All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michael Ellerman <mpe@ellerman.id.au>
To: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>,
	Mahesh Salgaonkar <mahesh@linux.ibm.com>,
	linuxppc-dev <linuxppc-dev@ozlabs.org>
Cc: Ganesh Goudar <ganeshgr@linux.ibm.com>,
	Nicholas Piggin <npiggin@gmail.com>
Subject: Re: [PATCH v4] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.
Date: Fri, 08 Mar 2024 19:08:50 +1100	[thread overview]
Message-ID: <874jdhno19.fsf@mail.lhotse> (raw)
In-Reply-To: <8d973907-8e86-4b9f-8995-cf3a8621f6b6@linux.ibm.com>

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> On 3/7/24 5:13 PM, Michael Ellerman wrote:
>> Mahesh Salgaonkar <mahesh@linux.ibm.com> writes:
>>> nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel
>>> crash when invoked during real mode interrupt handling (e.g. early HMI/MCE
>>> interrupt handler) if percpu allocation comes from vmalloc area.
>>>
>>> Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI()
>>> wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue when
>>> percpu allocation is from the embedded first chunk. However with
>>> CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where percpu
>>> allocation can come from the vmalloc area.
>>>
>>> With kernel command line "percpu_alloc=page" we can force percpu allocation
>>> to come from vmalloc area and can see kernel crash in machine_check_early:
>>>
>>> [    1.215714] NIP [c000000000e49eb4] rcu_nmi_enter+0x24/0x110
>>> [    1.215717] LR [c0000000000461a0] machine_check_early+0xf0/0x2c0
>>> [    1.215719] --- interrupt: 200
>>> [    1.215720] [c000000fffd73180] [0000000000000000] 0x0 (unreliable)
>>> [    1.215722] [c000000fffd731b0] [0000000000000000] 0x0
>>> [    1.215724] [c000000fffd73210] [c000000000008364] machine_check_early_common+0x134/0x1f8
>>>
>>> Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu
>>> first chunk is not embedded.
>> 
>> My system (powernv) doesn't even boot with percpu_alloc=page.
>
>
> Can you share the crash details?

Yes but it's not pretty :)

  [    1.725257][  T714] systemd-journald[714]: Collecting audit messages is disabled.
  [    1.729401][    T1] systemd[1]: Finished systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev.
  [^[[0;32m  OK  ^[[0m] Finished ^[[0;1;39msystemd-tmpfiles-…reate Static Device Nodes in /dev.
  [    1.773902][   C22] Disabling lock debugging due to kernel taint
  [    1.773905][   C23] Oops: Machine check, sig: 7 [#1]
  [    1.773911][   C23] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
  [    1.773916][   C23] Modules linked in:
  [    1.773920][   C23] CPU: 23 PID: 0 Comm: swapper/23 Tainted: G   M               6.8.0-rc7-02500-g23515c370cbb #1
  [    1.773924][   C23] Hardware name: 8335-GTH POWER9 0x4e1202 opal:skiboot-v6.5.3-35-g1851b2a06 PowerNV
  [    1.773926][   C23] NIP:  0000000000000000 LR: 0000000000000000 CTR: 0000000000000000
  [    1.773929][   C23] REGS: c000000fffa6ef50 TRAP: 0000   Tainted: G   M                (6.8.0-rc7-02500-g23515c370cbb)
  [    1.773932][   C23] MSR:  0000000000000000 <>  CR: 00000000  XER: 00000000
  [    1.773937][   C23] CFAR: 0000000000000000 IRQMASK: 3 
  [    1.773937][   C23] GPR00: 0000000000000000 c000000fffa6efe0 c000000fffa6efb0 0000000000000000 
  [    1.773937][   C23] GPR04: c00000000003d8c0 c000000001f5f000 0000000000000000 0000000000000103 
  [    1.773937][   C23] GPR08: 0000000000000003 653a0d962a590300 0000000000000000 0000000000000000 
  [    1.773937][   C23] GPR12: c000000fffa6f280 0000000000000000 c0000000000084a4 0000000000000000 
  [    1.773937][   C23] GPR16: 0000000053474552 0000000000000000 c00000000003d8c0 c000000fffa6f280 
  [    1.773937][   C23] GPR20: c000000001f5f000 c000000fffa6f340 c000000fffa6f2e8 0000000000000000 
  [    1.773937][   C23] GPR24: 0007fffffecf0000 c0000000065bbb80 0000000000550102 c000000002172b20 
  [    1.773937][   C23] GPR28: 0000000000000000 0000000053474552 0000000000000000 c000000ffffc6d80 
  [    1.773982][   C23] NIP [0000000000000000] 0x0
  [    1.773988][   C23] LR [0000000000000000] 0x0
  [    1.773990][   C23] Call Trace:
  [    1.773991][   C23] [c000000fffa6efe0] [c000000001f5f000] .TOC.+0x0/0xa1000 (unreliable)
  [    1.773999][   C23] Code: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 
  [    1.774021][   C23] ---[ end trace 0000000000000000 ]---

Something has gone badly wrong.

That was a test kernel with some other commits, but nothing that should
cause that. Removing percpu_alloc=page fix it.

It's based on fddff98e83b4b4d54470902ea0d520c4d423ca3b.

>> AFAIK the only reason we added support for it was to handle 4K kernels
>> with HPT. See commit eb553f16973a ("powerpc/64/mm: implement page
>> mapping percpu first chunk allocator").
>> 
>> So I wonder if we should change the Kconfig to only offer it as an
>> option in that case, and change the logic in setup_per_cpu_areas() to
>> only use it as a last resort.
>> 
>> I guess we probably still need this commit though, even if just for 4K HPT.
>> 
>>
> We have also observed some error when we have large gap between the start memory of
> NUMA nodes. That made the percpu offset really large causing boot failures even on 64K.

Yeah, I have vague memories of that :)

cheers

  reply	other threads:[~2024-03-08  8:09 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-14  9:51 [PATCH v4] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt Mahesh Salgaonkar
2024-02-14  9:54 ` Christophe Leroy
2024-03-06  8:25 ` Shirisha ganta
2024-03-07 11:43 ` Michael Ellerman
2024-03-08  4:41   ` Aneesh Kumar K V
2024-03-08  8:08     ` Michael Ellerman [this message]
2024-04-10  4:38       ` Mahesh J Salgaonkar
2024-03-08  5:19   ` Mahesh J Salgaonkar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874jdhno19.fsf@mail.lhotse \
    --to=mpe@ellerman.id.au \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=ganeshgr@linux.ibm.com \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=mahesh@linux.ibm.com \
    --cc=npiggin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.