From: Ganesh <ganeshgr@linux.ibm.com>
To: Nicholas Piggin <npiggin@gmail.com>,
linuxppc-dev@lists.ozlabs.org, mpe@ellerman.id.au
Cc: mahesh@linux.vnet.ibm.com
Subject: Re: [PATCH] powerpc/pseries: Fix MCE handling on pseries
Date: Mon, 16 Mar 2020 17:17:38 +0530 [thread overview]
Message-ID: <d22f9ef9-07db-9615-6420-001b85dd2742@linux.ibm.com> (raw)
In-Reply-To: <1584157063.g5s75uhbdu.astroid@bobo.none>
[-- Attachment #1: Type: text/plain, Size: 4075 bytes --]
On 3/14/20 9:18 AM, Nicholas Piggin wrote:
> Ganesh Goudar's on March 14, 2020 12:04 am:
>> MCE handling on pSeries platform fails as recent rework to use common
>> code for pSeries and PowerNV in machine check error handling tries to
>> access per-cpu variables in realmode. The per-cpu variables may be
>> outside the RMO region on pSeries platform and needs translation to be
>> enabled for access. Just moving these per-cpu variable into RMO region
>> did'nt help because we queue some work to workqueues in real mode, which
>> again tries to touch per-cpu variables.
> Which queues are these? We should not be using Linux workqueues, but the
> powerpc mce code which uses irq_work.
Yes, irq work queues accesses memory outside RMO.
irq_work_queue()->__irq_work_queue_local()->[this_cpu_ptr(&lazy_list) | this_cpu_ptr(&raised_list)]
>> Also fwnmi_release_errinfo()
>> cannot be called when translation is not enabled.
> Why not?
It crashes when we try to get RTAS token for "ibm, nmi-interlock" device
tree node. But yes we can avoid it by storing it rtas_token somewhere but haven't
tried it, here is the backtrace I got when fwnmi_release_errinfo() called from
realmode handler.
[ 70.856908] BUG: Unable to handle kernel data access on read at 0xc0000001ffffa8f8
[ 70.856918] Faulting instruction address: 0xc000000000853920
[ 70.856927] Oops: Kernel access of bad area, sig: 11 [#1]
[ 70.856935] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[ 70.856943] Modules linked in: mcetest_slb(OE+) bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg pseries_rng ip_tables xfs libcrc32c sd_mod t10_pi ibmvscsi ibmveth scsi_transport_srp
[ 70.856975] CPU: 13 PID: 6480 Comm: insmod Kdump: loaded Tainted: G OE 5.6.0-rc2-ganesh+ #6
[ 70.856985] NIP: c000000000853920 LR: c000000000853a14 CTR: c0000000000376b0
[ 70.856994] REGS: c000000007e4b870 TRAP: 0300 Tainted: G OE (5.6.0-rc2-ganesh+)
[ 70.857003] MSR: 8000000000001003 <SF,ME,RI,LE> CR: 88000422 XER: 00000009
[ 70.857015] CFAR: c000000000853a10 DAR: c0000001ffffa8f8 DSISR: 40000000 IRQMASK: 1
[ 70.857015] GPR00: c000000000853a14 c000000007e4bb00 c000000001372b00 c0000001ffffa8c8
[ 70.857015] GPR04: c000000000cf8728 0000000000000000 0000000000000002 c008000000420810
[ 70.857015] GPR08: 0000000000000000 0000000000000000 0000000000000001 0000000000000001
[ 70.857015] GPR12: 0000000000000000 c000000007f92000 c0000001f8113d70 c00800000059070d
[ 70.857015] GPR16: 00000000000004f8 c008000000421080 000000000000fff1 c008000000421038
[ 70.857015] GPR20: c00000000125eb20 c000000000d1d1c8 c008000000590000 0000000000000000
[ 70.857015] GPR24: 4000000000000510 c008000008000000 c0000000012355d8 c008000000420940
[ 70.857015] GPR28: c008000008000011 0000000000000000 c000000000cf8728 c00000000169a098
[ 70.857097] NIP [c000000000853920] __of_find_property+0x30/0xd0
[ 70.857106] LR [c000000000853a14] of_find_property+0x54/0x90
[ 70.857113] Call Trace:
[ 70.857117] Instruction dump:
[ 70.857124] 3c4c00b2 3842f210 2c230000 418200bc 7c0802a6 fba1ffe8 fbc1fff0 7cbd2b78
[ 70.857136] fbe1fff8 7c9e2378 f8010010 f821ffc1 <ebe30030> 2fbf0000 409e0014 48000064
[ 70.857152] ---[ end trace 13755f7502f3150b ]---
[ 70.864199]
[ 70.864226] Sending IPI to other CPUs
[ 82.011761] ERROR: 15 cpu(s) not responding
>> This patch fixes this by enabling translation in the exception handler
>> when all required real mode handling is done. This change only affects
>> the pSeries platform.
> Not supposed to do this, because we might not be in a state
> where the MMU is ready to be turned on at this point.
>
> I'd like to understand better which accesses are a problem, and whether
> we can fix them all to be in the RMO.
I faced three such access problems,
* accessing per-cpu data (like mce_event,mce_event_queue and mce_event_queue),
we can move this inside RMO.
* calling fwnmi_release_errinfo().
* And queuing work to irq_work_queue, not sure how to fix this.
> Thanks,
> Nick
[-- Attachment #2: Type: text/html, Size: 5307 bytes --]
next prev parent reply other threads:[~2020-03-16 18:52 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-03-13 14:04 [PATCH] powerpc/pseries: Fix MCE handling on pseries Ganesh Goudar
2020-03-14 3:48 ` Nicholas Piggin
2020-03-16 11:47 ` Ganesh [this message]
2020-03-17 10:01 ` Nicholas Piggin
2020-03-17 14:35 ` Ganesh
2020-03-20 2:41 ` Nicholas Piggin
2020-03-20 9:09 ` Ganesh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d22f9ef9-07db-9615-6420-001b85dd2742@linux.ibm.com \
--to=ganeshgr@linux.ibm.com \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mahesh@linux.vnet.ibm.com \
--cc=mpe@ellerman.id.au \
--cc=npiggin@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).