* [PATCH v2] powerpc/book3s: mce: Move add_taint() later in virtual mode.
@ 2017-04-18 16:38 Mahesh J Salgaonkar
2017-04-20 11:09 ` Daniel Axtens
2017-05-03 22:18 ` [v2] " Michael Ellerman
0 siblings, 2 replies; 7+ messages in thread
From: Mahesh J Salgaonkar @ 2017-04-18 16:38 UTC (permalink / raw)
To: linuxppc-dev, Michael Ellerman; +Cc: Daniel Axtens
From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
machine_check_early() gets called in real mode. The very first time when
add_taint() is called, it prints a warning which ends up calling opal
call (that uses OPAL_CALL wrapper) for writing it to console. If we get a
very first machine check while we are in opal we are doomed. OPAL_CALL
overwrites the PACASAVEDMSR in r13 and in this case when we are done with
MCE handling the original opal call will use this new MSR on it's way
back to opal_return. This usually leads unexpected behaviour or kernel
to panic. Instead move the add_taint() call later in the virtual mode
where it is safe to call.
This is broken with current FW level. We got lucky so far for not getting
very first MCE hit while in OPAL. But easily reproducible on Mambo.
This should go to stable.
Fixes: 27ea2c420cad powerpc: Set the correct kernel taint on machine check errors.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
arch/powerpc/kernel/mce.c | 2 ++
arch/powerpc/kernel/traps.c | 4 ++--
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index a1475e6..b23b323 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -221,6 +221,8 @@ static void machine_check_process_queued_event(struct irq_work *work)
{
int index;
+ add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
+
/*
* For now just print it to console.
* TODO: log this error event to FSP or nvram.
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index ff365f9..af97e81 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -306,8 +306,6 @@ long machine_check_early(struct pt_regs *regs)
__this_cpu_inc(irq_stat.mce_exceptions);
- add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
-
if (cur_cpu_spec && cur_cpu_spec->machine_check_early)
handled = cur_cpu_spec->machine_check_early(regs);
return handled;
@@ -741,6 +739,8 @@ void machine_check_exception(struct pt_regs *regs)
__this_cpu_inc(irq_stat.mce_exceptions);
+ add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
+
/* See if any machine dependent calls. In theory, we would want
* to call the CPU first, and call the ppc_md. one if the CPU
* one returns a positive number. However there is existing code
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v2] powerpc/book3s: mce: Move add_taint() later in virtual mode.
2017-04-18 16:38 [PATCH v2] powerpc/book3s: mce: Move add_taint() later in virtual mode Mahesh J Salgaonkar
@ 2017-04-20 11:09 ` Daniel Axtens
2017-04-21 4:07 ` Michael Ellerman
2017-05-03 22:18 ` [v2] " Michael Ellerman
1 sibling, 1 reply; 7+ messages in thread
From: Daniel Axtens @ 2017-04-20 11:09 UTC (permalink / raw)
To: Mahesh J Salgaonkar, linuxppc-dev, Michael Ellerman
Hi Mahesh,
> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> index a1475e6..b23b323 100644
> --- a/arch/powerpc/kernel/mce.c
> +++ b/arch/powerpc/kernel/mce.c
> @@ -221,6 +221,8 @@ static void machine_check_process_queued_event(struct irq_work *work)
> {
> int index;
>
> + add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
> +
This bit makes sense...
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index ff365f9..af97e81 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -741,6 +739,8 @@ void machine_check_exception(struct pt_regs *regs)
>
> __this_cpu_inc(irq_stat.mce_exceptions);
>
> + add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
> +
But this bit I'm not sure about.
Isn't machine_check_exception called from asm in
kernel/exceptions-64s.S? As in, it's called really early/in real mode?
Regards,
Daniel
> /* See if any machine dependent calls. In theory, we would want
> * to call the CPU first, and call the ppc_md. one if the CPU
> * one returns a positive number. However there is existing code
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] powerpc/book3s: mce: Move add_taint() later in virtual mode.
2017-04-20 11:09 ` Daniel Axtens
@ 2017-04-21 4:07 ` Michael Ellerman
2017-04-25 4:48 ` Mahesh Jagannath Salgaonkar
2017-05-01 15:22 ` Daniel Axtens
0 siblings, 2 replies; 7+ messages in thread
From: Michael Ellerman @ 2017-04-21 4:07 UTC (permalink / raw)
To: Daniel Axtens, Mahesh J Salgaonkar, linuxppc-dev
Daniel Axtens <dja@axtens.net> writes:
>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>> index a1475e6..b23b323 100644
>> --- a/arch/powerpc/kernel/mce.c
>> +++ b/arch/powerpc/kernel/mce.c
>> @@ -221,6 +221,8 @@ static void machine_check_process_queued_event(struct irq_work *work)
>> {
>> int index;
>>
>> + add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
>> +
> This bit makes sense...
>
>> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
>> index ff365f9..af97e81 100644
>> --- a/arch/powerpc/kernel/traps.c
>> +++ b/arch/powerpc/kernel/traps.c
>> @@ -741,6 +739,8 @@ void machine_check_exception(struct pt_regs *regs)
>>
>> __this_cpu_inc(irq_stat.mce_exceptions);
>>
>> + add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
>> +
>
> But this bit I'm not sure about.
>
> Isn't machine_check_exception called from asm in
> kernel/exceptions-64s.S? As in, it's called really early/in real mode?
It is called from there, in asm, but not from real mode AFAICS.
There's a call from machine_check_common(), we're already in virtual
mode there.
The other call is from unrecover_mce(), and both places that call that
do so via rfid, using PACAKMSR, which should turn on virtual mode.
But none of that really matters. The fundamental issue here is we can't
recursively call OPAL, that's what matters.
So if we were in OPAL and take an MCE, then we must not call OPAL again
from the MCE handler.
This fixes one case where we know that can happen, but AFAICS we are not
protected in general from it.
For example if we take an MCE in OPAL, decide it's not recoverable and
go to unrecover_mce(), that will call machine_check_exception() which
can then call OPAL via printk.
Or maybe there's a check in there somewhere that makes it OK, but it's
not clear to me.
cheers
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] powerpc/book3s: mce: Move add_taint() later in virtual mode.
2017-04-21 4:07 ` Michael Ellerman
@ 2017-04-25 4:48 ` Mahesh Jagannath Salgaonkar
2017-05-01 15:22 ` Daniel Axtens
1 sibling, 0 replies; 7+ messages in thread
From: Mahesh Jagannath Salgaonkar @ 2017-04-25 4:48 UTC (permalink / raw)
To: Michael Ellerman, Daniel Axtens, linuxppc-dev
On 04/21/2017 09:37 AM, Michael Ellerman wrote:
> Daniel Axtens <dja@axtens.net> writes:
>>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>>> index a1475e6..b23b323 100644
>>> --- a/arch/powerpc/kernel/mce.c
>>> +++ b/arch/powerpc/kernel/mce.c
>>> @@ -221,6 +221,8 @@ static void machine_check_process_queued_event(struct irq_work *work)
>>> {
>>> int index;
>>>
>>> + add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
>>> +
>> This bit makes sense...
>>
>>> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
>>> index ff365f9..af97e81 100644
>>> --- a/arch/powerpc/kernel/traps.c
>>> +++ b/arch/powerpc/kernel/traps.c
>>> @@ -741,6 +739,8 @@ void machine_check_exception(struct pt_regs *regs)
>>>
>>> __this_cpu_inc(irq_stat.mce_exceptions);
>>>
>>> + add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
>>> +
>>
>> But this bit I'm not sure about.
>>
>> Isn't machine_check_exception called from asm in
>> kernel/exceptions-64s.S? As in, it's called really early/in real mode?
>
> It is called from there, in asm, but not from real mode AFAICS.
>
> There's a call from machine_check_common(), we're already in virtual
> mode there.
>
> The other call is from unrecover_mce(), and both places that call that
> do so via rfid, using PACAKMSR, which should turn on virtual mode.
>
>
> But none of that really matters. The fundamental issue here is we can't
> recursively call OPAL, that's what matters.
>
> So if we were in OPAL and take an MCE, then we must not call OPAL again
> from the MCE handler.
>
> This fixes one case where we know that can happen, but AFAICS we are not
> protected in general from it.
>
> For example if we take an MCE in OPAL, decide it's not recoverable and
> go to unrecover_mce(), that will call machine_check_exception() which
> can then call OPAL via printk.
>
> Or maybe there's a check in there somewhere that makes it OK, but it's
> not clear to me.
There is no check, but for non-recoverable MCE in OPAL we print mce
event, go down to panic path and reboot. Hence we are fine. For
recoverable mce error in opal we would never end up in
machine_check_exception().
Thanks,
-Mahesh.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] powerpc/book3s: mce: Move add_taint() later in virtual mode.
2017-04-21 4:07 ` Michael Ellerman
2017-04-25 4:48 ` Mahesh Jagannath Salgaonkar
@ 2017-05-01 15:22 ` Daniel Axtens
2017-05-02 6:21 ` Mahesh J Salgaonkar
1 sibling, 1 reply; 7+ messages in thread
From: Daniel Axtens @ 2017-05-01 15:22 UTC (permalink / raw)
To: Michael Ellerman, Mahesh J Salgaonkar, linuxppc-dev
Michael Ellerman <mpe@ellerman.id.au> writes:
> Daniel Axtens <dja@axtens.net> writes:
>>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>>> index a1475e6..b23b323 100644
>>> --- a/arch/powerpc/kernel/mce.c
>>> +++ b/arch/powerpc/kernel/mce.c
>>> @@ -221,6 +221,8 @@ static void machine_check_process_queued_event(struct irq_work *work)
>>> {
>>> int index;
>>>
>>> + add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
>>> +
>> This bit makes sense...
>>
>>> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
>>> index ff365f9..af97e81 100644
>>> --- a/arch/powerpc/kernel/traps.c
>>> +++ b/arch/powerpc/kernel/traps.c
>>> @@ -741,6 +739,8 @@ void machine_check_exception(struct pt_regs *regs)
>>>
>>> __this_cpu_inc(irq_stat.mce_exceptions);
>>>
>>> + add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
>>> +
>>
>> But this bit I'm not sure about.
>>
>> Isn't machine_check_exception called from asm in
>> kernel/exceptions-64s.S? As in, it's called really early/in real mode?
>
> It is called from there, in asm, but not from real mode AFAICS.
>
> There's a call from machine_check_common(), we're already in virtual
> mode there.
>
> The other call is from unrecover_mce(), and both places that call that
> do so via rfid, using PACAKMSR, which should turn on virtual mode.
>
>
> But none of that really matters. The fundamental issue here is we can't
> recursively call OPAL, that's what matters.
Ah right, yes. Every time I think I understand MCEs, it is made plain to
me that I do not. I have been trying to compose a coherent reply for
over an hour and every time I think I have got somewhere I find a
massive issue in my understanding. But here goes!
> So if we were in OPAL and take an MCE, then we must not call OPAL again
> from the MCE handler.
If I'm correctly understanding the code, this includes the non-early
handlers (ppc_md.machine_check_exception) as well as the early handlers
(ppc_md.machine_check_early). Is that right?
So the only place we're safe to do prints is in the 'bottom half' on the
other side of the work queue?
> This fixes one case where we know that can happen, but AFAICS we are not
> protected in general from it.
>
> For example if we take an MCE in OPAL, decide it's not recoverable and
> go to unrecover_mce(), that will call machine_check_exception() which
> can then call OPAL via printk.
(Assuming I understand the code, we probably won't get that far because
we'll attempt an opal reboot and a panic in opal_machine_check - but
carrying on...)
I think I see what you mean: even without any tainting, panic() will
printk(), which will take us back to OPAL.
I *think* we can fix this with a tweak to opal_machine_check. It pulls
the event off the work-queue and prints it, but I now think it
shouldn't; it just try to recover. If it can't recover it should proceed
to do the opal_cec_reboot2. That might fail on a BMC system or an early
firmware level, leaving the machine hosed, but it was about to panic
anyway.
We can then leave popping the event and printing it to the bottom
half/work queue.
... assuming I have understood it correctly. Thoughts?
> Or maybe there's a check in there somewhere that makes it OK, but it's
> not clear to me.
More immediately, what I'm confused by (amongst many other things!) -
and maybe Mahesh you can enlighten me - is this: what situation is
caught by this add_taint() in machine_check_exception? If we're about to
die, tainting the kernel is probably not the most useful thing to do. As
I understand it, if we're not about to die, we'll print out the taint in
the 'bottom half'/work queue side. Is this right?
Perhaps we can slightly shrink the window for disaster and fix the hard
bits later :P
Regards,
Daniel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] powerpc/book3s: mce: Move add_taint() later in virtual mode.
2017-05-01 15:22 ` Daniel Axtens
@ 2017-05-02 6:21 ` Mahesh J Salgaonkar
0 siblings, 0 replies; 7+ messages in thread
From: Mahesh J Salgaonkar @ 2017-05-02 6:21 UTC (permalink / raw)
To: Daniel Axtens; +Cc: Michael Ellerman, linuxppc-dev
On 2017-05-02 01:22:06 Tue, Daniel Axtens wrote:
> Michael Ellerman <mpe@ellerman.id.au> writes:
>
> > Daniel Axtens <dja@axtens.net> writes:
> >>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> >>> index a1475e6..b23b323 100644
> >>> --- a/arch/powerpc/kernel/mce.c
> >>> +++ b/arch/powerpc/kernel/mce.c
> >>> @@ -221,6 +221,8 @@ static void machine_check_process_queued_event(struct irq_work *work)
> >>> {
> >>> int index;
> >>>
> >>> + add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
> >>> +
> >> This bit makes sense...
> >>
> >>> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> >>> index ff365f9..af97e81 100644
> >>> --- a/arch/powerpc/kernel/traps.c
> >>> +++ b/arch/powerpc/kernel/traps.c
> >>> @@ -741,6 +739,8 @@ void machine_check_exception(struct pt_regs *regs)
> >>>
> >>> __this_cpu_inc(irq_stat.mce_exceptions);
> >>>
> >>> + add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
> >>> +
> >>
> >> But this bit I'm not sure about.
> >>
> >> Isn't machine_check_exception called from asm in
> >> kernel/exceptions-64s.S? As in, it's called really early/in real mode?
> >
> > It is called from there, in asm, but not from real mode AFAICS.
> >
> > There's a call from machine_check_common(), we're already in virtual
> > mode there.
> >
> > The other call is from unrecover_mce(), and both places that call that
> > do so via rfid, using PACAKMSR, which should turn on virtual mode.
> >
> >
> > But none of that really matters. The fundamental issue here is we can't
> > recursively call OPAL, that's what matters.
>
> Ah right, yes. Every time I think I understand MCEs, it is made plain to
> me that I do not. I have been trying to compose a coherent reply for
> over an hour and every time I think I have got somewhere I find a
> massive issue in my understanding. But here goes!
>
> > So if we were in OPAL and take an MCE, then we must not call OPAL again
> > from the MCE handler.
>
> If I'm correctly understanding the code, this includes the non-early
> handlers (ppc_md.machine_check_exception) as well as the early handlers
> (ppc_md.machine_check_early). Is that right?
Yes, the early handler runs in real mode and non-early handler gets called
in virtual mode. The early handler tries to recover from soft errors and
generates an MCE event. There are following scenarios where above handler's
get called depending on whether error have been recovered or not:
1. MCE in kernel/OPAL
- Call ppc_md.machine_check_early() handler in real mode
- Try to recover from soft errors and generate event.
- if Recovered:
- queue the event for work queue to print later
- Go back to kernel/OPAL code
- else if not recovered
- Switch to virtual mode
- call ppc_md.machine_check_exception()
- print event and panic
2. MCE in userspace
- Call ppc_md.machine_check_early() handler in real mode
- Try to recover from soft errors and generate event.
- Switch to virtual mode
- call ppc_md.machine_check_exception()
- Print event.
- if recovered:
- Return from interrupt.
- else if not recovered:
- Kill the process
- Return from interrupt.
>
> So the only place we're safe to do prints is in the 'bottom half' on the
> other side of the work queue?
>
> > This fixes one case where we know that can happen, but AFAICS we are not
> > protected in general from it.
> >
> > For example if we take an MCE in OPAL, decide it's not recoverable and
> > go to unrecover_mce(), that will call machine_check_exception() which
> > can then call OPAL via printk.
>
> (Assuming I understand the code, we probably won't get that far because
> we'll attempt an opal reboot and a panic in opal_machine_check - but
> carrying on...)
>
> I think I see what you mean: even without any tainting, panic() will
> printk(), which will take us back to OPAL.
>
> I *think* we can fix this with a tweak to opal_machine_check. It pulls
> the event off the work-queue and prints it, but I now think it
> shouldn't; it just try to recover. If it can't recover it should proceed
> to do the opal_cec_reboot2. That might fail on a BMC system or an early
> firmware level, leaving the machine hosed, but it was about to panic
> anyway.
>
> We can then leave popping the event and printing it to the bottom
> half/work queue.
>
> ... assuming I have understood it correctly. Thoughts?
>
> > Or maybe there's a check in there somewhere that makes it OK, but it's
> > not clear to me.
>
> More immediately, what I'm confused by (amongst many other things!) -
> and maybe Mahesh you can enlighten me - is this: what situation is
> caught by this add_taint() in machine_check_exception? If we're about to
> die, tainting the kernel is probably not the most useful thing to do. As
> I understand it, if we're not about to die, we'll print out the taint in
> the 'bottom half'/work queue side. Is this right?
machine_check_exception() gets called for unrecovered MCE in kernel/OPAL, and
MCE in userspace. For MCE in kernel and userspace, we are safe to call
add_taint from machine_check_exception() even if it calls OPAL for printk.
Now for MCE in OPAL, there are two cases:
1. Recovered MCE:
- MCE handler queue's up the event to print it later
- Return to currently executing OPAL routine.
- When scheduled, work queue calls the add_taint and prints the event.
- machine_check_exception() never gets called.
Not calling add_taint() (or anything that could make OPAL call) from MCE
handler prevents OPAL stack corruption. This makes MCE handler to
safely return to currently executing OPAL routine.
2. Unrecovered MCE:
- Can't return to currently executing OPAL routine.
- kernel needs to print the event and call panic.
- Switch to virtual mode and call machine_check_exception()
- machine_check_exception() calls add_taint() and panic()
Now for case (2), kernel needs to panic. Since MCE handler will never go back
to currently executing OPAL routine, we should be now safe to call add_taint()
or new OPAL call. Kernel can now print event and head to panic or call
opal_cec_reboot2().
Thanks,
-Mahesh.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [v2] powerpc/book3s: mce: Move add_taint() later in virtual mode.
2017-04-18 16:38 [PATCH v2] powerpc/book3s: mce: Move add_taint() later in virtual mode Mahesh J Salgaonkar
2017-04-20 11:09 ` Daniel Axtens
@ 2017-05-03 22:18 ` Michael Ellerman
1 sibling, 0 replies; 7+ messages in thread
From: Michael Ellerman @ 2017-05-03 22:18 UTC (permalink / raw)
To: Mahesh Salgaonkar, linuxppc-dev; +Cc: Daniel Axtens
On Tue, 2017-04-18 at 16:38:17 UTC, Mahesh Salgaonkar wrote:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>
> machine_check_early() gets called in real mode. The very first time when
> add_taint() is called, it prints a warning which ends up calling opal
> call (that uses OPAL_CALL wrapper) for writing it to console. If we get a
> very first machine check while we are in opal we are doomed. OPAL_CALL
> overwrites the PACASAVEDMSR in r13 and in this case when we are done with
> MCE handling the original opal call will use this new MSR on it's way
> back to opal_return. This usually leads unexpected behaviour or kernel
> to panic. Instead move the add_taint() call later in the virtual mode
> where it is safe to call.
>
> This is broken with current FW level. We got lucky so far for not getting
> very first MCE hit while in OPAL. But easily reproducible on Mambo.
> This should go to stable.
>
> Fixes: 27ea2c420cad powerpc: Set the correct kernel taint on machine check errors.
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Applied to powerpc next, thanks.
https://git.kernel.org/powerpc/c/d93b0ac01a9ce276ec39644be47001
cheers
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2017-05-03 22:18 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-04-18 16:38 [PATCH v2] powerpc/book3s: mce: Move add_taint() later in virtual mode Mahesh J Salgaonkar
2017-04-20 11:09 ` Daniel Axtens
2017-04-21 4:07 ` Michael Ellerman
2017-04-25 4:48 ` Mahesh Jagannath Salgaonkar
2017-05-01 15:22 ` Daniel Axtens
2017-05-02 6:21 ` Mahesh J Salgaonkar
2017-05-03 22:18 ` [v2] " Michael Ellerman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).