linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] powerpc/book3s: Fix MCE console messages for unrecoverable MCE.
@ 2016-08-04  4:46 Mahesh J Salgaonkar
  2016-08-04  8:05 ` Greg KH
  2016-08-04  9:57 ` Michael Ellerman
  0 siblings, 2 replies; 7+ messages in thread
From: Mahesh J Salgaonkar @ 2016-08-04  4:46 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman; +Cc: Paul Mackerras, stable

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

When machine check occurs with MSR(RI=0), it means MC interrupt is
unrecoverable and kernel goes down to panic path. But the console
message still shows it as recovered. This patch fixes the MCE console
messages.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/mce.c             |    3 ++-
 arch/powerpc/platforms/powernv/opal.c |    2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index ef267fd..5e7ece0 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -92,7 +92,8 @@ void save_mce_event(struct pt_regs *regs, long handled,
 	mce->in_use = 1;
 
 	mce->initiator = MCE_INITIATOR_CPU;
-	if (handled)
+	/* Mark it recovered if we have handled it and MSR(RI=1). */
+	if (handled && (regs->msr & MSR_RI))
 		mce->disposition = MCE_DISPOSITION_RECOVERED;
 	else
 		mce->disposition = MCE_DISPOSITION_NOT_RECOVERED;
diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
index 5385434..8154171 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -401,6 +401,8 @@ static int opal_recover_mce(struct pt_regs *regs,
 
 	if (!(regs->msr & MSR_RI)) {
 		/* If MSR_RI isn't set, we cannot recover */
+		printk(KERN_ERR "Machine check interrupt unrecoverable:"
+				" MSR(RI=0)\n");
 		recovered = 0;
 	} else if (evt->disposition == MCE_DISPOSITION_RECOVERED) {
 		/* Platform corrected itself */

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] powerpc/book3s: Fix MCE console messages for unrecoverable MCE.
  2016-08-04  4:46 [PATCH] powerpc/book3s: Fix MCE console messages for unrecoverable MCE Mahesh J Salgaonkar
@ 2016-08-04  8:05 ` Greg KH
  2016-08-04  8:59   ` Mahesh Jagannath Salgaonkar
  2016-08-04  9:57 ` Michael Ellerman
  1 sibling, 1 reply; 7+ messages in thread
From: Greg KH @ 2016-08-04  8:05 UTC (permalink / raw)
  To: Mahesh J Salgaonkar
  Cc: linuxppc-dev, Michael Ellerman, Paul Mackerras, stable

On Thu, Aug 04, 2016 at 10:16:48AM +0530, Mahesh J Salgaonkar wrote:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> 
> When machine check occurs with MSR(RI=0), it means MC interrupt is
> unrecoverable and kernel goes down to panic path. But the console
> message still shows it as recovered. This patch fixes the MCE console
> messages.
> 
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kernel/mce.c             |    3 ++-
>  arch/powerpc/platforms/powernv/opal.c |    2 ++
>  2 files changed, 4 insertions(+), 1 deletion(-)


<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read Documentation/stable_kernel_rules.txt
for how to do this properly.

</formletter>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] powerpc/book3s: Fix MCE console messages for unrecoverable MCE.
  2016-08-04  8:05 ` Greg KH
@ 2016-08-04  8:59   ` Mahesh Jagannath Salgaonkar
  2016-08-04  9:53     ` Michael Ellerman
  0 siblings, 1 reply; 7+ messages in thread
From: Mahesh Jagannath Salgaonkar @ 2016-08-04  8:59 UTC (permalink / raw)
  To: Greg KH; +Cc: linuxppc-dev, Michael Ellerman, Paul Mackerras, stable

On 08/04/2016 01:35 PM, Greg KH wrote:
> On Thu, Aug 04, 2016 at 10:16:48AM +0530, Mahesh J Salgaonkar wrote:
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>
>> When machine check occurs with MSR(RI=0), it means MC interrupt is
>> unrecoverable and kernel goes down to panic path. But the console
>> message still shows it as recovered. This patch fixes the MCE console
>> messages.
>>
>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/kernel/mce.c             |    3 ++-
>>  arch/powerpc/platforms/powernv/opal.c |    2 ++
>>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> 
> <formletter>
> 
> This is not the correct way to submit patches for inclusion in the
> stable kernel tree.  Please read Documentation/stable_kernel_rules.txt
> for how to do this properly.
> 
> </formletter>
> 

Ouch. My mistake. Will follow Documentation/stable_kernel_rules.txt

Thanks,
-Mahesh.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] powerpc/book3s: Fix MCE console messages for unrecoverable MCE.
  2016-08-04  8:59   ` Mahesh Jagannath Salgaonkar
@ 2016-08-04  9:53     ` Michael Ellerman
  0 siblings, 0 replies; 7+ messages in thread
From: Michael Ellerman @ 2016-08-04  9:53 UTC (permalink / raw)
  To: Mahesh Jagannath Salgaonkar, Greg KH; +Cc: linuxppc-dev, Paul Mackerras, stable

Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com> writes:

> On 08/04/2016 01:35 PM, Greg KH wrote:
>> On Thu, Aug 04, 2016 at 10:16:48AM +0530, Mahesh J Salgaonkar wrote:
>>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>>
>>> When machine check occurs with MSR(RI=0), it means MC interrupt is
>>> unrecoverable and kernel goes down to panic path. But the console
>>> message still shows it as recovered. This patch fixes the MCE console
>>> messages.
>>>
>>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>> 
>> <formletter>
>> 
>> This is not the correct way to submit patches for inclusion in the
>> stable kernel tree.  Please read Documentation/stable_kernel_rules.txt
>> for how to do this properly.
>> 
>> </formletter>
>
> Ouch. My mistake. Will follow Documentation/stable_kernel_rules.txt

Additionally I appreciate it if you can add a Fixes: line. Which
indicates the commit that introduced the bug you are fixing. It is
documented in Documentation/SubmittingPatches.

In this case it looks like it would be:

Fixes: 36df96f8acaf ("powerpc/book3s: Decode and save machine check event.")

cheers

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] powerpc/book3s: Fix MCE console messages for unrecoverable MCE.
  2016-08-04  4:46 [PATCH] powerpc/book3s: Fix MCE console messages for unrecoverable MCE Mahesh J Salgaonkar
  2016-08-04  8:05 ` Greg KH
@ 2016-08-04  9:57 ` Michael Ellerman
  2016-08-04 12:05   ` Mahesh Jagannath Salgaonkar
  1 sibling, 1 reply; 7+ messages in thread
From: Michael Ellerman @ 2016-08-04  9:57 UTC (permalink / raw)
  To: Mahesh J Salgaonkar, linuxppc-dev; +Cc: Paul Mackerras

Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:

> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>
> When machine check occurs with MSR(RI=0), it means MC interrupt is
> unrecoverable and kernel goes down to panic path. But the console
> message still shows it as recovered. This patch fixes the MCE console
> messages.
>
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kernel/mce.c             |    3 ++-
>  arch/powerpc/platforms/powernv/opal.c |    2 ++
>  2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> index ef267fd..5e7ece0 100644
> --- a/arch/powerpc/kernel/mce.c
> +++ b/arch/powerpc/kernel/mce.c
> @@ -92,7 +92,8 @@ void save_mce_event(struct pt_regs *regs, long handled,
>  	mce->in_use = 1;
>  
>  	mce->initiator = MCE_INITIATOR_CPU;
> -	if (handled)
> +	/* Mark it recovered if we have handled it and MSR(RI=1). */
> +	if (handled && (regs->msr & MSR_RI))
>  		mce->disposition = MCE_DISPOSITION_RECOVERED;

This seems like it has bigger implications than just changing the
printk output? We're now (correctly) marking any MC where RI=0 as
unrecoverable.

Or is the only place that uses this the code below which *also* checks
MSR_RI?

> diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
> index 5385434..8154171 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -401,6 +401,8 @@ static int opal_recover_mce(struct pt_regs *regs,
>  
>  	if (!(regs->msr & MSR_RI)) {
>  		/* If MSR_RI isn't set, we cannot recover */

Why do we check MSR_RI again here? Shouldn't we just be looking at the evt->disposition?

> +		printk(KERN_ERR "Machine check interrupt unrecoverable:"
> +				" MSR(RI=0)\n");

Are we sure it's safe to call printk() there?

Please don't split the message across lines, and use pr_err() like the
rest of the code in this file. So it would be:

		pr_err("Machine check interrupt unrecoverable: MSR(RI=0)\n");

>  		recovered = 0;
>  	} else if (evt->disposition == MCE_DISPOSITION_RECOVERED) {
>  		/* Platform corrected itself */

cheers

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] powerpc/book3s: Fix MCE console messages for unrecoverable MCE.
  2016-08-04  9:57 ` Michael Ellerman
@ 2016-08-04 12:05   ` Mahesh Jagannath Salgaonkar
  2016-08-04 12:55     ` Nicholas Piggin
  0 siblings, 1 reply; 7+ messages in thread
From: Mahesh Jagannath Salgaonkar @ 2016-08-04 12:05 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: Paul Mackerras

On 08/04/2016 03:27 PM, Michael Ellerman wrote:
> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
> 
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>
>> When machine check occurs with MSR(RI=0), it means MC interrupt is
>> unrecoverable and kernel goes down to panic path. But the console
>> message still shows it as recovered. This patch fixes the MCE console
>> messages.
>>
>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/kernel/mce.c             |    3 ++-
>>  arch/powerpc/platforms/powernv/opal.c |    2 ++
>>  2 files changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>> index ef267fd..5e7ece0 100644
>> --- a/arch/powerpc/kernel/mce.c
>> +++ b/arch/powerpc/kernel/mce.c
>> @@ -92,7 +92,8 @@ void save_mce_event(struct pt_regs *regs, long handled,
>>  	mce->in_use = 1;
>>  
>>  	mce->initiator = MCE_INITIATOR_CPU;
>> -	if (handled)
>> +	/* Mark it recovered if we have handled it and MSR(RI=1). */
>> +	if (handled && (regs->msr & MSR_RI))
>>  		mce->disposition = MCE_DISPOSITION_RECOVERED;
> 
> This seems like it has bigger implications than just changing the
> printk output? We're now (correctly) marking any MC where RI=0 as
> unrecoverable.
> 
> Or is the only place that uses this the code below which *also* checks
> MSR_RI?

We would always check MSR_RI at code below and panic correctly. It was
just that we were always printing it as recovered and then panic.

> 
>> diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
>> index 5385434..8154171 100644
>> --- a/arch/powerpc/platforms/powernv/opal.c
>> +++ b/arch/powerpc/platforms/powernv/opal.c
>> @@ -401,6 +401,8 @@ static int opal_recover_mce(struct pt_regs *regs,
>>  
>>  	if (!(regs->msr & MSR_RI)) {
>>  		/* If MSR_RI isn't set, we cannot recover */
> 
> Why do we check MSR_RI again here? Shouldn't we just be looking at the evt->disposition?

When MSR_RI=0, where SRR0/SRR1 registers values have been thrashed,
kernel can not continue reliably if we return from interrupt. It should
definitely go down to panic path. Hence we check for RI=0 and return 0.
Where as, if MSR_RI=1 and disposition is "unrecovered", we can minimise
the damage to user process if this MCE was hit in user space context.

The print is just to tell that the kernel panic'ed because MCE occured
during a rare window where MSR RI bit was set to zero(0) and not that
handler could not fix the error.

> 
>> +		printk(KERN_ERR "Machine check interrupt unrecoverable:"
>> +				" MSR(RI=0)\n");
> 
> Are we sure it's safe to call printk() there?

Yes, we had just printed MCE event info before we came here.

> 
> Please don't split the message across lines, and use pr_err() like the
> rest of the code in this file. So it would be:
> 
> 		pr_err("Machine check interrupt unrecoverable: MSR(RI=0)\n");

Sure. Will make the change.

> 
>>  		recovered = 0;
>>  	} else if (evt->disposition == MCE_DISPOSITION_RECOVERED) {
>>  		/* Platform corrected itself */
> 
> cheers
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] powerpc/book3s: Fix MCE console messages for unrecoverable MCE.
  2016-08-04 12:05   ` Mahesh Jagannath Salgaonkar
@ 2016-08-04 12:55     ` Nicholas Piggin
  0 siblings, 0 replies; 7+ messages in thread
From: Nicholas Piggin @ 2016-08-04 12:55 UTC (permalink / raw)
  To: Mahesh Jagannath Salgaonkar
  Cc: Michael Ellerman, linuxppc-dev, Paul Mackerras

On Thu, 4 Aug 2016 17:35:45 +0530
Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:

> On 08/04/2016 03:27 PM, Michael Ellerman wrote:
> > Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
> >   
> >> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> >>
> >> When machine check occurs with MSR(RI=0), it means MC interrupt is
> >> unrecoverable and kernel goes down to panic path. But the console
> >> message still shows it as recovered. This patch fixes the MCE console
> >> messages.
> >>
> >> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> >> ---
> >>  arch/powerpc/kernel/mce.c             |    3 ++-
> >>  arch/powerpc/platforms/powernv/opal.c |    2 ++
> >>  2 files changed, 4 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> >> index ef267fd..5e7ece0 100644
> >> --- a/arch/powerpc/kernel/mce.c
> >> +++ b/arch/powerpc/kernel/mce.c
> >> @@ -92,7 +92,8 @@ void save_mce_event(struct pt_regs *regs, long handled,
> >>  	mce->in_use = 1;
> >>  
> >>  	mce->initiator = MCE_INITIATOR_CPU;
> >> -	if (handled)
> >> +	/* Mark it recovered if we have handled it and MSR(RI=1). */
> >> +	if (handled && (regs->msr & MSR_RI))
> >>  		mce->disposition = MCE_DISPOSITION_RECOVERED;  
> > 
> > This seems like it has bigger implications than just changing the
> > printk output? We're now (correctly) marking any MC where RI=0 as
> > unrecoverable.
> > 
> > Or is the only place that uses this the code below which *also* checks
> > MSR_RI?  
> 
> We would always check MSR_RI at code below and panic correctly. It was
> just that we were always printing it as recovered and then panic.
> 
> >   
> >> diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
> >> index 5385434..8154171 100644
> >> --- a/arch/powerpc/platforms/powernv/opal.c
> >> +++ b/arch/powerpc/platforms/powernv/opal.c
> >> @@ -401,6 +401,8 @@ static int opal_recover_mce(struct pt_regs *regs,
> >>  
> >>  	if (!(regs->msr & MSR_RI)) {
> >>  		/* If MSR_RI isn't set, we cannot recover */  
> > 
> > Why do we check MSR_RI again here? Shouldn't we just be looking at the evt->disposition?  
> 
> When MSR_RI=0, where SRR0/SRR1 registers values have been thrashed,
> kernel can not continue reliably if we return from interrupt.

If it's a user process that raises a synchronous/instruction caused
exception that gets hit with the MCE, I wonder if we can kill the
process and continue? I'm not saying you should do it with this patch.

It might need some auditing we don't leave the paca in some bad state
or something, but it would be a nice feature because right now a user
doing a busy loop of the funny 0x1ebe syscall, or maybe an illegal
instruction or emulation could probably keep the MSR_RI bit clear for
probably half or more of the CPU's cycles couldn't they?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-08-04 12:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-08-04  4:46 [PATCH] powerpc/book3s: Fix MCE console messages for unrecoverable MCE Mahesh J Salgaonkar
2016-08-04  8:05 ` Greg KH
2016-08-04  8:59   ` Mahesh Jagannath Salgaonkar
2016-08-04  9:53     ` Michael Ellerman
2016-08-04  9:57 ` Michael Ellerman
2016-08-04 12:05   ` Mahesh Jagannath Salgaonkar
2016-08-04 12:55     ` Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).