linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC,v3,3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal"
@ 2018-04-25 20:39 Alexandru Gagniuc
  0 siblings, 0 replies; 6+ messages in thread
From: Alexandru Gagniuc @ 2018-04-25 20:39 UTC (permalink / raw)
  To: linux-acpi, linux-edac
  Cc: Alexandru Gagniuc, Rafael J. Wysocki, Len Brown, Tony Luck,
	Borislav Petkov, Mauro Carvalho Chehab, Robert Moore,
	Erik Schmauss, Tyler Baicar, Will Deacon, James Morse, Shiju Jose,
	Jonathan (Zhixiong) Zhang, Dongjiu Geng, linux-kernel, devel

There seems to be a culture amongst BIOS teams to want to crash the
OS when an error can't be handled in firmware. Marking GHES errors as
"fatal" is a very common way to do this.

However, a number of errors reported by GHES may be fatal in the sense
a device or link is lost, but are not fatal to the system. When there
is a disagreement with firmware about the handleability of an error,
print a warning message.

Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
---
 drivers/acpi/apei/ghes.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 8ccb9cc10fc8..34d0da692dd0 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -539,6 +539,12 @@ static void ghes_do_proc(struct ghes *ghes,
 					       sec_sev, err,
 					       gdata->error_data_length);
 		}
+
+	}
+
+	if ((sev >= GHES_SEV_PANIC) && (ghes_actual_severity(ghes) < sev)) {
+		pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct");
+		pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");
 	}
 }
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC,v3,3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal"
@ 2018-04-26 11:20 Borislav Petkov
  0 siblings, 0 replies; 6+ messages in thread
From: Borislav Petkov @ 2018-04-26 11:20 UTC (permalink / raw)
  To: Alexandru Gagniuc
  Cc: linux-acpi, linux-edac, Rafael J. Wysocki, Len Brown, Tony Luck,
	Mauro Carvalho Chehab, Robert Moore, Erik Schmauss, Tyler Baicar,
	Will Deacon, James Morse, Shiju Jose, Jonathan (Zhixiong) Zhang,
	Dongjiu Geng, linux-kernel, devel

On Wed, Apr 25, 2018 at 03:39:51PM -0500, Alexandru Gagniuc wrote:
> There seems to be a culture amongst BIOS teams to want to crash the
> OS when an error can't be handled in firmware. Marking GHES errors as
> "fatal" is a very common way to do this.
> 
> However, a number of errors reported by GHES may be fatal in the sense
> a device or link is lost, but are not fatal to the system. When there
> is a disagreement with firmware about the handleability of an error,
> print a warning message.
> 
> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> ---
>  drivers/acpi/apei/ghes.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 8ccb9cc10fc8..34d0da692dd0 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -539,6 +539,12 @@ static void ghes_do_proc(struct ghes *ghes,
>  					       sec_sev, err,
>  					       gdata->error_data_length);
>  		}
> +
> +	}
> +
> +	if ((sev >= GHES_SEV_PANIC) && (ghes_actual_severity(ghes) < sev)) {
> +		pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct");
> +		pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");

Pasting the same comment from last time since you missed it:

"No, I don't want any of that crap issuing stuff in dmesg and then people
opening bugs and running around and trying to replace hardware.

We either can handle the error and log a normal record somewhere or we
cannot and explode. The complaining about the FW doesn't bring shit."

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC,v3,3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal"
@ 2018-04-26 17:47 Alexandru Gagniuc
  0 siblings, 0 replies; 6+ messages in thread
From: Alexandru Gagniuc @ 2018-04-26 17:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-acpi, linux-edac, Rafael J. Wysocki, Len Brown, Tony Luck,
	Mauro Carvalho Chehab, Robert Moore, Erik Schmauss, Tyler Baicar,
	Will Deacon, James Morse, Shiju Jose, Jonathan (Zhixiong) Zhang,
	Dongjiu Geng, linux-kernel, devel

On 04/26/2018 06:20 AM, Borislav Petkov wrote:
> Pasting the same comment from last time since you missed it:
> 
> "No, I don't want any of that crap issuing stuff in dmesg and then people
> opening bugs and running around and trying to replace hardware.
> 
> We either can handle the error and log a normal record somewhere or we
> cannot and explode. The complaining about the FW doesn't bring shit."

" Borislav, if you don't like the third patch in the series, feel free 
to leave it out. THings will work beautifully with or without it."

:)
---
To unsubscribe from this list: send the line "unsubscribe linux-edac" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC,v3,3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal"
@ 2018-04-26 18:03 Borislav Petkov
  0 siblings, 0 replies; 6+ messages in thread
From: Borislav Petkov @ 2018-04-26 18:03 UTC (permalink / raw)
  To: Alex G.
  Cc: linux-acpi, linux-edac, Rafael J. Wysocki, Len Brown, Tony Luck,
	Mauro Carvalho Chehab, Robert Moore, Erik Schmauss, Tyler Baicar,
	Will Deacon, James Morse, Shiju Jose, Jonathan (Zhixiong) Zhang,
	Dongjiu Geng, linux-kernel, devel

On Thu, Apr 26, 2018 at 12:47:30PM -0500, Alex G. wrote:
> " Borislav, if you don't like the third patch in the series, feel free to
> leave it out. THings will work beautifully with or without it."

Then don't send it.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC,v3,3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal"
@ 2018-05-02 19:10 Pavel Machek
  0 siblings, 0 replies; 6+ messages in thread
From: Pavel Machek @ 2018-05-02 19:10 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexandru Gagniuc, linux-acpi, linux-edac, Rafael J. Wysocki,
	Len Brown, Tony Luck, Mauro Carvalho Chehab, Robert Moore,
	Erik Schmauss, Tyler Baicar, Will Deacon, James Morse, Shiju Jose,
	Jonathan (Zhixiong) Zhang, Dongjiu Geng, linux-kernel, devel

On Thu 2018-04-26 13:20:57, Borislav Petkov wrote:
> On Wed, Apr 25, 2018 at 03:39:51PM -0500, Alexandru Gagniuc wrote:
> > There seems to be a culture amongst BIOS teams to want to crash the
> > OS when an error can't be handled in firmware. Marking GHES errors as
> > "fatal" is a very common way to do this.
> > 
> > However, a number of errors reported by GHES may be fatal in the sense
> > a device or link is lost, but are not fatal to the system. When there
> > is a disagreement with firmware about the handleability of an error,
> > print a warning message.


> > +
> > +	if ((sev >= GHES_SEV_PANIC) && (ghes_actual_severity(ghes) < sev)) {
> > +		pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct");
> > +		pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");
> 
> Pasting the same comment from last time since you missed it:
> 
> "No, I don't want any of that crap issuing stuff in dmesg and then people
> opening bugs and running around and trying to replace hardware.

We want to see warnings. Maybe they can be toned done. We even have
dedicated distros for firmware testing.

> Good mailing practices for 400: avoid top-posting and trim the reply.

Good mailing practices -- limit use of four letter words on public lists.

     	     	       	  	       	    	   	    	   Pavel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC,v3,3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal"
@ 2018-05-02 19:29 Alexandru Gagniuc
  0 siblings, 0 replies; 6+ messages in thread
From: Alexandru Gagniuc @ 2018-05-02 19:29 UTC (permalink / raw)
  To: Pavel Machek, Borislav Petkov
  Cc: linux-acpi, linux-edac, Rafael J. Wysocki, Len Brown, Tony Luck,
	Mauro Carvalho Chehab, Robert Moore, Erik Schmauss, Tyler Baicar,
	Will Deacon, James Morse, Shiju Jose, Jonathan (Zhixiong) Zhang,
	Dongjiu Geng, linux-kernel, devel

On 05/02/2018 02:10 PM, Pavel Machek wrote:
> On Thu 2018-04-26 13:20:57, Borislav Petkov wrote:
>> On Wed, Apr 25, 2018 at 03:39:51PM -0500, Alexandru Gagniuc wrote:
>>> There seems to be a culture amongst BIOS teams to want to crash the
>>> OS when an error can't be handled in firmware. Marking GHES errors as
>>> "fatal" is a very common way to do this.
>>>
>>> However, a number of errors reported by GHES may be fatal in the sense
>>> a device or link is lost, but are not fatal to the system. When there
>>> is a disagreement with firmware about the handleability of an error,
>>> print a warning message.
> 
> 
>>> +
>>> +	if ((sev >= GHES_SEV_PANIC) && (ghes_actual_severity(ghes) < sev)) {
>>> +		pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct");
>>> +		pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");
>>
>> Pasting the same comment from last time since you missed it:
>>
>> "No, I don't want any of that crap issuing stuff in dmesg and then people
>> opening bugs and running around and trying to replace hardware.
> 
> We want to see warnings. Maybe they can be toned done. We even have
> dedicated distros for firmware testing.

I'm told that had we had this warning when the r740 BIOS was in
development, we would have solved a lot of the issues that I'm currently
working on. That would, in turn, have exposed bigger issues, and we
would have had a platform to fix and test those bigger issues.

Hardware vendors who test on linux might be scratching their heads at
this error, though they tend to figure out what they're doing wrong, and
fix it.

One argument against was "expensive support calls", on which I call BS.
The firmware resources are expensive, but those are there whether or not
the customers call to complain.

Alex

>> Good mailing practices for 400: avoid top-posting and trim the reply.
> 
> Good mailing practices -- limit use of four letter words on public lists.

Then can't show word 'four'.
---
To unsubscribe from this list: send the line "unsubscribe linux-edac" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-05-02 19:29 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-04-25 20:39 [RFC,v3,3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal" Alexandru Gagniuc
  -- strict thread matches above, loose matches on Subject: below --
2018-04-26 11:20 Borislav Petkov
2018-04-26 17:47 Alexandru Gagniuc
2018-04-26 18:03 Borislav Petkov
2018-05-02 19:10 Pavel Machek
2018-05-02 19:29 Alexandru Gagniuc

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).