* [PATCH] acpi: apei: clear error status before acknowledging the error
@ 2017-07-28 22:25 Tyler Baicar
2017-07-29 6:53 ` Borislav Petkov
0 siblings, 1 reply; 8+ messages in thread
From: Tyler Baicar @ 2017-07-28 22:25 UTC (permalink / raw)
To: rjw, lenb, will.deacon, james.morse, bp, shiju.jose, geliangtang,
andriy.shevchenko, tony.luck, linux-acpi, linux-kernel
Cc: Tyler Baicar
Currently we acknowledge errors before clearing the error status.
This could cause a new error to be populated by firmware in-between
the error acknowledgment and the error status clearing which would
cause the second error's status to be cleared without being handled.
So, clear the error status before acknowledging the errors.
Also, make sure to acknowledge the error if the error status read
fails.
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
---
drivers/acpi/apei/ghes.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index d661d45..6a6895a 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -743,17 +743,15 @@ static int ghes_proc(struct ghes *ghes)
}
ghes_do_proc(ghes, ghes->estatus);
+out:
+ ghes_clear_estatus(ghes);
/*
* GHESv2 type HEST entries introduce support for error acknowledgment,
* so only acknowledge the error if this support is present.
*/
if (is_hest_type_generic_v2(ghes)) {
rc = ghes_ack_error(ghes->generic_v2);
- if (rc)
- return rc;
}
-out:
- ghes_clear_estatus(ghes);
return rc;
}
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] acpi: apei: clear error status before acknowledging the error
2017-07-28 22:25 [PATCH] acpi: apei: clear error status before acknowledging the error Tyler Baicar
@ 2017-07-29 6:53 ` Borislav Petkov
2017-07-31 16:15 ` Baicar, Tyler
0 siblings, 1 reply; 8+ messages in thread
From: Borislav Petkov @ 2017-07-29 6:53 UTC (permalink / raw)
To: Tyler Baicar
Cc: rjw, lenb, will.deacon, james.morse, shiju.jose, geliangtang,
andriy.shevchenko, tony.luck, linux-acpi, linux-kernel
On Fri, Jul 28, 2017 at 04:25:03PM -0600, Tyler Baicar wrote:
> Currently we acknowledge errors before clearing the error status.
> This could cause a new error to be populated by firmware in-between
> the error acknowledgment and the error status clearing which would
> cause the second error's status to be cleared without being handled.
> So, clear the error status before acknowledging the errors.
>
> Also, make sure to acknowledge the error if the error status read
> fails.
>
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> ---
> drivers/acpi/apei/ghes.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index d661d45..6a6895a 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -743,17 +743,15 @@ static int ghes_proc(struct ghes *ghes)
> }
> ghes_do_proc(ghes, ghes->estatus);
>
> +out:
If the first ghes_read_estatus() fails and we jump straight to that
label...
> + ghes_clear_estatus(ghes);
> /*
> * GHESv2 type HEST entries introduce support for error acknowledgment,
> * so only acknowledge the error if this support is present.
> */
> if (is_hest_type_generic_v2(ghes)) {
> rc = ghes_ack_error(ghes->generic_v2);
... and ACK the error anyway, even the status read failed, wouldn't that
confuse the firmware?
> - if (rc)
> - return rc;
> }
No need for the curly brackets anymore.
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] acpi: apei: clear error status before acknowledging the error
2017-07-29 6:53 ` Borislav Petkov
@ 2017-07-31 16:15 ` Baicar, Tyler
2017-07-31 17:00 ` Luck, Tony
2017-07-31 17:11 ` James Morse
0 siblings, 2 replies; 8+ messages in thread
From: Baicar, Tyler @ 2017-07-31 16:15 UTC (permalink / raw)
To: Borislav Petkov
Cc: rjw, lenb, will.deacon, james.morse, shiju.jose, geliangtang,
andriy.shevchenko, tony.luck, linux-acpi, linux-kernel
On 7/29/2017 12:53 AM, Borislav Petkov wrote:
> On Fri, Jul 28, 2017 at 04:25:03PM -0600, Tyler Baicar wrote:
>> Currently we acknowledge errors before clearing the error status.
>> This could cause a new error to be populated by firmware in-between
>> the error acknowledgment and the error status clearing which would
>> cause the second error's status to be cleared without being handled.
>> So, clear the error status before acknowledging the errors.
>>
>> Also, make sure to acknowledge the error if the error status read
>> fails.
>>
>> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
>> ---
>> drivers/acpi/apei/ghes.c | 6 ++----
>> 1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index d661d45..6a6895a 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -743,17 +743,15 @@ static int ghes_proc(struct ghes *ghes)
>> }
>> ghes_do_proc(ghes, ghes->estatus);
>>
>> +out:
> If the first ghes_read_estatus() fails and we jump straight to that
> label...
>
>> + ghes_clear_estatus(ghes);
>> /*
>> * GHESv2 type HEST entries introduce support for error acknowledgment,
>> * so only acknowledge the error if this support is present.
>> */
>> if (is_hest_type_generic_v2(ghes)) {
>> rc = ghes_ack_error(ghes->generic_v2);
> ... and ACK the error anyway, even the status read failed, wouldn't that
> confuse the firmware?
Hello Boris,
I think the better thing to do in this case is still send the ack. If
ghes_read_estatus() fails, then
either we are unable to read the estatus or the estatus is
empty/invalid. For the first case, there's
not much that can be done. The second case would be a FW bug with
populating the estatus.
If we do not send the ack, then we will be in a scenario where FW will
not send any more errors.
I think it would be better to still have the FW send the errors and
kernel complain about issues with
the errors populated rather than just have the kernel complain on the
first error and then not be sent
any more errors.
If you don't agree with this, then I can change it back to not sending
the ack if the read fails.
>
>> - if (rc)
>> - return rc;
>> }
> No need for the curly brackets anymore.
I'll remove these brackets in the next version.
Thanks,
Tyler
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] acpi: apei: clear error status before acknowledging the error
2017-07-31 16:15 ` Baicar, Tyler
@ 2017-07-31 17:00 ` Luck, Tony
2017-07-31 17:44 ` Baicar, Tyler
2017-07-31 17:11 ` James Morse
1 sibling, 1 reply; 8+ messages in thread
From: Luck, Tony @ 2017-07-31 17:00 UTC (permalink / raw)
To: Baicar, Tyler
Cc: Borislav Petkov, rjw, lenb, will.deacon, james.morse, shiju.jose,
geliangtang, andriy.shevchenko, linux-acpi, linux-kernel
On Mon, Jul 31, 2017 at 10:15:27AM -0600, Baicar, Tyler wrote:
> I think the better thing to do in this case is still send the ack. If
> ghes_read_estatus() fails, then
> either we are unable to read the estatus or the estatus is empty/invalid.
Right now we silently handle that failure of ghes_read_estatus(). That
might be hiding some Linux bugs if we are calling ghes_proc() in cases
where we shouldn't.
Perhaps we should have something like this, so if systems do start acting
weirdly there will be a note that we took this path:
rc = ghes_read_estatus(ghes, 0);
if (rc) {
pr_notice("surprise failure reading ghes estatus\n");
goto out;
}
> If we do not send the ack, then we will be in a scenario where FW will not
> send any more errors.
We might ACK something that the firmware didn't send, which may
lead to other problems.
> I think it would be better to still have the FW send the errors and kernel
> complain about issues with
But I agree with this. We should send the ACK. Luckliy this doesn't have
a long legacy problem because the whole ACK mechanism is a new thing. So
we only have to worry about GHESv2 supporting BIOS.
-Tony
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] acpi: apei: clear error status before acknowledging the error
2017-07-31 16:15 ` Baicar, Tyler
2017-07-31 17:00 ` Luck, Tony
@ 2017-07-31 17:11 ` James Morse
2017-07-31 17:57 ` Baicar, Tyler
1 sibling, 1 reply; 8+ messages in thread
From: James Morse @ 2017-07-31 17:11 UTC (permalink / raw)
To: Baicar, Tyler
Cc: Borislav Petkov, rjw, lenb, will.deacon, shiju.jose, geliangtang,
andriy.shevchenko, tony.luck, linux-acpi, linux-kernel
Hi Tyler,
On 31/07/17 17:15, Baicar, Tyler wrote:
> On 7/29/2017 12:53 AM, Borislav Petkov wrote:
>> On Fri, Jul 28, 2017 at 04:25:03PM -0600, Tyler Baicar wrote:
>>> Currently we acknowledge errors before clearing the error status.
>>> This could cause a new error to be populated by firmware in-between
>>> the error acknowledgment and the error status clearing which would
>>> cause the second error's status to be cleared without being handled.
>>> So, clear the error status before acknowledging the errors.
>>>
>>> Also, make sure to acknowledge the error if the error status read
>>> fails.
>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>> index d661d45..6a6895a 100644
>>> --- a/drivers/acpi/apei/ghes.c
>>> +++ b/drivers/acpi/apei/ghes.c
>>> @@ -743,17 +743,15 @@ static int ghes_proc(struct ghes *ghes)
>>> }
>>> ghes_do_proc(ghes, ghes->estatus);
>>> +out:
>> If the first ghes_read_estatus() fails and we jump straight to that
>> label...
>>
>>> + ghes_clear_estatus(ghes);
>>> /*
>>> * GHESv2 type HEST entries introduce support for error acknowledgment,
>>> * so only acknowledge the error if this support is present.
>>> */
>>> if (is_hest_type_generic_v2(ghes)) {
>>> rc = ghes_ack_error(ghes->generic_v2);
>> ... and ACK the error anyway, even the status read failed, wouldn't that
>> confuse the firmware?
> I think the better thing to do in this case is still send the ack. If
> ghes_read_estatus() fails, then
> either we are unable to read the estatus or the estatus is empty/invalid. For
> the first case, there's
> not much that can be done. The second case would be a FW bug with populating the
> estatus.
Wouldn't this mean acking on a timer for ghes_poll_func()?
What happens if:
> kernel: read error-status-block
> kernel: nothing here
> firmware: error! write to error-status-block
> kernel: write to ack register
(this is probably only a problem for polling as there is no notification)
> If we do not send the ack, then we will be in a scenario where FW will not send
> any more errors.
Because we haven't yet handled the first one...
I thought GHESv2's ack was also used to catch errors that occur while an earlier
error is being handled. But from the text in ACPI 6.2's 18.3.2.8 the 'ack' is
only described as releasing the memory region, not completion of the error handler.
Thanks,
James
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] acpi: apei: clear error status before acknowledging the error
2017-07-31 17:00 ` Luck, Tony
@ 2017-07-31 17:44 ` Baicar, Tyler
2017-08-03 22:06 ` Baicar, Tyler
0 siblings, 1 reply; 8+ messages in thread
From: Baicar, Tyler @ 2017-07-31 17:44 UTC (permalink / raw)
To: Luck, Tony
Cc: Borislav Petkov, rjw, lenb, will.deacon, james.morse, shiju.jose,
geliangtang, andriy.shevchenko, linux-acpi, linux-kernel
On 7/31/2017 11:00 AM, Luck, Tony wrote:
> On Mon, Jul 31, 2017 at 10:15:27AM -0600, Baicar, Tyler wrote:
>> I think the better thing to do in this case is still send the ack. If
>> ghes_read_estatus() fails, then
>> either we are unable to read the estatus or the estatus is empty/invalid.
> Right now we silently handle that failure of ghes_read_estatus(). That
> might be hiding some Linux bugs if we are calling ghes_proc() in cases
> where we shouldn't.
>
> Perhaps we should have something like this, so if systems do start acting
> weirdly there will be a note that we took this path:
>
> rc = ghes_read_estatus(ghes, 0);
> if (rc) {
> pr_notice("surprise failure reading ghes estatus\n");
> goto out;
> }
Thank you Tony for the feedback, I can add a print like this in the next
version. I'll verify that
rc is not -ENOENT though so we don't print it on empty scenarios since
the polled source
will be hitting this path frequently.
-Tyler
>
>> If we do not send the ack, then we will be in a scenario where FW will not
>> send any more errors.
> We might ACK something that the firmware didn't send, which may
> lead to other problems.
>
>> I think it would be better to still have the FW send the errors and kernel
>> complain about issues with
> But I agree with this. We should send the ACK. Luckliy this doesn't have
> a long legacy problem because the whole ACK mechanism is a new thing. So
> we only have to worry about GHESv2 supporting BIOS.
>
> -Tony
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] acpi: apei: clear error status before acknowledging the error
2017-07-31 17:11 ` James Morse
@ 2017-07-31 17:57 ` Baicar, Tyler
0 siblings, 0 replies; 8+ messages in thread
From: Baicar, Tyler @ 2017-07-31 17:57 UTC (permalink / raw)
To: James Morse
Cc: Borislav Petkov, rjw, lenb, will.deacon, shiju.jose, geliangtang,
andriy.shevchenko, tony.luck, linux-acpi, linux-kernel
On 7/31/2017 11:11 AM, James Morse wrote:
> Hi Tyler,
>
> On 31/07/17 17:15, Baicar, Tyler wrote:
>> On 7/29/2017 12:53 AM, Borislav Petkov wrote:
>>> On Fri, Jul 28, 2017 at 04:25:03PM -0600, Tyler Baicar wrote:
>>>> Currently we acknowledge errors before clearing the error status.
>>>> This could cause a new error to be populated by firmware in-between
>>>> the error acknowledgment and the error status clearing which would
>>>> cause the second error's status to be cleared without being handled.
>>>> So, clear the error status before acknowledging the errors.
>>>>
>>>> Also, make sure to acknowledge the error if the error status read
>>>> fails.
>>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>>> index d661d45..6a6895a 100644
>>>> --- a/drivers/acpi/apei/ghes.c
>>>> +++ b/drivers/acpi/apei/ghes.c
>>>> @@ -743,17 +743,15 @@ static int ghes_proc(struct ghes *ghes)
>>>> }
>>>> ghes_do_proc(ghes, ghes->estatus);
>>>> +out:
>>> If the first ghes_read_estatus() fails and we jump straight to that
>>> label...
>>>
>>>> + ghes_clear_estatus(ghes);
>>>> /*
>>>> * GHESv2 type HEST entries introduce support for error acknowledgment,
>>>> * so only acknowledge the error if this support is present.
>>>> */
>>>> if (is_hest_type_generic_v2(ghes)) {
>>>> rc = ghes_ack_error(ghes->generic_v2);
>>> ... and ACK the error anyway, even the status read failed, wouldn't that
>>> confuse the firmware?
>> I think the better thing to do in this case is still send the ack. If
>> ghes_read_estatus() fails, then
>> either we are unable to read the estatus or the estatus is empty/invalid. For
>> the first case, there's
>> not much that can be done. The second case would be a FW bug with populating the
>> estatus.
> Wouldn't this mean acking on a timer for ghes_poll_func()?
>
> What happens if:
>> kernel: read error-status-block
>> kernel: nothing here
>> firmware: error! write to error-status-block
>> kernel: write to ack register
> (this is probably only a problem for polling as there is no notification)
Hello James,
Yes, good point! I'll add a check so we avoid sending the ack for
polling sources that return
-ENOENT from ghes_read_estatus().
>
>
>> If we do not send the ack, then we will be in a scenario where FW will not send
>> any more errors.
> Because we haven't yet handled the first one...
>
> I thought GHESv2's ack was also used to catch errors that occur while an earlier
> error is being handled. But from the text in ACPI 6.2's 18.3.2.8 the 'ack' is
> only described as releasing the memory region, not completion of the error handler.
Once FW notifies the OS of the first error, it shouldn't be touching the
memory region until
receiving the ack. That's why if we don't send the ack FW won't send any
more errors.
Thanks,
Tyler
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] acpi: apei: clear error status before acknowledging the error
2017-07-31 17:44 ` Baicar, Tyler
@ 2017-08-03 22:06 ` Baicar, Tyler
0 siblings, 0 replies; 8+ messages in thread
From: Baicar, Tyler @ 2017-08-03 22:06 UTC (permalink / raw)
To: Luck, Tony
Cc: Borislav Petkov, rjw, lenb, will.deacon, james.morse, shiju.jose,
geliangtang, andriy.shevchenko, linux-acpi, linux-kernel
On 7/31/2017 11:44 AM, Baicar, Tyler wrote:
> On 7/31/2017 11:00 AM, Luck, Tony wrote:
>> On Mon, Jul 31, 2017 at 10:15:27AM -0600, Baicar, Tyler wrote:
>>> I think the better thing to do in this case is still send the ack. If
>>> ghes_read_estatus() fails, then
>>> either we are unable to read the estatus or the estatus is
>>> empty/invalid.
>> Right now we silently handle that failure of ghes_read_estatus(). That
>> might be hiding some Linux bugs if we are calling ghes_proc() in cases
>> where we shouldn't.
>>
>> Perhaps we should have something like this, so if systems do start
>> acting
>> weirdly there will be a note that we took this path:
>>
>> rc = ghes_read_estatus(ghes, 0);
>> if (rc) {
>> pr_notice("surprise failure reading ghes estatus\n");
>> goto out;
>> }
> Thank you Tony for the feedback, I can add a print like this in the
> next version. I'll verify that
> rc is not -ENOENT though so we don't print it on empty scenarios since
> the polled source
> will be hitting this path frequently.
>
Hi Tony,
I think I'm going to avoid adding this print, the failures are reported
in prints in ghes_read_estatus(), so it looks a little redundant:
[ 133.601165] [Firmware Warn]: GHES: Failed to read error status block!
[ 133.601167] surprise failure reading GHES estatus
Thanks,
Tyler
>>
>>> If we do not send the ack, then we will be in a scenario where FW
>>> will not
>>> send any more errors.
>> We might ACK something that the firmware didn't send, which may
>> lead to other problems.
>>
>>> I think it would be better to still have the FW send the errors and
>>> kernel
>>> complain about issues with
>> But I agree with this. We should send the ACK. Luckliy this doesn't
>> have
>> a long legacy problem because the whole ACK mechanism is a new thing. So
>> we only have to worry about GHESv2 supporting BIOS.
>>
>> -Tony
>
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-08-03 22:06 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-07-28 22:25 [PATCH] acpi: apei: clear error status before acknowledging the error Tyler Baicar
2017-07-29 6:53 ` Borislav Petkov
2017-07-31 16:15 ` Baicar, Tyler
2017-07-31 17:00 ` Luck, Tony
2017-07-31 17:44 ` Baicar, Tyler
2017-08-03 22:06 ` Baicar, Tyler
2017-07-31 17:11 ` James Morse
2017-07-31 17:57 ` Baicar, Tyler
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox