From mboxrd@z Thu Jan  1 00:00:00 1970
From: tbaicar@codeaurora.org (Baicar, Tyler)
Date: Mon, 13 Feb 2017 15:45:44 -0700
Subject: [PATCH V8 06/10] acpi: apei: panic OS with fatal error status
 block
In-Reply-To: <589C490A.9080109@arm.com>
References: <1485969413-23577-1-git-send-email-tbaicar@codeaurora.org>
 <1485969413-23577-7-git-send-email-tbaicar@codeaurora.org>
 <589C490A.9080109@arm.com>
Message-ID: <5b06372d-e389-5157-ccb4-a7b023990d4d@codeaurora.org>
To: linux-arm-kernel@lists.infradead.org
List-Id: linux-arm-kernel.lists.infradead.org

Hello James,


On 2/9/2017 3:48 AM, James Morse wrote:
> Hi Jonathan, Tyler,
>
> On 01/02/17 17:16, Tyler Baicar wrote:
>> From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>
>>
>> Even if an error status block's severity is fatal, the kernel does not
>> honor the severity level and panic.
>>
>> With the firmware first model, the platform could inform the OS about a
>> fatal hardware error through the non-NMI GHES notification type. The OS
>> should panic when a hardware error record is received with this
>> severity.
>>
>> Call panic() after CPER data in error status block is printed if
>> severity is fatal, before each error section is handled.
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 8756172..86c1f15 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -687,6 +689,13 @@ static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
>>   	return rc;
>>   }
>>   
>> +static void __ghes_call_panic(void)
>> +{
>> +	if (panic_timeout == 0)
>> +		panic_timeout = ghes_panic_timeout;
>> +	panic("Fatal hardware error!");
>> +}
>> +
> __ghes_panic() also has:
>> 	__ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
> Which prints this estatus regardless of rate limiting and cache-ing.
>
> [ ... ]
>
>> @@ -698,6 +707,10 @@ static int ghes_proc(struct ghes *ghes)
>>   		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
> ghes_print_estatus() uses some custom rate limiting '2 messages every 5
> seconds', GHES_SEV_PANIC shares the same limit as GHES_SEV_RECOVERABLE.
>
> I think its possible to get 2 recoverable messages, then a panic in a 5 second
> window. The rate limit will kick in to stop the panic estatus block being
> printed, but we still go on to call panic() without the real reason being printed...
>
> (the caching thing only seems to consider identical messages, given we would
> never see two panic messages, I don't think that will cause any problems.)
>
>>   			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
>>   	}
>> +	if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
>> +		__ghes_call_panic();
>> +	}
>> +
> I think this ghes_severity() then panic() should go above the:
>> 	if (!ghes_estatus_cached(ghes->estatus)) {
> and we should call __ghes_print_estatus() here too, to make sure the message
> definitely got out!
Okay, that makes sense. If we move this up, is there a problem with 
calling __ghes_panic() instead of making the __ghes_print_estatus() and 
__ghes_call_panic() calls here? It looks like that will just add a call 
to oops_begin() and ghes_print_queued_estatus() as well, but this is 
what ghes_notify_nmi() does if the severity is panic.

Thanks,
Tyler
> With that,
> Reviewed-by: James Morse <james.morse@arm.com>
>
>
> Thanks,
>
> James

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.