* [PATCH 0/6] Fix issues with ARM Processor CPER records
@ 2024-07-08 11:18 Mauro Carvalho Chehab
2024-07-08 11:18 ` [PATCH 6/6] docs: efi: add CPER functions to driver-api Mauro Carvalho Chehab
0 siblings, 1 reply; 3+ messages in thread
From: Mauro Carvalho Chehab @ 2024-07-08 11:18 UTC (permalink / raw)
To: Borislav Petkov, Tony Luck
Cc: Mauro Carvalho Chehab, Ard Biesheuvel, James Morse,
Jonathan Cameron, Shiju Jose, linux-efi, linux-kernel, linux-edac,
Len Brown, linux-acpi, Rafael J. Wysocki, Jonathan Corbet,
linux-doc
This series replaces two previously-sent series:
- https://lore.kernel.org/linux-edac/cover.1719219886.git.mchehab+huawei@kernel.org/T/#t
- https://lore.kernel.org/linux-edac/cover.1719484498.git.mchehab+huawei@kernel.org/T/#t
It is also available at:
https://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac.git/log/?h=edac-arm64
This is needed for both kernelspace and userspace properly handle ARM processor CPER
events.
Patches 1 and 2 of this series fix the UEFI 2.6+ implementation of the ARM
trace event, as the original implementation was incomplete.
Changeset e9279e83ad1f ("trace, ras: add ARM processor error trace event")
added such event, but it reports only some fields of the CPER record
defined on UEFI 2.6+ appendix N, table N.16. Those are not enough
actually parse such events on userspace, as not even the event type
is exported.
Patch 3 fixes a compilation breakage when W=1;
Patch 4 adds a new helper function to be used by cper and ghes drivers to
display CPER bitmaps;
Patch 5 fixes CPER logic according with UEFI 2.9A errata. Before it, there
was no description about how processor type field was encoded. The errata
defines it as a bitmask, and provides the information about how it should
be encoded.
Patch 6 adds CPER functions to Kernel-doc.
This series was validated with the help of an ARM EINJ code for QEMU:
https://github.com/mchehab/rasdaemon/wiki/error-injection
Using the QEMU injection code at:
https://gitlab.com/mchehab_kernel/qemu/-/commits/arm-error-inject-v2?ref_type=heads
Running it on QEMU and sending those commands via QEMU QMP interface:
{ "execute": "qmp_capabilities" }
{ "execute": "arm-inject-error", "arguments": {
"validation": ["mpidr-valid", "affinity-valid", "running-state-valid", "vendor-specific-valid"],
"running-state": [], "psci-state": 1229279264, "error": [
{"type": ["tlb-error", "bus-error", "micro-arch-error"], "multiple-error": 2},
{"type": ["micro-arch-error"]},
{"type": ["tlb-error"]},
{"type": ["bus-error"]},
{"type": ["cache-error"]}]} }
The CPER event is now properly handled:
[ 53.223383] {1}[Hardware Error]: event severity: recoverable
[ 53.223690] {1}[Hardware Error]: Error 0, type: recoverable
[ 53.224073] {1}[Hardware Error]: section_type: ARM processor error
[ 53.224419] {1}[Hardware Error]: MIDR: 0x0000000000000000
[ 53.224694] {1}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000080000000
[ 53.225029] {1}[Hardware Error]: error affinity level: 2
[ 53.225266] {1}[Hardware Error]: running state: 0x0
[ 53.225516] {1}[Hardware Error]: Power State Coordination Interface state: 1229279264
[ 53.225857] {1}[Hardware Error]: Error info structure 0:
[ 53.226094] {1}[Hardware Error]: num errors: 3
[ 53.226317] {1}[Hardware Error]: first error captured
[ 53.226548] {1}[Hardware Error]: propagated error captured
[ 53.226806] {1}[Hardware Error]: overflow occurred, error info is incomplete
[ 53.227180] {1}[Hardware Error]: error_type: 0x1c: TLB error|bus error|micro-architectural error
[ 53.227549] {1}[Hardware Error]: error_info: 0x000000000054007f
[ 53.227819] {1}[Hardware Error]: virtual fault address: 0x0000000067320230
[ 53.228106] {1}[Hardware Error]: physical fault address: 0x000000005cdfd492
[ 53.228403] {1}[Hardware Error]: Error info structure 1:
[ 53.228636] {1}[Hardware Error]: num errors: 3
[ 53.228840] {1}[Hardware Error]: first error captured
[ 53.229061] {1}[Hardware Error]: propagated error captured
[ 53.229296] {1}[Hardware Error]: overflow occurred, error info is incomplete
[ 53.229577] {1}[Hardware Error]: error_type: 0x10: micro-architectural error
[ 53.229873] {1}[Hardware Error]: error_info: 0x0000000078da03ff
[ 53.230130] {1}[Hardware Error]: virtual fault address: 0x0000000067320230
[ 53.230412] {1}[Hardware Error]: physical fault address: 0x000000005cdfd492
[ 53.230694] {1}[Hardware Error]: Error info structure 2:
[ 53.230924] {1}[Hardware Error]: num errors: 3
[ 53.231128] {1}[Hardware Error]: first error captured
[ 53.231349] {1}[Hardware Error]: propagated error captured
[ 53.231582] {1}[Hardware Error]: overflow occurred, error info is incomplete
[ 53.231863] {1}[Hardware Error]: error_type: 0x04: TLB error
[ 53.232116] {1}[Hardware Error]: error_info: 0x000000000054007f
[ 53.232396] {1}[Hardware Error]: transaction type: Instruction
[ 53.232686] {1}[Hardware Error]: TLB error, operation type: Instruction fetch
[ 53.232998] {1}[Hardware Error]: TLB level: 1
[ 53.233215] {1}[Hardware Error]: processor context not corrupted
[ 53.233479] {1}[Hardware Error]: the error has not been corrected
[ 53.233740] {1}[Hardware Error]: PC is imprecise
[ 53.233974] {1}[Hardware Error]: virtual fault address: 0x0000000067320230
[ 53.234264] {1}[Hardware Error]: physical fault address: 0x000000005cdfd492
[ 53.234547] {1}[Hardware Error]: Error info structure 3:
[ 53.234776] {1}[Hardware Error]: num errors: 3
[ 53.234980] {1}[Hardware Error]: first error captured
[ 53.235199] {1}[Hardware Error]: propagated error captured
[ 53.235433] {1}[Hardware Error]: overflow occurred, error info is incomplete
[ 53.235714] {1}[Hardware Error]: error_type: 0x08: bus error
[ 53.235966] {1}[Hardware Error]: error_info: 0x00000080d6460fff
[ 53.236223] {1}[Hardware Error]: transaction type: Generic
[ 53.236478] {1}[Hardware Error]: bus error, operation type: Generic read (type of instruction or data request cannot be determined)
[ 53.236923] {1}[Hardware Error]: affinity level at which the bus error occurred: 1
[ 53.237234] {1}[Hardware Error]: processor context corrupted
[ 53.237481] {1}[Hardware Error]: the error has been corrected
[ 53.237728] {1}[Hardware Error]: PC is imprecise
[ 53.237937] {1}[Hardware Error]: Program execution can be restarted reliably at the PC associated with the error.
[ 53.238329] {1}[Hardware Error]: participation type: Local processor observed
[ 53.238627] {1}[Hardware Error]: request timed out
[ 53.238851] {1}[Hardware Error]: address space: External Memory Access
[ 53.239129] {1}[Hardware Error]: memory access attributes:0x20
[ 53.239393] {1}[Hardware Error]: access mode: secure
[ 53.239613] {1}[Hardware Error]: virtual fault address: 0x0000000067320230
[ 53.239890] {1}[Hardware Error]: physical fault address: 0x000000005cdfd492
[ 53.240168] {1}[Hardware Error]: Error info structure 4:
[ 53.240396] {1}[Hardware Error]: num errors: 3
[ 53.240601] {1}[Hardware Error]: first error captured
[ 53.240816] {1}[Hardware Error]: propagated error captured
[ 53.241048] {1}[Hardware Error]: overflow occurred, error info is incomplete
[ 53.241332] {1}[Hardware Error]: error_type: 0x02: cache error
[ 53.241589] {1}[Hardware Error]: error_info: 0x000000000091000f
[ 53.241843] {1}[Hardware Error]: transaction type: Data Access
[ 53.242101] {1}[Hardware Error]: cache error, operation type: Data write
[ 53.242385] {1}[Hardware Error]: cache level: 2
[ 53.242596] {1}[Hardware Error]: processor context not corrupted
[ 53.242847] {1}[Hardware Error]: virtual fault address: 0x0000000067320230
[ 53.243125] {1}[Hardware Error]: physical fault address: 0x000000005cdfd492
[ 53.243426] {1}[Hardware Error]: Context info structure 0:
[ 53.243675] {1}[Hardware Error]: register context type: AArch64 EL1 context registers
[ 53.244185] {1}[Hardware Error]: 00000000: 12abde67 00000000 00000000 00000000
[ 53.244540] {1}[Hardware Error]: 00000010: 00000000 00000000 00000000 00000000
[ 53.244864] {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
[ 53.245183] {1}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000
[ 53.245504] {1}[Hardware Error]: 00000040: 00000000 00000000 00000000 00000000
[ 53.245828] {1}[Hardware Error]: 00000050: 00000000 00000000 00000000 00000000
[ 53.246149] {1}[Hardware Error]: 00000060: 00000000 00000000 00000000 00000000
[ 53.246475] {1}[Hardware Error]: 00000070: 00000000 00000000 00000000 00000000
[ 53.246799] {1}[Hardware Error]: 00000080: 00000000 00000000 00000000 00000000
[ 53.247122] {1}[Hardware Error]: 00000090: 00000000 00000000 00000000 00000000
[ 53.247446] {1}[Hardware Error]: 000000a0: 00000000 00000000 00000000 00000000
[ 53.247767] {1}[Hardware Error]: 000000b0: 00000000 00000000 00000000 00000000
[ 53.248090] {1}[Hardware Error]: 000000c0: 00000000 00000000 00000000 00000000
[ 53.248415] {1}[Hardware Error]: 000000d0: 00000000 00000000 00000000 00000000
[ 53.248739] {1}[Hardware Error]: 000000e0: 00000000 00000000 00000000 00000000
[ 53.249064] {1}[Hardware Error]: 000000f0: 00000000 00000000 00000000 00000000
[ 53.249398] {1}[Hardware Error]: 00000100: 00000000 00000000 00000000 00000000
[ 53.249727] {1}[Hardware Error]: 00000110: 00000000 00000000 00000000 00000000
[ 53.250053] {1}[Hardware Error]: 00000120: 00000000 00000000 00000000 00000000
[ 53.250377] {1}[Hardware Error]: 00000130: 00000000 00000000 00000000 00000000
[ 53.250700] {1}[Hardware Error]: 00000140: 00000000 00000000 00000000 00000000
[ 53.251038] {1}[Hardware Error]: 00000150: 00000000 00000000 00000000 00000000
[ 53.251368] {1}[Hardware Error]: 00000160: 00000000 00000000 00000000 00000000
[ 53.251694] {1}[Hardware Error]: 00000170: 00000000 00000000 00000000 00000000
[ 53.252017] {1}[Hardware Error]: 00000180: 00000000 00000000 00000000 00000000
[ 53.252342] {1}[Hardware Error]: 00000190: 00000000 00000000 00000000 00000000
[ 53.252664] {1}[Hardware Error]: 000001a0: 00000000 00000000 00000000 00000000
[ 53.252984] {1}[Hardware Error]: 000001b0: 00000000 00000000 00000000 00000000
[ 53.253309] {1}[Hardware Error]: 000001c0: 00000000 00000000 00000000 00000000
[ 53.253630] {1}[Hardware Error]: 000001d0: 00000000 00000000 00000000 00000000
[ 53.253949] {1}[Hardware Error]: 000001e0: 00000000 00000000 00000000 00000000
[ 53.254273] {1}[Hardware Error]: 000001f0: 00000000 00000000 00000000 00000000
[ 53.254595] {1}[Hardware Error]: 00000200: 00000000 00000000 00000000 00000000
[ 53.254917] {1}[Hardware Error]: 00000210: 00000000 00000000 00000000 00000000
[ 53.255245] {1}[Hardware Error]: 00000220: 00000000 00000000 00000000 00000000
[ 53.255569] {1}[Hardware Error]: 00000230: 00000000 00000000 00000000 00000000
[ 53.255890] {1}[Hardware Error]: 00000240: 00000000 00000000 00000000 00000000
[ 53.256794] [Firmware Warn]: GHES: Unhandled processor error type 0x1c: TLB error|bus error|micro-architectural error
[ 53.257203] [Firmware Warn]: GHES: Unhandled processor error type 0x10: micro-architectural error
[ 53.257543] [Firmware Warn]: GHES: Unhandled processor error type 0x04: TLB error
[ 53.257842] [Firmware Warn]: GHES: Unhandled processor error type 0x08: bus error
-
I also tested the ghes and cper reports both with and without this
change, using different versions of rasdaemon, with and without
support for the extended trace event. Those are a summary of the
test results:
- adding more fields to the trace events didn't break userspace API:
both versions of rasdaemon handled it;
- the rasdaemon patches to handle the new trace report was missing
a backward-compatibility logic. I fixed already. So, rasdaemon
can now handle both old and new trace events.
Btw, rasdaemon has gained support for the extended trace since its
version 0.5.8 (released in 2021). I didn't saw any issues there
complain about troubles on it, so either distros used on ARM servers
are using an old version of rasdaemon, or they're carrying on the trace
event changes as well.
Daniel Ferguson (1):
RAS: ACPI: APEI: add conditional compilation to ARM error report
functions
Mauro Carvalho Chehab (4):
efi/cper: Adjust infopfx size to accept an extra space
efi/cper: Add a new helper function to print bitmasks
efi/cper: align ARM CPER type with UEFI 2.9A/2.10 specs
docs: efi: add CPER functions to driver-api
Shengwei Luo (1):
RAS: Report all ARM processor CPER information to userspace
.../driver-api/firmware/efi/index.rst | 11 ++--
drivers/acpi/apei/ghes.c | 31 +++++------
drivers/firmware/efi/cper-arm.c | 52 +++++++++----------
drivers/firmware/efi/cper.c | 43 ++++++++++++++-
drivers/ras/ras.c | 47 ++++++++++++++++-
include/linux/cper.h | 12 +++--
include/linux/ras.h | 16 ++++--
include/ras/ras_event.h | 48 +++++++++++++++--
8 files changed, 198 insertions(+), 62 deletions(-)
--
2.45.2
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH 6/6] docs: efi: add CPER functions to driver-api
2024-07-08 11:18 [PATCH 0/6] Fix issues with ARM Processor CPER records Mauro Carvalho Chehab
@ 2024-07-08 11:18 ` Mauro Carvalho Chehab
2024-07-08 15:47 ` Jonathan Cameron
0 siblings, 1 reply; 3+ messages in thread
From: Mauro Carvalho Chehab @ 2024-07-08 11:18 UTC (permalink / raw)
To: Borislav Petkov, Tony Luck
Cc: Mauro Carvalho Chehab, Ard Biesheuvel, James Morse,
Jonathan Cameron, Len Brown, Rafael J. Wysocki, Shiju Jose,
Jonathan Corbet, linux-acpi, linux-doc, linux-edac, linux-efi,
linux-kernel
There are two kernel-doc like descriptions at cper, which is used
by other parts of cper and on ghes driver. They both have kernel-doc
like descriptions.
Change the tags for them to be actual kernel-doc tags and add them
to the driver-api documentaion at the UEFI section.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
Documentation/driver-api/firmware/efi/index.rst | 11 ++++++++---
drivers/firmware/efi/cper.c | 10 ++++------
2 files changed, 12 insertions(+), 9 deletions(-)
diff --git a/Documentation/driver-api/firmware/efi/index.rst b/Documentation/driver-api/firmware/efi/index.rst
index 4fe8abba9fc6..5a6b6229592c 100644
--- a/Documentation/driver-api/firmware/efi/index.rst
+++ b/Documentation/driver-api/firmware/efi/index.rst
@@ -1,11 +1,16 @@
.. SPDX-License-Identifier: GPL-2.0
-============
-UEFI Support
-============
+====================================================
+Unified Extensible Firmware Interface (UEFI) Support
+====================================================
UEFI stub library functions
===========================
.. kernel-doc:: drivers/firmware/efi/libstub/mem.c
:internal:
+
+UEFI Common Platform Error Record (CPER) functions
+==================================================
+
+.. kernel-doc:: drivers/firmware/efi/cper.c
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index f8c8a15cd527..2785c8ea8ad8 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -69,7 +69,7 @@ const char *cper_severity_str(unsigned int severity)
}
EXPORT_SYMBOL_GPL(cper_severity_str);
-/*
+/**
* cper_print_bits - print strings for set bits
* @pfx: prefix for each line, including log level and prefix string
* @bits: bit mask
@@ -106,18 +106,16 @@ void cper_print_bits(const char *pfx, unsigned int bits,
printk("%s\n", buf);
}
-/*
+/**
* cper_bits_to_str - return a string for set bits
* @buf: buffer to store the output string
* @buf_size: size of the output string buffer
* @bits: bit mask
* @strs: string array, indexed by bit position
* @strs_size: size of the string array: @strs
- * @mask: a continuous bitmask used to detect the first valid bit of the
- * bitmap.
*
- * Add to @buf the bitmask in hexadecimal. Then, for each set bit in @bits
- * mask, add the corresponding string describing the bit in @strs to @buf.
+ * Add to @buf the bitmask in hexadecimal. Then, for each set bit in @bits,
+ * add the corresponding string describing the bit in @strs to @buf.
*/
char *cper_bits_to_str(char *buf, int buf_size, unsigned long bits,
const char * const strs[], unsigned int strs_size)
--
2.45.2
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH 6/6] docs: efi: add CPER functions to driver-api
2024-07-08 11:18 ` [PATCH 6/6] docs: efi: add CPER functions to driver-api Mauro Carvalho Chehab
@ 2024-07-08 15:47 ` Jonathan Cameron
0 siblings, 0 replies; 3+ messages in thread
From: Jonathan Cameron @ 2024-07-08 15:47 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Borislav Petkov, Tony Luck, Ard Biesheuvel, James Morse,
Len Brown, Rafael J. Wysocki, Shiju Jose, Jonathan Corbet,
linux-acpi, linux-doc, linux-edac, linux-efi, linux-kernel
On Mon, 8 Jul 2024 13:18:15 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> There are two kernel-doc like descriptions at cper, which is used
> by other parts of cper and on ghes driver. They both have kernel-doc
> like descriptions.
>
> Change the tags for them to be actual kernel-doc tags and add them
> to the driver-api documentaion at the UEFI section.
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Other than the blob at the end that belongs in earlier patch LGTM.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
> Documentation/driver-api/firmware/efi/index.rst | 11 ++++++++---
> drivers/firmware/efi/cper.c | 10 ++++------
> 2 files changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/driver-api/firmware/efi/index.rst b/Documentation/driver-api/firmware/efi/index.rst
> index 4fe8abba9fc6..5a6b6229592c 100644
> --- a/Documentation/driver-api/firmware/efi/index.rst
> +++ b/Documentation/driver-api/firmware/efi/index.rst
> @@ -1,11 +1,16 @@
> .. SPDX-License-Identifier: GPL-2.0
>
> -============
> -UEFI Support
> -============
> +====================================================
> +Unified Extensible Firmware Interface (UEFI) Support
> +====================================================
>
> UEFI stub library functions
> ===========================
>
> .. kernel-doc:: drivers/firmware/efi/libstub/mem.c
> :internal:
> +
> +UEFI Common Platform Error Record (CPER) functions
> +==================================================
> +
> +.. kernel-doc:: drivers/firmware/efi/cper.c
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index f8c8a15cd527..2785c8ea8ad8 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -69,7 +69,7 @@ const char *cper_severity_str(unsigned int severity)
> }
> EXPORT_SYMBOL_GPL(cper_severity_str);
>
> -/*
> +/**
> * cper_print_bits - print strings for set bits
> * @pfx: prefix for each line, including log level and prefix string
> * @bits: bit mask
> @@ -106,18 +106,16 @@ void cper_print_bits(const char *pfx, unsigned int bits,
> printk("%s\n", buf);
> }
>
> -/*
> +/**
> * cper_bits_to_str - return a string for set bits
> * @buf: buffer to store the output string
> * @buf_size: size of the output string buffer
> * @bits: bit mask
> * @strs: string array, indexed by bit position
> * @strs_size: size of the string array: @strs
> - * @mask: a continuous bitmask used to detect the first valid bit of the
> - * bitmap.
> *
> - * Add to @buf the bitmask in hexadecimal. Then, for each set bit in @bits
> - * mask, add the corresponding string describing the bit in @strs to @buf.
> + * Add to @buf the bitmask in hexadecimal. Then, for each set bit in @bits,
> + * add the corresponding string describing the bit in @strs to @buf.
This is in wrong patch. No point in introducing wrong docs to fix later.
> */
> char *cper_bits_to_str(char *buf, int buf_size, unsigned long bits,
> const char * const strs[], unsigned int strs_size)
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2024-07-08 15:47 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-08 11:18 [PATCH 0/6] Fix issues with ARM Processor CPER records Mauro Carvalho Chehab
2024-07-08 11:18 ` [PATCH 6/6] docs: efi: add CPER functions to driver-api Mauro Carvalho Chehab
2024-07-08 15:47 ` Jonathan Cameron
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).