linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] Fix issues with ARM Processor CPER records
@ 2024-07-08 11:18 Mauro Carvalho Chehab
  2024-07-08 11:18 ` [PATCH 6/6] docs: efi: add CPER functions to driver-api Mauro Carvalho Chehab
  0 siblings, 1 reply; 3+ messages in thread
From: Mauro Carvalho Chehab @ 2024-07-08 11:18 UTC (permalink / raw)
  To: Borislav Petkov, Tony Luck
  Cc: Mauro Carvalho Chehab, Ard Biesheuvel, James Morse,
	Jonathan Cameron, Shiju Jose, linux-efi, linux-kernel, linux-edac,
	Len Brown, linux-acpi, Rafael J. Wysocki, Jonathan Corbet,
	linux-doc

This series replaces two previously-sent series:
- https://lore.kernel.org/linux-edac/cover.1719219886.git.mchehab+huawei@kernel.org/T/#t
- https://lore.kernel.org/linux-edac/cover.1719484498.git.mchehab+huawei@kernel.org/T/#t

It is also available at:

	https://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac.git/log/?h=edac-arm64

This is needed for both kernelspace and userspace properly handle ARM processor CPER
events.

Patches 1 and 2 of this series fix the UEFI 2.6+ implementation of the ARM 
trace event, as the original implementation was incomplete.
Changeset e9279e83ad1f ("trace, ras: add ARM processor error trace event")
added such event, but it reports only some fields of the CPER record
defined on UEFI 2.6+ appendix N, table N.16.  Those are not enough 
actually parse such events on userspace, as not even the event type
is exported.

Patch 3 fixes a compilation breakage when W=1;

Patch 4 adds a new helper function to be used by cper and ghes drivers to
display CPER bitmaps;

Patch 5 fixes CPER logic according with UEFI 2.9A errata. Before it, there
was no description about how processor type field was encoded. The errata
defines it as a bitmask, and provides the information about how it should
be encoded.

Patch 6 adds CPER functions to Kernel-doc.

This series was validated with the help of an ARM EINJ code for QEMU:

	https://github.com/mchehab/rasdaemon/wiki/error-injection

Using the QEMU injection code at:

   https://gitlab.com/mchehab_kernel/qemu/-/commits/arm-error-inject-v2?ref_type=heads

Running it on QEMU and sending those commands via QEMU QMP interface:

    { "execute": "qmp_capabilities" } 
    { "execute": "arm-inject-error", "arguments": {
	"validation": ["mpidr-valid", "affinity-valid", "running-state-valid", "vendor-specific-valid"],
	"running-state": [], "psci-state": 1229279264, "error": [
	{"type": ["tlb-error", "bus-error", "micro-arch-error"], "multiple-error": 2}, 
	{"type": ["micro-arch-error"]},
	{"type": ["tlb-error"]}, 
	{"type": ["bus-error"]}, 
	{"type": ["cache-error"]}]} }

The CPER event is now properly handled:

[   53.223383] {1}[Hardware Error]: event severity: recoverable
[   53.223690] {1}[Hardware Error]:  Error 0, type: recoverable
[   53.224073] {1}[Hardware Error]:   section_type: ARM processor error
[   53.224419] {1}[Hardware Error]:   MIDR: 0x0000000000000000
[   53.224694] {1}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000080000000
[   53.225029] {1}[Hardware Error]:   error affinity level: 2
[   53.225266] {1}[Hardware Error]:   running state: 0x0
[   53.225516] {1}[Hardware Error]:   Power State Coordination Interface state: 1229279264
[   53.225857] {1}[Hardware Error]:   Error info structure 0:
[   53.226094] {1}[Hardware Error]:   num errors: 3
[   53.226317] {1}[Hardware Error]:    first error captured
[   53.226548] {1}[Hardware Error]:    propagated error captured
[   53.226806] {1}[Hardware Error]:    overflow occurred, error info is incomplete
[   53.227180] {1}[Hardware Error]:    error_type: 0x1c: TLB error|bus error|micro-architectural error
[   53.227549] {1}[Hardware Error]:    error_info: 0x000000000054007f
[   53.227819] {1}[Hardware Error]:    virtual fault address: 0x0000000067320230
[   53.228106] {1}[Hardware Error]:    physical fault address: 0x000000005cdfd492
[   53.228403] {1}[Hardware Error]:   Error info structure 1:
[   53.228636] {1}[Hardware Error]:   num errors: 3
[   53.228840] {1}[Hardware Error]:    first error captured
[   53.229061] {1}[Hardware Error]:    propagated error captured
[   53.229296] {1}[Hardware Error]:    overflow occurred, error info is incomplete
[   53.229577] {1}[Hardware Error]:    error_type: 0x10: micro-architectural error
[   53.229873] {1}[Hardware Error]:    error_info: 0x0000000078da03ff
[   53.230130] {1}[Hardware Error]:    virtual fault address: 0x0000000067320230
[   53.230412] {1}[Hardware Error]:    physical fault address: 0x000000005cdfd492
[   53.230694] {1}[Hardware Error]:   Error info structure 2:
[   53.230924] {1}[Hardware Error]:   num errors: 3
[   53.231128] {1}[Hardware Error]:    first error captured
[   53.231349] {1}[Hardware Error]:    propagated error captured
[   53.231582] {1}[Hardware Error]:    overflow occurred, error info is incomplete
[   53.231863] {1}[Hardware Error]:    error_type: 0x04: TLB error
[   53.232116] {1}[Hardware Error]:    error_info: 0x000000000054007f
[   53.232396] {1}[Hardware Error]:     transaction type: Instruction
[   53.232686] {1}[Hardware Error]:     TLB error, operation type: Instruction fetch
[   53.232998] {1}[Hardware Error]:     TLB level: 1
[   53.233215] {1}[Hardware Error]:     processor context not corrupted
[   53.233479] {1}[Hardware Error]:     the error has not been corrected
[   53.233740] {1}[Hardware Error]:     PC is imprecise
[   53.233974] {1}[Hardware Error]:    virtual fault address: 0x0000000067320230
[   53.234264] {1}[Hardware Error]:    physical fault address: 0x000000005cdfd492
[   53.234547] {1}[Hardware Error]:   Error info structure 3:
[   53.234776] {1}[Hardware Error]:   num errors: 3
[   53.234980] {1}[Hardware Error]:    first error captured
[   53.235199] {1}[Hardware Error]:    propagated error captured
[   53.235433] {1}[Hardware Error]:    overflow occurred, error info is incomplete
[   53.235714] {1}[Hardware Error]:    error_type: 0x08: bus error
[   53.235966] {1}[Hardware Error]:    error_info: 0x00000080d6460fff
[   53.236223] {1}[Hardware Error]:     transaction type: Generic
[   53.236478] {1}[Hardware Error]:     bus error, operation type: Generic read (type of instruction or data request cannot be determined)
[   53.236923] {1}[Hardware Error]:     affinity level at which the bus error occurred: 1
[   53.237234] {1}[Hardware Error]:     processor context corrupted
[   53.237481] {1}[Hardware Error]:     the error has been corrected
[   53.237728] {1}[Hardware Error]:     PC is imprecise
[   53.237937] {1}[Hardware Error]:     Program execution can be restarted reliably at the PC associated with the error.
[   53.238329] {1}[Hardware Error]:     participation type: Local processor observed
[   53.238627] {1}[Hardware Error]:     request timed out
[   53.238851] {1}[Hardware Error]:     address space: External Memory Access
[   53.239129] {1}[Hardware Error]:     memory access attributes:0x20
[   53.239393] {1}[Hardware Error]:     access mode: secure
[   53.239613] {1}[Hardware Error]:    virtual fault address: 0x0000000067320230
[   53.239890] {1}[Hardware Error]:    physical fault address: 0x000000005cdfd492
[   53.240168] {1}[Hardware Error]:   Error info structure 4:
[   53.240396] {1}[Hardware Error]:   num errors: 3
[   53.240601] {1}[Hardware Error]:    first error captured
[   53.240816] {1}[Hardware Error]:    propagated error captured
[   53.241048] {1}[Hardware Error]:    overflow occurred, error info is incomplete
[   53.241332] {1}[Hardware Error]:    error_type: 0x02: cache error
[   53.241589] {1}[Hardware Error]:    error_info: 0x000000000091000f
[   53.241843] {1}[Hardware Error]:     transaction type: Data Access
[   53.242101] {1}[Hardware Error]:     cache error, operation type: Data write
[   53.242385] {1}[Hardware Error]:     cache level: 2
[   53.242596] {1}[Hardware Error]:     processor context not corrupted
[   53.242847] {1}[Hardware Error]:    virtual fault address: 0x0000000067320230
[   53.243125] {1}[Hardware Error]:    physical fault address: 0x000000005cdfd492
[   53.243426] {1}[Hardware Error]:   Context info structure 0:
[   53.243675] {1}[Hardware Error]:    register context type: AArch64 EL1 context registers
[   53.244185] {1}[Hardware Error]:    00000000: 12abde67 00000000 00000000 00000000
[   53.244540] {1}[Hardware Error]:    00000010: 00000000 00000000 00000000 00000000
[   53.244864] {1}[Hardware Error]:    00000020: 00000000 00000000 00000000 00000000
[   53.245183] {1}[Hardware Error]:    00000030: 00000000 00000000 00000000 00000000
[   53.245504] {1}[Hardware Error]:    00000040: 00000000 00000000 00000000 00000000
[   53.245828] {1}[Hardware Error]:    00000050: 00000000 00000000 00000000 00000000
[   53.246149] {1}[Hardware Error]:    00000060: 00000000 00000000 00000000 00000000
[   53.246475] {1}[Hardware Error]:    00000070: 00000000 00000000 00000000 00000000
[   53.246799] {1}[Hardware Error]:    00000080: 00000000 00000000 00000000 00000000
[   53.247122] {1}[Hardware Error]:    00000090: 00000000 00000000 00000000 00000000
[   53.247446] {1}[Hardware Error]:    000000a0: 00000000 00000000 00000000 00000000
[   53.247767] {1}[Hardware Error]:    000000b0: 00000000 00000000 00000000 00000000
[   53.248090] {1}[Hardware Error]:    000000c0: 00000000 00000000 00000000 00000000
[   53.248415] {1}[Hardware Error]:    000000d0: 00000000 00000000 00000000 00000000
[   53.248739] {1}[Hardware Error]:    000000e0: 00000000 00000000 00000000 00000000
[   53.249064] {1}[Hardware Error]:    000000f0: 00000000 00000000 00000000 00000000
[   53.249398] {1}[Hardware Error]:    00000100: 00000000 00000000 00000000 00000000
[   53.249727] {1}[Hardware Error]:    00000110: 00000000 00000000 00000000 00000000
[   53.250053] {1}[Hardware Error]:    00000120: 00000000 00000000 00000000 00000000
[   53.250377] {1}[Hardware Error]:    00000130: 00000000 00000000 00000000 00000000
[   53.250700] {1}[Hardware Error]:    00000140: 00000000 00000000 00000000 00000000
[   53.251038] {1}[Hardware Error]:    00000150: 00000000 00000000 00000000 00000000
[   53.251368] {1}[Hardware Error]:    00000160: 00000000 00000000 00000000 00000000
[   53.251694] {1}[Hardware Error]:    00000170: 00000000 00000000 00000000 00000000
[   53.252017] {1}[Hardware Error]:    00000180: 00000000 00000000 00000000 00000000
[   53.252342] {1}[Hardware Error]:    00000190: 00000000 00000000 00000000 00000000
[   53.252664] {1}[Hardware Error]:    000001a0: 00000000 00000000 00000000 00000000
[   53.252984] {1}[Hardware Error]:    000001b0: 00000000 00000000 00000000 00000000
[   53.253309] {1}[Hardware Error]:    000001c0: 00000000 00000000 00000000 00000000
[   53.253630] {1}[Hardware Error]:    000001d0: 00000000 00000000 00000000 00000000
[   53.253949] {1}[Hardware Error]:    000001e0: 00000000 00000000 00000000 00000000
[   53.254273] {1}[Hardware Error]:    000001f0: 00000000 00000000 00000000 00000000
[   53.254595] {1}[Hardware Error]:    00000200: 00000000 00000000 00000000 00000000
[   53.254917] {1}[Hardware Error]:    00000210: 00000000 00000000 00000000 00000000
[   53.255245] {1}[Hardware Error]:    00000220: 00000000 00000000 00000000 00000000
[   53.255569] {1}[Hardware Error]:    00000230: 00000000 00000000 00000000 00000000
[   53.255890] {1}[Hardware Error]:    00000240: 00000000 00000000 00000000 00000000
[   53.256794] [Firmware Warn]: GHES: Unhandled processor error type 0x1c: TLB error|bus error|micro-architectural error
[   53.257203] [Firmware Warn]: GHES: Unhandled processor error type 0x10: micro-architectural error
[   53.257543] [Firmware Warn]: GHES: Unhandled processor error type 0x04: TLB error
[   53.257842] [Firmware Warn]: GHES: Unhandled processor error type 0x08: bus error

- 

I also tested the ghes and cper reports both with and without this
change, using different versions of rasdaemon, with and without 
support for the extended trace event. Those are a summary of the
test results:

- adding more fields to the trace events didn't break userspace API:
  both versions of rasdaemon handled it;

- the rasdaemon patches to handle the new trace report was missing
  a backward-compatibility logic. I fixed already. So, rasdaemon
  can now handle both old and new trace events.

Btw, rasdaemon has gained support for the extended trace since its
version 0.5.8 (released in 2021). I didn't saw any issues there
complain about troubles on it, so either distros used on ARM servers
are using an old version of rasdaemon, or they're carrying on the trace
event changes as well.

Daniel Ferguson (1):
  RAS: ACPI: APEI: add conditional compilation to ARM error report
    functions

Mauro Carvalho Chehab (4):
  efi/cper: Adjust infopfx size to accept an extra space
  efi/cper: Add a new helper function to print bitmasks
  efi/cper: align ARM CPER type with UEFI 2.9A/2.10 specs
  docs: efi: add CPER functions to driver-api

Shengwei Luo (1):
  RAS: Report all ARM processor CPER information to userspace

 .../driver-api/firmware/efi/index.rst         | 11 ++--
 drivers/acpi/apei/ghes.c                      | 31 +++++------
 drivers/firmware/efi/cper-arm.c               | 52 +++++++++----------
 drivers/firmware/efi/cper.c                   | 43 ++++++++++++++-
 drivers/ras/ras.c                             | 47 ++++++++++++++++-
 include/linux/cper.h                          | 12 +++--
 include/linux/ras.h                           | 16 ++++--
 include/ras/ras_event.h                       | 48 +++++++++++++++--
 8 files changed, 198 insertions(+), 62 deletions(-)

-- 
2.45.2



^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 6/6] docs: efi: add CPER functions to driver-api
  2024-07-08 11:18 [PATCH 0/6] Fix issues with ARM Processor CPER records Mauro Carvalho Chehab
@ 2024-07-08 11:18 ` Mauro Carvalho Chehab
  2024-07-08 15:47   ` Jonathan Cameron
  0 siblings, 1 reply; 3+ messages in thread
From: Mauro Carvalho Chehab @ 2024-07-08 11:18 UTC (permalink / raw)
  To: Borislav Petkov, Tony Luck
  Cc: Mauro Carvalho Chehab, Ard Biesheuvel, James Morse,
	Jonathan Cameron, Len Brown, Rafael J. Wysocki, Shiju Jose,
	Jonathan Corbet, linux-acpi, linux-doc, linux-edac, linux-efi,
	linux-kernel

There are two kernel-doc like descriptions at cper, which is used
by other parts of cper and on ghes driver. They both have kernel-doc
like descriptions.

Change the tags for them to be actual kernel-doc tags and add them
to the driver-api documentaion at the UEFI section.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 Documentation/driver-api/firmware/efi/index.rst | 11 ++++++++---
 drivers/firmware/efi/cper.c                     | 10 ++++------
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/Documentation/driver-api/firmware/efi/index.rst b/Documentation/driver-api/firmware/efi/index.rst
index 4fe8abba9fc6..5a6b6229592c 100644
--- a/Documentation/driver-api/firmware/efi/index.rst
+++ b/Documentation/driver-api/firmware/efi/index.rst
@@ -1,11 +1,16 @@
 .. SPDX-License-Identifier: GPL-2.0
 
-============
-UEFI Support
-============
+====================================================
+Unified Extensible Firmware Interface (UEFI) Support
+====================================================
 
 UEFI stub library functions
 ===========================
 
 .. kernel-doc:: drivers/firmware/efi/libstub/mem.c
    :internal:
+
+UEFI Common Platform Error Record (CPER) functions
+==================================================
+
+.. kernel-doc:: drivers/firmware/efi/cper.c
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index f8c8a15cd527..2785c8ea8ad8 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -69,7 +69,7 @@ const char *cper_severity_str(unsigned int severity)
 }
 EXPORT_SYMBOL_GPL(cper_severity_str);
 
-/*
+/**
  * cper_print_bits - print strings for set bits
  * @pfx: prefix for each line, including log level and prefix string
  * @bits: bit mask
@@ -106,18 +106,16 @@ void cper_print_bits(const char *pfx, unsigned int bits,
 		printk("%s\n", buf);
 }
 
-/*
+/**
  * cper_bits_to_str - return a string for set bits
  * @buf: buffer to store the output string
  * @buf_size: size of the output string buffer
  * @bits: bit mask
  * @strs: string array, indexed by bit position
  * @strs_size: size of the string array: @strs
- * @mask: a continuous bitmask used to detect the first valid bit of the
- *        bitmap.
  *
- * Add to @buf the bitmask in hexadecimal. Then, for each set bit in @bits
- * mask, add the corresponding string describing the bit in @strs to @buf.
+ * Add to @buf the bitmask in hexadecimal. Then, for each set bit in @bits,
+ * add the corresponding string describing the bit in @strs to @buf.
  */
 char *cper_bits_to_str(char *buf, int buf_size, unsigned long bits,
 		       const char * const strs[], unsigned int strs_size)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH 6/6] docs: efi: add CPER functions to driver-api
  2024-07-08 11:18 ` [PATCH 6/6] docs: efi: add CPER functions to driver-api Mauro Carvalho Chehab
@ 2024-07-08 15:47   ` Jonathan Cameron
  0 siblings, 0 replies; 3+ messages in thread
From: Jonathan Cameron @ 2024-07-08 15:47 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Borislav Petkov, Tony Luck, Ard Biesheuvel, James Morse,
	Len Brown, Rafael J. Wysocki, Shiju Jose, Jonathan Corbet,
	linux-acpi, linux-doc, linux-edac, linux-efi, linux-kernel

On Mon,  8 Jul 2024 13:18:15 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> There are two kernel-doc like descriptions at cper, which is used
> by other parts of cper and on ghes driver. They both have kernel-doc
> like descriptions.
> 
> Change the tags for them to be actual kernel-doc tags and add them
> to the driver-api documentaion at the UEFI section.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

Other than the blob at the end that belongs in earlier patch LGTM.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> ---
>  Documentation/driver-api/firmware/efi/index.rst | 11 ++++++++---
>  drivers/firmware/efi/cper.c                     | 10 ++++------
>  2 files changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/driver-api/firmware/efi/index.rst b/Documentation/driver-api/firmware/efi/index.rst
> index 4fe8abba9fc6..5a6b6229592c 100644
> --- a/Documentation/driver-api/firmware/efi/index.rst
> +++ b/Documentation/driver-api/firmware/efi/index.rst
> @@ -1,11 +1,16 @@
>  .. SPDX-License-Identifier: GPL-2.0
>  
> -============
> -UEFI Support
> -============
> +====================================================
> +Unified Extensible Firmware Interface (UEFI) Support
> +====================================================
>  
>  UEFI stub library functions
>  ===========================
>  
>  .. kernel-doc:: drivers/firmware/efi/libstub/mem.c
>     :internal:
> +
> +UEFI Common Platform Error Record (CPER) functions
> +==================================================
> +
> +.. kernel-doc:: drivers/firmware/efi/cper.c
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index f8c8a15cd527..2785c8ea8ad8 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -69,7 +69,7 @@ const char *cper_severity_str(unsigned int severity)
>  }
>  EXPORT_SYMBOL_GPL(cper_severity_str);
>  
> -/*
> +/**
>   * cper_print_bits - print strings for set bits
>   * @pfx: prefix for each line, including log level and prefix string
>   * @bits: bit mask
> @@ -106,18 +106,16 @@ void cper_print_bits(const char *pfx, unsigned int bits,
>  		printk("%s\n", buf);
>  }
>  
> -/*
> +/**
>   * cper_bits_to_str - return a string for set bits
>   * @buf: buffer to store the output string
>   * @buf_size: size of the output string buffer
>   * @bits: bit mask
>   * @strs: string array, indexed by bit position
>   * @strs_size: size of the string array: @strs
> - * @mask: a continuous bitmask used to detect the first valid bit of the
> - *        bitmap.
>   *
> - * Add to @buf the bitmask in hexadecimal. Then, for each set bit in @bits
> - * mask, add the corresponding string describing the bit in @strs to @buf.
> + * Add to @buf the bitmask in hexadecimal. Then, for each set bit in @bits,
> + * add the corresponding string describing the bit in @strs to @buf.
This is in wrong patch.  No point in introducing wrong docs to fix later.

>   */
>  char *cper_bits_to_str(char *buf, int buf_size, unsigned long bits,
>  		       const char * const strs[], unsigned int strs_size)


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-07-08 15:47 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-08 11:18 [PATCH 0/6] Fix issues with ARM Processor CPER records Mauro Carvalho Chehab
2024-07-08 11:18 ` [PATCH 6/6] docs: efi: add CPER functions to driver-api Mauro Carvalho Chehab
2024-07-08 15:47   ` Jonathan Cameron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).