* [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation
@ 2024-08-16 7:37 Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 01/13] acpi/generic_event_device: add an APEI error device Mauro Carvalho Chehab
` (13 more replies)
0 siblings, 14 replies; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-16 7:37 UTC (permalink / raw)
Cc: Jonathan Cameron, Shiju Jose, Mauro Carvalho Chehab, Ani Sinha,
Dongjiu Geng, Paolo Bonzini, Peter Maydell, Shannon Zhao, kvm,
qemu-arm, qemu-devel
This series add support for injecting generic CPER records. Such records
are generated outside QEMU via a provided script.
On this version, I added two optional patches at the end:
- acpi/ghes: cleanup generic error data logic
It drops some obvious comments from some already-existing code.
As we're already doing lots of changes at the code, it sounded
reasonable to me to have such cleanup here;
- acpi/ghes: check if the BIOS pointers for HEST are correct
QEMU has two ways to navigate to a CPER start data: via its
memory address or indirectly following 2 BIOS pointers.
OS only have the latter one. This patch validates if the BIOS
links used by the OS were properly produced, comparing to the
actual location of the CPER record.
---
v8:
- Fix one of the BIOS links that were incorrect;
- Changed mem error internal injection to use a common code;
- No more hardcoded values for CPER: instead of using just the
payload at the QAPI, it now has the full raw CPER there;
- Error injection script now supports changing fields at the
Generic Error Data section of the CPER;
- Several minor cleanups.
v7:
- Change the way offsets are calculated and used on HEST table.
Now, it is compatible with migrations as all offsets are relative
to the HEST table;
- GHES interface is now more generic: the entire CPER is sent via
QMP, instead of just the payload;
- Some code cleanups to make the code more robust;
- The python script now uses QEMUMonitorProtocol class.
v6:
- PNP0C33 device creation moved to aml-build.c;
- acpi_ghes record functions now use ACPI notify parameter,
instead of source ID;
- the number of source IDs is now automatically calculated;
- some code cleanups and function/var renames;
- some fixes and cleanups at the error injection script;
- ghes cper stub now produces an error if cper JSON is not compiled;
- Offset calculation logic for GHES was refactored;
- Updated documentation to reflect the GHES allocated size;
- Added a x-mpidr object for QOM usage;
- Added a patch making usage of x-mpidr field at ARM injection
script;
v5:
- CPER guid is now passing as string;
- raw-data is now passed with base64 encode;
- Removed several GPIO left-overs from arm/virt.c changes;
- Lots of cleanups and improvements at the error injection script.
It now better handles QMP dialog and doesn't print debug messages.
Also, code was split on two modules, to make easier to add more
error injection commands.
v4:
- CPER generation moved to happen outside QEMU;
- One patch adding support for mpidr query was removed.
v3:
- patch 1 cleanups with some comment changes and adding another place where
the poweroff GPIO define should be used. No changes on other patches (except
due to conflict resolution).
v2:
- added a new patch using a define for GPIO power pin;
- patch 2 changed to also use a define for generic error GPIO pin;
- a couple cleanups at patch 2 removing uneeded else clauses.
Example of generating a CPER record:
$ scripts/ghes_inject.py -d arm -p 0xdeadbeef
GUID: e19e3d16-bc11-11e4-9caa-c2051d5d46b0
Generic Error Status Block (20 bytes):
00000000 01 00 00 00 00 00 00 00 00 00 00 00 90 00 00 00 ................
00000010 00 00 00 00 ....
Generic Error Data Entry (72 bytes):
00000000 16 3d 9e e1 11 bc e4 11 9c aa c2 05 1d 5d 46 b0 .=...........]F.
00000010 00 00 00 00 00 03 00 00 48 00 00 00 00 00 00 00 ........H.......
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000040 00 00 00 00 00 00 00 00 ........
Payload (72 bytes):
00000000 05 00 00 00 01 00 00 00 48 00 00 00 00 00 00 00 ........H.......
00000010 00 00 00 80 00 00 00 00 10 05 0f 00 00 00 00 00 ................
00000020 00 00 00 00 00 00 00 00 00 20 14 00 02 01 00 03 ......... ......
00000030 0f 00 91 00 00 00 00 00 ef be ad de 00 00 00 00 ................
00000040 ef be ad de 00 00 00 00 ........
Error injected.
[ 9.358364] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[ 9.359027] {1}[Hardware Error]: event severity: recoverable
[ 9.359586] {1}[Hardware Error]: Error 0, type: recoverable
[ 9.360124] {1}[Hardware Error]: section_type: ARM processor error
[ 9.360561] {1}[Hardware Error]: MIDR: 0x00000000000f0510
[ 9.361160] {1}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000080000000
[ 9.361643] {1}[Hardware Error]: running state: 0x0
[ 9.362142] {1}[Hardware Error]: Power State Coordination Interface state: 0
[ 9.362682] {1}[Hardware Error]: Error info structure 0:
[ 9.363030] {1}[Hardware Error]: num errors: 2
[ 9.363656] {1}[Hardware Error]: error_type: 0x02: cache error
[ 9.364163] {1}[Hardware Error]: error_info: 0x000000000091000f
[ 9.364834] {1}[Hardware Error]: transaction type: Data Access
[ 9.365599] {1}[Hardware Error]: cache error, operation type: Data write
[ 9.366441] {1}[Hardware Error]: cache level: 2
[ 9.367005] {1}[Hardware Error]: processor context not corrupted
[ 9.367753] {1}[Hardware Error]: physical fault address: 0x00000000deadbeef
[ 9.374267] Memory failure: 0xdeadb: recovery action for free buddy page: Recovered
Such script currently supports arm processor error CPER, but can easily be
extended to other GHES notification types.
Jonathan Cameron (1):
acpi/ghes: Add support for GED error device
Mauro Carvalho Chehab (12):
acpi/generic_event_device: add an APEI error device
arm/virt: Wire up a GED error device for ACPI / GHES
qapi/acpi-hest: add an interface to do generic CPER error injection
acpi/ghes: rework the logic to handle HEST source ID
acpi/ghes: add support for generic error injection via QAPI
acpi/ghes: cleanup the memory error code logic
docs: acpi_hest_ghes: fix documentation for CPER size
scripts/ghes_inject: add a script to generate GHES error inject
target/arm: add an experimental mpidr arm cpu property object
scripts/arm_processor_error.py: retrieve mpidr if not filled
acpi/ghes: cleanup generic error data logic
acpi/ghes: check if the BIOS pointers for HEST are correct
MAINTAINERS | 10 +
docs/specs/acpi_hest_ghes.rst | 6 +-
hw/acpi/Kconfig | 5 +
hw/acpi/aml-build.c | 10 +
hw/acpi/generic_event_device.c | 8 +
hw/acpi/ghes-stub.c | 3 +-
hw/acpi/ghes.c | 362 ++++++++-----
hw/acpi/ghes_cper.c | 33 ++
hw/acpi/ghes_cper_stub.c | 19 +
hw/acpi/meson.build | 2 +
hw/arm/Kconfig | 5 +
hw/arm/virt-acpi-build.c | 6 +-
hw/arm/virt.c | 12 +-
include/hw/acpi/acpi_dev_interface.h | 1 +
include/hw/acpi/aml-build.h | 2 +
include/hw/acpi/generic_event_device.h | 1 +
include/hw/acpi/ghes.h | 24 +-
include/hw/arm/virt.h | 1 +
qapi/acpi-hest.json | 36 ++
qapi/meson.build | 1 +
qapi/qapi-schema.json | 1 +
scripts/arm_processor_error.py | 388 ++++++++++++++
scripts/ghes_inject.py | 51 ++
scripts/qmp_helper.py | 702 +++++++++++++++++++++++++
target/arm/cpu.c | 1 +
target/arm/cpu.h | 1 +
target/arm/helper.c | 10 +-
target/arm/kvm.c | 2 +-
28 files changed, 1551 insertions(+), 152 deletions(-)
create mode 100644 hw/acpi/ghes_cper.c
create mode 100644 hw/acpi/ghes_cper_stub.c
create mode 100644 qapi/acpi-hest.json
create mode 100644 scripts/arm_processor_error.py
create mode 100755 scripts/ghes_inject.py
create mode 100644 scripts/qmp_helper.py
--
2.46.0
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v8 01/13] acpi/generic_event_device: add an APEI error device
2024-08-16 7:37 [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Mauro Carvalho Chehab
@ 2024-08-16 7:37 ` Mauro Carvalho Chehab
2024-08-19 11:21 ` Igor Mammedov
2024-08-16 7:37 ` [PATCH v8 02/13] arm/virt: Wire up a GED error device for ACPI / GHES Mauro Carvalho Chehab
` (12 subsequent siblings)
13 siblings, 1 reply; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-16 7:37 UTC (permalink / raw)
Cc: Jonathan Cameron, Shiju Jose, Mauro Carvalho Chehab,
Michael S. Tsirkin, Ani Sinha, Igor Mammedov, linux-kernel,
qemu-devel
Adds a generic error device to handle generic hardware error
events as specified at ACPI 6.5 specification at 18.3.2.7.2:
https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#event-notification-for-generic-error-sources
using HID PNP0C33.
The PNP0C33 device is used to report hardware errors to
the guest via ACPI APEI Generic Hardware Error Source (GHES).
Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Co-authored-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
---
hw/acpi/aml-build.c | 10 ++++++++++
hw/acpi/generic_event_device.c | 8 ++++++++
include/hw/acpi/acpi_dev_interface.h | 1 +
include/hw/acpi/aml-build.h | 2 ++
include/hw/acpi/generic_event_device.h | 1 +
5 files changed, 22 insertions(+)
diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 6d4517cfbe3d..cb167523859f 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -2520,3 +2520,13 @@ Aml *aml_i2c_serial_bus_device(uint16_t address, const char *resource_source)
return var;
}
+
+/* ACPI 5.0: 18.3.2.6.2 Event Notification For Generic Error Sources */
+Aml *aml_error_device(void)
+{
+ Aml *dev = aml_device(ACPI_APEI_ERROR_DEVICE);
+ aml_append(dev, aml_name_decl("_HID", aml_string("PNP0C33")));
+ aml_append(dev, aml_name_decl("_UID", aml_int(0)));
+
+ return dev;
+}
diff --git a/hw/acpi/generic_event_device.c b/hw/acpi/generic_event_device.c
index 15b4c3ebbf24..b4c83a089a02 100644
--- a/hw/acpi/generic_event_device.c
+++ b/hw/acpi/generic_event_device.c
@@ -26,6 +26,7 @@ static const uint32_t ged_supported_events[] = {
ACPI_GED_PWR_DOWN_EVT,
ACPI_GED_NVDIMM_HOTPLUG_EVT,
ACPI_GED_CPU_HOTPLUG_EVT,
+ ACPI_GED_ERROR_EVT,
};
/*
@@ -116,6 +117,11 @@ void build_ged_aml(Aml *table, const char *name, HotplugHandler *hotplug_dev,
aml_notify(aml_name(ACPI_POWER_BUTTON_DEVICE),
aml_int(0x80)));
break;
+ case ACPI_GED_ERROR_EVT:
+ aml_append(if_ctx,
+ aml_notify(aml_name(ACPI_APEI_ERROR_DEVICE),
+ aml_int(0x80)));
+ break;
case ACPI_GED_NVDIMM_HOTPLUG_EVT:
aml_append(if_ctx,
aml_notify(aml_name("\\_SB.NVDR"),
@@ -295,6 +301,8 @@ static void acpi_ged_send_event(AcpiDeviceIf *adev, AcpiEventStatusBits ev)
sel = ACPI_GED_MEM_HOTPLUG_EVT;
} else if (ev & ACPI_POWER_DOWN_STATUS) {
sel = ACPI_GED_PWR_DOWN_EVT;
+ } else if (ev & ACPI_GENERIC_ERROR) {
+ sel = ACPI_GED_ERROR_EVT;
} else if (ev & ACPI_NVDIMM_HOTPLUG_STATUS) {
sel = ACPI_GED_NVDIMM_HOTPLUG_EVT;
} else if (ev & ACPI_CPU_HOTPLUG_STATUS) {
diff --git a/include/hw/acpi/acpi_dev_interface.h b/include/hw/acpi/acpi_dev_interface.h
index 68d9d15f50aa..8294f8f0ccca 100644
--- a/include/hw/acpi/acpi_dev_interface.h
+++ b/include/hw/acpi/acpi_dev_interface.h
@@ -13,6 +13,7 @@ typedef enum {
ACPI_NVDIMM_HOTPLUG_STATUS = 16,
ACPI_VMGENID_CHANGE_STATUS = 32,
ACPI_POWER_DOWN_STATUS = 64,
+ ACPI_GENERIC_ERROR = 128,
} AcpiEventStatusBits;
#define TYPE_ACPI_DEVICE_IF "acpi-device-interface"
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index a3784155cb33..44d1a6af0c69 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -252,6 +252,7 @@ struct CrsRangeSet {
/* Consumer/Producer */
#define AML_SERIAL_BUS_FLAG_CONSUME_ONLY (1 << 1)
+#define ACPI_APEI_ERROR_DEVICE "GEDD"
/**
* init_aml_allocator:
*
@@ -382,6 +383,7 @@ Aml *aml_dma(AmlDmaType typ, AmlDmaBusMaster bm, AmlTransferSize sz,
uint8_t channel);
Aml *aml_sleep(uint64_t msec);
Aml *aml_i2c_serial_bus_device(uint16_t address, const char *resource_source);
+Aml *aml_error_device(void);
/* Block AML object primitives */
Aml *aml_scope(const char *name_format, ...) G_GNUC_PRINTF(1, 2);
diff --git a/include/hw/acpi/generic_event_device.h b/include/hw/acpi/generic_event_device.h
index 40af3550b56d..9ace8fe70328 100644
--- a/include/hw/acpi/generic_event_device.h
+++ b/include/hw/acpi/generic_event_device.h
@@ -98,6 +98,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(AcpiGedState, ACPI_GED)
#define ACPI_GED_PWR_DOWN_EVT 0x2
#define ACPI_GED_NVDIMM_HOTPLUG_EVT 0x4
#define ACPI_GED_CPU_HOTPLUG_EVT 0x8
+#define ACPI_GED_ERROR_EVT 0x10
typedef struct GEDState {
MemoryRegion evt;
--
2.46.0
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 02/13] arm/virt: Wire up a GED error device for ACPI / GHES
2024-08-16 7:37 [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 01/13] acpi/generic_event_device: add an APEI error device Mauro Carvalho Chehab
@ 2024-08-16 7:37 ` Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 03/13] acpi/ghes: Add support for GED error device Mauro Carvalho Chehab
` (11 subsequent siblings)
13 siblings, 0 replies; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-16 7:37 UTC (permalink / raw)
Cc: Jonathan Cameron, Shiju Jose, Mauro Carvalho Chehab,
Michael S. Tsirkin, Ani Sinha, Dongjiu Geng, Igor Mammedov,
Peter Maydell, Shannon Zhao, linux-kernel, qemu-arm, qemu-devel
Adds support to ARM virtualization to allow handling
generic error ACPI Event via GED & error source device.
It is aligned with Linux Kernel patch:
https://lore.kernel.org/lkml/1272350481-27951-8-git-send-email-ying.huang@intel.com/
Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Co-authored-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Igor Mammedov <imammedo@redhat.com>
---
hw/acpi/ghes.c | 3 +++
hw/arm/virt-acpi-build.c | 1 +
hw/arm/virt.c | 12 +++++++++++-
include/hw/acpi/ghes.h | 3 +++
include/hw/arm/virt.h | 1 +
5 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index e9511d9b8f71..13b105c5d02d 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -444,6 +444,9 @@ int acpi_ghes_record_errors(uint8_t source_id, uint64_t physical_address)
return ret;
}
+NotifierList acpi_generic_error_notifiers =
+ NOTIFIER_LIST_INITIALIZER(error_device_notifiers);
+
bool acpi_ghes_present(void)
{
AcpiGedState *acpi_ged_state;
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index f76fb117adff..1769467d23b2 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -858,6 +858,7 @@ build_dsdt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
}
acpi_dsdt_add_power_button(scope);
+ aml_append(scope, aml_error_device());
#ifdef CONFIG_TPM
acpi_dsdt_add_tpm(scope, vms);
#endif
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 687fe0bb8bc9..22448e5c5b73 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -677,7 +677,7 @@ static inline DeviceState *create_acpi_ged(VirtMachineState *vms)
DeviceState *dev;
MachineState *ms = MACHINE(vms);
int irq = vms->irqmap[VIRT_ACPI_GED];
- uint32_t event = ACPI_GED_PWR_DOWN_EVT;
+ uint32_t event = ACPI_GED_PWR_DOWN_EVT | ACPI_GED_ERROR_EVT;
if (ms->ram_slots) {
event |= ACPI_GED_MEM_HOTPLUG_EVT;
@@ -1009,6 +1009,13 @@ static void virt_powerdown_req(Notifier *n, void *opaque)
}
}
+static void virt_generic_error_req(Notifier *n, void *opaque)
+{
+ VirtMachineState *s = container_of(n, VirtMachineState, generic_error_notifier);
+
+ acpi_send_event(s->acpi_dev, ACPI_GENERIC_ERROR);
+}
+
static void create_gpio_keys(char *fdt, DeviceState *pl061_dev,
uint32_t phandle)
{
@@ -2385,6 +2392,9 @@ static void machvirt_init(MachineState *machine)
if (has_ged && aarch64 && firmware_loaded && virt_is_acpi_enabled(vms)) {
vms->acpi_dev = create_acpi_ged(vms);
+ vms->generic_error_notifier.notify = virt_generic_error_req;
+ notifier_list_add(&acpi_generic_error_notifiers,
+ &vms->generic_error_notifier);
} else {
create_gpio_devices(vms, VIRT_GPIO, sysmem);
}
diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
index 674f6958e905..fb80897e7eac 100644
--- a/include/hw/acpi/ghes.h
+++ b/include/hw/acpi/ghes.h
@@ -23,6 +23,9 @@
#define ACPI_GHES_H
#include "hw/acpi/bios-linker-loader.h"
+#include "qemu/notify.h"
+
+extern NotifierList acpi_generic_error_notifiers;
/*
* Values for Hardware Error Notification Type field
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index a4d937ed45ac..ad9f6e94dcc5 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -175,6 +175,7 @@ struct VirtMachineState {
DeviceState *gic;
DeviceState *acpi_dev;
Notifier powerdown_notifier;
+ Notifier generic_error_notifier;
PCIBus *bus;
char *oem_id;
char *oem_table_id;
--
2.46.0
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 03/13] acpi/ghes: Add support for GED error device
2024-08-16 7:37 [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 01/13] acpi/generic_event_device: add an APEI error device Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 02/13] arm/virt: Wire up a GED error device for ACPI / GHES Mauro Carvalho Chehab
@ 2024-08-16 7:37 ` Mauro Carvalho Chehab
2024-08-19 11:43 ` Igor Mammedov
2024-08-16 7:37 ` [PATCH v8 04/13] qapi/acpi-hest: add an interface to do generic CPER error injection Mauro Carvalho Chehab
` (10 subsequent siblings)
13 siblings, 1 reply; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-16 7:37 UTC (permalink / raw)
Cc: Jonathan Cameron, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, Igor Mammedov, linux-kernel, qemu-arm, qemu-devel,
Mauro Carvalho Chehab
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
As a GED error device is now defined, add another type
of notification.
Add error notification to GHES v2 using a GED error device GED
triggered via interrupt.
[mchehab: do some cleanups at ACPI_HEST_SRC_ID_* checks and
rename HEST event to better identify GED interrupt OSPM]
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
---
hw/acpi/ghes.c | 11 +++++++++--
include/hw/acpi/ghes.h | 3 ++-
2 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index 13b105c5d02d..df59fd35568c 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -34,8 +34,8 @@
/* The max size in bytes for one error block */
#define ACPI_GHES_MAX_RAW_DATA_LENGTH (1 * KiB)
-/* Now only support ARMv8 SEA notification type error source */
-#define ACPI_GHES_ERROR_SOURCE_COUNT 1
+/* Support ARMv8 SEA notification type error source and GPIO interrupt. */
+#define ACPI_GHES_ERROR_SOURCE_COUNT 2
/* Generic Hardware Error Source version 2 */
#define ACPI_GHES_SOURCE_GENERIC_ERROR_V2 10
@@ -290,6 +290,9 @@ void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker)
static void build_ghes_v2(GArray *table_data, int source_id, BIOSLinker *linker)
{
uint64_t address_offset;
+
+ assert(source_id < ACPI_HEST_SRC_ID_RESERVED);
+
/*
* Type:
* Generic Hardware Error Source version 2(GHESv2 - Type 10)
@@ -327,6 +330,9 @@ static void build_ghes_v2(GArray *table_data, int source_id, BIOSLinker *linker)
*/
build_ghes_hw_error_notification(table_data, ACPI_GHES_NOTIFY_SEA);
break;
+ case ACPI_HEST_SRC_ID_GED:
+ build_ghes_hw_error_notification(table_data, ACPI_GHES_NOTIFY_GPIO);
+ break;
default:
error_report("Not support this error source");
abort();
@@ -370,6 +376,7 @@ void acpi_build_hest(GArray *table_data, BIOSLinker *linker,
/* Error Source Count */
build_append_int_noprefix(table_data, ACPI_GHES_ERROR_SOURCE_COUNT, 4);
build_ghes_v2(table_data, ACPI_HEST_SRC_ID_SEA, linker);
+ build_ghes_v2(table_data, ACPI_HEST_SRC_ID_GED, linker);
acpi_table_end(linker, &table);
}
diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
index fb80897e7eac..419a97d5cbd9 100644
--- a/include/hw/acpi/ghes.h
+++ b/include/hw/acpi/ghes.h
@@ -59,9 +59,10 @@ enum AcpiGhesNotifyType {
ACPI_GHES_NOTIFY_RESERVED = 12
};
+/* Those are used as table indexes when building GHES tables */
enum {
ACPI_HEST_SRC_ID_SEA = 0,
- /* future ids go here */
+ ACPI_HEST_SRC_ID_GED,
ACPI_HEST_SRC_ID_RESERVED,
};
--
2.46.0
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 04/13] qapi/acpi-hest: add an interface to do generic CPER error injection
2024-08-16 7:37 [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Mauro Carvalho Chehab
` (2 preceding siblings ...)
2024-08-16 7:37 ` [PATCH v8 03/13] acpi/ghes: Add support for GED error device Mauro Carvalho Chehab
@ 2024-08-16 7:37 ` Mauro Carvalho Chehab
2024-08-19 11:54 ` Igor Mammedov
2024-08-16 7:37 ` [PATCH v8 05/13] acpi/ghes: rework the logic to handle HEST source ID Mauro Carvalho Chehab
` (9 subsequent siblings)
13 siblings, 1 reply; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-16 7:37 UTC (permalink / raw)
Cc: Jonathan Cameron, Shiju Jose, Mauro Carvalho Chehab,
Michael S. Tsirkin, Ani Sinha, Dongjiu Geng, Eric Blake,
Igor Mammedov, Markus Armbruster, Michael Roth, Paolo Bonzini,
Peter Maydell, linux-kernel, qemu-arm, qemu-devel
Creates a QMP command to be used for generic ACPI APEI hardware error
injection (HEST) via GHESv2.
The actual GHES code will be added at the followup patch.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
MAINTAINERS | 7 +++++++
hw/acpi/Kconfig | 5 +++++
hw/acpi/ghes_cper.c | 33 +++++++++++++++++++++++++++++++++
hw/acpi/ghes_cper_stub.c | 19 +++++++++++++++++++
hw/acpi/meson.build | 2 ++
hw/arm/Kconfig | 5 +++++
include/hw/acpi/ghes.h | 3 +++
qapi/acpi-hest.json | 36 ++++++++++++++++++++++++++++++++++++
qapi/meson.build | 1 +
qapi/qapi-schema.json | 1 +
10 files changed, 112 insertions(+)
create mode 100644 hw/acpi/ghes_cper.c
create mode 100644 hw/acpi/ghes_cper_stub.c
create mode 100644 qapi/acpi-hest.json
diff --git a/MAINTAINERS b/MAINTAINERS
index 3584d6a6c6da..1d8091818899 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2077,6 +2077,13 @@ F: hw/acpi/ghes.c
F: include/hw/acpi/ghes.h
F: docs/specs/acpi_hest_ghes.rst
+ACPI/HEST/GHES/ARM processor CPER
+R: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
+S: Maintained
+F: hw/arm/ghes_cper.c
+F: hw/acpi/ghes_cper_stub.c
+F: qapi/acpi-hest.json
+
ppc4xx
L: qemu-ppc@nongnu.org
S: Orphan
diff --git a/hw/acpi/Kconfig b/hw/acpi/Kconfig
index e07d3204eb36..73ffbb82c150 100644
--- a/hw/acpi/Kconfig
+++ b/hw/acpi/Kconfig
@@ -51,6 +51,11 @@ config ACPI_APEI
bool
depends on ACPI
+config GHES_CPER
+ bool
+ depends on ACPI_APEI
+ default y
+
config ACPI_PCI
bool
depends on ACPI && PCI
diff --git a/hw/acpi/ghes_cper.c b/hw/acpi/ghes_cper.c
new file mode 100644
index 000000000000..92ca84d738de
--- /dev/null
+++ b/hw/acpi/ghes_cper.c
@@ -0,0 +1,33 @@
+/*
+ * CPER payload parser for error injection
+ *
+ * Copyright(C) 2024 Huawei LTD.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+
+#include "qemu/base64.h"
+#include "qemu/error-report.h"
+#include "qemu/uuid.h"
+#include "qapi/qapi-commands-acpi-hest.h"
+#include "hw/acpi/ghes.h"
+
+void qmp_ghes_cper(const char *qmp_cper,
+ Error **errp)
+{
+
+ uint8_t *cper;
+ size_t len;
+
+ cper = qbase64_decode(qmp_cper, -1, &len, errp);
+ if (!cper) {
+ error_setg(errp, "missing GHES CPER payload");
+ return;
+ }
+
+ /* TODO: call a function at ghes */
+}
diff --git a/hw/acpi/ghes_cper_stub.c b/hw/acpi/ghes_cper_stub.c
new file mode 100644
index 000000000000..36138c462ac9
--- /dev/null
+++ b/hw/acpi/ghes_cper_stub.c
@@ -0,0 +1,19 @@
+/*
+ * Stub interface for CPER payload parser for error injection
+ *
+ * Copyright(C) 2024 Huawei LTD.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qapi/qapi-commands-acpi-hest.h"
+#include "hw/acpi/ghes.h"
+
+void qmp_ghes_cper(const char *cper, Error **errp)
+{
+ error_setg(errp, "GHES QMP error inject is not compiled in");
+}
diff --git a/hw/acpi/meson.build b/hw/acpi/meson.build
index fa5c07db9068..6cbf430eb66d 100644
--- a/hw/acpi/meson.build
+++ b/hw/acpi/meson.build
@@ -34,4 +34,6 @@ endif
system_ss.add(when: 'CONFIG_ACPI', if_false: files('acpi-stub.c', 'aml-build-stub.c', 'ghes-stub.c', 'acpi_interface.c'))
system_ss.add(when: 'CONFIG_ACPI_PCI_BRIDGE', if_false: files('pci-bridge-stub.c'))
system_ss.add_all(when: 'CONFIG_ACPI', if_true: acpi_ss)
+system_ss.add(when: 'CONFIG_GHES_CPER', if_true: files('ghes_cper.c'))
+system_ss.add(when: 'CONFIG_GHES_CPER', if_false: files('ghes_cper_stub.c'))
system_ss.add(files('acpi-qmp-cmds.c'))
diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
index 1ad60da7aa2d..bed6ba27d715 100644
--- a/hw/arm/Kconfig
+++ b/hw/arm/Kconfig
@@ -712,3 +712,8 @@ config ARMSSE
select UNIMP
select SSE_COUNTER
select SSE_TIMER
+
+config GHES_CPER
+ bool
+ depends on ARM
+ default y if AARCH64
diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
index 419a97d5cbd9..b977d65564ba 100644
--- a/include/hw/acpi/ghes.h
+++ b/include/hw/acpi/ghes.h
@@ -23,6 +23,7 @@
#define ACPI_GHES_H
#include "hw/acpi/bios-linker-loader.h"
+#include "qapi/error.h"
#include "qemu/notify.h"
extern NotifierList acpi_generic_error_notifiers;
@@ -77,6 +78,8 @@ void acpi_build_hest(GArray *table_data, BIOSLinker *linker,
void acpi_ghes_add_fw_cfg(AcpiGhesState *vms, FWCfgState *s,
GArray *hardware_errors);
int acpi_ghes_record_errors(uint8_t notify, uint64_t error_physical_addr);
+void ghes_record_cper_errors(const void *cper, size_t len,
+ enum AcpiGhesNotifyType notify, Error **errp);
/**
* acpi_ghes_present: Report whether ACPI GHES table is present
diff --git a/qapi/acpi-hest.json b/qapi/acpi-hest.json
new file mode 100644
index 000000000000..91296755d285
--- /dev/null
+++ b/qapi/acpi-hest.json
@@ -0,0 +1,36 @@
+# -*- Mode: Python -*-
+# vim: filetype=python
+
+##
+# = GHESv2 CPER Error Injection
+#
+# Defined since ACPI Specification 6.2,
+# section 18.3.2.8 Generic Hardware Error Source version 2. See:
+#
+# https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#generic-hardware-error-source-version-2-ghesv2-type-10
+##
+
+
+##
+# @ghes-cper:
+#
+# Inject a CPER error data to be filled according to ACPI 6.2
+# spec via GHESv2.
+#
+# @cper: contains a base64 encoded string with raw data for a single CPER
+# record with Generic Error Status Block, Generic Error Data Entry and
+# generic error data payload, as described at
+# https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html#format
+#
+# Features:
+#
+# @unstable: This command is experimental.
+#
+# Since: 9.2
+##
+{ 'command': 'ghes-cper',
+ 'data': {
+ 'cper': 'str'
+ },
+ 'features': [ 'unstable' ]
+}
diff --git a/qapi/meson.build b/qapi/meson.build
index e7bc54e5d047..35cea6147262 100644
--- a/qapi/meson.build
+++ b/qapi/meson.build
@@ -59,6 +59,7 @@ qapi_all_modules = [
if have_system
qapi_all_modules += [
'acpi',
+ 'acpi-hest',
'audio',
'cryptodev',
'qdev',
diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
index b1581988e4eb..baf19ab73afe 100644
--- a/qapi/qapi-schema.json
+++ b/qapi/qapi-schema.json
@@ -75,6 +75,7 @@
{ 'include': 'misc-target.json' }
{ 'include': 'audio.json' }
{ 'include': 'acpi.json' }
+{ 'include': 'acpi-hest.json' }
{ 'include': 'pci.json' }
{ 'include': 'stats.json' }
{ 'include': 'virtio.json' }
--
2.46.0
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 05/13] acpi/ghes: rework the logic to handle HEST source ID
2024-08-16 7:37 [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Mauro Carvalho Chehab
` (3 preceding siblings ...)
2024-08-16 7:37 ` [PATCH v8 04/13] qapi/acpi-hest: add an interface to do generic CPER error injection Mauro Carvalho Chehab
@ 2024-08-16 7:37 ` Mauro Carvalho Chehab
2024-08-19 12:10 ` Igor Mammedov
2024-08-16 7:37 ` [PATCH v8 06/13] acpi/ghes: add support for generic error injection via QAPI Mauro Carvalho Chehab
` (8 subsequent siblings)
13 siblings, 1 reply; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-16 7:37 UTC (permalink / raw)
Cc: Jonathan Cameron, Shiju Jose, Mauro Carvalho Chehab,
Michael S. Tsirkin, Ani Sinha, Dongjiu Geng, Igor Mammedov,
Peter Maydell, Shannon Zhao, linux-kernel, qemu-arm, qemu-devel
The current logic is based on a lot of duct tape, with
offsets calculated based on one define with the number of
source IDs and an enum.
Rewrite the logic in a way that it would be more resilient
of code changes, by moving the source ID count to an enum
and make the offset calculus more explicit.
Such change was inspired on a patch from Jonathan Cameron
splitting the logic to get the CPER address on a separate
function, as this will be needed to support generic error
injection.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
hw/acpi/ghes-stub.c | 3 +-
hw/acpi/ghes.c | 210 ++++++++++++++++++++++++---------------
hw/arm/virt-acpi-build.c | 5 +-
include/hw/acpi/ghes.h | 17 ++--
4 files changed, 138 insertions(+), 97 deletions(-)
diff --git a/hw/acpi/ghes-stub.c b/hw/acpi/ghes-stub.c
index c315de1802d6..8762449870b5 100644
--- a/hw/acpi/ghes-stub.c
+++ b/hw/acpi/ghes-stub.c
@@ -11,7 +11,8 @@
#include "qemu/osdep.h"
#include "hw/acpi/ghes.h"
-int acpi_ghes_record_errors(uint8_t source_id, uint64_t physical_address)
+int acpi_ghes_record_errors(enum AcpiGhesNotifyType notify,
+ uint64_t physical_address)
{
return -1;
}
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index df59fd35568c..7870f51e2a9e 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -28,14 +28,23 @@
#include "hw/nvram/fw_cfg.h"
#include "qemu/uuid.h"
-#define ACPI_GHES_ERRORS_FW_CFG_FILE "etc/hardware_errors"
-#define ACPI_GHES_DATA_ADDR_FW_CFG_FILE "etc/hardware_errors_addr"
+#define ACPI_HW_ERROR_FW_CFG_FILE "etc/hardware_errors"
+#define ACPI_HW_ERROR_ADDR_FW_CFG_FILE "etc/hardware_errors_addr"
+#define ACPI_HEST_ADDR_FW_CFG_FILE "etc/acpi_table_hest_addr"
/* The max size in bytes for one error block */
#define ACPI_GHES_MAX_RAW_DATA_LENGTH (1 * KiB)
-/* Support ARMv8 SEA notification type error source and GPIO interrupt. */
-#define ACPI_GHES_ERROR_SOURCE_COUNT 2
+/*
+ * ID numbers used to fill HEST source ID field
+ */
+enum AcpiHestSourceId {
+ ACPI_HEST_SRC_ID_SEA,
+ ACPI_HEST_SRC_ID_GED,
+
+ /* Shall be the last one */
+ ACPI_HEST_SRC_ID_COUNT
+} AcpiHestSourceId;
/* Generic Hardware Error Source version 2 */
#define ACPI_GHES_SOURCE_GENERIC_ERROR_V2 10
@@ -63,6 +72,19 @@
*/
#define ACPI_GHES_GESB_SIZE 20
+/*
+ * Offsets with regards to the start of the HEST table stored at
+ * ags->hest_addr_le, according with the memory layout map at
+ * docs/specs/acpi_hest_ghes.rst.
+ */
+
+/* ACPI 4.0: 17.3.2 ACPI Error Source */
+#define ACPI_HEST_HEADER_SIZE 40
+
+/* ACPI 6.2: 18.3.2.8 Generic Hardware Error Source version 2 */
+#define HEST_GHES_V2_TABLE_SIZE 92
+#define GHES_ACK_OFFSET (64 + GAS_ADDR_OFFSET + ACPI_HEST_HEADER_SIZE)
+
/*
* Values for error_severity field
*/
@@ -236,17 +258,17 @@ static int acpi_ghes_record_mem_error(uint64_t error_block_address,
* Initialize "etc/hardware_errors" and "etc/hardware_errors_addr" fw_cfg blobs.
* See docs/specs/acpi_hest_ghes.rst for blobs format.
*/
-void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker)
+static void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker)
{
int i, error_status_block_offset;
/* Build error_block_address */
- for (i = 0; i < ACPI_GHES_ERROR_SOURCE_COUNT; i++) {
+ for (i = 0; i < ACPI_HEST_SRC_ID_COUNT; i++) {
build_append_int_noprefix(hardware_errors, 0, sizeof(uint64_t));
}
/* Build read_ack_register */
- for (i = 0; i < ACPI_GHES_ERROR_SOURCE_COUNT; i++) {
+ for (i = 0; i < ACPI_HEST_SRC_ID_COUNT; i++) {
/*
* Initialize the value of read_ack_register to 1, so GHES can be
* writable after (re)boot.
@@ -261,20 +283,20 @@ void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker)
/* Reserve space for Error Status Data Block */
acpi_data_push(hardware_errors,
- ACPI_GHES_MAX_RAW_DATA_LENGTH * ACPI_GHES_ERROR_SOURCE_COUNT);
+ ACPI_GHES_MAX_RAW_DATA_LENGTH * ACPI_HEST_SRC_ID_COUNT);
/* Tell guest firmware to place hardware_errors blob into RAM */
- bios_linker_loader_alloc(linker, ACPI_GHES_ERRORS_FW_CFG_FILE,
+ bios_linker_loader_alloc(linker, ACPI_HW_ERROR_FW_CFG_FILE,
hardware_errors, sizeof(uint64_t), false);
- for (i = 0; i < ACPI_GHES_ERROR_SOURCE_COUNT; i++) {
+ for (i = 0; i < ACPI_HEST_SRC_ID_COUNT; i++) {
/*
* Tell firmware to patch error_block_address entries to point to
* corresponding "Generic Error Status Block"
*/
bios_linker_loader_add_pointer(linker,
- ACPI_GHES_ERRORS_FW_CFG_FILE, sizeof(uint64_t) * i,
- sizeof(uint64_t), ACPI_GHES_ERRORS_FW_CFG_FILE,
+ ACPI_HW_ERROR_FW_CFG_FILE, sizeof(uint64_t) * i,
+ sizeof(uint64_t), ACPI_HW_ERROR_FW_CFG_FILE,
error_status_block_offset + i * ACPI_GHES_MAX_RAW_DATA_LENGTH);
}
@@ -282,16 +304,39 @@ void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker)
* tell firmware to write hardware_errors GPA into
* hardware_errors_addr fw_cfg, once the former has been initialized.
*/
- bios_linker_loader_write_pointer(linker, ACPI_GHES_DATA_ADDR_FW_CFG_FILE,
- 0, sizeof(uint64_t), ACPI_GHES_ERRORS_FW_CFG_FILE, 0);
+ bios_linker_loader_write_pointer(linker, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, 0,
+ sizeof(uint64_t),
+ ACPI_HW_ERROR_FW_CFG_FILE, 0);
+}
+
+static bool ghes_notify_to_source_id(enum AcpiGhesNotifyType notify,
+ enum AcpiHestSourceId *source_id)
+{
+ switch (notify) {
+ case ACPI_GHES_NOTIFY_SEA: /* ARMv8 */
+ *source_id = ACPI_HEST_SRC_ID_SEA;
+ return false;
+ case ACPI_GHES_NOTIFY_GPIO:
+ *source_id = ACPI_HEST_SRC_ID_GED;
+ return false;
+ default:
+ /* Unsupported notification types */
+ return true;
+ }
}
/* Build Generic Hardware Error Source version 2 (GHESv2) */
-static void build_ghes_v2(GArray *table_data, int source_id, BIOSLinker *linker)
+static void build_ghes_v2(GArray *table_data,
+ enum AcpiGhesNotifyType notify,
+ BIOSLinker *linker)
{
uint64_t address_offset;
+ enum AcpiHestSourceId source_id;
- assert(source_id < ACPI_HEST_SRC_ID_RESERVED);
+ if (ghes_notify_to_source_id(notify, &source_id)) {
+ error_report("Error: notify %d not supported", notify);
+ abort();
+ }
/*
* Type:
@@ -319,24 +364,13 @@ static void build_ghes_v2(GArray *table_data, int source_id, BIOSLinker *linker)
build_append_gas(table_data, AML_AS_SYSTEM_MEMORY, 0x40, 0,
4 /* QWord access */, 0);
bios_linker_loader_add_pointer(linker, ACPI_BUILD_TABLE_FILE,
- address_offset + GAS_ADDR_OFFSET, sizeof(uint64_t),
- ACPI_GHES_ERRORS_FW_CFG_FILE, source_id * sizeof(uint64_t));
+ address_offset + GAS_ADDR_OFFSET,
+ sizeof(uint64_t),
+ ACPI_HW_ERROR_FW_CFG_FILE,
+ source_id * sizeof(uint64_t));
- switch (source_id) {
- case ACPI_HEST_SRC_ID_SEA:
- /*
- * Notification Structure
- * Now only enable ARMv8 SEA notification type
- */
- build_ghes_hw_error_notification(table_data, ACPI_GHES_NOTIFY_SEA);
- break;
- case ACPI_HEST_SRC_ID_GED:
- build_ghes_hw_error_notification(table_data, ACPI_GHES_NOTIFY_GPIO);
- break;
- default:
- error_report("Not support this error source");
- abort();
- }
+ /* Notification Structure */
+ build_ghes_hw_error_notification(table_data, notify);
/* Error Status Block Length */
build_append_int_noprefix(table_data, ACPI_GHES_MAX_RAW_DATA_LENGTH, 4);
@@ -350,9 +384,11 @@ static void build_ghes_v2(GArray *table_data, int source_id, BIOSLinker *linker)
build_append_gas(table_data, AML_AS_SYSTEM_MEMORY, 0x40, 0,
4 /* QWord access */, 0);
bios_linker_loader_add_pointer(linker, ACPI_BUILD_TABLE_FILE,
- address_offset + GAS_ADDR_OFFSET,
- sizeof(uint64_t), ACPI_GHES_ERRORS_FW_CFG_FILE,
- (ACPI_GHES_ERROR_SOURCE_COUNT + source_id) * sizeof(uint64_t));
+ address_offset + GAS_ADDR_OFFSET,
+ sizeof(uint64_t),
+ ACPI_HW_ERROR_FW_CFG_FILE,
+ (ACPI_HEST_SRC_ID_COUNT + source_id) *
+ sizeof(uint64_t));
/*
* Read Ack Preserve field
@@ -365,90 +401,100 @@ static void build_ghes_v2(GArray *table_data, int source_id, BIOSLinker *linker)
}
/* Build Hardware Error Source Table */
-void acpi_build_hest(GArray *table_data, BIOSLinker *linker,
+void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
+ BIOSLinker *linker,
const char *oem_id, const char *oem_table_id)
{
AcpiTable table = { .sig = "HEST", .rev = 1,
.oem_id = oem_id, .oem_table_id = oem_table_id };
+ build_ghes_error_table(hardware_errors, linker);
+
+ int hest_offset = table_data->len;
+
acpi_table_begin(&table, table_data);
/* Error Source Count */
- build_append_int_noprefix(table_data, ACPI_GHES_ERROR_SOURCE_COUNT, 4);
- build_ghes_v2(table_data, ACPI_HEST_SRC_ID_SEA, linker);
- build_ghes_v2(table_data, ACPI_HEST_SRC_ID_GED, linker);
+ build_append_int_noprefix(table_data, ACPI_HEST_SRC_ID_COUNT, 4);
+ build_ghes_v2(table_data, ACPI_GHES_NOTIFY_SEA, linker);
+ build_ghes_v2(table_data, ACPI_GHES_NOTIFY_GPIO, linker);
acpi_table_end(linker, &table);
+
+ /*
+ * tell firmware to write into GPA the address of HEST via fw_cfg,
+ * once initialized.
+ */
+ bios_linker_loader_write_pointer(linker,
+ ACPI_HEST_ADDR_FW_CFG_FILE, 0,
+ sizeof(uint64_t),
+ ACPI_BUILD_TABLE_FILE, hest_offset);
}
void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
GArray *hardware_error)
{
/* Create a read-only fw_cfg file for GHES */
- fw_cfg_add_file(s, ACPI_GHES_ERRORS_FW_CFG_FILE, hardware_error->data,
+ fw_cfg_add_file(s, ACPI_HW_ERROR_FW_CFG_FILE, hardware_error->data,
hardware_error->len);
/* Create a read-write fw_cfg file for Address */
- fw_cfg_add_file_callback(s, ACPI_GHES_DATA_ADDR_FW_CFG_FILE, NULL, NULL,
+ fw_cfg_add_file_callback(s, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, NULL, NULL,
NULL, &(ags->ghes_addr_le), sizeof(ags->ghes_addr_le), false);
+ fw_cfg_add_file_callback(s, ACPI_HEST_ADDR_FW_CFG_FILE, NULL, NULL,
+ NULL, &(ags->hest_addr_le), sizeof(ags->hest_addr_le), false);
+
ags->present = true;
}
-int acpi_ghes_record_errors(uint8_t source_id, uint64_t physical_address)
+int acpi_ghes_record_errors(enum AcpiGhesNotifyType notify,
+ uint64_t physical_address)
{
- uint64_t error_block_addr, read_ack_register_addr, read_ack_register = 0;
- uint64_t start_addr;
- bool ret = -1;
+ uint64_t cper_addr, read_ack_register = 0;
+ uint64_t read_ack_start_addr;
+ enum AcpiHestSourceId source;
AcpiGedState *acpi_ged_state;
AcpiGhesState *ags;
- assert(source_id < ACPI_HEST_SRC_ID_RESERVED);
+ if (ghes_notify_to_source_id(ACPI_HEST_SRC_ID_SEA, &source)) {
+ error_report("GHES: Invalid error block/ack address(es) for notify %d",
+ notify);
+ return -1;
+ }
acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
NULL));
g_assert(acpi_ged_state);
ags = &acpi_ged_state->ghes_state;
- start_addr = le64_to_cpu(ags->ghes_addr_le);
+ cper_addr = le64_to_cpu(ags->ghes_addr_le);
+ cper_addr += ACPI_HEST_SRC_ID_COUNT * sizeof(uint64_t);
+ read_ack_start_addr = cper_addr + source * sizeof(uint64_t);
- if (physical_address) {
+ cper_addr += ACPI_HEST_SRC_ID_COUNT * sizeof(uint64_t);
+ cper_addr += source * ACPI_GHES_MAX_RAW_DATA_LENGTH;
- if (source_id < ACPI_HEST_SRC_ID_RESERVED) {
- start_addr += source_id * sizeof(uint64_t);
- }
-
- cpu_physical_memory_read(start_addr, &error_block_addr,
- sizeof(error_block_addr));
-
- error_block_addr = le64_to_cpu(error_block_addr);
-
- read_ack_register_addr = start_addr +
- ACPI_GHES_ERROR_SOURCE_COUNT * sizeof(uint64_t);
-
- cpu_physical_memory_read(read_ack_register_addr,
- &read_ack_register, sizeof(read_ack_register));
-
- /* zero means OSPM does not acknowledge the error */
- if (!read_ack_register) {
- error_report("OSPM does not acknowledge previous error,"
- " so can not record CPER for current error anymore");
- } else if (error_block_addr) {
- read_ack_register = cpu_to_le64(0);
- /*
- * Clear the Read Ack Register, OSPM will write it to 1 when
- * it acknowledges this error.
- */
- cpu_physical_memory_write(read_ack_register_addr,
- &read_ack_register, sizeof(uint64_t));
-
- ret = acpi_ghes_record_mem_error(error_block_addr,
- physical_address);
- } else
- error_report("can not find Generic Error Status Block");
+ if (!physical_address) {
+ error_report("can not find Generic Error Status Block for notify %d",
+ notify);
+ return -1;
}
- return ret;
+ cpu_physical_memory_read(read_ack_start_addr,
+ &read_ack_register, sizeof(read_ack_register));
+
+ /* zero means OSPM does not acknowledge the error */
+
+ read_ack_register = cpu_to_le64(0);
+ /*
+ * Clear the Read Ack Register, OSPM will write it to 1 when
+ * it acknowledges this error.
+ */
+ cpu_physical_memory_write(read_ack_start_addr,
+ &read_ack_register, sizeof(uint64_t));
+
+ return acpi_ghes_record_mem_error(cper_addr, physical_address);
}
NotifierList acpi_generic_error_notifiers =
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 1769467d23b2..79635bc7a0a8 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -944,10 +944,9 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
build_dbg2(tables_blob, tables->linker, vms);
if (vms->ras) {
- build_ghes_error_table(tables->hardware_errors, tables->linker);
acpi_add_table(table_offsets, tables_blob);
- acpi_build_hest(tables_blob, tables->linker, vms->oem_id,
- vms->oem_table_id);
+ acpi_build_hest(tables_blob, tables->hardware_errors, tables->linker,
+ vms->oem_id, vms->oem_table_id);
}
if (ms->numa_state->num_nodes > 0) {
diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
index b977d65564ba..6e349264fd8b 100644
--- a/include/hw/acpi/ghes.h
+++ b/include/hw/acpi/ghes.h
@@ -29,7 +29,7 @@
extern NotifierList acpi_generic_error_notifiers;
/*
- * Values for Hardware Error Notification Type field
+ * ACPI spec values for Hardware Error Notification Type field
*/
enum AcpiGhesNotifyType {
/* Polled */
@@ -60,24 +60,19 @@ enum AcpiGhesNotifyType {
ACPI_GHES_NOTIFY_RESERVED = 12
};
-/* Those are used as table indexes when building GHES tables */
-enum {
- ACPI_HEST_SRC_ID_SEA = 0,
- ACPI_HEST_SRC_ID_GED,
- ACPI_HEST_SRC_ID_RESERVED,
-};
-
typedef struct AcpiGhesState {
+ uint64_t hest_addr_le;
uint64_t ghes_addr_le;
bool present; /* True if GHES is present at all on this board */
} AcpiGhesState;
-void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker);
-void acpi_build_hest(GArray *table_data, BIOSLinker *linker,
+void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
+ BIOSLinker *linker,
const char *oem_id, const char *oem_table_id);
void acpi_ghes_add_fw_cfg(AcpiGhesState *vms, FWCfgState *s,
GArray *hardware_errors);
-int acpi_ghes_record_errors(uint8_t notify, uint64_t error_physical_addr);
+int acpi_ghes_record_errors(enum AcpiGhesNotifyType notify,
+ uint64_t error_physical_addr);
void ghes_record_cper_errors(const void *cper, size_t len,
enum AcpiGhesNotifyType notify, Error **errp);
--
2.46.0
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 06/13] acpi/ghes: add support for generic error injection via QAPI
2024-08-16 7:37 [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Mauro Carvalho Chehab
` (4 preceding siblings ...)
2024-08-16 7:37 ` [PATCH v8 05/13] acpi/ghes: rework the logic to handle HEST source ID Mauro Carvalho Chehab
@ 2024-08-16 7:37 ` Mauro Carvalho Chehab
2024-08-19 12:51 ` Igor Mammedov
2024-08-16 7:37 ` [PATCH v8 07/13] acpi/ghes: cleanup the memory error code logic Mauro Carvalho Chehab
` (7 subsequent siblings)
13 siblings, 1 reply; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-16 7:37 UTC (permalink / raw)
Cc: Jonathan Cameron, Shiju Jose, Mauro Carvalho Chehab,
Michael S. Tsirkin, Ani Sinha, Dongjiu Geng, Igor Mammedov,
linux-kernel, qemu-arm, qemu-devel
Provide a generic interface for error injection via GHESv2.
This patch is co-authored:
- original ghes logic to inject a simple ARM record by Shiju Jose;
- generic logic to handle block addresses by Jonathan Cameron;
- generic GHESv2 error inject by Mauro Carvalho Chehab;
Co-authored-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Co-authored-by: Shiju Jose <shiju.jose@huawei.com>
Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
hw/acpi/ghes.c | 57 +++++++++++++++++++++++++++++++++++++++++++++
hw/acpi/ghes_cper.c | 2 +-
2 files changed, 58 insertions(+), 1 deletion(-)
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index 7870f51e2a9e..a3ae710dcf81 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -500,6 +500,63 @@ int acpi_ghes_record_errors(enum AcpiGhesNotifyType notify,
NotifierList acpi_generic_error_notifiers =
NOTIFIER_LIST_INITIALIZER(error_device_notifiers);
+void ghes_record_cper_errors(uint8_t *cper, size_t len,
+ enum AcpiGhesNotifyType notify, Error **errp)
+{
+ uint64_t cper_addr, read_ack_start_addr;
+ enum AcpiHestSourceId source;
+ AcpiGedState *acpi_ged_state;
+ AcpiGhesState *ags;
+ uint64_t read_ack;
+
+ if (ghes_notify_to_source_id(notify, &source)) {
+ error_setg(errp,
+ "GHES: Invalid error block/ack address(es) for notify %d",
+ notify);
+ return;
+ }
+
+ acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
+ NULL));
+ g_assert(acpi_ged_state);
+ ags = &acpi_ged_state->ghes_state;
+
+ cper_addr = le64_to_cpu(ags->ghes_addr_le);
+ cper_addr += ACPI_HEST_SRC_ID_COUNT * sizeof(uint64_t);
+ read_ack_start_addr = cper_addr + source * sizeof(uint64_t);
+
+ cper_addr += ACPI_HEST_SRC_ID_COUNT * sizeof(uint64_t);
+ cper_addr += source * ACPI_GHES_MAX_RAW_DATA_LENGTH;
+
+ cpu_physical_memory_read(read_ack_start_addr,
+ &read_ack, sizeof(uint64_t));
+
+ /* zero means OSPM does not acknowledge the error */
+ if (!read_ack) {
+ error_setg(errp,
+ "Last CPER record was not acknowledged yet");
+ read_ack = 1;
+ cpu_physical_memory_write(read_ack_start_addr,
+ &read_ack, sizeof(uint64_t));
+ return;
+ }
+
+ read_ack = cpu_to_le64(0);
+ cpu_physical_memory_write(read_ack_start_addr,
+ &read_ack, sizeof(uint64_t));
+
+ /* Build CPER record */
+
+ if (len > ACPI_GHES_MAX_RAW_DATA_LENGTH) {
+ error_setg(errp, "GHES CPER record is too big: %ld", len);
+ }
+
+ /* Write the generic error data entry into guest memory */
+ cpu_physical_memory_write(cper_addr, cper, len);
+
+ notifier_list_notify(&acpi_generic_error_notifiers, NULL);
+}
+
bool acpi_ghes_present(void)
{
AcpiGedState *acpi_ged_state;
diff --git a/hw/acpi/ghes_cper.c b/hw/acpi/ghes_cper.c
index 92ca84d738de..2328dbff7012 100644
--- a/hw/acpi/ghes_cper.c
+++ b/hw/acpi/ghes_cper.c
@@ -29,5 +29,5 @@ void qmp_ghes_cper(const char *qmp_cper,
return;
}
- /* TODO: call a function at ghes */
+ ghes_record_cper_errors(cper, len, ACPI_GHES_NOTIFY_GPIO, errp);
}
--
2.46.0
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 07/13] acpi/ghes: cleanup the memory error code logic
2024-08-16 7:37 [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Mauro Carvalho Chehab
` (5 preceding siblings ...)
2024-08-16 7:37 ` [PATCH v8 06/13] acpi/ghes: add support for generic error injection via QAPI Mauro Carvalho Chehab
@ 2024-08-16 7:37 ` Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 08/13] docs: acpi_hest_ghes: fix documentation for CPER size Mauro Carvalho Chehab
` (6 subsequent siblings)
13 siblings, 0 replies; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-16 7:37 UTC (permalink / raw)
Cc: Jonathan Cameron, Shiju Jose, Mauro Carvalho Chehab,
Michael S. Tsirkin, Ani Sinha, Dongjiu Geng, Igor Mammedov,
Paolo Bonzini, Peter Maydell, kvm, linux-kernel, qemu-arm,
qemu-devel
Better organize the code of the function, making it to use the
raw CPER function, thus removing duplicated code.
While here, rename the function to actually reflect what it does.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
hw/acpi/ghes-stub.c | 2 +-
hw/acpi/ghes.c | 125 +++++++++++++++--------------------------
include/hw/acpi/ghes.h | 4 +-
target/arm/kvm.c | 2 +-
4 files changed, 50 insertions(+), 83 deletions(-)
diff --git a/hw/acpi/ghes-stub.c b/hw/acpi/ghes-stub.c
index 8762449870b5..a60ae07a8e7c 100644
--- a/hw/acpi/ghes-stub.c
+++ b/hw/acpi/ghes-stub.c
@@ -11,7 +11,7 @@
#include "qemu/osdep.h"
#include "hw/acpi/ghes.h"
-int acpi_ghes_record_errors(enum AcpiGhesNotifyType notify,
+int acpi_ghes_memory_errors(enum AcpiGhesNotifyType notify,
uint64_t physical_address)
{
return -1;
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index a3ae710dcf81..4f7b6c5ad2b6 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -206,51 +206,30 @@ static void acpi_ghes_build_append_mem_cper(GArray *table,
build_append_int_noprefix(table, 0, 7);
}
-static int acpi_ghes_record_mem_error(uint64_t error_block_address,
- uint64_t error_physical_addr)
+static void
+ghes_gen_err_data_uncorrectable_recoverable(GArray *block,
+ const uint8_t *section_type,
+ int data_length)
{
- GArray *block;
-
- /* Memory Error Section Type */
- const uint8_t uefi_cper_mem_sec[] =
- UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83, \
- 0xED, 0x7C, 0x83, 0xB1);
-
/* invalid fru id: ACPI 4.0: 17.3.2.6.1 Generic Error Data,
* Table 17-13 Generic Error Data Entry
*/
QemuUUID fru_id = {};
- uint32_t data_length;
- block = g_array_new(false, true /* clear */, 1);
-
- /* This is the length if adding a new generic error data entry*/
- data_length = ACPI_GHES_DATA_LENGTH + ACPI_GHES_MEM_CPER_LENGTH;
/*
- * It should not run out of the preallocated memory if adding a new generic
- * error data entry
+ * Calculate the size with this block. No need to check for
+ * too big CPER, as CPER size is checked at ghes_record_cper_errors()
*/
- assert((data_length + ACPI_GHES_GESB_SIZE) <=
- ACPI_GHES_MAX_RAW_DATA_LENGTH);
+ data_length += ACPI_GHES_GESB_SIZE;
/* Build the new generic error status block header */
acpi_ghes_generic_error_status(block, ACPI_GEBS_UNCORRECTABLE,
0, 0, data_length, ACPI_CPER_SEV_RECOVERABLE);
/* Build this new generic error data entry header */
- acpi_ghes_generic_error_data(block, uefi_cper_mem_sec,
+ acpi_ghes_generic_error_data(block, section_type,
ACPI_CPER_SEV_RECOVERABLE, 0, 0,
ACPI_GHES_MEM_CPER_LENGTH, fru_id, 0);
-
- /* Build the memory section CPER for above new generic error data entry */
- acpi_ghes_build_append_mem_cper(block, error_physical_addr);
-
- /* Write the generic error data entry into guest memory */
- cpu_physical_memory_write(error_block_address, block->data, block->len);
-
- g_array_free(block, true);
-
- return 0;
}
/*
@@ -448,59 +427,10 @@ void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
ags->present = true;
}
-int acpi_ghes_record_errors(enum AcpiGhesNotifyType notify,
- uint64_t physical_address)
-{
- uint64_t cper_addr, read_ack_register = 0;
- uint64_t read_ack_start_addr;
- enum AcpiHestSourceId source;
- AcpiGedState *acpi_ged_state;
- AcpiGhesState *ags;
-
- if (ghes_notify_to_source_id(ACPI_HEST_SRC_ID_SEA, &source)) {
- error_report("GHES: Invalid error block/ack address(es) for notify %d",
- notify);
- return -1;
- }
-
- acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
- NULL));
- g_assert(acpi_ged_state);
- ags = &acpi_ged_state->ghes_state;
-
- cper_addr = le64_to_cpu(ags->ghes_addr_le);
- cper_addr += ACPI_HEST_SRC_ID_COUNT * sizeof(uint64_t);
- read_ack_start_addr = cper_addr + source * sizeof(uint64_t);
-
- cper_addr += ACPI_HEST_SRC_ID_COUNT * sizeof(uint64_t);
- cper_addr += source * ACPI_GHES_MAX_RAW_DATA_LENGTH;
-
- if (!physical_address) {
- error_report("can not find Generic Error Status Block for notify %d",
- notify);
- return -1;
- }
-
- cpu_physical_memory_read(read_ack_start_addr,
- &read_ack_register, sizeof(read_ack_register));
-
- /* zero means OSPM does not acknowledge the error */
-
- read_ack_register = cpu_to_le64(0);
- /*
- * Clear the Read Ack Register, OSPM will write it to 1 when
- * it acknowledges this error.
- */
- cpu_physical_memory_write(read_ack_start_addr,
- &read_ack_register, sizeof(uint64_t));
-
- return acpi_ghes_record_mem_error(cper_addr, physical_address);
-}
-
NotifierList acpi_generic_error_notifiers =
NOTIFIER_LIST_INITIALIZER(error_device_notifiers);
-void ghes_record_cper_errors(uint8_t *cper, size_t len,
+void ghes_record_cper_errors(const void *cper, size_t len,
enum AcpiGhesNotifyType notify, Error **errp)
{
uint64_t cper_addr, read_ack_start_addr;
@@ -557,6 +487,43 @@ void ghes_record_cper_errors(uint8_t *cper, size_t len,
notifier_list_notify(&acpi_generic_error_notifiers, NULL);
}
+int acpi_ghes_memory_errors(enum AcpiGhesNotifyType notify,
+ uint64_t physical_address)
+{
+ /* Memory Error Section Type */
+ const uint8_t guid[] =
+ UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83, \
+ 0xED, 0x7C, 0x83, 0xB1);
+ Error *errp = NULL;
+ GArray *block;
+
+ if (!physical_address) {
+ error_report("can not find Generic Error Status Block for notify %d",
+ notify);
+ return -1;
+ }
+
+ block = g_array_new(false, true /* clear */, 1);
+
+ ghes_gen_err_data_uncorrectable_recoverable(block, guid,
+ ACPI_GHES_MAX_RAW_DATA_LENGTH);
+
+ /* Build the memory section CPER for above new generic error data entry */
+ acpi_ghes_build_append_mem_cper(block, physical_address);
+
+ /* Report the error */
+ ghes_record_cper_errors(block->data, block->len, notify, &errp);
+
+ g_array_free(block, true);
+
+ if (errp) {
+ error_report_err(errp);
+ return -1;
+ }
+
+ return 0;
+}
+
bool acpi_ghes_present(void)
{
AcpiGedState *acpi_ged_state;
diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
index 6e349264fd8b..a24fe8a3bc33 100644
--- a/include/hw/acpi/ghes.h
+++ b/include/hw/acpi/ghes.h
@@ -71,7 +71,7 @@ void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
const char *oem_id, const char *oem_table_id);
void acpi_ghes_add_fw_cfg(AcpiGhesState *vms, FWCfgState *s,
GArray *hardware_errors);
-int acpi_ghes_record_errors(enum AcpiGhesNotifyType notify,
+int acpi_ghes_memory_errors(enum AcpiGhesNotifyType notify,
uint64_t error_physical_addr);
void ghes_record_cper_errors(const void *cper, size_t len,
enum AcpiGhesNotifyType notify, Error **errp);
@@ -80,7 +80,7 @@ void ghes_record_cper_errors(const void *cper, size_t len,
* acpi_ghes_present: Report whether ACPI GHES table is present
*
* Returns: true if the system has an ACPI GHES table and it is
- * safe to call acpi_ghes_record_errors() to record a memory error.
+ * safe to call acpi_ghes_memory_errors() to record a memory error.
*/
bool acpi_ghes_present(void);
#endif
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 849e2e21b304..915f07376c8f 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2373,7 +2373,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
*/
if (code == BUS_MCEERR_AR) {
kvm_cpu_synchronize_state(c);
- if (!acpi_ghes_record_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
+ if (!acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
kvm_inject_arm_sea(c);
} else {
error_report("failed to record the error");
--
2.46.0
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 08/13] docs: acpi_hest_ghes: fix documentation for CPER size
2024-08-16 7:37 [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Mauro Carvalho Chehab
` (6 preceding siblings ...)
2024-08-16 7:37 ` [PATCH v8 07/13] acpi/ghes: cleanup the memory error code logic Mauro Carvalho Chehab
@ 2024-08-16 7:37 ` Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 09/13] scripts/ghes_inject: add a script to generate GHES error inject Mauro Carvalho Chehab
` (5 subsequent siblings)
13 siblings, 0 replies; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-16 7:37 UTC (permalink / raw)
Cc: Jonathan Cameron, Shiju Jose, Mauro Carvalho Chehab, Dongjiu Geng,
linux-kernel, qemu-arm, qemu-devel, Igor Mammedov
While the spec defines a CPER size of 4KiB for each record,
currently it is set to 1KiB. Fix the documentation and add
a pointer to the macro name there, as this may help to keep
it updated.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Igor Mammedov <imammedo@redhat.com>
---
docs/specs/acpi_hest_ghes.rst | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/docs/specs/acpi_hest_ghes.rst b/docs/specs/acpi_hest_ghes.rst
index 68f1fbe0a4af..c3e9f8d9a702 100644
--- a/docs/specs/acpi_hest_ghes.rst
+++ b/docs/specs/acpi_hest_ghes.rst
@@ -67,8 +67,10 @@ Design Details
(3) The address registers table contains N Error Block Address entries
and N Read Ack Register entries. The size for each entry is 8-byte.
The Error Status Data Block table contains N Error Status Data Block
- entries. The size for each entry is 4096(0x1000) bytes. The total size
- for the "etc/hardware_errors" fw_cfg blob is (N * 8 * 2 + N * 4096) bytes.
+ entries. The size for each entry is defined at the source code as
+ ACPI_GHES_MAX_RAW_DATA_LENGTH (currently 1024 bytes). The total size
+ for the "etc/hardware_errors" fw_cfg blob is
+ (N * 8 * 2 + N * ACPI_GHES_MAX_RAW_DATA_LENGTH) bytes.
N is the number of the kinds of hardware error sources.
(4) QEMU generates the ACPI linker/loader script for the firmware. The
--
2.46.0
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 09/13] scripts/ghes_inject: add a script to generate GHES error inject
2024-08-16 7:37 [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Mauro Carvalho Chehab
` (7 preceding siblings ...)
2024-08-16 7:37 ` [PATCH v8 08/13] docs: acpi_hest_ghes: fix documentation for CPER size Mauro Carvalho Chehab
@ 2024-08-16 7:37 ` Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 10/13] target/arm: add an experimental mpidr arm cpu property object Mauro Carvalho Chehab
` (4 subsequent siblings)
13 siblings, 0 replies; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-16 7:37 UTC (permalink / raw)
Cc: Jonathan Cameron, Shiju Jose, Mauro Carvalho Chehab, Cleber Rosa,
John Snow, linux-kernel, qemu-devel
Using the QMP GHESv2 API requires preparing a raw data array
containing a CPER record.
Add a helper script with subcommands to prepare such data.
Currently, only ARM Processor error CPER record is supported.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
MAINTAINERS | 3 +
scripts/arm_processor_error.py | 377 ++++++++++++++++++
scripts/ghes_inject.py | 51 +++
scripts/qmp_helper.py | 702 +++++++++++++++++++++++++++++++++
4 files changed, 1133 insertions(+)
create mode 100644 scripts/arm_processor_error.py
create mode 100755 scripts/ghes_inject.py
create mode 100644 scripts/qmp_helper.py
diff --git a/MAINTAINERS b/MAINTAINERS
index 1d8091818899..249ed2858198 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2083,6 +2083,9 @@ S: Maintained
F: hw/arm/ghes_cper.c
F: hw/acpi/ghes_cper_stub.c
F: qapi/acpi-hest.json
+F: scripts/ghes_inject.py
+F: scripts/arm_processor_error.py
+F: scripts/qmp_helper.py
ppc4xx
L: qemu-ppc@nongnu.org
diff --git a/scripts/arm_processor_error.py b/scripts/arm_processor_error.py
new file mode 100644
index 000000000000..62e0c5662232
--- /dev/null
+++ b/scripts/arm_processor_error.py
@@ -0,0 +1,377 @@
+#!/usr/bin/env python3
+#
+# pylint: disable=C0301,C0114,R0903,R0912,R0913,R0914,R0915,W0511
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2024 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
+
+# TODO: current implementation has dummy defaults.
+#
+# For a better implementation, a QMP addition/call is needed to
+# retrieve some data for ARM Processor Error injection:
+#
+# - ARM registers: power_state, mpidr.
+
+import argparse
+import re
+
+from qmp_helper import qmp, util, cper_guid
+
+class ArmProcessorEinj:
+ """
+ Implements ARM Processor Error injection via GHES
+ """
+
+ DESC = """
+ Generates an ARM processor error CPER, compatible with
+ UEFI 2.9A Errata.
+ """
+
+ ACPI_GHES_ARM_CPER_LENGTH = 40
+ ACPI_GHES_ARM_CPER_PEI_LENGTH = 32
+
+ # Context types
+ CONTEXT_AARCH32_EL1 = 1
+ CONTEXT_AARCH64_EL1 = 5
+ CONTEXT_MISC_REG = 8
+
+ def __init__(self, subparsers):
+ """Initialize the error injection class and add subparser"""
+
+ # Valid choice values
+ self.arm_valid_bits = {
+ "mpidr": util.bit(0),
+ "affinity": util.bit(1),
+ "running": util.bit(2),
+ "vendor": util.bit(3),
+ }
+
+ self.pei_flags = {
+ "first": util.bit(0),
+ "last": util.bit(1),
+ "propagated": util.bit(2),
+ "overflow": util.bit(3),
+ }
+
+ self.pei_error_types = {
+ "cache": util.bit(1),
+ "tlb": util.bit(2),
+ "bus": util.bit(3),
+ "micro-arch": util.bit(4),
+ }
+
+ self.pei_valid_bits = {
+ "multiple-error": util.bit(0),
+ "flags": util.bit(1),
+ "error-info": util.bit(2),
+ "virt-addr": util.bit(3),
+ "phy-addr": util.bit(4),
+ }
+
+ self.data = bytearray()
+
+ parser = subparsers.add_parser("arm", description=self.DESC)
+
+ arm_valid_bits = ",".join(self.arm_valid_bits.keys())
+ flags = ",".join(self.pei_flags.keys())
+ error_types = ",".join(self.pei_error_types.keys())
+ pei_valid_bits = ",".join(self.pei_valid_bits.keys())
+
+ # UEFI N.16 ARM Validation bits
+ g_arm = parser.add_argument_group("ARM processor")
+ g_arm.add_argument("--arm", "--arm-valid",
+ help=f"ARM valid bits: {arm_valid_bits}")
+ g_arm.add_argument("-a", "--affinity", "--level", "--affinity-level",
+ type=lambda x: int(x, 0),
+ help="Affinity level (when multiple levels apply)")
+ g_arm.add_argument("-l", "--mpidr", type=lambda x: int(x, 0),
+ help="Multiprocessor Affinity Register")
+ g_arm.add_argument("-i", "--midr", type=lambda x: int(x, 0),
+ help="Main ID Register")
+ g_arm.add_argument("-r", "--running",
+ action=argparse.BooleanOptionalAction,
+ default=None,
+ help="Indicates if the processor is running or not")
+ g_arm.add_argument("--psci", "--psci-state",
+ type=lambda x: int(x, 0),
+ help="Power State Coordination Interface - PSCI state")
+
+ # TODO: Add vendor-specific support
+
+ # UEFI N.17 bitmaps (type and flags)
+ g_pei = parser.add_argument_group("ARM Processor Error Info (PEI)")
+ g_pei.add_argument("-t", "--type", nargs="+",
+ help=f"one or more error types: {error_types}")
+ g_pei.add_argument("-f", "--flags", nargs="*",
+ help=f"zero or more error flags: {flags}")
+ g_pei.add_argument("-V", "--pei-valid", "--error-valid", nargs="*",
+ help=f"zero or more PEI valid bits: {pei_valid_bits}")
+
+ # UEFI N.17 Integer values
+ g_pei.add_argument("-m", "--multiple-error", nargs="+",
+ help="Number of errors: 0: Single error, 1: Multiple errors, 2-65535: Error count if known")
+ g_pei.add_argument("-e", "--error-info", nargs="+",
+ help="Error information (UEFI 2.10 tables N.18 to N.20)")
+ g_pei.add_argument("-p", "--physical-address", nargs="+",
+ help="Physical address")
+ g_pei.add_argument("-v", "--virtual-address", nargs="+",
+ help="Virtual address")
+
+ # UEFI N.21 Context
+ g_ctx = parser.add_argument_group("Processor Context")
+ g_ctx.add_argument("--ctx-type", "--context-type", nargs="*",
+ help="Type of the context (0=ARM32 GPR, 5=ARM64 EL1, other values supported)")
+ g_ctx.add_argument("--ctx-size", "--context-size", nargs="*",
+ help="Minimal size of the context")
+ g_ctx.add_argument("--ctx-array", "--context-array", nargs="*",
+ help="Comma-separated arrays for each context")
+
+ # Vendor-specific data
+ g_vendor = parser.add_argument_group("Vendor-specific data")
+ g_vendor.add_argument("--vendor", "--vendor-specific", nargs="+",
+ help="Vendor-specific byte arrays of data")
+
+ # Add arguments for Generic Error Data
+ qmp.argparse(parser)
+
+ parser.set_defaults(func=self.send_cper)
+
+ def send_cper(self, args):
+ """Parse subcommand arguments and send a CPER via QMP"""
+
+ qmp_cmd = qmp(args.host, args.port, args.debug)
+
+ # Handle Generic Error Data arguments if any
+ qmp_cmd.set_args(args)
+
+ is_cpu_type = re.compile(r"^([\w+]+\-)?arm\-cpu$")
+ cpus = qmp_cmd.search_qom("/machine/unattached/device",
+ "type", is_cpu_type)
+
+ cper = {}
+ pei = {}
+ ctx = {}
+ vendor = {}
+
+ arg = vars(args)
+
+ # Handle global parameters
+ if args.arm:
+ arm_valid_init = False
+ cper["valid"] = util.get_choice(name="valid",
+ value=args.arm,
+ choices=self.arm_valid_bits,
+ suffixes=["-error", "-err"])
+ else:
+ cper["valid"] = 0
+ arm_valid_init = True
+
+ if "running" in arg:
+ if args.running:
+ cper["running-state"] = util.bit(0)
+ else:
+ cper["running-state"] = 0
+ else:
+ cper["running-state"] = 0
+
+ if arm_valid_init:
+ if args.affinity:
+ cper["valid"] |= self.arm_valid_bits["affinity"]
+
+ if args.mpidr:
+ cper["valid"] |= self.arm_valid_bits["mpidr"]
+
+ if "running-state" in cper:
+ cper["valid"] |= self.arm_valid_bits["running"]
+
+ if args.psci:
+ cper["valid"] |= self.arm_valid_bits["running"]
+
+ # Handle PEI
+ if not args.type:
+ args.type = ["cache-error"]
+
+ util.get_mult_choices(
+ pei,
+ name="valid",
+ values=args.pei_valid,
+ choices=self.pei_valid_bits,
+ suffixes=["-valid", "--addr"],
+ )
+ util.get_mult_choices(
+ pei,
+ name="type",
+ values=args.type,
+ choices=self.pei_error_types,
+ suffixes=["-error", "-err"],
+ )
+ util.get_mult_choices(
+ pei,
+ name="flags",
+ values=args.flags,
+ choices=self.pei_flags,
+ suffixes=["-error", "-cap"],
+ )
+ util.get_mult_int(pei, "error-info", args.error_info)
+ util.get_mult_int(pei, "multiple-error", args.multiple_error)
+ util.get_mult_int(pei, "phy-addr", args.physical_address)
+ util.get_mult_int(pei, "virt-addr", args.virtual_address)
+
+ # Handle context
+ util.get_mult_int(ctx, "type", args.ctx_type, allow_zero=True)
+ util.get_mult_int(ctx, "minimal-size", args.ctx_size, allow_zero=True)
+ util.get_mult_array(ctx, "register", args.ctx_array, allow_zero=True)
+
+ util.get_mult_array(vendor, "bytes", args.vendor, max_val=255)
+
+ # Store PEI
+ pei_data = bytearray()
+ default_flags = self.pei_flags["first"]
+ default_flags |= self.pei_flags["last"]
+
+ error_info_num = 0
+
+ for i, p in pei.items(): # pylint: disable=W0612
+ error_info_num += 1
+
+ # UEFI 2.10 doesn't define how to encode error information
+ # when multiple types are raised. So, provide a default only
+ # if a single type is there
+ if "error-info" not in p:
+ if p["type"] == util.bit(1):
+ p["error-info"] = 0x0091000F
+ if p["type"] == util.bit(2):
+ p["error-info"] = 0x0054007F
+ if p["type"] == util.bit(3):
+ p["error-info"] = 0x80D6460FFF
+ if p["type"] == util.bit(4):
+ p["error-info"] = 0x78DA03FF
+
+ if "valid" not in p:
+ p["valid"] = 0
+ if "multiple-error" in p:
+ p["valid"] |= self.pei_valid_bits["multiple-error"]
+
+ if "flags" in p:
+ p["valid"] |= self.pei_valid_bits["flags"]
+
+ if "error-info" in p:
+ p["valid"] |= self.pei_valid_bits["error-info"]
+
+ if "phy-addr" in p:
+ p["valid"] |= self.pei_valid_bits["phy-addr"]
+
+ if "virt-addr" in p:
+ p["valid"] |= self.pei_valid_bits["virt-addr"]
+
+ # Version
+ util.data_add(pei_data, 0, 1)
+
+ util.data_add(pei_data,
+ self.ACPI_GHES_ARM_CPER_PEI_LENGTH, 1)
+
+ util.data_add(pei_data, p["valid"], 2)
+ util.data_add(pei_data, p["type"], 1)
+ util.data_add(pei_data, p.get("multiple-error", 1), 2)
+ util.data_add(pei_data, p.get("flags", default_flags), 1)
+ util.data_add(pei_data, p.get("error-info", 0), 8)
+ util.data_add(pei_data, p.get("virt-addr", 0xDEADBEEF), 8)
+ util.data_add(pei_data, p.get("phy-addr", 0xABBA0BAD), 8)
+
+ # Store Context
+ ctx_data = bytearray()
+ context_info_num = 0
+
+ if ctx:
+ ret = qmp_cmd.send_cmd("query-target", may_open=True)
+
+ default_ctx = self.CONTEXT_MISC_REG
+
+ if "arch" in ret:
+ if ret["arch"] == "aarch64":
+ default_ctx = self.CONTEXT_AARCH64_EL1
+ elif ret["arch"] == "arm":
+ default_ctx = self.CONTEXT_AARCH32_EL1
+
+ for k in sorted(ctx.keys()):
+ context_info_num += 1
+
+ if "type" not in ctx[k]:
+ ctx[k]["type"] = default_ctx
+
+ if "register" not in ctx[k]:
+ ctx[k]["register"] = []
+
+ reg_size = len(ctx[k]["register"])
+ size = 0
+
+ if "minimal-size" in ctx:
+ size = ctx[k]["minimal-size"]
+
+ size = max(size, reg_size)
+
+ size = (size + 1) % 0xFFFE
+
+ # Version
+ util.data_add(ctx_data, 0, 2)
+
+ util.data_add(ctx_data, ctx[k]["type"], 2)
+
+ util.data_add(ctx_data, 8 * size, 4)
+
+ for r in ctx[k]["register"]:
+ util.data_add(ctx_data, r, 8)
+
+ for i in range(reg_size, size): # pylint: disable=W0612
+ util.data_add(ctx_data, 0, 8)
+
+ # Vendor-specific bytes are not grouped
+ vendor_data = bytearray()
+ if vendor:
+ for k in sorted(vendor.keys()):
+ for b in vendor[k]["bytes"]:
+ util.data_add(vendor_data, b, 1)
+
+ # Encode ARM Processor Error
+ data = bytearray()
+
+ util.data_add(data, cper["valid"], 4)
+
+ util.data_add(data, error_info_num, 2)
+ util.data_add(data, context_info_num, 2)
+
+ # Calculate the length of the CPER data
+ cper_length = self.ACPI_GHES_ARM_CPER_LENGTH
+ cper_length += len(pei_data)
+ cper_length += len(vendor_data)
+ cper_length += len(ctx_data)
+ util.data_add(data, cper_length, 4)
+
+ util.data_add(data, arg.get("affinity-level", 0), 1)
+
+ # Reserved
+ util.data_add(data, 0, 3)
+
+ if "midr-el1" not in arg:
+ if cpus:
+ cmd_arg = {
+ 'path': cpus[0],
+ 'property': "midr"
+ }
+ ret = qmp_cmd.send_cmd("qom-get", cmd_arg, may_open=True)
+ if isinstance(ret, int):
+ arg["midr-el1"] = ret
+
+ util.data_add(data, arg.get("mpidr-el1", 0), 8)
+ util.data_add(data, arg.get("midr-el1", 0), 8)
+ util.data_add(data, cper["running-state"], 4)
+ util.data_add(data, arg.get("psci-state", 0), 4)
+
+ # Add PEI
+ data.extend(pei_data)
+ data.extend(ctx_data)
+ data.extend(vendor_data)
+
+ self.data = data
+
+ qmp_cmd.send_cper(cper_guid.CPER_PROC_ARM, self.data)
diff --git a/scripts/ghes_inject.py b/scripts/ghes_inject.py
new file mode 100755
index 000000000000..67cb6077bec8
--- /dev/null
+++ b/scripts/ghes_inject.py
@@ -0,0 +1,51 @@
+#!/usr/bin/env python3
+#
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2024 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
+
+"""
+Handle ACPI GHESv2 error injection logic QEMU QMP interface.
+"""
+
+import argparse
+import sys
+
+from arm_processor_error import ArmProcessorEinj
+
+EINJ_DESC = """
+Handle ACPI GHESv2 error injection logic QEMU QMP interface.
+
+It allows using UEFI BIOS EINJ features to generate GHES records.
+
+It helps testing CPER and GHES drivers at the guest OS and how
+userspace applications at the guest handle them.
+"""
+
+def main():
+ """Main program"""
+
+ # Main parser - handle generic args like QEMU QMP TCP socket options
+ parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+ usage="%(prog)s [options]",
+ description=EINJ_DESC)
+
+ g_options = parser.add_argument_group("QEMU QMP socket options")
+ g_options.add_argument("-H", "--host", default="localhost", type=str,
+ help="host name")
+ g_options.add_argument("-P", "--port", default=4445, type=int,
+ help="TCP port number")
+ g_options.add_argument('-d', '--debug', action='store_true')
+
+ subparsers = parser.add_subparsers()
+
+ ArmProcessorEinj(subparsers)
+
+ args = parser.parse_args()
+ if "func" in args:
+ args.func(args)
+ else:
+ sys.exit(f"Please specify a valid command for {sys.argv[0]}")
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
new file mode 100644
index 000000000000..348db9275c36
--- /dev/null
+++ b/scripts/qmp_helper.py
@@ -0,0 +1,702 @@
+#!/usr/bin/env python3
+#
+# # pylint: disable=C0103,E0213,E1135,E1136,E1137,R0902,R0903,R0912,R0913
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2024 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
+
+"""
+Helper classes to be used by ghes_inject command classes.
+"""
+
+import json
+import sys
+
+from datetime import datetime
+from os import path as os_path
+
+try:
+ qemu_dir = os_path.abspath(os_path.dirname(os_path.dirname(__file__)))
+ sys.path.append(os_path.join(qemu_dir, 'python'))
+
+ from qemu.qmp.legacy import QEMUMonitorProtocol
+
+except ModuleNotFoundError as exc:
+ print(f"Module '{exc.name}' not found.")
+ print("Try export PYTHONPATH=top-qemu-dir/python or run from top-qemu-dir")
+ sys.exit(1)
+
+from base64 import b64encode
+
+class util:
+ """
+ Ancillary functions to deal with bitmaps, parse arguments,
+ generate GUID and encode data on a bytearray buffer.
+ """
+
+ #
+ # Helper routines to handle multiple choice arguments
+ #
+ def get_choice(name, value, choices, suffixes=None, bitmask=True):
+ """Produce a list from multiple choice argument"""
+
+ new_values = 0
+
+ if not value:
+ return new_values
+
+ for val in value.split(","):
+ val = val.lower()
+
+ if suffixes:
+ for suffix in suffixes:
+ val = val.removesuffix(suffix)
+
+ if val not in choices.keys():
+ if suffixes:
+ for suffix in suffixes:
+ if val + suffix in choices.keys():
+ val += suffix
+ break
+
+ if val not in choices.keys():
+ sys.exit(f"Error on '{name}': choice '{val}' is invalid.")
+
+ val = choices[val]
+
+ if bitmask:
+ new_values |= val
+ else:
+ if new_values:
+ sys.exit(f"Error on '{name}': only one value is accepted.")
+
+ new_values = val
+
+ return new_values
+
+ def get_array(name, values, max_val=None):
+ """Add numbered hashes from integer lists into an array"""
+
+ array = []
+
+ for value in values:
+ for val in value.split(","):
+ try:
+ val = int(val, 0)
+ except ValueError:
+ sys.exit(f"Error on '{name}': {val} is not an integer")
+
+ if val < 0:
+ sys.exit(f"Error on '{name}': {val} is not unsigned")
+
+ if max_val and val > max_val:
+ sys.exit(f"Error on '{name}': {val} is too little")
+
+ array.append(val)
+
+ return array
+
+ def get_mult_array(mult, name, values, allow_zero=False, max_val=None):
+ """Add numbered hashes from integer lists"""
+
+ if not allow_zero:
+ if not values:
+ return
+ else:
+ if values is None:
+ return
+
+ if not values:
+ i = 0
+ if i not in mult:
+ mult[i] = {}
+
+ mult[i][name] = []
+ return
+
+ i = 0
+ for value in values:
+ for val in value.split(","):
+ try:
+ val = int(val, 0)
+ except ValueError:
+ sys.exit(f"Error on '{name}': {val} is not an integer")
+
+ if val < 0:
+ sys.exit(f"Error on '{name}': {val} is not unsigned")
+
+ if max_val and val > max_val:
+ sys.exit(f"Error on '{name}': {val} is too little")
+
+ if i not in mult:
+ mult[i] = {}
+
+ if name not in mult[i]:
+ mult[i][name] = []
+
+ mult[i][name].append(val)
+
+ i += 1
+
+
+ def get_mult_choices(mult, name, values, choices,
+ suffixes=None, allow_zero=False):
+ """Add numbered hashes from multiple choice arguments"""
+
+ if not allow_zero:
+ if not values:
+ return
+ else:
+ if values is None:
+ return
+
+ i = 0
+ for val in values:
+ new_values = util.get_choice(name, val, choices, suffixes)
+
+ if i not in mult:
+ mult[i] = {}
+
+ mult[i][name] = new_values
+ i += 1
+
+
+ def get_mult_int(mult, name, values, allow_zero=False):
+ """Add numbered hashes from integer arguments"""
+ if not allow_zero:
+ if not values:
+ return
+ else:
+ if values is None:
+ return
+
+ i = 0
+ for val in values:
+ try:
+ val = int(val, 0)
+ except ValueError:
+ sys.exit(f"Error on '{name}': {val} is not an integer")
+
+ if val < 0:
+ sys.exit(f"Error on '{name}': {val} is not unsigned")
+
+ if i not in mult:
+ mult[i] = {}
+
+ mult[i][name] = val
+ i += 1
+
+
+ #
+ # Data encode helper functions
+ #
+ def bit(b):
+ """Simple macro to define a bit on a bitmask"""
+ return 1 << b
+
+
+ def data_add(data, value, num_bytes):
+ """Adds bytes from value inside a bitarray"""
+
+ data.extend(value.to_bytes(num_bytes, byteorder="little")) # pylint: disable=E1101
+
+ def dump_bytearray(name, data):
+ """Does an hexdump of a byte array, grouping in bytes"""
+
+ print(f"{name} ({len(data)} bytes):")
+
+ for ln_start in range(0, len(data), 16):
+ ln_end = min(ln_start + 16, len(data))
+ print(f" {ln_start:08x} ", end="")
+ for i in range(ln_start, ln_end):
+ print(f"{data[i]:02x} ", end="")
+ for i in range(ln_end, ln_start + 16):
+ print(" ", end="")
+ print(" ", end="")
+ for i in range(ln_start, ln_end):
+ if data[i] >= 32 and data[i] < 127:
+ print(chr(data[i]), end="")
+ else:
+ print(".", end="")
+
+ print()
+ print()
+
+ def time(string):
+ """Handle BCD timestamps used on Generic Error Data Block"""
+
+ time = None
+
+ # Formats to be used when parsing time stamps
+ formats = [
+ "%Y-%m-%d %H:%M:%S",
+ ]
+
+ if string == "now":
+ time = datetime.now()
+
+ if time is None:
+ for fmt in formats:
+ try:
+ time = datetime.strptime(string, fmt)
+ break
+ except ValueError:
+ pass
+
+ if time is None:
+ raise ValueError("Invalid time format")
+
+ return time
+
+class guid:
+ """
+ Simple class to handle GUID fields.
+ """
+
+ def __init__(self, time_low, time_mid, time_high, nodes):
+ """Initialize a GUID value"""
+
+ assert len(nodes) == 8
+
+ self.time_low = time_low
+ self.time_mid = time_mid
+ self.time_high = time_high
+ self.nodes = nodes
+
+ @classmethod
+ def UUID(cls, guid_str):
+ """Initialize a GUID using a string on its standard format"""
+
+ if len(guid_str) != 36:
+ print("Size not 36")
+ raise ValueError('Invalid GUID size')
+
+ # It is easier to parse without separators. So, drop them
+ guid_str = guid_str.replace('-', '')
+
+ if len(guid_str) != 32:
+ print("Size not 32", guid_str, len(guid_str))
+ raise ValueError('Invalid GUID hex size')
+
+ time_low = 0
+ time_mid = 0
+ time_high = 0
+ nodes = []
+
+ for i in reversed(range(16, 32, 2)):
+ h = guid_str[i:i + 2]
+ value = int(h, 16)
+ nodes.insert(0, value)
+
+ time_high = int(guid_str[12:16], 16)
+ time_mid = int(guid_str[8:12], 16)
+ time_low = int(guid_str[0:8], 16)
+
+ return cls(time_low, time_mid, time_high, nodes)
+
+ def __str__(self):
+ """Output a GUID value on its default string representation"""
+
+ clock = self.nodes[0] << 8 | self.nodes[1]
+
+ node = 0
+ for i in range(2, len(self.nodes)):
+ node = node << 8 | self.nodes[i]
+
+ s = f"{self.time_low:08x}-{self.time_mid:04x}-"
+ s += f"{self.time_high:04x}-{clock:04x}-{node:012x}"
+ return s
+
+ def to_bytes(self):
+ """Output a GUID value in bytes"""
+
+ data = bytearray()
+
+ util.data_add(data, self.time_low, 4)
+ util.data_add(data, self.time_mid, 2)
+ util.data_add(data, self.time_high, 2)
+ data.extend(bytearray(self.nodes))
+
+ return data
+
+class qmp:
+ """
+ Opens a connection and send/receive QMP commands.
+ """
+
+ def send_cmd(self, command, args=None, may_open=False, return_error=True):
+ """Send a command to QMP, optinally opening a connection"""
+
+ if may_open:
+ self._connect()
+ elif not self.connected:
+ return False
+
+ msg = { 'execute': command }
+ if args:
+ msg['arguments'] = args
+
+ try:
+ obj = self.qmp_monitor.cmd_obj(msg)
+ # Can we use some other exception class here?
+ except Exception as e: # pylint: disable=W0718
+ print(f"Command: {command}")
+ print(f"Failed to inject error: {e}.")
+ return None
+
+ if "return" in obj:
+ if isinstance(obj.get("return"), dict):
+ if obj["return"]:
+ return obj["return"]
+ return "OK"
+
+ return obj["return"]
+
+ if isinstance(obj.get("error"), dict):
+ error = obj["error"]
+ if return_error:
+ print(f"Command: {msg}")
+ print(f'{error["class"]}: {error["desc"]}')
+ else:
+ print(json.dumps(obj))
+
+ return None
+
+ def _close(self):
+ """Shutdown and close the socket, if opened"""
+ if not self.connected:
+ return
+
+ self.qmp_monitor.close()
+ self.connected = False
+
+ def _connect(self):
+ """Connect to a QMP TCP/IP port, if not connected yet"""
+
+ if self.connected:
+ return True
+
+ try:
+ self.qmp_monitor.connect(negotiate=True)
+ except ConnectionError:
+ sys.exit(f"Can't connect to QMP host {self.host}:{self.port}")
+
+ self.connected = True
+
+ return True
+
+ BLOCK_STATUS_BITS = {
+ "uncorrectable": util.bit(0),
+ "correctable": util.bit(1),
+ "multi-uncorrectable": util.bit(2),
+ "multi-correctable": util.bit(3),
+ }
+
+ ERROR_SEVERITY = {
+ "recoverable": 0,
+ "fatal": 1,
+ "corrected": 2,
+ "none": 3,
+ }
+
+ VALIDATION_BITS = {
+ "fru-id": util.bit(0),
+ "fru-text": util.bit(1),
+ "timestamp": util.bit(2),
+ }
+
+ GEDB_FLAGS_BITS = {
+ "recovered": util.bit(0),
+ "prev-error": util.bit(1),
+ "simulated": util.bit(2),
+ }
+
+ GENERIC_DATA_SIZE = 72
+
+ def argparse(parser):
+ """Prepare a parser group to query generic error data"""
+
+ block_status_bits = ",".join(qmp.BLOCK_STATUS_BITS.keys())
+ error_severity_enum = ",".join(qmp.ERROR_SEVERITY.keys())
+ validation_bits = ",".join(qmp.VALIDATION_BITS.keys())
+ gedb_flags_bits = ",".join(qmp.GEDB_FLAGS_BITS.keys())
+
+ g_gen = parser.add_argument_group("Generic Error Data") # pylint: disable=E1101
+ g_gen.add_argument("--block-status",
+ help=f"block status bits: {block_status_bits}")
+ g_gen.add_argument("--raw-data", nargs="+",
+ help="Raw data inside the Error Status Block")
+ g_gen.add_argument("--error-severity", "--severity",
+ help=f"error severity: {error_severity_enum}")
+ g_gen.add_argument("--gen-err-valid-bits",
+ "--generic-error-validation-bits",
+ help=f"validation bits: {validation_bits}")
+ g_gen.add_argument("--fru-id", type=guid.UUID,
+ help="GUID representing a physical device")
+ g_gen.add_argument("--fru-text",
+ help="ASCII string identifying the FRU hardware")
+ g_gen.add_argument("--timestamp", type=util.time,
+ help="Time when the error info was collected")
+ g_gen.add_argument("--precise", "--precise-timestamp",
+ action='store_true',
+ help="Marks the timestamp as precise if --timestamp is used")
+ g_gen.add_argument("--gedb-flags",
+ help=f"General Error Data Block flags: {gedb_flags_bits}")
+
+ def set_args(self, args):
+ """Set the arguments optionally defined via self.argparse()"""
+
+ if args.block_status:
+ self.block_status = util.get_choice(name="block-status",
+ value=args.block_status,
+ choices=self.BLOCK_STATUS_BITS,
+ bitmask=False)
+ if args.raw_data:
+ self.raw_data = util.get_array("raw-data", args.raw_data,
+ max_val=255)
+ print(self.raw_data)
+
+ if args.error_severity:
+ self.error_severity = util.get_choice(name="error-severity",
+ value=args.error_severity,
+ choices=self.ERROR_SEVERITY,
+ bitmask=False)
+
+ if args.fru_id:
+ self.fru_id = args.fru_id.to_bytes()
+ if not args.gen_err_valid_bits:
+ self.validation_bits |= self.VALIDATION_BITS["fru-id"]
+
+ if args.fru_text:
+ text = bytearray(args.fru_text.encode('ascii'))
+ if len(text) > 20:
+ sys.exit("FRU text is too big to fit")
+
+ self.fru_text = text
+ if not args.gen_err_valid_bits:
+ self.validation_bits |= self.VALIDATION_BITS["fru-text"]
+
+ if args.timestamp:
+ time = args.timestamp
+ century = int(time.year / 100)
+
+ bcd = bytearray()
+ util.data_add(bcd, (time.second // 10) << 4 | (time.second % 10), 1)
+ util.data_add(bcd, (time.minute // 10) << 4 | (time.minute % 10), 1)
+ util.data_add(bcd, (time.hour // 10) << 4 | (time.hour % 10), 1)
+
+ if args.precise:
+ util.data_add(bcd, 1, 1)
+ else:
+ util.data_add(bcd, 0, 1)
+
+ util.data_add(bcd, (time.day // 10) << 4 | (time.day % 10), 1)
+ util.data_add(bcd, (time.month // 10) << 4 | (time.month % 10), 1)
+ util.data_add(bcd,
+ ((time.year % 100) // 10) << 4 | (time.year % 10), 1)
+ util.data_add(bcd, ((century % 100) // 10) << 4 | (century % 10), 1)
+
+ self.timestamp = bcd
+ if not args.gen_err_valid_bits:
+ self.validation_bits |= self.VALIDATION_BITS["timestamp"]
+
+ if args.gen_err_valid_bits:
+ self.validation_bits = util.get_choice(name="validation",
+ value=args.gen_err_valid_bits,
+ choices=self.VALIDATION_BITS)
+
+ def __init__(self, host, port, debug=False):
+ """Initialize variables used by the QMP send logic"""
+
+ self.connected = False
+ self.host = host
+ self.port = port
+ self.debug = debug
+
+ # ACPI 6.1: 18.3.2.7.1 Generic Error Data: Generic Error Status Block
+ self.block_status = self.BLOCK_STATUS_BITS["uncorrectable"]
+ self.raw_data = []
+ self.error_severity = self.ERROR_SEVERITY["recoverable"]
+
+ # ACPI 6.1: 18.3.2.7.1 Generic Error Data: Generic Error Data Entry
+ self.validation_bits = 0
+ self.flags = 0
+ self.fru_id = bytearray(16)
+ self.fru_text = bytearray(20)
+ self.timestamp = bytearray(8)
+
+ self.qmp_monitor = QEMUMonitorProtocol(address=(self.host, self.port))
+
+ #
+ # Socket QMP send command
+ #
+ def send_cper_raw(self, cper_data):
+ """Send a raw CPER data to QEMU though QMP TCP socket"""
+
+ data = b64encode(bytes(cper_data)).decode('ascii')
+
+ cmd_arg = {
+ 'cper': data
+ }
+
+ self._connect()
+
+ if self.send_cmd("ghes-cper", cmd_arg):
+ print("Error injected.")
+
+ def send_cper(self, notif_type, payload):
+ """Send commands to QEMU though QMP TCP socket"""
+
+ # Fill CPER record header
+
+ # NOTE: bits 4 to 13 of block status contain the number of
+ # data entries in the data section. This is currently unsupported.
+
+ cper_length = len(payload)
+ data_length = cper_length + len(self.raw_data) + self.GENERIC_DATA_SIZE
+
+ # Generic Error Data Entry
+ gede = bytearray()
+
+ gede.extend(notif_type.to_bytes())
+ util.data_add(gede, self.error_severity, 4)
+ util.data_add(gede, 0x300, 2)
+ util.data_add(gede, self.validation_bits, 1)
+ util.data_add(gede, self.flags, 1)
+ util.data_add(gede, cper_length, 4)
+ gede.extend(self.fru_id)
+ gede.extend(self.fru_text)
+ gede.extend(self.timestamp)
+
+ # Generic Error Status Block
+ gebs = bytearray()
+
+ if self.raw_data:
+ raw_data_offset = len(gebs)
+ else:
+ raw_data_offset = 0
+
+ util.data_add(gebs, self.block_status, 4)
+ util.data_add(gebs, raw_data_offset, 4)
+ util.data_add(gebs, len(self.raw_data), 4)
+ util.data_add(gebs, data_length, 4)
+ util.data_add(gebs, self.error_severity, 4)
+
+ cper_data = bytearray()
+ cper_data.extend(gebs)
+ cper_data.extend(gede)
+ cper_data.extend(bytearray(self.raw_data))
+ cper_data.extend(bytearray(payload))
+
+ if self.debug:
+ print(f"GUID: {notif_type}")
+
+ util.dump_bytearray("Generic Error Status Block", gebs)
+ util.dump_bytearray("Generic Error Data Entry", gede)
+
+ if self.raw_data:
+ util.dump_bytearray("Raw data", bytearray(self.raw_data))
+
+ util.dump_bytearray("Payload", payload)
+
+ self.send_cper_raw(cper_data)
+
+
+ def search_qom(self, path, prop, regex):
+ """
+ Return a list of devices that match path array like:
+
+ /machine/unattached/device
+ /machine/peripheral-anon/device
+ ...
+ """
+
+ found = []
+
+ i = 0
+ while 1:
+ dev = f"{path}[{i}]"
+ args = {
+ 'path': dev,
+ 'property': prop
+ }
+ ret = self.send_cmd("qom-get", args, may_open=True, return_error=False)
+ if not ret:
+ break
+
+ if isinstance(ret, str):
+ if regex.search(ret):
+ found.append(dev)
+
+ i += 1
+ if i > 10000:
+ print("Too many objects returned by qom-get!")
+ break
+
+ return found
+
+class cper_guid:
+ """
+ Contains CPER GUID, as per:
+ https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html
+ """
+
+ CPER_PROC_GENERIC = guid(0x9876CCAD, 0x47B4, 0x4bdb,
+ [0xB6, 0x5E, 0x16, 0xF1,
+ 0x93, 0xC4, 0xF3, 0xDB])
+
+ CPER_PROC_X86 = guid(0xDC3EA0B0, 0xA144, 0x4797,
+ [0xB9, 0x5B, 0x53, 0xFA,
+ 0x24, 0x2B, 0x6E, 0x1D])
+
+ CPER_PROC_ITANIUM = guid(0xe429faf1, 0x3cb7, 0x11d4,
+ [0xbc, 0xa7, 0x00, 0x80,
+ 0xc7, 0x3c, 0x88, 0x81])
+
+ CPER_PROC_ARM = guid(0xE19E3D16, 0xBC11, 0x11E4,
+ [0x9C, 0xAA, 0xC2, 0x05,
+ 0x1D, 0x5D, 0x46, 0xB0])
+
+ CPER_PLATFORM_MEM = guid(0xA5BC1114, 0x6F64, 0x4EDE,
+ [0xB8, 0x63, 0x3E, 0x83,
+ 0xED, 0x7C, 0x83, 0xB1])
+
+ CPER_PLATFORM_MEM2 = guid(0x61EC04FC, 0x48E6, 0xD813,
+ [0x25, 0xC9, 0x8D, 0xAA,
+ 0x44, 0x75, 0x0B, 0x12])
+
+ CPER_PCIE = guid(0xD995E954, 0xBBC1, 0x430F,
+ [0xAD, 0x91, 0xB4, 0x4D,
+ 0xCB, 0x3C, 0x6F, 0x35])
+
+ CPER_PCI_BUS = guid(0xC5753963, 0x3B84, 0x4095,
+ [0xBF, 0x78, 0xED, 0xDA,
+ 0xD3, 0xF9, 0xC9, 0xDD])
+
+ CPER_PCI_DEV = guid(0xEB5E4685, 0xCA66, 0x4769,
+ [0xB6, 0xA2, 0x26, 0x06,
+ 0x8B, 0x00, 0x13, 0x26])
+
+ CPER_FW_ERROR = guid(0x81212A96, 0x09ED, 0x4996,
+ [0x94, 0x71, 0x8D, 0x72,
+ 0x9C, 0x8E, 0x69, 0xED])
+
+ CPER_DMA_GENERIC = guid(0x5B51FEF7, 0xC79D, 0x4434,
+ [0x8F, 0x1B, 0xAA, 0x62,
+ 0xDE, 0x3E, 0x2C, 0x64])
+
+ CPER_DMA_VT = guid(0x71761D37, 0x32B2, 0x45cd,
+ [0xA7, 0xD0, 0xB0, 0xFE,
+ 0xDD, 0x93, 0xE8, 0xCF])
+
+ CPER_DMA_IOMMU = guid(0x036F84E1, 0x7F37, 0x428c,
+ [0xA7, 0x9E, 0x57, 0x5F,
+ 0xDF, 0xAA, 0x84, 0xEC])
+
+ CPER_CCIX_PER = guid(0x91335EF6, 0xEBFB, 0x4478,
+ [0xA6, 0xA6, 0x88, 0xB7,
+ 0x28, 0xCF, 0x75, 0xD7])
+
+ CPER_CXL_PROT_ERR = guid(0x80B9EFB4, 0x52B5, 0x4DE3,
+ [0xA7, 0x77, 0x68, 0x78,
+ 0x4B, 0x77, 0x10, 0x48])
--
2.46.0
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 10/13] target/arm: add an experimental mpidr arm cpu property object
2024-08-16 7:37 [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Mauro Carvalho Chehab
` (8 preceding siblings ...)
2024-08-16 7:37 ` [PATCH v8 09/13] scripts/ghes_inject: add a script to generate GHES error inject Mauro Carvalho Chehab
@ 2024-08-16 7:37 ` Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 11/13] scripts/arm_processor_error.py: retrieve mpidr if not filled Mauro Carvalho Chehab
` (3 subsequent siblings)
13 siblings, 0 replies; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-16 7:37 UTC (permalink / raw)
Cc: Jonathan Cameron, Shiju Jose, Mauro Carvalho Chehab,
Peter Maydell, linux-kernel, qemu-arm, qemu-devel
Accurately injecting an ARM Processor error ACPI/APEI GHES
error record requires the value of the ARM Multiprocessor
Affinity Register (mpidr).
While ARM implements it, this is currently not visible.
Add a field at CPU storing it, and place it at arm_cpu_properties
as experimental, thus allowing it to be queried via QMP using
qom-get function.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
target/arm/cpu.c | 1 +
target/arm/cpu.h | 1 +
target/arm/helper.c | 10 ++++++++--
3 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index 19191c239181..30fcf0a10f46 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -2619,6 +2619,7 @@ static ObjectClass *arm_cpu_class_by_name(const char *cpu_model)
static Property arm_cpu_properties[] = {
DEFINE_PROP_UINT64("midr", ARMCPU, midr, 0),
+ DEFINE_PROP_UINT64("x-mpidr", ARMCPU, mpidr, 0),
DEFINE_PROP_UINT64("mp-affinity", ARMCPU,
mp_affinity, ARM64_AFFINITY_INVALID),
DEFINE_PROP_INT32("node-id", ARMCPU, node_id, CPU_UNSET_NUMA_NODE_ID),
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index 9a3fd595621f..3ad4de793409 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -1033,6 +1033,7 @@ struct ArchCPU {
uint64_t reset_pmcr_el0;
} isar;
uint64_t midr;
+ uint64_t mpidr;
uint32_t revidr;
uint32_t reset_fpsid;
uint64_t ctr;
diff --git a/target/arm/helper.c b/target/arm/helper.c
index 0a582c1cd3b3..d6e7aa069489 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -4690,7 +4690,7 @@ static uint64_t mpidr_read_val(CPUARMState *env)
return mpidr;
}
-static uint64_t mpidr_read(CPUARMState *env, const ARMCPRegInfo *ri)
+static uint64_t mpidr_read(CPUARMState *env)
{
unsigned int cur_el = arm_current_el(env);
@@ -4700,6 +4700,11 @@ static uint64_t mpidr_read(CPUARMState *env, const ARMCPRegInfo *ri)
return mpidr_read_val(env);
}
+static uint64_t mpidr_read_ri(CPUARMState *env, const ARMCPRegInfo *ri)
+{
+ return mpidr_read(env);
+}
+
static const ARMCPRegInfo lpae_cp_reginfo[] = {
/* NOP AMAIR0/1 */
{ .name = "AMAIR0", .state = ARM_CP_STATE_BOTH,
@@ -9721,7 +9726,7 @@ void register_cp_regs_for_features(ARMCPU *cpu)
{ .name = "MPIDR_EL1", .state = ARM_CP_STATE_BOTH,
.opc0 = 3, .crn = 0, .crm = 0, .opc1 = 0, .opc2 = 5,
.fgt = FGT_MPIDR_EL1,
- .access = PL1_R, .readfn = mpidr_read, .type = ARM_CP_NO_RAW },
+ .access = PL1_R, .readfn = mpidr_read_ri, .type = ARM_CP_NO_RAW },
};
#ifdef CONFIG_USER_ONLY
static const ARMCPRegUserSpaceInfo mpidr_user_cp_reginfo[] = {
@@ -9731,6 +9736,7 @@ void register_cp_regs_for_features(ARMCPU *cpu)
modify_arm_cp_regs(mpidr_cp_reginfo, mpidr_user_cp_reginfo);
#endif
define_arm_cp_regs(cpu, mpidr_cp_reginfo);
+ cpu->mpidr = mpidr_read(env);
}
if (arm_feature(env, ARM_FEATURE_AUXCR)) {
--
2.46.0
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 11/13] scripts/arm_processor_error.py: retrieve mpidr if not filled
2024-08-16 7:37 [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Mauro Carvalho Chehab
` (9 preceding siblings ...)
2024-08-16 7:37 ` [PATCH v8 10/13] target/arm: add an experimental mpidr arm cpu property object Mauro Carvalho Chehab
@ 2024-08-16 7:37 ` Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 12/13] acpi/ghes: cleanup generic error data logic Mauro Carvalho Chehab
` (2 subsequent siblings)
13 siblings, 0 replies; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-16 7:37 UTC (permalink / raw)
Cc: Jonathan Cameron, Shiju Jose, Mauro Carvalho Chehab, Cleber Rosa,
John Snow, linux-kernel, qemu-devel
Add support to retrieve mpidr value via qom-get.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
scripts/arm_processor_error.py | 27 +++++++++++++++++++--------
1 file changed, 19 insertions(+), 8 deletions(-)
diff --git a/scripts/arm_processor_error.py b/scripts/arm_processor_error.py
index 62e0c5662232..0a16d4f0d8b1 100644
--- a/scripts/arm_processor_error.py
+++ b/scripts/arm_processor_error.py
@@ -5,12 +5,10 @@
#
# Copyright (C) 2024 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
-# TODO: current implementation has dummy defaults.
-#
-# For a better implementation, a QMP addition/call is needed to
-# retrieve some data for ARM Processor Error injection:
-#
-# - ARM registers: power_state, mpidr.
+# Note: currently it lacks a method to fill the ARM Processor Error CPER
+# psci field from emulation. On a real hardware, this is filled only
+# when a CPU is not running. Implementing support for it to simulate a
+# real hardware is not trivial.
import argparse
import re
@@ -174,11 +172,24 @@ def send_cper(self, args):
else:
cper["running-state"] = 0
+ if args.mpidr:
+ cper["mpidr-el1"] = arg["mpidr"]
+ elif cpus:
+ cmd_arg = {
+ 'path': cpus[0],
+ 'property': "x-mpidr"
+ }
+ ret = qmp_cmd.send_cmd("qom-get", cmd_arg, may_open=True)
+ if isinstance(ret, int):
+ cper["mpidr-el1"] = ret
+ else:
+ cper["mpidr-el1"] = 0
+
if arm_valid_init:
if args.affinity:
cper["valid"] |= self.arm_valid_bits["affinity"]
- if args.mpidr:
+ if "mpidr-el1" in cper:
cper["valid"] |= self.arm_valid_bits["mpidr"]
if "running-state" in cper:
@@ -362,7 +373,7 @@ def send_cper(self, args):
if isinstance(ret, int):
arg["midr-el1"] = ret
- util.data_add(data, arg.get("mpidr-el1", 0), 8)
+ util.data_add(data, cper["mpidr-el1"], 8)
util.data_add(data, arg.get("midr-el1", 0), 8)
util.data_add(data, cper["running-state"], 4)
util.data_add(data, arg.get("psci-state", 0), 4)
--
2.46.0
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 12/13] acpi/ghes: cleanup generic error data logic
2024-08-16 7:37 [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Mauro Carvalho Chehab
` (10 preceding siblings ...)
2024-08-16 7:37 ` [PATCH v8 11/13] scripts/arm_processor_error.py: retrieve mpidr if not filled Mauro Carvalho Chehab
@ 2024-08-16 7:37 ` Mauro Carvalho Chehab
2024-08-19 12:57 ` Igor Mammedov
2024-08-16 7:37 ` [PATCH v8 13/13] acpi/ghes: check if the BIOS pointers for HEST are correct Mauro Carvalho Chehab
2024-08-19 14:21 ` [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Igor Mammedov
13 siblings, 1 reply; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-16 7:37 UTC (permalink / raw)
Cc: Jonathan Cameron, Shiju Jose, Mauro Carvalho Chehab,
Michael S. Tsirkin, Ani Sinha, Dongjiu Geng, Igor Mammedov,
linux-kernel, qemu-arm, qemu-devel
Remove comments that are obvious.
No functional changes.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
hw/acpi/ghes.c | 38 +++++++++++++++-----------------------
1 file changed, 15 insertions(+), 23 deletions(-)
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index 4f7b6c5ad2b6..a822a5eafaa0 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -130,34 +130,28 @@ static void build_ghes_hw_error_notification(GArray *table, const uint8_t type)
* ACPI 6.1: 18.3.2.7.1 Generic Error Data
*/
static void acpi_ghes_generic_error_data(GArray *table,
- const uint8_t *section_type, uint32_t error_severity,
- uint8_t validation_bits, uint8_t flags,
- uint32_t error_data_length, QemuUUID fru_id,
- uint64_t time_stamp)
+ const uint8_t *section_type,
+ uint32_t error_severity,
+ uint8_t validation_bits,
+ uint8_t flags,
+ uint32_t error_data_length,
+ QemuUUID fru_id,
+ uint64_t time_stamp)
{
const uint8_t fru_text[20] = {0};
- /* Section Type */
g_array_append_vals(table, section_type, 16);
-
- /* Error Severity */
build_append_int_noprefix(table, error_severity, 4);
+
/* Revision */
build_append_int_noprefix(table, 0x300, 2);
- /* Validation Bits */
+
build_append_int_noprefix(table, validation_bits, 1);
- /* Flags */
build_append_int_noprefix(table, flags, 1);
- /* Error Data Length */
build_append_int_noprefix(table, error_data_length, 4);
- /* FRU Id */
g_array_append_vals(table, fru_id.data, ARRAY_SIZE(fru_id.data));
-
- /* FRU Text */
g_array_append_vals(table, fru_text, sizeof(fru_text));
-
- /* Timestamp */
build_append_int_noprefix(table, time_stamp, 8);
}
@@ -165,19 +159,17 @@ static void acpi_ghes_generic_error_data(GArray *table,
* Generic Error Status Block
* ACPI 6.1: 18.3.2.7.1 Generic Error Data
*/
-static void acpi_ghes_generic_error_status(GArray *table, uint32_t block_status,
- uint32_t raw_data_offset, uint32_t raw_data_length,
- uint32_t data_length, uint32_t error_severity)
+static void acpi_ghes_generic_error_status(GArray *table,
+ uint32_t block_status,
+ uint32_t raw_data_offset,
+ uint32_t raw_data_length,
+ uint32_t data_length,
+ uint32_t error_severity)
{
- /* Block Status */
build_append_int_noprefix(table, block_status, 4);
- /* Raw Data Offset */
build_append_int_noprefix(table, raw_data_offset, 4);
- /* Raw Data Length */
build_append_int_noprefix(table, raw_data_length, 4);
- /* Data Length */
build_append_int_noprefix(table, data_length, 4);
- /* Error Severity */
build_append_int_noprefix(table, error_severity, 4);
}
--
2.46.0
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 13/13] acpi/ghes: check if the BIOS pointers for HEST are correct
2024-08-16 7:37 [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Mauro Carvalho Chehab
` (11 preceding siblings ...)
2024-08-16 7:37 ` [PATCH v8 12/13] acpi/ghes: cleanup generic error data logic Mauro Carvalho Chehab
@ 2024-08-16 7:37 ` Mauro Carvalho Chehab
2024-08-19 14:07 ` Igor Mammedov
2024-08-19 14:21 ` [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Igor Mammedov
13 siblings, 1 reply; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-16 7:37 UTC (permalink / raw)
Cc: Jonathan Cameron, Shiju Jose, Mauro Carvalho Chehab,
Michael S. Tsirkin, Ani Sinha, Dongjiu Geng, Igor Mammedov,
linux-kernel, qemu-arm, qemu-devel
The OS kernels navigate between HEST, error source struct
and CPER by the usage of some pointers. Double-check if such
pointers were properly initializing, ensuring that they match
the right address for CPER.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
hw/acpi/ghes.c | 30 +++++++++++++++++++++++++++++-
1 file changed, 29 insertions(+), 1 deletion(-)
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index a822a5eafaa0..51e2e40e5a9c 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -85,6 +85,9 @@ enum AcpiHestSourceId {
#define HEST_GHES_V2_TABLE_SIZE 92
#define GHES_ACK_OFFSET (64 + GAS_ADDR_OFFSET + ACPI_HEST_HEADER_SIZE)
+/* ACPI 6.2: 18.3.2.7: Generic Hardware Error Source */
+#define GHES_ERR_ST_ADDR_OFFSET (20 + GAS_ADDR_OFFSET + ACPI_HEST_HEADER_SIZE)
+
/*
* Values for error_severity field
*/
@@ -425,7 +428,10 @@ NotifierList acpi_generic_error_notifiers =
void ghes_record_cper_errors(const void *cper, size_t len,
enum AcpiGhesNotifyType notify, Error **errp)
{
- uint64_t cper_addr, read_ack_start_addr;
+ uint64_t hest_read_ack_start_addr, read_ack_start_addr;
+ uint64_t read_ack_start_addr_2, err_source_struct;
+ uint64_t hest_err_block_addr, error_block_addr;
+ uint64_t cper_addr, cper_addr_2;
enum AcpiHestSourceId source;
AcpiGedState *acpi_ged_state;
AcpiGhesState *ags;
@@ -450,6 +456,28 @@ void ghes_record_cper_errors(const void *cper, size_t len,
cper_addr += ACPI_HEST_SRC_ID_COUNT * sizeof(uint64_t);
cper_addr += source * ACPI_GHES_MAX_RAW_DATA_LENGTH;
+ err_source_struct = le64_to_cpu(ags->hest_addr_le) +
+ source * HEST_GHES_V2_TABLE_SIZE;
+
+ /* Check if BIOS addr pointers were properly generated */
+
+ hest_err_block_addr = err_source_struct + GHES_ERR_ST_ADDR_OFFSET;
+ hest_read_ack_start_addr = err_source_struct + GHES_ACK_OFFSET;
+
+ cpu_physical_memory_read(hest_err_block_addr, &error_block_addr,
+ sizeof(error_block_addr));
+
+ cpu_physical_memory_read(error_block_addr, &cper_addr_2,
+ sizeof(error_block_addr));
+
+ cpu_physical_memory_read(hest_read_ack_start_addr, &read_ack_start_addr_2,
+ sizeof(read_ack_start_addr_2));
+
+ assert(cper_addr == cper_addr_2);
+ assert(read_ack_start_addr == read_ack_start_addr_2);
+
+ /* Update ACK offset to notify about a new error */
+
cpu_physical_memory_read(read_ack_start_addr,
&read_ack, sizeof(uint64_t));
--
2.46.0
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v8 01/13] acpi/generic_event_device: add an APEI error device
2024-08-16 7:37 ` [PATCH v8 01/13] acpi/generic_event_device: add an APEI error device Mauro Carvalho Chehab
@ 2024-08-19 11:21 ` Igor Mammedov
0 siblings, 0 replies; 34+ messages in thread
From: Igor Mammedov @ 2024-08-19 11:21 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Jonathan Cameron, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
linux-kernel, qemu-devel
On Fri, 16 Aug 2024 09:37:33 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Adds a generic error device to handle generic hardware error
> events as specified at ACPI 6.5 specification at 18.3.2.7.2:
> https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#event-notification-for-generic-error-sources
make it match comment in the code for consistency
(i.e reference to 5.0b spec incl chapter/name)
> using HID PNP0C33.
>
> The PNP0C33 device is used to report hardware errors to
> the guest via ACPI APEI Generic Hardware Error Source (GHES).
>
> Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Co-authored-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
> ---
> hw/acpi/aml-build.c | 10 ++++++++++
> hw/acpi/generic_event_device.c | 8 ++++++++
> include/hw/acpi/acpi_dev_interface.h | 1 +
> include/hw/acpi/aml-build.h | 2 ++
> include/hw/acpi/generic_event_device.h | 1 +
> 5 files changed, 22 insertions(+)
>
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index 6d4517cfbe3d..cb167523859f 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -2520,3 +2520,13 @@ Aml *aml_i2c_serial_bus_device(uint16_t address, const char *resource_source)
>
> return var;
> }
> +
> +/* ACPI 5.0: 18.3.2.6.2 Event Notification For Generic Error Sources */
should be ACPI 5.0b
> +Aml *aml_error_device(void)
> +{
> + Aml *dev = aml_device(ACPI_APEI_ERROR_DEVICE);
> + aml_append(dev, aml_name_decl("_HID", aml_string("PNP0C33")));
> + aml_append(dev, aml_name_decl("_UID", aml_int(0)));
> +
> + return dev;
> +}
> diff --git a/hw/acpi/generic_event_device.c b/hw/acpi/generic_event_device.c
> index 15b4c3ebbf24..b4c83a089a02 100644
> --- a/hw/acpi/generic_event_device.c
> +++ b/hw/acpi/generic_event_device.c
> @@ -26,6 +26,7 @@ static const uint32_t ged_supported_events[] = {
> ACPI_GED_PWR_DOWN_EVT,
> ACPI_GED_NVDIMM_HOTPLUG_EVT,
> ACPI_GED_CPU_HOTPLUG_EVT,
> + ACPI_GED_ERROR_EVT,
> };
>
> /*
> @@ -116,6 +117,11 @@ void build_ged_aml(Aml *table, const char *name, HotplugHandler *hotplug_dev,
> aml_notify(aml_name(ACPI_POWER_BUTTON_DEVICE),
> aml_int(0x80)));
> break;
> + case ACPI_GED_ERROR_EVT:
> + aml_append(if_ctx,
> + aml_notify(aml_name(ACPI_APEI_ERROR_DEVICE),
> + aml_int(0x80)));
> + break;
> case ACPI_GED_NVDIMM_HOTPLUG_EVT:
> aml_append(if_ctx,
> aml_notify(aml_name("\\_SB.NVDR"),
> @@ -295,6 +301,8 @@ static void acpi_ged_send_event(AcpiDeviceIf *adev, AcpiEventStatusBits ev)
> sel = ACPI_GED_MEM_HOTPLUG_EVT;
> } else if (ev & ACPI_POWER_DOWN_STATUS) {
> sel = ACPI_GED_PWR_DOWN_EVT;
> + } else if (ev & ACPI_GENERIC_ERROR) {
> + sel = ACPI_GED_ERROR_EVT;
> } else if (ev & ACPI_NVDIMM_HOTPLUG_STATUS) {
> sel = ACPI_GED_NVDIMM_HOTPLUG_EVT;
> } else if (ev & ACPI_CPU_HOTPLUG_STATUS) {
> diff --git a/include/hw/acpi/acpi_dev_interface.h b/include/hw/acpi/acpi_dev_interface.h
> index 68d9d15f50aa..8294f8f0ccca 100644
> --- a/include/hw/acpi/acpi_dev_interface.h
> +++ b/include/hw/acpi/acpi_dev_interface.h
> @@ -13,6 +13,7 @@ typedef enum {
> ACPI_NVDIMM_HOTPLUG_STATUS = 16,
> ACPI_VMGENID_CHANGE_STATUS = 32,
> ACPI_POWER_DOWN_STATUS = 64,
> + ACPI_GENERIC_ERROR = 128,
> } AcpiEventStatusBits;
>
> #define TYPE_ACPI_DEVICE_IF "acpi-device-interface"
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index a3784155cb33..44d1a6af0c69 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -252,6 +252,7 @@ struct CrsRangeSet {
> /* Consumer/Producer */
> #define AML_SERIAL_BUS_FLAG_CONSUME_ONLY (1 << 1)
>
> +#define ACPI_APEI_ERROR_DEVICE "GEDD"
> /**
> * init_aml_allocator:
> *
> @@ -382,6 +383,7 @@ Aml *aml_dma(AmlDmaType typ, AmlDmaBusMaster bm, AmlTransferSize sz,
> uint8_t channel);
> Aml *aml_sleep(uint64_t msec);
> Aml *aml_i2c_serial_bus_device(uint16_t address, const char *resource_source);
> +Aml *aml_error_device(void);
>
> /* Block AML object primitives */
> Aml *aml_scope(const char *name_format, ...) G_GNUC_PRINTF(1, 2);
> diff --git a/include/hw/acpi/generic_event_device.h b/include/hw/acpi/generic_event_device.h
> index 40af3550b56d..9ace8fe70328 100644
> --- a/include/hw/acpi/generic_event_device.h
> +++ b/include/hw/acpi/generic_event_device.h
> @@ -98,6 +98,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(AcpiGedState, ACPI_GED)
> #define ACPI_GED_PWR_DOWN_EVT 0x2
> #define ACPI_GED_NVDIMM_HOTPLUG_EVT 0x4
> #define ACPI_GED_CPU_HOTPLUG_EVT 0x8
> +#define ACPI_GED_ERROR_EVT 0x10
>
> typedef struct GEDState {
> MemoryRegion evt;
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 03/13] acpi/ghes: Add support for GED error device
2024-08-16 7:37 ` [PATCH v8 03/13] acpi/ghes: Add support for GED error device Mauro Carvalho Chehab
@ 2024-08-19 11:43 ` Igor Mammedov
2024-08-23 23:28 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 34+ messages in thread
From: Igor Mammedov @ 2024-08-19 11:43 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Jonathan Cameron, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, linux-kernel, qemu-arm, qemu-devel
On Fri, 16 Aug 2024 09:37:35 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>
> As a GED error device is now defined, add another type
> of notification.
>
> Add error notification to GHES v2 using
>a GED error device GED triggered via interrupt.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is hard to parse, perhaps update so it would be
more clear what does what
>
> [mchehab: do some cleanups at ACPI_HEST_SRC_ID_* checks and
> rename HEST event to better identify GED interrupt OSPM]
>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
> ---
in addition to change log in cover letter,
I'd suggest to keep per patch change log as well (after ---),
it helps reviewer to notice intended changes.
[...]
> + case ACPI_HEST_SRC_ID_GED:
> + build_ghes_hw_error_notification(table_data, ACPI_GHES_NOTIFY_GPIO);
While GPIO works for arm, it's not the case for other machines.
I recall a suggestion to use ACPI_GHES_NOTIFY_EXTERNAL instead of GPIO one,
but that got lost somewhere...
> + break;
> default:
> error_report("Not support this error source");
> abort();
> @@ -370,6 +376,7 @@ void acpi_build_hest(GArray *table_data, BIOSLinker *linker,
> /* Error Source Count */
> build_append_int_noprefix(table_data, ACPI_GHES_ERROR_SOURCE_COUNT, 4);
> build_ghes_v2(table_data, ACPI_HEST_SRC_ID_SEA, linker);
> + build_ghes_v2(table_data, ACPI_HEST_SRC_ID_GED, linker);
>
> acpi_table_end(linker, &table);
> }
> diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
> index fb80897e7eac..419a97d5cbd9 100644
> --- a/include/hw/acpi/ghes.h
> +++ b/include/hw/acpi/ghes.h
> @@ -59,9 +59,10 @@ enum AcpiGhesNotifyType {
> ACPI_GHES_NOTIFY_RESERVED = 12
> };
>
> +/* Those are used as table indexes when building GHES tables */
> enum {
> ACPI_HEST_SRC_ID_SEA = 0,
> - /* future ids go here */
> + ACPI_HEST_SRC_ID_GED,
> ACPI_HEST_SRC_ID_RESERVED,
> };
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 04/13] qapi/acpi-hest: add an interface to do generic CPER error injection
2024-08-16 7:37 ` [PATCH v8 04/13] qapi/acpi-hest: add an interface to do generic CPER error injection Mauro Carvalho Chehab
@ 2024-08-19 11:54 ` Igor Mammedov
0 siblings, 0 replies; 34+ messages in thread
From: Igor Mammedov @ 2024-08-19 11:54 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Jonathan Cameron, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, Eric Blake, Markus Armbruster, Michael Roth,
Paolo Bonzini, Peter Maydell, linux-kernel, qemu-arm, qemu-devel
On Fri, 16 Aug 2024 09:37:36 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Creates a QMP command to be used for generic ACPI APEI hardware error
> injection (HEST) via GHESv2.
>
> The actual GHES code will be added at the followup patch.
modulo inconsistency in comments (see below), LGTM
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
> MAINTAINERS | 7 +++++++
> hw/acpi/Kconfig | 5 +++++
> hw/acpi/ghes_cper.c | 33 +++++++++++++++++++++++++++++++++
> hw/acpi/ghes_cper_stub.c | 19 +++++++++++++++++++
> hw/acpi/meson.build | 2 ++
> hw/arm/Kconfig | 5 +++++
> include/hw/acpi/ghes.h | 3 +++
> qapi/acpi-hest.json | 36 ++++++++++++++++++++++++++++++++++++
> qapi/meson.build | 1 +
> qapi/qapi-schema.json | 1 +
> 10 files changed, 112 insertions(+)
> create mode 100644 hw/acpi/ghes_cper.c
> create mode 100644 hw/acpi/ghes_cper_stub.c
> create mode 100644 qapi/acpi-hest.json
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 3584d6a6c6da..1d8091818899 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2077,6 +2077,13 @@ F: hw/acpi/ghes.c
> F: include/hw/acpi/ghes.h
> F: docs/specs/acpi_hest_ghes.rst
>
> +ACPI/HEST/GHES/ARM processor CPER
> +R: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> +S: Maintained
> +F: hw/arm/ghes_cper.c
> +F: hw/acpi/ghes_cper_stub.c
> +F: qapi/acpi-hest.json
> +
> ppc4xx
> L: qemu-ppc@nongnu.org
> S: Orphan
> diff --git a/hw/acpi/Kconfig b/hw/acpi/Kconfig
> index e07d3204eb36..73ffbb82c150 100644
> --- a/hw/acpi/Kconfig
> +++ b/hw/acpi/Kconfig
> @@ -51,6 +51,11 @@ config ACPI_APEI
> bool
> depends on ACPI
>
> +config GHES_CPER
> + bool
> + depends on ACPI_APEI
> + default y
> +
> config ACPI_PCI
> bool
> depends on ACPI && PCI
> diff --git a/hw/acpi/ghes_cper.c b/hw/acpi/ghes_cper.c
> new file mode 100644
> index 000000000000..92ca84d738de
> --- /dev/null
> +++ b/hw/acpi/ghes_cper.c
> @@ -0,0 +1,33 @@
> +/*
> + * CPER payload parser for error injection
> + *
> + * Copyright(C) 2024 Huawei LTD.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +
> +#include "qemu/base64.h"
> +#include "qemu/error-report.h"
> +#include "qemu/uuid.h"
> +#include "qapi/qapi-commands-acpi-hest.h"
> +#include "hw/acpi/ghes.h"
> +
> +void qmp_ghes_cper(const char *qmp_cper,
> + Error **errp)
> +{
> +
> + uint8_t *cper;
> + size_t len;
> +
> + cper = qbase64_decode(qmp_cper, -1, &len, errp);
> + if (!cper) {
> + error_setg(errp, "missing GHES CPER payload");
> + return;
> + }
> +
> + /* TODO: call a function at ghes */
> +}
> diff --git a/hw/acpi/ghes_cper_stub.c b/hw/acpi/ghes_cper_stub.c
> new file mode 100644
> index 000000000000..36138c462ac9
> --- /dev/null
> +++ b/hw/acpi/ghes_cper_stub.c
> @@ -0,0 +1,19 @@
> +/*
> + * Stub interface for CPER payload parser for error injection
> + *
> + * Copyright(C) 2024 Huawei LTD.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "qapi/qapi-commands-acpi-hest.h"
> +#include "hw/acpi/ghes.h"
> +
> +void qmp_ghes_cper(const char *cper, Error **errp)
> +{
> + error_setg(errp, "GHES QMP error inject is not compiled in");
> +}
> diff --git a/hw/acpi/meson.build b/hw/acpi/meson.build
> index fa5c07db9068..6cbf430eb66d 100644
> --- a/hw/acpi/meson.build
> +++ b/hw/acpi/meson.build
> @@ -34,4 +34,6 @@ endif
> system_ss.add(when: 'CONFIG_ACPI', if_false: files('acpi-stub.c', 'aml-build-stub.c', 'ghes-stub.c', 'acpi_interface.c'))
> system_ss.add(when: 'CONFIG_ACPI_PCI_BRIDGE', if_false: files('pci-bridge-stub.c'))
> system_ss.add_all(when: 'CONFIG_ACPI', if_true: acpi_ss)
> +system_ss.add(when: 'CONFIG_GHES_CPER', if_true: files('ghes_cper.c'))
> +system_ss.add(when: 'CONFIG_GHES_CPER', if_false: files('ghes_cper_stub.c'))
> system_ss.add(files('acpi-qmp-cmds.c'))
> diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
> index 1ad60da7aa2d..bed6ba27d715 100644
> --- a/hw/arm/Kconfig
> +++ b/hw/arm/Kconfig
> @@ -712,3 +712,8 @@ config ARMSSE
> select UNIMP
> select SSE_COUNTER
> select SSE_TIMER
> +
> +config GHES_CPER
> + bool
> + depends on ARM
> + default y if AARCH64
> diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
> index 419a97d5cbd9..b977d65564ba 100644
> --- a/include/hw/acpi/ghes.h
> +++ b/include/hw/acpi/ghes.h
> @@ -23,6 +23,7 @@
> #define ACPI_GHES_H
>
> #include "hw/acpi/bios-linker-loader.h"
> +#include "qapi/error.h"
> #include "qemu/notify.h"
>
> extern NotifierList acpi_generic_error_notifiers;
> @@ -77,6 +78,8 @@ void acpi_build_hest(GArray *table_data, BIOSLinker *linker,
> void acpi_ghes_add_fw_cfg(AcpiGhesState *vms, FWCfgState *s,
> GArray *hardware_errors);
> int acpi_ghes_record_errors(uint8_t notify, uint64_t error_physical_addr);
> +void ghes_record_cper_errors(const void *cper, size_t len,
> + enum AcpiGhesNotifyType notify, Error **errp);
>
> /**
> * acpi_ghes_present: Report whether ACPI GHES table is present
> diff --git a/qapi/acpi-hest.json b/qapi/acpi-hest.json
> new file mode 100644
> index 000000000000..91296755d285
> --- /dev/null
> +++ b/qapi/acpi-hest.json
> @@ -0,0 +1,36 @@
> +# -*- Mode: Python -*-
> +# vim: filetype=python
> +
> +##
> +# = GHESv2 CPER Error Injection
> +#
> +# Defined since ACPI Specification 6.2,
> +# section 18.3.2.8 Generic Hardware Error Source version 2. See:
earliest definition, I've found was 6.1, so this should be fixed up here
> +#
> +# https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#generic-hardware-error-source-version-2-ghesv2-type-10
make it consistent with above spec version
> +##
> +
> +
> +##
> +# @ghes-cper:
> +#
> +# Inject a CPER error data to be filled according to ACPI 6.2
ditto
> +# spec via GHESv2.
> +#
> +# @cper: contains a base64 encoded string with raw data for a single CPER
> +# record with Generic Error Status Block, Generic Error Data Entry and
> +# generic error data payload, as described at
> +# https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html#format
> +#
> +# Features:
> +#
> +# @unstable: This command is experimental.
> +#
> +# Since: 9.2
> +##
> +{ 'command': 'ghes-cper',
> + 'data': {
> + 'cper': 'str'
> + },
> + 'features': [ 'unstable' ]
> +}
> diff --git a/qapi/meson.build b/qapi/meson.build
> index e7bc54e5d047..35cea6147262 100644
> --- a/qapi/meson.build
> +++ b/qapi/meson.build
> @@ -59,6 +59,7 @@ qapi_all_modules = [
> if have_system
> qapi_all_modules += [
> 'acpi',
> + 'acpi-hest',
> 'audio',
> 'cryptodev',
> 'qdev',
> diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
> index b1581988e4eb..baf19ab73afe 100644
> --- a/qapi/qapi-schema.json
> +++ b/qapi/qapi-schema.json
> @@ -75,6 +75,7 @@
> { 'include': 'misc-target.json' }
> { 'include': 'audio.json' }
> { 'include': 'acpi.json' }
> +{ 'include': 'acpi-hest.json' }
> { 'include': 'pci.json' }
> { 'include': 'stats.json' }
> { 'include': 'virtio.json' }
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 05/13] acpi/ghes: rework the logic to handle HEST source ID
2024-08-16 7:37 ` [PATCH v8 05/13] acpi/ghes: rework the logic to handle HEST source ID Mauro Carvalho Chehab
@ 2024-08-19 12:10 ` Igor Mammedov
2024-08-25 2:02 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 34+ messages in thread
From: Igor Mammedov @ 2024-08-19 12:10 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Jonathan Cameron, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, Peter Maydell, Shannon Zhao, linux-kernel, qemu-arm,
qemu-devel
On Fri, 16 Aug 2024 09:37:37 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> The current logic is based on a lot of duct tape, with
> offsets calculated based on one define with the number of
> source IDs and an enum.
>
> Rewrite the logic in a way that it would be more resilient
> of code changes, by moving the source ID count to an enum
> and make the offset calculus more explicit.
>
> Such change was inspired on a patch from Jonathan Cameron
> splitting the logic to get the CPER address on a separate
> function, as this will be needed to support generic error
> injection.
patch does too many things, that it's hard to review.
Please split it up on smaller distinct parts, with more specific
commit messages. (see some comments below)
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---
> hw/acpi/ghes-stub.c | 3 +-
> hw/acpi/ghes.c | 210 ++++++++++++++++++++++++---------------
> hw/arm/virt-acpi-build.c | 5 +-
> include/hw/acpi/ghes.h | 17 ++--
> 4 files changed, 138 insertions(+), 97 deletions(-)
>
> diff --git a/hw/acpi/ghes-stub.c b/hw/acpi/ghes-stub.c
> index c315de1802d6..8762449870b5 100644
> --- a/hw/acpi/ghes-stub.c
> +++ b/hw/acpi/ghes-stub.c
> @@ -11,7 +11,8 @@
> #include "qemu/osdep.h"
> #include "hw/acpi/ghes.h"
>
> -int acpi_ghes_record_errors(uint8_t source_id, uint64_t physical_address)
> +int acpi_ghes_record_errors(enum AcpiGhesNotifyType notify,
> + uint64_t physical_address)
> {
> return -1;
> }
> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> index df59fd35568c..7870f51e2a9e 100644
> --- a/hw/acpi/ghes.c
> +++ b/hw/acpi/ghes.c
> @@ -28,14 +28,23 @@
> #include "hw/nvram/fw_cfg.h"
> #include "qemu/uuid.h"
>
> -#define ACPI_GHES_ERRORS_FW_CFG_FILE "etc/hardware_errors"
> -#define ACPI_GHES_DATA_ADDR_FW_CFG_FILE "etc/hardware_errors_addr"
> +#define ACPI_HW_ERROR_FW_CFG_FILE "etc/hardware_errors"
> +#define ACPI_HW_ERROR_ADDR_FW_CFG_FILE "etc/hardware_errors_addr"
split out renaming part into a presiding separate patch,
so it won't mask a new code
> +#define ACPI_HEST_ADDR_FW_CFG_FILE "etc/acpi_table_hest_addr"
>
> /* The max size in bytes for one error block */
> #define ACPI_GHES_MAX_RAW_DATA_LENGTH (1 * KiB)
>
> -/* Support ARMv8 SEA notification type error source and GPIO interrupt. */
> -#define ACPI_GHES_ERROR_SOURCE_COUNT 2
> +/*
> + * ID numbers used to fill HEST source ID field
> + */
> +enum AcpiHestSourceId {
> + ACPI_HEST_SRC_ID_SEA,
> + ACPI_HEST_SRC_ID_GED,
> +
> + /* Shall be the last one */
> + ACPI_HEST_SRC_ID_COUNT
> +} AcpiHestSourceId;
>
this rename also should go into its own separate patch.
> /* Generic Hardware Error Source version 2 */
> #define ACPI_GHES_SOURCE_GENERIC_ERROR_V2 10
> @@ -63,6 +72,19 @@
> */
> #define ACPI_GHES_GESB_SIZE 20
>
> +/*
> + * Offsets with regards to the start of the HEST table stored at
> + * ags->hest_addr_le, according with the memory layout map at
> + * docs/specs/acpi_hest_ghes.rst.
> + */
> +
> +/* ACPI 4.0: 17.3.2 ACPI Error Source */
> +#define ACPI_HEST_HEADER_SIZE 40
> +
> +/* ACPI 6.2: 18.3.2.8 Generic Hardware Error Source version 2 */
> +#define HEST_GHES_V2_TABLE_SIZE 92
> +#define GHES_ACK_OFFSET (64 + GAS_ADDR_OFFSET + ACPI_HEST_HEADER_SIZE)
> +
> /*
> * Values for error_severity field
> */
> @@ -236,17 +258,17 @@ static int acpi_ghes_record_mem_error(uint64_t error_block_address,
> * Initialize "etc/hardware_errors" and "etc/hardware_errors_addr" fw_cfg blobs.
> * See docs/specs/acpi_hest_ghes.rst for blobs format.
> */
> -void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker)
> +static void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker)
> {
> int i, error_status_block_offset;
>
> /* Build error_block_address */
> - for (i = 0; i < ACPI_GHES_ERROR_SOURCE_COUNT; i++) {
> + for (i = 0; i < ACPI_HEST_SRC_ID_COUNT; i++) {
> build_append_int_noprefix(hardware_errors, 0, sizeof(uint64_t));
> }
>
> /* Build read_ack_register */
> - for (i = 0; i < ACPI_GHES_ERROR_SOURCE_COUNT; i++) {
> + for (i = 0; i < ACPI_HEST_SRC_ID_COUNT; i++) {
> /*
> * Initialize the value of read_ack_register to 1, so GHES can be
> * writable after (re)boot.
> @@ -261,20 +283,20 @@ void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker)
>
> /* Reserve space for Error Status Data Block */
> acpi_data_push(hardware_errors,
> - ACPI_GHES_MAX_RAW_DATA_LENGTH * ACPI_GHES_ERROR_SOURCE_COUNT);
> + ACPI_GHES_MAX_RAW_DATA_LENGTH * ACPI_HEST_SRC_ID_COUNT);
>
> /* Tell guest firmware to place hardware_errors blob into RAM */
> - bios_linker_loader_alloc(linker, ACPI_GHES_ERRORS_FW_CFG_FILE,
> + bios_linker_loader_alloc(linker, ACPI_HW_ERROR_FW_CFG_FILE,
> hardware_errors, sizeof(uint64_t), false);
>
> - for (i = 0; i < ACPI_GHES_ERROR_SOURCE_COUNT; i++) {
> + for (i = 0; i < ACPI_HEST_SRC_ID_COUNT; i++) {
> /*
> * Tell firmware to patch error_block_address entries to point to
> * corresponding "Generic Error Status Block"
> */
> bios_linker_loader_add_pointer(linker,
> - ACPI_GHES_ERRORS_FW_CFG_FILE, sizeof(uint64_t) * i,
> - sizeof(uint64_t), ACPI_GHES_ERRORS_FW_CFG_FILE,
> + ACPI_HW_ERROR_FW_CFG_FILE, sizeof(uint64_t) * i,
> + sizeof(uint64_t), ACPI_HW_ERROR_FW_CFG_FILE,
> error_status_block_offset + i * ACPI_GHES_MAX_RAW_DATA_LENGTH);
> }
>
> @@ -282,16 +304,39 @@ void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker)
> * tell firmware to write hardware_errors GPA into
> * hardware_errors_addr fw_cfg, once the former has been initialized.
> */
> - bios_linker_loader_write_pointer(linker, ACPI_GHES_DATA_ADDR_FW_CFG_FILE,
> - 0, sizeof(uint64_t), ACPI_GHES_ERRORS_FW_CFG_FILE, 0);
> + bios_linker_loader_write_pointer(linker, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, 0,
> + sizeof(uint64_t),
> + ACPI_HW_ERROR_FW_CFG_FILE, 0);
> +}
> +
> +static bool ghes_notify_to_source_id(enum AcpiGhesNotifyType notify,
> + enum AcpiHestSourceId *source_id)
> +{
> + switch (notify) {
> + case ACPI_GHES_NOTIFY_SEA: /* ARMv8 */
> + *source_id = ACPI_HEST_SRC_ID_SEA;
> + return false;
> + case ACPI_GHES_NOTIFY_GPIO:
> + *source_id = ACPI_HEST_SRC_ID_GED;
> + return false;
> + default:
> + /* Unsupported notification types */
> + return true;
> + }
> }
>
> /* Build Generic Hardware Error Source version 2 (GHESv2) */
> -static void build_ghes_v2(GArray *table_data, int source_id, BIOSLinker *linker)
> +static void build_ghes_v2(GArray *table_data,
> + enum AcpiGhesNotifyType notify,
> + BIOSLinker *linker)
I'd suggest to drop this change as AcpiGhesNotifyType is not unique ID,
in fact one can easily have several ID with the same type.
> {
> uint64_t address_offset;
> + enum AcpiHestSourceId source_id;
>
> - assert(source_id < ACPI_HEST_SRC_ID_RESERVED);
> + if (ghes_notify_to_source_id(notify, &source_id)) {
> + error_report("Error: notify %d not supported", notify);
> + abort();
> + }
>
> /*
> * Type:
> @@ -319,24 +364,13 @@ static void build_ghes_v2(GArray *table_data, int source_id, BIOSLinker *linker)
> build_append_gas(table_data, AML_AS_SYSTEM_MEMORY, 0x40, 0,
> 4 /* QWord access */, 0);
> bios_linker_loader_add_pointer(linker, ACPI_BUILD_TABLE_FILE,
> - address_offset + GAS_ADDR_OFFSET, sizeof(uint64_t),
> - ACPI_GHES_ERRORS_FW_CFG_FILE, source_id * sizeof(uint64_t));
> + address_offset + GAS_ADDR_OFFSET,
> + sizeof(uint64_t),
> + ACPI_HW_ERROR_FW_CFG_FILE,
> + source_id * sizeof(uint64_t));
>
> - switch (source_id) {
> - case ACPI_HEST_SRC_ID_SEA:
> - /*
> - * Notification Structure
> - * Now only enable ARMv8 SEA notification type
> - */
> - build_ghes_hw_error_notification(table_data, ACPI_GHES_NOTIFY_SEA);
> - break;
> - case ACPI_HEST_SRC_ID_GED:
> - build_ghes_hw_error_notification(table_data, ACPI_GHES_NOTIFY_GPIO);
> - break;
> - default:
> - error_report("Not support this error source");
> - abort();
> - }
> + /* Notification Structure */
> + build_ghes_hw_error_notification(table_data, notify);
>
> /* Error Status Block Length */
> build_append_int_noprefix(table_data, ACPI_GHES_MAX_RAW_DATA_LENGTH, 4);
> @@ -350,9 +384,11 @@ static void build_ghes_v2(GArray *table_data, int source_id, BIOSLinker *linker)
> build_append_gas(table_data, AML_AS_SYSTEM_MEMORY, 0x40, 0,
> 4 /* QWord access */, 0);
> bios_linker_loader_add_pointer(linker, ACPI_BUILD_TABLE_FILE,
> - address_offset + GAS_ADDR_OFFSET,
> - sizeof(uint64_t), ACPI_GHES_ERRORS_FW_CFG_FILE,
> - (ACPI_GHES_ERROR_SOURCE_COUNT + source_id) * sizeof(uint64_t));
> + address_offset + GAS_ADDR_OFFSET,
> + sizeof(uint64_t),
> + ACPI_HW_ERROR_FW_CFG_FILE,
> + (ACPI_HEST_SRC_ID_COUNT + source_id) *
> + sizeof(uint64_t));
>
> /*
> * Read Ack Preserve field
> @@ -365,90 +401,100 @@ static void build_ghes_v2(GArray *table_data, int source_id, BIOSLinker *linker)
> }
>
> /* Build Hardware Error Source Table */
> -void acpi_build_hest(GArray *table_data, BIOSLinker *linker,
> +void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
> + BIOSLinker *linker,
> const char *oem_id, const char *oem_table_id)
> {
> AcpiTable table = { .sig = "HEST", .rev = 1,
> .oem_id = oem_id, .oem_table_id = oem_table_id };
>
> + build_ghes_error_table(hardware_errors, linker);
> +
> + int hest_offset = table_data->len;
> +
> acpi_table_begin(&table, table_data);
>
> /* Error Source Count */
> - build_append_int_noprefix(table_data, ACPI_GHES_ERROR_SOURCE_COUNT, 4);
> - build_ghes_v2(table_data, ACPI_HEST_SRC_ID_SEA, linker);
> - build_ghes_v2(table_data, ACPI_HEST_SRC_ID_GED, linker);
> + build_append_int_noprefix(table_data, ACPI_HEST_SRC_ID_COUNT, 4);
> + build_ghes_v2(table_data, ACPI_GHES_NOTIFY_SEA, linker);
> + build_ghes_v2(table_data, ACPI_GHES_NOTIFY_GPIO, linker);
>
> acpi_table_end(linker, &table);
> +
> + /*
> + * tell firmware to write into GPA the address of HEST via fw_cfg,
> + * once initialized.
> + */
> + bios_linker_loader_write_pointer(linker,
> + ACPI_HEST_ADDR_FW_CFG_FILE, 0,
> + sizeof(uint64_t),
> + ACPI_BUILD_TABLE_FILE, hest_offset);
> }
>
> void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
> GArray *hardware_error)
> {
> /* Create a read-only fw_cfg file for GHES */
> - fw_cfg_add_file(s, ACPI_GHES_ERRORS_FW_CFG_FILE, hardware_error->data,
> + fw_cfg_add_file(s, ACPI_HW_ERROR_FW_CFG_FILE, hardware_error->data,
> hardware_error->len);
>
> /* Create a read-write fw_cfg file for Address */
> - fw_cfg_add_file_callback(s, ACPI_GHES_DATA_ADDR_FW_CFG_FILE, NULL, NULL,
> + fw_cfg_add_file_callback(s, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, NULL, NULL,
> NULL, &(ags->ghes_addr_le), sizeof(ags->ghes_addr_le), false);
>
> + fw_cfg_add_file_callback(s, ACPI_HEST_ADDR_FW_CFG_FILE, NULL, NULL,
> + NULL, &(ags->hest_addr_le), sizeof(ags->hest_addr_le), false);
> +
> ags->present = true;
> }
>
> -int acpi_ghes_record_errors(uint8_t source_id, uint64_t physical_address)
> +int acpi_ghes_record_errors(enum AcpiGhesNotifyType notify,
> + uint64_t physical_address)
> {
> - uint64_t error_block_addr, read_ack_register_addr, read_ack_register = 0;
> - uint64_t start_addr;
> - bool ret = -1;
> + uint64_t cper_addr, read_ack_register = 0;
> + uint64_t read_ack_start_addr;
> + enum AcpiHestSourceId source;
> AcpiGedState *acpi_ged_state;
> AcpiGhesState *ags;
>
> - assert(source_id < ACPI_HEST_SRC_ID_RESERVED);
> + if (ghes_notify_to_source_id(ACPI_HEST_SRC_ID_SEA, &source)) {
> + error_report("GHES: Invalid error block/ack address(es) for notify %d",
> + notify);
> + return -1;
> + }
>
> acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
> NULL));
> g_assert(acpi_ged_state);
> ags = &acpi_ged_state->ghes_state;
>
> - start_addr = le64_to_cpu(ags->ghes_addr_le);
> + cper_addr = le64_to_cpu(ags->ghes_addr_le);
> + cper_addr += ACPI_HEST_SRC_ID_COUNT * sizeof(uint64_t);
> + read_ack_start_addr = cper_addr + source * sizeof(uint64_t);
>
> - if (physical_address) {
> + cper_addr += ACPI_HEST_SRC_ID_COUNT * sizeof(uint64_t);
> + cper_addr += source * ACPI_GHES_MAX_RAW_DATA_LENGTH;
>
> - if (source_id < ACPI_HEST_SRC_ID_RESERVED) {
> - start_addr += source_id * sizeof(uint64_t);
> - }
> -
> - cpu_physical_memory_read(start_addr, &error_block_addr,
> - sizeof(error_block_addr));
> -
> - error_block_addr = le64_to_cpu(error_block_addr);
> -
> - read_ack_register_addr = start_addr +
> - ACPI_GHES_ERROR_SOURCE_COUNT * sizeof(uint64_t);
> -
> - cpu_physical_memory_read(read_ack_register_addr,
> - &read_ack_register, sizeof(read_ack_register));
> -
> - /* zero means OSPM does not acknowledge the error */
> - if (!read_ack_register) {
> - error_report("OSPM does not acknowledge previous error,"
> - " so can not record CPER for current error anymore");
> - } else if (error_block_addr) {
> - read_ack_register = cpu_to_le64(0);
> - /*
> - * Clear the Read Ack Register, OSPM will write it to 1 when
> - * it acknowledges this error.
> - */
> - cpu_physical_memory_write(read_ack_register_addr,
> - &read_ack_register, sizeof(uint64_t));
> -
> - ret = acpi_ghes_record_mem_error(error_block_addr,
> - physical_address);
> - } else
> - error_report("can not find Generic Error Status Block");
> + if (!physical_address) {
> + error_report("can not find Generic Error Status Block for notify %d",
> + notify);
> + return -1;
> }
>
> - return ret;
> + cpu_physical_memory_read(read_ack_start_addr,
> + &read_ack_register, sizeof(read_ack_register));
> +
> + /* zero means OSPM does not acknowledge the error */
> +
> + read_ack_register = cpu_to_le64(0);
> + /*
> + * Clear the Read Ack Register, OSPM will write it to 1 when
> + * it acknowledges this error.
> + */
> + cpu_physical_memory_write(read_ack_start_addr,
> + &read_ack_register, sizeof(uint64_t));
> +
> + return acpi_ghes_record_mem_error(cper_addr, physical_address);
> }
>
> NotifierList acpi_generic_error_notifiers =
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index 1769467d23b2..79635bc7a0a8 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -944,10 +944,9 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
> build_dbg2(tables_blob, tables->linker, vms);
>
> if (vms->ras) {
> - build_ghes_error_table(tables->hardware_errors, tables->linker);
> acpi_add_table(table_offsets, tables_blob);
> - acpi_build_hest(tables_blob, tables->linker, vms->oem_id,
> - vms->oem_table_id);
> + acpi_build_hest(tables_blob, tables->hardware_errors, tables->linker,
> + vms->oem_id, vms->oem_table_id);
> }
>
> if (ms->numa_state->num_nodes > 0) {
> diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
> index b977d65564ba..6e349264fd8b 100644
> --- a/include/hw/acpi/ghes.h
> +++ b/include/hw/acpi/ghes.h
> @@ -29,7 +29,7 @@
> extern NotifierList acpi_generic_error_notifiers;
>
> /*
> - * Values for Hardware Error Notification Type field
> + * ACPI spec values for Hardware Error Notification Type field
> */
> enum AcpiGhesNotifyType {
> /* Polled */
> @@ -60,24 +60,19 @@ enum AcpiGhesNotifyType {
> ACPI_GHES_NOTIFY_RESERVED = 12
> };
>
> -/* Those are used as table indexes when building GHES tables */
> -enum {
> - ACPI_HEST_SRC_ID_SEA = 0,
> - ACPI_HEST_SRC_ID_GED,
> - ACPI_HEST_SRC_ID_RESERVED,
> -};
> -
> typedef struct AcpiGhesState {
> + uint64_t hest_addr_le;
> uint64_t ghes_addr_le;
> bool present; /* True if GHES is present at all on this board */
> } AcpiGhesState;
>
> -void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker);
> -void acpi_build_hest(GArray *table_data, BIOSLinker *linker,
> +void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
> + BIOSLinker *linker,
> const char *oem_id, const char *oem_table_id);
> void acpi_ghes_add_fw_cfg(AcpiGhesState *vms, FWCfgState *s,
> GArray *hardware_errors);
> -int acpi_ghes_record_errors(uint8_t notify, uint64_t error_physical_addr);
> +int acpi_ghes_record_errors(enum AcpiGhesNotifyType notify,
> + uint64_t error_physical_addr);
> void ghes_record_cper_errors(const void *cper, size_t len,
> enum AcpiGhesNotifyType notify, Error **errp);
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/13] acpi/ghes: add support for generic error injection via QAPI
2024-08-16 7:37 ` [PATCH v8 06/13] acpi/ghes: add support for generic error injection via QAPI Mauro Carvalho Chehab
@ 2024-08-19 12:51 ` Igor Mammedov
2024-08-25 3:29 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 34+ messages in thread
From: Igor Mammedov @ 2024-08-19 12:51 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Jonathan Cameron, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, linux-kernel, qemu-arm, qemu-devel
On Fri, 16 Aug 2024 09:37:38 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Provide a generic interface for error injection via GHESv2.
>
> This patch is co-authored:
> - original ghes logic to inject a simple ARM record by Shiju Jose;
> - generic logic to handle block addresses by Jonathan Cameron;
> - generic GHESv2 error inject by Mauro Carvalho Chehab;
>
> Co-authored-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Co-authored-by: Shiju Jose <shiju.jose@huawei.com>
> Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---
> hw/acpi/ghes.c | 57 +++++++++++++++++++++++++++++++++++++++++++++
> hw/acpi/ghes_cper.c | 2 +-
> 2 files changed, 58 insertions(+), 1 deletion(-)
>
> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> index 7870f51e2a9e..a3ae710dcf81 100644
> --- a/hw/acpi/ghes.c
> +++ b/hw/acpi/ghes.c
> @@ -500,6 +500,63 @@ int acpi_ghes_record_errors(enum AcpiGhesNotifyType notify,
> NotifierList acpi_generic_error_notifiers =
> NOTIFIER_LIST_INITIALIZER(error_device_notifiers);
>
> +void ghes_record_cper_errors(uint8_t *cper, size_t len,
> + enum AcpiGhesNotifyType notify, Error **errp)
> +{
> + uint64_t cper_addr, read_ack_start_addr;
> + enum AcpiHestSourceId source;
> + AcpiGedState *acpi_ged_state;
> + AcpiGhesState *ags;
> + uint64_t read_ack;
> +
> + if (ghes_notify_to_source_id(notify, &source)) {
> + error_setg(errp,
> + "GHES: Invalid error block/ack address(es) for notify %d",
> + notify);
> + return;
> + }
> +
> + acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
> + NULL));
> + g_assert(acpi_ged_state);
> + ags = &acpi_ged_state->ghes_state;
> +
> + cper_addr = le64_to_cpu(ags->ghes_addr_le);
^^^ suggest to rename to error_block_address
that way reader can easily match it with spec.
> + cper_addr += ACPI_HEST_SRC_ID_COUNT * sizeof(uint64_t);
and it would be better to merge this with previous line to be more clear
+ to avoid shifting meaning of variable between lines.
> + read_ack_start_addr = cper_addr + source * sizeof(uint64_t);
> + cper_addr += ACPI_HEST_SRC_ID_COUNT * sizeof(uint64_t);
> + cper_addr += source * ACPI_GHES_MAX_RAW_DATA_LENGTH;
I'd avoid changing meaning of variable, it adds up to confusion.
Anyway, what the point of of above math?
> +
> + cpu_physical_memory_read(read_ack_start_addr,
> + &read_ack, sizeof(uint64_t));
s/sizeof(uint64_t)/sizeof(read_ack)/
ditto elsewhere
> +
> + /* zero means OSPM does not acknowledge the error */
> + if (!read_ack) {
> + error_setg(errp,
> + "Last CPER record was not acknowledged yet");
> + read_ack = 1;
> + cpu_physical_memory_write(read_ack_start_addr,
> + &read_ack, (uint64_t));
we don't do this for SEV so, why are you setting it to 1 here?
> + return;
> + }
> +
> + read_ack = cpu_to_le64(0);
> + cpu_physical_memory_write(read_ack_start_addr,
> + &read_ack, sizeof(uint64_t));
> +
> + /* Build CPER record */
> +
> + if (len > ACPI_GHES_MAX_RAW_DATA_LENGTH) {
> + error_setg(errp, "GHES CPER record is too big: %ld", len);
> + }
move check at start of function?
> +
> + /* Write the generic error data entry into guest memory */
> + cpu_physical_memory_write(cper_addr, cper, len);
> +
> + notifier_list_notify(&acpi_generic_error_notifiers, NULL);
> +}
> +
> bool acpi_ghes_present(void)
> {
> AcpiGedState *acpi_ged_state;
> diff --git a/hw/acpi/ghes_cper.c b/hw/acpi/ghes_cper.c
> index 92ca84d738de..2328dbff7012 100644
> --- a/hw/acpi/ghes_cper.c
> +++ b/hw/acpi/ghes_cper.c
> @@ -29,5 +29,5 @@ void qmp_ghes_cper(const char *qmp_cper,
> return;
> }
>
> - /* TODO: call a function at ghes */
> + ghes_record_cper_errors(cper, len, ACPI_GHES_NOTIFY_GPIO, errp);
> }
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 12/13] acpi/ghes: cleanup generic error data logic
2024-08-16 7:37 ` [PATCH v8 12/13] acpi/ghes: cleanup generic error data logic Mauro Carvalho Chehab
@ 2024-08-19 12:57 ` Igor Mammedov
0 siblings, 0 replies; 34+ messages in thread
From: Igor Mammedov @ 2024-08-19 12:57 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Jonathan Cameron, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, linux-kernel, qemu-arm, qemu-devel
On Fri, 16 Aug 2024 09:37:44 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Remove comments that are obvious.
>
> No functional changes.
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
these comments help if you don't have spec side by side with code
to compare. I'd even say such comments are preferable than no comments
when composing an ACPI table.
pls, drop the patch
> ---
> hw/acpi/ghes.c | 38 +++++++++++++++-----------------------
> 1 file changed, 15 insertions(+), 23 deletions(-)
>
> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> index 4f7b6c5ad2b6..a822a5eafaa0 100644
> --- a/hw/acpi/ghes.c
> +++ b/hw/acpi/ghes.c
> @@ -130,34 +130,28 @@ static void build_ghes_hw_error_notification(GArray *table, const uint8_t type)
> * ACPI 6.1: 18.3.2.7.1 Generic Error Data
> */
> static void acpi_ghes_generic_error_data(GArray *table,
> - const uint8_t *section_type, uint32_t error_severity,
> - uint8_t validation_bits, uint8_t flags,
> - uint32_t error_data_length, QemuUUID fru_id,
> - uint64_t time_stamp)
> + const uint8_t *section_type,
> + uint32_t error_severity,
> + uint8_t validation_bits,
> + uint8_t flags,
> + uint32_t error_data_length,
> + QemuUUID fru_id,
> + uint64_t time_stamp)
> {
> const uint8_t fru_text[20] = {0};
>
> - /* Section Type */
> g_array_append_vals(table, section_type, 16);
> -
> - /* Error Severity */
> build_append_int_noprefix(table, error_severity, 4);
> +
> /* Revision */
> build_append_int_noprefix(table, 0x300, 2);
> - /* Validation Bits */
> +
> build_append_int_noprefix(table, validation_bits, 1);
> - /* Flags */
> build_append_int_noprefix(table, flags, 1);
> - /* Error Data Length */
> build_append_int_noprefix(table, error_data_length, 4);
>
> - /* FRU Id */
> g_array_append_vals(table, fru_id.data, ARRAY_SIZE(fru_id.data));
> -
> - /* FRU Text */
> g_array_append_vals(table, fru_text, sizeof(fru_text));
> -
> - /* Timestamp */
> build_append_int_noprefix(table, time_stamp, 8);
> }
>
> @@ -165,19 +159,17 @@ static void acpi_ghes_generic_error_data(GArray *table,
> * Generic Error Status Block
> * ACPI 6.1: 18.3.2.7.1 Generic Error Data
> */
> -static void acpi_ghes_generic_error_status(GArray *table, uint32_t block_status,
> - uint32_t raw_data_offset, uint32_t raw_data_length,
> - uint32_t data_length, uint32_t error_severity)
> +static void acpi_ghes_generic_error_status(GArray *table,
> + uint32_t block_status,
> + uint32_t raw_data_offset,
> + uint32_t raw_data_length,
> + uint32_t data_length,
> + uint32_t error_severity)
> {
> - /* Block Status */
> build_append_int_noprefix(table, block_status, 4);
> - /* Raw Data Offset */
> build_append_int_noprefix(table, raw_data_offset, 4);
> - /* Raw Data Length */
> build_append_int_noprefix(table, raw_data_length, 4);
> - /* Data Length */
> build_append_int_noprefix(table, data_length, 4);
> - /* Error Severity */
> build_append_int_noprefix(table, error_severity, 4);
> }
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 13/13] acpi/ghes: check if the BIOS pointers for HEST are correct
2024-08-16 7:37 ` [PATCH v8 13/13] acpi/ghes: check if the BIOS pointers for HEST are correct Mauro Carvalho Chehab
@ 2024-08-19 14:07 ` Igor Mammedov
2024-08-24 0:15 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 34+ messages in thread
From: Igor Mammedov @ 2024-08-19 14:07 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Jonathan Cameron, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, linux-kernel, qemu-arm, qemu-devel
On Fri, 16 Aug 2024 09:37:45 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> The OS kernels navigate between HEST, error source struct
> and CPER by the usage of some pointers. Double-check if such
> pointers were properly initializing, ensuring that they match
> the right address for CPER.
as QEMU, we don't care about what guest wrote into those addresses
(aka it's not hw businesses), even if later qemu will trample
on wrong guest memory (it's guest responsibility to do init right).
However this patch introduces usage for hest_addr_le, that I was looking for.
See notes below.
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---
> hw/acpi/ghes.c | 30 +++++++++++++++++++++++++++++-
> 1 file changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> index a822a5eafaa0..51e2e40e5a9c 100644
> --- a/hw/acpi/ghes.c
> +++ b/hw/acpi/ghes.c
> @@ -85,6 +85,9 @@ enum AcpiHestSourceId {
> #define HEST_GHES_V2_TABLE_SIZE 92
> #define GHES_ACK_OFFSET (64 + GAS_ADDR_OFFSET + ACPI_HEST_HEADER_SIZE)
>
> +/* ACPI 6.2: 18.3.2.7: Generic Hardware Error Source */
> +#define GHES_ERR_ST_ADDR_OFFSET (20 + GAS_ADDR_OFFSET + ACPI_HEST_HEADER_SIZE)
> +
> /*
> * Values for error_severity field
> */
> @@ -425,7 +428,10 @@ NotifierList acpi_generic_error_notifiers =
> void ghes_record_cper_errors(const void *cper, size_t len,
> enum AcpiGhesNotifyType notify, Error **errp)
> {
> - uint64_t cper_addr, read_ack_start_addr;
> + uint64_t hest_read_ack_start_addr, read_ack_start_addr;
> + uint64_t read_ack_start_addr_2, err_source_struct;
> + uint64_t hest_err_block_addr, error_block_addr;
> + uint64_t cper_addr, cper_addr_2;
> enum AcpiHestSourceId source;
> AcpiGedState *acpi_ged_state;
> AcpiGhesState *ags;
> @@ -450,6 +456,28 @@ void ghes_record_cper_errors(const void *cper, size_t len,
> cper_addr += ACPI_HEST_SRC_ID_COUNT * sizeof(uint64_t);
> cper_addr += source * ACPI_GHES_MAX_RAW_DATA_LENGTH;
>
> + err_source_struct = le64_to_cpu(ags->hest_addr_le) +
> + source * HEST_GHES_V2_TABLE_SIZE;
there is no guaranties that HEST table will contain only GHESv2 sources,
and once such is added this place becomes broken.
we need to iterate over HEST taking that into account
and find only ghesv2 structure with source id of interest.
This function (and acpi_ghes_record_errors() as well) taking source_id
as input should be able to lookup pointers from HEST in guest RAM,
very crude idea could look something like this:
typedef struct hest_source_type2len{
uint16_t type
int len
} hest_structure_type2len
hest_structure_type2len supported_hest_sources[] = {
/* Table 18-344 Generic Hardware Error Source version 2 (GHESv2) Structure */
{.type = 10, .len = 92},
}
uint64_t find_error_source(src_id) {
uint32_t struct_offset = hest_header_size;
uint16_t type, id
do {
addr = ags->hest_addr_le + struct_offset
cpu_physical_memory_read(addr, &id)
if (src_id == id)
return addr
cpu_physical_memory_read(addr, &type)
struct_offset ++= get_len_from_supported_hest_sources(type)
while(struct_offset < hest_len)
assert if not found
}
unit64_t get_error_status_block_addr(src_id) {
struct_addr = find_error_source(src_id)
hest_err_block_addr = struct_addr + GHES_ERR_ST_ADDR_OFFSET
// read intermediate pointer to status block addr pointer in hw table
cpu_physical_memory_read(hest_err_block_addr, &error_block_addr)
// read actual pointer to status block
cpu_physical_memory_read(error_block_addr, &error_status_block_addr)
return error_status_block_addr
}
ditto for read_ack modulo indirection that we have for error_status_block_addr
This way we can easily map source id to error status block
and find needed addresses using pointer info from guest RAM
without fragile pointer math and assumptions which might go wrong
when new error sources are added and regardless of the order they
are being added.
> + /* Check if BIOS addr pointers were properly generated */
> +
> + hest_err_block_addr = err_source_struct + GHES_ERR_ST_ADDR_OFFSET;
> + hest_read_ack_start_addr = err_source_struct + GHES_ACK_OFFSET;
> +
> + cpu_physical_memory_read(hest_err_block_addr, &error_block_addr,
> + sizeof(error_block_addr));
> +
> + cpu_physical_memory_read(error_block_addr, &cper_addr_2,
> + sizeof(error_block_addr));
> +
> + cpu_physical_memory_read(hest_read_ack_start_addr, &read_ack_start_addr_2,
> + sizeof(read_ack_start_addr_2));
> +
> + assert(cper_addr == cper_addr_2);
> + assert(read_ack_start_addr == read_ack_start_addr_2);
> +
> + /* Update ACK offset to notify about a new error */
> +
> cpu_physical_memory_read(read_ack_start_addr,
> &read_ack, sizeof(uint64_t));
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation
2024-08-16 7:37 [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Mauro Carvalho Chehab
` (12 preceding siblings ...)
2024-08-16 7:37 ` [PATCH v8 13/13] acpi/ghes: check if the BIOS pointers for HEST are correct Mauro Carvalho Chehab
@ 2024-08-19 14:21 ` Igor Mammedov
13 siblings, 0 replies; 34+ messages in thread
From: Igor Mammedov @ 2024-08-19 14:21 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Jonathan Cameron, Shiju Jose, Ani Sinha, Dongjiu Geng,
Paolo Bonzini, Peter Maydell, Shannon Zhao, kvm, qemu-arm,
qemu-devel
On Fri, 16 Aug 2024 09:37:32 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> This series add support for injecting generic CPER records. Such records
> are generated outside QEMU via a provided script.
>
> On this version, I added two optional patches at the end:
> - acpi/ghes: cleanup generic error data logic
>
> It drops some obvious comments from some already-existing code.
> As we're already doing lots of changes at the code, it sounded
> reasonable to me to have such cleanup here;
>
> - acpi/ghes: check if the BIOS pointers for HEST are correct
>
> QEMU has two ways to navigate to a CPER start data: via its
> memory address or indirectly following 2 BIOS pointers.
> OS only have the latter one. This patch validates if the BIOS
> links used by the OS were properly produced, comparing to the
> actual location of the CPER record.
I went over the series,
once suggestion in 13/13 implemented
we can get rid of pointer math that is reshuffled several times
in patches here.
I'd suggest to structure series as following:
1: patch that adds hest_addr_le
2: refactoring current code to use address lookup vs pointer math
3. renaming patches
4. patch adding new error source
5. QAPI patch
6. python script for error injection
with that in place we probably would need to
* iron out minor migration compat issues
(I didn't look for them during this review round as much
would change yet)
* make sure that bios tables test is updated
>
> ---
>
> v8:
> - Fix one of the BIOS links that were incorrect;
> - Changed mem error internal injection to use a common code;
> - No more hardcoded values for CPER: instead of using just the
> payload at the QAPI, it now has the full raw CPER there;
> - Error injection script now supports changing fields at the
> Generic Error Data section of the CPER;
> - Several minor cleanups.
>
> v7:
> - Change the way offsets are calculated and used on HEST table.
> Now, it is compatible with migrations as all offsets are relative
> to the HEST table;
> - GHES interface is now more generic: the entire CPER is sent via
> QMP, instead of just the payload;
> - Some code cleanups to make the code more robust;
> - The python script now uses QEMUMonitorProtocol class.
>
> v6:
> - PNP0C33 device creation moved to aml-build.c;
> - acpi_ghes record functions now use ACPI notify parameter,
> instead of source ID;
> - the number of source IDs is now automatically calculated;
> - some code cleanups and function/var renames;
> - some fixes and cleanups at the error injection script;
> - ghes cper stub now produces an error if cper JSON is not compiled;
> - Offset calculation logic for GHES was refactored;
> - Updated documentation to reflect the GHES allocated size;
> - Added a x-mpidr object for QOM usage;
> - Added a patch making usage of x-mpidr field at ARM injection
> script;
>
> v5:
> - CPER guid is now passing as string;
> - raw-data is now passed with base64 encode;
> - Removed several GPIO left-overs from arm/virt.c changes;
> - Lots of cleanups and improvements at the error injection script.
> It now better handles QMP dialog and doesn't print debug messages.
> Also, code was split on two modules, to make easier to add more
> error injection commands.
>
> v4:
> - CPER generation moved to happen outside QEMU;
> - One patch adding support for mpidr query was removed.
>
> v3:
> - patch 1 cleanups with some comment changes and adding another place where
> the poweroff GPIO define should be used. No changes on other patches (except
> due to conflict resolution).
>
> v2:
> - added a new patch using a define for GPIO power pin;
> - patch 2 changed to also use a define for generic error GPIO pin;
> - a couple cleanups at patch 2 removing uneeded else clauses.
>
> Example of generating a CPER record:
>
> $ scripts/ghes_inject.py -d arm -p 0xdeadbeef
> GUID: e19e3d16-bc11-11e4-9caa-c2051d5d46b0
> Generic Error Status Block (20 bytes):
> 00000000 01 00 00 00 00 00 00 00 00 00 00 00 90 00 00 00 ................
> 00000010 00 00 00 00 ....
>
> Generic Error Data Entry (72 bytes):
> 00000000 16 3d 9e e1 11 bc e4 11 9c aa c2 05 1d 5d 46 b0 .=...........]F.
> 00000010 00 00 00 00 00 03 00 00 48 00 00 00 00 00 00 00 ........H.......
> 00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000040 00 00 00 00 00 00 00 00 ........
>
> Payload (72 bytes):
> 00000000 05 00 00 00 01 00 00 00 48 00 00 00 00 00 00 00 ........H.......
> 00000010 00 00 00 80 00 00 00 00 10 05 0f 00 00 00 00 00 ................
> 00000020 00 00 00 00 00 00 00 00 00 20 14 00 02 01 00 03 ......... ......
> 00000030 0f 00 91 00 00 00 00 00 ef be ad de 00 00 00 00 ................
> 00000040 ef be ad de 00 00 00 00 ........
>
> Error injected.
>
> [ 9.358364] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> [ 9.359027] {1}[Hardware Error]: event severity: recoverable
> [ 9.359586] {1}[Hardware Error]: Error 0, type: recoverable
> [ 9.360124] {1}[Hardware Error]: section_type: ARM processor error
> [ 9.360561] {1}[Hardware Error]: MIDR: 0x00000000000f0510
> [ 9.361160] {1}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000080000000
> [ 9.361643] {1}[Hardware Error]: running state: 0x0
> [ 9.362142] {1}[Hardware Error]: Power State Coordination Interface state: 0
> [ 9.362682] {1}[Hardware Error]: Error info structure 0:
> [ 9.363030] {1}[Hardware Error]: num errors: 2
> [ 9.363656] {1}[Hardware Error]: error_type: 0x02: cache error
> [ 9.364163] {1}[Hardware Error]: error_info: 0x000000000091000f
> [ 9.364834] {1}[Hardware Error]: transaction type: Data Access
> [ 9.365599] {1}[Hardware Error]: cache error, operation type: Data write
> [ 9.366441] {1}[Hardware Error]: cache level: 2
> [ 9.367005] {1}[Hardware Error]: processor context not corrupted
> [ 9.367753] {1}[Hardware Error]: physical fault address: 0x00000000deadbeef
> [ 9.374267] Memory failure: 0xdeadb: recovery action for free buddy page: Recovered
>
> Such script currently supports arm processor error CPER, but can easily be
> extended to other GHES notification types.
>
>
>
> Jonathan Cameron (1):
> acpi/ghes: Add support for GED error device
>
> Mauro Carvalho Chehab (12):
> acpi/generic_event_device: add an APEI error device
> arm/virt: Wire up a GED error device for ACPI / GHES
> qapi/acpi-hest: add an interface to do generic CPER error injection
> acpi/ghes: rework the logic to handle HEST source ID
> acpi/ghes: add support for generic error injection via QAPI
> acpi/ghes: cleanup the memory error code logic
> docs: acpi_hest_ghes: fix documentation for CPER size
> scripts/ghes_inject: add a script to generate GHES error inject
> target/arm: add an experimental mpidr arm cpu property object
> scripts/arm_processor_error.py: retrieve mpidr if not filled
> acpi/ghes: cleanup generic error data logic
> acpi/ghes: check if the BIOS pointers for HEST are correct
>
> MAINTAINERS | 10 +
> docs/specs/acpi_hest_ghes.rst | 6 +-
> hw/acpi/Kconfig | 5 +
> hw/acpi/aml-build.c | 10 +
> hw/acpi/generic_event_device.c | 8 +
> hw/acpi/ghes-stub.c | 3 +-
> hw/acpi/ghes.c | 362 ++++++++-----
> hw/acpi/ghes_cper.c | 33 ++
> hw/acpi/ghes_cper_stub.c | 19 +
> hw/acpi/meson.build | 2 +
> hw/arm/Kconfig | 5 +
> hw/arm/virt-acpi-build.c | 6 +-
> hw/arm/virt.c | 12 +-
> include/hw/acpi/acpi_dev_interface.h | 1 +
> include/hw/acpi/aml-build.h | 2 +
> include/hw/acpi/generic_event_device.h | 1 +
> include/hw/acpi/ghes.h | 24 +-
> include/hw/arm/virt.h | 1 +
> qapi/acpi-hest.json | 36 ++
> qapi/meson.build | 1 +
> qapi/qapi-schema.json | 1 +
> scripts/arm_processor_error.py | 388 ++++++++++++++
> scripts/ghes_inject.py | 51 ++
> scripts/qmp_helper.py | 702 +++++++++++++++++++++++++
> target/arm/cpu.c | 1 +
> target/arm/cpu.h | 1 +
> target/arm/helper.c | 10 +-
> target/arm/kvm.c | 2 +-
> 28 files changed, 1551 insertions(+), 152 deletions(-)
> create mode 100644 hw/acpi/ghes_cper.c
> create mode 100644 hw/acpi/ghes_cper_stub.c
> create mode 100644 qapi/acpi-hest.json
> create mode 100644 scripts/arm_processor_error.py
> create mode 100755 scripts/ghes_inject.py
> create mode 100644 scripts/qmp_helper.py
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 03/13] acpi/ghes: Add support for GED error device
2024-08-19 11:43 ` Igor Mammedov
@ 2024-08-23 23:28 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-23 23:28 UTC (permalink / raw)
To: Igor Mammedov
Cc: Jonathan Cameron, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, linux-kernel, qemu-arm, qemu-devel
Em Mon, 19 Aug 2024 13:43:04 +0200
Igor Mammedov <imammedo@redhat.com> escreveu:
> On Fri, 16 Aug 2024 09:37:35 +0200
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
>
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >
> > As a GED error device is now defined, add another type
> > of notification.
> >
> > Add error notification to GHES v2 using
> >a GED error device GED triggered via interrupt.
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> This is hard to parse, perhaps update so it would be
> more clear what does what
>
> >
> > [mchehab: do some cleanups at ACPI_HEST_SRC_ID_* checks and
> > rename HEST event to better identify GED interrupt OSPM]
> >
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > Reviewed-by: Igor Mammedov <imammedo@redhat.com>
> > ---
>
> in addition to change log in cover letter,
> I'd suggest to keep per patch change log as well (after ---),
> it helps reviewer to notice intended changes.
>
>
> [...]
> > + case ACPI_HEST_SRC_ID_GED:
> > + build_ghes_hw_error_notification(table_data, ACPI_GHES_NOTIFY_GPIO);
> While GPIO works for arm, it's not the case for other machines.
> I recall a suggestion to use ACPI_GHES_NOTIFY_EXTERNAL instead of GPIO one,
> but that got lost somewhere...
True, but the same also applies to SEA, which is ARMv8+. After having
everything in place, I confined the source ID into this code inside
ghes.c:
enum AcpiHestSourceId {
ACPI_HEST_SRC_ID_SEA,
ACPI_HEST_SRC_ID_GED,
/* Shall be the last one */
ACPI_HEST_SRC_ID_COUNT
} AcpiHestSourceId;
static bool ghes_notify_to_source_id(enum AcpiGhesNotifyType notify,
enum AcpiHestSourceId *source_id)
{
switch (notify) {
case ACPI_GHES_NOTIFY_SEA: /* ARMv8 */
*source_id = ACPI_HEST_SRC_ID_SEA;
return false;
case ACPI_GHES_NOTIFY_GPIO:
*source_id = ACPI_HEST_SRC_ID_GED;
return false;
default:
/* Unsupported notification types */
return true;
}
}
The only place where the source ID number is used is at
ghes_notify_to_source_id() - still we use ACPI_HEST_SRC_ID_COUNT on other
places to initialize and fill in the HEST table and its error source
structures.
On other words, the source ID field is filled from the notification types as
defined at include/hw/acpi/ghes.h:
ACPI_GHES_NOTIFY_POLLED = 0,
ACPI_GHES_NOTIFY_EXTERNAL = 1,
ACPI_GHES_NOTIFY_LOCAL = 2,
ACPI_GHES_NOTIFY_SCI = 3,
ACPI_GHES_NOTIFY_NMI = 4,
ACPI_GHES_NOTIFY_CMCI = 5,
ACPI_GHES_NOTIFY_MCE = 6,
ACPI_GHES_NOTIFY_GPIO = 7,
ACPI_GHES_NOTIFY_SEA = 8,
ACPI_GHES_NOTIFY_SEI = 9,
ACPI_GHES_NOTIFY_GSIV = 10,
ACPI_GHES_NOTIFY_SDEI = 11,
(please notice that ACPI already defines "EXTERNAL" as being something
else)
Now, if we want to add support for x86, we could either add some ifdefs
inside ghes.c, e. g. something like:
enum AcpiHestSourceId {
#ifdef TARGET_ARM
ACPI_HEST_SRC_ID_SEA,
ACPI_HEST_SRC_ID_GED,
#endif
#ifdef TARGET_I386
ACPI_HEST_SRC_ID_MCE,
#endif
/* Shall be the last one */
ACPI_HEST_SRC_ID_COUNT
} AcpiHestSourceId;
and something similar at ghes_notify_to_source_id():
static bool ghes_notify_to_source_id(enum AcpiGhesNotifyType notify,
enum AcpiHestSourceId *source_id)
{
switch (notify) {
#ifdef TARGET_ARM
case ACPI_GHES_NOTIFY_SEA: /* ARMv8 */
*source_id = ACPI_HEST_SRC_ID_SEA;
return false;
case ACPI_GHES_NOTIFY_GPIO:
*source_id = ACPI_HEST_SRC_ID_GED;
return false;
#endif
#ifdef TARGET_I386
case ACPI_GHES_NOTIFY_MCE:
*source_id = ACPI_HEST_SRC_ID_MCE;
return false;
#endif
default:
/* Unsupported notification types */
return true;
}
}
An alternative would be to move source id/notification code out, placing
them at hw/arm, hw/i386, but a more complex binding logic will be needed.
If we're willing to do something like that, I would prefer to not do such
redesign now. Better to do such change when we'll be ready to add some
notification support that works on x86 (MCE? SCI? NMI?).
Regards,
Mauro
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 13/13] acpi/ghes: check if the BIOS pointers for HEST are correct
2024-08-19 14:07 ` Igor Mammedov
@ 2024-08-24 0:15 ` Mauro Carvalho Chehab
2024-08-25 3:48 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-24 0:15 UTC (permalink / raw)
To: Igor Mammedov
Cc: Jonathan Cameron, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, linux-kernel, qemu-arm, qemu-devel
Em Mon, 19 Aug 2024 16:07:33 +0200
Igor Mammedov <imammedo@redhat.com> escreveu:
> > + err_source_struct = le64_to_cpu(ags->hest_addr_le) +
> > + source * HEST_GHES_V2_TABLE_SIZE;
>
> there is no guaranties that HEST table will contain only GHESv2 sources,
> and once such is added this place becomes broken.
>
> we need to iterate over HEST taking that into account
> and find only ghesv2 structure with source id of interest.
>
> This function (and acpi_ghes_record_errors() as well) taking source_id
> as input should be able to lookup pointers from HEST in guest RAM,
> very crude idea could look something like this:
>
> typedef struct hest_source_type2len{
> uint16_t type
> int len
> } hest_structure_type2len
>
> hest_structure_type2len supported_hest_sources[] = {
> /* Table 18-344 Generic Hardware Error Source version 2 (GHESv2) Structure */
> {.type = 10, .len = 92},
> }
Sounds interesting, but IMO it should be done only when other types besides
ghes would be added, as:
1. Right now, the file is acpi/ghes.c. Adding non-type 10 HEST structures
there would be a little weird. It should likely be renamed to acpi/hest.c
when such time comes.
2. ACPI 6.5 has made clear that the above will only work up to type 11,
as, from type 12 and above, the length will be added to the error
struct, according with:
https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#error-source-structure-header-type-12-onward
3. some types have variable size. Starting from the beginning, type 0, as
defined at:
https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#hardware-errors-and-error-sources
has:
size = 40 + 24 * Number of Hardware banks
So, a simple table like the above with fixed sizes won't work.
The code would need instead a switch if types are <= 11.
Adding proper support for all already defined 12 types sounds lots of
work, as the code would need to calculate the size depending on the
size, and we don't really initialize the HEST table with other types
but GHES.
Ok, we could still do something like this pseudo-code to get the
error source offset:
#define ACPI_HEST_TYPE_GHESV2 11
err_struct_offset = 0;
for (i = 0; i < source_id_count; i++) {
/* NOTE: Other types may have different sizes */
assert(ghes[i].type == ACPI_HEST_TYPE_GHESV2);
if (ghes[i].source_id == source_id)
break;
err_struct_offset += HEST_GHES_V2_TABLE_SIZE;
}
assert (i < source_id_count);
---
That's said, maybe this will just add unwanted complexity, as QEMU
is already setting those offsets via bios_linker_loader_add_pointer().
So, an alternative for that is to merge the code on patch 13 with the one
on patch 5, dropping the math calcus there and relying that QEMU will
always handle properly bios links.
See, the logic which constructs GHESv2 source IDs do this to create
the links between HEST ACPI table and etc/hardware_errors:
with:
Per-source ID logic at build_ghes_v2():
address_offset = table_data->len;
/* Error Status Address */
build_append_gas(table_data, AML_AS_SYSTEM_MEMORY, 0x40, 0,
4 /* QWord access */, 0);
bios_linker_loader_add_pointer(linker, ACPI_BUILD_TABLE_FILE,
address_offset + GAS_ADDR_OFFSET,
sizeof(uint64_t),
ACPI_HW_ERROR_FW_CFG_FILE,
source_id * sizeof(uint64_t));
...
/*
* Read Ack Register
* ACPI 6.1: 18.3.2.8 Generic Hardware Error Source
* version 2 (GHESv2 - Type 10)
*/
address_offset = table_data->len;
build_append_gas(table_data, AML_AS_SYSTEM_MEMORY, 0x40, 0,
4 /* QWord access */, 0);
bios_linker_loader_add_pointer(linker, ACPI_BUILD_TABLE_FILE,
address_offset + GAS_ADDR_OFFSET,
sizeof(uint64_t),
ACPI_HW_ERROR_FW_CFG_FILE,
(ACPI_HEST_SRC_ID_COUNT + source_id) *
sizeof(uint64_t));
HEST table creation logic inside build_ghes_error_table():
for (i = 0; i < ACPI_HEST_SRC_ID_COUNT; i++) {
/*
* Tell firmware to patch error_block_address entries to point to
* corresponding "Generic Error Status Block"
*/
bios_linker_loader_add_pointer(linker,
ACPI_HW_ERROR_FW_CFG_FILE, sizeof(uint64_t) * i,
sizeof(uint64_t), ACPI_HW_ERROR_FW_CFG_FILE,
error_status_block_offset + i * ACPI_GHES_MAX_RAW_DATA_LENGTH);
}
Using those, the location of the CPER and ack addresses is easy and won't
require any math:
/* GHESv2 CPER offset */
cpu_physical_memory_read(hest_err_block_addr, &error_block_addr,
sizeof(error_block_addr));
cpu_physical_memory_read(error_block_addr, &cper_addr,
sizeof(error_block_addr));
/* GHESv2 ack offset */
cpu_physical_memory_read(hest_read_ack_start_addr, &read_ack_start_addr,
sizeof(read_ack_start_addr));
Regards,
Mauro
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 05/13] acpi/ghes: rework the logic to handle HEST source ID
2024-08-19 12:10 ` Igor Mammedov
@ 2024-08-25 2:02 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-25 2:02 UTC (permalink / raw)
To: Igor Mammedov
Cc: Jonathan Cameron, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, Peter Maydell, Shannon Zhao, linux-kernel, qemu-arm,
qemu-devel
Em Mon, 19 Aug 2024 14:10:37 +0200
Igor Mammedov <imammedo@redhat.com> escreveu:
> On Fri, 16 Aug 2024 09:37:37 +0200
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
>
> > The current logic is based on a lot of duct tape, with
> > offsets calculated based on one define with the number of
> > source IDs and an enum.
> >
> > Rewrite the logic in a way that it would be more resilient
> > of code changes, by moving the source ID count to an enum
> > and make the offset calculus more explicit.
> >
> > Such change was inspired on a patch from Jonathan Cameron
> > splitting the logic to get the CPER address on a separate
> > function, as this will be needed to support generic error
> > injection.
>
> patch does too many things, that it's hard to review.
> Please split it up on smaller distinct parts, with more specific
> commit messages. (see some comments below)
True, but there's not much that can be done when doing it and still
keeping the code working. I'll split the renames.
>
>
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > ---
> > hw/acpi/ghes-stub.c | 3 +-
> > hw/acpi/ghes.c | 210 ++++++++++++++++++++++++---------------
> > hw/arm/virt-acpi-build.c | 5 +-
> > include/hw/acpi/ghes.h | 17 ++--
> > 4 files changed, 138 insertions(+), 97 deletions(-)
> >
> > diff --git a/hw/acpi/ghes-stub.c b/hw/acpi/ghes-stub.c
> > index c315de1802d6..8762449870b5 100644
> > --- a/hw/acpi/ghes-stub.c
> > +++ b/hw/acpi/ghes-stub.c
> > @@ -11,7 +11,8 @@
> > #include "qemu/osdep.h"
> > #include "hw/acpi/ghes.h"
> >
> > -int acpi_ghes_record_errors(uint8_t source_id, uint64_t physical_address)
> > +int acpi_ghes_record_errors(enum AcpiGhesNotifyType notify,
> > + uint64_t physical_address)
> > {
> > return -1;
> > }
> > diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> > index df59fd35568c..7870f51e2a9e 100644
> > --- a/hw/acpi/ghes.c
> > +++ b/hw/acpi/ghes.c
> > @@ -28,14 +28,23 @@
> > #include "hw/nvram/fw_cfg.h"
> > #include "qemu/uuid.h"
> >
> > -#define ACPI_GHES_ERRORS_FW_CFG_FILE "etc/hardware_errors"
> > -#define ACPI_GHES_DATA_ADDR_FW_CFG_FILE "etc/hardware_errors_addr"
> > +#define ACPI_HW_ERROR_FW_CFG_FILE "etc/hardware_errors"
> > +#define ACPI_HW_ERROR_ADDR_FW_CFG_FILE "etc/hardware_errors_addr"
> split out renaming part into a presiding separate patch,
> so it won't mask a new code
>
> > +#define ACPI_HEST_ADDR_FW_CFG_FILE "etc/acpi_table_hest_addr"
> >
> > /* The max size in bytes for one error block */
> > #define ACPI_GHES_MAX_RAW_DATA_LENGTH (1 * KiB)
> >
>
>
> > -/* Support ARMv8 SEA notification type error source and GPIO interrupt. */
> > -#define ACPI_GHES_ERROR_SOURCE_COUNT 2
> > +/*
> > + * ID numbers used to fill HEST source ID field
> > + */
> > +enum AcpiHestSourceId {
> > + ACPI_HEST_SRC_ID_SEA,
> > + ACPI_HEST_SRC_ID_GED,
> > +
> > + /* Shall be the last one */
> > + ACPI_HEST_SRC_ID_COUNT
> > +} AcpiHestSourceId;
> >
> this rename also should go into its own separate patch.
I opted to remove this completely and move it to arm/virt, as this specific
set of sources is for ARM.
On such split, I ended placing the QMP error injection as the first one,
as this is probably the first one that we'll be mapping on x86 and
other architectures.
This way, the code at ghes.c won't rely on any hardcoded values. They'll
be passed at target ACPI table preparation using this function:
void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
BIOSLinker *linker,
const uint16_t * const notify,
int num_sources,
const char *oem_id, const char *oem_table_id)
On arm (at the rework patch, before adding GPIO method), the call
to the HEST build table (and etc/hardware_errors init) is done via:
static const uint16_t hest_ghes_notify[] = {
[ARM_ACPI_HEST_SRC_ID_SEA] = ACPI_GHES_NOTIFY_SEA,
};
void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
{
...
if (vms->ras) {
acpi_add_table(table_offsets, tables_blob);
acpi_build_hest(tables_blob, tables->hardware_errors, tables->linker,
hest_ghes_notify, sizeof(hest_ghes_notify),
vms->oem_id, vms->oem_table_id);
}
...
This way, adding support for a new notification type at the arch-specific code
is as easy as adding a new entry to hest_ghes_notify[].
>
> > /* Generic Hardware Error Source version 2 */
> > #define ACPI_GHES_SOURCE_GENERIC_ERROR_V2 10
> > @@ -63,6 +72,19 @@
> > */
> > #define ACPI_GHES_GESB_SIZE 20
> >
> > +/*
> > + * Offsets with regards to the start of the HEST table stored at
> > + * ags->hest_addr_le, according with the memory layout map at
> > + * docs/specs/acpi_hest_ghes.rst.
> > + */
> > +
> > +/* ACPI 4.0: 17.3.2 ACPI Error Source */
> > +#define ACPI_HEST_HEADER_SIZE 40
> > +
> > +/* ACPI 6.2: 18.3.2.8 Generic Hardware Error Source version 2 */
> > +#define HEST_GHES_V2_TABLE_SIZE 92
> > +#define GHES_ACK_OFFSET (64 + GAS_ADDR_OFFSET + ACPI_HEST_HEADER_SIZE)
> > +
> > /*
> > * Values for error_severity field
> > */
> > @@ -236,17 +258,17 @@ static int acpi_ghes_record_mem_error(uint64_t error_block_address,
> > * Initialize "etc/hardware_errors" and "etc/hardware_errors_addr" fw_cfg blobs.
> > * See docs/specs/acpi_hest_ghes.rst for blobs format.
> > */
> > -void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker)
> > +static void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker)
> > {
> > int i, error_status_block_offset;
> >
> > /* Build error_block_address */
> > - for (i = 0; i < ACPI_GHES_ERROR_SOURCE_COUNT; i++) {
> > + for (i = 0; i < ACPI_HEST_SRC_ID_COUNT; i++) {
> > build_append_int_noprefix(hardware_errors, 0, sizeof(uint64_t));
> > }
> >
> > /* Build read_ack_register */
> > - for (i = 0; i < ACPI_GHES_ERROR_SOURCE_COUNT; i++) {
> > + for (i = 0; i < ACPI_HEST_SRC_ID_COUNT; i++) {
> > /*
> > * Initialize the value of read_ack_register to 1, so GHES can be
> > * writable after (re)boot.
> > @@ -261,20 +283,20 @@ void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker)
> >
> > /* Reserve space for Error Status Data Block */
> > acpi_data_push(hardware_errors,
> > - ACPI_GHES_MAX_RAW_DATA_LENGTH * ACPI_GHES_ERROR_SOURCE_COUNT);
> > + ACPI_GHES_MAX_RAW_DATA_LENGTH * ACPI_HEST_SRC_ID_COUNT);
> >
> > /* Tell guest firmware to place hardware_errors blob into RAM */
> > - bios_linker_loader_alloc(linker, ACPI_GHES_ERRORS_FW_CFG_FILE,
> > + bios_linker_loader_alloc(linker, ACPI_HW_ERROR_FW_CFG_FILE,
> > hardware_errors, sizeof(uint64_t), false);
> >
> > - for (i = 0; i < ACPI_GHES_ERROR_SOURCE_COUNT; i++) {
> > + for (i = 0; i < ACPI_HEST_SRC_ID_COUNT; i++) {
> > /*
> > * Tell firmware to patch error_block_address entries to point to
> > * corresponding "Generic Error Status Block"
> > */
> > bios_linker_loader_add_pointer(linker,
> > - ACPI_GHES_ERRORS_FW_CFG_FILE, sizeof(uint64_t) * i,
> > - sizeof(uint64_t), ACPI_GHES_ERRORS_FW_CFG_FILE,
> > + ACPI_HW_ERROR_FW_CFG_FILE, sizeof(uint64_t) * i,
> > + sizeof(uint64_t), ACPI_HW_ERROR_FW_CFG_FILE,
> > error_status_block_offset + i * ACPI_GHES_MAX_RAW_DATA_LENGTH);
> > }
> >
> > @@ -282,16 +304,39 @@ void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker)
> > * tell firmware to write hardware_errors GPA into
> > * hardware_errors_addr fw_cfg, once the former has been initialized.
> > */
> > - bios_linker_loader_write_pointer(linker, ACPI_GHES_DATA_ADDR_FW_CFG_FILE,
> > - 0, sizeof(uint64_t), ACPI_GHES_ERRORS_FW_CFG_FILE, 0);
> > + bios_linker_loader_write_pointer(linker, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, 0,
> > + sizeof(uint64_t),
> > + ACPI_HW_ERROR_FW_CFG_FILE, 0);
> > +}
> > +
> > +static bool ghes_notify_to_source_id(enum AcpiGhesNotifyType notify,
> > + enum AcpiHestSourceId *source_id)
> > +{
> > + switch (notify) {
> > + case ACPI_GHES_NOTIFY_SEA: /* ARMv8 */
> > + *source_id = ACPI_HEST_SRC_ID_SEA;
> > + return false;
> > + case ACPI_GHES_NOTIFY_GPIO:
> > + *source_id = ACPI_HEST_SRC_ID_GED;
> > + return false;
> > + default:
> > + /* Unsupported notification types */
> > + return true;
> > + }
> > }
> >
> > /* Build Generic Hardware Error Source version 2 (GHESv2) */
> > -static void build_ghes_v2(GArray *table_data, int source_id, BIOSLinker *linker)
> > +static void build_ghes_v2(GArray *table_data,
> > + enum AcpiGhesNotifyType notify,
> > + BIOSLinker *linker)
>
> I'd suggest to drop this change as AcpiGhesNotifyType is not unique ID,
> in fact one can easily have several ID with the same type.
IMO, the best is to have different IDs for each type of notification on
a given guest type. The way I see, each supported notification type will
require a separate source ID, as different notification mechanisms will
use different ways to synchronize between QEMU, BIOS and OS.
So, with the new version I'll be sending, this will also use whatever table
with a notification/source_ID association is passed by target acpi init
code, so the way I'm designing it is that different archs may reuse the same
id numbers like:
/* ARM acpi build logic - currently the only one implemented */
static const uint16_t hest_ghes_notify[] = {
[0] = ACPI_GHES_NOTIFY_GPIO,
[1] = ACPI_GHES_NOTIFY_SEA,
};
/* x86 acpi build logic */
static const uint16_t hest_ghes_notify[] = {
[0] = ACPI_GHES_NOTIFY_SCI,
[1] = ACPI_GHES_NOTIFY_MCE,
[2] = ACPI_GHES_NOTIFY_NMI,
};
/* PPC acpi build logic */
static const uint16_t hest_ghes_notify[] = {
[0] = ACPI_GHES_NOTIFY_GPIO,
}
...
Btw, I'm reserving 0 for QMP based event injection, as we don't know
in advance how much events each architecture will implement. So, basically
this will be added to ghes.h:
/* Source ID associated with qapi/acpi-hest.json QMP error injection */
#define ACPI_HEST_SRC_ID_QMP 0
And the actual binding on arm is done as:
enum {
ARM_ACPI_HEST_SRC_ID_GPIO = ACPI_HEST_SRC_ID_QMP,
ARM_ACPI_HEST_SRC_ID_SEA,
};
static const uint16_t hest_ghes_notify[] = {
[ARM_ACPI_HEST_SRC_ID_SEA] = ACPI_GHES_NOTIFY_SEA,
[ARM_ACPI_HEST_SRC_ID_GPIO] = ACPI_GHES_NOTIFY_GPIO,
};
Regards,
Mauro
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/13] acpi/ghes: add support for generic error injection via QAPI
2024-08-19 12:51 ` Igor Mammedov
@ 2024-08-25 3:29 ` Mauro Carvalho Chehab
2024-09-11 13:21 ` Igor Mammedov
0 siblings, 1 reply; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-25 3:29 UTC (permalink / raw)
To: Igor Mammedov
Cc: Jonathan Cameron, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, linux-kernel, qemu-arm, qemu-devel
Em Mon, 19 Aug 2024 14:51:36 +0200
Igor Mammedov <imammedo@redhat.com> escreveu:
> > + read_ack = 1;
> > + cpu_physical_memory_write(read_ack_start_addr,
> > + &read_ack, (uint64_t));
> we don't do this for SEV so, why are you setting it to 1 here?
According with:
https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#generic-hardware-error-source-version-2-ghesv2-type-10
"These are the steps the OS must take once detecting an error from a particular GHESv2 error source:
OSPM detects error (via interrupt/exception or polling the block status)
OSPM copies the error status block
OSPM clears the block status field of the error status block
OSPM acknowledges the error via Read Ack register. For example:
OSPM reads the Read Ack register –> X
OSPM writes –> (( X & ReadAckPreserve) | ReadAckWrite)"
So, basically the guest OS takes some time to detect that an error
is raised. When it detects, it needs to mark that the error was
handled.
IMO, this is needed, independently of the notification mechanism.
Regards,
Mauro
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 13/13] acpi/ghes: check if the BIOS pointers for HEST are correct
2024-08-24 0:15 ` Mauro Carvalho Chehab
@ 2024-08-25 3:48 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-08-25 3:48 UTC (permalink / raw)
To: Igor Mammedov
Cc: Jonathan Cameron, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, linux-kernel, qemu-arm, qemu-devel
Em Sat, 24 Aug 2024 02:15:10 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> escreveu:
> Ok, we could still do something like this pseudo-code to get the
> error source offset:
>
> #define ACPI_HEST_TYPE_GHESV2 11
>
> err_struct_offset = 0;
> for (i = 0; i < source_id_count; i++) {
> /* NOTE: Other types may have different sizes */
> assert(ghes[i].type == ACPI_HEST_TYPE_GHESV2);
> if (ghes[i].source_id == source_id)
> break;
> err_struct_offset += HEST_GHES_V2_TABLE_SIZE;
> }
> assert (i < source_id_count);
This is what I ended implementing on v9.
Regards,
Mauro
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/13] acpi/ghes: add support for generic error injection via QAPI
2024-08-25 3:29 ` Mauro Carvalho Chehab
@ 2024-09-11 13:21 ` Igor Mammedov
2024-09-11 15:34 ` Jonathan Cameron via
0 siblings, 1 reply; 34+ messages in thread
From: Igor Mammedov @ 2024-09-11 13:21 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Jonathan Cameron, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, linux-kernel, qemu-arm, qemu-devel
On Sun, 25 Aug 2024 05:29:23 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Em Mon, 19 Aug 2024 14:51:36 +0200
> Igor Mammedov <imammedo@redhat.com> escreveu:
>
> > > + read_ack = 1;
> > > + cpu_physical_memory_write(read_ack_start_addr,
> > > + &read_ack, (uint64_t));
> > we don't do this for SEV so, why are you setting it to 1 here?
>
> According with:
> https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#generic-hardware-error-source-version-2-ghesv2-type-10
>
> "These are the steps the OS must take once detecting an error from a particular GHESv2 error source:
>
> OSPM detects error (via interrupt/exception or polling the block status)
>
> OSPM copies the error status block
>
> OSPM clears the block status field of the error status block
>
> OSPM acknowledges the error via Read Ack register. For example:
>
> OSPM reads the Read Ack register –> X
>
> OSPM writes –> (( X & ReadAckPreserve) | ReadAckWrite)"
>
>
> So, basically the guest OS takes some time to detect that an error
> is raised. When it detects, it needs to mark that the error was
> handled.
what you are doing here by setting read_ack = 1,
is making ack on behalf of OSPM when OSPM haven't handled existing error yet.
Essentially making HW/FW do the job of OSPM. That looks wrong to me.
From HW/FW side read_ack register should be thought as read-only.
>
> IMO, this is needed, independently of the notification mechanism.
>
> Regards,
> Mauro
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/13] acpi/ghes: add support for generic error injection via QAPI
2024-09-11 13:21 ` Igor Mammedov
@ 2024-09-11 15:34 ` Jonathan Cameron via
2024-09-12 12:42 ` Igor Mammedov
0 siblings, 1 reply; 34+ messages in thread
From: Jonathan Cameron via @ 2024-09-11 15:34 UTC (permalink / raw)
To: Igor Mammedov
Cc: Mauro Carvalho Chehab, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, linux-kernel, qemu-arm, qemu-devel
On Wed, 11 Sep 2024 15:21:32 +0200
Igor Mammedov <imammedo@redhat.com> wrote:
> On Sun, 25 Aug 2024 05:29:23 +0200
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
>
> > Em Mon, 19 Aug 2024 14:51:36 +0200
> > Igor Mammedov <imammedo@redhat.com> escreveu:
> >
> > > > + read_ack = 1;
> > > > + cpu_physical_memory_write(read_ack_start_addr,
> > > > + &read_ack, (uint64_t));
> > > we don't do this for SEV so, why are you setting it to 1 here?
> >
> > According with:
> > https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#generic-hardware-error-source-version-2-ghesv2-type-10
> >
> > "These are the steps the OS must take once detecting an error from a particular GHESv2 error source:
> >
> > OSPM detects error (via interrupt/exception or polling the block status)
> >
> > OSPM copies the error status block
> >
> > OSPM clears the block status field of the error status block
> >
> > OSPM acknowledges the error via Read Ack register. For example:
> >
> > OSPM reads the Read Ack register –> X
> >
> > OSPM writes –> (( X & ReadAckPreserve) | ReadAckWrite)"
> >
> >
> > So, basically the guest OS takes some time to detect that an error
> > is raised. When it detects, it needs to mark that the error was
> > handled.
>
> what you are doing here by setting read_ack = 1,
> is making ack on behalf of OSPM when OSPM haven't handled existing error yet.
>
> Essentially making HW/FW do the job of OSPM. That looks wrong to me.
> From HW/FW side read_ack register should be thought as read-only.
It's not read-only because HW/FW has to clear it so that HW/FW can detect
when the OSPM next writes it.
Agreed this write to 1 looks wrong, but the one a few lines further down (to zero
it) is correct.
My bug a long time back I think.
Jonathan
>
> >
> > IMO, this is needed, independently of the notification mechanism.
> >
> > Regards,
> > Mauro
> >
>
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/13] acpi/ghes: add support for generic error injection via QAPI
2024-09-11 15:34 ` Jonathan Cameron via
@ 2024-09-12 12:42 ` Igor Mammedov
2024-09-13 5:20 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 34+ messages in thread
From: Igor Mammedov @ 2024-09-12 12:42 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Mauro Carvalho Chehab, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, linux-kernel, qemu-arm, qemu-devel
On Wed, 11 Sep 2024 16:34:36 +0100
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> On Wed, 11 Sep 2024 15:21:32 +0200
> Igor Mammedov <imammedo@redhat.com> wrote:
>
> > On Sun, 25 Aug 2024 05:29:23 +0200
> > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> >
> > > Em Mon, 19 Aug 2024 14:51:36 +0200
> > > Igor Mammedov <imammedo@redhat.com> escreveu:
> > >
> > > > > + read_ack = 1;
> > > > > + cpu_physical_memory_write(read_ack_start_addr,
> > > > > + &read_ack, (uint64_t));
> > > > we don't do this for SEV so, why are you setting it to 1 here?
> > >
> > > According with:
> > > https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#generic-hardware-error-source-version-2-ghesv2-type-10
> > >
> > > "These are the steps the OS must take once detecting an error from a particular GHESv2 error source:
> > >
> > > OSPM detects error (via interrupt/exception or polling the block status)
> > >
> > > OSPM copies the error status block
> > >
> > > OSPM clears the block status field of the error status block
> > >
> > > OSPM acknowledges the error via Read Ack register. For example:
> > >
> > > OSPM reads the Read Ack register –> X
> > >
> > > OSPM writes –> (( X & ReadAckPreserve) | ReadAckWrite)"
> > >
> > >
> > > So, basically the guest OS takes some time to detect that an error
> > > is raised. When it detects, it needs to mark that the error was
> > > handled.
> >
> > what you are doing here by setting read_ack = 1,
> > is making ack on behalf of OSPM when OSPM haven't handled existing error yet.
> >
> > Essentially making HW/FW do the job of OSPM. That looks wrong to me.
> > From HW/FW side read_ack register should be thought as read-only.
>
> It's not read-only because HW/FW has to clear it so that HW/FW can detect
> when the OSPM next writes it.
By readonly, I've meant that hw shall not do above mentioned write
(bad phrasing on my side).
>
> Agreed this write to 1 looks wrong, but the one a few lines further down (to zero
> it) is correct.
yep, hw should clear register.
It would be better to so on OSPM ACK, but alas we can't intercept that,
so the next option would be to do that at the time when we add a new error block
>
> My bug a long time back I think.
>
> Jonathan
>
> >
> > >
> > > IMO, this is needed, independently of the notification mechanism.
> > >
> > > Regards,
> > > Mauro
> > >
> >
> >
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/13] acpi/ghes: add support for generic error injection via QAPI
2024-09-12 12:42 ` Igor Mammedov
@ 2024-09-13 5:20 ` Mauro Carvalho Chehab
2024-09-13 10:13 ` Jonathan Cameron via
0 siblings, 1 reply; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-09-13 5:20 UTC (permalink / raw)
To: Igor Mammedov
Cc: Jonathan Cameron, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, linux-kernel, qemu-arm, qemu-devel
Em Thu, 12 Sep 2024 14:42:33 +0200
Igor Mammedov <imammedo@redhat.com> escreveu:
> On Wed, 11 Sep 2024 16:34:36 +0100
> Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
>
> > On Wed, 11 Sep 2024 15:21:32 +0200
> > Igor Mammedov <imammedo@redhat.com> wrote:
> >
> > > On Sun, 25 Aug 2024 05:29:23 +0200
> > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > >
> > > > Em Mon, 19 Aug 2024 14:51:36 +0200
> > > > Igor Mammedov <imammedo@redhat.com> escreveu:
> > > >
> > > > > > + read_ack = 1;
> > > > > > + cpu_physical_memory_write(read_ack_start_addr,
> > > > > > + &read_ack, (uint64_t));
> > > > > we don't do this for SEV so, why are you setting it to 1 here?
The diffstat doesn't really help here. The full code is:
/* zero means OSPM does not acknowledge the error */
if (!read_ack) {
error_setg(errp,
"Last CPER record was not acknowledged yet");
read_ack = 1;
cpu_physical_memory_write(read_ack_start_addr,
&read_ack, sizeof(read_ack));
return;
}
> > > what you are doing here by setting read_ack = 1,
> > > is making ack on behalf of OSPM when OSPM haven't handled existing error yet.
> > >
> > > Essentially making HW/FW do the job of OSPM. That looks wrong to me.
> > > From HW/FW side read_ack register should be thought as read-only.
> >
> > It's not read-only because HW/FW has to clear it so that HW/FW can detect
> > when the OSPM next writes it.
>
> By readonly, I've meant that hw shall not do above mentioned write
> (bad phrasing on my side).
The above code is actually an error handling condition: if for some
reason errors are triggered too fast, there's a bug on QEMU or there is
a bug at the OSPM, an error message is raised and the logic resets the
record to a sane state. So, on a next error, OSPM will get it.
As described at https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html?highlight=asynchronous#generic-hardware-error-source:
"Some platforms may describe multiple Generic Hardware Error Source
structures with different notification types, as defined in
Table 18.10. For example, a platform may describe one error source
for the handling of synchronous errors (e.g. MCE or SEA), and a
second source for handling asynchronous errors (e.g. SCI or
External Interrupt)."
Basically, the error logic there seems to fit for the asynchronous
case, detecting if another error happened before OSPM handles the
first one.
IMO, there are a couple of alternatives to handle such case:
1. Keep the code as-is: if this ever happens, an error message will
be issued. If SEA/MCE gets implemented synchronously on HW/FW/OSPM,
the above code will never be called;
2. Change the logic to do that only for asynchronous sources
(currently, only if source ID is QMP);
3. Add a special QMP message to reset the notification ack. Probably
would use Notification type as an input parameter;
4. Have a much more complex code to implement asynchronous notifications,
with a queue to receive HEST errors and a separate thread to deliver
errors to OSPM asynchronously. If we go this way, QMP would be
returning the number of error messages queued, allowing error injection
code to know if OSPM has troubles delivering errors;
5. Just return an error code without doing any resets. To me, this is
the worse scenario.
I don't like (5), as if something bad happens, there's nothing to be
done.
For QMP error injection (4) seems is overkill. It may be needed in the
future if we end implementing a logic where host OS informs guest about
hardware problems, and such errors use asynchronous notifications.
I would also avoid implementing (3) at least for now, as reporting
such error via QMP seems enough for the QMP usecase.
So, if ok for you, I'll change the code to (2).
> > Agreed this write to 1 looks wrong, but the one a few lines further down (to zero
> > it) is correct.
>
> yep, hw should clear register.
> It would be better to so on OSPM ACK, but alas we can't intercept that,
> so the next option would be to do that at the time when we add a new error block
>
> >
> > My bug a long time back I think.
> >
> > Jonathan
> >
> > >
> > > >
> > > > IMO, this is needed, independently of the notification mechanism.
> > > >
> > > > Regards,
> > > > Mauro
> > > >
> > >
> > >
> >
>
Thanks,
Mauro
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/13] acpi/ghes: add support for generic error injection via QAPI
2024-09-13 5:20 ` Mauro Carvalho Chehab
@ 2024-09-13 10:13 ` Jonathan Cameron via
2024-09-13 12:28 ` Igor Mammedov
0 siblings, 1 reply; 34+ messages in thread
From: Jonathan Cameron via @ 2024-09-13 10:13 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Igor Mammedov, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, linux-kernel, qemu-arm, qemu-devel
On Fri, 13 Sep 2024 07:20:25 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Em Thu, 12 Sep 2024 14:42:33 +0200
> Igor Mammedov <imammedo@redhat.com> escreveu:
>
> > On Wed, 11 Sep 2024 16:34:36 +0100
> > Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> >
> > > On Wed, 11 Sep 2024 15:21:32 +0200
> > > Igor Mammedov <imammedo@redhat.com> wrote:
> > >
> > > > On Sun, 25 Aug 2024 05:29:23 +0200
> > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > >
> > > > > Em Mon, 19 Aug 2024 14:51:36 +0200
> > > > > Igor Mammedov <imammedo@redhat.com> escreveu:
> > > > >
> > > > > > > + read_ack = 1;
> > > > > > > + cpu_physical_memory_write(read_ack_start_addr,
> > > > > > > + &read_ack, (uint64_t));
> > > > > > we don't do this for SEV so, why are you setting it to 1 here?
>
> The diffstat doesn't really help here. The full code is:
>
> /* zero means OSPM does not acknowledge the error */
> if (!read_ack) {
> error_setg(errp,
> "Last CPER record was not acknowledged yet");
> read_ack = 1;
> cpu_physical_memory_write(read_ack_start_addr,
> &read_ack, sizeof(read_ack));
> return;
> }
>
> > > > what you are doing here by setting read_ack = 1,
> > > > is making ack on behalf of OSPM when OSPM haven't handled existing error yet.
> > > >
> > > > Essentially making HW/FW do the job of OSPM. That looks wrong to me.
> > > > From HW/FW side read_ack register should be thought as read-only.
> > >
> > > It's not read-only because HW/FW has to clear it so that HW/FW can detect
> > > when the OSPM next writes it.
> >
> > By readonly, I've meant that hw shall not do above mentioned write
> > (bad phrasing on my side).
>
> The above code is actually an error handling condition: if for some
> reason errors are triggered too fast, there's a bug on QEMU or there is
> a bug at the OSPM, an error message is raised and the logic resets the
> record to a sane state. So, on a next error, OSPM will get it.
>
> As described at https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html?highlight=asynchronous#generic-hardware-error-source:
>
> "Some platforms may describe multiple Generic Hardware Error Source
> structures with different notification types, as defined in
> Table 18.10. For example, a platform may describe one error source
> for the handling of synchronous errors (e.g. MCE or SEA), and a
> second source for handling asynchronous errors (e.g. SCI or
> External Interrupt)."
>
> Basically, the error logic there seems to fit for the asynchronous
> case, detecting if another error happened before OSPM handles the
> first one.
Agreed - the error logic to act as backpressure for the tool injecting
the error makes sense - it's just hardware acknowledging to paper
over slow software that is an issue.
>
> IMO, there are a couple of alternatives to handle such case:
>
> 1. Keep the code as-is: if this ever happens, an error message will
> be issued. If SEA/MCE gets implemented synchronously on HW/FW/OSPM,
> the above code will never be called;
> 2. Change the logic to do that only for asynchronous sources
> (currently, only if source ID is QMP);
> 3. Add a special QMP message to reset the notification ack. Probably
> would use Notification type as an input parameter;
> 4. Have a much more complex code to implement asynchronous notifications,
> with a queue to receive HEST errors and a separate thread to deliver
> errors to OSPM asynchronously. If we go this way, QMP would be
> returning the number of error messages queued, allowing error injection
> code to know if OSPM has troubles delivering errors;
Is this not better done in the injection code outside of qemu?
So detect the error in that and if it happens back off and try again
later? Basically EBUSY done in an inelegant way.
> 5. Just return an error code without doing any resets. To me, this is
> the worse scenario.
>
> I don't like (5), as if something bad happens, there's nothing to be
> done.
If it happens on a real system nothing is done either. So I'm not sure
we need to handle that. Or maybe real hardware reinjects the interrupt
if the OSPM hasn't done anything about it for a while.
>
> For QMP error injection (4) seems is overkill. It may be needed in the
> future if we end implementing a logic where host OS informs guest about
> hardware problems, and such errors use asynchronous notifications.
>
> I would also avoid implementing (3) at least for now, as reporting
> such error via QMP seems enough for the QMP usecase.
>
> So, if ok for you, I'll change the code to (2).
Whilst I don't feel strongly about it, I think 5 is unfortunately the
correct option if we aren't going to queue errors in qemu (so make it
an injection tool problem).
>
>
> > > Agreed this write to 1 looks wrong, but the one a few lines further down (to zero
> > > it) is correct.
> >
> > yep, hw should clear register.
> > It would be better to so on OSPM ACK, but alas we can't intercept that,
> > so the next option would be to do that at the time when we add a new error block
> >
> > >
> > > My bug a long time back I think.
> > >
> > > Jonathan
> > >
> > > >
> > > > >
> > > > > IMO, this is needed, independently of the notification mechanism.
> > > > >
> > > > > Regards,
> > > > > Mauro
> > > > >
> > > >
> > > >
> > >
> >
>
>
>
> Thanks,
> Mauro
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/13] acpi/ghes: add support for generic error injection via QAPI
2024-09-13 10:13 ` Jonathan Cameron via
@ 2024-09-13 12:28 ` Igor Mammedov
2024-09-14 5:38 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 34+ messages in thread
From: Igor Mammedov @ 2024-09-13 12:28 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Mauro Carvalho Chehab, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, linux-kernel, qemu-arm, qemu-devel
On Fri, 13 Sep 2024 11:13:00 +0100
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> On Fri, 13 Sep 2024 07:20:25 +0200
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
>
> > Em Thu, 12 Sep 2024 14:42:33 +0200
> > Igor Mammedov <imammedo@redhat.com> escreveu:
> >
> > > On Wed, 11 Sep 2024 16:34:36 +0100
> > > Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> > >
> > > > On Wed, 11 Sep 2024 15:21:32 +0200
> > > > Igor Mammedov <imammedo@redhat.com> wrote:
> > > >
> > > > > On Sun, 25 Aug 2024 05:29:23 +0200
> > > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > > >
> > > > > > Em Mon, 19 Aug 2024 14:51:36 +0200
> > > > > > Igor Mammedov <imammedo@redhat.com> escreveu:
> > > > > >
> > > > > > > > + read_ack = 1;
> > > > > > > > + cpu_physical_memory_write(read_ack_start_addr,
> > > > > > > > + &read_ack, (uint64_t));
> > > > > > > we don't do this for SEV so, why are you setting it to 1 here?
> >
> > The diffstat doesn't really help here. The full code is:
> >
> > /* zero means OSPM does not acknowledge the error */
> > if (!read_ack) {
> > error_setg(errp,
> > "Last CPER record was not acknowledged yet");
> > read_ack = 1;
> > cpu_physical_memory_write(read_ack_start_addr,
> > &read_ack, sizeof(read_ack));
> > return;
> > }
> >
> > > > > what you are doing here by setting read_ack = 1,
> > > > > is making ack on behalf of OSPM when OSPM haven't handled existing error yet.
> > > > >
> > > > > Essentially making HW/FW do the job of OSPM. That looks wrong to me.
> > > > > From HW/FW side read_ack register should be thought as read-only.
> > > >
> > > > It's not read-only because HW/FW has to clear it so that HW/FW can detect
> > > > when the OSPM next writes it.
> > >
> > > By readonly, I've meant that hw shall not do above mentioned write
> > > (bad phrasing on my side).
> >
> > The above code is actually an error handling condition: if for some
> > reason errors are triggered too fast, there's a bug on QEMU or there is
> > a bug at the OSPM, an error message is raised and the logic resets the
> > record to a sane state. So, on a next error, OSPM will get it.
> >
> > As described at https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html?highlight=asynchronous#generic-hardware-error-source:
> >
> > "Some platforms may describe multiple Generic Hardware Error Source
> > structures with different notification types, as defined in
> > Table 18.10. For example, a platform may describe one error source
> > for the handling of synchronous errors (e.g. MCE or SEA), and a
> > second source for handling asynchronous errors (e.g. SCI or
> > External Interrupt)."
> >
> > Basically, the error logic there seems to fit for the asynchronous
> > case, detecting if another error happened before OSPM handles the
> > first one.
>
> Agreed - the error logic to act as backpressure for the tool injecting
> the error makes sense - it's just hardware acknowledging to paper
> over slow software that is an issue.
on top of that, read_ack is serving as sync primitive
If one disregards it and starts overwriting error block regardless of
ack value, One will be inducing race condition, where OSPM might be
accessing error_block while HW is in process of overwriting it.
>
> >
> > IMO, there are a couple of alternatives to handle such case:
> >
> > 1. Keep the code as-is: if this ever happens, an error message will
> > be issued. If SEA/MCE gets implemented synchronously on HW/FW/OSPM,
> > the above code will never be called;
> > 2. Change the logic to do that only for asynchronous sources
> > (currently, only if source ID is QMP);
> > 3. Add a special QMP message to reset the notification ack. Probably
> > would use Notification type as an input parameter;
> > 4. Have a much more complex code to implement asynchronous notifications,
> > with a queue to receive HEST errors and a separate thread to deliver
> > errors to OSPM asynchronously. If we go this way, QMP would be
> > returning the number of error messages queued, allowing error injection
> > code to know if OSPM has troubles delivering errors;
>
> Is this not better done in the injection code outside of qemu?
> So detect the error in that and if it happens back off and try again
> later? Basically EBUSY done in an inelegant way.
>
> > 5. Just return an error code without doing any resets. To me, this is
> > the worse scenario.
> >
> > I don't like (5), as if something bad happens, there's nothing to be
> > done.
>
> If it happens on a real system nothing is done either. So I'm not sure
> we need to handle that. Or maybe real hardware reinjects the interrupt
> if the OSPM hasn't done anything about it for a while.
>
> >
> > For QMP error injection (4) seems is overkill. It may be needed in the
> > future if we end implementing a logic where host OS informs guest about
> > hardware problems, and such errors use asynchronous notifications.
> >
> > I would also avoid implementing (3) at least for now, as reporting
> > such error via QMP seems enough for the QMP usecase.
> >
> > So, if ok for you, I'll change the code to (2).
>
> Whilst I don't feel strongly about it, I think 5 is unfortunately the
> correct option if we aren't going to queue errors in qemu (so make it
> an injection tool problem).
+1 to option (5)
> >
> >
> > > > Agreed this write to 1 looks wrong, but the one a few lines further down (to zero
> > > > it) is correct.
> > >
> > > yep, hw should clear register.
> > > It would be better to so on OSPM ACK, but alas we can't intercept that,
> > > so the next option would be to do that at the time when we add a new error block
> > >
> > > >
> > > > My bug a long time back I think.
> > > >
> > > > Jonathan
> > > >
> > > > >
> > > > > >
> > > > > > IMO, this is needed, independently of the notification mechanism.
> > > > > >
> > > > > > Regards,
> > > > > > Mauro
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > Thanks,
> > Mauro
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/13] acpi/ghes: add support for generic error injection via QAPI
2024-09-13 12:28 ` Igor Mammedov
@ 2024-09-14 5:38 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2024-09-14 5:38 UTC (permalink / raw)
To: Igor Mammedov
Cc: Jonathan Cameron, Shiju Jose, Michael S. Tsirkin, Ani Sinha,
Dongjiu Geng, linux-kernel, qemu-arm, qemu-devel
Em Fri, 13 Sep 2024 14:28:02 +0200
Igor Mammedov <imammedo@redhat.com> escreveu:
> > > 5. Just return an error code without doing any resets. To me, this is
> > > the worse scenario.
> > >
> > > I don't like (5), as if something bad happens, there's nothing to be
> > > done.
> >
> > If it happens on a real system nothing is done either. So I'm not sure
> > we need to handle that. Or maybe real hardware reinjects the interrupt
> > if the OSPM hasn't done anything about it for a while.
> >
> > >
> > > For QMP error injection (4) seems is overkill. It may be needed in the
> > > future if we end implementing a logic where host OS informs guest about
> > > hardware problems, and such errors use asynchronous notifications.
> > >
> > > I would also avoid implementing (3) at least for now, as reporting
> > > such error via QMP seems enough for the QMP usecase.
> > >
> > > So, if ok for you, I'll change the code to (2).
> >
> > Whilst I don't feel strongly about it, I think 5 is unfortunately the
> > correct option if we aren't going to queue errors in qemu (so make it
> > an injection tool problem).
>
> +1 to option (5)
Ok, will do (5) then.
Thanks,
Mauro
^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2024-09-14 5:39 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-16 7:37 [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 01/13] acpi/generic_event_device: add an APEI error device Mauro Carvalho Chehab
2024-08-19 11:21 ` Igor Mammedov
2024-08-16 7:37 ` [PATCH v8 02/13] arm/virt: Wire up a GED error device for ACPI / GHES Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 03/13] acpi/ghes: Add support for GED error device Mauro Carvalho Chehab
2024-08-19 11:43 ` Igor Mammedov
2024-08-23 23:28 ` Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 04/13] qapi/acpi-hest: add an interface to do generic CPER error injection Mauro Carvalho Chehab
2024-08-19 11:54 ` Igor Mammedov
2024-08-16 7:37 ` [PATCH v8 05/13] acpi/ghes: rework the logic to handle HEST source ID Mauro Carvalho Chehab
2024-08-19 12:10 ` Igor Mammedov
2024-08-25 2:02 ` Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 06/13] acpi/ghes: add support for generic error injection via QAPI Mauro Carvalho Chehab
2024-08-19 12:51 ` Igor Mammedov
2024-08-25 3:29 ` Mauro Carvalho Chehab
2024-09-11 13:21 ` Igor Mammedov
2024-09-11 15:34 ` Jonathan Cameron via
2024-09-12 12:42 ` Igor Mammedov
2024-09-13 5:20 ` Mauro Carvalho Chehab
2024-09-13 10:13 ` Jonathan Cameron via
2024-09-13 12:28 ` Igor Mammedov
2024-09-14 5:38 ` Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 07/13] acpi/ghes: cleanup the memory error code logic Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 08/13] docs: acpi_hest_ghes: fix documentation for CPER size Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 09/13] scripts/ghes_inject: add a script to generate GHES error inject Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 10/13] target/arm: add an experimental mpidr arm cpu property object Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 11/13] scripts/arm_processor_error.py: retrieve mpidr if not filled Mauro Carvalho Chehab
2024-08-16 7:37 ` [PATCH v8 12/13] acpi/ghes: cleanup generic error data logic Mauro Carvalho Chehab
2024-08-19 12:57 ` Igor Mammedov
2024-08-16 7:37 ` [PATCH v8 13/13] acpi/ghes: check if the BIOS pointers for HEST are correct Mauro Carvalho Chehab
2024-08-19 14:07 ` Igor Mammedov
2024-08-24 0:15 ` Mauro Carvalho Chehab
2024-08-25 3:48 ` Mauro Carvalho Chehab
2024-08-19 14:21 ` [PATCH v8 00/13] Add ACPI CPER firmware first error injection on ARM emulation Igor Mammedov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).