* [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
@ 2025-01-31 17:42 Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 01/14] acpi/ghes: Prepare to support multiple sources on ghes Mauro Carvalho Chehab
` (14 more replies)
0 siblings, 15 replies; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-31 17:42 UTC (permalink / raw)
To: Igor Mammedov, Michael S . Tsirkin
Cc: Jonathan Cameron, Shiju Jose, qemu-arm, qemu-devel,
Mauro Carvalho Chehab, Philippe Mathieu-Daudé, Ani Sinha,
Cleber Rosa, Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
linux-kernel
Now that the ghes preparation patches were merged, let's add support
for error injection.
On this series, the first 6 patches chang to the math used to calculate offsets at HEST
table and hardware_error firmware file, together with its migration code. Migration tested
with both latest QEMU released kernel and upstream, on both directions.
The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
to inject ARM Processor Error records.
If I'm counting well, this is the 19th submission of my error inject patches.
---
v3:
- addressed more nits;
- hest_add_le now points to the beginning of HEST table;
- removed HEST from tests/data/acpi;
- added an extra patch to not use fw_cfg with virt-10.0 for hw_error_le
v2:
- address some nits;
- improved ags cleanup patch and removed ags.present field;
- added some missing le*_to_cpu() calls;
- update date at copyright for new files to 2024-2025;
- qmp command changed to: inject-ghes-v2-error ans since updated to 10.0;
- added HEST and DSDT tables after the changes to make check target happy.
(two patches: first one whitelisting such tables; second one removing from
whitelist and updating/adding such tables to tests/data/acpi)
It follows a diff against v2 to better show the differences.
diff --git a/hw/acpi/generic_event_device.c b/hw/acpi/generic_event_device.c
index 50313ed7ee96..f5e899155d34 100644
--- a/hw/acpi/generic_event_device.c
+++ b/hw/acpi/generic_event_device.c
@@ -29,12 +29,6 @@ static const uint32_t ged_supported_events[] = {
ACPI_GED_ERROR_EVT,
};
-/*
- * ACPI 5.0b: 5.6.6 Device Object Notifications
- * Table 5-135 Error Device Notification Values
- */
-#define ERROR_DEVICE_NOTIFICATION 0x80
-
/*
* The ACPI Generic Event Device (GED) is a hardware-reduced specific
* device[ACPI v6.1 Section 5.6.9] that handles all platform events,
@@ -124,9 +118,14 @@ void build_ged_aml(Aml *table, const char *name, HotplugHandler *hotplug_dev,
aml_int(0x80)));
break;
case ACPI_GED_ERROR_EVT:
+ /*
+ * ACPI 5.0b: 5.6.6 Device Object Notifications
+ * Table 5-135 Error Device Notification Values
+ * Defines 0x80 as the value to be used on notifications
+ */
aml_append(if_ctx,
aml_notify(aml_name(ACPI_APEI_ERROR_DEVICE),
- aml_int(ERROR_DEVICE_NOTIFICATION)));
+ aml_int(0x80)));
break;
case ACPI_GED_NVDIMM_HOTPLUG_EVT:
aml_append(if_ctx,
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index ef57ad14a38b..bcef0b22e612 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -41,6 +41,12 @@
/* Address offset in Generic Address Structure(GAS) */
#define GAS_ADDR_OFFSET 4
+/*
+ * ACPI spec 1.0b
+ * 5.2.3 System Description Table Header
+ */
+#define ACPI_DESC_HEADER_OFFSET 36
+
/*
* The total size of Generic Error Data Entry
* ACPI 6.1/6.2: 18.3.2.7.1 Generic Error Data,
@@ -226,8 +232,8 @@ ghes_gen_err_data_uncorrectable_recoverable(GArray *block,
* Initialize "etc/hardware_errors" and "etc/hardware_errors_addr" fw_cfg blobs.
* See docs/specs/acpi_hest_ghes.rst for blobs format.
*/
-static void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker,
- int num_sources)
+static void build_ghes_error_table(AcpiGhesState *ags, GArray *hardware_errors,
+ BIOSLinker *linker, int num_sources)
{
int i, error_status_block_offset;
@@ -272,13 +278,15 @@ static void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker,
i * ACPI_GHES_MAX_RAW_DATA_LENGTH);
}
- /*
- * Tell firmware to write hardware_errors GPA into
- * hardware_errors_addr fw_cfg, once the former has been initialized.
- */
- bios_linker_loader_write_pointer(linker, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, 0,
- sizeof(uint64_t),
- ACPI_HW_ERROR_FW_CFG_FILE, 0);
+ if (!ags->use_hest_addr) {
+ /*
+ * Tell firmware to write hardware_errors GPA into
+ * hardware_errors_addr fw_cfg, once the former has been initialized.
+ */
+ bios_linker_loader_write_pointer(linker, ACPI_HW_ERROR_ADDR_FW_CFG_FILE,
+ 0, sizeof(uint64_t),
+ ACPI_HW_ERROR_FW_CFG_FILE, 0);
+ }
}
/* Build Generic Hardware Error Source version 2 (GHESv2) */
@@ -365,11 +373,11 @@ void acpi_build_hest(AcpiGhesState *ags, GArray *table_data,
uint32_t hest_offset;
int i;
- build_ghes_error_table(hardware_errors, linker, num_sources);
+ hest_offset = table_data->len;
- acpi_table_begin(&table, table_data);
+ build_ghes_error_table(ags, hardware_errors, linker, num_sources);
- hest_offset = table_data->len;
+ acpi_table_begin(&table, table_data);
/* Error Source Count */
build_append_int_noprefix(table_data, num_sources, 4);
@@ -383,7 +391,6 @@ void acpi_build_hest(AcpiGhesState *ags, GArray *table_data,
* Tell firmware to write into GPA the address of HEST via fw_cfg,
* once initialized.
*/
-
if (ags->use_hest_addr) {
bios_linker_loader_write_pointer(linker,
ACPI_HEST_ADDR_FW_CFG_FILE, 0,
@@ -399,13 +406,13 @@ void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
fw_cfg_add_file(s, ACPI_HW_ERROR_FW_CFG_FILE, hardware_error->data,
hardware_error->len);
- /* Create a read-write fw_cfg file for Address */
- fw_cfg_add_file_callback(s, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, NULL, NULL,
- NULL, &(ags->hw_error_le), sizeof(ags->hw_error_le), false);
-
if (ags->use_hest_addr) {
fw_cfg_add_file_callback(s, ACPI_HEST_ADDR_FW_CFG_FILE, NULL, NULL,
NULL, &(ags->hest_addr_le), sizeof(ags->hest_addr_le), false);
+ } else {
+ /* Create a read-write fw_cfg file for Address */
+ fw_cfg_add_file_callback(s, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, NULL, NULL,
+ NULL, &(ags->hw_error_le), sizeof(ags->hw_error_le), false);
}
}
@@ -432,7 +439,7 @@ static void get_hw_error_offsets(uint64_t ghes_addr,
}
static void get_ghes_source_offsets(uint16_t source_id,
- uint64_t hest_entry_addr,
+ uint64_t hest_addr,
uint64_t *cper_addr,
uint64_t *read_ack_start_addr,
Error **errp)
@@ -441,12 +448,13 @@ static void get_ghes_source_offsets(uint16_t source_id,
uint64_t err_source_entry, error_block_addr;
uint32_t num_sources, i;
+ hest_addr += ACPI_DESC_HEADER_OFFSET;
- cpu_physical_memory_read(hest_entry_addr, &num_sources,
+ cpu_physical_memory_read(hest_addr, &num_sources,
sizeof(num_sources));
num_sources = le32_to_cpu(num_sources);
- err_source_entry = hest_entry_addr + sizeof(num_sources);
+ err_source_entry = hest_addr + sizeof(num_sources);
/*
* Currently, HEST Error source navigates only for GHESv2 tables
@@ -468,7 +476,6 @@ static void get_ghes_source_offsets(uint16_t source_id,
/* Compare CPER source address at the GHESv2 structure */
addr += sizeof(type);
cpu_physical_memory_read(addr, &src_id, sizeof(src_id));
-
if (le16_to_cpu(src_id) == source_id) {
break;
}
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 7d3580244179..7b6e90d69298 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -956,8 +956,10 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
build_dbg2(tables_blob, tables->linker, vms);
if (vms->ras) {
- AcpiGhesState *ags;
+ static const AcpiNotificationSourceId *notify;
AcpiGedState *acpi_ged_state;
+ unsigned int notify_sz;
+ AcpiGhesState *ags;
acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
NULL));
@@ -967,16 +969,16 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
acpi_add_table(table_offsets, tables_blob);
if (!ags->use_hest_addr) {
- acpi_build_hest(ags, tables_blob, tables->hardware_errors,
- tables->linker, hest_ghes_notify_9_2,
- ARRAY_SIZE(hest_ghes_notify_9_2),
- vms->oem_id, vms->oem_table_id);
+ notify = hest_ghes_notify_9_2;
+ notify_sz = ARRAY_SIZE(hest_ghes_notify_9_2);
} else {
- acpi_build_hest(ags, tables_blob, tables->hardware_errors,
- tables->linker, hest_ghes_notify,
- ARRAY_SIZE(hest_ghes_notify),
- vms->oem_id, vms->oem_table_id);
+ notify = hest_ghes_notify;
+ notify_sz = ARRAY_SIZE(hest_ghes_notify);
}
+
+ acpi_build_hest(ags, tables_blob, tables->hardware_errors,
+ tables->linker, notify, notify_sz,
+ vms->oem_id, vms->oem_table_id);
}
}
diff --git a/scripts/arm_processor_error.py b/scripts/arm_processor_error.py
index 4ac04ab08299..b0e8450e667e 100644
--- a/scripts/arm_processor_error.py
+++ b/scripts/arm_processor_error.py
@@ -12,11 +12,110 @@
#
# - ARM registers: power_state, mpidr.
+"""
+Generates an ARM processor error CPER, compatible with
+UEFI 2.9A Errata.
+
+Injecting such errors can be done using:
+
+ $ ./scripts/ghes_inject.py arm
+ Error injected.
+
+Produces a simple CPER register, as detected on a Linux guest:
+
+[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
+[Hardware Error]: event severity: recoverable
+[Hardware Error]: Error 0, type: recoverable
+[Hardware Error]: section_type: ARM processor error
+[Hardware Error]: MIDR: 0x0000000000000000
+[Hardware Error]: running state: 0x0
+[Hardware Error]: Power State Coordination Interface state: 0
+[Hardware Error]: Error info structure 0:
+[Hardware Error]: num errors: 2
+[Hardware Error]: error_type: 0x02: cache error
+[Hardware Error]: error_info: 0x000000000091000f
+[Hardware Error]: transaction type: Data Access
+[Hardware Error]: cache error, operation type: Data write
+[Hardware Error]: cache level: 2
+[Hardware Error]: processor context not corrupted
+[Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
+
+The ARM Processor Error message can be customized via command line
+parameters. For instance:
+
+ $ ./scripts/ghes_inject.py arm --mpidr 0x444 --running --affinity 1 \
+ --error-info 12345678 --vendor 0x13,123,4,5,1 --ctx-array 0,1,2,3,4,5 \
+ -t cache tlb bus micro-arch tlb,micro-arch
+ Error injected.
+
+Injects this error, as detected on a Linux guest:
+
+[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
+[Hardware Error]: event severity: recoverable
+[Hardware Error]: Error 0, type: recoverable
+[Hardware Error]: section_type: ARM processor error
+[Hardware Error]: MIDR: 0x0000000000000000
+[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000000000000
+[Hardware Error]: error affinity level: 0
+[Hardware Error]: running state: 0x1
+[Hardware Error]: Power State Coordination Interface state: 0
+[Hardware Error]: Error info structure 0:
+[Hardware Error]: num errors: 2
+[Hardware Error]: error_type: 0x02: cache error
+[Hardware Error]: error_info: 0x0000000000bc614e
+[Hardware Error]: cache level: 2
+[Hardware Error]: processor context not corrupted
+[Hardware Error]: Error info structure 1:
+[Hardware Error]: num errors: 2
+[Hardware Error]: error_type: 0x04: TLB error
+[Hardware Error]: error_info: 0x000000000054007f
+[Hardware Error]: transaction type: Instruction
+[Hardware Error]: TLB error, operation type: Instruction fetch
+[Hardware Error]: TLB level: 1
+[Hardware Error]: processor context not corrupted
+[Hardware Error]: the error has not been corrected
+[Hardware Error]: PC is imprecise
+[Hardware Error]: Error info structure 2:
+[Hardware Error]: num errors: 2
+[Hardware Error]: error_type: 0x08: bus error
+[Hardware Error]: error_info: 0x00000080d6460fff
+[Hardware Error]: transaction type: Generic
+[Hardware Error]: bus error, operation type: Generic read (type of instruction or data request cannot be determined)
+[Hardware Error]: affinity level at which the bus error occurred: 1
+[Hardware Error]: processor context corrupted
+[Hardware Error]: the error has been corrected
+[Hardware Error]: PC is imprecise
+[Hardware Error]: Program execution can be restarted reliably at the PC associated with the error.
+[Hardware Error]: participation type: Local processor observed
+[Hardware Error]: request timed out
+[Hardware Error]: address space: External Memory Access
+[Hardware Error]: memory access attributes:0x20
+[Hardware Error]: access mode: secure
+[Hardware Error]: Error info structure 3:
+[Hardware Error]: num errors: 2
+[Hardware Error]: error_type: 0x10: micro-architectural error
+[Hardware Error]: error_info: 0x0000000078da03ff
+[Hardware Error]: Error info structure 4:
+[Hardware Error]: num errors: 2
+[Hardware Error]: error_type: 0x14: TLB error|micro-architectural error
+[Hardware Error]: Context info structure 0:
+[Hardware Error]: register context type: AArch64 EL1 context registers
+[Hardware Error]: 00000000: 00000000 00000000
+[Hardware Error]: Vendor specific error info has 5 bytes:
+[Hardware Error]: 00000000: 13 7b 04 05 01 .{...
+[Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
+[Firmware Warn]: GHES: Unhandled processor error type 0x04: TLB error
+[Firmware Warn]: GHES: Unhandled processor error type 0x08: bus error
+[Firmware Warn]: GHES: Unhandled processor error type 0x10: micro-architectural error
+[Firmware Warn]: GHES: Unhandled processor error type 0x14: TLB error|micro-architectural error
+"""
+
import argparse
import re
from qmp_helper import qmp, util, cper_guid
+
class ArmProcessorEinj:
"""
Implements ARM Processor Error injection via GHES
diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
old mode 100644
new mode 100755
index 4f7ebb31b424..8e375d6a6cab
--- a/scripts/qmp_helper.py
+++ b/scripts/qmp_helper.py
@@ -541,7 +541,7 @@ def send_cper_raw(self, cper_data):
self._connect()
- if self.send_cmd("inject-ghes-error", cmd_arg):
+ if self.send_cmd("inject-ghes-v2-error", cmd_arg):
print("Error injected.")
def send_cper(self, notif_type, payload):
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 544ff174784d..80ca7779797b 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2371,7 +2371,6 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
ags = acpi_ghes_get_state();
-
if (ags && addr) {
ram_addr = qemu_ram_addr_from_host(addr);
if (ram_addr != RAM_ADDR_INVALID &&
diff --git a/tests/data/acpi/aarch64/virt/HEST b/tests/data/acpi/aarch64/virt/HEST
deleted file mode 100644
index 8b0cf87700fa..000000000000
Binary files a/tests/data/acpi/aarch64/virt/HEST and /dev/null differ
-
Mauro Carvalho Chehab (14):
acpi/ghes: Prepare to support multiple sources on ghes
acpi/ghes: add a firmware file with HEST address
acpi/ghes: Use HEST table offsets when preparing GHES records
acpi/generic_event_device: Update GHES migration to cover hest addr
acpi/generic_event_device: add logic to detect if HEST addr is
available
acpi/ghes: only set hw_error_le or hest_addr_le
acpi/ghes: add a notifier to notify when error data is ready
acpi/ghes: Cleanup the code which gets ghes ged state
acpi/generic_event_device: add an APEI error device
tests/acpi: virt: allow acpi table changes for a new table: HEST
arm/virt: Wire up a GED error device for ACPI / GHES
tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT
qapi/acpi-hest: add an interface to do generic CPER error injection
scripts/ghes_inject: add a script to generate GHES error inject
MAINTAINERS | 10 +
hw/acpi/Kconfig | 5 +
hw/acpi/aml-build.c | 10 +
hw/acpi/generic_event_device.c | 43 ++
hw/acpi/ghes-stub.c | 7 +-
hw/acpi/ghes.c | 231 ++++--
hw/acpi/ghes_cper.c | 38 +
hw/acpi/ghes_cper_stub.c | 19 +
hw/acpi/meson.build | 2 +
hw/arm/virt-acpi-build.c | 37 +-
hw/arm/virt.c | 19 +-
hw/core/machine.c | 2 +
include/hw/acpi/acpi_dev_interface.h | 1 +
include/hw/acpi/aml-build.h | 2 +
include/hw/acpi/generic_event_device.h | 1 +
include/hw/acpi/ghes.h | 45 +-
include/hw/arm/virt.h | 2 +
qapi/acpi-hest.json | 35 +
qapi/meson.build | 1 +
qapi/qapi-schema.json | 1 +
scripts/arm_processor_error.py | 476 ++++++++++++
scripts/ghes_inject.py | 51 ++
scripts/qmp_helper.py | 702 ++++++++++++++++++
target/arm/kvm.c | 7 +-
tests/data/acpi/aarch64/virt/DSDT | Bin 5196 -> 5240 bytes
.../data/acpi/aarch64/virt/DSDT.acpihmatvirt | Bin 5282 -> 5326 bytes
tests/data/acpi/aarch64/virt/DSDT.memhp | Bin 6557 -> 6601 bytes
tests/data/acpi/aarch64/virt/DSDT.pxb | Bin 7679 -> 7723 bytes
tests/data/acpi/aarch64/virt/DSDT.topology | Bin 5398 -> 5442 bytes
29 files changed, 1666 insertions(+), 81 deletions(-)
create mode 100644 hw/acpi/ghes_cper.c
create mode 100644 hw/acpi/ghes_cper_stub.c
create mode 100644 qapi/acpi-hest.json
create mode 100644 scripts/arm_processor_error.py
create mode 100755 scripts/ghes_inject.py
create mode 100755 scripts/qmp_helper.py
--
2.48.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v3 01/14] acpi/ghes: Prepare to support multiple sources on ghes
2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
@ 2025-01-31 17:42 ` Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 02/14] acpi/ghes: add a firmware file with HEST address Mauro Carvalho Chehab
` (13 subsequent siblings)
14 siblings, 0 replies; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-31 17:42 UTC (permalink / raw)
To: Igor Mammedov, Michael S . Tsirkin
Cc: Jonathan Cameron, Shiju Jose, qemu-arm, qemu-devel,
Mauro Carvalho Chehab, Ani Sinha, Dongjiu Geng, Peter Maydell,
Shannon Zhao, linux-kernel
The current code is actually dependent on having just one error
structure with a single source.
As the number of sources should be arch-dependent, as it will depend on
what kind of notifications will exist, change the logic to dynamically
build the table.
Yet, for a proper support, we need to get the number of sources by
reading the number from the HEST table. However, bios currently doesn't
store a pointer to it.
For now just change the logic at table build time, while enforcing that
it will behave like before with a single source ID.
A future patch will add a HEST table bios pointer and change the logic
at acpi_ghes_record_errors() to dynamically use the new size.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
---
hw/acpi/ghes.c | 47 +++++++++++++++++++++++++---------------
hw/arm/virt-acpi-build.c | 5 +++++
include/hw/acpi/ghes.h | 21 ++++++++++++------
3 files changed, 49 insertions(+), 24 deletions(-)
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index b709c177cdea..4cabb177ad47 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -206,17 +206,26 @@ ghes_gen_err_data_uncorrectable_recoverable(GArray *block,
* Initialize "etc/hardware_errors" and "etc/hardware_errors_addr" fw_cfg blobs.
* See docs/specs/acpi_hest_ghes.rst for blobs format.
*/
-static void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker)
+static void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker,
+ int num_sources)
{
int i, error_status_block_offset;
+ /*
+ * TODO: Current version supports only one source.
+ * A further patch will drop this check, after adding a proper migration
+ * code, as, for the code to work, we need to store a bios pointer to the
+ * HEST table.
+ */
+ assert(num_sources == 1);
+
/* Build error_block_address */
- for (i = 0; i < ACPI_GHES_ERROR_SOURCE_COUNT; i++) {
+ for (i = 0; i < num_sources; i++) {
build_append_int_noprefix(hardware_errors, 0, sizeof(uint64_t));
}
/* Build read_ack_register */
- for (i = 0; i < ACPI_GHES_ERROR_SOURCE_COUNT; i++) {
+ for (i = 0; i < num_sources; i++) {
/*
* Initialize the value of read_ack_register to 1, so GHES can be
* writable after (re)boot.
@@ -231,13 +240,13 @@ static void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker)
/* Reserve space for Error Status Data Block */
acpi_data_push(hardware_errors,
- ACPI_GHES_MAX_RAW_DATA_LENGTH * ACPI_GHES_ERROR_SOURCE_COUNT);
+ ACPI_GHES_MAX_RAW_DATA_LENGTH * num_sources);
/* Tell guest firmware to place hardware_errors blob into RAM */
bios_linker_loader_alloc(linker, ACPI_HW_ERROR_FW_CFG_FILE,
hardware_errors, sizeof(uint64_t), false);
- for (i = 0; i < ACPI_GHES_ERROR_SOURCE_COUNT; i++) {
+ for (i = 0; i < num_sources; i++) {
/*
* Tell firmware to patch error_block_address entries to point to
* corresponding "Generic Error Status Block"
@@ -261,12 +270,14 @@ static void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker)
}
/* Build Generic Hardware Error Source version 2 (GHESv2) */
-static void build_ghes_v2(GArray *table_data,
- BIOSLinker *linker,
- enum AcpiGhesNotifyType notify,
- uint16_t source_id)
+static void build_ghes_v2_entry(GArray *table_data,
+ BIOSLinker *linker,
+ const AcpiNotificationSourceId *notif_src,
+ uint16_t index, int num_sources)
{
uint64_t address_offset;
+ const uint16_t notify = notif_src->notify;
+ const uint16_t source_id = notif_src->source_id;
/*
* Type:
@@ -297,7 +308,7 @@ static void build_ghes_v2(GArray *table_data,
address_offset + GAS_ADDR_OFFSET,
sizeof(uint64_t),
ACPI_HW_ERROR_FW_CFG_FILE,
- source_id * sizeof(uint64_t));
+ index * sizeof(uint64_t));
/* Notification Structure */
build_ghes_hw_error_notification(table_data, notify);
@@ -317,8 +328,7 @@ static void build_ghes_v2(GArray *table_data,
address_offset + GAS_ADDR_OFFSET,
sizeof(uint64_t),
ACPI_HW_ERROR_FW_CFG_FILE,
- (ACPI_GHES_ERROR_SOURCE_COUNT + source_id)
- * sizeof(uint64_t));
+ (num_sources + index) * sizeof(uint64_t));
/*
* Read Ack Preserve field
@@ -333,19 +343,23 @@ static void build_ghes_v2(GArray *table_data,
/* Build Hardware Error Source Table */
void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
BIOSLinker *linker,
+ const AcpiNotificationSourceId *notif_source,
+ int num_sources,
const char *oem_id, const char *oem_table_id)
{
AcpiTable table = { .sig = "HEST", .rev = 1,
.oem_id = oem_id, .oem_table_id = oem_table_id };
+ int i;
- build_ghes_error_table(hardware_errors, linker);
+ build_ghes_error_table(hardware_errors, linker, num_sources);
acpi_table_begin(&table, table_data);
/* Error Source Count */
- build_append_int_noprefix(table_data, ACPI_GHES_ERROR_SOURCE_COUNT, 4);
- build_ghes_v2(table_data, linker,
- ACPI_GHES_NOTIFY_SEA, ACPI_HEST_SRC_ID_SEA);
+ build_append_int_noprefix(table_data, num_sources, 4);
+ for (i = 0; i < num_sources; i++) {
+ build_ghes_v2_entry(table_data, linker, ¬if_source[i], i, num_sources);
+ }
acpi_table_end(linker, &table);
}
@@ -410,7 +424,6 @@ void ghes_record_cper_errors(const void *cper, size_t len,
}
ags = &acpi_ged_state->ghes_state;
- assert(ACPI_GHES_ERROR_SOURCE_COUNT == 1);
get_hw_error_offsets(le64_to_cpu(ags->hw_error_le),
&cper_addr, &read_ack_register_addr);
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 3ac8f8e17861..3d411787fc37 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -893,6 +893,10 @@ static void acpi_align_size(GArray *blob, unsigned align)
g_array_set_size(blob, ROUND_UP(acpi_data_len(blob), align));
}
+static const AcpiNotificationSourceId hest_ghes_notify[] = {
+ { ACPI_HEST_SRC_ID_SYNC, ACPI_GHES_NOTIFY_SEA },
+};
+
static
void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
{
@@ -948,6 +952,7 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
if (vms->ras) {
acpi_add_table(table_offsets, tables_blob);
acpi_build_hest(tables_blob, tables->hardware_errors, tables->linker,
+ hest_ghes_notify, ARRAY_SIZE(hest_ghes_notify),
vms->oem_id, vms->oem_table_id);
}
diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
index 39619a2457cb..9f0120d0d596 100644
--- a/include/hw/acpi/ghes.h
+++ b/include/hw/acpi/ghes.h
@@ -57,20 +57,27 @@ enum AcpiGhesNotifyType {
ACPI_GHES_NOTIFY_RESERVED = 12
};
-enum {
- ACPI_HEST_SRC_ID_SEA = 0,
- /* future ids go here */
-
- ACPI_GHES_ERROR_SOURCE_COUNT
-};
-
typedef struct AcpiGhesState {
uint64_t hw_error_le;
bool present; /* True if GHES is present at all on this board */
} AcpiGhesState;
+/*
+ * ID numbers used to fill HEST source ID field
+ */
+enum AcpiGhesSourceID {
+ ACPI_HEST_SRC_ID_SYNC,
+};
+
+typedef struct AcpiNotificationSourceId {
+ enum AcpiGhesSourceID source_id;
+ enum AcpiGhesNotifyType notify;
+} AcpiNotificationSourceId;
+
void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
BIOSLinker *linker,
+ const AcpiNotificationSourceId * const notif_source,
+ int num_sources,
const char *oem_id, const char *oem_table_id);
void acpi_ghes_add_fw_cfg(AcpiGhesState *vms, FWCfgState *s,
GArray *hardware_errors);
--
2.48.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v3 02/14] acpi/ghes: add a firmware file with HEST address
2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 01/14] acpi/ghes: Prepare to support multiple sources on ghes Mauro Carvalho Chehab
@ 2025-01-31 17:42 ` Mauro Carvalho Chehab
2025-02-03 13:41 ` Igor Mammedov
2025-01-31 17:42 ` [PATCH v3 03/14] acpi/ghes: Use HEST table offsets when preparing GHES records Mauro Carvalho Chehab
` (12 subsequent siblings)
14 siblings, 1 reply; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-31 17:42 UTC (permalink / raw)
To: Igor Mammedov, Michael S . Tsirkin
Cc: Jonathan Cameron, Shiju Jose, qemu-arm, qemu-devel,
Mauro Carvalho Chehab, Ani Sinha, Dongjiu Geng, linux-kernel
Store HEST table address at GPA, placing its the start of the table at
hest_addr_le variable.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
hw/acpi/ghes.c | 16 ++++++++++++++++
include/hw/acpi/ghes.h | 1 +
2 files changed, 17 insertions(+)
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index 4cabb177ad47..27478f2d5674 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -30,6 +30,7 @@
#define ACPI_HW_ERROR_FW_CFG_FILE "etc/hardware_errors"
#define ACPI_HW_ERROR_ADDR_FW_CFG_FILE "etc/hardware_errors_addr"
+#define ACPI_HEST_ADDR_FW_CFG_FILE "etc/acpi_table_hest_addr"
/* The max size in bytes for one error block */
#define ACPI_GHES_MAX_RAW_DATA_LENGTH (1 * KiB)
@@ -349,8 +350,11 @@ void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
{
AcpiTable table = { .sig = "HEST", .rev = 1,
.oem_id = oem_id, .oem_table_id = oem_table_id };
+ uint32_t hest_offset;
int i;
+ hest_offset = table_data->len;
+
build_ghes_error_table(hardware_errors, linker, num_sources);
acpi_table_begin(&table, table_data);
@@ -362,6 +366,15 @@ void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
}
acpi_table_end(linker, &table);
+
+ /*
+ * Tell firmware to write into GPA the address of HEST via fw_cfg,
+ * once initialized.
+ */
+ bios_linker_loader_write_pointer(linker,
+ ACPI_HEST_ADDR_FW_CFG_FILE, 0,
+ sizeof(uint64_t),
+ ACPI_BUILD_TABLE_FILE, hest_offset);
}
void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
@@ -375,6 +388,9 @@ void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
fw_cfg_add_file_callback(s, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, NULL, NULL,
NULL, &(ags->hw_error_le), sizeof(ags->hw_error_le), false);
+ fw_cfg_add_file_callback(s, ACPI_HEST_ADDR_FW_CFG_FILE, NULL, NULL,
+ NULL, &(ags->hest_addr_le), sizeof(ags->hest_addr_le), false);
+
ags->present = true;
}
diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
index 9f0120d0d596..237721fec0a2 100644
--- a/include/hw/acpi/ghes.h
+++ b/include/hw/acpi/ghes.h
@@ -58,6 +58,7 @@ enum AcpiGhesNotifyType {
};
typedef struct AcpiGhesState {
+ uint64_t hest_addr_le;
uint64_t hw_error_le;
bool present; /* True if GHES is present at all on this board */
} AcpiGhesState;
--
2.48.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v3 03/14] acpi/ghes: Use HEST table offsets when preparing GHES records
2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 01/14] acpi/ghes: Prepare to support multiple sources on ghes Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 02/14] acpi/ghes: add a firmware file with HEST address Mauro Carvalho Chehab
@ 2025-01-31 17:42 ` Mauro Carvalho Chehab
2025-02-03 10:42 ` Jonathan Cameron via
2025-02-03 14:34 ` Igor Mammedov
2025-01-31 17:42 ` [PATCH v3 04/14] acpi/generic_event_device: Update GHES migration to cover hest addr Mauro Carvalho Chehab
` (11 subsequent siblings)
14 siblings, 2 replies; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-31 17:42 UTC (permalink / raw)
To: Igor Mammedov, Michael S . Tsirkin
Cc: Jonathan Cameron, Shiju Jose, qemu-arm, qemu-devel,
Mauro Carvalho Chehab, Ani Sinha, Dongjiu Geng, linux-kernel
There are two pointers that are needed during error injection:
1. The start address of the CPER block to be stored;
2. The address of the ack.
It is preferable to calculate them from the HEST table. This allows
checking the source ID, the size of the table and the type of the
HEST error block structures.
Yet, keep the old code, as this is needed for migration purposes.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
hw/acpi/ghes.c | 132 ++++++++++++++++++++++++++++++++++++-----
include/hw/acpi/ghes.h | 1 +
2 files changed, 119 insertions(+), 14 deletions(-)
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index 27478f2d5674..8f284fd191a6 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -41,6 +41,12 @@
/* Address offset in Generic Address Structure(GAS) */
#define GAS_ADDR_OFFSET 4
+/*
+ * ACPI spec 1.0b
+ * 5.2.3 System Description Table Header
+ */
+#define ACPI_DESC_HEADER_OFFSET 36
+
/*
* The total size of Generic Error Data Entry
* ACPI 6.1/6.2: 18.3.2.7.1 Generic Error Data,
@@ -61,6 +67,25 @@
*/
#define ACPI_GHES_GESB_SIZE 20
+/*
+ * Offsets with regards to the start of the HEST table stored at
+ * ags->hest_addr_le, according with the memory layout map at
+ * docs/specs/acpi_hest_ghes.rst.
+ */
+
+/*
+ * ACPI 6.2: 18.3.2.8 Generic Hardware Error Source version 2
+ * Table 18-382 Generic Hardware Error Source version 2 (GHESv2) Structure
+ */
+#define HEST_GHES_V2_TABLE_SIZE 92
+#define GHES_READ_ACK_ADDR_OFF 64
+
+/*
+ * ACPI 6.2: 18.3.2.7: Generic Hardware Error Source
+ * Table 18-380: 'Error Status Address' field
+ */
+#define GHES_ERR_STATUS_ADDR_OFF 20
+
/*
* Values for error_severity field
*/
@@ -212,14 +237,6 @@ static void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker,
{
int i, error_status_block_offset;
- /*
- * TODO: Current version supports only one source.
- * A further patch will drop this check, after adding a proper migration
- * code, as, for the code to work, we need to store a bios pointer to the
- * HEST table.
- */
- assert(num_sources == 1);
-
/* Build error_block_address */
for (i = 0; i < num_sources; i++) {
build_append_int_noprefix(hardware_errors, 0, sizeof(uint64_t));
@@ -352,6 +369,14 @@ void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
.oem_id = oem_id, .oem_table_id = oem_table_id };
uint32_t hest_offset;
int i;
+ AcpiGedState *acpi_ged_state;
+ AcpiGhesState *ags = NULL;
+
+ acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
+ NULL));
+ if (acpi_ged_state) {
+ ags = &acpi_ged_state->ghes_state;
+ }
hest_offset = table_data->len;
@@ -371,10 +396,12 @@ void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
* Tell firmware to write into GPA the address of HEST via fw_cfg,
* once initialized.
*/
- bios_linker_loader_write_pointer(linker,
- ACPI_HEST_ADDR_FW_CFG_FILE, 0,
- sizeof(uint64_t),
- ACPI_BUILD_TABLE_FILE, hest_offset);
+ if (ags->use_hest_addr) {
+ bios_linker_loader_write_pointer(linker,
+ ACPI_HEST_ADDR_FW_CFG_FILE, 0,
+ sizeof(uint64_t),
+ ACPI_BUILD_TABLE_FILE, hest_offset);
+ }
}
void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
@@ -420,6 +447,78 @@ static void get_hw_error_offsets(uint64_t ghes_addr,
*read_ack_register_addr = ghes_addr + sizeof(uint64_t);
}
+static void get_ghes_source_offsets(uint16_t source_id,
+ uint64_t hest_addr,
+ uint64_t *cper_addr,
+ uint64_t *read_ack_start_addr,
+ Error **errp)
+{
+ uint64_t hest_err_block_addr, hest_read_ack_addr;
+ uint64_t err_source_entry, error_block_addr;
+ uint32_t num_sources, i;
+
+ hest_addr += ACPI_DESC_HEADER_OFFSET;
+
+ cpu_physical_memory_read(hest_addr, &num_sources,
+ sizeof(num_sources));
+ num_sources = le32_to_cpu(num_sources);
+
+ err_source_entry = hest_addr + sizeof(num_sources);
+
+ /*
+ * Currently, HEST Error source navigates only for GHESv2 tables
+ */
+
+ for (i = 0; i < num_sources; i++) {
+ uint64_t addr = err_source_entry;
+ uint16_t type, src_id;
+
+ cpu_physical_memory_read(addr, &type, sizeof(type));
+ type = le16_to_cpu(type);
+
+ /* For now, we only know the size of GHESv2 table */
+ if (type != ACPI_GHES_SOURCE_GENERIC_ERROR_V2) {
+ error_setg(errp, "HEST: type %d not supported.", type);
+ return;
+ }
+
+ /* Compare CPER source address at the GHESv2 structure */
+ addr += sizeof(type);
+ cpu_physical_memory_read(addr, &src_id, sizeof(src_id));
+ if (le16_to_cpu(src_id) == source_id) {
+ break;
+ }
+
+ err_source_entry += HEST_GHES_V2_TABLE_SIZE;
+ }
+ if (i == num_sources) {
+ error_setg(errp, "HEST: Source %d not found.", source_id);
+ return;
+ }
+
+ /* Navigate though table address pointers */
+ hest_err_block_addr = err_source_entry + GHES_ERR_STATUS_ADDR_OFF +
+ GAS_ADDR_OFFSET;
+
+ cpu_physical_memory_read(hest_err_block_addr, &error_block_addr,
+ sizeof(error_block_addr));
+
+ error_block_addr = le64_to_cpu(error_block_addr);
+
+ cpu_physical_memory_read(error_block_addr, cper_addr,
+ sizeof(*cper_addr));
+
+ *cper_addr = le64_to_cpu(*cper_addr);
+
+ hest_read_ack_addr = err_source_entry + GHES_READ_ACK_ADDR_OFF +
+ GAS_ADDR_OFFSET;
+
+ cpu_physical_memory_read(hest_read_ack_addr, read_ack_start_addr,
+ sizeof(*read_ack_start_addr));
+
+ *read_ack_start_addr = le64_to_cpu(*read_ack_start_addr);
+}
+
void ghes_record_cper_errors(const void *cper, size_t len,
uint16_t source_id, Error **errp)
{
@@ -440,8 +539,13 @@ void ghes_record_cper_errors(const void *cper, size_t len,
}
ags = &acpi_ged_state->ghes_state;
- get_hw_error_offsets(le64_to_cpu(ags->hw_error_le),
- &cper_addr, &read_ack_register_addr);
+ if (!ags->hest_addr_le) {
+ get_hw_error_offsets(le64_to_cpu(ags->hw_error_le),
+ &cper_addr, &read_ack_register_addr);
+ } else {
+ get_ghes_source_offsets(source_id, le64_to_cpu(ags->hest_addr_le),
+ &cper_addr, &read_ack_register_addr, errp);
+ }
if (!cper_addr) {
error_setg(errp, "can not find Generic Error Status Block");
diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
index 237721fec0a2..6c2e57af0456 100644
--- a/include/hw/acpi/ghes.h
+++ b/include/hw/acpi/ghes.h
@@ -61,6 +61,7 @@ typedef struct AcpiGhesState {
uint64_t hest_addr_le;
uint64_t hw_error_le;
bool present; /* True if GHES is present at all on this board */
+ bool use_hest_addr; /* True if HEST address is present */
} AcpiGhesState;
/*
--
2.48.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v3 04/14] acpi/generic_event_device: Update GHES migration to cover hest addr
2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
` (2 preceding siblings ...)
2025-01-31 17:42 ` [PATCH v3 03/14] acpi/ghes: Use HEST table offsets when preparing GHES records Mauro Carvalho Chehab
@ 2025-01-31 17:42 ` Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 05/14] acpi/generic_event_device: add logic to detect if HEST addr is available Mauro Carvalho Chehab
` (10 subsequent siblings)
14 siblings, 0 replies; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-31 17:42 UTC (permalink / raw)
To: Igor Mammedov, Michael S . Tsirkin
Cc: Jonathan Cameron, Shiju Jose, qemu-arm, qemu-devel,
Mauro Carvalho Chehab, Ani Sinha, linux-kernel
The GHES migration logic should now support HEST table location too.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
---
hw/acpi/generic_event_device.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/hw/acpi/generic_event_device.c b/hw/acpi/generic_event_device.c
index c85d97ca3776..5346cae573b7 100644
--- a/hw/acpi/generic_event_device.c
+++ b/hw/acpi/generic_event_device.c
@@ -386,6 +386,34 @@ static const VMStateDescription vmstate_ghes_state = {
}
};
+static const VMStateDescription vmstate_hest = {
+ .name = "acpi-hest",
+ .version_id = 1,
+ .minimum_version_id = 1,
+ .fields = (const VMStateField[]) {
+ VMSTATE_UINT64(hest_addr_le, AcpiGhesState),
+ VMSTATE_END_OF_LIST()
+ },
+};
+
+static bool hest_needed(void *opaque)
+{
+ AcpiGedState *s = opaque;
+ return s->ghes_state.hest_addr_le;
+}
+
+static const VMStateDescription vmstate_hest_state = {
+ .name = "acpi-ged/hest",
+ .version_id = 1,
+ .minimum_version_id = 1,
+ .needed = hest_needed,
+ .fields = (const VMStateField[]) {
+ VMSTATE_STRUCT(ghes_state, AcpiGedState, 1,
+ vmstate_hest, AcpiGhesState),
+ VMSTATE_END_OF_LIST()
+ }
+};
+
static const VMStateDescription vmstate_acpi_ged = {
.name = "acpi-ged",
.version_id = 1,
@@ -398,6 +426,7 @@ static const VMStateDescription vmstate_acpi_ged = {
&vmstate_memhp_state,
&vmstate_cpuhp_state,
&vmstate_ghes_state,
+ &vmstate_hest_state,
NULL
}
};
--
2.48.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v3 05/14] acpi/generic_event_device: add logic to detect if HEST addr is available
2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
` (3 preceding siblings ...)
2025-01-31 17:42 ` [PATCH v3 04/14] acpi/generic_event_device: Update GHES migration to cover hest addr Mauro Carvalho Chehab
@ 2025-01-31 17:42 ` Mauro Carvalho Chehab
2025-02-03 14:56 ` Igor Mammedov
2025-01-31 17:42 ` [PATCH v3 06/14] acpi/ghes: only set hw_error_le or hest_addr_le Mauro Carvalho Chehab
` (9 subsequent siblings)
14 siblings, 1 reply; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-31 17:42 UTC (permalink / raw)
To: Igor Mammedov, Michael S . Tsirkin
Cc: Jonathan Cameron, Shiju Jose, qemu-arm, qemu-devel,
Mauro Carvalho Chehab, Philippe Mathieu-Daudé, Ani Sinha,
Dongjiu Geng, Eduardo Habkost, Marcel Apfelbaum, Peter Maydell,
Shannon Zhao, Yanan Wang, Zhao Liu, linux-kernel
Create a new property (x-has-hest-addr) and use it to detect if
the GHES table offsets can be calculated from the HEST address
(qemu 10.0 and upper) or via the legacy way via an offset obtained
from the hardware_errors firmware file.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
hw/acpi/generic_event_device.c | 1 +
hw/acpi/ghes.c | 17 ++++++-----------
hw/arm/virt-acpi-build.c | 32 ++++++++++++++++++++++++++++----
hw/core/machine.c | 2 ++
include/hw/acpi/ghes.h | 3 ++-
5 files changed, 39 insertions(+), 16 deletions(-)
diff --git a/hw/acpi/generic_event_device.c b/hw/acpi/generic_event_device.c
index 5346cae573b7..14d8513a5440 100644
--- a/hw/acpi/generic_event_device.c
+++ b/hw/acpi/generic_event_device.c
@@ -318,6 +318,7 @@ static void acpi_ged_send_event(AcpiDeviceIf *adev, AcpiEventStatusBits ev)
static const Property acpi_ged_properties[] = {
DEFINE_PROP_UINT32("ged-event", AcpiGedState, ged_event_bitmap, 0),
+ DEFINE_PROP_BOOL("x-has-hest-addr", AcpiGedState, ghes_state.use_hest_addr, false),
};
static const VMStateDescription vmstate_memhp_state = {
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index 8f284fd191a6..a91dcd777433 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -359,7 +359,8 @@ static void build_ghes_v2_entry(GArray *table_data,
}
/* Build Hardware Error Source Table */
-void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
+void acpi_build_hest(AcpiGhesState *ags, GArray *table_data,
+ GArray *hardware_errors,
BIOSLinker *linker,
const AcpiNotificationSourceId *notif_source,
int num_sources,
@@ -369,14 +370,6 @@ void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
.oem_id = oem_id, .oem_table_id = oem_table_id };
uint32_t hest_offset;
int i;
- AcpiGedState *acpi_ged_state;
- AcpiGhesState *ags = NULL;
-
- acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
- NULL));
- if (acpi_ged_state) {
- ags = &acpi_ged_state->ghes_state;
- }
hest_offset = table_data->len;
@@ -415,8 +408,10 @@ void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
fw_cfg_add_file_callback(s, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, NULL, NULL,
NULL, &(ags->hw_error_le), sizeof(ags->hw_error_le), false);
- fw_cfg_add_file_callback(s, ACPI_HEST_ADDR_FW_CFG_FILE, NULL, NULL,
- NULL, &(ags->hest_addr_le), sizeof(ags->hest_addr_le), false);
+ if (ags->use_hest_addr) {
+ fw_cfg_add_file_callback(s, ACPI_HEST_ADDR_FW_CFG_FILE, NULL, NULL,
+ NULL, &(ags->hest_addr_le), sizeof(ags->hest_addr_le), false);
+ }
ags->present = true;
}
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 3d411787fc37..9de51105a513 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -897,6 +897,10 @@ static const AcpiNotificationSourceId hest_ghes_notify[] = {
{ ACPI_HEST_SRC_ID_SYNC, ACPI_GHES_NOTIFY_SEA },
};
+static const AcpiNotificationSourceId hest_ghes_notify_9_2[] = {
+ { ACPI_HEST_SRC_ID_SYNC, ACPI_GHES_NOTIFY_SEA },
+};
+
static
void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
{
@@ -950,10 +954,30 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
build_dbg2(tables_blob, tables->linker, vms);
if (vms->ras) {
- acpi_add_table(table_offsets, tables_blob);
- acpi_build_hest(tables_blob, tables->hardware_errors, tables->linker,
- hest_ghes_notify, ARRAY_SIZE(hest_ghes_notify),
- vms->oem_id, vms->oem_table_id);
+ static const AcpiNotificationSourceId *notify;
+ AcpiGedState *acpi_ged_state;
+ unsigned int notify_sz;
+ AcpiGhesState *ags;
+
+ acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
+ NULL));
+ if (acpi_ged_state) {
+ ags = &acpi_ged_state->ghes_state;
+
+ acpi_add_table(table_offsets, tables_blob);
+
+ if (!ags->use_hest_addr) {
+ notify = hest_ghes_notify_9_2;
+ notify_sz = ARRAY_SIZE(hest_ghes_notify_9_2);
+ } else {
+ notify = hest_ghes_notify;
+ notify_sz = ARRAY_SIZE(hest_ghes_notify);
+ }
+
+ acpi_build_hest(ags, tables_blob, tables->hardware_errors,
+ tables->linker, notify, notify_sz,
+ vms->oem_id, vms->oem_table_id);
+ }
}
if (ms->numa_state->num_nodes > 0) {
diff --git a/hw/core/machine.c b/hw/core/machine.c
index c23b39949649..0d0cde481954 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -34,10 +34,12 @@
#include "hw/virtio/virtio-pci.h"
#include "hw/virtio/virtio-net.h"
#include "hw/virtio/virtio-iommu.h"
+#include "hw/acpi/generic_event_device.h"
#include "audio/audio.h"
GlobalProperty hw_compat_9_2[] = {
{"arm-cpu", "backcompat-pauth-default-use-qarma5", "true"},
+ { TYPE_ACPI_GED, "x-has-hest-addr", "false" },
};
const size_t hw_compat_9_2_len = G_N_ELEMENTS(hw_compat_9_2);
diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
index 6c2e57af0456..bfc8fd851648 100644
--- a/include/hw/acpi/ghes.h
+++ b/include/hw/acpi/ghes.h
@@ -76,7 +76,8 @@ typedef struct AcpiNotificationSourceId {
enum AcpiGhesNotifyType notify;
} AcpiNotificationSourceId;
-void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
+void acpi_build_hest(AcpiGhesState *ags, GArray *table_data,
+ GArray *hardware_errors,
BIOSLinker *linker,
const AcpiNotificationSourceId * const notif_source,
int num_sources,
--
2.48.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v3 06/14] acpi/ghes: only set hw_error_le or hest_addr_le
2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
` (4 preceding siblings ...)
2025-01-31 17:42 ` [PATCH v3 05/14] acpi/generic_event_device: add logic to detect if HEST addr is available Mauro Carvalho Chehab
@ 2025-01-31 17:42 ` Mauro Carvalho Chehab
2025-02-03 10:48 ` Jonathan Cameron via
2025-01-31 17:42 ` [PATCH v3 07/14] acpi/ghes: add a notifier to notify when error data is ready Mauro Carvalho Chehab
` (8 subsequent siblings)
14 siblings, 1 reply; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-31 17:42 UTC (permalink / raw)
To: Igor Mammedov, Michael S . Tsirkin
Cc: Jonathan Cameron, Shiju Jose, qemu-arm, qemu-devel,
Mauro Carvalho Chehab, Ani Sinha, Dongjiu Geng, linux-kernel
The hw_error_le pointer is used for legacy support (virt-9.2).
Starting from virt-10.0, HEST table is accessed via hest_addr_le.
Remove fw_cfg logic for legacy support if virt is 10.0 or upper.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
hw/acpi/ghes.c | 30 ++++++++++++++++--------------
1 file changed, 16 insertions(+), 14 deletions(-)
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index a91dcd777433..ba8b1a3a13dc 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -232,8 +232,8 @@ ghes_gen_err_data_uncorrectable_recoverable(GArray *block,
* Initialize "etc/hardware_errors" and "etc/hardware_errors_addr" fw_cfg blobs.
* See docs/specs/acpi_hest_ghes.rst for blobs format.
*/
-static void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker,
- int num_sources)
+static void build_ghes_error_table(AcpiGhesState *ags, GArray *hardware_errors,
+ BIOSLinker *linker, int num_sources)
{
int i, error_status_block_offset;
@@ -278,13 +278,15 @@ static void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker,
i * ACPI_GHES_MAX_RAW_DATA_LENGTH);
}
- /*
- * tell firmware to write hardware_errors GPA into
- * hardware_errors_addr fw_cfg, once the former has been initialized.
- */
- bios_linker_loader_write_pointer(linker, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, 0,
- sizeof(uint64_t),
- ACPI_HW_ERROR_FW_CFG_FILE, 0);
+ if (!ags->use_hest_addr) {
+ /*
+ * Tell firmware to write hardware_errors GPA into
+ * hardware_errors_addr fw_cfg, once the former has been initialized.
+ */
+ bios_linker_loader_write_pointer(linker, ACPI_HW_ERROR_ADDR_FW_CFG_FILE,
+ 0, sizeof(uint64_t),
+ ACPI_HW_ERROR_FW_CFG_FILE, 0);
+ }
}
/* Build Generic Hardware Error Source version 2 (GHESv2) */
@@ -373,7 +375,7 @@ void acpi_build_hest(AcpiGhesState *ags, GArray *table_data,
hest_offset = table_data->len;
- build_ghes_error_table(hardware_errors, linker, num_sources);
+ build_ghes_error_table(ags, hardware_errors, linker, num_sources);
acpi_table_begin(&table, table_data);
@@ -404,13 +406,13 @@ void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
fw_cfg_add_file(s, ACPI_HW_ERROR_FW_CFG_FILE, hardware_error->data,
hardware_error->len);
- /* Create a read-write fw_cfg file for Address */
- fw_cfg_add_file_callback(s, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, NULL, NULL,
- NULL, &(ags->hw_error_le), sizeof(ags->hw_error_le), false);
-
if (ags->use_hest_addr) {
fw_cfg_add_file_callback(s, ACPI_HEST_ADDR_FW_CFG_FILE, NULL, NULL,
NULL, &(ags->hest_addr_le), sizeof(ags->hest_addr_le), false);
+ } else {
+ /* Create a read-write fw_cfg file for Address */
+ fw_cfg_add_file_callback(s, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, NULL, NULL,
+ NULL, &(ags->hw_error_le), sizeof(ags->hw_error_le), false);
}
ags->present = true;
--
2.48.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v3 07/14] acpi/ghes: add a notifier to notify when error data is ready
2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
` (5 preceding siblings ...)
2025-01-31 17:42 ` [PATCH v3 06/14] acpi/ghes: only set hw_error_le or hest_addr_le Mauro Carvalho Chehab
@ 2025-01-31 17:42 ` Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 08/14] acpi/ghes: Cleanup the code which gets ghes ged state Mauro Carvalho Chehab
` (7 subsequent siblings)
14 siblings, 0 replies; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-31 17:42 UTC (permalink / raw)
To: Igor Mammedov, Michael S . Tsirkin
Cc: Jonathan Cameron, Shiju Jose, qemu-arm, qemu-devel,
Mauro Carvalho Chehab, Ani Sinha, Dongjiu Geng, linux-kernel
Some error injection notify methods are async, like GPIO
notify. Add a notifier to be used when the error record is
ready to be sent to the guest OS.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
hw/acpi/ghes.c | 5 ++++-
include/hw/acpi/ghes.h | 3 +++
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index ba8b1a3a13dc..dd93f0fc93fd 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -516,6 +516,9 @@ static void get_ghes_source_offsets(uint16_t source_id,
*read_ack_start_addr = le64_to_cpu(*read_ack_start_addr);
}
+NotifierList acpi_generic_error_notifiers =
+ NOTIFIER_LIST_INITIALIZER(error_device_notifiers);
+
void ghes_record_cper_errors(const void *cper, size_t len,
uint16_t source_id, Error **errp)
{
@@ -571,7 +574,7 @@ void ghes_record_cper_errors(const void *cper, size_t len,
/* Write the generic error data entry into guest memory */
cpu_physical_memory_write(cper_addr, cper, len);
- return;
+ notifier_list_notify(&acpi_generic_error_notifiers, NULL);
}
int acpi_ghes_memory_errors(uint16_t source_id, uint64_t physical_address)
diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
index bfc8fd851648..80a0c3fcfaca 100644
--- a/include/hw/acpi/ghes.h
+++ b/include/hw/acpi/ghes.h
@@ -24,6 +24,9 @@
#include "hw/acpi/bios-linker-loader.h"
#include "qapi/error.h"
+#include "qemu/notify.h"
+
+extern NotifierList acpi_generic_error_notifiers;
/*
* Values for Hardware Error Notification Type field
--
2.48.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v3 08/14] acpi/ghes: Cleanup the code which gets ghes ged state
2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
` (6 preceding siblings ...)
2025-01-31 17:42 ` [PATCH v3 07/14] acpi/ghes: add a notifier to notify when error data is ready Mauro Carvalho Chehab
@ 2025-01-31 17:42 ` Mauro Carvalho Chehab
2025-02-03 10:51 ` Jonathan Cameron via
2025-01-31 17:42 ` [PATCH v3 09/14] acpi/generic_event_device: add an APEI error device Mauro Carvalho Chehab
` (6 subsequent siblings)
14 siblings, 1 reply; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-31 17:42 UTC (permalink / raw)
To: Igor Mammedov, Michael S . Tsirkin
Cc: Jonathan Cameron, Shiju Jose, qemu-arm, qemu-devel,
Mauro Carvalho Chehab, Ani Sinha, Dongjiu Geng, Paolo Bonzini,
Peter Maydell, kvm, linux-kernel
Move the check logic into a common function and simplify the
code which checks if GHES is enabled and was properly setup.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
---
hw/acpi/ghes-stub.c | 7 ++++---
hw/acpi/ghes.c | 40 ++++++++++++----------------------------
include/hw/acpi/ghes.h | 15 ++++++++-------
target/arm/kvm.c | 7 +++++--
4 files changed, 29 insertions(+), 40 deletions(-)
diff --git a/hw/acpi/ghes-stub.c b/hw/acpi/ghes-stub.c
index 7cec1812dad9..40f660c246fe 100644
--- a/hw/acpi/ghes-stub.c
+++ b/hw/acpi/ghes-stub.c
@@ -11,12 +11,13 @@
#include "qemu/osdep.h"
#include "hw/acpi/ghes.h"
-int acpi_ghes_memory_errors(uint16_t source_id, uint64_t physical_address)
+int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
+ uint64_t physical_address)
{
return -1;
}
-bool acpi_ghes_present(void)
+AcpiGhesState *acpi_ghes_get_state(void)
{
- return false;
+ return NULL;
}
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index dd93f0fc93fd..b25e61537c87 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -414,18 +414,12 @@ void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
fw_cfg_add_file_callback(s, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, NULL, NULL,
NULL, &(ags->hw_error_le), sizeof(ags->hw_error_le), false);
}
-
- ags->present = true;
}
static void get_hw_error_offsets(uint64_t ghes_addr,
uint64_t *cper_addr,
uint64_t *read_ack_register_addr)
{
- if (!ghes_addr) {
- return;
- }
-
/*
* non-HEST version supports only one source, so no need to change
* the start offset based on the source ID. Also, we can't validate
@@ -519,27 +513,17 @@ static void get_ghes_source_offsets(uint16_t source_id,
NotifierList acpi_generic_error_notifiers =
NOTIFIER_LIST_INITIALIZER(error_device_notifiers);
-void ghes_record_cper_errors(const void *cper, size_t len,
+void ghes_record_cper_errors(AcpiGhesState *ags, const void *cper, size_t len,
uint16_t source_id, Error **errp)
{
uint64_t cper_addr = 0, read_ack_register_addr = 0, read_ack_register;
- AcpiGedState *acpi_ged_state;
- AcpiGhesState *ags;
if (len > ACPI_GHES_MAX_RAW_DATA_LENGTH) {
error_setg(errp, "GHES CPER record is too big: %zd", len);
return;
}
- acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
- NULL));
- if (!acpi_ged_state) {
- error_setg(errp, "Can't find ACPI_GED object");
- return;
- }
- ags = &acpi_ged_state->ghes_state;
-
- if (!ags->hest_addr_le) {
+ if (!ags->use_hest_addr) {
get_hw_error_offsets(le64_to_cpu(ags->hw_error_le),
&cper_addr, &read_ack_register_addr);
} else {
@@ -547,11 +531,6 @@ void ghes_record_cper_errors(const void *cper, size_t len,
&cper_addr, &read_ack_register_addr, errp);
}
- if (!cper_addr) {
- error_setg(errp, "can not find Generic Error Status Block");
- return;
- }
-
cpu_physical_memory_read(read_ack_register_addr,
&read_ack_register, sizeof(read_ack_register));
@@ -577,7 +556,8 @@ void ghes_record_cper_errors(const void *cper, size_t len,
notifier_list_notify(&acpi_generic_error_notifiers, NULL);
}
-int acpi_ghes_memory_errors(uint16_t source_id, uint64_t physical_address)
+int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
+ uint64_t physical_address)
{
/* Memory Error Section Type */
const uint8_t guid[] =
@@ -603,7 +583,7 @@ int acpi_ghes_memory_errors(uint16_t source_id, uint64_t physical_address)
acpi_ghes_build_append_mem_cper(block, physical_address);
/* Report the error */
- ghes_record_cper_errors(block->data, block->len, source_id, &errp);
+ ghes_record_cper_errors(ags, block->data, block->len, source_id, &errp);
g_array_free(block, true);
@@ -615,7 +595,7 @@ int acpi_ghes_memory_errors(uint16_t source_id, uint64_t physical_address)
return 0;
}
-bool acpi_ghes_present(void)
+AcpiGhesState *acpi_ghes_get_state(void)
{
AcpiGedState *acpi_ged_state;
AcpiGhesState *ags;
@@ -624,8 +604,12 @@ bool acpi_ghes_present(void)
NULL));
if (!acpi_ged_state) {
- return false;
+ return NULL;
}
ags = &acpi_ged_state->ghes_state;
- return ags->present;
+
+ if (!ags->hw_error_le && !ags->hest_addr_le) {
+ return NULL;
+ }
+ return ags;
}
diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
index 80a0c3fcfaca..e1b66141d01c 100644
--- a/include/hw/acpi/ghes.h
+++ b/include/hw/acpi/ghes.h
@@ -63,7 +63,6 @@ enum AcpiGhesNotifyType {
typedef struct AcpiGhesState {
uint64_t hest_addr_le;
uint64_t hw_error_le;
- bool present; /* True if GHES is present at all on this board */
bool use_hest_addr; /* True if HEST address is present */
} AcpiGhesState;
@@ -87,15 +86,17 @@ void acpi_build_hest(AcpiGhesState *ags, GArray *table_data,
const char *oem_id, const char *oem_table_id);
void acpi_ghes_add_fw_cfg(AcpiGhesState *vms, FWCfgState *s,
GArray *hardware_errors);
-int acpi_ghes_memory_errors(uint16_t source_id, uint64_t error_physical_addr);
-void ghes_record_cper_errors(const void *cper, size_t len,
+int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
+ uint64_t error_physical_addr);
+void ghes_record_cper_errors(AcpiGhesState *ags, const void *cper, size_t len,
uint16_t source_id, Error **errp);
/**
- * acpi_ghes_present: Report whether ACPI GHES table is present
+ * acpi_ghes_get_state: Get a pointer for ACPI ghes state
*
- * Returns: true if the system has an ACPI GHES table and it is
- * safe to call acpi_ghes_memory_errors() to record a memory error.
+ * Returns: a pointer to ghes state if the system has an ACPI GHES table,
+ * it is enabled and it is safe to call acpi_ghes_memory_errors() to record
+ * a memory error. Returns false, otherwise.
*/
-bool acpi_ghes_present(void);
+AcpiGhesState *acpi_ghes_get_state(void);
#endif
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index da30bdbb2349..80ca7779797b 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2366,10 +2366,12 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
{
ram_addr_t ram_addr;
hwaddr paddr;
+ AcpiGhesState *ags;
assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
- if (acpi_ghes_present() && addr) {
+ ags = acpi_ghes_get_state();
+ if (ags && addr) {
ram_addr = qemu_ram_addr_from_host(addr);
if (ram_addr != RAM_ADDR_INVALID &&
kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
@@ -2387,7 +2389,8 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
*/
if (code == BUS_MCEERR_AR) {
kvm_cpu_synchronize_state(c);
- if (!acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
+ if (!acpi_ghes_memory_errors(ags, ACPI_HEST_SRC_ID_SEA,
+ paddr)) {
kvm_inject_arm_sea(c);
} else {
error_report("failed to record the error");
--
2.48.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v3 09/14] acpi/generic_event_device: add an APEI error device
2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
` (7 preceding siblings ...)
2025-01-31 17:42 ` [PATCH v3 08/14] acpi/ghes: Cleanup the code which gets ghes ged state Mauro Carvalho Chehab
@ 2025-01-31 17:42 ` Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 10/14] tests/acpi: virt: allow acpi table changes for a new table: HEST Mauro Carvalho Chehab
` (5 subsequent siblings)
14 siblings, 0 replies; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-31 17:42 UTC (permalink / raw)
To: Igor Mammedov, Michael S . Tsirkin
Cc: Jonathan Cameron, Shiju Jose, qemu-arm, qemu-devel,
Mauro Carvalho Chehab, Ani Sinha, linux-kernel
Adds a generic error device to handle generic hardware error
events as specified at ACPI 6.5 specification at 18.3.2.7.2:
https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#event-notification-for-generic-error-sources
using HID PNP0C33.
The PNP0C33 device is used to report hardware errors to
the guest via ACPI APEI Generic Hardware Error Source (GHES).
Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Co-authored-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
---
hw/acpi/aml-build.c | 10 ++++++++++
hw/acpi/generic_event_device.c | 13 +++++++++++++
include/hw/acpi/acpi_dev_interface.h | 1 +
include/hw/acpi/aml-build.h | 2 ++
include/hw/acpi/generic_event_device.h | 1 +
5 files changed, 27 insertions(+)
diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index f8f93a9f66c8..e4bd7b611372 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -2614,3 +2614,13 @@ Aml *aml_i2c_serial_bus_device(uint16_t address, const char *resource_source)
return var;
}
+
+/* ACPI 5.0b: 18.3.2.6.2 Event Notification For Generic Error Sources */
+Aml *aml_error_device(void)
+{
+ Aml *dev = aml_device(ACPI_APEI_ERROR_DEVICE);
+ aml_append(dev, aml_name_decl("_HID", aml_string("PNP0C33")));
+ aml_append(dev, aml_name_decl("_UID", aml_int(0)));
+
+ return dev;
+}
diff --git a/hw/acpi/generic_event_device.c b/hw/acpi/generic_event_device.c
index 14d8513a5440..180eebbce1cd 100644
--- a/hw/acpi/generic_event_device.c
+++ b/hw/acpi/generic_event_device.c
@@ -26,6 +26,7 @@ static const uint32_t ged_supported_events[] = {
ACPI_GED_PWR_DOWN_EVT,
ACPI_GED_NVDIMM_HOTPLUG_EVT,
ACPI_GED_CPU_HOTPLUG_EVT,
+ ACPI_GED_ERROR_EVT,
};
/*
@@ -116,6 +117,16 @@ void build_ged_aml(Aml *table, const char *name, HotplugHandler *hotplug_dev,
aml_notify(aml_name(ACPI_POWER_BUTTON_DEVICE),
aml_int(0x80)));
break;
+ case ACPI_GED_ERROR_EVT:
+ /*
+ * ACPI 5.0b: 5.6.6 Device Object Notifications
+ * Table 5-135 Error Device Notification Values
+ * Defines 0x80 as the value to be used on notifications
+ */
+ aml_append(if_ctx,
+ aml_notify(aml_name(ACPI_APEI_ERROR_DEVICE),
+ aml_int(0x80)));
+ break;
case ACPI_GED_NVDIMM_HOTPLUG_EVT:
aml_append(if_ctx,
aml_notify(aml_name("\\_SB.NVDR"),
@@ -295,6 +306,8 @@ static void acpi_ged_send_event(AcpiDeviceIf *adev, AcpiEventStatusBits ev)
sel = ACPI_GED_MEM_HOTPLUG_EVT;
} else if (ev & ACPI_POWER_DOWN_STATUS) {
sel = ACPI_GED_PWR_DOWN_EVT;
+ } else if (ev & ACPI_GENERIC_ERROR) {
+ sel = ACPI_GED_ERROR_EVT;
} else if (ev & ACPI_NVDIMM_HOTPLUG_STATUS) {
sel = ACPI_GED_NVDIMM_HOTPLUG_EVT;
} else if (ev & ACPI_CPU_HOTPLUG_STATUS) {
diff --git a/include/hw/acpi/acpi_dev_interface.h b/include/hw/acpi/acpi_dev_interface.h
index 68d9d15f50aa..8294f8f0ccca 100644
--- a/include/hw/acpi/acpi_dev_interface.h
+++ b/include/hw/acpi/acpi_dev_interface.h
@@ -13,6 +13,7 @@ typedef enum {
ACPI_NVDIMM_HOTPLUG_STATUS = 16,
ACPI_VMGENID_CHANGE_STATUS = 32,
ACPI_POWER_DOWN_STATUS = 64,
+ ACPI_GENERIC_ERROR = 128,
} AcpiEventStatusBits;
#define TYPE_ACPI_DEVICE_IF "acpi-device-interface"
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index c18f68134246..f38e12971932 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -252,6 +252,7 @@ struct CrsRangeSet {
/* Consumer/Producer */
#define AML_SERIAL_BUS_FLAG_CONSUME_ONLY (1 << 1)
+#define ACPI_APEI_ERROR_DEVICE "GEDD"
/**
* init_aml_allocator:
*
@@ -382,6 +383,7 @@ Aml *aml_dma(AmlDmaType typ, AmlDmaBusMaster bm, AmlTransferSize sz,
uint8_t channel);
Aml *aml_sleep(uint64_t msec);
Aml *aml_i2c_serial_bus_device(uint16_t address, const char *resource_source);
+Aml *aml_error_device(void);
/* Block AML object primitives */
Aml *aml_scope(const char *name_format, ...) G_GNUC_PRINTF(1, 2);
diff --git a/include/hw/acpi/generic_event_device.h b/include/hw/acpi/generic_event_device.h
index d2dac87b4a9f..1c18ac296fcb 100644
--- a/include/hw/acpi/generic_event_device.h
+++ b/include/hw/acpi/generic_event_device.h
@@ -101,6 +101,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(AcpiGedState, ACPI_GED)
#define ACPI_GED_PWR_DOWN_EVT 0x2
#define ACPI_GED_NVDIMM_HOTPLUG_EVT 0x4
#define ACPI_GED_CPU_HOTPLUG_EVT 0x8
+#define ACPI_GED_ERROR_EVT 0x10
typedef struct GEDState {
MemoryRegion evt;
--
2.48.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v3 10/14] tests/acpi: virt: allow acpi table changes for a new table: HEST
2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
` (8 preceding siblings ...)
2025-01-31 17:42 ` [PATCH v3 09/14] acpi/generic_event_device: add an APEI error device Mauro Carvalho Chehab
@ 2025-01-31 17:42 ` Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 11/14] arm/virt: Wire up a GED error device for ACPI / GHES Mauro Carvalho Chehab
` (4 subsequent siblings)
14 siblings, 0 replies; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-31 17:42 UTC (permalink / raw)
To: Igor Mammedov, Michael S . Tsirkin
Cc: Jonathan Cameron, Shiju Jose, qemu-arm, qemu-devel,
Mauro Carvalho Chehab, Ani Sinha, linux-kernel
The DSDT table will also be affected by such change.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tests/data/acpi/aarch64/virt/HEST | 0
tests/qtest/bios-tables-test-allowed-diff.h | 1 +
2 files changed, 1 insertion(+)
create mode 100644 tests/data/acpi/aarch64/virt/HEST
diff --git a/tests/data/acpi/aarch64/virt/HEST b/tests/data/acpi/aarch64/virt/HEST
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/tests/qtest/bios-tables-test-allowed-diff.h b/tests/qtest/bios-tables-test-allowed-diff.h
index dfb8523c8bf4..1a4c2277bd5a 100644
--- a/tests/qtest/bios-tables-test-allowed-diff.h
+++ b/tests/qtest/bios-tables-test-allowed-diff.h
@@ -1 +1,2 @@
/* List of comma-separated changed AML files to ignore */
+"tests/data/acpi/aarch64/virt/DSDT",
--
2.48.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v3 11/14] arm/virt: Wire up a GED error device for ACPI / GHES
2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
` (9 preceding siblings ...)
2025-01-31 17:42 ` [PATCH v3 10/14] tests/acpi: virt: allow acpi table changes for a new table: HEST Mauro Carvalho Chehab
@ 2025-01-31 17:42 ` Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 12/14] tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT Mauro Carvalho Chehab
` (3 subsequent siblings)
14 siblings, 0 replies; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-31 17:42 UTC (permalink / raw)
To: Igor Mammedov, Michael S . Tsirkin
Cc: Jonathan Cameron, Shiju Jose, qemu-arm, qemu-devel,
Mauro Carvalho Chehab, Ani Sinha, Peter Maydell, Shannon Zhao,
linux-kernel
Adds support to ARM virtualization to allow handling
generic error ACPI Event via GED & error source device.
It is aligned with Linux Kernel patch:
https://lore.kernel.org/lkml/1272350481-27951-8-git-send-email-ying.huang@intel.com/
Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Co-authored-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Igor Mammedov <imammedo@redhat.com>
---
Changes from v8:
- Added a call to the function that produces GHES generic
records, as this is now added earlier in this series.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
hw/acpi/generic_event_device.c | 2 +-
hw/arm/virt-acpi-build.c | 1 +
hw/arm/virt.c | 12 +++++++++++-
include/hw/arm/virt.h | 1 +
4 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/hw/acpi/generic_event_device.c b/hw/acpi/generic_event_device.c
index 180eebbce1cd..f5e899155d34 100644
--- a/hw/acpi/generic_event_device.c
+++ b/hw/acpi/generic_event_device.c
@@ -331,7 +331,7 @@ static void acpi_ged_send_event(AcpiDeviceIf *adev, AcpiEventStatusBits ev)
static const Property acpi_ged_properties[] = {
DEFINE_PROP_UINT32("ged-event", AcpiGedState, ged_event_bitmap, 0),
- DEFINE_PROP_BOOL("x-has-hest-addr", AcpiGedState, ghes_state.use_hest_addr, false),
+ DEFINE_PROP_BOOL("x-has-hest-addr", AcpiGedState, ghes_state.use_hest_addr, true),
};
static const VMStateDescription vmstate_memhp_state = {
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 9de51105a513..4f174795ed60 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -861,6 +861,7 @@ build_dsdt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
}
acpi_dsdt_add_power_button(scope);
+ aml_append(scope, aml_error_device());
#ifdef CONFIG_TPM
acpi_dsdt_add_tpm(scope, vms);
#endif
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 99e0a68b6c55..e272b35ea114 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -678,7 +678,7 @@ static inline DeviceState *create_acpi_ged(VirtMachineState *vms)
DeviceState *dev;
MachineState *ms = MACHINE(vms);
int irq = vms->irqmap[VIRT_ACPI_GED];
- uint32_t event = ACPI_GED_PWR_DOWN_EVT;
+ uint32_t event = ACPI_GED_PWR_DOWN_EVT | ACPI_GED_ERROR_EVT;
if (ms->ram_slots) {
event |= ACPI_GED_MEM_HOTPLUG_EVT;
@@ -1010,6 +1010,13 @@ static void virt_powerdown_req(Notifier *n, void *opaque)
}
}
+static void virt_generic_error_req(Notifier *n, void *opaque)
+{
+ VirtMachineState *s = container_of(n, VirtMachineState, generic_error_notifier);
+
+ acpi_send_event(s->acpi_dev, ACPI_GENERIC_ERROR);
+}
+
static void create_gpio_keys(char *fdt, DeviceState *pl061_dev,
uint32_t phandle)
{
@@ -2404,6 +2411,9 @@ static void machvirt_init(MachineState *machine)
if (has_ged && aarch64 && firmware_loaded && virt_is_acpi_enabled(vms)) {
vms->acpi_dev = create_acpi_ged(vms);
+ vms->generic_error_notifier.notify = virt_generic_error_req;
+ notifier_list_add(&acpi_generic_error_notifiers,
+ &vms->generic_error_notifier);
} else {
create_gpio_devices(vms, VIRT_GPIO, sysmem);
}
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index c8e94e6aedc9..f3cf28436770 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -176,6 +176,7 @@ struct VirtMachineState {
DeviceState *gic;
DeviceState *acpi_dev;
Notifier powerdown_notifier;
+ Notifier generic_error_notifier;
PCIBus *bus;
char *oem_id;
char *oem_table_id;
--
2.48.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v3 12/14] tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT
2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
` (10 preceding siblings ...)
2025-01-31 17:42 ` [PATCH v3 11/14] arm/virt: Wire up a GED error device for ACPI / GHES Mauro Carvalho Chehab
@ 2025-01-31 17:42 ` Mauro Carvalho Chehab
2025-02-03 10:53 ` Jonathan Cameron via
2025-01-31 17:42 ` [PATCH v3 13/14] qapi/acpi-hest: add an interface to do generic CPER error injection Mauro Carvalho Chehab
` (2 subsequent siblings)
14 siblings, 1 reply; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-31 17:42 UTC (permalink / raw)
To: Igor Mammedov, Michael S . Tsirkin
Cc: Jonathan Cameron, Shiju Jose, qemu-arm, qemu-devel,
Mauro Carvalho Chehab, Ani Sinha, linux-kernel
--- a/DSDT.dsl 2025-01-28 09:38:15.155347858 +0100
+++ b/DSDT.dsl 2025-01-28 09:39:01.684836954 +0100
@@ -9,9 +9,9 @@
*
* Original Table Header:
* Signature "DSDT"
- * Length 0x00001516 (5398)
+ * Length 0x00001542 (5442)
* Revision 0x02
- * Checksum 0x0F
+ * Checksum 0xE9
* OEM ID "BOCHS "
* OEM Table ID "BXPC "
* OEM Revision 0x00000001 (1)
@@ -1931,6 +1931,11 @@
{
Notify (PWRB, 0x80) // Status Change
}
+
+ If (((Local0 & 0x10) == 0x10))
+ {
+ Notify (GEDD, 0x80) // Status Change
+ }
}
}
@@ -1939,6 +1944,12 @@
Name (_HID, "PNP0C0C" /* Power Button Device */) // _HID: Hardware ID
Name (_UID, Zero) // _UID: Unique ID
}
+
+ Device (GEDD)
+ {
+ Name (_HID, "PNP0C33" /* Error Device */) // _HID: Hardware ID
+ Name (_UID, Zero) // _UID: Unique ID
+ }
}
}
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tests/data/acpi/aarch64/virt/DSDT | Bin 5196 -> 5240 bytes
.../data/acpi/aarch64/virt/DSDT.acpihmatvirt | Bin 5282 -> 5326 bytes
tests/data/acpi/aarch64/virt/DSDT.memhp | Bin 6557 -> 6601 bytes
tests/data/acpi/aarch64/virt/DSDT.pxb | Bin 7679 -> 7723 bytes
tests/data/acpi/aarch64/virt/DSDT.topology | Bin 5398 -> 5442 bytes
tests/data/acpi/aarch64/virt/HEST | 0
tests/qtest/bios-tables-test-allowed-diff.h | 1 -
7 files changed, 1 deletion(-)
delete mode 100644 tests/data/acpi/aarch64/virt/HEST
diff --git a/tests/data/acpi/aarch64/virt/DSDT b/tests/data/acpi/aarch64/virt/DSDT
index 36d3e5d5a5e47359b6dcb3706f98b4f225677591..a182bd9d7182dccdf63c650d048c58f18505d001 100644
GIT binary patch
delta 109
zcmX@3@k4{lCD<jTLWF^ViDe>}G*h$dM)euOOwJsW4+;nC=*7E+g>V+Q2D|zsED)Gn
zoxsJ!z{S)S5FX^j)c_F?VBivHb9Z%dnXE4&D;?b=31V}^dw9C=2KWUSI2#)?aKwjt
Hx-b9$X;vI^
delta 64
zcmeyNaYlp7CD<jzM}&caNqQoeG*i3NM)euOOit{R4+;lM%f`Egg>V+Q2D|zsED)Gn
UoxsJ!z{S)S5FX?-*+E1W06%jPR{#J2
diff --git a/tests/data/acpi/aarch64/virt/DSDT.acpihmatvirt b/tests/data/acpi/aarch64/virt/DSDT.acpihmatvirt
index e6154d0355f84fdcc51387b4db8f9ee63acae4e9..af1f2b0eb0b77a80c5bd74f201d24f71e486627f 100644
GIT binary patch
delta 110
zcmZ3ac}|ndCD<k8oCpI0)4_>c(oCIR8`a+lGdXii78eO-)SH|wBICY5U~+W=mjDBo
yK%2X(iwjpnbdzL2c#soEyoaX?Z-8HbfwO@#14n$Qrwc=LlO#wDl9aJAR0;r(tsHj%
delta 66
zcmX@7xk!`CCD<iokq83=(~XH-(oDVX8`a+lGdZzO78eO-l%1R{A|oB$BpDDM<irv0
W;pxH~;1^)vY~akm5g+R5!T<noi4jWx
diff --git a/tests/data/acpi/aarch64/virt/DSDT.memhp b/tests/data/acpi/aarch64/virt/DSDT.memhp
index 33f011d6b635035a04c0b39ce9b4e219f7ae74b7..10436ec87c4859fb84b3ecb7bba5788f38112e59 100644
GIT binary patch
delta 88
zcmbPheA1Z9CD<k8q$C3algUIbX{MH08`WnBGdXcjJ}4Z_<jXo)OvH<SfxzVI1TFyv
qE`c_8R~MJfaU%At($P(lAPz^oho=i~fM0-tv#~J)M|`NK3j+W#;TF9B
delta 44
zcmX?UJlB}ZCD<iot|S8klg&gfX{L_p8`WnBGdXfiJ}4Z_<ij#qOvGz*p@=Oj039?8
AE&u=k
diff --git a/tests/data/acpi/aarch64/virt/DSDT.pxb b/tests/data/acpi/aarch64/virt/DSDT.pxb
index c0fdc6e9c1396cc2259dc4bc665ba023adcf4c9b..0524b3cbe00bfe552de824dd1090bd00a208c527 100644
GIT binary patch
delta 110
zcmexwz1oJ$CD<iITaJN&sbC_PG*jDyjq2XAOwJsWOJsu?^(LQ?m2qDnFu6K`OMrn(
ypv~RY#f7UOx=Au1JjjV7-ow*{H^48zz}di=fg?WD(}f|rNfM+6Ny^w5Dg^+WYaFrw
delta 66
zcmZ2&^WU1wCD<k8zbpd-Q^!OuX{N5b8`ZsKnVi@sm&gV)%1%BZD<d7<BpDDM<irv0
W;pxH~;1^)vY~akm5g+R5!T<oNArgiF
diff --git a/tests/data/acpi/aarch64/virt/DSDT.topology b/tests/data/acpi/aarch64/virt/DSDT.topology
index 029d03eecc4efddc001e5377e85ac8e831294362..8c0423fe62d6950f9098983d86bfee256d7d003a 100644
GIT binary patch
delta 86
zcmbQHbx4cLCD<jzNtA(s>E%Q&X{O%5jp|7vOwJsWyG4Q-^(NmJk>Ot;Fu6K`OMrn(
opv~RY#bxqO5n1WzCP@&RBi_T)g*U)2z`)tqn1Lfc)YF9l01l28<p2Nx
delta 42
ycmX@4HBF1lCD<iIOq79viGL!OG*hGhM)f2SCMWjE-6Fw^vXk$N$V}!Dl?DLb(h64q
diff --git a/tests/data/acpi/aarch64/virt/HEST b/tests/data/acpi/aarch64/virt/HEST
deleted file mode 100644
index e69de29bb2d1..000000000000
diff --git a/tests/qtest/bios-tables-test-allowed-diff.h b/tests/qtest/bios-tables-test-allowed-diff.h
index 1a4c2277bd5a..dfb8523c8bf4 100644
--- a/tests/qtest/bios-tables-test-allowed-diff.h
+++ b/tests/qtest/bios-tables-test-allowed-diff.h
@@ -1,2 +1 @@
/* List of comma-separated changed AML files to ignore */
-"tests/data/acpi/aarch64/virt/DSDT",
--
2.48.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v3 13/14] qapi/acpi-hest: add an interface to do generic CPER error injection
2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
` (11 preceding siblings ...)
2025-01-31 17:42 ` [PATCH v3 12/14] tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT Mauro Carvalho Chehab
@ 2025-01-31 17:42 ` Mauro Carvalho Chehab
2025-02-05 8:12 ` Markus Armbruster
2025-01-31 17:42 ` [PATCH v3 14/14] scripts/ghes_inject: add a script to generate GHES error inject Mauro Carvalho Chehab
2025-02-03 11:09 ` [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for " Jonathan Cameron via
14 siblings, 1 reply; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-31 17:42 UTC (permalink / raw)
To: Igor Mammedov, Michael S . Tsirkin
Cc: Jonathan Cameron, Shiju Jose, qemu-arm, qemu-devel,
Mauro Carvalho Chehab, Ani Sinha, Dongjiu Geng, Eric Blake,
Markus Armbruster, Michael Roth, Paolo Bonzini, Peter Maydell,
Shannon Zhao, linux-kernel
Creates a QMP command to be used for generic ACPI APEI hardware error
injection (HEST) via GHESv2, and add support for it for ARM guests.
Error injection uses ACPI_HEST_SRC_ID_QMP source ID to be platform
independent. This is mapped at arch virt bindings, depending on the
types supported by QEMU and by the BIOS. So, on ARM, this is supported
via ACPI_GHES_NOTIFY_GPIO notification type.
This patch is co-authored:
- original ghes logic to inject a simple ARM record by Shiju Jose;
- generic logic to handle block addresses by Jonathan Cameron;
- generic GHESv2 error inject by Mauro Carvalho Chehab;
Co-authored-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Co-authored-by: Shiju Jose <shiju.jose@huawei.com>
Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Igor Mammedov <imammedo@redhat.com>
---
Changes since v9:
- ARM source IDs renamed to reflect SYNC/ASYNC;
- command name changed to better reflect what it does;
- some improvements at JSON documentation;
- add a check for QMP source at the notification logic.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
MAINTAINERS | 7 +++++++
hw/acpi/Kconfig | 5 +++++
hw/acpi/ghes.c | 2 +-
hw/acpi/ghes_cper.c | 38 ++++++++++++++++++++++++++++++++++++++
hw/acpi/ghes_cper_stub.c | 19 +++++++++++++++++++
hw/acpi/meson.build | 2 ++
hw/arm/virt-acpi-build.c | 1 +
hw/arm/virt.c | 7 +++++++
include/hw/acpi/ghes.h | 1 +
include/hw/arm/virt.h | 1 +
qapi/acpi-hest.json | 35 +++++++++++++++++++++++++++++++++++
qapi/meson.build | 1 +
qapi/qapi-schema.json | 1 +
13 files changed, 119 insertions(+), 1 deletion(-)
create mode 100644 hw/acpi/ghes_cper.c
create mode 100644 hw/acpi/ghes_cper_stub.c
create mode 100644 qapi/acpi-hest.json
diff --git a/MAINTAINERS b/MAINTAINERS
index 846b81e3ec03..8e1f662fa0e0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2075,6 +2075,13 @@ F: hw/acpi/ghes.c
F: include/hw/acpi/ghes.h
F: docs/specs/acpi_hest_ghes.rst
+ACPI/HEST/GHES/ARM processor CPER
+R: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
+S: Maintained
+F: hw/arm/ghes_cper.c
+F: hw/acpi/ghes_cper_stub.c
+F: qapi/acpi-hest.json
+
ppc4xx
L: qemu-ppc@nongnu.org
S: Orphan
diff --git a/hw/acpi/Kconfig b/hw/acpi/Kconfig
index 1d4e9f0845c0..daabbe6cd11e 100644
--- a/hw/acpi/Kconfig
+++ b/hw/acpi/Kconfig
@@ -51,6 +51,11 @@ config ACPI_APEI
bool
depends on ACPI
+config GHES_CPER
+ bool
+ depends on ACPI_APEI
+ default y
+
config ACPI_PCI
bool
depends on ACPI && PCI
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index b25e61537c87..bcef0b22e612 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -553,7 +553,7 @@ void ghes_record_cper_errors(AcpiGhesState *ags, const void *cper, size_t len,
/* Write the generic error data entry into guest memory */
cpu_physical_memory_write(cper_addr, cper, len);
- notifier_list_notify(&acpi_generic_error_notifiers, NULL);
+ notifier_list_notify(&acpi_generic_error_notifiers, &source_id);
}
int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
diff --git a/hw/acpi/ghes_cper.c b/hw/acpi/ghes_cper.c
new file mode 100644
index 000000000000..0a2d95dd8b27
--- /dev/null
+++ b/hw/acpi/ghes_cper.c
@@ -0,0 +1,38 @@
+/*
+ * CPER payload parser for error injection
+ *
+ * Copyright(C) 2024-2025 Huawei LTD.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+
+#include "qemu/base64.h"
+#include "qemu/error-report.h"
+#include "qemu/uuid.h"
+#include "qapi/qapi-commands-acpi-hest.h"
+#include "hw/acpi/ghes.h"
+
+void qmp_inject_ghes_v2_error(const char *qmp_cper, Error **errp)
+{
+ AcpiGhesState *ags;
+
+ ags = acpi_ghes_get_state();
+ if (!ags) {
+ return;
+ }
+
+ uint8_t *cper;
+ size_t len;
+
+ cper = qbase64_decode(qmp_cper, -1, &len, errp);
+ if (!cper) {
+ error_setg(errp, "missing GHES CPER payload");
+ return;
+ }
+
+ ghes_record_cper_errors(ags, cper, len, ACPI_HEST_SRC_ID_QMP, errp);
+}
diff --git a/hw/acpi/ghes_cper_stub.c b/hw/acpi/ghes_cper_stub.c
new file mode 100644
index 000000000000..5ebc61970a78
--- /dev/null
+++ b/hw/acpi/ghes_cper_stub.c
@@ -0,0 +1,19 @@
+/*
+ * Stub interface for CPER payload parser for error injection
+ *
+ * Copyright(C) 2024-2025 Huawei LTD.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qapi/qapi-commands-acpi-hest.h"
+#include "hw/acpi/ghes.h"
+
+void qmp_inject_ghes_v2_error(const char *cper, Error **errp)
+{
+ error_setg(errp, "GHES QMP error inject is not compiled in");
+}
diff --git a/hw/acpi/meson.build b/hw/acpi/meson.build
index 73f02b96912b..56b5d1ec9691 100644
--- a/hw/acpi/meson.build
+++ b/hw/acpi/meson.build
@@ -34,4 +34,6 @@ endif
system_ss.add(when: 'CONFIG_ACPI', if_false: files('acpi-stub.c', 'aml-build-stub.c', 'ghes-stub.c', 'acpi_interface.c'))
system_ss.add(when: 'CONFIG_ACPI_PCI_BRIDGE', if_false: files('pci-bridge-stub.c'))
system_ss.add_all(when: 'CONFIG_ACPI', if_true: acpi_ss)
+system_ss.add(when: 'CONFIG_GHES_CPER', if_true: files('ghes_cper.c'))
+system_ss.add(when: 'CONFIG_GHES_CPER', if_false: files('ghes_cper_stub.c'))
system_ss.add(files('acpi-qmp-cmds.c'))
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 4f174795ed60..7b6e90d69298 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -896,6 +896,7 @@ static void acpi_align_size(GArray *blob, unsigned align)
static const AcpiNotificationSourceId hest_ghes_notify[] = {
{ ACPI_HEST_SRC_ID_SYNC, ACPI_GHES_NOTIFY_SEA },
+ { ACPI_HEST_SRC_ID_QMP, ACPI_GHES_NOTIFY_GPIO },
};
static const AcpiNotificationSourceId hest_ghes_notify_9_2[] = {
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index e272b35ea114..9074a540197d 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1012,6 +1012,13 @@ static void virt_powerdown_req(Notifier *n, void *opaque)
static void virt_generic_error_req(Notifier *n, void *opaque)
{
+ uint16_t *source_id = opaque;
+
+ /* Currently, only QMP source ID is async */
+ if (*source_id != ACPI_HEST_SRC_ID_QMP) {
+ return;
+ }
+
VirtMachineState *s = container_of(n, VirtMachineState, generic_error_notifier);
acpi_send_event(s->acpi_dev, ACPI_GENERIC_ERROR);
diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
index e1b66141d01c..376933a0024a 100644
--- a/include/hw/acpi/ghes.h
+++ b/include/hw/acpi/ghes.h
@@ -71,6 +71,7 @@ typedef struct AcpiGhesState {
*/
enum AcpiGhesSourceID {
ACPI_HEST_SRC_ID_SYNC,
+ ACPI_HEST_SRC_ID_QMP, /* Use it only for QMP injected errors */
};
typedef struct AcpiNotificationSourceId {
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index f3cf28436770..56f270f61cf5 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -33,6 +33,7 @@
#include "exec/hwaddr.h"
#include "qemu/notify.h"
#include "hw/boards.h"
+#include "hw/acpi/ghes.h"
#include "hw/arm/boot.h"
#include "hw/arm/bsa.h"
#include "hw/block/flash.h"
diff --git a/qapi/acpi-hest.json b/qapi/acpi-hest.json
new file mode 100644
index 000000000000..fff5018c7ec1
--- /dev/null
+++ b/qapi/acpi-hest.json
@@ -0,0 +1,35 @@
+# -*- Mode: Python -*-
+# vim: filetype=python
+
+##
+# == GHESv2 CPER Error Injection
+#
+# Defined since ACPI Specification 6.1,
+# section 18.3.2.8 Generic Hardware Error Source version 2. See:
+#
+# https://uefi.org/sites/default/files/resources/ACPI_6_1.pdf
+##
+
+
+##
+# @inject-ghes-v2-error:
+#
+# Inject an error with additional ACPI 6.1 GHESv2 error information
+#
+# @cper: contains a base64 encoded string with raw data for a single
+# CPER record with Generic Error Status Block, Generic Error Data
+# Entry and generic error data payload, as described at
+# https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html#format
+#
+# Features:
+#
+# @unstable: This command is experimental.
+#
+# Since: 10.0
+##
+{ 'command': 'inject-ghes-v2-error',
+ 'data': {
+ 'cper': 'str'
+ },
+ 'features': [ 'unstable' ]
+}
diff --git a/qapi/meson.build b/qapi/meson.build
index e7bc54e5d047..35cea6147262 100644
--- a/qapi/meson.build
+++ b/qapi/meson.build
@@ -59,6 +59,7 @@ qapi_all_modules = [
if have_system
qapi_all_modules += [
'acpi',
+ 'acpi-hest',
'audio',
'cryptodev',
'qdev',
diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
index b1581988e4eb..baf19ab73afe 100644
--- a/qapi/qapi-schema.json
+++ b/qapi/qapi-schema.json
@@ -75,6 +75,7 @@
{ 'include': 'misc-target.json' }
{ 'include': 'audio.json' }
{ 'include': 'acpi.json' }
+{ 'include': 'acpi-hest.json' }
{ 'include': 'pci.json' }
{ 'include': 'stats.json' }
{ 'include': 'virtio.json' }
--
2.48.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v3 14/14] scripts/ghes_inject: add a script to generate GHES error inject
2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
` (12 preceding siblings ...)
2025-01-31 17:42 ` [PATCH v3 13/14] qapi/acpi-hest: add an interface to do generic CPER error injection Mauro Carvalho Chehab
@ 2025-01-31 17:42 ` Mauro Carvalho Chehab
2025-02-03 10:56 ` Jonathan Cameron via
2025-02-05 8:16 ` Markus Armbruster
2025-02-03 11:09 ` [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for " Jonathan Cameron via
14 siblings, 2 replies; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-31 17:42 UTC (permalink / raw)
To: Igor Mammedov, Michael S . Tsirkin
Cc: Jonathan Cameron, Shiju Jose, qemu-arm, qemu-devel,
Mauro Carvalho Chehab, Cleber Rosa, John Snow, linux-kernel
Using the QMP GHESv2 API requires preparing a raw data array
containing a CPER record.
Add a helper script with subcommands to prepare such data.
Currently, only ARM Processor error CPER record is supported, by
using:
$ ghes_inject.py arm
which produces those warnings on Linux:
[ 705.032426] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
[ 774.866308] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[ 774.866583] {4}[Hardware Error]: event severity: recoverable
[ 774.866738] {4}[Hardware Error]: Error 0, type: recoverable
[ 774.866889] {4}[Hardware Error]: section_type: ARM processor error
[ 774.867048] {4}[Hardware Error]: MIDR: 0x00000000000f0510
[ 774.867189] {4}[Hardware Error]: running state: 0x0
[ 774.867321] {4}[Hardware Error]: Power State Coordination Interface state: 0
[ 774.867511] {4}[Hardware Error]: Error info structure 0:
[ 774.867679] {4}[Hardware Error]: num errors: 2
[ 774.867801] {4}[Hardware Error]: error_type: 0x02: cache error
[ 774.867962] {4}[Hardware Error]: error_info: 0x000000000091000f
[ 774.868124] {4}[Hardware Error]: transaction type: Data Access
[ 774.868280] {4}[Hardware Error]: cache error, operation type: Data write
[ 774.868465] {4}[Hardware Error]: cache level: 2
[ 774.868592] {4}[Hardware Error]: processor context not corrupted
[ 774.868774] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
Such script allows customizing the error data, allowing to change
all fields at the record. Please use:
$ ghes_inject.py arm -h
For more details about its usage.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
MAINTAINERS | 3 +
scripts/arm_processor_error.py | 476 ++++++++++++++++++++++
scripts/ghes_inject.py | 51 +++
scripts/qmp_helper.py | 702 +++++++++++++++++++++++++++++++++
4 files changed, 1232 insertions(+)
create mode 100644 scripts/arm_processor_error.py
create mode 100755 scripts/ghes_inject.py
create mode 100755 scripts/qmp_helper.py
diff --git a/MAINTAINERS b/MAINTAINERS
index 8e1f662fa0e0..99a9ba5c2ace 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2081,6 +2081,9 @@ S: Maintained
F: hw/arm/ghes_cper.c
F: hw/acpi/ghes_cper_stub.c
F: qapi/acpi-hest.json
+F: scripts/ghes_inject.py
+F: scripts/arm_processor_error.py
+F: scripts/qmp_helper.py
ppc4xx
L: qemu-ppc@nongnu.org
diff --git a/scripts/arm_processor_error.py b/scripts/arm_processor_error.py
new file mode 100644
index 000000000000..b0e8450e667e
--- /dev/null
+++ b/scripts/arm_processor_error.py
@@ -0,0 +1,476 @@
+#!/usr/bin/env python3
+#
+# pylint: disable=C0301,C0114,R0903,R0912,R0913,R0914,R0915,W0511
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2024-2025 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
+
+# TODO: current implementation has dummy defaults.
+#
+# For a better implementation, a QMP addition/call is needed to
+# retrieve some data for ARM Processor Error injection:
+#
+# - ARM registers: power_state, mpidr.
+
+"""
+Generates an ARM processor error CPER, compatible with
+UEFI 2.9A Errata.
+
+Injecting such errors can be done using:
+
+ $ ./scripts/ghes_inject.py arm
+ Error injected.
+
+Produces a simple CPER register, as detected on a Linux guest:
+
+[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
+[Hardware Error]: event severity: recoverable
+[Hardware Error]: Error 0, type: recoverable
+[Hardware Error]: section_type: ARM processor error
+[Hardware Error]: MIDR: 0x0000000000000000
+[Hardware Error]: running state: 0x0
+[Hardware Error]: Power State Coordination Interface state: 0
+[Hardware Error]: Error info structure 0:
+[Hardware Error]: num errors: 2
+[Hardware Error]: error_type: 0x02: cache error
+[Hardware Error]: error_info: 0x000000000091000f
+[Hardware Error]: transaction type: Data Access
+[Hardware Error]: cache error, operation type: Data write
+[Hardware Error]: cache level: 2
+[Hardware Error]: processor context not corrupted
+[Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
+
+The ARM Processor Error message can be customized via command line
+parameters. For instance:
+
+ $ ./scripts/ghes_inject.py arm --mpidr 0x444 --running --affinity 1 \
+ --error-info 12345678 --vendor 0x13,123,4,5,1 --ctx-array 0,1,2,3,4,5 \
+ -t cache tlb bus micro-arch tlb,micro-arch
+ Error injected.
+
+Injects this error, as detected on a Linux guest:
+
+[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
+[Hardware Error]: event severity: recoverable
+[Hardware Error]: Error 0, type: recoverable
+[Hardware Error]: section_type: ARM processor error
+[Hardware Error]: MIDR: 0x0000000000000000
+[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000000000000
+[Hardware Error]: error affinity level: 0
+[Hardware Error]: running state: 0x1
+[Hardware Error]: Power State Coordination Interface state: 0
+[Hardware Error]: Error info structure 0:
+[Hardware Error]: num errors: 2
+[Hardware Error]: error_type: 0x02: cache error
+[Hardware Error]: error_info: 0x0000000000bc614e
+[Hardware Error]: cache level: 2
+[Hardware Error]: processor context not corrupted
+[Hardware Error]: Error info structure 1:
+[Hardware Error]: num errors: 2
+[Hardware Error]: error_type: 0x04: TLB error
+[Hardware Error]: error_info: 0x000000000054007f
+[Hardware Error]: transaction type: Instruction
+[Hardware Error]: TLB error, operation type: Instruction fetch
+[Hardware Error]: TLB level: 1
+[Hardware Error]: processor context not corrupted
+[Hardware Error]: the error has not been corrected
+[Hardware Error]: PC is imprecise
+[Hardware Error]: Error info structure 2:
+[Hardware Error]: num errors: 2
+[Hardware Error]: error_type: 0x08: bus error
+[Hardware Error]: error_info: 0x00000080d6460fff
+[Hardware Error]: transaction type: Generic
+[Hardware Error]: bus error, operation type: Generic read (type of instruction or data request cannot be determined)
+[Hardware Error]: affinity level at which the bus error occurred: 1
+[Hardware Error]: processor context corrupted
+[Hardware Error]: the error has been corrected
+[Hardware Error]: PC is imprecise
+[Hardware Error]: Program execution can be restarted reliably at the PC associated with the error.
+[Hardware Error]: participation type: Local processor observed
+[Hardware Error]: request timed out
+[Hardware Error]: address space: External Memory Access
+[Hardware Error]: memory access attributes:0x20
+[Hardware Error]: access mode: secure
+[Hardware Error]: Error info structure 3:
+[Hardware Error]: num errors: 2
+[Hardware Error]: error_type: 0x10: micro-architectural error
+[Hardware Error]: error_info: 0x0000000078da03ff
+[Hardware Error]: Error info structure 4:
+[Hardware Error]: num errors: 2
+[Hardware Error]: error_type: 0x14: TLB error|micro-architectural error
+[Hardware Error]: Context info structure 0:
+[Hardware Error]: register context type: AArch64 EL1 context registers
+[Hardware Error]: 00000000: 00000000 00000000
+[Hardware Error]: Vendor specific error info has 5 bytes:
+[Hardware Error]: 00000000: 13 7b 04 05 01 .{...
+[Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
+[Firmware Warn]: GHES: Unhandled processor error type 0x04: TLB error
+[Firmware Warn]: GHES: Unhandled processor error type 0x08: bus error
+[Firmware Warn]: GHES: Unhandled processor error type 0x10: micro-architectural error
+[Firmware Warn]: GHES: Unhandled processor error type 0x14: TLB error|micro-architectural error
+"""
+
+import argparse
+import re
+
+from qmp_helper import qmp, util, cper_guid
+
+
+class ArmProcessorEinj:
+ """
+ Implements ARM Processor Error injection via GHES
+ """
+
+ DESC = """
+ Generates an ARM processor error CPER, compatible with
+ UEFI 2.9A Errata.
+ """
+
+ ACPI_GHES_ARM_CPER_LENGTH = 40
+ ACPI_GHES_ARM_CPER_PEI_LENGTH = 32
+
+ # Context types
+ CONTEXT_AARCH32_EL1 = 1
+ CONTEXT_AARCH64_EL1 = 5
+ CONTEXT_MISC_REG = 8
+
+ def __init__(self, subparsers):
+ """Initialize the error injection class and add subparser"""
+
+ # Valid choice values
+ self.arm_valid_bits = {
+ "mpidr": util.bit(0),
+ "affinity": util.bit(1),
+ "running": util.bit(2),
+ "vendor": util.bit(3),
+ }
+
+ self.pei_flags = {
+ "first": util.bit(0),
+ "last": util.bit(1),
+ "propagated": util.bit(2),
+ "overflow": util.bit(3),
+ }
+
+ self.pei_error_types = {
+ "cache": util.bit(1),
+ "tlb": util.bit(2),
+ "bus": util.bit(3),
+ "micro-arch": util.bit(4),
+ }
+
+ self.pei_valid_bits = {
+ "multiple-error": util.bit(0),
+ "flags": util.bit(1),
+ "error-info": util.bit(2),
+ "virt-addr": util.bit(3),
+ "phy-addr": util.bit(4),
+ }
+
+ self.data = bytearray()
+
+ parser = subparsers.add_parser("arm", description=self.DESC)
+
+ arm_valid_bits = ",".join(self.arm_valid_bits.keys())
+ flags = ",".join(self.pei_flags.keys())
+ error_types = ",".join(self.pei_error_types.keys())
+ pei_valid_bits = ",".join(self.pei_valid_bits.keys())
+
+ # UEFI N.16 ARM Validation bits
+ g_arm = parser.add_argument_group("ARM processor")
+ g_arm.add_argument("--arm", "--arm-valid",
+ help=f"ARM valid bits: {arm_valid_bits}")
+ g_arm.add_argument("-a", "--affinity", "--level", "--affinity-level",
+ type=lambda x: int(x, 0),
+ help="Affinity level (when multiple levels apply)")
+ g_arm.add_argument("-l", "--mpidr", type=lambda x: int(x, 0),
+ help="Multiprocessor Affinity Register")
+ g_arm.add_argument("-i", "--midr", type=lambda x: int(x, 0),
+ help="Main ID Register")
+ g_arm.add_argument("-r", "--running",
+ action=argparse.BooleanOptionalAction,
+ default=None,
+ help="Indicates if the processor is running or not")
+ g_arm.add_argument("--psci", "--psci-state",
+ type=lambda x: int(x, 0),
+ help="Power State Coordination Interface - PSCI state")
+
+ # TODO: Add vendor-specific support
+
+ # UEFI N.17 bitmaps (type and flags)
+ g_pei = parser.add_argument_group("ARM Processor Error Info (PEI)")
+ g_pei.add_argument("-t", "--type", nargs="+",
+ help=f"one or more error types: {error_types}")
+ g_pei.add_argument("-f", "--flags", nargs="*",
+ help=f"zero or more error flags: {flags}")
+ g_pei.add_argument("-V", "--pei-valid", "--error-valid", nargs="*",
+ help=f"zero or more PEI valid bits: {pei_valid_bits}")
+
+ # UEFI N.17 Integer values
+ g_pei.add_argument("-m", "--multiple-error", nargs="+",
+ help="Number of errors: 0: Single error, 1: Multiple errors, 2-65535: Error count if known")
+ g_pei.add_argument("-e", "--error-info", nargs="+",
+ help="Error information (UEFI 2.10 tables N.18 to N.20)")
+ g_pei.add_argument("-p", "--physical-address", nargs="+",
+ help="Physical address")
+ g_pei.add_argument("-v", "--virtual-address", nargs="+",
+ help="Virtual address")
+
+ # UEFI N.21 Context
+ g_ctx = parser.add_argument_group("Processor Context")
+ g_ctx.add_argument("--ctx-type", "--context-type", nargs="*",
+ help="Type of the context (0=ARM32 GPR, 5=ARM64 EL1, other values supported)")
+ g_ctx.add_argument("--ctx-size", "--context-size", nargs="*",
+ help="Minimal size of the context")
+ g_ctx.add_argument("--ctx-array", "--context-array", nargs="*",
+ help="Comma-separated arrays for each context")
+
+ # Vendor-specific data
+ g_vendor = parser.add_argument_group("Vendor-specific data")
+ g_vendor.add_argument("--vendor", "--vendor-specific", nargs="+",
+ help="Vendor-specific byte arrays of data")
+
+ # Add arguments for Generic Error Data
+ qmp.argparse(parser)
+
+ parser.set_defaults(func=self.send_cper)
+
+ def send_cper(self, args):
+ """Parse subcommand arguments and send a CPER via QMP"""
+
+ qmp_cmd = qmp(args.host, args.port, args.debug)
+
+ # Handle Generic Error Data arguments if any
+ qmp_cmd.set_args(args)
+
+ is_cpu_type = re.compile(r"^([\w+]+\-)?arm\-cpu$")
+ cpus = qmp_cmd.search_qom("/machine/unattached/device",
+ "type", is_cpu_type)
+
+ cper = {}
+ pei = {}
+ ctx = {}
+ vendor = {}
+
+ arg = vars(args)
+
+ # Handle global parameters
+ if args.arm:
+ arm_valid_init = False
+ cper["valid"] = util.get_choice(name="valid",
+ value=args.arm,
+ choices=self.arm_valid_bits,
+ suffixes=["-error", "-err"])
+ else:
+ cper["valid"] = 0
+ arm_valid_init = True
+
+ if "running" in arg:
+ if args.running:
+ cper["running-state"] = util.bit(0)
+ else:
+ cper["running-state"] = 0
+ else:
+ cper["running-state"] = 0
+
+ if arm_valid_init:
+ if args.affinity:
+ cper["valid"] |= self.arm_valid_bits["affinity"]
+
+ if args.mpidr:
+ cper["valid"] |= self.arm_valid_bits["mpidr"]
+
+ if "running-state" in cper:
+ cper["valid"] |= self.arm_valid_bits["running"]
+
+ if args.psci:
+ cper["valid"] |= self.arm_valid_bits["running"]
+
+ # Handle PEI
+ if not args.type:
+ args.type = ["cache-error"]
+
+ util.get_mult_choices(
+ pei,
+ name="valid",
+ values=args.pei_valid,
+ choices=self.pei_valid_bits,
+ suffixes=["-valid", "--addr"],
+ )
+ util.get_mult_choices(
+ pei,
+ name="type",
+ values=args.type,
+ choices=self.pei_error_types,
+ suffixes=["-error", "-err"],
+ )
+ util.get_mult_choices(
+ pei,
+ name="flags",
+ values=args.flags,
+ choices=self.pei_flags,
+ suffixes=["-error", "-cap"],
+ )
+ util.get_mult_int(pei, "error-info", args.error_info)
+ util.get_mult_int(pei, "multiple-error", args.multiple_error)
+ util.get_mult_int(pei, "phy-addr", args.physical_address)
+ util.get_mult_int(pei, "virt-addr", args.virtual_address)
+
+ # Handle context
+ util.get_mult_int(ctx, "type", args.ctx_type, allow_zero=True)
+ util.get_mult_int(ctx, "minimal-size", args.ctx_size, allow_zero=True)
+ util.get_mult_array(ctx, "register", args.ctx_array, allow_zero=True)
+
+ util.get_mult_array(vendor, "bytes", args.vendor, max_val=255)
+
+ # Store PEI
+ pei_data = bytearray()
+ default_flags = self.pei_flags["first"]
+ default_flags |= self.pei_flags["last"]
+
+ error_info_num = 0
+
+ for i, p in pei.items(): # pylint: disable=W0612
+ error_info_num += 1
+
+ # UEFI 2.10 doesn't define how to encode error information
+ # when multiple types are raised. So, provide a default only
+ # if a single type is there
+ if "error-info" not in p:
+ if p["type"] == util.bit(1):
+ p["error-info"] = 0x0091000F
+ if p["type"] == util.bit(2):
+ p["error-info"] = 0x0054007F
+ if p["type"] == util.bit(3):
+ p["error-info"] = 0x80D6460FFF
+ if p["type"] == util.bit(4):
+ p["error-info"] = 0x78DA03FF
+
+ if "valid" not in p:
+ p["valid"] = 0
+ if "multiple-error" in p:
+ p["valid"] |= self.pei_valid_bits["multiple-error"]
+
+ if "flags" in p:
+ p["valid"] |= self.pei_valid_bits["flags"]
+
+ if "error-info" in p:
+ p["valid"] |= self.pei_valid_bits["error-info"]
+
+ if "phy-addr" in p:
+ p["valid"] |= self.pei_valid_bits["phy-addr"]
+
+ if "virt-addr" in p:
+ p["valid"] |= self.pei_valid_bits["virt-addr"]
+
+ # Version
+ util.data_add(pei_data, 0, 1)
+
+ util.data_add(pei_data,
+ self.ACPI_GHES_ARM_CPER_PEI_LENGTH, 1)
+
+ util.data_add(pei_data, p["valid"], 2)
+ util.data_add(pei_data, p["type"], 1)
+ util.data_add(pei_data, p.get("multiple-error", 1), 2)
+ util.data_add(pei_data, p.get("flags", default_flags), 1)
+ util.data_add(pei_data, p.get("error-info", 0), 8)
+ util.data_add(pei_data, p.get("virt-addr", 0xDEADBEEF), 8)
+ util.data_add(pei_data, p.get("phy-addr", 0xABBA0BAD), 8)
+
+ # Store Context
+ ctx_data = bytearray()
+ context_info_num = 0
+
+ if ctx:
+ ret = qmp_cmd.send_cmd("query-target", may_open=True)
+
+ default_ctx = self.CONTEXT_MISC_REG
+
+ if "arch" in ret:
+ if ret["arch"] == "aarch64":
+ default_ctx = self.CONTEXT_AARCH64_EL1
+ elif ret["arch"] == "arm":
+ default_ctx = self.CONTEXT_AARCH32_EL1
+
+ for k in sorted(ctx.keys()):
+ context_info_num += 1
+
+ if "type" not in ctx[k]:
+ ctx[k]["type"] = default_ctx
+
+ if "register" not in ctx[k]:
+ ctx[k]["register"] = []
+
+ reg_size = len(ctx[k]["register"])
+ size = 0
+
+ if "minimal-size" in ctx:
+ size = ctx[k]["minimal-size"]
+
+ size = max(size, reg_size)
+
+ size = (size + 1) % 0xFFFE
+
+ # Version
+ util.data_add(ctx_data, 0, 2)
+
+ util.data_add(ctx_data, ctx[k]["type"], 2)
+
+ util.data_add(ctx_data, 8 * size, 4)
+
+ for r in ctx[k]["register"]:
+ util.data_add(ctx_data, r, 8)
+
+ for i in range(reg_size, size): # pylint: disable=W0612
+ util.data_add(ctx_data, 0, 8)
+
+ # Vendor-specific bytes are not grouped
+ vendor_data = bytearray()
+ if vendor:
+ for k in sorted(vendor.keys()):
+ for b in vendor[k]["bytes"]:
+ util.data_add(vendor_data, b, 1)
+
+ # Encode ARM Processor Error
+ data = bytearray()
+
+ util.data_add(data, cper["valid"], 4)
+
+ util.data_add(data, error_info_num, 2)
+ util.data_add(data, context_info_num, 2)
+
+ # Calculate the length of the CPER data
+ cper_length = self.ACPI_GHES_ARM_CPER_LENGTH
+ cper_length += len(pei_data)
+ cper_length += len(vendor_data)
+ cper_length += len(ctx_data)
+ util.data_add(data, cper_length, 4)
+
+ util.data_add(data, arg.get("affinity-level", 0), 1)
+
+ # Reserved
+ util.data_add(data, 0, 3)
+
+ if "midr-el1" not in arg:
+ if cpus:
+ cmd_arg = {
+ 'path': cpus[0],
+ 'property': "midr"
+ }
+ ret = qmp_cmd.send_cmd("qom-get", cmd_arg, may_open=True)
+ if isinstance(ret, int):
+ arg["midr-el1"] = ret
+
+ util.data_add(data, arg.get("mpidr-el1", 0), 8)
+ util.data_add(data, arg.get("midr-el1", 0), 8)
+ util.data_add(data, cper["running-state"], 4)
+ util.data_add(data, arg.get("psci-state", 0), 4)
+
+ # Add PEI
+ data.extend(pei_data)
+ data.extend(ctx_data)
+ data.extend(vendor_data)
+
+ self.data = data
+
+ qmp_cmd.send_cper(cper_guid.CPER_PROC_ARM, self.data)
diff --git a/scripts/ghes_inject.py b/scripts/ghes_inject.py
new file mode 100755
index 000000000000..5d72bc7f09e1
--- /dev/null
+++ b/scripts/ghes_inject.py
@@ -0,0 +1,51 @@
+#!/usr/bin/env python3
+#
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2024-2025 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
+
+"""
+Handle ACPI GHESv2 error injection logic QEMU QMP interface.
+"""
+
+import argparse
+import sys
+
+from arm_processor_error import ArmProcessorEinj
+
+EINJ_DESC = """
+Handle ACPI GHESv2 error injection logic QEMU QMP interface.
+
+It allows using UEFI BIOS EINJ features to generate GHES records.
+
+It helps testing CPER and GHES drivers at the guest OS and how
+userspace applications at the guest handle them.
+"""
+
+def main():
+ """Main program"""
+
+ # Main parser - handle generic args like QEMU QMP TCP socket options
+ parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+ usage="%(prog)s [options]",
+ description=EINJ_DESC)
+
+ g_options = parser.add_argument_group("QEMU QMP socket options")
+ g_options.add_argument("-H", "--host", default="localhost", type=str,
+ help="host name")
+ g_options.add_argument("-P", "--port", default=4445, type=int,
+ help="TCP port number")
+ g_options.add_argument('-d', '--debug', action='store_true')
+
+ subparsers = parser.add_subparsers()
+
+ ArmProcessorEinj(subparsers)
+
+ args = parser.parse_args()
+ if "func" in args:
+ args.func(args)
+ else:
+ sys.exit(f"Please specify a valid command for {sys.argv[0]}")
+
+if __name__ == "__main__":
+ main()
diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
new file mode 100755
index 000000000000..8e375d6a6cab
--- /dev/null
+++ b/scripts/qmp_helper.py
@@ -0,0 +1,702 @@
+#!/usr/bin/env python3
+#
+# # pylint: disable=C0103,E0213,E1135,E1136,E1137,R0902,R0903,R0912,R0913
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2024-2025 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
+
+"""
+Helper classes to be used by ghes_inject command classes.
+"""
+
+import json
+import sys
+
+from datetime import datetime
+from os import path as os_path
+
+try:
+ qemu_dir = os_path.abspath(os_path.dirname(os_path.dirname(__file__)))
+ sys.path.append(os_path.join(qemu_dir, 'python'))
+
+ from qemu.qmp.legacy import QEMUMonitorProtocol
+
+except ModuleNotFoundError as exc:
+ print(f"Module '{exc.name}' not found.")
+ print("Try export PYTHONPATH=top-qemu-dir/python or run from top-qemu-dir")
+ sys.exit(1)
+
+from base64 import b64encode
+
+class util:
+ """
+ Ancillary functions to deal with bitmaps, parse arguments,
+ generate GUID and encode data on a bytearray buffer.
+ """
+
+ #
+ # Helper routines to handle multiple choice arguments
+ #
+ def get_choice(name, value, choices, suffixes=None, bitmask=True):
+ """Produce a list from multiple choice argument"""
+
+ new_values = 0
+
+ if not value:
+ return new_values
+
+ for val in value.split(","):
+ val = val.lower()
+
+ if suffixes:
+ for suffix in suffixes:
+ val = val.removesuffix(suffix)
+
+ if val not in choices.keys():
+ if suffixes:
+ for suffix in suffixes:
+ if val + suffix in choices.keys():
+ val += suffix
+ break
+
+ if val not in choices.keys():
+ sys.exit(f"Error on '{name}': choice '{val}' is invalid.")
+
+ val = choices[val]
+
+ if bitmask:
+ new_values |= val
+ else:
+ if new_values:
+ sys.exit(f"Error on '{name}': only one value is accepted.")
+
+ new_values = val
+
+ return new_values
+
+ def get_array(name, values, max_val=None):
+ """Add numbered hashes from integer lists into an array"""
+
+ array = []
+
+ for value in values:
+ for val in value.split(","):
+ try:
+ val = int(val, 0)
+ except ValueError:
+ sys.exit(f"Error on '{name}': {val} is not an integer")
+
+ if val < 0:
+ sys.exit(f"Error on '{name}': {val} is not unsigned")
+
+ if max_val and val > max_val:
+ sys.exit(f"Error on '{name}': {val} is too little")
+
+ array.append(val)
+
+ return array
+
+ def get_mult_array(mult, name, values, allow_zero=False, max_val=None):
+ """Add numbered hashes from integer lists"""
+
+ if not allow_zero:
+ if not values:
+ return
+ else:
+ if values is None:
+ return
+
+ if not values:
+ i = 0
+ if i not in mult:
+ mult[i] = {}
+
+ mult[i][name] = []
+ return
+
+ i = 0
+ for value in values:
+ for val in value.split(","):
+ try:
+ val = int(val, 0)
+ except ValueError:
+ sys.exit(f"Error on '{name}': {val} is not an integer")
+
+ if val < 0:
+ sys.exit(f"Error on '{name}': {val} is not unsigned")
+
+ if max_val and val > max_val:
+ sys.exit(f"Error on '{name}': {val} is too little")
+
+ if i not in mult:
+ mult[i] = {}
+
+ if name not in mult[i]:
+ mult[i][name] = []
+
+ mult[i][name].append(val)
+
+ i += 1
+
+
+ def get_mult_choices(mult, name, values, choices,
+ suffixes=None, allow_zero=False):
+ """Add numbered hashes from multiple choice arguments"""
+
+ if not allow_zero:
+ if not values:
+ return
+ else:
+ if values is None:
+ return
+
+ i = 0
+ for val in values:
+ new_values = util.get_choice(name, val, choices, suffixes)
+
+ if i not in mult:
+ mult[i] = {}
+
+ mult[i][name] = new_values
+ i += 1
+
+
+ def get_mult_int(mult, name, values, allow_zero=False):
+ """Add numbered hashes from integer arguments"""
+ if not allow_zero:
+ if not values:
+ return
+ else:
+ if values is None:
+ return
+
+ i = 0
+ for val in values:
+ try:
+ val = int(val, 0)
+ except ValueError:
+ sys.exit(f"Error on '{name}': {val} is not an integer")
+
+ if val < 0:
+ sys.exit(f"Error on '{name}': {val} is not unsigned")
+
+ if i not in mult:
+ mult[i] = {}
+
+ mult[i][name] = val
+ i += 1
+
+
+ #
+ # Data encode helper functions
+ #
+ def bit(b):
+ """Simple macro to define a bit on a bitmask"""
+ return 1 << b
+
+
+ def data_add(data, value, num_bytes):
+ """Adds bytes from value inside a bitarray"""
+
+ data.extend(value.to_bytes(num_bytes, byteorder="little")) # pylint: disable=E1101
+
+ def dump_bytearray(name, data):
+ """Does an hexdump of a byte array, grouping in bytes"""
+
+ print(f"{name} ({len(data)} bytes):")
+
+ for ln_start in range(0, len(data), 16):
+ ln_end = min(ln_start + 16, len(data))
+ print(f" {ln_start:08x} ", end="")
+ for i in range(ln_start, ln_end):
+ print(f"{data[i]:02x} ", end="")
+ for i in range(ln_end, ln_start + 16):
+ print(" ", end="")
+ print(" ", end="")
+ for i in range(ln_start, ln_end):
+ if data[i] >= 32 and data[i] < 127:
+ print(chr(data[i]), end="")
+ else:
+ print(".", end="")
+
+ print()
+ print()
+
+ def time(string):
+ """Handle BCD timestamps used on Generic Error Data Block"""
+
+ time = None
+
+ # Formats to be used when parsing time stamps
+ formats = [
+ "%Y-%m-%d %H:%M:%S",
+ ]
+
+ if string == "now":
+ time = datetime.now()
+
+ if time is None:
+ for fmt in formats:
+ try:
+ time = datetime.strptime(string, fmt)
+ break
+ except ValueError:
+ pass
+
+ if time is None:
+ raise ValueError("Invalid time format")
+
+ return time
+
+class guid:
+ """
+ Simple class to handle GUID fields.
+ """
+
+ def __init__(self, time_low, time_mid, time_high, nodes):
+ """Initialize a GUID value"""
+
+ assert len(nodes) == 8
+
+ self.time_low = time_low
+ self.time_mid = time_mid
+ self.time_high = time_high
+ self.nodes = nodes
+
+ @classmethod
+ def UUID(cls, guid_str):
+ """Initialize a GUID using a string on its standard format"""
+
+ if len(guid_str) != 36:
+ print("Size not 36")
+ raise ValueError('Invalid GUID size')
+
+ # It is easier to parse without separators. So, drop them
+ guid_str = guid_str.replace('-', '')
+
+ if len(guid_str) != 32:
+ print("Size not 32", guid_str, len(guid_str))
+ raise ValueError('Invalid GUID hex size')
+
+ time_low = 0
+ time_mid = 0
+ time_high = 0
+ nodes = []
+
+ for i in reversed(range(16, 32, 2)):
+ h = guid_str[i:i + 2]
+ value = int(h, 16)
+ nodes.insert(0, value)
+
+ time_high = int(guid_str[12:16], 16)
+ time_mid = int(guid_str[8:12], 16)
+ time_low = int(guid_str[0:8], 16)
+
+ return cls(time_low, time_mid, time_high, nodes)
+
+ def __str__(self):
+ """Output a GUID value on its default string representation"""
+
+ clock = self.nodes[0] << 8 | self.nodes[1]
+
+ node = 0
+ for i in range(2, len(self.nodes)):
+ node = node << 8 | self.nodes[i]
+
+ s = f"{self.time_low:08x}-{self.time_mid:04x}-"
+ s += f"{self.time_high:04x}-{clock:04x}-{node:012x}"
+ return s
+
+ def to_bytes(self):
+ """Output a GUID value in bytes"""
+
+ data = bytearray()
+
+ util.data_add(data, self.time_low, 4)
+ util.data_add(data, self.time_mid, 2)
+ util.data_add(data, self.time_high, 2)
+ data.extend(bytearray(self.nodes))
+
+ return data
+
+class qmp:
+ """
+ Opens a connection and send/receive QMP commands.
+ """
+
+ def send_cmd(self, command, args=None, may_open=False, return_error=True):
+ """Send a command to QMP, optinally opening a connection"""
+
+ if may_open:
+ self._connect()
+ elif not self.connected:
+ return False
+
+ msg = { 'execute': command }
+ if args:
+ msg['arguments'] = args
+
+ try:
+ obj = self.qmp_monitor.cmd_obj(msg)
+ # Can we use some other exception class here?
+ except Exception as e: # pylint: disable=W0718
+ print(f"Command: {command}")
+ print(f"Failed to inject error: {e}.")
+ return None
+
+ if "return" in obj:
+ if isinstance(obj.get("return"), dict):
+ if obj["return"]:
+ return obj["return"]
+ return "OK"
+
+ return obj["return"]
+
+ if isinstance(obj.get("error"), dict):
+ error = obj["error"]
+ if return_error:
+ print(f"Command: {msg}")
+ print(f'{error["class"]}: {error["desc"]}')
+ else:
+ print(json.dumps(obj))
+
+ return None
+
+ def _close(self):
+ """Shutdown and close the socket, if opened"""
+ if not self.connected:
+ return
+
+ self.qmp_monitor.close()
+ self.connected = False
+
+ def _connect(self):
+ """Connect to a QMP TCP/IP port, if not connected yet"""
+
+ if self.connected:
+ return True
+
+ try:
+ self.qmp_monitor.connect(negotiate=True)
+ except ConnectionError:
+ sys.exit(f"Can't connect to QMP host {self.host}:{self.port}")
+
+ self.connected = True
+
+ return True
+
+ BLOCK_STATUS_BITS = {
+ "uncorrectable": util.bit(0),
+ "correctable": util.bit(1),
+ "multi-uncorrectable": util.bit(2),
+ "multi-correctable": util.bit(3),
+ }
+
+ ERROR_SEVERITY = {
+ "recoverable": 0,
+ "fatal": 1,
+ "corrected": 2,
+ "none": 3,
+ }
+
+ VALIDATION_BITS = {
+ "fru-id": util.bit(0),
+ "fru-text": util.bit(1),
+ "timestamp": util.bit(2),
+ }
+
+ GEDB_FLAGS_BITS = {
+ "recovered": util.bit(0),
+ "prev-error": util.bit(1),
+ "simulated": util.bit(2),
+ }
+
+ GENERIC_DATA_SIZE = 72
+
+ def argparse(parser):
+ """Prepare a parser group to query generic error data"""
+
+ block_status_bits = ",".join(qmp.BLOCK_STATUS_BITS.keys())
+ error_severity_enum = ",".join(qmp.ERROR_SEVERITY.keys())
+ validation_bits = ",".join(qmp.VALIDATION_BITS.keys())
+ gedb_flags_bits = ",".join(qmp.GEDB_FLAGS_BITS.keys())
+
+ g_gen = parser.add_argument_group("Generic Error Data") # pylint: disable=E1101
+ g_gen.add_argument("--block-status",
+ help=f"block status bits: {block_status_bits}")
+ g_gen.add_argument("--raw-data", nargs="+",
+ help="Raw data inside the Error Status Block")
+ g_gen.add_argument("--error-severity", "--severity",
+ help=f"error severity: {error_severity_enum}")
+ g_gen.add_argument("--gen-err-valid-bits",
+ "--generic-error-validation-bits",
+ help=f"validation bits: {validation_bits}")
+ g_gen.add_argument("--fru-id", type=guid.UUID,
+ help="GUID representing a physical device")
+ g_gen.add_argument("--fru-text",
+ help="ASCII string identifying the FRU hardware")
+ g_gen.add_argument("--timestamp", type=util.time,
+ help="Time when the error info was collected")
+ g_gen.add_argument("--precise", "--precise-timestamp",
+ action='store_true',
+ help="Marks the timestamp as precise if --timestamp is used")
+ g_gen.add_argument("--gedb-flags",
+ help=f"General Error Data Block flags: {gedb_flags_bits}")
+
+ def set_args(self, args):
+ """Set the arguments optionally defined via self.argparse()"""
+
+ if args.block_status:
+ self.block_status = util.get_choice(name="block-status",
+ value=args.block_status,
+ choices=self.BLOCK_STATUS_BITS,
+ bitmask=False)
+ if args.raw_data:
+ self.raw_data = util.get_array("raw-data", args.raw_data,
+ max_val=255)
+ print(self.raw_data)
+
+ if args.error_severity:
+ self.error_severity = util.get_choice(name="error-severity",
+ value=args.error_severity,
+ choices=self.ERROR_SEVERITY,
+ bitmask=False)
+
+ if args.fru_id:
+ self.fru_id = args.fru_id.to_bytes()
+ if not args.gen_err_valid_bits:
+ self.validation_bits |= self.VALIDATION_BITS["fru-id"]
+
+ if args.fru_text:
+ text = bytearray(args.fru_text.encode('ascii'))
+ if len(text) > 20:
+ sys.exit("FRU text is too big to fit")
+
+ self.fru_text = text
+ if not args.gen_err_valid_bits:
+ self.validation_bits |= self.VALIDATION_BITS["fru-text"]
+
+ if args.timestamp:
+ time = args.timestamp
+ century = int(time.year / 100)
+
+ bcd = bytearray()
+ util.data_add(bcd, (time.second // 10) << 4 | (time.second % 10), 1)
+ util.data_add(bcd, (time.minute // 10) << 4 | (time.minute % 10), 1)
+ util.data_add(bcd, (time.hour // 10) << 4 | (time.hour % 10), 1)
+
+ if args.precise:
+ util.data_add(bcd, 1, 1)
+ else:
+ util.data_add(bcd, 0, 1)
+
+ util.data_add(bcd, (time.day // 10) << 4 | (time.day % 10), 1)
+ util.data_add(bcd, (time.month // 10) << 4 | (time.month % 10), 1)
+ util.data_add(bcd,
+ ((time.year % 100) // 10) << 4 | (time.year % 10), 1)
+ util.data_add(bcd, ((century % 100) // 10) << 4 | (century % 10), 1)
+
+ self.timestamp = bcd
+ if not args.gen_err_valid_bits:
+ self.validation_bits |= self.VALIDATION_BITS["timestamp"]
+
+ if args.gen_err_valid_bits:
+ self.validation_bits = util.get_choice(name="validation",
+ value=args.gen_err_valid_bits,
+ choices=self.VALIDATION_BITS)
+
+ def __init__(self, host, port, debug=False):
+ """Initialize variables used by the QMP send logic"""
+
+ self.connected = False
+ self.host = host
+ self.port = port
+ self.debug = debug
+
+ # ACPI 6.1: 18.3.2.7.1 Generic Error Data: Generic Error Status Block
+ self.block_status = self.BLOCK_STATUS_BITS["uncorrectable"]
+ self.raw_data = []
+ self.error_severity = self.ERROR_SEVERITY["recoverable"]
+
+ # ACPI 6.1: 18.3.2.7.1 Generic Error Data: Generic Error Data Entry
+ self.validation_bits = 0
+ self.flags = 0
+ self.fru_id = bytearray(16)
+ self.fru_text = bytearray(20)
+ self.timestamp = bytearray(8)
+
+ self.qmp_monitor = QEMUMonitorProtocol(address=(self.host, self.port))
+
+ #
+ # Socket QMP send command
+ #
+ def send_cper_raw(self, cper_data):
+ """Send a raw CPER data to QEMU though QMP TCP socket"""
+
+ data = b64encode(bytes(cper_data)).decode('ascii')
+
+ cmd_arg = {
+ 'cper': data
+ }
+
+ self._connect()
+
+ if self.send_cmd("inject-ghes-v2-error", cmd_arg):
+ print("Error injected.")
+
+ def send_cper(self, notif_type, payload):
+ """Send commands to QEMU though QMP TCP socket"""
+
+ # Fill CPER record header
+
+ # NOTE: bits 4 to 13 of block status contain the number of
+ # data entries in the data section. This is currently unsupported.
+
+ cper_length = len(payload)
+ data_length = cper_length + len(self.raw_data) + self.GENERIC_DATA_SIZE
+
+ # Generic Error Data Entry
+ gede = bytearray()
+
+ gede.extend(notif_type.to_bytes())
+ util.data_add(gede, self.error_severity, 4)
+ util.data_add(gede, 0x300, 2)
+ util.data_add(gede, self.validation_bits, 1)
+ util.data_add(gede, self.flags, 1)
+ util.data_add(gede, cper_length, 4)
+ gede.extend(self.fru_id)
+ gede.extend(self.fru_text)
+ gede.extend(self.timestamp)
+
+ # Generic Error Status Block
+ gebs = bytearray()
+
+ if self.raw_data:
+ raw_data_offset = len(gebs)
+ else:
+ raw_data_offset = 0
+
+ util.data_add(gebs, self.block_status, 4)
+ util.data_add(gebs, raw_data_offset, 4)
+ util.data_add(gebs, len(self.raw_data), 4)
+ util.data_add(gebs, data_length, 4)
+ util.data_add(gebs, self.error_severity, 4)
+
+ cper_data = bytearray()
+ cper_data.extend(gebs)
+ cper_data.extend(gede)
+ cper_data.extend(bytearray(self.raw_data))
+ cper_data.extend(bytearray(payload))
+
+ if self.debug:
+ print(f"GUID: {notif_type}")
+
+ util.dump_bytearray("Generic Error Status Block", gebs)
+ util.dump_bytearray("Generic Error Data Entry", gede)
+
+ if self.raw_data:
+ util.dump_bytearray("Raw data", bytearray(self.raw_data))
+
+ util.dump_bytearray("Payload", payload)
+
+ self.send_cper_raw(cper_data)
+
+
+ def search_qom(self, path, prop, regex):
+ """
+ Return a list of devices that match path array like:
+
+ /machine/unattached/device
+ /machine/peripheral-anon/device
+ ...
+ """
+
+ found = []
+
+ i = 0
+ while 1:
+ dev = f"{path}[{i}]"
+ args = {
+ 'path': dev,
+ 'property': prop
+ }
+ ret = self.send_cmd("qom-get", args, may_open=True, return_error=False)
+ if not ret:
+ break
+
+ if isinstance(ret, str):
+ if regex.search(ret):
+ found.append(dev)
+
+ i += 1
+ if i > 10000:
+ print("Too many objects returned by qom-get!")
+ break
+
+ return found
+
+class cper_guid:
+ """
+ Contains CPER GUID, as per:
+ https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html
+ """
+
+ CPER_PROC_GENERIC = guid(0x9876CCAD, 0x47B4, 0x4bdb,
+ [0xB6, 0x5E, 0x16, 0xF1,
+ 0x93, 0xC4, 0xF3, 0xDB])
+
+ CPER_PROC_X86 = guid(0xDC3EA0B0, 0xA144, 0x4797,
+ [0xB9, 0x5B, 0x53, 0xFA,
+ 0x24, 0x2B, 0x6E, 0x1D])
+
+ CPER_PROC_ITANIUM = guid(0xe429faf1, 0x3cb7, 0x11d4,
+ [0xbc, 0xa7, 0x00, 0x80,
+ 0xc7, 0x3c, 0x88, 0x81])
+
+ CPER_PROC_ARM = guid(0xE19E3D16, 0xBC11, 0x11E4,
+ [0x9C, 0xAA, 0xC2, 0x05,
+ 0x1D, 0x5D, 0x46, 0xB0])
+
+ CPER_PLATFORM_MEM = guid(0xA5BC1114, 0x6F64, 0x4EDE,
+ [0xB8, 0x63, 0x3E, 0x83,
+ 0xED, 0x7C, 0x83, 0xB1])
+
+ CPER_PLATFORM_MEM2 = guid(0x61EC04FC, 0x48E6, 0xD813,
+ [0x25, 0xC9, 0x8D, 0xAA,
+ 0x44, 0x75, 0x0B, 0x12])
+
+ CPER_PCIE = guid(0xD995E954, 0xBBC1, 0x430F,
+ [0xAD, 0x91, 0xB4, 0x4D,
+ 0xCB, 0x3C, 0x6F, 0x35])
+
+ CPER_PCI_BUS = guid(0xC5753963, 0x3B84, 0x4095,
+ [0xBF, 0x78, 0xED, 0xDA,
+ 0xD3, 0xF9, 0xC9, 0xDD])
+
+ CPER_PCI_DEV = guid(0xEB5E4685, 0xCA66, 0x4769,
+ [0xB6, 0xA2, 0x26, 0x06,
+ 0x8B, 0x00, 0x13, 0x26])
+
+ CPER_FW_ERROR = guid(0x81212A96, 0x09ED, 0x4996,
+ [0x94, 0x71, 0x8D, 0x72,
+ 0x9C, 0x8E, 0x69, 0xED])
+
+ CPER_DMA_GENERIC = guid(0x5B51FEF7, 0xC79D, 0x4434,
+ [0x8F, 0x1B, 0xAA, 0x62,
+ 0xDE, 0x3E, 0x2C, 0x64])
+
+ CPER_DMA_VT = guid(0x71761D37, 0x32B2, 0x45cd,
+ [0xA7, 0xD0, 0xB0, 0xFE,
+ 0xDD, 0x93, 0xE8, 0xCF])
+
+ CPER_DMA_IOMMU = guid(0x036F84E1, 0x7F37, 0x428c,
+ [0xA7, 0x9E, 0x57, 0x5F,
+ 0xDF, 0xAA, 0x84, 0xEC])
+
+ CPER_CCIX_PER = guid(0x91335EF6, 0xEBFB, 0x4478,
+ [0xA6, 0xA6, 0x88, 0xB7,
+ 0x28, 0xCF, 0x75, 0xD7])
+
+ CPER_CXL_PROT_ERR = guid(0x80B9EFB4, 0x52B5, 0x4DE3,
+ [0xA7, 0x77, 0x68, 0x78,
+ 0x4B, 0x77, 0x10, 0x48])
--
2.48.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH v3 03/14] acpi/ghes: Use HEST table offsets when preparing GHES records
2025-01-31 17:42 ` [PATCH v3 03/14] acpi/ghes: Use HEST table offsets when preparing GHES records Mauro Carvalho Chehab
@ 2025-02-03 10:42 ` Jonathan Cameron via
2025-02-03 14:34 ` Igor Mammedov
1 sibling, 0 replies; 43+ messages in thread
From: Jonathan Cameron via @ 2025-02-03 10:42 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Igor Mammedov, Michael S . Tsirkin, Shiju Jose, qemu-arm,
qemu-devel, Ani Sinha, Dongjiu Geng, linux-kernel
On Fri, 31 Jan 2025 18:42:44 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> There are two pointers that are needed during error injection:
>
> 1. The start address of the CPER block to be stored;
> 2. The address of the ack.
>
> It is preferable to calculate them from the HEST table. This allows
> checking the source ID, the size of the table and the type of the
> HEST error block structures.
>
> Yet, keep the old code, as this is needed for migration purposes.
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Tiny niggle on patch split up inline. Either way
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> @@ -212,14 +237,6 @@ static void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker,
> {
> int i, error_status_block_offset;
>
> - /*
> - * TODO: Current version supports only one source.
> - * A further patch will drop this check, after adding a proper migration
> - * code, as, for the code to work, we need to store a bios pointer to the
> - * HEST table.
> - */
> - assert(num_sources == 1);
> -
> /* Build error_block_address */
> for (i = 0; i < num_sources; i++) {
> build_append_int_noprefix(hardware_errors, 0, sizeof(uint64_t));
> @@ -352,6 +369,14 @@ void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
> .oem_id = oem_id, .oem_table_id = oem_table_id };
> uint32_t hest_offset;
> int i;
> + AcpiGedState *acpi_ged_state;
> + AcpiGhesState *ags = NULL;
> +
> + acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
> + NULL));
> + if (acpi_ged_state) {
> + ags = &acpi_ged_state->ghes_state;
> + }
>
> hest_offset = table_data->len;
>
> @@ -371,10 +396,12 @@ void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
> * Tell firmware to write into GPA the address of HEST via fw_cfg,
> * once initialized.
> */
> - bios_linker_loader_write_pointer(linker,
> - ACPI_HEST_ADDR_FW_CFG_FILE, 0,
> - sizeof(uint64_t),
> - ACPI_BUILD_TABLE_FILE, hest_offset);
> + if (ags->use_hest_addr) {
Maybe move ags->use_hest_addr introduction to previous patch to avoid
churn here? It's not set yet anyway.
> + bios_linker_loader_write_pointer(linker,
> + ACPI_HEST_ADDR_FW_CFG_FILE, 0,
> + sizeof(uint64_t),
> + ACPI_BUILD_TABLE_FILE, hest_offset);
> + }
> }
>
> void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
> @@ -420,6 +447,78 @@ static void get_hw_error_offsets(uint64_t ghes_addr,
> *read_ack_register_addr = ghes_addr + sizeof(uint64_t);
> }
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 06/14] acpi/ghes: only set hw_error_le or hest_addr_le
2025-01-31 17:42 ` [PATCH v3 06/14] acpi/ghes: only set hw_error_le or hest_addr_le Mauro Carvalho Chehab
@ 2025-02-03 10:48 ` Jonathan Cameron via
0 siblings, 0 replies; 43+ messages in thread
From: Jonathan Cameron via @ 2025-02-03 10:48 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Igor Mammedov, Michael S . Tsirkin, Shiju Jose, qemu-arm,
qemu-devel, Ani Sinha, Dongjiu Geng, linux-kernel
On Fri, 31 Jan 2025 18:42:47 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> The hw_error_le pointer is used for legacy support (virt-9.2).
> Starting from virt-10.0, HEST table is accessed via hest_addr_le.
>
> Remove fw_cfg logic for legacy support if virt is 10.0 or upper.
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 08/14] acpi/ghes: Cleanup the code which gets ghes ged state
2025-01-31 17:42 ` [PATCH v3 08/14] acpi/ghes: Cleanup the code which gets ghes ged state Mauro Carvalho Chehab
@ 2025-02-03 10:51 ` Jonathan Cameron via
0 siblings, 0 replies; 43+ messages in thread
From: Jonathan Cameron via @ 2025-02-03 10:51 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Igor Mammedov, Michael S . Tsirkin, Shiju Jose, qemu-arm,
qemu-devel, Ani Sinha, Dongjiu Geng, Paolo Bonzini, Peter Maydell,
kvm, linux-kernel
On Fri, 31 Jan 2025 18:42:49 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Move the check logic into a common function and simplify the
> code which checks if GHES is enabled and was properly setup.
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
One minor comment inline on a change I think should be in an earlier
patch.
> -void ghes_record_cper_errors(const void *cper, size_t len,
> +void ghes_record_cper_errors(AcpiGhesState *ags, const void *cper, size_t len,
> uint16_t source_id, Error **errp)
> {
> uint64_t cper_addr = 0, read_ack_register_addr = 0, read_ack_register;
> - AcpiGedState *acpi_ged_state;
> - AcpiGhesState *ags;
>
> if (len > ACPI_GHES_MAX_RAW_DATA_LENGTH) {
> error_setg(errp, "GHES CPER record is too big: %zd", len);
> return;
> }
>
> - acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
> - NULL));
> - if (!acpi_ged_state) {
> - error_setg(errp, "Can't find ACPI_GED object");
> - return;
> - }
> - ags = &acpi_ged_state->ghes_state;
> -
> - if (!ags->hest_addr_le) {
> + if (!ags->use_hest_addr) {
Should this change be moved back to patch 3? use_hest_addr was available
at that point and it would reduce churn a tiny bit.
> get_hw_error_offsets(le64_to_cpu(ags->hw_error_le),
> &cper_addr, &read_ack_register_addr);
> } else {
> @@ -547,11 +531,6 @@ void ghes_record_cper_errors(const void *cper, size_t len,
> &cper_addr, &read_ack_register_addr, errp);
> }
>
> - if (!cper_addr) {
> - error_setg(errp, "can not find Generic Error Status Block");
> - return;
> - }
> -
> cpu_physical_memory_read(read_ack_register_addr,
> &read_ack_register, sizeof(read_ack_register));
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 12/14] tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT
2025-01-31 17:42 ` [PATCH v3 12/14] tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT Mauro Carvalho Chehab
@ 2025-02-03 10:53 ` Jonathan Cameron via
0 siblings, 0 replies; 43+ messages in thread
From: Jonathan Cameron via @ 2025-02-03 10:53 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Igor Mammedov, Michael S . Tsirkin, Shiju Jose, qemu-arm,
qemu-devel, Ani Sinha, linux-kernel
On Fri, 31 Jan 2025 18:42:53 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> --- a/DSDT.dsl 2025-01-28 09:38:15.155347858 +0100
> +++ b/DSDT.dsl 2025-01-28 09:39:01.684836954 +0100
> @@ -9,9 +9,9 @@
> *
> * Original Table Header:
> * Signature "DSDT"
> - * Length 0x00001516 (5398)
> + * Length 0x00001542 (5442)
> * Revision 0x02
> - * Checksum 0x0F
> + * Checksum 0xE9
> * OEM ID "BOCHS "
> * OEM Table ID "BXPC "
> * OEM Revision 0x00000001 (1)
> @@ -1931,6 +1931,11 @@
> {
> Notify (PWRB, 0x80) // Status Change
> }
> +
> + If (((Local0 & 0x10) == 0x10))
> + {
> + Notify (GEDD, 0x80) // Status Change
> + }
> }
> }
>
> @@ -1939,6 +1944,12 @@
> Name (_HID, "PNP0C0C" /* Power Button Device */) // _HID: Hardware ID
> Name (_UID, Zero) // _UID: Unique ID
> }
> +
> + Device (GEDD)
> + {
> + Name (_HID, "PNP0C33" /* Error Device */) // _HID: Hardware ID
> + Name (_UID, Zero) // _UID: Unique ID
> + }
> }
> }
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Diff looks good.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 14/14] scripts/ghes_inject: add a script to generate GHES error inject
2025-01-31 17:42 ` [PATCH v3 14/14] scripts/ghes_inject: add a script to generate GHES error inject Mauro Carvalho Chehab
@ 2025-02-03 10:56 ` Jonathan Cameron via
2025-02-05 8:16 ` Markus Armbruster
1 sibling, 0 replies; 43+ messages in thread
From: Jonathan Cameron via @ 2025-02-03 10:56 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Igor Mammedov, Michael S . Tsirkin, Shiju Jose, qemu-arm,
qemu-devel, Cleber Rosa, John Snow, linux-kernel
On Fri, 31 Jan 2025 18:42:55 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Using the QMP GHESv2 API requires preparing a raw data array
> containing a CPER record.
>
> Add a helper script with subcommands to prepare such data.
>
> Currently, only ARM Processor error CPER record is supported, by
> using:
> $ ghes_inject.py arm
>
> which produces those warnings on Linux:
>
> [ 705.032426] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
> [ 774.866308] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> [ 774.866583] {4}[Hardware Error]: event severity: recoverable
> [ 774.866738] {4}[Hardware Error]: Error 0, type: recoverable
> [ 774.866889] {4}[Hardware Error]: section_type: ARM processor error
> [ 774.867048] {4}[Hardware Error]: MIDR: 0x00000000000f0510
> [ 774.867189] {4}[Hardware Error]: running state: 0x0
> [ 774.867321] {4}[Hardware Error]: Power State Coordination Interface state: 0
> [ 774.867511] {4}[Hardware Error]: Error info structure 0:
> [ 774.867679] {4}[Hardware Error]: num errors: 2
> [ 774.867801] {4}[Hardware Error]: error_type: 0x02: cache error
> [ 774.867962] {4}[Hardware Error]: error_info: 0x000000000091000f
> [ 774.868124] {4}[Hardware Error]: transaction type: Data Access
> [ 774.868280] {4}[Hardware Error]: cache error, operation type: Data write
> [ 774.868465] {4}[Hardware Error]: cache level: 2
> [ 774.868592] {4}[Hardware Error]: processor context not corrupted
> [ 774.868774] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
>
> Such script allows customizing the error data, allowing to change
> all fields at the record. Please use:
>
> $ ghes_inject.py arm -h
>
> For more details about its usage.
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Thanks for examples. Given my poor python skills take this one with
a pinch of salt.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
` (13 preceding siblings ...)
2025-01-31 17:42 ` [PATCH v3 14/14] scripts/ghes_inject: add a script to generate GHES error inject Mauro Carvalho Chehab
@ 2025-02-03 11:09 ` Jonathan Cameron via
2025-02-03 15:22 ` Igor Mammedov
14 siblings, 1 reply; 43+ messages in thread
From: Jonathan Cameron via @ 2025-02-03 11:09 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Igor Mammedov, Michael S . Tsirkin, Shiju Jose, qemu-arm,
qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
linux-kernel
On Fri, 31 Jan 2025 18:42:41 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Now that the ghes preparation patches were merged, let's add support
> for error injection.
>
> On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> table and hardware_error firmware file, together with its migration code. Migration tested
> with both latest QEMU released kernel and upstream, on both directions.
>
> The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> to inject ARM Processor Error records.
>
> If I'm counting well, this is the 19th submission of my error inject patches.
Looks good to me. All remaining trivial things are in the category
of things to consider only if you are doing another spin. The code
ends up how I'd like it at the end of the series anyway, just
a question of the precise path to that state!
Jonathan
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 02/14] acpi/ghes: add a firmware file with HEST address
2025-01-31 17:42 ` [PATCH v3 02/14] acpi/ghes: add a firmware file with HEST address Mauro Carvalho Chehab
@ 2025-02-03 13:41 ` Igor Mammedov
0 siblings, 0 replies; 43+ messages in thread
From: Igor Mammedov @ 2025-02-03 13:41 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Michael S . Tsirkin, Jonathan Cameron, Shiju Jose, qemu-arm,
qemu-devel, Ani Sinha, Dongjiu Geng, linux-kernel
On Fri, 31 Jan 2025 18:42:43 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Store HEST table address at GPA, placing its the start of the table at
> hest_addr_le variable.
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
> ---
> hw/acpi/ghes.c | 16 ++++++++++++++++
> include/hw/acpi/ghes.h | 1 +
> 2 files changed, 17 insertions(+)
>
> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> index 4cabb177ad47..27478f2d5674 100644
> --- a/hw/acpi/ghes.c
> +++ b/hw/acpi/ghes.c
> @@ -30,6 +30,7 @@
>
> #define ACPI_HW_ERROR_FW_CFG_FILE "etc/hardware_errors"
> #define ACPI_HW_ERROR_ADDR_FW_CFG_FILE "etc/hardware_errors_addr"
> +#define ACPI_HEST_ADDR_FW_CFG_FILE "etc/acpi_table_hest_addr"
>
> /* The max size in bytes for one error block */
> #define ACPI_GHES_MAX_RAW_DATA_LENGTH (1 * KiB)
> @@ -349,8 +350,11 @@ void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
> {
> AcpiTable table = { .sig = "HEST", .rev = 1,
> .oem_id = oem_id, .oem_table_id = oem_table_id };
> + uint32_t hest_offset;
> int i;
>
> + hest_offset = table_data->len;
> +
> build_ghes_error_table(hardware_errors, linker, num_sources);
>
> acpi_table_begin(&table, table_data);
> @@ -362,6 +366,15 @@ void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
> }
>
> acpi_table_end(linker, &table);
> +
> + /*
> + * Tell firmware to write into GPA the address of HEST via fw_cfg,
> + * once initialized.
> + */
> + bios_linker_loader_write_pointer(linker,
> + ACPI_HEST_ADDR_FW_CFG_FILE, 0,
> + sizeof(uint64_t),
> + ACPI_BUILD_TABLE_FILE, hest_offset);
> }
>
> void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
> @@ -375,6 +388,9 @@ void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
> fw_cfg_add_file_callback(s, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, NULL, NULL,
> NULL, &(ags->hw_error_le), sizeof(ags->hw_error_le), false);
>
> + fw_cfg_add_file_callback(s, ACPI_HEST_ADDR_FW_CFG_FILE, NULL, NULL,
> + NULL, &(ags->hest_addr_le), sizeof(ags->hest_addr_le), false);
> +
> ags->present = true;
> }
>
> diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
> index 9f0120d0d596..237721fec0a2 100644
> --- a/include/hw/acpi/ghes.h
> +++ b/include/hw/acpi/ghes.h
> @@ -58,6 +58,7 @@ enum AcpiGhesNotifyType {
> };
>
> typedef struct AcpiGhesState {
> + uint64_t hest_addr_le;
> uint64_t hw_error_le;
> bool present; /* True if GHES is present at all on this board */
> } AcpiGhesState;
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 03/14] acpi/ghes: Use HEST table offsets when preparing GHES records
2025-01-31 17:42 ` [PATCH v3 03/14] acpi/ghes: Use HEST table offsets when preparing GHES records Mauro Carvalho Chehab
2025-02-03 10:42 ` Jonathan Cameron via
@ 2025-02-03 14:34 ` Igor Mammedov
2025-02-21 6:02 ` Mauro Carvalho Chehab
1 sibling, 1 reply; 43+ messages in thread
From: Igor Mammedov @ 2025-02-03 14:34 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Michael S . Tsirkin, Jonathan Cameron, Shiju Jose, qemu-arm,
qemu-devel, Ani Sinha, Dongjiu Geng, linux-kernel
On Fri, 31 Jan 2025 18:42:44 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> There are two pointers that are needed during error injection:
>
> 1. The start address of the CPER block to be stored;
> 2. The address of the ack.
>
> It is preferable to calculate them from the HEST table. This allows
> checking the source ID, the size of the table and the type of the
> HEST error block structures.
>
> Yet, keep the old code, as this is needed for migration purposes.
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---
> hw/acpi/ghes.c | 132 ++++++++++++++++++++++++++++++++++++-----
> include/hw/acpi/ghes.h | 1 +
> 2 files changed, 119 insertions(+), 14 deletions(-)
>
> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> index 27478f2d5674..8f284fd191a6 100644
> --- a/hw/acpi/ghes.c
> +++ b/hw/acpi/ghes.c
> @@ -41,6 +41,12 @@
> /* Address offset in Generic Address Structure(GAS) */
> #define GAS_ADDR_OFFSET 4
>
> +/*
> + * ACPI spec 1.0b
> + * 5.2.3 System Description Table Header
> + */
> +#define ACPI_DESC_HEADER_OFFSET 36
> +
> /*
> * The total size of Generic Error Data Entry
> * ACPI 6.1/6.2: 18.3.2.7.1 Generic Error Data,
> @@ -61,6 +67,25 @@
> */
> #define ACPI_GHES_GESB_SIZE 20
>
> +/*
> + * Offsets with regards to the start of the HEST table stored at
> + * ags->hest_addr_le,
If I read this literary, then offsets above are not what
declared later in this patch.
I'd really drop this comment altogether as it's confusing,
and rather get variables/macro naming right
> according with the memory layout map at
> + * docs/specs/acpi_hest_ghes.rst.
> + */
what we need is update to above doc, describing new and old ways.
a separate patch.
> +
> +/*
> + * ACPI 6.2: 18.3.2.8 Generic Hardware Error Source version 2
^^^^^^^^ - wrt version, I see it in 6.1.
our req is to point to the earliest doc where it has appeared.
it it must point to a later version for some justified reason
the explanation 'why' should be mentioned in comment message.
please check all versioning/chapters you are touching/adding in this series.
> + * Table 18-382 Generic Hardware Error Source version 2 (GHESv2) Structure
> + */
> +#define HEST_GHES_V2_TABLE_SIZE 92
it's not table but rather an GHES_V2 entry in HEST
and should be named as such (emph on _entry_)
> +#define GHES_READ_ACK_ADDR_OFF 64
please, add a comment like below but for 'Read Ack Register'
> +/*
> + * ACPI 6.2: 18.3.2.7: Generic Hardware Error Source
> + * Table 18-380: 'Error Status Address' field
> + */
> +#define GHES_ERR_STATUS_ADDR_OFF 20
> +
> /*
> * Values for error_severity field
> */
> @@ -212,14 +237,6 @@ static void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker,
> {
> int i, error_status_block_offset;
>
> - /*
> - * TODO: Current version supports only one source.
> - * A further patch will drop this check, after adding a proper migration
> - * code, as, for the code to work, we need to store a bios pointer to the
> - * HEST table.
> - */
> - assert(num_sources == 1);
> -
> /* Build error_block_address */
> for (i = 0; i < num_sources; i++) {
> build_append_int_noprefix(hardware_errors, 0, sizeof(uint64_t));
> @@ -352,6 +369,14 @@ void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
> .oem_id = oem_id, .oem_table_id = oem_table_id };
> uint32_t hest_offset;
> int i;
> + AcpiGedState *acpi_ged_state;
> + AcpiGhesState *ags = NULL;
> +
> + acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
> + NULL));
> + if (acpi_ged_state) {
> + ags = &acpi_ged_state->ghes_state;
> + }
>
> hest_offset = table_data->len;
>
> @@ -371,10 +396,12 @@ void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
> * Tell firmware to write into GPA the address of HEST via fw_cfg,
> * once initialized.
> */
> - bios_linker_loader_write_pointer(linker,
> - ACPI_HEST_ADDR_FW_CFG_FILE, 0,
> - sizeof(uint64_t),
> - ACPI_BUILD_TABLE_FILE, hest_offset);
> + if (ags->use_hest_addr) {
> + bios_linker_loader_write_pointer(linker,
> + ACPI_HEST_ADDR_FW_CFG_FILE, 0,
> + sizeof(uint64_t),
> + ACPI_BUILD_TABLE_FILE, hest_offset);
> + }
I'd move this patch before 2/14, to avoid issues during bisection.
Also legacy variant is hidden in build_ghes_error_table()
/*
* tell firmware to write hardware_errors GPA into
* hardware_errors_addr fw_cfg, once the former has been initialized.
*/
bios_linker_loader_write_pointer()
and after this patch we end up with scattered code that should pick
only one them (but doesn't).
As prereq, I'd move legacy into acpi_build_hest() as separate patch,
then do this patch adds above 'if' gate,
and followup patch [2/14 currently] adds bios_linker_loader_write_pointer(ACPI_HEST_ADDR_FW_CFG_FILE)
> }
>
> void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
shouldn't we do the same for fw_cfg_add_file_callback() hunk added
in previous patch and related 'fw_cfg_add_file_callback(s, ACPI_HW_ERROR_ADDR_FW_CFG_FILE'
we need only one of them.
> @@ -420,6 +447,78 @@ static void get_hw_error_offsets(uint64_t ghes_addr,
> *read_ack_register_addr = ghes_addr + sizeof(uint64_t);
> }
>
> +static void get_ghes_source_offsets(uint16_t source_id,
> + uint64_t hest_addr,
> + uint64_t *cper_addr,
> + uint64_t *read_ack_start_addr,
> + Error **errp)
> +{
> + uint64_t hest_err_block_addr, hest_read_ack_addr;
> + uint64_t err_source_entry, error_block_addr;
> + uint32_t num_sources, i;
> +
> + hest_addr += ACPI_DESC_HEADER_OFFSET;
> +
> + cpu_physical_memory_read(hest_addr, &num_sources,
> + sizeof(num_sources));
> + num_sources = le32_to_cpu(num_sources);
> +
> + err_source_entry = hest_addr + sizeof(num_sources);
> +
> + /*
> + * Currently, HEST Error source navigates only for GHESv2 tables
> + */
> +
not needed newline
> + for (i = 0; i < num_sources; i++) {
> + uint64_t addr = err_source_entry;
> + uint16_t type, src_id;
> +
> + cpu_physical_memory_read(addr, &type, sizeof(type));
> + type = le16_to_cpu(type);
> +
> + /* For now, we only know the size of GHESv2 table */
> + if (type != ACPI_GHES_SOURCE_GENERIC_ERROR_V2) {
> + error_setg(errp, "HEST: type %d not supported.", type);
> + return;
> + }
> +
> + /* Compare CPER source address at the GHESv2 structure */
> + addr += sizeof(type);
> + cpu_physical_memory_read(addr, &src_id, sizeof(src_id));
> + if (le16_to_cpu(src_id) == source_id) {
> + break;
> + }
> +
> + err_source_entry += HEST_GHES_V2_TABLE_SIZE;
> + }
> + if (i == num_sources) {
> + error_setg(errp, "HEST: Source %d not found.", source_id);
> + return;
> + }
> +
> + /* Navigate though table address pointers */
> + hest_err_block_addr = err_source_entry + GHES_ERR_STATUS_ADDR_OFF +
> + GAS_ADDR_OFFSET;
> +
> + cpu_physical_memory_read(hest_err_block_addr, &error_block_addr,
> + sizeof(error_block_addr));
> +
I'd drop newlines for related read/processing
> + error_block_addr = le64_to_cpu(error_block_addr);
> +
> + cpu_physical_memory_read(error_block_addr, cper_addr,
> + sizeof(*cper_addr));
> +
ditto
> + *cper_addr = le64_to_cpu(*cper_addr);
> +
> + hest_read_ack_addr = err_source_entry + GHES_READ_ACK_ADDR_OFF +
> + GAS_ADDR_OFFSET;
> +
> + cpu_physical_memory_read(hest_read_ack_addr, read_ack_start_addr,
> + sizeof(*read_ack_start_addr));
> +
ditto
> + *read_ack_start_addr = le64_to_cpu(*read_ack_start_addr);
> +}
> +
> void ghes_record_cper_errors(const void *cper, size_t len,
> uint16_t source_id, Error **errp)
> {
> @@ -440,8 +539,13 @@ void ghes_record_cper_errors(const void *cper, size_t len,
> }
> ags = &acpi_ged_state->ghes_state;
>
> - get_hw_error_offsets(le64_to_cpu(ags->hw_error_le),
> - &cper_addr, &read_ack_register_addr);
> + if (!ags->hest_addr_le) {
> + get_hw_error_offsets(le64_to_cpu(ags->hw_error_le),
> + &cper_addr, &read_ack_register_addr);
> + } else {
> + get_ghes_source_offsets(source_id, le64_to_cpu(ags->hest_addr_le),
> + &cper_addr, &read_ack_register_addr, errp);
> + }
>
> if (!cper_addr) {
> error_setg(errp, "can not find Generic Error Status Block");
> diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
> index 237721fec0a2..6c2e57af0456 100644
> --- a/include/hw/acpi/ghes.h
> +++ b/include/hw/acpi/ghes.h
> @@ -61,6 +61,7 @@ typedef struct AcpiGhesState {
> uint64_t hest_addr_le;
> uint64_t hw_error_le;
> bool present; /* True if GHES is present at all on this board */
> + bool use_hest_addr; /* True if HEST address is present */
> } AcpiGhesState;
>
> /*
an
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 05/14] acpi/generic_event_device: add logic to detect if HEST addr is available
2025-01-31 17:42 ` [PATCH v3 05/14] acpi/generic_event_device: add logic to detect if HEST addr is available Mauro Carvalho Chehab
@ 2025-02-03 14:56 ` Igor Mammedov
0 siblings, 0 replies; 43+ messages in thread
From: Igor Mammedov @ 2025-02-03 14:56 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Michael S . Tsirkin, Jonathan Cameron, Shiju Jose, qemu-arm,
qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Dongjiu Geng,
Eduardo Habkost, Marcel Apfelbaum, Peter Maydell, Shannon Zhao,
Yanan Wang, Zhao Liu, linux-kernel
On Fri, 31 Jan 2025 18:42:46 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Create a new property (x-has-hest-addr) and use it to detect if
> the GHES table offsets can be calculated from the HEST address
> (qemu 10.0 and upper) or via the legacy way via an offset obtained
> from the hardware_errors firmware file.
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
> hw/acpi/generic_event_device.c | 1 +
> hw/acpi/ghes.c | 17 ++++++-----------
> hw/arm/virt-acpi-build.c | 32 ++++++++++++++++++++++++++++----
> hw/core/machine.c | 2 ++
> include/hw/acpi/ghes.h | 3 ++-
> 5 files changed, 39 insertions(+), 16 deletions(-)
>
> diff --git a/hw/acpi/generic_event_device.c b/hw/acpi/generic_event_device.c
> index 5346cae573b7..14d8513a5440 100644
> --- a/hw/acpi/generic_event_device.c
> +++ b/hw/acpi/generic_event_device.c
> @@ -318,6 +318,7 @@ static void acpi_ged_send_event(AcpiDeviceIf *adev, AcpiEventStatusBits ev)
>
> static const Property acpi_ged_properties[] = {
> DEFINE_PROP_UINT32("ged-event", AcpiGedState, ged_event_bitmap, 0),
> + DEFINE_PROP_BOOL("x-has-hest-addr", AcpiGedState, ghes_state.use_hest_addr, false),
> };
>
> static const VMStateDescription vmstate_memhp_state = {
> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> index 8f284fd191a6..a91dcd777433 100644
> --- a/hw/acpi/ghes.c
> +++ b/hw/acpi/ghes.c
> @@ -359,7 +359,8 @@ static void build_ghes_v2_entry(GArray *table_data,
> }
>
> /* Build Hardware Error Source Table */
> -void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
> +void acpi_build_hest(AcpiGhesState *ags, GArray *table_data,
> + GArray *hardware_errors,
> BIOSLinker *linker,
> const AcpiNotificationSourceId *notif_source,
> int num_sources,
> @@ -369,14 +370,6 @@ void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
> .oem_id = oem_id, .oem_table_id = oem_table_id };
> uint32_t hest_offset;
> int i;
> - AcpiGedState *acpi_ged_state;
> - AcpiGhesState *ags = NULL;
> -
> - acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
> - NULL));
> - if (acpi_ged_state) {
> - ags = &acpi_ged_state->ghes_state;
> - }
hmh, can we move this once within series to the place where it should end up at,
instead of rewriting just added code over again,
somewhere at the being of series (maybe as separate patch)?
> hest_offset = table_data->len;
>
> @@ -415,8 +408,10 @@ void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
> fw_cfg_add_file_callback(s, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, NULL, NULL,
> NULL, &(ags->hw_error_le), sizeof(ags->hw_error_le), false);
>
> - fw_cfg_add_file_callback(s, ACPI_HEST_ADDR_FW_CFG_FILE, NULL, NULL,
> - NULL, &(ags->hest_addr_le), sizeof(ags->hest_addr_le), false);
> + if (ags->use_hest_addr) {
> + fw_cfg_add_file_callback(s, ACPI_HEST_ADDR_FW_CFG_FILE, NULL, NULL,
> + NULL, &(ags->hest_addr_le), sizeof(ags->hest_addr_le), false);
> + }
the same as comment in 3/14, please no flipping back and forth which might
break bisection. Also this hunk looks misplaced and should be a part of 3/14
and do not forget about ACPI_HW_ERROR_ADDR_FW_CFG_FILE, that should be excluded
when use_hest_addr == TRUE
I see that 6/14 does that, but order makes it
>
> ags->present = true;
> }
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index 3d411787fc37..9de51105a513 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -897,6 +897,10 @@ static const AcpiNotificationSourceId hest_ghes_notify[] = {
> { ACPI_HEST_SRC_ID_SYNC, ACPI_GHES_NOTIFY_SEA },
> };
>
> +static const AcpiNotificationSourceId hest_ghes_notify_9_2[] = {
> + { ACPI_HEST_SRC_ID_SYNC, ACPI_GHES_NOTIFY_SEA },
> +};
> +
> static
> void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
> {
> @@ -950,10 +954,30 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
> build_dbg2(tables_blob, tables->linker, vms);
>
> if (vms->ras) {
> - acpi_add_table(table_offsets, tables_blob);
> - acpi_build_hest(tables_blob, tables->hardware_errors, tables->linker,
> - hest_ghes_notify, ARRAY_SIZE(hest_ghes_notify),
> - vms->oem_id, vms->oem_table_id);
> + static const AcpiNotificationSourceId *notify;
> + AcpiGedState *acpi_ged_state;
> + unsigned int notify_sz;
> + AcpiGhesState *ags;
> +
> + acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
> + NULL));
> + if (acpi_ged_state) {
> + ags = &acpi_ged_state->ghes_state;
> +
> + acpi_add_table(table_offsets, tables_blob);
> +
> + if (!ags->use_hest_addr) {
> + notify = hest_ghes_notify_9_2;
> + notify_sz = ARRAY_SIZE(hest_ghes_notify_9_2);
all 9.2 compat hunks look misplaced,
they have no relation to using HEST addr at all,
they belong to the patches that introduce new error type
i.e. where hest_ghes_notify mutates to 2 entry array.
> + } else {
> + notify = hest_ghes_notify;
> + notify_sz = ARRAY_SIZE(hest_ghes_notify);
> + }
> +
> + acpi_build_hest(ags, tables_blob, tables->hardware_errors,
> + tables->linker, notify, notify_sz,
> + vms->oem_id, vms->oem_table_id);
> + }
> }
>
> if (ms->numa_state->num_nodes > 0) {
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index c23b39949649..0d0cde481954 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -34,10 +34,12 @@
> #include "hw/virtio/virtio-pci.h"
> #include "hw/virtio/virtio-net.h"
> #include "hw/virtio/virtio-iommu.h"
> +#include "hw/acpi/generic_event_device.h"
> #include "audio/audio.h"
>
> GlobalProperty hw_compat_9_2[] = {
> {"arm-cpu", "backcompat-pauth-default-use-qarma5", "true"},
> + { TYPE_ACPI_GED, "x-has-hest-addr", "false" },
> };
> const size_t hw_compat_9_2_len = G_N_ELEMENTS(hw_compat_9_2);
>
> diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
> index 6c2e57af0456..bfc8fd851648 100644
> --- a/include/hw/acpi/ghes.h
> +++ b/include/hw/acpi/ghes.h
> @@ -76,7 +76,8 @@ typedef struct AcpiNotificationSourceId {
> enum AcpiGhesNotifyType notify;
> } AcpiNotificationSourceId;
>
> -void acpi_build_hest(GArray *table_data, GArray *hardware_errors,
> +void acpi_build_hest(AcpiGhesState *ags, GArray *table_data,
> + GArray *hardware_errors,
> BIOSLinker *linker,
> const AcpiNotificationSourceId * const notif_source,
> int num_sources,
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
2025-02-03 11:09 ` [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for " Jonathan Cameron via
@ 2025-02-03 15:22 ` Igor Mammedov
2025-02-21 6:38 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 43+ messages in thread
From: Igor Mammedov @ 2025-02-03 15:22 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Mauro Carvalho Chehab, Michael S . Tsirkin, Shiju Jose, qemu-arm,
qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
linux-kernel
On Mon, 3 Feb 2025 11:09:34 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> On Fri, 31 Jan 2025 18:42:41 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
>
> > Now that the ghes preparation patches were merged, let's add support
> > for error injection.
> >
> > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > table and hardware_error firmware file, together with its migration code. Migration tested
> > with both latest QEMU released kernel and upstream, on both directions.
> >
> > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > to inject ARM Processor Error records.
> >
> > If I'm counting well, this is the 19th submission of my error inject patches.
>
> Looks good to me. All remaining trivial things are in the category
> of things to consider only if you are doing another spin. The code
> ends up how I'd like it at the end of the series anyway, just
> a question of the precise path to that state!
if you look at series as a whole it's more or less fine (I guess you
and me got used to it)
however if you take it patch by patch (as if you've never seen it)
ordering is messed up (the same would apply to everyone after a while
when it's forgotten)
So I'd strongly suggest to restructure the series (especially 2-6/14).
re sum up my comments wrt ordering:
0 add testcase for HEST table with current HEST as expected blob
(currently missing), so that we can be sure that we haven't messed
existing tables during refactoring.
1. Introduce use_hest_addr (disabled) for now so we could place all
legacy code to !use_hest_addr branch
2. then patches that do the part of switching to HEST addr lookup,
* ged lookup (preferably at the place it should end up eventually)
* legacy bios_linker/fwcfg fencing patches
* on top of that new hest bios_linker/fwcfg ones
* and then the rest
(everything that belongs to the 2nd error source should _not_ be a part of that)
3. add 2nd error source incl. necessary tests procedures introduce
and update DSDT/HEST
>
> Jonathan
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 13/14] qapi/acpi-hest: add an interface to do generic CPER error injection
2025-01-31 17:42 ` [PATCH v3 13/14] qapi/acpi-hest: add an interface to do generic CPER error injection Mauro Carvalho Chehab
@ 2025-02-05 8:12 ` Markus Armbruster
0 siblings, 0 replies; 43+ messages in thread
From: Markus Armbruster @ 2025-02-05 8:12 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Igor Mammedov, Michael S . Tsirkin, Jonathan Cameron, Shiju Jose,
qemu-arm, qemu-devel, Ani Sinha, Dongjiu Geng, Eric Blake,
Michael Roth, Paolo Bonzini, Peter Maydell, Shannon Zhao,
linux-kernel
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> Creates a QMP command to be used for generic ACPI APEI hardware error
> injection (HEST) via GHESv2, and add support for it for ARM guests.
>
> Error injection uses ACPI_HEST_SRC_ID_QMP source ID to be platform
> independent. This is mapped at arch virt bindings, depending on the
> types supported by QEMU and by the BIOS. So, on ARM, this is supported
> via ACPI_GHES_NOTIFY_GPIO notification type.
>
> This patch is co-authored:
> - original ghes logic to inject a simple ARM record by Shiju Jose;
> - generic logic to handle block addresses by Jonathan Cameron;
> - generic GHESv2 error inject by Mauro Carvalho Chehab;
>
> Co-authored-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Co-authored-by: Shiju Jose <shiju.jose@huawei.com>
> Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Acked-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 14/14] scripts/ghes_inject: add a script to generate GHES error inject
2025-01-31 17:42 ` [PATCH v3 14/14] scripts/ghes_inject: add a script to generate GHES error inject Mauro Carvalho Chehab
2025-02-03 10:56 ` Jonathan Cameron via
@ 2025-02-05 8:16 ` Markus Armbruster
2025-02-21 4:57 ` Mauro Carvalho Chehab
1 sibling, 1 reply; 43+ messages in thread
From: Markus Armbruster @ 2025-02-05 8:16 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Igor Mammedov, Michael S . Tsirkin, Jonathan Cameron, Shiju Jose,
qemu-arm, qemu-devel, Cleber Rosa, John Snow, linux-kernel
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> Using the QMP GHESv2 API requires preparing a raw data array
> containing a CPER record.
>
> Add a helper script with subcommands to prepare such data.
>
> Currently, only ARM Processor error CPER record is supported, by
> using:
> $ ghes_inject.py arm
>
> which produces those warnings on Linux:
>
> [ 705.032426] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
> [ 774.866308] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> [ 774.866583] {4}[Hardware Error]: event severity: recoverable
> [ 774.866738] {4}[Hardware Error]: Error 0, type: recoverable
> [ 774.866889] {4}[Hardware Error]: section_type: ARM processor error
> [ 774.867048] {4}[Hardware Error]: MIDR: 0x00000000000f0510
> [ 774.867189] {4}[Hardware Error]: running state: 0x0
> [ 774.867321] {4}[Hardware Error]: Power State Coordination Interface state: 0
> [ 774.867511] {4}[Hardware Error]: Error info structure 0:
> [ 774.867679] {4}[Hardware Error]: num errors: 2
> [ 774.867801] {4}[Hardware Error]: error_type: 0x02: cache error
> [ 774.867962] {4}[Hardware Error]: error_info: 0x000000000091000f
> [ 774.868124] {4}[Hardware Error]: transaction type: Data Access
> [ 774.868280] {4}[Hardware Error]: cache error, operation type: Data write
> [ 774.868465] {4}[Hardware Error]: cache level: 2
> [ 774.868592] {4}[Hardware Error]: processor context not corrupted
> [ 774.868774] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
>
> Such script allows customizing the error data, allowing to change
> all fields at the record. Please use:
>
> $ ghes_inject.py arm -h
>
> For more details about its usage.
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
[...]
> diff --git a/scripts/arm_processor_error.py b/scripts/arm_processor_error.py
> new file mode 100644
> index 000000000000..b0e8450e667e
> --- /dev/null
> +++ b/scripts/arm_processor_error.py
> @@ -0,0 +1,476 @@
> +#!/usr/bin/env python3
> +#
> +# pylint: disable=C0301,C0114,R0903,R0912,R0913,R0914,R0915,W0511
> +# SPDX-License-Identifier: GPL-2.0
Sorry if this has been answered already... why not GPL-2.0-or-later?
More of the same below.
[...]
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 14/14] scripts/ghes_inject: add a script to generate GHES error inject
2025-02-05 8:16 ` Markus Armbruster
@ 2025-02-21 4:57 ` Mauro Carvalho Chehab
2025-02-21 5:50 ` Markus Armbruster
0 siblings, 1 reply; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-02-21 4:57 UTC (permalink / raw)
To: Markus Armbruster
Cc: Igor Mammedov, Michael S . Tsirkin, Jonathan Cameron, Shiju Jose,
qemu-arm, qemu-devel, Cleber Rosa, John Snow, linux-kernel
Em Wed, 05 Feb 2025 09:16:53 +0100
Markus Armbruster <armbru@redhat.com> escreveu:
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
>
> > Using the QMP GHESv2 API requires preparing a raw data array
> > containing a CPER record.
> >
> > Add a helper script with subcommands to prepare such data.
> >
> > Currently, only ARM Processor error CPER record is supported, by
> > using:
> > $ ghes_inject.py arm
> >
> > which produces those warnings on Linux:
> >
> > [ 705.032426] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
> > [ 774.866308] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> > [ 774.866583] {4}[Hardware Error]: event severity: recoverable
> > [ 774.866738] {4}[Hardware Error]: Error 0, type: recoverable
> > [ 774.866889] {4}[Hardware Error]: section_type: ARM processor error
> > [ 774.867048] {4}[Hardware Error]: MIDR: 0x00000000000f0510
> > [ 774.867189] {4}[Hardware Error]: running state: 0x0
> > [ 774.867321] {4}[Hardware Error]: Power State Coordination Interface state: 0
> > [ 774.867511] {4}[Hardware Error]: Error info structure 0:
> > [ 774.867679] {4}[Hardware Error]: num errors: 2
> > [ 774.867801] {4}[Hardware Error]: error_type: 0x02: cache error
> > [ 774.867962] {4}[Hardware Error]: error_info: 0x000000000091000f
> > [ 774.868124] {4}[Hardware Error]: transaction type: Data Access
> > [ 774.868280] {4}[Hardware Error]: cache error, operation type: Data write
> > [ 774.868465] {4}[Hardware Error]: cache level: 2
> > [ 774.868592] {4}[Hardware Error]: processor context not corrupted
> > [ 774.868774] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
> >
> > Such script allows customizing the error data, allowing to change
> > all fields at the record. Please use:
> >
> > $ ghes_inject.py arm -h
> >
> > For more details about its usage.
> >
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
>
> [...]
>
> > diff --git a/scripts/arm_processor_error.py b/scripts/arm_processor_error.py
> > new file mode 100644
> > index 000000000000..b0e8450e667e
> > --- /dev/null
> > +++ b/scripts/arm_processor_error.py
> > @@ -0,0 +1,476 @@
> > +#!/usr/bin/env python3
> > +#
> > +# pylint: disable=C0301,C0114,R0903,R0912,R0913,R0914,R0915,W0511
> > +# SPDX-License-Identifier: GPL-2.0
>
> Sorry if this has been answered already... why not GPL-2.0-or-later?
>
> More of the same below.
No particular reason. It is just that GPL-2.0 is my preferred license.
I'll change the license of the three scripts to be GPL-2.0-or-later.
>
> [...]
>
Thanks,
Mauro
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 14/14] scripts/ghes_inject: add a script to generate GHES error inject
2025-02-21 4:57 ` Mauro Carvalho Chehab
@ 2025-02-21 5:50 ` Markus Armbruster
0 siblings, 0 replies; 43+ messages in thread
From: Markus Armbruster @ 2025-02-21 5:50 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Igor Mammedov, Michael S . Tsirkin, Jonathan Cameron, Shiju Jose,
qemu-arm, qemu-devel, Cleber Rosa, John Snow, linux-kernel
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> Em Wed, 05 Feb 2025 09:16:53 +0100
> Markus Armbruster <armbru@redhat.com> escreveu:
[...]
>> Sorry if this has been answered already... why not GPL-2.0-or-later?
>>
>> More of the same below.
>
> No particular reason. It is just that GPL-2.0 is my preferred license.
>
> I'll change the license of the three scripts to be GPL-2.0-or-later.
Thank you!
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 03/14] acpi/ghes: Use HEST table offsets when preparing GHES records
2025-02-03 14:34 ` Igor Mammedov
@ 2025-02-21 6:02 ` Mauro Carvalho Chehab
2025-02-25 9:43 ` Igor Mammedov
0 siblings, 1 reply; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-02-21 6:02 UTC (permalink / raw)
To: Igor Mammedov
Cc: Michael S . Tsirkin, Jonathan Cameron, Shiju Jose, qemu-arm,
qemu-devel, Ani Sinha, Dongjiu Geng, linux-kernel
Em Mon, 3 Feb 2025 15:34:23 +0100
Igor Mammedov <imammedo@redhat.com> escreveu:
> On Fri, 31 Jan 2025 18:42:44 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
>
> > There are two pointers that are needed during error injection:
> >
> > 1. The start address of the CPER block to be stored;
> > 2. The address of the ack.
> >
> > It is preferable to calculate them from the HEST table. This allows
> > checking the source ID, the size of the table and the type of the
> > HEST error block structures.
> >
> > Yet, keep the old code, as this is needed for migration purposes.
> >
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > ---
> > hw/acpi/ghes.c | 132 ++++++++++++++++++++++++++++++++++++-----
> > include/hw/acpi/ghes.h | 1 +
> > 2 files changed, 119 insertions(+), 14 deletions(-)
> >
> > diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> > index 27478f2d5674..8f284fd191a6 100644
> > --- a/hw/acpi/ghes.c
> > +++ b/hw/acpi/ghes.c
> > @@ -41,6 +41,12 @@
> > /* Address offset in Generic Address Structure(GAS) */
> > #define GAS_ADDR_OFFSET 4
> >
> > +/*
> > + * ACPI spec 1.0b
> > + * 5.2.3 System Description Table Header
> > + */
> > +#define ACPI_DESC_HEADER_OFFSET 36
> > +
> > /*
> > * The total size of Generic Error Data Entry
> > * ACPI 6.1/6.2: 18.3.2.7.1 Generic Error Data,
> > @@ -61,6 +67,25 @@
> > */
> > #define ACPI_GHES_GESB_SIZE 20
> >
> > +/*
> > + * Offsets with regards to the start of the HEST table stored at
> > + * ags->hest_addr_le,
>
> If I read this literary, then offsets above are not what
> declared later in this patch.
> I'd really drop this comment altogether as it's confusing,
> and rather get variables/macro naming right
>
> > according with the memory layout map at
> > + * docs/specs/acpi_hest_ghes.rst.
> > + */
>
> what we need is update to above doc, describing new and old ways.
> a separate patch.
I can't see anything that should be changed at
docs/specs/acpi_hest_ghes.rst, as this series doesn't change the
firmware layout: we're still using two firmware tables:
- etc/acpi/tables, with HEST on it;
- etc/hardware_errors, with:
- error block addresses;
- read_ack registers;
- CPER records.
The only changes that this series introduce are related to how
the error generation logic navigates between HEST and hw_errors
firmware. This is not described at acpi_hest_ghes.rst, and both
ways follow ACPI specs to the letter.
The only difference is that the code which populates the CPER
record and the error/read offsets doesn't require to know how
the HEST table generation placed offsets, as it will basically
reproduce what OSPM firmware does when handling HEST events.
Thanks,
Mauro
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
2025-02-03 15:22 ` Igor Mammedov
@ 2025-02-21 6:38 ` Mauro Carvalho Chehab
2025-02-21 10:21 ` Jonathan Cameron via
0 siblings, 1 reply; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-02-21 6:38 UTC (permalink / raw)
To: Igor Mammedov
Cc: Jonathan Cameron, Michael S . Tsirkin, Shiju Jose, qemu-arm,
qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
linux-kernel
Em Mon, 3 Feb 2025 16:22:36 +0100
Igor Mammedov <imammedo@redhat.com> escreveu:
> On Mon, 3 Feb 2025 11:09:34 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
>
> > On Fri, 31 Jan 2025 18:42:41 +0100
> > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> >
> > > Now that the ghes preparation patches were merged, let's add support
> > > for error injection.
> > >
> > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > with both latest QEMU released kernel and upstream, on both directions.
> > >
> > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > > to inject ARM Processor Error records.
> > >
> > > If I'm counting well, this is the 19th submission of my error inject patches.
> >
> > Looks good to me. All remaining trivial things are in the category
> > of things to consider only if you are doing another spin. The code
> > ends up how I'd like it at the end of the series anyway, just
> > a question of the precise path to that state!
>
> if you look at series as a whole it's more or less fine (I guess you
> and me got used to it)
>
> however if you take it patch by patch (as if you've never seen it)
> ordering is messed up (the same would apply to everyone after a while
> when it's forgotten)
>
> So I'd strongly suggest to restructure the series (especially 2-6/14).
> re sum up my comments wrt ordering:
>
> 0 add testcase for HEST table with current HEST as expected blob
> (currently missing), so that we can be sure that we haven't messed
> existing tables during refactoring.
Not sure if I got this one. The HEST table is part of etc/acpi/tables,
which is already tested, as you pointed at the previous reviews. Doing
changes there is already detected. That's basically why we added patches
10 and 12:
[PATCH v3 10/14] tests/acpi: virt: allow acpi table changes for a new table: HEST
[PATCH v3 12/14] tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT
What tests don't have is a check for etc/hardware_errors firmware inside
tests/data/acpi/aarch64/virt/, but, IMO, we shouldn't add it there.
See, hardware_errors table contains only some skeleton space to
store:
- 1 or more error block address offsets;
- 1 or more read ack register;
- 1 or more HEST source entries containing CPER blocks.
There's nothing there to be actually checked: it is just some
empty spaces with a variable number of fields.
With the new code, the actual number of CPER blocks and their
corresponding offsets and read ack registers can be different on
different architectures. So, for instance, when we add x86 support,
we'll likely start with just one error source entry, while arm will
have two after this changeset.
Also, one possibility to address the issues reported by Gavin Shan at
https://lore.kernel.org/qemu-devel/20250214041635.608012-1-gshan@redhat.com/
would be to have one entry per each CPU. So, the size of such firmware
could be dependent on the number of CPUs.
So, adding any validation to it would just cause pain and probably
won't detect any problems.
What could be done instead is to have a different type of tests that
would use the error injection script to check if regressions are
introduced after QEMU 10.0. Such new kind of test would require
this series to be merged first. It would also require the usage of
an OSPM image with some testing tools on it. This is easier said
than done, as besides the complexity of having an OSPM test image,
such kind of tests would require extra logic, specially if it would
check regressions for SEA and other notification sources.
Thanks,
Mauro
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
2025-02-21 6:38 ` Mauro Carvalho Chehab
@ 2025-02-21 10:21 ` Jonathan Cameron via
2025-02-21 12:23 ` Mauro Carvalho Chehab
2025-02-25 10:01 ` Igor Mammedov
0 siblings, 2 replies; 43+ messages in thread
From: Jonathan Cameron via @ 2025-02-21 10:21 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Igor Mammedov, Michael S . Tsirkin, Shiju Jose, qemu-arm,
qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
linux-kernel, Gavin Shan
On Fri, 21 Feb 2025 07:38:23 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Em Mon, 3 Feb 2025 16:22:36 +0100
> Igor Mammedov <imammedo@redhat.com> escreveu:
>
> > On Mon, 3 Feb 2025 11:09:34 +0000
> > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> >
> > > On Fri, 31 Jan 2025 18:42:41 +0100
> > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > >
> > > > Now that the ghes preparation patches were merged, let's add support
> > > > for error injection.
> > > >
> > > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > > with both latest QEMU released kernel and upstream, on both directions.
> > > >
> > > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > > > to inject ARM Processor Error records.
> > > >
> > > > If I'm counting well, this is the 19th submission of my error inject patches.
> > >
> > > Looks good to me. All remaining trivial things are in the category
> > > of things to consider only if you are doing another spin. The code
> > > ends up how I'd like it at the end of the series anyway, just
> > > a question of the precise path to that state!
> >
> > if you look at series as a whole it's more or less fine (I guess you
> > and me got used to it)
> >
> > however if you take it patch by patch (as if you've never seen it)
> > ordering is messed up (the same would apply to everyone after a while
> > when it's forgotten)
> >
> > So I'd strongly suggest to restructure the series (especially 2-6/14).
> > re sum up my comments wrt ordering:
> >
> > 0 add testcase for HEST table with current HEST as expected blob
> > (currently missing), so that we can be sure that we haven't messed
> > existing tables during refactoring.
To potentially save time I think Igor is asking that before you do anything
at all you plug the existing test hole which is that we don't test HEST
at all. Even after this series I think we don't test HEST. You add
a stub hest and exclusion but then in patch 12 the HEST stub is deleted whereas
it should be replaced with the example data for the test.
That indeed doesn't address testing the error data storage which would be
a different problem.
>
> Not sure if I got this one. The HEST table is part of etc/acpi/tables,
> which is already tested, as you pointed at the previous reviews. Doing
> changes there is already detected. That's basically why we added patches
> 10 and 12:
>
> [PATCH v3 10/14] tests/acpi: virt: allow acpi table changes for a new table: HEST
> [PATCH v3 12/14] tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT
>
> What tests don't have is a check for etc/hardware_errors firmware inside
> tests/data/acpi/aarch64/virt/, but, IMO, we shouldn't add it there.
>
> See, hardware_errors table contains only some skeleton space to
> store:
>
> - 1 or more error block address offsets;
> - 1 or more read ack register;
> - 1 or more HEST source entries containing CPER blocks.
>
> There's nothing there to be actually checked: it is just some
> empty spaces with a variable number of fields.
>
> With the new code, the actual number of CPER blocks and their
> corresponding offsets and read ack registers can be different on
> different architectures. So, for instance, when we add x86 support,
> we'll likely start with just one error source entry, while arm will
> have two after this changeset.
>
> Also, one possibility to address the issues reported by Gavin Shan at
> https://lore.kernel.org/qemu-devel/20250214041635.608012-1-gshan@redhat.com/
> would be to have one entry per each CPU. So, the size of such firmware
> could be dependent on the number of CPUs.
>
> So, adding any validation to it would just cause pain and probably
> won't detect any problems.
If we did do this the test would use a fixed number of CPUs so
would just verify we didn't break a small number of variants. Useful
but to me a follow up to this series not something that needs to
be part of it - particularly as Gavin's work may well change that!
>
> What could be done instead is to have a different type of tests that
> would use the error injection script to check if regressions are
> introduced after QEMU 10.0. Such new kind of test would require
> this series to be merged first. It would also require the usage of
> an OSPM image with some testing tools on it. This is easier said
> than done, as besides the complexity of having an OSPM test image,
> such kind of tests would require extra logic, specially if it would
> check regressions for SEA and other notification sources.
>
Agreed that a more end to end test is even better, but those are
quite a bit more complex so definitely a follow up.
J
> Thanks,
> Mauro
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
2025-02-21 10:21 ` Jonathan Cameron via
@ 2025-02-21 12:23 ` Mauro Carvalho Chehab
2025-02-21 15:05 ` Mauro Carvalho Chehab
2025-02-25 10:01 ` Igor Mammedov
1 sibling, 1 reply; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-02-21 12:23 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Igor Mammedov, Michael S . Tsirkin, Shiju Jose, qemu-arm,
qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
linux-kernel, Gavin Shan
Em Fri, 21 Feb 2025 10:21:27 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:
> On Fri, 21 Feb 2025 07:38:23 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
>
> > Em Mon, 3 Feb 2025 16:22:36 +0100
> > Igor Mammedov <imammedo@redhat.com> escreveu:
> >
> > > On Mon, 3 Feb 2025 11:09:34 +0000
> > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > >
> > > > On Fri, 31 Jan 2025 18:42:41 +0100
> > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > >
> > > > > Now that the ghes preparation patches were merged, let's add support
> > > > > for error injection.
> > > > >
> > > > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > > > with both latest QEMU released kernel and upstream, on both directions.
> > > > >
> > > > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > > > > to inject ARM Processor Error records.
> > > > >
> > > > > If I'm counting well, this is the 19th submission of my error inject patches.
> > > >
> > > > Looks good to me. All remaining trivial things are in the category
> > > > of things to consider only if you are doing another spin. The code
> > > > ends up how I'd like it at the end of the series anyway, just
> > > > a question of the precise path to that state!
> > >
> > > if you look at series as a whole it's more or less fine (I guess you
> > > and me got used to it)
> > >
> > > however if you take it patch by patch (as if you've never seen it)
> > > ordering is messed up (the same would apply to everyone after a while
> > > when it's forgotten)
> > >
> > > So I'd strongly suggest to restructure the series (especially 2-6/14).
> > > re sum up my comments wrt ordering:
> > >
> > > 0 add testcase for HEST table with current HEST as expected blob
> > > (currently missing), so that we can be sure that we haven't messed
> > > existing tables during refactoring.
>
> To potentially save time I think Igor is asking that before you do anything
> at all you plug the existing test hole which is that we don't test HEST
> at all. Even after this series I think we don't test HEST.
On a previous review (v2, I guess), Igor requested me to do the DSDT
test just before and after the patch which is actually changing its
content (patch 11). The HEST table is inside DSDT firmware, and it is
already tested.
> You add
> a stub hest and exclusion but then in patch 12 the HEST stub is deleted whereas
> it should be replaced with the example data for the test.
This was actually a misinterpretation from my side: patch 10 adds the
etc/hardware_errors table (mistakenly naming it as HEST), but this
was never tested. For the next submission, I'll drop etc/hardware_errors
table from patches 10 and 12.
> That indeed doesn't address testing the error data storage which would be
> a different problem.
> >
> > Not sure if I got this one. The HEST table is part of etc/acpi/tables,
> > which is already tested, as you pointed at the previous reviews. Doing
> > changes there is already detected. That's basically why we added patches
> > 10 and 12:
> >
> > [PATCH v3 10/14] tests/acpi: virt: allow acpi table changes for a new table: HEST
> > [PATCH v3 12/14] tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT
> >
> > What tests don't have is a check for etc/hardware_errors firmware inside
> > tests/data/acpi/aarch64/virt/, but, IMO, we shouldn't add it there.
> >
> > See, hardware_errors table contains only some skeleton space to
> > store:
> >
> > - 1 or more error block address offsets;
> > - 1 or more read ack register;
> > - 1 or more HEST source entries containing CPER blocks.
> >
> > There's nothing there to be actually checked: it is just some
> > empty spaces with a variable number of fields.
> >
> > With the new code, the actual number of CPER blocks and their
> > corresponding offsets and read ack registers can be different on
> > different architectures. So, for instance, when we add x86 support,
> > we'll likely start with just one error source entry, while arm will
> > have two after this changeset.
> >
> > Also, one possibility to address the issues reported by Gavin Shan at
> > https://lore.kernel.org/qemu-devel/20250214041635.608012-1-gshan@redhat.com/
> > would be to have one entry per each CPU. So, the size of such firmware
> > could be dependent on the number of CPUs.
> >
> > So, adding any validation to it would just cause pain and probably
> > won't detect any problems.
>
> If we did do this the test would use a fixed number of CPUs so
> would just verify we didn't break a small number of variants. Useful
> but to me a follow up to this series not something that needs to
> be part of it - particularly as Gavin's work may well change that!
I don't think that testing etc/hardware_errors would detect any
regressions. It will just create a test scenario that will require
constant changes, as adding any entry to HEST would hit it.
Besides that, I don't think adding support for it would be a simple
matter of adding another table. See, after this series, there are two
different scenarios for the /etc/hardware_errors:
- one with a single GHESv2 entry, for virt-9.2;
- another one with two GHESv2 entries for virt-10.0 and above that
will dynamically change its size (starting from 2) depending on
the features we add, and if we'll have one entry per CPU or not.
Right now, the tests there are only for "virt-latest": there's no
test directory for "virt-9.2". Adding support for virt-legacy will
very likely require lots of changes there at the test infrastructure,
as it will require some virt migration support.
> > What could be done instead is to have a different type of tests that
> > would use the error injection script to check if regressions are
> > introduced after QEMU 10.0. Such new kind of test would require
> > this series to be merged first. It would also require the usage of
> > an OSPM image with some testing tools on it. This is easier said
> > than done, as besides the complexity of having an OSPM test image,
> > such kind of tests would require extra logic, specially if it would
> > check regressions for SEA and other notification sources.
> >
> Agreed that a more end to end test is even better, but those are
> quite a bit more complex so definitely a follow up.
Yes, but it could be simpler than modifying ACPI tests to handle
migration.
The way I see is that such kind of integration could be done by some
gitlab workflow that would run an error injection script inside a
pre-defined image emulating both virt-9.2 and virt-latest and checking
if the HEST tables were properly generated for both SEA and GED
sources.
This is probably easier for GED, as the QMP interface already
detects that the read ack register was changed by the OSPM. For
SEA, it may require either some additional instrumentation or to
capture OSPM logs.
Anyway, ether way, a change like that is IMO outside the escope of
this series, as it will require lots of unrelated changes.
Regards,
Mauro
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
2025-02-21 12:23 ` Mauro Carvalho Chehab
@ 2025-02-21 15:05 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-02-21 15:05 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Igor Mammedov, Michael S . Tsirkin, Shiju Jose, qemu-arm,
qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
linux-kernel, Gavin Shan
Em Fri, 21 Feb 2025 13:23:06 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> escreveu:
> Em Fri, 21 Feb 2025 10:21:27 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:
>
> > On Fri, 21 Feb 2025 07:38:23 +0100
> > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> >
> > > Em Mon, 3 Feb 2025 16:22:36 +0100
> > > Igor Mammedov <imammedo@redhat.com> escreveu:
> > >
> > > > On Mon, 3 Feb 2025 11:09:34 +0000
> > > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > > >
> > > > > On Fri, 31 Jan 2025 18:42:41 +0100
> > > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > > >
> > > > > > Now that the ghes preparation patches were merged, let's add support
> > > > > > for error injection.
> > > > > >
> > > > > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > > > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > > > > with both latest QEMU released kernel and upstream, on both directions.
> > > > > >
> > > > > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > > > > > to inject ARM Processor Error records.
> > > > > >
> > > > > > If I'm counting well, this is the 19th submission of my error inject patches.
> > > > >
> > > > > Looks good to me. All remaining trivial things are in the category
> > > > > of things to consider only if you are doing another spin. The code
> > > > > ends up how I'd like it at the end of the series anyway, just
> > > > > a question of the precise path to that state!
> > > >
> > > > if you look at series as a whole it's more or less fine (I guess you
> > > > and me got used to it)
> > > >
> > > > however if you take it patch by patch (as if you've never seen it)
> > > > ordering is messed up (the same would apply to everyone after a while
> > > > when it's forgotten)
> > > >
> > > > So I'd strongly suggest to restructure the series (especially 2-6/14).
> > > > re sum up my comments wrt ordering:
> > > >
> > > > 0 add testcase for HEST table with current HEST as expected blob
> > > > (currently missing), so that we can be sure that we haven't messed
> > > > existing tables during refactoring.
> >
> > To potentially save time I think Igor is asking that before you do anything
> > at all you plug the existing test hole which is that we don't test HEST
> > at all. Even after this series I think we don't test HEST.
>
> On a previous review (v2, I guess), Igor requested me to do the DSDT
> test just before and after the patch which is actually changing its
> content (patch 11). The HEST table is inside DSDT firmware, and it is
> already tested.
>
> > You add
> > a stub hest and exclusion but then in patch 12 the HEST stub is deleted whereas
> > it should be replaced with the example data for the test.
>
> This was actually a misinterpretation from my side: patch 10 adds the
> etc/hardware_errors table (mistakenly naming it as HEST), but this
> was never tested. For the next submission, I'll drop etc/hardware_errors
> table from patches 10 and 12.
>
> > That indeed doesn't address testing the error data storage which would be
> > a different problem.
> > >
> > > Not sure if I got this one. The HEST table is part of etc/acpi/tables,
> > > which is already tested, as you pointed at the previous reviews. Doing
> > > changes there is already detected. That's basically why we added patches
> > > 10 and 12:
> > >
> > > [PATCH v3 10/14] tests/acpi: virt: allow acpi table changes for a new table: HEST
> > > [PATCH v3 12/14] tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT
> > >
> > > What tests don't have is a check for etc/hardware_errors firmware inside
> > > tests/data/acpi/aarch64/virt/, but, IMO, we shouldn't add it there.
> > >
> > > See, hardware_errors table contains only some skeleton space to
> > > store:
> > >
> > > - 1 or more error block address offsets;
> > > - 1 or more read ack register;
> > > - 1 or more HEST source entries containing CPER blocks.
> > >
> > > There's nothing there to be actually checked: it is just some
> > > empty spaces with a variable number of fields.
> > >
> > > With the new code, the actual number of CPER blocks and their
> > > corresponding offsets and read ack registers can be different on
> > > different architectures. So, for instance, when we add x86 support,
> > > we'll likely start with just one error source entry, while arm will
> > > have two after this changeset.
> > >
> > > Also, one possibility to address the issues reported by Gavin Shan at
> > > https://lore.kernel.org/qemu-devel/20250214041635.608012-1-gshan@redhat.com/
> > > would be to have one entry per each CPU. So, the size of such firmware
> > > could be dependent on the number of CPUs.
> > >
> > > So, adding any validation to it would just cause pain and probably
> > > won't detect any problems.
> >
> > If we did do this the test would use a fixed number of CPUs so
> > would just verify we didn't break a small number of variants. Useful
> > but to me a follow up to this series not something that needs to
> > be part of it - particularly as Gavin's work may well change that!
>
> I don't think that testing etc/hardware_errors would detect any
> regressions. It will just create a test scenario that will require
> constant changes, as adding any entry to HEST would hit it.
Btw, there is just one patch on this series touching
etc/hardware_errors:
https://lore.kernel.org/qemu-devel/647f9c974e606924b6b881a83e047d1d4dff47d5.1740148260.git.mchehab+huawei@kernel.org/T/#u
The table change is due to this simple hunk:
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 4f174795ed60..7b6e90d69298 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -896,6 +896,7 @@ static void acpi_align_size(GArray *blob, unsigned align)
static const AcpiNotificationSourceId hest_ghes_notify[] = {
{ ACPI_HEST_SRC_ID_SYNC, ACPI_GHES_NOTIFY_SEA },
+ { ACPI_HEST_SRC_ID_QMP, ACPI_GHES_NOTIFY_GPIO },
};
Before such patch, /etc/hardware_errors has:
- 1 error block offset;
- 1 ack register;
- 1 GHESv2 entry for SEA
After the change:
- for virt-9.2: nothing changes, as hw/arm/virt-acpi-build.c will
use the backward-compatible table with a single entry to be
added to HEST:
static const AcpiNotificationSourceId hest_ghes_notify_9_2[] = {
{ ACPI_HEST_SRC_ID_SYNC, ACPI_GHES_NOTIFY_SEA },
};
- for virt-latest/virt-10.0, it will use the new table to create two
sources:
static const AcpiNotificationSourceId hest_ghes_notify[] = {
{ ACPI_HEST_SRC_ID_SYNC, ACPI_GHES_NOTIFY_SEA },
{ ACPI_HEST_SRC_ID_QMP, ACPI_GHES_NOTIFY_GPIO },
};
which will actually mean that /etc/hardware_errors will now have:
- 2 error block offsets (one for SEA, one for GED);
- 2 ack registers (one for SEA, one for GED);
- 1 GHESv2 entry for SEA notifier;
- 1 GHESv2 entry for GED GPIO notifier.
With the discussions with Gavin, for virt-10.0 and above, we may end changing
the new table (hest_ghes_notify) to have one SEA entry per CPU, plus the GPIO
one, and add an extra logic at the error injection logic to select the SEA
entry based on the CPU ID and/or based on having an already acked
SEA notifier.
>
> Besides that, I don't think adding support for it would be a simple
> matter of adding another table. See, after this series, there are two
> different scenarios for the /etc/hardware_errors:
>
> - one with a single GHESv2 entry, for virt-9.2;
> - another one with two GHESv2 entries for virt-10.0 and above that
> will dynamically change its size (starting from 2) depending on
> the features we add, and if we'll have one entry per CPU or not.
>
> Right now, the tests there are only for "virt-latest": there's no
> test directory for "virt-9.2". Adding support for virt-legacy will
> very likely require lots of changes there at the test infrastructure,
> as it will require some virt migration support.
>
> > > What could be done instead is to have a different type of tests that
> > > would use the error injection script to check if regressions are
> > > introduced after QEMU 10.0. Such new kind of test would require
> > > this series to be merged first. It would also require the usage of
> > > an OSPM image with some testing tools on it. This is easier said
> > > than done, as besides the complexity of having an OSPM test image,
> > > such kind of tests would require extra logic, specially if it would
> > > check regressions for SEA and other notification sources.
> > >
> > Agreed that a more end to end test is even better, but those are
> > quite a bit more complex so definitely a follow up.
>
> Yes, but it could be simpler than modifying ACPI tests to handle
> migration.
>
> The way I see is that such kind of integration could be done by some
> gitlab workflow that would run an error injection script inside a
> pre-defined image emulating both virt-9.2 and virt-latest and checking
> if the HEST tables were properly generated for both SEA and GED
> sources.
>
> This is probably easier for GED, as the QMP interface already
> detects that the read ack register was changed by the OSPM. For
> SEA, it may require either some additional instrumentation or to
> capture OSPM logs.
>
> Anyway, ether way, a change like that is IMO outside the escope of
> this series, as it will require lots of unrelated changes.
>
> Regards,
> Mauro
>
Thanks,
Mauro
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH v3 03/14] acpi/ghes: Use HEST table offsets when preparing GHES records
2025-02-21 6:02 ` Mauro Carvalho Chehab
@ 2025-02-25 9:43 ` Igor Mammedov
2025-02-26 16:14 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 43+ messages in thread
From: Igor Mammedov @ 2025-02-25 9:43 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Michael S . Tsirkin, Jonathan Cameron, Shiju Jose, qemu-arm,
qemu-devel, Ani Sinha, Dongjiu Geng, linux-kernel
On Fri, 21 Feb 2025 07:02:21 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Em Mon, 3 Feb 2025 15:34:23 +0100
> Igor Mammedov <imammedo@redhat.com> escreveu:
>
> > On Fri, 31 Jan 2025 18:42:44 +0100
> > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> >
> > > There are two pointers that are needed during error injection:
> > >
> > > 1. The start address of the CPER block to be stored;
> > > 2. The address of the ack.
> > >
> > > It is preferable to calculate them from the HEST table. This allows
> > > checking the source ID, the size of the table and the type of the
> > > HEST error block structures.
> > >
> > > Yet, keep the old code, as this is needed for migration purposes.
> > >
> > > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > > ---
> > > hw/acpi/ghes.c | 132 ++++++++++++++++++++++++++++++++++++-----
> > > include/hw/acpi/ghes.h | 1 +
> > > 2 files changed, 119 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> > > index 27478f2d5674..8f284fd191a6 100644
> > > --- a/hw/acpi/ghes.c
> > > +++ b/hw/acpi/ghes.c
> > > @@ -41,6 +41,12 @@
> > > /* Address offset in Generic Address Structure(GAS) */
> > > #define GAS_ADDR_OFFSET 4
> > >
> > > +/*
> > > + * ACPI spec 1.0b
> > > + * 5.2.3 System Description Table Header
> > > + */
> > > +#define ACPI_DESC_HEADER_OFFSET 36
> > > +
> > > /*
> > > * The total size of Generic Error Data Entry
> > > * ACPI 6.1/6.2: 18.3.2.7.1 Generic Error Data,
> > > @@ -61,6 +67,25 @@
> > > */
> > > #define ACPI_GHES_GESB_SIZE 20
> > >
> > > +/*
> > > + * Offsets with regards to the start of the HEST table stored at
> > > + * ags->hest_addr_le,
> >
> > If I read this literary, then offsets above are not what
> > declared later in this patch.
> > I'd really drop this comment altogether as it's confusing,
> > and rather get variables/macro naming right
> >
> > > according with the memory layout map at
> > > + * docs/specs/acpi_hest_ghes.rst.
> > > + */
> >
> > what we need is update to above doc, describing new and old ways.
> > a separate patch.
>
> I can't see anything that should be changed at
> docs/specs/acpi_hest_ghes.rst, as this series doesn't change the
> firmware layout: we're still using two firmware tables:
>
> - etc/acpi/tables, with HEST on it;
> - etc/hardware_errors, with:
> - error block addresses;
> - read_ack registers;
> - CPER records.
>
> The only changes that this series introduce are related to how
> the error generation logic navigates between HEST and hw_errors
> firmware. This is not described at acpi_hest_ghes.rst, and both
> ways follow ACPI specs to the letter.
>
> The only difference is that the code which populates the CPER
> record and the error/read offsets doesn't require to know how
> the HEST table generation placed offsets, as it will basically
> reproduce what OSPM firmware does when handling HEST events.
section 8 describes old way to get to address to record old CPER,
so it needs to amended to also describe a new approach and say
which way is used for which version.
possibly section 11 might need some messaging as well.
>
> Thanks,
> Mauro
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
2025-02-21 10:21 ` Jonathan Cameron via
2025-02-21 12:23 ` Mauro Carvalho Chehab
@ 2025-02-25 10:01 ` Igor Mammedov
2025-02-26 9:56 ` Mauro Carvalho Chehab
1 sibling, 1 reply; 43+ messages in thread
From: Igor Mammedov @ 2025-02-25 10:01 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Mauro Carvalho Chehab, Michael S . Tsirkin, Shiju Jose, qemu-arm,
qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
linux-kernel, Gavin Shan, Ani Sinha
On Fri, 21 Feb 2025 10:21:27 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> On Fri, 21 Feb 2025 07:38:23 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
>
> > Em Mon, 3 Feb 2025 16:22:36 +0100
> > Igor Mammedov <imammedo@redhat.com> escreveu:
> >
> > > On Mon, 3 Feb 2025 11:09:34 +0000
> > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > >
> > > > On Fri, 31 Jan 2025 18:42:41 +0100
> > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > >
> > > > > Now that the ghes preparation patches were merged, let's add support
> > > > > for error injection.
> > > > >
> > > > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > > > with both latest QEMU released kernel and upstream, on both directions.
> > > > >
> > > > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > > > > to inject ARM Processor Error records.
> > > > >
> > > > > If I'm counting well, this is the 19th submission of my error inject patches.
> > > >
> > > > Looks good to me. All remaining trivial things are in the category
> > > > of things to consider only if you are doing another spin. The code
> > > > ends up how I'd like it at the end of the series anyway, just
> > > > a question of the precise path to that state!
> > >
> > > if you look at series as a whole it's more or less fine (I guess you
> > > and me got used to it)
> > >
> > > however if you take it patch by patch (as if you've never seen it)
> > > ordering is messed up (the same would apply to everyone after a while
> > > when it's forgotten)
> > >
> > > So I'd strongly suggest to restructure the series (especially 2-6/14).
> > > re sum up my comments wrt ordering:
> > >
> > > 0 add testcase for HEST table with current HEST as expected blob
> > > (currently missing), so that we can be sure that we haven't messed
> > > existing tables during refactoring.
>
> To potentially save time I think Igor is asking that before you do anything
> at all you plug the existing test hole which is that we don't test HEST
> at all. Even after this series I think we don't test HEST. You add
> a stub hest and exclusion but then in patch 12 the HEST stub is deleted whereas
> it should be replaced with the example data for the test.
that's what I was saying.
HEST table should be in DSDT, but it's optional and one has to use
'ras=on' option to enable that, which we aren't doing ATM.
So whatever changes are happening we aren't seeing them in tests
nor will we see any regression for the same reason.
While white listing tables before change should happen and then updating them
is the right thing to do, it's not sufficient since none of tests
run with 'ras' enabled, hence code is not actually executed.
>
> That indeed doesn't address testing the error data storage which would be
> a different problem.
I'd skip hardware_errors/CPER testing from QEMU unit tests.
That's basically requires functioning 'APEI driver' to test that.
Maybe we can use Ani's framework to parse HEST and all the way
towards CPER record(s) traversal, but that's certainly out of
scope of this series.
It could be done on top, but I won't insist even on that
since Mauro's out of tree error injection testing will
cover that using actual guest (which I assume he would
like to run periodically).
> >
> > Not sure if I got this one. The HEST table is part of etc/acpi/tables,
> > which is already tested, as you pointed at the previous reviews. Doing
> > changes there is already detected. That's basically why we added patches
> > 10 and 12:
> >
> > [PATCH v3 10/14] tests/acpi: virt: allow acpi table changes for a new table: HEST
> > [PATCH v3 12/14] tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT
> >
> > What tests don't have is a check for etc/hardware_errors firmware inside
> > tests/data/acpi/aarch64/virt/, but, IMO, we shouldn't add it there.
> >
> > See, hardware_errors table contains only some skeleton space to
> > store:
> >
> > - 1 or more error block address offsets;
> > - 1 or more read ack register;
> > - 1 or more HEST source entries containing CPER blocks.
> >
> > There's nothing there to be actually checked: it is just some
> > empty spaces with a variable number of fields.
> >
> > With the new code, the actual number of CPER blocks and their
> > corresponding offsets and read ack registers can be different on
> > different architectures. So, for instance, when we add x86 support,
> > we'll likely start with just one error source entry, while arm will
> > have two after this changeset.
> >
> > Also, one possibility to address the issues reported by Gavin Shan at
> > https://lore.kernel.org/qemu-devel/20250214041635.608012-1-gshan@redhat.com/
> > would be to have one entry per each CPU. So, the size of such firmware
> > could be dependent on the number of CPUs.
> >
> > So, adding any validation to it would just cause pain and probably
> > won't detect any problems.
>
> If we did do this the test would use a fixed number of CPUs so
> would just verify we didn't break a small number of variants. Useful
> but to me a follow up to this series not something that needs to
> be part of it - particularly as Gavin's work may well change that!
>
> >
> > What could be done instead is to have a different type of tests that
> > would use the error injection script to check if regressions are
> > introduced after QEMU 10.0. Such new kind of test would require
> > this series to be merged first. It would also require the usage of
> > an OSPM image with some testing tools on it. This is easier said
> > than done, as besides the complexity of having an OSPM test image,
> > such kind of tests would require extra logic, specially if it would
> > check regressions for SEA and other notification sources.
> >
> Agreed that a more end to end test is even better, but those are
> quite a bit more complex so definitely a follow up.
>
> J
> > Thanks,
> > Mauro
> >
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
2025-02-25 10:01 ` Igor Mammedov
@ 2025-02-26 9:56 ` Mauro Carvalho Chehab
2025-02-26 11:23 ` Mauro Carvalho Chehab
2025-02-26 12:29 ` Igor Mammedov
0 siblings, 2 replies; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-02-26 9:56 UTC (permalink / raw)
To: Igor Mammedov
Cc: Jonathan Cameron, Michael S . Tsirkin, Shiju Jose, qemu-arm,
qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
linux-kernel, Gavin Shan
Em Tue, 25 Feb 2025 11:01:15 +0100
Igor Mammedov <imammedo@redhat.com> escreveu:
> On Fri, 21 Feb 2025 10:21:27 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
>
> > On Fri, 21 Feb 2025 07:38:23 +0100
> > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> >
> > > Em Mon, 3 Feb 2025 16:22:36 +0100
> > > Igor Mammedov <imammedo@redhat.com> escreveu:
> > >
> > > > On Mon, 3 Feb 2025 11:09:34 +0000
> > > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > > >
> > > > > On Fri, 31 Jan 2025 18:42:41 +0100
> > > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > > >
> > > > > > Now that the ghes preparation patches were merged, let's add support
> > > > > > for error injection.
> > > > > >
> > > > > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > > > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > > > > with both latest QEMU released kernel and upstream, on both directions.
> > > > > >
> > > > > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > > > > > to inject ARM Processor Error records.
> > > > > >
> > > > > > If I'm counting well, this is the 19th submission of my error inject patches.
> > > > >
> > > > > Looks good to me. All remaining trivial things are in the category
> > > > > of things to consider only if you are doing another spin. The code
> > > > > ends up how I'd like it at the end of the series anyway, just
> > > > > a question of the precise path to that state!
> > > >
> > > > if you look at series as a whole it's more or less fine (I guess you
> > > > and me got used to it)
> > > >
> > > > however if you take it patch by patch (as if you've never seen it)
> > > > ordering is messed up (the same would apply to everyone after a while
> > > > when it's forgotten)
> > > >
> > > > So I'd strongly suggest to restructure the series (especially 2-6/14).
> > > > re sum up my comments wrt ordering:
> > > >
> > > > 0 add testcase for HEST table with current HEST as expected blob
> > > > (currently missing), so that we can be sure that we haven't messed
> > > > existing tables during refactoring.
> >
> > To potentially save time I think Igor is asking that before you do anything
> > at all you plug the existing test hole which is that we don't test HEST
> > at all. Even after this series I think we don't test HEST. You add
> > a stub hest and exclusion but then in patch 12 the HEST stub is deleted whereas
> > it should be replaced with the example data for the test.
>
> that's what I was saying.
> HEST table should be in DSDT, but it's optional and one has to use
> 'ras=on' option to enable that, which we aren't doing ATM.
> So whatever changes are happening we aren't seeing them in tests
> nor will we see any regression for the same reason.
>
> While white listing tables before change should happen and then updating them
> is the right thing to do, it's not sufficient since none of tests
> run with 'ras' enabled, hence code is not actually executed.
Ok. Well, again we're not modifying HEST table structure on this
changeset. The only change affecting HEST is when the number of entries
increased from 1 to 2.
Now, looking at bios-tables-test.c, if I got it right, I should be doing
something similar to the enclosed patch, right?
If so, I have a couple of questions:
1. from where should I get the HEST table? dumping the table from the
running VM?
2. what values should I use to fill those variables:
int hest_offset = 40 /* HEST */;
int hest_entry_size = 4;
>
> >
> > That indeed doesn't address testing the error data storage which would be
> > a different problem.
>
> I'd skip hardware_errors/CPER testing from QEMU unit tests.
> That's basically requires functioning 'APEI driver' to test that.
>
> Maybe we can use Ani's framework to parse HEST and all the way
> towards CPER record(s) traversal, but that's certainly out of
> scope of this series.
> It could be done on top, but I won't insist even on that
> since Mauro's out of tree error injection testing will
> cover that using actual guest (which I assume he would
> like to run periodically).
Yeah, my plan is to periodically test it. I intend to setup somewhere
a CI to test Kernel, QEMU and rasdaemon altogether.
Thanks,
Mauro
---
diff --git a/tests/qtest/bios-tables-test.c b/tests/qtest/bios-tables-test.c
index 0a333ec43536..31e69d906db4 100644
--- a/tests/qtest/bios-tables-test.c
+++ b/tests/qtest/bios-tables-test.c
@@ -210,6 +210,8 @@ static void test_acpi_fadt_table(test_data *data)
uint32_t val;
int dsdt_offset = 40 /* DSDT */;
int dsdt_entry_size = 4;
+ int hest_offset = 40 /* HEST */;
+ int hest_entry_size = 4;
g_assert(compare_signature(&table, "FACP"));
@@ -242,6 +244,12 @@ static void test_acpi_fadt_table(test_data *data)
/* update checksum */
fadt_aml[9 /* Checksum */] = 0;
fadt_aml[9 /* Checksum */] -= acpi_calc_checksum(fadt_aml, fadt_len);
+
+
+
+ acpi_fetch_table(data->qts, &table.aml, &table.aml_len,
+ fadt_aml + hest_offset, hest_entry_size, "HEST", true);
+ g_array_append_val(data->tables, table);
}
static void dump_aml_files(test_data *data, bool rebuild)
@@ -2411,7 +2419,7 @@ static void test_acpi_aarch64_virt_oem_fields(void)
};
char *args;
- args = test_acpi_create_args(&data, "-cpu cortex-a57 "OEM_TEST_ARGS);
+ args = test_acpi_create_args(&data, "-ras on -cpu cortex-a57 "OEM_TEST_ARGS);
data.qts = qtest_init(args);
test_acpi_load_tables(&data);
test_oem_fields(&data);
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
2025-02-26 9:56 ` Mauro Carvalho Chehab
@ 2025-02-26 11:23 ` Mauro Carvalho Chehab
2025-02-26 11:31 ` Mauro Carvalho Chehab
2025-02-26 12:29 ` Igor Mammedov
1 sibling, 1 reply; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-02-26 11:23 UTC (permalink / raw)
To: Igor Mammedov
Cc: Jonathan Cameron, Michael S . Tsirkin, Shiju Jose, qemu-arm,
qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
linux-kernel, Gavin Shan
Em Wed, 26 Feb 2025 10:56:28 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> escreveu:
> Em Tue, 25 Feb 2025 11:01:15 +0100
> Igor Mammedov <imammedo@redhat.com> escreveu:
>
> > On Fri, 21 Feb 2025 10:21:27 +0000
> > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> >
> > > On Fri, 21 Feb 2025 07:38:23 +0100
> > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > >
> > > > Em Mon, 3 Feb 2025 16:22:36 +0100
> > > > Igor Mammedov <imammedo@redhat.com> escreveu:
> > > >
> > > > > On Mon, 3 Feb 2025 11:09:34 +0000
> > > > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > > > >
> > > > > > On Fri, 31 Jan 2025 18:42:41 +0100
> > > > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > > > >
> > > > > > > Now that the ghes preparation patches were merged, let's add support
> > > > > > > for error injection.
> > > > > > >
> > > > > > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > > > > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > > > > > with both latest QEMU released kernel and upstream, on both directions.
> > > > > > >
> > > > > > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > > > > > > to inject ARM Processor Error records.
> > > > > > >
> > > > > > > If I'm counting well, this is the 19th submission of my error inject patches.
> > > > > >
> > > > > > Looks good to me. All remaining trivial things are in the category
> > > > > > of things to consider only if you are doing another spin. The code
> > > > > > ends up how I'd like it at the end of the series anyway, just
> > > > > > a question of the precise path to that state!
> > > > >
> > > > > if you look at series as a whole it's more or less fine (I guess you
> > > > > and me got used to it)
> > > > >
> > > > > however if you take it patch by patch (as if you've never seen it)
> > > > > ordering is messed up (the same would apply to everyone after a while
> > > > > when it's forgotten)
> > > > >
> > > > > So I'd strongly suggest to restructure the series (especially 2-6/14).
> > > > > re sum up my comments wrt ordering:
> > > > >
> > > > > 0 add testcase for HEST table with current HEST as expected blob
> > > > > (currently missing), so that we can be sure that we haven't messed
> > > > > existing tables during refactoring.
> > >
> > > To potentially save time I think Igor is asking that before you do anything
> > > at all you plug the existing test hole which is that we don't test HEST
> > > at all. Even after this series I think we don't test HEST. You add
> > > a stub hest and exclusion but then in patch 12 the HEST stub is deleted whereas
> > > it should be replaced with the example data for the test.
> >
> > that's what I was saying.
> > HEST table should be in DSDT, but it's optional and one has to use
> > 'ras=on' option to enable that, which we aren't doing ATM.
> > So whatever changes are happening we aren't seeing them in tests
> > nor will we see any regression for the same reason.
> >
> > While white listing tables before change should happen and then updating them
> > is the right thing to do, it's not sufficient since none of tests
> > run with 'ras' enabled, hence code is not actually executed.
>
> Ok. Well, again we're not modifying HEST table structure on this
> changeset. The only change affecting HEST is when the number of entries
> increased from 1 to 2.
>
> Now, looking at bios-tables-test.c, if I got it right, I should be doing
> something similar to the enclosed patch, right?
>
> If so, I have a couple of questions:
>
> 1. from where should I get the HEST table? dumping the table from the
> running VM?
>
> 2. what values should I use to fill those variables:
>
> int hest_offset = 40 /* HEST */;
> int hest_entry_size = 4;
Thanks,
Mauro
As a reference, this is the HEST table before the patch series:
/*
* Intel ACPI Component Architecture
* AML/ASL+ Disassembler version 20240927 (64-bit version)
* Copyright (c) 2000 - 2023 Intel Corporation
*
* Disassembly of hest.dat
*
* ACPI Data Table [HEST]
*
* Format: [HexOffset DecimalOffset ByteLength] FieldName : FieldValue (in hex)
*/
[000h 0000 004h] Signature : "HEST" [Hardware Error Source Table]
[004h 0004 004h] Table Length : 00000084
[008h 0008 001h] Revision : 01
[009h 0009 001h] Checksum : E0
[00Ah 0010 006h] Oem ID : "BOCHS "
[010h 0016 008h] Oem Table ID : "BXPC "
[018h 0024 004h] Oem Revision : 00000001
[01Ch 0028 004h] Asl Compiler ID : "BXPC"
[020h 0032 004h] Asl Compiler Revision : 00000001
[024h 0036 004h] Error Source Count : 00000001
[028h 0040 002h] Subtable Type : 000A [Generic Hardware Error Source V2]
[02Ah 0042 002h] Source Id : 0000
[02Ch 0044 002h] Related Source Id : FFFF
[02Eh 0046 001h] Reserved : 00
[02Fh 0047 001h] Enabled : 01
[030h 0048 004h] Records To Preallocate : 00000001
[034h 0052 004h] Max Sections Per Record : 00000001
[038h 0056 004h] Max Raw Data Length : 00000400
[03Ch 0060 00Ch] Error Status Address : [Generic Address Structure]
[03Ch 0060 001h] Space ID : 00 [SystemMemory]
[03Dh 0061 001h] Bit Width : 40
[03Eh 0062 001h] Bit Offset : 00
[03Fh 0063 001h] Encoded Access Width : 04 [QWord Access:64]
[040h 0064 008h] Address : 0000000139E40000
[048h 0072 01Ch] Notify : [Hardware Error Notification Structure]
[048h 0072 001h] Notify Type : 08 [SEA]
[049h 0073 001h] Notify Length : 1C
[04Ah 0074 002h] Configuration Write Enable : 0000
[04Ch 0076 004h] PollInterval : 00000000
[050h 0080 004h] Vector : 00000000
[054h 0084 004h] Polling Threshold Value : 00000000
[058h 0088 004h] Polling Threshold Window : 00000000
[05Ch 0092 004h] Error Threshold Value : 00000000
[060h 0096 004h] Error Threshold Window : 00000000
[064h 0100 004h] Error Status Block Length : 00000400
[068h 0104 00Ch] Read Ack Register : [Generic Address Structure]
[068h 0104 001h] Space ID : 00 [SystemMemory]
[069h 0105 001h] Bit Width : 40
[06Ah 0106 001h] Bit Offset : 00
[06Bh 0107 001h] Encoded Access Width : 04 [QWord Access:64]
[06Ch 0108 008h] Address : 0000000139E40008
[074h 0116 008h] Read Ack Preserve : FFFFFFFFFFFFFFFE
[07Ch 0124 008h] Read Ack Write : 0000000000000001
Raw Table Data: Length 132 (0x84)
0000: 48 45 53 54 84 00 00 00 01 E0 42 4F 43 48 53 20 // HEST......BOCHS
0010: 42 58 50 43 20 20 20 20 01 00 00 00 42 58 50 43 // BXPC ....BXPC
0020: 01 00 00 00 01 00 00 00 0A 00 00 00 FF FF 00 01 // ................
0030: 01 00 00 00 01 00 00 00 00 04 00 00 00 40 00 04 // .............@..
0040: 00 00 E4 39 01 00 00 00 08 1C 00 00 00 00 00 00 // ...9............
0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 // ................
0060: 00 00 00 00 00 04 00 00 00 40 00 04 08 00 E4 39 // .........@.....9
0070: 01 00 00 00 FE FF FF FF FF FF FF FF 01 00 00 00 // ................
0080: 00 00 00 00
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
2025-02-26 11:23 ` Mauro Carvalho Chehab
@ 2025-02-26 11:31 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-02-26 11:31 UTC (permalink / raw)
To: Igor Mammedov
Cc: Jonathan Cameron, Michael S . Tsirkin, Shiju Jose, qemu-arm,
qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
linux-kernel, Gavin Shan
Em Wed, 26 Feb 2025 12:23:03 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> escreveu:
> Em Wed, 26 Feb 2025 10:56:28 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> escreveu:
>
> > Em Tue, 25 Feb 2025 11:01:15 +0100
> > Igor Mammedov <imammedo@redhat.com> escreveu:
> >
> > > On Fri, 21 Feb 2025 10:21:27 +0000
> > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > >
> > > > On Fri, 21 Feb 2025 07:38:23 +0100
> > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > >
> > > > > Em Mon, 3 Feb 2025 16:22:36 +0100
> > > > > Igor Mammedov <imammedo@redhat.com> escreveu:
> > > > >
> > > > > > On Mon, 3 Feb 2025 11:09:34 +0000
> > > > > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > > > > >
> > > > > > > On Fri, 31 Jan 2025 18:42:41 +0100
> > > > > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > > > > >
> > > > > > > > Now that the ghes preparation patches were merged, let's add support
> > > > > > > > for error injection.
> > > > > > > >
> > > > > > > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > > > > > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > > > > > > with both latest QEMU released kernel and upstream, on both directions.
> > > > > > > >
> > > > > > > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > > > > > > > to inject ARM Processor Error records.
> > > > > > > >
> > > > > > > > If I'm counting well, this is the 19th submission of my error inject patches.
> > > > > > >
> > > > > > > Looks good to me. All remaining trivial things are in the category
> > > > > > > of things to consider only if you are doing another spin. The code
> > > > > > > ends up how I'd like it at the end of the series anyway, just
> > > > > > > a question of the precise path to that state!
> > > > > >
> > > > > > if you look at series as a whole it's more or less fine (I guess you
> > > > > > and me got used to it)
> > > > > >
> > > > > > however if you take it patch by patch (as if you've never seen it)
> > > > > > ordering is messed up (the same would apply to everyone after a while
> > > > > > when it's forgotten)
> > > > > >
> > > > > > So I'd strongly suggest to restructure the series (especially 2-6/14).
> > > > > > re sum up my comments wrt ordering:
> > > > > >
> > > > > > 0 add testcase for HEST table with current HEST as expected blob
> > > > > > (currently missing), so that we can be sure that we haven't messed
> > > > > > existing tables during refactoring.
> > > >
> > > > To potentially save time I think Igor is asking that before you do anything
> > > > at all you plug the existing test hole which is that we don't test HEST
> > > > at all. Even after this series I think we don't test HEST. You add
> > > > a stub hest and exclusion but then in patch 12 the HEST stub is deleted whereas
> > > > it should be replaced with the example data for the test.
> > >
> > > that's what I was saying.
> > > HEST table should be in DSDT, but it's optional and one has to use
> > > 'ras=on' option to enable that, which we aren't doing ATM.
> > > So whatever changes are happening we aren't seeing them in tests
> > > nor will we see any regression for the same reason.
> > >
> > > While white listing tables before change should happen and then updating them
> > > is the right thing to do, it's not sufficient since none of tests
> > > run with 'ras' enabled, hence code is not actually executed.
> >
> > Ok. Well, again we're not modifying HEST table structure on this
> > changeset. The only change affecting HEST is when the number of entries
> > increased from 1 to 2.
> >
> > Now, looking at bios-tables-test.c, if I got it right, I should be doing
> > something similar to the enclosed patch, right?
> >
> > If so, I have a couple of questions:
> >
> > 1. from where should I get the HEST table? dumping the table from the
> > running VM?
> >
> > 2. what values should I use to fill those variables:
> >
> > int hest_offset = 40 /* HEST */;
> > int hest_entry_size = 4;
>
> Thanks,
> Mauro
>
> As a reference, this is the HEST table before the patch series:
This is the diff of the HEST table before/after this series.
As already commented, the diff is basically:
-[024h 0036 004h] Error Source Count : 00000001
+[024h 0036 004h] Error Source Count : 00000002
Plus the new entry for source ID 1 using notify type 7 (GPIO):
+[084h 0132 002h] Subtable Type : 000A [Generic Hardware Error Source V2]
+[086h 0134 002h] Source Id : 0001
+[088h 0136 002h] Related Source Id : FFFF
...
+[0A4h 0164 001h] Notify Type : 07 [GPIO]
...
+[0D0h 0208 008h] Read Ack Preserve : FFFFFFFFFFFFFFFE
+[0D8h 0216 008h] Read Ack Write : 0000000000000001
Complete diff follows.
Regards,
Mauro
---
diff -u hest-before-changes.dsl hest-after-changes.dsl
--- hest-before-changes.dsl 2025-02-26 11:23:30.845089077 +0000
+++ hest-after-changes.dsl 2025-02-26 11:25:29.095066026 +0000
@@ -11,16 +11,16 @@
*/
[000h 0000 004h] Signature : "HEST" [Hardware Error Source Table]
-[004h 0004 004h] Table Length : 00000084
+[004h 0004 004h] Table Length : 000000E0
[008h 0008 001h] Revision : 01
-[009h 0009 001h] Checksum : E0
+[009h 0009 001h] Checksum : 68
[00Ah 0010 006h] Oem ID : "BOCHS "
[010h 0016 008h] Oem Table ID : "BXPC "
[018h 0024 004h] Oem Revision : 00000001
[01Ch 0028 004h] Asl Compiler ID : "BXPC"
[020h 0032 004h] Asl Compiler Revision : 00000001
-[024h 0036 004h] Error Source Count : 00000001
+[024h 0036 004h] Error Source Count : 00000002
[028h 0040 002h] Subtable Type : 000A [Generic Hardware Error Source V2]
[02Ah 0042 002h] Source Id : 0000
@@ -55,19 +55,62 @@
[069h 0105 001h] Bit Width : 40
[06Ah 0106 001h] Bit Offset : 00
[06Bh 0107 001h] Encoded Access Width : 04 [QWord Access:64]
-[06Ch 0108 008h] Address : 0000000139E40008
+[06Ch 0108 008h] Address : 0000000139E40010
[074h 0116 008h] Read Ack Preserve : FFFFFFFFFFFFFFFE
[07Ch 0124 008h] Read Ack Write : 0000000000000001
-Raw Table Data: Length 132 (0x84)
+[084h 0132 002h] Subtable Type : 000A [Generic Hardware Error Source V2]
+[086h 0134 002h] Source Id : 0001
+[088h 0136 002h] Related Source Id : FFFF
+[08Ah 0138 001h] Reserved : 00
+[08Bh 0139 001h] Enabled : 01
+[08Ch 0140 004h] Records To Preallocate : 00000001
+[090h 0144 004h] Max Sections Per Record : 00000001
+[094h 0148 004h] Max Raw Data Length : 00000400
+
+[098h 0152 00Ch] Error Status Address : [Generic Address Structure]
+[098h 0152 001h] Space ID : 00 [SystemMemory]
+[099h 0153 001h] Bit Width : 40
+[09Ah 0154 001h] Bit Offset : 00
+[09Bh 0155 001h] Encoded Access Width : 04 [QWord Access:64]
+[09Ch 0156 008h] Address : 0000000139E40008
+
+[0A4h 0164 01Ch] Notify : [Hardware Error Notification Structure]
+[0A4h 0164 001h] Notify Type : 07 [GPIO]
+[0A5h 0165 001h] Notify Length : 1C
+[0A6h 0166 002h] Configuration Write Enable : 0000
+[0A8h 0168 004h] PollInterval : 00000000
+[0ACh 0172 004h] Vector : 00000000
+[0B0h 0176 004h] Polling Threshold Value : 00000000
+[0B4h 0180 004h] Polling Threshold Window : 00000000
+[0B8h 0184 004h] Error Threshold Value : 00000000
+[0BCh 0188 004h] Error Threshold Window : 00000000
+
+[0C0h 0192 004h] Error Status Block Length : 00000400
+[0C4h 0196 00Ch] Read Ack Register : [Generic Address Structure]
+[0C4h 0196 001h] Space ID : 00 [SystemMemory]
+[0C5h 0197 001h] Bit Width : 40
+[0C6h 0198 001h] Bit Offset : 00
+[0C7h 0199 001h] Encoded Access Width : 04 [QWord Access:64]
+[0C8h 0200 008h] Address : 0000000139E40018
- 0000: 48 45 53 54 84 00 00 00 01 E0 42 4F 43 48 53 20 // HEST......BOCHS
+[0D0h 0208 008h] Read Ack Preserve : FFFFFFFFFFFFFFFE
+[0D8h 0216 008h] Read Ack Write : 0000000000000001
+
+Raw Table Data: Length 224 (0xE0)
+
+ 0000: 48 45 53 54 E0 00 00 00 01 68 42 4F 43 48 53 20 // HEST.....hBOCHS
0010: 42 58 50 43 20 20 20 20 01 00 00 00 42 58 50 43 // BXPC ....BXPC
- 0020: 01 00 00 00 01 00 00 00 0A 00 00 00 FF FF 00 01 // ................
+ 0020: 01 00 00 00 02 00 00 00 0A 00 00 00 FF FF 00 01 // ................
0030: 01 00 00 00 01 00 00 00 00 04 00 00 00 40 00 04 // .............@..
0040: 00 00 E4 39 01 00 00 00 08 1C 00 00 00 00 00 00 // ...9............
0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 // ................
- 0060: 00 00 00 00 00 04 00 00 00 40 00 04 08 00 E4 39 // .........@.....9
+ 0060: 00 00 00 00 00 04 00 00 00 40 00 04 10 00 E4 39 // .........@.....9
0070: 01 00 00 00 FE FF FF FF FF FF FF FF 01 00 00 00 // ................
- 0080: 00 00 00 00 // ....
+ 0080: 00 00 00 00 0A 00 01 00 FF FF 00 01 01 00 00 00 // ................
+ 0090: 01 00 00 00 00 04 00 00 00 40 00 04 08 00 E4 39 // .........@.....9
+ 00A0: 01 00 00 00 07 1C 00 00 00 00 00 00 00 00 00 00 // ................
+ 00B0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 // ................
+ 00C0: 00 04 00 00 00 40 00 04 18 00 E4 39 01 00 00 00 // .....@.....9....
+ 00D0: FE FF FF FF FF FF FF FF 01 00 00 00 00 00 00 00 // ................
Thanks,
Mauro
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
2025-02-26 9:56 ` Mauro Carvalho Chehab
2025-02-26 11:23 ` Mauro Carvalho Chehab
@ 2025-02-26 12:29 ` Igor Mammedov
1 sibling, 0 replies; 43+ messages in thread
From: Igor Mammedov @ 2025-02-26 12:29 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Jonathan Cameron, Michael S . Tsirkin, Shiju Jose, qemu-arm,
qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
linux-kernel, Gavin Shan
On Wed, 26 Feb 2025 10:56:28 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Em Tue, 25 Feb 2025 11:01:15 +0100
> Igor Mammedov <imammedo@redhat.com> escreveu:
>
> > On Fri, 21 Feb 2025 10:21:27 +0000
> > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> >
> > > On Fri, 21 Feb 2025 07:38:23 +0100
> > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > >
> > > > Em Mon, 3 Feb 2025 16:22:36 +0100
> > > > Igor Mammedov <imammedo@redhat.com> escreveu:
> > > >
> > > > > On Mon, 3 Feb 2025 11:09:34 +0000
> > > > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > > > >
> > > > > > On Fri, 31 Jan 2025 18:42:41 +0100
> > > > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > > > >
> > > > > > > Now that the ghes preparation patches were merged, let's add support
> > > > > > > for error injection.
> > > > > > >
> > > > > > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > > > > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > > > > > with both latest QEMU released kernel and upstream, on both directions.
> > > > > > >
> > > > > > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > > > > > > to inject ARM Processor Error records.
> > > > > > >
> > > > > > > If I'm counting well, this is the 19th submission of my error inject patches.
> > > > > >
> > > > > > Looks good to me. All remaining trivial things are in the category
> > > > > > of things to consider only if you are doing another spin. The code
> > > > > > ends up how I'd like it at the end of the series anyway, just
> > > > > > a question of the precise path to that state!
> > > > >
> > > > > if you look at series as a whole it's more or less fine (I guess you
> > > > > and me got used to it)
> > > > >
> > > > > however if you take it patch by patch (as if you've never seen it)
> > > > > ordering is messed up (the same would apply to everyone after a while
> > > > > when it's forgotten)
> > > > >
> > > > > So I'd strongly suggest to restructure the series (especially 2-6/14).
> > > > > re sum up my comments wrt ordering:
> > > > >
> > > > > 0 add testcase for HEST table with current HEST as expected blob
> > > > > (currently missing), so that we can be sure that we haven't messed
> > > > > existing tables during refactoring.
> > >
> > > To potentially save time I think Igor is asking that before you do anything
> > > at all you plug the existing test hole which is that we don't test HEST
> > > at all. Even after this series I think we don't test HEST. You add
> > > a stub hest and exclusion but then in patch 12 the HEST stub is deleted whereas
> > > it should be replaced with the example data for the test.
> >
> > that's what I was saying.
> > HEST table should be in DSDT, but it's optional and one has to use
> > 'ras=on' option to enable that, which we aren't doing ATM.
> > So whatever changes are happening we aren't seeing them in tests
> > nor will we see any regression for the same reason.
> >
> > While white listing tables before change should happen and then updating them
> > is the right thing to do, it's not sufficient since none of tests
> > run with 'ras' enabled, hence code is not actually executed.
>
> Ok. Well, again we're not modifying HEST table structure on this
> changeset. The only change affecting HEST is when the number of entries
> increased from 1 to 2.
>
> Now, looking at bios-tables-test.c, if I got it right, I should be doing
> something similar to the enclosed patch, right?
>
> If so, I have a couple of questions:
>
> 1. from where should I get the HEST table? dumping the table from the
> running VM?
>
> 2. what values should I use to fill those variables:
>
> int hest_offset = 40 /* HEST */;
> int hest_entry_size = 4;
you don't need to do that,
bios-tables-test will dump all ACPI tables for you automatically,
you only need to add or extend a test with ras=on option.
1: 1st add empty table and whitelist it ("tests/data/acpi/aarch64/virt/HEST")
2: enable ras in existing tescase
--- a/tests/qtest/bios-tables-test.c
+++ b/tests/qtest/bios-tables-test.c
@@ -2123,7 +2123,8 @@ static void test_acpi_aarch64_virt_tcg(void)
data.smbios_cpu_max_speed = 2900;
data.smbios_cpu_curr_speed = 2700;
test_acpi_one("-cpu cortex-a57 "
- "-smbios type=4,max-speed=2900,current-speed=2700", &data);
+ "-smbios type=4,max-speed=2900,current-speed=2700 "
+ "-machine ras=on", &data);
free_test_data(&data);
}
then with installed IASL run
V=1 QTEST_QEMU_BINARY=./qemu-system-aarch64 ./tests/qtest/bios-tables-test
to see diff
3: rebuild tables and follow the rest of procedure to update expected blobs
as described in comment at the top of (tests/qtest/bios-tables-test.c)
I'd recommend to add 3 patches as the beginning of the series,
that way we can be sure that if something changes unintentionally
it won't go unnoticed.
>
> >
> > >
> > > That indeed doesn't address testing the error data storage which would be
> > > a different problem.
> >
> > I'd skip hardware_errors/CPER testing from QEMU unit tests.
> > That's basically requires functioning 'APEI driver' to test that.
> >
> > Maybe we can use Ani's framework to parse HEST and all the way
> > towards CPER record(s) traversal, but that's certainly out of
> > scope of this series.
> > It could be done on top, but I won't insist even on that
> > since Mauro's out of tree error injection testing will
> > cover that using actual guest (which I assume he would
> > like to run periodically).
>
> Yeah, my plan is to periodically test it. I intend to setup somewhere
> a CI to test Kernel, QEMU and rasdaemon altogether.
>
> Thanks,
> Mauro
>
> ---
>
> diff --git a/tests/qtest/bios-tables-test.c b/tests/qtest/bios-tables-test.c
> index 0a333ec43536..31e69d906db4 100644
> --- a/tests/qtest/bios-tables-test.c
> +++ b/tests/qtest/bios-tables-test.c
> @@ -210,6 +210,8 @@ static void test_acpi_fadt_table(test_data *data)
> uint32_t val;
> int dsdt_offset = 40 /* DSDT */;
> int dsdt_entry_size = 4;
> + int hest_offset = 40 /* HEST */;
> + int hest_entry_size = 4;
>
> g_assert(compare_signature(&table, "FACP"));
>
> @@ -242,6 +244,12 @@ static void test_acpi_fadt_table(test_data *data)
> /* update checksum */
> fadt_aml[9 /* Checksum */] = 0;
> fadt_aml[9 /* Checksum */] -= acpi_calc_checksum(fadt_aml, fadt_len);
> +
> +
> +
> + acpi_fetch_table(data->qts, &table.aml, &table.aml_len,
> + fadt_aml + hest_offset, hest_entry_size, "HEST", true);
> + g_array_append_val(data->tables, table);
> }
>
> static void dump_aml_files(test_data *data, bool rebuild)
> @@ -2411,7 +2419,7 @@ static void test_acpi_aarch64_virt_oem_fields(void)
> };
> char *args;
>
> - args = test_acpi_create_args(&data, "-cpu cortex-a57 "OEM_TEST_ARGS);
> + args = test_acpi_create_args(&data, "-ras on -cpu cortex-a57 "OEM_TEST_ARGS);
> data.qts = qtest_init(args);
> test_acpi_load_tables(&data);
> test_oem_fields(&data);
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 03/14] acpi/ghes: Use HEST table offsets when preparing GHES records
2025-02-25 9:43 ` Igor Mammedov
@ 2025-02-26 16:14 ` Mauro Carvalho Chehab
2025-02-27 9:22 ` Igor Mammedov
0 siblings, 1 reply; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-02-26 16:14 UTC (permalink / raw)
To: Igor Mammedov
Cc: Michael S . Tsirkin, Jonathan Cameron, Shiju Jose, qemu-arm,
qemu-devel, Ani Sinha, Dongjiu Geng, linux-kernel
Em Tue, 25 Feb 2025 10:43:27 +0100
Igor Mammedov <imammedo@redhat.com> escreveu:
> On Fri, 21 Feb 2025 07:02:21 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
>
> > Em Mon, 3 Feb 2025 15:34:23 +0100
> > Igor Mammedov <imammedo@redhat.com> escreveu:
> >
> > > On Fri, 31 Jan 2025 18:42:44 +0100
> > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > >
> > > > There are two pointers that are needed during error injection:
> > > >
> > > > 1. The start address of the CPER block to be stored;
> > > > 2. The address of the ack.
> > > >
> > > > It is preferable to calculate them from the HEST table. This allows
> > > > checking the source ID, the size of the table and the type of the
> > > > HEST error block structures.
> > > >
> > > > Yet, keep the old code, as this is needed for migration purposes.
> > > >
> > > > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > > > ---
> > > > hw/acpi/ghes.c | 132 ++++++++++++++++++++++++++++++++++++-----
> > > > include/hw/acpi/ghes.h | 1 +
> > > > 2 files changed, 119 insertions(+), 14 deletions(-)
> > > >
> > > > diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> > > > index 27478f2d5674..8f284fd191a6 100644
> > > > --- a/hw/acpi/ghes.c
> > > > +++ b/hw/acpi/ghes.c
> > > > @@ -41,6 +41,12 @@
> > > > /* Address offset in Generic Address Structure(GAS) */
> > > > #define GAS_ADDR_OFFSET 4
> > > >
> > > > +/*
> > > > + * ACPI spec 1.0b
> > > > + * 5.2.3 System Description Table Header
> > > > + */
> > > > +#define ACPI_DESC_HEADER_OFFSET 36
> > > > +
> > > > /*
> > > > * The total size of Generic Error Data Entry
> > > > * ACPI 6.1/6.2: 18.3.2.7.1 Generic Error Data,
> > > > @@ -61,6 +67,25 @@
> > > > */
> > > > #define ACPI_GHES_GESB_SIZE 20
> > > >
> > > > +/*
> > > > + * Offsets with regards to the start of the HEST table stored at
> > > > + * ags->hest_addr_le,
> > >
> > > If I read this literary, then offsets above are not what
> > > declared later in this patch.
> > > I'd really drop this comment altogether as it's confusing,
> > > and rather get variables/macro naming right
> > >
> > > > according with the memory layout map at
> > > > + * docs/specs/acpi_hest_ghes.rst.
> > > > + */
> > >
> > > what we need is update to above doc, describing new and old ways.
> > > a separate patch.
> >
> > I can't see anything that should be changed at
> > docs/specs/acpi_hest_ghes.rst, as this series doesn't change the
> > firmware layout: we're still using two firmware tables:
> >
> > - etc/acpi/tables, with HEST on it;
> > - etc/hardware_errors, with:
> > - error block addresses;
> > - read_ack registers;
> > - CPER records.
> >
> > The only changes that this series introduce are related to how
> > the error generation logic navigates between HEST and hw_errors
> > firmware. This is not described at acpi_hest_ghes.rst, and both
> > ways follow ACPI specs to the letter.
> >
> > The only difference is that the code which populates the CPER
> > record and the error/read offsets doesn't require to know how
> > the HEST table generation placed offsets, as it will basically
> > reproduce what OSPM firmware does when handling HEST events.
>
> section 8 describes old way to get to address to record old CPER,
> so it needs to amended to also describe a new approach and say
> which way is used for which version.
>
> possibly section 11 might need some messaging as well.
Ok, I'll modify it and place at the end of the series. Please
see below if the new text is ok for you.
---
[PATCH] docs/specs/acpi_hest_ghes.rst: update it to reflect some changes
While the HEST layout didn't change, there are some internal
changes related to how offsets are calculated and how memory error
events are triggered.
Update specs to reflect such changes.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
diff --git a/docs/specs/acpi_hest_ghes.rst b/docs/specs/acpi_hest_ghes.rst
index c3e9f8d9a702..f22d2eefdec7 100644
--- a/docs/specs/acpi_hest_ghes.rst
+++ b/docs/specs/acpi_hest_ghes.rst
@@ -89,12 +89,21 @@ Design Details
addresses in the "error_block_address" fields with a pointer to the
respective "Error Status Data Block" in the "etc/hardware_errors" blob.
-(8) QEMU defines a third and write-only fw_cfg blob which is called
- "etc/hardware_errors_addr". Through that blob, the firmware can send back
- the guest-side allocation addresses to QEMU. The "etc/hardware_errors_addr"
- blob contains a 8-byte entry. QEMU generates a single WRITE_POINTER command
- for the firmware. The firmware will write back the start address of
- "etc/hardware_errors" blob to the fw_cfg file "etc/hardware_errors_addr".
+(8) QEMU defines a third and write-only fw_cfg blob to store the location
+ where the error block offsets, read ack registers and CPER records are
+ stored.
+
+ Up to QEMU 9.2, the location was at "etc/hardware_errors_addr", and
+ contains an offset for the beginning of "etc/hardware_errors".
+
+ Newer versions place the location at "etc/acpi_table_hest_addr",
+ pointing to the beginning of the HEST table.
+
+ Through that such offsets, the firmware can send back the guest-side
+ allocation addresses to QEMU. They contain a 8-byte entry. QEMU generates
+ a single WRITE_POINTER command for the firmware. The firmware will write
+ back the start address of either "etc/hardware_errors" or HEST table at
+ the correspoinding address firmware.
(9) When QEMU gets a SIGBUS from the kernel, QEMU writes CPER into corresponding
"Error Status Data Block", guest memory, and then injects platform specific
@@ -105,8 +114,6 @@ Design Details
kernel, on receiving notification, guest APEI driver could read the CPER error
and take appropriate action.
-(11) kvm_arch_on_sigbus_vcpu() uses source_id as index in "etc/hardware_errors" to
- find out "Error Status Data Block" entry corresponding to error source. So supported
- source_id values should be assigned here and not be changed afterwards to make sure
- that guest will write error into expected "Error Status Data Block" even if guest was
- migrated to a newer QEMU.
+(11) kvm_arch_on_sigbus_vcpu() report RAS errors via a SEA notifications,
+ when a SIGBUS event is triggered. The logic to convert a SEA notification
+ into a source ID is defined inside ghes.c source file.
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH v3 03/14] acpi/ghes: Use HEST table offsets when preparing GHES records
2025-02-26 16:14 ` Mauro Carvalho Chehab
@ 2025-02-27 9:22 ` Igor Mammedov
2025-02-27 10:11 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 43+ messages in thread
From: Igor Mammedov @ 2025-02-27 9:22 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Michael S . Tsirkin, Jonathan Cameron, Shiju Jose, qemu-arm,
qemu-devel, Ani Sinha, Dongjiu Geng, linux-kernel
On Wed, 26 Feb 2025 17:14:06 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Em Tue, 25 Feb 2025 10:43:27 +0100
> Igor Mammedov <imammedo@redhat.com> escreveu:
>
> > On Fri, 21 Feb 2025 07:02:21 +0100
> > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> >
> > > Em Mon, 3 Feb 2025 15:34:23 +0100
> > > Igor Mammedov <imammedo@redhat.com> escreveu:
> > >
> > > > On Fri, 31 Jan 2025 18:42:44 +0100
> > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > >
> > > > > There are two pointers that are needed during error injection:
> > > > >
> > > > > 1. The start address of the CPER block to be stored;
> > > > > 2. The address of the ack.
> > > > >
> > > > > It is preferable to calculate them from the HEST table. This allows
> > > > > checking the source ID, the size of the table and the type of the
> > > > > HEST error block structures.
> > > > >
> > > > > Yet, keep the old code, as this is needed for migration purposes.
> > > > >
> > > > > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > > > > ---
> > > > > hw/acpi/ghes.c | 132 ++++++++++++++++++++++++++++++++++++-----
> > > > > include/hw/acpi/ghes.h | 1 +
> > > > > 2 files changed, 119 insertions(+), 14 deletions(-)
> > > > >
> > > > > diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> > > > > index 27478f2d5674..8f284fd191a6 100644
> > > > > --- a/hw/acpi/ghes.c
> > > > > +++ b/hw/acpi/ghes.c
> > > > > @@ -41,6 +41,12 @@
> > > > > /* Address offset in Generic Address Structure(GAS) */
> > > > > #define GAS_ADDR_OFFSET 4
> > > > >
> > > > > +/*
> > > > > + * ACPI spec 1.0b
> > > > > + * 5.2.3 System Description Table Header
> > > > > + */
> > > > > +#define ACPI_DESC_HEADER_OFFSET 36
> > > > > +
> > > > > /*
> > > > > * The total size of Generic Error Data Entry
> > > > > * ACPI 6.1/6.2: 18.3.2.7.1 Generic Error Data,
> > > > > @@ -61,6 +67,25 @@
> > > > > */
> > > > > #define ACPI_GHES_GESB_SIZE 20
> > > > >
> > > > > +/*
> > > > > + * Offsets with regards to the start of the HEST table stored at
> > > > > + * ags->hest_addr_le,
> > > >
> > > > If I read this literary, then offsets above are not what
> > > > declared later in this patch.
> > > > I'd really drop this comment altogether as it's confusing,
> > > > and rather get variables/macro naming right
> > > >
> > > > > according with the memory layout map at
> > > > > + * docs/specs/acpi_hest_ghes.rst.
> > > > > + */
> > > >
> > > > what we need is update to above doc, describing new and old ways.
> > > > a separate patch.
> > >
> > > I can't see anything that should be changed at
> > > docs/specs/acpi_hest_ghes.rst, as this series doesn't change the
> > > firmware layout: we're still using two firmware tables:
> > >
> > > - etc/acpi/tables, with HEST on it;
> > > - etc/hardware_errors, with:
> > > - error block addresses;
> > > - read_ack registers;
> > > - CPER records.
> > >
> > > The only changes that this series introduce are related to how
> > > the error generation logic navigates between HEST and hw_errors
> > > firmware. This is not described at acpi_hest_ghes.rst, and both
> > > ways follow ACPI specs to the letter.
> > >
> > > The only difference is that the code which populates the CPER
> > > record and the error/read offsets doesn't require to know how
> > > the HEST table generation placed offsets, as it will basically
> > > reproduce what OSPM firmware does when handling HEST events.
> >
> > section 8 describes old way to get to address to record old CPER,
> > so it needs to amended to also describe a new approach and say
> > which way is used for which version.
> >
> > possibly section 11 might need some messaging as well.
>
> Ok, I'll modify it and place at the end of the series. Please
> see below if the new text is ok for you.
>
> ---
>
> [PATCH] docs/specs/acpi_hest_ghes.rst: update it to reflect some changes
s/^^^/docs: hest: add new "etc/acpi_table_hest_addr" and update workflow/
>
> While the HEST layout didn't change, there are some internal
> changes related to how offsets are calculated and how memory error
> events are triggered.
>
> Update specs to reflect such changes.
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
>
> diff --git a/docs/specs/acpi_hest_ghes.rst b/docs/specs/acpi_hest_ghes.rst
> index c3e9f8d9a702..f22d2eefdec7 100644
> --- a/docs/specs/acpi_hest_ghes.rst
> +++ b/docs/specs/acpi_hest_ghes.rst
> @@ -89,12 +89,21 @@ Design Details
> addresses in the "error_block_address" fields with a pointer to the
> respective "Error Status Data Block" in the "etc/hardware_errors" blob.
>
> -(8) QEMU defines a third and write-only fw_cfg blob which is called
> - "etc/hardware_errors_addr". Through that blob, the firmware can send back
> - the guest-side allocation addresses to QEMU. The "etc/hardware_errors_addr"
> - blob contains a 8-byte entry. QEMU generates a single WRITE_POINTER command
> - for the firmware. The firmware will write back the start address of
> - "etc/hardware_errors" blob to the fw_cfg file "etc/hardware_errors_addr".
> +(8) QEMU defines a third and write-only fw_cfg blob to store the location
> + where the error block offsets, read ack registers and CPER records are
> + stored.
> +
> + Up to QEMU 9.2, the location was at "etc/hardware_errors_addr", and
> + contains an offset for the beginning of "etc/hardware_errors".
> +
> + Newer versions place the location at "etc/acpi_table_hest_addr",
> + pointing to the beginning of the HEST table.
> +
> + Through that such offsets, the firmware can send back the guest-side
^^^^^^^^^^^^^^^^^^^^^^^^^ can't parse that, suggest to just drop the phrase
> + allocation addresses to QEMU. They contain a 8-byte entry. QEMU generates
> + a single WRITE_POINTER command for the firmware. The firmware will write
> + back the start address of either "etc/hardware_errors" or HEST table at
^^^^ drop this?
> + the correspoinding address firmware.
>
> (9) When QEMU gets a SIGBUS from the kernel, QEMU writes CPER into corresponding
> "Error Status Data Block", guest memory, and then injects platform specific
> @@ -105,8 +114,6 @@ Design Details
> kernel, on receiving notification, guest APEI driver could read the CPER error
> and take appropriate action.
>
> -(11) kvm_arch_on_sigbus_vcpu() uses source_id as index in "etc/hardware_errors" to
> - find out "Error Status Data Block" entry corresponding to error source. So supported
> - source_id values should be assigned here and not be changed afterwards to make sure
> - that guest will write error into expected "Error Status Data Block" even if guest was
> - migrated to a newer QEMU.
> +(11) kvm_arch_on_sigbus_vcpu() report RAS errors via a SEA notifications,
> + when a SIGBUS event is triggered.
> The logic to convert a SEA notification
> + into a source ID is defined inside ghes.c source file.
that's cheating and not documentation by any means
>
>
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 03/14] acpi/ghes: Use HEST table offsets when preparing GHES records
2025-02-27 9:22 ` Igor Mammedov
@ 2025-02-27 10:11 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 43+ messages in thread
From: Mauro Carvalho Chehab @ 2025-02-27 10:11 UTC (permalink / raw)
To: Igor Mammedov
Cc: Michael S . Tsirkin, Jonathan Cameron, Shiju Jose, qemu-arm,
qemu-devel, Ani Sinha, Dongjiu Geng, linux-kernel
Em Thu, 27 Feb 2025 10:22:55 +0100
Igor Mammedov <imammedo@redhat.com> escreveu:
> On Wed, 26 Feb 2025 17:14:06 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
>
> > Em Tue, 25 Feb 2025 10:43:27 +0100
> > Igor Mammedov <imammedo@redhat.com> escreveu:
> >
> > > On Fri, 21 Feb 2025 07:02:21 +0100
> > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > >
> > > > Em Mon, 3 Feb 2025 15:34:23 +0100
> > > > Igor Mammedov <imammedo@redhat.com> escreveu:
> > > >
> > > > > On Fri, 31 Jan 2025 18:42:44 +0100
> > > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > > >
> > > > > > There are two pointers that are needed during error injection:
> > > > > >
> > > > > > 1. The start address of the CPER block to be stored;
> > > > > > 2. The address of the ack.
> > > > > >
> > > > > > It is preferable to calculate them from the HEST table. This allows
> > > > > > checking the source ID, the size of the table and the type of the
> > > > > > HEST error block structures.
> > > > > >
> > > > > > Yet, keep the old code, as this is needed for migration purposes.
> > > > > >
> > > > > > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > > > > > ---
> > > > > > hw/acpi/ghes.c | 132 ++++++++++++++++++++++++++++++++++++-----
> > > > > > include/hw/acpi/ghes.h | 1 +
> > > > > > 2 files changed, 119 insertions(+), 14 deletions(-)
> > > > > >
> > > > > > diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> > > > > > index 27478f2d5674..8f284fd191a6 100644
> > > > > > --- a/hw/acpi/ghes.c
> > > > > > +++ b/hw/acpi/ghes.c
> > > > > > @@ -41,6 +41,12 @@
> > > > > > /* Address offset in Generic Address Structure(GAS) */
> > > > > > #define GAS_ADDR_OFFSET 4
> > > > > >
> > > > > > +/*
> > > > > > + * ACPI spec 1.0b
> > > > > > + * 5.2.3 System Description Table Header
> > > > > > + */
> > > > > > +#define ACPI_DESC_HEADER_OFFSET 36
> > > > > > +
> > > > > > /*
> > > > > > * The total size of Generic Error Data Entry
> > > > > > * ACPI 6.1/6.2: 18.3.2.7.1 Generic Error Data,
> > > > > > @@ -61,6 +67,25 @@
> > > > > > */
> > > > > > #define ACPI_GHES_GESB_SIZE 20
> > > > > >
> > > > > > +/*
> > > > > > + * Offsets with regards to the start of the HEST table stored at
> > > > > > + * ags->hest_addr_le,
> > > > >
> > > > > If I read this literary, then offsets above are not what
> > > > > declared later in this patch.
> > > > > I'd really drop this comment altogether as it's confusing,
> > > > > and rather get variables/macro naming right
> > > > >
> > > > > > according with the memory layout map at
> > > > > > + * docs/specs/acpi_hest_ghes.rst.
> > > > > > + */
> > > > >
> > > > > what we need is update to above doc, describing new and old ways.
> > > > > a separate patch.
> > > >
> > > > I can't see anything that should be changed at
> > > > docs/specs/acpi_hest_ghes.rst, as this series doesn't change the
> > > > firmware layout: we're still using two firmware tables:
> > > >
> > > > - etc/acpi/tables, with HEST on it;
> > > > - etc/hardware_errors, with:
> > > > - error block addresses;
> > > > - read_ack registers;
> > > > - CPER records.
> > > >
> > > > The only changes that this series introduce are related to how
> > > > the error generation logic navigates between HEST and hw_errors
> > > > firmware. This is not described at acpi_hest_ghes.rst, and both
> > > > ways follow ACPI specs to the letter.
> > > >
> > > > The only difference is that the code which populates the CPER
> > > > record and the error/read offsets doesn't require to know how
> > > > the HEST table generation placed offsets, as it will basically
> > > > reproduce what OSPM firmware does when handling HEST events.
> > >
> > > section 8 describes old way to get to address to record old CPER,
> > > so it needs to amended to also describe a new approach and say
> > > which way is used for which version.
> > >
> > > possibly section 11 might need some messaging as well.
> >
> > Ok, I'll modify it and place at the end of the series. Please
> > see below if the new text is ok for you.
> >
> > ---
> >
> > [PATCH] docs/specs/acpi_hest_ghes.rst: update it to reflect some changes
>
> s/^^^/docs: hest: add new "etc/acpi_table_hest_addr" and update workflow/
>
> >
> > While the HEST layout didn't change, there are some internal
> > changes related to how offsets are calculated and how memory error
> > events are triggered.
> >
> > Update specs to reflect such changes.
> >
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> >
> > diff --git a/docs/specs/acpi_hest_ghes.rst b/docs/specs/acpi_hest_ghes.rst
> > index c3e9f8d9a702..f22d2eefdec7 100644
> > --- a/docs/specs/acpi_hest_ghes.rst
> > +++ b/docs/specs/acpi_hest_ghes.rst
> > @@ -89,12 +89,21 @@ Design Details
> > addresses in the "error_block_address" fields with a pointer to the
> > respective "Error Status Data Block" in the "etc/hardware_errors" blob.
> >
> > -(8) QEMU defines a third and write-only fw_cfg blob which is called
> > - "etc/hardware_errors_addr". Through that blob, the firmware can send back
> > - the guest-side allocation addresses to QEMU. The "etc/hardware_errors_addr"
> > - blob contains a 8-byte entry. QEMU generates a single WRITE_POINTER command
> > - for the firmware. The firmware will write back the start address of
> > - "etc/hardware_errors" blob to the fw_cfg file "etc/hardware_errors_addr".
> > +(8) QEMU defines a third and write-only fw_cfg blob to store the location
> > + where the error block offsets, read ack registers and CPER records are
> > + stored.
> > +
> > + Up to QEMU 9.2, the location was at "etc/hardware_errors_addr", and
> > + contains an offset for the beginning of "etc/hardware_errors".
> > +
> > + Newer versions place the location at "etc/acpi_table_hest_addr",
> > + pointing to the beginning of the HEST table.
> > +
> > + Through that such offsets, the firmware can send back the guest-side
> ^^^^^^^^^^^^^^^^^^^^^^^^^ can't parse that, suggest to just drop the phrase
>
> > + allocation addresses to QEMU. They contain a 8-byte entry. QEMU generates
> > + a single WRITE_POINTER command for the firmware. The firmware will write
> > + back the start address of either "etc/hardware_errors" or HEST table at
> ^^^^ drop this?
>
> > + the correspoinding address firmware.
> >
> > (9) When QEMU gets a SIGBUS from the kernel, QEMU writes CPER into corresponding
> > "Error Status Data Block", guest memory, and then injects platform specific
> > @@ -105,8 +114,6 @@ Design Details
> > kernel, on receiving notification, guest APEI driver could read the CPER error
> > and take appropriate action.
> >
> > -(11) kvm_arch_on_sigbus_vcpu() uses source_id as index in "etc/hardware_errors" to
> > - find out "Error Status Data Block" entry corresponding to error source. So supported
> > - source_id values should be assigned here and not be changed afterwards to make sure
> > - that guest will write error into expected "Error Status Data Block" even if guest was
> > - migrated to a newer QEMU.
> > +(11) kvm_arch_on_sigbus_vcpu() report RAS errors via a SEA notifications,
> > + when a SIGBUS event is triggered.
>
> > The logic to convert a SEA notification
> > + into a source ID is defined inside ghes.c source file.
> that's cheating and not documentation by any means
I'll drop the last paragraph. I guess that (11) were here to document a hack on the
original design:
"supported source_id values should be assigned here".
The new code doesn't do any assumptions like that. All it needs is that the caller
need to specify the notification type (currently, SEA or GPIO).
So, IMO the only thing eventually useful there is:
(11) kvm_arch_on_sigbus_vcpu() report RAS errors via a SEA notifications,
when a SIGBUS event is triggered.
Yet, to be frank, I'm not sure if this should be documented. So perhaps
the entire section (11) can be dropped.
Thanks,
Mauro
^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2025-02-27 10:12 UTC | newest]
Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 01/14] acpi/ghes: Prepare to support multiple sources on ghes Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 02/14] acpi/ghes: add a firmware file with HEST address Mauro Carvalho Chehab
2025-02-03 13:41 ` Igor Mammedov
2025-01-31 17:42 ` [PATCH v3 03/14] acpi/ghes: Use HEST table offsets when preparing GHES records Mauro Carvalho Chehab
2025-02-03 10:42 ` Jonathan Cameron via
2025-02-03 14:34 ` Igor Mammedov
2025-02-21 6:02 ` Mauro Carvalho Chehab
2025-02-25 9:43 ` Igor Mammedov
2025-02-26 16:14 ` Mauro Carvalho Chehab
2025-02-27 9:22 ` Igor Mammedov
2025-02-27 10:11 ` Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 04/14] acpi/generic_event_device: Update GHES migration to cover hest addr Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 05/14] acpi/generic_event_device: add logic to detect if HEST addr is available Mauro Carvalho Chehab
2025-02-03 14:56 ` Igor Mammedov
2025-01-31 17:42 ` [PATCH v3 06/14] acpi/ghes: only set hw_error_le or hest_addr_le Mauro Carvalho Chehab
2025-02-03 10:48 ` Jonathan Cameron via
2025-01-31 17:42 ` [PATCH v3 07/14] acpi/ghes: add a notifier to notify when error data is ready Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 08/14] acpi/ghes: Cleanup the code which gets ghes ged state Mauro Carvalho Chehab
2025-02-03 10:51 ` Jonathan Cameron via
2025-01-31 17:42 ` [PATCH v3 09/14] acpi/generic_event_device: add an APEI error device Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 10/14] tests/acpi: virt: allow acpi table changes for a new table: HEST Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 11/14] arm/virt: Wire up a GED error device for ACPI / GHES Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 12/14] tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT Mauro Carvalho Chehab
2025-02-03 10:53 ` Jonathan Cameron via
2025-01-31 17:42 ` [PATCH v3 13/14] qapi/acpi-hest: add an interface to do generic CPER error injection Mauro Carvalho Chehab
2025-02-05 8:12 ` Markus Armbruster
2025-01-31 17:42 ` [PATCH v3 14/14] scripts/ghes_inject: add a script to generate GHES error inject Mauro Carvalho Chehab
2025-02-03 10:56 ` Jonathan Cameron via
2025-02-05 8:16 ` Markus Armbruster
2025-02-21 4:57 ` Mauro Carvalho Chehab
2025-02-21 5:50 ` Markus Armbruster
2025-02-03 11:09 ` [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for " Jonathan Cameron via
2025-02-03 15:22 ` Igor Mammedov
2025-02-21 6:38 ` Mauro Carvalho Chehab
2025-02-21 10:21 ` Jonathan Cameron via
2025-02-21 12:23 ` Mauro Carvalho Chehab
2025-02-21 15:05 ` Mauro Carvalho Chehab
2025-02-25 10:01 ` Igor Mammedov
2025-02-26 9:56 ` Mauro Carvalho Chehab
2025-02-26 11:23 ` Mauro Carvalho Chehab
2025-02-26 11:31 ` Mauro Carvalho Chehab
2025-02-26 12:29 ` Igor Mammedov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).