[PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
@ 2025-01-31 17:42 Mauro Carvalho Chehab
  2025-01-31 17:42 ` [PATCH v3 08/14] acpi/ghes: Cleanup the code which gets ghes ged state Mauro Carvalho Chehab
  2025-02-03 11:09 ` [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Jonathan Cameron
  0 siblings, 2 replies; 14+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-31 17:42 UTC (permalink / raw)
  To: Igor Mammedov, Michael S . Tsirkin
  Cc: Jonathan Cameron, Shiju Jose, qemu-arm, qemu-devel,
	Mauro Carvalho Chehab, Philippe Mathieu-Daudé, Ani Sinha,
	Cleber Rosa, Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
	Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
	Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
	linux-kernel

Now that the ghes preparation patches were merged, let's add support
for error injection.

On this series, the first 6 patches chang to the math used to calculate offsets at HEST
table and hardware_error firmware file, together with its migration code. Migration tested
with both latest QEMU released kernel and upstream, on both directions.

The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
   to inject ARM Processor Error records.

If I'm counting well, this is the 19th submission of my error inject patches.

---
v3:
- addressed more nits;
- hest_add_le now points to the beginning of HEST table;
- removed HEST from tests/data/acpi;
- added an extra patch to not use fw_cfg with virt-10.0 for hw_error_le

v2: 
- address some nits;
- improved ags cleanup patch and removed ags.present field;
- added some missing le*_to_cpu() calls;
- update date at copyright for new files to 2024-2025;
- qmp command changed to: inject-ghes-v2-error ans since updated to 10.0;
- added HEST and DSDT tables after the changes to make check target happy.
  (two patches: first one whitelisting such tables; second one removing from
   whitelist and updating/adding such tables to tests/data/acpi)

It follows a diff against v2 to better show the differences.

diff --git a/hw/acpi/generic_event_device.c b/hw/acpi/generic_event_device.c
index 50313ed7ee96..f5e899155d34 100644
--- a/hw/acpi/generic_event_device.c
+++ b/hw/acpi/generic_event_device.c
@@ -29,12 +29,6 @@ static const uint32_t ged_supported_events[] = {
     ACPI_GED_ERROR_EVT,
 };
 
-/*
- * ACPI 5.0b: 5.6.6 Device Object Notifications
- * Table 5-135 Error Device Notification Values
- */
-#define ERROR_DEVICE_NOTIFICATION   0x80
-
 /*
  * The ACPI Generic Event Device (GED) is a hardware-reduced specific
  * device[ACPI v6.1 Section 5.6.9] that handles all platform events,
@@ -124,9 +118,14 @@ void build_ged_aml(Aml *table, const char *name, HotplugHandler *hotplug_dev,
                                       aml_int(0x80)));
                 break;
             case ACPI_GED_ERROR_EVT:
+                /*
+                 * ACPI 5.0b: 5.6.6 Device Object Notifications
+                 * Table 5-135 Error Device Notification Values
+                 * Defines 0x80 as the value to be used on notifications
+                 */
                 aml_append(if_ctx,
                            aml_notify(aml_name(ACPI_APEI_ERROR_DEVICE),
-                                      aml_int(ERROR_DEVICE_NOTIFICATION)));
+                                      aml_int(0x80)));
                 break;
             case ACPI_GED_NVDIMM_HOTPLUG_EVT:
                 aml_append(if_ctx,
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index ef57ad14a38b..bcef0b22e612 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -41,6 +41,12 @@
 /* Address offset in Generic Address Structure(GAS) */
 #define GAS_ADDR_OFFSET 4
 
+/*
+ * ACPI spec 1.0b
+ * 5.2.3 System Description Table Header
+ */
+#define ACPI_DESC_HEADER_OFFSET     36
+
 /*
  * The total size of Generic Error Data Entry
  * ACPI 6.1/6.2: 18.3.2.7.1 Generic Error Data,
@@ -226,8 +232,8 @@ ghes_gen_err_data_uncorrectable_recoverable(GArray *block,
  * Initialize "etc/hardware_errors" and "etc/hardware_errors_addr" fw_cfg blobs.
  * See docs/specs/acpi_hest_ghes.rst for blobs format.
  */
-static void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker,
-                                   int num_sources)
+static void build_ghes_error_table(AcpiGhesState *ags, GArray *hardware_errors,
+                                   BIOSLinker *linker, int num_sources)
 {
     int i, error_status_block_offset;
 
@@ -272,13 +278,15 @@ static void build_ghes_error_table(GArray *hardware_errors, BIOSLinker *linker,
                                        i * ACPI_GHES_MAX_RAW_DATA_LENGTH);
     }
 
-    /*
-     * Tell firmware to write hardware_errors GPA into
-     * hardware_errors_addr fw_cfg, once the former has been initialized.
-     */
-    bios_linker_loader_write_pointer(linker, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, 0,
-                                     sizeof(uint64_t),
-                                     ACPI_HW_ERROR_FW_CFG_FILE, 0);
+    if (!ags->use_hest_addr) {
+        /*
+         * Tell firmware to write hardware_errors GPA into
+         * hardware_errors_addr fw_cfg, once the former has been initialized.
+         */
+        bios_linker_loader_write_pointer(linker, ACPI_HW_ERROR_ADDR_FW_CFG_FILE,
+                                         0, sizeof(uint64_t),
+                                         ACPI_HW_ERROR_FW_CFG_FILE, 0);
+    }
 }
 
 /* Build Generic Hardware Error Source version 2 (GHESv2) */
@@ -365,11 +373,11 @@ void acpi_build_hest(AcpiGhesState *ags, GArray *table_data,
     uint32_t hest_offset;
     int i;
 
-    build_ghes_error_table(hardware_errors, linker, num_sources);
+    hest_offset = table_data->len;
 
-    acpi_table_begin(&table, table_data);
+    build_ghes_error_table(ags, hardware_errors, linker, num_sources);
 
-    hest_offset = table_data->len;
+    acpi_table_begin(&table, table_data);
 
     /* Error Source Count */
     build_append_int_noprefix(table_data, num_sources, 4);
@@ -383,7 +391,6 @@ void acpi_build_hest(AcpiGhesState *ags, GArray *table_data,
      * Tell firmware to write into GPA the address of HEST via fw_cfg,
      * once initialized.
      */
-
     if (ags->use_hest_addr) {
         bios_linker_loader_write_pointer(linker,
                                          ACPI_HEST_ADDR_FW_CFG_FILE, 0,
@@ -399,13 +406,13 @@ void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
     fw_cfg_add_file(s, ACPI_HW_ERROR_FW_CFG_FILE, hardware_error->data,
                     hardware_error->len);
 
-    /* Create a read-write fw_cfg file for Address */
-    fw_cfg_add_file_callback(s, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, NULL, NULL,
-        NULL, &(ags->hw_error_le), sizeof(ags->hw_error_le), false);
-
     if (ags->use_hest_addr) {
         fw_cfg_add_file_callback(s, ACPI_HEST_ADDR_FW_CFG_FILE, NULL, NULL,
             NULL, &(ags->hest_addr_le), sizeof(ags->hest_addr_le), false);
+    } else {
+        /* Create a read-write fw_cfg file for Address */
+        fw_cfg_add_file_callback(s, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, NULL, NULL,
+            NULL, &(ags->hw_error_le), sizeof(ags->hw_error_le), false);
     }
 }
 
@@ -432,7 +439,7 @@ static void get_hw_error_offsets(uint64_t ghes_addr,
 }
 
 static void get_ghes_source_offsets(uint16_t source_id,
-                                    uint64_t hest_entry_addr,
+                                    uint64_t hest_addr,
                                     uint64_t *cper_addr,
                                     uint64_t *read_ack_start_addr,
                                     Error **errp)
@@ -441,12 +448,13 @@ static void get_ghes_source_offsets(uint16_t source_id,
     uint64_t err_source_entry, error_block_addr;
     uint32_t num_sources, i;
 
+    hest_addr += ACPI_DESC_HEADER_OFFSET;
 
-    cpu_physical_memory_read(hest_entry_addr, &num_sources,
+    cpu_physical_memory_read(hest_addr, &num_sources,
                              sizeof(num_sources));
     num_sources = le32_to_cpu(num_sources);
 
-    err_source_entry = hest_entry_addr + sizeof(num_sources);
+    err_source_entry = hest_addr + sizeof(num_sources);
 
     /*
      * Currently, HEST Error source navigates only for GHESv2 tables
@@ -468,7 +476,6 @@ static void get_ghes_source_offsets(uint16_t source_id,
         /* Compare CPER source address at the GHESv2 structure */
         addr += sizeof(type);
         cpu_physical_memory_read(addr, &src_id, sizeof(src_id));
-
         if (le16_to_cpu(src_id) == source_id) {
             break;
         }
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 7d3580244179..7b6e90d69298 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -956,8 +956,10 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
     build_dbg2(tables_blob, tables->linker, vms);
 
     if (vms->ras) {
-        AcpiGhesState *ags;
+        static const AcpiNotificationSourceId *notify;
         AcpiGedState *acpi_ged_state;
+        unsigned int notify_sz;
+        AcpiGhesState *ags;
 
         acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
                                                        NULL));
@@ -967,16 +969,16 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
             acpi_add_table(table_offsets, tables_blob);
 
             if (!ags->use_hest_addr) {
-                acpi_build_hest(ags, tables_blob, tables->hardware_errors,
-                                tables->linker, hest_ghes_notify_9_2,
-                                ARRAY_SIZE(hest_ghes_notify_9_2),
-                                vms->oem_id, vms->oem_table_id);
+                notify = hest_ghes_notify_9_2;
+                notify_sz = ARRAY_SIZE(hest_ghes_notify_9_2);
             } else {
-                acpi_build_hest(ags, tables_blob, tables->hardware_errors,
-                                tables->linker, hest_ghes_notify,
-                                ARRAY_SIZE(hest_ghes_notify),
-                                vms->oem_id, vms->oem_table_id);
+                notify = hest_ghes_notify;
+                notify_sz = ARRAY_SIZE(hest_ghes_notify);
             }
+
+            acpi_build_hest(ags, tables_blob, tables->hardware_errors,
+                            tables->linker, notify, notify_sz,
+                            vms->oem_id, vms->oem_table_id);
         }
     }
 
diff --git a/scripts/arm_processor_error.py b/scripts/arm_processor_error.py
index 4ac04ab08299..b0e8450e667e 100644
--- a/scripts/arm_processor_error.py
+++ b/scripts/arm_processor_error.py
@@ -12,11 +12,110 @@
 #
 #   - ARM registers: power_state, mpidr.
 
+"""
+Generates an ARM processor error CPER, compatible with
+UEFI 2.9A Errata.
+
+Injecting such errors can be done using:
+
+    $ ./scripts/ghes_inject.py arm
+    Error injected.
+
+Produces a simple CPER register, as detected on a Linux guest:
+
+[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
+[Hardware Error]: event severity: recoverable
+[Hardware Error]:  Error 0, type: recoverable
+[Hardware Error]:   section_type: ARM processor error
+[Hardware Error]:   MIDR: 0x0000000000000000
+[Hardware Error]:   running state: 0x0
+[Hardware Error]:   Power State Coordination Interface state: 0
+[Hardware Error]:   Error info structure 0:
+[Hardware Error]:   num errors: 2
+[Hardware Error]:    error_type: 0x02: cache error
+[Hardware Error]:    error_info: 0x000000000091000f
+[Hardware Error]:     transaction type: Data Access
+[Hardware Error]:     cache error, operation type: Data write
+[Hardware Error]:     cache level: 2
+[Hardware Error]:     processor context not corrupted
+[Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
+
+The ARM Processor Error message can be customized via command line
+parameters. For instance:
+
+    $ ./scripts/ghes_inject.py arm --mpidr 0x444 --running --affinity 1 \
+        --error-info 12345678 --vendor 0x13,123,4,5,1 --ctx-array 0,1,2,3,4,5 \
+        -t cache tlb bus micro-arch tlb,micro-arch
+    Error injected.
+
+Injects this error, as detected on a Linux guest:
+
+[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
+[Hardware Error]: event severity: recoverable
+[Hardware Error]:  Error 0, type: recoverable
+[Hardware Error]:   section_type: ARM processor error
+[Hardware Error]:   MIDR: 0x0000000000000000
+[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000000000000
+[Hardware Error]:   error affinity level: 0
+[Hardware Error]:   running state: 0x1
+[Hardware Error]:   Power State Coordination Interface state: 0
+[Hardware Error]:   Error info structure 0:
+[Hardware Error]:   num errors: 2
+[Hardware Error]:    error_type: 0x02: cache error
+[Hardware Error]:    error_info: 0x0000000000bc614e
+[Hardware Error]:     cache level: 2
+[Hardware Error]:     processor context not corrupted
+[Hardware Error]:   Error info structure 1:
+[Hardware Error]:   num errors: 2
+[Hardware Error]:    error_type: 0x04: TLB error
+[Hardware Error]:    error_info: 0x000000000054007f
+[Hardware Error]:     transaction type: Instruction
+[Hardware Error]:     TLB error, operation type: Instruction fetch
+[Hardware Error]:     TLB level: 1
+[Hardware Error]:     processor context not corrupted
+[Hardware Error]:     the error has not been corrected
+[Hardware Error]:     PC is imprecise
+[Hardware Error]:   Error info structure 2:
+[Hardware Error]:   num errors: 2
+[Hardware Error]:    error_type: 0x08: bus error
+[Hardware Error]:    error_info: 0x00000080d6460fff
+[Hardware Error]:     transaction type: Generic
+[Hardware Error]:     bus error, operation type: Generic read (type of instruction or data request cannot be determined)
+[Hardware Error]:     affinity level at which the bus error occurred: 1
+[Hardware Error]:     processor context corrupted
+[Hardware Error]:     the error has been corrected
+[Hardware Error]:     PC is imprecise
+[Hardware Error]:     Program execution can be restarted reliably at the PC associated with the error.
+[Hardware Error]:     participation type: Local processor observed
+[Hardware Error]:     request timed out
+[Hardware Error]:     address space: External Memory Access
+[Hardware Error]:     memory access attributes:0x20
+[Hardware Error]:     access mode: secure
+[Hardware Error]:   Error info structure 3:
+[Hardware Error]:   num errors: 2
+[Hardware Error]:    error_type: 0x10: micro-architectural error
+[Hardware Error]:    error_info: 0x0000000078da03ff
+[Hardware Error]:   Error info structure 4:
+[Hardware Error]:   num errors: 2
+[Hardware Error]:    error_type: 0x14: TLB error|micro-architectural error
+[Hardware Error]:   Context info structure 0:
+[Hardware Error]:    register context type: AArch64 EL1 context registers
+[Hardware Error]:    00000000: 00000000 00000000
+[Hardware Error]:   Vendor specific error info has 5 bytes:
+[Hardware Error]:    00000000: 13 7b 04 05 01                                   .{...
+[Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
+[Firmware Warn]: GHES: Unhandled processor error type 0x04: TLB error
+[Firmware Warn]: GHES: Unhandled processor error type 0x08: bus error
+[Firmware Warn]: GHES: Unhandled processor error type 0x10: micro-architectural error
+[Firmware Warn]: GHES: Unhandled processor error type 0x14: TLB error|micro-architectural error
+"""
+
 import argparse
 import re
 
 from qmp_helper import qmp, util, cper_guid
 
+
 class ArmProcessorEinj:
     """
     Implements ARM Processor Error injection via GHES
diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
old mode 100644
new mode 100755
index 4f7ebb31b424..8e375d6a6cab
--- a/scripts/qmp_helper.py
+++ b/scripts/qmp_helper.py
@@ -541,7 +541,7 @@ def send_cper_raw(self, cper_data):
 
         self._connect()
 
-        if self.send_cmd("inject-ghes-error", cmd_arg):
+        if self.send_cmd("inject-ghes-v2-error", cmd_arg):
             print("Error injected.")
 
     def send_cper(self, notif_type, payload):
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 544ff174784d..80ca7779797b 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2371,7 +2371,6 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
     assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
 
     ags = acpi_ghes_get_state();
-
     if (ags && addr) {
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
diff --git a/tests/data/acpi/aarch64/virt/HEST b/tests/data/acpi/aarch64/virt/HEST
deleted file mode 100644
index 8b0cf87700fa..000000000000
Binary files a/tests/data/acpi/aarch64/virt/HEST and /dev/null differ

-

Mauro Carvalho Chehab (14):
  acpi/ghes: Prepare to support multiple sources on ghes
  acpi/ghes: add a firmware file with HEST address
  acpi/ghes: Use HEST table offsets when preparing GHES records
  acpi/generic_event_device: Update GHES migration to cover hest addr
  acpi/generic_event_device: add logic to detect if HEST addr is
    available
  acpi/ghes: only set hw_error_le or hest_addr_le
  acpi/ghes: add a notifier to notify when error data is ready
  acpi/ghes: Cleanup the code which gets ghes ged state
  acpi/generic_event_device: add an APEI error device
  tests/acpi: virt: allow acpi table changes for a new table: HEST
  arm/virt: Wire up a GED error device for ACPI / GHES
  tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT
  qapi/acpi-hest: add an interface to do generic CPER error injection
  scripts/ghes_inject: add a script to generate GHES error inject

 MAINTAINERS                                   |  10 +
 hw/acpi/Kconfig                               |   5 +
 hw/acpi/aml-build.c                           |  10 +
 hw/acpi/generic_event_device.c                |  43 ++
 hw/acpi/ghes-stub.c                           |   7 +-
 hw/acpi/ghes.c                                | 231 ++++--
 hw/acpi/ghes_cper.c                           |  38 +
 hw/acpi/ghes_cper_stub.c                      |  19 +
 hw/acpi/meson.build                           |   2 +
 hw/arm/virt-acpi-build.c                      |  37 +-
 hw/arm/virt.c                                 |  19 +-
 hw/core/machine.c                             |   2 +
 include/hw/acpi/acpi_dev_interface.h          |   1 +
 include/hw/acpi/aml-build.h                   |   2 +
 include/hw/acpi/generic_event_device.h        |   1 +
 include/hw/acpi/ghes.h                        |  45 +-
 include/hw/arm/virt.h                         |   2 +
 qapi/acpi-hest.json                           |  35 +
 qapi/meson.build                              |   1 +
 qapi/qapi-schema.json                         |   1 +
 scripts/arm_processor_error.py                | 476 ++++++++++++
 scripts/ghes_inject.py                        |  51 ++
 scripts/qmp_helper.py                         | 702 ++++++++++++++++++
 target/arm/kvm.c                              |   7 +-
 tests/data/acpi/aarch64/virt/DSDT             | Bin 5196 -> 5240 bytes
 .../data/acpi/aarch64/virt/DSDT.acpihmatvirt  | Bin 5282 -> 5326 bytes
 tests/data/acpi/aarch64/virt/DSDT.memhp       | Bin 6557 -> 6601 bytes
 tests/data/acpi/aarch64/virt/DSDT.pxb         | Bin 7679 -> 7723 bytes
 tests/data/acpi/aarch64/virt/DSDT.topology    | Bin 5398 -> 5442 bytes
 29 files changed, 1666 insertions(+), 81 deletions(-)
 create mode 100644 hw/acpi/ghes_cper.c
 create mode 100644 hw/acpi/ghes_cper_stub.c
 create mode 100644 qapi/acpi-hest.json
 create mode 100644 scripts/arm_processor_error.py
 create mode 100755 scripts/ghes_inject.py
 create mode 100755 scripts/qmp_helper.py

-- 
2.48.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 08/14] acpi/ghes: Cleanup the code which gets ghes ged state
  2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
@ 2025-01-31 17:42 ` Mauro Carvalho Chehab
  2025-02-03 10:51   ` Jonathan Cameron
  2025-02-03 11:09 ` [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Jonathan Cameron
  1 sibling, 1 reply; 14+ messages in thread
From: Mauro Carvalho Chehab @ 2025-01-31 17:42 UTC (permalink / raw)
  To: Igor Mammedov, Michael S . Tsirkin
  Cc: Jonathan Cameron, Shiju Jose, qemu-arm, qemu-devel,
	Mauro Carvalho Chehab, Ani Sinha, Dongjiu Geng, Paolo Bonzini,
	Peter Maydell, kvm, linux-kernel

Move the check logic into a common function and simplify the
code which checks if GHES is enabled and was properly setup.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by:  Igor Mammedov <imammedo@redhat.com>
---
 hw/acpi/ghes-stub.c    |  7 ++++---
 hw/acpi/ghes.c         | 40 ++++++++++++----------------------------
 include/hw/acpi/ghes.h | 15 ++++++++-------
 target/arm/kvm.c       |  7 +++++--
 4 files changed, 29 insertions(+), 40 deletions(-)

diff --git a/hw/acpi/ghes-stub.c b/hw/acpi/ghes-stub.c
index 7cec1812dad9..40f660c246fe 100644
--- a/hw/acpi/ghes-stub.c
+++ b/hw/acpi/ghes-stub.c
@@ -11,12 +11,13 @@
 #include "qemu/osdep.h"
 #include "hw/acpi/ghes.h"
 
-int acpi_ghes_memory_errors(uint16_t source_id, uint64_t physical_address)
+int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
+                            uint64_t physical_address)
 {
     return -1;
 }
 
-bool acpi_ghes_present(void)
+AcpiGhesState *acpi_ghes_get_state(void)
 {
-    return false;
+    return NULL;
 }
diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
index dd93f0fc93fd..b25e61537c87 100644
--- a/hw/acpi/ghes.c
+++ b/hw/acpi/ghes.c
@@ -414,18 +414,12 @@ void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s,
         fw_cfg_add_file_callback(s, ACPI_HW_ERROR_ADDR_FW_CFG_FILE, NULL, NULL,
             NULL, &(ags->hw_error_le), sizeof(ags->hw_error_le), false);
     }
-
-    ags->present = true;
 }
 
 static void get_hw_error_offsets(uint64_t ghes_addr,
                                  uint64_t *cper_addr,
                                  uint64_t *read_ack_register_addr)
 {
-    if (!ghes_addr) {
-        return;
-    }
-
     /*
      * non-HEST version supports only one source, so no need to change
      * the start offset based on the source ID. Also, we can't validate
@@ -519,27 +513,17 @@ static void get_ghes_source_offsets(uint16_t source_id,
 NotifierList acpi_generic_error_notifiers =
     NOTIFIER_LIST_INITIALIZER(error_device_notifiers);
 
-void ghes_record_cper_errors(const void *cper, size_t len,
+void ghes_record_cper_errors(AcpiGhesState *ags, const void *cper, size_t len,
                              uint16_t source_id, Error **errp)
 {
     uint64_t cper_addr = 0, read_ack_register_addr = 0, read_ack_register;
-    AcpiGedState *acpi_ged_state;
-    AcpiGhesState *ags;
 
     if (len > ACPI_GHES_MAX_RAW_DATA_LENGTH) {
         error_setg(errp, "GHES CPER record is too big: %zd", len);
         return;
     }
 
-    acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
-                                                       NULL));
-    if (!acpi_ged_state) {
-        error_setg(errp, "Can't find ACPI_GED object");
-        return;
-    }
-    ags = &acpi_ged_state->ghes_state;
-
-    if (!ags->hest_addr_le) {
+    if (!ags->use_hest_addr) {
         get_hw_error_offsets(le64_to_cpu(ags->hw_error_le),
                              &cper_addr, &read_ack_register_addr);
     } else {
@@ -547,11 +531,6 @@ void ghes_record_cper_errors(const void *cper, size_t len,
                                 &cper_addr, &read_ack_register_addr, errp);
     }
 
-    if (!cper_addr) {
-        error_setg(errp, "can not find Generic Error Status Block");
-        return;
-    }
-
     cpu_physical_memory_read(read_ack_register_addr,
                              &read_ack_register, sizeof(read_ack_register));
 
@@ -577,7 +556,8 @@ void ghes_record_cper_errors(const void *cper, size_t len,
     notifier_list_notify(&acpi_generic_error_notifiers, NULL);
 }
 
-int acpi_ghes_memory_errors(uint16_t source_id, uint64_t physical_address)
+int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
+                            uint64_t physical_address)
 {
     /* Memory Error Section Type */
     const uint8_t guid[] =
@@ -603,7 +583,7 @@ int acpi_ghes_memory_errors(uint16_t source_id, uint64_t physical_address)
     acpi_ghes_build_append_mem_cper(block, physical_address);
 
     /* Report the error */
-    ghes_record_cper_errors(block->data, block->len, source_id, &errp);
+    ghes_record_cper_errors(ags, block->data, block->len, source_id, &errp);
 
     g_array_free(block, true);
 
@@ -615,7 +595,7 @@ int acpi_ghes_memory_errors(uint16_t source_id, uint64_t physical_address)
     return 0;
 }
 
-bool acpi_ghes_present(void)
+AcpiGhesState *acpi_ghes_get_state(void)
 {
     AcpiGedState *acpi_ged_state;
     AcpiGhesState *ags;
@@ -624,8 +604,12 @@ bool acpi_ghes_present(void)
                                                        NULL));
 
     if (!acpi_ged_state) {
-        return false;
+        return NULL;
     }
     ags = &acpi_ged_state->ghes_state;
-    return ags->present;
+
+    if (!ags->hw_error_le && !ags->hest_addr_le) {
+        return NULL;
+    }
+    return ags;
 }
diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h
index 80a0c3fcfaca..e1b66141d01c 100644
--- a/include/hw/acpi/ghes.h
+++ b/include/hw/acpi/ghes.h
@@ -63,7 +63,6 @@ enum AcpiGhesNotifyType {
 typedef struct AcpiGhesState {
     uint64_t hest_addr_le;
     uint64_t hw_error_le;
-    bool present; /* True if GHES is present at all on this board */
     bool use_hest_addr; /* True if HEST address is present */
 } AcpiGhesState;
 
@@ -87,15 +86,17 @@ void acpi_build_hest(AcpiGhesState *ags, GArray *table_data,
                      const char *oem_id, const char *oem_table_id);
 void acpi_ghes_add_fw_cfg(AcpiGhesState *vms, FWCfgState *s,
                           GArray *hardware_errors);
-int acpi_ghes_memory_errors(uint16_t source_id, uint64_t error_physical_addr);
-void ghes_record_cper_errors(const void *cper, size_t len,
+int acpi_ghes_memory_errors(AcpiGhesState *ags, uint16_t source_id,
+                            uint64_t error_physical_addr);
+void ghes_record_cper_errors(AcpiGhesState *ags, const void *cper, size_t len,
                              uint16_t source_id, Error **errp);
 
 /**
- * acpi_ghes_present: Report whether ACPI GHES table is present
+ * acpi_ghes_get_state: Get a pointer for ACPI ghes state
  *
- * Returns: true if the system has an ACPI GHES table and it is
- * safe to call acpi_ghes_memory_errors() to record a memory error.
+ * Returns: a pointer to ghes state if the system has an ACPI GHES table,
+ * it is enabled and it is safe to call acpi_ghes_memory_errors() to record
+ * a memory error. Returns false, otherwise.
  */
-bool acpi_ghes_present(void);
+AcpiGhesState *acpi_ghes_get_state(void);
 #endif
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index da30bdbb2349..80ca7779797b 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2366,10 +2366,12 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
 {
     ram_addr_t ram_addr;
     hwaddr paddr;
+    AcpiGhesState *ags;
 
     assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
 
-    if (acpi_ghes_present() && addr) {
+    ags = acpi_ghes_get_state();
+    if (ags && addr) {
         ram_addr = qemu_ram_addr_from_host(addr);
         if (ram_addr != RAM_ADDR_INVALID &&
             kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
@@ -2387,7 +2389,8 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
              */
             if (code == BUS_MCEERR_AR) {
                 kvm_cpu_synchronize_state(c);
-                if (!acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr)) {
+                if (!acpi_ghes_memory_errors(ags, ACPI_HEST_SRC_ID_SEA,
+                                             paddr)) {
                     kvm_inject_arm_sea(c);
                 } else {
                     error_report("failed to record the error");
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 08/14] acpi/ghes: Cleanup the code which gets ghes ged state
  2025-01-31 17:42 ` [PATCH v3 08/14] acpi/ghes: Cleanup the code which gets ghes ged state Mauro Carvalho Chehab
@ 2025-02-03 10:51   ` Jonathan Cameron
  0 siblings, 0 replies; 14+ messages in thread
From: Jonathan Cameron @ 2025-02-03 10:51 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Igor Mammedov, Michael S . Tsirkin, Shiju Jose, qemu-arm,
	qemu-devel, Ani Sinha, Dongjiu Geng, Paolo Bonzini, Peter Maydell,
	kvm, linux-kernel

On Fri, 31 Jan 2025 18:42:49 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> Move the check logic into a common function and simplify the
> code which checks if GHES is enabled and was properly setup.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by:  Igor Mammedov <imammedo@redhat.com>

One minor comment inline on a change I think should be in an earlier
patch.

> -void ghes_record_cper_errors(const void *cper, size_t len,
> +void ghes_record_cper_errors(AcpiGhesState *ags, const void *cper, size_t len,
>                               uint16_t source_id, Error **errp)
>  {
>      uint64_t cper_addr = 0, read_ack_register_addr = 0, read_ack_register;
> -    AcpiGedState *acpi_ged_state;
> -    AcpiGhesState *ags;
>  
>      if (len > ACPI_GHES_MAX_RAW_DATA_LENGTH) {
>          error_setg(errp, "GHES CPER record is too big: %zd", len);
>          return;
>      }
>  
> -    acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED,
> -                                                       NULL));
> -    if (!acpi_ged_state) {
> -        error_setg(errp, "Can't find ACPI_GED object");
> -        return;
> -    }
> -    ags = &acpi_ged_state->ghes_state;
> -
> -    if (!ags->hest_addr_le) {
> +    if (!ags->use_hest_addr) {

Should this change be moved back to patch 3?  use_hest_addr was available
at that point and it would reduce churn a tiny bit.

>          get_hw_error_offsets(le64_to_cpu(ags->hw_error_le),
>                               &cper_addr, &read_ack_register_addr);
>      } else {
> @@ -547,11 +531,6 @@ void ghes_record_cper_errors(const void *cper, size_t len,
>                                  &cper_addr, &read_ack_register_addr, errp);
>      }
>  
> -    if (!cper_addr) {
> -        error_setg(errp, "can not find Generic Error Status Block");
> -        return;
> -    }
> -
>      cpu_physical_memory_read(read_ack_register_addr,
>                               &read_ack_register, sizeof(read_ack_register));
>  

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
  2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
  2025-01-31 17:42 ` [PATCH v3 08/14] acpi/ghes: Cleanup the code which gets ghes ged state Mauro Carvalho Chehab
@ 2025-02-03 11:09 ` Jonathan Cameron
  2025-02-03 15:22   ` Igor Mammedov
  1 sibling, 1 reply; 14+ messages in thread
From: Jonathan Cameron @ 2025-02-03 11:09 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Igor Mammedov, Michael S . Tsirkin, Shiju Jose, qemu-arm,
	qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
	Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
	Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
	Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
	linux-kernel

On Fri, 31 Jan 2025 18:42:41 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> Now that the ghes preparation patches were merged, let's add support
> for error injection.
> 
> On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> table and hardware_error firmware file, together with its migration code. Migration tested
> with both latest QEMU released kernel and upstream, on both directions.
> 
> The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
>    to inject ARM Processor Error records.
> 
> If I'm counting well, this is the 19th submission of my error inject patches.

Looks good to me. All remaining trivial things are in the category
of things to consider only if you are doing another spin.  The code
ends up how I'd like it at the end of the series anyway, just
a question of the precise path to that state!

Jonathan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
  2025-02-03 11:09 ` [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Jonathan Cameron
@ 2025-02-03 15:22   ` Igor Mammedov
  2025-02-21  6:38     ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 14+ messages in thread
From: Igor Mammedov @ 2025-02-03 15:22 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Mauro Carvalho Chehab, Michael S . Tsirkin, Shiju Jose, qemu-arm,
	qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
	Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
	Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
	Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
	linux-kernel

On Mon, 3 Feb 2025 11:09:34 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> On Fri, 31 Jan 2025 18:42:41 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> 
> > Now that the ghes preparation patches were merged, let's add support
> > for error injection.
> > 
> > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > table and hardware_error firmware file, together with its migration code. Migration tested
> > with both latest QEMU released kernel and upstream, on both directions.
> > 
> > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> >    to inject ARM Processor Error records.
> > 
> > If I'm counting well, this is the 19th submission of my error inject patches.  
> 
> Looks good to me. All remaining trivial things are in the category
> of things to consider only if you are doing another spin.  The code
> ends up how I'd like it at the end of the series anyway, just
> a question of the precise path to that state!

if you look at series as a whole it's more or less fine (I guess you
and me got used to it)

however if you take it patch by patch (as if you've never seen it)
ordering is messed up (the same would apply to everyone after a while
when it's forgotten)

So I'd strongly suggest to restructure the series (especially 2-6/14).
re sum up my comments wrt ordering:

0  add testcase for HEST table with current HEST as expected blob
   (currently missing), so that we can be sure that we haven't messed
   existing tables during refactoring.
1. Introduce use_hest_addr (disabled) for now so we could place all
   legacy code to !use_hest_addr branch
2. then patches that do the part of switching to HEST addr lookup,
    * ged lookup (preferably at the place it should end up eventually)
    * legacy bios_linker/fwcfg fencing patches
    * on top of that new hest bios_linker/fwcfg ones
    * and then the rest 
    (everything that belongs to the 2nd error source should _not_ be a part of that)
3. add 2nd error source incl. necessary tests procedures introduce
   and update DSDT/HEST



> 
> Jonathan
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
  2025-02-03 15:22   ` Igor Mammedov
@ 2025-02-21  6:38     ` Mauro Carvalho Chehab
  2025-02-21 10:21       ` Jonathan Cameron
  0 siblings, 1 reply; 14+ messages in thread
From: Mauro Carvalho Chehab @ 2025-02-21  6:38 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Jonathan Cameron, Michael S . Tsirkin, Shiju Jose, qemu-arm,
	qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
	Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
	Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
	Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
	linux-kernel

Em Mon, 3 Feb 2025 16:22:36 +0100
Igor Mammedov <imammedo@redhat.com> escreveu:

> On Mon, 3 Feb 2025 11:09:34 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
> > On Fri, 31 Jan 2025 18:42:41 +0100
> > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> >   
> > > Now that the ghes preparation patches were merged, let's add support
> > > for error injection.
> > > 
> > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > with both latest QEMU released kernel and upstream, on both directions.
> > > 
> > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > >    to inject ARM Processor Error records.
> > > 
> > > If I'm counting well, this is the 19th submission of my error inject patches.    
> > 
> > Looks good to me. All remaining trivial things are in the category
> > of things to consider only if you are doing another spin.  The code
> > ends up how I'd like it at the end of the series anyway, just
> > a question of the precise path to that state!  
> 
> if you look at series as a whole it's more or less fine (I guess you
> and me got used to it)
> 
> however if you take it patch by patch (as if you've never seen it)
> ordering is messed up (the same would apply to everyone after a while
> when it's forgotten)
> 
> So I'd strongly suggest to restructure the series (especially 2-6/14).
> re sum up my comments wrt ordering:
> 
> 0  add testcase for HEST table with current HEST as expected blob
>    (currently missing), so that we can be sure that we haven't messed
>    existing tables during refactoring.

Not sure if I got this one. The HEST table is part of etc/acpi/tables,
which is already tested, as you pointed at the previous reviews. Doing
changes there is already detected. That's basically why we added patches
10 and 12:

	[PATCH v3 10/14] tests/acpi: virt: allow acpi table changes for a new table: HEST
	[PATCH v3 12/14] tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT

What tests don't have is a check for etc/hardware_errors firmware inside 
tests/data/acpi/aarch64/virt/, but, IMO, we shouldn't add it there.

See, hardware_errors table contains only some skeleton space to
store:

	- 1 or more error block address offsets;
	- 1 or more read ack register;
	- 1 or more HEST source entries containing CPER blocks.

There's nothing there to be actually checked: it is just some
empty spaces with a variable number of fields.

With the new code, the actual number of CPER blocks and their
corresponding offsets and read ack registers can be different on 
different architectures. So, for instance, when we add x86 support,
we'll likely start with just one error source entry, while arm will
have two after this changeset.

Also, one possibility to address the issues reported by Gavin Shan at
https://lore.kernel.org/qemu-devel/20250214041635.608012-1-gshan@redhat.com/
would be to have one entry per each CPU. So, the size of such firmware
could be dependent on the number of CPUs.

So, adding any validation to it would just cause pain and probably
won't detect any problems.

What could be done instead is to have a different type of tests that
would use the error injection script to check if regressions are 
introduced after QEMU 10.0. Such new kind of test would require
this series to be merged first. It would also require the usage of
an OSPM image with some testing tools on it. This is easier said 
than done, as besides the complexity of having an OSPM test image,
such kind of tests would require extra logic, specially if it would
check regressions for SEA and other notification sources.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
  2025-02-21  6:38     ` Mauro Carvalho Chehab
@ 2025-02-21 10:21       ` Jonathan Cameron
  2025-02-21 12:23         ` Mauro Carvalho Chehab
  2025-02-25 10:01         ` Igor Mammedov
  0 siblings, 2 replies; 14+ messages in thread
From: Jonathan Cameron @ 2025-02-21 10:21 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Igor Mammedov, Michael S . Tsirkin, Shiju Jose, qemu-arm,
	qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
	Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
	Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
	Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
	linux-kernel, Gavin Shan

On Fri, 21 Feb 2025 07:38:23 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> Em Mon, 3 Feb 2025 16:22:36 +0100
> Igor Mammedov <imammedo@redhat.com> escreveu:
> 
> > On Mon, 3 Feb 2025 11:09:34 +0000
> > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> >   
> > > On Fri, 31 Jan 2025 18:42:41 +0100
> > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > >     
> > > > Now that the ghes preparation patches were merged, let's add support
> > > > for error injection.
> > > > 
> > > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > > with both latest QEMU released kernel and upstream, on both directions.
> > > > 
> > > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > > >    to inject ARM Processor Error records.
> > > > 
> > > > If I'm counting well, this is the 19th submission of my error inject patches.      
> > > 
> > > Looks good to me. All remaining trivial things are in the category
> > > of things to consider only if you are doing another spin.  The code
> > > ends up how I'd like it at the end of the series anyway, just
> > > a question of the precise path to that state!    
> > 
> > if you look at series as a whole it's more or less fine (I guess you
> > and me got used to it)
> > 
> > however if you take it patch by patch (as if you've never seen it)
> > ordering is messed up (the same would apply to everyone after a while
> > when it's forgotten)
> > 
> > So I'd strongly suggest to restructure the series (especially 2-6/14).
> > re sum up my comments wrt ordering:
> > 
> > 0  add testcase for HEST table with current HEST as expected blob
> >    (currently missing), so that we can be sure that we haven't messed
> >    existing tables during refactoring.  

To potentially save time I think Igor is asking that before you do anything
at all you plug the existing test hole which is that we don't test HEST
at all.   Even after this series I think we don't test HEST.  You add
a stub hest and exclusion but then in patch 12 the HEST stub is deleted whereas
it should be replaced with the example data for the test.

That indeed doesn't address testing the error data storage which would be
a different problem.
> 
> Not sure if I got this one. The HEST table is part of etc/acpi/tables,
> which is already tested, as you pointed at the previous reviews. Doing
> changes there is already detected. That's basically why we added patches
> 10 and 12:
> 
> 	[PATCH v3 10/14] tests/acpi: virt: allow acpi table changes for a new table: HEST
> 	[PATCH v3 12/14] tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT
> 
> What tests don't have is a check for etc/hardware_errors firmware inside 
> tests/data/acpi/aarch64/virt/, but, IMO, we shouldn't add it there.
> 
> See, hardware_errors table contains only some skeleton space to
> store:
> 
> 	- 1 or more error block address offsets;
> 	- 1 or more read ack register;
> 	- 1 or more HEST source entries containing CPER blocks.
> 
> There's nothing there to be actually checked: it is just some
> empty spaces with a variable number of fields.
> 
> With the new code, the actual number of CPER blocks and their
> corresponding offsets and read ack registers can be different on 
> different architectures. So, for instance, when we add x86 support,
> we'll likely start with just one error source entry, while arm will
> have two after this changeset.
> 
> Also, one possibility to address the issues reported by Gavin Shan at
> https://lore.kernel.org/qemu-devel/20250214041635.608012-1-gshan@redhat.com/
> would be to have one entry per each CPU. So, the size of such firmware
> could be dependent on the number of CPUs.
> 
> So, adding any validation to it would just cause pain and probably
> won't detect any problems.

If we did do this the test would use a fixed number of CPUs so
would just verify we didn't break a small number of variants. Useful
but to me a follow up to this series not something that needs to
be part of it - particularly as Gavin's work may well change that!

> 
> What could be done instead is to have a different type of tests that
> would use the error injection script to check if regressions are 
> introduced after QEMU 10.0. Such new kind of test would require
> this series to be merged first. It would also require the usage of
> an OSPM image with some testing tools on it. This is easier said 
> than done, as besides the complexity of having an OSPM test image,
> such kind of tests would require extra logic, specially if it would
> check regressions for SEA and other notification sources.
> 
Agreed that a more end to end test is even better, but those are
quite a bit more complex so definitely a follow up.

J
> Thanks,
> Mauro
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
  2025-02-21 10:21       ` Jonathan Cameron
@ 2025-02-21 12:23         ` Mauro Carvalho Chehab
  2025-02-21 15:05           ` Mauro Carvalho Chehab
  2025-02-25 10:01         ` Igor Mammedov
  1 sibling, 1 reply; 14+ messages in thread
From: Mauro Carvalho Chehab @ 2025-02-21 12:23 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Igor Mammedov, Michael S . Tsirkin, Shiju Jose, qemu-arm,
	qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
	Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
	Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
	Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
	linux-kernel, Gavin Shan

Em Fri, 21 Feb 2025 10:21:27 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:

> On Fri, 21 Feb 2025 07:38:23 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> 
> > Em Mon, 3 Feb 2025 16:22:36 +0100
> > Igor Mammedov <imammedo@redhat.com> escreveu:
> >   
> > > On Mon, 3 Feb 2025 11:09:34 +0000
> > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > >     
> > > > On Fri, 31 Jan 2025 18:42:41 +0100
> > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > >       
> > > > > Now that the ghes preparation patches were merged, let's add support
> > > > > for error injection.
> > > > > 
> > > > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > > > with both latest QEMU released kernel and upstream, on both directions.
> > > > > 
> > > > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > > > >    to inject ARM Processor Error records.
> > > > > 
> > > > > If I'm counting well, this is the 19th submission of my error inject patches.        
> > > > 
> > > > Looks good to me. All remaining trivial things are in the category
> > > > of things to consider only if you are doing another spin.  The code
> > > > ends up how I'd like it at the end of the series anyway, just
> > > > a question of the precise path to that state!      
> > > 
> > > if you look at series as a whole it's more or less fine (I guess you
> > > and me got used to it)
> > > 
> > > however if you take it patch by patch (as if you've never seen it)
> > > ordering is messed up (the same would apply to everyone after a while
> > > when it's forgotten)
> > > 
> > > So I'd strongly suggest to restructure the series (especially 2-6/14).
> > > re sum up my comments wrt ordering:
> > > 
> > > 0  add testcase for HEST table with current HEST as expected blob
> > >    (currently missing), so that we can be sure that we haven't messed
> > >    existing tables during refactoring.    
> 
> To potentially save time I think Igor is asking that before you do anything
> at all you plug the existing test hole which is that we don't test HEST
> at all.   Even after this series I think we don't test HEST. 

On a previous review (v2, I guess), Igor requested me to do the DSDT
test just before and after the patch which is actually changing its
content (patch 11). The HEST table is inside DSDT firmware, and it is
already tested.

> You add
> a stub hest and exclusion but then in patch 12 the HEST stub is deleted whereas
> it should be replaced with the example data for the test.

This was actually a misinterpretation from my side: patch 10 adds the
etc/hardware_errors table (mistakenly naming it as HEST), but this
was never tested. For the next submission, I'll drop etc/hardware_errors
table from patches 10 and 12.

> That indeed doesn't address testing the error data storage which would be
> a different problem.
> > 
> > Not sure if I got this one. The HEST table is part of etc/acpi/tables,
> > which is already tested, as you pointed at the previous reviews. Doing
> > changes there is already detected. That's basically why we added patches
> > 10 and 12:
> > 
> > 	[PATCH v3 10/14] tests/acpi: virt: allow acpi table changes for a new table: HEST
> > 	[PATCH v3 12/14] tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT
> > 
> > What tests don't have is a check for etc/hardware_errors firmware inside 
> > tests/data/acpi/aarch64/virt/, but, IMO, we shouldn't add it there.
> > 
> > See, hardware_errors table contains only some skeleton space to
> > store:
> > 
> > 	- 1 or more error block address offsets;
> > 	- 1 or more read ack register;
> > 	- 1 or more HEST source entries containing CPER blocks.
> > 
> > There's nothing there to be actually checked: it is just some
> > empty spaces with a variable number of fields.
> > 
> > With the new code, the actual number of CPER blocks and their
> > corresponding offsets and read ack registers can be different on 
> > different architectures. So, for instance, when we add x86 support,
> > we'll likely start with just one error source entry, while arm will
> > have two after this changeset.
> > 
> > Also, one possibility to address the issues reported by Gavin Shan at
> > https://lore.kernel.org/qemu-devel/20250214041635.608012-1-gshan@redhat.com/
> > would be to have one entry per each CPU. So, the size of such firmware
> > could be dependent on the number of CPUs.
> > 
> > So, adding any validation to it would just cause pain and probably
> > won't detect any problems.  
> 
> If we did do this the test would use a fixed number of CPUs so
> would just verify we didn't break a small number of variants. Useful
> but to me a follow up to this series not something that needs to
> be part of it - particularly as Gavin's work may well change that!

I don't think that testing etc/hardware_errors would detect any
regressions. It will just create a test scenario that will require
constant changes, as adding any entry to HEST would hit it. 

Besides that, I don't think adding support for it would be a simple
matter of adding another table. See, after this series, there are two 
different scenarios for the /etc/hardware_errors:

- one with a single GHESv2 entry, for virt-9.2;
- another one with two GHESv2 entries for virt-10.0 and above that
  will dynamically change its size (starting from 2) depending on
  the features we add, and if we'll have one entry per CPU or not.

Right now, the tests there are only for "virt-latest": there's no
test directory for "virt-9.2". Adding support for virt-legacy will 
very likely require lots of changes there at the test infrastructure,
as it will require some virt migration support. 

> > What could be done instead is to have a different type of tests that
> > would use the error injection script to check if regressions are 
> > introduced after QEMU 10.0. Such new kind of test would require
> > this series to be merged first. It would also require the usage of
> > an OSPM image with some testing tools on it. This is easier said 
> > than done, as besides the complexity of having an OSPM test image,
> > such kind of tests would require extra logic, specially if it would
> > check regressions for SEA and other notification sources.
> >   
> Agreed that a more end to end test is even better, but those are
> quite a bit more complex so definitely a follow up.

Yes, but it could be simpler than modifying ACPI tests to handle
migration.

The way I see is that such kind of integration could be done by some
gitlab workflow that would run an error injection script inside a
pre-defined image emulating both virt-9.2 and virt-latest and checking
if the HEST tables were properly generated for both SEA and GED
sources.

This is probably easier for GED, as the QMP interface already
detects that the read ack register was changed by the OSPM. For
SEA, it may require either some additional instrumentation or to
capture OSPM logs.

Anyway, ether way, a change like that is IMO outside the escope of
this series, as it will require lots of unrelated changes.

Regards,
Mauro


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
  2025-02-21 12:23         ` Mauro Carvalho Chehab
@ 2025-02-21 15:05           ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 14+ messages in thread
From: Mauro Carvalho Chehab @ 2025-02-21 15:05 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Igor Mammedov, Michael S . Tsirkin, Shiju Jose, qemu-arm,
	qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
	Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
	Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
	Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
	linux-kernel, Gavin Shan

Em Fri, 21 Feb 2025 13:23:06 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> escreveu:

> Em Fri, 21 Feb 2025 10:21:27 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:
> 
> > On Fri, 21 Feb 2025 07:38:23 +0100
> > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> >   
> > > Em Mon, 3 Feb 2025 16:22:36 +0100
> > > Igor Mammedov <imammedo@redhat.com> escreveu:
> > >     
> > > > On Mon, 3 Feb 2025 11:09:34 +0000
> > > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > > >       
> > > > > On Fri, 31 Jan 2025 18:42:41 +0100
> > > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > > >         
> > > > > > Now that the ghes preparation patches were merged, let's add support
> > > > > > for error injection.
> > > > > > 
> > > > > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > > > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > > > > with both latest QEMU released kernel and upstream, on both directions.
> > > > > > 
> > > > > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > > > > >    to inject ARM Processor Error records.
> > > > > > 
> > > > > > If I'm counting well, this is the 19th submission of my error inject patches.          
> > > > > 
> > > > > Looks good to me. All remaining trivial things are in the category
> > > > > of things to consider only if you are doing another spin.  The code
> > > > > ends up how I'd like it at the end of the series anyway, just
> > > > > a question of the precise path to that state!        
> > > > 
> > > > if you look at series as a whole it's more or less fine (I guess you
> > > > and me got used to it)
> > > > 
> > > > however if you take it patch by patch (as if you've never seen it)
> > > > ordering is messed up (the same would apply to everyone after a while
> > > > when it's forgotten)
> > > > 
> > > > So I'd strongly suggest to restructure the series (especially 2-6/14).
> > > > re sum up my comments wrt ordering:
> > > > 
> > > > 0  add testcase for HEST table with current HEST as expected blob
> > > >    (currently missing), so that we can be sure that we haven't messed
> > > >    existing tables during refactoring.      
> > 
> > To potentially save time I think Igor is asking that before you do anything
> > at all you plug the existing test hole which is that we don't test HEST
> > at all.   Even after this series I think we don't test HEST.   
> 
> On a previous review (v2, I guess), Igor requested me to do the DSDT
> test just before and after the patch which is actually changing its
> content (patch 11). The HEST table is inside DSDT firmware, and it is
> already tested.
> 
> > You add
> > a stub hest and exclusion but then in patch 12 the HEST stub is deleted whereas
> > it should be replaced with the example data for the test.  
> 
> This was actually a misinterpretation from my side: patch 10 adds the
> etc/hardware_errors table (mistakenly naming it as HEST), but this
> was never tested. For the next submission, I'll drop etc/hardware_errors
> table from patches 10 and 12.
> 
> > That indeed doesn't address testing the error data storage which would be
> > a different problem.  
> > > 
> > > Not sure if I got this one. The HEST table is part of etc/acpi/tables,
> > > which is already tested, as you pointed at the previous reviews. Doing
> > > changes there is already detected. That's basically why we added patches
> > > 10 and 12:
> > > 
> > > 	[PATCH v3 10/14] tests/acpi: virt: allow acpi table changes for a new table: HEST
> > > 	[PATCH v3 12/14] tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT
> > > 
> > > What tests don't have is a check for etc/hardware_errors firmware inside 
> > > tests/data/acpi/aarch64/virt/, but, IMO, we shouldn't add it there.
> > > 
> > > See, hardware_errors table contains only some skeleton space to
> > > store:
> > > 
> > > 	- 1 or more error block address offsets;
> > > 	- 1 or more read ack register;
> > > 	- 1 or more HEST source entries containing CPER blocks.
> > > 
> > > There's nothing there to be actually checked: it is just some
> > > empty spaces with a variable number of fields.
> > > 
> > > With the new code, the actual number of CPER blocks and their
> > > corresponding offsets and read ack registers can be different on 
> > > different architectures. So, for instance, when we add x86 support,
> > > we'll likely start with just one error source entry, while arm will
> > > have two after this changeset.
> > > 
> > > Also, one possibility to address the issues reported by Gavin Shan at
> > > https://lore.kernel.org/qemu-devel/20250214041635.608012-1-gshan@redhat.com/
> > > would be to have one entry per each CPU. So, the size of such firmware
> > > could be dependent on the number of CPUs.
> > > 
> > > So, adding any validation to it would just cause pain and probably
> > > won't detect any problems.    
> > 
> > If we did do this the test would use a fixed number of CPUs so
> > would just verify we didn't break a small number of variants. Useful
> > but to me a follow up to this series not something that needs to
> > be part of it - particularly as Gavin's work may well change that!  
> 
> I don't think that testing etc/hardware_errors would detect any
> regressions. It will just create a test scenario that will require
> constant changes, as adding any entry to HEST would hit it. 

Btw, there is just one patch on this series touching 
etc/hardware_errors:

	https://lore.kernel.org/qemu-devel/647f9c974e606924b6b881a83e047d1d4dff47d5.1740148260.git.mchehab+huawei@kernel.org/T/#u

The table change is due to this simple hunk:

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 4f174795ed60..7b6e90d69298 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -896,6 +896,7 @@ static void acpi_align_size(GArray *blob, unsigned align)
 
 static const AcpiNotificationSourceId hest_ghes_notify[] = {
     { ACPI_HEST_SRC_ID_SYNC, ACPI_GHES_NOTIFY_SEA },
+    { ACPI_HEST_SRC_ID_QMP, ACPI_GHES_NOTIFY_GPIO },
 };


Before such patch, /etc/hardware_errors has:

	- 1 error block offset;
	- 1 ack register;
	- 1 GHESv2 entry for SEA

After the change:

- for virt-9.2: nothing changes, as hw/arm/virt-acpi-build.c will
  use the backward-compatible table with a single entry to be
  added to HEST:

	static const AcpiNotificationSourceId hest_ghes_notify_9_2[] = {
	    { ACPI_HEST_SRC_ID_SYNC, ACPI_GHES_NOTIFY_SEA },
	};

- for virt-latest/virt-10.0, it will use the new table to create two
  sources:

	static const AcpiNotificationSourceId hest_ghes_notify[] = {
	    { ACPI_HEST_SRC_ID_SYNC, ACPI_GHES_NOTIFY_SEA },
	    { ACPI_HEST_SRC_ID_QMP, ACPI_GHES_NOTIFY_GPIO },
	};

  which will actually mean that /etc/hardware_errors will now have:

	- 2 error block offsets (one for SEA, one for GED);
	- 2 ack registers (one for SEA, one for GED);
	- 1 GHESv2 entry for SEA notifier;
	- 1 GHESv2 entry for GED GPIO notifier.

With the discussions with Gavin, for virt-10.0 and above, we may end changing 
the new table (hest_ghes_notify) to have one SEA entry per CPU, plus the GPIO 
one, and add an extra logic at the error injection logic to select the SEA
entry based on the CPU ID and/or based on having an already acked
SEA notifier.
 
> 
> Besides that, I don't think adding support for it would be a simple
> matter of adding another table. See, after this series, there are two 
> different scenarios for the /etc/hardware_errors:
> 
> - one with a single GHESv2 entry, for virt-9.2;
> - another one with two GHESv2 entries for virt-10.0 and above that
>   will dynamically change its size (starting from 2) depending on
>   the features we add, and if we'll have one entry per CPU or not.
> 
> Right now, the tests there are only for "virt-latest": there's no
> test directory for "virt-9.2". Adding support for virt-legacy will 
> very likely require lots of changes there at the test infrastructure,
> as it will require some virt migration support. 
> 
> > > What could be done instead is to have a different type of tests that
> > > would use the error injection script to check if regressions are 
> > > introduced after QEMU 10.0. Such new kind of test would require
> > > this series to be merged first. It would also require the usage of
> > > an OSPM image with some testing tools on it. This is easier said 
> > > than done, as besides the complexity of having an OSPM test image,
> > > such kind of tests would require extra logic, specially if it would
> > > check regressions for SEA and other notification sources.
> > >     
> > Agreed that a more end to end test is even better, but those are
> > quite a bit more complex so definitely a follow up.  
> 
> Yes, but it could be simpler than modifying ACPI tests to handle
> migration.
> 
> The way I see is that such kind of integration could be done by some
> gitlab workflow that would run an error injection script inside a
> pre-defined image emulating both virt-9.2 and virt-latest and checking
> if the HEST tables were properly generated for both SEA and GED
> sources.
> 
> This is probably easier for GED, as the QMP interface already
> detects that the read ack register was changed by the OSPM. For
> SEA, it may require either some additional instrumentation or to
> capture OSPM logs.
> 
> Anyway, ether way, a change like that is IMO outside the escope of
> this series, as it will require lots of unrelated changes.
> 
> Regards,
> Mauro
> 



Thanks,
Mauro

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
  2025-02-21 10:21       ` Jonathan Cameron
  2025-02-21 12:23         ` Mauro Carvalho Chehab
@ 2025-02-25 10:01         ` Igor Mammedov
  2025-02-26  9:56           ` Mauro Carvalho Chehab
  1 sibling, 1 reply; 14+ messages in thread
From: Igor Mammedov @ 2025-02-25 10:01 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Mauro Carvalho Chehab, Michael S . Tsirkin, Shiju Jose, qemu-arm,
	qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
	Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
	Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
	Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
	linux-kernel, Gavin Shan, Ani Sinha

On Fri, 21 Feb 2025 10:21:27 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> On Fri, 21 Feb 2025 07:38:23 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> 
> > Em Mon, 3 Feb 2025 16:22:36 +0100
> > Igor Mammedov <imammedo@redhat.com> escreveu:
> >   
> > > On Mon, 3 Feb 2025 11:09:34 +0000
> > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > >     
> > > > On Fri, 31 Jan 2025 18:42:41 +0100
> > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > >       
> > > > > Now that the ghes preparation patches were merged, let's add support
> > > > > for error injection.
> > > > > 
> > > > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > > > with both latest QEMU released kernel and upstream, on both directions.
> > > > > 
> > > > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > > > >    to inject ARM Processor Error records.
> > > > > 
> > > > > If I'm counting well, this is the 19th submission of my error inject patches.        
> > > > 
> > > > Looks good to me. All remaining trivial things are in the category
> > > > of things to consider only if you are doing another spin.  The code
> > > > ends up how I'd like it at the end of the series anyway, just
> > > > a question of the precise path to that state!      
> > > 
> > > if you look at series as a whole it's more or less fine (I guess you
> > > and me got used to it)
> > > 
> > > however if you take it patch by patch (as if you've never seen it)
> > > ordering is messed up (the same would apply to everyone after a while
> > > when it's forgotten)
> > > 
> > > So I'd strongly suggest to restructure the series (especially 2-6/14).
> > > re sum up my comments wrt ordering:
> > > 
> > > 0  add testcase for HEST table with current HEST as expected blob
> > >    (currently missing), so that we can be sure that we haven't messed
> > >    existing tables during refactoring.    
> 
> To potentially save time I think Igor is asking that before you do anything
> at all you plug the existing test hole which is that we don't test HEST
> at all.   Even after this series I think we don't test HEST.  You add
> a stub hest and exclusion but then in patch 12 the HEST stub is deleted whereas
> it should be replaced with the example data for the test.

that's what I was saying.
HEST table should be in DSDT, but it's optional and one has to use
'ras=on' option to enable that, which we aren't doing ATM.
So whatever changes are happening we aren't seeing them in tests
nor will we see any regression for the same reason.

While white listing tables before change should happen and then updating them
is the right thing to do, it's not sufficient since none of tests
run with 'ras' enabled, hence code is not actually executed. 

> 
> That indeed doesn't address testing the error data storage which would be
> a different problem.

I'd skip hardware_errors/CPER testing from QEMU unit tests.
That's basically requires functioning 'APEI driver' to test that.

Maybe we can use Ani's framework to parse HEST and all the way
towards CPER record(s) traversal, but that's certainly out of
scope of this series.
It could be done on top, but I won't insist even on that
since Mauro's out of tree error injection testing will
cover that using actual guest (which I assume he would
like to run periodically).

> > 
> > Not sure if I got this one. The HEST table is part of etc/acpi/tables,
> > which is already tested, as you pointed at the previous reviews. Doing
> > changes there is already detected. That's basically why we added patches
> > 10 and 12:
> > 
> > 	[PATCH v3 10/14] tests/acpi: virt: allow acpi table changes for a new table: HEST
> > 	[PATCH v3 12/14] tests/acpi: virt: add a HEST table to aarch64 virt and update DSDT
> > 
> > What tests don't have is a check for etc/hardware_errors firmware inside 
> > tests/data/acpi/aarch64/virt/, but, IMO, we shouldn't add it there.
> > 
> > See, hardware_errors table contains only some skeleton space to
> > store:
> > 
> > 	- 1 or more error block address offsets;
> > 	- 1 or more read ack register;
> > 	- 1 or more HEST source entries containing CPER blocks.
> > 
> > There's nothing there to be actually checked: it is just some
> > empty spaces with a variable number of fields.
> > 
> > With the new code, the actual number of CPER blocks and their
> > corresponding offsets and read ack registers can be different on 
> > different architectures. So, for instance, when we add x86 support,
> > we'll likely start with just one error source entry, while arm will
> > have two after this changeset.
> > 
> > Also, one possibility to address the issues reported by Gavin Shan at
> > https://lore.kernel.org/qemu-devel/20250214041635.608012-1-gshan@redhat.com/
> > would be to have one entry per each CPU. So, the size of such firmware
> > could be dependent on the number of CPUs.
> > 
> > So, adding any validation to it would just cause pain and probably
> > won't detect any problems.  
> 
> If we did do this the test would use a fixed number of CPUs so
> would just verify we didn't break a small number of variants. Useful
> but to me a follow up to this series not something that needs to
> be part of it - particularly as Gavin's work may well change that!
> 
> > 
> > What could be done instead is to have a different type of tests that
> > would use the error injection script to check if regressions are 
> > introduced after QEMU 10.0. Such new kind of test would require
> > this series to be merged first. It would also require the usage of
> > an OSPM image with some testing tools on it. This is easier said 
> > than done, as besides the complexity of having an OSPM test image,
> > such kind of tests would require extra logic, specially if it would
> > check regressions for SEA and other notification sources.
> >   
> Agreed that a more end to end test is even better, but those are
> quite a bit more complex so definitely a follow up.
> 
> J
> > Thanks,
> > Mauro
> >   
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
  2025-02-25 10:01         ` Igor Mammedov
@ 2025-02-26  9:56           ` Mauro Carvalho Chehab
  2025-02-26 11:23             ` Mauro Carvalho Chehab
  2025-02-26 12:29             ` Igor Mammedov
  0 siblings, 2 replies; 14+ messages in thread
From: Mauro Carvalho Chehab @ 2025-02-26  9:56 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Jonathan Cameron, Michael S . Tsirkin, Shiju Jose, qemu-arm,
	qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
	Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
	Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
	Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
	linux-kernel, Gavin Shan

Em Tue, 25 Feb 2025 11:01:15 +0100
Igor Mammedov <imammedo@redhat.com> escreveu:

> On Fri, 21 Feb 2025 10:21:27 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
> > On Fri, 21 Feb 2025 07:38:23 +0100
> > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> >   
> > > Em Mon, 3 Feb 2025 16:22:36 +0100
> > > Igor Mammedov <imammedo@redhat.com> escreveu:
> > >     
> > > > On Mon, 3 Feb 2025 11:09:34 +0000
> > > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > > >       
> > > > > On Fri, 31 Jan 2025 18:42:41 +0100
> > > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > > >         
> > > > > > Now that the ghes preparation patches were merged, let's add support
> > > > > > for error injection.
> > > > > > 
> > > > > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > > > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > > > > with both latest QEMU released kernel and upstream, on both directions.
> > > > > > 
> > > > > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > > > > >    to inject ARM Processor Error records.
> > > > > > 
> > > > > > If I'm counting well, this is the 19th submission of my error inject patches.          
> > > > > 
> > > > > Looks good to me. All remaining trivial things are in the category
> > > > > of things to consider only if you are doing another spin.  The code
> > > > > ends up how I'd like it at the end of the series anyway, just
> > > > > a question of the precise path to that state!        
> > > > 
> > > > if you look at series as a whole it's more or less fine (I guess you
> > > > and me got used to it)
> > > > 
> > > > however if you take it patch by patch (as if you've never seen it)
> > > > ordering is messed up (the same would apply to everyone after a while
> > > > when it's forgotten)
> > > > 
> > > > So I'd strongly suggest to restructure the series (especially 2-6/14).
> > > > re sum up my comments wrt ordering:
> > > > 
> > > > 0  add testcase for HEST table with current HEST as expected blob
> > > >    (currently missing), so that we can be sure that we haven't messed
> > > >    existing tables during refactoring.      
> > 
> > To potentially save time I think Igor is asking that before you do anything
> > at all you plug the existing test hole which is that we don't test HEST
> > at all.   Even after this series I think we don't test HEST.  You add
> > a stub hest and exclusion but then in patch 12 the HEST stub is deleted whereas
> > it should be replaced with the example data for the test.  
> 
> that's what I was saying.
> HEST table should be in DSDT, but it's optional and one has to use
> 'ras=on' option to enable that, which we aren't doing ATM.
> So whatever changes are happening we aren't seeing them in tests
> nor will we see any regression for the same reason.
> 
> While white listing tables before change should happen and then updating them
> is the right thing to do, it's not sufficient since none of tests
> run with 'ras' enabled, hence code is not actually executed. 

Ok. Well, again we're not modifying HEST table structure on this
changeset. The only change affecting HEST is when the number of entries
increased from 1 to 2.

Now, looking at bios-tables-test.c, if I got it right, I should be doing
something similar to the enclosed patch, right?

If so, I have a couple of questions:

1. from where should I get the HEST table? dumping the table from the
   running VM?

2. what values should I use to fill those variables:

	int hest_offset = 40 /* HEST */;
	int hest_entry_size = 4;


> 
> > 
> > That indeed doesn't address testing the error data storage which would be
> > a different problem.  
> 
> I'd skip hardware_errors/CPER testing from QEMU unit tests.
> That's basically requires functioning 'APEI driver' to test that.
> 
> Maybe we can use Ani's framework to parse HEST and all the way
> towards CPER record(s) traversal, but that's certainly out of
> scope of this series.
> It could be done on top, but I won't insist even on that
> since Mauro's out of tree error injection testing will
> cover that using actual guest (which I assume he would
> like to run periodically).

Yeah, my plan is to periodically test it. I intend to setup somewhere
a CI to test Kernel, QEMU and rasdaemon altogether.

Thanks,
Mauro

---

diff --git a/tests/qtest/bios-tables-test.c b/tests/qtest/bios-tables-test.c
index 0a333ec43536..31e69d906db4 100644
--- a/tests/qtest/bios-tables-test.c
+++ b/tests/qtest/bios-tables-test.c
@@ -210,6 +210,8 @@ static void test_acpi_fadt_table(test_data *data)
     uint32_t val;
     int dsdt_offset = 40 /* DSDT */;
     int dsdt_entry_size = 4;
+    int hest_offset = 40 /* HEST */;
+    int hest_entry_size = 4;
 
     g_assert(compare_signature(&table, "FACP"));
 
@@ -242,6 +244,12 @@ static void test_acpi_fadt_table(test_data *data)
     /* update checksum */
     fadt_aml[9 /* Checksum */] = 0;
     fadt_aml[9 /* Checksum */] -= acpi_calc_checksum(fadt_aml, fadt_len);
+
+
+
+    acpi_fetch_table(data->qts, &table.aml, &table.aml_len,
+                     fadt_aml + hest_offset, hest_entry_size, "HEST", true);
+    g_array_append_val(data->tables, table);
 }
 
 static void dump_aml_files(test_data *data, bool rebuild)
@@ -2411,7 +2419,7 @@ static void test_acpi_aarch64_virt_oem_fields(void)
     };
     char *args;
 
-    args = test_acpi_create_args(&data, "-cpu cortex-a57 "OEM_TEST_ARGS);
+    args = test_acpi_create_args(&data, "-ras on -cpu cortex-a57 "OEM_TEST_ARGS);
     data.qts = qtest_init(args);
     test_acpi_load_tables(&data);
     test_oem_fields(&data);


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
  2025-02-26  9:56           ` Mauro Carvalho Chehab
@ 2025-02-26 11:23             ` Mauro Carvalho Chehab
  2025-02-26 11:31               ` Mauro Carvalho Chehab
  2025-02-26 12:29             ` Igor Mammedov
  1 sibling, 1 reply; 14+ messages in thread
From: Mauro Carvalho Chehab @ 2025-02-26 11:23 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Jonathan Cameron, Michael S . Tsirkin, Shiju Jose, qemu-arm,
	qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
	Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
	Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
	Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
	linux-kernel, Gavin Shan

Em Wed, 26 Feb 2025 10:56:28 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> escreveu:

> Em Tue, 25 Feb 2025 11:01:15 +0100
> Igor Mammedov <imammedo@redhat.com> escreveu:
> 
> > On Fri, 21 Feb 2025 10:21:27 +0000
> > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> >   
> > > On Fri, 21 Feb 2025 07:38:23 +0100
> > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > >     
> > > > Em Mon, 3 Feb 2025 16:22:36 +0100
> > > > Igor Mammedov <imammedo@redhat.com> escreveu:
> > > >       
> > > > > On Mon, 3 Feb 2025 11:09:34 +0000
> > > > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > > > >         
> > > > > > On Fri, 31 Jan 2025 18:42:41 +0100
> > > > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > > > >           
> > > > > > > Now that the ghes preparation patches were merged, let's add support
> > > > > > > for error injection.
> > > > > > > 
> > > > > > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > > > > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > > > > > with both latest QEMU released kernel and upstream, on both directions.
> > > > > > > 
> > > > > > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > > > > > >    to inject ARM Processor Error records.
> > > > > > > 
> > > > > > > If I'm counting well, this is the 19th submission of my error inject patches.            
> > > > > > 
> > > > > > Looks good to me. All remaining trivial things are in the category
> > > > > > of things to consider only if you are doing another spin.  The code
> > > > > > ends up how I'd like it at the end of the series anyway, just
> > > > > > a question of the precise path to that state!          
> > > > > 
> > > > > if you look at series as a whole it's more or less fine (I guess you
> > > > > and me got used to it)
> > > > > 
> > > > > however if you take it patch by patch (as if you've never seen it)
> > > > > ordering is messed up (the same would apply to everyone after a while
> > > > > when it's forgotten)
> > > > > 
> > > > > So I'd strongly suggest to restructure the series (especially 2-6/14).
> > > > > re sum up my comments wrt ordering:
> > > > > 
> > > > > 0  add testcase for HEST table with current HEST as expected blob
> > > > >    (currently missing), so that we can be sure that we haven't messed
> > > > >    existing tables during refactoring.        
> > > 
> > > To potentially save time I think Igor is asking that before you do anything
> > > at all you plug the existing test hole which is that we don't test HEST
> > > at all.   Even after this series I think we don't test HEST.  You add
> > > a stub hest and exclusion but then in patch 12 the HEST stub is deleted whereas
> > > it should be replaced with the example data for the test.    
> > 
> > that's what I was saying.
> > HEST table should be in DSDT, but it's optional and one has to use
> > 'ras=on' option to enable that, which we aren't doing ATM.
> > So whatever changes are happening we aren't seeing them in tests
> > nor will we see any regression for the same reason.
> > 
> > While white listing tables before change should happen and then updating them
> > is the right thing to do, it's not sufficient since none of tests
> > run with 'ras' enabled, hence code is not actually executed.   
> 
> Ok. Well, again we're not modifying HEST table structure on this
> changeset. The only change affecting HEST is when the number of entries
> increased from 1 to 2.
> 
> Now, looking at bios-tables-test.c, if I got it right, I should be doing
> something similar to the enclosed patch, right?
> 
> If so, I have a couple of questions:
> 
> 1. from where should I get the HEST table? dumping the table from the
>    running VM?
> 
> 2. what values should I use to fill those variables:
> 
> 	int hest_offset = 40 /* HEST */;
> 	int hest_entry_size = 4;

Thanks,
Mauro

As a reference, this is the HEST table before the patch series:

/*
 * Intel ACPI Component Architecture
 * AML/ASL+ Disassembler version 20240927 (64-bit version)
 * Copyright (c) 2000 - 2023 Intel Corporation
 * 
 * Disassembly of hest.dat
 *
 * ACPI Data Table [HEST]
 *
 * Format: [HexOffset DecimalOffset ByteLength]  FieldName : FieldValue (in hex)
 */

[000h 0000 004h]                   Signature : "HEST"    [Hardware Error Source Table]
[004h 0004 004h]                Table Length : 00000084
[008h 0008 001h]                    Revision : 01
[009h 0009 001h]                    Checksum : E0
[00Ah 0010 006h]                      Oem ID : "BOCHS "
[010h 0016 008h]                Oem Table ID : "BXPC    "
[018h 0024 004h]                Oem Revision : 00000001
[01Ch 0028 004h]             Asl Compiler ID : "BXPC"
[020h 0032 004h]       Asl Compiler Revision : 00000001

[024h 0036 004h]          Error Source Count : 00000001

[028h 0040 002h]               Subtable Type : 000A [Generic Hardware Error Source V2]
[02Ah 0042 002h]                   Source Id : 0000
[02Ch 0044 002h]           Related Source Id : FFFF
[02Eh 0046 001h]                    Reserved : 00
[02Fh 0047 001h]                     Enabled : 01
[030h 0048 004h]      Records To Preallocate : 00000001
[034h 0052 004h]     Max Sections Per Record : 00000001
[038h 0056 004h]         Max Raw Data Length : 00000400

[03Ch 0060 00Ch]        Error Status Address : [Generic Address Structure]
[03Ch 0060 001h]                    Space ID : 00 [SystemMemory]
[03Dh 0061 001h]                   Bit Width : 40
[03Eh 0062 001h]                  Bit Offset : 00
[03Fh 0063 001h]        Encoded Access Width : 04 [QWord Access:64]
[040h 0064 008h]                     Address : 0000000139E40000

[048h 0072 01Ch]                      Notify : [Hardware Error Notification Structure]
[048h 0072 001h]                 Notify Type : 08 [SEA]
[049h 0073 001h]               Notify Length : 1C
[04Ah 0074 002h]  Configuration Write Enable : 0000
[04Ch 0076 004h]                PollInterval : 00000000
[050h 0080 004h]                      Vector : 00000000
[054h 0084 004h]     Polling Threshold Value : 00000000
[058h 0088 004h]    Polling Threshold Window : 00000000
[05Ch 0092 004h]       Error Threshold Value : 00000000
[060h 0096 004h]      Error Threshold Window : 00000000

[064h 0100 004h]   Error Status Block Length : 00000400
[068h 0104 00Ch]           Read Ack Register : [Generic Address Structure]
[068h 0104 001h]                    Space ID : 00 [SystemMemory]
[069h 0105 001h]                   Bit Width : 40
[06Ah 0106 001h]                  Bit Offset : 00
[06Bh 0107 001h]        Encoded Access Width : 04 [QWord Access:64]
[06Ch 0108 008h]                     Address : 0000000139E40008

[074h 0116 008h]           Read Ack Preserve : FFFFFFFFFFFFFFFE
[07Ch 0124 008h]              Read Ack Write : 0000000000000001

Raw Table Data: Length 132 (0x84)

    0000: 48 45 53 54 84 00 00 00 01 E0 42 4F 43 48 53 20  // HEST......BOCHS 
    0010: 42 58 50 43 20 20 20 20 01 00 00 00 42 58 50 43  // BXPC    ....BXPC
    0020: 01 00 00 00 01 00 00 00 0A 00 00 00 FF FF 00 01  // ................
    0030: 01 00 00 00 01 00 00 00 00 04 00 00 00 40 00 04  // .............@..
    0040: 00 00 E4 39 01 00 00 00 08 1C 00 00 00 00 00 00  // ...9............
    0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  // ................
    0060: 00 00 00 00 00 04 00 00 00 40 00 04 08 00 E4 39  // .........@.....9
    0070: 01 00 00 00 FE FF FF FF FF FF FF FF 01 00 00 00  // ................
    0080: 00 00 00 00  




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
  2025-02-26 11:23             ` Mauro Carvalho Chehab
@ 2025-02-26 11:31               ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 14+ messages in thread
From: Mauro Carvalho Chehab @ 2025-02-26 11:31 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Jonathan Cameron, Michael S . Tsirkin, Shiju Jose, qemu-arm,
	qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
	Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
	Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
	Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
	linux-kernel, Gavin Shan

Em Wed, 26 Feb 2025 12:23:03 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> escreveu:

> Em Wed, 26 Feb 2025 10:56:28 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> escreveu:
> 
> > Em Tue, 25 Feb 2025 11:01:15 +0100
> > Igor Mammedov <imammedo@redhat.com> escreveu:
> >   
> > > On Fri, 21 Feb 2025 10:21:27 +0000
> > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > >     
> > > > On Fri, 21 Feb 2025 07:38:23 +0100
> > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > >       
> > > > > Em Mon, 3 Feb 2025 16:22:36 +0100
> > > > > Igor Mammedov <imammedo@redhat.com> escreveu:
> > > > >         
> > > > > > On Mon, 3 Feb 2025 11:09:34 +0000
> > > > > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > > > > >           
> > > > > > > On Fri, 31 Jan 2025 18:42:41 +0100
> > > > > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > > > > >             
> > > > > > > > Now that the ghes preparation patches were merged, let's add support
> > > > > > > > for error injection.
> > > > > > > > 
> > > > > > > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > > > > > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > > > > > > with both latest QEMU released kernel and upstream, on both directions.
> > > > > > > > 
> > > > > > > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > > > > > > >    to inject ARM Processor Error records.
> > > > > > > > 
> > > > > > > > If I'm counting well, this is the 19th submission of my error inject patches.              
> > > > > > > 
> > > > > > > Looks good to me. All remaining trivial things are in the category
> > > > > > > of things to consider only if you are doing another spin.  The code
> > > > > > > ends up how I'd like it at the end of the series anyway, just
> > > > > > > a question of the precise path to that state!            
> > > > > > 
> > > > > > if you look at series as a whole it's more or less fine (I guess you
> > > > > > and me got used to it)
> > > > > > 
> > > > > > however if you take it patch by patch (as if you've never seen it)
> > > > > > ordering is messed up (the same would apply to everyone after a while
> > > > > > when it's forgotten)
> > > > > > 
> > > > > > So I'd strongly suggest to restructure the series (especially 2-6/14).
> > > > > > re sum up my comments wrt ordering:
> > > > > > 
> > > > > > 0  add testcase for HEST table with current HEST as expected blob
> > > > > >    (currently missing), so that we can be sure that we haven't messed
> > > > > >    existing tables during refactoring.          
> > > > 
> > > > To potentially save time I think Igor is asking that before you do anything
> > > > at all you plug the existing test hole which is that we don't test HEST
> > > > at all.   Even after this series I think we don't test HEST.  You add
> > > > a stub hest and exclusion but then in patch 12 the HEST stub is deleted whereas
> > > > it should be replaced with the example data for the test.      
> > > 
> > > that's what I was saying.
> > > HEST table should be in DSDT, but it's optional and one has to use
> > > 'ras=on' option to enable that, which we aren't doing ATM.
> > > So whatever changes are happening we aren't seeing them in tests
> > > nor will we see any regression for the same reason.
> > > 
> > > While white listing tables before change should happen and then updating them
> > > is the right thing to do, it's not sufficient since none of tests
> > > run with 'ras' enabled, hence code is not actually executed.     
> > 
> > Ok. Well, again we're not modifying HEST table structure on this
> > changeset. The only change affecting HEST is when the number of entries
> > increased from 1 to 2.
> > 
> > Now, looking at bios-tables-test.c, if I got it right, I should be doing
> > something similar to the enclosed patch, right?
> > 
> > If so, I have a couple of questions:
> > 
> > 1. from where should I get the HEST table? dumping the table from the
> >    running VM?
> > 
> > 2. what values should I use to fill those variables:
> > 
> > 	int hest_offset = 40 /* HEST */;
> > 	int hest_entry_size = 4;  
> 
> Thanks,
> Mauro
> 
> As a reference, this is the HEST table before the patch series:

This is the diff of the HEST table before/after this series.

As already commented, the diff is basically:

	-[024h 0036 004h]          Error Source Count : 00000001
	+[024h 0036 004h]          Error Source Count : 00000002

Plus the new entry for source ID 1 using notify type 7 (GPIO):

	+[084h 0132 002h]               Subtable Type : 000A [Generic Hardware Error Source V2]
	+[086h 0134 002h]                   Source Id : 0001
	+[088h 0136 002h]           Related Source Id : FFFF
	...
	+[0A4h 0164 001h]                 Notify Type : 07 [GPIO]
	...
	+[0D0h 0208 008h]           Read Ack Preserve : FFFFFFFFFFFFFFFE
	+[0D8h 0216 008h]              Read Ack Write : 0000000000000001

Complete diff follows.

Regards,
Mauro

---

diff -u hest-before-changes.dsl hest-after-changes.dsl
--- hest-before-changes.dsl     2025-02-26 11:23:30.845089077 +0000
+++ hest-after-changes.dsl      2025-02-26 11:25:29.095066026 +0000
@@ -11,16 +11,16 @@
  */
 
 [000h 0000 004h]                   Signature : "HEST"    [Hardware Error Source Table]
-[004h 0004 004h]                Table Length : 00000084
+[004h 0004 004h]                Table Length : 000000E0
 [008h 0008 001h]                    Revision : 01
-[009h 0009 001h]                    Checksum : E0
+[009h 0009 001h]                    Checksum : 68
 [00Ah 0010 006h]                      Oem ID : "BOCHS "
 [010h 0016 008h]                Oem Table ID : "BXPC    "
 [018h 0024 004h]                Oem Revision : 00000001
 [01Ch 0028 004h]             Asl Compiler ID : "BXPC"
 [020h 0032 004h]       Asl Compiler Revision : 00000001
 
-[024h 0036 004h]          Error Source Count : 00000001
+[024h 0036 004h]          Error Source Count : 00000002
 
 [028h 0040 002h]               Subtable Type : 000A [Generic Hardware Error Source V2]
 [02Ah 0042 002h]                   Source Id : 0000
@@ -55,19 +55,62 @@
 [069h 0105 001h]                   Bit Width : 40
 [06Ah 0106 001h]                  Bit Offset : 00
 [06Bh 0107 001h]        Encoded Access Width : 04 [QWord Access:64]
-[06Ch 0108 008h]                     Address : 0000000139E40008
+[06Ch 0108 008h]                     Address : 0000000139E40010
 
 [074h 0116 008h]           Read Ack Preserve : FFFFFFFFFFFFFFFE
 [07Ch 0124 008h]              Read Ack Write : 0000000000000001
 
-Raw Table Data: Length 132 (0x84)
+[084h 0132 002h]               Subtable Type : 000A [Generic Hardware Error Source V2]
+[086h 0134 002h]                   Source Id : 0001
+[088h 0136 002h]           Related Source Id : FFFF
+[08Ah 0138 001h]                    Reserved : 00
+[08Bh 0139 001h]                     Enabled : 01
+[08Ch 0140 004h]      Records To Preallocate : 00000001
+[090h 0144 004h]     Max Sections Per Record : 00000001
+[094h 0148 004h]         Max Raw Data Length : 00000400
+
+[098h 0152 00Ch]        Error Status Address : [Generic Address Structure]
+[098h 0152 001h]                    Space ID : 00 [SystemMemory]
+[099h 0153 001h]                   Bit Width : 40
+[09Ah 0154 001h]                  Bit Offset : 00
+[09Bh 0155 001h]        Encoded Access Width : 04 [QWord Access:64]
+[09Ch 0156 008h]                     Address : 0000000139E40008
+
+[0A4h 0164 01Ch]                      Notify : [Hardware Error Notification Structure]
+[0A4h 0164 001h]                 Notify Type : 07 [GPIO]
+[0A5h 0165 001h]               Notify Length : 1C
+[0A6h 0166 002h]  Configuration Write Enable : 0000
+[0A8h 0168 004h]                PollInterval : 00000000
+[0ACh 0172 004h]                      Vector : 00000000
+[0B0h 0176 004h]     Polling Threshold Value : 00000000
+[0B4h 0180 004h]    Polling Threshold Window : 00000000
+[0B8h 0184 004h]       Error Threshold Value : 00000000
+[0BCh 0188 004h]      Error Threshold Window : 00000000
+
+[0C0h 0192 004h]   Error Status Block Length : 00000400
+[0C4h 0196 00Ch]           Read Ack Register : [Generic Address Structure]
+[0C4h 0196 001h]                    Space ID : 00 [SystemMemory]
+[0C5h 0197 001h]                   Bit Width : 40
+[0C6h 0198 001h]                  Bit Offset : 00
+[0C7h 0199 001h]        Encoded Access Width : 04 [QWord Access:64]
+[0C8h 0200 008h]                     Address : 0000000139E40018
 
-    0000: 48 45 53 54 84 00 00 00 01 E0 42 4F 43 48 53 20  // HEST......BOCHS 
+[0D0h 0208 008h]           Read Ack Preserve : FFFFFFFFFFFFFFFE
+[0D8h 0216 008h]              Read Ack Write : 0000000000000001
+
+Raw Table Data: Length 224 (0xE0)
+
+    0000: 48 45 53 54 E0 00 00 00 01 68 42 4F 43 48 53 20  // HEST.....hBOCHS 
     0010: 42 58 50 43 20 20 20 20 01 00 00 00 42 58 50 43  // BXPC    ....BXPC
-    0020: 01 00 00 00 01 00 00 00 0A 00 00 00 FF FF 00 01  // ................
+    0020: 01 00 00 00 02 00 00 00 0A 00 00 00 FF FF 00 01  // ................
     0030: 01 00 00 00 01 00 00 00 00 04 00 00 00 40 00 04  // .............@..
     0040: 00 00 E4 39 01 00 00 00 08 1C 00 00 00 00 00 00  // ...9............
     0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  // ................
-    0060: 00 00 00 00 00 04 00 00 00 40 00 04 08 00 E4 39  // .........@.....9
+    0060: 00 00 00 00 00 04 00 00 00 40 00 04 10 00 E4 39  // .........@.....9
     0070: 01 00 00 00 FE FF FF FF FF FF FF FF 01 00 00 00  // ................
-    0080: 00 00 00 00                                      // ....
+    0080: 00 00 00 00 0A 00 01 00 FF FF 00 01 01 00 00 00  // ................
+    0090: 01 00 00 00 00 04 00 00 00 40 00 04 08 00 E4 39  // .........@.....9
+    00A0: 01 00 00 00 07 1C 00 00 00 00 00 00 00 00 00 00  // ................
+    00B0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  // ................
+    00C0: 00 04 00 00 00 40 00 04 18 00 E4 39 01 00 00 00  // .....@.....9....
+    00D0: FE FF FF FF FF FF FF FF 01 00 00 00 00 00 00 00  // ................


Thanks,
Mauro

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject
  2025-02-26  9:56           ` Mauro Carvalho Chehab
  2025-02-26 11:23             ` Mauro Carvalho Chehab
@ 2025-02-26 12:29             ` Igor Mammedov
  1 sibling, 0 replies; 14+ messages in thread
From: Igor Mammedov @ 2025-02-26 12:29 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Jonathan Cameron, Michael S . Tsirkin, Shiju Jose, qemu-arm,
	qemu-devel, Philippe Mathieu-Daudé, Ani Sinha, Cleber Rosa,
	Dongjiu Geng, Eduardo Habkost, Eric Blake, John Snow,
	Marcel Apfelbaum, Markus Armbruster, Michael Roth, Paolo Bonzini,
	Peter Maydell, Shannon Zhao, Yanan Wang, Zhao Liu, kvm,
	linux-kernel, Gavin Shan

On Wed, 26 Feb 2025 10:56:28 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> Em Tue, 25 Feb 2025 11:01:15 +0100
> Igor Mammedov <imammedo@redhat.com> escreveu:
> 
> > On Fri, 21 Feb 2025 10:21:27 +0000
> > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> >   
> > > On Fri, 21 Feb 2025 07:38:23 +0100
> > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > >     
> > > > Em Mon, 3 Feb 2025 16:22:36 +0100
> > > > Igor Mammedov <imammedo@redhat.com> escreveu:
> > > >       
> > > > > On Mon, 3 Feb 2025 11:09:34 +0000
> > > > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > > > >         
> > > > > > On Fri, 31 Jan 2025 18:42:41 +0100
> > > > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > > > >           
> > > > > > > Now that the ghes preparation patches were merged, let's add support
> > > > > > > for error injection.
> > > > > > > 
> > > > > > > On this series, the first 6 patches chang to the math used to calculate offsets at HEST
> > > > > > > table and hardware_error firmware file, together with its migration code. Migration tested
> > > > > > > with both latest QEMU released kernel and upstream, on both directions.
> > > > > > > 
> > > > > > > The next patches add a new QAPI to allow injecting GHESv2 errors, and a script using such QAPI
> > > > > > >    to inject ARM Processor Error records.
> > > > > > > 
> > > > > > > If I'm counting well, this is the 19th submission of my error inject patches.            
> > > > > > 
> > > > > > Looks good to me. All remaining trivial things are in the category
> > > > > > of things to consider only if you are doing another spin.  The code
> > > > > > ends up how I'd like it at the end of the series anyway, just
> > > > > > a question of the precise path to that state!          
> > > > > 
> > > > > if you look at series as a whole it's more or less fine (I guess you
> > > > > and me got used to it)
> > > > > 
> > > > > however if you take it patch by patch (as if you've never seen it)
> > > > > ordering is messed up (the same would apply to everyone after a while
> > > > > when it's forgotten)
> > > > > 
> > > > > So I'd strongly suggest to restructure the series (especially 2-6/14).
> > > > > re sum up my comments wrt ordering:
> > > > > 
> > > > > 0  add testcase for HEST table with current HEST as expected blob
> > > > >    (currently missing), so that we can be sure that we haven't messed
> > > > >    existing tables during refactoring.        
> > > 
> > > To potentially save time I think Igor is asking that before you do anything
> > > at all you plug the existing test hole which is that we don't test HEST
> > > at all.   Even after this series I think we don't test HEST.  You add
> > > a stub hest and exclusion but then in patch 12 the HEST stub is deleted whereas
> > > it should be replaced with the example data for the test.    
> > 
> > that's what I was saying.
> > HEST table should be in DSDT, but it's optional and one has to use
> > 'ras=on' option to enable that, which we aren't doing ATM.
> > So whatever changes are happening we aren't seeing them in tests
> > nor will we see any regression for the same reason.
> > 
> > While white listing tables before change should happen and then updating them
> > is the right thing to do, it's not sufficient since none of tests
> > run with 'ras' enabled, hence code is not actually executed.   
> 
> Ok. Well, again we're not modifying HEST table structure on this
> changeset. The only change affecting HEST is when the number of entries
> increased from 1 to 2.
> 
> Now, looking at bios-tables-test.c, if I got it right, I should be doing
> something similar to the enclosed patch, right?
> 
> If so, I have a couple of questions:
> 
> 1. from where should I get the HEST table? dumping the table from the
>    running VM?


> 
> 2. what values should I use to fill those variables:
> 
> 	int hest_offset = 40 /* HEST */;
> 	int hest_entry_size = 4;
you don't need to do that,
bios-tables-test will dump all ACPI tables for you automatically,
you only need to add or extend a test with ras=on option.

   1: 1st add empty table and whitelist it ("tests/data/acpi/aarch64/virt/HEST")
   2: enable ras in existing tescase

--- a/tests/qtest/bios-tables-test.c
+++ b/tests/qtest/bios-tables-test.c
@@ -2123,7 +2123,8 @@ static void test_acpi_aarch64_virt_tcg(void)
     data.smbios_cpu_max_speed = 2900;
     data.smbios_cpu_curr_speed = 2700;
     test_acpi_one("-cpu cortex-a57 "
-                  "-smbios type=4,max-speed=2900,current-speed=2700", &data);
+                  "-smbios type=4,max-speed=2900,current-speed=2700 "
+                  "-machine ras=on", &data);
     free_test_data(&data);
 }
     
  then with installed IASL run
    V=1 QTEST_QEMU_BINARY=./qemu-system-aarch64  ./tests/qtest/bios-tables-test
  to see diff

  3: rebuild tables and follow the rest of procedure to update expected blobs
     as described in comment at the top of (tests/qtest/bios-tables-test.c)

I'd recommend to add 3 patches as the beginning of the series,
that way we can be sure that if something changes unintentionally
it won't go unnoticed.

> 
> >   
> > > 
> > > That indeed doesn't address testing the error data storage which would be
> > > a different problem.    
> > 
> > I'd skip hardware_errors/CPER testing from QEMU unit tests.
> > That's basically requires functioning 'APEI driver' to test that.
> > 
> > Maybe we can use Ani's framework to parse HEST and all the way
> > towards CPER record(s) traversal, but that's certainly out of
> > scope of this series.
> > It could be done on top, but I won't insist even on that
> > since Mauro's out of tree error injection testing will
> > cover that using actual guest (which I assume he would
> > like to run periodically).  
> 
> Yeah, my plan is to periodically test it. I intend to setup somewhere
> a CI to test Kernel, QEMU and rasdaemon altogether.
> 
> Thanks,
> Mauro
> 
> ---
> 
> diff --git a/tests/qtest/bios-tables-test.c b/tests/qtest/bios-tables-test.c
> index 0a333ec43536..31e69d906db4 100644
> --- a/tests/qtest/bios-tables-test.c
> +++ b/tests/qtest/bios-tables-test.c
> @@ -210,6 +210,8 @@ static void test_acpi_fadt_table(test_data *data)
>      uint32_t val;
>      int dsdt_offset = 40 /* DSDT */;
>      int dsdt_entry_size = 4;
> +    int hest_offset = 40 /* HEST */;
> +    int hest_entry_size = 4;
>  
>      g_assert(compare_signature(&table, "FACP"));
>  
> @@ -242,6 +244,12 @@ static void test_acpi_fadt_table(test_data *data)
>      /* update checksum */
>      fadt_aml[9 /* Checksum */] = 0;
>      fadt_aml[9 /* Checksum */] -= acpi_calc_checksum(fadt_aml, fadt_len);
> +
> +
> +
> +    acpi_fetch_table(data->qts, &table.aml, &table.aml_len,
> +                     fadt_aml + hest_offset, hest_entry_size, "HEST", true);
> +    g_array_append_val(data->tables, table);
>  }
>  
>  static void dump_aml_files(test_data *data, bool rebuild)
> @@ -2411,7 +2419,7 @@ static void test_acpi_aarch64_virt_oem_fields(void)
>      };
>      char *args;
>  
> -    args = test_acpi_create_args(&data, "-cpu cortex-a57 "OEM_TEST_ARGS);
> +    args = test_acpi_create_args(&data, "-ras on -cpu cortex-a57 "OEM_TEST_ARGS);
>      data.qts = qtest_init(args);
>      test_acpi_load_tables(&data);
>      test_oem_fields(&data);
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-02-26 12:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-31 17:42 [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Mauro Carvalho Chehab
2025-01-31 17:42 ` [PATCH v3 08/14] acpi/ghes: Cleanup the code which gets ghes ged state Mauro Carvalho Chehab
2025-02-03 10:51   ` Jonathan Cameron
2025-02-03 11:09 ` [PATCH v3 00/14] Change ghes to use HEST-based offsets and add support for error inject Jonathan Cameron
2025-02-03 15:22   ` Igor Mammedov
2025-02-21  6:38     ` Mauro Carvalho Chehab
2025-02-21 10:21       ` Jonathan Cameron
2025-02-21 12:23         ` Mauro Carvalho Chehab
2025-02-21 15:05           ` Mauro Carvalho Chehab
2025-02-25 10:01         ` Igor Mammedov
2025-02-26  9:56           ` Mauro Carvalho Chehab
2025-02-26 11:23             ` Mauro Carvalho Chehab
2025-02-26 11:31               ` Mauro Carvalho Chehab
2025-02-26 12:29             ` Igor Mammedov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox