All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] Add more commands to scripts/ghes_inject.py
@ 2026-01-21 11:25 Mauro Carvalho Chehab
  2026-01-21 11:25 ` [PATCH 01/13] scripts/qmp_helper: add a return code to send_cper Mauro Carvalho Chehab
                   ` (12 more replies)
  0 siblings, 13 replies; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 11:25 UTC (permalink / raw)
  To: Michael S Tsirkin
  Cc: Jonathan Cameron, Shiju Jose, qemu-devel, Igor Mammedov,
	Mauro Carvalho Chehab, Cleber Rosa, John Snow

Now that we have the basic stuff merged on QEMU, add more
commands to scripts/ghes_inject.py. After this patch, this
tool will support the following commands:

    arm                 Inject an ARM processor error CPER, compatible with
                        UEFI 2.9A Errata.
    pcie-bus            Inject a PCIe bus error CPER
    fuzzy-test (fuzzy)  Inject fuzzy test CPER packets
    raw-error (raw)     Inject CPER records from previously recorded ones.

Where arm is a pre-existing one.

The pcie-bus command injects a PCIe bus error - currently not supported
on Linux (GUID: c5753963-3b84-4095-bf78-eddad3f9c9dd).

The fuzzy-test command allows injecting one or more CPER records
for all GUID types supported on UEFI 2.11, with its contents being
either zero or random, and with the payload size that can also be
random.

The raw-error command allow reproducing a CPER from a text file.
It is helpful in conjunction with fuzzy-test to re-test the OSPM
after some fixes.

Besides the commands, a new helper logic was added at
scripts/ghes_decode.py: when the tool is called with the
--debug command line argument, it will translate the injected
record, allowing to compare what it was injected with what
the OSPM/userspace tools would interpret.

The first 6 patches on this series improve the qmp_helper
logic to support the new functionality.

The next 6 patches add the extra functionality to ghes_inject.

The final patch improves its help message when called without
a command.

Mauro Carvalho Chehab (13):
  scripts/qmp_helper: add a return code to send_cper
  scripts/qmp_helper: add missing CXL UEFI GUID
  scripts/qmp_helper: add support for FRU Memory Poison
  scripts/qmp_helper: make send_cper() more generic
  scripts/qmp_helper: fix raw_data logic
  scripts/qmp_helper: add support for a timeout logic
  scripts/ghes_inject: add a logic to decode CPER
  scripts/ghes_inject: exit 1 if command was not sent
  scripts/ghes_inject: add a handler for PCIe bus error
  scripts/ghes_inject: add support for fuzzy logic testing
  scripts/ghes_inject: add a raw error inject command
  scripts/ghes_inject: print help if no command specified
  scripts/ghes_inject: improve help message

 MAINTAINERS                    |    4 +
 scripts/arm_processor_error.py |    8 +-
 scripts/fuzzy_error.py         |  208 ++++++
 scripts/ghes_decode.py         | 1155 ++++++++++++++++++++++++++++++++
 scripts/ghes_inject.py         |   30 +-
 scripts/pcie_bus_error.py      |  148 ++++
 scripts/qmp_helper.py          |  159 ++++-
 scripts/raw_error.py           |  175 +++++
 8 files changed, 1849 insertions(+), 38 deletions(-)
 create mode 100644 scripts/fuzzy_error.py
 create mode 100644 scripts/ghes_decode.py
 create mode 100644 scripts/pcie_bus_error.py
 create mode 100644 scripts/raw_error.py

-- 
2.52.0



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 01/13] scripts/qmp_helper: add a return code to send_cper
  2026-01-21 11:25 [PATCH 00/13] Add more commands to scripts/ghes_inject.py Mauro Carvalho Chehab
@ 2026-01-21 11:25 ` Mauro Carvalho Chehab
  2026-01-21 12:08   ` Jonathan Cameron via qemu development
  2026-01-21 11:25 ` [PATCH 02/13] scripts/qmp_helper: add missing CXL UEFI GUID Mauro Carvalho Chehab
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 11:25 UTC (permalink / raw)
  To: Michael S Tsirkin
  Cc: Jonathan Cameron, Shiju Jose, qemu-devel, Igor Mammedov,
	Mauro Carvalho Chehab, Cleber Rosa, John Snow

When used inside a loop, it is interesting to have a return
code to indicate weather a send cper command succedded or not.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 scripts/qmp_helper.py | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
index c1e7e0fd80ce..249a8c7187d1 100755
--- a/scripts/qmp_helper.py
+++ b/scripts/qmp_helper.py
@@ -531,7 +531,11 @@ def __init__(self, host, port, debug=False):
     # Socket QMP send command
     #
     def send_cper_raw(self, cper_data):
-        """Send a raw CPER data to QEMU though QMP TCP socket"""
+        """
+        Send a raw CPER data to QEMU though QMP TCP socket.
+
+        Return True on success, False otherwise.
+        """
 
         data = b64encode(bytes(cper_data)).decode('ascii')
 
@@ -543,9 +547,16 @@ def send_cper_raw(self, cper_data):
 
         if self.send_cmd("inject-ghes-v2-error", cmd_arg):
             print("Error injected.")
+            return True
+
+        return False
 
     def send_cper(self, notif_type, payload):
-        """Send commands to QEMU though QMP TCP socket"""
+        """
+        Send commands to QEMU though QMP TCP socket.
+
+        Return True on success, False otherwise.
+        """
 
         # Fill CPER record header
 
@@ -599,8 +610,7 @@ def send_cper(self, notif_type, payload):
 
             util.dump_bytearray("Payload", payload)
 
-        self.send_cper_raw(cper_data)
-
+        return self.send_cper_raw(cper_data)
 
     def search_qom(self, path, prop, regex):
         """
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 02/13] scripts/qmp_helper: add missing CXL UEFI GUID
  2026-01-21 11:25 [PATCH 00/13] Add more commands to scripts/ghes_inject.py Mauro Carvalho Chehab
  2026-01-21 11:25 ` [PATCH 01/13] scripts/qmp_helper: add a return code to send_cper Mauro Carvalho Chehab
@ 2026-01-21 11:25 ` Mauro Carvalho Chehab
  2026-01-21 12:26     ` Jonathan Cameron via qemu development
  2026-01-21 11:25 ` [PATCH 03/13] scripts/qmp_helper: add support for FRU Memory Poison Mauro Carvalho Chehab
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 11:25 UTC (permalink / raw)
  To: Michael S Tsirkin
  Cc: Jonathan Cameron, Shiju Jose, qemu-devel, Igor Mammedov,
	Mauro Carvalho Chehab, Cleber Rosa, John Snow

The UEFI 2.11 - N.2.14. CXL Component Events Section states that
XL events are described at CXL specification 3.2:
        8.2.10.2.1 Event Records

Add the GUIDs defined here to fuzzy logic error injection code.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 scripts/qmp_helper.py | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
index 249a8c7187d1..7e786c4adfd9 100755
--- a/scripts/qmp_helper.py
+++ b/scripts/qmp_helper.py
@@ -711,3 +711,28 @@ class cper_guid:
     CPER_CXL_PROT_ERR = guid(0x80B9EFB4, 0x52B5, 0x4DE3,
                              [0xA7, 0x77, 0x68, 0x78,
                               0x4B, 0x77, 0x10, 0x48])
+
+    CPER_CXL_EVT_GEN_MEDIA = guid(0xFBCD0A77, 0xC260, 0x417F,
+                                  [0x85, 0xA9, 0x08, 0x8B,
+                                   0x16, 0x21, 0xEB, 0xA6])
+    CPER_CXL_EVT_DRAM = guid(0x601DCBB3, 0x9C06, 0x4EAB,
+                             [0xB8, 0xAF, 0x4E, 0x9B,
+                              0xFB, 0x5C, 0x96, 0x24])
+    CPER_CXL_EVT_MEM_MODULE = guid(0xFE927475, 0xDD59, 0x4339,
+                                   [0xA5, 0x86, 0x79, 0xBA,
+                                    0xB1, 0x13, 0xBC, 0x74])
+    CPER_CXL_EVT_MEM_SPARING = guid(0xE71F3A40, 0x2D29, 0x4092,
+                                    [0x8A, 0x39, 0x4D, 0x1C,
+                                     0x96, 0x6C, 0x7C, 0x65])
+    CPER_CXL_EVT_PHY_SW = guid(0x77CF9271, 0x9C02, 0x470B,
+                               [0x9F, 0xE4, 0xBC, 0x7B,
+                                0x75, 0xF2, 0xDA, 0x97])
+    CPER_CXL_EVT_VIRT_SW = guid(0x40D26425, 0x3396, 0x4C4D,
+                                [0xA5, 0xDA, 0x3D, 0x47,
+                                  0x2A, 0x63, 0xAF, 0x25])
+    CPER_CXL_EVT_MLD_PORT = guid(0x8DC44363, 0x0C96, 0x4710,
+                                 [0xB7, 0xBF, 0x04, 0xBB,
+                                  0x99, 0x53, 0x4C, 0x3F])
+    CPER_CXL_EVT_DYNA_CAP = guid(0xCA95AFA7, 0xF183, 0x4018,
+                                 [0x8C, 0x2F, 0x95, 0x26,
+                                  0x8E, 0x10, 0x1A, 0x2A])
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 03/13] scripts/qmp_helper: add support for FRU Memory Poison
  2026-01-21 11:25 [PATCH 00/13] Add more commands to scripts/ghes_inject.py Mauro Carvalho Chehab
  2026-01-21 11:25 ` [PATCH 01/13] scripts/qmp_helper: add a return code to send_cper Mauro Carvalho Chehab
  2026-01-21 11:25 ` [PATCH 02/13] scripts/qmp_helper: add missing CXL UEFI GUID Mauro Carvalho Chehab
@ 2026-01-21 11:25 ` Mauro Carvalho Chehab
  2026-01-21 12:27   ` Jonathan Cameron via qemu development
  2026-01-21 11:25 ` [PATCH 04/13] scripts/qmp_helper: make send_cper() more generic Mauro Carvalho Chehab
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 11:25 UTC (permalink / raw)
  To: Michael S Tsirkin
  Cc: Jonathan Cameron, Shiju Jose, qemu-devel, Igor Mammedov,
	Mauro Carvalho Chehab, Cleber Rosa, John Snow

This GUID record descriptor was added on UEFI 2.11.
Add support for it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 scripts/qmp_helper.py | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
index 7e786c4adfd9..19bf641a13ce 100755
--- a/scripts/qmp_helper.py
+++ b/scripts/qmp_helper.py
@@ -736,3 +736,7 @@ class cper_guid:
     CPER_CXL_EVT_DYNA_CAP = guid(0xCA95AFA7, 0xF183, 0x4018,
                                  [0x8C, 0x2F, 0x95, 0x26,
                                   0x8E, 0x10, 0x1A, 0x2A])
+
+    CPER_FRU_MEM_POISON = guid(0x5E4706C1, 0x5356, 0x48C6,
+                               [0x93, 0x0B, 0x52, 0xF2,
+                                0x12, 0x0A, 0x44, 0x58])
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 04/13] scripts/qmp_helper: make send_cper() more generic
  2026-01-21 11:25 [PATCH 00/13] Add more commands to scripts/ghes_inject.py Mauro Carvalho Chehab
                   ` (2 preceding siblings ...)
  2026-01-21 11:25 ` [PATCH 03/13] scripts/qmp_helper: add support for FRU Memory Poison Mauro Carvalho Chehab
@ 2026-01-21 11:25 ` Mauro Carvalho Chehab
  2026-01-21 12:30   ` Jonathan Cameron via qemu development
  2026-01-21 11:25 ` [PATCH 05/13] scripts/qmp_helper: fix raw_data logic Mauro Carvalho Chehab
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 11:25 UTC (permalink / raw)
  To: Michael S Tsirkin
  Cc: Jonathan Cameron, Shiju Jose, qemu-devel, Igor Mammedov,
	Mauro Carvalho Chehab, Cleber Rosa, John Snow

Allow the caller to set GEDE, GEBS and raw data. This can be
useful if one wants to replicate a CPER with the same parameters
as the original one.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 scripts/qmp_helper.py | 69 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 55 insertions(+), 14 deletions(-)

diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
index 19bf641a13ce..51c8ad92a39d 100755
--- a/scripts/qmp_helper.py
+++ b/scripts/qmp_helper.py
@@ -551,22 +551,11 @@ def send_cper_raw(self, cper_data):
 
         return False
 
-    def send_cper(self, notif_type, payload):
+    def get_gede(self, notif_type, cper_length):
         """
-        Send commands to QEMU though QMP TCP socket.
-
-        Return True on success, False otherwise.
+        Return a Generic Error Data Entry bytearray
         """
 
-        # Fill CPER record header
-
-        # NOTE: bits 4 to 13 of block status contain the number of
-        # data entries in the data section. This is currently unsupported.
-
-        cper_length = len(payload)
-        data_length = cper_length + len(self.raw_data) + self.GENERIC_DATA_SIZE
-
-        #  Generic Error Data Entry
         gede = bytearray()
 
         gede.extend(notif_type.to_bytes())
@@ -579,7 +568,13 @@ def send_cper(self, notif_type, payload):
         gede.extend(self.fru_text)
         gede.extend(self.timestamp)
 
-        # Generic Error Status Block
+        return gede
+
+    def get_gebs(self, data_length):
+        """
+        Return a Generic Error Status Block bytearray
+        """
+
         gebs = bytearray()
 
         if self.raw_data:
@@ -593,6 +588,52 @@ def send_cper(self, notif_type, payload):
         util.data_add(gebs, data_length, 4)
         util.data_add(gebs, self.error_severity, 4)
 
+        return gebs
+
+    def send_cper(self, notif_type, payload,
+                  gede=None, gebs=None, raw_data=None):
+        """
+        Send commands to QEMU though QMP TCP socket.
+
+        Return True on success, False otherwise.
+
+        Arguments:
+            notif_type: Notification type (GUID)
+            payload: bytearray with the payload
+            gede: Generic Error Data Entry. If None, the code will generate
+                  one using the defaults and generic error data arguments
+            gebs: Generic Error Status block. If None, the code will generate
+                  one using the defaults and generic error data arguments
+            raw_data: Raw data to be added after GEBS. If not specified,
+                      the code will generate one if Generic Error Data
+                      --raw-data parameter is specified.
+        """
+
+        # Fill CPER record header
+
+        # NOTE: bits 4 to 13 of block status contain the number of
+        # data entries in the data section. This is currently unsupported.
+
+        if raw_data:
+            self.raw_data = raw_data
+
+        cper_length = len(payload)
+        data_length = cper_length + len(self.raw_data) + self.GENERIC_DATA_SIZE
+
+        if gede and len(gede) != 72:
+            print(f"Invalid Generic Error Data Entry length: {len(gede)}. Ignoring it")
+            gede = None
+
+        if gebs and len(gebs) != 20:
+            print(f"Invalid Generic Error Status Block length: {len(gebs)}. Ignoring it")
+            gebs = None
+
+        if not gede:
+            gede = self.get_gede(notif_type, cper_length)
+
+        if not gebs:
+            gebs = self.get_gebs(data_length)
+
         cper_data = bytearray()
         cper_data.extend(gebs)
         cper_data.extend(gede)
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 05/13] scripts/qmp_helper: fix raw_data logic
  2026-01-21 11:25 [PATCH 00/13] Add more commands to scripts/ghes_inject.py Mauro Carvalho Chehab
                   ` (3 preceding siblings ...)
  2026-01-21 11:25 ` [PATCH 04/13] scripts/qmp_helper: make send_cper() more generic Mauro Carvalho Chehab
@ 2026-01-21 11:25 ` Mauro Carvalho Chehab
  2026-01-21 12:35   ` Jonathan Cameron via qemu development
  2026-01-21 11:25 ` [PATCH 06/13] scripts/qmp_helper: add support for a timeout logic Mauro Carvalho Chehab
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 11:25 UTC (permalink / raw)
  To: Michael S Tsirkin
  Cc: Jonathan Cameron, Shiju Jose, qemu-devel, Igor Mammedov,
	Mauro Carvalho Chehab, Cleber Rosa, John Snow

According with UEFI 6.4 spec Table 18.11 Generic Error Status Block:

Raw Data Offset:	Offset in bytes from the beginning of the
			Error Status Block to raw error data.
			The raw data must follow any Generic Error
			Data Entries.

Data Length:		Length in bytes of the generic error data.

So, basically, we have:

	+----------+                 /
	| GEBS     |                 |
	+----------+   /             |
        | GEDE     |   |             |
        | header   |   |             +--> raw data
	+----------+   +--> data     |    offset
        | GEDE     |   |    length   |
        | payload  |   |             |
	+----------+   /             /
	| Raw data |
	+----------+

where:

- raw data offset is relative to the beginning of GEBS;
- data length is only for GEDE header and payload.

Fix the code to handle it the expected way.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 scripts/qmp_helper.py | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
index 51c8ad92a39d..40059cd105f6 100755
--- a/scripts/qmp_helper.py
+++ b/scripts/qmp_helper.py
@@ -411,6 +411,7 @@ def _connect(self):
         "simulated":    util.bit(2),
     }
 
+    GENERIC_ERROR_STATUS_SIZE = 20
     GENERIC_DATA_SIZE = 72
 
     def argparse(parser):
@@ -551,7 +552,7 @@ def send_cper_raw(self, cper_data):
 
         return False
 
-    def get_gede(self, notif_type, cper_length):
+    def get_gede(self, notif_type, payload_length):
         """
         Return a Generic Error Data Entry bytearray
         """
@@ -563,22 +564,27 @@ def get_gede(self, notif_type, cper_length):
         util.data_add(gede, 0x300, 2)
         util.data_add(gede, self.validation_bits, 1)
         util.data_add(gede, self.flags, 1)
-        util.data_add(gede, cper_length, 4)
+        util.data_add(gede, payload_length, 4)
         gede.extend(self.fru_id)
         gede.extend(self.fru_text)
         gede.extend(self.timestamp)
 
         return gede
 
-    def get_gebs(self, data_length):
+    def get_gebs(self, payload_length):
         """
         Return a Generic Error Status Block bytearray
         """
 
+        data_length = payload_length
+        data_length += self.GENERIC_DATA_SIZE
+
         gebs = bytearray()
 
         if self.raw_data:
-            raw_data_offset = len(gebs)
+            raw_data_offset = payload_length
+            raw_data_offset += self.GENERIC_ERROR_STATUS_SIZE
+            raw_data_offset += self.GENERIC_DATA_SIZE
         else:
             raw_data_offset = 0
 
@@ -617,8 +623,7 @@ def send_cper(self, notif_type, payload,
         if raw_data:
             self.raw_data = raw_data
 
-        cper_length = len(payload)
-        data_length = cper_length + len(self.raw_data) + self.GENERIC_DATA_SIZE
+        payload_length = len(payload)
 
         if gede and len(gede) != 72:
             print(f"Invalid Generic Error Data Entry length: {len(gede)}. Ignoring it")
@@ -629,16 +634,16 @@ def send_cper(self, notif_type, payload,
             gebs = None
 
         if not gede:
-            gede = self.get_gede(notif_type, cper_length)
+            gede = self.get_gede(notif_type, payload_length)
 
         if not gebs:
-            gebs = self.get_gebs(data_length)
+            gebs = self.get_gebs(payload_length)
 
         cper_data = bytearray()
         cper_data.extend(gebs)
         cper_data.extend(gede)
-        cper_data.extend(bytearray(self.raw_data))
         cper_data.extend(bytearray(payload))
+        cper_data.extend(bytearray(self.raw_data))
 
         if self.debug:
             print(f"GUID: {notif_type}")
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 06/13] scripts/qmp_helper: add support for a timeout logic
  2026-01-21 11:25 [PATCH 00/13] Add more commands to scripts/ghes_inject.py Mauro Carvalho Chehab
                   ` (4 preceding siblings ...)
  2026-01-21 11:25 ` [PATCH 05/13] scripts/qmp_helper: fix raw_data logic Mauro Carvalho Chehab
@ 2026-01-21 11:25 ` Mauro Carvalho Chehab
  2026-01-21 12:39   ` Jonathan Cameron via qemu development
  2026-01-21 11:25 ` [PATCH 07/13] scripts/ghes_inject: add a logic to decode CPER Mauro Carvalho Chehab
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 11:25 UTC (permalink / raw)
  To: Michael S Tsirkin
  Cc: Jonathan Cameron, Shiju Jose, qemu-devel, Igor Mammedov,
	Mauro Carvalho Chehab, Cleber Rosa, John Snow

We can't inject a new GHES record to the same source before
it has been acked. There is an async mechanism to verify when
the Kernel is ready, which is implemented at QEMU's ghes
driver.

If error inject is too fast, QEMU may return an error. When
such errors occur, implement a retry mechanism, based on a
maximum timeout.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 scripts/qmp_helper.py | 47 +++++++++++++++++++++++++++++++------------
 1 file changed, 34 insertions(+), 13 deletions(-)

diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
index 40059cd105f6..63f3df2d75c3 100755
--- a/scripts/qmp_helper.py
+++ b/scripts/qmp_helper.py
@@ -14,6 +14,7 @@
 
 from datetime import datetime
 from os import path as os_path
+from time import sleep
 
 try:
     qemu_dir = os_path.abspath(os_path.dirname(os_path.dirname(__file__)))
@@ -324,7 +325,8 @@ class qmp:
     Opens a connection and send/receive QMP commands.
     """
 
-    def send_cmd(self, command, args=None, may_open=False, return_error=True):
+    def send_cmd(self, command, args=None, may_open=False, return_error=True,
+                 timeout=None):
         """Send a command to QMP, optinally opening a connection"""
 
         if may_open:
@@ -336,12 +338,31 @@ def send_cmd(self, command, args=None, may_open=False, return_error=True):
         if args:
             msg['arguments'] = args
 
-        try:
-            obj = self.qmp_monitor.cmd_obj(msg)
-        # Can we use some other exception class here?
-        except Exception as e:                         # pylint: disable=W0718
-            print(f"Command: {command}")
-            print(f"Failed to inject error: {e}.")
+        if timeout and timeout > 0:
+            attempts = int(timeout * 10)
+        else:
+            attempts = 1
+
+        # Try up to attempts
+        for i in range(0, attempts):
+            try:
+                obj = self.qmp_monitor.cmd_obj(msg)
+
+                if obj and "return" in obj and not obj["return"]:
+                    break
+
+            except Exception as e:                     # pylint: disable=W0718
+                print(f"Command: {command}")
+                print(f"Failed to inject error: {e}.")
+                obj = None
+
+            if attempts > 1:
+                print(f"Error inject attempt {i + 1}/{attempts} failed.")
+
+            if i + 1 < attempts:
+                sleep(0.1)
+
+        if not obj:
             return None
 
         if "return" in obj:
@@ -531,7 +552,7 @@ def __init__(self, host, port, debug=False):
     #
     # Socket QMP send command
     #
-    def send_cper_raw(self, cper_data):
+    def send_cper_raw(self, cper_data, timeout=None):
         """
         Send a raw CPER data to QEMU though QMP TCP socket.
 
@@ -546,11 +567,11 @@ def send_cper_raw(self, cper_data):
 
         self._connect()
 
-        if self.send_cmd("inject-ghes-v2-error", cmd_arg):
+        ret = self.send_cmd("inject-ghes-v2-error", cmd_arg, timeout=timeout)
+        if ret:
             print("Error injected.")
-            return True
 
-        return False
+        return ret
 
     def get_gede(self, notif_type, payload_length):
         """
@@ -597,7 +618,7 @@ def get_gebs(self, payload_length):
         return gebs
 
     def send_cper(self, notif_type, payload,
-                  gede=None, gebs=None, raw_data=None):
+                  gede=None, gebs=None, raw_data=None, timeout=None):
         """
         Send commands to QEMU though QMP TCP socket.
 
@@ -656,7 +677,7 @@ def send_cper(self, notif_type, payload,
 
             util.dump_bytearray("Payload", payload)
 
-        return self.send_cper_raw(cper_data)
+        return self.send_cper_raw(cper_data, timeout=timeout)
 
     def search_qom(self, path, prop, regex):
         """
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 07/13] scripts/ghes_inject: add a logic to decode CPER
  2026-01-21 11:25 [PATCH 00/13] Add more commands to scripts/ghes_inject.py Mauro Carvalho Chehab
                   ` (5 preceding siblings ...)
  2026-01-21 11:25 ` [PATCH 06/13] scripts/qmp_helper: add support for a timeout logic Mauro Carvalho Chehab
@ 2026-01-21 11:25 ` Mauro Carvalho Chehab
  2026-01-21 13:27   ` Jonathan Cameron via qemu development
  2026-01-21 11:25 ` [PATCH 08/13] scripts/ghes_inject: exit 1 if command was not sent Mauro Carvalho Chehab
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 11:25 UTC (permalink / raw)
  To: Michael S Tsirkin
  Cc: Jonathan Cameron, Shiju Jose, qemu-devel, Igor Mammedov,
	Mauro Carvalho Chehab, Cleber Rosa, John Snow

Add a decoder to help debugging injected CPERs. This is more
relevant when we add fuzzy-testing error inject, as the
decoder is helpful to identify what it packages will be sent
via QEMU to the firmware-fist logic.

By purpose, I opted to keep this completely independent from
the encoders implementation, as this can be used even when
there are no encoders for a certain GGUID type (except for a
fuzzy logic test, which is pretty much independent of the
records meaning).

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 MAINTAINERS            |    1 +
 scripts/ghes_decode.py | 1155 ++++++++++++++++++++++++++++++++++++++++
 scripts/qmp_helper.py  |    3 +
 3 files changed, 1159 insertions(+)
 create mode 100644 scripts/ghes_decode.py

diff --git a/MAINTAINERS b/MAINTAINERS
index 36a2be3ddba7..a970c47dd089 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2225,6 +2225,7 @@ S: Maintained
 F: hw/arm/ghes_cper.c
 F: hw/acpi/ghes_cper_stub.c
 F: qapi/acpi-hest.json
+F: scripts/ghes_decode.py
 F: scripts/ghes_inject.py
 F: scripts/arm_processor_error.py
 F: scripts/qmp_helper.py
diff --git a/scripts/ghes_decode.py b/scripts/ghes_decode.py
new file mode 100644
index 000000000000..6c7fdfe84e3a
--- /dev/null
+++ b/scripts/ghes_decode.py
@@ -0,0 +1,1155 @@
+#!/usr/bin/env python3
+#
+# pylint: disable=R0903,R0912,R0913,R0915,R0917,R1713,E1121,C0302,W0613
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# Copyright (C) 2025 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
+
+"""
+Helper classes to decode a generic error data entry.
+
+By purpose, the logic here is independent of the logic inside qmp_helper
+and other modules. With a different implementation, it is more likely to
+discover bugs at the error injection logic. Also, as this can be used to
+dump errors injected by reproducing an error mesage or for fuzzy error
+injection, it can't rely at the encoding logic inside each module of
+ghes_inject.py.
+
+To make the decoder simple, the decode logic here is at field level, not
+trying to decode bitmaps.
+"""
+
+from typing import Optional
+
+class DecodeField():
+    """
+    Helper functions to decode a field, printing its results
+    """
+
+    def __init__(self, cper_data: bytearray):
+        """Initialize the decoder with a cper bytearray"""
+        self.data = cper_data
+        self.pos = 0
+        self.past_end = False
+
+    @property
+    def remaining(self):
+        """Returns the number of bytes not decoded yet"""
+        return max(0, len(self.data) - self.pos)
+
+    @property
+    def is_end(self):
+        """
+        Returns true if all bytes were decoded and it didn't try
+        to read past the end.
+        """
+        if not self.past_end and self.pos == len(self.data):
+            return True
+
+        return False
+
+    def decode(self, name: str, size: int, ftype: str,
+               pos: Optional[int] = None,
+               show_incomplete: Optional[bool] = False) -> None:
+        """
+        Decodes and outputs a specified field from an ACPI table.
+
+        For ints, we opted to decode them byte by byte, thus not being
+        limited to an integer max size.
+
+        Arguments:
+            name: name of the field
+            size: number of bytes of the field
+            ftype: field type (str, int, guid, bcd)
+            pos: if specified, show a field at the specific position. If
+                 not, use last position and increment it with size at the end
+        """
+        if pos:
+            cur_pos = pos
+        else:
+            cur_pos = self.pos
+
+        try:
+            if cur_pos + size > len(self.data):
+                if not pos:
+                    self.past_end = True
+
+                if not show_incomplete:
+                    decoded = "N/A"
+                    return None
+
+            raw_data = self.data[cur_pos:cur_pos + size]
+
+            decoded = ""
+            if ftype == "str":
+                failures = False
+                for b in raw_data:
+                    if b >= 32 and b <= 126:            # pylint: disable=R1716
+                        decoded += chr(b)
+                    elif b:
+                        decoded += '.'
+                        failures = True
+                    else:
+                        decoded += r'\x0'
+
+                if failures:
+                    decoded += " # warning: non-ascii chars found"
+
+                if self.past_end:
+                    if decoded:
+                        decoded += " "
+                    decoded += "EOL"
+
+            elif ftype == "int":
+                i = 0
+                for b in reversed(raw_data):
+                    i += 1
+                    if len(raw_data) > 8 and i > 1:
+                        decoded += " "
+
+                    decoded += f"{b:02x}"
+
+                if self.past_end:
+                    if decoded:
+                        decoded += " "
+                    decoded += "EOL"
+
+            elif ftype == "guid":
+                if len(raw_data) != 16 or size != 16:
+                    decoded = "Invalid GUID"
+                else:
+                    for b in reversed(raw_data[0:4]):
+                        decoded += f"{b:02x}"
+
+                    decoded += "-"
+
+                    for b in reversed(raw_data[4:6]):
+                        decoded += f"{b:02x}"
+
+                    decoded += "-"
+
+                    for b in reversed(raw_data[6:8]):
+                        decoded += f"{b:02x}"
+
+                    decoded += "-"
+
+                    for b in raw_data[8:10]:
+                        decoded += f"{b:02x}"
+
+                    decoded += "-"
+
+                    for b in raw_data[10:]:
+                        decoded += f"{b:02x}"
+
+                    raw_data = decoded
+
+            elif ftype == "bcd":
+                val = 0
+                for b in raw_data:
+                    if (b & 0xf0) > 9 or (b & 0x0f) > 9:
+                        raise ValueError("Invalid BCD value")
+                    val = (val << 4) | (b & 0x0f)
+
+                decoded = f"{val:0{size * 2}x}"
+
+                if self.past_end:
+                    if decoded:
+                        decoded += " "
+                    decoded += "EOL"
+            else:
+                decoded = f"Warning: Unknown format {ftype}"
+
+        except ValueError as e:
+            decoded = f"Error decoding {e}"
+
+        finally:
+            print(f"{name:<26s}: {decoded}")
+            if not pos:
+                self.pos += size
+
+        return raw_data
+
+
+class DecodeProcGeneric():
+    """
+    Class to decode a Generic Processor Error as defined at
+    UEFI 2.1 - N.2.2 Section Descriptor
+    """
+    # GUID for Generic Processor Error
+    guid = "9876ccad-47b4-4bdb-b65e-16f193c4f3db"
+
+    fields = [
+        ("Validation Bits", 8, "int"),
+        ("Processor Type", 1, "int"),
+        ("Processor ISA", 1, "int"),
+        ("Processor Error Type", 1, "int"),
+        ("Operation", 1, "int"),
+        ("Flags", 1, "int"),
+        ("Level", 1, "int"),
+        ("Reserved", 2, "int"),
+        ("CPU Version Info", 8, "int"),
+        ("CPU Brand String", 128, "str"),
+        ("Processor ID", 8, "int"),
+        ("Target Address", 8, "int"),
+        ("Requestor Identifier", 8, "int"),
+        ("Responder Identifier", 8, "int"),
+        ("Instruction IP", 8, "int")
+    ]
+
+    def __init__(self, cper: DecodeField):
+        self.cper = cper
+
+    def decode(self, guid):
+        """Decode Generic Processor Error"""
+        print("Generic Processor Error")
+
+        for name, size, ftype in self.fields:
+            self.cper.decode(name, size, ftype)
+
+    @staticmethod
+    def decode_list():
+        """
+        Returns a tuple with the GUID and class
+        """
+        return [(DecodeProcGeneric.guid, DecodeProcGeneric)]
+
+class DecodeProcX86():
+    """
+    Class to decode an x86 Processor Error as defined at
+    UEFI 2.1 - N.2.2 Section Descriptor
+    """
+
+    # GUID for x86 Processor Error
+    guid = "dc3ea0b0-a144-4797-b95b-53fa242b6e1d"
+
+    pei_fields = [
+        ("Error Structure Type", 16, "guid"),
+        ("Validation Bits", 8, "int"),
+        ("Check Information", 8, "int"),
+        ("Target Identifier", 8, "int"),
+        ("Requestor Identifier", 8, "int"),
+        ("Responder Identifier", 8, "int"),
+        ("Instruction Pointer", 8, "int")
+    ]
+
+    def __init__(self, cper: DecodeField):
+        self.cper = cper
+
+    def decode(self, guid):
+        """Decode x86 Processor Error"""
+        print("x86 Processor Error")
+
+        val = self.cper.decode("Validation Bits", 8, "int")
+        try:
+            val_bits = int.from_bytes(val, byteorder='little')
+        except ValueError, TypeError:
+            val_bits = 0
+
+        error_info_num = (val_bits >> 2) & 0x3f    # bits 2-7
+        context_info_num = (val_bits >> 8) & 0xff  # bits 8-13
+
+        self.cper.decode("Local APIC_ID", 8, "int")
+        self.cper.decode("CPUID Info", 48, "int")
+
+        for pei in range(0, error_info_num):
+            if self.cper.past_end:
+                return
+
+            print()
+            print(f"Processor Error Info {pei}")
+            for name, size, ftype in self.pei_fields:
+                self.cper.decode(name, size, ftype)
+
+        for ctx in range(0, context_info_num):
+            if self.cper.past_end:
+                return
+
+            print()
+            print(f"Context {ctx}")
+
+            self.cper.decode("Register Context Type", 2, "int")
+
+            val = self.cper.decode("Register Array Size", 2, "int")
+            try:
+                context_size = int(int.from_bytes(val, byteorder='little') / 8)
+            except ValueError, TypeError:
+                context_size = 0
+
+            self.cper.decode("MSR Address", 4, "int")
+            self.cper.decode("MM Register Address", 8, "int")
+
+            for reg in range(0, context_size):
+                if self.cper.past_end:
+                    return
+                self.cper.decode(f"Register offset {reg:<3}", 8, "int")
+
+    @staticmethod
+    def decode_list():
+        """
+        Returns a tuple with the GUID and class
+        """
+        return [(DecodeProcX86.guid, DecodeProcX86)]
+
+class DecodeProcItanium():
+    """
+    Class to decode an Itanium Processor Error as defined at
+    UEFI 2.1 - N.2.2 Section Descriptor
+    """
+
+    # GUID for Itanium Processor Error
+    guid = "e429faf1-3cb7-11d4-bca7-0080c73c8881"
+
+    def __init__(self, cper: DecodeField):
+        self.cper = cper
+
+    def decode(self, guid):
+        """
+        Decode Itanium Processor Error.
+
+        Itanum processors stopped being sold in 2021. Probably not much
+        sense implementing a decoder for it.
+        """
+
+        print("Itanium Processor Error")
+
+        remaining = self.cper.remaining
+        if remaining:
+            print()
+            self.cper.decode("Data", remaining, "int")
+
+    @staticmethod
+    def decode_list():
+        """
+        Returns a tuple with the GUID and class
+        """
+        return [(DecodeProcItanium.guid, DecodeProcItanium)]
+
+
+class DecodeProcArm():
+    """
+    Class to decode an ARM Processor Error as defined at
+    UEFI 2.6 - N.2.2 Section Descriptor
+    """
+
+    # GUID for ARM Processor Error
+    guid = "e19e3d16-bc11-11e4-9caa-c2051d5d46b0"
+
+    arm_pei_fields = [
+        ("Version",              1, "int"),
+        ("Length",               1, "int"),
+        ("valid",                2, "int"),
+        ("type",                 1, "int"),
+        ("multiple-error",       2, "int"),
+        ("flags",                1, "int"),
+        ("error-info",           8, "int"),
+        ("virt-addr",            8, "int"),
+        ("phy-addr",             8, "int"),
+    ]
+
+    def __init__(self, cper: DecodeField):
+        self.cper = cper
+
+    def decode(self, guid):
+        """Decode Processor ARM"""
+
+        print("ARM Processor Error")
+
+        start = self.cper.pos
+
+        self.cper.decode("Valid", 4, "int")
+
+        val = self.cper.decode("Error Info num", 2, "int")
+        try:
+            error_info_num = int.from_bytes(val, byteorder='little')
+        except ValueError, TypeError:
+            error_info_num = 0
+
+        val = self.cper.decode("Context Info num", 2, "int")
+        try:
+            context_info_num = int.from_bytes(val, byteorder='little')
+        except ValueError, TypeError:
+            context_info_num = 0
+
+        val = self.cper.decode("Section Length", 4, "int")
+        try:
+            section_length = int.from_bytes(val, byteorder='little')
+        except ValueError, TypeError:
+            section_length = 0
+
+        self.cper.decode("Error affinity level", 1, "int")
+        self.cper.decode("Reserved", 3, "int")
+        self.cper.decode("MPIDR_EL1", 8, "int")
+        self.cper.decode("MIDR_EL1", 8, "int")
+        self.cper.decode("Running State", 4, "int")
+        self.cper.decode("PSCI State", 4, "int")
+
+        for pei in range(0, error_info_num):
+            if self.cper.past_end:
+                return
+
+            print()
+            print(f"Processor Error Info {pei}")
+            for name, size, ftype in self.arm_pei_fields:
+                self.cper.decode(name, size, ftype)
+
+        for ctx in range(0, context_info_num):
+            if self.cper.past_end:
+                return
+
+            print()
+            print(f"Context {ctx}")
+            self.cper.decode("Version", 2, "int")
+            self.cper.decode("Register Context Type", 2, "int")
+            val = self.cper.decode("Register Array Size", 4, "int")
+            try:
+                context_size = int(int.from_bytes(val, byteorder='little') / 8)
+            except ValueError:
+                context_size = 0
+
+            for reg in range(0, context_size):
+                if self.cper.past_end:
+                    return
+                self.cper.decode(f"Register {reg:<3}", 8, "int")
+
+        remaining = max(section_length + start - self.cper.pos, 0)
+        if remaining:
+            print()
+            self.cper.decode("Vendor data", remaining, "int")
+
+    @staticmethod
+    def decode_list():
+        """
+        Returns a tuple with the GUID and class
+        """
+        return [(DecodeProcArm.guid, DecodeProcArm)]
+
+
+class DecodePlatformMem():
+    """
+    Class to decode a Platform Memory Error as defined at
+    UEFI 2.1 - N.2.2 Section Descriptor
+    """
+
+    # GUID for Platform Memory Error
+    guid = "a5bc1114-6f64-4ede-b863-3e83ed7c83b1"
+
+    fields = [
+        ("Validation Bits", 8, "int"),
+        ("Error Status", 8, "int"),
+        ("Physical Address", 8, "int"),
+        ("Physical Address Mask", 8, "int"),
+        ("Node", 2, "int"),
+        ("Card", 2, "int"),
+        ("Module", 2, "int"),
+        ("Bank", 2, "int"),
+        ("Device", 2, "int"),
+        ("Row", 2, "int"),
+        ("Column", 2, "int"),
+        ("Bit Position", 2, "int"),
+        ("Requestor ID", 8, "int"),
+        ("Responder ID", 8, "int"),
+        ("Target ID", 8, "int"),
+        ("Memory Error Type", 1, "int"),
+        ("Extended", 1, "int"),
+        ("Rank Number", 2, "int"),
+        ("Card Handle", 2, "int"),
+        ("Module Handle", 2, "int")
+    ]
+
+    def __init__(self, cper: DecodeField):
+        self.cper = cper
+
+    def decode(self, guid):
+        """Decode Platform Memory Error"""
+        print("Platform Memory Error")
+
+        for name, size, ftype in self.fields:
+            self.cper.decode(name, size, ftype)
+
+    @staticmethod
+    def decode_list():
+        """
+        Returns a tuple with the GUID and class
+        """
+        return [(DecodePlatformMem.guid, DecodePlatformMem)]
+
+
+class DecodePlatformMem2():
+    """
+    Class to decode a Platform Memory Error (Type 2) as defined at
+    UEFI 2.5 - N.2.6. Memory Error Section 2
+    """
+
+    # GUID for Platform Memory Error Type 2
+    guid = "61ec04fc-48e6-d813-25c9-8daa44750b12"
+
+    fields = [
+        ("Validation Bits", 8, "int"),
+        ("Error Status", 8, "int"),
+        ("Physical Address", 8, "int"),
+        ("Physical Address Mask", 8, "int"),
+        ("Node", 2, "int"),
+        ("Card", 2, "int"),
+        ("Module", 2, "int"),
+        ("Bank", 2, "int"),
+        ("Device", 4, "int"),
+        ("Row", 4, "int"),
+        ("Column", 4, "int"),
+        ("Rank", 4, "int"),
+        ("Bit Position", 4, "int"),
+        ("Chip Identification", 1, "int"),
+        ("Memory Error Type", 1, "int"),
+        ("Status", 1, "int"),
+        ("Reserved", 1, "int"),
+        ("Requestor ID", 8, "int"),
+        ("Responder ID", 8, "int"),
+        ("Target ID", 8, "int"),
+        ("Card Handle", 4, "int"),
+        ("Module Handle", 4, "int")
+    ]
+
+    def __init__(self, cper: DecodeField):
+        self.cper = cper
+
+    def decode(self, guid):
+        """Decode Platform Memory Error Type 2"""
+        print("Platform Memory Error Type 2")
+
+        for name, size, ftype in self.fields:
+            self.cper.decode(name, size, ftype)
+
+    @staticmethod
+    def decode_list():
+        """
+        Returns a tuple with the GUID and class
+        """
+        return [(DecodePlatformMem2.guid, DecodePlatformMem2)]
+
+
+class DecodePCIe():
+    """
+    Class to decode a PCI Express Error as defined at
+    UEFI 2.1 - N.2.2 Section Descriptor
+    """
+    # GUID for PCI Express Error
+    guid = "d995e954-bbc1-430f-ad91-b44dcb3c6f35"
+
+    fields = [
+        ("Validation Bits", 8, "int"),
+        ("Port Type", 4, "int"),
+        ("Version", 4, "int"),
+        ("Command Status", 4, "int"),
+        ("RCRB High Address", 4, "int"),
+        ("Device ID", 16, "int"),
+        ("Device Serial Number", 8, "int"),
+        ("Bridge Control Status", 4, "int"),
+        ("Capability Structure", 60, "int"),
+        ("AER Info", 96, "int")
+    ]
+
+    def __init__(self, cper: DecodeField):
+        self.cper = cper
+
+    def decode(self, guid):
+        """Decode PCI Express Error"""
+        print("PCI Express Error")
+
+        for name, size, ftype in self.fields:
+            self.cper.decode(name, size, ftype)
+
+    @staticmethod
+    def decode_list():
+        """
+        Returns a tuple with the GUID and class
+        """
+        return [(DecodePCIe.guid, DecodePCIe)]
+
+
+class DecodePCIBus():
+    """
+    Class to decode a PCI Bus Error as defined at
+    UEFI 2.1 - N.2.2 Section Descriptor
+    """
+
+    # GUID for PCI Bus Error
+    guid = "c5753963-3b84-4095-bf78-eddad3f9c9dd"
+
+    fields = [
+        ("Validation Bits", 8, "int"),
+        ("Error Status", 8, "int"),
+        ("Error Type", 2, "int"),
+        ("Bus Id", 2, "int"),
+        ("Reserved", 4, "int"),
+        ("Bus Address", 8, "int"),
+        ("Bus Data", 8, "int"),
+        ("Bus Command", 8, "int"),
+        ("Bus Requestor Id", 8, "int"),
+        ("Bus Completer Id", 8, "int"),
+        ("Target Id", 8, "int")
+    ]
+
+    def __init__(self, cper: DecodeField):
+        self.cper = cper
+
+    def decode(self, guid):
+        """Decode PCI Bus Error"""
+        print("PCI Bus Error")
+
+        for name, size, ftype in self.fields:
+            self.cper.decode(name, size, ftype)
+
+    @staticmethod
+    def decode_list():
+        """
+        Returns a tuple with the GUID and class
+        """
+        return [(DecodePCIBus.guid, DecodePCIBus)]
+
+
+class DecodePCIDev():
+    """
+    Class to decode a PCI Device Error as defined at
+    UEFI 2.1 - N.2.2 Section Descriptor
+    """
+
+    # GUID for PCI Device Error
+    guid = "eb5e4685-ca66-4769-b6a2-26068b001326"
+
+    def __init__(self, cper: DecodeField):
+        self.cper = cper
+
+    def decode(self, guid):
+        """Decode PCI Device Error"""
+        print("PCI Device Error")
+
+        self.cper.decode("Validation Bits", 8, "int")
+        self.cper.decode("Error Status", 8, "int")
+        self.cper.decode("Id Info", 16, "int")
+
+        val = self.cper.decode("Memory Number", 4, "int")
+        try:
+            mem_num = int.from_bytes(val, byteorder='little')
+        except ValueError, TypeError:
+            mem_num = 0
+
+        self.cper.decode("IO Number", 4, "int")
+
+        for mem in range(0, mem_num):
+            if self.cper.past_end:
+                return
+
+            print()
+            print(f"Register Data Pair {mem}")
+            self.cper.decode("Register 0", 8, "int")
+            self.cper.decode("Register 1", 8, "int")
+
+    @staticmethod
+    def decode_list():
+        """
+        Returns a tuple with the GUID and class
+        """
+        return [(DecodePCIDev.guid, DecodePCIDev)]
+
+
+class DecodeFWError():
+    """
+    Class to decode a Firmware Error as defined at
+    UEFI 2.1 - N.2.2 Section Descriptor
+    """
+
+    # GUID for Firmware Error
+    guid = "81212a96-09ed-4996-9471-8d729c8e69ed"
+
+    # NOTE: UEFI 2.11 has a discrepancy, as it lists:
+    #       byte offset 1: revision (1 byte)
+    #       byte offset 1: reserved (7 bytes)
+    #
+    # both starting at position 1. We opted to change reserved size to 6,
+    # in order to better cope with the spec issues
+
+    fields = [
+        ("Firmware Error Record Type", 1, "int"),
+        ("Revision", 1, "int"),
+        ("Reserved", 6, "int"),
+        ("Record Identifier", 8, "int"),
+        ("Record identifier GUID extension", 16, "guid")
+    ]
+
+    def __init__(self, cper: DecodeField):
+        self.cper = cper
+
+    def decode(self, guid):
+        """Decode Firmware Error"""
+        print("Firmware Error")
+
+        for name, size, ftype in self.fields:
+            self.cper.decode(name, size, ftype)
+
+    @staticmethod
+    def decode_list():
+        """
+        Returns a tuple with the GUID and class
+        """
+        return [(DecodeFWError.guid, DecodeFWError)]
+
+
+class DecodeDMAGeneric():
+    """
+    Class to decode a Generic DMA Error as defined at
+    UEFI 2.2 - N.2.2 Section Descriptor
+    """
+
+    # GUID for Generic DMA Error
+    guid = "5b51fef7-c79d-4434-8f1b-aa62de3e2c64"
+
+    fields = [
+        ("Requester-ID", 2, "int"),
+        ("Segment Number", 2, "int"),
+        ("Fault Reason", 1, "int"),
+        ("Access Type", 1, "int"),
+        ("Address Type", 1, "int"),
+        ("Architecture Type", 1, "int"),
+        ("Device Address", 8, "int"),
+        ("Reserved", 16, "int")
+    ]
+
+    def __init__(self, cper: DecodeField):
+        self.cper = cper
+
+    def decode(self, guid):
+        """Decode Generic DMA Error"""
+        print("Generic DMA Error")
+
+        for name, size, ftype in self.fields:
+            self.cper.decode(name, size, ftype)
+
+    @staticmethod
+    def decode_list():
+        """
+        Returns a tuple with the GUID and class
+        """
+        return [(DecodeDMAGeneric.guid, DecodeDMAGeneric)]
+
+
+class DecodeDMAVT():
+    """
+    Class to decode a DMA Virtualization Technology Error as defined at
+    UEFI 2.2 - N.2.2 Section Descriptor
+    """
+
+    # GUID for DMA VT Error
+    guid = "71761d37-32b2-45cd-a7d0-b0fedd93e8cf"
+
+    fields = [
+        ("Version", 1, "int"),
+        ("Revision", 1, "int"),
+        ("OemId", 6, "int"),
+        ("Capability", 8, "int"),
+        ("Extended Capability", 8, "int"),
+        ("Global Command", 4, "int"),
+        ("Global Status", 4, "int"),
+        ("Fault Status", 4, "int"),
+        ("Reserved", 12, "int"),
+        ("Fault record", 16, "int"),
+        ("Root Entry", 16, "int"),
+        ("Context Entry", 16, "int"),
+        ("Level 6 Page Table Entry", 8, "int"),
+        ("Level 5 Page Table Entry", 8, "int"),
+        ("Level 4 Page Table Entry", 8, "int"),
+        ("Level 3 Page Table Entry", 8, "int"),
+        ("Level 2 Page Table Entry", 8, "int"),
+        ("Level 1 Page Table Entry", 8, "int")
+    ]
+
+    def __init__(self, cper: DecodeField):
+        self.cper = cper
+
+    def decode(self, guid):
+        """Decode DMA VT Error"""
+        print("DMA VT Error")
+
+        for name, size, ftype in self.fields:
+            self.cper.decode(name, size, ftype)
+
+    @staticmethod
+    def decode_list():
+        """
+        Returns a tuple with the GUID and class
+        """
+        return [(DecodeDMAVT.guid, DecodeDMAVT)]
+
+
+class DecodeDMAIOMMU():
+    """
+    Class to decode an IOMMU DMA Error as defined at
+    UEFI 2.2 - N.2.2 Section Descriptor
+    """
+
+    # GUID for IOMMU DMA Error
+    guid = "036f84e1-7f37-428c-a79e-575fdfaa84ec"
+
+    fields = [
+        ("Revision", 1, "int"),
+        ("Reserved", 7, "int"),
+        ("Control", 8, "int"),
+        ("Status", 8, "int"),
+        ("Reserved", 8, "int"),
+        ("Event Log Entry", 16, "int"),
+        ("Reserved", 16, "int"),
+        ("Device Table Entry", 32, "int"),
+        ("Level 6 Page Table Entry", 8, "int"),
+        ("Level 5 Page Table Entry", 8, "int"),
+        ("Level 4 Page Table Entry", 8, "int"),
+        ("Level 3 Page Table Entry", 8, "int"),
+        ("Level 2 Page Table Entry", 8, "int"),
+        ("Level 1 Page Table Entry", 8, "int")
+    ]
+
+    def __init__(self, cper: DecodeField):
+        self.cper = cper
+
+    def decode(self, guid):
+        """Decode IOMMU DMA Error"""
+        print("IOMMU DMA Error")
+
+        for name, size, ftype in self.fields:
+            self.cper.decode(name, size, ftype)
+
+    @staticmethod
+    def decode_list():
+        """
+        Returns a tuple with the GUID and class
+        """
+        return [(DecodeDMAIOMMU.guid, DecodeDMAIOMMU)]
+
+
+class DecodeCCIXPER():
+    """
+    Class to decode a CCIX Protocol Error as defined at
+    UEFI 2.8 - N.2.12. CCIX PER Log Error Section
+    """
+
+    # GUID for CCIX Protocol Error
+    guid = "91335ef6-ebfb-4478-a6a6-88b728cf75d7"
+
+    fields = [
+        ("Validation Bits", 8, "int"),
+        ("CCIX Source ID", 1, "int"),
+        ("CCIX Port ID", 1, "int"),
+        ("Reserved", 2, "int"),
+    ]
+
+    def __init__(self, cper: DecodeField):
+        self.cper = cper
+
+    def decode(self, guid):
+        """Decode CCIX Protocol Error"""
+        print("CCIX Protocol Error")
+
+        val = self.cper.decode("Length", 4, "int")
+        try:
+            length = int.from_bytes(val, byteorder='little')
+        except ValueError, TypeError:
+            length = 0
+
+        for name, size, ftype in self.fields:
+            self.cper.decode(name, size, ftype)
+
+        remaining = max(0, length - self.cper.pos)
+        for dword in range(0, int(remaining / 4)):
+            if self.cper.past_end:
+                return
+
+            self.cper.decode(f"CCIX PER log {dword}", 4, "int")
+
+    @staticmethod
+    def decode_list():
+        """
+        Returns a tuple with the GUID and class
+        """
+        return [(DecodeCCIXPER.guid, DecodeCCIXPER)]
+
+
+class DecodeCXLProtErr():
+    """
+    Class to decode a CXL Protocol Error as defined at
+    UEFI 2.9 - N.2.13. Compute Express Link (CXL) Protocol Error Section
+    """
+
+    # GUID for CXL Protocol Error
+    guid = "80b9efb4-52b5-4de3-a777-68784b771048"
+
+    fields = [
+        ("Validation Bits", 8, "int"),
+        ("CXL Agent Type", 1, "int"),
+        ("Reserved", 7, "int"),
+        ("CXL Agent Address", 8, "int"),
+        ("Device ID", 16, "int"),
+        ("Device Serial Number", 8, "int"),
+        ("Capability Structure", 60, "int"),
+    ]
+
+    def __init__(self, cper: DecodeField):
+        self.cper = cper
+
+    def decode(self, guid):
+        """Decode CXL Protocol Error"""
+        print("CXL Protocol Error")
+
+        for name, size, ftype in self.fields:
+            self.cper.decode(name, size, ftype)
+
+        val = self.cper.decode("CXL DVSEC Length", 2, "int")
+        try:
+            cxl_devsec_len = int.from_bytes(val, byteorder='little')
+        except ValueError, TypeError:
+            cxl_devsec_len = 0
+
+        val = self.cper.decode("CXL Error Log Length", 2, "int")
+        try:
+            cxl_error_log_len = int.from_bytes(val, byteorder='little')
+        except ValueError, TypeError:
+            cxl_error_log_len = 0
+
+        self.cper.decode("Reserved", 4, "int")
+        self.cper.decode("CXL DVSEC", cxl_devsec_len, "int",
+                         show_incomplete=True)
+        self.cper.decode("CXL Error Log", cxl_error_log_len, "int",
+                         show_incomplete=True)
+
+    @staticmethod
+    def decode_list():
+        """
+        Returns a tuple with the GUID and class
+        """
+        return [(DecodeCXLProtErr.guid, DecodeCXLProtErr)]
+
+
+class DecodeCXLCompEvent():
+    """
+    Class to decode a CXL Component Error as defined at
+    UEFI 2.9 - N.2.14. CXL Component Events Section
+
+    Currently, the decoder handles only the common fields, displaying
+    the CXL Component Event Log field in bytes.
+    """
+
+    # GUIDs, as defined at CXL specification 3.2: 8.2.10.2.1 Event Records
+    guids = [
+        ("General Media",              "fbcd0a77-c260-417f-85a9-088b1621eba6"),
+        ("DRAM",                       "601dcbb3-9c06-4eab-b8af-4e9bfb5c9624"),
+        ("Memory Module",              "fe927475-dd59-4339-a586-79bab113bc74"),
+        ("Memory Sparing",             "e71f3a40-2d29-4092-8a39-4d1c966c7c65"),
+        ("Physical Switch",            "77cf9271-9c02-470b-9fe4-bc7b75f2da97"),
+        ("Virtual Switch",             "40d26425-3396-4c4d-a5da-3d472a63af25"),
+        ("MDL Port",                   "8dc44363-0c96-4710-b7bf-04bb99534c3f"),
+        ("Dynamic Capabilities",       "ca95afa7-f183-4018-8c2f-95268e101a2a"),
+    ]
+
+    fields = [
+        ("Validation Bits", 8, "int"),
+        ("Device ID", 12, "int"),
+        ("Device Serial Number", 8, "int")
+    ]
+
+    def __init__(self, cper: DecodeField):
+        self.cper = cper
+
+    def decode(self, guid):
+        """Decode CXL Protocol Error"""
+        for name, guid_event in DecodeCXLCompEvent.guids:
+            if guid == guid_event:
+                print(f"CXL {name} Event Record")
+                break
+
+        val = self.cper.decode("Length", 4, "int")
+        try:
+            length = int.from_bytes(val, byteorder='little')
+        except ValueError, TypeError:
+            length = 0
+
+        for name, size, ftype in self.fields:
+            self.cper.decode(name, size, ftype)
+
+        length = max(0, length - self.cper.pos)
+
+        self.cper.decode("CXL Component Event Log", length, "int",
+                         show_incomplete=True)
+
+    @staticmethod
+    def decode_list():
+        """
+        Returns a tuple with the GUID and class
+        """
+
+        guid_list = []
+
+        for _, guid in DecodeCXLCompEvent.guids:
+            guid_list.append((guid, DecodeCXLCompEvent))
+
+        return guid_list
+
+
+class DecodeFRUMemoryPoison():
+    """
+    Class to decode a CXL Protocol Error as defined at
+    UEFI 2.11 - N.2.15 FRU Memory Poison Section
+    """
+
+    # GUID for FRU Memory Poison Section
+    guid = "5e4706c1-5356-48c6-930b-52f2120a4458"
+
+    common_fields = [
+        ("Checksum", 4, "int"),
+        ("Validation Bits", 8, "int"),
+        ("FRU Architecture Type", 4, "int"),
+        ("FRU Architecture Value", 8, "int"),
+        ("FRU Identifier Type", 4, "int"),
+        ("FRU Identifier Value", 8, "int")
+    ]
+
+    poison_fields = [
+        ("Poison Timestamp", 8, "int"),
+        ("Hardware Identifier Type", 4, "int"),
+        ("Hardware Identifier Value", 8, "int"),
+        ("Address Type", 4, "int"),
+        ("Address Value", 8, "int")
+    ]
+
+    def __init__(self, cper: DecodeField):
+        self.cper = cper
+
+    def decode(self, guid):
+        """Decode CXL Protocol Error"""
+        print("FRU Memory Poison")
+
+        for name, size, ftype in self.common_fields:
+            self.cper.decode(name, size, ftype)
+
+        val = self.cper.decode("Poison List Entries", 4, "int")
+        try:
+            poison_list_entries = int.from_bytes(val, byteorder='little')
+        except ValueError, TypeError:
+            poison_list_entries = 0
+
+        for entry in range(0, poison_list_entries):
+            if self.cper.past_end:
+                return
+
+            print()
+            print(f"Poison List {entry}")
+            for name, size, ftype in self.poison_fields:
+                if self.cper.past_end:
+                    return
+
+                self.cper.decode(name, size, ftype)
+
+    @staticmethod
+    def decode_list():
+        """
+        Returns a tuple with the GUID and class
+        """
+        return [(DecodeFRUMemoryPoison.guid, DecodeFRUMemoryPoison)]
+
+
+class DecodeGhesEntry():
+    """
+    Class to decode a GHESv2 element, as defined at:
+    ACPI 6.1: 18.3.2.8 Generic Hardware Error Source version 2
+    """
+
+    # Fields present on all CPER records
+    common_fields = [
+        # Generic Error Status Block fields
+        ("Block Status",           4, "int", None),
+        ("Raw Data Offset",        4, "int", "raw_data_offset"),
+        ("Raw Data Length",        4, "int", "raw_data_len"),
+        ("Data Length",            4, "int", None),
+        ("Error Severity",         4, "int", None),
+
+        # Generic Error Data Entry
+        ("Section Type",          16, "guid", "session_type"),
+        ("Error Severity",         4, "int", None),
+        ("Revision",               2, "int", None),
+        ("Validation Bits",        1, "int", None),
+        ("Flags",                  1, "int", None),
+        ("Error Data Length",      4, "int", None),
+        ("FRU Id",                16, "guid", None),
+        ("FRU Text",              20, "str", None),
+        ("Timestamp",              8, "bcd", None),
+    ]
+
+    def __init__(self, cper_data: bytearray):
+        """
+        Initializes a byte array, decoding it, printing results at the
+        screen.
+        """
+
+        # Create a decode list with the per-type decoders
+        decode_list = []
+        decode_list += DecodeProcGeneric.decode_list()
+        decode_list += DecodeProcX86.decode_list()
+        decode_list += DecodeProcItanium.decode_list()
+        decode_list += DecodeProcArm.decode_list()
+        decode_list += DecodePlatformMem.decode_list()
+        decode_list += DecodePlatformMem2.decode_list()
+        decode_list += DecodePCIe.decode_list()
+        decode_list += DecodePCIBus.decode_list()
+        decode_list += DecodePCIDev.decode_list()
+        decode_list += DecodeFWError.decode_list()
+        decode_list += DecodeDMAGeneric.decode_list()
+        decode_list += DecodeDMAVT.decode_list()
+        decode_list += DecodeDMAIOMMU.decode_list()
+        decode_list += DecodeCCIXPER.decode_list()
+        decode_list += DecodeCXLProtErr.decode_list()
+        decode_list += DecodeCXLCompEvent.decode_list()
+        decode_list += DecodeFRUMemoryPoison.decode_list()
+
+        # Handle common types
+        cper = DecodeField(cper_data)
+
+        fields = {}
+        for name, size, ftype, var in self.common_fields:
+            val = cper.decode(name, size, ftype)
+
+            if ftype == "int":
+                try:
+                    val = int.from_bytes(val, byteorder='little')
+                except ValueError, TypeError:
+                    val = 0
+
+            if var is not None:
+                fields[var] = val
+
+        if fields["raw_data_len"]:
+            cper.decode("Raw Data", fields["raw_data_len"],
+                        "int", pos=fields["raw_data_offset"])
+
+        if not fields["session_type"]:
+            return
+
+        print()
+
+        # Now, decode the rest of the record for known decoders
+        for guid, cls in decode_list:
+            if fields["session_type"] == guid:
+                dec = cls(cper)
+                dec.decode(guid)
+
+                if not cper.is_end:
+                    print()
+                    print("Warning: incomplete decode or broken CPER")
+                    if cper.remaining:
+                        cper.decode("Extra Data", cper.remaining, "int")
+
+                print()
+                return
+
+        # If we don't have a class to decode the full payload,
+        # output the undecoded part
+        print(f"Unknown GGID: {fields['session_type']}")
+        remaining = cper.remaining
+        if remaining:
+            cper.decode("Payload", remaining, "int")
+
+        print()
diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
index 63f3df2d75c3..32baca17ce10 100755
--- a/scripts/qmp_helper.py
+++ b/scripts/qmp_helper.py
@@ -21,6 +21,7 @@
     sys.path.append(os_path.join(qemu_dir, 'python'))
 
     from qemu.qmp.legacy import QEMUMonitorProtocol
+    from ghes_decode import DecodeGhesEntry
 
 except ModuleNotFoundError as exc:
     print(f"Module '{exc.name}' not found.")
@@ -677,6 +678,8 @@ def send_cper(self, notif_type, payload,
 
             util.dump_bytearray("Payload", payload)
 
+            DecodeGhesEntry(cper_data)
+
         return self.send_cper_raw(cper_data, timeout=timeout)
 
     def search_qom(self, path, prop, regex):
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 08/13] scripts/ghes_inject: exit 1 if command was not sent
  2026-01-21 11:25 [PATCH 00/13] Add more commands to scripts/ghes_inject.py Mauro Carvalho Chehab
                   ` (6 preceding siblings ...)
  2026-01-21 11:25 ` [PATCH 07/13] scripts/ghes_inject: add a logic to decode CPER Mauro Carvalho Chehab
@ 2026-01-21 11:25 ` Mauro Carvalho Chehab
  2026-01-21 13:28   ` Jonathan Cameron via qemu development
  2026-01-21 11:25 ` [PATCH 09/13] scripts/ghes_inject: add a handler for PCIe bus error Mauro Carvalho Chehab
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 11:25 UTC (permalink / raw)
  To: Michael S Tsirkin
  Cc: Jonathan Cameron, Shiju Jose, qemu-devel, Igor Mammedov,
	Mauro Carvalho Chehab, Cleber Rosa, John Snow

add a return code to subparser func() and return 1 if the
command failed.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 scripts/arm_processor_error.py | 2 +-
 scripts/ghes_inject.py         | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/scripts/arm_processor_error.py b/scripts/arm_processor_error.py
index 73d069f070d4..d9845adb0c0a 100644
--- a/scripts/arm_processor_error.py
+++ b/scripts/arm_processor_error.py
@@ -473,4 +473,4 @@ def send_cper(self, args):
 
         self.data = data
 
-        qmp_cmd.send_cper(cper_guid.CPER_PROC_ARM, self.data)
+        return qmp_cmd.send_cper(cper_guid.CPER_PROC_ARM, self.data)
diff --git a/scripts/ghes_inject.py b/scripts/ghes_inject.py
index 9a235201418b..6ac917d0b5db 100755
--- a/scripts/ghes_inject.py
+++ b/scripts/ghes_inject.py
@@ -43,7 +43,8 @@ def main():
 
     args = parser.parse_args()
     if "func" in args:
-        args.func(args)
+        if not args.func(args):
+            sys.exit(1)
     else:
         sys.exit(f"Please specify a valid command for {sys.argv[0]}")
 
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 09/13] scripts/ghes_inject: add a handler for PCIe bus error
  2026-01-21 11:25 [PATCH 00/13] Add more commands to scripts/ghes_inject.py Mauro Carvalho Chehab
                   ` (7 preceding siblings ...)
  2026-01-21 11:25 ` [PATCH 08/13] scripts/ghes_inject: exit 1 if command was not sent Mauro Carvalho Chehab
@ 2026-01-21 11:25 ` Mauro Carvalho Chehab
  2026-01-21 13:32   ` Jonathan Cameron via qemu development
  2026-01-21 11:25 ` [PATCH 10/13] scripts/ghes_inject: add support for fuzzy logic testing Mauro Carvalho Chehab
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 11:25 UTC (permalink / raw)
  To: Michael S Tsirkin
  Cc: Jonathan Cameron, Shiju Jose, qemu-devel, Igor Mammedov,
	Mauro Carvalho Chehab, Cleber Rosa, John Snow

Add a logic to do PCIe BUS error injection.

On Linux Kernel, despite CPER_SEC_PCI_X_BUS macro is defined for such
event, ghes.c doesn't implement support for it yet:

[16950.077494] {26}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[16950.077866] {26}[Hardware Error]: event severity: recoverable
[16950.078118] {26}[Hardware Error]:  Error 0, type: recoverable
[16950.078444] {26}[Hardware Error]:   section type: unknown, c5753963-3b84-4095-bf78-eddad3f9c9dd
[16950.078800] {26}[Hardware Error]:   section length: 0x48
[16950.079069] {26}[Hardware Error]:   00000000: 00000000 00000000 00000000 00000000  ................
[16950.079442] {26}[Hardware Error]:   00000010: 00000001 00000000 00000000 00000000  ................
[16950.079811] {26}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
[16950.080181] {26}[Hardware Error]:   00000030: 00000000 00000000 00000000 00000000  ................
[16950.080538] {26}[Hardware Error]:   00000040: 00000000 00000000                    ........

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 MAINTAINERS               |   1 +
 scripts/ghes_inject.py    |   2 +
 scripts/pcie_bus_error.py | 146 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 149 insertions(+)
 create mode 100644 scripts/pcie_bus_error.py

diff --git a/MAINTAINERS b/MAINTAINERS
index a970c47dd089..48067a618523 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2228,6 +2228,7 @@ F: qapi/acpi-hest.json
 F: scripts/ghes_decode.py
 F: scripts/ghes_inject.py
 F: scripts/arm_processor_error.py
+F: scripts/pcie_bus_error.py
 F: scripts/qmp_helper.py
 
 ppc4xx
diff --git a/scripts/ghes_inject.py b/scripts/ghes_inject.py
index 6ac917d0b5db..29a6a57508cd 100755
--- a/scripts/ghes_inject.py
+++ b/scripts/ghes_inject.py
@@ -12,6 +12,7 @@
 import sys
 
 from arm_processor_error import ArmProcessorEinj
+from pcie_bus_error import PcieBusError
 
 EINJ_DESC = """
 Handle ACPI GHESv2 error injection logic QEMU QMP interface.
@@ -40,6 +41,7 @@ def main():
     subparsers = parser.add_subparsers()
 
     ArmProcessorEinj(subparsers)
+    PcieBusError(subparsers)
 
     args = parser.parse_args()
     if "func" in args:
diff --git a/scripts/pcie_bus_error.py b/scripts/pcie_bus_error.py
new file mode 100644
index 000000000000..e8285b5dcc84
--- /dev/null
+++ b/scripts/pcie_bus_error.py
@@ -0,0 +1,146 @@
+#!/usr/bin/env python3
+#
+# pylint: disable=C0114,R0903
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2024 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
+
+from qmp_helper import qmp, util, cper_guid
+
+class PcieBusError:
+    """
+    Implements PCI Express bus error injection via GHES
+    """
+
+    def __init__(self, subparsers):
+        """Initialize the error injection class and add subparser"""
+
+        # Valid values
+        self.valid_bits = {
+            "status": util.bit(0),
+            "type": util.bit(1),
+            "bus-id": util.bit(2),
+            "bus-addr": util.bit(3),
+            "bus-data": util.bit(4),
+            "command": util.bit(5),
+            "requestor-id": util.bit(6),
+            "completer-id": util.bit(7),
+            "target-id": util.bit(8),
+        }
+
+        self.bus_command_bits = {
+            "pci": 0,               # Bit 56 is zero
+            "pci-x": util.bit(56)
+        }
+
+        self.data = bytearray()
+
+        parser = subparsers.add_parser("pcie-bus",
+                                       description="Generate PCIe bus error CPER")
+        g_pcie = parser.add_argument_group("PCIe bus error")
+
+        valid_bits = ",".join(self.valid_bits.keys())
+        bus_command_bits = ",".join(self.bus_command_bits.keys())
+
+        g_pcie.add_argument("-v", "--valid",
+                            help=f"Valid bits: {valid_bits}")
+        g_pcie.add_argument("-s", "--error-status",
+                            type=lambda x: int(x, 0),
+                            help="Error Status")
+        g_pcie.add_argument("-t", "--error-type",
+                            type=lambda x: int(x, 0),
+                            help="Error type")
+        g_pcie.add_argument("-b", "--bus-number",
+                            type=lambda x: int(x, 0),
+                            help="Bus number")
+        g_pcie.add_argument("-S", "--segment-number",
+                            type=lambda x: int(x, 0),
+                            help="Segment number")
+        g_pcie.add_argument("-a", "--bus-address",
+                            type=lambda x: int(x, 0),
+                            help="Bus address")
+        g_pcie.add_argument("-d", "--bus-data",
+                            type=lambda x: int(x, 0),
+                            help="Bus data")
+        g_pcie.add_argument("-c", "--bus-command",
+                            help=f"bus-command: {bus_command_bits}")
+        g_pcie.add_argument("-r", "--bus-requestor",
+                            type=lambda x: int(x, 0),
+                            help="Bus requestor ID")
+        g_pcie.add_argument("-C", "--bus-completer",
+                            type=lambda x: int(x, 0),
+                            help="Bus completer ID")
+        g_pcie.add_argument("-i", "--target-id",
+                            type=lambda x: int(x, 0),
+                            help="Target ID")
+
+        parser.set_defaults(func=self.send_cper)
+
+    def send_cper(self, args):
+        """Parse subcommand arguments and send a CPER via QMP"""
+
+        qmp_cmd = qmp(args.host, args.port, args.debug)
+
+        cper = {}
+        arg = vars(args)
+
+        # Handle global parameters
+        if args.valid:
+            valid_init = False
+            cper["valid"] = util.get_choice(name="valid",
+                                            value=args.valid,
+                                            choices=self.valid_bits)
+        else:
+            cper["valid"] = 0
+            valid_init = True
+
+        if args.bus_command:
+            cper["bus-command"] = util.get_choice(name="bus-command",
+                                                    value=args.bus_command,
+                                                    choices=self.bus_command_bits)
+        if valid_init:
+            if args.error_status:
+                cper["valid"] |= self.valid_bits["status"]
+
+            if args.error_type:
+                cper["valid"] |= self.valid_bits["type"]
+
+            if args.bus_number and args.bus_segment:
+                cper["valid"] |= self.valid_bits["bus-id"]
+
+            if args.bus_address:
+                cper["valid"] |= self.valid_bits["bus-address"]
+
+            if args.bus_data:
+                cper["valid"] |= self.valid_bits["bus-data"]
+
+            if args.bus_requestor:
+                cper["valid"] |= self.valid_bits["requestor-id"]
+
+            if args.bus_completer:
+                cper["valid"] |= self.valid_bits["completer-id"]
+
+            if args.target_id:
+                cper["valid"] |= self.valid_bits["target-id"]
+
+        util.data_add(self.data, cper["valid"], 8)
+        util.data_add(self.data, arg.get("error-status", 0), 8)
+        util.data_add(self.data, arg.get("error-type", util.bit(0)), 2)
+
+        # Bus ID
+        util.data_add(self.data, arg.get("bus-number", 0), 1)
+        util.data_add(self.data, arg.get("segment-number", 0), 1)
+
+        # Reserved
+        util.data_add(self.data, 0, 4)
+
+        util.data_add(self.data, arg.get("bus-address", 0), 8)
+        util.data_add(self.data, arg.get("bus-data", 0), 8)
+
+        util.data_add(self.data, cper.get("bus-command", 0), 8)
+
+        util.data_add(self.data, arg.get("bus-requestor", 0), 8)
+        util.data_add(self.data, arg.get("bus-completer", 0), 8)
+        util.data_add(self.data, arg.get("target-id", 0), 8)
+
+        return qmp_cmd.send_cper(cper_guid.CPER_PCI_BUS, self.data)
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 10/13] scripts/ghes_inject: add support for fuzzy logic testing
  2026-01-21 11:25 [PATCH 00/13] Add more commands to scripts/ghes_inject.py Mauro Carvalho Chehab
                   ` (8 preceding siblings ...)
  2026-01-21 11:25 ` [PATCH 09/13] scripts/ghes_inject: add a handler for PCIe bus error Mauro Carvalho Chehab
@ 2026-01-21 11:25 ` Mauro Carvalho Chehab
  2026-01-21 13:37   ` Jonathan Cameron via qemu development
  2026-01-21 11:25 ` [PATCH 11/13] scripts/ghes_inject: add a raw error inject command Mauro Carvalho Chehab
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 11:25 UTC (permalink / raw)
  To: Michael S Tsirkin
  Cc: Jonathan Cameron, Shiju Jose, qemu-devel, Igor Mammedov,
	Mauro Carvalho Chehab, Cleber Rosa, John Snow

Add a command to inject random errors for fuzzy logic testing.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 MAINTAINERS            |   1 +
 scripts/fuzzy_error.py | 206 +++++++++++++++++++++++++++++++++++++++++
 scripts/ghes_inject.py |   2 +
 3 files changed, 209 insertions(+)
 create mode 100644 scripts/fuzzy_error.py

diff --git a/MAINTAINERS b/MAINTAINERS
index 48067a618523..e553f8252f14 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2228,6 +2228,7 @@ F: qapi/acpi-hest.json
 F: scripts/ghes_decode.py
 F: scripts/ghes_inject.py
 F: scripts/arm_processor_error.py
+F: scripts/fuzzy_error.py
 F: scripts/pcie_bus_error.py
 F: scripts/qmp_helper.py
 
diff --git a/scripts/fuzzy_error.py b/scripts/fuzzy_error.py
new file mode 100644
index 000000000000..9f80abb72319
--- /dev/null
+++ b/scripts/fuzzy_error.py
@@ -0,0 +1,206 @@
+#!/usr/bin/env python3
+#
+# pylint: disable=C0114,R0903,R0912
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2024 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
+
+import argparse
+import sys
+
+from time import sleep
+from random import randrange
+from qmp_helper import qmp, util, cper_guid
+
+class FuzzyError:
+    """
+    Implements Fuzzy error injection via GHES
+    """
+
+    def __init__(self, subparsers):
+        """Initialize the error injection class and add subparser"""
+
+        # as defined at UEFI spec v2.10, section N.2.2
+        # Sizes here are just hints to have some default
+        self.types = {
+            "proc-generic": {
+                "guid": cper_guid.CPER_PROC_GENERIC,
+                "default_size": 192
+            },
+            "proc-x86": {
+                "guid": cper_guid.CPER_PROC_X86,
+                "default_size": 64
+            },
+            "proc-itanium": {
+                "guid": cper_guid.CPER_PROC_ITANIUM,
+                "default_size": 64
+            },
+            "proc-arm": {
+                "guid": cper_guid.CPER_PROC_ARM,
+                "default_size": 72
+            },
+            "platform-mem": {
+                "guid": cper_guid.CPER_PLATFORM_MEM,
+                "default_size": 80
+            },
+            "platform-mem2": {
+                "guid": cper_guid.CPER_PLATFORM_MEM2,
+                "default_size": 96
+            },
+            "pcie": {
+                "guid": cper_guid.CPER_PCIE,
+                "default_size": 208
+            },
+            "pci-bus": {
+                "guid": cper_guid.CPER_PCI_BUS,
+                "default_size": 72
+            },
+            "pci-dev": {
+                "guid": cper_guid.CPER_PCI_DEV,
+                "default_size": 56
+            },
+            "firmware-error": {
+                "guid": cper_guid.CPER_FW_ERROR,
+                "default_size": 32
+            },
+            "dma-generic": {
+                "guid": cper_guid.CPER_DMA_GENERIC,
+                "default_size": 32
+            },
+            "dma-vt": {
+                "guid": cper_guid.CPER_DMA_VT,
+                "default_size": 144
+            },
+            "dma-iommu": {
+                "guid": cper_guid.CPER_DMA_IOMMU,
+                "default_size": 144
+            },
+            "ccix-per": {
+                "guid": cper_guid.CPER_CCIX_PER,
+                "default_size": 36
+            },
+            "cxl-prot-err": {
+                "guid": cper_guid.CPER_CXL_PROT_ERR,
+                "default_size": 116
+            },
+            "cxl-evt-media": {
+                "guid": cper_guid.CPER_CXL_EVT_GEN_MEDIA,
+                "default_size": 32
+            },
+            "cxl-evt-dram": {
+                "guid": cper_guid.CPER_CXL_EVT_DRAM,
+                "default_size": 64
+            },
+            "cxl-evt-mem-module": {
+                "guid": cper_guid.CPER_CXL_EVT_MEM_MODULE,
+                "default_size": 64
+            },
+            "cxl-evt-mem-sparing": {
+                "guid": cper_guid.CPER_CXL_EVT_MEM_SPARING,
+                "default_size": 64
+            },
+            "cxl-evt-phy-sw": {
+                "guid": cper_guid.CPER_CXL_EVT_PHY_SW,
+                "default_size": 64
+            },
+            "cxl-evt-virt-sw": {
+                "guid": cper_guid.CPER_CXL_EVT_VIRT_SW,
+                "default_size": 64
+            },
+            "cxl-evt-mdl-port": {
+                "guid": cper_guid.CPER_CXL_EVT_MLD_PORT,
+                "default_size": 64
+            },
+            "cxl-evt-dyna-cap": {
+                "guid": cper_guid.CPER_CXL_EVT_DYNA_CAP,
+                "default_size": 64
+            },
+            "fru-mem-poison": {
+                "guid": cper_guid.CPER_FRU_MEM_POISON,
+                "default_size": 72
+            },
+        }
+
+        parser = subparsers.add_parser("fuzzy-test", aliases=['fuzzy'],
+                                       description="Inject a fuzzy test CPER",
+                                       formatter_class=argparse.RawTextHelpFormatter)
+        g_fuzzy = parser.add_argument_group("Fuzz testing error inject")
+
+
+        cper_types = ",".join(self.types.keys())
+
+        g_fuzzy.add_argument("-T", "--type",
+                            help=f"Type of the error: {cper_types}")
+        g_fuzzy.add_argument("--min-size",
+                    type=lambda x: int(x, 0),
+                    help="Minimal size of the CPER")
+        g_fuzzy.add_argument("--max-size",
+                    type=lambda x: int(x, 0),
+                    help="Maximal size of the CPER")
+        g_fuzzy.add_argument("-z", "--zero", action="store_true",
+                            help="Zero all bytes of the CPER payload (default: %(default)s)")
+        g_fuzzy.add_argument("-t", "--timeout", type=float,
+                    default=30.0,
+                    help="Specify timeout for CPER send retries (default: %(default)s seconds)")
+        g_fuzzy.add_argument("-d", "--delay", type=float,
+                    default=0,
+                    help="Specify a delay between multiple CPER (default: %(default)s)")
+        g_fuzzy.add_argument("-c", "--count", type=int,
+                    default=1,
+                    help="Specify the number of CPER records to be sent (default: %(default)s)")
+
+        parser.set_defaults(func=self.send_cper)
+
+    def send_cper(self, args):
+        """Parse subcommand arguments and send a CPER via QMP"""
+
+        qmp_cmd = qmp(args.host, args.port, args.debug)
+
+        args.count = max(args.count, 1)
+
+        for i in range(0, args.count):
+            if i:
+                if args.delay > 0:
+                    sleep(args.delay)
+
+            # Handle global parameters
+            if args.type:
+                if not args.type in self.types:
+                    sys.exit(f"Invalid type: {args.type}")
+
+                inj_type = args.type
+            else:
+                i = randrange(len(self.types))
+                keys = list(self.types.keys())
+                inj_type = keys[i]
+
+            inject = self.types[inj_type]
+
+            guid = inject["guid"]
+            min_size = inject["default_size"]
+            max_size = min_size
+
+            if args.min_size:
+                min_size = args.min_size
+
+            if args.max_size:
+                max_size = args.max_size
+
+            size = min_size
+
+            if min_size < max_size:
+                size += randrange(max_size - min_size)
+
+            data = bytearray()
+
+            if not args.zero:
+                for i in range(size):
+                    util.data_add(data, randrange(256), 1)
+            else:
+                for i in range(size):
+                    util.data_add(data, 0, 1)
+
+            print(f"Injecting {inj_type} with {size} bytes")
+            ret = qmp_cmd.send_cper(guid, data, timeout=args.timeout)
+            if ret and ret != "OK":
+                return ret
diff --git a/scripts/ghes_inject.py b/scripts/ghes_inject.py
index 29a6a57508cd..9b0a2443fc97 100755
--- a/scripts/ghes_inject.py
+++ b/scripts/ghes_inject.py
@@ -13,6 +13,7 @@
 
 from arm_processor_error import ArmProcessorEinj
 from pcie_bus_error import PcieBusError
+from fuzzy_error import FuzzyError
 
 EINJ_DESC = """
 Handle ACPI GHESv2 error injection logic QEMU QMP interface.
@@ -42,6 +43,7 @@ def main():
 
     ArmProcessorEinj(subparsers)
     PcieBusError(subparsers)
+    FuzzyError(subparsers)
 
     args = parser.parse_args()
     if "func" in args:
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 11/13] scripts/ghes_inject: add a raw error inject command
  2026-01-21 11:25 [PATCH 00/13] Add more commands to scripts/ghes_inject.py Mauro Carvalho Chehab
                   ` (9 preceding siblings ...)
  2026-01-21 11:25 ` [PATCH 10/13] scripts/ghes_inject: add support for fuzzy logic testing Mauro Carvalho Chehab
@ 2026-01-21 11:25 ` Mauro Carvalho Chehab
  2026-01-21 11:25 ` [PATCH 12/13] scripts/ghes_inject: print help if no command specified Mauro Carvalho Chehab
  2026-01-21 11:25 ` [PATCH 13/13] scripts/ghes_inject: improve help message Mauro Carvalho Chehab
  12 siblings, 0 replies; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 11:25 UTC (permalink / raw)
  To: Michael S Tsirkin
  Cc: Jonathan Cameron, Shiju Jose, qemu-devel, Igor Mammedov,
	Mauro Carvalho Chehab, Cleber Rosa, John Snow

Add a command to repeat a raw CPER record.

This helps to reproduce some condition that happened before.

The input format of the file is identical to the hexadecimal
dump generated by ghes_inject tool.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 MAINTAINERS            |   1 +
 scripts/ghes_inject.py |   2 +
 scripts/raw_error.py   | 175 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 178 insertions(+)
 create mode 100644 scripts/raw_error.py

diff --git a/MAINTAINERS b/MAINTAINERS
index e553f8252f14..9d74d057048e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2231,6 +2231,7 @@ F: scripts/arm_processor_error.py
 F: scripts/fuzzy_error.py
 F: scripts/pcie_bus_error.py
 F: scripts/qmp_helper.py
+F: scripts/raw_error.py
 
 ppc4xx
 L: qemu-ppc@nongnu.org
diff --git a/scripts/ghes_inject.py b/scripts/ghes_inject.py
index 9b0a2443fc97..781b37cc68af 100755
--- a/scripts/ghes_inject.py
+++ b/scripts/ghes_inject.py
@@ -14,6 +14,7 @@
 from arm_processor_error import ArmProcessorEinj
 from pcie_bus_error import PcieBusError
 from fuzzy_error import FuzzyError
+from raw_error import RawError
 
 EINJ_DESC = """
 Handle ACPI GHESv2 error injection logic QEMU QMP interface.
@@ -44,6 +45,7 @@ def main():
     ArmProcessorEinj(subparsers)
     PcieBusError(subparsers)
     FuzzyError(subparsers)
+    RawError(subparsers)
 
     args = parser.parse_args()
     if "func" in args:
diff --git a/scripts/raw_error.py b/scripts/raw_error.py
new file mode 100644
index 000000000000..f5e77bdfcead
--- /dev/null
+++ b/scripts/raw_error.py
@@ -0,0 +1,175 @@
+#!/usr/bin/env python3
+#
+# pylint: disable=C0114,R0903,R0912,R0914,R0915,R1732
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2024 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
+
+import argparse
+import os
+import re
+import sys
+
+from time import sleep
+
+from qmp_helper import qmp, guid
+
+class RawError:
+    """
+    Injects errors from a file containing raw data
+    """
+
+    SCRIPT_NAME = sys.argv[0]
+
+    HELP=f"""
+    Inject a CPER record from a previously recorded one.
+
+    One or more CPER records can be recorded. The records to be
+    injected are read from an specified file or from stdin and should
+    have the format produced by this script when using --debug, e.g.:
+
+    GUID: e19e3d16-bc11-11e4-9caa-c2051d5d46b0
+    CPER:
+        00000000  04 00 00 00 02 00 01 00 88 00 00 00 00 00 00 00   ................
+        00000010  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
+        00000020  00 00 00 00 00 00 00 00 00 20 05 00 08 02 00 03   ......... ......
+        00000030  ff 0f 46 d6 80 00 00 00 ef be ad de 00 00 00 00   ..F.............
+        00000040  ad 0b ba ab 00 00 00 00 00 20 04 00 04 01 00 03   ......... ......
+        00000050  7f 00 54 00 00 00 00 00 ef be ad de 00 00 00 00   ..T.............
+        00000060  ad 0b ba ab 00 00 00 00 00 00 05 00 18 00 00 00   ................
+        00000070  ef be ad de 00 00 00 00 ab ba ba ab 00 00 00 00   ................
+        00000080  00 00 00 00 00 00 00 00                           ........
+
+    Multiple such records can be used. On such case, a delay will
+    be introduced betewen them.
+
+    All lines that can't be parsed will be silently ignored.
+    As such, the output of this help can be piped to the raw-error
+    generator with:
+
+        {SCRIPT_NAME} -d raw-error -h | {SCRIPT_NAME} -d raw-error
+    """
+
+    def __init__(self, subparsers):
+        """Initialize the error injection class and add subparser"""
+
+        self.payload = bytearray()
+        self.inj_type = None
+        self.size = 0
+
+        parser = subparsers.add_parser("raw-error",  aliases=['raw'],
+                                       description=self.HELP,
+                                       formatter_class=argparse.RawTextHelpFormatter)
+
+        parser.add_argument("-f", "--file",
+                            help="File name with the raw error data. '-' for stdin")
+        parser.add_argument("-d", "--delay", type=lambda x: int(x, 0),
+                            default=1,
+                            help="Specify a delay between multiple CPER. Default=1")
+
+        parser.set_defaults(func=self.send_cper)
+
+    def send_cper(self, args):
+        """Parse subcommand arguments and send a CPER via QMP"""
+
+        if not args.file:
+            args.file='-'
+
+        is_guid = re.compile(r"^\s*guid:\s*(\w+\-\w+\-\w+\-\w+-\w+)", re.I)
+        is_gesb = re.compile(r"^Generic Error Status Block.*:", re.I)
+        is_gede = re.compile(r"^Generic Error Data Entry.*:", re.I)
+        is_raw_data = re.compile(r"^Raw data.*:", re.I)
+        is_payload = re.compile(r"^(Payload|CPER).*:", re.I)
+        is_hexdump = re.compile(r"^(\s*[\da-f]........\s+)(.*)\s\s+.*", re.I)
+        is_hex = re.compile(r"\b([\da-f].)\b", re.I)
+
+        cper = []
+
+        if args.file == "-":
+            fp = sys.stdin
+            if os.isatty(0):
+                print("Using stdin. Press CTRL-D to finish input.")
+            else:
+                print("Reading from stdin pipe")
+        else:
+            try:
+                fp = open(args.file, encoding="utf-8")
+            except FileNotFoundError:
+                sys.exit('File Not Found')
+
+        guid_obj = None
+        gebs = bytearray()
+        gede = bytearray()
+        raw_data = bytearray()
+        payload = bytearray()
+        ln_used = 0
+        ln = 0
+
+        cur = payload
+
+        for ln, line in enumerate(fp):
+            if match := is_guid.search(line):
+                if guid_obj and payload:
+                    cper.append({"guid": guid_obj, "raw-data": payload})
+                    guid_obj = None
+                    payload = bytearray()
+                    gebs = bytearray()
+                    gede = bytearray()
+
+                guid_obj = guid.UUID(match.group(1))
+
+                ln_used += 1
+                continue
+
+            if match := is_gesb.match(line):
+                cur = gebs
+                continue
+
+            if match := is_gede.match(line):
+                cur = gede
+                continue
+
+            if match := is_payload.match(line):
+                cur = payload
+                continue
+
+            if match := is_raw_data.match(line):
+                cur = raw_data
+                continue
+
+            new = is_hexdump.sub(r"\2", line)
+            if new != line:
+                if match := is_hex.findall(new):
+                    for m in match:
+                        cur.extend(bytes.fromhex(m))
+                    ln_used += 1
+                    continue
+                continue
+
+        if guid_obj and payload:
+            cper.append({"guid": guid_obj,
+                         "payload": payload,
+                         "gede": gede,
+                         "gebs": gebs,
+                         "raw-data": raw_data})
+
+        print(f"{ln} lines read, {ln - ln_used} lines ignored.")
+
+        if fp is not sys.stdin:
+            fp.close()
+
+        qmp_cmd = qmp(args.host, args.port, args.debug)
+
+        if not cper:
+            sys.exit("Format of the file not recognized.")
+
+        for i, c in enumerate(cper):
+            if i:
+                sleep(args.delay)
+
+            ret = qmp_cmd.send_cper(c["guid"], c["payload"], gede=c["gede"],
+                                    gebs=c["gebs"], raw_data=c["raw-data"])
+            if not ret:
+                return ret
+
+        return True
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 12/13] scripts/ghes_inject: print help if no command specified
  2026-01-21 11:25 [PATCH 00/13] Add more commands to scripts/ghes_inject.py Mauro Carvalho Chehab
                   ` (10 preceding siblings ...)
  2026-01-21 11:25 ` [PATCH 11/13] scripts/ghes_inject: add a raw error inject command Mauro Carvalho Chehab
@ 2026-01-21 11:25 ` Mauro Carvalho Chehab
  2026-01-21 13:42   ` Jonathan Cameron via qemu development
  2026-01-21 11:25 ` [PATCH 13/13] scripts/ghes_inject: improve help message Mauro Carvalho Chehab
  12 siblings, 1 reply; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 11:25 UTC (permalink / raw)
  To: Michael S Tsirkin
  Cc: Jonathan Cameron, Shiju Jose, qemu-devel, Igor Mammedov,
	Mauro Carvalho Chehab, Cleber Rosa, John Snow

The first positional argument (command) is mandatory. If not
specified, instead of a simple error message, show help as well.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 scripts/ghes_inject.py | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/scripts/ghes_inject.py b/scripts/ghes_inject.py
index 781b37cc68af..4028fdb15582 100755
--- a/scripts/ghes_inject.py
+++ b/scripts/ghes_inject.py
@@ -52,6 +52,9 @@ def main():
         if not args.func(args):
             sys.exit(1)
     else:
+        print("Error: no command specified\n", file=sys.stderr)
+        parser.print_help(file=sys.stderr)
+        print(file=sys.stderr)
         sys.exit(f"Please specify a valid command for {sys.argv[0]}")
 
 if __name__ == "__main__":
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 13/13] scripts/ghes_inject: improve help message
  2026-01-21 11:25 [PATCH 00/13] Add more commands to scripts/ghes_inject.py Mauro Carvalho Chehab
                   ` (11 preceding siblings ...)
  2026-01-21 11:25 ` [PATCH 12/13] scripts/ghes_inject: print help if no command specified Mauro Carvalho Chehab
@ 2026-01-21 11:25 ` Mauro Carvalho Chehab
  2026-01-21 13:43   ` Jonathan Cameron via qemu development
  12 siblings, 1 reply; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 11:25 UTC (permalink / raw)
  To: Michael S Tsirkin
  Cc: Jonathan Cameron, Shiju Jose, qemu-devel, Igor Mammedov,
	Mauro Carvalho Chehab, Cleber Rosa, John Snow

Add a one-liner help message for each type of error inject
command, and use raw formatter to keep line breaks.

While here, use a more uniform language.

With that, "ghes_inject -h" will now show:

  usage: ghes_inject.py [options]

  Handles ACPI GHESv2 error injection via the QEMU QMP interface.

  It uses UEFI BIOS APEI features to generate GHES records, which helps to
  test CPER and GHES drivers on the guest OS and see how user‑space
  applications on that guest handle such errors.

  positional arguments:
    {arm,pcie-bus,fuzzy-test,fuzzy,raw-error,raw}
      arm                 Inject an ARM processor error CPER, compatible with
                          UEFI 2.9A Errata.
      pcie-bus            Inject a PCIe bus error CPER
      fuzzy-test (fuzzy)  Inject fuzzy test CPER packets
      raw-error (raw)     Inject CPER records from previously recorded ones.

  options:
    -h, --help            show this help message and exit

  QEMU QMP socket options:
    -H, --host HOST       host name (default: localhost)
    -P, --port PORT       TCP port number (default: 4445)
    -d, --debug           enable debug output (default: False)

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 scripts/arm_processor_error.py |  6 ++++--
 scripts/fuzzy_error.py         |  4 +++-
 scripts/ghes_inject.py         | 18 +++++++++---------
 scripts/pcie_bus_error.py      |  4 +++-
 scripts/raw_error.py           |  6 +++---
 5 files changed, 22 insertions(+), 16 deletions(-)

diff --git a/scripts/arm_processor_error.py b/scripts/arm_processor_error.py
index d9845adb0c0a..597382031ab8 100644
--- a/scripts/arm_processor_error.py
+++ b/scripts/arm_processor_error.py
@@ -122,7 +122,7 @@ class ArmProcessorEinj:
     """
 
     DESC = """
-    Generates an ARM processor error CPER, compatible with
+    Inject an ARM processor error CPER, compatible with
     UEFI 2.9A Errata.
     """
 
@@ -169,7 +169,9 @@ def __init__(self, subparsers):
 
         self.data = bytearray()
 
-        parser = subparsers.add_parser("arm", description=self.DESC)
+        parser = subparsers.add_parser("arm",
+                                       help=self.DESC,
+                                       description=self.DESC)
 
         arm_valid_bits = ",".join(self.arm_valid_bits.keys())
         flags = ",".join(self.pei_flags.keys())
diff --git a/scripts/fuzzy_error.py b/scripts/fuzzy_error.py
index 9f80abb72319..3ddb90f743a1 100644
--- a/scripts/fuzzy_error.py
+++ b/scripts/fuzzy_error.py
@@ -121,8 +121,10 @@ def __init__(self, subparsers):
             },
         }
 
+        DESC = "Inject fuzzy test CPER packets"
+
         parser = subparsers.add_parser("fuzzy-test", aliases=['fuzzy'],
-                                       description="Inject a fuzzy test CPER",
+                                       help=DESC, description=DESC,
                                        formatter_class=argparse.RawTextHelpFormatter)
         g_fuzzy = parser.add_argument_group("Fuzz testing error inject")
 
diff --git a/scripts/ghes_inject.py b/scripts/ghes_inject.py
index 4028fdb15582..488d10ffcafd 100755
--- a/scripts/ghes_inject.py
+++ b/scripts/ghes_inject.py
@@ -17,28 +17,28 @@
 from raw_error import RawError
 
 EINJ_DESC = """
-Handle ACPI GHESv2 error injection logic QEMU QMP interface.
+Handles ACPI GHESv2 error injection via the QEMU QMP interface.
 
-It allows using UEFI BIOS EINJ features to generate GHES records.
-
-It helps testing CPER and GHES drivers at the guest OS and how
-userspace applications at the guest handle them.
+It uses UEFI BIOS APEI features to generate GHES records, which helps to
+test CPER and GHES drivers on the guest OS and see how user‑space
+applications on that guest handle such errors.
 """
 
 def main():
     """Main program"""
 
     # Main parser - handle generic args like QEMU QMP TCP socket options
-    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    parser = argparse.ArgumentParser(formatter_class=argparse.RawDescriptionHelpFormatter,
                                      usage="%(prog)s [options]",
                                      description=EINJ_DESC)
 
     g_options = parser.add_argument_group("QEMU QMP socket options")
     g_options.add_argument("-H", "--host", default="localhost", type=str,
-                           help="host name")
+                           help="host name (default: %(default)s)")
     g_options.add_argument("-P", "--port", default=4445, type=int,
-                           help="TCP port number")
-    g_options.add_argument('-d', '--debug', action='store_true')
+                           help="TCP port number (default: %(default)s)")
+    g_options.add_argument('-d', '--debug', action='store_true',
+                           help="enable debug output (default: %(default)s)")
 
     subparsers = parser.add_subparsers()
 
diff --git a/scripts/pcie_bus_error.py b/scripts/pcie_bus_error.py
index e8285b5dcc84..843ec09c7572 100644
--- a/scripts/pcie_bus_error.py
+++ b/scripts/pcie_bus_error.py
@@ -35,8 +35,10 @@ def __init__(self, subparsers):
 
         self.data = bytearray()
 
+        DESC = "Inject a PCIe bus error CPER"
+
         parser = subparsers.add_parser("pcie-bus",
-                                       description="Generate PCIe bus error CPER")
+                                       help=DESC, description=DESC)
         g_pcie = parser.add_argument_group("PCIe bus error")
 
         valid_bits = ",".join(self.valid_bits.keys())
diff --git a/scripts/raw_error.py b/scripts/raw_error.py
index f5e77bdfcead..1e9eb1bcf15b 100644
--- a/scripts/raw_error.py
+++ b/scripts/raw_error.py
@@ -21,8 +21,8 @@ class RawError:
 
     SCRIPT_NAME = sys.argv[0]
 
-    HELP=f"""
-    Inject a CPER record from a previously recorded one.
+    HELP="Inject CPER records from previously recorded ones."
+    DESC=HELP + f"""
 
     One or more CPER records can be recorded. The records to be
     injected are read from an specified file or from stdin and should
@@ -58,7 +58,7 @@ def __init__(self, subparsers):
         self.size = 0
 
         parser = subparsers.add_parser("raw-error",  aliases=['raw'],
-                                       description=self.HELP,
+                                       help=self.HELP, description=self.DESC,
                                        formatter_class=argparse.RawTextHelpFormatter)
 
         parser.add_argument("-f", "--file",
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 01/13] scripts/qmp_helper: add a return code to send_cper
  2026-01-21 11:25 ` [PATCH 01/13] scripts/qmp_helper: add a return code to send_cper Mauro Carvalho Chehab
@ 2026-01-21 12:08   ` Jonathan Cameron via qemu development
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-01-21 12:08 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michael S Tsirkin, Shiju Jose, qemu-devel, Igor Mammedov,
	Cleber Rosa, John Snow

On Wed, 21 Jan 2026 12:25:09 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> When used inside a loop, it is interesting to have a return

Maybe 'useful' rather 'interesting'?

> code to indicate weather a send cper command succedded or not.

whether

> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Otherwise LGTM
Reviewed-by: Jonathan Cameron <jonathan.cameron.huawei.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/13] scripts/qmp_helper: add missing CXL UEFI GUID
  2026-01-21 11:25 ` [PATCH 02/13] scripts/qmp_helper: add missing CXL UEFI GUID Mauro Carvalho Chehab
@ 2026-01-21 12:26     ` Jonathan Cameron via qemu development
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron @ 2026-01-21 12:26 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michael S Tsirkin, Shiju Jose, qemu-devel, Igor Mammedov,
	Cleber Rosa, John Snow, linux-cxl

On Wed, 21 Jan 2026 12:25:10 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> The UEFI 2.11 - N.2.14. CXL Component Events Section states that
> XL events are described at CXL specification 3.2:
CXL

>         8.2.10.2.1 Event Records
> 
> Add the GUIDs defined here to fuzzy logic error injection code.

+CC linux-cxl as more folk there who will be familiar with this
stuff.

Some of these won't be seen on a host. The same event
infrastructure is used for reporting on out of band interfaces
and some in band ones, but not ones that will turn up on the
mailboxes that firmware will be using to get info.

> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---
>  scripts/qmp_helper.py | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
> index 249a8c7187d1..7e786c4adfd9 100755
> --- a/scripts/qmp_helper.py
> +++ b/scripts/qmp_helper.py
> @@ -711,3 +711,28 @@ class cper_guid:
>      CPER_CXL_PROT_ERR = guid(0x80B9EFB4, 0x52B5, 0x4DE3,
>                               [0xA7, 0x77, 0x68, 0x78,
>                                0x4B, 0x77, 0x10, 0x48])
> +
> +    CPER_CXL_EVT_GEN_MEDIA = guid(0xFBCD0A77, 0xC260, 0x417F,
> +                                  [0x85, 0xA9, 0x08, 0x8B,
> +                                   0x16, 0x21, 0xEB, 0xA6])
> +    CPER_CXL_EVT_DRAM = guid(0x601DCBB3, 0x9C06, 0x4EAB,
> +                             [0xB8, 0xAF, 0x4E, 0x9B,
> +                              0xFB, 0x5C, 0x96, 0x24])
> +    CPER_CXL_EVT_MEM_MODULE = guid(0xFE927475, 0xDD59, 0x4339,
> +                                   [0xA5, 0x86, 0x79, 0xBA,
> +                                    0xB1, 0x13, 0xBC, 0x74])
> +    CPER_CXL_EVT_MEM_SPARING = guid(0xE71F3A40, 0x2D29, 0x4092,
> +                                    [0x8A, 0x39, 0x4D, 0x1C,
> +                                     0x96, 0x6C, 0x7C, 0x65])

The above are all fine I think.

From here on I think they will never come via a CPER record.

> +    CPER_CXL_EVT_PHY_SW = guid(0x77CF9271, 0x9C02, 0x470B,
> +                               [0x9F, 0xE4, 0xBC, 0x7B,
> +                                0x75, 0xF2, 0xDA, 0x97])

This is only going to surface over either out of band or switch CCI
I'd be very surprised to see a firmware anywhere near these.
More specifically they are only defined in the Fabric management
section of the spec, which strongly hints we'd not expect host firmware
to know anything about them. 
The events reported may well span bits of the topology currently
assigned to different hosts.

> +    CPER_CXL_EVT_VIRT_SW = guid(0x40D26425, 0x3396, 0x4C4D,
> +                                [0xA5, 0xDA, 0x3D, 0x47,
> +                                  0x2A, 0x63, 0xAF, 0x25])

Also a fabric management event.

> +    CPER_CXL_EVT_MLD_PORT = guid(0x8DC44363, 0x0C96, 0x4710,
> +                                 [0xB7, 0xBF, 0x04, 0xBB,
> +                                  0x99, 0x53, 0x4C, 0x3F])

Also a fabric management event.

> +    CPER_CXL_EVT_DYNA_CAP = guid(0xCA95AFA7, 0xF183, 0x4018,
> +                                 [0x8C, 0x2F, 0x95, 0x26,
> +                                  0x8E, 0x10, 0x1A, 0x2A])
These are never routed to firmware. They are part of the OS only
managed flows for dynamic capacity.
They have their own event log on the hardware and for this particular
set most relevant thing is in
CXL 4.0 Table 8-235 Set Event Interrupt Policy Input Payload
which controls whether a firmware interrupt or MSIX is used signal
the Dynamic Capacity Event Log Interrupt Settings only allows
for MSI/MSI-X, not FW interrupt (EFN VDM) like the other logs.


Jonathan



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/13] scripts/qmp_helper: add missing CXL UEFI GUID
@ 2026-01-21 12:26     ` Jonathan Cameron via qemu development
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-01-21 12:26 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michael S Tsirkin, Shiju Jose, qemu-devel, Igor Mammedov,
	Cleber Rosa, John Snow, linux-cxl

On Wed, 21 Jan 2026 12:25:10 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> The UEFI 2.11 - N.2.14. CXL Component Events Section states that
> XL events are described at CXL specification 3.2:
CXL

>         8.2.10.2.1 Event Records
> 
> Add the GUIDs defined here to fuzzy logic error injection code.

+CC linux-cxl as more folk there who will be familiar with this
stuff.

Some of these won't be seen on a host. The same event
infrastructure is used for reporting on out of band interfaces
and some in band ones, but not ones that will turn up on the
mailboxes that firmware will be using to get info.

> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---
>  scripts/qmp_helper.py | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
> index 249a8c7187d1..7e786c4adfd9 100755
> --- a/scripts/qmp_helper.py
> +++ b/scripts/qmp_helper.py
> @@ -711,3 +711,28 @@ class cper_guid:
>      CPER_CXL_PROT_ERR = guid(0x80B9EFB4, 0x52B5, 0x4DE3,
>                               [0xA7, 0x77, 0x68, 0x78,
>                                0x4B, 0x77, 0x10, 0x48])
> +
> +    CPER_CXL_EVT_GEN_MEDIA = guid(0xFBCD0A77, 0xC260, 0x417F,
> +                                  [0x85, 0xA9, 0x08, 0x8B,
> +                                   0x16, 0x21, 0xEB, 0xA6])
> +    CPER_CXL_EVT_DRAM = guid(0x601DCBB3, 0x9C06, 0x4EAB,
> +                             [0xB8, 0xAF, 0x4E, 0x9B,
> +                              0xFB, 0x5C, 0x96, 0x24])
> +    CPER_CXL_EVT_MEM_MODULE = guid(0xFE927475, 0xDD59, 0x4339,
> +                                   [0xA5, 0x86, 0x79, 0xBA,
> +                                    0xB1, 0x13, 0xBC, 0x74])
> +    CPER_CXL_EVT_MEM_SPARING = guid(0xE71F3A40, 0x2D29, 0x4092,
> +                                    [0x8A, 0x39, 0x4D, 0x1C,
> +                                     0x96, 0x6C, 0x7C, 0x65])

The above are all fine I think.

From here on I think they will never come via a CPER record.

> +    CPER_CXL_EVT_PHY_SW = guid(0x77CF9271, 0x9C02, 0x470B,
> +                               [0x9F, 0xE4, 0xBC, 0x7B,
> +                                0x75, 0xF2, 0xDA, 0x97])

This is only going to surface over either out of band or switch CCI
I'd be very surprised to see a firmware anywhere near these.
More specifically they are only defined in the Fabric management
section of the spec, which strongly hints we'd not expect host firmware
to know anything about them. 
The events reported may well span bits of the topology currently
assigned to different hosts.

> +    CPER_CXL_EVT_VIRT_SW = guid(0x40D26425, 0x3396, 0x4C4D,
> +                                [0xA5, 0xDA, 0x3D, 0x47,
> +                                  0x2A, 0x63, 0xAF, 0x25])

Also a fabric management event.

> +    CPER_CXL_EVT_MLD_PORT = guid(0x8DC44363, 0x0C96, 0x4710,
> +                                 [0xB7, 0xBF, 0x04, 0xBB,
> +                                  0x99, 0x53, 0x4C, 0x3F])

Also a fabric management event.

> +    CPER_CXL_EVT_DYNA_CAP = guid(0xCA95AFA7, 0xF183, 0x4018,
> +                                 [0x8C, 0x2F, 0x95, 0x26,
> +                                  0x8E, 0x10, 0x1A, 0x2A])
These are never routed to firmware. They are part of the OS only
managed flows for dynamic capacity.
They have their own event log on the hardware and for this particular
set most relevant thing is in
CXL 4.0 Table 8-235 Set Event Interrupt Policy Input Payload
which controls whether a firmware interrupt or MSIX is used signal
the Dynamic Capacity Event Log Interrupt Settings only allows
for MSI/MSI-X, not FW interrupt (EFN VDM) like the other logs.


Jonathan




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/13] scripts/qmp_helper: add support for FRU Memory Poison
  2026-01-21 11:25 ` [PATCH 03/13] scripts/qmp_helper: add support for FRU Memory Poison Mauro Carvalho Chehab
@ 2026-01-21 12:27   ` Jonathan Cameron via qemu development
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-01-21 12:27 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michael S Tsirkin, Shiju Jose, qemu-devel, Igor Mammedov,
	Cleber Rosa, John Snow

On Wed, 21 Jan 2026 12:25:11 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> This GUID record descriptor was added on UEFI 2.11.
> Add support for it.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Matches the spec.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> ---
>  scripts/qmp_helper.py | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
> index 7e786c4adfd9..19bf641a13ce 100755
> --- a/scripts/qmp_helper.py
> +++ b/scripts/qmp_helper.py
> @@ -736,3 +736,7 @@ class cper_guid:
>      CPER_CXL_EVT_DYNA_CAP = guid(0xCA95AFA7, 0xF183, 0x4018,
>                                   [0x8C, 0x2F, 0x95, 0x26,
>                                    0x8E, 0x10, 0x1A, 0x2A])
> +
> +    CPER_FRU_MEM_POISON = guid(0x5E4706C1, 0x5356, 0x48C6,
> +                               [0x93, 0x0B, 0x52, 0xF2,
> +                                0x12, 0x0A, 0x44, 0x58])



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/13] scripts/qmp_helper: make send_cper() more generic
  2026-01-21 11:25 ` [PATCH 04/13] scripts/qmp_helper: make send_cper() more generic Mauro Carvalho Chehab
@ 2026-01-21 12:30   ` Jonathan Cameron via qemu development
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-01-21 12:30 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michael S Tsirkin, Shiju Jose, qemu-devel, Igor Mammedov,
	Cleber Rosa, John Snow

On Wed, 21 Jan 2026 12:25:12 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> Allow the caller to set GEDE, GEBS and raw data. This can be
> useful if one wants to replicate a CPER with the same parameters
> as the original one.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Subject to my very rusty python, seems fine to me
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 05/13] scripts/qmp_helper: fix raw_data logic
  2026-01-21 11:25 ` [PATCH 05/13] scripts/qmp_helper: fix raw_data logic Mauro Carvalho Chehab
@ 2026-01-21 12:35   ` Jonathan Cameron via qemu development
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-01-21 12:35 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michael S Tsirkin, Shiju Jose, qemu-devel, Igor Mammedov,
	Cleber Rosa, John Snow

On Wed, 21 Jan 2026 12:25:13 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> According with UEFI 6.4 spec Table 18.11 Generic Error Status Block:
> 
> Raw Data Offset:	Offset in bytes from the beginning of the
> 			Error Status Block to raw error data.
> 			The raw data must follow any Generic Error
> 			Data Entries.
> 
> Data Length:		Length in bytes of the generic error data.
> 
> So, basically, we have:
> 
> 	+----------+                 /
> 	| GEBS     |                 |
> 	+----------+   /             |
>         | GEDE     |   |             |

Mix of tabs and spaces it seems.  Nice to tidy that up.

>         | header   |   |             +--> raw data
> 	+----------+   +--> data     |    offset
>         | GEDE     |   |    length   |
>         | payload  |   |             |
> 	+----------+   /             /
> 	| Raw data |
> 	+----------+
> 
> where:
> 
> - raw data offset is relative to the beginning of GEBS;
> - data length is only for GEDE header and payload.
> 
> Fix the code to handle it the expected way.

Maybe say what it did previously?

> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Code seems fine to me.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> ---
>  scripts/qmp_helper.py | 23 ++++++++++++++---------
>  1 file changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
> index 51c8ad92a39d..40059cd105f6 100755
> --- a/scripts/qmp_helper.py
> +++ b/scripts/qmp_helper.py
> @@ -411,6 +411,7 @@ def _connect(self):
>          "simulated":    util.bit(2),
>      }
>  
> +    GENERIC_ERROR_STATUS_SIZE = 20
>      GENERIC_DATA_SIZE = 72
>  
>      def argparse(parser):
> @@ -551,7 +552,7 @@ def send_cper_raw(self, cper_data):
>  
>          return False
>  
> -    def get_gede(self, notif_type, cper_length):
> +    def get_gede(self, notif_type, payload_length):
>          """
>          Return a Generic Error Data Entry bytearray
>          """
> @@ -563,22 +564,27 @@ def get_gede(self, notif_type, cper_length):
>          util.data_add(gede, 0x300, 2)
>          util.data_add(gede, self.validation_bits, 1)
>          util.data_add(gede, self.flags, 1)
> -        util.data_add(gede, cper_length, 4)
> +        util.data_add(gede, payload_length, 4)
>          gede.extend(self.fru_id)
>          gede.extend(self.fru_text)
>          gede.extend(self.timestamp)
>  
>          return gede
>  
> -    def get_gebs(self, data_length):
> +    def get_gebs(self, payload_length):
>          """
>          Return a Generic Error Status Block bytearray
>          """
>  
> +        data_length = payload_length
> +        data_length += self.GENERIC_DATA_SIZE
> +
>          gebs = bytearray()
>  
>          if self.raw_data:
> -            raw_data_offset = len(gebs)
> +            raw_data_offset = payload_length
> +            raw_data_offset += self.GENERIC_ERROR_STATUS_SIZE
> +            raw_data_offset += self.GENERIC_DATA_SIZE
>          else:
>              raw_data_offset = 0
>  
> @@ -617,8 +623,7 @@ def send_cper(self, notif_type, payload,
>          if raw_data:
>              self.raw_data = raw_data
>  
> -        cper_length = len(payload)
> -        data_length = cper_length + len(self.raw_data) + self.GENERIC_DATA_SIZE
> +        payload_length = len(payload)
>  
>          if gede and len(gede) != 72:
>              print(f"Invalid Generic Error Data Entry length: {len(gede)}. Ignoring it")
> @@ -629,16 +634,16 @@ def send_cper(self, notif_type, payload,
>              gebs = None
>  
>          if not gede:
> -            gede = self.get_gede(notif_type, cper_length)
> +            gede = self.get_gede(notif_type, payload_length)
>  
>          if not gebs:
> -            gebs = self.get_gebs(data_length)
> +            gebs = self.get_gebs(payload_length)
>  
>          cper_data = bytearray()
>          cper_data.extend(gebs)
>          cper_data.extend(gede)
> -        cper_data.extend(bytearray(self.raw_data))
>          cper_data.extend(bytearray(payload))
> +        cper_data.extend(bytearray(self.raw_data))
>  
>          if self.debug:
>              print(f"GUID: {notif_type}")



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/13] scripts/qmp_helper: add support for a timeout logic
  2026-01-21 11:25 ` [PATCH 06/13] scripts/qmp_helper: add support for a timeout logic Mauro Carvalho Chehab
@ 2026-01-21 12:39   ` Jonathan Cameron via qemu development
  2026-01-21 15:56     ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-01-21 12:39 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michael S Tsirkin, Shiju Jose, qemu-devel, Igor Mammedov,
	Cleber Rosa, John Snow

On Wed, 21 Jan 2026 12:25:14 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> We can't inject a new GHES record to the same source before
> it has been acked. There is an async mechanism to verify when
> the Kernel is ready, which is implemented at QEMU's ghes
> driver.
> 
> If error inject is too fast, QEMU may return an error. When
> such errors occur, implement a retry mechanism, based on a
> maximum timeout.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
A few trivial comments below. Either way this seems fine to me and
should make the tooling easier to use.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> ---
>  scripts/qmp_helper.py | 47 +++++++++++++++++++++++++++++++------------
>  1 file changed, 34 insertions(+), 13 deletions(-)
> 
> diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
> index 40059cd105f6..63f3df2d75c3 100755
> --- a/scripts/qmp_helper.py
> +++ b/scripts/qmp_helper.py
> @@ -14,6 +14,7 @@
>  
>  from datetime import datetime
>  from os import path as os_path
> +from time import sleep
>  
>  try:
>      qemu_dir = os_path.abspath(os_path.dirname(os_path.dirname(__file__)))
> @@ -324,7 +325,8 @@ class qmp:
>      Opens a connection and send/receive QMP commands.
>      """
>  
> -    def send_cmd(self, command, args=None, may_open=False, return_error=True):
> +    def send_cmd(self, command, args=None, may_open=False, return_error=True,
> +                 timeout=None):
>          """Send a command to QMP, optinally opening a connection"""
>  
>          if may_open:
> @@ -336,12 +338,31 @@ def send_cmd(self, command, args=None, may_open=False, return_error=True):
>          if args:
>              msg['arguments'] = args
>  
> -        try:
> -            obj = self.qmp_monitor.cmd_obj(msg)
> -        # Can we use some other exception class here?
> -        except Exception as e:                         # pylint: disable=W0718
> -            print(f"Command: {command}")
> -            print(f"Failed to inject error: {e}.")
> +        if timeout and timeout > 0:
> +            attempts = int(timeout * 10)
> +        else:
> +            attempts = 1
> +
> +        # Try up to attempts
That reads oddly because of the variable name.  Made me ask myself
"How many attempts?"
Maybe  " Retry up to attempts times" or something like that.

> +        for i in range(0, attempts):
> +            try:
> +                obj = self.qmp_monitor.cmd_obj(msg)
> +
> +                if obj and "return" in obj and not obj["return"]:
> +                    break
> +
> +            except Exception as e:                     # pylint: disable=W0718
> +                print(f"Command: {command}")
> +                print(f"Failed to inject error: {e}.")
> +                obj = None
> +
> +            if attempts > 1:
> +                print(f"Error inject attempt {i + 1}/{attempts} failed.")
> +
> +            if i + 1 < attempts:
> +                sleep(0.1)

Do we care about a sleep at the end?  Feels like a micro optimization that
isn't needed.

> +
> +        if not obj:
>              return None




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/13] scripts/ghes_inject: add a logic to decode CPER
  2026-01-21 11:25 ` [PATCH 07/13] scripts/ghes_inject: add a logic to decode CPER Mauro Carvalho Chehab
@ 2026-01-21 13:27   ` Jonathan Cameron via qemu development
  2026-01-21 16:24     ` Mauro Carvalho Chehab
  2026-01-22 16:23     ` Mauro Carvalho Chehab
  0 siblings, 2 replies; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-01-21 13:27 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michael S Tsirkin, Shiju Jose, qemu-devel, Igor Mammedov,
	Cleber Rosa, John Snow

On Wed, 21 Jan 2026 12:25:15 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> Add a decoder to help debugging injected CPERs. This is more
> relevant when we add fuzzy-testing error inject, as the
> decoder is helpful to identify what it packages will be sent
> via QEMU to the firmware-fist logic.
> 
> By purpose, I opted to keep this completely independent from
> the encoders implementation, as this can be used even when
> there are no encoders for a certain GGUID type (except for a
GGUID?

> fuzzy logic test, which is pretty much independent of the
> records meaning).
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

A few minor things inline.

I checked the specs, so all that stuff is fine including the spec bug
you mention (which I'm sure you've already reported :)

Jonathan

> ---
>  MAINTAINERS            |    1 +
>  scripts/ghes_decode.py | 1155 ++++++++++++++++++++++++++++++++++++++++
>  scripts/qmp_helper.py  |    3 +
>  3 files changed, 1159 insertions(+)
>  create mode 100644 scripts/ghes_decode.py
> 

> diff --git a/scripts/ghes_decode.py b/scripts/ghes_decode.py
> new file mode 100644
> index 000000000000..6c7fdfe84e3a
> --- /dev/null
> +++ b/scripts/ghes_decode.py
> @@ -0,0 +1,1155 @@
> +#!/usr/bin/env python3
> +#
> +# pylint: disable=R0903,R0912,R0913,R0915,R0917,R1713,E1121,C0302,W0613
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +#
> +# Copyright (C) 2025 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> +
> +"""
> +Helper classes to decode a generic error data entry.
> +
> +By purpose, the logic here is independent of the logic inside qmp_helper
> +and other modules. With a different implementation, it is more likely to
> +discover bugs at the error injection logic. Also, as this can be used to
> +dump errors injected by reproducing an error mesage or for fuzzy error
> +injection, it can't rely at the encoding logic inside each module of

rely on the encoding logic

> +ghes_inject.py.
> +
> +To make the decoder simple, the decode logic here is at field level, not
> +trying to decode bitmaps.
> +"""
> +
> +from typing import Optional


> +class DecodeProcX86():
> +    """
> +    Class to decode an x86 Processor Error as defined at
> +    UEFI 2.1 - N.2.2 Section Descriptor
> +    """
> +
> +    # GUID for x86 Processor Error
> +    guid = "dc3ea0b0-a144-4797-b95b-53fa242b6e1d"
> +
> +    pei_fields = [
> +        ("Error Structure Type", 16, "guid"),
> +        ("Validation Bits", 8, "int"),
> +        ("Check Information", 8, "int"),
> +        ("Target Identifier", 8, "int"),
> +        ("Requestor Identifier", 8, "int"),
> +        ("Responder Identifier", 8, "int"),
> +        ("Instruction Pointer", 8, "int")
> +    ]
> +
> +    def __init__(self, cper: DecodeField):
> +        self.cper = cper
> +
> +    def decode(self, guid):
> +        """Decode x86 Processor Error"""
> +        print("x86 Processor Error")
> +
> +        val = self.cper.decode("Validation Bits", 8, "int")
> +        try:
> +            val_bits = int.from_bytes(val, byteorder='little')
> +        except ValueError, TypeError:
> +            val_bits = 0
> +
> +        error_info_num = (val_bits >> 2) & 0x3f    # bits 2-7
> +        context_info_num = (val_bits >> 8) & 0xff  # bits 8-13
> +
> +        self.cper.decode("Local APIC_ID", 8, "int")
> +        self.cper.decode("CPUID Info", 48, "int")
> +
> +        for pei in range(0, error_info_num):
> +            if self.cper.past_end:
> +                return
> +
> +            print()
> +            print(f"Processor Error Info {pei}")
> +            for name, size, ftype in self.pei_fields:
> +                self.cper.decode(name, size, ftype)
> +
> +        for ctx in range(0, context_info_num):
> +            if self.cper.past_end:
> +                return
> +
> +            print()
> +            print(f"Context {ctx}")
> +
> +            self.cper.decode("Register Context Type", 2, "int")
> +
> +            val = self.cper.decode("Register Array Size", 2, "int")
> +            try:
> +                context_size = int(int.from_bytes(val, byteorder='little') / 8)
> +            except ValueError, TypeError:
> +                context_size = 0
> +
> +            self.cper.decode("MSR Address", 4, "int")
> +            self.cper.decode("MM Register Address", 8, "int")
> +
> +            for reg in range(0, context_size):
> +                if self.cper.past_end:
> +                    return
> +                self.cper.decode(f"Register offset {reg:<3}", 8, "int")

As for arm. Probably need sanity check it's the 8 byte version.
And for giggles even on x86 some are 16 bytes ;)
GDTR, IDTR

Meh, I don't care about those 16 byte ones, but not presenting the 4 byte
ones as 8 would be good.



> +
> +    @staticmethod
> +    def decode_list():
> +        """
> +        Returns a tuple with the GUID and class
> +        """
> +        return [(DecodeProcX86.guid, DecodeProcX86)]

> +class DecodeProcArm():
> +    """
> +    Class to decode an ARM Processor Error as defined at
> +    UEFI 2.6 - N.2.2 Section Descriptor
> +    """
> +
> +    # GUID for ARM Processor Error
> +    guid = "e19e3d16-bc11-11e4-9caa-c2051d5d46b0"
> +
> +    arm_pei_fields = [
> +        ("Version",              1, "int"),
> +        ("Length",               1, "int"),
> +        ("valid",                2, "int"),
> +        ("type",                 1, "int"),
> +        ("multiple-error",       2, "int"),
> +        ("flags",                1, "int"),
> +        ("error-info",           8, "int"),
> +        ("virt-addr",            8, "int"),
> +        ("phy-addr",             8, "int"),
> +    ]
> +
> +    def __init__(self, cper: DecodeField):
> +        self.cper = cper
> +
> +    def decode(self, guid):
> +        """Decode Processor ARM"""
> +
> +        print("ARM Processor Error")
> +
> +        start = self.cper.pos
> +
> +        self.cper.decode("Valid", 4, "int")
> +
> +        val = self.cper.decode("Error Info num", 2, "int")
> +        try:
> +            error_info_num = int.from_bytes(val, byteorder='little')
> +        except ValueError, TypeError:
> +            error_info_num = 0
> +
> +        val = self.cper.decode("Context Info num", 2, "int")
> +        try:
> +            context_info_num = int.from_bytes(val, byteorder='little')
> +        except ValueError, TypeError:
> +            context_info_num = 0
> +
> +        val = self.cper.decode("Section Length", 4, "int")
> +        try:
> +            section_length = int.from_bytes(val, byteorder='little')
> +        except ValueError, TypeError:
> +            section_length = 0
> +
> +        self.cper.decode("Error affinity level", 1, "int")
> +        self.cper.decode("Reserved", 3, "int")
> +        self.cper.decode("MPIDR_EL1", 8, "int")
> +        self.cper.decode("MIDR_EL1", 8, "int")
> +        self.cper.decode("Running State", 4, "int")
> +        self.cper.decode("PSCI State", 4, "int")
> +
> +        for pei in range(0, error_info_num):
> +            if self.cper.past_end:
> +                return
> +
> +            print()
> +            print(f"Processor Error Info {pei}")
> +            for name, size, ftype in self.arm_pei_fields:
> +                self.cper.decode(name, size, ftype)
> +
> +        for ctx in range(0, context_info_num):
> +            if self.cper.past_end:
> +                return
> +
> +            print()
> +            print(f"Context {ctx}")
> +            self.cper.decode("Version", 2, "int")
> +            self.cper.decode("Register Context Type", 2, "int")
> +            val = self.cper.decode("Register Array Size", 4, "int")
> +            try:
> +                context_size = int(int.from_bytes(val, byteorder='little') / 8)
> +            except ValueError:
> +                context_size = 0
> +
> +            for reg in range(0, context_size):
> +                if self.cper.past_end:
> +                    return
> +                self.cper.decode(f"Register {reg:<3}", 8, "int")

Maybe check it's not a 32 bit context?  Don't decode it if it
is but don't try to decode it as 64 bit.  Can get that from the
Register Context Type.  Anything over 4 is fine.


> +
> +        remaining = max(section_length + start - self.cper.pos, 0)
> +        if remaining:
> +            print()
> +            self.cper.decode("Vendor data", remaining, "int")
> +
> +    @staticmethod
> +    def decode_list():
> +        """
> +        Returns a tuple with the GUID and class
> +        """
> +        return [(DecodeProcArm.guid, DecodeProcArm)]

> +class DecodeDMAVT():
> +    """
> +    Class to decode a DMA Virtualization Technology Error as defined at
> +    UEFI 2.2 - N.2.2 Section Descriptor
As below. Feels like it should be 

N2.11.2 Intel VT for Directed I/O specific DMAr Error Section Descriptor.
Same for other places this comment exists.


> +    """
> +
> +    # GUID for DMA VT Error
> +    guid = "71761d37-32b2-45cd-a7d0-b0fedd93e8cf"

> +class DecodeDMAIOMMU():
> +    """
> +    Class to decode an IOMMU DMA Error as defined at
> +    UEFI 2.2 - N.2.2 Section Descriptor

Odd reference choice.
This stuff is in N2.11.3 IOMMU Specific DMAr Error Section
in 2.11. I'm too lazy to find it in the older spec.

> +    """
> +
> +    # GUID for IOMMU DMA Error

Maybe call it the IOMMU Specific DMAr Error

> +    guid = "036f84e1-7f37-428c-a79e-575fdfaa84ec"
> +
> +    fields = [

> +class DecodeCXLCompEvent():
> +    """
> +    Class to decode a CXL Component Error as defined at
> +    UEFI 2.9 - N.2.14. CXL Component Events Section
> +
> +    Currently, the decoder handles only the common fields, displaying
> +    the CXL Component Event Log field in bytes.
> +    """
> +
> +    # GUIDs, as defined at CXL specification 3.2: 8.2.10.2.1 Event Records
> +    guids = [
> +        ("General Media",              "fbcd0a77-c260-417f-85a9-088b1621eba6"),
> +        ("DRAM",                       "601dcbb3-9c06-4eab-b8af-4e9bfb5c9624"),
> +        ("Memory Module",              "fe927475-dd59-4339-a586-79bab113bc74"),
> +        ("Memory Sparing",             "e71f3a40-2d29-4092-8a39-4d1c966c7c65"),
> +        ("Physical Switch",            "77cf9271-9c02-470b-9fe4-bc7b75f2da97"),

As per earlier patch review I'm not sure we care about this and the following.
I don't think we'll ever see them in CPER records, unless going other something
else that encapsulates that format.

> +        ("Virtual Switch",             "40d26425-3396-4c4d-a5da-3d472a63af25"),
> +        ("MDL Port",                   "8dc44363-0c96-4710-b7bf-04bb99534c3f"),
MLD -> Multi-Logical Device

> +        ("Dynamic Capabilities",       "ca95afa7-f183-4018-8c2f-95268e101a2a"),
> +    ]
> +
> +    fields = [
> +        ("Validation Bits", 8, "int"),
> +        ("Device ID", 12, "int"),
> +        ("Device Serial Number", 8, "int")
> +    ]
> +
> +    def __init__(self, cper: DecodeField):
> +        self.cper = cper
> +
> +    def decode(self, guid):
> +        """Decode CXL Protocol Error"""
> +        for name, guid_event in DecodeCXLCompEvent.guids:
> +            if guid == guid_event:
> +                print(f"CXL {name} Event Record")
> +                break
> +
> +        val = self.cper.decode("Length", 4, "int")
> +        try:
> +            length = int.from_bytes(val, byteorder='little')
> +        except ValueError, TypeError:
> +            length = 0
> +
> +        for name, size, ftype in self.fields:
> +            self.cper.decode(name, size, ftype)
> +
> +        length = max(0, length - self.cper.pos)
> +
> +        self.cper.decode("CXL Component Event Log", length, "int",
> +                         show_incomplete=True)
> +
> +    @staticmethod
> +    def decode_list():
> +        """
> +        Returns a tuple with the GUID and class
> +        """
> +
> +        guid_list = []
> +
> +        for _, guid in DecodeCXLCompEvent.guids:
> +            guid_list.append((guid, DecodeCXLCompEvent))
> +
> +        return guid_list


...

> +class DecodeGhesEntry():
> +    """
> +    Class to decode a GHESv2 element, as defined at:
> +    ACPI 6.1: 18.3.2.8 Generic Hardware Error Source version 2
> +    """
> +
> +    # Fields present on all CPER records
> +    common_fields = [
> +        # Generic Error Status Block fields
> +        ("Block Status",           4, "int", None),
> +        ("Raw Data Offset",        4, "int", "raw_data_offset"),
> +        ("Raw Data Length",        4, "int", "raw_data_len"),
> +        ("Data Length",            4, "int", None),
> +        ("Error Severity",         4, "int", None),
> +
> +        # Generic Error Data Entry
> +        ("Section Type",          16, "guid", "session_type"),

Why session_type? Is idea it's the type of decode session we are
doing? Feels a bit too much like a typo from section_type.


> +        ("Error Severity",         4, "int", None),
> +        ("Revision",               2, "int", None),
> +        ("Validation Bits",        1, "int", None),
> +        ("Flags",                  1, "int", None),
> +        ("Error Data Length",      4, "int", None),
> +        ("FRU Id",                16, "guid", None),
> +        ("FRU Text",              20, "str", None),
> +        ("Timestamp",              8, "bcd", None),
> +    ]
> +
> +    def __init__(self, cper_data: bytearray):
> +        """
> +        Initializes a byte array, decoding it, printing results at the
> +        screen.
> +        """

...

> +
> +        # Handle common types
> +        cper = DecodeField(cper_data)
> +
> +        fields = {}
> +        for name, size, ftype, var in self.common_fields:
> +            val = cper.decode(name, size, ftype)
> +
> +            if ftype == "int":
> +                try:
> +                    val = int.from_bytes(val, byteorder='little')
> +                except ValueError, TypeError:
> +                    val = 0
> +
> +            if var is not None:
> +                fields[var] = val
> +
> +        if fields["raw_data_len"]:
> +            cper.decode("Raw Data", fields["raw_data_len"],
> +                        "int", pos=fields["raw_data_offset"])
> +
> +        if not fields["session_type"]:
> +            return



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/13] scripts/ghes_inject: exit 1 if command was not sent
  2026-01-21 11:25 ` [PATCH 08/13] scripts/ghes_inject: exit 1 if command was not sent Mauro Carvalho Chehab
@ 2026-01-21 13:28   ` Jonathan Cameron via qemu development
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-01-21 13:28 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michael S Tsirkin, Shiju Jose, qemu-devel, Igor Mammedov,
	Cleber Rosa, John Snow

On Wed, 21 Jan 2026 12:25:16 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> add a return code to subparser func() and return 1 if the
> command failed.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> ---
>  scripts/arm_processor_error.py | 2 +-
>  scripts/ghes_inject.py         | 3 ++-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/scripts/arm_processor_error.py b/scripts/arm_processor_error.py
> index 73d069f070d4..d9845adb0c0a 100644
> --- a/scripts/arm_processor_error.py
> +++ b/scripts/arm_processor_error.py
> @@ -473,4 +473,4 @@ def send_cper(self, args):
>  
>          self.data = data
>  
> -        qmp_cmd.send_cper(cper_guid.CPER_PROC_ARM, self.data)
> +        return qmp_cmd.send_cper(cper_guid.CPER_PROC_ARM, self.data)
> diff --git a/scripts/ghes_inject.py b/scripts/ghes_inject.py
> index 9a235201418b..6ac917d0b5db 100755
> --- a/scripts/ghes_inject.py
> +++ b/scripts/ghes_inject.py
> @@ -43,7 +43,8 @@ def main():
>  
>      args = parser.parse_args()
>      if "func" in args:
> -        args.func(args)
> +        if not args.func(args):
> +            sys.exit(1)
>      else:
>          sys.exit(f"Please specify a valid command for {sys.argv[0]}")
>  



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 09/13] scripts/ghes_inject: add a handler for PCIe bus error
  2026-01-21 11:25 ` [PATCH 09/13] scripts/ghes_inject: add a handler for PCIe bus error Mauro Carvalho Chehab
@ 2026-01-21 13:32   ` Jonathan Cameron via qemu development
  2026-01-21 13:33     ` Jonathan Cameron via qemu development
                       ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-01-21 13:32 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michael S Tsirkin, Shiju Jose, qemu-devel, Igor Mammedov,
	Cleber Rosa, John Snow

On Wed, 21 Jan 2026 12:25:17 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> Add a logic to do PCIe BUS error injection.
> 
> On Linux Kernel, despite CPER_SEC_PCI_X_BUS macro is defined for such
> event, ghes.c doesn't implement support for it yet:
> 
> [16950.077494] {26}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> [16950.077866] {26}[Hardware Error]: event severity: recoverable
> [16950.078118] {26}[Hardware Error]:  Error 0, type: recoverable
> [16950.078444] {26}[Hardware Error]:   section type: unknown, c5753963-3b84-4095-bf78-eddad3f9c9dd
> [16950.078800] {26}[Hardware Error]:   section length: 0x48
> [16950.079069] {26}[Hardware Error]:   00000000: 00000000 00000000 00000000 00000000  ................
> [16950.079442] {26}[Hardware Error]:   00000010: 00000001 00000000 00000000 00000000  ................
> [16950.079811] {26}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
> [16950.080181] {26}[Hardware Error]:   00000030: 00000000 00000000 00000000 00000000  ................
> [16950.080538] {26}[Hardware Error]:   00000040: 00000000 00000000                    ........
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

LGTM. Bit surprised Linux doesn't decode it but fair enough.
Seems a bit unlikely it ever will given this seems not to cover PCIe
which has it's own records.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 09/13] scripts/ghes_inject: add a handler for PCIe bus error
  2026-01-21 13:32   ` Jonathan Cameron via qemu development
@ 2026-01-21 13:33     ` Jonathan Cameron via qemu development
  2026-02-06 12:52       ` Jonathan Cameron via qemu development
  2026-01-21 16:26     ` Mauro Carvalho Chehab
  2026-01-22 16:42     ` Mauro Carvalho Chehab
  2 siblings, 1 reply; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-01-21 13:33 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michael S Tsirkin, Shiju Jose, qemu-devel, Igor Mammedov,
	Cleber Rosa, John Snow

On Wed, 21 Jan 2026 13:32:55 +0000
Jonathan Cameron <jonathan.cameron@huawei.com> wrote:

> On Wed, 21 Jan 2026 12:25:17 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> 
> > Add a logic to do PCIe BUS error injection.
> > 
> > On Linux Kernel, despite CPER_SEC_PCI_X_BUS macro is defined for such
> > event, ghes.c doesn't implement support for it yet:
> > 
> > [16950.077494] {26}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> > [16950.077866] {26}[Hardware Error]: event severity: recoverable
> > [16950.078118] {26}[Hardware Error]:  Error 0, type: recoverable
> > [16950.078444] {26}[Hardware Error]:   section type: unknown, c5753963-3b84-4095-bf78-eddad3f9c9dd
> > [16950.078800] {26}[Hardware Error]:   section length: 0x48
> > [16950.079069] {26}[Hardware Error]:   00000000: 00000000 00000000 00000000 00000000  ................
> > [16950.079442] {26}[Hardware Error]:   00000010: 00000001 00000000 00000000 00000000  ................
> > [16950.079811] {26}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
> > [16950.080181] {26}[Hardware Error]:   00000030: 00000000 00000000 00000000 00000000  ................
> > [16950.080538] {26}[Hardware Error]:   00000040: 00000000 00000000                    ........
> > 
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>  
> 
> LGTM. Bit surprised Linux doesn't decode it but fair enough.
> Seems a bit unlikely it ever will given this seems not to cover PCIe
> which has it's own records.
> 
Just noticed your patch description. This is PCI/PCI-X errors, not PCIe.

> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 10/13] scripts/ghes_inject: add support for fuzzy logic testing
  2026-01-21 11:25 ` [PATCH 10/13] scripts/ghes_inject: add support for fuzzy logic testing Mauro Carvalho Chehab
@ 2026-01-21 13:37   ` Jonathan Cameron via qemu development
  2026-01-21 16:35     ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-01-21 13:37 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michael S Tsirkin, Shiju Jose, qemu-devel, Igor Mammedov,
	Cleber Rosa, John Snow

On Wed, 21 Jan 2026 12:25:18 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> Add a command to inject random errors for fuzzy logic testing.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Seems reasonable, but maybe some more text in the description on
what types of fuzzy records it generates?  I.e. what is constrained
or at least starts as being standard values vs what is entirely random

J


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 12/13] scripts/ghes_inject: print help if no command specified
  2026-01-21 11:25 ` [PATCH 12/13] scripts/ghes_inject: print help if no command specified Mauro Carvalho Chehab
@ 2026-01-21 13:42   ` Jonathan Cameron via qemu development
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-01-21 13:42 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michael S Tsirkin, Shiju Jose, qemu-devel, Igor Mammedov,
	Cleber Rosa, John Snow

On Wed, 21 Jan 2026 12:25:20 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> The first positional argument (command) is mandatory. If not
> specified, instead of a simple error message, show help as well.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> ---
>  scripts/ghes_inject.py | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/scripts/ghes_inject.py b/scripts/ghes_inject.py
> index 781b37cc68af..4028fdb15582 100755
> --- a/scripts/ghes_inject.py
> +++ b/scripts/ghes_inject.py
> @@ -52,6 +52,9 @@ def main():
>          if not args.func(args):
>              sys.exit(1)
>      else:
> +        print("Error: no command specified\n", file=sys.stderr)
> +        parser.print_help(file=sys.stderr)
> +        print(file=sys.stderr)
>          sys.exit(f"Please specify a valid command for {sys.argv[0]}")
>  
>  if __name__ == "__main__":



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 13/13] scripts/ghes_inject: improve help message
  2026-01-21 11:25 ` [PATCH 13/13] scripts/ghes_inject: improve help message Mauro Carvalho Chehab
@ 2026-01-21 13:43   ` Jonathan Cameron via qemu development
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-01-21 13:43 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michael S Tsirkin, Shiju Jose, qemu-devel, Igor Mammedov,
	Cleber Rosa, John Snow

On Wed, 21 Jan 2026 12:25:21 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> Add a one-liner help message for each type of error inject
> command, and use raw formatter to keep line breaks.
> 
> While here, use a more uniform language.
> 
> With that, "ghes_inject -h" will now show:
> 
>   usage: ghes_inject.py [options]
> 
>   Handles ACPI GHESv2 error injection via the QEMU QMP interface.
> 
>   It uses UEFI BIOS APEI features to generate GHES records, which helps to
>   test CPER and GHES drivers on the guest OS and see how user‑space
>   applications on that guest handle such errors.
> 
>   positional arguments:
>     {arm,pcie-bus,fuzzy-test,fuzzy,raw-error,raw}
>       arm                 Inject an ARM processor error CPER, compatible with
>                           UEFI 2.9A Errata.
>       pcie-bus            Inject a PCIe bus error CPER
>       fuzzy-test (fuzzy)  Inject fuzzy test CPER packets
>       raw-error (raw)     Inject CPER records from previously recorded ones.
> 
>   options:
>     -h, --help            show this help message and exit
> 
>   QEMU QMP socket options:
>     -H, --host HOST       host name (default: localhost)
>     -P, --port PORT       TCP port number (default: 4445)
>     -d, --debug           enable debug output (default: False)
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

LGTM
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/13] scripts/qmp_helper: add missing CXL UEFI GUID
  2026-01-21 12:26     ` Jonathan Cameron via qemu development
  (?)
@ 2026-01-21 15:45     ` Mauro Carvalho Chehab
  2026-01-22 10:52         ` Jonathan Cameron via qemu development
  -1 siblings, 1 reply; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 15:45 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Mauro Carvalho Chehab, Michael S Tsirkin, Shiju Jose, qemu-devel,
	Igor Mammedov, Cleber Rosa, John Snow, linux-cxl

On Wed, Jan 21, 2026 at 12:26:04PM +0000, Jonathan Cameron wrote:
> On Wed, 21 Jan 2026 12:25:10 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> 
> > The UEFI 2.11 - N.2.14. CXL Component Events Section states that
> > XL events are described at CXL specification 3.2:
> CXL
> 
> >         8.2.10.2.1 Event Records
> > 
> > Add the GUIDs defined here to fuzzy logic error injection code.
> 
> +CC linux-cxl as more folk there who will be familiar with this
> stuff.
> 
> Some of these won't be seen on a host. The same event
> infrastructure is used for reporting on out of band interfaces
> and some in band ones, but not ones that will turn up on the
> mailboxes that firmware will be using to get info.

Good to know, but UEFI 2.11 still mentions all of them as
possible GUIDs:

    https://uefi.org/specs/UEFI/2.11/Apx_N_Common_Platform_Error_Record.html#cxl-component-events-section

So, the UEFI 2.11 doesn't explicitly state they won't de delivered
to OSPM. Quite contrary, they're listed as valid values for CPER,
even if, in practice, they won't.

This is just a small set of variables, that won't bring any major
impact on the code. So, I prefer to keep them in sync with the spec.
If they end removing the unused ones, we can update it in the future.

If you want, I can add a note at the next version with your
comments about them.

> 
> > 
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > ---
> >  scripts/qmp_helper.py | 25 +++++++++++++++++++++++++
> >  1 file changed, 25 insertions(+)
> > 
> > diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
> > index 249a8c7187d1..7e786c4adfd9 100755
> > --- a/scripts/qmp_helper.py
> > +++ b/scripts/qmp_helper.py
> > @@ -711,3 +711,28 @@ class cper_guid:
> >      CPER_CXL_PROT_ERR = guid(0x80B9EFB4, 0x52B5, 0x4DE3,
> >                               [0xA7, 0x77, 0x68, 0x78,
> >                                0x4B, 0x77, 0x10, 0x48])
> > +
> > +    CPER_CXL_EVT_GEN_MEDIA = guid(0xFBCD0A77, 0xC260, 0x417F,
> > +                                  [0x85, 0xA9, 0x08, 0x8B,
> > +                                   0x16, 0x21, 0xEB, 0xA6])
> > +    CPER_CXL_EVT_DRAM = guid(0x601DCBB3, 0x9C06, 0x4EAB,
> > +                             [0xB8, 0xAF, 0x4E, 0x9B,
> > +                              0xFB, 0x5C, 0x96, 0x24])
> > +    CPER_CXL_EVT_MEM_MODULE = guid(0xFE927475, 0xDD59, 0x4339,
> > +                                   [0xA5, 0x86, 0x79, 0xBA,
> > +                                    0xB1, 0x13, 0xBC, 0x74])
> > +    CPER_CXL_EVT_MEM_SPARING = guid(0xE71F3A40, 0x2D29, 0x4092,
> > +                                    [0x8A, 0x39, 0x4D, 0x1C,
> > +                                     0x96, 0x6C, 0x7C, 0x65])
> 
> The above are all fine I think.
> 
> From here on I think they will never come via a CPER record.
> 
> > +    CPER_CXL_EVT_PHY_SW = guid(0x77CF9271, 0x9C02, 0x470B,
> > +                               [0x9F, 0xE4, 0xBC, 0x7B,
> > +                                0x75, 0xF2, 0xDA, 0x97])
> 
> This is only going to surface over either out of band or switch CCI
> I'd be very surprised to see a firmware anywhere near these.
> More specifically they are only defined in the Fabric management
> section of the spec, which strongly hints we'd not expect host firmware
> to know anything about them. 
> The events reported may well span bits of the topology currently
> assigned to different hosts.
> 
> > +    CPER_CXL_EVT_VIRT_SW = guid(0x40D26425, 0x3396, 0x4C4D,
> > +                                [0xA5, 0xDA, 0x3D, 0x47,
> > +                                  0x2A, 0x63, 0xAF, 0x25])
> 
> Also a fabric management event.
> 
> > +    CPER_CXL_EVT_MLD_PORT = guid(0x8DC44363, 0x0C96, 0x4710,
> > +                                 [0xB7, 0xBF, 0x04, 0xBB,
> > +                                  0x99, 0x53, 0x4C, 0x3F])
> 
> Also a fabric management event.
> 
> > +    CPER_CXL_EVT_DYNA_CAP = guid(0xCA95AFA7, 0xF183, 0x4018,
> > +                                 [0x8C, 0x2F, 0x95, 0x26,
> > +                                  0x8E, 0x10, 0x1A, 0x2A])
> These are never routed to firmware. They are part of the OS only
> managed flows for dynamic capacity.
> They have their own event log on the hardware and for this particular
> set most relevant thing is in
> CXL 4.0 Table 8-235 Set Event Interrupt Policy Input Payload
> which controls whether a firmware interrupt or MSIX is used signal
> the Dynamic Capacity Event Log Interrupt Settings only allows
> for MSI/MSI-X, not FW interrupt (EFN VDM) like the other logs.
> 
> 
> Jonathan
> 
> 

-- 
Thanks,
Mauro

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/13] scripts/qmp_helper: add support for a timeout logic
  2026-01-21 12:39   ` Jonathan Cameron via qemu development
@ 2026-01-21 15:56     ` Mauro Carvalho Chehab
  2026-01-23 16:16       ` Jonathan Cameron via qemu development
  0 siblings, 1 reply; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 15:56 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Mauro Carvalho Chehab, Michael S Tsirkin, Shiju Jose, qemu-devel,
	Igor Mammedov, Cleber Rosa, John Snow

On Wed, Jan 21, 2026 at 12:39:27PM +0000, Jonathan Cameron wrote:
> On Wed, 21 Jan 2026 12:25:14 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> 
> > We can't inject a new GHES record to the same source before
> > it has been acked. There is an async mechanism to verify when
> > the Kernel is ready, which is implemented at QEMU's ghes
> > driver.
> > 
> > If error inject is too fast, QEMU may return an error. When
> > such errors occur, implement a retry mechanism, based on a
> > maximum timeout.
> > 
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> A few trivial comments below. Either way this seems fine to me and
> should make the tooling easier to use.
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> 
> > ---
> >  scripts/qmp_helper.py | 47 +++++++++++++++++++++++++++++++------------
> >  1 file changed, 34 insertions(+), 13 deletions(-)
> > 
> > diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
> > index 40059cd105f6..63f3df2d75c3 100755
> > --- a/scripts/qmp_helper.py
> > +++ b/scripts/qmp_helper.py
> > @@ -14,6 +14,7 @@
> >  
> >  from datetime import datetime
> >  from os import path as os_path
> > +from time import sleep
> >  
> >  try:
> >      qemu_dir = os_path.abspath(os_path.dirname(os_path.dirname(__file__)))
> > @@ -324,7 +325,8 @@ class qmp:
> >      Opens a connection and send/receive QMP commands.
> >      """
> >  
> > -    def send_cmd(self, command, args=None, may_open=False, return_error=True):
> > +    def send_cmd(self, command, args=None, may_open=False, return_error=True,
> > +                 timeout=None):
> >          """Send a command to QMP, optinally opening a connection"""
> >  
> >          if may_open:
> > @@ -336,12 +338,31 @@ def send_cmd(self, command, args=None, may_open=False, return_error=True):
> >          if args:
> >              msg['arguments'] = args
> >  
> > -        try:
> > -            obj = self.qmp_monitor.cmd_obj(msg)
> > -        # Can we use some other exception class here?
> > -        except Exception as e:                         # pylint: disable=W0718
> > -            print(f"Command: {command}")
> > -            print(f"Failed to inject error: {e}.")
> > +        if timeout and timeout > 0:
> > +            attempts = int(timeout * 10)
> > +        else:
> > +            attempts = 1
> > +
> > +        # Try up to attempts
> That reads oddly because of the variable name.  Made me ask myself
> "How many attempts?"
> Maybe  " Retry up to attempts times" or something like that.

I'll improve the message. The goal here is to try up to at least 
timeout" seconds.

That's why we multiply it by 10...

> 
> > +        for i in range(0, attempts):
> > +            try:
> > +                obj = self.qmp_monitor.cmd_obj(msg)
> > +
> > +                if obj and "return" in obj and not obj["return"]:
> > +                    break
> > +
> > +            except Exception as e:                     # pylint: disable=W0718
> > +                print(f"Command: {command}")
> > +                print(f"Failed to inject error: {e}.")
> > +                obj = None
> > +
> > +            if attempts > 1:
> > +                print(f"Error inject attempt {i + 1}/{attempts} failed.")
> > +
> > +            if i + 1 < attempts:
> > +                sleep(0.1)

... and here, we sleep for 0.1 seconds.

> 
> Do we care about a sleep at the end?  Feels like a micro optimization that
> isn't needed.

This is not a micro-optimization. It is more to ensure that we won't
respin it too fast.

What happens is that QMP interface asks the BIOS to send an async
message to OSPM, cleaning an ack register. When the OSPM reads the
error, it writes 1 to the ack register.

If we send messages too fast, the logic at ghes.c will detect that
the ack didn't happen, imediately returning an errocr code.

On such case, we sleep for 100ms before trying again.

In practice, on my Ryzen 9 machines with QEMU emulating ARM,
even under massive error injection, 99% of the time no retries
happen. The worse case scenario I got here is that sometimes
Kernel got stuck and took between 5s to 10s to accept the error
submission.

> 
> > +
> > +        if not obj:
> >              return None
> 
> 

-- 
Thanks,
Mauro


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/13] scripts/ghes_inject: add a logic to decode CPER
  2026-01-21 13:27   ` Jonathan Cameron via qemu development
@ 2026-01-21 16:24     ` Mauro Carvalho Chehab
  2026-01-22 16:23     ` Mauro Carvalho Chehab
  1 sibling, 0 replies; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 16:24 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Mauro Carvalho Chehab, Michael S Tsirkin, Shiju Jose, qemu-devel,
	Igor Mammedov, Cleber Rosa, John Snow

On Wed, Jan 21, 2026 at 01:27:38PM +0000, Jonathan Cameron wrote:
> On Wed, 21 Jan 2026 12:25:15 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> 
> > Add a decoder to help debugging injected CPERs. This is more
> > relevant when we add fuzzy-testing error inject, as the
> > decoder is helpful to identify what it packages will be sent
> > via QEMU to the firmware-fist logic.
> > 
> > By purpose, I opted to keep this completely independent from
> > the encoders implementation, as this can be used even when
> > there are no encoders for a certain GGUID type (except for a
> GGUID?
> 
> > fuzzy logic test, which is pretty much independent of the
> > records meaning).
> > 
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> 
> A few minor things inline.
> 
> I checked the specs, so all that stuff is fine including the spec bug
> you mention (which I'm sure you've already reported :)
> 
> Jonathan
> 
> > ---
> >  MAINTAINERS            |    1 +
> >  scripts/ghes_decode.py | 1155 ++++++++++++++++++++++++++++++++++++++++
> >  scripts/qmp_helper.py  |    3 +
> >  3 files changed, 1159 insertions(+)
> >  create mode 100644 scripts/ghes_decode.py
> > 
> 
> > diff --git a/scripts/ghes_decode.py b/scripts/ghes_decode.py
> > new file mode 100644
> > index 000000000000..6c7fdfe84e3a
> > --- /dev/null
> > +++ b/scripts/ghes_decode.py
> > @@ -0,0 +1,1155 @@
> > +#!/usr/bin/env python3
> > +#
> > +# pylint: disable=R0903,R0912,R0913,R0915,R0917,R1713,E1121,C0302,W0613
> > +# SPDX-License-Identifier: GPL-2.0-or-later
> > +#
> > +# Copyright (C) 2025 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > +
> > +"""
> > +Helper classes to decode a generic error data entry.
> > +
> > +By purpose, the logic here is independent of the logic inside qmp_helper
> > +and other modules. With a different implementation, it is more likely to
> > +discover bugs at the error injection logic. Also, as this can be used to
> > +dump errors injected by reproducing an error mesage or for fuzzy error
> > +injection, it can't rely at the encoding logic inside each module of
> 
> rely on the encoding logic
> 
> > +ghes_inject.py.
> > +
> > +To make the decoder simple, the decode logic here is at field level, not
> > +trying to decode bitmaps.
> > +"""
> > +
> > +from typing import Optional
> 
> 
> > +class DecodeProcX86():
> > +    """
> > +    Class to decode an x86 Processor Error as defined at
> > +    UEFI 2.1 - N.2.2 Section Descriptor
> > +    """
> > +
> > +    # GUID for x86 Processor Error
> > +    guid = "dc3ea0b0-a144-4797-b95b-53fa242b6e1d"
> > +
> > +    pei_fields = [
> > +        ("Error Structure Type", 16, "guid"),
> > +        ("Validation Bits", 8, "int"),
> > +        ("Check Information", 8, "int"),
> > +        ("Target Identifier", 8, "int"),
> > +        ("Requestor Identifier", 8, "int"),
> > +        ("Responder Identifier", 8, "int"),
> > +        ("Instruction Pointer", 8, "int")
> > +    ]
> > +
> > +    def __init__(self, cper: DecodeField):
> > +        self.cper = cper
> > +
> > +    def decode(self, guid):
> > +        """Decode x86 Processor Error"""
> > +        print("x86 Processor Error")
> > +
> > +        val = self.cper.decode("Validation Bits", 8, "int")
> > +        try:
> > +            val_bits = int.from_bytes(val, byteorder='little')
> > +        except ValueError, TypeError:
> > +            val_bits = 0
> > +
> > +        error_info_num = (val_bits >> 2) & 0x3f    # bits 2-7
> > +        context_info_num = (val_bits >> 8) & 0xff  # bits 8-13
> > +
> > +        self.cper.decode("Local APIC_ID", 8, "int")
> > +        self.cper.decode("CPUID Info", 48, "int")
> > +
> > +        for pei in range(0, error_info_num):
> > +            if self.cper.past_end:
> > +                return
> > +
> > +            print()
> > +            print(f"Processor Error Info {pei}")
> > +            for name, size, ftype in self.pei_fields:
> > +                self.cper.decode(name, size, ftype)
> > +
> > +        for ctx in range(0, context_info_num):
> > +            if self.cper.past_end:
> > +                return
> > +
> > +            print()
> > +            print(f"Context {ctx}")
> > +
> > +            self.cper.decode("Register Context Type", 2, "int")
> > +
> > +            val = self.cper.decode("Register Array Size", 2, "int")
> > +            try:
> > +                context_size = int(int.from_bytes(val, byteorder='little') / 8)
> > +            except ValueError, TypeError:
> > +                context_size = 0
> > +
> > +            self.cper.decode("MSR Address", 4, "int")
> > +            self.cper.decode("MM Register Address", 8, "int")
> > +
> > +            for reg in range(0, context_size):
> > +                if self.cper.past_end:
> > +                    return
> > +                self.cper.decode(f"Register offset {reg:<3}", 8, "int")
> 
> As for arm. Probably need sanity check it's the 8 byte version.
> And for giggles even on x86 some are 16 bytes ;)
> GDTR, IDTR
> 
> Meh, I don't care about those 16 byte ones, but not presenting the 4 byte
> ones as 8 would be good.

As this decode is for debugging purposes, I tried to keep the code as simple
as possible. So, I avoided adding too much details on it. What is important,
IMO, is mainly to be able to quickly check if the Kernel report and rasdaemon
are properly decoding the message.

Also, to be fair, even on 32 bit systems, the size of this register is
8 bytes. Nothing prevent them to have values filled above 32-bits.

In a matter of fact, when using the fuzzy-testing command, the logic
there won't care about what is valid or not: right now, it will either:

    - place a random number;
    - place 0, if used like:
            ghes_inject.py fuzzy --zero

Btw, if I'm not mistaken, currently, RAS is only enabled in QEMU for arm64.

> 
> 
> 
> > +
> > +    @staticmethod
> > +    def decode_list():
> > +        """
> > +        Returns a tuple with the GUID and class
> > +        """
> > +        return [(DecodeProcX86.guid, DecodeProcX86)]
> 
> > +class DecodeProcArm():
> > +    """
> > +    Class to decode an ARM Processor Error as defined at
> > +    UEFI 2.6 - N.2.2 Section Descriptor
> > +    """
> > +
> > +    # GUID for ARM Processor Error
> > +    guid = "e19e3d16-bc11-11e4-9caa-c2051d5d46b0"
> > +
> > +    arm_pei_fields = [
> > +        ("Version",              1, "int"),
> > +        ("Length",               1, "int"),
> > +        ("valid",                2, "int"),
> > +        ("type",                 1, "int"),
> > +        ("multiple-error",       2, "int"),
> > +        ("flags",                1, "int"),
> > +        ("error-info",           8, "int"),
> > +        ("virt-addr",            8, "int"),
> > +        ("phy-addr",             8, "int"),
> > +    ]
> > +
> > +    def __init__(self, cper: DecodeField):
> > +        self.cper = cper
> > +
> > +    def decode(self, guid):
> > +        """Decode Processor ARM"""
> > +
> > +        print("ARM Processor Error")
> > +
> > +        start = self.cper.pos
> > +
> > +        self.cper.decode("Valid", 4, "int")
> > +
> > +        val = self.cper.decode("Error Info num", 2, "int")
> > +        try:
> > +            error_info_num = int.from_bytes(val, byteorder='little')
> > +        except ValueError, TypeError:
> > +            error_info_num = 0
> > +
> > +        val = self.cper.decode("Context Info num", 2, "int")
> > +        try:
> > +            context_info_num = int.from_bytes(val, byteorder='little')
> > +        except ValueError, TypeError:
> > +            context_info_num = 0
> > +
> > +        val = self.cper.decode("Section Length", 4, "int")
> > +        try:
> > +            section_length = int.from_bytes(val, byteorder='little')
> > +        except ValueError, TypeError:
> > +            section_length = 0
> > +
> > +        self.cper.decode("Error affinity level", 1, "int")
> > +        self.cper.decode("Reserved", 3, "int")
> > +        self.cper.decode("MPIDR_EL1", 8, "int")
> > +        self.cper.decode("MIDR_EL1", 8, "int")
> > +        self.cper.decode("Running State", 4, "int")
> > +        self.cper.decode("PSCI State", 4, "int")
> > +
> > +        for pei in range(0, error_info_num):
> > +            if self.cper.past_end:
> > +                return
> > +
> > +            print()
> > +            print(f"Processor Error Info {pei}")
> > +            for name, size, ftype in self.arm_pei_fields:
> > +                self.cper.decode(name, size, ftype)
> > +
> > +        for ctx in range(0, context_info_num):
> > +            if self.cper.past_end:
> > +                return
> > +
> > +            print()
> > +            print(f"Context {ctx}")
> > +            self.cper.decode("Version", 2, "int")
> > +            self.cper.decode("Register Context Type", 2, "int")
> > +            val = self.cper.decode("Register Array Size", 4, "int")
> > +            try:
> > +                context_size = int(int.from_bytes(val, byteorder='little') / 8)
> > +            except ValueError:
> > +                context_size = 0
> > +
> > +            for reg in range(0, context_size):
> > +                if self.cper.past_end:
> > +                    return
> > +                self.cper.decode(f"Register {reg:<3}", 8, "int")
> 
> Maybe check it's not a 32 bit context?  Don't decode it if it
> is but don't try to decode it as 64 bit.  Can get that from the
> Register Context Type.  Anything over 4 is fine.

I can, but see above: this can actually be a 64-bit number even
when only 32-bits are valid (when used with fuzzy injector).

> > +
> > +        remaining = max(section_length + start - self.cper.pos, 0)
> > +        if remaining:
> > +            print()
> > +            self.cper.decode("Vendor data", remaining, "int")
> > +
> > +    @staticmethod
> > +    def decode_list():
> > +        """
> > +        Returns a tuple with the GUID and class
> > +        """
> > +        return [(DecodeProcArm.guid, DecodeProcArm)]
> 
> > +class DecodeDMAVT():
> > +    """
> > +    Class to decode a DMA Virtualization Technology Error as defined at
> > +    UEFI 2.2 - N.2.2 Section Descriptor
> As below. Feels like it should be 
> 
> N2.11.2 Intel VT for Directed I/O specific DMAr Error Section Descriptor.
> Same for other places this comment exists.
> 
> 
> > +    """
> > +
> > +    # GUID for DMA VT Error
> > +    guid = "71761d37-32b2-45cd-a7d0-b0fedd93e8cf"
> 
> > +class DecodeDMAIOMMU():
> > +    """
> > +    Class to decode an IOMMU DMA Error as defined at
> > +    UEFI 2.2 - N.2.2 Section Descriptor
> 
> Odd reference choice.
> This stuff is in N2.11.3 IOMMU Specific DMAr Error Section
> in 2.11. I'm too lazy to find it in the older spec.
> 
> > +    """
> > +
> > +    # GUID for IOMMU DMA Error
> 
> Maybe call it the IOMMU Specific DMAr Error
> 
> > +    guid = "036f84e1-7f37-428c-a79e-575fdfaa84ec"
> > +
> > +    fields = [
> 
> > +class DecodeCXLCompEvent():
> > +    """
> > +    Class to decode a CXL Component Error as defined at
> > +    UEFI 2.9 - N.2.14. CXL Component Events Section
> > +
> > +    Currently, the decoder handles only the common fields, displaying
> > +    the CXL Component Event Log field in bytes.
> > +    """
> > +
> > +    # GUIDs, as defined at CXL specification 3.2: 8.2.10.2.1 Event Records
> > +    guids = [
> > +        ("General Media",              "fbcd0a77-c260-417f-85a9-088b1621eba6"),
> > +        ("DRAM",                       "601dcbb3-9c06-4eab-b8af-4e9bfb5c9624"),
> > +        ("Memory Module",              "fe927475-dd59-4339-a586-79bab113bc74"),
> > +        ("Memory Sparing",             "e71f3a40-2d29-4092-8a39-4d1c966c7c65"),
> > +        ("Physical Switch",            "77cf9271-9c02-470b-9fe4-bc7b75f2da97"),
> 
> As per earlier patch review I'm not sure we care about this and the following.
> I don't think we'll ever see them in CPER records, unless going other something
> else that encapsulates that format.

As I commented on patch 2, UEFI 2.11 accepts those in CPERs. So, except if
a new errata will drop them, from UEFI perspective, all those tyepes could
be found on CPERs sent to OSPM.

From my perspective, for fuzzy-testing, we'd like to be able to inject them
as well, to check if those won't cause any troubles at OSPM implementation.

See more below...

> 
> > +        ("Virtual Switch",             "40d26425-3396-4c4d-a5da-3d472a63af25"),
> > +        ("MDL Port",                   "8dc44363-0c96-4710-b7bf-04bb99534c3f"),
> MLD -> Multi-Logical Device
> 
> > +        ("Dynamic Capabilities",       "ca95afa7-f183-4018-8c2f-95268e101a2a"),
> > +    ]
> > +
> > +    fields = [
> > +        ("Validation Bits", 8, "int"),
> > +        ("Device ID", 12, "int"),
> > +        ("Device Serial Number", 8, "int")
> > +    ]
> > +
> > +    def __init__(self, cper: DecodeField):
> > +        self.cper = cper
> > +
> > +    def decode(self, guid):
> > +        """Decode CXL Protocol Error"""
> > +        for name, guid_event in DecodeCXLCompEvent.guids:
> > +            if guid == guid_event:
> > +                print(f"CXL {name} Event Record")
> > +                break
> > +
> > +        val = self.cper.decode("Length", 4, "int")
> > +        try:
> > +            length = int.from_bytes(val, byteorder='little')
> > +        except ValueError, TypeError:
> > +            length = 0
> > +
> > +        for name, size, ftype in self.fields:
> > +            self.cper.decode(name, size, ftype)
> > +
> > +        length = max(0, length - self.cper.pos)
> > +
> > +        self.cper.decode("CXL Component Event Log", length, "int",
> > +                         show_incomplete=True)

... here, we are not actually decoding the CXL-specific part of the
message, but, instead, displaying as a byte sequence.

When time comes and we implement "cxl" command(s) to the tool, we
may want to decode the event log according with the CXL spec.

For such purpose, we won't need to decode all types - just the ones
that makes sense in practice.

> > +
> > +    @staticmethod
> > +    def decode_list():
> > +        """
> > +        Returns a tuple with the GUID and class
> > +        """
> > +
> > +        guid_list = []
> > +
> > +        for _, guid in DecodeCXLCompEvent.guids:
> > +            guid_list.append((guid, DecodeCXLCompEvent))
> > +
> > +        return guid_list
> 
> 
> ...
> 
> > +class DecodeGhesEntry():
> > +    """
> > +    Class to decode a GHESv2 element, as defined at:
> > +    ACPI 6.1: 18.3.2.8 Generic Hardware Error Source version 2
> > +    """
> > +
> > +    # Fields present on all CPER records
> > +    common_fields = [
> > +        # Generic Error Status Block fields
> > +        ("Block Status",           4, "int", None),
> > +        ("Raw Data Offset",        4, "int", "raw_data_offset"),
> > +        ("Raw Data Length",        4, "int", "raw_data_len"),
> > +        ("Data Length",            4, "int", None),
> > +        ("Error Severity",         4, "int", None),
> > +
> > +        # Generic Error Data Entry
> > +        ("Section Type",          16, "guid", "session_type"),
> 
> Why session_type? Is idea it's the type of decode session we are
> doing? Feels a bit too much like a typo from section_type.
> 
> 
> > +        ("Error Severity",         4, "int", None),
> > +        ("Revision",               2, "int", None),
> > +        ("Validation Bits",        1, "int", None),
> > +        ("Flags",                  1, "int", None),
> > +        ("Error Data Length",      4, "int", None),
> > +        ("FRU Id",                16, "guid", None),
> > +        ("FRU Text",              20, "str", None),
> > +        ("Timestamp",              8, "bcd", None),
> > +    ]
> > +
> > +    def __init__(self, cper_data: bytearray):
> > +        """
> > +        Initializes a byte array, decoding it, printing results at the
> > +        screen.
> > +        """
> 
> ...
> 
> > +
> > +        # Handle common types
> > +        cper = DecodeField(cper_data)
> > +
> > +        fields = {}
> > +        for name, size, ftype, var in self.common_fields:
> > +            val = cper.decode(name, size, ftype)
> > +
> > +            if ftype == "int":
> > +                try:
> > +                    val = int.from_bytes(val, byteorder='little')
> > +                except ValueError, TypeError:
> > +                    val = 0
> > +
> > +            if var is not None:
> > +                fields[var] = val
> > +
> > +        if fields["raw_data_len"]:
> > +            cper.decode("Raw Data", fields["raw_data_len"],
> > +                        "int", pos=fields["raw_data_offset"])
> > +
> > +        if not fields["session_type"]:
> > +            return
> 

-- 
Thanks,
Mauro


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 09/13] scripts/ghes_inject: add a handler for PCIe bus error
  2026-01-21 13:32   ` Jonathan Cameron via qemu development
  2026-01-21 13:33     ` Jonathan Cameron via qemu development
@ 2026-01-21 16:26     ` Mauro Carvalho Chehab
  2026-01-22 16:42     ` Mauro Carvalho Chehab
  2 siblings, 0 replies; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 16:26 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Mauro Carvalho Chehab, Michael S Tsirkin, Shiju Jose, qemu-devel,
	Igor Mammedov, Cleber Rosa, John Snow

On Wed, Jan 21, 2026 at 01:32:55PM +0000, Jonathan Cameron wrote:
> On Wed, 21 Jan 2026 12:25:17 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> 
> > Add a logic to do PCIe BUS error injection.
> > 
> > On Linux Kernel, despite CPER_SEC_PCI_X_BUS macro is defined for such
> > event, ghes.c doesn't implement support for it yet:
> > 
> > [16950.077494] {26}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> > [16950.077866] {26}[Hardware Error]: event severity: recoverable
> > [16950.078118] {26}[Hardware Error]:  Error 0, type: recoverable
> > [16950.078444] {26}[Hardware Error]:   section type: unknown, c5753963-3b84-4095-bf78-eddad3f9c9dd
> > [16950.078800] {26}[Hardware Error]:   section length: 0x48
> > [16950.079069] {26}[Hardware Error]:   00000000: 00000000 00000000 00000000 00000000  ................
> > [16950.079442] {26}[Hardware Error]:   00000010: 00000001 00000000 00000000 00000000  ................
> > [16950.079811] {26}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
> > [16950.080181] {26}[Hardware Error]:   00000030: 00000000 00000000 00000000 00000000  ................
> > [16950.080538] {26}[Hardware Error]:   00000040: 00000000 00000000                    ........
> > 
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> 
> LGTM. Bit surprised Linux doesn't decode it but fair enough.
> Seems a bit unlikely it ever will given this seems not to cover PCIe
> which has it's own records.

Yeah, me too. If I got it right from specs, this one is related to
the PCIe bus controller, while the other one is for the PCIe device.

Perhaps in practice vendors are using hardware-first approach for
the PCI controller.

> 
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

-- 
Thanks,
Mauro


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 10/13] scripts/ghes_inject: add support for fuzzy logic testing
  2026-01-21 13:37   ` Jonathan Cameron via qemu development
@ 2026-01-21 16:35     ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-21 16:35 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Mauro Carvalho Chehab, Michael S Tsirkin, Shiju Jose, qemu-devel,
	Igor Mammedov, Cleber Rosa, John Snow

On Wed, Jan 21, 2026 at 01:37:10PM +0000, Jonathan Cameron wrote:
> On Wed, 21 Jan 2026 12:25:18 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> 
> > Add a command to inject random errors for fuzzy logic testing.
> > 
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Seems reasonable, but maybe some more text in the description on
> what types of fuzzy records it generates?  I.e. what is constrained
> or at least starts as being standard values vs what is entirely random

I'll improve at the next version. By default, it just randomly picks
a valid GUID from the lis, and selects a default size that would be
a valid choice.

There is a parameter to force it to use an specific type:

    $ ghes_inject.py fuzzy -h

    Inject fuzzy test CPER packets

    options:
      -h, --help            show this help message and exit

    Fuzz testing error inject:
      -T, --type TYPE       Type of the error: proc-generic,proc-x86,proc-itanium,proc-arm,platform-mem,platform-mem2,pcie,pci-bus,pci-dev,firmware-error,dma-generic,dma-vt,dma-iommu,ccix-per,cxl-prot-err,cxl-evt-media,cxl-evt-dram,cxl-evt-mem-module,cxl-evt-mem-sparing,cxl-evt-phy-sw,cxl-evt-virt-sw,cxl-evt-mdl-port,cxl-evt-dyna-cap,fru-mem-poison
      --min-size MIN_SIZE   Minimal size of the CPER
      --max-size MAX_SIZE   Maximal size of the CPER
      -z, --zero            Zero all bytes of the CPER payload (default: False)
      -t, --timeout TIMEOUT
                            Specify timeout for CPER send retries (default: 30.0 seconds)
      -d, --delay DELAY     Specify a delay between multiple CPER (default: 0)
      -c, --count COUNT     Specify the number of CPER records to be sent (default: 1)

and parameters to allow it to mangle with the payload size.

When -T is not used, it randomly pics a valid GUID. When it is
used, all injected packages will have the same type.

Right now, the fuzzy-testing is mangling just with the CPER
payload, so GUIDs are valid. see:


  $ ghes_inject.py -d fuzzy
  Injecting cxl-evt-dyna-cap with 64 bytes
  GUID: ca95afa7-f183-4018-8c2f-95268e101a2a
  Generic Error Status Block (20 bytes):
      00000000  01 00 00 00 00 00 00 00 00 00 00 00 88 00 00 00   ................
      00000010  00 00 00 00                                       ....

  Generic Error Data Entry (72 bytes):
      00000000  a7 af 95 ca 83 f1 18 40 8c 2f 95 26 8e 10 1a 2a   .......@./.&...*
      00000010  00 00 00 00 00 03 00 00 40 00 00 00 00 00 00 00   ........@.......
      00000020  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
      00000030  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
      00000040  00 00 00 00 00 00 00 00                           ........

  Payload (64 bytes):
      00000000  32 ba ed 9c 1f ea ac cd 8c 8f 44 7b ab 4b c1 8f   2.........D{.K..
      00000010  68 32 8a c1 07 dd 0f 93 54 de 09 a8 42 79 80 1f   h2......T...By..
      00000020  f4 e8 0c 85 02 2d 0b 7d f5 64 32 8e 3b d6 f1 6b   .....-.}.d2.;..k
      00000030  73 39 97 00 54 30 aa e6 39 f0 5d 95 1c b1 cd 0f   s9..T0..9.].....

The first two tables (GESB and GEDE) aren't randomized, and the GUID is always
a valid one. Jus the payload contains either random numbers (default) or are 
always zero:

  $ ghes_inject.py -d fuzzy -z
  Injecting cxl-evt-media with 32 bytes
  GUID: fbcd0a77-c260-417f-85a9-088b1621eba6
  Generic Error Status Block (20 bytes):
      00000000  01 00 00 00 00 00 00 00 00 00 00 00 68 00 00 00   ............h...
      00000010  00 00 00 00                                       ....

  Generic Error Data Entry (72 bytes):
      00000000  77 0a cd fb 60 c2 7f 41 85 a9 08 8b 16 21 eb a6   w...`..A.....!..
      00000010  00 00 00 00 00 03 00 00 20 00 00 00 00 00 00 00   ........ .......
      00000020  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
      00000030  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
      00000040  00 00 00 00 00 00 00 00                           ........

  Payload (32 bytes):
      00000000  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
      00000010  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................

-- 
Thanks,
Mauro


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/13] scripts/qmp_helper: add missing CXL UEFI GUID
  2026-01-21 15:45     ` Mauro Carvalho Chehab
@ 2026-01-22 10:52         ` Jonathan Cameron via qemu development
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron @ 2026-01-22 10:52 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michael S Tsirkin, Shiju Jose, qemu-devel, Igor Mammedov,
	Cleber Rosa, John Snow, linux-cxl

On Wed, 21 Jan 2026 16:45:58 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> On Wed, Jan 21, 2026 at 12:26:04PM +0000, Jonathan Cameron wrote:
> > On Wed, 21 Jan 2026 12:25:10 +0100
> > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> >   
> > > The UEFI 2.11 - N.2.14. CXL Component Events Section states that
> > > XL events are described at CXL specification 3.2:  
> > CXL
> >   
> > >         8.2.10.2.1 Event Records
> > > 
> > > Add the GUIDs defined here to fuzzy logic error injection code.  
> > 
> > +CC linux-cxl as more folk there who will be familiar with this
> > stuff.
> > 
> > Some of these won't be seen on a host. The same event
> > infrastructure is used for reporting on out of band interfaces
> > and some in band ones, but not ones that will turn up on the
> > mailboxes that firmware will be using to get info.  
> 
> Good to know, but UEFI 2.11 still mentions all of them as
> possible GUIDs:
> 
>     https://uefi.org/specs/UEFI/2.11/Apx_N_Common_Platform_Error_Record.html#cxl-component-events-section
> 
> So, the UEFI 2.11 doesn't explicitly state they won't de delivered
> to OSPM. Quite contrary, they're listed as valid values for CPER,
> even if, in practice, they won't.
> 
> This is just a small set of variables, that won't bring any major
> impact on the code. So, I prefer to keep them in sync with the spec.
> If they end removing the unused ones, we can update it in the future.
> 
> If you want, I can add a note at the next version with your
> comments about them.
> 
A note works for me.

As I understand it CPER records are getting adopted in other specs
so it may make sense to document them, even if they  aren't a possibility
via ACPI.

However I'm not seeing them in the spec link you point at.  
All I'm seeing is a cross reference the CXL spec Events Record Format.

> >   
> > > 
> > > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > > ---
> > >  scripts/qmp_helper.py | 25 +++++++++++++++++++++++++
> > >  1 file changed, 25 insertions(+)
> > > 
> > > diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
> > > index 249a8c7187d1..7e786c4adfd9 100755
> > > --- a/scripts/qmp_helper.py
> > > +++ b/scripts/qmp_helper.py
> > > @@ -711,3 +711,28 @@ class cper_guid:
> > >      CPER_CXL_PROT_ERR = guid(0x80B9EFB4, 0x52B5, 0x4DE3,
> > >                               [0xA7, 0x77, 0x68, 0x78,
> > >                                0x4B, 0x77, 0x10, 0x48])
> > > +
> > > +    CPER_CXL_EVT_GEN_MEDIA = guid(0xFBCD0A77, 0xC260, 0x417F,
> > > +                                  [0x85, 0xA9, 0x08, 0x8B,
> > > +                                   0x16, 0x21, 0xEB, 0xA6])
> > > +    CPER_CXL_EVT_DRAM = guid(0x601DCBB3, 0x9C06, 0x4EAB,
> > > +                             [0xB8, 0xAF, 0x4E, 0x9B,
> > > +                              0xFB, 0x5C, 0x96, 0x24])
> > > +    CPER_CXL_EVT_MEM_MODULE = guid(0xFE927475, 0xDD59, 0x4339,
> > > +                                   [0xA5, 0x86, 0x79, 0xBA,
> > > +                                    0xB1, 0x13, 0xBC, 0x74])
> > > +    CPER_CXL_EVT_MEM_SPARING = guid(0xE71F3A40, 0x2D29, 0x4092,
> > > +                                    [0x8A, 0x39, 0x4D, 0x1C,
> > > +                                     0x96, 0x6C, 0x7C, 0x65])  
> > 
> > The above are all fine I think.
> > 
> > From here on I think they will never come via a CPER record.
> >   
> > > +    CPER_CXL_EVT_PHY_SW = guid(0x77CF9271, 0x9C02, 0x470B,
> > > +                               [0x9F, 0xE4, 0xBC, 0x7B,
> > > +                                0x75, 0xF2, 0xDA, 0x97])  
> > 
> > This is only going to surface over either out of band or switch CCI
> > I'd be very surprised to see a firmware anywhere near these.
> > More specifically they are only defined in the Fabric management
> > section of the spec, which strongly hints we'd not expect host firmware
> > to know anything about them. 
> > The events reported may well span bits of the topology currently
> > assigned to different hosts.
> >   
> > > +    CPER_CXL_EVT_VIRT_SW = guid(0x40D26425, 0x3396, 0x4C4D,
> > > +                                [0xA5, 0xDA, 0x3D, 0x47,
> > > +                                  0x2A, 0x63, 0xAF, 0x25])  
> > 
> > Also a fabric management event.
> >   
> > > +    CPER_CXL_EVT_MLD_PORT = guid(0x8DC44363, 0x0C96, 0x4710,
> > > +                                 [0xB7, 0xBF, 0x04, 0xBB,
> > > +                                  0x99, 0x53, 0x4C, 0x3F])  
> > 
> > Also a fabric management event.
> >   
> > > +    CPER_CXL_EVT_DYNA_CAP = guid(0xCA95AFA7, 0xF183, 0x4018,
> > > +                                 [0x8C, 0x2F, 0x95, 0x26,
> > > +                                  0x8E, 0x10, 0x1A, 0x2A])  
> > These are never routed to firmware. They are part of the OS only
> > managed flows for dynamic capacity.
> > They have their own event log on the hardware and for this particular
> > set most relevant thing is in
> > CXL 4.0 Table 8-235 Set Event Interrupt Policy Input Payload
> > which controls whether a firmware interrupt or MSIX is used signal
> > the Dynamic Capacity Event Log Interrupt Settings only allows
> > for MSI/MSI-X, not FW interrupt (EFN VDM) like the other logs.
> > 
> > 
> > Jonathan
> > 
> >   
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/13] scripts/qmp_helper: add missing CXL UEFI GUID
@ 2026-01-22 10:52         ` Jonathan Cameron via qemu development
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-01-22 10:52 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michael S Tsirkin, Shiju Jose, qemu-devel, Igor Mammedov,
	Cleber Rosa, John Snow, linux-cxl

On Wed, 21 Jan 2026 16:45:58 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> On Wed, Jan 21, 2026 at 12:26:04PM +0000, Jonathan Cameron wrote:
> > On Wed, 21 Jan 2026 12:25:10 +0100
> > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> >   
> > > The UEFI 2.11 - N.2.14. CXL Component Events Section states that
> > > XL events are described at CXL specification 3.2:  
> > CXL
> >   
> > >         8.2.10.2.1 Event Records
> > > 
> > > Add the GUIDs defined here to fuzzy logic error injection code.  
> > 
> > +CC linux-cxl as more folk there who will be familiar with this
> > stuff.
> > 
> > Some of these won't be seen on a host. The same event
> > infrastructure is used for reporting on out of band interfaces
> > and some in band ones, but not ones that will turn up on the
> > mailboxes that firmware will be using to get info.  
> 
> Good to know, but UEFI 2.11 still mentions all of them as
> possible GUIDs:
> 
>     https://uefi.org/specs/UEFI/2.11/Apx_N_Common_Platform_Error_Record.html#cxl-component-events-section
> 
> So, the UEFI 2.11 doesn't explicitly state they won't de delivered
> to OSPM. Quite contrary, they're listed as valid values for CPER,
> even if, in practice, they won't.
> 
> This is just a small set of variables, that won't bring any major
> impact on the code. So, I prefer to keep them in sync with the spec.
> If they end removing the unused ones, we can update it in the future.
> 
> If you want, I can add a note at the next version with your
> comments about them.
> 
A note works for me.

As I understand it CPER records are getting adopted in other specs
so it may make sense to document them, even if they  aren't a possibility
via ACPI.

However I'm not seeing them in the spec link you point at.  
All I'm seeing is a cross reference the CXL spec Events Record Format.

> >   
> > > 
> > > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > > ---
> > >  scripts/qmp_helper.py | 25 +++++++++++++++++++++++++
> > >  1 file changed, 25 insertions(+)
> > > 
> > > diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
> > > index 249a8c7187d1..7e786c4adfd9 100755
> > > --- a/scripts/qmp_helper.py
> > > +++ b/scripts/qmp_helper.py
> > > @@ -711,3 +711,28 @@ class cper_guid:
> > >      CPER_CXL_PROT_ERR = guid(0x80B9EFB4, 0x52B5, 0x4DE3,
> > >                               [0xA7, 0x77, 0x68, 0x78,
> > >                                0x4B, 0x77, 0x10, 0x48])
> > > +
> > > +    CPER_CXL_EVT_GEN_MEDIA = guid(0xFBCD0A77, 0xC260, 0x417F,
> > > +                                  [0x85, 0xA9, 0x08, 0x8B,
> > > +                                   0x16, 0x21, 0xEB, 0xA6])
> > > +    CPER_CXL_EVT_DRAM = guid(0x601DCBB3, 0x9C06, 0x4EAB,
> > > +                             [0xB8, 0xAF, 0x4E, 0x9B,
> > > +                              0xFB, 0x5C, 0x96, 0x24])
> > > +    CPER_CXL_EVT_MEM_MODULE = guid(0xFE927475, 0xDD59, 0x4339,
> > > +                                   [0xA5, 0x86, 0x79, 0xBA,
> > > +                                    0xB1, 0x13, 0xBC, 0x74])
> > > +    CPER_CXL_EVT_MEM_SPARING = guid(0xE71F3A40, 0x2D29, 0x4092,
> > > +                                    [0x8A, 0x39, 0x4D, 0x1C,
> > > +                                     0x96, 0x6C, 0x7C, 0x65])  
> > 
> > The above are all fine I think.
> > 
> > From here on I think they will never come via a CPER record.
> >   
> > > +    CPER_CXL_EVT_PHY_SW = guid(0x77CF9271, 0x9C02, 0x470B,
> > > +                               [0x9F, 0xE4, 0xBC, 0x7B,
> > > +                                0x75, 0xF2, 0xDA, 0x97])  
> > 
> > This is only going to surface over either out of band or switch CCI
> > I'd be very surprised to see a firmware anywhere near these.
> > More specifically they are only defined in the Fabric management
> > section of the spec, which strongly hints we'd not expect host firmware
> > to know anything about them. 
> > The events reported may well span bits of the topology currently
> > assigned to different hosts.
> >   
> > > +    CPER_CXL_EVT_VIRT_SW = guid(0x40D26425, 0x3396, 0x4C4D,
> > > +                                [0xA5, 0xDA, 0x3D, 0x47,
> > > +                                  0x2A, 0x63, 0xAF, 0x25])  
> > 
> > Also a fabric management event.
> >   
> > > +    CPER_CXL_EVT_MLD_PORT = guid(0x8DC44363, 0x0C96, 0x4710,
> > > +                                 [0xB7, 0xBF, 0x04, 0xBB,
> > > +                                  0x99, 0x53, 0x4C, 0x3F])  
> > 
> > Also a fabric management event.
> >   
> > > +    CPER_CXL_EVT_DYNA_CAP = guid(0xCA95AFA7, 0xF183, 0x4018,
> > > +                                 [0x8C, 0x2F, 0x95, 0x26,
> > > +                                  0x8E, 0x10, 0x1A, 0x2A])  
> > These are never routed to firmware. They are part of the OS only
> > managed flows for dynamic capacity.
> > They have their own event log on the hardware and for this particular
> > set most relevant thing is in
> > CXL 4.0 Table 8-235 Set Event Interrupt Policy Input Payload
> > which controls whether a firmware interrupt or MSIX is used signal
> > the Dynamic Capacity Event Log Interrupt Settings only allows
> > for MSI/MSI-X, not FW interrupt (EFN VDM) like the other logs.
> > 
> > 
> > Jonathan
> > 
> >   
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/13] scripts/qmp_helper: add missing CXL UEFI GUID
  2026-01-22 10:52         ` Jonathan Cameron via qemu development
  (?)
@ 2026-01-22 15:08         ` Mauro Carvalho Chehab
  2026-01-22 17:13             ` Jonathan Cameron via qemu development
  -1 siblings, 1 reply; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-22 15:08 UTC (permalink / raw)
  To: Jonathan Cameron via qemu development
  Cc: Jonathan Cameron, Michael S Tsirkin, Shiju Jose, Igor Mammedov,
	Cleber Rosa, John Snow, linux-cxl

On Thu, 22 Jan 2026 10:52:14 +0000
Jonathan Cameron via qemu development <qemu-devel@nongnu.org> wrote:

> On Wed, 21 Jan 2026 16:45:58 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> 
> > On Wed, Jan 21, 2026 at 12:26:04PM +0000, Jonathan Cameron wrote:  
> > > On Wed, 21 Jan 2026 12:25:10 +0100
> > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > >     
> > > > The UEFI 2.11 - N.2.14. CXL Component Events Section states that
> > > > XL events are described at CXL specification 3.2:    
> > > CXL
> > >     
> > > >         8.2.10.2.1 Event Records
> > > > 
> > > > Add the GUIDs defined here to fuzzy logic error injection code.    
> > > 
> > > +CC linux-cxl as more folk there who will be familiar with this
> > > stuff.
> > > 
> > > Some of these won't be seen on a host. The same event
> > > infrastructure is used for reporting on out of band interfaces
> > > and some in band ones, but not ones that will turn up on the
> > > mailboxes that firmware will be using to get info.    
> > 
> > Good to know, but UEFI 2.11 still mentions all of them as
> > possible GUIDs:
> > 
> >     https://uefi.org/specs/UEFI/2.11/Apx_N_Common_Platform_Error_Record.html#cxl-component-events-section
> > 
> > So, the UEFI 2.11 doesn't explicitly state they won't de delivered
> > to OSPM. Quite contrary, they're listed as valid values for CPER,
> > even if, in practice, they won't.
> > 
> > This is just a small set of variables, that won't bring any major
> > impact on the code. So, I prefer to keep them in sync with the spec.
> > If they end removing the unused ones, we can update it in the future.
> > 
> > If you want, I can add a note at the next version with your
> > comments about them.
> >   
> A note works for me.
> 
> As I understand it CPER records are getting adopted in other specs
> so it may make sense to document them, even if they  aren't a possibility
> via ACPI.
> 
> However I'm not seeing them in the spec link you point at.  
> All I'm seeing is a cross reference the CXL spec Events Record Format.

Heh, true, I got confused with another field. Yet, at CXL spec 3.2,
it is said that:


	8.2.10.2 Events
	===============

	This section defines the standard event record format that all CXL devices shall use
	when reporting events to the host.

	...

	Table 8-55. Common Event Record Format (Sheet 1 of 2)

	------	--------	----------------------------------------------------------------------------
	Byte	Length		Description
	Offset	in Bytes
	------	--------	----------------------------------------------------------------------------
	00h	10h		Event Record Identifier: UUID representing the specific Event Record format.
				The following UUIDs are defined in this spec:
				• fbcd0a77-c260-417f-85a9-088b1621eba6 – General Media Event Record
				  (see Table 8-57)
				• 601dcbb3-9c06-4eab-b8af-4e9bfb5c9624 – DRAM Event Record (see
				  Table 8-58)
				• fe927475-dd59-4339-a586-79bab113b774 – Memory Module Event Record
				  (see Table 8-59)
				• e71f3a40-2d29-4092-8a39-4d1c966c7c65 - Memory Sparing Event Record
				  (see Table 8-60)
				• 77cf9271-9c02-470b-9fe4-bc7b75f2da97 – Physical Switch Event Record
				  (see Table 7-77)
				• 40d26425-3396-4c4d-a5da-3d47263af425 – Virtual Switch Event Record
				  (see Table 7-78)
				• 8dc44363-0c96-4710-b7bf-04bb99534c3f – MLD Port Event Record (see
				  Table 7-79)
				• ca95afa7-f183-4018-8c2f-95268e101a2a - Dynamic Capacity Event Record
				  (see Table 8-62)
	------	--------	----------------------------------------------------------------------------

	...

So, CXL specs say they'll arrive at the host, and UEFI doesn't tell
they can't arrive the OSPM.

In any case, it is easier to just pick the entire set with 8 GUIDs
and keep the scripts in sync with the specs than filtering what 
shouldn't belong there and why, with the risk of eventually miss
something.

I'll append the diff below to the relevant patches on this patchset.

Regards,
Mauro

---

diff --git a/scripts/ghes_decode.py b/scripts/ghes_decode.py
index 6c7fdfe84e3a..7bac7dbd6b3a 100644
--- a/scripts/ghes_decode.py
+++ b/scripts/ghes_decode.py
@@ -935,6 +935,10 @@ class DecodeCXLCompEvent():
     """
 
     # GUIDs, as defined at CXL specification 3.2: 8.2.10.2.1 Event Records
+    # on Table 8-55. Common Event Record Format
+    #
+    # Please notice that, in practice, not all those events will be passed
+    # to OSPM. Some may be handled internally.
     guids = [
         ("General Media",              "fbcd0a77-c260-417f-85a9-088b1621eba6"),
         ("DRAM",                       "601dcbb3-9c06-4eab-b8af-4e9bfb5c9624"),
diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
index 32baca17ce10..583f898f04ef 100755
--- a/scripts/qmp_helper.py
+++ b/scripts/qmp_helper.py
@@ -778,10 +778,14 @@ class cper_guid:
                          [0xA6, 0xA6, 0x88, 0xB7,
                           0x28, 0xCF, 0x75, 0xD7])
 
+    # CXL GUIDs, as defined at CXL specification 3.2: 8.2.10.2.1 Event Records
+    # on Table 8-55. Common Event Record Format
+    #
+    # Please notice that, in practice, not all those events will be passed
+    # to OSPM. Some may be consumed internally
     CPER_CXL_PROT_ERR = guid(0x80B9EFB4, 0x52B5, 0x4DE3,
                              [0xA7, 0x77, 0x68, 0x78,
                               0x4B, 0x77, 0x10, 0x48])
-
     CPER_CXL_EVT_GEN_MEDIA = guid(0xFBCD0A77, 0xC260, 0x417F,
                                   [0x85, 0xA9, 0x08, 0x8B,
                                    0x16, 0x21, 0xEB, 0xA6])



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/13] scripts/ghes_inject: add a logic to decode CPER
  2026-01-21 13:27   ` Jonathan Cameron via qemu development
  2026-01-21 16:24     ` Mauro Carvalho Chehab
@ 2026-01-22 16:23     ` Mauro Carvalho Chehab
  1 sibling, 0 replies; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-22 16:23 UTC (permalink / raw)
  To: Jonathan Cameron via qemu development
  Cc: Jonathan Cameron, Michael S Tsirkin, Shiju Jose, Igor Mammedov,
	Cleber Rosa, John Snow, Mauro Carvalho Chehab

On Wed, 21 Jan 2026 13:27:38 +0000
Jonathan Cameron via qemu development <qemu-devel@nongnu.org> wrote:

> On Wed, 21 Jan 2026 12:25:15 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> 

...

> > +class DecodeDMAVT():
> > +    """
> > +    Class to decode a DMA Virtualization Technology Error as defined at
> > +    UEFI 2.2 - N.2.2 Section Descriptor  
> As below. Feels like it should be 
> 
> N2.11.2 Intel VT for Directed I/O specific DMAr Error Section Descriptor.
> Same for other places this comment exists.

when reviewing the ghes.c driver, Igor requested to pick the oldest 
possible spec when filling comments inside GHES. In this specific case,
the oldest one is UEFI 2.2:
	https://uefi.org/sites/default/files/resources/UEFI_Spec_2_2_D.pdf

> > +    """
> > +
> > +    # GUID for DMA VT Error
> > +    guid = "71761d37-32b2-45cd-a7d0-b0fedd93e8cf"  

... there, at:
	N.2.2 Section Descriptor
	Table 227. Section Descriptor

Section Type shows several different GUIDs, including this one,
listed there as:
	
	Intel® VT for Directed I/O specific DMAr section
	•{0x71761D37, 0x32B2, 0x45cd, {0xA7, 0xD0, 0xB0,
	•0xFE 0xDD, 0x93, 0xE8, 0xCF}}

That's basically why it is listed here for all the 9 GUIDs
defined at UEFI 2.2.

> 
> > +class DecodeDMAIOMMU():
> > +    """
> > +    Class to decode an IOMMU DMA Error as defined at
> > +    UEFI 2.2 - N.2.2 Section Descriptor  
> 
> Odd reference choice.
> This stuff is in N2.11.3 IOMMU Specific DMAr Error Section
> in 2.11. I'm too lazy to find it in the older spec.

Heh, most of the time I spent writing this patch were to seek
what spec version introduced each GUID ;-D

> 
> > +    """
> > +
> > +    # GUID for IOMMU DMA Error  
> 
> Maybe call it the IOMMU Specific DMAr Error
> 
> > +    guid = "036f84e1-7f37-428c-a79e-575fdfaa84ec"
> > +
> > +    fields = [  
> 
> > +class DecodeCXLCompEvent():
> > +    """
> > +    Class to decode a CXL Component Error as defined at
> > +    UEFI 2.9 - N.2.14. CXL Component Events Section
> > +
> > +    Currently, the decoder handles only the common fields, displaying
> > +    the CXL Component Event Log field in bytes.
> > +    """
> > +
> > +    # GUIDs, as defined at CXL specification 3.2: 8.2.10.2.1 Event Records
> > +    guids = [
> > +        ("General Media",              "fbcd0a77-c260-417f-85a9-088b1621eba6"),
> > +        ("DRAM",                       "601dcbb3-9c06-4eab-b8af-4e9bfb5c9624"),
> > +        ("Memory Module",              "fe927475-dd59-4339-a586-79bab113bc74"),
> > +        ("Memory Sparing",             "e71f3a40-2d29-4092-8a39-4d1c966c7c65"),
> > +        ("Physical Switch",            "77cf9271-9c02-470b-9fe4-bc7b75f2da97"),  
> 
> As per earlier patch review I'm not sure we care about this and the following.
> I don't think we'll ever see them in CPER records, unless going other something
> else that encapsulates that format.

I'm adding a notice before this table.

> 
> > +        ("Virtual Switch",             "40d26425-3396-4c4d-a5da-3d472a63af25"),
> > +        ("MDL Port",                   "8dc44363-0c96-4710-b7bf-04bb99534c3f"),  
> MLD -> Multi-Logical Device
> 
> > +        ("Dynamic Capabilities",       "ca95afa7-f183-4018-8c2f-95268e101a2a"),
> > +    ]
> > +
> > +    fields = [
> > +        ("Validation Bits", 8, "int"),
> > +        ("Device ID", 12, "int"),
> > +        ("Device Serial Number", 8, "int")
> > +    ]
> > +
> > +    def __init__(self, cper: DecodeField):
> > +        self.cper = cper
> > +
> > +    def decode(self, guid):
> > +        """Decode CXL Protocol Error"""
> > +        for name, guid_event in DecodeCXLCompEvent.guids:
> > +            if guid == guid_event:
> > +                print(f"CXL {name} Event Record")
> > +                break
> > +
> > +        val = self.cper.decode("Length", 4, "int")
> > +        try:
> > +            length = int.from_bytes(val, byteorder='little')
> > +        except ValueError, TypeError:
> > +            length = 0
> > +
> > +        for name, size, ftype in self.fields:
> > +            self.cper.decode(name, size, ftype)
> > +
> > +        length = max(0, length - self.cper.pos)
> > +
> > +        self.cper.decode("CXL Component Event Log", length, "int",
> > +                         show_incomplete=True)
> > +
> > +    @staticmethod
> > +    def decode_list():
> > +        """
> > +        Returns a tuple with the GUID and class
> > +        """
> > +
> > +        guid_list = []
> > +
> > +        for _, guid in DecodeCXLCompEvent.guids:
> > +            guid_list.append((guid, DecodeCXLCompEvent))
> > +
> > +        return guid_list  
> 
> 
> ...
> 
> > +class DecodeGhesEntry():
> > +    """
> > +    Class to decode a GHESv2 element, as defined at:
> > +    ACPI 6.1: 18.3.2.8 Generic Hardware Error Source version 2
> > +    """
> > +
> > +    # Fields present on all CPER records
> > +    common_fields = [
> > +        # Generic Error Status Block fields
> > +        ("Block Status",           4, "int", None),
> > +        ("Raw Data Offset",        4, "int", "raw_data_offset"),
> > +        ("Raw Data Length",        4, "int", "raw_data_len"),
> > +        ("Data Length",            4, "int", None),
> > +        ("Error Severity",         4, "int", None),
> > +
> > +        # Generic Error Data Entry
> > +        ("Section Type",          16, "guid", "session_type"),  
> 
> Why session_type? Is idea it's the type of decode session we are
> doing? Feels a bit too much like a typo from section_type.

it is meant to be a dict key, but I'll just drop it, taking a different
approach.

> > +        ("Error Severity",         4, "int", None),
> > +        ("Revision",               2, "int", None),
> > +        ("Validation Bits",        1, "int", None),
> > +        ("Flags",                  1, "int", None),
> > +        ("Error Data Length",      4, "int", None),
> > +        ("FRU Id",                16, "guid", None),
> > +        ("FRU Text",              20, "str", None),
> > +        ("Timestamp",              8, "bcd", None),
> > +    ]
> > +
> > +    def __init__(self, cper_data: bytearray):
> > +        """
> > +        Initializes a byte array, decoding it, printing results at the
> > +        screen.
> > +        """  
> 
> ...
> 
> > +
> > +        # Handle common types
> > +        cper = DecodeField(cper_data)
> > +
> > +        fields = {}
> > +        for name, size, ftype, var in self.common_fields:
> > +            val = cper.decode(name, size, ftype)
> > +
> > +            if ftype == "int":
> > +                try:
> > +                    val = int.from_bytes(val, byteorder='little')
> > +                except ValueError, TypeError:
> > +                    val = 0
> > +
> > +            if var is not None:
> > +                fields[var] = val
> > +
> > +        if fields["raw_data_len"]:
> > +            cper.decode("Raw Data", fields["raw_data_len"],
> > +                        "int", pos=fields["raw_data_offset"])
> > +
> > +        if not fields["session_type"]:
> > +            return  
> 
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 09/13] scripts/ghes_inject: add a handler for PCIe bus error
  2026-01-21 13:32   ` Jonathan Cameron via qemu development
  2026-01-21 13:33     ` Jonathan Cameron via qemu development
  2026-01-21 16:26     ` Mauro Carvalho Chehab
@ 2026-01-22 16:42     ` Mauro Carvalho Chehab
  2 siblings, 0 replies; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-22 16:42 UTC (permalink / raw)
  To: Jonathan Cameron via qemu development
  Cc: Jonathan Cameron, Michael S Tsirkin, Shiju Jose, Igor Mammedov,
	Cleber Rosa, John Snow

On Wed, 21 Jan 2026 13:32:55 +0000
Jonathan Cameron via qemu development <qemu-devel@nongnu.org> wrote:

> On Wed, 21 Jan 2026 12:25:17 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> 
> > Add a logic to do PCIe BUS error injection.
> > 
> > On Linux Kernel, despite CPER_SEC_PCI_X_BUS macro is defined for such
> > event, ghes.c doesn't implement support for it yet:
> > 
> > [16950.077494] {26}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> > [16950.077866] {26}[Hardware Error]: event severity: recoverable
> > [16950.078118] {26}[Hardware Error]:  Error 0, type: recoverable
> > [16950.078444] {26}[Hardware Error]:   section type: unknown, c5753963-3b84-4095-bf78-eddad3f9c9dd
> > [16950.078800] {26}[Hardware Error]:   section length: 0x48
> > [16950.079069] {26}[Hardware Error]:   00000000: 00000000 00000000 00000000 00000000  ................
> > [16950.079442] {26}[Hardware Error]:   00000010: 00000001 00000000 00000000 00000000  ................
> > [16950.079811] {26}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
> > [16950.080181] {26}[Hardware Error]:   00000030: 00000000 00000000 00000000 00000000  ................
> > [16950.080538] {26}[Hardware Error]:   00000040: 00000000 00000000                    ........
> > 
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>  
> 
> LGTM. Bit surprised Linux doesn't decode it but fair enough.
> Seems a bit unlikely it ever will given this seems not to cover PCIe
> which has it's own records.

Yeah, I misread the spec when I wrote: this one is specific for
PCI/PCI-X, and not PCIe. That probably explain why this was not
implemented in practice yet.

I'll rename it.

Still, it is good to test it, even not being implemented, as it
helps to check how Linux reacts to a GUID it doesn't know about
it.

> 
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/13] scripts/qmp_helper: add missing CXL UEFI GUID
  2026-01-22 15:08         ` Mauro Carvalho Chehab
@ 2026-01-22 17:13             ` Jonathan Cameron via qemu development
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron @ 2026-01-22 17:13 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Jonathan Cameron via qemu development, Michael S Tsirkin,
	Shiju Jose, Igor Mammedov, Cleber Rosa, John Snow, linux-cxl,
	John Groves

On Thu, 22 Jan 2026 16:08:09 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> On Thu, 22 Jan 2026 10:52:14 +0000
> Jonathan Cameron via qemu development <qemu-devel@nongnu.org> wrote:
> 
> > On Wed, 21 Jan 2026 16:45:58 +0100
> > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> >   
> > > On Wed, Jan 21, 2026 at 12:26:04PM +0000, Jonathan Cameron wrote:    
> > > > On Wed, 21 Jan 2026 12:25:10 +0100
> > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > >       
> > > > > The UEFI 2.11 - N.2.14. CXL Component Events Section states that
> > > > > XL events are described at CXL specification 3.2:      
> > > > CXL
> > > >       
> > > > >         8.2.10.2.1 Event Records
> > > > > 
> > > > > Add the GUIDs defined here to fuzzy logic error injection code.      
> > > > 
> > > > +CC linux-cxl as more folk there who will be familiar with this
> > > > stuff.
> > > > 
> > > > Some of these won't be seen on a host. The same event
> > > > infrastructure is used for reporting on out of band interfaces
> > > > and some in band ones, but not ones that will turn up on the
> > > > mailboxes that firmware will be using to get info.      
> > > 
> > > Good to know, but UEFI 2.11 still mentions all of them as
> > > possible GUIDs:
> > > 
> > >     https://uefi.org/specs/UEFI/2.11/Apx_N_Common_Platform_Error_Record.html#cxl-component-events-section
> > > 
> > > So, the UEFI 2.11 doesn't explicitly state they won't de delivered
> > > to OSPM. Quite contrary, they're listed as valid values for CPER,
> > > even if, in practice, they won't.
> > > 
> > > This is just a small set of variables, that won't bring any major
> > > impact on the code. So, I prefer to keep them in sync with the spec.
> > > If they end removing the unused ones, we can update it in the future.
> > > 
> > > If you want, I can add a note at the next version with your
> > > comments about them.
> > >     
> > A note works for me.
> > 
> > As I understand it CPER records are getting adopted in other specs
> > so it may make sense to document them, even if they  aren't a possibility
> > via ACPI.
> > 
> > However I'm not seeing them in the spec link you point at.  
> > All I'm seeing is a cross reference the CXL spec Events Record Format.  
> 
> Heh, true, I got confused with another field. Yet, at CXL spec 3.2,
> it is said that:
> 
> 
> 	8.2.10.2 Events
> 	===============
> 
> 	This section defines the standard event record format that all CXL devices shall use
> 	when reporting events to the host.
> 
> 	...
> 
> 	Table 8-55. Common Event Record Format (Sheet 1 of 2)
> 
> 	------	--------	----------------------------------------------------------------------------
> 	Byte	Length		Description
> 	Offset	in Bytes
> 	------	--------	----------------------------------------------------------------------------
> 	00h	10h		Event Record Identifier: UUID representing the specific Event Record format.
> 				The following UUIDs are defined in this spec:
> 				• fbcd0a77-c260-417f-85a9-088b1621eba6 – General Media Event Record
> 				  (see Table 8-57)
> 				• 601dcbb3-9c06-4eab-b8af-4e9bfb5c9624 – DRAM Event Record (see
> 				  Table 8-58)
> 				• fe927475-dd59-4339-a586-79bab113b774 – Memory Module Event Record
> 				  (see Table 8-59)
> 				• e71f3a40-2d29-4092-8a39-4d1c966c7c65 - Memory Sparing Event Record
> 				  (see Table 8-60)
> 				• 77cf9271-9c02-470b-9fe4-bc7b75f2da97 – Physical Switch Event Record
> 				  (see Table 7-77)
> 				• 40d26425-3396-4c4d-a5da-3d47263af425 – Virtual Switch Event Record
> 				  (see Table 7-78)
> 				• 8dc44363-0c96-4710-b7bf-04bb99534c3f – MLD Port Event Record (see
> 				  Table 7-79)
> 				• ca95afa7-f183-4018-8c2f-95268e101a2a - Dynamic Capacity Event Record
> 				  (see Table 8-62)
> 	------	--------	----------------------------------------------------------------------------
> 
> 	...
> 
> So, CXL specs say they'll arrive at the host, and UEFI doesn't tell
> they can't arrive the OSPM.
> 

Fair point. Some of those we shouldn't see at a host. Something to tidy up
in the spec.

+CC John Groves to make it his problem ;)

Dynamic Capacity events go to the host (well some of the them anyway),
but never to firmware, so there is  blurry boundary.
Some values of the Event type for those state they are not sent to the host.
"This event shall only be reported to the FM".

J
> In any case, it is easier to just pick the entire set with 8 GUIDs
> and keep the scripts in sync with the specs than filtering what 
> shouldn't belong there and why, with the risk of eventually miss
> something.
> 
> I'll append the diff below to the relevant patches on this patchset.
> 
> Regards,
> Mauro
> 
> ---
> 
> diff --git a/scripts/ghes_decode.py b/scripts/ghes_decode.py
> index 6c7fdfe84e3a..7bac7dbd6b3a 100644
> --- a/scripts/ghes_decode.py
> +++ b/scripts/ghes_decode.py
> @@ -935,6 +935,10 @@ class DecodeCXLCompEvent():
>      """
>  
>      # GUIDs, as defined at CXL specification 3.2: 8.2.10.2.1 Event Records
> +    # on Table 8-55. Common Event Record Format
> +    #
> +    # Please notice that, in practice, not all those events will be passed
> +    # to OSPM. Some may be handled internally.
>      guids = [
>          ("General Media",              "fbcd0a77-c260-417f-85a9-088b1621eba6"),
>          ("DRAM",                       "601dcbb3-9c06-4eab-b8af-4e9bfb5c9624"),
> diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
> index 32baca17ce10..583f898f04ef 100755
> --- a/scripts/qmp_helper.py
> +++ b/scripts/qmp_helper.py
> @@ -778,10 +778,14 @@ class cper_guid:
>                           [0xA6, 0xA6, 0x88, 0xB7,
>                            0x28, 0xCF, 0x75, 0xD7])
>  
> +    # CXL GUIDs, as defined at CXL specification 3.2: 8.2.10.2.1 Event Records
> +    # on Table 8-55. Common Event Record Format
> +    #
> +    # Please notice that, in practice, not all those events will be passed
> +    # to OSPM. Some may be consumed internally
>      CPER_CXL_PROT_ERR = guid(0x80B9EFB4, 0x52B5, 0x4DE3,
>                               [0xA7, 0x77, 0x68, 0x78,
>                                0x4B, 0x77, 0x10, 0x48])
> -
>      CPER_CXL_EVT_GEN_MEDIA = guid(0xFBCD0A77, 0xC260, 0x417F,
>                                    [0x85, 0xA9, 0x08, 0x8B,
>                                     0x16, 0x21, 0xEB, 0xA6])
> 
> 
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/13] scripts/qmp_helper: add missing CXL UEFI GUID
@ 2026-01-22 17:13             ` Jonathan Cameron via qemu development
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-01-22 17:13 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Jonathan Cameron via qemu development, Michael S Tsirkin,
	Shiju Jose, Igor Mammedov, Cleber Rosa, John Snow, linux-cxl,
	John Groves

On Thu, 22 Jan 2026 16:08:09 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> On Thu, 22 Jan 2026 10:52:14 +0000
> Jonathan Cameron via qemu development <qemu-devel@nongnu.org> wrote:
> 
> > On Wed, 21 Jan 2026 16:45:58 +0100
> > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> >   
> > > On Wed, Jan 21, 2026 at 12:26:04PM +0000, Jonathan Cameron wrote:    
> > > > On Wed, 21 Jan 2026 12:25:10 +0100
> > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > > >       
> > > > > The UEFI 2.11 - N.2.14. CXL Component Events Section states that
> > > > > XL events are described at CXL specification 3.2:      
> > > > CXL
> > > >       
> > > > >         8.2.10.2.1 Event Records
> > > > > 
> > > > > Add the GUIDs defined here to fuzzy logic error injection code.      
> > > > 
> > > > +CC linux-cxl as more folk there who will be familiar with this
> > > > stuff.
> > > > 
> > > > Some of these won't be seen on a host. The same event
> > > > infrastructure is used for reporting on out of band interfaces
> > > > and some in band ones, but not ones that will turn up on the
> > > > mailboxes that firmware will be using to get info.      
> > > 
> > > Good to know, but UEFI 2.11 still mentions all of them as
> > > possible GUIDs:
> > > 
> > >     https://uefi.org/specs/UEFI/2.11/Apx_N_Common_Platform_Error_Record.html#cxl-component-events-section
> > > 
> > > So, the UEFI 2.11 doesn't explicitly state they won't de delivered
> > > to OSPM. Quite contrary, they're listed as valid values for CPER,
> > > even if, in practice, they won't.
> > > 
> > > This is just a small set of variables, that won't bring any major
> > > impact on the code. So, I prefer to keep them in sync with the spec.
> > > If they end removing the unused ones, we can update it in the future.
> > > 
> > > If you want, I can add a note at the next version with your
> > > comments about them.
> > >     
> > A note works for me.
> > 
> > As I understand it CPER records are getting adopted in other specs
> > so it may make sense to document them, even if they  aren't a possibility
> > via ACPI.
> > 
> > However I'm not seeing them in the spec link you point at.  
> > All I'm seeing is a cross reference the CXL spec Events Record Format.  
> 
> Heh, true, I got confused with another field. Yet, at CXL spec 3.2,
> it is said that:
> 
> 
> 	8.2.10.2 Events
> 	===============
> 
> 	This section defines the standard event record format that all CXL devices shall use
> 	when reporting events to the host.
> 
> 	...
> 
> 	Table 8-55. Common Event Record Format (Sheet 1 of 2)
> 
> 	------	--------	----------------------------------------------------------------------------
> 	Byte	Length		Description
> 	Offset	in Bytes
> 	------	--------	----------------------------------------------------------------------------
> 	00h	10h		Event Record Identifier: UUID representing the specific Event Record format.
> 				The following UUIDs are defined in this spec:
> 				• fbcd0a77-c260-417f-85a9-088b1621eba6 – General Media Event Record
> 				  (see Table 8-57)
> 				• 601dcbb3-9c06-4eab-b8af-4e9bfb5c9624 – DRAM Event Record (see
> 				  Table 8-58)
> 				• fe927475-dd59-4339-a586-79bab113b774 – Memory Module Event Record
> 				  (see Table 8-59)
> 				• e71f3a40-2d29-4092-8a39-4d1c966c7c65 - Memory Sparing Event Record
> 				  (see Table 8-60)
> 				• 77cf9271-9c02-470b-9fe4-bc7b75f2da97 – Physical Switch Event Record
> 				  (see Table 7-77)
> 				• 40d26425-3396-4c4d-a5da-3d47263af425 – Virtual Switch Event Record
> 				  (see Table 7-78)
> 				• 8dc44363-0c96-4710-b7bf-04bb99534c3f – MLD Port Event Record (see
> 				  Table 7-79)
> 				• ca95afa7-f183-4018-8c2f-95268e101a2a - Dynamic Capacity Event Record
> 				  (see Table 8-62)
> 	------	--------	----------------------------------------------------------------------------
> 
> 	...
> 
> So, CXL specs say they'll arrive at the host, and UEFI doesn't tell
> they can't arrive the OSPM.
> 

Fair point. Some of those we shouldn't see at a host. Something to tidy up
in the spec.

+CC John Groves to make it his problem ;)

Dynamic Capacity events go to the host (well some of the them anyway),
but never to firmware, so there is  blurry boundary.
Some values of the Event type for those state they are not sent to the host.
"This event shall only be reported to the FM".

J
> In any case, it is easier to just pick the entire set with 8 GUIDs
> and keep the scripts in sync with the specs than filtering what 
> shouldn't belong there and why, with the risk of eventually miss
> something.
> 
> I'll append the diff below to the relevant patches on this patchset.
> 
> Regards,
> Mauro
> 
> ---
> 
> diff --git a/scripts/ghes_decode.py b/scripts/ghes_decode.py
> index 6c7fdfe84e3a..7bac7dbd6b3a 100644
> --- a/scripts/ghes_decode.py
> +++ b/scripts/ghes_decode.py
> @@ -935,6 +935,10 @@ class DecodeCXLCompEvent():
>      """
>  
>      # GUIDs, as defined at CXL specification 3.2: 8.2.10.2.1 Event Records
> +    # on Table 8-55. Common Event Record Format
> +    #
> +    # Please notice that, in practice, not all those events will be passed
> +    # to OSPM. Some may be handled internally.
>      guids = [
>          ("General Media",              "fbcd0a77-c260-417f-85a9-088b1621eba6"),
>          ("DRAM",                       "601dcbb3-9c06-4eab-b8af-4e9bfb5c9624"),
> diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
> index 32baca17ce10..583f898f04ef 100755
> --- a/scripts/qmp_helper.py
> +++ b/scripts/qmp_helper.py
> @@ -778,10 +778,14 @@ class cper_guid:
>                           [0xA6, 0xA6, 0x88, 0xB7,
>                            0x28, 0xCF, 0x75, 0xD7])
>  
> +    # CXL GUIDs, as defined at CXL specification 3.2: 8.2.10.2.1 Event Records
> +    # on Table 8-55. Common Event Record Format
> +    #
> +    # Please notice that, in practice, not all those events will be passed
> +    # to OSPM. Some may be consumed internally
>      CPER_CXL_PROT_ERR = guid(0x80B9EFB4, 0x52B5, 0x4DE3,
>                               [0xA7, 0x77, 0x68, 0x78,
>                                0x4B, 0x77, 0x10, 0x48])
> -
>      CPER_CXL_EVT_GEN_MEDIA = guid(0xFBCD0A77, 0xC260, 0x417F,
>                                    [0x85, 0xA9, 0x08, 0x8B,
>                                     0x16, 0x21, 0xEB, 0xA6])
> 
> 
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/13] scripts/qmp_helper: add support for a timeout logic
  2026-01-21 15:56     ` Mauro Carvalho Chehab
@ 2026-01-23 16:16       ` Jonathan Cameron via qemu development
  2026-01-26 11:23         ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-01-23 16:16 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Michael S Tsirkin, Shiju Jose, qemu-devel, Igor Mammedov,
	Cleber Rosa, John Snow


> >   
> > > +        for i in range(0, attempts):
> > > +            try:
> > > +                obj = self.qmp_monitor.cmd_obj(msg)
> > > +
> > > +                if obj and "return" in obj and not obj["return"]:
> > > +                    break
> > > +
> > > +            except Exception as e:                     # pylint: disable=W0718
> > > +                print(f"Command: {command}")
> > > +                print(f"Failed to inject error: {e}.")
> > > +                obj = None
> > > +
> > > +            if attempts > 1:
> > > +                print(f"Error inject attempt {i + 1}/{attempts} failed.")
> > > +
> > > +            if i + 1 < attempts:
> > > +                sleep(0.1)  
> 
> ... and here, we sleep for 0.1 seconds.
> 
> > 
> > Do we care about a sleep at the end?  Feels like a micro optimization that
> > isn't needed.  
> 
> This is not a micro-optimization. It is more to ensure that we won't
> respin it too fast.
> 
> What happens is that QMP interface asks the BIOS to send an async
> message to OSPM, cleaning an ack register. When the OSPM reads the
> error, it writes 1 to the ack register.
> 
> If we send messages too fast, the logic at ghes.c will detect that
> the ack didn't happen, imediately returning an errocr code.
> 
> On such case, we sleep for 100ms before trying again.
I was suggesting the opposite.  Just sleep one more time at the end
before timing out.
So instead of
	if i + 1 < attempts
		sleep(0.1)

simply
	sleep(0.1)



> 
> In practice, on my Ryzen 9 machines with QEMU emulating ARM,
> even under massive error injection, 99% of the time no retries
> happen. The worse case scenario I got here is that sometimes
> Kernel got stuck and took between 5s to 10s to accept the error
> submission.
> 
> >   
> > > +
> > > +        if not obj:
> > >              return None  
> > 
> >   
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/13] scripts/qmp_helper: add support for a timeout logic
  2026-01-23 16:16       ` Jonathan Cameron via qemu development
@ 2026-01-26 11:23         ` Mauro Carvalho Chehab
  2026-01-26 11:29           ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-26 11:23 UTC (permalink / raw)
  To: Jonathan Cameron via qemu development
  Cc: Jonathan Cameron, Michael S Tsirkin, Shiju Jose, Igor Mammedov,
	Cleber Rosa, John Snow

On Fri, 23 Jan 2026 16:16:03 +0000
Jonathan Cameron via qemu development <qemu-devel@nongnu.org> wrote:

> > >     
> > > > +        for i in range(0, attempts):
> > > > +            try:
> > > > +                obj = self.qmp_monitor.cmd_obj(msg)
> > > > +
> > > > +                if obj and "return" in obj and not obj["return"]:
> > > > +                    break
> > > > +
> > > > +            except Exception as e:                     # pylint: disable=W0718
> > > > +                print(f"Command: {command}")
> > > > +                print(f"Failed to inject error: {e}.")
> > > > +                obj = None
> > > > +
> > > > +            if attempts > 1:
> > > > +                print(f"Error inject attempt {i + 1}/{attempts} failed.")
> > > > +
> > > > +            if i + 1 < attempts:
> > > > +                sleep(0.1)    
> > 
> > ... and here, we sleep for 0.1 seconds.
> >   
> > > 
> > > Do we care about a sleep at the end?  Feels like a micro optimization that
> > > isn't needed.    
> > 
> > This is not a micro-optimization. It is more to ensure that we won't
> > respin it too fast.
> > 
> > What happens is that QMP interface asks the BIOS to send an async
> > message to OSPM, cleaning an ack register. When the OSPM reads the
> > error, it writes 1 to the ack register.
> > 
> > If we send messages too fast, the logic at ghes.c will detect that
> > the ack didn't happen, imediately returning an errocr code.
> > 
> > On such case, we sleep for 100ms before trying again.  
> I was suggesting the opposite.  Just sleep one more time at the end
> before timing out.
> So instead of
> 	if i + 1 < attempts
> 		sleep(0.1)
> 
> simply
> 	sleep(0.1)

If one writes an external loop calling fuzzy with different parameters,
like:

	for i in $(seq 1 360000); do
            scripts/ghes_inject.py fuzzy -T proc-arm;
            scripts/ghes_inject.py fuzzy -T firmware-error;
        done

The extra unneeded would sleep waste 10 hours doing nothing.

Regards,
Mauro


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/13] scripts/qmp_helper: add support for a timeout logic
  2026-01-26 11:23         ` Mauro Carvalho Chehab
@ 2026-01-26 11:29           ` Mauro Carvalho Chehab
  2026-01-26 12:27             ` Jonathan Cameron via qemu development
  0 siblings, 1 reply; 45+ messages in thread
From: Mauro Carvalho Chehab @ 2026-01-26 11:29 UTC (permalink / raw)
  To: Jonathan Cameron via qemu development
  Cc: Jonathan Cameron, Michael S Tsirkin, Shiju Jose, Igor Mammedov,
	Cleber Rosa, John Snow, Mauro Carvalho Chehab

On Mon, 26 Jan 2026 12:23:30 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> (by way of Mauro Carvalho Chehab <mchehab+huawei@kernel.org>) wrote:

> On Fri, 23 Jan 2026 16:16:03 +0000
> Jonathan Cameron via qemu development <qemu-devel@nongnu.org> wrote:
> 
> > > >     
> > > > > +        for i in range(0, attempts):
> > > > > +            try:
> > > > > +                obj = self.qmp_monitor.cmd_obj(msg)
> > > > > +
> > > > > +                if obj and "return" in obj and not obj["return"]:
> > > > > +                    break
> > > > > +
> > > > > +            except Exception as e:                     # pylint: disable=W0718
> > > > > +                print(f"Command: {command}")
> > > > > +                print(f"Failed to inject error: {e}.")
> > > > > +                obj = None
> > > > > +
> > > > > +            if attempts > 1:
> > > > > +                print(f"Error inject attempt {i + 1}/{attempts} failed.")
> > > > > +
> > > > > +            if i + 1 < attempts:
> > > > > +                sleep(0.1)    
> > > 
> > > ... and here, we sleep for 0.1 seconds.
> > >   
> > > > 
> > > > Do we care about a sleep at the end?  Feels like a micro optimization that
> > > > isn't needed.    
> > > 
> > > This is not a micro-optimization. It is more to ensure that we won't
> > > respin it too fast.
> > > 
> > > What happens is that QMP interface asks the BIOS to send an async
> > > message to OSPM, cleaning an ack register. When the OSPM reads the
> > > error, it writes 1 to the ack register.
> > > 
> > > If we send messages too fast, the logic at ghes.c will detect that
> > > the ack didn't happen, imediately returning an errocr code.
> > > 
> > > On such case, we sleep for 100ms before trying again.  
> > I was suggesting the opposite.  Just sleep one more time at the end
> > before timing out.
> > So instead of
> > 	if i + 1 < attempts
> > 		sleep(0.1)
> > 
> > simply
> > 	sleep(0.1)
> 
> If one writes an external loop calling fuzzy with different parameters,
> like:
> 
> 	for i in $(seq 1 360000); do
>             scripts/ghes_inject.py fuzzy -T proc-arm;
>             scripts/ghes_inject.py fuzzy -T firmware-error;
>         done
> 
> The extra unneeded would sleep waste 10 hours doing nothing.

Btw, the same applies when using the -c parameter:

             scripts/ghes_inject.py fuzzy -T proc-arm -c 360000

The goal here is to optimize in a way that we could one day have a
CI running lots of tests in a reasonable time to detect regressions
at QEMU + Linux Kernel + rasdaemon.

So, we don't want unneeded delays. We only need to sleep if a
retry attempt failed and it will be retrying again.

Regards,


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/13] scripts/qmp_helper: add support for a timeout logic
  2026-01-26 11:29           ` Mauro Carvalho Chehab
@ 2026-01-26 12:27             ` Jonathan Cameron via qemu development
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-01-26 12:27 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Jonathan Cameron via qemu development, Michael S Tsirkin,
	Shiju Jose, Igor Mammedov, Cleber Rosa, John Snow

On Mon, 26 Jan 2026 12:29:32 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> On Mon, 26 Jan 2026 12:23:30 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> (by way of Mauro Carvalho Chehab <mchehab+huawei@kernel.org>) wrote:
> 
> > On Fri, 23 Jan 2026 16:16:03 +0000
> > Jonathan Cameron via qemu development <qemu-devel@nongnu.org> wrote:
> >   
> > > > >       
> > > > > > +        for i in range(0, attempts):
> > > > > > +            try:
> > > > > > +                obj = self.qmp_monitor.cmd_obj(msg)
> > > > > > +
> > > > > > +                if obj and "return" in obj and not obj["return"]:
> > > > > > +                    break
> > > > > > +
> > > > > > +            except Exception as e:                     # pylint: disable=W0718
> > > > > > +                print(f"Command: {command}")
> > > > > > +                print(f"Failed to inject error: {e}.")
> > > > > > +                obj = None
> > > > > > +
> > > > > > +            if attempts > 1:
> > > > > > +                print(f"Error inject attempt {i + 1}/{attempts} failed.")
> > > > > > +
> > > > > > +            if i + 1 < attempts:
> > > > > > +                sleep(0.1)      
> > > > 
> > > > ... and here, we sleep for 0.1 seconds.
> > > >     
> > > > > 
> > > > > Do we care about a sleep at the end?  Feels like a micro optimization that
> > > > > isn't needed.      
> > > > 
> > > > This is not a micro-optimization. It is more to ensure that we won't
> > > > respin it too fast.
> > > > 
> > > > What happens is that QMP interface asks the BIOS to send an async
> > > > message to OSPM, cleaning an ack register. When the OSPM reads the
> > > > error, it writes 1 to the ack register.
> > > > 
> > > > If we send messages too fast, the logic at ghes.c will detect that
> > > > the ack didn't happen, imediately returning an errocr code.
> > > > 
> > > > On such case, we sleep for 100ms before trying again.    
> > > I was suggesting the opposite.  Just sleep one more time at the end
> > > before timing out.
> > > So instead of
> > > 	if i + 1 < attempts
> > > 		sleep(0.1)
> > > 
> > > simply
> > > 	sleep(0.1)  
> > 
> > If one writes an external loop calling fuzzy with different parameters,
> > like:
> > 
> > 	for i in $(seq 1 360000); do
> >             scripts/ghes_inject.py fuzzy -T proc-arm;
> >             scripts/ghes_inject.py fuzzy -T firmware-error;
> >         done
> > 
> > The extra unneeded would sleep waste 10 hours doing nothing.

True if it fails every time, which you were suggesting was very rare. 

Anyhow I really don't mind that much, just seemed like a tiny
bit over engineered for a rare case. 
 
> 
> Btw, the same applies when using the -c parameter:
> 
>              scripts/ghes_inject.py fuzzy -T proc-arm -c 360000
> 
> The goal here is to optimize in a way that we could one day have a
> CI running lots of tests in a reasonable time to detect regressions
> at QEMU + Linux Kernel + rasdaemon.
> 
> So, we don't want unneeded delays. We only need to sleep if a
> retry attempt failed and it will be retrying again.
> 
> Regards,
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 09/13] scripts/ghes_inject: add a handler for PCIe bus error
  2026-01-21 13:33     ` Jonathan Cameron via qemu development
@ 2026-02-06 12:52       ` Jonathan Cameron via qemu development
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-02-06 12:52 UTC (permalink / raw)
  To: Jonathan Cameron via qemu development
  Cc: Jonathan Cameron, Mauro Carvalho Chehab, Michael S Tsirkin,
	Shiju Jose, Igor Mammedov, Cleber Rosa, John Snow

On Wed, 21 Jan 2026 13:33:55 +0000
Jonathan Cameron via qemu development <qemu-devel@nongnu.org> wrote:

> On Wed, 21 Jan 2026 13:32:55 +0000
> Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> 
> > On Wed, 21 Jan 2026 12:25:17 +0100
> > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> >   
> > > Add a logic to do PCIe BUS error injection.
> > > 
> > > On Linux Kernel, despite CPER_SEC_PCI_X_BUS macro is defined for such
> > > event, ghes.c doesn't implement support for it yet:
> > > 
> > > [16950.077494] {26}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> > > [16950.077866] {26}[Hardware Error]: event severity: recoverable
> > > [16950.078118] {26}[Hardware Error]:  Error 0, type: recoverable
> > > [16950.078444] {26}[Hardware Error]:   section type: unknown, c5753963-3b84-4095-bf78-eddad3f9c9dd
> > > [16950.078800] {26}[Hardware Error]:   section length: 0x48
> > > [16950.079069] {26}[Hardware Error]:   00000000: 00000000 00000000 00000000 00000000  ................
> > > [16950.079442] {26}[Hardware Error]:   00000010: 00000001 00000000 00000000 00000000  ................
> > > [16950.079811] {26}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
> > > [16950.080181] {26}[Hardware Error]:   00000030: 00000000 00000000 00000000 00000000  ................
> > > [16950.080538] {26}[Hardware Error]:   00000040: 00000000 00000000                    ........
> > > 
> > > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>    
> > 
> > LGTM. Bit surprised Linux doesn't decode it but fair enough.
> > Seems a bit unlikely it ever will given this seems not to cover PCIe
> > which has it's own records.
> >   
> Just noticed your patch description. This is PCI/PCI-X errors, not PCIe.
Seem this was stuck in my outbox. Please ignore as you long fixed this
I think.

J
> 
> > Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>  
> 
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2026-02-06 12:53 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-21 11:25 [PATCH 00/13] Add more commands to scripts/ghes_inject.py Mauro Carvalho Chehab
2026-01-21 11:25 ` [PATCH 01/13] scripts/qmp_helper: add a return code to send_cper Mauro Carvalho Chehab
2026-01-21 12:08   ` Jonathan Cameron via qemu development
2026-01-21 11:25 ` [PATCH 02/13] scripts/qmp_helper: add missing CXL UEFI GUID Mauro Carvalho Chehab
2026-01-21 12:26   ` Jonathan Cameron
2026-01-21 12:26     ` Jonathan Cameron via qemu development
2026-01-21 15:45     ` Mauro Carvalho Chehab
2026-01-22 10:52       ` Jonathan Cameron
2026-01-22 10:52         ` Jonathan Cameron via qemu development
2026-01-22 15:08         ` Mauro Carvalho Chehab
2026-01-22 17:13           ` Jonathan Cameron
2026-01-22 17:13             ` Jonathan Cameron via qemu development
2026-01-21 11:25 ` [PATCH 03/13] scripts/qmp_helper: add support for FRU Memory Poison Mauro Carvalho Chehab
2026-01-21 12:27   ` Jonathan Cameron via qemu development
2026-01-21 11:25 ` [PATCH 04/13] scripts/qmp_helper: make send_cper() more generic Mauro Carvalho Chehab
2026-01-21 12:30   ` Jonathan Cameron via qemu development
2026-01-21 11:25 ` [PATCH 05/13] scripts/qmp_helper: fix raw_data logic Mauro Carvalho Chehab
2026-01-21 12:35   ` Jonathan Cameron via qemu development
2026-01-21 11:25 ` [PATCH 06/13] scripts/qmp_helper: add support for a timeout logic Mauro Carvalho Chehab
2026-01-21 12:39   ` Jonathan Cameron via qemu development
2026-01-21 15:56     ` Mauro Carvalho Chehab
2026-01-23 16:16       ` Jonathan Cameron via qemu development
2026-01-26 11:23         ` Mauro Carvalho Chehab
2026-01-26 11:29           ` Mauro Carvalho Chehab
2026-01-26 12:27             ` Jonathan Cameron via qemu development
2026-01-21 11:25 ` [PATCH 07/13] scripts/ghes_inject: add a logic to decode CPER Mauro Carvalho Chehab
2026-01-21 13:27   ` Jonathan Cameron via qemu development
2026-01-21 16:24     ` Mauro Carvalho Chehab
2026-01-22 16:23     ` Mauro Carvalho Chehab
2026-01-21 11:25 ` [PATCH 08/13] scripts/ghes_inject: exit 1 if command was not sent Mauro Carvalho Chehab
2026-01-21 13:28   ` Jonathan Cameron via qemu development
2026-01-21 11:25 ` [PATCH 09/13] scripts/ghes_inject: add a handler for PCIe bus error Mauro Carvalho Chehab
2026-01-21 13:32   ` Jonathan Cameron via qemu development
2026-01-21 13:33     ` Jonathan Cameron via qemu development
2026-02-06 12:52       ` Jonathan Cameron via qemu development
2026-01-21 16:26     ` Mauro Carvalho Chehab
2026-01-22 16:42     ` Mauro Carvalho Chehab
2026-01-21 11:25 ` [PATCH 10/13] scripts/ghes_inject: add support for fuzzy logic testing Mauro Carvalho Chehab
2026-01-21 13:37   ` Jonathan Cameron via qemu development
2026-01-21 16:35     ` Mauro Carvalho Chehab
2026-01-21 11:25 ` [PATCH 11/13] scripts/ghes_inject: add a raw error inject command Mauro Carvalho Chehab
2026-01-21 11:25 ` [PATCH 12/13] scripts/ghes_inject: print help if no command specified Mauro Carvalho Chehab
2026-01-21 13:42   ` Jonathan Cameron via qemu development
2026-01-21 11:25 ` [PATCH 13/13] scripts/ghes_inject: improve help message Mauro Carvalho Chehab
2026-01-21 13:43   ` Jonathan Cameron via qemu development

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.