* [PATCH 0/7] Implement MPIPL for PowerNV
@ 2025-02-17 7:19 Aditya Gupta
2025-02-17 7:19 ` [PATCH 1/7] hw/ppc: Log S0/S1 Interrupt triggers by OPAL Aditya Gupta
` (7 more replies)
0 siblings, 8 replies; 24+ messages in thread
From: Aditya Gupta @ 2025-02-17 7:19 UTC (permalink / raw)
To: qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
Overview
=========
Implemented MPIPL (Memory Preserving IPL, aka fadump) on PowerNV machine
in QEMU.
Note: It's okay if this isn't merged as there might be less users. Sending
for archieval purpose, as the patches can be referred for how fadump/mpipl
can be implemented in baremetal/PowerNV/any other arch QEMU.
Fadump is an alternative dump mechanism to kdump, in which we the firmware
does a memory preserving boot, and the second/crashkernel is booted fresh
like a normal system reset, instead of the crashed kernel loading the
second/crashkernel in case of kdump.
MPIPL in PowerNV, is similar to fadump in Pseries. The idea is same, memory
preserving, where in PowerNV we are assisted by SBE (Self Boot Engine) &
Hostboot, while in Pseries we are assisted by PHyp (Power Hypervisor)
For implementing in baremetal/powernv QEMU, we need to export a
"ibm,opal/dump" node in the device tree, to tell the kernel we support
MPIPL
Once kernel sees the support, and "fadump=on" is passed on commandline,
kernel will register memory regions to preserve with Skiboot.
Kernel sends these data using OPAL calls, after which skiboot/opal saves
the memory region details to MDST and MDDT tables (S-source, D-destination)
Skiboot then triggers the "S0 Interrupt" to the SBE (Self Boot Engine),
along with OPAL's relocated base address.
SBE then stops all core clocks, and only does particular ISteps for a
memory preserving boot.
Then, hostboot comes up, and with help of the relocated base address, it
accesses MDST & MDDT tables (S-source and D-destination), and preserves the
memory regions according to the data in these tables.
And after preserving, it writes the preserved memory region details to MDRT
tables (R-Result), for the kernel to know where/whether a memory region is
preserved.
Both SBE's and hostboot responsiblities have in implemented in the SBE code
in QEMU.
Then in the second kernel/crashkernel boot, OPAL passes the "mpipl-boot"
property for the kernel to know that a dump is active, which kernel then
exports in /proc/vmcore
Git Tree for Testing
====================
https://github.com/adi-g15-ibm/qemu/tree/fadump-powernv-v1
Known Issues
============
* CPU save area has not been implemented
Aditya Gupta (7):
hw/ppc: Log S0/S1 Interrupt triggers by OPAL
hw/ppc: Implement S0 SBE interrupt as cpu_pause then host reset
hw/ppc: Handle stash command in PowerNV SBE
hw/ppc: Add MDST/MDDT/MDRT table structures and offsets
hw/ppc: Preserve Memory Regions as per MDST/MDDT tables
hw/ppc: [WIP] Add Processor Dump Area offsets in Pnv SBE
hw/ppc: Implement MPIPL in PowerNV
hw/ppc/pnv.c | 49 +++++++
hw/ppc/pnv_sbe.c | 295 +++++++++++++++++++++++++++++++++++++--
include/hw/ppc/pnv_sbe.h | 7 +
3 files changed, 342 insertions(+), 9 deletions(-)
--
2.48.1
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 1/7] hw/ppc: Log S0/S1 Interrupt triggers by OPAL
2025-02-17 7:19 [PATCH 0/7] Implement MPIPL for PowerNV Aditya Gupta
@ 2025-02-17 7:19 ` Aditya Gupta
2025-03-11 4:38 ` Harsh Prateek Bora
2025-02-17 7:19 ` [PATCH 2/7] hw/ppc: Implement S0 SBE interrupt as cpu_pause then host reset Aditya Gupta
` (6 subsequent siblings)
7 siblings, 1 reply; 24+ messages in thread
From: Aditya Gupta @ 2025-02-17 7:19 UTC (permalink / raw)
To: qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
During MPIPL (aka fadump), OPAL triggers the S0 SBE interrupt to trigger
MPIPL.
Currently QEMU treats it as "Unimplemented", handle the interrupts by
just logging that the interrupt happened.
Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
---
hw/ppc/pnv_sbe.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
index 74cee4eea7ad..62c94a04a2df 100644
--- a/hw/ppc/pnv_sbe.c
+++ b/hw/ppc/pnv_sbe.c
@@ -109,6 +109,19 @@ static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
trace_pnv_sbe_xscom_ctrl_write(addr, val);
switch (offset) {
+ case SBE_CONTROL_REG_RW:
+ switch (val) {
+ case SBE_CONTROL_REG_S0:
+ qemu_log_mask(LOG_UNIMP, "SBE: S0 Interrupt triggered\n");
+ break;
+ case SBE_CONTROL_REG_S1:
+ qemu_log_mask(LOG_UNIMP, "SBE: S1 Interrupt triggered\n");
+ break;
+ default:
+ qemu_log_mask(LOG_UNIMP, "SBE Unimplemented register: Ox%"
+ HWADDR_PRIx "\n", addr >> 3);
+ }
+ break;
default:
qemu_log_mask(LOG_UNIMP, "SBE Unimplemented register: Ox%"
HWADDR_PRIx "\n", addr >> 3);
--
2.48.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 2/7] hw/ppc: Implement S0 SBE interrupt as cpu_pause then host reset
2025-02-17 7:19 [PATCH 0/7] Implement MPIPL for PowerNV Aditya Gupta
2025-02-17 7:19 ` [PATCH 1/7] hw/ppc: Log S0/S1 Interrupt triggers by OPAL Aditya Gupta
@ 2025-02-17 7:19 ` Aditya Gupta
2025-03-11 4:45 ` Harsh Prateek Bora
2025-02-17 7:19 ` [PATCH 3/7] hw/ppc: Handle stash command in PowerNV SBE Aditya Gupta
` (5 subsequent siblings)
7 siblings, 1 reply; 24+ messages in thread
From: Aditya Gupta @ 2025-02-17 7:19 UTC (permalink / raw)
To: qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
SBE's implementation of S0 seems to be basically "stop all clocks" and
then "host reset"
Nearest equivalent to the stop clocks seems to be 'pause_all_vcpus' in
QEMU,
Then reset the host, which is 'SHUTDOWN_CAUSE_GUEST_RESET' in QEMU.
Implement the S0 interrupt as pause_vcpus + guest_reset
See 'stopClocksS0' in SBE source code for more information.
Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
---
hw/ppc/pnv_sbe.c | 50 +++++++++++++++++++++++++++++++++++++++---------
1 file changed, 41 insertions(+), 9 deletions(-)
diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
index 62c94a04a2df..a6bf13650f2d 100644
--- a/hw/ppc/pnv_sbe.c
+++ b/hw/ppc/pnv_sbe.c
@@ -21,6 +21,8 @@
#include "qapi/error.h"
#include "qemu/log.h"
#include "qemu/module.h"
+#include "system/cpus.h"
+#include "system/runstate.h"
#include "hw/irq.h"
#include "hw/qdev-properties.h"
#include "hw/ppc/pnv.h"
@@ -80,6 +82,15 @@
#define SBE_CONTROL_REG_S0 PPC_BIT(14)
#define SBE_CONTROL_REG_S1 PPC_BIT(15)
+static void pnv_sbe_set_host_doorbell(PnvSBE *sbe, uint64_t val)
+{
+ val &= SBE_HOST_RESPONSE_MASK; /* Is this right? What does HW do? */
+ sbe->host_doorbell = val;
+
+ trace_pnv_sbe_reg_set_host_doorbell(val);
+ qemu_set_irq(sbe->psi_irq, !!val);
+}
+
struct sbe_msg {
uint64_t reg[4];
};
@@ -104,6 +115,7 @@ static uint64_t pnv_sbe_power9_xscom_ctrl_read(void *opaque, hwaddr addr,
static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
uint64_t val, unsigned size)
{
+ PnvSBE *sbe = opaque;
uint32_t offset = addr >> 3;
trace_pnv_sbe_xscom_ctrl_write(addr, val);
@@ -113,6 +125,35 @@ static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
switch (val) {
case SBE_CONTROL_REG_S0:
qemu_log_mask(LOG_UNIMP, "SBE: S0 Interrupt triggered\n");
+
+ pnv_sbe_set_host_doorbell(sbe, sbe->host_doorbell | SBE_HOST_RESPONSE_MASK);
+
+ /*
+ * Looks like, SBE stops clocks for all cores in S0.
+ * See 'stopClocksS0' in SBE source code.
+ * Nearest equivalent in QEMU seems to be 'pause_all_vcpus'
+ */
+ pause_all_vcpus();
+
+ /*
+ * TODO: Pass `mpipl` node in device tree to signify next
+ * boot is an MPIPL boot
+ */
+
+ /* Then do a guest reset */
+ /*
+ * Requirement:
+ * This guest reset should not clear the memory (which is
+ * the case when this is merged)
+ */
+ qemu_system_reset_request(SHUTDOWN_CAUSE_GUEST_RESET);
+
+ /*
+ * XXX: Does SBE really do system reset or only stop
+ * clocks ? OPAL seems to think that control will not come
+ * to it after it has triggered S0 interrupt.
+ */
+
break;
case SBE_CONTROL_REG_S1:
qemu_log_mask(LOG_UNIMP, "SBE: S1 Interrupt triggered\n");
@@ -138,15 +179,6 @@ static const MemoryRegionOps pnv_sbe_power9_xscom_ctrl_ops = {
.endianness = DEVICE_BIG_ENDIAN,
};
-static void pnv_sbe_set_host_doorbell(PnvSBE *sbe, uint64_t val)
-{
- val &= SBE_HOST_RESPONSE_MASK; /* Is this right? What does HW do? */
- sbe->host_doorbell = val;
-
- trace_pnv_sbe_reg_set_host_doorbell(val);
- qemu_set_irq(sbe->psi_irq, !!val);
-}
-
/* SBE Target Type */
#define SBE_TARGET_TYPE_PROC 0x00
#define SBE_TARGET_TYPE_EX 0x01
--
2.48.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 3/7] hw/ppc: Handle stash command in PowerNV SBE
2025-02-17 7:19 [PATCH 0/7] Implement MPIPL for PowerNV Aditya Gupta
2025-02-17 7:19 ` [PATCH 1/7] hw/ppc: Log S0/S1 Interrupt triggers by OPAL Aditya Gupta
2025-02-17 7:19 ` [PATCH 2/7] hw/ppc: Implement S0 SBE interrupt as cpu_pause then host reset Aditya Gupta
@ 2025-02-17 7:19 ` Aditya Gupta
2025-03-11 4:50 ` Harsh Prateek Bora
2025-02-17 7:19 ` [PATCH 4/7] hw/ppc: Add MDST/MDDT/MDRT table structures and offsets Aditya Gupta
` (4 subsequent siblings)
7 siblings, 1 reply; 24+ messages in thread
From: Aditya Gupta @ 2025-02-17 7:19 UTC (permalink / raw)
To: qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
Earlier since the SBE_CMD_STASH_MPIPL_CONFIG command was not handled, so
skiboot used to not get any response from SBE:
[ 106.350742821,3] SBE: Message timeout [chip id = 0], cmd = d7, subcmd = 7
[ 106.352067746,3] SBE: Failed to send stash MPIPL config [chip id = 0x0, rc = 254]
Fix this by handling the command in PowerNV SBE, and sending a response so
skiboot knows SBE has handled the STASH command
The stashed skiboot base is later used to access the relocated MDST/MDDT
tables when MPIPL is implemented.
The purpose of stashing relocated base address is explained in following
skiboot commit:
author Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Fri Jul 12 16:47:51 2019 +0530
committer Oliver O'Halloran <oohall@gmail.com> Thu Aug 15 17:53:39 2019 +1000
SBE: Send OPAL relocated base address to SBE
OPAL relocates itself during boot. During memory preserving IPL hostboot needs
to access relocated OPAL base address to get MDST, MDDT tables. Hence send
relocated base address to SBE via 'stash MPIPL config' chip-op. During next
IPL SBE will send stashed data to hostboot... so that hostboot can access
these data.
Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
---
hw/ppc/pnv_sbe.c | 25 +++++++++++++++++++++++++
include/hw/ppc/pnv_sbe.h | 3 +++
2 files changed, 28 insertions(+)
diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
index a6bf13650f2d..79818177fc36 100644
--- a/hw/ppc/pnv_sbe.c
+++ b/hw/ppc/pnv_sbe.c
@@ -82,6 +82,8 @@
#define SBE_CONTROL_REG_S0 PPC_BIT(14)
#define SBE_CONTROL_REG_S1 PPC_BIT(15)
+static uint64_t mpipl_skiboot_base = 0x30000000 /*default SKIBOOT_BASE*/;
+
static void pnv_sbe_set_host_doorbell(PnvSBE *sbe, uint64_t val)
{
val &= SBE_HOST_RESPONSE_MASK; /* Is this right? What does HW do? */
@@ -281,6 +283,29 @@ static void do_sbe_msg(PnvSBE *sbe)
timer_del(sbe->timer);
}
break;
+ case SBE_CMD_STASH_MPIPL_CONFIG:
+ /* key = sbe->mbox[1] */
+ switch (sbe->mbox[1]) {
+ case SBE_STASH_KEY_SKIBOOT_BASE:
+ mpipl_skiboot_base = sbe->mbox[2];
+ qemu_log_mask(LOG_UNIMP,
+ "Stashing skiboot base: 0x%lx\n", mpipl_skiboot_base);
+
+ /*
+ * Set the response register.
+ *
+ * Currently setting the same sequence number in
+ * response as we got in the request.
+ */
+ sbe->mbox[4] = sbe->mbox[0]; /* sequence number */
+ pnv_sbe_set_host_doorbell(sbe,
+ sbe->host_doorbell | SBE_HOST_RESPONSE_WAITING);
+
+ break;
+ default:
+ qemu_log_mask(LOG_UNIMP, "SBE Unimplemented command: 0x%x\n", cmd);
+ }
+ break;
default:
qemu_log_mask(LOG_UNIMP, "SBE Unimplemented command: 0x%x\n", cmd);
}
diff --git a/include/hw/ppc/pnv_sbe.h b/include/hw/ppc/pnv_sbe.h
index b6b378ad14c7..f6cbcf990ed9 100644
--- a/include/hw/ppc/pnv_sbe.h
+++ b/include/hw/ppc/pnv_sbe.h
@@ -53,4 +53,7 @@ struct PnvSBEClass {
const MemoryRegionOps *xscom_mbox_ops;
};
+/* Helper to access stashed SKIBOOT_BASE */
+bool pnv_sbe_mpipl_skiboot_base(void);
+
#endif /* PPC_PNV_SBE_H */
--
2.48.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 4/7] hw/ppc: Add MDST/MDDT/MDRT table structures and offsets
2025-02-17 7:19 [PATCH 0/7] Implement MPIPL for PowerNV Aditya Gupta
` (2 preceding siblings ...)
2025-02-17 7:19 ` [PATCH 3/7] hw/ppc: Handle stash command in PowerNV SBE Aditya Gupta
@ 2025-02-17 7:19 ` Aditya Gupta
2025-03-11 5:11 ` Harsh Prateek Bora
2025-02-17 7:19 ` [PATCH 5/7] hw/ppc: Preserve Memory Regions as per MDST/MDDT tables Aditya Gupta
` (3 subsequent siblings)
7 siblings, 1 reply; 24+ messages in thread
From: Aditya Gupta @ 2025-02-17 7:19 UTC (permalink / raw)
To: qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
Add the MDST, MDDT, MDRT tables offsets and structures as per current
skiboot upstream:
commit bc7b85db1e7e ("opal-ci: Remove centos7")
These structures will be later populated when preserving memory regions
for MPIPL
Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
---
hw/ppc/pnv_sbe.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 113 insertions(+)
diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
index 79818177fc36..361a3854307d 100644
--- a/hw/ppc/pnv_sbe.c
+++ b/hw/ppc/pnv_sbe.c
@@ -84,6 +84,119 @@
static uint64_t mpipl_skiboot_base = 0x30000000 /*default SKIBOOT_BASE*/;
+/* Following offsets are copied from Skiboot source code */
+/* Use 768 bytes for SPIRAH */
+#define SPIRAH_OFF 0x00010000
+#define SPIRAH_SIZE 0x300
+
+/* Use 256 bytes for processor dump area */
+#define PROC_DUMP_AREA_OFF (SPIRAH_OFF + SPIRAH_SIZE)
+#define PROC_DUMP_AREA_SIZE 0x100
+
+#define PROCIN_OFF (PROC_DUMP_AREA_OFF + PROC_DUMP_AREA_SIZE)
+#define PROCIN_SIZE 0x800
+
+/* Offsets of MDST and MDDT tables from skiboot base */
+#define MDST_TABLE_OFF (PROCIN_OFF + PROCIN_SIZE)
+#define MDST_TABLE_SIZE 0x400
+
+#define MDDT_TABLE_OFF (MDST_TABLE_OFF + MDST_TABLE_SIZE)
+#define MDDT_TABLE_SIZE 0x400
+
+#define CPU_CTL_OFF (MDDT_TABLE_OFF + MDDT_TABLE_SIZE)
+#define CPU_CTL_SIZE 0x2000
+
+/* MPIPL reserved regions (offset by skiboot_base to access) */
+#define MDST_TABLE_BASE (mpipl_skiboot_base + MDST_TABLE_OFF)
+#define MDDT_TABLE_BASE (mpipl_skiboot_base + MDDT_TABLE_OFF)
+#define PROC_DUMP_AREA_BASE (mpipl_skiboot_base + PROC_DUMP_AREA_OFF)
+
+#define __packed __attribute__((packed))
+
+/* Metadata to capture before triggering MPIPL */
+struct mpipl_metadata {
+ /* Crashing PIR is required to create OPAL dump */
+ uint32_t crashing_pir;
+ /* Kernel expects OPAL to presrve tag and pass it back via OPAL API */
+ uint64_t kernel_tag;
+ /* Post MPIPL kernel boot memory size */
+ uint64_t boot_mem_size;
+} __packed;
+
+/* Structure version */
+#define OPAL_MPIPL_VERSION 0x01
+
+/* Preserved memory details */
+struct opal_mpipl_region {
+ __be64 src;
+ __be64 dest;
+ __be64 size;
+};
+
+struct opal_mpipl_fadump {
+ uint8_t version;
+ uint8_t reserved[7];
+ __be32 crashing_pir; /* OPAL crashing CPU PIR */
+ __be32 cpu_data_version;
+ __be32 cpu_data_size;
+ __be32 region_cnt;
+ struct opal_mpipl_region *region;
+};
+
+/*
+ * This is our dump result table after MPIPL. Hostboot will write to this
+ * memory after moving memory content from source to destination memory.
+ */
+#define MDRT_TABLE_BASE (mpipl_skiboot_base + 0x01c00000)
+#define MDRT_TABLE_SIZE 0x00008000
+
+/*
+ * This is our dump metadata area. We will use this memory to save metadata
+ * (like crashing CPU details, payload tags) before triggering MPIPL.
+ */
+#define DUMP_METADATA_AREA_BASE (mpipl_skiboot_base + 0x01c08000)
+#define DUMP_METADATA_AREA_SIZE 0x8000
+
+/*
+ * Memory Dump Source Table
+ *
+ * Format of this table is same as Memory Dump Source Table (MDST)
+ * defined in HDAT spec.
+ */
+struct mdst_table {
+ __be64 addr;
+ uint8_t data_region; /* DUMP_REGION_* */
+ uint8_t dump_type; /* DUMP_TYPE_* */
+ __be16 reserved;
+ __be32 size;
+} __packed;
+
+/* Memory dump destination table (MDDT) */
+struct mddt_table {
+ __be64 addr;
+ uint8_t data_region;
+ uint8_t dump_type;
+ __be16 reserved;
+ __be32 size;
+} __packed;
+
+/*
+ * Memory dump result table (MDRT)
+ *
+ * List of the memory ranges that have been included in the dump. This table is
+ * filled by hostboot and passed to OPAL on second boot. OPAL/payload will use
+ * this table to extract the dump.
+ */
+struct mdrt_table {
+ __be64 src_addr;
+ __be64 dest_addr;
+ uint8_t data_region;
+ uint8_t dump_type; /* unused */
+ __be16 reserved; /* unused */
+ __be32 size;
+ __be64 padding; /* unused */
+} __packed;
+
static void pnv_sbe_set_host_doorbell(PnvSBE *sbe, uint64_t val)
{
val &= SBE_HOST_RESPONSE_MASK; /* Is this right? What does HW do? */
--
2.48.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 5/7] hw/ppc: Preserve Memory Regions as per MDST/MDDT tables
2025-02-17 7:19 [PATCH 0/7] Implement MPIPL for PowerNV Aditya Gupta
` (3 preceding siblings ...)
2025-02-17 7:19 ` [PATCH 4/7] hw/ppc: Add MDST/MDDT/MDRT table structures and offsets Aditya Gupta
@ 2025-02-17 7:19 ` Aditya Gupta
2025-03-11 5:18 ` Harsh Prateek Bora
2025-02-17 7:19 ` [PATCH 6/7] hw/ppc: [WIP] Add Processor Dump Area offsets in Pnv SBE Aditya Gupta
` (2 subsequent siblings)
7 siblings, 1 reply; 24+ messages in thread
From: Aditya Gupta @ 2025-02-17 7:19 UTC (permalink / raw)
To: qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
When MPIPL is used, OPAL/Linux registers memory regions to be preserved
on a Memory-Preserving boot ('crashkernel boot').
The regions are added to two tables: MDST and MDDT (source and
destination tables)
The MDST contains the start address of the region, and size of region
The MDDT contains the destination address where the region should be
copied (and size of region which will be same as in MDST entry)
Then after a crash, when hostboot (pnv_sbe.c in case of QEMU)
preserves the memory region, it adds the details of preserved regions to
MDRT (results table)
Copy memory regions mentioned in MDST to addresses mentioned in MDDT.
And accordingly update the copied region details in MDRT table.
Note: If we did not preserve the regions, and MDRT is empty then OPAL
simply logs "OPAL dump is not available", while kernel will assume that
firmware would have preserved the regions, and export /proc/vmcore, but
the vmcore won't have most basic kernel structures hence crash will be
unable to analyse the vmcore
Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
---
hw/ppc/pnv_sbe.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 57 insertions(+)
diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
index 361a3854307d..ee905df4e0a6 100644
--- a/hw/ppc/pnv_sbe.c
+++ b/hw/ppc/pnv_sbe.c
@@ -227,6 +227,60 @@ static uint64_t pnv_sbe_power9_xscom_ctrl_read(void *opaque, hwaddr addr,
return val;
}
+static void pnv_mpipl_preserve_mem(void)
+{
+ /* Get access to metadata */
+ struct mpipl_metadata *metadata = malloc(DUMP_METADATA_AREA_SIZE);
+ struct mdst_table *mdst = malloc(MDST_TABLE_SIZE);
+ struct mddt_table *mddt = malloc(MDDT_TABLE_SIZE);
+ struct mdrt_table *mdrt = malloc(MDRT_TABLE_SIZE);
+ __be64 source_addr, dest_addr, bytes_to_copy;
+ uint8_t *copy_buffer;
+
+ cpu_physical_memory_read(DUMP_METADATA_AREA_BASE, metadata, DUMP_METADATA_AREA_SIZE);
+ cpu_physical_memory_read(MDST_TABLE_BASE, mdst, MDST_TABLE_SIZE);
+ cpu_physical_memory_read(MDDT_TABLE_BASE, mddt, MDDT_TABLE_SIZE);
+
+ /* HRMOR_BIT copied from skiboot */
+ #define HRMOR_BIT (1ul << 63)
+
+ for (int i = 0;; ++i) {
+ /* NOTE: Assuming uninitialised will be all zeroes */
+ if ((mdst[i].addr == 0) && (mdst[i].size == 0)) {
+ break;
+ }
+
+ if (mdst[i].size != mddt[i].size) {
+ qemu_log_mask(LOG_TRACE,
+ "Warning: Invalid entry, size mismatch in MDST & MDDT\n");
+ continue;
+ }
+
+ if (mdst[i].data_region != mddt[i].data_region) {
+ qemu_log_mask(LOG_TRACE,
+ "Warning: Invalid entry, region mismatch in MDST & MDDT\n");
+ continue;
+ }
+
+ mdrt[i].src_addr = mdst[i].addr;
+ mdrt[i].dest_addr = mddt[i].addr;
+ mdrt[i].size = mdst[i].size;
+ mdrt[i].data_region = mdst[i].data_region;
+
+ source_addr = cpu_to_be64(mdst[i].addr) & ~HRMOR_BIT;
+ dest_addr = cpu_to_be64(mddt[i].addr) & ~HRMOR_BIT;
+ bytes_to_copy = cpu_to_be32(mddt[i].size);
+
+ /* XXX: Am i assuming we are in big endian mode ? */
+ copy_buffer = malloc(bytes_to_copy);
+ cpu_physical_memory_read(source_addr, copy_buffer, bytes_to_copy);
+ cpu_physical_memory_write(dest_addr, copy_buffer, bytes_to_copy);
+ free(copy_buffer);
+ }
+
+ cpu_physical_memory_write(MDRT_TABLE_BASE, mdrt, MDRT_TABLE_SIZE);
+}
+
static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
uint64_t val, unsigned size)
{
@@ -250,6 +304,9 @@ static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
*/
pause_all_vcpus();
+ /* Preserve the memory locations registered for MPIPL */
+ pnv_mpipl_preserve_mem();
+
/*
* TODO: Pass `mpipl` node in device tree to signify next
* boot is an MPIPL boot
--
2.48.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 6/7] hw/ppc: [WIP] Add Processor Dump Area offsets in Pnv SBE
2025-02-17 7:19 [PATCH 0/7] Implement MPIPL for PowerNV Aditya Gupta
` (4 preceding siblings ...)
2025-02-17 7:19 ` [PATCH 5/7] hw/ppc: Preserve Memory Regions as per MDST/MDDT tables Aditya Gupta
@ 2025-02-17 7:19 ` Aditya Gupta
2025-03-11 5:23 ` Harsh Prateek Bora
2025-02-17 7:19 ` [PATCH 7/7] hw/ppc: Implement MPIPL in PowerNV Aditya Gupta
2025-02-27 3:37 ` [PATCH 0/7] Implement MPIPL for PowerNV Nicholas Piggin
7 siblings, 1 reply; 24+ messages in thread
From: Aditya Gupta @ 2025-02-17 7:19 UTC (permalink / raw)
To: qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
Add offsets for the processor state captured during MPIPL dump.
This is incomplete. And might be implemented in future if the effort to
implement MPIPL is resumed again.
Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
---
hw/ppc/pnv_sbe.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
index ee905df4e0a6..3b50667226b5 100644
--- a/hw/ppc/pnv_sbe.c
+++ b/hw/ppc/pnv_sbe.c
@@ -197,6 +197,25 @@ struct mdrt_table {
__be64 padding; /* unused */
} __packed;
+/*
+ * Processor Dump Area
+ *
+ * This contains the information needed for having processor
+ * state captured during a platform dump.
+ */
+struct proc_dump_area {
+ __be32 thread_size; /* Size of each thread register entry */
+#define PROC_DUMP_AREA_FORMAT_P9 0x1 /* P9 format */
+ uint8_t version;
+ uint8_t reserved[11];
+ __be64 alloc_addr; /* Destination memory to place register data */
+ __be32 reserved2;
+ __be32 alloc_size; /* Allocated size */
+ __be64 dest_addr; /* Destination address */
+ __be32 reserved3;
+ __be32 act_size; /* Actual data size */
+} __packed;
+
static void pnv_sbe_set_host_doorbell(PnvSBE *sbe, uint64_t val)
{
val &= SBE_HOST_RESPONSE_MASK; /* Is this right? What does HW do? */
@@ -281,6 +300,11 @@ static void pnv_mpipl_preserve_mem(void)
cpu_physical_memory_write(MDRT_TABLE_BASE, mdrt, MDRT_TABLE_SIZE);
}
+static void pnv_mpipl_save_proc_regs(void)
+{
+ /* TODO: modify PROC_DUMP_AREA_BASE */
+}
+
static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
uint64_t val, unsigned size)
{
@@ -307,6 +331,9 @@ static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
/* Preserve the memory locations registered for MPIPL */
pnv_mpipl_preserve_mem();
+ /* Save processor state */
+ pnv_mpipl_save_proc_regs();
+
/*
* TODO: Pass `mpipl` node in device tree to signify next
* boot is an MPIPL boot
--
2.48.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 7/7] hw/ppc: Implement MPIPL in PowerNV
2025-02-17 7:19 [PATCH 0/7] Implement MPIPL for PowerNV Aditya Gupta
` (5 preceding siblings ...)
2025-02-17 7:19 ` [PATCH 6/7] hw/ppc: [WIP] Add Processor Dump Area offsets in Pnv SBE Aditya Gupta
@ 2025-02-17 7:19 ` Aditya Gupta
2025-03-11 5:41 ` Harsh Prateek Bora
2025-02-27 3:37 ` [PATCH 0/7] Implement MPIPL for PowerNV Nicholas Piggin
7 siblings, 1 reply; 24+ messages in thread
From: Aditya Gupta @ 2025-02-17 7:19 UTC (permalink / raw)
To: qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
Linux expect a "ibm,opal/dump" node to know whether MPIPL (aka fadump)
is supported on the hardware.
Export the "ibm,opal/dump" node in QEMU's device tree for Linux to know
that PowerNV supports MPIPL.
With the commit, kernel boots thinking fadump is supported, and reserves
memory regions for fadump if "fadump=on" is passed in kernel cmdline:
Linux/PowerPC load: init=/bin/sh debug fadump=on
Finalizing device tree... flat tree at 0x20ebaca0
[ 1.005765851,5] DUMP: Payload sent metadata tag : 0x800002a8
[ 1.005980914,5] DUMP: Boot mem size : 0x40000000
[ 0.000000] opal fadump: Kernel metadata addr: 800002a8
[ 0.000000] fadump: Reserved 1024MB of memory at 0x00000040000000 (System RAM: 20480MB)
[ 0.000000] fadump: Initialized 0x40000000 bytes cma area at 1024MB from 0x400102a8 bytes of memory reserved for firmware-assisted dump
Also, OPAL and Linux expect the "mpipl-boot" device tree node on a MPIPL
boot. Hence add "mpipl-boot" property in device tree on an MPIPL boot.
Hence after crash, Linux knows when it's a MPIPL/fadump boot:
[ 0.000000] opal fadump: Firmware-assisted dump is active.
[ 0.000000] fadump: Firmware-assisted dump is active.
[ 0.000000] fadump: Reserving 23552MB of memory at 0x00000040000000 for preserving crash data
Do note that fadump boot in PowerNV seems to require more memory,
trying with 1GB causes this error by kernel:
[ 0.000000] fadump: Failed to find memory chunk for reservation!
And even with anything from 2GB - 19GB, the kernel fails to boot due to
some memory issues.
Trying with >20GB memory is recommended for now
Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
---
hw/ppc/pnv.c | 49 ++++++++++++++++++++++++++++++++++++++++
hw/ppc/pnv_sbe.c | 18 +++++++++++----
include/hw/ppc/pnv_sbe.h | 4 ++++
3 files changed, 67 insertions(+), 4 deletions(-)
diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index 11fd477b71be..39ed3f873e9a 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -51,6 +51,7 @@
#include "hw/ppc/pnv_chip.h"
#include "hw/ppc/pnv_xscom.h"
#include "hw/ppc/pnv_pnor.h"
+#include "hw/ppc/pnv_sbe.h"
#include "hw/isa/isa.h"
#include "hw/char/serial-isa.h"
@@ -697,6 +698,26 @@ static void *pnv_dt_create(MachineState *machine)
pmc->dt_power_mgt(pnv, fdt);
}
+ /* Add "dump" node so kernel knows MPIPL (aka fadump) is supported */
+ off = fdt_add_subnode(fdt, 0, "ibm,opal");
+ if (off == -FDT_ERR_EXISTS) {
+ off = fdt_path_offset(fdt, "/ibm,opal");
+ }
+
+ _FDT(off);
+ off = fdt_add_subnode(fdt, off, "dump");
+ _FDT(off);
+ _FDT((fdt_setprop_string(fdt, off, "compatible", "ibm,opal-dump")));
+
+ /* Add kernel and initrd as fw-load-area */
+ uint64_t fw_load_area[4] = {
+ cpu_to_be64(KERNEL_LOAD_ADDR), cpu_to_be64(KERNEL_MAX_SIZE),
+ cpu_to_be64(INITRD_LOAD_ADDR), cpu_to_be64(INITRD_MAX_SIZE)
+ };
+
+ _FDT((fdt_setprop(fdt, off, "fw-load-area",
+ fw_load_area, sizeof(fw_load_area))));
+
return fdt;
}
@@ -714,6 +735,7 @@ static void pnv_reset(MachineState *machine, ResetType type)
PnvMachineState *pnv = PNV_MACHINE(machine);
IPMIBmc *bmc;
void *fdt;
+ int node_offset;
qemu_devices_reset(type);
@@ -744,6 +766,33 @@ static void pnv_reset(MachineState *machine, ResetType type)
_FDT((fdt_pack(fdt)));
}
+ /*
+ * If it's a MPIPL boot, add the "mpipl-boot" property, and reset the
+ * boolean for MPIPL boot for next boot
+ */
+ if (pnv_sbe_is_mpipl_boot()) {
+ void *fdt_copy = g_malloc0(FDT_MAX_SIZE);
+
+ /* Create a writable copy of the fdt */
+ _FDT((fdt_open_into(fdt, fdt_copy, FDT_MAX_SIZE)));
+
+ node_offset = fdt_path_offset(fdt_copy, "/ibm,opal/dump");
+ _FDT((fdt_appendprop_u64(fdt_copy, node_offset, "mpipl-boot", 1)));
+
+ /* Update the fdt, and free the original fdt */
+ if (fdt != machine->fdt) {
+ /*
+ * Only free the fdt if it's not machine->fdt, to prevent
+ * double free, since we already free machine->fdt later
+ */
+ g_free(fdt);
+ }
+ fdt = fdt_copy;
+
+ /* This boot is an MPIPL, reset the boolean for next boot */
+ pnv_sbe_reset_is_next_boot_mpipl();
+ }
+
qemu_fdt_dumpdtb(fdt, fdt_totalsize(fdt));
cpu_physical_memory_write(PNV_FDT_ADDR, fdt, fdt_totalsize(fdt));
diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
index 3b50667226b5..671dc81c9501 100644
--- a/hw/ppc/pnv_sbe.c
+++ b/hw/ppc/pnv_sbe.c
@@ -216,6 +216,18 @@ struct proc_dump_area {
__be32 act_size; /* Actual data size */
} __packed;
+static bool is_next_boot_mpipl;
+
+bool pnv_sbe_is_mpipl_boot(void)
+{
+ return is_next_boot_mpipl;
+}
+
+void pnv_sbe_reset_is_next_boot_mpipl(void)
+{
+ is_next_boot_mpipl = false;
+}
+
static void pnv_sbe_set_host_doorbell(PnvSBE *sbe, uint64_t val)
{
val &= SBE_HOST_RESPONSE_MASK; /* Is this right? What does HW do? */
@@ -334,10 +346,8 @@ static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
/* Save processor state */
pnv_mpipl_save_proc_regs();
- /*
- * TODO: Pass `mpipl` node in device tree to signify next
- * boot is an MPIPL boot
- */
+ /* Mark next boot as Memory-preserving boot */
+ is_next_boot_mpipl = true;
/* Then do a guest reset */
/*
diff --git a/include/hw/ppc/pnv_sbe.h b/include/hw/ppc/pnv_sbe.h
index f6cbcf990ed9..94bbdc7b6414 100644
--- a/include/hw/ppc/pnv_sbe.h
+++ b/include/hw/ppc/pnv_sbe.h
@@ -56,4 +56,8 @@ struct PnvSBEClass {
/* Helper to access stashed SKIBOOT_BASE */
bool pnv_sbe_mpipl_skiboot_base(void);
+/* Helpers to know if next boot is MPIPL boot */
+bool pnv_sbe_is_mpipl_boot(void);
+void pnv_sbe_reset_is_next_boot_mpipl(void);
+
#endif /* PPC_PNV_SBE_H */
--
2.48.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 0/7] Implement MPIPL for PowerNV
2025-02-17 7:19 [PATCH 0/7] Implement MPIPL for PowerNV Aditya Gupta
` (6 preceding siblings ...)
2025-02-17 7:19 ` [PATCH 7/7] hw/ppc: Implement MPIPL in PowerNV Aditya Gupta
@ 2025-02-27 3:37 ` Nicholas Piggin
2025-02-27 6:23 ` Aditya Gupta
7 siblings, 1 reply; 24+ messages in thread
From: Nicholas Piggin @ 2025-02-27 3:37 UTC (permalink / raw)
To: Aditya Gupta, qemu-devel
Cc: qemu-ppc, Frédéric Barrat, Sourabh Jain,
Mahesh J Salgaonkar, Hari Bathini
On Mon Feb 17, 2025 at 5:19 PM AEST, Aditya Gupta wrote:
> Overview
> =========
>
> Implemented MPIPL (Memory Preserving IPL, aka fadump) on PowerNV machine
> in QEMU.
Wow, that's a lot of effort.
> Note: It's okay if this isn't merged as there might be less users. Sending
> for archieval purpose, as the patches can be referred for how fadump/mpipl
> can be implemented in baremetal/PowerNV/any other arch QEMU.
I would like to add it. It helps test a bunch of code that is in Linux
and skiboot, so it would be quite useful. A functional test would be
important to have.
I've had a glance through it, but better review might have to wait for
until the next development cycle.
Thanks,
Nick
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/7] Implement MPIPL for PowerNV
2025-02-27 3:37 ` [PATCH 0/7] Implement MPIPL for PowerNV Nicholas Piggin
@ 2025-02-27 6:23 ` Aditya Gupta
0 siblings, 0 replies; 24+ messages in thread
From: Aditya Gupta @ 2025-02-27 6:23 UTC (permalink / raw)
To: Nicholas Piggin, qemu-devel
Cc: qemu-ppc, Frédéric Barrat, Sourabh Jain,
Mahesh J Salgaonkar, Hari Bathini
Hi Nick,
On 27/02/25 09:07, Nicholas Piggin wrote:
> On Mon Feb 17, 2025 at 5:19 PM AEST, Aditya Gupta wrote:
>> Overview
>> =========
>>
>> Implemented MPIPL (Memory Preserving IPL, aka fadump) on PowerNV machine
>> in QEMU.
> Wow, that's a lot of effort.
Thanks Nick.
>> Note: It's okay if this isn't merged as there might be less users. Sending
>> for archieval purpose, as the patches can be referred for how fadump/mpipl
>> can be implemented in baremetal/PowerNV/any other arch QEMU.
> I would like to add it. It helps test a bunch of code that is in Linux
> and skiboot, so it would be quite useful. A functional test would be
> important to have.
Sure, it's not complete yet (didn't implement the CPU saving part) as I
just wanted to do a experiment I did, will improve those things by v2
then. It might take some time though.
Will look into the functional test thing also.
> I've had a glance through it, but better review might have to wait for
> until the next development cycle.
Sure, that's totally okay. Thank you for looking at it.
Thanks,
- Aditya G
> Thanks,
> Nick
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/7] hw/ppc: Log S0/S1 Interrupt triggers by OPAL
2025-02-17 7:19 ` [PATCH 1/7] hw/ppc: Log S0/S1 Interrupt triggers by OPAL Aditya Gupta
@ 2025-03-11 4:38 ` Harsh Prateek Bora
2025-03-13 18:43 ` Aditya Gupta
0 siblings, 1 reply; 24+ messages in thread
From: Harsh Prateek Bora @ 2025-03-11 4:38 UTC (permalink / raw)
To: Aditya Gupta, qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
On 2/17/25 12:49, Aditya Gupta wrote:
> During MPIPL (aka fadump), OPAL triggers the S0 SBE interrupt to trigger
> MPIPL.
>
> Currently QEMU treats it as "Unimplemented", handle the interrupts by
> just logging that the interrupt happened.
>
> Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
> ---
> hw/ppc/pnv_sbe.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
> index 74cee4eea7ad..62c94a04a2df 100644
> --- a/hw/ppc/pnv_sbe.c
> +++ b/hw/ppc/pnv_sbe.c
> @@ -109,6 +109,19 @@ static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
> trace_pnv_sbe_xscom_ctrl_write(addr, val);
>
> switch (offset) {
> + case SBE_CONTROL_REG_RW:
> + switch (val) {
> + case SBE_CONTROL_REG_S0:
> + qemu_log_mask(LOG_UNIMP, "SBE: S0 Interrupt triggered\n");
> + break;
> + case SBE_CONTROL_REG_S1:
> + qemu_log_mask(LOG_UNIMP, "SBE: S1 Interrupt triggered\n");
> + break;
> + default:
> + qemu_log_mask(LOG_UNIMP, "SBE Unimplemented register: Ox%"
This log could be made specific to SBE unimplemented register "bits",
otherwise fall back to outer switch-default case.
> + HWADDR_PRIx "\n", addr >> 3);
> + }
> + break;
> default:
> qemu_log_mask(LOG_UNIMP, "SBE Unimplemented register: Ox%"
> HWADDR_PRIx "\n", addr >> 3);
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/7] hw/ppc: Implement S0 SBE interrupt as cpu_pause then host reset
2025-02-17 7:19 ` [PATCH 2/7] hw/ppc: Implement S0 SBE interrupt as cpu_pause then host reset Aditya Gupta
@ 2025-03-11 4:45 ` Harsh Prateek Bora
2025-03-13 18:45 ` Aditya Gupta
0 siblings, 1 reply; 24+ messages in thread
From: Harsh Prateek Bora @ 2025-03-11 4:45 UTC (permalink / raw)
To: Aditya Gupta, qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
On 2/17/25 12:49, Aditya Gupta wrote:
> SBE's implementation of S0 seems to be basically "stop all clocks" and
> then "host reset"
>
> Nearest equivalent to the stop clocks seems to be 'pause_all_vcpus' in
> QEMU,
>
> Then reset the host, which is 'SHUTDOWN_CAUSE_GUEST_RESET' in QEMU.
>
> Implement the S0 interrupt as pause_vcpus + guest_reset
>
> See 'stopClocksS0' in SBE source code for more information.
>
> Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
> ---
> hw/ppc/pnv_sbe.c | 50 +++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 41 insertions(+), 9 deletions(-)
>
> diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
> index 62c94a04a2df..a6bf13650f2d 100644
> --- a/hw/ppc/pnv_sbe.c
> +++ b/hw/ppc/pnv_sbe.c
> @@ -21,6 +21,8 @@
> #include "qapi/error.h"
> #include "qemu/log.h"
> #include "qemu/module.h"
> +#include "system/cpus.h"
> +#include "system/runstate.h"
> #include "hw/irq.h"
> #include "hw/qdev-properties.h"
> #include "hw/ppc/pnv.h"
> @@ -80,6 +82,15 @@
> #define SBE_CONTROL_REG_S0 PPC_BIT(14)
> #define SBE_CONTROL_REG_S1 PPC_BIT(15)
>
> +static void pnv_sbe_set_host_doorbell(PnvSBE *sbe, uint64_t val)
> +{
> + val &= SBE_HOST_RESPONSE_MASK; /* Is this right? What does HW do? */
> + sbe->host_doorbell = val;
> +
> + trace_pnv_sbe_reg_set_host_doorbell(val);
> + qemu_set_irq(sbe->psi_irq, !!val);
> +}
> +
> struct sbe_msg {
> uint64_t reg[4];
> };
> @@ -104,6 +115,7 @@ static uint64_t pnv_sbe_power9_xscom_ctrl_read(void *opaque, hwaddr addr,
> static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
> uint64_t val, unsigned size)
> {
> + PnvSBE *sbe = opaque;
> uint32_t offset = addr >> 3;
>
> trace_pnv_sbe_xscom_ctrl_write(addr, val);
> @@ -113,6 +125,35 @@ static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
> switch (val) {
> case SBE_CONTROL_REG_S0:
> qemu_log_mask(LOG_UNIMP, "SBE: S0 Interrupt triggered\n");
> +
> + pnv_sbe_set_host_doorbell(sbe, sbe->host_doorbell | SBE_HOST_RESPONSE_MASK);
> +
> + /*
> + * Looks like, SBE stops clocks for all cores in S0.
> + * See 'stopClocksS0' in SBE source code.
> + * Nearest equivalent in QEMU seems to be 'pause_all_vcpus'
> + */
> + pause_all_vcpus();
> +
> + /*
> + * TODO: Pass `mpipl` node in device tree to signify next
> + * boot is an MPIPL boot
> + */
> +
> + /* Then do a guest reset */
> + /*
> + * Requirement:
> + * This guest reset should not clear the memory (which is
> + * the case when this is merged)
This comment may need a rephrase. Also, if we are keeping an assumption
that SHUTDOWN_CAUSE_GUEST_RESET should not clear the memory, it may be
better to put a comment about this wherever it is handled as well.
> + */
> + qemu_system_reset_request(SHUTDOWN_CAUSE_GUEST_RESET);
> +
> + /*
> + * XXX: Does SBE really do system reset or only stop
> + * clocks ? OPAL seems to think that control will not come
> + * to it after it has triggered S0 interrupt.
> + */
> +
> break;
> case SBE_CONTROL_REG_S1:
> qemu_log_mask(LOG_UNIMP, "SBE: S1 Interrupt triggered\n");
> @@ -138,15 +179,6 @@ static const MemoryRegionOps pnv_sbe_power9_xscom_ctrl_ops = {
> .endianness = DEVICE_BIG_ENDIAN,
> };
>
> -static void pnv_sbe_set_host_doorbell(PnvSBE *sbe, uint64_t val)
> -{
> - val &= SBE_HOST_RESPONSE_MASK; /* Is this right? What does HW do? */
> - sbe->host_doorbell = val;
> -
> - trace_pnv_sbe_reg_set_host_doorbell(val);
> - qemu_set_irq(sbe->psi_irq, !!val);
> -}
> -
Code movement like above could be a separate patch, not necessary though.
Patch 1 can be squashed with this one.
Thanks
Harsh
> /* SBE Target Type */
> #define SBE_TARGET_TYPE_PROC 0x00
> #define SBE_TARGET_TYPE_EX 0x01
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/7] hw/ppc: Handle stash command in PowerNV SBE
2025-02-17 7:19 ` [PATCH 3/7] hw/ppc: Handle stash command in PowerNV SBE Aditya Gupta
@ 2025-03-11 4:50 ` Harsh Prateek Bora
2025-03-13 18:46 ` Aditya Gupta
0 siblings, 1 reply; 24+ messages in thread
From: Harsh Prateek Bora @ 2025-03-11 4:50 UTC (permalink / raw)
To: Aditya Gupta, qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
On 2/17/25 12:49, Aditya Gupta wrote:
> Earlier since the SBE_CMD_STASH_MPIPL_CONFIG command was not handled, so
> skiboot used to not get any response from SBE:
>
> [ 106.350742821,3] SBE: Message timeout [chip id = 0], cmd = d7, subcmd = 7
> [ 106.352067746,3] SBE: Failed to send stash MPIPL config [chip id = 0x0, rc = 254]
>
> Fix this by handling the command in PowerNV SBE, and sending a response so
> skiboot knows SBE has handled the STASH command
>
> The stashed skiboot base is later used to access the relocated MDST/MDDT
> tables when MPIPL is implemented.
>
> The purpose of stashing relocated base address is explained in following
> skiboot commit:
>
> author Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Fri Jul 12 16:47:51 2019 +0530
> committer Oliver O'Halloran <oohall@gmail.com> Thu Aug 15 17:53:39 2019 +1000
>
> SBE: Send OPAL relocated base address to SBE
>
> OPAL relocates itself during boot. During memory preserving IPL hostboot needs
> to access relocated OPAL base address to get MDST, MDDT tables. Hence send
> relocated base address to SBE via 'stash MPIPL config' chip-op. During next
> IPL SBE will send stashed data to hostboot... so that hostboot can access
> these data.
>
> Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
> ---
> hw/ppc/pnv_sbe.c | 25 +++++++++++++++++++++++++
> include/hw/ppc/pnv_sbe.h | 3 +++
> 2 files changed, 28 insertions(+)
>
> diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
> index a6bf13650f2d..79818177fc36 100644
> --- a/hw/ppc/pnv_sbe.c
> +++ b/hw/ppc/pnv_sbe.c
> @@ -82,6 +82,8 @@
> #define SBE_CONTROL_REG_S0 PPC_BIT(14)
> #define SBE_CONTROL_REG_S1 PPC_BIT(15)
>
> +static uint64_t mpipl_skiboot_base = 0x30000000 /*default SKIBOOT_BASE*/;
> +
> static void pnv_sbe_set_host_doorbell(PnvSBE *sbe, uint64_t val)
> {
> val &= SBE_HOST_RESPONSE_MASK; /* Is this right? What does HW do? */
> @@ -281,6 +283,29 @@ static void do_sbe_msg(PnvSBE *sbe)
> timer_del(sbe->timer);
> }
> break;
> + case SBE_CMD_STASH_MPIPL_CONFIG:
> + /* key = sbe->mbox[1] */
> + switch (sbe->mbox[1]) {
> + case SBE_STASH_KEY_SKIBOOT_BASE:
> + mpipl_skiboot_base = sbe->mbox[2];
> + qemu_log_mask(LOG_UNIMP,
> + "Stashing skiboot base: 0x%lx\n", mpipl_skiboot_base);
> +
> + /*
> + * Set the response register.
> + *
> + * Currently setting the same sequence number in
> + * response as we got in the request.
> + */
> + sbe->mbox[4] = sbe->mbox[0]; /* sequence number */
> + pnv_sbe_set_host_doorbell(sbe,
> + sbe->host_doorbell | SBE_HOST_RESPONSE_WAITING);
> +
> + break;
> + default:
> + qemu_log_mask(LOG_UNIMP, "SBE Unimplemented command: 0x%x\n", cmd);
Unimplemented SBE_CMD_STASH_MPIPL_CONFIG key ?
> + }
> + break;
> default:
> qemu_log_mask(LOG_UNIMP, "SBE Unimplemented command: 0x%x\n", cmd);
> }
> diff --git a/include/hw/ppc/pnv_sbe.h b/include/hw/ppc/pnv_sbe.h
> index b6b378ad14c7..f6cbcf990ed9 100644
> --- a/include/hw/ppc/pnv_sbe.h
> +++ b/include/hw/ppc/pnv_sbe.h
> @@ -53,4 +53,7 @@ struct PnvSBEClass {
> const MemoryRegionOps *xscom_mbox_ops;
> };
>
> +/* Helper to access stashed SKIBOOT_BASE */
> +bool pnv_sbe_mpipl_skiboot_base(void);
> +
> #endif /* PPC_PNV_SBE_H */
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/7] hw/ppc: Add MDST/MDDT/MDRT table structures and offsets
2025-02-17 7:19 ` [PATCH 4/7] hw/ppc: Add MDST/MDDT/MDRT table structures and offsets Aditya Gupta
@ 2025-03-11 5:11 ` Harsh Prateek Bora
2025-03-13 18:50 ` Aditya Gupta
0 siblings, 1 reply; 24+ messages in thread
From: Harsh Prateek Bora @ 2025-03-11 5:11 UTC (permalink / raw)
To: Aditya Gupta, qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
On 2/17/25 12:49, Aditya Gupta wrote:
> Add the MDST, MDDT, MDRT tables offsets and structures as per current
> skiboot upstream:
>
> commit bc7b85db1e7e ("opal-ci: Remove centos7")
>
> These structures will be later populated when preserving memory regions
> for MPIPL
>
> Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
> ---
> hw/ppc/pnv_sbe.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 113 insertions(+)
>
> diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
> index 79818177fc36..361a3854307d 100644
> --- a/hw/ppc/pnv_sbe.c
> +++ b/hw/ppc/pnv_sbe.c
These should have been added in include/hw/ppc/pnv_sbe.h
> @@ -84,6 +84,119 @@
>
> static uint64_t mpipl_skiboot_base = 0x30000000 /*default SKIBOOT_BASE*/;
>
> +/* Following offsets are copied from Skiboot source code */
> +/* Use 768 bytes for SPIRAH */
> +#define SPIRAH_OFF 0x00010000
> +#define SPIRAH_SIZE 0x300
> +
> +/* Use 256 bytes for processor dump area */
> +#define PROC_DUMP_AREA_OFF (SPIRAH_OFF + SPIRAH_SIZE)
> +#define PROC_DUMP_AREA_SIZE 0x100
> +
> +#define PROCIN_OFF (PROC_DUMP_AREA_OFF + PROC_DUMP_AREA_SIZE)
> +#define PROCIN_SIZE 0x800
> +
> +/* Offsets of MDST and MDDT tables from skiboot base */
> +#define MDST_TABLE_OFF (PROCIN_OFF + PROCIN_SIZE)
> +#define MDST_TABLE_SIZE 0x400
> +
> +#define MDDT_TABLE_OFF (MDST_TABLE_OFF + MDST_TABLE_SIZE)
> +#define MDDT_TABLE_SIZE 0x400
> +
> +#define CPU_CTL_OFF (MDDT_TABLE_OFF + MDDT_TABLE_SIZE)
> +#define CPU_CTL_SIZE 0x2000
> +
> +/* MPIPL reserved regions (offset by skiboot_base to access) */
> +#define MDST_TABLE_BASE (mpipl_skiboot_base + MDST_TABLE_OFF)
> +#define MDDT_TABLE_BASE (mpipl_skiboot_base + MDDT_TABLE_OFF)
> +#define PROC_DUMP_AREA_BASE (mpipl_skiboot_base + PROC_DUMP_AREA_OFF)
> +
> +#define __packed __attribute__((packed))
> +
> +/* Metadata to capture before triggering MPIPL */
> +struct mpipl_metadata {
CamelCase for structs? Applies for other structs below.
> + /* Crashing PIR is required to create OPAL dump */
> + uint32_t crashing_pir;
> + /* Kernel expects OPAL to presrve tag and pass it back via OPAL API */
> + uint64_t kernel_tag;
> + /* Post MPIPL kernel boot memory size */
> + uint64_t boot_mem_size;
> +} __packed;
> +
> +/* Structure version */
> +#define OPAL_MPIPL_VERSION 0x01
> +
> +/* Preserved memory details */
> +struct opal_mpipl_region {
> + __be64 src;
> + __be64 dest;
> + __be64 size;
> +};
> +
> +struct opal_mpipl_fadump {
> + uint8_t version;
> + uint8_t reserved[7];
> + __be32 crashing_pir; /* OPAL crashing CPU PIR */
> + __be32 cpu_data_version;
> + __be32 cpu_data_size;
> + __be32 region_cnt;
> + struct opal_mpipl_region *region;
> +};
> +
> +/*
> + * This is our dump result table after MPIPL. Hostboot will write to this
> + * memory after moving memory content from source to destination memory.
> + */
> +#define MDRT_TABLE_BASE (mpipl_skiboot_base + 0x01c00000)
> +#define MDRT_TABLE_SIZE 0x00008000
> +
> +/*
> + * This is our dump metadata area. We will use this memory to save metadata
> + * (like crashing CPU details, payload tags) before triggering MPIPL.
> + */
> +#define DUMP_METADATA_AREA_BASE (mpipl_skiboot_base + 0x01c08000)
> +#define DUMP_METADATA_AREA_SIZE 0x8000
> +
> +/*
> + * Memory Dump Source Table
> + *
> + * Format of this table is same as Memory Dump Source Table (MDST)
> + * defined in HDAT spec.
> + */
> +struct mdst_table {
> + __be64 addr;
> + uint8_t data_region; /* DUMP_REGION_* */
> + uint8_t dump_type; /* DUMP_TYPE_* */
> + __be16 reserved;
> + __be32 size;
> +} __packed;
> +
> +/* Memory dump destination table (MDDT) */
> +struct mddt_table {
> + __be64 addr;
> + uint8_t data_region;
> + uint8_t dump_type;
> + __be16 reserved;
> + __be32 size;
> +} __packed;
If both mdst_table and mddt_table are supposed to be same, we could have
just one mdxt_table and variable instances could be named mdst/mddt.
Thanks
Harsh
> +
> +/*
> + * Memory dump result table (MDRT)
> + *
> + * List of the memory ranges that have been included in the dump. This table is
> + * filled by hostboot and passed to OPAL on second boot. OPAL/payload will use
> + * this table to extract the dump.
> + */
> +struct mdrt_table {
> + __be64 src_addr;
> + __be64 dest_addr;
> + uint8_t data_region;
> + uint8_t dump_type; /* unused */
> + __be16 reserved; /* unused */
> + __be32 size;
> + __be64 padding; /* unused */
> +} __packed;
> +
> static void pnv_sbe_set_host_doorbell(PnvSBE *sbe, uint64_t val)
> {
> val &= SBE_HOST_RESPONSE_MASK; /* Is this right? What does HW do? */
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 5/7] hw/ppc: Preserve Memory Regions as per MDST/MDDT tables
2025-02-17 7:19 ` [PATCH 5/7] hw/ppc: Preserve Memory Regions as per MDST/MDDT tables Aditya Gupta
@ 2025-03-11 5:18 ` Harsh Prateek Bora
2025-03-13 18:54 ` Aditya Gupta
0 siblings, 1 reply; 24+ messages in thread
From: Harsh Prateek Bora @ 2025-03-11 5:18 UTC (permalink / raw)
To: Aditya Gupta, qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
On 2/17/25 12:49, Aditya Gupta wrote:
> When MPIPL is used, OPAL/Linux registers memory regions to be preserved
> on a Memory-Preserving boot ('crashkernel boot').
>
> The regions are added to two tables: MDST and MDDT (source and
> destination tables)
>
> The MDST contains the start address of the region, and size of region
>
> The MDDT contains the destination address where the region should be
> copied (and size of region which will be same as in MDST entry)
>
> Then after a crash, when hostboot (pnv_sbe.c in case of QEMU)
> preserves the memory region, it adds the details of preserved regions to
> MDRT (results table)
>
> Copy memory regions mentioned in MDST to addresses mentioned in MDDT.
> And accordingly update the copied region details in MDRT table.
>
> Note: If we did not preserve the regions, and MDRT is empty then OPAL
> simply logs "OPAL dump is not available", while kernel will assume that
> firmware would have preserved the regions, and export /proc/vmcore, but
> the vmcore won't have most basic kernel structures hence crash will be
> unable to analyse the vmcore
>
> Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
> ---
> hw/ppc/pnv_sbe.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 57 insertions(+)
>
> diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
> index 361a3854307d..ee905df4e0a6 100644
> --- a/hw/ppc/pnv_sbe.c
> +++ b/hw/ppc/pnv_sbe.c
> @@ -227,6 +227,60 @@ static uint64_t pnv_sbe_power9_xscom_ctrl_read(void *opaque, hwaddr addr,
> return val;
> }
>
> +static void pnv_mpipl_preserve_mem(void)
> +{
> + /* Get access to metadata */
> + struct mpipl_metadata *metadata = malloc(DUMP_METADATA_AREA_SIZE);
> + struct mdst_table *mdst = malloc(MDST_TABLE_SIZE);
> + struct mddt_table *mddt = malloc(MDDT_TABLE_SIZE);
> + struct mdrt_table *mdrt = malloc(MDRT_TABLE_SIZE);
Where are these getting free()ed? Mem leak ?
> + __be64 source_addr, dest_addr, bytes_to_copy;
> + uint8_t *copy_buffer;
> +
> + cpu_physical_memory_read(DUMP_METADATA_AREA_BASE, metadata, DUMP_METADATA_AREA_SIZE);
> + cpu_physical_memory_read(MDST_TABLE_BASE, mdst, MDST_TABLE_SIZE);
> + cpu_physical_memory_read(MDDT_TABLE_BASE, mddt, MDDT_TABLE_SIZE);
> +
> + /* HRMOR_BIT copied from skiboot */
> + #define HRMOR_BIT (1ul << 63)
Could be moved to pnv_sbe.h file.
> +
> + for (int i = 0;; ++i) {
> + /* NOTE: Assuming uninitialised will be all zeroes */
> + if ((mdst[i].addr == 0) && (mdst[i].size == 0)) {
> + break;
> + }
What if there is no uninitialized entry till the end of array?
Out-of-bound access since we do not have a loop exit condition?
> +
> + if (mdst[i].size != mddt[i].size) {
> + qemu_log_mask(LOG_TRACE,
> + "Warning: Invalid entry, size mismatch in MDST & MDDT\n");
> + continue;
> + }
> +
> + if (mdst[i].data_region != mddt[i].data_region) {
> + qemu_log_mask(LOG_TRACE,
> + "Warning: Invalid entry, region mismatch in MDST & MDDT\n");
> + continue;
> + }
> +
> + mdrt[i].src_addr = mdst[i].addr;
> + mdrt[i].dest_addr = mddt[i].addr;
> + mdrt[i].size = mdst[i].size;
> + mdrt[i].data_region = mdst[i].data_region;
> +
> + source_addr = cpu_to_be64(mdst[i].addr) & ~HRMOR_BIT;
> + dest_addr = cpu_to_be64(mddt[i].addr) & ~HRMOR_BIT;
> + bytes_to_copy = cpu_to_be32(mddt[i].size);
> +
> + /* XXX: Am i assuming we are in big endian mode ? */
If the patches are assuming to work only with BE, it should gracefully
handle the LE case.
Thanks
Harsh
> + copy_buffer = malloc(bytes_to_copy);
> + cpu_physical_memory_read(source_addr, copy_buffer, bytes_to_copy);
> + cpu_physical_memory_write(dest_addr, copy_buffer, bytes_to_copy);
> + free(copy_buffer);
> + }
> +
> + cpu_physical_memory_write(MDRT_TABLE_BASE, mdrt, MDRT_TABLE_SIZE);
> +}
> +
> static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
> uint64_t val, unsigned size)
> {
> @@ -250,6 +304,9 @@ static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
> */
> pause_all_vcpus();
>
> + /* Preserve the memory locations registered for MPIPL */
> + pnv_mpipl_preserve_mem();
> +
> /*
> * TODO: Pass `mpipl` node in device tree to signify next
> * boot is an MPIPL boot
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 6/7] hw/ppc: [WIP] Add Processor Dump Area offsets in Pnv SBE
2025-02-17 7:19 ` [PATCH 6/7] hw/ppc: [WIP] Add Processor Dump Area offsets in Pnv SBE Aditya Gupta
@ 2025-03-11 5:23 ` Harsh Prateek Bora
2025-03-13 18:56 ` Aditya Gupta
0 siblings, 1 reply; 24+ messages in thread
From: Harsh Prateek Bora @ 2025-03-11 5:23 UTC (permalink / raw)
To: Aditya Gupta, qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
On 2/17/25 12:49, Aditya Gupta wrote:
> Add offsets for the processor state captured during MPIPL dump.
>
> This is incomplete. And might be implemented in future if the effort to
> implement MPIPL is resumed again.
Please use RFC prefix in next iteration of patch series until the
patches are requested to be merged.
>
> Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
> ---
> hw/ppc/pnv_sbe.c | 27 +++++++++++++++++++++++++++
> 1 file changed, 27 insertions(+)
>
> diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
> index ee905df4e0a6..3b50667226b5 100644
> --- a/hw/ppc/pnv_sbe.c
> +++ b/hw/ppc/pnv_sbe.c
> @@ -197,6 +197,25 @@ struct mdrt_table {
> __be64 padding; /* unused */
> } __packed;
>
> +/*
> + * Processor Dump Area
> + *
> + * This contains the information needed for having processor
> + * state captured during a platform dump.
> + */
> +struct proc_dump_area {
> + __be32 thread_size; /* Size of each thread register entry */
> +#define PROC_DUMP_AREA_FORMAT_P9 0x1 /* P9 format */
> + uint8_t version;
> + uint8_t reserved[11];
> + __be64 alloc_addr; /* Destination memory to place register data */
> + __be32 reserved2;
> + __be32 alloc_size; /* Allocated size */
> + __be64 dest_addr; /* Destination address */
> + __be32 reserved3;
> + __be32 act_size; /* Actual data size */
> +} __packed;
> +
Above should go into pnv_sbe.h and introduce when actually get used.
> static void pnv_sbe_set_host_doorbell(PnvSBE *sbe, uint64_t val)
> {
> val &= SBE_HOST_RESPONSE_MASK; /* Is this right? What does HW do? */
> @@ -281,6 +300,11 @@ static void pnv_mpipl_preserve_mem(void)
> cpu_physical_memory_write(MDRT_TABLE_BASE, mdrt, MDRT_TABLE_SIZE);
> }
>
> +static void pnv_mpipl_save_proc_regs(void)
> +{
> + /* TODO: modify PROC_DUMP_AREA_BASE */
> +}
> +
> static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
> uint64_t val, unsigned size)
> {
> @@ -307,6 +331,9 @@ static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
> /* Preserve the memory locations registered for MPIPL */
> pnv_mpipl_preserve_mem();
>
> + /* Save processor state */
> + pnv_mpipl_save_proc_regs();
Introduce when actually get implemented, otherwise doesnt need a
separate patch for this stub.
> +
> /*
> * TODO: Pass `mpipl` node in device tree to signify next
> * boot is an MPIPL boot
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 7/7] hw/ppc: Implement MPIPL in PowerNV
2025-02-17 7:19 ` [PATCH 7/7] hw/ppc: Implement MPIPL in PowerNV Aditya Gupta
@ 2025-03-11 5:41 ` Harsh Prateek Bora
2025-03-13 19:00 ` Aditya Gupta
0 siblings, 1 reply; 24+ messages in thread
From: Harsh Prateek Bora @ 2025-03-11 5:41 UTC (permalink / raw)
To: Aditya Gupta, qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
On 2/17/25 12:49, Aditya Gupta wrote:
> Linux expect a "ibm,opal/dump" node to know whether MPIPL (aka fadump)
> is supported on the hardware.
>
> Export the "ibm,opal/dump" node in QEMU's device tree for Linux to know
> that PowerNV supports MPIPL.
>
> With the commit, kernel boots thinking fadump is supported, and reserves
> memory regions for fadump if "fadump=on" is passed in kernel cmdline:
>
> Linux/PowerPC load: init=/bin/sh debug fadump=on
> Finalizing device tree... flat tree at 0x20ebaca0
> [ 1.005765851,5] DUMP: Payload sent metadata tag : 0x800002a8
> [ 1.005980914,5] DUMP: Boot mem size : 0x40000000
> [ 0.000000] opal fadump: Kernel metadata addr: 800002a8
> [ 0.000000] fadump: Reserved 1024MB of memory at 0x00000040000000 (System RAM: 20480MB)
> [ 0.000000] fadump: Initialized 0x40000000 bytes cma area at 1024MB from 0x400102a8 bytes of memory reserved for firmware-assisted dump
>
> Also, OPAL and Linux expect the "mpipl-boot" device tree node on a MPIPL
> boot. Hence add "mpipl-boot" property in device tree on an MPIPL boot.
>
> Hence after crash, Linux knows when it's a MPIPL/fadump boot:
>
> [ 0.000000] opal fadump: Firmware-assisted dump is active.
> [ 0.000000] fadump: Firmware-assisted dump is active.
> [ 0.000000] fadump: Reserving 23552MB of memory at 0x00000040000000 for preserving crash data
>
> Do note that fadump boot in PowerNV seems to require more memory,
> trying with 1GB causes this error by kernel:
>
> [ 0.000000] fadump: Failed to find memory chunk for reservation!
>
> And even with anything from 2GB - 19GB, the kernel fails to boot due to
> some memory issues.
>
> Trying with >20GB memory is recommended for now
>
> Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
> ---
> hw/ppc/pnv.c | 49 ++++++++++++++++++++++++++++++++++++++++
> hw/ppc/pnv_sbe.c | 18 +++++++++++----
> include/hw/ppc/pnv_sbe.h | 4 ++++
> 3 files changed, 67 insertions(+), 4 deletions(-)
>
> diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
> index 11fd477b71be..39ed3f873e9a 100644
> --- a/hw/ppc/pnv.c
> +++ b/hw/ppc/pnv.c
> @@ -51,6 +51,7 @@
> #include "hw/ppc/pnv_chip.h"
> #include "hw/ppc/pnv_xscom.h"
> #include "hw/ppc/pnv_pnor.h"
> +#include "hw/ppc/pnv_sbe.h"
>
> #include "hw/isa/isa.h"
> #include "hw/char/serial-isa.h"
> @@ -697,6 +698,26 @@ static void *pnv_dt_create(MachineState *machine)
> pmc->dt_power_mgt(pnv, fdt);
> }
>
> + /* Add "dump" node so kernel knows MPIPL (aka fadump) is supported */
> + off = fdt_add_subnode(fdt, 0, "ibm,opal");
> + if (off == -FDT_ERR_EXISTS) {
> + off = fdt_path_offset(fdt, "/ibm,opal");
> + }
> +
> + _FDT(off);
> + off = fdt_add_subnode(fdt, off, "dump");
> + _FDT(off);
> + _FDT((fdt_setprop_string(fdt, off, "compatible", "ibm,opal-dump")));
> +
> + /* Add kernel and initrd as fw-load-area */
> + uint64_t fw_load_area[4] = {
> + cpu_to_be64(KERNEL_LOAD_ADDR), cpu_to_be64(KERNEL_MAX_SIZE),
> + cpu_to_be64(INITRD_LOAD_ADDR), cpu_to_be64(INITRD_MAX_SIZE)
> + };
> +
> + _FDT((fdt_setprop(fdt, off, "fw-load-area",
> + fw_load_area, sizeof(fw_load_area))));
> +
Above could be wrapped in pnv_dt_mpipl(fdt) and called here?
> return fdt;
> }
>
> @@ -714,6 +735,7 @@ static void pnv_reset(MachineState *machine, ResetType type)
> PnvMachineState *pnv = PNV_MACHINE(machine);
> IPMIBmc *bmc;
> void *fdt;
> + int node_offset;
>
> qemu_devices_reset(type);
>
> @@ -744,6 +766,33 @@ static void pnv_reset(MachineState *machine, ResetType type)
> _FDT((fdt_pack(fdt)));
> }
>
> + /*
> + * If it's a MPIPL boot, add the "mpipl-boot" property, and reset the
> + * boolean for MPIPL boot for next boot
> + */
> + if (pnv_sbe_is_mpipl_boot()) {
> + void *fdt_copy = g_malloc0(FDT_MAX_SIZE);
Where is this getting free'ed ?
> +
> + /* Create a writable copy of the fdt */
> + _FDT((fdt_open_into(fdt, fdt_copy, FDT_MAX_SIZE)));
> +
> + node_offset = fdt_path_offset(fdt_copy, "/ibm,opal/dump");
> + _FDT((fdt_appendprop_u64(fdt_copy, node_offset, "mpipl-boot", 1)));
> +
> + /* Update the fdt, and free the original fdt */
> + if (fdt != machine->fdt) {
> + /*
> + * Only free the fdt if it's not machine->fdt, to prevent
> + * double free, since we already free machine->fdt later
> + */
> + g_free(fdt);
> + }
> + fdt = fdt_copy;
> +
> + /* This boot is an MPIPL, reset the boolean for next boot */
> + pnv_sbe_reset_is_next_boot_mpipl();
> + }
> +
Could above be wrapped in pnv_reset_mpipl(fdt) and called if
pnv_sbe_is_mpipl_boot is true?
> qemu_fdt_dumpdtb(fdt, fdt_totalsize(fdt));
> cpu_physical_memory_write(PNV_FDT_ADDR, fdt, fdt_totalsize(fdt));
>
> diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
> index 3b50667226b5..671dc81c9501 100644
> --- a/hw/ppc/pnv_sbe.c
> +++ b/hw/ppc/pnv_sbe.c
> @@ -216,6 +216,18 @@ struct proc_dump_area {
> __be32 act_size; /* Actual data size */
> } __packed;
>
> +static bool is_next_boot_mpipl;
> +
> +bool pnv_sbe_is_mpipl_boot(void)
> +{
> + return is_next_boot_mpipl;
> +}
> +
> +void pnv_sbe_reset_is_next_boot_mpipl(void)
> +{
> + is_next_boot_mpipl = false;
> +}
> +
> static void pnv_sbe_set_host_doorbell(PnvSBE *sbe, uint64_t val)
> {
> val &= SBE_HOST_RESPONSE_MASK; /* Is this right? What does HW do? */
> @@ -334,10 +346,8 @@ static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
> /* Save processor state */
> pnv_mpipl_save_proc_regs();
>
> - /*
> - * TODO: Pass `mpipl` node in device tree to signify next
> - * boot is an MPIPL boot
> - */
> + /* Mark next boot as Memory-preserving boot */
> + is_next_boot_mpipl = true;
>
> /* Then do a guest reset */
> /*
> diff --git a/include/hw/ppc/pnv_sbe.h b/include/hw/ppc/pnv_sbe.h
> index f6cbcf990ed9..94bbdc7b6414 100644
> --- a/include/hw/ppc/pnv_sbe.h
> +++ b/include/hw/ppc/pnv_sbe.h
> @@ -56,4 +56,8 @@ struct PnvSBEClass {
> /* Helper to access stashed SKIBOOT_BASE */
> bool pnv_sbe_mpipl_skiboot_base(void);
>
> +/* Helpers to know if next boot is MPIPL boot */
> +bool pnv_sbe_is_mpipl_boot(void);
> +void pnv_sbe_reset_is_next_boot_mpipl(void);
Usually we have a set along with reset and helpers for modifying struct
members. Above helpers for a gloabl var seems a bit un-necessary. I
guess we want to keep such global vars inside relevant struct.
Thanks
Harsh
> +
> #endif /* PPC_PNV_SBE_H */
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/7] hw/ppc: Log S0/S1 Interrupt triggers by OPAL
2025-03-11 4:38 ` Harsh Prateek Bora
@ 2025-03-13 18:43 ` Aditya Gupta
0 siblings, 0 replies; 24+ messages in thread
From: Aditya Gupta @ 2025-03-13 18:43 UTC (permalink / raw)
To: Harsh Prateek Bora, qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
Hi Harsh,
Thank you for the reviews.
On 11/03/25 10:08, Harsh Prateek Bora wrote:
>
>
> On 2/17/25 12:49, Aditya Gupta wrote:
>> During MPIPL (aka fadump), OPAL triggers the S0 SBE interrupt to trigger
>> MPIPL.
>>
>> Currently QEMU treats it as "Unimplemented", handle the interrupts by
>> just logging that the interrupt happened.
>>
>> Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
>> ---
>> hw/ppc/pnv_sbe.c | 13 +++++++++++++
>> 1 file changed, 13 insertions(+)
>>
>> diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
>> index 74cee4eea7ad..62c94a04a2df 100644
>> --- a/hw/ppc/pnv_sbe.c
>> +++ b/hw/ppc/pnv_sbe.c
>> @@ -109,6 +109,19 @@ static void pnv_sbe_power9_xscom_ctrl_write(void
>> *opaque, hwaddr addr,
>> trace_pnv_sbe_xscom_ctrl_write(addr, val);
>> switch (offset) {
>> + case SBE_CONTROL_REG_RW:
>> + switch (val) {
>> + case SBE_CONTROL_REG_S0:
>> + qemu_log_mask(LOG_UNIMP, "SBE: S0 Interrupt triggered\n");
>> + break;
>> + case SBE_CONTROL_REG_S1:
>> + qemu_log_mask(LOG_UNIMP, "SBE: S1 Interrupt triggered\n");
>> + break;
>> + default:
>> + qemu_log_mask(LOG_UNIMP, "SBE Unimplemented register: Ox%"
>
> This log could be made specific to SBE unimplemented register "bits",
> otherwise fall back to outer switch-default case.
Agreed, will fix it in v2.
Thanks,
- Aditya G
>
>
>> + HWADDR_PRIx "\n", addr >> 3);
>> + }
>> + break;
>> default:
>> qemu_log_mask(LOG_UNIMP, "SBE Unimplemented register: Ox%"
>> HWADDR_PRIx "\n", addr >> 3);
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/7] hw/ppc: Implement S0 SBE interrupt as cpu_pause then host reset
2025-03-11 4:45 ` Harsh Prateek Bora
@ 2025-03-13 18:45 ` Aditya Gupta
0 siblings, 0 replies; 24+ messages in thread
From: Aditya Gupta @ 2025-03-13 18:45 UTC (permalink / raw)
To: Harsh Prateek Bora, qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
On 11/03/25 10:15, Harsh Prateek Bora wrote:
>
>
> On 2/17/25 12:49, Aditya Gupta wrote:
>> <...snip...>
>>
>> diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
>> index 62c94a04a2df..a6bf13650f2d 100644
>> --- a/hw/ppc/pnv_sbe.c
>> +++ b/hw/ppc/pnv_sbe.c
>> @@ -21,6 +21,8 @@
>> #include "qapi/error.h"
>> #include "qemu/log.h"
>> #include "qemu/module.h"
>> +#include "system/cpus.h"
>> +#include "system/runstate.h"
>> #include "hw/irq.h"
>> #include "hw/qdev-properties.h"
>> #include "hw/ppc/pnv.h"
>> @@ -80,6 +82,15 @@
>> #define SBE_CONTROL_REG_S0 PPC_BIT(14)
>> #define SBE_CONTROL_REG_S1 PPC_BIT(15)
>> +static void pnv_sbe_set_host_doorbell(PnvSBE *sbe, uint64_t val)
>> +{
>> + val &= SBE_HOST_RESPONSE_MASK; /* Is this right? What does HW
>> do? */
>> + sbe->host_doorbell = val;
>> +
>> + trace_pnv_sbe_reg_set_host_doorbell(val);
>> + qemu_set_irq(sbe->psi_irq, !!val);
>> +}
>> +
>> struct sbe_msg {
>> uint64_t reg[4];
>> };
>> @@ -104,6 +115,7 @@ static uint64_t
>> pnv_sbe_power9_xscom_ctrl_read(void *opaque, hwaddr addr,
>> static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
>> uint64_t val, unsigned size)
>> {
>> + PnvSBE *sbe = opaque;
>> uint32_t offset = addr >> 3;
>> trace_pnv_sbe_xscom_ctrl_write(addr, val);
>> @@ -113,6 +125,35 @@ static void pnv_sbe_power9_xscom_ctrl_write(void
>> *opaque, hwaddr addr,
>> switch (val) {
>> case SBE_CONTROL_REG_S0:
>> qemu_log_mask(LOG_UNIMP, "SBE: S0 Interrupt triggered\n");
>> +
>> + pnv_sbe_set_host_doorbell(sbe, sbe->host_doorbell |
>> SBE_HOST_RESPONSE_MASK);
>> +
>> + /*
>> + * Looks like, SBE stops clocks for all cores in S0.
>> + * See 'stopClocksS0' in SBE source code.
>> + * Nearest equivalent in QEMU seems to be 'pause_all_vcpus'
>> + */
>> + pause_all_vcpus();
>> +
>> + /*
>> + * TODO: Pass `mpipl` node in device tree to signify next
>> + * boot is an MPIPL boot
>> + */
>> +
>> + /* Then do a guest reset */
>> + /*
>> + * Requirement:
>> + * This guest reset should not clear the memory (which is
>> + * the case when this is merged)
>
> This comment may need a rephrase. Also, if we are keeping an
> assumption that SHUTDOWN_CAUSE_GUEST_RESET should not clear the
> memory, it may be better to put a comment about this wherever it is
> handled as well.
Sure, will make the comment more direct.
>
>> + */
>> + qemu_system_reset_request(SHUTDOWN_CAUSE_GUEST_RESET);
>> +
>> + /*
>> + * XXX: Does SBE really do system reset or only stop
>> + * clocks ? OPAL seems to think that control will not come
>> + * to it after it has triggered S0 interrupt.
>> + */
>> +
>> break;
>> case SBE_CONTROL_REG_S1:
>> qemu_log_mask(LOG_UNIMP, "SBE: S1 Interrupt triggered\n");
>> @@ -138,15 +179,6 @@ static const MemoryRegionOps
>> pnv_sbe_power9_xscom_ctrl_ops = {
>> .endianness = DEVICE_BIG_ENDIAN,
>> };
>> -static void pnv_sbe_set_host_doorbell(PnvSBE *sbe, uint64_t val)
>> -{
>> - val &= SBE_HOST_RESPONSE_MASK; /* Is this right? What does HW
>> do? */
>> - sbe->host_doorbell = val;
>> -
>> - trace_pnv_sbe_reg_set_host_doorbell(val);
>> - qemu_set_irq(sbe->psi_irq, !!val);
>> -}
>> -
>
> Code movement like above could be a separate patch, not necessary though.
> Patch 1 can be squashed with this one.
Got it, will move it to a separate patch, and squash patch 1 and the
other parts of this patch.
Thanks,
- Aditya G
>
> Thanks
> Harsh
>> /* SBE Target Type */
>> #define SBE_TARGET_TYPE_PROC 0x00
>> #define SBE_TARGET_TYPE_EX 0x01
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/7] hw/ppc: Handle stash command in PowerNV SBE
2025-03-11 4:50 ` Harsh Prateek Bora
@ 2025-03-13 18:46 ` Aditya Gupta
0 siblings, 0 replies; 24+ messages in thread
From: Aditya Gupta @ 2025-03-13 18:46 UTC (permalink / raw)
To: Harsh Prateek Bora, qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
On 11/03/25 10:20, Harsh Prateek Bora wrote:
>
>> <...snip...>
>>
>> --- a/hw/ppc/pnv_sbe.c
>> +++ b/hw/ppc/pnv_sbe.c
>> @@ -82,6 +82,8 @@
>> #define SBE_CONTROL_REG_S0 PPC_BIT(14)
>> #define SBE_CONTROL_REG_S1 PPC_BIT(15)
>> +static uint64_t mpipl_skiboot_base = 0x30000000 /*default
>> SKIBOOT_BASE*/;
>> +
>> static void pnv_sbe_set_host_doorbell(PnvSBE *sbe, uint64_t val)
>> {
>> val &= SBE_HOST_RESPONSE_MASK; /* Is this right? What does HW
>> do? */
>> @@ -281,6 +283,29 @@ static void do_sbe_msg(PnvSBE *sbe)
>> timer_del(sbe->timer);
>> }
>> break;
>> + case SBE_CMD_STASH_MPIPL_CONFIG:
>> + /* key = sbe->mbox[1] */
>> + switch (sbe->mbox[1]) {
>> + case SBE_STASH_KEY_SKIBOOT_BASE:
>> + mpipl_skiboot_base = sbe->mbox[2];
>> + qemu_log_mask(LOG_UNIMP,
>> + "Stashing skiboot base: 0x%lx\n", mpipl_skiboot_base);
>> +
>> + /*
>> + * Set the response register.
>> + *
>> + * Currently setting the same sequence number in
>> + * response as we got in the request.
>> + */
>> + sbe->mbox[4] = sbe->mbox[0]; /* sequence number */
>> + pnv_sbe_set_host_doorbell(sbe,
>> + sbe->host_doorbell | SBE_HOST_RESPONSE_WAITING);
>> +
>> + break;
>> + default:
>> + qemu_log_mask(LOG_UNIMP, "SBE Unimplemented command:
>> 0x%x\n", cmd);
>
> Unimplemented SBE_CMD_STASH_MPIPL_CONFIG key ?
Got it. Thanks for the reword suggestion, will do it.
Thanks,
- Aditya G
>
>> + }
>> + break;
>> default:
>> qemu_log_mask(LOG_UNIMP, "SBE Unimplemented command:
>> 0x%x\n", cmd);
>> }
>> diff --git a/include/hw/ppc/pnv_sbe.h b/include/hw/ppc/pnv_sbe.h
>> index b6b378ad14c7..f6cbcf990ed9 100644
>> --- a/include/hw/ppc/pnv_sbe.h
>> +++ b/include/hw/ppc/pnv_sbe.h
>> @@ -53,4 +53,7 @@ struct PnvSBEClass {
>> const MemoryRegionOps *xscom_mbox_ops;
>> };
>> +/* Helper to access stashed SKIBOOT_BASE */
>> +bool pnv_sbe_mpipl_skiboot_base(void);
>> +
>> #endif /* PPC_PNV_SBE_H */
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/7] hw/ppc: Add MDST/MDDT/MDRT table structures and offsets
2025-03-11 5:11 ` Harsh Prateek Bora
@ 2025-03-13 18:50 ` Aditya Gupta
0 siblings, 0 replies; 24+ messages in thread
From: Aditya Gupta @ 2025-03-13 18:50 UTC (permalink / raw)
To: Harsh Prateek Bora, qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
On 11/03/25 10:41, Harsh Prateek Bora wrote:
>
>
> On 2/17/25 12:49, Aditya Gupta wrote:
>> Add the MDST, MDDT, MDRT tables offsets and structures as per current
>> skiboot upstream:
>>
>> commit bc7b85db1e7e ("opal-ci: Remove centos7")
>>
>> These structures will be later populated when preserving memory regions
>> for MPIPL
>>
>> Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
>> ---
>> hw/ppc/pnv_sbe.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 113 insertions(+)
>>
>> diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
>> index 79818177fc36..361a3854307d 100644
>> --- a/hw/ppc/pnv_sbe.c
>> +++ b/hw/ppc/pnv_sbe.c
>
> These should have been added in include/hw/ppc/pnv_sbe.h
>
Got it, will move the #defines to a header.
>> @@ -84,6 +84,119 @@
>> static uint64_t mpipl_skiboot_base = 0x30000000 /*default
>> SKIBOOT_BASE*/;
>> +/* Following offsets are copied from Skiboot source code */
>> +/* Use 768 bytes for SPIRAH */
>> +#define SPIRAH_OFF 0x00010000
>> +#define SPIRAH_SIZE 0x300
>> +
>> +/* Use 256 bytes for processor dump area */
>> +#define PROC_DUMP_AREA_OFF (SPIRAH_OFF + SPIRAH_SIZE)
>> +#define PROC_DUMP_AREA_SIZE 0x100
>> +
>> +#define PROCIN_OFF (PROC_DUMP_AREA_OFF + PROC_DUMP_AREA_SIZE)
>> +#define PROCIN_SIZE 0x800
>> +
>> +/* Offsets of MDST and MDDT tables from skiboot base */
>> +#define MDST_TABLE_OFF (PROCIN_OFF + PROCIN_SIZE)
>> +#define MDST_TABLE_SIZE 0x400
>> +
>> +#define MDDT_TABLE_OFF (MDST_TABLE_OFF + MDST_TABLE_SIZE)
>> +#define MDDT_TABLE_SIZE 0x400
>> +
>> +#define CPU_CTL_OFF (MDDT_TABLE_OFF + MDDT_TABLE_SIZE)
>> +#define CPU_CTL_SIZE 0x2000
>> +
>> +/* MPIPL reserved regions (offset by skiboot_base to access) */
>> +#define MDST_TABLE_BASE (mpipl_skiboot_base + MDST_TABLE_OFF)
>> +#define MDDT_TABLE_BASE (mpipl_skiboot_base + MDDT_TABLE_OFF)
>> +#define PROC_DUMP_AREA_BASE (mpipl_skiboot_base + PROC_DUMP_AREA_OFF)
>> +
>> +#define __packed __attribute__((packed))
>> +
>> +/* Metadata to capture before triggering MPIPL */
>> +struct mpipl_metadata {
>
> CamelCase for structs? Applies for other structs below.
Sure, will follow CamelCase. Had just copied the structs as it is.
>
>> + /* Crashing PIR is required to create OPAL dump */
>> + uint32_t crashing_pir;
>> + /* Kernel expects OPAL to presrve tag and pass it back via OPAL
>> API */
>> + uint64_t kernel_tag;
>> + /* Post MPIPL kernel boot memory size */
>> + uint64_t boot_mem_size;
>> +} __packed;
>> +
>> +/* Structure version */
>> +#define OPAL_MPIPL_VERSION 0x01
>> +
>> +/* Preserved memory details */
>> +struct opal_mpipl_region {
>> + __be64 src;
>> + __be64 dest;
>> + __be64 size;
>> +};
>> +
>> +struct opal_mpipl_fadump {
>> + uint8_t version;
>> + uint8_t reserved[7];
>> + __be32 crashing_pir; /* OPAL crashing CPU PIR */
>> + __be32 cpu_data_version;
>> + __be32 cpu_data_size;
>> + __be32 region_cnt;
>> + struct opal_mpipl_region *region;
>> +};
>> +
>> +/*
>> + * This is our dump result table after MPIPL. Hostboot will write to
>> this
>> + * memory after moving memory content from source to destination
>> memory.
>> + */
>> +#define MDRT_TABLE_BASE (mpipl_skiboot_base + 0x01c00000)
>> +#define MDRT_TABLE_SIZE 0x00008000
>> +
>> +/*
>> + * This is our dump metadata area. We will use this memory to save
>> metadata
>> + * (like crashing CPU details, payload tags) before triggering MPIPL.
>> + */
>> +#define DUMP_METADATA_AREA_BASE (mpipl_skiboot_base + 0x01c08000)
>> +#define DUMP_METADATA_AREA_SIZE 0x8000
>> +
>> +/*
>> + * Memory Dump Source Table
>> + *
>> + * Format of this table is same as Memory Dump Source Table (MDST)
>> + * defined in HDAT spec.
>> + */
>> +struct mdst_table {
>> + __be64 addr;
>> + uint8_t data_region; /* DUMP_REGION_* */
>> + uint8_t dump_type; /* DUMP_TYPE_* */
>> + __be16 reserved;
>> + __be32 size;
>> +} __packed;
>> +
>> +/* Memory dump destination table (MDDT) */
>> +struct mddt_table {
>> + __be64 addr;
>> + uint8_t data_region;
>> + uint8_t dump_type;
>> + __be16 reserved;
>> + __be32 size;
>> +} __packed;
>
> If both mdst_table and mddt_table are supposed to be same, we could have
> just one mdxt_table and variable instances could be named mdst/mddt.
Agreed. But mdxt might be confusing, as 'mdrt' has a different structure.
Maybe will do a 'typedef MdstTable MddtTable' ?
Thanks,
- Aditya G
>
> Thanks
> Harsh
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 5/7] hw/ppc: Preserve Memory Regions as per MDST/MDDT tables
2025-03-11 5:18 ` Harsh Prateek Bora
@ 2025-03-13 18:54 ` Aditya Gupta
0 siblings, 0 replies; 24+ messages in thread
From: Aditya Gupta @ 2025-03-13 18:54 UTC (permalink / raw)
To: Harsh Prateek Bora, qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
On 11/03/25 10:48, Harsh Prateek Bora wrote:
>
> On 2/17/25 12:49, Aditya Gupta wrote:
>> When MPIPL is used, OPAL/Linux registers memory regions to be preserved
>> on a Memory-Preserving boot ('crashkernel boot').
>>
>> The regions are added to two tables: MDST and MDDT (source and
>> destination tables)
>>
>> The MDST contains the start address of the region, and size of region
>>
>> The MDDT contains the destination address where the region should be
>> copied (and size of region which will be same as in MDST entry)
>>
>> Then after a crash, when hostboot (pnv_sbe.c in case of QEMU)
>> preserves the memory region, it adds the details of preserved regions to
>> MDRT (results table)
>>
>> Copy memory regions mentioned in MDST to addresses mentioned in MDDT.
>> And accordingly update the copied region details in MDRT table.
>>
>> Note: If we did not preserve the regions, and MDRT is empty then OPAL
>> simply logs "OPAL dump is not available", while kernel will assume that
>> firmware would have preserved the regions, and export /proc/vmcore, but
>> the vmcore won't have most basic kernel structures hence crash will be
>> unable to analyse the vmcore
>>
>> Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
>> ---
>> hw/ppc/pnv_sbe.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 57 insertions(+)
>>
>> diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
>> index 361a3854307d..ee905df4e0a6 100644
>> --- a/hw/ppc/pnv_sbe.c
>> +++ b/hw/ppc/pnv_sbe.c
>> @@ -227,6 +227,60 @@ static uint64_t
>> pnv_sbe_power9_xscom_ctrl_read(void *opaque, hwaddr addr,
>> return val;
>> }
>> +static void pnv_mpipl_preserve_mem(void)
>> +{
>> + /* Get access to metadata */
>> + struct mpipl_metadata *metadata = malloc(DUMP_METADATA_AREA_SIZE);
>> + struct mdst_table *mdst = malloc(MDST_TABLE_SIZE);
>> + struct mddt_table *mddt = malloc(MDDT_TABLE_SIZE);
>> + struct mdrt_table *mdrt = malloc(MDRT_TABLE_SIZE);
>
> Where are these getting free()ed? Mem leak ?
Yes. Thanks for catching this, it's a memory leak, will free it in v2.
>
>> + __be64 source_addr, dest_addr, bytes_to_copy;
>> + uint8_t *copy_buffer;
>> +
>> + cpu_physical_memory_read(DUMP_METADATA_AREA_BASE, metadata,
>> DUMP_METADATA_AREA_SIZE);
>> + cpu_physical_memory_read(MDST_TABLE_BASE, mdst, MDST_TABLE_SIZE);
>> + cpu_physical_memory_read(MDDT_TABLE_BASE, mddt, MDDT_TABLE_SIZE);
>> +
>> + /* HRMOR_BIT copied from skiboot */
>> + #define HRMOR_BIT (1ul << 63)
> Could be moved to pnv_sbe.h file.
Okay.
>
>> +
>> + for (int i = 0;; ++i) {
>> + /* NOTE: Assuming uninitialised will be all zeroes */
>> + if ((mdst[i].addr == 0) && (mdst[i].size == 0)) {
>> + break;
>> + }
>
> What if there is no uninitialized entry till the end of array?
> Out-of-bound access since we do not have a loop exit condition?
My bad, didn't handle that. Will limit the loop to at max MDST_MAX_SIZE
/ MDST_ENTRY_SIZE
>
>> +
>> + if (mdst[i].size != mddt[i].size) {
>> + qemu_log_mask(LOG_TRACE,
>> + "Warning: Invalid entry, size mismatch in MDST &
>> MDDT\n");
>> + continue;
>> + }
>> +
>> + if (mdst[i].data_region != mddt[i].data_region) {
>> + qemu_log_mask(LOG_TRACE,
>> + "Warning: Invalid entry, region mismatch in MDST
>> & MDDT\n");
>> + continue;
>> + }
>> +
>> + mdrt[i].src_addr = mdst[i].addr;
>> + mdrt[i].dest_addr = mddt[i].addr;
>> + mdrt[i].size = mdst[i].size;
>> + mdrt[i].data_region = mdst[i].data_region;
>> +
>> + source_addr = cpu_to_be64(mdst[i].addr) & ~HRMOR_BIT;
>> + dest_addr = cpu_to_be64(mddt[i].addr) & ~HRMOR_BIT;
>> + bytes_to_copy = cpu_to_be32(mddt[i].size);
>> +
>> + /* XXX: Am i assuming we are in big endian mode ? */
> If the patches are assuming to work only with BE, it should gracefully
> handle the LE case.
Agreed, I have to fix it, so it works in both cases, will handle with
enough cpu_to_be* for values coming from the firmware/kernel.
Thanks,
- Aditya G
>
> Thanks
> Harsh
>
>> + copy_buffer = malloc(bytes_to_copy);
>> + cpu_physical_memory_read(source_addr, copy_buffer,
>> bytes_to_copy);
>> + cpu_physical_memory_write(dest_addr, copy_buffer,
>> bytes_to_copy);
>> + free(copy_buffer);
>> + }
>> +
>> + cpu_physical_memory_write(MDRT_TABLE_BASE, mdrt, MDRT_TABLE_SIZE);
>> +}
>> +
>> static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
>> uint64_t val, unsigned size)
>> {
>> @@ -250,6 +304,9 @@ static void pnv_sbe_power9_xscom_ctrl_write(void
>> *opaque, hwaddr addr,
>> */
>> pause_all_vcpus();
>> + /* Preserve the memory locations registered for MPIPL */
>> + pnv_mpipl_preserve_mem();
>> +
>> /*
>> * TODO: Pass `mpipl` node in device tree to signify next
>> * boot is an MPIPL boot
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 6/7] hw/ppc: [WIP] Add Processor Dump Area offsets in Pnv SBE
2025-03-11 5:23 ` Harsh Prateek Bora
@ 2025-03-13 18:56 ` Aditya Gupta
0 siblings, 0 replies; 24+ messages in thread
From: Aditya Gupta @ 2025-03-13 18:56 UTC (permalink / raw)
To: Harsh Prateek Bora, qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
On 11/03/25 10:53, Harsh Prateek Bora wrote:
>
>
> On 2/17/25 12:49, Aditya Gupta wrote:
>> Add offsets for the processor state captured during MPIPL dump.
>>
>> This is incomplete. And might be implemented in future if the effort to
>> implement MPIPL is resumed again.
>
> Please use RFC prefix in next iteration of patch series until the
> patches are requested to be merged.
Sorry I missed it, and just sent it as a normal patch series.
Will take care from next time.
>
>>
>> Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
>> ---
>> hw/ppc/pnv_sbe.c | 27 +++++++++++++++++++++++++++
>> 1 file changed, 27 insertions(+)
>>
>> diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
>> index ee905df4e0a6..3b50667226b5 100644
>> --- a/hw/ppc/pnv_sbe.c
>> +++ b/hw/ppc/pnv_sbe.c
>> @@ -197,6 +197,25 @@ struct mdrt_table {
>> __be64 padding; /* unused */
>> } __packed;
>> +/*
>> + * Processor Dump Area
>> + *
>> + * This contains the information needed for having processor
>> + * state captured during a platform dump.
>> + */
>> +struct proc_dump_area {
>> + __be32 thread_size; /* Size of each thread register entry */
>> +#define PROC_DUMP_AREA_FORMAT_P9 0x1 /* P9 format */
>> + uint8_t version;
>> + uint8_t reserved[11];
>> + __be64 alloc_addr; /* Destination memory to place register
>> data */
>> + __be32 reserved2;
>> + __be32 alloc_size; /* Allocated size */
>> + __be64 dest_addr; /* Destination address */
>> + __be32 reserved3;
>> + __be32 act_size; /* Actual data size */
>> +} __packed;
>> +
>
> Above should go into pnv_sbe.h and introduce when actually get used.
Sure, will take care of introducing only when used.
>
>> static void pnv_sbe_set_host_doorbell(PnvSBE *sbe, uint64_t val)
>> {
>> val &= SBE_HOST_RESPONSE_MASK; /* Is this right? What does HW
>> do? */
>> @@ -281,6 +300,11 @@ static void pnv_mpipl_preserve_mem(void)
>> cpu_physical_memory_write(MDRT_TABLE_BASE, mdrt, MDRT_TABLE_SIZE);
>> }
>> +static void pnv_mpipl_save_proc_regs(void)
>> +{
>> + /* TODO: modify PROC_DUMP_AREA_BASE */
>> +}
>> +
>> static void pnv_sbe_power9_xscom_ctrl_write(void *opaque, hwaddr addr,
>> uint64_t val, unsigned size)
>> {
>> @@ -307,6 +331,9 @@ static void pnv_sbe_power9_xscom_ctrl_write(void
>> *opaque, hwaddr addr,
>> /* Preserve the memory locations registered for MPIPL */
>> pnv_mpipl_preserve_mem();
>> + /* Save processor state */
>> + pnv_mpipl_save_proc_regs();
>
> Introduce when actually get implemented, otherwise doesnt need a
> separate patch for this stub.
Okay, I planned to implement it in this patch itself. Will implement in v2.
Thanks,
- Aditya G
>
>> +
>> /*
>> * TODO: Pass `mpipl` node in device tree to signify next
>> * boot is an MPIPL boot
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 7/7] hw/ppc: Implement MPIPL in PowerNV
2025-03-11 5:41 ` Harsh Prateek Bora
@ 2025-03-13 19:00 ` Aditya Gupta
0 siblings, 0 replies; 24+ messages in thread
From: Aditya Gupta @ 2025-03-13 19:00 UTC (permalink / raw)
To: Harsh Prateek Bora, qemu-devel
Cc: qemu-ppc, Nicholas Piggin, Frédéric Barrat,
Sourabh Jain, Mahesh J Salgaonkar, Hari Bathini
On 11/03/25 11:11, Harsh Prateek Bora wrote:
>
>
> On 2/17/25 12:49, Aditya Gupta wrote:
>> Linux expect a "ibm,opal/dump" node to know whether MPIPL (aka fadump)
>> is supported on the hardware.
>>
>> Export the "ibm,opal/dump" node in QEMU's device tree for Linux to know
>> that PowerNV supports MPIPL.
>>
>> With the commit, kernel boots thinking fadump is supported, and reserves
>> memory regions for fadump if "fadump=on" is passed in kernel cmdline:
>>
>> Linux/PowerPC load: init=/bin/sh debug fadump=on
>> Finalizing device tree... flat tree at 0x20ebaca0
>> [ 1.005765851,5] DUMP: Payload sent metadata tag : 0x800002a8
>> [ 1.005980914,5] DUMP: Boot mem size : 0x40000000
>> [ 0.000000] opal fadump: Kernel metadata addr: 800002a8
>> [ 0.000000] fadump: Reserved 1024MB of memory at
>> 0x00000040000000 (System RAM: 20480MB)
>> [ 0.000000] fadump: Initialized 0x40000000 bytes cma area at
>> 1024MB from 0x400102a8 bytes of memory reserved for firmware-assisted
>> dump
>>
>> Also, OPAL and Linux expect the "mpipl-boot" device tree node on a MPIPL
>> boot. Hence add "mpipl-boot" property in device tree on an MPIPL boot.
>>
>> Hence after crash, Linux knows when it's a MPIPL/fadump boot:
>>
>> [ 0.000000] opal fadump: Firmware-assisted dump is active.
>> [ 0.000000] fadump: Firmware-assisted dump is active.
>> [ 0.000000] fadump: Reserving 23552MB of memory at
>> 0x00000040000000 for preserving crash data
>>
>> Do note that fadump boot in PowerNV seems to require more memory,
>> trying with 1GB causes this error by kernel:
>>
>> [ 0.000000] fadump: Failed to find memory chunk for reservation!
>>
>> And even with anything from 2GB - 19GB, the kernel fails to boot due to
>> some memory issues.
>>
>> Trying with >20GB memory is recommended for now
>>
>> Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
>> ---
>> hw/ppc/pnv.c | 49 ++++++++++++++++++++++++++++++++++++++++
>> hw/ppc/pnv_sbe.c | 18 +++++++++++----
>> include/hw/ppc/pnv_sbe.h | 4 ++++
>> 3 files changed, 67 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
>> index 11fd477b71be..39ed3f873e9a 100644
>> --- a/hw/ppc/pnv.c
>> +++ b/hw/ppc/pnv.c
>> @@ -51,6 +51,7 @@
>> #include "hw/ppc/pnv_chip.h"
>> #include "hw/ppc/pnv_xscom.h"
>> #include "hw/ppc/pnv_pnor.h"
>> +#include "hw/ppc/pnv_sbe.h"
>> #include "hw/isa/isa.h"
>> #include "hw/char/serial-isa.h"
>> @@ -697,6 +698,26 @@ static void *pnv_dt_create(MachineState *machine)
>> pmc->dt_power_mgt(pnv, fdt);
>> }
>> + /* Add "dump" node so kernel knows MPIPL (aka fadump) is
>> supported */
>> + off = fdt_add_subnode(fdt, 0, "ibm,opal");
>> + if (off == -FDT_ERR_EXISTS) {
>> + off = fdt_path_offset(fdt, "/ibm,opal");
>> + }
>> +
>> + _FDT(off);
>> + off = fdt_add_subnode(fdt, off, "dump");
>> + _FDT(off);
>> + _FDT((fdt_setprop_string(fdt, off, "compatible",
>> "ibm,opal-dump")));
>> +
>> + /* Add kernel and initrd as fw-load-area */
>> + uint64_t fw_load_area[4] = {
>> + cpu_to_be64(KERNEL_LOAD_ADDR), cpu_to_be64(KERNEL_MAX_SIZE),
>> + cpu_to_be64(INITRD_LOAD_ADDR), cpu_to_be64(INITRD_MAX_SIZE)
>> + };
>> +
>> + _FDT((fdt_setprop(fdt, off, "fw-load-area",
>> + fw_load_area, sizeof(fw_load_area))));
>> +
>
> Above could be wrapped in pnv_dt_mpipl(fdt) and called here?
Sure, will create a helper in v2.
>
>> return fdt;
>> }
>> @@ -714,6 +735,7 @@ static void pnv_reset(MachineState *machine,
>> ResetType type)
>> PnvMachineState *pnv = PNV_MACHINE(machine);
>> IPMIBmc *bmc;
>> void *fdt;
>> + int node_offset;
>> qemu_devices_reset(type);
>> @@ -744,6 +766,33 @@ static void pnv_reset(MachineState *machine,
>> ResetType type)
>> _FDT((fdt_pack(fdt)));
>> }
>> + /*
>> + * If it's a MPIPL boot, add the "mpipl-boot" property, and
>> reset the
>> + * boolean for MPIPL boot for next boot
>> + */
>> + if (pnv_sbe_is_mpipl_boot()) {
>> + void *fdt_copy = g_malloc0(FDT_MAX_SIZE);
>
> Where is this getting free'ed ?
We don't free it currently, as we assign it to the fdt, which (from my
understanding) gets freed on a system reset if fdt needs to get modified.
Will see it.
>
>> +
>> + /* Create a writable copy of the fdt */
>> + _FDT((fdt_open_into(fdt, fdt_copy, FDT_MAX_SIZE)));
>> +
>> + node_offset = fdt_path_offset(fdt_copy, "/ibm,opal/dump");
>> + _FDT((fdt_appendprop_u64(fdt_copy, node_offset,
>> "mpipl-boot", 1)));
>> +
>> + /* Update the fdt, and free the original fdt */
>> + if (fdt != machine->fdt) {
>> + /*
>> + * Only free the fdt if it's not machine->fdt, to prevent
>> + * double free, since we already free machine->fdt later
>> + */
>> + g_free(fdt);
>> + }
>> + fdt = fdt_copy;
>> +
>> + /* This boot is an MPIPL, reset the boolean for next boot */
>> + pnv_sbe_reset_is_next_boot_mpipl();
>> + }
>> +
>
> Could above be wrapped in pnv_reset_mpipl(fdt) and called if
> pnv_sbe_is_mpipl_boot is true?
Agreed, will do it that way.
>
>> qemu_fdt_dumpdtb(fdt, fdt_totalsize(fdt));
>> cpu_physical_memory_write(PNV_FDT_ADDR, fdt, fdt_totalsize(fdt));
>> diff --git a/hw/ppc/pnv_sbe.c b/hw/ppc/pnv_sbe.c
>> index 3b50667226b5..671dc81c9501 100644
>> --- a/hw/ppc/pnv_sbe.c
>> +++ b/hw/ppc/pnv_sbe.c
>> @@ -216,6 +216,18 @@ struct proc_dump_area {
>> __be32 act_size; /* Actual data size */
>> } __packed;
>> +static bool is_next_boot_mpipl;
>> +
>> +bool pnv_sbe_is_mpipl_boot(void)
>> +{
>> + return is_next_boot_mpipl;
>> +}
>> +
>> +void pnv_sbe_reset_is_next_boot_mpipl(void)
>> +{
>> + is_next_boot_mpipl = false;
>> +}
>> +
>> static void pnv_sbe_set_host_doorbell(PnvSBE *sbe, uint64_t val)
>> {
>> val &= SBE_HOST_RESPONSE_MASK; /* Is this right? What does HW
>> do? */
>> @@ -334,10 +346,8 @@ static void pnv_sbe_power9_xscom_ctrl_write(void
>> *opaque, hwaddr addr,
>> /* Save processor state */
>> pnv_mpipl_save_proc_regs();
>> - /*
>> - * TODO: Pass `mpipl` node in device tree to signify next
>> - * boot is an MPIPL boot
>> - */
>> + /* Mark next boot as Memory-preserving boot */
>> + is_next_boot_mpipl = true;
>> /* Then do a guest reset */
>> /*
>> diff --git a/include/hw/ppc/pnv_sbe.h b/include/hw/ppc/pnv_sbe.h
>> index f6cbcf990ed9..94bbdc7b6414 100644
>> --- a/include/hw/ppc/pnv_sbe.h
>> +++ b/include/hw/ppc/pnv_sbe.h
>> @@ -56,4 +56,8 @@ struct PnvSBEClass {
>> /* Helper to access stashed SKIBOOT_BASE */
>> bool pnv_sbe_mpipl_skiboot_base(void);
>> +/* Helpers to know if next boot is MPIPL boot */
>> +bool pnv_sbe_is_mpipl_boot(void);
>> +void pnv_sbe_reset_is_next_boot_mpipl(void);
>
> Usually we have a set along with reset and helpers for modifying
> struct members. Above helpers for a gloabl var seems a bit
> un-necessary. I guess we want to keep such global vars inside relevant
> struct.
Yes, will move these to relevant structs.
About set and reset pair, will take care.
Thanks Harsh for the reviews.
- Aditya G
>
> Thanks
> Harsh
>> +
>> #endif /* PPC_PNV_SBE_H */
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2025-03-13 19:01 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-17 7:19 [PATCH 0/7] Implement MPIPL for PowerNV Aditya Gupta
2025-02-17 7:19 ` [PATCH 1/7] hw/ppc: Log S0/S1 Interrupt triggers by OPAL Aditya Gupta
2025-03-11 4:38 ` Harsh Prateek Bora
2025-03-13 18:43 ` Aditya Gupta
2025-02-17 7:19 ` [PATCH 2/7] hw/ppc: Implement S0 SBE interrupt as cpu_pause then host reset Aditya Gupta
2025-03-11 4:45 ` Harsh Prateek Bora
2025-03-13 18:45 ` Aditya Gupta
2025-02-17 7:19 ` [PATCH 3/7] hw/ppc: Handle stash command in PowerNV SBE Aditya Gupta
2025-03-11 4:50 ` Harsh Prateek Bora
2025-03-13 18:46 ` Aditya Gupta
2025-02-17 7:19 ` [PATCH 4/7] hw/ppc: Add MDST/MDDT/MDRT table structures and offsets Aditya Gupta
2025-03-11 5:11 ` Harsh Prateek Bora
2025-03-13 18:50 ` Aditya Gupta
2025-02-17 7:19 ` [PATCH 5/7] hw/ppc: Preserve Memory Regions as per MDST/MDDT tables Aditya Gupta
2025-03-11 5:18 ` Harsh Prateek Bora
2025-03-13 18:54 ` Aditya Gupta
2025-02-17 7:19 ` [PATCH 6/7] hw/ppc: [WIP] Add Processor Dump Area offsets in Pnv SBE Aditya Gupta
2025-03-11 5:23 ` Harsh Prateek Bora
2025-03-13 18:56 ` Aditya Gupta
2025-02-17 7:19 ` [PATCH 7/7] hw/ppc: Implement MPIPL in PowerNV Aditya Gupta
2025-03-11 5:41 ` Harsh Prateek Bora
2025-03-13 19:00 ` Aditya Gupta
2025-02-27 3:37 ` [PATCH 0/7] Implement MPIPL for PowerNV Nicholas Piggin
2025-02-27 6:23 ` Aditya Gupta
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).