qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] numa: add 'spm' option for Specific Purpose Memory
@ 2025-10-20  9:07 fanhuang
  2025-10-20  9:07 ` fanhuang
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: fanhuang @ 2025-10-20  9:07 UTC (permalink / raw)
  To: qemu-devel, david, imammedo; +Cc: Zhigang.Luo, Lianjie.Shi, FangSheng.Huang

Hi David and Igor,

I apologize for the delayed response. Thank you very much for your thoughtful
questions and feedback on the SPM patch series.

Before addressing your questions, I'd like to briefly mention what the new
QEMU patch series additionally resolves:

1. **Corrected SPM terminology**: Fixed the description error from the previous
   version. The correct acronym is "Specific Purpose Memory" (not "special
   purpose memory" as previously stated).

2. **Fixed overlapping E820 entries**: Updated the implementation to properly
   handle overlapping E820 RAM entries before adding E820_SOFT_RESERVED
   regions. 

   The previous implementation created overlapping E820 entries by first adding
   a large E820_RAM entry covering the entire above-4GB memory range, then
   adding E820_SOFT_RESERVED entries for SPM regions that overlapped with the
   RAM entry. This violated the E820 specification and caused OVMF/UEFI
   firmware to receive conflicting memory type information for the same
   physical addresses.

   The new implementation processes SPM regions first to identify reserved
   areas, then adds RAM entries around the SPM regions, generating a clean,
   non-overlapping E820 map.

Now, regarding your questions:

========================================================================
Why SPM Must Be Boot Memory
========================================================================

SPM cannot be implemented as hotplug memory (DIMM/NVDIMM) because:

The primary goal of SPM is to ensure that memory is managed by guest
device drivers, not the guest OS. This requires boot-time discovery
for three key reasons:

1. SPM regions must appear in the E820 memory map as `E820_SOFT_RESERVED`
   during firmware initialization, before the OS starts.

2. Hotplug memory is integrated into kernel memory management, making
   it unavailable for device-specific use.

========================================================================
Detailed Use Case
========================================================================

**Background**
Unified Address Space for CPU and GPU:

Modern heterogeneous computing architectures implement a coherent and
unified address space shared between CPUs and GPUs. Unlike traditional
discrete GPU designs with dedicated frame buffer, these accelerators
connect CPU and GPU through high-speed interconnects (e.g., XGMI):

- **HBM (High Bandwidth Memory)**: Physically attached to each GPU,
  reported to the OS as driver-managed system memory

- **XGMI (eXternal Global Memory Interconnect, aka. Infinity Fabric)**:
  Maintains data coherence between CPU and GPU, enabling direct CPU
  access to GPU HBM without data copying

In this architecture, GPU HBM is reported as system memory to the OS,
but it needs to be managed exclusively by the GPU driver rather than
the general OS memory allocator. This driver-managed memory provides
optimal performance for GPU workloads while enabling coherent CPU-GPU
data sharing through the XGMI. This is where SPM (Specific Purpose
Memory) becomes essential.

**Virtualization Scenario**

In virtualization, hypervisor need to expose this memory topology to
guest VMs while maintaining the same driver-managed vs OS-managed
distinction.

In this example, `0000:c1:02.0` is a GPU Virtual Function (VF) device
that requires dedicated memory allocation. The host driver obtains VF
HBM information and creates a user space device for each VF (for
example `/dev/vf_hbm_0000.c1.02.0`) providing an mmap() interface that
allows QEMU to allocate memory from the VF's HBM. By using SPM, this
memory is reserved exclusively for the GPU driver rather than being
available for general OS allocation.

**QEMU Configuration**:
```
-object memory-backend-ram,size=8G,id=m0 \
-numa node,nodeid=0,memdev=m0 \
-object memory-backend-file,size=8G,id=m1,mem-path=/dev/vf_hbm_0000.c1.02.0,prealloc=on,align=16M \
-numa node,nodeid=1,memdev=m1,spm=on \
-device vfio-pci,host=0000:c1:02.0,bus=pcie.0
```

**BIOS-e820**

BIOS provided physical RAM map in which 0x280000000-0x47fffffff as
soft reserved:

```
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000027fffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000280000000-0x000000047fffffff] soft reserved
```

**Guest OS**

Guest OS sees 8GB (0x280000000-0x47fffffff) as "soft reserved" memory
that only the GPU driver can use, preventing conflicts with general OS
memory allocation:

```
100000000-27fffffff : System RAM
  1b7a00000-1b8ffffff : Kernel code
  1b9000000-1b9825fff : Kernel rodata
  1b9a00000-1b9e775bf : Kernel data
  1ba397000-1ba7fffff : Kernel bss
280000000-47fffffff : Soft Reserved
  280000000-47fffffff : dax0.0
    280000000-47fffffff : System RAM (kmem)
```

========================================================================

I hope this addresses your concerns. Please let me know if you need any
further clarification or have additional questions.

Best regards,
Jerry Huang



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2] numa: add 'spm' option for Specific Purpose Memory
  2025-10-20  9:07 [PATCH v2] numa: add 'spm' option for Specific Purpose Memory fanhuang
@ 2025-10-20  9:07 ` fanhuang
  2025-11-03 12:32   ` David Hildenbrand
  2025-10-20 10:15 ` Jonathan Cameron via
  2025-10-20 20:10 ` David Hildenbrand
  2 siblings, 1 reply; 12+ messages in thread
From: fanhuang @ 2025-10-20  9:07 UTC (permalink / raw)
  To: qemu-devel, david, imammedo; +Cc: Zhigang.Luo, Lianjie.Shi, FangSheng.Huang

This patch adds support for Specific Purpose Memory (SPM) through the
NUMA node configuration. When 'spm=on' is specified for a NUMA node,
QEMU will:

1. Set the RAM_SPM flag in the RAM block of the corresponding memory region
2. Update the overlapping E820 RAM entries before adding E820_SOFT_RESERVED
3. Set the E820 type to E820_SOFT_RESERVED for this memory region

This allows guest operating systems to recognize the memory as soft reserved
memory, which can be used for device-specific memory management without
E820 table conflicts.

Usage:
  -numa node,nodeid=0,memdev=m1,spm=on

Signed-off-by: fanhuang <FangSheng.Huang@amd.com>
---
 hw/core/numa.c               |  3 ++
 hw/i386/e820_memory_layout.c | 73 ++++++++++++++++++++++++++++++++++++
 hw/i386/e820_memory_layout.h |  2 +
 hw/i386/pc.c                 | 37 ++++++++++++++++++
 include/exec/cpu-common.h    |  1 +
 include/system/memory.h      |  3 ++
 include/system/numa.h        |  1 +
 qapi/machine.json            |  6 +++
 system/physmem.c             |  7 +++-
 9 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/hw/core/numa.c b/hw/core/numa.c
index 218576f745..e680130460 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -163,6 +163,9 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
         numa_info[nodenr].node_memdev = MEMORY_BACKEND(o);
     }
 
+    /* Store spm configuration for later processing */
+    numa_info[nodenr].is_spm = node->has_spm && node->spm;
+
     numa_info[nodenr].present = true;
     max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
     ms->numa_state->num_nodes++;
diff --git a/hw/i386/e820_memory_layout.c b/hw/i386/e820_memory_layout.c
index 3e848fb69c..5b090ac6df 100644
--- a/hw/i386/e820_memory_layout.c
+++ b/hw/i386/e820_memory_layout.c
@@ -46,3 +46,76 @@ bool e820_get_entry(int idx, uint32_t type, uint64_t *address, uint64_t *length)
     }
     return false;
 }
+
+bool e820_update_entry_type(uint64_t start, uint64_t length, uint32_t new_type)
+{
+    uint64_t end = start + length;
+    bool updated = false;
+    assert(!e820_done);
+
+    /* For E820_SOFT_RESERVED, validate range is within E820_RAM */
+    if (new_type == E820_SOFT_RESERVED) {
+        bool range_in_ram = false;
+        for (size_t j = 0; j < e820_entries; j++) {
+            uint64_t ram_start = le64_to_cpu(e820_table[j].address);
+            uint64_t ram_end = ram_start + le64_to_cpu(e820_table[j].length);
+            uint32_t ram_type = le32_to_cpu(e820_table[j].type);
+
+            if (ram_type == E820_RAM && ram_start <= start && ram_end >= end) {
+                range_in_ram = true;
+                break;
+            }
+        }
+        if (!range_in_ram) {
+            return false;
+        }
+    }
+
+    /* Find entry that contains the target range and update it */
+    for (size_t i = 0; i < e820_entries; i++) {
+        uint64_t entry_start = le64_to_cpu(e820_table[i].address);
+        uint64_t entry_length = le64_to_cpu(e820_table[i].length);
+        uint64_t entry_end = entry_start + entry_length;
+
+        if (entry_start <= start && entry_end >= end) {
+            uint32_t original_type = e820_table[i].type;
+
+            /* Remove original entry */
+            memmove(&e820_table[i], &e820_table[i + 1],
+                    (e820_entries - i - 1) * sizeof(struct e820_entry));
+            e820_entries--;
+
+            /* Add split parts inline */
+            if (entry_start < start) {
+                e820_table = g_renew(struct e820_entry, e820_table,
+                                     e820_entries + 1);
+                e820_table[e820_entries].address = cpu_to_le64(entry_start);
+                e820_table[e820_entries].length =
+                    cpu_to_le64(start - entry_start);
+                e820_table[e820_entries].type = original_type;
+                e820_entries++;
+            }
+
+            e820_table = g_renew(struct e820_entry, e820_table,
+                                 e820_entries + 1);
+            e820_table[e820_entries].address = cpu_to_le64(start);
+            e820_table[e820_entries].length = cpu_to_le64(length);
+            e820_table[e820_entries].type = cpu_to_le32(new_type);
+            e820_entries++;
+
+            if (end < entry_end) {
+                e820_table = g_renew(struct e820_entry, e820_table,
+                                     e820_entries + 1);
+                e820_table[e820_entries].address = cpu_to_le64(end);
+                e820_table[e820_entries].length = cpu_to_le64(entry_end - end);
+                e820_table[e820_entries].type = original_type;
+                e820_entries++;
+            }
+
+            updated = true;
+            break;
+        }
+    }
+
+    return updated;
+}
diff --git a/hw/i386/e820_memory_layout.h b/hw/i386/e820_memory_layout.h
index b50acfa201..657cc679e2 100644
--- a/hw/i386/e820_memory_layout.h
+++ b/hw/i386/e820_memory_layout.h
@@ -15,6 +15,7 @@
 #define E820_ACPI       3
 #define E820_NVS        4
 #define E820_UNUSABLE   5
+#define E820_SOFT_RESERVED  0xEFFFFFFF
 
 struct e820_entry {
     uint64_t address;
@@ -26,5 +27,6 @@ void e820_add_entry(uint64_t address, uint64_t length, uint32_t type);
 bool e820_get_entry(int index, uint32_t type,
                     uint64_t *address, uint64_t *length);
 int e820_get_table(struct e820_entry **table);
+bool e820_update_entry_type(uint64_t start, uint64_t length, uint32_t new_type);
 
 #endif
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index bc048a6d13..3e50570484 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -26,6 +26,7 @@
 #include "qemu/units.h"
 #include "exec/target_page.h"
 #include "hw/i386/pc.h"
+#include "system/ramblock.h"
 #include "hw/char/serial-isa.h"
 #include "hw/char/parallel.h"
 #include "hw/hyperv/hv-balloon.h"
@@ -787,6 +788,41 @@ static hwaddr pc_max_used_gpa(PCMachineState *pcms, uint64_t pci_hole64_size)
     return pc_above_4g_end(pcms) - 1;
 }
 
+static int pc_update_spm_memory(RAMBlock *rb, void *opaque)
+{
+    X86MachineState *x86ms = opaque;
+    MachineState *ms = MACHINE(x86ms);
+    ram_addr_t offset;
+    ram_addr_t length;
+    bool is_spm = false;
+
+    /* Check if this RAM block belongs to a NUMA node with spm=on */
+    for (int i = 0; i < ms->numa_state->num_nodes; i++) {
+        NodeInfo *numa_info = &ms->numa_state->nodes[i];
+        if (numa_info->is_spm && numa_info->node_memdev) {
+            MemoryRegion *mr = &numa_info->node_memdev->mr;
+            if (mr->ram_block == rb) {
+                /* Mark this RAM block as SPM and set the flag */
+                rb->flags |= RAM_SPM;
+                is_spm = true;
+                break;
+            }
+        }
+    }
+
+    if (is_spm) {
+        offset = qemu_ram_get_offset(rb) +
+                 (0x100000000ULL - x86ms->below_4g_mem_size);
+        length = qemu_ram_get_used_length(rb);
+        if (!e820_update_entry_type(offset, length, E820_SOFT_RESERVED)) {
+            warn_report("Failed to update E820 entry for SPM at 0x%" PRIx64
+                        " length 0x%" PRIx64, offset, length);
+        }
+    }
+
+    return 0;
+}
+
 /*
  * AMD systems with an IOMMU have an additional hole close to the
  * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
@@ -901,6 +937,7 @@ void pc_memory_init(PCMachineState *pcms,
     if (pcms->sgx_epc.size != 0) {
         e820_add_entry(pcms->sgx_epc.base, pcms->sgx_epc.size, E820_RESERVED);
     }
+    qemu_ram_foreach_block(pc_update_spm_memory, x86ms);
 
     if (!pcmc->has_reserved_memory &&
         (machine->ram_slots ||
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 9b658a3f48..9b437eaa10 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -89,6 +89,7 @@ ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb);
 ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
 ram_addr_t qemu_ram_get_max_length(RAMBlock *rb);
 bool qemu_ram_is_shared(RAMBlock *rb);
+bool qemu_ram_is_spm(RAMBlock *rb);
 bool qemu_ram_is_noreserve(RAMBlock *rb);
 bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
 void qemu_ram_set_uf_zeroable(RAMBlock *rb);
diff --git a/include/system/memory.h b/include/system/memory.h
index aa85fc27a1..0d36cbd30d 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -275,6 +275,9 @@ typedef struct IOMMUTLBEvent {
  */
 #define RAM_PRIVATE (1 << 13)
 
+/* RAM is Specific Purpose Memory */
+#define RAM_SPM (1 << 14)
+
 static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
                                        IOMMUNotifierFlag flags,
                                        hwaddr start, hwaddr end,
diff --git a/include/system/numa.h b/include/system/numa.h
index 1044b0eb6e..438511a756 100644
--- a/include/system/numa.h
+++ b/include/system/numa.h
@@ -41,6 +41,7 @@ typedef struct NodeInfo {
     bool present;
     bool has_cpu;
     bool has_gi;
+    bool is_spm;
     uint8_t lb_info_provided;
     uint16_t initiator;
     uint8_t distance[MAX_NODES];
diff --git a/qapi/machine.json b/qapi/machine.json
index 038eab281c..1fa31b0224 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -500,6 +500,11 @@
 # @memdev: memory backend object.  If specified for one node, it must
 #     be specified for all nodes.
 #
+# @spm: if true, mark the memory region of this node as Specific
+#     Purpose Memory (SPM). This will set the RAM_SPM flag for the
+#     corresponding memory region and set the E820 type to
+#     E820_SOFT_RESERVED. (default: false, since 9.2)
+#
 # @initiator: defined in ACPI 6.3 Chapter 5.2.27.3 Table 5-145, points
 #     to the nodeid which has the memory controller responsible for
 #     this NUMA node.  This field provides additional information as
@@ -514,6 +519,7 @@
    '*cpus':   ['uint16'],
    '*mem':    'size',
    '*memdev': 'str',
+   '*spm':    'bool',
    '*initiator': 'uint16' }}
 
 ##
diff --git a/system/physmem.c b/system/physmem.c
index ae8ecd50ea..0090d9955d 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1611,6 +1611,11 @@ bool qemu_ram_is_noreserve(RAMBlock *rb)
     return rb->flags & RAM_NORESERVE;
 }
 
+bool qemu_ram_is_spm(RAMBlock *rb)
+{
+    return rb->flags & RAM_SPM;
+}
+
 /* Note: Only set at the start of postcopy */
 bool qemu_ram_is_uf_zeroable(RAMBlock *rb)
 {
@@ -2032,7 +2037,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, ram_addr_t max_size,
     ram_flags &= ~RAM_PRIVATE;
 
     /* Just support these ram flags by now. */
-    assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE |
+    assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_SPM | RAM_NORESERVE |
                           RAM_PROTECTED | RAM_NAMED_FILE | RAM_READONLY |
                           RAM_READONLY_FD | RAM_GUEST_MEMFD |
                           RAM_RESIZEABLE)) == 0);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] numa: add 'spm' option for Specific Purpose Memory
  2025-10-20  9:07 [PATCH v2] numa: add 'spm' option for Specific Purpose Memory fanhuang
  2025-10-20  9:07 ` fanhuang
@ 2025-10-20 10:15 ` Jonathan Cameron via
  2025-10-20 20:03   ` David Hildenbrand
  2025-10-20 20:10 ` David Hildenbrand
  2 siblings, 1 reply; 12+ messages in thread
From: Jonathan Cameron via @ 2025-10-20 10:15 UTC (permalink / raw)
  To: fanhuang
  Cc: qemu-devel, david, imammedo, Zhigang.Luo, Lianjie.Shi,
	David Hildenbrand, Oscar Salvador

On Mon, 20 Oct 2025 17:07:00 +0800
fanhuang <FangSheng.Huang@amd.com> wrote:

> Hi David and Igor,
> 
> I apologize for the delayed response. Thank you very much for your thoughtful
> questions and feedback on the SPM patch series.
> 
> Before addressing your questions, I'd like to briefly mention what the new
> QEMU patch series additionally resolves:
> 
> 1. **Corrected SPM terminology**: Fixed the description error from the previous
>    version. The correct acronym is "Specific Purpose Memory" (not "special
>    purpose memory" as previously stated).
> 
> 2. **Fixed overlapping E820 entries**: Updated the implementation to properly
>    handle overlapping E820 RAM entries before adding E820_SOFT_RESERVED
>    regions. 
> 
>    The previous implementation created overlapping E820 entries by first adding
>    a large E820_RAM entry covering the entire above-4GB memory range, then
>    adding E820_SOFT_RESERVED entries for SPM regions that overlapped with the
>    RAM entry. This violated the E820 specification and caused OVMF/UEFI
>    firmware to receive conflicting memory type information for the same
>    physical addresses.
> 
>    The new implementation processes SPM regions first to identify reserved
>    areas, then adds RAM entries around the SPM regions, generating a clean,
>    non-overlapping E820 map.

I'm definitely in favor of this support for testing purposes as well as
for the GPU cases you describe.

Given I took your brief comment on hotplug and expanded on it +CC David
and Oscar.

> 
> Now, regarding your questions:
> 
> ========================================================================
> Why SPM Must Be Boot Memory
> ========================================================================
> 
> SPM cannot be implemented as hotplug memory (DIMM/NVDIMM) because:
> 
> The primary goal of SPM is to ensure that memory is managed by guest
> device drivers, not the guest OS. This requires boot-time discovery
> for three key reasons:
> 
> 1. SPM regions must appear in the E820 memory map as `E820_SOFT_RESERVED`
>    during firmware initialization, before the OS starts.
> 
> 2. Hotplug memory is integrated into kernel memory management, making
>    it unavailable for device-specific use.

This is only sort of true and perhaps reflects support in the kernel for ACPI
features being missing as no one has yet been interested in them.
See 9.11.3 Hot-pluggable Memory Description Illustrated in the 6.6 ACPI spec.
That has an example where the EFI_MEMORY_SP bit is provided. 
I had a dig around and for now ACPICA / kernel doesn't seem to put that alongside
write_protect and the other bits that IIUC come from the same field.
It would be relatively easy to pipe that through and potentially add handling
in the memory hotplug path to allow for drivers to pick these regions up
(which boils down I think to making them visible in some way but doing nothing
else with them)

Other path would be to use a discoverable path such as emulating CXL memory.
Hotplug of that would work fine from point of view of coming up as driver managed
SPM style (the flag is in runtime data provided by the device). It would however
look different to the firmware managed approach you are using in the host.

All I want to draw attention to is that there are other ways of doing this
that might be relevant in future, but don't work for what you need to do today.
So don't see this an objection to this specific bit of work!

Thanks,

Jonathan



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] numa: add 'spm' option for Specific Purpose Memory
  2025-10-20 10:15 ` Jonathan Cameron via
@ 2025-10-20 20:03   ` David Hildenbrand
  2025-10-22 10:19     ` Huang, FangSheng (Jerry)
  0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2025-10-20 20:03 UTC (permalink / raw)
  To: Jonathan Cameron, fanhuang
  Cc: qemu-devel, imammedo, Zhigang.Luo, Lianjie.Shi, Oscar Salvador

On 20.10.25 12:15, Jonathan Cameron wrote:
> On Mon, 20 Oct 2025 17:07:00 +0800
> fanhuang <FangSheng.Huang@amd.com> wrote:
> 
>> Hi David and Igor,
>>
>> I apologize for the delayed response. Thank you very much for your thoughtful
>> questions and feedback on the SPM patch series.
>>
>> Before addressing your questions, I'd like to briefly mention what the new
>> QEMU patch series additionally resolves:
>>
>> 1. **Corrected SPM terminology**: Fixed the description error from the previous
>>     version. The correct acronym is "Specific Purpose Memory" (not "special
>>     purpose memory" as previously stated).
>>
>> 2. **Fixed overlapping E820 entries**: Updated the implementation to properly
>>     handle overlapping E820 RAM entries before adding E820_SOFT_RESERVED
>>     regions.
>>
>>     The previous implementation created overlapping E820 entries by first adding
>>     a large E820_RAM entry covering the entire above-4GB memory range, then
>>     adding E820_SOFT_RESERVED entries for SPM regions that overlapped with the
>>     RAM entry. This violated the E820 specification and caused OVMF/UEFI
>>     firmware to receive conflicting memory type information for the same
>>     physical addresses.
>>
>>     The new implementation processes SPM regions first to identify reserved
>>     areas, then adds RAM entries around the SPM regions, generating a clean,
>>     non-overlapping E820 map.
> 
> I'm definitely in favor of this support for testing purposes as well as
> for the GPU cases you describe.

Thanks for taking a look!

> 
> Given I took your brief comment on hotplug and expanded on it +CC David
> and Oscar.
> 
>>
>> Now, regarding your questions:
>>
>> ========================================================================
>> Why SPM Must Be Boot Memory
>> ========================================================================
>>
>> SPM cannot be implemented as hotplug memory (DIMM/NVDIMM) because:
>>
>> The primary goal of SPM is to ensure that memory is managed by guest
>> device drivers, not the guest OS. This requires boot-time discovery
>> for three key reasons:
>>
>> 1. SPM regions must appear in the E820 memory map as `E820_SOFT_RESERVED`
>>     during firmware initialization, before the OS starts.
>>
>> 2. Hotplug memory is integrated into kernel memory management, making
>>     it unavailable for device-specific use.
> 
> This is only sort of true and perhaps reflects support in the kernel for ACPI
> features being missing as no one has yet been interested in them.
> See 9.11.3 Hot-pluggable Memory Description Illustrated in the 6.6 ACPI spec.
> That has an example where the EFI_MEMORY_SP bit is provided.
> I had a dig around and for now ACPICA / kernel doesn't seem to put that alongside
> write_protect and the other bits that IIUC come from the same field.
> It would be relatively easy to pipe that through and potentially add handling
> in the memory hotplug path to allow for drivers to pick these regions up
> (which boils down I think to making them visible in some way but doing nothing
> else with them)

Considering something like DIMMs, one challenge is also that hotplugged 
memory in QEMU is never advertised in e820 (we only indicate the 
hotpluggable region), which is different to real hardware but let's us 
stop the early kernel that is booting up from considering these areas 
"initial memory" and effectively turning them hot-unpluggable in the 
default case.

Then, the question is what happens when someone plugs such a DIMM, 
unplugs it, and plugs something else in there that's not supposed to be SP.

I assume that's all solvable, just want to point out that the default 
memory hotplug path in QEMU is not really suitable for that right now I 
think.

> 
> Other path would be to use a discoverable path such as emulating CXL memory.
> Hotplug of that would work fine from point of view of coming up as driver managed
> SPM style (the flag is in runtime data provided by the device). It would however
> look different to the firmware managed approach you are using in the host.

Right.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] numa: add 'spm' option for Specific Purpose Memory
  2025-10-20  9:07 [PATCH v2] numa: add 'spm' option for Specific Purpose Memory fanhuang
  2025-10-20  9:07 ` fanhuang
  2025-10-20 10:15 ` Jonathan Cameron via
@ 2025-10-20 20:10 ` David Hildenbrand
  2025-10-22 10:09   ` Huang, FangSheng (Jerry)
  2 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2025-10-20 20:10 UTC (permalink / raw)
  To: fanhuang, qemu-devel, imammedo; +Cc: Zhigang.Luo, Lianjie.Shi, Jonathan Cameron

On 20.10.25 11:07, fanhuang wrote:
> Hi David and Igor,
> 
> I apologize for the delayed response. Thank you very much for your thoughtful
> questions and feedback on the SPM patch series.
> 
> Before addressing your questions, I'd like to briefly mention what the new
> QEMU patch series additionally resolves:
> 
> 1. **Corrected SPM terminology**: Fixed the description error from the previous
>     version. The correct acronym is "Specific Purpose Memory" (not "special
>     purpose memory" as previously stated).
> 
> 2. **Fixed overlapping E820 entries**: Updated the implementation to properly
>     handle overlapping E820 RAM entries before adding E820_SOFT_RESERVED
>     regions.
> 
>     The previous implementation created overlapping E820 entries by first adding
>     a large E820_RAM entry covering the entire above-4GB memory range, then
>     adding E820_SOFT_RESERVED entries for SPM regions that overlapped with the
>     RAM entry. This violated the E820 specification and caused OVMF/UEFI
>     firmware to receive conflicting memory type information for the same
>     physical addresses.
> 
>     The new implementation processes SPM regions first to identify reserved
>     areas, then adds RAM entries around the SPM regions, generating a clean,
>     non-overlapping E820 map.
> 
> Now, regarding your questions:
> 
> ========================================================================
> Why SPM Must Be Boot Memory
> ========================================================================
> 
> SPM cannot be implemented as hotplug memory (DIMM/NVDIMM) because:
> 
> The primary goal of SPM is to ensure that memory is managed by guest
> device drivers, not the guest OS. This requires boot-time discovery
> for three key reasons:
> 
> 1. SPM regions must appear in the E820 memory map as `E820_SOFT_RESERVED`
>     during firmware initialization, before the OS starts.
> 
> 2. Hotplug memory is integrated into kernel memory management, making
>     it unavailable for device-specific use.
> 
> ========================================================================
> Detailed Use Case
> ========================================================================
> 
> **Background**
> Unified Address Space for CPU and GPU:
> 
> Modern heterogeneous computing architectures implement a coherent and
> unified address space shared between CPUs and GPUs. Unlike traditional
> discrete GPU designs with dedicated frame buffer, these accelerators
> connect CPU and GPU through high-speed interconnects (e.g., XGMI):
> 
> - **HBM (High Bandwidth Memory)**: Physically attached to each GPU,
>    reported to the OS as driver-managed system memory
> 
> - **XGMI (eXternal Global Memory Interconnect, aka. Infinity Fabric)**:
>    Maintains data coherence between CPU and GPU, enabling direct CPU
>    access to GPU HBM without data copying
> 
> In this architecture, GPU HBM is reported as system memory to the OS,
> but it needs to be managed exclusively by the GPU driver rather than
> the general OS memory allocator. This driver-managed memory provides
> optimal performance for GPU workloads while enabling coherent CPU-GPU
> data sharing through the XGMI. This is where SPM (Specific Purpose
> Memory) becomes essential.
> 
> **Virtualization Scenario**
> 
> In virtualization, hypervisor need to expose this memory topology to
> guest VMs while maintaining the same driver-managed vs OS-managed
> distinction.

Just wondering, could device hotplug in that model ever work? I guess we 
wouldn't expose the memory at all in e820 (after all, it gets hotplugged 
later) and instead the device driver in the guest would have to 
detect+hotplug that memoory.

But that sounds weird, because the device driver in the VM shouldn't do 
something virt specific.

Which raises the question: how is device hoplug of such gpus handled on 
bare metal? Or does it simply not work? :)

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] numa: add 'spm' option for Specific Purpose Memory
  2025-10-20 20:10 ` David Hildenbrand
@ 2025-10-22 10:09   ` Huang, FangSheng (Jerry)
  2025-10-22 10:28     ` David Hildenbrand
  0 siblings, 1 reply; 12+ messages in thread
From: Huang, FangSheng (Jerry) @ 2025-10-22 10:09 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel, imammedo
  Cc: Zhigang.Luo, Lianjie.Shi, Jonathan Cameron



On 10/21/2025 4:10 AM, David Hildenbrand wrote:
> On 20.10.25 11:07, fanhuang wrote:
>> Hi David and Igor,
>>
>> I apologize for the delayed response. Thank you very much for your 
>> thoughtful
>> questions and feedback on the SPM patch series.
>>
>> Before addressing your questions, I'd like to briefly mention what the 
>> new
>> QEMU patch series additionally resolves:
>>
>> 1. **Corrected SPM terminology**: Fixed the description error from the 
>> previous
>>     version. The correct acronym is "Specific Purpose Memory" (not 
>> "special
>>     purpose memory" as previously stated).
>>
>> 2. **Fixed overlapping E820 entries**: Updated the implementation to 
>> properly
>>     handle overlapping E820 RAM entries before adding E820_SOFT_RESERVED
>>     regions.
>>
>>     The previous implementation created overlapping E820 entries by 
>> first adding
>>     a large E820_RAM entry covering the entire above-4GB memory range, 
>> then
>>     adding E820_SOFT_RESERVED entries for SPM regions that overlapped 
>> with the
>>     RAM entry. This violated the E820 specification and caused OVMF/UEFI
>>     firmware to receive conflicting memory type information for the same
>>     physical addresses.
>>
>>     The new implementation processes SPM regions first to identify 
>> reserved
>>     areas, then adds RAM entries around the SPM regions, generating a 
>> clean,
>>     non-overlapping E820 map.
>>
>> Now, regarding your questions:
>>
>> ========================================================================
>> Why SPM Must Be Boot Memory
>> ========================================================================
>>
>> SPM cannot be implemented as hotplug memory (DIMM/NVDIMM) because:
>>
>> The primary goal of SPM is to ensure that memory is managed by guest
>> device drivers, not the guest OS. This requires boot-time discovery
>> for three key reasons:
>>
>> 1. SPM regions must appear in the E820 memory map as `E820_SOFT_RESERVED`
>>     during firmware initialization, before the OS starts.
>>
>> 2. Hotplug memory is integrated into kernel memory management, making
>>     it unavailable for device-specific use.
>>
>> ========================================================================
>> Detailed Use Case
>> ========================================================================
>>
>> **Background**
>> Unified Address Space for CPU and GPU:
>>
>> Modern heterogeneous computing architectures implement a coherent and
>> unified address space shared between CPUs and GPUs. Unlike traditional
>> discrete GPU designs with dedicated frame buffer, these accelerators
>> connect CPU and GPU through high-speed interconnects (e.g., XGMI):
>>
>> - **HBM (High Bandwidth Memory)**: Physically attached to each GPU,
>>    reported to the OS as driver-managed system memory
>>
>> - **XGMI (eXternal Global Memory Interconnect, aka. Infinity Fabric)**:
>>    Maintains data coherence between CPU and GPU, enabling direct CPU
>>    access to GPU HBM without data copying
>>
>> In this architecture, GPU HBM is reported as system memory to the OS,
>> but it needs to be managed exclusively by the GPU driver rather than
>> the general OS memory allocator. This driver-managed memory provides
>> optimal performance for GPU workloads while enabling coherent CPU-GPU
>> data sharing through the XGMI. This is where SPM (Specific Purpose
>> Memory) becomes essential.
>>
>> **Virtualization Scenario**
>>
>> In virtualization, hypervisor need to expose this memory topology to
>> guest VMs while maintaining the same driver-managed vs OS-managed
>> distinction.
> 
> Just wondering, could device hotplug in that model ever work? I guess we 
> wouldn't expose the memory at all in e820 (after all, it gets hotplugged 
> later) and instead the device driver in the guest would have to 
> detect+hotplug that memoory.
> 
> But that sounds weird, because the device driver in the VM shouldn't do 
> something virt specific.
> 
> Which raises the question: how is device hoplug of such gpus handled on 
> bare metal? Or does it simply not work? :)
> 
Hi David, Thank you for your thoughtful feedback.
To directly answer your question:
in our use case, GPU device hotplug does NOT work on bare metal,
and this is by design.

HBM as Boot Memory:
- HBM (High Bandwidth Memory) is physically attached to each GPU.
- This memory is exposed via ACPI during firmware initialization.
- GPU drivers discover HBM regions by parsing these ACPI tables.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] numa: add 'spm' option for Specific Purpose Memory
  2025-10-20 20:03   ` David Hildenbrand
@ 2025-10-22 10:19     ` Huang, FangSheng (Jerry)
  0 siblings, 0 replies; 12+ messages in thread
From: Huang, FangSheng (Jerry) @ 2025-10-22 10:19 UTC (permalink / raw)
  To: David Hildenbrand, Jonathan Cameron
  Cc: qemu-devel, imammedo, Zhigang.Luo, Lianjie.Shi, Oscar Salvador



On 10/21/2025 4:03 AM, David Hildenbrand wrote:
> On 20.10.25 12:15, Jonathan Cameron wrote:
>> On Mon, 20 Oct 2025 17:07:00 +0800
>> fanhuang <FangSheng.Huang@amd.com> wrote:
>>
>>> Hi David and Igor,
>>>
>>> I apologize for the delayed response. Thank you very much for your 
>>> thoughtful
>>> questions and feedback on the SPM patch series.
>>>
>>> Before addressing your questions, I'd like to briefly mention what 
>>> the new
>>> QEMU patch series additionally resolves:
>>>
>>> 1. **Corrected SPM terminology**: Fixed the description error from 
>>> the previous
>>>     version. The correct acronym is "Specific Purpose Memory" (not 
>>> "special
>>>     purpose memory" as previously stated).
>>>
>>> 2. **Fixed overlapping E820 entries**: Updated the implementation to 
>>> properly
>>>     handle overlapping E820 RAM entries before adding E820_SOFT_RESERVED
>>>     regions.
>>>
>>>     The previous implementation created overlapping E820 entries by 
>>> first adding
>>>     a large E820_RAM entry covering the entire above-4GB memory 
>>> range, then
>>>     adding E820_SOFT_RESERVED entries for SPM regions that overlapped 
>>> with the
>>>     RAM entry. This violated the E820 specification and caused OVMF/UEFI
>>>     firmware to receive conflicting memory type information for the same
>>>     physical addresses.
>>>
>>>     The new implementation processes SPM regions first to identify 
>>> reserved
>>>     areas, then adds RAM entries around the SPM regions, generating a 
>>> clean,
>>>     non-overlapping E820 map.
>>
>> I'm definitely in favor of this support for testing purposes as well as
>> for the GPU cases you describe.
> 
> Thanks for taking a look!
> 
>>
>> Given I took your brief comment on hotplug and expanded on it +CC David
>> and Oscar.
>>
>>>
>>> Now, regarding your questions:
>>>
>>> ========================================================================
>>> Why SPM Must Be Boot Memory
>>> ========================================================================
>>>
>>> SPM cannot be implemented as hotplug memory (DIMM/NVDIMM) because:
>>>
>>> The primary goal of SPM is to ensure that memory is managed by guest
>>> device drivers, not the guest OS. This requires boot-time discovery
>>> for three key reasons:
>>>
>>> 1. SPM regions must appear in the E820 memory map as 
>>> `E820_SOFT_RESERVED`
>>>     during firmware initialization, before the OS starts.
>>>
>>> 2. Hotplug memory is integrated into kernel memory management, making
>>>     it unavailable for device-specific use.
>>
>> This is only sort of true and perhaps reflects support in the kernel 
>> for ACPI
>> features being missing as no one has yet been interested in them.
>> See 9.11.3 Hot-pluggable Memory Description Illustrated in the 6.6 
>> ACPI spec.
>> That has an example where the EFI_MEMORY_SP bit is provided.
>> I had a dig around and for now ACPICA / kernel doesn't seem to put 
>> that alongside
>> write_protect and the other bits that IIUC come from the same field.
>> It would be relatively easy to pipe that through and potentially add 
>> handling
>> in the memory hotplug path to allow for drivers to pick these regions up
>> (which boils down I think to making them visible in some way but doing 
>> nothing
>> else with them)
> 
> Considering something like DIMMs, one challenge is also that hotplugged 
> memory in QEMU is never advertised in e820 (we only indicate the 
> hotpluggable region), which is different to real hardware but let's us 
> stop the early kernel that is booting up from considering these areas 
> "initial memory" and effectively turning them hot-unpluggable in the 
> default case.
> 
> Then, the question is what happens when someone plugs such a DIMM, 
> unplugs it, and plugs something else in there that's not supposed to be SP.
> 
> I assume that's all solvable, just want to point out that the default 
> memory hotplug path in QEMU is not really suitable for that right now I 
> think.
> 
>>
>> Other path would be to use a discoverable path such as emulating CXL 
>> memory.
>> Hotplug of that would work fine from point of view of coming up as 
>> driver managed
>> SPM style (the flag is in runtime data provided by the device). It 
>> would however
>> look different to the firmware managed approach you are using in the 
>> host.
> 
> Right.
> 
Hi David and Jonathan,

Thank you both for the support and for the forward-looking
perspective on potential future approaches like EFI_MEMORY_SP and CXL
emulation.

I appreciate the constructive technical discussion from both of you. 
Please let me know if there's anything else I should clarify about the 
implementation.

Best regards,
Jerry Huang



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] numa: add 'spm' option for Specific Purpose Memory
  2025-10-22 10:09   ` Huang, FangSheng (Jerry)
@ 2025-10-22 10:28     ` David Hildenbrand
  2025-11-03  3:01       ` Huang, FangSheng (Jerry)
  0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2025-10-22 10:28 UTC (permalink / raw)
  To: Huang, FangSheng (Jerry), qemu-devel, imammedo
  Cc: Zhigang.Luo, Lianjie.Shi, Jonathan Cameron

On 22.10.25 12:09, Huang, FangSheng (Jerry) wrote:
> 
> 
> On 10/21/2025 4:10 AM, David Hildenbrand wrote:
>> On 20.10.25 11:07, fanhuang wrote:
>>> Hi David and Igor,
>>>
>>> I apologize for the delayed response. Thank you very much for your
>>> thoughtful
>>> questions and feedback on the SPM patch series.
>>>
>>> Before addressing your questions, I'd like to briefly mention what the
>>> new
>>> QEMU patch series additionally resolves:
>>>
>>> 1. **Corrected SPM terminology**: Fixed the description error from the
>>> previous
>>>      version. The correct acronym is "Specific Purpose Memory" (not
>>> "special
>>>      purpose memory" as previously stated).
>>>
>>> 2. **Fixed overlapping E820 entries**: Updated the implementation to
>>> properly
>>>      handle overlapping E820 RAM entries before adding E820_SOFT_RESERVED
>>>      regions.
>>>
>>>      The previous implementation created overlapping E820 entries by
>>> first adding
>>>      a large E820_RAM entry covering the entire above-4GB memory range,
>>> then
>>>      adding E820_SOFT_RESERVED entries for SPM regions that overlapped
>>> with the
>>>      RAM entry. This violated the E820 specification and caused OVMF/UEFI
>>>      firmware to receive conflicting memory type information for the same
>>>      physical addresses.
>>>
>>>      The new implementation processes SPM regions first to identify
>>> reserved
>>>      areas, then adds RAM entries around the SPM regions, generating a
>>> clean,
>>>      non-overlapping E820 map.
>>>
>>> Now, regarding your questions:
>>>
>>> ========================================================================
>>> Why SPM Must Be Boot Memory
>>> ========================================================================
>>>
>>> SPM cannot be implemented as hotplug memory (DIMM/NVDIMM) because:
>>>
>>> The primary goal of SPM is to ensure that memory is managed by guest
>>> device drivers, not the guest OS. This requires boot-time discovery
>>> for three key reasons:
>>>
>>> 1. SPM regions must appear in the E820 memory map as `E820_SOFT_RESERVED`
>>>      during firmware initialization, before the OS starts.
>>>
>>> 2. Hotplug memory is integrated into kernel memory management, making
>>>      it unavailable for device-specific use.
>>>
>>> ========================================================================
>>> Detailed Use Case
>>> ========================================================================
>>>
>>> **Background**
>>> Unified Address Space for CPU and GPU:
>>>
>>> Modern heterogeneous computing architectures implement a coherent and
>>> unified address space shared between CPUs and GPUs. Unlike traditional
>>> discrete GPU designs with dedicated frame buffer, these accelerators
>>> connect CPU and GPU through high-speed interconnects (e.g., XGMI):
>>>
>>> - **HBM (High Bandwidth Memory)**: Physically attached to each GPU,
>>>     reported to the OS as driver-managed system memory
>>>
>>> - **XGMI (eXternal Global Memory Interconnect, aka. Infinity Fabric)**:
>>>     Maintains data coherence between CPU and GPU, enabling direct CPU
>>>     access to GPU HBM without data copying
>>>
>>> In this architecture, GPU HBM is reported as system memory to the OS,
>>> but it needs to be managed exclusively by the GPU driver rather than
>>> the general OS memory allocator. This driver-managed memory provides
>>> optimal performance for GPU workloads while enabling coherent CPU-GPU
>>> data sharing through the XGMI. This is where SPM (Specific Purpose
>>> Memory) becomes essential.
>>>
>>> **Virtualization Scenario**
>>>
>>> In virtualization, hypervisor need to expose this memory topology to
>>> guest VMs while maintaining the same driver-managed vs OS-managed
>>> distinction.
>>
>> Just wondering, could device hotplug in that model ever work? I guess we
>> wouldn't expose the memory at all in e820 (after all, it gets hotplugged
>> later) and instead the device driver in the guest would have to
>> detect+hotplug that memoory.
>>
>> But that sounds weird, because the device driver in the VM shouldn't do
>> something virt specific.
>>
>> Which raises the question: how is device hoplug of such gpus handled on
>> bare metal? Or does it simply not work? :)
>>
> Hi David, Thank you for your thoughtful feedback.
> To directly answer your question:
> in our use case, GPU device hotplug does NOT work on bare metal,
> and this is by design.

Cool, thanks for clarifying!

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] numa: add 'spm' option for Specific Purpose Memory
  2025-10-22 10:28     ` David Hildenbrand
@ 2025-11-03  3:01       ` Huang, FangSheng (Jerry)
  2025-11-03 12:36         ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 12+ messages in thread
From: Huang, FangSheng (Jerry) @ 2025-11-03  3:01 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel, imammedo
  Cc: Zhigang.Luo, Lianjie.Shi, Jonathan Cameron

Hi David,

I hope this email finds you well. I wanted to follow up on the SPM
patch series we discussed back in October.

I'm reaching out to check on the current status and see if there's
anything else I should address or any additional information I can
provide.

Thank you for your time and guidance on this!

Best regards,
Jerry Huang

On 10/22/2025 6:28 PM, David Hildenbrand wrote:
> On 22.10.25 12:09, Huang, FangSheng (Jerry) wrote:
>>
>>
>> On 10/21/2025 4:10 AM, David Hildenbrand wrote:
>>> On 20.10.25 11:07, fanhuang wrote:
>>>> Hi David and Igor,
>>>>
>>>> I apologize for the delayed response. Thank you very much for your
>>>> thoughtful
>>>> questions and feedback on the SPM patch series.
>>>>
>>>> Before addressing your questions, I'd like to briefly mention what the
>>>> new
>>>> QEMU patch series additionally resolves:
>>>>
>>>> 1. **Corrected SPM terminology**: Fixed the description error from the
>>>> previous
>>>>      version. The correct acronym is "Specific Purpose Memory" (not
>>>> "special
>>>>      purpose memory" as previously stated).
>>>>
>>>> 2. **Fixed overlapping E820 entries**: Updated the implementation to
>>>> properly
>>>>      handle overlapping E820 RAM entries before adding 
>>>> E820_SOFT_RESERVED
>>>>      regions.
>>>>
>>>>      The previous implementation created overlapping E820 entries by
>>>> first adding
>>>>      a large E820_RAM entry covering the entire above-4GB memory range,
>>>> then
>>>>      adding E820_SOFT_RESERVED entries for SPM regions that overlapped
>>>> with the
>>>>      RAM entry. This violated the E820 specification and caused 
>>>> OVMF/UEFI
>>>>      firmware to receive conflicting memory type information for the 
>>>> same
>>>>      physical addresses.
>>>>
>>>>      The new implementation processes SPM regions first to identify
>>>> reserved
>>>>      areas, then adds RAM entries around the SPM regions, generating a
>>>> clean,
>>>>      non-overlapping E820 map.
>>>>
>>>> Now, regarding your questions:
>>>>
>>>> ========================================================================
>>>> Why SPM Must Be Boot Memory
>>>> ========================================================================
>>>>
>>>> SPM cannot be implemented as hotplug memory (DIMM/NVDIMM) because:
>>>>
>>>> The primary goal of SPM is to ensure that memory is managed by guest
>>>> device drivers, not the guest OS. This requires boot-time discovery
>>>> for three key reasons:
>>>>
>>>> 1. SPM regions must appear in the E820 memory map as 
>>>> `E820_SOFT_RESERVED`
>>>>      during firmware initialization, before the OS starts.
>>>>
>>>> 2. Hotplug memory is integrated into kernel memory management, making
>>>>      it unavailable for device-specific use.
>>>>
>>>> ========================================================================
>>>> Detailed Use Case
>>>> ========================================================================
>>>>
>>>> **Background**
>>>> Unified Address Space for CPU and GPU:
>>>>
>>>> Modern heterogeneous computing architectures implement a coherent and
>>>> unified address space shared between CPUs and GPUs. Unlike traditional
>>>> discrete GPU designs with dedicated frame buffer, these accelerators
>>>> connect CPU and GPU through high-speed interconnects (e.g., XGMI):
>>>>
>>>> - **HBM (High Bandwidth Memory)**: Physically attached to each GPU,
>>>>     reported to the OS as driver-managed system memory
>>>>
>>>> - **XGMI (eXternal Global Memory Interconnect, aka. Infinity Fabric)**:
>>>>     Maintains data coherence between CPU and GPU, enabling direct CPU
>>>>     access to GPU HBM without data copying
>>>>
>>>> In this architecture, GPU HBM is reported as system memory to the OS,
>>>> but it needs to be managed exclusively by the GPU driver rather than
>>>> the general OS memory allocator. This driver-managed memory provides
>>>> optimal performance for GPU workloads while enabling coherent CPU-GPU
>>>> data sharing through the XGMI. This is where SPM (Specific Purpose
>>>> Memory) becomes essential.
>>>>
>>>> **Virtualization Scenario**
>>>>
>>>> In virtualization, hypervisor need to expose this memory topology to
>>>> guest VMs while maintaining the same driver-managed vs OS-managed
>>>> distinction.
>>>
>>> Just wondering, could device hotplug in that model ever work? I guess we
>>> wouldn't expose the memory at all in e820 (after all, it gets hotplugged
>>> later) and instead the device driver in the guest would have to
>>> detect+hotplug that memoory.
>>>
>>> But that sounds weird, because the device driver in the VM shouldn't do
>>> something virt specific.
>>>
>>> Which raises the question: how is device hoplug of such gpus handled on
>>> bare metal? Or does it simply not work? :)
>>>
>> Hi David, Thank you for your thoughtful feedback.
>> To directly answer your question:
>> in our use case, GPU device hotplug does NOT work on bare metal,
>> and this is by design.
> 
> Cool, thanks for clarifying!
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] numa: add 'spm' option for Specific Purpose Memory
  2025-10-20  9:07 ` fanhuang
@ 2025-11-03 12:32   ` David Hildenbrand
  2025-11-04  8:00     ` Huang, FangSheng (Jerry)
  0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2025-11-03 12:32 UTC (permalink / raw)
  To: fanhuang, qemu-devel, imammedo; +Cc: Zhigang.Luo, Lianjie.Shi

On 20.10.25 11:07, fanhuang wrote:
> This patch adds support for Specific Purpose Memory (SPM) through the
> NUMA node configuration. When 'spm=on' is specified for a NUMA node,
> QEMU will:
> 
> 1. Set the RAM_SPM flag in the RAM block of the corresponding memory region
> 2. Update the overlapping E820 RAM entries before adding E820_SOFT_RESERVED
> 3. Set the E820 type to E820_SOFT_RESERVED for this memory region
> 
> This allows guest operating systems to recognize the memory as soft reserved
> memory, which can be used for device-specific memory management without
> E820 table conflicts.
> 
> Usage:
>    -numa node,nodeid=0,memdev=m1,spm=on
> 
> Signed-off-by: fanhuang <FangSheng.Huang@amd.com>
> ---
>   hw/core/numa.c               |  3 ++
>   hw/i386/e820_memory_layout.c | 73 ++++++++++++++++++++++++++++++++++++
>   hw/i386/e820_memory_layout.h |  2 +
>   hw/i386/pc.c                 | 37 ++++++++++++++++++
>   include/exec/cpu-common.h    |  1 +
>   include/system/memory.h      |  3 ++
>   include/system/numa.h        |  1 +
>   qapi/machine.json            |  6 +++
>   system/physmem.c             |  7 +++-
>   9 files changed, 132 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/core/numa.c b/hw/core/numa.c
> index 218576f745..e680130460 100644
> --- a/hw/core/numa.c
> +++ b/hw/core/numa.c
> @@ -163,6 +163,9 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
>           numa_info[nodenr].node_memdev = MEMORY_BACKEND(o);
>       }
>   
> +    /* Store spm configuration for later processing */
> +    numa_info[nodenr].is_spm = node->has_spm && node->spm;
> +
>       numa_info[nodenr].present = true;
>       max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
>       ms->numa_state->num_nodes++;
> diff --git a/hw/i386/e820_memory_layout.c b/hw/i386/e820_memory_layout.c
> index 3e848fb69c..5b090ac6df 100644
> --- a/hw/i386/e820_memory_layout.c
> +++ b/hw/i386/e820_memory_layout.c
> @@ -46,3 +46,76 @@ bool e820_get_entry(int idx, uint32_t type, uint64_t *address, uint64_t *length)
>       }
>       return false;
>   }
> +
> +bool e820_update_entry_type(uint64_t start, uint64_t length, uint32_t new_type)
> +{
> +    uint64_t end = start + length;
> +    bool updated = false;
> +    assert(!e820_done);
> +
> +    /* For E820_SOFT_RESERVED, validate range is within E820_RAM */
> +    if (new_type == E820_SOFT_RESERVED) {
> +        bool range_in_ram = false;
> +        for (size_t j = 0; j < e820_entries; j++) {
> +            uint64_t ram_start = le64_to_cpu(e820_table[j].address);
> +            uint64_t ram_end = ram_start + le64_to_cpu(e820_table[j].length);
> +            uint32_t ram_type = le32_to_cpu(e820_table[j].type);
> +
> +            if (ram_type == E820_RAM && ram_start <= start && ram_end >= end) {
> +                range_in_ram = true;
> +                break;
> +            }
> +        }
> +        if (!range_in_ram) {
> +            return false;
> +        }
> +    }
> +
> +    /* Find entry that contains the target range and update it */
> +    for (size_t i = 0; i < e820_entries; i++) {
> +        uint64_t entry_start = le64_to_cpu(e820_table[i].address);
> +        uint64_t entry_length = le64_to_cpu(e820_table[i].length);
> +        uint64_t entry_end = entry_start + entry_length;
> +
> +        if (entry_start <= start && entry_end >= end) {
> +            uint32_t original_type = e820_table[i].type;
> +
> +            /* Remove original entry */
> +            memmove(&e820_table[i], &e820_table[i + 1],
> +                    (e820_entries - i - 1) * sizeof(struct e820_entry));
> +            e820_entries--;
> +
> +            /* Add split parts inline */
> +            if (entry_start < start) {
> +                e820_table = g_renew(struct e820_entry, e820_table,
> +                                     e820_entries + 1);
> +                e820_table[e820_entries].address = cpu_to_le64(entry_start);
> +                e820_table[e820_entries].length =
> +                    cpu_to_le64(start - entry_start);
> +                e820_table[e820_entries].type = original_type;
> +                e820_entries++;
> +            }
> +
> +            e820_table = g_renew(struct e820_entry, e820_table,
> +                                 e820_entries + 1);
> +            e820_table[e820_entries].address = cpu_to_le64(start);
> +            e820_table[e820_entries].length = cpu_to_le64(length);
> +            e820_table[e820_entries].type = cpu_to_le32(new_type);
> +            e820_entries++;
> +
> +            if (end < entry_end) {
> +                e820_table = g_renew(struct e820_entry, e820_table,
> +                                     e820_entries + 1);
> +                e820_table[e820_entries].address = cpu_to_le64(end);
> +                e820_table[e820_entries].length = cpu_to_le64(entry_end - end);
> +                e820_table[e820_entries].type = original_type;
> +                e820_entries++;
> +            }
> +
> +            updated = true;
> +            break;
> +        }
> +    }
> +
> +    return updated;
> +}
> diff --git a/hw/i386/e820_memory_layout.h b/hw/i386/e820_memory_layout.h
> index b50acfa201..657cc679e2 100644
> --- a/hw/i386/e820_memory_layout.h
> +++ b/hw/i386/e820_memory_layout.h
> @@ -15,6 +15,7 @@
>   #define E820_ACPI       3
>   #define E820_NVS        4
>   #define E820_UNUSABLE   5
> +#define E820_SOFT_RESERVED  0xEFFFFFFF
>   
>   struct e820_entry {
>       uint64_t address;
> @@ -26,5 +27,6 @@ void e820_add_entry(uint64_t address, uint64_t length, uint32_t type);
>   bool e820_get_entry(int index, uint32_t type,
>                       uint64_t *address, uint64_t *length);
>   int e820_get_table(struct e820_entry **table);
> +bool e820_update_entry_type(uint64_t start, uint64_t length, uint32_t new_type);
>   
>   #endif
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index bc048a6d13..3e50570484 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -26,6 +26,7 @@
>   #include "qemu/units.h"
>   #include "exec/target_page.h"
>   #include "hw/i386/pc.h"
> +#include "system/ramblock.h"
>   #include "hw/char/serial-isa.h"
>   #include "hw/char/parallel.h"
>   #include "hw/hyperv/hv-balloon.h"
> @@ -787,6 +788,41 @@ static hwaddr pc_max_used_gpa(PCMachineState *pcms, uint64_t pci_hole64_size)
>       return pc_above_4g_end(pcms) - 1;
>   }
>   
> +static int pc_update_spm_memory(RAMBlock *rb, void *opaque)
> +{
> +    X86MachineState *x86ms = opaque;
> +    MachineState *ms = MACHINE(x86ms);
> +    ram_addr_t offset;
> +    ram_addr_t length;
> +    bool is_spm = false;
> +
> +    /* Check if this RAM block belongs to a NUMA node with spm=on */
> +    for (int i = 0; i < ms->numa_state->num_nodes; i++) {
> +        NodeInfo *numa_info = &ms->numa_state->nodes[i];
> +        if (numa_info->is_spm && numa_info->node_memdev) {
> +            MemoryRegion *mr = &numa_info->node_memdev->mr;
> +            if (mr->ram_block == rb) {
> +                /* Mark this RAM block as SPM and set the flag */
> +                rb->flags |= RAM_SPM;
> +                is_spm = true;
> +                break;
> +            }
> +        }
> +    }
> +
> +    if (is_spm) {
> +        offset = qemu_ram_get_offset(rb) +
> +                 (0x100000000ULL - x86ms->below_4g_mem_size);
> +        length = qemu_ram_get_used_length(rb);
> +        if (!e820_update_entry_type(offset, length, E820_SOFT_RESERVED)) {
> +            warn_report("Failed to update E820 entry for SPM at 0x%" PRIx64
> +                        " length 0x%" PRIx64, offset, length);
> +        }
> +    }
> +
> +    return 0;
> +}
> +
>   /*
>    * AMD systems with an IOMMU have an additional hole close to the
>    * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> @@ -901,6 +937,7 @@ void pc_memory_init(PCMachineState *pcms,
>       if (pcms->sgx_epc.size != 0) {
>           e820_add_entry(pcms->sgx_epc.base, pcms->sgx_epc.size, E820_RESERVED);
>       }
> +    qemu_ram_foreach_block(pc_update_spm_memory, x86ms);
>   
>       if (!pcmc->has_reserved_memory &&
>           (machine->ram_slots ||
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 9b658a3f48..9b437eaa10 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -89,6 +89,7 @@ ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb);
>   ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
>   ram_addr_t qemu_ram_get_max_length(RAMBlock *rb);
>   bool qemu_ram_is_shared(RAMBlock *rb);
> +bool qemu_ram_is_spm(RAMBlock *rb);
>   bool qemu_ram_is_noreserve(RAMBlock *rb);
>   bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
>   void qemu_ram_set_uf_zeroable(RAMBlock *rb);
> diff --git a/include/system/memory.h b/include/system/memory.h
> index aa85fc27a1..0d36cbd30d 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -275,6 +275,9 @@ typedef struct IOMMUTLBEvent {
>    */
>   #define RAM_PRIVATE (1 << 13)
>   
> +/* RAM is Specific Purpose Memory */
> +#define RAM_SPM (1 << 14)
> +
>   static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
>                                          IOMMUNotifierFlag flags,
>                                          hwaddr start, hwaddr end,
> diff --git a/include/system/numa.h b/include/system/numa.h
> index 1044b0eb6e..438511a756 100644
> --- a/include/system/numa.h
> +++ b/include/system/numa.h
> @@ -41,6 +41,7 @@ typedef struct NodeInfo {
>       bool present;
>       bool has_cpu;
>       bool has_gi;
> +    bool is_spm;
>       uint8_t lb_info_provided;
>       uint16_t initiator;
>       uint8_t distance[MAX_NODES];
> diff --git a/qapi/machine.json b/qapi/machine.json
> index 038eab281c..1fa31b0224 100644
> --- a/qapi/machine.json
> +++ b/qapi/machine.json
> @@ -500,6 +500,11 @@
>   # @memdev: memory backend object.  If specified for one node, it must
>   #     be specified for all nodes.
>   #
> +# @spm: if true, mark the memory region of this node as Specific
> +#     Purpose Memory (SPM). This will set the RAM_SPM flag for the
> +#     corresponding memory region and set the E820 type to
> +#     E820_SOFT_RESERVED. (default: false, since 9.2)
> +#
>   # @initiator: defined in ACPI 6.3 Chapter 5.2.27.3 Table 5-145, points
>   #     to the nodeid which has the memory controller responsible for
>   #     this NUMA node.  This field provides additional information as
> @@ -514,6 +519,7 @@
>      '*cpus':   ['uint16'],
>      '*mem':    'size',
>      '*memdev': 'str',
> +   '*spm':    'bool',
>      '*initiator': 'uint16' }}
>   
>   ##
> diff --git a/system/physmem.c b/system/physmem.c
> index ae8ecd50ea..0090d9955d 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1611,6 +1611,11 @@ bool qemu_ram_is_noreserve(RAMBlock *rb)
>       return rb->flags & RAM_NORESERVE;
>   }
>   
> +bool qemu_ram_is_spm(RAMBlock *rb)
> +{
> +    return rb->flags & RAM_SPM;
> +}
> +

IIUC, this function is unused, and the only setter is in 
pc_update_spm_memory().

Why do we have to modify the RAMBlock at all or walk over them?

Shouldn't it be sufficient to just walk over all 
&ms->numa_state->nodes[i] and update e820 accordingly?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] numa: add 'spm' option for Specific Purpose Memory
  2025-11-03  3:01       ` Huang, FangSheng (Jerry)
@ 2025-11-03 12:36         ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-03 12:36 UTC (permalink / raw)
  To: Huang, FangSheng (Jerry), qemu-devel, imammedo
  Cc: Zhigang.Luo, Lianjie.Shi, Jonathan Cameron

On 03.11.25 04:01, Huang, FangSheng (Jerry) wrote:
> Hi David,

Hi!

> 
> I hope this email finds you well. I wanted to follow up on the SPM
> patch series we discussed back in October.
> 
> I'm reaching out to check on the current status and see if there's
> anything else I should address or any additional information I can
> provide.
> 
> Thank you for your time and guidance on this!

I just commented on the implementation, I think it can be simplified.

Regarding the overall idea it would be great to learn whether Igor as 
any more concerns.

Cheers,

David


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] numa: add 'spm' option for Specific Purpose Memory
  2025-11-03 12:32   ` David Hildenbrand
@ 2025-11-04  8:00     ` Huang, FangSheng (Jerry)
  0 siblings, 0 replies; 12+ messages in thread
From: Huang, FangSheng (Jerry) @ 2025-11-04  8:00 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel, imammedo; +Cc: Zhigang.Luo, Lianjie.Shi



On 11/3/2025 8:32 PM, David Hildenbrand wrote:
> On 20.10.25 11:07, fanhuang wrote:
>> This patch adds support for Specific Purpose Memory (SPM) through the
>> NUMA node configuration. When 'spm=on' is specified for a NUMA node,
>> QEMU will:
>>
>> 1. Set the RAM_SPM flag in the RAM block of the corresponding memory 
>> region
>> 2. Update the overlapping E820 RAM entries before adding 
>> E820_SOFT_RESERVED
>> 3. Set the E820 type to E820_SOFT_RESERVED for this memory region
>>
>> This allows guest operating systems to recognize the memory as soft 
>> reserved
>> memory, which can be used for device-specific memory management without
>> E820 table conflicts.
>>
>> Usage:
>>    -numa node,nodeid=0,memdev=m1,spm=on
>>
>> Signed-off-by: fanhuang <FangSheng.Huang@amd.com>
>> ---
>>   hw/core/numa.c               |  3 ++
>>   hw/i386/e820_memory_layout.c | 73 ++++++++++++++++++++++++++++++++++++
>>   hw/i386/e820_memory_layout.h |  2 +
>>   hw/i386/pc.c                 | 37 ++++++++++++++++++
>>   include/exec/cpu-common.h    |  1 +
>>   include/system/memory.h      |  3 ++
>>   include/system/numa.h        |  1 +
>>   qapi/machine.json            |  6 +++
>>   system/physmem.c             |  7 +++-
>>   9 files changed, 132 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/core/numa.c b/hw/core/numa.c
>> index 218576f745..e680130460 100644
>> --- a/hw/core/numa.c
>> +++ b/hw/core/numa.c
>> @@ -163,6 +163,9 @@ static void parse_numa_node(MachineState *ms, 
>> NumaNodeOptions *node,
>>           numa_info[nodenr].node_memdev = MEMORY_BACKEND(o);
>>       }
>> +    /* Store spm configuration for later processing */
>> +    numa_info[nodenr].is_spm = node->has_spm && node->spm;
>> +
>>       numa_info[nodenr].present = true;
>>       max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
>>       ms->numa_state->num_nodes++;
>> diff --git a/hw/i386/e820_memory_layout.c b/hw/i386/e820_memory_layout.c
>> index 3e848fb69c..5b090ac6df 100644
>> --- a/hw/i386/e820_memory_layout.c
>> +++ b/hw/i386/e820_memory_layout.c
>> @@ -46,3 +46,76 @@ bool e820_get_entry(int idx, uint32_t type, 
>> uint64_t *address, uint64_t *length)
>>       }
>>       return false;
>>   }
>> +
>> +bool e820_update_entry_type(uint64_t start, uint64_t length, uint32_t 
>> new_type)
>> +{
>> +    uint64_t end = start + length;
>> +    bool updated = false;
>> +    assert(!e820_done);
>> +
>> +    /* For E820_SOFT_RESERVED, validate range is within E820_RAM */
>> +    if (new_type == E820_SOFT_RESERVED) {
>> +        bool range_in_ram = false;
>> +        for (size_t j = 0; j < e820_entries; j++) {
>> +            uint64_t ram_start = le64_to_cpu(e820_table[j].address);
>> +            uint64_t ram_end = ram_start + 
>> le64_to_cpu(e820_table[j].length);
>> +            uint32_t ram_type = le32_to_cpu(e820_table[j].type);
>> +
>> +            if (ram_type == E820_RAM && ram_start <= start && ram_end 
>> >= end) {
>> +                range_in_ram = true;
>> +                break;
>> +            }
>> +        }
>> +        if (!range_in_ram) {
>> +            return false;
>> +        }
>> +    }
>> +
>> +    /* Find entry that contains the target range and update it */
>> +    for (size_t i = 0; i < e820_entries; i++) {
>> +        uint64_t entry_start = le64_to_cpu(e820_table[i].address);
>> +        uint64_t entry_length = le64_to_cpu(e820_table[i].length);
>> +        uint64_t entry_end = entry_start + entry_length;
>> +
>> +        if (entry_start <= start && entry_end >= end) {
>> +            uint32_t original_type = e820_table[i].type;
>> +
>> +            /* Remove original entry */
>> +            memmove(&e820_table[i], &e820_table[i + 1],
>> +                    (e820_entries - i - 1) * sizeof(struct e820_entry));
>> +            e820_entries--;
>> +
>> +            /* Add split parts inline */
>> +            if (entry_start < start) {
>> +                e820_table = g_renew(struct e820_entry, e820_table,
>> +                                     e820_entries + 1);
>> +                e820_table[e820_entries].address = 
>> cpu_to_le64(entry_start);
>> +                e820_table[e820_entries].length =
>> +                    cpu_to_le64(start - entry_start);
>> +                e820_table[e820_entries].type = original_type;
>> +                e820_entries++;
>> +            }
>> +
>> +            e820_table = g_renew(struct e820_entry, e820_table,
>> +                                 e820_entries + 1);
>> +            e820_table[e820_entries].address = cpu_to_le64(start);
>> +            e820_table[e820_entries].length = cpu_to_le64(length);
>> +            e820_table[e820_entries].type = cpu_to_le32(new_type);
>> +            e820_entries++;
>> +
>> +            if (end < entry_end) {
>> +                e820_table = g_renew(struct e820_entry, e820_table,
>> +                                     e820_entries + 1);
>> +                e820_table[e820_entries].address = cpu_to_le64(end);
>> +                e820_table[e820_entries].length = 
>> cpu_to_le64(entry_end - end);
>> +                e820_table[e820_entries].type = original_type;
>> +                e820_entries++;
>> +            }
>> +
>> +            updated = true;
>> +            break;
>> +        }
>> +    }
>> +
>> +    return updated;
>> +}
>> diff --git a/hw/i386/e820_memory_layout.h b/hw/i386/e820_memory_layout.h
>> index b50acfa201..657cc679e2 100644
>> --- a/hw/i386/e820_memory_layout.h
>> +++ b/hw/i386/e820_memory_layout.h
>> @@ -15,6 +15,7 @@
>>   #define E820_ACPI       3
>>   #define E820_NVS        4
>>   #define E820_UNUSABLE   5
>> +#define E820_SOFT_RESERVED  0xEFFFFFFF
>>   struct e820_entry {
>>       uint64_t address;
>> @@ -26,5 +27,6 @@ void e820_add_entry(uint64_t address, uint64_t 
>> length, uint32_t type);
>>   bool e820_get_entry(int index, uint32_t type,
>>                       uint64_t *address, uint64_t *length);
>>   int e820_get_table(struct e820_entry **table);
>> +bool e820_update_entry_type(uint64_t start, uint64_t length, uint32_t 
>> new_type);
>>   #endif
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index bc048a6d13..3e50570484 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -26,6 +26,7 @@
>>   #include "qemu/units.h"
>>   #include "exec/target_page.h"
>>   #include "hw/i386/pc.h"
>> +#include "system/ramblock.h"
>>   #include "hw/char/serial-isa.h"
>>   #include "hw/char/parallel.h"
>>   #include "hw/hyperv/hv-balloon.h"
>> @@ -787,6 +788,41 @@ static hwaddr pc_max_used_gpa(PCMachineState 
>> *pcms, uint64_t pci_hole64_size)
>>       return pc_above_4g_end(pcms) - 1;
>>   }
>> +static int pc_update_spm_memory(RAMBlock *rb, void *opaque)
>> +{
>> +    X86MachineState *x86ms = opaque;
>> +    MachineState *ms = MACHINE(x86ms);
>> +    ram_addr_t offset;
>> +    ram_addr_t length;
>> +    bool is_spm = false;
>> +
>> +    /* Check if this RAM block belongs to a NUMA node with spm=on */
>> +    for (int i = 0; i < ms->numa_state->num_nodes; i++) {
>> +        NodeInfo *numa_info = &ms->numa_state->nodes[i];
>> +        if (numa_info->is_spm && numa_info->node_memdev) {
>> +            MemoryRegion *mr = &numa_info->node_memdev->mr;
>> +            if (mr->ram_block == rb) {
>> +                /* Mark this RAM block as SPM and set the flag */
>> +                rb->flags |= RAM_SPM;
>> +                is_spm = true;
>> +                break;
>> +            }
>> +        }
>> +    }
>> +
>> +    if (is_spm) {
>> +        offset = qemu_ram_get_offset(rb) +
>> +                 (0x100000000ULL - x86ms->below_4g_mem_size);
>> +        length = qemu_ram_get_used_length(rb);
>> +        if (!e820_update_entry_type(offset, length, 
>> E820_SOFT_RESERVED)) {
>> +            warn_report("Failed to update E820 entry for SPM at 0x%" 
>> PRIx64
>> +                        " length 0x%" PRIx64, offset, length);
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>   /*
>>    * AMD systems with an IOMMU have an additional hole close to the
>>    * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>> @@ -901,6 +937,7 @@ void pc_memory_init(PCMachineState *pcms,
>>       if (pcms->sgx_epc.size != 0) {
>>           e820_add_entry(pcms->sgx_epc.base, pcms->sgx_epc.size, 
>> E820_RESERVED);
>>       }
>> +    qemu_ram_foreach_block(pc_update_spm_memory, x86ms);
>>       if (!pcmc->has_reserved_memory &&
>>           (machine->ram_slots ||
>> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
>> index 9b658a3f48..9b437eaa10 100644
>> --- a/include/exec/cpu-common.h
>> +++ b/include/exec/cpu-common.h
>> @@ -89,6 +89,7 @@ ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb);
>>   ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
>>   ram_addr_t qemu_ram_get_max_length(RAMBlock *rb);
>>   bool qemu_ram_is_shared(RAMBlock *rb);
>> +bool qemu_ram_is_spm(RAMBlock *rb);
>>   bool qemu_ram_is_noreserve(RAMBlock *rb);
>>   bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
>>   void qemu_ram_set_uf_zeroable(RAMBlock *rb);
>> diff --git a/include/system/memory.h b/include/system/memory.h
>> index aa85fc27a1..0d36cbd30d 100644
>> --- a/include/system/memory.h
>> +++ b/include/system/memory.h
>> @@ -275,6 +275,9 @@ typedef struct IOMMUTLBEvent {
>>    */
>>   #define RAM_PRIVATE (1 << 13)
>> +/* RAM is Specific Purpose Memory */
>> +#define RAM_SPM (1 << 14)
>> +
>>   static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify 
>> fn,
>>                                          IOMMUNotifierFlag flags,
>>                                          hwaddr start, hwaddr end,
>> diff --git a/include/system/numa.h b/include/system/numa.h
>> index 1044b0eb6e..438511a756 100644
>> --- a/include/system/numa.h
>> +++ b/include/system/numa.h
>> @@ -41,6 +41,7 @@ typedef struct NodeInfo {
>>       bool present;
>>       bool has_cpu;
>>       bool has_gi;
>> +    bool is_spm;
>>       uint8_t lb_info_provided;
>>       uint16_t initiator;
>>       uint8_t distance[MAX_NODES];
>> diff --git a/qapi/machine.json b/qapi/machine.json
>> index 038eab281c..1fa31b0224 100644
>> --- a/qapi/machine.json
>> +++ b/qapi/machine.json
>> @@ -500,6 +500,11 @@
>>   # @memdev: memory backend object.  If specified for one node, it must
>>   #     be specified for all nodes.
>>   #
>> +# @spm: if true, mark the memory region of this node as Specific
>> +#     Purpose Memory (SPM). This will set the RAM_SPM flag for the
>> +#     corresponding memory region and set the E820 type to
>> +#     E820_SOFT_RESERVED. (default: false, since 9.2)
>> +#
>>   # @initiator: defined in ACPI 6.3 Chapter 5.2.27.3 Table 5-145, points
>>   #     to the nodeid which has the memory controller responsible for
>>   #     this NUMA node.  This field provides additional information as
>> @@ -514,6 +519,7 @@
>>      '*cpus':   ['uint16'],
>>      '*mem':    'size',
>>      '*memdev': 'str',
>> +   '*spm':    'bool',
>>      '*initiator': 'uint16' }}
>>   ##
>> diff --git a/system/physmem.c b/system/physmem.c
>> index ae8ecd50ea..0090d9955d 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -1611,6 +1611,11 @@ bool qemu_ram_is_noreserve(RAMBlock *rb)
>>       return rb->flags & RAM_NORESERVE;
>>   }
>> +bool qemu_ram_is_spm(RAMBlock *rb)
>> +{
>> +    return rb->flags & RAM_SPM;
>> +}
>> +
> 
> IIUC, this function is unused, and the only setter is in 
> pc_update_spm_memory().
> 
> Why do we have to modify the RAMBlock at all or walk over them?
> 
> Shouldn't it be sufficient to just walk over all &ms->numa_state- 
>  >nodes[i] and update e820 accordingly?
> 
Hi David,

Thank you for the excellent review and the insightful suggestion!

You're absolutely right - I've simplified the implementation to
directly iterate over NUMA nodes instead of RAMBlocks.

I'll send v3 after internal review. I also understand Igor's
feedback would be valuable - I'll wait to hear if he has any
concerns.

Best regards,
Jerry Huang


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-11-04  8:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-20  9:07 [PATCH v2] numa: add 'spm' option for Specific Purpose Memory fanhuang
2025-10-20  9:07 ` fanhuang
2025-11-03 12:32   ` David Hildenbrand
2025-11-04  8:00     ` Huang, FangSheng (Jerry)
2025-10-20 10:15 ` Jonathan Cameron via
2025-10-20 20:03   ` David Hildenbrand
2025-10-22 10:19     ` Huang, FangSheng (Jerry)
2025-10-20 20:10 ` David Hildenbrand
2025-10-22 10:09   ` Huang, FangSheng (Jerry)
2025-10-22 10:28     ` David Hildenbrand
2025-11-03  3:01       ` Huang, FangSheng (Jerry)
2025-11-03 12:36         ` David Hildenbrand (Red Hat)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).