[PATCH v7 0/1] numa: add 'memmap-type' option for memory type configuration

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v7 0/1] numa: add 'memmap-type' option for memory type configuration
@ 2026-03-06  8:27 fanhuang
  2026-03-06  8:27 ` [PATCH v7 1/1] " fanhuang
  2026-03-13  8:30 ` [PATCH v7 0/1] " Huang, FangSheng (Jerry)
  0 siblings, 2 replies; 11+ messages in thread
From: fanhuang @ 2026-03-06  8:27 UTC (permalink / raw)
  To: qemu-devel, david, imammedo, gourry, jonathan.cameron
  Cc: apopple, dan.j.williams, Zhigang.Luo, Lianjie.Shi, fanhuang

Hi all,

This is v7 of the SPM (Specific Purpose Memory) patch. Thank you
David for the Acked-by, and Gregory for the Reviewed-by and for
catching the hardcoded address bug.

Changes in v7:
- Fixed pc_update_numa_memory_types() to use x86ms->above_4g_mem_start
  instead of hardcoded 0x100000000ULL (4 GiB). On AMD hosts with IOMMU,
  above_4g_mem_start is relocated to above 1 TB, so the hardcode would
  produce wrong guest physical addresses. (spotted by Gregory Price)

Changes in v6:
- Added validation: memmap-type now requires memdev to be specified,
  to avoid misconfiguration on memory-less NUMA nodes
- Simplified pc_update_numa_memory_types() by replacing switch/goto
  with a direct conditional expression
- Reserved memory nodes are now excluded from SRAT memory affinity
  entries, since E820 already marks them as reserved and SRAT should
  not report them as enabled memory affinity

Use case:
This feature allows marking NUMA node memory as Specific Purpose Memory
(SPM) or reserved in the E820 table. SPM serves as a hint to the guest
that this memory might be managed by device drivers based on guest policy

Example usage:
  -object memory-backend-ram,size=8G,id=m0
  -object memory-backend-memfd,size=8G,id=m1
  -numa node,nodeid=0,memdev=m0
  -numa node,nodeid=1,memdev=m1,memmap-type=spm

Supported memmap-type values:
  - normal:   Regular system RAM (E820 type 1, default)
  - spm:      Specific Purpose Memory (E820 type 0xEFFFFFFF), a hint
              that this memory might be managed by device drivers
  - reserved: Reserved memory (E820 type 2), not usable as RAM

Please review. Thanks!

Best regards,
Jerry Huang

fanhuang (1):
  numa: add 'memmap-type' option for memory type configuration

 hw/core/numa.c               | 24 ++++++++++++
 hw/i386/acpi-build.c         |  8 ++++
 hw/i386/e820_memory_layout.c | 72 ++++++++++++++++++++++++++++++++++++
 hw/i386/e820_memory_layout.h | 12 +++---
 hw/i386/pc.c                 | 48 ++++++++++++++++++++++++
 include/system/numa.h        |  7 ++++
 qapi/machine.json            | 24 ++++++++++++
 qemu-options.hx              | 14 ++++++-
 8 files changed, 202 insertions(+), 7 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v7 1/1] numa: add 'memmap-type' option for memory type configuration
  2026-03-06  8:27 [PATCH v7 0/1] numa: add 'memmap-type' option for memory type configuration fanhuang
@ 2026-03-06  8:27 ` fanhuang
  2026-05-14 13:05   ` Igor Mammedov
  2026-03-13  8:30 ` [PATCH v7 0/1] " Huang, FangSheng (Jerry)
  1 sibling, 1 reply; 11+ messages in thread
From: fanhuang @ 2026-03-06  8:27 UTC (permalink / raw)
  To: qemu-devel, david, imammedo, gourry, jonathan.cameron
  Cc: apopple, dan.j.williams, Zhigang.Luo, Lianjie.Shi, fanhuang,
	David Hildenbrand

Add a 'memmap-type' option to NUMA node configuration that allows
specifying the memory type for a NUMA node.

Supported values:
  - normal:   Regular system RAM (E820 type 1, default)
  - spm:      Specific Purpose Memory (E820 type 0xEFFFFFFF)
  - reserved: Reserved memory (E820 type 2)

The 'spm' type indicates Specific Purpose Memory - a hint to the guest
that this memory might be managed by device drivers based on guest policy.
The 'reserved' type marks memory as not usable as RAM.

Note: This option is only supported on x86 platforms.

Usage:
  -numa node,nodeid=1,memdev=m1,memmap-type=spm

Signed-off-by: fanhuang <FangSheng.Huang@amd.com>
Acked-by: David Hildenbrand <david@kernel.org>
Reviewed-by: Gregory Price <gourry@gourry.net>
---
 hw/core/numa.c               | 24 ++++++++++++
 hw/i386/acpi-build.c         |  8 ++++
 hw/i386/e820_memory_layout.c | 72 ++++++++++++++++++++++++++++++++++++
 hw/i386/e820_memory_layout.h | 12 +++---
 hw/i386/pc.c                 | 48 ++++++++++++++++++++++++
 include/system/numa.h        |  7 ++++
 qapi/machine.json            | 24 ++++++++++++
 qemu-options.hx              | 14 ++++++-
 8 files changed, 202 insertions(+), 7 deletions(-)

diff --git a/hw/core/numa.c b/hw/core/numa.c
index f462883c87..521c8f10f1 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -38,6 +38,7 @@
 #include "hw/mem/pc-dimm.h"
 #include "hw/core/boards.h"
 #include "hw/mem/memory-device.h"
+#include "hw/i386/x86.h"
 #include "qemu/option.h"
 #include "qemu/config-file.h"
 #include "qemu/cutils.h"
@@ -164,6 +165,29 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
         numa_info[nodenr].node_memdev = MEMORY_BACKEND(o);
     }
 
+    if (node->has_memmap_type && node->memmap_type != NUMA_MEMMAP_TYPE_NORMAL) {
+        if (!node->memdev) {
+            error_setg(errp, "memmap-type=%s requires memdev to be specified",
+                       NumaMemmapType_str(node->memmap_type));
+            return;
+        }
+        if (!object_dynamic_cast(OBJECT(ms), TYPE_X86_MACHINE)) {
+            error_setg(errp, "memmap-type=%s is only supported on x86 machines",
+                       NumaMemmapType_str(node->memmap_type));
+            return;
+        }
+        switch (node->memmap_type) {
+        case NUMA_MEMMAP_TYPE_SPM:
+            numa_info[nodenr].memmap_type = NUMA_MEMMAP_SPM;
+            break;
+        case NUMA_MEMMAP_TYPE_RESERVED:
+            numa_info[nodenr].memmap_type = NUMA_MEMMAP_RESERVED;
+            break;
+        default:
+            break;
+        }
+    }
+
     numa_info[nodenr].present = true;
     max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
     ms->numa_state->num_nodes++;
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index f622b91b76..521bf66ca1 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -1417,6 +1417,14 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
         mem_len = numa_info[i - 1].node_mem;
         next_base = mem_base + mem_len;
 
+        /*
+         * Skip reserved memory nodes - E820 marks them as reserved,
+         * so SRAT should not report them as enabled memory affinity.
+         */
+        if (numa_info[i - 1].memmap_type == NUMA_MEMMAP_RESERVED) {
+            continue;
+        }
+
         /* Cut out the 640K hole */
         if (mem_base <= HOLE_640K_START &&
             next_base > HOLE_640K_START) {
diff --git a/hw/i386/e820_memory_layout.c b/hw/i386/e820_memory_layout.c
index 3e848fb69c..4c62b5ddea 100644
--- a/hw/i386/e820_memory_layout.c
+++ b/hw/i386/e820_memory_layout.c
@@ -46,3 +46,75 @@ bool e820_get_entry(int idx, uint32_t type, uint64_t *address, uint64_t *length)
     }
     return false;
 }
+
+bool e820_update_entry_type(uint64_t start, uint64_t length, uint32_t new_type)
+{
+    uint64_t end = start + length;
+    assert(!e820_done);
+
+    /* For E820_SOFT_RESERVED, validate range is within E820_RAM */
+    if (new_type == E820_SOFT_RESERVED) {
+        bool range_in_ram = false;
+
+        for (size_t j = 0; j < e820_entries; j++) {
+            uint64_t ram_start = le64_to_cpu(e820_table[j].address);
+            uint64_t ram_end = ram_start + le64_to_cpu(e820_table[j].length);
+            uint32_t ram_type = le32_to_cpu(e820_table[j].type);
+
+            if (ram_type == E820_RAM && ram_start <= start && ram_end >= end) {
+                range_in_ram = true;
+                break;
+            }
+        }
+        if (!range_in_ram) {
+            return false;
+        }
+    }
+
+    /* Find entry that contains the target range and update it */
+    for (size_t i = 0; i < e820_entries; i++) {
+        uint64_t entry_start = le64_to_cpu(e820_table[i].address);
+        uint64_t entry_length = le64_to_cpu(e820_table[i].length);
+        uint64_t entry_end = entry_start + entry_length;
+
+        if (entry_start <= start && entry_end >= end) {
+            uint32_t original_type = e820_table[i].type;
+
+            /* Remove original entry */
+            memmove(&e820_table[i], &e820_table[i + 1],
+                    (e820_entries - i - 1) * sizeof(struct e820_entry));
+            e820_entries--;
+
+            /* Add split parts inline */
+            if (entry_start < start) {
+                e820_table = g_renew(struct e820_entry, e820_table,
+                                     e820_entries + 1);
+                e820_table[e820_entries].address = cpu_to_le64(entry_start);
+                e820_table[e820_entries].length =
+                    cpu_to_le64(start - entry_start);
+                e820_table[e820_entries].type = original_type;
+                e820_entries++;
+            }
+
+            e820_table = g_renew(struct e820_entry, e820_table,
+                                 e820_entries + 1);
+            e820_table[e820_entries].address = cpu_to_le64(start);
+            e820_table[e820_entries].length = cpu_to_le64(length);
+            e820_table[e820_entries].type = cpu_to_le32(new_type);
+            e820_entries++;
+
+            if (end < entry_end) {
+                e820_table = g_renew(struct e820_entry, e820_table,
+                                     e820_entries + 1);
+                e820_table[e820_entries].address = cpu_to_le64(end);
+                e820_table[e820_entries].length = cpu_to_le64(entry_end - end);
+                e820_table[e820_entries].type = original_type;
+                e820_entries++;
+            }
+
+            return true;
+        }
+    }
+
+    return false;
+}
diff --git a/hw/i386/e820_memory_layout.h b/hw/i386/e820_memory_layout.h
index b50acfa201..a85b4fd14c 100644
--- a/hw/i386/e820_memory_layout.h
+++ b/hw/i386/e820_memory_layout.h
@@ -10,11 +10,12 @@
 #define HW_I386_E820_MEMORY_LAYOUT_H
 
 /* e820 types */
-#define E820_RAM        1
-#define E820_RESERVED   2
-#define E820_ACPI       3
-#define E820_NVS        4
-#define E820_UNUSABLE   5
+#define E820_RAM            1
+#define E820_RESERVED       2
+#define E820_ACPI           3
+#define E820_NVS            4
+#define E820_UNUSABLE       5
+#define E820_SOFT_RESERVED  0xEFFFFFFF
 
 struct e820_entry {
     uint64_t address;
@@ -26,5 +27,6 @@ void e820_add_entry(uint64_t address, uint64_t length, uint32_t type);
 bool e820_get_entry(int index, uint32_t type,
                     uint64_t *address, uint64_t *length);
 int e820_get_table(struct e820_entry **table);
+bool e820_update_entry_type(uint64_t start, uint64_t length, uint32_t new_type);
 
 #endif
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 819e729a6e..c024a34db2 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -740,6 +740,51 @@ static hwaddr pc_max_used_gpa(PCMachineState *pcms, uint64_t pci_hole64_size)
     return pc_above_4g_end(pcms) - 1;
 }
 
+/*
+ * Update E820 entries for NUMA nodes with non-default memory types.
+ */
+static void pc_update_numa_memory_types(X86MachineState *x86ms)
+{
+    MachineState *ms = MACHINE(x86ms);
+    uint64_t addr = 0;
+
+    for (int i = 0; i < ms->numa_state->num_nodes; i++) {
+        NodeInfo *numa_info = &ms->numa_state->nodes[i];
+        uint64_t node_size = numa_info->node_mem;
+
+        if (numa_info->node_memdev &&
+            (numa_info->memmap_type == NUMA_MEMMAP_SPM ||
+             numa_info->memmap_type == NUMA_MEMMAP_RESERVED)) {
+            uint64_t guest_addr;
+            uint32_t e820_type = (numa_info->memmap_type == NUMA_MEMMAP_SPM)
+                                  ? E820_SOFT_RESERVED : E820_RESERVED;
+
+            if (addr < x86ms->below_4g_mem_size) {
+                if (addr + node_size <= x86ms->below_4g_mem_size) {
+                    guest_addr = addr;
+                } else {
+                    error_report("NUMA node %d with memmap-type spans across "
+                                 "4GB boundary, not supported", i);
+                    exit(EXIT_FAILURE);
+                }
+            } else {
+                guest_addr = x86ms->above_4g_mem_start +
+                            (addr - x86ms->below_4g_mem_size);
+            }
+
+            if (!e820_update_entry_type(guest_addr, node_size, e820_type)) {
+                warn_report("Failed to update E820 entry for node %d "
+                           "at 0x%" PRIx64 " length 0x%" PRIx64,
+                           i, guest_addr, node_size);
+            }
+        }
+
+        if (numa_info->node_memdev) {
+            addr += node_size;
+        }
+    }
+}
+
 /*
  * AMD systems with an IOMMU have an additional hole close to the
  * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
@@ -856,6 +901,9 @@ void pc_memory_init(PCMachineState *pcms,
         e820_add_entry(pcms->sgx_epc.base, pcms->sgx_epc.size, E820_RESERVED);
     }
 
+    /* Update E820 for NUMA nodes with special memory types */
+    pc_update_numa_memory_types(x86ms);
+
     if (!pcmc->has_reserved_memory &&
         (machine->ram_slots ||
          (machine->maxram_size > machine->ram_size))) {
diff --git a/include/system/numa.h b/include/system/numa.h
index 1044b0eb6e..64e8f63736 100644
--- a/include/system/numa.h
+++ b/include/system/numa.h
@@ -35,12 +35,19 @@ enum {
 
 #define UINT16_BITS       16
 
+typedef enum {
+    NUMA_MEMMAP_NORMAL = 0,
+    NUMA_MEMMAP_SPM,
+    NUMA_MEMMAP_RESERVED,
+} NumaMemmapTypeInternal;
+
 typedef struct NodeInfo {
     uint64_t node_mem;
     struct HostMemoryBackend *node_memdev;
     bool present;
     bool has_cpu;
     bool has_gi;
+    NumaMemmapTypeInternal memmap_type;
     uint8_t lb_info_provided;
     uint16_t initiator;
     uint8_t distance[MAX_NODES];
diff --git a/qapi/machine.json b/qapi/machine.json
index 685e4e29b8..67ba487f6c 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -466,6 +466,22 @@
 { 'enum': 'NumaOptionsType',
   'data': [ 'node', 'dist', 'cpu', 'hmat-lb', 'hmat-cache' ] }
 
+##
+# @NumaMemmapType:
+#
+# Memory mapping type for a NUMA node.
+#
+# @normal: Normal system RAM (E820 type 1)
+#
+# @spm: Specific Purpose Memory (E820 type 0xEFFFFFFF)
+#
+# @reserved: Reserved memory (E820 type 2)
+#
+# Since: 10.2
+##
+{ 'enum': 'NumaMemmapType',
+  'data': ['normal', 'spm', 'reserved'] }
+
 ##
 # @NumaOptions:
 #
@@ -502,6 +518,13 @@
 # @memdev: memory backend object.  If specified for one node, it must
 #     be specified for all nodes.
 #
+# @memmap-type: specifies the memory type for this NUMA node.
+#     'normal' (default) is regular system RAM.
+#     'spm' is Specific Purpose Memory - a hint to the guest that
+#     this memory might be managed by device drivers based on policy.
+#     'reserved' is reserved memory, not usable as RAM.
+#     Currently only supported on x86.  (since 10.2)
+#
 # @initiator: defined in ACPI 6.3 Chapter 5.2.27.3 Table 5-145, points
 #     to the nodeid which has the memory controller responsible for
 #     this NUMA node.  This field provides additional information as
@@ -516,6 +539,7 @@
    '*cpus':   ['uint16'],
    '*mem':    'size',
    '*memdev': 'str',
+   '*memmap-type': 'NumaMemmapType',
    '*initiator': 'uint16' }}
 
 ##
diff --git a/qemu-options.hx b/qemu-options.hx
index 0da2b4d034..c898428822 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -433,7 +433,7 @@ ERST
 
 DEF("numa", HAS_ARG, QEMU_OPTION_numa,
     "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
-    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
+    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node][,memmap-type=normal|spm|reserved]\n"
     "-numa dist,src=source,dst=destination,val=distance\n"
     "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n"
     "-numa hmat-lb,initiator=node,target=node,hierarchy=memory|first-level|second-level|third-level,data-type=access-latency|read-latency|write-latency[,latency=lat][,bandwidth=bw]\n"
@@ -442,7 +442,7 @@ DEF("numa", HAS_ARG, QEMU_OPTION_numa,
 SRST
 ``-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=initiator]``
   \ 
-``-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=initiator]``
+``-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=initiator][,memmap-type=type]``
   \
 ``-numa dist,src=source,dst=destination,val=distance``
   \ 
@@ -510,6 +510,16 @@ SRST
     largest bandwidth) to this NUMA node. Note that this option can be
     set only when the machine property 'hmat' is set to 'on'.
 
+    '\ ``memmap-type``\ ' specifies the memory type for this NUMA node:
+
+    - ``normal`` (default): Regular system RAM (E820 type 1)
+    - ``spm``: Specific Purpose Memory (E820 type 0xEFFFFFFF). This is a
+      hint to the guest that the memory might be managed by device drivers
+      based on guest policy.
+    - ``reserved``: Reserved memory (E820 type 2), not usable as RAM.
+
+    This option is only supported on x86 platforms.
+
     Following example creates a machine with 2 NUMA nodes, node 0 has
     CPU. node 1 has only memory, and its initiator is node 0. Note that
     because node 0 has CPU, by default the initiator of node 0 is itself
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 0/1] numa: add 'memmap-type' option for memory type configuration
  2026-03-06  8:27 [PATCH v7 0/1] numa: add 'memmap-type' option for memory type configuration fanhuang
  2026-03-06  8:27 ` [PATCH v7 1/1] " fanhuang
@ 2026-03-13  8:30 ` Huang, FangSheng (Jerry)
  2026-03-13 15:18   ` Gregory Price
  1 sibling, 1 reply; 11+ messages in thread
From: Huang, FangSheng (Jerry) @ 2026-03-13  8:30 UTC (permalink / raw)
  To: qemu-devel, gourry
  Cc: apopple, dan.j.williams, Zhigang.Luo, Lianjie.Shi, david,
	imammedo, jonathan.cameron


On 3/6/2026 4:27 PM, fanhuang wrote:
> Hi all,
> 
> This is v7 of the SPM (Specific Purpose Memory) patch. Thank you
> David for the Acked-by, and Gregory for the Reviewed-by and for
> catching the hardcoded address bug.
> 
> Changes in v7:
> - Fixed pc_update_numa_memory_types() to use x86ms->above_4g_mem_start
>    instead of hardcoded 0x100000000ULL (4 GiB). On AMD hosts with IOMMU,
>    above_4g_mem_start is relocated to above 1 TB, so the hardcode would
>    produce wrong guest physical addresses. (spotted by Gregory Price)
> 
> Changes in v6:
> - Added validation: memmap-type now requires memdev to be specified,
>    to avoid misconfiguration on memory-less NUMA nodes
> - Simplified pc_update_numa_memory_types() by replacing switch/goto
>    with a direct conditional expression
> - Reserved memory nodes are now excluded from SRAT memory affinity
>    entries, since E820 already marks them as reserved and SRAT should
>    not report them as enabled memory affinity
> 
> Use case:
> This feature allows marking NUMA node memory as Specific Purpose Memory
> (SPM) or reserved in the E820 table. SPM serves as a hint to the guest
> that this memory might be managed by device drivers based on guest policy
> 
> Example usage:
>    -object memory-backend-ram,size=8G,id=m0
>    -object memory-backend-memfd,size=8G,id=m1
>    -numa node,nodeid=0,memdev=m0
>    -numa node,nodeid=1,memdev=m1,memmap-type=spm
> 
> Supported memmap-type values:
>    - normal:   Regular system RAM (E820 type 1, default)
>    - spm:      Specific Purpose Memory (E820 type 0xEFFFFFFF), a hint
>                that this memory might be managed by device drivers
>    - reserved: Reserved memory (E820 type 2), not usable as RAM
> 
> Please review. Thanks!
> 
> Best regards,
> Jerry Huang
> 
> fanhuang (1):
>    numa: add 'memmap-type' option for memory type configuration
> 
>   hw/core/numa.c               | 24 ++++++++++++
>   hw/i386/acpi-build.c         |  8 ++++
>   hw/i386/e820_memory_layout.c | 72 ++++++++++++++++++++++++++++++++++++
>   hw/i386/e820_memory_layout.h | 12 +++---
>   hw/i386/pc.c                 | 48 ++++++++++++++++++++++++
>   include/system/numa.h        |  7 ++++
>   qapi/machine.json            | 24 ++++++++++++
>   qemu-options.hx              | 14 ++++++-
>   8 files changed, 202 insertions(+), 7 deletions(-)
> 
Hi Gregory,

Thanks again for the thorough review on v6 and for catching the hardcoded
address issue — v7 has that fixed, now using x86ms->above_4g_mem_start as
you suggested.

Just wanted to follow up and check if your Reviewed-by still applies to v7,
since it's a minimal change from v6 (only the one-line fix you identified).

On a related note, the corresponding OVMF patch for E820 Soft Reserved
support has been merged upstream (edk2 PR #11964), so the firmware side is
now in place. The QEMU patch is the remaining piece to complete the 
pipeline.

Could you also advise on the next steps for getting this merged?
is there a target merge window? Happy to provide anything else that's 
needed.

Thanks,
Jerry Huang


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 0/1] numa: add 'memmap-type' option for memory type configuration
  2026-03-13  8:30 ` [PATCH v7 0/1] " Huang, FangSheng (Jerry)
@ 2026-03-13 15:18   ` Gregory Price
  2026-03-13 16:14     ` Jonathan Cameron via qemu development
  0 siblings, 1 reply; 11+ messages in thread
From: Gregory Price @ 2026-03-13 15:18 UTC (permalink / raw)
  To: Huang, FangSheng (Jerry)
  Cc: qemu-devel, apopple, dan.j.williams, Zhigang.Luo, Lianjie.Shi,
	david, imammedo, jonathan.cameron

On Fri, Mar 13, 2026 at 04:30:20PM +0800, Huang, FangSheng (Jerry) wrote:
> Hi Gregory,
> 
> Thanks again for the thorough review on v6 and for catching the hardcoded
> address issue — v7 has that fixed, now using x86ms->above_4g_mem_start as
> you suggested.
> 
> Just wanted to follow up and check if your Reviewed-by still applies to v7,
> since it's a minimal change from v6 (only the one-line fix you identified).
> 

Yes sorry, we're good to go

Reviewed-by: Gregory Price <gourry@gourry.net>

Thank you for this, this will be very helpful for testing

Jonathan would be able to answer more questions re: upstream


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 0/1] numa: add 'memmap-type' option for memory type configuration
  2026-03-13 15:18   ` Gregory Price
@ 2026-03-13 16:14     ` Jonathan Cameron via qemu development
  2026-03-16  7:17       ` Huang, FangSheng (Jerry)
  0 siblings, 1 reply; 11+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-03-13 16:14 UTC (permalink / raw)
  To: Gregory Price
  Cc: Huang, FangSheng (Jerry), qemu-devel, apopple, dan.j.williams,
	Zhigang.Luo, Lianjie.Shi, david, imammedo

On Fri, 13 Mar 2026 11:18:18 -0400
Gregory Price <gourry@gourry.net> wrote:

> On Fri, Mar 13, 2026 at 04:30:20PM +0800, Huang, FangSheng (Jerry) wrote:
> > Hi Gregory,
> > 
> > Thanks again for the thorough review on v6 and for catching the hardcoded
> > address issue — v7 has that fixed, now using x86ms->above_4g_mem_start as
> > you suggested.
> > 
> > Just wanted to follow up and check if your Reviewed-by still applies to v7,
> > since it's a minimal change from v6 (only the one-line fix you identified).
> >   
> 
> Yes sorry, we're good to go
> 
> Reviewed-by: Gregory Price <gourry@gourry.net>
> 
> Thank you for this, this will be very helpful for testing
> 
> Jonathan would be able to answer more questions re: upstream

Can provide general details, but specifics of this patch is probably
something for Igor as it's not CXL specific.

Soft freeze for 11.0 has passed, so this will be a 11.1 feature.
https://wiki.qemu.org/Planning/11.0
Is the 11.0 timeline. (11.1 will go up around time of 11.0 release).

Jonathan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 0/1] numa: add 'memmap-type' option for memory type configuration
  2026-03-13 16:14     ` Jonathan Cameron via qemu development
@ 2026-03-16  7:17       ` Huang, FangSheng (Jerry)
  2026-04-27  8:47         ` Huang, FangSheng (Jerry)
  0 siblings, 1 reply; 11+ messages in thread
From: Huang, FangSheng (Jerry) @ 2026-03-16  7:17 UTC (permalink / raw)
  To: Jonathan Cameron, Gregory Price, Igor Mammedov
  Cc: qemu-devel, apopple, dan.j.williams, Zhigang.Luo, Lianjie.Shi,
	david


On 3/14/2026 12:14 AM, Jonathan Cameron wrote:
> On Fri, 13 Mar 2026 11:18:18 -0400
> Gregory Price <gourry@gourry.net> wrote:
> 
>> On Fri, Mar 13, 2026 at 04:30:20PM +0800, Huang, FangSheng (Jerry) wrote:
>>> Hi Gregory,
>>>
>>> Thanks again for the thorough review on v6 and for catching the hardcoded
>>> address issue — v7 has that fixed, now using x86ms->above_4g_mem_start as
>>> you suggested.
>>>
>>> Just wanted to follow up and check if your Reviewed-by still applies to v7,
>>> since it's a minimal change from v6 (only the one-line fix you identified).
>>>    
>>
>> Yes sorry, we're good to go
>>
>> Reviewed-by: Gregory Price <gourry@gourry.net>
>>
>> Thank you for this, this will be very helpful for testing
>>
>> Jonathan would be able to answer more questions re: upstream
> 
> Can provide general details, but specifics of this patch is probably
> something for Igor as it's not CXL specific.
> 
> Soft freeze for 11.0 has passed, so this will be a 11.1 feature.
> https://wiki.qemu.org/Planning/11.0
> Is the 11.0 timeline. (11.1 will go up around time of 11.0 release).
> 
> Jonathan

Hi Gregory, Jonathan,

Thank you both!

Gregory — much appreciated for the Reviewed-by and your continued
support throughout the review process.

Jonathan — thanks for clarifying the timeline. Understood that this
would target the 11.1 cycle. Will follow up with Igor when the 11.1 
window opens up.

Best regards,
Jerry Huang


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 0/1] numa: add 'memmap-type' option for memory type configuration
  2026-03-16  7:17       ` Huang, FangSheng (Jerry)
@ 2026-04-27  8:47         ` Huang, FangSheng (Jerry)
  0 siblings, 0 replies; 11+ messages in thread
From: Huang, FangSheng (Jerry) @ 2026-04-27  8:47 UTC (permalink / raw)
  To: Jonathan Cameron, Gregory Price, Igor Mammedov
  Cc: qemu-devel, apopple, dan.j.williams, Zhigang.Luo, Lianjie.Shi,
	david

Hi Igor,

Gentle ping on this v7 patch. The QEMU 11.1 development window opened on
April 24, so I wanted to follow up as discussed in the previous round
(Jonathan suggested I reach out once 11.1 was open).

The patch carries the following review tags from v7:

   Acked-by: David Hildenbrand <david@redhat.com>
   Reviewed-by: Gregory Price <gourry@gourry.net>

Patch link:
  
https://lore.kernel.org/qemu-devel/20260306082735.1106690-2-FangSheng.Huang@amd.com/

Could you take a look and consider queuing it for 11.1?

Best regards,
Jerry Huang

On 3/16/2026 3:17 PM, Huang, FangSheng (Jerry) wrote:
> 
> On 3/14/2026 12:14 AM, Jonathan Cameron wrote:
>> On Fri, 13 Mar 2026 11:18:18 -0400
>> Gregory Price <gourry@gourry.net> wrote:
>>
>>> On Fri, Mar 13, 2026 at 04:30:20PM +0800, Huang, FangSheng (Jerry) 
>>> wrote:
>>>> Hi Gregory,
>>>>
>>>> Thanks again for the thorough review on v6 and for catching the 
>>>> hardcoded
>>>> address issue — v7 has that fixed, now using x86ms- 
>>>> >above_4g_mem_start as
>>>> you suggested.
>>>>
>>>> Just wanted to follow up and check if your Reviewed-by still applies 
>>>> to v7,
>>>> since it's a minimal change from v6 (only the one-line fix you 
>>>> identified).
>>>
>>> Yes sorry, we're good to go
>>>
>>> Reviewed-by: Gregory Price <gourry@gourry.net>
>>>
>>> Thank you for this, this will be very helpful for testing
>>>
>>> Jonathan would be able to answer more questions re: upstream
>>
>> Can provide general details, but specifics of this patch is probably
>> something for Igor as it's not CXL specific.
>>
>> Soft freeze for 11.0 has passed, so this will be a 11.1 feature.
>> https://wiki.qemu.org/Planning/11.0
>> Is the 11.0 timeline. (11.1 will go up around time of 11.0 release).
>>
>> Jonathan
> 
> Hi Gregory, Jonathan,
> 
> Thank you both!
> 
> Gregory — much appreciated for the Reviewed-by and your continued
> support throughout the review process.
> 
> Jonathan — thanks for clarifying the timeline. Understood that this
> would target the 11.1 cycle. Will follow up with Igor when the 11.1 
> window opens up.
> 
> Best regards,
> Jerry Huang



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 1/1] numa: add 'memmap-type' option for memory type configuration
  2026-03-06  8:27 ` [PATCH v7 1/1] " fanhuang
@ 2026-05-14 13:05   ` Igor Mammedov
  2026-05-14 13:38     ` Gregory Price
  2026-05-15  7:53     ` Huang, FangSheng (Jerry)
  0 siblings, 2 replies; 11+ messages in thread
From: Igor Mammedov @ 2026-05-14 13:05 UTC (permalink / raw)
  To: fanhuang
  Cc: qemu-devel, david, gourry, jonathan.cameron, apopple,
	dan.j.williams, Zhigang.Luo, Lianjie.Shi, David Hildenbrand

On Fri, 6 Mar 2026 16:27:35 +0800
fanhuang <FangSheng.Huang@amd.com> wrote:

> Add a 'memmap-type' option to NUMA node configuration that allows
> specifying the memory type for a NUMA node.
> 
> Supported values:
>   - normal:   Regular system RAM (E820 type 1, default)
>   - spm:      Specific Purpose Memory (E820 type 0xEFFFFFFF)
>   - reserved: Reserved memory (E820 type 2)
> 
> The 'spm' type indicates Specific Purpose Memory - a hint to the guest
> that this memory might be managed by device drivers based on guest policy.
> The 'reserved' type marks memory as not usable as RAM.
> 
> Note: This option is only supported on x86 platforms.
> 
> Usage:
>   -numa node,nodeid=1,memdev=m1,memmap-type=spm

in short:
  don't do it this way
  I'm against merging it as is, till you convince me otherwise.

more detailed answer:

* mandatory bashing chapter:

the more i look at it, the hackier this approach looks to me,
and what even worse that nonsense propagates to firmware.

Judging by commit message, the goal is to expose some RAM as
E820 SPM, to guest (that's it).

You however picked -numa node as a way to achieve that,
and then hack the numa code not to generate numa data for it (SRAT)
and massage e820 to exclude SPM from  RAM entries.

But at this stage I don't really see a good justification for hack(s)
this patch introduces (it's definitely is not in commit message not cover letter).

And until alternative approach is not explored and proved to be worse,
I'm against merging this patch.

* suggestion chapter:

I don't recall but I likely asked before
why not use device memory instead for it (aka DIMM device or some device derived
from device memory object and then add e820 entry for it).

It would be a way more simpler approach and impl. without need to resplit
anything in e820.
And no need for messing with firmware (SeaBIOS: RamSizeOver4G patch) nor EDK2.



> 
> Signed-off-by: fanhuang <FangSheng.Huang@amd.com>
> Acked-by: David Hildenbrand <david@kernel.org>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> ---
>  hw/core/numa.c               | 24 ++++++++++++
>  hw/i386/acpi-build.c         |  8 ++++
>  hw/i386/e820_memory_layout.c | 72 ++++++++++++++++++++++++++++++++++++
>  hw/i386/e820_memory_layout.h | 12 +++---
>  hw/i386/pc.c                 | 48 ++++++++++++++++++++++++
>  include/system/numa.h        |  7 ++++
>  qapi/machine.json            | 24 ++++++++++++
>  qemu-options.hx              | 14 ++++++-
>  8 files changed, 202 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/core/numa.c b/hw/core/numa.c
> index f462883c87..521c8f10f1 100644
> --- a/hw/core/numa.c
> +++ b/hw/core/numa.c
> @@ -38,6 +38,7 @@
>  #include "hw/mem/pc-dimm.h"
>  #include "hw/core/boards.h"
>  #include "hw/mem/memory-device.h"
> +#include "hw/i386/x86.h"
>  #include "qemu/option.h"
>  #include "qemu/config-file.h"
>  #include "qemu/cutils.h"
> @@ -164,6 +165,29 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
>          numa_info[nodenr].node_memdev = MEMORY_BACKEND(o);
>      }
>  
> +    if (node->has_memmap_type && node->memmap_type != NUMA_MEMMAP_TYPE_NORMAL) {
> +        if (!node->memdev) {
> +            error_setg(errp, "memmap-type=%s requires memdev to be specified",
> +                       NumaMemmapType_str(node->memmap_type));
> +            return;
> +        }
> +        if (!object_dynamic_cast(OBJECT(ms), TYPE_X86_MACHINE)) {
> +            error_setg(errp, "memmap-type=%s is only supported on x86 machines",
> +                       NumaMemmapType_str(node->memmap_type));
> +            return;
> +        }
> +        switch (node->memmap_type) {
> +        case NUMA_MEMMAP_TYPE_SPM:
> +            numa_info[nodenr].memmap_type = NUMA_MEMMAP_SPM;
> +            break;
> +        case NUMA_MEMMAP_TYPE_RESERVED:
> +            numa_info[nodenr].memmap_type = NUMA_MEMMAP_RESERVED;
> +            break;
> +        default:
> +            break;
> +        }
> +    }
> +
>      numa_info[nodenr].present = true;
>      max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
>      ms->numa_state->num_nodes++;
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index f622b91b76..521bf66ca1 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -1417,6 +1417,14 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
>          mem_len = numa_info[i - 1].node_mem;
>          next_base = mem_base + mem_len;
>  
> +        /*
> +         * Skip reserved memory nodes - E820 marks them as reserved,
> +         * so SRAT should not report them as enabled memory affinity.
> +         */
> +        if (numa_info[i - 1].memmap_type == NUMA_MEMMAP_RESERVED) {
> +            continue;
> +        }
> +
>          /* Cut out the 640K hole */
>          if (mem_base <= HOLE_640K_START &&
>              next_base > HOLE_640K_START) {
> diff --git a/hw/i386/e820_memory_layout.c b/hw/i386/e820_memory_layout.c
> index 3e848fb69c..4c62b5ddea 100644
> --- a/hw/i386/e820_memory_layout.c
> +++ b/hw/i386/e820_memory_layout.c
> @@ -46,3 +46,75 @@ bool e820_get_entry(int idx, uint32_t type, uint64_t *address, uint64_t *length)
>      }
>      return false;
>  }
> +
> +bool e820_update_entry_type(uint64_t start, uint64_t length, uint32_t new_type)
> +{
> +    uint64_t end = start + length;
> +    assert(!e820_done);
> +
> +    /* For E820_SOFT_RESERVED, validate range is within E820_RAM */
> +    if (new_type == E820_SOFT_RESERVED) {
> +        bool range_in_ram = false;
> +
> +        for (size_t j = 0; j < e820_entries; j++) {
> +            uint64_t ram_start = le64_to_cpu(e820_table[j].address);
> +            uint64_t ram_end = ram_start + le64_to_cpu(e820_table[j].length);
> +            uint32_t ram_type = le32_to_cpu(e820_table[j].type);
> +
> +            if (ram_type == E820_RAM && ram_start <= start && ram_end >= end) {
> +                range_in_ram = true;
> +                break;
> +            }
> +        }
> +        if (!range_in_ram) {
> +            return false;
> +        }
> +    }
> +
> +    /* Find entry that contains the target range and update it */
> +    for (size_t i = 0; i < e820_entries; i++) {
> +        uint64_t entry_start = le64_to_cpu(e820_table[i].address);
> +        uint64_t entry_length = le64_to_cpu(e820_table[i].length);
> +        uint64_t entry_end = entry_start + entry_length;
> +
> +        if (entry_start <= start && entry_end >= end) {
> +            uint32_t original_type = e820_table[i].type;
> +
> +            /* Remove original entry */
> +            memmove(&e820_table[i], &e820_table[i + 1],
> +                    (e820_entries - i - 1) * sizeof(struct e820_entry));
> +            e820_entries--;
> +
> +            /* Add split parts inline */
> +            if (entry_start < start) {
> +                e820_table = g_renew(struct e820_entry, e820_table,
> +                                     e820_entries + 1);
> +                e820_table[e820_entries].address = cpu_to_le64(entry_start);
> +                e820_table[e820_entries].length =
> +                    cpu_to_le64(start - entry_start);
> +                e820_table[e820_entries].type = original_type;
> +                e820_entries++;
> +            }
> +
> +            e820_table = g_renew(struct e820_entry, e820_table,
> +                                 e820_entries + 1);
> +            e820_table[e820_entries].address = cpu_to_le64(start);
> +            e820_table[e820_entries].length = cpu_to_le64(length);
> +            e820_table[e820_entries].type = cpu_to_le32(new_type);
> +            e820_entries++;
> +
> +            if (end < entry_end) {
> +                e820_table = g_renew(struct e820_entry, e820_table,
> +                                     e820_entries + 1);
> +                e820_table[e820_entries].address = cpu_to_le64(end);
> +                e820_table[e820_entries].length = cpu_to_le64(entry_end - end);
> +                e820_table[e820_entries].type = original_type;
> +                e820_entries++;
> +            }
> +
> +            return true;
> +        }
> +    }
> +
> +    return false;
> +}
> diff --git a/hw/i386/e820_memory_layout.h b/hw/i386/e820_memory_layout.h
> index b50acfa201..a85b4fd14c 100644
> --- a/hw/i386/e820_memory_layout.h
> +++ b/hw/i386/e820_memory_layout.h
> @@ -10,11 +10,12 @@
>  #define HW_I386_E820_MEMORY_LAYOUT_H
>  
>  /* e820 types */
> -#define E820_RAM        1
> -#define E820_RESERVED   2
> -#define E820_ACPI       3
> -#define E820_NVS        4
> -#define E820_UNUSABLE   5
> +#define E820_RAM            1
> +#define E820_RESERVED       2
> +#define E820_ACPI           3
> +#define E820_NVS            4
> +#define E820_UNUSABLE       5
> +#define E820_SOFT_RESERVED  0xEFFFFFFF
>  
>  struct e820_entry {
>      uint64_t address;
> @@ -26,5 +27,6 @@ void e820_add_entry(uint64_t address, uint64_t length, uint32_t type);
>  bool e820_get_entry(int index, uint32_t type,
>                      uint64_t *address, uint64_t *length);
>  int e820_get_table(struct e820_entry **table);
> +bool e820_update_entry_type(uint64_t start, uint64_t length, uint32_t new_type);
>  
>  #endif
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 819e729a6e..c024a34db2 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -740,6 +740,51 @@ static hwaddr pc_max_used_gpa(PCMachineState *pcms, uint64_t pci_hole64_size)
>      return pc_above_4g_end(pcms) - 1;
>  }
>  
> +/*
> + * Update E820 entries for NUMA nodes with non-default memory types.
> + */
> +static void pc_update_numa_memory_types(X86MachineState *x86ms)
> +{
> +    MachineState *ms = MACHINE(x86ms);
> +    uint64_t addr = 0;
> +
> +    for (int i = 0; i < ms->numa_state->num_nodes; i++) {
> +        NodeInfo *numa_info = &ms->numa_state->nodes[i];
> +        uint64_t node_size = numa_info->node_mem;
> +
> +        if (numa_info->node_memdev &&
> +            (numa_info->memmap_type == NUMA_MEMMAP_SPM ||
> +             numa_info->memmap_type == NUMA_MEMMAP_RESERVED)) {
> +            uint64_t guest_addr;
> +            uint32_t e820_type = (numa_info->memmap_type == NUMA_MEMMAP_SPM)
> +                                  ? E820_SOFT_RESERVED : E820_RESERVED;
> +
> +            if (addr < x86ms->below_4g_mem_size) {
> +                if (addr + node_size <= x86ms->below_4g_mem_size) {
> +                    guest_addr = addr;
> +                } else {
> +                    error_report("NUMA node %d with memmap-type spans across "
> +                                 "4GB boundary, not supported", i);
> +                    exit(EXIT_FAILURE);
> +                }
> +            } else {
> +                guest_addr = x86ms->above_4g_mem_start +
> +                            (addr - x86ms->below_4g_mem_size);
> +            }
> +
> +            if (!e820_update_entry_type(guest_addr, node_size, e820_type)) {
> +                warn_report("Failed to update E820 entry for node %d "
> +                           "at 0x%" PRIx64 " length 0x%" PRIx64,
> +                           i, guest_addr, node_size);
> +            }
> +        }
> +
> +        if (numa_info->node_memdev) {
> +            addr += node_size;
> +        }
> +    }
> +}
> +
>  /*
>   * AMD systems with an IOMMU have an additional hole close to the
>   * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> @@ -856,6 +901,9 @@ void pc_memory_init(PCMachineState *pcms,
>          e820_add_entry(pcms->sgx_epc.base, pcms->sgx_epc.size, E820_RESERVED);
>      }
>  
> +    /* Update E820 for NUMA nodes with special memory types */
> +    pc_update_numa_memory_types(x86ms);
> +
>      if (!pcmc->has_reserved_memory &&
>          (machine->ram_slots ||
>           (machine->maxram_size > machine->ram_size))) {
> diff --git a/include/system/numa.h b/include/system/numa.h
> index 1044b0eb6e..64e8f63736 100644
> --- a/include/system/numa.h
> +++ b/include/system/numa.h
> @@ -35,12 +35,19 @@ enum {
>  
>  #define UINT16_BITS       16
>  
> +typedef enum {
> +    NUMA_MEMMAP_NORMAL = 0,
> +    NUMA_MEMMAP_SPM,
> +    NUMA_MEMMAP_RESERVED,
> +} NumaMemmapTypeInternal;
> +
>  typedef struct NodeInfo {
>      uint64_t node_mem;
>      struct HostMemoryBackend *node_memdev;
>      bool present;
>      bool has_cpu;
>      bool has_gi;
> +    NumaMemmapTypeInternal memmap_type;
>      uint8_t lb_info_provided;
>      uint16_t initiator;
>      uint8_t distance[MAX_NODES];
> diff --git a/qapi/machine.json b/qapi/machine.json
> index 685e4e29b8..67ba487f6c 100644
> --- a/qapi/machine.json
> +++ b/qapi/machine.json
> @@ -466,6 +466,22 @@
>  { 'enum': 'NumaOptionsType',
>    'data': [ 'node', 'dist', 'cpu', 'hmat-lb', 'hmat-cache' ] }
>  
> +##
> +# @NumaMemmapType:
> +#
> +# Memory mapping type for a NUMA node.
> +#
> +# @normal: Normal system RAM (E820 type 1)
> +#
> +# @spm: Specific Purpose Memory (E820 type 0xEFFFFFFF)
> +#
> +# @reserved: Reserved memory (E820 type 2)
> +#
> +# Since: 10.2
> +##
> +{ 'enum': 'NumaMemmapType',
> +  'data': ['normal', 'spm', 'reserved'] }
> +
>  ##
>  # @NumaOptions:
>  #
> @@ -502,6 +518,13 @@
>  # @memdev: memory backend object.  If specified for one node, it must
>  #     be specified for all nodes.
>  #
> +# @memmap-type: specifies the memory type for this NUMA node.
> +#     'normal' (default) is regular system RAM.
> +#     'spm' is Specific Purpose Memory - a hint to the guest that
> +#     this memory might be managed by device drivers based on policy.
> +#     'reserved' is reserved memory, not usable as RAM.
> +#     Currently only supported on x86.  (since 10.2)
> +#
>  # @initiator: defined in ACPI 6.3 Chapter 5.2.27.3 Table 5-145, points
>  #     to the nodeid which has the memory controller responsible for
>  #     this NUMA node.  This field provides additional information as
> @@ -516,6 +539,7 @@
>     '*cpus':   ['uint16'],
>     '*mem':    'size',
>     '*memdev': 'str',
> +   '*memmap-type': 'NumaMemmapType',
>     '*initiator': 'uint16' }}
>  
>  ##
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 0da2b4d034..c898428822 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -433,7 +433,7 @@ ERST
>  
>  DEF("numa", HAS_ARG, QEMU_OPTION_numa,
>      "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
> -    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
> +    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node][,memmap-type=normal|spm|reserved]\n"
>      "-numa dist,src=source,dst=destination,val=distance\n"
>      "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n"
>      "-numa hmat-lb,initiator=node,target=node,hierarchy=memory|first-level|second-level|third-level,data-type=access-latency|read-latency|write-latency[,latency=lat][,bandwidth=bw]\n"
> @@ -442,7 +442,7 @@ DEF("numa", HAS_ARG, QEMU_OPTION_numa,
>  SRST
>  ``-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=initiator]``
>    \ 
> -``-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=initiator]``
> +``-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=initiator][,memmap-type=type]``
>    \
>  ``-numa dist,src=source,dst=destination,val=distance``
>    \ 
> @@ -510,6 +510,16 @@ SRST
>      largest bandwidth) to this NUMA node. Note that this option can be
>      set only when the machine property 'hmat' is set to 'on'.
>  
> +    '\ ``memmap-type``\ ' specifies the memory type for this NUMA node:
> +
> +    - ``normal`` (default): Regular system RAM (E820 type 1)
> +    - ``spm``: Specific Purpose Memory (E820 type 0xEFFFFFFF). This is a
> +      hint to the guest that the memory might be managed by device drivers
> +      based on guest policy.
> +    - ``reserved``: Reserved memory (E820 type 2), not usable as RAM.
> +
> +    This option is only supported on x86 platforms.
> +
>      Following example creates a machine with 2 NUMA nodes, node 0 has
>      CPU. node 1 has only memory, and its initiator is node 0. Note that
>      because node 0 has CPU, by default the initiator of node 0 is itself



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 1/1] numa: add 'memmap-type' option for memory type configuration
  2026-05-14 13:05   ` Igor Mammedov
@ 2026-05-14 13:38     ` Gregory Price
  2026-05-15  7:53     ` Huang, FangSheng (Jerry)
  1 sibling, 0 replies; 11+ messages in thread
From: Gregory Price @ 2026-05-14 13:38 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: fanhuang, qemu-devel, david, jonathan.cameron, apopple,
	dan.j.williams, Zhigang.Luo, Lianjie.Shi, David Hildenbrand

On Thu, May 14, 2026 at 03:05:59PM +0200, Igor Mammedov wrote:
> 
> I don't recall but I likely asked before
> why not use device memory instead for it (aka DIMM device or some device derived
> from device memory object and then add e820 entry for it).
> 
> It would be a way more simpler approach and impl. without need to resplit
> anything in e820.
> And no need for messing with firmware (SeaBIOS: RamSizeOver4G patch) nor EDK2.
> 

David previously addressed your question on the original patch version:

https://lore.kernel.org/qemu-devel/6e7ad90d-a467-40cc-99fa-d0915438dd05@redhat.com/

  I wondered the same in my reply: I'm afraid it cannot be a DIMM/NVDIMM,
  these ranges are only described in E820 as "hotplug area".

  I think it must be something that's present in the memory map right from
  the start, where the OS would identify it as SP and treat it accordingly.

We're trending towards devices being given dedicated nodes for their
memory, so this actually makes sense as an extension to NUMA.

While heterogenous device/memory nodes are possible - they're also
pretty nonsensical outside of specifically the simple use case of:

   This node has both hotpluggable and not-hotpluggable memory.

Which can already be accomplished another way.

For a device being given a node with memory, marking it reserved or spm
in e820 is needed to make the memory hotpluggable in the future (as that
node has to be reserved and the hotplug memory region accounted for).

Unless I am misunderstanding your feedback here - please let me know.

~Gregory

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 1/1] numa: add 'memmap-type' option for memory type configuration
  2026-05-14 13:05   ` Igor Mammedov
  2026-05-14 13:38     ` Gregory Price
@ 2026-05-15  7:53     ` Huang, FangSheng (Jerry)
  2026-05-15 13:04       ` Igor Mammedov
  1 sibling, 1 reply; 11+ messages in thread
From: Huang, FangSheng (Jerry) @ 2026-05-15  7:53 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, david, gourry, jonathan.cameron, apopple,
	dan.j.williams, Zhigang.Luo, Lianjie.Shi, David Hildenbrand

On 5/14/2026 9:05 PM, Igor Mammedov wrote:
> [You don't often get email from imammedo@redhat.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> On Fri, 6 Mar 2026 16:27:35 +0800
> fanhuang <FangSheng.Huang@amd.com> wrote:
> 
>> Add a 'memmap-type' option to NUMA node configuration that allows
>> specifying the memory type for a NUMA node.
>>
>> Supported values:
>>    - normal:   Regular system RAM (E820 type 1, default)
>>    - spm:      Specific Purpose Memory (E820 type 0xEFFFFFFF)
>>    - reserved: Reserved memory (E820 type 2)
>>
>> The 'spm' type indicates Specific Purpose Memory - a hint to the guest
>> that this memory might be managed by device drivers based on guest policy.
>> The 'reserved' type marks memory as not usable as RAM.
>>
>> Note: This option is only supported on x86 platforms.
>>
>> Usage:
>>    -numa node,nodeid=1,memdev=m1,memmap-type=spm
> 
> in short:
>    don't do it this way
>    I'm against merging it as is, till you convince me otherwise.
> 
> more detailed answer:
> 
> * mandatory bashing chapter:
> 
> the more i look at it, the hackier this approach looks to me,
> and what even worse that nonsense propagates to firmware.
> 
> Judging by commit message, the goal is to expose some RAM as
> E820 SPM, to guest (that's it).
> 
> You however picked -numa node as a way to achieve that,
> and then hack the numa code not to generate numa data for it (SRAT)
> and massage e820 to exclude SPM from  RAM entries.
> 
> But at this stage I don't really see a good justification for hack(s)
> this patch introduces (it's definitely is not in commit message not cover letter).
> 
> And until alternative approach is not explored and proved to be worse,
> I'm against merging this patch.
> 
> * suggestion chapter:
> 
> I don't recall but I likely asked before
> why not use device memory instead for it (aka DIMM device or some device derived
> from device memory object and then add e820 entry for it).
> 
> It would be a way more simpler approach and impl. without need to resplit
> anything in e820.
> And no need for messing with firmware (SeaBIOS: RamSizeOver4G patch) nor EDK2.
> 
>

Hi Igor,

Thanks for taking the time to review this -- and for the candor in
the bashing chapter.  Before going into the bigger picture, let me
re-establish one factual point that v7 didn't carry forward from
the v6 cover letter.

On SRAT generation:

v7 only suppresses SRAT for memmap-type=reserved.  memmap-type=spm
nodes get a normal SRAT Memory Affinity entry.  This was shown
explicitly in the v6 cover letter, which v7 didn't carry forward
since v7 is a single-patch series.  For the spm case:

     [    0.042582] ACPI: SRAT: Node 1 PXM 1 [mem 0x280000000-0x47fffffff]

Full transcript with all three memmap-type variants side by side:
https://lore.kernel.org/qemu-devel/20260226105023.256568-1-FangSheng.Huang@amd.com/

The bigger picture -- real-world context that drove the design:

The use case is GPU/accelerator HBM exposed to the OS as SPM.  On
bare metal, the platform firmware:

   - emits E820 type 0xEFFFFFFF (SOFT_RESERVED) for the HBM region;
   - emits ACPI SRAT memory affinity entries that bind HBM to a
     dedicated proximity domain (NUMA node);
   - tags the accelerator's PCI device with _PXM matching that node.

That gives the device driver a stable lookup chain at runtime:

     dev -> pci_dev_to_node(dev) -> SRAT walk -> HBM GPA range

NUMA node here is not incidental -- it is the OS-exposed
intermediary ID that the device driver uses to find its own HBM.
This is the in-tree path used by accelerator drivers today.

The "-numa node + memmap-type=spm + E820 SOFT_RESERVED" combo in
v7 is a direct 1:1 model of this BM topology.  The E820 retyping
in the patch is exactly what makes the guest-visible E820 match
what BM firmware emits for the same kind of region.

On the DIMM / device-memory alternative:

David pointed this out in the v6 thread, and Gregory's reply in
this thread reinforces the same point -- DIMM / NVDIMM ranges are
described in E820 only as the hotplug area.  SPM needs to be in
the boot E820 from the start so the OS classifies it as SP and
treats it accordingly.  Going via DIMM would also detach the
memory from the NUMA topology (no SRAT entry tied to the device's
_PXM), which breaks the dev -> node -> SRAT -> HBM lookup the
driver relies on.

Happy to dig into any of this further, or to reshape parts you
still see as too hacky.

Best regards,
FangSheng Huang (Jerry)
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 1/1] numa: add 'memmap-type' option for memory type configuration
  2026-05-15  7:53     ` Huang, FangSheng (Jerry)
@ 2026-05-15 13:04       ` Igor Mammedov
  0 siblings, 0 replies; 11+ messages in thread
From: Igor Mammedov @ 2026-05-15 13:04 UTC (permalink / raw)
  To: Huang, FangSheng (Jerry)
  Cc: qemu-devel, david, gourry, jonathan.cameron, apopple,
	dan.j.williams, Zhigang.Luo, Lianjie.Shi, David Hildenbrand

On Fri, 15 May 2026 15:53:07 +0800
"Huang, FangSheng (Jerry)" <FangSheng.Huang@amd.com> wrote:

> On 5/14/2026 9:05 PM, Igor Mammedov wrote:
> > [You don't often get email from imammedo@redhat.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> > 
> > On Fri, 6 Mar 2026 16:27:35 +0800
> > fanhuang <FangSheng.Huang@amd.com> wrote:
> >   
> >> Add a 'memmap-type' option to NUMA node configuration that allows
> >> specifying the memory type for a NUMA node.
> >>
> >> Supported values:
> >>    - normal:   Regular system RAM (E820 type 1, default)
> >>    - spm:      Specific Purpose Memory (E820 type 0xEFFFFFFF)
> >>    - reserved: Reserved memory (E820 type 2)
> >>
> >> The 'spm' type indicates Specific Purpose Memory - a hint to the guest
> >> that this memory might be managed by device drivers based on guest policy.
> >> The 'reserved' type marks memory as not usable as RAM.
> >>
> >> Note: This option is only supported on x86 platforms.
> >>
> >> Usage:
> >>    -numa node,nodeid=1,memdev=m1,memmap-type=spm  
> > 
> > in short:
> >    don't do it this way
> >    I'm against merging it as is, till you convince me otherwise.
> > 
> > more detailed answer:
> > 
> > * mandatory bashing chapter:
> > 
> > the more i look at it, the hackier this approach looks to me,
> > and what even worse that nonsense propagates to firmware.
> > 
> > Judging by commit message, the goal is to expose some RAM as
> > E820 SPM, to guest (that's it).
> > 
> > You however picked -numa node as a way to achieve that,
> > and then hack the numa code not to generate numa data for it (SRAT)
> > and massage e820 to exclude SPM from  RAM entries.
> > 
> > But at this stage I don't really see a good justification for hack(s)
> > this patch introduces (it's definitely is not in commit message not cover letter).
> > 
> > And until alternative approach is not explored and proved to be worse,
> > I'm against merging this patch.
> > 
> > * suggestion chapter:
> > 
> > I don't recall but I likely asked before
> > why not use device memory instead for it (aka DIMM device or some device derived
> > from device memory object and then add e820 entry for it).
> > 
> > It would be a way more simpler approach and impl. without need to resplit
> > anything in e820.
> > And no need for messing with firmware (SeaBIOS: RamSizeOver4G patch) nor EDK2.
> > 
> >  
> 
> Hi Igor,
> 
> Thanks for taking the time to review this -- and for the candor in
> the bashing chapter.  Before going into the bigger picture, let me
> re-establish one factual point that v7 didn't carry forward from
> the v6 cover letter.

feel free to bash my review as well, I hope that we end up with
clear picture what and why we are doing.

> 
> On SRAT generation:
> 
> v7 only suppresses SRAT for memmap-type=reserved.  memmap-type=spm
> nodes get a normal SRAT Memory Affinity entry.  This was shown
> explicitly in the v6 cover letter, which v7 didn't carry forward
> since v7 is a single-patch series.  For the spm case:
> 
>      [    0.042582] ACPI: SRAT: Node 1 PXM 1 [mem 0x280000000-0x47fffffff]
> 
> Full transcript with all three memmap-type variants side by side:
> https://lore.kernel.org/qemu-devel/20260226105023.256568-1-FangSheng.Huang@amd.com/
> 
> The bigger picture -- real-world context that drove the design:

bigger picture should be somewhere in commit message so later on
a reader could understand why we are doing it at all/this way.
 
lets continue with questions wrt impl.

> The use case is GPU/accelerator HBM exposed to the OS as SPM.  On
> bare metal, the platform firmware:
> 
>    - emits E820 type 0xEFFFFFFF (SOFT_RESERVED) for the HBM region;
>    - emits ACPI SRAT memory affinity entries that bind HBM to a
>      dedicated proximity domain (NUMA node);
>    - tags the accelerator's PCI device with _PXM matching that node.
> 
> That gives the device driver a stable lookup chain at runtime:
> 
>      dev -> pci_dev_to_node(dev) -> SRAT walk -> HBM GPA range

it looks kind of convoluted, isn't it.
PCI devices were supposed to be self describing/discoverable.
Preferably without above mentioned firmware 'hooks'.
Above example could be just early impl. issues, rather than by
design issue.

> NUMA node here is not incidental -- it is the OS-exposed
> intermediary ID that the device driver uses to find its own HBM.
> This is the in-tree path used by accelerator drivers today.

I'm assuming GPU is exposed as some composite PCI/CXL device.
and use-case is its pass-through to guest.

Perhaps we can't do anything about it now.
But shouldn't device driver discover its own memory (HBM and what not)
without external parties that magically gain knowledge about parts of
device that driver supposedly driving the device has not a clue about? 
How doesn bios know about SPM when device's driver with knowledge of
device internals knows nothing about?
 
> The "-numa node + memmap-type=spm + E820 SOFT_RESERVED" combo in
> v7 is a direct 1:1 model of this BM topology.  The E820 retyping
> in the patch is exactly what makes the guest-visible E820 match
> what BM firmware emits for the same kind of region.
> 
> On the DIMM / device-memory alternative:

wrt modeling GPU pass-through, my 1st attempt would be
to make -device gpu-foo take everything need to compose the device
(like in real hw) and be done with it (and PCI/CXL machinery would
take care of mapping/exposing memory to guest).
Why we aren't doing it?

barring that, and assuming we have to pass SPM as a separate memory
(why and why it should be exposed in E820 and at boot time only?)
I'd try -device foo-memory approach.

> David pointed this out in the v6 thread, and Gregory's reply in
> this thread reinforces the same point -- DIMM / NVDIMM ranges are
> described in E820 only as the hotplug area.  SPM needs to be in
> the boot E820 from the start so the OS classifies it as SP and
> treats it accordingly.  Going via DIMM would also detach the
> memory from the NUMA topology (no SRAT entry tied to the device's
> _PXM), which breaks the dev -> node -> SRAT -> HBM lookup the
> driver relies on.

Where we should bend modeling to driver behavior is questionable.
But I don't know nearly enough about subj, it could be parallel discussion.
But we need capture 'why' somewhere in commit message, to give
a justification for going pass-through as a separate memory approach.
 
For now lets leave it alone.

wrt my suggestion using memory-device.
It's true that the device memory region has started as hotpluggable memory.
But that's impl. detail, nothing fundamentally prevents us from
describing mix of present at boot time memory devices within it in e820/SRAT.
 
Answer to why DIMMs aren't in e820 was for us to avoid dealing with
linux kernel putting that memory into zone_normal instead of zone_movable.
On real hardware, one is likely to see all present at boot dimms, in e820
and SRAT.
For already existing memory devices, I'd like us continue dodging e820,
so we wouldn't break existing deployments. however for a new memory device
we don't have such limitations.

What I'd try is:
 1: inherit spm-memory device from memory-device
       (all memory mapping and APCI memory device descriptors, can be made
        to pick it along with DIMM devices) 
 2: figure out why device driver has to fetch memory map
   and proximity from static tables as opposed to getting it dynamically
   from _PXM -> maped-memory range. (at the time PCI devices enum runs, all ACPI
   info incl. run time one is fully accessible to in-kernel users)
   i.e. try to make driver work with runtime proximity
 3. if #2 is impossible, we can try to expose SPM memory devices in e820,
     and partition SRAT to match actual device_memory region layout. 

 
> Happy to dig into any of this further, or to reshape parts you
> still see as too hacky.
> 
> Best regards,
> FangSheng Huang (Jerry)
> >   
> 



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-05-15 13:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-06  8:27 [PATCH v7 0/1] numa: add 'memmap-type' option for memory type configuration fanhuang
2026-03-06  8:27 ` [PATCH v7 1/1] " fanhuang
2026-05-14 13:05   ` Igor Mammedov
2026-05-14 13:38     ` Gregory Price
2026-05-15  7:53     ` Huang, FangSheng (Jerry)
2026-05-15 13:04       ` Igor Mammedov
2026-03-13  8:30 ` [PATCH v7 0/1] " Huang, FangSheng (Jerry)
2026-03-13 15:18   ` Gregory Price
2026-03-13 16:14     ` Jonathan Cameron via qemu development
2026-03-16  7:17       ` Huang, FangSheng (Jerry)
2026-04-27  8:47         ` Huang, FangSheng (Jerry)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.