[PATCH] nvdimm: add 'target-node' option

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] nvdimm: add 'target-node' option
@ 2021-06-25  2:25 Jingqi Liu
  2021-06-30 15:00 ` Igor Mammedov
  0 siblings, 1 reply; 3+ messages in thread
From: Jingqi Liu @ 2021-06-25  2:25 UTC (permalink / raw)
  To: xiaoguangrong.eric, mst, imammedo, marcel.apfelbaum, pbonzini,
	richard.henderson, ehabkost
  Cc: Jingqi Liu, qemu-devel

Linux kernel version 5.1 brings in support for the volatile-use of
persistent memory as a hotplugged memory region (KMEM DAX).
When this feature is enabled, persistent memory can be seen as a
separate memory-only NUMA node(s). This newly-added memory can be
selected by its unique NUMA node.

Add 'target-node' option for 'nvdimm' device to indicate this NUMA
node. It can be extended to a new node after all existing NUMA nodes.

The 'node' option of 'pc-dimm' device is to add the DIMM to an
existing NUMA node. The 'node' should be in the available NUMA nodes.
For KMEM DAX mode, persistent memory can be in a new separate
memory-only NUMA node. The new node is created dynamically.
So users use 'target-node' to control whether persistent memory
is added to an existing NUMA node or a new NUMA node.

An example of configuration is as follows.

Using the following QEMU command:
 -object memory-backend-file,id=nvmem1,share=on,mem-path=/dev/dax0.0,size=3G,align=2M
 -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K,targe-node=2

To list DAX devices:
 # daxctl list -u
 {
   "chardev":"dax0.0",
   "size":"3.00 GiB (3.22 GB)",
   "target_node":2,
   "mode":"devdax"
 }

To create a namespace in Device-DAX mode as a standard memory:
 $ ndctl create-namespace --mode=devdax --map=mem
To reconfigure DAX device from devdax mode to a system-ram mode:
 $ daxctl reconfigure-device dax0.0 --mode=system-ram

There are two existing NUMA nodes in Guest. After these operations,
persistent memory is configured as a separate Node 2 and
can be used as a volatile memory. This NUMA node is dynamically
created according to 'target-node'.

Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
---
 docs/nvdimm.txt         | 93 +++++++++++++++++++++++++++++++++++++++++
 hw/acpi/nvdimm.c        | 18 ++++++--
 hw/i386/acpi-build.c    | 12 +++++-
 hw/mem/nvdimm.c         | 23 +++++++++-
 include/hw/mem/nvdimm.h | 15 ++++++-
 util/nvdimm-utils.c     | 22 ++++++++++
 6 files changed, 175 insertions(+), 8 deletions(-)

diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
index 0aae682be3..083d954bb4 100644
--- a/docs/nvdimm.txt
+++ b/docs/nvdimm.txt
@@ -107,6 +107,99 @@ Note:
    may result guest data corruption (e.g. breakage of guest file
    system).
 
+Target node
+-----------
+
+Linux kernel version 5.1 brings in support for the volatile-use of
+persistent memory as a hotplugged memory region (KMEM DAX).
+When this feature is enabled, persistent memory can be seen as a
+separate memory-only NUMA node(s). This newly-added memory can be
+selected by its unique NUMA node.
+Add 'target-node' option for nvdimm device to indicate this NUMA node.
+It can be extended after all existing NUMA nodes.
+
+An example of configuration is presented below.
+
+Using the following QEMU command:
+ -object memory-backend-file,id=nvmem1,share=on,mem-path=/dev/dax0.0,size=3G,align=2M
+ -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K,targe-node=1
+
+The below operations are in Guest.
+
+To list available NUMA nodes using numactl:
+ # numactl -H
+ available: 1 nodes (0)
+ node 0 cpus: 0 1 2 3 4 5 6 7
+ node 0 size: 5933 MB
+ node 0 free: 5457 MB
+ node distances:
+ node   0
+   0:  10
+
+To create a namespace in Device-DAX mode as a standard memory from
+all the available capacity of NVDIMM:
+
+ # ndctl create-namespace --mode=devdax --map=mem
+ {
+   "dev":"namespace0.0",
+   "mode":"devdax",
+   "map":"mem",
+   "size":"3.00 GiB (3.22 GB)",
+   "uuid":"4e4d8293-dd3b-4e43-8ad9-7f3d2a8d1680",
+   "daxregion":{
+     "id":0,
+     "size":"3.00 GiB (3.22 GB)",
+     "align":2097152,
+     "devices":[
+       {
+         "chardev":"dax0.0",
+         "size":"3.00 GiB (3.22 GB)",
+         "target_node":1,
+         "mode":"devdax"
+       }
+     ]
+   },
+   "align":2097152
+ }
+
+To list DAX devices:
+ # daxctl list -u
+ {
+   "chardev":"dax0.0",
+   "size":"3.00 GiB (3.22 GB)",
+   "target_node":1,
+   "mode":"devdax"
+ }
+
+To reconfigure DAX device from devdax mode to a system-ram mode:
+ # daxctl reconfigure-device dax0.0 --mode=system-ram
+ [
+   {
+     "chardev":"dax0.0",
+     "size":3217031168,
+     "target_node":1,
+     "mode":"system-ram",
+     "movable":false
+   }
+ ]
+
+After this operation, persistent memory is configured as a separate NUMA node
+and can be used as a volatile memory.
+The new NUMA node is Node 1:
+ # numactl -H
+ available: 2 nodes (0-1)
+ node 0 cpus: 0 1 2 3 4 5 6 7
+ node 0 size: 5933 MB
+ node 0 free: 5339 MB
+ node 1 cpus:
+ node 1 size: 2816 MB
+ node 1 free: 2815 MB
+ node distances:
+ node   0   1
+   0:  10  20
+   1:  20  10
+
+
 Hotplug
 -------
 
diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index e3d5fe1939..376ad6fd58 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -228,11 +228,13 @@ nvdimm_build_structure_spa(GArray *structures, DeviceState *dev)
                                              NULL);
     uint64_t size = object_property_get_uint(OBJECT(dev), PC_DIMM_SIZE_PROP,
                                              NULL);
-    uint32_t node = object_property_get_uint(OBJECT(dev), PC_DIMM_NODE_PROP,
+    int node = object_property_get_int(OBJECT(dev), NVDIMM_TARGET_NODE_PROP,
                                              NULL);
     int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
                                        NULL);
-
+    if (node < 0) {
+        node = object_property_get_uint(OBJECT(dev), PC_DIMM_NODE_PROP, NULL);
+    }
     nfit_spa = acpi_data_push(structures, sizeof(*nfit_spa));
 
     nfit_spa->type = cpu_to_le16(0 /* System Physical Address Range
@@ -1337,8 +1339,9 @@ static void nvdimm_build_ssdt(GArray *table_offsets, GArray *table_data,
     free_aml_allocator();
 }
 
-void nvdimm_build_srat(GArray *table_data)
+int nvdimm_build_srat(GArray *table_data)
 {
+    int max_target_node = nvdimm_check_target_nodes();
     GSList *device_list = nvdimm_get_device_list();
 
     for (; device_list; device_list = device_list->next) {
@@ -1348,7 +1351,12 @@ void nvdimm_build_srat(GArray *table_data)
         uint64_t addr, size;
         int node;
 
-        node = object_property_get_int(obj, PC_DIMM_NODE_PROP, &error_abort);
+        node = object_property_get_int(obj, NVDIMM_TARGET_NODE_PROP,
+                                       &error_abort);
+        if (node < 0) {
+            node = object_property_get_uint(obj, PC_DIMM_NODE_PROP,
+                                            &error_abort);
+        }
         addr = object_property_get_uint(obj, PC_DIMM_ADDR_PROP, &error_abort);
         size = object_property_get_uint(obj, PC_DIMM_SIZE_PROP, &error_abort);
 
@@ -1357,6 +1365,8 @@ void nvdimm_build_srat(GArray *table_data)
                           MEM_AFFINITY_ENABLED | MEM_AFFINITY_NON_VOLATILE);
     }
     g_slist_free(device_list);
+
+   return max_target_node;
 }
 
 void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 796ffc6f5c..19bf91063f 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -1879,6 +1879,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
     AcpiSratMemoryAffinity *numamem;
 
     int i;
+    int max_node = 0;
     int srat_start, numa_start, slots;
     uint64_t mem_len, mem_base, next_base;
     MachineClass *mc = MACHINE_GET_CLASS(machine);
@@ -1974,7 +1975,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
     }
 
     if (machine->nvdimms_state->is_enabled) {
-        nvdimm_build_srat(table_data);
+        max_node = nvdimm_build_srat(table_data);
     }
 
     slots = (table_data->len - numa_start) / sizeof *numamem;
@@ -1992,9 +1993,16 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
      * providing _PXM method if necessary.
      */
     if (hotplugabble_address_space_size) {
+        if (max_node < 0) {
+            max_node = pcms->numa_nodes - 1;
+        } else {
+            max_node = max_node > pcms->numa_nodes - 1 ?
+                       max_node : pcms->numa_nodes - 1;
+        }
+
         numamem = acpi_data_push(table_data, sizeof *numamem);
         build_srat_memory(numamem, machine->device_memory->base,
-                          hotplugabble_address_space_size, pcms->numa_nodes - 1,
+                          hotplugabble_address_space_size, max_node,
                           MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
     }
 
diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index 7397b67156..a9c27f7ad0 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -27,11 +27,15 @@
 #include "qemu/pmem.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
+#include "hw/boards.h"
 #include "hw/mem/nvdimm.h"
 #include "hw/qdev-properties.h"
 #include "hw/mem/memory-device.h"
 #include "sysemu/hostmem.h"
 
+unsigned long nvdimm_target_nodes[BITS_TO_LONGS(MAX_NODES)];
+int nvdimm_max_target_node;
+
 static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name,
                                   void *opaque, Error **errp)
 {
@@ -96,7 +100,6 @@ static void nvdimm_set_uuid(Object *obj, Visitor *v, const char *name,
     g_free(value);
 }
 
-
 static void nvdimm_init(Object *obj)
 {
     object_property_add(obj, NVDIMM_LABEL_SIZE_PROP, "int",
@@ -181,6 +184,23 @@ static MemoryRegion *nvdimm_md_get_memory_region(MemoryDeviceState *md,
 static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
 {
     NVDIMMDevice *nvdimm = NVDIMM(dimm);
+    MachineState *ms = MACHINE(qdev_get_machine());
+    int nb_numa_nodes = ms->numa_state->num_nodes;
+
+    if (nvdimm->target_node >= MAX_NODES) {
+        error_setg(errp, "'NVDIMM property " NVDIMM_TARGET_NODE_PROP
+                   " has value %" PRIu32
+                   "' which exceeds the max number of numa nodes: %d",
+                   nvdimm->target_node, MAX_NODES);
+        return;
+    }
+
+    if (nvdimm->target_node >= nb_numa_nodes) {
+        set_bit(nvdimm->target_node, nvdimm_target_nodes);
+        if (nvdimm->target_node > nvdimm_max_target_node) {
+            nvdimm_max_target_node = nvdimm->target_node;
+        }
+    }
 
     if (!nvdimm->nvdimm_mr) {
         nvdimm_prepare_memory_region(nvdimm, errp);
@@ -229,6 +249,7 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, const void *buf,
 
 static Property nvdimm_properties[] = {
     DEFINE_PROP_BOOL(NVDIMM_UNARMED_PROP, NVDIMMDevice, unarmed, false),
+    DEFINE_PROP_INT32(NVDIMM_TARGET_NODE_PROP, NVDIMMDevice, target_node, -1),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index bcf62f825c..96b609a60e 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -51,6 +51,7 @@ OBJECT_DECLARE_TYPE(NVDIMMDevice, NVDIMMClass, NVDIMM)
 #define NVDIMM_LABEL_SIZE_PROP "label-size"
 #define NVDIMM_UUID_PROP       "uuid"
 #define NVDIMM_UNARMED_PROP    "unarmed"
+#define NVDIMM_TARGET_NODE_PROP "target-node"
 
 struct NVDIMMDevice {
     /* private */
@@ -89,6 +90,14 @@ struct NVDIMMDevice {
      * The PPC64 - spapr requires each nvdimm device have a uuid.
      */
     QemuUUID uuid;
+
+    /*
+     * Support for the volatile-use of persistent memory as normal RAM.
+     * This newly-added memory can be selected by its unique NUMA node.
+     * This node can be extended to a new node after all existing NUMA
+     * nodes.
+     */
+    int target_node;
 };
 
 struct NVDIMMClass {
@@ -148,14 +157,18 @@ struct NVDIMMState {
 };
 typedef struct NVDIMMState NVDIMMState;
 
+extern unsigned long nvdimm_target_nodes[];
+extern int nvdimm_max_target_node;
+
 void nvdimm_init_acpi_state(NVDIMMState *state, MemoryRegion *io,
                             struct AcpiGenericAddress dsm_io,
                             FWCfgState *fw_cfg, Object *owner);
-void nvdimm_build_srat(GArray *table_data);
+int nvdimm_build_srat(GArray *table_data);
 void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
                        BIOSLinker *linker, NVDIMMState *state,
                        uint32_t ram_slots, const char *oem_id,
                        const char *oem_table_id);
 void nvdimm_plug(NVDIMMState *state);
 void nvdimm_acpi_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev);
+int nvdimm_check_target_nodes(void);
 #endif
diff --git a/util/nvdimm-utils.c b/util/nvdimm-utils.c
index aa3d199f2d..767f1e4787 100644
--- a/util/nvdimm-utils.c
+++ b/util/nvdimm-utils.c
@@ -1,5 +1,7 @@
 #include "qemu/osdep.h"
 #include "qemu/nvdimm-utils.h"
+#include "qapi/error.h"
+#include "hw/boards.h"
 #include "hw/mem/nvdimm.h"
 
 static int nvdimm_device_list(Object *obj, void *opaque)
@@ -28,3 +30,23 @@ GSList *nvdimm_get_device_list(void)
     object_child_foreach(qdev_get_machine(), nvdimm_device_list, &list);
     return list;
 }
+
+int nvdimm_check_target_nodes(void)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+    int nb_numa_nodes = ms->numa_state->num_nodes;
+    int node;
+
+    if (!nvdimm_max_target_node) {
+        return -1;
+    }
+
+    for (node = nb_numa_nodes; node <= nvdimm_max_target_node; node++) {
+        if (!test_bit(node, nvdimm_target_nodes)) {
+            error_report("nvdimm target-node: Node ID missing: %d", node);
+            exit(1);
+        }
+    }
+
+    return nvdimm_max_target_node;
+}
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] nvdimm: add 'target-node' option
  2021-06-25  2:25 [PATCH] nvdimm: add 'target-node' option Jingqi Liu
@ 2021-06-30 15:00 ` Igor Mammedov
  2021-07-01  7:55   ` Liu, Jingqi
  0 siblings, 1 reply; 3+ messages in thread
From: Igor Mammedov @ 2021-06-30 15:00 UTC (permalink / raw)
  To: Jingqi Liu
  Cc: ehabkost, mst, richard.henderson, qemu-devel, pbonzini,
	xiaoguangrong.eric

On Fri, 25 Jun 2021 10:25:18 +0800
Jingqi Liu <jingqi.liu@intel.com> wrote:

> Linux kernel version 5.1 brings in support for the volatile-use of
> persistent memory as a hotplugged memory region (KMEM DAX).
> When this feature is enabled, persistent memory can be seen as a
> separate memory-only NUMA node(s). This newly-added memory can be
> selected by its unique NUMA node.
> 
> Add 'target-node' option for 'nvdimm' device to indicate this NUMA
> node. It can be extended to a new node after all existing NUMA nodes.

how dynamic it is?
can we force it to be 'static' node, would it break something?

> The 'node' option of 'pc-dimm' device is to add the DIMM to an
> existing NUMA node. The 'node' should be in the available NUMA nodes.
> For KMEM DAX mode, persistent memory can be in a new separate
> memory-only NUMA node. The new node is created dynamically.
> So users use 'target-node' to control whether persistent memory
> is added to an existing NUMA node or a new NUMA node.

I don't get reasoning behind creating new property instead of reusing
exiting 'node'.


Can you provide more context by pointing to relevant kernel series?
A pointer to a specification?

and SRAT handling looks a bit sketchy,
You are saying that it's dynamic and assigned by guest and then
are trying to put it in static SRAT along with predefined nodes.

> An example of configuration is as follows.
> 
> Using the following QEMU command:
>  -object memory-backend-file,id=nvmem1,share=on,mem-path=/dev/dax0.0,size=3G,align=2M
>  -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K,targe-node=2
> 
> To list DAX devices:
>  # daxctl list -u
>  {
>    "chardev":"dax0.0",
>    "size":"3.00 GiB (3.22 GB)",
>    "target_node":2,
>    "mode":"devdax"
>  }
> 
> To create a namespace in Device-DAX mode as a standard memory:
>  $ ndctl create-namespace --mode=devdax --map=mem
> To reconfigure DAX device from devdax mode to a system-ram mode:
>  $ daxctl reconfigure-device dax0.0 --mode=system-ram
> 
> There are two existing NUMA nodes in Guest. After these operations,
> persistent memory is configured as a separate Node 2 and
> can be used as a volatile memory. This NUMA node is dynamically
> created according to 'target-node'.
> 
> Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
> ---
>  docs/nvdimm.txt         | 93 +++++++++++++++++++++++++++++++++++++++++
>  hw/acpi/nvdimm.c        | 18 ++++++--
>  hw/i386/acpi-build.c    | 12 +++++-
>  hw/mem/nvdimm.c         | 23 +++++++++-
>  include/hw/mem/nvdimm.h | 15 ++++++-
>  util/nvdimm-utils.c     | 22 ++++++++++
>  6 files changed, 175 insertions(+), 8 deletions(-)
> 
> diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
> index 0aae682be3..083d954bb4 100644
> --- a/docs/nvdimm.txt
> +++ b/docs/nvdimm.txt
> @@ -107,6 +107,99 @@ Note:
>     may result guest data corruption (e.g. breakage of guest file
>     system).
>  
> +Target node
> +-----------
> +
> +Linux kernel version 5.1 brings in support for the volatile-use of
> +persistent memory as a hotplugged memory region (KMEM DAX).
> +When this feature is enabled, persistent memory can be seen as a
> +separate memory-only NUMA node(s). This newly-added memory can be
> +selected by its unique NUMA node.
> +Add 'target-node' option for nvdimm device to indicate this NUMA node.
> +It can be extended after all existing NUMA nodes.
> +
> +An example of configuration is presented below.
> +
> +Using the following QEMU command:
> + -object memory-backend-file,id=nvmem1,share=on,mem-path=/dev/dax0.0,size=3G,align=2M
> + -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K,targe-node=1
> +
> +The below operations are in Guest.
> +
> +To list available NUMA nodes using numactl:
> + # numactl -H
> + available: 1 nodes (0)
> + node 0 cpus: 0 1 2 3 4 5 6 7
> + node 0 size: 5933 MB
> + node 0 free: 5457 MB
> + node distances:
> + node   0
> +   0:  10
> +
> +To create a namespace in Device-DAX mode as a standard memory from
> +all the available capacity of NVDIMM:
> +
> + # ndctl create-namespace --mode=devdax --map=mem
> + {
> +   "dev":"namespace0.0",
> +   "mode":"devdax",
> +   "map":"mem",
> +   "size":"3.00 GiB (3.22 GB)",
> +   "uuid":"4e4d8293-dd3b-4e43-8ad9-7f3d2a8d1680",
> +   "daxregion":{
> +     "id":0,
> +     "size":"3.00 GiB (3.22 GB)",
> +     "align":2097152,
> +     "devices":[
> +       {
> +         "chardev":"dax0.0",
> +         "size":"3.00 GiB (3.22 GB)",
> +         "target_node":1,
> +         "mode":"devdax"
> +       }
> +     ]
> +   },
> +   "align":2097152
> + }
> +
> +To list DAX devices:
> + # daxctl list -u
> + {
> +   "chardev":"dax0.0",
> +   "size":"3.00 GiB (3.22 GB)",
> +   "target_node":1,
> +   "mode":"devdax"
> + }
> +
> +To reconfigure DAX device from devdax mode to a system-ram mode:
> + # daxctl reconfigure-device dax0.0 --mode=system-ram
> + [
> +   {
> +     "chardev":"dax0.0",
> +     "size":3217031168,
> +     "target_node":1,
> +     "mode":"system-ram",
> +     "movable":false
> +   }
> + ]
> +
> +After this operation, persistent memory is configured as a separate NUMA node
> +and can be used as a volatile memory.
> +The new NUMA node is Node 1:
> + # numactl -H
> + available: 2 nodes (0-1)
> + node 0 cpus: 0 1 2 3 4 5 6 7
> + node 0 size: 5933 MB
> + node 0 free: 5339 MB
> + node 1 cpus:
> + node 1 size: 2816 MB
> + node 1 free: 2815 MB
> + node distances:
> + node   0   1
> +   0:  10  20
> +   1:  20  10
> +
> +
>  Hotplug
>  -------
>  
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index e3d5fe1939..376ad6fd58 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -228,11 +228,13 @@ nvdimm_build_structure_spa(GArray *structures, DeviceState *dev)
>                                               NULL);
>      uint64_t size = object_property_get_uint(OBJECT(dev), PC_DIMM_SIZE_PROP,
>                                               NULL);
> -    uint32_t node = object_property_get_uint(OBJECT(dev), PC_DIMM_NODE_PROP,
node id is 32 bit per spec, don't loose it here


> +    int node = object_property_get_int(OBJECT(dev), NVDIMM_TARGET_NODE_PROP,
>                                               NULL);
>      int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
>                                         NULL);
> -
> +    if (node < 0) {
> +        node = object_property_get_uint(OBJECT(dev), PC_DIMM_NODE_PROP, NULL);
> +    }
see below about error handling

>      nfit_spa = acpi_data_push(structures, sizeof(*nfit_spa));
>  
>      nfit_spa->type = cpu_to_le16(0 /* System Physical Address Range
> @@ -1337,8 +1339,9 @@ static void nvdimm_build_ssdt(GArray *table_offsets, GArray *table_data,
>      free_aml_allocator();
>  }
>  
> -void nvdimm_build_srat(GArray *table_data)
> +int nvdimm_build_srat(GArray *table_data)
>  {
> +    int max_target_node = nvdimm_check_target_nodes();
>      GSList *device_list = nvdimm_get_device_list();
>  
>      for (; device_list; device_list = device_list->next) {
> @@ -1348,7 +1351,12 @@ void nvdimm_build_srat(GArray *table_data)
>          uint64_t addr, size;
>          int node;
>  
> -        node = object_property_get_int(obj, PC_DIMM_NODE_PROP, &error_abort);
> +        node = object_property_get_int(obj, NVDIMM_TARGET_NODE_PROP,
> +                                       &error_abort);
> +        if (node < 0) {
> +            node = object_property_get_uint(obj, PC_DIMM_NODE_PROP,
> +                                            &error_abort);
> +        }
it should be checked at realize time, with proper error handling.
Also I'd make both properties mutually exclusive

>          addr = object_property_get_uint(obj, PC_DIMM_ADDR_PROP, &error_abort);
>          size = object_property_get_uint(obj, PC_DIMM_SIZE_PROP, &error_abort);
>  
> @@ -1357,6 +1365,8 @@ void nvdimm_build_srat(GArray *table_data)
>                            MEM_AFFINITY_ENABLED | MEM_AFFINITY_NON_VOLATILE);
>      }
>      g_slist_free(device_list);
> +
> +   return max_target_node;
>  }
>  
>  void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 796ffc6f5c..19bf91063f 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -1879,6 +1879,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
>      AcpiSratMemoryAffinity *numamem;
>  
>      int i;
> +    int max_node = 0;
>      int srat_start, numa_start, slots;
>      uint64_t mem_len, mem_base, next_base;
>      MachineClass *mc = MACHINE_GET_CLASS(machine);
> @@ -1974,7 +1975,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
>      }
>  
>      if (machine->nvdimms_state->is_enabled) {
> -        nvdimm_build_srat(table_data);
> +        max_node = nvdimm_build_srat(table_data);
>      }
>  
>      slots = (table_data->len - numa_start) / sizeof *numamem;
> @@ -1992,9 +1993,16 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
>       * providing _PXM method if necessary.
>       */
>      if (hotplugabble_address_space_size) {
> +        if (max_node < 0) {
> +            max_node = pcms->numa_nodes - 1;
> +        } else {
> +            max_node = max_node > pcms->numa_nodes - 1 ?
> +                       max_node : pcms->numa_nodes - 1;
> +        }
>
>          numamem = acpi_data_push(table_data, sizeof *numamem);
>          build_srat_memory(numamem, machine->device_memory->base,
> -                          hotplugabble_address_space_size, pcms->numa_nodes - 1,
> +                          hotplugabble_address_space_size, max_node,
>                            MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
>      }
>  
> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> index 7397b67156..a9c27f7ad0 100644
> --- a/hw/mem/nvdimm.c
> +++ b/hw/mem/nvdimm.c
> @@ -27,11 +27,15 @@
>  #include "qemu/pmem.h"
>  #include "qapi/error.h"
>  #include "qapi/visitor.h"
> +#include "hw/boards.h"
>  #include "hw/mem/nvdimm.h"
>  #include "hw/qdev-properties.h"
>  #include "hw/mem/memory-device.h"
>  #include "sysemu/hostmem.h"
>  
> +unsigned long nvdimm_target_nodes[BITS_TO_LONGS(MAX_NODES)];
> +int nvdimm_max_target_node;
> +
>  static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name,
>                                    void *opaque, Error **errp)
>  {
> @@ -96,7 +100,6 @@ static void nvdimm_set_uuid(Object *obj, Visitor *v, const char *name,
>      g_free(value);
>  }
>  
> -
>  static void nvdimm_init(Object *obj)
>  {
>      object_property_add(obj, NVDIMM_LABEL_SIZE_PROP, "int",
> @@ -181,6 +184,23 @@ static MemoryRegion *nvdimm_md_get_memory_region(MemoryDeviceState *md,
>  static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
>  {
>      NVDIMMDevice *nvdimm = NVDIMM(dimm);

> +    MachineState *ms = MACHINE(qdev_get_machine());
> +    int nb_numa_nodes = ms->numa_state->num_nodes;
> +
> +    if (nvdimm->target_node >= MAX_NODES) {
> +        error_setg(errp, "'NVDIMM property " NVDIMM_TARGET_NODE_PROP
> +                   " has value %" PRIu32
> +                   "' which exceeds the max number of numa nodes: %d",
> +                   nvdimm->target_node, MAX_NODES);
> +        return;
> +    }
> +
> +    if (nvdimm->target_node >= nb_numa_nodes) {
> +        set_bit(nvdimm->target_node, nvdimm_target_nodes);
> +        if (nvdimm->target_node > nvdimm_max_target_node) {
> +            nvdimm_max_target_node = nvdimm->target_node;
> +        }
> +    }


device shouldn't poke into Machine,
use _pre_plug callback for that

>  
>      if (!nvdimm->nvdimm_mr) {
>          nvdimm_prepare_memory_region(nvdimm, errp);
> @@ -229,6 +249,7 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, const void *buf,
>  
>  static Property nvdimm_properties[] = {
>      DEFINE_PROP_BOOL(NVDIMM_UNARMED_PROP, NVDIMMDevice, unarmed, false),
> +    DEFINE_PROP_INT32(NVDIMM_TARGET_NODE_PROP, NVDIMMDevice, target_node, -1),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
> index bcf62f825c..96b609a60e 100644
> --- a/include/hw/mem/nvdimm.h
> +++ b/include/hw/mem/nvdimm.h
> @@ -51,6 +51,7 @@ OBJECT_DECLARE_TYPE(NVDIMMDevice, NVDIMMClass, NVDIMM)
>  #define NVDIMM_LABEL_SIZE_PROP "label-size"
>  #define NVDIMM_UUID_PROP       "uuid"
>  #define NVDIMM_UNARMED_PROP    "unarmed"
> +#define NVDIMM_TARGET_NODE_PROP "target-node"
>  
>  struct NVDIMMDevice {
>      /* private */
> @@ -89,6 +90,14 @@ struct NVDIMMDevice {
>       * The PPC64 - spapr requires each nvdimm device have a uuid.
>       */
>      QemuUUID uuid;
> +
> +    /*
> +     * Support for the volatile-use of persistent memory as normal RAM.
> +     * This newly-added memory can be selected by its unique NUMA node.
> +     * This node can be extended to a new node after all existing NUMA
> +     * nodes.
> +     */
> +    int target_node;
>  };
>  
>  struct NVDIMMClass {
> @@ -148,14 +157,18 @@ struct NVDIMMState {
>  };
>  typedef struct NVDIMMState NVDIMMState;
>  
> +extern unsigned long nvdimm_target_nodes[];
> +extern int nvdimm_max_target_node;
> +
>  void nvdimm_init_acpi_state(NVDIMMState *state, MemoryRegion *io,
>                              struct AcpiGenericAddress dsm_io,
>                              FWCfgState *fw_cfg, Object *owner);
> -void nvdimm_build_srat(GArray *table_data);
> +int nvdimm_build_srat(GArray *table_data);
>  void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
>                         BIOSLinker *linker, NVDIMMState *state,
>                         uint32_t ram_slots, const char *oem_id,
>                         const char *oem_table_id);
>  void nvdimm_plug(NVDIMMState *state);
>  void nvdimm_acpi_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev);
> +int nvdimm_check_target_nodes(void);
>  #endif
> diff --git a/util/nvdimm-utils.c b/util/nvdimm-utils.c
> index aa3d199f2d..767f1e4787 100644
> --- a/util/nvdimm-utils.c
> +++ b/util/nvdimm-utils.c
> @@ -1,5 +1,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/nvdimm-utils.h"
> +#include "qapi/error.h"
> +#include "hw/boards.h"
>  #include "hw/mem/nvdimm.h"
>  
>  static int nvdimm_device_list(Object *obj, void *opaque)
> @@ -28,3 +30,23 @@ GSList *nvdimm_get_device_list(void)
>      object_child_foreach(qdev_get_machine(), nvdimm_device_list, &list);
>      return list;
>  }
> +
> +int nvdimm_check_target_nodes(void)
> +{
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +    int nb_numa_nodes = ms->numa_state->num_nodes;
> +    int node;
> +
> +    if (!nvdimm_max_target_node) {
> +        return -1;
> +    }
> +
> +    for (node = nb_numa_nodes; node <= nvdimm_max_target_node; node++) {
> +        if (!test_bit(node, nvdimm_target_nodes)) {
> +            error_report("nvdimm target-node: Node ID missing: %d", node);
> +            exit(1);
> +        }
> +    }
> +
> +    return nvdimm_max_target_node;
> +}



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] nvdimm: add 'target-node' option
  2021-06-30 15:00 ` Igor Mammedov
@ 2021-07-01  7:55   ` Liu, Jingqi
  0 siblings, 0 replies; 3+ messages in thread
From: Liu, Jingqi @ 2021-07-01  7:55 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: ehabkost@redhat.com, mst@redhat.com, richard.henderson@linaro.org,
	qemu-devel@nongnu.org, pbonzini@redhat.com,
	xiaoguangrong.eric@gmail.com


Hi Igor,

Thanks for your comments.

On 6/30/2021 11:00 PM, Igor Mammedov wrote:
> On Fri, 25 Jun 2021 10:25:18 +0800
> Jingqi Liu <jingqi.liu@intel.com> wrote:
> 
>> Linux kernel version 5.1 brings in support for the volatile-use of
>> persistent memory as a hotplugged memory region (KMEM DAX).
>> When this feature is enabled, persistent memory can be seen as a
>> separate memory-only NUMA node(s). This newly-added memory can be
>> selected by its unique NUMA node.
>>
>> Add 'target-node' option for 'nvdimm' device to indicate this NUMA
>> node. It can be extended to a new node after all existing NUMA nodes.
> 
> how dynamic it is?
> can we force it to be 'static' node, would it break something?

In Guest,  the node can only be created after running the command like 
'daxctl reconfigure-device dax0.0 --mode=system-ram'.  This command 
makes the persistent memory be used as a volatile memory.

We can only use QEMU commands like the following to force it to be 
'static' node.

  -numa node,nodeid=1
  -object 
memory-backend-file,id=nvmem0,share=on,mem-path=/dev/dax0.0,size=2G,align=2M 

  -device nvdimm,memdev=nvmem0,id=nv0,label-size=2M,node=1

In Guest, before reconfiguring dax device to volatile memory, the NUMA 
nodes look like:

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 5933 MB
node 0 free: 5402 MB
node 1 cpus: 4 5 6 7
node 1 size: 0 MB
node 1 free: 0 MB
node distances:
node   0   1
   0:  10  20
   1:  20  10

It looks unfriendly.

> 
>> The 'node' option of 'pc-dimm' device is to add the DIMM to an
>> existing NUMA node. The 'node' should be in the available NUMA nodes.
>> For KMEM DAX mode, persistent memory can be in a new separate
>> memory-only NUMA node. The new node is created dynamically.
>> So users use 'target-node' to control whether persistent memory
>> is added to an existing NUMA node or a new NUMA node.
> 
> I don't get reasoning behind creating new property instead of reusing
> exiting 'node'.
> 
> 
> Can you provide more context by pointing to relevant kernel series?
> A pointer to a specification?

The relevant kernel series is as follows:
https://patchwork.kernel.org/project/linux-nvdimm/cover/20190225185727.BCBD768C@viggo.jf.intel.com/

For the specification, please refer to the links below.
https://pmem.io/glossary
KMEM DAX
https://pmem.io/2020/01/20/memkind-dax-kmem.html

> 
> and SRAT handling looks a bit sketchy,
> You are saying that it's dynamic and assigned by guest and then
> are trying to put it in static SRAT along with predefined nodes.
> 

The 'target-node' will be stored in the field 'proximity domain' of SRAT 
and NFIT. When creating dax device, the kernel dax driver will set 
'target_node' according to this field.
If the dax device is reconfigured as volatile memory,  it will be 
hot-plugged to the NUMA node specified by this 'target_node'.

>> An example of configuration is as follows.
>>
>> Using the following QEMU command:
>>   -object memory-backend-file,id=nvmem1,share=on,mem-path=/dev/dax0.0,size=3G,align=2M
>>   -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K,targe-node=2
>>
>> To list DAX devices:
>>   # daxctl list -u
>>   {
>>     "chardev":"dax0.0",
>>     "size":"3.00 GiB (3.22 GB)",
>>     "target_node":2,
>>     "mode":"devdax"
>>   }
>>
>> To create a namespace in Device-DAX mode as a standard memory:
>>   $ ndctl create-namespace --mode=devdax --map=mem
>> To reconfigure DAX device from devdax mode to a system-ram mode:
>>   $ daxctl reconfigure-device dax0.0 --mode=system-ram
>>
>> There are two existing NUMA nodes in Guest. After these operations,
>> persistent memory is configured as a separate Node 2 and
>> can be used as a volatile memory. This NUMA node is dynamically
>> created according to 'target-node'.
>>
>> Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
>> ---
>>   docs/nvdimm.txt         | 93 +++++++++++++++++++++++++++++++++++++++++
>>   hw/acpi/nvdimm.c        | 18 ++++++--
>>   hw/i386/acpi-build.c    | 12 +++++-
>>   hw/mem/nvdimm.c         | 23 +++++++++-
>>   include/hw/mem/nvdimm.h | 15 ++++++-
>>   util/nvdimm-utils.c     | 22 ++++++++++
>>   6 files changed, 175 insertions(+), 8 deletions(-)
>>
>> diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
>> index 0aae682be3..083d954bb4 100644
>> --- a/docs/nvdimm.txt
>> +++ b/docs/nvdimm.txt
>> @@ -107,6 +107,99 @@ Note:
>>      may result guest data corruption (e.g. breakage of guest file
>>      system).
>>   
>> +Target node
>> +-----------
>> +
>> +Linux kernel version 5.1 brings in support for the volatile-use of
>> +persistent memory as a hotplugged memory region (KMEM DAX).
>> +When this feature is enabled, persistent memory can be seen as a
>> +separate memory-only NUMA node(s). This newly-added memory can be
>> +selected by its unique NUMA node.
>> +Add 'target-node' option for nvdimm device to indicate this NUMA node.
>> +It can be extended after all existing NUMA nodes.
>> +
>> +An example of configuration is presented below.
>> +
>> +Using the following QEMU command:
>> + -object memory-backend-file,id=nvmem1,share=on,mem-path=/dev/dax0.0,size=3G,align=2M
>> + -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K,targe-node=1
>> +
>> +The below operations are in Guest.
>> +
>> +To list available NUMA nodes using numactl:
>> + # numactl -H
>> + available: 1 nodes (0)
>> + node 0 cpus: 0 1 2 3 4 5 6 7
>> + node 0 size: 5933 MB
>> + node 0 free: 5457 MB
>> + node distances:
>> + node   0
>> +   0:  10
>> +
>> +To create a namespace in Device-DAX mode as a standard memory from
>> +all the available capacity of NVDIMM:
>> +
>> + # ndctl create-namespace --mode=devdax --map=mem
>> + {
>> +   "dev":"namespace0.0",
>> +   "mode":"devdax",
>> +   "map":"mem",
>> +   "size":"3.00 GiB (3.22 GB)",
>> +   "uuid":"4e4d8293-dd3b-4e43-8ad9-7f3d2a8d1680",
>> +   "daxregion":{
>> +     "id":0,
>> +     "size":"3.00 GiB (3.22 GB)",
>> +     "align":2097152,
>> +     "devices":[
>> +       {
>> +         "chardev":"dax0.0",
>> +         "size":"3.00 GiB (3.22 GB)",
>> +         "target_node":1,
>> +         "mode":"devdax"
>> +       }
>> +     ]
>> +   },
>> +   "align":2097152
>> + }
>> +
>> +To list DAX devices:
>> + # daxctl list -u
>> + {
>> +   "chardev":"dax0.0",
>> +   "size":"3.00 GiB (3.22 GB)",
>> +   "target_node":1,
>> +   "mode":"devdax"
>> + }
>> +
>> +To reconfigure DAX device from devdax mode to a system-ram mode:
>> + # daxctl reconfigure-device dax0.0 --mode=system-ram
>> + [
>> +   {
>> +     "chardev":"dax0.0",
>> +     "size":3217031168,
>> +     "target_node":1,
>> +     "mode":"system-ram",
>> +     "movable":false
>> +   }
>> + ]
>> +
>> +After this operation, persistent memory is configured as a separate NUMA node
>> +and can be used as a volatile memory.
>> +The new NUMA node is Node 1:
>> + # numactl -H
>> + available: 2 nodes (0-1)
>> + node 0 cpus: 0 1 2 3 4 5 6 7
>> + node 0 size: 5933 MB
>> + node 0 free: 5339 MB
>> + node 1 cpus:
>> + node 1 size: 2816 MB
>> + node 1 free: 2815 MB
>> + node distances:
>> + node   0   1
>> +   0:  10  20
>> +   1:  20  10
>> +
>> +
>>   Hotplug
>>   -------
>>   
>> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
>> index e3d5fe1939..376ad6fd58 100644
>> --- a/hw/acpi/nvdimm.c
>> +++ b/hw/acpi/nvdimm.c
>> @@ -228,11 +228,13 @@ nvdimm_build_structure_spa(GArray *structures, DeviceState *dev)
>>                                                NULL);
>>       uint64_t size = object_property_get_uint(OBJECT(dev), PC_DIMM_SIZE_PROP,
>>                                                NULL);
>> -    uint32_t node = object_property_get_uint(OBJECT(dev), PC_DIMM_NODE_PROP,
> node id is 32 bit per spec, don't loose it here
> 
Got it.

> 
>> +    int node = object_property_get_int(OBJECT(dev), NVDIMM_TARGET_NODE_PROP,
>>                                                NULL);
>>       int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
>>                                          NULL);
>> -
>> +    if (node < 0) {
>> +        node = object_property_get_uint(OBJECT(dev), PC_DIMM_NODE_PROP, NULL);
>> +    }
> see below about error handling
> 
>>       nfit_spa = acpi_data_push(structures, sizeof(*nfit_spa));
>>   
>>       nfit_spa->type = cpu_to_le16(0 /* System Physical Address Range
>> @@ -1337,8 +1339,9 @@ static void nvdimm_build_ssdt(GArray *table_offsets, GArray *table_data,
>>       free_aml_allocator();
>>   }
>>   
>> -void nvdimm_build_srat(GArray *table_data)
>> +int nvdimm_build_srat(GArray *table_data)
>>   {
>> +    int max_target_node = nvdimm_check_target_nodes();
>>       GSList *device_list = nvdimm_get_device_list();
>>   
>>       for (; device_list; device_list = device_list->next) {
>> @@ -1348,7 +1351,12 @@ void nvdimm_build_srat(GArray *table_data)
>>           uint64_t addr, size;
>>           int node;
>>   
>> -        node = object_property_get_int(obj, PC_DIMM_NODE_PROP, &error_abort);
>> +        node = object_property_get_int(obj, NVDIMM_TARGET_NODE_PROP,
>> +                                       &error_abort);
>> +        if (node < 0) {
>> +            node = object_property_get_uint(obj, PC_DIMM_NODE_PROP,
>> +                                            &error_abort);
>> +        }
> it should be checked at realize time, with proper error handling.
> Also I'd make both properties mutually exclusive
Good point.
I'll move the checking to realize().

> 
>>           addr = object_property_get_uint(obj, PC_DIMM_ADDR_PROP, &error_abort);
>>           size = object_property_get_uint(obj, PC_DIMM_SIZE_PROP, &error_abort);
>>   
>> @@ -1357,6 +1365,8 @@ void nvdimm_build_srat(GArray *table_data)
>>                             MEM_AFFINITY_ENABLED | MEM_AFFINITY_NON_VOLATILE);
>>       }
>>       g_slist_free(device_list);
>> +
>> +   return max_target_node;
>>   }
>>   
>>   void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
>> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
>> index 796ffc6f5c..19bf91063f 100644
>> --- a/hw/i386/acpi-build.c
>> +++ b/hw/i386/acpi-build.c
>> @@ -1879,6 +1879,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
>>       AcpiSratMemoryAffinity *numamem;
>>   
>>       int i;
>> +    int max_node = 0;
>>       int srat_start, numa_start, slots;
>>       uint64_t mem_len, mem_base, next_base;
>>       MachineClass *mc = MACHINE_GET_CLASS(machine);
>> @@ -1974,7 +1975,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
>>       }
>>   
>>       if (machine->nvdimms_state->is_enabled) {
>> -        nvdimm_build_srat(table_data);
>> +        max_node = nvdimm_build_srat(table_data);
>>       }
>>   
>>       slots = (table_data->len - numa_start) / sizeof *numamem;
>> @@ -1992,9 +1993,16 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
>>        * providing _PXM method if necessary.
>>        */
>>       if (hotplugabble_address_space_size) {
>> +        if (max_node < 0) {
>> +            max_node = pcms->numa_nodes - 1;
>> +        } else {
>> +            max_node = max_node > pcms->numa_nodes - 1 ?
>> +                       max_node : pcms->numa_nodes - 1;
>> +        }
>>
>>           numamem = acpi_data_push(table_data, sizeof *numamem);
>>           build_srat_memory(numamem, machine->device_memory->base,
>> -                          hotplugabble_address_space_size, pcms->numa_nodes - 1,
>> +                          hotplugabble_address_space_size, max_node,
>>                             MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
>>       }
>>   
>> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
>> index 7397b67156..a9c27f7ad0 100644
>> --- a/hw/mem/nvdimm.c
>> +++ b/hw/mem/nvdimm.c
>> @@ -27,11 +27,15 @@
>>   #include "qemu/pmem.h"
>>   #include "qapi/error.h"
>>   #include "qapi/visitor.h"
>> +#include "hw/boards.h"
>>   #include "hw/mem/nvdimm.h"
>>   #include "hw/qdev-properties.h"
>>   #include "hw/mem/memory-device.h"
>>   #include "sysemu/hostmem.h"
>>   
>> +unsigned long nvdimm_target_nodes[BITS_TO_LONGS(MAX_NODES)];
>> +int nvdimm_max_target_node;
>> +
>>   static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *name,
>>                                     void *opaque, Error **errp)
>>   {
>> @@ -96,7 +100,6 @@ static void nvdimm_set_uuid(Object *obj, Visitor *v, const char *name,
>>       g_free(value);
>>   }
>>   
>> -
>>   static void nvdimm_init(Object *obj)
>>   {
>>       object_property_add(obj, NVDIMM_LABEL_SIZE_PROP, "int",
>> @@ -181,6 +184,23 @@ static MemoryRegion *nvdimm_md_get_memory_region(MemoryDeviceState *md,
>>   static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
>>   {
>>       NVDIMMDevice *nvdimm = NVDIMM(dimm);
> 
>> +    MachineState *ms = MACHINE(qdev_get_machine());
>> +    int nb_numa_nodes = ms->numa_state->num_nodes;
>> +
>> +    if (nvdimm->target_node >= MAX_NODES) {
>> +        error_setg(errp, "'NVDIMM property " NVDIMM_TARGET_NODE_PROP
>> +                   " has value %" PRIu32
>> +                   "' which exceeds the max number of numa nodes: %d",
>> +                   nvdimm->target_node, MAX_NODES);
>> +        return;
>> +    }
>> +
>> +    if (nvdimm->target_node >= nb_numa_nodes) {
>> +        set_bit(nvdimm->target_node, nvdimm_target_nodes);
>> +        if (nvdimm->target_node > nvdimm_max_target_node) {
>> +            nvdimm_max_target_node = nvdimm->target_node;
>> +        }
>> +    }
> 
> 
> device shouldn't poke into Machine,
> use _pre_plug callback for that
> 
Good point.
How about calling nvdimm_pre_plug() in pc_memory_pre_plug() ?

>>   
>>       if (!nvdimm->nvdimm_mr) {
>>           nvdimm_prepare_memory_region(nvdimm, errp);
>> @@ -229,6 +249,7 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, const void *buf,
>>   
>>   static Property nvdimm_properties[] = {
>>       DEFINE_PROP_BOOL(NVDIMM_UNARMED_PROP, NVDIMMDevice, unarmed, false),
>> +    DEFINE_PROP_INT32(NVDIMM_TARGET_NODE_PROP, NVDIMMDevice, target_node, -1),
>>       DEFINE_PROP_END_OF_LIST(),
>>   };
>>   
>> diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
>> index bcf62f825c..96b609a60e 100644
>> --- a/include/hw/mem/nvdimm.h
>> +++ b/include/hw/mem/nvdimm.h
>> @@ -51,6 +51,7 @@ OBJECT_DECLARE_TYPE(NVDIMMDevice, NVDIMMClass, NVDIMM)
>>   #define NVDIMM_LABEL_SIZE_PROP "label-size"
>>   #define NVDIMM_UUID_PROP       "uuid"
>>   #define NVDIMM_UNARMED_PROP    "unarmed"
>> +#define NVDIMM_TARGET_NODE_PROP "target-node"
>>   
>>   struct NVDIMMDevice {
>>       /* private */
>> @@ -89,6 +90,14 @@ struct NVDIMMDevice {
>>        * The PPC64 - spapr requires each nvdimm device have a uuid.
>>        */
>>       QemuUUID uuid;
>> +
>> +    /*
>> +     * Support for the volatile-use of persistent memory as normal RAM.
>> +     * This newly-added memory can be selected by its unique NUMA node.
>> +     * This node can be extended to a new node after all existing NUMA
>> +     * nodes.
>> +     */
>> +    int target_node;
>>   };
>>   
>>   struct NVDIMMClass {
>> @@ -148,14 +157,18 @@ struct NVDIMMState {
>>   };
>>   typedef struct NVDIMMState NVDIMMState;
>>   
>> +extern unsigned long nvdimm_target_nodes[];
>> +extern int nvdimm_max_target_node;
>> +
>>   void nvdimm_init_acpi_state(NVDIMMState *state, MemoryRegion *io,
>>                               struct AcpiGenericAddress dsm_io,
>>                               FWCfgState *fw_cfg, Object *owner);
>> -void nvdimm_build_srat(GArray *table_data);
>> +int nvdimm_build_srat(GArray *table_data);
>>   void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
>>                          BIOSLinker *linker, NVDIMMState *state,
>>                          uint32_t ram_slots, const char *oem_id,
>>                          const char *oem_table_id);
>>   void nvdimm_plug(NVDIMMState *state);
>>   void nvdimm_acpi_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev);
>> +int nvdimm_check_target_nodes(void);
>>   #endif
>> diff --git a/util/nvdimm-utils.c b/util/nvdimm-utils.c
>> index aa3d199f2d..767f1e4787 100644
>> --- a/util/nvdimm-utils.c
>> +++ b/util/nvdimm-utils.c
>> @@ -1,5 +1,7 @@
>>   #include "qemu/osdep.h"
>>   #include "qemu/nvdimm-utils.h"
>> +#include "qapi/error.h"
>> +#include "hw/boards.h"
>>   #include "hw/mem/nvdimm.h"
>>   
>>   static int nvdimm_device_list(Object *obj, void *opaque)
>> @@ -28,3 +30,23 @@ GSList *nvdimm_get_device_list(void)
>>       object_child_foreach(qdev_get_machine(), nvdimm_device_list, &list);
>>       return list;
>>   }
>> +
>> +int nvdimm_check_target_nodes(void)
>> +{
>> +    MachineState *ms = MACHINE(qdev_get_machine());
>> +    int nb_numa_nodes = ms->numa_state->num_nodes;
>> +    int node;
>> +
>> +    if (!nvdimm_max_target_node) {
>> +        return -1;
>> +    }
>> +
>> +    for (node = nb_numa_nodes; node <= nvdimm_max_target_node; node++) {
>> +        if (!test_bit(node, nvdimm_target_nodes)) {
>> +            error_report("nvdimm target-node: Node ID missing: %d", node);
>> +            exit(1);
>> +        }
>> +    }
>> +
>> +    return nvdimm_max_target_node;
>> +}
> 


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-07-01  7:56 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-06-25  2:25 [PATCH] nvdimm: add 'target-node' option Jingqi Liu
2021-06-30 15:00 ` Igor Mammedov
2021-07-01  7:55   ` Liu, Jingqi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).