qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines
@ 2023-02-23  8:13 Gavin Shan
  2023-02-23  8:13 ` [PATCH v2 1/4] qtest/numa-test: Follow socket-NUMA-node boundary for aarch64 Gavin Shan
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Gavin Shan @ 2023-02-23  8:13 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, qemu-riscv, rad, peter.maydell, quic_llindhol,
	eduardo, marcel.apfelbaum, philmd, wangyanan55, palmer,
	alistair.francis, bin.meng, thuth, lvivier, pbonzini, imammedo,
	yihyu, shan.gavin

For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
used to populate the CPU topology in the Linux guest. It's required that
the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
scheduling domain can't be sorted out, as the following warning message
indicates. To avoid the unexpected confusion, this series attempts to
rejects such kind of insane configurations.

   -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
   -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
   -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
   -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
 
   ------------[ cut here ]------------
   WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2271 build_sched_domains+0x284/0x910
   Modules linked in:
   CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-268.el9.aarch64 #1
   pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
   pc : build_sched_domains+0x284/0x910
   lr : build_sched_domains+0x184/0x910
   sp : ffff80000804bd50
   x29: ffff80000804bd50 x28: 0000000000000002 x27: 0000000000000000
   x26: ffff800009cf9a80 x25: 0000000000000000 x24: ffff800009cbf840
   x23: ffff000080325000 x22: ffff0000005df800 x21: ffff80000a4ce508
   x20: 0000000000000000 x19: ffff000080324440 x18: 0000000000000014
   x17: 00000000388925c0 x16: 000000005386a066 x15: 000000009c10cc2e
   x14: 00000000000001c0 x13: 0000000000000001 x12: ffff00007fffb1a0
   x11: ffff00007fffb180 x10: ffff80000a4ce508 x9 : 0000000000000041
   x8 : ffff80000a4ce500 x7 : ffff80000a4cf920 x6 : 0000000000000001
   x5 : 0000000000000001 x4 : 0000000000000007 x3 : 0000000000000002
   x2 : 0000000000001000 x1 : ffff80000a4cf928 x0 : 0000000000000001
   Call trace:
    build_sched_domains+0x284/0x910
    sched_init_domains+0xac/0xe0
    sched_init_smp+0x48/0xc8
    kernel_init_freeable+0x140/0x1ac
    kernel_init+0x28/0x140
    ret_from_fork+0x10/0x20

PATCH[1] Improves numa/test for aarch64 to follow socket-to-NUMA-node boundary
PATCH[2] Validate the configuration if required in the NUMA subsystem
PATCH[3] Enable the validation for aarch64 machines
PATCH[4] Enable the validation for RiscV machines

v1: https://lists.nongnu.org/archive/html/qemu-arm/2023-02/msg00886.html

Changelog
=========
v2:
  * Fix socket-NUMA-node boundary issues in qtests/numa-test  (Gavin)
  * Add helper set_numa_socket_boundary() and validate the
    boundary in the generic path                              (Philippe)

Gavin Shan (4):
  qtest/numa-test: Follow socket-NUMA-node boundary for aarch64
  numa: Validate socket and NUMA node boundary if required
  hw/arm: Validate socket and NUMA node boundary
  hw/riscv: Validate socket and NUMA node boundary

 hw/arm/sbsa-ref.c       |  2 ++
 hw/arm/virt.c           |  2 ++
 hw/core/machine.c       | 34 ++++++++++++++++++++++++++++++++++
 hw/core/numa.c          |  7 +++++++
 hw/riscv/spike.c        |  1 +
 hw/riscv/virt.c         |  1 +
 include/sysemu/numa.h   |  4 ++++
 tests/qtest/numa-test.c | 13 ++++++++++---
 8 files changed, 61 insertions(+), 3 deletions(-)

-- 
2.23.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/4] qtest/numa-test: Follow socket-NUMA-node boundary for aarch64
  2023-02-23  8:13 [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines Gavin Shan
@ 2023-02-23  8:13 ` Gavin Shan
  2023-02-23  8:13 ` [PATCH v2 2/4] numa: Validate socket and NUMA node boundary if required Gavin Shan
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Gavin Shan @ 2023-02-23  8:13 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, qemu-riscv, rad, peter.maydell, quic_llindhol,
	eduardo, marcel.apfelbaum, philmd, wangyanan55, palmer,
	alistair.francis, bin.meng, thuth, lvivier, pbonzini, imammedo,
	yihyu, shan.gavin

After socket-to-NUMA-node boundary is applied to aarch64 in the subsequent
patches, we need to explicitly specify 'smp.sockets=2' for 'test_mon_explicit'
and 'test_query_cpus' test cases. Besides, 'test_mon_partial' isn't applied
to aarch64 any more.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 tests/qtest/numa-test.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/tests/qtest/numa-test.c b/tests/qtest/numa-test.c
index c5eb13f349..ebfd522af3 100644
--- a/tests/qtest/numa-test.c
+++ b/tests/qtest/numa-test.c
@@ -25,7 +25,8 @@ static void test_mon_explicit(const void *data)
     g_autofree char *s = NULL;
     g_autofree char *cli = NULL;
 
-    cli = make_cli(data, "-machine smp.cpus=8 -numa node,nodeid=0,memdev=ram,cpus=0-3 "
+    cli = make_cli(data, "-machine smp.cpus=8,smp.sockets=2 "
+                         "-numa node,nodeid=0,memdev=ram,cpus=0-3 "
                          "-numa node,nodeid=1,cpus=4-7");
     qts = qtest_init(cli);
 
@@ -87,7 +88,8 @@ static void test_query_cpus(const void *data)
     QTestState *qts;
     g_autofree char *cli = NULL;
 
-    cli = make_cli(data, "-machine smp.cpus=8 -numa node,memdev=ram,cpus=0-3 "
+    cli = make_cli(data, "-machine smp.cpus=8,smp.sockets=2 "
+                         "-numa node,memdev=ram,cpus=0-3 "
                          "-numa node,cpus=4-7");
     qts = qtest_init(cli);
     cpus = get_cpus(qts, &resp);
@@ -565,7 +567,12 @@ int main(int argc, char **argv)
 
     qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split);
     qtest_add_data_func("/numa/mon/cpus/explicit", args, test_mon_explicit);
-    qtest_add_data_func("/numa/mon/cpus/partial", args, test_mon_partial);
+
+    if (!strcmp(arch, "i386") || !strcmp(arch, "x86_64") ||
+        !strcmp(arch, "ppc64")) {
+        qtest_add_data_func("/numa/mon/cpus/partial", args, test_mon_partial);
+    }
+
     qtest_add_data_func("/numa/qmp/cpus/query-cpus", args, test_query_cpus);
 
     if (!strcmp(arch, "i386") || !strcmp(arch, "x86_64")) {
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/4] numa: Validate socket and NUMA node boundary if required
  2023-02-23  8:13 [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines Gavin Shan
  2023-02-23  8:13 ` [PATCH v2 1/4] qtest/numa-test: Follow socket-NUMA-node boundary for aarch64 Gavin Shan
@ 2023-02-23  8:13 ` Gavin Shan
  2023-02-23  9:05   ` Philippe Mathieu-Daudé
  2023-02-23  8:14 ` [PATCH v2 3/4] hw/arm: Validate socket and NUMA node boundary Gavin Shan
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Gavin Shan @ 2023-02-23  8:13 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, qemu-riscv, rad, peter.maydell, quic_llindhol,
	eduardo, marcel.apfelbaum, philmd, wangyanan55, palmer,
	alistair.francis, bin.meng, thuth, lvivier, pbonzini, imammedo,
	yihyu, shan.gavin

For some architectures like ARM64, multiple CPUs in one socket can't be
associated with different NUMA nodes. Otherwise, the guest kernel is confused
about the CPU topology. For example, the following warning message is observed
from linux guest with the below command lines.

  -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
  -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
  -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
  -numa node,nodeid=2,cpus=4-5,memdev=ram2                \

  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2271 build_sched_domains+0x284/0x910
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-268.el9.aarch64 #1
  pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  pc : build_sched_domains+0x284/0x910
  lr : build_sched_domains+0x184/0x910
  sp : ffff80000804bd50
  x29: ffff80000804bd50 x28: 0000000000000002 x27: 0000000000000000
  x26: ffff800009cf9a80 x25: 0000000000000000 x24: ffff800009cbf840
  x23: ffff000080325000 x22: ffff0000005df800 x21: ffff80000a4ce508
  x20: 0000000000000000 x19: ffff000080324440 x18: 0000000000000014
  x17: 00000000388925c0 x16: 000000005386a066 x15: 000000009c10cc2e
  x14: 00000000000001c0 x13: 0000000000000001 x12: ffff00007fffb1a0
  x11: ffff00007fffb180 x10: ffff80000a4ce508 x9 : 0000000000000041
  x8 : ffff80000a4ce500 x7 : ffff80000a4cf920 x6 : 0000000000000001
  x5 : 0000000000000001 x4 : 0000000000000007 x3 : 0000000000000002
  x2 : 0000000000001000 x1 : ffff80000a4cf928 x0 : 0000000000000001
  Call trace:
   build_sched_domains+0x284/0x910
   sched_init_domains+0xac/0xe0
   sched_init_smp+0x48/0xc8
   kernel_init_freeable+0x140/0x1ac
   kernel_init+0x28/0x140
   ret_from_fork+0x10/0x20

Improve the sitation to reject the configuration where multiple CPUs
in one socket have been associated with different NUMA nodes. The
newly introduced helper set_numa_socket_boundary() is expected to
called by specific machines (boards) where the boundary is required.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 hw/core/machine.c     | 34 ++++++++++++++++++++++++++++++++++
 hw/core/numa.c        |  7 +++++++
 include/sysemu/numa.h |  4 ++++
 3 files changed, 45 insertions(+)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index f73fc4c45c..875a3fe6c4 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -1174,6 +1174,36 @@ static char *cpu_slot_to_string(const CPUArchId *cpu)
     return g_string_free(s, false);
 }
 
+static void numa_validate_socket_boundary(MachineState *ms)
+{
+    MachineClass *mc = MACHINE_GET_CLASS(ms);
+    NumaState *state = ms->numa_state;
+    const CPUArchIdList *possible_cpus = mc->possible_cpu_arch_ids(ms);
+    const CPUArchId *cpus = possible_cpus->cpus;
+    int len = possible_cpus->len, i, j;
+
+    if (state->num_nodes <= 1 || len <= 1) {
+        return;
+    }
+
+    for (i = 0; i < len; i++) {
+        for (j = i + 1; j < len; j++) {
+            if (cpus[i].props.has_socket_id &&
+                cpus[i].props.has_node_id &&
+                cpus[j].props.has_socket_id &&
+                cpus[j].props.has_node_id &&
+                cpus[i].props.socket_id == cpus[j].props.socket_id &&
+                cpus[i].props.node_id != cpus[j].props.node_id) {
+                error_report("CPU-%d and CPU-%d in socket-%ld have been "
+                             "associated with node-%ld and node-%ld "
+                             "respectively", i, j, cpus[i].props.socket_id,
+                             cpus[i].props.node_id, cpus[j].props.node_id);
+                exit(1);
+            }
+        }
+    }
+}
+
 static void numa_validate_initiator(NumaState *numa_state)
 {
     int i;
@@ -1239,6 +1269,10 @@ static void machine_numa_finish_cpu_init(MachineState *machine)
         }
     }
 
+    if (machine->numa_state->have_socket_boundary) {
+        numa_validate_socket_boundary(machine);
+    }
+
     if (machine->numa_state->hmat_enabled) {
         numa_validate_initiator(machine->numa_state);
     }
diff --git a/hw/core/numa.c b/hw/core/numa.c
index d8d36b16d8..ebdd964ec8 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -460,6 +460,13 @@ void parse_numa_hmat_cache(MachineState *ms, NumaHmatCacheOptions *node,
     ms->numa_state->hmat_cache[node->node_id][node->level] = hmat_cache;
 }
 
+void set_numa_socket_boundary(MachineState *ms)
+{
+    if (ms->numa_state) {
+        ms->numa_state->have_socket_boundary = true;
+    }
+}
+
 void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
 {
     if (!ms->numa_state) {
diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
index 4173ef2afa..160008fff4 100644
--- a/include/sysemu/numa.h
+++ b/include/sysemu/numa.h
@@ -86,6 +86,9 @@ struct NumaState {
     /* Detect if HMAT support is enabled. */
     bool hmat_enabled;
 
+    /* CPUs in one socket can't break socket boundary */
+    bool have_socket_boundary;
+
     /* NUMA nodes information */
     NodeInfo nodes[MAX_NODES];
 
@@ -97,6 +100,7 @@ struct NumaState {
 };
 typedef struct NumaState NumaState;
 
+void set_numa_socket_boundary(MachineState *ms);
 void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp);
 void parse_numa_opts(MachineState *ms);
 void parse_numa_hmat_lb(NumaState *numa_state, NumaHmatLBOptions *node,
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 3/4] hw/arm: Validate socket and NUMA node boundary
  2023-02-23  8:13 [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines Gavin Shan
  2023-02-23  8:13 ` [PATCH v2 1/4] qtest/numa-test: Follow socket-NUMA-node boundary for aarch64 Gavin Shan
  2023-02-23  8:13 ` [PATCH v2 2/4] numa: Validate socket and NUMA node boundary if required Gavin Shan
@ 2023-02-23  8:14 ` Gavin Shan
  2023-02-23  8:14 ` [PATCH v2 4/4] hw/riscv: " Gavin Shan
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Gavin Shan @ 2023-02-23  8:14 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, qemu-riscv, rad, peter.maydell, quic_llindhol,
	eduardo, marcel.apfelbaum, philmd, wangyanan55, palmer,
	alistair.francis, bin.meng, thuth, lvivier, pbonzini, imammedo,
	yihyu, shan.gavin

There are two ARM machines where NUMA is aware: 'virt' and 'sbsa-ref'.
Both of them are required to follow socket-NUMA-node boundary. To
enable the validation to reject incorrect configuration.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 hw/arm/sbsa-ref.c | 2 ++
 hw/arm/virt.c     | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/hw/arm/sbsa-ref.c b/hw/arm/sbsa-ref.c
index f778cb6d09..1a87437017 100644
--- a/hw/arm/sbsa-ref.c
+++ b/hw/arm/sbsa-ref.c
@@ -844,6 +844,8 @@ static void sbsa_ref_instance_init(Object *obj)
     SBSAMachineState *sms = SBSA_MACHINE(obj);
 
     sbsa_flash_create(sms);
+
+    set_numa_socket_boundary(MACHINE(obj));
 }
 
 static void sbsa_ref_class_init(ObjectClass *oc, void *data)
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index ac626b3bef..9d9f26626c 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -3210,6 +3210,8 @@ static void virt_instance_init(Object *obj)
 
     vms->oem_id = g_strndup(ACPI_BUILD_APPNAME6, 6);
     vms->oem_table_id = g_strndup(ACPI_BUILD_APPNAME8, 8);
+
+    set_numa_socket_boundary(MACHINE(obj));
 }
 
 static const TypeInfo virt_machine_info = {
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 4/4] hw/riscv: Validate socket and NUMA node boundary
  2023-02-23  8:13 [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines Gavin Shan
                   ` (2 preceding siblings ...)
  2023-02-23  8:14 ` [PATCH v2 3/4] hw/arm: Validate socket and NUMA node boundary Gavin Shan
@ 2023-02-23  8:14 ` Gavin Shan
  2023-02-23 12:25 ` [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines Andrew Jones
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Gavin Shan @ 2023-02-23  8:14 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, qemu-riscv, rad, peter.maydell, quic_llindhol,
	eduardo, marcel.apfelbaum, philmd, wangyanan55, palmer,
	alistair.francis, bin.meng, thuth, lvivier, pbonzini, imammedo,
	yihyu, shan.gavin

There are two RISCV machines where NUMA is aware: 'virt' and 'spike'.
Both of them are required to follow socket-NUMA-node boundary. To
enable the validation to reject incorrect configuration.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 hw/riscv/spike.c | 1 +
 hw/riscv/virt.c  | 1 +
 2 files changed, 2 insertions(+)

diff --git a/hw/riscv/spike.c b/hw/riscv/spike.c
index cc3f6dac17..fba0cbec29 100644
--- a/hw/riscv/spike.c
+++ b/hw/riscv/spike.c
@@ -342,6 +342,7 @@ static void spike_board_init(MachineState *machine)
 
 static void spike_machine_instance_init(Object *obj)
 {
+    set_numa_socket_boundary(MACHINE(obj));
 }
 
 static void spike_machine_class_init(ObjectClass *oc, void *data)
diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
index b81081c70b..ed79becb96 100644
--- a/hw/riscv/virt.c
+++ b/hw/riscv/virt.c
@@ -1521,6 +1521,7 @@ static void virt_machine_init(MachineState *machine)
 
 static void virt_machine_instance_init(Object *obj)
 {
+    set_numa_socket_boundary(MACHINE(obj));
 }
 
 static char *virt_get_aia_guests(Object *obj, Error **errp)
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] numa: Validate socket and NUMA node boundary if required
  2023-02-23  8:13 ` [PATCH v2 2/4] numa: Validate socket and NUMA node boundary if required Gavin Shan
@ 2023-02-23  9:05   ` Philippe Mathieu-Daudé
  2023-02-23 10:27     ` Gavin Shan
  0 siblings, 1 reply; 18+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-02-23  9:05 UTC (permalink / raw)
  To: Gavin Shan, qemu-arm
  Cc: qemu-devel, qemu-riscv, rad, peter.maydell, quic_llindhol,
	eduardo, marcel.apfelbaum, wangyanan55, palmer, alistair.francis,
	bin.meng, thuth, lvivier, pbonzini, imammedo, yihyu, shan.gavin

On 23/2/23 09:13, Gavin Shan wrote:
> For some architectures like ARM64, multiple CPUs in one socket can't be
> associated with different NUMA nodes. Otherwise, the guest kernel is confused
> about the CPU topology. For example, the following warning message is observed
> from linux guest with the below command lines.
> 
>    -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>    -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>    -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>    -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
> 
>    ------------[ cut here ]------------
>    WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2271 build_sched_domains+0x284/0x910
>    Modules linked in:
>    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-268.el9.aarch64 #1
>    pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>    pc : build_sched_domains+0x284/0x910
>    lr : build_sched_domains+0x184/0x910
>    sp : ffff80000804bd50
>    x29: ffff80000804bd50 x28: 0000000000000002 x27: 0000000000000000
>    x26: ffff800009cf9a80 x25: 0000000000000000 x24: ffff800009cbf840
>    x23: ffff000080325000 x22: ffff0000005df800 x21: ffff80000a4ce508
>    x20: 0000000000000000 x19: ffff000080324440 x18: 0000000000000014
>    x17: 00000000388925c0 x16: 000000005386a066 x15: 000000009c10cc2e
>    x14: 00000000000001c0 x13: 0000000000000001 x12: ffff00007fffb1a0
>    x11: ffff00007fffb180 x10: ffff80000a4ce508 x9 : 0000000000000041
>    x8 : ffff80000a4ce500 x7 : ffff80000a4cf920 x6 : 0000000000000001
>    x5 : 0000000000000001 x4 : 0000000000000007 x3 : 0000000000000002
>    x2 : 0000000000001000 x1 : ffff80000a4cf928 x0 : 0000000000000001
>    Call trace:
>     build_sched_domains+0x284/0x910
>     sched_init_domains+0xac/0xe0
>     sched_init_smp+0x48/0xc8
>     kernel_init_freeable+0x140/0x1ac
>     kernel_init+0x28/0x140
>     ret_from_fork+0x10/0x20
> 
> Improve the sitation to reject the configuration where multiple CPUs

Typo "situation".

> in one socket have been associated with different NUMA nodes. The
> newly introduced helper set_numa_socket_boundary() is expected to
> called by specific machines (boards) where the boundary is required.
> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>   hw/core/machine.c     | 34 ++++++++++++++++++++++++++++++++++
>   hw/core/numa.c        |  7 +++++++
>   include/sysemu/numa.h |  4 ++++
>   3 files changed, 45 insertions(+)


> diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
> index 4173ef2afa..160008fff4 100644
> --- a/include/sysemu/numa.h
> +++ b/include/sysemu/numa.h
> @@ -86,6 +86,9 @@ struct NumaState {
>       /* Detect if HMAT support is enabled. */
>       bool hmat_enabled;
>   
> +    /* CPUs in one socket can't break socket boundary */
> +    bool have_socket_boundary;

This field belong to MachineClass, please add it as
numa_have_socket_boundary just after to numa_mem_supported.

Next patches become:

---
diff --git a/hw/arm/sbsa-ref.c b/hw/arm/sbsa-ref.c
index f778cb6d09..a48f1b2329 100644
--- a/hw/arm/sbsa-ref.c
+++ b/hw/arm/sbsa-ref.c
@@ -864,6 +864,7 @@ static void sbsa_ref_class_init(ObjectClass *oc, 
void *data)
      mc->possible_cpu_arch_ids = sbsa_ref_possible_cpu_arch_ids;
      mc->cpu_index_to_instance_props = sbsa_ref_cpu_index_to_props;
      mc->get_default_cpu_node_id = sbsa_ref_get_default_cpu_node_id;
+    mc->numa_have_socket_boundary = true;
  }
---

Otherwise LGTM :)


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] numa: Validate socket and NUMA node boundary if required
  2023-02-23  9:05   ` Philippe Mathieu-Daudé
@ 2023-02-23 10:27     ` Gavin Shan
  0 siblings, 0 replies; 18+ messages in thread
From: Gavin Shan @ 2023-02-23 10:27 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-arm
  Cc: qemu-devel, qemu-riscv, rad, peter.maydell, quic_llindhol,
	eduardo, marcel.apfelbaum, wangyanan55, palmer, alistair.francis,
	bin.meng, thuth, lvivier, pbonzini, imammedo, yihyu, shan.gavin

On 2/23/23 8:05 PM, Philippe Mathieu-Daudé wrote:
> On 23/2/23 09:13, Gavin Shan wrote:
>> For some architectures like ARM64, multiple CPUs in one socket can't be
>> associated with different NUMA nodes. Otherwise, the guest kernel is confused
>> about the CPU topology. For example, the following warning message is observed
>> from linux guest with the below command lines.
>>
>>    -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>>    -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>>    -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>>    -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
>>
>>    ------------[ cut here ]------------
>>    WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2271 build_sched_domains+0x284/0x910
>>    Modules linked in:
>>    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-268.el9.aarch64 #1
>>    pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>    pc : build_sched_domains+0x284/0x910
>>    lr : build_sched_domains+0x184/0x910
>>    sp : ffff80000804bd50
>>    x29: ffff80000804bd50 x28: 0000000000000002 x27: 0000000000000000
>>    x26: ffff800009cf9a80 x25: 0000000000000000 x24: ffff800009cbf840
>>    x23: ffff000080325000 x22: ffff0000005df800 x21: ffff80000a4ce508
>>    x20: 0000000000000000 x19: ffff000080324440 x18: 0000000000000014
>>    x17: 00000000388925c0 x16: 000000005386a066 x15: 000000009c10cc2e
>>    x14: 00000000000001c0 x13: 0000000000000001 x12: ffff00007fffb1a0
>>    x11: ffff00007fffb180 x10: ffff80000a4ce508 x9 : 0000000000000041
>>    x8 : ffff80000a4ce500 x7 : ffff80000a4cf920 x6 : 0000000000000001
>>    x5 : 0000000000000001 x4 : 0000000000000007 x3 : 0000000000000002
>>    x2 : 0000000000001000 x1 : ffff80000a4cf928 x0 : 0000000000000001
>>    Call trace:
>>     build_sched_domains+0x284/0x910
>>     sched_init_domains+0xac/0xe0
>>     sched_init_smp+0x48/0xc8
>>     kernel_init_freeable+0x140/0x1ac
>>     kernel_init+0x28/0x140
>>     ret_from_fork+0x10/0x20
>>
>> Improve the sitation to reject the configuration where multiple CPUs
> 
> Typo "situation".
> 

Yes, I will fix it up in next revision.

>> in one socket have been associated with different NUMA nodes. The
>> newly introduced helper set_numa_socket_boundary() is expected to
>> called by specific machines (boards) where the boundary is required.
>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   hw/core/machine.c     | 34 ++++++++++++++++++++++++++++++++++
>>   hw/core/numa.c        |  7 +++++++
>>   include/sysemu/numa.h |  4 ++++
>>   3 files changed, 45 insertions(+)
> 
> 
>> diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
>> index 4173ef2afa..160008fff4 100644
>> --- a/include/sysemu/numa.h
>> +++ b/include/sysemu/numa.h
>> @@ -86,6 +86,9 @@ struct NumaState {
>>       /* Detect if HMAT support is enabled. */
>>       bool hmat_enabled;
>> +    /* CPUs in one socket can't break socket boundary */
>> +    bool have_socket_boundary;
> 
> This field belong to MachineClass, please add it as
> numa_have_socket_boundary just after to numa_mem_supported.
> 
> Next patches become:
> 
> ---
> diff --git a/hw/arm/sbsa-ref.c b/hw/arm/sbsa-ref.c
> index f778cb6d09..a48f1b2329 100644
> --- a/hw/arm/sbsa-ref.c
> +++ b/hw/arm/sbsa-ref.c
> @@ -864,6 +864,7 @@ static void sbsa_ref_class_init(ObjectClass *oc, void *data)
>       mc->possible_cpu_arch_ids = sbsa_ref_possible_cpu_arch_ids;
>       mc->cpu_index_to_instance_props = sbsa_ref_cpu_index_to_props;
>       mc->get_default_cpu_node_id = sbsa_ref_get_default_cpu_node_id;
> +    mc->numa_have_socket_boundary = true;
>   }
> ---
> 
> Otherwise LGTM :)
> 

Thanks, Philippe. It makes sense to have the field in MachineClass and the
changes will be included in next revision.

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines
  2023-02-23  8:13 [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines Gavin Shan
                   ` (3 preceding siblings ...)
  2023-02-23  8:14 ` [PATCH v2 4/4] hw/riscv: " Gavin Shan
@ 2023-02-23 12:25 ` Andrew Jones
  2023-02-24  7:20   ` Gavin Shan
  2023-02-23 12:57 ` Daniel P. Berrangé
  2023-02-23 13:18 ` Daniel Henrique Barboza
  6 siblings, 1 reply; 18+ messages in thread
From: Andrew Jones @ 2023-02-23 12:25 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-arm, qemu-devel, qemu-riscv, rad, peter.maydell,
	quic_llindhol, eduardo, marcel.apfelbaum, philmd, wangyanan55,
	palmer, alistair.francis, bin.meng, thuth, lvivier, pbonzini,
	imammedo, yihyu, shan.gavin

On Thu, Feb 23, 2023 at 04:13:57PM +0800, Gavin Shan wrote:
> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
> used to populate the CPU topology in the Linux guest. It's required that
> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
> scheduling domain can't be sorted out, as the following warning message
> indicates. To avoid the unexpected confusion, this series attempts to
> rejects such kind of insane configurations.

Hi Gavin,

I'm not sure 'insane' is the right word, maybe 'irregular'. I've never
seen, and don't really expect to find, specifications stating that arm
and/or riscv must have 1:1 mappings for numa nodes and sockets. But, as
QEMU machine types can choose certain configurations to support, if
everyone is in agreement to restrict sbsa-ref and arm/riscv virt to 1:1
mappings, then why not. However, I suggest stating that this is simply a
choice being made for these boards, since the expectation of needing
"irregular" configurations is low. We should not be trying to use Linux
assumptions as rationale, since other OSes may cope better with irregular
configurations.

Also, I suggest adding a comment to the boards, which adopt this
restriction, which explains that it's only a platform choice, not
something specified by the architecture.

Thanks,
drew

> 
>    -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>    -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>    -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>    -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
>  
>    ------------[ cut here ]------------
>    WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2271 build_sched_domains+0x284/0x910
>    Modules linked in:
>    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-268.el9.aarch64 #1
>    pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>    pc : build_sched_domains+0x284/0x910
>    lr : build_sched_domains+0x184/0x910
>    sp : ffff80000804bd50
>    x29: ffff80000804bd50 x28: 0000000000000002 x27: 0000000000000000
>    x26: ffff800009cf9a80 x25: 0000000000000000 x24: ffff800009cbf840
>    x23: ffff000080325000 x22: ffff0000005df800 x21: ffff80000a4ce508
>    x20: 0000000000000000 x19: ffff000080324440 x18: 0000000000000014
>    x17: 00000000388925c0 x16: 000000005386a066 x15: 000000009c10cc2e
>    x14: 00000000000001c0 x13: 0000000000000001 x12: ffff00007fffb1a0
>    x11: ffff00007fffb180 x10: ffff80000a4ce508 x9 : 0000000000000041
>    x8 : ffff80000a4ce500 x7 : ffff80000a4cf920 x6 : 0000000000000001
>    x5 : 0000000000000001 x4 : 0000000000000007 x3 : 0000000000000002
>    x2 : 0000000000001000 x1 : ffff80000a4cf928 x0 : 0000000000000001
>    Call trace:
>     build_sched_domains+0x284/0x910
>     sched_init_domains+0xac/0xe0
>     sched_init_smp+0x48/0xc8
>     kernel_init_freeable+0x140/0x1ac
>     kernel_init+0x28/0x140
>     ret_from_fork+0x10/0x20
> 
> PATCH[1] Improves numa/test for aarch64 to follow socket-to-NUMA-node boundary
> PATCH[2] Validate the configuration if required in the NUMA subsystem
> PATCH[3] Enable the validation for aarch64 machines
> PATCH[4] Enable the validation for RiscV machines
> 
> v1: https://lists.nongnu.org/archive/html/qemu-arm/2023-02/msg00886.html
> 
> Changelog
> =========
> v2:
>   * Fix socket-NUMA-node boundary issues in qtests/numa-test  (Gavin)
>   * Add helper set_numa_socket_boundary() and validate the
>     boundary in the generic path                              (Philippe)
> 
> Gavin Shan (4):
>   qtest/numa-test: Follow socket-NUMA-node boundary for aarch64
>   numa: Validate socket and NUMA node boundary if required
>   hw/arm: Validate socket and NUMA node boundary
>   hw/riscv: Validate socket and NUMA node boundary
> 
>  hw/arm/sbsa-ref.c       |  2 ++
>  hw/arm/virt.c           |  2 ++
>  hw/core/machine.c       | 34 ++++++++++++++++++++++++++++++++++
>  hw/core/numa.c          |  7 +++++++
>  hw/riscv/spike.c        |  1 +
>  hw/riscv/virt.c         |  1 +
>  include/sysemu/numa.h   |  4 ++++
>  tests/qtest/numa-test.c | 13 ++++++++++---
>  8 files changed, 61 insertions(+), 3 deletions(-)
> 
> -- 
> 2.23.0
> 
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines
  2023-02-23  8:13 [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines Gavin Shan
                   ` (4 preceding siblings ...)
  2023-02-23 12:25 ` [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines Andrew Jones
@ 2023-02-23 12:57 ` Daniel P. Berrangé
  2023-02-24  5:47   ` Gavin Shan
  2023-02-23 13:18 ` Daniel Henrique Barboza
  6 siblings, 1 reply; 18+ messages in thread
From: Daniel P. Berrangé @ 2023-02-23 12:57 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-arm, qemu-devel, qemu-riscv, rad, peter.maydell,
	quic_llindhol, eduardo, marcel.apfelbaum, philmd, wangyanan55,
	palmer, alistair.francis, bin.meng, thuth, lvivier, pbonzini,
	imammedo, yihyu, shan.gavin

On Thu, Feb 23, 2023 at 04:13:57PM +0800, Gavin Shan wrote:
> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
> used to populate the CPU topology in the Linux guest. It's required that
> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
> scheduling domain can't be sorted out, as the following warning message
> indicates. To avoid the unexpected confusion, this series attempts to
> rejects such kind of insane configurations.
> 
>    -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>    -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>    -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>    -numa node,nodeid=2,cpus=4-5,memdev=ram2                \

This is somewhat odd as a config, because core 2 is in socket 0
and core 3 is in socket 1, so it wouldn't make much conceptual
sense to have them in the same NUMA node, while other cores within
the same socket are in different NUMA nodes. Basically the split
of NUMA nodes is not aligned with any level in the topology.

This series, however, also rejects configurations that I would
normally consider to be reasonable. I've not tested linux kernel
behaviour though, but as a user I would expect to be able to
ask for:

    -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
    -numa node,nodeid=0,cpus=0,memdev=ram0                \
    -numa node,nodeid=1,cpus=1,memdev=ram1                \
    -numa node,nodeid=2,cpus=2,memdev=ram2                \
    -numa node,nodeid=3,cpus=3,memdev=ram3                \
    -numa node,nodeid=4,cpus=4,memdev=ram4                \
    -numa node,nodeid=5,cpus=5,memdev=ram5                \

ie, every core gets its own NUMA node

Or to aask for every cluster as a numa node:

    -smp 6,maxcpus=6,sockets=2,clusters=3,cores=1,threads=1 \
    -numa node,nodeid=0,cpus=0,memdev=ram0                \
    -numa node,nodeid=1,cpus=1,memdev=ram1                \
    -numa node,nodeid=2,cpus=2,memdev=ram2                \
    -numa node,nodeid=3,cpus=3,memdev=ram3                \
    -numa node,nodeid=4,cpus=4,memdev=ram4                \
    -numa node,nodeid=5,cpus=5,memdev=ram5                \

In both cases the NUMA split is aligned with a given level
in the topology, which was not the case with your example.

Rejecting these feels overly strict to me, and may risk breaking
existing valid deployments, unless we can demonstrate those
scenarios were unambiguously already broken ?

If there was something in the hardware specs that requires
this, then it is more valid to do, than if it is merely an
specific guest kernel limitation that might be fixed any day.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines
  2023-02-23  8:13 [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines Gavin Shan
                   ` (5 preceding siblings ...)
  2023-02-23 12:57 ` Daniel P. Berrangé
@ 2023-02-23 13:18 ` Daniel Henrique Barboza
  2023-02-24  7:09   ` Gavin Shan
  6 siblings, 1 reply; 18+ messages in thread
From: Daniel Henrique Barboza @ 2023-02-23 13:18 UTC (permalink / raw)
  To: Gavin Shan, qemu-arm
  Cc: qemu-devel, qemu-riscv, rad, peter.maydell, quic_llindhol,
	eduardo, marcel.apfelbaum, philmd, wangyanan55, palmer,
	alistair.francis, bin.meng, thuth, lvivier, pbonzini, imammedo,
	yihyu, shan.gavin



On 2/23/23 05:13, Gavin Shan wrote:
> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
> used to populate the CPU topology in the Linux guest. It's required that
> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
> scheduling domain can't be sorted out, as the following warning message
> indicates. To avoid the unexpected confusion, this series attempts to
> rejects such kind of insane configurations.
> 
>     -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>     -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>     -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>     -numa node,nodeid=2,cpus=4-5,memdev=ram2                \


And why is this a QEMU problem? This doesn't hurt ACPI.

Also, this restriction impacts breaks ARM guests in the wild that are running
non-Linux OSes. I don't see why we should impact use cases that has nothing to
do with Linux Kernel feelings about sockets - NUMA nodes exclusivity.


Thanks,


Daniel


>   
>     ------------[ cut here ]------------
>     WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2271 build_sched_domains+0x284/0x910
>     Modules linked in:
>     CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-268.el9.aarch64 #1
>     pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>     pc : build_sched_domains+0x284/0x910
>     lr : build_sched_domains+0x184/0x910
>     sp : ffff80000804bd50
>     x29: ffff80000804bd50 x28: 0000000000000002 x27: 0000000000000000
>     x26: ffff800009cf9a80 x25: 0000000000000000 x24: ffff800009cbf840
>     x23: ffff000080325000 x22: ffff0000005df800 x21: ffff80000a4ce508
>     x20: 0000000000000000 x19: ffff000080324440 x18: 0000000000000014
>     x17: 00000000388925c0 x16: 000000005386a066 x15: 000000009c10cc2e
>     x14: 00000000000001c0 x13: 0000000000000001 x12: ffff00007fffb1a0
>     x11: ffff00007fffb180 x10: ffff80000a4ce508 x9 : 0000000000000041
>     x8 : ffff80000a4ce500 x7 : ffff80000a4cf920 x6 : 0000000000000001
>     x5 : 0000000000000001 x4 : 0000000000000007 x3 : 0000000000000002
>     x2 : 0000000000001000 x1 : ffff80000a4cf928 x0 : 0000000000000001
>     Call trace:
>      build_sched_domains+0x284/0x910
>      sched_init_domains+0xac/0xe0
>      sched_init_smp+0x48/0xc8
>      kernel_init_freeable+0x140/0x1ac
>      kernel_init+0x28/0x140
>      ret_from_fork+0x10/0x20
> 
> PATCH[1] Improves numa/test for aarch64 to follow socket-to-NUMA-node boundary
> PATCH[2] Validate the configuration if required in the NUMA subsystem
> PATCH[3] Enable the validation for aarch64 machines
> PATCH[4] Enable the validation for RiscV machines
> 
> v1: https://lists.nongnu.org/archive/html/qemu-arm/2023-02/msg00886.html
> 
> Changelog
> =========
> v2:
>    * Fix socket-NUMA-node boundary issues in qtests/numa-test  (Gavin)
>    * Add helper set_numa_socket_boundary() and validate the
>      boundary in the generic path                              (Philippe)
> 
> Gavin Shan (4):
>    qtest/numa-test: Follow socket-NUMA-node boundary for aarch64
>    numa: Validate socket and NUMA node boundary if required
>    hw/arm: Validate socket and NUMA node boundary
>    hw/riscv: Validate socket and NUMA node boundary
> 
>   hw/arm/sbsa-ref.c       |  2 ++
>   hw/arm/virt.c           |  2 ++
>   hw/core/machine.c       | 34 ++++++++++++++++++++++++++++++++++
>   hw/core/numa.c          |  7 +++++++
>   hw/riscv/spike.c        |  1 +
>   hw/riscv/virt.c         |  1 +
>   include/sysemu/numa.h   |  4 ++++
>   tests/qtest/numa-test.c | 13 ++++++++++---
>   8 files changed, 61 insertions(+), 3 deletions(-)
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines
  2023-02-23 12:57 ` Daniel P. Berrangé
@ 2023-02-24  5:47   ` Gavin Shan
  0 siblings, 0 replies; 18+ messages in thread
From: Gavin Shan @ 2023-02-24  5:47 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-arm, qemu-devel, qemu-riscv, rad, peter.maydell,
	quic_llindhol, eduardo, marcel.apfelbaum, philmd, wangyanan55,
	palmer, alistair.francis, bin.meng, thuth, lvivier, pbonzini,
	imammedo, yihyu, shan.gavin

On 2/23/23 11:57 PM, Daniel P. Berrangé wrote:
> On Thu, Feb 23, 2023 at 04:13:57PM +0800, Gavin Shan wrote:
>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
>> used to populate the CPU topology in the Linux guest. It's required that
>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
>> scheduling domain can't be sorted out, as the following warning message
>> indicates. To avoid the unexpected confusion, this series attempts to
>> rejects such kind of insane configurations.
>>
>>     -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>>     -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>>     -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>>     -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
> 
> This is somewhat odd as a config, because core 2 is in socket 0
> and core 3 is in socket 1, so it wouldn't make much conceptual
> sense to have them in the same NUMA node, while other cores within
> the same socket are in different NUMA nodes. Basically the split
> of NUMA nodes is not aligned with any level in the topology.
> 
> This series, however, also rejects configurations that I would
> normally consider to be reasonable. I've not tested linux kernel
> behaviour though, but as a user I would expect to be able to
> ask for:
> 
>      -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>      -numa node,nodeid=0,cpus=0,memdev=ram0                \
>      -numa node,nodeid=1,cpus=1,memdev=ram1                \
>      -numa node,nodeid=2,cpus=2,memdev=ram2                \
>      -numa node,nodeid=3,cpus=3,memdev=ram3                \
>      -numa node,nodeid=4,cpus=4,memdev=ram4                \
>      -numa node,nodeid=5,cpus=5,memdev=ram5                \
> 
> ie, every core gets its own NUMA node
> 

It doesn't work to Linux guest either. As the following warning message
indicates, the Multicore domain isn't a subset of DIE (CLUSTER or socket)
domain. For example, Multicore domain is 0-2 while DIE domain is 0 for
CPU-0.

[    0.023486] CPU-0: 36,56,0,-1 thread=0  core=0-2  cluster=0-2 llc=0    // parsed from ACPI PPTT
[    0.023490] CPU-1: 36,56,1,-1 thread=1  core=0-2  cluster=0-2 llc=1
[    0.023492] CPU-2: 36,56,2,-1 thread=2  core=0-2  cluster=0-2 llc=2
[    0.023494] CPU-3: 136,156,3,-1 thread=3  core=3-5  cluster=3-5 llc=3
[    0.023495] CPU-4: 136,156,4,-1 thread=4  core=3-5  cluster=3-5 llc=4
[    0.023497] CPU-5: 136,156,5,-1 thread=5  core=3-5  cluster=3-5 llc=5
[    0.023499] CPU-0: SMT=0  CLUSTER=0  MULTICORE=0-2  DIE=0  CPU-OF-NODE=0      // Seen by scheduling domain
[    0.023501] CPU-1: SMT=1  CLUSTER=1  MULTICORE=0-2  DIE=1  CPU-OF-NODE=1
[    0.023503] CPU-2: SMT=2  CLUSTER=2  MULTICORE=0-2  DIE=2  CPU-OF-NODE=2
[    0.023504] CPU-3: SMT=3  CLUSTER=3  MULTICORE=3-5  DIE=3  CPU-OF-NODE=3
[    0.023506] CPU-4: SMT=4  CLUSTER=4  MULTICORE=3-5  DIE=4  CPU-OF-NODE=4
[    0.023508] CPU-5: SMT=5  CLUSTER=5  MULTICORE=3-5  DIE=5  CPU-OF_NODE=5
         :
[    0.023555] BUG: arch topology borken
[    0.023556]      the MC domain not a subset of the DIE domain

NOTE that both DIE and CPU-OF-NODE are same since they're all returned by
'cpumask_of_node(cpu_to_node(cpu))'.


> Or to aask for every cluster as a numa node:
> 
>      -smp 6,maxcpus=6,sockets=2,clusters=3,cores=1,threads=1 \
>      -numa node,nodeid=0,cpus=0,memdev=ram0                \
>      -numa node,nodeid=1,cpus=1,memdev=ram1                \
>      -numa node,nodeid=2,cpus=2,memdev=ram2                \
>      -numa node,nodeid=3,cpus=3,memdev=ram3                \
>      -numa node,nodeid=4,cpus=4,memdev=ram4                \
>      -numa node,nodeid=5,cpus=5,memdev=ram5                \
> 

This case works fine to Linux guest.

[    0.024505] CPU-0: 36,56,0,-1 thread=0  core=0-2  cluster=0 llc=0            // parsed from ACPI PPTT
[    0.024509] CPU-1: 36,96,1,-1 thread=1  core=0-2  cluster=1 llc=1
[    0.024511] CPU-2: 36,136,2,-1 thread=2  core=0-2  cluster=2 llc=2
[    0.024512] CPU-3: 176,196,3,-1 thread=3  core=3-5  cluster=3 llc=3
[    0.024514] CPU-4: 176,236,4,-1 thread=4  core=3-5  cluster=4 llc=4
[    0.024515] CPU-5: 176,276,5,-1 thread=5  core=3-5  cluster=5 llc=5
[    0.024518] CPU-0: SMT=0  CLUSTER=0  MULTICORE=0  DIE=0  CPU-OF-NODE=0      // Seen by scheduling domain
[    0.024519] CPU-1: SMT=1  CLUSTER=1  MULTICORE=1  DIE=1  CPU-OF-NODE=1
[    0.024521] CPU-2: SMT=2  CLUSTER=2  MULTICORE=2  DIE=2  CPU-OF-NODE=2
[    0.024522] CPU-3: SMT=3  CLUSTER=3  MULTICORE=3  DIE=3  CPU-OF-NODE=3
[    0.024524] CPU-4: SMT=4  CLUSTER=4  MULTICORE=4  DIE=4  CPU-OF-NODE=4
[    0.024525] CPU-5: SMT=5  CLUSTER=5  MULTICORE=5  DIE=5  CPU-OF-NODE=5


> In both cases the NUMA split is aligned with a given level
> in the topology, which was not the case with your example.
> 
> Rejecting these feels overly strict to me, and may risk breaking
> existing valid deployments, unless we can demonstrate those
> scenarios were unambiguously already broken ?
> 
> If there was something in the hardware specs that requires
> this, then it is more valid to do, than if it is merely an
> specific guest kernel limitation that might be fixed any day.
> 

Yes, I agree that it's strict to have socket-to-NUMA boundary. However,
it sounds not sensible to split CPUs in one cluster to differnet NUMA
nodes, or to split CPUs in one core to different NUMA nodes in the baremetal
environment. I think we probably need to prevent these two cases, meaning two
clusters in one socket is still allowed to be associated with different NUMA
nodes.

I fail to get accurate information about the relation among socket/cluster/core
from specs. As I can understand, the CPUs in one core are sharing L2 cache and
cores in one cluster are sharing L3 cache. thread would have its own L1 cache.
L3 cache is usually corresponding to NUMA node. I may be totally wrong here.

Thanks,
Gavin






^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines
  2023-02-23 13:18 ` Daniel Henrique Barboza
@ 2023-02-24  7:09   ` Gavin Shan
  2023-02-24  9:26     ` Daniel Henrique Barboza
  0 siblings, 1 reply; 18+ messages in thread
From: Gavin Shan @ 2023-02-24  7:09 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-arm
  Cc: qemu-devel, qemu-riscv, rad, peter.maydell, quic_llindhol,
	eduardo, marcel.apfelbaum, philmd, wangyanan55, palmer,
	alistair.francis, bin.meng, thuth, lvivier, pbonzini, imammedo,
	yihyu, shan.gavin

On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote:
> On 2/23/23 05:13, Gavin Shan wrote:
>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
>> used to populate the CPU topology in the Linux guest. It's required that
>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
>> scheduling domain can't be sorted out, as the following warning message
>> indicates. To avoid the unexpected confusion, this series attempts to
>> rejects such kind of insane configurations.
>>
>>     -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>>     -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>>     -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>>     -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
> 
> 
> And why is this a QEMU problem? This doesn't hurt ACPI.
> 
> Also, this restriction impacts breaks ARM guests in the wild that are running
> non-Linux OSes. I don't see why we should impact use cases that has nothing to
> do with Linux Kernel feelings about sockets - NUMA nodes exclusivity.
> 

With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5
are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and
cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple
CPUs in different clusters and sockets have been associated with one NUMA node.

If I'm correct, the configuration isn't sensible in a baremetal environment and
same Linux kernel is supposed to work well for baremetal and virtualized machine.
So I think QEMU needs to emulate the topology as much as we can to match with the
baremetal environment. It's the reason why I think it's a QEMU problem even it
doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com>
in another thread, we may need to gurantee that the CPUs in one cluster can't be
split to multiple NUMA nodes, which matches with the baremetal environment, as I
can understand.

Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will
definitely break the configurations running in the wild.

Thanks,
Gavin

[...]



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines
  2023-02-23 12:25 ` [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines Andrew Jones
@ 2023-02-24  7:20   ` Gavin Shan
  0 siblings, 0 replies; 18+ messages in thread
From: Gavin Shan @ 2023-02-24  7:20 UTC (permalink / raw)
  To: Andrew Jones
  Cc: qemu-arm, qemu-devel, qemu-riscv, rad, peter.maydell,
	quic_llindhol, eduardo, marcel.apfelbaum, philmd, wangyanan55,
	palmer, alistair.francis, bin.meng, thuth, lvivier, pbonzini,
	imammedo, yihyu, shan.gavin

Hi Drew,

On 2/23/23 11:25 PM, Andrew Jones wrote:
> On Thu, Feb 23, 2023 at 04:13:57PM +0800, Gavin Shan wrote:
>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
>> used to populate the CPU topology in the Linux guest. It's required that
>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
>> scheduling domain can't be sorted out, as the following warning message
>> indicates. To avoid the unexpected confusion, this series attempts to
>> rejects such kind of insane configurations.
> 
> I'm not sure 'insane' is the right word, maybe 'irregular'. I've never
> seen, and don't really expect to find, specifications stating that arm
> and/or riscv must have 1:1 mappings for numa nodes and sockets. But, as
> QEMU machine types can choose certain configurations to support, if
> everyone is in agreement to restrict sbsa-ref and arm/riscv virt to 1:1
> mappings, then why not. However, I suggest stating that this is simply a
> choice being made for these boards, since the expectation of needing
> "irregular" configurations is low. We should not be trying to use Linux
> assumptions as rationale, since other OSes may cope better with irregular
> configurations.
> 
> Also, I suggest adding a comment to the boards, which adopt this
> restriction, which explains that it's only a platform choice, not
> something specified by the architecture.
> 

[...]

Thanks for your comments. Yes, "irregular" is definitely better words. I failed
to find the statement that ARM64 has the requirement for 1:1 mapping between
socket and NUMA node, which makes it hard to tell if the restriction is required
from architectural level. However, the linux driver (drivers/base/arch_topology.c)
considers both CPU-to-NUMA-association and CPU toplogy from ACPI PPTT when the
CPU set in the core group is calculated. It means the CPU-to-NUMA-association
can affect the CPU topology seen by Linux scheduler.

With the restriction of 1:1 mapping between socket and NUMA node, or between
cluster and NUMA node as I mentioned in the reply to Daniel P. Berrangé <berrange@redhat.com>,
the existing configuration for other OSes can become invalid.

Yes, a comment to mention it's a platform choice, not architectural requirement,
definitely helps to understand the context.

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines
  2023-02-24  7:09   ` Gavin Shan
@ 2023-02-24  9:26     ` Daniel Henrique Barboza
  2023-02-24 10:16       ` Gavin Shan
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Henrique Barboza @ 2023-02-24  9:26 UTC (permalink / raw)
  To: Gavin Shan, qemu-arm
  Cc: qemu-devel, qemu-riscv, rad, peter.maydell, quic_llindhol,
	eduardo, marcel.apfelbaum, philmd, wangyanan55, palmer,
	alistair.francis, bin.meng, thuth, lvivier, pbonzini, imammedo,
	yihyu, shan.gavin



On 2/24/23 04:09, Gavin Shan wrote:
> On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote:
>> On 2/23/23 05:13, Gavin Shan wrote:
>>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
>>> used to populate the CPU topology in the Linux guest. It's required that
>>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
>>> scheduling domain can't be sorted out, as the following warning message
>>> indicates. To avoid the unexpected confusion, this series attempts to
>>> rejects such kind of insane configurations.
>>>
>>>     -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>>>     -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>>>     -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>>>     -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
>>
>>
>> And why is this a QEMU problem? This doesn't hurt ACPI.
>>
>> Also, this restriction impacts breaks ARM guests in the wild that are running
>> non-Linux OSes. I don't see why we should impact use cases that has nothing to
>> do with Linux Kernel feelings about sockets - NUMA nodes exclusivity.
>>
> 
> With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5
> are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and
> cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple
> CPUs in different clusters and sockets have been associated with one NUMA node.
> 
> If I'm correct, the configuration isn't sensible in a baremetal environment and
> same Linux kernel is supposed to work well for baremetal and virtualized machine.
> So I think QEMU needs to emulate the topology as much as we can to match with the
> baremetal environment. It's the reason why I think it's a QEMU problem even it
> doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com>
> in another thread, we may need to gurantee that the CPUs in one cluster can't be
> split to multiple NUMA nodes, which matches with the baremetal environment, as I
> can understand.
> 
> Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will
> definitely break the configurations running in the wild.


What about a warning? If the user attempts to use an exotic NUMA configuration
like the one you mentioned we can print something like:

"Warning: NUMA topologies where a socket belongs to multiple NUMA nodes can cause OSes like Linux to misbehave"

This would inform the user what might be going wrong in case Linux is crashing/error
out on them and then the user is free to fix their topology (or the kernel). And
at the same time we wouldn't break existing stuff that happens to be working.


Thanks,


Daniel


> 
> Thanks,
> Gavin
> 
> [...]
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines
  2023-02-24  9:26     ` Daniel Henrique Barboza
@ 2023-02-24 10:16       ` Gavin Shan
  2023-02-24 10:39         ` Andrew Jones
  2023-02-24 14:20         ` Igor Mammedov
  0 siblings, 2 replies; 18+ messages in thread
From: Gavin Shan @ 2023-02-24 10:16 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-arm
  Cc: qemu-devel, qemu-riscv, rad, peter.maydell, quic_llindhol,
	eduardo, marcel.apfelbaum, philmd, wangyanan55, palmer,
	alistair.francis, bin.meng, thuth, lvivier, pbonzini, imammedo,
	yihyu, shan.gavin

On 2/24/23 8:26 PM, Daniel Henrique Barboza wrote:
> On 2/24/23 04:09, Gavin Shan wrote:
>> On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote:
>>> On 2/23/23 05:13, Gavin Shan wrote:
>>>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
>>>> used to populate the CPU topology in the Linux guest. It's required that
>>>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
>>>> scheduling domain can't be sorted out, as the following warning message
>>>> indicates. To avoid the unexpected confusion, this series attempts to
>>>> rejects such kind of insane configurations.
>>>>
>>>>     -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>>>>     -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>>>>     -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>>>>     -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
>>>
>>>
>>> And why is this a QEMU problem? This doesn't hurt ACPI.
>>>
>>> Also, this restriction impacts breaks ARM guests in the wild that are running
>>> non-Linux OSes. I don't see why we should impact use cases that has nothing to
>>> do with Linux Kernel feelings about sockets - NUMA nodes exclusivity.
>>>
>>
>> With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5
>> are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and
>> cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple
>> CPUs in different clusters and sockets have been associated with one NUMA node.
>>
>> If I'm correct, the configuration isn't sensible in a baremetal environment and
>> same Linux kernel is supposed to work well for baremetal and virtualized machine.
>> So I think QEMU needs to emulate the topology as much as we can to match with the
>> baremetal environment. It's the reason why I think it's a QEMU problem even it
>> doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com>
>> in another thread, we may need to gurantee that the CPUs in one cluster can't be
>> split to multiple NUMA nodes, which matches with the baremetal environment, as I
>> can understand.
>>
>> Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will
>> definitely break the configurations running in the wild.
> 
> 
> What about a warning? If the user attempts to use an exotic NUMA configuration
> like the one you mentioned we can print something like:
> 
> "Warning: NUMA topologies where a socket belongs to multiple NUMA nodes can cause OSes like Linux to misbehave"
> 
> This would inform the user what might be going wrong in case Linux is crashing/error
> out on them and then the user is free to fix their topology (or the kernel). And
> at the same time we wouldn't break existing stuff that happens to be working.
> 
> 

Yes, I think a warning message is more appropriate, so that users can fix their
irregular configurations and the existing configurations aren't disconnected.
It would be nice to get the agreements from Daniel P. Berrangé and Drew, before
I'm going to change the code and post next revision.

Thanks,
Gavin




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines
  2023-02-24 10:16       ` Gavin Shan
@ 2023-02-24 10:39         ` Andrew Jones
  2023-02-24 14:20         ` Igor Mammedov
  1 sibling, 0 replies; 18+ messages in thread
From: Andrew Jones @ 2023-02-24 10:39 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Daniel Henrique Barboza, qemu-arm, qemu-devel, qemu-riscv, rad,
	peter.maydell, quic_llindhol, eduardo, marcel.apfelbaum, philmd,
	wangyanan55, palmer, alistair.francis, bin.meng, thuth, lvivier,
	pbonzini, imammedo, yihyu, shan.gavin

On Fri, Feb 24, 2023 at 09:16:39PM +1100, Gavin Shan wrote:
> On 2/24/23 8:26 PM, Daniel Henrique Barboza wrote:
> > On 2/24/23 04:09, Gavin Shan wrote:
> > > On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote:
> > > > On 2/23/23 05:13, Gavin Shan wrote:
> > > > > For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
> > > > > used to populate the CPU topology in the Linux guest. It's required that
> > > > > the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
> > > > > scheduling domain can't be sorted out, as the following warning message
> > > > > indicates. To avoid the unexpected confusion, this series attempts to
> > > > > rejects such kind of insane configurations.
> > > > > 
> > > > >     -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
> > > > >     -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
> > > > >     -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
> > > > >     -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
> > > > 
> > > > 
> > > > And why is this a QEMU problem? This doesn't hurt ACPI.
> > > > 
> > > > Also, this restriction impacts breaks ARM guests in the wild that are running
> > > > non-Linux OSes. I don't see why we should impact use cases that has nothing to
> > > > do with Linux Kernel feelings about sockets - NUMA nodes exclusivity.
> > > > 
> > > 
> > > With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5
> > > are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and
> > > cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple
> > > CPUs in different clusters and sockets have been associated with one NUMA node.
> > > 
> > > If I'm correct, the configuration isn't sensible in a baremetal environment and
> > > same Linux kernel is supposed to work well for baremetal and virtualized machine.
> > > So I think QEMU needs to emulate the topology as much as we can to match with the
> > > baremetal environment. It's the reason why I think it's a QEMU problem even it
> > > doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com>
> > > in another thread, we may need to gurantee that the CPUs in one cluster can't be
> > > split to multiple NUMA nodes, which matches with the baremetal environment, as I
> > > can understand.
> > > 
> > > Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will
> > > definitely break the configurations running in the wild.
> > 
> > 
> > What about a warning? If the user attempts to use an exotic NUMA configuration
> > like the one you mentioned we can print something like:
> > 
> > "Warning: NUMA topologies where a socket belongs to multiple NUMA nodes can cause OSes like Linux to misbehave"
> > 
> > This would inform the user what might be going wrong in case Linux is crashing/error
> > out on them and then the user is free to fix their topology (or the kernel). And
> > at the same time we wouldn't break existing stuff that happens to be working.
> > 
> > 
> 
> Yes, I think a warning message is more appropriate, so that users can fix their
> irregular configurations and the existing configurations aren't disconnected.
> It would be nice to get the agreements from Daniel P. Berrangé and Drew, before
> I'm going to change the code and post next revision.
>

If there's a concern that this will break non-Linux OSes on arm, then, at
most, the change needs to be tied to the next machine type version, and
possibly it can never be made for arm. riscv is OK, since it currently
ignores smp parameters anyway. It currently derives the number of sockets
from the number of NUMA nodes, using a 1:1 mapping. When smp parameters
are eventually implemented for riscv, then this can be revisited.

Also, it sounds like not only has the rationale for this series been
changed to "platform choice", but also that the cluster <-> numa node
mapping should be 1:1, not the socket <-> numa node mapping. If that's
the case, then the series probably needs to be reworked for that.

Thanks,
drew


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines
  2023-02-24 10:16       ` Gavin Shan
  2023-02-24 10:39         ` Andrew Jones
@ 2023-02-24 14:20         ` Igor Mammedov
  2023-02-25  0:05           ` Gavin Shan
  1 sibling, 1 reply; 18+ messages in thread
From: Igor Mammedov @ 2023-02-24 14:20 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Daniel Henrique Barboza, qemu-arm, qemu-devel, qemu-riscv, rad,
	peter.maydell, quic_llindhol, eduardo, marcel.apfelbaum, philmd,
	wangyanan55, palmer, alistair.francis, bin.meng, thuth, lvivier,
	pbonzini, yihyu, shan.gavin, Michal Prívozník

On Fri, 24 Feb 2023 21:16:39 +1100
Gavin Shan <gshan@redhat.com> wrote:

> On 2/24/23 8:26 PM, Daniel Henrique Barboza wrote:
> > On 2/24/23 04:09, Gavin Shan wrote:  
> >> On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote:  
> >>> On 2/23/23 05:13, Gavin Shan wrote:  
> >>>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
> >>>> used to populate the CPU topology in the Linux guest. It's required that
> >>>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
> >>>> scheduling domain can't be sorted out, as the following warning message
> >>>> indicates. To avoid the unexpected confusion, this series attempts to
> >>>> rejects such kind of insane configurations.
> >>>>
> >>>>     -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
> >>>>     -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
> >>>>     -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
> >>>>     -numa node,nodeid=2,cpus=4-5,memdev=ram2                \  
> >>>
> >>>
> >>> And why is this a QEMU problem? This doesn't hurt ACPI.
> >>>
> >>> Also, this restriction impacts breaks ARM guests in the wild that are running
> >>> non-Linux OSes. I don't see why we should impact use cases that has nothing to
> >>> do with Linux Kernel feelings about sockets - NUMA nodes exclusivity.
> >>>  
> >>
> >> With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5
> >> are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and
> >> cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple
> >> CPUs in different clusters and sockets have been associated with one NUMA node.
> >>
> >> If I'm correct, the configuration isn't sensible in a baremetal environment and
> >> same Linux kernel is supposed to work well for baremetal and virtualized machine.
> >> So I think QEMU needs to emulate the topology as much as we can to match with the
> >> baremetal environment. It's the reason why I think it's a QEMU problem even it
> >> doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com>
> >> in another thread, we may need to gurantee that the CPUs in one cluster can't be
> >> split to multiple NUMA nodes, which matches with the baremetal environment, as I
> >> can understand.
> >>
> >> Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will
> >> definitely break the configurations running in the wild.  
> > 
> > 
> > What about a warning? If the user attempts to use an exotic NUMA configuration
> > like the one you mentioned we can print something like:
> > 
> > "Warning: NUMA topologies where a socket belongs to multiple NUMA nodes can cause OSes like Linux to misbehave"
> > 
> > This would inform the user what might be going wrong in case Linux is crashing/error
> > out on them and then the user is free to fix their topology (or the kernel). And
> > at the same time we wouldn't break existing stuff that happens to be working.
> > 
> >   
> 
> Yes, I think a warning message is more appropriate, so that users can fix their
> irregular configurations and the existing configurations aren't disconnected.
> It would be nice to get the agreements from Daniel P. Berrangé and Drew, before
> I'm going to change the code and post next revision.

Honestly you (and libvirt as far as I recall) are using legacy options
to assign cpus to numa nodes.
With '-numa node,nodeid=0,cpus=0-1' you can't really be sure what/where
in topology those cpus are.
What you can do is to use '-numa cpu,...' option to assign socket/core/...
to numa node, ex:
        "-numa cpu,node-id=1,socket-id=0 "           or                              
        "-numa cpu,node-id=0,socket-id=1,core-id=0 "       or                      
        "-numa cpu,node-id=0,socket-id=1,core-id=1,thread-id=0
to get your desired mapping.

The problem that's so far was stopping the later adoption by libvirt (Michal CCed)
is that values used by it are machine specific and to do it properly, for a concrete
'-M x -smp ...' at least for the first time qemu should be started with-
 -preconfig option and then user should query possible cpus for those values
and assign them to numa nodes via QMP.

btw: on x86 we also allow 'insane' configuration incl. those that do not
exist on baremetal and do not warn about it anyone (i.e. it's user's
responsibility to provide topology that specific guest OS could handle,
aka it's not QEMU business but upper layers). (I do occasionally try
introduce stricter checks in that are, but that gets push back more
often that not).

I'd do check only in case of a specific board where mapping is fixed
in specs of emulated machine, otherwise I wouldn't bother.

> Thanks,
> Gavin
> 
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines
  2023-02-24 14:20         ` Igor Mammedov
@ 2023-02-25  0:05           ` Gavin Shan
  0 siblings, 0 replies; 18+ messages in thread
From: Gavin Shan @ 2023-02-25  0:05 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Daniel Henrique Barboza, qemu-arm, qemu-devel, qemu-riscv, rad,
	peter.maydell, quic_llindhol, eduardo, marcel.apfelbaum, philmd,
	wangyanan55, palmer, alistair.francis, bin.meng, thuth, lvivier,
	pbonzini, yihyu, shan.gavin, Michal Prívozník

On 2/25/23 1:20 AM, Igor Mammedov wrote:
> On Fri, 24 Feb 2023 21:16:39 +1100
> Gavin Shan <gshan@redhat.com> wrote:
> 
>> On 2/24/23 8:26 PM, Daniel Henrique Barboza wrote:
>>> On 2/24/23 04:09, Gavin Shan wrote:
>>>> On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote:
>>>>> On 2/23/23 05:13, Gavin Shan wrote:
>>>>>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
>>>>>> used to populate the CPU topology in the Linux guest. It's required that
>>>>>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
>>>>>> scheduling domain can't be sorted out, as the following warning message
>>>>>> indicates. To avoid the unexpected confusion, this series attempts to
>>>>>> rejects such kind of insane configurations.
>>>>>>
>>>>>>      -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>>>>>>      -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>>>>>>      -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>>>>>>      -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
>>>>>
>>>>>
>>>>> And why is this a QEMU problem? This doesn't hurt ACPI.
>>>>>
>>>>> Also, this restriction impacts breaks ARM guests in the wild that are running
>>>>> non-Linux OSes. I don't see why we should impact use cases that has nothing to
>>>>> do with Linux Kernel feelings about sockets - NUMA nodes exclusivity.
>>>>>   
>>>>
>>>> With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5
>>>> are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and
>>>> cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple
>>>> CPUs in different clusters and sockets have been associated with one NUMA node.
>>>>
>>>> If I'm correct, the configuration isn't sensible in a baremetal environment and
>>>> same Linux kernel is supposed to work well for baremetal and virtualized machine.
>>>> So I think QEMU needs to emulate the topology as much as we can to match with the
>>>> baremetal environment. It's the reason why I think it's a QEMU problem even it
>>>> doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com>
>>>> in another thread, we may need to gurantee that the CPUs in one cluster can't be
>>>> split to multiple NUMA nodes, which matches with the baremetal environment, as I
>>>> can understand.
>>>>
>>>> Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will
>>>> definitely break the configurations running in the wild.
>>>
>>>
>>> What about a warning? If the user attempts to use an exotic NUMA configuration
>>> like the one you mentioned we can print something like:
>>>
>>> "Warning: NUMA topologies where a socket belongs to multiple NUMA nodes can cause OSes like Linux to misbehave"
>>>
>>> This would inform the user what might be going wrong in case Linux is crashing/error
>>> out on them and then the user is free to fix their topology (or the kernel). And
>>> at the same time we wouldn't break existing stuff that happens to be working.
>>>
>>>    
>>
>> Yes, I think a warning message is more appropriate, so that users can fix their
>> irregular configurations and the existing configurations aren't disconnected.
>> It would be nice to get the agreements from Daniel P. Berrangé and Drew, before
>> I'm going to change the code and post next revision.
> 
> Honestly you (and libvirt as far as I recall) are using legacy options
> to assign cpus to numa nodes.
> With '-numa node,nodeid=0,cpus=0-1' you can't really be sure what/where
> in topology those cpus are.
> What you can do is to use '-numa cpu,...' option to assign socket/core/...
> to numa node, ex:
>          "-numa cpu,node-id=1,socket-id=0 "           or
>          "-numa cpu,node-id=0,socket-id=1,core-id=0 "       or
>          "-numa cpu,node-id=0,socket-id=1,core-id=1,thread-id=0
> to get your desired mapping.
> 
> The problem that's so far was stopping the later adoption by libvirt (Michal CCed)
> is that values used by it are machine specific and to do it properly, for a concrete
> '-M x -smp ...' at least for the first time qemu should be started with-
>   -preconfig option and then user should query possible cpus for those values
> and assign them to numa nodes via QMP.
> 

The issue isn't related to the legacy or modern way to configure CPU-to-NUMA association.
Note that qtests/numa-test also use the legacy way. For example, the issue can be triggered
with the following command lines where the modern configuration is used:

   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64            \
   -accel kvm -machine virt,gic-version=host                          \
   -cpu host -smp 6,sockets=2,clusters=1,cores=3,threads=1            \
   -m 768M,slots=16,maxmem=64G                                        \
   -object memory-backend-ram,id=mem0,size=256M                       \
   -object memory-backend-ram,id=mem1,size=256M                       \
   -object memory-backend-ram,id=mem2,size=256M                       \
   -numa node,nodeid=0,memdev=mem0                                    \
   -numa node,nodeid=1,memdev=mem1                                    \
   -numa node,nodeid=2,memdev=mem2                                    \
   -numa cpu,node-id=0,socket-id=0,cluster-id=0,core-id=0,thread-id=0 \
   -numa cpu,node-id=0,socket-id=0,cluster-id=0,core-id=1,thread-id=0 \
   -numa cpu,node-id=1,socket-id=0,cluster-id=0,core-id=2,thread-id=0 \
   -numa cpu,node-id=1,socket-id=1,cluster-id=0,core-id=0,thread-id=0 \
   -numa cpu,node-id=2,socket-id=1,cluster-id=0,core-id=1,thread-id=0 \
   -numa cpu,node-id=2,socket-id=1,cluster-id=0,core-id=2,thread-id=0
     :
     :
   [    0.022260] =============== sched_init_domains ===================
   [    0.022263] CPU-0: 36,56,0,-1 thread=0  core=0-2  cluster=0-2 llc=0
   [    0.022267] CPU-1: 36,56,1,-1 thread=1  core=0-2  cluster=0-2 llc=1
   [    0.022269] CPU-2: 36,56,2,-1 thread=2  core=0-2  cluster=0-2 llc=2
   [    0.022270] CPU-3: 136,156,3,-1 thread=3  core=3-5  cluster=3-5 llc=3
   [    0.022272] CPU-4: 136,156,4,-1 thread=4  core=3-5  cluster=3-5 llc=4
   [    0.022273] CPU-5: 136,156,5,-1 thread=5  core=3-5  cluster=3-5 llc=5
   [    0.022275] CPU-0: SMT=0  CLUSTER=0  MULTICORE=0-2  CPUMASK=0-1  0-1
   [    0.022277] CPU-1: SMT=1  CLUSTER=1  MULTICORE=0-2  CPUMASK=0-1  0-1
   [    0.022279] CPU-2: SMT=2  CLUSTER=0-2  MULTICORE=2-3  CPUMASK=2-3  2-3
   [    0.022281] CPU-3: SMT=3  CLUSTER=3-5  MULTICORE=2-3  CPUMASK=2-3  2-3
   [    0.022283] CPU-4: SMT=4  CLUSTER=4  MULTICORE=3-5  CPUMASK=4-5  4-5
   [    0.022284] CPU-5: SMT=5  CLUSTER=5  MULTICORE=3-5  CPUMASK=4-5  4-5
   [    0.022314] build_sched_domains: CPU-0 level-SMT
   [    0.022317] build_sched_domains: CPU-0 level-CLS
   [    0.022318] topology_span_sane: CPU-0 0 vs CPU-2 0-2
   [    0.022322] ------------[ cut here ]------------
   [    0.022323] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2275 build_sched_domains+0x135c/0x1660
      :

> btw: on x86 we also allow 'insane' configuration incl. those that do not
> exist on baremetal and do not warn about it anyone (i.e. it's user's
> responsibility to provide topology that specific guest OS could handle,
> aka it's not QEMU business but upper layers). (I do occasionally try
> introduce stricter checks in that are, but that gets push back more
> often that not).
> 
> I'd do check only in case of a specific board where mapping is fixed
> in specs of emulated machine, otherwise I wouldn't bother.
> 

Right, x86 can handle the irregular configuration well and we didn't find
any triggered warning messages there. As the subject indicates, it's a
ARM64 or riscv specific issue. Unfortunately, I failed to find the requirement,
socket-to-NUMA-node or cluster-to-NUMA-node 1:1 mapping from specs. However, it's not
sensible to split CPUs in one cluster to multiple NUMA nodes in a (ARM64 or RISCv)
baremetal environment. QEMU needs to emulate the environment close to the baremetal
machine if we can.

In summary, a warning message when multiple CPUs in one cluster are split to
multiple NUMA nodes, as Daniel suggested, sounds reasonable to me. Please let me
know your thoughts, Igor :)

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-02-25  0:06 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-23  8:13 [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines Gavin Shan
2023-02-23  8:13 ` [PATCH v2 1/4] qtest/numa-test: Follow socket-NUMA-node boundary for aarch64 Gavin Shan
2023-02-23  8:13 ` [PATCH v2 2/4] numa: Validate socket and NUMA node boundary if required Gavin Shan
2023-02-23  9:05   ` Philippe Mathieu-Daudé
2023-02-23 10:27     ` Gavin Shan
2023-02-23  8:14 ` [PATCH v2 3/4] hw/arm: Validate socket and NUMA node boundary Gavin Shan
2023-02-23  8:14 ` [PATCH v2 4/4] hw/riscv: " Gavin Shan
2023-02-23 12:25 ` [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines Andrew Jones
2023-02-24  7:20   ` Gavin Shan
2023-02-23 12:57 ` Daniel P. Berrangé
2023-02-24  5:47   ` Gavin Shan
2023-02-23 13:18 ` Daniel Henrique Barboza
2023-02-24  7:09   ` Gavin Shan
2023-02-24  9:26     ` Daniel Henrique Barboza
2023-02-24 10:16       ` Gavin Shan
2023-02-24 10:39         ` Andrew Jones
2023-02-24 14:20         ` Igor Mammedov
2023-02-25  0:05           ` Gavin Shan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).