From: Paolo Bonzini <pbonzini@redhat.com>
To: qemu-devel@nongnu.org
Cc: "Gavin Shan" <gshan@redhat.com>,
"Philippe Mathieu-Daudé" <philmd@linaro.org>,
"Igor Mammedov" <imammedo@redhat.com>
Subject: [PULL 03/18] numa: Validate cluster and NUMA node boundary if required
Date: Mon, 26 Jun 2023 13:14:30 +0200 [thread overview]
Message-ID: <20230626111445.163573-4-pbonzini@redhat.com> (raw)
In-Reply-To: <20230626111445.163573-1-pbonzini@redhat.com>
From: Gavin Shan <gshan@redhat.com>
For some architectures like ARM64, multiple CPUs in one cluster can be
associated with different NUMA nodes, which is irregular configuration
because we shouldn't have this in baremetal environment. The irregular
configuration causes Linux guest to misbehave, as the following warning
messages indicate.
-smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
-numa node,nodeid=0,cpus=0-1,memdev=ram0 \
-numa node,nodeid=1,cpus=2-3,memdev=ram1 \
-numa node,nodeid=2,cpus=4-5,memdev=ram2 \
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2271 build_sched_domains+0x284/0x910
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-268.el9.aarch64 #1
pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : build_sched_domains+0x284/0x910
lr : build_sched_domains+0x184/0x910
sp : ffff80000804bd50
x29: ffff80000804bd50 x28: 0000000000000002 x27: 0000000000000000
x26: ffff800009cf9a80 x25: 0000000000000000 x24: ffff800009cbf840
x23: ffff000080325000 x22: ffff0000005df800 x21: ffff80000a4ce508
x20: 0000000000000000 x19: ffff000080324440 x18: 0000000000000014
x17: 00000000388925c0 x16: 000000005386a066 x15: 000000009c10cc2e
x14: 00000000000001c0 x13: 0000000000000001 x12: ffff00007fffb1a0
x11: ffff00007fffb180 x10: ffff80000a4ce508 x9 : 0000000000000041
x8 : ffff80000a4ce500 x7 : ffff80000a4cf920 x6 : 0000000000000001
x5 : 0000000000000001 x4 : 0000000000000007 x3 : 0000000000000002
x2 : 0000000000001000 x1 : ffff80000a4cf928 x0 : 0000000000000001
Call trace:
build_sched_domains+0x284/0x910
sched_init_domains+0xac/0xe0
sched_init_smp+0x48/0xc8
kernel_init_freeable+0x140/0x1ac
kernel_init+0x28/0x140
ret_from_fork+0x10/0x20
Improve the situation to warn when multiple CPUs in one cluster have
been associated with different NUMA nodes. However, one NUMA node is
allowed to be associated with different clusters.
Signed-off-by: Gavin Shan <gshan@redhat.com>
Acked-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Acked-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20230509002739.18388-2-gshan@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
hw/core/machine.c | 42 ++++++++++++++++++++++++++++++++++++++++++
include/hw/boards.h | 1 +
2 files changed, 43 insertions(+)
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 1000406211f..46f8f9a2b04 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -1262,6 +1262,45 @@ static void machine_numa_finish_cpu_init(MachineState *machine)
g_string_free(s, true);
}
+static void validate_cpu_cluster_to_numa_boundary(MachineState *ms)
+{
+ MachineClass *mc = MACHINE_GET_CLASS(ms);
+ NumaState *state = ms->numa_state;
+ const CPUArchIdList *possible_cpus = mc->possible_cpu_arch_ids(ms);
+ const CPUArchId *cpus = possible_cpus->cpus;
+ int i, j;
+
+ if (state->num_nodes <= 1 || possible_cpus->len <= 1) {
+ return;
+ }
+
+ /*
+ * The Linux scheduling domain can't be parsed when the multiple CPUs
+ * in one cluster have been associated with different NUMA nodes. However,
+ * it's fine to associate one NUMA node with CPUs in different clusters.
+ */
+ for (i = 0; i < possible_cpus->len; i++) {
+ for (j = i + 1; j < possible_cpus->len; j++) {
+ if (cpus[i].props.has_socket_id &&
+ cpus[i].props.has_cluster_id &&
+ cpus[i].props.has_node_id &&
+ cpus[j].props.has_socket_id &&
+ cpus[j].props.has_cluster_id &&
+ cpus[j].props.has_node_id &&
+ cpus[i].props.socket_id == cpus[j].props.socket_id &&
+ cpus[i].props.cluster_id == cpus[j].props.cluster_id &&
+ cpus[i].props.node_id != cpus[j].props.node_id) {
+ warn_report("CPU-%d and CPU-%d in socket-%" PRId64 "-cluster-%" PRId64
+ " have been associated with node-%" PRId64 " and node-%" PRId64
+ " respectively. It can cause OSes like Linux to"
+ " misbehave", i, j, cpus[i].props.socket_id,
+ cpus[i].props.cluster_id, cpus[i].props.node_id,
+ cpus[j].props.node_id);
+ }
+ }
+ }
+}
+
MemoryRegion *machine_consume_memdev(MachineState *machine,
HostMemoryBackend *backend)
{
@@ -1355,6 +1394,9 @@ void machine_run_board_init(MachineState *machine, const char *mem_path, Error *
numa_complete_configuration(machine);
if (machine->numa_state->num_nodes) {
machine_numa_finish_cpu_init(machine);
+ if (machine_class->cpu_cluster_has_numa_boundary) {
+ validate_cpu_cluster_to_numa_boundary(machine);
+ }
}
}
diff --git a/include/hw/boards.h b/include/hw/boards.h
index a385010909d..6b267c21ce7 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -274,6 +274,7 @@ struct MachineClass {
bool nvdimm_supported;
bool numa_mem_supported;
bool auto_enable_numa;
+ bool cpu_cluster_has_numa_boundary;
SMPCompatProps smp_props;
const char *default_ram_id;
--
2.41.0
next prev parent reply other threads:[~2023-06-26 11:15 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-26 11:14 [PULL 00/18] Misc, i386 patches for 2023-06-26 Paolo Bonzini
2023-06-26 11:14 ` [PULL 01/18] build: further refine build.ninja rules Paolo Bonzini
2023-06-26 11:14 ` [PULL 02/18] hw/remote/proxy: Remove dubious 'event_notifier-posix.c' include Paolo Bonzini
2023-06-26 11:14 ` Paolo Bonzini [this message]
2023-07-20 13:10 ` [PULL 03/18] numa: Validate cluster and NUMA node boundary if required Peter Maydell
2023-07-21 10:50 ` Gavin Shan
2023-06-26 11:14 ` [PULL 04/18] hw/arm: Validate cluster and NUMA node boundary Paolo Bonzini
2023-06-26 11:14 ` [PULL 05/18] hw/riscv: " Paolo Bonzini
2023-06-26 11:14 ` [PULL 06/18] kvm: reuse per-vcpu stats fd to avoid vcpu interruption Paolo Bonzini
2023-06-26 11:14 ` [PULL 07/18] target/i386: fix INVD vmexit Paolo Bonzini
2023-06-26 11:14 ` [PULL 08/18] target/i386: TCG supports 3DNow! prefetch(w) Paolo Bonzini
2023-06-26 11:14 ` [PULL 09/18] target/i386: TCG supports RDSEED Paolo Bonzini
2023-06-26 11:14 ` [PULL 10/18] target/i386: do not accept RDSEED if CPUID bit absent Paolo Bonzini
2023-06-26 11:14 ` [PULL 11/18] target/i386: TCG supports XSAVEERPTR Paolo Bonzini
2023-06-26 11:14 ` [PULL 12/18] target/i386: TCG supports WBNOINVD Paolo Bonzini
2023-06-26 11:14 ` [PULL 13/18] target/i386: Intel only supports SYSCALL/SYSRET in long mode Paolo Bonzini
2023-06-26 11:14 ` [PULL 14/18] target/i386: AMD only supports SYSENTER/SYSEXIT in 32-bit mode Paolo Bonzini
2023-06-26 11:14 ` [PULL 15/18] target/i386: sysret and sysexit are privileged Paolo Bonzini
2023-06-26 11:14 ` [PULL 16/18] target/i386: implement RDPID in TCG Paolo Bonzini
2023-06-26 11:14 ` [PULL 17/18] target/i386: implement SYSCALL/SYSRET in 32-bit emulators Paolo Bonzini
2023-06-26 11:14 ` [PULL 18/18] git-submodule.sh: allow running in validate mode without previous update Paolo Bonzini
2023-06-26 14:04 ` [PULL 00/18] Misc, i386 patches for 2023-06-26 Richard Henderson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230626111445.163573-4-pbonzini@redhat.com \
--to=pbonzini@redhat.com \
--cc=gshan@redhat.com \
--cc=imammedo@redhat.com \
--cc=philmd@linaro.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).