From: Gavin Shan <gshan@redhat.com>
To: Igor Mammedov <imammedo@redhat.com>
Cc: peter.maydell@linaro.org, drjones@redhat.com,
	richard.henderson@linaro.org, qemu-devel@nongnu.org,
	"wangyanan \(Y\)" <wangyanan55@huawei.com>,
	qemu-arm@nongnu.org, shan.gavin@gmail.com
Subject: Re: [PATCH] hw/arm/virt: Fix CPU's default NUMA node ID
Date: Mon, 28 Feb 2022 12:26:53 +0800	[thread overview]
Message-ID: <6a520e47-e3ad-13a1-89f1-8d4213fb944a@redhat.com> (raw)
In-Reply-To: <20220225110306.6c3e97a1@redhat.com>
Hi Igor,
On 2/25/22 6:03 PM, Igor Mammedov wrote:
> On Fri, 25 Feb 2022 16:41:43 +0800
> Gavin Shan <gshan@redhat.com> wrote:
>> On 2/17/22 10:14 AM, Gavin Shan wrote:
>>> On 1/26/22 5:14 PM, Igor Mammedov wrote:
>>>> On Wed, 26 Jan 2022 13:24:10 +0800
>>>> Gavin Shan <gshan@redhat.com> wrote:
>>>>   
>>>>> The default CPU-to-NUMA association is given by mc->get_default_cpu_node_id()
>>>>> when it isn't provided explicitly. However, the CPU topology isn't fully
>>>>> considered in the default association and it causes CPU topology broken
>>>>> warnings on booting Linux guest.
>>>>>
>>>>> For example, the following warning messages are observed when the Linux guest
>>>>> is booted with the following command lines.
>>>>>
>>>>>     /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>>>>>     -accel kvm -machine virt,gic-version=host               \
>>>>>     -cpu host                                               \
>>>>>     -smp 6,sockets=2,cores=3,threads=1                      \
>>>>>     -m 1024M,slots=16,maxmem=64G                            \
>>>>>     -object memory-backend-ram,id=mem0,size=128M            \
>>>>>     -object memory-backend-ram,id=mem1,size=128M            \
>>>>>     -object memory-backend-ram,id=mem2,size=128M            \
>>>>>     -object memory-backend-ram,id=mem3,size=128M            \
>>>>>     -object memory-backend-ram,id=mem4,size=128M            \
>>>>>     -object memory-backend-ram,id=mem4,size=384M            \
>>>>>     -numa node,nodeid=0,memdev=mem0                         \
>>>>>     -numa node,nodeid=1,memdev=mem1                         \
>>>>>     -numa node,nodeid=2,memdev=mem2                         \
>>>>>     -numa node,nodeid=3,memdev=mem3                         \
>>>>>     -numa node,nodeid=4,memdev=mem4                         \
>>>>>     -numa node,nodeid=5,memdev=mem5
>>>>>            :
>>>>>     alternatives: patching kernel code
>>>>>     BUG: arch topology borken
>>>>>     the CLS domain not a subset of the MC domain
>>>>>     <the above error log repeats>
>>>>>     BUG: arch topology borken
>>>>>     the DIE domain not a subset of the NODE domain
>>>>>
>>>>> With current implementation of mc->get_default_cpu_node_id(), CPU#0 to CPU#5
>>>>> are associated with NODE#0 to NODE#5 separately. That's incorrect because
>>>>> CPU#0/1/2 should be associated with same NUMA node because they're seated
>>>>> in same socket.
>>>>>
>>>>> This fixes the issue by considering the socket when default CPU-to-NUMA
>>>>> is given. With this applied, no more CPU topology broken warnings are seen
>>>>> from the Linux guest. The 6 CPUs are associated with NODE#0/1, but there are
>>>>> no CPUs associated with NODE#2/3/4/5.
>>>>   
>>>>>  From migration point of view it looks fine to me, and doesn't need a compat knob
>>>> since NUMA data (on virt-arm) only used to construct ACPI tables (and we don't
>>>> version those unless something is broken by it).
>>>>
>>>>   
>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>>> ---
>>>>>    hw/arm/virt.c | 2 +-
>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
>>>>> index 141350bf21..b4a95522d3 100644
>>>>> --- a/hw/arm/virt.c
>>>>> +++ b/hw/arm/virt.c
>>>>> @@ -2499,7 +2499,7 @@ virt_cpu_index_to_props(MachineState *ms, unsigned cpu_index)
>>>>>    static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
>>>>>    {
>>>>> -    return idx % ms->numa_state->num_nodes;
>>>>> +    return idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
>>>>
>>>> I'd like for ARM folks to confirm whether above is correct
>>>> (i.e. socket is NUMA node boundary and also if above topo vars
>>>> could have odd values. Don't look at horribly complicated x86
>>>> as example, but it showed that vendors could stash pretty much
>>>> anything there, so we should consider it here as well and maybe
>>>> forbid that in smp virt-arm parser)
>>>>   
>>>
>>> After doing some investigation, I don't think the socket is NUMA node boundary.
>>> Unfortunately, I didn't find it's documented like this in any documents after
>>> checking device-tree specification, Linux CPU topology and NUMA binding documents.
>>>
>>> However, there are two options here according to Linux (guest) kernel code:
>>> (A) socket is NUMA node boundary  (B) CPU die is NUMA node boundary. They are
>>> equivalent as CPU die isn't supported on arm/virt machine. Besides, the topology
>>> of one-to-one association between socket and NUMA node sounds natural and simplified.
>>> So I think (A) is the best way to go.
>>>
>>> Another thing I want to explain here is how the changes affect the memory
>>> allocation in Linux guest. Taking the command lines included in the commit
>>> log as an example, the first two NUMA nodes are bound to CPUs while the other
>>> 4 NUMA nodes are regarded as remote NUMA nodes to CPUs. The remote NUMA node
>>> won't accommodate the memory allocation until the memory in the near (local)
>>> NUMA node becomes exhausted. However, it's uncertain how the memory is hosted
>>> if memory binding isn't applied.
>>>
>>> Besides, I think the code should be improved like below to avoid overflow on
>>> ms->numa_state->num_nodes.
>>>
>>>    static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
>>>    {
>>> -    return idx % ms->numa_state->num_nodes;
>>> +    int node_idx;
>>> +
>>> +    node_idx = idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
>>> +    return node_idx % ms->numa_state->num_nodes;
> 
> using idx directly to deduce node looks a bit iffy
> take x86_get_default_cpu_node_id() as an example,
> it translates it uses idx to pick arch_id (APIC ID)
> which has topology encoded into it and than translates
> that to node boundary (pkg_id -> socket)
> 
> Probably the same should happen here.
> 
> PS:
> may be a little on tangent to the topic but chunk above
> mentions dies/clusters/cores/threads as possible attributes
> for CPUs but virt_possible_cpu_arch_ids() says that only
> has_thread_id = true
> are supported, which looks broken to me.
> 
The x86's APIC ID, where the CPU topology is encoded, is something
ideal for arm/virt to follow here. With CPU topology embedded to
struct CPUArchId::arch_id, lots of ACPI tables like MADT, SRAT, PPTT
needs the corresponding update to expose it through "ACPI Processor UID"
field in those ACPI tables. It's much more than what we want to fix
the issue here because I don't see additional benefits to do it.
Besides, the package or socket index is exactly determined by 'idx'
on arm/virt. The CPU topology is exposed through ACPI PPTT, depending on
ms->smp. The threads/cores/clusters/dies in the struct determines
the indexes of threads ('idx') who belongs to the same socket.
Yes, the information maintained in ms->possible_cpus->cpu[i].props
is inconsistent to ms->smp. It means ms->possible_cpus->cpu[i].props.thread_id
is "CPU index" or "vcpu index", not "thread ID". Other fields like
sockets/dies/clusters/cores in ms->possible_cpus->cpu[i].props are
never used on arm/virt. However, the code changes included in this
patch to fix the broken CPU topology issue is still correct, or we
can enhance it like below in case 'has_socket_id' contains the correct
information in the future.
static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
{
-    return idx % ms->numa_state->num_nodes;
+    int socket_id;
+
+    if (ms->possible_cpus->cpus[idx].props.has_socket_id) {
+        socket_id = ms->possible_cpus->cpus[idx].props.socket_id;
+    } else {
+        socket_id = idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
+    }
+
+    return socket_id % ms->numa_state->num_nodes;
}
>>>    }
>>>
>>>    
>>>>>    }
>>>>>    static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)
>>>>   
Thanks,
Gavin
next prev parent reply	other threads:[~2022-02-28  4:28 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-26  5:24 [PATCH] hw/arm/virt: Fix CPU's default NUMA node ID Gavin Shan
2022-01-26  7:41 ` Andrew Jones
2022-01-26  9:14 ` Igor Mammedov
2022-01-28  7:05   ` wangyanan (Y) via
2022-02-15  8:19     ` Gavin Shan
2022-02-15  8:32       ` Andrew Jones
2022-02-16 10:58         ` Gavin Shan
2022-02-08 14:49   ` Peter Maydell
2022-02-17  2:14   ` Gavin Shan
2022-02-25  8:41     ` Gavin Shan
2022-02-25 10:03       ` Igor Mammedov
2022-02-28  4:26         ` Gavin Shan [this message]
2022-02-28 10:54           ` Igor Mammedov
2022-03-01  9:14             ` Gavin Shan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox
  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):
  git send-email \
    --in-reply-to=6a520e47-e3ad-13a1-89f1-8d4213fb944a@redhat.com \
    --to=gshan@redhat.com \
    --cc=drjones@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=shan.gavin@gmail.com \
    --cc=wangyanan55@huawei.com \
    /path/to/YOUR_REPLY
  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
  Be sure your reply has a Subject: header at the top and a blank line
  before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).