All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gavin Shan <gshan@redhat.com>
To: Igor Mammedov <imammedo@redhat.com>
Cc: peter.maydell@linaro.org, drjones@redhat.com,
	ehabkost@redhat.com, richard.henderson@linaro.org,
	qemu-devel@nongnu.org, qemu-arm@nongnu.org, shan.gavin@gmail.com
Subject: Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI
Date: Thu, 28 Oct 2021 22:32:09 +1100	[thread overview]
Message-ID: <fecb9351-ae78-8fcd-e377-623243ef80df@redhat.com> (raw)
In-Reply-To: <20211027174028.1f16fcfb@redhat.com>

On 10/28/21 2:40 AM, Igor Mammedov wrote:
> On Wed, 27 Oct 2021 13:29:58 +0800
> Gavin Shan <gshan@redhat.com> wrote:
> 
>> The empty NUMA nodes, where no memory resides, aren't exposed
>> through ACPI SRAT table. It's not user preferred behaviour because
>> the corresponding memory node devices are missed from the guest
>> kernel as the following example shows. It means the guest kernel
>> doesn't have the node information as user specifies. However,
>> memory can be still hot added to these empty NUMA nodes when
>> they're not exposed.
>>
>>    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>>    -accel kvm -machine virt,gic-version=host               \
>>    -cpu host -smp 4,sockets=2,cores=2,threads=1            \
>>    -m 1024M,slots=16,maxmem=64G                            \
>>    -object memory-backend-ram,id=mem0,size=512M            \
>>    -object memory-backend-ram,id=mem1,size=512M            \
>>    -numa node,nodeid=0,cpus=0-1,memdev=mem0                \
>>    -numa node,nodeid=1,cpus=2-3,memdev=mem1                \
>>    -numa node,nodeid=2                                     \
>>    -numa node,nodeid=3                                     \
>>       :
>>    guest# ls /sys/devices/system/node | grep node
>>    node0
>>    node1
>>    (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
>>    (qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
>>    guest# ls /sys/devices/system/node | grep node
>>    node0
>>    node1
>>    node2
>>    guest# cat /sys/devices/system/node/node2/meminfo | grep MemTotal
>>    Node 2 MemTotal:    1048576 kB
>>
>> This exposes these empty NUMA nodes through ACPI SRAT table. With
>> this applied, the corresponding memory node devices can be found
>> from the guest. Note that the hotpluggable capability is explicitly
>> given to these empty NUMA nodes for sake of completeness.
>>
>>    guest# ls /sys/devices/system/node | grep node
>>    node0
>>    node1
>>    node2
>>    node3
>>    guest# cat /sys/devices/system/node/node3/meminfo | grep MemTotal
>>    Node 3 MemTotal:    0 kB
>>    (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
>>    (qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
>>    guest# cat /sys/devices/system/node/node3/meminfo | grep MemTotal
>>    Node 3 MemTotal:    1048576 kB
> 
> I'm still not sure why this is necessary and if it's a good idea,
> is there a real hardware that have such nodes?
> 
> SRAT is used to assign resources to nodes, I haven't seen it being
> used  as means to describe an empty node anywhere in the spec.
> (perhaps we should not allow empty nodes on QEMU CLI at all).
> 
> Then if we really need this, why it's done for ARM only
> and not for x86?
> 

I think this case exists in real hardware where the memory DIMM
isn't plugged, but the node is still probed. Besides, this patch
addresses two issues:

(1) To make the information contained in guest kernel consistent
     to the command line as the user expects. It means the sysfs
     entries for these empty NUMA nodes in guest kernel reflects
     what user provided.

(2) Without this patch, the node number can be twisted from user's
     perspective. As the example included in the commit log, node3
     should be created, but node2 is actually created. The patch
     reserves the NUMA node IDs in advance to avoid the issue.

     /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
        :
     -numa node,nodeid=0,cpus=0-1,memdev=mem0                \
     -numa node,nodeid=1,cpus=2-3,memdev=mem1                \
     -numa node,nodeid=2                                     \
     -numa node,nodeid=3                                     \
     guest# ls /sys/devices/system/node | grep node
     node0  node1
     (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
     (qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
     guest# ls /sys/devices/system/node | grep node
     node0  node1  node2

We definitely need empty NUMA nodes from QEMU CLI. One case I heard
of is kdump developer specify NUMA nodes and corresponding pc-dimm
objects for memory hot-add and test the memory usability. I'm not
familiar with ACPI specification, but linux kernel fetches NUMA
node IDs from the following ACPI tables on ARM64. It's possible
the empty NUMA node IDs are parsed from GENERIC_AFFINITY or SLIT
tables if they exist in the corresponding ACPI tables.

     ACPI_SRAT_TYPE_MEMORY_AFFINITY
     ACPI_SRAT_TYPE_GENERIC_AFFINITY
     ACPI_SIG_SLIT                          # if it exists

So I think other architectures including x86 needs similar mechanism
to expose NUMA node IDs through ACPI table. If you agree, I can post
additional patches to do this after this one is settled and merged.

>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> Reviewed-by: Andrew Jones <drjones@redhat.com>
>> ---
>> v2: Improved commit log as suggested by Drew and Igor.
>> ---
>>   hw/arm/virt-acpi-build.c | 14 +++++++++-----
>>   1 file changed, 9 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
>> index 674f902652..a4c95b2f64 100644
>> --- a/hw/arm/virt-acpi-build.c
>> +++ b/hw/arm/virt-acpi-build.c
>> @@ -526,6 +526,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>>       const CPUArchIdList *cpu_list = mc->possible_cpu_arch_ids(ms);
>>       AcpiTable table = { .sig = "SRAT", .rev = 3, .oem_id = vms->oem_id,
>>                           .oem_table_id = vms->oem_table_id };
>> +    MemoryAffinityFlags flags;
>>   
>>       acpi_table_begin(&table, table_data);
>>       build_append_int_noprefix(table_data, 1, 4); /* Reserved */
>> @@ -547,12 +548,15 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>>   
>>       mem_base = vms->memmap[VIRT_MEM].base;
>>       for (i = 0; i < ms->numa_state->num_nodes; ++i) {
>> -        if (ms->numa_state->nodes[i].node_mem > 0) {
>> -            build_srat_memory(table_data, mem_base,
>> -                              ms->numa_state->nodes[i].node_mem, i,
>> -                              MEM_AFFINITY_ENABLED);
>> -            mem_base += ms->numa_state->nodes[i].node_mem;
>> +        if (ms->numa_state->nodes[i].node_mem) {
>> +            flags = MEM_AFFINITY_ENABLED;
>> +        } else {
>> +            flags = MEM_AFFINITY_ENABLED | MEM_AFFINITY_HOTPLUGGABLE;
>>           }
>> +
>> +        build_srat_memory(table_data, mem_base,
>> +                          ms->numa_state->nodes[i].node_mem, i, flags);
> that will create 0 length memory range, which is "Enabled",
> I'm not sure it's safe thing to do.
> 
> As side effect this will also create empty ranges for memory-less
> nodes that have only CPUs, where it's not necessary.
> 
> I'd really try avoid adding empty ranges unless it hard requirement,
> described somewhere or fixes a bug that can't be fixed elsewhere.
> 

It's safe to Linux at least as I tested on ARM64. The (zero) memory
block doesn't affect anything. Besides, the memory block which has
been marked as hotpluggable won't be handled in Linux on ARM64
actually.

Yes, the empty NUMA nodes are meaningless to CPUs until memory is
hot added into them.


>> +        mem_base += ms->numa_state->nodes[i].node_mem;
>>       }
>>   
>>       if (ms->nvdimms_state->is_enabled) {
> 

Thanks,
Gavin


  reply	other threads:[~2021-10-28 11:34 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-27  5:29 [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI Gavin Shan
2021-10-27 15:40 ` Igor Mammedov
2021-10-28 11:32   ` Gavin Shan [this message]
2021-11-01  8:44     ` Igor Mammedov
2021-11-01 23:44       ` Gavin Shan
2021-11-02  7:39         ` Andrew Jones
2021-11-05 12:47           ` Gavin Shan
2021-11-10 10:33             ` Igor Mammedov
2021-11-10 11:01               ` David Hildenbrand
2021-11-12 13:27                 ` Igor Mammedov
2021-11-16 11:11                   ` David Hildenbrand
2021-11-17 14:30                     ` Jonathan Cameron
2021-11-17 18:08                       ` David Hildenbrand
2021-11-18 10:28                         ` Jonathan Cameron
2021-11-18 11:06                           ` David Hildenbrand
2021-11-18 11:23                             ` Jonathan Cameron
2021-11-19 10:58                               ` Jonathan Cameron
2021-11-19 11:33                                 ` David Hildenbrand
2021-11-19 17:26                                   ` Jonathan Cameron
2021-11-19 17:56                                     ` David Hildenbrand
2021-11-17 18:26                   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fecb9351-ae78-8fcd-e377-623243ef80df@redhat.com \
    --to=gshan@redhat.com \
    --cc=drjones@redhat.com \
    --cc=ehabkost@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=shan.gavin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.