From: Igor Mammedov <imammedo@redhat.com>
To: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>,
qemu-devel@nongnu.org, Eduardo Habkost <ehabkost@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [Qemu-devel] [RFC PATCH] numa: enable sparse node numbering
Date: Tue, 24 Jun 2014 09:33:48 +0200 [thread overview]
Message-ID: <20140624093348.094e6176@nial.usersys.redhat.com> (raw)
In-Reply-To: <20140623193310.GD4323@linux.vnet.ibm.com>
On Mon, 23 Jun 2014 12:33:10 -0700
Nishanth Aravamudan <nacc@linux.vnet.ibm.com> wrote:
> Add a present field to NodeInfo which indicates if a given nodeid was
> present on the command-line or not. Current code relies on a memory
> value being passed for a node to indicate presence, which is
> insufficient in the presence of memoryless nodes or sparse numbering.
> Adjust the iteration of various NUMA loops to use the maximum known NUMA
> ID rather than the number of NUMA nodes.
Why would you need sparse numbering?
What task exactly are you trying to solve?
>
> numa.c::set_numa_nodes() has become a bit more convoluted for
> round-robin'ing the CPUs over known nodes when not specified by the
> user.
>
> Note that architecture-specific code still needs changes (forthcoming)
> for both sparse node numbering and memoryless nodes.
>
> Examples:
>
> (1) qemu-system-x86_64 -enable-kvm -m 4096 -numa node,nodeid=3 -numa
> node,nodeid=2 -smp 16
>
> Before:
>
> node 0 cpus: 0 2 4 6 8 10 12 14
> node 0 size: 2048 MB
> node 1 cpus: 1 3 5 7 9 11 13 15
> node 1 size: 2048 MB
>
> After:
>
> node 2 cpus: 0 2 4 6 8 10 12 14 |
> node 2 size: 2048 MB |
> node 3 cpus: 1 3 5 7 9 11 13 15 |
> node 3 size: 2048 MB
>
> (2) qemu-system-x86_64 -enable-kvm -m 4096 -numa node,nodeid=0 -numa
> node,nodeid=1 -numa node,nodeid=2 -numa node,nodeid=3 -smp 16
>
> Before:
>
> node 0 cpus: 0 4 8 12 |
> node 0 size: 1024 MB |
> node 1 cpus: 1 5 9 13 |
> node 1 size: 1024 MB |
> node 2 cpus: 2 6 10 14 |
> node 2 size: 1024 MB |
> node 3 cpus: 3 7 11 15 |
> node 3 size: 1024 MB
>
> After:
>
> node 0 cpus: 0 4 8 12
> node 0 size: 1024 MB
> node 1 cpus: 1 5 9 13
> node 1 size: 1024 MB
> node 2 cpus: 2 6 10 14
> node 2 size: 1024 MB
> node 3 cpus: 3 7 11 15
> node 3 size: 1024 MB
>
> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
>
> ---
> Based off mst's for_upstream tag, which has the NodeInfo changes.
>
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index 277230d..b90bf66 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -145,11 +145,13 @@ extern int mem_prealloc;
> */
> #define MAX_CPUMASK_BITS 255
>
> -extern int nb_numa_nodes;
> +extern int nb_numa_nodes; /* Number of NUMA nodes */
> +extern int max_numa_node; /* Highest specified NUMA node ID */
> typedef struct node_info {
> uint64_t node_mem;
> DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS);
> struct HostMemoryBackend *node_memdev;
> + bool present;
> } NodeInfo;
> extern NodeInfo numa_info[MAX_NODES];
> void set_numa_nodes(void);
> diff --git a/monitor.c b/monitor.c
> index c7f8797..4721996 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -2003,7 +2003,10 @@ static void do_info_numa(Monitor *mon, const QDict *qdict)
> CPUState *cpu;
>
> monitor_printf(mon, "%d nodes\n", nb_numa_nodes);
> - for (i = 0; i < nb_numa_nodes; i++) {
> + for (i = 0; i < max_numa_node; i++) {
> + if (!numa_info[i].present) {
> + continue;
> + }
> monitor_printf(mon, "node %d cpus:", i);
> CPU_FOREACH(cpu) {
> if (cpu->numa_node == i) {
> diff --git a/numa.c b/numa.c
> index e471afe..c31857d 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -106,6 +106,10 @@ static void numa_node_parse(NumaNodeOptions *node, QemuOpts *opts, Error **errp)
> numa_info[nodenr].node_mem = object_property_get_int(o, "size", NULL);
> numa_info[nodenr].node_memdev = MEMORY_BACKEND(o);
> }
> + numa_info[nodenr].present = true;
> + if (nodenr >= max_numa_node) {
> + max_numa_node = nodenr + 1;
> + }
> }
>
> int numa_init_func(QemuOpts *opts, void *opaque)
> @@ -155,7 +159,7 @@ void set_numa_nodes(void)
> {
> if (nb_numa_nodes > 0) {
> uint64_t numa_total;
> - int i;
> + int i, j, last_j = -1;
>
> if (nb_numa_nodes > MAX_NODES) {
> nb_numa_nodes = MAX_NODES;
> @@ -164,27 +168,29 @@ void set_numa_nodes(void)
> /* If no memory size if given for any node, assume the default case
> * and distribute the available memory equally across all nodes
> */
> - for (i = 0; i < nb_numa_nodes; i++) {
> - if (numa_info[i].node_mem != 0) {
> + for (i = 0; i < max_numa_node; i++) {
> + if (numa_info[i].present && numa_info[i].node_mem != 0) {
> break;
> }
> }
> - if (i == nb_numa_nodes) {
> + if (i == max_numa_node) {
> uint64_t usedmem = 0;
>
> /* On Linux, the each node's border has to be 8MB aligned,
> * the final node gets the rest.
> */
> - for (i = 0; i < nb_numa_nodes - 1; i++) {
> - numa_info[i].node_mem = (ram_size / nb_numa_nodes) &
> - ~((1 << 23UL) - 1);
> - usedmem += numa_info[i].node_mem;
> + for (i = 0; i < max_numa_node - 1; i++) {
> + if (numa_info[i].present) {
> + numa_info[i].node_mem = (ram_size / nb_numa_nodes) &
> + ~((1 << 23UL) - 1);
> + usedmem += numa_info[i].node_mem;
> + }
> }
> numa_info[i].node_mem = ram_size - usedmem;
> }
>
> numa_total = 0;
> - for (i = 0; i < nb_numa_nodes; i++) {
> + for (i = 0; i < max_numa_node; i++) {
> numa_total += numa_info[i].node_mem;
> }
> if (numa_total != ram_size) {
> @@ -194,8 +200,9 @@ void set_numa_nodes(void)
> exit(1);
> }
>
> - for (i = 0; i < nb_numa_nodes; i++) {
> - if (!bitmap_empty(numa_info[i].node_cpu, MAX_CPUMASK_BITS)) {
> + for (i = 0; i < max_numa_node; i++) {
> + if (numa_info[i].present &&
> + !bitmap_empty(numa_info[i].node_cpu, MAX_CPUMASK_BITS)) {
> break;
> }
> }
> @@ -203,9 +210,19 @@ void set_numa_nodes(void)
> * must cope with this anyway, because there are BIOSes out there in
> * real machines which also use this scheme.
> */
> - if (i == nb_numa_nodes) {
> + if (i == max_numa_node) {
> for (i = 0; i < max_cpus; i++) {
> - set_bit(i, numa_info[i % nb_numa_nodes].node_cpu);
> + for (j = last_j + 1;
> + j != (last_j == 0 ? max_numa_node : last_j - 1); j++) {
> + if (j >= max_numa_node) {
> + j = 0;
> + }
> + if (numa_info[j].present) {
> + break;
> + }
> + }
> + last_j = j;
> + set_bit(i, numa_info[j].node_cpu);
> }
> }
> }
> @@ -217,7 +234,7 @@ void set_numa_modes(void)
> int i;
>
> CPU_FOREACH(cpu) {
> - for (i = 0; i < nb_numa_nodes; i++) {
> + for (i = 0; i < max_numa_node; i++) {
> if (test_bit(cpu->cpu_index, numa_info[i].node_cpu)) {
> cpu->numa_node = i;
> }
> diff --git a/vl.c b/vl.c
> index 54b4627..e1a6ab8 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -195,6 +195,7 @@ static QTAILQ_HEAD(, FWBootEntry) fw_boot_order =
> QTAILQ_HEAD_INITIALIZER(fw_boot_order);
>
> int nb_numa_nodes;
> +int max_numa_node;
> NodeInfo numa_info[MAX_NODES];
>
> uint8_t qemu_uuid[16];
> @@ -2967,10 +2968,12 @@ int main(int argc, char **argv, char **envp)
>
> for (i = 0; i < MAX_NODES; i++) {
> numa_info[i].node_mem = 0;
> + numa_info[i].present = false;
> bitmap_zero(numa_info[i].node_cpu, MAX_CPUMASK_BITS);
> }
>
> nb_numa_nodes = 0;
> + max_numa_node = -1;
> nb_nics = 0;
>
> bdrv_init_with_whitelist();
>
>
next prev parent reply other threads:[~2014-06-24 7:34 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-06-23 19:33 [Qemu-devel] [RFC PATCH] numa: enable sparse node numbering Nishanth Aravamudan
2014-06-24 0:48 ` [Qemu-devel] [RFC PATCH v2] " Nishanth Aravamudan
2014-06-24 1:08 ` [Qemu-devel] [RFC PATCH] ppc/spapr: support sparse NUMA " Nishanth Aravamudan
2014-06-24 3:34 ` Nishanth Aravamudan
2014-06-24 7:43 ` [Qemu-devel] [RFC PATCH v2] numa: enable sparse " Hu Tao
2014-06-24 17:40 ` [Qemu-devel] [RFC PATCH v3] " Nishanth Aravamudan
2014-06-25 11:21 ` Igor Mammedov
2014-06-25 16:13 ` Nishanth Aravamudan
2014-06-25 16:52 ` Eduardo Habkost
2014-06-25 17:04 ` Nishanth Aravamudan
2014-06-25 18:24 ` Michael S. Tsirkin
2014-06-26 6:29 ` Igor Mammedov
2014-06-25 18:23 ` Michael S. Tsirkin
2014-06-26 9:09 ` Hu Tao
2014-06-26 17:58 ` Nishanth Aravamudan
2014-06-26 18:39 ` Eduardo Habkost
2014-06-26 19:37 ` Eduardo Habkost
2014-06-30 18:26 ` Nishanth Aravamudan
2014-06-30 21:48 ` Eduardo Habkost
2014-06-24 7:33 ` Igor Mammedov [this message]
2014-06-24 8:58 ` [Qemu-devel] [RFC PATCH] " Alexey Kardashevskiy
2014-06-24 16:39 ` Nishanth Aravamudan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140624093348.094e6176@nial.usersys.redhat.com \
--to=imammedo@redhat.com \
--cc=aik@ozlabs.ru \
--cc=ehabkost@redhat.com \
--cc=mst@redhat.com \
--cc=nacc@linux.vnet.ibm.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).