From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:44692)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <chegu_vinod@hp.com>) id 1Sgjus-0006jr-11
	for qemu-devel@nongnu.org; Mon, 18 Jun 2012 17:55:23 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <chegu_vinod@hp.com>) id 1Sgjup-0004yT-CM
	for qemu-devel@nongnu.org; Mon, 18 Jun 2012 17:55:21 -0400
Received: from g6t0185.atlanta.hp.com ([15.193.32.62]:1278)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <chegu_vinod@hp.com>) id 1Sgjup-0004xS-5f
	for qemu-devel@nongnu.org; Mon, 18 Jun 2012 17:55:19 -0400
Message-ID: <4FDFA3C2.9030602@hp.com>
Date: Mon, 18 Jun 2012 14:55:14 -0700
From: Chegu Vinod <chegu_vinod@hp.com>
MIME-Version: 1.0
References: <chegu_vinod@hp.com>
	<1339963951-7798-1-git-send-email-chegu_vinod@hp.com>
	<20120618202956.GA32417@otherpad.lan.raisama.net>
In-Reply-To: <20120618202956.GA32417@otherpad.lan.raisama.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH] Fixes related to processing of qemu's
 -numa option
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Eduardo Habkost <ehabkost@redhat.com>
Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org

On 6/18/2012 1:29 PM, Eduardo Habkost wrote:
> On Sun, Jun 17, 2012 at 01:12:31PM -0700, Chegu Vinod wrote:
>> The -numa option to qemu is used to create [fake] numa nodes
>> and expose them to the guest OS instance.
>>
>> There are a couple of issues with the -numa option:
>>
>> a) Max VCPU's that can be specified for a guest while using
>>     the qemu's -numa option is 64. Due to a typecasting issue
>>     when the number of VCPUs is>  32 the VCPUs don't show up
>>     under the specified [fake] numa nodes.
>>
>> b) KVM currently has support for 160VCPUs per guest. The
>>     qemu's -numa option has only support for upto 64VCPUs
>>     per guest.
>>
>> This patch addresses these two issues. [ Note: This
>> patch has been verified by Eduardo Habkost ].
> I was going to add a Tested-by line, but this patch breaks the automatic
> round-robin assignment when no CPU is added to any node on the
> command-line.

Pl. see below.
>
> Additional questions below:
>
> [...]
>> diff --git a/cpus.c b/cpus.c
>> index b182b3d..f9cee60 100644
>> --- a/cpus.c
>> +++ b/cpus.c
>> @@ -1145,7 +1145,7 @@ void set_numa_modes(void)
>>
>>       for (env = first_cpu; env != NULL; env = env->next_cpu) {
>>           for (i = 0; i<  nb_numa_nodes; i++) {
>> -            if (node_cpumask[i]&  (1<<  env->cpu_index)) {
>> +            if (CPU_ISSET_S(env->cpu_index, cpumask_size, node_cpumask[i])) {
>>                   env->numa_node = i;
>>               }
>>           }
>> diff --git a/hw/pc.c b/hw/pc.c
>> index 8368701..f0c3665 100644
>> --- a/hw/pc.c
>> +++ b/hw/pc.c
>> @@ -639,7 +639,7 @@ static void *bochs_bios_init(void)
>>       numa_fw_cfg[0] = cpu_to_le64(nb_numa_nodes);
>>       for (i = 0; i<  max_cpus; i++) {
>>           for (j = 0; j<  nb_numa_nodes; j++) {
>> -            if (node_cpumask[j]&  (1<<  i)) {
>> +            if (CPU_ISSET_S(i, cpumask_size, node_cpumask[j])) {
>>                   numa_fw_cfg[i + 1] = cpu_to_le64(j);
>>                   break;
>>               }
>> diff --git a/sysemu.h b/sysemu.h
>> index bc2c788..6e4d342 100644
>> --- a/sysemu.h
>> +++ b/sysemu.h
>> @@ -9,6 +9,7 @@
>>   #include "qapi-types.h"
>>   #include "notify.h"
>>   #include "main-loop.h"
>> +#include<sched.h>
>>
>>   /* vl.c */
>>
>> @@ -133,9 +134,11 @@ extern uint8_t qemu_extra_params_fw[2];
>>   extern QEMUClock *rtc_clock;
>>
>>   #define MAX_NODES 64
>> +#define KVM_MAX_VCPUS 254
> Do we really need this constant? Why not just use max_cpus when
> allocating the CPU sets, instead?

Hmm.... I had thought about this earlier too max_cpus was not getting 
set at the point where the allocations were being done. I have now moved 
that code to a bit later and switched to using
max_cpus.

>
>
>>   extern int nb_numa_nodes;
>>   extern uint64_t node_mem[MAX_NODES];
>> -extern uint64_t node_cpumask[MAX_NODES];
>> +extern cpu_set_t *node_cpumask[MAX_NODES];
>> +extern size_t cpumask_size;
>>
>>   #define MAX_OPTION_ROMS 16
>>   typedef struct QEMUOptionRom {
>> diff --git a/vl.c b/vl.c
>> index 204d85b..1906412 100644
>> --- a/vl.c
>> +++ b/vl.c
>> @@ -28,6 +28,7 @@
>>   #include<errno.h>
>>   #include<sys/time.h>
>>   #include<zlib.h>
>> +#include<sched.h>
>>
>>   /* Needed early for CONFIG_BSD etc. */
>>   #include "config-host.h"
>> @@ -240,7 +241,8 @@ QTAILQ_HEAD(, FWBootEntry) fw_boot_order = QTAILQ_HEAD_INITIALIZER(fw_boot_order
>>
>>   int nb_numa_nodes;
>>   uint64_t node_mem[MAX_NODES];
>> -uint64_t node_cpumask[MAX_NODES];
>> +cpu_set_t *node_cpumask[MAX_NODES];
>> +size_t cpumask_size;
>>
>>   uint8_t qemu_uuid[16];
>>
>> @@ -950,6 +952,9 @@ static void numa_add(const char *optarg)
>>       char *endptr;
>>       unsigned long long value, endvalue;
>>       int nodenr;
>> +    int i;
>> +
>> +    value = endvalue = 0;
>>
>>       optarg = get_opt_name(option, 128, optarg, ',') + 1;
>>       if (!strcmp(option, "node")) {
>> @@ -970,27 +975,17 @@ static void numa_add(const char *optarg)
>>               }
>>               node_mem[nodenr] = sval;
>>           }
>> -        if (get_param_value(option, 128, "cpus", optarg) == 0) {
>> -            node_cpumask[nodenr] = 0;
>> -        } else {
>> +        if (get_param_value(option, 128, "cpus", optarg) != 0) {
>>               value = strtoull(option,&endptr, 10);
>> -            if (value>= 64) {
>> -                value = 63;
>> -                fprintf(stderr, "only 64 CPUs in NUMA mode supported.\n");
>> +            if (*endptr == '-') {
>> +                endvalue = strtoull(endptr+1,&endptr, 10);
>>               } else {
>> -                if (*endptr == '-') {
>> -                    endvalue = strtoull(endptr+1,&endptr, 10);
>> -                    if (endvalue>= 63) {
>> -                        endvalue = 62;
>> -                        fprintf(stderr,
>> -                            "only 63 CPUs in NUMA mode supported.\n");
>> -                    }
>> -                    value = (2ULL<<  endvalue) - (1ULL<<  value);
>> -                } else {
>> -                    value = 1ULL<<  value;
>> -                }
>> +                endvalue = value;
>> +            }
>> +
>> +            for (i = value; i<= endvalue; ++i) {
>> +                CPU_SET_S(i, cpumask_size, node_cpumask[nodenr]);
> What if endvalue is larger than the cpu mask size we allocated?

I can add a check (endvalue >= max_cpus) and print an error.
Should we force set the endvalue to max_cpus-1 in that case ?


>
>
>>               }
>> -            node_cpumask[nodenr] = value;
>>           }
>>           nb_numa_nodes++;
>>       }
>> @@ -2331,7 +2326,9 @@ int main(int argc, char **argv, char **envp)
>>
>>       for (i = 0; i<  MAX_NODES; i++) {
>>           node_mem[i] = 0;
>> -        node_cpumask[i] = 0;
>> +        node_cpumask[i] = CPU_ALLOC(KVM_MAX_VCPUS);
>> +        cpumask_size = CPU_ALLOC_SIZE(KVM_MAX_VCPUS);
>> +        CPU_ZERO_S(cpumask_size, node_cpumask[i]);
>>       }
>>
>>       nb_numa_nodes = 0;
>> @@ -3465,8 +3462,9 @@ int main(int argc, char **argv, char **envp)
>>           }
>>
>>           for (i = 0; i<  nb_numa_nodes; i++) {
>> -            if (node_cpumask[i] != 0)
>> +            if (node_cpumask[i] != NULL) {
> This will be true for every node (as you preallocate all the CPU
> sets in the beginning), so the loop will always end with i==0, in turn
> unconditionally disabling the round-robin CPU assignment code below.
> You can use CPU_COUNT_S, instead.

Argh... I should have tested this. My bad !  Thanks for catching this. I 
made the change and tested it and its doing the round-robin assignment now.

Will post the next version of the patch soon.

Thanks for your feedback.
Vinod
>
>>                   break;
>> +            }
>>           }
>>           /* assigning the VCPUs round-robin is easier to implement, guest OSes
>>            * must cope with this anyway, because there are BIOSes out there in
>> @@ -3474,7 +3472,7 @@ int main(int argc, char **argv, char **envp)
>>            */
>>           if (i == nb_numa_nodes) {
>>               for (i = 0; i<  max_cpus; i++) {
>> -                node_cpumask[i % nb_numa_nodes] |= 1<<  i;
>> +                CPU_SET_S(i, cpumask_size, node_cpumask[i % nb_numa_nodes]);
>>               }
>>           }
>>       }
>> -- 
>> 1.7.1
>>