From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49601) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Wwh5V-0007h9-3V for qemu-devel@nongnu.org; Mon, 16 Jun 2014 20:17:29 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Wwh5M-0001mc-Sj for qemu-devel@nongnu.org; Mon, 16 Jun 2014 20:17:21 -0400 Received: from e7.ny.us.ibm.com ([32.97.182.137]:51645) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Wwh5M-0001mQ-Oi for qemu-devel@nongnu.org; Mon, 16 Jun 2014 20:17:12 -0400 Received: from /spool/local by e7.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Jun 2014 20:17:10 -0400 Date: Mon, 16 Jun 2014 17:16:50 -0700 From: Nishanth Aravamudan Message-ID: <20140617001649.GJ16644@linux.vnet.ibm.com> References: <1402905233-26510-1-git-send-email-aik@ozlabs.ru> <1402905233-26510-8-git-send-email-aik@ozlabs.ru> <20140616161500.GD3222@otherpad.lan.raisama.net> <20140616184945.GI16644@linux.vnet.ibm.com> <20140616201140.GB8629@otherpad.lan.raisama.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140616201140.GB8629@otherpad.lan.raisama.net> Subject: Re: [Qemu-devel] [PATCH 7/7] numa: Allow empty nodes List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Eduardo Habkost Cc: Alexey Kardashevskiy , qemu-ppc@nongnu.org, qemu-devel@nongnu.org, Alexander Graf On 16.06.2014 [17:11:40 -0300], Eduardo Habkost wrote: > On Mon, Jun 16, 2014 at 11:49:46AM -0700, Nishanth Aravamudan wrote: > > On 16.06.2014 [13:15:00 -0300], Eduardo Habkost wrote: > > > On Mon, Jun 16, 2014 at 05:53:53PM +1000, Alexey Kardashevskiy wrote: > > > > Currently NUMA nodes must go consequently and QEMU ignores nodes > > > > with @nodeid bigger than the number of NUMA nodes in the command line. > > > > > > Why would somebody need a NUMA node with nodeid bigger than the number > > > of NUMA nodes? NUMA node IDs must be in the [0, numa_nodes-1] range. > > > > That is not how the code works currently. > > > > vl.c::numa_add() > > ... > > if (get_param_value(option, 128, "nodeid", optarg) == 0) { > > nodenr = nb_numa_nodes; > > } else { > > if (parse_uint_full(option, &nodenr, 10) < 0) { > > fprintf(stderr, "qemu: Invalid NUMA nodeid: %s\n", option); > > exit(1); > > } > > } > > ... > > if (get_param_value(option, 128, "mem", optarg) == 0) { > > node_mem[nodenr] = 0; > > } else { > > int64_t sval; > > sval = strtosz(option, &endptr); > > if (sval < 0 || *endptr) { > > fprintf(stderr, "qemu: invalid numa mem size: %s\n", optarg); > > exit(1); > > } > > node_mem[nodenr] = sval; > > } > > ... > > nb_numa_nodes++; > > ... > > > > So if a user passes nodeid= to the NUMA node definition, that entry in > > node_mem is set to the appropriate value, but nb_numa_nodes, which is > > used to bound the iteration of that array is not bumped appropriately. > > So we end up looking at arbitrary indices in the node_mem array, which > > are often 0. > > Note also that means that we can't generically differentiate here > > between a user-defined memoryless node and one that happens to be 0 > > because the particular nodeid was not specified on the command-line. > > That's because all nodeids must be specified in the command-line. > Accepting omitted nodes is a bug which should be fixed. Fair enough, but that's certainly not enforced. In fact, the way the code is written right now (in hw/ppc/spapr.c and hw/i386/pc.c, e.g) it seems like hte architecture code assumes nodeids are sequential (0 to nb_numa_nodes), but the generic code puts values in node_mem based upon the nodeid specified on the command-line. That's the bug Alexey's 7/7 patch is attempting to fix, I think. > > Alexey, how do you differentiate between these two cases after your > > patches? In patch 3, I see you check (and skip in the loop) explicitly > > if !node_mem[nodeid], but I'm not sure how that check can differentiate > > between the statically 0 (from main's intialization loop) and when a > > user says a node's memory is 0. Probably something obvious I'm missing > > (it is Monday after all)... > > > > > > This prevents us from creating memory-less nodes which is possible > > > > situation in POWERPC under pHyp or Sapphire. > > > > > > Why? If I recall correctly, nodes without any CPUs or any memory are > > > already allowed. > > > > They are allowed, but it seems like the code throughout qemu (where it's > > relevant) assumes that NUMA nodes are sequential and continuous, but > > that's not required (nor is it enforced on the command-line). > > It is not being enforced, but you will get weird bugs if you don't do > it. What I suggest is that we start enforcing it instead of magically > crating new nodes when a wrong (too high) ID is provided. > > > > > > How exactly would this patch help you? How do you expect the > > > command-line to look like for your use case? > > > > Alexey has replied with that, it looks like. > > Where? I don't see a reply. > > > > > > > This makes nb_numa_nodes a total number of nodes or the biggest node > > > > number + 1 whichever is greater. > > > > > > > > Signed-off-by: Alexey Kardashevskiy > > > > --- > > > > vl.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/vl.c b/vl.c > > > > index ac0e3d7..f1b75cb 100644 > > > > --- a/vl.c > > > > +++ b/vl.c > > > > @@ -1356,7 +1356,7 @@ static void numa_add(const char *optarg) > > > > if (get_param_value(option, 128, "cpus", optarg) != 0) { > > > > numa_node_parse_cpus(nodenr, option); > > > > } > > > > - nb_numa_nodes++; > > > > + nb_numa_nodes = MAX(nb_numa_nodes + 1, nodenr + 1); > > > > > > I would instead suggest that if any node in the [0, max_node_id] range > > > is not present on the command-line, QEMU should instead reject the > > > command-line. > > > > We already check two things: > > > > Too many nodes (meaning we've filled the array): > > if (nb_numa_nodes >= MAX_NODES) { > > fprintf(stderr, "qemu: too many NUMA nodes\n"); > > exit(1); > > } > > > > Node ID itself is out of range (due to the use of an array): > > if (nodenr >= MAX_NODES) { > > fprintf(stderr, "qemu: invalid NUMA nodeid: %llu\n", nodenr); > > exit(1); > > } > > > > But that doesn't prevent one from having *sparse* NUMA node IDs. And, as > > far as I can tell, this is allowed by the spec, but isn't properly > > supported by qemu. > > Wait, is the node ID visible to the guest at all? I believe it is a > QEMU-internal thing, just to allow the NUMA nodes to be ordered in the > command-line. I would even claim that the parameter is useless and > shouldn't have been introduced in the first place. nodeid's show up in the topology to Linux, at least on ppc. Thanks, Nish