From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
To: Eduardo Habkost <ehabkost@redhat.com>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>,
qemu-ppc@nongnu.org, qemu-devel@nongnu.org,
Alexander Graf <agraf@suse.de>
Subject: Re: [Qemu-devel] [PATCH 7/7] numa: Allow empty nodes
Date: Mon, 16 Jun 2014 17:16:50 -0700 [thread overview]
Message-ID: <20140617001649.GJ16644@linux.vnet.ibm.com> (raw)
In-Reply-To: <20140616201140.GB8629@otherpad.lan.raisama.net>
On 16.06.2014 [17:11:40 -0300], Eduardo Habkost wrote:
> On Mon, Jun 16, 2014 at 11:49:46AM -0700, Nishanth Aravamudan wrote:
> > On 16.06.2014 [13:15:00 -0300], Eduardo Habkost wrote:
> > > On Mon, Jun 16, 2014 at 05:53:53PM +1000, Alexey Kardashevskiy wrote:
> > > > Currently NUMA nodes must go consequently and QEMU ignores nodes
> > > > with @nodeid bigger than the number of NUMA nodes in the command line.
> > >
> > > Why would somebody need a NUMA node with nodeid bigger than the number
> > > of NUMA nodes? NUMA node IDs must be in the [0, numa_nodes-1] range.
> >
> > That is not how the code works currently.
> >
> > vl.c::numa_add()
> > ...
> > if (get_param_value(option, 128, "nodeid", optarg) == 0) {
> > nodenr = nb_numa_nodes;
> > } else {
> > if (parse_uint_full(option, &nodenr, 10) < 0) {
> > fprintf(stderr, "qemu: Invalid NUMA nodeid: %s\n", option);
> > exit(1);
> > }
> > }
> > ...
> > if (get_param_value(option, 128, "mem", optarg) == 0) {
> > node_mem[nodenr] = 0;
> > } else {
> > int64_t sval;
> > sval = strtosz(option, &endptr);
> > if (sval < 0 || *endptr) {
> > fprintf(stderr, "qemu: invalid numa mem size: %s\n", optarg);
> > exit(1);
> > }
> > node_mem[nodenr] = sval;
> > }
> > ...
> > nb_numa_nodes++;
> > ...
> >
> > So if a user passes nodeid= to the NUMA node definition, that entry in
> > node_mem is set to the appropriate value, but nb_numa_nodes, which is
> > used to bound the iteration of that array is not bumped appropriately.
> > So we end up looking at arbitrary indices in the node_mem array, which
> > are often 0.
<snip, replying to this part in your follow-up>
> > Note also that means that we can't generically differentiate here
> > between a user-defined memoryless node and one that happens to be 0
> > because the particular nodeid was not specified on the command-line.
>
> That's because all nodeids must be specified in the command-line.
> Accepting omitted nodes is a bug which should be fixed.
Fair enough, but that's certainly not enforced. In fact, the way the
code is written right now (in hw/ppc/spapr.c and hw/i386/pc.c, e.g) it
seems like hte architecture code assumes nodeids are sequential (0 to
nb_numa_nodes), but the generic code puts values in node_mem based upon
the nodeid specified on the command-line. That's the bug Alexey's 7/7
patch is attempting to fix, I think.
> > Alexey, how do you differentiate between these two cases after your
> > patches? In patch 3, I see you check (and skip in the loop) explicitly
> > if !node_mem[nodeid], but I'm not sure how that check can differentiate
> > between the statically 0 (from main's intialization loop) and when a
> > user says a node's memory is 0. Probably something obvious I'm missing
> > (it is Monday after all)...
> >
> > > > This prevents us from creating memory-less nodes which is possible
> > > > situation in POWERPC under pHyp or Sapphire.
> > >
> > > Why? If I recall correctly, nodes without any CPUs or any memory are
> > > already allowed.
> >
> > They are allowed, but it seems like the code throughout qemu (where it's
> > relevant) assumes that NUMA nodes are sequential and continuous, but
> > that's not required (nor is it enforced on the command-line).
>
> It is not being enforced, but you will get weird bugs if you don't do
> it. What I suggest is that we start enforcing it instead of magically
> crating new nodes when a wrong (too high) ID is provided.
>
> >
> > > How exactly would this patch help you? How do you expect the
> > > command-line to look like for your use case?
> >
> > Alexey has replied with that, it looks like.
>
> Where? I don't see a reply.
>
> >
> > > > This makes nb_numa_nodes a total number of nodes or the biggest node
> > > > number + 1 whichever is greater.
> > > >
> > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > > ---
> > > > vl.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/vl.c b/vl.c
> > > > index ac0e3d7..f1b75cb 100644
> > > > --- a/vl.c
> > > > +++ b/vl.c
> > > > @@ -1356,7 +1356,7 @@ static void numa_add(const char *optarg)
> > > > if (get_param_value(option, 128, "cpus", optarg) != 0) {
> > > > numa_node_parse_cpus(nodenr, option);
> > > > }
> > > > - nb_numa_nodes++;
> > > > + nb_numa_nodes = MAX(nb_numa_nodes + 1, nodenr + 1);
> > >
> > > I would instead suggest that if any node in the [0, max_node_id] range
> > > is not present on the command-line, QEMU should instead reject the
> > > command-line.
> >
> > We already check two things:
> >
> > Too many nodes (meaning we've filled the array):
> > if (nb_numa_nodes >= MAX_NODES) {
> > fprintf(stderr, "qemu: too many NUMA nodes\n");
> > exit(1);
> > }
> >
> > Node ID itself is out of range (due to the use of an array):
> > if (nodenr >= MAX_NODES) {
> > fprintf(stderr, "qemu: invalid NUMA nodeid: %llu\n", nodenr);
> > exit(1);
> > }
> >
> > But that doesn't prevent one from having *sparse* NUMA node IDs. And, as
> > far as I can tell, this is allowed by the spec, but isn't properly
> > supported by qemu.
>
> Wait, is the node ID visible to the guest at all? I believe it is a
> QEMU-internal thing, just to allow the NUMA nodes to be ordered in the
> command-line. I would even claim that the parameter is useless and
> shouldn't have been introduced in the first place.
nodeid's show up in the topology to Linux, at least on ppc.
Thanks,
Nish
next prev parent reply other threads:[~2014-06-17 0:17 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-06-16 7:53 [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes Alexey Kardashevskiy
2014-06-16 7:53 ` [Qemu-devel] [PATCH 1/7] spapr: Move DT memory node rendering to a helper Alexey Kardashevskiy
2014-06-16 7:53 ` [Qemu-devel] [PATCH 2/7] spapr: Use DT memory node rendering helper for other nodes Alexey Kardashevskiy
2014-06-16 7:53 ` [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory() Alexey Kardashevskiy
2014-06-18 5:04 ` Alexey Kardashevskiy
2014-06-20 19:10 ` Nishanth Aravamudan
2014-06-21 3:08 ` Alexey Kardashevskiy
2014-06-23 17:41 ` Nishanth Aravamudan
2014-06-23 22:02 ` Alexey Kardashevskiy
2014-06-20 22:55 ` Nishanth Aravamudan
2014-06-21 3:06 ` Alexey Kardashevskiy
2014-06-23 17:40 ` Nishanth Aravamudan
2014-06-24 6:07 ` Alexey Kardashevskiy
2014-06-24 17:07 ` Nishanth Aravamudan
2014-06-24 3:08 ` Nishanth Aravamudan
2014-06-24 6:14 ` Alexey Kardashevskiy
2014-06-24 17:01 ` Nishanth Aravamudan
2014-07-21 18:08 ` Nishanth Aravamudan
2014-06-16 7:53 ` [Qemu-devel] [PATCH 4/7] spapr: Split memory nodes to power-of-two blocks Alexey Kardashevskiy
2014-06-17 7:07 ` Alexey Kardashevskiy
2014-06-16 7:53 ` [Qemu-devel] [PATCH 5/7] spapr: Add a helper for node0_size calculation Alexey Kardashevskiy
2014-06-16 18:43 ` Nishanth Aravamudan
2014-06-16 7:53 ` [Qemu-devel] [PATCH 6/7] spapr: Fix ibm, associativity for memory nodes Alexey Kardashevskiy
2014-06-16 7:53 ` [Qemu-devel] [PATCH 7/7] numa: Allow empty nodes Alexey Kardashevskiy
2014-06-16 16:15 ` Eduardo Habkost
2014-06-16 18:49 ` Nishanth Aravamudan
2014-06-16 20:11 ` Eduardo Habkost
2014-06-16 20:31 ` Eduardo Habkost
2014-06-17 0:21 ` Nishanth Aravamudan
2014-06-17 0:16 ` Nishanth Aravamudan [this message]
2014-06-16 8:16 ` [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes Alexey Kardashevskiy
2014-06-16 18:26 ` Nishanth Aravamudan
2014-06-16 20:51 ` Eduardo Habkost
2014-06-17 0:25 ` Nishanth Aravamudan
2014-06-17 1:37 ` Eduardo Habkost
2014-06-17 18:36 ` Nishanth Aravamudan
2014-06-17 1:41 ` Eduardo Habkost
2014-06-17 18:37 ` Nishanth Aravamudan
2014-06-17 5:51 ` Alexey Kardashevskiy
2014-06-17 14:07 ` Eduardo Habkost
2014-06-17 18:38 ` Nishanth Aravamudan
2014-06-17 19:22 ` Eduardo Habkost
2014-06-18 18:28 ` Nishanth Aravamudan
2014-06-18 19:33 ` Eduardo Habkost
2014-06-18 23:58 ` Nishanth Aravamudan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140617001649.GJ16644@linux.vnet.ibm.com \
--to=nacc@linux.vnet.ibm.com \
--cc=agraf@suse.de \
--cc=aik@ozlabs.ru \
--cc=ehabkost@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-ppc@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.