xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] numa: fix problems with memory-less nodes
@ 2010-01-12 16:30 Andre Przywara
  2010-01-13  8:26 ` Keir Fraser
  0 siblings, 1 reply; 5+ messages in thread
From: Andre Przywara @ 2010-01-12 16:30 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 957 bytes --]

Hi,

If we decided to not report memory-less nodes in physinfo we should also 
skip them in the node_to_{cpu,memory,dma32_mem} Python lists. Currently 
Xen will not start guests on machines with memory-less nodes which are 
not the last ones. On an 8-node machine with empty nodes 4 and 5 "xm 
info" was reporting wrongly, also the node assignment algorithm crashed 
with a division by zero error.
The attached patch fixes this by skipping empty nodes in the enumeration 
of resources.

Regards,
Andre.


Signed-off-by: Andre Przywara <andre.przywara@amd.com>

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448 3567 12
----to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

[-- Attachment #2: physinfo_empty_numa_nodes.patch --]
[-- Type: text/x-patch, Size: 1134 bytes --]

diff -r 0e3910f1de64 tools/python/xen/lowlevel/xc/xc.c
--- a/tools/python/xen/lowlevel/xc/xc.c	Tue Jan 12 07:17:40 2010 +0000
+++ b/tools/python/xen/lowlevel/xc/xc.c	Tue Jan 12 17:28:55 2010 +0100
@@ -1129,20 +1129,23 @@
             Py_DECREF(pyint);
             node_exists = 1;
         }
-        PyList_Append(node_to_cpu_obj, cpus); 
+        if (node_exists)
+            PyList_Append(node_to_cpu_obj, cpus); 
         Py_DECREF(cpus);
 
         /* Memory. */
         xc_availheap(self->xc_handle, 0, 0, i, &free_heap);
         node_exists = node_exists || (free_heap != 0);
         pyint = PyInt_FromLong(free_heap / 1024);
-        PyList_Append(node_to_memory_obj, pyint);
+        if (node_exists)
+            PyList_Append(node_to_memory_obj, pyint);
         Py_DECREF(pyint);
 
         /* DMA memory. */
         xc_availheap(self->xc_handle, 0, 32, i, &free_heap);
         pyint = PyInt_FromLong(free_heap / 1024);
-        PyList_Append(node_to_dma32_mem_obj, pyint);
+        if (node_exists)
+            PyList_Append(node_to_dma32_mem_obj, pyint);
         Py_DECREF(pyint);
 
         if ( node_exists )

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] numa: fix problems with memory-less nodes
  2010-01-12 16:30 [PATCH] numa: fix problems with memory-less nodes Andre Przywara
@ 2010-01-13  8:26 ` Keir Fraser
  2010-01-13  9:42   ` Andre Przywara
  0 siblings, 1 reply; 5+ messages in thread
From: Keir Fraser @ 2010-01-13  8:26 UTC (permalink / raw)
  To: Andre Przywara; +Cc: xen-devel@lists.xensource.com

On 12/01/2010 16:30, "Andre Przywara" <andre.przywara@amd.com> wrote:

> If we decided to not report memory-less nodes in physinfo we should also
> skip them in the node_to_{cpu,memory,dma32_mem} Python lists. Currently
> Xen will not start guests on machines with memory-less nodes which are
> not the last ones. On an 8-node machine with empty nodes 4 and 5 "xm
> info" was reporting wrongly, also the node assignment algorithm crashed
> with a division by zero error.
> The attached patch fixes this by skipping empty nodes in the enumeration
> of resources.

Where to begin? Firstly, I thought that the ordering of nodes in the
node_to_* lists actually mattered -- the lists are indexed by nodeid (a
handle which can be passed to other Xen interfaces) are they not? If you
don't include empty entries, then the index position of entries is no longer
meaningful.

Secondly, you avoid appending to the node_to_cpu list if the node is
cpu-less. But you avoid appending to the node_to_{memory,dma32} lists only
if the node is *both* cpu-less and memory-less. That's not even consistent.

Please just fix the crap Python code.

 -- Keir

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] numa: fix problems with memory-less nodes
  2010-01-13  8:26 ` Keir Fraser
@ 2010-01-13  9:42   ` Andre Przywara
  2010-01-13  9:55     ` Keir Fraser
  2010-01-13 10:02     ` Keir Fraser
  0 siblings, 2 replies; 5+ messages in thread
From: Andre Przywara @ 2010-01-13  9:42 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel@lists.xensource.com

Keir Fraser wrote:
> On 12/01/2010 16:30, "Andre Przywara" <andre.przywara@amd.com> wrote:
> 
>> If we decided to not report memory-less nodes in physinfo we should also
>> skip them in the node_to_{cpu,memory,dma32_mem} Python lists. Currently
>> Xen will not start guests on machines with memory-less nodes which are
>> not the last ones. On an 8-node machine with empty nodes 4 and 5 "xm
>> info" was reporting wrongly, also the node assignment algorithm crashed
>> with a division by zero error.
>> The attached patch fixes this by skipping empty nodes in the enumeration
>> of resources.
> 
> Where to begin? Firstly, I thought that the ordering of nodes in the
> node_to_* lists actually mattered -- the lists are indexed by nodeid (a
> handle which can be passed to other Xen interfaces) are they not? If you
> don't include empty entries, then the index position of entries is no longer
> meaningful.
OK, that seems to be an issue.
To be honest I am not a fan of omitting nodes from physinfo, but that is 
what the current code (RC1!) does and it definitely breaks Xen on my 
box. So I just made this small patch to make it work again.
Actually I would opt to revert the patch cropping the number of nodes 
reported by physinfo (20762:a1d0a575b4ba ?). Yes, that would result in 
nodes reported with zero memory, but in my tests this did not raise 
problems, as a node's memory can (and will) be exhausted even during 
normal operation.
To illustrate the problem:
My box has 8 nodes, I removed the memory from nodes 4 & 5.
With the unpatched version xm info says:
total_memory           : 73712
free_memory            : 70865
node_to_cpu            : node0:0-5,24-35
                          node1:6-11
                          node2:12-17
                          node3:18-23
                          node4:no cpus
                          node5:no cpus
node_to_memory         : node0:14267
                          node1:8167
                          node2:16335
                          node3:8167
                          node4:0
                          node5:0
So this listing completely omits the last two nodes (CPUs 36-47 and the 
24 GB connected to them). The debug key triggered Xen-internal listing 
is correct, though:
(XEN) idx0 -> NODE0 start->0 size->4423680
(XEN) phys_to_nid(0000000000001000) -> 0 should be 0
(XEN) idx1 -> NODE1 start->4423680 size->2097152
(XEN) phys_to_nid(0000000438001000) -> 1 should be 1
(XEN) idx2 -> NODE2 start->6520832 size->4194304
(XEN) phys_to_nid(0000000638001000) -> 2 should be 2
(XEN) idx3 -> NODE3 start->10715136 size->2097152
(XEN) phys_to_nid(0000000a38001000) -> 3 should be 3
(XEN) idx6 -> NODE6 start->12812288 size->4194304
(XEN) phys_to_nid(0000000c38001000) -> 6 should be 6
(XEN) idx7 -> NODE7 start->17006592 size->2097152
(XEN) phys_to_nid(0000001038001000) -> 7 should be 7
With the patched xc.so xm info reports:
node_to_cpu            : node0:0-5,24-35
                          node1:6-11
                          node2:12-17
                          node3:18-23
                          node4:36-41
                          node5:42-47
node_to_memory         : node0:14267
                          node1:8167
                          node2:16335
                          node3:8167
                          node4:16335
                          node5:7590

Although memory less nodes are not very common, it could happen 
sometimes with our new dual-node processor, where one could (even 
accidentally) forget to populate certain memory slots, as it has in fact 
a dual-node dual-channel memory interface.

> Secondly, you avoid appending to the node_to_cpu list if the node is
> cpu-less. But you avoid appending to the node_to_{memory,dma32} lists only
> if the node is *both* cpu-less and memory-less. That's not even consistent.
OK, that's a point. I see that the value of node_exists can change.
> Please just fix the crap Python code.
What part do you exactly mean? The part triggering the division by zero?

I will see if I can fix this properly.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448 3567 12
----to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] numa: fix problems with memory-less nodes
  2010-01-13  9:42   ` Andre Przywara
@ 2010-01-13  9:55     ` Keir Fraser
  2010-01-13 10:02     ` Keir Fraser
  1 sibling, 0 replies; 5+ messages in thread
From: Keir Fraser @ 2010-01-13  9:55 UTC (permalink / raw)
  To: Andre Przywara; +Cc: xen-devel@lists.xensource.com

On 13/01/2010 09:42, "Andre Przywara" <andre.przywara@amd.com> wrote:

> To be honest I am not a fan of omitting nodes from physinfo, but that is
> what the current code (RC1!) does and it definitely breaks Xen on my
> box. So I just made this small patch to make it work again.
> Actually I would opt to revert the patch cropping the number of nodes
> reported by physinfo (20762:a1d0a575b4ba ?). Yes, that would result in
> nodes reported with zero memory, but in my tests this did not raise
> problems, as a node's memory can (and will) be exhausted even during
> normal operation.

The intention of 20762 was not to change the semantics of the node_to_*
lists. It's simply supposed to make available max_node_id to the toolstack
(since this can differ from nr_nodes if there are holes in the online node
map).

If the node_to_* semantics really have been changed by 20762, then it is a
bug. It's not a bug I can eyeball however.

 -- Keir

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] numa: fix problems with memory-less nodes
  2010-01-13  9:42   ` Andre Przywara
  2010-01-13  9:55     ` Keir Fraser
@ 2010-01-13 10:02     ` Keir Fraser
  1 sibling, 0 replies; 5+ messages in thread
From: Keir Fraser @ 2010-01-13 10:02 UTC (permalink / raw)
  To: Andre Przywara; +Cc: xen-devel@lists.xensource.com

On 13/01/2010 09:42, "Andre Przywara" <andre.przywara@amd.com> wrote:

> node_to_cpu            : node0:0-5,24-35
>                           node1:6-11
>                           node2:12-17
>                           node3:18-23
>                           node4:no cpus
>                           node5:no cpus
> node_to_memory         : node0:14267
>                           node1:8167
>                           node2:16335
>                           node3:8167
>                           node4:0
>                           node5:0

To be clear: I certainly agree that node_to_cpu[4,5] should list the CPUs
that are present, even if the nodes are memory-less. But you tell me how
20762 has changed this behaviour.

If it was changed by one of Intel's slightly earlier changesets (which is
possible) I'm happy to see it changed back. The new node_to_cpu[] behaviour
you are observing is obviously stupid, and I assume unintended.

 -- Keir

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-01-13 10:02 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-12 16:30 [PATCH] numa: fix problems with memory-less nodes Andre Przywara
2010-01-13  8:26 ` Keir Fraser
2010-01-13  9:42   ` Andre Przywara
2010-01-13  9:55     ` Keir Fraser
2010-01-13 10:02     ` Keir Fraser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).