All of lore.kernel.org
 help / color / mirror / Atom feed
* Memory allocation in NUMA system
@ 2008-07-25  3:34 Yang, Xiaowei
  2008-07-25  6:53 ` Keir Fraser
  0 siblings, 1 reply; 12+ messages in thread
From: Yang, Xiaowei @ 2008-07-25  3:34 UTC (permalink / raw)
  To: xen-devel

Currently, when alloc_domheap_pages() is called with no range specified
(which is the case for allocating domain memory during its creation), it
uses such priority to allocate memory:
1) current node, > DMA range (2^dma_bitsize = 4G).
2) other nodes, > DMA range
3) current node, all range
4) other nodes, all range

Let's say we have a 2-node system, with node0 and node1's memory range 
being 0-0xc0000000 (<4G) and 0x100000000-0x1c0000000 (>4G) respectively. 
In that case, node1's memory is always preferred for domain memory 
allocation, no matter which node the created domain is pinned to. It 
results in performance penalty.

One possible fix is to specify all range for the domain memory 
allocation, which means local memory is preferred. This change may be 
restricted only to the domain pinned to one node for less impact.

One side effect is that the DMA memory size may be smaller, which makes 
device domain unhappy. This can be addressed by reserving node0 to be 
used lastly.

Comments?

Thanks,
Xiaowei

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation in NUMA system
  2008-07-25  3:34 Memory allocation in NUMA system Yang, Xiaowei
@ 2008-07-25  6:53 ` Keir Fraser
  2008-07-25  7:22   ` Yang, Xiaowei
  0 siblings, 1 reply; 12+ messages in thread
From: Keir Fraser @ 2008-07-25  6:53 UTC (permalink / raw)
  To: Yang, Xiaowei, xen-devel

On 25/7/08 04:34, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote:

> Let's say we have a 2-node system, with node0 and node1's memory range
> being 0-0xc0000000 (<4G) and 0x100000000-0x1c0000000 (>4G) respectively.
> In that case, node1's memory is always preferred for domain memory
> allocation, no matter which node the created domain is pinned to. It
> results in performance penalty.
> 
> One possible fix is to specify all range for the domain memory
> allocation, which means local memory is preferred. This change may be
> restricted only to the domain pinned to one node for less impact.
> 
> One side effect is that the DMA memory size may be smaller, which makes
> device domain unhappy. This can be addressed by reserving node0 to be
> used lastly.

Doesn't your solution amount to what we already do, for the 2-node example?
i.e., node0 would not be chosen until node1 is exhausted?

 -- Keir

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation in NUMA system
  2008-07-25  6:53 ` Keir Fraser
@ 2008-07-25  7:22   ` Yang, Xiaowei
  2008-07-25  7:27     ` Keir Fraser
  0 siblings, 1 reply; 12+ messages in thread
From: Yang, Xiaowei @ 2008-07-25  7:22 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

Keir Fraser wrote:
> On 25/7/08 04:34, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote:
> 
>  > Let's say we have a 2-node system, with node0 and node1's memory range
>  > being 0-0xc0000000 (<4G) and 0x100000000-0x1c0000000 (>4G) respectively.
>  > In that case, node1's memory is always preferred for domain memory
>  > allocation, no matter which node the created domain is pinned to. It
>  > results in performance penalty.
>  >
>  > One possible fix is to specify all range for the domain memory
>  > allocation, which means local memory is preferred. This change may be
>  > restricted only to the domain pinned to one node for less impact.
>  >
>  > One side effect is that the DMA memory size may be smaller, which makes
>  > device domain unhappy. This can be addressed by reserving node0 to be
>  > used lastly.
> 
> Doesn't your solution amount to what we already do, for the 2-node example?
> i.e., node0 would not be chosen until node1 is exhausted?
> 
Oh, what I mean is:
With the above possible fix, the domain memory is allocated from the 
node it pinned to. As node0's memory is precious for DMA, it's suggested 
to pin VMs to other nodes firstly.

And for non-pinned VM, we can stick to the original method.

Thanks,
Xiaowei

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation in NUMA system
  2008-07-25  7:22   ` Yang, Xiaowei
@ 2008-07-25  7:27     ` Keir Fraser
  2008-07-25  7:51       ` Yang, Xiaowei
  0 siblings, 1 reply; 12+ messages in thread
From: Keir Fraser @ 2008-07-25  7:27 UTC (permalink / raw)
  To: Yang, Xiaowei; +Cc: xen-devel

On 25/7/08 08:22, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote:

>> Doesn't your solution amount to what we already do, for the 2-node example?
>> i.e., node0 would not be chosen until node1 is exhausted?
>> 
> Oh, what I mean is:
> With the above possible fix, the domain memory is allocated from the
> node it pinned to. As node0's memory is precious for DMA, it's suggested
> to pin VMs to other nodes firstly.
> 
> And for non-pinned VM, we can stick to the original method.

How about by default we guarantee no more than 25% of a node's memory is
classed as 'DMA memory', and we reduce the DMA address width variable in Xen
to ensure that?

So, in your example, we would reduce dma_bitsize to 30.

 -- Keir

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation in NUMA system
  2008-07-25  7:27     ` Keir Fraser
@ 2008-07-25  7:51       ` Yang, Xiaowei
  2008-07-25  7:55         ` Keir Fraser
  0 siblings, 1 reply; 12+ messages in thread
From: Yang, Xiaowei @ 2008-07-25  7:51 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

Keir Fraser wrote:
> On 25/7/08 08:22, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote:
> 
>  >> Doesn't your solution amount to what we already do, for the 2-node 
> example?
>  >> i.e., node0 would not be chosen until node1 is exhausted?
>  >>
>  > Oh, what I mean is:
>  > With the above possible fix, the domain memory is allocated from the
>  > node it pinned to. As node0's memory is precious for DMA, it's suggested
>  > to pin VMs to other nodes firstly.
>  >
>  > And for non-pinned VM, we can stick to the original method.
> 
> How about by default we guarantee no more than 25% of a node's memory is
> classed as 'DMA memory', and we reduce the DMA address width variable in Xen
> to ensure that?
> 
> So, in your example, we would reduce dma_bitsize to 30.
> 
>  -- Keir
> 
> 
Yes, a good suggestion!

Thanks,
Xiaowei

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation in NUMA system
  2008-07-25  7:51       ` Yang, Xiaowei
@ 2008-07-25  7:55         ` Keir Fraser
  2008-07-25 10:26           ` Yang, Xiaowei
  0 siblings, 1 reply; 12+ messages in thread
From: Keir Fraser @ 2008-07-25  7:55 UTC (permalink / raw)
  To: Yang, Xiaowei; +Cc: xen-devel

On 25/7/08 08:51, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote:

>> How about by default we guarantee no more than 25% of a node's memory is
>> classed as 'DMA memory', and we reduce the DMA address width variable in Xen
>> to ensure that?
>> 
>> So, in your example, we would reduce dma_bitsize to 30.
>> 
>>  -- Keir
>> 
>> 
> Yes, a good suggestion!

Indeed the only reason we still have dma_bitsize is to break the
select-NUMA-node-first memory allocation search strategy. So tweaking the
dma_bitsize approach further to strike the correct NUMA-vs-DMA balance does
seem the right thing to do. Feel free to work up a patch.

 -- Keir

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation in NUMA system
  2008-07-25  7:55         ` Keir Fraser
@ 2008-07-25 10:26           ` Yang, Xiaowei
  2008-07-25 12:56             ` Keir Fraser
  0 siblings, 1 reply; 12+ messages in thread
From: Yang, Xiaowei @ 2008-07-25 10:26 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

Keir Fraser wrote:
> On 25/7/08 08:51, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote:
> 
>  >> How about by default we guarantee no more than 25% of a node's memory is
>  >> classed as 'DMA memory', and we reduce the DMA address width 
> variable in Xen
>  >> to ensure that?
>  >>
>  >> So, in your example, we would reduce dma_bitsize to 30.
>  >>
>  >>  -- Keir
>  >>
>  >>
>  > Yes, a good suggestion!
> 
> Indeed the only reason we still have dma_bitsize is to break the
> select-NUMA-node-first memory allocation search strategy. So tweaking the
> dma_bitsize approach further to strike the correct NUMA-vs-DMA balance does
> seem the right thing to do. Feel free to work up a patch.
> 
>  -- Keir
> 
> 
> 
How about this one?

diff -r 63317b6c3eab xen/common/page_alloc.c
--- a/xen/common/page_alloc.c	Mon Jul 14 15:21:03 2008 +0100
+++ b/xen/common/page_alloc.c	Fri Jul 25 18:24:16 2008 +0800
@@ -55,7 +55,7 @@
  /*
   * Bit width of the DMA heap.
   */
-static unsigned int dma_bitsize = CONFIG_DMA_BITSIZE;
+static unsigned int dma_bitsize;
  static void __init parse_dma_bits(char *s)
  {
      unsigned int v = simple_strtol(s, NULL, 0);
@@ -583,6 +583,16 @@
              init_heap_pages(pfn_dom_zone_type(i), mfn_to_page(i), 1);
      }

+    /* Reserve up to 25% of node0's memory for DMA */
+    if ( dma_bitsize == 0 )
+    {
+        dma_bitsize = 
pfn_dom_zone_type(NODE_DATA(0)->node_spanned_pages / 4)
+                      + PAGE_SHIFT;
+
+        ASSERT(dma_bitsize <= BITS_PER_LONG + PAGE_SHIFT);
+        ASSERT(dma_bitsize > PAGE_SHIFT + 1);
+    }
+
      printk("Domain heap initialised: DMA width %u bits\n", dma_bitsize);
  }
  #undef avail_for_domheap
diff -r 63317b6c3eab xen/include/asm-x86/config.h
--- a/xen/include/asm-x86/config.h	Mon Jul 14 15:21:03 2008 +0100
+++ b/xen/include/asm-x86/config.h	Fri Jul 25 18:24:16 2008 +0800
@@ -96,8 +96,6 @@

  /* Primary stack is restricted to 8kB by guard pages. */
  #define PRIMARY_STACK_SIZE 8192
-
-#define CONFIG_DMA_BITSIZE 32

  #define BOOT_TRAMPOLINE 0x8c000
  #define bootsym_phys(sym)                                 \


Thanks,
Xiaowei

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation in NUMA system
  2008-07-25 10:26           ` Yang, Xiaowei
@ 2008-07-25 12:56             ` Keir Fraser
  2008-07-28 12:21               ` Andre Przywara
  0 siblings, 1 reply; 12+ messages in thread
From: Keir Fraser @ 2008-07-25 12:56 UTC (permalink / raw)
  To: Yang, Xiaowei; +Cc: xen-devel

On 25/7/08 11:26, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote:

>> Indeed the only reason we still have dma_bitsize is to break the
>> select-NUMA-node-first memory allocation search strategy. So tweaking the
>> dma_bitsize approach further to strike the correct NUMA-vs-DMA balance does
>> seem the right thing to do. Feel free to work up a patch.
>> 
>>  -- Keir
>> 
>> 
>> 
> How about this one?

Hmmm.. something like that. Let's wait until 3.4 development opens to get
this checked in.

 -- Keir

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation in NUMA system
  2008-07-25 12:56             ` Keir Fraser
@ 2008-07-28 12:21               ` Andre Przywara
  2008-07-28 12:38                 ` Keir Fraser
  0 siblings, 1 reply; 12+ messages in thread
From: Andre Przywara @ 2008-07-28 12:21 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel, Yang, Xiaowei

Keir Fraser wrote:
> On 25/7/08 11:26, "Yang, Xiaowei" <xiaowei.yang@intel.com> wrote:
> 
>>> Indeed the only reason we still have dma_bitsize is to break the
>>> select-NUMA-node-first memory allocation search strategy. So tweaking the
>>> dma_bitsize approach further to strike the correct NUMA-vs-DMA balance does
>>> seem the right thing to do. Feel free to work up a patch.
>>>
>>>  -- Keir
>>>
>> How about this one?
> 
> Hmmm.. something like that. Let's wait until 3.4 development opens to get
> this checked in.
Mmh, why not check this in in 3.3? I have noticed this problem already a 
year ago and was having some other kind of fix for it (which actually 
prefered nodes over zones):
http://lists.xensource.com/archives/html/xen-devel/2007-12/msg00831.html
I think this is a somewhat serious issue on NUMA machines, since with 
the automatic pinning now active (new in 3.3!) many domains will end up 
with remote memory _all the time_. So I think of this as a bugfix. 
Actually I have dma_bitsize=30 hardwired in my Grub's menu.lst for some 
months now...


Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 277-84917

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation in NUMA system
  2008-07-28 12:21               ` Andre Przywara
@ 2008-07-28 12:38                 ` Keir Fraser
  2008-07-28 14:26                   ` Andre Przywara
  0 siblings, 1 reply; 12+ messages in thread
From: Keir Fraser @ 2008-07-28 12:38 UTC (permalink / raw)
  To: Andre Przywara; +Cc: xen-devel, Yang, Xiaowei

On 28/7/08 13:21, "Andre Przywara" <andre.przywara@amd.com> wrote:

> Mmh, why not check this in in 3.3? I have noticed this problem already a
> year ago and was having some other kind of fix for it (which actually
> prefered nodes over zones):
> http://lists.xensource.com/archives/html/xen-devel/2007-12/msg00831.html
> I think this is a somewhat serious issue on NUMA machines, since with
> the automatic pinning now active (new in 3.3!) many domains will end up
> with remote memory _all the time_. So I think of this as a bugfix.
> Actually I have dma_bitsize=30 hardwired in my Grub's menu.lst for some
> months now...

Well, fine, but unfortunately the patch breaks ia64 and doesn't even work
properly:
 - why should NUMA node 0 be the one that overlaps with default DMA memory?
 - a 'large' NUMA node 0 will cause dma_bitsize to be set much larger than
it is currently, thus breaking its original intent.

 -- Keir

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation in NUMA system
  2008-07-28 12:38                 ` Keir Fraser
@ 2008-07-28 14:26                   ` Andre Przywara
  2008-07-28 14:53                     ` Keir Fraser
  0 siblings, 1 reply; 12+ messages in thread
From: Andre Przywara @ 2008-07-28 14:26 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel, Yang, Xiaowei

[-- Attachment #1: Type: text/plain, Size: 2067 bytes --]

Keir Fraser wrote:
> On 28/7/08 13:21, "Andre Przywara" <andre.przywara@amd.com> wrote:
> 
>> Mmh, why not check this in in 3.3? I have noticed this problem already a
>> year ago and was having some other kind of fix for it (which actually
>> prefered nodes over zones):
>> http://lists.xensource.com/archives/html/xen-devel/2007-12/msg00831.html
>> I think this is a somewhat serious issue on NUMA machines, since with
>> the automatic pinning now active (new in 3.3!) many domains will end up
>> with remote memory _all the time_. So I think of this as a bugfix.
>> Actually I have dma_bitsize=30 hardwired in my Grub's menu.lst for some
>> months now...
> 
> Well, fine, but unfortunately the patch breaks ia64
Fixed.
> and doesn't even work properly:
>  - why should NUMA node 0 be the one that overlaps with default DMA memory?
Because that is the most common configuration? Do you know of any 
machine where this is not true? I agree that a dual node machine with 2 
gig on each node does not need this patch, but NUMA machines tend to 
have more memory than this (especially given the current memory costs). 
I changed the default DMA_BITSIZE to 30 bits, this seems to be a 
reasonable value.
>  - a 'large' NUMA node 0 will cause dma_bitsize to be set much larger than
> it is currently, thus breaking its original intent.
Fixed in the attached patch. It now caps dma_bitsize to at most 1/4 of 
node0 memory.

What about using this patch for Xen 3.3 and work out a more general 
solution for Xen 3.4?

Signed off by: Andre Przywara <andre.przywara@amd.com>
Based on the patch from: "Yang, Xiaowei" <xiaowei.yang@intel.com>

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 277-84917
----to satisfy European Law for business letters:
AMD Saxony Limited Liability Company & Co. KG,
Wilschdorfer Landstr. 101, 01109 Dresden, Germany
Register Court Dresden: HRA 4896, General Partner authorized
to represent: AMD Saxony LLC (Wilmington, Delaware, US)
General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

[-- Attachment #2: dma_bitsize.patch --]
[-- Type: text/plain, Size: 1434 bytes --]

diff -r 37fae02cc335 xen/common/page_alloc.c
--- a/xen/common/page_alloc.c	Fri Jul 25 15:03:03 2008 +0100
+++ b/xen/common/page_alloc.c	Mon Jul 28 16:02:40 2008 +0200
@@ -583,6 +583,13 @@
             init_heap_pages(pfn_dom_zone_type(i), mfn_to_page(i), 1);
     }
 
+    /* Reserve only up to 25% of node0's memory for DMA */
+    i = pfn_dom_zone_type(NODE_DATA(0)->node_spanned_pages / 4)
+                          + PAGE_SHIFT;
+    if ( i < dma_bitsize ) dma_bitsize = i;
+
+    ASSERT(dma_bitsize > PAGE_SHIFT + 1);
+
     printk("Domain heap initialised: DMA width %u bits\n", dma_bitsize);
 }
 #undef avail_for_domheap
diff -r 37fae02cc335 xen/include/asm-ia64/config.h
--- a/xen/include/asm-ia64/config.h	Fri Jul 25 15:03:03 2008 +0100
+++ b/xen/include/asm-ia64/config.h	Mon Jul 28 16:02:40 2008 +0200
@@ -44,7 +44,7 @@
 #define CONFIG_IOSAPIC
 #define supervisor_mode_kernel (0)
 
-#define CONFIG_DMA_BITSIZE 32
+#define CONFIG_DMA_BITSIZE 30
 
 #define PADDR_BITS	48
 
diff -r 37fae02cc335 xen/include/asm-x86/config.h
--- a/xen/include/asm-x86/config.h	Fri Jul 25 15:03:03 2008 +0100
+++ b/xen/include/asm-x86/config.h	Mon Jul 28 16:02:40 2008 +0200
@@ -97,7 +97,7 @@
 /* Primary stack is restricted to 8kB by guard pages. */
 #define PRIMARY_STACK_SIZE 8192
 
-#define CONFIG_DMA_BITSIZE 32
+#define CONFIG_DMA_BITSIZE 30
 
 #define BOOT_TRAMPOLINE 0x8c000
 #define bootsym_phys(sym)                                 \

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Memory allocation in NUMA system
  2008-07-28 14:26                   ` Andre Przywara
@ 2008-07-28 14:53                     ` Keir Fraser
  0 siblings, 0 replies; 12+ messages in thread
From: Keir Fraser @ 2008-07-28 14:53 UTC (permalink / raw)
  To: Andre Przywara; +Cc: xen-devel, Yang, Xiaowei

On 28/7/08 15:26, "Andre Przywara" <andre.przywara@amd.com> wrote:

> Because that is the most common configuration? Do you know of any
> machine where this is not true? I agree that a dual node machine with 2
> gig on each node does not need this patch, but NUMA machines tend to
> have more memory than this (especially given the current memory costs).
> I changed the default DMA_BITSIZE to 30 bits, this seems to be a
> reasonable value.

I'll take that bit then (the CONFIG_DMA_BITSIZE change). Sounds like it
suffices for all systems you care about.

 -- Keir

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-07-28 14:53 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-25  3:34 Memory allocation in NUMA system Yang, Xiaowei
2008-07-25  6:53 ` Keir Fraser
2008-07-25  7:22   ` Yang, Xiaowei
2008-07-25  7:27     ` Keir Fraser
2008-07-25  7:51       ` Yang, Xiaowei
2008-07-25  7:55         ` Keir Fraser
2008-07-25 10:26           ` Yang, Xiaowei
2008-07-25 12:56             ` Keir Fraser
2008-07-28 12:21               ` Andre Przywara
2008-07-28 12:38                 ` Keir Fraser
2008-07-28 14:26                   ` Andre Przywara
2008-07-28 14:53                     ` Keir Fraser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.