Tmem vs order>0 allocation, workaround RFC

All of lore.kernel.org
 help / color / mirror / Atom feed

* Tmem vs order>0 allocation, workaround RFC
@ 2010-02-12 17:24 Dan Magenheimer
  2010-02-12 18:07 ` Dan Magenheimer
  2010-02-15  8:21 ` Keir Fraser
  0 siblings, 2 replies; 15+ messages in thread
From: Dan Magenheimer @ 2010-02-12 17:24 UTC (permalink / raw)
  To: Keir Fraser, xen-devel, Jan Beulich
  Cc: George Dunlap, kurt.hackel, Ian Pratt, Tim Deegan, Patrick Colp,
	Grzegorz Milos, Andrew Peace

I just had an idea for a workaround that might be low enough
impact to get in for 4.0 and allow tmem to be enabled by
default.  I think it will not eliminate the fragmentation
problem entirely, but would greatly reduce the probability
of it causing problems for domain creation/migration when tmem
is enabled, and possibly for the other memory utilization
features as well.

Simply, avail_heap_pages would fail if total_avail_pages
is less than 1%(?) of the total memory on the system AND
the request is order==0.  Essentially, this is reserving
a "zone" for order>0 allocations.

It could be tied to tmem_enabled but, as previously discussed,
even fragmentation from frequent ballooning can fragment
memory and cause problems for domain creation/migration...
and since, without memory utilization features it is highly
unlikely that a system will "accidentally" pack in enough
domains to use between 99% and 100% of physical memory anyway,
always enabling this restriction would affect very very few
systems.

Comments?  I'm not sure I've thought this all the way
through and certainly haven't tested it yet, but it
seems like it should be easy to implement in a low-impact
patch.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Tmem vs order>0 allocation, workaround RFC
  2010-02-12 17:24 Tmem vs order>0 allocation, workaround RFC Dan Magenheimer
@ 2010-02-12 18:07 ` Dan Magenheimer
  2010-02-15  8:21 ` Keir Fraser
  1 sibling, 0 replies; 15+ messages in thread
From: Dan Magenheimer @ 2010-02-12 18:07 UTC (permalink / raw)
  To: dan.magenheimer, Keir Fraser, xen-devel, Jan Beulich
  Cc: Ian, George Dunlap, kurt.hackel, Patrick Colp, Andrew, Tim Deegan,
	Pratt, Grzegorz Milos, Peace

[-- Attachment #1: Type: text/plain, Size: 2050 bytes --]

> Simply, avail_heap_pages would fail if total_avail_pages
> is less than 1%(?) of the total memory on the system AND
> the request is order==0.  Essentially, this is reserving
> a "zone" for order>0 allocations.

Avoid worst fragmentation issues by reserving a "zone"
of physical memory only for order>0 allocations.

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>

g`"--- a/xen/common/page_alloc.c	Fri Feb 12 09:24:18 2010 +0000
+++ b/xen/common/page_alloc.c	Fri Feb 12 11:05:19 2010 -0700
@@ -223,6 +223,10 @@ static heap_by_zone_and_order_t *_heap[M
 
 static unsigned long *avail[MAX_NUMNODES];
 static long total_avail_pages;
+static long max_total_avail_pages; /* highwater mark */
+#define ORDER_NONZERO_FRAC 128
+static long order_nonzero_zonesize; /* reserved for order>0 allocations */
+
 
 static DEFINE_SPINLOCK(heap_lock);
 
@@ -304,6 +308,13 @@ static struct page_info *alloc_heap_page
     spin_lock(&heap_lock);
 
     /*
+       When available memory is scarce, allow only larger allocations
+       to avoid worst of fragmentation issues
+    */
+    if ( !order && (total_avail_pages <= order_nonzero_zonesize) )
+        goto fail;
+
+    /*
      * Start with requested node, but exhaust all node memory in requested 
      * zone before failing, only calc new node value if we fail to find memory 
      * in target node, this avoids needless computation on fast-path.
@@ -337,6 +348,7 @@ static struct page_info *alloc_heap_page
     }
 
     /* No suitable memory blocks. Fail the request. */
+fail:
     spin_unlock(&heap_lock);
     return NULL;
 
@@ -503,6 +515,11 @@ static void free_heap_pages(
 
     avail[node][zone] += 1 << order;
     total_avail_pages += 1 << order;
+    if ( total_avail_pages > max_total_avail_pages )
+    {
+        max_total_avail_pages = total_avail_pages;
+        order_nonzero_zonesize = max_total_avail_pages / ORDER_NONZERO_FRAC;
+    }
 
     /* Merge chunks as far as possible. */
     while ( order < MAX_ORDER )

[-- Attachment #2: nonzero_alloc.patch --]
[-- Type: application/octet-stream, Size: 1655 bytes --]

diff -r 3bb163b74673 xen/common/page_alloc.c
--- a/xen/common/page_alloc.c	Fri Feb 12 09:24:18 2010 +0000
+++ b/xen/common/page_alloc.c	Fri Feb 12 11:05:19 2010 -0700
@@ -223,6 +223,10 @@ static heap_by_zone_and_order_t *_heap[M
 
 static unsigned long *avail[MAX_NUMNODES];
 static long total_avail_pages;
+static long max_total_avail_pages; /* highwater mark */
+#define ORDER_NONZERO_FRAC 128
+static long order_nonzero_zonesize; /* reserved for order>0 allocations */
+
 
 static DEFINE_SPINLOCK(heap_lock);
 
@@ -304,6 +308,13 @@ static struct page_info *alloc_heap_page
     spin_lock(&heap_lock);
 
     /*
+       When available memory is scarce, allow only larger allocations
+       to avoid worst of fragmentation issues
+    */
+    if ( !order && (total_avail_pages <= order_nonzero_zonesize) )
+        goto fail;
+
+    /*
      * Start with requested node, but exhaust all node memory in requested 
      * zone before failing, only calc new node value if we fail to find memory 
      * in target node, this avoids needless computation on fast-path.
@@ -337,6 +348,7 @@ static struct page_info *alloc_heap_page
     }
 
     /* No suitable memory blocks. Fail the request. */
+fail:
     spin_unlock(&heap_lock);
     return NULL;
 
@@ -503,6 +515,11 @@ static void free_heap_pages(
 
     avail[node][zone] += 1 << order;
     total_avail_pages += 1 << order;
+    if ( total_avail_pages > max_total_avail_pages )
+    {
+        max_total_avail_pages = total_avail_pages;
+        order_nonzero_zonesize = max_total_avail_pages / ORDER_NONZERO_FRAC;
+    }
 
     /* Merge chunks as far as possible. */
     while ( order < MAX_ORDER )

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Tmem vs order>0 allocation, workaround RFC
  2010-02-12 17:24 Tmem vs order>0 allocation, workaround RFC Dan Magenheimer
  2010-02-12 18:07 ` Dan Magenheimer
@ 2010-02-15  8:21 ` Keir Fraser
  2010-02-15 14:31   ` Dan Magenheimer
  1 sibling, 1 reply; 15+ messages in thread
From: Keir Fraser @ 2010-02-15  8:21 UTC (permalink / raw)
  To: Dan Magenheimer, xen-devel@lists.xensource.com, Jan Beulich
  Cc: George Dunlap, kurt.hackel@oracle.com, Ian Pratt, Tim Deegan,
	Patrick Colp, Grzegorz Milos, Andrew Peace

On 12/02/2010 17:24, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> I just had an idea for a workaround that might be low enough
> impact to get in for 4.0 and allow tmem to be enabled by
> default.  I think it will not eliminate the fragmentation
> problem entirely, but would greatly reduce the probability
> of it causing problems for domain creation/migration when tmem
> is enabled, and possibly for the other memory utilization
> features as well.
> 
> Simply, avail_heap_pages would fail if total_avail_pages
> is less than 1%(?) of the total memory on the system AND
> the request is order==0.  Essentially, this is reserving
> a "zone" for order>0 allocations.

I don't see how that necessarily works. Pages can be allocated in order>0
chunks and freed order==0, so even that last 1% can get fragmented. For
example, guests get their memory allocated in 2MB chunks where possible; but
their balloon drivers may then free arbitrary 4kB pages within those chunks.

 -- Keir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Tmem vs order>0 allocation, workaround RFC
  2010-02-15  8:21 ` Keir Fraser
@ 2010-02-15 14:31   ` Dan Magenheimer
  2010-02-15 15:40     ` Keir Fraser
  0 siblings, 1 reply; 15+ messages in thread
From: Dan Magenheimer @ 2010-02-15 14:31 UTC (permalink / raw)
  To: Keir Fraser, xen-devel, Jan Beulich
  Cc: George Dunlap, kurt.hackel, Ian Pratt, Tim Deegan, Patrick Colp,
	Grzegorz Milos, Andrew Peace

> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
> > I just had an idea for a workaround that might be low enough
> > impact to get in for 4.0 and allow tmem to be enabled by
> > default.  I think it will not eliminate the fragmentation
> > problem entirely, but would greatly reduce the probability
> > of it causing problems for domain creation/migration when tmem
> > is enabled, and possibly for the other memory utilization
> > features as well.
> >
> > Simply, avail_heap_pages would fail if total_avail_pages
> > is less than 1%(?) of the total memory on the system AND
> > the request is order==0.  Essentially, this is reserving
> > a "zone" for order>0 allocations.
> 
> I don't see how that necessarily works. Pages can be allocated in
> order>0
> chunks and freed order==0, so even that last 1% can get fragmented. For
> example, guests get their memory allocated in 2MB chunks where
> possible; but
> their balloon drivers may then free arbitrary 4kB pages within those
> chunks.

Good point.  BUT... do you know of any other asymmetric
allocs/frees?  Since the 2MB allocation does fall back
if it fails (to 4K I think?, if the patch is modified
to restrict the "zone" to order>0&&order<9 will that
be sufficient?

I know this is quite a hack...  I don't like it much
either.  But I expect the process of restructuring all
data structures to limit them to order==0 to take a long
time with an even longer bug tail (AND be a whack-a-mole
game in the future unless we disallow order>0 entirely).
In that light (and with the low impact of this workaround),
this hack may be just fine for a while.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Tmem vs order>0 allocation, workaround RFC
  2010-02-15 14:31   ` Dan Magenheimer
@ 2010-02-15 15:40     ` Keir Fraser
  2010-02-15 15:55       ` Dan Magenheimer
  0 siblings, 1 reply; 15+ messages in thread
From: Keir Fraser @ 2010-02-15 15:40 UTC (permalink / raw)
  To: Dan Magenheimer, xen-devel@lists.xensource.com, Jan Beulich
  Cc: George Dunlap, kurt.hackel@oracle.com, Ian Pratt, Tim Deegan,
	Patrick Colp, Grzegorz Milos, Andrew Peace

On 15/02/2010 14:31, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> Good point.  BUT... do you know of any other asymmetric
> allocs/frees?  Since the 2MB allocation does fall back
> if it fails (to 4K I think?, if the patch is modified
> to restrict the "zone" to order>0&&order<9 will that
> be sufficient?

Even though that one can fall back, the point is that even one asymmetric
alloc/free (and that is by far going to be the most common one) can hoover
up the 1% 'pool' and fragment it, so that allocations that cannot fall back
can no longer use the pool.

> I know this is quite a hack...  I don't like it much
> either.  But I expect the process of restructuring all
> data structures to limit them to order==0 to take a long
> time with an even longer bug tail (AND be a whack-a-mole
> game in the future unless we disallow order>0 entirely).
> In that light (and with the low impact of this workaround),
> this hack may be just fine for a while.

Well, I think it's not only not very nice but also dubious whether it will
work in practice very well.

 -- Keir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Tmem vs order>0 allocation, workaround RFC
  2010-02-15 15:40     ` Keir Fraser
@ 2010-02-15 15:55       ` Dan Magenheimer
  2010-02-15 16:36         ` Dan Magenheimer
  2010-02-15 16:37         ` Keir Fraser
  0 siblings, 2 replies; 15+ messages in thread
From: Dan Magenheimer @ 2010-02-15 15:55 UTC (permalink / raw)
  To: Keir Fraser, xen-devel, Jan Beulich
  Cc: George Dunlap, kurt.hackel, Ian Pratt, Tim Deegan, Patrick Colp,
	Grzegorz Milos, Andrew Peace

> Even though that one can fall back, the point is that even one
> asymmetric
> alloc/free (and that is by far going to be the most common one) can
> hoover
> up the 1% 'pool' and fragment it, so that allocations that cannot fall
> back
> can no longer use the pool.

Understood.

If we eliminate this case, can you think of any others that
are asymmetric, except possibly very uncommon ones?

> > I know this is quite a hack...  I don't like it much
> > either.  But I expect the process of restructuring all
> > data structures to limit them to order==0 to take a long
> > time with an even longer bug tail (AND be a whack-a-mole
> > game in the future unless we disallow order>0 entirely).
> > In that light (and with the low impact of this workaround),
> > this hack may be just fine for a while.
> 
> Well, I think it's not only not very nice but also dubious whether it
> will
> work in practice very well.

Other than the above, can you (or Jan? or others?) think of
other cases where it won't work in practice?  If not, it's
at least worth a try to see if Jan's test cases continue
to see a problem.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Tmem vs order>0 allocation, workaround RFC
  2010-02-15 15:55       ` Dan Magenheimer
@ 2010-02-15 16:36         ` Dan Magenheimer
  2010-02-16  8:20           ` Jan Beulich
  2010-02-15 16:37         ` Keir Fraser
  1 sibling, 1 reply; 15+ messages in thread
From: Dan Magenheimer @ 2010-02-15 16:36 UTC (permalink / raw)
  To: dan.magenheimer, Keir Fraser, xen-devel, Jan Beulich
  Cc: George Dunlap, kurt.hackel, Ian Pratt, Tim Deegan, Patrick Colp,
	Grzegorz Milos, Andrew Peace

[-- Attachment #1: Type: text/plain, Size: 2356 bytes --]

This version should have zero impact if tmem is not enabled.

=======

When tmem is enabled, reserve a fraction of memory
for allocations of 0<order<9 to avoid fragmentation
issues.

Signed-off by: Dan Magenheimer <dan.magenheimer@oracle.com>

diff -r 3bb163b74673 xen/common/page_alloc.c
--- a/xen/common/page_alloc.c	Fri Feb 12 09:24:18 2010 +0000
+++ b/xen/common/page_alloc.c	Mon Feb 15 09:28:01 2010 -0700
@@ -223,6 +223,12 @@ static heap_by_zone_and_order_t *_heap[M
 
 static unsigned long *avail[MAX_NUMNODES];
 static long total_avail_pages;
+static long max_total_avail_pages; /* highwater mark */
+
+/* reserved for midsize (0<order<9) allocations, tmem only for now */
+static long midsize_alloc_zone_pages;
+#define MIDSIZE_ALLOC_FRAC 128
+
 
 static DEFINE_SPINLOCK(heap_lock);
 
@@ -304,6 +310,15 @@ static struct page_info *alloc_heap_page
     spin_lock(&heap_lock);
 
     /*
+       When available memory is scarce, allow only mid-size allocations
+       to avoid worst of fragmentation issues.  For now, only special-case
+       this when transcendent memory is enabled
+    */
+    if ( opt_tmem && ((order == 0) || (order >= 9)) &&
+         (total_avail_pages <= midsize_alloc_zone_pages) )
+        goto fail;
+
+    /*
      * Start with requested node, but exhaust all node memory in requested 
      * zone before failing, only calc new node value if we fail to find memory 
      * in target node, this avoids needless computation on fast-path.
@@ -337,6 +352,7 @@ static struct page_info *alloc_heap_page
     }
 
     /* No suitable memory blocks. Fail the request. */
+fail:
     spin_unlock(&heap_lock);
     return NULL;
 
@@ -503,6 +519,11 @@ static void free_heap_pages(
 
     avail[node][zone] += 1 << order;
     total_avail_pages += 1 << order;
+    if ( total_avail_pages > max_total_avail_pages )
+    {
+        max_total_avail_pages = total_avail_pages;
+        midsize_alloc_zone_pages  = max_total_avail_pages / MIDSIZE_ALLOC_FRAC;
+    }
 
     /* Merge chunks as far as possible. */
     while ( order < MAX_ORDER )
@@ -842,6 +863,8 @@ static unsigned long avail_heap_pages(
 
 unsigned long total_free_pages(void)
 {
+    if ( opt_tmem )
+        return total_avail_pages - midsize_alloc_zone_pages ;
     return total_avail_pages;
 }

[-- Attachment #2: tmem_midsize_alloc.patch --]
[-- Type: application/octet-stream, Size: 2044 bytes --]

diff -r 3bb163b74673 xen/common/page_alloc.c
--- a/xen/common/page_alloc.c	Fri Feb 12 09:24:18 2010 +0000
+++ b/xen/common/page_alloc.c	Mon Feb 15 09:28:01 2010 -0700
@@ -223,6 +223,12 @@ static heap_by_zone_and_order_t *_heap[M
 
 static unsigned long *avail[MAX_NUMNODES];
 static long total_avail_pages;
+static long max_total_avail_pages; /* highwater mark */
+
+/* reserved for midsize (0<order<9) allocations, tmem only for now */
+static long midsize_alloc_zone_pages;
+#define MIDSIZE_ALLOC_FRAC 128
+
 
 static DEFINE_SPINLOCK(heap_lock);
 
@@ -304,6 +310,15 @@ static struct page_info *alloc_heap_page
     spin_lock(&heap_lock);
 
     /*
+       When available memory is scarce, allow only mid-size allocations
+       to avoid worst of fragmentation issues.  For now, only special-case
+       this when transcendent memory is enabled
+    */
+    if ( opt_tmem && ((order == 0) || (order >= 9)) &&
+         (total_avail_pages <= midsize_alloc_zone_pages) )
+        goto fail;
+
+    /*
      * Start with requested node, but exhaust all node memory in requested 
      * zone before failing, only calc new node value if we fail to find memory 
      * in target node, this avoids needless computation on fast-path.
@@ -337,6 +352,7 @@ static struct page_info *alloc_heap_page
     }
 
     /* No suitable memory blocks. Fail the request. */
+fail:
     spin_unlock(&heap_lock);
     return NULL;
 
@@ -503,6 +519,11 @@ static void free_heap_pages(
 
     avail[node][zone] += 1 << order;
     total_avail_pages += 1 << order;
+    if ( total_avail_pages > max_total_avail_pages )
+    {
+        max_total_avail_pages = total_avail_pages;
+        midsize_alloc_zone_pages  = max_total_avail_pages / MIDSIZE_ALLOC_FRAC;
+    }
 
     /* Merge chunks as far as possible. */
     while ( order < MAX_ORDER )
@@ -842,6 +863,8 @@ static unsigned long avail_heap_pages(
 
 unsigned long total_free_pages(void)
 {
+    if ( opt_tmem )
+        return total_avail_pages - midsize_alloc_zone_pages ;
     return total_avail_pages;
 }
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Tmem vs order>0 allocation, workaround RFC
  2010-02-15 15:55       ` Dan Magenheimer
  2010-02-15 16:36         ` Dan Magenheimer
@ 2010-02-15 16:37         ` Keir Fraser
  1 sibling, 0 replies; 15+ messages in thread
From: Keir Fraser @ 2010-02-15 16:37 UTC (permalink / raw)
  To: Dan Magenheimer, xen-devel@lists.xensource.com, Jan Beulich
  Cc: George Dunlap, kurt.hackel@oracle.com, Ian Pratt, Tim Deegan,
	Patrick Colp, Grzegorz Milos, Andrew Peace

On 15/02/2010 15:55, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> Other than the above, can you (or Jan? or others?) think of
> other cases where it won't work in practice?  If not, it's
> at least worth a try to see if Jan's test cases continue
> to see a problem.

I think that's the only obvious one.

 -- Keir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Tmem vs order>0 allocation, workaround RFC
  2010-02-15 16:36         ` Dan Magenheimer
@ 2010-02-16  8:20           ` Jan Beulich
  2010-02-16 15:05             ` Dan Magenheimer
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2010-02-16  8:20 UTC (permalink / raw)
  To: Keir Fraser, xen-devel, Dan Magenheimer
  Cc: George Dunlap, kurt.hackel, Ian Pratt, Tim Deegan, Patrick Colp,
	Grzegorz Milos, Andrew Peace

Besides generally not liking hackery like this (but we all seem to agree on
that part), and besides having an un-explained feeling that there may
be other bad effects from this, I also think that on large systems this
may not work well: When you have 1Tb, you'd reserve 8G, making Dom0
single-page-below-4G-allocations impossible (unless dom0_mem= was
used) if I read the logic correctly.

Jan

>>> Dan Magenheimer <dan.magenheimer@oracle.com> 15.02.10 17:36 >>>
This version should have zero impact if tmem is not enabled.

=======

When tmem is enabled, reserve a fraction of memory
for allocations of 0<order<9 to avoid fragmentation
issues.

Signed-off by: Dan Magenheimer <dan.magenheimer@oracle.com>

diff -r 3bb163b74673 xen/common/page_alloc.c
--- a/xen/common/page_alloc.c	Fri Feb 12 09:24:18 2010 +0000
+++ b/xen/common/page_alloc.c	Mon Feb 15 09:28:01 2010 -0700
@@ -223,6 +223,12 @@ static heap_by_zone_and_order_t *_heap[M
 
 static unsigned long *avail[MAX_NUMNODES];
 static long total_avail_pages;
+static long max_total_avail_pages; /* highwater mark */
+
+/* reserved for midsize (0<order<9) allocations, tmem only for now */
+static long midsize_alloc_zone_pages;
+#define MIDSIZE_ALLOC_FRAC 128
+
 
 static DEFINE_SPINLOCK(heap_lock);
 
@@ -304,6 +310,15 @@ static struct page_info *alloc_heap_page
     spin_lock(&heap_lock);
 
     /*
+       When available memory is scarce, allow only mid-size allocations
+       to avoid worst of fragmentation issues.  For now, only special-case
+       this when transcendent memory is enabled
+    */
+    if ( opt_tmem && ((order == 0) || (order >= 9)) &&
+         (total_avail_pages <= midsize_alloc_zone_pages) )
+        goto fail;
+
+    /*
      * Start with requested node, but exhaust all node memory in requested 
      * zone before failing, only calc new node value if we fail to find memory 
      * in target node, this avoids needless computation on fast-path.
@@ -337,6 +352,7 @@ static struct page_info *alloc_heap_page
     }
 
     /* No suitable memory blocks. Fail the request. */
+fail:
     spin_unlock(&heap_lock);
     return NULL;
 
@@ -503,6 +519,11 @@ static void free_heap_pages(
 
     avail[node][zone] += 1 << order;
     total_avail_pages += 1 << order;
+    if ( total_avail_pages > max_total_avail_pages )
+    {
+        max_total_avail_pages = total_avail_pages;
+        midsize_alloc_zone_pages  = max_total_avail_pages / MIDSIZE_ALLOC_FRAC;
+    }
 
     /* Merge chunks as far as possible. */
     while ( order < MAX_ORDER )
@@ -842,6 +863,8 @@ static unsigned long avail_heap_pages(
 
 unsigned long total_free_pages(void)
 {
+    if ( opt_tmem )
+        return total_avail_pages - midsize_alloc_zone_pages ;
     return total_avail_pages;
 }

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Tmem vs order>0 allocation, workaround RFC
  2010-02-16  8:20           ` Jan Beulich
@ 2010-02-16 15:05             ` Dan Magenheimer
  2010-02-16 15:15               ` Jan Beulich
  2010-02-16 18:20               ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 15+ messages in thread
From: Dan Magenheimer @ 2010-02-16 15:05 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser, xen-devel
  Cc: George Dunlap, kurt.hackel, Patrick Colp, Andrew, Tim, Ian Pratt,
	Peace, Deegan, Grzegorz Milos

Hi Jan --

Thanks for thinking about this.

> may not work well: When you have 1Tb, you'd reserve 8G, making Dom0
> single-page-below-4G-allocations impossible (unless dom0_mem= was
> used) if I read the logic correctly.

Good point.  But tmem doesn't work very well at all if dom0_mem
isn't set as dom0 is hogging all the spare memory in the system
so only fallow memory reclaimed from selfballooning domains
can be used by tmem.

Under what circumstances does dom0 require single-page-below-4G
allocations?  Is it only for bounce buffers for PCI passthrough
of old devices with 32-bit addressing limitations?  Or am I
missing a much more common case?  (I think it's important to
enumerate and understand -- and document -- all special needs
of memory pages as Xen has been fairly careless/lucky with
fragmentation so far, but with all the memory optimization
technologies in 4.0, we need to root out all the cases.)

Thanks,
Dan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Tmem vs order>0 allocation, workaround RFC
  2010-02-16 15:05             ` Dan Magenheimer
@ 2010-02-16 15:15               ` Jan Beulich
  2010-02-16 15:31                 ` Dan Magenheimer
  2010-02-16 18:20               ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2010-02-16 15:15 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: xen-devel, AndrewPeace, George Dunlap, kurt.hackel, Ian Pratt,
	TimDeegan, Patrick Colp, Keir Fraser, Grzegorz Milos

>>> Dan Magenheimer <dan.magenheimer@oracle.com> 16.02.10 16:05 >>>
>Under what circumstances does dom0 require single-page-below-4G
>allocations?  Is it only for bounce buffers for PCI passthrough
>of old devices with 32-bit addressing limitations?  Or am I
>missing a much more common case?

Not just for pass-through; all devices only supporting 32-bit
addressing would have such requirements, and particularly common
ones are display adapters which have DRM/AGP drivers loaded for
them.

Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Tmem vs order>0 allocation, workaround RFC
  2010-02-16 15:15               ` Jan Beulich
@ 2010-02-16 15:31                 ` Dan Magenheimer
  2010-02-16 15:45                   ` Jan Beulich
  0 siblings, 1 reply; 15+ messages in thread
From: Dan Magenheimer @ 2010-02-16 15:31 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, George Dunlap, kurt.hackel, Ian Pratt, TimDeegan,
	Patrick Colp, Grzegorz Milos, Keir, Fraser, AndrewPeace

> From: Jan Beulich [mailto:JBeulich@novell.com]
> Subject: RE: Tmem vs order>0 allocation, workaround RFC
> 
> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 16.02.10 16:05 >>>
> >Under what circumstances does dom0 require single-page-below-4G
> >allocations?  Is it only for bounce buffers for PCI passthrough
> >of old devices with 32-bit addressing limitations?  Or am I
> >missing a much more common case?
> 
> Not just for pass-through; all devices only supporting 32-bit
> addressing would have such requirements, and particularly common
> ones are display adapters which have DRM/AGP drivers loaded for
> them.

Right, but those are statically allocated when dom0 is
launched, not dynamically allocated later after tmem
(or other memory allocation technologies) begin working,
right?  Whereas pass-through devices would need below-4G
pages later?

(And 32-bit devices in a 1TB machine seems a bit of a
stretch, but I suppose it is good to enumerate all the
cases.)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Tmem vs order>0 allocation, workaround RFC
  2010-02-16 15:31                 ` Dan Magenheimer
@ 2010-02-16 15:45                   ` Jan Beulich
  2010-02-16 16:44                     ` Dan Magenheimer
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2010-02-16 15:45 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: xen-devel, AndrewPeace, George Dunlap, kurt.hackel, Ian Pratt,
	TimDeegan, Patrick Colp, KeirFraser, Grzegorz Milos

>>> Dan Magenheimer <dan.magenheimer@oracle.com> 16.02.10 16:31 >>>
>> From: Jan Beulich [mailto:JBeulich@novell.com] 
>> Subject: RE: Tmem vs order>0 allocation, workaround RFC
>> 
>> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 16.02.10 16:05 >>>
>> >Under what circumstances does dom0 require single-page-below-4G
>> >allocations?  Is it only for bounce buffers for PCI passthrough
>> >of old devices with 32-bit addressing limitations?  Or am I
>>> >missing a much more common case?
>> 
>> Not just for pass-through; all devices only supporting 32-bit
>> addressing would have such requirements, and particularly common
>> ones are display adapters which have DRM/AGP drivers loaded for
>> them.
>
>Right, but those are statically allocated when dom0 is
>launched, not dynamically allocated later after tmem
>(or other memory allocation technologies) begin working,
>right?  Whereas pass-through devices would need below-4G
>pages later?

No, consistent/coherent allocations can happen at run time.
Typically the largest share of the allocations would happen when
the respective driver loads, but especially for the DRM/AGP case
I think allocations get triggered by user mode (X initializing a
display), which may happen at any time.

>(And 32-bit devices in a 1TB machine seems a bit of a
>stretch, but I suppose it is good to enumerate all the
>cases.)

Yes, but the 1Tb was just taken as an extreme example. Issues may
arise earlier. And the display adapter part would likely remain valid
even there - just see the use of vmalloc_32() in
drivers/gpu/drm/drm_scatter.c for an example.

Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Tmem vs order>0 allocation, workaround RFC
  2010-02-16 15:45                   ` Jan Beulich
@ 2010-02-16 16:44                     ` Dan Magenheimer
  0 siblings, 0 replies; 15+ messages in thread
From: Dan Magenheimer @ 2010-02-16 16:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, AndrewPeace, George Dunlap, kurt.hackel, Ian Pratt,
	TimDeegan, Patrick Colp, KeirFraser, Grzegorz Milos

Hi Jan --

Fair enough.  You've convinced me that I shouldn't push for
tmem to be turned back on by default for the official
Xen 4.0 release.  But the patch as just checked-in by Keir
limits allocations only if tmem is enabled so I will just
document that tmem may cause problems if 32-bit-limited
devices are in the system. (I'd expect that to be rare in
the cloud environment where tmem would be most used.)

I do think it's unfortunate (turning off tmem by default)
as I suspect that "thar be (more) dragons" in Xen, when
trying to do any kind of memory utilization optimization,
that will come back and bite us.  Tmem is just the first
to aggressively pursue this and disabling it only delays
the inevitable.  For example, I'll bet improvements to
NUMA support will have many similar problems.

Anyway, thanks as usual for thinking deeply through the
issue and for trying out tmem... any new technology is
going to have some growing pains.

Thanks again,
Dan

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@novell.com]
> Sent: Tuesday, February 16, 2010 8:46 AM
> To: Dan Magenheimer
> Cc: Grzegorz Milos; Patrick Colp; AndrewPeace; George Dunlap; Ian
> Pratt; KeirFraser; TimDeegan; xen-devel@lists.xensource.com; Kurt
> Hackel
> Subject: RE: Tmem vs order>0 allocation, workaround RFC
> 
> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 16.02.10 16:31 >>>
> >> From: Jan Beulich [mailto:JBeulich@novell.com]
> >> Subject: RE: Tmem vs order>0 allocation, workaround RFC
> >>
> >> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 16.02.10 16:05 >>>
> >> >Under what circumstances does dom0 require single-page-below-4G
> >> >allocations?  Is it only for bounce buffers for PCI passthrough
> >> >of old devices with 32-bit addressing limitations?  Or am I
> >>> >missing a much more common case?
> >>
> >> Not just for pass-through; all devices only supporting 32-bit
> >> addressing would have such requirements, and particularly common
> >> ones are display adapters which have DRM/AGP drivers loaded for
> >> them.
> >
> >Right, but those are statically allocated when dom0 is
> >launched, not dynamically allocated later after tmem
> >(or other memory allocation technologies) begin working,
> >right?  Whereas pass-through devices would need below-4G
> >pages later?
> 
> No, consistent/coherent allocations can happen at run time.
> Typically the largest share of the allocations would happen when
> the respective driver loads, but especially for the DRM/AGP case
> I think allocations get triggered by user mode (X initializing a
> display), which may happen at any time.
> 
> >(And 32-bit devices in a 1TB machine seems a bit of a
> >stretch, but I suppose it is good to enumerate all the
> >cases.)
> 
> Yes, but the 1Tb was just taken as an extreme example. Issues may
> arise earlier. And the display adapter part would likely remain valid
> even there - just see the use of vmalloc_32() in
> drivers/gpu/drm/drm_scatter.c for an example.
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RE: Tmem vs order>0 allocation, workaround RFC
  2010-02-16 15:05             ` Dan Magenheimer
  2010-02-16 15:15               ` Jan Beulich
@ 2010-02-16 18:20               ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-02-16 18:20 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: xen-devel, Andrew, Grzegorz Milos, George Dunlap, kurt.hackel,
	Ian Pratt, Jan Beulich, Deegan, Patrick Colp, Keir Fraser, Tim,
	Peace

On Tue, Feb 16, 2010 at 07:05:48AM -0800, Dan Magenheimer wrote:
> Hi Jan --
> 
> Thanks for thinking about this.
> 
> > may not work well: When you have 1Tb, you'd reserve 8G, making Dom0
> > single-page-below-4G-allocations impossible (unless dom0_mem= was
> > used) if I read the logic correctly.
> 
> Good point.  But tmem doesn't work very well at all if dom0_mem
> isn't set as dom0 is hogging all the spare memory in the system
> so only fallow memory reclaimed from selfballooning domains
> can be used by tmem.
> 
> Under what circumstances does dom0 require single-page-below-4G
> allocations?  Is it only for bounce buffers for PCI passthrough
> of old devices with 32-bit addressing limitations?  Or am I
> missing a much more common case?  (I think it's important to

The software IO TLB is initialized unconditionally if no IOMMUs are found.
This is a 64MB + 32Kb chunk of memory that is exchanged with Xen to make sure
it is under the 32-bit mark.

> enumerate and understand -- and document -- all special needs
> of memory pages as Xen has been fairly careless/lucky with
> fragmentation so far, but with all the memory optimization
> technologies in 4.0, we need to root out all the cases.)
> 
> Thanks,
> Dan
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-02-16 18:20 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-12 17:24 Tmem vs order>0 allocation, workaround RFC Dan Magenheimer
2010-02-12 18:07 ` Dan Magenheimer
2010-02-15  8:21 ` Keir Fraser
2010-02-15 14:31   ` Dan Magenheimer
2010-02-15 15:40     ` Keir Fraser
2010-02-15 15:55       ` Dan Magenheimer
2010-02-15 16:36         ` Dan Magenheimer
2010-02-16  8:20           ` Jan Beulich
2010-02-16 15:05             ` Dan Magenheimer
2010-02-16 15:15               ` Jan Beulich
2010-02-16 15:31                 ` Dan Magenheimer
2010-02-16 15:45                   ` Jan Beulich
2010-02-16 16:44                     ` Dan Magenheimer
2010-02-16 18:20               ` Konrad Rzeszutek Wilk
2010-02-15 16:37         ` Keir Fraser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.