[RFC PATCH] page_alloc: use first half of higher order chunks when halving

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] page_alloc: use first half of higher order chunks when halving
@ 2014-03-25 11:22 Matt Wilson
  2014-03-25 11:44 ` Andrew Cooper
  2014-03-25 12:19 ` Tim Deegan
  0 siblings, 2 replies; 55+ messages in thread
From: Matt Wilson @ 2014-03-25 11:22 UTC (permalink / raw)
  To: xen-devel
  Cc: Keir Fraser, Matt Wilson, Andrew Cooper, Tim Deegan, Matt Rushton,
	Jan Beulich

From: Matt Rushton <mrushton@amazon.com>

This patch makes the Xen heap allocator use the first half of higher
order chunks instead of the second half when breaking them down for
smaller order allocations.

Linux currently remaps the memory overlapping PCI space one page at a
time. Before this change this resulted in the mfns being allocated in
reverse order and led to discontiguous dom0 memory. This forced dom0
to use bounce buffers for doing DMA and resulted in poor performance.

This change more gracefully handles the dom0 use case and returns
contiguous memory for subsequent allocations.

Cc: xen-devel@lists.xenproject.org
Cc: Keir Fraser <keir@xen.org>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tim Deegan <tim@xen.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Matt Rushton <mrushton@amazon.com>
Signed-off-by: Matt Wilson <msw@amazon.com>
---
 xen/common/page_alloc.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 601319c..27e7f18 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -677,9 +677,10 @@ static struct page_info *alloc_heap_pages(
     /* We may have to halve the chunk a number of times. */
     while ( j != order )
     {
-        PFN_ORDER(pg) = --j;
+        struct page_info *pg2;
+        pg2 = pg + (1 << --j);
+        PFN_ORDER(pg) = j;
         page_list_add_tail(pg, &heap(node, zone, j));
-        pg += 1 << j;
     }
 
     ASSERT(avail[node][zone] >= request);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-25 11:22 [RFC PATCH] page_alloc: use first half of higher order chunks when halving Matt Wilson
@ 2014-03-25 11:44 ` Andrew Cooper
  2014-03-25 13:20   ` Matt Wilson
  2014-03-25 12:19 ` Tim Deegan
  1 sibling, 1 reply; 55+ messages in thread
From: Andrew Cooper @ 2014-03-25 11:44 UTC (permalink / raw)
  To: Matt Wilson
  Cc: Keir Fraser, Matt Wilson, Tim Deegan, Matt Rushton, Jan Beulich,
	xen-devel

On 25/03/14 11:22, Matt Wilson wrote:
> From: Matt Rushton <mrushton@amazon.com>
>
> This patch makes the Xen heap allocator use the first half of higher
> order chunks instead of the second half when breaking them down for
> smaller order allocations.
>
> Linux currently remaps the memory overlapping PCI space one page at a
> time. Before this change this resulted in the mfns being allocated in
> reverse order and led to discontiguous dom0 memory. This forced dom0
> to use bounce buffers for doing DMA and resulted in poor performance.
>
> This change more gracefully handles the dom0 use case and returns
> contiguous memory for subsequent allocations.
>
> Cc: xen-devel@lists.xenproject.org
> Cc: Keir Fraser <keir@xen.org>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Tim Deegan <tim@xen.org>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Matt Rushton <mrushton@amazon.com>
> Signed-off-by: Matt Wilson <msw@amazon.com>

How does dom0 work out that it is safe to join multiple pfns into a dma
buffer without the swiotlb?

> ---
>  xen/common/page_alloc.c |    5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> index 601319c..27e7f18 100644
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -677,9 +677,10 @@ static struct page_info *alloc_heap_pages(
>      /* We may have to halve the chunk a number of times. */
>      while ( j != order )
>      {
> -        PFN_ORDER(pg) = --j;
> +        struct page_info *pg2;

At the very least, Xen style mandates a blank line after this variable
declaration.

~Andrew

> +        pg2 = pg + (1 << --j);
> +        PFN_ORDER(pg) = j;
>          page_list_add_tail(pg, &heap(node, zone, j));
> -        pg += 1 << j;
>      }
>  
>      ASSERT(avail[node][zone] >= request);

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-25 11:44 ` Andrew Cooper
@ 2014-03-25 13:20   ` Matt Wilson
  2014-03-25 20:18     ` Matthew Rushton
  0 siblings, 1 reply; 55+ messages in thread
From: Matt Wilson @ 2014-03-25 13:20 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Keir Fraser, Matt Wilson, Tim Deegan, Matt Rushton, Jan Beulich,
	xen-devel

On Tue, Mar 25, 2014 at 11:44:19AM +0000, Andrew Cooper wrote:
> On 25/03/14 11:22, Matt Wilson wrote:
> > From: Matt Rushton <mrushton@amazon.com>
> >
> > This patch makes the Xen heap allocator use the first half of higher
> > order chunks instead of the second half when breaking them down for
> > smaller order allocations.
> >
> > Linux currently remaps the memory overlapping PCI space one page at a
> > time. Before this change this resulted in the mfns being allocated in
> > reverse order and led to discontiguous dom0 memory. This forced dom0
> > to use bounce buffers for doing DMA and resulted in poor performance.
> >
> > This change more gracefully handles the dom0 use case and returns
> > contiguous memory for subsequent allocations.
> >
> > Cc: xen-devel@lists.xenproject.org
> > Cc: Keir Fraser <keir@xen.org>
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > Cc: Tim Deegan <tim@xen.org>
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > Signed-off-by: Matt Rushton <mrushton@amazon.com>
> > Signed-off-by: Matt Wilson <msw@amazon.com>
> 
> How does dom0 work out that it is safe to join multiple pfns into a dma
> buffer without the swiotlb?

I'm not familiar enough with how this works to say. Perhaps Matt R.
can chime in during his day. My guess is that xen_swiotlb_alloc_coherent() 
avoids allocating a contiguous region if the pages allocated already
happen to be physically contiguous.

Konrad, can you enlighten us? The setup code in question that does the
remapping one page at a time is in arch/x86/xen/setup.c.

> > ---
> >  xen/common/page_alloc.c |    5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> > index 601319c..27e7f18 100644
> > --- a/xen/common/page_alloc.c
> > +++ b/xen/common/page_alloc.c
> > @@ -677,9 +677,10 @@ static struct page_info *alloc_heap_pages(
> >      /* We may have to halve the chunk a number of times. */
> >      while ( j != order )
> >      {
> > -        PFN_ORDER(pg) = --j;
> > +        struct page_info *pg2;
> 
> At the very least, Xen style mandates a blank line after this variable
> declaration.

Ack.
 
> ~Andrew
> 
> > +        pg2 = pg + (1 << --j);
> > +        PFN_ORDER(pg) = j;
> >          page_list_add_tail(pg, &heap(node, zone, j));
> > -        pg += 1 << j;
> >      }
> >  
> >      ASSERT(avail[node][zone] >= request);
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-25 13:20   ` Matt Wilson
@ 2014-03-25 20:18     ` Matthew Rushton
  0 siblings, 0 replies; 55+ messages in thread
From: Matthew Rushton @ 2014-03-25 20:18 UTC (permalink / raw)
  To: Matt Wilson, Andrew Cooper
  Cc: Keir Fraser, Jan Beulich, Tim Deegan, Matt Wilson, xen-devel

On 03/25/14 06:20, Matt Wilson wrote:
> On Tue, Mar 25, 2014 at 11:44:19AM +0000, Andrew Cooper wrote:
>> On 25/03/14 11:22, Matt Wilson wrote:
>>> From: Matt Rushton <mrushton@amazon.com>
>>>
>>> This patch makes the Xen heap allocator use the first half of higher
>>> order chunks instead of the second half when breaking them down for
>>> smaller order allocations.
>>>
>>> Linux currently remaps the memory overlapping PCI space one page at a
>>> time. Before this change this resulted in the mfns being allocated in
>>> reverse order and led to discontiguous dom0 memory. This forced dom0
>>> to use bounce buffers for doing DMA and resulted in poor performance.
>>>
>>> This change more gracefully handles the dom0 use case and returns
>>> contiguous memory for subsequent allocations.
>>>
>>> Cc: xen-devel@lists.xenproject.org
>>> Cc: Keir Fraser <keir@xen.org>
>>> Cc: Jan Beulich <jbeulich@suse.com>
>>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>> Cc: Tim Deegan <tim@xen.org>
>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Signed-off-by: Matt Rushton <mrushton@amazon.com>
>>> Signed-off-by: Matt Wilson <msw@amazon.com>
>> How does dom0 work out that it is safe to join multiple pfns into a dma
>> buffer without the swiotlb?
> I'm not familiar enough with how this works to say. Perhaps Matt R.
> can chime in during his day. My guess is that xen_swiotlb_alloc_coherent()
> avoids allocating a contiguous region if the pages allocated already
> happen to be physically contiguous.
>
> Konrad, can you enlighten us? The setup code in question that does the
> remapping one page at a time is in arch/x86/xen/setup.c.
>

The swiotlb code will check if the underlying mfns are contiguous and 
use a bounce buffer if and only if they are not. Everything goes through 
the swiotlb via the normal Linux dma apis it's just a matter of if it 
uses a bounce buffer or not.


>>> ---
>>>   xen/common/page_alloc.c |    5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
>>> index 601319c..27e7f18 100644
>>> --- a/xen/common/page_alloc.c
>>> +++ b/xen/common/page_alloc.c
>>> @@ -677,9 +677,10 @@ static struct page_info *alloc_heap_pages(
>>>       /* We may have to halve the chunk a number of times. */
>>>       while ( j != order )
>>>       {
>>> -        PFN_ORDER(pg) = --j;
>>> +        struct page_info *pg2;
>> At the very least, Xen style mandates a blank line after this variable
>> declaration.
> Ack.
>   
>> ~Andrew
>>
>>> +        pg2 = pg + (1 << --j);
>>> +        PFN_ORDER(pg) = j;
>>>           page_list_add_tail(pg, &heap(node, zone, j));
>>> -        pg += 1 << j;
>>>       }
>>>   
>>>       ASSERT(avail[node][zone] >= request);

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-25 11:22 [RFC PATCH] page_alloc: use first half of higher order chunks when halving Matt Wilson
  2014-03-25 11:44 ` Andrew Cooper
@ 2014-03-25 12:19 ` Tim Deegan
  2014-03-25 13:27   ` Matt Wilson
  1 sibling, 1 reply; 55+ messages in thread
From: Tim Deegan @ 2014-03-25 12:19 UTC (permalink / raw)
  To: Matt Wilson
  Cc: Keir Fraser, Matt Wilson, Andrew Cooper, Matt Rushton,
	Jan Beulich, xen-devel

At 13:22 +0200 on 25 Mar (1395750124), Matt Wilson wrote:
> From: Matt Rushton <mrushton@amazon.com>
> 
> This patch makes the Xen heap allocator use the first half of higher
> order chunks instead of the second half when breaking them down for
> smaller order allocations.
> 
> Linux currently remaps the memory overlapping PCI space one page at a
> time. Before this change this resulted in the mfns being allocated in
> reverse order and led to discontiguous dom0 memory. This forced dom0
> to use bounce buffers for doing DMA and resulted in poor performance.

This seems like something better fixed on the dom0 side, by asking
explicitly for contiguous memory in cases where it makes a difference.
On the Xen side, this change seems harmless, but we might like to keep
the explicitly reversed allocation on debug builds, to flush out
guests that rely on their memory being contiguous.

> This change more gracefully handles the dom0 use case and returns
> contiguous memory for subsequent allocations.
> 
> Cc: xen-devel@lists.xenproject.org
> Cc: Keir Fraser <keir@xen.org>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Tim Deegan <tim@xen.org>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Matt Rushton <mrushton@amazon.com>
> Signed-off-by: Matt Wilson <msw@amazon.com>
> ---
>  xen/common/page_alloc.c |    5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> index 601319c..27e7f18 100644
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -677,9 +677,10 @@ static struct page_info *alloc_heap_pages(
>      /* We may have to halve the chunk a number of times. */
>      while ( j != order )
>      {
> -        PFN_ORDER(pg) = --j;
> +        struct page_info *pg2;
> +        pg2 = pg + (1 << --j);
> +        PFN_ORDER(pg) = j;
>          page_list_add_tail(pg, &heap(node, zone, j));
> -        pg += 1 << j;

AFAICT this uses the low half (pg) for the allocation _and_ puts it on
the freelist, and just leaks the high half (pg2).  Am I missing something?

Tim.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-25 12:19 ` Tim Deegan
@ 2014-03-25 13:27   ` Matt Wilson
  2014-03-25 20:09     ` Matthew Rushton
  0 siblings, 1 reply; 55+ messages in thread
From: Matt Wilson @ 2014-03-25 13:27 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Keir Fraser, Matt Wilson, Andrew Cooper, Matt Rushton,
	Jan Beulich, xen-devel

On Tue, Mar 25, 2014 at 01:19:22PM +0100, Tim Deegan wrote:
> At 13:22 +0200 on 25 Mar (1395750124), Matt Wilson wrote:
> > From: Matt Rushton <mrushton@amazon.com>
> > 
> > This patch makes the Xen heap allocator use the first half of higher
> > order chunks instead of the second half when breaking them down for
> > smaller order allocations.
> > 
> > Linux currently remaps the memory overlapping PCI space one page at a
> > time. Before this change this resulted in the mfns being allocated in
> > reverse order and led to discontiguous dom0 memory. This forced dom0
> > to use bounce buffers for doing DMA and resulted in poor performance.
> 
> This seems like something better fixed on the dom0 side, by asking
> explicitly for contiguous memory in cases where it makes a difference.
> On the Xen side, this change seems harmless, but we might like to keep
> the explicitly reversed allocation on debug builds, to flush out
> guests that rely on their memory being contiguous.

Yes, I think that retaining the reverse allocation on debug builds is
fine. I'd like Konrad's take on if it's better or possible to fix this
on the Linux side.

> > This change more gracefully handles the dom0 use case and returns
> > contiguous memory for subsequent allocations.
> > 
> > Cc: xen-devel@lists.xenproject.org
> > Cc: Keir Fraser <keir@xen.org>
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > Cc: Tim Deegan <tim@xen.org>
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > Signed-off-by: Matt Rushton <mrushton@amazon.com>
> > Signed-off-by: Matt Wilson <msw@amazon.com>
> > ---
> >  xen/common/page_alloc.c |    5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> > index 601319c..27e7f18 100644
> > --- a/xen/common/page_alloc.c
> > +++ b/xen/common/page_alloc.c
> > @@ -677,9 +677,10 @@ static struct page_info *alloc_heap_pages(
> >      /* We may have to halve the chunk a number of times. */
> >      while ( j != order )
> >      {
> > -        PFN_ORDER(pg) = --j;
> > +        struct page_info *pg2;
> > +        pg2 = pg + (1 << --j);
> > +        PFN_ORDER(pg) = j;
> >          page_list_add_tail(pg, &heap(node, zone, j));
> > -        pg += 1 << j;
> 
> AFAICT this uses the low half (pg) for the allocation _and_ puts it on
> the freelist, and just leaks the high half (pg2).  Am I missing something?

Argh, oops. this is totally my fault (not Matt R.'s). I ported the
patch out of our development tree incorrectly. The code should have
read:

     while ( j != order )
     {
         struct page_info *pg2;

         pg2 = pg + (1 << --j);
         PFN_ORDER(pg2) = j;
         page_list_add_tail(pg2, &heap(node, zone, j));
     }

Apologies to Matt for my mangling of his patch (which also already had
the correct blank line per Andy's comment).

--msw

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-25 13:27   ` Matt Wilson
@ 2014-03-25 20:09     ` Matthew Rushton
  2014-03-26  9:55       ` Tim Deegan
  0 siblings, 1 reply; 55+ messages in thread
From: Matthew Rushton @ 2014-03-25 20:09 UTC (permalink / raw)
  To: Matt Wilson, Tim Deegan
  Cc: Keir Fraser, Jan Beulich, Andrew Cooper, Matt Wilson, xen-devel

On 03/25/14 06:27, Matt Wilson wrote:
> On Tue, Mar 25, 2014 at 01:19:22PM +0100, Tim Deegan wrote:
>> At 13:22 +0200 on 25 Mar (1395750124), Matt Wilson wrote:
>>> From: Matt Rushton <mrushton@amazon.com>
>>>
>>> This patch makes the Xen heap allocator use the first half of higher
>>> order chunks instead of the second half when breaking them down for
>>> smaller order allocations.
>>>
>>> Linux currently remaps the memory overlapping PCI space one page at a
>>> time. Before this change this resulted in the mfns being allocated in
>>> reverse order and led to discontiguous dom0 memory. This forced dom0
>>> to use bounce buffers for doing DMA and resulted in poor performance.
>> This seems like something better fixed on the dom0 side, by asking
>> explicitly for contiguous memory in cases where it makes a difference.
>> On the Xen side, this change seems harmless, but we might like to keep
>> the explicitly reversed allocation on debug builds, to flush out
>> guests that rely on their memory being contiguous.
> Yes, I think that retaining the reverse allocation on debug builds is
> fine. I'd like Konrad's take on if it's better or possible to fix this
> on the Linux side.

I considered fixing it in Linux but this was a more straight forward 
change with no downside as far as I can tell. I see no reason in not 
fixing it in both places but this at least behaves more reasonably for 
one potential use case. I'm also interested in other opinions.


>>> This change more gracefully handles the dom0 use case and returns
>>> contiguous memory for subsequent allocations.
>>>
>>> Cc: xen-devel@lists.xenproject.org
>>> Cc: Keir Fraser <keir@xen.org>
>>> Cc: Jan Beulich <jbeulich@suse.com>
>>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>> Cc: Tim Deegan <tim@xen.org>
>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Signed-off-by: Matt Rushton <mrushton@amazon.com>
>>> Signed-off-by: Matt Wilson <msw@amazon.com>
>>> ---
>>>   xen/common/page_alloc.c |    5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
>>> index 601319c..27e7f18 100644
>>> --- a/xen/common/page_alloc.c
>>> +++ b/xen/common/page_alloc.c
>>> @@ -677,9 +677,10 @@ static struct page_info *alloc_heap_pages(
>>>       /* We may have to halve the chunk a number of times. */
>>>       while ( j != order )
>>>       {
>>> -        PFN_ORDER(pg) = --j;
>>> +        struct page_info *pg2;
>>> +        pg2 = pg + (1 << --j);
>>> +        PFN_ORDER(pg) = j;
>>>           page_list_add_tail(pg, &heap(node, zone, j));
>>> -        pg += 1 << j;
>> AFAICT this uses the low half (pg) for the allocation _and_ puts it on
>> the freelist, and just leaks the high half (pg2).  Am I missing something?
> Argh, oops. this is totally my fault (not Matt R.'s). I ported the
> patch out of our development tree incorrectly. The code should have
> read:
>
>       while ( j != order )
>       {
>           struct page_info *pg2;
>
>           pg2 = pg + (1 << --j);
>           PFN_ORDER(pg2) = j;
>           page_list_add_tail(pg2, &heap(node, zone, j));
>       }
>
> Apologies to Matt for my mangling of his patch (which also already had
> the correct blank line per Andy's comment).
>
> --msw

No worries I was about to correct you:)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-25 20:09     ` Matthew Rushton
@ 2014-03-26  9:55       ` Tim Deegan
  2014-03-26 10:17         ` Matt Wilson
  0 siblings, 1 reply; 55+ messages in thread
From: Tim Deegan @ 2014-03-26  9:55 UTC (permalink / raw)
  To: Matthew Rushton
  Cc: Keir Fraser, Matt Wilson, Matt Wilson, Jan Beulich, Andrew Cooper,
	xen-devel

Hi,

At 13:09 -0700 on 25 Mar (1395749353), Matthew Rushton wrote:
> On 03/25/14 06:27, Matt Wilson wrote:
> > On Tue, Mar 25, 2014 at 01:19:22PM +0100, Tim Deegan wrote:
> >> At 13:22 +0200 on 25 Mar (1395750124), Matt Wilson wrote:
> >>> From: Matt Rushton <mrushton@amazon.com>
> >>>
> >>> This patch makes the Xen heap allocator use the first half of higher
> >>> order chunks instead of the second half when breaking them down for
> >>> smaller order allocations.
> >>>
> >>> Linux currently remaps the memory overlapping PCI space one page at a
> >>> time. Before this change this resulted in the mfns being allocated in
> >>> reverse order and led to discontiguous dom0 memory. This forced dom0
> >>> to use bounce buffers for doing DMA and resulted in poor performance.
> >> This seems like something better fixed on the dom0 side, by asking
> >> explicitly for contiguous memory in cases where it makes a difference.
> >> On the Xen side, this change seems harmless, but we might like to keep
> >> the explicitly reversed allocation on debug builds, to flush out
> >> guests that rely on their memory being contiguous.
> > Yes, I think that retaining the reverse allocation on debug builds is
> > fine. I'd like Konrad's take on if it's better or possible to fix this
> > on the Linux side.
> 
> I considered fixing it in Linux but this was a more straight forward 
> change with no downside as far as I can tell. I see no reason in not 
> fixing it in both places but this at least behaves more reasonably for 
> one potential use case. I'm also interested in other opinions.

Well, I'm happy enough with changing Xen (though it's common code so
you'll need Keir's ack anyway rather than mine), since as you say it
happens to make one use case a bit better and is otherwise harmless.
But that comes with a stinking great warning:

 - This is not 'fixing' anything in Xen because Xen is doing exactly
   what dom0 asks for in the current code; and conversely

 - dom0 (and other guests) _must_not_ rely on it, whether for
   performance or correctness.  Xen might change its page allocator at
   some point in the future, for any reason, and if linux perf starts
   sucking when that happens, that's (still) a linux bug.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26  9:55       ` Tim Deegan
@ 2014-03-26 10:17         ` Matt Wilson
  2014-03-26 10:44           ` David Vrabel
  2014-03-26 15:08           ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 55+ messages in thread
From: Matt Wilson @ 2014-03-26 10:17 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Keir Fraser, Matt Wilson, Matthew Rushton, Andrew Cooper,
	Jan Beulich, xen-devel

On Wed, Mar 26, 2014 at 10:55:33AM +0100, Tim Deegan wrote:
> Hi,
> 
> At 13:09 -0700 on 25 Mar (1395749353), Matthew Rushton wrote:
> > On 03/25/14 06:27, Matt Wilson wrote:
> > > On Tue, Mar 25, 2014 at 01:19:22PM +0100, Tim Deegan wrote:
> > >> At 13:22 +0200 on 25 Mar (1395750124), Matt Wilson wrote:
> > >>> From: Matt Rushton <mrushton@amazon.com>
> > >>>
> > >>> This patch makes the Xen heap allocator use the first half of higher
> > >>> order chunks instead of the second half when breaking them down for
> > >>> smaller order allocations.
> > >>>
> > >>> Linux currently remaps the memory overlapping PCI space one page at a
> > >>> time. Before this change this resulted in the mfns being allocated in
> > >>> reverse order and led to discontiguous dom0 memory. This forced dom0
> > >>> to use bounce buffers for doing DMA and resulted in poor performance.
> > >> This seems like something better fixed on the dom0 side, by asking
> > >> explicitly for contiguous memory in cases where it makes a difference.
> > >> On the Xen side, this change seems harmless, but we might like to keep
> > >> the explicitly reversed allocation on debug builds, to flush out
> > >> guests that rely on their memory being contiguous.
> > > Yes, I think that retaining the reverse allocation on debug builds is
> > > fine. I'd like Konrad's take on if it's better or possible to fix this
> > > on the Linux side.
> > 
> > I considered fixing it in Linux but this was a more straight forward 
> > change with no downside as far as I can tell. I see no reason in not 
> > fixing it in both places but this at least behaves more reasonably for 
> > one potential use case. I'm also interested in other opinions.
> 
> Well, I'm happy enough with changing Xen (though it's common code so
> you'll need Keir's ack anyway rather than mine), since as you say it
> happens to make one use case a bit better and is otherwise harmless.
> But that comes with a stinking great warning:

Anyone can Ack or Nack, but I wouldn't want to move forward on a
change like this without Keir's Ack. :-)

>  - This is not 'fixing' anything in Xen because Xen is doing exactly
>    what dom0 asks for in the current code; and conversely
>
>  - dom0 (and other guests) _must_not_ rely on it, whether for
>    performance or correctness.  Xen might change its page allocator at
>    some point in the future, for any reason, and if linux perf starts
>    sucking when that happens, that's (still) a linux bug.

I agree with both of these. This was just the "least change" patch to
a particular problem we observed.

Konrad, what's the possibility of fixing this in Linux Xen PV setup
code? I think it'd be a matter batching up pages and doing larger
order allocations in linux/arch/x86/xen/setup.c:xen_do_chunk(),
falling back to smaller pages if allocations fail due to
fragmentation, etc.

--msw

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26 10:17         ` Matt Wilson
@ 2014-03-26 10:44           ` David Vrabel
  2014-03-26 10:48             ` Matt Wilson
  2014-03-26 15:08           ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 55+ messages in thread
From: David Vrabel @ 2014-03-26 10:44 UTC (permalink / raw)
  To: Matt Wilson
  Cc: Keir Fraser, Matt Wilson, Matthew Rushton, Andrew Cooper,
	Tim Deegan, Jan Beulich, xen-devel, Malcolm Crossley

On 26/03/14 10:17, Matt Wilson wrote:
> 
> Konrad, what's the possibility of fixing this in Linux Xen PV setup
> code? I think it'd be a matter batching up pages and doing larger
> order allocations in linux/arch/x86/xen/setup.c:xen_do_chunk(),
> falling back to smaller pages if allocations fail due to
> fragmentation, etc.

We plan to fix problems caused by non-machine-contiguous memory by
setting up the IOMMU to have 1:1 bus to pseudo-physical mappings.  This
would avoid using the swiotlb always[1], regardless of the machine
layout of dom0 or the driver domain.

I think I would prefer this approach rather than making xen/setup.c even
more horribly complicated.

Malcolm has a working prototype of this already.

David

[1] Unless you have a non-64 bit DMA capable device, but we don't care
about performance with these.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26 10:44           ` David Vrabel
@ 2014-03-26 10:48             ` Matt Wilson
  2014-03-26 11:13               ` Ian Campbell
  2014-03-26 12:43               ` David Vrabel
  0 siblings, 2 replies; 55+ messages in thread
From: Matt Wilson @ 2014-03-26 10:48 UTC (permalink / raw)
  To: David Vrabel
  Cc: Keir Fraser, Matt Wilson, Matthew Rushton, Andrew Cooper,
	Tim Deegan, Jan Beulich, xen-devel, Malcolm Crossley

On Wed, Mar 26, 2014 at 10:44:18AM +0000, David Vrabel wrote:
> On 26/03/14 10:17, Matt Wilson wrote:
> > 
> > Konrad, what's the possibility of fixing this in Linux Xen PV setup
> > code? I think it'd be a matter batching up pages and doing larger
> > order allocations in linux/arch/x86/xen/setup.c:xen_do_chunk(),
> > falling back to smaller pages if allocations fail due to
> > fragmentation, etc.
> 
> We plan to fix problems caused by non-machine-contiguous memory by
> setting up the IOMMU to have 1:1 bus to pseudo-physical mappings.  This
> would avoid using the swiotlb always[1], regardless of the machine
> layout of dom0 or the driver domain.
> 
> I think I would prefer this approach rather than making xen/setup.c even
> more horribly complicated.

I imagine that some users will not want to run dom0 under an IOMMU. If
changing Linux Xen PV setup is (rightly) objectionable due to
complexity, perhaps this small change to the hypervisor is a better
short-term fix.

> Malcolm has a working prototype of this already.

Is it ready for a RFC posting?

> David
> 
> [1] Unless you have a non-64 bit DMA capable device, but we don't care
> about performance with these.

--msw

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26 10:48             ` Matt Wilson
@ 2014-03-26 11:13               ` Ian Campbell
  2014-03-26 11:41                 ` Matt Wilson
  2014-03-26 12:43               ` David Vrabel
  1 sibling, 1 reply; 55+ messages in thread
From: Ian Campbell @ 2014-03-26 11:13 UTC (permalink / raw)
  To: Matt Wilson
  Cc: Keir Fraser, Jan Beulich, Matthew Rushton, Andrew Cooper,
	Tim Deegan, David Vrabel, Matt Wilson, xen-devel,
	Malcolm Crossley

On Wed, 2014-03-26 at 12:48 +0200, Matt Wilson wrote:
> On Wed, Mar 26, 2014 at 10:44:18AM +0000, David Vrabel wrote:
> > On 26/03/14 10:17, Matt Wilson wrote:
> > > 
> > > Konrad, what's the possibility of fixing this in Linux Xen PV setup
> > > code? I think it'd be a matter batching up pages and doing larger
> > > order allocations in linux/arch/x86/xen/setup.c:xen_do_chunk(),
> > > falling back to smaller pages if allocations fail due to
> > > fragmentation, etc.
> > 
> > We plan to fix problems caused by non-machine-contiguous memory by
> > setting up the IOMMU to have 1:1 bus to pseudo-physical mappings.  This
> > would avoid using the swiotlb always[1], regardless of the machine
> > layout of dom0 or the driver domain.
> > 
> > I think I would prefer this approach rather than making xen/setup.c even
> > more horribly complicated.
> 
> I imagine that some users will not want to run dom0 under an IOMMU.

Then they have chosen that (for whatever reason) over performance,
right?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26 11:13               ` Ian Campbell
@ 2014-03-26 11:41                 ` Matt Wilson
  2014-03-26 11:45                   ` Andrew Cooper
  0 siblings, 1 reply; 55+ messages in thread
From: Matt Wilson @ 2014-03-26 11:41 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Keir Fraser, Jan Beulich, Matthew Rushton, Andrew Cooper,
	Tim Deegan, David Vrabel, Matt Wilson, xen-devel,
	Malcolm Crossley

On Wed, Mar 26, 2014 at 11:13:49AM +0000, Ian Campbell wrote:
> On Wed, 2014-03-26 at 12:48 +0200, Matt Wilson wrote:
> > On Wed, Mar 26, 2014 at 10:44:18AM +0000, David Vrabel wrote:
> > > On 26/03/14 10:17, Matt Wilson wrote:
> > > > 
> > > > Konrad, what's the possibility of fixing this in Linux Xen PV setup
> > > > code? I think it'd be a matter batching up pages and doing larger
> > > > order allocations in linux/arch/x86/xen/setup.c:xen_do_chunk(),
> > > > falling back to smaller pages if allocations fail due to
> > > > fragmentation, etc.
> > > 
> > > We plan to fix problems caused by non-machine-contiguous memory by
> > > setting up the IOMMU to have 1:1 bus to pseudo-physical mappings.  This
> > > would avoid using the swiotlb always[1], regardless of the machine
> > > layout of dom0 or the driver domain.
> > > 
> > > I think I would prefer this approach rather than making xen/setup.c even
> > > more horribly complicated.
> > 
> > I imagine that some users will not want to run dom0 under an IOMMU.
> 
> Then they have chosen that (for whatever reason) over performance,
> right?

IOMMU is not free, so I imagine that some users would actually be
choosing to avoid it specifically for performance reasons.

--msw

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26 11:41                 ` Matt Wilson
@ 2014-03-26 11:45                   ` Andrew Cooper
  2014-03-26 11:50                     ` Matt Wilson
  0 siblings, 1 reply; 55+ messages in thread
From: Andrew Cooper @ 2014-03-26 11:45 UTC (permalink / raw)
  To: Matt Wilson
  Cc: Keir Fraser, Ian Campbell, Matthew Rushton, Tim Deegan,
	Jan Beulich, David Vrabel, Matt Wilson, xen-devel,
	Malcolm Crossley

On 26/03/14 11:41, Matt Wilson wrote:
> On Wed, Mar 26, 2014 at 11:13:49AM +0000, Ian Campbell wrote:
>> On Wed, 2014-03-26 at 12:48 +0200, Matt Wilson wrote:
>>> On Wed, Mar 26, 2014 at 10:44:18AM +0000, David Vrabel wrote:
>>>> On 26/03/14 10:17, Matt Wilson wrote:
>>>>> Konrad, what's the possibility of fixing this in Linux Xen PV setup
>>>>> code? I think it'd be a matter batching up pages and doing larger
>>>>> order allocations in linux/arch/x86/xen/setup.c:xen_do_chunk(),
>>>>> falling back to smaller pages if allocations fail due to
>>>>> fragmentation, etc.
>>>> We plan to fix problems caused by non-machine-contiguous memory by
>>>> setting up the IOMMU to have 1:1 bus to pseudo-physical mappings.  This
>>>> would avoid using the swiotlb always[1], regardless of the machine
>>>> layout of dom0 or the driver domain.
>>>>
>>>> I think I would prefer this approach rather than making xen/setup.c even
>>>> more horribly complicated.
>>> I imagine that some users will not want to run dom0 under an IOMMU.
>> Then they have chosen that (for whatever reason) over performance,
>> right?
> IOMMU is not free, so I imagine that some users would actually be
> choosing to avoid it specifically for performance reasons.
>
> --msw

True, but if people are actually looking for performance, they will be
turning on features like GRO and more generically scatter/gather which
results in drivers using single page DMA mappings at a time, which
completely bypass the bouncing in the swiotlb.

~Andrew

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26 11:45                   ` Andrew Cooper
@ 2014-03-26 11:50                     ` Matt Wilson
  0 siblings, 0 replies; 55+ messages in thread
From: Matt Wilson @ 2014-03-26 11:50 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Keir Fraser, Ian Campbell, Matthew Rushton, Tim Deegan,
	Jan Beulich, David Vrabel, Matt Wilson, xen-devel,
	Malcolm Crossley

On Wed, Mar 26, 2014 at 11:45:50AM +0000, Andrew Cooper wrote:
> On 26/03/14 11:41, Matt Wilson wrote:
> > On Wed, Mar 26, 2014 at 11:13:49AM +0000, Ian Campbell wrote:
> >> On Wed, 2014-03-26 at 12:48 +0200, Matt Wilson wrote:
> >>> I imagine that some users will not want to run dom0 under an IOMMU.
> >>
> >> Then they have chosen that (for whatever reason) over performance,
> >> right?
> >
> > IOMMU is not free, so I imagine that some users would actually be
> > choosing to avoid it specifically for performance reasons.
> >
> > --msw
> 
> True, but if people are actually looking for performance, they will be
> turning on features like GRO and more generically scatter/gather which
> results in drivers using single page DMA mappings at a time, which
> completely bypass the bouncing in the swiotlb.

Alas, GRO isn't without its own problems (at least historically).

--msw

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26 10:48             ` Matt Wilson
  2014-03-26 11:13               ` Ian Campbell
@ 2014-03-26 12:43               ` David Vrabel
  2014-03-26 12:48                 ` Matt Wilson
  1 sibling, 1 reply; 55+ messages in thread
From: David Vrabel @ 2014-03-26 12:43 UTC (permalink / raw)
  To: Matt Wilson
  Cc: Keir Fraser, Jan Beulich, Matthew Rushton, Andrew Cooper,
	Tim Deegan, David Vrabel, Matt Wilson, xen-devel,
	Malcolm Crossley

On 26/03/14 10:48, Matt Wilson wrote:
> On Wed, Mar 26, 2014 at 10:44:18AM +0000, David Vrabel wrote:
>> On 26/03/14 10:17, Matt Wilson wrote:
>>>
>>> Konrad, what's the possibility of fixing this in Linux Xen PV setup
>>> code? I think it'd be a matter batching up pages and doing larger
>>> order allocations in linux/arch/x86/xen/setup.c:xen_do_chunk(),
>>> falling back to smaller pages if allocations fail due to
>>> fragmentation, etc.
>>
>> We plan to fix problems caused by non-machine-contiguous memory by
>> setting up the IOMMU to have 1:1 bus to pseudo-physical mappings.  This
>> would avoid using the swiotlb always[1], regardless of the machine
>> layout of dom0 or the driver domain.
>>
>> I think I would prefer this approach rather than making xen/setup.c even
>> more horribly complicated.
> 
> I imagine that some users will not want to run dom0 under an IOMMU. If
> changing Linux Xen PV setup is (rightly) objectionable due to
> complexity, perhaps this small change to the hypervisor is a better
> short-term fix.

Users who are not using the IOMMU for performance reasons but are
complaining about swiotlb costs?  I'm not sure that's an interesting set
of users...

I'm not ruling out any Linux-side memory setup changes.  I just don't
think they're a complete solution.

>> Malcolm has a working prototype of this already.
> 
> Is it ready for a RFC posting?

Not sure. Malcolm is current OoO.

David

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26 12:43               ` David Vrabel
@ 2014-03-26 12:48                 ` Matt Wilson
  0 siblings, 0 replies; 55+ messages in thread
From: Matt Wilson @ 2014-03-26 12:48 UTC (permalink / raw)
  To: David Vrabel
  Cc: Keir Fraser, Matt Wilson, Matthew Rushton, Andrew Cooper,
	Tim Deegan, Jan Beulich, xen-devel, Malcolm Crossley

On Wed, Mar 26, 2014 at 12:43:25PM +0000, David Vrabel wrote:
> On 26/03/14 10:48, Matt Wilson wrote:
> > On Wed, Mar 26, 2014 at 10:44:18AM +0000, David Vrabel wrote:
> >> On 26/03/14 10:17, Matt Wilson wrote:
> >>>
> >>> Konrad, what's the possibility of fixing this in Linux Xen PV setup
> >>> code? I think it'd be a matter batching up pages and doing larger
> >>> order allocations in linux/arch/x86/xen/setup.c:xen_do_chunk(),
> >>> falling back to smaller pages if allocations fail due to
> >>> fragmentation, etc.
> >>
> >> We plan to fix problems caused by non-machine-contiguous memory by
> >> setting up the IOMMU to have 1:1 bus to pseudo-physical mappings.  This
> >> would avoid using the swiotlb always[1], regardless of the machine
> >> layout of dom0 or the driver domain.
> >>
> >> I think I would prefer this approach rather than making xen/setup.c even
> >> more horribly complicated.
> > 
> > I imagine that some users will not want to run dom0 under an IOMMU. If
> > changing Linux Xen PV setup is (rightly) objectionable due to
> > complexity, perhaps this small change to the hypervisor is a better
> > short-term fix.
> 
> Users who are not using the IOMMU for performance reasons but are
> complaining about swiotlb costs?  I'm not sure that's an interesting set
> of users...

You have to admit that a configuration that avoids both IOMMU and
bouncing in swiotlb will be the best performance in many scenarios,
don't you think?

--msw

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26 10:17         ` Matt Wilson
  2014-03-26 10:44           ` David Vrabel
@ 2014-03-26 15:08           ` Konrad Rzeszutek Wilk
  2014-03-26 15:15             ` Matt Wilson
  1 sibling, 1 reply; 55+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-03-26 15:08 UTC (permalink / raw)
  To: Matt Wilson
  Cc: Keir Fraser, Matt Wilson, Matthew Rushton, Andrew Cooper,
	Tim Deegan, Jan Beulich, xen-devel

On Wed, Mar 26, 2014 at 12:17:53PM +0200, Matt Wilson wrote:
> On Wed, Mar 26, 2014 at 10:55:33AM +0100, Tim Deegan wrote:
> > Hi,
> > 
> > At 13:09 -0700 on 25 Mar (1395749353), Matthew Rushton wrote:
> > > On 03/25/14 06:27, Matt Wilson wrote:
> > > > On Tue, Mar 25, 2014 at 01:19:22PM +0100, Tim Deegan wrote:
> > > >> At 13:22 +0200 on 25 Mar (1395750124), Matt Wilson wrote:
> > > >>> From: Matt Rushton <mrushton@amazon.com>
> > > >>>
> > > >>> This patch makes the Xen heap allocator use the first half of higher
> > > >>> order chunks instead of the second half when breaking them down for
> > > >>> smaller order allocations.
> > > >>>
> > > >>> Linux currently remaps the memory overlapping PCI space one page at a
> > > >>> time. Before this change this resulted in the mfns being allocated in
> > > >>> reverse order and led to discontiguous dom0 memory. This forced dom0
> > > >>> to use bounce buffers for doing DMA and resulted in poor performance.
> > > >> This seems like something better fixed on the dom0 side, by asking
> > > >> explicitly for contiguous memory in cases where it makes a difference.
> > > >> On the Xen side, this change seems harmless, but we might like to keep
> > > >> the explicitly reversed allocation on debug builds, to flush out
> > > >> guests that rely on their memory being contiguous.
> > > > Yes, I think that retaining the reverse allocation on debug builds is
> > > > fine. I'd like Konrad's take on if it's better or possible to fix this
> > > > on the Linux side.
> > > 
> > > I considered fixing it in Linux but this was a more straight forward 
> > > change with no downside as far as I can tell. I see no reason in not 
> > > fixing it in both places but this at least behaves more reasonably for 
> > > one potential use case. I'm also interested in other opinions.
> > 
> > Well, I'm happy enough with changing Xen (though it's common code so
> > you'll need Keir's ack anyway rather than mine), since as you say it
> > happens to make one use case a bit better and is otherwise harmless.
> > But that comes with a stinking great warning:
> 
> Anyone can Ack or Nack, but I wouldn't want to move forward on a
> change like this without Keir's Ack. :-)
> 
> >  - This is not 'fixing' anything in Xen because Xen is doing exactly
> >    what dom0 asks for in the current code; and conversely
> >
> >  - dom0 (and other guests) _must_not_ rely on it, whether for
> >    performance or correctness.  Xen might change its page allocator at
> >    some point in the future, for any reason, and if linux perf starts
> >    sucking when that happens, that's (still) a linux bug.
> 
> I agree with both of these. This was just the "least change" patch to
> a particular problem we observed.
> 
> Konrad, what's the possibility of fixing this in Linux Xen PV setup
> code? I think it'd be a matter batching up pages and doing larger
> order allocations in linux/arch/x86/xen/setup.c:xen_do_chunk(),
> falling back to smaller pages if allocations fail due to
> fragmentation, etc.

Could you elaborate a bit more on the use-case please?
My understanding is that most drivers use a scatter gather list - in which
case it does not matter if the underlaying MFNs in the PFNs spare are
not contingous.

But I presume the issue you are hitting is with drivers doing dma_map_page
and the page is not 4KB but rather large (compound page). Is that the
problem you have observed?

Thanks.
> 
> --msw

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26 15:08           ` Konrad Rzeszutek Wilk
@ 2014-03-26 15:15             ` Matt Wilson
  2014-03-26 15:59               ` Matthew Rushton
  2014-03-26 16:34               ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 55+ messages in thread
From: Matt Wilson @ 2014-03-26 15:15 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Matt Wilson, Matthew Rushton, Andrew Cooper,
	Tim Deegan, Jan Beulich, xen-devel

On Wed, Mar 26, 2014 at 11:08:01AM -0400, Konrad Rzeszutek Wilk wrote:
> 
> Could you elaborate a bit more on the use-case please?
> My understanding is that most drivers use a scatter gather list - in which
> case it does not matter if the underlaying MFNs in the PFNs spare are
> not contingous.
> 
> But I presume the issue you are hitting is with drivers doing dma_map_page
> and the page is not 4KB but rather large (compound page). Is that the
> problem you have observed?

Drivers are using very large size arguments to dma_alloc_coherent()
for things like RX and TX descriptor rings.

--msw

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26 15:15             ` Matt Wilson
@ 2014-03-26 15:59               ` Matthew Rushton
  2014-03-26 16:36                 ` Konrad Rzeszutek Wilk
  2014-03-26 16:34               ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 55+ messages in thread
From: Matthew Rushton @ 2014-03-26 15:59 UTC (permalink / raw)
  To: Matt Wilson, Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Jan Beulich, Andrew Cooper, Tim Deegan, Matt Wilson,
	xen-devel

On 03/26/14 08:15, Matt Wilson wrote:
> On Wed, Mar 26, 2014 at 11:08:01AM -0400, Konrad Rzeszutek Wilk wrote:
>> Could you elaborate a bit more on the use-case please?
>> My understanding is that most drivers use a scatter gather list - in which
>> case it does not matter if the underlaying MFNs in the PFNs spare are
>> not contingous.
>>
>> But I presume the issue you are hitting is with drivers doing dma_map_page
>> and the page is not 4KB but rather large (compound page). Is that the
>> problem you have observed?
> Drivers are using very large size arguments to dma_alloc_coherent()
> for things like RX and TX descriptor rings.
>
> --msw

It's the dma streaming api I've noticed the problem with, so 
dma_map_single(). Applicable swiotlb code would be 
xen_swiotlb_map_page() and range_straddles_page_boundary(). So yes for 
larger buffers it can cause bouncing.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26 15:59               ` Matthew Rushton
@ 2014-03-26 16:36                 ` Konrad Rzeszutek Wilk
  2014-03-26 17:47                   ` Matthew Rushton
  0 siblings, 1 reply; 55+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-03-26 16:36 UTC (permalink / raw)
  To: Matthew Rushton
  Cc: Keir Fraser, Matt Wilson, Matt Wilson, Tim Deegan, Jan Beulich,
	Andrew Cooper, xen-devel

On Wed, Mar 26, 2014 at 08:59:04AM -0700, Matthew Rushton wrote:
> On 03/26/14 08:15, Matt Wilson wrote:
> >On Wed, Mar 26, 2014 at 11:08:01AM -0400, Konrad Rzeszutek Wilk wrote:
> >>Could you elaborate a bit more on the use-case please?
> >>My understanding is that most drivers use a scatter gather list - in which
> >>case it does not matter if the underlaying MFNs in the PFNs spare are
> >>not contingous.
> >>
> >>But I presume the issue you are hitting is with drivers doing dma_map_page
> >>and the page is not 4KB but rather large (compound page). Is that the
> >>problem you have observed?
> >Drivers are using very large size arguments to dma_alloc_coherent()
> >for things like RX and TX descriptor rings.

Large size like larger than 512kB? That would also cause problems
on baremetal then when swiotlb is activated I believe.

> >
> >--msw
> 
> It's the dma streaming api I've noticed the problem with, so
> dma_map_single(). Applicable swiotlb code would be
> xen_swiotlb_map_page() and range_straddles_page_boundary(). So yes
> for larger buffers it can cause bouncing.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26 16:36                 ` Konrad Rzeszutek Wilk
@ 2014-03-26 17:47                   ` Matthew Rushton
  2014-03-26 17:56                     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 55+ messages in thread
From: Matthew Rushton @ 2014-03-26 17:47 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Matt Wilson, Matt Wilson, Tim Deegan, Jan Beulich,
	Andrew Cooper, xen-devel

On 03/26/14 09:36, Konrad Rzeszutek Wilk wrote:
> On Wed, Mar 26, 2014 at 08:59:04AM -0700, Matthew Rushton wrote:
>> On 03/26/14 08:15, Matt Wilson wrote:
>>> On Wed, Mar 26, 2014 at 11:08:01AM -0400, Konrad Rzeszutek Wilk wrote:
>>>> Could you elaborate a bit more on the use-case please?
>>>> My understanding is that most drivers use a scatter gather list - in which
>>>> case it does not matter if the underlaying MFNs in the PFNs spare are
>>>> not contingous.
>>>>
>>>> But I presume the issue you are hitting is with drivers doing dma_map_page
>>>> and the page is not 4KB but rather large (compound page). Is that the
>>>> problem you have observed?
>>> Drivers are using very large size arguments to dma_alloc_coherent()
>>> for things like RX and TX descriptor rings.
> Large size like larger than 512kB? That would also cause problems
> on baremetal then when swiotlb is activated I believe.

I was looking at network IO performance so the buffers would not have 
been that large. I think large in this context is relative to the 4k 
page size and the odds of the buffer spanning a page boundary. For 
context I saw ~5-10% performance increase with guest network throughput 
by avoiding bounce buffers and also saw dom0 tcp streaming performance 
go from ~6Gb/s to over 9Gb/s on my test setup with a 10Gb NIC.

>
>>> --msw
>> It's the dma streaming api I've noticed the problem with, so
>> dma_map_single(). Applicable swiotlb code would be
>> xen_swiotlb_map_page() and range_straddles_page_boundary(). So yes
>> for larger buffers it can cause bouncing.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26 17:47                   ` Matthew Rushton
@ 2014-03-26 17:56                     ` Konrad Rzeszutek Wilk
  2014-03-26 22:15                       ` Matthew Rushton
  0 siblings, 1 reply; 55+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-03-26 17:56 UTC (permalink / raw)
  To: Matthew Rushton
  Cc: Keir Fraser, Matt Wilson, Matt Wilson, Tim Deegan, Jan Beulich,
	Andrew Cooper, xen-devel

On Wed, Mar 26, 2014 at 10:47:44AM -0700, Matthew Rushton wrote:
> On 03/26/14 09:36, Konrad Rzeszutek Wilk wrote:
> >On Wed, Mar 26, 2014 at 08:59:04AM -0700, Matthew Rushton wrote:
> >>On 03/26/14 08:15, Matt Wilson wrote:
> >>>On Wed, Mar 26, 2014 at 11:08:01AM -0400, Konrad Rzeszutek Wilk wrote:
> >>>>Could you elaborate a bit more on the use-case please?
> >>>>My understanding is that most drivers use a scatter gather list - in which
> >>>>case it does not matter if the underlaying MFNs in the PFNs spare are
> >>>>not contingous.
> >>>>
> >>>>But I presume the issue you are hitting is with drivers doing dma_map_page
> >>>>and the page is not 4KB but rather large (compound page). Is that the
> >>>>problem you have observed?
> >>>Drivers are using very large size arguments to dma_alloc_coherent()
> >>>for things like RX and TX descriptor rings.
> >Large size like larger than 512kB? That would also cause problems
> >on baremetal then when swiotlb is activated I believe.
> 
> I was looking at network IO performance so the buffers would not
> have been that large. I think large in this context is relative to
> the 4k page size and the odds of the buffer spanning a page
> boundary. For context I saw ~5-10% performance increase with guest
> network throughput by avoiding bounce buffers and also saw dom0 tcp
> streaming performance go from ~6Gb/s to over 9Gb/s on my test setup
> with a 10Gb NIC.

OK, but that would not be the dma_alloc_coherent ones then? That sounds
more like the generic TCP mechanism allocated 64KB pages instead of 4KB
and used those.

Did you try looking at this hack that Ian proposed a long time ago
to verify that it is said problem?

https://lkml.org/lkml/2013/9/4/540

> 
> >
> >>>--msw
> >>It's the dma streaming api I've noticed the problem with, so
> >>dma_map_single(). Applicable swiotlb code would be
> >>xen_swiotlb_map_page() and range_straddles_page_boundary(). So yes
> >>for larger buffers it can cause bouncing.
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26 17:56                     ` Konrad Rzeszutek Wilk
@ 2014-03-26 22:15                       ` Matthew Rushton
  2014-03-28 17:02                         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 55+ messages in thread
From: Matthew Rushton @ 2014-03-26 22:15 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Matt Wilson, Matt Wilson, Tim Deegan, Jan Beulich,
	Andrew Cooper, xen-devel

On 03/26/14 10:56, Konrad Rzeszutek Wilk wrote:
> On Wed, Mar 26, 2014 at 10:47:44AM -0700, Matthew Rushton wrote:
>> On 03/26/14 09:36, Konrad Rzeszutek Wilk wrote:
>>> On Wed, Mar 26, 2014 at 08:59:04AM -0700, Matthew Rushton wrote:
>>>> On 03/26/14 08:15, Matt Wilson wrote:
>>>>> On Wed, Mar 26, 2014 at 11:08:01AM -0400, Konrad Rzeszutek Wilk wrote:
>>>>>> Could you elaborate a bit more on the use-case please?
>>>>>> My understanding is that most drivers use a scatter gather list - in which
>>>>>> case it does not matter if the underlaying MFNs in the PFNs spare are
>>>>>> not contingous.
>>>>>>
>>>>>> But I presume the issue you are hitting is with drivers doing dma_map_page
>>>>>> and the page is not 4KB but rather large (compound page). Is that the
>>>>>> problem you have observed?
>>>>> Drivers are using very large size arguments to dma_alloc_coherent()
>>>>> for things like RX and TX descriptor rings.
>>> Large size like larger than 512kB? That would also cause problems
>>> on baremetal then when swiotlb is activated I believe.
>> I was looking at network IO performance so the buffers would not
>> have been that large. I think large in this context is relative to
>> the 4k page size and the odds of the buffer spanning a page
>> boundary. For context I saw ~5-10% performance increase with guest
>> network throughput by avoiding bounce buffers and also saw dom0 tcp
>> streaming performance go from ~6Gb/s to over 9Gb/s on my test setup
>> with a 10Gb NIC.
> OK, but that would not be the dma_alloc_coherent ones then? That sounds
> more like the generic TCP mechanism allocated 64KB pages instead of 4KB
> and used those.
>
> Did you try looking at this hack that Ian proposed a long time ago
> to verify that it is said problem?
>
> https://lkml.org/lkml/2013/9/4/540
>

Yes I had seen that and intially had the same reaction but the change 
was relatively recent and not relevant. I *think* all the coherent 
allocations are ok since the swiotlb makes them contiguous. The problem 
comes with the use of the streaming api. As one example with jumbo 
frames enabled a driver might use larger rx buffers which triggers the 
problem.

I think the right thing to do is to make the dma streaming api work 
better with larger buffers on dom0. That way it works across all drivers 
and device types regardless of how they were designed.

>>>>> --msw
>>>> It's the dma streaming api I've noticed the problem with, so
>>>> dma_map_single(). Applicable swiotlb code would be
>>>> xen_swiotlb_map_page() and range_straddles_page_boundary(). So yes
>>>> for larger buffers it can cause bouncing.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26 22:15                       ` Matthew Rushton
@ 2014-03-28 17:02                         ` Konrad Rzeszutek Wilk
  2014-03-28 22:06                           ` Matthew Rushton
  0 siblings, 1 reply; 55+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-03-28 17:02 UTC (permalink / raw)
  To: Matthew Rushton
  Cc: Keir Fraser, Matt Wilson, Matt Wilson, Tim Deegan, Jan Beulich,
	Andrew Cooper, xen-devel

On Wed, Mar 26, 2014 at 03:15:42PM -0700, Matthew Rushton wrote:
> On 03/26/14 10:56, Konrad Rzeszutek Wilk wrote:
> >On Wed, Mar 26, 2014 at 10:47:44AM -0700, Matthew Rushton wrote:
> >>On 03/26/14 09:36, Konrad Rzeszutek Wilk wrote:
> >>>On Wed, Mar 26, 2014 at 08:59:04AM -0700, Matthew Rushton wrote:
> >>>>On 03/26/14 08:15, Matt Wilson wrote:
> >>>>>On Wed, Mar 26, 2014 at 11:08:01AM -0400, Konrad Rzeszutek Wilk wrote:
> >>>>>>Could you elaborate a bit more on the use-case please?
> >>>>>>My understanding is that most drivers use a scatter gather list - in which
> >>>>>>case it does not matter if the underlaying MFNs in the PFNs spare are
> >>>>>>not contingous.
> >>>>>>
> >>>>>>But I presume the issue you are hitting is with drivers doing dma_map_page
> >>>>>>and the page is not 4KB but rather large (compound page). Is that the
> >>>>>>problem you have observed?
> >>>>>Drivers are using very large size arguments to dma_alloc_coherent()
> >>>>>for things like RX and TX descriptor rings.
> >>>Large size like larger than 512kB? That would also cause problems
> >>>on baremetal then when swiotlb is activated I believe.
> >>I was looking at network IO performance so the buffers would not
> >>have been that large. I think large in this context is relative to
> >>the 4k page size and the odds of the buffer spanning a page
> >>boundary. For context I saw ~5-10% performance increase with guest
> >>network throughput by avoiding bounce buffers and also saw dom0 tcp
> >>streaming performance go from ~6Gb/s to over 9Gb/s on my test setup
> >>with a 10Gb NIC.
> >OK, but that would not be the dma_alloc_coherent ones then? That sounds
> >more like the generic TCP mechanism allocated 64KB pages instead of 4KB
> >and used those.
> >
> >Did you try looking at this hack that Ian proposed a long time ago
> >to verify that it is said problem?
> >
> >https://lkml.org/lkml/2013/9/4/540
> >
> 
> Yes I had seen that and intially had the same reaction but the
> change was relatively recent and not relevant. I *think* all the
> coherent allocations are ok since the swiotlb makes them contiguous.
> The problem comes with the use of the streaming api. As one example
> with jumbo frames enabled a driver might use larger rx buffers which
> triggers the problem.
> 
> I think the right thing to do is to make the dma streaming api work
> better with larger buffers on dom0. That way it works across all

OK.
> drivers and device types regardless of how they were designed.

Can you point me to an example of the DMA streaming API?

I am not sure if you mean 'streaming API' as scatter gather operations
using DMA API?

Is there a particular easy way for me to reproduce this. I have
to say I hadn't enabled Jumbo frame on my box since I am not even
sure if the switch I have can do it. Is there a idiots-punch-list
of how to reproduce this?

Thanks!
> 
> >>>>>--msw
> >>>>It's the dma streaming api I've noticed the problem with, so
> >>>>dma_map_single(). Applicable swiotlb code would be
> >>>>xen_swiotlb_map_page() and range_straddles_page_boundary(). So yes
> >>>>for larger buffers it can cause bouncing.
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-28 17:02                         ` Konrad Rzeszutek Wilk
@ 2014-03-28 22:06                           ` Matthew Rushton
  2014-03-31 14:15                             ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 55+ messages in thread
From: Matthew Rushton @ 2014-03-28 22:06 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Matt Wilson, Matt Wilson, Tim Deegan, Jan Beulich,
	Andrew Cooper, xen-devel

On 03/28/14 10:02, Konrad Rzeszutek Wilk wrote:
> On Wed, Mar 26, 2014 at 03:15:42PM -0700, Matthew Rushton wrote:
>> On 03/26/14 10:56, Konrad Rzeszutek Wilk wrote:
>>> On Wed, Mar 26, 2014 at 10:47:44AM -0700, Matthew Rushton wrote:
>>>> On 03/26/14 09:36, Konrad Rzeszutek Wilk wrote:
>>>>> On Wed, Mar 26, 2014 at 08:59:04AM -0700, Matthew Rushton wrote:
>>>>>> On 03/26/14 08:15, Matt Wilson wrote:
>>>>>>> On Wed, Mar 26, 2014 at 11:08:01AM -0400, Konrad Rzeszutek Wilk wrote:
>>>>>>>> Could you elaborate a bit more on the use-case please?
>>>>>>>> My understanding is that most drivers use a scatter gather list - in which
>>>>>>>> case it does not matter if the underlaying MFNs in the PFNs spare are
>>>>>>>> not contingous.
>>>>>>>>
>>>>>>>> But I presume the issue you are hitting is with drivers doing dma_map_page
>>>>>>>> and the page is not 4KB but rather large (compound page). Is that the
>>>>>>>> problem you have observed?
>>>>>>> Drivers are using very large size arguments to dma_alloc_coherent()
>>>>>>> for things like RX and TX descriptor rings.
>>>>> Large size like larger than 512kB? That would also cause problems
>>>>> on baremetal then when swiotlb is activated I believe.
>>>> I was looking at network IO performance so the buffers would not
>>>> have been that large. I think large in this context is relative to
>>>> the 4k page size and the odds of the buffer spanning a page
>>>> boundary. For context I saw ~5-10% performance increase with guest
>>>> network throughput by avoiding bounce buffers and also saw dom0 tcp
>>>> streaming performance go from ~6Gb/s to over 9Gb/s on my test setup
>>>> with a 10Gb NIC.
>>> OK, but that would not be the dma_alloc_coherent ones then? That sounds
>>> more like the generic TCP mechanism allocated 64KB pages instead of 4KB
>>> and used those.
>>>
>>> Did you try looking at this hack that Ian proposed a long time ago
>>> to verify that it is said problem?
>>>
>>> https://lkml.org/lkml/2013/9/4/540
>>>
>> Yes I had seen that and intially had the same reaction but the
>> change was relatively recent and not relevant. I *think* all the
>> coherent allocations are ok since the swiotlb makes them contiguous.
>> The problem comes with the use of the streaming api. As one example
>> with jumbo frames enabled a driver might use larger rx buffers which
>> triggers the problem.
>>
>> I think the right thing to do is to make the dma streaming api work
>> better with larger buffers on dom0. That way it works across all
> OK.
>> drivers and device types regardless of how they were designed.
> Can you point me to an example of the DMA streaming API?
>
> I am not sure if you mean 'streaming API' as scatter gather operations
> using DMA API?
>
> Is there a particular easy way for me to reproduce this. I have
> to say I hadn't enabled Jumbo frame on my box since I am not even
> sure if the switch I have can do it. Is there a idiots-punch-list
> of how to reproduce this?
>
> Thanks!

By streaming API I'm just referring to drivers that use 
dma_map_single/dma_unmap_single on every buffer instead of using 
coherent allocations. So not related to sg in my case. If you want an 
example of this you can look at the bnx2x Broadcom driver. To reproduce 
this at a minimum you'll need to have:

1) Enough dom0 memory so it overlaps with PCI space and gets remapped by 
Linux at boot
2) A driver that uses dma_map_single/dma_unmap_single
3) Large enough buffers so that they span page boundaries

Things that may help with 3 are enabling jumbos and various offload 
settings in either guests or dom0.

>>>>>>> --msw
>>>>>> It's the dma streaming api I've noticed the problem with, so
>>>>>> dma_map_single(). Applicable swiotlb code would be
>>>>>> xen_swiotlb_map_page() and range_straddles_page_boundary(). So yes
>>>>>> for larger buffers it can cause bouncing.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-28 22:06                           ` Matthew Rushton
@ 2014-03-31 14:15                             ` Konrad Rzeszutek Wilk
  2014-04-01  3:25                               ` Matthew Rushton
  0 siblings, 1 reply; 55+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-03-31 14:15 UTC (permalink / raw)
  To: Matthew Rushton
  Cc: Keir Fraser, Matt Wilson, Matt Wilson, Tim Deegan, Jan Beulich,
	Andrew Cooper, xen-devel

On Fri, Mar 28, 2014 at 03:06:23PM -0700, Matthew Rushton wrote:
> On 03/28/14 10:02, Konrad Rzeszutek Wilk wrote:
> >On Wed, Mar 26, 2014 at 03:15:42PM -0700, Matthew Rushton wrote:
> >>On 03/26/14 10:56, Konrad Rzeszutek Wilk wrote:
> >>>On Wed, Mar 26, 2014 at 10:47:44AM -0700, Matthew Rushton wrote:
> >>>>On 03/26/14 09:36, Konrad Rzeszutek Wilk wrote:
> >>>>>On Wed, Mar 26, 2014 at 08:59:04AM -0700, Matthew Rushton wrote:
> >>>>>>On 03/26/14 08:15, Matt Wilson wrote:
> >>>>>>>On Wed, Mar 26, 2014 at 11:08:01AM -0400, Konrad Rzeszutek Wilk wrote:
> >>>>>>>>Could you elaborate a bit more on the use-case please?
> >>>>>>>>My understanding is that most drivers use a scatter gather list - in which
> >>>>>>>>case it does not matter if the underlaying MFNs in the PFNs spare are
> >>>>>>>>not contingous.
> >>>>>>>>
> >>>>>>>>But I presume the issue you are hitting is with drivers doing dma_map_page
> >>>>>>>>and the page is not 4KB but rather large (compound page). Is that the
> >>>>>>>>problem you have observed?
> >>>>>>>Drivers are using very large size arguments to dma_alloc_coherent()
> >>>>>>>for things like RX and TX descriptor rings.
> >>>>>Large size like larger than 512kB? That would also cause problems
> >>>>>on baremetal then when swiotlb is activated I believe.
> >>>>I was looking at network IO performance so the buffers would not
> >>>>have been that large. I think large in this context is relative to
> >>>>the 4k page size and the odds of the buffer spanning a page
> >>>>boundary. For context I saw ~5-10% performance increase with guest
> >>>>network throughput by avoiding bounce buffers and also saw dom0 tcp
> >>>>streaming performance go from ~6Gb/s to over 9Gb/s on my test setup
> >>>>with a 10Gb NIC.
> >>>OK, but that would not be the dma_alloc_coherent ones then? That sounds
> >>>more like the generic TCP mechanism allocated 64KB pages instead of 4KB
> >>>and used those.
> >>>
> >>>Did you try looking at this hack that Ian proposed a long time ago
> >>>to verify that it is said problem?
> >>>
> >>>https://lkml.org/lkml/2013/9/4/540
> >>>
> >>Yes I had seen that and intially had the same reaction but the
> >>change was relatively recent and not relevant. I *think* all the
> >>coherent allocations are ok since the swiotlb makes them contiguous.
> >>The problem comes with the use of the streaming api. As one example
> >>with jumbo frames enabled a driver might use larger rx buffers which
> >>triggers the problem.
> >>
> >>I think the right thing to do is to make the dma streaming api work
> >>better with larger buffers on dom0. That way it works across all
> >OK.
> >>drivers and device types regardless of how they were designed.
> >Can you point me to an example of the DMA streaming API?
> >
> >I am not sure if you mean 'streaming API' as scatter gather operations
> >using DMA API?
> >
> >Is there a particular easy way for me to reproduce this. I have
> >to say I hadn't enabled Jumbo frame on my box since I am not even
> >sure if the switch I have can do it. Is there a idiots-punch-list
> >of how to reproduce this?
> >
> >Thanks!
> 
> By streaming API I'm just referring to drivers that use
> dma_map_single/dma_unmap_single on every buffer instead of using
> coherent allocations. So not related to sg in my case. If you want
> an example of this you can look at the bnx2x Broadcom driver. To
> reproduce this at a minimum you'll need to have:
> 
> 1) Enough dom0 memory so it overlaps with PCI space and gets
> remapped by Linux at boot

Hm? Could you give a bit details? As in is the:

[    0.000000] Allocating PCI resources starting at 7f800000 (gap: 7f800000:7c800000)

value?

As in that value should be in the PCI space and I am not sure
how your dom0 memory overlaps? If you do say dom0_mem=max:3G
the kernel will balloon out of the MMIO regions and the gaps (so PCI space)
and put that memory past the 4GB. So the MMIO regions end up
being MMIO regions.

> 2) A driver that uses dma_map_single/dma_unmap_single

OK,
> 3) Large enough buffers so that they span page boundaries

Um, right, so I think the get_order hack that was posted would
help in that so you would not span page boundaries?

> 
> Things that may help with 3 are enabling jumbos and various offload
> settings in either guests or dom0.

If you booted baremetal with 'iommu=soft swiotlb=force' the same
problem should show up - at least based on the 2) and 3) issue.

Well, except that there are no guests but one should be able to trigger
this.

What do you use for driving traffic? iperf with certain parameters?

Thanks!
> 
> >>>>>>>--msw
> >>>>>>It's the dma streaming api I've noticed the problem with, so
> >>>>>>dma_map_single(). Applicable swiotlb code would be
> >>>>>>xen_swiotlb_map_page() and range_straddles_page_boundary(). So yes
> >>>>>>for larger buffers it can cause bouncing.
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-31 14:15                             ` Konrad Rzeszutek Wilk
@ 2014-04-01  3:25                               ` Matthew Rushton
  2014-04-01 10:48                                 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 55+ messages in thread
From: Matthew Rushton @ 2014-04-01  3:25 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Matt Wilson, Matt Wilson, Tim Deegan, Jan Beulich,
	Andrew Cooper, xen-devel

On 03/31/14 07:15, Konrad Rzeszutek Wilk wrote:
> On Fri, Mar 28, 2014 at 03:06:23PM -0700, Matthew Rushton wrote:
>> On 03/28/14 10:02, Konrad Rzeszutek Wilk wrote:
>>> On Wed, Mar 26, 2014 at 03:15:42PM -0700, Matthew Rushton wrote:
>>>> On 03/26/14 10:56, Konrad Rzeszutek Wilk wrote:
>>>>> On Wed, Mar 26, 2014 at 10:47:44AM -0700, Matthew Rushton wrote:
>>>>>> On 03/26/14 09:36, Konrad Rzeszutek Wilk wrote:
>>>>>>> On Wed, Mar 26, 2014 at 08:59:04AM -0700, Matthew Rushton wrote:
>>>>>>>> On 03/26/14 08:15, Matt Wilson wrote:
>>>>>>>>> On Wed, Mar 26, 2014 at 11:08:01AM -0400, Konrad Rzeszutek Wilk wrote:
>>>>>>>>>> Could you elaborate a bit more on the use-case please?
>>>>>>>>>> My understanding is that most drivers use a scatter gather list - in which
>>>>>>>>>> case it does not matter if the underlaying MFNs in the PFNs spare are
>>>>>>>>>> not contingous.
>>>>>>>>>>
>>>>>>>>>> But I presume the issue you are hitting is with drivers doing dma_map_page
>>>>>>>>>> and the page is not 4KB but rather large (compound page). Is that the
>>>>>>>>>> problem you have observed?
>>>>>>>>> Drivers are using very large size arguments to dma_alloc_coherent()
>>>>>>>>> for things like RX and TX descriptor rings.
>>>>>>> Large size like larger than 512kB? That would also cause problems
>>>>>>> on baremetal then when swiotlb is activated I believe.
>>>>>> I was looking at network IO performance so the buffers would not
>>>>>> have been that large. I think large in this context is relative to
>>>>>> the 4k page size and the odds of the buffer spanning a page
>>>>>> boundary. For context I saw ~5-10% performance increase with guest
>>>>>> network throughput by avoiding bounce buffers and also saw dom0 tcp
>>>>>> streaming performance go from ~6Gb/s to over 9Gb/s on my test setup
>>>>>> with a 10Gb NIC.
>>>>> OK, but that would not be the dma_alloc_coherent ones then? That sounds
>>>>> more like the generic TCP mechanism allocated 64KB pages instead of 4KB
>>>>> and used those.
>>>>>
>>>>> Did you try looking at this hack that Ian proposed a long time ago
>>>>> to verify that it is said problem?
>>>>>
>>>>> https://lkml.org/lkml/2013/9/4/540
>>>>>
>>>> Yes I had seen that and intially had the same reaction but the
>>>> change was relatively recent and not relevant. I *think* all the
>>>> coherent allocations are ok since the swiotlb makes them contiguous.
>>>> The problem comes with the use of the streaming api. As one example
>>>> with jumbo frames enabled a driver might use larger rx buffers which
>>>> triggers the problem.
>>>>
>>>> I think the right thing to do is to make the dma streaming api work
>>>> better with larger buffers on dom0. That way it works across all
>>> OK.
>>>> drivers and device types regardless of how they were designed.
>>> Can you point me to an example of the DMA streaming API?
>>>
>>> I am not sure if you mean 'streaming API' as scatter gather operations
>>> using DMA API?
>>>
>>> Is there a particular easy way for me to reproduce this. I have
>>> to say I hadn't enabled Jumbo frame on my box since I am not even
>>> sure if the switch I have can do it. Is there a idiots-punch-list
>>> of how to reproduce this?
>>>
>>> Thanks!
>> By streaming API I'm just referring to drivers that use
>> dma_map_single/dma_unmap_single on every buffer instead of using
>> coherent allocations. So not related to sg in my case. If you want
>> an example of this you can look at the bnx2x Broadcom driver. To
>> reproduce this at a minimum you'll need to have:
>>
>> 1) Enough dom0 memory so it overlaps with PCI space and gets
>> remapped by Linux at boot
> Hm? Could you give a bit details? As in is the:
>
> [    0.000000] Allocating PCI resources starting at 7f800000 (gap: 7f800000:7c800000)
>
> value?
>
> As in that value should be in the PCI space and I am not sure
> how your dom0 memory overlaps? If you do say dom0_mem=max:3G
> the kernel will balloon out of the MMIO regions and the gaps (so PCI space)
> and put that memory past the 4GB. So the MMIO regions end up
> being MMIO regions.

You should see the message from xen_do_chunk() about adding pages back. 
Something along the lines of:

Populating 380000-401fb6 pfn range: 542250 pages added

These pages get added in reverse order (mfns reversed) without my 
proposed Xen change.

>> 2) A driver that uses dma_map_single/dma_unmap_single
> OK,
>> 3) Large enough buffers so that they span page boundaries
> Um, right, so I think the get_order hack that was posted would
> help in that so you would not span page boundaries?

That patch doesn't apply in my case but in principal you're right, any 
change that would decrease buffers spanning page boundaries would limit 
bounce buffer usage.

>> Things that may help with 3 are enabling jumbos and various offload
>> settings in either guests or dom0.
> If you booted baremetal with 'iommu=soft swiotlb=force' the same
> problem should show up - at least based on the 2) and 3) issue.
>
> Well, except that there are no guests but one should be able to trigger
> this.

If that forces the use of bounce buffers than it would be a similar net 
result if you wanted to see the performance overhead of doing the copies.

> What do you use for driving traffic? iperf with certain parameters?

I was using netperf. There weren't any magic params to trigger this. I 
believe with the default tcp stream test I ran into the issue.


>
> Thanks!

Are there any concerns about the proposed Xen change as a reasonable 
work around for the current implementation? Thank you!

>>>>>>>>> --msw
>>>>>>>> It's the dma streaming api I've noticed the problem with, so
>>>>>>>> dma_map_single(). Applicable swiotlb code would be
>>>>>>>> xen_swiotlb_map_page() and range_straddles_page_boundary(). So yes
>>>>>>>> for larger buffers it can cause bouncing.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-01  3:25                               ` Matthew Rushton
@ 2014-04-01 10:48                                 ` Konrad Rzeszutek Wilk
  2014-04-01 12:22                                   ` Tim Deegan
  0 siblings, 1 reply; 55+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-01 10:48 UTC (permalink / raw)
  To: Matthew Rushton
  Cc: Keir Fraser, Matt Wilson, Matt Wilson, Tim Deegan, Jan Beulich,
	Andrew Cooper, xen-devel

On Mon, Mar 31, 2014 at 08:25:43PM -0700, Matthew Rushton wrote:
> On 03/31/14 07:15, Konrad Rzeszutek Wilk wrote:
> >On Fri, Mar 28, 2014 at 03:06:23PM -0700, Matthew Rushton wrote:
> >>On 03/28/14 10:02, Konrad Rzeszutek Wilk wrote:
> >>>On Wed, Mar 26, 2014 at 03:15:42PM -0700, Matthew Rushton wrote:
> >>>>On 03/26/14 10:56, Konrad Rzeszutek Wilk wrote:
> >>>>>On Wed, Mar 26, 2014 at 10:47:44AM -0700, Matthew Rushton wrote:
> >>>>>>On 03/26/14 09:36, Konrad Rzeszutek Wilk wrote:
> >>>>>>>On Wed, Mar 26, 2014 at 08:59:04AM -0700, Matthew Rushton wrote:
> >>>>>>>>On 03/26/14 08:15, Matt Wilson wrote:
> >>>>>>>>>On Wed, Mar 26, 2014 at 11:08:01AM -0400, Konrad Rzeszutek Wilk wrote:
> >>>>>>>>>>Could you elaborate a bit more on the use-case please?
> >>>>>>>>>>My understanding is that most drivers use a scatter gather list - in which
> >>>>>>>>>>case it does not matter if the underlaying MFNs in the PFNs spare are
> >>>>>>>>>>not contingous.
> >>>>>>>>>>
> >>>>>>>>>>But I presume the issue you are hitting is with drivers doing dma_map_page
> >>>>>>>>>>and the page is not 4KB but rather large (compound page). Is that the
> >>>>>>>>>>problem you have observed?
> >>>>>>>>>Drivers are using very large size arguments to dma_alloc_coherent()
> >>>>>>>>>for things like RX and TX descriptor rings.
> >>>>>>>Large size like larger than 512kB? That would also cause problems
> >>>>>>>on baremetal then when swiotlb is activated I believe.
> >>>>>>I was looking at network IO performance so the buffers would not
> >>>>>>have been that large. I think large in this context is relative to
> >>>>>>the 4k page size and the odds of the buffer spanning a page
> >>>>>>boundary. For context I saw ~5-10% performance increase with guest
> >>>>>>network throughput by avoiding bounce buffers and also saw dom0 tcp
> >>>>>>streaming performance go from ~6Gb/s to over 9Gb/s on my test setup
> >>>>>>with a 10Gb NIC.
> >>>>>OK, but that would not be the dma_alloc_coherent ones then? That sounds
> >>>>>more like the generic TCP mechanism allocated 64KB pages instead of 4KB
> >>>>>and used those.
> >>>>>
> >>>>>Did you try looking at this hack that Ian proposed a long time ago
> >>>>>to verify that it is said problem?
> >>>>>
> >>>>>https://lkml.org/lkml/2013/9/4/540
> >>>>>
> >>>>Yes I had seen that and intially had the same reaction but the
> >>>>change was relatively recent and not relevant. I *think* all the
> >>>>coherent allocations are ok since the swiotlb makes them contiguous.
> >>>>The problem comes with the use of the streaming api. As one example
> >>>>with jumbo frames enabled a driver might use larger rx buffers which
> >>>>triggers the problem.
> >>>>
> >>>>I think the right thing to do is to make the dma streaming api work
> >>>>better with larger buffers on dom0. That way it works across all
> >>>OK.
> >>>>drivers and device types regardless of how they were designed.
> >>>Can you point me to an example of the DMA streaming API?
> >>>
> >>>I am not sure if you mean 'streaming API' as scatter gather operations
> >>>using DMA API?
> >>>
> >>>Is there a particular easy way for me to reproduce this. I have
> >>>to say I hadn't enabled Jumbo frame on my box since I am not even
> >>>sure if the switch I have can do it. Is there a idiots-punch-list
> >>>of how to reproduce this?
> >>>
> >>>Thanks!
> >>By streaming API I'm just referring to drivers that use
> >>dma_map_single/dma_unmap_single on every buffer instead of using
> >>coherent allocations. So not related to sg in my case. If you want
> >>an example of this you can look at the bnx2x Broadcom driver. To
> >>reproduce this at a minimum you'll need to have:
> >>
> >>1) Enough dom0 memory so it overlaps with PCI space and gets
> >>remapped by Linux at boot
> >Hm? Could you give a bit details? As in is the:
> >
> >[    0.000000] Allocating PCI resources starting at 7f800000 (gap: 7f800000:7c800000)
> >
> >value?
> >
> >As in that value should be in the PCI space and I am not sure
> >how your dom0 memory overlaps? If you do say dom0_mem=max:3G
> >the kernel will balloon out of the MMIO regions and the gaps (so PCI space)
> >and put that memory past the 4GB. So the MMIO regions end up
> >being MMIO regions.
> 
> You should see the message from xen_do_chunk() about adding pages
> back. Something along the lines of:
> 
> Populating 380000-401fb6 pfn range: 542250 pages added
> 
> These pages get added in reverse order (mfns reversed) without my
> proposed Xen change.
> 
> >>2) A driver that uses dma_map_single/dma_unmap_single
> >OK,
> >>3) Large enough buffers so that they span page boundaries
> >Um, right, so I think the get_order hack that was posted would
> >help in that so you would not span page boundaries?
> 
> That patch doesn't apply in my case but in principal you're right,
> any change that would decrease buffers spanning page boundaries
> would limit bounce buffer usage.
> 
> >>Things that may help with 3 are enabling jumbos and various offload
> >>settings in either guests or dom0.
> >If you booted baremetal with 'iommu=soft swiotlb=force' the same
> >problem should show up - at least based on the 2) and 3) issue.
> >
> >Well, except that there are no guests but one should be able to trigger
> >this.
> 
> If that forces the use of bounce buffers than it would be a similar
> net result if you wanted to see the performance overhead of doing
> the copies.
> 
> >What do you use for driving traffic? iperf with certain parameters?
> 
> I was using netperf. There weren't any magic params to trigger this.
> I believe with the default tcp stream test I ran into the issue.
> 
> 
> >
> >Thanks!
> 
> Are there any concerns about the proposed Xen change as a reasonable
> work around for the current implementation? Thank you!

So I finally understood what the concern was about it - the balloon
mechanics get the pages in worst possible order. I am wondeirng if there
is something on the Linux side we can do to tell Xen to give them to use
in the proper order?

Could we swap the order of xen_do_chunk so it starts from the end and
goes to start? Would that help? Or maybe do an array of 512 chunks (I
had an prototype patch like that floating around to speed this up)?

> 
> >>>>>>>>>--msw
> >>>>>>>>It's the dma streaming api I've noticed the problem with, so
> >>>>>>>>dma_map_single(). Applicable swiotlb code would be
> >>>>>>>>xen_swiotlb_map_page() and range_straddles_page_boundary(). So yes
> >>>>>>>>for larger buffers it can cause bouncing.
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-01 10:48                                 ` Konrad Rzeszutek Wilk
@ 2014-04-01 12:22                                   ` Tim Deegan
  2014-04-02  0:17                                     ` Matthew Rushton
  0 siblings, 1 reply; 55+ messages in thread
From: Tim Deegan @ 2014-04-01 12:22 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Matt Wilson, Matthew Rushton, Matt Wilson,
	Jan Beulich, Andrew Cooper, xen-devel

At 06:48 -0400 on 01 Apr (1396331306), Konrad Rzeszutek Wilk wrote:
> On Mon, Mar 31, 2014 at 08:25:43PM -0700, Matthew Rushton wrote:
> > Are there any concerns about the proposed Xen change as a reasonable
> > work around for the current implementation? Thank you!
> 
> So I finally understood what the concern was about it - the balloon
> mechanics get the pages in worst possible order. I am wondeirng if there
> is something on the Linux side we can do to tell Xen to give them to use
> in the proper order?

The best way, at least from Xen's point of view, is to explicitly
allocate contiguous pages in the cases where it'll make a difference
AIUI linux already does this for some classes of dma-able memory.

> Could we swap the order of xen_do_chunk so it starts from the end and
> goes to start? Would that help?

As long as we don't also change the default allocation order in
Xen. :)  In general, linux shouldn't rely on the order that Xen
allocates memory, as that might change later.  If the current API
can't do what's needed, maybe we can add another allocator
hypercall or flag?

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-01 12:22                                   ` Tim Deegan
@ 2014-04-02  0:17                                     ` Matthew Rushton
  2014-04-02  7:52                                       ` Jan Beulich
  0 siblings, 1 reply; 55+ messages in thread
From: Matthew Rushton @ 2014-04-02  0:17 UTC (permalink / raw)
  To: Tim Deegan, Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Jan Beulich, Matt Wilson, Matt Wilson, Andrew Cooper,
	xen-devel

On 04/01/14 05:22, Tim Deegan wrote:
> At 06:48 -0400 on 01 Apr (1396331306), Konrad Rzeszutek Wilk wrote:
>> On Mon, Mar 31, 2014 at 08:25:43PM -0700, Matthew Rushton wrote:
>>> Are there any concerns about the proposed Xen change as a reasonable
>>> work around for the current implementation? Thank you!
>> So I finally understood what the concern was about it - the balloon
>> mechanics get the pages in worst possible order. I am wondeirng if there
>> is something on the Linux side we can do to tell Xen to give them to use
>> in the proper order?
> The best way, at least from Xen's point of view, is to explicitly
> allocate contiguous pages in the cases where it'll make a difference
> AIUI linux already does this for some classes of dma-able memory.

I'm in agreement that if any change is made to Linux it should be to 
make as large as possible allocations and back off accordingly. I 
suppose another approach could be to add a boot option to not reallocate 
at all.

>> Could we swap the order of xen_do_chunk so it starts from the end and
>> goes to start? Would that help?
> As long as we don't also change the default allocation order in
> Xen. :)  In general, linux shouldn't rely on the order that Xen
> allocates memory, as that might change later.  If the current API
> can't do what's needed, maybe we can add another allocator
> hypercall or flag?

Agree on not relying on the order in the long run. A new hypercall or 
flag seems like overkill right now. The question for me comes down to my 
proposed change which is more simple and solves the short term problem 
or investing time in reworking the Linux code to make large allocations.

>
> Cheers,
>
> Tim.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-02  0:17                                     ` Matthew Rushton
@ 2014-04-02  7:52                                       ` Jan Beulich
  2014-04-02 10:06                                         ` Ian Campbell
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2014-04-02  7:52 UTC (permalink / raw)
  To: Matthew Rushton
  Cc: Keir Fraser, Andrew Cooper, Tim Deegan, Matt Wilson, Matt Wilson,
	xen-devel

>>> On 02.04.14 at 02:17, <mvrushton@gmail.com> wrote:
> On 04/01/14 05:22, Tim Deegan wrote:
>> As long as we don't also change the default allocation order in
>> Xen. :)  In general, linux shouldn't rely on the order that Xen
>> allocates memory, as that might change later.  If the current API
>> can't do what's needed, maybe we can add another allocator
>> hypercall or flag?
> 
> Agree on not relying on the order in the long run. A new hypercall or 
> flag seems like overkill right now. The question for me comes down to my 
> proposed change which is more simple and solves the short term problem 
> or investing time in reworking the Linux code to make large allocations.

I think it has become pretty clear by now that we'd rather not alter
the hypervisor allocator for a purpose like this.

Jan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-02  7:52                                       ` Jan Beulich
@ 2014-04-02 10:06                                         ` Ian Campbell
  2014-04-02 10:15                                           ` Jan Beulich
  0 siblings, 1 reply; 55+ messages in thread
From: Ian Campbell @ 2014-04-02 10:06 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Matthew Rushton, Andrew Cooper, Tim Deegan,
	Matt Wilson, Matt Wilson, xen-devel

On Wed, 2014-04-02 at 08:52 +0100, Jan Beulich wrote:
> >>> On 02.04.14 at 02:17, <mvrushton@gmail.com> wrote:
> > On 04/01/14 05:22, Tim Deegan wrote:
> >> As long as we don't also change the default allocation order in
> >> Xen. :)  In general, linux shouldn't rely on the order that Xen
> >> allocates memory, as that might change later.  If the current API
> >> can't do what's needed, maybe we can add another allocator
> >> hypercall or flag?
> > 
> > Agree on not relying on the order in the long run. A new hypercall or 
> > flag seems like overkill right now. The question for me comes down to my 
> > proposed change which is more simple and solves the short term problem 
> > or investing time in reworking the Linux code to make large allocations.
> 
> I think it has become pretty clear by now that we'd rather not alter
> the hypervisor allocator for a purpose like this.

Does it even actually solve the problem? It seems like it is just
deferring it until sufficient fragmentation has occurred in the system.
All its really done is make the eventual issue much harder to debug.

Ian.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-02 10:06                                         ` Ian Campbell
@ 2014-04-02 10:15                                           ` Jan Beulich
  2014-04-02 10:20                                             ` Ian Campbell
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2014-04-02 10:15 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Keir Fraser, Matthew Rushton, AndrewCooper, Tim Deegan,
	Matt Wilson, Matt Wilson, xen-devel

>>> On 02.04.14 at 12:06, <Ian.Campbell@citrix.com> wrote:
> On Wed, 2014-04-02 at 08:52 +0100, Jan Beulich wrote:
>> >>> On 02.04.14 at 02:17, <mvrushton@gmail.com> wrote:
>> > On 04/01/14 05:22, Tim Deegan wrote:
>> >> As long as we don't also change the default allocation order in
>> >> Xen. :)  In general, linux shouldn't rely on the order that Xen
>> >> allocates memory, as that might change later.  If the current API
>> >> can't do what's needed, maybe we can add another allocator
>> >> hypercall or flag?
>> > 
>> > Agree on not relying on the order in the long run. A new hypercall or 
>> > flag seems like overkill right now. The question for me comes down to my 
>> > proposed change which is more simple and solves the short term problem 
>> > or investing time in reworking the Linux code to make large allocations.
>> 
>> I think it has become pretty clear by now that we'd rather not alter
>> the hypervisor allocator for a purpose like this.
> 
> Does it even actually solve the problem? It seems like it is just
> deferring it until sufficient fragmentation has occurred in the system.
> All its really done is make the eventual issue much harder to debug.

Wasn't this largely for Dom0 (in which case fragmentation shouldn't
matter yet)?

Jan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-02 10:15                                           ` Jan Beulich
@ 2014-04-02 10:20                                             ` Ian Campbell
  2014-04-09 22:21                                               ` Matthew Rushton
  0 siblings, 1 reply; 55+ messages in thread
From: Ian Campbell @ 2014-04-02 10:20 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Matthew Rushton, AndrewCooper, Tim Deegan,
	Matt Wilson, Matt Wilson, xen-devel

On Wed, 2014-04-02 at 11:15 +0100, Jan Beulich wrote:
> >>> On 02.04.14 at 12:06, <Ian.Campbell@citrix.com> wrote:
> > On Wed, 2014-04-02 at 08:52 +0100, Jan Beulich wrote:
> >> >>> On 02.04.14 at 02:17, <mvrushton@gmail.com> wrote:
> >> > On 04/01/14 05:22, Tim Deegan wrote:
> >> >> As long as we don't also change the default allocation order in
> >> >> Xen. :)  In general, linux shouldn't rely on the order that Xen
> >> >> allocates memory, as that might change later.  If the current API
> >> >> can't do what's needed, maybe we can add another allocator
> >> >> hypercall or flag?
> >> > 
> >> > Agree on not relying on the order in the long run. A new hypercall or 
> >> > flag seems like overkill right now. The question for me comes down to my 
> >> > proposed change which is more simple and solves the short term problem 
> >> > or investing time in reworking the Linux code to make large allocations.
> >> 
> >> I think it has become pretty clear by now that we'd rather not alter
> >> the hypervisor allocator for a purpose like this.
> > 
> > Does it even actually solve the problem? It seems like it is just
> > deferring it until sufficient fragmentation has occurred in the system.
> > All its really done is make the eventual issue much harder to debug.
> 
> Wasn't this largely for Dom0 (in which case fragmentation shouldn't
> matter yet)?

Dom0 ballooning breaks any assumptions you might make about relying on
early allocations.

Ian.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-02 10:20                                             ` Ian Campbell
@ 2014-04-09 22:21                                               ` Matthew Rushton
  2014-04-10  6:14                                                 ` Jan Beulich
  2014-04-11 17:05                                                 ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 55+ messages in thread
From: Matthew Rushton @ 2014-04-09 22:21 UTC (permalink / raw)
  To: Ian Campbell, Jan Beulich
  Cc: Keir Fraser, AndrewCooper, Tim Deegan, Matt Wilson, Matt Wilson,
	xen-devel

On 04/02/14 03:20, Ian Campbell wrote:
> On Wed, 2014-04-02 at 11:15 +0100, Jan Beulich wrote:
>>>>> On 02.04.14 at 12:06, <Ian.Campbell@citrix.com> wrote:
>>> On Wed, 2014-04-02 at 08:52 +0100, Jan Beulich wrote:
>>>>>>> On 02.04.14 at 02:17, <mvrushton@gmail.com> wrote:
>>>>> On 04/01/14 05:22, Tim Deegan wrote:
>>>>>> As long as we don't also change the default allocation order in
>>>>>> Xen. :)  In general, linux shouldn't rely on the order that Xen
>>>>>> allocates memory, as that might change later.  If the current API
>>>>>> can't do what's needed, maybe we can add another allocator
>>>>>> hypercall or flag?
>>>>> Agree on not relying on the order in the long run. A new hypercall or
>>>>> flag seems like overkill right now. The question for me comes down to my
>>>>> proposed change which is more simple and solves the short term problem
>>>>> or investing time in reworking the Linux code to make large allocations.
>>>> I think it has become pretty clear by now that we'd rather not alter
>>>> the hypervisor allocator for a purpose like this.

OK understood see below.

>>> Does it even actually solve the problem? It seems like it is just
>>> deferring it until sufficient fragmentation has occurred in the system.
>>> All its really done is make the eventual issue much harder to debug.
>> Wasn't this largely for Dom0 (in which case fragmentation shouldn't
>> matter yet)?
> Dom0 ballooning breaks any assumptions you might make about relying on
> early allocations.

I think you're missing the point. I'm not arguing that this change is a 
general purpose solution to guarantee that dom0 is contiguous. 
Fragmentation can exist even if dom0 asks for larger allocations like it 
should (which the balloon driver does I believe). What the change does 
do is solve a real problem in the current Linux PCI remapping 
implementation which happens during dom0 intialization. If the 
allocation strategy is arbitrary why not make the proposed hypervisor 
change to make existing Linux implementations behave better and in 
addition fix the problem in Linux so moving forward things are safe?

> Ian.
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-09 22:21                                               ` Matthew Rushton
@ 2014-04-10  6:14                                                 ` Jan Beulich
  2014-04-11 20:20                                                   ` Matthew Rushton
  2014-04-11 17:05                                                 ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2014-04-10  6:14 UTC (permalink / raw)
  To: Matthew Rushton
  Cc: Keir Fraser, Ian Campbell, AndrewCooper, Tim Deegan, Matt Wilson,
	Matt Wilson, xen-devel

>>> On 10.04.14 at 00:21, <mvrushton@gmail.com> wrote:
> On 04/02/14 03:20, Ian Campbell wrote:
>> Dom0 ballooning breaks any assumptions you might make about relying on
>> early allocations.
> 
> I think you're missing the point. I'm not arguing that this change is a 
> general purpose solution to guarantee that dom0 is contiguous. 
> Fragmentation can exist even if dom0 asks for larger allocations like it 
> should (which the balloon driver does I believe). What the change does 
> do is solve a real problem in the current Linux PCI remapping 
> implementation which happens during dom0 intialization. If the 
> allocation strategy is arbitrary why not make the proposed hypervisor 
> change to make existing Linux implementations behave better and in 
> addition fix the problem in Linux so moving forward things are safe?

Apart from all other arguments speaking against this, did you
consider that altering the hypervisor behavior may adversely
affect some other Dom0-capable OS?

Problems in Linux should, as said before, get fixed in Linux. If
older versions are affected, stable backports should subsequently
be requested/done.

Jan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-10  6:14                                                 ` Jan Beulich
@ 2014-04-11 20:20                                                   ` Matthew Rushton
  0 siblings, 0 replies; 55+ messages in thread
From: Matthew Rushton @ 2014-04-11 20:20 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Ian Campbell, AndrewCooper, Tim Deegan, Matt Wilson,
	Matt Wilson, xen-devel

On 04/09/14 23:14, Jan Beulich wrote:
>>>> On 10.04.14 at 00:21, <mvrushton@gmail.com> wrote:
>> On 04/02/14 03:20, Ian Campbell wrote:
>>> Dom0 ballooning breaks any assumptions you might make about relying on
>>> early allocations.
>> I think you're missing the point. I'm not arguing that this change is a
>> general purpose solution to guarantee that dom0 is contiguous.
>> Fragmentation can exist even if dom0 asks for larger allocations like it
>> should (which the balloon driver does I believe). What the change does
>> do is solve a real problem in the current Linux PCI remapping
>> implementation which happens during dom0 intialization. If the
>> allocation strategy is arbitrary why not make the proposed hypervisor
>> change to make existing Linux implementations behave better and in
>> addition fix the problem in Linux so moving forward things are safe?
> Apart from all other arguments speaking against this, did you
> consider that altering the hypervisor behavior may adversely
> affect some other Dom0-capable OS?

Sure I've been considering the more intuitive dom0 implementation of 
allocating memory low to high and looking at things pragmatically.

>
> Problems in Linux should, as said before, get fixed in Linux. If
> older versions are affected, stable backports should subsequently
> be requested/done.
>
> Jan
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-09 22:21                                               ` Matthew Rushton
  2014-04-10  6:14                                                 ` Jan Beulich
@ 2014-04-11 17:05                                                 ` Konrad Rzeszutek Wilk
  2014-04-11 20:28                                                   ` Matthew Rushton
  2014-04-13 21:32                                                   ` Tim Deegan
  1 sibling, 2 replies; 55+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-11 17:05 UTC (permalink / raw)
  To: Matthew Rushton
  Cc: Keir Fraser, Ian Campbell, AndrewCooper, Tim Deegan, Jan Beulich,
	Matt Wilson, Matt Wilson, xen-devel

On Wed, Apr 09, 2014 at 03:21:38PM -0700, Matthew Rushton wrote:
> On 04/02/14 03:20, Ian Campbell wrote:
> >On Wed, 2014-04-02 at 11:15 +0100, Jan Beulich wrote:
> >>>>>On 02.04.14 at 12:06, <Ian.Campbell@citrix.com> wrote:
> >>>On Wed, 2014-04-02 at 08:52 +0100, Jan Beulich wrote:
> >>>>>>>On 02.04.14 at 02:17, <mvrushton@gmail.com> wrote:
> >>>>>On 04/01/14 05:22, Tim Deegan wrote:
> >>>>>>As long as we don't also change the default allocation order in
> >>>>>>Xen. :)  In general, linux shouldn't rely on the order that Xen
> >>>>>>allocates memory, as that might change later.  If the current API
> >>>>>>can't do what's needed, maybe we can add another allocator
> >>>>>>hypercall or flag?
> >>>>>Agree on not relying on the order in the long run. A new hypercall or
> >>>>>flag seems like overkill right now. The question for me comes down to my
> >>>>>proposed change which is more simple and solves the short term problem
> >>>>>or investing time in reworking the Linux code to make large allocations.
> >>>>I think it has become pretty clear by now that we'd rather not alter
> >>>>the hypervisor allocator for a purpose like this.
> 
> OK understood see below.
> 
> >>>Does it even actually solve the problem? It seems like it is just
> >>>deferring it until sufficient fragmentation has occurred in the system.
> >>>All its really done is make the eventual issue much harder to debug.
> >>Wasn't this largely for Dom0 (in which case fragmentation shouldn't
> >>matter yet)?
> >Dom0 ballooning breaks any assumptions you might make about relying on
> >early allocations.
> 
> I think you're missing the point. I'm not arguing that this change
> is a general purpose solution to guarantee that dom0 is contiguous.
> Fragmentation can exist even if dom0 asks for larger allocations
> like it should (which the balloon driver does I believe). What the
> change does do is solve a real problem in the current Linux PCI
> remapping implementation which happens during dom0 intialization. If
> the allocation strategy is arbitrary why not make the proposed
> hypervisor change to make existing Linux implementations behave
> better and in addition fix the problem in Linux so moving forward
> things are safe?

I think Tim was OK with that - as long as it was based on a flag - meaning
when we do the increase_reservation call we use an extra flag
to ask for contingous PFNs.

> 
> >Ian.
> >
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-11 17:05                                                 ` Konrad Rzeszutek Wilk
@ 2014-04-11 20:28                                                   ` Matthew Rushton
  2014-04-12  1:34                                                     ` Konrad Rzeszutek Wilk
  2014-04-13 21:32                                                   ` Tim Deegan
  1 sibling, 1 reply; 55+ messages in thread
From: Matthew Rushton @ 2014-04-11 20:28 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Ian Campbell, AndrewCooper, Tim Deegan, Jan Beulich,
	Matt Wilson, Matt Wilson, xen-devel

On 04/11/14 10:05, Konrad Rzeszutek Wilk wrote:
> On Wed, Apr 09, 2014 at 03:21:38PM -0700, Matthew Rushton wrote:
>> On 04/02/14 03:20, Ian Campbell wrote:
>>> On Wed, 2014-04-02 at 11:15 +0100, Jan Beulich wrote:
>>>>>>> On 02.04.14 at 12:06, <Ian.Campbell@citrix.com> wrote:
>>>>> On Wed, 2014-04-02 at 08:52 +0100, Jan Beulich wrote:
>>>>>>>>> On 02.04.14 at 02:17, <mvrushton@gmail.com> wrote:
>>>>>>> On 04/01/14 05:22, Tim Deegan wrote:
>>>>>>>> As long as we don't also change the default allocation order in
>>>>>>>> Xen. :)  In general, linux shouldn't rely on the order that Xen
>>>>>>>> allocates memory, as that might change later.  If the current API
>>>>>>>> can't do what's needed, maybe we can add another allocator
>>>>>>>> hypercall or flag?
>>>>>>> Agree on not relying on the order in the long run. A new hypercall or
>>>>>>> flag seems like overkill right now. The question for me comes down to my
>>>>>>> proposed change which is more simple and solves the short term problem
>>>>>>> or investing time in reworking the Linux code to make large allocations.
>>>>>> I think it has become pretty clear by now that we'd rather not alter
>>>>>> the hypervisor allocator for a purpose like this.
>> OK understood see below.
>>
>>>>> Does it even actually solve the problem? It seems like it is just
>>>>> deferring it until sufficient fragmentation has occurred in the system.
>>>>> All its really done is make the eventual issue much harder to debug.
>>>> Wasn't this largely for Dom0 (in which case fragmentation shouldn't
>>>> matter yet)?
>>> Dom0 ballooning breaks any assumptions you might make about relying on
>>> early allocations.
>> I think you're missing the point. I'm not arguing that this change
>> is a general purpose solution to guarantee that dom0 is contiguous.
>> Fragmentation can exist even if dom0 asks for larger allocations
>> like it should (which the balloon driver does I believe). What the
>> change does do is solve a real problem in the current Linux PCI
>> remapping implementation which happens during dom0 intialization. If
>> the allocation strategy is arbitrary why not make the proposed
>> hypervisor change to make existing Linux implementations behave
>> better and in addition fix the problem in Linux so moving forward
>> things are safe?
> I think Tim was OK with that - as long as it was based on a flag - meaning
> when we do the increase_reservation call we use an extra flag
> to ask for contingous PFNs.

OK the extra flag feels a little dirty to me but it would solve the 
problem. What are your thoughts on changing Linux to make higher order 
allocations or more minimally adding a boot parameter to not remap the 
memory at all for those that care about performance? I know the Linux 
code is already fairly complex and your preference was not to make it 
worse.

>>> Ian.
>>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-11 20:28                                                   ` Matthew Rushton
@ 2014-04-12  1:34                                                     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 55+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-12  1:34 UTC (permalink / raw)
  To: Matthew Rushton
  Cc: Keir Fraser, Ian Campbell, AndrewCooper, Tim Deegan, Jan Beulich,
	Matt Wilson, Matt Wilson, xen-devel

On Fri, Apr 11, 2014 at 01:28:45PM -0700, Matthew Rushton wrote:
> On 04/11/14 10:05, Konrad Rzeszutek Wilk wrote:
> >On Wed, Apr 09, 2014 at 03:21:38PM -0700, Matthew Rushton wrote:
> >>On 04/02/14 03:20, Ian Campbell wrote:
> >>>On Wed, 2014-04-02 at 11:15 +0100, Jan Beulich wrote:
> >>>>>>>On 02.04.14 at 12:06, <Ian.Campbell@citrix.com> wrote:
> >>>>>On Wed, 2014-04-02 at 08:52 +0100, Jan Beulich wrote:
> >>>>>>>>>On 02.04.14 at 02:17, <mvrushton@gmail.com> wrote:
> >>>>>>>On 04/01/14 05:22, Tim Deegan wrote:
> >>>>>>>>As long as we don't also change the default allocation order in
> >>>>>>>>Xen. :)  In general, linux shouldn't rely on the order that Xen
> >>>>>>>>allocates memory, as that might change later.  If the current API
> >>>>>>>>can't do what's needed, maybe we can add another allocator
> >>>>>>>>hypercall or flag?
> >>>>>>>Agree on not relying on the order in the long run. A new hypercall or
> >>>>>>>flag seems like overkill right now. The question for me comes down to my
> >>>>>>>proposed change which is more simple and solves the short term problem
> >>>>>>>or investing time in reworking the Linux code to make large allocations.
> >>>>>>I think it has become pretty clear by now that we'd rather not alter
> >>>>>>the hypervisor allocator for a purpose like this.
> >>OK understood see below.
> >>
> >>>>>Does it even actually solve the problem? It seems like it is just
> >>>>>deferring it until sufficient fragmentation has occurred in the system.
> >>>>>All its really done is make the eventual issue much harder to debug.
> >>>>Wasn't this largely for Dom0 (in which case fragmentation shouldn't
> >>>>matter yet)?
> >>>Dom0 ballooning breaks any assumptions you might make about relying on
> >>>early allocations.
> >>I think you're missing the point. I'm not arguing that this change
> >>is a general purpose solution to guarantee that dom0 is contiguous.
> >>Fragmentation can exist even if dom0 asks for larger allocations
> >>like it should (which the balloon driver does I believe). What the
> >>change does do is solve a real problem in the current Linux PCI
> >>remapping implementation which happens during dom0 intialization. If
> >>the allocation strategy is arbitrary why not make the proposed
> >>hypervisor change to make existing Linux implementations behave
> >>better and in addition fix the problem in Linux so moving forward
> >>things are safe?
> >I think Tim was OK with that - as long as it was based on a flag - meaning
> >when we do the increase_reservation call we use an extra flag
> >to ask for contingous PFNs.
> 
> OK the extra flag feels a little dirty to me but it would solve the
> problem. What are your thoughts on changing Linux to make higher
> order allocations or more minimally adding a boot parameter to not
> remap the memory at all for those that care about performance? I

Oh, so just leave it ballooned down? I presume you can get the same
exact behavior if you have your dom0_mem=max:X value tweaked just
right? And your E820 does not look like swiss cheese.


> know the Linux code is already fairly complex and your preference
> was not to make it worse.
> 
> >>>Ian.
> >>>
> >>
> >>_______________________________________________
> >>Xen-devel mailing list
> >>Xen-devel@lists.xen.org
> >>http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-11 17:05                                                 ` Konrad Rzeszutek Wilk
  2014-04-11 20:28                                                   ` Matthew Rushton
@ 2014-04-13 21:32                                                   ` Tim Deegan
  2014-04-14  8:51                                                     ` Jan Beulich
  1 sibling, 1 reply; 55+ messages in thread
From: Tim Deegan @ 2014-04-13 21:32 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Matt Wilson, Matthew Rushton, AndrewCooper,
	Jan Beulich, Matt Wilson, xen-devel, Ian Campbell

At 13:05 -0400 on 11 Apr (1397217936), Konrad Rzeszutek Wilk wrote:
> On Wed, Apr 09, 2014 at 03:21:38PM -0700, Matthew Rushton wrote:
> > On 04/02/14 03:20, Ian Campbell wrote:
> > >On Wed, 2014-04-02 at 11:15 +0100, Jan Beulich wrote:
> > >>>>>On 02.04.14 at 12:06, <Ian.Campbell@citrix.com> wrote:
> > >>>On Wed, 2014-04-02 at 08:52 +0100, Jan Beulich wrote:
> > >>>>>>>On 02.04.14 at 02:17, <mvrushton@gmail.com> wrote:
> > >>>>>On 04/01/14 05:22, Tim Deegan wrote:
> > >>>>>>As long as we don't also change the default allocation order in
> > >>>>>>Xen. :)  In general, linux shouldn't rely on the order that Xen
> > >>>>>>allocates memory, as that might change later.  If the current API
> > >>>>>>can't do what's needed, maybe we can add another allocator
> > >>>>>>hypercall or flag?
> > >>>>>Agree on not relying on the order in the long run. A new hypercall or
> > >>>>>flag seems like overkill right now. The question for me comes down to my
> > >>>>>proposed change which is more simple and solves the short term problem
> > >>>>>or investing time in reworking the Linux code to make large allocations.
> > >>>>I think it has become pretty clear by now that we'd rather not alter
> > >>>>the hypervisor allocator for a purpose like this.
> > 
> > OK understood see below.
> > 
> > >>>Does it even actually solve the problem? It seems like it is just
> > >>>deferring it until sufficient fragmentation has occurred in the system.
> > >>>All its really done is make the eventual issue much harder to debug.
> > >>Wasn't this largely for Dom0 (in which case fragmentation shouldn't
> > >>matter yet)?
> > >Dom0 ballooning breaks any assumptions you might make about relying on
> > >early allocations.
> > 
> > I think you're missing the point. I'm not arguing that this change
> > is a general purpose solution to guarantee that dom0 is contiguous.
> > Fragmentation can exist even if dom0 asks for larger allocations
> > like it should (which the balloon driver does I believe). What the
> > change does do is solve a real problem in the current Linux PCI
> > remapping implementation which happens during dom0 intialization. If
> > the allocation strategy is arbitrary why not make the proposed
> > hypervisor change to make existing Linux implementations behave
> > better and in addition fix the problem in Linux so moving forward
> > things are safe?
> 
> I think Tim was OK with that - as long as it was based on a flag - meaning
> when we do the increase_reservation call we use an extra flag
> to ask for contingous PFNs.

That's not quite what I meant to say.  I think that:
 (a) Making this change would be OK, as it should be harmless and 
     happens to help people running older linux kernels.  That comes
     with the caveats I mentioned: dom0 should not be relying on this
     and we (xen) reserve the right to change it later even if that makes
     unfixed linux dom0 slow again.
     We also shouldn't make this change on debug builds (to catch
     cases where a guest relies on the new behaviour for _correctness_).
 (b) The right thing to do is to fix linux so that it asks for
     contiguous memory in cases where that matters.  AFAICT that would
     involve allocating in larger areas than 1 page.
 (c) If for some reason the current hypercall API is not sufficient
     for dom0 to get what it wants, we should consider adding some new
     operation/flag/mode somewhere.  But since AFAIK there's already
     another path in linux that allocates contiguous DMA buffers
     for device drivers, presumably this isn't the case.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-13 21:32                                                   ` Tim Deegan
@ 2014-04-14  8:51                                                     ` Jan Beulich
  2014-04-14 14:40                                                       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2014-04-14  8:51 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Tim Deegan
  Cc: Keir Fraser, Ian Campbell, Matthew Rushton, AndrewCooper,
	Matt Wilson, Matt Wilson, xen-devel

>>> On 13.04.14 at 23:32, <tim@xen.org> wrote:
>  (c) If for some reason the current hypercall API is not sufficient
>      for dom0 to get what it wants, we should consider adding some new
>      operation/flag/mode somewhere.  But since AFAIK there's already
>      another path in linux that allocates contiguous DMA buffers
>      for device drivers, presumably this isn't the case.

And it should be kept in mind that requesting contiguous memory
shouldn't be done at will, as it may end up exhausting the portion
of memory intended for DMA-style allocations (SWIOTLB / DMA-
coherent allocations in Linux). I.e. neither Dom0 nor DomU should
be trying to populate large parts of their memory with contiguous
allocation requests to the hypervisor. They may, if they so desire,
go and re-arrange their P2M mapping (solely based on what they
got handed by doing order-0 allocations).

Jan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-14  8:51                                                     ` Jan Beulich
@ 2014-04-14 14:40                                                       ` Konrad Rzeszutek Wilk
  2014-04-14 15:34                                                         ` Jan Beulich
  0 siblings, 1 reply; 55+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-14 14:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Ian Campbell, Matthew Rushton, AndrewCooper,
	Tim Deegan, Matt Wilson, Matt Wilson, xen-devel

On Mon, Apr 14, 2014 at 09:51:34AM +0100, Jan Beulich wrote:
> >>> On 13.04.14 at 23:32, <tim@xen.org> wrote:
> >  (c) If for some reason the current hypercall API is not sufficient
> >      for dom0 to get what it wants, we should consider adding some new
> >      operation/flag/mode somewhere.  But since AFAIK there's already
> >      another path in linux that allocates contiguous DMA buffers
> >      for device drivers, presumably this isn't the case.
> 
> And it should be kept in mind that requesting contiguous memory
> shouldn't be done at will, as it may end up exhausting the portion
> of memory intended for DMA-style allocations (SWIOTLB / DMA-
> coherent allocations in Linux). I.e. neither Dom0 nor DomU should
> be trying to populate large parts of their memory with contiguous
> allocation requests to the hypervisor. They may, if they so desire,
> go and re-arrange their P2M mapping (solely based on what they
> got handed by doing order-0 allocations).

I did try that at some point  - and it did not work.

The reason for trying this was that during the E820 parsing we would
find the MMIO holes/gaps and instead of doing the
'XENMEM_decrease_reservation'/ 'XENMEM_populate_physmap' dance I
thought I could just swap the P2M entries.

That was OK, but the M2P lookup table was not too thrilled with this.
Perhaps I should have used another hypercall to re-arrange the M2P?
I think I did try 'XENMEM_exchange' but that is not the right call either.

Perhaps I should use XENMEM_remove_from_physmap/XENMEM_add_to_physmap
combo ?

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-14 14:40                                                       ` Konrad Rzeszutek Wilk
@ 2014-04-14 15:34                                                         ` Jan Beulich
  2014-04-16 14:15                                                           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2014-04-14 15:34 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Ian Campbell, Matthew Rushton, AndrewCooper,
	Tim Deegan, Matt Wilson, Matt Wilson, xen-devel

>>> On 14.04.14 at 16:40, <konrad.wilk@oracle.com> wrote:
> That was OK, but the M2P lookup table was not too thrilled with this.
> Perhaps I should have used another hypercall to re-arrange the M2P?
> I think I did try 'XENMEM_exchange' but that is not the right call either.

Yeah, that's allocating new pages in exchange for your old ones. Not
really what you want.

> Perhaps I should use XENMEM_remove_from_physmap/XENMEM_add_to_physmap
> combo ?

A pair of MMU_MACHPHYS_UPDATE operations would seem to be the
right way of doing this (along with respective kernel internal accounting
like set_phys_to_machine(), and perhaps a pair of update_va_mapping
operations if the 1:1 map is already in place at that time, and you care
about which page contents appears at which virtual address).

Jan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-14 15:34                                                         ` Jan Beulich
@ 2014-04-16 14:15                                                           ` Konrad Rzeszutek Wilk
  2014-04-17  1:34                                                             ` Matthew Rushton
  2014-05-07 23:16                                                             ` Matthew Rushton
  0 siblings, 2 replies; 55+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-16 14:15 UTC (permalink / raw)
  Cc: Keir Fraser, Ian Campbell, Matthew Rushton, AndrewCooper,
	Tim Deegan, Matt Wilson, Matt Wilson, xen-devel

On Mon, Apr 14, 2014 at 04:34:47PM +0100, Jan Beulich wrote:
> >>> On 14.04.14 at 16:40, <konrad.wilk@oracle.com> wrote:
> > That was OK, but the M2P lookup table was not too thrilled with this.
> > Perhaps I should have used another hypercall to re-arrange the M2P?
> > I think I did try 'XENMEM_exchange' but that is not the right call either.
> 
> Yeah, that's allocating new pages in exchange for your old ones. Not
> really what you want.
> 
> > Perhaps I should use XENMEM_remove_from_physmap/XENMEM_add_to_physmap
> > combo ?
> 
> A pair of MMU_MACHPHYS_UPDATE operations would seem to be the
> right way of doing this (along with respective kernel internal accounting
> like set_phys_to_machine(), and perhaps a pair of update_va_mapping
> operations if the 1:1 map is already in place at that time, and you care
> about which page contents appears at which virtual address).

OK.

Matt & Matthew - my plate is quite filled and I fear that in the next three
weeks there is not going to be much time to code up a prototype.

Would either one of you be willing to take a crack at this? It would
be neat as we could remove a lot of the balloon increase/decrease code
in arch/x86/xen/setup.c.

Thanks!
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-16 14:15                                                           ` Konrad Rzeszutek Wilk
@ 2014-04-17  1:34                                                             ` Matthew Rushton
  2014-05-07 23:16                                                             ` Matthew Rushton
  1 sibling, 0 replies; 55+ messages in thread
From: Matthew Rushton @ 2014-04-17  1:34 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, msw
  Cc: Keir Fraser, Ian Campbell, AndrewCooper, Tim Deegan, Matt Wilson,
	xen-devel

On 04/16/14 07:15, Konrad Rzeszutek Wilk wrote:
> On Mon, Apr 14, 2014 at 04:34:47PM +0100, Jan Beulich wrote:
>>>>> On 14.04.14 at 16:40, <konrad.wilk@oracle.com> wrote:
>>> That was OK, but the M2P lookup table was not too thrilled with this.
>>> Perhaps I should have used another hypercall to re-arrange the M2P?
>>> I think I did try 'XENMEM_exchange' but that is not the right call either.
>> Yeah, that's allocating new pages in exchange for your old ones. Not
>> really what you want.
>>
>>> Perhaps I should use XENMEM_remove_from_physmap/XENMEM_add_to_physmap
>>> combo ?
>> A pair of MMU_MACHPHYS_UPDATE operations would seem to be the
>> right way of doing this (along with respective kernel internal accounting
>> like set_phys_to_machine(), and perhaps a pair of update_va_mapping
>> operations if the 1:1 map is already in place at that time, and you care
>> about which page contents appears at which virtual address).
> OK.
>
> Matt & Matthew - my plate is quite filled and I fear that in the next three
> weeks there is not going to be much time to code up a prototype.
>
> Would either one of you be willing to take a crack at this? It would
> be neat as we could remove a lot of the balloon increase/decrease code
> in arch/x86/xen/setup.c.
>
> Thanks!

Yeah sure. It won't be immediatly but I should be able to do that.


>> Jan
>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-04-16 14:15                                                           ` Konrad Rzeszutek Wilk
  2014-04-17  1:34                                                             ` Matthew Rushton
@ 2014-05-07 23:16                                                             ` Matthew Rushton
  2014-05-08 18:05                                                               ` Konrad Rzeszutek Wilk
  2014-05-14 15:06                                                               ` Konrad Rzeszutek Wilk
  1 sibling, 2 replies; 55+ messages in thread
From: Matthew Rushton @ 2014-05-07 23:16 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, msw
  Cc: Keir Fraser, Ian Campbell, AndrewCooper, Tim Deegan, Matt Wilson,
	xen-devel

On 04/16/14 07:15, Konrad Rzeszutek Wilk wrote:
> On Mon, Apr 14, 2014 at 04:34:47PM +0100, Jan Beulich wrote:
>>>>> On 14.04.14 at 16:40, <konrad.wilk@oracle.com> wrote:
>>> That was OK, but the M2P lookup table was not too thrilled with this.
>>> Perhaps I should have used another hypercall to re-arrange the M2P?
>>> I think I did try 'XENMEM_exchange' but that is not the right call either.
>> Yeah, that's allocating new pages in exchange for your old ones. Not
>> really what you want.
>>
>>> Perhaps I should use XENMEM_remove_from_physmap/XENMEM_add_to_physmap
>>> combo ?
>> A pair of MMU_MACHPHYS_UPDATE operations would seem to be the
>> right way of doing this (along with respective kernel internal accounting
>> like set_phys_to_machine(), and perhaps a pair of update_va_mapping
>> operations if the 1:1 map is already in place at that time, and you care
>> about which page contents appears at which virtual address).
> OK.
>
> Matt & Matthew - my plate is quite filled and I fear that in the next three
> weeks there is not going to be much time to code up a prototype.
>
> Would either one of you be willing to take a crack at this? It would
> be neat as we could remove a lot of the balloon increase/decrease code
> in arch/x86/xen/setup.c.
>
> Thanks!

I have a first pass at this. Just need to test it and should have 
something ready sometime next week or so.

>> Jan
>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-05-07 23:16                                                             ` Matthew Rushton
@ 2014-05-08 18:05                                                               ` Konrad Rzeszutek Wilk
  2014-05-14 15:06                                                               ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 55+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-05-08 18:05 UTC (permalink / raw)
  To: Matthew Rushton
  Cc: Keir Fraser, Ian Campbell, AndrewCooper, Tim Deegan, msw,
	Matt Wilson, xen-devel

On Wed, May 07, 2014 at 04:16:14PM -0700, Matthew Rushton wrote:
> On 04/16/14 07:15, Konrad Rzeszutek Wilk wrote:
> >On Mon, Apr 14, 2014 at 04:34:47PM +0100, Jan Beulich wrote:
> >>>>>On 14.04.14 at 16:40, <konrad.wilk@oracle.com> wrote:
> >>>That was OK, but the M2P lookup table was not too thrilled with this.
> >>>Perhaps I should have used another hypercall to re-arrange the M2P?
> >>>I think I did try 'XENMEM_exchange' but that is not the right call either.
> >>Yeah, that's allocating new pages in exchange for your old ones. Not
> >>really what you want.
> >>
> >>>Perhaps I should use XENMEM_remove_from_physmap/XENMEM_add_to_physmap
> >>>combo ?
> >>A pair of MMU_MACHPHYS_UPDATE operations would seem to be the
> >>right way of doing this (along with respective kernel internal accounting
> >>like set_phys_to_machine(), and perhaps a pair of update_va_mapping
> >>operations if the 1:1 map is already in place at that time, and you care
> >>about which page contents appears at which virtual address).
> >OK.
> >
> >Matt & Matthew - my plate is quite filled and I fear that in the next three
> >weeks there is not going to be much time to code up a prototype.
> >
> >Would either one of you be willing to take a crack at this? It would
> >be neat as we could remove a lot of the balloon increase/decrease code
> >in arch/x86/xen/setup.c.
> >
> >Thanks!
> 
> I have a first pass at this. Just need to test it and should have something
> ready sometime next week or so.

Woohoo! Thanks for the update!
> 
> >>Jan
> >>
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-05-07 23:16                                                             ` Matthew Rushton
  2014-05-08 18:05                                                               ` Konrad Rzeszutek Wilk
@ 2014-05-14 15:06                                                               ` Konrad Rzeszutek Wilk
  2014-05-20 19:26                                                                 ` Matthew Rushton
  1 sibling, 1 reply; 55+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-05-14 15:06 UTC (permalink / raw)
  To: Matthew Rushton
  Cc: Keir Fraser, Ian Campbell, AndrewCooper, Tim Deegan, msw,
	Matt Wilson, xen-devel

On Wed, May 07, 2014 at 04:16:14PM -0700, Matthew Rushton wrote:
> On 04/16/14 07:15, Konrad Rzeszutek Wilk wrote:
> >On Mon, Apr 14, 2014 at 04:34:47PM +0100, Jan Beulich wrote:
> >>>>>On 14.04.14 at 16:40, <konrad.wilk@oracle.com> wrote:
> >>>That was OK, but the M2P lookup table was not too thrilled with this.
> >>>Perhaps I should have used another hypercall to re-arrange the M2P?
> >>>I think I did try 'XENMEM_exchange' but that is not the right call either.
> >>Yeah, that's allocating new pages in exchange for your old ones. Not
> >>really what you want.
> >>
> >>>Perhaps I should use XENMEM_remove_from_physmap/XENMEM_add_to_physmap
> >>>combo ?
> >>A pair of MMU_MACHPHYS_UPDATE operations would seem to be the
> >>right way of doing this (along with respective kernel internal accounting
> >>like set_phys_to_machine(), and perhaps a pair of update_va_mapping
> >>operations if the 1:1 map is already in place at that time, and you care
> >>about which page contents appears at which virtual address).
> >OK.
> >
> >Matt & Matthew - my plate is quite filled and I fear that in the next three
> >weeks there is not going to be much time to code up a prototype.
> >
> >Would either one of you be willing to take a crack at this? It would
> >be neat as we could remove a lot of the balloon increase/decrease code
> >in arch/x86/xen/setup.c.
> >
> >Thanks!
> 
> I have a first pass at this. Just need to test it and should have something
> ready sometime next week or so.

Daniel pointed me to this commit:
ommit 2e2fb75475c2fc74c98100f1468c8195fee49f3b
Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date:   Fri Apr 6 10:07:11 2012 -0400

    xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
..
    The other solution (that did not work) was to transplant the MFN in
    the P2M tree - the ones that were going to be freed were put in
    the E820_RAM regions past the nr_pages. But the modifications to the
    M2P array (the other side of creating PTEs) were not carried away.
    As the hypervisor is the only one capable of modifying that and the
    only two hypercalls that would do this are: the update_va_mapping
    (which won't work, as during initial bootup only PFNs up to nr_pages
    are mapped in the guest) or via the populate hypercall.

Where I talk about the 'update_va_mapping' - and I seem to think
that it would not work (due to the nr_pages limit). I don't actually
remember the details - so I might have been incorrect (hopefully!?).


> 
> >>Jan
> >>
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-05-14 15:06                                                               ` Konrad Rzeszutek Wilk
@ 2014-05-20 19:26                                                                 ` Matthew Rushton
  2014-05-23 19:00                                                                   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 55+ messages in thread
From: Matthew Rushton @ 2014-05-20 19:26 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Ian Campbell, AndrewCooper, Tim Deegan, mrushton,
	msw, Matt Wilson, xen-devel

On 05/14/14 08:06, Konrad Rzeszutek Wilk wrote:
> On Wed, May 07, 2014 at 04:16:14PM -0700, Matthew Rushton wrote:
>> On 04/16/14 07:15, Konrad Rzeszutek Wilk wrote:
>>> On Mon, Apr 14, 2014 at 04:34:47PM +0100, Jan Beulich wrote:
>>>>>>> On 14.04.14 at 16:40, <konrad.wilk@oracle.com> wrote:
>>>>> That was OK, but the M2P lookup table was not too thrilled with this.
>>>>> Perhaps I should have used another hypercall to re-arrange the M2P?
>>>>> I think I did try 'XENMEM_exchange' but that is not the right call either.
>>>> Yeah, that's allocating new pages in exchange for your old ones. Not
>>>> really what you want.
>>>>
>>>>> Perhaps I should use XENMEM_remove_from_physmap/XENMEM_add_to_physmap
>>>>> combo ?
>>>> A pair of MMU_MACHPHYS_UPDATE operations would seem to be the
>>>> right way of doing this (along with respective kernel internal accounting
>>>> like set_phys_to_machine(), and perhaps a pair of update_va_mapping
>>>> operations if the 1:1 map is already in place at that time, and you care
>>>> about which page contents appears at which virtual address).
>>> OK.
>>>
>>> Matt & Matthew - my plate is quite filled and I fear that in the next three
>>> weeks there is not going to be much time to code up a prototype.
>>>
>>> Would either one of you be willing to take a crack at this? It would
>>> be neat as we could remove a lot of the balloon increase/decrease code
>>> in arch/x86/xen/setup.c.
>>>
>>> Thanks!
>> I have a first pass at this. Just need to test it and should have something
>> ready sometime next week or so.
> Daniel pointed me to this commit:
> ommit 2e2fb75475c2fc74c98100f1468c8195fee49f3b
> Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Date:   Fri Apr 6 10:07:11 2012 -0400
>
>      xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
> ..
>      The other solution (that did not work) was to transplant the MFN in
>      the P2M tree - the ones that were going to be freed were put in
>      the E820_RAM regions past the nr_pages. But the modifications to the
>      M2P array (the other side of creating PTEs) were not carried away.
>      As the hypervisor is the only one capable of modifying that and the
>      only two hypercalls that would do this are: the update_va_mapping
>      (which won't work, as during initial bootup only PFNs up to nr_pages
>      are mapped in the guest) or via the populate hypercall.
>
> Where I talk about the 'update_va_mapping' - and I seem to think
> that it would not work (due to the nr_pages limit). I don't actually
> remember the details - so I might have been incorrect (hopefully!?).
>

Ok I finally have something I'm happy with using the mmu_update 
hypercall and placing things in the existing E820 map. I don't think the 
update_va_mapping hypercall is necessary. It ended up being a little 
more complicated than I originally thought to handle not allocating 
additional p2m leaf nodes. I'm going on vacation here shortly and can 
post it when I get back.

>>>> Jan
>>>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-05-20 19:26                                                                 ` Matthew Rushton
@ 2014-05-23 19:00                                                                   ` Konrad Rzeszutek Wilk
  2014-06-04 22:25                                                                     ` Matthew Rushton
  0 siblings, 1 reply; 55+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-05-23 19:00 UTC (permalink / raw)
  To: Matthew Rushton
  Cc: Keir Fraser, Ian Campbell, AndrewCooper, Tim Deegan, mrushton,
	msw, Matt Wilson, xen-devel

On Tue, May 20, 2014 at 12:26:57PM -0700, Matthew Rushton wrote:
> On 05/14/14 08:06, Konrad Rzeszutek Wilk wrote:
> >On Wed, May 07, 2014 at 04:16:14PM -0700, Matthew Rushton wrote:
> >>On 04/16/14 07:15, Konrad Rzeszutek Wilk wrote:
> >>>On Mon, Apr 14, 2014 at 04:34:47PM +0100, Jan Beulich wrote:
> >>>>>>>On 14.04.14 at 16:40, <konrad.wilk@oracle.com> wrote:
> >>>>>That was OK, but the M2P lookup table was not too thrilled with this.
> >>>>>Perhaps I should have used another hypercall to re-arrange the M2P?
> >>>>>I think I did try 'XENMEM_exchange' but that is not the right call either.
> >>>>Yeah, that's allocating new pages in exchange for your old ones. Not
> >>>>really what you want.
> >>>>
> >>>>>Perhaps I should use XENMEM_remove_from_physmap/XENMEM_add_to_physmap
> >>>>>combo ?
> >>>>A pair of MMU_MACHPHYS_UPDATE operations would seem to be the
> >>>>right way of doing this (along with respective kernel internal accounting
> >>>>like set_phys_to_machine(), and perhaps a pair of update_va_mapping
> >>>>operations if the 1:1 map is already in place at that time, and you care
> >>>>about which page contents appears at which virtual address).
> >>>OK.
> >>>
> >>>Matt & Matthew - my plate is quite filled and I fear that in the next three
> >>>weeks there is not going to be much time to code up a prototype.
> >>>
> >>>Would either one of you be willing to take a crack at this? It would
> >>>be neat as we could remove a lot of the balloon increase/decrease code
> >>>in arch/x86/xen/setup.c.
> >>>
> >>>Thanks!
> >>I have a first pass at this. Just need to test it and should have something
> >>ready sometime next week or so.
> >Daniel pointed me to this commit:
> >ommit 2e2fb75475c2fc74c98100f1468c8195fee49f3b
> >Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >Date:   Fri Apr 6 10:07:11 2012 -0400
> >
> >     xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
> >..
> >     The other solution (that did not work) was to transplant the MFN in
> >     the P2M tree - the ones that were going to be freed were put in
> >     the E820_RAM regions past the nr_pages. But the modifications to the
> >     M2P array (the other side of creating PTEs) were not carried away.
> >     As the hypervisor is the only one capable of modifying that and the
> >     only two hypercalls that would do this are: the update_va_mapping
> >     (which won't work, as during initial bootup only PFNs up to nr_pages
> >     are mapped in the guest) or via the populate hypercall.
> >
> >Where I talk about the 'update_va_mapping' - and I seem to think
> >that it would not work (due to the nr_pages limit). I don't actually
> >remember the details - so I might have been incorrect (hopefully!?).
> >
> 
> Ok I finally have something I'm happy with using the mmu_update hypercall
> and placing things in the existing E820 map. I don't think the
> update_va_mapping hypercall is necessary. It ended up being a little more
> complicated than I originally thought to handle not allocating additional
> p2m leaf nodes. I'm going on vacation here shortly and can post it when I
> get back.

Enjoy the vacation and I am looking forward to seeing the patches when you
come back!

Thank you!
> 
> >>>>Jan
> >>>>
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-05-23 19:00                                                                   ` Konrad Rzeszutek Wilk
@ 2014-06-04 22:25                                                                     ` Matthew Rushton
  2014-06-05  9:32                                                                       ` David Vrabel
  0 siblings, 1 reply; 55+ messages in thread
From: Matthew Rushton @ 2014-06-04 22:25 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Ian Campbell, AndrewCooper, Tim Deegan, mrushton,
	msw, Matt Wilson, xen-devel

On 05/23/14 12:00, Konrad Rzeszutek Wilk wrote:
> On Tue, May 20, 2014 at 12:26:57PM -0700, Matthew Rushton wrote:
>> On 05/14/14 08:06, Konrad Rzeszutek Wilk wrote:
>>> On Wed, May 07, 2014 at 04:16:14PM -0700, Matthew Rushton wrote:
>>>> On 04/16/14 07:15, Konrad Rzeszutek Wilk wrote:
>>>>> On Mon, Apr 14, 2014 at 04:34:47PM +0100, Jan Beulich wrote:
>>>>>>>>> On 14.04.14 at 16:40, <konrad.wilk@oracle.com> wrote:
>>>>>>> That was OK, but the M2P lookup table was not too thrilled with this.
>>>>>>> Perhaps I should have used another hypercall to re-arrange the M2P?
>>>>>>> I think I did try 'XENMEM_exchange' but that is not the right call either.
>>>>>> Yeah, that's allocating new pages in exchange for your old ones. Not
>>>>>> really what you want.
>>>>>>
>>>>>>> Perhaps I should use XENMEM_remove_from_physmap/XENMEM_add_to_physmap
>>>>>>> combo ?
>>>>>> A pair of MMU_MACHPHYS_UPDATE operations would seem to be the
>>>>>> right way of doing this (along with respective kernel internal accounting
>>>>>> like set_phys_to_machine(), and perhaps a pair of update_va_mapping
>>>>>> operations if the 1:1 map is already in place at that time, and you care
>>>>>> about which page contents appears at which virtual address).
>>>>> OK.
>>>>>
>>>>> Matt & Matthew - my plate is quite filled and I fear that in the next three
>>>>> weeks there is not going to be much time to code up a prototype.
>>>>>
>>>>> Would either one of you be willing to take a crack at this? It would
>>>>> be neat as we could remove a lot of the balloon increase/decrease code
>>>>> in arch/x86/xen/setup.c.
>>>>>
>>>>> Thanks!
>>>> I have a first pass at this. Just need to test it and should have something
>>>> ready sometime next week or so.
>>> Daniel pointed me to this commit:
>>> ommit 2e2fb75475c2fc74c98100f1468c8195fee49f3b
>>> Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>> Date:   Fri Apr 6 10:07:11 2012 -0400
>>>
>>>      xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
>>> ..
>>>      The other solution (that did not work) was to transplant the MFN in
>>>      the P2M tree - the ones that were going to be freed were put in
>>>      the E820_RAM regions past the nr_pages. But the modifications to the
>>>      M2P array (the other side of creating PTEs) were not carried away.
>>>      As the hypervisor is the only one capable of modifying that and the
>>>      only two hypercalls that would do this are: the update_va_mapping
>>>      (which won't work, as during initial bootup only PFNs up to nr_pages
>>>      are mapped in the guest) or via the populate hypercall.
>>>
>>> Where I talk about the 'update_va_mapping' - and I seem to think
>>> that it would not work (due to the nr_pages limit). I don't actually
>>> remember the details - so I might have been incorrect (hopefully!?).
>>>
>> Ok I finally have something I'm happy with using the mmu_update hypercall
>> and placing things in the existing E820 map. I don't think the
>> update_va_mapping hypercall is necessary. It ended up being a little more
>> complicated than I originally thought to handle not allocating additional
>> p2m leaf nodes. I'm going on vacation here shortly and can post it when I
>> get back.
> Enjoy the vacation and I am looking forward to seeing the patches when you
> come back!
>
> Thank you!

Sent patch to lkml late last week ([PATCH] xen/setup: Remap Xen Identity 
Mapped RAM). Will cc xen-devel on further correspondance.

-Matt

>>>>>> Jan
>>>>>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-06-04 22:25                                                                     ` Matthew Rushton
@ 2014-06-05  9:32                                                                       ` David Vrabel
  0 siblings, 0 replies; 55+ messages in thread
From: David Vrabel @ 2014-06-05  9:32 UTC (permalink / raw)
  To: Matthew Rushton, Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Ian Campbell, AndrewCooper, Tim Deegan, mrushton,
	msw, Matt Wilson, xen-devel

On 04/06/14 23:25, Matthew Rushton wrote:
> 
> Sent patch to lkml late last week ([PATCH] xen/setup: Remap Xen Identity
> Mapped RAM). Will cc xen-devel on further correspondance.

Please repost, Cc'ing the Linux Xen maintainers and xen-devel.

David

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH] page_alloc: use first half of higher order chunks when halving
  2014-03-26 15:15             ` Matt Wilson
  2014-03-26 15:59               ` Matthew Rushton
@ 2014-03-26 16:34               ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 55+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-03-26 16:34 UTC (permalink / raw)
  To: Matt Wilson
  Cc: Keir Fraser, Matt Wilson, Matthew Rushton, Andrew Cooper,
	Tim Deegan, Jan Beulich, xen-devel

On Wed, Mar 26, 2014 at 05:15:08PM +0200, Matt Wilson wrote:
> On Wed, Mar 26, 2014 at 11:08:01AM -0400, Konrad Rzeszutek Wilk wrote:
> > 
> > Could you elaborate a bit more on the use-case please?
> > My understanding is that most drivers use a scatter gather list - in which
> > case it does not matter if the underlaying MFNs in the PFNs spare are
> > not contingous.
> > 
> > But I presume the issue you are hitting is with drivers doing dma_map_page
> > and the page is not 4KB but rather large (compound page). Is that the
> > problem you have observed?
> 
> Drivers are using very large size arguments to dma_alloc_coherent()
> for things like RX and TX descriptor rings.

OK, but that call ends up using chunks from the SWIOTLB buffer which is
contingously allocated. That shouldn't be a problem?

> 
> --msw

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2014-06-05  9:32 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-25 11:22 [RFC PATCH] page_alloc: use first half of higher order chunks when halving Matt Wilson
2014-03-25 11:44 ` Andrew Cooper
2014-03-25 13:20   ` Matt Wilson
2014-03-25 20:18     ` Matthew Rushton
2014-03-25 12:19 ` Tim Deegan
2014-03-25 13:27   ` Matt Wilson
2014-03-25 20:09     ` Matthew Rushton
2014-03-26  9:55       ` Tim Deegan
2014-03-26 10:17         ` Matt Wilson
2014-03-26 10:44           ` David Vrabel
2014-03-26 10:48             ` Matt Wilson
2014-03-26 11:13               ` Ian Campbell
2014-03-26 11:41                 ` Matt Wilson
2014-03-26 11:45                   ` Andrew Cooper
2014-03-26 11:50                     ` Matt Wilson
2014-03-26 12:43               ` David Vrabel
2014-03-26 12:48                 ` Matt Wilson
2014-03-26 15:08           ` Konrad Rzeszutek Wilk
2014-03-26 15:15             ` Matt Wilson
2014-03-26 15:59               ` Matthew Rushton
2014-03-26 16:36                 ` Konrad Rzeszutek Wilk
2014-03-26 17:47                   ` Matthew Rushton
2014-03-26 17:56                     ` Konrad Rzeszutek Wilk
2014-03-26 22:15                       ` Matthew Rushton
2014-03-28 17:02                         ` Konrad Rzeszutek Wilk
2014-03-28 22:06                           ` Matthew Rushton
2014-03-31 14:15                             ` Konrad Rzeszutek Wilk
2014-04-01  3:25                               ` Matthew Rushton
2014-04-01 10:48                                 ` Konrad Rzeszutek Wilk
2014-04-01 12:22                                   ` Tim Deegan
2014-04-02  0:17                                     ` Matthew Rushton
2014-04-02  7:52                                       ` Jan Beulich
2014-04-02 10:06                                         ` Ian Campbell
2014-04-02 10:15                                           ` Jan Beulich
2014-04-02 10:20                                             ` Ian Campbell
2014-04-09 22:21                                               ` Matthew Rushton
2014-04-10  6:14                                                 ` Jan Beulich
2014-04-11 20:20                                                   ` Matthew Rushton
2014-04-11 17:05                                                 ` Konrad Rzeszutek Wilk
2014-04-11 20:28                                                   ` Matthew Rushton
2014-04-12  1:34                                                     ` Konrad Rzeszutek Wilk
2014-04-13 21:32                                                   ` Tim Deegan
2014-04-14  8:51                                                     ` Jan Beulich
2014-04-14 14:40                                                       ` Konrad Rzeszutek Wilk
2014-04-14 15:34                                                         ` Jan Beulich
2014-04-16 14:15                                                           ` Konrad Rzeszutek Wilk
2014-04-17  1:34                                                             ` Matthew Rushton
2014-05-07 23:16                                                             ` Matthew Rushton
2014-05-08 18:05                                                               ` Konrad Rzeszutek Wilk
2014-05-14 15:06                                                               ` Konrad Rzeszutek Wilk
2014-05-20 19:26                                                                 ` Matthew Rushton
2014-05-23 19:00                                                                   ` Konrad Rzeszutek Wilk
2014-06-04 22:25                                                                     ` Matthew Rushton
2014-06-05  9:32                                                                       ` David Vrabel
2014-03-26 16:34               ` Konrad Rzeszutek Wilk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).