alloc_heap_pages is low efficient with more CPUs

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* alloc_heap_pages is low efficient with more CPUs
@ 2012-10-11 15:18 tupeng212
  2012-10-11 15:41 ` Keir Fraser
  0 siblings, 1 reply; 15+ messages in thread
From: tupeng212 @ 2012-10-11 15:18 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1.1: Type: text/plain, Size: 2307 bytes --]

I am confused with a problem:
I have a blade with 64 physical CPUs and 64G physical RAM, and defined only one VM with 1 CPU and 40G RAM.
For the first time I started the VM, it just took 3s, But for the second starting it took 30s. 
After studied it by printing log, I have located a place in the hypervisor where cost too much time, 
occupied 98% of the whole starting time.

xen/common/page_alloc.c 
/* Allocate 2^@order contiguous pages. */
static struct page_info *alloc_heap_pages(
    unsigned int zone_lo, unsigned int zone_hi,
    unsigned int node, unsigned int order, unsigned int memflags)
{
        if ( pg[i].u.free.need_tlbflush )
        {
            /* Add in extra CPUs that need flushing because of this page. */
            cpus_andnot(extra_cpus_mask, cpu_online_map, mask);
            tlbflush_filter(extra_cpus_mask, pg[i].tlbflush_timestamp);
            cpus_or(mask, mask, extra_cpus_mask);
        }

}

1 in the first starting, most of need_tlbflush=NULL, so cost little; in the second one, most of RAM have been used 
  thus makes need_tlbflush=true, so cost much.

2 but I repeated the same experiment in another blade which contains 16 physical CPUs and 64G physical RAM, the second
  starting cost 3s. After I traced the process between the two second startings, found that count entering into the judgement of
  pg[i].u.free.need_tlbflush is the same, but number of CPUs leads to the difference.

3 The code I pasted aims to compute a mask to determine whether it should flush CPU's TLB. I traced the values in starting period below:

cpus_andnot(extra_cpus_mask, cpu_online_map, mask);
//after, mask=0, cpu_online_map=0xFFFFFFFFFFFFFFFF, extra_cpus_mask=0xFFFFFFFFFFFFFFFF
tlbflush_filter(extra_cpus_mask, pg[i].tlbflush_timestamp);
//after, mask=0, extra_cpus_mask=0
cpus_or(mask, mask, extra_cpus_mask);
//after, mask=0, extra_cpus_mask=0

  every time it starts with mask=0, and ends with the same result mask=0, so leads to flush CPU's TLB definitely,  
  which seems meaningless in the staring process.

4 The problem is, this seemed meaningless code is very time-consuming, CPUs get more, time costs more, it takes 30s here in my blade
  with 64 CPUs which may need some solution to improve the efficiency.

tupeng

[-- Attachment #1.2: Type: text/html, Size: 5486 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: alloc_heap_pages is low efficient with more CPUs
  2012-10-11 15:18 alloc_heap_pages is low efficient with more CPUs tupeng212
@ 2012-10-11 15:41 ` Keir Fraser
  2012-10-12 12:24   ` tupeng212
  0 siblings, 1 reply; 15+ messages in thread
From: Keir Fraser @ 2012-10-11 15:41 UTC (permalink / raw)
  To: tupeng212, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 3082 bytes --]

Not sure I understand. With 16 CPUs you find domain startup takes 3s always.
With 64 CPUs you find it takes 3s first time, then 30s in future? And this
is due to cost of tlbflush_filter() (not actual TLB flushes, because you
always end up with mask=0)? If tlbflush_filter() were that expensive I¹d
expect the 16-CPU case to have slowdown after the first domain startup, too.

 -- Keir

On 11/10/2012 16:18, "tupeng212" <tupeng212@gmail.com> wrote:

> I am confused with a problem:
> I have a blade with 64 physical CPUs and 64G physical RAM, and defined only
> one VM with 1 CPU and 40G RAM.
> For the first time I started the VM, it just took 3s, But for the second
> starting it took 30s.
> After studied it by printing log, I have located a place in the hypervisor
> where cost too much time,
> occupied 98% of the whole starting time.
>  
> xen/common/page_alloc.c
> /* Allocate 2^@order contiguous pages. */
> static struct page_info *alloc_heap_pages(
>     unsigned int zone_lo, unsigned int zone_hi,
>     unsigned int node, unsigned int order, unsigned int memflags)
> {
>         if ( pg[i].u.free.need_tlbflush )
>         {
>             /* Add in extra CPUs that need flushing because of this page. */
>             cpus_andnot(extra_cpus_mask, cpu_online_map, mask);
>             tlbflush_filter(extra_cpus_mask, pg[i].tlbflush_timestamp);
>             cpus_or(mask, mask, extra_cpus_mask);
>         }
>  
> }
>  
> 1 in the first starting, most of need_tlbflush=NULL, so cost little; in the
> second one, most of RAM have been used
>   thus makes need_tlbflush=true, so cost much.
>  
> 2 but I repeated the same experiment in another blade which contains 16
> physical CPUs and 64G physical RAM, the second
>   starting cost 3s. After I traced the process between the two second
> startings, found that count entering into the judgement of
>   pg[i].u.free.need_tlbflush is the same, but number of CPUs leads to the
> difference.
>  
> 3 The code I pasted aims to compute a mask to determine whether it should
> flush CPU's TLB. I traced the values in starting period below:
>  
>>  
>> cpus_andnot(extra_cpus_mask, cpu_online_map, mask);
>>  
>> //after, mask=0, cpu_online_map=0xFFFFFFFFFFFFFFFF,
>> extra_cpus_mask=0xFFFFFFFFFFFFFFFF
>>  
>> tlbflush_filter(extra_cpus_mask, pg[i].tlbflush_timestamp);
>>  
>> //after, mask=0, extra_cpus_mask=0
>>  
>> cpus_or(mask, mask, extra_cpus_mask);
>>  
>> //after, mask=0, extra_cpus_mask=0
>  
>   every time it starts with mask=0, and ends with the same result mask=0, so
> leads to flush CPU's TLB definitely,
>   which seems meaningless in the staring process.
>  
> 4 The problem is, this seemed meaningless code is very time-consuming, CPUs
> get more, time costs more, it takes 30s here in my blade
>   with 64 CPUs which may need some solution to improve the efficiency.
>   
> 
> tupeng
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel


[-- Attachment #1.2: Type: text/html, Size: 4643 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: alloc_heap_pages is low efficient with more CPUs
  2012-10-11 15:41 ` Keir Fraser
@ 2012-10-12 12:24   ` tupeng212
  2012-10-12 22:39     ` Mukesh Rathor
  0 siblings, 1 reply; 15+ messages in thread
From: tupeng212 @ 2012-10-12 12:24 UTC (permalink / raw)
  To: Keir Fraser, xen-devel

[-- Attachment #1.1: Type: text/plain, Size: 1177 bytes --]

With 16 CPUs you find domain startup takes 3s always. 
:No
With 16CPUs, first 0.3s, second 3s
With 64CPUs, first 3s, second 30s.

With 64 CPUs you find it takes 3s first time, then 30s in future? 
: Yes

And this is due to cost of tlbflush_filter() (not actual TLB flushes, because you always end up with mask=0)? 
: Yes, it costs much in tlbflush_filter() in the judgement.
TLB flushing is really very fast, it just sends a IPI to related CPU.
In the starting process's allocation, it always ends up with mask=0 which seems needless.

If tlbflush_filter() were that expensive I’d expect the 16-CPU case to have slowdown after the first domain startup, too.
: Yes, you are right, 16CPU slows down too after its first startup.

The reason is very clear, I have discussed it with others, tlbflush_filter() is low efficient is no doubt, 
But I don't know how to improve it .

and I also used  xen oprofile to find the following two functions are called high frequently:
alloc_heap_pages: 40%
__next_cpu : 20%
others: 0.x%
.....
alloc_heap_pages -> tlbflush_filter -> for_each_cpu_mask next_cpu  -> __next_cpu 
it seems traveling among CPUs is expensive.

[-- Attachment #1.2: Type: text/html, Size: 3023 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: alloc_heap_pages is low efficient with more CPUs
  2012-10-12 12:24   ` tupeng212
@ 2012-10-12 22:39     ` Mukesh Rathor
  2012-10-13  6:03       ` Keir Fraser
  0 siblings, 1 reply; 15+ messages in thread
From: Mukesh Rathor @ 2012-10-12 22:39 UTC (permalink / raw)
  To: tupeng212; +Cc: Keir Fraser, xen-devel

On Fri, 12 Oct 2012 20:24:16 +0800
tupeng212 <tupeng212@gmail.com> wrote:

> With 16 CPUs you find domain startup takes 3s always. 
> :No
> With 16CPUs, first 0.3s, second 3s
> With 64CPUs, first 3s, second 30s.
> 
> With 64 CPUs you find it takes 3s first time, then 30s in future? 
> : Yes

...........

I had seen this a while ago, it was always page scrubbing. That
algoright was changed IIRC, so not sure if still could be the cause.

My 2 cents,
thanks
Mukesh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: alloc_heap_pages is low efficient with more CPUs
  2012-10-12 22:39     ` Mukesh Rathor
@ 2012-10-13  6:03       ` Keir Fraser
  2012-10-13  7:20         ` tupeng212
  0 siblings, 1 reply; 15+ messages in thread
From: Keir Fraser @ 2012-10-13  6:03 UTC (permalink / raw)
  To: Mukesh Rathor, tupeng212; +Cc: xen-devel

On 12/10/2012 23:39, "Mukesh Rathor" <mukesh.rathor@oracle.com> wrote:

> On Fri, 12 Oct 2012 20:24:16 +0800
> tupeng212 <tupeng212@gmail.com> wrote:
> 
>> With 16 CPUs you find domain startup takes 3s always.
>> :No
>> With 16CPUs, first 0.3s, second 3s
>> With 64CPUs, first 3s, second 30s.
>> 
>> With 64 CPUs you find it takes 3s first time, then 30s in future?
>> : Yes
> 
> ...........
> 
> I had seen this a while ago, it was always page scrubbing. That
> algoright was changed IIRC, so not sure if still could be the cause.

Tupeng,

If the tlbflush_filter() and cpumask_or() lines are commented out from the
if(need_tlbflush) block in alloc_heap_pages(), what are the domain creation
times like then? By the way it looks like you are not using xen-unstable or
xen-4.2, can you try with one of these later versions of Xen?

 -- Keir

> My 2 cents,
> thanks
> Mukesh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: alloc_heap_pages is low efficient with more CPUs
       [not found] <CC9EC91B.41A16%keir.xen@gmail.com>
@ 2012-10-13  6:46 ` tupeng212
  2012-10-13  8:59   ` Keir Fraser
  0 siblings, 1 reply; 15+ messages in thread
From: tupeng212 @ 2012-10-13  6:46 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

[-- Attachment #1.1: Type: text/plain, Size: 2442 bytes --]

What if you replace tlbflush_filter() call with cpus_clear(&extra_cpus_mask)? 
:  you mean just clear it,  maybe a little violent..,  you 'd like to do it at any other place.

I assume you see lots of looping in one of those two functions, but only single-page-at-a-time calls into alloc_domheap_pages()->alloc_heap_pages()?
:  In populate_physmap,  all pages are 2M size, 
static void populate_physmap(struct memop_args *a) 
{
    for ( i = a->nr_done; i < a->nr_extents; i++ )
    {
page = alloc_domheap_pages(d, a->extent_order, a->memflags) ->alloc_heap_pages ;  //a->extent_order = 9, always 2M size
}
//you mean move that block and TLB-flush here to avoid for loop ?
}

tupeng212

From: Keir Fraser
Date: 2012-10-13 14:30
To: tupeng212
Subject: Re: [Xen-devel] alloc_heap_pages is low efficient with more CPUs
What if you replace tlbflush_filter() call with cpus_clear(&extra_cpus_mask)? Seems a bit silly to do, but I’d like to know how much a few cpumask operations per page is costing (most are of course much quicker than tlbflush_filter as they operate on 64 bits per iteration, rather than one bit per iteration).

If that is suitably fast, I think we can have a go at fixing this by pulling the TLB-flush logic out of alloc_heap_pages() and into the loops in increwase_reservation() and populate_physmap() in common/memory.c. I assume you see lots of looping in one of those two functions, but only single-page-at-a-time calls into alloc_domheap_pages()->alloc_heap_pages()?

  -- Keir

On 13/10/2012 07:21, "tupeng212" <tupeng212@gmail.com> wrote:

If the tlbflush_filter() and cpumask_or() lines are commented out from the
if(need_tlbflush) block in alloc_heap_pages(), what are the domain creation
times like then? 
: You mean removing these code from alloc_heap_pages, then try it.
I didn't do it as you said, but I calculated the whole time of if(need_tlbflush) block
by using time1=NOW() ...block ... time2=NOW(), time=time2-time1, its unit is ns, and s = ns * 10^9
it occupy high rate of the whole time. whole starting time is 30s, then this block may be 29s.

By the way it looks like you are not using xen-unstable or
xen-4.2, can you try with one of these later versions of Xen?
: fortunately, other groups have to use xen-4.2, we have repeated this experiment on 
that source code too, it changed nothing, time is still very long in second starting.

tupeng

[-- Attachment #1.2: Type: text/html, Size: 5118 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: alloc_heap_pages is low efficient with more CPUs
  2012-10-13  6:03       ` Keir Fraser
@ 2012-10-13  7:20         ` tupeng212
  2012-10-13  8:59           ` Keir Fraser
  0 siblings, 1 reply; 15+ messages in thread
From: tupeng212 @ 2012-10-13  7:20 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 874 bytes --]

/* Allocate 2^@order contiguous pages. */
static struct page_info *alloc_heap_pages(
    unsigned int zone_lo, unsigned int zone_hi,
    unsigned int node, unsigned int order, unsigned int memflags)
{
    cpus_clear(mask);
    for ( i = 0; i < (1 << order); i++ )
    {    
        if ( pg[i].u.free.need_tlbflush )
        {
            /* Add in extra CPUs that need flushing because of this page. */
            cpus_andnot(extra_cpus_mask, cpu_online_map, mask);
            tlbflush_filter(extra_cpus_mask, pg[i].tlbflush_timestamp);
            cpus_or(mask, mask, extra_cpus_mask);
        }
    }
    if ( unlikely(!cpus_empty(mask)) )
    {
        perfc_incr(need_flush_tlb_flush);
        flush_tlb_mask(&mask);
    }
    return pg;
}

Is this equal to : in pg[1<<order-1].u.free.need_tlbflush, if we compute mask=0, then flush TLB mask ? 

[-- Attachment #1.2: Type: text/html, Size: 2798 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: alloc_heap_pages is low efficient with more CPUs
  2012-10-13  6:46 ` tupeng212
@ 2012-10-13  8:59   ` Keir Fraser
  2012-10-15 13:27     ` tupeng212
  0 siblings, 1 reply; 15+ messages in thread
From: Keir Fraser @ 2012-10-13  8:59 UTC (permalink / raw)
  To: tupeng212; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2861 bytes --]

If the allocations are 2M size, we can do better quite easily I think.
Please try the attached patch.

 -- Keir

On 13/10/2012 07:46, "tupeng212" <tupeng212@gmail.com> wrote:

> What if you replace tlbflush_filter() call with cpus_clear(&extra_cpus_mask)?
> :  you mean just clear it,  maybe a little violent..,  you 'd like to do it at
> any other place.
>  
> I assume you see lots of looping in one of those two functions, but only
> single-page-at-a-time calls into alloc_domheap_pages()->alloc_heap_pages()?
> :  In populate_physmap,  all pages are 2M size,
> static void populate_physmap(struct memop_args *a)
> {
>     for ( i = a->nr_done; i < a->nr_extents; i++ )
>     {
> page = alloc_domheap_pages(d, a->extent_order, a->memflags) ->alloc_heap_pages
> ;  //a->extent_order = 9, always 2M size
> }
> //you mean move that block and TLB-flush here to avoid for loop ?
> }
> 
> tupeng212
>  
> From: Keir Fraser <mailto:keir.xen@gmail.com>
> Date: 2012-10-13 14:30
> To: tupeng212 <mailto:tupeng212@gmail.com>
> Subject: Re: [Xen-devel] alloc_heap_pages is low efficient with more CPUs
> What if you replace tlbflush_filter() call with cpus_clear(&extra_cpus_mask)?
> Seems a bit silly to do, but I¹d like to know how much a few cpumask
> operations per page is costing (most are of course much quicker than
> tlbflush_filter as they operate on 64 bits per iteration, rather than one bit
> per iteration).
> 
> If that is suitably fast, I think we can have a go at fixing this by pulling
> the TLB-flush logic out of alloc_heap_pages() and into the loops in
> increwase_reservation() and populate_physmap() in common/memory.c. I assume
> you see lots of looping in one of those two functions, but only
> single-page-at-a-time calls into alloc_domheap_pages()->alloc_heap_pages()?
> 
>   -- Keir
> 
> On 13/10/2012 07:21, "tupeng212" <tupeng212@gmail.com> wrote:
> 
>> If the  tlbflush_filter() and cpumask_or() lines are commented out from  the
>> if(need_tlbflush) block in alloc_heap_pages(), what are the domain  creation
>> times like then?
>> : You mean removing these code from  alloc_heap_pages, then try it.
>> I didn't do it as you said, but I calculated  the whole time of
>> if(need_tlbflush) block
>> by using time1=NOW() ...block ...  time2=NOW(), time=time2-time1, its unit is
>> ns, and s = ns * 10^9
>> it occupy  high rate of the whole time. whole starting time is 30s, then this
>> block may  be 29s.
>>  
>> By the way it looks like you are not using xen-unstable  or
>> xen-4.2, can you try with one of these later versions of Xen?
>> :  fortunately, other groups have to use xen-4.2, we have repeated this
>> experiment on 
>> that source code too, it changed nothing, time is still very  long in second
>> starting.
>>  
>>  
>> 
>>  tupeng
>>  
>>  
>> 
> 


[-- Attachment #1.2: Type: text/html, Size: 4367 bytes --]

[-- Attachment #2: 00-reduce-tlbflush_filter --]
[-- Type: application/octet-stream, Size: 2242 bytes --]

diff -r a15596a619ed xen/common/page_alloc.c
--- a/xen/common/page_alloc.c	Thu Oct 04 10:44:43 2012 +0200
+++ b/xen/common/page_alloc.c	Sat Oct 13 09:57:26 2012 +0100
@@ -303,9 +303,10 @@ static struct page_info *alloc_heap_page
     unsigned int first_node, i, j, zone = 0, nodemask_retry = 0;
     unsigned int node = (uint8_t)((memflags >> _MEMF_node) - 1);
     unsigned long request = 1UL << order;
-    cpumask_t extra_cpus_mask, mask;
     struct page_info *pg;
     nodemask_t nodemask = (d != NULL ) ? d->node_affinity : node_online_map;
+    bool_t need_tlbflush = 0;
+    uint32_t tlbflush_timestamp = 0;
 
     if ( node == NUMA_NO_NODE )
     {
@@ -417,20 +418,19 @@ static struct page_info *alloc_heap_page
     if ( d != NULL )
         d->last_alloc_node = node;
 
-    cpus_clear(mask);
-
     for ( i = 0; i < (1 << order); i++ )
     {
         /* Reference count must continuously be zero for free pages. */
         BUG_ON(pg[i].count_info != PGC_state_free);
         pg[i].count_info = PGC_state_inuse;
 
-        if ( pg[i].u.free.need_tlbflush )
+        if ( pg[i].u.free.need_tlbflush &&
+             (pg[i].tlbflush_timestamp <= tlbflush_current_time()) &&
+             (!need_tlbflush ||
+              (pg[i].tlbflush_timestamp > tlbflush_timestamp)) )
         {
-            /* Add in extra CPUs that need flushing because of this page. */
-            cpus_andnot(extra_cpus_mask, cpu_online_map, mask);
-            tlbflush_filter(extra_cpus_mask, pg[i].tlbflush_timestamp);
-            cpus_or(mask, mask, extra_cpus_mask);
+            need_tlbflush = 1;
+            tlbflush_timestamp = pg[i].tlbflush_timestamp;
         }
 
         /* Initialise fields which have other uses for free pages. */
@@ -440,10 +440,15 @@ static struct page_info *alloc_heap_page
 
     spin_unlock(&heap_lock);
 
-    if ( unlikely(!cpus_empty(mask)) )
+    if ( need_tlbflush )
     {
-        perfc_incr(need_flush_tlb_flush);
-        flush_tlb_mask(&mask);
+        cpumask_t mask = cpu_online_map;
+        tlbflush_filter(mask, tlbflush_timestamp);
+        if ( !cpus_empty(mask) )
+        {
+            perfc_incr(need_flush_tlb_flush);
+            flush_tlb_mask(&mask);
+        }
     }
 
     return pg;

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: alloc_heap_pages is low efficient with more CPUs
  2012-10-13  7:20         ` tupeng212
@ 2012-10-13  8:59           ` Keir Fraser
  0 siblings, 0 replies; 15+ messages in thread
From: Keir Fraser @ 2012-10-13  8:59 UTC (permalink / raw)
  To: tupeng212; +Cc: xen-devel

On 13/10/2012 08:20, "tupeng212" <tupeng212@gmail.com> wrote:

> /* Allocate 2^@order contiguous pages. */
> static struct page_info *alloc_heap_pages(
>     unsigned int zone_lo, unsigned int zone_hi,
>     unsigned int node, unsigned int order, unsigned int memflags)
> {
>     cpus_clear(mask);
>     for ( i = 0; i < (1 << order); i++ )
>     {    
>         if ( pg[i].u.free.need_tlbflush )
>         {
>             /* Add in extra CPUs that need flushing because of this page. */
>             cpus_andnot(extra_cpus_mask, cpu_online_map, mask);
>             tlbflush_filter(extra_cpus_mask, pg[i].tlbflush_timestamp);
>             cpus_or(mask, mask, extra_cpus_mask);
>         }
>     }
>     if ( unlikely(!cpus_empty(mask)) )
>     {
>         perfc_incr(need_flush_tlb_flush);
>         flush_tlb_mask(&mask);
>     }
>     return pg;
> }
>  
> Is this equal to : in pg[1<<order-1].u.free.need_tlbflush, if we compute
> mask=0, then flush TLB mask ?

No!

 -- Keir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: alloc_heap_pages is low efficient with more CPUs
  2012-10-13  8:59   ` Keir Fraser
@ 2012-10-15 13:27     ` tupeng212
  2012-10-15 15:45       ` Keir Fraser
  0 siblings, 1 reply; 15+ messages in thread
From: tupeng212 @ 2012-10-15 13:27 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 171 bytes --]

Please try the attached patch.
: Great!  you have done a good job, needless time decreases badly to 1s.

If anybody has no proposal, I suggest you to commit this patch.

[-- Attachment #1.2: Type: text/html, Size: 2373 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: alloc_heap_pages is low efficient with more CPUs
  2012-10-15 13:27     ` tupeng212
@ 2012-10-15 15:45       ` Keir Fraser
  2012-10-16  7:51         ` Jan Beulich
  0 siblings, 1 reply; 15+ messages in thread
From: Keir Fraser @ 2012-10-15 15:45 UTC (permalink / raw)
  To: tupeng212; +Cc: Jan Beulich, xen-devel

[-- Attachment #1: Type: text/plain, Size: 401 bytes --]

On 15/10/2012 14:27, "tupeng212" <tupeng212@gmail.com> wrote:

> Please try the attached patch.
> : Great!  you have done a good job, needless time decreases badly to 1s.
>  
> If anybody has no proposal, I suggest you to commit this patch.

I have applied it to xen-unstable. It probably makes sense to put it in 4.1
and 4.2 as well (cc'ed Jan, and attaching the backport for 4.1 again).

 -- Keir



[-- Attachment #2: 00-reduce-tlbflush_filter --]
[-- Type: application/octet-stream, Size: 2242 bytes --]

diff -r a15596a619ed xen/common/page_alloc.c
--- a/xen/common/page_alloc.c	Thu Oct 04 10:44:43 2012 +0200
+++ b/xen/common/page_alloc.c	Sat Oct 13 09:57:26 2012 +0100
@@ -303,9 +303,10 @@ static struct page_info *alloc_heap_page
     unsigned int first_node, i, j, zone = 0, nodemask_retry = 0;
     unsigned int node = (uint8_t)((memflags >> _MEMF_node) - 1);
     unsigned long request = 1UL << order;
-    cpumask_t extra_cpus_mask, mask;
     struct page_info *pg;
     nodemask_t nodemask = (d != NULL ) ? d->node_affinity : node_online_map;
+    bool_t need_tlbflush = 0;
+    uint32_t tlbflush_timestamp = 0;
 
     if ( node == NUMA_NO_NODE )
     {
@@ -417,20 +418,19 @@ static struct page_info *alloc_heap_page
     if ( d != NULL )
         d->last_alloc_node = node;
 
-    cpus_clear(mask);
-
     for ( i = 0; i < (1 << order); i++ )
     {
         /* Reference count must continuously be zero for free pages. */
         BUG_ON(pg[i].count_info != PGC_state_free);
         pg[i].count_info = PGC_state_inuse;
 
-        if ( pg[i].u.free.need_tlbflush )
+        if ( pg[i].u.free.need_tlbflush &&
+             (pg[i].tlbflush_timestamp <= tlbflush_current_time()) &&
+             (!need_tlbflush ||
+              (pg[i].tlbflush_timestamp > tlbflush_timestamp)) )
         {
-            /* Add in extra CPUs that need flushing because of this page. */
-            cpus_andnot(extra_cpus_mask, cpu_online_map, mask);
-            tlbflush_filter(extra_cpus_mask, pg[i].tlbflush_timestamp);
-            cpus_or(mask, mask, extra_cpus_mask);
+            need_tlbflush = 1;
+            tlbflush_timestamp = pg[i].tlbflush_timestamp;
         }
 
         /* Initialise fields which have other uses for free pages. */
@@ -440,10 +440,15 @@ static struct page_info *alloc_heap_page
 
     spin_unlock(&heap_lock);
 
-    if ( unlikely(!cpus_empty(mask)) )
+    if ( need_tlbflush )
     {
-        perfc_incr(need_flush_tlb_flush);
-        flush_tlb_mask(&mask);
+        cpumask_t mask = cpu_online_map;
+        tlbflush_filter(mask, tlbflush_timestamp);
+        if ( !cpus_empty(mask) )
+        {
+            perfc_incr(need_flush_tlb_flush);
+            flush_tlb_mask(&mask);
+        }
     }
 
     return pg;

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: alloc_heap_pages is low efficient with more CPUs
  2012-10-15 15:45       ` Keir Fraser
@ 2012-10-16  7:51         ` Jan Beulich
  2012-10-16  8:03           ` Keir Fraser
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2012-10-16  7:51 UTC (permalink / raw)
  To: Keir Fraser; +Cc: tupeng212, xen-devel

>>> On 15.10.12 at 17:45, Keir Fraser <keir@xen.org> wrote:
> On 15/10/2012 14:27, "tupeng212" <tupeng212@gmail.com> wrote:
> 
>> Please try the attached patch.
>> : Great!  you have done a good job, needless time decreases badly to 1s.
>>  
>> If anybody has no proposal, I suggest you to commit this patch.
> 
> I have applied it to xen-unstable. It probably makes sense to put it in 4.1
> and 4.2 as well (cc'ed Jan, and attaching the backport for 4.1 again).

Will do, but do you have an explanation how this simple, memory
only operation (64 CPUs isn't that many) has this dramatic an
effect on performance. Are we bouncing cache lines this badly? If
so, which one(s)? I don't see what would be written frequently
from multiple CPUs here - tlbflush_filter() itself only reads global
variables, but never writes them.

Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: alloc_heap_pages is low efficient with more CPUs
  2012-10-16  7:51         ` Jan Beulich
@ 2012-10-16  8:03           ` Keir Fraser
  2012-10-16  8:28             ` Jan Beulich
  0 siblings, 1 reply; 15+ messages in thread
From: Keir Fraser @ 2012-10-16  8:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: tupeng212, xen-devel

On 16/10/2012 08:51, "Jan Beulich" <JBeulich@suse.com> wrote:

>>>> On 15.10.12 at 17:45, Keir Fraser <keir@xen.org> wrote:
>> On 15/10/2012 14:27, "tupeng212" <tupeng212@gmail.com> wrote:
>> 
>>> Please try the attached patch.
>>> : Great!  you have done a good job, needless time decreases badly to 1s.
>>>  
>>> If anybody has no proposal, I suggest you to commit this patch.
>> 
>> I have applied it to xen-unstable. It probably makes sense to put it in 4.1
>> and 4.2 as well (cc'ed Jan, and attaching the backport for 4.1 again).
> 
> Will do, but do you have an explanation how this simple, memory
> only operation (64 CPUs isn't that many) has this dramatic an
> effect on performance. Are we bouncing cache lines this badly? If
> so, which one(s)? I don't see what would be written frequently
> from multiple CPUs here - tlbflush_filter() itself only reads global
> variables, but never writes them.

It's just the small factors multiplying up. A 40G domain is 10M page
allocations, each of which does 64x per-cpu cpumask bitops and timestamp
compares. That's going on for a billion (10^9) times round
tlbflush_filter()s loop. Each iteration need only take a few CPU cycles for
the effect to actually be noticeable. If the stuff being touched were not in
the cache (which of course it is, since this path is so hot and not touching
much memory) it would probably take an hour to create the domain!

 -- Keir

> Jan
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: alloc_heap_pages is low efficient with more CPUs
  2012-10-16  8:03           ` Keir Fraser
@ 2012-10-16  8:28             ` Jan Beulich
  2012-10-16  8:53               ` Keir Fraser
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2012-10-16  8:28 UTC (permalink / raw)
  To: Keir Fraser; +Cc: tupeng212, xen-devel

>>> On 16.10.12 at 10:03, Keir Fraser <keir.xen@gmail.com> wrote:
> On 16/10/2012 08:51, "Jan Beulich" <JBeulich@suse.com> wrote:
> 
>>>>> On 15.10.12 at 17:45, Keir Fraser <keir@xen.org> wrote:
>>> On 15/10/2012 14:27, "tupeng212" <tupeng212@gmail.com> wrote:
>>> 
>>>> Please try the attached patch.
>>>> : Great!  you have done a good job, needless time decreases badly to 1s.
>>>>  
>>>> If anybody has no proposal, I suggest you to commit this patch.
>>> 
>>> I have applied it to xen-unstable. It probably makes sense to put it in 4.1
>>> and 4.2 as well (cc'ed Jan, and attaching the backport for 4.1 again).
>> 
>> Will do, but do you have an explanation how this simple, memory
>> only operation (64 CPUs isn't that many) has this dramatic an
>> effect on performance. Are we bouncing cache lines this badly? If
>> so, which one(s)? I don't see what would be written frequently
>> from multiple CPUs here - tlbflush_filter() itself only reads global
>> variables, but never writes them.
> 
> It's just the small factors multiplying up. A 40G domain is 10M page
> allocations, each of which does 64x per-cpu cpumask bitops and timestamp
> compares. That's going on for a billion (10^9) times round
> tlbflush_filter()s loop. Each iteration need only take a few CPU cycles for
> the effect to actually be noticeable. If the stuff being touched were not in
> the cache (which of course it is, since this path is so hot and not touching
> much memory) it would probably take an hour to create the domain!

Which means that your TODO remark in the change description
is more than a nice-to-have, since
- in the fallback case using 4k allocations your change doesn't
  win anything
- even larger domains would still suffer from this very problem
  (albeit in 4.2 we try 1G allocations first, so your change should
  have an even more visible effect there, as long as falling back
  to smaller chunks isn't needed)

One question of course is whether, for sufficiently large allocation
requests on sufficiently large systems, trying to avoid the TLB
flush is worth it at all, considering that determining where to skip
the flushes is costing so much more than the flushes themselves.

Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: alloc_heap_pages is low efficient with more CPUs
  2012-10-16  8:28             ` Jan Beulich
@ 2012-10-16  8:53               ` Keir Fraser
  0 siblings, 0 replies; 15+ messages in thread
From: Keir Fraser @ 2012-10-16  8:53 UTC (permalink / raw)
  To: Jan Beulich; +Cc: tupeng212, xen-devel

On 16/10/2012 09:28, "Jan Beulich" <JBeulich@suse.com> wrote:

>> It's just the small factors multiplying up. A 40G domain is 10M page
>> allocations, each of which does 64x per-cpu cpumask bitops and timestamp
>> compares. That's going on for a billion (10^9) times round
>> tlbflush_filter()s loop. Each iteration need only take a few CPU cycles for
>> the effect to actually be noticeable. If the stuff being touched were not in
>> the cache (which of course it is, since this path is so hot and not touching
>> much memory) it would probably take an hour to create the domain!
> 
> Which means that your TODO remark in the change description
> is more than a nice-to-have, since
> - in the fallback case using 4k allocations your change doesn't
>   win anything
> - even larger domains would still suffer from this very problem
>   (albeit in 4.2 we try 1G allocations first, so your change should
>   have an even more visible effect there, as long as falling back
>   to smaller chunks isn't needed)

Agreed. This is a simple starting point which is also easy to backport
though, and solves the immediate observed problem.

> One question of course is whether, for sufficiently large allocation
> requests on sufficiently large systems, trying to avoid the TLB
> flush is worth it at all, considering that determining where to skip
> the flushes is costing so much more than the flushes themselves.

It might be a simpler option. I'll see how doing it the filtering way looks.

 -- Keir

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-10-16  8:53 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-11 15:18 alloc_heap_pages is low efficient with more CPUs tupeng212
2012-10-11 15:41 ` Keir Fraser
2012-10-12 12:24   ` tupeng212
2012-10-12 22:39     ` Mukesh Rathor
2012-10-13  6:03       ` Keir Fraser
2012-10-13  7:20         ` tupeng212
2012-10-13  8:59           ` Keir Fraser
     [not found] <CC9EC91B.41A16%keir.xen@gmail.com>
2012-10-13  6:46 ` tupeng212
2012-10-13  8:59   ` Keir Fraser
2012-10-15 13:27     ` tupeng212
2012-10-15 15:45       ` Keir Fraser
2012-10-16  7:51         ` Jan Beulich
2012-10-16  8:03           ` Keir Fraser
2012-10-16  8:28             ` Jan Beulich
2012-10-16  8:53               ` Keir Fraser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).