Poor HVM performance with 8 vcpus

All of lore.kernel.org
 help / color / mirror / Atom feed

* Poor HVM performance with 8 vcpus
@ 2009-10-07  6:55 Juergen Gross
  2009-10-07  7:26 ` Keir Fraser
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Juergen Gross @ 2009-10-07  6:55 UTC (permalink / raw)
  To: xen-devel@lists.xensource.com

Hi,

we've got massive performance problems running a 8 vcpu HVM-guest (BS2000)
under XEN (xen 3.3.1).

With a specific benchmark producing a rather high load on memory management
operations (lots of process creation/deletion and memory allocation) the 8
vcpu performance was worse than the 4 vcpu performance. On other platforms
(/390, MIPS, SPARC) this benchmark scaled rather well with the number of cpus.

The result of the usage of the software performance counters of XEN seemed
to point to the shadow lock being the reason. I modified the Hypervisor to
gather some lock statistics (patch will be sent soon) and found that the
shadow lock is really the bottleneck. On average 4 vcpus are waiting to get
the lock!

Is this a known issue?
Is there a chance to split the shadow lock into sub-locks or to use a
reader/writer lock instead?
I just wanted to ask before trying to understand all of the shadow code :-)

Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-07  6:55 Poor HVM performance with 8 vcpus Juergen Gross
@ 2009-10-07  7:26 ` Keir Fraser
  2009-10-07  7:49   ` Juergen Gross
  2009-10-07  7:56 ` Ian Pratt
  2009-10-07 16:37 ` Gianluca Guida
  2 siblings, 1 reply; 29+ messages in thread
From: Keir Fraser @ 2009-10-07  7:26 UTC (permalink / raw)
  To: Juergen Gross, xen-devel@lists.xensource.com; +Cc: Tim Deegan

Hi Juergen,

Tim Deegan is the man for this stuff (cc'ed) - you don't want to get too
involved in the shadow code without syncing with him first. My
understanding, however, is that shadow code is currently designed with
scalability up to only about 4 VCPUs in mind. The expectation is that, as
users want to scale wider than that, they will typically be upgrading to
modern many-core processors with hardware assistance (Intel EPT, AMD NPT).

If you don't fit into that scenario, perhaps we can find you some
lowish-hanging fruit to improve parallelism. Big changes in shadow code
could be scary for us due to the likely nasty bug tail!

 -- Keir

On 07/10/2009 07:55, "Juergen Gross" <juergen.gross@ts.fujitsu.com> wrote:

> Hi,
> 
> we've got massive performance problems running a 8 vcpu HVM-guest (BS2000)
> under XEN (xen 3.3.1).
> 
> With a specific benchmark producing a rather high load on memory management
> operations (lots of process creation/deletion and memory allocation) the 8
> vcpu performance was worse than the 4 vcpu performance. On other platforms
> (/390, MIPS, SPARC) this benchmark scaled rather well with the number of cpus.
> 
> The result of the usage of the software performance counters of XEN seemed
> to point to the shadow lock being the reason. I modified the Hypervisor to
> gather some lock statistics (patch will be sent soon) and found that the
> shadow lock is really the bottleneck. On average 4 vcpus are waiting to get
> the lock!
> 
> Is this a known issue?
> Is there a chance to split the shadow lock into sub-locks or to use a
> reader/writer lock instead?
> I just wanted to ask before trying to understand all of the shadow code :-)
> 
> 
> Juergen

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-07  7:26 ` Keir Fraser
@ 2009-10-07  7:49   ` Juergen Gross
  0 siblings, 0 replies; 29+ messages in thread
From: Juergen Gross @ 2009-10-07  7:49 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Tim Deegan, xen-devel@lists.xensource.com

Hi Keir,

thanks for the quick answer.

Keir Fraser wrote:
> Hi Juergen,
> 
> Tim Deegan is the man for this stuff (cc'ed) - you don't want to get too
> involved in the shadow code without syncing with him first. My

:-)

> understanding, however, is that shadow code is currently designed with
> scalability up to only about 4 VCPUs in mind. The expectation is that, as
> users want to scale wider than that, they will typically be upgrading to
> modern many-core processors with hardware assistance (Intel EPT, AMD NPT).

Okay. We plan to do this as soon as Nehalem-EX is available. Right now we are
testing on 4 socket Dunnington systems (24 cores) and found the issue.
This will be a problem for us if Nehalem-EX is available much later then
planned now. So I wanted to check for possible enhancements in XEN before
this might happen.

> 
> If you don't fit into that scenario, perhaps we can find you some
> lowish-hanging fruit to improve parallelism. Big changes in shadow code
> could be scary for us due to the likely nasty bug tail!

I understand this. Let's see if some rather local changes could improve the
performance.


Juergen

> On 07/10/2009 07:55, "Juergen Gross" <juergen.gross@ts.fujitsu.com> wrote:
> 
>> Hi,
>>
>> we've got massive performance problems running a 8 vcpu HVM-guest (BS2000)
>> under XEN (xen 3.3.1).
>>
>> With a specific benchmark producing a rather high load on memory management
>> operations (lots of process creation/deletion and memory allocation) the 8
>> vcpu performance was worse than the 4 vcpu performance. On other platforms
>> (/390, MIPS, SPARC) this benchmark scaled rather well with the number of cpus.
>>
>> The result of the usage of the software performance counters of XEN seemed
>> to point to the shadow lock being the reason. I modified the Hypervisor to
>> gather some lock statistics (patch will be sent soon) and found that the
>> shadow lock is really the bottleneck. On average 4 vcpus are waiting to get
>> the lock!
>>
>> Is this a known issue?
>> Is there a chance to split the shadow lock into sub-locks or to use a
>> reader/writer lock instead?
>> I just wanted to ask before trying to understand all of the shadow code :-)
>>
>>
>> Juergen
> 
> 
> 
> 


-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Poor HVM performance with 8 vcpus
  2009-10-07  6:55 Poor HVM performance with 8 vcpus Juergen Gross
  2009-10-07  7:26 ` Keir Fraser
@ 2009-10-07  7:56 ` Ian Pratt
  2009-10-07  8:08   ` James Harper
  2009-10-07 16:37 ` Gianluca Guida
  2 siblings, 1 reply; 29+ messages in thread
From: Ian Pratt @ 2009-10-07  7:56 UTC (permalink / raw)
  To: Juergen Gross, xen-devel@lists.xensource.com; +Cc: Ian Pratt

> With a specific benchmark producing a rather high load on memory
> management
> operations (lots of process creation/deletion and memory allocation) the 8
> vcpu performance was worse than the 4 vcpu performance. On other platforms
> (/390, MIPS, SPARC) this benchmark scaled rather well with the number of
> cpus.
> 
> The result of the usage of the software performance counters of XEN seemed
> to point to the shadow lock being the reason. I modified the Hypervisor to
> gather some lock statistics (patch will be sent soon) and found that the
> shadow lock is really the bottleneck. On average 4 vcpus are waiting to
> get the lock!

At various points in the shadow pagetable code, xen needs to be able to find all the writeable mappings (PTEs) to a particular page. Rather than storing a data structure to enable the frame number to list of PTEs lookup, we've found that it is generally quicker to use a heuristic. The heuristic knows where to look to find writeable mappings in a number of common OSes. For example, it knows to look in the direct mapped (1:1) kernel address regions in linux, or the recursive linear mapping in windows.  If application of the heuristics fails, xen resorts to a brute force search. 

Unless BS2000 just happens to use the exact same virtual memory layout as any of the other supported OSes, the heuristic will be failing. The brute force search is rather slow, which will result in the shadow lock being held for an extensive period, resulting in lock conveys on SMP guests.

The quick fix is to add a heuristic for BS2000. However, the list of heuristics is getting a bit unmanageable, and they're currently dumbly tried in-order. Given the user-base size of BS2000, Keir is likely to insist the heuristic for BS2000 is the last to be tried :)

At the very least it would be good to have a predictor which figured out which of the several heuristics should actually be used for a given VM. A simple "try whichever one worked last time first" should work fine.

Even smarter would be two just have heuristics for the two general classes of mapping (1:1 and recursive), and have the code automatically figure out the starting virtual address being used for a given guest.

All fun stuff.

Ian

> Is this a known issue?
> Is there a chance to split the shadow lock into sub-locks or to use a
> reader/writer lock instead?
> I just wanted to ask before trying to understand all of the shadow code :-
> )
> 
> 
> Juergen
> 
> --
> Juergen Gross                 Principal Developer Operating Systems
> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
> Fujitsu Technolgy Solutions               e-mail:
> juergen.gross@ts.fujitsu.com
> Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
> D-81739 Muenchen                 Company details:
> ts.fujitsu.com/imprint.html
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Poor HVM performance with 8 vcpus
  2009-10-07  7:56 ` Ian Pratt
@ 2009-10-07  8:08   ` James Harper
  2009-10-07  8:13     ` Ian Pratt
                       ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: James Harper @ 2009-10-07  8:08 UTC (permalink / raw)
  To: Ian Pratt, Juergen Gross, xen-devel

> At the very least it would be good to have a predictor which figured
out which
> of the several heuristics should actually be used for a given VM. A
simple
> "try whichever one worked last time first" should work fine.
> 
> Even smarter would be two just have heuristics for the two general
classes of
> mapping (1:1 and recursive), and have the code automatically figure
out the
> starting virtual address being used for a given guest.
> 

Are there any other of these heuristics tucked away in xen? Would there
be any benefit to specifying the OS being virtualised in the config? Eg
"os=windows"?

James

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Poor HVM performance with 8 vcpus
  2009-10-07  8:08   ` James Harper
@ 2009-10-07  8:13     ` Ian Pratt
  2009-10-07  8:31       ` Juergen Gross
  2009-10-07  8:17     ` Keir Fraser
  2009-10-07  9:12     ` Tim Deegan
  2 siblings, 1 reply; 29+ messages in thread
From: Ian Pratt @ 2009-10-07  8:13 UTC (permalink / raw)
  To: James Harper, Juergen Gross, xen-devel@lists.xensource.com; +Cc: Ian Pratt

> Are there any other of these heuristics tucked away in xen? Would there
> be any benefit to specifying the OS being virtualised in the config? Eg
> "os=windows"?

That's really the only one that springs to mind.

Given how simple a predictor would be, I think that's the best approach.

Setting the OS in the config file is a little tricky as in a couple of instances you'd want to know the exact version (linux moved the 64b direct mapped region at some point during 2.6 development).

Ian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-07  8:08   ` James Harper
  2009-10-07  8:13     ` Ian Pratt
@ 2009-10-07  8:17     ` Keir Fraser
  2009-10-07  9:12     ` Tim Deegan
  2 siblings, 0 replies; 29+ messages in thread
From: Keir Fraser @ 2009-10-07  8:17 UTC (permalink / raw)
  To: James Harper, Ian Pratt, Juergen Gross,
	xen-devel@lists.xensource.com

On 07/10/2009 09:08, "James Harper" <james.harper@bendigoit.com.au> wrote:

> Are there any other of these heuristics tucked away in xen? Would there
> be any benefit to specifying the OS being virtualised in the config? Eg
> "os=windows"?

There aren't really any others. The timer_mode config is along the same
lines I suppose, in that you would set it depending on guest kernel time
handling behaviour. In the specific case of shadow-code heuristics, however,
auto-detecting the OS type by methods such as try-previous-successful-first
would be the sensible way to go.

 -- Keir

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-07  8:13     ` Ian Pratt
@ 2009-10-07  8:31       ` Juergen Gross
  0 siblings, 0 replies; 29+ messages in thread
From: Juergen Gross @ 2009-10-07  8:31 UTC (permalink / raw)
  To: Ian Pratt; +Cc: James Harper, xen-devel@lists.xensource.com

Ian Pratt wrote:
>  
>> Are there any other of these heuristics tucked away in xen? Would there
>> be any benefit to specifying the OS being virtualised in the config? Eg
>> "os=windows"?
> 
> That's really the only one that springs to mind.
> 
> Given how simple a predictor would be, I think that's the best approach.
> 
> Setting the OS in the config file is a little tricky as in a couple of instances you'd want to know the exact version (linux moved the 64b direct mapped region at some point during 2.6 development).

I think it is a bad idea in most cases to use such a basic parameter like
"os=xyz" to tune some specific behaviour. It may be useful to set some default
parameters, but there should be an explicit parameter as well, e.g.
"shadow_heuristic=direct-mapped@address"


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-07  8:08   ` James Harper
  2009-10-07  8:13     ` Ian Pratt
  2009-10-07  8:17     ` Keir Fraser
@ 2009-10-07  9:12     ` Tim Deegan
  2009-10-07  9:40       ` Juergen Gross
  2 siblings, 1 reply; 29+ messages in thread
From: Tim Deegan @ 2009-10-07  9:12 UTC (permalink / raw)
  To: James Harper; +Cc: Ian Pratt, xen-devel@lists.xensource.com, Juergen Gross

At 09:08 +0100 on 07 Oct (1254906487), James Harper wrote:
> > At the very least it would be good to have a predictor which figured
> out which
> > of the several heuristics should actually be used for a given VM. A
> simple
> > "try whichever one worked last time first" should work fine.
> > 
> > Even smarter would be two just have heuristics for the two general
> classes of
> > mapping (1:1 and recursive), and have the code automatically figure
> out the
> > starting virtual address being used for a given guest.
> > 
> 
> Are there any other of these heuristics tucked away in xen? Would there
> be any benefit to specifying the OS being virtualised in the config? Eg
> "os=windows"?

It would be better to allow the specific heuristic to be specified in
the Xen interface (e.g. that it's a recursive pagetable at a particular
address, or a one-to-one mapping).  Which isn't to say the python layer
couldn't put some syntactic sugar on it. 

But the bulk of the win will be had from adding BS2000 to the list of
heuristics.  There's probably some benefit in making the heuristic list
pull-to-front, too.

Automatically detecting 1:1 mappings and linear pagetable schemes would
be fun and is probably the Right Thing[tm], but making sure it works
with all the OSes that currently work (e.g. all HALs of all Windows
versions) will be a significant investment in time. :)

Also, before getting too stuck into this it'd be worth running once more
with performance counters enabled and checking that this is actually
your problem!  You should see a much higher number for "shadow writeable
brute-force" running BS2000 than running Windows.

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-07  9:12     ` Tim Deegan
@ 2009-10-07  9:40       ` Juergen Gross
  2009-10-07 10:11         ` George Dunlap
  2009-10-07 10:14         ` Tim Deegan
  0 siblings, 2 replies; 29+ messages in thread
From: Juergen Gross @ 2009-10-07  9:40 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Ian Pratt, James Harper, xen-devel@lists.xensource.com

Tim Deegan wrote:
> At 09:08 +0100 on 07 Oct (1254906487), James Harper wrote:
>>> At the very least it would be good to have a predictor which figured
>> out which
>>> of the several heuristics should actually be used for a given VM. A
>> simple
>>> "try whichever one worked last time first" should work fine.
>>>
>>> Even smarter would be two just have heuristics for the two general
>> classes of
>>> mapping (1:1 and recursive), and have the code automatically figure
>> out the
>>> starting virtual address being used for a given guest.
>>>
>> Are there any other of these heuristics tucked away in xen? Would there
>> be any benefit to specifying the OS being virtualised in the config? Eg
>> "os=windows"?
> 
> It would be better to allow the specific heuristic to be specified in
> the Xen interface (e.g. that it's a recursive pagetable at a particular
> address, or a one-to-one mapping).  Which isn't to say the python layer
> couldn't put some syntactic sugar on it. 
> 
> But the bulk of the win will be had from adding BS2000 to the list of
> heuristics.  There's probably some benefit in making the heuristic list
> pull-to-front, too.
> 
> Automatically detecting 1:1 mappings and linear pagetable schemes would
> be fun and is probably the Right Thing[tm], but making sure it works
> with all the OSes that currently work (e.g. all HALs of all Windows
> versions) will be a significant investment in time. :)
> 
> Also, before getting too stuck into this it'd be worth running once more
> with performance counters enabled and checking that this is actually
> your problem!  You should see a much higher number for "shadow writeable
> brute-force" running BS2000 than running Windows.

I still had the numbers for a test with 6 vcpus, which already showed severe
performance degradation. I edited the numbers a little bit to show only the
counters for the cpus running BS2000 and no other domain. The test ran for
60 seconds.

calls to shadow_alloc              438     427     424     480     436     422
number of shadow pages in use     2765    2151    2386    2509    4885    1391
calls to shadow_free               168     132     185     144     181     105
calls to shadow_fault            65271   69132   60495   53756   73363   52449
shadow_fault fast path n/p        7347    8081    6713    6134    8521    6112
shadow_fault fast path error        14      12      15       3      13      11
shadow_fault really guest fault  24004   25723   22815   19709   27049   19190
shadow_fault emulates a write     1045     949    1018     995    1015     901
shadow_fault fast emulate          424     361     449     348     387     314
shadow_fault fixed fault         32503   34264   29624   26689   36641   26096
calls to shadow_validate_gl2e      875     748     917     731     795     667
calls to shadow_validate_gl3e      481     456     443     491     489     446
calls to shadow_validate_gl4e      104      97      95     112     105      95
calls to shadow_hash_lookup    2109654 2203254 2228896 2245849 2164727 2309059
shadow hash hit in bucket head 2012828 2111164 2161113 2177591 2104099 2242458
shadow hash misses                 851     840     841     910     852     838
calls to get_shadow_status     2110031 2202828 2228769 2246689 2164213 2309241
calls to shadow_hash_insert        438     436     428     481     437     430
calls to shadow_hash_delete        168     150     185     154     202     128
shadow removes write access        335     324     329     385     330     336
shadow writeable: linux high       130     139     152     155     138     149
shadow writeable: sl1p           14508   15402   12961   11823   16474   11472
shadow writeable brute-force       205     185     177     230     192     187
shadow unshadows for fork/exit       9      12      12      12      18      12
shadow unshadows a page             10      13      13      13      19      13
shadow walks guest tables       647527  727336  649397  646601  659655  621289
shadow checks gwalk                526     544     535     550     614     554
shadow flush tlb by rem wr perm    235     233     229     268     238     237
shadow emulates invlpg           14688   15499   14604   12630   16627   11370
shadow OOS fixup adds            14467   15335   13059   11840   16624   11339
shadow OOS unsyncs               14467   15335   13058   11840   16624   11339
shadow OOS evictions               566     449     565     369     589     336
shadow OOS resyncs               14510   15407   12964   11828   16478   11481

I don't think the "shadow writable brute-force" is the problem.
get_shadow_status seems to be a more critical candidate.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-07  9:40       ` Juergen Gross
@ 2009-10-07 10:11         ` George Dunlap
  2009-10-07 11:45           ` Juergen Gross
  2009-10-07 10:14         ` Tim Deegan
  1 sibling, 1 reply; 29+ messages in thread
From: George Dunlap @ 2009-10-07 10:11 UTC (permalink / raw)
  To: Juergen Gross
  Cc: xen-devel@lists.xensource.com, Ian Pratt, James Harper,
	Tim Deegan

Jeurgen,

I think this problem is a good candidate for xentrace/xenalyze.  If
you take a 30-second trace (xentrace -D -e all -T 30
/tmp/[traceid].trace) while the benchmark is at its heaviest, and then
analyze it using xenalyze
(http://xenbits.xensource.com/ext/xenalyze.hg), it should show up
whether the shadow performance is due to brute-force search or
something else.

If you're using 3.3, you'll have to apply the back-patch to xenalyze
to make it work properly.

If you post the summary output (xenalyze -s [traceid].trace >
[traceid].summary), I can help interpret it.

 -George

On Wed, Oct 7, 2009 at 10:40 AM, Juergen Gross
<juergen.gross@ts.fujitsu.com> wrote:
> Tim Deegan wrote:
>> At 09:08 +0100 on 07 Oct (1254906487), James Harper wrote:
>>>> At the very least it would be good to have a predictor which figured
>>> out which
>>>> of the several heuristics should actually be used for a given VM. A
>>> simple
>>>> "try whichever one worked last time first" should work fine.
>>>>
>>>> Even smarter would be two just have heuristics for the two general
>>> classes of
>>>> mapping (1:1 and recursive), and have the code automatically figure
>>> out the
>>>> starting virtual address being used for a given guest.
>>>>
>>> Are there any other of these heuristics tucked away in xen? Would there
>>> be any benefit to specifying the OS being virtualised in the config? Eg
>>> "os=windows"?
>>
>> It would be better to allow the specific heuristic to be specified in
>> the Xen interface (e.g. that it's a recursive pagetable at a particular
>> address, or a one-to-one mapping).  Which isn't to say the python layer
>> couldn't put some syntactic sugar on it.
>>
>> But the bulk of the win will be had from adding BS2000 to the list of
>> heuristics.  There's probably some benefit in making the heuristic list
>> pull-to-front, too.
>>
>> Automatically detecting 1:1 mappings and linear pagetable schemes would
>> be fun and is probably the Right Thing[tm], but making sure it works
>> with all the OSes that currently work (e.g. all HALs of all Windows
>> versions) will be a significant investment in time. :)
>>
>> Also, before getting too stuck into this it'd be worth running once more
>> with performance counters enabled and checking that this is actually
>> your problem!  You should see a much higher number for "shadow writeable
>> brute-force" running BS2000 than running Windows.
>
> I still had the numbers for a test with 6 vcpus, which already showed severe
> performance degradation. I edited the numbers a little bit to show only the
> counters for the cpus running BS2000 and no other domain. The test ran for
> 60 seconds.
>
> calls to shadow_alloc              438     427     424     480     436     422
> number of shadow pages in use     2765    2151    2386    2509    4885    1391
> calls to shadow_free               168     132     185     144     181     105
> calls to shadow_fault            65271   69132   60495   53756   73363   52449
> shadow_fault fast path n/p        7347    8081    6713    6134    8521    6112
> shadow_fault fast path error        14      12      15       3      13      11
> shadow_fault really guest fault  24004   25723   22815   19709   27049   19190
> shadow_fault emulates a write     1045     949    1018     995    1015     901
> shadow_fault fast emulate          424     361     449     348     387     314
> shadow_fault fixed fault         32503   34264   29624   26689   36641   26096
> calls to shadow_validate_gl2e      875     748     917     731     795     667
> calls to shadow_validate_gl3e      481     456     443     491     489     446
> calls to shadow_validate_gl4e      104      97      95     112     105      95
> calls to shadow_hash_lookup    2109654 2203254 2228896 2245849 2164727 2309059
> shadow hash hit in bucket head 2012828 2111164 2161113 2177591 2104099 2242458
> shadow hash misses                 851     840     841     910     852     838
> calls to get_shadow_status     2110031 2202828 2228769 2246689 2164213 2309241
> calls to shadow_hash_insert        438     436     428     481     437     430
> calls to shadow_hash_delete        168     150     185     154     202     128
> shadow removes write access        335     324     329     385     330     336
> shadow writeable: linux high       130     139     152     155     138     149
> shadow writeable: sl1p           14508   15402   12961   11823   16474   11472
> shadow writeable brute-force       205     185     177     230     192     187
> shadow unshadows for fork/exit       9      12      12      12      18      12
> shadow unshadows a page             10      13      13      13      19      13
> shadow walks guest tables       647527  727336  649397  646601  659655  621289
> shadow checks gwalk                526     544     535     550     614     554
> shadow flush tlb by rem wr perm    235     233     229     268     238     237
> shadow emulates invlpg           14688   15499   14604   12630   16627   11370
> shadow OOS fixup adds            14467   15335   13059   11840   16624   11339
> shadow OOS unsyncs               14467   15335   13058   11840   16624   11339
> shadow OOS evictions               566     449     565     369     589     336
> shadow OOS resyncs               14510   15407   12964   11828   16478   11481
>
> I don't think the "shadow writable brute-force" is the problem.
> get_shadow_status seems to be a more critical candidate.
>
>
> Juergen
>
> --
> Juergen Gross                 Principal Developer Operating Systems
> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
> Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
> Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
> D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-07  9:40       ` Juergen Gross
  2009-10-07 10:11         ` George Dunlap
@ 2009-10-07 10:14         ` Tim Deegan
  2009-10-07 12:32           ` Juergen Gross
  1 sibling, 1 reply; 29+ messages in thread
From: Tim Deegan @ 2009-10-07 10:14 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Ian Pratt, James Harper, xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 4044 bytes --]

At 10:40 +0100 on 07 Oct (1254912041), Juergen Gross wrote:
> calls to shadow_alloc              438     427     424     480     436     422
> number of shadow pages in use     2765    2151    2386    2509    4885    1391
> calls to shadow_free               168     132     185     144     181     105
> calls to shadow_fault            65271   69132   60495   53756   73363   52449
> shadow_fault fast path n/p        7347    8081    6713    6134    8521    6112
> shadow_fault fast path error        14      12      15       3      13      11
> shadow_fault really guest fault  24004   25723   22815   19709   27049   19190
> shadow_fault emulates a write     1045     949    1018     995    1015     901
> shadow_fault fast emulate          424     361     449     348     387     314
> shadow_fault fixed fault         32503   34264   29624   26689   36641   26096
> calls to shadow_validate_gl2e      875     748     917     731     795     667
> calls to shadow_validate_gl3e      481     456     443     491     489     446
> calls to shadow_validate_gl4e      104      97      95     112     105      95
> calls to shadow_hash_lookup    2109654 2203254 2228896 2245849 2164727 2309059
> shadow hash hit in bucket head 2012828 2111164 2161113 2177591 2104099 2242458
> shadow hash misses                 851     840     841     910     852     838
> calls to get_shadow_status     2110031 2202828 2228769 2246689 2164213 2309241
> calls to shadow_hash_insert        438     436     428     481     437     430
> calls to shadow_hash_delete        168     150     185     154     202     128
> shadow removes write access        335     324     329     385     330     336
> shadow writeable: linux high       130     139     152     155     138     149
> shadow writeable: sl1p           14508   15402   12961   11823   16474   11472
> shadow writeable brute-force       205     185     177     230     192     187
> shadow unshadows for fork/exit       9      12      12      12      18      12
> shadow unshadows a page             10      13      13      13      19      13
> shadow walks guest tables       647527  727336  649397  646601  659655  621289
> shadow checks gwalk                526     544     535     550     614     554
> shadow flush tlb by rem wr perm    235     233     229     268     238     237
> shadow emulates invlpg           14688   15499   14604   12630   16627   11370
> shadow OOS fixup adds            14467   15335   13059   11840   16624   11339
> shadow OOS unsyncs               14467   15335   13058   11840   16624   11339
> shadow OOS evictions               566     449     565     369     589     336
> shadow OOS resyncs               14510   15407   12964   11828   16478   11481
> 
> I don't think the "shadow writable brute-force" is the problem.
> get_shadow_status seems to be a more critical candidate.

get_shadow_status is a simple hash lookup to find the shadow of a frame;
it's expected to happen multiple times per pagefault.  Even so those
numbers look high.  ~10k guest PT walks per CPU per second, each causing
3-4 shadow hash lookups.  That's much higher than the number of
pagefaults.  

I take it you reset the performance counters at the start of that run?

Are there any other numbers (outside the shadow stats) that are up
around 600k/cpu?  

I wonder whether this is caused by pagetables changing under out feet in
the shadow fault handler -- in order to avoid taking the shadow lock too
often, the fault handler walks the pagetables first, then takes the lock
and double-checks its work.   If the other cpus are aggressively writing
to the pagetables that this CPU is running on that could cause the
pagefault handler to spin, locking and unlocking the shadow lock as it
goes.   I would expect to see the OOS unsyncs number much higher if that
was the case though.  

Can you try the attached patch and see if it makes any difference?

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

[-- Attachment #2: rewalk-with-lock --]
[-- Type: text/plain, Size: 1075 bytes --]

diff -r f1fb228b43ac xen/arch/x86/mm/shadow/multi.c
--- a/xen/arch/x86/mm/shadow/multi.c	Mon Aug 31 09:45:27 2009 +0100
+++ b/xen/arch/x86/mm/shadow/multi.c	Wed Oct 07 11:08:44 2009 +0100
@@ -3266,7 +3266,10 @@
                 regs->error_code | PFEC_page_present);
 #endif /* (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB) */
 
-    shadow_lock(d);
+    /* We can reach here with the lcok held if we took the "rewalk" path. 
+     * We checked for recursive faults already, so this is OK. */
+    if ( !shadow_locked_by_me(d) )
+        shadow_lock(d);
 
     /* Make sure there is enough free shadow memory to build a chain of
      * shadow tables. (We never allocate a top-level shadow on this path,
@@ -3298,7 +3301,6 @@
     /* Second bit set: Resynced a page. Re-walk needed. */
     if ( rc & GW_RMWR_REWALK )
     {
-        shadow_unlock(d);
         goto rewalk;
     }
 #endif /* OOS */
@@ -3306,7 +3308,6 @@
     if ( !shadow_check_gwalk(v, va, &gw) )
     {
         perfc_incr(shadow_inconsistent_gwalk);
-        shadow_unlock(d);
         goto rewalk;
     }
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-07 10:11         ` George Dunlap
@ 2009-10-07 11:45           ` Juergen Gross
  2009-10-07 13:44             ` George Dunlap
       [not found]             ` <de76405a0910070627s7585c587l8753e40d1d2b77b9@mail.gmail.com>
  0 siblings, 2 replies; 29+ messages in thread
From: Juergen Gross @ 2009-10-07 11:45 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Pratt, xen-devel@lists.xensource.com, Tim Deegan,
	James Harper

George Dunlap wrote:
> Jeurgen,
> 
> I think this problem is a good candidate for xentrace/xenalyze.  If
> you take a 30-second trace (xentrace -D -e all -T 30
> /tmp/[traceid].trace) while the benchmark is at its heaviest, and then
> analyze it using xenalyze
> (http://xenbits.xensource.com/ext/xenalyze.hg), it should show up
> whether the shadow performance is due to brute-force search or
> something else.
> 
> If you're using 3.3, you'll have to apply the back-patch to xenalyze
> to make it work properly.

Patches don't apply cleanly, build fails with error even without patches due
to incorrect format strings.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-07 10:14         ` Tim Deegan
@ 2009-10-07 12:32           ` Juergen Gross
  0 siblings, 0 replies; 29+ messages in thread
From: Juergen Gross @ 2009-10-07 12:32 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Ian Pratt, James Harper, xen-devel@lists.xensource.com

Tim Deegan wrote:
> At 10:40 +0100 on 07 Oct (1254912041), Juergen Gross wrote:
>> calls to shadow_alloc              438     427     424     480     436     422
>> number of shadow pages in use     2765    2151    2386    2509    4885    1391
>> calls to shadow_free               168     132     185     144     181     105
>> calls to shadow_fault            65271   69132   60495   53756   73363   52449
>> shadow_fault fast path n/p        7347    8081    6713    6134    8521    6112
>> shadow_fault fast path error        14      12      15       3      13      11
>> shadow_fault really guest fault  24004   25723   22815   19709   27049   19190
>> shadow_fault emulates a write     1045     949    1018     995    1015     901
>> shadow_fault fast emulate          424     361     449     348     387     314
>> shadow_fault fixed fault         32503   34264   29624   26689   36641   26096
>> calls to shadow_validate_gl2e      875     748     917     731     795     667
>> calls to shadow_validate_gl3e      481     456     443     491     489     446
>> calls to shadow_validate_gl4e      104      97      95     112     105      95
>> calls to shadow_hash_lookup    2109654 2203254 2228896 2245849 2164727 2309059
>> shadow hash hit in bucket head 2012828 2111164 2161113 2177591 2104099 2242458
>> shadow hash misses                 851     840     841     910     852     838
>> calls to get_shadow_status     2110031 2202828 2228769 2246689 2164213 2309241
>> calls to shadow_hash_insert        438     436     428     481     437     430
>> calls to shadow_hash_delete        168     150     185     154     202     128
>> shadow removes write access        335     324     329     385     330     336
>> shadow writeable: linux high       130     139     152     155     138     149
>> shadow writeable: sl1p           14508   15402   12961   11823   16474   11472
>> shadow writeable brute-force       205     185     177     230     192     187
>> shadow unshadows for fork/exit       9      12      12      12      18      12
>> shadow unshadows a page             10      13      13      13      19      13
>> shadow walks guest tables       647527  727336  649397  646601  659655  621289
>> shadow checks gwalk                526     544     535     550     614     554
>> shadow flush tlb by rem wr perm    235     233     229     268     238     237
>> shadow emulates invlpg           14688   15499   14604   12630   16627   11370
>> shadow OOS fixup adds            14467   15335   13059   11840   16624   11339
>> shadow OOS unsyncs               14467   15335   13058   11840   16624   11339
>> shadow OOS evictions               566     449     565     369     589     336
>> shadow OOS resyncs               14510   15407   12964   11828   16478   11481
>>
>> I don't think the "shadow writable brute-force" is the problem.
>> get_shadow_status seems to be a more critical candidate.
> 
> get_shadow_status is a simple hash lookup to find the shadow of a frame;
> it's expected to happen multiple times per pagefault.  Even so those
> numbers look high.  ~10k guest PT walks per CPU per second, each causing
> 3-4 shadow hash lookups.  That's much higher than the number of
> pagefaults.  
> 
> I take it you reset the performance counters at the start of that run?

Yes.

> 
> Are there any other numbers (outside the shadow stats) that are up
> around 600k/cpu?

No. All other counts are below 100k.

> 
> I wonder whether this is caused by pagetables changing under out feet in
> the shadow fault handler -- in order to avoid taking the shadow lock too
> often, the fault handler walks the pagetables first, then takes the lock
> and double-checks its work.   If the other cpus are aggressively writing
> to the pagetables that this CPU is running on that could cause the
> pagefault handler to spin, locking and unlocking the shadow lock as it
> goes.   I would expect to see the OOS unsyncs number much higher if that
> was the case though.  
> 
> Can you try the attached patch and see if it makes any difference?

Not really. Still 4 vcpus waiting at the shadow lock, performance counters as
follows (test ran with 8 vcpus this time, performance was 2% higher than
without patch, but uncertainty is in the same order).
I copied only the counters for 1 cpu, as the others are similar. Counters are
smaller as before, because performance was less as with 6 vcpus and lock
profiling costs about 20% performance with this lock load (the previous
measurement had no lock profiling support).

calls to shadow_alloc               700
number of shadow pages in use      1986
calls to shadow_free                 48
calls to shadow_fault             33508
shadow_fault fast path n/p         3785
shadow_fault fast path error         10
shadow_fault really guest fault   11290
shadow_fault emulates a write      1313
shadow_fault fast emulate           257
shadow_fault fixed fault          16853
calls to shadow_validate_gl2e       604
calls to shadow_validate_gl3e       786
calls to shadow_validate_gl4e       179
calls to shadow_hash_lookup     1079773
shadow hash hit in bucket head  1030265
shadow hash misses                 1258
calls to get_shadow_status      1079233
calls to shadow_hash_insert         700
calls to shadow_hash_delete          48
shadow removes write access         650
shadow writeable: linux high        624
shadow writeable: sl1p             7459
shadow writeable brute-force         26
shadow unshadows for fork/exit        1
shadow unshadows a page               1
shadow walks guest tables        393663
shadow checks gwalk                1251
shadow flush tlb by rem wr perm     439
shadow emulates invlpg             6212
shadow OOS fixup adds              7330
shadow OOS unsyncs                 7328
shadow OOS evictions                175
shadow OOS resyncs                 7459


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-07 11:45           ` Juergen Gross
@ 2009-10-07 13:44             ` George Dunlap
       [not found]             ` <de76405a0910070627s7585c587l8753e40d1d2b77b9@mail.gmail.com>
  1 sibling, 0 replies; 29+ messages in thread
From: George Dunlap @ 2009-10-07 13:44 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Ian Pratt, xen-devel@lists.xensource.com, Tim Deegan,
	James Harper

Format strings... are you building this on 64-bit, then?  If so, would
you be willing to help me test a 64-bit-friendly version, if I do the
work of making xenalyze capable of being compiled either under x64 or
x32?  I don't have a 64-bit environment handy, and don't have time to
set one up just for this purpose.

Sorry for the patches not applying cleanly.  I've now updated the
back-patches, so if you do a pull they should apply.  Let me know if
you have any trouble.

 -George

On Wed, Oct 7, 2009 at 12:45 PM, Juergen Gross
<juergen.gross@ts.fujitsu.com> wrote:
> George Dunlap wrote:
>> Jeurgen,
>>
>> I think this problem is a good candidate for xentrace/xenalyze.  If
>> you take a 30-second trace (xentrace -D -e all -T 30
>> /tmp/[traceid].trace) while the benchmark is at its heaviest, and then
>> analyze it using xenalyze
>> (http://xenbits.xensource.com/ext/xenalyze.hg), it should show up
>> whether the shadow performance is due to brute-force search or
>> something else.
>>
>> If you're using 3.3, you'll have to apply the back-patch to xenalyze
>> to make it work properly.
>
> Patches don't apply cleanly, build fails with error even without patches due
> to incorrect format strings.
>
>
> Juergen
>
> --
> Juergen Gross                 Principal Developer Operating Systems
> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
> Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
> Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
> D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
       [not found]               ` <4ACC9C40.3030503@ts.fujitsu.com>
@ 2009-10-07 14:24                 ` George Dunlap
  2009-10-08  5:00                   ` Juergen Gross
  0 siblings, 1 reply; 29+ messages in thread
From: George Dunlap @ 2009-10-07 14:24 UTC (permalink / raw)
  To: Juergen Gross, xen-devel@lists.xensource.com

Juergen Gross wrote:
> Uh, this is binary data, isn't it?
> I could imagine this is possible by using some kind of format description file
> specific to the producing xen version which is generated during hypervisor
> build.
> But this would require quite a bit of work...
>   
And you'd still need to make code changes for different features based 
on whether a given bit of data was in the format file or not.  Not to 
mention that we already have a sort-of description file 
(xenformat_formats) file that doesn't get updated when traces change... 
(although I'm definitely an offender there).

Regarding your trace file... can you send me the file you're having 
trouble with?

Thanks,
 -George

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-07  6:55 Poor HVM performance with 8 vcpus Juergen Gross
  2009-10-07  7:26 ` Keir Fraser
  2009-10-07  7:56 ` Ian Pratt
@ 2009-10-07 16:37 ` Gianluca Guida
  2009-10-08  7:10   ` Juergen Gross
  2 siblings, 1 reply; 29+ messages in thread
From: Gianluca Guida @ 2009-10-07 16:37 UTC (permalink / raw)
  To: Juergen Gross; +Cc: xen-devel@lists.xensource.com

Hi,

On Wed, Oct 7, 2009 at 8:55 AM, Juergen Gross
<juergen.gross@ts.fujitsu.com> wrote:
> we've got massive performance problems running a 8 vcpu HVM-guest (BS2000)
> under XEN (xen 3.3.1).
>
> With a specific benchmark producing a rather high load on memory management
> operations (lots of process creation/deletion and memory allocation) the 8
> vcpu performance was worse than the 4 vcpu performance. On other platforms
> (/390, MIPS, SPARC) this benchmark scaled rather well with the number of cpus.
>
> The result of the usage of the software performance counters of XEN seemed
> to point to the shadow lock being the reason. I modified the Hypervisor to
> gather some lock statistics (patch will be sent soon) and found that the
> shadow lock is really the bottleneck. On average 4 vcpus are waiting to get
> the lock!
>
> Is this a known issue?

Acutally, I think so. The OOS optimization is widely known not to be
too scalable at 8vcpus in the current state, since its weak point is
the CR3 switching time increasing linearly with the number of cpus. If
you have lot of processes switches together with lot of PTE writings
(as it seems to be the case for your benchmark) then that's probably
the cause.

Could you try disabling the OOS optimization from the
SHADOW_OPTIMIZATIONS definition?

Thanks,
Gianluca

> Is there a chance to split the shadow lock into sub-locks or to use a
> reader/writer lock instead?
> I just wanted to ask before trying to understand all of the shadow code :-)
>
>
> Juergen
>
> --
> Juergen Gross                 Principal Developer Operating Systems
> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
> Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
> Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
> D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>



-- 
It was a type of people I did not know, I found them very strange and
they did not inspire confidence at all. Later I learned that I had been
introduced to electronic engineers.
                                                  E. W. Dijkstra

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-07 14:24                 ` George Dunlap
@ 2009-10-08  5:00                   ` Juergen Gross
  0 siblings, 0 replies; 29+ messages in thread
From: Juergen Gross @ 2009-10-08  5:00 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel@lists.xensource.com

George Dunlap wrote:
> Juergen Gross wrote:
>> Uh, this is binary data, isn't it?
>> I could imagine this is possible by using some kind of format
>> description file
>> specific to the producing xen version which is generated during
>> hypervisor
>> build.
>> But this would require quite a bit of work...
>>   
> And you'd still need to make code changes for different features based
> on whether a given bit of data was in the format file or not.  Not to
> mention that we already have a sort-of description file
> (xenformat_formats) file that doesn't get updated when traces change...
> (although I'm definitely an offender there).

This cries for usage of XML ;-)

> 
> Regarding your trace file... can you send me the file you're having
> trouble with?

590 MB per mail isn't a good idea...
Do you have a server where I could upload it?


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-07 16:37 ` Gianluca Guida
@ 2009-10-08  7:10   ` Juergen Gross
  2009-10-14  8:16     ` Juergen Gross
  0 siblings, 1 reply; 29+ messages in thread
From: Juergen Gross @ 2009-10-08  7:10 UTC (permalink / raw)
  To: Gianluca Guida; +Cc: xen-devel@lists.xensource.com

Hi,

Gianluca Guida wrote:
> Hi,
> 
> On Wed, Oct 7, 2009 at 8:55 AM, Juergen Gross
> <juergen.gross@ts.fujitsu.com> wrote:
>> we've got massive performance problems running a 8 vcpu HVM-guest (BS2000)
>> under XEN (xen 3.3.1).
>>
>> With a specific benchmark producing a rather high load on memory management
>> operations (lots of process creation/deletion and memory allocation) the 8
>> vcpu performance was worse than the 4 vcpu performance. On other platforms
>> (/390, MIPS, SPARC) this benchmark scaled rather well with the number of cpus.
>>
>> The result of the usage of the software performance counters of XEN seemed
>> to point to the shadow lock being the reason. I modified the Hypervisor to
>> gather some lock statistics (patch will be sent soon) and found that the
>> shadow lock is really the bottleneck. On average 4 vcpus are waiting to get
>> the lock!
>>
>> Is this a known issue?
> 
> Acutally, I think so. The OOS optimization is widely known not to be
> too scalable at 8vcpus in the current state, since its weak point is
> the CR3 switching time increasing linearly with the number of cpus. If
> you have lot of processes switches together with lot of PTE writings
> (as it seems to be the case for your benchmark) then that's probably
> the cause.
> 
> Could you try disabling the OOS optimization from the
> SHADOW_OPTIMIZATIONS definition?

Great!
First performance data looks okay!
We will have to run different benchmarks in different configurations, but I
think you gave an excellent hint. :-)


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-08  7:10   ` Juergen Gross
@ 2009-10-14  8:16     ` Juergen Gross
  2009-10-14  8:35       ` Keir Fraser
                         ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Juergen Gross @ 2009-10-14  8:16 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Gianluca Guida, xen-devel@lists.xensource.com

Gianluca,

as the performance of BS2000 seems to be hit by OOS optimization, I'm
thinking of making a patch to disable this feature by a domain parameter.

Is there a way to do this without having to change all places where the
#if statements are placed?
I think there should be some central routines where adding an "if" could
be enough (setting oos_active to 0 seems not to be enough, I fear).

Do you have any hint?


Juergen

Juergen Gross wrote:
> Hi,
> 
> Gianluca Guida wrote:
>> Hi,
>>
>> On Wed, Oct 7, 2009 at 8:55 AM, Juergen Gross
>> <juergen.gross@ts.fujitsu.com> wrote:
>>> we've got massive performance problems running a 8 vcpu HVM-guest (BS2000)
>>> under XEN (xen 3.3.1).
>>>
>>> With a specific benchmark producing a rather high load on memory management
>>> operations (lots of process creation/deletion and memory allocation) the 8
>>> vcpu performance was worse than the 4 vcpu performance. On other platforms
>>> (/390, MIPS, SPARC) this benchmark scaled rather well with the number of cpus.
>>>
>>> The result of the usage of the software performance counters of XEN seemed
>>> to point to the shadow lock being the reason. I modified the Hypervisor to
>>> gather some lock statistics (patch will be sent soon) and found that the
>>> shadow lock is really the bottleneck. On average 4 vcpus are waiting to get
>>> the lock!
>>>
>>> Is this a known issue?
>> Acutally, I think so. The OOS optimization is widely known not to be
>> too scalable at 8vcpus in the current state, since its weak point is
>> the CR3 switching time increasing linearly with the number of cpus. If
>> you have lot of processes switches together with lot of PTE writings
>> (as it seems to be the case for your benchmark) then that's probably
>> the cause.
>>
>> Could you try disabling the OOS optimization from the
>> SHADOW_OPTIMIZATIONS definition?
> 
> Great!
> First performance data looks okay!
> We will have to run different benchmarks in different configurations, but I
> think you gave an excellent hint. :-)


-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-14  8:16     ` Juergen Gross
@ 2009-10-14  8:35       ` Keir Fraser
  2009-10-14  9:11         ` Juergen Gross
  2009-10-14 10:16         ` Gianluca Guida
  2009-10-14  8:41       ` Tim Deegan
  2009-10-14 11:35       ` Gianluca Guida
  2 siblings, 2 replies; 29+ messages in thread
From: Keir Fraser @ 2009-10-14  8:35 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Gianluca Guida, xen-devel@lists.xensource.com

On 14/10/2009 09:16, "Juergen Gross" <juergen.gross@ts.fujitsu.com> wrote:

> as the performance of BS2000 seems to be hit by OOS optimization, I'm
> thinking of making a patch to disable this feature by a domain parameter.
> 
> Is there a way to do this without having to change all places where the
> #if statements are placed?
> I think there should be some central routines where adding an "if" could
> be enough (setting oos_active to 0 seems not to be enough, I fear).
> 
> Do you have any hint?

How about disabling it for domains with more than four VCPUs? Have you
measured performance with OOS for 1-4 VCPU guests? This is perhaps not
something that needs to be baked into guest configs.

 -- Keir

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-14  8:16     ` Juergen Gross
  2009-10-14  8:35       ` Keir Fraser
@ 2009-10-14  8:41       ` Tim Deegan
  2009-10-14  9:17         ` Juergen Gross
  2009-10-14 11:35       ` Gianluca Guida
  2 siblings, 1 reply; 29+ messages in thread
From: Tim Deegan @ 2009-10-14  8:41 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Gianluca Guida, xen-devel@lists.xensource.com

At 09:16 +0100 on 14 Oct (1255511785), Juergen Gross wrote:
> as the performance of BS2000 seems to be hit by OOS optimization, I'm
> thinking of making a patch to disable this feature by a domain parameter.
> 
> Is there a way to do this without having to change all places where the
> #if statements are placed?
> I think there should be some central routines where adding an "if" could
> be enough (setting oos_active to 0 seems not to be enough, I fear).
> 
> Do you have any hint?

The simplest way is to cause sh_unsync() to immediately return 0.  That
won't be quite as fast as #defining it all away but will avoid the
expensive paths that cause lock contention.  You can add your flag to
the big if statement that's already there to avoid unsafe cases.

Incidentally, although your benchmark does poorly on 8 VCPUs it might be
worth trying a less aggressively targeted benchmark -- we found on
Windows VMs that more realistic tests (e.g. Sysmark) still showed a
slight improvement from the OOS optimization at 8 vcpus.

Cheers,

Tim.

> Juergen Gross wrote:
> > Hi,
> > 
> > Gianluca Guida wrote:
> >> Hi,
> >>
> >> On Wed, Oct 7, 2009 at 8:55 AM, Juergen Gross
> >> <juergen.gross@ts.fujitsu.com> wrote:
> >>> we've got massive performance problems running a 8 vcpu HVM-guest (BS2000)
> >>> under XEN (xen 3.3.1).
> >>>
> >>> With a specific benchmark producing a rather high load on memory management
> >>> operations (lots of process creation/deletion and memory allocation) the 8
> >>> vcpu performance was worse than the 4 vcpu performance. On other platforms
> >>> (/390, MIPS, SPARC) this benchmark scaled rather well with the number of cpus.
> >>>
> >>> The result of the usage of the software performance counters of XEN seemed
> >>> to point to the shadow lock being the reason. I modified the Hypervisor to
> >>> gather some lock statistics (patch will be sent soon) and found that the
> >>> shadow lock is really the bottleneck. On average 4 vcpus are waiting to get
> >>> the lock!
> >>>
> >>> Is this a known issue?
> >> Acutally, I think so. The OOS optimization is widely known not to be
> >> too scalable at 8vcpus in the current state, since its weak point is
> >> the CR3 switching time increasing linearly with the number of cpus. If
> >> you have lot of processes switches together with lot of PTE writings
> >> (as it seems to be the case for your benchmark) then that's probably
> >> the cause.
> >>
> >> Could you try disabling the OOS optimization from the
> >> SHADOW_OPTIMIZATIONS definition?
> > 
> > Great!
> > First performance data looks okay!
> > We will have to run different benchmarks in different configurations, but I
> > think you gave an excellent hint. :-)
> 
> 
> -- 
> Juergen Gross                 Principal Developer Operating Systems
> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
> Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
> Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
> D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-14  8:35       ` Keir Fraser
@ 2009-10-14  9:11         ` Juergen Gross
  2009-10-14 10:16         ` Gianluca Guida
  1 sibling, 0 replies; 29+ messages in thread
From: Juergen Gross @ 2009-10-14  9:11 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Gianluca Guida, xen-devel@lists.xensource.com

Keir Fraser wrote:
> On 14/10/2009 09:16, "Juergen Gross" <juergen.gross@ts.fujitsu.com> wrote:
> 
>> as the performance of BS2000 seems to be hit by OOS optimization, I'm
>> thinking of making a patch to disable this feature by a domain parameter.
>>
>> Is there a way to do this without having to change all places where the
>> #if statements are placed?
>> I think there should be some central routines where adding an "if" could
>> be enough (setting oos_active to 0 seems not to be enough, I fear).
>>
>> Do you have any hint?
> 
> How about disabling it for domains with more than four VCPUs? Have you
> measured performance with OOS for 1-4 VCPU guests? This is perhaps not
> something that needs to be baked into guest configs.

The same benchmark with 4 vcpus showed an improvement of about 6 % with OOS
disabled. A 1 vcpu BS2000 showed no change in performance.

And as Tim writes: there are systems with more than 4 vcpus which are still
faster with OOS optimization active.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-14  8:41       ` Tim Deegan
@ 2009-10-14  9:17         ` Juergen Gross
  0 siblings, 0 replies; 29+ messages in thread
From: Juergen Gross @ 2009-10-14  9:17 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Gianluca Guida, xen-devel@lists.xensource.com

Tim Deegan wrote:
> At 09:16 +0100 on 14 Oct (1255511785), Juergen Gross wrote:
>> as the performance of BS2000 seems to be hit by OOS optimization, I'm
>> thinking of making a patch to disable this feature by a domain parameter.
>>
>> Is there a way to do this without having to change all places where the
>> #if statements are placed?
>> I think there should be some central routines where adding an "if" could
>> be enough (setting oos_active to 0 seems not to be enough, I fear).
>>
>> Do you have any hint?
> 
> The simplest way is to cause sh_unsync() to immediately return 0.  That
> won't be quite as fast as #defining it all away but will avoid the
> expensive paths that cause lock contention.  You can add your flag to
> the big if statement that's already there to avoid unsafe cases.

Thanks, I'll try.

> 
> Incidentally, although your benchmark does poorly on 8 VCPUs it might be
> worth trying a less aggressively targeted benchmark -- we found on
> Windows VMs that more realistic tests (e.g. Sysmark) still showed a
> slight improvement from the OOS optimization at 8 vcpus.

The benchmark is designed to be a realistic simulation of a typical customer
batch load. Many BS2000 customers would see similar performance effects.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-14  8:35       ` Keir Fraser
  2009-10-14  9:11         ` Juergen Gross
@ 2009-10-14 10:16         ` Gianluca Guida
  2009-10-14 10:44           ` Juergen Gross
  1 sibling, 1 reply; 29+ messages in thread
From: Gianluca Guida @ 2009-10-14 10:16 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Tim Deegan, Juergen Gross, xen-devel@lists.xensource.com

Ah, those good old OOS talks. I fear I am going to fail on my attempt
to be laconic.

On Wed, Oct 14, 2009 at 10:35 AM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:
> On 14/10/2009 09:16, "Juergen Gross" <juergen.gross@ts.fujitsu.com> wrote:
>
>> as the performance of BS2000 seems to be hit by OOS optimization, I'm
>> thinking of making a patch to disable this feature by a domain parameter.
>>
>> Is there a way to do this without having to change all places where the
>> #if statements are placed?
>> I think there should be some central routines where adding an "if" could
>> be enough (setting oos_active to 0 seems not to be enough, I fear).
>>
>> Do you have any hint?
>
> How about disabling it for domains with more than four VCPUs? Have you
> measured performance with OOS for 1-4 VCPU guests? This is perhaps not
> something that needs to be baked into guest configs.

In general, shadow code loses performances as the vcpus increase (>=4)
because of the single shadow lock (and getting rid of the shadow lock,
i.e. having per-vcpu shadows wouldn't help, since it would make much
slower the most common operation, that is removing writable access of
guest pages).
But the two algorithms (always in-sync vs. OOS) will show their
performance penalties in two different areas: in a scenario where
guests do lot of PTE writes (read Windows in most of its operations)
the in-sync approach will be more penalizing, because emulation is
slow and needs the shadow lock, while scenarios were guests tend to
have many dirty CR3 switches (that is CR3 switches with freshly
written PTEs, as in the case with Juergen benchmark and the famous
Windows parallel ddk build) will be penalized more by the OOS
algorithm.

Disabling OOS for domains more than 4 vcpus might be a good idea, but
not necessarily optimal. Furthermore, I always understood that a good
practice for VM performance is to have many small VMs instead of a VM
eating all of the host's CPUs, at least when shadow code is on. With
big VMs, EPT/NPT has always been the best approach, since even with
lot of TLB misses, the system was definitely lock-free in most of the
VM's life.

Creating a per-domain switch should be a good idea, but a more generic
(and correct) approach would be to have a dynamic policy for OOSing
pages, in which we would stop putting OOS pages when we realize that
we are resynch'ing too many pages in CR3 switches. This was taken in
consideration during the development of the OOS, but it was finally
discarded because performance were decent and big VMs were not in the
interest range.

Yes, definitely away from spartan wit. But I hope this clarifies the issue.

Thanks,
Gianluca

-- 
It was a type of people I did not know, I found them very strange and
they did not inspire confidence at all. Later I learned that I had been
introduced to electronic engineers.
                                                  E. W. Dijkstra

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-14 10:16         ` Gianluca Guida
@ 2009-10-14 10:44           ` Juergen Gross
  2009-10-14 10:49             ` Keir Fraser
  0 siblings, 1 reply; 29+ messages in thread
From: Juergen Gross @ 2009-10-14 10:44 UTC (permalink / raw)
  To: Gianluca Guida; +Cc: Tim Deegan, xen-devel@lists.xensource.com, Keir Fraser

Gianluca Guida wrote:
> Ah, those good old OOS talks. I fear I am going to fail on my attempt
> to be laconic.

:-)

> 
> On Wed, Oct 14, 2009 at 10:35 AM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:
>> On 14/10/2009 09:16, "Juergen Gross" <juergen.gross@ts.fujitsu.com> wrote:
>>
>>> as the performance of BS2000 seems to be hit by OOS optimization, I'm
>>> thinking of making a patch to disable this feature by a domain parameter.
>>>
>>> Is there a way to do this without having to change all places where the
>>> #if statements are placed?
>>> I think there should be some central routines where adding an "if" could
>>> be enough (setting oos_active to 0 seems not to be enough, I fear).
>>>
>>> Do you have any hint?
>> How about disabling it for domains with more than four VCPUs? Have you
>> measured performance with OOS for 1-4 VCPU guests? This is perhaps not
>> something that needs to be baked into guest configs.
> 
> In general, shadow code loses performances as the vcpus increase (>=4)
> because of the single shadow lock (and getting rid of the shadow lock,
> i.e. having per-vcpu shadows wouldn't help, since it would make much
> slower the most common operation, that is removing writable access of
> guest pages).
> But the two algorithms (always in-sync vs. OOS) will show their
> performance penalties in two different areas: in a scenario where
> guests do lot of PTE writes (read Windows in most of its operations)
> the in-sync approach will be more penalizing, because emulation is
> slow and needs the shadow lock, while scenarios were guests tend to
> have many dirty CR3 switches (that is CR3 switches with freshly
> written PTEs, as in the case with Juergen benchmark and the famous
> Windows parallel ddk build) will be penalized more by the OOS
> algorithm.
> 
> Disabling OOS for domains more than 4 vcpus might be a good idea, but
> not necessarily optimal. Furthermore, I always understood that a good
> practice for VM performance is to have many small VMs instead of a VM
> eating all of the host's CPUs, at least when shadow code is on. With
> big VMs, EPT/NPT has always been the best approach, since even with
> lot of TLB misses, the system was definitely lock-free in most of the
> VM's life.
> 
> Creating a per-domain switch should be a good idea, but a more generic
> (and correct) approach would be to have a dynamic policy for OOSing
> pages, in which we would stop putting OOS pages when we realize that
> we are resynch'ing too many pages in CR3 switches. This was taken in
> consideration during the development of the OOS, but it was finally
> discarded because performance were decent and big VMs were not in the
> interest range.
> 
> Yes, definitely away from spartan wit. But I hope this clarifies the issue.

I really does.

I think I'll start with a per-domain switch and leave the generic approach
to the specialists. ;-)

If, however, Keir rejects such a switch, I could try the generic solution,
but I think this solution would need very much work to find the correct
parameters.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-14 10:44           ` Juergen Gross
@ 2009-10-14 10:49             ` Keir Fraser
  0 siblings, 0 replies; 29+ messages in thread
From: Keir Fraser @ 2009-10-14 10:49 UTC (permalink / raw)
  To: Juergen Gross, Gianluca Guida; +Cc: Tim Deegan, xen-devel@lists.xensource.com

On 14/10/2009 11:44, "Juergen Gross" <juergen.gross@ts.fujitsu.com> wrote:

>> Yes, definitely away from spartan wit. But I hope this clarifies the issue.
> 
> I really does.
> 
> I think I'll start with a per-domain switch and leave the generic approach
> to the specialists. ;-)
> 
> If, however, Keir rejects such a switch, I could try the generic solution,
> but I think this solution would need very much work to find the correct
> parameters.

Obviously, punting the question to the user or admin is much better. ;-)

A per-domain switch is fine with me.

 -- Keir

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-14  8:16     ` Juergen Gross
  2009-10-14  8:35       ` Keir Fraser
  2009-10-14  8:41       ` Tim Deegan
@ 2009-10-14 11:35       ` Gianluca Guida
  2009-10-14 11:43         ` Juergen Gross
  2 siblings, 1 reply; 29+ messages in thread
From: Gianluca Guida @ 2009-10-14 11:35 UTC (permalink / raw)
  To: Juergen Gross; +Cc: xen-devel@lists.xensource.com

On Wed, Oct 14, 2009 at 9:16 AM, Juergen Gross
<juergen.gross@ts.fujitsu.com> wrote:
> Gianluca,
>
> as the performance of BS2000 seems to be hit by OOS optimization, I'm
> thinking of making a patch to disable this feature by a domain parameter.
>
> Is there a way to do this without having to change all places where the
> #if statements are placed?
> I think there should be some central routines where adding an "if" could
> be enough (setting oos_active to 0 seems not to be enough, I fear).

It should be. oos_active set to zero (if you prevent the shadow code
to set it to 1 when we enable paging) will prevent any page to go OOS.
This is exactly what Tim's suggestion was.

Have you tried it?

Thanks,
Gianluca

-- 
It was a type of people I did not know, I found them very strange and
they did not inspire confidence at all. Later I learned that I had been
introduced to electronic engineers.
                                                  E. W. Dijkstra

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Poor HVM performance with 8 vcpus
  2009-10-14 11:35       ` Gianluca Guida
@ 2009-10-14 11:43         ` Juergen Gross
  0 siblings, 0 replies; 29+ messages in thread
From: Juergen Gross @ 2009-10-14 11:43 UTC (permalink / raw)
  To: Gianluca Guida; +Cc: xen-devel@lists.xensource.com

Gianluca Guida wrote:
> On Wed, Oct 14, 2009 at 9:16 AM, Juergen Gross
> <juergen.gross@ts.fujitsu.com> wrote:
>> Gianluca,
>>
>> as the performance of BS2000 seems to be hit by OOS optimization, I'm
>> thinking of making a patch to disable this feature by a domain parameter.
>>
>> Is there a way to do this without having to change all places where the
>> #if statements are placed?
>> I think there should be some central routines where adding an "if" could
>> be enough (setting oos_active to 0 seems not to be enough, I fear).
> 
> It should be. oos_active set to zero (if you prevent the shadow code
> to set it to 1 when we enable paging) will prevent any page to go OOS.
> This is exactly what Tim's suggestion was.
> 
> Have you tried it?

Not yet. I wanted to be sure, as I suspected some inconsistencies could be
hiding without showing up at once.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2009-10-14 11:43 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-07  6:55 Poor HVM performance with 8 vcpus Juergen Gross
2009-10-07  7:26 ` Keir Fraser
2009-10-07  7:49   ` Juergen Gross
2009-10-07  7:56 ` Ian Pratt
2009-10-07  8:08   ` James Harper
2009-10-07  8:13     ` Ian Pratt
2009-10-07  8:31       ` Juergen Gross
2009-10-07  8:17     ` Keir Fraser
2009-10-07  9:12     ` Tim Deegan
2009-10-07  9:40       ` Juergen Gross
2009-10-07 10:11         ` George Dunlap
2009-10-07 11:45           ` Juergen Gross
2009-10-07 13:44             ` George Dunlap
     [not found]             ` <de76405a0910070627s7585c587l8753e40d1d2b77b9@mail.gmail.com>
     [not found]               ` <4ACC9C40.3030503@ts.fujitsu.com>
2009-10-07 14:24                 ` George Dunlap
2009-10-08  5:00                   ` Juergen Gross
2009-10-07 10:14         ` Tim Deegan
2009-10-07 12:32           ` Juergen Gross
2009-10-07 16:37 ` Gianluca Guida
2009-10-08  7:10   ` Juergen Gross
2009-10-14  8:16     ` Juergen Gross
2009-10-14  8:35       ` Keir Fraser
2009-10-14  9:11         ` Juergen Gross
2009-10-14 10:16         ` Gianluca Guida
2009-10-14 10:44           ` Juergen Gross
2009-10-14 10:49             ` Keir Fraser
2009-10-14  8:41       ` Tim Deegan
2009-10-14  9:17         ` Juergen Gross
2009-10-14 11:35       ` Gianluca Guida
2009-10-14 11:43         ` Juergen Gross

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.