Poor SMP performance pv

All of lore.kernel.org
 help / color / mirror / Atom feed

* Poor SMP performance pv_ops domU
@ 2010-05-18 17:34 John Morrison
  2010-05-18 18:38 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 5+ messages in thread
From: John Morrison @ 2010-05-18 17:34 UTC (permalink / raw)
  To: xen-devel

Hi,

Over the last year we have tried many times to get acceptable performance from pv_ops kernels.

Tests done with 1,2,4 and 8 cores. The more cores the lower the score.

Inside the domU it shows all cores, top -s shows all cores in use.
xentop in dom0 never shows over 99% cpu.

2.6.18.8-xenU kernel show's over 700% cpu and the scores are about 8 x the pv_ops score.

Any ideas ?


John


1 core

BYTE UNIX Benchmarks (Version 4.1-wht.2)
System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux
/dev/xvda1           141110136   1066476 132875660   1% /

Start Benchmark Run: Tue May 18 13:54:54 BST 2010
 13:54:54 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00

End Benchmark Run: Tue May 18 14:06:12 BST 2010
 14:06:12 up 11 min,  2 users,  load average: 11.48, 5.20, 2.43


                     INDEX VALUES
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        376783.7  8950813.0      237.6
Double-Precision Whetstone                      83.1     2103.7      253.2
Execl Throughput                               188.3     1568.4       83.3
File Copy 1024 bufsize 2000 maxblocks         2672.0    64198.0      240.3
File Copy 256 bufsize 500 maxblocks           1077.0    17781.0      165.1
File Read 4096 bufsize 8000 maxblocks        15382.0   643717.0      418.5
Pipe-based Context Switching                 15448.6    85379.4       55.3
Pipe Throughput                             111814.6   478490.1       42.8
Process Creation                               569.3     3329.6       58.5
Shell Scripts (8 concurrent)                    44.8      380.7       85.0
System Call Overhead                        114433.5   498712.3       43.6
                                                                 =========
     FINAL SCORE                                                     114.1

2-cores

==============================================================
BYTE UNIX Benchmarks (Version 4.1-wht.2)
System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux
/dev/xvda1           141110136   1066548 132875588   1% /

Start Benchmark Run: Tue May 18 14:07:27 BST 2010
 14:07:27 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00

End Benchmark Run: Tue May 18 14:18:04 BST 2010
 14:18:04 up 10 min,  1 user,  load average: 12.78, 5.53, 2.49


                     INDEX VALUES
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        376783.7 10124838.6      268.7
Double-Precision Whetstone                      83.1     1188.7      143.0
Execl Throughput                               188.3     1596.2       84.8
File Copy 1024 bufsize 2000 maxblocks         2672.0    58323.0      218.3
File Copy 256 bufsize 500 maxblocks           1077.0    17776.0      165.1
File Read 4096 bufsize 8000 maxblocks        15382.0   568217.0      369.4
Pipe-based Context Switching                 15448.6    86111.3       55.7
Pipe Throughput                             111814.6   469957.8       42.0
Process Creation                               569.3     3298.1       57.9
Shell Scripts (8 concurrent)                    44.8      378.9       84.6
System Call Overhead                        114433.5   532828.4       46.6
                                                                 =========
     FINAL SCORE                                                     107.9

4-cores

==============================================================
BYTE UNIX Benchmarks (Version 4.1-wht.2)
System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux
/dev/xvda1           141110136   1066628 132875508   1% /

Start Benchmark Run: Tue May 18 14:19:17 BST 2010
 14:19:17 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00

End Benchmark Run: Tue May 18 14:29:53 BST 2010
 14:29:53 up 10 min,  1 user,  load average: 13.59, 6.35, 2.97


                     INDEX VALUES
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        376783.7 10185429.8      270.3
Double-Precision Whetstone                      83.1      759.8       91.4
Execl Throughput                               188.3     1386.2       73.6
File Copy 1024 bufsize 2000 maxblocks         2672.0    62331.0      233.3
File Copy 256 bufsize 500 maxblocks           1077.0    16492.0      153.1
File Read 4096 bufsize 8000 maxblocks        15382.0   563402.0      366.3
Pipe-based Context Switching                 15448.6    87176.0       56.4
Pipe Throughput                             111814.6   481068.1       43.0
Process Creation                               569.3     3128.9       55.0
Shell Scripts (8 concurrent)                    44.8      394.9       88.1
System Call Overhead                        114433.5   539996.1       47.2
                                                                 =========
     FINAL SCORE                                                     102.6
8-cores
 
==============================================================
BYTE UNIX Benchmarks (Version 4.1-wht.2, 8 threads)
System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux
/dev/xvda1           141110136   1066680 132875456   1% /

Start Benchmark Run: Tue May 18 14:30:59 BST 2010
 14:30:59 up 0 min,  1 user,  load average: 0.07, 0.02, 0.00

End Benchmark Run: Tue May 18 14:42:52 BST 2010
 14:42:52 up 12 min,  1 user,  load average: 25.56, 10.84, 4.96


                     INDEX VALUES
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        376783.7  9972130.3      264.7
Double-Precision Whetstone                      83.1      755.2       90.9
Execl Throughput                               188.3     1584.7       84.2
File Copy 1024 bufsize 2000 maxblocks         2672.0    58981.0      220.7
File Copy 256 bufsize 500 maxblocks           1077.0    16904.0      157.0
File Read 4096 bufsize 8000 maxblocks        15382.0   557735.0      362.6
Pipe-based Context Switching                 15448.6    80738.2       52.3
Pipe Throughput                             111814.6   450891.2       40.3
Process Creation                               569.3     2948.5       51.8
Shell Scripts (8 concurrent)                    44.8      378.1       84.4
System Call Overhead                        114433.5   537443.2       47.0
                                                                 =========
     FINAL SCORE                                                     100.9



--
Professional hosting without compromise
www.clustered.net

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Poor SMP performance pv_ops domU
  2010-05-18 17:34 Poor SMP performance pv_ops domU John Morrison
@ 2010-05-18 18:38 ` Jeremy Fitzhardinge
  2010-05-19 16:24   ` John Morrison
  0 siblings, 1 reply; 5+ messages in thread
From: Jeremy Fitzhardinge @ 2010-05-18 18:38 UTC (permalink / raw)
  To: John Morrison; +Cc: xen-devel

On 05/18/2010 10:34 AM, John Morrison wrote:
> Hi,
>
> Over the last year we have tried many times to get acceptable performance from pv_ops kernels.
>
> Tests done with 1,2,4 and 8 cores. The more cores the lower the score.
>
> Inside the domU it shows all cores, top -s shows all cores in use.
> xentop in dom0 never shows over 99% cpu.
>
> 2.6.18.8-xenU kernel show's over 700% cpu and the scores are about 8 x the pv_ops score.
>
> Any ideas ?
>   

Well, I guess some kind of bad serialization is going on in there, and
it should be fairly obvious with a bit of examination.

Have you tried building your own pvops domu kernels?  Does enabling PV
spinlocks make any difference?  Also enabling some of the lock
debugging/profiling/contention monitoring stuff may give useful results.

Can you post the corresponding 2.6.18 results?  Are there specific
sub-tests which show the effect more strongly than the others?

How does the 2.6.32 kernel fare when booted native?

Thanks,
    J

>
> John
>
>
> 1 core
>
> BYTE UNIX Benchmarks (Version 4.1-wht.2)
> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux
> /dev/xvda1           141110136   1066476 132875660   1% /
>
> Start Benchmark Run: Tue May 18 13:54:54 BST 2010
>  13:54:54 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00
>
> End Benchmark Run: Tue May 18 14:06:12 BST 2010
>  14:06:12 up 11 min,  2 users,  load average: 11.48, 5.20, 2.43
>
>
>                      INDEX VALUES
> TEST                                        BASELINE     RESULT      INDEX
>
> Dhrystone 2 using register variables        376783.7  8950813.0      237.6
> Double-Precision Whetstone                      83.1     2103.7      253.2
> Execl Throughput                               188.3     1568.4       83.3
> File Copy 1024 bufsize 2000 maxblocks         2672.0    64198.0      240.3
> File Copy 256 bufsize 500 maxblocks           1077.0    17781.0      165.1
> File Read 4096 bufsize 8000 maxblocks        15382.0   643717.0      418.5
> Pipe-based Context Switching                 15448.6    85379.4       55.3
> Pipe Throughput                             111814.6   478490.1       42.8
> Process Creation                               569.3     3329.6       58.5
> Shell Scripts (8 concurrent)                    44.8      380.7       85.0
> System Call Overhead                        114433.5   498712.3       43.6
>                                                                  =========
>      FINAL SCORE                                                     114.1
>
> 2-cores
>
> ==============================================================
> BYTE UNIX Benchmarks (Version 4.1-wht.2)
> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux
> /dev/xvda1           141110136   1066548 132875588   1% /
>
> Start Benchmark Run: Tue May 18 14:07:27 BST 2010
>  14:07:27 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00
>
> End Benchmark Run: Tue May 18 14:18:04 BST 2010
>  14:18:04 up 10 min,  1 user,  load average: 12.78, 5.53, 2.49
>
>
>                      INDEX VALUES
> TEST                                        BASELINE     RESULT      INDEX
>
> Dhrystone 2 using register variables        376783.7 10124838.6      268.7
> Double-Precision Whetstone                      83.1     1188.7      143.0
> Execl Throughput                               188.3     1596.2       84.8
> File Copy 1024 bufsize 2000 maxblocks         2672.0    58323.0      218.3
> File Copy 256 bufsize 500 maxblocks           1077.0    17776.0      165.1
> File Read 4096 bufsize 8000 maxblocks        15382.0   568217.0      369.4
> Pipe-based Context Switching                 15448.6    86111.3       55.7
> Pipe Throughput                             111814.6   469957.8       42.0
> Process Creation                               569.3     3298.1       57.9
> Shell Scripts (8 concurrent)                    44.8      378.9       84.6
> System Call Overhead                        114433.5   532828.4       46.6
>                                                                  =========
>      FINAL SCORE                                                     107.9
>
> 4-cores
>
> ==============================================================
> BYTE UNIX Benchmarks (Version 4.1-wht.2)
> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux
> /dev/xvda1           141110136   1066628 132875508   1% /
>
> Start Benchmark Run: Tue May 18 14:19:17 BST 2010
>  14:19:17 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00
>
> End Benchmark Run: Tue May 18 14:29:53 BST 2010
>  14:29:53 up 10 min,  1 user,  load average: 13.59, 6.35, 2.97
>
>
>                      INDEX VALUES
> TEST                                        BASELINE     RESULT      INDEX
>
> Dhrystone 2 using register variables        376783.7 10185429.8      270.3
> Double-Precision Whetstone                      83.1      759.8       91.4
> Execl Throughput                               188.3     1386.2       73.6
> File Copy 1024 bufsize 2000 maxblocks         2672.0    62331.0      233.3
> File Copy 256 bufsize 500 maxblocks           1077.0    16492.0      153.1
> File Read 4096 bufsize 8000 maxblocks        15382.0   563402.0      366.3
> Pipe-based Context Switching                 15448.6    87176.0       56.4
> Pipe Throughput                             111814.6   481068.1       43.0
> Process Creation                               569.3     3128.9       55.0
> Shell Scripts (8 concurrent)                    44.8      394.9       88.1
> System Call Overhead                        114433.5   539996.1       47.2
>                                                                  =========
>      FINAL SCORE                                                     102.6
> 8-cores
>  
> ==============================================================
> BYTE UNIX Benchmarks (Version 4.1-wht.2, 8 threads)
> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux
> /dev/xvda1           141110136   1066680 132875456   1% /
>
> Start Benchmark Run: Tue May 18 14:30:59 BST 2010
>  14:30:59 up 0 min,  1 user,  load average: 0.07, 0.02, 0.00
>
> End Benchmark Run: Tue May 18 14:42:52 BST 2010
>  14:42:52 up 12 min,  1 user,  load average: 25.56, 10.84, 4.96
>
>
>                      INDEX VALUES
> TEST                                        BASELINE     RESULT      INDEX
>
> Dhrystone 2 using register variables        376783.7  9972130.3      264.7
> Double-Precision Whetstone                      83.1      755.2       90.9
> Execl Throughput                               188.3     1584.7       84.2
> File Copy 1024 bufsize 2000 maxblocks         2672.0    58981.0      220.7
> File Copy 256 bufsize 500 maxblocks           1077.0    16904.0      157.0
> File Read 4096 bufsize 8000 maxblocks        15382.0   557735.0      362.6
> Pipe-based Context Switching                 15448.6    80738.2       52.3
> Pipe Throughput                             111814.6   450891.2       40.3
> Process Creation                               569.3     2948.5       51.8
> Shell Scripts (8 concurrent)                    44.8      378.1       84.4
> System Call Overhead                        114433.5   537443.2       47.0
>                                                                  =========
>      FINAL SCORE                                                     100.9
>
>
>
> --
> Professional hosting without compromise
> www.clustered.net
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
>   

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Poor SMP performance pv_ops domU
  2010-05-18 18:38 ` Jeremy Fitzhardinge
@ 2010-05-19 16:24   ` John Morrison
  2010-05-19 17:44     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 5+ messages in thread
From: John Morrison @ 2010-05-19 16:24 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: xen-devel

I've tried with various kernel's today - pv_ops seems to only use 1 core out of 8.

PV spinlocks makes no difference.

The thing that sticks out most is I cannot get the dom0 (xen-3.4.2) to show more that about 99.7% cpu usage for any pv_ops kernel.

#!/usr/bin/perl

while () {}

running 8 of these loads 2.6.18.8-xenU with nearly 800% cpu as shown in dom0
running the same 8 in any pv_ops kernel's only gets as high as about 99.7%

Inside the pv and xenU kernels top -s show all 8 cores being used.


John

On 18 May 2010, at 19:38, Jeremy Fitzhardinge wrote:

> On 05/18/2010 10:34 AM, John Morrison wrote:
>> Hi,
>> 
>> Over the last year we have tried many times to get acceptable performance from pv_ops kernels.
>> 
>> Tests done with 1,2,4 and 8 cores. The more cores the lower the score.
>> 
>> Inside the domU it shows all cores, top -s shows all cores in use.
>> xentop in dom0 never shows over 99% cpu.
>> 
>> 2.6.18.8-xenU kernel show's over 700% cpu and the scores are about 8 x the pv_ops score.
>> 
>> Any ideas ?
>> 
> 
> Well, I guess some kind of bad serialization is going on in there, and
> it should be fairly obvious with a bit of examination.
> 
> Have you tried building your own pvops domu kernels?  Does enabling PV
> spinlocks make any difference?  Also enabling some of the lock
> debugging/profiling/contention monitoring stuff may give useful results.
> 
> Can you post the corresponding 2.6.18 results?  Are there specific
> sub-tests which show the effect more strongly than the others?
> 
> How does the 2.6.32 kernel fare when booted native?
> 
> Thanks,
>    J
> 
>> 
>> John
>> 
>> 
>> 1 core
>> 
>> BYTE UNIX Benchmarks (Version 4.1-wht.2)
>> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux
>> /dev/xvda1           141110136   1066476 132875660   1% /
>> 
>> Start Benchmark Run: Tue May 18 13:54:54 BST 2010
>> 13:54:54 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00
>> 
>> End Benchmark Run: Tue May 18 14:06:12 BST 2010
>> 14:06:12 up 11 min,  2 users,  load average: 11.48, 5.20, 2.43
>> 
>> 
>>                     INDEX VALUES
>> TEST                                        BASELINE     RESULT      INDEX
>> 
>> Dhrystone 2 using register variables        376783.7  8950813.0      237.6
>> Double-Precision Whetstone                      83.1     2103.7      253.2
>> Execl Throughput                               188.3     1568.4       83.3
>> File Copy 1024 bufsize 2000 maxblocks         2672.0    64198.0      240.3
>> File Copy 256 bufsize 500 maxblocks           1077.0    17781.0      165.1
>> File Read 4096 bufsize 8000 maxblocks        15382.0   643717.0      418.5
>> Pipe-based Context Switching                 15448.6    85379.4       55.3
>> Pipe Throughput                             111814.6   478490.1       42.8
>> Process Creation                               569.3     3329.6       58.5
>> Shell Scripts (8 concurrent)                    44.8      380.7       85.0
>> System Call Overhead                        114433.5   498712.3       43.6
>>                                                                 =========
>>     FINAL SCORE                                                     114.1
>> 
>> 2-cores
>> 
>> ==============================================================
>> BYTE UNIX Benchmarks (Version 4.1-wht.2)
>> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux
>> /dev/xvda1           141110136   1066548 132875588   1% /
>> 
>> Start Benchmark Run: Tue May 18 14:07:27 BST 2010
>> 14:07:27 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00
>> 
>> End Benchmark Run: Tue May 18 14:18:04 BST 2010
>> 14:18:04 up 10 min,  1 user,  load average: 12.78, 5.53, 2.49
>> 
>> 
>>                     INDEX VALUES
>> TEST                                        BASELINE     RESULT      INDEX
>> 
>> Dhrystone 2 using register variables        376783.7 10124838.6      268.7
>> Double-Precision Whetstone                      83.1     1188.7      143.0
>> Execl Throughput                               188.3     1596.2       84.8
>> File Copy 1024 bufsize 2000 maxblocks         2672.0    58323.0      218.3
>> File Copy 256 bufsize 500 maxblocks           1077.0    17776.0      165.1
>> File Read 4096 bufsize 8000 maxblocks        15382.0   568217.0      369.4
>> Pipe-based Context Switching                 15448.6    86111.3       55.7
>> Pipe Throughput                             111814.6   469957.8       42.0
>> Process Creation                               569.3     3298.1       57.9
>> Shell Scripts (8 concurrent)                    44.8      378.9       84.6
>> System Call Overhead                        114433.5   532828.4       46.6
>>                                                                 =========
>>     FINAL SCORE                                                     107.9
>> 
>> 4-cores
>> 
>> ==============================================================
>> BYTE UNIX Benchmarks (Version 4.1-wht.2)
>> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux
>> /dev/xvda1           141110136   1066628 132875508   1% /
>> 
>> Start Benchmark Run: Tue May 18 14:19:17 BST 2010
>> 14:19:17 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00
>> 
>> End Benchmark Run: Tue May 18 14:29:53 BST 2010
>> 14:29:53 up 10 min,  1 user,  load average: 13.59, 6.35, 2.97
>> 
>> 
>>                     INDEX VALUES
>> TEST                                        BASELINE     RESULT      INDEX
>> 
>> Dhrystone 2 using register variables        376783.7 10185429.8      270.3
>> Double-Precision Whetstone                      83.1      759.8       91.4
>> Execl Throughput                               188.3     1386.2       73.6
>> File Copy 1024 bufsize 2000 maxblocks         2672.0    62331.0      233.3
>> File Copy 256 bufsize 500 maxblocks           1077.0    16492.0      153.1
>> File Read 4096 bufsize 8000 maxblocks        15382.0   563402.0      366.3
>> Pipe-based Context Switching                 15448.6    87176.0       56.4
>> Pipe Throughput                             111814.6   481068.1       43.0
>> Process Creation                               569.3     3128.9       55.0
>> Shell Scripts (8 concurrent)                    44.8      394.9       88.1
>> System Call Overhead                        114433.5   539996.1       47.2
>>                                                                 =========
>>     FINAL SCORE                                                     102.6
>> 8-cores
>> 
>> ==============================================================
>> BYTE UNIX Benchmarks (Version 4.1-wht.2, 8 threads)
>> System -- Linux test 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux
>> /dev/xvda1           141110136   1066680 132875456   1% /
>> 
>> Start Benchmark Run: Tue May 18 14:30:59 BST 2010
>> 14:30:59 up 0 min,  1 user,  load average: 0.07, 0.02, 0.00
>> 
>> End Benchmark Run: Tue May 18 14:42:52 BST 2010
>> 14:42:52 up 12 min,  1 user,  load average: 25.56, 10.84, 4.96
>> 
>> 
>>                     INDEX VALUES
>> TEST                                        BASELINE     RESULT      INDEX
>> 
>> Dhrystone 2 using register variables        376783.7  9972130.3      264.7
>> Double-Precision Whetstone                      83.1      755.2       90.9
>> Execl Throughput                               188.3     1584.7       84.2
>> File Copy 1024 bufsize 2000 maxblocks         2672.0    58981.0      220.7
>> File Copy 256 bufsize 500 maxblocks           1077.0    16904.0      157.0
>> File Read 4096 bufsize 8000 maxblocks        15382.0   557735.0      362.6
>> Pipe-based Context Switching                 15448.6    80738.2       52.3
>> Pipe Throughput                             111814.6   450891.2       40.3
>> Process Creation                               569.3     2948.5       51.8
>> Shell Scripts (8 concurrent)                    44.8      378.1       84.4
>> System Call Overhead                        114433.5   537443.2       47.0
>>                                                                 =========
>>     FINAL SCORE                                                     100.9
>> 
>> 
>> 
>> --
>> Professional hosting without compromise
>> www.clustered.net
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>> 
>> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Poor SMP performance pv_ops domU
  2010-05-19 16:24   ` John Morrison
@ 2010-05-19 17:44     ` Jeremy Fitzhardinge
       [not found]       ` <7AA26B35-634A-41B4-AD2E-54E3F33BD4BA@clustered.net>
  0 siblings, 1 reply; 5+ messages in thread
From: Jeremy Fitzhardinge @ 2010-05-19 17:44 UTC (permalink / raw)
  To: John Morrison; +Cc: xen-devel

On 05/19/2010 09:24 AM, John Morrison wrote:
> I've tried with various kernel's today - pv_ops seems to only use 1 core out of 8.
>
> PV spinlocks makes no difference.
>
> The thing that sticks out most is I cannot get the dom0 (xen-3.4.2) to show more that about 99.7% cpu usage for any pv_ops kernel.
>
> #!/usr/bin/perl
>
> while () {}
>
> running 8 of these loads 2.6.18.8-xenU with nearly 800% cpu as shown in dom0
> running the same 8 in any pv_ops kernel's only gets as high as about 99.7%
>   

What tool are you using to show CPU use?

> Inside the pv and xenU kernels top -s show all 8 cores being used.
>   

I tried to reproduce this:

   1. I created a 4 vcpu pvops PV domain (4 pcpu host)
   2. Confirmed that all 4 vcpus are present with "cat /proc/cpuinfo" in
      the domain
   3. Ran 4 instances of ``perl -e "while(){}"&'' in the domain
   4. "top" within the domain shows 99% overall user time, no stolen
      time, with the perl processes each using 99% cpu time
   5. in dom0 "watch -n 1 xl vcpu-list <domain>" shows all 4 vcpus are
      consuming 1 vcpu second per second
   6. running a spin loop in dom0 makes top within the domain show
      16-25% stolen time

Aside from top showing "99%" rather than ~400% as one might expect, it
all seems OK, and it looks like the vcpus are actually getting all the
CPU they're asking for.  I think the 99 vs 400 difference is just a
change in how the kernel shows its accounting (since there's been a lot
of change in that area between .18 and .32, including a whole new
scheduler).

If you're seeing a real performance regression between .18 and .32,
that's interesting, but it would be useful to make sure you're comparing
apples to apples; in particular, isolating any performance effect
inherent in Linux's performance change from .18 -> .32, compared to
pvops vs xenU.

So, things to try:

    * make sure all the vcpus are actually enabled within your domain;
      if your adding them after the domain has booted, you need to make
      sure they get hot-plugged properly
    * make sure you don't have any expensive debug options enabled in
      your kernel config
    * run your benchmark on the 2.6.32 kernel booted native and compare
      it to pvops running under xen
    * compare it with the Novell 2.6.32 non-pvops kernel
    * try pinning the vcpus to physical cpus to eliminate any Xen
      scheduler effects

Thanks,
    J

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Poor SMP performance pv_ops domU
       [not found]       ` <7AA26B35-634A-41B4-AD2E-54E3F33BD4BA@clustered.net>
@ 2010-05-19 19:48         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 5+ messages in thread
From: Jeremy Fitzhardinge @ 2010-05-19 19:48 UTC (permalink / raw)
  To: John Morrison; +Cc: Xen-devel

(Re-added cc: xen-devel)

On 05/19/2010 12:41 PM, John Morrison wrote:
> xentop for the cpu usage.
>
> We see the performance of a single core in domU when running a pv_ops kernel.
> Reboot domU with 2.6.18.8-xenU and performance jumps nearly 8 fold.
>   

Could you reproduce my experiment?  If you look at the CPU time
accumulated by each vcpu, is it incrementing at less than 1 vcpu
second/second?

> Pinned all 8 cpu's -  still the same results.
>
> Tried bare metal much better results.
>   

What do you mean by "much better"?  How does it compare to domu 2.6.18?

> We have seen this over 18 months on all pv kernel's we try.
>
> It's not any specific kernel - all pv kernel's we try have the same performance impact.
>   

Do you mean pvops, or all PV Xen kernels?  How do the recent Novell
Xenlinux kernels perform?  Have you verified there are no expensive
debug options enabled?

BTW, is it a 32 or 64-bit guest?

    J

> John
>
> On 19 May 2010, at 18:44, Jeremy Fitzhardinge wrote:
>
>   
>> On 05/19/2010 09:24 AM, John Morrison wrote:
>>     
>>> I've tried with various kernel's today - pv_ops seems to only use 1 core out of 8.
>>>
>>> PV spinlocks makes no difference.
>>>
>>> The thing that sticks out most is I cannot get the dom0 (xen-3.4.2) to show more that about 99.7% cpu usage for any pv_ops kernel.
>>>
>>> #!/usr/bin/perl
>>>
>>> while () {}
>>>
>>> running 8 of these loads 2.6.18.8-xenU with nearly 800% cpu as shown in dom0
>>> running the same 8 in any pv_ops kernel's only gets as high as about 99.7%
>>>
>>>       
>> What tool are you using to show CPU use?
>>
>>     
>>> Inside the pv and xenU kernels top -s show all 8 cores being used.
>>>
>>>       
>> I tried to reproduce this:
>>
>>   1. I created a 4 vcpu pvops PV domain (4 pcpu host)
>>   2. Confirmed that all 4 vcpus are present with "cat /proc/cpuinfo" in
>>      the domain
>>   3. Ran 4 instances of ``perl -e "while(){}"&'' in the domain
>>   4. "top" within the domain shows 99% overall user time, no stolen
>>      time, with the perl processes each using 99% cpu time
>>   5. in dom0 "watch -n 1 xl vcpu-list <domain>" shows all 4 vcpus are
>>      consuming 1 vcpu second per second
>>   6. running a spin loop in dom0 makes top within the domain show
>>      16-25% stolen time
>>
>> Aside from top showing "99%" rather than ~400% as one might expect, it
>> all seems OK, and it looks like the vcpus are actually getting all the
>> CPU they're asking for.  I think the 99 vs 400 difference is just a
>> change in how the kernel shows its accounting (since there's been a lot
>> of change in that area between .18 and .32, including a whole new
>> scheduler).
>>
>> If you're seeing a real performance regression between .18 and .32,
>> that's interesting, but it would be useful to make sure you're comparing
>> apples to apples; in particular, isolating any performance effect
>> inherent in Linux's performance change from .18 -> .32, compared to
>> pvops vs xenU.
>>
>> So, things to try:
>>
>>    * make sure all the vcpus are actually enabled within your domain;
>>      if your adding them after the domain has booted, you need to make
>>      sure they get hot-plugged properly
>>    * make sure you don't have any expensive debug options enabled in
>>      your kernel config
>>    * run your benchmark on the 2.6.32 kernel booted native and compare
>>      it to pvops running under xen
>>    * compare it with the Novell 2.6.32 non-pvops kernel
>>    * try pinning the vcpus to physical cpus to eliminate any Xen
>>      scheduler effects
>>
>> Thanks,
>>    J
>>
>>     
>   

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-05-19 19:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-18 17:34 Poor SMP performance pv_ops domU John Morrison
2010-05-18 18:38 ` Jeremy Fitzhardinge
2010-05-19 16:24   ` John Morrison
2010-05-19 17:44     ` Jeremy Fitzhardinge
     [not found]       ` <7AA26B35-634A-41B4-AD2E-54E3F33BD4BA@clustered.net>
2010-05-19 19:48         ` Jeremy Fitzhardinge

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.