RE: MPI benchmark performance gap between native linux anddomU

All of lore.kernel.org
 help / color / mirror / Atom feed

* RE: MPI benchmark performance gap between native linux anddomU
@ 2005-04-05  2:07 Santos, Jose Renato G (Jose Renato Santos)
  2005-04-05  8:59 ` Keir Fraser
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Santos, Jose Renato G (Jose Renato Santos) @ 2005-04-05  2:07 UTC (permalink / raw)
  To: xuehai zhang, Xen-devel; +Cc: Turner, Yoshio, Aravind Menon, G John Janakiraman

  Hi,

  We had a similar network problem in the past. We were using a TCP
benchmark instead of MPI but I believe your problem is probably the same
as the one we encountered.
  It took us a while to get to the bottom of this and we only identified
the reason for this behavior after we ported oprofile to Xen and did
some performance profiling experiments.

  Here is a brief explanation of the problem we found and the solution
that worked for us.
  Xenolinux allocates a full page (4KB) to store socket buffers instead
of using just MTU bytes as in traditional linux. This is necessary to
enable page exchanges between the guest and the I/O domains. The side
effect of this is that memory space used for  socket buffers is not very
efficient. Even if packets have the maximum MTU size (typically 1500
bytes for Ethernet) the total buffer utilization is very low ( at most
just slightly  higher than 35%). If packets arrive faster than they are
processed at the receiver side, they will exhaust the receiver buffer
before the TCP advertised window is reached (By default Linux uses a TCP
advertised window equal to 75% of the receive buffer size. In standard
Linux, this is typically sufficient to stop packet transmission at the
sender before running out of receive buffers. The same is not true in
Xen due to inefficient use of socket buffers). When a packet arrives and
there is no receive buffer available, TCP tries to free socket buffer
space by eliminating socket buffer fragmentation (i.e. eliminating
wasted buffer space). This is done at the cost of an extra copy of all
receive buffer to new compacted socket buffers. This introduces overhead
and reduces throughput when the CPU is the bottleneck, which seems to be
your case.

This problem is not very frequent because modern CPUs are fast enough to
receive packets at Gigabit speeds and the receive buffer does not fill
up. However the problem may arise when using slower machines and/or when
the workload consumes a lot of CPU cycles, such as for example
scientific MPI applications. In your case in you have both factors
against you.

The solution to this problem is trivial. You just have to change the TCP
advertised window of your guest to a lower value. In our case, we used
25% of the receive buffer size and that was sufficient  to eliminate the
problem. This can be done using the following command

> echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale

(The default 2 corresponds to 75% of receive buffer, and -2 corresponds
to 25%)

Please let me know if this improve your results. You should still see a
degradation in throughput when comparing xen to traditional linux, but
hopefully you should be able to see better throughputs. You should also
try running your experiments in domain 0. This will give better
throughput although still lower than traditional linux.
I am curious to know if this have any effect in your experiments.
Please, post the new results if this has any effect in your results

Thanks

Renato

> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com 
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
> xuehai zhang
> Sent: Monday, April 04, 2005 4:19 PM
> To: Xen-devel@lists.xensource.com
> Subject: [Xen-devel] MPI benchmark performance gap between 
> native linux anddomU
> 
> 
> 
> Hi all,
> 
> I did the following experiments to explore the MPI 
> application execution performance on both native linux 
> machines and inside of unpriviledged Xen user domains. I use 
> 8 machines with identical HW configurations (498.756 MHz dual 
> CPU, 512MB memory, on a 10MB/sec LAN) and I use Pallas MPI 
> Benchmarks (PMB).
> 
> Experiment 1: I boot all 8 nodes with native linux (nosmp, 
> kernel 2.4.29) and use all of them for PMB tests.
> 
> Experiment 2: I boot all 8 nodes with Xen running and start a 
> single user domain (port 2.6.10,using file-backed VBD) on 
> each node with 360MB memory. Then I run the same PMB tests 
> among these 8 user domains.
> 
> The expreiment results show, running a same MPI benchmark in 
> user domains usually results in a worse (sometimes very bad) 
> performance comparing with on native linux machines. The 
> following are the results for PMB SendRecv benchmark for both 
> experiments (table1 and table2 report throughput and latency 
> respectively). As you may notice, SendRecv can achieve a 
> 14.9MB/sec throughput on native linux machines but can get a 
> maximum 7.07 MB/sec throughput if running inside of user 
> domains. The latency results also have big gap.
> 
> Clearly, there is difference between the memory used in the 
> native linux machine of Experiment 1 (512MB) and in the user 
> domain (360MB, can not go higher because dom0 started with 
> 128MB memory) of Experiment 2. However, I don't think it is 
> the main cause of the performance gap because the tested 
> message sizes are much smaller than both memory sizes.
> 
> I will appreciate your help if you had the similar experience 
> and wanna share your insights.
> 
> BTW, if you are not familar with PMB SendRecv benchmark, you 
> can find a detailed explaination at 
> http://people.cs.uchicago.edu/~hai/PMB-MPI1.pdf (see section 4.3.1).
> 
> Thanks in advance for you help.
> 
> Xuehai
> 
> 
> P.S. Table 1: SendRecv throughput (MB/sec) performance
> 
> Message_Size(bytes)    Experiment_1    Experiment_2
> 0                0             0
> 1                0             0
> 2                0             0
> 4                0             0
> 8                0.04          0.01
> 16                    0.16          0.01
> 32                    0.34          0.02
> 64                    0.65          0.04
> 128                    1.17          0.09
> 256                    2.15          0.59
> 512                    3.4           1.23
> 1K                    5.29          2.57
> 2K                    7.68          3.5
> 4K                    10.7          4.96
> 8K                    13.35         7.07
> 16K                    14.9          3.77
> 32K                    9.85          3.68
> 64K                    5.06          3.02
> 128K                    7.91          4.94
> 256K                    7.85          5.25
> 512K                    7.93          6.11
> 1M                    7.85          6.5
> 2M                    8.18          5.44
> 4M                    7.55          4.93
> 
> Table 2: SendRecv latency (millisec) performance
> 
> Message_Size(bytes)    Experiment_1    Experiment_2
> 0                   1979.6        3010.96
> 1                   1724.16       3218.88
> 2                   1669.65       3185.3
> 4                   1637.26       3055.67
> 8                   406.77        2966.17
> 16                  185.76        2777.89
> 32                  181.06        2791.06
> 64                  189.12        2940.82
> 128                 210.51        2716.3
> 256                 227.36        843.94
> 512                 287.28        796.71
> 1K                  368.72        758.19
> 2K                  508.65        1144.24
> 4K                  730.59        1612.66
> 8K                  1170.22       2471.65
> 16K                 2096.86       8300.18
> 32K                 6340.45       17017.99
> 64K                 24640.78      41264.5
> 128K                31709.09      50608.97
> 256K                63680.67      94918.13
> 512K                125531.7      162168.47
> 1M                  251566.94     321451.02
> 2M                  477431.32     707981
> 4M                  997768.35     1503987.61
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: MPI benchmark performance gap between native linux anddomU
  2005-04-05  2:07 MPI benchmark performance gap between native linux anddomU Santos, Jose Renato G (Jose Renato Santos)
@ 2005-04-05  8:59 ` Keir Fraser
  2005-04-05 22:22 ` Nivedita Singhvi
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Keir Fraser @ 2005-04-05  8:59 UTC (permalink / raw)
  To: Santos, Jose Renato G (Jose Renato Santos)
  Cc: Turner, Yoshio, Xen-devel, Aravind Menon, xuehai zhang,
	G John Janakiraman


On 5 Apr 2005, at 03:07, Santos, Jose Renato G (Jose Renato Santos) 
wrote:

>  Here is a brief explanation of the problem we found and the solution
> that worked for us.
>   Xenolinux allocates a full page (4KB) to store socket buffers instead
> of using just MTU bytes as in traditional linux. This is necessary to
> enable page exchanges between the guest and the I/O domains. The side
> effect of this is that memory space used for  socket buffers is not 
> very
> efficient.

This is true, but these days we lie to the network stack about how big 
the skb data area is. The 'truesize' field, which is what I think is 
used for socket buffer accounting, will be around 1600 bytes, not 4096. 
So I would expect the old trick of reducing the receive windows not to 
work: but if it does then that is very interesting!

  -- Keir

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: MPI benchmark performance gap between native linux anddomU
  2005-04-05  2:07 MPI benchmark performance gap between native linux anddomU Santos, Jose Renato G (Jose Renato Santos)
  2005-04-05  8:59 ` Keir Fraser
@ 2005-04-05 22:22 ` Nivedita Singhvi
  2005-04-05 22:23 ` xuehai zhang
  2005-04-05 22:34 ` xuehai zhang
  3 siblings, 0 replies; 24+ messages in thread
From: Nivedita Singhvi @ 2005-04-05 22:22 UTC (permalink / raw)
  To: Santos, Jose Renato G (Jose Renato Santos)
  Cc: Turner, Yoshio, Aravind Menon, Xen-devel, xuehai zhang,
	G John Janakiraman

Santos, Jose Renato G (Jose Renato Santos) wrote:

>   Hi,
> 
>   We had a similar network problem in the past. We were using a TCP
> benchmark instead of MPI but I believe your problem is probably the same
> as the one we encountered.
>   It took us a while to get to the bottom of this and we only identified
> the reason for this behavior after we ported oprofile to Xen and did
> some performance profiling experiments.

Hello! Was this on the 2.6 kernel? Would you be able to
share the oprofile port? It would be very handy indeed
right now. (I was told by a few people that someone
was porting oprofile and I believe there was some status
on the list that went by) but haven't seen it yet...

>   Here is a brief explanation of the problem we found and the solution
> that worked for us.
>   Xenolinux allocates a full page (4KB) to store socket buffers instead
> of using just MTU bytes as in traditional linux. This is necessary to
> enable page exchanges between the guest and the I/O domains. The side
> effect of this is that memory space used for  socket buffers is not very
> efficient. Even if packets have the maximum MTU size (typically 1500
> bytes for Ethernet) the total buffer utilization is very low ( at most
> just slightly  higher than 35%). If packets arrive faster than they are
> processed at the receiver side, they will exhaust the receiver buffer

Most small connections (say upto 3 - 4K) involve only 3 to 5 segments,
and so the tcp window never really opens fully.  On longer lived
connections, it does help very much to have a large buffer.

> before the TCP advertised window is reached (By default Linux uses a TCP
> advertised window equal to 75% of the receive buffer size. In standard
> Linux, this is typically sufficient to stop packet transmission at the
> sender before running out of receive buffers. The same is not true in
> Xen due to inefficient use of socket buffers). When a packet arrives and
> there is no receive buffer available, TCP tries to free socket buffer
> space by eliminating socket buffer fragmentation (i.e. eliminating
> wasted buffer space). This is done at the cost of an extra copy of all
> receive buffer to new compacted socket buffers. This introduces overhead
> and reduces throughput when the CPU is the bottleneck, which seems to be
> your case.

/proc/net/netstat will show a counter of just how many times this
happens (RcvPruned). Would be interesting if that was significant.

> This problem is not very frequent because modern CPUs are fast enough to
> receive packets at Gigabit speeds and the receive buffer does not fill
> up. However the problem may arise when using slower machines and/or when
> the workload consumes a lot of CPU cycles, such as for example
> scientific MPI applications. In your case in you have both factors
> against you.


> The solution to this problem is trivial. You just have to change the TCP
> advertised window of your guest to a lower value. In our case, we used
> 25% of the receive buffer size and that was sufficient  to eliminate the
> problem. This can be done using the following command

>>echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale

How much did this improve your results by? And wouldn't
  making the default socket buffers, max socket buffers
larger by, say, 5 times be more effective (other than for
those applications using setsockopt() to set their buffers
to some size already, but not large enough)?

> (The default 2 corresponds to 75% of receive buffer, and -2 corresponds
> to 25%)
> 
> Please let me know if this improve your results. You should still see a
> degradation in throughput when comparing xen to traditional linux, but
> hopefully you should be able to see better throughputs. You should also
> try running your experiments in domain 0. This will give better
> throughput although still lower than traditional linux.
> I am curious to know if this have any effect in your experiments.
> Please, post the new results if this has any effect in your results

Yep, me too..

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: MPI benchmark performance gap between native linux anddomU
  2005-04-05  2:07 MPI benchmark performance gap between native linux anddomU Santos, Jose Renato G (Jose Renato Santos)
  2005-04-05  8:59 ` Keir Fraser
  2005-04-05 22:22 ` Nivedita Singhvi
@ 2005-04-05 22:23 ` xuehai zhang
  2005-04-05 22:34 ` xuehai zhang
  3 siblings, 0 replies; 24+ messages in thread
From: xuehai zhang @ 2005-04-05 22:23 UTC (permalink / raw)
  To: Santos, Jose Renato G (Jose Renato Santos), m+Ian.Pratt
  Cc: Turner, Yoshio, Aravind Menon, Xen-devel, G John Janakiraman

Hi Ian and Jose,

Based on your suggestions, I did two more experiments: one (with tag "domU-B" in table below) is 
changing the TCP advertised window of domU to -2 (the default is 2) and the other (with tag "dom0" 
in table below) is to repeat the experiment in dom0 (only dom0 is running). The following table 
contains the results from these two new experiments plus two old ones (with tags "native-linux" and 
"domU-A" in table below) in my previous email.

I have the following observation from the results:

1. Decreasing the scaling of TCP window ("domU-B") doesn't buy any good to the  performance but 
slightly slowdown the performance (comparing with "domU-A").

2. Generally, the performance of running the experiments in dom0 ("dom0" column) is very close 
(slightly less) to the performance on native linux ("native-linux" column). However, in certain 
situations, it outperforms the performance on native linux. For example, throughput values when 
message size is 64KB and latency values when message size is 1 , or 2, or 4, or 8 bytes.

3. The performance gap between domU and dom0 is big, similarly as domU and native linux.

BTW, each reported data point in the following table is the average of over 10 runs of the same 
experiments. I forget to mention that in experiment using user domains, the 8 domU forms a private 
network and each domU is assigned a private network IP (for example, 192.168.254.X).

Xuehai

*********************************
*SendRecv Throughput(Mbytes/sec)*
*********************************

Msg Size(bytes)  native-linux	dom0          domU-A          domU-B
         0         0              0.00          0              0.00
         1         0              0.01          0              0.00
         2         0              0.01          0              0.00
         4         0              0.03          0              0.00
         8         0.04           0.05          0.01           0.01
        16         0.16           0.11          0.01           0.01
        32         0.34           0.21          0.02           0.02
        64         0.65           0.42          0.04           0.04
       128         1.17           0.79          0.09           0.10
       256         2.15           1.44          0.59           0.58
       512         3.4            2.39          1.23           1.22
      1024         5.29           3.79          2.57           2.50
      2048         7.68           5.30          3.5            3.44
      4096         10.7           8.51          4.96           5.23
      8192         13.35          11.06         7.07           6.00
     16384         14.9           13.60         3.77           4.62
     32768         9.85           11.13         3.68           4.34
     65536         5.06           9.06          3.02           3.14
    131072         7.91           7.61          4.94           5.04
    262144         7.85           7.65          5.25           5.29
    524288         7.93           7.77          6.11           5.40
   1048576         7.85           7.82          6.5            5.62
   2097152         8.18           7.35          5.44           5.32
   4194304         7.55           6.88          4.93           4.92

*********************************
*   SendRecv Latency(millisec)  *
*********************************

Msg Size(bytes)  native-linux	dom0           domU-A         domU-B					
         0         1979.6         1920.83       3010.96        3246.71    		
         1         1724.16        397.27        3218.88        3219.63
         2         1669.65        297.58        3185.3         3298.86
         4         1637.26        285.27        3055.67        3222.34
         8         406.77         282.78        2966.17        3001.24
        16         185.76         283.87        2777.89        2761.90
        32         181.06         284.75        2791.06        2798.77
        64         189.12         293.93        2940.82        3043.55
       128         210.51         310.47        2716.3         2495.83
       256         227.36         338.13        843.94         853.86
       512         287.28         408.14        796.71         805.51
      1024         368.72         515.59        758.19         786.67
      2048         508.65         737.12        1144.24        1150.66
      4096         730.59         917.97        1612.66        1516.35
      8192         1170.22        1411.94       2471.65        2650.17
     16384         2096.86        2297.19       8300.18        6857.13
     32768         6340.45        5619.56       17017.99       14392.36
     65536         24640.78       13787.31      41264.5        39871.19
    131072         31709.09       32797.52      50608.97       49533.68
    262144         63680.67       65174.67      94918.13       94157.30
    524288         125531.7       128116.73     162168.47      189307.05
   1048576         251566.94      252257.55     321451.02      361714.44
   2097152         477431.32      527432.60     707981         728504.38
   4194304         997768.35      1108898.61    1503987.61     1534795.56

Santos, Jose Renato G (Jose Renato Santos) wrote:
>   Hi,
> 
>   We had a similar network problem in the past. We were using a TCP
> benchmark instead of MPI but I believe your problem is probably the same
> as the one we encountered.
>   It took us a while to get to the bottom of this and we only identified
> the reason for this behavior after we ported oprofile to Xen and did
> some performance profiling experiments.
> 
>   Here is a brief explanation of the problem we found and the solution
> that worked for us.
>   Xenolinux allocates a full page (4KB) to store socket buffers instead
> of using just MTU bytes as in traditional linux. This is necessary to
> enable page exchanges between the guest and the I/O domains. The side
> effect of this is that memory space used for  socket buffers is not very
> efficient. Even if packets have the maximum MTU size (typically 1500
> bytes for Ethernet) the total buffer utilization is very low ( at most
> just slightly  higher than 35%). If packets arrive faster than they are
> processed at the receiver side, they will exhaust the receiver buffer
> before the TCP advertised window is reached (By default Linux uses a TCP
> advertised window equal to 75% of the receive buffer size. In standard
> Linux, this is typically sufficient to stop packet transmission at the
> sender before running out of receive buffers. The same is not true in
> Xen due to inefficient use of socket buffers). When a packet arrives and
> there is no receive buffer available, TCP tries to free socket buffer
> space by eliminating socket buffer fragmentation (i.e. eliminating
> wasted buffer space). This is done at the cost of an extra copy of all
> receive buffer to new compacted socket buffers. This introduces overhead
> and reduces throughput when the CPU is the bottleneck, which seems to be
> your case.
> 
> This problem is not very frequent because modern CPUs are fast enough to
> receive packets at Gigabit speeds and the receive buffer does not fill
> up. However the problem may arise when using slower machines and/or when
> the workload consumes a lot of CPU cycles, such as for example
> scientific MPI applications. In your case in you have both factors
> against you.
> 
> The solution to this problem is trivial. You just have to change the TCP
> advertised window of your guest to a lower value. In our case, we used
> 25% of the receive buffer size and that was sufficient  to eliminate the
> problem. This can be done using the following command
> 
> 
>>echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale
> 
> 
> (The default 2 corresponds to 75% of receive buffer, and -2 corresponds
> to 25%)
> 
> Please let me know if this improve your results. You should still see a
> degradation in throughput when comparing xen to traditional linux, but
> hopefully you should be able to see better throughputs. You should also
> try running your experiments in domain 0. This will give better
> throughput although still lower than traditional linux.
> I am curious to know if this have any effect in your experiments.
> Please, post the new results if this has any effect in your results
> 
> Thanks
> 
> Renato
> 
> 
>  
>  
> 
>>-----Original Message-----
>>From: xen-devel-bounces@lists.xensource.com 
>>[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
>>xuehai zhang
>>Sent: Monday, April 04, 2005 4:19 PM
>>To: Xen-devel@lists.xensource.com
>>Subject: [Xen-devel] MPI benchmark performance gap between 
>>native linux anddomU
>>
>>
>>
>>Hi all,
>>
>>I did the following experiments to explore the MPI 
>>application execution performance on both native linux 
>>machines and inside of unpriviledged Xen user domains. I use 
>>8 machines with identical HW configurations (498.756 MHz dual 
>>CPU, 512MB memory, on a 10MB/sec LAN) and I use Pallas MPI 
>>Benchmarks (PMB).
>>
>>Experiment 1: I boot all 8 nodes with native linux (nosmp, 
>>kernel 2.4.29) and use all of them for PMB tests.
>>
>>Experiment 2: I boot all 8 nodes with Xen running and start a 
>>single user domain (port 2.6.10,using file-backed VBD) on 
>>each node with 360MB memory. Then I run the same PMB tests 
>>among these 8 user domains.
>>
>>The expreiment results show, running a same MPI benchmark in 
>>user domains usually results in a worse (sometimes very bad) 
>>performance comparing with on native linux machines. The 
>>following are the results for PMB SendRecv benchmark for both 
>>experiments (table1 and table2 report throughput and latency 
>>respectively). As you may notice, SendRecv can achieve a 
>>14.9MB/sec throughput on native linux machines but can get a 
>>maximum 7.07 MB/sec throughput if running inside of user 
>>domains. The latency results also have big gap.
>>
>>Clearly, there is difference between the memory used in the 
>>native linux machine of Experiment 1 (512MB) and in the user 
>>domain (360MB, can not go higher because dom0 started with 
>>128MB memory) of Experiment 2. However, I don't think it is 
>>the main cause of the performance gap because the tested 
>>message sizes are much smaller than both memory sizes.
>>
>>I will appreciate your help if you had the similar experience 
>>and wanna share your insights.
>>
>>BTW, if you are not familar with PMB SendRecv benchmark, you 
>>can find a detailed explaination at 
>>http://people.cs.uchicago.edu/~hai/PMB-MPI1.pdf (see section 4.3.1).
>>
>>Thanks in advance for you help.
>>
>>Xuehai
>>
>>
>>P.S. Table 1: SendRecv throughput (MB/sec) performance
>>
>>Message_Size(bytes)    Experiment_1    Experiment_2
>>0                0             0
>>1                0             0
>>2                0             0
>>4                0             0
>>8                0.04          0.01
>>16                    0.16          0.01
>>32                    0.34          0.02
>>64                    0.65          0.04
>>128                    1.17          0.09
>>256                    2.15          0.59
>>512                    3.4           1.23
>>1K                    5.29          2.57
>>2K                    7.68          3.5
>>4K                    10.7          4.96
>>8K                    13.35         7.07
>>16K                    14.9          3.77
>>32K                    9.85          3.68
>>64K                    5.06          3.02
>>128K                    7.91          4.94
>>256K                    7.85          5.25
>>512K                    7.93          6.11
>>1M                    7.85          6.5
>>2M                    8.18          5.44
>>4M                    7.55          4.93
>>
>>Table 2: SendRecv latency (millisec) performance
>>
>>Message_Size(bytes)    Experiment_1    Experiment_2
>>0                   1979.6        3010.96
>>1                   1724.16       3218.88
>>2                   1669.65       3185.3
>>4                   1637.26       3055.67
>>8                   406.77        2966.17
>>16                  185.76        2777.89
>>32                  181.06        2791.06
>>64                  189.12        2940.82
>>128                 210.51        2716.3
>>256                 227.36        843.94
>>512                 287.28        796.71
>>1K                  368.72        758.19
>>2K                  508.65        1144.24
>>4K                  730.59        1612.66
>>8K                  1170.22       2471.65
>>16K                 2096.86       8300.18
>>32K                 6340.45       17017.99
>>64K                 24640.78      41264.5
>>128K                31709.09      50608.97
>>256K                63680.67      94918.13
>>512K                125531.7      162168.47
>>1M                  251566.94     321451.02
>>2M                  477431.32     707981
>>4M                  997768.35     1503987.61
>>
>>
>>
>>_______________________________________________
>>Xen-devel mailing list
>>Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: MPI benchmark performance gap between native linux anddomU
  2005-04-05  2:07 MPI benchmark performance gap between native linux anddomU Santos, Jose Renato G (Jose Renato Santos)
                   ` (2 preceding siblings ...)
  2005-04-05 22:23 ` xuehai zhang
@ 2005-04-05 22:34 ` xuehai zhang
  2005-04-05 22:53   ` Nivedita Singhvi
  3 siblings, 1 reply; 24+ messages in thread
From: xuehai zhang @ 2005-04-05 22:34 UTC (permalink / raw)
  To: Santos, Jose Renato G (Jose Renato Santos); +Cc: Xen-devel

Santos, Jose Renato G (Jose Renato Santos) wrote:
>   Hi,
> 
>   We had a similar network problem in the past. We were using a TCP
> benchmark instead of MPI but I believe your problem is probably the same
> as the one we encountered.
>   It took us a while to get to the bottom of this and we only identified
> the reason for this behavior after we ported oprofile to Xen and did
> some performance profiling experiments.
> 
>   Here is a brief explanation of the problem we found and the solution
> that worked for us.
>   Xenolinux allocates a full page (4KB) to store socket buffers instead
> of using just MTU bytes as in traditional linux. This is necessary to
> enable page exchanges between the guest and the I/O domains. The side
> effect of this is that memory space used for  socket buffers is not very
> efficient. Even if packets have the maximum MTU size (typically 1500
> bytes for Ethernet) the total buffer utilization is very low ( at most
> just slightly  higher than 35%). If packets arrive faster than they are
> processed at the receiver side, they will exhaust the receiver buffer
> before the TCP advertised window is reached (By default Linux uses a TCP
> advertised window equal to 75% of the receive buffer size. In standard
> Linux, this is typically sufficient to stop packet transmission at the
> sender before running out of receive buffers. The same is not true in
> Xen due to inefficient use of socket buffers). When a packet arrives and
> there is no receive buffer available, TCP tries to free socket buffer
> space by eliminating socket buffer fragmentation (i.e. eliminating
> wasted buffer space). This is done at the cost of an extra copy of all
> receive buffer to new compacted socket buffers. This introduces overhead
> and reduces throughput when the CPU is the bottleneck, which seems to be
> your case.
> 
> This problem is not very frequent because modern CPUs are fast enough to
> receive packets at Gigabit speeds and the receive buffer does not fill
> up. However the problem may arise when using slower machines and/or when
> the workload consumes a lot of CPU cycles, such as for example
> scientific MPI applications. In your case in you have both factors
> against you.
> 
> The solution to this problem is trivial. You just have to change the TCP
> advertised window of your guest to a lower value. In our case, we used
> 25% of the receive buffer size and that was sufficient  to eliminate the
> problem. This can be done using the following command
> 
> 
>>echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale

In my experiments, I notice the above changing doesn't persist upon reboots (every reboot will 
change the value back to 2, the default value for Debian Sarge 3.1). Is there a way to make a 
permanent changing?

Thanks.

Xuehai

> (The default 2 corresponds to 75% of receive buffer, and -2 corresponds
> to 25%)
> 
> Please let me know if this improve your results. You should still see a
> degradation in throughput when comparing xen to traditional linux, but
> hopefully you should be able to see better throughputs. You should also
> try running your experiments in domain 0. This will give better
> throughput although still lower than traditional linux.
> I am curious to know if this have any effect in your experiments.
> Please, post the new results if this has any effect in your results
> 
> Thanks
> 
> Renato
> 
> 
>  
>  
> 
>>-----Original Message-----
>>From: xen-devel-bounces@lists.xensource.com 
>>[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
>>xuehai zhang
>>Sent: Monday, April 04, 2005 4:19 PM
>>To: Xen-devel@lists.xensource.com
>>Subject: [Xen-devel] MPI benchmark performance gap between 
>>native linux anddomU
>>
>>
>>
>>Hi all,
>>
>>I did the following experiments to explore the MPI 
>>application execution performance on both native linux 
>>machines and inside of unpriviledged Xen user domains. I use 
>>8 machines with identical HW configurations (498.756 MHz dual 
>>CPU, 512MB memory, on a 10MB/sec LAN) and I use Pallas MPI 
>>Benchmarks (PMB).
>>
>>Experiment 1: I boot all 8 nodes with native linux (nosmp, 
>>kernel 2.4.29) and use all of them for PMB tests.
>>
>>Experiment 2: I boot all 8 nodes with Xen running and start a 
>>single user domain (port 2.6.10,using file-backed VBD) on 
>>each node with 360MB memory. Then I run the same PMB tests 
>>among these 8 user domains.
>>
>>The expreiment results show, running a same MPI benchmark in 
>>user domains usually results in a worse (sometimes very bad) 
>>performance comparing with on native linux machines. The 
>>following are the results for PMB SendRecv benchmark for both 
>>experiments (table1 and table2 report throughput and latency 
>>respectively). As you may notice, SendRecv can achieve a 
>>14.9MB/sec throughput on native linux machines but can get a 
>>maximum 7.07 MB/sec throughput if running inside of user 
>>domains. The latency results also have big gap.
>>
>>Clearly, there is difference between the memory used in the 
>>native linux machine of Experiment 1 (512MB) and in the user 
>>domain (360MB, can not go higher because dom0 started with 
>>128MB memory) of Experiment 2. However, I don't think it is 
>>the main cause of the performance gap because the tested 
>>message sizes are much smaller than both memory sizes.
>>
>>I will appreciate your help if you had the similar experience 
>>and wanna share your insights.
>>
>>BTW, if you are not familar with PMB SendRecv benchmark, you 
>>can find a detailed explaination at 
>>http://people.cs.uchicago.edu/~hai/PMB-MPI1.pdf (see section 4.3.1).
>>
>>Thanks in advance for you help.
>>
>>Xuehai
>>
>>
>>P.S. Table 1: SendRecv throughput (MB/sec) performance
>>
>>Message_Size(bytes)    Experiment_1    Experiment_2
>>0                0             0
>>1                0             0
>>2                0             0
>>4                0             0
>>8                0.04          0.01
>>16                    0.16          0.01
>>32                    0.34          0.02
>>64                    0.65          0.04
>>128                    1.17          0.09
>>256                    2.15          0.59
>>512                    3.4           1.23
>>1K                    5.29          2.57
>>2K                    7.68          3.5
>>4K                    10.7          4.96
>>8K                    13.35         7.07
>>16K                    14.9          3.77
>>32K                    9.85          3.68
>>64K                    5.06          3.02
>>128K                    7.91          4.94
>>256K                    7.85          5.25
>>512K                    7.93          6.11
>>1M                    7.85          6.5
>>2M                    8.18          5.44
>>4M                    7.55          4.93
>>
>>Table 2: SendRecv latency (millisec) performance
>>
>>Message_Size(bytes)    Experiment_1    Experiment_2
>>0                   1979.6        3010.96
>>1                   1724.16       3218.88
>>2                   1669.65       3185.3
>>4                   1637.26       3055.67
>>8                   406.77        2966.17
>>16                  185.76        2777.89
>>32                  181.06        2791.06
>>64                  189.12        2940.82
>>128                 210.51        2716.3
>>256                 227.36        843.94
>>512                 287.28        796.71
>>1K                  368.72        758.19
>>2K                  508.65        1144.24
>>4K                  730.59        1612.66
>>8K                  1170.22       2471.65
>>16K                 2096.86       8300.18
>>32K                 6340.45       17017.99
>>64K                 24640.78      41264.5
>>128K                31709.09      50608.97
>>256K                63680.67      94918.13
>>512K                125531.7      162168.47
>>1M                  251566.94     321451.02
>>2M                  477431.32     707981
>>4M                  997768.35     1503987.61
>>
>>
>>
>>_______________________________________________
>>Xen-devel mailing list
>>Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: MPI benchmark performance gap between native linux anddomU
  2005-04-05 22:34 ` xuehai zhang
@ 2005-04-05 22:53   ` Nivedita Singhvi
  2005-04-05 22:58     ` Nivedita Singhvi
  0 siblings, 1 reply; 24+ messages in thread
From: Nivedita Singhvi @ 2005-04-05 22:53 UTC (permalink / raw)
  To: xuehai zhang; +Cc: Xen-devel, Santos, Jose Renato G (Jose Renato Santos)

xuehai zhang wrote:


>>> echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale
> 
> 
> In my experiments, I notice the above changing doesn't persist upon 
> reboots (every reboot will change the value back to 2, the default value 
> for Debian Sarge 3.1). Is there a way to make a permanent changing?


You can edit /etc/sysctl.conf and add the following entry:

net.ipv4.tcp_adv_win_scale = -2

Or you can put in a sysctl -w net.ipv4.tcp_adv_win_scale -2
into some appropriate /etc/init.d startup script.

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: MPI benchmark performance gap between native linux anddomU
  2005-04-05 22:53   ` Nivedita Singhvi
@ 2005-04-05 22:58     ` Nivedita Singhvi
  0 siblings, 0 replies; 24+ messages in thread
From: Nivedita Singhvi @ 2005-04-05 22:58 UTC (permalink / raw)
  To: Nivedita Singhvi
  Cc: Xen-devel, xuehai zhang,
	Santos, Jose Renato G (Jose Renato Santos)

Nivedita Singhvi wrote:

> Or you can put in a sysctl -w net.ipv4.tcp_adv_win_scale -2

grrr...

sysctl -w net.ipv4.tcp_adv_win_scale=-2

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: MPI benchmark performance gap between native linux anddomU
@ 2005-04-06  7:23 Ian Pratt
  0 siblings, 0 replies; 24+ messages in thread
From: Ian Pratt @ 2005-04-06  7:23 UTC (permalink / raw)
  To: Ian Pratt, Santos, Jose Renato G (Jose Renato Santos),
	xuehai zhang
  Cc: Turner, Yoshio, Aravind Menon, Xen-devel, G John Janakiraman

> If you're on an SMP with the dom0 and domU's on different 
> CPUs (and have CPU to burn) then you might get a performance 
> improvement by artificially capping some of the natural 
> batching to just a couple of packets. You could try modifying 
> netback's net_rx_action to send the notification through to 
> netfront more eagerly. This will help get the latency down, 
> at the cost of burning more CPU.

To be clearer, modify net_rx_action netback as follows to kick the
frontend after every packet. I expect this might help for some of the
larger message sizes. Kicking every packet may be overdoing it, so you
might want to adjust to every Nth, using the rx_notify array to store
the number of packets queued per netfront driver.

Overall, the MPI SendRecv benchmark is an absoloute worst case scenario
for s/w virtualization. Any 'optimisations' we add will be at the
expense of reduced CPU efficiency, possibly resulting in reduced
bandwidth for many users. The best soloution to this is to use a 'smart
NIC' or HCA (such as the Arsenic GigE we developed) that can deliver
packets directly to VMs. I expect we'll see a number of such NICs on the
market before too long, and they'll be great for Xen.

Ian

        evtchn = netif->evtchn;
        id =
netif->rx->ring[MASK_NETIF_RX_IDX(netif->rx_resp_prod)].req.id;
        if ( make_rx_response(netif, id, status, mdata, size) &&
             (rx_notify[evtchn] == 0) )
        {
-            rx_notify[evtchn] = 1;
-            notify_list[notify_nr++] = evtchn;
+            notify_via_evtchn(evtchn);
        }

        netif_put(netif);
        dev_kfree_skb(skb);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: MPI benchmark performance gap between native linux anddomU
@ 2005-04-06  7:08 Ian Pratt
  0 siblings, 0 replies; 24+ messages in thread
From: Ian Pratt @ 2005-04-06  7:08 UTC (permalink / raw)
  To: Santos, Jose Renato G (Jose Renato Santos), xuehai zhang
  Cc: Turner, Yoshio, Aravind Menon, Xen-devel, G John Janakiraman

>   I believe your problem is due to a higher network latency 
> in Xen. Your formula to compute throughput uses the inverse 
> of round trip latency (if I understood it correctly). This 
> probably means that your application is sensitive to the 
> round trip latency. Your latency mesurements show a higher 
> value for domainU and this is the reason for the lower 
> throughput.  I am not sure but it is possible that network 
> interrupts or event notifications in the inter-domain channel 
> are being coalesced and causing longer latency. Keir, do 
> event notifications get coalesced in the inter-domain I/O 
> channel for networking?

There's no timeout-based coalescing right now, so we'll be pushing
through packets as soon as the sending party emptys its own work
queue.[*]

If you're on an SMP with the dom0 and domU's on different CPUs (and have
CPU to burn) then you might get a performance improvement by
artificially capping some of the natural batching to just a couple of
packets. You could try modifying netback's net_rx_action to send the
notification through to netfront more eagerly. This will help get the
latency down, at the cost of burning more CPU.

Ian

[*] We actually need to add some timeout-based coallescing to make true
inter-VM communication work more efficiently (i.e. two VMs on the same
node talking to each other rather than out over the network). We'll
probably need to have some heuristic to detect when we're entering a
'high bandwith regime' and only then enable the timeout-forced batching.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: MPI benchmark performance gap between native linux anddomU
@ 2005-04-06  0:37 Santos, Jose Renato G (Jose Renato Santos)
  2005-04-06  4:24 ` xuehai zhang
  0 siblings, 1 reply; 24+ messages in thread
From: Santos, Jose Renato G (Jose Renato Santos) @ 2005-04-06  0:37 UTC (permalink / raw)
  To: xuehai zhang, m+Ian.Pratt
  Cc: Turner, Yoshio, Aravind Menon, Xen-devel, G John Janakiraman


  Xuehai,

  Thanks for posting your new results. In fact it seems that your
problem is not the same as the one we encountered.
  
  I believe your problem is due to a higher network latency in Xen. Your
formula to compute throughput uses the inverse of round trip latency (if
I understood it correctly). This probably means that your application is
sensitive to the round trip latency. Your latency mesurements show a
higher value for domainU and this is the reason for the lower
throughput.  I am not sure but it is possible that network interrupts or
event notifications in the inter-domain channel are being coalesced and
causing longer latency. Keir, do event notifications get coalesced in
the inter-domain I/O channel for networking?

  Renato
  

>> -----Original Message-----
>> From: xuehai zhang [mailto:hai@cs.uchicago.edu] 
>> Sent: Tuesday, April 05, 2005 3:23 PM
>> To: Santos, Jose Renato G (Jose Renato Santos); 
>> m+Ian.Pratt@cl.cam.ac.uk
>> Cc: Xen-devel@lists.xensource.com; Aravind Menon; Turner, 
>> Yoshio; G John Janakiraman
>> Subject: Re: [Xen-devel] MPI benchmark performance gap 
>> between native linux anddomU
>> 
>> 
>> Hi Ian and Jose,
>> 
>> Based on your suggestions, I did two more experiments: one 
>> (with tag "domU-B" in table below) is 
>> changing the TCP advertised window of domU to -2 (the 
>> default is 2) and the other (with tag "dom0" 
>> in table below) is to repeat the experiment in dom0 (only 
>> dom0 is running). The following table 
>> contains the results from these two new experiments plus two 
>> old ones (with tags "native-linux" and 
>> "domU-A" in table below) in my previous email.
>> 
>> I have the following observation from the results:
>> 
>> 1. Decreasing the scaling of TCP window ("domU-B") doesn't 
>> buy any good to the  performance but 
>> slightly slowdown the performance (comparing with "domU-A").
>> 
>> 2. Generally, the performance of running the experiments in 
>> dom0 ("dom0" column) is very close 
>> (slightly less) to the performance on native linux 
>> ("native-linux" column). However, in certain 
>> situations, it outperforms the performance on native linux. 
>> For example, throughput values when 
>> message size is 64KB and latency values when message size is 
>> 1 , or 2, or 4, or 8 bytes.
>> 
>> 3. The performance gap between domU and dom0 is big, 
>> similarly as domU and native linux.
>> 
>> BTW, each reported data point in the following table is the 
>> average of over 10 runs of the same 
>> experiments. I forget to mention that in experiment using 
>> user domains, the 8 domU forms a private 
>> network and each domU is assigned a private network IP (for 
>> example, 192.168.254.X).
>> 
>> Xuehai
>> 
>> *********************************
>> *SendRecv Throughput(Mbytes/sec)*
>> *********************************
>> 
>> Msg Size(bytes)  native-linux	dom0          domU-A    
>>       domU-B
>>          0         0              0.00          0              0.00
>>          1         0              0.01          0              0.00
>>          2         0              0.01          0              0.00
>>          4         0              0.03          0              0.00
>>          8         0.04           0.05          0.01           0.01
>>         16         0.16           0.11          0.01           0.01
>>         32         0.34           0.21          0.02           0.02
>>         64         0.65           0.42          0.04           0.04
>>        128         1.17           0.79          0.09           0.10
>>        256         2.15           1.44          0.59           0.58
>>        512         3.4            2.39          1.23           1.22
>>       1024         5.29           3.79          2.57           2.50
>>       2048         7.68           5.30          3.5            3.44
>>       4096         10.7           8.51          4.96           5.23
>>       8192         13.35          11.06         7.07           6.00
>>      16384         14.9           13.60         3.77           4.62
>>      32768         9.85           11.13         3.68           4.34
>>      65536         5.06           9.06          3.02           3.14
>>     131072         7.91           7.61          4.94           5.04
>>     262144         7.85           7.65          5.25           5.29
>>     524288         7.93           7.77          6.11           5.40
>>    1048576         7.85           7.82          6.5            5.62
>>    2097152         8.18           7.35          5.44           5.32
>>    4194304         7.55           6.88          4.93           4.92
>> 
>> *********************************
>> *   SendRecv Latency(millisec)  *
>> *********************************
>> 
>> Msg Size(bytes)  native-linux	dom0           domU-A   
>>       domU-B					
>>          0         1979.6         1920.83       3010.96      
>>   3246.71    		
>>          1         1724.16        397.27        3218.88      
>>   3219.63
>>          2         1669.65        297.58        3185.3       
>>   3298.86
>>          4         1637.26        285.27        3055.67      
>>   3222.34
>>          8         406.77         282.78        2966.17      
>>   3001.24
>>         16         185.76         283.87        2777.89      
>>   2761.90
>>         32         181.06         284.75        2791.06      
>>   2798.77
>>         64         189.12         293.93        2940.82      
>>   3043.55
>>        128         210.51         310.47        2716.3       
>>   2495.83
>>        256         227.36         338.13        843.94         853.86
>>        512         287.28         408.14        796.71         805.51
>>       1024         368.72         515.59        758.19         786.67
>>       2048         508.65         737.12        1144.24      
>>   1150.66
>>       4096         730.59         917.97        1612.66      
>>   1516.35
>>       8192         1170.22        1411.94       2471.65      
>>   2650.17
>>      16384         2096.86        2297.19       8300.18      
>>   6857.13
>>      32768         6340.45        5619.56       17017.99     
>>   14392.36
>>      65536         24640.78       13787.31      41264.5      
>>   39871.19
>>     131072         31709.09       32797.52      50608.97     
>>   49533.68
>>     262144         63680.67       65174.67      94918.13     
>>   94157.30
>>     524288         125531.7       128116.73     162168.47    
>>   189307.05
>>    1048576         251566.94      252257.55     321451.02    
>>   361714.44
>>    2097152         477431.32      527432.60     707981       
>>   728504.38
>>    4194304         997768.35      1108898.61    1503987.61   
>>   1534795.56
>> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: MPI benchmark performance gap between native linux anddomU
  2005-04-06  0:37 Santos, Jose Renato G (Jose Renato Santos)
@ 2005-04-06  4:24 ` xuehai zhang
  0 siblings, 0 replies; 24+ messages in thread
From: xuehai zhang @ 2005-04-06  4:24 UTC (permalink / raw)
  To: Santos, Jose Renato G (Jose Renato Santos)
  Cc: m+Ian.Pratt, Xen-devel, Aravind Menon, G John Janakiraman,
	Turner, Yoshio

Jose,

Thank you for your help to diagnose the problem!

I kinda agree with you that the problem is due to the network latency. The throughput calculation of 
SencRecv benchmark is actually directly related to the latency and the following is its formula 
(where #_of_messages is 2 and the unit of message_size is bytes and the unit of latency is 
milliseconds):
	throughput = ((#_of_messages * message_size)/220)/(latency/106)
So, the performance gap really comes from the delayed latency in domU. It is true that PMB's 
SendRecv benchmark is sensitive to the round trip latency. I would like to hear Keir's comments on 
the behavior of event notifications in the inter-domain I/O channel for networking very much.

BTW, as I stated in my previous emails, besides the SendRecv benchmark, I also have other 11 PMB's 
benchmark results for both native linux and domU. The following are PingPing results (between 2 
nodes) in my experiments. As you can see, the performance gap is not that big as SendRecv and the 
performance is very closer in several testing cases. Part of the reason might come from the fact 
only two nodes are used and only one-way latency is used for the calculation of the latency and 
throughput values.

Best,
Xuehai

P.S.
Note: each reported data point in the following table is the
average of over 10 runs of the same experiments, similarly as the SendRecv.

PingPing Throughput (MB/sec)
Msg-size(bytes) #repetitions      native-linux    domU
             0         1000         0.00          0.00
             1         1000         0.01           0.00
             2         1000         0.01           0.01
             4         1000         0.02           0.01
             8         1000         0.04           0.02
            16         1000         0.09           0.04
            32         1000         0.17           0.09
            64         1000         0.33           0.17
           128         1000         0.65           0.33
           256         1000         1.19           0.62
           512         1000         1.95           1.06
          1024         1000         2.80           1.73
          2048         1000         3.74           2.52
          4096         1000         5.38           3.77
          8192         1000         6.49           4.79
         16384         1000         7.45           4.97
         32768         1000         6.74           5.27
         65536          640         5.89           3.07
        131072          320         5.27           3.11
        262144          160         5.09           3.88
        524288           80         5.00           4.84
       1048576           40         4.95           4.91
       2097152           20         4.94           4.89
       4194304           10         4.93           4.92

PingPing Latency/Startup (usec)
Msg-size(bytes) #repetitions      native-linux     domU
             0         1000       172.78          342.89
             1         1000       176.12          346.23
             2         1000       173.48          344.20
             4         1000       177.05          346.15
             8         1000       177.54          343.56
            16         1000       178.71          346.47
            32         1000       176.71          351.25
            64         1000       183.83          359.41
           128         1000       188.09          371.94
           256         1000       204.64          393.79
           512         1000       250.63          462.45
          1024         1000       349.20          565.03
          2048         1000       521.56          773.63
          4096         1000       726.62         1036.23
          8192         1000      1204.54         1630.43
         16384         1000      2097.42         3143.95
         32768         1000      4633.77         5930.04
         65536          640     10604.54        20335.55
        131072          320     23717.61        40174.68
        262144          160     49146.14        64505.20
        524288           80     99962.09       103390.30
       1048576           40    202000.30       203478.00
       2097152           20    404857.10       408950.55
       4194304           10    812047.60       813135.50

Santos, Jose Renato G (Jose Renato Santos) wrote:
>   Xuehai,
> 
>   Thanks for posting your new results. In fact it seems that your
> problem is not the same as the one we encountered.
>   
>   I believe your problem is due to a higher network latency in Xen. Your
> formula to compute throughput uses the inverse of round trip latency (if
> I understood it correctly). This probably means that your application is
> sensitive to the round trip latency. Your latency mesurements show a
> higher value for domainU and this is the reason for the lower
> throughput.  I am not sure but it is possible that network interrupts or
> event notifications in the inter-domain channel are being coalesced and
> causing longer latency. Keir, do event notifications get coalesced in
> the inter-domain I/O channel for networking?
> 
>   Renato
>   
> 
> 
>>>-----Original Message-----
>>>From: xuehai zhang [mailto:hai@cs.uchicago.edu] 
>>>Sent: Tuesday, April 05, 2005 3:23 PM
>>>To: Santos, Jose Renato G (Jose Renato Santos); 
>>>m+Ian.Pratt@cl.cam.ac.uk
>>>Cc: Xen-devel@lists.xensource.com; Aravind Menon; Turner, 
>>>Yoshio; G John Janakiraman
>>>Subject: Re: [Xen-devel] MPI benchmark performance gap 
>>>between native linux anddomU
>>>
>>>
>>>Hi Ian and Jose,
>>>
>>>Based on your suggestions, I did two more experiments: one 
>>>(with tag "domU-B" in table below) is 
>>>changing the TCP advertised window of domU to -2 (the 
>>>default is 2) and the other (with tag "dom0" 
>>>in table below) is to repeat the experiment in dom0 (only 
>>>dom0 is running). The following table 
>>>contains the results from these two new experiments plus two 
>>>old ones (with tags "native-linux" and 
>>>"domU-A" in table below) in my previous email.
>>>
>>>I have the following observation from the results:
>>>
>>>1. Decreasing the scaling of TCP window ("domU-B") doesn't 
>>>buy any good to the  performance but 
>>>slightly slowdown the performance (comparing with "domU-A").
>>>
>>>2. Generally, the performance of running the experiments in 
>>>dom0 ("dom0" column) is very close 
>>>(slightly less) to the performance on native linux 
>>>("native-linux" column). However, in certain 
>>>situations, it outperforms the performance on native linux. 
>>>For example, throughput values when 
>>>message size is 64KB and latency values when message size is 
>>>1 , or 2, or 4, or 8 bytes.
>>>
>>>3. The performance gap between domU and dom0 is big, 
>>>similarly as domU and native linux.
>>>
>>>BTW, each reported data point in the following table is the 
>>>average of over 10 runs of the same 
>>>experiments. I forget to mention that in experiment using 
>>>user domains, the 8 domU forms a private 
>>>network and each domU is assigned a private network IP (for 
>>>example, 192.168.254.X).
>>>
>>>Xuehai
>>>
>>>*********************************
>>>*SendRecv Throughput(Mbytes/sec)*
>>>*********************************
>>>
>>>Msg Size(bytes)  native-linux	dom0          domU-A    
>>>      domU-B
>>>         0         0              0.00          0              0.00
>>>         1         0              0.01          0              0.00
>>>         2         0              0.01          0              0.00
>>>         4         0              0.03          0              0.00
>>>         8         0.04           0.05          0.01           0.01
>>>        16         0.16           0.11          0.01           0.01
>>>        32         0.34           0.21          0.02           0.02
>>>        64         0.65           0.42          0.04           0.04
>>>       128         1.17           0.79          0.09           0.10
>>>       256         2.15           1.44          0.59           0.58
>>>       512         3.4            2.39          1.23           1.22
>>>      1024         5.29           3.79          2.57           2.50
>>>      2048         7.68           5.30          3.5            3.44
>>>      4096         10.7           8.51          4.96           5.23
>>>      8192         13.35          11.06         7.07           6.00
>>>     16384         14.9           13.60         3.77           4.62
>>>     32768         9.85           11.13         3.68           4.34
>>>     65536         5.06           9.06          3.02           3.14
>>>    131072         7.91           7.61          4.94           5.04
>>>    262144         7.85           7.65          5.25           5.29
>>>    524288         7.93           7.77          6.11           5.40
>>>   1048576         7.85           7.82          6.5            5.62
>>>   2097152         8.18           7.35          5.44           5.32
>>>   4194304         7.55           6.88          4.93           4.92
>>>
>>>*********************************
>>>*   SendRecv Latency(millisec)  *
>>>*********************************
>>>
>>>Msg Size(bytes)  native-linux	dom0           domU-A   
>>>      domU-B					
>>>         0         1979.6         1920.83       3010.96      
>>>  3246.71    		
>>>         1         1724.16        397.27        3218.88      
>>>  3219.63
>>>         2         1669.65        297.58        3185.3       
>>>  3298.86
>>>         4         1637.26        285.27        3055.67      
>>>  3222.34
>>>         8         406.77         282.78        2966.17      
>>>  3001.24
>>>        16         185.76         283.87        2777.89      
>>>  2761.90
>>>        32         181.06         284.75        2791.06      
>>>  2798.77
>>>        64         189.12         293.93        2940.82      
>>>  3043.55
>>>       128         210.51         310.47        2716.3       
>>>  2495.83
>>>       256         227.36         338.13        843.94         853.86
>>>       512         287.28         408.14        796.71         805.51
>>>      1024         368.72         515.59        758.19         786.67
>>>      2048         508.65         737.12        1144.24      
>>>  1150.66
>>>      4096         730.59         917.97        1612.66      
>>>  1516.35
>>>      8192         1170.22        1411.94       2471.65      
>>>  2650.17
>>>     16384         2096.86        2297.19       8300.18      
>>>  6857.13
>>>     32768         6340.45        5619.56       17017.99     
>>>  14392.36
>>>     65536         24640.78       13787.31      41264.5      
>>>  39871.19
>>>    131072         31709.09       32797.52      50608.97     
>>>  49533.68
>>>    262144         63680.67       65174.67      94918.13     
>>>  94157.30
>>>    524288         125531.7       128116.73     162168.47    
>>>  189307.05
>>>   1048576         251566.94      252257.55     321451.02    
>>>  361714.44
>>>   2097152         477431.32      527432.60     707981       
>>>  728504.38
>>>   4194304         997768.35      1108898.61    1503987.61   
>>>  1534795.56
>>>
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: MPI benchmark performance gap between native linux anddomU
@ 2005-04-06  0:17 Santos, Jose Renato G (Jose Renato Santos)
  0 siblings, 0 replies; 24+ messages in thread
From: Santos, Jose Renato G (Jose Renato Santos) @ 2005-04-06  0:17 UTC (permalink / raw)
  To: Nivedita Singhvi, Bin Ren, Andrew Theurer
  Cc: Turner, Yoshio, Aravind Menon, Xen-devel, G John Janakiraman


  Nivedita, Bin, Andrew and all interested in Xenoprof

  We should be posting the xenoprof patches in a few days.
  We are doing some last cleaning up in the code. Just be a little more
patient

  Thanks

  Renato 

>> -----Original Message-----
>> From: Nivedita Singhvi [mailto:niv@us.ibm.com] 
>> Sent: Tuesday, April 05, 2005 3:23 PM
>> To: Santos, Jose Renato G (Jose Renato Santos)
>> Cc: xuehai zhang; Xen-devel@lists.xensource.com; Turner, 
>> Yoshio; Aravind Menon; G John Janakiraman
>> Subject: Re: [Xen-devel] MPI benchmark performance gap 
>> between native linux anddomU
>> 
>> 
>> Santos, Jose Renato G (Jose Renato Santos) wrote:
>> 
>> >   Hi,
>> > 
>> >   We had a similar network problem in the past. We were 
>> using a TCP 
>> > benchmark instead of MPI but I believe your problem is 
>> probably the 
>> > same as the one we encountered.
>> >   It took us a while to get to the bottom of this and we only 
>> > identified the reason for this behavior after we ported 
>> oprofile to 
>> > Xen and did some performance profiling experiments.
>> 
>> Hello! Was this on the 2.6 kernel? Would you be able to
>> share the oprofile port? It would be very handy indeed
>> right now. (I was told by a few people that someone
>> was porting oprofile and I believe there was some status
>> on the list that went by) but haven't seen it yet...
>> 
>> >   Here is a brief explanation of the problem we found and 
>> the solution 
>> > that worked for us.
>> >   Xenolinux allocates a full page (4KB) to store socket buffers 
>> > instead of using just MTU bytes as in traditional linux. This is 
>> > necessary to enable page exchanges between the guest and the I/O 
>> > domains. The side effect of this is that memory space used 
>> for  socket 
>> > buffers is not very efficient. Even if packets have the 
>> maximum MTU 
>> > size (typically 1500 bytes for Ethernet) the total buffer 
>> utilization 
>> > is very low ( at most just slightly  higher than 35%). If packets 
>> > arrive faster than they are processed at the receiver 
>> side, they will 
>> > exhaust the receiver buffer
>> 
>> Most small connections (say upto 3 - 4K) involve only 3 to 5 
>> segments, and so the tcp window never really opens fully.  
>> On longer lived connections, it does help very much to have 
>> a large buffer.
>> 
>> > before the TCP advertised window is reached (By default 
>> Linux uses a 
>> > TCP advertised window equal to 75% of the receive buffer size. In 
>> > standard Linux, this is typically sufficient to stop packet 
>> > transmission at the sender before running out of receive 
>> buffers. The 
>> > same is not true in Xen due to inefficient use of socket buffers). 
>> > When a packet arrives and there is no receive buffer 
>> available, TCP 
>> > tries to free socket buffer space by eliminating socket buffer 
>> > fragmentation (i.e. eliminating wasted buffer space). This 
>> is done at 
>> > the cost of an extra copy of all receive buffer to new compacted 
>> > socket buffers. This introduces overhead and reduces 
>> throughput when 
>> > the CPU is the bottleneck, which seems to be your case.
>> 
>> /proc/net/netstat will show a counter of just how many times 
>> this happens (RcvPruned). Would be interesting if that was 
>> significant.
>> 
>> > This problem is not very frequent because modern CPUs are 
>> fast enough 
>> > to receive packets at Gigabit speeds and the receive 
>> buffer does not 
>> > fill up. However the problem may arise when using slower machines 
>> > and/or when the workload consumes a lot of CPU cycles, such as for 
>> > example scientific MPI applications. In your case in you have both 
>> > factors against you.
>> 
>> 
>> > The solution to this problem is trivial. You just have to 
>> change the 
>> > TCP advertised window of your guest to a lower value. In 
>> our case, we 
>> > used 25% of the receive buffer size and that was sufficient  to 
>> > eliminate the problem. This can be done using the following command
>> 
>> >>echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale
>> 
>> How much did this improve your results by? And wouldn't
>>   making the default socket buffers, max socket buffers
>> larger by, say, 5 times be more effective (other than for
>> those applications using setsockopt() to set their buffers
>> to some size already, but not large enough)?
>> 
>> > (The default 2 corresponds to 75% of receive buffer, and -2 
>> > corresponds to 25%)
>> > 
>> > Please let me know if this improve your results. You 
>> should still see 
>> > a degradation in throughput when comparing xen to 
>> traditional linux, 
>> > but hopefully you should be able to see better 
>> throughputs. You should 
>> > also try running your experiments in domain 0. This will 
>> give better 
>> > throughput although still lower than traditional linux. I 
>> am curious 
>> > to know if this have any effect in your experiments. 
>> Please, post the 
>> > new results if this has any effect in your results
>> 
>> Yep, me too..
>> 
>> thanks,
>> Nivedita
>> 
>> 
>> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: MPI benchmark performance gap between native linux anddomU
@ 2005-04-05 23:59 Santos, Jose Renato G (Jose Renato Santos)
  0 siblings, 0 replies; 24+ messages in thread
From: Santos, Jose Renato G (Jose Renato Santos) @ 2005-04-05 23:59 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Turner, Yoshio, Xen-devel, Aravind Menon, xuehai zhang,
	G John Janakiraman



>> -----Original Message-----
>> From: Keir Fraser [mailto:Keir.Fraser@cl.cam.ac.uk] 
>> Sent: Tuesday, April 05, 2005 8:47 AM
>> To: Santos, Jose Renato G (Jose Renato Santos)
>> Cc: Turner, Yoshio; Aravind Menon; 
>> Xen-devel@lists.xensource.com; xuehai zhang; G John Janakiraman
>> Subject: Re: [Xen-devel] MPI benchmark performance gap 
>> between native linux anddomU
>> 
>> 
>> 
>> On 5 Apr 2005, at 16:23, Santos, Jose Renato G (Jose Renato Santos) 
>> wrote:
>> 
>> > In which version the 'truesize' field was changed to 
>> report less than 
>> > a page?
>> >   We were using 2.0.3 when we found this problem.
>> >   I agree this trick will prevent the early overflow of 
>> the receive 
>> > buffer.
>> >   However, I am thinking if there is no other side effect of lying
>> > about
>> > the true size of the buffer to the kernel.
>> >   Would bad things happen if the kernel believes that is using less
>> > memory than it is really using.
>> >   For example, would it be possible for the kernel to 
>> exhaust memory 
>> > for
>> > network intensive application with a large number of open 
>> connections ?
>> 
>> I guess it would be easier to provoke trouble, but in any case the 
>> default advertised window and socket buffer allocation are 
>> not affected 
>> dynamically by system-wide memory pressure. Per-sockbuf 
>> limits are set 
>> to a 'suitable default' at boot-time according to amount of RAM 
>> detected, but after that they have to be manually reset by the user.
>> 

  The question is if this 'suitable default' may be not suitable
anymore, because of the true_size lying trick.
  Probably this is non issue in most instalations but maybe a latent
error in network intensive applications. Just keep this in the back of
your mind in case a lack of memory problem for network applications
arises in the future.

>> So I don't think we are breaking any carefully-tuned 
>> dynamically-balanced memory allocation algorithms here. :-)
>> 
>> By setting the true size (4kB) we are far more likely to 
>> throw network 
>> performance off, as the TCP stack will not have been tuned with such 
>> large packet overheads in mind.
>> 
>>   -- Keir
>> 
>> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: MPI benchmark performance gap between native linux anddomU
@ 2005-04-05 15:23 Santos, Jose Renato G (Jose Renato Santos)
  2005-04-05 15:47 ` Keir Fraser
  0 siblings, 1 reply; 24+ messages in thread
From: Santos, Jose Renato G (Jose Renato Santos) @ 2005-04-05 15:23 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Turner, Yoshio, Xen-devel, Aravind Menon, xuehai zhang,
	G John Janakiraman


  Keir,

  In which version the 'truesize' field was changed to report less than
a page?
  We were using 2.0.3 when we found this problem.
  I agree this trick will prevent the early overflow of the receive
buffer.
  However, I am thinking if there is no other side effect of lying about
the true size of the buffer to the kernel.
  Would bad things happen if the kernel believes that is using less
memory than it is really using.
  For example, would it be possible for the kernel to exhaust memory for
network intensive application with a large number of open connections ?

  Renato


>> -----Original Message-----
>> From: Keir Fraser [mailto:Keir.Fraser@cl.cam.ac.uk] 
>> Sent: Tuesday, April 05, 2005 1:59 AM
>> To: Santos, Jose Renato G (Jose Renato Santos)
>> Cc: Aravind Menon; Turner, Yoshio; 
>> Xen-devel@lists.xensource.com; xuehai zhang; G John Janakiraman
>> Subject: Re: [Xen-devel] MPI benchmark performance gap 
>> between native linux anddomU
>> 
>> 
>> 
>> On 5 Apr 2005, at 03:07, Santos, Jose Renato G (Jose Renato Santos) 
>> wrote:
>> 
>> >  Here is a brief explanation of the problem we found and 
>> the solution 
>> > that worked for us.
>> >   Xenolinux allocates a full page (4KB) to store socket buffers 
>> > instead of using just MTU bytes as in traditional linux. This is 
>> > necessary to enable page exchanges between the guest and the I/O 
>> > domains. The side effect of this is that memory space used 
>> for  socket 
>> > buffers is not very efficient.
>> 
>> This is true, but these days we lie to the network stack 
>> about how big 
>> the skb data area is. The 'truesize' field, which is what I think is 
>> used for socket buffer accounting, will be around 1600 
>> bytes, not 4096. 
>> So I would expect the old trick of reducing the receive 
>> windows not to 
>> work: but if it does then that is very interesting!
>> 
>>   -- Keir
>> 
>> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: MPI benchmark performance gap between native linux anddomU
  2005-04-05 15:23 Santos, Jose Renato G (Jose Renato Santos)
@ 2005-04-05 15:47 ` Keir Fraser
  0 siblings, 0 replies; 24+ messages in thread
From: Keir Fraser @ 2005-04-05 15:47 UTC (permalink / raw)
  To: Santos, Jose Renato G (Jose Renato Santos)
  Cc: Turner, Yoshio, Xen-devel, Aravind Menon, xuehai zhang,
	G John Janakiraman

On 5 Apr 2005, at 16:23, Santos, Jose Renato G (Jose Renato Santos) 
wrote:

> In which version the 'truesize' field was changed to report less than
> a page?
>   We were using 2.0.3 when we found this problem.
>   I agree this trick will prevent the early overflow of the receive
> buffer.
>   However, I am thinking if there is no other side effect of lying 
> about
> the true size of the buffer to the kernel.
>   Would bad things happen if the kernel believes that is using less
> memory than it is really using.
>   For example, would it be possible for the kernel to exhaust memory 
> for
> network intensive application with a large number of open connections ?

I guess it would be easier to provoke trouble, but in any case the 
default advertised window and socket buffer allocation are not affected 
dynamically by system-wide memory pressure. Per-sockbuf limits are set 
to a 'suitable default' at boot-time according to amount of RAM 
detected, but after that they have to be manually reset by the user.

So I don't think we are breaking any carefully-tuned 
dynamically-balanced memory allocation algorithms here. :-)

By setting the true size (4kB) we are far more likely to throw network 
performance off, as the TCP stack will not have been tuned with such 
large packet overheads in mind.

  -- Keir

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: MPI benchmark performance gap between native linux anddomU
@ 2005-04-05  3:10 Santos, Jose Renato G (Jose Renato Santos)
  2005-04-05  5:27 ` xuehai zhang
  0 siblings, 1 reply; 24+ messages in thread
From: Santos, Jose Renato G (Jose Renato Santos) @ 2005-04-05  3:10 UTC (permalink / raw)
  To: Santos, Jose Renato G (Jose Renato Santos), xuehai zhang,
	Xen-devel
  Cc: Turner, Yoshio, Aravind Menon, G John Janakiraman


  I guess I overlooked the rates you reported in your post.
  Looking now at your rates carefully I got somewhat confused. When you
say MB/sec do you mean Megabyte/sec or Migabit/sec? In any case these
are much lower rates than in our case (we were using a gigabit network).
Now, I am starting to think that your problem might be different than
ours, but it does not hurt to try changing the advertised window, just
in case.
  Also, the numbers your report are inconsistent. You mention that your
network is 10 MB/s, and that native linux achieve 14.9 MB/s. How is it
possible to achieve a throughput higher than the network bandwidth?
Could you please clarify?

Thanks

Renato


> -----Original Message-----
> From: Santos, Jose Renato G (Jose Renato Santos) 
> Sent: Monday, April 04, 2005 7:07 PM
> To: 'xuehai zhang'; Xen-devel@lists.xensource.com
> Cc: Aravind Menon; Turner, Yoshio; G John Janakiraman
> Subject: RE: [Xen-devel] MPI benchmark performance gap 
> between native linux anddomU
> 
> 
> 
>   Hi,
> 
>   We had a similar network problem in the past. We were using 
> a TCP benchmark instead of MPI but I believe your problem is 
> probably the same as the one we encountered.
>   It took us a while to get to the bottom of this and we only 
> identified the reason for this behavior after we ported 
> oprofile to Xen and did some performance profiling experiments.
> 
>   Here is a brief explanation of the problem we found and the 
> solution that worked for us.
>   Xenolinux allocates a full page (4KB) to store socket 
> buffers instead of using just MTU bytes as in traditional 
> linux. This is necessary to enable page exchanges between the 
> guest and the I/O domains. The side effect of this is that 
> memory space used for  socket buffers is not very efficient. 
> Even if packets have the maximum MTU size (typically 1500 
> bytes for Ethernet) the total buffer utilization is very low 
> ( at most just slightly  higher than 35%). If packets arrive 
> faster than they are processed at the receiver side, they 
> will exhaust the receiver buffer before the TCP advertised 
> window is reached (By default Linux uses a TCP advertised 
> window equal to 75% of the receive buffer size. In standard 
> Linux, this is typically sufficient to stop packet 
> transmission at the sender before running out of receive 
> buffers. The same is not true in Xen due to inefficient use 
> of socket buffers). When a packet arrives and there is no 
> receive buffer available, TCP tries to free socket buffer 
> space by eliminating socket buffer fragmentation (i.e. 
> eliminating wasted buffer space). This is done at the cost of 
> an extra copy of all receive buffer to new compacted socket 
> buffers. This introduces overhead and reduces throughput when 
> the CPU is the bottleneck, which seems to be your case.
> 
> This problem is not very frequent because modern CPUs are 
> fast enough to receive packets at Gigabit speeds and the 
> receive buffer does not fill up. However the problem may 
> arise when using slower machines and/or when  the workload 
> consumes a lot of CPU cycles, such as for example scientific 
> MPI applications. In your case in you have both factors against you.
> 
> The solution to this problem is trivial. You just have to 
> change the TCP advertised window of your guest to a lower 
> value. In our case, we used 25% of the receive buffer size 
> and that was sufficient  to eliminate the  problem. This can 
> be done using the following command
> 
> > echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale
> 
> (The default 2 corresponds to 75% of receive buffer, and -2 
> corresponds to 25%)
> 
> Please let me know if this improve your results. You should 
> still see a degradation in throughput when comparing xen to 
> traditional linux, but hopefully you should be able to see 
> better throughputs. You should also try running your 
> experiments in domain 0. This will give better throughput 
> although still lower than traditional linux. I am curious to 
> know if this have any effect in your experiments. Please, 
> post the new results if this has any effect in your results
> 
> Thanks
> 
> Renato
> 
> 
>  
>  
> > -----Original Message-----
> > From: xen-devel-bounces@lists.xensource.com
> > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
> > xuehai zhang
> > Sent: Monday, April 04, 2005 4:19 PM
> > To: Xen-devel@lists.xensource.com
> > Subject: [Xen-devel] MPI benchmark performance gap between 
> > native linux anddomU
> > 
> > 
> > 
> > Hi all,
> > 
> > I did the following experiments to explore the MPI
> > application execution performance on both native linux 
> > machines and inside of unpriviledged Xen user domains. I use 
> > 8 machines with identical HW configurations (498.756 MHz dual 
> > CPU, 512MB memory, on a 10MB/sec LAN) and I use Pallas MPI 
> > Benchmarks (PMB).
> > 
> > Experiment 1: I boot all 8 nodes with native linux (nosmp,
> > kernel 2.4.29) and use all of them for PMB tests.
> > 
> > Experiment 2: I boot all 8 nodes with Xen running and start a
> > single user domain (port 2.6.10,using file-backed VBD) on 
> > each node with 360MB memory. Then I run the same PMB tests 
> > among these 8 user domains.
> > 
> > The expreiment results show, running a same MPI benchmark in
> > user domains usually results in a worse (sometimes very bad) 
> > performance comparing with on native linux machines. The 
> > following are the results for PMB SendRecv benchmark for both 
> > experiments (table1 and table2 report throughput and latency 
> > respectively). As you may notice, SendRecv can achieve a 
> > 14.9MB/sec throughput on native linux machines but can get a 
> > maximum 7.07 MB/sec throughput if running inside of user 
> > domains. The latency results also have big gap.
> > 
> > Clearly, there is difference between the memory used in the
> > native linux machine of Experiment 1 (512MB) and in the user 
> > domain (360MB, can not go higher because dom0 started with 
> > 128MB memory) of Experiment 2. However, I don't think it is 
> > the main cause of the performance gap because the tested 
> > message sizes are much smaller than both memory sizes.
> > 
> > I will appreciate your help if you had the similar experience
> > and wanna share your insights.
> > 
> > BTW, if you are not familar with PMB SendRecv benchmark, you
> > can find a detailed explaination at 
> > http://people.cs.uchicago.edu/~hai/PMB-MPI1.pdf (see section 4.3.1).
> > 
> > Thanks in advance for you help.
> > 
> > Xuehai
> > 
> > 
> > P.S. Table 1: SendRecv throughput (MB/sec) performance
> > 
> > Message_Size(bytes)    Experiment_1    Experiment_2
> > 0                0             0
> > 1                0             0
> > 2                0             0
> > 4                0             0
> > 8                0.04          0.01
> > 16                    0.16          0.01
> > 32                    0.34          0.02
> > 64                    0.65          0.04
> > 128                    1.17          0.09
> > 256                    2.15          0.59
> > 512                    3.4           1.23
> > 1K                    5.29          2.57
> > 2K                    7.68          3.5
> > 4K                    10.7          4.96
> > 8K                    13.35         7.07
> > 16K                    14.9          3.77
> > 32K                    9.85          3.68
> > 64K                    5.06          3.02
> > 128K                    7.91          4.94
> > 256K                    7.85          5.25
> > 512K                    7.93          6.11
> > 1M                    7.85          6.5
> > 2M                    8.18          5.44
> > 4M                    7.55          4.93
> > 
> > Table 2: SendRecv latency (millisec) performance
> > 
> > Message_Size(bytes)    Experiment_1    Experiment_2
> > 0                   1979.6        3010.96
> > 1                   1724.16       3218.88
> > 2                   1669.65       3185.3
> > 4                   1637.26       3055.67
> > 8                   406.77        2966.17
> > 16                  185.76        2777.89
> > 32                  181.06        2791.06
> > 64                  189.12        2940.82
> > 128                 210.51        2716.3
> > 256                 227.36        843.94
> > 512                 287.28        796.71
> > 1K                  368.72        758.19
> > 2K                  508.65        1144.24
> > 4K                  730.59        1612.66
> > 8K                  1170.22       2471.65
> > 16K                 2096.86       8300.18
> > 32K                 6340.45       17017.99
> > 64K                 24640.78      41264.5
> > 128K                31709.09      50608.97
> > 256K                63680.67      94918.13
> > 512K                125531.7      162168.47
> > 1M                  251566.94     321451.02
> > 2M                  477431.32     707981
> > 4M                  997768.35     1503987.61
> > 
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: MPI benchmark performance gap between native linux anddomU
  2005-04-05  3:10 Santos, Jose Renato G (Jose Renato Santos)
@ 2005-04-05  5:27 ` xuehai zhang
  0 siblings, 0 replies; 24+ messages in thread
From: xuehai zhang @ 2005-04-05  5:27 UTC (permalink / raw)
  To: Santos, Jose Renato G (Jose Renato Santos)
  Cc: Turner, Yoshio, Aravind Menon, Xen-devel, G John Janakiraman

Jose,

Thank you so much for your valueable information.

>   I guess I overlooked the rates you reported in your post.
>   Looking now at your rates carefully I got somewhat confused. When you
> say MB/sec do you mean Megabyte/sec or Migabit/sec? 

It is Megabyte/sec (2^20 bytes)

> In any case these
> are much lower rates than in our case (we were using a gigabit network).
> Now, I am starting to think that your problem might be different than
> ours, but it does not hurt to try changing the advertised window, just
> in case.

I will try your suggestion and sent out update tomorrow morning.

>   Also, the numbers your report are inconsistent. You mention that your
> network is 10 MB/s, and that native linux achieve 14.9 MB/s. How is it
> possible to achieve a throughput higher than the network bandwidth?
> Could you please clarify?

Yes, it is a little confusing. It is due to the caculation of SendRecv's throughput. If you take a 
look at the PMB user manual (following the link in my previous email), the throughput is defined as 
2X/1.048567/time (time is latency). So, it is a weighted throughput and could go beyond 10MB/s, 
which is the max bandwidth of the network.

Thanks.

Xuehai

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: MPI benchmark performance gap between native linux anddomU
@ 2005-04-04 23:30 Ian Pratt
  2005-04-04 23:43 ` xuehai zhang
  0 siblings, 1 reply; 24+ messages in thread
From: Ian Pratt @ 2005-04-04 23:30 UTC (permalink / raw)
  To: xuehai zhang, Xen-devel

> I did the following experiments to explore the MPI 
> application execution performance on both native linux 
> machines and inside of unpriviledged Xen user domains. I use 
> 8 machines with identical HW configurations (498.756 MHz dual 
> CPU, 512MB memory, on a 10MB/sec LAN) and I use Pallas MPI 
> Benchmarks (PMB).

> The expreiment results show, running a same MPI benchmark in 
> user domains usually results in a worse (sometimes very bad) 
> performance comparing with on native linux machines. The 
> following are the results for PMB SendRecv benchmark for both 
> experiments (table1 and table2 report throughput and latency 
> respectively). As you may notice, SendRecv can achieve a 
> 14.9MB/sec throughput on native linux machines but can get a 
> maximum 7.07 MB/sec throughput if running inside of user 
> domains. The latency results also have big gap.

> I will appreciate your help if you had the similar experience 
> and wanna share your insights.

Xen (or any kind of virtualization) is not particularly well suited to
MPI type applications, at least unless you're using Inifiniband or some
other smart NIC that avoids having to use dom0 to do the IO
virtualization.

However, the results you are seeing are lower than I'd expect.

Are you running dom0 and the domU on the same CPU or different CPUs. How
does changing this effect the results?

Also, are you sure the MTU is the same in all cases?

Further, please can you repeat the experiements with just a dom0 running
on each node.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: MPI benchmark performance gap between native linux anddomU
  2005-04-04 23:30 Ian Pratt
@ 2005-04-04 23:43 ` xuehai zhang
  2005-04-05 22:29   ` xuehai zhang
  0 siblings, 1 reply; 24+ messages in thread
From: xuehai zhang @ 2005-04-04 23:43 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Xen-devel

Ian,

Thanks for the quick response! Explainations to your comments are inline below.

>>I did the following experiments to explore the MPI 
>>application execution performance on both native linux 
>>machines and inside of unpriviledged Xen user domains. I use 
>>8 machines with identical HW configurations (498.756 MHz dual 
>>CPU, 512MB memory, on a 10MB/sec LAN) and I use Pallas MPI 
>>Benchmarks (PMB).
> 
> 
>>The expreiment results show, running a same MPI benchmark in 
>>user domains usually results in a worse (sometimes very bad) 
>>performance comparing with on native linux machines. The 
>>following are the results for PMB SendRecv benchmark for both 
>>experiments (table1 and table2 report throughput and latency 
>>respectively). As you may notice, SendRecv can achieve a 
>>14.9MB/sec throughput on native linux machines but can get a 
>>maximum 7.07 MB/sec throughput if running inside of user 
>>domains. The latency results also have big gap.
> 
> 
>>I will appreciate your help if you had the similar experience 
>>and wanna share your insights.
> 
> 
> Xen (or any kind of virtualization) is not particularly well suited to
> MPI type applications, at least unless you're using Inifiniband or some
> other smart NIC that avoids having to use dom0 to do the IO
> virtualization.
> 
> However, the results you are seeing are lower than I'd expect.
> 
> Are you running dom0 and the domU on the same CPU or different CPUs. How
> does changing this effect the results?

I did not specify "cpu" option in Xen's configuration file, so I think both dom0 and domU run on the 
  same CPU (1st CPU). I will try to assign them to different CPUs later.

> Also, are you sure the MTU is the same in all cases?

The outputs of "ifconfig" show MTU is 1500 in all cases.

> Further, please can you repeat the experiements with just a dom0 running
> on each node.

I will do it and update you later.

Thanks again for the help.

Xuehai

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: MPI benchmark performance gap between native linux anddomU
  2005-04-04 23:43 ` xuehai zhang
@ 2005-04-05 22:29   ` xuehai zhang
  2005-04-05 22:34     ` Mark Williamson
  0 siblings, 1 reply; 24+ messages in thread
From: xuehai zhang @ 2005-04-05 22:29 UTC (permalink / raw)
  To: xuehai zhang; +Cc: Ian Pratt, Xen-devel

xuehai zhang wrote:
> Ian,
> 
> Thanks for the quick response! Explainations to your comments are inline 
> below.
> 
>>> I did the following experiments to explore the MPI application 
>>> execution performance on both native linux machines and inside of 
>>> unpriviledged Xen user domains. I use 8 machines with identical HW 
>>> configurations (498.756 MHz dual CPU, 512MB memory, on a 10MB/sec 
>>> LAN) and I use Pallas MPI Benchmarks (PMB).
>>
>>
>>
>>> The expreiment results show, running a same MPI benchmark in user 
>>> domains usually results in a worse (sometimes very bad) performance 
>>> comparing with on native linux machines. The following are the 
>>> results for PMB SendRecv benchmark for both experiments (table1 and 
>>> table2 report throughput and latency respectively). As you may 
>>> notice, SendRecv can achieve a 14.9MB/sec throughput on native linux 
>>> machines but can get a maximum 7.07 MB/sec throughput if running 
>>> inside of user domains. The latency results also have big gap.
>>
>>
>>
>>> I will appreciate your help if you had the similar experience and 
>>> wanna share your insights.
>>
>>
>>
>> Xen (or any kind of virtualization) is not particularly well suited to
>> MPI type applications, at least unless you're using Inifiniband or some
>> other smart NIC that avoids having to use dom0 to do the IO
>> virtualization.
>>
>> However, the results you are seeing are lower than I'd expect.
>>
>> Are you running dom0 and the domU on the same CPU or different CPUs. How
>> does changing this effect the results?
> 
> 
> I did not specify "cpu" option in Xen's configuration file, so I think 
> both dom0 and domU run on the  same CPU (1st CPU). I will try to assign 
> them to different CPUs later.

I think I said something wrong here. If I do not specify "cpu" option in Xen config file, I observe
Xen usually assigns the 2nd CPU to domU while running dom0 on the 1st CPU.

>> Also, are you sure the MTU is the same in all cases?
> 
> 
> The outputs of "ifconfig" show MTU is 1500 in all cases.
> 
>> Further, please can you repeat the experiements with just a dom0 running
>> on each node.
> 
> 
> I will do it and update you later.
> 
> Thanks again for the help.
> 
> Xuehai
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: MPI benchmark performance gap between native linux anddomU
  2005-04-05 22:29   ` xuehai zhang
@ 2005-04-05 22:34     ` Mark Williamson
  2005-04-05 22:39       ` xuehai zhang
  0 siblings, 1 reply; 24+ messages in thread
From: Mark Williamson @ 2005-04-05 22:34 UTC (permalink / raw)
  To: xen-devel; +Cc: Ian Pratt, xuehai zhang

On Tuesday 05 April 2005 23:29, xuehai zhang wrote:
> > I did not specify "cpu" option in Xen's configuration file, so I think
> > both dom0 and domU run on the  same CPU (1st CPU). I will try to assign
> > them to different CPUs later.
>
> I think I said something wrong here. If I do not specify "cpu" option in
> Xen config file, I observe Xen usually assigns the 2nd CPU to domU while
> running dom0 on the 1st CPU.

Last time I looked, the default was to assign in a round robin fashion.  i.e. 
the next domain you create will be on the 1st CPU (with dom0) unless you 
explicitly specify otherwise - watch out this doesn't confuse your testing.

Cheers,
Mark

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: MPI benchmark performance gap between native linux anddomU
  2005-04-05 22:34     ` Mark Williamson
@ 2005-04-05 22:39       ` xuehai zhang
  2005-04-05 22:43         ` Mark Williamson
  0 siblings, 1 reply; 24+ messages in thread
From: xuehai zhang @ 2005-04-05 22:39 UTC (permalink / raw)
  To: Mark Williamson; +Cc: Ian Pratt, xen-devel

Mark,

Thanks for the advice. I will explicitly specify "cpu" option in domU's config file.

So far, I think my experiments are not affected by this. In my experiments, I only run at most 1 
domU besides dom0. I think the domU will be assigned to 2nd CPU even the assignment policy is round 
robin.

Xuehai

Mark Williamson wrote:
> On Tuesday 05 April 2005 23:29, xuehai zhang wrote:
> 
>>>I did not specify "cpu" option in Xen's configuration file, so I think
>>>both dom0 and domU run on the  same CPU (1st CPU). I will try to assign
>>>them to different CPUs later.
>>
>>I think I said something wrong here. If I do not specify "cpu" option in
>>Xen config file, I observe Xen usually assigns the 2nd CPU to domU while
>>running dom0 on the 1st CPU.
> 
> 
> Last time I looked, the default was to assign in a round robin fashion.  i.e. 
> the next domain you create will be on the 1st CPU (with dom0) unless you 
> explicitly specify otherwise - watch out this doesn't confuse your testing.
> 
> Cheers,
> Mark
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: MPI benchmark performance gap between native linux anddomU
  2005-04-05 22:39       ` xuehai zhang
@ 2005-04-05 22:43         ` Mark Williamson
  2005-04-06  4:25           ` xuehai zhang
  0 siblings, 1 reply; 24+ messages in thread
From: Mark Williamson @ 2005-04-05 22:43 UTC (permalink / raw)
  To: xuehai zhang; +Cc: Ian Pratt, xen-devel, Mark Williamson

> Thanks for the advice. I will explicitly specify "cpu" option in domU's
> config file.

Probably good practice although, as you rightly point out, it doesn't matter 
in this case.

> So far, I think my experiments are not affected by this. In my experiments,
> I only run at most 1 domU besides dom0. I think the domU will be assigned
> to 2nd CPU even the assignment policy is round robin.

Ah yes, the code's been updated to choose the least loaded CPU.  Should be 
fine then.

Cheers,
Mark

> Xuehai
>
> Mark Williamson wrote:
> > On Tuesday 05 April 2005 23:29, xuehai zhang wrote:
> >>>I did not specify "cpu" option in Xen's configuration file, so I think
> >>>both dom0 and domU run on the  same CPU (1st CPU). I will try to assign
> >>>them to different CPUs later.
> >>
> >>I think I said something wrong here. If I do not specify "cpu" option in
> >>Xen config file, I observe Xen usually assigns the 2nd CPU to domU while
> >>running dom0 on the 1st CPU.
> >
> > Last time I looked, the default was to assign in a round robin fashion. 
> > i.e. the next domain you create will be on the 1st CPU (with dom0) unless
> > you explicitly specify otherwise - watch out this doesn't confuse your
> > testing.
> >
> > Cheers,
> > Mark

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: MPI benchmark performance gap between native linux anddomU
  2005-04-05 22:43         ` Mark Williamson
@ 2005-04-06  4:25           ` xuehai zhang
  0 siblings, 0 replies; 24+ messages in thread
From: xuehai zhang @ 2005-04-06  4:25 UTC (permalink / raw)
  To: Mark Williamson; +Cc: xen-devel, Mark Williamson

Mark Williamson wrote:
>>Thanks for the advice. I will explicitly specify "cpu" option in domU's
>>config file.
> 
> 
> Probably good practice although, as you rightly point out, it doesn't matter 
> in this case.
> 
> 
>>So far, I think my experiments are not affected by this. In my experiments,
>>I only run at most 1 domU besides dom0. I think the domU will be assigned
>>to 2nd CPU even the assignment policy is round robin.
> 
> 
> Ah yes, the code's been updated to choose the least loaded CPU.  Should be 
> fine then.

Nice to know it. Thanks for the input.
Xuehai

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2005-04-06  7:23 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-05  2:07 MPI benchmark performance gap between native linux anddomU Santos, Jose Renato G (Jose Renato Santos)
2005-04-05  8:59 ` Keir Fraser
2005-04-05 22:22 ` Nivedita Singhvi
2005-04-05 22:23 ` xuehai zhang
2005-04-05 22:34 ` xuehai zhang
2005-04-05 22:53   ` Nivedita Singhvi
2005-04-05 22:58     ` Nivedita Singhvi
  -- strict thread matches above, loose matches on Subject: below --
2005-04-06  7:23 Ian Pratt
2005-04-06  7:08 Ian Pratt
2005-04-06  0:37 Santos, Jose Renato G (Jose Renato Santos)
2005-04-06  4:24 ` xuehai zhang
2005-04-06  0:17 Santos, Jose Renato G (Jose Renato Santos)
2005-04-05 23:59 Santos, Jose Renato G (Jose Renato Santos)
2005-04-05 15:23 Santos, Jose Renato G (Jose Renato Santos)
2005-04-05 15:47 ` Keir Fraser
2005-04-05  3:10 Santos, Jose Renato G (Jose Renato Santos)
2005-04-05  5:27 ` xuehai zhang
2005-04-04 23:30 Ian Pratt
2005-04-04 23:43 ` xuehai zhang
2005-04-05 22:29   ` xuehai zhang
2005-04-05 22:34     ` Mark Williamson
2005-04-05 22:39       ` xuehai zhang
2005-04-05 22:43         ` Mark Williamson
2005-04-06  4:25           ` xuehai zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.