From mboxrd@z Thu Jan  1 00:00:00 1970
From: xuehai zhang <hai@cs.uchicago.edu>
Subject: Re: MPI benchmark performance gap between native linux
	anddomU
Date: Tue, 05 Apr 2005 17:34:06 -0500
Message-ID: <4253125E.4060209@cs.uchicago.edu>
References: <6C21311CEE34E049B74CC0EF339464B902FB1D@cacexc12.americas.cpqcorp.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <6C21311CEE34E049B74CC0EF339464B902FB1D@cacexc12.americas.cpqcorp.net>
List-Unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: "Santos, Jose Renato G (Jose Renato Santos)" <joserenato.santos@hp.com>
Cc: Xen-devel@lists.xensource.com
List-Id: xen-devel@lists.xenproject.org

Santos, Jose Renato G (Jose Renato Santos) wrote:
>   Hi,
> 
>   We had a similar network problem in the past. We were using a TCP
> benchmark instead of MPI but I believe your problem is probably the same
> as the one we encountered.
>   It took us a while to get to the bottom of this and we only identified
> the reason for this behavior after we ported oprofile to Xen and did
> some performance profiling experiments.
> 
>   Here is a brief explanation of the problem we found and the solution
> that worked for us.
>   Xenolinux allocates a full page (4KB) to store socket buffers instead
> of using just MTU bytes as in traditional linux. This is necessary to
> enable page exchanges between the guest and the I/O domains. The side
> effect of this is that memory space used for  socket buffers is not very
> efficient. Even if packets have the maximum MTU size (typically 1500
> bytes for Ethernet) the total buffer utilization is very low ( at most
> just slightly  higher than 35%). If packets arrive faster than they are
> processed at the receiver side, they will exhaust the receiver buffer
> before the TCP advertised window is reached (By default Linux uses a TCP
> advertised window equal to 75% of the receive buffer size. In standard
> Linux, this is typically sufficient to stop packet transmission at the
> sender before running out of receive buffers. The same is not true in
> Xen due to inefficient use of socket buffers). When a packet arrives and
> there is no receive buffer available, TCP tries to free socket buffer
> space by eliminating socket buffer fragmentation (i.e. eliminating
> wasted buffer space). This is done at the cost of an extra copy of all
> receive buffer to new compacted socket buffers. This introduces overhead
> and reduces throughput when the CPU is the bottleneck, which seems to be
> your case.
> 
> This problem is not very frequent because modern CPUs are fast enough to
> receive packets at Gigabit speeds and the receive buffer does not fill
> up. However the problem may arise when using slower machines and/or when
> the workload consumes a lot of CPU cycles, such as for example
> scientific MPI applications. In your case in you have both factors
> against you.
> 
> The solution to this problem is trivial. You just have to change the TCP
> advertised window of your guest to a lower value. In our case, we used
> 25% of the receive buffer size and that was sufficient  to eliminate the
> problem. This can be done using the following command
> 
> 
>>echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale

In my experiments, I notice the above changing doesn't persist upon reboots (every reboot will 
change the value back to 2, the default value for Debian Sarge 3.1). Is there a way to make a 
permanent changing?

Thanks.

Xuehai

> (The default 2 corresponds to 75% of receive buffer, and -2 corresponds
> to 25%)
> 
> Please let me know if this improve your results. You should still see a
> degradation in throughput when comparing xen to traditional linux, but
> hopefully you should be able to see better throughputs. You should also
> try running your experiments in domain 0. This will give better
> throughput although still lower than traditional linux.
> I am curious to know if this have any effect in your experiments.
> Please, post the new results if this has any effect in your results
> 
> Thanks
> 
> Renato
> 
> 
>  
>  
> 
>>-----Original Message-----
>>From: xen-devel-bounces@lists.xensource.com 
>>[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
>>xuehai zhang
>>Sent: Monday, April 04, 2005 4:19 PM
>>To: Xen-devel@lists.xensource.com
>>Subject: [Xen-devel] MPI benchmark performance gap between 
>>native linux anddomU
>>
>>
>>
>>Hi all,
>>
>>I did the following experiments to explore the MPI 
>>application execution performance on both native linux 
>>machines and inside of unpriviledged Xen user domains. I use 
>>8 machines with identical HW configurations (498.756 MHz dual 
>>CPU, 512MB memory, on a 10MB/sec LAN) and I use Pallas MPI 
>>Benchmarks (PMB).
>>
>>Experiment 1: I boot all 8 nodes with native linux (nosmp, 
>>kernel 2.4.29) and use all of them for PMB tests.
>>
>>Experiment 2: I boot all 8 nodes with Xen running and start a 
>>single user domain (port 2.6.10,using file-backed VBD) on 
>>each node with 360MB memory. Then I run the same PMB tests 
>>among these 8 user domains.
>>
>>The expreiment results show, running a same MPI benchmark in 
>>user domains usually results in a worse (sometimes very bad) 
>>performance comparing with on native linux machines. The 
>>following are the results for PMB SendRecv benchmark for both 
>>experiments (table1 and table2 report throughput and latency 
>>respectively). As you may notice, SendRecv can achieve a 
>>14.9MB/sec throughput on native linux machines but can get a 
>>maximum 7.07 MB/sec throughput if running inside of user 
>>domains. The latency results also have big gap.
>>
>>Clearly, there is difference between the memory used in the 
>>native linux machine of Experiment 1 (512MB) and in the user 
>>domain (360MB, can not go higher because dom0 started with 
>>128MB memory) of Experiment 2. However, I don't think it is 
>>the main cause of the performance gap because the tested 
>>message sizes are much smaller than both memory sizes.
>>
>>I will appreciate your help if you had the similar experience 
>>and wanna share your insights.
>>
>>BTW, if you are not familar with PMB SendRecv benchmark, you 
>>can find a detailed explaination at 
>>http://people.cs.uchicago.edu/~hai/PMB-MPI1.pdf (see section 4.3.1).
>>
>>Thanks in advance for you help.
>>
>>Xuehai
>>
>>
>>P.S. Table 1: SendRecv throughput (MB/sec) performance
>>
>>Message_Size(bytes)    Experiment_1    Experiment_2
>>0                0             0
>>1                0             0
>>2                0             0
>>4                0             0
>>8                0.04          0.01
>>16                    0.16          0.01
>>32                    0.34          0.02
>>64                    0.65          0.04
>>128                    1.17          0.09
>>256                    2.15          0.59
>>512                    3.4           1.23
>>1K                    5.29          2.57
>>2K                    7.68          3.5
>>4K                    10.7          4.96
>>8K                    13.35         7.07
>>16K                    14.9          3.77
>>32K                    9.85          3.68
>>64K                    5.06          3.02
>>128K                    7.91          4.94
>>256K                    7.85          5.25
>>512K                    7.93          6.11
>>1M                    7.85          6.5
>>2M                    8.18          5.44
>>4M                    7.55          4.93
>>
>>Table 2: SendRecv latency (millisec) performance
>>
>>Message_Size(bytes)    Experiment_1    Experiment_2
>>0                   1979.6        3010.96
>>1                   1724.16       3218.88
>>2                   1669.65       3185.3
>>4                   1637.26       3055.67
>>8                   406.77        2966.17
>>16                  185.76        2777.89
>>32                  181.06        2791.06
>>64                  189.12        2940.82
>>128                 210.51        2716.3
>>256                 227.36        843.94
>>512                 287.28        796.71
>>1K                  368.72        758.19
>>2K                  508.65        1144.24
>>4K                  730.59        1612.66
>>8K                  1170.22       2471.65
>>16K                 2096.86       8300.18
>>32K                 6340.45       17017.99
>>64K                 24640.78      41264.5
>>128K                31709.09      50608.97
>>256K                63680.67      94918.13
>>512K                125531.7      162168.47
>>1M                  251566.94     321451.02
>>2M                  477431.32     707981
>>4M                  997768.35     1503987.61
>>
>>
>>
>>_______________________________________________
>>Xen-devel mailing list
>>Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>
> 
>