From mboxrd@z Thu Jan 1 00:00:00 1970 From: xuehai zhang Subject: Re: MPI benchmark performance gap between native linux anddomU Date: Tue, 05 Apr 2005 17:34:06 -0500 Message-ID: <4253125E.4060209@cs.uchicago.edu> References: <6C21311CEE34E049B74CC0EF339464B902FB1D@cacexc12.americas.cpqcorp.net> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <6C21311CEE34E049B74CC0EF339464B902FB1D@cacexc12.americas.cpqcorp.net> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: "Santos, Jose Renato G (Jose Renato Santos)" Cc: Xen-devel@lists.xensource.com List-Id: xen-devel@lists.xenproject.org Santos, Jose Renato G (Jose Renato Santos) wrote: > Hi, > > We had a similar network problem in the past. We were using a TCP > benchmark instead of MPI but I believe your problem is probably the same > as the one we encountered. > It took us a while to get to the bottom of this and we only identified > the reason for this behavior after we ported oprofile to Xen and did > some performance profiling experiments. > > Here is a brief explanation of the problem we found and the solution > that worked for us. > Xenolinux allocates a full page (4KB) to store socket buffers instead > of using just MTU bytes as in traditional linux. This is necessary to > enable page exchanges between the guest and the I/O domains. The side > effect of this is that memory space used for socket buffers is not very > efficient. Even if packets have the maximum MTU size (typically 1500 > bytes for Ethernet) the total buffer utilization is very low ( at most > just slightly higher than 35%). If packets arrive faster than they are > processed at the receiver side, they will exhaust the receiver buffer > before the TCP advertised window is reached (By default Linux uses a TCP > advertised window equal to 75% of the receive buffer size. In standard > Linux, this is typically sufficient to stop packet transmission at the > sender before running out of receive buffers. The same is not true in > Xen due to inefficient use of socket buffers). When a packet arrives and > there is no receive buffer available, TCP tries to free socket buffer > space by eliminating socket buffer fragmentation (i.e. eliminating > wasted buffer space). This is done at the cost of an extra copy of all > receive buffer to new compacted socket buffers. This introduces overhead > and reduces throughput when the CPU is the bottleneck, which seems to be > your case. > > This problem is not very frequent because modern CPUs are fast enough to > receive packets at Gigabit speeds and the receive buffer does not fill > up. However the problem may arise when using slower machines and/or when > the workload consumes a lot of CPU cycles, such as for example > scientific MPI applications. In your case in you have both factors > against you. > > The solution to this problem is trivial. You just have to change the TCP > advertised window of your guest to a lower value. In our case, we used > 25% of the receive buffer size and that was sufficient to eliminate the > problem. This can be done using the following command > > >>echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale In my experiments, I notice the above changing doesn't persist upon reboots (every reboot will change the value back to 2, the default value for Debian Sarge 3.1). Is there a way to make a permanent changing? Thanks. Xuehai > (The default 2 corresponds to 75% of receive buffer, and -2 corresponds > to 25%) > > Please let me know if this improve your results. You should still see a > degradation in throughput when comparing xen to traditional linux, but > hopefully you should be able to see better throughputs. You should also > try running your experiments in domain 0. This will give better > throughput although still lower than traditional linux. > I am curious to know if this have any effect in your experiments. > Please, post the new results if this has any effect in your results > > Thanks > > Renato > > > > > >>-----Original Message----- >>From: xen-devel-bounces@lists.xensource.com >>[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of >>xuehai zhang >>Sent: Monday, April 04, 2005 4:19 PM >>To: Xen-devel@lists.xensource.com >>Subject: [Xen-devel] MPI benchmark performance gap between >>native linux anddomU >> >> >> >>Hi all, >> >>I did the following experiments to explore the MPI >>application execution performance on both native linux >>machines and inside of unpriviledged Xen user domains. I use >>8 machines with identical HW configurations (498.756 MHz dual >>CPU, 512MB memory, on a 10MB/sec LAN) and I use Pallas MPI >>Benchmarks (PMB). >> >>Experiment 1: I boot all 8 nodes with native linux (nosmp, >>kernel 2.4.29) and use all of them for PMB tests. >> >>Experiment 2: I boot all 8 nodes with Xen running and start a >>single user domain (port 2.6.10,using file-backed VBD) on >>each node with 360MB memory. Then I run the same PMB tests >>among these 8 user domains. >> >>The expreiment results show, running a same MPI benchmark in >>user domains usually results in a worse (sometimes very bad) >>performance comparing with on native linux machines. The >>following are the results for PMB SendRecv benchmark for both >>experiments (table1 and table2 report throughput and latency >>respectively). As you may notice, SendRecv can achieve a >>14.9MB/sec throughput on native linux machines but can get a >>maximum 7.07 MB/sec throughput if running inside of user >>domains. The latency results also have big gap. >> >>Clearly, there is difference between the memory used in the >>native linux machine of Experiment 1 (512MB) and in the user >>domain (360MB, can not go higher because dom0 started with >>128MB memory) of Experiment 2. However, I don't think it is >>the main cause of the performance gap because the tested >>message sizes are much smaller than both memory sizes. >> >>I will appreciate your help if you had the similar experience >>and wanna share your insights. >> >>BTW, if you are not familar with PMB SendRecv benchmark, you >>can find a detailed explaination at >>http://people.cs.uchicago.edu/~hai/PMB-MPI1.pdf (see section 4.3.1). >> >>Thanks in advance for you help. >> >>Xuehai >> >> >>P.S. Table 1: SendRecv throughput (MB/sec) performance >> >>Message_Size(bytes) Experiment_1 Experiment_2 >>0 0 0 >>1 0 0 >>2 0 0 >>4 0 0 >>8 0.04 0.01 >>16 0.16 0.01 >>32 0.34 0.02 >>64 0.65 0.04 >>128 1.17 0.09 >>256 2.15 0.59 >>512 3.4 1.23 >>1K 5.29 2.57 >>2K 7.68 3.5 >>4K 10.7 4.96 >>8K 13.35 7.07 >>16K 14.9 3.77 >>32K 9.85 3.68 >>64K 5.06 3.02 >>128K 7.91 4.94 >>256K 7.85 5.25 >>512K 7.93 6.11 >>1M 7.85 6.5 >>2M 8.18 5.44 >>4M 7.55 4.93 >> >>Table 2: SendRecv latency (millisec) performance >> >>Message_Size(bytes) Experiment_1 Experiment_2 >>0 1979.6 3010.96 >>1 1724.16 3218.88 >>2 1669.65 3185.3 >>4 1637.26 3055.67 >>8 406.77 2966.17 >>16 185.76 2777.89 >>32 181.06 2791.06 >>64 189.12 2940.82 >>128 210.51 2716.3 >>256 227.36 843.94 >>512 287.28 796.71 >>1K 368.72 758.19 >>2K 508.65 1144.24 >>4K 730.59 1612.66 >>8K 1170.22 2471.65 >>16K 2096.86 8300.18 >>32K 6340.45 17017.99 >>64K 24640.78 41264.5 >>128K 31709.09 50608.97 >>256K 63680.67 94918.13 >>512K 125531.7 162168.47 >>1M 251566.94 321451.02 >>2M 477431.32 707981 >>4M 997768.35 1503987.61 >> >> >> >>_______________________________________________ >>Xen-devel mailing list >>Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel >> > >