Re: MPI benchmark performance gap between native linux anddomU

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Nivedita Singhvi <niv@us.ibm.com>
To: "Santos, Jose Renato G (Jose Renato Santos)" <joserenato.santos@hp.com>
Cc: "Turner, Yoshio" <yoshio_turner@hp.com>,
	Aravind Menon <aravind.menon@epfl.ch>,
	Xen-devel@lists.xensource.com, xuehai zhang <hai@cs.uchicago.edu>,
	G John Janakiraman <john@arivalai.hpl.hp.com>
Subject: Re: MPI benchmark performance gap between native linux anddomU
Date: Tue, 05 Apr 2005 15:22:43 -0700	[thread overview]
Message-ID: <42530FB3.1050901@us.ibm.com> (raw)
In-Reply-To: <6C21311CEE34E049B74CC0EF339464B902FB1D@cacexc12.americas.cpqcorp.net>

Santos, Jose Renato G (Jose Renato Santos) wrote:

>   Hi,
> 
>   We had a similar network problem in the past. We were using a TCP
> benchmark instead of MPI but I believe your problem is probably the same
> as the one we encountered.
>   It took us a while to get to the bottom of this and we only identified
> the reason for this behavior after we ported oprofile to Xen and did
> some performance profiling experiments.

Hello! Was this on the 2.6 kernel? Would you be able to
share the oprofile port? It would be very handy indeed
right now. (I was told by a few people that someone
was porting oprofile and I believe there was some status
on the list that went by) but haven't seen it yet...

>   Here is a brief explanation of the problem we found and the solution
> that worked for us.
>   Xenolinux allocates a full page (4KB) to store socket buffers instead
> of using just MTU bytes as in traditional linux. This is necessary to
> enable page exchanges between the guest and the I/O domains. The side
> effect of this is that memory space used for  socket buffers is not very
> efficient. Even if packets have the maximum MTU size (typically 1500
> bytes for Ethernet) the total buffer utilization is very low ( at most
> just slightly  higher than 35%). If packets arrive faster than they are
> processed at the receiver side, they will exhaust the receiver buffer

Most small connections (say upto 3 - 4K) involve only 3 to 5 segments,
and so the tcp window never really opens fully.  On longer lived
connections, it does help very much to have a large buffer.

> before the TCP advertised window is reached (By default Linux uses a TCP
> advertised window equal to 75% of the receive buffer size. In standard
> Linux, this is typically sufficient to stop packet transmission at the
> sender before running out of receive buffers. The same is not true in
> Xen due to inefficient use of socket buffers). When a packet arrives and
> there is no receive buffer available, TCP tries to free socket buffer
> space by eliminating socket buffer fragmentation (i.e. eliminating
> wasted buffer space). This is done at the cost of an extra copy of all
> receive buffer to new compacted socket buffers. This introduces overhead
> and reduces throughput when the CPU is the bottleneck, which seems to be
> your case.

/proc/net/netstat will show a counter of just how many times this
happens (RcvPruned). Would be interesting if that was significant.

> This problem is not very frequent because modern CPUs are fast enough to
> receive packets at Gigabit speeds and the receive buffer does not fill
> up. However the problem may arise when using slower machines and/or when
> the workload consumes a lot of CPU cycles, such as for example
> scientific MPI applications. In your case in you have both factors
> against you.


> The solution to this problem is trivial. You just have to change the TCP
> advertised window of your guest to a lower value. In our case, we used
> 25% of the receive buffer size and that was sufficient  to eliminate the
> problem. This can be done using the following command

>>echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale

How much did this improve your results by? And wouldn't
  making the default socket buffers, max socket buffers
larger by, say, 5 times be more effective (other than for
those applications using setsockopt() to set their buffers
to some size already, but not large enough)?

> (The default 2 corresponds to 75% of receive buffer, and -2 corresponds
> to 25%)
> 
> Please let me know if this improve your results. You should still see a
> degradation in throughput when comparing xen to traditional linux, but
> hopefully you should be able to see better throughputs. You should also
> try running your experiments in domain 0. This will give better
> throughput although still lower than traditional linux.
> I am curious to know if this have any effect in your experiments.
> Please, post the new results if this has any effect in your results

Yep, me too..

thanks,
Nivedita

next prev parent reply	other threads:[~2005-04-05 22:22 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-05  2:07 MPI benchmark performance gap between native linux anddomU Santos, Jose Renato G (Jose Renato Santos)
2005-04-05  8:59 ` Keir Fraser
2005-04-05 22:22 ` Nivedita Singhvi [this message]
2005-04-05 22:23 ` xuehai zhang
2005-04-05 22:34 ` xuehai zhang
2005-04-05 22:53   ` Nivedita Singhvi
2005-04-05 22:58     ` Nivedita Singhvi
  -- strict thread matches above, loose matches on Subject: below --
2005-04-06  7:23 Ian Pratt
2005-04-06  7:08 Ian Pratt
2005-04-06  0:37 Santos, Jose Renato G (Jose Renato Santos)
2005-04-06  4:24 ` xuehai zhang
2005-04-06  0:17 Santos, Jose Renato G (Jose Renato Santos)
2005-04-05 23:59 Santos, Jose Renato G (Jose Renato Santos)
2005-04-05 15:23 Santos, Jose Renato G (Jose Renato Santos)
2005-04-05 15:47 ` Keir Fraser
2005-04-05  3:10 Santos, Jose Renato G (Jose Renato Santos)
2005-04-05  5:27 ` xuehai zhang
2005-04-04 23:30 Ian Pratt
2005-04-04 23:43 ` xuehai zhang
2005-04-05 22:29   ` xuehai zhang
2005-04-05 22:34     ` Mark Williamson
2005-04-05 22:39       ` xuehai zhang
2005-04-05 22:43         ` Mark Williamson
2005-04-06  4:25           ` xuehai zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42530FB3.1050901@us.ibm.com \
    --to=niv@us.ibm.com \
    --cc=Xen-devel@lists.xensource.com \
    --cc=aravind.menon@epfl.ch \
    --cc=hai@cs.uchicago.edu \
    --cc=john@arivalai.hpl.hp.com \
    --cc=joserenato.santos@hp.com \
    --cc=yoshio_turner@hp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.