[lmbench] tcp bandwidth on athlon

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [lmbench] tcp bandwidth on athlon
@ 2002-07-21 13:21 rwhron
  2002-07-21 16:44 ` Alan Cox
  0 siblings, 1 reply; 5+ messages in thread
From: rwhron @ 2002-07-21 13:21 UTC (permalink / raw)
  To: linux-kernel

I ran oprofile with bw_tcp and retired instructions on athlon showed:

samples       %-age  symbol name
903640      75.4825  csum_partial_copy_generic

In Carl Staelin and Larry McVoy's 98 Usenix paper they wrote:

"It is interesting to compare pipes with TCP because the TCP  benchmark  is
identical to the pipe benchmark except for the transport mechanism.  Ideally,
the TCP bandwidth would be as good as the pipe bandwidth.  It is  not  widely
known  that  the  majority of the TCP cost is in the bcopy, the checksum, and
the network interface driver.  The checksum and  the  driver  may  be  safely
eliminated  in  the loopback case and if the costs have been eliminated, then
TCP should be just as fast as pipes.  From the pipe and TCP results [...]
it is easy to see that Solaris and HP-UX have done this optimization."

Here are some recent Linux kernels:

Processor                Pipe        TCP
Athlon/1330             840.66      73.75 (or 150 MB/sec - see below)
k6-2/475                 65.15      52.45
PIII * 1/700 Xeon       539.73     446.16 

I tried compiling the athlon kernel without X86_USE_PPRO_CHECKSUM
but that didn't really change tcp bandwidth.

kernel                   Pipe      TCP  
2.4.19rc2aa1            860.97    74.27
2.4.19rc2aa1-nocsum     853.18    74.16

[topic shift]

There was a change in bw_tcp.c that has a 2x impact on
the computed bandwidth.  I have two versions:

ls -gl LM*/src/bw_tcp.c
-r--r--r--    1 rwhron     3553 Jul 23  2001 LMbench.old/src/bw_tcp.c
-r--r--r--    1 rwhron     3799 Sep 27  2001 LMbench2/src/bw_tcp.c

Both LMbench trees have the same version:

#define MAJOR   2
#define MINOR   -13     /* negative is alpha, it "increases" */

ident doesn't specify a version in tcp_bw.c, but diff shows
a difference.

This is the newer bw_tcp on an Athlon 1330.

/bw_tcp localhost
server: nbytes=10485760
initial bandwidth measurement: move=10485760, usecs=117291: 89.40 MB/sec
move=693633024, XFERSIZE=65536
server: nbytes=693633024
Socket bandwidth using localhost: 75.85 MB/sec

And the older bw_tcp compiled with same gcc same kernel on athlon:

/bw_tcp localhost
Socket bandwidth using localhost: 150.21 MB/sec

-- 
Randy Hron
http://home.earthlink.net/~rwhron/kernel/bigbox.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [lmbench] tcp bandwidth on athlon
  2002-07-21 13:21 [lmbench] tcp bandwidth on athlon rwhron
@ 2002-07-21 16:44 ` Alan Cox
  0 siblings, 0 replies; 5+ messages in thread
From: Alan Cox @ 2002-07-21 16:44 UTC (permalink / raw)
  To: rwhron; +Cc: linux-kernel

On Sun, 2002-07-21 at 14:21, rwhron@earthlink.net wrote:
> In Carl Staelin and Larry McVoy's 98 Usenix paper they wrote:
> 
> "It is interesting to compare pipes with TCP because the TCP  benchmark  is
> identical to the pipe benchmark except for the transport mechanism.  Ideally,
> the TCP bandwidth would be as good as the pipe bandwidth.  It is  not  widely
> known  that  the  majority of the TCP cost is in the bcopy, the checksum, and
> the network interface driver.  The checksum and  the  driver  may  be  safely

The paper however ignored something else we do which is why you see
csum_partial_copy_generic. On a modern processor the cost of fetching
and storing memory is so high compared to the throughput of the
processor that it is actually much more effective to fold the copy and
checksum together. Generally the copy/checksum has the same speed as a
pure copy anyway


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [lmbench] tcp bandwidth on athlon
@ 2002-07-21 19:18 Nivedita Singhvi
  0 siblings, 0 replies; 5+ messages in thread
From: Nivedita Singhvi @ 2002-07-21 19:18 UTC (permalink / raw)
  To: rwhron; +Cc: linux-kernel

> I ran oprofile with bw_tcp and retired instructions on 
> athlon showed:

> samples       %-age  symbol name
> 903640      75.4825  csum_partial_copy_generic

> In Carl Staelin and Larry McVoy's 98 Usenix paper they 
> wrote:
>
> It is interesting to compare pipes with TCP because the TCP
> benchmark  is identical to the pipe benchmark except for the
> transport mechanism.  Ideally, the TCP bandwidth would be as 
> good as the pipe bandwidth. It is not widely known that the

Well, TCP will have a little more overhead than a pipe - 
the network stack having to take care of a few more things.

> majority of the TCP cost is in the bcopy, the checksum, and
> the network interface driver.  The checksum and  the  driver
> may  be  safely eliminated in the loopback case and if the 
> costs have been eliminated, then TCP should be just as fast 
> as pipes.  From the pipe and TCP results [...]
> it is easy to see that Solaris and HP-UX have done this 
> optimization.

Much has happened since 1998: hw checksum offload,
sendfile, amongst others. Note that this checksum 
is performed while copying the data from/to
user space (though not that often in rx code path).
However, while we dont look at the checksum on the
rx side for TCP, since the loopback driver will
have set ip_summed to CHECKSUM_UNNECESSARY, on the
send side we dont bother, because we have to do the
copy in any case. Marginal difference, although
perhaps worth doing. I had a patch for this which
showed no difference in a VolanoMark (yes, I know),
benchmark.

> Processor                Pipe        TCP
> Athlon/1330             840.66      73.75 (or 150 MB/sec - see below)
> k6-2/475                 65.15      52.45
> PIII * 1/700 Xeon       539.73     446.16 

Hmm, so if K6 and Xeon can scrounge up 80% of pipe
performance, why is the Athlon an order of magnitude off
at 8%? How did your Athlon perform in other tests relative
to these other procs?

> I tried compiling the athlon kernel without 
> X86_USE_PPRO_CHECKSUM but that didn't really change 
> tcp bandwidth.

> kernel                   Pipe      TCP  
> 2.4.19rc2aa1            860.97    74.27
> 2.4.19rc2aa1-nocsum     853.18    74.16

Well, that would simply change how the checksum is
calculated, and in this case, I believe the substantial
latency is from the copy.

> There was a change in bw_tcp.c that has a 2x impact on
> the computed bandwidth.  I have two versions:

> ls -gl LM*/src/bw_tcp.c
> -r--r--r--    1 rwhron     3553 Jul 23  2001 LMbench.old/src/bw_tcp.c
> -r--r--r--    1 rwhron     3799 Sep 27  2001 LMbench2/src/bw_tcp.c

I only see the bitkeeper version thats almost a year old, online,
where is the later version from? 

> Both LMbench trees have the same version:
...

> ident doesn't specify a version in tcp_bw.c, but diff shows
> a difference.

A change in your test causes a 2x difference in performance,
and you dont give us the diff? :) :)

> Socket bandwidth using localhost: 75.85 MB/sec 
...
> Socket bandwidth using localhost: 150.21 MB/sec

Where are the complete profiles from these runs?
Also, any chance you have network stats before/after?
I looked on your site but couldnt find the tcp_bw
runs..

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [lmbench] tcp bandwidth on athlon
       [not found] <1027279106.3d3b0902a9209@imap.linux.ibm.com.suse.lists.linux.kernel>
@ 2002-07-21 19:30 ` Andi Kleen
  0 siblings, 0 replies; 5+ messages in thread
From: Andi Kleen @ 2002-07-21 19:30 UTC (permalink / raw)
  To: Nivedita Singhvi; +Cc: linux-kernel, rwhron

Nivedita Singhvi <niv@us.ibm.com> writes:

> Hmm, so if K6 and Xeon can scrounge up 80% of pipe
> performance, why is the Athlon an order of magnitude off
> at 8%? How did your Athlon perform in other tests relative
> to these other procs?

The pipe test basically tests copy_from_user()/copy_to_user().
The standard implementation of these macros (essentially rep ; movsl)
doesn't exploit the Athlon very well - it is not good at this
instruction. AFAIK Intel CPUs have an faster microcode
implementation for this. 

You could likely do better on Athlon with a copy*user that uses 
an unrolled loop with explicit movls or even SSE.
[similar to the implementation the x86-64 port uses, but without
the NT instructions]

-Andi

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [lmbench] tcp bandwidth on athlon
@ 2002-07-22 22:39 rwhron
  0 siblings, 0 replies; 5+ messages in thread
From: rwhron @ 2002-07-22 22:39 UTC (permalink / raw)
  To: niv; +Cc: linux-kernel

Nivedita Singhvi wrote:
> How did your Athlon perform in other tests relative
> to these other procs?

TCP bandwidth was the only strange really strange result.
Side by side lmbench for three processors is at:
http://home.earthlink.net/~rwhron/kernel/lmbench_comparison.html

> I only see the bitkeeper version thats almost a year old, online,
> where is the later version from?

The earlier version of bw_tcp is from lmbench-2.0-patch1.tgz
and the later version is from lmbench-2.0-patch2.tgz.

> Also, any chance you have network stats before/after?

I only ran oprofile on the athlon using "localhost".  
oprofile was to get an idea of the hot functions.
oprofile wasn't executing for the lmbench runs in the
link above.

-- 
Randy Hron
http://home.earthlink.net/~rwhron/kernel/bigbox.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2002-07-22 22:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-07-21 13:21 [lmbench] tcp bandwidth on athlon rwhron
2002-07-21 16:44 ` Alan Cox
  -- strict thread matches above, loose matches on Subject: below --
2002-07-21 19:18 Nivedita Singhvi
     [not found] <1027279106.3d3b0902a9209@imap.linux.ibm.com.suse.lists.linux.kernel>
2002-07-21 19:30 ` Andi Kleen
2002-07-22 22:39 rwhron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox