* Re: tcp bw in 2.6
[not found] ` <alpine.LFD.0.999.0709291050200.3579@woody.linux-foundation.org>
@ 2007-10-02 0:59 ` Larry McVoy
2007-10-02 2:14 ` Linus Torvalds
2007-10-02 10:52 ` Herbert Xu
0 siblings, 2 replies; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 0:59 UTC (permalink / raw)
To: Linus Torvalds; +Cc: davem, wscott, netdev
On Sat, Sep 29, 2007 at 11:02:32AM -0700, Linus Torvalds wrote:
> On Sat, 29 Sep 2007, Larry McVoy wrote:
> > I haven't kept up on switch technology but in the past they were much
> > better than you are thinking. The Kalpana switch that I had modified
> > to support vlans (invented by yours truly), did not store and forward,
> > it was cut through and could handle any load that was theoretically
> > possible within about 1%.
>
> Hey, you may well be right. Maybe my assumptions about cutting corners are
> just cynical and pessimistic.
So I got a netgear switch and it works fine. But my tests are busted.
Catching netdev up, I'm trying to optimize traffic to a server that has
a gbit interface; I moved to a 24 port netgear that is all 10/100/1000
and I have a pile of clients to act as load generators.
I can do this on each of the clients
dd if=/dev/zero bs=1024000 | rsh work "dd of=/dev/null"
and that cranks up to about 47K packets/second which is about 70MB/sec.
One of my clients also has gigabit so I played around with just that
one and it (itanium running hpux w/ broadcom gigabit) can push the load
as well. One weird thing is that it is dependent on the direction the
data is flowing. If the hp is sending then I get 46MB/sec, if linux is
sending then I get 18MB/sec. Weird. Linux is debian, running
Linux work 2.6.18-5-k7 #1 SMP Thu Aug 30 02:52:31 UTC 2007 i686
and dual e1000 cards:
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection
I wrote a tiny little program to try and emulate this and I can't get
it to do as well. I've tracked it down, I think, to the read side.
The server sources, the client sinks, the server looks like:
11689 accept(3, {sa_family=AF_INET, sin_port=htons(49376), sin_addr=inet_addr("10.3.1.38")}, [16]) = 4
11689 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [1048576], 4) = 0
11689 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1048576], 4) = 0
11689 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7ddf708) = 11694
11689 close(4) = 0
11689 accept(3, <unfinished ...>
11694 write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
11694 write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
11694 write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
11694 write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
...
but the client looks like
connect(3, {sa_family=AF_INET, sin_port=htons(31235), sin_addr=inet_addr("10.3.9.1")}, 16) = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 2896
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1448
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 2896
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 2896
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 2896
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 2896
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 2896
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 2896
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 2896
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1448
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 2896
which I suspect may be the problem.
I played around with SO_RCVBUF/SO_SNDBUF and that didn't help. So any ideas why
a simple dd piped through rsh is kicking my ass? It must be something simple
but my test program is tiny and does nothing weird that I can see.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 0:59 ` tcp bw in 2.6 Larry McVoy
@ 2007-10-02 2:14 ` Linus Torvalds
2007-10-02 2:20 ` Larry McVoy
2007-10-02 10:52 ` Herbert Xu
1 sibling, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2007-10-02 2:14 UTC (permalink / raw)
To: Larry McVoy; +Cc: davem, wscott, netdev
On Mon, 1 Oct 2007, Larry McVoy wrote:
>
> but the client looks like
>
> connect(3, {sa_family=AF_INET, sin_port=htons(31235), sin_addr=inet_addr("10.3.9.1")}, 16) = 0
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 2896
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1448
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 2896
..
This is exactly what I'd expect if the machine is *not* under excessive
load.
The system calls are fast enough that the latency for the TCP stack is
roughly on the same scale as the time it takes to receive one new packet,
so since a socket read will always return when it has any data (not until
it has filled the whole buffer), you get exactly that "one or two packets"
pattern.
If you'd be really CPU-limited or under load from other programs, you'd
have more packets come in while you're in the read path, and you'd get
bigger reads.
But do a tcpdump both ways, and see (for example) if the TCP window is
much bigger going the other way.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 2:14 ` Linus Torvalds
@ 2007-10-02 2:20 ` Larry McVoy
2007-10-02 3:50 ` David Miller
` (2 more replies)
0 siblings, 3 replies; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 2:20 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Larry McVoy, davem, wscott, netdev
On Mon, Oct 01, 2007 at 07:14:37PM -0700, Linus Torvalds wrote:
>
>
> On Mon, 1 Oct 2007, Larry McVoy wrote:
> >
> > but the client looks like
> >
> > connect(3, {sa_family=AF_INET, sin_port=htons(31235), sin_addr=inet_addr("10.3.9.1")}, 16) = 0
> > read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 2896
> > read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1448
> > read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 2896
> ..
>
> This is exactly what I'd expect if the machine is *not* under excessive
> load.
That's fine, but why is it that my trivial program can't do as well as
dd | rsh dd?
A short summary is "can someone please post a test program that sources
and sinks data at the wire speed?" because apparently I'm too old and
clueless to write such a thing.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 2:20 ` Larry McVoy
@ 2007-10-02 3:50 ` David Miller
2007-10-02 4:23 ` Larry McVoy
2007-10-02 15:06 ` John Heffner
2007-10-02 17:14 ` Rick Jones
2 siblings, 1 reply; 56+ messages in thread
From: David Miller @ 2007-10-02 3:50 UTC (permalink / raw)
To: lm; +Cc: torvalds, wscott, netdev
From: lm@bitmover.com (Larry McVoy)
Date: Mon, 1 Oct 2007 19:20:59 -0700
> A short summary is "can someone please post a test program that sources
> and sinks data at the wire speed?" because apparently I'm too old and
> clueless to write such a thing.
You're not showing us your test program so there is no way we
can help you out.
My initial inclination, even without that critical information,
is to ask whether you are setting any socket options in way?
In particular, SO_RCVLOWAT can have a large effect here, if you're
setting it to something, that would explain why dd is doing better. A
lot of people link to "helper libraries" with interfaces to setup
sockets with all sorts of socket option settings by default, try not
using such things if possible.
You also shouldn't dork at all with the receive and send buffer sizes.
They are adjusted dynamically by the kernel as the window grows. But
if you set them to specific values, this dynamic logic is turned off.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 3:50 ` David Miller
@ 2007-10-02 4:23 ` Larry McVoy
0 siblings, 0 replies; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 4:23 UTC (permalink / raw)
To: David Miller; +Cc: lm, torvalds, wscott, netdev
[-- Attachment #1: Type: text/plain, Size: 1858 bytes --]
On Mon, Oct 01, 2007 at 08:50:50PM -0700, David Miller wrote:
> From: lm@bitmover.com (Larry McVoy)
> Date: Mon, 1 Oct 2007 19:20:59 -0700
>
> > A short summary is "can someone please post a test program that sources
> > and sinks data at the wire speed?" because apparently I'm too old and
> > clueless to write such a thing.
>
> You're not showing us your test program so there is no way we
> can help you out.
Attached. Drop it into an lmbench tree and build it.
> My initial inclination, even without that critical information,
> is to ask whether you are setting any socket options in way?
The only one I was playing with was SO_RCVBUF/SO_SNDBUF and I tried
disabling that and I tried playing with the read/write size. Didn't
help.
> In particular, SO_RCVLOWAT can have a large effect here, if you're
> setting it to something, that would explain why dd is doing better. A
> lot of people link to "helper libraries" with interfaces to setup
> sockets with all sorts of socket option settings by default, try not
> using such things if possible.
Agreed. That was my first thought as well, I must have been doing
something that messed up the defaults. But you did get the strace
output, there wasn't anything weird there.
> You also shouldn't dork at all with the receive and send buffer sizes.
> They are adjusted dynamically by the kernel as the window grows. But
> if you set them to specific values, this dynamic logic is turned off.
Yeah, dorking with those is left over from the bad old days of '95
when lmbench was first shipped. But I turned that all off and no
difference.
So feel free to show me where I'm an idiot in the code, but if you
can't, then what would rock would be a little send.c / recv.c that
demonstrated filling the pipe.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
[-- Attachment #2: bytes_tcp.c --]
[-- Type: text/x-csrc, Size: 2278 bytes --]
/*
* bytes_tcp.c - simple TCP bandwidth source/sink
*
* server usage: bytes_tcp -s
* client usage: bytes_tcp hostname [msgsize]
*
* Copyright (c) 1994 Larry McVoy.
* Copyright (c) 2002 Carl Staelin. Distributed under the FSF GPL with
* additional restriction that results may published only if
* (1) the benchmark is unmodified, and
* (2) the version in the sccsid below is included in the report.
* Support for this development by Sun Microsystems is gratefully acknowledged.
*/
char *id = "$Id$\n";
#include "bench.h"
#define XFER (1024*1024)
int server_main(int ac, char **av);
int client_main(int ac, char **av);
void source(int data);
void
transfer(int get, int server, char *buf)
{
int c;
while ((get > 0) && (c = read(server, buf, XFER)) > 0) {
get -= c;
}
if (c < 0) {
perror("bytes_tcp: transfer: read failed");
exit(4);
}
}
/* ARGSUSED */
int
client_main(int ac, char **av)
{
int server;
int get = 256 << 20;
char buf[XFER];
char* usage = "usage: %s -remotehost OR %s remotehost [msgsize]\n";
if (ac != 2 && ac != 3) {
(void)fprintf(stderr, usage, av[0], av[0]);
exit(0);
}
if (ac == 3) get = bytes(av[2]);
server = tcp_connect(av[1], TCP_DATA+1, SOCKOPT_READ|SOCKOPT_REUSE);
if (server < 0) {
perror("bytes_tcp: could not open socket to server");
exit(2);
}
transfer(get, server, buf);
close(server);
exit(0);
/*NOTREACHED*/
}
void
child()
{
wait(0);
signal(SIGCHLD, child);
}
/* ARGSUSED */
int
server_main(int ac, char **av)
{
int data, newdata;
signal(SIGCHLD, child);
data = tcp_server(TCP_DATA+1, SOCKOPT_READ|SOCKOPT_WRITE|SOCKOPT_REUSE);
for ( ;; ) {
newdata = tcp_accept(data, SOCKOPT_WRITE|SOCKOPT_READ);
switch (fork()) {
case -1:
perror("fork");
break;
case 0:
source(newdata);
exit(0);
default:
close(newdata);
break;
}
}
}
void
source(int data)
{
char buf[XFER];
while (write(data, buf, sizeof(buf)) > 0);
}
int
main(int ac, char **av)
{
char* usage = "Usage: %s -s OR %s -serverhost OR %s serverhost [msgsize]\n";
if (ac < 2 || 3 < ac) {
fprintf(stderr, usage, av[0], av[0], av[0]);
exit(1);
}
if (ac == 2 && !strcmp(av[1], "-s")) {
if (fork() == 0) server_main(ac, av);
exit(0);
} else {
client_main(ac, av);
}
return(0);
}
[-- Attachment #3: lib_tcp.c --]
[-- Type: text/x-csrc, Size: 5094 bytes --]
/*
* tcp_lib.c - routines for managing TCP connections.
*
* Positive port/program numbers are RPC ports, negative ones are TCP ports.
*
* Copyright (c) 1994-1996 Larry McVoy.
*/
#define _LIB /* bench.h needs this */
#include "bench.h"
/*
* Get a TCP socket, bind it, figure out the port,
* and advertise the port as program "prog".
*
* XXX - it would be nice if you could advertise ascii strings.
*/
int
tcp_server(int prog, int rdwr)
{
int sock;
struct sockaddr_in s;
#ifdef LIBTCP_VERBOSE
fprintf(stderr, "tcp_server(%u, %u)\n", prog, rdwr);
#endif
if ((sock = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) {
perror("socket");
exit(1);
}
sock_optimize(sock, rdwr);
bzero((void*)&s, sizeof(s));
s.sin_family = AF_INET;
if (prog < 0) {
s.sin_port = htons(-prog);
}
if (bind(sock, (struct sockaddr*)&s, sizeof(s)) < 0) {
perror("bind");
exit(2);
}
if (listen(sock, 100) < 0) {
perror("listen");
exit(4);
}
if (prog > 0) {
#ifdef LIBTCP_VERBOSE
fprintf(stderr, "Server port %d\n", sockport(sock));
#endif
(void)pmap_unset((u_long)prog, (u_long)1);
if (!pmap_set((u_long)prog, (u_long)1, (u_long)IPPROTO_TCP,
(unsigned short)sockport(sock))) {
perror("pmap_set");
exit(5);
}
}
return (sock);
}
/*
* Unadvertise the socket
*/
int
tcp_done(int prog)
{
if (prog > 0) {
pmap_unset((u_long)prog, (u_long)1);
}
return (0);
}
/*
* Accept a connection and return it
*/
int
tcp_accept(int sock, int rdwr)
{
struct sockaddr_in s;
int newsock, namelen;
namelen = sizeof(s);
bzero((void*)&s, namelen);
retry:
if ((newsock = accept(sock, (struct sockaddr*)&s, &namelen)) < 0) {
if (errno == EINTR)
goto retry;
perror("accept");
exit(6);
}
#ifdef LIBTCP_VERBOSE
fprintf(stderr, "Server newsock port %d\n", sockport(newsock));
#endif
sock_optimize(newsock, rdwr);
return (newsock);
}
/*
* Connect to the TCP socket advertised as "prog" on "host" and
* return the connected socket.
*
* Hacked Thu Oct 27 1994 to cache pmap_getport calls. This saves
* about 4000 usecs in loopback lat_connect calls. I suppose we
* should time gethostbyname() & pmap_getprot(), huh?
*/
int
tcp_connect(char *host, int prog, int rdwr)
{
static struct hostent *h;
static struct sockaddr_in s;
static u_short save_port;
static u_long save_prog;
static char *save_host;
int sock;
static int tries = 0;
if ((sock = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) {
perror("socket");
exit(1);
}
if (rdwr & SOCKOPT_PID) {
static unsigned short port;
struct sockaddr_in sin;
if (!port) {
port = (unsigned short)(getpid() << 4);
if (port < 1024) {
port += 1024;
}
}
do {
port++;
bzero((void*)&sin, sizeof(sin));
sin.sin_family = AF_INET;
sin.sin_port = htons(port);
} while (bind(sock, (struct sockaddr*)&sin, sizeof(sin)) == -1);
}
#ifdef LIBTCP_VERBOSE
else {
struct sockaddr_in sin;
bzero((void*)&sin, sizeof(sin));
sin.sin_family = AF_INET;
if (bind(sock, (struct sockaddr*)&sin, sizeof(sin)) < 0) {
perror("bind");
exit(2);
}
}
fprintf(stderr, "Client port %d\n", sockport(sock));
#endif
sock_optimize(sock, rdwr);
if (!h || host != save_host || prog != save_prog) {
save_host = host; /* XXX - counting on them not
* changing it - benchmark only.
*/
save_prog = prog;
if (!(h = gethostbyname(host))) {
perror(host);
exit(2);
}
bzero((void *) &s, sizeof(s));
s.sin_family = AF_INET;
bcopy((void*)h->h_addr, (void *)&s.sin_addr, h->h_length);
if (prog > 0) {
save_port = pmap_getport(&s, prog,
(u_long)1, IPPROTO_TCP);
if (!save_port) {
perror("lib TCP: No port found");
exit(3);
}
#ifdef LIBTCP_VERBOSE
fprintf(stderr, "Server port %d\n", save_port);
#endif
s.sin_port = htons(save_port);
} else {
s.sin_port = htons(-prog);
}
}
if (connect(sock, (struct sockaddr*)&s, sizeof(s)) < 0) {
if (errno == ECONNRESET || errno == ECONNREFUSED) {
close(sock);
if (++tries > 10) return(-1);
return (tcp_connect(host, prog, rdwr));
}
perror("connect");
exit(4);
}
tries = 0;
return (sock);
}
#define LIBTCP_VERBOSE
void
sock_optimize(int sock, int flags)
{
return;
if (flags & SOCKOPT_READ) {
int sockbuf = SOCKBUF;
while (setsockopt(sock, SOL_SOCKET, SO_RCVBUF, &sockbuf,
sizeof(int))) {
sockbuf >>= 1;
}
#ifdef LIBTCP_VERBOSE
fprintf(stderr, "sockopt %d: RCV: %dK\n", sock, sockbuf>>10);
#endif
}
if (flags & SOCKOPT_WRITE) {
int sockbuf = SOCKBUF;
while (setsockopt(sock, SOL_SOCKET, SO_SNDBUF, &sockbuf,
sizeof(int))) {
sockbuf >>= 1;
}
#ifdef LIBTCP_VERBOSE
fprintf(stderr, "sockopt %d: SND: %dK\n", sock, sockbuf>>10);
#endif
}
if (flags & SOCKOPT_REUSE) {
int val = 1;
if (setsockopt(sock, SOL_SOCKET,
SO_REUSEADDR, &val, sizeof(val)) == -1) {
perror("SO_REUSEADDR");
}
}
}
int
sockport(int s)
{
int namelen;
struct sockaddr_in sin;
namelen = sizeof(sin);
if (getsockname(s, (struct sockaddr *)&sin, &namelen) < 0) {
perror("getsockname");
return(-1);
}
return ((int)ntohs(sin.sin_port));
}
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 0:59 ` tcp bw in 2.6 Larry McVoy
2007-10-02 2:14 ` Linus Torvalds
@ 2007-10-02 10:52 ` Herbert Xu
2007-10-02 15:09 ` Larry McVoy
1 sibling, 1 reply; 56+ messages in thread
From: Herbert Xu @ 2007-10-02 10:52 UTC (permalink / raw)
To: Larry McVoy; +Cc: torvalds, davem, wscott, netdev
Larry McVoy <lm@bitmover.com> wrote:
>
> One of my clients also has gigabit so I played around with just that
> one and it (itanium running hpux w/ broadcom gigabit) can push the load
> as well. One weird thing is that it is dependent on the direction the
> data is flowing. If the hp is sending then I get 46MB/sec, if linux is
> sending then I get 18MB/sec. Weird. Linux is debian, running
First of all check the CPU load on both sides to see if either
of them is saturating. If the CPU's fine then look at the tcpdump
output to see if both receivers are using the same window settings.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 2:20 ` Larry McVoy
2007-10-02 3:50 ` David Miller
@ 2007-10-02 15:06 ` John Heffner
2007-10-02 17:14 ` Rick Jones
2 siblings, 0 replies; 56+ messages in thread
From: John Heffner @ 2007-10-02 15:06 UTC (permalink / raw)
To: lm, Linus Torvalds, davem, wscott, netdev
[-- Attachment #1: Type: text/plain, Size: 417 bytes --]
Larry McVoy wrote:
> A short summary is "can someone please post a test program that sources
> and sinks data at the wire speed?" because apparently I'm too old and
> clueless to write such a thing.
Here's a simple reference tcp source/sink that's I've used for years.
For example, on a couple gigabit machines:
$ ./tcpsend -t10 dew
Sent 1240415312 bytes in 10.033101 seconds
Throughput: 123632294 B/s
-John
[-- Attachment #2: discard.c --]
[-- Type: text/plain, Size: 2332 bytes --]
/*
* discard.c
* A simple discard server.
*
* Copyright 2003 John Heffner.
*/
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/poll.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <sys/param.h>
#include <netinet/in.h>
#if 0
#define RATELIMIT
#define RATE 100000 /* bytes/sec */
#define WAIT_TIME (1000000/HZ-1)
#define READ_SIZE (RATE/HZ)
#else
#define READ_SIZE (1024*1024)
#endif
void child_handler(int sig)
{
int status;
wait(&status);
}
int main(int argc, char *argv[])
{
int port = 9000;
int lfd;
struct sockaddr_in laddr;
int newfd;
struct sockaddr_in newaddr;
int pid;
socklen_t len;
if (argc > 2) {
fprintf(stderr, "usage: discard [port]\n");
exit(1);
}
if (argc == 2) {
if (sscanf(argv[1], "%d", &port) != 1 || port < 0 || port > 65535) {
fprintf(stderr, "discard: error: not a port number\n");
exit(1);
}
}
if (signal(SIGCHLD, child_handler) == SIG_ERR) {
perror("signal");
exit(1);
}
memset(&laddr, 0, sizeof (laddr));
laddr.sin_family = AF_INET;
laddr.sin_port = htons(port);
laddr.sin_addr.s_addr = INADDR_ANY;
if ((lfd = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
perror("socket");
exit(1);
}
if (bind(lfd, (struct sockaddr *)&laddr, sizeof (laddr)) != 0) {
perror("bind");
exit(1);
}
if (listen(lfd, 5) != 0) {
perror("listen");
exit(1);
}
for (;;) {
if ((newfd = accept(lfd, (struct sockaddr *)&newaddr, &len)) < 0) {
if (errno == EINTR)
continue;
perror("accept");
exit(1);
}
if ((pid = fork()) < 0) {
perror("fork");
exit(1);
} else if (pid == 0) {
int n;
char buf[READ_SIZE];
int64_t data_rcvd = 0;
struct timeval stime, etime;
float time;
gettimeofday(&stime, NULL);
while ((n = read(newfd, buf, READ_SIZE)) > 0) {
data_rcvd += n;
#ifdef RATELIMIT
usleep(WAIT_TIME);
#endif
}
gettimeofday(&etime, NULL);
close(newfd);
time = (float)(1000000*(etime.tv_sec - stime.tv_sec) + etime.tv_usec - stime.tv_usec) / 1000000.0;
printf("Received %lld bytes in %f seconds\n", (long long)data_rcvd, time);
printf("Throughput: %d B/s\n", (int)((float)data_rcvd / time));
exit(0);
}
close(newfd);
}
return 1;
}
[-- Attachment #3: tcpsend.c --]
[-- Type: text/plain, Size: 6268 bytes --]
/*
* tcpsend.c
* Send pseudo-random data through a TCP connection.
*
* Copyright 2003 John Heffner.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <netdb.h>
#include <signal.h>
#include <errno.h>
#include <fcntl.h>
#include <netinet/in.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/time.h>
#include <sys/stat.h>
#ifdef __linux__
#include <sys/sendfile.h>
#endif
#define SNDSIZE (1024 * 10)
#define BUFSIZE (1024 * 1024)
#define max(a,b) (a > b ? a : b)
#define min(a,b) (a < b ? a : b)
int time_done = 0;
int interrupt_done = 0;
struct timeval starttime;
void int_handler(int sig)
{
interrupt_done = 1;
}
void alarm_handler(int sig)
{
time_done = 1;
}
static void usage_error(int err) {
fprintf(stderr, "usage: tcpsend [-z] [-b max_bytes] [-t max_time] hostname [port]\n");
exit(err);
}
static void cleanup_exit(int fd, char *filename, int status)
{
if (fd > 0)
close(fd);
if (filename)
unlink(filename);
exit(status);
}
int main(int argc, char *argv[])
{
char *hostname = "localhost";
int port = 9000;
int max_time = -1;
int max_bytes = -1;
int zerocopy = 0;
int sockfd;
struct sockaddr_in addr;
struct hostent *hent;
struct sigaction act;
int i;
int arg_state;
char *tmp;
int add;
char *buf;
int64_t data_sent;
int n;
off_t start;
int amt;
struct timeval etime;
float time;
int err;
char *namebuf = NULL;
int fd = -1;
/* Read in args */
if (argc == 2 && strcmp(argv[1], "-h") == 0)
usage_error(0);
for (arg_state = 0, i = 1; i < argc; i++) {
if (argv[i][0] == '-') {
if (arg_state != 0)
usage_error(1);
if (strlen(argv[i]) < 2)
usage_error(1);
add = 0;
if (argv[i][1] == 'z') {
zerocopy = 1;
} else if (argv[i][1] == 'b' ||
argv[i][1] == 't') {
if (strlen(argv[i]) > 2) {
tmp = &(argv[i][2]);
} else {
add = 1;
if (i + 1 >= argc)
usage_error(1);
tmp = argv[i + 1];
}
if (argv[i][1] == 'b') {
if (sscanf(tmp, "%d", &max_bytes) != 1 ||
max_bytes < 0)
usage_error(1);
} else {
if (sscanf(tmp, "%d", &max_time) != 1 ||
max_time < 0)
usage_error(1);
}
} else {
usage_error(1);
}
i += add;
} else {
switch (arg_state) {
case 0:
arg_state = 1;
hostname = argv[i];
break;
case 1:
arg_state = 2;
if (sscanf(argv[i], "%d", &port) != 1 ||
port < 0 || port > 65535)
usage_error(1);
break;
default:
usage_error(1);
}
}
}
if (arg_state < 1)
usage_error(1);
#ifndef __linux__
if (zerocopy) {
fprintf(stderr, "Zero-copy is only supported under Linux.\n");
exit(1);
}
#endif
/* Set up addr struct from hostname and port */
if ((hent = gethostbyname(hostname)) == NULL) {
fprintf(stderr, "tcpsend: gethostbyname error\n");
exit(1);
}
memset(&addr, 0, sizeof (addr));
addr.sin_family = AF_INET;
memcpy(&addr.sin_addr, hent->h_addr_list[0], 4);
addr.sin_port = htons(port);
/* Create buffer and fill with random data */
if (gettimeofday(&starttime, NULL) < 0) {
perror("gettimeofday");
exit(1);
}
srand((unsigned int)(starttime.tv_usec + 1000000 * starttime.tv_sec));
if ((buf = (char *)malloc(BUFSIZE)) == NULL) {
fprintf(stderr, "malloc failed\n");
exit(1);
}
for (i = 0; i < BUFSIZE; i += sizeof (int)) {
*(int *)&buf[i] = rand();
}
if (zerocopy) {
if ((namebuf = malloc(64)) == NULL) {
fprintf(stderr, "malloc failed\n");
exit(1);
}
sprintf(namebuf, "/tmp/tcpsend%d", getpid());
if ((fd = open(namebuf, O_RDWR | O_CREAT, 0600)) < 0) {
perror("open");
exit(1);
}
for (amt = BUFSIZE; amt > 0; ) {
if ((n = write(fd, buf, amt)) < 0) {
perror("write");
cleanup_exit(fd, namebuf, 1);
}
amt -= n;
}
}
/* Open connection */
if ((sockfd = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
perror("socket");
cleanup_exit(fd, namebuf, 1);
}
if (connect(sockfd, (struct sockaddr *)&addr, sizeof (addr)) != 0) {
perror("connect");
cleanup_exit(fd, namebuf, 1);
}
/* Set up signal handlers */
if (max_time >= 0) {
if (sigaction(SIGALRM, NULL, &act) != 0) {
perror("sigaction: SIGALRM");
cleanup_exit(fd, namebuf, 1);
}
act.sa_handler = alarm_handler;
act.sa_flags = 0;
if (sigaction(SIGALRM, &act, NULL) != 0) {
perror("sigaction: SIGALRM");
cleanup_exit(fd, namebuf, 1);
}
alarm(max_time);
}
if (sigaction(SIGINT, NULL, &act) != 0) {
perror("sigaction: SIGINT");
cleanup_exit(fd, namebuf, 1);
}
act.sa_handler = int_handler;
act.sa_flags = 0;
if (sigaction(SIGINT, &act, NULL) != 0) {
perror("sigaction: SIGINT");
cleanup_exit(fd, namebuf, 1);
}
/* Send random data until we hit a max */
data_sent = 0;
while ((max_bytes < 0 ? 1 : data_sent < max_bytes) &&
!time_done && !interrupt_done) {
start = rand() / (RAND_MAX / (BUFSIZE - SNDSIZE) + 1);
if (max_bytes < 0)
amt = SNDSIZE;
else
amt = min(SNDSIZE, max_bytes - data_sent);
if (zerocopy) {
#ifdef __linux__
if ((n = sendfile(sockfd, fd, &start, amt)) < 0 && errno != EINTR) {
perror("sendfile");
cleanup_exit(fd, namebuf, 1);
} else if (n == 0) {
fprintf(stderr, "tcpsend: socket unexpectedly closed\n");
cleanup_exit(fd, namebuf, 1);
}
#endif
} else {
if ((n = write(sockfd, &buf[start], amt)) < 0 && errno != EINTR) {
perror("write");
cleanup_exit(fd, namebuf, 1);
} else if (n == 0) {
fprintf(stderr, "tcpsend: socket unexpectedly closed\n");
cleanup_exit(fd, namebuf, 1);
}
}
data_sent += n;
}
/* Close the socket and wait for the remote host to close */
if (shutdown(sockfd, SHUT_WR) != 0) {
perror("shutdown");
cleanup_exit(fd, namebuf, 1);
}
err = read(sockfd, buf, 1);
if (err < 0) {
perror("read");
cleanup_exit(fd, namebuf, 1);
} else if (err > 0) {
fprintf(stderr, "warning: data read on socket\n");
}
gettimeofday(&etime, NULL);
time = (float)(1000000*(etime.tv_sec - starttime.tv_sec) +
etime.tv_usec - starttime.tv_usec) / 1000000.0;
printf("Sent %lld bytes in %f seconds\n", (long long)data_sent, time);
printf("Throughput: %d B/s\n", (int)((float)data_sent / time));
cleanup_exit(fd, namebuf, 0);
return 0;
}
[-- Attachment #4: Makefile --]
[-- Type: text/plain, Size: 75 bytes --]
CFLAGS = -g -O2 -Wall
all: tcpsend discard
clean:
rm -f tcpsend discard
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 10:52 ` Herbert Xu
@ 2007-10-02 15:09 ` Larry McVoy
2007-10-02 15:41 ` Larry McVoy
` (2 more replies)
0 siblings, 3 replies; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 15:09 UTC (permalink / raw)
To: Herbert Xu; +Cc: Larry McVoy, torvalds, davem, wscott, netdev
On Tue, Oct 02, 2007 at 06:52:54PM +0800, Herbert Xu wrote:
> > One of my clients also has gigabit so I played around with just that
> > one and it (itanium running hpux w/ broadcom gigabit) can push the load
> > as well. One weird thing is that it is dependent on the direction the
> > data is flowing. If the hp is sending then I get 46MB/sec, if linux is
> > sending then I get 18MB/sec. Weird. Linux is debian, running
>
> First of all check the CPU load on both sides to see if either
> of them is saturating. If the CPU's fine then look at the tcpdump
> output to see if both receivers are using the same window settings.
tcpdump is a good idea, take a look at this. The window starts out
at 46 and never opens up in my test case, but in the rsh case it
starts out the same but does open up. Ideas?
08:08:06.033305 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: S 2756874880:2756874880(0) win 32768 <mss 1460,wscale 0,nop>
08:08:06.033335 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: S 3360532803:3360532803(0) ack 2756874881 win 5840 <mss 1460,nop,wscale 7>
08:08:06.047924 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 1 win 32768
08:08:06.048218 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 1:2921(2920) ack 1 win 46
08:08:06.048426 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 1461 win 32768
08:08:06.048446 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 2921:5841(2920) ack 1 win 46
08:08:06.048673 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 4381 win 32768
08:08:06.048684 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 5841:10221(4380) ack 1 win 46
08:08:06.049047 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 8761 win 32768
08:08:06.049057 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 10221:16061(5840) ack 1 win 46
08:08:06.049422 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 14601 win 32768
08:08:06.049429 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 16061:18981(2920) ack 1 win 46
08:08:06.049462 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 18981:20441(1460) ack 1 win 46
08:08:06.049484 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 20441:23361(2920) ack 1 win 46
08:08:06.049924 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 21901 win 32768
08:08:06.049943 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 23361:32121(8760) ack 1 win 46
08:08:06.050549 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 30661 win 32768
08:08:06.050559 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 32121:39421(7300) ack 1 win 46
08:08:06.050592 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 39421:40881(1460) ack 1 win 46
08:08:06.050614 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 40881:42341(1460) ack 1 win 46
08:08:06.051170 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 40881 win 32768
08:08:06.051188 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 42341:54021(11680) ack 1 win 46
08:08:06.051923 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 52561 win 32768
08:08:06.051932 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 54021:58401(4380) ack 1 win 46
08:08:06.051942 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 58401:67161(8760) ack 1 win 46
08:08:06.052671 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 65701 win 32768
08:08:06.052680 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 67161:74461(7300) ack 1 win 46
08:08:06.052719 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 74461:77381(2920) ack 1 win 46
08:08:06.052752 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 77381:81761(4380) ack 1 win 46
08:08:06.053549 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 80301 win 32768
08:08:06.053566 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 81761:97821(16060) ack 1 win 46
08:08:06.054423 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 96361 win 32768
08:08:06.054433 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 97821:113881(16060) ack 1 win 46
08:08:06.054476 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 113881:115341(1460) ack 1 win 46
08:08:06.055422 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 113881 win 32768
08:08:06.055438 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 115341:131401(16060) ack 1 win 46
08:08:06.056421 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 131401 win 32768
08:08:06.056432 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 131401:147461(16060) ack 1 win 46
08:08:06.117889 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 147461 win 32768
08:08:06.117897 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 147461:163521(16060) ack 1 win 46
08:08:06.118392 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 148921 win 32768
08:08:06.118405 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 163521:173741(10220) ack 1 win 46
08:08:06.118640 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 151841 win 32768
08:08:06.118768 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 156221 win 32768
08:08:06.118775 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 173741:179581(5840) ack 1 win 46
08:08:06.118783 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 162061 win 32768
08:08:06.118793 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 179581:191261(11680) ack 1 win 46
08:08:06.119388 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 169361 win 32768
08:08:06.119644 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 178121 win 32768
08:08:06.119654 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 191261:195641(4380) ack 1 win 46
08:08:06.119665 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 195641:210241(14600) ack 1 win 46
08:08:06.120265 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 188341 win 32768
08:08:06.120274 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 210241:211701(1460) ack 1 win 46
08:08:06.121137 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 200021 win 32768
08:08:06.121148 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 211701:227761(16060) ack 1 win 46
08:08:06.121763 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 213161 win 32768
08:08:06.121772 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 227761:243821(16060) ack 1 win 46
08:08:06.122385 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 227761 win 32768
08:08:06.122396 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 243821:259881(16060) ack 1 win 46
08:08:06.123260 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 243821 win 32768
08:08:06.123269 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 259881:275941(16060) ack 1 win 46
08:08:06.124008 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 259881 win 32768
08:08:06.124021 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 275941:292001(16060) ack 1 win 46
08:08:06.124884 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 275941 win 32768
08:08:06.124894 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 292001:308061(16060) ack 1 win 46
08:08:06.125637 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 292001 win 32768
08:08:06.125647 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 308061:324121(16060) ack 1 win 46
08:08:06.126512 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 308061 win 32768
08:08:06.126522 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 324121:340181(16060) ack 1 win 46
08:08:06.127383 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 324121 win 32768
08:08:06.127393 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 340181:356241(16060) ack 1 win 46
08:08:06.128135 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 340181 win 32768
08:08:06.128146 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 356241:372301(16060) ack 1 win 46
08:08:06.129010 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 356241 win 32768
08:08:06.129020 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 372301:388361(16060) ack 1 win 46
08:08:06.129761 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 372301 win 32768
08:08:06.129770 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 388361:404421(16060) ack 1 win 46
08:08:06.130636 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 388361 win 32768
08:08:06.130645 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 404421:420481(16060) ack 1 win 46
08:08:06.131510 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 404421 win 32768
08:08:06.131521 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 420481:436541(16060) ack 1 win 46
08:08:06.132130 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 420481 win 32768
08:08:06.132140 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 436541:452601(16060) ack 1 win 46
08:08:06.132886 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 436541 win 32768
08:08:06.132895 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 452601:468661(16060) ack 1 win 46
08:08:06.133754 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 452601 win 32768
08:08:06.133765 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 468661:484721(16060) ack 1 win 46
08:08:06.134630 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 468661 win 32768
08:08:06.134640 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 484721:500781(16060) ack 1 win 46
08:08:06.135384 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 484721 win 32768
08:08:06.135395 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 500781:516841(16060) ack 1 win 46
08:08:06.136258 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 500781 win 32768
08:08:06.136272 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 516841:532901(16060) ack 1 win 46
08:08:06.137006 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 516841 win 32768
08:08:06.137016 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 532901:548961(16060) ack 1 win 46
08:08:06.137880 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 532901 win 32768
08:08:06.137891 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 548961:565021(16060) ack 1 win 46
08:08:06.138756 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 548961 win 32768
08:08:06.138768 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 565021:581081(16060) ack 1 win 46
08:08:06.139505 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 565021 win 32768
08:08:18.842450 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 3613368208 win 32768
08:08:18.844056 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 16061:32121(16060) ack 0 win 46
08:08:18.843057 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 16061 win 32768
08:08:18.843069 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 32121:48181(16060) ack 0 win 46
08:08:18.843932 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 32121 win 32768
08:08:18.843942 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 48181:64241(16060) ack 0 win 46
08:08:18.844681 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 48181 win 32768
08:08:18.844690 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 64241:80301(16060) ack 0 win 46
08:08:18.845556 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 64241 win 32768
08:08:18.845566 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 80301:96361(16060) ack 0 win 46
08:08:18.846304 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 80301 win 32768
08:08:18.846313 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 96361:112421(16060) ack 0 win 46
08:08:18.847178 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 96361 win 32768
08:08:18.847187 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 112421:128481(16060) ack 0 win 46
08:08:18.848053 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 112421 win 32768
08:08:18.848063 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 128481:144541(16060) ack 0 win 46
08:08:18.848941 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 128481 win 32768
08:08:18.848952 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 144541:160601(16060) ack 0 win 46
08:08:18.849553 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 144541 win 32768
08:08:18.849561 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 160601:176661(16060) ack 0 win 46
08:08:18.850306 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 160601 win 32768
08:08:18.850316 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 176661:192721(16060) ack 0 win 46
08:08:18.851174 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 176661 win 32768
08:08:18.851182 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 192721:208781(16060) ack 0 win 46
08:08:18.852055 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 192721 win 32768
08:08:18.852064 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 208781:224841(16060) ack 0 win 46
08:08:18.852802 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 208781 win 32768
08:08:18.852810 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 224841:240901(16060) ack 0 win 46
08:08:18.853677 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 224841 win 32768
08:08:18.853687 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 240901:256961(16060) ack 0 win 46
08:08:18.854427 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 240901 win 32768
08:08:18.854436 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 256961:273021(16060) ack 0 win 46
08:08:18.855302 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 256961 win 32768
08:08:18.855311 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 273021:289081(16060) ack 0 win 46
08:08:18.856048 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 273021 win 32768
08:08:18.856058 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 289081:305141(16060) ack 0 win 46
08:08:18.856925 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 289081 win 32768
08:08:18.856934 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 305141:321201(16060) ack 0 win 46
08:08:18.857800 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 305141 win 32768
08:08:18.857809 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 321201:337261(16060) ack 0 win 46
08:08:18.858548 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 321201 win 32768
08:08:18.858556 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 337261:353321(16060) ack 0 win 46
08:08:18.859424 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 337261 win 32768
08:08:18.859432 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 353321:369381(16060) ack 0 win 46
08:08:18.860045 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 353321 win 32768
08:08:18.860054 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 369381:385441(16060) ack 0 win 46
08:08:18.860799 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 369381 win 32768
08:08:18.860807 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 385441:401501(16060) ack 0 win 46
08:08:18.861673 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 385441 win 32768
08:08:18.861683 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 401501:417561(16060) ack 0 win 46
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 15:09 ` Larry McVoy
@ 2007-10-02 15:41 ` Larry McVoy
2007-10-02 16:25 ` Larry McVoy
` (2 more replies)
2007-10-02 18:29 ` John Heffner
2007-10-02 19:27 ` Linus Torvalds
2 siblings, 3 replies; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 15:41 UTC (permalink / raw)
To: lm, Herbert Xu, torvalds, davem, wscott, netdev
Interesting data point. My test case is like this:
server
bind
listen
while (newsock = accept...)
transfer()
client
connect
transfer
If the server side is the source of the data, i.e, it's transfer is a
write loop, then I get the bad behaviour. If I switch them so the data
flows in the other direction, then it works, I go from about 14K pkt/sec
to 43K pkt/sec.
Can anyone else reproduce this? I can extract the test case from lmbench
so it is standalone but I suspect that any test case will do it. I'll
try with the one that John sent. Yup, s/read/write/ and s/write/read/
in his two files at the appropriate places and I get exactly the same
behaviour.
So is this a bug or intentional?
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 15:41 ` Larry McVoy
@ 2007-10-02 16:25 ` Larry McVoy
2007-10-02 16:47 ` Stephen Hemminger
2007-10-02 16:34 ` Linus Torvalds
2007-10-02 16:48 ` Ben Greear
2 siblings, 1 reply; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 16:25 UTC (permalink / raw)
To: lm, Herbert Xu, torvalds, davem, wscott, netdev
> If the server side is the source of the data, i.e, it's transfer is a
> write loop, then I get the bad behaviour.
> ...
> So is this a bug or intentional?
For whatever it is worth, I believed that we used to get better performance
from the same hardware. My guess is that it changed somewhere between
2.6.15-1-k7 and 2.6.18-5-k7.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 15:41 ` Larry McVoy
2007-10-02 16:25 ` Larry McVoy
@ 2007-10-02 16:34 ` Linus Torvalds
2007-10-02 16:48 ` Larry McVoy
2007-10-02 16:48 ` Ben Greear
2 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2007-10-02 16:34 UTC (permalink / raw)
To: Larry McVoy; +Cc: Herbert Xu, davem, wscott, netdev
On Tue, 2 Oct 2007, Larry McVoy wrote:
> Interesting data point. My test case is like this:
>
> server
> bind
> listen
> while (newsock = accept...)
> transfer()
>
> client
> connect
> transfer
>
> If the server side is the source of the data, i.e, it's transfer is a
> write loop, then I get the bad behaviour. If I switch them so the data
> flows in the other direction, then it works, I go from about 14K pkt/sec
> to 43K pkt/sec.
Sounds like accept() possibly initializes slightly different socket
parameters than connect() does.
On the other hand, different network cards will simply have different
behaviour (some due to hardware, some due to driver differences), so I
hope you also switched the processes around and/or used identically
configured machines (and the port configuration on switches could matter,
of course, so it's really best to switch the processes around, to make
sure that the *only* difference is whether the socket was set up by
accept() vs connect()).
> So is this a bug or intentional?
Sounds like a bug to me, modulo the above caveat of making sure that it's
not some hw/driver/switch kind of difference.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 16:25 ` Larry McVoy
@ 2007-10-02 16:47 ` Stephen Hemminger
2007-10-02 16:49 ` Larry McVoy
2007-10-15 12:40 ` Daniel Schaffrath
0 siblings, 2 replies; 56+ messages in thread
From: Stephen Hemminger @ 2007-10-02 16:47 UTC (permalink / raw)
To: Larry McVoy; +Cc: lm, Herbert Xu, torvalds, davem, wscott, netdev
On Tue, 2 Oct 2007 09:25:34 -0700
lm@bitmover.com (Larry McVoy) wrote:
> > If the server side is the source of the data, i.e, it's transfer is a
> > write loop, then I get the bad behaviour.
> > ...
> > So is this a bug or intentional?
>
> For whatever it is worth, I believed that we used to get better performance
> from the same hardware. My guess is that it changed somewhere between
> 2.6.15-1-k7 and 2.6.18-5-k7.
For the period from 2.6.15 to 2.6.18, the kernel by default enabled TCP
Appropriate Byte Counting. This caused bad performance on applications that
did small writes.
--
Stephen Hemminger <shemminger@linux-foundation.org>
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 15:41 ` Larry McVoy
2007-10-02 16:25 ` Larry McVoy
2007-10-02 16:34 ` Linus Torvalds
@ 2007-10-02 16:48 ` Ben Greear
2007-10-02 17:11 ` Larry McVoy
2 siblings, 1 reply; 56+ messages in thread
From: Ben Greear @ 2007-10-02 16:48 UTC (permalink / raw)
To: lm, Herbert Xu, torvalds, davem, wscott, netdev
Larry McVoy wrote:
> Interesting data point. My test case is like this:
>
> server
> bind
> listen
> while (newsock = accept...)
> transfer()
>
> client
> connect
> transfer
>
> If the server side is the source of the data, i.e, it's transfer is a
> write loop, then I get the bad behaviour. If I switch them so the data
> flows in the other direction, then it works, I go from about 14K pkt/sec
> to 43K pkt/sec.
>
> Can anyone else reproduce this? I can extract the test case from lmbench
> so it is standalone but I suspect that any test case will do it. I'll
> try with the one that John sent. Yup, s/read/write/ and s/write/read/
> in his two files at the appropriate places and I get exactly the same
> behaviour.
>
> So is this a bug or intentional?
>
I have a more complex configuration & application, but I don't see this
problem in
my testing. Using e1000 nics and modern hardware I can set up a connection
between two machines and run 800+Mbps in both directions, or near line speed
in one direction if the other direction is mostly silent.
I am purposefully setting the socket send/rx buffers, as well has
twiddling with
the tcp and netdev related tunables. If you want, I can email these
tweaks to you.
NICs and busses have a huge impact on performance, so make sure those
are good.
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 16:34 ` Linus Torvalds
@ 2007-10-02 16:48 ` Larry McVoy
2007-10-02 21:16 ` David Miller
0 siblings, 1 reply; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 16:48 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Larry McVoy, Herbert Xu, davem, wscott, netdev
Isn't this something so straightforward that you would have tests for it?
This is the basic FTP server loop, doesn't someone have a big machine with
10gig cards and test that sending/recving data doesn't regress?
> Sounds like a bug to me, modulo the above caveat of making sure that it's
> not some hw/driver/switch kind of difference.
Pretty unlikely given that we've changed the switch, the card works fine
in the other direction, and I'm 95% sure that we used to get better perf
before we switched to a more recent kernel.
I'll try and find some other gig ether cards and try them.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 16:47 ` Stephen Hemminger
@ 2007-10-02 16:49 ` Larry McVoy
2007-10-02 17:10 ` Stephen Hemminger
2007-10-15 12:40 ` Daniel Schaffrath
1 sibling, 1 reply; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 16:49 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Larry McVoy, Herbert Xu, torvalds, davem, wscott, netdev
On Tue, Oct 02, 2007 at 09:47:26AM -0700, Stephen Hemminger wrote:
> On Tue, 2 Oct 2007 09:25:34 -0700
> lm@bitmover.com (Larry McVoy) wrote:
>
> > > If the server side is the source of the data, i.e, it's transfer is a
> > > write loop, then I get the bad behaviour.
> > > ...
> > > So is this a bug or intentional?
> >
> > For whatever it is worth, I believed that we used to get better performance
> > from the same hardware. My guess is that it changed somewhere between
> > 2.6.15-1-k7 and 2.6.18-5-k7.
>
> For the period from 2.6.15 to 2.6.18, the kernel by default enabled TCP
> Appropriate Byte Counting. This caused bad performance on applications that
> did small writes.
It's doing 1MB writes.
Is there a sockopt to turn that off? Or /proc or something?
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 16:49 ` Larry McVoy
@ 2007-10-02 17:10 ` Stephen Hemminger
0 siblings, 0 replies; 56+ messages in thread
From: Stephen Hemminger @ 2007-10-02 17:10 UTC (permalink / raw)
To: Larry McVoy; +Cc: Larry McVoy, Herbert Xu, torvalds, davem, wscott, netdev
On Tue, 2 Oct 2007 09:49:52 -0700
lm@bitmover.com (Larry McVoy) wrote:
> On Tue, Oct 02, 2007 at 09:47:26AM -0700, Stephen Hemminger wrote:
> > On Tue, 2 Oct 2007 09:25:34 -0700
> > lm@bitmover.com (Larry McVoy) wrote:
> >
> > > > If the server side is the source of the data, i.e, it's transfer is a
> > > > write loop, then I get the bad behaviour.
> > > > ...
> > > > So is this a bug or intentional?
> > >
> > > For whatever it is worth, I believed that we used to get better performance
> > > from the same hardware. My guess is that it changed somewhere between
> > > 2.6.15-1-k7 and 2.6.18-5-k7.
> >
> > For the period from 2.6.15 to 2.6.18, the kernel by default enabled TCP
> > Appropriate Byte Counting. This caused bad performance on applications that
> > did small writes.
>
> It's doing 1MB writes.
>
> Is there a sockopt to turn that off? Or /proc or something?
sysctl -w net.ipv4.tcp_abc=0
--
Stephen Hemminger <shemminger@linux-foundation.org>
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 16:48 ` Ben Greear
@ 2007-10-02 17:11 ` Larry McVoy
2007-10-02 17:18 ` Ben Greear
0 siblings, 1 reply; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 17:11 UTC (permalink / raw)
To: Ben Greear; +Cc: lm, Herbert Xu, torvalds, davem, wscott, netdev
> I have a more complex configuration & application, but I don't see this
> problem in my testing. Using e1000 nics and modern hardware
I'm using a similar setup, what kernel are you using?
> I am purposefully setting the socket send/rx buffers, as well has
> twiddling with the tcp and netdev related tunables.
Ben sent those to me, see below, they didn't make any difference.
I tried diddling the socket send/recv buffers to 10MB, that didn't
help. The defaults didn't help. 1MB didn't help and 64K didn't
help.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 2:20 ` Larry McVoy
2007-10-02 3:50 ` David Miller
2007-10-02 15:06 ` John Heffner
@ 2007-10-02 17:14 ` Rick Jones
2007-10-02 17:20 ` Larry McVoy
2007-10-03 7:19 ` Bill Fink
2 siblings, 2 replies; 56+ messages in thread
From: Rick Jones @ 2007-10-02 17:14 UTC (permalink / raw)
To: Larry McVoy; +Cc: Linus Torvalds, davem, wscott, netdev
Larry McVoy wrote:
> A short summary is "can someone please post a test program that sources
> and sinks data at the wire speed?" because apparently I'm too old and
> clueless to write such a thing.
http://www.netperf.org/svn/netperf2/trunk/
:)
WRT the different speeds in each direction talking with HP-UX, perhaps there is
an interaction between the Linux TCP stack (TSO perhaps) and HP-UX's ACK
avoidance heuristics. If that is the case, tweaking tcp_deferred_ack_max with
ndd on the HP-UX system might yield different results.
I don't recall if the igelan (broadcom) driver in HP-UX attempts to auto-tune
the interrupt throttling. I do believe the iether (intel) driver in HP-UX does.
That can be altered via lanadmin -X mumble... commands.
Later (although later than a 2.6.18 kernel IIRC) e1000 drivers do try to
auto-tune the interrupt throttling and one can see oscilations when an e1000
driver is talking to an e1000 driver. I think that can only be changed via the
InterruptThrotleRate e1000 module parameter in that era of kernel - not sure if
the Intel folks have that available via ethtool on contemporary kernels now or not.
WRT the small program making a setsockopt(SO_*BUF) call going slower than the
rsh, does rsh make the setsockopt() call, or does it bend itself to the will of
the linux stack's autotuning? What happens if your small program does not make
setsockopt(SO_*BUF) calls?
Other misc observations of variable value:
*) depending on the quantity of CPU around, and the type of test one is running,
results can be better/worse depending on the CPU to which you bind the
application. Latency tends to be best when running on the same core as takes
interrupts from the NIC, bulk transfer can be better when running on a different
core, although generally better when a different core on the same chip. These
days the throughput stuff is more easily seen on 10G, but the netperf service
demand changes are still visible on 1G.
*) agreement with the observation that the small recv calls suggest that the
application is staying-up with the network. I doubt that SO_&BUF settings would
change that, but perhaps setting watermarks might (wild ass guess). The
watermarks will do nothing on HP-UX though (IIRC).
rick jones
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 17:11 ` Larry McVoy
@ 2007-10-02 17:18 ` Ben Greear
2007-10-02 17:21 ` Larry McVoy
0 siblings, 1 reply; 56+ messages in thread
From: Ben Greear @ 2007-10-02 17:18 UTC (permalink / raw)
To: lm, Ben Greear, Herbert Xu, torvalds, davem, wscott, netdev
Larry McVoy wrote:
>> I have a more complex configuration & application, but I don't see this
>> problem in my testing. Using e1000 nics and modern hardware
>>
>
> I'm using a similar setup, what kernel are you using?
>
I'm currently on 2.6.20, and have also tried 10gbe nics on 2.6.23 with
good results. At least for my app, performance has been pretty steady
at least as far back as the .18 kernels, and probably before....
I do 64k or smaller writes & reads, and non-blocking IO (not sure if that
would matter..but I do :)
Have you tried something like ttcp, iperf, or even regular ftp?
Checked your nics to make sure they have no errors and are negotiated
to full duplex?
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 17:14 ` Rick Jones
@ 2007-10-02 17:20 ` Larry McVoy
2007-10-02 18:01 ` Rick Jones
2007-10-03 7:19 ` Bill Fink
1 sibling, 1 reply; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 17:20 UTC (permalink / raw)
To: Rick Jones; +Cc: Larry McVoy, Linus Torvalds, davem, wscott, netdev
On Tue, Oct 02, 2007 at 10:14:11AM -0700, Rick Jones wrote:
> Larry McVoy wrote:
> >A short summary is "can someone please post a test program that sources
> >and sinks data at the wire speed?" because apparently I'm too old and
> >clueless to write such a thing.
>
> WRT the different speeds in each direction talking with HP-UX, perhaps
> there is an interaction between the Linux TCP stack (TSO perhaps) and
> HP-UX's ACK avoidance heuristics. If that is the case, tweaking
> tcp_deferred_ack_max with ndd on the HP-UX system might yield different
> results.
I doubt it because I see the same sort of behaviour when I have a group
of Linux clients talking to the server. The HP box is in the mix
simply because it has a gigabit card and that makes driving the load
simpler. But if I do several loads from 100Mbit clients I get the same
packet throughput.
> WRT the small program making a setsockopt(SO_*BUF) call going slower than
> the rsh, does rsh make the setsockopt() call, or does it bend itself to the
> will of the linux stack's autotuning? What happens if your small program
> does not make setsockopt(SO_*BUF) calls?
I haven't tracked down if rsh does that but I've tried doing it with
values of default, 64K, 1MB, and 10MB with no difference.
> *) depending on the quantity of CPU around, and the type of test one is
These are fast CPUs and they are running at 93% idle while running the test.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 17:18 ` Ben Greear
@ 2007-10-02 17:21 ` Larry McVoy
2007-10-02 17:54 ` Stephen Hemminger
0 siblings, 1 reply; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 17:21 UTC (permalink / raw)
To: Ben Greear; +Cc: lm, Herbert Xu, torvalds, davem, wscott, netdev
> I'm currently on 2.6.20, and have also tried 10gbe nics on 2.6.23 with
My guess is that it is a bug in the debian 2.6.18 kernel.
> Have you tried something like ttcp, iperf, or even regular ftp?
Yeah, I've factored out the code since BitKeeper, my test program,
and John's test program all exhibit the same behaviour. Also switched
switches.
> Checked your nics to make sure they have no errors and are negotiated
> to full duplex?
Yup and yup.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 17:21 ` Larry McVoy
@ 2007-10-02 17:54 ` Stephen Hemminger
2007-10-02 18:35 ` Larry McVoy
0 siblings, 1 reply; 56+ messages in thread
From: Stephen Hemminger @ 2007-10-02 17:54 UTC (permalink / raw)
To: Larry McVoy; +Cc: Ben Greear, lm, Herbert Xu, torvalds, davem, wscott, netdev
On Tue, 2 Oct 2007 10:21:55 -0700
lm@bitmover.com (Larry McVoy) wrote:
> > I'm currently on 2.6.20, and have also tried 10gbe nics on 2.6.23 with
>
> My guess is that it is a bug in the debian 2.6.18 kernel.
>
> > Have you tried something like ttcp, iperf, or even regular ftp?
>
> Yeah, I've factored out the code since BitKeeper, my test program,
> and John's test program all exhibit the same behaviour. Also switched
> switches.
>
> > Checked your nics to make sure they have no errors and are negotiated
> > to full duplex?
>
> Yup and yup.
Make sure you don't have slab debugging turned on. It kills performance.
--
Stephen Hemminger <shemminger@linux-foundation.org>
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 17:20 ` Larry McVoy
@ 2007-10-02 18:01 ` Rick Jones
2007-10-02 18:40 ` Larry McVoy
0 siblings, 1 reply; 56+ messages in thread
From: Rick Jones @ 2007-10-02 18:01 UTC (permalink / raw)
To: Larry McVoy; +Cc: Linus Torvalds, davem, wscott, netdev
has anyone already asked whether link-layer flow-control is enabled?
rick jones
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 15:09 ` Larry McVoy
2007-10-02 15:41 ` Larry McVoy
@ 2007-10-02 18:29 ` John Heffner
2007-10-02 19:07 ` Larry McVoy
2007-10-02 19:27 ` Linus Torvalds
2 siblings, 1 reply; 56+ messages in thread
From: John Heffner @ 2007-10-02 18:29 UTC (permalink / raw)
To: lm, Herbert Xu, torvalds, davem, wscott, netdev
Larry McVoy wrote:
> On Tue, Oct 02, 2007 at 06:52:54PM +0800, Herbert Xu wrote:
>>> One of my clients also has gigabit so I played around with just that
>>> one and it (itanium running hpux w/ broadcom gigabit) can push the load
>>> as well. One weird thing is that it is dependent on the direction the
>>> data is flowing. If the hp is sending then I get 46MB/sec, if linux is
>>> sending then I get 18MB/sec. Weird. Linux is debian, running
>> First of all check the CPU load on both sides to see if either
>> of them is saturating. If the CPU's fine then look at the tcpdump
>> output to see if both receivers are using the same window settings.
>
> tcpdump is a good idea, take a look at this. The window starts out
> at 46 and never opens up in my test case, but in the rsh case it
> starts out the same but does open up. Ideas?
(Binary tcpdumps are always better than ascii.)
The window on the sender (linux box) starts at 46. It doesn't open up,
but it's not receiving data so it doesn't matter, and you don't expect
it to. The HP box always announces a window of 32768.
Looks like you have TSO enabled. Does it behave differently if it's
disabled? I think Rick Jones is on to something with the HP ack
avoidance. Looks like a pretty low ack ratio, and it might not be
interacting well with TSO, especially at such a small window size.
-John
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 17:54 ` Stephen Hemminger
@ 2007-10-02 18:35 ` Larry McVoy
0 siblings, 0 replies; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 18:35 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Larry McVoy, Ben Greear, Herbert Xu, torvalds, davem, wscott,
netdev
> Make sure you don't have slab debugging turned on. It kills performance.
It's a stock debian kernel, so unless they turn it on it's off.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 18:01 ` Rick Jones
@ 2007-10-02 18:40 ` Larry McVoy
2007-10-02 19:47 ` Rick Jones
2007-10-02 21:32 ` David Miller
0 siblings, 2 replies; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 18:40 UTC (permalink / raw)
To: Rick Jones; +Cc: Larry McVoy, Linus Torvalds, davem, wscott, netdev
On Tue, Oct 02, 2007 at 11:01:47AM -0700, Rick Jones wrote:
> has anyone already asked whether link-layer flow-control is enabled?
I doubt it, the same test works fine in one direction and poorly in the other.
Wouldn't the flow control squelch either way?
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 18:29 ` John Heffner
@ 2007-10-02 19:07 ` Larry McVoy
2007-10-02 19:29 ` Linus Torvalds
2007-10-02 19:33 ` Larry McVoy
0 siblings, 2 replies; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 19:07 UTC (permalink / raw)
To: John Heffner; +Cc: lm, Herbert Xu, torvalds, davem, wscott, netdev
> Looks like you have TSO enabled. Does it behave differently if it's
> disabled?
It cranks the interrupts/sec up to 8K instead of 5K. No difference in
performance other than that.
> I think Rick Jones is on to something with the HP ack avoidance.
I sincerely doubt it. I'm only using the HP box because it has gigabit
so it's a single connection. I can produce almost identical results by
doing the same sorts of tests with several linux clients. One direction
goes fast and the other goes slow.
3x performance difference depending on the direction of data flow:
# Server is receiving, goes fast
$ for i in 22 24 25 26; do rsh -n glibc$i dd if=/dev/zero|dd of=/dev/null & done
load free cach swap pgin pgou dk0 dk1 dk2 dk3 ipkt opkt int ctx usr sys idl
0.98 0 0 0 0 0 0 0 0 0 30K 15K 8.1K 68K 12 66 22
0.98 0 0 0 0 0 0 0 0 0 29K 15K 8.2K 67K 11 64 25
0.98 0 0 0 0 0 0 0 0 0 29K 15K 8.2K 67K 12 66 22
# Server is sending, goes slow
$ for i in 22 24 25 26; do dd if=/dev/zero|rsh glibc$i dd of=/dev/null & done
load free cach swap pgin pgou dk0 dk1 dk2 dk3 ipkt opkt int ctx usr sys idl
1.06 0 0 0 0 0 0 0 0 0 5.0K 10K 4.4K 8.4K 21 17 62
0.97 0 0 0 0 0 0 0 0 0 5.1K 10K 4.4K 8.9K 2 15 83
0.97 0 0 0 0 0 0 0 0 0 5.0K 10K 4.4K 8.6K 21 26 53
$ for i in 22 24 25 26; do rsh glibc$i cat /etc/motd; done | grep Welcome
Welcome to redhat71.bitmover.com, a 2Ghz Athlon running Red Hat 7.1.
Welcome to glibc24.bitmover.com, a 1.2Ghz Athlon running SUSE 10.1.
Welcome to glibc25.bitmover.com, a 2Ghz Athlon running Fedora Core 6
Welcome to glibc26.bitmover.com, a 2Ghz Athlon running Fedora Core 7
$ for i in 22 24 25 26; do rsh glibc$i uname -r; done
2.4.2-2
2.6.16.13-4-default
2.6.18-1.2798.fc6
2.6.22.4-65.fc7
No HP in the mix. It's got nothing to do with hp, nor to do with rsh, it
has everything to do with the direction the data is flowing.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 15:09 ` Larry McVoy
2007-10-02 15:41 ` Larry McVoy
2007-10-02 18:29 ` John Heffner
@ 2007-10-02 19:27 ` Linus Torvalds
2007-10-02 19:53 ` Rick Jones
2007-10-02 20:33 ` David Miller
2 siblings, 2 replies; 56+ messages in thread
From: Linus Torvalds @ 2007-10-02 19:27 UTC (permalink / raw)
To: Larry McVoy; +Cc: Herbert Xu, davem, wscott, netdev
On Tue, 2 Oct 2007, Larry McVoy wrote:
>
> tcpdump is a good idea, take a look at this. The window starts out
> at 46 and never opens up in my test case, but in the rsh case it
> starts out the same but does open up. Ideas?
I don't think that's an issue, since you only send one way. The window
opening up only matters for the receiver. Also, you missed the "wscale=7"
at the beginning, so the window of "46" looks like it actually is 5888 (ie
fits four segments - and it's not grown because it never gets any data).
However, I think this is some strange TSO artifact:
...
> 08:08:18.843942 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 48181:64241(16060) ack 0 win 46
> 08:08:18.844681 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 48181 win 32768
> 08:08:18.844690 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: P 64241:80301(16060) ack 0 win 46
> 08:08:18.845556 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 64241 win 32768
> 08:08:18.845566 IP work-cluster.bitmover.com.31235 > hp-ia64.bitmover.com.49614: . 80301:96361(16060) ack 0 win 46
> 08:08:18.846304 IP hp-ia64.bitmover.com.49614 > work-cluster.bitmover.com.31235: . ack 80301 win 32768
...
We see a single packet containing 16060 bytes, which seems to be because
of TSO on the sending side (you did your tcpdump on the sender, no?), so
it will actually be broken up into 11 1460-byte regular frames by the
network card, since they started out agreeing on a standard 1460-byte MSS.
So the above is not a jumbo frame, it just kind of looks like one when you
capture it on the sender side.
And maybe a 32kB window is not big enough when it causes the networking
code to basically just have a single packet outstanding.
I also would have expected more ACK's from the HP box. It's been a long
time since I did TCP, but I thought the rule was still that you were
supposed to ACK at least every other full frame - but the HP box is acking
roughly every 16K (and it's *not* always at TSO boundaries: the earlier
ACK's in the sequence are at 1460-byte packet boundaries, but it does seem
to end up getting into that pattern later on).
So I'm wondering if we get into some bad pattern with the networking code
trying to make big TSO packets for e1000, but because they are *so* big
that there's only room for two such packets per window, you don't get into
any smooth pattern with lots of outstanding packets, but it starts
stuttering.
Larry, try turning off TSO. Or rather, make the kernel use a smaller limit
for the large packets. The easiest way to do that should be to just change
the value in /proc/sys/net/ipv4/tcp_tso_win_divisor. It defaults to 3, try
doing
echo 6 > /proc/sys/net/ipv4/tcp_tso_win_divisor
and see if that changes anything.
And maybe I'm just whistling in the dark. In fact, it looks like for you
it's not 3, but 2 (window of 32768, but the TSO frames are half the size).
So maybe I'm just totally confused and I'm not reading that tcp dump
correctly at all!
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 19:07 ` Larry McVoy
@ 2007-10-02 19:29 ` Linus Torvalds
2007-10-02 20:31 ` David Miller
2007-10-02 19:33 ` Larry McVoy
1 sibling, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2007-10-02 19:29 UTC (permalink / raw)
To: Larry McVoy; +Cc: John Heffner, Herbert Xu, davem, wscott, netdev
On Tue, 2 Oct 2007, Larry McVoy wrote:
>
> No HP in the mix. It's got nothing to do with hp, nor to do with rsh, it
> has everything to do with the direction the data is flowing.
Can you tcpdump both cases and send snippets (both of steady-state, and
the initial connect)?
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 19:07 ` Larry McVoy
2007-10-02 19:29 ` Linus Torvalds
@ 2007-10-02 19:33 ` Larry McVoy
2007-10-02 19:53 ` John Heffner
1 sibling, 1 reply; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 19:33 UTC (permalink / raw)
To: Larry McVoy, John Heffner, Herbert Xu, torvalds, davem, wscott,
netd
More data, we've conclusively eliminated the card / cpu from the mix.
We've got 2 ia64 boxes with e1000 interfaces. One box is running
linux 2.6.12 and the other is running hpux 11.
I made sure the linux one was running at gigabit and reran the tests
from the linux/ia64 <=> hp/ia64. Same results, when linux sends
it is slow, when it receives it is fast.
And note carefully: we've removed hpux from the equation, we can do
the same tests from linux to multiple linux clients and see the same
thing, sending from the server is slow, receiving on the server is
fast.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 18:40 ` Larry McVoy
@ 2007-10-02 19:47 ` Rick Jones
2007-10-02 21:32 ` David Miller
1 sibling, 0 replies; 56+ messages in thread
From: Rick Jones @ 2007-10-02 19:47 UTC (permalink / raw)
To: Larry McVoy; +Cc: Linus Torvalds, davem, wscott, netdev
Larry McVoy wrote:
> On Tue, Oct 02, 2007 at 11:01:47AM -0700, Rick Jones wrote:
>
>>has anyone already asked whether link-layer flow-control is enabled?
>
>
> I doubt it, the same test works fine in one direction and poorly in the other.
> Wouldn't the flow control squelch either way?
While I am often guilty of it, a wise old engineer tried to teach me that the
proper spelling is ass-u-me :) I wouldn't count on it hitting in both
directions, depends on the specifics of the situation.
WRT the HP-UX ACK avoidance heuristic, the default HP-UX socket buffer/window is
32768, and tcp_deferred_ack_max defaults to 22. That isn't really all that good
a combination - with a window of 32768 11 for the deferred ack would be better.
You could also go ahead and try it with a value of 2. Or, bump the window
size defaults - tcp_recv_hiwater_def and tcp_xmit_hiwater_def - to say 65535 or
128K or something - or use the setsockopt() calls to effect that.
rick jones
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 19:27 ` Linus Torvalds
@ 2007-10-02 19:53 ` Rick Jones
2007-10-02 20:33 ` David Miller
1 sibling, 0 replies; 56+ messages in thread
From: Rick Jones @ 2007-10-02 19:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Larry McVoy, Herbert Xu, davem, wscott, netdev
> I also would have expected more ACK's from the HP box. It's been a long
> time since I did TCP, but I thought the rule was still that you were
> supposed to ACK at least every other full frame - but the HP box is acking
> roughly every 16K (and it's *not* always at TSO boundaries: the earlier
> ACK's in the sequence are at 1460-byte packet boundaries, but it does seem
> to end up getting into that pattern later on).
Drift...
The RFC's say "SHOULD" (emphasis theirs) rather than "MUST."
Both HP-UX and Solaris have rather robust ACK avoidance heuristics to cut-down
on the CPU overhead of bulk transfers. (That they both have them stems from
their being cousins, sharing a common TCP stack ancestor long ago - both of
course have been diverging since then).
rick jones
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 19:33 ` Larry McVoy
@ 2007-10-02 19:53 ` John Heffner
2007-10-02 20:14 ` Larry McVoy
0 siblings, 1 reply; 56+ messages in thread
From: John Heffner @ 2007-10-02 19:53 UTC (permalink / raw)
To: lm, John Heffner, Herbert Xu, torvalds, davem, wscott, netdev
Larry McVoy wrote:
> More data, we've conclusively eliminated the card / cpu from the mix.
> We've got 2 ia64 boxes with e1000 interfaces. One box is running
> linux 2.6.12 and the other is running hpux 11.
>
> I made sure the linux one was running at gigabit and reran the tests
> from the linux/ia64 <=> hp/ia64. Same results, when linux sends
> it is slow, when it receives it is fast.
>
> And note carefully: we've removed hpux from the equation, we can do
> the same tests from linux to multiple linux clients and see the same
> thing, sending from the server is slow, receiving on the server is
> fast.
I think I'm still missing some basic data here (probably because this
thread did not originate on netdev). Let me try to nail down some of
the basics. You have a linux ia64 box (running 2.6.12 or 2.6.18?) that
sends slowly, and receives faster, but not quite a 1 Gbps? And this is
true regardless of which peer it sends or receives from? And the
behavior is different depending on which kernel? How, and which kernel
versions? Do you have other hardware running the same kernel that
behaves the same or differently?
Have you done ethernet cable tests? Have you tried measuring the udp
sending rate? (Iperf can do this.) Are there any error counters on the
interface?
-John
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 19:53 ` John Heffner
@ 2007-10-02 20:14 ` Larry McVoy
2007-10-02 20:40 ` Rick Jones
2007-10-02 20:42 ` Wayne Scott
0 siblings, 2 replies; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 20:14 UTC (permalink / raw)
To: John Heffner; +Cc: lm, Herbert Xu, torvalds, davem, wscott, netdev
> I think I'm still missing some basic data here (probably because this
> thread did not originate on netdev). Let me try to nail down some of
> the basics. You have a linux ia64 box (running 2.6.12 or 2.6.18?) that
> sends slowly, and receives faster, but not quite a 1 Gbps? And this is
> true regardless of which peer it sends or receives from? And the
> behavior is different depending on which kernel? How, and which kernel
> versions? Do you have other hardware running the same kernel that
> behaves the same or differently?
just got off the phone with Linus and he thinks it is the side that does
the accept is the problem side, i.e., if you are the server, you do the
accept, and you send the data, you'll go slow. But as I'm writing this
I realize he's wrong, because it is the combination of accept & send.
accept & recv goes fast.
A trivial way to see the problem is to take two linux boxes, on each
apt-get install rsh-client rsh-server
set up your .rhosts,
and then do
dd if=/dev/zero count=100000 | rsh OTHER_BOX dd of=/dev/null
rsh OTHER_BOX dd if=/dev/zero count=100000 | dd of=/dev/null
See if you get balanced results. For me, I get 45MB/sec one way, and
15-19MB/sec the other way.
I've tried the same test linux - linux and linux - hpux. Same results.
The test setup I have is
work: 2ghz x 2 Athlons, e1000, 2.6.18
ia64: 900mhz Itanium, e1000, 2.6.12
hp-ia64:900mhz Itanium, e1000, hpux 11
glibc*: 1-2ghz athlons running various linux releases
all connected through a netgear 724T 10/100/1000 switch (a linksys showed
identical results).
I tested
work <-> hp-ia64
work <-> ia64
ia64 <-> hp-ia64
and in all cases, one direction worked fast and the other didn't.
It would be good if people tried the same simple test. You have to
use rsh, ssh will slow things down way too much.
Alternatively, take your favorite test programs, such as John's,
and make a second pair that reverses the direction the data is
sent. So one pair is server sends, the other is server receives,
try both. That's where we started, BitKeeper, my stripped down test,
and John's test all exhibit the same behavior. And the rsh test
is just a really simple way to demonstrate it.
Wayne, Linus asked for tcp dumps from just one side, with the first 100
packets and then wait 10 seconds or so for the window to open up, and then
a snap shot of the another 100 packets. Do that for both directions
and send them to the list. Can you do that? I want to get lunch, I'm
starving.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 19:29 ` Linus Torvalds
@ 2007-10-02 20:31 ` David Miller
0 siblings, 0 replies; 56+ messages in thread
From: David Miller @ 2007-10-02 20:31 UTC (permalink / raw)
To: torvalds; +Cc: lm, jheffner, herbert, wscott, netdev
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 2 Oct 2007 12:29:50 -0700 (PDT)
> On Tue, 2 Oct 2007, Larry McVoy wrote:
> >
> > No HP in the mix. It's got nothing to do with hp, nor to do with rsh, it
> > has everything to do with the direction the data is flowing.
>
> Can you tcpdump both cases and send snippets (both of steady-state, and
> the initial connect)?
Another thing I'd like to see is if something more recent than 2.6.18
also reproduces the problem.
It could be just some bug we've fixed in the past year :)
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 19:27 ` Linus Torvalds
2007-10-02 19:53 ` Rick Jones
@ 2007-10-02 20:33 ` David Miller
2007-10-02 20:44 ` Roland Dreier
2007-10-02 21:21 ` Larry McVoy
1 sibling, 2 replies; 56+ messages in thread
From: David Miller @ 2007-10-02 20:33 UTC (permalink / raw)
To: torvalds; +Cc: lm, herbert, wscott, netdev
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 2 Oct 2007 12:27:53 -0700 (PDT)
> We see a single packet containing 16060 bytes, which seems to be because
> of TSO on the sending side (you did your tcpdump on the sender, no?), so
> it will actually be broken up into 11 1460-byte regular frames by the
> network card, since they started out agreeing on a standard 1460-byte MSS.
> So the above is not a jumbo frame, it just kind of looks like one when you
> capture it on the sender side.
>
> And maybe a 32kB window is not big enough when it causes the networking
> code to basically just have a single packet outstanding.
We fixed a lot of bugs in TSO last year.
It would be really great to see numbers with a more recent kernel
than 2.6.18
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 20:14 ` Larry McVoy
@ 2007-10-02 20:40 ` Rick Jones
2007-10-02 20:42 ` Wayne Scott
1 sibling, 0 replies; 56+ messages in thread
From: Rick Jones @ 2007-10-02 20:40 UTC (permalink / raw)
To: Larry McVoy; +Cc: John Heffner, Herbert Xu, torvalds, davem, wscott, netdev
[-- Attachment #1: Type: text/plain, Size: 2725 bytes --]
> Alternatively, take your favorite test programs, such as John's,
> and make a second pair that reverses the direction the data is
> sent. So one pair is server sends, the other is server receives,
> try both. That's where we started, BitKeeper, my stripped down test,
> and John's test all exhibit the same behavior. And the rsh test
> is just a really simple way to demonstrate it.
Netperf TCP_STREAM - server receives. TCP_MAERTS (STREAM backwards) - server sends:
[root@hpcpc106 ~]# netperf -H 192.168.2.107
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.107
(192.168.2.107) port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 87380 87380 10.17 941.46
[root@hpcpc106 ~]# netperf -H 192.168.2.107 -t TCP_MAERTS
TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.107
(192.168.2.107) port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 87380 87380 10.15 941.35
The above took all the defaults for socket buffers and such.
[root@hpcpc106 ~]# uname -a
Linux hpcpc106.cup.hp.com 2.6.18-8.el5 #1 SMP Fri Jan 26 14:16:09 EST 2007 ia64
ia64 ia64 GNU/Linux
[root@hpcpc106 ~]# ethtool -i eth2
driver: e1000
version: 7.2.7-k2-NAPI
firmware-version: N/A
bus-info: 0000:06:01.0
between a pair of 1.6 GHz itanium2 montecito rx2660's with a dual-port HP A9900A
(Intel 82546GB) in slot 3 of the io cage on each. Connection is actually
back-to-back rather than through a switch. I'm afraid I've nothing older installed.
sysctl settings attached
Where I do have things connected via a switch (HP ProCurve 3500 IIRC, perhaps a
2724) is through the core BCM5704:
[root@hpcpc106 netperf2_work]# netperf -H hpcpc107
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to hpcpc107.cup.hp.com
(16.89.84.107) port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 87380 87380 10.03 941.41
[root@hpcpc106 netperf2_work]# netperf -H hpcpc107 -t TCP_MAERTS
TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to hpcpc107.cup.hp.com
(16.89.84.107) port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 87380 87380 10.03 941.37
[root@hpcpc106 netperf2_work]# ethtool -i eth0
driver: tg3
version: 3.65-rh
firmware-version: 5704-v3.27
bus-info: 0000:01:02.0
rick jones
[-- Attachment #2: sysctl.net.txt --]
[-- Type: text/plain, Size: 17816 bytes --]
net.ipv6.conf.eth2.router_probe_interval = 60
net.ipv6.conf.eth2.accept_ra_rtr_pref = 1
net.ipv6.conf.eth2.accept_ra_pinfo = 1
net.ipv6.conf.eth2.accept_ra_defrtr = 1
net.ipv6.conf.eth2.max_addresses = 16
net.ipv6.conf.eth2.max_desync_factor = 600
net.ipv6.conf.eth2.regen_max_retry = 5
net.ipv6.conf.eth2.temp_prefered_lft = 86400
net.ipv6.conf.eth2.temp_valid_lft = 604800
net.ipv6.conf.eth2.use_tempaddr = 0
net.ipv6.conf.eth2.force_mld_version = 0
net.ipv6.conf.eth2.router_solicitation_delay = 1
net.ipv6.conf.eth2.router_solicitation_interval = 4
net.ipv6.conf.eth2.router_solicitations = 3
net.ipv6.conf.eth2.dad_transmits = 1
net.ipv6.conf.eth2.autoconf = 1
net.ipv6.conf.eth2.accept_redirects = 1
net.ipv6.conf.eth2.accept_ra = 1
net.ipv6.conf.eth2.mtu = 1500
net.ipv6.conf.eth2.hop_limit = 64
net.ipv6.conf.eth2.forwarding = 0
net.ipv6.conf.eth0.router_probe_interval = 60
net.ipv6.conf.eth0.accept_ra_rtr_pref = 1
net.ipv6.conf.eth0.accept_ra_pinfo = 1
net.ipv6.conf.eth0.accept_ra_defrtr = 1
net.ipv6.conf.eth0.max_addresses = 16
net.ipv6.conf.eth0.max_desync_factor = 600
net.ipv6.conf.eth0.regen_max_retry = 5
net.ipv6.conf.eth0.temp_prefered_lft = 86400
net.ipv6.conf.eth0.temp_valid_lft = 604800
net.ipv6.conf.eth0.use_tempaddr = 0
net.ipv6.conf.eth0.force_mld_version = 0
net.ipv6.conf.eth0.router_solicitation_delay = 1
net.ipv6.conf.eth0.router_solicitation_interval = 4
net.ipv6.conf.eth0.router_solicitations = 3
net.ipv6.conf.eth0.dad_transmits = 1
net.ipv6.conf.eth0.autoconf = 1
net.ipv6.conf.eth0.accept_redirects = 1
net.ipv6.conf.eth0.accept_ra = 1
net.ipv6.conf.eth0.mtu = 1500
net.ipv6.conf.eth0.hop_limit = 64
net.ipv6.conf.eth0.forwarding = 0
net.ipv6.conf.default.router_probe_interval = 60
net.ipv6.conf.default.accept_ra_rtr_pref = 1
net.ipv6.conf.default.accept_ra_pinfo = 1
net.ipv6.conf.default.accept_ra_defrtr = 1
net.ipv6.conf.default.max_addresses = 16
net.ipv6.conf.default.max_desync_factor = 600
net.ipv6.conf.default.regen_max_retry = 5
net.ipv6.conf.default.temp_prefered_lft = 86400
net.ipv6.conf.default.temp_valid_lft = 604800
net.ipv6.conf.default.use_tempaddr = 0
net.ipv6.conf.default.force_mld_version = 0
net.ipv6.conf.default.router_solicitation_delay = 1
net.ipv6.conf.default.router_solicitation_interval = 4
net.ipv6.conf.default.router_solicitations = 3
net.ipv6.conf.default.dad_transmits = 1
net.ipv6.conf.default.autoconf = 1
net.ipv6.conf.default.accept_redirects = 1
net.ipv6.conf.default.accept_ra = 1
net.ipv6.conf.default.mtu = 1280
net.ipv6.conf.default.hop_limit = 64
net.ipv6.conf.default.forwarding = 0
net.ipv6.conf.all.router_probe_interval = 60
net.ipv6.conf.all.accept_ra_rtr_pref = 1
net.ipv6.conf.all.accept_ra_pinfo = 1
net.ipv6.conf.all.accept_ra_defrtr = 1
net.ipv6.conf.all.max_addresses = 16
net.ipv6.conf.all.max_desync_factor = 600
net.ipv6.conf.all.regen_max_retry = 5
net.ipv6.conf.all.temp_prefered_lft = 86400
net.ipv6.conf.all.temp_valid_lft = 604800
net.ipv6.conf.all.use_tempaddr = 0
net.ipv6.conf.all.force_mld_version = 0
net.ipv6.conf.all.router_solicitation_delay = 1
net.ipv6.conf.all.router_solicitation_interval = 4
net.ipv6.conf.all.router_solicitations = 3
net.ipv6.conf.all.dad_transmits = 1
net.ipv6.conf.all.autoconf = 1
net.ipv6.conf.all.accept_redirects = 1
net.ipv6.conf.all.accept_ra = 1
net.ipv6.conf.all.mtu = 1280
net.ipv6.conf.all.hop_limit = 64
net.ipv6.conf.all.forwarding = 0
net.ipv6.conf.lo.router_probe_interval = 60
net.ipv6.conf.lo.accept_ra_rtr_pref = 1
net.ipv6.conf.lo.accept_ra_pinfo = 1
net.ipv6.conf.lo.accept_ra_defrtr = 1
net.ipv6.conf.lo.max_addresses = 16
net.ipv6.conf.lo.max_desync_factor = 600
net.ipv6.conf.lo.regen_max_retry = 5
net.ipv6.conf.lo.temp_prefered_lft = 86400
net.ipv6.conf.lo.temp_valid_lft = 604800
net.ipv6.conf.lo.use_tempaddr = -1
net.ipv6.conf.lo.force_mld_version = 0
net.ipv6.conf.lo.router_solicitation_delay = 1
net.ipv6.conf.lo.router_solicitation_interval = 4
net.ipv6.conf.lo.router_solicitations = 3
net.ipv6.conf.lo.dad_transmits = 1
net.ipv6.conf.lo.autoconf = 1
net.ipv6.conf.lo.accept_redirects = 1
net.ipv6.conf.lo.accept_ra = 1
net.ipv6.conf.lo.mtu = 16436
net.ipv6.conf.lo.hop_limit = 64
net.ipv6.conf.lo.forwarding = 0
net.ipv6.neigh.eth2.base_reachable_time_ms = 30000
net.ipv6.neigh.eth2.retrans_time_ms = 1000
net.ipv6.neigh.eth2.locktime = 0
net.ipv6.neigh.eth2.proxy_delay = 800
net.ipv6.neigh.eth2.anycast_delay = 1000
net.ipv6.neigh.eth2.proxy_qlen = 64
net.ipv6.neigh.eth2.unres_qlen = 3
net.ipv6.neigh.eth2.gc_stale_time = 60
net.ipv6.neigh.eth2.delay_first_probe_time = 5
net.ipv6.neigh.eth2.retrans_time = 1000
net.ipv6.neigh.eth2.app_solicit = 0
net.ipv6.neigh.eth2.ucast_solicit = 3
net.ipv6.neigh.eth2.mcast_solicit = 3
net.ipv6.neigh.eth0.base_reachable_time_ms = 30000
net.ipv6.neigh.eth0.retrans_time_ms = 1000
net.ipv6.neigh.eth0.locktime = 0
net.ipv6.neigh.eth0.proxy_delay = 800
net.ipv6.neigh.eth0.anycast_delay = 1000
net.ipv6.neigh.eth0.proxy_qlen = 64
net.ipv6.neigh.eth0.unres_qlen = 3
net.ipv6.neigh.eth0.gc_stale_time = 60
net.ipv6.neigh.eth0.delay_first_probe_time = 5
net.ipv6.neigh.eth0.retrans_time = 1000
net.ipv6.neigh.eth0.app_solicit = 0
net.ipv6.neigh.eth0.ucast_solicit = 3
net.ipv6.neigh.eth0.mcast_solicit = 3
net.ipv6.neigh.lo.base_reachable_time_ms = 30000
net.ipv6.neigh.lo.retrans_time_ms = 1000
net.ipv6.neigh.lo.locktime = 0
net.ipv6.neigh.lo.proxy_delay = 800
net.ipv6.neigh.lo.anycast_delay = 1000
net.ipv6.neigh.lo.proxy_qlen = 64
net.ipv6.neigh.lo.unres_qlen = 3
net.ipv6.neigh.lo.gc_stale_time = 60
net.ipv6.neigh.lo.delay_first_probe_time = 5
net.ipv6.neigh.lo.retrans_time = 1000
net.ipv6.neigh.lo.app_solicit = 0
net.ipv6.neigh.lo.ucast_solicit = 3
net.ipv6.neigh.lo.mcast_solicit = 3
net.ipv6.neigh.default.base_reachable_time_ms = 30000
net.ipv6.neigh.default.retrans_time_ms = 1000
net.ipv6.neigh.default.gc_thresh3 = 1024
net.ipv6.neigh.default.gc_thresh2 = 512
net.ipv6.neigh.default.gc_thresh1 = 128
net.ipv6.neigh.default.gc_interval = 30
net.ipv6.neigh.default.locktime = 0
net.ipv6.neigh.default.proxy_delay = 800
net.ipv6.neigh.default.anycast_delay = 1000
net.ipv6.neigh.default.proxy_qlen = 64
net.ipv6.neigh.default.unres_qlen = 3
net.ipv6.neigh.default.gc_stale_time = 60
net.ipv6.neigh.default.delay_first_probe_time = 5
net.ipv6.neigh.default.retrans_time = 1000
net.ipv6.neigh.default.app_solicit = 0
net.ipv6.neigh.default.ucast_solicit = 3
net.ipv6.neigh.default.mcast_solicit = 3
net.ipv6.mld_max_msf = 64
net.ipv6.ip6frag_secret_interval = 600
net.ipv6.ip6frag_time = 60
net.ipv6.ip6frag_low_thresh = 196608
net.ipv6.ip6frag_high_thresh = 262144
net.ipv6.bindv6only = 0
net.ipv6.icmp.ratelimit = 1000
net.ipv6.route.gc_min_interval_ms = 500
net.ipv6.route.min_adv_mss = 1
net.ipv6.route.mtu_expires = 600
net.ipv6.route.gc_elasticity = 0
net.ipv6.route.gc_interval = 30
net.ipv6.route.gc_timeout = 60
net.ipv6.route.gc_min_interval = 0
net.ipv6.route.max_size = 4096
net.ipv6.route.gc_thresh = 1024
net.unix.max_dgram_qlen = 10
net.token-ring.rif_timeout = 600000
net.ipv4.conf.eth3.promote_secondaries = 0
net.ipv4.conf.eth3.force_igmp_version = 0
net.ipv4.conf.eth3.disable_policy = 0
net.ipv4.conf.eth3.disable_xfrm = 0
net.ipv4.conf.eth3.arp_accept = 0
net.ipv4.conf.eth3.arp_ignore = 1
net.ipv4.conf.eth3.arp_announce = 0
net.ipv4.conf.eth3.arp_filter = 1
net.ipv4.conf.eth3.tag = 0
net.ipv4.conf.eth3.log_martians = 0
net.ipv4.conf.eth3.bootp_relay = 0
net.ipv4.conf.eth3.medium_id = 0
net.ipv4.conf.eth3.proxy_arp = 0
net.ipv4.conf.eth3.accept_source_route = 0
net.ipv4.conf.eth3.send_redirects = 1
net.ipv4.conf.eth3.rp_filter = 1
net.ipv4.conf.eth3.shared_media = 1
net.ipv4.conf.eth3.secure_redirects = 1
net.ipv4.conf.eth3.accept_redirects = 1
net.ipv4.conf.eth3.mc_forwarding = 0
net.ipv4.conf.eth3.forwarding = 0
net.ipv4.conf.eth2.promote_secondaries = 0
net.ipv4.conf.eth2.force_igmp_version = 0
net.ipv4.conf.eth2.disable_policy = 0
net.ipv4.conf.eth2.disable_xfrm = 0
net.ipv4.conf.eth2.arp_accept = 0
net.ipv4.conf.eth2.arp_ignore = 1
net.ipv4.conf.eth2.arp_announce = 0
net.ipv4.conf.eth2.arp_filter = 1
net.ipv4.conf.eth2.tag = 0
net.ipv4.conf.eth2.log_martians = 0
net.ipv4.conf.eth2.bootp_relay = 0
net.ipv4.conf.eth2.medium_id = 0
net.ipv4.conf.eth2.proxy_arp = 0
net.ipv4.conf.eth2.accept_source_route = 0
net.ipv4.conf.eth2.send_redirects = 1
net.ipv4.conf.eth2.rp_filter = 1
net.ipv4.conf.eth2.shared_media = 1
net.ipv4.conf.eth2.secure_redirects = 1
net.ipv4.conf.eth2.accept_redirects = 1
net.ipv4.conf.eth2.mc_forwarding = 0
net.ipv4.conf.eth2.forwarding = 0
net.ipv4.conf.eth0.promote_secondaries = 0
net.ipv4.conf.eth0.force_igmp_version = 0
net.ipv4.conf.eth0.disable_policy = 0
net.ipv4.conf.eth0.disable_xfrm = 0
net.ipv4.conf.eth0.arp_accept = 0
net.ipv4.conf.eth0.arp_ignore = 1
net.ipv4.conf.eth0.arp_announce = 0
net.ipv4.conf.eth0.arp_filter = 1
net.ipv4.conf.eth0.tag = 0
net.ipv4.conf.eth0.log_martians = 0
net.ipv4.conf.eth0.bootp_relay = 0
net.ipv4.conf.eth0.medium_id = 0
net.ipv4.conf.eth0.proxy_arp = 0
net.ipv4.conf.eth0.accept_source_route = 0
net.ipv4.conf.eth0.send_redirects = 1
net.ipv4.conf.eth0.rp_filter = 1
net.ipv4.conf.eth0.shared_media = 1
net.ipv4.conf.eth0.secure_redirects = 1
net.ipv4.conf.eth0.accept_redirects = 1
net.ipv4.conf.eth0.mc_forwarding = 0
net.ipv4.conf.eth0.forwarding = 0
net.ipv4.conf.lo.promote_secondaries = 0
net.ipv4.conf.lo.force_igmp_version = 0
net.ipv4.conf.lo.disable_policy = 1
net.ipv4.conf.lo.disable_xfrm = 1
net.ipv4.conf.lo.arp_accept = 0
net.ipv4.conf.lo.arp_ignore = 0
net.ipv4.conf.lo.arp_announce = 0
net.ipv4.conf.lo.arp_filter = 0
net.ipv4.conf.lo.tag = 0
net.ipv4.conf.lo.log_martians = 0
net.ipv4.conf.lo.bootp_relay = 0
net.ipv4.conf.lo.medium_id = 0
net.ipv4.conf.lo.proxy_arp = 0
net.ipv4.conf.lo.accept_source_route = 1
net.ipv4.conf.lo.send_redirects = 1
net.ipv4.conf.lo.rp_filter = 0
net.ipv4.conf.lo.shared_media = 1
net.ipv4.conf.lo.secure_redirects = 1
net.ipv4.conf.lo.accept_redirects = 1
net.ipv4.conf.lo.mc_forwarding = 0
net.ipv4.conf.lo.forwarding = 0
net.ipv4.conf.default.promote_secondaries = 0
net.ipv4.conf.default.force_igmp_version = 0
net.ipv4.conf.default.disable_policy = 0
net.ipv4.conf.default.disable_xfrm = 0
net.ipv4.conf.default.arp_accept = 0
net.ipv4.conf.default.arp_ignore = 1
net.ipv4.conf.default.arp_announce = 0
net.ipv4.conf.default.arp_filter = 1
net.ipv4.conf.default.tag = 0
net.ipv4.conf.default.log_martians = 0
net.ipv4.conf.default.bootp_relay = 0
net.ipv4.conf.default.medium_id = 0
net.ipv4.conf.default.proxy_arp = 0
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.conf.default.send_redirects = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.shared_media = 1
net.ipv4.conf.default.secure_redirects = 1
net.ipv4.conf.default.accept_redirects = 1
net.ipv4.conf.default.mc_forwarding = 0
net.ipv4.conf.default.forwarding = 0
net.ipv4.conf.all.promote_secondaries = 0
net.ipv4.conf.all.force_igmp_version = 0
net.ipv4.conf.all.disable_policy = 0
net.ipv4.conf.all.disable_xfrm = 0
net.ipv4.conf.all.arp_accept = 0
net.ipv4.conf.all.arp_ignore = 0
net.ipv4.conf.all.arp_announce = 0
net.ipv4.conf.all.arp_filter = 0
net.ipv4.conf.all.tag = 0
net.ipv4.conf.all.log_martians = 0
net.ipv4.conf.all.bootp_relay = 0
net.ipv4.conf.all.medium_id = 0
net.ipv4.conf.all.proxy_arp = 0
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.all.send_redirects = 1
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.all.shared_media = 1
net.ipv4.conf.all.secure_redirects = 1
net.ipv4.conf.all.accept_redirects = 1
net.ipv4.conf.all.mc_forwarding = 0
net.ipv4.conf.all.forwarding = 0
net.ipv4.neigh.eth3.base_reachable_time_ms = 30000
net.ipv4.neigh.eth3.retrans_time_ms = 1000
net.ipv4.neigh.eth3.locktime = 1000
net.ipv4.neigh.eth3.proxy_delay = 800
net.ipv4.neigh.eth3.anycast_delay = 1000
net.ipv4.neigh.eth3.proxy_qlen = 64
net.ipv4.neigh.eth3.unres_qlen = 3
net.ipv4.neigh.eth3.gc_stale_time = 60
net.ipv4.neigh.eth3.delay_first_probe_time = 5
net.ipv4.neigh.eth3.base_reachable_time = 30
net.ipv4.neigh.eth3.retrans_time = 1000
net.ipv4.neigh.eth3.app_solicit = 0
net.ipv4.neigh.eth3.ucast_solicit = 3
net.ipv4.neigh.eth3.mcast_solicit = 3
net.ipv4.neigh.eth2.base_reachable_time_ms = 30000
net.ipv4.neigh.eth2.retrans_time_ms = 1000
net.ipv4.neigh.eth2.locktime = 1000
net.ipv4.neigh.eth2.proxy_delay = 800
net.ipv4.neigh.eth2.anycast_delay = 1000
net.ipv4.neigh.eth2.proxy_qlen = 64
net.ipv4.neigh.eth2.unres_qlen = 3
net.ipv4.neigh.eth2.gc_stale_time = 60
net.ipv4.neigh.eth2.delay_first_probe_time = 5
net.ipv4.neigh.eth2.base_reachable_time = 30
net.ipv4.neigh.eth2.retrans_time = 1000
net.ipv4.neigh.eth2.app_solicit = 0
net.ipv4.neigh.eth2.ucast_solicit = 3
net.ipv4.neigh.eth2.mcast_solicit = 3
net.ipv4.neigh.eth0.base_reachable_time_ms = 30000
net.ipv4.neigh.eth0.retrans_time_ms = 1000
net.ipv4.neigh.eth0.locktime = 1000
net.ipv4.neigh.eth0.proxy_delay = 800
net.ipv4.neigh.eth0.anycast_delay = 1000
net.ipv4.neigh.eth0.proxy_qlen = 64
net.ipv4.neigh.eth0.unres_qlen = 3
net.ipv4.neigh.eth0.gc_stale_time = 60
net.ipv4.neigh.eth0.delay_first_probe_time = 5
net.ipv4.neigh.eth0.base_reachable_time = 30
net.ipv4.neigh.eth0.retrans_time = 1000
net.ipv4.neigh.eth0.app_solicit = 0
net.ipv4.neigh.eth0.ucast_solicit = 3
net.ipv4.neigh.eth0.mcast_solicit = 3
net.ipv4.neigh.lo.base_reachable_time_ms = 30000
net.ipv4.neigh.lo.retrans_time_ms = 1000
net.ipv4.neigh.lo.locktime = 1000
net.ipv4.neigh.lo.proxy_delay = 800
net.ipv4.neigh.lo.anycast_delay = 1000
net.ipv4.neigh.lo.proxy_qlen = 64
net.ipv4.neigh.lo.unres_qlen = 3
net.ipv4.neigh.lo.gc_stale_time = 60
net.ipv4.neigh.lo.delay_first_probe_time = 5
net.ipv4.neigh.lo.base_reachable_time = 30
net.ipv4.neigh.lo.retrans_time = 1000
net.ipv4.neigh.lo.app_solicit = 0
net.ipv4.neigh.lo.ucast_solicit = 3
net.ipv4.neigh.lo.mcast_solicit = 3
net.ipv4.neigh.default.base_reachable_time_ms = 30000
net.ipv4.neigh.default.retrans_time_ms = 1000
net.ipv4.neigh.default.gc_thresh3 = 1024
net.ipv4.neigh.default.gc_thresh2 = 512
net.ipv4.neigh.default.gc_thresh1 = 128
net.ipv4.neigh.default.gc_interval = 30
net.ipv4.neigh.default.locktime = 1000
net.ipv4.neigh.default.proxy_delay = 800
net.ipv4.neigh.default.anycast_delay = 1000
net.ipv4.neigh.default.proxy_qlen = 64
net.ipv4.neigh.default.unres_qlen = 3
net.ipv4.neigh.default.gc_stale_time = 60
net.ipv4.neigh.default.delay_first_probe_time = 5
net.ipv4.neigh.default.base_reachable_time = 30
net.ipv4.neigh.default.retrans_time = 1000
net.ipv4.neigh.default.app_solicit = 0
net.ipv4.neigh.default.ucast_solicit = 3
net.ipv4.neigh.default.mcast_solicit = 3
net.ipv4.cipso_rbm_strictvalid = 1
net.ipv4.cipso_rbm_optfmt = 0
net.ipv4.cipso_cache_bucket_size = 10
net.ipv4.cipso_cache_enable = 1
net.ipv4.tcp_slow_start_after_idle = 1
net.ipv4.tcp_dma_copybreak = 4096
net.ipv4.tcp_workaround_signed_windows = 0
net.ipv4.tcp_base_mss = 512
net.ipv4.tcp_mtu_probing = 0
net.ipv4.tcp_abc = 0
net.ipv4.tcp_congestion_control = bic
net.ipv4.tcp_tso_win_divisor = 3
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_no_metrics_save = 0
net.ipv4.ipfrag_max_dist = 64
net.ipv4.ipfrag_secret_interval = 600
net.ipv4.tcp_low_latency = 0
net.ipv4.tcp_frto = 0
net.ipv4.tcp_tw_reuse = 0
net.ipv4.icmp_ratemask = 6168
net.ipv4.icmp_ratelimit = 1000
net.ipv4.tcp_adv_win_scale = 2
net.ipv4.tcp_app_win = 31
net.ipv4.tcp_rmem = 4096 87380 20971520
net.ipv4.tcp_wmem = 4096 87380 20971520
net.ipv4.tcp_mem = 49152 65536 98304
net.ipv4.tcp_dsack = 1
net.ipv4.tcp_ecn = 0
net.ipv4.tcp_reordering = 3
net.ipv4.tcp_fack = 1
net.ipv4.tcp_orphan_retries = 0
net.ipv4.inet_peer_gc_maxtime = 120
net.ipv4.inet_peer_gc_mintime = 10
net.ipv4.inet_peer_maxttl = 600
net.ipv4.inet_peer_minttl = 120
net.ipv4.inet_peer_threshold = 65664
net.ipv4.igmp_max_msf = 10
net.ipv4.igmp_max_memberships = 20
net.ipv4.route.secret_interval = 600
net.ipv4.route.min_adv_mss = 256
net.ipv4.route.min_pmtu = 552
net.ipv4.route.mtu_expires = 600
net.ipv4.route.gc_elasticity = 8
net.ipv4.route.error_burst = 5000
net.ipv4.route.error_cost = 1000
net.ipv4.route.redirect_silence = 20480
net.ipv4.route.redirect_number = 9
net.ipv4.route.redirect_load = 20
net.ipv4.route.gc_interval = 60
net.ipv4.route.gc_timeout = 300
net.ipv4.route.gc_min_interval_ms = 500
net.ipv4.route.gc_min_interval = 0
net.ipv4.route.max_size = 8388608
net.ipv4.route.gc_thresh = 524288
net.ipv4.route.max_delay = 10
net.ipv4.route.min_delay = 2
net.ipv4.icmp_errors_use_inbound_ifaddr = 0
net.ipv4.icmp_ignore_bogus_error_responses = 1
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.icmp_echo_ignore_all = 0
net.ipv4.ip_local_port_range = 32768 61000
net.ipv4.tcp_max_syn_backlog = 1024
net.ipv4.tcp_rfc1337 = 0
net.ipv4.tcp_stdurg = 0
net.ipv4.tcp_abort_on_overflow = 0
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_fin_timeout = 60
net.ipv4.tcp_retries2 = 15
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.ipfrag_time = 30
net.ipv4.ip_dynaddr = 0
net.ipv4.ipfrag_low_thresh = 196608
net.ipv4.ipfrag_high_thresh = 262144
net.ipv4.tcp_max_tw_buckets = 180000
net.ipv4.tcp_max_orphans = 16384
net.ipv4.tcp_synack_retries = 5
net.ipv4.tcp_syn_retries = 5
net.ipv4.ip_nonlocal_bind = 0
net.ipv4.ip_no_pmtu_disc = 0
net.ipv4.ip_default_ttl = 64
net.ipv4.ip_forward = 0
net.ipv4.tcp_retrans_collapse = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
net.core.netdev_budget = 300
net.core.somaxconn = 128
net.core.xfrm_aevent_rseqth = 2
net.core.xfrm_aevent_etime = 10
net.core.optmem_max = 20480
net.core.message_burst = 10
net.core.message_cost = 5
net.core.netdev_max_backlog = 10000
net.core.dev_weight = 64
net.core.rmem_default = 126976
net.core.wmem_default = 126976
net.core.rmem_max = 20971520
net.core.wmem_max = 20971520
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 20:14 ` Larry McVoy
2007-10-02 20:40 ` Rick Jones
@ 2007-10-02 20:42 ` Wayne Scott
2007-10-02 21:56 ` Linus Torvalds
1 sibling, 1 reply; 56+ messages in thread
From: Wayne Scott @ 2007-10-02 20:42 UTC (permalink / raw)
To: lm; +Cc: jheffner, herbert, torvalds, davem, netdev
[-- Attachment #1: Type: Text/Plain, Size: 999 bytes --]
From: lm@bitmover.com (Larry McVoy)
> Wayne, Linus asked for tcp dumps from just one side, with the first 100
> packets and then wait 10 seconds or so for the window to open up, and then
> a snap shot of the another 100 packets. Do that for both directions
> and send them to the list. Can you do that? I want to get lunch, I'm
> starving.
OK attached are 4 raw tcpdumps of 1000 packets each. One from the
start and one from steady state.
The slow set was done like this:
on ia64: netcat -l -p8888 > /dev/null
on work: netcat ia64 8888 < /dev/zero
the traces were done of work with slow1 started right before the
netcat on work was exectued. And slow2 started after it acheved
steady state at 18MB/s.
The fast set was done like this:
on work: netcat -l -p8888 > /dev/null
on ia64: netcat ia64 8888 < /dev/zero
the traces were done of work with fast1 started right before the
netcat on ia64 was exectued. And fast2 started after it acheved
steady state at 42MB/s.
-Wayne
[-- Attachment #2: tcpdumps.lm.bz2 --]
[-- Type: Application/Octet-Stream, Size: 60715 bytes --]
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 20:33 ` David Miller
@ 2007-10-02 20:44 ` Roland Dreier
2007-10-02 21:21 ` Larry McVoy
1 sibling, 0 replies; 56+ messages in thread
From: Roland Dreier @ 2007-10-02 20:44 UTC (permalink / raw)
To: David Miller; +Cc: torvalds, lm, herbert, wscott, netdev
> It would be really great to see numbers with a more recent kernel
> than 2.6.18
FWIW Debian has binaries for 2.6.21 in testing and for 2.6.22 in
unstable so it should be very easy for Larry to try at least those.
- R.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 16:48 ` Larry McVoy
@ 2007-10-02 21:16 ` David Miller
2007-10-02 21:26 ` Larry McVoy
0 siblings, 1 reply; 56+ messages in thread
From: David Miller @ 2007-10-02 21:16 UTC (permalink / raw)
To: lm; +Cc: torvalds, herbert, wscott, netdev
From: lm@bitmover.com (Larry McVoy)
Date: Tue, 2 Oct 2007 09:48:58 -0700
> Isn't this something so straightforward that you would have tests for it?
> This is the basic FTP server loop, doesn't someone have a big machine with
> 10gig cards and test that sending/recving data doesn't regress?
Nobody is really doing this, or they aren't talking about it.
Sometimes the crash fixes and other work completely consumes us. Add
in travel to conferences and real life, and it's no surprise stuff
like this slips through the cracks.
We absolutely depend upon people like you to report when there are
anomalies like this. It's the only thing that scales.
FWIW I have a t1000 Niagara box and an Ultra45 going through a netgear
gigabit switch. I'm getting 85MB/sec in one direction and 10MB/sec in
the other (using bw_tcp from lmbench3). Both are using identical
broadcom tigon3 gigabit chips and identical current kernels so that is
a truly strange result.
I'll investigate, it may be the same thing you're seeing.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 20:33 ` David Miller
2007-10-02 20:44 ` Roland Dreier
@ 2007-10-02 21:21 ` Larry McVoy
2007-10-03 21:13 ` Pekka Pietikainen
1 sibling, 1 reply; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 21:21 UTC (permalink / raw)
To: David Miller; +Cc: torvalds, lm, herbert, wscott, netdev
> We fixed a lot of bugs in TSO last year.
>
> It would be really great to see numbers with a more recent kernel
> than 2.6.18
More data, sky2 works fine (really really fine, like 79MB/sec) between
Linux dylan.bitmover.com 2.6.18.1 #5 SMP Mon Oct 23 17:36:00 PDT 2006 i686
Linux steele 2.6.20-16-generic #2 SMP Sun Sep 23 18:31:23 UTC 2007 x86_64
So this is looking like a e1000 bug. I'll try to upgrade the kernel on
the ia64 box and see what happens.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 21:16 ` David Miller
@ 2007-10-02 21:26 ` Larry McVoy
2007-10-02 21:47 ` David Miller
0 siblings, 1 reply; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 21:26 UTC (permalink / raw)
To: David Miller; +Cc: lm, torvalds, herbert, wscott, netdev
On Tue, Oct 02, 2007 at 02:16:56PM -0700, David Miller wrote:
> We absolutely depend upon people like you to report when there are
> anomalies like this. It's the only thing that scales.
Well cool, finally doing something useful :)
Is this issue no test setup? Because this does seem like something we'd
want to have work well.
> FWIW I have a t1000 Niagara box and an Ultra45 going through a netgear
> gigabit switch. I'm getting 85MB/sec in one direction and 10MB/sec in
> the other (using bw_tcp from lmbench3).
Note that bw_tcp mucks with SND/RCVBUF. It probably shouldn't, it's been
12 years since that code went in there and I dunno if it is still needed.
> Both are using identical
> broadcom tigon3 gigabit chips and identical current kernels so that is
> a truly strange result.
>
> I'll investigate, it may be the same thing you're seeing.
Wow, sounds very similar. In my case I was seeing pretty close to 3x
consistently. You're more like 8x, but I was all e1000 not broadcom.
And note that sky2 doesn't have this problem. Does the broadcom do TSO?
And sky2 not? I noticed a much higher CPU load for sky2.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 18:40 ` Larry McVoy
2007-10-02 19:47 ` Rick Jones
@ 2007-10-02 21:32 ` David Miller
1 sibling, 0 replies; 56+ messages in thread
From: David Miller @ 2007-10-02 21:32 UTC (permalink / raw)
To: lm; +Cc: rick.jones2, torvalds, wscott, netdev
From: lm@bitmover.com (Larry McVoy)
Date: Tue, 2 Oct 2007 11:40:32 -0700
> I doubt it, the same test works fine in one direction and poorly in the other.
> Wouldn't the flow control squelch either way?
HW controls for these things are typically:
1) Generates flow control flames
2) Listens for them
So you can have flow control operational in one direction
and not the other.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 21:26 ` Larry McVoy
@ 2007-10-02 21:47 ` David Miller
2007-10-02 22:17 ` Rick Jones
0 siblings, 1 reply; 56+ messages in thread
From: David Miller @ 2007-10-02 21:47 UTC (permalink / raw)
To: lm; +Cc: torvalds, herbert, wscott, netdev
From: lm@bitmover.com (Larry McVoy)
Date: Tue, 2 Oct 2007 14:26:08 -0700
> And note that sky2 doesn't have this problem. Does the broadcom do TSO?
> And sky2 not? I noticed a much higher CPU load for sky2.
Yes the broadcoms (the revisions I have) do TSO and it is enabled
on both sides.
Which makes the mis-matched performance even stranger :)
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 20:42 ` Wayne Scott
@ 2007-10-02 21:56 ` Linus Torvalds
0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2007-10-02 21:56 UTC (permalink / raw)
To: Wayne Scott; +Cc: lm, jheffner, herbert, davem, netdev
On Tue, 2 Oct 2007, Wayne Scott wrote:
>
> The slow set was done like this:
>
> on ia64: netcat -l -p8888 > /dev/null
> on work: netcat ia64 8888 < /dev/zero
That sounds wrong. Larry claims the slow case is when the side that did
"accept()" does the sending, the above has the listener just reading.
> The fast set was done like this:
>
> on work: netcat -l -p8888 > /dev/null
> on ia64: netcat ia64 8888 < /dev/zero
This one is guaranteed wrong too, since you have the listener reading
(fine), but the sener now doesn't go over the network at all, but sends to
itself.
That said, let's assume that only your description was bogus, the TCP
dumps themselves are ok.
I find the window scaling differences interesting. This is the opening of
the fast sequence from the receiver:
13:35:13.929349 IP 10.3.1.1.ddi-tcp-1 > 10.3.1.10.58415: S 2592471184:2592471184(0) ack 3363219397 win 5792 <mss 1460,sackOK,timestamp 174966955 3714830794,nop,wscale 7>
13:35:13.929702 IP 10.3.1.1.ddi-tcp-1 > 10.3.1.10.58415: . ack 1449 win 68 <nop,nop,timestamp 174966955 3714830795>
13:35:13.929712 IP 10.3.1.1.ddi-tcp-1 > 10.3.1.10.58415: . ack 2897 win 91 <nop,nop,timestamp 174966955 3714830795>
13:35:13.929724 IP 10.3.1.1.ddi-tcp-1 > 10.3.1.10.58415: . ack 4345 win 114 <nop,nop,timestamp 174966955 3714830795>
13:35:13.929941 IP 10.3.1.1.ddi-tcp-1 > 10.3.1.10.58415: . ack 5793 win 136 <nop,nop,timestamp 174966955 3714830795>
13:35:13.929951 IP 10.3.1.1.ddi-tcp-1 > 10.3.1.10.58415: . ack 7241 win 159 <nop,nop,timestamp 174966955 3714830795>
13:35:13.929960 IP 10.3.1.1.ddi-tcp-1 > 10.3.1.10.58415: . ack 8689 win 181 <nop,nop,timestamp 174966955 3714830795>
13:35:13.929970 IP 10.3.1.1.ddi-tcp-1 > 10.3.1.10.58415: . ack 10137 win 204 <nop,nop,timestamp 174966955 3714830795>
13:35:13.929981 IP 10.3.1.1.ddi-tcp-1 > 10.3.1.10.58415: . ack 11585 win 227 <nop,nop,timestamp 174966955 3714830795>
13:35:13.929992 IP 10.3.1.1.ddi-tcp-1 > 10.3.1.10.58415: . ack 13033 win 249 <nop,nop,timestamp 174966955 3714830795>
13:35:13.930331 IP 10.3.1.1.ddi-tcp-1 > 10.3.1.10.58415: . ack 14481 win 272 <nop,nop,timestamp 174966955 3714830795>
...
ie we use a window scale of 7, and we started with a window of 5792 bytes,
and after ten packets it has grown to 272<<7 (34816) bytes.
The slow case is
13:34:16.761034 IP 10.3.1.10.ddi-tcp-1 > 10.3.1.1.49864: S 3299922549:3299922549(0) ack 2548837296 win 5792 <mss 1460,sackOK,timestamp 3714772254 174952667,nop,wscale 2>
13:34:16.761533 IP 10.3.1.10.ddi-tcp-1 > 10.3.1.1.49864: . ack 1449 win 2172 <nop,nop,timestamp 3714772255 174952667>
13:34:16.761553 IP 10.3.1.10.ddi-tcp-1 > 10.3.1.1.49864: . ack 2897 win 2896 <nop,nop,timestamp 3714772255 174952667>
13:34:16.761782 IP 10.3.1.10.ddi-tcp-1 > 10.3.1.1.49864: . ack 4345 win 3620 <nop,nop,timestamp 3714772255 174952667>
13:34:16.761908 IP 10.3.1.10.ddi-tcp-1 > 10.3.1.1.49864: . ack 5793 win 4344 <nop,nop,timestamp 3714772255 174952667>
13:34:16.761916 IP 10.3.1.10.ddi-tcp-1 > 10.3.1.1.49864: . ack 7241 win 5068 <nop,nop,timestamp 3714772255 174952667>
13:34:16.762157 IP 10.3.1.10.ddi-tcp-1 > 10.3.1.1.49864: . ack 8689 win 5792 <nop,nop,timestamp 3714772255 174952667>
13:34:16.762164 IP 10.3.1.10.ddi-tcp-1 > 10.3.1.1.49864: . ack 10137 win 6516 <nop,nop,timestamp 3714772255 174952667>
13:34:16.762283 IP 10.3.1.10.ddi-tcp-1 > 10.3.1.1.49864: . ack 11585 win 7240 <nop,nop,timestamp 3714772256 174952667>
13:34:16.762290 IP 10.3.1.10.ddi-tcp-1 > 10.3.1.1.49864: . ack 13033 win 7964 <nop,nop,timestamp 3714772256 174952667>
13:34:16.762303 IP 10.3.1.10.ddi-tcp-1 > 10.3.1.1.49864: . ack 14481 win 8688 <nop,nop,timestamp 3714772256 174952667>
...
so after the same ten packets, it too has grown to about the same
size (8688<<2 = 34752 bytes).
But the slow case has a smaller window scale, and it actually stops
opening the window at that point: the window stays at 8688<<2 for a long
time (and eventually grows to 9412<<2 and then 16652<<2 in the steady
case, and is basically limited at that 66kB window size).
But the fast one that had a window scale of 7 can keep growing, and will
do so quite aggressively. It grows the window to (1442<<7 = 180kB) in the
first fifty packets.
But in your dump, it doesn't seem to be about who is listening and who is
connecting. It seems to be about the fact that your machine 10.3.1.10 uses
a window scale of 2, while 10.3.1.1 uses a scale of 7.
Linus
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 21:47 ` David Miller
@ 2007-10-02 22:17 ` Rick Jones
2007-10-02 22:32 ` David Miller
0 siblings, 1 reply; 56+ messages in thread
From: Rick Jones @ 2007-10-02 22:17 UTC (permalink / raw)
To: David Miller; +Cc: lm, torvalds, herbert, wscott, netdev
David Miller wrote:
> From: lm@bitmover.com (Larry McVoy)
> Date: Tue, 2 Oct 2007 14:26:08 -0700
>
>
>>And note that sky2 doesn't have this problem. Does the broadcom do TSO?
>>And sky2 not? I noticed a much higher CPU load for sky2.
>
>
> Yes the broadcoms (the revisions I have) do TSO and it is enabled
> on both sides.
>
> Which makes the mis-matched performance even stranger :)
Stranger still, with a mix of a 2.6.23-rc5ish kernel and a net-2.6.24 one
(pulled oh middle of last week?) I get link-rate and I see no asymmetry between
TCP_STREAM and TCP_MAERTS over an "e1000" link with no switch or tg3 with a
ProCurve on my rx2660's.
I can also run bw_tcp from lmbench 3.0a8 and get 106 MB/s.
I don't have a netgear switch to try in all this...
rick jones
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 22:17 ` Rick Jones
@ 2007-10-02 22:32 ` David Miller
2007-10-02 22:36 ` Larry McVoy
0 siblings, 1 reply; 56+ messages in thread
From: David Miller @ 2007-10-02 22:32 UTC (permalink / raw)
To: rick.jones2; +Cc: lm, torvalds, herbert, wscott, netdev
From: Rick Jones <rick.jones2@hp.com>
Date: Tue, 02 Oct 2007 15:17:35 -0700
> Stranger still, with a mix of a 2.6.23-rc5ish kernel and a net-2.6.24 one
> (pulled oh middle of last week?) I get link-rate and I see no asymmetry between
> TCP_STREAM and TCP_MAERTS over an "e1000" link with no switch or tg3 with a
> ProCurve on my rx2660's.
>
> I can also run bw_tcp from lmbench 3.0a8 and get 106 MB/s.
>
> I don't have a netgear switch to try in all this...
I'm starting to have a theory about what the bad case might
be.
A strong sender going to an even stronger receiver which can
pull out packets into the process as fast as they arrive.
This might be part of what keeps the receive window from
growing.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 22:32 ` David Miller
@ 2007-10-02 22:36 ` Larry McVoy
2007-10-02 22:59 ` Rick Jones
2007-10-03 8:02 ` David Miller
0 siblings, 2 replies; 56+ messages in thread
From: Larry McVoy @ 2007-10-02 22:36 UTC (permalink / raw)
To: David Miller; +Cc: rick.jones2, lm, torvalds, herbert, wscott, netdev
On Tue, Oct 02, 2007 at 03:32:16PM -0700, David Miller wrote:
> I'm starting to have a theory about what the bad case might
> be.
>
> A strong sender going to an even stronger receiver which can
> pull out packets into the process as fast as they arrive.
> This might be part of what keeps the receive window from
> growing.
I can back you up on that. When I straced the receiving side that goes
slowly, all the reads were short, like 1-2K. The way that works the
reads were a lot larger as I recall.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 22:36 ` Larry McVoy
@ 2007-10-02 22:59 ` Rick Jones
2007-10-03 8:02 ` David Miller
1 sibling, 0 replies; 56+ messages in thread
From: Rick Jones @ 2007-10-02 22:59 UTC (permalink / raw)
To: Larry McVoy; +Cc: David Miller, torvalds, herbert, wscott, netdev
Larry McVoy wrote:
> On Tue, Oct 02, 2007 at 03:32:16PM -0700, David Miller wrote:
>
>>I'm starting to have a theory about what the bad case might
>>be.
>>
>>A strong sender going to an even stronger receiver which can
>>pull out packets into the process as fast as they arrive.
>>This might be part of what keeps the receive window from
>>growing.
>
>
> I can back you up on that. When I straced the receiving side that goes
> slowly, all the reads were short, like 1-2K. The way that works the
> reads were a lot larger as I recall.
Indeed I was getting more like 8K on each recv() call per netperf's -v 2 stats,
but the system was more than fast enough to stay ahead of the traffic. On the
hunch that it was the interrupt throttling which was keeping the recv's large
rather than the speed of the system(s) I nuked the InterruptThrottleRate to 0
and was able to get between 1900 and 2300 byte recvs on the TCP_STREAM and
TCP_MAERTS tests and still had 940 Mbit/s in each direction.
hpcpc106:~# netperf -H 192.168.7.107 -t TCP_STREAM -v 2 -c -C
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.7.107
(192.168.7.107) port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 87380 87380 10.02 940.95 10.75 21.65 3.743 7.540
Alignment Offset Bytes Bytes Sends Bytes Recvs
Local Remote Local Remote Xfered Per Per
Send Recv Send Recv Send (avg) Recv (avg)
8 8 0 0 1.179e+09 87386.29 13491 1965.77 599729
Maximum
Segment
Size (bytes)
1448
hpcpc106:~# netperf -H 192.168.7.107 -t TCP_MAERTS -v 2 -c -C
TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.7.107
(192.168.7.107) port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 87380 87380 10.02 940.82 20.44 10.61 7.117 3.696
Alignment Offset Bytes Bytes Recvs Bytes Sends
Local Remote Local Remote Xfered Per Per
Recv Send Recv Send Recv (avg) Send (avg)
8 8 0 0 1.178e+09 2352.26 500931 87380.00 13485
Maximum
Segment
Size (bytes)
1448
the systems above had four 1.6 GHz cores, netperf reports CPU as 0 to 100%
regardless of core count.
and then my systems with the 3.0 GHz cores:
[root@s9 netperf2_trunk]# netperf -H sweb20 -v 2 -t TCP_STREAM -c -C
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to sweb20.cup.hp.com
(16.89.133.20) port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 16384 10.03 941.37 6.40 13.26 2.229 4.615
Alignment Offset Bytes Bytes Sends Bytes Recvs
Local Remote Local Remote Xfered Per Per
Send Recv Send Recv Send (avg) Recv (avg)
8 8 0 0 1.18e+09 16384.06 72035 1453.85 811793
Maximum
Segment
Size (bytes)
1448
[root@s9 netperf2_trunk]# netperf -H sweb20 -v 2 -t TCP_MAERTS -c -C
TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to sweb20.cup.hp.com
(16.89.133.20) port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 16384 10.03 941.35 12.13 5.80 4.221 2.018
Alignment Offset Bytes Bytes Recvs Bytes Sends
Local Remote Local Remote Xfered Per Per
Recv Send Recv Send Recv (avg) Send (avg)
8 8 0 0 1.181e+09 1452.38 812953 16384.00 72065
Maximum
Segment
Size (bytes)
1448
rick jones
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 17:14 ` Rick Jones
2007-10-02 17:20 ` Larry McVoy
@ 2007-10-03 7:19 ` Bill Fink
1 sibling, 0 replies; 56+ messages in thread
From: Bill Fink @ 2007-10-03 7:19 UTC (permalink / raw)
To: Rick Jones; +Cc: Larry McVoy, Linus Torvalds, davem, wscott, netdev
Tangential aside:
On Tue, 02 Oct 2007, Rick Jones wrote:
> *) depending on the quantity of CPU around, and the type of test one is running,
> results can be better/worse depending on the CPU to which you bind the
> application. Latency tends to be best when running on the same core as takes
> interrupts from the NIC, bulk transfer can be better when running on a different
> core, although generally better when a different core on the same chip. These
> days the throughput stuff is more easily seen on 10G, but the netperf service
> demand changes are still visible on 1G.
Interesting. I was going to say that I've generally had the opposite
experience when it comes to bulk data transfers, which is what I would
expect due to CPU caching effects, but that perhaps it's motherboard/NIC/
driver dependent. But in testing I just did I discovered it's even
MTU dependent (most of my normal testing is always with 9000-byte
jumbo frames).
With Myricom 10-GigE NICs, NIC interrupts on CPU 0 and nuttcp app
running on CPU 1 (both transmit and receive sides), and using 9000-byte
jumbo frames:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
10078.5000 MB / 10.02 sec = 8437.5396 Mbps 100 %TX 99 %RX
With Myricom 10-GigE NICs, and both NIC interrupts and nuttcp app
on CPU 0 (both transmit and receive sides), again using 9000-byte
jumbo frames:
[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11817.8750 MB / 10.00 sec = 9909.7537 Mbps 100 %TX 74 %RX
Same tests repeated with standard 1500-byte Ethernet MTU:
With Myricom 10-GigE NICs, NIC interrupts on CPU 0 and nuttcp app
running on CPU 1 (both transmit and receive sides), and using
standard 1500-byte Ethernet MTU:
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
5685.9375 MB / 10.00 sec = 4768.0951 Mbps 99 %TX 98 %RX
With Myricom 10-GigE NICs, and both NIC interrupts and nuttcp app
on CPU 0 (both transmit and receive sides), again using standard
1500-byte Ethernet MTU:
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
4974.0625 MB / 10.03 sec = 4161.6015 Mbps 100 %TX 100 %RX
Now back to your regularly scheduled programming. :-)
-Bill
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 22:36 ` Larry McVoy
2007-10-02 22:59 ` Rick Jones
@ 2007-10-03 8:02 ` David Miller
1 sibling, 0 replies; 56+ messages in thread
From: David Miller @ 2007-10-03 8:02 UTC (permalink / raw)
To: lm; +Cc: rick.jones2, torvalds, herbert, wscott, netdev
From: lm@bitmover.com (Larry McVoy)
Date: Tue, 2 Oct 2007 15:36:44 -0700
> On Tue, Oct 02, 2007 at 03:32:16PM -0700, David Miller wrote:
> > I'm starting to have a theory about what the bad case might
> > be.
> >
> > A strong sender going to an even stronger receiver which can
> > pull out packets into the process as fast as they arrive.
> > This might be part of what keeps the receive window from
> > growing.
>
> I can back you up on that. When I straced the receiving side that goes
> slowly, all the reads were short, like 1-2K. The way that works the
> reads were a lot larger as I recall.
My issue turns out to be hardware specific too.
The two Broadcom 5714 onboard NICs on my Niagara t1000 give bad packet
receive performance for some reason, the other two which are Broadcom
5704's are perfectly fine. I'll figure out what the problem is,
probably some misprogramed register in either the chip or the bridge
it's behind.
The UDP stream test of netperf is great for isolating TCP/TSO vs.
hardware issues. If you can't saturate the pipe or the cpu with
the UDP stream test, it's likely a hardware issue.
The cpu utilization and service demand numbers provided, on both
send and receive, are really useful for diagnosing problems like
this.
Rick deserves several beers for his work on this cool toy. :)
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 21:21 ` Larry McVoy
@ 2007-10-03 21:13 ` Pekka Pietikainen
2007-10-03 21:23 ` Larry McVoy
0 siblings, 1 reply; 56+ messages in thread
From: Pekka Pietikainen @ 2007-10-03 21:13 UTC (permalink / raw)
To: lm, David Miller, torvalds, herbert, wscott, netdev
On Tue, Oct 02, 2007 at 02:21:32PM -0700, Larry McVoy wrote:
> More data, sky2 works fine (really really fine, like 79MB/sec) between
> Linux dylan.bitmover.com 2.6.18.1 #5 SMP Mon Oct 23 17:36:00 PDT 2006 i686
> Linux steele 2.6.20-16-generic #2 SMP Sun Sep 23 18:31:23 UTC 2007 x86_64
>
> So this is looking like a e1000 bug. I'll try to upgrade the kernel on
> the ia64 box and see what happens.
A few notes to the discussion. I've seen one e1000 "bug" that ended up being
a crappy AMD pre-opteron SMP chipset with a totally useless PCI bus
implementation, which limited performance quite a bit-totally depending on
what you plugged in and in which slot. 10e milk-and-bread-store
32/33 gige nics actually were better than server-class e1000's
in those, but weren't that great either.
A few things worth trying out is using recv(.., MSG_TRUNC ) on the receiver,
that tests the theoretical sender maximum performance much better (but memory
bandwidth vs. GigE is much higher these days than it was in 2001 so maybe
not that useful anymore).
Check your interrupt rates for the interface. You shouldn't be getting
anywhere near 1 interrupt/packet. If you are, something is badly wrong :).
Running getsockopt(...TCP_INFO) every few secs on the socket and printing
that out can be useful too. That gives you both sides' idea on what the
tcp windows etc. are.
My favourite tool is a home-made thing called yantt btw.
( http://www.ee.oulu.fi/~pp/yantt.tgz . Needs lots of cleanup love,
it mucks with the window sizes by default, since in the 2.4 days you really
had to do that to get any kind of performance and the help text is wrong.
But it's pretty easy to hack to try out new ideas, use
sendfile/MSG_TRUNC/TCP_INFO etc.
Netperf is the kitchen sink of network benchmark tools. But trying out a few
tiny things with it is not fun at all, I tried and quickly decided to
write my own tool for my master's thesis work ;-)
Oh. Don't measure CPU usage with top. Use a cyclesoaker (google for
cyclesoak, I included akpm's with yantt) :-)
And yes. TCP stacks do have bugs, especially when things get outside the
equipment most people have. Having a dedicated transatlantic 2.5Gbps
connection found a really fun one a long time ago ;)
--
Pekka Pietikainen
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-03 21:13 ` Pekka Pietikainen
@ 2007-10-03 21:23 ` Larry McVoy
2007-10-03 21:50 ` Pekka Pietikainen
0 siblings, 1 reply; 56+ messages in thread
From: Larry McVoy @ 2007-10-03 21:23 UTC (permalink / raw)
To: Pekka Pietikainen; +Cc: lm, David Miller, torvalds, herbert, wscott, netdev
> A few notes to the discussion. I've seen one e1000 "bug" that ended up being
> a crappy AMD pre-opteron SMP chipset with a totally useless PCI bus
> implementation, which limited performance quite a bit-totally depending on
> what you plugged in and in which slot. 10e milk-and-bread-store
> 32/33 gige nics actually were better than server-class e1000's
> in those, but weren't that great either.
That could well be my problem, this is a dual processor (not core) athlon
(not opteron) tyan motherboard if I recall correctly.
> Check your interrupt rates for the interface. You shouldn't be getting
> anywhere near 1 interrupt/packet. If you are, something is badly wrong :).
The acks (because I'm sending) are about 1.5 packets/interrupt.
When this box is receiving it's moving about 3x ass much data
and has a _lower_ (absolute, not per packet) interrupt load.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-03 21:23 ` Larry McVoy
@ 2007-10-03 21:50 ` Pekka Pietikainen
0 siblings, 0 replies; 56+ messages in thread
From: Pekka Pietikainen @ 2007-10-03 21:50 UTC (permalink / raw)
To: lm, David Miller, torvalds, herbert, wscott, netdev
On Wed, Oct 03, 2007 at 02:23:58PM -0700, Larry McVoy wrote:
> > A few notes to the discussion. I've seen one e1000 "bug" that ended up being
> > a crappy AMD pre-opteron SMP chipset with a totally useless PCI bus
> > implementation, which limited performance quite a bit-totally depending on
> > what you plugged in and in which slot. 10e milk-and-bread-store
> > 32/33 gige nics actually were better than server-class e1000's
> > in those, but weren't that great either.
>
> That could well be my problem, this is a dual processor (not core) athlon
> (not opteron) tyan motherboard if I recall correctly.
If it's AMD760/768MPX, here's some relevant discussion:
http://lkml.org/lkml/2002/7/18/292
http://www.ussg.iu.edu/hypermail/linux/kernel/0307.1/1109.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0307.1/1154.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0307.1/1212.html
http://forums.2cpu.com/showthread.php?s=&threadid=31211
>
> > Check your interrupt rates for the interface. You shouldn't be getting
> > anywhere near 1 interrupt/packet. If you are, something is badly wrong :).
>
> The acks (because I'm sending) are about 1.5 packets/interrupt.
> When this box is receiving it's moving about 3x ass much data
> and has a _lower_ (absolute, not per packet) interrupt load.
Probably not a problem then, since those acks probably cover many
sent packets. Current interrupt mitigation schemes are pretty
dynamic, balancing between latency and bulk performance so the acks
might be fine (thousands vs. tens of thousands/sec)
--
Pekka Pietikainen
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-02 16:47 ` Stephen Hemminger
2007-10-02 16:49 ` Larry McVoy
@ 2007-10-15 12:40 ` Daniel Schaffrath
2007-10-15 15:49 ` Stephen Hemminger
1 sibling, 1 reply; 56+ messages in thread
From: Daniel Schaffrath @ 2007-10-15 12:40 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Larry McVoy, Herbert Xu, torvalds, davem, wscott, Linux NetDev
On 2007/10/02 , at 18:47, Stephen Hemminger wrote:
> On Tue, 2 Oct 2007 09:25:34 -0700
> lm@bitmover.com (Larry McVoy) wrote:
>
>>> If the server side is the source of the data, i.e, it's transfer
>>> is a
>>> write loop, then I get the bad behaviour.
>>> ...
>>> So is this a bug or intentional?
>>
>> For whatever it is worth, I believed that we used to get better
>> performance
>> from the same hardware. My guess is that it changed somewhere
>> between
>> 2.6.15-1-k7 and 2.6.18-5-k7.
>
> For the period from 2.6.15 to 2.6.18, the kernel by default enabled
> TCP
> Appropriate Byte Counting. This caused bad performance on
> applications that
> did small writes.
Stephen, maybe you can provide me with some specifics here?
Thanks a lot!!
Daniel
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: tcp bw in 2.6
2007-10-15 12:40 ` Daniel Schaffrath
@ 2007-10-15 15:49 ` Stephen Hemminger
0 siblings, 0 replies; 56+ messages in thread
From: Stephen Hemminger @ 2007-10-15 15:49 UTC (permalink / raw)
To: Daniel Schaffrath
Cc: Larry McVoy, Herbert Xu, torvalds, davem, wscott, Linux NetDev
On Mon, 15 Oct 2007 14:40:25 +0200
Daniel Schaffrath <danielschaffrath@mac.com> wrote:
> On 2007/10/02 , at 18:47, Stephen Hemminger wrote:
>
> > On Tue, 2 Oct 2007 09:25:34 -0700
> > lm@bitmover.com (Larry McVoy) wrote:
> >
> >>> If the server side is the source of the data, i.e, it's transfer
> >>> is a
> >>> write loop, then I get the bad behaviour.
> >>> ...
> >>> So is this a bug or intentional?
> >>
> >> For whatever it is worth, I believed that we used to get better
> >> performance
> >> from the same hardware. My guess is that it changed somewhere
> >> between
> >> 2.6.15-1-k7 and 2.6.18-5-k7.
> >
> > For the period from 2.6.15 to 2.6.18, the kernel by default enabled
> > TCP
> > Appropriate Byte Counting. This caused bad performance on
> > applications that
> > did small writes.
> Stephen, maybe you can provide me with some specifics here?
>
> Thanks a lot!!
> Daniel
>
Read the RFC3465 for explanation of TCP ABC.
What happens is that applications that do multiple small writes
will end up using up their window. Typically these applications are not
streaming enough data to grow the congestion window so they get held
after 4 writes until an ACK comes back. The fix for the application
(which also helps on all OS's and TCP versions as well) is to use a call
like writev() or sendmsg() to aggregate the small header blocks together
into a single send.
--
Stephen Hemminger <shemminger@linux-foundation.org>
^ permalink raw reply [flat|nested] 56+ messages in thread
end of thread, other threads:[~2007-10-15 15:50 UTC | newest]
Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20070929142517.EC6AB5FB21@work.bitmover.com>
[not found] ` <alpine.LFD.0.999.0709290914410.3579@woody.linux-foundation.org>
[not found] ` <20070929172639.GB7037@bitmover.com>
[not found] ` <alpine.LFD.0.999.0709291050200.3579@woody.linux-foundation.org>
2007-10-02 0:59 ` tcp bw in 2.6 Larry McVoy
2007-10-02 2:14 ` Linus Torvalds
2007-10-02 2:20 ` Larry McVoy
2007-10-02 3:50 ` David Miller
2007-10-02 4:23 ` Larry McVoy
2007-10-02 15:06 ` John Heffner
2007-10-02 17:14 ` Rick Jones
2007-10-02 17:20 ` Larry McVoy
2007-10-02 18:01 ` Rick Jones
2007-10-02 18:40 ` Larry McVoy
2007-10-02 19:47 ` Rick Jones
2007-10-02 21:32 ` David Miller
2007-10-03 7:19 ` Bill Fink
2007-10-02 10:52 ` Herbert Xu
2007-10-02 15:09 ` Larry McVoy
2007-10-02 15:41 ` Larry McVoy
2007-10-02 16:25 ` Larry McVoy
2007-10-02 16:47 ` Stephen Hemminger
2007-10-02 16:49 ` Larry McVoy
2007-10-02 17:10 ` Stephen Hemminger
2007-10-15 12:40 ` Daniel Schaffrath
2007-10-15 15:49 ` Stephen Hemminger
2007-10-02 16:34 ` Linus Torvalds
2007-10-02 16:48 ` Larry McVoy
2007-10-02 21:16 ` David Miller
2007-10-02 21:26 ` Larry McVoy
2007-10-02 21:47 ` David Miller
2007-10-02 22:17 ` Rick Jones
2007-10-02 22:32 ` David Miller
2007-10-02 22:36 ` Larry McVoy
2007-10-02 22:59 ` Rick Jones
2007-10-03 8:02 ` David Miller
2007-10-02 16:48 ` Ben Greear
2007-10-02 17:11 ` Larry McVoy
2007-10-02 17:18 ` Ben Greear
2007-10-02 17:21 ` Larry McVoy
2007-10-02 17:54 ` Stephen Hemminger
2007-10-02 18:35 ` Larry McVoy
2007-10-02 18:29 ` John Heffner
2007-10-02 19:07 ` Larry McVoy
2007-10-02 19:29 ` Linus Torvalds
2007-10-02 20:31 ` David Miller
2007-10-02 19:33 ` Larry McVoy
2007-10-02 19:53 ` John Heffner
2007-10-02 20:14 ` Larry McVoy
2007-10-02 20:40 ` Rick Jones
2007-10-02 20:42 ` Wayne Scott
2007-10-02 21:56 ` Linus Torvalds
2007-10-02 19:27 ` Linus Torvalds
2007-10-02 19:53 ` Rick Jones
2007-10-02 20:33 ` David Miller
2007-10-02 20:44 ` Roland Dreier
2007-10-02 21:21 ` Larry McVoy
2007-10-03 21:13 ` Pekka Pietikainen
2007-10-03 21:23 ` Larry McVoy
2007-10-03 21:50 ` Pekka Pietikainen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).