tbench wrt. loopback TSO

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* tbench wrt. loopback TSO
@ 2008-10-16  0:14 David Miller
  2008-10-17  3:49 ` non-TCP tbench (was Re: tbench wrt. loopback TSO) David Miller
  2008-10-26 12:34 ` tbench wrt. loopback TSO Evgeniy Polyakov
  0 siblings, 2 replies; 26+ messages in thread
From: David Miller @ 2008-10-16  0:14 UTC (permalink / raw)
  To: netdev; +Cc: zbr, efault, mingo, a.p.zijlstra, herbert

I got curious about this aspect of the investigation so I wanted
to see it first-hand :-)

To be honest, this reported effect of disabling TSO in the loopback
driver surprised me because:

1) If the benchmark is doing small writes, TSO should have zero
   effect.  The TSO logic won't kick in.

2) If larger than MTU writes are being done, TSO should help,
   and this is supported by other benchmarks :-)

So I ran some tbench cases both with and without the NETIF_F_TSO
setting in drivers/net/loopback.c

On my 64-cpu 1.2GHz Niagara-2 box I obtained these results:

1) For a simpler 2 thread run (tbench 2 localhost) the results
   stayed the same both with and without TSO enabled on loopback.
   About 77MB/sec

2) For a large 64 thread run (tbench 64 localhost) the results
   improved with TSO enabled in the loopback driver.

   Without TSO I got 1.5 GB/sec and with TSO I got 1.75 GB/sec
   throughput.

I double checked this on a more traditional style processor, on
my workstation, with has 2 UltraSPARC-IIIi chips running at 1.5GHZ

This setup matched case #1 above, for "tbench 2 localhost" I got
the same result, 138 MB/sec, both with and without TSO enabled.

And these results all make total sense to me.  tbench does mostly
small transfers (for which TSO should make absolutely no difference),
but it does a small number of large ones as well.  And as you
up the thread count, the large transfer cases factor more and more
into the results.

If this TSO setting is causing some performance decrease on some
systems we should find out why.  I'll try some of my x86 systems here
to see if I can reproduce.  It just doesn't make any sense that TSO
can make any kind of negative difference, it can only help or have no
effect at all!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* non-TCP tbench (was Re: tbench wrt. loopback TSO)
  2008-10-16  0:14 tbench wrt. loopback TSO David Miller
@ 2008-10-17  3:49 ` David Miller
  2008-10-26 12:34 ` tbench wrt. loopback TSO Evgeniy Polyakov
  1 sibling, 0 replies; 26+ messages in thread
From: David Miller @ 2008-10-17  3:49 UTC (permalink / raw)
  To: netdev; +Cc: zbr, efault, mingo, a.p.zijlstra, herbert


Just for fun, to continue my own investigations, I wrote two
more variants of tbench to try and eliminate and compare
possible influences to the changes.

First is a "nop" variant, it just does memcpy()'s of the size the TCP
transfers would have been.

Second is a "pipe" variant, the uses pipes instead of TCP sockets.

Here is a dbench patch:

diff -u --recursive --new-file vanilla/dbench-4.0/Makefile.in dbench-4.0/Makefile.in
--- vanilla/dbench-4.0/Makefile.in	2008-02-17 16:49:25.000000000 -0800
+++ dbench-4.0/Makefile.in	2008-10-16 20:47:57.000000000 -0700
@@ -17,9 +17,11 @@
 
 DB_OBJS = fileio.o util.o dbench.o child.o system.o snprintf.o
 TB_OBJS = sockio.o util.o dbench.o child.o socklib.o snprintf.o
+NOP_OBJS = nopio.o util.o dbench.o child.o socklib.o snprintf.o
+PIPE_OBJS = pipeio.o util.o dbench.o child.o snprintf.o
 SRV_OBJS = util.o tbench_srv.o socklib.o
 
-all: dbench tbench tbench_srv
+all: dbench tbench tbench_srv nop pbench
 
 dbench: $(DB_OBJS)
 	$(CC) -o $@ $(DB_OBJS) $(LIBS)
@@ -30,6 +32,12 @@
 tbench_srv: $(SRV_OBJS)
 	$(CC) -o $@ $(SRV_OBJS) $(LIBS)
 
+nop: $(NOP_OBJS)
+	$(CC) -o $@ $(NOP_OBJS) $(LIBS)
+
+pbench: $(PIPE_OBJS)
+	$(CC) -o $@ $(PIPE_OBJS) $(LIBS)
+
 # Careful here: don't install client.txt over itself.
 install: all
 	${INSTALLCMD} -d $(bindir) $(datadir) $(mandir)
@@ -40,7 +48,7 @@
 	ln -sf dbench.1 $(mandir)/tbench_srv.1
 
 clean:
-	rm -f *.o *~ dbench tbench tbench_srv
+	rm -f *.o *~ dbench tbench tbench_srv nop pbench
 
 proto:
 	./mkproto.pl *.c > proto.h
diff -u --recursive --new-file vanilla/dbench-4.0/nopio.c dbench-4.0/nopio.c
--- vanilla/dbench-4.0/nopio.c	1969-12-31 16:00:00.000000000 -0800
+++ dbench-4.0/nopio.c	2008-10-16 20:48:18.000000000 -0700
@@ -0,0 +1,197 @@
+#include "dbench.h"
+
+struct nopio {
+	char dstbuf[70000];
+	char srcbuf[70000];
+};
+
+static void do_packets(struct child_struct *child, int send_size, int recv_size)
+{
+	struct nopio *nopio = (struct nopio *) child->private;
+
+	/* Client --> Server */
+	memcpy(nopio->dstbuf, nopio->srcbuf, send_size);
+	memcpy(nopio->srcbuf, nopio->dstbuf, send_size);
+
+	/* Serer --> Client */
+	memcpy(nopio->srcbuf, nopio->dstbuf, recv_size);
+	memcpy(nopio->dstbuf, nopio->srcbuf, recv_size);
+}
+
+void nb_setup(struct child_struct *child)
+{
+	struct nopio *nopio;
+	nopio = calloc(1, sizeof(struct nopio));
+	child->private = nopio;
+	child->rate.last_time = timeval_current();
+	child->rate.last_bytes = 0;
+
+	do_packets(child, 8, 8);
+}
+
+void nb_unlink(struct child_struct *child, const char *fname, int attr, const char *status)
+{
+	(void)child;
+	(void)attr;
+	(void)status;
+        do_packets(child, 39+2+strlen(fname)*2+2, 39);
+}
+
+void nb_mkdir(struct child_struct *child, const char *dname, const char *status)
+{
+	(void)child;
+	(void)status;
+        do_packets(child, 39+2+strlen(dname)*2+2, 39);
+}
+
+void nb_rmdir(struct child_struct *child, const char *fname, const char *status)
+{
+	(void)child;
+	(void)status;
+        do_packets(child, 39+2+strlen(fname)*2+2, 39);
+}
+
+void nb_createx(struct child_struct *child, const char *fname, 
+		uint32_t create_options, uint32_t create_disposition, int fnum,
+		const char *status)
+{
+	(void)child;
+	(void)create_options;
+	(void)create_disposition;
+	(void)fnum;
+	(void)status;
+        do_packets(child, 70+2+strlen(fname)*2+2, 39+12*4);
+}
+
+void nb_writex(struct child_struct *child, int handle, int offset, 
+	       int size, int ret_size, const char *status)
+{
+	(void)child;
+	(void)handle;
+	(void)offset;
+	(void)ret_size;
+	(void)status;
+        do_packets(child, 39+20+size, 39+16);
+	child->bytes += size;
+}
+
+void nb_readx(struct child_struct *child, int handle, int offset, 
+	      int size, int ret_size, const char *status)
+{
+	(void)child;
+	(void)handle;
+	(void)offset;
+	(void)size;
+	(void)status;
+        do_packets(child, 39+20, 39+20+ret_size);
+	child->bytes += ret_size;
+}
+
+void nb_close(struct child_struct *child, int handle, const char *status)
+{
+	(void)child;
+	(void)handle;
+	(void)status;
+        do_packets(child, 39+8, 39);
+}
+
+void nb_rename(struct child_struct *child, const char *old, const char *new, const char *status)
+{
+	(void)child;
+	(void)status;
+        do_packets(child, 39+8+2*strlen(old)+2*strlen(new), 39);
+}
+
+void nb_flush(struct child_struct *child, int handle, const char *status)
+{
+	(void)child;
+	(void)handle;
+	(void)status;
+        do_packets(child, 39+2, 39);
+}
+
+void nb_qpathinfo(struct child_struct *child, const char *fname, int level, 
+		  const char *status)
+{
+	(void)child;
+	(void)level;
+	(void)status;
+        do_packets(child, 39+16+2*strlen(fname), 39+32);
+}
+
+void nb_qfileinfo(struct child_struct *child, int handle, int level, const char *status)
+{
+	(void)child;
+	(void)level;
+	(void)handle;
+	(void)status;
+        do_packets(child, 39+20, 39+32);
+}
+
+void nb_qfsinfo(struct child_struct *child, int level, const char *status)
+{
+	(void)child;
+	(void)level;
+	(void)status;
+        do_packets(child, 39+20, 39+32);
+}
+
+void nb_findfirst(struct child_struct *child, const char *fname, int level, int maxcnt, 
+		  int count, const char *status)
+{
+	(void)child;
+	(void)level;
+	(void)maxcnt;
+	(void)status;
+        do_packets(child, 39+20+strlen(fname)*2, 39+90*count);
+}
+
+void nb_cleanup(struct child_struct *child)
+{
+	(void)child;
+}
+
+void nb_deltree(struct child_struct *child, const char *dname)
+{
+	(void)child;
+	(void)dname;
+}
+
+void nb_sfileinfo(struct child_struct *child, int handle, int level, const char *status)
+{
+	(void)child;
+	(void)handle;
+	(void)level;
+	(void)status;
+        do_packets(child, 39+32, 39+8);
+}
+
+void nb_lockx(struct child_struct *child, int handle, uint32_t offset, int size, 
+	      const char *status)
+{
+	(void)child;
+	(void)handle;
+	(void)offset;
+	(void)size;
+	(void)status;
+        do_packets(child, 39+12, 39);
+}
+
+void nb_unlockx(struct child_struct *child,
+		int handle, uint32_t offset, int size, const char *status)
+{
+	(void)child;
+	(void)handle;
+	(void)offset;
+	(void)size;
+	(void)status;
+        do_packets(child, 39+12, 39);
+}
+
+void nb_sleep(struct child_struct *child, int usec, const char *status)
+{
+	(void)child;
+	(void)usec;
+	(void)status;
+	usleep(usec);
+}
diff -u --recursive --new-file vanilla/dbench-4.0/pipeio.c dbench-4.0/pipeio.c
--- vanilla/dbench-4.0/pipeio.c	1969-12-31 16:00:00.000000000 -0800
+++ dbench-4.0/pipeio.c	2008-10-16 20:42:31.000000000 -0700
@@ -0,0 +1,310 @@
+#include "dbench.h"
+
+static int pipe_read(int s, char *buf, int size)
+{
+	int total=0;
+
+	while (size) {
+		int r = read(s, buf, size);
+		if (r <= 0) {
+			if (r == -1) perror("pipe_read");
+			break;
+		}
+		buf += r;
+		size -= r;
+		total += r;
+	}
+	return total;
+}
+
+static int pipe_write(int s, char *buf, int size)
+{
+	int total=0;
+
+	while (size) {
+		int r = write(s, buf, size);
+		if (r <= 0) {
+			if (r == -1) perror("pipe_write");
+			break;
+		}
+		buf += r;
+		size -= r;
+		total += r;
+	}
+	return total;
+}
+
+struct pipeio {
+	char buf[70000];
+	int readfd, writefd;
+};
+
+static void do_packets(struct child_struct *child, int send_size, int recv_size)
+{
+	struct pipeio *pipeio = (struct pipeio *) child->private;
+	uint32 *ubuf = (uint32 *) pipeio->buf;
+
+	ubuf[0] = htonl(send_size-4);
+	ubuf[1] = htonl(recv_size-4);
+
+	if (pipe_write(pipeio->writefd, pipeio->buf, send_size) != send_size) {
+		printf("error writing %d bytes\n", (int)send_size);
+		exit(1);
+	}
+
+	if (pipe_read(pipeio->readfd, pipeio->buf, 4) != 4) {
+		printf("error reading header\n");
+		exit(1);
+	}
+
+	if (ntohl(ubuf[0]) != (unsigned)(recv_size-4)) {
+		printf("lost sync (%d %d)\n", 
+		       (int)recv_size-4, (int)ntohl(ubuf[0]));
+		exit(1);
+	}
+
+	if (pipe_read(pipeio->readfd, pipeio->buf, recv_size-4) != 
+	    recv_size-4) {
+		printf("error reading %d bytes\n", (int)recv_size-4);
+		exit(1);
+	}
+
+	if (ntohl(ubuf[0]) != (unsigned)(recv_size-4)) {
+		printf("lost sync (%d %d)\n", 
+		       (int)recv_size-4, (int)ntohl(ubuf[0]));
+	}
+}
+
+static void server(int readfd, int writefd)
+{
+	char buf[70000];
+	unsigned *ibuf = (unsigned *)buf;
+	uint32_t n;
+
+	signal(SIGPIPE, SIG_IGN);
+
+	printf("^"); fflush(stdout);
+
+	while (1) {
+		if (pipe_read(readfd, buf, 4) != 4)
+			break;
+		n = ntohl(ibuf[0]);
+		if (n+4 >= sizeof(buf)) {
+			printf("overflow in server!\n");
+			exit(1);
+		}
+		if (pipe_read(readfd, buf+4, n) != (int)n)
+			break;
+		n = ntohl(ibuf[1]);
+		ibuf[0] = htonl(n);
+		if (pipe_write(writefd, buf, n+4) != (int)(n+4))
+			break;
+	}
+}
+
+void nb_setup(struct child_struct *child)
+{
+	struct pipeio *pipeio;
+	int fds_one[2];
+	int fds_two[2];
+
+	if (pipe(fds_one) == -1) {
+		perror("pipe1");
+		exit(1);
+	}
+
+	if (pipe(fds_two) == -1) {
+		perror("pipe2");
+		exit(1);
+	}
+
+	if (fork() == 0) {
+		setpgrp();
+		close(fds_one[1]);
+		close(fds_two[0]);
+		server(fds_one[0], fds_two[1]);
+		close(fds_one[0]);
+		close(fds_two[1]);
+		exit(0);
+	}
+
+	close(fds_one[0]);
+	close(fds_two[1]);
+
+	pipeio = calloc(1, sizeof(struct pipeio));
+	pipeio->readfd = fds_two[0];
+	pipeio->writefd = fds_one[1];
+	child->private = pipeio;
+	child->rate.last_time = timeval_current();
+	child->rate.last_bytes = 0;
+
+	do_packets(child, 8, 8);
+}
+
+void nb_unlink(struct child_struct *child, const char *fname, int attr, const char *status)
+{
+	(void)child;
+	(void)attr;
+	(void)status;
+        do_packets(child, 39+2+strlen(fname)*2+2, 39);
+}
+
+void nb_mkdir(struct child_struct *child, const char *dname, const char *status)
+{
+	(void)child;
+	(void)status;
+        do_packets(child, 39+2+strlen(dname)*2+2, 39);
+}
+
+void nb_rmdir(struct child_struct *child, const char *fname, const char *status)
+{
+	(void)child;
+	(void)status;
+        do_packets(child, 39+2+strlen(fname)*2+2, 39);
+}
+
+void nb_createx(struct child_struct *child, const char *fname, 
+		uint32_t create_options, uint32_t create_disposition, int fnum,
+		const char *status)
+{
+	(void)child;
+	(void)create_options;
+	(void)create_disposition;
+	(void)fnum;
+	(void)status;
+        do_packets(child, 70+2+strlen(fname)*2+2, 39+12*4);
+}
+
+void nb_writex(struct child_struct *child, int handle, int offset, 
+	       int size, int ret_size, const char *status)
+{
+	(void)child;
+	(void)handle;
+	(void)offset;
+	(void)ret_size;
+	(void)status;
+        do_packets(child, 39+20+size, 39+16);
+	child->bytes += size;
+}
+
+void nb_readx(struct child_struct *child, int handle, int offset, 
+	      int size, int ret_size, const char *status)
+{
+	(void)child;
+	(void)handle;
+	(void)offset;
+	(void)size;
+	(void)status;
+        do_packets(child, 39+20, 39+20+ret_size);
+	child->bytes += ret_size;
+}
+
+void nb_close(struct child_struct *child, int handle, const char *status)
+{
+	(void)child;
+	(void)handle;
+	(void)status;
+        do_packets(child, 39+8, 39);
+}
+
+void nb_rename(struct child_struct *child, const char *old, const char *new, const char *status)
+{
+	(void)child;
+	(void)status;
+        do_packets(child, 39+8+2*strlen(old)+2*strlen(new), 39);
+}
+
+void nb_flush(struct child_struct *child, int handle, const char *status)
+{
+	(void)child;
+	(void)handle;
+	(void)status;
+        do_packets(child, 39+2, 39);
+}
+
+void nb_qpathinfo(struct child_struct *child, const char *fname, int level, 
+		  const char *status)
+{
+	(void)child;
+	(void)level;
+	(void)status;
+        do_packets(child, 39+16+2*strlen(fname), 39+32);
+}
+
+void nb_qfileinfo(struct child_struct *child, int handle, int level, const char *status)
+{
+	(void)child;
+	(void)level;
+	(void)handle;
+	(void)status;
+        do_packets(child, 39+20, 39+32);
+}
+
+void nb_qfsinfo(struct child_struct *child, int level, const char *status)
+{
+	(void)child;
+	(void)level;
+	(void)status;
+        do_packets(child, 39+20, 39+32);
+}
+
+void nb_findfirst(struct child_struct *child, const char *fname, int level, int maxcnt, 
+		  int count, const char *status)
+{
+	(void)child;
+	(void)level;
+	(void)maxcnt;
+	(void)status;
+        do_packets(child, 39+20+strlen(fname)*2, 39+90*count);
+}
+
+void nb_cleanup(struct child_struct *child)
+{
+	(void)child;
+}
+
+void nb_deltree(struct child_struct *child, const char *dname)
+{
+	(void)child;
+	(void)dname;
+}
+
+void nb_sfileinfo(struct child_struct *child, int handle, int level, const char *status)
+{
+	(void)child;
+	(void)handle;
+	(void)level;
+	(void)status;
+        do_packets(child, 39+32, 39+8);
+}
+
+void nb_lockx(struct child_struct *child, int handle, uint32_t offset, int size, 
+	      const char *status)
+{
+	(void)child;
+	(void)handle;
+	(void)offset;
+	(void)size;
+	(void)status;
+        do_packets(child, 39+12, 39);
+}
+
+void nb_unlockx(struct child_struct *child,
+		int handle, uint32_t offset, int size, const char *status)
+{
+	(void)child;
+	(void)handle;
+	(void)offset;
+	(void)size;
+	(void)status;
+        do_packets(child, 39+12, 39);
+}
+
+void nb_sleep(struct child_struct *child, int usec, const char *status)
+{
+	(void)child;
+	(void)usec;
+	(void)status;
+	fprintf(stderr, "%d: usleep(%d)\n", child->id, usec);
+	usleep(usec);
+}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-10-16  0:14 tbench wrt. loopback TSO David Miller
  2008-10-17  3:49 ` non-TCP tbench (was Re: tbench wrt. loopback TSO) David Miller
@ 2008-10-26 12:34 ` Evgeniy Polyakov
  2008-10-27  1:59   ` David Miller
  1 sibling, 1 reply; 26+ messages in thread
From: Evgeniy Polyakov @ 2008-10-26 12:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, efault, mingo, a.p.zijlstra, herbert

Hi.

On Wed, Oct 15, 2008 at 05:14:08PM -0700, David Miller (davem@davemloft.net) wrote:
> 
> I got curious about this aspect of the investigation so I wanted
> to see it first-hand :-)
> 
> To be honest, this reported effect of disabling TSO in the loopback
> driver surprised me because:
> 
> 1) If the benchmark is doing small writes, TSO should have zero
>    effect.  The TSO logic won't kick in.

But GSO will try to create a huge packet and that overhead will not be
overweighted?

That's what I got with the current tree for 8 threads on a 4-way 32-bit
Xeons (2 physical CPUs) and 8gb of ram:
gso/tso off: 361.367
tso/gso on:  354.635

Disabled/enabled via ethtools: -k tso off/on gso off/on

> 2) If larger than MTU writes are being done, TSO should help,
>    and this is supported by other benchmarks :-)

Yes, that's where it is useful.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-10-26 12:34 ` tbench wrt. loopback TSO Evgeniy Polyakov
@ 2008-10-27  1:59   ` David Miller
  2008-10-27  7:49     ` Ilpo Järvinen
  0 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2008-10-27  1:59 UTC (permalink / raw)
  To: zbr; +Cc: netdev, efault, mingo, a.p.zijlstra, herbert

From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Sun, 26 Oct 2008 15:34:42 +0300

> On Wed, Oct 15, 2008 at 05:14:08PM -0700, David Miller (davem@davemloft.net) wrote:
> > 
> > I got curious about this aspect of the investigation so I wanted
> > to see it first-hand :-)
> > 
> > To be honest, this reported effect of disabling TSO in the loopback
> > driver surprised me because:
> > 
> > 1) If the benchmark is doing small writes, TSO should have zero
> >    effect.  The TSO logic won't kick in.
> 
> But GSO will try to create a huge packet and that overhead will not be
> overweighted?
> 
> That's what I got with the current tree for 8 threads on a 4-way 32-bit
> Xeons (2 physical CPUs) and 8gb of ram:
> gso/tso off: 361.367
> tso/gso on:  354.635
> 
> Disabled/enabled via ethtools: -k tso off/on gso off/on

Yes, it might do this, in which case tcp_tso_should_defer() simply needs
some tweaking.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-10-27  1:59   ` David Miller
@ 2008-10-27  7:49     ` Ilpo Järvinen
  2008-10-27 14:13       ` Evgeniy Polyakov
  0 siblings, 1 reply; 26+ messages in thread
From: Ilpo Järvinen @ 2008-10-27  7:49 UTC (permalink / raw)
  To: zbr, David Miller; +Cc: Netdev, efault, mingo, a.p.zijlstra, Herbert Xu

On Sun, 26 Oct 2008, David Miller wrote:

> From: Evgeniy Polyakov <zbr@ioremap.net>
> Date: Sun, 26 Oct 2008 15:34:42 +0300
> 
> > On Wed, Oct 15, 2008 at 05:14:08PM -0700, David Miller (davem@davemloft.net) wrote:
> > > 
> > > I got curious about this aspect of the investigation so I wanted
> > > to see it first-hand :-)
> > > 
> > > To be honest, this reported effect of disabling TSO in the loopback
> > > driver surprised me because:
> > > 
> > > 1) If the benchmark is doing small writes, TSO should have zero
> > >    effect.  The TSO logic won't kick in.
> > 
> > But GSO will try to create a huge packet and that overhead will not be
> > overweighted?
> > 
> > That's what I got with the current tree for 8 threads on a 4-way 32-bit
> > Xeons (2 physical CPUs) and 8gb of ram:
> > gso/tso off: 361.367
> > tso/gso on:  354.635
> > 
> > Disabled/enabled via ethtools: -k tso off/on gso off/on
> 
> Yes, it might do this, in which case tcp_tso_should_defer() simply needs
> some tweaking.

Good point, could you Evgeniy verify first if simple goto send_now; in 
beginning there recovers it all...

-- 
 i.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-10-27  7:49     ` Ilpo Järvinen
@ 2008-10-27 14:13       ` Evgeniy Polyakov
  2008-10-27 15:19         ` Ilpo Järvinen
  0 siblings, 1 reply; 26+ messages in thread
From: Evgeniy Polyakov @ 2008-10-27 14:13 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: David Miller, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu

On Mon, Oct 27, 2008 at 09:49:21AM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
> > > That's what I got with the current tree for 8 threads on a 4-way 32-bit
> > > Xeons (2 physical CPUs) and 8gb of ram:
> > > gso/tso off: 361.367
> > > tso/gso on:  354.635
> > 
> > Yes, it might do this, in which case tcp_tso_should_defer() simply needs
> > some tweaking.
> 
> Good point, could you Evgeniy verify first if simple goto send_now; in 
> beginning there recovers it all...

With direct goto at the begining of the tcp_tso_should_defer()
and 4403b406 commit git tree (was current yesterday night)
I got following numbers: 

with goto:
tso/gso  on: 358.976, 357.699
tso/gso off: 368.016, 367.079

no goto
tso/gso  on: 353.656, 354.791
tso/gso off: 364.8, 365.85

So tcp_tso_should_defer() adds additional problems not counted in
tso/gso changes.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-10-27 14:13       ` Evgeniy Polyakov
@ 2008-10-27 15:19         ` Ilpo Järvinen
  2008-10-27 17:03           ` Evgeniy Polyakov
  0 siblings, 1 reply; 26+ messages in thread
From: Ilpo Järvinen @ 2008-10-27 15:19 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1184 bytes --]

On Mon, 27 Oct 2008, Evgeniy Polyakov wrote:

> On Mon, Oct 27, 2008 at 09:49:21AM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
> > > > That's what I got with the current tree for 8 threads on a 4-way 32-bit
> > > > Xeons (2 physical CPUs) and 8gb of ram:
> > > > gso/tso off: 361.367
> > > > tso/gso on:  354.635
> > > 
> > > Yes, it might do this, in which case tcp_tso_should_defer() simply needs
> > > some tweaking.
> > 
> > Good point, could you Evgeniy verify first if simple goto send_now; in 
> > beginning there recovers it all...
> 
> With direct goto at the begining of the tcp_tso_should_defer()
> and 4403b406 commit git tree (was current yesterday night)
> I got following numbers: 
> 
> with goto:
> tso/gso  on: 358.976, 357.699
> tso/gso off: 368.016, 367.079
> 
> no goto
> tso/gso  on: 353.656, 354.791
> tso/gso off: 364.8, 365.85
> 
> So tcp_tso_should_defer() adds additional problems not counted in
> tso/gso changes.

Noticed that tcp_current_mss does modulo with tso and that it is 
being called from a non-trivial number of places, though ovbiously only 
part of them should be relevant here. ...I'm not sure if that can show 
up though.

-- 
 i.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-10-27 15:19         ` Ilpo Järvinen
@ 2008-10-27 17:03           ` Evgeniy Polyakov
  2008-10-27 18:39             ` David Miller
  2008-10-27 22:17             ` Ilpo Järvinen
  0 siblings, 2 replies; 26+ messages in thread
From: Evgeniy Polyakov @ 2008-10-27 17:03 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: David Miller, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu

On Mon, Oct 27, 2008 at 05:19:17PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
> On Mon, 27 Oct 2008, Evgeniy Polyakov wrote:
> 
> > On Mon, Oct 27, 2008 at 09:49:21AM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
> > > > > That's what I got with the current tree for 8 threads on a 4-way 32-bit
> > > > > Xeons (2 physical CPUs) and 8gb of ram:
> > > > > gso/tso off: 361.367
> > > > > tso/gso on:  354.635
> > > > 
> > > > Yes, it might do this, in which case tcp_tso_should_defer() simply needs
> > > > some tweaking.
> > > 
> > > Good point, could you Evgeniy verify first if simple goto send_now; in 
> > > beginning there recovers it all...
> > 
> > With direct goto at the begining of the tcp_tso_should_defer()
> > and 4403b406 commit git tree (was current yesterday night)
> > I got following numbers: 
> > 
> > with goto:
> > tso/gso  on: 358.976, 357.699
> > tso/gso off: 368.016, 367.079
> > 
> > no goto
> > tso/gso  on: 353.656, 354.791
> > tso/gso off: 364.8, 365.85
> > 
> > So tcp_tso_should_defer() adds additional problems not counted in
> > tso/gso changes.
> 
> Noticed that tcp_current_mss does modulo with tso and that it is 
> being called from a non-trivial number of places, though ovbiously only 
> part of them should be relevant here. ...I'm not sure if that can show 
> up though.

That's why when I see modulo something screams inside me...
I used following patch (without goto in the tcp_tso_should_defer():

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e4c5ac9..8ee7597 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1075,7 +1075,6 @@ unsigned int tcp_current_mss(struct sock *sk, int large_allowed)
 				  tp->tcp_header_len);
 
 		xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
-		xmit_size_goal -= (xmit_size_goal % mss_now);
 	}
 	tp->xmit_size_goal = xmit_size_goal;
 

tso/gso  on: 361.866 362.662
tso/gso off: 370.038 368.768

So this is another improvement hidden and not accounted in the tso/gso stuff.
Another modulo sits in tcp_mss_split_point().
Also found one in the scheduler's find_next_best_node(), which is under
CONFIG_NUMA though.

-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-10-27 17:03           ` Evgeniy Polyakov
@ 2008-10-27 18:39             ` David Miller
  2008-10-27 19:35               ` Evgeniy Polyakov
  2008-11-05 11:42               ` David Miller
  2008-10-27 22:17             ` Ilpo Järvinen
  1 sibling, 2 replies; 26+ messages in thread
From: David Miller @ 2008-10-27 18:39 UTC (permalink / raw)
  To: zbr; +Cc: ilpo.jarvinen, netdev, efault, mingo, a.p.zijlstra, herbert

From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Mon, 27 Oct 2008 20:03:14 +0300

> On Mon, Oct 27, 2008 at 05:19:17PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
> > Noticed that tcp_current_mss does modulo with tso and that it is 
> > being called from a non-trivial number of places, though ovbiously only 
> > part of them should be relevant here. ...I'm not sure if that can show 
> > up though.
> 
> That's why when I see modulo something screams inside me...
> I used following patch (without goto in the tcp_tso_should_defer():

Indeed, those modulos are the worst part about the TSO code.

I'll see what I can do to get rid of them, I hate them too. :-)

One idea immediately occurs to me.  Since we're effectively limited
to a 64K TSO frame, and the MSS is some value smaller than that, we
can probably get away with a reciprocol divide.  Even using a 16-bit
inverse value would suffice, so we wouldn't need to use u64's like
some other pieces of code do.  A u32 would be enough.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-10-27 18:39             ` David Miller
@ 2008-10-27 19:35               ` Evgeniy Polyakov
  2008-10-27 19:37                 ` David Miller
  2008-11-05 11:42               ` David Miller
  1 sibling, 1 reply; 26+ messages in thread
From: Evgeniy Polyakov @ 2008-10-27 19:35 UTC (permalink / raw)
  To: David Miller; +Cc: ilpo.jarvinen, netdev, efault, mingo, a.p.zijlstra, herbert

On Mon, Oct 27, 2008 at 11:39:04AM -0700, David Miller (davem@davemloft.net) wrote:
> One idea immediately occurs to me.  Since we're effectively limited
> to a 64K TSO frame, and the MSS is some value smaller than that, we
> can probably get away with a reciprocol divide.  Even using a 16-bit
> inverse value would suffice, so we wouldn't need to use u64's like
> some other pieces of code do.  A u32 would be enough.

But why do we need to trim that last bytes at the first place at all?
Is it just enough to 'binary and' with 0xffff? Or is it what you mean? :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-10-27 19:35               ` Evgeniy Polyakov
@ 2008-10-27 19:37                 ` David Miller
  0 siblings, 0 replies; 26+ messages in thread
From: David Miller @ 2008-10-27 19:37 UTC (permalink / raw)
  To: zbr; +Cc: ilpo.jarvinen, netdev, efault, mingo, a.p.zijlstra, herbert

From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Mon, 27 Oct 2008 22:35:02 +0300

> On Mon, Oct 27, 2008 at 11:39:04AM -0700, David Miller (davem@davemloft.net) wrote:
> > One idea immediately occurs to me.  Since we're effectively limited
> > to a 64K TSO frame, and the MSS is some value smaller than that, we
> > can probably get away with a reciprocol divide.  Even using a 16-bit
> > inverse value would suffice, so we wouldn't need to use u64's like
> > some other pieces of code do.  A u32 would be enough.
> 
> But why do we need to trim that last bytes at the first place at all?
> Is it just enough to 'binary and' with 0xffff? Or is it what you mean? :)

We need to trim because we want to send full sized frames onto the
network, and those trailing bytes will make sub-MSS sized frame
instead of coalescing with the next round of user sendmsg() data.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-10-27 18:39             ` David Miller
  2008-10-27 19:35               ` Evgeniy Polyakov
@ 2008-11-05 11:42               ` David Miller
  2008-11-05 11:49                 ` Evgeniy Polyakov
  1 sibling, 1 reply; 26+ messages in thread
From: David Miller @ 2008-11-05 11:42 UTC (permalink / raw)
  To: zbr; +Cc: ilpo.jarvinen, netdev, efault, mingo, a.p.zijlstra, herbert

From: David Miller <davem@davemloft.net>
Date: Mon, 27 Oct 2008 11:39:04 -0700 (PDT)

> One idea immediately occurs to me.  Since we're effectively limited
> to a 64K TSO frame, and the MSS is some value smaller than that, we
> can probably get away with a reciprocol divide.  Even using a 16-bit
> inverse value would suffice, so we wouldn't need to use u64's like
> some other pieces of code do.  A u32 would be enough.

I couldn't get anywhere with this idea.

The problem is that 16-bits provides not enough precision for accurate
divisions by multiplication.

For example, for a divisor of 1500 and a shift of 16 neither 43 nor 44
provides an accurate calculation.

So we'd need to use a u64 and a shift of 32, and on 32-bit cpus that
could be expensive, but perhaps not as expensive as the divide.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-11-05 11:42               ` David Miller
@ 2008-11-05 11:49                 ` Evgeniy Polyakov
  2008-11-05 11:54                   ` David Miller
  0 siblings, 1 reply; 26+ messages in thread
From: Evgeniy Polyakov @ 2008-11-05 11:49 UTC (permalink / raw)
  To: David Miller; +Cc: ilpo.jarvinen, netdev, efault, mingo, a.p.zijlstra, herbert

On Wed, Nov 05, 2008 at 03:42:24AM -0800, David Miller (davem@davemloft.net) wrote:
> > One idea immediately occurs to me.  Since we're effectively limited
> > to a 64K TSO frame, and the MSS is some value smaller than that, we
> > can probably get away with a reciprocol divide.  Even using a 16-bit
> > inverse value would suffice, so we wouldn't need to use u64's like
> > some other pieces of code do.  A u32 would be enough.
> 
> I couldn't get anywhere with this idea.
> 
> The problem is that 16-bits provides not enough precision for accurate
> divisions by multiplication.
> 
> For example, for a divisor of 1500 and a shift of 16 neither 43 nor 44
> provides an accurate calculation.
> 
> So we'd need to use a u64 and a shift of 32, and on 32-bit cpus that
> could be expensive, but perhaps not as expensive as the divide.

But what if we just remove that trimming at all? This may result in a
not fully filled tso frame, but who cares if we copy data from userspace
anyway, will add another potentially small copy at sending time?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-11-05 11:49                 ` Evgeniy Polyakov
@ 2008-11-05 11:54                   ` David Miller
  2008-11-05 12:04                     ` Ilpo Järvinen
  0 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2008-11-05 11:54 UTC (permalink / raw)
  To: zbr; +Cc: ilpo.jarvinen, netdev, efault, mingo, a.p.zijlstra, herbert

From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Wed, 5 Nov 2008 14:49:00 +0300

> But what if we just remove that trimming at all? This may result in
> a not fully filled tso frame, but who cares if we copy data from
> userspace anyway, will add another potentially small copy at sending
> time?

Ok, the only part I care about is whether taking away this modulus
will result in a sub-MSS sized packet on the wire, which is
undesirable.

The send logic will prevent this, right?

If so, yes I agree, we can delete this.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-11-05 11:54                   ` David Miller
@ 2008-11-05 12:04                     ` Ilpo Järvinen
  2008-11-05 12:09                       ` David Miller
  0 siblings, 1 reply; 26+ messages in thread
From: Ilpo Järvinen @ 2008-11-05 12:04 UTC (permalink / raw)
  To: David Miller; +Cc: zbr, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu

On Wed, 5 Nov 2008, David Miller wrote:

> From: Evgeniy Polyakov <zbr@ioremap.net>
> Date: Wed, 5 Nov 2008 14:49:00 +0300
> 
> > But what if we just remove that trimming at all? This may result in
> > a not fully filled tso frame, but who cares if we copy data from
> > userspace anyway, will add another potentially small copy at sending
> > time?
> 
> Ok, the only part I care about is whether taking away this modulus
> will result in a sub-MSS sized packet on the wire, which is
> undesirable.
>
> The send logic will prevent this, right?

I don't see what currently would do that. We'd need to resegment skbs 
there then...

> If so, yes I agree, we can delete this.

Would we have something similar to those blobs we discussed when talking 
about sack data structures it might be provided by machinery we need 
anyway but currently it gets quite non-trivial since in order to fill a 
segment we need to input from two skb's which probably requires making 
another skb or the stack downstream cannot cope afaik.

-- 
 i.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-11-05 12:04                     ` Ilpo Järvinen
@ 2008-11-05 12:09                       ` David Miller
  2008-11-05 12:25                         ` Ilpo Järvinen
  0 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2008-11-05 12:09 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: zbr, netdev, efault, mingo, a.p.zijlstra, herbert

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Wed, 5 Nov 2008 14:04:14 +0200 (EET)

> On Wed, 5 Nov 2008, David Miller wrote:
> 
> > From: Evgeniy Polyakov <zbr@ioremap.net>
> > Date: Wed, 5 Nov 2008 14:49:00 +0300
> > 
> > > But what if we just remove that trimming at all? This may result in
> > > a not fully filled tso frame, but who cares if we copy data from
> > > userspace anyway, will add another potentially small copy at sending
> > > time?
> > 
> > Ok, the only part I care about is whether taking away this modulus
> > will result in a sub-MSS sized packet on the wire, which is
> > undesirable.
> >
> > The send logic will prevent this, right?
> 
> I don't see what currently would do that. We'd need to resegment skbs 
> there then...

Look at tcp_mss_split_point() in tcp_output.c

I think it ends up doing exactly this.

'needed' is set to min(skb->len, window)

and the return value is "needed - (needed % mss_now)"

The caller splits up the TSO segment as needed to satisfy this
returned limit.

That should effectively chop off the sub-mss part.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-11-05 12:09                       ` David Miller
@ 2008-11-05 12:25                         ` Ilpo Järvinen
  2008-11-05 13:04                           ` Evgeniy Polyakov
  0 siblings, 1 reply; 26+ messages in thread
From: Ilpo Järvinen @ 2008-11-05 12:25 UTC (permalink / raw)
  To: David Miller; +Cc: zbr, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2049 bytes --]

On Wed, 5 Nov 2008, David Miller wrote:

> From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Wed, 5 Nov 2008 14:04:14 +0200 (EET)
> 
> > On Wed, 5 Nov 2008, David Miller wrote:
> > 
> > > From: Evgeniy Polyakov <zbr@ioremap.net>
> > > Date: Wed, 5 Nov 2008 14:49:00 +0300
> > > 
> > > > But what if we just remove that trimming at all? This may result in
> > > > a not fully filled tso frame, but who cares if we copy data from
> > > > userspace anyway, will add another potentially small copy at sending
> > > > time?
> > > 
> > > Ok, the only part I care about is whether taking away this modulus
> > > will result in a sub-MSS sized packet on the wire, which is
> > > undesirable.
> > >
> > > The send logic will prevent this, right?
> > 
> > I don't see what currently would do that. We'd need to resegment skbs 
> > there then...
> 
> Look at tcp_mss_split_point() in tcp_output.c
> 
> I think it ends up doing exactly this.
> 
> 'needed' is set to min(skb->len, window)
> 
> and the return value is "needed - (needed % mss_now)"
> 
> The caller splits up the TSO segment as needed to satisfy this
> returned limit.
> 
> That should effectively chop off the sub-mss part.

Yes yes, I know as I wrote that part :-). But how does it prevent sub-MSS 
on wire?

Answer: It is intented for different case and doesn't prevent sub-MSS 
segments but solves some window limitation issues! It just makes sure
we won't add yet another sub-mss segment because of splitting we anyway 
are forced to.

In case of curr_mss != write_time_mss, we'll be sending those < mss 
segments per each skb, normally this happens over a single rtt so it's not 
that bad problem (and btw, that modulo won't even be executed if you have 
enough window but the splitting happens on a layer below, except for the 
last segment of course where we'd do nagle).

The problem is that we'd need to _resegment with the next skb_ since the 
mss boundary and skb boundary would basically constantly be running 
out-of-sync. That won't get done currently by anything.

-- 
 i.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-11-05 12:25                         ` Ilpo Järvinen
@ 2008-11-05 13:04                           ` Evgeniy Polyakov
  2008-11-05 13:33                             ` Ilpo Järvinen
  0 siblings, 1 reply; 26+ messages in thread
From: Evgeniy Polyakov @ 2008-11-05 13:04 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: David Miller, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu

On Wed, Nov 05, 2008 at 02:25:57PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
> The problem is that we'd need to _resegment with the next skb_ since the 
> mss boundary and skb boundary would basically constantly be running 
> out-of-sync. That won't get done currently by anything.

Btw, what's that wrong if there will be sub-mss frame per tso frame?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-11-05 13:04                           ` Evgeniy Polyakov
@ 2008-11-05 13:33                             ` Ilpo Järvinen
  2008-11-05 18:48                               ` Rick Jones
  0 siblings, 1 reply; 26+ messages in thread
From: Ilpo Järvinen @ 2008-11-05 13:33 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu

[-- Attachment #1: Type: TEXT/PLAIN, Size: 737 bytes --]

On Wed, 5 Nov 2008, Evgeniy Polyakov wrote:

> On Wed, Nov 05, 2008 at 02:25:57PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
> > The problem is that we'd need to _resegment with the next skb_ since the 
> > mss boundary and skb boundary would basically constantly be running 
> > out-of-sync. That won't get done currently by anything.
> 
> Btw, what's that wrong if there will be sub-mss frame per tso frame?

I personally don't consider that to be a big deal... I suppose some see
it as bad thing because of the slightly larger header vs data ratio...
Which is significant only if you can saturate the link (or have unbounded 
bandwidth such as with lo), so slower links are more affected than high 
speed ones...

-- 
 i.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-11-05 13:33                             ` Ilpo Järvinen
@ 2008-11-05 18:48                               ` Rick Jones
  2008-11-05 19:46                                 ` Ilpo Järvinen
  0 siblings, 1 reply; 26+ messages in thread
From: Rick Jones @ 2008-11-05 18:48 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Evgeniy Polyakov, David Miller, Netdev, efault, mingo,
	a.p.zijlstra, Herbert Xu

Ilpo Järvinen wrote:
> On Wed, 5 Nov 2008, Evgeniy Polyakov wrote:
> 
> 
>>On Wed, Nov 05, 2008 at 02:25:57PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
>>
>>>The problem is that we'd need to _resegment with the next skb_ since the 
>>>mss boundary and skb boundary would basically constantly be running 
>>>out-of-sync. That won't get done currently by anything.
>>
>>Btw, what's that wrong if there will be sub-mss frame per tso frame?
> 
> 
> I personally don't consider that to be a big deal... I suppose some see
> it as bad thing because of the slightly larger header vs data ratio...
> Which is significant only if you can saturate the link (or have unbounded 
> bandwidth such as with lo), so slower links are more affected than high 
> speed ones...

Can't say that I tend to "like" subMSS segments out there in a bulk 
transfer but some pseudorandom thoughts:

And the worst that would be would be one full MSS and a single byte, 
getting us an average of (MSS+1)/2 (roughly).  It only gets better from 
there (2MSS+1)/3, (3MSS+1)/4 etc etc.

Ignoring the TSO case for a moment, if there is congestion and receiver 
window available and a user makes a > MSS send that isn't an integral 
multiple of the MSS, we don't delay the last subMSS segment do we?

rick jones

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-11-05 18:48                               ` Rick Jones
@ 2008-11-05 19:46                                 ` Ilpo Järvinen
  2008-11-05 21:06                                   ` Rick Jones
  0 siblings, 1 reply; 26+ messages in thread
From: Ilpo Järvinen @ 2008-11-05 19:46 UTC (permalink / raw)
  To: Rick Jones
  Cc: Evgeniy Polyakov, David Miller, Netdev, efault, mingo,
	a.p.zijlstra, Herbert Xu

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3328 bytes --]

On Wed, 5 Nov 2008, Rick Jones wrote:

> Ilpo Järvinen wrote:
> > On Wed, 5 Nov 2008, Evgeniy Polyakov wrote:
> > 
> > 
> > >On Wed, Nov 05, 2008 at 02:25:57PM +0200, Ilpo Järvinen
> > >(ilpo.jarvinen@helsinki.fi) wrote:
> > >
> > > >The problem is that we'd need to _resegment with the next skb_ since the
> > > >mss boundary and skb boundary would basically constantly be running
> > > >out-of-sync. That won't get done currently by anything.
> > >
> > >Btw, what's that wrong if there will be sub-mss frame per tso frame?
> > 
> > 
> > I personally don't consider that to be a big deal... I suppose some see
> > it as bad thing because of the slightly larger header vs data ratio...
> > Which is significant only if you can saturate the link (or have unbounded
> > bandwidth such as with lo), so slower links are more affected than high
> > speed ones...
> 
> Can't say that I tend to "like" subMSS segments out there in a bulk 
> transfer but some pseudorandom thoughts:
> 
> And the worst that would be would be one full MSS and a single byte, getting
> us an average of (MSS+1)/2 (roughly).  It only gets better from there
> (2MSS+1)/3, (3MSS+1)/4 etc etc.

...Note that likelyhood of such 1 byte pathological case are not that high 
if one is sending pages... For malicious purposes one could always use 
TCP_NODELAY anyway to force similar small segments so it's hardly worth 
considering here.

For the most sensible cases with full pages, resulting segs are (1460,1448 
mss):

1 2.80548 2.82873
2 5.61096 5.65746
3 8.41644 8.48619
4 11.2219 11.3149
5 14.0274 14.1436
6 16.8329 16.9724
7 19.6384 19.8011
8 22.4438 22.6298
9 25.2493 25.4586
10 28.0548 28.2873
11 30.8603 31.116
12 33.6658 33.9448
13 36.4712 36.7735
14 39.2767 39.6022
15 42.0822 42.4309
16 44.8877 45.2597

The worst case seems to be 5 pages with 1460 which yields to 40 bytes 
payload.

> Ignoring the TSO case for a moment, if there is congestion and receiver 
> window available and a user makes a > MSS send that isn't an integral 
> multiple of the MSS, we don't delay the last subMSS segment do we?

Without TSO, only Nagle could prevent sending that submss portion, so the 
answer depends on what the window in-flight consists of.

With TSO, I guess this falls under tcp_tso_should_defer first...

And then, as far as the mss-splitter (that was quoted in this thread by 
DaveM) we send just the full segment if there's enough room in the 
receiver window and let the tso/gso handle splitting of that submss 
segment if necessary... ...Except for that last segment which is important 
for keeping track of nagle and we currently don't handle nagle correctly 
if the segment in question would have len > mss (as noted earlier this 
could be changed though after auditting those len < mss checks). So it 
will basically be the same as without tso (depends on in-flight win).

We _won't do_ nagle for the intermediate submss segments (it is a design 
decision which predates times I've been tracking kernel development), for 
obvious reasons it could just be enabled for them currently because then 
nagle would basically stall the transfer eg. when opposite dir needs to 
change mss due to reporting sack blocks. In general those middle submss 
skbs only occur if mss changes as they're split currently based on mss 
already at write time.

-- 
 i.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-11-05 19:46                                 ` Ilpo Järvinen
@ 2008-11-05 21:06                                   ` Rick Jones
  0 siblings, 0 replies; 26+ messages in thread
From: Rick Jones @ 2008-11-05 21:06 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Evgeniy Polyakov, David Miller, Netdev, efault, mingo,
	a.p.zijlstra, Herbert Xu

>>Ignoring the TSO case for a moment, if there is congestion and receiver 
>>window available and a user makes a > MSS send that isn't an integral 
>>multiple of the MSS, we don't delay the last subMSS segment do we?
> 
> 
> Without TSO, only Nagle could prevent sending that submss portion, so the 
> answer depends on what the window in-flight consists of.

I always thought that Nagle was supposed to be implemented as if it was 
on a user send() by user send() basis and not segment by segment, in 
which case no send() > MSS would be delayed by Nagle.

rick jones

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-10-27 17:03           ` Evgeniy Polyakov
  2008-10-27 18:39             ` David Miller
@ 2008-10-27 22:17             ` Ilpo Järvinen
  2008-10-31  8:14               ` David Miller
  1 sibling, 1 reply; 26+ messages in thread
From: Ilpo Järvinen @ 2008-10-27 22:17 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2664 bytes --]

On Mon, 27 Oct 2008, Evgeniy Polyakov wrote:

> On Mon, Oct 27, 2008 at 05:19:17PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
> > On Mon, 27 Oct 2008, Evgeniy Polyakov wrote:
> > 
> > > On Mon, Oct 27, 2008 at 09:49:21AM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote:
> > > > > > That's what I got with the current tree for 8 threads on a 4-way 32-bit
> > > > > > Xeons (2 physical CPUs) and 8gb of ram:
> > > > > > gso/tso off: 361.367
> > > > > > tso/gso on:  354.635
> > > > > 
> > > > > Yes, it might do this, in which case tcp_tso_should_defer() simply needs
> > > > > some tweaking.
> > > > 
> > > > Good point, could you Evgeniy verify first if simple goto send_now; in 
> > > > beginning there recovers it all...
> > > 
> > > With direct goto at the begining of the tcp_tso_should_defer()
> > > and 4403b406 commit git tree (was current yesterday night)
> > > I got following numbers: 
> > > 
> > > with goto:
> > > tso/gso  on: 358.976, 357.699
> > > tso/gso off: 368.016, 367.079
> > > 
> > > no goto
> > > tso/gso  on: 353.656, 354.791
> > > tso/gso off: 364.8, 365.85
> > > 
> > > So tcp_tso_should_defer() adds additional problems not counted in
> > > tso/gso changes.
> > 
> > Noticed that tcp_current_mss does modulo with tso and that it is 
> > being called from a non-trivial number of places, though ovbiously only 
> > part of them should be relevant here. ...I'm not sure if that can show 
> > up though.
> 
> That's why when I see modulo something screams inside me...
> I used following patch (without goto in the tcp_tso_should_defer():
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index e4c5ac9..8ee7597 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1075,7 +1075,6 @@ unsigned int tcp_current_mss(struct sock *sk, int large_allowed)
>  				  tp->tcp_header_len);
>  
>  		xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
> -		xmit_size_goal -= (xmit_size_goal % mss_now);
>  	}
>  	tp->xmit_size_goal = xmit_size_goal;
>  
> 
> tso/gso  on: 361.866 362.662
> tso/gso off: 370.038 368.768
> 
> So this is another improvement hidden and not accounted in the tso/gso stuff.
> Another modulo sits in tcp_mss_split_point().

I know it's there but it should occur not that often. If you have pcount 
== 1 (len <= mss implies that) that won't even execute and rest of the 
cases involve handling rwin limited & nagle. I suppose we could make nagle 
to work without splitting the skb to sub-mss (needs some auditting to find 
out if something else than snd_sml setting assumes skb->len < mss, nagle 
check probably as well but I don't remember w/o looking).


-- 
 i.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-10-27 22:17             ` Ilpo Järvinen
@ 2008-10-31  8:14               ` David Miller
  2008-10-31  9:16                 ` Ilpo Järvinen
  0 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2008-10-31  8:14 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: zbr, netdev, efault, mingo, a.p.zijlstra, herbert

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue, 28 Oct 2008 00:17:00 +0200 (EET)

> On Mon, 27 Oct 2008, Evgeniy Polyakov wrote:
> 
> > That's why when I see modulo something screams inside me...
> > I used following patch (without goto in the tcp_tso_should_defer():
> > 
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index e4c5ac9..8ee7597 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -1075,7 +1075,6 @@ unsigned int tcp_current_mss(struct sock *sk, int large_allowed)
> >  				  tp->tcp_header_len);
> >  
> >  		xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
> > -		xmit_size_goal -= (xmit_size_goal % mss_now);
> >  	}
> >  	tp->xmit_size_goal = xmit_size_goal;
> >  
> > 
> > tso/gso  on: 361.866 362.662
> > tso/gso off: 370.038 368.768
> > 
> > So this is another improvement hidden and not accounted in the tso/gso stuff.
> > Another modulo sits in tcp_mss_split_point().
> 
> I know it's there but it should occur not that often.

It happens every sendmsg() call, and all tbench does is small send,
small recv, repeat.

So with TSO on, a small increase in performance is no surprise, as
we will save some cycles often enough.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-10-31  8:14               ` David Miller
@ 2008-10-31  9:16                 ` Ilpo Järvinen
  2008-10-31  9:47                   ` David Miller
  0 siblings, 1 reply; 26+ messages in thread
From: Ilpo Järvinen @ 2008-10-31  9:16 UTC (permalink / raw)
  To: David Miller; +Cc: zbr, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1406 bytes --]

On Fri, 31 Oct 2008, David Miller wrote:

> From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Tue, 28 Oct 2008 00:17:00 +0200 (EET)
> 
> > > Another modulo sits in tcp_mss_split_point().
> > 
> > I know it's there but it should occur not that often.
> 
> It happens every sendmsg() call, and all tbench does is small send,
> small recv, repeat.

This is not true for the mss_split_point case which I was speaking of, I 
even explained this in the part following that first sentence which you 
choose to snip away:

If you have pcount == 1 (len <= mss implies that) that won't even execute 
and rest of the cases involve handling rwin limited & nagle. I suppose we 
could make nagle to work without splitting the skb to sub-mss (needs some 
auditting to find out if something else than snd_sml setting assumes 
skb->len < mss, nagle check probably as well but I don't remember w/o 
looking).

To reiterate, with small send you have pcount == 1 and that won't execute 
any of the mss_split_point code!

The tcp_current_mss code is a different beast (if you meant that, I 
didn't :-)), yes it executes every so often but I was under impression 
that we agreed on it already. :-)

> So with TSO on, a small increase in performance is no surprise, as
> we will save some cycles often enough.

Also tcp_send_fin seems to do tcp_current_mss(sk, 1) so it's at least two 
modulos per transaction...

-- 
 i.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: tbench wrt. loopback TSO
  2008-10-31  9:16                 ` Ilpo Järvinen
@ 2008-10-31  9:47                   ` David Miller
  0 siblings, 0 replies; 26+ messages in thread
From: David Miller @ 2008-10-31  9:47 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: zbr, netdev, efault, mingo, a.p.zijlstra, herbert

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Fri, 31 Oct 2008 11:16:21 +0200 (EET)

> On Fri, 31 Oct 2008, David Miller wrote:
> 
> > From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> > Date: Tue, 28 Oct 2008 00:17:00 +0200 (EET)
> > 
> > > > Another modulo sits in tcp_mss_split_point().
> > > 
> > > I know it's there but it should occur not that often.
> > 
> > It happens every sendmsg() call, and all tbench does is small send,
> > small recv, repeat.
> 
> This is not true for the mss_split_point case which I was speaking of, I 
> even explained this in the part following that first sentence which you 
> choose to snip away:
> 
> If you have pcount == 1 (len <= mss implies that) that won't even execute 
> and rest of the cases involve handling rwin limited & nagle. I suppose we 
> could make nagle to work without splitting the skb to sub-mss (needs some 
> auditting to find out if something else than snd_sml setting assumes 
> skb->len < mss, nagle check probably as well but I don't remember w/o 
> looking).
> 
> To reiterate, with small send you have pcount == 1 and that won't execute 
> any of the mss_split_point code!
> 
> The tcp_current_mss code is a different beast (if you meant that, I 
> didn't :-)), yes it executes every so often but I was under impression 
> that we agreed on it already. :-)

My bad, tcp_current_mss() is what I thought your first sentence was about.

> > So with TSO on, a small increase in performance is no surprise, as
> > we will save some cycles often enough.
> 
> Also tcp_send_fin seems to do tcp_current_mss(sk, 1) so it's at least two 
> modulos per transaction...

True, but not for tbench. :-)

For tbench that one is benign as this will only execute once for each
thread during the entire benchmark.  Each thread starts up immediately,
opens up the connection, and then goes back and forth send/recv over
and over again for each SAMBA transaction being simulated.  Then at
the end each per-thread socket is closed.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2008-11-05 21:06 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-16  0:14 tbench wrt. loopback TSO David Miller
2008-10-17  3:49 ` non-TCP tbench (was Re: tbench wrt. loopback TSO) David Miller
2008-10-26 12:34 ` tbench wrt. loopback TSO Evgeniy Polyakov
2008-10-27  1:59   ` David Miller
2008-10-27  7:49     ` Ilpo Järvinen
2008-10-27 14:13       ` Evgeniy Polyakov
2008-10-27 15:19         ` Ilpo Järvinen
2008-10-27 17:03           ` Evgeniy Polyakov
2008-10-27 18:39             ` David Miller
2008-10-27 19:35               ` Evgeniy Polyakov
2008-10-27 19:37                 ` David Miller
2008-11-05 11:42               ` David Miller
2008-11-05 11:49                 ` Evgeniy Polyakov
2008-11-05 11:54                   ` David Miller
2008-11-05 12:04                     ` Ilpo Järvinen
2008-11-05 12:09                       ` David Miller
2008-11-05 12:25                         ` Ilpo Järvinen
2008-11-05 13:04                           ` Evgeniy Polyakov
2008-11-05 13:33                             ` Ilpo Järvinen
2008-11-05 18:48                               ` Rick Jones
2008-11-05 19:46                                 ` Ilpo Järvinen
2008-11-05 21:06                                   ` Rick Jones
2008-10-27 22:17             ` Ilpo Järvinen
2008-10-31  8:14               ` David Miller
2008-10-31  9:16                 ` Ilpo Järvinen
2008-10-31  9:47                   ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).