* tbench wrt. loopback TSO @ 2008-10-16 0:14 David Miller 2008-10-17 3:49 ` non-TCP tbench (was Re: tbench wrt. loopback TSO) David Miller 2008-10-26 12:34 ` tbench wrt. loopback TSO Evgeniy Polyakov 0 siblings, 2 replies; 26+ messages in thread From: David Miller @ 2008-10-16 0:14 UTC (permalink / raw) To: netdev; +Cc: zbr, efault, mingo, a.p.zijlstra, herbert I got curious about this aspect of the investigation so I wanted to see it first-hand :-) To be honest, this reported effect of disabling TSO in the loopback driver surprised me because: 1) If the benchmark is doing small writes, TSO should have zero effect. The TSO logic won't kick in. 2) If larger than MTU writes are being done, TSO should help, and this is supported by other benchmarks :-) So I ran some tbench cases both with and without the NETIF_F_TSO setting in drivers/net/loopback.c On my 64-cpu 1.2GHz Niagara-2 box I obtained these results: 1) For a simpler 2 thread run (tbench 2 localhost) the results stayed the same both with and without TSO enabled on loopback. About 77MB/sec 2) For a large 64 thread run (tbench 64 localhost) the results improved with TSO enabled in the loopback driver. Without TSO I got 1.5 GB/sec and with TSO I got 1.75 GB/sec throughput. I double checked this on a more traditional style processor, on my workstation, with has 2 UltraSPARC-IIIi chips running at 1.5GHZ This setup matched case #1 above, for "tbench 2 localhost" I got the same result, 138 MB/sec, both with and without TSO enabled. And these results all make total sense to me. tbench does mostly small transfers (for which TSO should make absolutely no difference), but it does a small number of large ones as well. And as you up the thread count, the large transfer cases factor more and more into the results. If this TSO setting is causing some performance decrease on some systems we should find out why. I'll try some of my x86 systems here to see if I can reproduce. It just doesn't make any sense that TSO can make any kind of negative difference, it can only help or have no effect at all! ^ permalink raw reply [flat|nested] 26+ messages in thread
* non-TCP tbench (was Re: tbench wrt. loopback TSO) 2008-10-16 0:14 tbench wrt. loopback TSO David Miller @ 2008-10-17 3:49 ` David Miller 2008-10-26 12:34 ` tbench wrt. loopback TSO Evgeniy Polyakov 1 sibling, 0 replies; 26+ messages in thread From: David Miller @ 2008-10-17 3:49 UTC (permalink / raw) To: netdev; +Cc: zbr, efault, mingo, a.p.zijlstra, herbert Just for fun, to continue my own investigations, I wrote two more variants of tbench to try and eliminate and compare possible influences to the changes. First is a "nop" variant, it just does memcpy()'s of the size the TCP transfers would have been. Second is a "pipe" variant, the uses pipes instead of TCP sockets. Here is a dbench patch: diff -u --recursive --new-file vanilla/dbench-4.0/Makefile.in dbench-4.0/Makefile.in --- vanilla/dbench-4.0/Makefile.in 2008-02-17 16:49:25.000000000 -0800 +++ dbench-4.0/Makefile.in 2008-10-16 20:47:57.000000000 -0700 @@ -17,9 +17,11 @@ DB_OBJS = fileio.o util.o dbench.o child.o system.o snprintf.o TB_OBJS = sockio.o util.o dbench.o child.o socklib.o snprintf.o +NOP_OBJS = nopio.o util.o dbench.o child.o socklib.o snprintf.o +PIPE_OBJS = pipeio.o util.o dbench.o child.o snprintf.o SRV_OBJS = util.o tbench_srv.o socklib.o -all: dbench tbench tbench_srv +all: dbench tbench tbench_srv nop pbench dbench: $(DB_OBJS) $(CC) -o $@ $(DB_OBJS) $(LIBS) @@ -30,6 +32,12 @@ tbench_srv: $(SRV_OBJS) $(CC) -o $@ $(SRV_OBJS) $(LIBS) +nop: $(NOP_OBJS) + $(CC) -o $@ $(NOP_OBJS) $(LIBS) + +pbench: $(PIPE_OBJS) + $(CC) -o $@ $(PIPE_OBJS) $(LIBS) + # Careful here: don't install client.txt over itself. install: all ${INSTALLCMD} -d $(bindir) $(datadir) $(mandir) @@ -40,7 +48,7 @@ ln -sf dbench.1 $(mandir)/tbench_srv.1 clean: - rm -f *.o *~ dbench tbench tbench_srv + rm -f *.o *~ dbench tbench tbench_srv nop pbench proto: ./mkproto.pl *.c > proto.h diff -u --recursive --new-file vanilla/dbench-4.0/nopio.c dbench-4.0/nopio.c --- vanilla/dbench-4.0/nopio.c 1969-12-31 16:00:00.000000000 -0800 +++ dbench-4.0/nopio.c 2008-10-16 20:48:18.000000000 -0700 @@ -0,0 +1,197 @@ +#include "dbench.h" + +struct nopio { + char dstbuf[70000]; + char srcbuf[70000]; +}; + +static void do_packets(struct child_struct *child, int send_size, int recv_size) +{ + struct nopio *nopio = (struct nopio *) child->private; + + /* Client --> Server */ + memcpy(nopio->dstbuf, nopio->srcbuf, send_size); + memcpy(nopio->srcbuf, nopio->dstbuf, send_size); + + /* Serer --> Client */ + memcpy(nopio->srcbuf, nopio->dstbuf, recv_size); + memcpy(nopio->dstbuf, nopio->srcbuf, recv_size); +} + +void nb_setup(struct child_struct *child) +{ + struct nopio *nopio; + nopio = calloc(1, sizeof(struct nopio)); + child->private = nopio; + child->rate.last_time = timeval_current(); + child->rate.last_bytes = 0; + + do_packets(child, 8, 8); +} + +void nb_unlink(struct child_struct *child, const char *fname, int attr, const char *status) +{ + (void)child; + (void)attr; + (void)status; + do_packets(child, 39+2+strlen(fname)*2+2, 39); +} + +void nb_mkdir(struct child_struct *child, const char *dname, const char *status) +{ + (void)child; + (void)status; + do_packets(child, 39+2+strlen(dname)*2+2, 39); +} + +void nb_rmdir(struct child_struct *child, const char *fname, const char *status) +{ + (void)child; + (void)status; + do_packets(child, 39+2+strlen(fname)*2+2, 39); +} + +void nb_createx(struct child_struct *child, const char *fname, + uint32_t create_options, uint32_t create_disposition, int fnum, + const char *status) +{ + (void)child; + (void)create_options; + (void)create_disposition; + (void)fnum; + (void)status; + do_packets(child, 70+2+strlen(fname)*2+2, 39+12*4); +} + +void nb_writex(struct child_struct *child, int handle, int offset, + int size, int ret_size, const char *status) +{ + (void)child; + (void)handle; + (void)offset; + (void)ret_size; + (void)status; + do_packets(child, 39+20+size, 39+16); + child->bytes += size; +} + +void nb_readx(struct child_struct *child, int handle, int offset, + int size, int ret_size, const char *status) +{ + (void)child; + (void)handle; + (void)offset; + (void)size; + (void)status; + do_packets(child, 39+20, 39+20+ret_size); + child->bytes += ret_size; +} + +void nb_close(struct child_struct *child, int handle, const char *status) +{ + (void)child; + (void)handle; + (void)status; + do_packets(child, 39+8, 39); +} + +void nb_rename(struct child_struct *child, const char *old, const char *new, const char *status) +{ + (void)child; + (void)status; + do_packets(child, 39+8+2*strlen(old)+2*strlen(new), 39); +} + +void nb_flush(struct child_struct *child, int handle, const char *status) +{ + (void)child; + (void)handle; + (void)status; + do_packets(child, 39+2, 39); +} + +void nb_qpathinfo(struct child_struct *child, const char *fname, int level, + const char *status) +{ + (void)child; + (void)level; + (void)status; + do_packets(child, 39+16+2*strlen(fname), 39+32); +} + +void nb_qfileinfo(struct child_struct *child, int handle, int level, const char *status) +{ + (void)child; + (void)level; + (void)handle; + (void)status; + do_packets(child, 39+20, 39+32); +} + +void nb_qfsinfo(struct child_struct *child, int level, const char *status) +{ + (void)child; + (void)level; + (void)status; + do_packets(child, 39+20, 39+32); +} + +void nb_findfirst(struct child_struct *child, const char *fname, int level, int maxcnt, + int count, const char *status) +{ + (void)child; + (void)level; + (void)maxcnt; + (void)status; + do_packets(child, 39+20+strlen(fname)*2, 39+90*count); +} + +void nb_cleanup(struct child_struct *child) +{ + (void)child; +} + +void nb_deltree(struct child_struct *child, const char *dname) +{ + (void)child; + (void)dname; +} + +void nb_sfileinfo(struct child_struct *child, int handle, int level, const char *status) +{ + (void)child; + (void)handle; + (void)level; + (void)status; + do_packets(child, 39+32, 39+8); +} + +void nb_lockx(struct child_struct *child, int handle, uint32_t offset, int size, + const char *status) +{ + (void)child; + (void)handle; + (void)offset; + (void)size; + (void)status; + do_packets(child, 39+12, 39); +} + +void nb_unlockx(struct child_struct *child, + int handle, uint32_t offset, int size, const char *status) +{ + (void)child; + (void)handle; + (void)offset; + (void)size; + (void)status; + do_packets(child, 39+12, 39); +} + +void nb_sleep(struct child_struct *child, int usec, const char *status) +{ + (void)child; + (void)usec; + (void)status; + usleep(usec); +} diff -u --recursive --new-file vanilla/dbench-4.0/pipeio.c dbench-4.0/pipeio.c --- vanilla/dbench-4.0/pipeio.c 1969-12-31 16:00:00.000000000 -0800 +++ dbench-4.0/pipeio.c 2008-10-16 20:42:31.000000000 -0700 @@ -0,0 +1,310 @@ +#include "dbench.h" + +static int pipe_read(int s, char *buf, int size) +{ + int total=0; + + while (size) { + int r = read(s, buf, size); + if (r <= 0) { + if (r == -1) perror("pipe_read"); + break; + } + buf += r; + size -= r; + total += r; + } + return total; +} + +static int pipe_write(int s, char *buf, int size) +{ + int total=0; + + while (size) { + int r = write(s, buf, size); + if (r <= 0) { + if (r == -1) perror("pipe_write"); + break; + } + buf += r; + size -= r; + total += r; + } + return total; +} + +struct pipeio { + char buf[70000]; + int readfd, writefd; +}; + +static void do_packets(struct child_struct *child, int send_size, int recv_size) +{ + struct pipeio *pipeio = (struct pipeio *) child->private; + uint32 *ubuf = (uint32 *) pipeio->buf; + + ubuf[0] = htonl(send_size-4); + ubuf[1] = htonl(recv_size-4); + + if (pipe_write(pipeio->writefd, pipeio->buf, send_size) != send_size) { + printf("error writing %d bytes\n", (int)send_size); + exit(1); + } + + if (pipe_read(pipeio->readfd, pipeio->buf, 4) != 4) { + printf("error reading header\n"); + exit(1); + } + + if (ntohl(ubuf[0]) != (unsigned)(recv_size-4)) { + printf("lost sync (%d %d)\n", + (int)recv_size-4, (int)ntohl(ubuf[0])); + exit(1); + } + + if (pipe_read(pipeio->readfd, pipeio->buf, recv_size-4) != + recv_size-4) { + printf("error reading %d bytes\n", (int)recv_size-4); + exit(1); + } + + if (ntohl(ubuf[0]) != (unsigned)(recv_size-4)) { + printf("lost sync (%d %d)\n", + (int)recv_size-4, (int)ntohl(ubuf[0])); + } +} + +static void server(int readfd, int writefd) +{ + char buf[70000]; + unsigned *ibuf = (unsigned *)buf; + uint32_t n; + + signal(SIGPIPE, SIG_IGN); + + printf("^"); fflush(stdout); + + while (1) { + if (pipe_read(readfd, buf, 4) != 4) + break; + n = ntohl(ibuf[0]); + if (n+4 >= sizeof(buf)) { + printf("overflow in server!\n"); + exit(1); + } + if (pipe_read(readfd, buf+4, n) != (int)n) + break; + n = ntohl(ibuf[1]); + ibuf[0] = htonl(n); + if (pipe_write(writefd, buf, n+4) != (int)(n+4)) + break; + } +} + +void nb_setup(struct child_struct *child) +{ + struct pipeio *pipeio; + int fds_one[2]; + int fds_two[2]; + + if (pipe(fds_one) == -1) { + perror("pipe1"); + exit(1); + } + + if (pipe(fds_two) == -1) { + perror("pipe2"); + exit(1); + } + + if (fork() == 0) { + setpgrp(); + close(fds_one[1]); + close(fds_two[0]); + server(fds_one[0], fds_two[1]); + close(fds_one[0]); + close(fds_two[1]); + exit(0); + } + + close(fds_one[0]); + close(fds_two[1]); + + pipeio = calloc(1, sizeof(struct pipeio)); + pipeio->readfd = fds_two[0]; + pipeio->writefd = fds_one[1]; + child->private = pipeio; + child->rate.last_time = timeval_current(); + child->rate.last_bytes = 0; + + do_packets(child, 8, 8); +} + +void nb_unlink(struct child_struct *child, const char *fname, int attr, const char *status) +{ + (void)child; + (void)attr; + (void)status; + do_packets(child, 39+2+strlen(fname)*2+2, 39); +} + +void nb_mkdir(struct child_struct *child, const char *dname, const char *status) +{ + (void)child; + (void)status; + do_packets(child, 39+2+strlen(dname)*2+2, 39); +} + +void nb_rmdir(struct child_struct *child, const char *fname, const char *status) +{ + (void)child; + (void)status; + do_packets(child, 39+2+strlen(fname)*2+2, 39); +} + +void nb_createx(struct child_struct *child, const char *fname, + uint32_t create_options, uint32_t create_disposition, int fnum, + const char *status) +{ + (void)child; + (void)create_options; + (void)create_disposition; + (void)fnum; + (void)status; + do_packets(child, 70+2+strlen(fname)*2+2, 39+12*4); +} + +void nb_writex(struct child_struct *child, int handle, int offset, + int size, int ret_size, const char *status) +{ + (void)child; + (void)handle; + (void)offset; + (void)ret_size; + (void)status; + do_packets(child, 39+20+size, 39+16); + child->bytes += size; +} + +void nb_readx(struct child_struct *child, int handle, int offset, + int size, int ret_size, const char *status) +{ + (void)child; + (void)handle; + (void)offset; + (void)size; + (void)status; + do_packets(child, 39+20, 39+20+ret_size); + child->bytes += ret_size; +} + +void nb_close(struct child_struct *child, int handle, const char *status) +{ + (void)child; + (void)handle; + (void)status; + do_packets(child, 39+8, 39); +} + +void nb_rename(struct child_struct *child, const char *old, const char *new, const char *status) +{ + (void)child; + (void)status; + do_packets(child, 39+8+2*strlen(old)+2*strlen(new), 39); +} + +void nb_flush(struct child_struct *child, int handle, const char *status) +{ + (void)child; + (void)handle; + (void)status; + do_packets(child, 39+2, 39); +} + +void nb_qpathinfo(struct child_struct *child, const char *fname, int level, + const char *status) +{ + (void)child; + (void)level; + (void)status; + do_packets(child, 39+16+2*strlen(fname), 39+32); +} + +void nb_qfileinfo(struct child_struct *child, int handle, int level, const char *status) +{ + (void)child; + (void)level; + (void)handle; + (void)status; + do_packets(child, 39+20, 39+32); +} + +void nb_qfsinfo(struct child_struct *child, int level, const char *status) +{ + (void)child; + (void)level; + (void)status; + do_packets(child, 39+20, 39+32); +} + +void nb_findfirst(struct child_struct *child, const char *fname, int level, int maxcnt, + int count, const char *status) +{ + (void)child; + (void)level; + (void)maxcnt; + (void)status; + do_packets(child, 39+20+strlen(fname)*2, 39+90*count); +} + +void nb_cleanup(struct child_struct *child) +{ + (void)child; +} + +void nb_deltree(struct child_struct *child, const char *dname) +{ + (void)child; + (void)dname; +} + +void nb_sfileinfo(struct child_struct *child, int handle, int level, const char *status) +{ + (void)child; + (void)handle; + (void)level; + (void)status; + do_packets(child, 39+32, 39+8); +} + +void nb_lockx(struct child_struct *child, int handle, uint32_t offset, int size, + const char *status) +{ + (void)child; + (void)handle; + (void)offset; + (void)size; + (void)status; + do_packets(child, 39+12, 39); +} + +void nb_unlockx(struct child_struct *child, + int handle, uint32_t offset, int size, const char *status) +{ + (void)child; + (void)handle; + (void)offset; + (void)size; + (void)status; + do_packets(child, 39+12, 39); +} + +void nb_sleep(struct child_struct *child, int usec, const char *status) +{ + (void)child; + (void)usec; + (void)status; + fprintf(stderr, "%d: usleep(%d)\n", child->id, usec); + usleep(usec); +} ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-10-16 0:14 tbench wrt. loopback TSO David Miller 2008-10-17 3:49 ` non-TCP tbench (was Re: tbench wrt. loopback TSO) David Miller @ 2008-10-26 12:34 ` Evgeniy Polyakov 2008-10-27 1:59 ` David Miller 1 sibling, 1 reply; 26+ messages in thread From: Evgeniy Polyakov @ 2008-10-26 12:34 UTC (permalink / raw) To: David Miller; +Cc: netdev, efault, mingo, a.p.zijlstra, herbert Hi. On Wed, Oct 15, 2008 at 05:14:08PM -0700, David Miller (davem@davemloft.net) wrote: > > I got curious about this aspect of the investigation so I wanted > to see it first-hand :-) > > To be honest, this reported effect of disabling TSO in the loopback > driver surprised me because: > > 1) If the benchmark is doing small writes, TSO should have zero > effect. The TSO logic won't kick in. But GSO will try to create a huge packet and that overhead will not be overweighted? That's what I got with the current tree for 8 threads on a 4-way 32-bit Xeons (2 physical CPUs) and 8gb of ram: gso/tso off: 361.367 tso/gso on: 354.635 Disabled/enabled via ethtools: -k tso off/on gso off/on > 2) If larger than MTU writes are being done, TSO should help, > and this is supported by other benchmarks :-) Yes, that's where it is useful. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-10-26 12:34 ` tbench wrt. loopback TSO Evgeniy Polyakov @ 2008-10-27 1:59 ` David Miller 2008-10-27 7:49 ` Ilpo Järvinen 0 siblings, 1 reply; 26+ messages in thread From: David Miller @ 2008-10-27 1:59 UTC (permalink / raw) To: zbr; +Cc: netdev, efault, mingo, a.p.zijlstra, herbert From: Evgeniy Polyakov <zbr@ioremap.net> Date: Sun, 26 Oct 2008 15:34:42 +0300 > On Wed, Oct 15, 2008 at 05:14:08PM -0700, David Miller (davem@davemloft.net) wrote: > > > > I got curious about this aspect of the investigation so I wanted > > to see it first-hand :-) > > > > To be honest, this reported effect of disabling TSO in the loopback > > driver surprised me because: > > > > 1) If the benchmark is doing small writes, TSO should have zero > > effect. The TSO logic won't kick in. > > But GSO will try to create a huge packet and that overhead will not be > overweighted? > > That's what I got with the current tree for 8 threads on a 4-way 32-bit > Xeons (2 physical CPUs) and 8gb of ram: > gso/tso off: 361.367 > tso/gso on: 354.635 > > Disabled/enabled via ethtools: -k tso off/on gso off/on Yes, it might do this, in which case tcp_tso_should_defer() simply needs some tweaking. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-10-27 1:59 ` David Miller @ 2008-10-27 7:49 ` Ilpo Järvinen 2008-10-27 14:13 ` Evgeniy Polyakov 0 siblings, 1 reply; 26+ messages in thread From: Ilpo Järvinen @ 2008-10-27 7:49 UTC (permalink / raw) To: zbr, David Miller; +Cc: Netdev, efault, mingo, a.p.zijlstra, Herbert Xu On Sun, 26 Oct 2008, David Miller wrote: > From: Evgeniy Polyakov <zbr@ioremap.net> > Date: Sun, 26 Oct 2008 15:34:42 +0300 > > > On Wed, Oct 15, 2008 at 05:14:08PM -0700, David Miller (davem@davemloft.net) wrote: > > > > > > I got curious about this aspect of the investigation so I wanted > > > to see it first-hand :-) > > > > > > To be honest, this reported effect of disabling TSO in the loopback > > > driver surprised me because: > > > > > > 1) If the benchmark is doing small writes, TSO should have zero > > > effect. The TSO logic won't kick in. > > > > But GSO will try to create a huge packet and that overhead will not be > > overweighted? > > > > That's what I got with the current tree for 8 threads on a 4-way 32-bit > > Xeons (2 physical CPUs) and 8gb of ram: > > gso/tso off: 361.367 > > tso/gso on: 354.635 > > > > Disabled/enabled via ethtools: -k tso off/on gso off/on > > Yes, it might do this, in which case tcp_tso_should_defer() simply needs > some tweaking. Good point, could you Evgeniy verify first if simple goto send_now; in beginning there recovers it all... -- i. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-10-27 7:49 ` Ilpo Järvinen @ 2008-10-27 14:13 ` Evgeniy Polyakov 2008-10-27 15:19 ` Ilpo Järvinen 0 siblings, 1 reply; 26+ messages in thread From: Evgeniy Polyakov @ 2008-10-27 14:13 UTC (permalink / raw) To: Ilpo Järvinen Cc: David Miller, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu On Mon, Oct 27, 2008 at 09:49:21AM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote: > > > That's what I got with the current tree for 8 threads on a 4-way 32-bit > > > Xeons (2 physical CPUs) and 8gb of ram: > > > gso/tso off: 361.367 > > > tso/gso on: 354.635 > > > > Yes, it might do this, in which case tcp_tso_should_defer() simply needs > > some tweaking. > > Good point, could you Evgeniy verify first if simple goto send_now; in > beginning there recovers it all... With direct goto at the begining of the tcp_tso_should_defer() and 4403b406 commit git tree (was current yesterday night) I got following numbers: with goto: tso/gso on: 358.976, 357.699 tso/gso off: 368.016, 367.079 no goto tso/gso on: 353.656, 354.791 tso/gso off: 364.8, 365.85 So tcp_tso_should_defer() adds additional problems not counted in tso/gso changes. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-10-27 14:13 ` Evgeniy Polyakov @ 2008-10-27 15:19 ` Ilpo Järvinen 2008-10-27 17:03 ` Evgeniy Polyakov 0 siblings, 1 reply; 26+ messages in thread From: Ilpo Järvinen @ 2008-10-27 15:19 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu [-- Attachment #1: Type: TEXT/PLAIN, Size: 1184 bytes --] On Mon, 27 Oct 2008, Evgeniy Polyakov wrote: > On Mon, Oct 27, 2008 at 09:49:21AM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote: > > > > That's what I got with the current tree for 8 threads on a 4-way 32-bit > > > > Xeons (2 physical CPUs) and 8gb of ram: > > > > gso/tso off: 361.367 > > > > tso/gso on: 354.635 > > > > > > Yes, it might do this, in which case tcp_tso_should_defer() simply needs > > > some tweaking. > > > > Good point, could you Evgeniy verify first if simple goto send_now; in > > beginning there recovers it all... > > With direct goto at the begining of the tcp_tso_should_defer() > and 4403b406 commit git tree (was current yesterday night) > I got following numbers: > > with goto: > tso/gso on: 358.976, 357.699 > tso/gso off: 368.016, 367.079 > > no goto > tso/gso on: 353.656, 354.791 > tso/gso off: 364.8, 365.85 > > So tcp_tso_should_defer() adds additional problems not counted in > tso/gso changes. Noticed that tcp_current_mss does modulo with tso and that it is being called from a non-trivial number of places, though ovbiously only part of them should be relevant here. ...I'm not sure if that can show up though. -- i. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-10-27 15:19 ` Ilpo Järvinen @ 2008-10-27 17:03 ` Evgeniy Polyakov 2008-10-27 18:39 ` David Miller 2008-10-27 22:17 ` Ilpo Järvinen 0 siblings, 2 replies; 26+ messages in thread From: Evgeniy Polyakov @ 2008-10-27 17:03 UTC (permalink / raw) To: Ilpo Järvinen Cc: David Miller, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu On Mon, Oct 27, 2008 at 05:19:17PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote: > On Mon, 27 Oct 2008, Evgeniy Polyakov wrote: > > > On Mon, Oct 27, 2008 at 09:49:21AM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote: > > > > > That's what I got with the current tree for 8 threads on a 4-way 32-bit > > > > > Xeons (2 physical CPUs) and 8gb of ram: > > > > > gso/tso off: 361.367 > > > > > tso/gso on: 354.635 > > > > > > > > Yes, it might do this, in which case tcp_tso_should_defer() simply needs > > > > some tweaking. > > > > > > Good point, could you Evgeniy verify first if simple goto send_now; in > > > beginning there recovers it all... > > > > With direct goto at the begining of the tcp_tso_should_defer() > > and 4403b406 commit git tree (was current yesterday night) > > I got following numbers: > > > > with goto: > > tso/gso on: 358.976, 357.699 > > tso/gso off: 368.016, 367.079 > > > > no goto > > tso/gso on: 353.656, 354.791 > > tso/gso off: 364.8, 365.85 > > > > So tcp_tso_should_defer() adds additional problems not counted in > > tso/gso changes. > > Noticed that tcp_current_mss does modulo with tso and that it is > being called from a non-trivial number of places, though ovbiously only > part of them should be relevant here. ...I'm not sure if that can show > up though. That's why when I see modulo something screams inside me... I used following patch (without goto in the tcp_tso_should_defer(): diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index e4c5ac9..8ee7597 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1075,7 +1075,6 @@ unsigned int tcp_current_mss(struct sock *sk, int large_allowed) tp->tcp_header_len); xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal); - xmit_size_goal -= (xmit_size_goal % mss_now); } tp->xmit_size_goal = xmit_size_goal; tso/gso on: 361.866 362.662 tso/gso off: 370.038 368.768 So this is another improvement hidden and not accounted in the tso/gso stuff. Another modulo sits in tcp_mss_split_point(). Also found one in the scheduler's find_next_best_node(), which is under CONFIG_NUMA though. -- Evgeniy Polyakov ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-10-27 17:03 ` Evgeniy Polyakov @ 2008-10-27 18:39 ` David Miller 2008-10-27 19:35 ` Evgeniy Polyakov 2008-11-05 11:42 ` David Miller 2008-10-27 22:17 ` Ilpo Järvinen 1 sibling, 2 replies; 26+ messages in thread From: David Miller @ 2008-10-27 18:39 UTC (permalink / raw) To: zbr; +Cc: ilpo.jarvinen, netdev, efault, mingo, a.p.zijlstra, herbert From: Evgeniy Polyakov <zbr@ioremap.net> Date: Mon, 27 Oct 2008 20:03:14 +0300 > On Mon, Oct 27, 2008 at 05:19:17PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote: > > Noticed that tcp_current_mss does modulo with tso and that it is > > being called from a non-trivial number of places, though ovbiously only > > part of them should be relevant here. ...I'm not sure if that can show > > up though. > > That's why when I see modulo something screams inside me... > I used following patch (without goto in the tcp_tso_should_defer(): Indeed, those modulos are the worst part about the TSO code. I'll see what I can do to get rid of them, I hate them too. :-) One idea immediately occurs to me. Since we're effectively limited to a 64K TSO frame, and the MSS is some value smaller than that, we can probably get away with a reciprocol divide. Even using a 16-bit inverse value would suffice, so we wouldn't need to use u64's like some other pieces of code do. A u32 would be enough. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-10-27 18:39 ` David Miller @ 2008-10-27 19:35 ` Evgeniy Polyakov 2008-10-27 19:37 ` David Miller 2008-11-05 11:42 ` David Miller 1 sibling, 1 reply; 26+ messages in thread From: Evgeniy Polyakov @ 2008-10-27 19:35 UTC (permalink / raw) To: David Miller; +Cc: ilpo.jarvinen, netdev, efault, mingo, a.p.zijlstra, herbert On Mon, Oct 27, 2008 at 11:39:04AM -0700, David Miller (davem@davemloft.net) wrote: > One idea immediately occurs to me. Since we're effectively limited > to a 64K TSO frame, and the MSS is some value smaller than that, we > can probably get away with a reciprocol divide. Even using a 16-bit > inverse value would suffice, so we wouldn't need to use u64's like > some other pieces of code do. A u32 would be enough. But why do we need to trim that last bytes at the first place at all? Is it just enough to 'binary and' with 0xffff? Or is it what you mean? :) -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-10-27 19:35 ` Evgeniy Polyakov @ 2008-10-27 19:37 ` David Miller 0 siblings, 0 replies; 26+ messages in thread From: David Miller @ 2008-10-27 19:37 UTC (permalink / raw) To: zbr; +Cc: ilpo.jarvinen, netdev, efault, mingo, a.p.zijlstra, herbert From: Evgeniy Polyakov <zbr@ioremap.net> Date: Mon, 27 Oct 2008 22:35:02 +0300 > On Mon, Oct 27, 2008 at 11:39:04AM -0700, David Miller (davem@davemloft.net) wrote: > > One idea immediately occurs to me. Since we're effectively limited > > to a 64K TSO frame, and the MSS is some value smaller than that, we > > can probably get away with a reciprocol divide. Even using a 16-bit > > inverse value would suffice, so we wouldn't need to use u64's like > > some other pieces of code do. A u32 would be enough. > > But why do we need to trim that last bytes at the first place at all? > Is it just enough to 'binary and' with 0xffff? Or is it what you mean? :) We need to trim because we want to send full sized frames onto the network, and those trailing bytes will make sub-MSS sized frame instead of coalescing with the next round of user sendmsg() data. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-10-27 18:39 ` David Miller 2008-10-27 19:35 ` Evgeniy Polyakov @ 2008-11-05 11:42 ` David Miller 2008-11-05 11:49 ` Evgeniy Polyakov 1 sibling, 1 reply; 26+ messages in thread From: David Miller @ 2008-11-05 11:42 UTC (permalink / raw) To: zbr; +Cc: ilpo.jarvinen, netdev, efault, mingo, a.p.zijlstra, herbert From: David Miller <davem@davemloft.net> Date: Mon, 27 Oct 2008 11:39:04 -0700 (PDT) > One idea immediately occurs to me. Since we're effectively limited > to a 64K TSO frame, and the MSS is some value smaller than that, we > can probably get away with a reciprocol divide. Even using a 16-bit > inverse value would suffice, so we wouldn't need to use u64's like > some other pieces of code do. A u32 would be enough. I couldn't get anywhere with this idea. The problem is that 16-bits provides not enough precision for accurate divisions by multiplication. For example, for a divisor of 1500 and a shift of 16 neither 43 nor 44 provides an accurate calculation. So we'd need to use a u64 and a shift of 32, and on 32-bit cpus that could be expensive, but perhaps not as expensive as the divide. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-11-05 11:42 ` David Miller @ 2008-11-05 11:49 ` Evgeniy Polyakov 2008-11-05 11:54 ` David Miller 0 siblings, 1 reply; 26+ messages in thread From: Evgeniy Polyakov @ 2008-11-05 11:49 UTC (permalink / raw) To: David Miller; +Cc: ilpo.jarvinen, netdev, efault, mingo, a.p.zijlstra, herbert On Wed, Nov 05, 2008 at 03:42:24AM -0800, David Miller (davem@davemloft.net) wrote: > > One idea immediately occurs to me. Since we're effectively limited > > to a 64K TSO frame, and the MSS is some value smaller than that, we > > can probably get away with a reciprocol divide. Even using a 16-bit > > inverse value would suffice, so we wouldn't need to use u64's like > > some other pieces of code do. A u32 would be enough. > > I couldn't get anywhere with this idea. > > The problem is that 16-bits provides not enough precision for accurate > divisions by multiplication. > > For example, for a divisor of 1500 and a shift of 16 neither 43 nor 44 > provides an accurate calculation. > > So we'd need to use a u64 and a shift of 32, and on 32-bit cpus that > could be expensive, but perhaps not as expensive as the divide. But what if we just remove that trimming at all? This may result in a not fully filled tso frame, but who cares if we copy data from userspace anyway, will add another potentially small copy at sending time? -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-11-05 11:49 ` Evgeniy Polyakov @ 2008-11-05 11:54 ` David Miller 2008-11-05 12:04 ` Ilpo Järvinen 0 siblings, 1 reply; 26+ messages in thread From: David Miller @ 2008-11-05 11:54 UTC (permalink / raw) To: zbr; +Cc: ilpo.jarvinen, netdev, efault, mingo, a.p.zijlstra, herbert From: Evgeniy Polyakov <zbr@ioremap.net> Date: Wed, 5 Nov 2008 14:49:00 +0300 > But what if we just remove that trimming at all? This may result in > a not fully filled tso frame, but who cares if we copy data from > userspace anyway, will add another potentially small copy at sending > time? Ok, the only part I care about is whether taking away this modulus will result in a sub-MSS sized packet on the wire, which is undesirable. The send logic will prevent this, right? If so, yes I agree, we can delete this. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-11-05 11:54 ` David Miller @ 2008-11-05 12:04 ` Ilpo Järvinen 2008-11-05 12:09 ` David Miller 0 siblings, 1 reply; 26+ messages in thread From: Ilpo Järvinen @ 2008-11-05 12:04 UTC (permalink / raw) To: David Miller; +Cc: zbr, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu On Wed, 5 Nov 2008, David Miller wrote: > From: Evgeniy Polyakov <zbr@ioremap.net> > Date: Wed, 5 Nov 2008 14:49:00 +0300 > > > But what if we just remove that trimming at all? This may result in > > a not fully filled tso frame, but who cares if we copy data from > > userspace anyway, will add another potentially small copy at sending > > time? > > Ok, the only part I care about is whether taking away this modulus > will result in a sub-MSS sized packet on the wire, which is > undesirable. > > The send logic will prevent this, right? I don't see what currently would do that. We'd need to resegment skbs there then... > If so, yes I agree, we can delete this. Would we have something similar to those blobs we discussed when talking about sack data structures it might be provided by machinery we need anyway but currently it gets quite non-trivial since in order to fill a segment we need to input from two skb's which probably requires making another skb or the stack downstream cannot cope afaik. -- i. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-11-05 12:04 ` Ilpo Järvinen @ 2008-11-05 12:09 ` David Miller 2008-11-05 12:25 ` Ilpo Järvinen 0 siblings, 1 reply; 26+ messages in thread From: David Miller @ 2008-11-05 12:09 UTC (permalink / raw) To: ilpo.jarvinen; +Cc: zbr, netdev, efault, mingo, a.p.zijlstra, herbert From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> Date: Wed, 5 Nov 2008 14:04:14 +0200 (EET) > On Wed, 5 Nov 2008, David Miller wrote: > > > From: Evgeniy Polyakov <zbr@ioremap.net> > > Date: Wed, 5 Nov 2008 14:49:00 +0300 > > > > > But what if we just remove that trimming at all? This may result in > > > a not fully filled tso frame, but who cares if we copy data from > > > userspace anyway, will add another potentially small copy at sending > > > time? > > > > Ok, the only part I care about is whether taking away this modulus > > will result in a sub-MSS sized packet on the wire, which is > > undesirable. > > > > The send logic will prevent this, right? > > I don't see what currently would do that. We'd need to resegment skbs > there then... Look at tcp_mss_split_point() in tcp_output.c I think it ends up doing exactly this. 'needed' is set to min(skb->len, window) and the return value is "needed - (needed % mss_now)" The caller splits up the TSO segment as needed to satisfy this returned limit. That should effectively chop off the sub-mss part. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-11-05 12:09 ` David Miller @ 2008-11-05 12:25 ` Ilpo Järvinen 2008-11-05 13:04 ` Evgeniy Polyakov 0 siblings, 1 reply; 26+ messages in thread From: Ilpo Järvinen @ 2008-11-05 12:25 UTC (permalink / raw) To: David Miller; +Cc: zbr, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu [-- Attachment #1: Type: TEXT/PLAIN, Size: 2049 bytes --] On Wed, 5 Nov 2008, David Miller wrote: > From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> > Date: Wed, 5 Nov 2008 14:04:14 +0200 (EET) > > > On Wed, 5 Nov 2008, David Miller wrote: > > > > > From: Evgeniy Polyakov <zbr@ioremap.net> > > > Date: Wed, 5 Nov 2008 14:49:00 +0300 > > > > > > > But what if we just remove that trimming at all? This may result in > > > > a not fully filled tso frame, but who cares if we copy data from > > > > userspace anyway, will add another potentially small copy at sending > > > > time? > > > > > > Ok, the only part I care about is whether taking away this modulus > > > will result in a sub-MSS sized packet on the wire, which is > > > undesirable. > > > > > > The send logic will prevent this, right? > > > > I don't see what currently would do that. We'd need to resegment skbs > > there then... > > Look at tcp_mss_split_point() in tcp_output.c > > I think it ends up doing exactly this. > > 'needed' is set to min(skb->len, window) > > and the return value is "needed - (needed % mss_now)" > > The caller splits up the TSO segment as needed to satisfy this > returned limit. > > That should effectively chop off the sub-mss part. Yes yes, I know as I wrote that part :-). But how does it prevent sub-MSS on wire? Answer: It is intented for different case and doesn't prevent sub-MSS segments but solves some window limitation issues! It just makes sure we won't add yet another sub-mss segment because of splitting we anyway are forced to. In case of curr_mss != write_time_mss, we'll be sending those < mss segments per each skb, normally this happens over a single rtt so it's not that bad problem (and btw, that modulo won't even be executed if you have enough window but the splitting happens on a layer below, except for the last segment of course where we'd do nagle). The problem is that we'd need to _resegment with the next skb_ since the mss boundary and skb boundary would basically constantly be running out-of-sync. That won't get done currently by anything. -- i. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-11-05 12:25 ` Ilpo Järvinen @ 2008-11-05 13:04 ` Evgeniy Polyakov 2008-11-05 13:33 ` Ilpo Järvinen 0 siblings, 1 reply; 26+ messages in thread From: Evgeniy Polyakov @ 2008-11-05 13:04 UTC (permalink / raw) To: Ilpo Järvinen Cc: David Miller, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu On Wed, Nov 05, 2008 at 02:25:57PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote: > The problem is that we'd need to _resegment with the next skb_ since the > mss boundary and skb boundary would basically constantly be running > out-of-sync. That won't get done currently by anything. Btw, what's that wrong if there will be sub-mss frame per tso frame? -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-11-05 13:04 ` Evgeniy Polyakov @ 2008-11-05 13:33 ` Ilpo Järvinen 2008-11-05 18:48 ` Rick Jones 0 siblings, 1 reply; 26+ messages in thread From: Ilpo Järvinen @ 2008-11-05 13:33 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu [-- Attachment #1: Type: TEXT/PLAIN, Size: 737 bytes --] On Wed, 5 Nov 2008, Evgeniy Polyakov wrote: > On Wed, Nov 05, 2008 at 02:25:57PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote: > > The problem is that we'd need to _resegment with the next skb_ since the > > mss boundary and skb boundary would basically constantly be running > > out-of-sync. That won't get done currently by anything. > > Btw, what's that wrong if there will be sub-mss frame per tso frame? I personally don't consider that to be a big deal... I suppose some see it as bad thing because of the slightly larger header vs data ratio... Which is significant only if you can saturate the link (or have unbounded bandwidth such as with lo), so slower links are more affected than high speed ones... -- i. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-11-05 13:33 ` Ilpo Järvinen @ 2008-11-05 18:48 ` Rick Jones 2008-11-05 19:46 ` Ilpo Järvinen 0 siblings, 1 reply; 26+ messages in thread From: Rick Jones @ 2008-11-05 18:48 UTC (permalink / raw) To: Ilpo Järvinen Cc: Evgeniy Polyakov, David Miller, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu Ilpo Järvinen wrote: > On Wed, 5 Nov 2008, Evgeniy Polyakov wrote: > > >>On Wed, Nov 05, 2008 at 02:25:57PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote: >> >>>The problem is that we'd need to _resegment with the next skb_ since the >>>mss boundary and skb boundary would basically constantly be running >>>out-of-sync. That won't get done currently by anything. >> >>Btw, what's that wrong if there will be sub-mss frame per tso frame? > > > I personally don't consider that to be a big deal... I suppose some see > it as bad thing because of the slightly larger header vs data ratio... > Which is significant only if you can saturate the link (or have unbounded > bandwidth such as with lo), so slower links are more affected than high > speed ones... Can't say that I tend to "like" subMSS segments out there in a bulk transfer but some pseudorandom thoughts: And the worst that would be would be one full MSS and a single byte, getting us an average of (MSS+1)/2 (roughly). It only gets better from there (2MSS+1)/3, (3MSS+1)/4 etc etc. Ignoring the TSO case for a moment, if there is congestion and receiver window available and a user makes a > MSS send that isn't an integral multiple of the MSS, we don't delay the last subMSS segment do we? rick jones ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-11-05 18:48 ` Rick Jones @ 2008-11-05 19:46 ` Ilpo Järvinen 2008-11-05 21:06 ` Rick Jones 0 siblings, 1 reply; 26+ messages in thread From: Ilpo Järvinen @ 2008-11-05 19:46 UTC (permalink / raw) To: Rick Jones Cc: Evgeniy Polyakov, David Miller, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu [-- Attachment #1: Type: TEXT/PLAIN, Size: 3328 bytes --] On Wed, 5 Nov 2008, Rick Jones wrote: > Ilpo Järvinen wrote: > > On Wed, 5 Nov 2008, Evgeniy Polyakov wrote: > > > > > > >On Wed, Nov 05, 2008 at 02:25:57PM +0200, Ilpo Järvinen > > >(ilpo.jarvinen@helsinki.fi) wrote: > > > > > > >The problem is that we'd need to _resegment with the next skb_ since the > > > >mss boundary and skb boundary would basically constantly be running > > > >out-of-sync. That won't get done currently by anything. > > > > > >Btw, what's that wrong if there will be sub-mss frame per tso frame? > > > > > > I personally don't consider that to be a big deal... I suppose some see > > it as bad thing because of the slightly larger header vs data ratio... > > Which is significant only if you can saturate the link (or have unbounded > > bandwidth such as with lo), so slower links are more affected than high > > speed ones... > > Can't say that I tend to "like" subMSS segments out there in a bulk > transfer but some pseudorandom thoughts: > > And the worst that would be would be one full MSS and a single byte, getting > us an average of (MSS+1)/2 (roughly). It only gets better from there > (2MSS+1)/3, (3MSS+1)/4 etc etc. ...Note that likelyhood of such 1 byte pathological case are not that high if one is sending pages... For malicious purposes one could always use TCP_NODELAY anyway to force similar small segments so it's hardly worth considering here. For the most sensible cases with full pages, resulting segs are (1460,1448 mss): 1 2.80548 2.82873 2 5.61096 5.65746 3 8.41644 8.48619 4 11.2219 11.3149 5 14.0274 14.1436 6 16.8329 16.9724 7 19.6384 19.8011 8 22.4438 22.6298 9 25.2493 25.4586 10 28.0548 28.2873 11 30.8603 31.116 12 33.6658 33.9448 13 36.4712 36.7735 14 39.2767 39.6022 15 42.0822 42.4309 16 44.8877 45.2597 The worst case seems to be 5 pages with 1460 which yields to 40 bytes payload. > Ignoring the TSO case for a moment, if there is congestion and receiver > window available and a user makes a > MSS send that isn't an integral > multiple of the MSS, we don't delay the last subMSS segment do we? Without TSO, only Nagle could prevent sending that submss portion, so the answer depends on what the window in-flight consists of. With TSO, I guess this falls under tcp_tso_should_defer first... And then, as far as the mss-splitter (that was quoted in this thread by DaveM) we send just the full segment if there's enough room in the receiver window and let the tso/gso handle splitting of that submss segment if necessary... ...Except for that last segment which is important for keeping track of nagle and we currently don't handle nagle correctly if the segment in question would have len > mss (as noted earlier this could be changed though after auditting those len < mss checks). So it will basically be the same as without tso (depends on in-flight win). We _won't do_ nagle for the intermediate submss segments (it is a design decision which predates times I've been tracking kernel development), for obvious reasons it could just be enabled for them currently because then nagle would basically stall the transfer eg. when opposite dir needs to change mss due to reporting sack blocks. In general those middle submss skbs only occur if mss changes as they're split currently based on mss already at write time. -- i. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-11-05 19:46 ` Ilpo Järvinen @ 2008-11-05 21:06 ` Rick Jones 0 siblings, 0 replies; 26+ messages in thread From: Rick Jones @ 2008-11-05 21:06 UTC (permalink / raw) To: Ilpo Järvinen Cc: Evgeniy Polyakov, David Miller, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu >>Ignoring the TSO case for a moment, if there is congestion and receiver >>window available and a user makes a > MSS send that isn't an integral >>multiple of the MSS, we don't delay the last subMSS segment do we? > > > Without TSO, only Nagle could prevent sending that submss portion, so the > answer depends on what the window in-flight consists of. I always thought that Nagle was supposed to be implemented as if it was on a user send() by user send() basis and not segment by segment, in which case no send() > MSS would be delayed by Nagle. rick jones ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-10-27 17:03 ` Evgeniy Polyakov 2008-10-27 18:39 ` David Miller @ 2008-10-27 22:17 ` Ilpo Järvinen 2008-10-31 8:14 ` David Miller 1 sibling, 1 reply; 26+ messages in thread From: Ilpo Järvinen @ 2008-10-27 22:17 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu [-- Attachment #1: Type: TEXT/PLAIN, Size: 2664 bytes --] On Mon, 27 Oct 2008, Evgeniy Polyakov wrote: > On Mon, Oct 27, 2008 at 05:19:17PM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote: > > On Mon, 27 Oct 2008, Evgeniy Polyakov wrote: > > > > > On Mon, Oct 27, 2008 at 09:49:21AM +0200, Ilpo Järvinen (ilpo.jarvinen@helsinki.fi) wrote: > > > > > > That's what I got with the current tree for 8 threads on a 4-way 32-bit > > > > > > Xeons (2 physical CPUs) and 8gb of ram: > > > > > > gso/tso off: 361.367 > > > > > > tso/gso on: 354.635 > > > > > > > > > > Yes, it might do this, in which case tcp_tso_should_defer() simply needs > > > > > some tweaking. > > > > > > > > Good point, could you Evgeniy verify first if simple goto send_now; in > > > > beginning there recovers it all... > > > > > > With direct goto at the begining of the tcp_tso_should_defer() > > > and 4403b406 commit git tree (was current yesterday night) > > > I got following numbers: > > > > > > with goto: > > > tso/gso on: 358.976, 357.699 > > > tso/gso off: 368.016, 367.079 > > > > > > no goto > > > tso/gso on: 353.656, 354.791 > > > tso/gso off: 364.8, 365.85 > > > > > > So tcp_tso_should_defer() adds additional problems not counted in > > > tso/gso changes. > > > > Noticed that tcp_current_mss does modulo with tso and that it is > > being called from a non-trivial number of places, though ovbiously only > > part of them should be relevant here. ...I'm not sure if that can show > > up though. > > That's why when I see modulo something screams inside me... > I used following patch (without goto in the tcp_tso_should_defer(): > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index e4c5ac9..8ee7597 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -1075,7 +1075,6 @@ unsigned int tcp_current_mss(struct sock *sk, int large_allowed) > tp->tcp_header_len); > > xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal); > - xmit_size_goal -= (xmit_size_goal % mss_now); > } > tp->xmit_size_goal = xmit_size_goal; > > > tso/gso on: 361.866 362.662 > tso/gso off: 370.038 368.768 > > So this is another improvement hidden and not accounted in the tso/gso stuff. > Another modulo sits in tcp_mss_split_point(). I know it's there but it should occur not that often. If you have pcount == 1 (len <= mss implies that) that won't even execute and rest of the cases involve handling rwin limited & nagle. I suppose we could make nagle to work without splitting the skb to sub-mss (needs some auditting to find out if something else than snd_sml setting assumes skb->len < mss, nagle check probably as well but I don't remember w/o looking). -- i. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-10-27 22:17 ` Ilpo Järvinen @ 2008-10-31 8:14 ` David Miller 2008-10-31 9:16 ` Ilpo Järvinen 0 siblings, 1 reply; 26+ messages in thread From: David Miller @ 2008-10-31 8:14 UTC (permalink / raw) To: ilpo.jarvinen; +Cc: zbr, netdev, efault, mingo, a.p.zijlstra, herbert From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> Date: Tue, 28 Oct 2008 00:17:00 +0200 (EET) > On Mon, 27 Oct 2008, Evgeniy Polyakov wrote: > > > That's why when I see modulo something screams inside me... > > I used following patch (without goto in the tcp_tso_should_defer(): > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > index e4c5ac9..8ee7597 100644 > > --- a/net/ipv4/tcp_output.c > > +++ b/net/ipv4/tcp_output.c > > @@ -1075,7 +1075,6 @@ unsigned int tcp_current_mss(struct sock *sk, int large_allowed) > > tp->tcp_header_len); > > > > xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal); > > - xmit_size_goal -= (xmit_size_goal % mss_now); > > } > > tp->xmit_size_goal = xmit_size_goal; > > > > > > tso/gso on: 361.866 362.662 > > tso/gso off: 370.038 368.768 > > > > So this is another improvement hidden and not accounted in the tso/gso stuff. > > Another modulo sits in tcp_mss_split_point(). > > I know it's there but it should occur not that often. It happens every sendmsg() call, and all tbench does is small send, small recv, repeat. So with TSO on, a small increase in performance is no surprise, as we will save some cycles often enough. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-10-31 8:14 ` David Miller @ 2008-10-31 9:16 ` Ilpo Järvinen 2008-10-31 9:47 ` David Miller 0 siblings, 1 reply; 26+ messages in thread From: Ilpo Järvinen @ 2008-10-31 9:16 UTC (permalink / raw) To: David Miller; +Cc: zbr, Netdev, efault, mingo, a.p.zijlstra, Herbert Xu [-- Attachment #1: Type: TEXT/PLAIN, Size: 1406 bytes --] On Fri, 31 Oct 2008, David Miller wrote: > From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> > Date: Tue, 28 Oct 2008 00:17:00 +0200 (EET) > > > > Another modulo sits in tcp_mss_split_point(). > > > > I know it's there but it should occur not that often. > > It happens every sendmsg() call, and all tbench does is small send, > small recv, repeat. This is not true for the mss_split_point case which I was speaking of, I even explained this in the part following that first sentence which you choose to snip away: If you have pcount == 1 (len <= mss implies that) that won't even execute and rest of the cases involve handling rwin limited & nagle. I suppose we could make nagle to work without splitting the skb to sub-mss (needs some auditting to find out if something else than snd_sml setting assumes skb->len < mss, nagle check probably as well but I don't remember w/o looking). To reiterate, with small send you have pcount == 1 and that won't execute any of the mss_split_point code! The tcp_current_mss code is a different beast (if you meant that, I didn't :-)), yes it executes every so often but I was under impression that we agreed on it already. :-) > So with TSO on, a small increase in performance is no surprise, as > we will save some cycles often enough. Also tcp_send_fin seems to do tcp_current_mss(sk, 1) so it's at least two modulos per transaction... -- i. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: tbench wrt. loopback TSO 2008-10-31 9:16 ` Ilpo Järvinen @ 2008-10-31 9:47 ` David Miller 0 siblings, 0 replies; 26+ messages in thread From: David Miller @ 2008-10-31 9:47 UTC (permalink / raw) To: ilpo.jarvinen; +Cc: zbr, netdev, efault, mingo, a.p.zijlstra, herbert From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> Date: Fri, 31 Oct 2008 11:16:21 +0200 (EET) > On Fri, 31 Oct 2008, David Miller wrote: > > > From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> > > Date: Tue, 28 Oct 2008 00:17:00 +0200 (EET) > > > > > > Another modulo sits in tcp_mss_split_point(). > > > > > > I know it's there but it should occur not that often. > > > > It happens every sendmsg() call, and all tbench does is small send, > > small recv, repeat. > > This is not true for the mss_split_point case which I was speaking of, I > even explained this in the part following that first sentence which you > choose to snip away: > > If you have pcount == 1 (len <= mss implies that) that won't even execute > and rest of the cases involve handling rwin limited & nagle. I suppose we > could make nagle to work without splitting the skb to sub-mss (needs some > auditting to find out if something else than snd_sml setting assumes > skb->len < mss, nagle check probably as well but I don't remember w/o > looking). > > To reiterate, with small send you have pcount == 1 and that won't execute > any of the mss_split_point code! > > The tcp_current_mss code is a different beast (if you meant that, I > didn't :-)), yes it executes every so often but I was under impression > that we agreed on it already. :-) My bad, tcp_current_mss() is what I thought your first sentence was about. > > So with TSO on, a small increase in performance is no surprise, as > > we will save some cycles often enough. > > Also tcp_send_fin seems to do tcp_current_mss(sk, 1) so it's at least two > modulos per transaction... True, but not for tbench. :-) For tbench that one is benign as this will only execute once for each thread during the entire benchmark. Each thread starts up immediately, opens up the connection, and then goes back and forth send/recv over and over again for each SAMBA transaction being simulated. Then at the end each per-thread socket is closed. ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2008-11-05 21:06 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-10-16 0:14 tbench wrt. loopback TSO David Miller 2008-10-17 3:49 ` non-TCP tbench (was Re: tbench wrt. loopback TSO) David Miller 2008-10-26 12:34 ` tbench wrt. loopback TSO Evgeniy Polyakov 2008-10-27 1:59 ` David Miller 2008-10-27 7:49 ` Ilpo Järvinen 2008-10-27 14:13 ` Evgeniy Polyakov 2008-10-27 15:19 ` Ilpo Järvinen 2008-10-27 17:03 ` Evgeniy Polyakov 2008-10-27 18:39 ` David Miller 2008-10-27 19:35 ` Evgeniy Polyakov 2008-10-27 19:37 ` David Miller 2008-11-05 11:42 ` David Miller 2008-11-05 11:49 ` Evgeniy Polyakov 2008-11-05 11:54 ` David Miller 2008-11-05 12:04 ` Ilpo Järvinen 2008-11-05 12:09 ` David Miller 2008-11-05 12:25 ` Ilpo Järvinen 2008-11-05 13:04 ` Evgeniy Polyakov 2008-11-05 13:33 ` Ilpo Järvinen 2008-11-05 18:48 ` Rick Jones 2008-11-05 19:46 ` Ilpo Järvinen 2008-11-05 21:06 ` Rick Jones 2008-10-27 22:17 ` Ilpo Järvinen 2008-10-31 8:14 ` David Miller 2008-10-31 9:16 ` Ilpo Järvinen 2008-10-31 9:47 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).