From mboxrd@z Thu Jan 1 00:00:00 1970 From: "David S. Miller" Subject: Re: [patch] e1000 TSO parameter Date: Mon, 14 Jul 2003 22:38:22 -0700 Sender: netdev-bounce@oss.sgi.com Message-ID: <20030714223822.23b78f9b.davem@redhat.com> References: <20030714214510.17e02a9f.davem@redhat.com> <16147.37268.946613.965075@napali.hpl.hp.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: davidm@napali.hpl.hp.com, scott.feldman@intel.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Return-path: To: davidm@hpl.hp.com In-Reply-To: <16147.37268.946613.965075@napali.hpl.hp.com> Errors-to: netdev-bounce@oss.sgi.com List-Id: netdev.vger.kernel.org On Mon, 14 Jul 2003 22:31:00 -0700 David Mosberger wrote: > With TSO enabled: > > ftp> get big.iso /dev/null > local: /dev/null remote: big.iso > 200 PORT command successful. > 150 Opening BINARY mode data connection for 'big.iso' (2038628352 bytes). > 226 Transfer complete. > 2038628352 bytes received in 21.16 secs (94070.2 kB/s) > > ftp server CPU utilization: ~ 15% > > So we get almost 15% of throughput drop. This was with plain "netkit > fptd". AFAIK, it does a simple read/write loop (not sendfile()). When we use TSO for non-sendfile() applications it really stresses memory allocations. We do these 64K+ kmalloc()'s for each packet we construct. But I don't think that's what is happening here, rather the PCI controller is "talking" to the CPU's L2 cache with coherency transactions on all the data of every packet going to the chip. Whereas with a sendfile() type setup, the PCI controller is going straight to main memory for the data part of the packets since the CPU is unlikely to have each page cache page in it's L2 caches. In the sendmsg() case, it's virtually guarenteed that the cpu will have all the packet data in it's L2 cache in an unshared-modified state. I know how this can be fixed, can you use L2-bypassing stores in your csum_and_copy_from_user() and copy_from_user() implementations like we do on sparc64? That would exactly eliminate this situation where the card is talking to the cpu's L2 cache for all the data during the PCI DMA transation on the send side.