[RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
@ 2010-01-15  5:33 Krishna Kumar
  2010-01-15  8:36 ` David Miller
  2010-01-15 18:13 ` Rick Jones
  0 siblings, 2 replies; 24+ messages in thread
From: Krishna Kumar @ 2010-01-15  5:33 UTC (permalink / raw)
  To: davem; +Cc: ilpo.jarvinen, netdev, eric.dumazet, Krishna Kumar

From: Krishna Kumar <krkumar2@in.ibm.com>

Remove inline skb data in tcp_sendmsg(). For the few devices that
don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize,
and pass the penalty to those slow devices (the following drivers
do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c,
dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c,
r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c,
via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c,
cris/eth_v10.c).

This patch does not affect devices that support SG but turn off
via ethtool after register_netdev.

I ran the following test cases with iperf - #threads: 1 4 8 16 32
64 128 192 256, I/O sizes: 256 4K 16K 64K, each test case runs for
1 minute, repeat 5 iterations. Total test run time is 6 hours.
System is 4-proc Opteron, with a Chelsio 10gbps NIC. Results (BW
figures are the aggregate across 5 iterations in mbps):

-------------------------------------------------------
#Process   I/O Size    Org-BW     New-BW   %-change
-------------------------------------------------------
1           256        2098       2147      2.33
1           4K         14057      14269     1.50
1           16K        25984      27317     5.13
1           64K        25920      27539     6.24

4           256        1895       2117      11.71
4           4K         10699      15649     46.26
4           16K        25675      26723     4.08
4           64K        27026      27545     1.92

8           256        1816       1966      8.25
8           4K         9511       12754     34.09
8           16K        25177      25281     0.41
8           64K        26288      26878     2.24

16          256        1893       2142      13.15
16          4K         16370      15805     -3.45
16          16K        25986      25890     -0.36
16          64K        26925      28036     4.12

32          256        2061       2038      -1.11
32          4K         10765      12253     13.82
32          16K        26802      28613     6.75
32          64K        28433      27739     -2.44

64          256        1885       2088      10.76
64          4K         10534      15778     49.78
64          16K        26745      28130     5.17
64          64K        29153      28708     -1.52

128         256        1884       2023      7.37
128         4K         9446       13732     45.37
128         16K        27013      27086     0.27
128         64K        26151      27933     6.81

192         256        2000       2094      4.70
192         4K         14260      13479     -5.47
192         16K        25545      27478     7.56
192         64K        26497      26482     -0.05

256         256        1947       1955      0.41
256         4K         9828       12265     24.79
256         16K        25087      24977     -0.43
256         64K        26715      27997     4.79
-------------------------------------------------------
Total:      -          600071     634906    5.80
-------------------------------------------------------

Please review if the idea is acceptable.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 net/ipv4/tcp.c |  172 ++++++++++++++++++-----------------------------
 1 file changed, 66 insertions(+), 106 deletions(-)

diff -ruNp org/net/ipv4/tcp.c new/net/ipv4/tcp.c
--- org/net/ipv4/tcp.c	2010-01-13 10:43:27.000000000 +0530
+++ new/net/ipv4/tcp.c	2010-01-13 10:43:37.000000000 +0530
@@ -876,26 +876,6 @@ ssize_t tcp_sendpage(struct socket *sock
 #define TCP_PAGE(sk)	(sk->sk_sndmsg_page)
 #define TCP_OFF(sk)	(sk->sk_sndmsg_off)
 
-static inline int select_size(struct sock *sk, int sg)
-{
-	struct tcp_sock *tp = tcp_sk(sk);
-	int tmp = tp->mss_cache;
-
-	if (sg) {
-		if (sk_can_gso(sk))
-			tmp = 0;
-		else {
-			int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);
-
-			if (tmp >= pgbreak &&
-			    tmp <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE)
-				tmp = pgbreak;
-		}
-	}
-
-	return tmp;
-}
-
 int tcp_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 		size_t size)
 {
@@ -905,7 +885,7 @@ int tcp_sendmsg(struct kiocb *iocb, stru
 	struct sk_buff *skb;
 	int iovlen, flags;
 	int mss_now, size_goal;
-	int sg, err, copied;
+	int err, copied;
 	long timeo;
 
 	lock_sock(sk);
@@ -933,8 +913,6 @@ int tcp_sendmsg(struct kiocb *iocb, stru
 	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
 		goto out_err;
 
-	sg = sk->sk_route_caps & NETIF_F_SG;
-
 	while (--iovlen >= 0) {
 		int seglen = iov->iov_len;
 		unsigned char __user *from = iov->iov_base;
@@ -944,6 +922,8 @@ int tcp_sendmsg(struct kiocb *iocb, stru
 		while (seglen > 0) {
 			int copy = 0;
 			int max = size_goal;
+			int merge, i, off;
+			struct page *page;
 
 			skb = tcp_write_queue_tail(sk);
 			if (tcp_send_head(sk)) {
@@ -954,14 +934,11 @@ int tcp_sendmsg(struct kiocb *iocb, stru
 
 			if (copy <= 0) {
 new_segment:
-				/* Allocate new segment. If the interface is SG,
-				 * allocate skb fitting to single page.
-				 */
+				/* Allocate new segment with a single page */
 				if (!sk_stream_memory_free(sk))
 					goto wait_for_sndbuf;
 
-				skb = sk_stream_alloc_skb(sk,
-							  select_size(sk, sg),
+				skb = sk_stream_alloc_skb(sk, 0,
 							  sk->sk_allocation);
 				if (!skb)
 					goto wait_for_memory;
@@ -981,84 +958,77 @@ new_segment:
 			if (copy > seglen)
 				copy = seglen;
 
-			/* Where to copy to? */
-			if (skb_tailroom(skb) > 0) {
-				/* We have some space in skb head. Superb! */
-				if (copy > skb_tailroom(skb))
-					copy = skb_tailroom(skb);
-				if ((err = skb_add_data(skb, from, copy)) != 0)
-					goto do_fault;
-			} else {
-				int merge = 0;
-				int i = skb_shinfo(skb)->nr_frags;
-				struct page *page = TCP_PAGE(sk);
-				int off = TCP_OFF(sk);
-
-				if (skb_can_coalesce(skb, i, page, off) &&
-				    off != PAGE_SIZE) {
-					/* We can extend the last page
-					 * fragment. */
-					merge = 1;
-				} else if (i == MAX_SKB_FRAGS || !sg) {
-					/* Need to add new fragment and cannot
-					 * do this because interface is non-SG,
-					 * or because all the page slots are
-					 * busy. */
-					tcp_mark_push(tp, skb);
-					goto new_segment;
-				} else if (page) {
-					if (off == PAGE_SIZE) {
-						put_page(page);
-						TCP_PAGE(sk) = page = NULL;
-						off = 0;
-					}
-				} else
+			merge = 0;
+			i = skb_shinfo(skb)->nr_frags;
+			page = TCP_PAGE(sk);
+			off = TCP_OFF(sk);
+
+			if (skb_can_coalesce(skb, i, page, off) &&
+			    off != PAGE_SIZE) {
+				/* We can extend the last page
+				 * fragment. */
+				merge = 1;
+			} else if (i == MAX_SKB_FRAGS) {
+				/*
+				 * Need to add new fragment and cannot
+				 * do this because all the page slots are
+				 * busy. For the (rare) non-SG devices,
+				 * dev_queue_xmit handles this skb.
+				 */
+				tcp_mark_push(tp, skb);
+				goto new_segment;
+			} else if (page) {
+				if (off == PAGE_SIZE) {
+					put_page(page);
+					TCP_PAGE(sk) = page = NULL;
 					off = 0;
+				}
+			} else
+				off = 0;
 
-				if (copy > PAGE_SIZE - off)
-					copy = PAGE_SIZE - off;
+			if (copy > PAGE_SIZE - off)
+				copy = PAGE_SIZE - off;
 
-				if (!sk_wmem_schedule(sk, copy))
-					goto wait_for_memory;
+			if (!sk_wmem_schedule(sk, copy))
+				goto wait_for_memory;
 
-				if (!page) {
-					/* Allocate new cache page. */
-					if (!(page = sk_stream_alloc_page(sk)))
-						goto wait_for_memory;
-				}
+			if (!page) {
+				/* Allocate new cache page. */
+				if (!(page = sk_stream_alloc_page(sk)))
+					goto wait_for_memory;
+			}
 
-				/* Time to copy data. We are close to
-				 * the end! */
-				err = skb_copy_to_page(sk, from, skb, page,
-						       off, copy);
-				if (err) {
-					/* If this page was new, give it to the
-					 * socket so it does not get leaked.
-					 */
-					if (!TCP_PAGE(sk)) {
-						TCP_PAGE(sk) = page;
-						TCP_OFF(sk) = 0;
-					}
-					goto do_error;
+			/* Time to copy data. We are close to
+			 * the end! */
+			err = skb_copy_to_page(sk, from, skb, page,
+					       off, copy);
+			if (err) {
+				/* If this page was new, give it to the
+				 * socket so it does not get leaked.
+				 */
+				if (!TCP_PAGE(sk)) {
+					TCP_PAGE(sk) = page;
+					TCP_OFF(sk) = 0;
 				}
+				goto do_error;
+			}
 
-				/* Update the skb. */
-				if (merge) {
-					skb_shinfo(skb)->frags[i - 1].size +=
-									copy;
-				} else {
-					skb_fill_page_desc(skb, i, page, off, copy);
-					if (TCP_PAGE(sk)) {
-						get_page(page);
-					} else if (off + copy < PAGE_SIZE) {
-						get_page(page);
-						TCP_PAGE(sk) = page;
-					}
+			/* Update the skb. */
+			if (merge) {
+				skb_shinfo(skb)->frags[i - 1].size +=
+								copy;
+			} else {
+				skb_fill_page_desc(skb, i, page, off, copy);
+				if (TCP_PAGE(sk)) {
+					get_page(page);
+				} else if (off + copy < PAGE_SIZE) {
+					get_page(page);
+					TCP_PAGE(sk) = page;
 				}
-
-				TCP_OFF(sk) = off + copy;
 			}
 
+			TCP_OFF(sk) = off + copy;
+
 			if (!copied)
 				TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH;
 
@@ -1101,16 +1071,6 @@ out:
 	release_sock(sk);
 	return copied;
 
-do_fault:
-	if (!skb->len) {
-		tcp_unlink_write_queue(skb, sk);
-		/* It is the one place in all of TCP, except connection
-		 * reset, where we can be unlinking the send_head.
-		 */
-		tcp_check_send_head(sk, skb);
-		sk_wmem_free_skb(sk, skb);
-	}
-
 do_error:
 	if (copied)
 		goto out;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-15  5:33 [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices? Krishna Kumar
@ 2010-01-15  8:36 ` David Miller
  2010-01-15  8:50   ` Simon Kagstrom
                     ` (2 more replies)
  2010-01-15 18:13 ` Rick Jones
  1 sibling, 3 replies; 24+ messages in thread
From: David Miller @ 2010-01-15  8:36 UTC (permalink / raw)
  To: krkumar2; +Cc: ilpo.jarvinen, netdev, eric.dumazet

From: Krishna Kumar <krkumar2@in.ibm.com>
Date: Fri, 15 Jan 2010 11:03:52 +0530

> From: Krishna Kumar <krkumar2@in.ibm.com>
> 
> Remove inline skb data in tcp_sendmsg(). For the few devices that
> don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize,
> and pass the penalty to those slow devices (the following drivers
> do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c,
> dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c,
> r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c,
> via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c,
> cris/eth_v10.c).

I was really surprised to see r8169.c in that list.

It even has all the code in it's ->ndo_start_xmit() method
to build fragments properly and handle segmented SKBs, it
simply doesn't set NETIF_F_SG in dev->features for whatever
reason.

Bonding it on your list, but it does indeed support NETIF_F_SG
as long as all of it's slaves do.  See bond_compute_features()
and how it uses netdev_increment_features() over the slaves.

Anyways...

> This patch does not affect devices that support SG but turn off
> via ethtool after register_netdev.
> 
> I ran the following test cases with iperf - #threads: 1 4 8 16 32
> 64 128 192 256, I/O sizes: 256 4K 16K 64K, each test case runs for
> 1 minute, repeat 5 iterations. Total test run time is 6 hours.
> System is 4-proc Opteron, with a Chelsio 10gbps NIC. Results (BW
> figures are the aggregate across 5 iterations in mbps):
 ...
> Please review if the idea is acceptable.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>

So how bad does it kill performance for a chip that doesn't
support NETIF_F_SG?

That's what people will complain about if this goes in.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-15  8:36 ` David Miller
@ 2010-01-15  8:50   ` Simon Kagstrom
  2010-01-15  8:52     ` David Miller
  2010-01-15  9:03   ` Krishna Kumar2
       [not found]   ` <OF1B0853DD.2824263E-ON652576AC.002FA9D4-652576AC.0030AC47@LocalDomain>
  2 siblings, 1 reply; 24+ messages in thread
From: Simon Kagstrom @ 2010-01-15  8:50 UTC (permalink / raw)
  To: David Miller; +Cc: krkumar2, ilpo.jarvinen, netdev, eric.dumazet

On Fri, 15 Jan 2010 00:36:36 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> > Remove inline skb data in tcp_sendmsg(). For the few devices that
> > don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize,
> > and pass the penalty to those slow devices (the following drivers
> > do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c,
> > dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c,
> > r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c,
> > via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c,
> > cris/eth_v10.c).
> 
> I was really surprised to see r8169.c in that list.
> 
> It even has all the code in it's ->ndo_start_xmit() method
> to build fragments properly and handle segmented SKBs, it
> simply doesn't set NETIF_F_SG in dev->features for whatever
> reason.

The same thing goes for via-velocity.c, it's turned on via ethtool
though (ethtool_op_set_sg).

// Simon

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-15  8:50   ` Simon Kagstrom
@ 2010-01-15  8:52     ` David Miller
  2010-01-15  9:00       ` Simon Kagstrom
  0 siblings, 1 reply; 24+ messages in thread
From: David Miller @ 2010-01-15  8:52 UTC (permalink / raw)
  To: simon.kagstrom; +Cc: krkumar2, ilpo.jarvinen, netdev, eric.dumazet

From: Simon Kagstrom <simon.kagstrom@netinsight.net>
Date: Fri, 15 Jan 2010 09:50:49 +0100

> On Fri, 15 Jan 2010 00:36:36 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
> 
>> > Remove inline skb data in tcp_sendmsg(). For the few devices that
>> > don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize,
>> > and pass the penalty to those slow devices (the following drivers
>> > do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c,
>> > dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c,
>> > r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c,
>> > via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c,
>> > cris/eth_v10.c).
>> 
>> I was really surprised to see r8169.c in that list.
>> 
>> It even has all the code in it's ->ndo_start_xmit() method
>> to build fragments properly and handle segmented SKBs, it
>> simply doesn't set NETIF_F_SG in dev->features for whatever
>> reason.
> 
> The same thing goes for via-velocity.c, it's turned on via ethtool
> though (ethtool_op_set_sg).

Indeed, see my reply to Krishna's ethtool_op_set_sg() patch.

I think it's a cruddy way to do things, SG ought to be on by
default always unless it is defective.  And if it's defective
support should be removed entirely.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-15  8:52     ` David Miller
@ 2010-01-15  9:00       ` Simon Kagstrom
  2010-01-15  9:04         ` David Miller
  0 siblings, 1 reply; 24+ messages in thread
From: Simon Kagstrom @ 2010-01-15  9:00 UTC (permalink / raw)
  To: David Miller; +Cc: krkumar2, ilpo.jarvinen, netdev, eric.dumazet

On Fri, 15 Jan 2010 00:52:24 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> > The same thing goes for via-velocity.c, it's turned on via ethtool
> > though (ethtool_op_set_sg).
> 
> Indeed, see my reply to Krishna's ethtool_op_set_sg() patch.
> 
> I think it's a cruddy way to do things, SG ought to be on by
> default always unless it is defective.  And if it's defective
> support should be removed entirely.

I kept it off by default since I didn't see any big improvement in my
tests (negative in some, positive in some). But I suppose you're
right though.

// Simon

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-15  9:00       ` Simon Kagstrom
@ 2010-01-15  9:04         ` David Miller
  0 siblings, 0 replies; 24+ messages in thread
From: David Miller @ 2010-01-15  9:04 UTC (permalink / raw)
  To: simon.kagstrom; +Cc: krkumar2, ilpo.jarvinen, netdev, eric.dumazet

From: Simon Kagstrom <simon.kagstrom@netinsight.net>
Date: Fri, 15 Jan 2010 10:00:11 +0100

> On Fri, 15 Jan 2010 00:52:24 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
> 
>> > The same thing goes for via-velocity.c, it's turned on via ethtool
>> > though (ethtool_op_set_sg).
>> 
>> Indeed, see my reply to Krishna's ethtool_op_set_sg() patch.
>> 
>> I think it's a cruddy way to do things, SG ought to be on by
>> default always unless it is defective.  And if it's defective
>> support should be removed entirely.
> 
> I kept it off by default since I didn't see any big improvement in my
> tests (negative in some, positive in some). But I suppose you're
> right though.

Well, it has to provide significantly better performance for
sendfile() (especially wrt. cpu utilization) because we avoid the copy
out of the page cache pages entirely.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-15  8:36 ` David Miller
  2010-01-15  8:50   ` Simon Kagstrom
@ 2010-01-15  9:03   ` Krishna Kumar2
       [not found]   ` <OF1B0853DD.2824263E-ON652576AC.002FA9D4-652576AC.0030AC47@LocalDomain>
  2 siblings, 0 replies; 24+ messages in thread
From: Krishna Kumar2 @ 2010-01-15  9:03 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, ilpo.jarvinen, netdev, netdev-owner

> David Miller <davem@davemloft.net>
>
> Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
>
> From: Krishna Kumar <krkumar2@in.ibm.com>
> Date: Fri, 15 Jan 2010 11:03:52 +0530
>
> > From: Krishna Kumar <krkumar2@in.ibm.com>
> >
> > Remove inline skb data in tcp_sendmsg(). For the few devices that
> > don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize,
> > and pass the penalty to those slow devices (the following drivers
> > do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c,
> > dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c,
> > r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c,
> > via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c,
> > cris/eth_v10.c).
>
> I was really surprised to see r8169.c in that list.
>
> It even has all the code in it's ->ndo_start_xmit() method
> to build fragments properly and handle segmented SKBs, it
> simply doesn't set NETIF_F_SG in dev->features for whatever
> reason.

I didn't notice this driver but had checked via-velocity and
found that it too had support but was not setting this bit.

> Bonding it on your list, but it does indeed support NETIF_F_SG
> as long as all of it's slaves do.  See bond_compute_features()
> and how it uses netdev_increment_features() over the slaves.
>
> Anyways...
>
> > This patch does not affect devices that support SG but turn off
> > via ethtool after register_netdev.
> >
> > I ran the following test cases with iperf - #threads: 1 4 8 16 32
> > 64 128 192 256, I/O sizes: 256 4K 16K 64K, each test case runs for
> > 1 minute, repeat 5 iterations. Total test run time is 6 hours.
> > System is 4-proc Opteron, with a Chelsio 10gbps NIC. Results (BW
> > figures are the aggregate across 5 iterations in mbps):
>  ...
> > Please review if the idea is acceptable.
> >
> > Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
>
> So how bad does it kill performance for a chip that doesn't
> support NETIF_F_SG?
>
> That's what people will complain about if this goes in.

I don't know how much performance will drop for those 15-18 drivers
listed above, since I don't have access to those devices.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <OF1B0853DD.2824263E-ON652576AC.002FA9D4-652576AC.0030AC47@LocalDomain>]

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
       [not found]   ` <OF1B0853DD.2824263E-ON652576AC.002FA9D4-652576AC.0030AC47@LocalDomain>
@ 2010-01-15  9:20     ` Krishna Kumar2
  2010-01-15  9:18       ` David Miller
  0 siblings, 1 reply; 24+ messages in thread
From: Krishna Kumar2 @ 2010-01-15  9:20 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: David Miller, eric.dumazet, ilpo.jarvinen, netdev

Sorry, I sent off the reply sooner than I wanted to. continued at the
end...

Krishna Kumar2/India/IBM wrote on 01/15/2010 02:33:25 PM:


> Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
>
> > David Miller <davem@davemloft.net>
> >
> > Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
> >
> > From: Krishna Kumar <krkumar2@in.ibm.com>
> > Date: Fri, 15 Jan 2010 11:03:52 +0530
> >
> > > From: Krishna Kumar <krkumar2@in.ibm.com>
> > >
> > > Remove inline skb data in tcp_sendmsg(). For the few devices that
> > > don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize,
> > > and pass the penalty to those slow devices (the following drivers
> > > do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c,
> > > dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c,
> > > r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c,
> > > via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c,
> > > cris/eth_v10.c).
> >
> > I was really surprised to see r8169.c in that list.
> >
> > It even has all the code in it's ->ndo_start_xmit() method
> > to build fragments properly and handle segmented SKBs, it
> > simply doesn't set NETIF_F_SG in dev->features for whatever
> > reason.

> I didn't notice this driver but had checked via-velocity and
> found that it too had support but was not setting this bit.
>
> > Bonding it on your list, but it does indeed support NETIF_F_SG
> > as long as all of it's slaves do.  See bond_compute_features()
> > and how it uses netdev_increment_features() over the slaves.
> >
> > Anyways...
> >
> > > This patch does not affect devices that support SG but turn off
> > > via ethtool after register_netdev.
> > >
> > > I ran the following test cases with iperf - #threads: 1 4 8 16 32
> > > 64 128 192 256, I/O sizes: 256 4K 16K 64K, each test case runs for
> > > 1 minute, repeat 5 iterations. Total test run time is 6 hours.
> > > System is 4-proc Opteron, with a Chelsio 10gbps NIC. Results (BW
> > > figures are the aggregate across 5 iterations in mbps):
> >  ...
> > > Please review if the idea is acceptable.
> > >
> > > Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> >
> > So how bad does it kill performance for a chip that doesn't
> > support NETIF_F_SG?
> >
> > That's what people will complain about if this goes in.
>
> I don't know how much performance will drop for those 15-18 drivers
> listed above, since I don't have access to those devices.

I wonder if there is some other way to test it. I could test it on
the card I have, cxgbe, by ethtool F_SG off, and then testing
this patch with existing code (both with ethtool F_SG off)? Will
that be enough to get an idea, or I cannot assume this is
reasonable for real non-sg drivers? I am sure there is a
degradation, and mentioned that part as a "penalty" for those
drivers in my patch.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-15  9:20     ` Krishna Kumar2
@ 2010-01-15  9:18       ` David Miller
  2010-01-20 12:19         ` Krishna Kumar2
  0 siblings, 1 reply; 24+ messages in thread
From: David Miller @ 2010-01-15  9:18 UTC (permalink / raw)
  To: krkumar2; +Cc: eric.dumazet, ilpo.jarvinen, netdev

From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Fri, 15 Jan 2010 14:50:04 +0530

> I wonder if there is some other way to test it. I could test it on
> the card I have, cxgbe, by ethtool F_SG off, and then testing
> this patch with existing code (both with ethtool F_SG off)? Will
> that be enough to get an idea, or I cannot assume this is
> reasonable for real non-sg drivers? I am sure there is a
> degradation, and mentioned that part as a "penalty" for those
> drivers in my patch.

I think such a test would provide useful data by which to judge this
change.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-15  9:18       ` David Miller
@ 2010-01-20 12:19         ` Krishna Kumar2
  2010-01-21  9:25           ` David Miller
  0 siblings, 1 reply; 24+ messages in thread
From: Krishna Kumar2 @ 2010-01-20 12:19 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, ilpo.jarvinen, netdev

Hi Dave,

> David Miller <davem@davemloft.net> wrote on 01/15/2010 02:48:29 PM:
>
> Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
>
> From: Krishna Kumar2 <krkumar2@in.ibm.com>
> Date: Fri, 15 Jan 2010 14:50:04 +0530
>
> > I wonder if there is some other way to test it. I could test it on
> > the card I have, cxgbe, by ethtool F_SG off, and then testing
> > this patch with existing code (both with ethtool F_SG off)? Will
> > that be enough to get an idea, or I cannot assume this is
> > reasonable for real non-sg drivers? I am sure there is a
> > degradation, and mentioned that part as a "penalty" for those
> > drivers in my patch.
>
> I think such a test would provide useful data by which to judge this
> change.

I had to remove the F_SG flag from cxgb3 driver (using ethtool
didn't show any difference in performance since GSO was enabled
on the device due to register_netdev setting it). Testing show a
drop of 25% in performance with this patch for non-SG device,
the extra alloc/memcpy is showing up.

For the SG driver, I get a good performace gain (not anywhere
close to 25% though). What do you suggest?

Thanks,

- KK


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-20 12:19         ` Krishna Kumar2
@ 2010-01-21  9:25           ` David Miller
  2010-01-21  9:41             ` Herbert Xu
  0 siblings, 1 reply; 24+ messages in thread
From: David Miller @ 2010-01-21  9:25 UTC (permalink / raw)
  To: krkumar2; +Cc: eric.dumazet, ilpo.jarvinen, netdev

From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Wed, 20 Jan 2010 17:49:18 +0530

> I had to remove the F_SG flag from cxgb3 driver (using ethtool
> didn't show any difference in performance since GSO was enabled
> on the device due to register_netdev setting it). Testing show a
> drop of 25% in performance with this patch for non-SG device,
> the extra alloc/memcpy is showing up.
> 
> For the SG driver, I get a good performace gain (not anywhere
> close to 25% though). What do you suggest?

I don't think we can add your change if it hurts non-SG
devices that much.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-21  9:25           ` David Miller
@ 2010-01-21  9:41             ` Herbert Xu
  2010-01-27  7:12               ` Krishna Kumar2
       [not found]               ` <OF7EA723DA.DC2FF4FC-ON652576B8.002064CB-652576B8.00267739@LocalDomain>
  0 siblings, 2 replies; 24+ messages in thread
From: Herbert Xu @ 2010-01-21  9:41 UTC (permalink / raw)
  To: David Miller; +Cc: krkumar2, eric.dumazet, ilpo.jarvinen, netdev

David Miller <davem@davemloft.net> wrote:
> From: Krishna Kumar2 <krkumar2@in.ibm.com>
> Date: Wed, 20 Jan 2010 17:49:18 +0530
> 
>> I had to remove the F_SG flag from cxgb3 driver (using ethtool
>> didn't show any difference in performance since GSO was enabled
>> on the device due to register_netdev setting it). Testing show a
>> drop of 25% in performance with this patch for non-SG device,
>> the extra alloc/memcpy is showing up.
>> 
>> For the SG driver, I get a good performace gain (not anywhere
>> close to 25% though). What do you suggest?
> 
> I don't think we can add your change if it hurts non-SG
> devices that much.

Wait, we need to be careful when testing this.  Non-SG devices
do actually benefit from TSO which they otherwise cannot access.

If you unset the F_SG bit, then that would disable TSO too.  So
you need to enable GSO to compensate.  So Krishna, did you check
with tcpdump to see if GSO was really enabled with SG off?

IIRC when I did a similar test with e1000 back when I wrote this
the performance of GSO with SG off was pretty much the same as
no GSO with SG off.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-21  9:41             ` Herbert Xu
@ 2010-01-27  7:12               ` Krishna Kumar2
  2010-01-29  9:06                 ` Herbert Xu
       [not found]               ` <OF7EA723DA.DC2FF4FC-ON652576B8.002064CB-652576B8.00267739@LocalDomain>
  1 sibling, 1 reply; 24+ messages in thread
From: Krishna Kumar2 @ 2010-01-27  7:12 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, eric.dumazet, ilpo.jarvinen, netdev

Hi Herbert,

> Herbert Xu <herbert@gondor.apana.org.au> wrote on 01/21/2010 03:11 PM

Sorry for the late response.

> >> I had to remove the F_SG flag from cxgb3 driver (using ethtool
> >> didn't show any difference in performance since GSO was enabled
> >> on the device due to register_netdev setting it). Testing show a
> >> drop of 25% in performance with this patch for non-SG device,
> >> the extra alloc/memcpy is showing up.
> >>
> >> For the SG driver, I get a good performace gain (not anywhere
> >> close to 25% though). What do you suggest?
> >
> > I don't think we can add your change if it hurts non-SG
> > devices that much.
>
> Wait, we need to be careful when testing this.  Non-SG devices
> do actually benefit from TSO which they otherwise cannot access.
>
> If you unset the F_SG bit, then that would disable TSO too.  So
> you need to enable GSO to compensate.  So Krishna, did you check
> with tcpdump to see if GSO was really enabled with SG off?

OK, I unset F_SG and set F_GSO (in driver). With this, tcpdump shows
GSO is enabled - the tcp packet sizes builds up to 65160 bytes.

I ran 5 serial netperf's with 16K and another 5 serial netperfs
with 64K I/O sizes, and the aggregate result is:

0. Driver unsets F_SG but sets F_GSO:
      Original code with 16K: 19471.65
      New code with 16K:      19409.70
      Original code with 64K: 21357.23
      New code with 64K:      22050.42

To recap the other tests I did today:

1. Driver unsets F_SG, and with GSO off
      Original code with 16K: 10123.56
      New code with 16K:      7111.12
      Original code with 64K: 11568.99
      New code with 64K:      7611.37

2. Driver unsets F_SG and uses ethtool to set GSO:
      Original code with 16K: 18864.38
      New code with 16K:      18465.54
      Original code with 64K: 21005.43
      New code with 64K:      22529.24

Thanks,

- KK

> IIRC when I did a similar test with e1000 back when I wrote this
> the performance of GSO with SG off was pretty much the same as
> no GSO with SG off.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-27  7:12               ` Krishna Kumar2
@ 2010-01-29  9:06                 ` Herbert Xu
  2010-01-29 11:15                   ` Krishna Kumar2
  0 siblings, 1 reply; 24+ messages in thread
From: Herbert Xu @ 2010-01-29  9:06 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: David Miller, eric.dumazet, ilpo.jarvinen, netdev

On Wed, Jan 27, 2010 at 12:42:12PM +0530, Krishna Kumar2 wrote:
> 
> OK, I unset F_SG and set F_GSO (in driver). With this, tcpdump shows
> GSO is enabled - the tcp packet sizes builds up to 65160 bytes.
> 
> I ran 5 serial netperf's with 16K and another 5 serial netperfs
> with 64K I/O sizes, and the aggregate result is:
> 
> 0. Driver unsets F_SG but sets F_GSO:
>       Original code with 16K: 19471.65
>       New code with 16K:      19409.70
>       Original code with 64K: 21357.23
>       New code with 64K:      22050.42

OK this is more in line with what I was expecting, namely that
enabling GSO is actually beneficial even without SG.

It would be good to get the CPU utilisation figures so we can
see the complete picture.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-29  9:06                 ` Herbert Xu
@ 2010-01-29 11:15                   ` Krishna Kumar2
  2010-01-29 11:33                     ` Herbert Xu
  2010-01-29 19:56                     ` Rick Jones
  0 siblings, 2 replies; 24+ messages in thread
From: Krishna Kumar2 @ 2010-01-29 11:15 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, eric.dumazet, ilpo.jarvinen, netdev

> Herbert Xu <herbert@gondor.apana.org.au> wrote on 01/29/2010 02:36:25 PM:
>
> > I ran 5 serial netperf's with 16K and another 5 serial netperfs
> > with 64K I/O sizes, and the aggregate result is:
> >
> > 0. Driver unsets F_SG but sets F_GSO:
> >       Original code with 16K: 19471.65
> >       New code with 16K:      19409.70
> >       Original code with 64K: 21357.23
> >       New code with 64K:      22050.42
>
> OK this is more in line with what I was expecting, namely that
> enabling GSO is actually beneficial even without SG.
>
> It would be good to get the CPU utilisation figures so we can
> see the complete picture.

Same 5 runs of single netperf's:

0. Driver unsets F_SG but sets F_GSO:
        Org (16K):      BW: 18180.71    SD: 13.485
        New (16K):      BW: 18113.15    SD: 13.551
        Org (64K):      BW: 21980.28    SD: 10.306
        New (64K):      BW: 21386.59    SD: 10.447

1. Driver unsets F_SG, and with GSO off
        Org (16K):      BW: 10894.62    SD: 26.591
        New (16K):      BW: 7262.10     SD: 35.340
        Org (64K):      BW: 12396.41    SD: 23.357
        New (64K):      BW: 7853.02     SD: 32.405


2. Driver unsets F_SG and uses ethtool to set GSO:
        Org (16K):      BW: 18094.11    SD: 13.603
        New (16K):      BW: 17952.38    SD: 13.743
        Org (64K):      BW: 21540.78    SD: 10.771
        New (64K):      BW: 21818.35    SD: 10.598

> > I should have mentioned this too - if I unset F_SG in the
> > cxgb3 driver and nothing else, ethtool -k still shows GSO
> > is set, and tcpdump shows max packet size is 1448. If I
> > additionally set GSO in driver, then ethtool still has the
> > same output, but tcpdump shows max packet size of 65160.
>
> This sounds like a bug.

Yes, an ethtool bug (version 6). The test case #1 above, I
have written that GSO is off but ethtool "thinks" it is on
(after a modprobe -r cxgb3; modprobe cxgb3). So for test #2,
I simply run "ethtool ... gso on", and GSO is now really on
in the kernel, explaining the better results.

thanks,

- KK


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-29 11:15                   ` Krishna Kumar2
@ 2010-01-29 11:33                     ` Herbert Xu
  2010-01-29 11:50                       ` Krishna Kumar2
  2010-01-29 20:02                       ` Rick Jones
  2010-01-29 19:56                     ` Rick Jones
  1 sibling, 2 replies; 24+ messages in thread
From: Herbert Xu @ 2010-01-29 11:33 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: David Miller, eric.dumazet, ilpo.jarvinen, netdev

On Fri, Jan 29, 2010 at 04:45:01PM +0530, Krishna Kumar2 wrote:
> 
> Same 5 runs of single netperf's:
> 
> 0. Driver unsets F_SG but sets F_GSO:
>         Org (16K):      BW: 18180.71    SD: 13.485
>         New (16K):      BW: 18113.15    SD: 13.551
>         Org (64K):      BW: 21980.28    SD: 10.306
>         New (64K):      BW: 21386.59    SD: 10.447
> 
> 1. Driver unsets F_SG, and with GSO off
>         Org (16K):      BW: 10894.62    SD: 26.591
>         New (16K):      BW: 7262.10     SD: 35.340
>         Org (64K):      BW: 12396.41    SD: 23.357
>         New (64K):      BW: 7853.02     SD: 32.405
> 
> 
> 2. Driver unsets F_SG and uses ethtool to set GSO:
>         Org (16K):      BW: 18094.11    SD: 13.603
>         New (16K):      BW: 17952.38    SD: 13.743
>         Org (64K):      BW: 21540.78    SD: 10.771
>         New (64K):      BW: 21818.35    SD: 10.598

Hmm, any idea what is causing case 0 to be different from case 2?
In particular, the 64K performance in case 0 appears to be a
regression but in case 2 it's showing up as an improvement.

AFAICS these two cases should produce identical results, or is
this just jitter across tests?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-29 11:33                     ` Herbert Xu
@ 2010-01-29 11:50                       ` Krishna Kumar2
  2010-01-29 20:02                       ` Rick Jones
  1 sibling, 0 replies; 24+ messages in thread
From: Krishna Kumar2 @ 2010-01-29 11:50 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, eric.dumazet, ilpo.jarvinen, netdev

Herbert Xu <herbert@gondor.apana.org.au> wrote on 01/29/2010 05:03:46 PM:
>
> > Same 5 runs of single netperf's:
> >
> > 0. Driver unsets F_SG but sets F_GSO:
> >         Org (16K):      BW: 18180.71    SD: 13.485
> >         New (16K):      BW: 18113.15    SD: 13.551
> >         Org (64K):      BW: 21980.28    SD: 10.306
> >         New (64K):      BW: 21386.59    SD: 10.447
> >
> > 1. Driver unsets F_SG, and with GSO off
> >         Org (16K):      BW: 10894.62    SD: 26.591
> >         New (16K):      BW: 7262.10     SD: 35.340
> >         Org (64K):      BW: 12396.41    SD: 23.357
> >         New (64K):      BW: 7853.02     SD: 32.405
> >
> >
> > 2. Driver unsets F_SG and uses ethtool to set GSO:
> >         Org (16K):      BW: 18094.11    SD: 13.603
> >         New (16K):      BW: 17952.38    SD: 13.743
> >         Org (64K):      BW: 21540.78    SD: 10.771
> >         New (64K):      BW: 21818.35    SD: 10.598
>
> Hmm, any idea what is causing case 0 to be different from case 2?
> In particular, the 64K performance in case 0 appears to be a
> regression but in case 2 it's showing up as an improvement.
>
> AFAICS these two cases should produce identical results, or is
> this just jitter across tests?

You are right about the jitter. I have run this many times, most
of the times #0 and #2 are almost identical, but sometimes varies
a bit.

Also about my earlier ethtool comment:

> Yes, an ethtool bug (version 6). The test case #1 above, I
> have written that GSO is off but ethtool "thinks" it is on
> (after a modprobe -r cxgb3; modprobe cxgb3). So for test #2,
> I simply run "ethtool ... gso on", and GSO is now really on
> in the kernel, explaining the better results.

Hmmm, I had a bad ethtool it seems. I built the latest one to
debug this problem but this shows settings correctly.

thanks,

- KK


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-29 11:33                     ` Herbert Xu
  2010-01-29 11:50                       ` Krishna Kumar2
@ 2010-01-29 20:02                       ` Rick Jones
  1 sibling, 0 replies; 24+ messages in thread
From: Rick Jones @ 2010-01-29 20:02 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Krishna Kumar2, David Miller, eric.dumazet, ilpo.jarvinen, netdev

Herbert Xu wrote:
> On Fri, Jan 29, 2010 at 04:45:01PM +0530, Krishna Kumar2 wrote:
> 
>>Same 5 runs of single netperf's:
>>
>>0. Driver unsets F_SG but sets F_GSO:
>>        Org (16K):      BW: 18180.71    SD: 13.485
>>        New (16K):      BW: 18113.15    SD: 13.551
>>        Org (64K):      BW: 21980.28    SD: 10.306
>>        New (64K):      BW: 21386.59    SD: 10.447
>>
>>1. Driver unsets F_SG, and with GSO off
>>        Org (16K):      BW: 10894.62    SD: 26.591
>>        New (16K):      BW: 7262.10     SD: 35.340
>>        Org (64K):      BW: 12396.41    SD: 23.357
>>        New (64K):      BW: 7853.02     SD: 32.405
>>
>>
>>2. Driver unsets F_SG and uses ethtool to set GSO:
>>        Org (16K):      BW: 18094.11    SD: 13.603
>>        New (16K):      BW: 17952.38    SD: 13.743
>>        Org (64K):      BW: 21540.78    SD: 10.771
>>        New (64K):      BW: 21818.35    SD: 10.598
> 
> 
> Hmm, any idea what is causing case 0 to be different from case 2?
> In particular, the 64K performance in case 0 appears to be a
> regression but in case 2 it's showing up as an improvement.
> 
> AFAICS these two cases should produce identical results, or is
> this just jitter across tests?

To get some idea of run to run variation, and one does not want to run 
multiple explicit netperf commands and do later statistical work, one 
can add global command line arguments to netperf:

netperf ... -i 30,3 -I 99,<width> ...

which will tell netperf to run at least 3 iterations (that is the 
minimum minimum netperf will do) and no more than 30 iterations (that is 
the maximum maximum netperf will do) attempting to be 99% confident that 
the mean for throughput (and the CPU utilization if -c and/or -C are 
present and a global -r is not) is within +/- width/2%  For example:

netperf -H remote -i 30,3 -I 99,0.5 -c -C

will attempt to be 99% certain that the means it reports for throughput, 
local and remote CPU utilization is within +/- 0.25% of the actual mean. 
  If, after 30 iterations it has not achieved that confidence, it will 
emit warnings giving the width of the confidence intervals it has achieved.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-29 11:15                   ` Krishna Kumar2
  2010-01-29 11:33                     ` Herbert Xu
@ 2010-01-29 19:56                     ` Rick Jones
  1 sibling, 0 replies; 24+ messages in thread
From: Rick Jones @ 2010-01-29 19:56 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: Herbert Xu, David Miller, eric.dumazet, ilpo.jarvinen, netdev

Krishna Kumar2 wrote:
>>Herbert Xu <herbert@gondor.apana.org.au> wrote on 01/29/2010 02:36:25 PM:
>>
>>
>>>I ran 5 serial netperf's with 16K and another 5 serial netperfs
>>>with 64K I/O sizes, and the aggregate result is:
>>>
>>>0. Driver unsets F_SG but sets F_GSO:
>>>      Original code with 16K: 19471.65
>>>      New code with 16K:      19409.70
>>>      Original code with 64K: 21357.23
>>>      New code with 64K:      22050.42
>>
>>OK this is more in line with what I was expecting, namely that
>>enabling GSO is actually beneficial even without SG.
>>
>>It would be good to get the CPU utilisation figures so we can
>>see the complete picture.
> 
> 
> Same 5 runs of single netperf's:
> 
> 0. Driver unsets F_SG but sets F_GSO:
>         Org (16K):      BW: 18180.71    SD: 13.485
>         New (16K):      BW: 18113.15    SD: 13.551
>         Org (64K):      BW: 21980.28    SD: 10.306
>         New (64K):      BW: 21386.59    SD: 10.447
> 
> 1. Driver unsets F_SG, and with GSO off
>         Org (16K):      BW: 10894.62    SD: 26.591
>         New (16K):      BW: 7262.10     SD: 35.340
>         Org (64K):      BW: 12396.41    SD: 23.357
>         New (64K):      BW: 7853.02     SD: 32.405
> 
> 
> 2. Driver unsets F_SG and uses ethtool to set GSO:
>         Org (16K):      BW: 18094.11    SD: 13.603
>         New (16K):      BW: 17952.38    SD: 13.743
>         Org (64K):      BW: 21540.78    SD: 10.771
>         New (64K):      BW: 21818.35    SD: 10.598

Just a slight change in service demand there...  For those unfamiliar, 
service demand in netperf is the microseconds of non-idle CPU time per 
KB of data transferred.  Smaller is better.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <OF7EA723DA.DC2FF4FC-ON652576B8.002064CB-652576B8.00267739@LocalDomain>]

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
       [not found]               ` <OF7EA723DA.DC2FF4FC-ON652576B8.002064CB-652576B8.00267739@LocalDomain>
@ 2010-01-27  9:42                 ` Krishna Kumar2
  2010-01-29  9:07                   ` Herbert Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Krishna Kumar2 @ 2010-01-27  9:42 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: David Miller, eric.dumazet, Herbert Xu, ilpo.jarvinen, netdev

> Krishna Kumar2/India/IBM wrote on 01/27/2010 12:42 PM
>
> > Herbert Xu <herbert@gondor.apana.org.au> wrote on 01/21/2010 03:11 PM
>
> Sorry for the late response.
>
> > >> I had to remove the F_SG flag from cxgb3 driver (using ethtool
> > >> didn't show any difference in performance since GSO was enabled
> > >> on the device due to register_netdev setting it). Testing show a
> > >> drop of 25% in performance with this patch for non-SG device,
> > >> the extra alloc/memcpy is showing up.
> > >>
> > >> For the SG driver, I get a good performace gain (not anywhere
> > >> close to 25% though). What do you suggest?
> > >
> > > I don't think we can add your change if it hurts non-SG
> > > devices that much.
> >
> > Wait, we need to be careful when testing this.  Non-SG devices
> > do actually benefit from TSO which they otherwise cannot access.
> >
> > If you unset the F_SG bit, then that would disable TSO too.  So
> > you need to enable GSO to compensate.  So Krishna, did you check
> > with tcpdump to see if GSO was really enabled with SG off?
>
> OK, I unset F_SG and set F_GSO (in driver). With this, tcpdump shows
> GSO is enabled - the tcp packet sizes builds up to 65160 bytes.

I should have mentioned this too - if I unset F_SG in the
cxgb3 driver and nothing else, ethtool -k still shows GSO
is set, and tcpdump shows max packet size is 1448. If I
additionally set GSO in driver, then ethtool still has the
same output, but tcpdump shows max packet size of 65160.

Thanks,

- KK

> I ran 5 serial netperf's with 16K and another 5 serial netperfs
> with 64K I/O sizes, and the aggregate result is:
>
> 0. Driver unsets F_SG but sets F_GSO:
>  Original code with 16K: 19471.65
>  New code with 16K:      19409.70
>  Original code with 64K: 21357.23
>  New code with 64K:      22050.42
>
> To recap the other tests I did today:
>
> 1. Driver unsets F_SG, and with GSO off
>  Original code with 16K: 10123.56
>  New code with 16K:      7111.12
>  Original code with 64K: 11568.99
>  New code with 64K:      7611.37
>
> 2. Driver unsets F_SG and uses ethtool to set GSO:
>  Original code with 16K: 18864.38
>  New code with 16K:      18465.54
>  Original code with 64K: 21005.43
>  New code with 64K:      22529.24
>
> Thanks,
>
> - KK
>
> > IIRC when I did a similar test with e1000 back when I wrote this
> > the performance of GSO with SG off was pretty much the same as
> > no GSO with SG off.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-27  9:42                 ` Krishna Kumar2
@ 2010-01-29  9:07                   ` Herbert Xu
  0 siblings, 0 replies; 24+ messages in thread
From: Herbert Xu @ 2010-01-29  9:07 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: David Miller, eric.dumazet, ilpo.jarvinen, netdev

On Wed, Jan 27, 2010 at 03:12:48PM +0530, Krishna Kumar2 wrote:
> 
> I should have mentioned this too - if I unset F_SG in the
> cxgb3 driver and nothing else, ethtool -k still shows GSO
> is set, and tcpdump shows max packet size is 1448. If I
> additionally set GSO in driver, then ethtool still has the
> same output, but tcpdump shows max packet size of 65160.

This sounds like a bug.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-15  5:33 [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices? Krishna Kumar
  2010-01-15  8:36 ` David Miller
@ 2010-01-15 18:13 ` Rick Jones
  2010-01-16  6:38   ` Krishna Kumar2
  1 sibling, 1 reply; 24+ messages in thread
From: Rick Jones @ 2010-01-15 18:13 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: davem, ilpo.jarvinen, netdev, eric.dumazet

Krishna Kumar wrote:
> From: Krishna Kumar <krkumar2@in.ibm.com>
> 
> Remove inline skb data in tcp_sendmsg(). For the few devices that
> don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize,
> and pass the penalty to those slow devices (the following drivers
> do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c,
> dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c,
> r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c,
> via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c,
> cris/eth_v10.c).
> 
> This patch does not affect devices that support SG but turn off
> via ethtool after register_netdev.
> 
> I ran the following test cases with iperf - #threads: 1 4 8 16 32
> 64 128 192 256, I/O sizes: 256 4K 16K 64K, each test case runs for
> 1 minute, repeat 5 iterations. Total test run time is 6 hours.
> System is 4-proc Opteron, with a Chelsio 10gbps NIC. Results (BW
> figures are the aggregate across 5 iterations in mbps):
> 
> -------------------------------------------------------
> #Process   I/O Size    Org-BW     New-BW   %-change
> -------------------------------------------------------
> 1           256        2098       2147      2.33
> 1           4K         14057      14269     1.50
> 1           16K        25984      27317     5.13
> 1           64K        25920      27539     6.24
> ...
> 256         256        1947       1955      0.41
> 256         4K         9828       12265     24.79
> 256         16K        25087      24977     -0.43
> 256         64K        26715      27997     4.79
> -------------------------------------------------------
> Total:      -          600071     634906    5.80
> -------------------------------------------------------

Does bandwidth alone convey the magnitude of the change?  I would think that 
would only be the case if the CPU(s) were 100% utilized, and perhaps not even 
completely then.  At the risk of a shameless plug, it's not for nothing that 
netperf reports service demand :)

I would think that change in service demand (CPU per unit of work) would be 
something one wants to see.

Also, the world does not run on bandwidth alone, so small packet performance and 
any delta there would be good to have.

Multiple process tests may not be as easy in netperf as it is in iperf, but under:

ftp://ftp.netperf.org/netperf/misc

I have a single-stream test script I use called runemomni.sh and an example of 
its output, as well as an aggregate script I use called runemomniagg2.sh - I'll 
post an example of its output there as soon as I finish some runs.  The script 
presumes one has ./configure'd netperf:

./configure --enable-burst --enable-omni ...

The netperf omni tests still ass-u-me that the CPU util each measures is all his 
own, which means the service demands from aggrgate tests require some 
post-processing fixup.

http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance

happy benchmarking,

rick jones

FWIW, service demand and pps performance may be even more important for non-SG 
devices because they may be slow 1 Gig devices and still hit link-rate on a bulk 
throughput test even with a non-trivial increase in CPU util.  However, a 
non-trivial hit in CPU util may rather change the pps performance.

PPS - there is a *lot* of output in those omni test results - best viewed with a 
spreadsheet program.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-15 18:13 ` Rick Jones
@ 2010-01-16  6:38   ` Krishna Kumar2
  2010-01-19 17:41     ` Rick Jones
  0 siblings, 1 reply; 24+ messages in thread
From: Krishna Kumar2 @ 2010-01-16  6:38 UTC (permalink / raw)
  To: Rick Jones; +Cc: davem, eric.dumazet, ilpo.jarvinen, netdev

Rick Jones <rick.jones2@hp.com> wrote on 01/15/2010 11:43:06 PM:
>
> > Remove inline skb data in tcp_sendmsg(). For the few devices that
> > don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize,
> > and pass the penalty to those slow devices (the following drivers
> > do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c,
> > dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c,
> > r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c,
> > via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c,
> > cris/eth_v10.c).
> >
> > This patch does not affect devices that support SG but turn off
> > via ethtool after register_netdev.
> >
> > I ran the following test cases with iperf - #threads: 1 4 8 16 32
> > 64 128 192 256, I/O sizes: 256 4K 16K 64K, each test case runs for
> > 1 minute, repeat 5 iterations. Total test run time is 6 hours.
> > System is 4-proc Opteron, with a Chelsio 10gbps NIC. Results (BW
> > figures are the aggregate across 5 iterations in mbps):
> >
> > -------------------------------------------------------
> > #Process   I/O Size    Org-BW     New-BW   %-change
> > -------------------------------------------------------
> > 1           256        2098       2147      2.33
> > 1           4K         14057      14269     1.50
> > 1           16K        25984      27317     5.13
> > 1           64K        25920      27539     6.24
> > ...
> > 256         256        1947       1955      0.41
> > 256         4K         9828       12265     24.79
> > 256         16K        25087      24977     -0.43
> > 256         64K        26715      27997     4.79
> > -------------------------------------------------------
> > Total:      -          600071     634906    5.80
> > -------------------------------------------------------
>
> Does bandwidth alone convey the magnitude of the change?  I would think
that
> would only be the case if the CPU(s) were 100% utilized, and perhapsnot
even
> completely then.  At the risk of a shameless plug, it's not for nothing
that
> netperf reports service demand :)
>
> I would think that change in service demand (CPU per unit of work) would
be
> something one wants to see.
>
> Also, the world does not run on bandwidth alone, so small packet
> performance and
> any delta there would be good to have.
>
> Multiple process tests may not be as easy in netperf as it is in
> iperf, but under:
>
> ftp://ftp.netperf.org/netperf/misc
>
> I have a single-stream test script I use called runemomni.sh and an
> example of
> its output, as well as an aggregate script I use called
> runemomniagg2.sh - I'll
> post an example of its output there as soon as I finish some runs.

I usually run netperf for smaller number of threads and aggregate
the output using some scripts. I will try what you suggested above,
and see if I can get consistent results for higher number of
processes. Thanks for the links.

- KK

> The script
> presumes one has ./configure'd netperf:
>
> ./configure --enable-burst --enable-omni ...
>
> The netperf omni tests still ass-u-me that the CPU util each
> measures is all his
> own, which means the service demands from aggrgate tests require some
> post-processing fixup.
>
> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-
> Netperf-to-Measure-Aggregate-Performance
>
> happy benchmarking,
>
> rick jones
>
> FWIW, service demand and pps performance may be even more important
> for non-SG
> devices because they may be slow 1 Gig devices and still hit link-
> rate on a bulk
> throughput test even with a non-trivial increase in CPU util.  However, a

> non-trivial hit in CPU util may rather change the pps performance.
>
> PPS - there is a *lot* of output in those omni test results - best
> viewed with a
> spreadsheet program.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
  2010-01-16  6:38   ` Krishna Kumar2
@ 2010-01-19 17:41     ` Rick Jones
  0 siblings, 0 replies; 24+ messages in thread
From: Rick Jones @ 2010-01-19 17:41 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: davem, eric.dumazet, ilpo.jarvinen, netdev

Krishna Kumar2 wrote:
> Rick Jones <rick.jones2@hp.com> wrote on 01/15/2010 11:43:06 PM:
>>Multiple process tests may not be as easy in netperf as it is in
>>iperf, but under:
>>
>>ftp://ftp.netperf.org/netperf/misc
>>
>> I have a single-stream test script I use called runemomni.sh and an
>> example of its output, as well as an aggregate script I use called
>> runemomniagg2.sh - I'll post an example of its output there as
>> soon as I finish some runs.
> 
> 
> I usually run netperf for smaller number of threads and aggregate
> the output using some scripts. I will try what you suggested above,
> and see if I can get consistent results for higher number of
> processes. Thanks for the links.

You're welcome - the output of the aggregate script is up there too now 
- generally I run three systems for the aggregate tests and then include 
the TCP_MAERTs stuff, but in this case since I had only two otherwise 
identical systems, TCP_MAERTs would have been redundant.

rick jones

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2010-01-29 20:02 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-15  5:33 [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices? Krishna Kumar
2010-01-15  8:36 ` David Miller
2010-01-15  8:50   ` Simon Kagstrom
2010-01-15  8:52     ` David Miller
2010-01-15  9:00       ` Simon Kagstrom
2010-01-15  9:04         ` David Miller
2010-01-15  9:03   ` Krishna Kumar2
     [not found]   ` <OF1B0853DD.2824263E-ON652576AC.002FA9D4-652576AC.0030AC47@LocalDomain>
2010-01-15  9:20     ` Krishna Kumar2
2010-01-15  9:18       ` David Miller
2010-01-20 12:19         ` Krishna Kumar2
2010-01-21  9:25           ` David Miller
2010-01-21  9:41             ` Herbert Xu
2010-01-27  7:12               ` Krishna Kumar2
2010-01-29  9:06                 ` Herbert Xu
2010-01-29 11:15                   ` Krishna Kumar2
2010-01-29 11:33                     ` Herbert Xu
2010-01-29 11:50                       ` Krishna Kumar2
2010-01-29 20:02                       ` Rick Jones
2010-01-29 19:56                     ` Rick Jones
     [not found]               ` <OF7EA723DA.DC2FF4FC-ON652576B8.002064CB-652576B8.00267739@LocalDomain>
2010-01-27  9:42                 ` Krishna Kumar2
2010-01-29  9:07                   ` Herbert Xu
2010-01-15 18:13 ` Rick Jones
2010-01-16  6:38   ` Krishna Kumar2
2010-01-19 17:41     ` Rick Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).