* [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
@ 2010-01-15 5:33 Krishna Kumar
2010-01-15 8:36 ` David Miller
2010-01-15 18:13 ` Rick Jones
0 siblings, 2 replies; 24+ messages in thread
From: Krishna Kumar @ 2010-01-15 5:33 UTC (permalink / raw)
To: davem; +Cc: ilpo.jarvinen, netdev, eric.dumazet, Krishna Kumar
From: Krishna Kumar <krkumar2@in.ibm.com>
Remove inline skb data in tcp_sendmsg(). For the few devices that
don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize,
and pass the penalty to those slow devices (the following drivers
do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c,
dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c,
r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c,
via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c,
cris/eth_v10.c).
This patch does not affect devices that support SG but turn off
via ethtool after register_netdev.
I ran the following test cases with iperf - #threads: 1 4 8 16 32
64 128 192 256, I/O sizes: 256 4K 16K 64K, each test case runs for
1 minute, repeat 5 iterations. Total test run time is 6 hours.
System is 4-proc Opteron, with a Chelsio 10gbps NIC. Results (BW
figures are the aggregate across 5 iterations in mbps):
-------------------------------------------------------
#Process I/O Size Org-BW New-BW %-change
-------------------------------------------------------
1 256 2098 2147 2.33
1 4K 14057 14269 1.50
1 16K 25984 27317 5.13
1 64K 25920 27539 6.24
4 256 1895 2117 11.71
4 4K 10699 15649 46.26
4 16K 25675 26723 4.08
4 64K 27026 27545 1.92
8 256 1816 1966 8.25
8 4K 9511 12754 34.09
8 16K 25177 25281 0.41
8 64K 26288 26878 2.24
16 256 1893 2142 13.15
16 4K 16370 15805 -3.45
16 16K 25986 25890 -0.36
16 64K 26925 28036 4.12
32 256 2061 2038 -1.11
32 4K 10765 12253 13.82
32 16K 26802 28613 6.75
32 64K 28433 27739 -2.44
64 256 1885 2088 10.76
64 4K 10534 15778 49.78
64 16K 26745 28130 5.17
64 64K 29153 28708 -1.52
128 256 1884 2023 7.37
128 4K 9446 13732 45.37
128 16K 27013 27086 0.27
128 64K 26151 27933 6.81
192 256 2000 2094 4.70
192 4K 14260 13479 -5.47
192 16K 25545 27478 7.56
192 64K 26497 26482 -0.05
256 256 1947 1955 0.41
256 4K 9828 12265 24.79
256 16K 25087 24977 -0.43
256 64K 26715 27997 4.79
-------------------------------------------------------
Total: - 600071 634906 5.80
-------------------------------------------------------
Please review if the idea is acceptable.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
net/ipv4/tcp.c | 172 ++++++++++++++++++-----------------------------
1 file changed, 66 insertions(+), 106 deletions(-)
diff -ruNp org/net/ipv4/tcp.c new/net/ipv4/tcp.c
--- org/net/ipv4/tcp.c 2010-01-13 10:43:27.000000000 +0530
+++ new/net/ipv4/tcp.c 2010-01-13 10:43:37.000000000 +0530
@@ -876,26 +876,6 @@ ssize_t tcp_sendpage(struct socket *sock
#define TCP_PAGE(sk) (sk->sk_sndmsg_page)
#define TCP_OFF(sk) (sk->sk_sndmsg_off)
-static inline int select_size(struct sock *sk, int sg)
-{
- struct tcp_sock *tp = tcp_sk(sk);
- int tmp = tp->mss_cache;
-
- if (sg) {
- if (sk_can_gso(sk))
- tmp = 0;
- else {
- int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);
-
- if (tmp >= pgbreak &&
- tmp <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE)
- tmp = pgbreak;
- }
- }
-
- return tmp;
-}
-
int tcp_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
size_t size)
{
@@ -905,7 +885,7 @@ int tcp_sendmsg(struct kiocb *iocb, stru
struct sk_buff *skb;
int iovlen, flags;
int mss_now, size_goal;
- int sg, err, copied;
+ int err, copied;
long timeo;
lock_sock(sk);
@@ -933,8 +913,6 @@ int tcp_sendmsg(struct kiocb *iocb, stru
if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
goto out_err;
- sg = sk->sk_route_caps & NETIF_F_SG;
-
while (--iovlen >= 0) {
int seglen = iov->iov_len;
unsigned char __user *from = iov->iov_base;
@@ -944,6 +922,8 @@ int tcp_sendmsg(struct kiocb *iocb, stru
while (seglen > 0) {
int copy = 0;
int max = size_goal;
+ int merge, i, off;
+ struct page *page;
skb = tcp_write_queue_tail(sk);
if (tcp_send_head(sk)) {
@@ -954,14 +934,11 @@ int tcp_sendmsg(struct kiocb *iocb, stru
if (copy <= 0) {
new_segment:
- /* Allocate new segment. If the interface is SG,
- * allocate skb fitting to single page.
- */
+ /* Allocate new segment with a single page */
if (!sk_stream_memory_free(sk))
goto wait_for_sndbuf;
- skb = sk_stream_alloc_skb(sk,
- select_size(sk, sg),
+ skb = sk_stream_alloc_skb(sk, 0,
sk->sk_allocation);
if (!skb)
goto wait_for_memory;
@@ -981,84 +958,77 @@ new_segment:
if (copy > seglen)
copy = seglen;
- /* Where to copy to? */
- if (skb_tailroom(skb) > 0) {
- /* We have some space in skb head. Superb! */
- if (copy > skb_tailroom(skb))
- copy = skb_tailroom(skb);
- if ((err = skb_add_data(skb, from, copy)) != 0)
- goto do_fault;
- } else {
- int merge = 0;
- int i = skb_shinfo(skb)->nr_frags;
- struct page *page = TCP_PAGE(sk);
- int off = TCP_OFF(sk);
-
- if (skb_can_coalesce(skb, i, page, off) &&
- off != PAGE_SIZE) {
- /* We can extend the last page
- * fragment. */
- merge = 1;
- } else if (i == MAX_SKB_FRAGS || !sg) {
- /* Need to add new fragment and cannot
- * do this because interface is non-SG,
- * or because all the page slots are
- * busy. */
- tcp_mark_push(tp, skb);
- goto new_segment;
- } else if (page) {
- if (off == PAGE_SIZE) {
- put_page(page);
- TCP_PAGE(sk) = page = NULL;
- off = 0;
- }
- } else
+ merge = 0;
+ i = skb_shinfo(skb)->nr_frags;
+ page = TCP_PAGE(sk);
+ off = TCP_OFF(sk);
+
+ if (skb_can_coalesce(skb, i, page, off) &&
+ off != PAGE_SIZE) {
+ /* We can extend the last page
+ * fragment. */
+ merge = 1;
+ } else if (i == MAX_SKB_FRAGS) {
+ /*
+ * Need to add new fragment and cannot
+ * do this because all the page slots are
+ * busy. For the (rare) non-SG devices,
+ * dev_queue_xmit handles this skb.
+ */
+ tcp_mark_push(tp, skb);
+ goto new_segment;
+ } else if (page) {
+ if (off == PAGE_SIZE) {
+ put_page(page);
+ TCP_PAGE(sk) = page = NULL;
off = 0;
+ }
+ } else
+ off = 0;
- if (copy > PAGE_SIZE - off)
- copy = PAGE_SIZE - off;
+ if (copy > PAGE_SIZE - off)
+ copy = PAGE_SIZE - off;
- if (!sk_wmem_schedule(sk, copy))
- goto wait_for_memory;
+ if (!sk_wmem_schedule(sk, copy))
+ goto wait_for_memory;
- if (!page) {
- /* Allocate new cache page. */
- if (!(page = sk_stream_alloc_page(sk)))
- goto wait_for_memory;
- }
+ if (!page) {
+ /* Allocate new cache page. */
+ if (!(page = sk_stream_alloc_page(sk)))
+ goto wait_for_memory;
+ }
- /* Time to copy data. We are close to
- * the end! */
- err = skb_copy_to_page(sk, from, skb, page,
- off, copy);
- if (err) {
- /* If this page was new, give it to the
- * socket so it does not get leaked.
- */
- if (!TCP_PAGE(sk)) {
- TCP_PAGE(sk) = page;
- TCP_OFF(sk) = 0;
- }
- goto do_error;
+ /* Time to copy data. We are close to
+ * the end! */
+ err = skb_copy_to_page(sk, from, skb, page,
+ off, copy);
+ if (err) {
+ /* If this page was new, give it to the
+ * socket so it does not get leaked.
+ */
+ if (!TCP_PAGE(sk)) {
+ TCP_PAGE(sk) = page;
+ TCP_OFF(sk) = 0;
}
+ goto do_error;
+ }
- /* Update the skb. */
- if (merge) {
- skb_shinfo(skb)->frags[i - 1].size +=
- copy;
- } else {
- skb_fill_page_desc(skb, i, page, off, copy);
- if (TCP_PAGE(sk)) {
- get_page(page);
- } else if (off + copy < PAGE_SIZE) {
- get_page(page);
- TCP_PAGE(sk) = page;
- }
+ /* Update the skb. */
+ if (merge) {
+ skb_shinfo(skb)->frags[i - 1].size +=
+ copy;
+ } else {
+ skb_fill_page_desc(skb, i, page, off, copy);
+ if (TCP_PAGE(sk)) {
+ get_page(page);
+ } else if (off + copy < PAGE_SIZE) {
+ get_page(page);
+ TCP_PAGE(sk) = page;
}
-
- TCP_OFF(sk) = off + copy;
}
+ TCP_OFF(sk) = off + copy;
+
if (!copied)
TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH;
@@ -1101,16 +1071,6 @@ out:
release_sock(sk);
return copied;
-do_fault:
- if (!skb->len) {
- tcp_unlink_write_queue(skb, sk);
- /* It is the one place in all of TCP, except connection
- * reset, where we can be unlinking the send_head.
- */
- tcp_check_send_head(sk, skb);
- sk_wmem_free_skb(sk, skb);
- }
-
do_error:
if (copied)
goto out;
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-15 5:33 [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices? Krishna Kumar
@ 2010-01-15 8:36 ` David Miller
2010-01-15 8:50 ` Simon Kagstrom
` (2 more replies)
2010-01-15 18:13 ` Rick Jones
1 sibling, 3 replies; 24+ messages in thread
From: David Miller @ 2010-01-15 8:36 UTC (permalink / raw)
To: krkumar2; +Cc: ilpo.jarvinen, netdev, eric.dumazet
From: Krishna Kumar <krkumar2@in.ibm.com>
Date: Fri, 15 Jan 2010 11:03:52 +0530
> From: Krishna Kumar <krkumar2@in.ibm.com>
>
> Remove inline skb data in tcp_sendmsg(). For the few devices that
> don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize,
> and pass the penalty to those slow devices (the following drivers
> do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c,
> dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c,
> r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c,
> via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c,
> cris/eth_v10.c).
I was really surprised to see r8169.c in that list.
It even has all the code in it's ->ndo_start_xmit() method
to build fragments properly and handle segmented SKBs, it
simply doesn't set NETIF_F_SG in dev->features for whatever
reason.
Bonding it on your list, but it does indeed support NETIF_F_SG
as long as all of it's slaves do. See bond_compute_features()
and how it uses netdev_increment_features() over the slaves.
Anyways...
> This patch does not affect devices that support SG but turn off
> via ethtool after register_netdev.
>
> I ran the following test cases with iperf - #threads: 1 4 8 16 32
> 64 128 192 256, I/O sizes: 256 4K 16K 64K, each test case runs for
> 1 minute, repeat 5 iterations. Total test run time is 6 hours.
> System is 4-proc Opteron, with a Chelsio 10gbps NIC. Results (BW
> figures are the aggregate across 5 iterations in mbps):
...
> Please review if the idea is acceptable.
>
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
So how bad does it kill performance for a chip that doesn't
support NETIF_F_SG?
That's what people will complain about if this goes in.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-15 8:36 ` David Miller
@ 2010-01-15 8:50 ` Simon Kagstrom
2010-01-15 8:52 ` David Miller
2010-01-15 9:03 ` Krishna Kumar2
[not found] ` <OF1B0853DD.2824263E-ON652576AC.002FA9D4-652576AC.0030AC47@LocalDomain>
2 siblings, 1 reply; 24+ messages in thread
From: Simon Kagstrom @ 2010-01-15 8:50 UTC (permalink / raw)
To: David Miller; +Cc: krkumar2, ilpo.jarvinen, netdev, eric.dumazet
On Fri, 15 Jan 2010 00:36:36 -0800 (PST)
David Miller <davem@davemloft.net> wrote:
> > Remove inline skb data in tcp_sendmsg(). For the few devices that
> > don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize,
> > and pass the penalty to those slow devices (the following drivers
> > do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c,
> > dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c,
> > r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c,
> > via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c,
> > cris/eth_v10.c).
>
> I was really surprised to see r8169.c in that list.
>
> It even has all the code in it's ->ndo_start_xmit() method
> to build fragments properly and handle segmented SKBs, it
> simply doesn't set NETIF_F_SG in dev->features for whatever
> reason.
The same thing goes for via-velocity.c, it's turned on via ethtool
though (ethtool_op_set_sg).
// Simon
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-15 8:50 ` Simon Kagstrom
@ 2010-01-15 8:52 ` David Miller
2010-01-15 9:00 ` Simon Kagstrom
0 siblings, 1 reply; 24+ messages in thread
From: David Miller @ 2010-01-15 8:52 UTC (permalink / raw)
To: simon.kagstrom; +Cc: krkumar2, ilpo.jarvinen, netdev, eric.dumazet
From: Simon Kagstrom <simon.kagstrom@netinsight.net>
Date: Fri, 15 Jan 2010 09:50:49 +0100
> On Fri, 15 Jan 2010 00:36:36 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
>
>> > Remove inline skb data in tcp_sendmsg(). For the few devices that
>> > don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize,
>> > and pass the penalty to those slow devices (the following drivers
>> > do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c,
>> > dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c,
>> > r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c,
>> > via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c,
>> > cris/eth_v10.c).
>>
>> I was really surprised to see r8169.c in that list.
>>
>> It even has all the code in it's ->ndo_start_xmit() method
>> to build fragments properly and handle segmented SKBs, it
>> simply doesn't set NETIF_F_SG in dev->features for whatever
>> reason.
>
> The same thing goes for via-velocity.c, it's turned on via ethtool
> though (ethtool_op_set_sg).
Indeed, see my reply to Krishna's ethtool_op_set_sg() patch.
I think it's a cruddy way to do things, SG ought to be on by
default always unless it is defective. And if it's defective
support should be removed entirely.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-15 8:52 ` David Miller
@ 2010-01-15 9:00 ` Simon Kagstrom
2010-01-15 9:04 ` David Miller
0 siblings, 1 reply; 24+ messages in thread
From: Simon Kagstrom @ 2010-01-15 9:00 UTC (permalink / raw)
To: David Miller; +Cc: krkumar2, ilpo.jarvinen, netdev, eric.dumazet
On Fri, 15 Jan 2010 00:52:24 -0800 (PST)
David Miller <davem@davemloft.net> wrote:
> > The same thing goes for via-velocity.c, it's turned on via ethtool
> > though (ethtool_op_set_sg).
>
> Indeed, see my reply to Krishna's ethtool_op_set_sg() patch.
>
> I think it's a cruddy way to do things, SG ought to be on by
> default always unless it is defective. And if it's defective
> support should be removed entirely.
I kept it off by default since I didn't see any big improvement in my
tests (negative in some, positive in some). But I suppose you're
right though.
// Simon
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-15 9:00 ` Simon Kagstrom
@ 2010-01-15 9:04 ` David Miller
0 siblings, 0 replies; 24+ messages in thread
From: David Miller @ 2010-01-15 9:04 UTC (permalink / raw)
To: simon.kagstrom; +Cc: krkumar2, ilpo.jarvinen, netdev, eric.dumazet
From: Simon Kagstrom <simon.kagstrom@netinsight.net>
Date: Fri, 15 Jan 2010 10:00:11 +0100
> On Fri, 15 Jan 2010 00:52:24 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
>
>> > The same thing goes for via-velocity.c, it's turned on via ethtool
>> > though (ethtool_op_set_sg).
>>
>> Indeed, see my reply to Krishna's ethtool_op_set_sg() patch.
>>
>> I think it's a cruddy way to do things, SG ought to be on by
>> default always unless it is defective. And if it's defective
>> support should be removed entirely.
>
> I kept it off by default since I didn't see any big improvement in my
> tests (negative in some, positive in some). But I suppose you're
> right though.
Well, it has to provide significantly better performance for
sendfile() (especially wrt. cpu utilization) because we avoid the copy
out of the page cache pages entirely.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-15 8:36 ` David Miller
2010-01-15 8:50 ` Simon Kagstrom
@ 2010-01-15 9:03 ` Krishna Kumar2
[not found] ` <OF1B0853DD.2824263E-ON652576AC.002FA9D4-652576AC.0030AC47@LocalDomain>
2 siblings, 0 replies; 24+ messages in thread
From: Krishna Kumar2 @ 2010-01-15 9:03 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, ilpo.jarvinen, netdev, netdev-owner
> David Miller <davem@davemloft.net>
>
> Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
>
> From: Krishna Kumar <krkumar2@in.ibm.com>
> Date: Fri, 15 Jan 2010 11:03:52 +0530
>
> > From: Krishna Kumar <krkumar2@in.ibm.com>
> >
> > Remove inline skb data in tcp_sendmsg(). For the few devices that
> > don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize,
> > and pass the penalty to those slow devices (the following drivers
> > do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c,
> > dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c,
> > r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c,
> > via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c,
> > cris/eth_v10.c).
>
> I was really surprised to see r8169.c in that list.
>
> It even has all the code in it's ->ndo_start_xmit() method
> to build fragments properly and handle segmented SKBs, it
> simply doesn't set NETIF_F_SG in dev->features for whatever
> reason.
I didn't notice this driver but had checked via-velocity and
found that it too had support but was not setting this bit.
> Bonding it on your list, but it does indeed support NETIF_F_SG
> as long as all of it's slaves do. See bond_compute_features()
> and how it uses netdev_increment_features() over the slaves.
>
> Anyways...
>
> > This patch does not affect devices that support SG but turn off
> > via ethtool after register_netdev.
> >
> > I ran the following test cases with iperf - #threads: 1 4 8 16 32
> > 64 128 192 256, I/O sizes: 256 4K 16K 64K, each test case runs for
> > 1 minute, repeat 5 iterations. Total test run time is 6 hours.
> > System is 4-proc Opteron, with a Chelsio 10gbps NIC. Results (BW
> > figures are the aggregate across 5 iterations in mbps):
> ...
> > Please review if the idea is acceptable.
> >
> > Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
>
> So how bad does it kill performance for a chip that doesn't
> support NETIF_F_SG?
>
> That's what people will complain about if this goes in.
I don't know how much performance will drop for those 15-18 drivers
listed above, since I don't have access to those devices.
Thanks,
- KK
^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <OF1B0853DD.2824263E-ON652576AC.002FA9D4-652576AC.0030AC47@LocalDomain>]
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
[not found] ` <OF1B0853DD.2824263E-ON652576AC.002FA9D4-652576AC.0030AC47@LocalDomain>
@ 2010-01-15 9:20 ` Krishna Kumar2
2010-01-15 9:18 ` David Miller
0 siblings, 1 reply; 24+ messages in thread
From: Krishna Kumar2 @ 2010-01-15 9:20 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: David Miller, eric.dumazet, ilpo.jarvinen, netdev
Sorry, I sent off the reply sooner than I wanted to. continued at the
end...
Krishna Kumar2/India/IBM wrote on 01/15/2010 02:33:25 PM:
> Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
>
> > David Miller <davem@davemloft.net>
> >
> > Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
> >
> > From: Krishna Kumar <krkumar2@in.ibm.com>
> > Date: Fri, 15 Jan 2010 11:03:52 +0530
> >
> > > From: Krishna Kumar <krkumar2@in.ibm.com>
> > >
> > > Remove inline skb data in tcp_sendmsg(). For the few devices that
> > > don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize,
> > > and pass the penalty to those slow devices (the following drivers
> > > do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c,
> > > dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c,
> > > r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c,
> > > via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c,
> > > cris/eth_v10.c).
> >
> > I was really surprised to see r8169.c in that list.
> >
> > It even has all the code in it's ->ndo_start_xmit() method
> > to build fragments properly and handle segmented SKBs, it
> > simply doesn't set NETIF_F_SG in dev->features for whatever
> > reason.
> I didn't notice this driver but had checked via-velocity and
> found that it too had support but was not setting this bit.
>
> > Bonding it on your list, but it does indeed support NETIF_F_SG
> > as long as all of it's slaves do. See bond_compute_features()
> > and how it uses netdev_increment_features() over the slaves.
> >
> > Anyways...
> >
> > > This patch does not affect devices that support SG but turn off
> > > via ethtool after register_netdev.
> > >
> > > I ran the following test cases with iperf - #threads: 1 4 8 16 32
> > > 64 128 192 256, I/O sizes: 256 4K 16K 64K, each test case runs for
> > > 1 minute, repeat 5 iterations. Total test run time is 6 hours.
> > > System is 4-proc Opteron, with a Chelsio 10gbps NIC. Results (BW
> > > figures are the aggregate across 5 iterations in mbps):
> > ...
> > > Please review if the idea is acceptable.
> > >
> > > Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> >
> > So how bad does it kill performance for a chip that doesn't
> > support NETIF_F_SG?
> >
> > That's what people will complain about if this goes in.
>
> I don't know how much performance will drop for those 15-18 drivers
> listed above, since I don't have access to those devices.
I wonder if there is some other way to test it. I could test it on
the card I have, cxgbe, by ethtool F_SG off, and then testing
this patch with existing code (both with ethtool F_SG off)? Will
that be enough to get an idea, or I cannot assume this is
reasonable for real non-sg drivers? I am sure there is a
degradation, and mentioned that part as a "penalty" for those
drivers in my patch.
Thanks,
- KK
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-15 9:20 ` Krishna Kumar2
@ 2010-01-15 9:18 ` David Miller
2010-01-20 12:19 ` Krishna Kumar2
0 siblings, 1 reply; 24+ messages in thread
From: David Miller @ 2010-01-15 9:18 UTC (permalink / raw)
To: krkumar2; +Cc: eric.dumazet, ilpo.jarvinen, netdev
From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Fri, 15 Jan 2010 14:50:04 +0530
> I wonder if there is some other way to test it. I could test it on
> the card I have, cxgbe, by ethtool F_SG off, and then testing
> this patch with existing code (both with ethtool F_SG off)? Will
> that be enough to get an idea, or I cannot assume this is
> reasonable for real non-sg drivers? I am sure there is a
> degradation, and mentioned that part as a "penalty" for those
> drivers in my patch.
I think such a test would provide useful data by which to judge this
change.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-15 9:18 ` David Miller
@ 2010-01-20 12:19 ` Krishna Kumar2
2010-01-21 9:25 ` David Miller
0 siblings, 1 reply; 24+ messages in thread
From: Krishna Kumar2 @ 2010-01-20 12:19 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, ilpo.jarvinen, netdev
Hi Dave,
> David Miller <davem@davemloft.net> wrote on 01/15/2010 02:48:29 PM:
>
> Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
>
> From: Krishna Kumar2 <krkumar2@in.ibm.com>
> Date: Fri, 15 Jan 2010 14:50:04 +0530
>
> > I wonder if there is some other way to test it. I could test it on
> > the card I have, cxgbe, by ethtool F_SG off, and then testing
> > this patch with existing code (both with ethtool F_SG off)? Will
> > that be enough to get an idea, or I cannot assume this is
> > reasonable for real non-sg drivers? I am sure there is a
> > degradation, and mentioned that part as a "penalty" for those
> > drivers in my patch.
>
> I think such a test would provide useful data by which to judge this
> change.
I had to remove the F_SG flag from cxgb3 driver (using ethtool
didn't show any difference in performance since GSO was enabled
on the device due to register_netdev setting it). Testing show a
drop of 25% in performance with this patch for non-SG device,
the extra alloc/memcpy is showing up.
For the SG driver, I get a good performace gain (not anywhere
close to 25% though). What do you suggest?
Thanks,
- KK
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-20 12:19 ` Krishna Kumar2
@ 2010-01-21 9:25 ` David Miller
2010-01-21 9:41 ` Herbert Xu
0 siblings, 1 reply; 24+ messages in thread
From: David Miller @ 2010-01-21 9:25 UTC (permalink / raw)
To: krkumar2; +Cc: eric.dumazet, ilpo.jarvinen, netdev
From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Wed, 20 Jan 2010 17:49:18 +0530
> I had to remove the F_SG flag from cxgb3 driver (using ethtool
> didn't show any difference in performance since GSO was enabled
> on the device due to register_netdev setting it). Testing show a
> drop of 25% in performance with this patch for non-SG device,
> the extra alloc/memcpy is showing up.
>
> For the SG driver, I get a good performace gain (not anywhere
> close to 25% though). What do you suggest?
I don't think we can add your change if it hurts non-SG
devices that much.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-21 9:25 ` David Miller
@ 2010-01-21 9:41 ` Herbert Xu
2010-01-27 7:12 ` Krishna Kumar2
[not found] ` <OF7EA723DA.DC2FF4FC-ON652576B8.002064CB-652576B8.00267739@LocalDomain>
0 siblings, 2 replies; 24+ messages in thread
From: Herbert Xu @ 2010-01-21 9:41 UTC (permalink / raw)
To: David Miller; +Cc: krkumar2, eric.dumazet, ilpo.jarvinen, netdev
David Miller <davem@davemloft.net> wrote:
> From: Krishna Kumar2 <krkumar2@in.ibm.com>
> Date: Wed, 20 Jan 2010 17:49:18 +0530
>
>> I had to remove the F_SG flag from cxgb3 driver (using ethtool
>> didn't show any difference in performance since GSO was enabled
>> on the device due to register_netdev setting it). Testing show a
>> drop of 25% in performance with this patch for non-SG device,
>> the extra alloc/memcpy is showing up.
>>
>> For the SG driver, I get a good performace gain (not anywhere
>> close to 25% though). What do you suggest?
>
> I don't think we can add your change if it hurts non-SG
> devices that much.
Wait, we need to be careful when testing this. Non-SG devices
do actually benefit from TSO which they otherwise cannot access.
If you unset the F_SG bit, then that would disable TSO too. So
you need to enable GSO to compensate. So Krishna, did you check
with tcpdump to see if GSO was really enabled with SG off?
IIRC when I did a similar test with e1000 back when I wrote this
the performance of GSO with SG off was pretty much the same as
no GSO with SG off.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-21 9:41 ` Herbert Xu
@ 2010-01-27 7:12 ` Krishna Kumar2
2010-01-29 9:06 ` Herbert Xu
[not found] ` <OF7EA723DA.DC2FF4FC-ON652576B8.002064CB-652576B8.00267739@LocalDomain>
1 sibling, 1 reply; 24+ messages in thread
From: Krishna Kumar2 @ 2010-01-27 7:12 UTC (permalink / raw)
To: Herbert Xu; +Cc: David Miller, eric.dumazet, ilpo.jarvinen, netdev
Hi Herbert,
> Herbert Xu <herbert@gondor.apana.org.au> wrote on 01/21/2010 03:11 PM
Sorry for the late response.
> >> I had to remove the F_SG flag from cxgb3 driver (using ethtool
> >> didn't show any difference in performance since GSO was enabled
> >> on the device due to register_netdev setting it). Testing show a
> >> drop of 25% in performance with this patch for non-SG device,
> >> the extra alloc/memcpy is showing up.
> >>
> >> For the SG driver, I get a good performace gain (not anywhere
> >> close to 25% though). What do you suggest?
> >
> > I don't think we can add your change if it hurts non-SG
> > devices that much.
>
> Wait, we need to be careful when testing this. Non-SG devices
> do actually benefit from TSO which they otherwise cannot access.
>
> If you unset the F_SG bit, then that would disable TSO too. So
> you need to enable GSO to compensate. So Krishna, did you check
> with tcpdump to see if GSO was really enabled with SG off?
OK, I unset F_SG and set F_GSO (in driver). With this, tcpdump shows
GSO is enabled - the tcp packet sizes builds up to 65160 bytes.
I ran 5 serial netperf's with 16K and another 5 serial netperfs
with 64K I/O sizes, and the aggregate result is:
0. Driver unsets F_SG but sets F_GSO:
Original code with 16K: 19471.65
New code with 16K: 19409.70
Original code with 64K: 21357.23
New code with 64K: 22050.42
To recap the other tests I did today:
1. Driver unsets F_SG, and with GSO off
Original code with 16K: 10123.56
New code with 16K: 7111.12
Original code with 64K: 11568.99
New code with 64K: 7611.37
2. Driver unsets F_SG and uses ethtool to set GSO:
Original code with 16K: 18864.38
New code with 16K: 18465.54
Original code with 64K: 21005.43
New code with 64K: 22529.24
Thanks,
- KK
> IIRC when I did a similar test with e1000 back when I wrote this
> the performance of GSO with SG off was pretty much the same as
> no GSO with SG off.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-27 7:12 ` Krishna Kumar2
@ 2010-01-29 9:06 ` Herbert Xu
2010-01-29 11:15 ` Krishna Kumar2
0 siblings, 1 reply; 24+ messages in thread
From: Herbert Xu @ 2010-01-29 9:06 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: David Miller, eric.dumazet, ilpo.jarvinen, netdev
On Wed, Jan 27, 2010 at 12:42:12PM +0530, Krishna Kumar2 wrote:
>
> OK, I unset F_SG and set F_GSO (in driver). With this, tcpdump shows
> GSO is enabled - the tcp packet sizes builds up to 65160 bytes.
>
> I ran 5 serial netperf's with 16K and another 5 serial netperfs
> with 64K I/O sizes, and the aggregate result is:
>
> 0. Driver unsets F_SG but sets F_GSO:
> Original code with 16K: 19471.65
> New code with 16K: 19409.70
> Original code with 64K: 21357.23
> New code with 64K: 22050.42
OK this is more in line with what I was expecting, namely that
enabling GSO is actually beneficial even without SG.
It would be good to get the CPU utilisation figures so we can
see the complete picture.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-29 9:06 ` Herbert Xu
@ 2010-01-29 11:15 ` Krishna Kumar2
2010-01-29 11:33 ` Herbert Xu
2010-01-29 19:56 ` Rick Jones
0 siblings, 2 replies; 24+ messages in thread
From: Krishna Kumar2 @ 2010-01-29 11:15 UTC (permalink / raw)
To: Herbert Xu; +Cc: David Miller, eric.dumazet, ilpo.jarvinen, netdev
> Herbert Xu <herbert@gondor.apana.org.au> wrote on 01/29/2010 02:36:25 PM:
>
> > I ran 5 serial netperf's with 16K and another 5 serial netperfs
> > with 64K I/O sizes, and the aggregate result is:
> >
> > 0. Driver unsets F_SG but sets F_GSO:
> > Original code with 16K: 19471.65
> > New code with 16K: 19409.70
> > Original code with 64K: 21357.23
> > New code with 64K: 22050.42
>
> OK this is more in line with what I was expecting, namely that
> enabling GSO is actually beneficial even without SG.
>
> It would be good to get the CPU utilisation figures so we can
> see the complete picture.
Same 5 runs of single netperf's:
0. Driver unsets F_SG but sets F_GSO:
Org (16K): BW: 18180.71 SD: 13.485
New (16K): BW: 18113.15 SD: 13.551
Org (64K): BW: 21980.28 SD: 10.306
New (64K): BW: 21386.59 SD: 10.447
1. Driver unsets F_SG, and with GSO off
Org (16K): BW: 10894.62 SD: 26.591
New (16K): BW: 7262.10 SD: 35.340
Org (64K): BW: 12396.41 SD: 23.357
New (64K): BW: 7853.02 SD: 32.405
2. Driver unsets F_SG and uses ethtool to set GSO:
Org (16K): BW: 18094.11 SD: 13.603
New (16K): BW: 17952.38 SD: 13.743
Org (64K): BW: 21540.78 SD: 10.771
New (64K): BW: 21818.35 SD: 10.598
> > I should have mentioned this too - if I unset F_SG in the
> > cxgb3 driver and nothing else, ethtool -k still shows GSO
> > is set, and tcpdump shows max packet size is 1448. If I
> > additionally set GSO in driver, then ethtool still has the
> > same output, but tcpdump shows max packet size of 65160.
>
> This sounds like a bug.
Yes, an ethtool bug (version 6). The test case #1 above, I
have written that GSO is off but ethtool "thinks" it is on
(after a modprobe -r cxgb3; modprobe cxgb3). So for test #2,
I simply run "ethtool ... gso on", and GSO is now really on
in the kernel, explaining the better results.
thanks,
- KK
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-29 11:15 ` Krishna Kumar2
@ 2010-01-29 11:33 ` Herbert Xu
2010-01-29 11:50 ` Krishna Kumar2
2010-01-29 20:02 ` Rick Jones
2010-01-29 19:56 ` Rick Jones
1 sibling, 2 replies; 24+ messages in thread
From: Herbert Xu @ 2010-01-29 11:33 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: David Miller, eric.dumazet, ilpo.jarvinen, netdev
On Fri, Jan 29, 2010 at 04:45:01PM +0530, Krishna Kumar2 wrote:
>
> Same 5 runs of single netperf's:
>
> 0. Driver unsets F_SG but sets F_GSO:
> Org (16K): BW: 18180.71 SD: 13.485
> New (16K): BW: 18113.15 SD: 13.551
> Org (64K): BW: 21980.28 SD: 10.306
> New (64K): BW: 21386.59 SD: 10.447
>
> 1. Driver unsets F_SG, and with GSO off
> Org (16K): BW: 10894.62 SD: 26.591
> New (16K): BW: 7262.10 SD: 35.340
> Org (64K): BW: 12396.41 SD: 23.357
> New (64K): BW: 7853.02 SD: 32.405
>
>
> 2. Driver unsets F_SG and uses ethtool to set GSO:
> Org (16K): BW: 18094.11 SD: 13.603
> New (16K): BW: 17952.38 SD: 13.743
> Org (64K): BW: 21540.78 SD: 10.771
> New (64K): BW: 21818.35 SD: 10.598
Hmm, any idea what is causing case 0 to be different from case 2?
In particular, the 64K performance in case 0 appears to be a
regression but in case 2 it's showing up as an improvement.
AFAICS these two cases should produce identical results, or is
this just jitter across tests?
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-29 11:33 ` Herbert Xu
@ 2010-01-29 11:50 ` Krishna Kumar2
2010-01-29 20:02 ` Rick Jones
1 sibling, 0 replies; 24+ messages in thread
From: Krishna Kumar2 @ 2010-01-29 11:50 UTC (permalink / raw)
To: Herbert Xu; +Cc: David Miller, eric.dumazet, ilpo.jarvinen, netdev
Herbert Xu <herbert@gondor.apana.org.au> wrote on 01/29/2010 05:03:46 PM:
>
> > Same 5 runs of single netperf's:
> >
> > 0. Driver unsets F_SG but sets F_GSO:
> > Org (16K): BW: 18180.71 SD: 13.485
> > New (16K): BW: 18113.15 SD: 13.551
> > Org (64K): BW: 21980.28 SD: 10.306
> > New (64K): BW: 21386.59 SD: 10.447
> >
> > 1. Driver unsets F_SG, and with GSO off
> > Org (16K): BW: 10894.62 SD: 26.591
> > New (16K): BW: 7262.10 SD: 35.340
> > Org (64K): BW: 12396.41 SD: 23.357
> > New (64K): BW: 7853.02 SD: 32.405
> >
> >
> > 2. Driver unsets F_SG and uses ethtool to set GSO:
> > Org (16K): BW: 18094.11 SD: 13.603
> > New (16K): BW: 17952.38 SD: 13.743
> > Org (64K): BW: 21540.78 SD: 10.771
> > New (64K): BW: 21818.35 SD: 10.598
>
> Hmm, any idea what is causing case 0 to be different from case 2?
> In particular, the 64K performance in case 0 appears to be a
> regression but in case 2 it's showing up as an improvement.
>
> AFAICS these two cases should produce identical results, or is
> this just jitter across tests?
You are right about the jitter. I have run this many times, most
of the times #0 and #2 are almost identical, but sometimes varies
a bit.
Also about my earlier ethtool comment:
> Yes, an ethtool bug (version 6). The test case #1 above, I
> have written that GSO is off but ethtool "thinks" it is on
> (after a modprobe -r cxgb3; modprobe cxgb3). So for test #2,
> I simply run "ethtool ... gso on", and GSO is now really on
> in the kernel, explaining the better results.
Hmmm, I had a bad ethtool it seems. I built the latest one to
debug this problem but this shows settings correctly.
thanks,
- KK
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-29 11:33 ` Herbert Xu
2010-01-29 11:50 ` Krishna Kumar2
@ 2010-01-29 20:02 ` Rick Jones
1 sibling, 0 replies; 24+ messages in thread
From: Rick Jones @ 2010-01-29 20:02 UTC (permalink / raw)
To: Herbert Xu
Cc: Krishna Kumar2, David Miller, eric.dumazet, ilpo.jarvinen, netdev
Herbert Xu wrote:
> On Fri, Jan 29, 2010 at 04:45:01PM +0530, Krishna Kumar2 wrote:
>
>>Same 5 runs of single netperf's:
>>
>>0. Driver unsets F_SG but sets F_GSO:
>> Org (16K): BW: 18180.71 SD: 13.485
>> New (16K): BW: 18113.15 SD: 13.551
>> Org (64K): BW: 21980.28 SD: 10.306
>> New (64K): BW: 21386.59 SD: 10.447
>>
>>1. Driver unsets F_SG, and with GSO off
>> Org (16K): BW: 10894.62 SD: 26.591
>> New (16K): BW: 7262.10 SD: 35.340
>> Org (64K): BW: 12396.41 SD: 23.357
>> New (64K): BW: 7853.02 SD: 32.405
>>
>>
>>2. Driver unsets F_SG and uses ethtool to set GSO:
>> Org (16K): BW: 18094.11 SD: 13.603
>> New (16K): BW: 17952.38 SD: 13.743
>> Org (64K): BW: 21540.78 SD: 10.771
>> New (64K): BW: 21818.35 SD: 10.598
>
>
> Hmm, any idea what is causing case 0 to be different from case 2?
> In particular, the 64K performance in case 0 appears to be a
> regression but in case 2 it's showing up as an improvement.
>
> AFAICS these two cases should produce identical results, or is
> this just jitter across tests?
To get some idea of run to run variation, and one does not want to run
multiple explicit netperf commands and do later statistical work, one
can add global command line arguments to netperf:
netperf ... -i 30,3 -I 99,<width> ...
which will tell netperf to run at least 3 iterations (that is the
minimum minimum netperf will do) and no more than 30 iterations (that is
the maximum maximum netperf will do) attempting to be 99% confident that
the mean for throughput (and the CPU utilization if -c and/or -C are
present and a global -r is not) is within +/- width/2% For example:
netperf -H remote -i 30,3 -I 99,0.5 -c -C
will attempt to be 99% certain that the means it reports for throughput,
local and remote CPU utilization is within +/- 0.25% of the actual mean.
If, after 30 iterations it has not achieved that confidence, it will
emit warnings giving the width of the confidence intervals it has achieved.
happy benchmarking,
rick jones
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-29 11:15 ` Krishna Kumar2
2010-01-29 11:33 ` Herbert Xu
@ 2010-01-29 19:56 ` Rick Jones
1 sibling, 0 replies; 24+ messages in thread
From: Rick Jones @ 2010-01-29 19:56 UTC (permalink / raw)
To: Krishna Kumar2
Cc: Herbert Xu, David Miller, eric.dumazet, ilpo.jarvinen, netdev
Krishna Kumar2 wrote:
>>Herbert Xu <herbert@gondor.apana.org.au> wrote on 01/29/2010 02:36:25 PM:
>>
>>
>>>I ran 5 serial netperf's with 16K and another 5 serial netperfs
>>>with 64K I/O sizes, and the aggregate result is:
>>>
>>>0. Driver unsets F_SG but sets F_GSO:
>>> Original code with 16K: 19471.65
>>> New code with 16K: 19409.70
>>> Original code with 64K: 21357.23
>>> New code with 64K: 22050.42
>>
>>OK this is more in line with what I was expecting, namely that
>>enabling GSO is actually beneficial even without SG.
>>
>>It would be good to get the CPU utilisation figures so we can
>>see the complete picture.
>
>
> Same 5 runs of single netperf's:
>
> 0. Driver unsets F_SG but sets F_GSO:
> Org (16K): BW: 18180.71 SD: 13.485
> New (16K): BW: 18113.15 SD: 13.551
> Org (64K): BW: 21980.28 SD: 10.306
> New (64K): BW: 21386.59 SD: 10.447
>
> 1. Driver unsets F_SG, and with GSO off
> Org (16K): BW: 10894.62 SD: 26.591
> New (16K): BW: 7262.10 SD: 35.340
> Org (64K): BW: 12396.41 SD: 23.357
> New (64K): BW: 7853.02 SD: 32.405
>
>
> 2. Driver unsets F_SG and uses ethtool to set GSO:
> Org (16K): BW: 18094.11 SD: 13.603
> New (16K): BW: 17952.38 SD: 13.743
> Org (64K): BW: 21540.78 SD: 10.771
> New (64K): BW: 21818.35 SD: 10.598
Just a slight change in service demand there... For those unfamiliar,
service demand in netperf is the microseconds of non-idle CPU time per
KB of data transferred. Smaller is better.
happy benchmarking,
rick jones
^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <OF7EA723DA.DC2FF4FC-ON652576B8.002064CB-652576B8.00267739@LocalDomain>]
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
[not found] ` <OF7EA723DA.DC2FF4FC-ON652576B8.002064CB-652576B8.00267739@LocalDomain>
@ 2010-01-27 9:42 ` Krishna Kumar2
2010-01-29 9:07 ` Herbert Xu
0 siblings, 1 reply; 24+ messages in thread
From: Krishna Kumar2 @ 2010-01-27 9:42 UTC (permalink / raw)
To: Krishna Kumar2
Cc: David Miller, eric.dumazet, Herbert Xu, ilpo.jarvinen, netdev
> Krishna Kumar2/India/IBM wrote on 01/27/2010 12:42 PM
>
> > Herbert Xu <herbert@gondor.apana.org.au> wrote on 01/21/2010 03:11 PM
>
> Sorry for the late response.
>
> > >> I had to remove the F_SG flag from cxgb3 driver (using ethtool
> > >> didn't show any difference in performance since GSO was enabled
> > >> on the device due to register_netdev setting it). Testing show a
> > >> drop of 25% in performance with this patch for non-SG device,
> > >> the extra alloc/memcpy is showing up.
> > >>
> > >> For the SG driver, I get a good performace gain (not anywhere
> > >> close to 25% though). What do you suggest?
> > >
> > > I don't think we can add your change if it hurts non-SG
> > > devices that much.
> >
> > Wait, we need to be careful when testing this. Non-SG devices
> > do actually benefit from TSO which they otherwise cannot access.
> >
> > If you unset the F_SG bit, then that would disable TSO too. So
> > you need to enable GSO to compensate. So Krishna, did you check
> > with tcpdump to see if GSO was really enabled with SG off?
>
> OK, I unset F_SG and set F_GSO (in driver). With this, tcpdump shows
> GSO is enabled - the tcp packet sizes builds up to 65160 bytes.
I should have mentioned this too - if I unset F_SG in the
cxgb3 driver and nothing else, ethtool -k still shows GSO
is set, and tcpdump shows max packet size is 1448. If I
additionally set GSO in driver, then ethtool still has the
same output, but tcpdump shows max packet size of 65160.
Thanks,
- KK
> I ran 5 serial netperf's with 16K and another 5 serial netperfs
> with 64K I/O sizes, and the aggregate result is:
>
> 0. Driver unsets F_SG but sets F_GSO:
> Original code with 16K: 19471.65
> New code with 16K: 19409.70
> Original code with 64K: 21357.23
> New code with 64K: 22050.42
>
> To recap the other tests I did today:
>
> 1. Driver unsets F_SG, and with GSO off
> Original code with 16K: 10123.56
> New code with 16K: 7111.12
> Original code with 64K: 11568.99
> New code with 64K: 7611.37
>
> 2. Driver unsets F_SG and uses ethtool to set GSO:
> Original code with 16K: 18864.38
> New code with 16K: 18465.54
> Original code with 64K: 21005.43
> New code with 64K: 22529.24
>
> Thanks,
>
> - KK
>
> > IIRC when I did a similar test with e1000 back when I wrote this
> > the performance of GSO with SG off was pretty much the same as
> > no GSO with SG off.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-27 9:42 ` Krishna Kumar2
@ 2010-01-29 9:07 ` Herbert Xu
0 siblings, 0 replies; 24+ messages in thread
From: Herbert Xu @ 2010-01-29 9:07 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: David Miller, eric.dumazet, ilpo.jarvinen, netdev
On Wed, Jan 27, 2010 at 03:12:48PM +0530, Krishna Kumar2 wrote:
>
> I should have mentioned this too - if I unset F_SG in the
> cxgb3 driver and nothing else, ethtool -k still shows GSO
> is set, and tcpdump shows max packet size is 1448. If I
> additionally set GSO in driver, then ethtool still has the
> same output, but tcpdump shows max packet size of 65160.
This sounds like a bug.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-15 5:33 [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices? Krishna Kumar
2010-01-15 8:36 ` David Miller
@ 2010-01-15 18:13 ` Rick Jones
2010-01-16 6:38 ` Krishna Kumar2
1 sibling, 1 reply; 24+ messages in thread
From: Rick Jones @ 2010-01-15 18:13 UTC (permalink / raw)
To: Krishna Kumar; +Cc: davem, ilpo.jarvinen, netdev, eric.dumazet
Krishna Kumar wrote:
> From: Krishna Kumar <krkumar2@in.ibm.com>
>
> Remove inline skb data in tcp_sendmsg(). For the few devices that
> don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize,
> and pass the penalty to those slow devices (the following drivers
> do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c,
> dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c,
> r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c,
> via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c,
> cris/eth_v10.c).
>
> This patch does not affect devices that support SG but turn off
> via ethtool after register_netdev.
>
> I ran the following test cases with iperf - #threads: 1 4 8 16 32
> 64 128 192 256, I/O sizes: 256 4K 16K 64K, each test case runs for
> 1 minute, repeat 5 iterations. Total test run time is 6 hours.
> System is 4-proc Opteron, with a Chelsio 10gbps NIC. Results (BW
> figures are the aggregate across 5 iterations in mbps):
>
> -------------------------------------------------------
> #Process I/O Size Org-BW New-BW %-change
> -------------------------------------------------------
> 1 256 2098 2147 2.33
> 1 4K 14057 14269 1.50
> 1 16K 25984 27317 5.13
> 1 64K 25920 27539 6.24
> ...
> 256 256 1947 1955 0.41
> 256 4K 9828 12265 24.79
> 256 16K 25087 24977 -0.43
> 256 64K 26715 27997 4.79
> -------------------------------------------------------
> Total: - 600071 634906 5.80
> -------------------------------------------------------
Does bandwidth alone convey the magnitude of the change? I would think that
would only be the case if the CPU(s) were 100% utilized, and perhaps not even
completely then. At the risk of a shameless plug, it's not for nothing that
netperf reports service demand :)
I would think that change in service demand (CPU per unit of work) would be
something one wants to see.
Also, the world does not run on bandwidth alone, so small packet performance and
any delta there would be good to have.
Multiple process tests may not be as easy in netperf as it is in iperf, but under:
ftp://ftp.netperf.org/netperf/misc
I have a single-stream test script I use called runemomni.sh and an example of
its output, as well as an aggregate script I use called runemomniagg2.sh - I'll
post an example of its output there as soon as I finish some runs. The script
presumes one has ./configure'd netperf:
./configure --enable-burst --enable-omni ...
The netperf omni tests still ass-u-me that the CPU util each measures is all his
own, which means the service demands from aggrgate tests require some
post-processing fixup.
http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance
happy benchmarking,
rick jones
FWIW, service demand and pps performance may be even more important for non-SG
devices because they may be slow 1 Gig devices and still hit link-rate on a bulk
throughput test even with a non-trivial increase in CPU util. However, a
non-trivial hit in CPU util may rather change the pps performance.
PPS - there is a *lot* of output in those omni test results - best viewed with a
spreadsheet program.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-15 18:13 ` Rick Jones
@ 2010-01-16 6:38 ` Krishna Kumar2
2010-01-19 17:41 ` Rick Jones
0 siblings, 1 reply; 24+ messages in thread
From: Krishna Kumar2 @ 2010-01-16 6:38 UTC (permalink / raw)
To: Rick Jones; +Cc: davem, eric.dumazet, ilpo.jarvinen, netdev
Rick Jones <rick.jones2@hp.com> wrote on 01/15/2010 11:43:06 PM:
>
> > Remove inline skb data in tcp_sendmsg(). For the few devices that
> > don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize,
> > and pass the penalty to those slow devices (the following drivers
> > do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c,
> > dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c,
> > r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c,
> > via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c,
> > cris/eth_v10.c).
> >
> > This patch does not affect devices that support SG but turn off
> > via ethtool after register_netdev.
> >
> > I ran the following test cases with iperf - #threads: 1 4 8 16 32
> > 64 128 192 256, I/O sizes: 256 4K 16K 64K, each test case runs for
> > 1 minute, repeat 5 iterations. Total test run time is 6 hours.
> > System is 4-proc Opteron, with a Chelsio 10gbps NIC. Results (BW
> > figures are the aggregate across 5 iterations in mbps):
> >
> > -------------------------------------------------------
> > #Process I/O Size Org-BW New-BW %-change
> > -------------------------------------------------------
> > 1 256 2098 2147 2.33
> > 1 4K 14057 14269 1.50
> > 1 16K 25984 27317 5.13
> > 1 64K 25920 27539 6.24
> > ...
> > 256 256 1947 1955 0.41
> > 256 4K 9828 12265 24.79
> > 256 16K 25087 24977 -0.43
> > 256 64K 26715 27997 4.79
> > -------------------------------------------------------
> > Total: - 600071 634906 5.80
> > -------------------------------------------------------
>
> Does bandwidth alone convey the magnitude of the change? I would think
that
> would only be the case if the CPU(s) were 100% utilized, and perhapsnot
even
> completely then. At the risk of a shameless plug, it's not for nothing
that
> netperf reports service demand :)
>
> I would think that change in service demand (CPU per unit of work) would
be
> something one wants to see.
>
> Also, the world does not run on bandwidth alone, so small packet
> performance and
> any delta there would be good to have.
>
> Multiple process tests may not be as easy in netperf as it is in
> iperf, but under:
>
> ftp://ftp.netperf.org/netperf/misc
>
> I have a single-stream test script I use called runemomni.sh and an
> example of
> its output, as well as an aggregate script I use called
> runemomniagg2.sh - I'll
> post an example of its output there as soon as I finish some runs.
I usually run netperf for smaller number of threads and aggregate
the output using some scripts. I will try what you suggested above,
and see if I can get consistent results for higher number of
processes. Thanks for the links.
- KK
> The script
> presumes one has ./configure'd netperf:
>
> ./configure --enable-burst --enable-omni ...
>
> The netperf omni tests still ass-u-me that the CPU util each
> measures is all his
> own, which means the service demands from aggrgate tests require some
> post-processing fixup.
>
> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-
> Netperf-to-Measure-Aggregate-Performance
>
> happy benchmarking,
>
> rick jones
>
> FWIW, service demand and pps performance may be even more important
> for non-SG
> devices because they may be slow 1 Gig devices and still hit link-
> rate on a bulk
> throughput test even with a non-trivial increase in CPU util. However, a
> non-trivial hit in CPU util may rather change the pps performance.
>
> PPS - there is a *lot* of output in those omni test results - best
> viewed with a
> spreadsheet program.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices?
2010-01-16 6:38 ` Krishna Kumar2
@ 2010-01-19 17:41 ` Rick Jones
0 siblings, 0 replies; 24+ messages in thread
From: Rick Jones @ 2010-01-19 17:41 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: davem, eric.dumazet, ilpo.jarvinen, netdev
Krishna Kumar2 wrote:
> Rick Jones <rick.jones2@hp.com> wrote on 01/15/2010 11:43:06 PM:
>>Multiple process tests may not be as easy in netperf as it is in
>>iperf, but under:
>>
>>ftp://ftp.netperf.org/netperf/misc
>>
>> I have a single-stream test script I use called runemomni.sh and an
>> example of its output, as well as an aggregate script I use called
>> runemomniagg2.sh - I'll post an example of its output there as
>> soon as I finish some runs.
>
>
> I usually run netperf for smaller number of threads and aggregate
> the output using some scripts. I will try what you suggested above,
> and see if I can get consistent results for higher number of
> processes. Thanks for the links.
You're welcome - the output of the aggregate script is up there too now
- generally I run three systems for the aggregate tests and then include
the TCP_MAERTs stuff, but in this case since I had only two otherwise
identical systems, TCP_MAERTs would have been redundant.
rick jones
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2010-01-29 20:02 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-15 5:33 [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices? Krishna Kumar
2010-01-15 8:36 ` David Miller
2010-01-15 8:50 ` Simon Kagstrom
2010-01-15 8:52 ` David Miller
2010-01-15 9:00 ` Simon Kagstrom
2010-01-15 9:04 ` David Miller
2010-01-15 9:03 ` Krishna Kumar2
[not found] ` <OF1B0853DD.2824263E-ON652576AC.002FA9D4-652576AC.0030AC47@LocalDomain>
2010-01-15 9:20 ` Krishna Kumar2
2010-01-15 9:18 ` David Miller
2010-01-20 12:19 ` Krishna Kumar2
2010-01-21 9:25 ` David Miller
2010-01-21 9:41 ` Herbert Xu
2010-01-27 7:12 ` Krishna Kumar2
2010-01-29 9:06 ` Herbert Xu
2010-01-29 11:15 ` Krishna Kumar2
2010-01-29 11:33 ` Herbert Xu
2010-01-29 11:50 ` Krishna Kumar2
2010-01-29 20:02 ` Rick Jones
2010-01-29 19:56 ` Rick Jones
[not found] ` <OF7EA723DA.DC2FF4FC-ON652576B8.002064CB-652576B8.00267739@LocalDomain>
2010-01-27 9:42 ` Krishna Kumar2
2010-01-29 9:07 ` Herbert Xu
2010-01-15 18:13 ` Rick Jones
2010-01-16 6:38 ` Krishna Kumar2
2010-01-19 17:41 ` Rick Jones
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).