* [PATCH] LRO ack aggregation
@ 2007-10-23 15:11 Andrew Gallatin
2007-11-20 5:09 ` David Miller
0 siblings, 1 reply; 15+ messages in thread
From: Andrew Gallatin @ 2007-10-23 15:11 UTC (permalink / raw)
To: netdev; +Cc: ossthema
[-- Attachment #1: Type: text/plain, Size: 3466 bytes --]
Hi,
We recently did some performance comparisons between the new inet_lro
LRO support in the kernel, and our Myri10GE in-driver LRO.
For receive, we found they were nearly identical. However, for
transmit, we found that Myri10GE's LRO shows much lower CPU
utilization. We traced the CPU utilization difference to our driver
LRO aggregating TCP acks, and the inet_lro module not doing so.
I've attached a patch which adds support to inet_lro for aggregating
pure acks. Aggregating pure acks (segments with TCP_PAYLOAD_LENGTH ==
0) entails freeing the skb (or putting the page in the frags case).
The patch also handles trimming (typical for 54-byte pure ack frames
which have been padded to the ethernet minimum 60 byte frame size).
In the frags case, I tried to keep things simple by only doing the
trim when the entire frame fits in the first frag. To be safe, I
ensure that the padding is all 0 (or, more exactly, was some pattern
whose checksum is -0) so that it doesn't impact hardware checksums.
This patch also fixes a small bug in the skb LRO path dealing with
vlans that I found when doing my own testing. Specifically, in the
desc->active case, the existing code always fails the
lro_tcp_ip_check() for NICs without LRO_F_EXTRACT_VLAN_ID, because it
fails to subtract the vlan_hdr_len from the skb->len.
Jan-Bernd Themann (ossthema@de.ibm.com) has tested the patch using
the eHEA driver (skb codepath), and I have tested it using Myri10GE
(both frags and skb codepath).
Using a pair of identical low-end 2.0GHz Athlon 64 x2 3800+ with
Myri10GE 10GbE NICs, I ran 10 iterations of netperf TCP_SENDFILE
tests, taking the median run for comparison purposes. The receiver
was running Myri10GE + patched inet_lro:
TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to rome-my
(192.168.1.16) port 0 AF_INET : cpu bind
Recv Send Send Utilization Service
Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local
remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
Myri10GE driver-specific LRO:
87380 65536 65536 60.02 9442.65 16.24 69.31 0.282 1.203
Myri10GE + unpatched inet_lro:
87380 65536 65536 60.02 9442.88 20.10 69.11 0.349 1.199
Myri10GE + patched inet_lro:
87380 65536 65536 60.02 9443.30 16.95 68.97 0.294 1.197
The important bit here is the sender's CPU utilization, and service
demand (cost per byte). As you can see, without aggregating ack's,
the overhead on the sender is roughly 20% higher, even when sending to
a receiver which uses LRO. The differences are even more dramatic
when sending to a receiver which does not use LRO (and hence sends
more frequent acks).
Below is the same benchmark, run between a pair of 4-way 3.0GHz Xeon
5160 machines (Dell 2950) with Myri10GE NICs. The receiver is running
Solaris 10U4, which does not do LRO, and is acking at approximately
8:1 (or ~100K acks/sec):
Myri10GE driver-specific LRO:
196712 65536 65536 60.01 9280.09 7.14 45.37 0.252 1.602
Myri10GE + unpatched inet_lro:
196712 65536 65536 60.01 8530.80 10.51 44.60 0.404 1.713
Myri10GE + patched inet_lro:
196712 65536 65536 60.00 9249.65 7.21 45.90 0.255 1.626
Signed off by: Andrew Gallatin <gallatin@myri.com>
Andrew Gallatin
Myricom Inc.
[-- Attachment #2: ack_aggr.diff --]
[-- Type: text/plain, Size: 3789 bytes --]
diff --git a/net/ipv4/inet_lro.c b/net/ipv4/inet_lro.c
index ac3b1d3..eba145b 100644
--- a/net/ipv4/inet_lro.c
+++ b/net/ipv4/inet_lro.c
@@ -58,9 +58,6 @@ static int lro_tcp_ip_check(struct iphdr
if (ntohs(iph->tot_len) != len)
return -1;
- if (TCP_PAYLOAD_LENGTH(iph, tcph) == 0)
- return -1;
-
if (iph->ihl != IPH_LEN_WO_OPTIONS)
return -1;
@@ -223,6 +220,11 @@ static void lro_add_packet(struct net_lr
lro_add_common(lro_desc, iph, tcph, tcp_data_len);
+ if (tcp_data_len == 0) {
+ dev_kfree_skb_any(skb);
+ return;
+ }
+
skb_pull(skb, (skb->len - tcp_data_len));
parent->truesize += skb->truesize;
@@ -244,6 +246,11 @@ static void lro_add_frags(struct net_lro
lro_add_common(lro_desc, iph, tcph, tcp_data_len);
+ if (tcp_data_len == 0) {
+ put_page(skb_frags[0].page);
+ return;
+ }
+
skb->truesize += truesize;
skb_frags[0].page_offset += hlen;
@@ -338,6 +345,8 @@ static int __lro_proc_skb(struct net_lro
struct tcphdr *tcph;
u64 flags;
int vlan_hdr_len = 0;
+ int pkt_len;
+ int trim;
if (!lro_mgr->get_skb_header
|| lro_mgr->get_skb_header(skb, (void *)&iph, (void *)&tcph,
@@ -355,6 +364,17 @@ static int __lro_proc_skb(struct net_lro
&& !test_bit(LRO_F_EXTRACT_VLAN_ID, &lro_mgr->features))
vlan_hdr_len = VLAN_HLEN;
+ /* strip padding from runts iff the padding is all zero */
+ pkt_len = vlan_hdr_len + ntohs(iph->tot_len);
+ trim = skb->len - pkt_len;
+ if (trim > 0) {
+ u8 *pad = skb_tail_pointer(skb) - trim;
+ if (unlikely(ip_compute_csum(pad, trim) != 0xffff)) {
+ goto out;
+ }
+ skb_trim(skb, skb->len - trim);
+ }
+
if (!lro_desc->active) { /* start new lro session */
if (lro_tcp_ip_check(iph, tcph, skb->len - vlan_hdr_len, NULL))
goto out;
@@ -368,7 +388,7 @@ static int __lro_proc_skb(struct net_lro
if (lro_desc->tcp_next_seq != ntohl(tcph->seq))
goto out2;
- if (lro_tcp_ip_check(iph, tcph, skb->len, lro_desc))
+ if (lro_tcp_ip_check(iph, tcph, skb->len - vlan_hdr_len, lro_desc))
goto out2;
lro_add_packet(lro_desc, skb, iph, tcph);
@@ -412,18 +432,20 @@ static struct sk_buff *lro_gen_skb(struc
memcpy(skb->data, mac_hdr, hdr_len);
- skb_frags = skb_shinfo(skb)->frags;
- while (data_len > 0) {
- *skb_frags = *frags;
- data_len -= frags->size;
- skb_frags++;
- frags++;
- skb_shinfo(skb)->nr_frags++;
+ if (skb->data_len == 0) {
+ put_page(frags[0].page);
+ } else {
+ skb_frags = skb_shinfo(skb)->frags;
+ while (data_len > 0) {
+ *skb_frags = *frags;
+ data_len -= frags->size;
+ skb_frags++;
+ frags++;
+ skb_shinfo(skb)->nr_frags++;
+ }
+ skb_shinfo(skb)->frags[0].page_offset += hdr_len;
+ skb_shinfo(skb)->frags[0].size -= hdr_len;
}
-
- skb_shinfo(skb)->frags[0].page_offset += hdr_len;
- skb_shinfo(skb)->frags[0].size -= hdr_len;
-
skb->ip_summed = ip_summed;
skb->csum = sum;
skb->protocol = eth_type_trans(skb, lro_mgr->dev);
@@ -445,6 +467,8 @@ static struct sk_buff *__lro_proc_segmen
int mac_hdr_len;
int hdr_len = LRO_MAX_PG_HLEN;
int vlan_hdr_len = 0;
+ int pkt_len;
+ int trim;
if (!lro_mgr->get_frag_header
|| lro_mgr->get_frag_header(frags, (void *)&mac_hdr, (void *)&iph,
@@ -463,6 +487,19 @@ static struct sk_buff *__lro_proc_segmen
if (!lro_desc)
goto out1;
+ /* strip padding from runts iff the padding is all zero */
+ pkt_len = mac_hdr_len + ntohs(iph->tot_len);
+ trim = len - pkt_len;
+ if (trim > 0 && pkt_len <= frags->size) {
+ u8 *pad = page_address(frags->page) + frags->page_offset +
+ pkt_len;
+ if (unlikely(ip_compute_csum(pad, trim) != 0xffff))
+ goto out1;
+ frags->size -= trim;
+ len -= trim;
+ true_size -= trim;
+ }
+
if (!lro_desc->active) { /* start new lro session */
if (lro_tcp_ip_check(iph, tcph, len - mac_hdr_len, NULL))
goto out1;
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH] LRO ack aggregation
2007-10-23 15:11 [PATCH] LRO ack aggregation Andrew Gallatin
@ 2007-11-20 5:09 ` David Miller
2007-11-20 6:09 ` Herbert Xu
2007-11-20 19:45 ` Rick Jones
0 siblings, 2 replies; 15+ messages in thread
From: David Miller @ 2007-11-20 5:09 UTC (permalink / raw)
To: gallatin; +Cc: netdev, ossthema
From: Andrew Gallatin <gallatin@myri.com>
Date: Tue, 23 Oct 2007 11:11:55 -0400
> I've attached a patch which adds support to inet_lro for aggregating
> pure acks.
I've applied this patch to net-2.6.25... but!
This needs some serious thinking. What this patch ends up doing is creating
big stretch-ACKs, and those can hurt performance.
Stretch ACKs are particularly harmful when either the receiver is cpu
weak (lacking enough cpu power to fill the pipe completely no matter
what optimizations are applied) or when there is packet loss (less
feedback information and ACK clocking).
It also means that the sender will be more bursty, because it will now
swallow ACKs covering huge portions of the send window, and then have
large chunks of it's send queue it can send out all at once.
Fundamentally, I really don't like this change, it batches to the
point where it begins to erode the natural ACK clocking of TCP, and I
therefore am very likely to revert it before merging to Linus.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] LRO ack aggregation
2007-11-20 5:09 ` David Miller
@ 2007-11-20 6:09 ` Herbert Xu
2007-11-20 6:22 ` David Miller
2007-11-20 19:45 ` Rick Jones
1 sibling, 1 reply; 15+ messages in thread
From: Herbert Xu @ 2007-11-20 6:09 UTC (permalink / raw)
To: David Miller; +Cc: gallatin, netdev, ossthema
David Miller <davem@davemloft.net> wrote:
>
> Fundamentally, I really don't like this change, it batches to the
> point where it begins to erode the natural ACK clocking of TCP, and I
> therefore am very likely to revert it before merging to Linus.
Perhaps make it a tunable that defaults to off?
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH] LRO ack aggregation
2007-11-20 6:09 ` Herbert Xu
@ 2007-11-20 6:22 ` David Miller
2007-11-20 11:47 ` Andrew Gallatin
0 siblings, 1 reply; 15+ messages in thread
From: David Miller @ 2007-11-20 6:22 UTC (permalink / raw)
To: herbert; +Cc: gallatin, netdev, ossthema
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 20 Nov 2007 14:09:18 +0800
> David Miller <davem@davemloft.net> wrote:
> >
> > Fundamentally, I really don't like this change, it batches to the
> > point where it begins to erode the natural ACK clocking of TCP, and I
> > therefore am very likely to revert it before merging to Linus.
>
> Perhaps make it a tunable that defaults to off?
That's one idea.
But if it's there the risk it to have it end up being turned on
always by distribution vendors, making our off-default pointless
and the internet gets crapped up with misbehaving Linux TCP
stacks anyways.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] LRO ack aggregation
2007-11-20 6:22 ` David Miller
@ 2007-11-20 11:47 ` Andrew Gallatin
2007-11-20 11:55 ` David Miller
0 siblings, 1 reply; 15+ messages in thread
From: Andrew Gallatin @ 2007-11-20 11:47 UTC (permalink / raw)
To: David Miller; +Cc: herbert, netdev, ossthema
David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Tue, 20 Nov 2007 14:09:18 +0800
>
>> David Miller <davem@davemloft.net> wrote:
>>> Fundamentally, I really don't like this change, it batches to the
>>> point where it begins to erode the natural ACK clocking of TCP, and I
>>> therefore am very likely to revert it before merging to Linus.
>> Perhaps make it a tunable that defaults to off?
>
> That's one idea.
I'd certainly prefer the option to have a tunable to having our
customers see performance regressions when they switch to
the kernel's LRO.
> But if it's there the risk it to have it end up being turned on
> always by distribution vendors, making our off-default pointless
> and the internet gets crapped up with misbehaving Linux TCP
> stacks anyways.
If vendors are going to pick this up, there is the risk of them just
applying this patch (which currently has no tunable to disable it),
leaving their users stuck with it enabled. At least with a tunable,
it would be easy for them to turn it off. And the comments surrounding
it could make it clear why it should default to off.
FWIW, we've seen TCP perform well in a WAN setting using our NICs and
our LRO which does this ack aggregation. For example, the last 2
Supercomputing "Bandwidth Challenges" (making the most of a 10Gb/s
WAN connection) were won by teams using our NICs, with drivers that
did this sort of ack aggregation.
Drew
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] LRO ack aggregation
2007-11-20 11:47 ` Andrew Gallatin
@ 2007-11-20 11:55 ` David Miller
2007-11-20 13:27 ` Andrew Gallatin
0 siblings, 1 reply; 15+ messages in thread
From: David Miller @ 2007-11-20 11:55 UTC (permalink / raw)
To: gallatin; +Cc: herbert, netdev, ossthema
From: Andrew Gallatin <gallatin@myri.com>
Date: Tue, 20 Nov 2007 06:47:57 -0500
> David Miller wrote:
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Date: Tue, 20 Nov 2007 14:09:18 +0800
> >
> >> David Miller <davem@davemloft.net> wrote:
> >>> Fundamentally, I really don't like this change, it batches to the
> >>> point where it begins to erode the natural ACK clocking of TCP, and I
> >>> therefore am very likely to revert it before merging to Linus.
> >> Perhaps make it a tunable that defaults to off?
> >
> > That's one idea.
>
> I'd certainly prefer the option to have a tunable to having our
> customers see performance regressions when they switch to
> the kernel's LRO.
Please qualify this because by itself it's an inaccurate statement.
It would cause a performance regression in situations where the is
nearly no packet loss, no packet reordering, and the receiver has
strong enough cpu power.
Because in fact, some "customers" might see performance regressions
by using this ack aggregation code. In particular if they are
talking to the real internet at all.
> If vendors are going to pick this up, there is the risk of them just
> applying this patch (which currently has no tunable to disable it),
> leaving their users stuck with it enabled.
I will watch out for this and make sure to advise them strongly not to
do this.
So you can be sure this won't happen. :-)
> FWIW, we've seen TCP perform well in a WAN setting using our NICs and
> our LRO which does this ack aggregation. For example, the last 2
> Supercomputing "Bandwidth Challenges" (making the most of a 10Gb/s
> WAN connection) were won by teams using our NICs, with drivers that
> did this sort of ack aggregation.
And basically nearly no packet loss, which just supports my objections
to this even more. ANd this doesn't even begin to consider the RX
cpu limited cases, where again ACK stretching hurts a lot.
The bandwidth challenge cases are not very realistic at all, and are
about as far from the realities of real internet traffic as you can
get.
Show me something over real backbones, talking to hundres or thousands
of clients scattered all over the world. That's what people will be
using these high end NICs for front facing services, and that's where
loss happens and stretch ACKs hurt performance.
ACK stretching is bad bad bad for everything outside of some well
controlled test network bubble.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] LRO ack aggregation
2007-11-20 11:55 ` David Miller
@ 2007-11-20 13:27 ` Andrew Gallatin
2007-11-20 13:35 ` Evgeniy Polyakov
2007-11-21 6:15 ` Bill Fink
0 siblings, 2 replies; 15+ messages in thread
From: Andrew Gallatin @ 2007-11-20 13:27 UTC (permalink / raw)
To: David Miller; +Cc: herbert, netdev, ossthema
David Miller wrote:
> From: Andrew Gallatin <gallatin@myri.com>
> Date: Tue, 20 Nov 2007 06:47:57 -0500
>
>> David Miller wrote:
>> > From: Herbert Xu <herbert@gondor.apana.org.au>
>> > Date: Tue, 20 Nov 2007 14:09:18 +0800
>> >
>> >> David Miller <davem@davemloft.net> wrote:
>> >>> Fundamentally, I really don't like this change, it batches to the
>> >>> point where it begins to erode the natural ACK clocking of TCP,
and I
>> >>> therefore am very likely to revert it before merging to Linus.
>> >> Perhaps make it a tunable that defaults to off?
>> >
>> > That's one idea.
>>
>> I'd certainly prefer the option to have a tunable to having our
>> customers see performance regressions when they switch to
>> the kernel's LRO.
>
> Please qualify this because by itself it's an inaccurate statement.
>
> It would cause a performance regression in situations where the is
> nearly no packet loss, no packet reordering, and the receiver has
> strong enough cpu power.
Yes, a regression of nearly 1Gb/s in some cases as I mentioned
when I submitted the patch.
<....>
> Show me something over real backbones, talking to hundres or thousands
> of clients scattered all over the world. That's what people will be
> using these high end NICs for front facing services, and that's where
> loss happens and stretch ACKs hurt performance.
>
I can't. I think most 10GbE on endstations is used either in the
sever room, or on dedicated links. My experience with 10GbE users is
limited to my interactions with people using our NICs who contact our
support. Of those, I can recall only a tiny handful who were using
10GbE on a normal internet facing connection (and the ones I dealt
with were actually running a different OS). The vast majority were in
a well controlled, lossless environment. It is quite ironic. The
very fact that I cannot provide you with examples of internet facing
people using LRO (w/ack aggr) in more normal applications tends to
support my point that most 10GbE users seem to be in lossless
environments.
> ACK stretching is bad bad bad for everything outside of some well
> controlled test network bubble.
I just want those in the bubble to continue have the best performance
possible in their situation. If it is a tunable the defaults to off,
that is great.
Hmm.. rather than a global tunable, what if it was a
network driver managed tunable which toggled a flag in the
lro_mgr features? Would that be better?
Drew
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] LRO ack aggregation
2007-11-20 13:27 ` Andrew Gallatin
@ 2007-11-20 13:35 ` Evgeniy Polyakov
2007-11-20 13:50 ` Herbert Xu
2007-11-21 6:15 ` Bill Fink
1 sibling, 1 reply; 15+ messages in thread
From: Evgeniy Polyakov @ 2007-11-20 13:35 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: David Miller, herbert, netdev, ossthema
Hi.
On Tue, Nov 20, 2007 at 08:27:05AM -0500, Andrew Gallatin (gallatin@myri.com) wrote:
> Hmm.. rather than a global tunable, what if it was a
> network driver managed tunable which toggled a flag in the
> lro_mgr features? Would that be better?
What about ethtool control to set LRO_simple and LRO_ACK_aggregation?
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] LRO ack aggregation
2007-11-20 13:35 ` Evgeniy Polyakov
@ 2007-11-20 13:50 ` Herbert Xu
2007-11-20 14:03 ` Evgeniy Polyakov
0 siblings, 1 reply; 15+ messages in thread
From: Herbert Xu @ 2007-11-20 13:50 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: Andrew Gallatin, David Miller, netdev, ossthema
On Tue, Nov 20, 2007 at 04:35:09PM +0300, Evgeniy Polyakov wrote:
>
> On Tue, Nov 20, 2007 at 08:27:05AM -0500, Andrew Gallatin (gallatin@myri.com) wrote:
> > Hmm.. rather than a global tunable, what if it was a
> > network driver managed tunable which toggled a flag in the
> > lro_mgr features? Would that be better?
>
> What about ethtool control to set LRO_simple and LRO_ACK_aggregation?
I have two concerns about this:
1) That same option can still be turned on by distros.
2) This doesn't make sense because the code is actually in the
core networking stack.
I'm particular unhappy about 2) because I don't want be in a
situation down the track where every driver is going to add this
option so that they're not left behind in the arms race.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH] LRO ack aggregation
2007-11-20 13:50 ` Herbert Xu
@ 2007-11-20 14:03 ` Evgeniy Polyakov
2007-11-20 14:08 ` Herbert Xu
0 siblings, 1 reply; 15+ messages in thread
From: Evgeniy Polyakov @ 2007-11-20 14:03 UTC (permalink / raw)
To: Herbert Xu; +Cc: Andrew Gallatin, David Miller, netdev, ossthema
On Tue, Nov 20, 2007 at 09:50:56PM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> On Tue, Nov 20, 2007 at 04:35:09PM +0300, Evgeniy Polyakov wrote:
> >
> > On Tue, Nov 20, 2007 at 08:27:05AM -0500, Andrew Gallatin (gallatin@myri.com) wrote:
> > > Hmm.. rather than a global tunable, what if it was a
> > > network driver managed tunable which toggled a flag in the
> > > lro_mgr features? Would that be better?
> >
> > What about ethtool control to set LRO_simple and LRO_ACK_aggregation?
>
> I have two concerns about this:
>
> 1) That same option can still be turned on by distros.
FC and Debian turn on hardware checksumm offloading in e1000 and I have
a card where this results in more than 10% performance _decrease_.
I do not know why, but Im able to run script which disables it via
ethtool.
> 2) This doesn't make sense because the code is actually in the
> core networking stack.
It depends. Software lro can be controlled by simple procfs switch, but
hardware one? I recall it was number of times pointed that hardware LRO
is possible and likely being implemented in some asics.
> I'm particular unhappy about 2) because I don't want be in a
> situation down the track where every driver is going to add this
> option so that they're not left behind in the arms race.
For software lro I agree, but this looks exactly like gso/tso case and
additional tweak for software gso. Having it per-system is fine, and I
believe no one should ever care that some distro will do bad/good things
with it. Actually we do have so much tricky options in procfs already
which can kill performance...
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] LRO ack aggregation
2007-11-20 14:03 ` Evgeniy Polyakov
@ 2007-11-20 14:08 ` Herbert Xu
2007-11-20 14:37 ` Evgeniy Polyakov
0 siblings, 1 reply; 15+ messages in thread
From: Herbert Xu @ 2007-11-20 14:08 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: Andrew Gallatin, David Miller, netdev, ossthema
On Tue, Nov 20, 2007 at 05:03:12PM +0300, Evgeniy Polyakov wrote:
>
> For software lro I agree, but this looks exactly like gso/tso case and
> additional tweak for software gso. Having it per-system is fine, and I
> believe no one should ever care that some distro will do bad/good things
> with it. Actually we do have so much tricky options in procfs already
> which can kill performance...
Right, if you're doing it such that the same option automatically
shows up for every driver that uses software LRO then my second
concern goes away.
Of course we still have the problem with the option in general
that Dave raised. That is this may cause the proliferation of
TCP receiver behaviour that may be undesirable.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH] LRO ack aggregation
2007-11-20 14:08 ` Herbert Xu
@ 2007-11-20 14:37 ` Evgeniy Polyakov
0 siblings, 0 replies; 15+ messages in thread
From: Evgeniy Polyakov @ 2007-11-20 14:37 UTC (permalink / raw)
To: Herbert Xu; +Cc: Andrew Gallatin, David Miller, netdev, ossthema
On Tue, Nov 20, 2007 at 10:08:31PM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> Of course we still have the problem with the option in general
> that Dave raised. That is this may cause the proliferation of
> TCP receiver behaviour that may be undesirable.
Yes, it results in bursts of traffic because of delayed acks accumulated
in sender's lro engine, but from the first point, if receiver is slow,
then it will slowly send acks and they will be slowly accumulated, thus
changing not only seq/ack numbers, but also timings, which is equal to
increasing length of the pipe between users. TCP is able to balance on
this edge. I'm sure it depends on workload, but heavy bulk transfers,
where only lro with and without ack agregation can win, are quite usual
on long pipes with high performance numbers.
Until it is tested, I doubt it is possible to say it is 100% good or
bad, so my proposal is to write the code, which is tunable from
userspace, turn it off and allow people to test the change.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] LRO ack aggregation
2007-11-20 13:27 ` Andrew Gallatin
2007-11-20 13:35 ` Evgeniy Polyakov
@ 2007-11-21 6:15 ` Bill Fink
1 sibling, 0 replies; 15+ messages in thread
From: Bill Fink @ 2007-11-21 6:15 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: David Miller, herbert, netdev, ossthema
On Tue, 20 Nov 2007, Andrew Gallatin wrote:
> David Miller wrote:
> > From: Andrew Gallatin <gallatin@myri.com>
> > Date: Tue, 20 Nov 2007 06:47:57 -0500
> >
> >> David Miller wrote:
> >> > From: Herbert Xu <herbert@gondor.apana.org.au>
> >> > Date: Tue, 20 Nov 2007 14:09:18 +0800
> >> >
> >> >> David Miller <davem@davemloft.net> wrote:
> >> >>> Fundamentally, I really don't like this change, it batches to the
> >> >>> point where it begins to erode the natural ACK clocking of TCP,
> and I
> >> >>> therefore am very likely to revert it before merging to Linus.
I have mixed feelings about this topic. In general I agree with the
importance of maintaining the natural ACK clocking of TCP for normal
usage. But there may also be some special cases that could benefit
significantly from such a new LRO pure ACK aggregation feature. The
rest of my comments are in support of such a new feature, although
I haven't completely made up my own mind yet about the tradeoffs
involved in implementing such a new capability (good arguments are
being made on both sides).
> >> >> Perhaps make it a tunable that defaults to off?
> >> >
> >> > That's one idea.
> >>
> >> I'd certainly prefer the option to have a tunable to having our
> >> customers see performance regressions when they switch to
> >> the kernel's LRO.
> >
> > Please qualify this because by itself it's an inaccurate statement.
> >
> > It would cause a performance regression in situations where the is
> > nearly no packet loss, no packet reordering, and the receiver has
> > strong enough cpu power.
You are basically describing the HPC universe, which while not the
multitudes of the general Internet, is a very real and valid special
community of interest where maximum performance is critical.
For example, we're starting to see dynamic provisioning of dedicated
10-GigE lambda paths to meet various HPC requirements, just for the
purpose of insuring "nearly no packet loss, no packet reordering".
See for example Internet2's Dynamic Circuit Network (DCN).
In the general Internet case, many smaller flows tend to be aggregated
together up to perhaps a 10-GigE interface, while in the HPC universe,
there tend to be fewer, but much higher individual bandwidth flows.
But both are totally valid usage scenarios. So a tunable that defaults
to off for the general case makes sense to me.
> Yes, a regression of nearly 1Gb/s in some cases as I mentioned
> when I submitted the patch.
Which is a significant performance penalty. But the CPU savings may
be an even more important benefit.
> <....>
>
> > Show me something over real backbones, talking to hundres or thousands
> > of clients scattered all over the world. That's what people will be
> > using these high end NICs for front facing services, and that's where
> > loss happens and stretch ACKs hurt performance.
The HPC universe uses real backbones, just not the general Internet
backbones. Their backbones are engineered to have the characteristics
required for enabling very high performance applications.
And if performance would take a hit in the general Internet 10-GigE
server case, and that's clearly documented and understood, I don't
see what incentive the distros would have to enable the tunable for
their normal users, since why would they want to cause poorer
performance relative to other distros that stuck with the recommended
default. The special HPC users could easily enable the option if it
was desired and proven beneficial in their environment.
> I can't. I think most 10GbE on endstations is used either in the
> sever room, or on dedicated links. My experience with 10GbE users is
> limited to my interactions with people using our NICs who contact our
> support. Of those, I can recall only a tiny handful who were using
> 10GbE on a normal internet facing connection (and the ones I dealt
> with were actually running a different OS). The vast majority were in
> a well controlled, lossless environment. It is quite ironic. The
> very fact that I cannot provide you with examples of internet facing
> people using LRO (w/ack aggr) in more normal applications tends to
> support my point that most 10GbE users seem to be in lossless
> environments.
Most use of 10-GigE that I'm familiar with is related to the HPC
universe, but then that's the environment I work in. I'm sure that
over time the use of 10-GigE in general Internet facing servers
will predominate, since that's where the great mass of users is.
But I would argue that that doesn't make it the sole usage arena
that matters.
> > ACK stretching is bad bad bad for everything outside of some well
> > controlled test network bubble.
It's not just for network bubbles. That's where the technology tends
to first be shaken out, but the real goal is use in real-world,
production HPC environments.
> I just want those in the bubble to continue have the best performance
> possible in their situation. If it is a tunable the defaults to off,
> that is great.
I totally agree, and think that the tunable (defaulting to off),
allows both the general Internet and HPC users to meet their goals.
> Hmm.. rather than a global tunable, what if it was a
> network driver managed tunable which toggled a flag in the
> lro_mgr features? Would that be better?
I like that idea. In some of the configurations I deal with, a system
might have a special 10-GigE interface connected to a dedicated 10-GigE
HPC network, and also a regular GigE normal Internet connection. So
the new LRO feature could be enabled on the 10-GigE HPC interface and
left disabled on the normal GigE Internet interface.
-Bill
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] LRO ack aggregation
2007-11-20 5:09 ` David Miller
2007-11-20 6:09 ` Herbert Xu
@ 2007-11-20 19:45 ` Rick Jones
2007-11-20 22:27 ` David Miller
1 sibling, 1 reply; 15+ messages in thread
From: Rick Jones @ 2007-11-20 19:45 UTC (permalink / raw)
To: David Miller; +Cc: gallatin, netdev, ossthema
David Miller wrote:
> From: Andrew Gallatin <gallatin@myri.com>
> Date: Tue, 23 Oct 2007 11:11:55 -0400
>
>
>>I've attached a patch which adds support to inet_lro for aggregating
>>pure acks.
>
>
> I've applied this patch to net-2.6.25... but!
>
> This needs some serious thinking. What this patch ends up doing is creating
> big stretch-ACKs, and those can hurt performance.
>
> Stretch ACKs are particularly harmful when either the receiver is cpu
> weak (lacking enough cpu power to fill the pipe completely no matter
> what optimizations are applied) or when there is packet loss (less
> feedback information and ACK clocking).
>
> It also means that the sender will be more bursty, because it will now
> swallow ACKs covering huge portions of the send window, and then have
> large chunks of it's send queue it can send out all at once.
>
> Fundamentally, I really don't like this change, it batches to the
> point where it begins to erode the natural ACK clocking of TCP, and I
> therefore am very likely to revert it before merging to Linus.
Sounds like one might as well go ahead and implement HP-UX/Solaris-like
ACK sending avoidance at the receiver and not bother with LRO-ACK on the
sender.
In some experiements a while back I thought I saw that LRO on the
receiver was causing him to send fewer ACKs already? IIRC that was with
a Myricom card, perhaps I was fooled by it's own ACK LRO it was doing.
rick jones
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] LRO ack aggregation
2007-11-20 19:45 ` Rick Jones
@ 2007-11-20 22:27 ` David Miller
0 siblings, 0 replies; 15+ messages in thread
From: David Miller @ 2007-11-20 22:27 UTC (permalink / raw)
To: rick.jones2; +Cc: gallatin, netdev, ossthema
From: Rick Jones <rick.jones2@hp.com>
Date: Tue, 20 Nov 2007 11:45:54 -0800
> Sounds like one might as well go ahead and implement HP-UX/Solaris-like
> ACK sending avoidance at the receiver and not bother with LRO-ACK on the
> sender.
>
> In some experiements a while back I thought I saw that LRO on the
> receiver was causing him to send fewer ACKs already? IIRC that was with
> a Myricom card, perhaps I was fooled by it's own ACK LRO it was doing.
Linux used to do aggressive ACK deferral, especially when the ucopy
code paths triggered.
I removed that code because I had several scenerios where it hurt more
than it helped performance, and the IETF has explicitly stated in
several documents the (proven) perils of such stretch ACKs.
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2007-11-21 6:16 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-23 15:11 [PATCH] LRO ack aggregation Andrew Gallatin
2007-11-20 5:09 ` David Miller
2007-11-20 6:09 ` Herbert Xu
2007-11-20 6:22 ` David Miller
2007-11-20 11:47 ` Andrew Gallatin
2007-11-20 11:55 ` David Miller
2007-11-20 13:27 ` Andrew Gallatin
2007-11-20 13:35 ` Evgeniy Polyakov
2007-11-20 13:50 ` Herbert Xu
2007-11-20 14:03 ` Evgeniy Polyakov
2007-11-20 14:08 ` Herbert Xu
2007-11-20 14:37 ` Evgeniy Polyakov
2007-11-21 6:15 ` Bill Fink
2007-11-20 19:45 ` Rick Jones
2007-11-20 22:27 ` David Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).