Question about LRO/GRO and TCP acknowledgements

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Question about LRO/GRO and TCP acknowledgements
@ 2011-06-11 19:59 Joris van Rantwijk
  2011-06-12  3:43 ` Ben Hutchings
  2011-06-13 17:34 ` Rick Jones
  0 siblings, 2 replies; 14+ messages in thread
From: Joris van Rantwijk @ 2011-06-11 19:59 UTC (permalink / raw)
  To: netdev

Hi,

I'm trying to understand how Linux produces TCP acknowledgements
for segments received via LRO/GRO.

As far as I can see, the network driver uses GRO to collect several
received packets into one big super skb, which is then handled
during just one call to tcp_v4_rcv(). This will eventually result
in the sending of at most one ACK packet for the entire GRO packet.

Conventional wisdom (RFC 5681) says that a receiver should send at
least one ACK for every two data segments received. The sending TCP
needs these ACKs to update its congestion window (e.g. slow start).

It seems to me that the current implementation in Linux may send
just one ACK for a large number of received segments. This would
be a deviation from the standard. As a result the congestion
window of the sender would grow much slower than intended.

Maybe I misunderstand something in the network code (likely).
Could someone please explain me how this ACK issue is handled?

Thanks,
Joris van Rantwijk.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question about LRO/GRO and TCP acknowledgements
  2011-06-11 19:59 Question about LRO/GRO and TCP acknowledgements Joris van Rantwijk
@ 2011-06-12  3:43 ` Ben Hutchings
  2011-06-12  7:51   ` Joris van Rantwijk
  2011-06-13 17:34 ` Rick Jones
  1 sibling, 1 reply; 14+ messages in thread
From: Ben Hutchings @ 2011-06-12  3:43 UTC (permalink / raw)
  To: Joris van Rantwijk; +Cc: netdev

On Sat, 2011-06-11 at 21:59 +0200, Joris van Rantwijk wrote:
> Hi,
> 
> I'm trying to understand how Linux produces TCP acknowledgements
> for segments received via LRO/GRO.
> 
> As far as I can see, the network driver uses GRO to collect several
> received packets into one big super skb, which is then handled
> during just one call to tcp_v4_rcv(). This will eventually result
> in the sending of at most one ACK packet for the entire GRO packet.
> 
> Conventional wisdom (RFC 5681) says that a receiver should send at
> least one ACK for every two data segments received. The sending TCP
> needs these ACKs to update its congestion window (e.g. slow start).
> 
> It seems to me that the current implementation in Linux may send
> just one ACK for a large number of received segments. This would
> be a deviation from the standard. As a result the congestion
> window of the sender would grow much slower than intended.

This was a problem in older versions of Linux (and still is on other
network stacks that aren't aware of LRO).

> Maybe I misunderstand something in the network code (likely).
> Could someone please explain me how this ACK issue is handled?

LRO implementations (and GRO) are expected to put the actual segment
size in skb_shared_info(skb)->gso_size on the aggregated skb.  TCP will
then use that rather than the aggregated payload size when deciding
whether to defer an ACK.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question about LRO/GRO and TCP acknowledgements
  2011-06-12  3:43 ` Ben Hutchings
@ 2011-06-12  7:51   ` Joris van Rantwijk
  2011-06-12  9:07     ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Joris van Rantwijk @ 2011-06-12  7:51 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev

On 2011-06-12, Ben Hutchings <bhutchings@solarflare.com> wrote:
> LRO implementations (and GRO) are expected to put the actual segment
> size in skb_shared_info(skb)->gso_size on the aggregated skb.  TCP
> will then use that rather than the aggregated payload size when
> deciding whether to defer an ACK.

Thanks. I see that indeed gso_size is being used for MSS calculations
instead of the total GRO size.

However, I'm not sure that this completely answers my question.
I am not so much concerned about quick ACK vs delayed ACK.
Instead, I'm looking at the total number of ACKs transmitted.
The sender depends on the _number_ of ACKs to update its congestion
window.

As far as I can see, current code will send just one ACK per coalesced
GRO bundle, while the sender expects one ACK per two segments.

Thanks,
Joris.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question about LRO/GRO and TCP acknowledgements
  2011-06-12  7:51   ` Joris van Rantwijk
@ 2011-06-12  9:07     ` Eric Dumazet
  2011-06-12  9:30       ` Joris van Rantwijk
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2011-06-12  9:07 UTC (permalink / raw)
  To: Joris van Rantwijk; +Cc: Ben Hutchings, netdev

Le dimanche 12 juin 2011 à 09:51 +0200, Joris van Rantwijk a écrit :
> On 2011-06-12, Ben Hutchings <bhutchings@solarflare.com> wrote:
> > LRO implementations (and GRO) are expected to put the actual segment
> > size in skb_shared_info(skb)->gso_size on the aggregated skb.  TCP
> > will then use that rather than the aggregated payload size when
> > deciding whether to defer an ACK.
> 
> Thanks. I see that indeed gso_size is being used for MSS calculations
> instead of the total GRO size.
> 
> However, I'm not sure that this completely answers my question.
> I am not so much concerned about quick ACK vs delayed ACK.
> Instead, I'm looking at the total number of ACKs transmitted.
> The sender depends on the _number_ of ACKs to update its congestion
> window.
> 


> As far as I can see, current code will send just one ACK per coalesced
> GRO bundle, while the sender expects one ACK per two segments.
> 

One ACK carries an implicit ack for _all_ previous segments. If sender
only 'counts' ACKs, it is a bit dumb...


10:05:02.755146 IP 192.168.20.110.57736 > 192.168.20.108.53563: SWE 96444459:96444459(0) win 14600 <mss 1460,sackOK,timestamp 12174491 
0,nop,wscale 8>
10:05:02.755242 IP 192.168.20.108.53563 > 192.168.20.110.57736: SE 1849523184:1849523184(0) ack 96444460 win 14480 <mss 1460,sackOK,tim
estamp 15334585 12174491,nop,wscale 7>
10:05:02.755310 IP 192.168.20.110.57736 > 192.168.20.108.53563: . ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755369 IP 192.168.20.110.57736 > 192.168.20.108.53563: . 1:1449(1448) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755417 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 1449 win 136 <nop,nop,timestamp 15334585 12174491>
10:05:02.755428 IP 192.168.20.110.57736 > 192.168.20.108.53563: P 1449:8689(7240) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755476 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 8689 win 159 <nop,nop,timestamp 15334585 12174491>
10:05:02.755482 IP 192.168.20.110.57736 > 192.168.20.108.53563: . 8689:13033(4344) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755529 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 13033 win 181 <nop,nop,timestamp 15334585 12174491>
10:05:02.755535 IP 192.168.20.110.57736 > 192.168.20.108.53563: . 13033:14481(1448) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755582 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 14481 win 204 <nop,nop,timestamp 15334585 12174491>
10:05:02.755588 IP 192.168.20.110.57736 > 192.168.20.108.53563: P 14481:16385(1904) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755635 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 16385 win 227 <nop,nop,timestamp 15334585 12174491>
10:05:02.755641 IP 192.168.20.110.57736 > 192.168.20.108.53563: . 16385:23625(7240) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755689 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 23625 win 249 <nop,nop,timestamp 15334585 12174491>
10:05:02.755695 IP 192.168.20.110.57736 > 192.168.20.108.53563: P 23625:26521(2896) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755742 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 26521 win 272 <nop,nop,timestamp 15334585 12174491>
10:05:02.755750 IP 192.168.20.110.57736 > 192.168.20.108.53563: P 26521:33761(7240) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755796 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 33761 win 295 <nop,nop,timestamp 15334585 12174491>
10:05:02.755802 IP 192.168.20.110.57736 > 192.168.20.108.53563: . 33761:39553(5792) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755849 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 39553 win 259 <nop,nop,timestamp 15334585 12174491>




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question about LRO/GRO and TCP acknowledgements
  2011-06-12  9:07     ` Eric Dumazet
@ 2011-06-12  9:30       ` Joris van Rantwijk
  2011-06-12 10:48         ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Joris van Rantwijk @ 2011-06-12  9:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On 2011-06-12, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > As far as I can see, current code will send just one ACK per
> > coalesced GRO bundle, while the sender expects one ACK per two
> > segments.

> One ACK carries an implicit ack for _all_ previous segments. If sender
> only 'counts' ACKs, it is a bit dumb...

It may be dumb, but it's what the RFCs recommend and it's what Linux
implements.

RFC 5681:
  "During slow start, a TCP increments cwnd by at most SMSS bytes for
   each ACK received that cumulatively acknowledges new data."

In Linux, each incoming ACK causes one call to tcp_cong_avoid(),
which causes one call to tcp_slow_start() - assuming the connection is
in slow start - which increases the congestion window by one MSS.
Am I mistaken?

Please note I'm talking about managing the congestion window.
Of course I agree that each ACK implicitly covers all previous segments
for the purpose of retransmission management. But congestion
management is a different story.

Joris.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question about LRO/GRO and TCP acknowledgements
  2011-06-12  9:30       ` Joris van Rantwijk
@ 2011-06-12 10:48         ` Eric Dumazet
  2011-06-12 11:24           ` Joris van Rantwijk
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2011-06-12 10:48 UTC (permalink / raw)
  To: Joris van Rantwijk; +Cc: netdev

Le dimanche 12 juin 2011 à 11:30 +0200, Joris van Rantwijk a écrit :
> On 2011-06-12, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > As far as I can see, current code will send just one ACK per
> > > coalesced GRO bundle, while the sender expects one ACK per two
> > > segments.
> 
> > One ACK carries an implicit ack for _all_ previous segments. If sender
> > only 'counts' ACKs, it is a bit dumb...
> 
> It may be dumb, but it's what the RFCs recommend and it's what Linux
> implements.
> 
> RFC 5681:
>   "During slow start, a TCP increments cwnd by at most SMSS bytes for
>    each ACK received that cumulatively acknowledges new data."
> 

Note also RFC says:

The RECOMMENDED way to increase cwnd during congestion avoidance is
   to count the number of bytes that have been acknowledged by ACKs for
   new data. 

So your concern is more a Sender side implementation missing this
recommendation, not GRO per se...

GRO kicks when receiver receives a train of consecutive frames in his
NAPI run. In order to really reduce number of ACKS, you need to receive
3 frames in a very short time.

This leads to the RTT rule : "Note that during congestion avoidance,
cwnd MUST NOT be increased by more than SMSS bytes per RTT"

So GRO, lowering number of ACKS, can help sender to not waste its time
on extra ACKS.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question about LRO/GRO and TCP acknowledgements
  2011-06-12 10:48         ` Eric Dumazet
@ 2011-06-12 11:24           ` Joris van Rantwijk
  2011-06-12 12:01             ` Alexander Zimmermann
  2011-06-12 14:57             ` Eric Dumazet
  0 siblings, 2 replies; 14+ messages in thread
From: Joris van Rantwijk @ 2011-06-12 11:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On 2011-06-12, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le dimanche 12 juin 2011 à 11:30 +0200, Joris van Rantwijk a écrit :
> > > > As far as I can see, current code will send just one ACK per
> > > > coalesced GRO bundle, while the sender expects one ACK per two
> > > > segments.

> Note also RFC says:
> The RECOMMENDED way to increase cwnd during congestion avoidance is
>    to count the number of bytes that have been acknowledged by ACKs
> for new data. 

This is during the congestion avoidance phase. I'm actually more
concerned about the slow start phase, but congestion avoidance may also
be an issue.

By the way, Linux does not implement the recommended (byte-counting)
method by default. It can be enabled through sysctl tcp_abc, which is
off by default.

Also:
  Byte counting during congestion avoidance is also recommended,
  while the method from [RFC2581] and other safe methods are still
  allowed.

> So your concern is more a Sender side implementation missing this
> recommendation, not GRO per se...

Not really. The same RFC says:
  Specifically, an ACK SHOULD be generated for at least every
  second full-sized segment, ...

Sender side behaviour is just my argument for the practical importance
of this issue. But sender side arguments are not an excuse for the
receiver to deviate from its own recommended behaviour.

> GRO kicks when receiver receives a train of consecutive frames in his
> NAPI run. In order to really reduce number of ACKS, you need to
> receive 3 frames in a very short time.
> 
> This leads to the RTT rule : "Note that during congestion avoidance,
> cwnd MUST NOT be increased by more than SMSS bytes per RTT"

But this RTT rule is already taken into account in the code which
increases cwnd during congestion avoidance. This code _assumes_ that
the receiver sends one ACK per two segments. If the receiver sends
fewer ACKs, the congestion window will grow too slowly.

> So GRO, lowering number of ACKS, can help sender to not waste its time
> on extra ACKS.

I can see how the world may have been a better place if every sender
implemented Appropriate Byte Counting and TCP receivers were allowed to
send fewer ACKs. However, current reality is that ABC is optional,
disabled by default in Linux, and receivers are recommended to send one
ACK per two segments.

I suspect that GRO currently hurts throughput of isolated TCP
connections. This is based on a purely theoretic argument. I may be
wrong and I have absolutely no data to confirm my suspicion.

If you can point out the flaw in my reasoning, I would be greatly
relieved. Until then, I remain concerned that there may be something
wrong with GRO and TCP ACKs.

Joris.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question about LRO/GRO and TCP acknowledgements
  2011-06-12 11:24           ` Joris van Rantwijk
@ 2011-06-12 12:01             ` Alexander Zimmermann
  2011-06-12 14:57             ` Eric Dumazet
  1 sibling, 0 replies; 14+ messages in thread
From: Alexander Zimmermann @ 2011-06-12 12:01 UTC (permalink / raw)
  To: Joris van Rantwijk; +Cc: Eric Dumazet, netdev

[-- Attachment #1: Type: text/plain, Size: 586 bytes --]

Hi Joris,

Am 12.06.2011 um 13:24 schrieb Joris van Rantwijk:

> 
> By the way, Linux does not implement the recommended (byte-counting)
> method by default. It can be enabled through sysctl tcp_abc, which is
> off by default.
> 
> 

See http://kerneltrap.org/mailarchive/linux-netdev/2010/3/3/6271114


//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//


[-- Attachment #2: Signierter Teil der Nachricht --]
[-- Type: application/pgp-signature, Size: 243 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question about LRO/GRO and TCP acknowledgements
  2011-06-12 11:24           ` Joris van Rantwijk
  2011-06-12 12:01             ` Alexander Zimmermann
@ 2011-06-12 14:57             ` Eric Dumazet
  2011-06-12 19:37               ` Joris van Rantwijk
  2011-06-13 17:55               ` Rick Jones
  1 sibling, 2 replies; 14+ messages in thread
From: Eric Dumazet @ 2011-06-12 14:57 UTC (permalink / raw)
  To: Joris van Rantwijk; +Cc: netdev

Le dimanche 12 juin 2011 à 13:24 +0200, Joris van Rantwijk a écrit :
> On 2011-06-12, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > So your concern is more a Sender side implementation missing this
> > recommendation, not GRO per se...
> 
> Not really. The same RFC says:
>   Specifically, an ACK SHOULD be generated for at least every
>   second full-sized segment, ...
> 

Well, SHOULD is not MUST.


> I can see how the world may have been a better place if every sender
> implemented Appropriate Byte Counting and TCP receivers were allowed to
> send fewer ACKs. However, current reality is that ABC is optional,
> disabled by default in Linux, and receivers are recommended to send one
> ACK per two segments.
> 

ABC might be nice for stacks that use byte counters for cwnd. We use
segments.

> I suspect that GRO currently hurts throughput of isolated TCP
> connections. This is based on a purely theoretic argument. I may be
> wrong and I have absolutely no data to confirm my suspicion.
> 
> If you can point out the flaw in my reasoning, I would be greatly
> relieved. Until then, I remain concerned that there may be something
> wrong with GRO and TCP ACKs.

Think of GRO being a receiver facility against stress/load, typically in
datacenter.

Only when receiver is overloaded, GRO kicks in and can coalesce several
frames before being handled in TCP stack in one run.

If receiver is so loaded that more than 2 frames are coalesced in a NAPI
run, it certainly helps to not allow sender to increase its cwnd more
than one SMSS. We probably are right before packet drops anyway.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question about LRO/GRO and TCP acknowledgements
  2011-06-12 14:57             ` Eric Dumazet
@ 2011-06-12 19:37               ` Joris van Rantwijk
  2011-06-14 10:53                 ` Ilpo Järvinen
  2011-06-13 17:55               ` Rick Jones
  1 sibling, 1 reply; 14+ messages in thread
From: Joris van Rantwijk @ 2011-06-12 19:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On 2011-06-12, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Think of GRO being a receiver facility against stress/load, typically
> in datacenter.
> 
> Only when receiver is overloaded, GRO kicks in and can coalesce
> several frames before being handled in TCP stack in one run.

Ok, it now becomes clear to me that I have a different scenario in mind
than GRO was designed to handle. I'm interested in LRO as a method
to sustain 1 Gbit through a single TCP connection on a slow embedded
computer.

> If receiver is so loaded that more than 2 frames are coalesced in a
> NAPI run, it certainly helps to not allow sender to increase its cwnd
> more than one SMSS. We probably are right before packet drops anyway.

Right. So unlike TSO, GRO is not a transparent, generally applicable
performance improvement. It's more like a form of graceful degradation,
helping a server to sustain overall throughput when it is already
swamped in TCP traffic.

Thanks for your clarification. This has certainly solved some confusion
on my side.

Joris.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question about LRO/GRO and TCP acknowledgements
  2011-06-12 19:37               ` Joris van Rantwijk
@ 2011-06-14 10:53                 ` Ilpo Järvinen
  2011-06-14 19:37                   ` Joris van Rantwijk
  0 siblings, 1 reply; 14+ messages in thread
From: Ilpo Järvinen @ 2011-06-14 10:53 UTC (permalink / raw)
  To: Joris van Rantwijk; +Cc: Eric Dumazet, Netdev

On Sun, 12 Jun 2011, Joris van Rantwijk wrote:

> On 2011-06-12, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Think of GRO being a receiver facility against stress/load, typically
> > in datacenter.
> > 
> > Only when receiver is overloaded, GRO kicks in and can coalesce
> > several frames before being handled in TCP stack in one run.
> 
> Ok, it now becomes clear to me that I have a different scenario in mind
> than GRO was designed to handle. I'm interested in LRO as a method
> to sustain 1 Gbit through a single TCP connection on a slow embedded
> computer.
> 
> > If receiver is so loaded that more than 2 frames are coalesced in a
> > NAPI run, it certainly helps to not allow sender to increase its cwnd
> > more than one SMSS. We probably are right before packet drops anyway.
> 
> Right. So unlike TSO, GRO is not a transparent, generally applicable
> performance improvement. It's more like a form of graceful degradation,
> helping a server to sustain overall throughput when it is already
> swamped in TCP traffic.
> 
> Thanks for your clarification. This has certainly solved some confusion
> on my side.

BTW, it wouldn't be impossible to create all those "missing" ACKs on the 
TCP layer relatively cheaply when receiving the GRO'ed super segment.  I'm 
certainly not opposed you coming up such patch which does all that minimal 
work needed on TCP layer but I think it requires also some TSO/GSO related 
problem solving because TSO/GSO as is won't let you create such super ACKs 
we'd want to send out on that single go.


-- 
 i.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question about LRO/GRO and TCP acknowledgements
  2011-06-14 10:53                 ` Ilpo Järvinen
@ 2011-06-14 19:37                   ` Joris van Rantwijk
  0 siblings, 0 replies; 14+ messages in thread
From: Joris van Rantwijk @ 2011-06-14 19:37 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Netdev

On 2011-06-14, "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:
> BTW, it wouldn't be impossible to create all those "missing" ACKs on
> the TCP layer relatively cheaply when receiving the GRO'ed super
> segment.  I'm certainly not opposed you coming up such patch which
> does all that minimal work needed on TCP layer but I think it
> requires also some TSO/GSO related problem solving because TSO/GSO as
> is won't let you create such super ACKs we'd want to send out on that
> single go.

Your super-ACK idea is similar to the solution presented in this paper:
http://www.usenix.org/event/usenix08/tech/full_papers/menon/menon_html/

Actually, I started looking at the GRO code after reading that
paper, hoping to find that Linux has a better way to deal with ACKs.

The super ACK doesn't look easy. It must contain all different ack_seq
values to avoid tripping duplicate ACK detection. Ideally, the ack_seq
values would match real seq values from the received segments.

I don't currently have a setup where I could test these kinds of
changes, so this doesn't seem like a job for me. At least not right now.

Thanks, Joris.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question about LRO/GRO and TCP acknowledgements
  2011-06-12 14:57             ` Eric Dumazet
  2011-06-12 19:37               ` Joris van Rantwijk
@ 2011-06-13 17:55               ` Rick Jones
  1 sibling, 0 replies; 14+ messages in thread
From: Rick Jones @ 2011-06-13 17:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Joris van Rantwijk, netdev

On Sun, 2011-06-12 at 16:57 +0200, Eric Dumazet wrote:
> Le dimanche 12 juin 2011 à 13:24 +0200, Joris van Rantwijk a écrit :
> > On 2011-06-12, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > So your concern is more a Sender side implementation missing this
> > > recommendation, not GRO per se...
> > 
> > Not really. The same RFC says:
> >   Specifically, an ACK SHOULD be generated for at least every
> >   second full-sized segment, ...
> > 
> 
> Well, SHOULD is not MUST.
> 
> 
> > I can see how the world may have been a better place if every sender
> > implemented Appropriate Byte Counting and TCP receivers were allowed to
> > send fewer ACKs. However, current reality is that ABC is optional,
> > disabled by default in Linux, and receivers are recommended to send one
> > ACK per two segments.
> > 
> 
> ABC might be nice for stacks that use byte counters for cwnd. We use
> segments.
> 
> > I suspect that GRO currently hurts throughput of isolated TCP
> > connections. This is based on a purely theoretic argument. I may be
> > wrong and I have absolutely no data to confirm my suspicion.
> > 
> > If you can point out the flaw in my reasoning, I would be greatly
> > relieved. Until then, I remain concerned that there may be something
> > wrong with GRO and TCP ACKs.
> 
> Think of GRO being a receiver facility against stress/load, typically in
> datacenter.
> 
> Only when receiver is overloaded, GRO kicks in and can coalesce several
> frames before being handled in TCP stack in one run.

How is that affected by interrupt coalescing in the NIC and the sending
side doing TSO (and so, ostensibly sending back-to-back frames)?  Are we
assured that a NIC is updating its completion pointer on the rx ring
continuously rather than just before a coalesced interrupt?

Does GRO "never" kick-in over a 1GbE link (making the handwaving
assumption that cores today are >> faster than a 1GbE link on a bulk
transfer).

It was just a quick and dirty test, but it does seem there is a positive
hit from GRO being enabled on a 1GbE link on a system with "fast
processors"

raj@tardy:~/netperf2_trunk$ sudo ethtool -K eth1 gro off
raj@tardy:~/netperf2_trunk$ src/netperf -t TCP_MAERTS -H 192.168.1.3 -i
10,3 -c -- -k foo
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.1.3 (192.168.1.3) port 0 AF_INET : +/-2.500% @ 99% conf.  :
histogram : demo
THROUGHPUT=935.07
LOCAL_INTERFACE_NAME=eth1
LOCAL_CPU_UTIL=16.64
LOCAL_SD=5.830
raj@tardy:~/netperf2_trunk$ sudo ethtool -K eth1 gro on
raj@tardy:~/netperf2_trunk$ src/netperf -t TCP_MAERTS -H 192.168.1.3 -i
10,3 -c -- -k foo
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.1.3 (192.168.1.3) port 0 AF_INET : +/-2.500% @ 99% conf.  :
histogram : demo
THROUGHPUT=934.81
LOCAL_INTERFACE_NAME=eth1
LOCAL_CPU_UTIL=16.21
LOCAL_SD=5.684
raj@tardy:~/netperf2_trunk$ uname -a
Linux tardy 2.6.35-28-generic #50-Ubuntu SMP Fri Mar 18 18:42:20 UTC
2011 x86_64 GNU/Linux

The receiver system here has a 3.07 GHz W3550 in it and eth1 is a port
on an Intel 82571EB-based four-port card.

raj@tardy:~/netperf2_trunk$ ethtool -i eth1
driver: e1000e
version: 1.0.2-k4
firmware-version: 5.10-2
bus-info: 0000:2a:00.0

> If receiver is so loaded that more than 2 frames are coalesced in a NAPI
> run, it certainly helps to not allow sender to increase its cwnd more
> than one SMSS. We probably are right before packet drops anyway.

If we are indeed statistically certain we are right before packet drops
(or I suppose asserting pause) then shouldn't ECN get set by the GRO
code?

rick


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question about LRO/GRO and TCP acknowledgements
  2011-06-11 19:59 Question about LRO/GRO and TCP acknowledgements Joris van Rantwijk
  2011-06-12  3:43 ` Ben Hutchings
@ 2011-06-13 17:34 ` Rick Jones
  1 sibling, 0 replies; 14+ messages in thread
From: Rick Jones @ 2011-06-13 17:34 UTC (permalink / raw)
  To: Joris van Rantwijk; +Cc: netdev

On Sat, 2011-06-11 at 21:59 +0200, Joris van Rantwijk wrote:
> Hi,
> 
> I'm trying to understand how Linux produces TCP acknowledgements
> for segments received via LRO/GRO.
> 
> As far as I can see, the network driver uses GRO to collect several
> received packets into one big super skb, which is then handled
> during just one call to tcp_v4_rcv(). This will eventually result
> in the sending of at most one ACK packet for the entire GRO packet.
> 
> Conventional wisdom (RFC 5681) says that a receiver should send at
> least one ACK for every two data segments received. The sending TCP
> needs these ACKs to update its congestion window (e.g. slow start).
> 
> It seems to me that the current implementation in Linux may send
> just one ACK for a large number of received segments. This would
> be a deviation from the standard. As a result the congestion
> window of the sender would grow much slower than intended.
> 
> Maybe I misunderstand something in the network code (likely).
> Could someone please explain me how this ACK issue is handled?

FWIW, HP-UX and Solaris stacks (perhaps others) have had ACK-avoidance
heuristics in place since the mid to late 1990's, and the ACK-avoidance
of LRO/GRO is quite similar "on the wire."  Have those heuristics been
stellar?  Probably not, but they've not it seems caused massive
problems, and when one has TSO and LRO/GRO the overhead of
ACK-every-other-MSS processing becomes non-trivial.  Even more so when
copy-avoidance is present.

I'd go with increase by the bytes ACKed not the ACK count.

rick jones


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-06-14 19:37 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-11 19:59 Question about LRO/GRO and TCP acknowledgements Joris van Rantwijk
2011-06-12  3:43 ` Ben Hutchings
2011-06-12  7:51   ` Joris van Rantwijk
2011-06-12  9:07     ` Eric Dumazet
2011-06-12  9:30       ` Joris van Rantwijk
2011-06-12 10:48         ` Eric Dumazet
2011-06-12 11:24           ` Joris van Rantwijk
2011-06-12 12:01             ` Alexander Zimmermann
2011-06-12 14:57             ` Eric Dumazet
2011-06-12 19:37               ` Joris van Rantwijk
2011-06-14 10:53                 ` Ilpo Järvinen
2011-06-14 19:37                   ` Joris van Rantwijk
2011-06-13 17:55               ` Rick Jones
2011-06-13 17:34 ` Rick Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).