* Question about LRO/GRO and TCP acknowledgements @ 2011-06-11 19:59 Joris van Rantwijk 2011-06-12 3:43 ` Ben Hutchings 2011-06-13 17:34 ` Rick Jones 0 siblings, 2 replies; 14+ messages in thread From: Joris van Rantwijk @ 2011-06-11 19:59 UTC (permalink / raw) To: netdev Hi, I'm trying to understand how Linux produces TCP acknowledgements for segments received via LRO/GRO. As far as I can see, the network driver uses GRO to collect several received packets into one big super skb, which is then handled during just one call to tcp_v4_rcv(). This will eventually result in the sending of at most one ACK packet for the entire GRO packet. Conventional wisdom (RFC 5681) says that a receiver should send at least one ACK for every two data segments received. The sending TCP needs these ACKs to update its congestion window (e.g. slow start). It seems to me that the current implementation in Linux may send just one ACK for a large number of received segments. This would be a deviation from the standard. As a result the congestion window of the sender would grow much slower than intended. Maybe I misunderstand something in the network code (likely). Could someone please explain me how this ACK issue is handled? Thanks, Joris van Rantwijk. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question about LRO/GRO and TCP acknowledgements 2011-06-11 19:59 Question about LRO/GRO and TCP acknowledgements Joris van Rantwijk @ 2011-06-12 3:43 ` Ben Hutchings 2011-06-12 7:51 ` Joris van Rantwijk 2011-06-13 17:34 ` Rick Jones 1 sibling, 1 reply; 14+ messages in thread From: Ben Hutchings @ 2011-06-12 3:43 UTC (permalink / raw) To: Joris van Rantwijk; +Cc: netdev On Sat, 2011-06-11 at 21:59 +0200, Joris van Rantwijk wrote: > Hi, > > I'm trying to understand how Linux produces TCP acknowledgements > for segments received via LRO/GRO. > > As far as I can see, the network driver uses GRO to collect several > received packets into one big super skb, which is then handled > during just one call to tcp_v4_rcv(). This will eventually result > in the sending of at most one ACK packet for the entire GRO packet. > > Conventional wisdom (RFC 5681) says that a receiver should send at > least one ACK for every two data segments received. The sending TCP > needs these ACKs to update its congestion window (e.g. slow start). > > It seems to me that the current implementation in Linux may send > just one ACK for a large number of received segments. This would > be a deviation from the standard. As a result the congestion > window of the sender would grow much slower than intended. This was a problem in older versions of Linux (and still is on other network stacks that aren't aware of LRO). > Maybe I misunderstand something in the network code (likely). > Could someone please explain me how this ACK issue is handled? LRO implementations (and GRO) are expected to put the actual segment size in skb_shared_info(skb)->gso_size on the aggregated skb. TCP will then use that rather than the aggregated payload size when deciding whether to defer an ACK. Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question about LRO/GRO and TCP acknowledgements 2011-06-12 3:43 ` Ben Hutchings @ 2011-06-12 7:51 ` Joris van Rantwijk 2011-06-12 9:07 ` Eric Dumazet 0 siblings, 1 reply; 14+ messages in thread From: Joris van Rantwijk @ 2011-06-12 7:51 UTC (permalink / raw) To: Ben Hutchings; +Cc: netdev On 2011-06-12, Ben Hutchings <bhutchings@solarflare.com> wrote: > LRO implementations (and GRO) are expected to put the actual segment > size in skb_shared_info(skb)->gso_size on the aggregated skb. TCP > will then use that rather than the aggregated payload size when > deciding whether to defer an ACK. Thanks. I see that indeed gso_size is being used for MSS calculations instead of the total GRO size. However, I'm not sure that this completely answers my question. I am not so much concerned about quick ACK vs delayed ACK. Instead, I'm looking at the total number of ACKs transmitted. The sender depends on the _number_ of ACKs to update its congestion window. As far as I can see, current code will send just one ACK per coalesced GRO bundle, while the sender expects one ACK per two segments. Thanks, Joris. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question about LRO/GRO and TCP acknowledgements 2011-06-12 7:51 ` Joris van Rantwijk @ 2011-06-12 9:07 ` Eric Dumazet 2011-06-12 9:30 ` Joris van Rantwijk 0 siblings, 1 reply; 14+ messages in thread From: Eric Dumazet @ 2011-06-12 9:07 UTC (permalink / raw) To: Joris van Rantwijk; +Cc: Ben Hutchings, netdev Le dimanche 12 juin 2011 à 09:51 +0200, Joris van Rantwijk a écrit : > On 2011-06-12, Ben Hutchings <bhutchings@solarflare.com> wrote: > > LRO implementations (and GRO) are expected to put the actual segment > > size in skb_shared_info(skb)->gso_size on the aggregated skb. TCP > > will then use that rather than the aggregated payload size when > > deciding whether to defer an ACK. > > Thanks. I see that indeed gso_size is being used for MSS calculations > instead of the total GRO size. > > However, I'm not sure that this completely answers my question. > I am not so much concerned about quick ACK vs delayed ACK. > Instead, I'm looking at the total number of ACKs transmitted. > The sender depends on the _number_ of ACKs to update its congestion > window. > > As far as I can see, current code will send just one ACK per coalesced > GRO bundle, while the sender expects one ACK per two segments. > One ACK carries an implicit ack for _all_ previous segments. If sender only 'counts' ACKs, it is a bit dumb... 10:05:02.755146 IP 192.168.20.110.57736 > 192.168.20.108.53563: SWE 96444459:96444459(0) win 14600 <mss 1460,sackOK,timestamp 12174491 0,nop,wscale 8> 10:05:02.755242 IP 192.168.20.108.53563 > 192.168.20.110.57736: SE 1849523184:1849523184(0) ack 96444460 win 14480 <mss 1460,sackOK,tim estamp 15334585 12174491,nop,wscale 7> 10:05:02.755310 IP 192.168.20.110.57736 > 192.168.20.108.53563: . ack 1 win 58 <nop,nop,timestamp 12174491 15334585> 10:05:02.755369 IP 192.168.20.110.57736 > 192.168.20.108.53563: . 1:1449(1448) ack 1 win 58 <nop,nop,timestamp 12174491 15334585> 10:05:02.755417 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 1449 win 136 <nop,nop,timestamp 15334585 12174491> 10:05:02.755428 IP 192.168.20.110.57736 > 192.168.20.108.53563: P 1449:8689(7240) ack 1 win 58 <nop,nop,timestamp 12174491 15334585> 10:05:02.755476 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 8689 win 159 <nop,nop,timestamp 15334585 12174491> 10:05:02.755482 IP 192.168.20.110.57736 > 192.168.20.108.53563: . 8689:13033(4344) ack 1 win 58 <nop,nop,timestamp 12174491 15334585> 10:05:02.755529 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 13033 win 181 <nop,nop,timestamp 15334585 12174491> 10:05:02.755535 IP 192.168.20.110.57736 > 192.168.20.108.53563: . 13033:14481(1448) ack 1 win 58 <nop,nop,timestamp 12174491 15334585> 10:05:02.755582 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 14481 win 204 <nop,nop,timestamp 15334585 12174491> 10:05:02.755588 IP 192.168.20.110.57736 > 192.168.20.108.53563: P 14481:16385(1904) ack 1 win 58 <nop,nop,timestamp 12174491 15334585> 10:05:02.755635 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 16385 win 227 <nop,nop,timestamp 15334585 12174491> 10:05:02.755641 IP 192.168.20.110.57736 > 192.168.20.108.53563: . 16385:23625(7240) ack 1 win 58 <nop,nop,timestamp 12174491 15334585> 10:05:02.755689 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 23625 win 249 <nop,nop,timestamp 15334585 12174491> 10:05:02.755695 IP 192.168.20.110.57736 > 192.168.20.108.53563: P 23625:26521(2896) ack 1 win 58 <nop,nop,timestamp 12174491 15334585> 10:05:02.755742 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 26521 win 272 <nop,nop,timestamp 15334585 12174491> 10:05:02.755750 IP 192.168.20.110.57736 > 192.168.20.108.53563: P 26521:33761(7240) ack 1 win 58 <nop,nop,timestamp 12174491 15334585> 10:05:02.755796 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 33761 win 295 <nop,nop,timestamp 15334585 12174491> 10:05:02.755802 IP 192.168.20.110.57736 > 192.168.20.108.53563: . 33761:39553(5792) ack 1 win 58 <nop,nop,timestamp 12174491 15334585> 10:05:02.755849 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 39553 win 259 <nop,nop,timestamp 15334585 12174491> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question about LRO/GRO and TCP acknowledgements 2011-06-12 9:07 ` Eric Dumazet @ 2011-06-12 9:30 ` Joris van Rantwijk 2011-06-12 10:48 ` Eric Dumazet 0 siblings, 1 reply; 14+ messages in thread From: Joris van Rantwijk @ 2011-06-12 9:30 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On 2011-06-12, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > As far as I can see, current code will send just one ACK per > > coalesced GRO bundle, while the sender expects one ACK per two > > segments. > One ACK carries an implicit ack for _all_ previous segments. If sender > only 'counts' ACKs, it is a bit dumb... It may be dumb, but it's what the RFCs recommend and it's what Linux implements. RFC 5681: "During slow start, a TCP increments cwnd by at most SMSS bytes for each ACK received that cumulatively acknowledges new data." In Linux, each incoming ACK causes one call to tcp_cong_avoid(), which causes one call to tcp_slow_start() - assuming the connection is in slow start - which increases the congestion window by one MSS. Am I mistaken? Please note I'm talking about managing the congestion window. Of course I agree that each ACK implicitly covers all previous segments for the purpose of retransmission management. But congestion management is a different story. Joris. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question about LRO/GRO and TCP acknowledgements 2011-06-12 9:30 ` Joris van Rantwijk @ 2011-06-12 10:48 ` Eric Dumazet 2011-06-12 11:24 ` Joris van Rantwijk 0 siblings, 1 reply; 14+ messages in thread From: Eric Dumazet @ 2011-06-12 10:48 UTC (permalink / raw) To: Joris van Rantwijk; +Cc: netdev Le dimanche 12 juin 2011 à 11:30 +0200, Joris van Rantwijk a écrit : > On 2011-06-12, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > As far as I can see, current code will send just one ACK per > > > coalesced GRO bundle, while the sender expects one ACK per two > > > segments. > > > One ACK carries an implicit ack for _all_ previous segments. If sender > > only 'counts' ACKs, it is a bit dumb... > > It may be dumb, but it's what the RFCs recommend and it's what Linux > implements. > > RFC 5681: > "During slow start, a TCP increments cwnd by at most SMSS bytes for > each ACK received that cumulatively acknowledges new data." > Note also RFC says: The RECOMMENDED way to increase cwnd during congestion avoidance is to count the number of bytes that have been acknowledged by ACKs for new data. So your concern is more a Sender side implementation missing this recommendation, not GRO per se... GRO kicks when receiver receives a train of consecutive frames in his NAPI run. In order to really reduce number of ACKS, you need to receive 3 frames in a very short time. This leads to the RTT rule : "Note that during congestion avoidance, cwnd MUST NOT be increased by more than SMSS bytes per RTT" So GRO, lowering number of ACKS, can help sender to not waste its time on extra ACKS. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question about LRO/GRO and TCP acknowledgements 2011-06-12 10:48 ` Eric Dumazet @ 2011-06-12 11:24 ` Joris van Rantwijk 2011-06-12 12:01 ` Alexander Zimmermann 2011-06-12 14:57 ` Eric Dumazet 0 siblings, 2 replies; 14+ messages in thread From: Joris van Rantwijk @ 2011-06-12 11:24 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On 2011-06-12, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le dimanche 12 juin 2011 à 11:30 +0200, Joris van Rantwijk a écrit : > > > > As far as I can see, current code will send just one ACK per > > > > coalesced GRO bundle, while the sender expects one ACK per two > > > > segments. > Note also RFC says: > The RECOMMENDED way to increase cwnd during congestion avoidance is > to count the number of bytes that have been acknowledged by ACKs > for new data. This is during the congestion avoidance phase. I'm actually more concerned about the slow start phase, but congestion avoidance may also be an issue. By the way, Linux does not implement the recommended (byte-counting) method by default. It can be enabled through sysctl tcp_abc, which is off by default. Also: Byte counting during congestion avoidance is also recommended, while the method from [RFC2581] and other safe methods are still allowed. > So your concern is more a Sender side implementation missing this > recommendation, not GRO per se... Not really. The same RFC says: Specifically, an ACK SHOULD be generated for at least every second full-sized segment, ... Sender side behaviour is just my argument for the practical importance of this issue. But sender side arguments are not an excuse for the receiver to deviate from its own recommended behaviour. > GRO kicks when receiver receives a train of consecutive frames in his > NAPI run. In order to really reduce number of ACKS, you need to > receive 3 frames in a very short time. > > This leads to the RTT rule : "Note that during congestion avoidance, > cwnd MUST NOT be increased by more than SMSS bytes per RTT" But this RTT rule is already taken into account in the code which increases cwnd during congestion avoidance. This code _assumes_ that the receiver sends one ACK per two segments. If the receiver sends fewer ACKs, the congestion window will grow too slowly. > So GRO, lowering number of ACKS, can help sender to not waste its time > on extra ACKS. I can see how the world may have been a better place if every sender implemented Appropriate Byte Counting and TCP receivers were allowed to send fewer ACKs. However, current reality is that ABC is optional, disabled by default in Linux, and receivers are recommended to send one ACK per two segments. I suspect that GRO currently hurts throughput of isolated TCP connections. This is based on a purely theoretic argument. I may be wrong and I have absolutely no data to confirm my suspicion. If you can point out the flaw in my reasoning, I would be greatly relieved. Until then, I remain concerned that there may be something wrong with GRO and TCP ACKs. Joris. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question about LRO/GRO and TCP acknowledgements 2011-06-12 11:24 ` Joris van Rantwijk @ 2011-06-12 12:01 ` Alexander Zimmermann 2011-06-12 14:57 ` Eric Dumazet 1 sibling, 0 replies; 14+ messages in thread From: Alexander Zimmermann @ 2011-06-12 12:01 UTC (permalink / raw) To: Joris van Rantwijk; +Cc: Eric Dumazet, netdev [-- Attachment #1: Type: text/plain, Size: 586 bytes --] Hi Joris, Am 12.06.2011 um 13:24 schrieb Joris van Rantwijk: > > By the way, Linux does not implement the recommended (byte-counting) > method by default. It can be enabled through sysctl tcp_abc, which is > off by default. > > See http://kerneltrap.org/mailarchive/linux-netdev/2010/3/3/6271114 // // Dipl.-Inform. Alexander Zimmermann // Department of Computer Science, Informatik 4 // RWTH Aachen University // Ahornstr. 55, 52056 Aachen, Germany // phone: (49-241) 80-21422, fax: (49-241) 80-22222 // email: zimmermann@cs.rwth-aachen.de // web: http://www.umic-mesh.net // [-- Attachment #2: Signierter Teil der Nachricht --] [-- Type: application/pgp-signature, Size: 243 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question about LRO/GRO and TCP acknowledgements 2011-06-12 11:24 ` Joris van Rantwijk 2011-06-12 12:01 ` Alexander Zimmermann @ 2011-06-12 14:57 ` Eric Dumazet 2011-06-12 19:37 ` Joris van Rantwijk 2011-06-13 17:55 ` Rick Jones 1 sibling, 2 replies; 14+ messages in thread From: Eric Dumazet @ 2011-06-12 14:57 UTC (permalink / raw) To: Joris van Rantwijk; +Cc: netdev Le dimanche 12 juin 2011 à 13:24 +0200, Joris van Rantwijk a écrit : > On 2011-06-12, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > So your concern is more a Sender side implementation missing this > > recommendation, not GRO per se... > > Not really. The same RFC says: > Specifically, an ACK SHOULD be generated for at least every > second full-sized segment, ... > Well, SHOULD is not MUST. > I can see how the world may have been a better place if every sender > implemented Appropriate Byte Counting and TCP receivers were allowed to > send fewer ACKs. However, current reality is that ABC is optional, > disabled by default in Linux, and receivers are recommended to send one > ACK per two segments. > ABC might be nice for stacks that use byte counters for cwnd. We use segments. > I suspect that GRO currently hurts throughput of isolated TCP > connections. This is based on a purely theoretic argument. I may be > wrong and I have absolutely no data to confirm my suspicion. > > If you can point out the flaw in my reasoning, I would be greatly > relieved. Until then, I remain concerned that there may be something > wrong with GRO and TCP ACKs. Think of GRO being a receiver facility against stress/load, typically in datacenter. Only when receiver is overloaded, GRO kicks in and can coalesce several frames before being handled in TCP stack in one run. If receiver is so loaded that more than 2 frames are coalesced in a NAPI run, it certainly helps to not allow sender to increase its cwnd more than one SMSS. We probably are right before packet drops anyway. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question about LRO/GRO and TCP acknowledgements 2011-06-12 14:57 ` Eric Dumazet @ 2011-06-12 19:37 ` Joris van Rantwijk 2011-06-14 10:53 ` Ilpo Järvinen 2011-06-13 17:55 ` Rick Jones 1 sibling, 1 reply; 14+ messages in thread From: Joris van Rantwijk @ 2011-06-12 19:37 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On 2011-06-12, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Think of GRO being a receiver facility against stress/load, typically > in datacenter. > > Only when receiver is overloaded, GRO kicks in and can coalesce > several frames before being handled in TCP stack in one run. Ok, it now becomes clear to me that I have a different scenario in mind than GRO was designed to handle. I'm interested in LRO as a method to sustain 1 Gbit through a single TCP connection on a slow embedded computer. > If receiver is so loaded that more than 2 frames are coalesced in a > NAPI run, it certainly helps to not allow sender to increase its cwnd > more than one SMSS. We probably are right before packet drops anyway. Right. So unlike TSO, GRO is not a transparent, generally applicable performance improvement. It's more like a form of graceful degradation, helping a server to sustain overall throughput when it is already swamped in TCP traffic. Thanks for your clarification. This has certainly solved some confusion on my side. Joris. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question about LRO/GRO and TCP acknowledgements 2011-06-12 19:37 ` Joris van Rantwijk @ 2011-06-14 10:53 ` Ilpo Järvinen 2011-06-14 19:37 ` Joris van Rantwijk 0 siblings, 1 reply; 14+ messages in thread From: Ilpo Järvinen @ 2011-06-14 10:53 UTC (permalink / raw) To: Joris van Rantwijk; +Cc: Eric Dumazet, Netdev On Sun, 12 Jun 2011, Joris van Rantwijk wrote: > On 2011-06-12, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > Think of GRO being a receiver facility against stress/load, typically > > in datacenter. > > > > Only when receiver is overloaded, GRO kicks in and can coalesce > > several frames before being handled in TCP stack in one run. > > Ok, it now becomes clear to me that I have a different scenario in mind > than GRO was designed to handle. I'm interested in LRO as a method > to sustain 1 Gbit through a single TCP connection on a slow embedded > computer. > > > If receiver is so loaded that more than 2 frames are coalesced in a > > NAPI run, it certainly helps to not allow sender to increase its cwnd > > more than one SMSS. We probably are right before packet drops anyway. > > Right. So unlike TSO, GRO is not a transparent, generally applicable > performance improvement. It's more like a form of graceful degradation, > helping a server to sustain overall throughput when it is already > swamped in TCP traffic. > > Thanks for your clarification. This has certainly solved some confusion > on my side. BTW, it wouldn't be impossible to create all those "missing" ACKs on the TCP layer relatively cheaply when receiving the GRO'ed super segment. I'm certainly not opposed you coming up such patch which does all that minimal work needed on TCP layer but I think it requires also some TSO/GSO related problem solving because TSO/GSO as is won't let you create such super ACKs we'd want to send out on that single go. -- i. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question about LRO/GRO and TCP acknowledgements 2011-06-14 10:53 ` Ilpo Järvinen @ 2011-06-14 19:37 ` Joris van Rantwijk 0 siblings, 0 replies; 14+ messages in thread From: Joris van Rantwijk @ 2011-06-14 19:37 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: Netdev On 2011-06-14, "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote: > BTW, it wouldn't be impossible to create all those "missing" ACKs on > the TCP layer relatively cheaply when receiving the GRO'ed super > segment. I'm certainly not opposed you coming up such patch which > does all that minimal work needed on TCP layer but I think it > requires also some TSO/GSO related problem solving because TSO/GSO as > is won't let you create such super ACKs we'd want to send out on that > single go. Your super-ACK idea is similar to the solution presented in this paper: http://www.usenix.org/event/usenix08/tech/full_papers/menon/menon_html/ Actually, I started looking at the GRO code after reading that paper, hoping to find that Linux has a better way to deal with ACKs. The super ACK doesn't look easy. It must contain all different ack_seq values to avoid tripping duplicate ACK detection. Ideally, the ack_seq values would match real seq values from the received segments. I don't currently have a setup where I could test these kinds of changes, so this doesn't seem like a job for me. At least not right now. Thanks, Joris. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question about LRO/GRO and TCP acknowledgements 2011-06-12 14:57 ` Eric Dumazet 2011-06-12 19:37 ` Joris van Rantwijk @ 2011-06-13 17:55 ` Rick Jones 1 sibling, 0 replies; 14+ messages in thread From: Rick Jones @ 2011-06-13 17:55 UTC (permalink / raw) To: Eric Dumazet; +Cc: Joris van Rantwijk, netdev On Sun, 2011-06-12 at 16:57 +0200, Eric Dumazet wrote: > Le dimanche 12 juin 2011 à 13:24 +0200, Joris van Rantwijk a écrit : > > On 2011-06-12, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > So your concern is more a Sender side implementation missing this > > > recommendation, not GRO per se... > > > > Not really. The same RFC says: > > Specifically, an ACK SHOULD be generated for at least every > > second full-sized segment, ... > > > > Well, SHOULD is not MUST. > > > > I can see how the world may have been a better place if every sender > > implemented Appropriate Byte Counting and TCP receivers were allowed to > > send fewer ACKs. However, current reality is that ABC is optional, > > disabled by default in Linux, and receivers are recommended to send one > > ACK per two segments. > > > > ABC might be nice for stacks that use byte counters for cwnd. We use > segments. > > > I suspect that GRO currently hurts throughput of isolated TCP > > connections. This is based on a purely theoretic argument. I may be > > wrong and I have absolutely no data to confirm my suspicion. > > > > If you can point out the flaw in my reasoning, I would be greatly > > relieved. Until then, I remain concerned that there may be something > > wrong with GRO and TCP ACKs. > > Think of GRO being a receiver facility against stress/load, typically in > datacenter. > > Only when receiver is overloaded, GRO kicks in and can coalesce several > frames before being handled in TCP stack in one run. How is that affected by interrupt coalescing in the NIC and the sending side doing TSO (and so, ostensibly sending back-to-back frames)? Are we assured that a NIC is updating its completion pointer on the rx ring continuously rather than just before a coalesced interrupt? Does GRO "never" kick-in over a 1GbE link (making the handwaving assumption that cores today are >> faster than a 1GbE link on a bulk transfer). It was just a quick and dirty test, but it does seem there is a positive hit from GRO being enabled on a 1GbE link on a system with "fast processors" raj@tardy:~/netperf2_trunk$ sudo ethtool -K eth1 gro off raj@tardy:~/netperf2_trunk$ src/netperf -t TCP_MAERTS -H 192.168.1.3 -i 10,3 -c -- -k foo MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.3 (192.168.1.3) port 0 AF_INET : +/-2.500% @ 99% conf. : histogram : demo THROUGHPUT=935.07 LOCAL_INTERFACE_NAME=eth1 LOCAL_CPU_UTIL=16.64 LOCAL_SD=5.830 raj@tardy:~/netperf2_trunk$ sudo ethtool -K eth1 gro on raj@tardy:~/netperf2_trunk$ src/netperf -t TCP_MAERTS -H 192.168.1.3 -i 10,3 -c -- -k foo MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.3 (192.168.1.3) port 0 AF_INET : +/-2.500% @ 99% conf. : histogram : demo THROUGHPUT=934.81 LOCAL_INTERFACE_NAME=eth1 LOCAL_CPU_UTIL=16.21 LOCAL_SD=5.684 raj@tardy:~/netperf2_trunk$ uname -a Linux tardy 2.6.35-28-generic #50-Ubuntu SMP Fri Mar 18 18:42:20 UTC 2011 x86_64 GNU/Linux The receiver system here has a 3.07 GHz W3550 in it and eth1 is a port on an Intel 82571EB-based four-port card. raj@tardy:~/netperf2_trunk$ ethtool -i eth1 driver: e1000e version: 1.0.2-k4 firmware-version: 5.10-2 bus-info: 0000:2a:00.0 > If receiver is so loaded that more than 2 frames are coalesced in a NAPI > run, it certainly helps to not allow sender to increase its cwnd more > than one SMSS. We probably are right before packet drops anyway. If we are indeed statistically certain we are right before packet drops (or I suppose asserting pause) then shouldn't ECN get set by the GRO code? rick ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question about LRO/GRO and TCP acknowledgements 2011-06-11 19:59 Question about LRO/GRO and TCP acknowledgements Joris van Rantwijk 2011-06-12 3:43 ` Ben Hutchings @ 2011-06-13 17:34 ` Rick Jones 1 sibling, 0 replies; 14+ messages in thread From: Rick Jones @ 2011-06-13 17:34 UTC (permalink / raw) To: Joris van Rantwijk; +Cc: netdev On Sat, 2011-06-11 at 21:59 +0200, Joris van Rantwijk wrote: > Hi, > > I'm trying to understand how Linux produces TCP acknowledgements > for segments received via LRO/GRO. > > As far as I can see, the network driver uses GRO to collect several > received packets into one big super skb, which is then handled > during just one call to tcp_v4_rcv(). This will eventually result > in the sending of at most one ACK packet for the entire GRO packet. > > Conventional wisdom (RFC 5681) says that a receiver should send at > least one ACK for every two data segments received. The sending TCP > needs these ACKs to update its congestion window (e.g. slow start). > > It seems to me that the current implementation in Linux may send > just one ACK for a large number of received segments. This would > be a deviation from the standard. As a result the congestion > window of the sender would grow much slower than intended. > > Maybe I misunderstand something in the network code (likely). > Could someone please explain me how this ACK issue is handled? FWIW, HP-UX and Solaris stacks (perhaps others) have had ACK-avoidance heuristics in place since the mid to late 1990's, and the ACK-avoidance of LRO/GRO is quite similar "on the wire." Have those heuristics been stellar? Probably not, but they've not it seems caused massive problems, and when one has TSO and LRO/GRO the overhead of ACK-every-other-MSS processing becomes non-trivial. Even more so when copy-avoidance is present. I'd go with increase by the bytes ACKed not the ACK count. rick jones ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2011-06-14 19:37 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-06-11 19:59 Question about LRO/GRO and TCP acknowledgements Joris van Rantwijk 2011-06-12 3:43 ` Ben Hutchings 2011-06-12 7:51 ` Joris van Rantwijk 2011-06-12 9:07 ` Eric Dumazet 2011-06-12 9:30 ` Joris van Rantwijk 2011-06-12 10:48 ` Eric Dumazet 2011-06-12 11:24 ` Joris van Rantwijk 2011-06-12 12:01 ` Alexander Zimmermann 2011-06-12 14:57 ` Eric Dumazet 2011-06-12 19:37 ` Joris van Rantwijk 2011-06-14 10:53 ` Ilpo Järvinen 2011-06-14 19:37 ` Joris van Rantwijk 2011-06-13 17:55 ` Rick Jones 2011-06-13 17:34 ` Rick Jones
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).