From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lawrence Brakmo Subject: [PATCH net-next 0/2] tcp: fix high tail latencies in DCTCP Date: Fri, 29 Jun 2018 18:48:13 -0700 Message-ID: <20180630014815.2881895-1-brakmo@fb.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Cc: Kernel Team , Blake Matheny , Alexei Starovoitov , Neal Cardwell , Yuchung Cheng , Steve Ibanez , Eric Dumazet To: netdev Return-path: Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:56194 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S936028AbeF3Bsi (ORCPT ); Fri, 29 Jun 2018 21:48:38 -0400 Received: from pps.filterd (m0109334.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w5U1mbFO013107 for ; Fri, 29 Jun 2018 18:48:38 -0700 Received: from mail.thefacebook.com ([199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2jww558cj5-1 (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Fri, 29 Jun 2018 18:48:37 -0700 Sender: netdev-owner@vger.kernel.org List-ID: When have observed high tail latencies when using DCTCP for RPCs as compared to using Cubic. For example, in one setup there are 2 hosts sending to a 3rd one, with each sender having 3 flows (1 stream, 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following table shows the 99% and 99.9% latencies for both Cubic and dctcp: Cubic 99% Cubic 99.9% dctcp 99% dctcp 99.9% 1MB RPCs 2.6ms 5.5ms 43ms 208ms 10KB RPCs 1.1ms 1.3ms 53ms 212ms Looking at tcpdump traces showed that there are two causes for the latency. =20 1) RTOs caused by the receiver sending a dup ACK and not ACKing the last (and only) packet sent. 2) Delaying ACKs when the sender has a cwnd of 1, so everything pauses for the duration of the delayed ACK. The first patch fixes the cause of the dup ACKs, not updating DCTCP state when an ACK that was initially delayed has been sent with a data packet. The second patch insures that an ACK is sent immediately when a CWR marked packet arrives. With the patches the latencies for DCTCP now look like: dctcp 99% dctcp 99.9%=20 1MB RPCs 4.8ms 6.5ms 10KB RPCs 143us 184us Note that while the 1MB RPCs tail latencies are higher than Cubic's, the 10KB latencies are much smaller than Cubic's. These patches fix issues on the receiver, but tcpdump traces indicate there is an opportunity to also fix an issue at the sender that adds about 3ms to the tail latencies. The following trace shows the issue that tiggers an RTO (fixed by these p= atches): Host A sends the last packets of the request Host B receives them, and the last packet is marked with congestion (C= E) Host B sends ACKs for packets not marked with congestion Host B sends data packet with reply and ACK for packet marked with congestion (TCP flag ECE) Host A receives ACKs with no ECE flag Host A receives data packet with ACK for the last packet of request and which has TCP ECE bit set Host A sends 1st data packet of the next request with TCP flag CWR Host B receives the packet (as seen in tcpdump at B), no CE flag Host B sends a dup ACK that also has the TCP ECE flag Host A RTO timer fires! Host A to send the next packet Host A receives an ACK for everything it has sent (i.e. Host B did receive 1st packet of request) Host A send more packets=E2=80=A6 [PATCH net-next 1/2] tcp: notify when a delayed ack is sent [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives net/ipv4/tcp_input.c | 25 +++++++++++++++++-------- net/ipv4/tcp_output.c | 2 ++ 2 files changed, 19 insertions(+), 8 deletions(-)