From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, FROM_EXCESS_BASE64,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AC823C10F03 for ; Thu, 25 Apr 2019 13:13:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 1058D20679 for ; Thu, 25 Apr 2019 13:13:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=sysophe.eu header.i=@sysophe.eu header.b="Yyh/iYiC" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726762AbfDYNNc (ORCPT ); Thu, 25 Apr 2019 09:13:32 -0400 Received: from hygieia.sysophe.eu ([138.201.91.14]:43505 "EHLO hygieia.sysophe.eu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725921AbfDYNNb (ORCPT ); Thu, 25 Apr 2019 09:13:31 -0400 Received: from pluto.restena.lu (pluto.restena.lu [IPv6:2001:a18:1:10::156]) by smtp.sysophe.eu (Postfix) with ESMTPSA id C4C4C106C2638; Thu, 25 Apr 2019 15:11:36 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=sysophe.eu; s=201205; t=1556197897; x=1556284297; bh=Amw0qdc24qiIyG12hCIOp890HwYzo8RCS3X4LnqhS84=; h=Date:From:To:Cc:Subject:In-Reply-To:References; b=Yyh/iYiCnlBe0fNpwuFmRqbkzfvDcLRXSc5+tZYAsn2Qpov4pN6A+dpQbGFgbkueK fqIRB4K2VyrdF/TMRx9lhaC+MObRYhxviQSdmHY4Vu45mec/duNZF5mS1zeOTq1lRi DnUhW7mvpH3hrn3BEl8jJT8Jdts8TSQeWAO3lLgc= Date: Thu, 25 Apr 2019 15:13:26 +0200 From: Bruno =?UTF-8?B?UHLDqW1vbnQ=?= To: Eric Dumazet Cc: richard.purdie@linuxfoundation.org, Neal Cardwell , Yuchung Cheng , "David S. Miller" , netdev , Alexander Kanavin , Bruce Ashfield Subject: Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue Message-ID: <20190425151326.305b5c70@pluto.restena.lu> In-Reply-To: <20190425095531.4fd8c53f@pluto.restena.lu> References: <85aabf9d4f41b6c57629e736993233f80a037e59.camel@linuxfoundation.org> <20190424165150.1420b046@pluto.restena.lu> <20190425095531.4fd8c53f@pluto.restena.lu> X-Mailer: Claws Mail 3.17.3 (GTK+ 2.24.32; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Hi Eric, On Thu, 25 Apr 2019 09:55:31 +0200 Bruno Pr=C3=A9mont wrote: > Hi Eric, >=20 > On Wed, 24 Apr 2019 08:47:27 -0700 Eric Dumazet wrote: > > On Wed, Apr 24, 2019 at 7:51 AM Bruno Pr=C3=A9mont wrote: =20 > > > > > > Hi Eric, > > > > > > I'm seeing issues with this patch as well, not as regular as for > > > Richard but still (about up to one in 30-50 TCP sessions). > > > > > > In my case I have a virtual machine (on VMWare) with this patch where > > > NGINX as reverse proxy misses part (end) of payload from its upstream > > > and times out on the upstream connection (while according to tcpdump = all > > > packets including upstream's FIN were sent and the upstream did get > > > ACKs from the VM). > > > > > > From when browsers get from NGINX it feels as if at some point reading > > > from the socket or waiting for data using select() never returned data > > > that arrived as more than just EOF is missing. > > > > > > The upstream is a hardware machine in the same subnet. > > > > > > My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) > > > as network adapter which lists the following features: > > > =20 > >=20 > > Hi Bruno. > >=20 > > I suspect a EPOLLIN notification being lost by the application. > >=20 > > Fact that TCP backlog contains 1 instead of 2+ packets should not > > change stack behavior, > > this packet should land into socket receive queue eventually. > >=20 > > Are you using epoll() in Edge Trigger mode. You mention select() but > > select() is a rather old and inefficient API. =20 >=20 > nginx is using epoll (c.f. http://nginx.org/en/docs/events.html) >=20 > For source, see here > https://trac.nginx.org/nginx/browser/nginx/src/event/modules/ngx_epoll_mo= dule.c?rev=3Debf8c9686b8ce7428f975d8a567935ea3722da70 I've been trying to reproduce while running nginx under strace (with verbos= e logging), but have not been successful. Looks like tracing modifies the timing in a way to prevent the issue. Without verbose logging I could hit it (though needed way more perseverence= ). What I do have for a working session under strace is: socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) =3D 12 ioctl(12, FIONBIO, [1]) =3D 0 epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u3= 2=3D1221158320, u64=3D139682147558832}}) =3D 0 connect(12, {sa_family=3DAF_INET6, sin6_port=3Dhtons(8080), inet_pton(AF_= INET6, "2001:a18:1:6::217", &sin6_addr), sin6_flowinfo=3D0, sin6_scope_id= =3D0}, 28) =3D -1 EINPROGRESS (Operation now in progress) epoll_wait(9, [{EPOLLOUT, {u32=3D1221158320, u64=3D139682147558832}}], 51= 2, 57230) =3D 1 getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) =3D 0 writev(12, [{""..., 830}, {""..., 115}], 2) =3D 945 epoll_wait(9, [{EPOLLIN|EPOLLOUT, {u32=3D1221158320, u64=3D13968214755883= 2}}], 512, 57220) =3D 1 recvfrom(12, ""..., 4096, 0, NULL, NULL) =3D 4096 readv(12, [{""..., 4096}], 1) =3D 4096 readv(12, [{""..., 4096}], 1) =3D 4096 readv(12, [{""..., 4096}], 1) =3D 4096 readv(12, [{""..., 4096}], 1) =3D 4096 readv(12, [{""..., 4096}], 1) =3D 4096 readv(12, [{""..., 4096}], 1) =3D 3855 epoll_wait(9, [{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=3D1221158320, u64=3D139= 682147558832}}], 512, 57190) =3D 1 readv(12, [{"", 241}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {""= , 4096}, {"", 4096}], 7) =3D 0 close(12) =3D 0 Under failing session I got: socket(AF_INET, SOCK_STREAM, IPPROTO_IP) =3D 6 ioctl(6, FIONBIO, [1]) =3D 0 epoll_ctl(86, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u3= 2=3D1221159521, u64=3D139682147560033}}) =3D 0 connect(6, {sa_family=3DAF_INET, sin_port=3Dhtons(8080), sin_addr=3Dinet_= addr("158.64.2.217")}, 16) =3D -1 EINPROGRESS (Operation now in progress) epoll_wait(86, [{EPOLLOUT, {u32=3D1221159521, u64=3D139682147560033}}], 5= 12, 60000) =3D 1 getsockopt(6, SOL_SOCKET, SO_ERROR, [0], [4]) =3D 0 writev(6, [{""..., 830}, {""..., 120}], 2) =3D 950 epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=3D1221159521, u64=3D1396821475600= 33}}], 512, 60000) =3D 1 recvfrom(6, ""..., 4096, 0, NULL, NULL) =3D 1740 epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=3D1221159521, u64=3D1396821475600= 33}}], 512, 59980) =3D 1 readv(6, [{""..., 2356}], 1) =3D 2356 readv(6, [{""..., 4096}], 1) =3D 3362 epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=3D1221159521, u64=3D1396821475600= 33}}], 512, 59980) =3D 1 readv(6, [{""..., 734}, {""..., 4096}], 2) =3D 4830 readv(6, [{""..., 4096}], 1) =3D 4096 readv(6, [{""..., 4096}], 1) =3D 4096 readv(6, [{""..., 4096}], 1) =3D 4096 readv(6, [{""..., 4096}], 1) =3D 3619 epoll_wait(86, [{EPOLLOUT, {u32=3D1221159521, u64=3D139682147560033}}], 5= 12, 59980) =3D 1 ... some other socket with activity and epoll_wait calls for 50 seconds epoll_wait(86, [], 512, 3331) =3D 0 close(6) =3D 0 So the EPOLLRDHUP seems to never happen here. As a side-note, in strace output (over lots of sessions) I see almost no re= adv() returning -1 EAGAIN except for the first readv() following revcfrom(). I hope this helps at looking for the cause, as well as that not collapsing = FIN packets also prevents the issue. Cheers, Bruno > > Could you watch/report the output of " ss -temoi " for the frozen TCP = flow ? =20 >=20 > Here it is, three distinct reproducing attempts: > State Recv-Q Send-Q Local Address:Port Peer Address:Port > ESTAB 0 0 158.64.2.228:44248 158.64.2.217:webcache = uid:83 ino:13245 sk:87 <-> > skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wsc= ale:7,7 rto:210 rtt:0.24/0.118 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd= :10 bytes_acked:949 bytes_received:28381 segs_out:12 segs_in:12 data_segs_o= ut:1 data_segs_in:10 send 482.7Mbps lastsnd:46810 lastrcv:46790 lastack:467= 90 pacing_rate 965.3Mbps delivery_rate 74.3Mbps app_limited rcv_rtt:1 rcv_s= pace:14480 minrtt:0.156 >=20 >=20 > ESTAB 0 0 = 2001:a18:1:6::228:33572 = 2001:a18:1:6::217:webcache = uid:83 ino:16699 sk:e1 <-> > skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wsc= ale:7,7 rto:210 rtt:0.231/0.11 ato:40 mss:1428 rcvmss:1428 advmss:1428 cwnd= :10 bytes_acked:948 bytes_received:28474 segs_out:12 segs_in:12 data_segs_o= ut:1 data_segs_in:10 send 494.5Mbps lastsnd:8380 lastrcv:8360 lastack:8360 = pacing_rate 989.1Mbps delivery_rate 71.0Mbps app_limited rcv_rtt:1.109 rcv_= space:14280 minrtt:0.161 >=20 >=20 > ESTAB 0 0 = 158.64.2.228:44578 = 158.64.2.217:webcache = uid:83 ino:17628 sk:12c <-> > skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wsc= ale:7,7 rto:210 rtt:0.279/0.136 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwn= d:10 bytes_acked:949 bytes_received:28481 segs_out:12 segs_in:12 data_segs_= out:1 data_segs_in:10 send 415.2Mbps lastsnd:11360 lastrcv:11330 lastack:11= 340 pacing_rate 828.2Mbps delivery_rate 61.9Mbps app_limited rcv_rtt:1 rcv_= space:14480 minrtt:0.187 >=20 >=20 > From nginx debug logging I don't get a real clue though it seems for work= ing connections > the last event obtained is 2005 (EPOLLMSG | EPOLLWRBAND | EPOLLWRNORM | > EPOLLRDBAND | EPOLLRDNORM | EPOLLHUP | EPOLLIN | EPOLLOUT) - previous one= s are 5 > while for failing connections it looks like last event seen is 5 (EPOLLIN= | EPOLLOUT). >=20 > > This migtht give us a clue about packets being dropped, say the the > > accumulated packet became too big. =20 >=20 >=20 > The following minor patch (might be white-space mangled) does prevent the= issue > happening for me: >=20 > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index 4904250a9aac..c102cd367c79 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -1667,7 +1667,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buf= f *skb) > if (TCP_SKB_CB(tail)->end_seq !=3D TCP_SKB_CB(skb)->seq || > TCP_SKB_CB(tail)->ip_dsfield !=3D TCP_SKB_CB(skb)->ip_dsfield= || > ((TCP_SKB_CB(tail)->tcp_flags | > - TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_URG) || > + TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_URG | TCPHDR_FIN)) || > ((TCP_SKB_CB(tail)->tcp_flags ^ > TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) || > #ifdef CONFIG_TLS_DEVICE >=20 > Cheers, > Bruno >=20 >=20 > > > rx-checksumming: on > > > tx-checksumming: on > > > tx-checksum-ipv4: off [fixed] > > > tx-checksum-ip-generic: on > > > tx-checksum-ipv6: off [fixed] > > > tx-checksum-fcoe-crc: off [fixed] > > > tx-checksum-sctp: off [fixed] > > > scatter-gather: on > > > tx-scatter-gather: on > > > tx-scatter-gather-fraglist: off [fixed] > > > tcp-segmentation-offload: on > > > tx-tcp-segmentation: on > > > tx-tcp-ecn-segmentation: off [fixed] > > > tx-tcp-mangleid-segmentation: off > > > tx-tcp6-segmentation: on > > > udp-fragmentation-offload: off > > > generic-segmentation-offload: on > > > generic-receive-offload: on > > > large-receive-offload: on > > > rx-vlan-offload: on > > > tx-vlan-offload: on > > > ntuple-filters: off [fixed] > > > receive-hashing: off [fixed] > > > highdma: on > > > rx-vlan-filter: on [fixed] > > > vlan-challenged: off [fixed] > > > tx-lockless: off [fixed] > > > netns-local: off [fixed] > > > tx-gso-robust: off [fixed] > > > tx-fcoe-segmentation: off [fixed] > > > tx-gre-segmentation: off [fixed] > > > tx-gre-csum-segmentation: off [fixed] > > > tx-ipxip4-segmentation: off [fixed] > > > tx-ipxip6-segmentation: off [fixed] > > > tx-udp_tnl-segmentation: off [fixed] > > > tx-udp_tnl-csum-segmentation: off [fixed] > > > tx-gso-partial: off [fixed] > > > tx-sctp-segmentation: off [fixed] > > > tx-esp-segmentation: off [fixed] > > > tx-udp-segmentation: off [fixed] > > > fcoe-mtu: off [fixed] > > > tx-nocache-copy: off > > > loopback: off [fixed] > > > rx-fcs: off [fixed] > > > rx-all: off [fixed] > > > tx-vlan-stag-hw-insert: off [fixed] > > > rx-vlan-stag-hw-parse: off [fixed] > > > rx-vlan-stag-filter: off [fixed] > > > l2-fwd-offload: off [fixed] > > > hw-tc-offload: off [fixed] > > > esp-hw-offload: off [fixed] > > > esp-tx-csum-hw-offload: off [fixed] > > > rx-udp_tunnel-port-offload: off [fixed] > > > tls-hw-tx-offload: off [fixed] > > > tls-hw-rx-offload: off [fixed] > > > rx-gro-hw: off [fixed] > > > tls-hw-record: off [fixed] > > > > > > > > > I can reproduce the issue with kernels 5.0.x and as recent as 5.1-rc6. > > > > > > Cheers, > > > Bruno > > > > > > On Sunday, April 7, 2019 11:28:30 PM CEST, richard.purdie@linuxfounda= tion.org wrote: =20 > > > > Hi, > > > > > > > > I've been chasing down why a python test from the python3 testsuite > > > > started failing and it seems to point to this kernel change in the > > > > networking stack. > > > > > > > > In kernels beyond commit 4f693b55c3d2d2239b8a0094b518a1e533cf75d5 t= he > > > > test hangs about 90% of the time (I've reproduced with 5.1-rc3, 5.0= .7, > > > > 5.0-rc1 but not 4.18, 4.19 or 4.20). The reproducer is: > > > > > > > > $ python3 -m test test_httplib -v > > > > =3D=3D CPython 3.7.2 (default, Apr 5 2019, 15:17:15) [GCC 8.3.0] > > > > =3D=3D Linux-5.0.0-yocto-standard-x86_64-with-glibc2.2.5 little-end= ian > > > > =3D=3D cwd: /var/volatile/tmp/test_python_288 > > > > =3D=3D CPU count: 1 > > > > =3D=3D encodings: locale=3DUTF-8, FS=3Dutf-8 > > > > [...] > > > > test_response_fileno (test.test_httplib.BasicTest) ... > > > > > > > > and it hangs in test_response_fileno. > > > > > > > > The test in question comes from Lib/test/test_httplib.py in the pyt= hon > > > > source tree and the code is: > > > > > > > > def test_response_fileno(self): > > > > # Make sure fd returned by fileno is valid. > > > > serv =3D socket.socket( > > > > socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP) > > > > self.addCleanup(serv.close) > > > > serv.bind((HOST, 0)) > > > > serv.listen() > > > > > > > > result =3D None > > > > def run_server(): > > > > [conn, address] =3D serv.accept() > > > > with conn, conn.makefile("rb") as reader: > > > > # Read the request header until a blank line > > > > while True: > > > > line =3D reader.readline() > > > > if not line.rstrip(b"\r\n"): > > > > break > > > > conn.sendall(b"HTTP/1.1 200 Connection established\= r\n\r\n") > > > > nonlocal result > > > > result =3D reader.read() > > > > > > > > thread =3D threading.Thread(target=3Drun_server) > > > > thread.start() > > > > self.addCleanup(thread.join, float(1)) > > > > conn =3D client.HTTPConnection(*serv.getsockname()) > > > > conn.request("CONNECT", "dummy:1234") > > > > response =3D conn.getresponse() > > > > try: > > > > self.assertEqual(response.status, client.OK) > > > > s =3D socket.socket(fileno=3Dresponse.fileno()) > > > > try: > > > > s.sendall(b"proxied data\n") > > > > finally: > > > > s.detach() > > > > finally: > > > > response.close() > > > > conn.close() > > > > thread.join() > > > > self.assertEqual(result, b"proxied data\n") > > > > > > > > I was hoping someone with more understanding of the networking stack > > > > could look at this and tell whether its a bug in the python test, t= he > > > > kernel change or otherwise give a pointer to where the problem might > > > > be? I'll freely admit this is not an area I know much about. > > > > > > > > Cheers, > > > > > > > > Richard > > > > > > > > > > > > =20