From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.2 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, FROM_EXCESS_BASE64,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, PI_IMPORTANCE_HIGH,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 21D8BC10F11 for ; Wed, 24 Apr 2019 15:01:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B47492084F for ; Wed, 24 Apr 2019 15:01:54 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=sysophe.eu header.i=@sysophe.eu header.b="kx7dSG5E" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731557AbfDXPBx (ORCPT ); Wed, 24 Apr 2019 11:01:53 -0400 Received: from hygieia.sysophe.eu ([138.201.91.14]:37871 "EHLO hygieia.sysophe.eu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730873AbfDXPBx (ORCPT ); Wed, 24 Apr 2019 11:01:53 -0400 X-Greylist: delayed 599 seconds by postgrey-1.27 at vger.kernel.org; Wed, 24 Apr 2019 11:01:51 EDT Received: from pluto.restena.lu (pluto.restena.lu [IPv6:2001:a18:1:10::156]) by smtp.sysophe.eu (Postfix) with ESMTPSA id 9DFB3106EFF3B; Wed, 24 Apr 2019 16:50:02 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=sysophe.eu; s=201205; t=1556117403; x=1556203803; bh=48aMdVQ3FVd+z3j0shkWrrbQ8rN1UznTF13wlnTzY48=; h=Date:From:To:Cc:Subject:In-Reply-To:References; b=kx7dSG5EXzTAw8LXU1k92CdGCyrepOsWIWt1ekdwrIBavjNRQ9OGON/aSmAw6e7pR ejYHFqyGGP9MVIXZYuSBKMm8zMXqh3YHmIjnLEvY9XGSK/zHEebGfQcqEpnW8EHw8b Y+89LVFvaosyupTSa9vPxu0XNvglWsAVL9u65I7k= Date: Wed, 24 Apr 2019 16:51:50 +0200 From: Bruno =?UTF-8?B?UHLDqW1vbnQ=?= To: Eric Dumazet Cc: richard.purdie@linuxfoundation.org, Neal Cardwell , Yuchung Cheng , "David S. Miller" , netdev@vger.kernel.org, Alexander Kanavin , Bruce Ashfield Subject: Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue Message-ID: <20190424165150.1420b046@pluto.restena.lu> In-Reply-To: <85aabf9d4f41b6c57629e736993233f80a037e59.camel@linuxfoundation.org> References: <85aabf9d4f41b6c57629e736993233f80a037e59.camel@linuxfoundation.org> X-Mailer: Claws Mail 3.17.3 (GTK+ 2.24.32; x86_64-pc-linux-gnu) Importance: high X-Priority: 1 (Highest) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Hi Eric, I'm seeing issues with this patch as well, not as regular as for Richard but still (about up to one in 30-50 TCP sessions). In my case I have a virtual machine (on VMWare) with this patch where NGINX as reverse proxy misses part (end) of payload from its upstream and times out on the upstream connection (while according to tcpdump all packets including upstream's FIN were sent and the upstream did get ACKs from the VM). =46rom when browsers get from NGINX it feels as if at some point reading from the socket or waiting for data using select() never returned data that arrived as more than just EOF is missing. The upstream is a hardware machine in the same subnet. My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) as network adapter which lists the following features: rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: off [fixed] tx-checksum-ip-generic: on tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: on rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: off [fixed] receive-hashing: off [fixed] highdma: on rx-vlan-filter: on [fixed] vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: off [fixed] tx-gre-csum-segmentation: off [fixed] tx-ipxip4-segmentation: off [fixed] tx-ipxip6-segmentation: off [fixed] tx-udp_tnl-segmentation: off [fixed] tx-udp_tnl-csum-segmentation: off [fixed] tx-gso-partial: off [fixed] tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] tx-udp-segmentation: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off [fixed] tx-vlan-stag-hw-insert: off [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: off [fixed] esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: off [fixed] tls-hw-tx-offload: off [fixed] tls-hw-rx-offload: off [fixed] rx-gro-hw: off [fixed] tls-hw-record: off [fixed] I can reproduce the issue with kernels 5.0.x and as recent as 5.1-rc6. Cheers, Bruno On Sunday, April 7, 2019 11:28:30 PM CEST, richard.purdie@linuxfoundation.o= rg wrote: > Hi, > > I've been chasing down why a python test from the python3 testsuite > started failing and it seems to point to this kernel change in the > networking stack. > > In kernels beyond commit 4f693b55c3d2d2239b8a0094b518a1e533cf75d5 the > test hangs about 90% of the time (I've reproduced with 5.1-rc3, 5.0.7, > 5.0-rc1 but not 4.18, 4.19 or 4.20). The reproducer is: > > $ python3 -m test test_httplib -v > =3D=3D CPython 3.7.2 (default, Apr 5 2019, 15:17:15) [GCC 8.3.0] > =3D=3D Linux-5.0.0-yocto-standard-x86_64-with-glibc2.2.5 little-endian > =3D=3D cwd: /var/volatile/tmp/test_python_288 > =3D=3D CPU count: 1 > =3D=3D encodings: locale=3DUTF-8, FS=3Dutf-8 > [...] > test_response_fileno (test.test_httplib.BasicTest) ...=20 > > and it hangs in test_response_fileno. > > The test in question comes from Lib/test/test_httplib.py in the python > source tree and the code is: > > def test_response_fileno(self): > # Make sure fd returned by fileno is valid. > serv =3D socket.socket( > socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP) > self.addCleanup(serv.close) > serv.bind((HOST, 0)) > serv.listen() > > result =3D None > def run_server(): > [conn, address] =3D serv.accept() > with conn, conn.makefile("rb") as reader: > # Read the request header until a blank line > while True: > line =3D reader.readline() > if not line.rstrip(b"\r\n"): > break > conn.sendall(b"HTTP/1.1 200 Connection established\r\n\r\= n") > nonlocal result > result =3D reader.read() > > thread =3D threading.Thread(target=3Drun_server) > thread.start() > self.addCleanup(thread.join, float(1)) > conn =3D client.HTTPConnection(*serv.getsockname()) > conn.request("CONNECT", "dummy:1234") > response =3D conn.getresponse() > try: > self.assertEqual(response.status, client.OK) > s =3D socket.socket(fileno=3Dresponse.fileno()) > try: > s.sendall(b"proxied data\n") > finally: > s.detach() > finally: > response.close() > conn.close() > thread.join() > self.assertEqual(result, b"proxied data\n") > > I was hoping someone with more understanding of the networking stack > could look at this and tell whether its a bug in the python test, the > kernel change or otherwise give a pointer to where the problem might > be? I'll freely admit this is not an area I know much about. > > Cheers, > > Richard > > >