From mboxrd@z Thu Jan 1 00:00:00 1970 From: Risto Pajula Subject: IP (rtl8169) forwarding bug (performance) Date: Mon, 10 Dec 2018 01:28:09 +0200 Message-ID: References: <462e25db-8aad-7687-31e5-fb812d8daeaa@gmail.com> <51078d4c-17de-9c9d-4ba2-07a4b8e73575@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Cc: netdev@vger.kernel.org To: "David S. Miller" , Alexey Kuznetsov , Realtek linux nic maintainers , hkallweit1@gmail.com Return-path: Received: from mail-lf1-f65.google.com ([209.85.167.65]:42639 "EHLO mail-lf1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726515AbeLIX2O (ORCPT ); Sun, 9 Dec 2018 18:28:14 -0500 Received: by mail-lf1-f65.google.com with SMTP id l10so6654942lfh.9 for ; Sun, 09 Dec 2018 15:28:12 -0800 (PST) In-Reply-To: Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: Hello. Old subject: "Re: IP fragmentation performance and don't fragment bug when forwarding I have now been tracing the kernel and finding the bug seems difficult. I think the bug is combination of several things, likely cause is that it only occurs with rtl8169 and how it is using the netdevapi/DQL api. From my investigations seems that following happens: *Burst of frames is received from internet. (eth0) *These are accepted and forwarded to the (eth1) *DQL hits the limit, this causes the scheduling for the device to be stopped (__QUEUE_STATE_STACK_XOFF) *This and combination of some timing causes the performance degradation, messages are only scheduled for transmission in soft Irq context. I still do now know, why the DQL hits the limits.. I think it should not, maybe something undesirable first happens with rtl8169. I inserted following trace printk functions to the code.. rtl8169_start_xmit: trace_printk("ORP33 %s rtl8169_start_xmit len: %d entry: %u cur_tx: %u frags: %d dirty_tx: %u tx_left: %u\n" , tp->dev->name, skb->len, entry, tp->cur_tx, frags, tp->dirty_tx, (tp->cur_tx - tp->dirty_tx) ); rtl_tx: trace_printk("ORP33 %s rtl_tx len: %d entry: %u cur_tx: %u frags: %d dirty_tx: %u tx_left: %u\n" , tp->dev->name, tx_skb->skb->len, tp->cur_tx % NUM_TX_DESC, tp->cur_tx, 0, dirty_tx, tx_left ); Here is some grepped output only filtering the ping packet in two different situations: //trace when downloading ubuntu (~20MB/s), only 3 secs fit to buffer. ping is all the time normal.... //orig tracing.out3754 117493: -0 [000] ..s. 5235.407116: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 57 cur_tx: 3747641 frags: 0 dirty_tx: 3747640 tx_left: 1 118704: -0 [000] ..s. 5235.416057: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 20 cur_tx: 3747668 frags: 0 dirty_tx: 3747667 tx_left: 1 346319: hping3-2871 [000] .... 5236.002456: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 8 cur_tx: 3754312 frags: 0 dirty_tx: 3754312 tx_left: 0 346337: hping3-2871 [000] ..s. 5236.002475: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 9 cur_tx: 3754313 frags: 0 dirty_tx: 3754312 tx_left: 1 1277155: hping3-2871 [001] .... 5237.002627: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 38 cur_tx: 3766502 frags: 0 dirty_tx: 3766502 tx_left: 0 1277173: -0 [000] ..s. 5237.002646: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 39 cur_tx: 3766503 frags: 0 dirty_tx: 3766502 tx_left: 1 //imcp messages from route to intenal network when TCP stream is on. tracing.out3236/eth1_ring.txt is the original //no idea why the first frame is tripled... 200792: hping3-2871 [001] .... 3239.593384: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 34 cur_tx: 2132770 frags: 0 dirty_tx: 2132770 tx_left: 0 201578: hping3-2871 [001] .... 3240.593625: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 40 cur_tx: 2132776 frags: 0 dirty_tx: 2132776 tx_left: 0 341061: -0 [000] ..s. 3241.738011: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 52 cur_tx: 2133684 frags: 0 dirty_tx: 2133680 tx_left: 4 342112: -0 [000] ..s. 3241.745297: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 59 cur_tx: 2133691 frags: 0 dirty_tx: 2133684 tx_left: 7 500126: -0 [000] ..s. 3242.739500: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 49 cur_tx: 2134513 frags: 0 dirty_tx: 2134505 tx_left: 8 501739: -0 [000] ..s. 3242.751654: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 58 cur_tx: 2134522 frags: 0 dirty_tx: 2134513 tx_left: 9 646220: -0 [000] ..s. 3243.745761: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 41 cur_tx: 2135273 frags: 0 dirty_tx: 2135267 tx_left: 6 647633: -0 [000] ..s. 3243.755486: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 49 cur_tx: 2135281 frags: 0 dirty_tx: 2135273 tx_left: 8 802878: -0 [000] ..s. 3244.739947: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 23 cur_tx: 2136087 frags: 0 dirty_tx: 2136081 tx_left: 6 804298: -0 [000] ..s. 3244.749677: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 31 cur_tx: 2136095 frags: 0 dirty_tx: 2136087 tx_left: 8 961190: -0 [000] ..s. 3245.746217: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 16 cur_tx: 2136912 frags: 0 dirty_tx: 2136906 tx_left: 6 962610: -0 [000] ..s. 3245.755946: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 24 cur_tx: 2136920 frags: 0 dirty_tx: 2136912 tx_left: 8 1118044: -0 [000] ..s. 3246.740336: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 62 cur_tx: 2137726 frags: 0 dirty_tx: 2137720 tx_left: 6 1119443: -0 [000] ..s. 3246.750047: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 5 cur_tx: 2137733 frags: 0 dirty_tx: 2137726 tx_left: 7 1264047: -0 [000] ..s. 3247.709202: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 29 cur_tx: 2138525 frags: 0 dirty_tx: 2138518 tx_left: 7 1264740: -0 [000] ..s. 3247.718918: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 33 cur_tx: 2138529 frags: 0 dirty_tx: 2138525 tx_left: 4 1419958: -0 [000] ..s. 3248.736436: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 48 cur_tx: 2139312 frags: 0 dirty_tx: 2139305 tx_left: 7 1421551: -0 [000] ..s. 3248.746189: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 56 cur_tx: 2139320 frags: 0 dirty_tx: 2139312 tx_left: 8 1578746: -0 [000] ..s. 3249.742702: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 46 cur_tx: 2140142 frags: 0 dirty_tx: 2140135 tx_left: 7 1580194: -0 [000] ..s. 3249.752457: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 54 cur_tx: 2140150 frags: 0 dirty_tx: 2140142 tx_left: 8 1729597: -0 [000] ..s. 3250.746839: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 36 cur_tx: 2140964 frags: 0 dirty_tx: 2140957 tx_left: 7 1731015: -0 [000] ..s. 3250.756594: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 43 cur_tx: 2140971 frags: 0 dirty_tx: 2140964 tx_left: 7 1881028: -0 [000] ..s. 3251.740991: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 19 cur_tx: 2141779 frags: 0 dirty_tx: 2141772 tx_left: 7 1881381: -0 [000] ..s. 3251.750718: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 26 cur_tx: 2141786 frags: 0 dirty_tx: 2141779 tx_left: 7 //something happens, the output queue empties.. now it is fast, and rtl8169_start_xmit is not called from soft_irq 1896178: hping3-2871 [001] .... 3252.595994: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 23 cur_tx: 2141911 frags: 0 dirty_tx: 2141911 tx_left: 0 1896196: -0 [000] ..s. 3252.596055: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 24 cur_tx: 2141912 frags: 0 dirty_tx: 2141911 tx_left: 1 1923136: hping3-2871 [001] .... 3253.596227: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 50 cur_tx: 2142066 frags: 0 dirty_tx: 2142066 tx_left: 0 1923154: -0 [000] ..s. 3253.596289: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 51 cur_tx: 2142067 frags: 0 dirty_tx: 2142066 tx_left: 1 1951633: hping3-2871 [001] .... 3254.596459: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 5 cur_tx: 2142213 frags: 0 dirty_tx: 2142213 tx_left: 0 1951651: -0 [000] ..s. 3254.596521: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 6 cur_tx: 2142214 frags: 0 dirty_tx: 2142213 tx_left: 1 1954472: hping3-2871 [001] .... 3255.596700: rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 15 cur_tx: 2142223 frags: 0 dirty_tx: 2142223 tx_left: 0 1954490: -0 [000] ..s. 3255.596763: rtl8169_poll: ORP33 eth1 rtl_tx len: 54 entry: 16 cur_tx: 2142224 frags: 0 dirty_tx: 2142223 tx_left: 1 //until it soon, again is slow.. BR. Risto On 7.12.2018 16:46, Risto Pajula wrote: > Hello. > > I have been to track the poor forwarding latency to the TCP Window scale > options. The Netgem device uses rather large windows scale options > (x256) and I have been able to reproduce the routers poor forwarding > latency also with linux box running in the internal network and changing > the net.ipv4.tcp_rmem to a large value and thus changing the TCP window > scaling options to larger ones. I still do not have clue why this causes > the forwarfing in the linux kernel to block? Maybe something in the > connection tracking....? > > > With the ICMP timestamp messages I have been able to also pinpoint that > the latency is caused in the eth1 sending side (the following hping3 > example is run in the router toward the internal network... > > > xxx:/usr/src/linux-4.20-rc2 # hping3 192.168.0.112 --icmp --icmp-ts -V > using eth1, addr: 192.168.0.1, MTU: 1500 > HPING 192.168.0.112 (eth1 192.168.0.112): icmp mode set, 28 headers + 0 > data bytes > len=46 ip=192.168.0.112 ttl=64 id=49464 tos=0 iplen=40 > icmp_seq=0 rtt=7.9 ms > ICMP timestamp: Originate=52294891 Receive=52294895 Transmit=52294895 > ICMP timestamp RTT tsrtt=7 > > len=46 ip=192.168.0.112 ttl=64 id=49795 tos=0 iplen=40 > icmp_seq=1 rtt=235.9 ms > ICMP timestamp: Originate=52295891 Receive=52296128 Transmit=52296128 > ICMP timestamp RTT tsrtt=235 > > len=46 ip=192.168.0.112 ttl=64 id=49941 tos=0 iplen=40 > icmp_seq=2 rtt=3.8 ms > ICMP timestamp: Originate=52296891 Receive=52296895 Transmit=52296895 > ICMP timestamp RTT tsrtt=3 > > len=46 ip=192.168.0.112 ttl=64 id=50685 tos=0 iplen=40 > icmp_seq=3 rtt=47.8 ms > ICMP timestamp: Originate=52297891 Receive=52297940 Transmit=52297940 > ICMP timestamp RTT tsrtt=47 > > len=46 ip=192.168.0.112 ttl=64 id=51266 tos=0 iplen=40 > icmp_seq=4 rtt=7.7 ms > ICMP timestamp: Originate=52298891 Receive=52298895 Transmit=52298895 > ICMP timestamp RTT tsrtt=7 > > len=46 ip=192.168.0.112 ttl=64 id=52245 tos=0 iplen=40 > icmp_seq=5 rtt=3.7 ms > ICMP timestamp: Originate=52299891 Receive=52299895 Transmit=52299895 > ICMP timestamp RTT tsrtt=3 > > ^C > --- 192.168.0.112 hping statistic --- > 6 packets tramitted, 6 packets received, 0% packet loss > round-trip min/avg/max = 3.7/51.1/235.9 ms > > > > BR. > Risto > > On 2.12.2018 23:32, Risto Pajula wrote: >> Hello. >> >> You can most likely ignore the "DF Bit, mtu bug when forwarding" case. >> There isn't actually big IP packets on the wire, instead there is >> burst of packets on the wire, which are combined by the GRO... And >> thus dropping them should not happen. Sorry about the invalid bug report. >> >> However the poor latency from intenal network to the internet still >> remain, both GRO enabled and disabled. I will try to study further... >> >> >> BR. >> Risto >> >> >> On 2.12.2018 14:01, Risto Pajula wrote: >>> Hello. >>> >>> I have encountered a weird performance problem in Linux IP >>> fragmentation when using video streaming services behind the NAT. >>> Also I have studied a possible bug in the DF bit (don't fragment) >>> handling when forwarding the IP packets. >>> >>> First the system setup description: >>> >>> [host1]-int lan-(eth1)[linux router](eth0)-extlan-[fibre >>> router]-internet >>> >>> where: >>> host1: is a Netgem N7800 "cable box" for online video streaming >>> services provided by local telco (Can access Netflix, HBO nordic, >>> "live TV", etc.) >>> linux router: Linux computer with Dualcore Intel Celeron G1840, >>> running currently Linux kernel 4.20.0-rc2, and openSUSE Leap 15.0 >>> eth1: Linux Routers internal (NAT) interface, 192.168.0.1/24 network, >>> mtu set to 1500, RTL8169sb/8110sb >>> eth0: Linux Routers internet facing interface, public ip address, mtu >>> set to 1500,  RTL8168evl/8111evl >>> fibre router: Alcatel Lucent fibre router (I-241G-Q), directly >>> connected to the eth0 of the Linux router. >>> >>> And now when using the Netgem N7800 with online video services >>> (Netflix, HBO nordic, etc) the Linux router will receive very BIG IP >>> packets in the eth0 upto ~20kB, this seems to lead to the following >>> problems in the Linux IP stack. >>> >>> IP fragmentation performance: >>> When the Linux router receives these large IP packets in the eth0 >>> everything works, but it seems that them cause very large performance >>> degradation from internal network to the internet regarding the >>> latency when the IP fragmentation is performed. The ping latency from >>> internal network to the internel network increases from stable >>> 15ms-20ms up to 700-800ms AND also the ping from the internal network >>> to the linux router eth1 (192.168.0.). However up link works >>> perfectly, the ping is still stable when streaming the online >>> services (From linux router to the internet). It seems that the IP >>> fragmentation is somehow blocking the eth1 reception or transmission >>> for very long time (which it shouldn't). I'm able to test and debug >>> the issue further, but advice regarding where to look would be >>> appreciated. >>> >>> >>> DF Bit, mtu bug when forwarding: >>> I have started to study the above mentioned problem and have found a >>> possible bug in the DF bit and mtu handling in IP forwarding. The BIG >>> packets received from streaming services all have the "DF bit" set >>> and the question is that should we be forwarding them at all as that >>> would result them being fragmented? Apparently we currently are... I >>> have traced this down to the ip_forward.c function ip_exceeds_mtu(), >>> and the following patch seems to fix that. >>> >>> --- net/ipv4/ip_forward.c.orig  2018-12-02 11:09:32.764320780 +0200 >>> +++ net/ipv4/ip_forward.c       2018-12-02 12:53:25.031232347 +0200 >>> @@ -49,7 +49,7 @@ static bool ip_exceeds_mtu(const struct >>>                 return false; >>> >>>         /* original fragment exceeds mtu and DF is set */ >>> -       if (unlikely(IPCB(skb)->frag_max_size > mtu)) >>> +        if (unlikely(skb->len > mtu)) >>>                 return true; >>> >>>         if (skb->ignore_df) >>> >>> >>> This seems to work (in some ways) - after the change IP packets that >>> are too large to the internal network get dropped and we are sending >>> "ICMP Destination unreachable, The datagram is too big" messages to >>> the originator (as we should?). However it seems that not all >>> services really like this... Netflix behaves as expected and ping is >>> stable from internal network to the internet, but for example HBO >>> nordic will not work anymore (too little buffering? Retransimissions >>> not working?). So it seems the original issue should be also fixed >>> (And the fragmention should be allowed?). >>> >>> >>> >>> Any advice would be appreciated. Thanks! >>> >>> PS. Watching TV was not this intensive 20 years ago :) >>>