From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 71A1BC43381 for ; Fri, 22 Mar 2019 20:46:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3CF5E218D3 for ; Fri, 22 Mar 2019 20:46:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727648AbfCVUqS (ORCPT ); Fri, 22 Mar 2019 16:46:18 -0400 Received: from andre.telenet-ops.be ([195.130.132.53]:40674 "EHLO andre.telenet-ops.be" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727193AbfCVUqR (ORCPT ); Fri, 22 Mar 2019 16:46:17 -0400 Received: from [10.2.2.197] ([141.135.125.133]) by andre.telenet-ops.be with bizsmtp id r8mD1z0062soW79018mDYr; Fri, 22 Mar 2019 21:46:14 +0100 Message-ID: <5C9549B4.8020503@mail.wizbit.be> Date: Fri, 22 Mar 2019 21:46:44 +0100 From: Bram Yvahk User-Agent: Thunderbird 2.0.0.24 (Windows/20100228) MIME-Version: 1.0 To: Steffen Klassert CC: herbert@gondor.apana.org.au, davem@davemloft.net, netdev@vger.kernel.org Subject: Re: [PATCH ipsec/vti 0/2] Fragmentation of IPv4 in VTI References: <1552865877-13401-1-git-send-email-bram-yvahk@mail.wizbit.be> <20190321151630.GA25855@gauss3.secunet.de> <5C93D910.1080008@mail.wizbit.be> In-Reply-To: <5C93D910.1080008@mail.wizbit.be> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Bram Yvahk wrote: > Steffen Klassert wrote: >> On Sun, Mar 17, 2019 at 11:37:55PM +0000, Bram Yvahk wrote: >>> We've experienced an issue with VTI when the path-mtu is smaller than > the size >>> of the "client" packet. >>> >>> What happens: IPv4 packet from the client (i.e. another system in the > LAN) >>> attempts to transmit some data; IPv4 header shows that 'DF' bit is > not set but >>> still the client receives ICMPv4 "need-to-frag" message [which the > client does >>> not expect and ignores]. >>> >>> Example: $ ping -s 1300 -M dont -c5 192.168.235.2 >>> PING 192.168.235.3 (192.168.235.3) 1300(1328) bytes of data. >>> From 192.168.236.254 icmp_seq=1 Frag needed and DF set (mtu = 1214) >>> From 192.168.236.254 icmp_seq=2 Frag needed and DF set (mtu = 1214) >>> From 192.168.236.254 icmp_seq=3 Frag needed and DF set (mtu = 1214) >>> From 192.168.236.254 icmp_seq=4 Frag needed and DF set (mtu = 1214) >>> From 192.168.236.254 icmp_seq=5 Frag needed and DF set (mtu = 1214) >>> >>> --- 192.168.235.3 ping statistics --- >>> 5 packets transmitted, 0 received, +5 errors, 100% packet loss, > time 3999ms >> Hm, this works here. Can you show how you setup the vti device? >> Some tunnel configuration options (set ttl etc.) force to have >> the DF bit set. > > I will provide these details Tommorow. > What I can say is that ttl was set to inherit. > vti device is created (on Gateway A) using: $ ip tun add name vti0 mode vti ikey 1 okey 1 local $ ip link show dev vti0 46: vti0@NONE: mtu 1480 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/ipip brd 0.0.0.0 $ ip tun show name vti0 vti0: ip/ip remote any local ttl inherit key 1 [I've also done setup with mtu 1400 - all remains the same] xfrm state: src dst proto esp spi 0xcd76a4a9 reqid 16389 mode tunnel replay-window 32 flag nopmtudisc af-unspec auth-trunc hmac(sha1) 0x08e1ce16b1f7f9039f9cc7421cf61010c029efc3 96 enc cbc(aes) 0x22c7aacd9680a10a52b0c5670b7d850c35ba17f7c7dc6c963252cdc311b1f4d5 anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000 src dst proto esp spi 0x8f2988c7 reqid 16389 mode tunnel replay-window 32 flag nopmtudisc af-unspec auth-trunc hmac(sha1) 0x229bbe490606ddcc6a68332babd498001591c6bf 96 enc cbc(aes) 0xd598dba419bfc45232580e54d517aae6a77c3328a51ebb3321802b89cc51ae43 anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000 (same behaviour with/without nopmtudisc; nopmtudisc only makes a difference for packets from 'client A' that *do* have the DF bit set) > > When testing this there is one important bit - which in hindsight I > should've included in the previous message - the (IPsec) Gateway A > needs to know the path-mtu to (IPsec) Gateway B. > > Some ways to accomplish this: > - transmit a ICMP with DF bit set and a larger packet size from > Gateway A to Gateway B > - ensure the "nopmtudisc" option is *not* set in the xfrm state > and then let client A transmit a ICMP *with* DF bit set to > client B. [when "nopmtudisc" is set then all outgoing IPv4 ESP > packet have the DF bit cleared, when "nopmtudisc" is not set then > DF bit is copied from the client packet] > > For testing purposes I recommend to do the ping from Gateway A to > Gateway B. (Otherwise tcpdumps/traffic get a bit more confusing.) > > A more in-depth description of what happens: > > Setup: > ====== > > |----------| |-----------| |-------| |-----------| |----------| > | client A |---| Gateway A |---| Hop H |---| Gateway B |---| client B | > ------------ |-----------| |-------| |-----------| |----------| > > - testing with linux 4.14.95 (setup with more recent kernel is WIP) > - link mtu between client A and Gateway A: 1500 > - link mtu between Gateway A and Hop H: 1500 > - link mtu between Hop H and Gateway B: 1280 > - link mtu between Gateway B and client B: 1500 > - path-mtu between Gateway A and Gateway B: 1280 > - IPsec tunnel over *IPv4* between Gateway A and Gateway B > - tunneling IPv4 over the IPsec tunnel > - testing with VTI > > Scenario: > ========== > > Before starting it's important to ensure that: > - Gateway A does *not* know the path-mtu to Gateway B > - Client A does *not* know the path-mtu to Gateway B On Gateway A: $ ip route get via dev eth1 src uid 0 cache => no mtu shown --> path-mtu not yet known > > * Step 1: client A: $ ping -M dont -s 1300 ip_of_client_B > - IPv4 ICMP packet of client A does not have DF bit set > - IPv4 ESP packet of Gateway A does not have DF bit set > - Hop H receives a IPv4 ESP packet that is too large for link-mtu > between Hop H and Gateway B: it fragments the IPv4 ESP packet. > - Gateway B receives 2 IPv4 fragmented packets > - (Client B receives one IPv4 ICMP packet from client A) tcpdump on Gateway A: - from client A it receives: IP (tos 0x0, ttl 64, id 46797, offset 0, flags [none], proto ICMP (1), length 1328) client_A > client_B: ICMP echo request, id 6855, seq 1, length 1308 - it transmits (to Gateway B): IP (tos 0x0, ttl 64, id 10932, offset 0, flags [none], proto ESP (50), length 1400) gateway_A > gateway_B: ESP(spi=0x8f2988c7,seq=0x3), length 1380 tcpdump on Gateway B: - it receives (from Gateway A): IP (tos 0x0, ttl 63, id 10932, offset 0, flags [+], proto ESP (50), length 1276) gateway_A > gateway_B: ESP(spi=0x8f2988c7,seq=0x3), length 1256 IP (tos 0x0, ttl 63, id 10932, offset 1256, flags [none], proto ESP (50), length 144) gateway_A > gateway_B: ip-proto-50 - it transmits (to client B): IP (tos 0x0, ttl 62, id 46797, offset 0, flags [none], proto ICMP (1), length 1328) client_A > client_B: ICMP echo request, id 6855, seq 1, length 1308 => Hop H fragmented the IPv4 packets. This is expected: DF bit is not set on ESP packets and Gateway A does not know path-mtu to Gateway B > > * Step 2: Gateway A: $ ping -M do -s 1300 ip_of_gateway_B > - IPv4 ICMP packet of Gateway A does have DF bit set > - Gateway A receives a 'need to frag' ICMP from Hop H tcpdump on Gateway A: - it transmits (local packet - to Gateway B): IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 1328) gateway_A > gateway_B: ICMP echo request, id 28176, seq 1, length 1308 - it receives (from Hop H): IP (tos 0xc0, ttl 64, id 52788, offset 0, flags [none], proto ICMP (1), length 576) hop_H > gateway_A: ICMP 1.1.235.254 unreachable - need to frag (mtu 1280), length 556 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 1328) gateway_A > gateway_B: ICMP echo request, id 28176, seq 1, length 1308 => Hop H send need-to-frag mtu. This expected: DF bit is set on ICMP packet so Hop H should not fragment. on Gateway A: $ ip route get via dev eth1 src uid 0 cache expires 17sec mtu 1280 => path-mtu known to be 1280 > * Step 3: client A: $ ping -M dont -s 1300 ip_of_client_B > - IPv4 ICMP packet of client A does not have DF bit set > - Gateway A: it process this packet in VTI module and detects that > packet size > path-mtu and then sends a 'need to frag' ICMP to > client A. [this is the code I patched] tcpdump on Gateway A: - from client A it receives: IP (tos 0x0, ttl 64, id 46798, offset 0, flags [none], proto ICMP (1), length 1328) client_A > client_B: ICMP echo request, id 7063, seq 1, length 1308 - it transmits to client A: IP (tos 0xc0, ttl 64, id 59290, offset 0, flags [none], proto ICMP (1), length 576) gateway_A > client_A: ICMP client_B unreachable - need to frag (mtu 1214), length 556 IP (tos 0x0, ttl 63, id 46798, offset 0, flags [none], proto ICMP (1), length 1328) client_A > client_B: ICMP echo request, id 7063, seq 1, length 1308 > > => the critical bit in the above is that Gateway A learns > the path-mtu to Gateway B. If it doesn't then it keeps > assuming path-mtu is 1500 and the check in VTI will not > trigger (since path-mtu of 1500 > packet size)