From mboxrd@z Thu Jan 1 00:00:00 1970 From: Flavio Leitner Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU Date: Tue, 2 Dec 2014 19:32:32 -0200 Message-ID: <20141202213232.GC5344@t520.home> References: <1417156385-18276-1-git-send-email-fan.du@intel.com> <1417158128.3268.2@smtp.corp.redhat.com> <5A90DA2E42F8AE43BC4A093BF0678848DED92B@SHSMSX104.ccr.corp.intel.com> <20141202154425.GA5344@t520.home> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: "Du, Fan" , Jason Wang , "netdev@vger.kernel.org" , "davem@davemloft.net" , "fw@strlen.de" To: Jesse Gross Return-path: Received: from mx1.redhat.com ([209.132.183.28]:51759 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754007AbaLBVcn (ORCPT ); Tue, 2 Dec 2014 16:32:43 -0500 Content-Disposition: inline In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Tue, Dec 02, 2014 at 10:06:53AM -0800, Jesse Gross wrote: > On Tue, Dec 2, 2014 at 7:44 AM, Flavio Leitner wrote: > > On Sun, Nov 30, 2014 at 10:08:32AM +0000, Du, Fan wrote: > >> > >> > >> >-----Original Message----- > >> >From: Jason Wang [mailto:jasowang@redhat.com] > >> >Sent: Friday, November 28, 2014 3:02 PM > >> >To: Du, Fan > >> >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan > >> >Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU > >> > > >> > > >> > > >> >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du wrote: > >> >> Test scenario: two KVM guests sitting in different hosts communicate > >> >> to each other with a vxlan tunnel. > >> >> > >> >> All interface MTU is default 1500 Bytes, from guest point of view, its > >> >> skb gso_size could be as bigger as 1448Bytes, however after guest skb > >> >> goes through vxlan encapuslation, individual segments length of a gso > >> >> packet could exceed physical NIC MTU 1500, which will be lost at > >> >> recevier side. > >> >> > >> >> So it's possible in virtualized environment, locally created skb len > >> >> after encapslation could be bigger than underlayer MTU. In such case, > >> >> it's reasonable to do GSO first, then fragment any packet bigger than > >> >> MTU as possible. > >> >> > >> >> +---------------+ TX RX +---------------+ > >> >> | KVM Guest | -> ... -> | KVM Guest | > >> >> +-+-----------+-+ +-+-----------+-+ > >> >> |Qemu/VirtIO| |Qemu/VirtIO| > >> >> +-----------+ +-----------+ > >> >> | | > >> >> v tap0 tap0 v > >> >> +-----------+ +-----------+ > >> >> | ovs bridge| | ovs bridge| > >> >> +-----------+ +-----------+ > >> >> | vxlan vxlan | > >> >> v v > >> >> +-----------+ +-----------+ > >> >> | NIC | <------> | NIC | > >> >> +-----------+ +-----------+ > >> >> > >> >> Steps to reproduce: > >> >> 1. Using kernel builtin openvswitch module to setup ovs bridge. > >> >> 2. Runing iperf without -M, communication will stuck. > >> > > >> >Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I > >> >believe. > >> > >> Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there > >> without any further ip segmentation. > >> > >> Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set, > >> Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu. > >> > >> For PMTU to work, that's another issue I will try to address later on. > >> > >> >> > >> >> > >> >> Signed-off-by: Fan Du > >> >> --- > >> >> net/ipv4/ip_output.c | 7 ++++--- > >> >> 1 files changed, 4 insertions(+), 3 deletions(-) > >> >> > >> >> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index > >> >> bc6471d..558b5f8 100644 > >> >> --- a/net/ipv4/ip_output.c > >> >> +++ b/net/ipv4/ip_output.c > >> >> @@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct sk_buff > >> >> *skb) > >> >> struct sk_buff *segs; > >> >> int ret = 0; > >> >> > >> >> - /* common case: locally created skb or seglen is <= mtu */ > >> >> - if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) || > >> >> - skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb)) > >> >> + /* Both locally created skb and forwarded skb could exceed > >> >> + * MTU size, so make a unified rule for them all. > >> >> + */ > >> >> + if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb)) > >> >> return ip_finish_output2(skb); > > > > > > Are you using kernel's vxlan device or openvswitch's vxlan device? > > > > Because for kernel's vxlan devices the MTU accounts for the header > > overhead so I believe your patch would work. However, the MTU is > > not visible for the ovs's vxlan devices, so that wouldn't work. > > This is being called after the tunnel code, so the MTU that is being > looked at in all cases is the physical device's. Since the packet has > already been encapsulated, tunnel header overhead is already accounted > for in skb_gso_network_seglen() and this should be fine for both OVS > and non-OVS cases. Right, it didn't work on my first try and that explanation came to mind. Anyway, I am testing this with containers instead of VMs, so I am using veth and not Virtio-net. This is the actual stack trace: [...] do_output ovs_vport_send vxlan_tnl_send vxlan_xmit_skb udp_tunnel_xmit_skb iptunnel_xmit \ skb_scrub_packet => skb->ignore_df = 0; ip_local_out_sk ip_output ip_finish_output (_gso is inlined) ip_fragment and on ip_fragment() it does: 503 if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) || 504 (IPCB(skb)->frag_max_size && 505 IPCB(skb)->frag_max_size > mtu))) { 506 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS); 507 icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, 508 htonl(mtu)); 509 kfree_skb(skb); 510 return -EMSGSIZE; 511 } Since IP_DF is set and skb->ignore_df is reset to 0, in my case the packet is dropped and an ICMP is sent back. The connection remains stuck as before. Doesn't virtio-net set DF bit? Thanks, fbl