From mboxrd@z Thu Jan 1 00:00:00 1970 From: Fan Du Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU Date: Mon, 05 Jan 2015 14:02:58 +0800 Message-ID: <54AA2912.6090903@gmail.com> References: <1417156385-18276-1-git-send-email-fan.du@intel.com> <1417158128.3268.2@smtp.corp.redhat.com> <5A90DA2E42F8AE43BC4A093BF0678848DED92B@SHSMSX104.ccr.corp.intel.com> <20141201135225.GA16814@casper.infradead.org> <20141202154839.GB5344@t520.home> <20141202170927.GA9457@casper.infradead.org> <20141202173401.GB4126@redhat.com> <20141202174158.GB9457@casper.infradead.org> <5A90DA2E42F8AE43BC4A093BF0678848DEDFDB@SHSMSX104.ccr.corp.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: "Michael S. Tsirkin" , 'Jason Wang' , "netdev@vger.kernel.org" , "fw@strlen.de" , "dev@openvswitch.org" , "pshelar@nicira.com" To: "Du, Fan" , Thomas Graf , "davem@davemloft.net" , "jesse@nicira.com" Return-path: Received: from mail-pd0-f173.google.com ([209.85.192.173]:33299 "EHLO mail-pd0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750959AbbAEGGd (ORCPT ); Mon, 5 Jan 2015 01:06:33 -0500 Received: by mail-pd0-f173.google.com with SMTP id ft15so27444854pdb.18 for ; Sun, 04 Jan 2015 22:06:33 -0800 (PST) In-Reply-To: <5A90DA2E42F8AE43BC4A093BF0678848DEDFDB@SHSMSX104.ccr.corp.intel.com> Sender: netdev-owner@vger.kernel.org List-ID: =E4=BA=8E 2014=E5=B9=B412=E6=9C=8803=E6=97=A5 10:31, Du, Fan =E5=86=99=E9= =81=93: > > >> -----Original Message----- >> From: Thomas Graf [mailto:tgr@infradead.org] On Behalf Of Thomas Gra= f >> Sent: Wednesday, December 3, 2014 1:42 AM >> To: Michael S. Tsirkin >> Cc: Du, Fan; 'Jason Wang'; netdev@vger.kernel.org; davem@davemloft.n= et; >> fw@strlen.de; dev@openvswitch.org; jesse@nicira.com; pshelar@nicira.= com >> Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger = than MTU >> >> On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote: >>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote: >>>> On 12/02/14 at 01:48pm, Flavio Leitner wrote: >>>>> What about containers or any other virtualization environment tha= t >>>>> doesn't use Virtio? >>>> >>>> The host can dictate the MTU in that case for both veth or OVS >>>> internal which would be primary container plumbing techniques. >>> >>> It typically can't do this easily for VMs with emulated devices: >>> real ethernet uses a fixed MTU. >>> >>> IMHO it's confusing to suggest MTU as a fix for this bug, it's an >>> unrelated optimization. >>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here. >> >> PMTU discovery only resolves the issue if an actual IP stack is runn= ing inside the >> VM. This may not be the case at all. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Some thoughts here: > > Think otherwise, this is indeed what host stack should forge a ICMP_D= EST_UNREACH/ICMP_FRAG_NEEDED > message with _inner_ skb network and transport header, do whatever ty= pe of encapsulation, > and thereafter push such packet upward to Guest/Container, which make= them feel, the intermediate node > or the peer send such message. PMTU should be expected to work correc= t. > And such behavior should be shared by all other encapsulation tech if= they are also suffered. Hi David, Jesse and Thomas As discussed in here: https://www.marc.info/?l=3Dlinux-netdev&m=3D14176= 4712631150&w=3D4 and quotes from Jesse: My proposal would be something like this: * For L2, reduce the VM MTU to the lowest common denominator on the s= egment. * For L3, use path MTU discovery or fragment inner packet (i.e. normal routing behavior). * As a last resort (such as if using an old version of virtio in the guest), fragment the tunnel packet. =46or L2, it's a administrative action =46or L3, PMTU approach looks better, because once the sender is alerte= d the reduced MTU, packet size after encapsulation will not exceed physical MTU, so no add= itional fragments efforts needed. =46or "As a last resort... fragment the tunnel packet", the original pa= tch: https://www.marc.info/?l=3Dlinux-netdev&m=3D141715655024090&w=3D4 did t= he job, but seems it's not welcomed. Below raw patch adopts PMTU approach, please review! Any kind of commen= ts/suggestions is welcomed. diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index e9f81d4..4d1b221 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -1771,6 +1771,130 @@ static void vxlan_xmit_one(struct sk_buff *skb,= struct net_device *dev, tos =3D ip_tunnel_ecn_encap(tos, old_iph, skb); ttl =3D ttl ? : ip4_dst_hoplimit(&rt->dst); + if (skb_is_gso(skb)) { + unsigned int inner_l234_hdrlen; + unsigned int outer_l34_hdrlen; + unsigned int gso_seglen; + struct net_device *phy_dev =3D rt->dst.dev; + + inner_l234_hdrlen =3D skb_transport_header(skb) - skb_mac_header(sk= b); + if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)) + inner_l234_hdrlen +=3D tcp_hdrlen(skb); + if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP) + inner_l234_hdrlen +=3D sizeof(struct udphdr); + + outer_l34_hdrlen =3D sizeof(struct iphdr) + sizeof(struct udphdr) += sizeof(struct vxlanhdr); + /* gso_seglen is the GSO-ed skb packet len, adjust gso_size + * to fit into physical netdev MTU + */ + gso_seglen =3D outer_l34_hdrlen + inner_l234_hdrlen + skb_shinfo(sk= b)->gso_size; + if (gso_seglen > phy_dev->mtu) { + struct sk_buff *reply; + struct ethhdr *orig_eth; + struct ethhdr *new_eth; + struct ethhdr *tnl_eth; + struct iphdr *orig_ip; + struct iphdr *new_ip; + struct iphdr *tnl_ip; + struct icmphdr *new_icmp; + unsigned int room; + unsigned int data_len; + unsigned int reply_l234_hdrlen; + unsigned int vxlan_tnl_hdrlen; + struct vxlanhdr *vxh; + struct udphdr *uh; + __wsum csum; + + /* How much room to store orignal message */ + room =3D (skb->len > 576) ? 576 : skb->len; + room -=3D sizeof(struct iphdr) + sizeof(struct icmphdr); + + /* Ethernet payload len */ + data_len =3D skb->len - skb_network_offset(skb); + if (data_len > room) + data_len =3D room; + + reply_l234_hdrlen =3D LL_RESERVED_SPACE(phy_dev) + phy_dev->needed= _tailroom + + sizeof(struct iphdr) + sizeof(struct icmphdr); + vxlan_tnl_hdrlen =3D LL_RESERVED_SPACE(phy_dev) + phy_dev->needed_= tailroom + + sizeof(struct iphdr) + sizeof(struct udphdr) + sizeof(struct = vxlanhdr); + + reply =3D alloc_skb(vxlan_tnl_hdrlen + reply_l234_hdrlen + data_le= n, GFP_ATOMIC); + reply->dev =3D phy_dev; + skb_reserve(reply, vxlan_tnl_hdrlen + reply_l234_hdrlen); + + new_icmp =3D (struct icmphdr *)__skb_push(reply, sizeof(struct icm= phdr)); + new_icmp->type =3D ICMP_DEST_UNREACH; + new_icmp->code =3D ICMP_FRAG_NEEDED; + new_icmp->un.frag.mtu =3D htons(phy_dev->mtu - outer_l34_hdrlen); + new_icmp->checksum =3D 0; + + new_ip =3D (struct iphdr *)__skb_push(reply, sizeof(struct iphdr))= ; + orig_ip =3D ip_hdr(skb); + new_ip->ihl =3D 5; + new_ip->version =3D 4; + new_ip->ttl =3D 32; + new_ip->tos =3D 1; + new_ip->protocol =3D IPPROTO_ICMP; + new_ip->saddr =3D orig_ip->daddr; + new_ip->daddr =3D orig_ip->saddr; + new_ip->frag_off =3D 0; + new_ip->tot_len =3D htons(sizeof(struct iphdr) + sizeof(struct icm= phdr) + data_len); + ip_send_check(new_ip); + + new_eth =3D (struct ethhdr *)__skb_push(reply, sizeof(struct ethhd= r)); + orig_eth =3D eth_hdr(skb); + ether_addr_copy(new_eth->h_dest, orig_eth->h_source); + ether_addr_copy(new_eth->h_source, orig_eth->h_dest); + new_eth->h_proto =3D htons(ETH_P_IP); + reply->ip_summed =3D CHECKSUM_UNNECESSARY; + reply->pkt_type =3D PACKET_HOST; + reply->protocol =3D htons(ETH_P_IP); + memcpy(skb_put(reply, data_len), skb_network_header(skb), data_len= ); + new_icmp->checksum =3D csum_fold(csum_partial(new_icmp, sizeof(str= uct icmphdr) + data_len, 0)); + + /* vxlan encapuslation */ + vxh =3D (struct vxlanhdr *)__skb_push(reply, sizeof(*vxh)); + vxh->vx_flags =3D htonl(VXLAN_FLAGS); + vxh->vx_vni =3D htonl(vni << 8); + + __skb_push(reply, sizeof(*uh)); + skb_reset_transport_header(reply); + uh =3D udp_hdr(reply); + uh->dest =3D dst_port; + uh->source =3D src_port; + uh->len =3D htons(reply->len); + uh->check =3D 0; + csum =3D skb_checksum(reply, 0, reply->len, 0); + uh->check =3D udp_v4_check(reply->len, fl4.saddr, dst->sin.sin_add= r.s_addr, csum); + + tnl_ip =3D (struct iphdr *)__skb_push(reply, sizeof(struct iphdr))= ; + skb_reset_network_header(reply); + tnl_ip->ihl =3D 5; + tnl_ip->version =3D 4; + tnl_ip->ttl =3D 32; + tnl_ip->tos =3D 1; + tnl_ip->protocol =3D IPPROTO_UDP; + tnl_ip->saddr =3D dst->sin.sin_addr.s_addr; + tnl_ip->daddr =3D fl4.saddr; + tnl_ip->frag_off =3D 0; + tnl_ip->tot_len =3D htons(reply->len); + ip_send_check(tnl_ip); + + /* fill with nosense mac header */ + tnl_eth =3D (struct ethhdr *)__skb_push(reply, sizeof(struct ethhd= r)); + skb_reset_mac_header(reply); + orig_eth =3D eth_hdr(skb); + ether_addr_copy(tnl_eth->h_dest, orig_eth->h_source); + ether_addr_copy(tnl_eth->h_source, orig_eth->h_dest); + tnl_eth->h_proto =3D htons(ETH_P_IP); + __skb_pull(reply, skb_network_offset(reply)); + + /* push encapuslated ICMP message back to sender */ + netif_rx_ni(reply); + } + } err =3D vxlan_xmit_skb(vxlan->vn_sock, rt, skb, fl4.saddr, dst->sin.sin_addr.s_addr, tos, ttl, df, src_port, dst_port, --=20 No zuo no die but I have to try.