From mboxrd@z Thu Jan 1 00:00:00 1970 From: Fan Du Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU Date: Tue, 06 Jan 2015 17:34:11 +0800 Message-ID: <54ABAC13.9070402@gmail.com> References: <1417156385-18276-1-git-send-email-fan.du@intel.com> <1417158128.3268.2@smtp.corp.redhat.com> <5A90DA2E42F8AE43BC4A093BF0678848DED92B@SHSMSX104.ccr.corp.intel.com> <20141201135225.GA16814@casper.infradead.org> <20141202154839.GB5344@t520.home> <20141202170927.GA9457@casper.infradead.org> <20141202173401.GB4126@redhat.com> <20141202174158.GB9457@casper.infradead.org> <5A90DA2E42F8AE43BC4A093BF0678848DEDFDB@SHSMSX104.ccr.corp.intel.com> <54AA2912.6090903@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: "Du, Fan" , Thomas Graf , "davem@davemloft.net" , "Michael S. Tsirkin" , Jason Wang , "netdev@vger.kernel.org" , "fw@strlen.de" , "dev@openvswitch.org" , "pshelar@nicira.com" To: Jesse Gross Return-path: Received: from mail-pa0-f49.google.com ([209.85.220.49]:32884 "EHLO mail-pa0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753825AbbAFJeX (ORCPT ); Tue, 6 Jan 2015 04:34:23 -0500 Received: by mail-pa0-f49.google.com with SMTP id eu11so30472677pac.22 for ; Tue, 06 Jan 2015 01:34:22 -0800 (PST) In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On 2015/1/6 1:58, Jesse Gross wrote: > On Mon, Jan 5, 2015 at 1:02 AM, Fan Du = wrote: >> =E4=BA=8E 2014=E5=B9=B412=E6=9C=8803=E6=97=A5 10:31, Du, Fan =E5=86=99= =E9=81=93: >> >>> >>>> -----Original Message----- >>>> From: Thomas Graf [mailto:tgr@infradead.org] On Behalf Of Thomas G= raf >>>> Sent: Wednesday, December 3, 2014 1:42 AM >>>> To: Michael S. Tsirkin >>>> Cc: Du, Fan; 'Jason Wang'; netdev@vger.kernel.org; davem@davemloft= =2Enet; >>>> fw@strlen.de; dev@openvswitch.org; jesse@nicira.com; pshelar@nicir= a.com >>>> Subject: Re: [PATCH net] gso: do GSO for local skb with size bigge= r than >>>> MTU >>>> >>>> On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote: >>>>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote: >>>>>> On 12/02/14 at 01:48pm, Flavio Leitner wrote: >>>>>>> What about containers or any other virtualization environment t= hat >>>>>>> doesn't use Virtio? >>>>>> >>>>>> The host can dictate the MTU in that case for both veth or OVS >>>>>> internal which would be primary container plumbing techniques. >>>>> >>>>> It typically can't do this easily for VMs with emulated devices: >>>>> real ethernet uses a fixed MTU. >>>>> >>>>> IMHO it's confusing to suggest MTU as a fix for this bug, it's an >>>>> unrelated optimization. >>>>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here. >>>> >>>> PMTU discovery only resolves the issue if an actual IP stack is ru= nning >>>> inside the >>>> VM. This may not be the case at all. >>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^ >>> >>> Some thoughts here: >>> >>> Think otherwise, this is indeed what host stack should forge a >>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED >>> message with _inner_ skb network and transport header, do whatever = type of >>> encapsulation, >>> and thereafter push such packet upward to Guest/Container, which ma= ke them >>> feel, the intermediate node >>> or the peer send such message. PMTU should be expected to work corr= ect. >>> And such behavior should be shared by all other encapsulation tech = if they >>> are also suffered. >> >> Hi David, Jesse and Thomas >> >> As discussed in here: >> https://www.marc.info/?l=3Dlinux-netdev&m=3D141764712631150&w=3D4 an= d >> quotes from Jesse: >> My proposal would be something like this: >> * For L2, reduce the VM MTU to the lowest common denominator on th= e >> segment. >> * For L3, use path MTU discovery or fragment inner packet (i.e. >> normal routing behavior). >> * As a last resort (such as if using an old version of virtio in t= he >> guest), fragment the tunnel packet. >> >> >> For L2, it's a administrative action >> For L3, PMTU approach looks better, because once the sender is alert= ed the >> reduced MTU, >> packet size after encapsulation will not exceed physical MTU, so no >> additional fragments >> efforts needed. >> For "As a last resort... fragment the tunnel packet", the original p= atch: >> https://www.marc.info/?l=3Dlinux-netdev&m=3D141715655024090&w=3D4 di= d the job, but >> seems it's >> not welcomed. > This needs to be properly integrated into IP processing if it is to > work correctly. Do you mean the original patch in this thread? yes, it works correctly in my cloud env. If you has any other concerns, please let me know. :) > One of the reasons for only doing path MTU discovery > for L3 is that it operates seamlessly as part of normal operation - > there is no need to forge addresses or potentially generate ICMP when > on an L2 network. However, this ignores the IP handling that is going > on (note that in OVS it is possible for L3 to be implemented as a set > of flows coming from a controller). > > It also should not be VXLAN specific or duplicate VXLAN encapsulation > code. As this is happening before encapsulation, the generated ICMP > does not need to be encapsulated either if it is created in the right > location. Yes, I agree. GRE share the same issue from the code flow. Pushing back ICMP msg back without encapsulation without circulating do= wn to physical device is possible. The "right location" as far as I know could only be in ovs_vport_send. In addition this probably requires wra= pper route looking up operation for GRE/VXLAN, after get the under layer=20 device MTU from the routing information, then calculate reduced MTU becomes feasib= le.