Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] bonding: move ipoib_header_ops to vmlinux
From: Wengang Wang @ 2014-12-03  1:50 UTC (permalink / raw)
  To: David Miller, jay.vosburgh; +Cc: ogerlitz, netdev, linux-rdma
In-Reply-To: <54752D4C.7000603@oracle.com>

Hi David and Jay,

Then about about the change in this patch?

thanks,
wengang

在 2014年11月26日 09:30, Wengang 写道:
> 于 2014年11月26日 02:44, David Miller 写道:
>> From: Jay Vosburgh <jay.vosburgh@canonical.com>
>> Date: Tue, 25 Nov 2014 10:41:17 -0800
>>
>>> Or Gerlitz <ogerlitz@mellanox.com> wrote:
>>>
>>>> On 11/25/2014 8:07 AM, David Miller wrote:
>>>>> IPOIB should not work over bonding as it requires that the device
>>>>> use ARPHRD_ETHER.
>>>> Hi Dave,
>>>>
>>>> IPoIB devices can be enslaved to both bonding and teaming in their 
>>>> HA mode,
>>>> the bond device type becomes ARPHRD_INFINIBAND when this happens.
>>>     The point was that pktgen disallows ARPHRD_INFINIBAND, not that
>>> bonding does.
>>>
>>>     Pktgen specifically checks for type != ARPHRD_ETHER, so the
>>> IPoIB bond should not be able to be used with pkgten.  My suspicion is
>>> that pktgen is being configured on the bond first, then an IPoIB slave
>>> is added to the bond; this would change its type in a way that pktgen
>>> wouldn't notice.
>> +1
>
> I think it go this way:
>
> 1) bond_master is ready
> 2) bond_enslave enslave a IPOIB interface calling bond_setup_by_slave
> 3) then bond_setup_by_slave set change master type to ARPHRD_INFINIBAND.
>
> code is like this:
>
> 1 /* enslave device <slave> to bond device <master> */
> 2 int bond_enslave(struct net_device *bond_dev, struct net_device 
> *slave_dev)
> 3 {
> 4 <snip>...
> 5 /* set bonding device ether type by slave - bonding netdevices are
> 6 * created with ether_setup, so when the slave type is not ARPHRD_ETHER
> 7 * there is a need to override some of the type dependent attribs/funcs.
> 8 *
> 9 * bond ether type mutual exclusion - don't allow slaves of dissimilar
> 10 * ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the same 
> bond
> 11 */
> 12 if (!bond_has_slaves(bond)) {
> 13 if (bond_dev->type != slave_dev->type) {
> 14 <snip>...
> 15 if (slave_dev->type != ARPHRD_ETHER)
> 16 bond_setup_by_slave(bond_dev, slave_dev);
> 17 else {
> 18 ether_setup(bond_dev);
> 19 bond_dev->priv_flags &= ~IFF_TX_SKB_SHARING;
> 20 }
> 21
> 22 call_netdevice_notifiers(NETDEV_POST_TYPE_CHANGE,
> 23 bond_dev);
> 24 }
> 25 <snip>...
> 26 }
> 27
> 28 static void bond_setup_by_slave(struct net_device *bond_dev,
> 29 struct net_device *slave_dev)
> 30 {
> 31 bond_dev->header_ops = slave_dev->header_ops;
> 32
> 33 bond_dev->type = slave_dev->type;
> 34 bond_dev->hard_header_len = slave_dev->hard_header_len;
> 35 bond_dev->addr_len = slave_dev->addr_len;
> 36
> 37 memcpy(bond_dev->broadcast, slave_dev->broadcast,
> 38 slave_dev->addr_len);
> 39 }
> 40
>
> thanks
> wengang
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next] tcp: Add TCP tracer
From: Stephen Hemminger @ 2014-12-03  1:51 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: davem, netdev
In-Reply-To: <1417552662-16398-1-git-send-email-kafai@fb.com>

On Tue, 2 Dec 2014 12:37:42 -0800
Martin KaFai Lau <kafai@fb.com> wrote:

> diff --git a/include/uapi/linux/tcp_trace.h b/include/uapi/linux/tcp_trace.h
> index 2644f7f..d913a3c 100644
> --- a/include/uapi/linux/tcp_trace.h
> +++ b/include/uapi/linux/tcp_trace.h
> @@ -22,11 +22,11 @@ struct tcp_stats {
>  	__u32	other_segs_retrans;
>  	__u32	other_octets_retrans;
>  	__u32	loss_segs_retrans;
> -	__u32	loss_octects_retrans;
> +	__u32	loss_octets_retrans;
>  	__u32	segs_in;
>  	__u32	data_segs_in;
> -	__u64	rtt_sample_us;
>  	__u64	data_octets_in;
> +	__u64	rtt_sample_us;
>  	__u64	max_rtt_us;
>  	__u64	min_rtt_us;
>  	__u64   sum_rtt_us;
> @@ -64,9 +64,4 @@ struct tcp_trace_stats {
>          struct tcp_stats stats;
>  } __packed;

You can't change exposed kernel API like that.

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the kselftest-fixes tree
From: Stephen Rothwell @ 2014-12-03  1:52 UTC (permalink / raw)
  To: David Miller, netdev, Shuah Khan
  Cc: linux-next, linux-kernel, Michael Ellerman, stephen hemminger

[-- Attachment #1: Type: text/plain, Size: 1021 bytes --]

Hi all,

Today's linux-next merge of the net-next tree got a conflict in
include/uapi/linux/Kbuild between commit 3f4994cfc15f ("kcmp: Move
kcmp.h into uapi") from the kselftest-fixes tree and commit
df32dd2054b6 ("uapi: resort Kbuild entries") from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

diff --cc include/uapi/linux/Kbuild
index d78fecf216bf,a1e8175cc488..000000000000
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@@ -211,12 -211,11 +211,12 @@@ header-y += ivtv.
  header-y += ixjuser.h
  header-y += jffs2.h
  header-y += joystick.h
 +header-y += kcmp.h
- header-y += kd.h
  header-y += kdev_t.h
- header-y += kernel-page-flags.h
- header-y += kernel.h
+ header-y += kd.h
  header-y += kernelcapi.h
+ header-y += kernel.h
+ header-y += kernel-page-flags.h
  header-y += kexec.h
  header-y += keyboard.h
  header-y += keyctl.h

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH net-next] ipv6: remove useless spin_lock/spin_unlock
From: Eric Dumazet @ 2014-12-03  1:57 UTC (permalink / raw)
  To: Duan Jiong; +Cc: David Miller, netdev
In-Reply-To: <547E6832.4070403@cn.fujitsu.com>

On Wed, 2014-12-03 at 09:32 +0800, Duan Jiong wrote:
> xchg is atomic, so there is no necessary to use spin_lock/spin_unlock
> to protect it.
> 
> Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
> ---
>  net/ipv6/ipv6_sockglue.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
> index e1a9583..92ca907 100644
> --- a/net/ipv6/ipv6_sockglue.c
> +++ b/net/ipv6/ipv6_sockglue.c
> @@ -112,9 +112,7 @@ struct ipv6_txoptions *ipv6_update_options(struct sock *sk,
>  		}
>  		opt = xchg(&inet6_sk(sk)->opt, opt);
>  	} else {
> -		spin_lock(&sk->sk_dst_lock);
>  		opt = xchg(&inet6_sk(sk)->opt, opt);
> -		spin_unlock(&sk->sk_dst_lock);
>  	}
>  	sk_dst_reset(sk);
>  

Why keeping 2 copies of opt = xchg(&inet6_sk(sk)->opt, opt); then ?

^ permalink raw reply

* RE: [PATCH net] gso: do GSO for local skb with size bigger than MTU
From: Du, Fan @ 2014-12-03  1:58 UTC (permalink / raw)
  To: Flavio Leitner, Jesse Gross
  Cc: Jason Wang, netdev@vger.kernel.org, davem@davemloft.net,
	fw@strlen.de, Du, Fan
In-Reply-To: <20141202213232.GC5344@t520.home>



>-----Original Message-----
>From: Flavio Leitner [mailto:fbl@redhat.com]
>Sent: Wednesday, December 3, 2014 5:33 AM
>To: Jesse Gross
>Cc: Du, Fan; Jason Wang; netdev@vger.kernel.org; davem@davemloft.net;
>fw@strlen.de
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>On Tue, Dec 02, 2014 at 10:06:53AM -0800, Jesse Gross wrote:
>> On Tue, Dec 2, 2014 at 7:44 AM, Flavio Leitner <fbl@redhat.com> wrote:
>> > On Sun, Nov 30, 2014 at 10:08:32AM +0000, Du, Fan wrote:
>> >>
>> >>
>> >> >-----Original Message-----
>> >> >From: Jason Wang [mailto:jasowang@redhat.com]
>> >> >Sent: Friday, November 28, 2014 3:02 PM
>> >> >To: Du, Fan
>> >> >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du,
>> >> >Fan
>> >> >Subject: Re: [PATCH net] gso: do GSO for local skb with size
>> >> >bigger than MTU
>> >> >
>> >> >
>> >> >
>> >> >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
>> >> >> Test scenario: two KVM guests sitting in different hosts
>> >> >> communicate to each other with a vxlan tunnel.
>> >> >>
>> >> >> All interface MTU is default 1500 Bytes, from guest point of
>> >> >> view, its skb gso_size could be as bigger as 1448Bytes, however
>> >> >> after guest skb goes through vxlan encapuslation, individual
>> >> >> segments length of a gso packet could exceed physical NIC MTU
>> >> >> 1500, which will be lost at recevier side.
>> >> >>
>> >> >> So it's possible in virtualized environment, locally created skb
>> >> >> len after encapslation could be bigger than underlayer MTU. In
>> >> >> such case, it's reasonable to do GSO first, then fragment any
>> >> >> packet bigger than MTU as possible.
>> >> >>
>> >> >> +---------------+ TX     RX +---------------+
>> >> >> |   KVM Guest   | -> ... -> |   KVM Guest   |
>> >> >> +-+-----------+-+           +-+-----------+-+
>> >> >>   |Qemu/VirtIO|               |Qemu/VirtIO|
>> >> >>   +-----------+               +-----------+
>> >> >>        |                            |
>> >> >>        v tap0                  tap0 v
>> >> >>   +-----------+               +-----------+
>> >> >>   | ovs bridge|               | ovs bridge|
>> >> >>   +-----------+               +-----------+
>> >> >>        | vxlan                vxlan |
>> >> >>        v                            v
>> >> >>   +-----------+               +-----------+
>> >> >>   |    NIC    |    <------>   |    NIC    |
>> >> >>   +-----------+               +-----------+
>> >> >>
>> >> >> Steps to reproduce:
>> >> >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>> >> >>  2. Runing iperf without -M, communication will stuck.
>> >> >
>> >> >Is this issue specific to ovs or ipv4? Path MTU discovery should
>> >> >help in this case I believe.
>> >>
>> >> Problem here is host stack push local over-sized gso skb down to
>> >> NIC, and perform GSO there without any further ip segmentation.
>> >>
>> >> Reasonable behavior is do gso first at ip level, if gso-ed skb is
>> >> bigger than MTU && df is set, Then push
>ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust
>mtu.
>> >>
>> >> For PMTU to work, that's another issue I will try to address later on.
>> >>
>> >> >>
>> >> >>
>> >> >> Signed-off-by: Fan Du <fan.du@intel.com>
>> >> >> ---
>> >> >>  net/ipv4/ip_output.c |    7 ++++---
>> >> >>  1 files changed, 4 insertions(+), 3 deletions(-)
>> >> >>
>> >> >> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index
>> >> >> bc6471d..558b5f8 100644
>> >> >> --- a/net/ipv4/ip_output.c
>> >> >> +++ b/net/ipv4/ip_output.c
>> >> >> @@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct
>> >> >> sk_buff
>> >> >> *skb)
>> >> >>    struct sk_buff *segs;
>> >> >>    int ret = 0;
>> >> >>
>> >> >> -  /* common case: locally created skb or seglen is <= mtu */
>> >> >> -  if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
>> >> >> -        skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >> >> +  /* Both locally created skb and forwarded skb could exceed
>> >> >> +   * MTU size, so make a unified rule for them all.
>> >> >> +   */
>> >> >> +  if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >> >>            return ip_finish_output2(skb);
>> >
>> >
>> > Are you using kernel's vxlan device or openvswitch's vxlan device?
>> >
>> > Because for kernel's vxlan devices the MTU accounts for the header
>> > overhead so I believe your patch would work.  However, the MTU is
>> > not visible for the ovs's vxlan devices, so that wouldn't work.
>>
>> This is being called after the tunnel code, so the MTU that is being
>> looked at in all cases is the physical device's. Since the packet has
>> already been encapsulated, tunnel header overhead is already accounted
>> for in skb_gso_network_seglen() and this should be fine for both OVS
>> and non-OVS cases.
>
>Right, it didn't work on my first try and that explanation came to mind.
>
>Anyway, I am testing this with containers instead of VMs, so I am using veth and
>not Virtio-net.
>
>This is the actual stack trace:
>
>[...]
>  do_output
>  ovs_vport_send
>  vxlan_tnl_send
>  vxlan_xmit_skb
>  udp_tunnel_xmit_skb
>  iptunnel_xmit
>   \ skb_scrub_packet => skb->ignore_df = 0;
>  ip_local_out_sk
>  ip_output
>  ip_finish_output (_gso is inlined)
>  ip_fragment
>
>and on ip_fragment() it does:
>
> 503         if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
> 504                      (IPCB(skb)->frag_max_size &&
> 505                       IPCB(skb)->frag_max_size > mtu))) {
> 506                 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
> 507                 icmp_send(skb, ICMP_DEST_UNREACH,
>ICMP_FRAG_NEEDED,
> 508                           htonl(mtu));
> 509                 kfree_skb(skb);
> 510                 return -EMSGSIZE;
> 511         }
>
>Since IP_DF is set and skb->ignore_df is reset to 0, in my case the packet is
>dropped and an ICMP is sent back. The connection remains stuck as before.
>Doesn't virtio-net set DF bit?

Thanks for giving it a try and see what really happens. 

You almost there! Ip_segment honor IP_DF, this is bit is take care of by vxlan interface.
In practical env, vxlan interface should take a conservative attitude to allow fragmentation
by appending "options: df_default=false" when creating vxlan interface.

Why allow fragmentation? Because Guest or Container may send over-MTU-sized packet downwards.
Host is expected to be prepared to such incident. This is just what happens in real world cloud env.


>Thanks,
>fbl

^ permalink raw reply

* Re: [PATCH net-next] ipv6: remove useless spin_lock/spin_unlock
From: Duan Jiong @ 2014-12-03  2:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1417571879.5303.83.camel@edumazet-glaptop2.roam.corp.google.com>

On 12/03/2014 09:57 AM, Eric Dumazet wrote:
> On Wed, 2014-12-03 at 09:32 +0800, Duan Jiong wrote:
>> xchg is atomic, so there is no necessary to use spin_lock/spin_unlock
>> to protect it.
>>
>> Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
>> ---
>>  net/ipv6/ipv6_sockglue.c | 2 --
>>  1 file changed, 2 deletions(-)
>>
>> diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
>> index e1a9583..92ca907 100644
>> --- a/net/ipv6/ipv6_sockglue.c
>> +++ b/net/ipv6/ipv6_sockglue.c
>> @@ -112,9 +112,7 @@ struct ipv6_txoptions *ipv6_update_options(struct sock *sk,
>>  		}
>>  		opt = xchg(&inet6_sk(sk)->opt, opt);
>>  	} else {
>> -		spin_lock(&sk->sk_dst_lock);
>>  		opt = xchg(&inet6_sk(sk)->opt, opt);
>> -		spin_unlock(&sk->sk_dst_lock);
>>  	}
>>  	sk_dst_reset(sk);
>>  
> 
> Why keeping 2 copies of opt = xchg(&inet6_sk(sk)->opt, opt); then ?
> 

Thanks for you remind, i didn't notice that.
The else statement could be removed, opt = xchg(&inet6_sk(sk)->opt, opt); should be
moved out, and i will send v2.

Thanks,
  Duan

> 
> 
> 

^ permalink raw reply

* Netdev 0.1 Call for Proposals
From: Richard Guy Briggs @ 2014-12-03  2:28 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	lwn.net-WsCbK3Kd604hZSLhaPfTA7DIv+2fBXtB,
	netdev01-wool9L35kiczKOhml7GhPkB+6BGkLq7r

Netdev 0.1 Call for Proposals
-----------------------------

Netdev 0.1 (year 0, conference 1) is a community-driven conference
geared towards Linux netheads. Linux kernel networking and user 
space utilization of the interfaces to the Linux kernel networking
subsystem are the focus.

There are 4 phases/formats to Netdev 0.1

1) Workshops (day 1)

The workshop format is inspired by Netconf and the wireless
mini-summits, with workshops being centered around existing
networking subsystems. workshops are intended to be an extension of
the mailing list in the sense that many times previous
discussions from the mailing list (or that could otherwise have
happened there) are taken to the round-table to simplify the
decision-making process.

The networking subsystem maintainer(s) should at least prepare a
list of agenda items well before the workshop takes place to allow 
participants to come prepared; this makes the discussions most productive.
Sometimes brain-storming sessions will also be appropriate where
being prepared is less important, for example for discussions
around new user requirements this can be very valuable.

At the workshop meeting itself discussions prevail and notes are
later sent back to the mailing list; presentations are typically
- at the discretion of the chairs - only used where needed to
clarify a problem statement for discussion.

The sitting format is round-table.

2) BOFs (day 1)

BOFs are sessions with a potential to become a workshop in a future
Netdev conference. The lifetime of a BOF may be only one or two 
Netdev conference gatherings. We discourage perpetual BOFs.
BoFs don't need to have an existing networking subsystem or mailing list.
BOFs also don't need to strive to be upgraded to be a Workshop
in the future. Their longevity could only be one conference.
The sitting format could vary and be either lecture or round table format
depending on the proposal.

3) Tutorials (day 2)

Tutorials are generally about 2 hours long (or more at the discretion
of the proposal).
Tutorials are educational in nature and are presented in a classroom 
format with a specific educational outcome for the attendees.

4) Paper proposals (days 3 and 4)

These are classical conference paper + presentations.
Presentations are 30 minutes long with an additional 15 minutes for Q&A
presented in a lecture format.
We will require paper submissions for these sessions. The committee 
believes that a paper submission raises the quality of the presentations
and makes it easier to build on presented ideas in the future.

The Netdev conference this year is structured to be 50% by-invitation 
and 50% submission. We are making sure that we reach out to speakers 
who have interesting relevant topics because we recognize most of 
these folks would typically not be submitting papers to a conference.
The invitation will be made by the technical committee to the individual
speakers for workshop, paper and tutorial sessions.

This call for papers is for the 50% submission portion of the 
conference for paper submissions, tutorials and workshops.
We *highly discourage* submission of recycled talks.

Current technical focus topics include:
- wireless
- performance analysis, debugging and improvement
- networking hardware and offload
- netfilter
- traffic control
- different networking layers (L2/3, etc)
- Internet of things
- security
- additional topics can be suggested

Unlike other conferences, we are going to try and accommodate as many
submissions as possible - but please stay within the relevant topic focus 
and tie to Linux networking to make it easier for the technical committee
to provide quick feedback. In order to give a talk you must be 
registered. If your proposal is accepted you will not be charged 
a conference fee or your conference fee will be refunded to you 
when your talk gets accepted.

We expect minimum of 2 parallel tracks but likely more depending on the
(quantity of submissions) in all phases i.e during tutorials,
workshops and main talks. 

Why you should submit a proposal
---------------------------------
If you yearn for the old community tech driven conferences where 
you mingle with fellow geeks (only these would be Linux networking
geeks) then this would be it. There will be no marketing flashy 
openings. There will just be a pure feed of Linux networking.
Netdev 0.1 will be held back to back with Netconf 2015, the 
by-invite Linux kernel networking workshop 
(http://vger.kernel.org/netconf2015.html). 
So gurus of all sorts will be there mingling and giving talks.
While there will be heavy Linux kernel influence we expect a lot
of user space presence as well. 

How to submit a proposal
------------------------
Send email to  netdev01-wool9L35kiczKOhml7GhPkB+6BGkLq7r@public.gmane.org with a paragraph or
two of your proposal.
For paper proposals, if your submission is accepted we will provide
you a template to use.
A minimum of two pages is needed so as to to allow people to skip the 
burden of writing a large paper. The maximum page limit is 10 pages.

Location:
---------
Downtown Ottawa, Canada
www.netdev01.org

Important Dates:
----------------
December 02, 2014	 	Call for Papers opens
December 10, 2014	 	Registration opens
January 10, 2015	 	Call for sessions deadline
January 20, 2015	 	Conference schedule announced
February 14-17, 2015            Conference days

Please register as soon as registration opens up on December 10.
Registering helps us plan properly for numbers of attendees,
ensuring venue sizes and supplies are appropriate without
wasting resources.



	slainte mhath, RGB

--
Richard Guy Briggs               --  ~\    -- ~\            <hpv.tricolour.net>
<www.TriColour.ca>                 --  \___   o \@       @       Ride yer bike!
Ottawa, ON, CANADA                  --  Lo_>__M__\\/\%__\\/\%
Vote! -- <greenparty.ca>_____GTVS6#790__(*)__(*)________(*)(*)_________________
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH net] gso: do GSO for local skb with size bigger than MTU
From: Du, Fan @ 2014-12-03  2:31 UTC (permalink / raw)
  To: Thomas Graf, Michael S. Tsirkin
  Cc: 'Jason Wang', netdev@vger.kernel.org, davem@davemloft.net,
	fw@strlen.de, dev@openvswitch.org, jesse@nicira.com,
	pshelar@nicira.com, Du, Fan
In-Reply-To: <20141202174158.GB9457@casper.infradead.org>



>-----Original Message-----
>From: Thomas Graf [mailto:tgr@infradead.org] On Behalf Of Thomas Graf
>Sent: Wednesday, December 3, 2014 1:42 AM
>To: Michael S. Tsirkin
>Cc: Du, Fan; 'Jason Wang'; netdev@vger.kernel.org; davem@davemloft.net;
>fw@strlen.de; dev@openvswitch.org; jesse@nicira.com; pshelar@nicira.com
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>> > On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>> > > What about containers or any other virtualization environment that
>> > > doesn't use Virtio?
>> >
>> > The host can dictate the MTU in that case for both veth or OVS
>> > internal which would be primary container plumbing techniques.
>>
>> It typically can't do this easily for VMs with emulated devices:
>> real ethernet uses a fixed MTU.
>>
>> IMHO it's confusing to suggest MTU as a fix for this bug, it's an
>> unrelated optimization.
>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>
>PMTU discovery only resolves the issue if an actual IP stack is running inside the
>VM. This may not be the case at all.
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^

Some thoughts here:

Think otherwise, this is indeed what host stack should forge a ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED
message with _inner_ skb network and transport header, do whatever type of encapsulation,
and thereafter push such packet upward to Guest/Container, which make them feel, the intermediate node
or the peer send such message. PMTU should be expected to work correct.
And such behavior should be shared by all other encapsulation tech if they are also suffered.


>I agree that exposing an MTU towards the guest is not applicable in all situations,
>in particular because it is difficult to decide what MTU to expose. It is a relatively
>elegant solution in a lot of virtualization host cases hooked up to an orchestration
>system though.

^ permalink raw reply

* [PATCH net-next v2] ipv6: remove useless spin_lock/spin_unlock
From: Duan Jiong @ 2014-12-03  2:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <547E6832.4070403@cn.fujitsu.com>


xchg is atomic, so there is no necessary to use spin_lock/spin_unlock
to protect it. At last, remove the redundant
opt = xchg(&inet6_sk(sk)->opt, opt); statement.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
---
v2: remove the redundant opt = xchg(&inet6_sk(sk)->opt, opt); statement.

 net/ipv6/ipv6_sockglue.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index e1a9583..66980d8 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -110,12 +110,8 @@ struct ipv6_txoptions *ipv6_update_options(struct sock *sk,
 			icsk->icsk_ext_hdr_len = opt->opt_flen + opt->opt_nflen;
 			icsk->icsk_sync_mss(sk, icsk->icsk_pmtu_cookie);
 		}
-		opt = xchg(&inet6_sk(sk)->opt, opt);
-	} else {
-		spin_lock(&sk->sk_dst_lock);
-		opt = xchg(&inet6_sk(sk)->opt, opt);
-		spin_unlock(&sk->sk_dst_lock);
 	}
+	opt = xchg(&inet6_sk(sk)->opt, opt);
 	sk_dst_reset(sk);
 
 	return opt;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next v2] rtnetlink: delay RTM_DELLINK notification until after ndo_uninit()
From: Mahesh Bandewar @ 2014-12-03  2:43 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Roopa Prabhu, Toshiaki Makita,
	Mahesh Bandewar

The commit 56bfa7ee7c ("unregister_netdevice : move RTM_DELLINK to
until after ndo_uninit") tried to do this ealier but while doing so
it created a problem. Unfortunately the delayed rtmsg_ifinfo() also
delayed call to fill_info(). So this translated into asking driver
to remove private state and then query it's private state. This
could have catastropic consequences.

This change breaks the rtmsg_ifinfo() into two parts - one takes the
precise snapshot of the device by called fill_info() before calling
the ndo_uninit() and the second part sends the notification using
collected snapshot.

It was brought to notice when last link is deleted from an ipvlan device
when it has free-ed the port and the subsequent .fill_info() call is
trying to get the info from the port.

kernel: [  255.139429] ------------[ cut here ]------------
kernel: [  255.139439] WARNING: CPU: 12 PID: 11173 at net/core/rtnetlink.c:2238 rtmsg_ifinfo+0x100/0x110()
kernel: [  255.139493] Modules linked in: ipvlan bonding w1_therm ds2482 wire cdc_acm ehci_pci ehci_hcd i2c_dev i2c_i801 i2c_core msr cpuid bnx2x ptp pps_core mdio libcrc32c
kernel: [  255.139513] CPU: 12 PID: 11173 Comm: ip Not tainted 3.18.0-smp-DEV #167
kernel: [  255.139514] Hardware name: Intel RML,PCH/Ibis_QC_18, BIOS 1.0.10 05/15/2012
kernel: [  255.139515]  0000000000000009 ffff880851b6b828 ffffffff815d87f4 00000000000000e0
kernel: [  255.139516]  0000000000000000 ffff880851b6b868 ffffffff8109c29c 0000000000000000
kernel: [  255.139518]  00000000ffffffa6 00000000000000d0 ffffffff81aaf580 0000000000000011
kernel: [  255.139520] Call Trace:
kernel: [  255.139527]  [<ffffffff815d87f4>] dump_stack+0x46/0x58
kernel: [  255.139531]  [<ffffffff8109c29c>] warn_slowpath_common+0x8c/0xc0
kernel: [  255.139540]  [<ffffffff8109c2ea>] warn_slowpath_null+0x1a/0x20
kernel: [  255.139544]  [<ffffffff8150d570>] rtmsg_ifinfo+0x100/0x110
kernel: [  255.139547]  [<ffffffff814f78b5>] rollback_registered_many+0x1d5/0x2d0
kernel: [  255.139549]  [<ffffffff814f79cf>] unregister_netdevice_many+0x1f/0xb0
kernel: [  255.139551]  [<ffffffff8150acab>] rtnl_dellink+0xbb/0x110
kernel: [  255.139553]  [<ffffffff8150da90>] rtnetlink_rcv_msg+0xa0/0x240
kernel: [  255.139557]  [<ffffffff81329283>] ? rhashtable_lookup_compare+0x43/0x80
kernel: [  255.139558]  [<ffffffff8150d9f0>] ? __rtnl_unlock+0x20/0x20
kernel: [  255.139562]  [<ffffffff8152cb11>] netlink_rcv_skb+0xb1/0xc0
kernel: [  255.139563]  [<ffffffff8150a495>] rtnetlink_rcv+0x25/0x40
kernel: [  255.139565]  [<ffffffff8152c398>] netlink_unicast+0x178/0x230
kernel: [  255.139567]  [<ffffffff8152c75f>] netlink_sendmsg+0x30f/0x420
kernel: [  255.139571]  [<ffffffff814e0b0c>] sock_sendmsg+0x9c/0xd0
kernel: [  255.139575]  [<ffffffff811d1d7f>] ? rw_copy_check_uvector+0x6f/0x130
kernel: [  255.139577]  [<ffffffff814e11c9>] ? copy_msghdr_from_user+0x139/0x1b0
kernel: [  255.139578]  [<ffffffff814e1774>] ___sys_sendmsg+0x304/0x310
kernel: [  255.139581]  [<ffffffff81198723>] ? handle_mm_fault+0xca3/0xde0
kernel: [  255.139585]  [<ffffffff811ebc4c>] ? destroy_inode+0x3c/0x70
kernel: [  255.139589]  [<ffffffff8108e6ec>] ? __do_page_fault+0x20c/0x500
kernel: [  255.139597]  [<ffffffff811e8336>] ? dput+0xb6/0x190
kernel: [  255.139606]  [<ffffffff811f05f6>] ? mntput+0x26/0x40
kernel: [  255.139611]  [<ffffffff811d2b94>] ? __fput+0x174/0x1e0
kernel: [  255.139613]  [<ffffffff814e2129>] __sys_sendmsg+0x49/0x90
kernel: [  255.139615]  [<ffffffff814e2182>] SyS_sendmsg+0x12/0x20
kernel: [  255.139617]  [<ffffffff815df092>] system_call_fastpath+0x12/0x17
kernel: [  255.139619] ---[ end trace 5e6703e87d984f6b ]---

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Report-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Cc: David S. Miller <davem@davemloft.net>
---
v1:
	Initial version
v2:
	Keep the rtmsg_ifinfo() return type as it is but break the function into
	two minimizing the changes all over places

 include/linux/rtnetlink.h |  5 +++++
 net/core/dev.c            | 12 +++++++++---
 net/core/rtnetlink.c      | 27 +++++++++++++++++++++++----
 3 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 6cacbce1a06c..19dc0bce9c2b 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -17,6 +17,11 @@ extern int rtnl_put_cacheinfo(struct sk_buff *skb, struct dst_entry *dst,
 			      u32 id, long expires, u32 error);
 
 void rtmsg_ifinfo(int type, struct net_device *dev, unsigned change, gfp_t flags);
+struct sk_buff *rtmsg_ifinfo_build_skb(int type, struct net_device *dev,
+				       unsigned change, gfp_t flags);
+void rtmsg_ifinfo_send(struct sk_buff *skb, struct net_device *dev,
+		       gfp_t flags);
+
 
 /* RTNL is used as a global lock for all changes to network configuration  */
 extern void rtnl_lock(void);
diff --git a/net/core/dev.c b/net/core/dev.c
index ac4836241a96..98f6563b68b6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5925,6 +5925,8 @@ static void rollback_registered_many(struct list_head *head)
 	synchronize_net();
 
 	list_for_each_entry(dev, head, unreg_list) {
+		struct sk_buff *skb = NULL;
+
 		/* Shutdown queueing discipline. */
 		dev_shutdown(dev);
 
@@ -5934,6 +5936,11 @@ static void rollback_registered_many(struct list_head *head)
 		*/
 		call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
 
+		if (!dev->rtnl_link_ops ||
+		    dev->rtnl_link_state == RTNL_LINK_INITIALIZED)
+			skb = rtmsg_ifinfo_build_skb(RTM_DELLINK, dev, ~0U,
+						     GFP_KERNEL);
+
 		/*
 		 *	Flush the unicast and multicast chains
 		 */
@@ -5943,9 +5950,8 @@ static void rollback_registered_many(struct list_head *head)
 		if (dev->netdev_ops->ndo_uninit)
 			dev->netdev_ops->ndo_uninit(dev);
 
-		if (!dev->rtnl_link_ops ||
-		    dev->rtnl_link_state == RTNL_LINK_INITIALIZED)
-			rtmsg_ifinfo(RTM_DELLINK, dev, ~0U, GFP_KERNEL);
+		if (skb)
+			rtmsg_ifinfo_send(skb, dev, GFP_KERNEL);
 
 		/* Notifier chain MUST detach us all upper devices. */
 		WARN_ON(netdev_has_any_upper_dev(dev));
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index b9b7dfaf202b..fddddebc1aa6 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2220,8 +2220,16 @@ static int rtnl_dump_all(struct sk_buff *skb, struct netlink_callback *cb)
 	return skb->len;
 }
 
-void rtmsg_ifinfo(int type, struct net_device *dev, unsigned int change,
-		  gfp_t flags)
+void rtmsg_ifinfo_send(struct sk_buff *skb, struct net_device *dev, gfp_t flags)
+{
+	struct net *net = dev_net(dev);
+
+	rtnl_notify(skb, net, 0, RTNLGRP_LINK, NULL, flags);
+}
+EXPORT_SYMBOL(rtmsg_ifinfo_send);
+
+struct sk_buff *rtmsg_ifinfo_build_skb(int type, struct net_device *dev,
+				       unsigned int change, gfp_t flags)
 {
 	struct net *net = dev_net(dev);
 	struct sk_buff *skb;
@@ -2239,11 +2247,22 @@ void rtmsg_ifinfo(int type, struct net_device *dev, unsigned int change,
 		kfree_skb(skb);
 		goto errout;
 	}
-	rtnl_notify(skb, net, 0, RTNLGRP_LINK, NULL, flags);
-	return;
+	return skb;
 errout:
 	if (err < 0)
 		rtnl_set_sk_err(net, RTNLGRP_LINK, err);
+	return NULL;
+}
+EXPORT_SYMBOL(rtmsg_ifinfo_build_skb);
+
+void rtmsg_ifinfo(int type, struct net_device *dev, unsigned int change,
+		  gfp_t flags)
+{
+	struct sk_buff *skb;
+
+	skb = rtmsg_ifinfo_build_skb(type, dev, change, flags);
+	if (!skb)
+		rtmsg_ifinfo_send(skb, dev, flags);
 }
 EXPORT_SYMBOL(rtmsg_ifinfo);
 
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCHv11 net-next 1/2] openvswitch: Refactor ovs_nla_fill_match().
From: Joe Stringer @ 2014-12-03  2:56 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, linux-kernel-u79uwXL29TY76Z2rM5mHXA

Refactor the ovs_nla_fill_match() function into separate netlink
serialization functions ovs_nla_put_{unmasked_key,masked_key,mask}().
Modify ovs_nla_put_flow() to handle attribute nesting and expose the
'is_mask' parameter - all callers need to nest the flow, and callers
have better knowledge about whether it is serializing a mask or not.
The next patch will be the first user of ovs_nla_put_masked_key().

Signed-off-by: Joe Stringer <joestringer@nicira.com>
---
v11: Shift netlink serialization of key/mask to flow_netlink.c
     Add put_{unmasked_key,key,mask} helpers.
     Perform nesting in ovs_nla_put_flow().
v10: First post.
---
 net/openvswitch/datapath.c     |   41 ++++++------------------------------
 net/openvswitch/flow_netlink.c |   45 +++++++++++++++++++++++++++++++++++++---
 net/openvswitch/flow_netlink.h |    8 +++++--
 3 files changed, 54 insertions(+), 40 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 332b5a0..b2a3796 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -462,10 +462,8 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
 			     0, upcall_info->cmd);
 	upcall->dp_ifindex = dp_ifindex;
 
-	nla = nla_nest_start(user_skb, OVS_PACKET_ATTR_KEY);
-	err = ovs_nla_put_flow(key, key, user_skb);
+	err = ovs_nla_put_flow(key, key, OVS_PACKET_ATTR_KEY, false, user_skb);
 	BUG_ON(err);
-	nla_nest_end(user_skb, nla);
 
 	if (upcall_info->userdata)
 		__nla_put(user_skb, OVS_PACKET_ATTR_USERDATA,
@@ -676,37 +674,6 @@ static size_t ovs_flow_cmd_msg_size(const struct sw_flow_actions *acts)
 }
 
 /* Called with ovs_mutex or RCU read lock. */
-static int ovs_flow_cmd_fill_match(const struct sw_flow *flow,
-				   struct sk_buff *skb)
-{
-	struct nlattr *nla;
-	int err;
-
-	/* Fill flow key. */
-	nla = nla_nest_start(skb, OVS_FLOW_ATTR_KEY);
-	if (!nla)
-		return -EMSGSIZE;
-
-	err = ovs_nla_put_flow(&flow->unmasked_key, &flow->unmasked_key, skb);
-	if (err)
-		return err;
-
-	nla_nest_end(skb, nla);
-
-	/* Fill flow mask. */
-	nla = nla_nest_start(skb, OVS_FLOW_ATTR_MASK);
-	if (!nla)
-		return -EMSGSIZE;
-
-	err = ovs_nla_put_flow(&flow->key, &flow->mask->key, skb);
-	if (err)
-		return err;
-
-	nla_nest_end(skb, nla);
-	return 0;
-}
-
-/* Called with ovs_mutex or RCU read lock. */
 static int ovs_flow_cmd_fill_stats(const struct sw_flow *flow,
 				   struct sk_buff *skb)
 {
@@ -787,7 +754,11 @@ static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
 
 	ovs_header->dp_ifindex = dp_ifindex;
 
-	err = ovs_flow_cmd_fill_match(flow, skb);
+	err = ovs_nla_put_unmasked_key(flow, skb);
+	if (err)
+		goto error;
+
+	err = ovs_nla_put_mask(flow, skb);
 	if (err)
 		goto error;
 
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index df3c7f2..7bb571f 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -1131,12 +1131,12 @@ int ovs_nla_get_flow_metadata(const struct nlattr *attr,
 	return metadata_from_nlattrs(&match, &attrs, a, false, log);
 }
 
-int ovs_nla_put_flow(const struct sw_flow_key *swkey,
-		     const struct sw_flow_key *output, struct sk_buff *skb)
+int __ovs_nla_put_flow(const struct sw_flow_key *swkey,
+		       const struct sw_flow_key *output, bool is_mask,
+		       struct sk_buff *skb)
 {
 	struct ovs_key_ethernet *eth_key;
 	struct nlattr *nla, *encap;
-	bool is_mask = (swkey != output);
 
 	if (nla_put_u32(skb, OVS_KEY_ATTR_RECIRC_ID, output->recirc_id))
 		goto nla_put_failure;
@@ -1346,6 +1346,45 @@ nla_put_failure:
 	return -EMSGSIZE;
 }
 
+int ovs_nla_put_flow(const struct sw_flow_key *swkey,
+		     const struct sw_flow_key *output, int attr, bool is_mask,
+		     struct sk_buff *skb)
+{
+	int err;
+	struct nlattr *nla;
+
+	nla = nla_nest_start(skb, attr);
+	if (!nla)
+		return -EMSGSIZE;
+	err = __ovs_nla_put_flow(swkey, output, is_mask, skb);
+	if (err)
+		return err;
+	nla_nest_end(skb, nla);
+
+	return 0;
+}
+
+/* Called with ovs_mutex or RCU read lock. */
+int ovs_nla_put_unmasked_key(const struct sw_flow *flow, struct sk_buff *skb)
+{
+	return ovs_nla_put_flow(&flow->unmasked_key, &flow->unmasked_key,
+				OVS_FLOW_ATTR_KEY, false, skb);
+}
+
+/* Called with ovs_mutex or RCU read lock. */
+int ovs_nla_put_masked_key(const struct sw_flow *flow, struct sk_buff *skb)
+{
+	return ovs_nla_put_flow(&flow->mask->key, &flow->key,
+				OVS_FLOW_ATTR_KEY, false, skb);
+}
+
+/* Called with ovs_mutex or RCU read lock. */
+int ovs_nla_put_mask(const struct sw_flow *flow, struct sk_buff *skb)
+{
+	return ovs_nla_put_flow(&flow->key, &flow->mask->key,
+				OVS_FLOW_ATTR_MASK, true, skb);
+}
+
 #define MAX_ACTIONS_BUFSIZE	(32 * 1024)
 
 static struct sw_flow_actions *nla_alloc_flow_actions(int size, bool log)
diff --git a/net/openvswitch/flow_netlink.h b/net/openvswitch/flow_netlink.h
index 577f12b..ea54564 100644
--- a/net/openvswitch/flow_netlink.h
+++ b/net/openvswitch/flow_netlink.h
@@ -43,11 +43,15 @@ size_t ovs_key_attr_size(void);
 void ovs_match_init(struct sw_flow_match *match,
 		    struct sw_flow_key *key, struct sw_flow_mask *mask);
 
-int ovs_nla_put_flow(const struct sw_flow_key *,
-		     const struct sw_flow_key *, struct sk_buff *);
+int ovs_nla_put_flow(const struct sw_flow_key *, const struct sw_flow_key *,
+		     int attr, bool is_mask, struct sk_buff *);
 int ovs_nla_get_flow_metadata(const struct nlattr *, struct sw_flow_key *,
 			      bool log);
 
+int ovs_nla_put_unmasked_key(const struct sw_flow *flow, struct sk_buff *skb);
+int ovs_nla_put_masked_key(const struct sw_flow *flow, struct sk_buff *skb);
+int ovs_nla_put_mask(const struct sw_flow *flow, struct sk_buff *skb);
+
 int ovs_nla_get_match(struct sw_flow_match *, const struct nlattr *key,
 		      const struct nlattr *mask, bool log);
 int ovs_nla_put_egress_tunnel_key(struct sk_buff *,
-- 
1.7.10.4

_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply related

* [PATCHv11 net-next 2/2] openvswitch: Add support for unique flow IDs.
From: Joe Stringer @ 2014-12-03  2:56 UTC (permalink / raw)
  To: netdev; +Cc: pshelar, linux-kernel, dev
In-Reply-To: <1417575363-13770-1-git-send-email-joestringer@nicira.com>

Previously, flows were manipulated by userspace specifying a full,
unmasked flow key. This adds significant burden onto flow
serialization/deserialization, particularly when dumping flows.

This patch adds an alternative way to refer to flows using a
variable-length "unique flow identifier" (UFID). At flow setup time,
userspace may specify a UFID for a flow, which is stored with the flow
and inserted into a separate table for lookup, in addition to the
standard flow table. Flows created using a UFID must be fetched or
deleted using the UFID.

All flow dump operations may now be made more terse with OVS_UFID_F_*
flags. For example, the OVS_UFID_F_OMIT_KEY flag allows responses to
omit the flow key from a datapath operation if the flow has a
corresponding UFID. This significantly reduces the time spent assembling
and transacting netlink messages. With all OVS_UFID_F_OMIT_* flags
enabled, the datapath only returns the UFID and statistics for each flow
during flow dump, increasing ovs-vswitchd revalidator performance by up
to 50%.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
---
v11: Separate UFID and unmasked key from sw_flow.
     Modify interface to remove nested UFID attributes.
     Only allow UFIDs between 1-256 octets.
     Move UFID nla fetch helpers to flow_netlink.h.
     Perform complete nlmsg_parsing in ovs_flow_cmd_dump().
     Check UFID table for flows with duplicate UFID at flow setup.
     Tidy up mask/key/ufid insertion into flow_table.
     Rebase.
v10: Ignore flow_key in requests if UFID is specified.
     Only allow UFID flows to be indexed by UFID.
     Only allow non-UFID flows to be indexed by unmasked flow key.
     Unite the unmasked_key and ufid+ufid_hash in 'struct sw_flow'.
     Don't periodically rehash the UFID table.
     Resize the UFID table independently from the flow table.
     Modify table_destroy() to iterate once and delete from both tables.
     Fix UFID memory leak in flow_free().
     Remove kernel-only UFIDs for non-UFID cases.
     Rename "OVS_UFID_F_SKIP_*" -> "OVS_UFID_F_OMIT_*"
     Update documentation.
     Rebase.
v9: No change.
v8: Rename UID -> UFID "unique flow identifier".
    Fix null dereference when adding flow without uid or mask.
    If UFID and not match are specified, and lookup fails, return ENOENT.
    Rebase.
v7: Remove OVS_DP_F_INDEX_BY_UID.
    Rework UID serialisation for variable-length UID.
    Log error if uid not specified and OVS_UID_F_SKIP_KEY is set.
    Rebase against "probe" logging changes.
v6: Fix documentation for supporting UIDs between 32-128 bits.
    Minor style fixes.
    Rebase.
v5: No change.
v4: Fix memory leaks.
    Log when triggering the older userspace issue above.
v3: Initial post.
---
 Documentation/networking/openvswitch.txt |   13 ++
 include/uapi/linux/openvswitch.h         |   20 +++
 net/openvswitch/datapath.c               |  241 +++++++++++++++++++-----------
 net/openvswitch/flow.h                   |   16 +-
 net/openvswitch/flow_netlink.c           |   63 +++++++-
 net/openvswitch/flow_netlink.h           |    4 +
 net/openvswitch/flow_table.c             |  204 +++++++++++++++++++------
 net/openvswitch/flow_table.h             |    7 +
 8 files changed, 437 insertions(+), 131 deletions(-)

diff --git a/Documentation/networking/openvswitch.txt b/Documentation/networking/openvswitch.txt
index 37c20ee..b3b9ac6 100644
--- a/Documentation/networking/openvswitch.txt
+++ b/Documentation/networking/openvswitch.txt
@@ -131,6 +131,19 @@ performs best-effort detection of overlapping wildcarded flows and may reject
 some but not all of them. However, this behavior may change in future versions.
 
 
+Unique flow identifiers
+-----------------------
+
+An alternative to using the original match portion of a key as the handle for
+flow identification is a unique flow identifier, or "UFID". UFIDs are optional
+for both the kernel and user space program.
+
+User space programs that support UFID are expected to provide it during flow
+setup in addition to the flow, then refer to the flow using the UFID for all
+future operations. The kernel is not required to index flows by the original
+flow key if a UFID is specified.
+
+
 Basic rule for evolving flow keys
 ---------------------------------
 
diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 3a6dcaa..80db129 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -444,6 +444,14 @@ struct ovs_key_nd {
  * a wildcarded match. Omitting attribute is treated as wildcarding all
  * corresponding fields. Optional for all requests. If not present,
  * all flow key bits are exact match bits.
+ * @OVS_FLOW_ATTR_UFID: A value between 1-256 octets specifying a unique
+ * identifier for the flow. Causes the flow to be indexed by this value rather
+ * than the value of the %OVS_FLOW_ATTR_KEY attribute. Optional for all
+ * requests. Present in notifications if the flow was created with this
+ * attribute.
+ * @OVS_FLOW_ATTR_UFID_FLAGS: A 32-bit value of OR'd %OVS_UFID_F_*
+ * flags that provide alternative semantics for flow installation and
+ * retrieval. Optional for all requests.
  *
  * These attributes follow the &struct ovs_header within the Generic Netlink
  * payload for %OVS_FLOW_* commands.
@@ -459,12 +467,24 @@ enum ovs_flow_attr {
 	OVS_FLOW_ATTR_MASK,      /* Sequence of OVS_KEY_ATTR_* attributes. */
 	OVS_FLOW_ATTR_PROBE,     /* Flow operation is a feature probe, error
 				  * logging should be suppressed. */
+	OVS_FLOW_ATTR_UFID,      /* Variable length unique flow identifier. */
+	OVS_FLOW_ATTR_UFID_FLAGS,/* u32 of OVS_UFID_F_*. */
 	__OVS_FLOW_ATTR_MAX
 };
 
 #define OVS_FLOW_ATTR_MAX (__OVS_FLOW_ATTR_MAX - 1)
 
 /**
+ * Omit attributes for notifications.
+ *
+ * If a datapath request contains an %OVS_UFID_F_OMIT_* flag, then the datapath
+ * may omit the corresponding %OVS_FLOW_ATTR_* from the response.
+ */
+#define OVS_UFID_F_OMIT_KEY      (1 << 0)
+#define OVS_UFID_F_OMIT_MASK     (1 << 1)
+#define OVS_UFID_F_OMIT_ACTIONS  (1 << 2)
+
+/**
  * enum ovs_sample_attr - Attributes for %OVS_ACTION_ATTR_SAMPLE action.
  * @OVS_SAMPLE_ATTR_PROBABILITY: 32-bit fraction of packets to sample with
  * @OVS_ACTION_ATTR_SAMPLE.  A value of 0 samples no packets, a value of
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index b2a3796..d54e920 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -65,6 +65,8 @@ static struct genl_family dp_packet_genl_family;
 static struct genl_family dp_flow_genl_family;
 static struct genl_family dp_datapath_genl_family;
 
+static const struct nla_policy flow_policy[];
+
 static const struct genl_multicast_group ovs_dp_flow_multicast_group = {
 	.name = OVS_FLOW_MCGROUP,
 };
@@ -662,11 +664,18 @@ static void get_dp_stats(const struct datapath *dp, struct ovs_dp_stats *stats,
 	}
 }
 
-static size_t ovs_flow_cmd_msg_size(const struct sw_flow_actions *acts)
+static size_t ovs_flow_cmd_msg_size(const struct sw_flow_actions *acts,
+				    const struct sw_flow_id *sfid)
 {
+	size_t sfid_len = 0;
+
+	if (sfid && sfid->ufid_len)
+		sfid_len = nla_total_size(sfid->ufid_len);
+
 	return NLMSG_ALIGN(sizeof(struct ovs_header))
 		+ nla_total_size(ovs_key_attr_size()) /* OVS_FLOW_ATTR_KEY */
 		+ nla_total_size(ovs_key_attr_size()) /* OVS_FLOW_ATTR_MASK */
+		+ sfid_len /* OVS_FLOW_ATTR_UFID */
 		+ nla_total_size(sizeof(struct ovs_flow_stats)) /* OVS_FLOW_ATTR_STATS */
 		+ nla_total_size(1) /* OVS_FLOW_ATTR_TCP_FLAGS */
 		+ nla_total_size(8) /* OVS_FLOW_ATTR_USED */
@@ -741,7 +750,7 @@ static int ovs_flow_cmd_fill_actions(const struct sw_flow *flow,
 /* Called with ovs_mutex or RCU read lock. */
 static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
 				  struct sk_buff *skb, u32 portid,
-				  u32 seq, u32 flags, u8 cmd)
+				  u32 seq, u32 flags, u8 cmd, u32 ufid_flags)
 {
 	const int skb_orig_len = skb->len;
 	struct ovs_header *ovs_header;
@@ -754,21 +763,35 @@ static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
 
 	ovs_header->dp_ifindex = dp_ifindex;
 
-	err = ovs_nla_put_unmasked_key(flow, skb);
+	if (flow->ufid)
+		err = nla_put(skb, OVS_FLOW_ATTR_UFID, flow->ufid->ufid_len,
+			      flow->ufid->ufid);
+	else
+		err = ovs_nla_put_unmasked_key(flow, skb);
 	if (err)
 		goto error;
 
-	err = ovs_nla_put_mask(flow, skb);
-	if (err)
-		goto error;
+	if (!(ufid_flags & OVS_UFID_F_OMIT_KEY) && flow->ufid) {
+		err = ovs_nla_put_masked_key(flow, skb);
+		if (err)
+			goto error;
+	}
+
+	if (!(ufid_flags & OVS_UFID_F_OMIT_MASK)) {
+		err = ovs_nla_put_mask(flow, skb);
+		if (err)
+			goto error;
+	}
 
 	err = ovs_flow_cmd_fill_stats(flow, skb);
 	if (err)
 		goto error;
 
-	err = ovs_flow_cmd_fill_actions(flow, skb, skb_orig_len);
-	if (err)
-		goto error;
+	if (!(ufid_flags & OVS_UFID_F_OMIT_ACTIONS)) {
+		err = ovs_flow_cmd_fill_actions(flow, skb, skb_orig_len);
+		if (err)
+			goto error;
+	}
 
 	return genlmsg_end(skb, ovs_header);
 
@@ -779,6 +802,7 @@ error:
 
 /* May not be called with RCU read lock. */
 static struct sk_buff *ovs_flow_cmd_alloc_info(const struct sw_flow_actions *acts,
+					       const struct sw_flow_id *sfid,
 					       struct genl_info *info,
 					       bool always)
 {
@@ -787,7 +811,8 @@ static struct sk_buff *ovs_flow_cmd_alloc_info(const struct sw_flow_actions *act
 	if (!always && !ovs_must_notify(&dp_flow_genl_family, info, 0))
 		return NULL;
 
-	skb = genlmsg_new_unicast(ovs_flow_cmd_msg_size(acts), info, GFP_KERNEL);
+	skb = genlmsg_new_unicast(ovs_flow_cmd_msg_size(acts, sfid), info,
+				  GFP_KERNEL);
 	if (!skb)
 		return ERR_PTR(-ENOMEM);
 
@@ -798,19 +823,19 @@ static struct sk_buff *ovs_flow_cmd_alloc_info(const struct sw_flow_actions *act
 static struct sk_buff *ovs_flow_cmd_build_info(const struct sw_flow *flow,
 					       int dp_ifindex,
 					       struct genl_info *info, u8 cmd,
-					       bool always)
+					       bool always, u32 ufid_flags)
 {
 	struct sk_buff *skb;
 	int retval;
 
-	skb = ovs_flow_cmd_alloc_info(ovsl_dereference(flow->sf_acts), info,
-				      always);
+	skb = ovs_flow_cmd_alloc_info(ovsl_dereference(flow->sf_acts),
+				      flow->ufid, info, always);
 	if (IS_ERR_OR_NULL(skb))
 		return skb;
 
 	retval = ovs_flow_cmd_fill_info(flow, dp_ifindex, skb,
 					info->snd_portid, info->snd_seq, 0,
-					cmd);
+					cmd, ufid_flags);
 	BUG_ON(retval < 0);
 	return skb;
 }
@@ -819,12 +844,14 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 {
 	struct nlattr **a = info->attrs;
 	struct ovs_header *ovs_header = info->userhdr;
-	struct sw_flow *flow, *new_flow;
+	struct sw_flow *flow = NULL, *new_flow;
 	struct sw_flow_mask mask;
 	struct sk_buff *reply;
 	struct datapath *dp;
+	struct sw_flow_key key;
 	struct sw_flow_actions *acts;
 	struct sw_flow_match match;
+	u32 ufid_flags = ovs_nla_get_ufid_flags(a[OVS_FLOW_ATTR_UFID_FLAGS]);
 	int error;
 	bool log = !a[OVS_FLOW_ATTR_PROBE];
 
@@ -849,13 +876,30 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 	}
 
 	/* Extract key. */
-	ovs_match_init(&match, &new_flow->unmasked_key, &mask);
+	ovs_match_init(&match, &key, &mask);
 	error = ovs_nla_get_match(&match, a[OVS_FLOW_ATTR_KEY],
 				  a[OVS_FLOW_ATTR_MASK], log);
 	if (error)
 		goto err_kfree_flow;
 
-	ovs_flow_mask_key(&new_flow->key, &new_flow->unmasked_key, &mask);
+	ovs_flow_mask_key(&new_flow->key, &key, &mask);
+
+	/* Extract flow id. */
+	error = ovs_nla_copy_ufid(a[OVS_FLOW_ATTR_UFID], &new_flow->ufid, log);
+	if (error)
+		goto err_kfree_flow;
+	if (!new_flow->ufid) {
+		struct sw_flow_key *new_key;
+
+		new_key = kmalloc(sizeof(*new_flow->unmasked_key), GFP_KERNEL);
+		if (new_key) {
+			memcpy(new_key, &key, sizeof(key));
+			new_flow->unmasked_key = new_key;
+		} else {
+			error = -ENOMEM;
+			goto err_kfree_flow;
+		}
+	}
 
 	/* Validate actions. */
 	error = ovs_nla_copy_actions(a[OVS_FLOW_ATTR_ACTIONS], &new_flow->key,
@@ -865,7 +909,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 		goto err_kfree_flow;
 	}
 
-	reply = ovs_flow_cmd_alloc_info(acts, info, false);
+	reply = ovs_flow_cmd_alloc_info(acts, new_flow->ufid, info, false);
 	if (IS_ERR(reply)) {
 		error = PTR_ERR(reply);
 		goto err_kfree_acts;
@@ -877,8 +921,12 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 		error = -ENODEV;
 		goto err_unlock_ovs;
 	}
+
 	/* Check if this is a duplicate flow */
-	flow = ovs_flow_tbl_lookup(&dp->table, &new_flow->unmasked_key);
+	if (new_flow->ufid)
+		flow = ovs_flow_tbl_lookup_ufid(&dp->table, new_flow->ufid);
+	if (!flow)
+		flow = ovs_flow_tbl_lookup(&dp->table, &new_flow->key);
 	if (likely(!flow)) {
 		rcu_assign_pointer(new_flow->sf_acts, acts);
 
@@ -894,7 +942,8 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 						       ovs_header->dp_ifindex,
 						       reply, info->snd_portid,
 						       info->snd_seq, 0,
-						       OVS_FLOW_CMD_NEW);
+						       OVS_FLOW_CMD_NEW,
+						       ufid_flags);
 			BUG_ON(error < 0);
 		}
 		ovs_unlock();
@@ -912,11 +961,13 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 			error = -EEXIST;
 			goto err_unlock_ovs;
 		}
-		/* The unmasked key has to be the same for flow updates. */
-		if (unlikely(!ovs_flow_cmp_unmasked_key(flow, &match))) {
-			/* Look for any overlapping flow. */
+		/* The flow identifier has to be the same for flow updates.
+		 * Look for any overlapping flow.
+		 */
+		if (!flow->ufid &&
+		    unlikely(!ovs_flow_cmp_unmasked_key(flow, &match))) {
 			flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
-			if (!flow) {
+			if (unlikely(!flow)) {
 				error = -ENOENT;
 				goto err_unlock_ovs;
 			}
@@ -930,7 +981,8 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 						       ovs_header->dp_ifindex,
 						       reply, info->snd_portid,
 						       info->snd_seq, 0,
-						       OVS_FLOW_CMD_NEW);
+						       OVS_FLOW_CMD_NEW,
+						       ufid_flags);
 			BUG_ON(error < 0);
 		}
 		ovs_unlock();
@@ -980,45 +1032,34 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
 	struct nlattr **a = info->attrs;
 	struct ovs_header *ovs_header = info->userhdr;
 	struct sw_flow_key key;
-	struct sw_flow *flow;
+	struct sw_flow *flow = NULL;
 	struct sw_flow_mask mask;
 	struct sk_buff *reply = NULL;
 	struct datapath *dp;
 	struct sw_flow_actions *old_acts = NULL, *acts = NULL;
 	struct sw_flow_match match;
+	struct sw_flow_id *ufid;
+	u32 ufid_flags = ovs_nla_get_ufid_flags(a[OVS_FLOW_ATTR_UFID_FLAGS]);
 	int error;
 	bool log = !a[OVS_FLOW_ATTR_PROBE];
 
-	/* Extract key. */
-	error = -EINVAL;
-	if (!a[OVS_FLOW_ATTR_KEY]) {
+	/* Extract identifier. Take a copy to avoid "Wframe-larger-than=1024"
+	 * warning.
+	 */
+	error = ovs_nla_copy_ufid(a[OVS_FLOW_ATTR_UFID], &ufid, log);
+	if (error)
+		return error;
+	if (a[OVS_FLOW_ATTR_KEY]) {
+		ovs_match_init(&match, &key, &mask);
+		error = ovs_nla_get_match(&match, a[OVS_FLOW_ATTR_KEY],
+					  a[OVS_FLOW_ATTR_MASK], log);
+	} else if (!ufid) {
 		OVS_NLERR(log, "Flow key attribute not present in set flow.");
-		goto error;
+		error = -EINVAL;
 	}
-
-	ovs_match_init(&match, &key, &mask);
-	error = ovs_nla_get_match(&match, a[OVS_FLOW_ATTR_KEY],
-				  a[OVS_FLOW_ATTR_MASK], log);
 	if (error)
 		goto error;
 
-	/* Validate actions. */
-	if (a[OVS_FLOW_ATTR_ACTIONS]) {
-		acts = get_flow_actions(a[OVS_FLOW_ATTR_ACTIONS], &key, &mask,
-					log);
-		if (IS_ERR(acts)) {
-			error = PTR_ERR(acts);
-			goto error;
-		}
-
-		/* Can allocate before locking if have acts. */
-		reply = ovs_flow_cmd_alloc_info(acts, info, false);
-		if (IS_ERR(reply)) {
-			error = PTR_ERR(reply);
-			goto err_kfree_acts;
-		}
-	}
-
 	ovs_lock();
 	dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
 	if (unlikely(!dp)) {
@@ -1026,33 +1067,34 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
 		goto err_unlock_ovs;
 	}
 	/* Check that the flow exists. */
-	flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
+	if (ufid)
+		flow = ovs_flow_tbl_lookup_ufid(&dp->table, ufid);
+	else
+		flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
 	if (unlikely(!flow)) {
 		error = -ENOENT;
 		goto err_unlock_ovs;
 	}
 
-	/* Update actions, if present. */
-	if (likely(acts)) {
+	/* Validate and update actions. */
+	if (a[OVS_FLOW_ATTR_ACTIONS]) {
+		acts = get_flow_actions(a[OVS_FLOW_ATTR_ACTIONS], &flow->key,
+					flow->mask, log);
+		if (IS_ERR(acts)) {
+			error = PTR_ERR(acts);
+			goto err_unlock_ovs;
+		}
+
 		old_acts = ovsl_dereference(flow->sf_acts);
 		rcu_assign_pointer(flow->sf_acts, acts);
+	}
 
-		if (unlikely(reply)) {
-			error = ovs_flow_cmd_fill_info(flow,
-						       ovs_header->dp_ifindex,
-						       reply, info->snd_portid,
-						       info->snd_seq, 0,
-						       OVS_FLOW_CMD_NEW);
-			BUG_ON(error < 0);
-		}
-	} else {
-		/* Could not alloc without acts before locking. */
-		reply = ovs_flow_cmd_build_info(flow, ovs_header->dp_ifindex,
-						info, OVS_FLOW_CMD_NEW, false);
-		if (unlikely(IS_ERR(reply))) {
-			error = PTR_ERR(reply);
-			goto err_unlock_ovs;
-		}
+	reply = ovs_flow_cmd_build_info(flow, ovs_header->dp_ifindex,
+					info, OVS_FLOW_CMD_NEW, false,
+					ufid_flags);
+	if (unlikely(IS_ERR(reply))) {
+		error = PTR_ERR(reply);
+		goto err_unlock_ovs;
 	}
 
 	/* Clear stats. */
@@ -1070,9 +1112,9 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
 err_unlock_ovs:
 	ovs_unlock();
 	kfree_skb(reply);
-err_kfree_acts:
 	kfree(acts);
 error:
+	kfree(ufid);
 	return error;
 }
 
@@ -1085,17 +1127,23 @@ static int ovs_flow_cmd_get(struct sk_buff *skb, struct genl_info *info)
 	struct sw_flow *flow;
 	struct datapath *dp;
 	struct sw_flow_match match;
+	struct sw_flow_id ufid;
+	u32 ufid_flags = ovs_nla_get_ufid_flags(a[OVS_FLOW_ATTR_UFID_FLAGS]);
 	int err;
 	bool log = !a[OVS_FLOW_ATTR_PROBE];
 
-	if (!a[OVS_FLOW_ATTR_KEY]) {
+	err = ovs_nla_get_ufid(a[OVS_FLOW_ATTR_UFID], &ufid, log);
+	if (err)
+		return err;
+	if (a[OVS_FLOW_ATTR_KEY]) {
+		ovs_match_init(&match, &key, NULL);
+		err = ovs_nla_get_match(&match, a[OVS_FLOW_ATTR_KEY], NULL,
+					log);
+	} else if (!ufid.ufid_len) {
 		OVS_NLERR(log,
 			  "Flow get message rejected, Key attribute missing.");
-		return -EINVAL;
+		err = -EINVAL;
 	}
-
-	ovs_match_init(&match, &key, NULL);
-	err = ovs_nla_get_match(&match, a[OVS_FLOW_ATTR_KEY], NULL, log);
 	if (err)
 		return err;
 
@@ -1106,14 +1154,17 @@ static int ovs_flow_cmd_get(struct sk_buff *skb, struct genl_info *info)
 		goto unlock;
 	}
 
-	flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
+	if (ufid.ufid_len)
+		flow = ovs_flow_tbl_lookup_ufid(&dp->table, &ufid);
+	else
+		flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
 	if (!flow) {
 		err = -ENOENT;
 		goto unlock;
 	}
 
 	reply = ovs_flow_cmd_build_info(flow, ovs_header->dp_ifindex, info,
-					OVS_FLOW_CMD_NEW, true);
+					OVS_FLOW_CMD_NEW, true, ufid_flags);
 	if (IS_ERR(reply)) {
 		err = PTR_ERR(reply);
 		goto unlock;
@@ -1132,13 +1183,18 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	struct ovs_header *ovs_header = info->userhdr;
 	struct sw_flow_key key;
 	struct sk_buff *reply;
-	struct sw_flow *flow;
+	struct sw_flow *flow = NULL;
 	struct datapath *dp;
 	struct sw_flow_match match;
+	struct sw_flow_id ufid;
+	u32 ufid_flags = ovs_nla_get_ufid_flags(a[OVS_FLOW_ATTR_UFID_FLAGS]);
 	int err;
 	bool log = !a[OVS_FLOW_ATTR_PROBE];
 
-	if (likely(a[OVS_FLOW_ATTR_KEY])) {
+	err = ovs_nla_get_ufid(a[OVS_FLOW_ATTR_UFID], &ufid, log);
+	if (err)
+		return err;
+	if (a[OVS_FLOW_ATTR_KEY]) {
 		ovs_match_init(&match, &key, NULL);
 		err = ovs_nla_get_match(&match, a[OVS_FLOW_ATTR_KEY], NULL,
 					log);
@@ -1153,12 +1209,15 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
 		goto unlock;
 	}
 
-	if (unlikely(!a[OVS_FLOW_ATTR_KEY])) {
+	if (unlikely(!a[OVS_FLOW_ATTR_KEY] && !ufid.ufid_len)) {
 		err = ovs_flow_tbl_flush(&dp->table);
 		goto unlock;
 	}
 
-	flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
+	if (ufid.ufid_len)
+		flow = ovs_flow_tbl_lookup_ufid(&dp->table, &ufid);
+	else
+		flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
 	if (unlikely(!flow)) {
 		err = -ENOENT;
 		goto unlock;
@@ -1168,14 +1227,15 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	ovs_unlock();
 
 	reply = ovs_flow_cmd_alloc_info((const struct sw_flow_actions __force *) flow->sf_acts,
-					info, false);
+					flow->ufid, info, false);
 	if (likely(reply)) {
 		if (likely(!IS_ERR(reply))) {
 			rcu_read_lock();	/*To keep RCU checker happy. */
 			err = ovs_flow_cmd_fill_info(flow, ovs_header->dp_ifindex,
 						     reply, info->snd_portid,
 						     info->snd_seq, 0,
-						     OVS_FLOW_CMD_DEL);
+						     OVS_FLOW_CMD_DEL,
+						     ufid_flags);
 			rcu_read_unlock();
 			BUG_ON(err < 0);
 
@@ -1194,9 +1254,18 @@ unlock:
 
 static int ovs_flow_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb)
 {
+	struct nlattr *a[__OVS_FLOW_ATTR_MAX];
 	struct ovs_header *ovs_header = genlmsg_data(nlmsg_data(cb->nlh));
 	struct table_instance *ti;
 	struct datapath *dp;
+	u32 ufid_flags;
+	int err;
+
+	err = nlmsg_parse(cb->nlh, GENL_HDRLEN + dp_flow_genl_family.hdrsize,
+			  a, dp_flow_genl_family.maxattr, flow_policy);
+	if (err)
+		return err;
+	ufid_flags = ovs_nla_get_ufid_flags(a[OVS_FLOW_ATTR_UFID_FLAGS]);
 
 	rcu_read_lock();
 	dp = get_dp_rcu(sock_net(skb->sk), ovs_header->dp_ifindex);
@@ -1219,7 +1288,7 @@ static int ovs_flow_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb)
 		if (ovs_flow_cmd_fill_info(flow, ovs_header->dp_ifindex, skb,
 					   NETLINK_CB(cb->skb).portid,
 					   cb->nlh->nlmsg_seq, NLM_F_MULTI,
-					   OVS_FLOW_CMD_NEW) < 0)
+					   OVS_FLOW_CMD_NEW, ufid_flags) < 0)
 			break;
 
 		cb->args[0] = bucket;
@@ -1235,6 +1304,8 @@ static const struct nla_policy flow_policy[OVS_FLOW_ATTR_MAX + 1] = {
 	[OVS_FLOW_ATTR_ACTIONS] = { .type = NLA_NESTED },
 	[OVS_FLOW_ATTR_CLEAR] = { .type = NLA_FLAG },
 	[OVS_FLOW_ATTR_PROBE] = { .type = NLA_FLAG },
+	[OVS_FLOW_ATTR_UFID] = { .type = NLA_UNSPEC },
+	[OVS_FLOW_ATTR_UFID_FLAGS] = { .type = NLA_U32 },
 };
 
 static const struct genl_ops dp_flow_genl_ops[] = {
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index a8b30f3..7f31dbf 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -197,6 +197,13 @@ struct sw_flow_match {
 	struct sw_flow_mask *mask;
 };
 
+#define MAX_UFID_LENGTH 256
+
+struct sw_flow_id {
+	u32 ufid_len;
+	u32 ufid[MAX_UFID_LENGTH / 4];
+};
+
 struct sw_flow_actions {
 	struct rcu_head rcu;
 	u32 actions_len;
@@ -213,13 +220,16 @@ struct flow_stats {
 
 struct sw_flow {
 	struct rcu_head rcu;
-	struct hlist_node hash_node[2];
-	u32 hash;
+	struct {
+		struct hlist_node node[2];
+		u32 hash;
+	} flow_hash, ufid_hash;
 	int stats_last_writer;		/* NUMA-node id of the last writer on
 					 * 'stats[0]'.
 					 */
 	struct sw_flow_key key;
-	struct sw_flow_key unmasked_key;
+	struct sw_flow_id *ufid;
+	struct sw_flow_key *unmasked_key; /* Only valid if 'ufid' is NULL. */
 	struct sw_flow_mask *mask;
 	struct sw_flow_actions __rcu *sf_acts;
 	struct flow_stats __rcu *stats[]; /* One for each NUMA node.  First one
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 7bb571f..56a5d2e 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -1095,6 +1095,67 @@ free_newmask:
 	return err;
 }
 
+static size_t get_ufid_size(const struct nlattr *attr, bool log)
+{
+	if (!attr)
+		return 0;
+	if (!nla_len(attr)) {
+		OVS_NLERR(log, "Flow ufid must be at least 1 octet");
+		return -EINVAL;
+	}
+	if (nla_len(attr) >= MAX_UFID_LENGTH) {
+		OVS_NLERR(log, "Flow ufid size %u bytes exceeds max",
+			  nla_len(attr));
+		return -EINVAL;
+	}
+
+	return nla_len(attr);
+}
+
+/* Initializes 'flow->ufid'. */
+int ovs_nla_get_ufid(const struct nlattr *attr, struct sw_flow_id *sfid,
+		     bool log)
+{
+	size_t len;
+
+	sfid->ufid_len = 0;
+	len = get_ufid_size(attr, log);
+	if (len <= 0)
+		return len;
+
+	sfid->ufid_len = len;
+	memcpy(sfid->ufid, nla_data(attr), len);
+
+	return 0;
+}
+
+int ovs_nla_copy_ufid(const struct nlattr *attr, struct sw_flow_id **sfid,
+		      bool log)
+{
+	struct sw_flow_id *new_sfid = NULL;
+	size_t len;
+
+	*sfid = NULL;
+	len = get_ufid_size(attr, log);
+	if (len <= 0)
+		return len;
+
+	new_sfid = kmalloc(sizeof(*new_sfid), GFP_KERNEL);
+	if (!new_sfid)
+		return -ENOMEM;
+
+	new_sfid->ufid_len = len;
+	memcpy(new_sfid->ufid, nla_data(attr), len);
+	*sfid = new_sfid;
+
+	return 0;
+}
+
+u32 ovs_nla_get_ufid_flags(const struct nlattr *attr)
+{
+	return attr ? nla_get_u32(attr) : 0;
+}
+
 /**
  * ovs_nla_get_flow_metadata - parses Netlink attributes into a flow key.
  * @key: Receives extracted in_port, priority, tun_key and skb_mark.
@@ -1367,7 +1428,7 @@ int ovs_nla_put_flow(const struct sw_flow_key *swkey,
 /* Called with ovs_mutex or RCU read lock. */
 int ovs_nla_put_unmasked_key(const struct sw_flow *flow, struct sk_buff *skb)
 {
-	return ovs_nla_put_flow(&flow->unmasked_key, &flow->unmasked_key,
+	return ovs_nla_put_flow(flow->unmasked_key, flow->unmasked_key,
 				OVS_FLOW_ATTR_KEY, false, skb);
 }
 
diff --git a/net/openvswitch/flow_netlink.h b/net/openvswitch/flow_netlink.h
index ea54564..4f1bd7a 100644
--- a/net/openvswitch/flow_netlink.h
+++ b/net/openvswitch/flow_netlink.h
@@ -57,6 +57,10 @@ int ovs_nla_get_match(struct sw_flow_match *, const struct nlattr *key,
 int ovs_nla_put_egress_tunnel_key(struct sk_buff *,
 				  const struct ovs_tunnel_info *);
 
+int ovs_nla_get_ufid(const struct nlattr *, struct sw_flow_id *, bool log);
+int ovs_nla_copy_ufid(const struct nlattr *, struct sw_flow_id **, bool log);
+u32 ovs_nla_get_ufid_flags(const struct nlattr *attr);
+
 int ovs_nla_copy_actions(const struct nlattr *attr,
 			 const struct sw_flow_key *key,
 			 struct sw_flow_actions **sfa, bool log);
diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index e0a7fef..7287805 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -85,6 +85,8 @@ struct sw_flow *ovs_flow_alloc(void)
 
 	flow->sf_acts = NULL;
 	flow->mask = NULL;
+	flow->ufid = NULL;
+	flow->unmasked_key = NULL;
 	flow->stats_last_writer = NUMA_NO_NODE;
 
 	/* Initialize the default stat node. */
@@ -139,6 +141,8 @@ static void flow_free(struct sw_flow *flow)
 {
 	int node;
 
+	kfree(flow->ufid);
+	kfree(flow->unmasked_key);
 	kfree((struct sw_flow_actions __force *)flow->sf_acts);
 	for_each_node(node)
 		if (flow->stats[node])
@@ -200,18 +204,28 @@ static struct table_instance *table_instance_alloc(int new_size)
 
 int ovs_flow_tbl_init(struct flow_table *table)
 {
-	struct table_instance *ti;
+	struct table_instance *ti, *ufid_ti;
 
 	ti = table_instance_alloc(TBL_MIN_BUCKETS);
 
 	if (!ti)
 		return -ENOMEM;
 
+	ufid_ti = table_instance_alloc(TBL_MIN_BUCKETS);
+	if (!ufid_ti)
+		goto free_ti;
+
 	rcu_assign_pointer(table->ti, ti);
+	rcu_assign_pointer(table->ufid_ti, ufid_ti);
 	INIT_LIST_HEAD(&table->mask_list);
 	table->last_rehash = jiffies;
 	table->count = 0;
+	table->ufid_count = 0;
 	return 0;
+
+free_ti:
+	__table_instance_destroy(ti);
+	return -ENOMEM;
 }
 
 static void flow_tbl_destroy_rcu_cb(struct rcu_head *rcu)
@@ -221,13 +235,16 @@ static void flow_tbl_destroy_rcu_cb(struct rcu_head *rcu)
 	__table_instance_destroy(ti);
 }
 
-static void table_instance_destroy(struct table_instance *ti, bool deferred)
+static void table_instance_destroy(struct table_instance *ti,
+				   struct table_instance *ufid_ti,
+				   bool deferred)
 {
 	int i;
 
 	if (!ti)
 		return;
 
+	BUG_ON(!ufid_ti);
 	if (ti->keep_flows)
 		goto skip_flows;
 
@@ -236,18 +253,24 @@ static void table_instance_destroy(struct table_instance *ti, bool deferred)
 		struct hlist_head *head = flex_array_get(ti->buckets, i);
 		struct hlist_node *n;
 		int ver = ti->node_ver;
+		int ufid_ver = ufid_ti->node_ver;
 
-		hlist_for_each_entry_safe(flow, n, head, hash_node[ver]) {
-			hlist_del_rcu(&flow->hash_node[ver]);
+		hlist_for_each_entry_safe(flow, n, head, flow_hash.node[ver]) {
+			hlist_del_rcu(&flow->flow_hash.node[ver]);
+			if (flow->ufid)
+				hlist_del_rcu(&flow->ufid_hash.node[ufid_ver]);
 			ovs_flow_free(flow, deferred);
 		}
 	}
 
 skip_flows:
-	if (deferred)
+	if (deferred) {
 		call_rcu(&ti->rcu, flow_tbl_destroy_rcu_cb);
-	else
+		call_rcu(&ufid_ti->rcu, flow_tbl_destroy_rcu_cb);
+	} else {
 		__table_instance_destroy(ti);
+		__table_instance_destroy(ufid_ti);
+	}
 }
 
 /* No need for locking this function is called from RCU callback or
@@ -256,8 +279,9 @@ skip_flows:
 void ovs_flow_tbl_destroy(struct flow_table *table)
 {
 	struct table_instance *ti = rcu_dereference_raw(table->ti);
+	struct table_instance *ufid_ti = rcu_dereference_raw(table->ufid_ti);
 
-	table_instance_destroy(ti, false);
+	table_instance_destroy(ti, ufid_ti, false);
 }
 
 struct sw_flow *ovs_flow_tbl_dump_next(struct table_instance *ti,
@@ -272,7 +296,7 @@ struct sw_flow *ovs_flow_tbl_dump_next(struct table_instance *ti,
 	while (*bucket < ti->n_buckets) {
 		i = 0;
 		head = flex_array_get(ti->buckets, *bucket);
-		hlist_for_each_entry_rcu(flow, head, hash_node[ver]) {
+		hlist_for_each_entry_rcu(flow, head, flow_hash.node[ver]) {
 			if (i < *last) {
 				i++;
 				continue;
@@ -294,16 +318,26 @@ static struct hlist_head *find_bucket(struct table_instance *ti, u32 hash)
 				(hash & (ti->n_buckets - 1)));
 }
 
-static void table_instance_insert(struct table_instance *ti, struct sw_flow *flow)
+static void table_instance_insert(struct table_instance *ti,
+				  struct sw_flow *flow)
+{
+	struct hlist_head *head;
+
+	head = find_bucket(ti, flow->flow_hash.hash);
+	hlist_add_head_rcu(&flow->flow_hash.node[ti->node_ver], head);
+}
+
+static void ufid_table_instance_insert(struct table_instance *ti,
+				       struct sw_flow *flow)
 {
 	struct hlist_head *head;
 
-	head = find_bucket(ti, flow->hash);
-	hlist_add_head_rcu(&flow->hash_node[ti->node_ver], head);
+	head = find_bucket(ti, flow->ufid_hash.hash);
+	hlist_add_head_rcu(&flow->ufid_hash.node[ti->node_ver], head);
 }
 
 static void flow_table_copy_flows(struct table_instance *old,
-				  struct table_instance *new)
+				  struct table_instance *new, bool ufid)
 {
 	int old_ver;
 	int i;
@@ -318,15 +352,21 @@ static void flow_table_copy_flows(struct table_instance *old,
 
 		head = flex_array_get(old->buckets, i);
 
-		hlist_for_each_entry(flow, head, hash_node[old_ver])
-			table_instance_insert(new, flow);
+		if (ufid)
+			hlist_for_each_entry(flow, head,
+					     ufid_hash.node[old_ver])
+				ufid_table_instance_insert(new, flow);
+		else
+			hlist_for_each_entry(flow, head,
+					     flow_hash.node[old_ver])
+				table_instance_insert(new, flow);
 	}
 
 	old->keep_flows = true;
 }
 
 static struct table_instance *table_instance_rehash(struct table_instance *ti,
-					    int n_buckets)
+						    int n_buckets, bool ufid)
 {
 	struct table_instance *new_ti;
 
@@ -334,27 +374,37 @@ static struct table_instance *table_instance_rehash(struct table_instance *ti,
 	if (!new_ti)
 		return NULL;
 
-	flow_table_copy_flows(ti, new_ti);
-
+	flow_table_copy_flows(ti, new_ti, ufid);
 	return new_ti;
 }
 
 int ovs_flow_tbl_flush(struct flow_table *flow_table)
 {
-	struct table_instance *old_ti;
-	struct table_instance *new_ti;
+	struct table_instance *old_ti, *new_ti;
+	struct table_instance *old_ufid_ti, *new_ufid_ti;
 
-	old_ti = ovsl_dereference(flow_table->ti);
 	new_ti = table_instance_alloc(TBL_MIN_BUCKETS);
 	if (!new_ti)
 		return -ENOMEM;
+	new_ufid_ti = table_instance_alloc(TBL_MIN_BUCKETS);
+	if (!new_ufid_ti)
+		goto err_free_ti;
+
+	old_ti = ovsl_dereference(flow_table->ti);
+	old_ufid_ti = ovsl_dereference(flow_table->ufid_ti);
 
 	rcu_assign_pointer(flow_table->ti, new_ti);
+	rcu_assign_pointer(flow_table->ufid_ti, new_ufid_ti);
 	flow_table->last_rehash = jiffies;
 	flow_table->count = 0;
+	flow_table->ufid_count = 0;
 
-	table_instance_destroy(old_ti, true);
+	table_instance_destroy(old_ti, old_ufid_ti, true);
 	return 0;
+
+err_free_ti:
+	__table_instance_destroy(new_ti);
+	return -ENOMEM;
 }
 
 static u32 flow_hash(const struct sw_flow_key *key, int key_start,
@@ -407,7 +457,8 @@ bool ovs_flow_cmp_unmasked_key(const struct sw_flow *flow,
 	int key_start = flow_key_start(key);
 	int key_end = match->range.end;
 
-	return cmp_key(&flow->unmasked_key, key, key_start, key_end);
+	BUG_ON(flow->ufid);
+	return cmp_key(flow->unmasked_key, key, key_start, key_end);
 }
 
 static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
@@ -424,10 +475,9 @@ static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
 	ovs_flow_mask_key(&masked_key, unmasked, mask);
 	hash = flow_hash(&masked_key, key_start, key_end);
 	head = find_bucket(ti, hash);
-	hlist_for_each_entry_rcu(flow, head, hash_node[ti->node_ver]) {
-		if (flow->mask == mask && flow->hash == hash &&
-		    flow_cmp_masked_key(flow, &masked_key,
-					  key_start, key_end))
+	hlist_for_each_entry_rcu(flow, head, flow_hash.node[ti->node_ver]) {
+		if (flow->mask == mask && flow->flow_hash.hash == hash &&
+		    flow_cmp_masked_key(flow, &masked_key, key_start, key_end))
 			return flow;
 	}
 	return NULL;
@@ -469,7 +519,40 @@ struct sw_flow *ovs_flow_tbl_lookup_exact(struct flow_table *tbl,
 	/* Always called under ovs-mutex. */
 	list_for_each_entry(mask, &tbl->mask_list, list) {
 		flow = masked_flow_lookup(ti, match->key, mask);
-		if (flow && ovs_flow_cmp_unmasked_key(flow, match))  /* Found */
+		if (flow && !flow->ufid &&
+		    ovs_flow_cmp_unmasked_key(flow, match))
+			return flow;
+	}
+	return NULL;
+}
+
+static u32 ufid_hash(const struct sw_flow_id *sfid)
+{
+	return arch_fast_hash(sfid->ufid, sfid->ufid_len, 0);
+}
+
+bool ovs_flow_cmp_ufid(const struct sw_flow *flow,
+		       const struct sw_flow_id *sfid)
+{
+	if (flow->ufid->ufid_len != sfid->ufid_len)
+		return false;
+
+	return !memcmp(flow->ufid->ufid, sfid->ufid, sfid->ufid_len);
+}
+
+struct sw_flow *ovs_flow_tbl_lookup_ufid(struct flow_table *tbl,
+					 const struct sw_flow_id *ufid)
+{
+	struct table_instance *ti = rcu_dereference_ovsl(tbl->ufid_ti);
+	struct sw_flow *flow;
+	struct hlist_head *head;
+	u32 hash;
+
+	hash = ufid_hash(ufid);
+	head = find_bucket(ti, hash);
+	hlist_for_each_entry_rcu(flow, head, ufid_hash.node[ti->node_ver]) {
+		if (flow->ufid_hash.hash == hash &&
+		    ovs_flow_cmp_ufid(flow, ufid))
 			return flow;
 	}
 	return NULL;
@@ -486,9 +569,10 @@ int ovs_flow_tbl_num_masks(const struct flow_table *table)
 	return num;
 }
 
-static struct table_instance *table_instance_expand(struct table_instance *ti)
+static struct table_instance *table_instance_expand(struct table_instance *ti,
+						    bool ufid)
 {
-	return table_instance_rehash(ti, ti->n_buckets * 2);
+	return table_instance_rehash(ti, ti->n_buckets * 2, ufid);
 }
 
 /* Remove 'mask' from the mask list, if it is not needed any more. */
@@ -513,10 +597,15 @@ static void flow_mask_remove(struct flow_table *tbl, struct sw_flow_mask *mask)
 void ovs_flow_tbl_remove(struct flow_table *table, struct sw_flow *flow)
 {
 	struct table_instance *ti = ovsl_dereference(table->ti);
+	struct table_instance *ufid_ti = ovsl_dereference(table->ufid_ti);
 
 	BUG_ON(table->count == 0);
-	hlist_del_rcu(&flow->hash_node[ti->node_ver]);
+	hlist_del_rcu(&flow->flow_hash.node[ti->node_ver]);
 	table->count--;
+	if (flow->ufid) {
+		hlist_del_rcu(&flow->ufid_hash.node[ufid_ti->node_ver]);
+		table->ufid_count--;
+	}
 
 	/* RCU delete the mask. 'flow->mask' is not NULLed, as it should be
 	 * accessible as long as the RCU read lock is held.
@@ -585,34 +674,65 @@ static int flow_mask_insert(struct flow_table *tbl, struct sw_flow *flow,
 }
 
 /* Must be called with OVS mutex held. */
-int ovs_flow_tbl_insert(struct flow_table *table, struct sw_flow *flow,
-			const struct sw_flow_mask *mask)
+static void flow_key_insert(struct flow_table *table, struct sw_flow *flow)
 {
 	struct table_instance *new_ti = NULL;
 	struct table_instance *ti;
-	int err;
 
-	err = flow_mask_insert(table, flow, mask);
-	if (err)
-		return err;
-
-	flow->hash = flow_hash(&flow->key, flow->mask->range.start,
-			flow->mask->range.end);
+	flow->flow_hash.hash = flow_hash(&flow->key, flow->mask->range.start,
+					 flow->mask->range.end);
 	ti = ovsl_dereference(table->ti);
 	table_instance_insert(ti, flow);
 	table->count++;
 
 	/* Expand table, if necessary, to make room. */
 	if (table->count > ti->n_buckets)
-		new_ti = table_instance_expand(ti);
+		new_ti = table_instance_expand(ti, false);
 	else if (time_after(jiffies, table->last_rehash + REHASH_INTERVAL))
-		new_ti = table_instance_rehash(ti, ti->n_buckets);
+		new_ti = table_instance_rehash(ti, ti->n_buckets, false);
 
 	if (new_ti) {
 		rcu_assign_pointer(table->ti, new_ti);
-		table_instance_destroy(ti, true);
+		call_rcu(&ti->rcu, flow_tbl_destroy_rcu_cb);
 		table->last_rehash = jiffies;
 	}
+}
+
+/* Must be called with OVS mutex held. */
+static void flow_ufid_insert(struct flow_table *table, struct sw_flow *flow)
+{
+	struct table_instance *ti;
+
+	flow->ufid_hash.hash = ufid_hash(flow->ufid);
+	ti = ovsl_dereference(table->ufid_ti);
+	ufid_table_instance_insert(ti, flow);
+	table->ufid_count++;
+
+	/* Expand table, if necessary, to make room. */
+	if (table->ufid_count > ti->n_buckets) {
+		struct table_instance *new_ti;
+
+		new_ti = table_instance_expand(ti, true);
+		if (new_ti) {
+			rcu_assign_pointer(table->ufid_ti, new_ti);
+			call_rcu(&ti->rcu, flow_tbl_destroy_rcu_cb);
+		}
+	}
+}
+
+/* Must be called with OVS mutex held. */
+int ovs_flow_tbl_insert(struct flow_table *table, struct sw_flow *flow,
+			const struct sw_flow_mask *mask)
+{
+	int err;
+
+	err = flow_mask_insert(table, flow, mask);
+	if (err)
+		return err;
+	flow_key_insert(table, flow);
+	if (flow->ufid)
+		flow_ufid_insert(table, flow);
+
 	return 0;
 }
 
diff --git a/net/openvswitch/flow_table.h b/net/openvswitch/flow_table.h
index 309fa64..454ef92 100644
--- a/net/openvswitch/flow_table.h
+++ b/net/openvswitch/flow_table.h
@@ -47,9 +47,11 @@ struct table_instance {
 
 struct flow_table {
 	struct table_instance __rcu *ti;
+	struct table_instance __rcu *ufid_ti;
 	struct list_head mask_list;
 	unsigned long last_rehash;
 	unsigned int count;
+	unsigned int ufid_count;
 };
 
 extern struct kmem_cache *flow_stats_cache;
@@ -78,8 +80,13 @@ struct sw_flow *ovs_flow_tbl_lookup(struct flow_table *,
 				    const struct sw_flow_key *);
 struct sw_flow *ovs_flow_tbl_lookup_exact(struct flow_table *tbl,
 					  const struct sw_flow_match *match);
+struct sw_flow *ovs_flow_tbl_lookup_ufid(struct flow_table *,
+					 const struct sw_flow_id *);
+
 bool ovs_flow_cmp_unmasked_key(const struct sw_flow *flow,
 			       const struct sw_flow_match *match);
+bool ovs_flow_cmp_ufid(const struct sw_flow *flow,
+		       const struct sw_flow_id *sfid);
 
 void ovs_flow_mask_key(struct sw_flow_key *dst, const struct sw_flow_key *src,
 		       const struct sw_flow_mask *mask);
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH] iputils: multiply sndbuf size by sending icmp times
From: Li Wang @ 2014-12-03  2:58 UTC (permalink / raw)
  To: hideaki.yoshifuji, netdev

$ ping -i 0.1 198.168.5.200 -W 1
PING 128.224.124.76 (128.224.124.76) 56(84) bytes of data.
ping: sendmsg: No buffer space available
>From 128.224.124.205 icmp_seq=1 Destination Host Unreachable

when ping a non-exist IP with same subnet,
ping will send arp packet, at first.
there is a limitation for arp packet of same ping.

for linux-2.6, the arp packet of number is 3.
so, the size of limitation is (3*sizeof(arp packet)).

for linux-3.x, the arp packet of size is 64k.
so, it maybe exceed the sock of sndbuf.
the linux kernel impoves the limitation.

when customer use "-i 0.1 -W 1" option, it will send 20 icmp packets.
at the same time, it send 20 arp packets.
it does not exceed the limitation of linux-3.x,
but, it exceeds the sock sndbuf of ping(324):
setsockopt(3, SOL_SOCKET, SO_SNDBUF, [324], 4) = 0

so, auto-resize sndbuf according to the arp packet number.

Signed-off-by: Li Wang <li.wang@windriver.com>
---
 ping.c  |    1 +
 ping6.c |    1 +
 2 files changed, 2 insertions(+)

diff --git a/ping.c b/ping.c
index c0366cd..46ca6df 100644
--- a/ping.c
+++ b/ping.c
@@ -531,6 +531,7 @@ main(int argc, char **argv)
 	 * Actually, for small datalen's it depends on kernel side a lot. */
 	hold = datalen + 8;
 	hold += ((hold+511)/512)*(optlen + 20 + 16 + 64 + 160);
+	hold *= lingertime/SCHINT(interval/2);
 	sock_setbufs(icmp_sock, hold);
 
 	if (broadcast_pings) {
diff --git a/ping6.c b/ping6.c
index 6d83462..58843a9 100644
--- a/ping6.c
+++ b/ping6.c
@@ -1086,6 +1086,7 @@ int main(int argc, char *argv[])
 	 * Actually, for small datalen's it depends on kernel side a lot. */
 	hold = datalen+8;
 	hold += ((hold+511)/512)*(40+16+64+160);
+	hold *= lingertime/SCHINT(interval/2);
 	sock_setbufs(icmp_sock, hold);
 
 #ifdef __linux__
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH net-next v2] rtnetlink: delay RTM_DELLINK notification until after ndo_uninit()
From: Eric Dumazet @ 2014-12-03  3:02 UTC (permalink / raw)
  To: Mahesh Bandewar
  Cc: netdev, David Miller, Eric Dumazet, Roopa Prabhu, Toshiaki Makita
In-Reply-To: <1417574617-27560-1-git-send-email-maheshb@google.com>

On Tue, 2014-12-02 at 18:43 -0800, Mahesh Bandewar wrote:

> +
> +void rtmsg_ifinfo(int type, struct net_device *dev, unsigned int change,
> +		  gfp_t flags)
> +{
> +	struct sk_buff *skb;
> +
> +	skb = rtmsg_ifinfo_build_skb(type, dev, change, flags);
> +	if (!skb)

	if (skb)

> +		rtmsg_ifinfo_send(skb, dev, flags);
>  }
>  EXPORT_SYMBOL(rtmsg_ifinfo);
>  

^ permalink raw reply

* Re: linux-next Problems with VPN tunnel - no packets sent
From: Jason Wang @ 2014-12-03  3:12 UTC (permalink / raw)
  To: Valdis Kletnieks; +Cc: Herbert Xu, davem, netdev, linux-kernel
In-Reply-To: <17929.1417552899@turing-police.cc.vt.edu>



On Wed, Dec 3, 2014 at 4:41 AM, Valdis Kletnieks 
<Valdis.Kletnieks@vt.edu> wrote:
> Recent linux-next has broken my Juniper VPN client.  The tunnel gets 
> created,
> routes get added, but trying to actually send packets across results 
> in packets
> just disappearing. 'ifconfig' consistently reports exactly 1 packet 
> sent (even
> after a 'ping' command or similar should have sent multiple packets.
> 
> tun0: flags=4305<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST>  mtu 1400
>         inet 172.27.1.40  netmask 255.255.255.255  destination 
> 172.27.1.40
>         unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  
> txqueuelen 500  (UNSPEC)
>         RX packets 1  bytes 355 (355.0 B)
>         RX errors 0  dropped 0  overruns 0  frame 0
>         TX packets 1  bytes 61 (61.0 B)
>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> 
> Still broken in next-20141201, and bisection fingers this commit:
> 
> commit e0b46d0ee9c240c7430a47e9b0365674d4a04522
> Author: Herbert Xu <herbert@gondor.apana.org.au>
> Date:   Fri Nov 7 21:22:23 2014 +0800
> 
>     tun: Use iovec iterators
> 
>     This patch removes the use of skb_copy_datagram_const_iovec in
>     favour of the iovec iterator-based skb_copy_datagram_iter.
> 
> This commit is in the kernel, and does *not* fix the problem:
> 
> commit 8c847d254146d32c86574a1b16923ff91bb784dd
> Author: Jason Wang <jasowang@redhat.com>
> Date:   Thu Nov 13 16:54:14 2014 +0800
> 
>     tun: fix issues of iovec iterators using in tun_put_user()
> 
> So there's apparently additional issues that Jason didn't address. I 
> tried to
> revert Herbert's patch for testing, but there's at  least 5 or 6 
> other patches
> that need reverting first, so I abandoned that unless it becomes 
> necessary...
> 
> What's the best way to proceed?

See another fixes from Herbert, it probably fixes your issue:

http://marc.info/?l=linux-netdev&m=141734182302021&w=2

^ permalink raw reply

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
From: David Miller @ 2014-12-03  3:23 UTC (permalink / raw)
  To: fan.du; +Cc: netdev, fw
In-Reply-To: <1417156385-18276-1-git-send-email-fan.du@intel.com>

From: Fan Du <fan.du@intel.com>
Date: Fri, 28 Nov 2014 14:33:05 +0800

> Test scenario: two KVM guests sitting in different
> hosts communicate to each other with a vxlan tunnel.
> 
> All interface MTU is default 1500 Bytes, from guest point
> of view, its skb gso_size could be as bigger as 1448Bytes,
> however after guest skb goes through vxlan encapuslation,
> individual segments length of a gso packet could exceed
> physical NIC MTU 1500, which will be lost at recevier side.
> 
> So it's possible in virtualized environment, locally created
> skb len after encapslation could be bigger than underlayer
> MTU. In such case, it's reasonable to do GSO first,
> then fragment any packet bigger than MTU as possible.
> 
> +---------------+ TX     RX +---------------+
> |   KVM Guest   | -> ... -> |   KVM Guest   |
> +-+-----------+-+           +-+-----------+-+
>   |Qemu/VirtIO|               |Qemu/VirtIO|
>   +-----------+               +-----------+
>        |                            |
>        v tap0                  tap0 v
>   +-----------+               +-----------+
>   | ovs bridge|               | ovs bridge|
>   +-----------+               +-----------+
>        | vxlan                vxlan |
>        v                            v
>   +-----------+               +-----------+
>   |    NIC    |    <------>   |    NIC    |
>   +-----------+               +-----------+
> 
> Steps to reproduce:
>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>  2. Runing iperf without -M, communication will stuck.
> 
> Signed-off-by: Fan Du <fan.du@intel.com>

I really don't like this at all.

If guest sees a 1500 byte MTU, that's it's link layer MTU and it had
better be able to send 1500 byte packets onto the "wire".

If you cannot properly propagate the vxlan encapsulation overhead back
into the guest's MTU you must hide this problem from the rest of our
stack somehow.

Nothing we create inside the host should need the change that you
are making.

^ permalink raw reply

* Re: [PATCH] xen-netfront: Remove BUGs on paged skb data which crosses a page boundary
From: David Miller @ 2014-12-03  3:25 UTC (permalink / raw)
  To: seth.forshee
  Cc: konrad.wilk, boris.ostrovsky, david.vrabel, zoltan.kiss,
	eric.dumazet, stefan.bader, xen-devel, netdev, linux-kernel
In-Reply-To: <1416968904-70874-1-git-send-email-seth.forshee@canonical.com>

From: Seth Forshee <seth.forshee@canonical.com>
Date: Tue, 25 Nov 2014 20:28:24 -0600

> These BUGs can be erroneously triggered by frags which refer to
> tail pages within a compound page. The data in these pages may
> overrun the hardware page while still being contained within the
> compound page, but since compound_order() evaluates to 0 for tail
> pages the assertion fails. The code already iterates through
> subsequent pages correctly in this scenario, so the BUGs are
> unnecessary and can be removed.
> 
> Fixes: f36c374782e4 ("xen/netfront: handle compound page fragments on transmit")
> Cc: <stable@vger.kernel.org> # 3.7+
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

Applied, thanks.

^ permalink raw reply

* Re: [RFC PATCH 0/3] net: Alloc NAPI page frags from their own pool
From: David Miller @ 2014-12-03  3:30 UTC (permalink / raw)
  To: alexander.h.duyck; +Cc: netdev, brouer, jeffrey.t.kirsher, eric.dumazet, ast
In-Reply-To: <20141126235900.1617.10008.stgit@ahduyck-vm-fedora20>

From: Alexander Duyck <alexander.h.duyck@redhat.com>
Date: Wed, 26 Nov 2014 16:05:50 -0800

> This patch series implements a means of allocating page fragments without
> the need for the local_irq_save/restore in __netdev_alloc_frag.  By doing
> this I am able to decrease packet processing time by 11ns per packet in my
> test environment.

No fundamental objections from me.

^ permalink raw reply

* RE: [PATCH net] gso: do GSO for local skb with size bigger than MTU
From: Du, Fan @ 2014-12-03  3:32 UTC (permalink / raw)
  To: David Miller; +Cc: netdev@vger.kernel.org, fw@strlen.de, Du, Fan
In-Reply-To: <20141202.192311.1226452173523245977.davem@davemloft.net>



>-----Original Message-----
>From: David Miller [mailto:davem@davemloft.net]
>Sent: Wednesday, December 3, 2014 11:23 AM
>To: Du, Fan
>Cc: netdev@vger.kernel.org; fw@strlen.de
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>From: Fan Du <fan.du@intel.com>
>Date: Fri, 28 Nov 2014 14:33:05 +0800
>
>> Test scenario: two KVM guests sitting in different hosts communicate
>> to each other with a vxlan tunnel.
>>
>> All interface MTU is default 1500 Bytes, from guest point of view, its
>> skb gso_size could be as bigger as 1448Bytes, however after guest skb
>> goes through vxlan encapuslation, individual segments length of a gso
>> packet could exceed physical NIC MTU 1500, which will be lost at
>> recevier side.
>>
>> So it's possible in virtualized environment, locally created skb len
>> after encapslation could be bigger than underlayer MTU. In such case,
>> it's reasonable to do GSO first, then fragment any packet bigger than
>> MTU as possible.
>>
>> +---------------+ TX     RX +---------------+
>> |   KVM Guest   | -> ... -> |   KVM Guest   |
>> +-+-----------+-+           +-+-----------+-+
>>   |Qemu/VirtIO|               |Qemu/VirtIO|
>>   +-----------+               +-----------+
>>        |                            |
>>        v tap0                  tap0 v
>>   +-----------+               +-----------+
>>   | ovs bridge|               | ovs bridge|
>>   +-----------+               +-----------+
>>        | vxlan                vxlan |
>>        v                            v
>>   +-----------+               +-----------+
>>   |    NIC    |    <------>   |    NIC    |
>>   +-----------+               +-----------+
>>
>> Steps to reproduce:
>>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>>  2. Runing iperf without -M, communication will stuck.
>>
>> Signed-off-by: Fan Du <fan.du@intel.com>
>
>I really don't like this at all.
>
>If guest sees a 1500 byte MTU, that's it's link layer MTU and it had better be able to
>send 1500 byte packets onto the "wire".

This patch makes it happens exactly as you putted.

>If you cannot properly propagate the vxlan encapsulation overhead back into the
>guest's MTU you must hide this problem from the rest of our stack somehow.

Again, this patch hide this problem to make Guest feel it can send packet with MTU as 1500 bytes.

>Nothing we create inside the host should need the change that you are making.

^ permalink raw reply

* Re: [PATCH net-next v2] rtnetlink: delay RTM_DELLINK notification until after ndo_uninit()
From: Toshiaki Makita @ 2014-12-03  3:41 UTC (permalink / raw)
  To: Mahesh Bandewar, netdev; +Cc: David Miller, Eric Dumazet, Roopa Prabhu
In-Reply-To: <1417574617-27560-1-git-send-email-maheshb@google.com>

On 2014/12/03 11:43, Mahesh Bandewar wrote:
> The commit 56bfa7ee7c ("unregister_netdevice : move RTM_DELLINK to
> until after ndo_uninit") tried to do this ealier but while doing so
> it created a problem. Unfortunately the delayed rtmsg_ifinfo() also
> delayed call to fill_info(). So this translated into asking driver
> to remove private state and then query it's private state. This
> could have catastropic consequences.
> 
> This change breaks the rtmsg_ifinfo() into two parts - one takes the
> precise snapshot of the device by called fill_info() before calling
> the ndo_uninit() and the second part sends the notification using
> collected snapshot.
> 
> It was brought to notice when last link is deleted from an ipvlan device
> when it has free-ed the port and the subsequent .fill_info() call is
> trying to get the info from the port.
> 
...
> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
> Report-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

s/Report-by/Reported-by/

Thanks,
Toshiaki Makita

^ permalink raw reply

* Re: [PATCH] cxgb4: Fill in supported link mode for SFP modules
From: David Miller @ 2014-12-03  3:58 UTC (permalink / raw)
  To: hariprasad; +Cc: netdev, leedom, nirranjan
In-Reply-To: <1417179914-5806-1-git-send-email-hariprasad@chelsio.com>

From: Hariprasad Shenai <hariprasad@chelsio.com>
Date: Fri, 28 Nov 2014 18:35:14 +0530

> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>

Applied, thanks.

^ permalink raw reply

* Re: [patch net-next v5 00/21] introduce rocker switch driver with hardware accelerated datapath api - phase 1: bridge fdb offload
From: David Miller @ 2014-12-03  4:02 UTC (permalink / raw)
  To: jiri
  Cc: netdev, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck, john.ronciak, mleitner, shrijeet,
	gospo, bcrl, hemal
In-Reply-To: <1417181672-11531-1-git-send-email-jiri@resnulli.us>

From: Jiri Pirko <jiri@resnulli.us>
Date: Fri, 28 Nov 2014 14:34:11 +0100

> This patchset is just the first phase of switch and switch-ish device
> support api in kernel. Note that the api will extend.

Series applied, thanks everyone for all of their hard work on this.

^ permalink raw reply

* Re: [PATCH, regression against -rc6] net/stmmac: fix one more regression from filter bins setting
From: David Miller @ 2014-12-03  4:32 UTC (permalink / raw)
  To: arnd; +Cc: chenhc, netdev, khilman, peppe.cavallaro, olof, linux-arm-kernel
In-Reply-To: <5807934.FFPy2gmHBj@wuerfel>

From: Arnd Bergmann <arnd@arndb.de>
Date: Sat, 29 Nov 2014 17:26:42 +0100

> On Saturday 29 November 2014 23:44:23 陈华才 wrote:
>> Hi, Arnd,
>> 
>> Maybe this patch is better?
>> http://www.spinics.net/lists/netdev/msg306413.html
> 
> Yes, that would work too. I also checked that my version works now.
> 
> I'm fine with either one, as long as a fix makes it into 3.18.

It's in the 'net' tree, I'll try to get it to Linus soon.

^ permalink raw reply

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
From: David Miller @ 2014-12-03  4:35 UTC (permalink / raw)
  To: fan.du; +Cc: netdev, fw
In-Reply-To: <5A90DA2E42F8AE43BC4A093BF0678848DEE09A@SHSMSX104.ccr.corp.intel.com>

From: "Du, Fan" <fan.du@intel.com>
Date: Wed, 3 Dec 2014 03:32:46 +0000

>>If guest sees a 1500 byte MTU, that's it's link layer MTU and it had better be able to
>>send 1500 byte packets onto the "wire".
> 
> This patch makes it happens exactly as you putted.
> 
>>If you cannot properly propagate the vxlan encapsulation overhead back into the
>>guest's MTU you must hide this problem from the rest of our stack somehow.
> 
> Again, this patch hide this problem to make Guest feel it can send packet with MTU as 1500 bytes.

I said make the guest see the real MTU, not hide the real MTU by
fragmenting or spitting ICMP PMTU messages back.

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2014-12-03  4:39 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) Fill in ethtool link parameters for all link types in cxgb4,
   from Hariprasad Shenai.

2) Fix probe regressions in stmmac driver, from Huacai Chen.

3) Network namespace leaks on errirs in rtnetlink, from Nicolas
   Dichtel.

4) Remove erroneous BUG check which can actually trigger
   legitimately, in xen-netfront.  From Seth Forshee.

5) Validate length of IFLA_BOND_ARP_IP_TARGET netlink
   attributes, from Thomas Grag.

Please pull, thanks a lot.

The following changes since commit 7a5a4f978750756755dc839014e13d1b088ccc8e:

  Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip (2014-11-29 10:49:24 -0800)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git master

for you to fetch changes up to 4c2d518695338801110bc166eece6aa02822b0b4:

  cxgb4: Fill in supported link mode for SFP modules (2014-12-02 19:57:49 -0800)

----------------------------------------------------------------
Hariprasad Shenai (1):
      cxgb4: Fill in supported link mode for SFP modules

Huacai Chen (1):
      stmmac: platform: Move plat_dat checking earlier

Mitsuhiro Kimura (2):
      sh_eth: Fix skb alloc size and alignment adjust rule.
      sh_eth: Fix sleeping function called from invalid context

Nicolas Dichtel (1):
      rtnetlink: release net refcnt on error in do_setlink()

Seth Forshee (1):
      xen-netfront: Remove BUGs on paged skb data which crosses a page boundary

Thomas Graf (1):
      bond: Check length of IFLA_BOND_ARP_IP_TARGET attributes

 drivers/net/bonding/bond_netlink.c                    |    7 ++++++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c       |    8 ++++++--
 drivers/net/ethernet/renesas/sh_eth.c                 |   96 ++++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------------
 drivers/net/ethernet/renesas/sh_eth.h                 |    5 +++--
 drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c |   18 +++++++++---------
 drivers/net/xen-netfront.c                            |    5 -----
 net/core/rtnetlink.c                                  |    1 +
 7 files changed, 73 insertions(+), 67 deletions(-)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox