Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCHv3 net-next 3/3] vxlan: virtual extensible lan
From: Stephen Hemminger @ 2012-09-24 20:27 UTC (permalink / raw)
  To: John Fastabend; +Cc: Eric Dumazet, David Miller, Chris Wright, netdev
In-Reply-To: <5060C189.8090803@intel.com>

On Mon, 24 Sep 2012 13:24:41 -0700
John Fastabend <john.r.fastabend@intel.com> wrote:

> [...]
> 
> On 9/24/2012 1:02 PM, Stephen Hemminger wrote:
> > +
> > +/* Transmit local packets over Vxlan
> > + *
> > + * Outer IP header inherits ECN and DF from inner header.
> > + * Outer UDP destination is the VXLAN assigned port.
> > + *           source port is based on hash of flow if available
> > + *                       otherwise use a random value
> > + */
> > +static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
> > +{
> > +	struct vxlan_dev *vxlan = netdev_priv(dev);
> > +	struct rtable *rt;
> > +	const struct ethhdr *eth;
> > +	const struct iphdr *old_iph;
> > +	struct iphdr *iph;
> > +	struct vxlanhdr *vxh;
> > +	struct udphdr *uh;
> > +	struct flowi4 fl4;
> > +	struct vxlan_fdb *f;
> > +	unsigned int pkt_len = skb->len;
> > +	unsigned int mtu;
> > +	u32 hash;
> > +	__be32 dst;
> > +	__be16 df = 0;
> > +	__u8 tos, ttl;
> > +	int err;
> > +
> 
> [...]
> 
> > +	err = ip_local_out(skb);
> > +	if (likely(net_xmit_eval(err) == 0)) {
> > +		struct vxlan_stats *stats = this_cpu_ptr(vxlan->stats);
> > +
> > +		u64_stats_update_begin(&stats->syncp);
> > +		stats->tx_packets++;
> > +		stats->tx_bytes += pkt_len;
> 
> Should pkt_len include the outer headers?

It doesn't for GRE and related tunnels.

^ permalink raw reply

* Re: [RFC PATCHv2 bridge 7/7] bridge: Add the ability to show dump the vlan map from a bridge port
From: Vlad Yasevich @ 2012-09-24 20:29 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Stephen Hemminger, David Miller, netdev
In-Reply-To: <1348511955.10741.4.camel@deadeye.wl.decadent.org.uk>

On 09/24/2012 02:39 PM, Ben Hutchings wrote:
> On Mon, 2012-09-24 at 09:49 -0400, Vlad Yasevich wrote:
>> On 09/22/2012 04:05 PM, Stephen Hemminger wrote:
>>> On Sat, 22 Sep 2012 13:27:47 -0400 (EDT)
>>> David Miller <davem@davemloft.net> wrote:
>>>
>>>> From: Ben Hutchings <bhutchings@solarflare.com>
>>>> Date: Sat, 22 Sep 2012 18:15:32 +0100
>>>>
>>>>> On Wed, 2012-09-19 at 08:42 -0400, Vlad Yasevich wrote:
>>>>>> Using the RTM_GETLINK dump the vlan map of a given bridge port.
>>>>> [...]
>>>>>
>>>>> This enlarges the RTM_GETLINK response quite a bit.  I think perhaps
>>>>> this should be optional, like IFLA_VFINFO_LIST is now.
>>>>
>>>> Completely agreed.
>>>
>>> Since most users won't use it, it should be not necessary to include
>>> it if the map is blank.
>>>
>>
>> Response is included only for AF_BRIDGE calls (proably used by STP
>> implementation) and only when vlan filtering is configured.
>>
>> I do see the point though and can add a filter for this.
>
> I thought the per-AF stuff was included by default (or at least what
> most clients ask for).  If not then this is probably fine.
> IFLA_VFINFO_LIST was causing glibc's queries to fail and that must be
> avoided.
>
> Ben.
>

Not just that.  This is a PF_BRIDGE family request.  Note the
	err = __rtnl_register(PF_BRIDGE, RTM_GETLINK, NULL,
                               br_dump_ifinfo, NULL);

in br_netlink_init.

-vlad

^ permalink raw reply

* Re: [PATCH net-net] net: loopback: set default mtu to 64K
From: Eric Dumazet @ 2012-09-24 20:33 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120924.162442.1381532664983403691.davem@redhat.com>

On Mon, 2012-09-24 at 16:24 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 24 Sep 2012 10:28:59 +0200
> 
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > loopback current mtu of 16436 bytes allows no more than 3 MSS TCP
> > segments per frame, or 48 Kbytes. Changing mtu to 64K allows TCP
> > stack to build large frames and significantly reduces stack overhead.
> > 
> > Performance boost on bulk TCP transferts can be up to 30 %, partly
> > because we now have one ACK message for two 64KB segments, and a lower
> > probability of hitting /proc/sys/net/ipv4/tcp_reordering default limit.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> This gives a nice %13 improvement on my SPARC-T4 box as well.
> 
> Applied, thanks Eric.

the biggest increase is with

netperf -- -m 63K

I reach 50000 Mbytes/second

^ permalink raw reply

* Re: [PATCH net-net] net: loopback: set default mtu to 64K
From: David Miller @ 2012-09-24 20:38 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1348518796.26828.1651.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 24 Sep 2012 22:33:16 +0200

> On Mon, 2012-09-24 at 16:24 -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Mon, 24 Sep 2012 10:28:59 +0200
>> 
>> > From: Eric Dumazet <edumazet@google.com>
>> > 
>> > loopback current mtu of 16436 bytes allows no more than 3 MSS TCP
>> > segments per frame, or 48 Kbytes. Changing mtu to 64K allows TCP
>> > stack to build large frames and significantly reduces stack overhead.
>> > 
>> > Performance boost on bulk TCP transferts can be up to 30 %, partly
>> > because we now have one ACK message for two 64KB segments, and a lower
>> > probability of hitting /proc/sys/net/ipv4/tcp_reordering default limit.
>> > 
>> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>> 
>> This gives a nice %13 improvement on my SPARC-T4 box as well.
>> 
>> Applied, thanks Eric.
> 
> the biggest increase is with
> 
> netperf -- -m 63K
> 
> I reach 50000 Mbytes/second

I tested with lmbench bw_tcp, which gives me about 1308 MB/sec.

And indeed that netperf command line gives superior numbers,
about 11747 MB/sec

^ permalink raw reply

* Re: [PATCH net-next v3] net: use a per task frag allocator
From: David Miller @ 2012-09-24 20:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: subramanian.vijay, netdev, bhutchings, alexander.h.duyck
In-Reply-To: <1348477482.26828.235.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 24 Sep 2012 11:04:42 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> We currently use a per socket order-0 page cache for tcp_sendmsg()
> operations.
> 
> This page is used to build fragments for skbs.
> 
> Its done to increase probability of coalescing small write() into
> single segments in skbs still in write queue (not yet sent)
> 
> But it wastes a lot of memory for applications handling many mostly
> idle sockets, since each socket holds one page in sk->sk_sndmsg_page
> 
> Its also quite inefficient to build TSO 64KB packets, because we need
> about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
> page allocator more than wanted.
> 
> This patch adds a per task frag allocator and uses bigger pages,
> if available. An automatic fallback is done in case of memory pressure.
> 
> (up to 32768 bytes per frag, thats order-3 pages on x86)
> 
> This increases TCP stream performance by 20% on loopback device,
> but also benefits on other network devices, since 8x less frags are
> mapped on transmit and unmapped on tx completion. Alexander Duyck
> mentioned a probable performance win on systems with IOMMU enabled.
> 
> Its possible some SG enabled hardware cant cope with bigger fragments,
> but their ndo_start_xmit() should already handle this, splitting a
> fragment in sub fragments, since some arches have PAGE_SIZE=65536
> 
> Successfully tested on various ethernet devices.
> (ixgbe, igb, bnx2x, tg3, mellanox mlx4)
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

I'm going to apply this, nice work Eric.

I'll also take care of the trailing whitespace pointed out by others.

Thanks again.

^ permalink raw reply

* Re: [PATCHv2 net-next 3/3] vxlan: virtual extensible lan
From: Eric Dumazet @ 2012-09-24 20:41 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, Chris Wright, netdev
In-Reply-To: <20120924132637.17d89210@nehalam.linuxnetplumber.net>

On Mon, 2012-09-24 at 13:26 -0700, Stephen Hemminger wrote:
> On Mon, 24 Sep 2012 22:09:50 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 

> > 
> > I am not sure why you remove IFF_XMIT_DST_RELEASE here
> 
> Copied from GRE. What is impact either way? 

GRE needs skb_dst(skb) to be non NULL if tiph->daddr == NULL

but I am not sure your code needs this.

This is a performance issue, for packets waiting in a Qdisc : we'll need
to force a reference (and a cache line false sharing)

^ permalink raw reply

* Re: [PATCH] net: mipsnet: Remove the MIPSsim Ethernet driver.
From: David Miller @ 2012-09-24 20:41 UTC (permalink / raw)
  To: sjhill; +Cc: linux-mips, netdev, ralf
In-Reply-To: <1348498036-30257-1-git-send-email-sjhill@mips.com>

From: "Steven J. Hill" <sjhill@mips.com>
Date: Mon, 24 Sep 2012 09:47:16 -0500

> From: "Steven J. Hill" <sjhill@mips.com>
> 
> The MIPSsim platform is no longer supported or used. This patch
> deletes the Ethernet driver.
> 
> Signed-off-by: Steven J. Hill <sjhill@mips.com>

I'll apply this to net-next, thanks.

^ permalink raw reply

* Re: [PATCH net-next] filter: add XOR instruction for use with X/K
From: David Miller @ 2012-09-24 20:50 UTC (permalink / raw)
  To: eric.dumazet; +Cc: dxchgb, netdev
In-Reply-To: <1348494228.26828.784.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 24 Sep 2012 15:43:48 +0200

> On Mon, 2012-09-24 at 14:23 +0200, Daniel Borkmann wrote:
>> BPF_S_ANC_ALU_XOR_X has been added a while ago, but as an 'ancillary'
>> operation that is invoked through a negative offset in K within BPF
>> load operations. Since BPF_MOD has recently been added, BPF_XOR should
>> also be part of the common ALU operations. Removing BPF_S_ANC_ALU_XOR_X
>> might not be an option since this is exposed to user space.
> 
> Please note we dont expose BPF_S_ANC_ALU_XOR_X to user space.
> 
> We expose SKF_AD_ALU_XOR_X instead.
> 
> But it seems easier to leave it to keep this patch small (not touching
> various JIT implementations, even if followup are welcomed)
> 
> Acked-by: Eric Dumazet <edumazet@google.com>

I applied this, fixing the commit message to refer to SKF_AD_ALU_XOR_X
instead of BPF_S_ANC_ALU_XOR_X.

^ permalink raw reply

* Re: kernel BUG at kernel/timer.c:748!
From: David Miller @ 2012-09-24 20:53 UTC (permalink / raw)
  To: eric.dumazet; +Cc: davej, ycheng, ja, netdev
In-Reply-To: <1348506011.26828.1195.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 24 Sep 2012 19:00:11 +0200

> Signed-off-by: Eric Dumazet <edumazet@google.com>

I know you meant "From: Eric Dumazet <edumazet@google.com>"
here :-)

> [PATCH] net: guard tcp_set_keepalive() to tcp sockets
> 
> Its possible to use RAW sockets to get a crash in 
> tcp_set_keepalive() / sk_reset_timer()
> 
> Fix is to make sure socket is a SOCK_STREAM one.
> 
> Reported-by: Dave Jones <davej@redhat.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied and queued up for -stable, thanks Eric.

^ permalink raw reply

* Re: kernel BUG at kernel/timer.c:748!
From: David Miller @ 2012-09-24 20:53 UTC (permalink / raw)
  To: davej; +Cc: eric.dumazet, ycheng, ja, netdev
In-Reply-To: <20120924181152.GA31183@redhat.com>

From: Dave Jones <davej@redhat.com>
Date: Mon, 24 Sep 2012 14:11:52 -0400

> tangent: Is there any kind of networking correctness suite
> other than fuzz testers like isic etc ? 

If people want to work on this I'm willing to maintain the
tree.

^ permalink raw reply

* Re: [PATCH net-next] x86: bpf_jit_comp: add XOR instruction for BPF JIT
From: David Miller @ 2012-09-24 20:55 UTC (permalink / raw)
  To: eric.dumazet; +Cc: dxchgb, netdev
In-Reply-To: <1348508647.26828.1300.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 24 Sep 2012 19:44:07 +0200

> On Mon, 2012-09-24 at 19:34 +0200, Daniel Borkmann wrote:
>> This patch is a follow-up for patch "filter: add XOR instruction for use
>> with X/K" that implements BPF x86 JIT parts for the BPF XOR operation.
>> 
>> Signed-off-by: Daniel Borkmann <daniel.borkmann@tik.ee.ethz.ch>
>> ---
>>  arch/x86/net/bpf_jit_comp.c |    9 +++++++++
>>  1 files changed, 9 insertions(+), 0 deletions(-)
> 
> Acked-by: Eric Dumazet <edumazet@google.com>

Applied, thanks everyone.

^ permalink raw reply

* Re: [PATCH net-next 3/3] vxlan: virtual extensible lan
From: Chris Wright @ 2012-09-24 20:58 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, Chris Wright, netdev
In-Reply-To: <20120924185050.162920909@vyatta.com>

* Stephen Hemminger (shemminger@vyatta.com) wrote:
> This is an implementation of Virtual eXtensible Local Area Network
> as described in draft RFC:
>   http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02
> 
> The driver integrates a Virtual Tunnel Endpoint (VTEP) functionality
> that learns MAC to IP address mapping. 
> 
> This implementation has not been tested for Interoperation with
> other equipment.

I'm working on doing some interop

> --- a/drivers/net/Kconfig	2012-09-24 10:56:57.080291529 -0700
> +++ b/drivers/net/Kconfig	2012-09-24 11:08:02.865416523 -0700
> @@ -149,6 +149,19 @@ config MACVTAP
>  	  To compile this driver as a module, choose M here: the module
>  	  will be called macvtap.
>  
> +config VXLAN
> +       tristate "Virtual eXtensible Local Area Network (VXLAN)"
> +       depends on EXPERIMENTAL
> +       ---help---
> +	  This allows one to create vxlan virtual interfaces that provide
> +	  Layer 2 Networks over Layer 3 Networks. VXLAN is often used
> +	  to tunnel virtual network infrastructure in virtualized environments.
> +	  For more information see:
> +	    http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02
> +
> +	  To compile this driver as a module, choose M here: the module
> +	  will be called macvlan.
                         ^^^^^^^
Cut 'n paste error, s/macvlan/vxlan/

> +/* Add static entry (via netlink) */
> +static int vxlan_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
> +			 struct net_device *dev,
> +			 const unsigned char *addr, u16 flags)
> +{
> +	struct vxlan_dev *vxlan = netdev_priv(dev);
> +	__be32 ip;
> +	int err;
> +
> +	if (tb[NDA_DST] == NULL)
> +		return -EINVAL;
> +
> +	if (nla_len(tb[NDA_DST]) != sizeof(__be32))
> +		return -EAFNOSUPPORT;
> +
> +	ip = nla_get_be32(tb[NDA_DST]);
> +
> +	spin_lock_bh(&vxlan->hash_lock);
> +	err = vxlan_fdb_create(vxlan, addr, ip, VXLAN_FDB_PERM);

Any reason to force permanent when created from userspace?

> +static bool vxlan_group_used(struct vxlan_net *vn,
> +			     const struct vxlan_dev *this)
> +{
> +	const struct vxlan_dev *vxlan;
> +	struct hlist_node *node;
> +	unsigned h;
> +
> +	for (h = 0; h < VNI_HASH_SIZE; ++h)
> +		hlist_for_each_entry(vxlan, node, &vn->vni_list[h], hlist) {

is walking this chain only protected with rtnl?

> +/* Propogate ECN from outer IP header to tunneled packet */
> +static inline void vxlan_ecn_decap(const struct iphdr *iph, struct sk_buff *skb)
> +{
> +	if (INET_ECN_is_ce(iph->tos)) {
> +		if (skb->protocol == htons(ETH_P_IP))
> +			IP_ECN_set_ce(ip_hdr(skb));
> +		else if (skb->protocol == htons(ETH_P_IPV6))
> +			IP6_ECN_set_ce(ipv6_hdr(skb));
> +	}
> +}
<snip>
> +/* Propogate ECN bits out */
> +static inline u8 vxlan_ecn_encap(u8 tos,
> +				 const struct iphdr *iph,
> +				 const struct sk_buff *skb)
> +{
> +	u8 inner = vxlan_get_dsfield(iph, skb);
> +
> +	return INET_ECN_encapsulate(tos, inner);
> +}

Goal is to be RFC 6040 compliant, and it looks like some edge cases aren't
met, for example, should drop on decap when inner is not supporting ECN
and outer has set CE.

<snip>
> +/* Callback from net/ipv4/udp.c to receive packets */
> +	/* Mark socket as an encapsulation socket. */
> +	udp_sk(sk)->encap_type = UDP_ENCAP_L2TPINUDP;

I don't think we need this particular encap_type value, just != 0

> +	udp_sk(sk)->encap_rcv = vxlan_udp_encap_recv;
> +	udp_encap_enable();

^ permalink raw reply

* Re: kernel BUG at kernel/timer.c:748!
From: Eric Dumazet @ 2012-09-24 21:01 UTC (permalink / raw)
  To: David Miller; +Cc: davej, ycheng, ja, netdev
In-Reply-To: <20120924.165307.848844032342514886.davem@davemloft.net>

On Mon, 2012-09-24 at 16:53 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 24 Sep 2012 19:00:11 +0200
> 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> I know you meant "From: Eric Dumazet <edumazet@google.com>"
> here :-)

Oh well, I should use some automatic tool ;)

Thanks

^ permalink raw reply

* Re: [PATCH net-next] filter: add XOR instruction for use with X/K
From: Daniel Borkmann @ 2012-09-24 21:02 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev
In-Reply-To: <20120924.165005.1246466996061307302.davem@davemloft.net>

David Miller <davem@davemloft.net> [2012-09-24 16:50:05 -0400] wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 24 Sep 2012 15:43:48 +0200
> > On Mon, 2012-09-24 at 14:23 +0200, Daniel Borkmann wrote:
> >> BPF_S_ANC_ALU_XOR_X has been added a while ago, but as an 'ancillary'
> >> operation that is invoked through a negative offset in K within BPF
> >> load operations. Since BPF_MOD has recently been added, BPF_XOR should
> >> also be part of the common ALU operations. Removing BPF_S_ANC_ALU_XOR_X
> >> might not be an option since this is exposed to user space.
> > 
> > Please note we dont expose BPF_S_ANC_ALU_XOR_X to user space.
> > 
> > We expose SKF_AD_ALU_XOR_X instead.
> > 
> > But it seems easier to leave it to keep this patch small (not touching
> > various JIT implementations, even if followup are welcomed)
> > 
> > Acked-by: Eric Dumazet <edumazet@google.com>
> 
> I applied this, fixing the commit message to refer to SKF_AD_ALU_XOR_X
> instead of BPF_S_ANC_ALU_XOR_X.

Thanks David!

^ permalink raw reply

* Re: [PATCH net-next 3/3] vxlan: virtual extensible lan
From: Stephen Hemminger @ 2012-09-24 21:11 UTC (permalink / raw)
  To: Chris Wright; +Cc: David Miller, netdev
In-Reply-To: <20120924205822.GI26494@x200.localdomain>

On Mon, 24 Sep 2012 13:58:22 -0700
Chris Wright <chrisw@redhat.com> wrote:

> * Stephen Hemminger (shemminger@vyatta.com) wrote:
> > This is an implementation of Virtual eXtensible Local Area Network
> > as described in draft RFC:
> >   http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02
> > 
> > The driver integrates a Virtual Tunnel Endpoint (VTEP) functionality
> > that learns MAC to IP address mapping. 
> > 
> > This implementation has not been tested for Interoperation with
> > other equipment.
> 
> I'm working on doing some interop
> 
> > --- a/drivers/net/Kconfig	2012-09-24 10:56:57.080291529 -0700
> > +++ b/drivers/net/Kconfig	2012-09-24 11:08:02.865416523 -0700
> > @@ -149,6 +149,19 @@ config MACVTAP
> >  	  To compile this driver as a module, choose M here: the module
> >  	  will be called macvtap.
> >  
> > +config VXLAN
> > +       tristate "Virtual eXtensible Local Area Network (VXLAN)"
> > +       depends on EXPERIMENTAL
> > +       ---help---
> > +	  This allows one to create vxlan virtual interfaces that provide
> > +	  Layer 2 Networks over Layer 3 Networks. VXLAN is often used
> > +	  to tunnel virtual network infrastructure in virtualized environments.
> > +	  For more information see:
> > +	    http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02
> > +
> > +	  To compile this driver as a module, choose M here: the module
> > +	  will be called macvlan.
>                          ^^^^^^^
> Cut 'n paste error, s/macvlan/vxlan/
> 
> > +/* Add static entry (via netlink) */
> > +static int vxlan_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
> > +			 struct net_device *dev,
> > +			 const unsigned char *addr, u16 flags)
> > +{
> > +	struct vxlan_dev *vxlan = netdev_priv(dev);
> > +	__be32 ip;
> > +	int err;
> > +
> > +	if (tb[NDA_DST] == NULL)
> > +		return -EINVAL;
> > +
> > +	if (nla_len(tb[NDA_DST]) != sizeof(__be32))
> > +		return -EAFNOSUPPORT;
> > +
> > +	ip = nla_get_be32(tb[NDA_DST]);
> > +
> > +	spin_lock_bh(&vxlan->hash_lock);
> > +	err = vxlan_fdb_create(vxlan, addr, ip, VXLAN_FDB_PERM);
> 
> Any reason to force permanent when created from userspace?

Should use neighbour flag (NUD_PERMANENT) instead.

> 
> > +static bool vxlan_group_used(struct vxlan_net *vn,
> > +			     const struct vxlan_dev *this)
> > +{
> > +	const struct vxlan_dev *vxlan;
> > +	struct hlist_node *node;
> > +	unsigned h;
> > +
> > +	for (h = 0; h < VNI_HASH_SIZE; ++h)
> > +		hlist_for_each_entry(vxlan, node, &vn->vni_list[h], hlist) {
> 
> is walking this chain only protected with rtnl?

Yes. that should be enough, only used when creating new vxlan
to avoid joining same group twice.

> 
> > +/* Propogate ECN from outer IP header to tunneled packet */
> > +static inline void vxlan_ecn_decap(const struct iphdr *iph, struct sk_buff *skb)
> > +{
> > +	if (INET_ECN_is_ce(iph->tos)) {
> > +		if (skb->protocol == htons(ETH_P_IP))
> > +			IP_ECN_set_ce(ip_hdr(skb));
> > +		else if (skb->protocol == htons(ETH_P_IPV6))
> > +			IP6_ECN_set_ce(ipv6_hdr(skb));
> > +	}
> > +}
> <snip>
> > +/* Propogate ECN bits out */
> > +static inline u8 vxlan_ecn_encap(u8 tos,
> > +				 const struct iphdr *iph,
> > +				 const struct sk_buff *skb)
> > +{
> > +	u8 inner = vxlan_get_dsfield(iph, skb);
> > +
> > +	return INET_ECN_encapsulate(tos, inner);
> > +}
> 
> Goal is to be RFC 6040 compliant, and it looks like some edge cases aren't
> met, for example, should drop on decap when inner is not supporting ECN
> and outer has set CE.

The code was taken from existing GRE in Linux.
Looks like both VXLAN and GRE need to handle that.


> 
> <snip>
> > +/* Callback from net/ipv4/udp.c to receive packets */
> > +	/* Mark socket as an encapsulation socket. */
> > +	udp_sk(sk)->encap_type = UDP_ENCAP_L2TPINUDP;
> 
> I don't think we need this particular encap_type value, just != 0

Is there any value in defining new value?


> 
> > +	udp_sk(sk)->encap_rcv = vxlan_udp_encap_recv;
> > +	udp_encap_enable();

^ permalink raw reply

* Re: [PATCH net-next 3/3] vxlan: virtual extensible lan
From: Chris Wright @ 2012-09-24 21:22 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Chris Wright, David Miller, netdev
In-Reply-To: <20120924141133.3c97e9de@nehalam.linuxnetplumber.net>

* Stephen Hemminger (shemminger@vyatta.com) wrote:
> Chris Wright <chrisw@redhat.com> wrote:
> > > +/* Propogate ECN bits out */
> > > +static inline u8 vxlan_ecn_encap(u8 tos,
> > > +				 const struct iphdr *iph,
> > > +				 const struct sk_buff *skb)
> > > +{
> > > +	u8 inner = vxlan_get_dsfield(iph, skb);
> > > +
> > > +	return INET_ECN_encapsulate(tos, inner);
> > > +}
> > 
> > Goal is to be RFC 6040 compliant, and it looks like some edge cases aren't
> > met, for example, should drop on decap when inner is not supporting ECN
> > and outer has set CE.
> 
> The code was taken from existing GRE in Linux.
> Looks like both VXLAN and GRE need to handle that.

Right.

> > > +/* Callback from net/ipv4/udp.c to receive packets */
> > > +	/* Mark socket as an encapsulation socket. */
> > > +	udp_sk(sk)->encap_type = UDP_ENCAP_L2TPINUDP;
> > 
> > I don't think we need this particular encap_type value, just != 0
> 
> Is there any value in defining new value?

No, more the propagation of L2TP constant starts to look like cargo cult
programming.  IOW, no special meaning in this context.

^ permalink raw reply

* Re: [PATCH v3] ucc_geth: Lockless xmit
From: Francois Romieu @ 2012-09-24 21:10 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: netdev
In-Reply-To: <OF39977762.2BDD4BB8-ONC1257A82.0032A26B-C1257A82.00350A53@transmode.se>

Joakim Tjernlund <joakim.tjernlund@transmode.se> :
[...]
> I don't get it. The skb test is there just for one special case, when
> the BD ring is empty the (bd_status & T_R) == 0 will be true as well so
> one need something more than the bd_status test.

Sure but the converse is not true : (bd_status & T_R) == 0 && skb does not
mean that the skb has been sent. It happens when said skb is about to be
given to the hardware by hard_start_xmit as well.

-- 
Ueimor

^ permalink raw reply

* [PATCH RFC 0/2] Multiple VLAN Registration Protocol (IEEE 802.1Q-2011)
From: David Ward @ 2012-09-24 21:07 UTC (permalink / raw)
  To: netdev; +Cc: Patrick McHardy, Eric Dumazet, Philip Foulkes, David Ward

The Linux kernel currently implements the GARP VLAN Registration 
Protocol (GVRP) from IEEE 802.1Q-1998 (applicant-only participant). 
When the GVRP flag is set for a VLAN interface on a Linux host, the 
host advertises its membership in the VLAN to the attached bridge/
switch, so that it is not necessary to manually configure the bridge/
switch port to participate in the VLAN.

GVRP has been superseded by the Multiple VLAN Registration Protocol 
(MVRP) in IEEE 802.1Q-2011, which addresses scalability concerns about 
the earlier protocol.  The following patches add support for MVRP to 
the Linux kernel and iproute2 utility. They are based largely off of 
the existing implementation of GVRP, but have been modified for the 
new PDU structure and state machine.

Please let me know if you have any comments. This implementation was 
tested with two Juniper EX4200 switches and found to work as expected. 
If you have access to other switches that implement MVRP, I would of 
course welcome additional testing.

Signed-off-by: David Ward <david.ward@ll.mit.edu>

David Ward (2):
  net/802: Implement Multiple Registration Protocol (MRP)
  net/8021q: Implement Multiple VLAN Registration Protocol (MVRP)

 include/linux/if_ether.h  |    1 +
 include/linux/if_vlan.h   |    1 +
 include/linux/netdevice.h |    2 +
 include/net/mrp.h         |  142 ++++++++
 net/802/Kconfig           |    3 +
 net/802/Makefile          |    1 +
 net/802/mrp.c             |  872 +++++++++++++++++++++++++++++++++++++++++++++
 net/8021q/Kconfig         |   11 +
 net/8021q/Makefile        |    1 +
 net/8021q/vlan.c          |   27 ++-
 net/8021q/vlan.h          |   16 +
 net/8021q/vlan_dev.c      |   12 +-
 net/8021q/vlan_mvrp.c     |   72 ++++
 net/8021q/vlan_netlink.c  |    2 +-
 14 files changed, 1156 insertions(+), 7 deletions(-)
 create mode 100644 include/net/mrp.h
 create mode 100644 net/802/mrp.c
 create mode 100644 net/8021q/vlan_mvrp.c

-- 
1.7.4.1

^ permalink raw reply

* [PATCH RFC iproute2] iplink_vlan: Add flag for Multiple VLAN Registration Protocol (MVRP)
From: David Ward @ 2012-09-24 21:07 UTC (permalink / raw)
  To: netdev; +Cc: Patrick McHardy, Eric Dumazet, Philip Foulkes, David Ward
In-Reply-To: <1348520855-8810-1-git-send-email-david.ward@ll.mit.edu>


Signed-off-by: David Ward <david.ward@ll.mit.edu>
---
 include/linux/if_vlan.h |    1 +
 ip/iplink_vlan.c        |   12 +++++++++++-
 2 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 5b808eb..38052c2 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -34,6 +34,7 @@ enum vlan_flags {
 	VLAN_FLAG_REORDER_HDR	= 0x1,
 	VLAN_FLAG_GVRP		= 0x2,
 	VLAN_FLAG_LOOSE_BINDING	= 0x4,
+	VLAN_FLAG_MVRP		= 0x8,
 };
 
 enum vlan_name_types {
diff --git a/ip/iplink_vlan.c b/ip/iplink_vlan.c
index 97af8d6..4d155b6 100644
--- a/ip/iplink_vlan.c
+++ b/ip/iplink_vlan.c
@@ -26,7 +26,7 @@ static void explain(void)
 		"\n"
 		"VLANID := 0-4095\n"
 		"FLAG-LIST := [ FLAG-LIST ] FLAG\n"
-		"FLAG := [ reorder_hdr { on | off } ] [ gvrp { on | off } ]\n"
+		"FLAG := [ reorder_hdr { on | off } ] [ gvrp { on | off } ] [ mvrp { on | off } ]\n"
 		"        [ loose_binding { on | off } ]\n"
 		"QOS-MAP := [ QOS-MAP ] QOS-MAPPING\n"
 		"QOS-MAPPING := FROM:TO\n"
@@ -103,6 +103,15 @@ static int vlan_parse_opt(struct link_util *lu, int argc, char **argv,
 				flags.flags &= ~VLAN_FLAG_GVRP;
 			else
 				return on_off("gvrp");
+		} else if (matches(*argv, "mvrp") == 0) {
+			NEXT_ARG();
+			flags.mask |= VLAN_FLAG_MVRP;
+			if (strcmp(*argv, "on") == 0)
+				flags.flags |= VLAN_FLAG_MVRP;
+			else if (strcmp(*argv, "off") == 0)
+				flags.flags &= ~VLAN_FLAG_MVRP;
+			else
+				return on_off("mvrp");
 		} else if (matches(*argv, "loose_binding") == 0) {
 			NEXT_ARG();
 			flags.mask |= VLAN_FLAG_LOOSE_BINDING;
@@ -166,6 +175,7 @@ static void vlan_print_flags(FILE *fp, __u32 flags)
 		}
 	_PF(REORDER_HDR);
 	_PF(GVRP);
+	_PF(MVRP);
 	_PF(LOOSE_BINDING);
 #undef _PF
 	if (flags)
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH RFC 2/2] net/8021q: Implement Multiple VLAN Registration Protocol (MVRP)
From: David Ward @ 2012-09-24 21:07 UTC (permalink / raw)
  To: netdev; +Cc: Patrick McHardy, Eric Dumazet, Philip Foulkes, David Ward
In-Reply-To: <1348520855-8810-1-git-send-email-david.ward@ll.mit.edu>

Initial implementation of the Multiple VLAN Registration Protocol
(MVRP) from IEEE 802.1Q-2011, based on the existing implementation
of the GARP VLAN Registration Protocol (GVRP).

Signed-off-by: David Ward <david.ward@ll.mit.edu>
---
 include/linux/if_ether.h |    1 +
 include/linux/if_vlan.h  |    1 +
 net/8021q/Kconfig        |   11 +++++++
 net/8021q/Makefile       |    1 +
 net/8021q/vlan.c         |   27 ++++++++++++++---
 net/8021q/vlan.h         |   16 ++++++++++
 net/8021q/vlan_dev.c     |   12 +++++++-
 net/8021q/vlan_mvrp.c    |   72 ++++++++++++++++++++++++++++++++++++++++++++++
 net/8021q/vlan_netlink.c |    2 +-
 9 files changed, 136 insertions(+), 7 deletions(-)
 create mode 100644 net/8021q/vlan_mvrp.c

diff --git a/include/linux/if_ether.h b/include/linux/if_ether.h
index 167ce5b..6476705 100644
--- a/include/linux/if_ether.h
+++ b/include/linux/if_ether.h
@@ -82,6 +82,7 @@
 #define ETH_P_802_EX1	0x88B5		/* 802.1 Local Experimental 1.  */
 #define ETH_P_TIPC	0x88CA		/* TIPC 			*/
 #define ETH_P_8021AH	0x88E7          /* 802.1ah Backbone Service Tag */
+#define ETH_P_MVRP	0x88F5		/* 802.1Q MVRP			*/
 #define ETH_P_1588	0x88F7		/* IEEE 1588 Timesync */
 #define ETH_P_FCOE	0x8906		/* Fibre Channel over Ethernet  */
 #define ETH_P_TDLS	0x890D          /* TDLS */
diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index e6ff12d..576444f 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -383,6 +383,7 @@ enum vlan_flags {
 	VLAN_FLAG_REORDER_HDR	= 0x1,
 	VLAN_FLAG_GVRP		= 0x2,
 	VLAN_FLAG_LOOSE_BINDING	= 0x4,
+	VLAN_FLAG_MVRP		= 0x8,
 };
 
 enum vlan_name_types {
diff --git a/net/8021q/Kconfig b/net/8021q/Kconfig
index fa073a5..8f7517d 100644
--- a/net/8021q/Kconfig
+++ b/net/8021q/Kconfig
@@ -27,3 +27,14 @@ config VLAN_8021Q_GVRP
 	  automatic propagation of registered VLANs to switches.
 
 	  If unsure, say N.
+
+config VLAN_8021Q_MVRP
+	bool "MVRP (Multiple VLAN Registration Protocol) support"
+	depends on VLAN_8021Q
+	select MRP
+	help
+	  Select this to enable MVRP end-system support. MVRP is used for
+	  automatic propagation of registered VLANs to switches; it
+	  supersedes GVRP and is not backwards-compatible.
+
+	  If unsure, say N.
diff --git a/net/8021q/Makefile b/net/8021q/Makefile
index 9f4f174..7bc8db0 100644
--- a/net/8021q/Makefile
+++ b/net/8021q/Makefile
@@ -6,5 +6,6 @@ obj-$(CONFIG_VLAN_8021Q)		+= 8021q.o
 
 8021q-y					:= vlan.o vlan_dev.o vlan_netlink.o
 8021q-$(CONFIG_VLAN_8021Q_GVRP)		+= vlan_gvrp.o
+8021q-$(CONFIG_VLAN_8021Q_MVRP)		+= vlan_mvrp.o
 8021q-$(CONFIG_PROC_FS)			+= vlanproc.o
 
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 9096bcb..290e377 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -95,6 +95,8 @@ void unregister_vlan_dev(struct net_device *dev, struct list_head *head)
 
 	grp->nr_vlan_devs--;
 
+	if (vlan->flags & VLAN_FLAG_MVRP)
+		vlan_mvrp_request_leave(dev);
 	if (vlan->flags & VLAN_FLAG_GVRP)
 		vlan_gvrp_request_leave(dev);
 
@@ -105,8 +107,10 @@ void unregister_vlan_dev(struct net_device *dev, struct list_head *head)
 	 */
 	unregister_netdevice_queue(dev, head);
 
-	if (grp->nr_vlan_devs == 0)
+	if (grp->nr_vlan_devs == 0) {
+		vlan_mvrp_uninit_applicant(real_dev);
 		vlan_gvrp_uninit_applicant(real_dev);
+	}
 
 	/* Get rid of the vlan's reference to real_dev */
 	dev_put(real_dev);
@@ -156,15 +160,18 @@ int register_vlan_dev(struct net_device *dev)
 		err = vlan_gvrp_init_applicant(real_dev);
 		if (err < 0)
 			goto out_vid_del;
+		err = vlan_mvrp_init_applicant(real_dev);
+		if (err < 0)
+			goto out_uninit_gvrp;
 	}
 
 	err = vlan_group_prealloc_vid(grp, vlan_id);
 	if (err < 0)
-		goto out_uninit_applicant;
+		goto out_uninit_mvrp;
 
 	err = register_netdevice(dev);
 	if (err < 0)
-		goto out_uninit_applicant;
+		goto out_uninit_mvrp;
 
 	/* Account for reference in struct vlan_dev_priv */
 	dev_hold(real_dev);
@@ -180,7 +187,10 @@ int register_vlan_dev(struct net_device *dev)
 
 	return 0;
 
-out_uninit_applicant:
+out_uninit_mvrp:
+	if (grp->nr_vlan_devs == 0)
+		vlan_mvrp_uninit_applicant(real_dev);
+out_uninit_gvrp:
 	if (grp->nr_vlan_devs == 0)
 		vlan_gvrp_uninit_applicant(real_dev);
 out_vid_del:
@@ -651,13 +661,19 @@ static int __init vlan_proto_init(void)
 	if (err < 0)
 		goto err3;
 
-	err = vlan_netlink_init();
+	err = vlan_mvrp_init();
 	if (err < 0)
 		goto err4;
 
+	err = vlan_netlink_init();
+	if (err < 0)
+		goto err5;
+
 	vlan_ioctl_set(vlan_ioctl_handler);
 	return 0;
 
+err5:
+	vlan_mvrp_uninit();
 err4:
 	vlan_gvrp_uninit();
 err3:
@@ -678,6 +694,7 @@ static void __exit vlan_cleanup_module(void)
 	unregister_pernet_subsys(&vlan_net_ops);
 	rcu_barrier(); /* Wait for completion of call_rcu()'s */
 
+	vlan_mvrp_uninit();
 	vlan_gvrp_uninit();
 }
 
diff --git a/net/8021q/vlan.h b/net/8021q/vlan.h
index a4886d9..670f1e8 100644
--- a/net/8021q/vlan.h
+++ b/net/8021q/vlan.h
@@ -171,6 +171,22 @@ static inline int vlan_gvrp_init(void) { return 0; }
 static inline void vlan_gvrp_uninit(void) {}
 #endif
 
+#ifdef CONFIG_VLAN_8021Q_MVRP
+extern int vlan_mvrp_request_join(const struct net_device *dev);
+extern void vlan_mvrp_request_leave(const struct net_device *dev);
+extern int vlan_mvrp_init_applicant(struct net_device *dev);
+extern void vlan_mvrp_uninit_applicant(struct net_device *dev);
+extern int vlan_mvrp_init(void);
+extern void vlan_mvrp_uninit(void);
+#else
+static inline int vlan_mvrp_request_join(const struct net_device *dev) { return 0; }
+static inline void vlan_mvrp_request_leave(const struct net_device *dev) {}
+static inline int vlan_mvrp_init_applicant(struct net_device *dev) { return 0; }
+static inline void vlan_mvrp_uninit_applicant(struct net_device *dev) {}
+static inline int vlan_mvrp_init(void) { return 0; }
+static inline void vlan_mvrp_uninit(void) {}
+#endif
+
 extern const char vlan_fullname[];
 extern const char vlan_version[];
 extern int vlan_netlink_init(void);
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 4024424..f547faa 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -261,7 +261,7 @@ int vlan_dev_change_flags(const struct net_device *dev, u32 flags, u32 mask)
 	u32 old_flags = vlan->flags;
 
 	if (mask & ~(VLAN_FLAG_REORDER_HDR | VLAN_FLAG_GVRP |
-		     VLAN_FLAG_LOOSE_BINDING))
+		     VLAN_FLAG_LOOSE_BINDING | VLAN_FLAG_MVRP))
 		return -EINVAL;
 
 	vlan->flags = (old_flags & ~mask) | (flags & mask);
@@ -272,6 +272,13 @@ int vlan_dev_change_flags(const struct net_device *dev, u32 flags, u32 mask)
 		else
 			vlan_gvrp_request_leave(dev);
 	}
+
+	if (netif_running(dev) && (vlan->flags ^ old_flags) & VLAN_FLAG_MVRP) {
+		if (vlan->flags & VLAN_FLAG_MVRP)
+			vlan_mvrp_request_join(dev);
+		else
+			vlan_mvrp_request_leave(dev);
+	}
 	return 0;
 }
 
@@ -312,6 +319,9 @@ static int vlan_dev_open(struct net_device *dev)
 	if (vlan->flags & VLAN_FLAG_GVRP)
 		vlan_gvrp_request_join(dev);
 
+	if (vlan->flags & VLAN_FLAG_MVRP)
+		vlan_mvrp_request_join(dev);
+
 	if (netif_carrier_ok(real_dev))
 		netif_carrier_on(dev);
 	return 0;
diff --git a/net/8021q/vlan_mvrp.c b/net/8021q/vlan_mvrp.c
new file mode 100644
index 0000000..d9ec1d5
--- /dev/null
+++ b/net/8021q/vlan_mvrp.c
@@ -0,0 +1,72 @@
+/*
+ *	IEEE 802.1Q Multiple VLAN Registration Protocol (MVRP)
+ *
+ *	Copyright (c) 2012 Massachusetts Institute of Technology
+ *
+ *	Adapted from code in net/8021q/vlan_gvrp.c
+ *	Copyright (c) 2008 Patrick McHardy <kaber@trash.net>
+ *
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License
+ *	version 2 as published by the Free Software Foundation.
+ */
+#include <linux/types.h>
+#include <linux/if_ether.h>
+#include <linux/if_vlan.h>
+#include <net/mrp.h>
+#include "vlan.h"
+
+#define MRP_MVRP_ADDRESS	{ 0x01, 0x80, 0xc2, 0x00, 0x00, 0x21 }
+
+enum mvrp_attributes {
+	MVRP_ATTR_INVALID,
+	MVRP_ATTR_VID,
+	__MVRP_ATTR_MAX
+};
+#define MVRP_ATTR_MAX	(__MVRP_ATTR_MAX - 1)
+
+static struct mrp_application vlan_mrp_app __read_mostly = {
+	.type		= MRP_APPLICATION_MVRP,
+	.maxattr	= MVRP_ATTR_MAX,
+	.pkttype.type	= htons(ETH_P_MVRP),
+	.group_address	= MRP_MVRP_ADDRESS,
+	.version	= 0,
+};
+
+int vlan_mvrp_request_join(const struct net_device *dev)
+{
+	const struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
+	__be16 vlan_id = htons(vlan->vlan_id);
+
+	return mrp_request_join(vlan->real_dev, &vlan_mrp_app,
+				&vlan_id, sizeof(vlan_id), MVRP_ATTR_VID);
+}
+
+void vlan_mvrp_request_leave(const struct net_device *dev)
+{
+	const struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
+	__be16 vlan_id = htons(vlan->vlan_id);
+
+	mrp_request_leave(vlan->real_dev, &vlan_mrp_app,
+			  &vlan_id, sizeof(vlan_id), MVRP_ATTR_VID);
+}
+
+int vlan_mvrp_init_applicant(struct net_device *dev)
+{
+	return mrp_init_applicant(dev, &vlan_mrp_app);
+}
+
+void vlan_mvrp_uninit_applicant(struct net_device *dev)
+{
+	mrp_uninit_applicant(dev, &vlan_mrp_app);
+}
+
+int __init vlan_mvrp_init(void)
+{
+	return mrp_register_application(&vlan_mrp_app);
+}
+
+void vlan_mvrp_uninit(void)
+{
+	mrp_unregister_application(&vlan_mrp_app);
+}
diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index 708c80e..1789658 100644
--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -62,7 +62,7 @@ static int vlan_validate(struct nlattr *tb[], struct nlattr *data[])
 		flags = nla_data(data[IFLA_VLAN_FLAGS]);
 		if ((flags->flags & flags->mask) &
 		    ~(VLAN_FLAG_REORDER_HDR | VLAN_FLAG_GVRP |
-		      VLAN_FLAG_LOOSE_BINDING))
+		      VLAN_FLAG_LOOSE_BINDING | VLAN_FLAG_MVRP))
 			return -EINVAL;
 	}
 
-- 
1.7.4.1

^ permalink raw reply related

* Re: [PATCH 3/3 RESEND] phy/micrel: Add missing header to micrel_phy.h
From: Marek Vasut @ 2012-09-24 21:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, david.choi, nobuhiro.iwamatsu.yj
In-Reply-To: <20120924.160059.1705079970561628883.davem@davemloft.net>

Dear David Miller,

> From: Marek Vasut <marex@denx.de>
> Date: Mon, 24 Sep 2012 04:58:51 +0200
> 
> > The license header was missing in micrel_phy.h . This patch adds
> > one.
> > 
> > Signed-off-by: Marek Vasut <marex@denx.de>
> 
> Applied.

Thank you and thanks for guiding me!

Best regards,
Marek Vasut

^ permalink raw reply

* Re: [PATCH RFC 1/2] net/802: Implement Multiple Registration Protocol (MRP)
From: Eric Dumazet @ 2012-09-24 21:36 UTC (permalink / raw)
  To: David Ward; +Cc: netdev, Patrick McHardy, Philip Foulkes
In-Reply-To: <1348520855-8810-2-git-send-email-david.ward@ll.mit.edu>

On Mon, 2012-09-24 at 17:07 -0400, David Ward wrote:

...

> +static int mrp_pdu_parse_end_mark(struct sk_buff *skb)
> +{
> +	__be16 *endmark;
> +
> +	if (!pskb_may_pull(skb, sizeof(*endmark)))
> +		return -1;
> +	endmark = (__be16 *)skb->data;
> +	if (get_unaligned(endmark) == MRP_END_MARK) {

You might try get_unaligned_be16(skb->data) ...

> +		skb_pull(skb, sizeof(*endmark));
> +		return -1;
> +	}
> +	return 0;
> +}
> +

...

> +
> +static int mrp_pdu_parse_msg(struct mrp_applicant *app, struct sk_buff *skb)
> +{
> +	if (!pskb_may_pull(skb, sizeof(*mrp_cb(skb)->mh)))
> +		return -1;
> +	mrp_cb(skb)->mh = (struct mrp_msg_hdr *)skb->data;

here you store a pointer to skb->data, in skb>cb[]

> +	if (mrp_cb(skb)->mh->attrtype == 0 ||
> +	    mrp_cb(skb)->mh->attrtype > app->app->maxattr ||
> +	    mrp_cb(skb)->mh->attrlen == 0)
> +		return -1;
> +	if (sizeof(struct mrp_skb_cb) + mrp_cb(skb)->mh->attrlen >
> +	    FIELD_SIZEOF(struct sk_buff, cb))
> +		return -1;
> +	skb_pull(skb, sizeof(*mrp_cb(skb)->mh));
> +
> +	while (skb->len > 0) {
> +		if (mrp_pdu_parse_end_mark(skb) < 0)
> +			break;

but skb->head can be reallocated by other pskb_may_pull() calls
done in mrp_pdu_parse_end_mark() (and elsewhere)

So skb->cb[] might contain a pointer to a freed memory.

> +		if (mrp_pdu_parse_vecattr(app, skb) < 0)
> +			return -1;
> +	}
> +	return 0;
> +}
> +

You might consider using skb_header_pointer() instead of
pskb_may_pull()/skb_pull(), it might be easier when you have such nested
headers...

^ permalink raw reply

* [PATCH RFC 1/2] net/802: Implement Multiple Registration Protocol (MRP)
From: David Ward @ 2012-09-24 21:07 UTC (permalink / raw)
  To: netdev; +Cc: Patrick McHardy, Eric Dumazet, Philip Foulkes, David Ward
In-Reply-To: <1348520855-8810-1-git-send-email-david.ward@ll.mit.edu>

Initial implementation of the Multiple Registration Protocol (MRP)
from IEEE 802.1Q-2011, based on the existing implementation of the
Generic Attribute Registration Protocol (GARP).

Signed-off-by: David Ward <david.ward@ll.mit.edu>
---
 include/linux/netdevice.h |    2 +
 include/net/mrp.h         |  142 ++++++++
 net/802/Kconfig           |    3 +
 net/802/Makefile          |    1 +
 net/802/mrp.c             |  872 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 1020 insertions(+), 0 deletions(-)
 create mode 100644 include/net/mrp.h
 create mode 100644 net/802/mrp.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6c131f0..704da15 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1289,6 +1289,8 @@ struct net_device {
 	};
 	/* GARP */
 	struct garp_port __rcu	*garp_port;
+	/* MRP */
+	struct mrp_port __rcu	*mrp_port;
 
 	/* class/net/name entry */
 	struct device		dev;
diff --git a/include/net/mrp.h b/include/net/mrp.h
new file mode 100644
index 0000000..69677d9
--- /dev/null
+++ b/include/net/mrp.h
@@ -0,0 +1,142 @@
+#ifndef _NET_MRP_H
+#define _NET_MRP_H
+
+#define MRP_END_MARK		0x0
+
+struct mrp_pdu_hdr {
+	u8	version;
+};
+
+struct mrp_msg_hdr {
+	u8	attrtype;
+	u8	attrlen;
+};
+
+struct mrp_vecattr_hdr {
+	__be16	lenflags;
+	unsigned char	firstattrvalue[];
+#define MRP_VECATTR_HDR_LEN_MASK cpu_to_be16(0x1FFF)
+#define MRP_VECATTR_HDR_FLAG_LA cpu_to_be16(0x2000)
+};
+
+enum mrp_vecattr_event {
+	MRP_NEW,
+	MRP_JOIN_IN,
+	MRP_IN,
+	MRP_JOIN_MT,
+	MRP_MT,
+	MRP_LV,
+};
+
+struct mrp_skb_cb {
+	struct mrp_msg_hdr	*mh;
+	struct mrp_vecattr_hdr	*vah;
+	unsigned char		attrvalue[];
+};
+
+static inline struct mrp_skb_cb *mrp_cb(struct sk_buff *skb)
+{
+	BUILD_BUG_ON(sizeof(struct mrp_skb_cb) >
+		     FIELD_SIZEOF(struct sk_buff, cb));
+	return (struct mrp_skb_cb *)skb->cb;
+}
+
+enum mrp_applicant_state {
+	MRP_APPLICANT_INVALID,
+	MRP_APPLICANT_VO,
+	MRP_APPLICANT_VP,
+	MRP_APPLICANT_VN,
+	MRP_APPLICANT_AN,
+	MRP_APPLICANT_AA,
+	MRP_APPLICANT_QA,
+	MRP_APPLICANT_LA,
+	MRP_APPLICANT_AO,
+	MRP_APPLICANT_QO,
+	MRP_APPLICANT_AP,
+	MRP_APPLICANT_QP,
+	__MRP_APPLICANT_MAX
+};
+#define MRP_APPLICANT_MAX	(__MRP_APPLICANT_MAX - 1)
+
+enum mrp_event {
+	MRP_EVENT_NEW,
+	MRP_EVENT_JOIN,
+	MRP_EVENT_LV,
+	MRP_EVENT_TX,
+	MRP_EVENT_R_NEW,
+	MRP_EVENT_R_JOIN_IN,
+	MRP_EVENT_R_IN,
+	MRP_EVENT_R_JOIN_MT,
+	MRP_EVENT_R_MT,
+	MRP_EVENT_R_LV,
+	MRP_EVENT_R_LA,
+	MRP_EVENT_REDECLARE,
+	MRP_EVENT_PERIODIC,
+	__MRP_EVENT_MAX
+};
+#define MRP_EVENT_MAX		(__MRP_EVENT_MAX - 1)
+
+enum mrp_tx_action {
+	MRP_TX_ACTION_NONE,
+	MRP_TX_ACTION_S_NEW,
+	MRP_TX_ACTION_S_JOIN_IN,
+	MRP_TX_ACTION_S_JOIN_IN_OPTIONAL,
+	MRP_TX_ACTION_S_IN_OPTIONAL,
+	MRP_TX_ACTION_S_LV,
+};
+
+struct mrp_attr {
+	struct rb_node			node;
+	enum mrp_applicant_state	state;
+	u8				type;
+	u8				len;
+	unsigned char			value[];
+};
+
+enum mrp_applications {
+	MRP_APPLICATION_MVRP,
+	__MRP_APPLICATION_MAX
+};
+#define MRP_APPLICATION_MAX	(__MRP_APPLICATION_MAX - 1)
+
+struct mrp_application {
+	enum mrp_applications	type;
+	unsigned int		maxattr;
+	struct packet_type	pkttype;
+	unsigned char		group_address[ETH_ALEN];
+	u8			version;
+};
+
+struct mrp_applicant {
+	struct mrp_application	*app;
+	struct net_device	*dev;
+	struct timer_list	join_timer;
+
+	spinlock_t		lock;
+	struct sk_buff_head	queue;
+	struct sk_buff		*pdu;
+	struct rb_root		mad;
+	struct rcu_head		rcu;
+};
+
+struct mrp_port {
+	struct mrp_applicant __rcu	*applicants[MRP_APPLICATION_MAX + 1];
+	struct rcu_head			rcu;
+};
+
+extern int	mrp_register_application(struct mrp_application *app);
+extern void	mrp_unregister_application(struct mrp_application *app);
+
+extern int	mrp_init_applicant(struct net_device *dev,
+				    struct mrp_application *app);
+extern void	mrp_uninit_applicant(struct net_device *dev,
+				      struct mrp_application *app);
+
+extern int	mrp_request_join(const struct net_device *dev,
+				  const struct mrp_application *app,
+				  const void *value, u8 len, u8 type);
+extern void	mrp_request_leave(const struct net_device *dev,
+				   const struct mrp_application *app,
+				   const void *value, u8 len, u8 type);
+
+#endif /* _NET_MRP_H */
diff --git a/net/802/Kconfig b/net/802/Kconfig
index be33d27..80d4bf7 100644
--- a/net/802/Kconfig
+++ b/net/802/Kconfig
@@ -5,3 +5,6 @@ config STP
 config GARP
 	tristate
 	select STP
+
+config MRP
+	tristate
diff --git a/net/802/Makefile b/net/802/Makefile
index a30d6e3..37e654d 100644
--- a/net/802/Makefile
+++ b/net/802/Makefile
@@ -11,3 +11,4 @@ obj-$(CONFIG_IPX)	+= p8022.o psnap.o p8023.o
 obj-$(CONFIG_ATALK)	+= p8022.o psnap.o
 obj-$(CONFIG_STP)	+= stp.o
 obj-$(CONFIG_GARP)	+= garp.o
+obj-$(CONFIG_MRP)	+= mrp.o
diff --git a/net/802/mrp.c b/net/802/mrp.c
new file mode 100644
index 0000000..dddc3e0
--- /dev/null
+++ b/net/802/mrp.c
@@ -0,0 +1,872 @@
+/*
+ *	IEEE 802.1Q Multiple Registration Protocol (MRP)
+ *
+ *	Copyright (c) 2012 Massachusetts Institute of Technology
+ *
+ *	Adapted from code in net/802/garp.c
+ *	Copyright (c) 2008 Patrick McHardy <kaber@trash.net>
+ *
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License
+ *	version 2 as published by the Free Software Foundation.
+ */
+#include <linux/kernel.h>
+#include <linux/timer.h>
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/rtnetlink.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <net/mrp.h>
+#include <asm/unaligned.h>
+
+static unsigned int mrp_join_time __read_mostly = 200;
+module_param(mrp_join_time, uint, 0644);
+MODULE_PARM_DESC(mrp_join_time, "Join time in ms (default 200ms)");
+MODULE_LICENSE("GPL");
+
+static const u8
+mrp_applicant_state_table[MRP_APPLICANT_MAX + 1][MRP_EVENT_MAX + 1] = {
+	[MRP_APPLICANT_VO] = {
+		[MRP_EVENT_NEW]		= MRP_APPLICANT_VN,
+		[MRP_EVENT_JOIN]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_LV]		= MRP_APPLICANT_VO,
+		[MRP_EVENT_TX]		= MRP_APPLICANT_VO,
+		[MRP_EVENT_R_NEW]	= MRP_APPLICANT_VO,
+		[MRP_EVENT_R_JOIN_IN]	= MRP_APPLICANT_AO,
+		[MRP_EVENT_R_IN]	= MRP_APPLICANT_VO,
+		[MRP_EVENT_R_JOIN_MT]	= MRP_APPLICANT_VO,
+		[MRP_EVENT_R_MT]	= MRP_APPLICANT_VO,
+		[MRP_EVENT_R_LV]	= MRP_APPLICANT_VO,
+		[MRP_EVENT_R_LA]	= MRP_APPLICANT_VO,
+		[MRP_EVENT_REDECLARE]	= MRP_APPLICANT_VO,
+		[MRP_EVENT_PERIODIC]	= MRP_APPLICANT_VO,
+	},
+	[MRP_APPLICANT_VP] = {
+		[MRP_EVENT_NEW]		= MRP_APPLICANT_VN,
+		[MRP_EVENT_JOIN]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_LV]		= MRP_APPLICANT_VO,
+		[MRP_EVENT_TX]		= MRP_APPLICANT_AA,
+		[MRP_EVENT_R_NEW]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_R_JOIN_IN]	= MRP_APPLICANT_AP,
+		[MRP_EVENT_R_IN]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_R_JOIN_MT]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_R_MT]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_R_LV]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_R_LA]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_REDECLARE]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_PERIODIC]	= MRP_APPLICANT_VP,
+	},
+	[MRP_APPLICANT_VN] = {
+		[MRP_EVENT_NEW]		= MRP_APPLICANT_VN,
+		[MRP_EVENT_JOIN]	= MRP_APPLICANT_VN,
+		[MRP_EVENT_LV]		= MRP_APPLICANT_LA,
+		[MRP_EVENT_TX]		= MRP_APPLICANT_AN,
+		[MRP_EVENT_R_NEW]	= MRP_APPLICANT_VN,
+		[MRP_EVENT_R_JOIN_IN]	= MRP_APPLICANT_VN,
+		[MRP_EVENT_R_IN]	= MRP_APPLICANT_VN,
+		[MRP_EVENT_R_JOIN_MT]	= MRP_APPLICANT_VN,
+		[MRP_EVENT_R_MT]	= MRP_APPLICANT_VN,
+		[MRP_EVENT_R_LV]	= MRP_APPLICANT_VN,
+		[MRP_EVENT_R_LA]	= MRP_APPLICANT_VN,
+		[MRP_EVENT_REDECLARE]	= MRP_APPLICANT_VN,
+		[MRP_EVENT_PERIODIC]	= MRP_APPLICANT_VN,
+	},
+	[MRP_APPLICANT_AN] = {
+		[MRP_EVENT_NEW]		= MRP_APPLICANT_AN,
+		[MRP_EVENT_JOIN]	= MRP_APPLICANT_AN,
+		[MRP_EVENT_LV]		= MRP_APPLICANT_LA,
+		[MRP_EVENT_TX]		= MRP_APPLICANT_QA,
+		[MRP_EVENT_R_NEW]	= MRP_APPLICANT_AN,
+		[MRP_EVENT_R_JOIN_IN]	= MRP_APPLICANT_AN,
+		[MRP_EVENT_R_IN]	= MRP_APPLICANT_AN,
+		[MRP_EVENT_R_JOIN_MT]	= MRP_APPLICANT_AN,
+		[MRP_EVENT_R_MT]	= MRP_APPLICANT_AN,
+		[MRP_EVENT_R_LV]	= MRP_APPLICANT_VN,
+		[MRP_EVENT_R_LA]	= MRP_APPLICANT_VN,
+		[MRP_EVENT_REDECLARE]	= MRP_APPLICANT_VN,
+		[MRP_EVENT_PERIODIC]	= MRP_APPLICANT_AN,
+	},
+	[MRP_APPLICANT_AA] = {
+		[MRP_EVENT_NEW]		= MRP_APPLICANT_VN,
+		[MRP_EVENT_JOIN]	= MRP_APPLICANT_AA,
+		[MRP_EVENT_LV]		= MRP_APPLICANT_LA,
+		[MRP_EVENT_TX]		= MRP_APPLICANT_QA,
+		[MRP_EVENT_R_NEW]	= MRP_APPLICANT_AA,
+		[MRP_EVENT_R_JOIN_IN]	= MRP_APPLICANT_QA,
+		[MRP_EVENT_R_IN]	= MRP_APPLICANT_AA,
+		[MRP_EVENT_R_JOIN_MT]	= MRP_APPLICANT_AA,
+		[MRP_EVENT_R_MT]	= MRP_APPLICANT_AA,
+		[MRP_EVENT_R_LV]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_R_LA]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_REDECLARE]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_PERIODIC]	= MRP_APPLICANT_AA,
+	},
+	[MRP_APPLICANT_QA] = {
+		[MRP_EVENT_NEW]		= MRP_APPLICANT_VN,
+		[MRP_EVENT_JOIN]	= MRP_APPLICANT_QA,
+		[MRP_EVENT_LV]		= MRP_APPLICANT_LA,
+		[MRP_EVENT_TX]		= MRP_APPLICANT_QA,
+		[MRP_EVENT_R_NEW]	= MRP_APPLICANT_QA,
+		[MRP_EVENT_R_JOIN_IN]	= MRP_APPLICANT_QA,
+		[MRP_EVENT_R_IN]	= MRP_APPLICANT_QA,
+		[MRP_EVENT_R_JOIN_MT]	= MRP_APPLICANT_AA,
+		[MRP_EVENT_R_MT]	= MRP_APPLICANT_AA,
+		[MRP_EVENT_R_LV]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_R_LA]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_REDECLARE]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_PERIODIC]	= MRP_APPLICANT_AA,
+	},
+	[MRP_APPLICANT_LA] = {
+		[MRP_EVENT_NEW]		= MRP_APPLICANT_VN,
+		[MRP_EVENT_JOIN]	= MRP_APPLICANT_AA,
+		[MRP_EVENT_LV]		= MRP_APPLICANT_LA,
+		[MRP_EVENT_TX]		= MRP_APPLICANT_VO,
+		[MRP_EVENT_R_NEW]	= MRP_APPLICANT_LA,
+		[MRP_EVENT_R_JOIN_IN]	= MRP_APPLICANT_LA,
+		[MRP_EVENT_R_IN]	= MRP_APPLICANT_LA,
+		[MRP_EVENT_R_JOIN_MT]	= MRP_APPLICANT_LA,
+		[MRP_EVENT_R_MT]	= MRP_APPLICANT_LA,
+		[MRP_EVENT_R_LV]	= MRP_APPLICANT_LA,
+		[MRP_EVENT_R_LA]	= MRP_APPLICANT_LA,
+		[MRP_EVENT_REDECLARE]	= MRP_APPLICANT_LA,
+		[MRP_EVENT_PERIODIC]	= MRP_APPLICANT_LA,
+	},
+	[MRP_APPLICANT_AO] = {
+		[MRP_EVENT_NEW]		= MRP_APPLICANT_VN,
+		[MRP_EVENT_JOIN]	= MRP_APPLICANT_AP,
+		[MRP_EVENT_LV]		= MRP_APPLICANT_AO,
+		[MRP_EVENT_TX]		= MRP_APPLICANT_AO,
+		[MRP_EVENT_R_NEW]	= MRP_APPLICANT_AO,
+		[MRP_EVENT_R_JOIN_IN]	= MRP_APPLICANT_QO,
+		[MRP_EVENT_R_IN]	= MRP_APPLICANT_AO,
+		[MRP_EVENT_R_JOIN_MT]	= MRP_APPLICANT_AO,
+		[MRP_EVENT_R_MT]	= MRP_APPLICANT_AO,
+		[MRP_EVENT_R_LV]	= MRP_APPLICANT_VO,
+		[MRP_EVENT_R_LA]	= MRP_APPLICANT_VO,
+		[MRP_EVENT_REDECLARE]	= MRP_APPLICANT_VO,
+		[MRP_EVENT_PERIODIC]	= MRP_APPLICANT_AO,
+	},
+	[MRP_APPLICANT_QO] = {
+		[MRP_EVENT_NEW]		= MRP_APPLICANT_VN,
+		[MRP_EVENT_JOIN]	= MRP_APPLICANT_QP,
+		[MRP_EVENT_LV]		= MRP_APPLICANT_QO,
+		[MRP_EVENT_TX]		= MRP_APPLICANT_QO,
+		[MRP_EVENT_R_NEW]	= MRP_APPLICANT_QO,
+		[MRP_EVENT_R_JOIN_IN]	= MRP_APPLICANT_QO,
+		[MRP_EVENT_R_IN]	= MRP_APPLICANT_QO,
+		[MRP_EVENT_R_JOIN_MT]	= MRP_APPLICANT_AO,
+		[MRP_EVENT_R_MT]	= MRP_APPLICANT_AO,
+		[MRP_EVENT_R_LV]	= MRP_APPLICANT_VO,
+		[MRP_EVENT_R_LA]	= MRP_APPLICANT_VO,
+		[MRP_EVENT_REDECLARE]	= MRP_APPLICANT_VO,
+		[MRP_EVENT_PERIODIC]	= MRP_APPLICANT_QO,
+	},
+	[MRP_APPLICANT_AP] = {
+		[MRP_EVENT_NEW]		= MRP_APPLICANT_VN,
+		[MRP_EVENT_JOIN]	= MRP_APPLICANT_AP,
+		[MRP_EVENT_LV]		= MRP_APPLICANT_AO,
+		[MRP_EVENT_TX]		= MRP_APPLICANT_QA,
+		[MRP_EVENT_R_NEW]	= MRP_APPLICANT_AP,
+		[MRP_EVENT_R_JOIN_IN]	= MRP_APPLICANT_QP,
+		[MRP_EVENT_R_IN]	= MRP_APPLICANT_AP,
+		[MRP_EVENT_R_JOIN_MT]	= MRP_APPLICANT_AP,
+		[MRP_EVENT_R_MT]	= MRP_APPLICANT_AP,
+		[MRP_EVENT_R_LV]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_R_LA]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_REDECLARE]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_PERIODIC]	= MRP_APPLICANT_AP,
+	},
+	[MRP_APPLICANT_QP] = {
+		[MRP_EVENT_NEW]		= MRP_APPLICANT_VN,
+		[MRP_EVENT_JOIN]	= MRP_APPLICANT_QP,
+		[MRP_EVENT_LV]		= MRP_APPLICANT_QO,
+		[MRP_EVENT_TX]		= MRP_APPLICANT_QP,
+		[MRP_EVENT_R_NEW]	= MRP_APPLICANT_QP,
+		[MRP_EVENT_R_JOIN_IN]	= MRP_APPLICANT_QP,
+		[MRP_EVENT_R_IN]	= MRP_APPLICANT_QP,
+		[MRP_EVENT_R_JOIN_MT]	= MRP_APPLICANT_AP,
+		[MRP_EVENT_R_MT]	= MRP_APPLICANT_AP,
+		[MRP_EVENT_R_LV]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_R_LA]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_REDECLARE]	= MRP_APPLICANT_VP,
+		[MRP_EVENT_PERIODIC]	= MRP_APPLICANT_AP,
+	},
+};
+
+static const u8
+mrp_tx_action_table[MRP_APPLICANT_MAX + 1] = {
+	[MRP_APPLICANT_VO] = MRP_TX_ACTION_S_IN_OPTIONAL,
+	[MRP_APPLICANT_VP] = MRP_TX_ACTION_S_JOIN_IN,
+	[MRP_APPLICANT_VN] = MRP_TX_ACTION_S_NEW,
+	[MRP_APPLICANT_AN] = MRP_TX_ACTION_S_NEW,
+	[MRP_APPLICANT_AA] = MRP_TX_ACTION_S_JOIN_IN,
+	[MRP_APPLICANT_QA] = MRP_TX_ACTION_S_JOIN_IN_OPTIONAL,
+	[MRP_APPLICANT_LA] = MRP_TX_ACTION_S_LV,
+	[MRP_APPLICANT_AO] = MRP_TX_ACTION_S_IN_OPTIONAL,
+	[MRP_APPLICANT_QO] = MRP_TX_ACTION_S_IN_OPTIONAL,
+	[MRP_APPLICANT_AP] = MRP_TX_ACTION_S_JOIN_IN,
+	[MRP_APPLICANT_QP] = MRP_TX_ACTION_S_IN_OPTIONAL,
+};
+
+static void mrp_attrvalue_inc(void *value, u8 len)
+{
+	u8 *v = (u8 *)value;
+
+	/* Add 1 to the last byte. If it becomes zero,
+	 * go to the previous byte and repeat. */
+	while (len > 0 && !++v[--len])
+		;
+}
+
+static int mrp_attr_cmp(const struct mrp_attr *attr,
+			 const void *value, u8 len, u8 type)
+{
+	if (attr->type != type)
+		return attr->type - type;
+	if (attr->len != len)
+		return attr->len - len;
+	return memcmp(attr->value, value, len);
+}
+
+static struct mrp_attr *mrp_attr_lookup(const struct mrp_applicant *app,
+					const void *value, u8 len, u8 type)
+{
+	struct rb_node *parent = app->mad.rb_node;
+	struct mrp_attr *attr;
+	int d;
+
+	while (parent) {
+		attr = rb_entry(parent, struct mrp_attr, node);
+		d = mrp_attr_cmp(attr, value, len, type);
+		if (d > 0)
+			parent = parent->rb_left;
+		else if (d < 0)
+			parent = parent->rb_right;
+		else
+			return attr;
+	}
+	return NULL;
+}
+
+static struct mrp_attr *mrp_attr_create(struct mrp_applicant *app,
+					const void *value, u8 len, u8 type)
+{
+	struct rb_node *parent = NULL, **p = &app->mad.rb_node;
+	struct mrp_attr *attr;
+	int d;
+
+	while (*p) {
+		parent = *p;
+		attr = rb_entry(parent, struct mrp_attr, node);
+		d = mrp_attr_cmp(attr, value, len, type);
+		if (d > 0)
+			p = &parent->rb_left;
+		else if (d < 0)
+			p = &parent->rb_right;
+		else {
+			/* The attribute already exists; re-use it. */
+			return attr;
+		}
+	}
+	attr = kmalloc(sizeof(*attr) + len, GFP_ATOMIC);
+	if (!attr)
+		return attr;
+	attr->state = MRP_APPLICANT_VO;
+	attr->type  = type;
+	attr->len   = len;
+	memcpy(attr->value, value, len);
+
+	rb_link_node(&attr->node, parent, p);
+	rb_insert_color(&attr->node, &app->mad);
+	return attr;
+}
+
+static void mrp_attr_destroy(struct mrp_applicant *app, struct mrp_attr *attr)
+{
+	rb_erase(&attr->node, &app->mad);
+	kfree(attr);
+}
+
+static int mrp_pdu_init(struct mrp_applicant *app)
+{
+	struct sk_buff *skb;
+	struct mrp_pdu_hdr *ph;
+
+	skb = alloc_skb(app->dev->mtu + LL_RESERVED_SPACE(app->dev),
+			GFP_ATOMIC);
+	if (!skb)
+		return -ENOMEM;
+
+	skb->dev = app->dev;
+	skb->protocol = app->app->pkttype.type;
+	skb_reserve(skb, LL_RESERVED_SPACE(app->dev));
+	skb_reset_network_header(skb);
+	skb_reset_transport_header(skb);
+
+	ph = (struct mrp_pdu_hdr *)__skb_put(skb, sizeof(*ph));
+	ph->version = app->app->version;
+
+	app->pdu = skb;
+	return 0;
+}
+
+static int mrp_pdu_append_end_mark(struct mrp_applicant *app)
+{
+	__be16 *endmark;
+
+	if (skb_tailroom(app->pdu) < sizeof(*endmark))
+		return -1;
+	endmark = (__be16 *)__skb_put(app->pdu, sizeof(*endmark));
+	put_unaligned(MRP_END_MARK, endmark);
+	return 0;
+}
+
+static void mrp_pdu_queue(struct mrp_applicant *app)
+{
+	if (!app->pdu)
+		return;
+
+	if (mrp_cb(app->pdu)->mh)
+		mrp_pdu_append_end_mark(app);
+	mrp_pdu_append_end_mark(app);
+
+	dev_hard_header(app->pdu, app->dev, ntohs(app->app->pkttype.type),
+			app->app->group_address, app->dev->dev_addr,
+			app->pdu->len);
+
+	skb_queue_tail(&app->queue, app->pdu);
+	app->pdu = NULL;
+}
+
+static void mrp_queue_xmit(struct mrp_applicant *app)
+{
+	struct sk_buff *skb;
+
+	while ((skb = skb_dequeue(&app->queue)))
+		dev_queue_xmit(skb);
+}
+
+static int mrp_pdu_append_msg_hdr(struct mrp_applicant *app,
+				  u8 attrtype, u8 attrlen)
+{
+	struct mrp_msg_hdr *mh;
+
+	if (mrp_cb(app->pdu)->mh) {
+		if (mrp_pdu_append_end_mark(app) < 0)
+			return -1;
+		mrp_cb(app->pdu)->mh = NULL;
+		mrp_cb(app->pdu)->vah = NULL;
+	}
+
+	if (skb_tailroom(app->pdu) < sizeof(*mh))
+		return -1;
+	mh = (struct mrp_msg_hdr *)__skb_put(app->pdu, sizeof(*mh));
+	mh->attrtype = attrtype;
+	mh->attrlen = attrlen;
+	mrp_cb(app->pdu)->mh = mh;
+	return 0;
+}
+
+static int mrp_pdu_append_vecattr_hdr(struct mrp_applicant *app,
+				      const void *firstattrvalue, u8 attrlen)
+{
+	struct mrp_vecattr_hdr *vah;
+
+	if (skb_tailroom(app->pdu) < sizeof(*vah) + attrlen)
+		return -1;
+	vah = (struct mrp_vecattr_hdr *)__skb_put(app->pdu,
+						  sizeof(*vah) + attrlen);
+	put_unaligned(0, &vah->lenflags);
+	memcpy(vah->firstattrvalue, firstattrvalue, attrlen);
+	mrp_cb(app->pdu)->vah = vah;
+	memcpy(mrp_cb(app->pdu)->attrvalue, firstattrvalue, attrlen);
+	return 0;
+}
+
+static int mrp_pdu_append_event(struct mrp_applicant *app,
+				const struct mrp_attr *attr,
+				enum mrp_vecattr_event event)
+{
+	u16 len, pos;
+	u8 *events;
+	int err;
+again:
+	if (!app->pdu) {
+		err = mrp_pdu_init(app);
+		if (err < 0)
+			return err;
+	}
+
+	/* If there is no Message header in the PDU, or the Message header is
+	 * for a different attribute type, add an EndMark (if necessary) and a
+	 * new Message header to the PDU. */
+	if (!mrp_cb(app->pdu)->mh ||
+	    mrp_cb(app->pdu)->mh->attrtype != attr->type ||
+	    mrp_cb(app->pdu)->mh->attrlen != attr->len) {
+		if (mrp_pdu_append_msg_hdr(app, attr->type, attr->len) < 0)
+			goto queue;
+	}
+
+	/* If there is no VectorAttribute header for this Message in the PDU,
+	 * or this attribute's value does not sequentially follow the previous
+	 * attribute's value, add a new VectorAttribute header to the PDU. */
+	if (!mrp_cb(app->pdu)->vah ||
+	    memcmp(mrp_cb(app->pdu)->attrvalue, attr->value, attr->len)) {
+		if (mrp_pdu_append_vecattr_hdr(app, attr->value, attr->len) < 0)
+			goto queue;
+	}
+
+	len = be16_to_cpu(get_unaligned(&mrp_cb(app->pdu)->vah->lenflags));
+	pos = len % 3;
+
+	/* Events are packed into Vectors in the PDU, three to a byte. Add a
+	 * byte to the end of the Vector if necessary. */
+	if (!pos) {
+		if (skb_tailroom(app->pdu) < sizeof(u8))
+			goto queue;
+		events = (u8 *)__skb_put(app->pdu, sizeof(u8));
+	} else {
+		events = (u8 *)(skb_tail_pointer(app->pdu) - sizeof(u8));
+	}
+
+	switch (pos) {
+	case 0:
+		*events = event * 36;
+		break;
+	case 1:
+		*events += event * 6;
+		break;
+	case 2:
+		*events += event;
+		break;
+	default:
+		WARN_ON(1);
+	}
+
+	/* Increment the length of the VectorAttribute in the PDU, as well as
+	 * the value of the next attribute that would continue its Vector. */
+	put_unaligned(cpu_to_be16(++len), &mrp_cb(app->pdu)->vah->lenflags);
+	mrp_attrvalue_inc(mrp_cb(app->pdu)->attrvalue, attr->len);
+
+	return 0;
+
+queue:
+	mrp_pdu_queue(app);
+	goto again;
+}
+
+static void mrp_attr_event(struct mrp_applicant *app,
+			   struct mrp_attr *attr, enum mrp_event event)
+{
+	enum mrp_applicant_state state;
+
+	state = mrp_applicant_state_table[attr->state][event];
+	if (state == MRP_APPLICANT_INVALID) {
+		WARN_ON(1);
+		return;
+	}
+
+	if (event == MRP_EVENT_TX) {
+		/* When appending the attribute fails, don't update its state
+		 * in order to retry at the next TX event. */
+
+		switch (mrp_tx_action_table[attr->state]) {
+		case MRP_TX_ACTION_NONE:
+		case MRP_TX_ACTION_S_JOIN_IN_OPTIONAL:
+		case MRP_TX_ACTION_S_IN_OPTIONAL:
+			break;
+		case MRP_TX_ACTION_S_NEW:
+			if (mrp_pdu_append_event(app, attr, MRP_NEW) < 0)
+				return;
+			break;
+		case MRP_TX_ACTION_S_JOIN_IN:
+			if (mrp_pdu_append_event(app, attr, MRP_JOIN_IN) < 0)
+				return;
+			break;
+		case MRP_TX_ACTION_S_LV:
+			if (mrp_pdu_append_event(app, attr, MRP_LV) < 0)
+				return;
+			/* As a pure applicant, sending a leave message
+			 * implies that the attribute was unregistered and
+			 * can be destroyed. */
+			mrp_attr_destroy(app, attr);
+			return;
+		default:
+			WARN_ON(1);
+		}
+	}
+
+	attr->state = state;
+}
+
+int mrp_request_join(const struct net_device *dev,
+		     const struct mrp_application *appl,
+		     const void *value, u8 len, u8 type)
+{
+	struct mrp_port *port = rtnl_dereference(dev->mrp_port);
+	struct mrp_applicant *app = rtnl_dereference(
+		port->applicants[appl->type]);
+	struct mrp_attr *attr;
+
+	if (sizeof(struct mrp_skb_cb) + len >
+	    FIELD_SIZEOF(struct sk_buff, cb))
+		return -ENOMEM;
+
+	spin_lock_bh(&app->lock);
+	attr = mrp_attr_create(app, value, len, type);
+	if (!attr) {
+		spin_unlock_bh(&app->lock);
+		return -ENOMEM;
+	}
+	mrp_attr_event(app, attr, MRP_EVENT_JOIN);
+	spin_unlock_bh(&app->lock);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mrp_request_join);
+
+void mrp_request_leave(const struct net_device *dev,
+		       const struct mrp_application *appl,
+		       const void *value, u8 len, u8 type)
+{
+	struct mrp_port *port = rtnl_dereference(dev->mrp_port);
+	struct mrp_applicant *app = rtnl_dereference(
+		port->applicants[appl->type]);
+	struct mrp_attr *attr;
+
+	if (sizeof(struct mrp_skb_cb) + len >
+	    FIELD_SIZEOF(struct sk_buff, cb))
+		return;
+
+	spin_lock_bh(&app->lock);
+	attr = mrp_attr_lookup(app, value, len, type);
+	if (!attr) {
+		spin_unlock_bh(&app->lock);
+		return;
+	}
+	mrp_attr_event(app, attr, MRP_EVENT_LV);
+	spin_unlock_bh(&app->lock);
+}
+EXPORT_SYMBOL_GPL(mrp_request_leave);
+
+static void mrp_mad_event(struct mrp_applicant *app, enum mrp_event event)
+{
+	struct rb_node *node, *next;
+	struct mrp_attr *attr;
+
+	for (node = rb_first(&app->mad);
+	     next = node ? rb_next(node) : NULL, node != NULL;
+	     node = next) {
+		attr = rb_entry(node, struct mrp_attr, node);
+		mrp_attr_event(app, attr, event);
+	}
+}
+
+static void mrp_join_timer_arm(struct mrp_applicant *app)
+{
+	unsigned long delay;
+
+	delay = (u64)msecs_to_jiffies(mrp_join_time) * net_random() >> 32;
+	mod_timer(&app->join_timer, jiffies + delay);
+}
+
+static void mrp_join_timer(unsigned long data)
+{
+	struct mrp_applicant *app = (struct mrp_applicant *)data;
+
+	spin_lock(&app->lock);
+	mrp_mad_event(app, MRP_EVENT_TX);
+	mrp_pdu_queue(app);
+	spin_unlock(&app->lock);
+
+	mrp_queue_xmit(app);
+	mrp_join_timer_arm(app);
+}
+
+static int mrp_pdu_parse_end_mark(struct sk_buff *skb)
+{
+	__be16 *endmark;
+
+	if (!pskb_may_pull(skb, sizeof(*endmark)))
+		return -1;
+	endmark = (__be16 *)skb->data;
+	if (get_unaligned(endmark) == MRP_END_MARK) {
+		skb_pull(skb, sizeof(*endmark));
+		return -1;
+	}
+	return 0;
+}
+
+static void mrp_pdu_parse_vecattr_event(struct mrp_applicant *app,
+					struct sk_buff *skb,
+					enum mrp_vecattr_event vaevent)
+{
+	struct mrp_attr *attr;
+	enum mrp_event event;
+
+	attr = mrp_attr_lookup(app, mrp_cb(skb)->attrvalue,
+			       mrp_cb(skb)->mh->attrlen,
+			       mrp_cb(skb)->mh->attrtype);
+	if (attr == NULL)
+		return;
+
+	switch (vaevent) {
+	case MRP_NEW:
+		event = MRP_EVENT_R_NEW;
+		break;
+	case MRP_JOIN_IN:
+		event = MRP_EVENT_R_JOIN_IN;
+		break;
+	case MRP_IN:
+		event = MRP_EVENT_R_IN;
+		break;
+	case MRP_JOIN_MT:
+		event = MRP_EVENT_R_JOIN_MT;
+		break;
+	case MRP_MT:
+		event = MRP_EVENT_R_MT;
+		break;
+	case MRP_LV:
+		event = MRP_EVENT_R_LV;
+		break;
+	default:
+		return;
+	}
+
+	mrp_attr_event(app, attr, event);
+}
+
+static int mrp_pdu_parse_vecattr(struct mrp_applicant *app,
+				 struct sk_buff *skb)
+{
+	u16 valen;
+	u8 vaevents, vaevent;
+
+	if (!pskb_may_pull(skb, sizeof(*mrp_cb(skb)->vah) +
+				mrp_cb(skb)->mh->attrlen))
+		return -1;
+	mrp_cb(skb)->vah = (struct mrp_vecattr_hdr *)skb->data;
+	if (get_unaligned(&mrp_cb(skb)->vah->lenflags) &
+	    MRP_VECATTR_HDR_FLAG_LA)
+		mrp_mad_event(app, MRP_EVENT_R_LA);
+	valen = be16_to_cpu(get_unaligned(&mrp_cb(skb)->vah->lenflags) &
+			    MRP_VECATTR_HDR_LEN_MASK);
+
+	/* The VectorAttribute structure in a PDU carries event information
+	 * about one or more attributes having consecutive values. Only the
+	 * value for the first attribute is contained in the structure. So
+	 * we make a copy of that value, and then increment it each time we
+	 * advance to the next event in its Vector. */
+	memcpy(mrp_cb(skb)->attrvalue,
+	       mrp_cb(skb)->vah->firstattrvalue,
+	       mrp_cb(skb)->mh->attrlen);
+	skb_pull(skb, sizeof(*mrp_cb(skb)->vah) +
+		      mrp_cb(skb)->mh->attrlen);
+
+	/* In a VectorAttribute, the Vector contains events which are packed
+	 * three to a byte. We process one byte of the Vector at a time. */
+	while (valen > 0) {
+		if (!pskb_may_pull(skb, sizeof(u8)))
+			return -1;
+		vaevents = *(u8 *)skb->data;
+		skb_pull(skb, sizeof(u8));
+
+		/* Extract and process the first event. */
+		vaevent = vaevents / 36;
+		vaevents %= 36;
+		if (vaevent >= 6) {
+			/* The byte is malformed; stop processing. */
+			return -1;
+		}
+		mrp_pdu_parse_vecattr_event(app, skb, vaevent);
+
+		/* If present, extract and process the second event. */
+		if (!--valen)
+			break;
+		mrp_attrvalue_inc(mrp_cb(skb)->attrvalue,
+				  mrp_cb(skb)->mh->attrlen);
+		vaevent = vaevents / 6;
+		vaevents %= 6;
+		mrp_pdu_parse_vecattr_event(app, skb, vaevent);
+
+		/* If present, extract and process the third event. */
+		if (!--valen)
+			break;
+		mrp_attrvalue_inc(mrp_cb(skb)->attrvalue,
+				  mrp_cb(skb)->mh->attrlen);
+		vaevent = vaevents;
+		mrp_pdu_parse_vecattr_event(app, skb, vaevent);
+	}
+	return 0;
+}
+
+static int mrp_pdu_parse_msg(struct mrp_applicant *app, struct sk_buff *skb)
+{
+	if (!pskb_may_pull(skb, sizeof(*mrp_cb(skb)->mh)))
+		return -1;
+	mrp_cb(skb)->mh = (struct mrp_msg_hdr *)skb->data;
+	if (mrp_cb(skb)->mh->attrtype == 0 ||
+	    mrp_cb(skb)->mh->attrtype > app->app->maxattr ||
+	    mrp_cb(skb)->mh->attrlen == 0)
+		return -1;
+	if (sizeof(struct mrp_skb_cb) + mrp_cb(skb)->mh->attrlen >
+	    FIELD_SIZEOF(struct sk_buff, cb))
+		return -1;
+	skb_pull(skb, sizeof(*mrp_cb(skb)->mh));
+
+	while (skb->len > 0) {
+		if (mrp_pdu_parse_end_mark(skb) < 0)
+			break;
+		if (mrp_pdu_parse_vecattr(app, skb) < 0)
+			return -1;
+	}
+	return 0;
+}
+
+int mrp_rcv(struct sk_buff *skb, struct net_device *dev,
+	    struct packet_type *pt, struct net_device *orig_dev)
+{
+	struct mrp_application *appl = container_of(pt, struct mrp_application,
+						    pkttype);
+	struct mrp_port *port;
+	struct mrp_applicant *app;
+	const struct mrp_pdu_hdr *ph;
+
+	/*
+	 * If the interface is in promiscuous mode, drop the packet if
+	 * it was unicast to another host.
+	 */
+	if (unlikely(skb->pkt_type == PACKET_OTHERHOST))
+		goto out;
+	skb = skb_share_check(skb, GFP_ATOMIC);
+	if (unlikely(!skb))
+		goto out;
+	port = rcu_dereference(dev->mrp_port);
+	if (unlikely(!port))
+		goto out;
+	app = rcu_dereference(port->applicants[appl->type]);
+	if (unlikely(!app))
+		goto out;
+
+	if (!pskb_may_pull(skb, sizeof(*ph)))
+		goto out;
+	ph = (struct mrp_pdu_hdr *)skb->data;
+	if (ph->version != app->app->version)
+		goto out;
+	skb_pull(skb, sizeof(*ph));
+
+	spin_lock(&app->lock);
+	while (skb->len > 0) {
+		if (mrp_pdu_parse_end_mark(skb) < 0)
+			break;
+		if (mrp_pdu_parse_msg(app, skb) < 0)
+			break;
+	}
+	spin_unlock(&app->lock);
+out:
+	kfree_skb(skb);
+	return 0;
+}
+
+static int mrp_init_port(struct net_device *dev)
+{
+	struct mrp_port *port;
+
+	port = kzalloc(sizeof(*port), GFP_KERNEL);
+	if (!port)
+		return -ENOMEM;
+	rcu_assign_pointer(dev->mrp_port, port);
+	return 0;
+}
+
+static void mrp_release_port(struct net_device *dev)
+{
+	struct mrp_port *port = rtnl_dereference(dev->mrp_port);
+	unsigned int i;
+
+	for (i = 0; i <= MRP_APPLICATION_MAX; i++) {
+		if (rtnl_dereference(port->applicants[i]))
+			return;
+	}
+	RCU_INIT_POINTER(dev->mrp_port, NULL);
+	kfree_rcu(port, rcu);
+}
+
+int mrp_init_applicant(struct net_device *dev, struct mrp_application *appl)
+{
+	struct mrp_applicant *app;
+	int err;
+
+	ASSERT_RTNL();
+
+	if (!rtnl_dereference(dev->mrp_port)) {
+		err = mrp_init_port(dev);
+		if (err < 0)
+			goto err1;
+	}
+
+	err = -ENOMEM;
+	app = kzalloc(sizeof(*app), GFP_KERNEL);
+	if (!app)
+		goto err2;
+
+	err = dev_mc_add(dev, appl->group_address);
+	if (err < 0)
+		goto err3;
+
+	app->dev = dev;
+	app->app = appl;
+	app->mad = RB_ROOT;
+	spin_lock_init(&app->lock);
+	skb_queue_head_init(&app->queue);
+	rcu_assign_pointer(dev->mrp_port->applicants[appl->type], app);
+	setup_timer(&app->join_timer, mrp_join_timer, (unsigned long)app);
+	mrp_join_timer_arm(app);
+	return 0;
+
+err3:
+	kfree(app);
+err2:
+	mrp_release_port(dev);
+err1:
+	return err;
+}
+EXPORT_SYMBOL_GPL(mrp_init_applicant);
+
+void mrp_uninit_applicant(struct net_device *dev, struct mrp_application *appl)
+{
+	struct mrp_port *port = rtnl_dereference(dev->mrp_port);
+	struct mrp_applicant *app = rtnl_dereference(
+		port->applicants[appl->type]);
+
+	ASSERT_RTNL();
+
+	RCU_INIT_POINTER(port->applicants[appl->type], NULL);
+
+	/* Delete timer and generate a final TX event to flush out
+	 * all pending messages before the applicant is gone. */
+	del_timer_sync(&app->join_timer);
+	mrp_mad_event(app, MRP_EVENT_TX);
+	mrp_pdu_queue(app);
+	mrp_queue_xmit(app);
+
+	dev_mc_del(dev, appl->group_address);
+	kfree_rcu(app, rcu);
+	mrp_release_port(dev);
+}
+EXPORT_SYMBOL_GPL(mrp_uninit_applicant);
+
+int mrp_register_application(struct mrp_application *appl)
+{
+	appl->pkttype.func = mrp_rcv;
+	dev_add_pack(&appl->pkttype);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mrp_register_application);
+
+void mrp_unregister_application(struct mrp_application *appl)
+{
+	dev_remove_pack(&appl->pkttype);
+}
+EXPORT_SYMBOL_GPL(mrp_unregister_application);
-- 
1.7.4.1

^ permalink raw reply related

* [RFC] gre: conform to RFC6040 ECN progogation
From: Stephen Hemminger @ 2012-09-24 21:44 UTC (permalink / raw)
  To: Chris Wright; +Cc: David Miller, netdev
In-Reply-To: <20120924212226.GJ26494@x200.localdomain>

Linux GRE was likely written before this RFC and therefore does not
conform to one of the rules in Section 4.2.  Default Tunnel Egress Behaviour.

The new code addresses:
 o If the inner ECN field is Not-ECT, the decapsulator MUST NOT
      propagate any other ECN codepoint onwards.  This is because the
      inner Not-ECT marking is set by transports that rely on dropped
      packets as an indication of congestion and would not understand or
      respond to any other ECN codepoint [RFC4774].  Specifically:

      *  If the inner ECN field is Not-ECT and the outer ECN field is
         CE, the decapsulator MUST drop the packet.

      *  If the inner ECN field is Not-ECT and the outer ECN field is
         Not-ECT, ECT(0), or ECT(1), the decapsulator MUST forward the
         outgoing packet with the ECN field cleared to Not-ECT.

This was caught by Chris Wright while reviewing VXLAN.
This code has not been tested with real ECN through tunnel.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/net/ipv4/ip_gre.c	2012-09-21 08:45:55.948772761 -0700
+++ b/net/ipv4/ip_gre.c	2012-09-24 14:35:54.666185603 -0700
@@ -567,15 +567,16 @@ out:
 	rcu_read_unlock();
 }
 
-static inline void ipgre_ecn_decapsulate(const struct iphdr *iph, struct sk_buff *skb)
+static int ipgre_ecn_decapsulate(const struct iphdr *iph, struct sk_buff *skb)
 {
 	if (INET_ECN_is_ce(iph->tos)) {
 		if (skb->protocol == htons(ETH_P_IP)) {
-			IP_ECN_set_ce(ip_hdr(skb));
+			return IP_ECN_set_ce(ip_hdr(skb));
 		} else if (skb->protocol == htons(ETH_P_IPV6)) {
-			IP6_ECN_set_ce(ipv6_hdr(skb));
+			return IP6_ECN_set_ce(ipv6_hdr(skb));
 		}
 	}
+	return 1;
 }
 
 static inline u8
@@ -703,17 +704,18 @@ static int ipgre_rcv(struct sk_buff *skb
 			skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
 		}
 
+		__skb_tunnel_rx(skb, tunnel->dev);
+
+		skb_reset_network_header(skb);
+		if (!ipgre_ecn_decapsulate(iph, skb))
+			goto drop;
+
 		tstats = this_cpu_ptr(tunnel->dev->tstats);
 		u64_stats_update_begin(&tstats->syncp);
 		tstats->rx_packets++;
 		tstats->rx_bytes += skb->len;
 		u64_stats_update_end(&tstats->syncp);
 
-		__skb_tunnel_rx(skb, tunnel->dev);
-
-		skb_reset_network_header(skb);
-		ipgre_ecn_decapsulate(iph, skb);
-
 		netif_rx(skb);
 
 		rcu_read_unlock();

^ permalink raw reply

* [PATCHv4 net-next] vxlan: virtual extensible lan
From: Stephen Hemminger @ 2012-09-24 21:50 UTC (permalink / raw)
  To: Chris Wright; +Cc: David Miller, netdev
In-Reply-To: <20120924205822.GI26494@x200.localdomain>

This is an implementation of Virtual eXtensible Local Area Network
as described in draft RFC:
  http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02

The driver integrates a Virtual Tunnel Endpoint (VTEP) functionality
that learns MAC to IP address mapping. 

This implementation has not been tested for Interoperation with
other equipment. Still needs more testing of static entry manipulation.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
v4 - fix ecn and set state of fdb entries
v3 - fix ordering of change versus migration message
v2 - fix use of ip header after pskb_may_pull

 Documentation/networking/vxlan.txt |   36 +
 drivers/net/Kconfig                |   13 
 drivers/net/Makefile               |    1 
 drivers/net/vxlan.c                | 1180 +++++++++++++++++++++++++++++++++++++
 include/linux/if_link.h            |   14 
 5 files changed, 1244 insertions(+)

--- a/drivers/net/Kconfig	2012-09-24 10:56:57.080291529 -0700
+++ b/drivers/net/Kconfig	2012-09-24 12:43:25.298239026 -0700
@@ -149,6 +149,19 @@ config MACVTAP
 	  To compile this driver as a module, choose M here: the module
 	  will be called macvtap.
 
+config VXLAN
+       tristate "Virtual eXtensible Local Area Network (VXLAN)"
+       depends on EXPERIMENTAL
+       ---help---
+	  This allows one to create vxlan virtual interfaces that provide
+	  Layer 2 Networks over Layer 3 Networks. VXLAN is often used
+	  to tunnel virtual network infrastructure in virtualized environments.
+	  For more information see:
+	    http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called vxlan.
+
 config NETCONSOLE
 	tristate "Network console logging support"
 	---help---
--- a/drivers/net/Makefile	2012-09-24 10:56:57.080291529 -0700
+++ b/drivers/net/Makefile	2012-09-24 11:08:02.865416523 -0700
@@ -21,6 +21,7 @@ obj-$(CONFIG_NET_TEAM) += team/
 obj-$(CONFIG_TUN) += tun.o
 obj-$(CONFIG_VETH) += veth.o
 obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+obj-$(CONFIG_VXLAN) += vxlan.o
 
 #
 # Networking Drivers
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ b/drivers/net/vxlan.c	2012-09-24 14:33:00.687939813 -0700
@@ -0,0 +1,1182 @@
+/*
+ * VXLAN: Virtual eXtensiable Local Area Network
+ *
+ * Copyright (c) 2012 Vyatta Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * TODO
+ *  - use IANA UDP port number (when defined)
+ *  - IPv6 (not in RFC)
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/rculist.h>
+#include <linux/netdevice.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/udp.h>
+#include <linux/igmp.h>
+#include <linux/etherdevice.h>
+#include <linux/if_ether.h>
+#include <linux/version.h>
+#include <linux/hash.h>
+#include <net/ip.h>
+#include <net/icmp.h>
+#include <net/udp.h>
+#include <net/rtnetlink.h>
+#include <net/route.h>
+#include <net/dsfield.h>
+#include <net/inet_ecn.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+
+#define VXLAN_VERSION	"0.0"
+
+#define VNI_HASH_BITS	10
+#define VNI_HASH_SIZE	(1<<VNI_HASH_BITS)
+#define FDB_HASH_BITS	8
+#define FDB_HASH_SIZE	(1<<FDB_HASH_BITS)
+
+#define VXLAN_N_VID	(1u << 24)
+#define VXLAN_VID_MASK	(VXLAN_N_VID - 1)
+
+#define FDB_AGE_INTERVAL (10 * HZ)	/* rescan interval */
+#define FDB_AGE_TIME	 (300 * HZ)	/* drop if not used in 5 min */
+
+#define VXLAN_FLAGS 0x08000000	/* struct vxlanhdr.vx_flags required value. */
+
+/* VXLAN protocol header */
+struct vxlanhdr {
+	__be32 vx_flags;
+	__be32 vx_vni;
+};
+
+#define VXLAN_HEADROOM (sizeof(struct iphdr)		\
+			+ sizeof(struct udphdr)		\
+			+ sizeof(struct vxlanhdr))
+
+/* UDP port for VXLAN traffic. */
+static unsigned int vxlan_port __read_mostly = 8472;
+module_param_named(port, vxlan_port, uint, 0);
+MODULE_PARM_DESC(vxlan_port, "Destination UDP port");
+
+/* per-net private data for this module */
+static unsigned int vxlan_net_id;
+struct vxlan_net {
+	struct socket	  *sock;	/* UDP encap socket */
+	struct hlist_head vni_list[VNI_HASH_SIZE];
+};
+
+/* Forwarding table entry */
+struct vxlan_fdb {
+	struct hlist_node hlist;	/* linked list of entries */
+	struct rcu_head	  rcu;
+	unsigned long	  updated;	/* jiffies */
+	unsigned long	  used;
+	__be32		  remote_ip;
+	u16		  state;	/* see ndm_state */
+	u8		  eth_addr[ETH_ALEN];
+};
+
+/* Per-cpu network traffic stats */
+struct vxlan_stats {
+	u64			rx_packets;
+	u64			rx_bytes;
+	u64			tx_packets;
+	u64			tx_bytes;
+	struct u64_stats_sync	syncp;
+};
+
+/* Pseudo network device */
+struct vxlan_dev {
+	struct hlist_node hlist;
+	struct net_device *dev;
+	struct vxlan_stats __percpu *stats;
+	__u32		  vni;		/* virtual network id */
+	__be32	          gaddr;	/* multicast group */
+	__be32		  saddr;	/* source address */
+	unsigned int      link;		/* link to multicast over */
+	__u8		  tos;		/* TOS override */
+	__u8		  ttl;
+	bool		  learn;
+
+	struct timer_list age_timer;
+	spinlock_t	  hash_lock;
+	struct hlist_head fdb_head[FDB_HASH_SIZE];
+};
+
+/* salt for hash table */
+static u32 vxlan_salt __read_mostly;
+
+static inline struct hlist_head *vni_head(struct net *net, u32 id)
+{
+	struct vxlan_net *vn = net_generic(net, vxlan_net_id);
+
+	return &vn->vni_list[hash_32(id, VNI_HASH_BITS)];
+}
+
+/* Look up VNI in a per net namespace table */
+static struct vxlan_dev *vxlan_find_vni(struct net *net, u32 id)
+{
+	struct vxlan_dev *vxlan;
+	struct hlist_node *node;
+
+	hlist_for_each_entry_rcu(vxlan, node, vni_head(net, id), hlist) {
+		if (vxlan->vni == id)
+			return vxlan;
+	}
+
+	return NULL;
+}
+
+/* Fill in neighbour message in skbuff. */
+static int vxlan_fdb_info(struct sk_buff *skb, struct vxlan_dev *vxlan,
+			   const struct vxlan_fdb *fdb,
+			   u32 portid, u32 seq, int type, unsigned int flags)
+{
+	unsigned long now = jiffies;
+	struct nda_cacheinfo ci;
+	struct nlmsghdr *nlh;
+	struct ndmsg *ndm;
+
+	nlh = nlmsg_put(skb, portid, seq, type, sizeof(*ndm), flags);
+	if (nlh == NULL)
+		return -EMSGSIZE;
+
+	ndm = nlmsg_data(nlh);
+	memset(ndm, 0, sizeof(*ndm));
+	ndm->ndm_family	= AF_BRIDGE;
+	ndm->ndm_state = fdb->state;
+
+	if (nla_put(skb, NDA_LLADDR, ETH_ALEN, &fdb->eth_addr))
+		goto nla_put_failure;
+
+	if (nla_put_be32(skb, NDA_DST, fdb->remote_ip))
+		goto nla_put_failure;
+
+	ci.ndm_used	 = jiffies_to_clock_t(now - fdb->used);
+	ci.ndm_confirmed = 0;
+	ci.ndm_updated	 = jiffies_to_clock_t(now - fdb->updated);
+	ci.ndm_refcnt	 = 0;
+
+	if (nla_put(skb, NDA_CACHEINFO, sizeof(ci), &ci))
+		goto nla_put_failure;
+
+	return nlmsg_end(skb, nlh);
+
+nla_put_failure:
+	nlmsg_cancel(skb, nlh);
+	return -EMSGSIZE;
+}
+
+static inline size_t vxlan_nlmsg_size(void)
+{
+	return NLMSG_ALIGN(sizeof(struct ndmsg))
+		+ nla_total_size(ETH_ALEN) /* NDA_LLADDR */
+		+ nla_total_size(sizeof(__be32)) /* NDA_DST */
+		+ nla_total_size(sizeof(struct nda_cacheinfo));
+}
+
+static void vxlan_fdb_notify(struct vxlan_dev *vxlan,
+			     const struct vxlan_fdb *fdb, int type)
+{
+	struct net *net = dev_net(vxlan->dev);
+	struct sk_buff *skb;
+	int err = -ENOBUFS;
+
+	skb = nlmsg_new(vxlan_nlmsg_size(), GFP_ATOMIC);
+	if (skb == NULL)
+		goto errout;
+
+	err = vxlan_fdb_info(skb, vxlan, fdb, 0, 0, type, 0);
+	if (err < 0) {
+		/* -EMSGSIZE implies BUG in vxlan_nlmsg_size() */
+		WARN_ON(err == -EMSGSIZE);
+		kfree_skb(skb);
+		goto errout;
+	}
+
+	rtnl_notify(skb, net, 0, RTNLGRP_NEIGH, NULL, GFP_ATOMIC);
+	return;
+errout:
+	if (err < 0)
+		rtnl_set_sk_err(net, RTNLGRP_NEIGH, err);
+}
+
+/* Hash Ethernet address */
+static u32 eth_hash(const unsigned char *addr)
+{
+	/* could be optimized for unaligned access */
+	u32 a = addr[5] << 8 | addr[4];
+	u32 b = addr[3] << 24 | addr[2] << 16 | addr[1] << 8 | addr[0];
+
+	return jhash_2words(a, b, vxlan_salt);
+}
+
+/* Hash chain to use given mac address */
+static inline struct hlist_head *vxlan_fdb_head(struct vxlan_dev *vxlan,
+						const u8 *mac)
+{
+	return &vxlan->fdb_head[hash_32(eth_hash(mac), FDB_HASH_BITS)];
+}
+
+/* Look up Ethernet address in forwarding table */
+static struct vxlan_fdb *vxlan_find_mac(struct vxlan_dev *vxlan,
+					const u8 *mac)
+
+{
+	struct hlist_head *head = vxlan_fdb_head(vxlan, mac);
+	struct vxlan_fdb *f;
+	struct hlist_node *node;
+
+	hlist_for_each_entry_rcu(f, node, head, hlist) {
+		if (compare_ether_addr(mac, f->eth_addr) == 0)
+			return f;
+	}
+
+	return NULL;
+}
+
+/* Add new entry to forwarding table -- assumes lock held */
+static int vxlan_fdb_create(struct vxlan_dev *vxlan,
+			    const u8 *mac, __be32 ip,
+			    __u16 state, __u16 flags)
+{
+	struct vxlan_fdb *f;
+
+	f = vxlan_find_mac(vxlan, mac);
+	if (f) {
+		if (flags & NLM_F_EXCL) {
+			netdev_dbg(vxlan->dev,
+				   "lost race to create %pM\n", mac);
+			return -EEXIST;
+		}
+	} else {
+		if (!(flags & NLM_F_CREATE))
+			return -ENOENT;
+
+		netdev_dbg(vxlan->dev, "add %pM -> %pI4\n", mac, &ip);
+		f = kmalloc(sizeof(*f), GFP_ATOMIC);
+		if (!f)
+			return -ENOMEM;
+
+		f->remote_ip = ip;
+		memcpy(f->eth_addr, mac, ETH_ALEN);
+		hlist_add_head_rcu(&f->hlist,
+				   vxlan_fdb_head(vxlan, mac));
+		f->state = 0;
+	}
+
+	if (f->state != state) {
+		f->state = state;
+		f->updated = f->used = jiffies;
+		vxlan_fdb_notify(vxlan, f, RTM_NEWNEIGH);
+	}
+
+	return 0;
+}
+
+static void vxlan_fdb_destroy(struct vxlan_dev *vxlan, struct vxlan_fdb *f)
+{
+	netdev_dbg(vxlan->dev,
+		    "delete %pM\n", f->eth_addr);
+
+	vxlan_fdb_notify(vxlan, f, RTM_DELNEIGH);
+
+	hlist_del_rcu(&f->hlist);
+	kfree_rcu(f, rcu);
+}
+
+/* Add static entry (via netlink) */
+static int vxlan_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
+			 struct net_device *dev,
+			 const unsigned char *addr, u16 flags)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	__be32 ip;
+	int err;
+
+	if (!(ndm->ndm_state & (NUD_PERMANENT|NUD_REACHABLE))) {
+		pr_info("RTM_NEWNEIGH with invalid state %#x\n", ndm->ndm_state);
+		return -EINVAL;
+	}
+
+	if (tb[NDA_DST] == NULL)
+		return -EINVAL;
+
+	if (nla_len(tb[NDA_DST]) != sizeof(__be32))
+		return -EAFNOSUPPORT;
+
+	ip = nla_get_be32(tb[NDA_DST]);
+
+	spin_lock_bh(&vxlan->hash_lock);
+	err = vxlan_fdb_create(vxlan, addr, ip, ndm->ndm_state, flags);
+	spin_unlock_bh(&vxlan->hash_lock);
+
+	return err;
+}
+
+/* Delete entry (via netlink) */
+static int vxlan_fdb_delete(struct ndmsg *ndm, struct net_device *dev,
+			    const unsigned char *addr)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	struct vxlan_fdb *f;
+	int err = -ENOENT;
+
+	spin_lock_bh(&vxlan->hash_lock);
+	f = vxlan_find_mac(vxlan, addr);
+	if (f) {
+		vxlan_fdb_destroy(vxlan, f);
+		err = 0;
+	}
+	spin_unlock_bh(&vxlan->hash_lock);
+
+	return err;
+}
+
+/* Dump forwarding table */
+static int vxlan_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,
+			  struct net_device *dev, int idx)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	unsigned int h;
+
+	for (h = 0; h < FDB_HASH_SIZE; ++h) {
+		struct vxlan_fdb *f;
+		struct hlist_node *n;
+
+		hlist_for_each_entry_rcu(f, n, &vxlan->fdb_head[h], hlist) {
+			if (idx < cb->args[0])
+				goto skip;
+
+			if (vxlan_fdb_info(skb, vxlan, f,
+					    NETLINK_CB(cb->skb).portid,
+					    cb->nlh->nlmsg_seq,
+					    RTM_NEWNEIGH,
+					    NLM_F_MULTI) < 0)
+				break;
+skip:
+			++idx;
+		}
+	}
+
+	return idx;
+}
+
+/* Watch incoming packets to learn mapping between Ethernet address
+ * and Tunnel endpoint.
+ */
+static void vxlan_snoop(struct net_device *dev,
+			__be32 src_ip, const u8 *src_mac)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	struct vxlan_fdb *f;
+	int err;
+
+	f = vxlan_find_mac(vxlan, src_mac);
+	if (likely(f)) {
+		f->used = jiffies;
+		if (likely(f->remote_ip == src_ip))
+			return;
+
+		if (net_ratelimit())
+			netdev_info(dev,
+				    "%pM migrated from %pI4 to %pI4\n",
+				    src_mac, &f->remote_ip, &src_ip);
+
+		f->remote_ip = src_ip;
+		f->updated = jiffies;
+	} else {
+		/* learned new entry */
+		spin_lock(&vxlan->hash_lock);
+		err = vxlan_fdb_create(vxlan, src_mac, src_ip,
+				       NUD_REACHABLE,
+				       NLM_F_EXCL|NLM_F_CREATE);
+		spin_unlock(&vxlan->hash_lock);
+	}
+}
+
+
+/* See if multicast group is already in use by other ID */
+static bool vxlan_group_used(struct vxlan_net *vn,
+			     const struct vxlan_dev *this)
+{
+	const struct vxlan_dev *vxlan;
+	struct hlist_node *node;
+	unsigned h;
+
+	for (h = 0; h < VNI_HASH_SIZE; ++h)
+		hlist_for_each_entry(vxlan, node, &vn->vni_list[h], hlist) {
+			if (vxlan == this)
+				continue;
+
+			if (!netif_running(vxlan->dev))
+				continue;
+
+			if (vxlan->gaddr == this->gaddr)
+				return true;
+		}
+
+	return false;
+}
+
+/* kernel equivalent to IP_ADD_MEMBERSHIP */
+static int vxlan_join_group(struct net_device *dev)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	struct vxlan_net *vn = net_generic(dev_net(dev), vxlan_net_id);
+	struct sock *sk = vn->sock->sk;
+	struct ip_mreqn mreq = {
+		.imr_multiaddr.s_addr = vxlan->gaddr,
+	};
+	int err;
+
+	if (vxlan_group_used(vn, vxlan))
+		return 0;
+
+	/* Need to drop RTNL to call multicast join */
+	rtnl_unlock();
+	lock_sock(sk);
+	err = ip_mc_join_group(sk, &mreq);
+	release_sock(sk);
+	rtnl_lock();
+
+	return err;
+}
+
+
+/* kernel equivalent to IP_DROP_MEMBERSHIP */
+static int vxlan_leave_group(struct net_device *dev)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	struct vxlan_net *vn = net_generic(dev_net(dev), vxlan_net_id);
+	int err = 0;
+	struct sock *sk = vn->sock->sk;
+	struct ip_mreqn mreq = {
+		.imr_multiaddr.s_addr = vxlan->gaddr,
+	};
+
+	if (vxlan_group_used(vn, vxlan))
+		return 0;
+
+	/* Need to drop RTNL to call multicast leave */
+	rtnl_unlock();
+	lock_sock(sk);
+	err = ip_mc_leave_group(sk, &mreq);
+	release_sock(sk);
+	rtnl_lock();
+
+	return err;
+}
+
+/* Propogate ECN from outer IP header to tunneled packet */
+static inline int vxlan_ecn_decap(const struct iphdr *iph, struct sk_buff *skb)
+{
+	if (unlikely(INET_ECN_is_ce(iph->tos))) {
+		if (skb->protocol == htons(ETH_P_IP))
+			return IP_ECN_set_ce(ip_hdr(skb));
+		else if (skb->protocol == htons(ETH_P_IPV6))
+			return IP6_ECN_set_ce(ipv6_hdr(skb));
+	}
+	return 1;
+}
+
+/* Callback from net/ipv4/udp.c to receive packets */
+static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
+{
+	struct iphdr *oip;
+	struct vxlanhdr *vxh;
+	struct vxlan_dev *vxlan;
+	struct vxlan_stats *stats;
+	__u32 vni;
+
+	/* pop off outer UDP header */
+	__skb_pull(skb, sizeof(struct udphdr));
+
+	/* Need Vxlan and inner Ethernet header to be present */
+	if (!pskb_may_pull(skb,
+			   sizeof(struct vxlanhdr) + sizeof(struct ethhdr)))
+		goto error;
+
+	oip = ip_hdr(skb);
+
+	/* Drop packets with reserved bits set */
+	vxh = (struct vxlanhdr *) skb->data;
+	if (vxh->vx_flags != htonl(VXLAN_FLAGS) ||
+	    (vxh->vx_vni & htonl(0xff)))
+		goto error;
+
+	__skb_pull(skb, sizeof(struct vxlanhdr));
+
+	/* Is this VNI defined? */
+	vni = ntohl(vxh->vx_vni) >> 8;
+	vxlan = vxlan_find_vni(sock_net(sk), vni);
+	if (!vxlan)
+		goto drop;
+
+	/* Ignore packets if device is not up */
+	if (!netif_running(vxlan->dev))
+		goto drop;
+
+	/* Re-examine inner Ethernet packet */
+	skb->protocol = eth_type_trans(skb, vxlan->dev);
+	skb->ip_summed = CHECKSUM_NONE;
+
+	/* Ignore packet loops (and multicast echo) */
+	if (compare_ether_addr(eth_hdr(skb)->h_source,
+			       vxlan->dev->dev_addr) == 0)
+		goto drop;
+
+	/* Check for multicast group configuration errors */
+	if (IN_MULTICAST(ntohl(oip->daddr)) &&
+	    oip->daddr != vxlan->gaddr) {
+		if (net_ratelimit())
+			netdev_notice(vxlan->dev,
+				      "group address %pI4 does not match\n",
+				      &oip->daddr);
+		goto drop;
+	}
+
+	if (vxlan->learn)
+		vxlan_snoop(skb->dev, oip->saddr, eth_hdr(skb)->h_source);
+
+	__skb_tunnel_rx(skb, vxlan->dev);
+	skb_reset_network_header(skb);
+	if (!vxlan_ecn_decap(oip, skb))
+		goto drop;
+
+	stats = this_cpu_ptr(vxlan->stats);
+	u64_stats_update_begin(&stats->syncp);
+	stats->rx_packets++;
+	stats->rx_bytes += skb->len;
+	u64_stats_update_end(&stats->syncp);
+
+	netif_rx(skb);
+
+	return 0;
+error:
+	/* Put UDP header back */
+	__skb_push(skb, sizeof(struct udphdr));
+
+	return 1;
+drop:
+	/* Consume bad packet */
+	kfree_skb(skb);
+	return 0;
+}
+
+/* Extract dsfield from inner protocol */
+static inline u8 vxlan_get_dsfield(const struct iphdr *iph,
+				   const struct sk_buff *skb)
+{
+	if (skb->protocol == htons(ETH_P_IP))
+		return iph->tos;
+	else if (skb->protocol == htons(ETH_P_IPV6))
+		return ipv6_get_dsfield((const struct ipv6hdr *)iph);
+	else
+		return 0;
+}
+
+/* Propogate ECN bits out */
+static inline u8 vxlan_ecn_encap(u8 tos,
+				 const struct iphdr *iph,
+				 const struct sk_buff *skb)
+{
+	u8 inner = vxlan_get_dsfield(iph, skb);
+
+	return INET_ECN_encapsulate(tos, inner);
+}
+
+/* Transmit local packets over Vxlan
+ *
+ * Outer IP header inherits ECN and DF from inner header.
+ * Outer UDP destination is the VXLAN assigned port.
+ *           source port is based on hash of flow if available
+ *                       otherwise use a random value
+ */
+static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	struct rtable *rt;
+	const struct ethhdr *eth;
+	const struct iphdr *old_iph;
+	struct iphdr *iph;
+	struct vxlanhdr *vxh;
+	struct udphdr *uh;
+	struct flowi4 fl4;
+	struct vxlan_fdb *f;
+	unsigned int pkt_len = skb->len;
+	unsigned int mtu;
+	u32 hash;
+	__be32 dst;
+	__be16 df = 0;
+	__u8 tos, ttl;
+	int err;
+
+	/* Need space for new headers (invalidates iph ptr) */
+	if (skb_cow_head(skb, VXLAN_HEADROOM))
+		goto drop;
+
+	eth = (void *)skb->data;
+	old_iph = ip_hdr(skb);
+
+	if (!is_multicast_ether_addr(eth->h_dest) &&
+	    (f = vxlan_find_mac(vxlan, eth->h_dest)))
+		dst = f->remote_ip;
+	else if (vxlan->gaddr) {
+		dst = vxlan->gaddr;
+	} else
+		goto drop;
+
+	ttl = vxlan->ttl;
+	if (!ttl && IN_MULTICAST(ntohl(dst)))
+		ttl = 1;
+
+	tos = vxlan->tos;
+	if (tos == 1)
+		tos = vxlan_get_dsfield(old_iph, skb);
+
+	hash = skb_get_rxhash(skb);
+
+	rt = ip_route_output_gre(dev_net(dev), &fl4, dst,
+				 vxlan->saddr, vxlan->vni,
+				 RT_TOS(tos), vxlan->link);
+	if (IS_ERR(rt)) {
+		netdev_dbg(dev, "no route to %pI4\n", &dst);
+		dev->stats.tx_carrier_errors++;
+		goto tx_error;
+	}
+
+	if (rt->dst.dev == dev) {
+		netdev_dbg(dev, "circular route to %pI4\n", &dst);
+		ip_rt_put(rt);
+		dev->stats.collisions++;
+		goto tx_error;
+	}
+
+	mtu = dst_mtu(&rt->dst) - VXLAN_HEADROOM;
+	/* Do PMTU */
+	if (skb->protocol == htons(ETH_P_IP)) {
+		df |= old_iph->frag_off & htons(IP_DF);
+		if (df && mtu < pkt_len) {
+			icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
+				  htonl(mtu));
+			ip_rt_put(rt);
+			goto tx_error;
+		}
+	}
+#if IS_ENABLED(CONFIG_IPV6)
+	else if (skb->protocol == htons(ETH_P_IPV6)) {
+		if (mtu >= IPV6_MIN_MTU && mtu < pkt_len) {
+			icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
+			ip_rt_put(rt);
+			goto tx_error;
+		}
+	}
+#endif
+
+	memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt));
+	IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED |
+			      IPSKB_REROUTED);
+	skb_dst_drop(skb);
+	skb_dst_set(skb, &rt->dst);
+
+	vxh = (struct vxlanhdr *) __skb_push(skb, sizeof(*vxh));
+	vxh->vx_flags = htonl(VXLAN_FLAGS);
+	vxh->vx_vni = htonl(vxlan->vni << 8);
+
+	__skb_push(skb, sizeof(*uh));
+	skb_reset_transport_header(skb);
+	uh = udp_hdr(skb);
+
+	uh->dest = htons(vxlan_port);
+	uh->source = hash ? :random32();
+
+	uh->len = htons(skb->len);
+	uh->check = 0;
+
+	__skb_push(skb, sizeof(*iph));
+	skb_reset_network_header(skb);
+	iph		= ip_hdr(skb);
+	iph->version	= 4;
+	iph->ihl	= sizeof(struct iphdr) >> 2;
+	iph->frag_off	= df;
+	iph->protocol	= IPPROTO_UDP;
+	iph->tos	= vxlan_ecn_encap(tos, old_iph, skb);
+	iph->daddr	= fl4.daddr;
+	iph->saddr	= fl4.saddr;
+	iph->ttl	= ttl ? : ip4_dst_hoplimit(&rt->dst);
+
+	/* See __IPTUNNEL_XMIT */
+	skb->ip_summed = CHECKSUM_NONE;
+	ip_select_ident(iph, &rt->dst, NULL);
+
+	err = ip_local_out(skb);
+	if (likely(net_xmit_eval(err) == 0)) {
+		struct vxlan_stats *stats = this_cpu_ptr(vxlan->stats);
+
+		u64_stats_update_begin(&stats->syncp);
+		stats->tx_packets++;
+		stats->tx_bytes += pkt_len;
+		u64_stats_update_end(&stats->syncp);
+	} else {
+		dev->stats.tx_errors++;
+		dev->stats.tx_aborted_errors++;
+	}
+	return NETDEV_TX_OK;
+
+drop:
+	dev->stats.tx_dropped++;
+	goto tx_free;
+
+tx_error:
+	dev->stats.tx_errors++;
+tx_free:
+	dev_kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
+/* Walk the forwarding table and purge stale entries */
+static void vxlan_cleanup(unsigned long arg)
+{
+	struct vxlan_dev *vxlan = (struct vxlan_dev *) arg;
+	unsigned long next_timer = jiffies + FDB_AGE_INTERVAL;
+	unsigned int h;
+
+	if (!netif_running(vxlan->dev))
+		return;
+
+	spin_lock_bh(&vxlan->hash_lock);
+	for (h = 0; h < FDB_HASH_SIZE; ++h) {
+		struct hlist_node *p, *n;
+		hlist_for_each_safe(p, n, &vxlan->fdb_head[h]) {
+			struct vxlan_fdb *f
+				= container_of(p, struct vxlan_fdb, hlist);
+			unsigned long timeout = f->used + FDB_AGE_TIME;
+
+			if (f->state == NUD_PERMANENT)
+				continue;
+
+			if (time_before_eq(timeout, jiffies)) {
+				netdev_dbg(vxlan->dev,
+					   "garbage collect %pM\n",
+					   f->eth_addr);
+				f->state = NUD_STALE;
+				vxlan_fdb_destroy(vxlan, f);
+			} else if (time_before(timeout, next_timer))
+				next_timer = timeout;
+		}
+	}
+	spin_unlock_bh(&vxlan->hash_lock);
+
+	mod_timer(&vxlan->age_timer, next_timer);
+}
+
+/* Setup stats when device is created */
+static int vxlan_init(struct net_device *dev)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+
+	vxlan->stats = alloc_percpu(struct vxlan_stats);
+	if (!vxlan->stats)
+		return -ENOMEM;
+
+	return 0;
+}
+
+/* Start ageing timer and join group when device is brought up */
+static int vxlan_open(struct net_device *dev)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	int err;
+
+	if (vxlan->gaddr) {
+		err = vxlan_join_group(dev);
+		if (err)
+			return err;
+	}
+
+	mod_timer(&vxlan->age_timer, jiffies + FDB_AGE_INTERVAL);
+	return 0;
+}
+
+/* Purge the forwarding table */
+static void vxlan_flush(struct vxlan_dev *vxlan)
+{
+	unsigned h;
+
+	spin_lock_bh(&vxlan->hash_lock);
+	for (h = 0; h < FDB_HASH_SIZE; ++h) {
+		struct hlist_node *p, *n;
+		hlist_for_each_safe(p, n, &vxlan->fdb_head[h]) {
+			struct vxlan_fdb *f
+				= container_of(p, struct vxlan_fdb, hlist);
+			vxlan_fdb_destroy(vxlan, f);
+		}
+	}
+	spin_unlock_bh(&vxlan->hash_lock);
+}
+
+/* Cleanup timer and forwarding table on shutdown */
+static int vxlan_stop(struct net_device *dev)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+
+	if (vxlan->gaddr)
+		vxlan_leave_group(dev);
+
+	del_timer_sync(&vxlan->age_timer);
+
+	vxlan_flush(vxlan);
+
+	return 0;
+}
+
+/* Merge per-cpu statistics */
+static struct rtnl_link_stats64 *vxlan_stats64(struct net_device *dev,
+					       struct rtnl_link_stats64 *stats)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	struct vxlan_stats tmp, sum = { 0 };
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu) {
+		unsigned int start;
+		const struct vxlan_stats *stats
+			= per_cpu_ptr(vxlan->stats, cpu);
+
+		do {
+			start = u64_stats_fetch_begin_bh(&stats->syncp);
+			memcpy(&tmp, stats, sizeof(tmp));
+		} while (u64_stats_fetch_retry_bh(&stats->syncp, start));
+
+		sum.tx_bytes   += tmp.tx_bytes;
+		sum.tx_packets += tmp.tx_packets;
+		sum.rx_bytes   += tmp.rx_bytes;
+		sum.rx_packets += tmp.rx_packets;
+	}
+
+	stats->tx_bytes   = sum.tx_bytes;
+	stats->tx_packets = sum.tx_packets;
+	stats->rx_bytes   = sum.rx_bytes;
+	stats->rx_packets = sum.rx_packets;
+
+	stats->tx_dropped = dev->stats.tx_dropped;
+	stats->tx_errors  = dev->stats.tx_errors;
+	stats->tx_carrier_errors  = dev->stats.tx_carrier_errors;
+	stats->collisions  = dev->stats.collisions;
+
+	return stats;
+}
+
+/* Stub, nothing needs to be done. */
+static void vxlan_set_multicast_list(struct net_device *dev)
+{
+}
+
+static const struct net_device_ops vxlan_netdev_ops = {
+	.ndo_init		= vxlan_init,
+	.ndo_open		= vxlan_open,
+	.ndo_stop		= vxlan_stop,
+	.ndo_start_xmit		= vxlan_xmit,
+	.ndo_get_stats64	= vxlan_stats64,
+	.ndo_set_rx_mode	= vxlan_set_multicast_list,
+	.ndo_change_mtu		= eth_change_mtu,
+	.ndo_validate_addr	= eth_validate_addr,
+	.ndo_set_mac_address	= eth_mac_addr,
+	.ndo_fdb_add		= vxlan_fdb_add,
+	.ndo_fdb_del		= vxlan_fdb_delete,
+	.ndo_fdb_dump		= vxlan_fdb_dump,
+};
+
+/* Info for udev, that this is a virtual tunnel endpoint */
+static struct device_type vxlan_type = {
+	.name = "vxlan",
+};
+
+static void vxlan_free(struct net_device *dev)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+
+	free_percpu(vxlan->stats);
+	free_netdev(dev);
+}
+
+/* Initialize the device structure. */
+static void vxlan_setup(struct net_device *dev)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	unsigned h;
+
+	eth_hw_addr_random(dev);
+	ether_setup(dev);
+
+	dev->netdev_ops = &vxlan_netdev_ops;
+	dev->destructor = vxlan_free;
+	SET_NETDEV_DEVTYPE(dev, &vxlan_type);
+
+	dev->tx_queue_len = 0;
+	dev->features	|= NETIF_F_LLTX;
+	dev->features	|= NETIF_F_NETNS_LOCAL;
+	dev->priv_flags	&= ~IFF_XMIT_DST_RELEASE;
+
+	spin_lock_init(&vxlan->hash_lock);
+
+	init_timer_deferrable(&vxlan->age_timer);
+	vxlan->age_timer.function = vxlan_cleanup;
+	vxlan->age_timer.data = (unsigned long) vxlan;
+
+	vxlan->dev = dev;
+
+	for (h = 0; h < FDB_HASH_SIZE; ++h)
+		INIT_HLIST_HEAD(&vxlan->fdb_head[h]);
+}
+
+static const struct nla_policy vxlan_policy[IFLA_VXLAN_MAX + 1] = {
+	[IFLA_VXLAN_ID]		= { .type = NLA_U32 },
+	[IFLA_VXLAN_GROUP]	= { .len = FIELD_SIZEOF(struct iphdr, daddr) },
+	[IFLA_VXLAN_LINK]	= { .type = NLA_U32 },
+	[IFLA_VXLAN_LOCAL]	= { .len = FIELD_SIZEOF(struct iphdr, saddr) },
+	[IFLA_VXLAN_TOS]	= { .type = NLA_U8 },
+	[IFLA_VXLAN_LEARNING]	= { .type = NLA_U8 },
+};
+
+static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[])
+{
+	if (tb[IFLA_ADDRESS]) {
+		if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN) {
+			pr_debug("invalid link address (not ethernet)\n");
+			return -EINVAL;
+		}
+
+		if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS]))) {
+			pr_debug("invalid all zero ethernet address\n");
+			return -EADDRNOTAVAIL;
+		}
+	}
+
+	if (!data)
+		return -EINVAL;
+
+	if (data[IFLA_VXLAN_ID]) {
+		__u32 id = nla_get_u32(data[IFLA_VXLAN_ID]);
+		if (id >= VXLAN_VID_MASK)
+			return -ERANGE;
+	}
+
+	if (data[IFLA_VXLAN_GROUP]) {
+		__be32 gaddr = nla_get_be32(data[IFLA_VXLAN_GROUP]);
+		if (!IN_MULTICAST(ntohl(gaddr))) {
+			pr_debug("group address is not IPv4 multicast\n");
+			return -EADDRNOTAVAIL;
+		}
+	}
+
+	return 0;
+}
+
+static int vxlan_newlink(struct net *net, struct net_device *dev,
+			 struct nlattr *tb[], struct nlattr *data[])
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	__u32 vni;
+	int err;
+
+	if (!data[IFLA_VXLAN_ID])
+		return -EINVAL;
+
+	vni = nla_get_u32(data[IFLA_VXLAN_ID]);
+	if (vxlan_find_vni(net, vni)) {
+		pr_info("duplicate VNI %u\n", vni);
+		return -EEXIST;
+	}
+	vxlan->vni = vni;
+
+	if (data[IFLA_VXLAN_GROUP])
+		vxlan->gaddr = nla_get_be32(data[IFLA_VXLAN_GROUP]);
+
+	if (data[IFLA_VXLAN_LOCAL])
+		vxlan->saddr = nla_get_be32(data[IFLA_VXLAN_LOCAL]);
+
+	if (data[IFLA_VXLAN_LINK])
+		vxlan->link = nla_get_u32(data[IFLA_VXLAN_LINK]);
+
+	if (data[IFLA_VXLAN_TOS])
+		vxlan->tos  = nla_get_u8(data[IFLA_VXLAN_TOS]);
+
+	if (!data[IFLA_VXLAN_LEARNING] || nla_get_u8(data[IFLA_VXLAN_LEARNING]))
+		vxlan->learn = true;
+
+	err = register_netdevice(dev);
+	if (!err)
+		hlist_add_head_rcu(&vxlan->hlist, vni_head(net, vxlan->vni));
+
+	return err;
+}
+
+static void vxlan_dellink(struct net_device *dev, struct list_head *head)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+
+	hlist_del_rcu(&vxlan->hlist);
+
+	unregister_netdevice_queue(dev, head);
+}
+
+static size_t vxlan_get_size(const struct net_device *dev)
+{
+
+	return nla_total_size(sizeof(__u32)) +	/* IFLA_VXLAN_ID */
+		nla_total_size(sizeof(__be32)) +/* IFLA_VXLAN_GROUP */
+		nla_total_size(sizeof(__u32)) +	/* IFLA_VXLAN_LINK */
+		nla_total_size(sizeof(__be32))+	/* IFLA_VXLAN_LOCAL */
+		nla_total_size(sizeof(__u8)) +	/* IFLA_VXLAN_TTL */
+		nla_total_size(sizeof(__u8)) +	/* IFLA_VXLAN_TOS */
+		nla_total_size(sizeof(__u8)) +	/* IFLA_VXLAN_LEARNING */
+		0;
+}
+
+static int vxlan_fill_info(struct sk_buff *skb, const struct net_device *dev)
+{
+	const struct vxlan_dev *vxlan = netdev_priv(dev);
+
+	if (nla_put_u32(skb, IFLA_VXLAN_ID, vxlan->vni))
+		goto nla_put_failure;
+
+	if (vxlan->gaddr && nla_put_u32(skb, IFLA_VXLAN_GROUP, vxlan->gaddr))
+		goto nla_put_failure;
+
+	if (vxlan->link && nla_put_u32(skb, IFLA_VXLAN_LINK, vxlan->link))
+		goto nla_put_failure;
+
+	if (vxlan->saddr && nla_put_u32(skb, IFLA_VXLAN_LOCAL, vxlan->saddr))
+		goto nla_put_failure;
+
+	if (nla_put_u8(skb, IFLA_VXLAN_TTL, vxlan->ttl) ||
+	    nla_put_u8(skb, IFLA_VXLAN_TOS, vxlan->tos) ||
+	    nla_put_u8(skb, IFLA_VXLAN_LEARNING, vxlan->learn))
+		goto nla_put_failure;
+
+	return 0;
+
+nla_put_failure:
+	return -EMSGSIZE;
+}
+
+static struct rtnl_link_ops vxlan_link_ops __read_mostly = {
+	.kind		= "vxlan",
+	.maxtype	= IFLA_VXLAN_MAX,
+	.policy		= vxlan_policy,
+	.priv_size	= sizeof(struct vxlan_dev),
+	.setup		= vxlan_setup,
+	.validate	= vxlan_validate,
+	.newlink	= vxlan_newlink,
+	.dellink	= vxlan_dellink,
+	.get_size	= vxlan_get_size,
+	.fill_info	= vxlan_fill_info,
+};
+
+static __net_init int vxlan_init_net(struct net *net)
+{
+	struct vxlan_net *vn = net_generic(net, vxlan_net_id);
+	struct sock *sk;
+	struct sockaddr_in addr;
+	int rc;
+	unsigned h;
+
+	/* Create UDP socket for encapsulation receive. */
+	rc = sock_create_kern(AF_INET, SOCK_DGRAM, IPPROTO_UDP, &vn->sock);
+	if (rc < 0) {
+		pr_debug("UDP socket create failed\n");
+		return rc;
+	}
+
+	addr.sin_family = AF_INET;
+	addr.sin_addr.s_addr = INADDR_ANY;
+	addr.sin_port = htons(vxlan_port);
+
+	rc = kernel_bind(vn->sock, (struct sockaddr *) &addr, sizeof(addr));
+	if (rc < 0) {
+		pr_debug("bind for port %u failed %d\n", vxlan_port, rc);
+		sock_release(vn->sock);
+		vn->sock = NULL;
+		return rc;
+	}
+
+	/* Disable multicast loopback */
+	sk = vn->sock->sk;
+	inet_sk(sk)->mc_loop = 0;
+
+	/* Mark socket as an encapsulation socket. */
+	udp_sk(sk)->encap_type = 1;
+	udp_sk(sk)->encap_rcv = vxlan_udp_encap_recv;
+	udp_encap_enable();
+
+	for (h = 0; h < VNI_HASH_SIZE; ++h)
+		INIT_HLIST_HEAD(&vn->vni_list[h]);
+
+	return 0;
+}
+
+static __net_exit void vxlan_exit_net(struct net *net)
+{
+	struct vxlan_net *vn = net_generic(net, vxlan_net_id);
+
+	if (vn->sock) {
+		sock_release(vn->sock);
+		vn->sock = NULL;
+	}
+}
+
+static struct pernet_operations vxlan_net_ops = {
+	.init = vxlan_init_net,
+	.exit = vxlan_exit_net,
+	.id   = &vxlan_net_id,
+	.size = sizeof(struct vxlan_net),
+};
+
+static int __init vxlan_init_module(void)
+{
+	int rc;
+
+	get_random_bytes(&vxlan_salt, sizeof(vxlan_salt));
+
+	rc = register_pernet_device(&vxlan_net_ops);
+	if (rc)
+		goto out1;
+
+	rc = rtnl_link_register(&vxlan_link_ops);
+	if (rc)
+		goto out2;
+
+	return 0;
+
+out2:
+	unregister_pernet_device(&vxlan_net_ops);
+out1:
+	return rc;
+}
+module_init(vxlan_init_module);
+
+static void __exit vxlan_cleanup_module(void)
+{
+	rtnl_link_unregister(&vxlan_link_ops);
+	unregister_pernet_device(&vxlan_net_ops);
+}
+module_exit(vxlan_cleanup_module);
+
+MODULE_LICENSE("GPL");
+MODULE_VERSION(VXLAN_VERSION);
+MODULE_AUTHOR("Stephen Hemminger <shemminger@vyatta.com>");
+MODULE_ALIAS_RTNL_LINK("vxlan");
--- a/include/linux/if_link.h	2012-09-24 10:56:57.080291529 -0700
+++ b/include/linux/if_link.h	2012-09-24 11:08:02.869416484 -0700
@@ -272,6 +272,20 @@ enum macvlan_mode {
 
 #define MACVLAN_FLAG_NOPROMISC	1
 
+/* VXLAN section */
+enum {
+	IFLA_VXLAN_UNSPEC,
+	IFLA_VXLAN_ID,
+	IFLA_VXLAN_GROUP,
+	IFLA_VXLAN_LINK,
+	IFLA_VXLAN_LOCAL,
+	IFLA_VXLAN_TTL,
+	IFLA_VXLAN_TOS,
+	IFLA_VXLAN_LEARNING,
+	__IFLA_VXLAN_MAX
+};
+#define IFLA_VXLAN_MAX	(__IFLA_VXLAN_MAX - 1)
+
 /* SR-IOV virtual function management section */
 
 enum {
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ b/Documentation/networking/vxlan.txt	2012-09-24 11:08:02.869416484 -0700
@@ -0,0 +1,36 @@
+Virtual eXtensible Local Area Networking documentation
+======================================================
+
+The VXLAN protocol is a tunnelling protocol that is designed to
+solve the problem of limited number of available VLAN's (4096).
+With VXLAN identifier is expanded to 24 bits.
+
+It is a draft RFC standard, that is implemented by Cisco Nexus,
+Vmware and Brocade. The protocol runs over UDP using a single
+destination port (still not standardized by IANA).
+This document describes the Linux kernel tunnel device,
+there is also an implantation of VXLAN for Openvswitch.
+
+Unlike most tunnels, a VXLAN is a 1 to N network, not just point
+to point. A VXLAN device can either dynamically learn the IP address
+of the other end, in a manner similar to a learning bridge, or the
+forwarding entries can be configured statically.
+
+The management of vxlan is done in a similar fashion to it's
+too closest neighbors GRE and VLAN. Configuring VXLAN requires
+the version of iproute2 that matches the kernel release
+where VXLAN was first merged upstream.
+
+1. Create vxlan device
+  # ip li add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1
+
+This creates a new device (vxlan0). The device uses the
+the multicast group 239.1.1.1 over eth1 to handle packets where
+no entry is in the forwarding table.
+
+2. Delete vxlan device
+  # ip link delete vxlan0
+
+3. Show vxlan info
+  # ip -d show vxlan0
+

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox