[RFC PATCH net-next 0/5] bridge: per vlan lwt and dst

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support
@ 2017-01-21  5:46 Roopa Prabhu
  2017-01-21  5:46 ` [RFC PATCH net-next 1/5] ip_tunnels: new IP_TUNNEL_INFO_BRIDGE flag for ip_tunnel_info mode Roopa Prabhu
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: Roopa Prabhu @ 2017-01-21  5:46 UTC (permalink / raw)
  To: netdev; +Cc: davem, stephen, nikolay, tgraf, hannes, jbenc, pshelar, dsa, hadi

From: Roopa Prabhu <roopa@cumulusnetworks.com>

High level summary:
lwt and dst_metadata/collect_metadata have enabled vxlan l3 deployments
to use a single vxlan netdev for multiple vnis eliminating the scalability
problem with using a single vxlan netdev per vni. This series tries to
do the same for vxlan netdevs in pure l2 bridged networks.
Use-case/deployment and details are below.

Deployment scerario details:
As we know VXLAN is used to build layer 2 virtual networks across the
underlay layer3 infrastructure. A VXLAN tunnel endpoint (VTEP)
originates and terminates VXLAN tunnels. And a VTEP can be a TOR switch
or a vswitch in the hypervisor. This patch series mainly
focuses on the TOR switch configured as a Vtep. Vxlan segment ID (vni)
along with vlan id is used to identify layer 2 segments in a vxlan
overlay network. Vxlan bridging is the function provided by Vteps to terminate
vxlan tunnels and map the vxlan vni to traditional end host vlan. This is
covered in the "VXLAN Deployment Scenarios" in sections 6 and 6.1 in RFC 7348.
To provide vxlan bridging function, a vtep has to map vlan to a vni. The rfc
says that the ingress VTEP device shall remove the IEEE 802.1Q VLAN tag in
the original Layer 2 packet if there is one before encapsulating the packet
into the VXLAN format to transmit it through the underlay network. The remote
VTEP devices have information about the VLAN in which the packet will be
placed based on their own VLAN-to-VXLAN VNI mapping configurations.

Existing solution:
Without this patch series one can deploy such a vtep configuration by
by adding the local ports and vxlan netdevs into a vlan filtering bridge.
The local ports are configured as trunk ports carrying all vlans.
A vxlan netdev per vni is added to the bridge. Vlan mapping to vni is
achieved by configuring the vlan as pvid on the corresponding vxlan netdev.
The vxlan netdev only receives traffic corresponding to the vlan it is mapped
to. This configuration maps traffic belonging to a vlan to the corresponding
vxlan segment.

          -----------------------------------
         |              bridge               |
         |                                   |
          -----------------------------------
            |100,200       |100 (pvid)    |200 (pvid)
            |              |              |
           swp1          vxlan1000      vxlan2000

This provides the required vxlan bridging function but poses a
scalability problem with using a single vxlan netdev for each vni.

Solution in this patch series:
The Goal is to use a single vxlan device to carry all vnis similar
to the vxlan collect metadata mode but vxlan driver still carrying all
the forwarding information.
- vxlan driver changes:
    - enable collect metadata mode device to be used with learning,
      replication, fdb
    - A single fdb table hashed by (mac, vni)
    - rx path already has the vni
    - tx path expects a vni in the packet with dst_metadata and vxlan
      driver has all the forwarding information for the vni in the
      dst_metadata.

- Bridge driver changes: per vlan LWT and dst_metadata support:
    - Our use case is vxlan and 1-1 mapping between vlan and vni, but I have
      kept the api generic for any tunnel info
    - Uapi to configure/unconfigure/dump per vlan tunnel data
    - new bridge port flag to turn this feature on/off. off by default
    - ingress hook:
        - if port is a lwt tunnel port, use tunnel info in
          attached dst_metadata to map it to a local vlan
    - egress hook:
        - if port is a lwt tunnel port, use tunnel info attached to vlan
          to set dst_metadata on the skb

Other approaches tried and vetoed:
- tc vlan push/pop and tunnel metadata dst:
    - posses a tc rule scalability problem (2 rules per vni)
    - cannot handle the case where a packet needs to be replicated to
      multiple vxlan remote tunnel end-points.. which the vxlan driver
      can do today by having multiple remote destinations per fdb.
- making vxlan driver understand vlan-vni mapping:
    - I had a series almost ready with this one but soon realized
      it duplicated a lot of vlan handling code in the vxlan driver

This series is briefly tested for functionality. Sending it out as RFC while
I continue to test it more. There are some rough edges which I am in the process
of fixing.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>

Roopa Prabhu (5):
  ip_tunnels: new IP_TUNNEL_INFO_BRIDGE flag for ip_tunnel_info mode
  vxlan: make COLLECT_METADATA mode bridge friendly
  bridge: uapi: add per vlan tunnel info
  bridge: vlan lwt and dst_metadata netlink support
  bridge: vlan lwt dst_metadata hooks in ingress and egress paths

 drivers/net/vxlan.c            |  209 ++++++++++++--------
 include/linux/if_bridge.h      |    1 +
 include/net/ip_tunnels.h       |    1 +
 include/uapi/linux/if_bridge.h |   11 ++
 include/uapi/linux/if_link.h   |    1 +
 include/uapi/linux/neighbour.h |    1 +
 net/bridge/br_input.c          |    5 +
 net/bridge/br_netlink.c        |  410 ++++++++++++++++++++++++++++++++++------
 net/bridge/br_private.h        |   22 +++
 net/bridge/br_vlan.c           |  193 ++++++++++++++++++-
 10 files changed, 717 insertions(+), 137 deletions(-)

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH net-next 1/5] ip_tunnels: new IP_TUNNEL_INFO_BRIDGE flag for ip_tunnel_info mode
  2017-01-21  5:46 [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support Roopa Prabhu
@ 2017-01-21  5:46 ` Roopa Prabhu
  2017-01-21  5:46 ` [RFC PATCH net-next 2/5] vxlan: make COLLECT_METADATA mode bridge friendly Roopa Prabhu
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: Roopa Prabhu @ 2017-01-21  5:46 UTC (permalink / raw)
  To: netdev; +Cc: davem, stephen, nikolay, tgraf, hannes, jbenc, pshelar, dsa, hadi

From: Roopa Prabhu <roopa@cumulusnetworks.com>

New ip_tunnel_info flag to represent bridged tunnel metadata.
Used by bridge driver later in the series to pass per vlan dst
metadata to bridge ports.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
 include/net/ip_tunnels.h |    1 +
 1 file changed, 1 insertion(+)

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 3d4ca4d..9505679 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -58,6 +58,7 @@ struct ip_tunnel_key {
 /* Flags for ip_tunnel_info mode. */
 #define IP_TUNNEL_INFO_TX	0x01	/* represents tx tunnel parameters */
 #define IP_TUNNEL_INFO_IPV6	0x02	/* key contains IPv6 addresses */
+#define IP_TUNNEL_INFO_BRIDGE	0x04	/* represents a bridged tunnel id */
 
 /* Maximum tunnel options length. */
 #define IP_TUNNEL_OPTS_MAX					\
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH net-next 2/5] vxlan: make COLLECT_METADATA mode bridge friendly
  2017-01-21  5:46 [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support Roopa Prabhu
  2017-01-21  5:46 ` [RFC PATCH net-next 1/5] ip_tunnels: new IP_TUNNEL_INFO_BRIDGE flag for ip_tunnel_info mode Roopa Prabhu
@ 2017-01-21  5:46 ` Roopa Prabhu
  2017-01-22 11:40   ` Nikolay Aleksandrov
  2017-01-21  5:46 ` [RFC PATCH net-next 3/5] bridge: uapi: add per vlan tunnel info Roopa Prabhu
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Roopa Prabhu @ 2017-01-21  5:46 UTC (permalink / raw)
  To: netdev; +Cc: davem, stephen, nikolay, tgraf, hannes, jbenc, pshelar, dsa, hadi

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This patch series makes vxlan COLLECT_METADATA mode bridge
and layer2 network friendly. Vxlan COLLECT_METADATA mode today
solves the per-vni netdev scalability problem in l3 networks.
When vxlan collect metadata device participates in bridging
vlan to vn-segments, It can only get the vlan mapped vni in
the xmit tunnel dst metadata. It will need the vxlan driver to
continue learn, hold forwarding state and remote destination
information similar to how it already does for non COLLECT_METADATA
vxlan netdevices today.

Changes introduced by this patch:
    - allow learning and forwarding database state to vxlan netdev in
      COLLECT_METADATA mode. Current behaviour is not changed
      by default. tunnel info flag IP_TUNNEL_INFO_BRIDGE is used
      to support the new bridge friendly mode.
    - A single fdb table hashed by (mac, vni) to allow fdb entries with
      multiple vnis in the same fdb table
    - rx path already has the vni
    - tx path expects a vni in the packet with dst_metadata
    - prior to this series, fdb remote_dsts carried remote vni and
      the vxlan device carrying the fdb table represented the
      source vni. With the vxlan device now representing multiple vnis,
      this patch adds a src vni attribute to the fdb entry. The remote
      vni already uses NDA_VNI attribute. This patch introduces
      NDA_SRC_VNI netlink attribute to represent the src vni in a multi
      vni fdb table.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
 drivers/net/vxlan.c            |  209 +++++++++++++++++++++++++---------------
 include/uapi/linux/neighbour.h |    1 +
 2 files changed, 134 insertions(+), 76 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index ca7196c..fb114b3 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -57,6 +57,8 @@
 
 static const u8 all_zeros_mac[ETH_ALEN + 2];
 
+static u32 fdb_salt __read_mostly;
+
 static int vxlan_sock_add(struct vxlan_dev *vxlan);
 
 /* per-network namespace private data for this module */
@@ -75,6 +77,7 @@ struct vxlan_fdb {
 	struct list_head  remotes;
 	u8		  eth_addr[ETH_ALEN];
 	u16		  state;	/* see ndm_state */
+	__be32		  vni;
 	u8		  flags;	/* see ndm_flags */
 };
 
@@ -302,6 +305,10 @@ static int vxlan_fdb_info(struct sk_buff *skb, struct vxlan_dev *vxlan,
 	if (rdst->remote_vni != vxlan->default_dst.remote_vni &&
 	    nla_put_u32(skb, NDA_VNI, be32_to_cpu(rdst->remote_vni)))
 		goto nla_put_failure;
+	if ((vxlan->flags & VXLAN_F_COLLECT_METADATA) && fdb->vni &&
+	    nla_put_u32(skb, NDA_SRC_VNI,
+			be32_to_cpu(fdb->vni)))
+		goto nla_put_failure;
 	if (rdst->remote_ifindex &&
 	    nla_put_u32(skb, NDA_IFINDEX, rdst->remote_ifindex))
 		goto nla_put_failure;
@@ -400,34 +407,50 @@ static u32 eth_hash(const unsigned char *addr)
 	return hash_64(value, FDB_HASH_BITS);
 }
 
+static u32 eth_vni_hash(const unsigned char *addr, __be32 vni)
+{
+	/* use 1 byte of OUI and 3 bytes of NIC */
+	u32 key = get_unaligned((u32 *)(addr + 2));
+	return jhash_2words(key, vni, fdb_salt) & (FDB_HASH_SIZE - 1);
+}
+
 /* Hash chain to use given mac address */
 static inline struct hlist_head *vxlan_fdb_head(struct vxlan_dev *vxlan,
-						const u8 *mac)
+						const u8 *mac, __be32 vni)
 {
-	return &vxlan->fdb_head[eth_hash(mac)];
+	if (vxlan->flags & VXLAN_F_COLLECT_METADATA)
+		return &vxlan->fdb_head[eth_vni_hash(mac, vni)];
+	else
+		return &vxlan->fdb_head[eth_hash(mac)];
 }
 
 /* Look up Ethernet address in forwarding table */
 static struct vxlan_fdb *__vxlan_find_mac(struct vxlan_dev *vxlan,
-					const u8 *mac)
+					  const u8 *mac, __be32 vni)
 {
-	struct hlist_head *head = vxlan_fdb_head(vxlan, mac);
+	struct hlist_head *head = vxlan_fdb_head(vxlan, mac, vni);
 	struct vxlan_fdb *f;
 
 	hlist_for_each_entry_rcu(f, head, hlist) {
-		if (ether_addr_equal(mac, f->eth_addr))
-			return f;
+		if (ether_addr_equal(mac, f->eth_addr)) {
+			if (vxlan->flags & VXLAN_F_COLLECT_METADATA) {
+				if (vni == f->vni)
+					return f;
+			} else {
+				return f;
+			}
+		}
 	}
 
 	return NULL;
 }
 
 static struct vxlan_fdb *vxlan_find_mac(struct vxlan_dev *vxlan,
-					const u8 *mac)
+					const u8 *mac, __be32 vni)
 {
 	struct vxlan_fdb *f;
 
-	f = __vxlan_find_mac(vxlan, mac);
+	f = __vxlan_find_mac(vxlan, mac, vni);
 	if (f)
 		f->used = jiffies;
 
@@ -605,15 +628,15 @@ static int vxlan_gro_complete(struct sock *sk, struct sk_buff *skb, int nhoff)
 static int vxlan_fdb_create(struct vxlan_dev *vxlan,
 			    const u8 *mac, union vxlan_addr *ip,
 			    __u16 state, __u16 flags,
-			    __be16 port, __be32 vni, __u32 ifindex,
-			    __u8 ndm_flags)
+			    __be16 port, __be32 src_vni, __be32 vni,
+			    __u32 ifindex, __u8 ndm_flags)
 {
 	struct vxlan_rdst *rd = NULL;
 	struct vxlan_fdb *f;
 	int notify = 0;
 	int rc;
 
-	f = __vxlan_find_mac(vxlan, mac);
+	f = __vxlan_find_mac(vxlan, mac, src_vni);
 	if (f) {
 		if (flags & NLM_F_EXCL) {
 			netdev_dbg(vxlan->dev,
@@ -670,6 +693,7 @@ static int vxlan_fdb_create(struct vxlan_dev *vxlan,
 		f->state = state;
 		f->flags = ndm_flags;
 		f->updated = f->used = jiffies;
+		f->vni = src_vni;
 		INIT_LIST_HEAD(&f->remotes);
 		memcpy(f->eth_addr, mac, ETH_ALEN);
 
@@ -681,7 +705,7 @@ static int vxlan_fdb_create(struct vxlan_dev *vxlan,
 
 		++vxlan->addrcnt;
 		hlist_add_head_rcu(&f->hlist,
-				   vxlan_fdb_head(vxlan, mac));
+				   vxlan_fdb_head(vxlan, mac, src_vni));
 	}
 
 	if (notify) {
@@ -718,8 +742,8 @@ static void vxlan_fdb_destroy(struct vxlan_dev *vxlan, struct vxlan_fdb *f)
 }
 
 static int vxlan_fdb_parse(struct nlattr *tb[], struct vxlan_dev *vxlan,
-			   union vxlan_addr *ip, __be16 *port, __be32 *vni,
-			   u32 *ifindex)
+			   union vxlan_addr *ip, __be16 *port, __be32 *src_vni,
+			   __be32 *vni, u32 *ifindex)
 {
 	struct net *net = dev_net(vxlan->dev);
 	int err;
@@ -757,6 +781,14 @@ static int vxlan_fdb_parse(struct nlattr *tb[], struct vxlan_dev *vxlan,
 		*vni = vxlan->default_dst.remote_vni;
 	}
 
+	if (tb[NDA_SRC_VNI]) {
+		if (nla_len(tb[NDA_SRC_VNI]) != sizeof(u32))
+			return -EINVAL;
+		*src_vni = cpu_to_be32(nla_get_u32(tb[NDA_SRC_VNI]));
+	} else {
+		*src_vni = vxlan->default_dst.remote_vni;
+	}
+
 	if (tb[NDA_IFINDEX]) {
 		struct net_device *tdev;
 
@@ -782,7 +814,7 @@ static int vxlan_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
 	/* struct net *net = dev_net(vxlan->dev); */
 	union vxlan_addr ip;
 	__be16 port;
-	__be32 vni;
+	__be32 src_vni, vni;
 	u32 ifindex;
 	int err;
 
@@ -795,7 +827,7 @@ static int vxlan_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
 	if (tb[NDA_DST] == NULL)
 		return -EINVAL;
 
-	err = vxlan_fdb_parse(tb, vxlan, &ip, &port, &vni, &ifindex);
+	err = vxlan_fdb_parse(tb, vxlan, &ip, &port, &src_vni, &vni, &ifindex);
 	if (err)
 		return err;
 
@@ -804,36 +836,23 @@ static int vxlan_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
 
 	spin_lock_bh(&vxlan->hash_lock);
 	err = vxlan_fdb_create(vxlan, addr, &ip, ndm->ndm_state, flags,
-			       port, vni, ifindex, ndm->ndm_flags);
+			       port, src_vni, vni, ifindex, ndm->ndm_flags);
 	spin_unlock_bh(&vxlan->hash_lock);
 
 	return err;
 }
 
-/* Delete entry (via netlink) */
-static int vxlan_fdb_delete(struct ndmsg *ndm, struct nlattr *tb[],
-			    struct net_device *dev,
-			    const unsigned char *addr, u16 vid)
+static int __vxlan_fdb_delete(struct vxlan_dev *vxlan, const unsigned char *addr,
+			      union vxlan_addr ip, __be16 port, __be32 src_vni,
+			      u32 vni, u32 ifindex, u16 vid)
 {
-	struct vxlan_dev *vxlan = netdev_priv(dev);
 	struct vxlan_fdb *f;
 	struct vxlan_rdst *rd = NULL;
-	union vxlan_addr ip;
-	__be16 port;
-	__be32 vni;
-	u32 ifindex;
-	int err;
-
-	err = vxlan_fdb_parse(tb, vxlan, &ip, &port, &vni, &ifindex);
-	if (err)
-		return err;
+	int err = -ENOENT;
 
-	err = -ENOENT;
-
-	spin_lock_bh(&vxlan->hash_lock);
-	f = vxlan_find_mac(vxlan, addr);
+	f = vxlan_find_mac(vxlan, addr, src_vni);
 	if (!f)
-		goto out;
+		return err;
 
 	if (!vxlan_addr_any(&ip)) {
 		rd = vxlan_fdb_find_rdst(f, &ip, port, vni, ifindex);
@@ -841,8 +860,6 @@ static int vxlan_fdb_delete(struct ndmsg *ndm, struct nlattr *tb[],
 			goto out;
 	}
 
-	err = 0;
-
 	/* remove a destination if it's not the only one on the list,
 	 * otherwise destroy the fdb entry
 	 */
@@ -856,6 +873,28 @@ static int vxlan_fdb_delete(struct ndmsg *ndm, struct nlattr *tb[],
 	vxlan_fdb_destroy(vxlan, f);
 
 out:
+	return 0;
+}
+
+/* Delete entry (via netlink) */
+static int vxlan_fdb_delete(struct ndmsg *ndm, struct nlattr *tb[],
+			    struct net_device *dev,
+			    const unsigned char *addr, u16 vid)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	union vxlan_addr ip;
+	__be32 src_vni, vni;
+	__be16 port;
+	u32 ifindex;
+	int err;
+
+	err = vxlan_fdb_parse(tb, vxlan, &ip, &port, &src_vni, &vni, &ifindex);
+	if (err)
+		return err;
+
+	spin_lock_bh(&vxlan->hash_lock);
+	err = __vxlan_fdb_delete(vxlan, addr, ip, port, src_vni, vni, ifindex,
+				 vid);
 	spin_unlock_bh(&vxlan->hash_lock);
 
 	return err;
@@ -901,12 +940,13 @@ static int vxlan_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,
  * Return true if packet is bogus and should be dropped.
  */
 static bool vxlan_snoop(struct net_device *dev,
-			union vxlan_addr *src_ip, const u8 *src_mac)
+			union vxlan_addr *src_ip, const u8 *src_mac,
+			__be32 vni)
 {
 	struct vxlan_dev *vxlan = netdev_priv(dev);
 	struct vxlan_fdb *f;
 
-	f = vxlan_find_mac(vxlan, src_mac);
+	f = vxlan_find_mac(vxlan, src_mac, vni);
 	if (likely(f)) {
 		struct vxlan_rdst *rdst = first_remote_rcu(f);
 
@@ -930,13 +970,15 @@ static bool vxlan_snoop(struct net_device *dev,
 		spin_lock(&vxlan->hash_lock);
 
 		/* close off race between vxlan_flush and incoming packets */
-		if (netif_running(dev))
+		if (netif_running(dev)) {
 			vxlan_fdb_create(vxlan, src_mac, src_ip,
 					 NUD_REACHABLE,
 					 NLM_F_EXCL|NLM_F_CREATE,
 					 vxlan->cfg.dst_port,
+					 vni,
 					 vxlan->default_dst.remote_vni,
 					 0, NTF_SELF);
+		}
 		spin_unlock(&vxlan->hash_lock);
 	}
 
@@ -1202,7 +1244,7 @@ static bool vxlan_parse_gpe_hdr(struct vxlanhdr *unparsed,
 
 static bool vxlan_set_mac(struct vxlan_dev *vxlan,
 			  struct vxlan_sock *vs,
-			  struct sk_buff *skb)
+			  struct sk_buff *skb, __be32 vni)
 {
 	union vxlan_addr saddr;
 
@@ -1226,7 +1268,7 @@ static bool vxlan_set_mac(struct vxlan_dev *vxlan,
 	}
 
 	if ((vxlan->flags & VXLAN_F_LEARN) &&
-	    vxlan_snoop(skb->dev, &saddr, eth_hdr(skb)->h_source))
+	    vxlan_snoop(skb->dev, &saddr, eth_hdr(skb)->h_source, vni))
 		return false;
 
 	return true;
@@ -1268,6 +1310,7 @@ static int vxlan_rcv(struct sock *sk, struct sk_buff *skb)
 	__be16 protocol = htons(ETH_P_TEB);
 	bool raw_proto = false;
 	void *oiph;
+	__be32 vni = 0;
 
 	/* Need UDP and VXLAN header to be present */
 	if (!pskb_may_pull(skb, VXLAN_HLEN))
@@ -1289,7 +1332,12 @@ static int vxlan_rcv(struct sock *sk, struct sk_buff *skb)
 	if (!vs)
 		goto drop;
 
-	vxlan = vxlan_vs_find_vni(vs, vxlan_vni(vxlan_hdr(skb)->vx_vni));
+	vni = vxlan_vni(vxlan_hdr(skb)->vx_vni);
+
+	if ((vs->flags & VXLAN_F_COLLECT_METADATA) && !vni)
+		goto drop;
+
+	vxlan = vxlan_vs_find_vni(vs, vni);
 	if (!vxlan)
 		goto drop;
 
@@ -1307,7 +1355,6 @@ static int vxlan_rcv(struct sock *sk, struct sk_buff *skb)
 			goto drop;
 
 	if (vxlan_collect_metadata(vs)) {
-		__be32 vni = vxlan_vni(vxlan_hdr(skb)->vx_vni);
 		struct metadata_dst *tun_dst;
 
 		tun_dst = udp_tun_rx_dst(skb, vxlan_get_sk_family(vs), TUNNEL_KEY,
@@ -1345,7 +1392,7 @@ static int vxlan_rcv(struct sock *sk, struct sk_buff *skb)
 	}
 
 	if (!raw_proto) {
-		if (!vxlan_set_mac(vxlan, vs, skb))
+		if (!vxlan_set_mac(vxlan, vs, skb, vni))
 			goto drop;
 	} else {
 		skb_reset_mac_header(skb);
@@ -1377,7 +1424,7 @@ static int vxlan_rcv(struct sock *sk, struct sk_buff *skb)
 	return 0;
 }
 
-static int arp_reduce(struct net_device *dev, struct sk_buff *skb)
+static int arp_reduce(struct net_device *dev, struct sk_buff *skb, __be32 vni)
 {
 	struct vxlan_dev *vxlan = netdev_priv(dev);
 	struct arphdr *parp;
@@ -1424,7 +1471,7 @@ static int arp_reduce(struct net_device *dev, struct sk_buff *skb)
 			goto out;
 		}
 
-		f = vxlan_find_mac(vxlan, n->ha);
+		f = vxlan_find_mac(vxlan, n->ha, vni);
 		if (f && vxlan_addr_any(&(first_remote_rcu(f)->remote_ip))) {
 			/* bridge-local neighbor */
 			neigh_release(n);
@@ -1548,7 +1595,7 @@ static struct sk_buff *vxlan_na_create(struct sk_buff *request,
 	return reply;
 }
 
-static int neigh_reduce(struct net_device *dev, struct sk_buff *skb)
+static int neigh_reduce(struct net_device *dev, struct sk_buff *skb, __be32 vni)
 {
 	struct vxlan_dev *vxlan = netdev_priv(dev);
 	struct nd_msg *msg;
@@ -1585,7 +1632,7 @@ static int neigh_reduce(struct net_device *dev, struct sk_buff *skb)
 			goto out;
 		}
 
-		f = vxlan_find_mac(vxlan, n->ha);
+		f = vxlan_find_mac(vxlan, n->ha, vni);
 		if (f && vxlan_addr_any(&(first_remote_rcu(f)->remote_ip))) {
 			/* bridge-local neighbor */
 			neigh_release(n);
@@ -1906,7 +1953,7 @@ static struct dst_entry *vxlan6_get_route(struct vxlan_dev *vxlan,
 
 /* Bypass encapsulation if the destination is local */
 static void vxlan_encap_bypass(struct sk_buff *skb, struct vxlan_dev *src_vxlan,
-			       struct vxlan_dev *dst_vxlan)
+			       struct vxlan_dev *dst_vxlan, __be32 vni)
 {
 	struct pcpu_sw_netstats *tx_stats, *rx_stats;
 	union vxlan_addr loopback;
@@ -1932,7 +1979,7 @@ static void vxlan_encap_bypass(struct sk_buff *skb, struct vxlan_dev *src_vxlan,
 	}
 
 	if (dst_vxlan->flags & VXLAN_F_LEARN)
-		vxlan_snoop(skb->dev, &loopback, eth_hdr(skb)->h_source);
+		vxlan_snoop(skb->dev, &loopback, eth_hdr(skb)->h_source, vni);
 
 	u64_stats_update_begin(&tx_stats->syncp);
 	tx_stats->tx_packets++;
@@ -1976,7 +2023,7 @@ static int encap_bypass_if_local(struct sk_buff *skb, struct net_device *dev,
 
 			return -ENOENT;
 		}
-		vxlan_encap_bypass(skb, vxlan, dst_vxlan);
+		vxlan_encap_bypass(skb, vxlan, dst_vxlan, vni);
 		return 1;
 	}
 
@@ -1984,7 +2031,8 @@ static int encap_bypass_if_local(struct sk_buff *skb, struct net_device *dev,
 }
 
 static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
-			   struct vxlan_rdst *rdst, bool did_rsc)
+			   __be32 default_vni, struct vxlan_rdst *rdst,
+			   bool did_rsc)
 {
 	struct dst_cache *dst_cache;
 	struct ip_tunnel_info *info;
@@ -2011,14 +2059,14 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 		if (vxlan_addr_any(dst)) {
 			if (did_rsc) {
 				/* short-circuited back to local bridge */
-				vxlan_encap_bypass(skb, vxlan, vxlan);
+				vxlan_encap_bypass(skb, vxlan, vxlan, default_vni);
 				return;
 			}
 			goto drop;
 		}
 
 		dst_port = rdst->remote_port ? rdst->remote_port : vxlan->cfg.dst_port;
-		vni = rdst->remote_vni;
+		vni = (rdst->remote_vni) ? : default_vni;
 		src = &vxlan->cfg.saddr;
 		dst_cache = &rdst->dst_cache;
 		md->gbp = skb->mark;
@@ -2173,23 +2221,29 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
 	bool did_rsc = false;
 	struct vxlan_rdst *rdst, *fdst = NULL;
 	struct vxlan_fdb *f;
+	__be32 vni = 0;
 
 	info = skb_tunnel_info(skb);
 
 	skb_reset_mac_header(skb);
 
 	if (vxlan->flags & VXLAN_F_COLLECT_METADATA) {
-		if (info && info->mode & IP_TUNNEL_INFO_TX)
-			vxlan_xmit_one(skb, dev, NULL, false);
-		else
-			kfree_skb(skb);
-		return NETDEV_TX_OK;
+		if (info && info->mode & IP_TUNNEL_INFO_BRIDGE &&
+		    info->mode & IP_TUNNEL_INFO_TX) {
+			vni = tunnel_id_to_key32(info->key.tun_id);
+		} else {
+			if (info && info->mode & IP_TUNNEL_INFO_TX)
+				vxlan_xmit_one(skb, dev, vni, NULL, false);
+			else
+				kfree_skb(skb);
+			return NETDEV_TX_OK;
+		}
 	}
 
 	if (vxlan->flags & VXLAN_F_PROXY) {
 		eth = eth_hdr(skb);
 		if (ntohs(eth->h_proto) == ETH_P_ARP)
-			return arp_reduce(dev, skb);
+			return arp_reduce(dev, skb, vni);
 #if IS_ENABLED(CONFIG_IPV6)
 		else if (ntohs(eth->h_proto) == ETH_P_IPV6 &&
 			 pskb_may_pull(skb, sizeof(struct ipv6hdr)
@@ -2200,13 +2254,13 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
 				msg = (struct nd_msg *)skb_transport_header(skb);
 				if (msg->icmph.icmp6_code == 0 &&
 				    msg->icmph.icmp6_type == NDISC_NEIGHBOUR_SOLICITATION)
-					return neigh_reduce(dev, skb);
+					return neigh_reduce(dev, skb, vni);
 		}
 #endif
 	}
 
 	eth = eth_hdr(skb);
-	f = vxlan_find_mac(vxlan, eth->h_dest);
+	f = vxlan_find_mac(vxlan, eth->h_dest, vni);
 	did_rsc = false;
 
 	if (f && (f->flags & NTF_ROUTER) && (vxlan->flags & VXLAN_F_RSC) &&
@@ -2214,11 +2268,11 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
 	     ntohs(eth->h_proto) == ETH_P_IPV6)) {
 		did_rsc = route_shortcircuit(dev, skb);
 		if (did_rsc)
-			f = vxlan_find_mac(vxlan, eth->h_dest);
+			f = vxlan_find_mac(vxlan, eth->h_dest, vni);
 	}
 
 	if (f == NULL) {
-		f = vxlan_find_mac(vxlan, all_zeros_mac);
+		f = vxlan_find_mac(vxlan, all_zeros_mac, vni);
 		if (f == NULL) {
 			if ((vxlan->flags & VXLAN_F_L2MISS) &&
 			    !is_multicast_ether_addr(eth->h_dest))
@@ -2239,11 +2293,11 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
 		}
 		skb1 = skb_clone(skb, GFP_ATOMIC);
 		if (skb1)
-			vxlan_xmit_one(skb1, dev, rdst, did_rsc);
+			vxlan_xmit_one(skb1, dev, vni, rdst, did_rsc);
 	}
 
 	if (fdst)
-		vxlan_xmit_one(skb, dev, fdst, did_rsc);
+		vxlan_xmit_one(skb, dev, vni, fdst, did_rsc);
 	else
 		kfree_skb(skb);
 	return NETDEV_TX_OK;
@@ -2307,12 +2361,12 @@ static int vxlan_init(struct net_device *dev)
 	return 0;
 }
 
-static void vxlan_fdb_delete_default(struct vxlan_dev *vxlan)
+static void vxlan_fdb_delete_default(struct vxlan_dev *vxlan, __be32 vni)
 {
 	struct vxlan_fdb *f;
 
 	spin_lock_bh(&vxlan->hash_lock);
-	f = __vxlan_find_mac(vxlan, all_zeros_mac);
+	f = __vxlan_find_mac(vxlan, all_zeros_mac, vni);
 	if (f)
 		vxlan_fdb_destroy(vxlan, f);
 	spin_unlock_bh(&vxlan->hash_lock);
@@ -2322,7 +2376,7 @@ static void vxlan_uninit(struct net_device *dev)
 {
 	struct vxlan_dev *vxlan = netdev_priv(dev);
 
-	vxlan_fdb_delete_default(vxlan);
+	vxlan_fdb_delete_default(vxlan, vxlan->cfg.vni);
 
 	free_percpu(dev->tstats);
 }
@@ -2536,6 +2590,8 @@ static void vxlan_setup(struct net_device *dev)
 	dev->vlan_features = dev->features;
 	dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_RXCSUM;
 	dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+	dev->hw_features |= NETIF_F_HW_VLAN_CTAG_TX;
+	dev->features |= dev->hw_features;
 	netif_keep_dst(dev);
 	dev->priv_flags |= IFF_NO_QUEUE;
 
@@ -2921,6 +2977,7 @@ static int vxlan_dev_configure(struct net *src_net, struct net_device *dev,
 				       NLM_F_EXCL|NLM_F_CREATE,
 				       vxlan->cfg.dst_port,
 				       vxlan->default_dst.remote_vni,
+				       vxlan->default_dst.remote_vni,
 				       vxlan->default_dst.remote_ifindex,
 				       NTF_SELF);
 		if (err)
@@ -2929,7 +2986,7 @@ static int vxlan_dev_configure(struct net *src_net, struct net_device *dev,
 
 	err = register_netdevice(dev);
 	if (err) {
-		vxlan_fdb_delete_default(vxlan);
+		vxlan_fdb_delete_default(vxlan, vxlan->cfg.vni);
 		return err;
 	}
 
@@ -3023,19 +3080,19 @@ static int vxlan_newlink(struct net *src_net, struct net_device *dev,
 		conf.flags |= VXLAN_F_UDP_ZERO_CSUM_TX;
 
 	if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX] &&
-	    nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]))
+	    !nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]))
 		conf.flags |= VXLAN_F_UDP_ZERO_CSUM6_TX;
 
 	if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX] &&
-	    nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]))
+	    !nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]))
 		conf.flags |= VXLAN_F_UDP_ZERO_CSUM6_RX;
 
 	if (data[IFLA_VXLAN_REMCSUM_TX] &&
-	    nla_get_u8(data[IFLA_VXLAN_REMCSUM_TX]))
+	    !nla_get_u8(data[IFLA_VXLAN_REMCSUM_TX]))
 		conf.flags |= VXLAN_F_REMCSUM_TX;
 
 	if (data[IFLA_VXLAN_REMCSUM_RX] &&
-	    nla_get_u8(data[IFLA_VXLAN_REMCSUM_RX]))
+	    !nla_get_u8(data[IFLA_VXLAN_REMCSUM_RX]))
 		conf.flags |= VXLAN_F_REMCSUM_RX;
 
 	if (data[IFLA_VXLAN_GBP])
diff --git a/include/uapi/linux/neighbour.h b/include/uapi/linux/neighbour.h
index bd99a8d..f3d16db 100644
--- a/include/uapi/linux/neighbour.h
+++ b/include/uapi/linux/neighbour.h
@@ -26,6 +26,7 @@ enum {
 	NDA_IFINDEX,
 	NDA_MASTER,
 	NDA_LINK_NETNSID,
+	NDA_SRC_VNI,
 	__NDA_MAX
 };
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH net-next 3/5] bridge: uapi: add per vlan tunnel info
  2017-01-21  5:46 [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support Roopa Prabhu
  2017-01-21  5:46 ` [RFC PATCH net-next 1/5] ip_tunnels: new IP_TUNNEL_INFO_BRIDGE flag for ip_tunnel_info mode Roopa Prabhu
  2017-01-21  5:46 ` [RFC PATCH net-next 2/5] vxlan: make COLLECT_METADATA mode bridge friendly Roopa Prabhu
@ 2017-01-21  5:46 ` Roopa Prabhu
  2017-01-21 16:59   ` Roopa Prabhu
  2017-01-21  5:46 ` [RFC PATCH net-next 4/5] bridge: vlan lwt and dst_metadata netlink support Roopa Prabhu
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Roopa Prabhu @ 2017-01-21  5:46 UTC (permalink / raw)
  To: netdev; +Cc: davem, stephen, nikolay, tgraf, hannes, jbenc, pshelar, dsa, hadi

From: Roopa Prabhu <roopa@cumulusnetworks.com>

New netlink api to associate tunnel info per vlan.
This is used by bridge driver to send tunnel metadata to
bridge ports in LWT tunnel dst metadata mode.

One example use for this is a vxlan bridging gateway or vtep
which maps vlans to vn-segments (or vnis). User can configure
per-vlan tunnel information which the bridge driver can use
to bridge vlan into the corresponding vn-segment.

This patch also introduces a bridge port flag IFLA_BRPORT_LWT_VLAN
to enable this feature on a tunnel bridge port. It is off by default.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
 include/uapi/linux/if_bridge.h |   11 +++++++++++
 include/uapi/linux/if_link.h   |    1 +
 2 files changed, 12 insertions(+)

diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
index ab92bca..a9e6244 100644
--- a/include/uapi/linux/if_bridge.h
+++ b/include/uapi/linux/if_bridge.h
@@ -118,6 +118,7 @@ enum {
 	IFLA_BRIDGE_FLAGS,
 	IFLA_BRIDGE_MODE,
 	IFLA_BRIDGE_VLAN_INFO,
+	IFLA_BRIDGE_VLAN_TUNNEL_INFO,
 	__IFLA_BRIDGE_MAX,
 };
 #define IFLA_BRIDGE_MAX (__IFLA_BRIDGE_MAX - 1)
@@ -134,6 +135,16 @@ struct bridge_vlan_info {
 	__u16 vid;
 };
 
+enum {
+	IFLA_BRIDGE_VLAN_TUNNEL_UNSPEC,
+	IFLA_BRIDGE_VLAN_TUNNEL_ID,
+	IFLA_BRIDGE_VLAN_TUNNEL_VID,
+	IFLA_BRIDGE_VLAN_TUNNEL_FLAGS,
+	__IFLA_BRIDGE_VLAN_TUNNEL_MAX,
+};
+
+#define IFLA_BRIDGE_VLAN_TUNNEL_MAX (__IFLA_BRIDGE_VLAN_TUNNEL_MAX - 1)
+
 struct bridge_vlan_xstats {
 	__u64 rx_bytes;
 	__u64 rx_packets;
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 184b16e..f0356b9 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -321,6 +321,7 @@ enum {
 	IFLA_BRPORT_MULTICAST_ROUTER,
 	IFLA_BRPORT_PAD,
 	IFLA_BRPORT_MCAST_FLOOD,
+	IFLA_BRPORT_LWT_VLAN,
 	__IFLA_BRPORT_MAX
 };
 #define IFLA_BRPORT_MAX (__IFLA_BRPORT_MAX - 1)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH net-next 4/5] bridge: vlan lwt and dst_metadata netlink support
  2017-01-21  5:46 [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support Roopa Prabhu
                   ` (2 preceding siblings ...)
  2017-01-21  5:46 ` [RFC PATCH net-next 3/5] bridge: uapi: add per vlan tunnel info Roopa Prabhu
@ 2017-01-21  5:46 ` Roopa Prabhu
  2017-01-22 12:05   ` Nikolay Aleksandrov
  2017-01-23  0:22   ` Rosen, Rami
  2017-01-21  5:46 ` [RFC PATCH net-next 5/5] bridge: vlan lwt dst_metadata hooks in ingress and egress paths Roopa Prabhu
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 23+ messages in thread
From: Roopa Prabhu @ 2017-01-21  5:46 UTC (permalink / raw)
  To: netdev; +Cc: davem, stephen, nikolay, tgraf, hannes, jbenc, pshelar, dsa, hadi

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This patch adds support to attach per vlan tunnel info dst
metadata. This enables bridge driver to map vlan to tunnel_info
at ingress and egress

The initial use case is vlan to vni bridging, but the api is generic
to extend to any tunnel_info in the future:
    - Uapi to configure/unconfigure/dump per vlan tunnel data
    - netlink functions to configure vlan and tunnel_info mapping
    - Introduces bridge port flag BR_LWT_VLAN to enable attach/detach
    dst_metadata to bridged packets on ports.

Use case:
example use for this is a vxlan bridging gateway or vtep
which maps vlans to vn-segments (or vnis). User can configure
per-vlan tunnel information which the bridge driver can use
to bridge vlan into the corresponding tunnel.

CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
CC'ing Nikolay for some more eyes as he has been trying to keep the
bridge driver fast path lite.

 include/linux/if_bridge.h |    1 +
 net/bridge/br_input.c     |    1 +
 net/bridge/br_netlink.c   |  410 ++++++++++++++++++++++++++++++++++++++-------
 net/bridge/br_private.h   |   18 ++
 net/bridge/br_vlan.c      |  138 ++++++++++++++-
 5 files changed, 507 insertions(+), 61 deletions(-)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index c6587c0..36ff611 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -46,6 +46,7 @@ struct br_ip_list {
 #define BR_LEARNING_SYNC	BIT(9)
 #define BR_PROXYARP_WIFI	BIT(10)
 #define BR_MCAST_FLOOD		BIT(11)
+#define BR_LWT_VLAN		BIT(12)
 
 #define BR_DEFAULT_AGEING_TIME	(300 * HZ)
 
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 855b72f..83f356f 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -20,6 +20,7 @@
 #include <net/arp.h>
 #include <linux/export.h>
 #include <linux/rculist.h>
+#include <net/dst_metadata.h>
 #include "br_private.h"
 
 /* Hook for brouter */
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 71c7453..df997ad 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -17,17 +17,30 @@
 #include <net/net_namespace.h>
 #include <net/sock.h>
 #include <uapi/linux/if_bridge.h>
+#include <net/dst_metadata.h>
 
 #include "br_private.h"
 #include "br_private_stp.h"
 
-static int __get_num_vlan_infos(struct net_bridge_vlan_group *vg,
-				u32 filter_mask)
+static size_t br_get_vlan_tinfo_size(void)
 {
+	return nla_total_size(0) + /* nest IFLA_BRIDGE_VLAN_TUNNEL_INFO */
+		  nla_total_size(sizeof(u32)) + /* IFLA_BRIDGE_VLAN_TUNNEL_ID */
+		  nla_total_size(sizeof(u16)) + /* IFLA_BRIDGE_VLAN_TUNNEL_VID */
+		  nla_total_size(sizeof(u16)); /* IFLA_BRIDGE_VLAN_TUNNEL_FLAGS */
+}
+
+static int __get_num_vlan_infos(struct net_bridge_port *p,
+				struct net_bridge_vlan_group *vg,
+				u32 filter_mask, int *num_vtinfos)
+{
+	struct net_bridge_vlan *vbegin = NULL, *vend = NULL;
+	struct net_bridge_vlan *vtbegin = NULL, *vtend = NULL;
 	struct net_bridge_vlan *v;
-	u16 vid_range_start = 0, vid_range_end = 0, vid_range_flags = 0;
+	bool get_tinfos = (p && p->flags & BR_LWT_VLAN) ? true: false;
+	bool vcontinue, vtcontinue;
+	int num_vinfos = 0;
 	u16 flags, pvid;
-	int num_vlans = 0;
 
 	if (!(filter_mask & RTEXT_FILTER_BRVLAN_COMPRESSED))
 		return 0;
@@ -36,6 +49,8 @@ static int __get_num_vlan_infos(struct net_bridge_vlan_group *vg,
 	/* Count number of vlan infos */
 	list_for_each_entry_rcu(v, &vg->vlan_list, vlist) {
 		flags = 0;
+		vcontinue = false;
+		vtcontinue = false;
 		/* only a context, bridge vlan not activated */
 		if (!br_vlan_should_use(v))
 			continue;
@@ -45,47 +60,79 @@ static int __get_num_vlan_infos(struct net_bridge_vlan_group *vg,
 		if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED)
 			flags |= BRIDGE_VLAN_INFO_UNTAGGED;
 
-		if (vid_range_start == 0) {
-			goto initvars;
-		} else if ((v->vid - vid_range_end) == 1 &&
-			flags == vid_range_flags) {
-			vid_range_end = v->vid;
+		if (!vbegin) {
+			vbegin = v;
+			vend = v;
+			vcontinue = true;
+		} else if ((v->vid - vend->vid) == 1 &&
+			flags == vbegin->flags) {
+			vend = v;
+			vcontinue = true;
+		}
+
+		if (!vcontinue) {
+			if ((vend->vid - vbegin->vid) > 0)
+				num_vinfos += 2;
+			else
+				num_vinfos += 1;
+		}
+
+		if (!get_tinfos && !v->tinfo.tunnel_id)
 			continue;
-		} else {
-			if ((vid_range_end - vid_range_start) > 0)
-				num_vlans += 2;
+
+		if (!vtbegin) {
+			vtbegin = v;
+			vtend = v;
+			vtcontinue = true;
+		} else if ((v->vid - vtend->vid) == 1 &&
+		    vlan_tunnel_id_isrange(vtend, v)) {
+			vtend = v;
+			vtcontinue = true;
+		}
+
+		if (!vtcontinue) {
+			if ((vtend->vid - vtbegin->vid) > 0)
+				num_vtinfos += 2;
 			else
-				num_vlans += 1;
+				num_vtinfos += 1;
+			vbegin = NULL;
+			vend = NULL;
 		}
-initvars:
-		vid_range_start = v->vid;
-		vid_range_end = v->vid;
-		vid_range_flags = flags;
 	}
 
-	if (vid_range_start != 0) {
-		if ((vid_range_end - vid_range_start) > 0)
-			num_vlans += 2;
+	if (vbegin) {
+		if ((vend->vid - vbegin->vid) > 0)
+			num_vinfos += 2;
 		else
-			num_vlans += 1;
+			num_vinfos += 1;
 	}
 
-	return num_vlans;
+	if (get_tinfos && vtbegin && vtbegin->tinfo.tunnel_id) {
+		if ((vtend->vid - vtbegin->vid) > 0)
+			*num_vtinfos += 2;
+		else
+			*num_vtinfos += 1;
+	}
+
+	return num_vinfos;
 }
 
-static int br_get_num_vlan_infos(struct net_bridge_vlan_group *vg,
-				 u32 filter_mask)
+static int br_get_num_vlan_infos(struct net_bridge_port *p,
+				 struct net_bridge_vlan_group *vg,
+				 int *num_tinfos, u32 filter_mask)
 {
 	int num_vlans;
 
 	if (!vg)
 		return 0;
 
-	if (filter_mask & RTEXT_FILTER_BRVLAN)
+	if (filter_mask & RTEXT_FILTER_BRVLAN) {
+		*num_tinfos = vg->num_vlans;
 		return vg->num_vlans;
+	}
 
 	rcu_read_lock();
-	num_vlans = __get_num_vlan_infos(vg, filter_mask);
+	num_vlans = __get_num_vlan_infos(p, vg, filter_mask, num_tinfos);
 	rcu_read_unlock();
 
 	return num_vlans;
@@ -95,9 +142,10 @@ static size_t br_get_link_af_size_filtered(const struct net_device *dev,
 					   u32 filter_mask)
 {
 	struct net_bridge_vlan_group *vg = NULL;
-	struct net_bridge_port *p;
+	struct net_bridge_port *p = NULL;
 	struct net_bridge *br;
-	int num_vlan_infos;
+	int num_vlan_infos, num_vlan_tinfos = 0;
+	size_t retsize = 0;
 
 	rcu_read_lock();
 	if (br_port_exists(dev)) {
@@ -107,11 +155,15 @@ static size_t br_get_link_af_size_filtered(const struct net_device *dev,
 		br = netdev_priv(dev);
 		vg = br_vlan_group_rcu(br);
 	}
-	num_vlan_infos = br_get_num_vlan_infos(vg, filter_mask);
+	num_vlan_infos = br_get_num_vlan_infos(p, vg, &num_vlan_tinfos,
+					       filter_mask);
 	rcu_read_unlock();
 
 	/* Each VLAN is returned in bridge_vlan_info along with flags */
-	return num_vlan_infos * nla_total_size(sizeof(struct bridge_vlan_info));
+	retsize =  num_vlan_infos * nla_total_size(sizeof(struct bridge_vlan_info)) +
+			num_vlan_tinfos * br_get_vlan_tinfo_size();
+
+	return retsize;
 }
 
 static inline size_t br_port_info_size(void)
@@ -191,7 +243,8 @@ static int br_port_fill_attrs(struct sk_buff *skb,
 	    nla_put_u16(skb, IFLA_BRPORT_NO, p->port_no) ||
 	    nla_put_u8(skb, IFLA_BRPORT_TOPOLOGY_CHANGE_ACK,
 		       p->topology_change_ack) ||
-	    nla_put_u8(skb, IFLA_BRPORT_CONFIG_PENDING, p->config_pending))
+	    nla_put_u8(skb, IFLA_BRPORT_CONFIG_PENDING, p->config_pending) ||
+	    nla_put_u8(skb, IFLA_BRPORT_LWT_VLAN, !!(p->flags & BR_LWT_VLAN)))
 		return -EMSGSIZE;
 
 	timerval = br_timer_value(&p->message_age_timer);
@@ -216,6 +269,34 @@ static int br_port_fill_attrs(struct sk_buff *skb,
 	return 0;
 }
 
+static int br_fill_vlan_tinfo(struct sk_buff *skb, u16 vid,
+                              __be64 tunnel_id, u16 flags)
+{
+    __be32 tid = tunnel_id_to_key32(tunnel_id);
+    struct nlattr *tmap;
+
+	tmap = nla_nest_start(skb, IFLA_BRIDGE_VLAN_TUNNEL_INFO);
+	if (!tmap)
+		return -EMSGSIZE;
+	if (nla_put_u32(skb, IFLA_BRIDGE_VLAN_TUNNEL_ID,
+            be32_to_cpu(tid)))
+		goto nla_put_failure;
+	if (nla_put_u16(skb, IFLA_BRIDGE_VLAN_TUNNEL_VID,
+			vid))
+		goto nla_put_failure;
+	if (nla_put_u16(skb, IFLA_BRIDGE_VLAN_TUNNEL_FLAGS,
+			flags))
+		goto nla_put_failure;
+	nla_nest_end(skb, tmap);
+
+	return 0;
+
+nla_put_failure:
+	nla_nest_cancel(skb, tmap);
+
+	return -EMSGSIZE;
+}
+
 static int br_fill_ifvlaninfo_range(struct sk_buff *skb, u16 vid_start,
 				    u16 vid_end, u16 flags)
 {
@@ -249,20 +330,24 @@ static int br_fill_ifvlaninfo_range(struct sk_buff *skb, u16 vid_start,
 }
 
 static int br_fill_ifvlaninfo_compressed(struct sk_buff *skb,
+					 struct net_bridge_port *p,
 					 struct net_bridge_vlan_group *vg)
 {
+	struct net_bridge_vlan *vbegin = NULL, *vend = NULL;
+	struct net_bridge_vlan *vtbegin = NULL, *vtend = NULL;
+	bool fill_tinfos = (p && p->flags & BR_LWT_VLAN) ? true: false;
 	struct net_bridge_vlan *v;
-	u16 vid_range_start = 0, vid_range_end = 0, vid_range_flags = 0;
+	bool vcontinue, vtcontinue;
 	u16 flags, pvid;
-	int err = 0;
+	int err;
 
-	/* Pack IFLA_BRIDGE_VLAN_INFO's for every vlan
-	 * and mark vlan info with begin and end flags
-	 * if vlaninfo represents a range
-	 */
 	pvid = br_get_pvid(vg);
+	/* Count number of vlan infos */
 	list_for_each_entry_rcu(v, &vg->vlan_list, vlist) {
 		flags = 0;
+		vcontinue = false;
+		vtcontinue = false;
+		/* only a context, bridge vlan not activated */
 		if (!br_vlan_should_use(v))
 			continue;
 		if (v->vid == pvid)
@@ -271,44 +356,103 @@ static int br_fill_ifvlaninfo_compressed(struct sk_buff *skb,
 		if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED)
 			flags |= BRIDGE_VLAN_INFO_UNTAGGED;
 
-		if (vid_range_start == 0) {
-			goto initvars;
-		} else if ((v->vid - vid_range_end) == 1 &&
-			flags == vid_range_flags) {
-			vid_range_end = v->vid;
-			continue;
-		} else {
-			err = br_fill_ifvlaninfo_range(skb, vid_range_start,
-						       vid_range_end,
-						       vid_range_flags);
+		if (!vbegin) {
+			vbegin = v;
+			vend = v;
+			vcontinue = true;
+		} else if ((v->vid - vend->vid) == 1 &&
+			flags == vbegin->flags) {
+			vend = v;
+			vcontinue = true;
+		}
+
+		if (!vcontinue) {
+			err = br_fill_ifvlaninfo_range(skb,
+						       vbegin->vid,
+						       vend->vid,
+						       vbegin->flags);
 			if (err)
 				return err;
+			vbegin = vend = NULL;
+		}
+
+		if (!fill_tinfos || !v->tinfo.tunnel_id)
+			continue;
+
+		if (!vtbegin) {
+			vtbegin = v;
+			vtend = v;
+			vtcontinue = true;
+		} else if ((v->vid - vtend->vid) == 1 &&
+		    vlan_tunnel_id_isrange(vtend, v)) {
+			vtend = v;
+			vtcontinue = true;
 		}
 
-initvars:
-		vid_range_start = v->vid;
-		vid_range_end = v->vid;
-		vid_range_flags = flags;
+		if (!vtcontinue && vtbegin->tinfo.tunnel_id) {
+			if ((vtend->vid - vtbegin->vid) > 0) {
+				err = br_fill_vlan_tinfo(skb, vbegin->vid,
+							 vbegin->tinfo.tunnel_id,
+						BRIDGE_VLAN_INFO_RANGE_BEGIN);
+				if (err)
+					return err;
+				err = br_fill_vlan_tinfo(skb, vend->vid,
+							 vend->tinfo.tunnel_id,
+						BRIDGE_VLAN_INFO_RANGE_END);
+				if (err)
+					return err;
+			} else {
+				err = br_fill_vlan_tinfo(skb, vbegin->vid,
+							 vbegin->tinfo.tunnel_id,
+							 0);
+				if (err)
+					return err;
+			}
+			vbegin = NULL;
+			vend = NULL;
+		}
 	}
 
-	if (vid_range_start != 0) {
-		/* Call it once more to send any left over vlans */
-		err = br_fill_ifvlaninfo_range(skb, vid_range_start,
-					       vid_range_end,
-					       vid_range_flags);
+	if (vbegin) {
+		err = br_fill_ifvlaninfo_range(skb, vbegin->vid,
+					       vend->vid,
+					       vbegin->flags);
 		if (err)
 			return err;
 	}
 
+	if (fill_tinfos && vtbegin && vtbegin->tinfo.tunnel_id) {
+		if ((vtend->vid - vtbegin->vid) > 0) {
+			err = br_fill_vlan_tinfo(skb, vbegin->vid,
+						 vbegin->tinfo.tunnel_id,
+						 BRIDGE_VLAN_INFO_RANGE_BEGIN);
+			if (err)
+				return err;
+			err = br_fill_vlan_tinfo(skb, vend->vid,
+						 vend->tinfo.tunnel_id,
+						 BRIDGE_VLAN_INFO_RANGE_END);
+			if (err)
+				return err;
+		} else {
+			err = br_fill_vlan_tinfo(skb, vbegin->vid,
+						 vbegin->tinfo.tunnel_id, 0);
+			if (err)
+				return err;
+		}
+	}
+
 	return 0;
 }
 
 static int br_fill_ifvlaninfo(struct sk_buff *skb,
+			      struct net_bridge_port *p,
 			      struct net_bridge_vlan_group *vg)
 {
 	struct bridge_vlan_info vinfo;
 	struct net_bridge_vlan *v;
+	bool fill_tinfos = (p && p->flags & BR_LWT_VLAN) ? true : false;
 	u16 pvid;
+	int err;
 
 	pvid = br_get_pvid(vg);
 	list_for_each_entry_rcu(v, &vg->vlan_list, vlist) {
@@ -326,6 +470,14 @@ static int br_fill_ifvlaninfo(struct sk_buff *skb,
 		if (nla_put(skb, IFLA_BRIDGE_VLAN_INFO,
 			    sizeof(vinfo), &vinfo))
 			goto nla_put_failure;
+
+		if (!fill_tinfos || !v->tinfo.tunnel_id)
+			continue;
+
+		err = br_fill_vlan_tinfo(skb, v->vid,
+					 v->tinfo.tunnel_id, 0);
+		if (err)
+			return err;
 	}
 
 	return 0;
@@ -411,9 +563,9 @@ static int br_fill_ifinfo(struct sk_buff *skb,
 			goto nla_put_failure;
 		}
 		if (filter_mask & RTEXT_FILTER_BRVLAN_COMPRESSED)
-			err = br_fill_ifvlaninfo_compressed(skb, vg);
+			err = br_fill_ifvlaninfo_compressed(skb, port, vg);
 		else
-			err = br_fill_ifvlaninfo(skb, vg);
+			err = br_fill_ifvlaninfo(skb, port, vg);
 		rcu_read_unlock();
 		if (err)
 			goto nla_put_failure;
@@ -514,6 +666,127 @@ static int br_vlan_info(struct net_bridge *br, struct net_bridge_port *p,
 	return err;
 }
 
+static const struct nla_policy vlan_tunnel_policy[IFLA_BRIDGE_VLAN_TUNNEL_MAX + 1] = {
+	[IFLA_BRIDGE_VLAN_TUNNEL_ID]= { .type = NLA_U32 },
+	[IFLA_BRIDGE_VLAN_TUNNEL_VID] = { .type = NLA_U16 },
+	[IFLA_BRIDGE_VLAN_TUNNEL_FLAGS] = { .type = NLA_U16 },
+};
+
+static int br_add_vlan_tunnel_info(struct net_bridge *br,
+				   struct net_bridge_port *p, int cmd,
+				   u16 vid, u32 tun_id)
+{
+	int err;
+
+	switch (cmd) {
+	case RTM_SETLINK:
+		if (p) {
+			/* if the MASTER flag is set this will act on the global
+			 * per-VLAN entry as well
+			 */
+			err = nbp_vlan_tunnel_info_add(p, vid, tun_id);
+			if (err)
+				break;
+		} else {
+			return -EINVAL;
+		}
+
+		break;
+
+	case RTM_DELLINK:
+		if (p)
+			nbp_vlan_tunnel_info_delete(p, vid);
+		else
+			return -EINVAL;
+		break;
+	}
+
+	return 0;
+}
+
+struct vtunnel_info {
+	u32	tunid;
+	u16	vid;
+	u16	flags;
+};
+
+static int br_parse_vlan_tunnel_info(struct nlattr *attr,
+				     struct vtunnel_info *tinfo)
+{
+	struct nlattr *tb[IFLA_BRIDGE_VLAN_TUNNEL_MAX + 1];
+	u32 tun_id;
+	u16 vid, flags;
+	int err;
+
+	err = nla_parse_nested(tb, IFLA_BRIDGE_VLAN_TUNNEL_MAX,
+			       attr, vlan_tunnel_policy);
+	if (err < 0)
+		return err;
+
+	if (tb[IFLA_BRIDGE_VLAN_TUNNEL_ID])
+		tun_id = nla_get_u32(tb[IFLA_BRIDGE_VLAN_TUNNEL_ID]);
+	else
+		return -EINVAL;
+
+	if (tb[IFLA_BRIDGE_VLAN_TUNNEL_VID]) {
+		vid = nla_get_u16(tb[IFLA_BRIDGE_VLAN_TUNNEL_VID]);
+		if (vid >= VLAN_VID_MASK)
+			return -ERANGE;
+	} else {
+		return -EINVAL;
+	}
+
+	if (tb[IFLA_BRIDGE_VLAN_TUNNEL_FLAGS])
+		flags = nla_get_u16(tb[IFLA_BRIDGE_VLAN_TUNNEL_FLAGS]);
+
+	tinfo->tunid = tun_id;
+	tinfo->vid = vid;
+	tinfo->flags = flags;
+
+	return 0;
+}
+
+static int br_process_vlan_tunnel_info(struct net_bridge *br,
+				       struct net_bridge_port *p, int cmd,
+				       struct vtunnel_info *tinfo_curr,
+				       struct vtunnel_info *tinfo_last)
+{
+	int t, v;
+	int err;
+
+	if (tinfo_curr->flags & BRIDGE_VLAN_INFO_RANGE_BEGIN) {
+		if (tinfo_last->flags & BRIDGE_VLAN_INFO_RANGE_BEGIN)
+			return -EINVAL;
+		memcpy(tinfo_last, tinfo_curr, sizeof(struct vtunnel_info));
+	} else if (tinfo_curr->flags & BRIDGE_VLAN_INFO_RANGE_END) {
+		if (!(tinfo_last->flags & BRIDGE_VLAN_INFO_RANGE_BEGIN))
+			return -EINVAL;
+		if ((tinfo_curr->vid - tinfo_last->vid) !=
+		    (tinfo_curr->tunid - tinfo_last->tunid))
+			return -EINVAL;
+		/* XXX: tun id and vlan id attrs must be same
+		 */
+		t = tinfo_last->tunid;
+		for (v = tinfo_last->vid; v <= tinfo_curr->vid; v++) {
+			err = br_add_vlan_tunnel_info(br, p, cmd,
+							  v, t);
+			if (err)
+				return err;
+			t++;
+		}
+		memset(tinfo_last, 0, sizeof(struct vtunnel_info));
+		memset(tinfo_curr, 0, sizeof(struct vtunnel_info));
+	} else {
+		err = br_add_vlan_tunnel_info(br, p, cmd,
+					      tinfo_curr->vid,
+					      tinfo_curr->tunid);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
 static int br_afspec(struct net_bridge *br,
 		     struct net_bridge_port *p,
 		     struct nlattr *af_spec,
@@ -522,10 +795,30 @@ static int br_afspec(struct net_bridge *br,
 	struct bridge_vlan_info *vinfo_start = NULL;
 	struct bridge_vlan_info *vinfo = NULL;
 	struct nlattr *attr;
+	struct vtunnel_info tinfo_last = {
+		.tunid = 0,
+		.vid = 0,
+		.flags = 0};
+	struct vtunnel_info tinfo_curr = {
+		.tunid = 0,
+		.vid = 0,
+		.flags = 0};
 	int err = 0;
 	int rem;
 
 	nla_for_each_nested(attr, af_spec, rem) {
+		if (nla_type(attr) == IFLA_BRIDGE_VLAN_TUNNEL_INFO) {
+			err = br_parse_vlan_tunnel_info(attr, &tinfo_curr);
+			if (err)
+				return err;
+			err = br_process_vlan_tunnel_info(br, p, cmd,
+							  &tinfo_curr,
+							  &tinfo_last);
+			if (err)
+				return err;
+			continue;
+		}
+
 		if (nla_type(attr) != IFLA_BRIDGE_VLAN_INFO)
 			continue;
 		if (nla_len(attr) != sizeof(struct bridge_vlan_info))
@@ -638,6 +931,7 @@ static int br_setport(struct net_bridge_port *p, struct nlattr *tb[])
 	br_set_port_flag(p, tb, IFLA_BRPORT_MCAST_FLOOD, BR_MCAST_FLOOD);
 	br_set_port_flag(p, tb, IFLA_BRPORT_PROXYARP, BR_PROXYARP);
 	br_set_port_flag(p, tb, IFLA_BRPORT_PROXYARP_WIFI, BR_PROXYARP_WIFI);
+	br_set_port_flag(p, tb, IFLA_BRPORT_LWT_VLAN, BR_LWT_VLAN);
 
 	if (tb[IFLA_BRPORT_COST]) {
 		err = br_stp_set_path_cost(p, nla_get_u32(tb[IFLA_BRPORT_COST]));
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 8ce621e..f68e360 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -91,6 +91,11 @@ struct br_vlan_stats {
 	struct u64_stats_sync syncp;
 };
 
+struct br_tunnel_info {
+	__be64			tunnel_id;
+	struct metadata_dst	*tunnel_dst;
+};
+
 /**
  * struct net_bridge_vlan - per-vlan entry
  *
@@ -113,6 +118,7 @@ struct br_vlan_stats {
  */
 struct net_bridge_vlan {
 	struct rhash_head		vnode;
+	struct rhash_head		tnode;
 	u16				vid;
 	u16				flags;
 	struct br_vlan_stats __percpu	*stats;
@@ -124,6 +130,9 @@ struct net_bridge_vlan {
 		atomic_t		refcnt;
 		struct net_bridge_vlan	*brvlan;
 	};
+
+	struct br_tunnel_info		tinfo;
+
 	struct list_head		vlist;
 
 	struct rcu_head			rcu;
@@ -145,6 +154,7 @@ struct net_bridge_vlan {
  */
 struct net_bridge_vlan_group {
 	struct rhashtable		vlan_hash;
+	struct rhashtable		tunnel_hash;
 	struct list_head		vlan_list;
 	u16				num_vlans;
 	u16				pvid;
@@ -786,6 +796,14 @@ struct sk_buff *br_handle_vlan(struct net_bridge *br,
 int nbp_get_num_vlan_infos(struct net_bridge_port *p, u32 filter_mask);
 void br_vlan_get_stats(const struct net_bridge_vlan *v,
 		       struct br_vlan_stats *stats);
+int __vlan_tunnel_info_add(struct net_bridge_vlan_group *vg,
+			   struct net_bridge_vlan *vlan, u32 tun_id);
+int __vlan_tunnel_info_del(struct net_bridge_vlan_group *vg,
+                           struct net_bridge_vlan *vlan);
+int nbp_vlan_tunnel_info_delete(struct net_bridge_port *port, u16 vid);
+int nbp_vlan_tunnel_info_add(struct net_bridge_port *port, u16 vid, u32 tun_id);
+bool vlan_tunnel_id_isrange(struct net_bridge_vlan *v_end,
+			    struct net_bridge_vlan *v);
 
 static inline struct net_bridge_vlan_group *br_vlan_group(
 					const struct net_bridge *br)
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index b6de4f4..2040f08 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -3,6 +3,7 @@
 #include <linux/rtnetlink.h>
 #include <linux/slab.h>
 #include <net/switchdev.h>
+#include <net/dst_metadata.h>
 
 #include "br_private.h"
 
@@ -31,6 +32,31 @@ static struct net_bridge_vlan *br_vlan_lookup(struct rhashtable *tbl, u16 vid)
 	return rhashtable_lookup_fast(tbl, &vid, br_vlan_rht_params);
 }
 
+static inline int br_vlan_tunid_cmp(struct rhashtable_compare_arg *arg,
+				    const void *ptr)
+{
+	const struct net_bridge_vlan *vle = ptr;
+	__be64 tunid = *(__be64 *)arg->key;
+
+	return vle->tinfo.tunnel_id != tunid;
+}
+
+static const struct rhashtable_params br_vlan_tunnel_rht_params = {
+	.head_offset = offsetof(struct net_bridge_vlan, tnode),
+	.key_offset = offsetof(struct net_bridge_vlan, tinfo.tunnel_id),
+	.key_len = sizeof(__be64),
+	.nelem_hint = 3,
+	.locks_mul = 1,
+	.obj_cmpfn = br_vlan_tunid_cmp,
+	.automatic_shrinking = true,
+};
+
+static struct net_bridge_vlan *br_vlan_tunnel_lookup(struct rhashtable *tbl,
+						     u64 tunnel_id)
+{
+	return rhashtable_lookup_fast(tbl, &tunnel_id, br_vlan_tunnel_rht_params);
+}
+
 static void __vlan_add_pvid(struct net_bridge_vlan_group *vg, u16 vid)
 {
 	if (vg->pvid == vid)
@@ -325,6 +351,7 @@ static void __vlan_group_free(struct net_bridge_vlan_group *vg)
 {
 	WARN_ON(!list_empty(&vg->vlan_list));
 	rhashtable_destroy(&vg->vlan_hash);
+	rhashtable_destroy(&vg->tunnel_hash);
 	kfree(vg);
 }
 
@@ -613,6 +640,8 @@ int br_vlan_delete(struct net_bridge *br, u16 vid)
 	br_fdb_find_delete_local(br, NULL, br->dev->dev_addr, vid);
 	br_fdb_delete_by_port(br, NULL, vid, 0);
 
+	__vlan_tunnel_info_del(vg, v);
+
 	return __vlan_del(v);
 }
 
@@ -918,6 +947,9 @@ int br_vlan_init(struct net_bridge *br)
 	ret = rhashtable_init(&vg->vlan_hash, &br_vlan_rht_params);
 	if (ret)
 		goto err_rhtbl;
+	ret = rhashtable_init(&vg->tunnel_hash, &br_vlan_tunnel_rht_params);
+	if (ret)
+		goto err_rhtbl2;
 	INIT_LIST_HEAD(&vg->vlan_list);
 	br->vlan_proto = htons(ETH_P_8021Q);
 	br->default_pvid = 1;
@@ -932,6 +964,8 @@ int br_vlan_init(struct net_bridge *br)
 	return ret;
 
 err_vlan_add:
+	rhashtable_destroy(&vg->tunnel_hash);
+err_rhtbl2:
 	rhashtable_destroy(&vg->vlan_hash);
 err_rhtbl:
 	kfree(vg);
@@ -960,7 +994,10 @@ int nbp_vlan_init(struct net_bridge_port *p)
 
 	ret = rhashtable_init(&vg->vlan_hash, &br_vlan_rht_params);
 	if (ret)
-		goto err_rhtbl;
+		goto err_rhtbl1;
+	ret = rhashtable_init(&vg->tunnel_hash, &br_vlan_tunnel_rht_params);
+	if (ret)
+		goto err_rhtbl2;
 	INIT_LIST_HEAD(&vg->vlan_list);
 	rcu_assign_pointer(p->vlgrp, vg);
 	if (p->br->default_pvid) {
@@ -976,9 +1013,11 @@ int nbp_vlan_init(struct net_bridge_port *p)
 err_vlan_add:
 	RCU_INIT_POINTER(p->vlgrp, NULL);
 	synchronize_rcu();
-	rhashtable_destroy(&vg->vlan_hash);
+	rhashtable_destroy(&vg->tunnel_hash);
 err_vlan_enabled:
-err_rhtbl:
+err_rhtbl2:
+	rhashtable_destroy(&vg->vlan_hash);
+err_rhtbl1:
 	kfree(vg);
 
 	goto out;
@@ -1081,3 +1120,96 @@ void br_vlan_get_stats(const struct net_bridge_vlan *v,
 		stats->tx_packets += txpackets;
 	}
 }
+
+bool vlan_tunnel_id_isrange(struct net_bridge_vlan *v_end,
+			    struct net_bridge_vlan *v)
+{
+	/* XXX: check other tunnel attributes */
+	return (be32_to_cpu(tunnel_id_to_key32(v_end->tinfo.tunnel_id)) -
+		be32_to_cpu(tunnel_id_to_key32(v->tinfo.tunnel_id)) == 1);
+}
+
+int __vlan_tunnel_info_add(struct net_bridge_vlan_group *vg,
+			   struct net_bridge_vlan *vlan, u32 tun_id)
+{
+	struct metadata_dst *metadata = NULL;
+	__be64 key = key32_to_tunnel_id(cpu_to_be32(tun_id));
+	int err;
+
+	if (vlan->tinfo.tunnel_dst)
+		return -EEXIST;
+
+	metadata = __ip_tun_set_dst(0, 0, 0, 0, 0, TUNNEL_KEY,
+				    key, 0);
+	if (!metadata)
+		return -EINVAL;
+
+	metadata->u.tun_info.mode |= IP_TUNNEL_INFO_TX | IP_TUNNEL_INFO_BRIDGE;
+	vlan->tinfo.tunnel_dst = metadata;
+	vlan->tinfo.tunnel_id = key;
+
+	err = rhashtable_lookup_insert_fast(&vg->tunnel_hash, &vlan->tnode,
+					    br_vlan_tunnel_rht_params);
+	if (err)
+		goto out;
+
+	return 0;
+out:
+	dst_release(&vlan->tinfo.tunnel_dst->dst);
+
+	return err;
+}
+
+int __vlan_tunnel_info_del(struct net_bridge_vlan_group *vg,
+			   struct net_bridge_vlan *vlan)
+{
+	if (vlan->tinfo.tunnel_dst) {
+		vlan->tinfo.tunnel_id = 0;
+		dst_release(&vlan->tinfo.tunnel_dst->dst);
+
+		rhashtable_remove_fast(&vg->tunnel_hash, &vlan->vnode,
+				       br_vlan_tunnel_rht_params);
+	}
+
+	return 0;
+}
+
+/* Must be protected by RTNL.
+ * Must be called with vid in range from 1 to 4094 inclusive.
+ */
+int nbp_vlan_tunnel_info_add(struct net_bridge_port *port, u16 vid, u32 tun_id)
+{
+	struct net_bridge_vlan_group *vg;
+	struct net_bridge_vlan *vlan;
+
+	ASSERT_RTNL();
+
+	vg = nbp_vlan_group(port);
+	vlan = br_vlan_find(vg, vid);
+	if (!vlan)
+		return -EINVAL;
+
+	__vlan_tunnel_info_add(vg, vlan, tun_id);
+
+	return 0;
+}
+
+/* Must be protected by RTNL.
+ * Must be called with vid in range from 1 to 4094 inclusive.
+ */
+int nbp_vlan_tunnel_info_delete(struct net_bridge_port *port, u16 vid)
+{
+	struct net_bridge_vlan_group *vg;
+	struct net_bridge_vlan *v;
+
+	ASSERT_RTNL();
+
+	vg = nbp_vlan_group(port);
+	v = br_vlan_find(vg, vid);
+	if (!v)
+		return -ENOENT;
+
+	__vlan_tunnel_info_del(vg, v);
+
+	return 0;
+}
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH net-next 5/5] bridge: vlan lwt dst_metadata hooks in ingress and egress paths
  2017-01-21  5:46 [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support Roopa Prabhu
                   ` (3 preceding siblings ...)
  2017-01-21  5:46 ` [RFC PATCH net-next 4/5] bridge: vlan lwt and dst_metadata netlink support Roopa Prabhu
@ 2017-01-21  5:46 ` Roopa Prabhu
  2017-01-22 12:15   ` Nikolay Aleksandrov
  2017-01-23  8:08 ` [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support Jiri Pirko
  2017-01-24 15:47 ` Stephen Hemminger
  6 siblings, 1 reply; 23+ messages in thread
From: Roopa Prabhu @ 2017-01-21  5:46 UTC (permalink / raw)
  To: netdev; +Cc: davem, stephen, nikolay, tgraf, hannes, jbenc, pshelar, dsa, hadi

From: Roopa Prabhu <roopa@cumulusnetworks.com>

- ingress hook:
    - if port is a lwt tunnel port, use tunnel info in
      attached dst_metadata to map it to a local vlan
- egress hook:
    - if port is a lwt tunnel port, use tunnel info attached to
      vlan to set dst_metadata on the skb

CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
CC'ing Nikolay for some more eyes as he has been trying to keep the
bridge driver fast path lite.

 net/bridge/br_input.c   |    4 ++++
 net/bridge/br_private.h |    4 ++++
 net/bridge/br_vlan.c    |   55 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 63 insertions(+)

diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 83f356f..96602a1 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -262,6 +262,10 @@ rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
 		return RX_HANDLER_CONSUMED;
 
 	p = br_port_get_rcu(skb->dev);
+	if (p->flags & BR_LWT_VLAN) {
+		if (br_handle_ingress_vlan_tunnel(skb, p, nbp_vlan_group_rcu(p)))
+			goto drop;
+	}
 
 	if (unlikely(is_link_local_ether_addr(dest))) {
 		u16 fwd_mask = p->br->group_fwd_mask_required;
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index f68e360..68a23c5 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -804,6 +804,10 @@ int __vlan_tunnel_info_del(struct net_bridge_vlan_group *vg,
 int nbp_vlan_tunnel_info_add(struct net_bridge_port *port, u16 vid, u32 tun_id);
 bool vlan_tunnel_id_isrange(struct net_bridge_vlan *v_end,
 			    struct net_bridge_vlan *v);
+int br_handle_ingress_vlan_tunnel(struct sk_buff *skb, struct net_bridge_port *p,
+				  struct net_bridge_vlan_group *vg);
+int br_handle_egress_vlan_tunnel(struct sk_buff *skb,
+				 struct net_bridge_vlan *vlan);
 
 static inline struct net_bridge_vlan_group *br_vlan_group(
 					const struct net_bridge *br)
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 2040f08..6cf2344 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -405,6 +405,11 @@ struct sk_buff *br_handle_vlan(struct net_bridge *br,
 
 	if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED)
 		skb->vlan_tci = 0;
+
+	if (br_handle_egress_vlan_tunnel(skb, v)) {
+		kfree_skb(skb);
+		return NULL;
+	}
 out:
 	return skb;
 }
@@ -1213,3 +1218,53 @@ int nbp_vlan_tunnel_info_delete(struct net_bridge_port *port, u16 vid)
 
 	return 0;
 }
+
+int br_handle_ingress_vlan_tunnel(struct sk_buff *skb,
+				  struct net_bridge_port *p,
+				  struct net_bridge_vlan_group *vg)
+{
+	struct ip_tunnel_info *tinfo = skb_tunnel_info(skb);
+	struct net_bridge_vlan *vlan;
+
+	if (!vg || !tinfo)
+		return 0;
+
+	/* if already tagged, ignore */
+	if (skb_vlan_tagged(skb))
+		return 0;
+
+	/* lookup vid, given tunnel id */
+	vlan = br_vlan_tunnel_lookup(&vg->tunnel_hash, tinfo->key.tun_id);
+	if (!vlan)
+		return 0;
+
+	skb_dst_drop(skb);
+
+	__vlan_hwaccel_put_tag(skb, p->br->vlan_proto, vlan->vid);
+
+	return 0;
+}
+
+int br_handle_egress_vlan_tunnel(struct sk_buff *skb,
+				 struct net_bridge_vlan *vlan)
+{
+	__be32 tun_id;
+	int err;
+
+	if (!vlan || !vlan->tinfo.tunnel_id)
+		return 0;
+
+	if (unlikely(!skb_vlan_tag_present(skb)))
+		return 0;
+
+	skb_dst_drop(skb);
+	tun_id = tunnel_id_to_key32(vlan->tinfo.tunnel_id);
+
+	err = skb_vlan_pop(skb);
+	if (err)
+		return err;
+
+	skb_dst_set(skb, dst_clone(&vlan->tinfo.tunnel_dst->dst));
+
+	return 0;
+}
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 3/5] bridge: uapi: add per vlan tunnel info
  2017-01-21  5:46 ` [RFC PATCH net-next 3/5] bridge: uapi: add per vlan tunnel info Roopa Prabhu
@ 2017-01-21 16:59   ` Roopa Prabhu
  0 siblings, 0 replies; 23+ messages in thread
From: Roopa Prabhu @ 2017-01-21 16:59 UTC (permalink / raw)
  To: netdev; +Cc: davem, stephen, nikolay, tgraf, hannes, jbenc, pshelar, dsa, hadi

On 1/20/17, 9:46 PM, Roopa Prabhu wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>
> New netlink api to associate tunnel info per vlan.
> This is used by bridge driver to send tunnel metadata to
> bridge ports in LWT tunnel dst metadata mode.
>
> One example use for this is a vxlan bridging gateway or vtep
> which maps vlans to vn-segments (or vnis). User can configure
> per-vlan tunnel information which the bridge driver can use
> to bridge vlan into the corresponding vn-segment.
>
> This patch also introduces a bridge port flag IFLA_BRPORT_LWT_VLAN
> to enable this feature on a tunnel bridge port. It is off by default.
>
>
since it does not use the LWT infra, I plan to drop any LWT references and call
the new features and modes plain collect metadata or dst_metadata mode in the next version.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 2/5] vxlan: make COLLECT_METADATA mode bridge friendly
  2017-01-21  5:46 ` [RFC PATCH net-next 2/5] vxlan: make COLLECT_METADATA mode bridge friendly Roopa Prabhu
@ 2017-01-22 11:40   ` Nikolay Aleksandrov
  2017-01-22 15:18     ` Roopa Prabhu
  0 siblings, 1 reply; 23+ messages in thread
From: Nikolay Aleksandrov @ 2017-01-22 11:40 UTC (permalink / raw)
  To: Roopa Prabhu, netdev
  Cc: davem, stephen, tgraf, hannes, jbenc, pshelar, dsa, hadi

On 21/01/17 06:46, Roopa Prabhu wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
> 
> This patch series makes vxlan COLLECT_METADATA mode bridge
> and layer2 network friendly. Vxlan COLLECT_METADATA mode today
> solves the per-vni netdev scalability problem in l3 networks.
> When vxlan collect metadata device participates in bridging
> vlan to vn-segments, It can only get the vlan mapped vni in
> the xmit tunnel dst metadata. It will need the vxlan driver to
> continue learn, hold forwarding state and remote destination
> information similar to how it already does for non COLLECT_METADATA
> vxlan netdevices today.
> 
> Changes introduced by this patch:
>     - allow learning and forwarding database state to vxlan netdev in
>       COLLECT_METADATA mode. Current behaviour is not changed
>       by default. tunnel info flag IP_TUNNEL_INFO_BRIDGE is used
>       to support the new bridge friendly mode.
>     - A single fdb table hashed by (mac, vni) to allow fdb entries with
>       multiple vnis in the same fdb table
>     - rx path already has the vni
>     - tx path expects a vni in the packet with dst_metadata
>     - prior to this series, fdb remote_dsts carried remote vni and
>       the vxlan device carrying the fdb table represented the
>       source vni. With the vxlan device now representing multiple vnis,
>       this patch adds a src vni attribute to the fdb entry. The remote
>       vni already uses NDA_VNI attribute. This patch introduces
>       NDA_SRC_VNI netlink attribute to represent the src vni in a multi
>       vni fdb table.
> 
> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
> ---
[snip]
> @@ -2173,23 +2221,29 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
>  	bool did_rsc = false;
>  	struct vxlan_rdst *rdst, *fdst = NULL;
>  	struct vxlan_fdb *f;
> +	__be32 vni = 0;
>  
>  	info = skb_tunnel_info(skb);
>  
>  	skb_reset_mac_header(skb);
>  
>  	if (vxlan->flags & VXLAN_F_COLLECT_METADATA) {
> -		if (info && info->mode & IP_TUNNEL_INFO_TX)
> -			vxlan_xmit_one(skb, dev, NULL, false);
> -		else
> -			kfree_skb(skb);
> -		return NETDEV_TX_OK;
> +		if (info && info->mode & IP_TUNNEL_INFO_BRIDGE &&
> +		    info->mode & IP_TUNNEL_INFO_TX) {

nit: parentheses around the IP_TUNNEL_INFO_TX check

> +			vni = tunnel_id_to_key32(info->key.tun_id);
> +		} else {
> +			if (info && info->mode & IP_TUNNEL_INFO_TX)

nit: parentheses around the IP_TUNNEL_INFO_TX check

> +				vxlan_xmit_one(skb, dev, vni, NULL, false);
> +			else
> +				kfree_skb(skb);
> +			return NETDEV_TX_OK;
> +		}
>  	}
>  
>  	if (vxlan->flags & VXLAN_F_PROXY) {
>  		eth = eth_hdr(skb);
>  		if (ntohs(eth->h_proto) == ETH_P_ARP)
> -			return arp_reduce(dev, skb);
> +			return arp_reduce(dev, skb, vni);
>  #if IS_ENABLED(CONFIG_IPV6)
>  		else if (ntohs(eth->h_proto) == ETH_P_IPV6 &&
>  			 pskb_may_pull(skb, sizeof(struct ipv6hdr)
> @@ -2200,13 +2254,13 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
>  				msg = (struct nd_msg *)skb_transport_header(skb);
>  				if (msg->icmph.icmp6_code == 0 &&
>  				    msg->icmph.icmp6_type == NDISC_NEIGHBOUR_SOLICITATION)
> -					return neigh_reduce(dev, skb);
> +					return neigh_reduce(dev, skb, vni);
>  		}
>  #endif
>  	}
>  
>  	eth = eth_hdr(skb);
> -	f = vxlan_find_mac(vxlan, eth->h_dest);
> +	f = vxlan_find_mac(vxlan, eth->h_dest, vni);
>  	did_rsc = false;
>  
>  	if (f && (f->flags & NTF_ROUTER) && (vxlan->flags & VXLAN_F_RSC) &&
> @@ -2214,11 +2268,11 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
>  	     ntohs(eth->h_proto) == ETH_P_IPV6)) {
>  		did_rsc = route_shortcircuit(dev, skb);
>  		if (did_rsc)
> -			f = vxlan_find_mac(vxlan, eth->h_dest);
> +			f = vxlan_find_mac(vxlan, eth->h_dest, vni);
>  	}
>  
>  	if (f == NULL) {
> -		f = vxlan_find_mac(vxlan, all_zeros_mac);
> +		f = vxlan_find_mac(vxlan, all_zeros_mac, vni);
>  		if (f == NULL) {
>  			if ((vxlan->flags & VXLAN_F_L2MISS) &&
>  			    !is_multicast_ether_addr(eth->h_dest))
> @@ -2239,11 +2293,11 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
>  		}
>  		skb1 = skb_clone(skb, GFP_ATOMIC);
>  		if (skb1)
> -			vxlan_xmit_one(skb1, dev, rdst, did_rsc);
> +			vxlan_xmit_one(skb1, dev, vni, rdst, did_rsc);
>  	}
>  
>  	if (fdst)
> -		vxlan_xmit_one(skb, dev, fdst, did_rsc);
> +		vxlan_xmit_one(skb, dev, vni, fdst, did_rsc);
>  	else
>  		kfree_skb(skb);
>  	return NETDEV_TX_OK;
> @@ -2307,12 +2361,12 @@ static int vxlan_init(struct net_device *dev)
>  	return 0;
>  }
>  
> -static void vxlan_fdb_delete_default(struct vxlan_dev *vxlan)
> +static void vxlan_fdb_delete_default(struct vxlan_dev *vxlan, __be32 vni)
>  {
>  	struct vxlan_fdb *f;
>  
>  	spin_lock_bh(&vxlan->hash_lock);
> -	f = __vxlan_find_mac(vxlan, all_zeros_mac);
> +	f = __vxlan_find_mac(vxlan, all_zeros_mac, vni);
>  	if (f)
>  		vxlan_fdb_destroy(vxlan, f);
>  	spin_unlock_bh(&vxlan->hash_lock);
> @@ -2322,7 +2376,7 @@ static void vxlan_uninit(struct net_device *dev)
>  {
>  	struct vxlan_dev *vxlan = netdev_priv(dev);
>  
> -	vxlan_fdb_delete_default(vxlan);
> +	vxlan_fdb_delete_default(vxlan, vxlan->cfg.vni);
>  
>  	free_percpu(dev->tstats);
>  }
> @@ -2536,6 +2590,8 @@ static void vxlan_setup(struct net_device *dev)
>  	dev->vlan_features = dev->features;
>  	dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_RXCSUM;
>  	dev->hw_features |= NETIF_F_GSO_SOFTWARE;
> +	dev->hw_features |= NETIF_F_HW_VLAN_CTAG_TX;
> +	dev->features |= dev->hw_features;
>  	netif_keep_dst(dev);
>  	dev->priv_flags |= IFF_NO_QUEUE;
>  
> @@ -2921,6 +2977,7 @@ static int vxlan_dev_configure(struct net *src_net, struct net_device *dev,
>  				       NLM_F_EXCL|NLM_F_CREATE,
>  				       vxlan->cfg.dst_port,
>  				       vxlan->default_dst.remote_vni,
> +				       vxlan->default_dst.remote_vni,
>  				       vxlan->default_dst.remote_ifindex,
>  				       NTF_SELF);
>  		if (err)
> @@ -2929,7 +2986,7 @@ static int vxlan_dev_configure(struct net *src_net, struct net_device *dev,
>  
>  	err = register_netdevice(dev);
>  	if (err) {
> -		vxlan_fdb_delete_default(vxlan);
> +		vxlan_fdb_delete_default(vxlan, vxlan->cfg.vni);
>  		return err;
>  	}
>  
> @@ -3023,19 +3080,19 @@ static int vxlan_newlink(struct net *src_net, struct net_device *dev,
>  		conf.flags |= VXLAN_F_UDP_ZERO_CSUM_TX;
>  
>  	if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX] &&
> -	    nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]))
> +	    !nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]))
>  		conf.flags |= VXLAN_F_UDP_ZERO_CSUM6_TX;
>  
>  	if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX] &&
> -	    nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]))
> +	    !nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]))
>  		conf.flags |= VXLAN_F_UDP_ZERO_CSUM6_RX;
>  
>  	if (data[IFLA_VXLAN_REMCSUM_TX] &&
> -	    nla_get_u8(data[IFLA_VXLAN_REMCSUM_TX]))
> +	    !nla_get_u8(data[IFLA_VXLAN_REMCSUM_TX]))
>  		conf.flags |= VXLAN_F_REMCSUM_TX;
>  
>  	if (data[IFLA_VXLAN_REMCSUM_RX] &&
> -	    nla_get_u8(data[IFLA_VXLAN_REMCSUM_RX]))
> +	    !nla_get_u8(data[IFLA_VXLAN_REMCSUM_RX]))
>  		conf.flags |= VXLAN_F_REMCSUM_RX;

Aren't these going to break user-space ? 

>  
>  	if (data[IFLA_VXLAN_GBP])
> diff --git a/include/uapi/linux/neighbour.h b/include/uapi/linux/neighbour.h
> index bd99a8d..f3d16db 100644
> --- a/include/uapi/linux/neighbour.h
> +++ b/include/uapi/linux/neighbour.h
> @@ -26,6 +26,7 @@ enum {
>  	NDA_IFINDEX,
>  	NDA_MASTER,
>  	NDA_LINK_NETNSID,
> +	NDA_SRC_VNI,
>  	__NDA_MAX
>  };
>  
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 4/5] bridge: vlan lwt and dst_metadata netlink support
  2017-01-21  5:46 ` [RFC PATCH net-next 4/5] bridge: vlan lwt and dst_metadata netlink support Roopa Prabhu
@ 2017-01-22 12:05   ` Nikolay Aleksandrov
  2017-01-22 15:23     ` Roopa Prabhu
  2017-01-23  0:22   ` Rosen, Rami
  1 sibling, 1 reply; 23+ messages in thread
From: Nikolay Aleksandrov @ 2017-01-22 12:05 UTC (permalink / raw)
  To: Roopa Prabhu, netdev
  Cc: davem, stephen, tgraf, hannes, jbenc, pshelar, dsa, hadi

On 21/01/17 06:46, Roopa Prabhu wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
> 
> This patch adds support to attach per vlan tunnel info dst
> metadata. This enables bridge driver to map vlan to tunnel_info
> at ingress and egress
> 
> The initial use case is vlan to vni bridging, but the api is generic
> to extend to any tunnel_info in the future:
>     - Uapi to configure/unconfigure/dump per vlan tunnel data
>     - netlink functions to configure vlan and tunnel_info mapping
>     - Introduces bridge port flag BR_LWT_VLAN to enable attach/detach
>     dst_metadata to bridged packets on ports.
> 
> Use case:
> example use for this is a vxlan bridging gateway or vtep
> which maps vlans to vn-segments (or vnis). User can configure
> per-vlan tunnel information which the bridge driver can use
> to bridge vlan into the corresponding tunnel.
> 
> CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
> ---
> CC'ing Nikolay for some more eyes as he has been trying to keep the
> bridge driver fast path lite.
> 
>  include/linux/if_bridge.h |    1 +
>  net/bridge/br_input.c     |    1 +
>  net/bridge/br_netlink.c   |  410 ++++++++++++++++++++++++++++++++++++++-------
>  net/bridge/br_private.h   |   18 ++
>  net/bridge/br_vlan.c      |  138 ++++++++++++++-
>  5 files changed, 507 insertions(+), 61 deletions(-)
> 
> diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
> index c6587c0..36ff611 100644
> --- a/include/linux/if_bridge.h
> +++ b/include/linux/if_bridge.h
> @@ -46,6 +46,7 @@ struct br_ip_list {
>  #define BR_LEARNING_SYNC	BIT(9)
>  #define BR_PROXYARP_WIFI	BIT(10)
>  #define BR_MCAST_FLOOD		BIT(11)
> +#define BR_LWT_VLAN		BIT(12)
>  
>  #define BR_DEFAULT_AGEING_TIME	(300 * HZ)
>  
> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> index 855b72f..83f356f 100644
> --- a/net/bridge/br_input.c
> +++ b/net/bridge/br_input.c
> @@ -20,6 +20,7 @@
>  #include <net/arp.h>
>  #include <linux/export.h>
>  #include <linux/rculist.h>
> +#include <net/dst_metadata.h>
>  #include "br_private.h"
>  
>  /* Hook for brouter */
> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> index 71c7453..df997ad 100644
> --- a/net/bridge/br_netlink.c
> +++ b/net/bridge/br_netlink.c
> @@ -17,17 +17,30 @@
>  #include <net/net_namespace.h>
>  #include <net/sock.h>
>  #include <uapi/linux/if_bridge.h>
> +#include <net/dst_metadata.h>
>  
>  #include "br_private.h"
>  #include "br_private_stp.h"
>  
> -static int __get_num_vlan_infos(struct net_bridge_vlan_group *vg,
> -				u32 filter_mask)
> +static size_t br_get_vlan_tinfo_size(void)
>  {
> +	return nla_total_size(0) + /* nest IFLA_BRIDGE_VLAN_TUNNEL_INFO */
> +		  nla_total_size(sizeof(u32)) + /* IFLA_BRIDGE_VLAN_TUNNEL_ID */
> +		  nla_total_size(sizeof(u16)) + /* IFLA_BRIDGE_VLAN_TUNNEL_VID */
> +		  nla_total_size(sizeof(u16)); /* IFLA_BRIDGE_VLAN_TUNNEL_FLAGS */
> +}
> +
> +static int __get_num_vlan_infos(struct net_bridge_port *p,
> +				struct net_bridge_vlan_group *vg,
> +				u32 filter_mask, int *num_vtinfos)
> +{
> +	struct net_bridge_vlan *vbegin = NULL, *vend = NULL;
> +	struct net_bridge_vlan *vtbegin = NULL, *vtend = NULL;
>  	struct net_bridge_vlan *v;
> -	u16 vid_range_start = 0, vid_range_end = 0, vid_range_flags = 0;
> +	bool get_tinfos = (p && p->flags & BR_LWT_VLAN) ? true: false;
> +	bool vcontinue, vtcontinue;
> +	int num_vinfos = 0;
>  	u16 flags, pvid;
> -	int num_vlans = 0;
>  
>  	if (!(filter_mask & RTEXT_FILTER_BRVLAN_COMPRESSED))
>  		return 0;
> @@ -36,6 +49,8 @@ static int __get_num_vlan_infos(struct net_bridge_vlan_group *vg,
>  	/* Count number of vlan infos */
>  	list_for_each_entry_rcu(v, &vg->vlan_list, vlist) {
>  		flags = 0;
> +		vcontinue = false;
> +		vtcontinue = false;
>  		/* only a context, bridge vlan not activated */
>  		if (!br_vlan_should_use(v))
>  			continue;
> @@ -45,47 +60,79 @@ static int __get_num_vlan_infos(struct net_bridge_vlan_group *vg,
>  		if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED)
>  			flags |= BRIDGE_VLAN_INFO_UNTAGGED;
>  
> -		if (vid_range_start == 0) {
> -			goto initvars;
> -		} else if ((v->vid - vid_range_end) == 1 &&
> -			flags == vid_range_flags) {
> -			vid_range_end = v->vid;
> +		if (!vbegin) {
> +			vbegin = v;
> +			vend = v;
> +			vcontinue = true;
> +		} else if ((v->vid - vend->vid) == 1 &&
> +			flags == vbegin->flags) {
> +			vend = v;
> +			vcontinue = true;
> +		}
> +
> +		if (!vcontinue) {
> +			if ((vend->vid - vbegin->vid) > 0)
> +				num_vinfos += 2;
> +			else
> +				num_vinfos += 1;
> +		}
> +
> +		if (!get_tinfos && !v->tinfo.tunnel_id)
>  			continue;
> -		} else {
> -			if ((vid_range_end - vid_range_start) > 0)
> -				num_vlans += 2;
> +
> +		if (!vtbegin) {
> +			vtbegin = v;
> +			vtend = v;
> +			vtcontinue = true;
> +		} else if ((v->vid - vtend->vid) == 1 &&
> +		    vlan_tunnel_id_isrange(vtend, v)) {
> +			vtend = v;
> +			vtcontinue = true;
> +		}
> +
> +		if (!vtcontinue) {
> +			if ((vtend->vid - vtbegin->vid) > 0)
> +				num_vtinfos += 2;
>  			else
> -				num_vlans += 1;
> +				num_vtinfos += 1;
> +			vbegin = NULL;
> +			vend = NULL;
>  		}
> -initvars:
> -		vid_range_start = v->vid;
> -		vid_range_end = v->vid;
> -		vid_range_flags = flags;
>  	}
>  
> -	if (vid_range_start != 0) {
> -		if ((vid_range_end - vid_range_start) > 0)
> -			num_vlans += 2;
> +	if (vbegin) {
> +		if ((vend->vid - vbegin->vid) > 0)
> +			num_vinfos += 2;
>  		else
> -			num_vlans += 1;
> +			num_vinfos += 1;
>  	}
>  
> -	return num_vlans;
> +	if (get_tinfos && vtbegin && vtbegin->tinfo.tunnel_id) {
> +		if ((vtend->vid - vtbegin->vid) > 0)
> +			*num_vtinfos += 2;
> +		else
> +			*num_vtinfos += 1;
> +	}
> +
> +	return num_vinfos;
>  }

I think this whole function should be broken into at least a few. It's really difficult
to parse what's going on.

>  
> -static int br_get_num_vlan_infos(struct net_bridge_vlan_group *vg,
> -				 u32 filter_mask)
> +static int br_get_num_vlan_infos(struct net_bridge_port *p,
> +				 struct net_bridge_vlan_group *vg,
> +				 int *num_tinfos, u32 filter_mask)
>  {
>  	int num_vlans;
>  
>  	if (!vg)
>  		return 0;
>  
> -	if (filter_mask & RTEXT_FILTER_BRVLAN)
> +	if (filter_mask & RTEXT_FILTER_BRVLAN) {
> +		*num_tinfos = vg->num_vlans;
>  		return vg->num_vlans;
> +	}
>  
>  	rcu_read_lock();
> -	num_vlans = __get_num_vlan_infos(vg, filter_mask);
> +	num_vlans = __get_num_vlan_infos(p, vg, filter_mask, num_tinfos);
>  	rcu_read_unlock();
>  
>  	return num_vlans;
> @@ -95,9 +142,10 @@ static size_t br_get_link_af_size_filtered(const struct net_device *dev,
>  					   u32 filter_mask)
>  {
>  	struct net_bridge_vlan_group *vg = NULL;
> -	struct net_bridge_port *p;
> +	struct net_bridge_port *p = NULL;
>  	struct net_bridge *br;
> -	int num_vlan_infos;
> +	int num_vlan_infos, num_vlan_tinfos = 0;
> +	size_t retsize = 0;
>  
>  	rcu_read_lock();
>  	if (br_port_exists(dev)) {
> @@ -107,11 +155,15 @@ static size_t br_get_link_af_size_filtered(const struct net_device *dev,
>  		br = netdev_priv(dev);
>  		vg = br_vlan_group_rcu(br);
>  	}
> -	num_vlan_infos = br_get_num_vlan_infos(vg, filter_mask);
> +	num_vlan_infos = br_get_num_vlan_infos(p, vg, &num_vlan_tinfos,
> +					       filter_mask);
>  	rcu_read_unlock();
>  
>  	/* Each VLAN is returned in bridge_vlan_info along with flags */
> -	return num_vlan_infos * nla_total_size(sizeof(struct bridge_vlan_info));
> +	retsize =  num_vlan_infos * nla_total_size(sizeof(struct bridge_vlan_info)) +
> +			num_vlan_tinfos * br_get_vlan_tinfo_size();
> +
> +	return retsize;
>  }
>  
>  static inline size_t br_port_info_size(void)
> @@ -191,7 +243,8 @@ static int br_port_fill_attrs(struct sk_buff *skb,
>  	    nla_put_u16(skb, IFLA_BRPORT_NO, p->port_no) ||
>  	    nla_put_u8(skb, IFLA_BRPORT_TOPOLOGY_CHANGE_ACK,
>  		       p->topology_change_ack) ||
> -	    nla_put_u8(skb, IFLA_BRPORT_CONFIG_PENDING, p->config_pending))
> +	    nla_put_u8(skb, IFLA_BRPORT_CONFIG_PENDING, p->config_pending) ||
> +	    nla_put_u8(skb, IFLA_BRPORT_LWT_VLAN, !!(p->flags & BR_LWT_VLAN)))
>  		return -EMSGSIZE;
>  
>  	timerval = br_timer_value(&p->message_age_timer);
> @@ -216,6 +269,34 @@ static int br_port_fill_attrs(struct sk_buff *skb,
>  	return 0;
>  }
>  
> +static int br_fill_vlan_tinfo(struct sk_buff *skb, u16 vid,
> +                              __be64 tunnel_id, u16 flags)
> +{
> +    __be32 tid = tunnel_id_to_key32(tunnel_id);
> +    struct nlattr *tmap;
> +
> +	tmap = nla_nest_start(skb, IFLA_BRIDGE_VLAN_TUNNEL_INFO);
> +	if (!tmap)
> +		return -EMSGSIZE;
> +	if (nla_put_u32(skb, IFLA_BRIDGE_VLAN_TUNNEL_ID,
> +            be32_to_cpu(tid)))
> +		goto nla_put_failure;
> +	if (nla_put_u16(skb, IFLA_BRIDGE_VLAN_TUNNEL_VID,
> +			vid))
> +		goto nla_put_failure;
> +	if (nla_put_u16(skb, IFLA_BRIDGE_VLAN_TUNNEL_FLAGS,
> +			flags))
> +		goto nla_put_failure;
> +	nla_nest_end(skb, tmap);
> +
> +	return 0;
> +
> +nla_put_failure:
> +	nla_nest_cancel(skb, tmap);
> +
> +	return -EMSGSIZE;
> +}
> +
>  static int br_fill_ifvlaninfo_range(struct sk_buff *skb, u16 vid_start,
>  				    u16 vid_end, u16 flags)
>  {
> @@ -249,20 +330,24 @@ static int br_fill_ifvlaninfo_range(struct sk_buff *skb, u16 vid_start,
>  }
>  
>  static int br_fill_ifvlaninfo_compressed(struct sk_buff *skb,
> +					 struct net_bridge_port *p,
>  					 struct net_bridge_vlan_group *vg)
>  {
> +	struct net_bridge_vlan *vbegin = NULL, *vend = NULL;
> +	struct net_bridge_vlan *vtbegin = NULL, *vtend = NULL;
> +	bool fill_tinfos = (p && p->flags & BR_LWT_VLAN) ? true: false;
>  	struct net_bridge_vlan *v;
> -	u16 vid_range_start = 0, vid_range_end = 0, vid_range_flags = 0;
> +	bool vcontinue, vtcontinue;
>  	u16 flags, pvid;
> -	int err = 0;
> +	int err;
>  
> -	/* Pack IFLA_BRIDGE_VLAN_INFO's for every vlan
> -	 * and mark vlan info with begin and end flags
> -	 * if vlaninfo represents a range
> -	 */
>  	pvid = br_get_pvid(vg);
> +	/* Count number of vlan infos */
>  	list_for_each_entry_rcu(v, &vg->vlan_list, vlist) {
>  		flags = 0;
> +		vcontinue = false;
> +		vtcontinue = false;
> +		/* only a context, bridge vlan not activated */
>  		if (!br_vlan_should_use(v))
>  			continue;
>  		if (v->vid == pvid)
> @@ -271,44 +356,103 @@ static int br_fill_ifvlaninfo_compressed(struct sk_buff *skb,
>  		if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED)
>  			flags |= BRIDGE_VLAN_INFO_UNTAGGED;
>  
> -		if (vid_range_start == 0) {
> -			goto initvars;
> -		} else if ((v->vid - vid_range_end) == 1 &&
> -			flags == vid_range_flags) {
> -			vid_range_end = v->vid;
> -			continue;
> -		} else {
> -			err = br_fill_ifvlaninfo_range(skb, vid_range_start,
> -						       vid_range_end,
> -						       vid_range_flags);
> +		if (!vbegin) {
> +			vbegin = v;
> +			vend = v;
> +			vcontinue = true;
> +		} else if ((v->vid - vend->vid) == 1 &&
> +			flags == vbegin->flags) {
> +			vend = v;
> +			vcontinue = true;
> +		}
> +
> +		if (!vcontinue) {
> +			err = br_fill_ifvlaninfo_range(skb,
> +						       vbegin->vid,
> +						       vend->vid,
> +						       vbegin->flags);
>  			if (err)
>  				return err;
> +			vbegin = vend = NULL;
> +		}
> +
> +		if (!fill_tinfos || !v->tinfo.tunnel_id)
> +			continue;
> +
> +		if (!vtbegin) {
> +			vtbegin = v;
> +			vtend = v;
> +			vtcontinue = true;
> +		} else if ((v->vid - vtend->vid) == 1 &&
> +		    vlan_tunnel_id_isrange(vtend, v)) {
> +			vtend = v;
> +			vtcontinue = true;
>  		}
>  
> -initvars:
> -		vid_range_start = v->vid;
> -		vid_range_end = v->vid;
> -		vid_range_flags = flags;
> +		if (!vtcontinue && vtbegin->tinfo.tunnel_id) {
> +			if ((vtend->vid - vtbegin->vid) > 0) {
> +				err = br_fill_vlan_tinfo(skb, vbegin->vid,
> +							 vbegin->tinfo.tunnel_id,
> +						BRIDGE_VLAN_INFO_RANGE_BEGIN);
> +				if (err)
> +					return err;
> +				err = br_fill_vlan_tinfo(skb, vend->vid,
> +							 vend->tinfo.tunnel_id,
> +						BRIDGE_VLAN_INFO_RANGE_END);
> +				if (err)
> +					return err;
> +			} else {
> +				err = br_fill_vlan_tinfo(skb, vbegin->vid,
> +							 vbegin->tinfo.tunnel_id,
> +							 0);
> +				if (err)
> +					return err;
> +			}
> +			vbegin = NULL;
> +			vend = NULL;
> +		}
>  	}
>  
> -	if (vid_range_start != 0) {
> -		/* Call it once more to send any left over vlans */
> -		err = br_fill_ifvlaninfo_range(skb, vid_range_start,
> -					       vid_range_end,
> -					       vid_range_flags);
> +	if (vbegin) {
> +		err = br_fill_ifvlaninfo_range(skb, vbegin->vid,
> +					       vend->vid,
> +					       vbegin->flags);
>  		if (err)
>  			return err;
>  	}
>  
> +	if (fill_tinfos && vtbegin && vtbegin->tinfo.tunnel_id) {
> +		if ((vtend->vid - vtbegin->vid) > 0) {
> +			err = br_fill_vlan_tinfo(skb, vbegin->vid,
> +						 vbegin->tinfo.tunnel_id,
> +						 BRIDGE_VLAN_INFO_RANGE_BEGIN);
> +			if (err)
> +				return err;
> +			err = br_fill_vlan_tinfo(skb, vend->vid,
> +						 vend->tinfo.tunnel_id,
> +						 BRIDGE_VLAN_INFO_RANGE_END);
> +			if (err)
> +				return err;
> +		} else {
> +			err = br_fill_vlan_tinfo(skb, vbegin->vid,
> +						 vbegin->tinfo.tunnel_id, 0);
> +			if (err)
> +				return err;
> +		}
> +	}
> +
>  	return 0;
>  }

Maybe look into breaking this one, too. It's getting huge and hard to follow..

>  
>  static int br_fill_ifvlaninfo(struct sk_buff *skb,
> +			      struct net_bridge_port *p,
>  			      struct net_bridge_vlan_group *vg)
>  {
>  	struct bridge_vlan_info vinfo;
>  	struct net_bridge_vlan *v;
> +	bool fill_tinfos = (p && p->flags & BR_LWT_VLAN) ? true : false;
>  	u16 pvid;
> +	int err;
>  
>  	pvid = br_get_pvid(vg);
>  	list_for_each_entry_rcu(v, &vg->vlan_list, vlist) {
> @@ -326,6 +470,14 @@ static int br_fill_ifvlaninfo(struct sk_buff *skb,
>  		if (nla_put(skb, IFLA_BRIDGE_VLAN_INFO,
>  			    sizeof(vinfo), &vinfo))
>  			goto nla_put_failure;
> +
> +		if (!fill_tinfos || !v->tinfo.tunnel_id)
> +			continue;
> +
> +		err = br_fill_vlan_tinfo(skb, v->vid,
> +					 v->tinfo.tunnel_id, 0);
> +		if (err)
> +			return err;
>  	}
>  
>  	return 0;
> @@ -411,9 +563,9 @@ static int br_fill_ifinfo(struct sk_buff *skb,
>  			goto nla_put_failure;
>  		}
>  		if (filter_mask & RTEXT_FILTER_BRVLAN_COMPRESSED)
> -			err = br_fill_ifvlaninfo_compressed(skb, vg);
> +			err = br_fill_ifvlaninfo_compressed(skb, port, vg);
>  		else
> -			err = br_fill_ifvlaninfo(skb, vg);
> +			err = br_fill_ifvlaninfo(skb, port, vg);
>  		rcu_read_unlock();
>  		if (err)
>  			goto nla_put_failure;
> @@ -514,6 +666,127 @@ static int br_vlan_info(struct net_bridge *br, struct net_bridge_port *p,
>  	return err;
>  }
>  
> +static const struct nla_policy vlan_tunnel_policy[IFLA_BRIDGE_VLAN_TUNNEL_MAX + 1] = {
> +	[IFLA_BRIDGE_VLAN_TUNNEL_ID]= { .type = NLA_U32 },
> +	[IFLA_BRIDGE_VLAN_TUNNEL_VID] = { .type = NLA_U16 },
> +	[IFLA_BRIDGE_VLAN_TUNNEL_FLAGS] = { .type = NLA_U16 },
> +};
> +
> +static int br_add_vlan_tunnel_info(struct net_bridge *br,
> +				   struct net_bridge_port *p, int cmd,
> +				   u16 vid, u32 tun_id)
> +{
> +	int err;
> +
> +	switch (cmd) {
> +	case RTM_SETLINK:
> +		if (p) {
> +			/* if the MASTER flag is set this will act on the global
> +			 * per-VLAN entry as well
> +			 */
> +			err = nbp_vlan_tunnel_info_add(p, vid, tun_id);
> +			if (err)
> +				break;
> +		} else {
> +			return -EINVAL;
> +		}
> +
> +		break;
> +
> +	case RTM_DELLINK:
> +		if (p)
> +			nbp_vlan_tunnel_info_delete(p, vid);
> +		else
> +			return -EINVAL;
> +		break;
> +	}

so if (!p) return -einval in the beginning ? :-)

> +
> +	return 0;
> +}
> +
> +struct vtunnel_info {
> +	u32	tunid;
> +	u16	vid;
> +	u16	flags;
> +};
> +
> +static int br_parse_vlan_tunnel_info(struct nlattr *attr,
> +				     struct vtunnel_info *tinfo)
> +{
> +	struct nlattr *tb[IFLA_BRIDGE_VLAN_TUNNEL_MAX + 1];
> +	u32 tun_id;
> +	u16 vid, flags;
> +	int err;
> +
> +	err = nla_parse_nested(tb, IFLA_BRIDGE_VLAN_TUNNEL_MAX,
> +			       attr, vlan_tunnel_policy);
> +	if (err < 0)
> +		return err;
> +
> +	if (tb[IFLA_BRIDGE_VLAN_TUNNEL_ID])
> +		tun_id = nla_get_u32(tb[IFLA_BRIDGE_VLAN_TUNNEL_ID]);
> +	else
> +		return -EINVAL;
> +
> +	if (tb[IFLA_BRIDGE_VLAN_TUNNEL_VID]) {
> +		vid = nla_get_u16(tb[IFLA_BRIDGE_VLAN_TUNNEL_VID]);
> +		if (vid >= VLAN_VID_MASK)
> +			return -ERANGE;

!vid || 

> +	} else {
> +		return -EINVAL;
> +	}

these attr checks can be moved in the beginning, usually there's a check for existence
of the mandatory attributes, then you can continue just using them and avoid these
"else" statements

> +
> +	if (tb[IFLA_BRIDGE_VLAN_TUNNEL_FLAGS])
> +		flags = nla_get_u16(tb[IFLA_BRIDGE_VLAN_TUNNEL_FLAGS]);
> +
> +	tinfo->tunid = tun_id;
> +	tinfo->vid = vid;
> +	tinfo->flags = flags;
> +
> +	return 0;
> +}
> +
> +static int br_process_vlan_tunnel_info(struct net_bridge *br,
> +				       struct net_bridge_port *p, int cmd,
> +				       struct vtunnel_info *tinfo_curr,
> +				       struct vtunnel_info *tinfo_last)
> +{
> +	int t, v;
> +	int err;
> +
> +	if (tinfo_curr->flags & BRIDGE_VLAN_INFO_RANGE_BEGIN) {
> +		if (tinfo_last->flags & BRIDGE_VLAN_INFO_RANGE_BEGIN)
> +			return -EINVAL;
> +		memcpy(tinfo_last, tinfo_curr, sizeof(struct vtunnel_info));
> +	} else if (tinfo_curr->flags & BRIDGE_VLAN_INFO_RANGE_END) {
> +		if (!(tinfo_last->flags & BRIDGE_VLAN_INFO_RANGE_BEGIN))
> +			return -EINVAL;
> +		if ((tinfo_curr->vid - tinfo_last->vid) !=
> +		    (tinfo_curr->tunid - tinfo_last->tunid))
> +			return -EINVAL;
> +		/* XXX: tun id and vlan id attrs must be same
> +		 */
> +		t = tinfo_last->tunid;
> +		for (v = tinfo_last->vid; v <= tinfo_curr->vid; v++) {
> +			err = br_add_vlan_tunnel_info(br, p, cmd,
> +							  v, t);
> +			if (err)
> +				return err;
> +			t++;
> +		}
> +		memset(tinfo_last, 0, sizeof(struct vtunnel_info));
> +		memset(tinfo_curr, 0, sizeof(struct vtunnel_info));
> +	} else {
> +		err = br_add_vlan_tunnel_info(br, p, cmd,
> +					      tinfo_curr->vid,
> +					      tinfo_curr->tunid);
> +		if (err)
> +			return err;
> +	}
> +
> +	return 0;
> +}
> +
>  static int br_afspec(struct net_bridge *br,
>  		     struct net_bridge_port *p,
>  		     struct nlattr *af_spec,
> @@ -522,10 +795,30 @@ static int br_afspec(struct net_bridge *br,
>  	struct bridge_vlan_info *vinfo_start = NULL;
>  	struct bridge_vlan_info *vinfo = NULL;
>  	struct nlattr *attr;
> +	struct vtunnel_info tinfo_last = {
> +		.tunid = 0,
> +		.vid = 0,
> +		.flags = 0};
> +	struct vtunnel_info tinfo_curr = {
> +		.tunid = 0,
> +		.vid = 0,
> +		.flags = 0};

just { } should be enough to zero the structs

>  	int err = 0;
>  	int rem;
>  
>  	nla_for_each_nested(attr, af_spec, rem) {
> +		if (nla_type(attr) == IFLA_BRIDGE_VLAN_TUNNEL_INFO) {
> +			err = br_parse_vlan_tunnel_info(attr, &tinfo_curr);
> +			if (err)
> +				return err;
> +			err = br_process_vlan_tunnel_info(br, p, cmd,
> +							  &tinfo_curr,
> +							  &tinfo_last);
> +			if (err)
> +				return err;
> +			continue;
> +		}
> +
>  		if (nla_type(attr) != IFLA_BRIDGE_VLAN_INFO)
>  			continue;
>  		if (nla_len(attr) != sizeof(struct bridge_vlan_info))
> @@ -638,6 +931,7 @@ static int br_setport(struct net_bridge_port *p, struct nlattr *tb[])
>  	br_set_port_flag(p, tb, IFLA_BRPORT_MCAST_FLOOD, BR_MCAST_FLOOD);
>  	br_set_port_flag(p, tb, IFLA_BRPORT_PROXYARP, BR_PROXYARP);
>  	br_set_port_flag(p, tb, IFLA_BRPORT_PROXYARP_WIFI, BR_PROXYARP_WIFI);
> +	br_set_port_flag(p, tb, IFLA_BRPORT_LWT_VLAN, BR_LWT_VLAN);
>  
>  	if (tb[IFLA_BRPORT_COST]) {
>  		err = br_stp_set_path_cost(p, nla_get_u32(tb[IFLA_BRPORT_COST]));
> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
> index 8ce621e..f68e360 100644
> --- a/net/bridge/br_private.h
> +++ b/net/bridge/br_private.h
> @@ -91,6 +91,11 @@ struct br_vlan_stats {
>  	struct u64_stats_sync syncp;
>  };
>  
> +struct br_tunnel_info {
> +	__be64			tunnel_id;
> +	struct metadata_dst	*tunnel_dst;
> +};
> +
>  /**
>   * struct net_bridge_vlan - per-vlan entry
>   *
> @@ -113,6 +118,7 @@ struct br_vlan_stats {
>   */
>  struct net_bridge_vlan {
>  	struct rhash_head		vnode;
> +	struct rhash_head		tnode;
>  	u16				vid;
>  	u16				flags;
>  	struct br_vlan_stats __percpu	*stats;
> @@ -124,6 +130,9 @@ struct net_bridge_vlan {
>  		atomic_t		refcnt;
>  		struct net_bridge_vlan	*brvlan;
>  	};
> +
> +	struct br_tunnel_info		tinfo;
> +
>  	struct list_head		vlist;
>  
>  	struct rcu_head			rcu;
> @@ -145,6 +154,7 @@ struct net_bridge_vlan {
>   */
>  struct net_bridge_vlan_group {
>  	struct rhashtable		vlan_hash;
> +	struct rhashtable		tunnel_hash;
>  	struct list_head		vlan_list;
>  	u16				num_vlans;
>  	u16				pvid;
> @@ -786,6 +796,14 @@ struct sk_buff *br_handle_vlan(struct net_bridge *br,
>  int nbp_get_num_vlan_infos(struct net_bridge_port *p, u32 filter_mask);
>  void br_vlan_get_stats(const struct net_bridge_vlan *v,
>  		       struct br_vlan_stats *stats);
> +int __vlan_tunnel_info_add(struct net_bridge_vlan_group *vg,
> +			   struct net_bridge_vlan *vlan, u32 tun_id);
> +int __vlan_tunnel_info_del(struct net_bridge_vlan_group *vg,
> +                           struct net_bridge_vlan *vlan);
> +int nbp_vlan_tunnel_info_delete(struct net_bridge_port *port, u16 vid);
> +int nbp_vlan_tunnel_info_add(struct net_bridge_port *port, u16 vid, u32 tun_id);
> +bool vlan_tunnel_id_isrange(struct net_bridge_vlan *v_end,
> +			    struct net_bridge_vlan *v);
>  
>  static inline struct net_bridge_vlan_group *br_vlan_group(
>  					const struct net_bridge *br)
> diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
> index b6de4f4..2040f08 100644
> --- a/net/bridge/br_vlan.c
> +++ b/net/bridge/br_vlan.c
> @@ -3,6 +3,7 @@
>  #include <linux/rtnetlink.h>
>  #include <linux/slab.h>
>  #include <net/switchdev.h>
> +#include <net/dst_metadata.h>
>  
>  #include "br_private.h"
>  
> @@ -31,6 +32,31 @@ static struct net_bridge_vlan *br_vlan_lookup(struct rhashtable *tbl, u16 vid)
>  	return rhashtable_lookup_fast(tbl, &vid, br_vlan_rht_params);
>  }
>  
> +static inline int br_vlan_tunid_cmp(struct rhashtable_compare_arg *arg,
> +				    const void *ptr)
> +{
> +	const struct net_bridge_vlan *vle = ptr;
> +	__be64 tunid = *(__be64 *)arg->key;
> +
> +	return vle->tinfo.tunnel_id != tunid;
> +}
> +
> +static const struct rhashtable_params br_vlan_tunnel_rht_params = {
> +	.head_offset = offsetof(struct net_bridge_vlan, tnode),
> +	.key_offset = offsetof(struct net_bridge_vlan, tinfo.tunnel_id),
> +	.key_len = sizeof(__be64),
> +	.nelem_hint = 3,
> +	.locks_mul = 1,
> +	.obj_cmpfn = br_vlan_tunid_cmp,
> +	.automatic_shrinking = true,
> +};
> +
> +static struct net_bridge_vlan *br_vlan_tunnel_lookup(struct rhashtable *tbl,
> +						     u64 tunnel_id)
> +{
> +	return rhashtable_lookup_fast(tbl, &tunnel_id, br_vlan_tunnel_rht_params);
> +}
> +
>  static void __vlan_add_pvid(struct net_bridge_vlan_group *vg, u16 vid)
>  {
>  	if (vg->pvid == vid)
> @@ -325,6 +351,7 @@ static void __vlan_group_free(struct net_bridge_vlan_group *vg)
>  {
>  	WARN_ON(!list_empty(&vg->vlan_list));
>  	rhashtable_destroy(&vg->vlan_hash);
> +	rhashtable_destroy(&vg->tunnel_hash);
>  	kfree(vg);
>  }
>  
> @@ -613,6 +640,8 @@ int br_vlan_delete(struct net_bridge *br, u16 vid)
>  	br_fdb_find_delete_local(br, NULL, br->dev->dev_addr, vid);
>  	br_fdb_delete_by_port(br, NULL, vid, 0);
>  
> +	__vlan_tunnel_info_del(vg, v);
> +
>  	return __vlan_del(v);
>  }
>  
> @@ -918,6 +947,9 @@ int br_vlan_init(struct net_bridge *br)
>  	ret = rhashtable_init(&vg->vlan_hash, &br_vlan_rht_params);
>  	if (ret)
>  		goto err_rhtbl;
> +	ret = rhashtable_init(&vg->tunnel_hash, &br_vlan_tunnel_rht_params);
> +	if (ret)
> +		goto err_rhtbl2;
>  	INIT_LIST_HEAD(&vg->vlan_list);
>  	br->vlan_proto = htons(ETH_P_8021Q);
>  	br->default_pvid = 1;
> @@ -932,6 +964,8 @@ int br_vlan_init(struct net_bridge *br)
>  	return ret;
>  
>  err_vlan_add:
> +	rhashtable_destroy(&vg->tunnel_hash);
> +err_rhtbl2:
>  	rhashtable_destroy(&vg->vlan_hash);
>  err_rhtbl:
>  	kfree(vg);
> @@ -960,7 +994,10 @@ int nbp_vlan_init(struct net_bridge_port *p)
>  
>  	ret = rhashtable_init(&vg->vlan_hash, &br_vlan_rht_params);
>  	if (ret)
> -		goto err_rhtbl;
> +		goto err_rhtbl1;
> +	ret = rhashtable_init(&vg->tunnel_hash, &br_vlan_tunnel_rht_params);
> +	if (ret)
> +		goto err_rhtbl2;
>  	INIT_LIST_HEAD(&vg->vlan_list);
>  	rcu_assign_pointer(p->vlgrp, vg);
>  	if (p->br->default_pvid) {
> @@ -976,9 +1013,11 @@ int nbp_vlan_init(struct net_bridge_port *p)
>  err_vlan_add:
>  	RCU_INIT_POINTER(p->vlgrp, NULL);
>  	synchronize_rcu();
> -	rhashtable_destroy(&vg->vlan_hash);
> +	rhashtable_destroy(&vg->tunnel_hash);
>  err_vlan_enabled:
> -err_rhtbl:
> +err_rhtbl2:
> +	rhashtable_destroy(&vg->vlan_hash);
> +err_rhtbl1:
>  	kfree(vg);
>  
>  	goto out;
> @@ -1081,3 +1120,96 @@ void br_vlan_get_stats(const struct net_bridge_vlan *v,
>  		stats->tx_packets += txpackets;
>  	}
>  }
> +
> +bool vlan_tunnel_id_isrange(struct net_bridge_vlan *v_end,
> +			    struct net_bridge_vlan *v)
> +{
> +	/* XXX: check other tunnel attributes */
> +	return (be32_to_cpu(tunnel_id_to_key32(v_end->tinfo.tunnel_id)) -
> +		be32_to_cpu(tunnel_id_to_key32(v->tinfo.tunnel_id)) == 1);
> +}
> +
> +int __vlan_tunnel_info_add(struct net_bridge_vlan_group *vg,
> +			   struct net_bridge_vlan *vlan, u32 tun_id)
> +{
> +	struct metadata_dst *metadata = NULL;
> +	__be64 key = key32_to_tunnel_id(cpu_to_be32(tun_id));
> +	int err;
> +
> +	if (vlan->tinfo.tunnel_dst)
> +		return -EEXIST;
> +
> +	metadata = __ip_tun_set_dst(0, 0, 0, 0, 0, TUNNEL_KEY,
> +				    key, 0);
> +	if (!metadata)
> +		return -EINVAL;
> +
> +	metadata->u.tun_info.mode |= IP_TUNNEL_INFO_TX | IP_TUNNEL_INFO_BRIDGE;
> +	vlan->tinfo.tunnel_dst = metadata;
> +	vlan->tinfo.tunnel_id = key;
> +
> +	err = rhashtable_lookup_insert_fast(&vg->tunnel_hash, &vlan->tnode,
> +					    br_vlan_tunnel_rht_params);
> +	if (err)
> +		goto out;
> +
> +	return 0;
> +out:
> +	dst_release(&vlan->tinfo.tunnel_dst->dst);
> +
> +	return err;
> +}
> +
> +int __vlan_tunnel_info_del(struct net_bridge_vlan_group *vg,
> +			   struct net_bridge_vlan *vlan)
> +{
> +	if (vlan->tinfo.tunnel_dst) {
> +		vlan->tinfo.tunnel_id = 0;
> +		dst_release(&vlan->tinfo.tunnel_dst->dst);
> +
> +		rhashtable_remove_fast(&vg->tunnel_hash, &vlan->vnode,
> +				       br_vlan_tunnel_rht_params);
> +	}
> +
> +	return 0;
> +}

I think all of these should be static, if I haven't missed something I don't see them
being used anywhere else

> +
> +/* Must be protected by RTNL.
> + * Must be called with vid in range from 1 to 4094 inclusive.
> + */
> +int nbp_vlan_tunnel_info_add(struct net_bridge_port *port, u16 vid, u32 tun_id)
> +{
> +	struct net_bridge_vlan_group *vg;
> +	struct net_bridge_vlan *vlan;
> +
> +	ASSERT_RTNL();
> +
> +	vg = nbp_vlan_group(port);
> +	vlan = br_vlan_find(vg, vid);
> +	if (!vlan)
> +		return -EINVAL;
> +
> +	__vlan_tunnel_info_add(vg, vlan, tun_id);
> +
> +	return 0;
> +}
> +
> +/* Must be protected by RTNL.
> + * Must be called with vid in range from 1 to 4094 inclusive.
> + */
> +int nbp_vlan_tunnel_info_delete(struct net_bridge_port *port, u16 vid)
> +{
> +	struct net_bridge_vlan_group *vg;
> +	struct net_bridge_vlan *v;
> +
> +	ASSERT_RTNL();
> +
> +	vg = nbp_vlan_group(port);
> +	v = br_vlan_find(vg, vid);
> +	if (!v)
> +		return -ENOENT;
> +
> +	__vlan_tunnel_info_del(vg, v);
> +
> +	return 0;
> +}
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 5/5] bridge: vlan lwt dst_metadata hooks in ingress and egress paths
  2017-01-21  5:46 ` [RFC PATCH net-next 5/5] bridge: vlan lwt dst_metadata hooks in ingress and egress paths Roopa Prabhu
@ 2017-01-22 12:15   ` Nikolay Aleksandrov
  2017-01-22 15:27     ` Roopa Prabhu
  0 siblings, 1 reply; 23+ messages in thread
From: Nikolay Aleksandrov @ 2017-01-22 12:15 UTC (permalink / raw)
  To: Roopa Prabhu, netdev
  Cc: davem, stephen, tgraf, hannes, jbenc, pshelar, dsa, hadi

On 21/01/17 06:46, Roopa Prabhu wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
> 
> - ingress hook:
>     - if port is a lwt tunnel port, use tunnel info in
>       attached dst_metadata to map it to a local vlan
> - egress hook:
>     - if port is a lwt tunnel port, use tunnel info attached to
>       vlan to set dst_metadata on the skb
> 
> CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
> ---
> CC'ing Nikolay for some more eyes as he has been trying to keep the
> bridge driver fast path lite.
> 
>  net/bridge/br_input.c   |    4 ++++
>  net/bridge/br_private.h |    4 ++++
>  net/bridge/br_vlan.c    |   55 +++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 63 insertions(+)
> 
> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> index 83f356f..96602a1 100644
> --- a/net/bridge/br_input.c
> +++ b/net/bridge/br_input.c
> @@ -262,6 +262,10 @@ rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
>  		return RX_HANDLER_CONSUMED;
>  
>  	p = br_port_get_rcu(skb->dev);
> +	if (p->flags & BR_LWT_VLAN) {
> +		if (br_handle_ingress_vlan_tunnel(skb, p, nbp_vlan_group_rcu(p)))
> +			goto drop;
> +	}

Is there any reason to do this so early (perhaps netfilter?) ? If not, you can push it to the vlan __allowed_ingress
(and rename that function to something else, it does a hundred additional things)
and avoid this check for all packets if vlans are disabled, thus people using non-vlan filtering
bridge won't have an additional test in their fast path

>  
>  	if (unlikely(is_link_local_ether_addr(dest))) {
>  		u16 fwd_mask = p->br->group_fwd_mask_required;
> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
> index f68e360..68a23c5 100644
> --- a/net/bridge/br_private.h
> +++ b/net/bridge/br_private.h
> @@ -804,6 +804,10 @@ int __vlan_tunnel_info_del(struct net_bridge_vlan_group *vg,
>  int nbp_vlan_tunnel_info_add(struct net_bridge_port *port, u16 vid, u32 tun_id);
>  bool vlan_tunnel_id_isrange(struct net_bridge_vlan *v_end,
>  			    struct net_bridge_vlan *v);
> +int br_handle_ingress_vlan_tunnel(struct sk_buff *skb, struct net_bridge_port *p,
> +				  struct net_bridge_vlan_group *vg);
> +int br_handle_egress_vlan_tunnel(struct sk_buff *skb,
> +				 struct net_bridge_vlan *vlan);
>  
>  static inline struct net_bridge_vlan_group *br_vlan_group(
>  					const struct net_bridge *br)
> diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
> index 2040f08..6cf2344 100644
> --- a/net/bridge/br_vlan.c
> +++ b/net/bridge/br_vlan.c
> @@ -405,6 +405,11 @@ struct sk_buff *br_handle_vlan(struct net_bridge *br,
>  
>  	if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED)
>  		skb->vlan_tci = 0;
> +
> +	if (br_handle_egress_vlan_tunnel(skb, v)) {
> +		kfree_skb(skb);
> +		return NULL;
> +	}
>  out:
>  	return skb;
>  }
> @@ -1213,3 +1218,53 @@ int nbp_vlan_tunnel_info_delete(struct net_bridge_port *port, u16 vid)
>  
>  	return 0;
>  }
> +
> +int br_handle_ingress_vlan_tunnel(struct sk_buff *skb,
> +				  struct net_bridge_port *p,
> +				  struct net_bridge_vlan_group *vg)
> +{
> +	struct ip_tunnel_info *tinfo = skb_tunnel_info(skb);
> +	struct net_bridge_vlan *vlan;
> +
> +	if (!vg || !tinfo)
> +		return 0;
> +
> +	/* if already tagged, ignore */
> +	if (skb_vlan_tagged(skb))
> +		return 0;
> +
> +	/* lookup vid, given tunnel id */
> +	vlan = br_vlan_tunnel_lookup(&vg->tunnel_hash, tinfo->key.tun_id);
> +	if (!vlan)
> +		return 0;
> +
> +	skb_dst_drop(skb);
> +
> +	__vlan_hwaccel_put_tag(skb, p->br->vlan_proto, vlan->vid);
> +
> +	return 0;
> +}
> +
> +int br_handle_egress_vlan_tunnel(struct sk_buff *skb,
> +				 struct net_bridge_vlan *vlan)
> +{
> +	__be32 tun_id;
> +	int err;
> +
> +	if (!vlan || !vlan->tinfo.tunnel_id)
> +		return 0;
> +
> +	if (unlikely(!skb_vlan_tag_present(skb)))
> +		return 0;
> +
> +	skb_dst_drop(skb);
> +	tun_id = tunnel_id_to_key32(vlan->tinfo.tunnel_id);
> +
> +	err = skb_vlan_pop(skb);
> +	if (err)
> +		return err;
> +
> +	skb_dst_set(skb, dst_clone(&vlan->tinfo.tunnel_dst->dst));
> +
> +	return 0;
> +}
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 2/5] vxlan: make COLLECT_METADATA mode bridge friendly
  2017-01-22 11:40   ` Nikolay Aleksandrov
@ 2017-01-22 15:18     ` Roopa Prabhu
  0 siblings, 0 replies; 23+ messages in thread
From: Roopa Prabhu @ 2017-01-22 15:18 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: netdev, davem, stephen, tgraf, hannes, jbenc, pshelar, dsa, hadi

On 1/22/17, 3:40 AM, Nikolay Aleksandrov wrote:
> On 21/01/17 06:46, Roopa Prabhu wrote:
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> This patch series makes vxlan COLLECT_METADATA mode bridge
>> and layer2 network friendly. Vxlan COLLECT_METADATA mode today
>> solves the per-vni netdev scalability problem in l3 networks.
>> When vxlan collect metadata device participates in bridging
>> vlan to vn-segments, It can only get the vlan mapped vni in
>> the xmit tunnel dst metadata. It will need the vxlan driver to
>> continue learn, hold forwarding state and remote destination
>> information similar to how it already does for non COLLECT_METADATA
>> vxlan netdevices today.
>>
>> Changes introduced by this patch:
>>     - allow learning and forwarding database state to vxlan netdev in
>>       COLLECT_METADATA mode. Current behaviour is not changed
>>       by default. tunnel info flag IP_TUNNEL_INFO_BRIDGE is used
>>       to support the new bridge friendly mode.
>>     - A single fdb table hashed by (mac, vni) to allow fdb entries with
>>       multiple vnis in the same fdb table
>>     - rx path already has the vni
>>     - tx path expects a vni in the packet with dst_metadata
>>     - prior to this series, fdb remote_dsts carried remote vni and
>>       the vxlan device carrying the fdb table represented the
>>       source vni. With the vxlan device now representing multiple vnis,
>>       this patch adds a src vni attribute to the fdb entry. The remote
>>       vni already uses NDA_VNI attribute. This patch introduces
>>       NDA_SRC_VNI netlink attribute to represent the src vni in a multi
>>       vni fdb table.
>>
>> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>> ---
> [snip]
>> @@ -2173,23 +2221,29 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
>>  	bool did_rsc = false;
>>  	struct vxlan_rdst *rdst, *fdst = NULL;
>>  	struct vxlan_fdb *f;
>> +	__be32 vni = 0;
>>  
>>  	info = skb_tunnel_info(skb);
>>  
>>  	skb_reset_mac_header(skb);
>>  
>>  	if (vxlan->flags & VXLAN_F_COLLECT_METADATA) {
>> -		if (info && info->mode & IP_TUNNEL_INFO_TX)
>> -			vxlan_xmit_one(skb, dev, NULL, false);
>> -		else
>> -			kfree_skb(skb);
>> -		return NETDEV_TX_OK;
>> +		if (info && info->mode & IP_TUNNEL_INFO_BRIDGE &&
>> +		    info->mode & IP_TUNNEL_INFO_TX) {
> nit: parentheses around the IP_TUNNEL_INFO_TX check
>
>> +			vni = tunnel_id_to_key32(info->key.tun_id);
>> +		} else {
>> +			if (info && info->mode & IP_TUNNEL_INFO_TX)
> nit: parentheses around the IP_TUNNEL_INFO_TX check

ack
>> +				vxlan_xmit_one(skb, dev, vni, NULL, false);
>> +			else
>> +				kfree_skb(skb);
>> +			return NETDEV_TX_OK;
>> +		}
>>  	}
>>  
>>  	if (vxlan->flags & VXLAN_F_PROXY) {
>>  		eth = eth_hdr(skb);
>>  		if (ntohs(eth->h_proto) == ETH_P_ARP)
>> -			return arp_reduce(dev, skb);
>> +			return arp_reduce(dev, skb, vni);
>>  #if IS_ENABLED(CONFIG_IPV6)
>>  		else if (ntohs(eth->h_proto) == ETH_P_IPV6 &&
>>  			 pskb_may_pull(skb, sizeof(struct ipv6hdr)
>> @@ -2200,13 +2254,13 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
>>  				msg = (struct nd_msg *)skb_transport_header(skb);
>>  				if (msg->icmph.icmp6_code == 0 &&
>>  				    msg->icmph.icmp6_type == NDISC_NEIGHBOUR_SOLICITATION)
>> -					return neigh_reduce(dev, skb);
>> +					return neigh_reduce(dev, skb, vni);
>>  		}
>>  #endif
>>  	}
>>  
>>  	eth = eth_hdr(skb);
>> -	f = vxlan_find_mac(vxlan, eth->h_dest);
>> +	f = vxlan_find_mac(vxlan, eth->h_dest, vni);
>>  	did_rsc = false;
>>  
>>  	if (f && (f->flags & NTF_ROUTER) && (vxlan->flags & VXLAN_F_RSC) &&
>> @@ -2214,11 +2268,11 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
>>  	     ntohs(eth->h_proto) == ETH_P_IPV6)) {
>>  		did_rsc = route_shortcircuit(dev, skb);
>>  		if (did_rsc)
>> -			f = vxlan_find_mac(vxlan, eth->h_dest);
>> +			f = vxlan_find_mac(vxlan, eth->h_dest, vni);
>>  	}
>>  
>>  	if (f == NULL) {
>> -		f = vxlan_find_mac(vxlan, all_zeros_mac);
>> +		f = vxlan_find_mac(vxlan, all_zeros_mac, vni);
>>  		if (f == NULL) {
>>  			if ((vxlan->flags & VXLAN_F_L2MISS) &&
>>  			    !is_multicast_ether_addr(eth->h_dest))
>> @@ -2239,11 +2293,11 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
>>  		}
>>  		skb1 = skb_clone(skb, GFP_ATOMIC);
>>  		if (skb1)
>> -			vxlan_xmit_one(skb1, dev, rdst, did_rsc);
>> +			vxlan_xmit_one(skb1, dev, vni, rdst, did_rsc);
>>  	}
>>  
>>  	if (fdst)
>> -		vxlan_xmit_one(skb, dev, fdst, did_rsc);
>> +		vxlan_xmit_one(skb, dev, vni, fdst, did_rsc);
>>  	else
>>  		kfree_skb(skb);
>>  	return NETDEV_TX_OK;
>> @@ -2307,12 +2361,12 @@ static int vxlan_init(struct net_device *dev)
>>  	return 0;
>>  }
>>  
>> -static void vxlan_fdb_delete_default(struct vxlan_dev *vxlan)
>> +static void vxlan_fdb_delete_default(struct vxlan_dev *vxlan, __be32 vni)
>>  {
>>  	struct vxlan_fdb *f;
>>  
>>  	spin_lock_bh(&vxlan->hash_lock);
>> -	f = __vxlan_find_mac(vxlan, all_zeros_mac);
>> +	f = __vxlan_find_mac(vxlan, all_zeros_mac, vni);
>>  	if (f)
>>  		vxlan_fdb_destroy(vxlan, f);
>>  	spin_unlock_bh(&vxlan->hash_lock);
>> @@ -2322,7 +2376,7 @@ static void vxlan_uninit(struct net_device *dev)
>>  {
>>  	struct vxlan_dev *vxlan = netdev_priv(dev);
>>  
>> -	vxlan_fdb_delete_default(vxlan);
>> +	vxlan_fdb_delete_default(vxlan, vxlan->cfg.vni);
>>  
>>  	free_percpu(dev->tstats);
>>  }
>> @@ -2536,6 +2590,8 @@ static void vxlan_setup(struct net_device *dev)
>>  	dev->vlan_features = dev->features;
>>  	dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_RXCSUM;
>>  	dev->hw_features |= NETIF_F_GSO_SOFTWARE;
>> +	dev->hw_features |= NETIF_F_HW_VLAN_CTAG_TX;
>> +	dev->features |= dev->hw_features;
>>  	netif_keep_dst(dev);
>>  	dev->priv_flags |= IFF_NO_QUEUE;
>>  
>> @@ -2921,6 +2977,7 @@ static int vxlan_dev_configure(struct net *src_net, struct net_device *dev,
>>  				       NLM_F_EXCL|NLM_F_CREATE,
>>  				       vxlan->cfg.dst_port,
>>  				       vxlan->default_dst.remote_vni,
>> +				       vxlan->default_dst.remote_vni,
>>  				       vxlan->default_dst.remote_ifindex,
>>  				       NTF_SELF);
>>  		if (err)
>> @@ -2929,7 +2986,7 @@ static int vxlan_dev_configure(struct net *src_net, struct net_device *dev,
>>  
>>  	err = register_netdevice(dev);
>>  	if (err) {
>> -		vxlan_fdb_delete_default(vxlan);
>> +		vxlan_fdb_delete_default(vxlan, vxlan->cfg.vni);
>>  		return err;
>>  	}
>>  
>> @@ -3023,19 +3080,19 @@ static int vxlan_newlink(struct net *src_net, struct net_device *dev,
>>  		conf.flags |= VXLAN_F_UDP_ZERO_CSUM_TX;
>>  
>>  	if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX] &&
>> -	    nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]))
>> +	    !nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]))
>>  		conf.flags |= VXLAN_F_UDP_ZERO_CSUM6_TX;
>>  
>>  	if (data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX] &&
>> -	    nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]))
>> +	    !nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]))
>>  		conf.flags |= VXLAN_F_UDP_ZERO_CSUM6_RX;
>>  
>>  	if (data[IFLA_VXLAN_REMCSUM_TX] &&
>> -	    nla_get_u8(data[IFLA_VXLAN_REMCSUM_TX]))
>> +	    !nla_get_u8(data[IFLA_VXLAN_REMCSUM_TX]))
>>  		conf.flags |= VXLAN_F_REMCSUM_TX;
>>  
>>  	if (data[IFLA_VXLAN_REMCSUM_RX] &&
>> -	    nla_get_u8(data[IFLA_VXLAN_REMCSUM_RX]))
>> +	    !nla_get_u8(data[IFLA_VXLAN_REMCSUM_RX]))
>>  		conf.flags |= VXLAN_F_REMCSUM_RX;
> Aren't these going to break user-space ? 

correct... ignore these. Not intentional. these were from an incorrect merge with an earlier changelink patch i had.
Did not realize these had crept it.

thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 4/5] bridge: vlan lwt and dst_metadata netlink support
  2017-01-22 12:05   ` Nikolay Aleksandrov
@ 2017-01-22 15:23     ` Roopa Prabhu
  0 siblings, 0 replies; 23+ messages in thread
From: Roopa Prabhu @ 2017-01-22 15:23 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: netdev, davem, stephen, tgraf, hannes, jbenc, pshelar, dsa, hadi

On 1/22/17, 4:05 AM, Nikolay Aleksandrov wrote:
> On 21/01/17 06:46, Roopa Prabhu wrote:
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> This patch adds support to attach per vlan tunnel info dst
>> metadata. This enables bridge driver to map vlan to tunnel_info
>> at ingress and egress
>>
>> The initial use case is vlan to vni bridging, but the api is generic
>> to extend to any tunnel_info in the future:
>>     - Uapi to configure/unconfigure/dump per vlan tunnel data
>>     - netlink functions to configure vlan and tunnel_info mapping
>>     - Introduces bridge port flag BR_LWT_VLAN to enable attach/detach
>>     dst_metadata to bridged packets on ports.
>>
>> Use case:
>> example use for this is a vxlan bridging gateway or vtep
>> which maps vlans to vn-segments (or vnis). User can configure
>> per-vlan tunnel information which the bridge driver can use
>> to bridge vlan into the corresponding tunnel.
>>
>> CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>> ---
>> CC'ing Nikolay for some more eyes as he has been trying to keep the
>> bridge driver fast path lite.
>>
>>  include/linux/if_bridge.h |    1 +
>>  net/bridge/br_input.c     |    1 +
>>  net/bridge/br_netlink.c   |  410 ++++++++++++++++++++++++++++++++++++++-------
>>  net/bridge/br_private.h   |   18 ++
>>  net/bridge/br_vlan.c      |  138 ++++++++++++++-
>>  5 files changed, 507 insertions(+), 61 deletions(-)
>>
>> diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
>> index c6587c0..36ff611 100644
>> --- a/include/linux/if_bridge.h
>> +++ b/include/linux/if_bridge.h
>> @@ -46,6 +46,7 @@ struct br_ip_list {
>>  #define BR_LEARNING_SYNC	BIT(9)
>>  #define BR_PROXYARP_WIFI	BIT(10)
>>  #define BR_MCAST_FLOOD		BIT(11)
>> +#define BR_LWT_VLAN		BIT(12)
>>  
>>  #define BR_DEFAULT_AGEING_TIME	(300 * HZ)
>>  
>> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
>> index 855b72f..83f356f 100644
>> --- a/net/bridge/br_input.c
>> +++ b/net/bridge/br_input.c
>> @@ -20,6 +20,7 @@
>>  #include <net/arp.h>
>>  #include <linux/export.h>
>>  #include <linux/rculist.h>
>> +#include <net/dst_metadata.h>
>>  #include "br_private.h"
>>  
>>  /* Hook for brouter */
>> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
>> index 71c7453..df997ad 100644
>> --- a/net/bridge/br_netlink.c
>> +++ b/net/bridge/br_netlink.c
>> @@ -17,17 +17,30 @@
>>  #include <net/net_namespace.h>
>>  #include <net/sock.h>
>>  #include <uapi/linux/if_bridge.h>
>> +#include <net/dst_metadata.h>
>>  
>>  #include "br_private.h"
>>  #include "br_private_stp.h"
>>  
>> -static int __get_num_vlan_infos(struct net_bridge_vlan_group *vg,
>> -				u32 filter_mask)
>> +static size_t br_get_vlan_tinfo_size(void)
>>  {
>> +	return nla_total_size(0) + /* nest IFLA_BRIDGE_VLAN_TUNNEL_INFO */
>> +		  nla_total_size(sizeof(u32)) + /* IFLA_BRIDGE_VLAN_TUNNEL_ID */
>> +		  nla_total_size(sizeof(u16)) + /* IFLA_BRIDGE_VLAN_TUNNEL_VID */
>> +		  nla_total_size(sizeof(u16)); /* IFLA_BRIDGE_VLAN_TUNNEL_FLAGS */
>> +}
>> +
>> +static int __get_num_vlan_infos(struct net_bridge_port *p,
>> +				struct net_bridge_vlan_group *vg,
>> +				u32 filter_mask, int *num_vtinfos)
>> +{
>> +	struct net_bridge_vlan *vbegin = NULL, *vend = NULL;
>> +	struct net_bridge_vlan *vtbegin = NULL, *vtend = NULL;
>>  	struct net_bridge_vlan *v;
>> -	u16 vid_range_start = 0, vid_range_end = 0, vid_range_flags = 0;
>> +	bool get_tinfos = (p && p->flags & BR_LWT_VLAN) ? true: false;
>> +	bool vcontinue, vtcontinue;
>> +	int num_vinfos = 0;
>>  	u16 flags, pvid;
>> -	int num_vlans = 0;
>>  
>>  	if (!(filter_mask & RTEXT_FILTER_BRVLAN_COMPRESSED))
>>  		return 0;
>> @@ -36,6 +49,8 @@ static int __get_num_vlan_infos(struct net_bridge_vlan_group *vg,
>>  	/* Count number of vlan infos */
>>  	list_for_each_entry_rcu(v, &vg->vlan_list, vlist) {
>>  		flags = 0;
>> +		vcontinue = false;
>> +		vtcontinue = false;
>>  		/* only a context, bridge vlan not activated */
>>  		if (!br_vlan_should_use(v))
>>  			continue;
>> @@ -45,47 +60,79 @@ static int __get_num_vlan_infos(struct net_bridge_vlan_group *vg,
>>  		if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED)
>>  			flags |= BRIDGE_VLAN_INFO_UNTAGGED;
>>  
>> -		if (vid_range_start == 0) {
>> -			goto initvars;
>> -		} else if ((v->vid - vid_range_end) == 1 &&
>> -			flags == vid_range_flags) {
>> -			vid_range_end = v->vid;
>> +		if (!vbegin) {
>> +			vbegin = v;
>> +			vend = v;
>> +			vcontinue = true;
>> +		} else if ((v->vid - vend->vid) == 1 &&
>> +			flags == vbegin->flags) {
>> +			vend = v;
>> +			vcontinue = true;
>> +		}
>> +
>> +		if (!vcontinue) {
>> +			if ((vend->vid - vbegin->vid) > 0)
>> +				num_vinfos += 2;
>> +			else
>> +				num_vinfos += 1;
>> +		}
>> +
>> +		if (!get_tinfos && !v->tinfo.tunnel_id)
>>  			continue;
>> -		} else {
>> -			if ((vid_range_end - vid_range_start) > 0)
>> -				num_vlans += 2;
>> +
>> +		if (!vtbegin) {
>> +			vtbegin = v;
>> +			vtend = v;
>> +			vtcontinue = true;
>> +		} else if ((v->vid - vtend->vid) == 1 &&
>> +		    vlan_tunnel_id_isrange(vtend, v)) {
>> +			vtend = v;
>> +			vtcontinue = true;
>> +		}
>> +
>> +		if (!vtcontinue) {
>> +			if ((vtend->vid - vtbegin->vid) > 0)
>> +				num_vtinfos += 2;
>>  			else
>> -				num_vlans += 1;
>> +				num_vtinfos += 1;
>> +			vbegin = NULL;
>> +			vend = NULL;
>>  		}
>> -initvars:
>> -		vid_range_start = v->vid;
>> -		vid_range_end = v->vid;
>> -		vid_range_flags = flags;
>>  	}
>>  
>> -	if (vid_range_start != 0) {
>> -		if ((vid_range_end - vid_range_start) > 0)
>> -			num_vlans += 2;
>> +	if (vbegin) {
>> +		if ((vend->vid - vbegin->vid) > 0)
>> +			num_vinfos += 2;
>>  		else
>> -			num_vlans += 1;
>> +			num_vinfos += 1;
>>  	}
>>  
>> -	return num_vlans;
>> +	if (get_tinfos && vtbegin && vtbegin->tinfo.tunnel_id) {
>> +		if ((vtend->vid - vtbegin->vid) > 0)
>> +			*num_vtinfos += 2;
>> +		else
>> +			*num_vtinfos += 1;
>> +	}
>> +
>> +	return num_vinfos;
>>  }
> I think this whole function should be broken into at least a few. It's really difficult
> to parse what's going on.

ack, the vlan range handling is a bit painful and getting uglier. I will see what i can do.
I have added a few more things in this area in my latest code...ie reject tunnel info sets
if the per port flag is not set and fix some error handling paths...will send out next set
early next week.

>
>>  
>> -static int br_get_num_vlan_infos(struct net_bridge_vlan_group *vg,
>> -				 u32 filter_mask)
>> +static int br_get_num_vlan_infos(struct net_bridge_port *p,
>> +				 struct net_bridge_vlan_group *vg,
>> +				 int *num_tinfos, u32 filter_mask)
>>  {
>>  	int num_vlans;
>>  
>>  	if (!vg)
>>  		return 0;
>>  
>> -	if (filter_mask & RTEXT_FILTER_BRVLAN)
>> +	if (filter_mask & RTEXT_FILTER_BRVLAN) {
>> +		*num_tinfos = vg->num_vlans;
>>  		return vg->num_vlans;
>> +	}
>>  
>>  	rcu_read_lock();
>> -	num_vlans = __get_num_vlan_infos(vg, filter_mask);
>> +	num_vlans = __get_num_vlan_infos(p, vg, filter_mask, num_tinfos);
>>  	rcu_read_unlock();
>>  
>>  	return num_vlans;
>> @@ -95,9 +142,10 @@ static size_t br_get_link_af_size_filtered(const struct net_device *dev,
>>  					   u32 filter_mask)
>>  {
>>  	struct net_bridge_vlan_group *vg = NULL;
>> -	struct net_bridge_port *p;
>> +	struct net_bridge_port *p = NULL;
>>  	struct net_bridge *br;
>> -	int num_vlan_infos;
>> +	int num_vlan_infos, num_vlan_tinfos = 0;
>> +	size_t retsize = 0;
>>  
>>  	rcu_read_lock();
>>  	if (br_port_exists(dev)) {
>> @@ -107,11 +155,15 @@ static size_t br_get_link_af_size_filtered(const struct net_device *dev,
>>  		br = netdev_priv(dev);
>>  		vg = br_vlan_group_rcu(br);
>>  	}
>> -	num_vlan_infos = br_get_num_vlan_infos(vg, filter_mask);
>> +	num_vlan_infos = br_get_num_vlan_infos(p, vg, &num_vlan_tinfos,
>> +					       filter_mask);
>>  	rcu_read_unlock();
>>  
>>  	/* Each VLAN is returned in bridge_vlan_info along with flags */
>> -	return num_vlan_infos * nla_total_size(sizeof(struct bridge_vlan_info));
>> +	retsize =  num_vlan_infos * nla_total_size(sizeof(struct bridge_vlan_info)) +
>> +			num_vlan_tinfos * br_get_vlan_tinfo_size();
>> +
>> +	return retsize;
>>  }
>>  
>>  static inline size_t br_port_info_size(void)
>> @@ -191,7 +243,8 @@ static int br_port_fill_attrs(struct sk_buff *skb,
>>  	    nla_put_u16(skb, IFLA_BRPORT_NO, p->port_no) ||
>>  	    nla_put_u8(skb, IFLA_BRPORT_TOPOLOGY_CHANGE_ACK,
>>  		       p->topology_change_ack) ||
>> -	    nla_put_u8(skb, IFLA_BRPORT_CONFIG_PENDING, p->config_pending))
>> +	    nla_put_u8(skb, IFLA_BRPORT_CONFIG_PENDING, p->config_pending) ||
>> +	    nla_put_u8(skb, IFLA_BRPORT_LWT_VLAN, !!(p->flags & BR_LWT_VLAN)))
>>  		return -EMSGSIZE;
>>  
>>  	timerval = br_timer_value(&p->message_age_timer);
>> @@ -216,6 +269,34 @@ static int br_port_fill_attrs(struct sk_buff *skb,
>>  	return 0;
>>  }
>>  
>> +static int br_fill_vlan_tinfo(struct sk_buff *skb, u16 vid,
>> +                              __be64 tunnel_id, u16 flags)
>> +{
>> +    __be32 tid = tunnel_id_to_key32(tunnel_id);
>> +    struct nlattr *tmap;
>> +
>> +	tmap = nla_nest_start(skb, IFLA_BRIDGE_VLAN_TUNNEL_INFO);
>> +	if (!tmap)
>> +		return -EMSGSIZE;
>> +	if (nla_put_u32(skb, IFLA_BRIDGE_VLAN_TUNNEL_ID,
>> +            be32_to_cpu(tid)))
>> +		goto nla_put_failure;
>> +	if (nla_put_u16(skb, IFLA_BRIDGE_VLAN_TUNNEL_VID,
>> +			vid))
>> +		goto nla_put_failure;
>> +	if (nla_put_u16(skb, IFLA_BRIDGE_VLAN_TUNNEL_FLAGS,
>> +			flags))
>> +		goto nla_put_failure;
>> +	nla_nest_end(skb, tmap);
>> +
>> +	return 0;
>> +
>> +nla_put_failure:
>> +	nla_nest_cancel(skb, tmap);
>> +
>> +	return -EMSGSIZE;
>> +}
>> +
>>  static int br_fill_ifvlaninfo_range(struct sk_buff *skb, u16 vid_start,
>>  				    u16 vid_end, u16 flags)
>>  {
>> @@ -249,20 +330,24 @@ static int br_fill_ifvlaninfo_range(struct sk_buff *skb, u16 vid_start,
>>  }
>>  
>>  static int br_fill_ifvlaninfo_compressed(struct sk_buff *skb,
>> +					 struct net_bridge_port *p,
>>  					 struct net_bridge_vlan_group *vg)
>>  {
>> +	struct net_bridge_vlan *vbegin = NULL, *vend = NULL;
>> +	struct net_bridge_vlan *vtbegin = NULL, *vtend = NULL;
>> +	bool fill_tinfos = (p && p->flags & BR_LWT_VLAN) ? true: false;
>>  	struct net_bridge_vlan *v;
>> -	u16 vid_range_start = 0, vid_range_end = 0, vid_range_flags = 0;
>> +	bool vcontinue, vtcontinue;
>>  	u16 flags, pvid;
>> -	int err = 0;
>> +	int err;
>>  
>> -	/* Pack IFLA_BRIDGE_VLAN_INFO's for every vlan
>> -	 * and mark vlan info with begin and end flags
>> -	 * if vlaninfo represents a range
>> -	 */
>>  	pvid = br_get_pvid(vg);
>> +	/* Count number of vlan infos */
>>  	list_for_each_entry_rcu(v, &vg->vlan_list, vlist) {
>>  		flags = 0;
>> +		vcontinue = false;
>> +		vtcontinue = false;
>> +		/* only a context, bridge vlan not activated */
>>  		if (!br_vlan_should_use(v))
>>  			continue;
>>  		if (v->vid == pvid)
>> @@ -271,44 +356,103 @@ static int br_fill_ifvlaninfo_compressed(struct sk_buff *skb,
>>  		if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED)
>>  			flags |= BRIDGE_VLAN_INFO_UNTAGGED;
>>  
>> -		if (vid_range_start == 0) {
>> -			goto initvars;
>> -		} else if ((v->vid - vid_range_end) == 1 &&
>> -			flags == vid_range_flags) {
>> -			vid_range_end = v->vid;
>> -			continue;
>> -		} else {
>> -			err = br_fill_ifvlaninfo_range(skb, vid_range_start,
>> -						       vid_range_end,
>> -						       vid_range_flags);
>> +		if (!vbegin) {
>> +			vbegin = v;
>> +			vend = v;
>> +			vcontinue = true;
>> +		} else if ((v->vid - vend->vid) == 1 &&
>> +			flags == vbegin->flags) {
>> +			vend = v;
>> +			vcontinue = true;
>> +		}
>> +
>> +		if (!vcontinue) {
>> +			err = br_fill_ifvlaninfo_range(skb,
>> +						       vbegin->vid,
>> +						       vend->vid,
>> +						       vbegin->flags);
>>  			if (err)
>>  				return err;
>> +			vbegin = vend = NULL;
>> +		}
>> +
>> +		if (!fill_tinfos || !v->tinfo.tunnel_id)
>> +			continue;
>> +
>> +		if (!vtbegin) {
>> +			vtbegin = v;
>> +			vtend = v;
>> +			vtcontinue = true;
>> +		} else if ((v->vid - vtend->vid) == 1 &&
>> +		    vlan_tunnel_id_isrange(vtend, v)) {
>> +			vtend = v;
>> +			vtcontinue = true;
>>  		}
>>  
>> -initvars:
>> -		vid_range_start = v->vid;
>> -		vid_range_end = v->vid;
>> -		vid_range_flags = flags;
>> +		if (!vtcontinue && vtbegin->tinfo.tunnel_id) {
>> +			if ((vtend->vid - vtbegin->vid) > 0) {
>> +				err = br_fill_vlan_tinfo(skb, vbegin->vid,
>> +							 vbegin->tinfo.tunnel_id,
>> +						BRIDGE_VLAN_INFO_RANGE_BEGIN);
>> +				if (err)
>> +					return err;
>> +				err = br_fill_vlan_tinfo(skb, vend->vid,
>> +							 vend->tinfo.tunnel_id,
>> +						BRIDGE_VLAN_INFO_RANGE_END);
>> +				if (err)
>> +					return err;
>> +			} else {
>> +				err = br_fill_vlan_tinfo(skb, vbegin->vid,
>> +							 vbegin->tinfo.tunnel_id,
>> +							 0);
>> +				if (err)
>> +					return err;
>> +			}
>> +			vbegin = NULL;
>> +			vend = NULL;
>> +		}
>>  	}
>>  
>> -	if (vid_range_start != 0) {
>> -		/* Call it once more to send any left over vlans */
>> -		err = br_fill_ifvlaninfo_range(skb, vid_range_start,
>> -					       vid_range_end,
>> -					       vid_range_flags);
>> +	if (vbegin) {
>> +		err = br_fill_ifvlaninfo_range(skb, vbegin->vid,
>> +					       vend->vid,
>> +					       vbegin->flags);
>>  		if (err)
>>  			return err;
>>  	}
>>  
>> +	if (fill_tinfos && vtbegin && vtbegin->tinfo.tunnel_id) {
>> +		if ((vtend->vid - vtbegin->vid) > 0) {
>> +			err = br_fill_vlan_tinfo(skb, vbegin->vid,
>> +						 vbegin->tinfo.tunnel_id,
>> +						 BRIDGE_VLAN_INFO_RANGE_BEGIN);
>> +			if (err)
>> +				return err;
>> +			err = br_fill_vlan_tinfo(skb, vend->vid,
>> +						 vend->tinfo.tunnel_id,
>> +						 BRIDGE_VLAN_INFO_RANGE_END);
>> +			if (err)
>> +				return err;
>> +		} else {
>> +			err = br_fill_vlan_tinfo(skb, vbegin->vid,
>> +						 vbegin->tinfo.tunnel_id, 0);
>> +			if (err)
>> +				return err;
>> +		}
>> +	}
>> +
>>  	return 0;
>>  }
> Maybe look into breaking this one, too. It's getting huge and hard to follow..
>
>>  
>>  static int br_fill_ifvlaninfo(struct sk_buff *skb,
>> +			      struct net_bridge_port *p,
>>  			      struct net_bridge_vlan_group *vg)
>>  {
>>  	struct bridge_vlan_info vinfo;
>>  	struct net_bridge_vlan *v;
>> +	bool fill_tinfos = (p && p->flags & BR_LWT_VLAN) ? true : false;
>>  	u16 pvid;
>> +	int err;
>>  
>>  	pvid = br_get_pvid(vg);
>>  	list_for_each_entry_rcu(v, &vg->vlan_list, vlist) {
>> @@ -326,6 +470,14 @@ static int br_fill_ifvlaninfo(struct sk_buff *skb,
>>  		if (nla_put(skb, IFLA_BRIDGE_VLAN_INFO,
>>  			    sizeof(vinfo), &vinfo))
>>  			goto nla_put_failure;
>> +
>> +		if (!fill_tinfos || !v->tinfo.tunnel_id)
>> +			continue;
>> +
>> +		err = br_fill_vlan_tinfo(skb, v->vid,
>> +					 v->tinfo.tunnel_id, 0);
>> +		if (err)
>> +			return err;
>>  	}
>>  
>>  	return 0;
>> @@ -411,9 +563,9 @@ static int br_fill_ifinfo(struct sk_buff *skb,
>>  			goto nla_put_failure;
>>  		}
>>  		if (filter_mask & RTEXT_FILTER_BRVLAN_COMPRESSED)
>> -			err = br_fill_ifvlaninfo_compressed(skb, vg);
>> +			err = br_fill_ifvlaninfo_compressed(skb, port, vg);
>>  		else
>> -			err = br_fill_ifvlaninfo(skb, vg);
>> +			err = br_fill_ifvlaninfo(skb, port, vg);
>>  		rcu_read_unlock();
>>  		if (err)
>>  			goto nla_put_failure;
>> @@ -514,6 +666,127 @@ static int br_vlan_info(struct net_bridge *br, struct net_bridge_port *p,
>>  	return err;
>>  }
>>  
>> +static const struct nla_policy vlan_tunnel_policy[IFLA_BRIDGE_VLAN_TUNNEL_MAX + 1] = {
>> +	[IFLA_BRIDGE_VLAN_TUNNEL_ID]= { .type = NLA_U32 },
>> +	[IFLA_BRIDGE_VLAN_TUNNEL_VID] = { .type = NLA_U16 },
>> +	[IFLA_BRIDGE_VLAN_TUNNEL_FLAGS] = { .type = NLA_U16 },
>> +};
>> +
>> +static int br_add_vlan_tunnel_info(struct net_bridge *br,
>> +				   struct net_bridge_port *p, int cmd,
>> +				   u16 vid, u32 tun_id)
>> +{
>> +	int err;
>> +
>> +	switch (cmd) {
>> +	case RTM_SETLINK:
>> +		if (p) {
>> +			/* if the MASTER flag is set this will act on the global
>> +			 * per-VLAN entry as well
>> +			 */
>> +			err = nbp_vlan_tunnel_info_add(p, vid, tun_id);
>> +			if (err)
>> +				break;
>> +		} else {
>> +			return -EINVAL;
>> +		}
>> +
>> +		break;
>> +
>> +	case RTM_DELLINK:
>> +		if (p)
>> +			nbp_vlan_tunnel_info_delete(p, vid);
>> +		else
>> +			return -EINVAL;
>> +		break;
>> +	}
> so if (!p) return -einval in the beginning ? :-)
>
>> +
>> +	return 0;
>> +}
>> +
>> +struct vtunnel_info {
>> +	u32	tunid;
>> +	u16	vid;
>> +	u16	flags;
>> +};
>> +
>> +static int br_parse_vlan_tunnel_info(struct nlattr *attr,
>> +				     struct vtunnel_info *tinfo)
>> +{
>> +	struct nlattr *tb[IFLA_BRIDGE_VLAN_TUNNEL_MAX + 1];
>> +	u32 tun_id;
>> +	u16 vid, flags;
>> +	int err;
>> +
>> +	err = nla_parse_nested(tb, IFLA_BRIDGE_VLAN_TUNNEL_MAX,
>> +			       attr, vlan_tunnel_policy);
>> +	if (err < 0)
>> +		return err;
>> +
>> +	if (tb[IFLA_BRIDGE_VLAN_TUNNEL_ID])
>> +		tun_id = nla_get_u32(tb[IFLA_BRIDGE_VLAN_TUNNEL_ID]);
>> +	else
>> +		return -EINVAL;
>> +
>> +	if (tb[IFLA_BRIDGE_VLAN_TUNNEL_VID]) {
>> +		vid = nla_get_u16(tb[IFLA_BRIDGE_VLAN_TUNNEL_VID]);
>> +		if (vid >= VLAN_VID_MASK)
>> +			return -ERANGE;
> !vid || 
>
>> +	} else {
>> +		return -EINVAL;
>> +	}
> these attr checks can be moved in the beginning, usually there's a check for existence
> of the mandatory attributes, then you can continue just using them and avoid these
> "else" statements
>
>> +
>> +	if (tb[IFLA_BRIDGE_VLAN_TUNNEL_FLAGS])
>> +		flags = nla_get_u16(tb[IFLA_BRIDGE_VLAN_TUNNEL_FLAGS]);
>> +
>> +	tinfo->tunid = tun_id;
>> +	tinfo->vid = vid;
>> +	tinfo->flags = flags;
>> +
>> +	return 0;
>> +}
>> +
>> +static int br_process_vlan_tunnel_info(struct net_bridge *br,
>> +				       struct net_bridge_port *p, int cmd,
>> +				       struct vtunnel_info *tinfo_curr,
>> +				       struct vtunnel_info *tinfo_last)
>> +{
>> +	int t, v;
>> +	int err;
>> +
>> +	if (tinfo_curr->flags & BRIDGE_VLAN_INFO_RANGE_BEGIN) {
>> +		if (tinfo_last->flags & BRIDGE_VLAN_INFO_RANGE_BEGIN)
>> +			return -EINVAL;
>> +		memcpy(tinfo_last, tinfo_curr, sizeof(struct vtunnel_info));
>> +	} else if (tinfo_curr->flags & BRIDGE_VLAN_INFO_RANGE_END) {
>> +		if (!(tinfo_last->flags & BRIDGE_VLAN_INFO_RANGE_BEGIN))
>> +			return -EINVAL;
>> +		if ((tinfo_curr->vid - tinfo_last->vid) !=
>> +		    (tinfo_curr->tunid - tinfo_last->tunid))
>> +			return -EINVAL;
>> +		/* XXX: tun id and vlan id attrs must be same
>> +		 */
>> +		t = tinfo_last->tunid;
>> +		for (v = tinfo_last->vid; v <= tinfo_curr->vid; v++) {
>> +			err = br_add_vlan_tunnel_info(br, p, cmd,
>> +							  v, t);
>> +			if (err)
>> +				return err;
>> +			t++;
>> +		}
>> +		memset(tinfo_last, 0, sizeof(struct vtunnel_info));
>> +		memset(tinfo_curr, 0, sizeof(struct vtunnel_info));
>> +	} else {
>> +		err = br_add_vlan_tunnel_info(br, p, cmd,
>> +					      tinfo_curr->vid,
>> +					      tinfo_curr->tunid);
>> +		if (err)
>> +			return err;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>  static int br_afspec(struct net_bridge *br,
>>  		     struct net_bridge_port *p,
>>  		     struct nlattr *af_spec,
>> @@ -522,10 +795,30 @@ static int br_afspec(struct net_bridge *br,
>>  	struct bridge_vlan_info *vinfo_start = NULL;
>>  	struct bridge_vlan_info *vinfo = NULL;
>>  	struct nlattr *attr;
>> +	struct vtunnel_info tinfo_last = {
>> +		.tunid = 0,
>> +		.vid = 0,
>> +		.flags = 0};
>> +	struct vtunnel_info tinfo_curr = {
>> +		.tunid = 0,
>> +		.vid = 0,
>> +		.flags = 0};
> just { } should be enough to zero the structs

ack
>
>>  	int err = 0;
>>  	int rem;
>>  
>>  	nla_for_each_nested(attr, af_spec, rem) {
>> +		if (nla_type(attr) == IFLA_BRIDGE_VLAN_TUNNEL_INFO) {
>> +			err = br_parse_vlan_tunnel_info(attr, &tinfo_curr);
>> +			if (err)
>> +				return err;
>> +			err = br_process_vlan_tunnel_info(br, p, cmd,
>> +							  &tinfo_curr,
>> +							  &tinfo_last);
>> +			if (err)
>> +				return err;
>> +			continue;
>> +		}
>> +
>>  		if (nla_type(attr) != IFLA_BRIDGE_VLAN_INFO)
>>  			continue;
>>  		if (nla_len(attr) != sizeof(struct bridge_vlan_info))
>> @@ -638,6 +931,7 @@ static int br_setport(struct net_bridge_port *p, struct nlattr *tb[])
>>  	br_set_port_flag(p, tb, IFLA_BRPORT_MCAST_FLOOD, BR_MCAST_FLOOD);
>>  	br_set_port_flag(p, tb, IFLA_BRPORT_PROXYARP, BR_PROXYARP);
>>  	br_set_port_flag(p, tb, IFLA_BRPORT_PROXYARP_WIFI, BR_PROXYARP_WIFI);
>> +	br_set_port_flag(p, tb, IFLA_BRPORT_LWT_VLAN, BR_LWT_VLAN);
>>  
>>  	if (tb[IFLA_BRPORT_COST]) {
>>  		err = br_stp_set_path_cost(p, nla_get_u32(tb[IFLA_BRPORT_COST]));
>> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
>> index 8ce621e..f68e360 100644
>> --- a/net/bridge/br_private.h
>> +++ b/net/bridge/br_private.h
>> @@ -91,6 +91,11 @@ struct br_vlan_stats {
>>  	struct u64_stats_sync syncp;
>>  };
>>  
>> +struct br_tunnel_info {
>> +	__be64			tunnel_id;
>> +	struct metadata_dst	*tunnel_dst;
>> +};
>> +
>>  /**
>>   * struct net_bridge_vlan - per-vlan entry
>>   *
>> @@ -113,6 +118,7 @@ struct br_vlan_stats {
>>   */
>>  struct net_bridge_vlan {
>>  	struct rhash_head		vnode;
>> +	struct rhash_head		tnode;
>>  	u16				vid;
>>  	u16				flags;
>>  	struct br_vlan_stats __percpu	*stats;
>> @@ -124,6 +130,9 @@ struct net_bridge_vlan {
>>  		atomic_t		refcnt;
>>  		struct net_bridge_vlan	*brvlan;
>>  	};
>> +
>> +	struct br_tunnel_info		tinfo;
>> +
>>  	struct list_head		vlist;
>>  
>>  	struct rcu_head			rcu;
>> @@ -145,6 +154,7 @@ struct net_bridge_vlan {
>>   */
>>  struct net_bridge_vlan_group {
>>  	struct rhashtable		vlan_hash;
>> +	struct rhashtable		tunnel_hash;
>>  	struct list_head		vlan_list;
>>  	u16				num_vlans;
>>  	u16				pvid;
>> @@ -786,6 +796,14 @@ struct sk_buff *br_handle_vlan(struct net_bridge *br,
>>  int nbp_get_num_vlan_infos(struct net_bridge_port *p, u32 filter_mask);
>>  void br_vlan_get_stats(const struct net_bridge_vlan *v,
>>  		       struct br_vlan_stats *stats);
>> +int __vlan_tunnel_info_add(struct net_bridge_vlan_group *vg,
>> +			   struct net_bridge_vlan *vlan, u32 tun_id);
>> +int __vlan_tunnel_info_del(struct net_bridge_vlan_group *vg,
>> +                           struct net_bridge_vlan *vlan);
>> +int nbp_vlan_tunnel_info_delete(struct net_bridge_port *port, u16 vid);
>> +int nbp_vlan_tunnel_info_add(struct net_bridge_port *port, u16 vid, u32 tun_id);
>> +bool vlan_tunnel_id_isrange(struct net_bridge_vlan *v_end,
>> +			    struct net_bridge_vlan *v);
>>  
>>  static inline struct net_bridge_vlan_group *br_vlan_group(
>>  					const struct net_bridge *br)
>> diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
>> index b6de4f4..2040f08 100644
>> --- a/net/bridge/br_vlan.c
>> +++ b/net/bridge/br_vlan.c
>> @@ -3,6 +3,7 @@
>>  #include <linux/rtnetlink.h>
>>  #include <linux/slab.h>
>>  #include <net/switchdev.h>
>> +#include <net/dst_metadata.h>
>>  
>>  #include "br_private.h"
>>  
>> @@ -31,6 +32,31 @@ static struct net_bridge_vlan *br_vlan_lookup(struct rhashtable *tbl, u16 vid)
>>  	return rhashtable_lookup_fast(tbl, &vid, br_vlan_rht_params);
>>  }
>>  
>> +static inline int br_vlan_tunid_cmp(struct rhashtable_compare_arg *arg,
>> +				    const void *ptr)
>> +{
>> +	const struct net_bridge_vlan *vle = ptr;
>> +	__be64 tunid = *(__be64 *)arg->key;
>> +
>> +	return vle->tinfo.tunnel_id != tunid;
>> +}
>> +
>> +static const struct rhashtable_params br_vlan_tunnel_rht_params = {
>> +	.head_offset = offsetof(struct net_bridge_vlan, tnode),
>> +	.key_offset = offsetof(struct net_bridge_vlan, tinfo.tunnel_id),
>> +	.key_len = sizeof(__be64),
>> +	.nelem_hint = 3,
>> +	.locks_mul = 1,
>> +	.obj_cmpfn = br_vlan_tunid_cmp,
>> +	.automatic_shrinking = true,
>> +};
>> +
>> +static struct net_bridge_vlan *br_vlan_tunnel_lookup(struct rhashtable *tbl,
>> +						     u64 tunnel_id)
>> +{
>> +	return rhashtable_lookup_fast(tbl, &tunnel_id, br_vlan_tunnel_rht_params);
>> +}
>> +
>>  static void __vlan_add_pvid(struct net_bridge_vlan_group *vg, u16 vid)
>>  {
>>  	if (vg->pvid == vid)
>> @@ -325,6 +351,7 @@ static void __vlan_group_free(struct net_bridge_vlan_group *vg)
>>  {
>>  	WARN_ON(!list_empty(&vg->vlan_list));
>>  	rhashtable_destroy(&vg->vlan_hash);
>> +	rhashtable_destroy(&vg->tunnel_hash);
>>  	kfree(vg);
>>  }
>>  
>> @@ -613,6 +640,8 @@ int br_vlan_delete(struct net_bridge *br, u16 vid)
>>  	br_fdb_find_delete_local(br, NULL, br->dev->dev_addr, vid);
>>  	br_fdb_delete_by_port(br, NULL, vid, 0);
>>  
>> +	__vlan_tunnel_info_del(vg, v);
>> +
>>  	return __vlan_del(v);
>>  }
>>  
>> @@ -918,6 +947,9 @@ int br_vlan_init(struct net_bridge *br)
>>  	ret = rhashtable_init(&vg->vlan_hash, &br_vlan_rht_params);
>>  	if (ret)
>>  		goto err_rhtbl;
>> +	ret = rhashtable_init(&vg->tunnel_hash, &br_vlan_tunnel_rht_params);
>> +	if (ret)
>> +		goto err_rhtbl2;
>>  	INIT_LIST_HEAD(&vg->vlan_list);
>>  	br->vlan_proto = htons(ETH_P_8021Q);
>>  	br->default_pvid = 1;
>> @@ -932,6 +964,8 @@ int br_vlan_init(struct net_bridge *br)
>>  	return ret;
>>  
>>  err_vlan_add:
>> +	rhashtable_destroy(&vg->tunnel_hash);
>> +err_rhtbl2:
>>  	rhashtable_destroy(&vg->vlan_hash);
>>  err_rhtbl:
>>  	kfree(vg);
>> @@ -960,7 +994,10 @@ int nbp_vlan_init(struct net_bridge_port *p)
>>  
>>  	ret = rhashtable_init(&vg->vlan_hash, &br_vlan_rht_params);
>>  	if (ret)
>> -		goto err_rhtbl;
>> +		goto err_rhtbl1;
>> +	ret = rhashtable_init(&vg->tunnel_hash, &br_vlan_tunnel_rht_params);
>> +	if (ret)
>> +		goto err_rhtbl2;
>>  	INIT_LIST_HEAD(&vg->vlan_list);
>>  	rcu_assign_pointer(p->vlgrp, vg);
>>  	if (p->br->default_pvid) {
>> @@ -976,9 +1013,11 @@ int nbp_vlan_init(struct net_bridge_port *p)
>>  err_vlan_add:
>>  	RCU_INIT_POINTER(p->vlgrp, NULL);
>>  	synchronize_rcu();
>> -	rhashtable_destroy(&vg->vlan_hash);
>> +	rhashtable_destroy(&vg->tunnel_hash);
>>  err_vlan_enabled:
>> -err_rhtbl:
>> +err_rhtbl2:
>> +	rhashtable_destroy(&vg->vlan_hash);
>> +err_rhtbl1:
>>  	kfree(vg);
>>  
>>  	goto out;
>> @@ -1081,3 +1120,96 @@ void br_vlan_get_stats(const struct net_bridge_vlan *v,
>>  		stats->tx_packets += txpackets;
>>  	}
>>  }
>> +
>> +bool vlan_tunnel_id_isrange(struct net_bridge_vlan *v_end,
>> +			    struct net_bridge_vlan *v)
>> +{
>> +	/* XXX: check other tunnel attributes */
>> +	return (be32_to_cpu(tunnel_id_to_key32(v_end->tinfo.tunnel_id)) -
>> +		be32_to_cpu(tunnel_id_to_key32(v->tinfo.tunnel_id)) == 1);
>> +}
>> +
>> +int __vlan_tunnel_info_add(struct net_bridge_vlan_group *vg,
>> +			   struct net_bridge_vlan *vlan, u32 tun_id)
>> +{
>> +	struct metadata_dst *metadata = NULL;
>> +	__be64 key = key32_to_tunnel_id(cpu_to_be32(tun_id));
>> +	int err;
>> +
>> +	if (vlan->tinfo.tunnel_dst)
>> +		return -EEXIST;
>> +
>> +	metadata = __ip_tun_set_dst(0, 0, 0, 0, 0, TUNNEL_KEY,
>> +				    key, 0);
>> +	if (!metadata)
>> +		return -EINVAL;
>> +
>> +	metadata->u.tun_info.mode |= IP_TUNNEL_INFO_TX | IP_TUNNEL_INFO_BRIDGE;
>> +	vlan->tinfo.tunnel_dst = metadata;
>> +	vlan->tinfo.tunnel_id = key;
>> +
>> +	err = rhashtable_lookup_insert_fast(&vg->tunnel_hash, &vlan->tnode,
>> +					    br_vlan_tunnel_rht_params);
>> +	if (err)
>> +		goto out;
>> +
>> +	return 0;
>> +out:
>> +	dst_release(&vlan->tinfo.tunnel_dst->dst);
>> +
>> +	return err;
>> +}
>> +
>> +int __vlan_tunnel_info_del(struct net_bridge_vlan_group *vg,
>> +			   struct net_bridge_vlan *vlan)
>> +{
>> +	if (vlan->tinfo.tunnel_dst) {
>> +		vlan->tinfo.tunnel_id = 0;
>> +		dst_release(&vlan->tinfo.tunnel_dst->dst);
>> +
>> +		rhashtable_remove_fast(&vg->tunnel_hash, &vlan->vnode,
>> +				       br_vlan_tunnel_rht_params);
>> +	}
>> +
>> +	return 0;
>> +}
> I think all of these should be static, if I haven't missed something I don't see them
> being used anywhere else.

yep, moved most of them to static....in the latest version.
>
>> +
>> +/* Must be protected by RTNL.
>> + * Must be called with vid in range from 1 to 4094 inclusive.
>> + */
>> +int nbp_vlan_tunnel_info_add(struct net_bridge_port *port, u16 vid, u32 tun_id)
>> +{
>> +	struct net_bridge_vlan_group *vg;
>> +	struct net_bridge_vlan *vlan;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	vg = nbp_vlan_group(port);
>> +	vlan = br_vlan_find(vg, vid);
>> +	if (!vlan)
>> +		return -EINVAL;
>> +
>> +	__vlan_tunnel_info_add(vg, vlan, tun_id);
>> +
>> +	return 0;
>> +}
>> +
>> +/* Must be protected by RTNL.
>> + * Must be called with vid in range from 1 to 4094 inclusive.
>> + */
>> +int nbp_vlan_tunnel_info_delete(struct net_bridge_port *port, u16 vid)
>> +{
>> +	struct net_bridge_vlan_group *vg;
>> +	struct net_bridge_vlan *v;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	vg = nbp_vlan_group(port);
>> +	v = br_vlan_find(vg, vid);
>> +	if (!v)
>> +		return -ENOENT;
>> +
>> +	__vlan_tunnel_info_del(vg, v);
>> +
>> +	return 0;
>> +}
>>
thanks

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 5/5] bridge: vlan lwt dst_metadata hooks in ingress and egress paths
  2017-01-22 12:15   ` Nikolay Aleksandrov
@ 2017-01-22 15:27     ` Roopa Prabhu
  0 siblings, 0 replies; 23+ messages in thread
From: Roopa Prabhu @ 2017-01-22 15:27 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: netdev, davem, stephen, tgraf, hannes, jbenc, pshelar, dsa, hadi

On 1/22/17, 4:15 AM, Nikolay Aleksandrov wrote:
> On 21/01/17 06:46, Roopa Prabhu wrote:
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> - ingress hook:
>>     - if port is a lwt tunnel port, use tunnel info in
>>       attached dst_metadata to map it to a local vlan
>> - egress hook:
>>     - if port is a lwt tunnel port, use tunnel info attached to
>>       vlan to set dst_metadata on the skb
>>
>> CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>> ---
>> CC'ing Nikolay for some more eyes as he has been trying to keep the
>> bridge driver fast path lite.
>>
>>  net/bridge/br_input.c   |    4 ++++
>>  net/bridge/br_private.h |    4 ++++
>>  net/bridge/br_vlan.c    |   55 +++++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 63 insertions(+)
>>
>> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
>> index 83f356f..96602a1 100644
>> --- a/net/bridge/br_input.c
>> +++ b/net/bridge/br_input.c
>> @@ -262,6 +262,10 @@ rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
>>  		return RX_HANDLER_CONSUMED;
>>  
>>  	p = br_port_get_rcu(skb->dev);
>> +	if (p->flags & BR_LWT_VLAN) {
>> +		if (br_handle_ingress_vlan_tunnel(skb, p, nbp_vlan_group_rcu(p)))
>> +			goto drop;
>> +	}
> Is there any reason to do this so early (perhaps netfilter?) ? If not, you can push it to the vlan __allowed_ingress
> (and rename that function to something else, it does a hundred additional things)
> and avoid this check for all packets if vlans are disabled, thus people using non-vlan filtering
> bridge won't have an additional test in their fast path
>
>
yes, forgot to mention it in the commit log. I had it close to __allowed_ingress in my first version...had to move it up here
because br_nf_pre_routing/br_nf_pre_routing_finish reset the dst...and hence already late..

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC PATCH net-next 4/5] bridge: vlan lwt and dst_metadata netlink support
  2017-01-21  5:46 ` [RFC PATCH net-next 4/5] bridge: vlan lwt and dst_metadata netlink support Roopa Prabhu
  2017-01-22 12:05   ` Nikolay Aleksandrov
@ 2017-01-23  0:22   ` Rosen, Rami
  2017-01-23 15:39     ` Roopa Prabhu
  1 sibling, 1 reply; 23+ messages in thread
From: Rosen, Rami @ 2017-01-23  0:22 UTC (permalink / raw)
  To: Roopa Prabhu, netdev@vger.kernel.org
  Cc: davem@davemloft.net, stephen@networkplumber.org,
	nikolay@cumulusnetworks.com, tgraf@suug.ch,
	hannes@stressinduktion.org, jbenc@redhat.com, pshelar@ovn.org,
	dsa@cumulusnetworks.com, hadi@mojatatu.com

Hi, Roopa,

Two minor comments:

The parameter br is not used in the br_add_vlan_tunnel_info() method, it should be removed:

+static int br_add_vlan_tunnel_info(struct net_bridge *br,
+				   struct net_bridge_port *p, int cmd,
+				   u16 vid, u32 tun_id)
+{
+	int err;
+
+	switch (cmd) {
+	case RTM_SETLINK:
+		if (p) {
+			/* if the MASTER flag is set this will act on the global
+			 * per-VLAN entry as well
+			 */
+			err = nbp_vlan_tunnel_info_add(p, vid, tun_id);
+			if (err)
+				break;
+		} else {
+			return -EINVAL;
+		}
+
+		break;
+
+	case RTM_DELLINK:
+		if (p)
+			nbp_vlan_tunnel_info_delete(p, vid);
+		else
+			return -EINVAL;
+		break;
+	}
+
+	return 0;
+}
+

The parameter br is used inside br_process_vlan_tunnel_info() only in the two 
Cases, when br_add_vlan_tunnel_info() is invoked. Since we saw earlier that it should be removed from br_add_vlan_tunnel_info(), it should also be removed from br_process_vlan_tunnel_info() as it is not needed anymore:

+static int br_process_vlan_tunnel_info(struct net_bridge *br,
+				       struct net_bridge_port *p, int cmd,
+				       struct vtunnel_info *tinfo_curr,
+				       struct vtunnel_info *tinfo_last) {
+	int t, v;
+	int err;
+
+	if (tinfo_curr->flags & BRIDGE_VLAN_INFO_RANGE_BEGIN) {
+		if (tinfo_last->flags & BRIDGE_VLAN_INFO_RANGE_BEGIN)
+			return -EINVAL;
+		memcpy(tinfo_last, tinfo_curr, sizeof(struct vtunnel_info));
+	} else if (tinfo_curr->flags & BRIDGE_VLAN_INFO_RANGE_END) {
+		if (!(tinfo_last->flags & BRIDGE_VLAN_INFO_RANGE_BEGIN))
+			return -EINVAL;
+		if ((tinfo_curr->vid - tinfo_last->vid) !=
+		    (tinfo_curr->tunid - tinfo_last->tunid))
+			return -EINVAL;
+		/* XXX: tun id and vlan id attrs must be same
+		 */
+		t = tinfo_last->tunid;
+		for (v = tinfo_last->vid; v <= tinfo_curr->vid; v++) {
+			err = br_add_vlan_tunnel_info(br, p, cmd,
+							  v, t);
+			if (err)
+				return err;
+			t++;
+		}
+		memset(tinfo_last, 0, sizeof(struct vtunnel_info));
+		memset(tinfo_curr, 0, sizeof(struct vtunnel_info));
+	} else {
+		err = br_add_vlan_tunnel_info(br, p, cmd,
+					      tinfo_curr->vid,
+					      tinfo_curr->tunid);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+

Regards,
Rami Rosen

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support
  2017-01-21  5:46 [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support Roopa Prabhu
                   ` (4 preceding siblings ...)
  2017-01-21  5:46 ` [RFC PATCH net-next 5/5] bridge: vlan lwt dst_metadata hooks in ingress and egress paths Roopa Prabhu
@ 2017-01-23  8:08 ` Jiri Pirko
  2017-01-23  8:51   ` Jiri Benc
  2017-01-24 15:47 ` Stephen Hemminger
  6 siblings, 1 reply; 23+ messages in thread
From: Jiri Pirko @ 2017-01-23  8:08 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: netdev, davem, stephen, nikolay, tgraf, hannes, jbenc, pshelar,
	dsa, hadi

Sat, Jan 21, 2017 at 06:46:51AM CET, roopa@cumulusnetworks.com wrote:
>From: Roopa Prabhu <roopa@cumulusnetworks.com>
>
>High level summary:
>lwt and dst_metadata/collect_metadata have enabled vxlan l3 deployments
>to use a single vxlan netdev for multiple vnis eliminating the scalability
>problem with using a single vxlan netdev per vni. This series tries to
>do the same for vxlan netdevs in pure l2 bridged networks.
>Use-case/deployment and details are below.
>
>Deployment scerario details:
>As we know VXLAN is used to build layer 2 virtual networks across the
>underlay layer3 infrastructure. A VXLAN tunnel endpoint (VTEP)
>originates and terminates VXLAN tunnels. And a VTEP can be a TOR switch
>or a vswitch in the hypervisor. This patch series mainly
>focuses on the TOR switch configured as a Vtep. Vxlan segment ID (vni)
>along with vlan id is used to identify layer 2 segments in a vxlan
>overlay network. Vxlan bridging is the function provided by Vteps to terminate
>vxlan tunnels and map the vxlan vni to traditional end host vlan. This is
>covered in the "VXLAN Deployment Scenarios" in sections 6 and 6.1 in RFC 7348.
>To provide vxlan bridging function, a vtep has to map vlan to a vni. The rfc
>says that the ingress VTEP device shall remove the IEEE 802.1Q VLAN tag in
>the original Layer 2 packet if there is one before encapsulating the packet
>into the VXLAN format to transmit it through the underlay network. The remote
>VTEP devices have information about the VLAN in which the packet will be
>placed based on their own VLAN-to-VXLAN VNI mapping configurations.
>
>Existing solution:
>Without this patch series one can deploy such a vtep configuration by
>by adding the local ports and vxlan netdevs into a vlan filtering bridge.
>The local ports are configured as trunk ports carrying all vlans.
>A vxlan netdev per vni is added to the bridge. Vlan mapping to vni is
>achieved by configuring the vlan as pvid on the corresponding vxlan netdev.
>The vxlan netdev only receives traffic corresponding to the vlan it is mapped
>to. This configuration maps traffic belonging to a vlan to the corresponding
>vxlan segment.
>
>          -----------------------------------
>         |              bridge               |
>         |                                   |
>          -----------------------------------
>            |100,200       |100 (pvid)    |200 (pvid)
>            |              |              |
>           swp1          vxlan1000      vxlan2000
>                    
>This provides the required vxlan bridging function but poses a
>scalability problem with using a single vxlan netdev for each vni.
>
>Solution in this patch series:
>The Goal is to use a single vxlan device to carry all vnis similar
>to the vxlan collect metadata mode but vxlan driver still carrying all
>the forwarding information.
>- vxlan driver changes:
>    - enable collect metadata mode device to be used with learning,
>      replication, fdb
>    - A single fdb table hashed by (mac, vni)
>    - rx path already has the vni
>    - tx path expects a vni in the packet with dst_metadata and vxlan
>      driver has all the forwarding information for the vni in the
>      dst_metadata.
>
>- Bridge driver changes: per vlan LWT and dst_metadata support:
>    - Our use case is vxlan and 1-1 mapping between vlan and vni, but I have
>      kept the api generic for any tunnel info
>    - Uapi to configure/unconfigure/dump per vlan tunnel data
>    - new bridge port flag to turn this feature on/off. off by default
>    - ingress hook:
>        - if port is a lwt tunnel port, use tunnel info in
>          attached dst_metadata to map it to a local vlan
>    - egress hook:
>        - if port is a lwt tunnel port, use tunnel info attached to vlan
>          to set dst_metadata on the skb
>
>Other approaches tried and vetoed:
>- tc vlan push/pop and tunnel metadata dst:
>    - posses a tc rule scalability problem (2 rules per vni)

Why it is a problem?


>    - cannot handle the case where a packet needs to be replicated to
>      multiple vxlan remote tunnel end-points.. which the vxlan driver
>      can do today by having multiple remote destinations per fdb.

Can't you just extend the tc to support this?


To me, looks like the tc is the correct place to hangle this. Then, the
user can use it for multiple cases of forwarding, including bridge,
tc-mirred, ovs and others. Putting this in bridge somehow seems wrong in
this light. Also, the bridge code is polluted enough as it is. I this we
should be super-picky to add another code there.

Can you please elaborate more why can't have this as a re-usable TC solution?


Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support
  2017-01-23  8:08 ` [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support Jiri Pirko
@ 2017-01-23  8:51   ` Jiri Benc
  2017-01-23 16:13     ` Roopa Prabhu
  0 siblings, 1 reply; 23+ messages in thread
From: Jiri Benc @ 2017-01-23  8:51 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Roopa Prabhu, netdev, davem, stephen, nikolay, tgraf, hannes,
	pshelar, dsa, hadi

On Mon, 23 Jan 2017 09:08:05 +0100, Jiri Pirko wrote:
> Sat, Jan 21, 2017 at 06:46:51AM CET, roopa@cumulusnetworks.com wrote:
> >Other approaches tried and vetoed:
> >- tc vlan push/pop and tunnel metadata dst:
> >    - posses a tc rule scalability problem (2 rules per vni)
> 
> Why it is a problem?

Wanted to ask exactly the same question.

> >    - cannot handle the case where a packet needs to be replicated to
> >      multiple vxlan remote tunnel end-points.. which the vxlan driver
> >      can do today by having multiple remote destinations per fdb.
> 
> Can't you just extend the tc to support this?

+1

> To me, looks like the tc is the correct place to hangle this. Then, the
> user can use it for multiple cases of forwarding, including bridge,
> tc-mirred, ovs and others. Putting this in bridge somehow seems wrong in
> this light. Also, the bridge code is polluted enough as it is. I this we
> should be super-picky to add another code there.

Completely agreed.

 Jiri

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 4/5] bridge: vlan lwt and dst_metadata netlink support
  2017-01-23  0:22   ` Rosen, Rami
@ 2017-01-23 15:39     ` Roopa Prabhu
  0 siblings, 0 replies; 23+ messages in thread
From: Roopa Prabhu @ 2017-01-23 15:39 UTC (permalink / raw)
  To: Rosen, Rami
  Cc: netdev@vger.kernel.org, davem@davemloft.net,
	stephen@networkplumber.org, nikolay@cumulusnetworks.com,
	tgraf@suug.ch, hannes@stressinduktion.org, jbenc@redhat.com,
	pshelar@ovn.org, dsa@cumulusnetworks.com, hadi@mojatatu.com

On 1/22/17, 4:22 PM, Rosen, Rami wrote:
> Hi, Roopa,
>
> Two minor comments:
>
> The parameter br is not used in the br_add_vlan_tunnel_info() method, it should be removed:

Thanks Rami. will take care of this in the next version.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support
  2017-01-23  8:51   ` Jiri Benc
@ 2017-01-23 16:13     ` Roopa Prabhu
  2017-01-23 16:24       ` Jiri Benc
       [not found]       ` <CAJ3xEMiC5xJ+rex8xMnyuGj5QKj+sYA9A6JjOM0xQaZraFSHig@mail.gmail.com>
  0 siblings, 2 replies; 23+ messages in thread
From: Roopa Prabhu @ 2017-01-23 16:13 UTC (permalink / raw)
  To: Jiri Benc
  Cc: Jiri Pirko, netdev, davem, stephen, nikolay, tgraf, hannes,
	pshelar, dsa, hadi

On 1/23/17, 12:51 AM, Jiri Benc wrote:
> On Mon, 23 Jan 2017 09:08:05 +0100, Jiri Pirko wrote:
>> Sat, Jan 21, 2017 at 06:46:51AM CET, roopa@cumulusnetworks.com wrote:
>>> Other approaches tried and vetoed:
>>> - tc vlan push/pop and tunnel metadata dst:
>>>    - posses a tc rule scalability problem (2 rules per vni)
>> Why it is a problem?
> Wanted to ask exactly the same question.
>
>>>    - cannot handle the case where a packet needs to be replicated to
>>>      multiple vxlan remote tunnel end-points.. which the vxlan driver
>>>      can do today by having multiple remote destinations per fdb.
>> Can't you just extend the tc to support this?
> +1
>
>> To me, looks like the tc is the correct place to hangle this. Then, the
>> user can use it for multiple cases of forwarding, including bridge,
>> tc-mirred, ovs and others. Putting this in bridge somehow seems wrong in
>> this light. Also, the bridge code is polluted enough as it is. I this we
>> should be super-picky to add another code there.
> Completely agreed.
>

The problem is, When you use the Linux bridge for vlan configuration and vlan filtering, having
additional vlan config in some other subsystem is a bit awkward. Its the same argument where
tc and netfilter subsystems have so much overlap...but they do because, each subsystem has to
have the missing functionality for completeness....cannot expect the user to configure a few rules
in tc and a few others in netfilter. In this case, I cannot expect the user/app to configure vlan filtering
in one place and have additional vlan to tunnel filtering in another subsystem. Its duplicating vlan
configuration in multiple places.

Also, the goal is to reduce the number of vxlan devices from say 4k to 1. I don't think replacing
it with 8k (egress + ingress) rules is going in the right direction.

bigger picture/context... With bgp now being deployed as a controller for
l2 ethernet vpn solutions (https://tools.ietf.org/html/draft-ietf-bess-evpn-overlay-07), popular routing
suites like quagga, are looking at using the Linux api for L2 configuration.
And, a 'vlan-to-tunid' mapping is a very common configuration in L2 ethernet vpn configurations.
With the bridge driver being the center of vlan configuration in such bridged networks,
having all vlan configuration in one place makes sense. Also, quagga now has a single api
to get the 'vlan-to-tunid' mapping. Telling quagga to look at tc filtering rules to derive this
mapping is not inline with the rest of the L2 api ..(when you use the Linux bridge ..).

Regarding piling this on to the bridge driver:
- It is using existing dst metadata infra + two hooks disabled by default.
- I started this with vlan-to-vxlan map in the vxlan driver (regret spending time on it)..
I ended up duplicating a lot of vlan handling code that the bridge driver all-ready had in the vxlan driver.
Hence bridge driver is the right place for this ...when you are using the bridge driver for vlan filtering.
- Besides, having it in the bridge driver ..enables the bridge driver for future other
 l2 evpn dataplanes (vxlan just happens to be one of them i am working on currently).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support
  2017-01-23 16:13     ` Roopa Prabhu
@ 2017-01-23 16:24       ` Jiri Benc
  2017-01-24  0:00         ` Roopa Prabhu
       [not found]       ` <CAJ3xEMiC5xJ+rex8xMnyuGj5QKj+sYA9A6JjOM0xQaZraFSHig@mail.gmail.com>
  1 sibling, 1 reply; 23+ messages in thread
From: Jiri Benc @ 2017-01-23 16:24 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Jiri Pirko, netdev, davem, stephen, nikolay, tgraf, hannes,
	pshelar, dsa, hadi

On Mon, 23 Jan 2017 08:13:30 -0800, Roopa Prabhu wrote:
> And, a 'vlan-to-tunid' mapping is a very common configuration in L2 ethernet vpn configurations.

You have one particular and narrow use case in mind and are proposing a
rather large patchset to add support for that (and only that) single
use case, while we already have a generic mechanism in place to address
this and many similar (and dissimilar, too) use cases. That doesn't
sound right.

If the current generic mechanisms have bottlenecks for your use case,
let's work on removing those bottlenecks. That way, everybody benefits,
not just a single use case.

Thanks,

 Jiri

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support
  2017-01-23 16:24       ` Jiri Benc
@ 2017-01-24  0:00         ` Roopa Prabhu
  0 siblings, 0 replies; 23+ messages in thread
From: Roopa Prabhu @ 2017-01-24  0:00 UTC (permalink / raw)
  To: Jiri Benc
  Cc: Jiri Pirko, netdev, davem, stephen, nikolay, tgraf, hannes,
	pshelar, dsa, hadi

On 1/23/17, 8:24 AM, Jiri Benc wrote:
> On Mon, 23 Jan 2017 08:13:30 -0800, Roopa Prabhu wrote:
>> And, a 'vlan-to-tunid' mapping is a very common configuration in L2 ethernet vpn configurations.
> You have one particular and narrow use case in mind and are proposing a
> rather large patchset to add support for that (and only that) single
> use case, while we already have a generic mechanism in place to address
> this and many similar (and dissimilar, too) use cases. That doesn't
> sound right.
Let me clarify:
the generic mechanism you are talking about is dst_metadata infra. Any subsystem can use it.
tc vlan and dst_metadata wrapper/filter provide a creative way to use it inside the tc subsystem and is very
useful for people using tc all-around.
What I am proposing here is hooks in bridge to use the dst_metadata for pure L2 networks who
use the bridge driver. This is similar to how we have lwt plugged into the L3 (routing) code.
If you are using the bridge driver for vlan config and filtering, I don't see why one
 has to duplicate vlan config using tc. Its painful trying to deploy l2 networks with vlan config spanning
multiple subsystems and apis.

Regarding the patch-set size, let me give you a breakdown:
If i used tc for passing dst_metadata (assume 4k vlans that are participating in l2 ethernet vpn):
(a) configure bridging/vlan filtering using bridge driver (4k vlans)
(b) configure tc rules to map vlans to tunnel-id (Additional patch to tc to only allow tunnel-id in dst_metadata: ingress + egress = 8k tc rules)
(c) vxlan driver patch to make it bridge friendly (my patch in this series is required regardless if i use tc or bridge driver for dst_metadata because vxlan driver learns and needs to carry the forwarding information database)
(d) ethernet vpn controller (quagga bgp) looks at 'bridge api + vxlan api + tc filtering rules'
           

My current series:
(a) configure bridging/vlan filtering using bridge driver (4k vlans with tunnel info)
(b) vxlan driver patch to make it bridge friendly (my patch in this series is required regardless if
i use tc or bridge driver for dst_metadata because vxlan driver learns and needs to carry the forwarding information database)
(c) ethernet vpn controller (quagga bgp) looks at 'bridge api + vxlan api'


And btw, most of the functions that i am adding in the bridge driver are related to vlan range handling.
vlan ranges code is tricky and i am trying to also support vlan-tunnelid mapping in ranges, and i have tried
to rewrite my own vlan range code (added long back) to include tunnel info. The rest is just use of the dst_metadata infra
to store and use  dst_metadata per vlan.


>
> If the current generic mechanisms have bottlenecks for your use case,
> let's work on removing those bottlenecks. That way, everybody benefits,
> not just a single use case.
For people using all tc, the tc wrapper for dst_metadata is a good fit.
I see my series as still using the generic 'dst_metadata' mechanism/infra for a newer use case.
like i say above, I see this similar to how we have plugged dst_metadata into the L3 (routing) code.
This does it in the bridging code (for L2 networks).

Thanks,
Roopa

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support
       [not found]       ` <CAJ3xEMiC5xJ+rex8xMnyuGj5QKj+sYA9A6JjOM0xQaZraFSHig@mail.gmail.com>
@ 2017-01-24  0:09         ` Roopa Prabhu
  0 siblings, 0 replies; 23+ messages in thread
From: Roopa Prabhu @ 2017-01-24  0:09 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Jiri Benc, Jiri Pirko, Linux Netdev List, David Miller,
	Stephen Hemminger, Nikolay Aleksandrov, Thomas Graf,
	Hannes Frederic Sowa, pravin shelar, David Ahern,
	Jamal Hadi Salim

On 1/23/17, 9:03 AM, Or Gerlitz wrote:
> On Mon, Jan 23, 2017 at 6:13 PM, Roopa Prabhu <roopa@cumulusnetworks.com>
> wrote:
>
>> Also, the goal is to reduce the number of vxlan devices from say 4k to 1.
>> I don't think replacing it with 8k (egress + ingress) rules is going in the
>> right direction.
>>
> Can't you take advantage of the shared vxlan device configuration
> introduced throughout the LWT work such that you have single device dealing
> with many tunnels? why?
>
I tried to cover this in my initial paragraph in the cover letter:
"lwt and dst_metadata/collect_metadata have enabled vxlan l3 deployments to use a 'single vxlan
netdev for multiple vnis' eliminating the scalability problem with using a 'single vxlan netdev per vni'.
This series tries to do the same for vxlan netdevs in pure l2 bridged networks. Use-case/deployment and
details are below." there is more in the cover letter on this.

There is no route pointing to the vxlan device here. vxlan device is a bridged port. And it bridges local host ports to remote vxlan tunnels
vlan-to-vxlan.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support
  2017-01-21  5:46 [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support Roopa Prabhu
                   ` (5 preceding siblings ...)
  2017-01-23  8:08 ` [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support Jiri Pirko
@ 2017-01-24 15:47 ` Stephen Hemminger
  2017-01-25 17:08   ` Roopa Prabhu
  6 siblings, 1 reply; 23+ messages in thread
From: Stephen Hemminger @ 2017-01-24 15:47 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: netdev, davem, nikolay, tgraf, hannes, jbenc, pshelar, dsa, hadi

On Fri, 20 Jan 2017 21:46:51 -0800
Roopa Prabhu <roopa@cumulusnetworks.com> wrote:

> From: Roopa Prabhu <roopa@cumulusnetworks.com>
> 
> High level summary:
> lwt and dst_metadata/collect_metadata have enabled vxlan l3 deployments
> to use a single vxlan netdev for multiple vnis eliminating the scalability
> problem with using a single vxlan netdev per vni. This series tries to
> do the same for vxlan netdevs in pure l2 bridged networks.
> Use-case/deployment and details are below.
> 
> Deployment scerario details:
> As we know VXLAN is used to build layer 2 virtual networks across the
> underlay layer3 infrastructure. A VXLAN tunnel endpoint (VTEP)
> originates and terminates VXLAN tunnels. And a VTEP can be a TOR switch
> or a vswitch in the hypervisor. This patch series mainly
> focuses on the TOR switch configured as a Vtep. Vxlan segment ID (vni)
> along with vlan id is used to identify layer 2 segments in a vxlan
> overlay network. Vxlan bridging is the function provided by Vteps to terminate
> vxlan tunnels and map the vxlan vni to traditional end host vlan. This is
> covered in the "VXLAN Deployment Scenarios" in sections 6 and 6.1 in RFC 7348.
> To provide vxlan bridging function, a vtep has to map vlan to a vni. The rfc
> says that the ingress VTEP device shall remove the IEEE 802.1Q VLAN tag in
> the original Layer 2 packet if there is one before encapsulating the packet
> into the VXLAN format to transmit it through the underlay network. The remote
> VTEP devices have information about the VLAN in which the packet will be
> placed based on their own VLAN-to-VXLAN VNI mapping configurations.
> 
> Existing solution:
> Without this patch series one can deploy such a vtep configuration by
> by adding the local ports and vxlan netdevs into a vlan filtering bridge.
> The local ports are configured as trunk ports carrying all vlans.
> A vxlan netdev per vni is added to the bridge. Vlan mapping to vni is
> achieved by configuring the vlan as pvid on the corresponding vxlan netdev.
> The vxlan netdev only receives traffic corresponding to the vlan it is mapped
> to. This configuration maps traffic belonging to a vlan to the corresponding
> vxlan segment.
> 
>           -----------------------------------
>          |              bridge               |
>          |                                   |
>           -----------------------------------
>             |100,200       |100 (pvid)    |200 (pvid)
>             |              |              |
>            swp1          vxlan1000      vxlan2000
>                     
> This provides the required vxlan bridging function but poses a
> scalability problem with using a single vxlan netdev for each vni.
> 
> Solution in this patch series:
> The Goal is to use a single vxlan device to carry all vnis similar
> to the vxlan collect metadata mode but vxlan driver still carrying all
> the forwarding information.
> - vxlan driver changes:
>     - enable collect metadata mode device to be used with learning,
>       replication, fdb
>     - A single fdb table hashed by (mac, vni)
>     - rx path already has the vni
>     - tx path expects a vni in the packet with dst_metadata and vxlan
>       driver has all the forwarding information for the vni in the
>       dst_metadata.
> 
> - Bridge driver changes: per vlan LWT and dst_metadata support:
>     - Our use case is vxlan and 1-1 mapping between vlan and vni, but I have
>       kept the api generic for any tunnel info
>     - Uapi to configure/unconfigure/dump per vlan tunnel data
>     - new bridge port flag to turn this feature on/off. off by default
>     - ingress hook:
>         - if port is a lwt tunnel port, use tunnel info in
>           attached dst_metadata to map it to a local vlan
>     - egress hook:
>         - if port is a lwt tunnel port, use tunnel info attached to vlan
>           to set dst_metadata on the skb
> 
> Other approaches tried and vetoed:
> - tc vlan push/pop and tunnel metadata dst:
>     - posses a tc rule scalability problem (2 rules per vni)
>     - cannot handle the case where a packet needs to be replicated to
>       multiple vxlan remote tunnel end-points.. which the vxlan driver
>       can do today by having multiple remote destinations per fdb.
> - making vxlan driver understand vlan-vni mapping:
>     - I had a series almost ready with this one but soon realized
>       it duplicated a lot of vlan handling code in the vxlan driver
> 
> This series is briefly tested for functionality. Sending it out as RFC while
> I continue to test it more. There are some rough edges which I am in the process
> of fixing.
> 
> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
> 
> Roopa Prabhu (5):
>   ip_tunnels: new IP_TUNNEL_INFO_BRIDGE flag for ip_tunnel_info mode
>   vxlan: make COLLECT_METADATA mode bridge friendly
>   bridge: uapi: add per vlan tunnel info
>   bridge: vlan lwt and dst_metadata netlink support
>   bridge: vlan lwt dst_metadata hooks in ingress and egress paths
> 
>  drivers/net/vxlan.c            |  209 ++++++++++++--------
>  include/linux/if_bridge.h      |    1 +
>  include/net/ip_tunnels.h       |    1 +
>  include/uapi/linux/if_bridge.h |   11 ++
>  include/uapi/linux/if_link.h   |    1 +
>  include/uapi/linux/neighbour.h |    1 +
>  net/bridge/br_input.c          |    5 +
>  net/bridge/br_netlink.c        |  410 ++++++++++++++++++++++++++++++++++------
>  net/bridge/br_private.h        |   22 +++
>  net/bridge/br_vlan.c           |  193 ++++++++++++++++++-
>  10 files changed, 717 insertions(+), 137 deletions(-)
> 

Yes this is a complex issue, but I am concerned about the added code bloat
and complexity. The bridge which is already a mess with netfilter, multicast, vlan
filtering etc.The current code not modular and grows with each feature.
At the same time, the same bridge functionality is used in its simplest form by
all the network virtualization and container technolgies.

Maybe do it in OpenVswitch or figure out a way to do customization with BPF?

more complex than it should

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support
  2017-01-24 15:47 ` Stephen Hemminger
@ 2017-01-25 17:08   ` Roopa Prabhu
  0 siblings, 0 replies; 23+ messages in thread
From: Roopa Prabhu @ 2017-01-25 17:08 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, davem, nikolay, tgraf, hannes, jbenc, pshelar, dsa, hadi

On 1/24/17, 7:47 AM, Stephen Hemminger wrote:
> On Fri, 20 Jan 2017 21:46:51 -0800
> Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
>
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> High level summary:
>> lwt and dst_metadata/collect_metadata have enabled vxlan l3 deployments
>> to use a single vxlan netdev for multiple vnis eliminating the scalability
>> problem with using a single vxlan netdev per vni. This series tries to
>> do the same for vxlan netdevs in pure l2 bridged networks.
>> Use-case/deployment and details are below.
>>
>> Deployment scerario details:
>> As we know VXLAN is used to build layer 2 virtual networks across the
>> underlay layer3 infrastructure. A VXLAN tunnel endpoint (VTEP)
>> originates and terminates VXLAN tunnels. And a VTEP can be a TOR switch
>> or a vswitch in the hypervisor. This patch series mainly
>> focuses on the TOR switch configured as a Vtep. Vxlan segment ID (vni)
>> along with vlan id is used to identify layer 2 segments in a vxlan
>> overlay network. Vxlan bridging is the function provided by Vteps to terminate
>> vxlan tunnels and map the vxlan vni to traditional end host vlan. This is
>> covered in the "VXLAN Deployment Scenarios" in sections 6 and 6.1 in RFC 7348.
>> To provide vxlan bridging function, a vtep has to map vlan to a vni. The rfc
>> says that the ingress VTEP device shall remove the IEEE 802.1Q VLAN tag in
>> the original Layer 2 packet if there is one before encapsulating the packet
>> into the VXLAN format to transmit it through the underlay network. The remote
>> VTEP devices have information about the VLAN in which the packet will be
>> placed based on their own VLAN-to-VXLAN VNI mapping configurations.
>>
>> Existing solution:
>> Without this patch series one can deploy such a vtep configuration by
>> by adding the local ports and vxlan netdevs into a vlan filtering bridge.
>> The local ports are configured as trunk ports carrying all vlans.
>> A vxlan netdev per vni is added to the bridge. Vlan mapping to vni is
>> achieved by configuring the vlan as pvid on the corresponding vxlan netdev.
>> The vxlan netdev only receives traffic corresponding to the vlan it is mapped
>> to. This configuration maps traffic belonging to a vlan to the corresponding
>> vxlan segment.
>>
>>           -----------------------------------
>>          |              bridge               |
>>          |                                   |
>>           -----------------------------------
>>             |100,200       |100 (pvid)    |200 (pvid)
>>             |              |              |
>>            swp1          vxlan1000      vxlan2000
>>                     
>> This provides the required vxlan bridging function but poses a
>> scalability problem with using a single vxlan netdev for each vni.
>>
>> Solution in this patch series:
>> The Goal is to use a single vxlan device to carry all vnis similar
>> to the vxlan collect metadata mode but vxlan driver still carrying all
>> the forwarding information.
>> - vxlan driver changes:
>>     - enable collect metadata mode device to be used with learning,
>>       replication, fdb
>>     - A single fdb table hashed by (mac, vni)
>>     - rx path already has the vni
>>     - tx path expects a vni in the packet with dst_metadata and vxlan
>>       driver has all the forwarding information for the vni in the
>>       dst_metadata.
>>
>> - Bridge driver changes: per vlan LWT and dst_metadata support:
>>     - Our use case is vxlan and 1-1 mapping between vlan and vni, but I have
>>       kept the api generic for any tunnel info
>>     - Uapi to configure/unconfigure/dump per vlan tunnel data
>>     - new bridge port flag to turn this feature on/off. off by default
>>     - ingress hook:
>>         - if port is a lwt tunnel port, use tunnel info in
>>           attached dst_metadata to map it to a local vlan
>>     - egress hook:
>>         - if port is a lwt tunnel port, use tunnel info attached to vlan
>>           to set dst_metadata on the skb
>>
>> Other approaches tried and vetoed:
>> - tc vlan push/pop and tunnel metadata dst:
>>     - posses a tc rule scalability problem (2 rules per vni)
>>     - cannot handle the case where a packet needs to be replicated to
>>       multiple vxlan remote tunnel end-points.. which the vxlan driver
>>       can do today by having multiple remote destinations per fdb.
>> - making vxlan driver understand vlan-vni mapping:
>>     - I had a series almost ready with this one but soon realized
>>       it duplicated a lot of vlan handling code in the vxlan driver
>>
>> This series is briefly tested for functionality. Sending it out as RFC while
>> I continue to test it more. There are some rough edges which I am in the process
>> of fixing.
>>
>> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> Roopa Prabhu (5):
>>   ip_tunnels: new IP_TUNNEL_INFO_BRIDGE flag for ip_tunnel_info mode
>>   vxlan: make COLLECT_METADATA mode bridge friendly
>>   bridge: uapi: add per vlan tunnel info
>>   bridge: vlan lwt and dst_metadata netlink support
>>   bridge: vlan lwt dst_metadata hooks in ingress and egress paths
>>
>>  drivers/net/vxlan.c            |  209 ++++++++++++--------
>>  include/linux/if_bridge.h      |    1 +
>>  include/net/ip_tunnels.h       |    1 +
>>  include/uapi/linux/if_bridge.h |   11 ++
>>  include/uapi/linux/if_link.h   |    1 +
>>  include/uapi/linux/neighbour.h |    1 +
>>  net/bridge/br_input.c          |    5 +
>>  net/bridge/br_netlink.c        |  410 ++++++++++++++++++++++++++++++++++------
>>  net/bridge/br_private.h        |   22 +++
>>  net/bridge/br_vlan.c           |  193 ++++++++++++++++++-
>>  10 files changed, 717 insertions(+), 137 deletions(-)
>>
> Yes this is a complex issue, but I am concerned about the added code bloat
> and complexity. 
not sure if you saw my last response on this thread. Understand the concern, I have given
the break down of code changes in one of the responses here:
http://marc.info/?l=linux-netdev&m=148521603908551&w=2

The bridge changes are mainly just parsing and storing dst metadata per vlan
for ethernet vpn tunneling. The vxlan changes are the next step in collect_metadata,
ie to allow learning. Regardless, It will be good for the vxlan driver to support vni in
the fdb table ie mux on (mac, vni) for the single vxlan device case.

> The bridge which is already a mess with netfilter, multicast, vlan
> filtering etc.The current code not modular and grows with each feature.
> At the same time, the same bridge functionality is used in its simplest form by
> all the network virtualization and container technolgies.
>
> Maybe do it in OpenVswitch or figure out a way to do customization with BPF?
>
> more complex than it should

We use the bridge driver for bridging and vlan filtering. Like i mentioned in some
of my earlier responses, routing daemons participating in ethernet vpn tunneling via bgp,
are now looking at the bridge driver api and forwarding database.
Vlan to tunnel-id mapping is a very common api in the networking world.

Unfortunately, Moving away from the bridge driver or using customized bpf is not an option for us.
On a networking box,  vlan-to-tunnel-id mapping is used by many networking apps as well, and hence cannot be
hidden in a BPF.

An option is to move this out of the bridge driver into the vxlan driver..., but I did try that and it did not seem right
duplicating all that vlan info in the vxlan driver. This patch series keeps it generic for future use with other ethernet dataplane
protocols.

I will refactor it some more and try to keep this code isolated in the bridge driver
 in my next version and we can discuss some more.

Thanks for the review.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-01-25 17:08 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-01-21  5:46 [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support Roopa Prabhu
2017-01-21  5:46 ` [RFC PATCH net-next 1/5] ip_tunnels: new IP_TUNNEL_INFO_BRIDGE flag for ip_tunnel_info mode Roopa Prabhu
2017-01-21  5:46 ` [RFC PATCH net-next 2/5] vxlan: make COLLECT_METADATA mode bridge friendly Roopa Prabhu
2017-01-22 11:40   ` Nikolay Aleksandrov
2017-01-22 15:18     ` Roopa Prabhu
2017-01-21  5:46 ` [RFC PATCH net-next 3/5] bridge: uapi: add per vlan tunnel info Roopa Prabhu
2017-01-21 16:59   ` Roopa Prabhu
2017-01-21  5:46 ` [RFC PATCH net-next 4/5] bridge: vlan lwt and dst_metadata netlink support Roopa Prabhu
2017-01-22 12:05   ` Nikolay Aleksandrov
2017-01-22 15:23     ` Roopa Prabhu
2017-01-23  0:22   ` Rosen, Rami
2017-01-23 15:39     ` Roopa Prabhu
2017-01-21  5:46 ` [RFC PATCH net-next 5/5] bridge: vlan lwt dst_metadata hooks in ingress and egress paths Roopa Prabhu
2017-01-22 12:15   ` Nikolay Aleksandrov
2017-01-22 15:27     ` Roopa Prabhu
2017-01-23  8:08 ` [RFC PATCH net-next 0/5] bridge: per vlan lwt and dst_metadata support Jiri Pirko
2017-01-23  8:51   ` Jiri Benc
2017-01-23 16:13     ` Roopa Prabhu
2017-01-23 16:24       ` Jiri Benc
2017-01-24  0:00         ` Roopa Prabhu
     [not found]       ` <CAJ3xEMiC5xJ+rex8xMnyuGj5QKj+sYA9A6JjOM0xQaZraFSHig@mail.gmail.com>
2017-01-24  0:09         ` Roopa Prabhu
2017-01-24 15:47 ` Stephen Hemminger
2017-01-25 17:08   ` Roopa Prabhu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).