Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next] net: mvpp2: phylink support
From: Antoine Tenart @ 2017-09-25 13:06 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Antoine Tenart, davem, andrew, gregory.clement, thomas.petazzoni,
	miquel.raynal, nadavh, linux-kernel, mw, stefanc, netdev
In-Reply-To: <20170925121343.GO20805@n2100.armlinux.org.uk>

On Mon, Sep 25, 2017 at 01:13:43PM +0100, Russell King - ARM Linux wrote:
> On Mon, Sep 25, 2017 at 01:53:03PM +0200, Antoine Tenart wrote:
> > On Mon, Sep 25, 2017 at 11:45:32AM +0100, Russell King - ARM Linux wrote:
> > > Can you describe what the GoP link IRQ is doing please?
> > 
> > In cases where there is no PHY connected to the MAC and no SFP cage is
> > used. One example is when a SOHO switch is connected directly to a
> > serdes lane. In such cases we still need to have a minimal link
> > management. The GoP link interrupt helps doing so as it raises when the
> > serdes is in sync and AN succeeded.
> 
> Isn't this just like a fixed link scenario, or an in-band
> autonegotiation scenario (both of which phylink supports natively)?
> 
> The situation on Clearfog with the 88E6176 switch is pretty similar -
> a switch connected directly via serdes to the MAC.  Currently, we
> configure stuff there as a fixed link, but in actual fact the 88E6176
> is configured to run the CPU facing port in 1000base-X mode, and with
> appropriate tweaks, switching phylink to 1000base-X mode also works.

Hmm, I think you're right, we should be able to represent the link
between the MAC and the switch as a fixed link. And when it's not fixed,
it could be done with in-band AN. I cannot test this myself but I've
asked someone who can to.

Antoine

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

^ permalink raw reply

* Re: [PATCH net v2 2/3] net: mvpp2: fix port list indexing
From: Antoine Tenart @ 2017-09-25 13:09 UTC (permalink / raw)
  To: davem
  Cc: Yan Markman, andrew, gregory.clement, thomas.petazzoni,
	miquel.raynal, nadavh, linux, linux-kernel, mw, stefanc, netdev,
	Antoine Tenart
In-Reply-To: <20170925125948.13507-3-antoine.tenart@free-electrons.com>

On Mon, Sep 25, 2017 at 02:59:47PM +0200, Antoine Tenart wrote:
> From: Yan Markman <ymarkman@marvell.com>
> 
> The private port_list array has a list of pointers to mvpp2_port
> instances. This list is allocated given the number of ports enabled in
> the device tree, but the pointers are set using the port-id property. If
> on a single port is enabled, the port_list array will be of size 1, but
> when registering the port, if its id is not 0 the driver will crash.
> Other crashes were encountered in various situations.
> 
> This fixes the issue by using an index not equal to the value of the
> port-id property.
> 
> Fixes: 3f518509dedc ("ethernet: Add new driver for Marvell Armada 375 network unit")

With,

Signed-off-by: Yan Markman <ymarkman@marvell.com>

I don't know why it was removed, but this SoB should be added back.

Antoine

> Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
> ---
>  drivers/net/ethernet/marvell/mvpp2.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
> index da04939a2748..b2f99df81e9c 100644
> --- a/drivers/net/ethernet/marvell/mvpp2.c
> +++ b/drivers/net/ethernet/marvell/mvpp2.c
> @@ -7504,7 +7504,7 @@ static void mvpp2_port_copy_mac_addr(struct net_device *dev, struct mvpp2 *priv,
>  /* Ports initialization */
>  static int mvpp2_port_probe(struct platform_device *pdev,
>  			    struct device_node *port_node,
> -			    struct mvpp2 *priv)
> +			    struct mvpp2 *priv, int index)
>  {
>  	struct device_node *phy_node;
>  	struct phy *comphy;
> @@ -7678,7 +7678,7 @@ static int mvpp2_port_probe(struct platform_device *pdev,
>  	}
>  	netdev_info(dev, "Using %s mac address %pM\n", mac_from, dev->dev_addr);
>  
> -	priv->port_list[id] = port;
> +	priv->port_list[index] = port;
>  	return 0;
>  
>  err_free_port_pcpu:
> @@ -8013,10 +8013,12 @@ static int mvpp2_probe(struct platform_device *pdev)
>  	}
>  
>  	/* Initialize ports */
> +	i = 0;
>  	for_each_available_child_of_node(dn, port_node) {
> -		err = mvpp2_port_probe(pdev, port_node, priv);
> +		err = mvpp2_port_probe(pdev, port_node, priv, i);
>  		if (err < 0)
>  			goto err_mg_clk;
> +		i++;
>  	}
>  
>  	platform_set_drvdata(pdev, priv);
> -- 
> 2.13.5
> 

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

^ permalink raw reply

* Re: [PATCH net v2 1/3] net: mvpp2: fix parsing fragmentation detection
From: Antoine Tenart @ 2017-09-25 13:10 UTC (permalink / raw)
  To: davem
  Cc: Stefan Chulski, andrew, gregory.clement, thomas.petazzoni,
	miquel.raynal, nadavh, linux, linux-kernel, mw, netdev,
	Antoine Tenart
In-Reply-To: <20170925125948.13507-2-antoine.tenart@free-electrons.com>

On Mon, Sep 25, 2017 at 02:59:46PM +0200, Antoine Tenart wrote:
> From: Stefan Chulski <stefanc@marvell.com>
> 
> Parsing fragmentation detection failed due to wrong configured
> parser TCAM entry's. Some traffic was marked as fragmented in RX
> descriptor, even it wasn't IP fragmented. The hardware also failed to
> calculate checksums which lead to use software checksum and caused
> performance degradation.
> 
> Fixes: 3f518509dedc ("ethernet: Add new driver for Marvell Armada 375 network unit")

With,

Signed-off-by: Stefan Chulski <stefanc@marvell.com>

I don't know why this SoB was removed but it should be added back.

Antoine

> Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
> ---
>  drivers/net/ethernet/marvell/mvpp2.c | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
> index dd0ee2691c86..da04939a2748 100644
> --- a/drivers/net/ethernet/marvell/mvpp2.c
> +++ b/drivers/net/ethernet/marvell/mvpp2.c
> @@ -676,6 +676,7 @@ enum mvpp2_tag_type {
>  #define MVPP2_PRS_RI_L3_MCAST			BIT(15)
>  #define MVPP2_PRS_RI_L3_BCAST			(BIT(15) | BIT(16))
>  #define MVPP2_PRS_RI_IP_FRAG_MASK		0x20000
> +#define MVPP2_PRS_RI_IP_FRAG_TRUE		BIT(17)
>  #define MVPP2_PRS_RI_UDF3_MASK			0x300000
>  #define MVPP2_PRS_RI_UDF3_RX_SPECIAL		BIT(21)
>  #define MVPP2_PRS_RI_L4_PROTO_MASK		0x1c00000
> @@ -2315,7 +2316,7 @@ static int mvpp2_prs_ip4_proto(struct mvpp2 *priv, unsigned short proto,
>  	    (proto != IPPROTO_IGMP))
>  		return -EINVAL;
>  
> -	/* Fragmented packet */
> +	/* Not fragmented packet */
>  	tid = mvpp2_prs_tcam_first_free(priv, MVPP2_PE_FIRST_FREE_TID,
>  					MVPP2_PE_LAST_FREE_TID);
>  	if (tid < 0)
> @@ -2334,8 +2335,12 @@ static int mvpp2_prs_ip4_proto(struct mvpp2 *priv, unsigned short proto,
>  				  MVPP2_PRS_SRAM_OP_SEL_UDF_ADD);
>  	mvpp2_prs_sram_ai_update(&pe, MVPP2_PRS_IPV4_DIP_AI_BIT,
>  				 MVPP2_PRS_IPV4_DIP_AI_BIT);
> -	mvpp2_prs_sram_ri_update(&pe, ri | MVPP2_PRS_RI_IP_FRAG_MASK,
> -				 ri_mask | MVPP2_PRS_RI_IP_FRAG_MASK);
> +	mvpp2_prs_sram_ri_update(&pe, ri, ri_mask | MVPP2_PRS_RI_IP_FRAG_MASK);
> +
> +	mvpp2_prs_tcam_data_byte_set(&pe, 2, 0x00,
> +				     MVPP2_PRS_TCAM_PROTO_MASK_L);
> +	mvpp2_prs_tcam_data_byte_set(&pe, 3, 0x00,
> +				     MVPP2_PRS_TCAM_PROTO_MASK);
>  
>  	mvpp2_prs_tcam_data_byte_set(&pe, 5, proto, MVPP2_PRS_TCAM_PROTO_MASK);
>  	mvpp2_prs_tcam_ai_update(&pe, 0, MVPP2_PRS_IPV4_DIP_AI_BIT);
> @@ -2346,7 +2351,7 @@ static int mvpp2_prs_ip4_proto(struct mvpp2 *priv, unsigned short proto,
>  	mvpp2_prs_shadow_set(priv, pe.index, MVPP2_PRS_LU_IP4);
>  	mvpp2_prs_hw_write(priv, &pe);
>  
> -	/* Not fragmented packet */
> +	/* Fragmented packet */
>  	tid = mvpp2_prs_tcam_first_free(priv, MVPP2_PE_FIRST_FREE_TID,
>  					MVPP2_PE_LAST_FREE_TID);
>  	if (tid < 0)
> @@ -2358,8 +2363,11 @@ static int mvpp2_prs_ip4_proto(struct mvpp2 *priv, unsigned short proto,
>  	pe.sram.word[MVPP2_PRS_SRAM_RI_CTRL_WORD] = 0x0;
>  	mvpp2_prs_sram_ri_update(&pe, ri, ri_mask);
>  
> -	mvpp2_prs_tcam_data_byte_set(&pe, 2, 0x00, MVPP2_PRS_TCAM_PROTO_MASK_L);
> -	mvpp2_prs_tcam_data_byte_set(&pe, 3, 0x00, MVPP2_PRS_TCAM_PROTO_MASK);
> +	mvpp2_prs_sram_ri_update(&pe, ri | MVPP2_PRS_RI_IP_FRAG_TRUE,
> +				 ri_mask | MVPP2_PRS_RI_IP_FRAG_MASK);
> +
> +	mvpp2_prs_tcam_data_byte_set(&pe, 2, 0x00, 0x0);
> +	mvpp2_prs_tcam_data_byte_set(&pe, 3, 0x00, 0x0);
>  
>  	/* Update shadow table and hw entry */
>  	mvpp2_prs_shadow_set(priv, pe.index, MVPP2_PRS_LU_IP4);
> -- 
> 2.13.5
> 

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

^ permalink raw reply

* Re: [patch net-next v2 06/12] net: mroute: Check if rule is a default rule
From: Yotam Gigi @ 2017-09-25 13:37 UTC (permalink / raw)
  To: Nikolay Aleksandrov, Jiri Pirko, Yunsheng Lin
  Cc: netdev, davem, idosch, mlxsw, andrew
In-Reply-To: <fc1f3324-d2fc-f95c-50d2-25773f3c7683@cumulusnetworks.com>

On 09/25/2017 01:02 PM, Nikolay Aleksandrov wrote:
> On 25/09/17 12:45, Jiri Pirko wrote:
>> Mon, Sep 25, 2017 at 03:28:21AM CEST, linyunsheng@huawei.com wrote:
>>> Hi, Jiri
>>>
>>> On 2017/9/25 1:22, Jiri Pirko wrote:
>>>> From: Yotam Gigi <yotamg@mellanox.com>
>>>>
>>>> When the ipmr starts, it adds one default FIB rule that matches all packets
>>>> and sends them to the DEFAULT (multicast) FIB table. A more complex rule
>>>> can be added by user to specify that for a specific interface, a packet
>>>> should be look up at either an arbitrary table or according to the l3mdev
>>>> of the interface.
>>>>
>>>> For drivers willing to offload the ipmr logic into a hardware but don't
>>>> want to offload all the FIB rules functionality, provide a function that
>>>> can indicate whether the FIB rule is the default multicast rule, thus only
>>>> one routing table is needed.
>>>>
>>>> This way, a driver can register to the FIB notification chain, get
>>>> notifications about FIB rules added and trigger some kind of an internal
>>>> abort mechanism when a non default rule is added by the user.
>>>>
>>>> Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
>>>> Reviewed-by: Ido Schimmel <idosch@mellanox.com>
>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>> ---
>>>>  include/linux/mroute.h |  7 +++++++
>>>>  net/ipv4/ipmr.c        | 10 ++++++++++
>>>>  2 files changed, 17 insertions(+)
>>>>
>>>> diff --git a/include/linux/mroute.h b/include/linux/mroute.h
>>>> index 5566580..b072a84 100644
>>>> --- a/include/linux/mroute.h
>>>> +++ b/include/linux/mroute.h
>>>> @@ -5,6 +5,7 @@
>>>>  #include <linux/pim.h>
>>>>  #include <linux/rhashtable.h>
>>>>  #include <net/sock.h>
>>>> +#include <net/fib_rules.h>
>>>>  #include <net/fib_notifier.h>
>>>>  #include <uapi/linux/mroute.h>
>>>>  
>>>> @@ -19,6 +20,7 @@ int ip_mroute_getsockopt(struct sock *, int, char __user *, int __user *);
>>>>  int ipmr_ioctl(struct sock *sk, int cmd, void __user *arg);
>>>>  int ipmr_compat_ioctl(struct sock *sk, unsigned int cmd, void __user *arg);
>>>>  int ip_mr_init(void);
>>>> +bool ipmr_rule_default(const struct fib_rule *rule);
>>>>  #else
>>>>  static inline int ip_mroute_setsockopt(struct sock *sock, int optname,
>>>>  				       char __user *optval, unsigned int optlen)
>>>> @@ -46,6 +48,11 @@ static inline int ip_mroute_opt(int opt)
>>>>  {
>>>>  	return 0;
>>>>  }
>>>> +
>>>> +static inline bool ipmr_rule_default(const struct fib_rule *rule)
>>>> +{
>>>> +	return true;
>>>> +}
>>>>  #endif
>>>>  
>>>>  struct vif_device {
>>>> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
>>>> index 2a795d2..a714f55 100644
>>>> --- a/net/ipv4/ipmr.c
>>>> +++ b/net/ipv4/ipmr.c
>>>> @@ -320,6 +320,16 @@ static unsigned int ipmr_rules_seq_read(struct net *net)
>>>>  }
>>>>  #endif
>>>>  
>>>> +bool ipmr_rule_default(const struct fib_rule *rule)
>>>> +{
>>>> +#if IS_ENABLED(CONFIG_FIB_RULES)
>>>> +	return fib_rule_matchall(rule) && rule->table == RT_TABLE_DEFAULT;
>>>> +#else
>>>> +	return true;
>>>> +#endif
>>> In patch 02, You have the following, can you do the same for the above?
>>> +#ifdef CONFIG_IP_MROUTE
>>> +void ipmr_cache_free(struct mfc_cache *mfc_cache);
>>> +#else
>>> +static inline void ipmr_cache_free(struct mfc_cache *mfc_cache)
>>> +{
>>> +}
>>> +#endif
>> I don't believe this is necessary. The solution you described is often
>> used in headers. But here, I'm ok with the current code.
>>
> +1


Hmm, when re-looking at it, I think I will just use the already existing
#ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES other than adding a new one. It selects
the CONFIG_FIB_RULES, and if CONFIG_IP_MROUTE_MULTIPLE_TABLES is not defined
than only default rules can exist for the IPMR family.

I will fix it for v3.



>
>>>> +}
>>>> +EXPORT_SYMBOL(ipmr_rule_default);
>>>> +
>>>>  static inline int ipmr_hash_cmp(struct rhashtable_compare_arg *arg,
>>>>  				const void *ptr)
>>>>  {
>>>>

^ permalink raw reply

* [PATCH net-next v9] openvswitch: enable NSH support
From: Yi Yang @ 2017-09-25 14:16 UTC (permalink / raw)
  To: netdev; +Cc: dev, jbenc, e, davem, Yi Yang

v8->v9
 - Fix build error reported by daily intel build
   because nsh module isn't selected by openvswitch

v7->v8
 - Rework nested value and mask for OVS_KEY_ATTR_NSH
 - Change pop_nsh to adapt to nsh kernel module
 - Fix many issues per comments from Jiri Benc

v6->v7
 - Remove NSH GSO patches in v6 because Jiri Benc
   reworked it as another patch series and they have
   been merged.
 - Change it to adapt to nsh kernel module added by NSH
   GSO patch series

v5->v6
 - Fix the rest comments for v4.
 - Add NSH GSO support for VxLAN-gpe + NSH and
   Eth + NSH.

v4->v5
 - Fix many comments by Jiri Benc and Eric Garver
   for v4.

v3->v4
 - Add new NSH match field ttl
 - Update NSH header to the latest format
   which will be final format and won't change
   per its author's confirmation.
 - Fix comments for v3.

v2->v3
 - Change OVS_KEY_ATTR_NSH to nested key to handle
   length-fixed attributes and length-variable
   attriubte more flexibly.
 - Remove struct ovs_action_push_nsh completely
 - Add code to handle nested attribute for SET_MASKED
 - Change PUSH_NSH to use the nested OVS_KEY_ATTR_NSH
   to transfer NSH header data.
 - Fix comments and coding style issues by Jiri and Eric

v1->v2
 - Change encap_nsh and decap_nsh to push_nsh and pop_nsh
 - Dynamically allocate struct ovs_action_push_nsh for
   length-variable metadata.

OVS master and 2.8 branch has merged NSH userspace
patch series, this patch is to enable NSH support
in kernel data path in order that OVS can support
NSH in compat mode by porting this.

Signed-off-by: Yi Yang <yi.y.yang@intel.com>
---
 include/net/nsh.h                |   3 +
 include/uapi/linux/openvswitch.h |  29 ++++
 net/nsh/nsh.c                    |  53 +++++++
 net/openvswitch/Kconfig          |   1 +
 net/openvswitch/actions.c        | 112 ++++++++++++++
 net/openvswitch/flow.c           |  51 ++++++
 net/openvswitch/flow.h           |  11 ++
 net/openvswitch/flow_netlink.c   | 324 ++++++++++++++++++++++++++++++++++++++-
 net/openvswitch/flow_netlink.h   |   5 +
 9 files changed, 588 insertions(+), 1 deletion(-)

diff --git a/include/net/nsh.h b/include/net/nsh.h
index a1eaea2..b886d33 100644
--- a/include/net/nsh.h
+++ b/include/net/nsh.h
@@ -304,4 +304,7 @@ static inline void nsh_set_flags_ttl_len(struct nshhdr *nsh, u8 flags,
 			NSH_FLAGS_MASK | NSH_TTL_MASK | NSH_LEN_MASK);
 }
 
+int skb_push_nsh(struct sk_buff *skb, const struct nshhdr *src_nsh_hdr);
+int skb_pop_nsh(struct sk_buff *skb);
+
 #endif /* __NET_NSH_H */
diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 156ee4c..c1a785c 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -333,6 +333,7 @@ enum ovs_key_attr {
 	OVS_KEY_ATTR_CT_LABELS,	/* 16-octet connection tracking label */
 	OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4,   /* struct ovs_key_ct_tuple_ipv4 */
 	OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6,   /* struct ovs_key_ct_tuple_ipv6 */
+	OVS_KEY_ATTR_NSH,       /* Nested set of ovs_nsh_key_* */
 
 #ifdef __KERNEL__
 	OVS_KEY_ATTR_TUNNEL_INFO,  /* struct ip_tunnel_info */
@@ -491,6 +492,30 @@ struct ovs_key_ct_tuple_ipv6 {
 	__u8   ipv6_proto;
 };
 
+enum ovs_nsh_key_attr {
+	OVS_NSH_KEY_ATTR_UNSPEC,
+	OVS_NSH_KEY_ATTR_BASE,  /* struct ovs_nsh_key_base. */
+	OVS_NSH_KEY_ATTR_MD1,   /* struct ovs_nsh_key_md1. */
+	OVS_NSH_KEY_ATTR_MD2,   /* variable-length octets for MD type 2. */
+	__OVS_NSH_KEY_ATTR_MAX
+};
+
+#define OVS_NSH_KEY_ATTR_MAX (__OVS_NSH_KEY_ATTR_MAX - 1)
+
+struct ovs_nsh_key_base {
+	__u8 flags;
+	__u8 ttl;
+	__u8 mdtype;
+	__u8 np;
+	__be32 path_hdr;
+};
+
+#define NSH_MD1_CONTEXT_SIZE 4
+
+struct ovs_nsh_key_md1 {
+	__be32 context[NSH_MD1_CONTEXT_SIZE];
+};
+
 /**
  * enum ovs_flow_attr - attributes for %OVS_FLOW_* commands.
  * @OVS_FLOW_ATTR_KEY: Nested %OVS_KEY_ATTR_* attributes specifying the flow
@@ -806,6 +831,8 @@ struct ovs_action_push_eth {
  * packet.
  * @OVS_ACTION_ATTR_POP_ETH: Pop the outermost Ethernet header off the
  * packet.
+ * @OVS_ACTION_ATTR_PUSH_NSH: push NSH header to the packet.
+ * @OVS_ACTION_ATTR_POP_NSH: pop the outermost NSH header off the packet.
  *
  * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not all
  * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
@@ -835,6 +862,8 @@ enum ovs_action_attr {
 	OVS_ACTION_ATTR_TRUNC,        /* u32 struct ovs_action_trunc. */
 	OVS_ACTION_ATTR_PUSH_ETH,     /* struct ovs_action_push_eth. */
 	OVS_ACTION_ATTR_POP_ETH,      /* No argument. */
+	OVS_ACTION_ATTR_PUSH_NSH,     /* Nested OVS_NSH_KEY_ATTR_*. */
+	OVS_ACTION_ATTR_POP_NSH,      /* No argument. */
 
 	__OVS_ACTION_ATTR_MAX,	      /* Nothing past this will be accepted
 				       * from userspace. */
diff --git a/net/nsh/nsh.c b/net/nsh/nsh.c
index 58fb827..54334ca 100644
--- a/net/nsh/nsh.c
+++ b/net/nsh/nsh.c
@@ -14,6 +14,59 @@
 #include <net/nsh.h>
 #include <net/tun_proto.h>
 
+int skb_push_nsh(struct sk_buff *skb, const struct nshhdr *src_nsh_hdr)
+{
+	struct nshhdr *nsh_hdr;
+	size_t length = nsh_hdr_len(src_nsh_hdr);
+	u8 next_proto;
+
+	if (skb->mac_len) {
+		next_proto = TUN_P_ETHERNET;
+	} else {
+		next_proto = tun_p_from_eth_p(skb->protocol);
+		if (!next_proto)
+			return -EAFNOSUPPORT;
+	}
+
+	/* Add the NSH header */
+	if (skb_cow_head(skb, length) < 0)
+		return -ENOMEM;
+
+	skb_push(skb, length);
+	nsh_hdr = (struct nshhdr *)(skb->data);
+	memcpy(nsh_hdr, src_nsh_hdr, length);
+	nsh_hdr->np = next_proto;
+
+	skb->protocol = htons(ETH_P_NSH);
+	skb_reset_mac_header(skb);
+	skb_reset_mac_len(skb);
+	skb_reset_network_header(skb);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(skb_push_nsh);
+
+int skb_pop_nsh(struct sk_buff *skb)
+{
+	struct nshhdr *nsh_hdr = (struct nshhdr *)(skb->data);
+	size_t length;
+	u16 inner_proto;
+
+	inner_proto = tun_p_to_eth_p(nsh_hdr->np);
+	if (!inner_proto)
+		return -EAFNOSUPPORT;
+
+	length = nsh_hdr_len(nsh_hdr);
+	skb_pull(skb, length);
+	skb_reset_mac_header(skb);
+	skb_reset_mac_len(skb);
+	skb_reset_network_header(skb);
+	skb->protocol = inner_proto;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(skb_pop_nsh);
+
 static struct sk_buff *nsh_gso_segment(struct sk_buff *skb,
 				       netdev_features_t features)
 {
diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index ce94729..2650205 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -14,6 +14,7 @@ config OPENVSWITCH
 	select MPLS
 	select NET_MPLS_GSO
 	select DST_CACHE
+	select NET_NSH
 	---help---
 	  Open vSwitch is a multilayer Ethernet switch targeted at virtualized
 	  environments.  In addition to supporting a variety of features
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index a54a556..d026b85 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -43,6 +43,7 @@
 #include "flow.h"
 #include "conntrack.h"
 #include "vport.h"
+#include "flow_netlink.h"
 
 struct deferred_action {
 	struct sk_buff *skb;
@@ -380,6 +381,45 @@ static int push_eth(struct sk_buff *skb, struct sw_flow_key *key,
 	return 0;
 }
 
+static int push_nsh(struct sk_buff *skb, struct sw_flow_key *key,
+		    const struct nshhdr *src_nsh_hdr)
+{
+	int err;
+
+	err = skb_push_nsh(skb, src_nsh_hdr);
+	if (err)
+		return err;
+
+	key->eth.type = htons(ETH_P_NSH);
+
+	/* safe right before invalidate_flow_key */
+	key->mac_proto = MAC_PROTO_NONE;
+	invalidate_flow_key(key);
+	return 0;
+}
+
+static int pop_nsh(struct sk_buff *skb, struct sw_flow_key *key)
+{
+	int err;
+
+	if (ovs_key_mac_proto(key) != MAC_PROTO_NONE ||
+	    skb->protocol != htons(ETH_P_NSH)) {
+		return -EINVAL;
+	}
+
+	err = skb_pop_nsh(skb);
+	if (err)
+		return err;
+
+	/* safe right before invalidate_flow_key */
+	if (skb->protocol == htons(ETH_P_TEB))
+		key->mac_proto = MAC_PROTO_ETHERNET;
+	else
+		key->mac_proto = MAC_PROTO_NONE;
+	invalidate_flow_key(key);
+	return 0;
+}
+
 static void update_ip_l4_checksum(struct sk_buff *skb, struct iphdr *nh,
 				  __be32 addr, __be32 new_addr)
 {
@@ -602,6 +642,59 @@ static int set_ipv6(struct sk_buff *skb, struct sw_flow_key *flow_key,
 	return 0;
 }
 
+static int set_nsh(struct sk_buff *skb, struct sw_flow_key *flow_key,
+		   const struct nlattr *a)
+{
+	struct nshhdr *nsh_hdr;
+	int err;
+	u8 flags;
+	u8 ttl;
+	int i;
+
+	struct ovs_key_nsh key;
+	struct ovs_key_nsh mask;
+
+	err = nsh_key_from_nlattr(a, &key, &mask);
+	if (err)
+		return err;
+
+	err = skb_ensure_writable(skb, skb_network_offset(skb) +
+				  sizeof(struct nshhdr));
+	if (unlikely(err))
+		return err;
+
+	nsh_hdr = (struct nshhdr *)skb_network_header(skb);
+
+	flags = nsh_get_flags(nsh_hdr);
+	flags = OVS_MASKED(flags, key.flags, mask.flags);
+	flow_key->nsh.flags = flags;
+	ttl = nsh_get_ttl(nsh_hdr);
+	ttl = OVS_MASKED(ttl, key.ttl, mask.ttl);
+	flow_key->nsh.ttl = ttl;
+	nsh_set_flags_and_ttl(nsh_hdr, flags, ttl);
+	nsh_hdr->path_hdr = OVS_MASKED(nsh_hdr->path_hdr, key.path_hdr,
+				       mask.path_hdr);
+	flow_key->nsh.path_hdr = nsh_hdr->path_hdr;
+	switch (nsh_hdr->mdtype) {
+	case NSH_M_TYPE1:
+		for (i = 0; i < NSH_MD1_CONTEXT_SIZE; i++) {
+			nsh_hdr->md1.context[i] =
+			    OVS_MASKED(nsh_hdr->md1.context[i], key.context[i],
+				       mask.context[i]);
+		}
+		memcpy(flow_key->nsh.context, nsh_hdr->md1.context,
+		       sizeof(nsh_hdr->md1.context));
+		break;
+	case NSH_M_TYPE2:
+		memset(flow_key->nsh.context, 0,
+		       sizeof(flow_key->nsh.context));
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
 /* Must follow skb_ensure_writable() since that can move the skb data. */
 static void set_tp_port(struct sk_buff *skb, __be16 *port,
 			__be16 new_port, __sum16 *check)
@@ -1024,6 +1117,10 @@ static int execute_masked_set_action(struct sk_buff *skb,
 				   get_mask(a, struct ovs_key_ethernet *));
 		break;
 
+	case OVS_KEY_ATTR_NSH:
+		err = set_nsh(skb, flow_key, a);
+		break;
+
 	case OVS_KEY_ATTR_IPV4:
 		err = set_ipv4(skb, flow_key, nla_data(a),
 			       get_mask(a, struct ovs_key_ipv4 *));
@@ -1210,6 +1307,21 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
 		case OVS_ACTION_ATTR_POP_ETH:
 			err = pop_eth(skb, key);
 			break;
+
+		case OVS_ACTION_ATTR_PUSH_NSH: {
+			u8 buffer[NSH_HDR_MAX_LEN];
+			struct nshhdr *nsh_hdr = (struct nshhdr *)buffer;
+			const struct nshhdr *src_nsh_hdr = nsh_hdr;
+
+			nsh_hdr_from_nlattr(nla_data(a), nsh_hdr,
+					    NSH_HDR_MAX_LEN);
+			err = push_nsh(skb, key, src_nsh_hdr);
+			break;
+		}
+
+		case OVS_ACTION_ATTR_POP_NSH:
+			err = pop_nsh(skb, key);
+			break;
 		}
 
 		if (unlikely(err)) {
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 8c94cef..67fb6d9 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -46,6 +46,7 @@
 #include <net/ipv6.h>
 #include <net/mpls.h>
 #include <net/ndisc.h>
+#include <net/nsh.h>
 
 #include "conntrack.h"
 #include "datapath.h"
@@ -490,6 +491,52 @@ static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key,
 	return 0;
 }
 
+static int parse_nsh(struct sk_buff *skb, struct sw_flow_key *key)
+{
+	struct nshhdr *nsh_hdr;
+	unsigned int nh_ofs = skb_network_offset(skb);
+	u8 version, length;
+	int err;
+
+	err = check_header(skb, nh_ofs + NSH_BASE_HDR_LEN);
+	if (unlikely(err))
+		return err;
+
+	nsh_hdr = (struct nshhdr *)skb_network_header(skb);
+	version = nsh_get_ver(nsh_hdr);
+	length = nsh_hdr_len(nsh_hdr);
+
+	if (version != 0)
+		return -EINVAL;
+
+	err = check_header(skb, nh_ofs + length);
+	if (unlikely(err))
+		return err;
+
+	nsh_hdr = (struct nshhdr *)skb_network_header(skb);
+	key->nsh.flags = nsh_get_flags(nsh_hdr);
+	key->nsh.ttl = nsh_get_ttl(nsh_hdr);
+	key->nsh.mdtype = nsh_hdr->mdtype;
+	key->nsh.np = nsh_hdr->np;
+	key->nsh.path_hdr = nsh_hdr->path_hdr;
+	switch (key->nsh.mdtype) {
+	case NSH_M_TYPE1:
+		if (length != NSH_M_TYPE1_LEN)
+			return -EINVAL;
+		memcpy(key->nsh.context, nsh_hdr->md1.context,
+		       sizeof(nsh_hdr->md1));
+		break;
+	case NSH_M_TYPE2:
+		memset(key->nsh.context, 0,
+		       sizeof(nsh_hdr->md1));
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 /**
  * key_extract - extracts a flow key from an Ethernet frame.
  * @skb: sk_buff that contains the frame, with skb->data pointing to the
@@ -735,6 +782,10 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
 				memset(&key->tp, 0, sizeof(key->tp));
 			}
 		}
+	} else if (key->eth.type == htons(ETH_P_NSH)) {
+		error = parse_nsh(skb, key);
+		if (error)
+			return error;
 	}
 	return 0;
 }
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index 1875bba..6a3cd9c 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -35,6 +35,7 @@
 #include <net/inet_ecn.h>
 #include <net/ip_tunnels.h>
 #include <net/dst_metadata.h>
+#include <net/nsh.h>
 
 struct sk_buff;
 
@@ -66,6 +67,15 @@ struct vlan_head {
 	(offsetof(struct sw_flow_key, recirc_id) +	\
 	FIELD_SIZEOF(struct sw_flow_key, recirc_id))
 
+struct ovs_key_nsh {
+	u8 flags;
+	u8 ttl;
+	u8 mdtype;
+	u8 np;
+	__be32 path_hdr;
+	__be32 context[NSH_MD1_CONTEXT_SIZE];
+};
+
 struct sw_flow_key {
 	u8 tun_opts[IP_TUNNEL_OPTS_MAX];
 	u8 tun_opts_len;
@@ -144,6 +154,7 @@ struct sw_flow_key {
 			};
 		} ipv6;
 	};
+	struct ovs_key_nsh nsh;         /* network service header */
 	struct {
 		/* Connection tracking fields not packed above. */
 		struct {
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index e8eb427..278bbb3 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -48,6 +48,7 @@
 #include <net/ndisc.h>
 #include <net/mpls.h>
 #include <net/vxlan.h>
+#include <net/tun_proto.h>
 
 #include "flow_netlink.h"
 
@@ -78,9 +79,11 @@ static bool actions_may_change_flow(const struct nlattr *actions)
 		case OVS_ACTION_ATTR_HASH:
 		case OVS_ACTION_ATTR_POP_ETH:
 		case OVS_ACTION_ATTR_POP_MPLS:
+		case OVS_ACTION_ATTR_POP_NSH:
 		case OVS_ACTION_ATTR_POP_VLAN:
 		case OVS_ACTION_ATTR_PUSH_ETH:
 		case OVS_ACTION_ATTR_PUSH_MPLS:
+		case OVS_ACTION_ATTR_PUSH_NSH:
 		case OVS_ACTION_ATTR_PUSH_VLAN:
 		case OVS_ACTION_ATTR_SAMPLE:
 		case OVS_ACTION_ATTR_SET:
@@ -322,12 +325,27 @@ size_t ovs_tun_key_attr_size(void)
 		+ nla_total_size(2);   /* OVS_TUNNEL_KEY_ATTR_TP_DST */
 }
 
+size_t ovs_nsh_key_attr_size(void)
+{
+	/* Whenever adding new OVS_NSH_KEY_ FIELDS, we should consider
+	 * updating this function.
+	 */
+	return  nla_total_size(NSH_BASE_HDR_LEN) /* OVS_NSH_KEY_ATTR_BASE */
+		/* OVS_NSH_KEY_ATTR_MD1 and OVS_NSH_KEY_ATTR_MD2 are
+		 * mutually exclusive, so the bigger one can cover
+		 * the small one.
+		 *
+		 * OVS_NSH_KEY_ATTR_MD2
+		 */
+		+ nla_total_size(NSH_CTX_HDRS_MAX_LEN);
+}
+
 size_t ovs_key_attr_size(void)
 {
 	/* Whenever adding new OVS_KEY_ FIELDS, we should consider
 	 * updating this function.
 	 */
-	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 28);
+	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 29);
 
 	return    nla_total_size(4)   /* OVS_KEY_ATTR_PRIORITY */
 		+ nla_total_size(0)   /* OVS_KEY_ATTR_TUNNEL */
@@ -341,6 +359,8 @@ size_t ovs_key_attr_size(void)
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_CT_MARK */
 		+ nla_total_size(16)  /* OVS_KEY_ATTR_CT_LABELS */
 		+ nla_total_size(40)  /* OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6 */
+		+ nla_total_size(0)   /* OVS_KEY_ATTR_NSH */
+		  + ovs_nsh_key_attr_size()
 		+ nla_total_size(12)  /* OVS_KEY_ATTR_ETHERNET */
 		+ nla_total_size(2)   /* OVS_KEY_ATTR_ETHERTYPE */
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_VLAN */
@@ -373,6 +393,13 @@ static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
 	[OVS_TUNNEL_KEY_ATTR_IPV6_DST]      = { .len = sizeof(struct in6_addr) },
 };
 
+static const struct ovs_len_tbl
+ovs_nsh_key_attr_lens[OVS_NSH_KEY_ATTR_MAX + 1] = {
+	[OVS_NSH_KEY_ATTR_BASE] = { .len = sizeof(struct ovs_nsh_key_base) },
+	[OVS_NSH_KEY_ATTR_MD1]  = { .len = sizeof(struct ovs_nsh_key_md1) },
+	[OVS_NSH_KEY_ATTR_MD2]  = { .len = OVS_ATTR_VARIABLE },
+};
+
 /* The size of the argument for each %OVS_KEY_ATTR_* Netlink attribute.  */
 static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
 	[OVS_KEY_ATTR_ENCAP]	 = { .len = OVS_ATTR_NESTED },
@@ -405,6 +432,8 @@ static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
 		.len = sizeof(struct ovs_key_ct_tuple_ipv4) },
 	[OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6] = {
 		.len = sizeof(struct ovs_key_ct_tuple_ipv6) },
+	[OVS_KEY_ATTR_NSH]       = { .len = OVS_ATTR_NESTED,
+				     .next = ovs_nsh_key_attr_lens, },
 };
 
 static bool check_attr_len(unsigned int attr_len, unsigned int expected_len)
@@ -1179,6 +1208,222 @@ static int metadata_from_nlattrs(struct net *net, struct sw_flow_match *match,
 	return 0;
 }
 
+int nsh_hdr_from_nlattr(const struct nlattr *attr,
+			struct nshhdr *nsh, size_t size)
+{
+	struct nlattr *a;
+	int rem;
+	u8 flags = 0;
+	u8 ttl = 0;
+	int mdlen = 0;
+
+	/* validate_nsh has check this, so we needn't do duplicate check here
+	 */
+	nla_for_each_nested(a, attr, rem) {
+		int type = nla_type(a);
+
+		switch (type) {
+		case OVS_NSH_KEY_ATTR_BASE: {
+			const struct ovs_nsh_key_base *base =
+				(struct ovs_nsh_key_base *)nla_data(a);
+			flags = base->flags;
+			ttl = base->ttl;
+			nsh->np = base->np;
+			nsh->mdtype = base->mdtype;
+			nsh->path_hdr = base->path_hdr;
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD1: {
+			const struct ovs_nsh_key_md1 *md1 =
+				(struct ovs_nsh_key_md1 *)nla_data(a);
+			struct nsh_md1_ctx *md1_dst = &nsh->md1;
+
+			mdlen = nla_len(a);
+			memcpy(md1_dst, md1, mdlen);
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD2: {
+			const struct u8 *md2 = nla_data(a);
+			struct nsh_md2_tlv *md2_dst = &nsh->md2;
+
+			mdlen = nla_len(a);
+			memcpy(md2_dst, md2, mdlen);
+			break;
+		}
+		default:
+			return -EINVAL;
+		}
+	}
+
+	/* nsh header length  = NSH_BASE_HDR_LEN + mdlen */
+	nsh->ver_flags_ttl_len = 0;
+	nsh_set_flags_ttl_len(nsh, flags, ttl, NSH_BASE_HDR_LEN + mdlen);
+
+	return 0;
+}
+
+int nsh_key_from_nlattr(const struct nlattr *attr,
+			struct ovs_key_nsh *nsh, struct ovs_key_nsh *nsh_mask)
+{
+	struct nlattr *a;
+	int rem;
+
+	/* validate_nsh has check this, so we needn't do duplicate check here
+	 */
+	nla_for_each_nested(a, attr, rem) {
+		int type = nla_type(a);
+
+		switch (type) {
+		case OVS_NSH_KEY_ATTR_BASE: {
+			const struct ovs_nsh_key_base *base =
+				(struct ovs_nsh_key_base *)nla_data(a);
+			const struct ovs_nsh_key_base *base_mask = base + 1;
+
+			memcpy(nsh, base, sizeof(*base));
+			memcpy(nsh, base_mask, sizeof(*base_mask));
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD1: {
+			const struct ovs_nsh_key_md1 *md1 =
+				(struct ovs_nsh_key_md1 *)nla_data(a);
+			const struct ovs_nsh_key_md1 *md1_mask = md1 + 1;
+
+			memcpy(nsh->context, md1->context, sizeof(*md1));
+			memcpy(nsh_mask->context, md1_mask->context,
+			       sizeof(*md1_mask));
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD2:
+			/* Not supported yet */
+			return -ENOTSUPP;
+		default:
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int nsh_key_put_from_nlattr(const struct nlattr *attr,
+				   struct sw_flow_match *match, bool is_mask,
+				   bool is_push_nsh, bool log)
+{
+	struct nlattr *a;
+	int rem;
+	bool has_base = false;
+	bool has_md1 = false;
+	bool has_md2 = false;
+	u8 mdtype = 0;
+	int mdlen = 0;
+
+	if (unlikely(is_push_nsh && is_mask))
+		return -EINVAL;
+
+	nla_for_each_nested(a, attr, rem) {
+		int type = nla_type(a);
+		int i;
+
+		if (type > OVS_NSH_KEY_ATTR_MAX) {
+			OVS_NLERR(log, "nsh attr %d is out of range max %d",
+				  type, OVS_NSH_KEY_ATTR_MAX);
+			return -EINVAL;
+		}
+
+		if (!check_attr_len(nla_len(a),
+				    ovs_nsh_key_attr_lens[type].len)) {
+			OVS_NLERR(
+			    log,
+			    "nsh attr %d has unexpected len %d expected %d",
+			    type,
+			    nla_len(a),
+			    ovs_nsh_key_attr_lens[type].len
+			);
+			return -EINVAL;
+		}
+
+		switch (type) {
+		case OVS_NSH_KEY_ATTR_BASE: {
+			const struct ovs_nsh_key_base *base =
+				(struct ovs_nsh_key_base *)nla_data(a);
+
+			has_base = true;
+			mdtype = base->mdtype;
+			SW_FLOW_KEY_PUT(match, nsh.flags,
+					base->flags, is_mask);
+			SW_FLOW_KEY_PUT(match, nsh.ttl,
+					base->ttl, is_mask);
+			SW_FLOW_KEY_PUT(match, nsh.mdtype,
+					base->mdtype, is_mask);
+			SW_FLOW_KEY_PUT(match, nsh.np,
+					base->np, is_mask);
+			SW_FLOW_KEY_PUT(match, nsh.path_hdr,
+					base->path_hdr, is_mask);
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD1: {
+			const struct ovs_nsh_key_md1 *md1 =
+				(struct ovs_nsh_key_md1 *)nla_data(a);
+
+			has_md1 = true;
+			for (i = 0; i < NSH_MD1_CONTEXT_SIZE; i++)
+				SW_FLOW_KEY_PUT(match, nsh.context[i],
+						md1->context[i], is_mask);
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD2:
+			if (!is_push_nsh) /* Not supported MD type 2 yet */
+				return -ENOTSUPP;
+
+			has_md2 = true;
+			mdlen = nla_len(a);
+			if ((mdlen > NSH_CTX_HDRS_MAX_LEN) ||
+			    (mdlen <= 0)) {
+				WARN_ON_ONCE(1);
+				return -EINVAL;
+			}
+			break;
+		default:
+			OVS_NLERR(log, "Unknown nsh attribute %d",
+				  type);
+			return -EINVAL;
+		}
+	}
+
+	if (rem > 0) {
+		OVS_NLERR(log, "nsh attribute has %d unknown bytes.", rem);
+		return -EINVAL;
+	}
+
+	if (has_md1 && has_md2) {
+		OVS_NLERR(
+		    1,
+		    "invalid nsh attribute: md1 and md2 are exclusive."
+		);
+		return -EINVAL;
+	}
+
+	if (!is_mask) {
+		if ((has_md1 && mdtype != NSH_M_TYPE1) ||
+		    (has_md2 && mdtype != NSH_M_TYPE2)) {
+			OVS_NLERR(1, "nsh attribute has unmatched MD type %d.",
+				  mdtype);
+			return -EINVAL;
+		}
+
+		if (is_push_nsh &&
+		    (!has_base || (!has_md1 && !has_md2))) {
+			OVS_NLERR(
+			    1,
+			    "push nsh attributes are invalid for type %d.",
+			    mdtype
+			);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
 static int ovs_key_from_nlattrs(struct net *net, struct sw_flow_match *match,
 				u64 attrs, const struct nlattr **a,
 				bool is_mask, bool log)
@@ -1306,6 +1551,13 @@ static int ovs_key_from_nlattrs(struct net *net, struct sw_flow_match *match,
 		attrs &= ~(1 << OVS_KEY_ATTR_ARP);
 	}
 
+	if (attrs & (1 << OVS_KEY_ATTR_NSH)) {
+		if (nsh_key_put_from_nlattr(a[OVS_KEY_ATTR_NSH], match,
+					    is_mask, false, log) < 0)
+			return -EINVAL;
+		attrs &= ~(1 << OVS_KEY_ATTR_NSH);
+	}
+
 	if (attrs & (1 << OVS_KEY_ATTR_MPLS)) {
 		const struct ovs_key_mpls *mpls_key;
 
@@ -1622,6 +1874,40 @@ static int ovs_nla_put_vlan(struct sk_buff *skb, const struct vlan_head *vh,
 	return 0;
 }
 
+static int nsh_key_to_nlattr(const struct ovs_key_nsh *nsh, bool is_mask,
+			     struct sk_buff *skb)
+{
+	struct nlattr *start;
+	struct ovs_nsh_key_base base;
+	struct ovs_nsh_key_md1 md1;
+
+	memcpy(&base, nsh, sizeof(base));
+
+	if (is_mask || nsh->mdtype == NSH_M_TYPE1)
+		memcpy(md1.context, nsh->context, sizeof(md1));
+
+	start = nla_nest_start(skb, OVS_KEY_ATTR_NSH);
+	if (!start)
+		return -EMSGSIZE;
+
+	if (nla_put(skb, OVS_NSH_KEY_ATTR_BASE, sizeof(base), &base))
+		goto nla_put_failure;
+
+	if (is_mask || nsh->mdtype == NSH_M_TYPE1) {
+		if (nla_put(skb, OVS_NSH_KEY_ATTR_MD1, sizeof(md1), &md1))
+			goto nla_put_failure;
+	}
+
+	/* Don't support MD type 2 yet */
+
+	nla_nest_end(skb, start);
+
+	return 0;
+
+nla_put_failure:
+	return -EMSGSIZE;
+}
+
 static int __ovs_nla_put_key(const struct sw_flow_key *swkey,
 			     const struct sw_flow_key *output, bool is_mask,
 			     struct sk_buff *skb)
@@ -1750,6 +2036,9 @@ static int __ovs_nla_put_key(const struct sw_flow_key *swkey,
 		ipv6_key->ipv6_tclass = output->ip.tos;
 		ipv6_key->ipv6_hlimit = output->ip.ttl;
 		ipv6_key->ipv6_frag = output->ip.frag;
+	} else if (swkey->eth.type == htons(ETH_P_NSH)) {
+		if (nsh_key_to_nlattr(&output->nsh, is_mask, skb))
+			goto nla_put_failure;
 	} else if (swkey->eth.type == htons(ETH_P_ARP) ||
 		   swkey->eth.type == htons(ETH_P_RARP)) {
 		struct ovs_key_arp *arp_key;
@@ -2242,6 +2531,19 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 	return err;
 }
 
+static bool validate_nsh(const struct nlattr *attr, bool is_mask,
+			 bool is_push_nsh, bool log)
+{
+	struct sw_flow_match match;
+	struct sw_flow_key key;
+	int ret = 0;
+
+	ovs_match_init(&match, &key, true, NULL);
+	ret = nsh_key_put_from_nlattr(attr, &match, is_mask,
+				      is_push_nsh, log);
+	return ((ret != 0) ? false : true);
+}
+
 /* Return false if there are any non-masked bits set.
  * Mask follows data immediately, before any netlink padding.
  */
@@ -2384,6 +2686,11 @@ static int validate_set(const struct nlattr *a,
 
 		break;
 
+	case OVS_KEY_ATTR_NSH:
+		if (!validate_nsh(nla_data(a), masked, false, log))
+			return -EINVAL;
+		break;
+
 	default:
 		return -EINVAL;
 	}
@@ -2482,6 +2789,8 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
 			[OVS_ACTION_ATTR_TRUNC] = sizeof(struct ovs_action_trunc),
 			[OVS_ACTION_ATTR_PUSH_ETH] = sizeof(struct ovs_action_push_eth),
 			[OVS_ACTION_ATTR_POP_ETH] = 0,
+			[OVS_ACTION_ATTR_PUSH_NSH] = (u32)-1,
+			[OVS_ACTION_ATTR_POP_NSH] = 0,
 		};
 		const struct ovs_action_push_vlan *vlan;
 		int type = nla_type(a);
@@ -2636,6 +2945,19 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
 			mac_proto = MAC_PROTO_ETHERNET;
 			break;
 
+		case OVS_ACTION_ATTR_PUSH_NSH:
+			mac_proto = MAC_PROTO_NONE;
+			if (!validate_nsh(nla_data(a), false, true, true))
+				return -EINVAL;
+			break;
+
+		case OVS_ACTION_ATTR_POP_NSH:
+			if (key->nsh.np == TUN_P_ETHERNET)
+				mac_proto = MAC_PROTO_ETHERNET;
+			else
+				mac_proto = MAC_PROTO_NONE;
+			break;
+
 		default:
 			OVS_NLERR(log, "Unknown Action type %d", type);
 			return -EINVAL;
diff --git a/net/openvswitch/flow_netlink.h b/net/openvswitch/flow_netlink.h
index 929c665..4b80083 100644
--- a/net/openvswitch/flow_netlink.h
+++ b/net/openvswitch/flow_netlink.h
@@ -79,4 +79,9 @@ int ovs_nla_put_actions(const struct nlattr *attr,
 void ovs_nla_free_flow_actions(struct sw_flow_actions *);
 void ovs_nla_free_flow_actions_rcu(struct sw_flow_actions *);
 
+int nsh_key_from_nlattr(const struct nlattr *attr, struct ovs_key_nsh *nsh,
+			struct ovs_key_nsh *nsh_mask);
+int nsh_hdr_from_nlattr(const struct nlattr *attr, struct nshhdr *src_nsh_hdr,
+			size_t size);
+
 #endif /* flow_netlink.h */
-- 
2.5.5

^ permalink raw reply related

* RE: [PATCH v4 4/9] em28xx: fix em28xx_dvb_init for KASAN
From: David Laight @ 2017-09-25 14:41 UTC (permalink / raw)
  To: 'Arnd Bergmann', Mauro Carvalho Chehab
  Cc: Jiri Pirko, Arend van Spriel, Kalle Valo, David S. Miller,
	Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	Masahiro Yamada, Michal Marek, Andrew Morton, Kees Cook,
	Geert Uytterhoeven, Greg Kroah-Hartman,
	linux-media@vger.kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, linux-wireless@vger.kernel.org
In-Reply-To: <20170922212930.620249-5-arnd@arndb.de>

From: Arnd Bergmann
> Sent: 22 September 2017 22:29
...
> It seems that this is triggered in part by using strlcpy(), which the
> compiler doesn't recognize as copying at most 'len' bytes, since strlcpy
> is not part of the C standard.

Neither is strncpy().

It'll almost certainly be a marker in a header file somewhere,
so it should be possibly to teach it about other functions.

	David

^ permalink raw reply

* Re: [PATCH] r8152:  add Linksys USB3GIGV1 id
From: Oliver Neukum @ 2017-09-25 14:51 UTC (permalink / raw)
  To: Grant Grundler, Hayes Wang; +Cc: David S . Miller, LKML, linux-usb, netdev
In-Reply-To: <20170922190604.108155-1-grundler@chromium.org>

Am Freitag, den 22.09.2017, 12:06 -0700 schrieb Grant Grundler:
> This Linksys dongle by default comes up in cdc_ether mode.
> This patch allows r8152 to claim the device:
>    Bus 002 Device 002: ID 13b1:0041 Linksys

Hi,

have you tested this in case cdc_ether is for some reason already
loaded? The patch seems to enable r8152 but does not disable
cdc_ether.

	Regards
		Oliver

^ permalink raw reply

* Re: [PATCH net v2] sctp: Fix a big endian bug in sctp_diag_dump()
From: Xin Long @ 2017-09-25 15:00 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Vlad Yasevich, Neil Horman, David S. Miller, linux-sctp,
	network dev, kernel-janitors
In-Reply-To: <20170925101926.db4f6x4hblh7tcvo@mwanda>

On Mon, Sep 25, 2017 at 6:19 PM, Dan Carpenter <dan.carpenter@oracle.com> wrote:
> The sctp_for_each_transport() function takes an pointer to int.  The
> cb->args[] array holds longs so it's only using the high 32 bits.  It
> works on little endian system but will break on big endian 64 bit
> machines.
>
> Fixes: d25adbeb0cdb ("sctp: fix an use-after-free issue in sctp_sock_dump")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> ---
> v2: The v1 patch changed the function to take a long pointer, but v2
>     just changes the caller.
>
> diff --git a/net/sctp/sctp_diag.c b/net/sctp/sctp_diag.c
> index 22ed01a76b19..a72a7d925d46 100644
> --- a/net/sctp/sctp_diag.c
> +++ b/net/sctp/sctp_diag.c
> @@ -463,6 +463,7 @@ static void sctp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
>                 .r = r,
>                 .net_admin = netlink_net_capable(cb->skb, CAP_NET_ADMIN),
>         };
> +       int pos = cb->args[2];
>
>         /* eps hashtable dumps
>          * args:
> @@ -493,7 +494,8 @@ static void sctp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
>                 goto done;
>
>         sctp_for_each_transport(sctp_sock_filter, sctp_sock_dump,
> -                               net, (int *)&cb->args[2], &commp);
> +                               net, &pos, &commp);
> +       cb->args[2] = pos;
>
>  done:
>         cb->args[1] = cb->args[4];

Reviewed-by: Xin Long <lucien.xin@gmail.com>

^ permalink raw reply

* Re: [PATCH net-next 0/7] nfp: flower vxlan tunnel offload
From: Or Gerlitz @ 2017-09-25 15:25 UTC (permalink / raw)
  To: Simon Horman; +Cc: David Miller, Jakub Kicinski, Linux Netdev List, oss-drivers
In-Reply-To: <1506335021-32024-1-git-send-email-simon.horman@netronome.com>

On Mon, Sep 25, 2017 at 1:23 PM, Simon Horman
<simon.horman@netronome.com> wrote:
> From: Simon Horman <simon.horman@netronome.com>
>
> John says:
>
> This patch set allows offloading of TC flower match and set tunnel fields
> to the NFP. The initial focus is on VXLAN traffic. Due to the current
> state of the NFP firmware, only VXLAN traffic on well known port 4789 is
> handled. The match and action fields must explicity set this value to be
> supported. Tunnel end point information is also offloaded to the NFP for
> both encapsulation and decapsulation. The NFP expects 3 separate data sets
> to be supplied.

> For decapsulation, 2 separate lists exist; a list of MAC addresses
> referenced by an index comprised of the port number, and a list of IP
> addresses. These IP addresses are not connected to a MAC or port.

Do these IP addresses exist on the host kernel SW stack? can the same
set of TC rules be fully functional and generate the same traffic
pattern when set to run in SW (skip_hw)?


> The MAC
> addresses can be written as a block or one at a time (because they have an
> index, previous values can be overwritten) while the IP addresses are
> always written as a list of all the available IPs. Because the MAC address
> used as a tunnel end point may be associated with a physical port or may
> be a virtual netdev like an OVS bridge, we do not know which addresses
> should be offloaded. For this reason, all MAC addresses of active netdevs
> are offloaded to the NFP. A notifier checks for changes to any currently
> offloaded MACs or any new netdevs that may occur. For IP addresses, the
> tunnel end point used in the rules is known as the destination IP address
> must be specified in the flower classifier rule. When a new IP address
> appears in a rule, the IP address is offloaded. The IP is removed from the
> offloaded list when all rules matching on that IP are deleted.
>
> For encapsulation, a next hop table is updated on the NFP that contains
> the source/dest IPs, MACs and egress port. These are written individually
> when requested. If the NFP tries to encapsulate a packet but does not know
> the next hop, then is sends a request to the host. The host carries out a
> route lookup and populates the given entry on the NFP table. A notifier
> also exists to check for any links changing or going down in the kernel
> next hop table. If an offloaded next hop entry is removed from the kernel
> then it is also removed on the NFP.
>
> The NFP periodically sends a message to the host telling it which tunnel
> ports have packets egressing the system. The host uses this information to
> update the used value in the neighbour entry. This means that, rather than
> expire when it times out, the kernel will send an ARP to check if the link
> is still live. From an NFP perspective, this means that valid entries will
> not be removed from its next hop table.
>
> John Hurley (7):
>   nfp: add helper to get flower cmsg length
>   nfp: compile flower vxlan tunnel metadata match fields
>   nfp: compile flower vxlan tunnel set actions
>   nfp: offload flower vxlan endpoint MAC addresses
>   nfp: offload vxlan IPv4 endpoints of flower rules
>   nfp: flower vxlan neighbour offload
>   nfp: flower vxlan neighbour keep-alive
>
>  drivers/net/ethernet/netronome/nfp/Makefile        |   3 +-
>  drivers/net/ethernet/netronome/nfp/flower/action.c | 169 ++++-
>  drivers/net/ethernet/netronome/nfp/flower/cmsg.c   |  16 +-
>  drivers/net/ethernet/netronome/nfp/flower/cmsg.h   |  87 ++-
>  drivers/net/ethernet/netronome/nfp/flower/main.c   |  13 +
>  drivers/net/ethernet/netronome/nfp/flower/main.h   |  35 +
>  drivers/net/ethernet/netronome/nfp/flower/match.c  |  75 +-
>  .../net/ethernet/netronome/nfp/flower/metadata.c   |   2 +-
>  .../net/ethernet/netronome/nfp/flower/offload.c    |  74 +-
>  .../ethernet/netronome/nfp/flower/tunnel_conf.c    | 811 +++++++++++++++++++++
>  10 files changed, 1243 insertions(+), 42 deletions(-)
>  create mode 100644 drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
>
> --
> 2.1.4
>

^ permalink raw reply

* [PATCH net] inetpeer: fix RCU lookup() again
From: Eric Dumazet @ 2017-09-25 15:40 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

My prior fix was not complete, as we were dereferencing a pointer
three times per node, not twice as I initially thought.

Fixes: 4cc5b44b29a9 ("inetpeer: fix RCU lookup()")
Fixes: b145425f269a ("inetpeer: remove AVL implementation in favor of RB tree")
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/inetpeer.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index e7eb590c86ce2b33654c17c61619de74ff07bfd1..b20c8ac640811e1b4c5416134181dd77675db878 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -128,9 +128,9 @@ static struct inet_peer *lookup(const struct inetpeer_addr *daddr,
 			break;
 		}
 		if (cmp == -1)
-			pp = &(*pp)->rb_left;
+			pp = &next->rb_left;
 		else
-			pp = &(*pp)->rb_right;
+			pp = &next->rb_right;
 	}
 	*parent_p = parent;
 	*pp_p = pp;

^ permalink raw reply related

* Re: [PATCH net-next v5 1/4] bpf: add helper bpf_perf_event_read_value for perf event array map
From: Yonghong Song @ 2017-09-25 15:41 UTC (permalink / raw)
  To: Alexei Starovoitov, David Miller, peterz
  Cc: rostedt, daniel, netdev, kernel-team
In-Reply-To: <b9753171-8e70-92f2-9852-8d3a33f2c07c@fb.com>



On 9/21/17 8:15 AM, Alexei Starovoitov wrote:
> On 9/20/17 4:07 PM, David Miller wrote:
>> From: Peter Zijlstra <peterz@infradead.org>
>> Date: Wed, 20 Sep 2017 19:26:51 +0200
>>
>>> Dave, could we have this in a topic tree of sorts, because I have a
>>> pending series to rework all the timekeeping and it might be nice to not
>>> have sfr run into all sorts of conflicts.
>>
>> If you want to merge it into your tree that's fine.
>>
>> But it means that any further development done on top of this
>> work will need to go via you as well.
> 
> can we merge this set of patches into both net-next and tip ?
> We did such tricks in the past to avoid merge conflicts and
> everything went fine during merge window.

Hi, Peter,

Any suggestion for merging this patch? Alexei's idea sounds
reasonable?

Thanks,

Yonghong

^ permalink raw reply

* [PATCH net-next] inetpeer: speed up inetpeer_invalidate_tree()
From: Eric Dumazet @ 2017-09-25 16:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

As measured in my prior patch ("sch_netem: faster rb tree removal"),
rbtree_postorder_for_each_entry_safe() is nice looking but much slower
than using rb_next() directly, except when tree is small enough
to fit in CPU caches (then the cost is the same)

From: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/inetpeer.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index e7eb590c86ce2b33654c17c61619de74ff07bfd1..6e5626cc366c150b13fd44a8e36169c2fb54476d 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -284,14 +284,17 @@ EXPORT_SYMBOL(inet_peer_xrlim_allow);
 
 void inetpeer_invalidate_tree(struct inet_peer_base *base)
 {
-	struct inet_peer *p, *n;
+	struct rb_node *p = rb_first(&base->rb_root);
 
-	rbtree_postorder_for_each_entry_safe(p, n, &base->rb_root, rb_node) {
-		inet_putpeer(p);
+	while (p) {
+		struct inet_peer *peer = rb_entry(p, struct inet_peer, rb_node);
+
+		p = rb_next(p);
+		rb_erase(&peer->rb_node, &base->rb_root);
+		inet_putpeer(peer);
 		cond_resched();
 	}
 
-	base->rb_root = RB_ROOT;
 	base->total = 0;
 }
 EXPORT_SYMBOL(inetpeer_invalidate_tree);

^ permalink raw reply related

* Re: [PATCH net-next] sch_netem: faster rb tree removal
From: David Ahern @ 2017-09-25 16:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Stephen Hemminger
In-Reply-To: <1506317240.6617.5.camel@edumazet-glaptop3.roam.corp.google.com>

On 9/24/17 11:27 PM, Eric Dumazet wrote:
> On Sun, 2017-09-24 at 20:05 -0600, David Ahern wrote:
>> On 9/24/17 7:57 PM, David Ahern wrote:
> 
>>> Hi Eric:
>>>
>>> I'm guessing the cost is in the rb_first and rb_next computations. Did
>>> you consider something like this:
>>>
>>>         struct rb_root *root
>>>         struct rb_node **p = &root->rb_node;
>>>
>>>         while (*p != NULL) {
>>>                 struct foobar *fb;
>>>
>>>                 fb = container_of(*p, struct foobar, rb_node);
>>>                 // fb processing
>> 		  rb_erase(&nh->rb_node, root);
>>
>>>                 p = &root->rb_node;
>>>         }
>>>
>>
>> Oops, dropped the rb_erase in my consolidating the code to this snippet.
> 
> Hi David
> 
> This gives about same numbers than method_1
> 
> I tried with 10^7 skbs in the tree :
> 
> Your suggestion takes 66ns per skb, while the one I chose takes 37ns per
> skb.

Thanks for the test.

I made a simple program this morning and ran it under perf. With the
above suggestion the rb_erase has a high cost because it always deletes
the root node. Your method 1 has a high cost on rb_first which is
expected given its definition and it is run on each removal. Both
options increase in time with the number of entries in the tree.

Your method 2 is fairly constant from 10,000 entries to 10M entries
which makes sense: a one time cost at finding rb_first and then always
removing a bottom node so rb_erase is light.

As for the change:

Acked-by: David Ahern <dsahern@gmail.com>

^ permalink raw reply

* Re: [PATCH,v3,net-next 1/2] tun: enable NAPI for TUN/TAP driver
From: Mahesh Bandewar (महेश बंडेवार) @ 2017-09-25 16:23 UTC (permalink / raw)
  To: Petar Penkov
  Cc: linux-netdev, Eric Dumazet, Willem Bruijn, David Miller,
	Petar Bozhidarov Penkov
In-Reply-To: <20170922204915.7889-2-peterpenkov96@gmail.com>

On Fri, Sep 22, 2017 at 1:49 PM, Petar Penkov <peterpenkov96@gmail.com> wrote:
> Changes TUN driver to use napi_gro_receive() upon receiving packets
> rather than netif_rx_ni(). Adds flag IFF_NAPI that enables these
> changes and operation is not affected if the flag is disabled.  SKBs
> are constructed upon packet arrival and are queued to be processed
> later.
>
> The new path was evaluated with a benchmark with the following setup:
> Open two tap devices and a receiver thread that reads in a loop for
> each device. Start one sender thread and pin all threads to different
> CPUs. Send 1M minimum UDP packets to each device and measure sending
> time for each of the sending methods:
>         napi_gro_receive():     4.90s
>         netif_rx_ni():          4.90s
>         netif_receive_skb():    7.20s
>
> Signed-off-by: Petar Penkov <peterpenkov96@gmail.com>

Thank you Petar.

Acked-by: Mahesh Bandewar <maheshb@google.com>

> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Mahesh Bandewar <maheshb@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: davem@davemloft.net
> Cc: ppenkov@stanford.edu
> ---
>  drivers/net/tun.c           | 133 +++++++++++++++++++++++++++++++++++++++-----
>  include/uapi/linux/if_tun.h |   1 +
>  2 files changed, 119 insertions(+), 15 deletions(-)
>

^ permalink raw reply

* Re: [PATCH,v3,net-next 2/2] tun: enable napi_gro_frags() for TUN/TAP driver
From: Mahesh Bandewar (महेश बंडेवार) @ 2017-09-25 16:27 UTC (permalink / raw)
  To: Petar Penkov
  Cc: linux-netdev, Eric Dumazet, Willem Bruijn, David Miller,
	Petar Bozhidarov Penkov
In-Reply-To: <20170922204915.7889-3-peterpenkov96@gmail.com>

On Fri, Sep 22, 2017 at 1:49 PM, Petar Penkov <peterpenkov96@gmail.com> wrote:
> Add a TUN/TAP receive mode that exercises the napi_gro_frags()
> interface. This mode is available only in TAP mode, as the interface
> expects packets with Ethernet headers.
>
> Furthermore, packets follow the layout of the iovec_iter that was
> received. The first iovec is the linear data, and every one after the
> first is a fragment. If there are more fragments than the max number,
> drop the packet. Additionally, invoke eth_get_headlen() to exercise flow
> dissector code and to verify that the header resides in the linear data.
>
> The napi_gro_frags() mode requires setting the IFF_NAPI_FRAGS option.
> This is imposed because this mode is intended for testing via tools like
> syzkaller and packetdrill, and the increased flexibility it provides can
> introduce security vulnerabilities. This flag is accepted only if the
> device is in TAP mode and has the IFF_NAPI flag set as well. This is
> done because both of these are explicit requirements for correct
> operation in this mode.
>
> Signed-off-by: Petar Penkov <peterpenkov96@gmail.com>

Thank you Petar.

Acked-by: Mahesh Bandewar <maheshb@google,com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Mahesh Bandewar <maheshb@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: davem@davemloft.net
> Cc: ppenkov@stanford.edu
> ---
>  drivers/net/tun.c           | 134 ++++++++++++++++++++++++++++++++++++++++++--
>  include/uapi/linux/if_tun.h |   1 +
>  2 files changed, 129 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index f16407242b18..9880b3bc8fa5 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -75,6 +75,7 @@
>  #include <linux/skb_array.h>
>  #include <linux/bpf.h>
>  #include <linux/bpf_trace.h>
> +#include <linux/mutex.h>
>
>  #include <linux/uaccess.h>
>
> @@ -121,7 +122,8 @@ do {                                                                \
>  #define TUN_VNET_BE     0x40000000
>
>  #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
> -                     IFF_MULTI_QUEUE | IFF_NAPI)
> +                     IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS)
> +
>  #define GOODCOPY_LEN 128
>
>  #define FLT_EXACT_COUNT 8
> @@ -173,6 +175,7 @@ struct tun_file {
>                 unsigned int ifindex;
>         };
>         struct napi_struct napi;
> +       struct mutex napi_mutex;        /* Protects access to the above napi */
>         struct list_head next;
>         struct tun_struct *detached;
>         struct skb_array tx_array;
> @@ -277,6 +280,7 @@ static void tun_napi_init(struct tun_struct *tun, struct tun_file *tfile,
>                 netif_napi_add(tun->dev, &tfile->napi, tun_napi_poll,
>                                NAPI_POLL_WEIGHT);
>                 napi_enable(&tfile->napi);
> +               mutex_init(&tfile->napi_mutex);
>         }
>  }
>
> @@ -292,6 +296,11 @@ static void tun_napi_del(struct tun_struct *tun, struct tun_file *tfile)
>                 netif_napi_del(&tfile->napi);
>  }
>
> +static bool tun_napi_frags_enabled(const struct tun_struct *tun)
> +{
> +       return READ_ONCE(tun->flags) & IFF_NAPI_FRAGS;
> +}
> +
>  #ifdef CONFIG_TUN_VNET_CROSS_LE
>  static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
>  {
> @@ -1036,7 +1045,8 @@ static void tun_poll_controller(struct net_device *dev)
>          * supports polling, which enables bridge devices in virt setups to
>          * still use netconsole
>          * If NAPI is enabled, however, we need to schedule polling for all
> -        * queues.
> +        * queues unless we are using napi_gro_frags(), which we call in
> +        * process context and not in NAPI context.
>          */
>         struct tun_struct *tun = netdev_priv(dev);
>
> @@ -1044,6 +1054,9 @@ static void tun_poll_controller(struct net_device *dev)
>                 struct tun_file *tfile;
>                 int i;
>
> +               if (tun_napi_frags_enabled(tun))
> +                       return;
> +
>                 rcu_read_lock();
>                 for (i = 0; i < tun->numqueues; i++) {
>                         tfile = rcu_dereference(tun->tfiles[i]);
> @@ -1266,6 +1279,64 @@ static unsigned int tun_chr_poll(struct file *file, poll_table *wait)
>         return mask;
>  }
>
> +static struct sk_buff *tun_napi_alloc_frags(struct tun_file *tfile,
> +                                           size_t len,
> +                                           const struct iov_iter *it)
> +{
> +       struct sk_buff *skb;
> +       size_t linear;
> +       int err;
> +       int i;
> +
> +       if (it->nr_segs > MAX_SKB_FRAGS + 1)
> +               return ERR_PTR(-ENOMEM);
> +
> +       local_bh_disable();
> +       skb = napi_get_frags(&tfile->napi);
> +       local_bh_enable();
> +       if (!skb)
> +               return ERR_PTR(-ENOMEM);
> +
> +       linear = iov_iter_single_seg_count(it);
> +       err = __skb_grow(skb, linear);
> +       if (err)
> +               goto free;
> +
> +       skb->len = len;
> +       skb->data_len = len - linear;
> +       skb->truesize += skb->data_len;
> +
> +       for (i = 1; i < it->nr_segs; i++) {
> +               size_t fragsz = it->iov[i].iov_len;
> +               unsigned long offset;
> +               struct page *page;
> +               void *data;
> +
> +               if (fragsz == 0 || fragsz > PAGE_SIZE) {
> +                       err = -EINVAL;
> +                       goto free;
> +               }
> +
> +               local_bh_disable();
> +               data = napi_alloc_frag(fragsz);
> +               local_bh_enable();
> +               if (!data) {
> +                       err = -ENOMEM;
> +                       goto free;
> +               }
> +
> +               page = virt_to_head_page(data);
> +               offset = data - page_address(page);
> +               skb_fill_page_desc(skb, i - 1, page, offset, fragsz);
> +       }
> +
> +       return skb;
> +free:
> +       /* frees skb and all frags allocated with napi_alloc_frag() */
> +       napi_free_frags(&tfile->napi);
> +       return ERR_PTR(err);
> +}
> +
>  /* prepad is the amount to reserve at front.  len is length after that.
>   * linear is a hint as to how much to copy (usually headers). */
>  static struct sk_buff *tun_alloc_skb(struct tun_file *tfile,
> @@ -1478,6 +1549,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>         int err;
>         u32 rxhash;
>         int skb_xdp = 1;
> +       bool frags = tun_napi_frags_enabled(tun);
>
>         if (!(tun->dev->flags & IFF_UP))
>                 return -EIO;
> @@ -1535,7 +1607,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>                         zerocopy = true;
>         }
>
> -       if (tun_can_build_skb(tun, tfile, len, noblock, zerocopy)) {
> +       if (!frags && tun_can_build_skb(tun, tfile, len, noblock, zerocopy)) {
>                 /* For the packet that is not easy to be processed
>                  * (e.g gso or jumbo packet), we will do it at after
>                  * skb was created with generic XDP routine.
> @@ -1556,10 +1628,24 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>                                 linear = tun16_to_cpu(tun, gso.hdr_len);
>                 }
>
> -               skb = tun_alloc_skb(tfile, align, copylen, linear, noblock);
> +               if (frags) {
> +                       mutex_lock(&tfile->napi_mutex);
> +                       skb = tun_napi_alloc_frags(tfile, copylen, from);
> +                       /* tun_napi_alloc_frags() enforces a layout for the skb.
> +                        * If zerocopy is enabled, then this layout will be
> +                        * overwritten by zerocopy_sg_from_iter().
> +                        */
> +                       zerocopy = false;
> +               } else {
> +                       skb = tun_alloc_skb(tfile, align, copylen, linear,
> +                                           noblock);
> +               }
> +
>                 if (IS_ERR(skb)) {
>                         if (PTR_ERR(skb) != -EAGAIN)
>                                 this_cpu_inc(tun->pcpu_stats->rx_dropped);
> +                       if (frags)
> +                               mutex_unlock(&tfile->napi_mutex);
>                         return PTR_ERR(skb);
>                 }
>
> @@ -1571,6 +1657,11 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>                 if (err) {
>                         this_cpu_inc(tun->pcpu_stats->rx_dropped);
>                         kfree_skb(skb);
> +                       if (frags) {
> +                               tfile->napi.skb = NULL;
> +                               mutex_unlock(&tfile->napi_mutex);
> +                       }
> +
>                         return -EFAULT;
>                 }
>         }
> @@ -1578,6 +1669,11 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>         if (virtio_net_hdr_to_skb(skb, &gso, tun_is_little_endian(tun))) {
>                 this_cpu_inc(tun->pcpu_stats->rx_frame_errors);
>                 kfree_skb(skb);
> +               if (frags) {
> +                       tfile->napi.skb = NULL;
> +                       mutex_unlock(&tfile->napi_mutex);
> +               }
> +
>                 return -EINVAL;
>         }
>
> @@ -1603,7 +1699,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>                 skb->dev = tun->dev;
>                 break;
>         case IFF_TAP:
> -               skb->protocol = eth_type_trans(skb, tun->dev);
> +               if (!frags)
> +                       skb->protocol = eth_type_trans(skb, tun->dev);
>                 break;
>         }
>
> @@ -1638,7 +1735,23 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>
>         rxhash = __skb_get_hash_symmetric(skb);
>
> -       if (tun->flags & IFF_NAPI) {
> +       if (frags) {
> +               /* Exercise flow dissector code path. */
> +               u32 headlen = eth_get_headlen(skb->data, skb_headlen(skb));
> +
> +               if (headlen > skb_headlen(skb) || headlen < ETH_HLEN) {
> +                       this_cpu_inc(tun->pcpu_stats->rx_dropped);
> +                       napi_free_frags(&tfile->napi);
> +                       mutex_unlock(&tfile->napi_mutex);
> +                       WARN_ON(1);
> +                       return -ENOMEM;
> +               }
> +
> +               local_bh_disable();
> +               napi_gro_frags(&tfile->napi);
> +               local_bh_enable();
> +               mutex_unlock(&tfile->napi_mutex);
> +       } else if (tun->flags & IFF_NAPI) {
>                 struct sk_buff_head *queue = &tfile->sk.sk_write_queue;
>                 int queue_len;
>
> @@ -2061,6 +2174,15 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>         if (tfile->detached)
>                 return -EINVAL;
>
> +       if ((ifr->ifr_flags & IFF_NAPI_FRAGS)) {
> +               if (!capable(CAP_NET_ADMIN))
> +                       return -EPERM;
> +
> +               if (!(ifr->ifr_flags & IFF_NAPI) ||
> +                   (ifr->ifr_flags & TUN_TYPE_MASK) != IFF_TAP)
> +                       return -EINVAL;
> +       }
> +
>         dev = __dev_get_by_name(net, ifr->ifr_name);
>         if (dev) {
>                 if (ifr->ifr_flags & IFF_TUN_EXCL)
> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
> index 30b6184884eb..365ade5685c9 100644
> --- a/include/uapi/linux/if_tun.h
> +++ b/include/uapi/linux/if_tun.h
> @@ -61,6 +61,7 @@
>  #define IFF_TUN                0x0001
>  #define IFF_TAP                0x0002
>  #define IFF_NAPI       0x0010
> +#define IFF_NAPI_FRAGS 0x0020
>  #define IFF_NO_PI      0x1000
>  /* This flag has no real effect */
>  #define IFF_ONE_QUEUE  0x2000
> --
> 2.11.0
>

^ permalink raw reply

* Re: [Patch net-next] net_sched: use idr to allocate bpf filter handles
From: Cong Wang @ 2017-09-25 16:46 UTC (permalink / raw)
  To: David Miller
  Cc: Linux Kernel Network Developers, Daniel Borkmann, Chris Mi,
	Jamal Hadi Salim
In-Reply-To: <20170923.171534.1037317649165173409.davem@davemloft.net>

On Sat, Sep 23, 2017 at 5:15 PM, David Miller <davem@davemloft.net> wrote:
> If I read the code properly, the existing code allows IDs from '1' to
> and including '0x7ffffffe', whereas the new code allows up to and
> including '0x7ffffffff'.
>
> Whether intentional or not this is a change in behavior.  If it's OK,
> it should be at least documented either in a comment or in the
> commit log itself.
>
> Similarly for your other scheduler IDR changes.

Good catch! Obviously off-by-one. I agree we should keep
the old behavior no matter the reason.

I will update the patches.

Thanks.

^ permalink raw reply

* Re: [Patch net-next] net_sched: use idr to allocate bpf filter handles
From: Cong Wang @ 2017-09-25 16:49 UTC (permalink / raw)
  To: David Miller
  Cc: Linux Kernel Network Developers, Daniel Borkmann, Chris Mi,
	Jamal Hadi Salim
In-Reply-To: <CAM_iQpVOnzqAs6sT8=n+n2zFUNd_K4evph3=68NL1CtBN=AAmg@mail.gmail.com>

On Mon, Sep 25, 2017 at 9:46 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Sat, Sep 23, 2017 at 5:15 PM, David Miller <davem@davemloft.net> wrote:
>> If I read the code properly, the existing code allows IDs from '1' to
>> and including '0x7ffffffe', whereas the new code allows up to and
>> including '0x7ffffffff'.
>>
>> Whether intentional or not this is a change in behavior.  If it's OK,
>> it should be at least documented either in a comment or in the
>> commit log itself.
>>
>> Similarly for your other scheduler IDR changes.
>
> Good catch! Obviously off-by-one. I agree we should keep
> the old behavior no matter the reason.
>
> I will update the patches.
>

By the way, same problem for commit fe2502e49b58606580c77.
:)

^ permalink raw reply

* Re: [PATCH net-next] sch_netem: faster rb tree removal
From: Eric Dumazet @ 2017-09-25 16:52 UTC (permalink / raw)
  To: David Ahern; +Cc: David Miller, netdev, Stephen Hemminger
In-Reply-To: <ce328dfd-0b54-883a-36a8-91ec344f86ad@gmail.com>

On Mon, 2017-09-25 at 10:14 -0600, David Ahern wrote:

> Thanks for the test.
> 
> I made a simple program this morning and ran it under perf. With the
> above suggestion the rb_erase has a high cost because it always deletes
> the root node. Your method 1 has a high cost on rb_first which is
> expected given its definition and it is run on each removal. Both
> options increase in time with the number of entries in the tree.
> 
> Your method 2 is fairly constant from 10,000 entries to 10M entries
> which makes sense: a one time cost at finding rb_first and then always
> removing a bottom node so rb_erase is light.
> 
> As for the change:
> 
> Acked-by: David Ahern <dsahern@gmail.com>

Thanks a lot for double checking !

^ permalink raw reply

* Re: [PATCH net-next 0/7] nfp: flower vxlan tunnel offload
From: Simon Horman @ 2017-09-25 17:04 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David Miller, Jakub Kicinski, Linux Netdev List, oss-drivers,
	John Hurley
In-Reply-To: <CAJ3xEMgfRuD2Y-P2eTKmXpciiopA8js4zbO_nNOA6J62sn--mw@mail.gmail.com>

On Mon, Sep 25, 2017 at 06:25:03PM +0300, Or Gerlitz wrote:
> On Mon, Sep 25, 2017 at 1:23 PM, Simon Horman
> <simon.horman@netronome.com> wrote:
> > From: Simon Horman <simon.horman@netronome.com>
> >
> > John says:
> >
> > This patch set allows offloading of TC flower match and set tunnel fields
> > to the NFP. The initial focus is on VXLAN traffic. Due to the current
> > state of the NFP firmware, only VXLAN traffic on well known port 4789 is
> > handled. The match and action fields must explicity set this value to be
> > supported. Tunnel end point information is also offloaded to the NFP for
> > both encapsulation and decapsulation. The NFP expects 3 separate data sets
> > to be supplied.
> 
> > For decapsulation, 2 separate lists exist; a list of MAC addresses
> > referenced by an index comprised of the port number, and a list of IP
> > addresses. These IP addresses are not connected to a MAC or port.
> 
> Do these IP addresses exist on the host kernel SW stack? can the same
> set of TC rules be fully functional and generate the same traffic
> pattern when set to run in SW (skip_hw)?

Hi Or,

I asked John (now CCed) about this and his response was:

The MAC addresses are extracted from the netdevs already loaded in the
kernel and are monitored for any changes. The IP addresses are slightly
different in that they are extracted from the rules themselves. We make the
assumption that, if a packet is decapsulated at the end point and a match
is attempted on the IP address, that this IP address should be recognised
in the kernel. That being the case, the same traffic pattern should be
witnessed if the skip_hw flag is applied.

^ permalink raw reply

* [Patch net-next v2] net_sched: use idr to allocate bpf filter handles
From: Cong Wang @ 2017-09-25 17:13 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Daniel Borkmann, Chris Mi, Jamal Hadi Salim

Instead of calling cls_bpf_get() in a loop to find
a unused handle, just switch to idr API to allocate
new handles.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Chris Mi <chrism@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_bpf.c | 57 ++++++++++++++++++++++++++---------------------------
 1 file changed, 28 insertions(+), 29 deletions(-)

diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 520c5027646a..e19c3c3c27cc 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -17,6 +17,7 @@
 #include <linux/skbuff.h>
 #include <linux/filter.h>
 #include <linux/bpf.h>
+#include <linux/idr.h>
 
 #include <net/rtnetlink.h>
 #include <net/pkt_cls.h>
@@ -32,7 +33,7 @@ MODULE_DESCRIPTION("TC BPF based classifier");
 
 struct cls_bpf_head {
 	struct list_head plist;
-	u32 hgen;
+	struct idr handle_idr;
 	struct rcu_head rcu;
 };
 
@@ -238,6 +239,7 @@ static int cls_bpf_init(struct tcf_proto *tp)
 		return -ENOBUFS;
 
 	INIT_LIST_HEAD_RCU(&head->plist);
+	idr_init(&head->handle_idr);
 	rcu_assign_pointer(tp->root, head);
 
 	return 0;
@@ -264,6 +266,9 @@ static void cls_bpf_delete_prog_rcu(struct rcu_head *rcu)
 
 static void __cls_bpf_delete(struct tcf_proto *tp, struct cls_bpf_prog *prog)
 {
+	struct cls_bpf_head *head = rtnl_dereference(tp->root);
+
+	idr_remove_ext(&head->handle_idr, prog->handle);
 	cls_bpf_stop_offload(tp, prog);
 	list_del_rcu(&prog->link);
 	tcf_unbind_filter(tp, &prog->res);
@@ -287,6 +292,7 @@ static void cls_bpf_destroy(struct tcf_proto *tp)
 	list_for_each_entry_safe(prog, tmp, &head->plist, link)
 		__cls_bpf_delete(tp, prog);
 
+	idr_destroy(&head->handle_idr);
 	kfree_rcu(head, rcu);
 }
 
@@ -421,27 +427,6 @@ static int cls_bpf_set_parms(struct net *net, struct tcf_proto *tp,
 	return 0;
 }
 
-static u32 cls_bpf_grab_new_handle(struct tcf_proto *tp,
-				   struct cls_bpf_head *head)
-{
-	unsigned int i = 0x80000000;
-	u32 handle;
-
-	do {
-		if (++head->hgen == 0x7FFFFFFF)
-			head->hgen = 1;
-	} while (--i > 0 && cls_bpf_get(tp, head->hgen));
-
-	if (unlikely(i == 0)) {
-		pr_err("Insufficient number of handles\n");
-		handle = 0;
-	} else {
-		handle = head->hgen;
-	}
-
-	return handle;
-}
-
 static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 			  struct tcf_proto *tp, unsigned long base,
 			  u32 handle, struct nlattr **tca,
@@ -451,6 +436,7 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 	struct cls_bpf_prog *oldprog = *arg;
 	struct nlattr *tb[TCA_BPF_MAX + 1];
 	struct cls_bpf_prog *prog;
+	unsigned long idr_index;
 	int ret;
 
 	if (tca[TCA_OPTIONS] == NULL)
@@ -476,21 +462,30 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 		}
 	}
 
-	if (handle == 0)
-		prog->handle = cls_bpf_grab_new_handle(tp, head);
-	else
+	if (handle == 0) {
+		ret = idr_alloc_ext(&head->handle_idr, prog, &idr_index,
+				    1, 0x7FFFFFFF, GFP_KERNEL);
+		if (ret)
+			goto errout;
+		prog->handle = idr_index;
+	} else {
+		if (!oldprog) {
+			ret = idr_alloc_ext(&head->handle_idr, prog, &idr_index,
+					    handle, handle + 1, GFP_KERNEL);
+			if (ret)
+				goto errout;
+		}
 		prog->handle = handle;
-	if (prog->handle == 0) {
-		ret = -EINVAL;
-		goto errout;
 	}
 
 	ret = cls_bpf_set_parms(net, tp, prog, base, tb, tca[TCA_RATE], ovr);
 	if (ret < 0)
-		goto errout;
+		goto errout_idr;
 
 	ret = cls_bpf_offload(tp, prog, oldprog);
 	if (ret) {
+		if (!oldprog)
+			idr_remove_ext(&head->handle_idr, prog->handle);
 		__cls_bpf_delete_prog(prog);
 		return ret;
 	}
@@ -499,6 +494,7 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 		prog->gen_flags |= TCA_CLS_FLAGS_NOT_IN_HW;
 
 	if (oldprog) {
+		idr_replace_ext(&head->handle_idr, prog, handle);
 		list_replace_rcu(&oldprog->link, &prog->link);
 		tcf_unbind_filter(tp, &oldprog->res);
 		call_rcu(&oldprog->rcu, cls_bpf_delete_prog_rcu);
@@ -509,6 +505,9 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 	*arg = prog;
 	return 0;
 
+errout_idr:
+	if (!oldprog)
+		idr_remove_ext(&head->handle_idr, prog->handle);
 errout:
 	tcf_exts_destroy(&prog->exts);
 	kfree(prog);
-- 
2.13.0

^ permalink raw reply related

* [Patch net-next v2] net_sched: use idr to allocate basic filter handles
From: Cong Wang @ 2017-09-25 17:13 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Chris Mi, Jamal Hadi Salim

Instead of calling basic_get() in a loop to find
a unused handle, just switch to idr API to allocate
new handles.

Cc: Chris Mi <chrism@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_basic.c | 37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index d89ebafd2239..cfeb6f158566 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -17,13 +17,14 @@
 #include <linux/errno.h>
 #include <linux/rtnetlink.h>
 #include <linux/skbuff.h>
+#include <linux/idr.h>
 #include <net/netlink.h>
 #include <net/act_api.h>
 #include <net/pkt_cls.h>
 
 struct basic_head {
-	u32			hgenerator;
 	struct list_head	flist;
+	struct idr		handle_idr;
 	struct rcu_head		rcu;
 };
 
@@ -78,6 +79,7 @@ static int basic_init(struct tcf_proto *tp)
 	if (head == NULL)
 		return -ENOBUFS;
 	INIT_LIST_HEAD(&head->flist);
+	idr_init(&head->handle_idr);
 	rcu_assign_pointer(tp->root, head);
 	return 0;
 }
@@ -99,8 +101,10 @@ static void basic_destroy(struct tcf_proto *tp)
 	list_for_each_entry_safe(f, n, &head->flist, link) {
 		list_del_rcu(&f->link);
 		tcf_unbind_filter(tp, &f->res);
+		idr_remove_ext(&head->handle_idr, f->handle);
 		call_rcu(&f->rcu, basic_delete_filter);
 	}
+	idr_destroy(&head->handle_idr);
 	kfree_rcu(head, rcu);
 }
 
@@ -111,6 +115,7 @@ static int basic_delete(struct tcf_proto *tp, void *arg, bool *last)
 
 	list_del_rcu(&f->link);
 	tcf_unbind_filter(tp, &f->res);
+	idr_remove_ext(&head->handle_idr, f->handle);
 	call_rcu(&f->rcu, basic_delete_filter);
 	*last = list_empty(&head->flist);
 	return 0;
@@ -154,6 +159,7 @@ static int basic_change(struct net *net, struct sk_buff *in_skb,
 	struct nlattr *tb[TCA_BASIC_MAX + 1];
 	struct basic_filter *fold = (struct basic_filter *) *arg;
 	struct basic_filter *fnew;
+	unsigned long idr_index;
 
 	if (tca[TCA_OPTIONS] == NULL)
 		return -EINVAL;
@@ -179,30 +185,31 @@ static int basic_change(struct net *net, struct sk_buff *in_skb,
 	err = -EINVAL;
 	if (handle) {
 		fnew->handle = handle;
-	} else if (fold) {
-		fnew->handle = fold->handle;
+		if (!fold) {
+			err = idr_alloc_ext(&head->handle_idr, fnew, &idr_index,
+					    handle, handle + 1, GFP_KERNEL);
+			if (err)
+				goto errout;
+		}
 	} else {
-		unsigned int i = 0x80000000;
-		do {
-			if (++head->hgenerator == 0x7FFFFFFF)
-				head->hgenerator = 1;
-		} while (--i > 0 && basic_get(tp, head->hgenerator));
-
-		if (i <= 0) {
-			pr_err("Insufficient number of handles\n");
+		err = idr_alloc_ext(&head->handle_idr, fnew, &idr_index,
+				    1, 0x7FFFFFFF, GFP_KERNEL);
+		if (err)
 			goto errout;
-		}
-
-		fnew->handle = head->hgenerator;
+		fnew->handle = idr_index;
 	}
 
 	err = basic_set_parms(net, tp, fnew, base, tb, tca[TCA_RATE], ovr);
-	if (err < 0)
+	if (err < 0) {
+		if (!fold)
+			idr_remove_ext(&head->handle_idr, fnew->handle);
 		goto errout;
+	}
 
 	*arg = fnew;
 
 	if (fold) {
+		idr_replace_ext(&head->handle_idr, fnew, fnew->handle);
 		list_replace_rcu(&fold->link, &fnew->link);
 		tcf_unbind_filter(tp, &fold->res);
 		call_rcu(&fold->rcu, basic_delete_filter);
-- 
2.13.0

^ permalink raw reply related

* [Patch net-next v2] net_sched: use idr to allocate u32 filter handles
From: Cong Wang @ 2017-09-25 17:13 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Chris Mi, Jamal Hadi Salim

Instead of calling u32_lookup_ht() in a loop to find
a unused handle, just switch to idr API to allocate
new handles. u32 filters are special as the handle
could contain a hash table id and a key id, so we
need two IDR to allocate each of them.

Cc: Chris Mi <chrism@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_u32.c | 108 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 67 insertions(+), 41 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 10b8d851fc6b..094d224411a9 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -46,6 +46,7 @@
 #include <net/act_api.h>
 #include <net/pkt_cls.h>
 #include <linux/netdevice.h>
+#include <linux/idr.h>
 
 struct tc_u_knode {
 	struct tc_u_knode __rcu	*next;
@@ -82,6 +83,7 @@ struct tc_u_hnode {
 	struct tc_u_common	*tp_c;
 	int			refcnt;
 	unsigned int		divisor;
+	struct idr		handle_idr;
 	struct rcu_head		rcu;
 	/* The 'ht' field MUST be the last field in structure to allow for
 	 * more entries allocated at end of structure.
@@ -93,7 +95,7 @@ struct tc_u_common {
 	struct tc_u_hnode __rcu	*hlist;
 	struct Qdisc		*q;
 	int			refcnt;
-	u32			hgenerator;
+	struct idr		handle_idr;
 	struct hlist_node	hnode;
 	struct rcu_head		rcu;
 };
@@ -311,19 +313,19 @@ static void *u32_get(struct tcf_proto *tp, u32 handle)
 	return u32_lookup_key(ht, handle);
 }
 
-static u32 gen_new_htid(struct tc_u_common *tp_c)
+static u32 gen_new_htid(struct tc_u_common *tp_c, struct tc_u_hnode *ptr)
 {
-	int i = 0x800;
+	unsigned long idr_index;
+	int err;
 
-	/* hgenerator only used inside rtnl lock it is safe to increment
+	/* This is only used inside rtnl lock it is safe to increment
 	 * without read _copy_ update semantics
 	 */
-	do {
-		if (++tp_c->hgenerator == 0x7FF)
-			tp_c->hgenerator = 1;
-	} while (--i > 0 && u32_lookup_ht(tp_c, (tp_c->hgenerator|0x800)<<20));
-
-	return i > 0 ? (tp_c->hgenerator|0x800)<<20 : 0;
+	err = idr_alloc_ext(&tp_c->handle_idr, ptr, &idr_index,
+			    1, 0x7FF, GFP_KERNEL);
+	if (err)
+		return 0;
+	return (u32)(idr_index | 0x800) << 20;
 }
 
 static struct hlist_head *tc_u_common_hash;
@@ -366,8 +368,9 @@ static int u32_init(struct tcf_proto *tp)
 		return -ENOBUFS;
 
 	root_ht->refcnt++;
-	root_ht->handle = tp_c ? gen_new_htid(tp_c) : 0x80000000;
+	root_ht->handle = tp_c ? gen_new_htid(tp_c, root_ht) : 0x80000000;
 	root_ht->prio = tp->prio;
+	idr_init(&root_ht->handle_idr);
 
 	if (tp_c == NULL) {
 		tp_c = kzalloc(sizeof(*tp_c), GFP_KERNEL);
@@ -377,6 +380,7 @@ static int u32_init(struct tcf_proto *tp)
 		}
 		tp_c->q = tp->q;
 		INIT_HLIST_NODE(&tp_c->hnode);
+		idr_init(&tp_c->handle_idr);
 
 		h = tc_u_hash(tp);
 		hlist_add_head(&tp_c->hnode, &tc_u_common_hash[h]);
@@ -565,6 +569,7 @@ static void u32_clear_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht)
 					 rtnl_dereference(n->next));
 			tcf_unbind_filter(tp, &n->res);
 			u32_remove_hw_knode(tp, n->handle);
+			idr_remove_ext(&ht->handle_idr, n->handle);
 			call_rcu(&n->rcu, u32_delete_key_freepf_rcu);
 		}
 	}
@@ -586,6 +591,8 @@ static int u32_destroy_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht)
 	     hn = &phn->next, phn = rtnl_dereference(*hn)) {
 		if (phn == ht) {
 			u32_clear_hw_hnode(tp, ht);
+			idr_destroy(&ht->handle_idr);
+			idr_remove_ext(&tp_c->handle_idr, ht->handle);
 			RCU_INIT_POINTER(*hn, ht->next);
 			kfree_rcu(ht, rcu);
 			return 0;
@@ -633,6 +640,7 @@ static void u32_destroy(struct tcf_proto *tp)
 			kfree_rcu(ht, rcu);
 		}
 
+		idr_destroy(&tp_c->handle_idr);
 		kfree(tp_c);
 	}
 
@@ -701,27 +709,21 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool *last)
 	return ret;
 }
 
-#define NR_U32_NODE (1<<12)
-static u32 gen_new_kid(struct tc_u_hnode *ht, u32 handle)
+static u32 gen_new_kid(struct tc_u_hnode *ht, u32 htid)
 {
-	struct tc_u_knode *n;
-	unsigned long i;
-	unsigned long *bitmap = kzalloc(BITS_TO_LONGS(NR_U32_NODE) * sizeof(unsigned long),
-					GFP_KERNEL);
-	if (!bitmap)
-		return handle | 0xFFF;
-
-	for (n = rtnl_dereference(ht->ht[TC_U32_HASH(handle)]);
-	     n;
-	     n = rtnl_dereference(n->next))
-		set_bit(TC_U32_NODE(n->handle), bitmap);
-
-	i = find_next_zero_bit(bitmap, NR_U32_NODE, 0x800);
-	if (i >= NR_U32_NODE)
-		i = find_next_zero_bit(bitmap, NR_U32_NODE, 1);
+	unsigned long idr_index;
+	u32 start = htid | 0x800;
+	u32 max = htid | 0xFFF;
+	u32 min = htid;
+
+	if (idr_alloc_ext(&ht->handle_idr, NULL, &idr_index,
+			  start, max + 1, GFP_KERNEL)) {
+		if (idr_alloc_ext(&ht->handle_idr, NULL, &idr_index,
+				  min + 1, max + 1, GFP_KERNEL))
+			return max;
+	}
 
-	kfree(bitmap);
-	return handle | (i >= NR_U32_NODE ? 0xFFF : i);
+	return (u32)idr_index;
 }
 
 static const struct nla_policy u32_policy[TCA_U32_MAX + 1] = {
@@ -806,6 +808,7 @@ static void u32_replace_knode(struct tcf_proto *tp, struct tc_u_common *tp_c,
 		if (pins->handle == n->handle)
 			break;
 
+	idr_replace_ext(&ht->handle_idr, n, n->handle);
 	RCU_INIT_POINTER(n->next, pins->next);
 	rcu_assign_pointer(*ins, n);
 }
@@ -937,22 +940,33 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
 			return -EINVAL;
 		if (TC_U32_KEY(handle))
 			return -EINVAL;
-		if (handle == 0) {
-			handle = gen_new_htid(tp->data);
-			if (handle == 0)
-				return -ENOMEM;
-		}
 		ht = kzalloc(sizeof(*ht) + divisor*sizeof(void *), GFP_KERNEL);
 		if (ht == NULL)
 			return -ENOBUFS;
+		if (handle == 0) {
+			handle = gen_new_htid(tp->data, ht);
+			if (handle == 0) {
+				kfree(ht);
+				return -ENOMEM;
+			}
+		} else {
+			err = idr_alloc_ext(&tp_c->handle_idr, ht, NULL,
+					    handle, handle + 1, GFP_KERNEL);
+			if (err) {
+				kfree(ht);
+				return err;
+			}
+		}
 		ht->tp_c = tp_c;
 		ht->refcnt = 1;
 		ht->divisor = divisor;
 		ht->handle = handle;
 		ht->prio = tp->prio;
+		idr_init(&ht->handle_idr);
 
 		err = u32_replace_hw_hnode(tp, ht, flags);
 		if (err) {
+			idr_remove_ext(&tp_c->handle_idr, handle);
 			kfree(ht);
 			return err;
 		}
@@ -986,24 +1000,33 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
 		if (TC_U32_HTID(handle) && TC_U32_HTID(handle^htid))
 			return -EINVAL;
 		handle = htid | TC_U32_NODE(handle);
+		err = idr_alloc_ext(&ht->handle_idr, NULL, NULL,
+				    handle, handle + 1,
+				    GFP_KERNEL);
+		if (err)
+			return err;
 	} else
 		handle = gen_new_kid(ht, htid);
 
-	if (tb[TCA_U32_SEL] == NULL)
-		return -EINVAL;
+	if (tb[TCA_U32_SEL] == NULL) {
+		err = -EINVAL;
+		goto erridr;
+	}
 
 	s = nla_data(tb[TCA_U32_SEL]);
 
 	n = kzalloc(sizeof(*n) + s->nkeys*sizeof(struct tc_u32_key), GFP_KERNEL);
-	if (n == NULL)
-		return -ENOBUFS;
+	if (n == NULL) {
+		err = -ENOBUFS;
+		goto erridr;
+	}
 
 #ifdef CONFIG_CLS_U32_PERF
 	size = sizeof(struct tc_u32_pcnt) + s->nkeys * sizeof(u64);
 	n->pf = __alloc_percpu(size, __alignof__(struct tc_u32_pcnt));
 	if (!n->pf) {
-		kfree(n);
-		return -ENOBUFS;
+		err = -ENOBUFS;
+		goto errfree;
 	}
 #endif
 
@@ -1066,9 +1089,12 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
 errout:
 	tcf_exts_destroy(&n->exts);
 #ifdef CONFIG_CLS_U32_PERF
+errfree:
 	free_percpu(n->pf);
 #endif
 	kfree(n);
+erridr:
+	idr_remove_ext(&ht->handle_idr, handle);
 	return err;
 }
 
-- 
2.13.0

^ permalink raw reply related

* [PATCH net] ipv6: remove incorrect WARN_ON() in fib6_del()
From: Wei Wang @ 2017-09-25 17:35 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: Eric Dumazet, Martin KaFai Lau, Wei Wang

From: Wei Wang <weiwan@google.com>

fib6_del() generates WARN_ON() when rt->dst.obsolete > 0. This does not
make sense because it is possible that the route passed in is already
deleted by some other thread and rt->dst.obsolete is set to
DST_OBSOLETE_DEAD.
So this commit deletes this WARN_ON() and also remove the
"#ifdef RT6_DEBUG >= 2" condition so that if the route is already
obsolete, we return right at the beginning of fib6_del().

Syzkaller hit this WARN_ON() in the following call trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:52
 panic+0x1e4/0x417 kernel/panic.c:180
 __warn+0x1c4/0x1d9 kernel/panic.c:541
 report_bug+0x211/0x2d0 lib/bug.c:183
 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
 do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
 do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
 do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:310
 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
 invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:846
RIP: 0010:fib6_del+0x947/0xca0 net/ipv6/ip6_fib.c:1477
RSP: 0018:ffff8801db2074d8 EFLAGS: 00010206
RAX: ffff8801d1500080 RBX: ffff8801d01638c0 RCX: 0000000000000000
RDX: 0000000000000100 RSI: ffff8801db207650 RDI: ffff8801d0163924
RBP: ffff8801db2075f0 R08: ffffffff86df5f98 R09: 0000000000000002
R10: ffff8801db2074b8 R11: 1ffff1003a2a026b R12: dffffc0000000000
R13: ffff8801db207650 R14: ffff8801a0748180 R15: 1ffff1003b640ea5
 __ip6_del_rt+0xc7/0x120 net/ipv6/route.c:2136
 ip6_del_rt+0x132/0x1a0 net/ipv6/route.c:2149
 ip6_link_failure+0x244/0x380 net/ipv6/route.c:1359
 dst_link_failure include/net/dst.h:454 [inline]
 ndisc_error_report+0xae/0x180 net/ipv6/ndisc.c:682
 neigh_invalidate+0x225/0x530 net/core/neighbour.c:883
 neigh_timer_handler+0x883/0xca0 net/core/neighbour.c:969
 call_timer_fn+0x233/0x830 kernel/time/timer.c:1268
 expire_timers kernel/time/timer.c:1307 [inline]
 __run_timers+0x7fd/0xb90 kernel/time/timer.c:1601
 run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614
 __do_softirq+0x2f5/0xba3 kernel/softirq.c:284
 invoke_softirq kernel/softirq.c:364 [inline]
 irq_exit+0x1cc/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:638 [inline]
 smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:1044
 apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:702
RIP: 0010:arch_local_irq_enable arch/x86/include/asm/paravirt.h:824 [inline]
RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:168 [inline]
RIP: 0010:_raw_spin_unlock_irq+0x56/0x70 kernel/locking/spinlock.c:199
RSP: 0018:ffff8801d0407040 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff10
RAX: dffffc0000000000 RBX: ffff8801db225780 RCX: 0000000000000000
RDX: 1ffffffff0b59433 RSI: 0000000000000001 RDI: ffffffff85aca198
RBP: ffff8801d0407048 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801c6820400
R13: 1ffff1003a080e11 R14: ffff8801d1500080 R15: ffff8801d1500080
 </IRQ>
 finish_lock_switch kernel/sched/sched.h:1334 [inline]
 finish_task_switch+0x1d3/0x740 kernel/sched/core.c:2638
 context_switch kernel/sched/core.c:2774 [inline]
 __schedule+0x8f0/0x2070 kernel/sched/core.c:3332
 schedule+0x108/0x440 kernel/sched/core.c:3391
 schedule_hrtimeout_range_clock+0x23e/0x810 kernel/time/hrtimer.c:1708
 schedule_hrtimeout_range+0x2a/0x40 kernel/time/hrtimer.c:1753
 poll_schedule_timeout+0x10f/0x1f0 fs/select.c:242
 do_select+0x11ea/0x1710 fs/select.c:581
 core_sys_select+0x480/0x960 fs/select.c:655
 do_pselect fs/select.c:732 [inline]
 SYSC_pselect6 fs/select.c:773 [inline]
 SyS_pselect6+0x54a/0x650 fs/select.c:758
 entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x45f181
RSP: 002b:00007f91306e1db0 EFLAGS: 00000246 ORIG_RAX: 000000000000010e
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000045f181
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000086 R08: 00007f91306e1db0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdd9621670
R13: 00007f91306e29c0 R14: 00007f9130eac040 R15: 0000000000000003

Note: there is no Fixes tag because this bug was introduced long ago.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv6/ip6_fib.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index e5308d7cbd75..693bcd7ef6d2 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1592,13 +1592,7 @@ int fib6_del(struct rt6_info *rt, struct nl_info *info)
 	struct net *net = info->nl_net;
 	struct rt6_info **rtp;
 
-#if RT6_DEBUG >= 2
-	if (rt->dst.obsolete > 0) {
-		WARN_ON(fn);
-		return -ENOENT;
-	}
-#endif
-	if (!fn || rt == net->ipv6.ip6_null_entry)
+	if (!fn || rt->dst.obsolete > 0 || rt == net->ipv6.ip6_null_entry)
 		return -ENOENT;
 
 	WARN_ON(!(fn->fn_flags & RTN_RTINFO));
-- 
2.14.1.821.g8fa685d3b7-goog

^ permalink raw reply related

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: Andy Gospodarek @ 2017-09-25 18:10 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: davem, alexei.starovoitov, john.fastabend, peter.waskiewicz.jr,
	jakub.kicinski, netdev, mchan
In-Reply-To: <458f9c13ab58abb1a15627906d03c33c42b02a7c.1506297988.git.daniel@iogearbox.net>

On Mon, Sep 25, 2017 at 02:25:51AM +0200, Daniel Borkmann wrote:
> This work enables generic transfer of metadata from XDP into skb. The
> basic idea is that we can make use of the fact that the resulting skb
> must be linear and already comes with a larger headroom for supporting
> bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
> on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
> for adjusting a new pointer called xdp->data_meta. Thus, the packet has
> a flexible and programmable room for meta data, followed by the actual
> packet data. struct xdp_buff is therefore laid out that we first point
> to data_hard_start, then data_meta directly prepended to data followed
> by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
> account whether we have meta data already prepended and if so, memmove()s
> this along with the given offset provided there's enough room.
> 
> xdp->data_meta is optional and programs are not required to use it. The
> rationale is that when we process the packet in XDP (e.g. as DoS filter),
> we can push further meta data along with it for the XDP_PASS case, and
> give the guarantee that a clsact ingress BPF program on the same device
> can pick this up for further post-processing. Since we work with skb
> there, we can also set skb->mark, skb->priority or other skb meta data
> out of BPF, thus having this scratch space generic and programmable
> allows for more flexibility than defining a direct 1:1 transfer of
> potentially new XDP members into skb (it's also more efficient as we
> don't need to initialize/handle each of such new members).  The facility
> also works together with GRO aggregation. The scratch space at the head
> of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
> yet supporting xdp->data_meta can simply be set up with xdp->data_meta
> as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
> such that the subsequent match against xdp->data for later access is
> guaranteed to fail.
> 
> The verifier treats xdp->data_meta/xdp->data the same way as we treat
> xdp->data/xdp->data_end pointer comparisons. The requirement for doing
> the compare against xdp->data is that it hasn't been modified from it's
> original address we got from ctx access. It may have a range marking
> already from prior successful xdp->data/xdp->data_end pointer comparisons
> though.

First, thanks for this detailed description.  It was helpful to read
along with the patches.

My only concern about this area being generic is that you are now in a
state where any bpf program must know about all the bpf programs in the
receive pipeline before it can properly parse what is stored in the
meta-data and add it to an skb (or perform any other action).
Especially if each program adds it's own meta-data along the way.

Maybe this isn't a big concern based on the number of users of this
today, but it just starts to seem like a concern as there are these
hints being passed between layers that are challenging to track due to a
lack of a standard format for passing data between.

The main reason I bring this up is that Michael and I had discussed and
designed a way for drivers to communicate between each other that rx
resources could be freed after a tx completion on an XDP_REDIRECT
action.  Much like this code, it involved adding an new element to
struct xdp_md that could point to the important information.  Now that
there is a generic way to handle this, it would seem nice to be able to
leverage it, but I'm not sure how reliable this meta-data area would be
without the ability to mark it in some manner.

For additional background, the minimum amount of data needed in the case
Michael and I were discussing was really 2 words.  One to serve as a
pointer to an rx_ring structure and one to have a counter to the rx
producer entry.  This data could be acessed by the driver processing the
tx completions and callback to the driver that received the frame off the wire
to perform any needed processing.  (For those curious this would also require a
new callback/netdev op to act on this data stored in the XDP buffer.)

IIUC, I could use this meta_data area to store this information, but would it
also be useful to create some type field/marker that could also be stored in
the meta_data to indicate what type of information is there?  I hate to propose
such a thing as it may add unneeded complexity, but I just wanted to make sure
to say something before it was too late as there may be more users of this
right away.  :-)

> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c      |   1 +
>  drivers/net/ethernet/cavium/thunder/nicvf_main.c   |   1 +
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c        |   1 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c      |   1 +
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c         |   1 +
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |   1 +
>  .../net/ethernet/netronome/nfp/nfp_net_common.c    |   1 +
>  drivers/net/ethernet/qlogic/qede/qede_fp.c         |   1 +
>  drivers/net/tun.c                                  |   1 +
>  drivers/net/virtio_net.c                           |   2 +
>  include/linux/bpf.h                                |   1 +
>  include/linux/filter.h                             |  21 +++-
>  include/linux/skbuff.h                             |  68 +++++++++++-
>  include/uapi/linux/bpf.h                           |  13 ++-
>  kernel/bpf/verifier.c                              | 114 ++++++++++++++++-----
>  net/bpf/test_run.c                                 |   1 +
>  net/core/dev.c                                     |  31 +++++-
>  net/core/filter.c                                  |  77 +++++++++++++-
>  net/core/skbuff.c                                  |   2 +
>  19 files changed, 297 insertions(+), 42 deletions(-)
[...]

^ permalink raw reply

* [PATCH iproute2] tc: fix ipv6 filter selector attribute for some prefix lengths
From: Yulia Kartseva @ 2017-09-25 18:12 UTC (permalink / raw)
  To: netdev; +Cc: shemminger

Wrong TCA_U32_SEL attribute packing if prefixLen AND 0x1f equals 0x1f.
These are  /31, /63, /95 and /127 prefix lengths.

Example:
# tc filter add dev eth0 protocol ipv6 parent b: prio 2307 u32 match
ip6 dst face:b00f::/31
# tc filter show dev eth0
filter parent b: protocol ipv6 pref 2307 u32
filter parent b: protocol ipv6 pref 2307 u32 fh 800: ht divisor 1
filter parent b: protocol ipv6 pref 2307 u32 fh 800::800 order 2048
key ht 800 bkt 0
  match faceb00f/ffffffff at 24


The correct match would be "faceb00e/fffffffe": don't count the last
bit of the 4th byte as the network prefix. With fix:

# tc filter show dev eth0
filter parent b: protocol ipv6 pref 2307 u32
filter parent b: protocol ipv6 pref 2307 u32 fh 800: ht divisor 1
filter parent b: protocol ipv6 pref 2307 u32 fh 800::800 order 2048
key ht 800 bkt 0
  match faceb00e/fffffffe at 24

 tc/f_u32.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tc/f_u32.c b/tc/f_u32.c
index 5815be9..14b9588 100644
--- a/tc/f_u32.c
+++ b/tc/f_u32.c
@@ -385,8 +385,7 @@ static int parse_ip6_addr(int *argc_p, char ***argv_p,

  plen = addr.bitlen;
  for (i = 0; i < plen; i += 32) {
- /* if (((i + 31) & ~0x1F) <= plen) { */
- if (i + 31 <= plen) {
+ if (i + 31 < plen) {
  res = pack_key(sel, addr.data[i / 32],
        0xFFFFFFFF, off + 4 * (i / 32), offmask);
  if (res < 0)

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox