Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: linux-next: build failure after merge of the net-next tree
From: Florian Fainelli @ 2017-09-29  2:07 UTC (permalink / raw)
  To: Stephen Rothwell, David Miller, Networking
  Cc: Linux-Next Mailing List, Linux Kernel Mailing List,
	Vivien Didelot
In-Reply-To: <20170929113635.3337c026@canb.auug.org.au>

Le 09/28/17 à 18:36, Stephen Rothwell a écrit :
> Hi all,
> 
> After merging the net-next tree, today's linux-next build (arm
> multi_v7_defconfig) failed like this:
> 
> net/dsa/slave.c: In function 'dsa_slave_create':
> net/dsa/slave.c:1191:18: error: 'struct dsa_slave_priv' has no member named 'phy'
>   phy_disconnect(p->phy);
>                   ^
> 
> Caused by commit
> 
>   0115dcd1787d ("net: dsa: use slave device phydev")
> 
> Interacting with commit
> 
>   e804441cfe0b ("net: dsa: Fix network device registration order")
> 
> from the net tree.
> 
> I applied the following merge fix patch (which I am not sure about):

Your resolution looks fine to me, thanks Stephen!

> 
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Fri, 29 Sep 2017 11:28:45 +1000
> Subject: [PATCH] net: dsa: merge fix patch for removal of phy
> 
> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
> ---
>  net/dsa/slave.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> index 8869954485db..9191c929c6c8 100644
> --- a/net/dsa/slave.c
> +++ b/net/dsa/slave.c
> @@ -1188,7 +1188,7 @@ int dsa_slave_create(struct dsa_port *port, const char *name)
>  	return 0;
>  
>  out_phy:
> -	phy_disconnect(p->phy);
> +	phy_disconnect(slave_dev->phydev);
>  	if (of_phy_is_fixed_link(p->dp->dn))
>  		of_phy_deregister_fixed_link(p->dp->dn);
>  out_free:
> 


-- 
Florian

^ permalink raw reply

* Re: [lkp-robot] [mac80211] 31e9170bde: hwsim.sta_dynamic_down_up.fail
From: Xiang Gao @ 2017-09-29  2:21 UTC (permalink / raw)
  To: kernel test robot
  Cc: Herbert Xu, David S. Miller, Johannes Berg, linux-crypto,
	linux-kernel, linux-wireless, netdev, lkp
In-Reply-To: <20170928080614.GZ17200@yexl-desktop>

Thanks, I will look into it.
Xiang Gao


2017-09-28 4:06 GMT-04:00 kernel test robot <xiaolong.ye@intel.com>:
>
> FYI, we noticed the following commit:
>
> commit: 31e9170bdeb6ebe66426337b4e2b9924683a412b ("mac80211: aead api to reduce redundancy")
> url: https://github.com/0day-ci/linux/commits/Xiang-Gao/mac80211-aead-api-to-reduce-redundancy/20170926-053110
> base: https://git.kernel.org/cgit/linux/kernel/git/jberg/mac80211-next.git master
>
> in testcase: hwsim
> with following parameters:
>
>         group: hwsim-10
>
>
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 2G
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
> 2017-09-27 16:04:27     ./run-tests.py sta_dynamic_down_up
> DEV: wlan0: 02:00:00:00:00:00
> DEV: wlan1: 02:00:00:00:01:00
> DEV: wlan2: 02:00:00:00:02:00
> APDEV: wlan3
> APDEV: wlan4
> START sta_dynamic_down_up 1/1
> Test: Dynamically added wpa_supplicant interface down/up
> Starting AP wlan3
> Create a dynamic wpa_supplicant interface and connect
> Connect STA wlan5 to AP
> dev1->dev2 unicast data delivery failed
> Traceback (most recent call last):
>   File "./run-tests.py", line 453, in main
>     t(dev, apdev)
>   File "/lkp/benchmarks/hwsim/tests/hwsim/test_sta_dynamic.py", line 122, in test_sta_dynamic_down_up
>     hwsim_utils.test_connectivity(wpas, hapd)
>   File "/lkp/benchmarks/hwsim/tests/hwsim/hwsim_utils.py", line 165, in test_connectivity
>     raise Exception(last_err)
> Exception: dev1->dev2 unicast data delivery failed
> FAIL sta_dynamic_down_up 5.397413 2017-09-27 16:04:32.540689
> passed 0 test case(s)
> skipped 0 test case(s)
> failed tests: sta_dynamic_down_up
>
>
>
> To reproduce:
>
>         git clone https://github.com/intel/lkp-tests.git
>         cd lkp-tests
>         bin/lkp qemu -k <bzImage> job-script  # job-script is attached in this email
>
>
>
> Thanks,
> Xiaolong

^ permalink raw reply

* (unknown), 
From: Tina Aaron @ 2017-09-29  2:48 UTC (permalink / raw)




Do you need urgent LOAN ? If yes, Contact me now via Email: mondataclassic@gmail.com




CONFIDENTIALITY NOTICE: This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information.  Any unauthorized use, disclosure or distribution is prohibited.  If you are not the intended recipient, please discard the message immediately and inform the sender that the message was sent in error.

^ permalink raw reply

* Re: [net-next PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP
From: Alexei Starovoitov @ 2017-09-29  3:21 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, jakub.kicinski, Michael S. Tsirkin, Jason Wang, mchan,
	John Fastabend, peter.waskiewicz.jr, Daniel Borkmann,
	Andy Gospodarek, hannes
In-Reply-To: <150660342793.2808.10838498581615265043.stgit@firesoul>

On Thu, Sep 28, 2017 at 02:57:08PM +0200, Jesper Dangaard Brouer wrote:
> The 'cpumap' is primary used as a backend map for XDP BPF helper
> call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
> 
> This patch implement the main part of the map.  It is not connected to
> the XDP redirect system yet, and no SKB allocation are done yet.
> 
> The main concern in this patch is to ensure the datapath can run
> without any locking.  This adds complexity to the setup and tear-down
> procedure, which assumptions are extra carefully documented in the
> code comments.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/linux/bpf_types.h      |    1 
>  include/uapi/linux/bpf.h       |    1 
>  kernel/bpf/Makefile            |    1 
>  kernel/bpf/cpumap.c            |  547 ++++++++++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c           |    8 +
>  tools/include/uapi/linux/bpf.h |    1 
>  6 files changed, 558 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/cpumap.c
> 
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 6f1a567667b8..814c1081a4a9 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -41,4 +41,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
>  #ifdef CONFIG_STREAM_PARSER
>  BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
>  #endif
> +BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
>  #endif
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index e43491ac4823..f14e15702533 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -111,6 +111,7 @@ enum bpf_map_type {
>  	BPF_MAP_TYPE_HASH_OF_MAPS,
>  	BPF_MAP_TYPE_DEVMAP,
>  	BPF_MAP_TYPE_SOCKMAP,
> +	BPF_MAP_TYPE_CPUMAP,
>  };
>  
>  enum bpf_prog_type {
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 897daa005b23..dba0bd33a43c 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -4,6 +4,7 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
>  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
>  ifeq ($(CONFIG_NET),y)
>  obj-$(CONFIG_BPF_SYSCALL) += devmap.o
> +obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
>  ifeq ($(CONFIG_STREAM_PARSER),y)
>  obj-$(CONFIG_BPF_SYSCALL) += sockmap.o
>  endif
> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> new file mode 100644
> index 000000000000..f0948af82e65
> --- /dev/null
> +++ b/kernel/bpf/cpumap.c
> @@ -0,0 +1,547 @@
> +/* bpf/cpumap.c
> + *
> + * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
> + * Released under terms in GPL version 2.  See COPYING.
> + */
> +
> +/* The 'cpumap' is primary used as a backend map for XDP BPF helper
> + * call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
> + *
> + * Unlike devmap which redirect XDP frames out another NIC device,
> + * this map type redirect raw XDP frames to another CPU.  The remote
> + * CPU will do SKB-allocation and call the normal network stack.
> + *
> + * This is a scalability and isolation mechanism, that allow
> + * separating the early driver network XDP layer, from the rest of the
> + * netstack, and assigning dedicated CPUs for this stage.  This
> + * basically allows for 10G wirespeed pre-filtering via bpf.
> + */
> +#include <linux/bpf.h>
> +#include <linux/filter.h>
> +#include <linux/ptr_ring.h>
> +
> +#include <linux/sched.h>
> +#include <linux/workqueue.h>
> +#include <linux/kthread.h>
> +
> +/*
> + * General idea: XDP packets getting XDP redirected to another CPU,
> + * will maximum be stored/queued for one driver ->poll() call.  It is
> + * guaranteed that setting flush bit and flush operation happen on
> + * same CPU.  Thus, cpu_map_flush operation can deduct via this_cpu_ptr()
> + * which queue in bpf_cpu_map_entry contains packets.
> + */
> +
> +#define CPU_MAP_BULK_SIZE 8  /* 8 == one cacheline on 64-bit archs */
> +struct xdp_bulk_queue {
> +	void *q[CPU_MAP_BULK_SIZE];
> +	unsigned int count;
> +};
> +
> +/* Struct for every remote "destination" CPU in map */
> +struct bpf_cpu_map_entry {
> +	u32 cpu;    /* kthread CPU and map index */
> +	int map_id; /* Back reference to map */
> +	u32 qsize;  /* Redundant queue size for map lookup */
> +
> +	/* XDP can run multiple RX-ring queues, need __percpu enqueue store */
> +	struct xdp_bulk_queue __percpu *bulkq;
> +
> +	/* Queue with potential multi-producers, and single-consumer kthread */
> +	struct ptr_ring *queue;
> +	struct task_struct *kthread;
> +	struct work_struct kthread_stop_wq;
> +
> +	atomic_t refcnt; /* Control when this struct can be free'ed */
> +	struct rcu_head rcu;
> +};
> +
> +struct bpf_cpu_map {
> +	struct bpf_map map;
> +	/* Below members specific for map type */
> +	struct bpf_cpu_map_entry **cpu_map;
> +	unsigned long __percpu *flush_needed;
> +};
> +
> +static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
> +			     struct xdp_bulk_queue *bq);
> +
> +static u64 cpu_map_bitmap_size(const union bpf_attr *attr)
> +{
> +	return BITS_TO_LONGS(attr->max_entries) * sizeof(unsigned long);
> +}
> +
> +static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
> +{
> +	struct bpf_cpu_map *cmap;
> +	u64 cost;
> +	int err;
> +
> +	/* check sanity of attributes */
> +	if (attr->max_entries == 0 || attr->key_size != 4 ||
> +	    attr->value_size != 4 || attr->map_flags & ~BPF_F_NUMA_NODE)
> +		return ERR_PTR(-EINVAL);
> +
> +	cmap = kzalloc(sizeof(*cmap), GFP_USER);
> +	if (!cmap)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/* mandatory map attributes */
> +	cmap->map.map_type = attr->map_type;
> +	cmap->map.key_size = attr->key_size;
> +	cmap->map.value_size = attr->value_size;
> +	cmap->map.max_entries = attr->max_entries;
> +	cmap->map.map_flags = attr->map_flags;
> +	cmap->map.numa_node = bpf_map_attr_numa_node(attr);
> +
> +	/* make sure page count doesn't overflow */
> +	cost = (u64) cmap->map.max_entries * sizeof(struct bpf_cpu_map_entry *);
> +	cost += cpu_map_bitmap_size(attr) * num_possible_cpus();
> +	if (cost >= U32_MAX - PAGE_SIZE)
> +		goto free_cmap;
> +	cmap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	/* if map size is larger than memlock limit, reject it early */
> +	err = bpf_map_precharge_memlock(cmap->map.pages);
> +	if (err)
> +		goto free_cmap;
> +
> +	/* A per cpu bitfield with a bit per possible CPU in map  */
> +	cmap->flush_needed = __alloc_percpu(cpu_map_bitmap_size(attr),
> +					    __alignof__(unsigned long));
> +	if (!cmap->flush_needed)
> +		goto free_cmap;
> +
> +	/* Alloc array for possible remote "destination" CPUs */
> +	cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries *
> +					   sizeof(struct bpf_cpu_map_entry *),
> +					   cmap->map.numa_node);
> +	if (!cmap->cpu_map)
> +		goto free_cmap;
> +
> +	return &cmap->map;
> +free_cmap:
> +	free_percpu(cmap->flush_needed);
> +	kfree(cmap);
> +	return ERR_PTR(-ENOMEM);
> +}
> +
> +void __cpu_map_queue_destructor(void *ptr)
> +{
> +	/* For now, just catch this as an error */
> +	if (!ptr)
> +		return;
> +	pr_err("ERROR: %s() cpu_map queue was not empty\n", __func__);
> +	page_frag_free(ptr);
> +}
> +
> +static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> +{
> +	if (atomic_dec_and_test(&rcpu->refcnt)) {
> +		/* The queue should be empty at this point */
> +		ptr_ring_cleanup(rcpu->queue, __cpu_map_queue_destructor);
> +		kfree(rcpu->queue);
> +		kfree(rcpu);
> +	}
> +}
> +
> +static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> +{
> +	atomic_inc(&rcpu->refcnt);
> +}
> +
> +/* called from workqueue, to workaround syscall using preempt_disable */
> +static void cpu_map_kthread_stop(struct work_struct *work)
> +{
> +	struct bpf_cpu_map_entry *rcpu;
> +
> +	rcpu = container_of(work, struct bpf_cpu_map_entry, kthread_stop_wq);
> +	synchronize_rcu(); /* wait for flush in __cpu_map_entry_free() */
> +	kthread_stop(rcpu->kthread); /* calls put_cpu_map_entry */
> +}
> +
> +static int cpu_map_kthread_run(void *data)
> +{
> +	struct bpf_cpu_map_entry *rcpu = data;
> +
> +	set_current_state(TASK_INTERRUPTIBLE);
> +	while (!kthread_should_stop()) {
> +		struct xdp_pkt *xdp_pkt;
> +
> +		schedule();
> +		/* Do work */
> +		while ((xdp_pkt = ptr_ring_consume(rcpu->queue))) {
> +			/* For now just "refcnt-free" */
> +			page_frag_free(xdp_pkt);
> +		}
> +		__set_current_state(TASK_INTERRUPTIBLE);
> +	}
> +	put_cpu_map_entry(rcpu);
> +
> +	__set_current_state(TASK_RUNNING);
> +	return 0;
> +}
> +
> +struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 qsize, u32 cpu, int map_id)
> +{
> +	gfp_t gfp = GFP_ATOMIC|__GFP_NOWARN;
> +	struct bpf_cpu_map_entry *rcpu;
> +	int numa, err;
> +
> +	/* Have map->numa_node, but choose node of redirect target CPU */
> +	numa = cpu_to_node(cpu);
> +
> +	rcpu = kzalloc_node(sizeof(*rcpu), gfp, numa);
> +	if (!rcpu)
> +		return NULL;
> +
> +	/* Alloc percpu bulkq */
> +	rcpu->bulkq = __alloc_percpu_gfp(sizeof(*rcpu->bulkq),
> +					 sizeof(void *), gfp);
> +	if (!rcpu->bulkq)
> +		goto fail;
> +
> +	/* Alloc queue */
> +	rcpu->queue = kzalloc_node(sizeof(*rcpu->queue), gfp, numa);
> +	if (!rcpu->queue)
> +		goto fail;
> +
> +	err = ptr_ring_init(rcpu->queue, qsize, gfp);
> +	if (err)
> +		goto fail;
> +	rcpu->qsize = qsize;
> +
> +	/* Setup kthread */
> +	rcpu->kthread = kthread_create_on_node(cpu_map_kthread_run, rcpu, numa,
> +					       "cpumap/%d/map:%d", cpu, map_id);
> +	if (IS_ERR(rcpu->kthread))
> +		goto fail;
> +
> +	/* Make sure kthread runs on a single CPU */
> +	kthread_bind(rcpu->kthread, cpu);

is there a check that max_entries <= num_possible_cpu ? I couldn't find it.
otherwise it will be binding to impossible cpu?

> +	wake_up_process(rcpu->kthread);

In general the whole thing looks like 'threaded NAPI' that Hannes was
proposing some time back. I liked it back then and I like it now.
I don't remember what were the objections back then.
Something scheduler related?
Adding Hannes.

Still curious about the questions I asked in the other thread
on what's causing it to be so much better than RPS

^ permalink raw reply

* Re: [PATCH v4 2/2] ip_tunnel: add mpls over gre encapsulation
From: Tom Herbert @ 2017-09-29  4:11 UTC (permalink / raw)
  To: Amine Kherbouche; +Cc: Linux Kernel Network Developers, xeb, roopa, equinox
In-Reply-To: <2e611d0f6e0c39ff54bfe464cdf9cf6eeb7843e1.1506590878.git.amine.kherbouche@6wind.com>

On Thu, Sep 28, 2017 at 2:34 AM, Amine Kherbouche
<amine.kherbouche@6wind.com> wrote:
> This commit introduces the MPLSoGRE support (RFC 4023), using ip tunnel
> API.
>
> Encap:
>   - Add a new iptunnel type mpls.
>   - Share tx path: gre type mpls loaded from skb->protocol.
>
> Decap:
>   - pull gre hdr and call mpls_forward().
>
> Signed-off-by: Amine Kherbouche <amine.kherbouche@6wind.com>
> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
> ---
>  include/net/gre.h              |  1 +
>  include/uapi/linux/if_tunnel.h |  1 +
>  net/ipv4/gre_demux.c           | 27 +++++++++++++++++++++++++++
>  net/ipv4/ip_gre.c              |  3 +++
>  net/ipv6/ip6_gre.c             |  3 +++
>  net/mpls/af_mpls.c             | 36 ++++++++++++++++++++++++++++++++++++
>  6 files changed, 71 insertions(+)
>
> diff --git a/include/net/gre.h b/include/net/gre.h
> index d25d836..aa3c4d3 100644
> --- a/include/net/gre.h
> +++ b/include/net/gre.h
> @@ -35,6 +35,7 @@ struct net_device *gretap_fb_dev_create(struct net *net, const char *name,
>                                        u8 name_assign_type);
>  int gre_parse_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
>                      bool *csum_err, __be16 proto, int nhs);
> +int mpls_gre_rcv(struct sk_buff *skb, int gre_hdr_len);
>
>  static inline int gre_calc_hlen(__be16 o_flags)
>  {
> diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
> index 2e52088..a2f48c0 100644
> --- a/include/uapi/linux/if_tunnel.h
> +++ b/include/uapi/linux/if_tunnel.h
> @@ -84,6 +84,7 @@ enum tunnel_encap_types {
>         TUNNEL_ENCAP_NONE,
>         TUNNEL_ENCAP_FOU,
>         TUNNEL_ENCAP_GUE,
> +       TUNNEL_ENCAP_MPLS,
>  };
>
>  #define TUNNEL_ENCAP_FLAG_CSUM         (1<<0)
> diff --git a/net/ipv4/gre_demux.c b/net/ipv4/gre_demux.c
> index b798862..40484a3 100644
> --- a/net/ipv4/gre_demux.c
> +++ b/net/ipv4/gre_demux.c
> @@ -23,6 +23,9 @@
>  #include <linux/netdevice.h>
>  #include <linux/if_tunnel.h>
>  #include <linux/spinlock.h>
> +#if IS_ENABLED(CONFIG_MPLS)
> +#include <linux/mpls.h>
> +#endif
>  #include <net/protocol.h>
>  #include <net/gre.h>
>
> @@ -122,6 +125,30 @@ int gre_parse_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
>  }
>  EXPORT_SYMBOL(gre_parse_header);
>
> +#if IS_ENABLED(CONFIG_MPLS)
> +int mpls_gre_rcv(struct sk_buff *skb, int gre_hdr_len)
> +{
> +       if (unlikely(!pskb_may_pull(skb, gre_hdr_len)))
> +               goto drop;
> +
> +       /* Pop GRE hdr and reset the skb */
> +       skb_pull(skb, gre_hdr_len);
> +       skb_reset_network_header(skb);
> +

I don't see why MPLS/GRE needs to be a special case in gre_rcv. Can't
we just follow the normal processing patch which calls the proto ops
handler for the protocol in the GRE header? Also, if protocol specific
code is added to rcv function that most likely means that we need to
update the related offloads also (grant it that MPLS doesn't support
GRO but it looks like it supports GSO). Additionally, we'd need to
consider if flow dissector needs a similar special case (I will point
out that my recently posted patches there eliminated TEB as the one
special case in GRE dissection).

Thanks,
Tom

> +       return mpls_forward(skb, skb->dev, NULL, NULL);
> +drop:
> +       kfree_skb(skb);
> +       return NET_RX_DROP;
> +}
> +#else
> +int mpls_gre_rcv(struct sk_buff *skb, int gre_hdr_len)
> +{
> +       kfree_skb(skb);
> +       return NET_RX_DROP;
> +}
> +#endif
> +EXPORT_SYMBOL(mpls_gre_rcv);
> +
>  static int gre_rcv(struct sk_buff *skb)
>  {
>         const struct gre_protocol *proto;
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index 9cee986..7a50e4f 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -412,6 +412,9 @@ static int gre_rcv(struct sk_buff *skb)
>                         return 0;
>         }
>
> +       if (unlikely(tpi.proto == htons(ETH_P_MPLS_UC)))
> +               return mpls_gre_rcv(skb, hdr_len);
> +
>         if (ipgre_rcv(skb, &tpi, hdr_len) == PACKET_RCVD)
>                 return 0;
>
> diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
> index c82d41e..440efb1 100644
> --- a/net/ipv6/ip6_gre.c
> +++ b/net/ipv6/ip6_gre.c
> @@ -476,6 +476,9 @@ static int gre_rcv(struct sk_buff *skb)
>         if (hdr_len < 0)
>                 goto drop;
>
> +       if (unlikely(tpi.proto == htons(ETH_P_MPLS_UC)))
> +               return mpls_gre_rcv(skb, hdr_len);
> +
>         if (iptunnel_pull_header(skb, hdr_len, tpi.proto, false))
>                 goto drop;
>
> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
> index 36ea2ad..4274243 100644
> --- a/net/mpls/af_mpls.c
> +++ b/net/mpls/af_mpls.c
> @@ -16,6 +16,7 @@
>  #include <net/arp.h>
>  #include <net/ip_fib.h>
>  #include <net/netevent.h>
> +#include <net/ip_tunnels.h>
>  #include <net/netns/generic.h>
>  #if IS_ENABLED(CONFIG_IPV6)
>  #include <net/ipv6.h>
> @@ -39,6 +40,36 @@ static int one = 1;
>  static int label_limit = (1 << 20) - 1;
>  static int ttl_max = 255;
>
> +#if IS_ENABLED(CONFIG_NET_IP_TUNNEL)
> +size_t ipgre_mpls_encap_hlen(struct ip_tunnel_encap *e)
> +{
> +       return sizeof(struct mpls_shim_hdr);
> +}
> +
> +static const struct ip_tunnel_encap_ops mpls_iptun_ops = {
> +       .encap_hlen     = ipgre_mpls_encap_hlen,
> +};
> +
> +static int ipgre_tunnel_encap_add_mpls_ops(void)
> +{
> +       return ip_tunnel_encap_add_ops(&mpls_iptun_ops, TUNNEL_ENCAP_MPLS);
> +}
> +
> +static void ipgre_tunnel_encap_del_mpls_ops(void)
> +{
> +       ip_tunnel_encap_del_ops(&mpls_iptun_ops, TUNNEL_ENCAP_MPLS);
> +}
> +#else
> +static int ipgre_tunnel_encap_add_mpls_ops(void)
> +{
> +       return 0;
> +}
> +
> +static void ipgre_tunnel_encap_del_mpls_ops(void)
> +{
> +}
> +#endif
> +
>  static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
>                        struct nlmsghdr *nlh, struct net *net, u32 portid,
>                        unsigned int nlm_flags);
> @@ -2486,6 +2517,10 @@ static int __init mpls_init(void)
>                       0);
>         rtnl_register(PF_MPLS, RTM_GETNETCONF, mpls_netconf_get_devconf,
>                       mpls_netconf_dump_devconf, 0);
> +       err = ipgre_tunnel_encap_add_mpls_ops();
> +       if (err)
> +               pr_err("Can't add mpls over gre tunnel ops\n");
> +
>         err = 0;
>  out:
>         return err;
> @@ -2503,6 +2538,7 @@ static void __exit mpls_exit(void)
>         dev_remove_pack(&mpls_packet_type);
>         unregister_netdevice_notifier(&mpls_dev_notifier);
>         unregister_pernet_subsys(&mpls_net_ops);
> +       ipgre_tunnel_encap_del_mpls_ops();
>  }
>  module_exit(mpls_exit);
>
> --
> 2.1.4
>

^ permalink raw reply

* Re: [PATCH net-next] tcp: fix under-evaluated ssthresh in TCP Vegas
From: David Miller @ 2017-09-29  5:07 UTC (permalink / raw)
  To: tranviethoang.vn; +Cc: netdev, hoang.tran, kuznet, yoshfuji, linux-kernel
In-Reply-To: <1506529940-2143-1-git-send-email-hoang.tran@uclouvain.be>

From: Hoang Tran <tranviethoang.vn@gmail.com>
Date: Wed, 27 Sep 2017 18:30:58 +0200

> With the commit 76174004a0f19785 (tcp: do not slow start when cwnd equals
> ssthresh), the comparison to the reduced cwnd in tcp_vegas_ssthresh() would
> under-evaluate the ssthresh.
> 
> Signed-off-by: Hoang Tran <hoang.tran@uclouvain.be>

Applied, thank you.

^ permalink raw reply

* Re: [patch net-next 1/7] skbuff: Add the offload_mr_fwd_mark field
From: Jiri Pirko @ 2017-09-29  6:05 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, davem, yotamg, idosch, mlxsw, nikolay, dsa, edumazet,
	willemb, johannes.berg, dcaratti, pabeni, daniel, f.fainelli, fw,
	gfree.wind
In-Reply-To: <20170928174903.GE14940@lunn.ch>

Thu, Sep 28, 2017 at 07:49:03PM CEST, andrew@lunn.ch wrote:
>On Thu, Sep 28, 2017 at 07:34:09PM +0200, Jiri Pirko wrote:
>> From: Yotam Gigi <yotamg@mellanox.com>
>> 
>> Similarly to the offload_fwd_mark field, the offload_mr_fwd_mark field is
>> used to allow partial offloading of MFC multicast routes.
>
>> The reason why the already existing "offload_fwd_mark" bit cannot be used
>> is that a switchdev driver would want to make the distinction between a
>> packet that has already gone through L2 forwarding but did not go through
>> multicast forwarding, and a packet that has already gone through both L2
>> and multicast forwarding.
>
>Hi Jiri
>
>So we are talking about l2 vs l3. So why not call this
>offload_l3_fwd_mark?
>
>Is there anything really specific to multicast here?

Currently it is, not sure if it is going to be used for anything else
later on. In case it will be, it could be renamed very easily.


>
>   Thanks
>      Andrew

^ permalink raw reply

* Re: [RFC PATCH v3 7/7] i40e: Enable cloud filters via tc-flower
From: Jiri Pirko @ 2017-09-29  6:20 UTC (permalink / raw)
  To: Nambiar, Amritha
  Cc: intel-wired-lan, jeffrey.t.kirsher, alexander.h.duyck, netdev,
	mlxsw, alexander.duyck@gmail.com, Jamal Hadi Salim, Cong Wang
In-Reply-To: <dd18a4bd-f2fc-002b-2ef9-01de9a5a4162@intel.com>

Thu, Sep 28, 2017 at 09:22:15PM CEST, amritha.nambiar@intel.com wrote:
>On 9/14/2017 1:00 AM, Nambiar, Amritha wrote:
>> On 9/13/2017 6:26 AM, Jiri Pirko wrote:
>>> Wed, Sep 13, 2017 at 11:59:50AM CEST, amritha.nambiar@intel.com wrote:
>>>> This patch enables tc-flower based hardware offloads. tc flower
>>>> filter provided by the kernel is configured as driver specific
>>>> cloud filter. The patch implements functions and admin queue
>>>> commands needed to support cloud filters in the driver and
>>>> adds cloud filters to configure these tc-flower filters.
>>>>
>>>> The only action supported is to redirect packets to a traffic class
>>>> on the same device.
>>>
>>> So basically you are not doing redirect, you are just setting tclass for
>>> matched packets, right? Why you use mirred for this? I think that
>>> you might consider extending g_act for that:
>>>
>>> # tc filter add dev eth0 protocol ip ingress \
>>>   prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw \
>>>   action tclass 0
>>>
>> Yes, this doesn't work like a typical egress redirect, but is aimed at
>> forwarding the matched packets to a different queue-group/traffic class
>> on the same device, so some sort-of ingress redirect in the hardware. I
>> possibly may not need the mirred-redirect as you say, I'll look into the
>> g_act way of doing this with a new gact tc action.
>> 
>
>I was looking at introducing a new gact tclass action to TC. In the HW
>offload path, this sets a traffic class value for certain matched
>packets so they will be processed in a queue belonging to the traffic class.
>
># tc filter add dev eth0 protocol ip parent ffff:\
>  prio 2 flower dst_ip 192.168.3.5/32\
>  ip_proto udp dst_port 25 skip_sw\
>  action tclass 2
>
>But, I'm having trouble defining what this action means in the kernel
>datapath. For ingress, this action could just take the default path and
>do nothing and only have meaning in the HW offloaded path. For egress,

Sounds ok.


>certain qdiscs like 'multiq' and 'prio' could use this 'tclass' value
>for band selection, while the 'mqprio' qdisc selects the traffic class
>based on the skb priority in netdev_pick_tx(), so what would this action
>mean for the 'mqprio' qdisc?

I don't see why this action would have any special meaning for specific
qdiscs. The qdiscs have already mechanisms for band mapping. I don't see
why to mix it up with tclass action.

Also, you can use tclass action on qdisc clsact egress to do band
mapping. That would be symmetrical with ingress.


>
>It looks like the 'prio' qdisc uses band selection based on the
>'classid', so I was thinking of using the 'classid' through the cls
>flower filter and offload it to HW for the traffic class index, this way
>we would have the same behavior in HW offload and SW fallback and there
>would be no need for a separate tc action.
>
>In HW:
># tc filter add dev eth0 protocol ip parent ffff:\
>  prio 2 flower dst_ip 192.168.3.5/32\
>  ip_proto udp dst_port 25 skip_sw classid 1:2\
>
>filter pref 2 flower chain 0
>filter pref 2 flower chain 0 handle 0x1 classid 1:2
>  eth_type ipv4
>  ip_proto udp
>  dst_ip 192.168.3.5
>  dst_port 25
>  skip_sw
>  in_hw
>
>This will be used to route packets to traffic class 2.
>
>In SW:
># tc filter add dev eth0 protocol ip parent ffff:\
>  prio 2 flower dst_ip 192.168.3.5/32\
>  ip_proto udp dst_port 25 skip_hw classid 1:2
>
>filter pref 2 flower chain 0
>filter pref 2 flower chain 0 handle 0x1 classid 1:2
>  eth_type ipv4
>  ip_proto udp
>  dst_ip 192.168.3.5
>  dst_port 25
>  skip_hw
>  not_in_hw
>
>>>
>>>>
>>>> # tc qdisc add dev eth0 ingress
>>>> # ethtool -K eth0 hw-tc-offload on
>>>>
>>>> # tc filter add dev eth0 protocol ip parent ffff:\
>>>>  prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw\
>>>>  action mirred ingress redirect dev eth0 tclass 0
>>>>
>>>> # tc filter add dev eth0 protocol ip parent ffff:\
>>>>  prio 2 flower dst_ip 192.168.3.5/32\
>>>>  ip_proto udp dst_port 25 skip_sw\
>>>>  action mirred ingress redirect dev eth0 tclass 1
>>>>
>>>> # tc filter add dev eth0 protocol ipv6 parent ffff:\
>>>>  prio 3 flower dst_ip fe8::200:1\
>>>>  ip_proto udp dst_port 66 skip_sw\
>>>>  action mirred ingress redirect dev eth0 tclass 1
>>>>
>>>> Delete tc flower filter:
>>>> Example:
>>>>
>>>> # tc filter del dev eth0 parent ffff: prio 3 handle 0x1 flower
>>>> # tc filter del dev eth0 parent ffff:
>>>>
>>>> Flow Director Sideband is disabled while configuring cloud filters
>>>> via tc-flower and until any cloud filter exists.
>>>>
>>>> Unsupported matches when cloud filters are added using enhanced
>>>> big buffer cloud filter mode of underlying switch include:
>>>> 1. source port and source IP
>>>> 2. Combined MAC address and IP fields.
>>>> 3. Not specifying L4 port
>>>>
>>>> These filter matches can however be used to redirect traffic to
>>>> the main VSI (tc 0) which does not require the enhanced big buffer
>>>> cloud filter support.
>>>>
>>>> v3: Cleaned up some lengthy function names. Changed ipv6 address to
>>>> __be32 array instead of u8 array. Used macro for IP version. Minor
>>>> formatting changes.
>>>> v2:
>>>> 1. Moved I40E_SWITCH_MODE_MASK definition to i40e_type.h
>>>> 2. Moved dev_info for add/deleting cloud filters in else condition
>>>> 3. Fixed some format specifier in dev_err logs
>>>> 4. Refactored i40e_get_capabilities to take an additional
>>>>   list_type parameter and use it to query device and function
>>>>   level capabilities.
>>>> 5. Fixed parsing tc redirect action to check for the is_tcf_mirred_tc()
>>>>   to verify if redirect to a traffic class is supported.
>>>> 6. Added comments for Geneve fix in cloud filter big buffer AQ
>>>>   function definitions.
>>>> 7. Cleaned up setup_tc interface to rebase and work with Jiri's
>>>>   updates, separate function to process tc cls flower offloads.
>>>> 8. Changes to make Flow Director Sideband and Cloud filters mutually
>>>>   exclusive.
>>>>
>>>> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
>>>> Signed-off-by: Kiran Patil <kiran.patil@intel.com>
>>>> Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
>>>> Signed-off-by: Jingjing Wu <jingjing.wu@intel.com>
>>>> ---
>>>> drivers/net/ethernet/intel/i40e/i40e.h             |   49 +
>>>> drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h  |    3 
>>>> drivers/net/ethernet/intel/i40e/i40e_common.c      |  189 ++++
>>>> drivers/net/ethernet/intel/i40e/i40e_main.c        |  971 +++++++++++++++++++-
>>>> drivers/net/ethernet/intel/i40e/i40e_prototype.h   |   16 
>>>> drivers/net/ethernet/intel/i40e/i40e_type.h        |    1 
>>>> .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h    |    3 
>>>> 7 files changed, 1202 insertions(+), 30 deletions(-)
>>>>
>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
>>>> index 6018fb6..b110519 100644
>>>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>>>> @@ -55,6 +55,8 @@
>>>> #include <linux/net_tstamp.h>
>>>> #include <linux/ptp_clock_kernel.h>
>>>> #include <net/pkt_cls.h>
>>>> +#include <net/tc_act/tc_gact.h>
>>>> +#include <net/tc_act/tc_mirred.h>
>>>> #include "i40e_type.h"
>>>> #include "i40e_prototype.h"
>>>> #include "i40e_client.h"
>>>> @@ -252,9 +254,52 @@ struct i40e_fdir_filter {
>>>> 	u32 fd_id;
>>>> };
>>>>
>>>> +#define IPV4_VERSION 4
>>>> +#define IPV6_VERSION 6
>>>> +
>>>> +#define I40E_CLOUD_FIELD_OMAC	0x01
>>>> +#define I40E_CLOUD_FIELD_IMAC	0x02
>>>> +#define I40E_CLOUD_FIELD_IVLAN	0x04
>>>> +#define I40E_CLOUD_FIELD_TEN_ID	0x08
>>>> +#define I40E_CLOUD_FIELD_IIP	0x10
>>>> +
>>>> +#define I40E_CLOUD_FILTER_FLAGS_OMAC	I40E_CLOUD_FIELD_OMAC
>>>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC	I40E_CLOUD_FIELD_IMAC
>>>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN	(I40E_CLOUD_FIELD_IMAC | \
>>>> +						 I40E_CLOUD_FIELD_IVLAN)
>>>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC_TEN_ID	(I40E_CLOUD_FIELD_IMAC | \
>>>> +						 I40E_CLOUD_FIELD_TEN_ID)
>>>> +#define I40E_CLOUD_FILTER_FLAGS_OMAC_TEN_ID_IMAC (I40E_CLOUD_FIELD_OMAC | \
>>>> +						  I40E_CLOUD_FIELD_IMAC | \
>>>> +						  I40E_CLOUD_FIELD_TEN_ID)
>>>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN_TEN_ID (I40E_CLOUD_FIELD_IMAC | \
>>>> +						   I40E_CLOUD_FIELD_IVLAN | \
>>>> +						   I40E_CLOUD_FIELD_TEN_ID)
>>>> +#define I40E_CLOUD_FILTER_FLAGS_IIP	I40E_CLOUD_FIELD_IIP
>>>> +
>>>> struct i40e_cloud_filter {
>>>> 	struct hlist_node cloud_node;
>>>> 	unsigned long cookie;
>>>> +	/* cloud filter input set follows */
>>>> +	u8 dst_mac[ETH_ALEN];
>>>> +	u8 src_mac[ETH_ALEN];
>>>> +	__be16 vlan_id;
>>>> +	__be32 dst_ip;
>>>> +	__be32 src_ip;
>>>> +	__be32 dst_ipv6[4];
>>>> +	__be32 src_ipv6[4];
>>>> +	__be16 dst_port;
>>>> +	__be16 src_port;
>>>> +	u32 ip_version;
>>>> +	u8 ip_proto;	/* IPPROTO value */
>>>> +	/* L4 port type: src or destination port */
>>>> +#define I40E_CLOUD_FILTER_PORT_SRC	0x01
>>>> +#define I40E_CLOUD_FILTER_PORT_DEST	0x02
>>>> +	u8 port_type;
>>>> +	u32 tenant_id;
>>>> +	u8 flags;
>>>> +#define I40E_CLOUD_TNL_TYPE_NONE	0xff
>>>> +	u8 tunnel_type;
>>>> 	u16 seid;	/* filter control */
>>>> };
>>>>
>>>> @@ -491,6 +536,8 @@ struct i40e_pf {
>>>> #define I40E_FLAG_LINK_DOWN_ON_CLOSE_ENABLED	BIT(27)
>>>> #define I40E_FLAG_SOURCE_PRUNING_DISABLED	BIT(28)
>>>> #define I40E_FLAG_TC_MQPRIO			BIT(29)
>>>> +#define I40E_FLAG_FD_SB_INACTIVE		BIT(30)
>>>> +#define I40E_FLAG_FD_SB_TO_CLOUD_FILTER		BIT(31)
>>>>
>>>> 	struct i40e_client_instance *cinst;
>>>> 	bool stat_offsets_loaded;
>>>> @@ -573,6 +620,8 @@ struct i40e_pf {
>>>> 	u16 phy_led_val;
>>>>
>>>> 	u16 override_q_count;
>>>> +	u16 last_sw_conf_flags;
>>>> +	u16 last_sw_conf_valid_flags;
>>>> };
>>>>
>>>> /**
>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
>>>> index 2e567c2..feb3d42 100644
>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
>>>> @@ -1392,6 +1392,9 @@ struct i40e_aqc_cloud_filters_element_data {
>>>> 		struct {
>>>> 			u8 data[16];
>>>> 		} v6;
>>>> +		struct {
>>>> +			__le16 data[8];
>>>> +		} raw_v6;
>>>> 	} ipaddr;
>>>> 	__le16	flags;
>>>> #define I40E_AQC_ADD_CLOUD_FILTER_SHIFT			0
>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c b/drivers/net/ethernet/intel/i40e/i40e_common.c
>>>> index 9567702..d9c9665 100644
>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_common.c
>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
>>>> @@ -5434,5 +5434,194 @@ i40e_add_pinfo_to_list(struct i40e_hw *hw,
>>>>
>>>> 	status = i40e_aq_write_ppp(hw, (void *)sec, sec->data_end,
>>>> 				   track_id, &offset, &info, NULL);
>>>> +
>>>> +	return status;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_aq_add_cloud_filters
>>>> + * @hw: pointer to the hardware structure
>>>> + * @seid: VSI seid to add cloud filters from
>>>> + * @filters: Buffer which contains the filters to be added
>>>> + * @filter_count: number of filters contained in the buffer
>>>> + *
>>>> + * Set the cloud filters for a given VSI.  The contents of the
>>>> + * i40e_aqc_cloud_filters_element_data are filled in by the caller
>>>> + * of the function.
>>>> + *
>>>> + **/
>>>> +enum i40e_status_code
>>>> +i40e_aq_add_cloud_filters(struct i40e_hw *hw, u16 seid,
>>>> +			  struct i40e_aqc_cloud_filters_element_data *filters,
>>>> +			  u8 filter_count)
>>>> +{
>>>> +	struct i40e_aq_desc desc;
>>>> +	struct i40e_aqc_add_remove_cloud_filters *cmd =
>>>> +	(struct i40e_aqc_add_remove_cloud_filters *)&desc.params.raw;
>>>> +	enum i40e_status_code status;
>>>> +	u16 buff_len;
>>>> +
>>>> +	i40e_fill_default_direct_cmd_desc(&desc,
>>>> +					  i40e_aqc_opc_add_cloud_filters);
>>>> +
>>>> +	buff_len = filter_count * sizeof(*filters);
>>>> +	desc.datalen = cpu_to_le16(buff_len);
>>>> +	desc.flags |= cpu_to_le16((u16)(I40E_AQ_FLAG_BUF | I40E_AQ_FLAG_RD));
>>>> +	cmd->num_filters = filter_count;
>>>> +	cmd->seid = cpu_to_le16(seid);
>>>> +
>>>> +	status = i40e_asq_send_command(hw, &desc, filters, buff_len, NULL);
>>>> +
>>>> +	return status;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_aq_add_cloud_filters_bb
>>>> + * @hw: pointer to the hardware structure
>>>> + * @seid: VSI seid to add cloud filters from
>>>> + * @filters: Buffer which contains the filters in big buffer to be added
>>>> + * @filter_count: number of filters contained in the buffer
>>>> + *
>>>> + * Set the big buffer cloud filters for a given VSI.  The contents of the
>>>> + * i40e_aqc_cloud_filters_element_bb are filled in by the caller of the
>>>> + * function.
>>>> + *
>>>> + **/
>>>> +i40e_status
>>>> +i40e_aq_add_cloud_filters_bb(struct i40e_hw *hw, u16 seid,
>>>> +			     struct i40e_aqc_cloud_filters_element_bb *filters,
>>>> +			     u8 filter_count)
>>>> +{
>>>> +	struct i40e_aq_desc desc;
>>>> +	struct i40e_aqc_add_remove_cloud_filters *cmd =
>>>> +	(struct i40e_aqc_add_remove_cloud_filters *)&desc.params.raw;
>>>> +	i40e_status status;
>>>> +	u16 buff_len;
>>>> +	int i;
>>>> +
>>>> +	i40e_fill_default_direct_cmd_desc(&desc,
>>>> +					  i40e_aqc_opc_add_cloud_filters);
>>>> +
>>>> +	buff_len = filter_count * sizeof(*filters);
>>>> +	desc.datalen = cpu_to_le16(buff_len);
>>>> +	desc.flags |= cpu_to_le16((u16)(I40E_AQ_FLAG_BUF | I40E_AQ_FLAG_RD));
>>>> +	cmd->num_filters = filter_count;
>>>> +	cmd->seid = cpu_to_le16(seid);
>>>> +	cmd->big_buffer_flag = I40E_AQC_ADD_CLOUD_CMD_BB;
>>>> +
>>>> +	for (i = 0; i < filter_count; i++) {
>>>> +		u16 tnl_type;
>>>> +		u32 ti;
>>>> +
>>>> +		tnl_type = (le16_to_cpu(filters[i].element.flags) &
>>>> +			   I40E_AQC_ADD_CLOUD_TNL_TYPE_MASK) >>
>>>> +			   I40E_AQC_ADD_CLOUD_TNL_TYPE_SHIFT;
>>>> +
>>>> +		/* For Geneve, the VNI should be placed in offset shifted by a
>>>> +		 * byte than the offset for the Tenant ID for rest of the
>>>> +		 * tunnels.
>>>> +		 */
>>>> +		if (tnl_type == I40E_AQC_ADD_CLOUD_TNL_TYPE_GENEVE) {
>>>> +			ti = le32_to_cpu(filters[i].element.tenant_id);
>>>> +			filters[i].element.tenant_id = cpu_to_le32(ti << 8);
>>>> +		}
>>>> +	}
>>>> +
>>>> +	status = i40e_asq_send_command(hw, &desc, filters, buff_len, NULL);
>>>> +
>>>> +	return status;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_aq_rem_cloud_filters
>>>> + * @hw: pointer to the hardware structure
>>>> + * @seid: VSI seid to remove cloud filters from
>>>> + * @filters: Buffer which contains the filters to be removed
>>>> + * @filter_count: number of filters contained in the buffer
>>>> + *
>>>> + * Remove the cloud filters for a given VSI.  The contents of the
>>>> + * i40e_aqc_cloud_filters_element_data are filled in by the caller
>>>> + * of the function.
>>>> + *
>>>> + **/
>>>> +enum i40e_status_code
>>>> +i40e_aq_rem_cloud_filters(struct i40e_hw *hw, u16 seid,
>>>> +			  struct i40e_aqc_cloud_filters_element_data *filters,
>>>> +			  u8 filter_count)
>>>> +{
>>>> +	struct i40e_aq_desc desc;
>>>> +	struct i40e_aqc_add_remove_cloud_filters *cmd =
>>>> +	(struct i40e_aqc_add_remove_cloud_filters *)&desc.params.raw;
>>>> +	enum i40e_status_code status;
>>>> +	u16 buff_len;
>>>> +
>>>> +	i40e_fill_default_direct_cmd_desc(&desc,
>>>> +					  i40e_aqc_opc_remove_cloud_filters);
>>>> +
>>>> +	buff_len = filter_count * sizeof(*filters);
>>>> +	desc.datalen = cpu_to_le16(buff_len);
>>>> +	desc.flags |= cpu_to_le16((u16)(I40E_AQ_FLAG_BUF | I40E_AQ_FLAG_RD));
>>>> +	cmd->num_filters = filter_count;
>>>> +	cmd->seid = cpu_to_le16(seid);
>>>> +
>>>> +	status = i40e_asq_send_command(hw, &desc, filters, buff_len, NULL);
>>>> +
>>>> +	return status;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_aq_rem_cloud_filters_bb
>>>> + * @hw: pointer to the hardware structure
>>>> + * @seid: VSI seid to remove cloud filters from
>>>> + * @filters: Buffer which contains the filters in big buffer to be removed
>>>> + * @filter_count: number of filters contained in the buffer
>>>> + *
>>>> + * Remove the big buffer cloud filters for a given VSI.  The contents of the
>>>> + * i40e_aqc_cloud_filters_element_bb are filled in by the caller of the
>>>> + * function.
>>>> + *
>>>> + **/
>>>> +i40e_status
>>>> +i40e_aq_rem_cloud_filters_bb(struct i40e_hw *hw, u16 seid,
>>>> +			     struct i40e_aqc_cloud_filters_element_bb *filters,
>>>> +			     u8 filter_count)
>>>> +{
>>>> +	struct i40e_aq_desc desc;
>>>> +	struct i40e_aqc_add_remove_cloud_filters *cmd =
>>>> +	(struct i40e_aqc_add_remove_cloud_filters *)&desc.params.raw;
>>>> +	i40e_status status;
>>>> +	u16 buff_len;
>>>> +	int i;
>>>> +
>>>> +	i40e_fill_default_direct_cmd_desc(&desc,
>>>> +					  i40e_aqc_opc_remove_cloud_filters);
>>>> +
>>>> +	buff_len = filter_count * sizeof(*filters);
>>>> +	desc.datalen = cpu_to_le16(buff_len);
>>>> +	desc.flags |= cpu_to_le16((u16)(I40E_AQ_FLAG_BUF | I40E_AQ_FLAG_RD));
>>>> +	cmd->num_filters = filter_count;
>>>> +	cmd->seid = cpu_to_le16(seid);
>>>> +	cmd->big_buffer_flag = I40E_AQC_ADD_CLOUD_CMD_BB;
>>>> +
>>>> +	for (i = 0; i < filter_count; i++) {
>>>> +		u16 tnl_type;
>>>> +		u32 ti;
>>>> +
>>>> +		tnl_type = (le16_to_cpu(filters[i].element.flags) &
>>>> +			   I40E_AQC_ADD_CLOUD_TNL_TYPE_MASK) >>
>>>> +			   I40E_AQC_ADD_CLOUD_TNL_TYPE_SHIFT;
>>>> +
>>>> +		/* For Geneve, the VNI should be placed in offset shifted by a
>>>> +		 * byte than the offset for the Tenant ID for rest of the
>>>> +		 * tunnels.
>>>> +		 */
>>>> +		if (tnl_type == I40E_AQC_ADD_CLOUD_TNL_TYPE_GENEVE) {
>>>> +			ti = le32_to_cpu(filters[i].element.tenant_id);
>>>> +			filters[i].element.tenant_id = cpu_to_le32(ti << 8);
>>>> +		}
>>>> +	}
>>>> +
>>>> +	status = i40e_asq_send_command(hw, &desc, filters, buff_len, NULL);
>>>> +
>>>> 	return status;
>>>> }
>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>>>> index afcf08a..96ee608 100644
>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>>>> @@ -69,6 +69,15 @@ static int i40e_reset(struct i40e_pf *pf);
>>>> static void i40e_rebuild(struct i40e_pf *pf, bool reinit, bool lock_acquired);
>>>> static void i40e_fdir_sb_setup(struct i40e_pf *pf);
>>>> static int i40e_veb_get_bw_info(struct i40e_veb *veb);
>>>> +static int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
>>>> +				     struct i40e_cloud_filter *filter,
>>>> +				     bool add);
>>>> +static int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
>>>> +					     struct i40e_cloud_filter *filter,
>>>> +					     bool add);
>>>> +static int i40e_get_capabilities(struct i40e_pf *pf,
>>>> +				 enum i40e_admin_queue_opc list_type);
>>>> +
>>>>
>>>> /* i40e_pci_tbl - PCI Device ID Table
>>>>  *
>>>> @@ -5478,7 +5487,11 @@ int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate)
>>>>  **/
>>>> static void i40e_remove_queue_channels(struct i40e_vsi *vsi)
>>>> {
>>>> +	enum i40e_admin_queue_err last_aq_status;
>>>> +	struct i40e_cloud_filter *cfilter;
>>>> 	struct i40e_channel *ch, *ch_tmp;
>>>> +	struct i40e_pf *pf = vsi->back;
>>>> +	struct hlist_node *node;
>>>> 	int ret, i;
>>>>
>>>> 	/* Reset rss size that was stored when reconfiguring rss for
>>>> @@ -5519,6 +5532,29 @@ static void i40e_remove_queue_channels(struct i40e_vsi *vsi)
>>>> 				 "Failed to reset tx rate for ch->seid %u\n",
>>>> 				 ch->seid);
>>>>
>>>> +		/* delete cloud filters associated with this channel */
>>>> +		hlist_for_each_entry_safe(cfilter, node,
>>>> +					  &pf->cloud_filter_list, cloud_node) {
>>>> +			if (cfilter->seid != ch->seid)
>>>> +				continue;
>>>> +
>>>> +			hash_del(&cfilter->cloud_node);
>>>> +			if (cfilter->dst_port)
>>>> +				ret = i40e_add_del_cloud_filter_big_buf(vsi,
>>>> +									cfilter,
>>>> +									false);
>>>> +			else
>>>> +				ret = i40e_add_del_cloud_filter(vsi, cfilter,
>>>> +								false);
>>>> +			last_aq_status = pf->hw.aq.asq_last_status;
>>>> +			if (ret)
>>>> +				dev_info(&pf->pdev->dev,
>>>> +					 "Failed to delete cloud filter, err %s aq_err %s\n",
>>>> +					 i40e_stat_str(&pf->hw, ret),
>>>> +					 i40e_aq_str(&pf->hw, last_aq_status));
>>>> +			kfree(cfilter);
>>>> +		}
>>>> +
>>>> 		/* delete VSI from FW */
>>>> 		ret = i40e_aq_delete_element(&vsi->back->hw, ch->seid,
>>>> 					     NULL);
>>>> @@ -5970,6 +6006,74 @@ static bool i40e_setup_channel(struct i40e_pf *pf, struct i40e_vsi *vsi,
>>>> }
>>>>
>>>> /**
>>>> + * i40e_validate_and_set_switch_mode - sets up switch mode correctly
>>>> + * @vsi: ptr to VSI which has PF backing
>>>> + * @l4type: true for TCP ond false for UDP
>>>> + * @port_type: true if port is destination and false if port is source
>>>> + *
>>>> + * Sets up switch mode correctly if it needs to be changed and perform
>>>> + * what are allowed modes.
>>>> + **/
>>>> +static int i40e_validate_and_set_switch_mode(struct i40e_vsi *vsi, bool l4type,
>>>> +					     bool port_type)
>>>> +{
>>>> +	u8 mode;
>>>> +	struct i40e_pf *pf = vsi->back;
>>>> +	struct i40e_hw *hw = &pf->hw;
>>>> +	int ret;
>>>> +
>>>> +	ret = i40e_get_capabilities(pf, i40e_aqc_opc_list_dev_capabilities);
>>>> +	if (ret)
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (hw->dev_caps.switch_mode) {
>>>> +		/* if switch mode is set, support mode2 (non-tunneled for
>>>> +		 * cloud filter) for now
>>>> +		 */
>>>> +		u32 switch_mode = hw->dev_caps.switch_mode &
>>>> +							I40E_SWITCH_MODE_MASK;
>>>> +		if (switch_mode >= I40E_NVM_IMAGE_TYPE_MODE1) {
>>>> +			if (switch_mode == I40E_NVM_IMAGE_TYPE_MODE2)
>>>> +				return 0;
>>>> +			dev_err(&pf->pdev->dev,
>>>> +				"Invalid switch_mode (%d), only non-tunneled mode for cloud filter is supported\n",
>>>> +				hw->dev_caps.switch_mode);
>>>> +			return -EINVAL;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	/* port_type: true for destination port and false for source port
>>>> +	 * For now, supports only destination port type
>>>> +	 */
>>>> +	if (!port_type) {
>>>> +		dev_err(&pf->pdev->dev, "src port type not supported\n");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +
>>>> +	/* Set Bit 7 to be valid */
>>>> +	mode = I40E_AQ_SET_SWITCH_BIT7_VALID;
>>>> +
>>>> +	/* Set L4type to both TCP and UDP support */
>>>> +	mode |= I40E_AQ_SET_SWITCH_L4_TYPE_BOTH;
>>>> +
>>>> +	/* Set cloud filter mode */
>>>> +	mode |= I40E_AQ_SET_SWITCH_MODE_NON_TUNNEL;
>>>> +
>>>> +	/* Prep mode field for set_switch_config */
>>>> +	ret = i40e_aq_set_switch_config(hw, pf->last_sw_conf_flags,
>>>> +					pf->last_sw_conf_valid_flags,
>>>> +					mode, NULL);
>>>> +	if (ret && hw->aq.asq_last_status != I40E_AQ_RC_ESRCH)
>>>> +		dev_err(&pf->pdev->dev,
>>>> +			"couldn't set switch config bits, err %s aq_err %s\n",
>>>> +			i40e_stat_str(hw, ret),
>>>> +			i40e_aq_str(hw,
>>>> +				    hw->aq.asq_last_status));
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +/**
>>>>  * i40e_create_queue_channel - function to create channel
>>>>  * @vsi: VSI to be configured
>>>>  * @ch: ptr to channel (it contains channel specific params)
>>>> @@ -6735,13 +6839,726 @@ static int i40e_setup_tc(struct net_device *netdev, void *type_data)
>>>> 	return ret;
>>>> }
>>>>
>>>> +/**
>>>> + * i40e_set_cld_element - sets cloud filter element data
>>>> + * @filter: cloud filter rule
>>>> + * @cld: ptr to cloud filter element data
>>>> + *
>>>> + * This is helper function to copy data into cloud filter element
>>>> + **/
>>>> +static inline void
>>>> +i40e_set_cld_element(struct i40e_cloud_filter *filter,
>>>> +		     struct i40e_aqc_cloud_filters_element_data *cld)
>>>> +{
>>>> +	int i, j;
>>>> +	u32 ipa;
>>>> +
>>>> +	memset(cld, 0, sizeof(*cld));
>>>> +	ether_addr_copy(cld->outer_mac, filter->dst_mac);
>>>> +	ether_addr_copy(cld->inner_mac, filter->src_mac);
>>>> +
>>>> +	if (filter->ip_version == IPV6_VERSION) {
>>>> +#define IPV6_MAX_INDEX	(ARRAY_SIZE(filter->dst_ipv6) - 1)
>>>> +		for (i = 0, j = 0; i < 4; i++, j += 2) {
>>>> +			ipa = be32_to_cpu(filter->dst_ipv6[IPV6_MAX_INDEX - i]);
>>>> +			ipa = cpu_to_le32(ipa);
>>>> +			memcpy(&cld->ipaddr.raw_v6.data[j], &ipa, 4);
>>>> +		}
>>>> +	} else {
>>>> +		ipa = be32_to_cpu(filter->dst_ip);
>>>> +		memcpy(&cld->ipaddr.v4.data, &ipa, 4);
>>>> +	}
>>>> +
>>>> +	cld->inner_vlan = cpu_to_le16(ntohs(filter->vlan_id));
>>>> +
>>>> +	/* tenant_id is not supported by FW now, once the support is enabled
>>>> +	 * fill the cld->tenant_id with cpu_to_le32(filter->tenant_id)
>>>> +	 */
>>>> +	if (filter->tenant_id)
>>>> +		return;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_add_del_cloud_filter - Add/del cloud filter
>>>> + * @vsi: pointer to VSI
>>>> + * @filter: cloud filter rule
>>>> + * @add: if true, add, if false, delete
>>>> + *
>>>> + * Add or delete a cloud filter for a specific flow spec.
>>>> + * Returns 0 if the filter were successfully added.
>>>> + **/
>>>> +static int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
>>>> +				     struct i40e_cloud_filter *filter, bool add)
>>>> +{
>>>> +	struct i40e_aqc_cloud_filters_element_data cld_filter;
>>>> +	struct i40e_pf *pf = vsi->back;
>>>> +	int ret;
>>>> +	static const u16 flag_table[128] = {
>>>> +		[I40E_CLOUD_FILTER_FLAGS_OMAC]  =
>>>> +			I40E_AQC_ADD_CLOUD_FILTER_OMAC,
>>>> +		[I40E_CLOUD_FILTER_FLAGS_IMAC]  =
>>>> +			I40E_AQC_ADD_CLOUD_FILTER_IMAC,
>>>> +		[I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN]  =
>>>> +			I40E_AQC_ADD_CLOUD_FILTER_IMAC_IVLAN,
>>>> +		[I40E_CLOUD_FILTER_FLAGS_IMAC_TEN_ID] =
>>>> +			I40E_AQC_ADD_CLOUD_FILTER_IMAC_TEN_ID,
>>>> +		[I40E_CLOUD_FILTER_FLAGS_OMAC_TEN_ID_IMAC] =
>>>> +			I40E_AQC_ADD_CLOUD_FILTER_OMAC_TEN_ID_IMAC,
>>>> +		[I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN_TEN_ID] =
>>>> +			I40E_AQC_ADD_CLOUD_FILTER_IMAC_IVLAN_TEN_ID,
>>>> +		[I40E_CLOUD_FILTER_FLAGS_IIP] =
>>>> +			I40E_AQC_ADD_CLOUD_FILTER_IIP,
>>>> +	};
>>>> +
>>>> +	if (filter->flags >= ARRAY_SIZE(flag_table))
>>>> +		return I40E_ERR_CONFIG;
>>>> +
>>>> +	/* copy element needed to add cloud filter from filter */
>>>> +	i40e_set_cld_element(filter, &cld_filter);
>>>> +
>>>> +	if (filter->tunnel_type != I40E_CLOUD_TNL_TYPE_NONE)
>>>> +		cld_filter.flags = cpu_to_le16(filter->tunnel_type <<
>>>> +					     I40E_AQC_ADD_CLOUD_TNL_TYPE_SHIFT);
>>>> +
>>>> +	if (filter->ip_version == IPV6_VERSION)
>>>> +		cld_filter.flags |= cpu_to_le16(flag_table[filter->flags] |
>>>> +						I40E_AQC_ADD_CLOUD_FLAGS_IPV6);
>>>> +	else
>>>> +		cld_filter.flags |= cpu_to_le16(flag_table[filter->flags] |
>>>> +						I40E_AQC_ADD_CLOUD_FLAGS_IPV4);
>>>> +
>>>> +	if (add)
>>>> +		ret = i40e_aq_add_cloud_filters(&pf->hw, filter->seid,
>>>> +						&cld_filter, 1);
>>>> +	else
>>>> +		ret = i40e_aq_rem_cloud_filters(&pf->hw, filter->seid,
>>>> +						&cld_filter, 1);
>>>> +	if (ret)
>>>> +		dev_dbg(&pf->pdev->dev,
>>>> +			"Failed to %s cloud filter using l4 port %u, err %d aq_err %d\n",
>>>> +			add ? "add" : "delete", filter->dst_port, ret,
>>>> +			pf->hw.aq.asq_last_status);
>>>> +	else
>>>> +		dev_info(&pf->pdev->dev,
>>>> +			 "%s cloud filter for VSI: %d\n",
>>>> +			 add ? "Added" : "Deleted", filter->seid);
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_add_del_cloud_filter_big_buf - Add/del cloud filter using big_buf
>>>> + * @vsi: pointer to VSI
>>>> + * @filter: cloud filter rule
>>>> + * @add: if true, add, if false, delete
>>>> + *
>>>> + * Add or delete a cloud filter for a specific flow spec using big buffer.
>>>> + * Returns 0 if the filter were successfully added.
>>>> + **/
>>>> +static int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
>>>> +					     struct i40e_cloud_filter *filter,
>>>> +					     bool add)
>>>> +{
>>>> +	struct i40e_aqc_cloud_filters_element_bb cld_filter;
>>>> +	struct i40e_pf *pf = vsi->back;
>>>> +	int ret;
>>>> +
>>>> +	/* Both (Outer/Inner) valid mac_addr are not supported */
>>>> +	if (is_valid_ether_addr(filter->dst_mac) &&
>>>> +	    is_valid_ether_addr(filter->src_mac))
>>>> +		return -EINVAL;
>>>> +
>>>> +	/* Make sure port is specified, otherwise bail out, for channel
>>>> +	 * specific cloud filter needs 'L4 port' to be non-zero
>>>> +	 */
>>>> +	if (!filter->dst_port)
>>>> +		return -EINVAL;
>>>> +
>>>> +	/* adding filter using src_port/src_ip is not supported at this stage */
>>>> +	if (filter->src_port || filter->src_ip ||
>>>> +	    !ipv6_addr_any((struct in6_addr *)&filter->src_ipv6))
>>>> +		return -EINVAL;
>>>> +
>>>> +	/* copy element needed to add cloud filter from filter */
>>>> +	i40e_set_cld_element(filter, &cld_filter.element);
>>>> +
>>>> +	if (is_valid_ether_addr(filter->dst_mac) ||
>>>> +	    is_valid_ether_addr(filter->src_mac) ||
>>>> +	    is_multicast_ether_addr(filter->dst_mac) ||
>>>> +	    is_multicast_ether_addr(filter->src_mac)) {
>>>> +		/* MAC + IP : unsupported mode */
>>>> +		if (filter->dst_ip)
>>>> +			return -EINVAL;
>>>> +
>>>> +		/* since we validated that L4 port must be valid before
>>>> +		 * we get here, start with respective "flags" value
>>>> +		 * and update if vlan is present or not
>>>> +		 */
>>>> +		cld_filter.element.flags =
>>>> +			cpu_to_le16(I40E_AQC_ADD_CLOUD_FILTER_MAC_PORT);
>>>> +
>>>> +		if (filter->vlan_id) {
>>>> +			cld_filter.element.flags =
>>>> +			cpu_to_le16(I40E_AQC_ADD_CLOUD_FILTER_MAC_VLAN_PORT);
>>>> +		}
>>>> +
>>>> +	} else if (filter->dst_ip || filter->ip_version == IPV6_VERSION) {
>>>> +		cld_filter.element.flags =
>>>> +				cpu_to_le16(I40E_AQC_ADD_CLOUD_FILTER_IP_PORT);
>>>> +		if (filter->ip_version == IPV6_VERSION)
>>>> +			cld_filter.element.flags |=
>>>> +				cpu_to_le16(I40E_AQC_ADD_CLOUD_FLAGS_IPV6);
>>>> +		else
>>>> +			cld_filter.element.flags |=
>>>> +				cpu_to_le16(I40E_AQC_ADD_CLOUD_FLAGS_IPV4);
>>>> +	} else {
>>>> +		dev_err(&pf->pdev->dev,
>>>> +			"either mac or ip has to be valid for cloud filter\n");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +
>>>> +	/* Now copy L4 port in Byte 6..7 in general fields */
>>>> +	cld_filter.general_fields[I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD0] =
>>>> +						be16_to_cpu(filter->dst_port);
>>>> +
>>>> +	if (add) {
>>>> +		bool proto_type, port_type;
>>>> +
>>>> +		proto_type = (filter->ip_proto == IPPROTO_TCP) ? true : false;
>>>> +		port_type = (filter->port_type & I40E_CLOUD_FILTER_PORT_DEST) ?
>>>> +			     true : false;
>>>> +
>>>> +		/* For now, src port based cloud filter for channel is not
>>>> +		 * supported
>>>> +		 */
>>>> +		if (!port_type) {
>>>> +			dev_err(&pf->pdev->dev,
>>>> +				"unsupported port type (src port)\n");
>>>> +			return -EOPNOTSUPP;
>>>> +		}
>>>> +
>>>> +		/* Validate current device switch mode, change if necessary */
>>>> +		ret = i40e_validate_and_set_switch_mode(vsi, proto_type,
>>>> +							port_type);
>>>> +		if (ret) {
>>>> +			dev_err(&pf->pdev->dev,
>>>> +				"failed to set switch mode, ret %d\n",
>>>> +				ret);
>>>> +			return ret;
>>>> +		}
>>>> +
>>>> +		ret = i40e_aq_add_cloud_filters_bb(&pf->hw, filter->seid,
>>>> +						   &cld_filter, 1);
>>>> +	} else {
>>>> +		ret = i40e_aq_rem_cloud_filters_bb(&pf->hw, filter->seid,
>>>> +						   &cld_filter, 1);
>>>> +	}
>>>> +
>>>> +	if (ret)
>>>> +		dev_dbg(&pf->pdev->dev,
>>>> +			"Failed to %s cloud filter(big buffer) err %d aq_err %d\n",
>>>> +			add ? "add" : "delete", ret, pf->hw.aq.asq_last_status);
>>>> +	else
>>>> +		dev_info(&pf->pdev->dev,
>>>> +			 "%s cloud filter for VSI: %d, L4 port: %d\n",
>>>> +			 add ? "add" : "delete", filter->seid,
>>>> +			 ntohs(filter->dst_port));
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_parse_cls_flower - Parse tc flower filters provided by kernel
>>>> + * @vsi: Pointer to VSI
>>>> + * @cls_flower: Pointer to struct tc_cls_flower_offload
>>>> + * @filter: Pointer to cloud filter structure
>>>> + *
>>>> + **/
>>>> +static int i40e_parse_cls_flower(struct i40e_vsi *vsi,
>>>> +				 struct tc_cls_flower_offload *f,
>>>> +				 struct i40e_cloud_filter *filter)
>>>> +{
>>>> +	struct i40e_pf *pf = vsi->back;
>>>> +	u16 addr_type = 0;
>>>> +	u8 field_flags = 0;
>>>> +
>>>> +	if (f->dissector->used_keys &
>>>> +	    ~(BIT(FLOW_DISSECTOR_KEY_CONTROL) |
>>>> +	      BIT(FLOW_DISSECTOR_KEY_BASIC) |
>>>> +	      BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) |
>>>> +	      BIT(FLOW_DISSECTOR_KEY_VLAN) |
>>>> +	      BIT(FLOW_DISSECTOR_KEY_IPV4_ADDRS) |
>>>> +	      BIT(FLOW_DISSECTOR_KEY_IPV6_ADDRS) |
>>>> +	      BIT(FLOW_DISSECTOR_KEY_PORTS) |
>>>> +	      BIT(FLOW_DISSECTOR_KEY_ENC_KEYID))) {
>>>> +		dev_err(&pf->pdev->dev, "Unsupported key used: 0x%x\n",
>>>> +			f->dissector->used_keys);
>>>> +		return -EOPNOTSUPP;
>>>> +	}
>>>> +
>>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ENC_KEYID)) {
>>>> +		struct flow_dissector_key_keyid *key =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_ENC_KEYID,
>>>> +						  f->key);
>>>> +
>>>> +		struct flow_dissector_key_keyid *mask =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_ENC_KEYID,
>>>> +						  f->mask);
>>>> +
>>>> +		if (mask->keyid != 0)
>>>> +			field_flags |= I40E_CLOUD_FIELD_TEN_ID;
>>>> +
>>>> +		filter->tenant_id = be32_to_cpu(key->keyid);
>>>> +	}
>>>> +
>>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_BASIC)) {
>>>> +		struct flow_dissector_key_basic *key =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_BASIC,
>>>> +						  f->key);
>>>> +
>>>> +		filter->ip_proto = key->ip_proto;
>>>> +	}
>>>> +
>>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
>>>> +		struct flow_dissector_key_eth_addrs *key =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_ETH_ADDRS,
>>>> +						  f->key);
>>>> +
>>>> +		struct flow_dissector_key_eth_addrs *mask =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_ETH_ADDRS,
>>>> +						  f->mask);
>>>> +
>>>> +		/* use is_broadcast and is_zero to check for all 0xf or 0 */
>>>> +		if (!is_zero_ether_addr(mask->dst)) {
>>>> +			if (is_broadcast_ether_addr(mask->dst)) {
>>>> +				field_flags |= I40E_CLOUD_FIELD_OMAC;
>>>> +			} else {
>>>> +				dev_err(&pf->pdev->dev, "Bad ether dest mask %pM\n",
>>>> +					mask->dst);
>>>> +				return I40E_ERR_CONFIG;
>>>> +			}
>>>> +		}
>>>> +
>>>> +		if (!is_zero_ether_addr(mask->src)) {
>>>> +			if (is_broadcast_ether_addr(mask->src)) {
>>>> +				field_flags |= I40E_CLOUD_FIELD_IMAC;
>>>> +			} else {
>>>> +				dev_err(&pf->pdev->dev, "Bad ether src mask %pM\n",
>>>> +					mask->src);
>>>> +				return I40E_ERR_CONFIG;
>>>> +			}
>>>> +		}
>>>> +		ether_addr_copy(filter->dst_mac, key->dst);
>>>> +		ether_addr_copy(filter->src_mac, key->src);
>>>> +	}
>>>> +
>>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_VLAN)) {
>>>> +		struct flow_dissector_key_vlan *key =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_VLAN,
>>>> +						  f->key);
>>>> +		struct flow_dissector_key_vlan *mask =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_VLAN,
>>>> +						  f->mask);
>>>> +
>>>> +		if (mask->vlan_id) {
>>>> +			if (mask->vlan_id == VLAN_VID_MASK) {
>>>> +				field_flags |= I40E_CLOUD_FIELD_IVLAN;
>>>> +
>>>> +			} else {
>>>> +				dev_err(&pf->pdev->dev, "Bad vlan mask 0x%04x\n",
>>>> +					mask->vlan_id);
>>>> +				return I40E_ERR_CONFIG;
>>>> +			}
>>>> +		}
>>>> +
>>>> +		filter->vlan_id = cpu_to_be16(key->vlan_id);
>>>> +	}
>>>> +
>>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_CONTROL)) {
>>>> +		struct flow_dissector_key_control *key =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_CONTROL,
>>>> +						  f->key);
>>>> +
>>>> +		addr_type = key->addr_type;
>>>> +	}
>>>> +
>>>> +	if (addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS) {
>>>> +		struct flow_dissector_key_ipv4_addrs *key =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_IPV4_ADDRS,
>>>> +						  f->key);
>>>> +		struct flow_dissector_key_ipv4_addrs *mask =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_IPV4_ADDRS,
>>>> +						  f->mask);
>>>> +
>>>> +		if (mask->dst) {
>>>> +			if (mask->dst == cpu_to_be32(0xffffffff)) {
>>>> +				field_flags |= I40E_CLOUD_FIELD_IIP;
>>>> +			} else {
>>>> +				dev_err(&pf->pdev->dev, "Bad ip dst mask 0x%08x\n",
>>>> +					be32_to_cpu(mask->dst));
>>>> +				return I40E_ERR_CONFIG;
>>>> +			}
>>>> +		}
>>>> +
>>>> +		if (mask->src) {
>>>> +			if (mask->src == cpu_to_be32(0xffffffff)) {
>>>> +				field_flags |= I40E_CLOUD_FIELD_IIP;
>>>> +			} else {
>>>> +				dev_err(&pf->pdev->dev, "Bad ip src mask 0x%08x\n",
>>>> +					be32_to_cpu(mask->dst));
>>>> +				return I40E_ERR_CONFIG;
>>>> +			}
>>>> +		}
>>>> +
>>>> +		if (field_flags & I40E_CLOUD_FIELD_TEN_ID) {
>>>> +			dev_err(&pf->pdev->dev, "Tenant id not allowed for ip filter\n");
>>>> +			return I40E_ERR_CONFIG;
>>>> +		}
>>>> +		filter->dst_ip = key->dst;
>>>> +		filter->src_ip = key->src;
>>>> +		filter->ip_version = IPV4_VERSION;
>>>> +	}
>>>> +
>>>> +	if (addr_type == FLOW_DISSECTOR_KEY_IPV6_ADDRS) {
>>>> +		struct flow_dissector_key_ipv6_addrs *key =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_IPV6_ADDRS,
>>>> +						  f->key);
>>>> +		struct flow_dissector_key_ipv6_addrs *mask =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_IPV6_ADDRS,
>>>> +						  f->mask);
>>>> +
>>>> +		/* src and dest IPV6 address should not be LOOPBACK
>>>> +		 * (0:0:0:0:0:0:0:1), which can be represented as ::1
>>>> +		 */
>>>> +		if (ipv6_addr_loopback(&key->dst) ||
>>>> +		    ipv6_addr_loopback(&key->src)) {
>>>> +			dev_err(&pf->pdev->dev,
>>>> +				"Bad ipv6, addr is LOOPBACK\n");
>>>> +			return I40E_ERR_CONFIG;
>>>> +		}
>>>> +		if (!ipv6_addr_any(&mask->dst) || !ipv6_addr_any(&mask->src))
>>>> +			field_flags |= I40E_CLOUD_FIELD_IIP;
>>>> +
>>>> +		memcpy(&filter->src_ipv6, &key->src.s6_addr32,
>>>> +		       sizeof(filter->src_ipv6));
>>>> +		memcpy(&filter->dst_ipv6, &key->dst.s6_addr32,
>>>> +		       sizeof(filter->dst_ipv6));
>>>> +
>>>> +		/* mark it as IPv6 filter, to be used later */
>>>> +		filter->ip_version = IPV6_VERSION;
>>>> +	}
>>>> +
>>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_PORTS)) {
>>>> +		struct flow_dissector_key_ports *key =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_PORTS,
>>>> +						  f->key);
>>>> +		struct flow_dissector_key_ports *mask =
>>>> +			skb_flow_dissector_target(f->dissector,
>>>> +						  FLOW_DISSECTOR_KEY_PORTS,
>>>> +						  f->mask);
>>>> +
>>>> +		if (mask->src) {
>>>> +			if (mask->src == cpu_to_be16(0xffff)) {
>>>> +				field_flags |= I40E_CLOUD_FIELD_IIP;
>>>> +			} else {
>>>> +				dev_err(&pf->pdev->dev, "Bad src port mask 0x%04x\n",
>>>> +					be16_to_cpu(mask->src));
>>>> +				return I40E_ERR_CONFIG;
>>>> +			}
>>>> +		}
>>>> +
>>>> +		if (mask->dst) {
>>>> +			if (mask->dst == cpu_to_be16(0xffff)) {
>>>> +				field_flags |= I40E_CLOUD_FIELD_IIP;
>>>> +			} else {
>>>> +				dev_err(&pf->pdev->dev, "Bad dst port mask 0x%04x\n",
>>>> +					be16_to_cpu(mask->dst));
>>>> +				return I40E_ERR_CONFIG;
>>>> +			}
>>>> +		}
>>>> +
>>>> +		filter->dst_port = key->dst;
>>>> +		filter->src_port = key->src;
>>>> +
>>>> +		/* For now, only supports destination port*/
>>>> +		filter->port_type |= I40E_CLOUD_FILTER_PORT_DEST;
>>>> +
>>>> +		switch (filter->ip_proto) {
>>>> +		case IPPROTO_TCP:
>>>> +		case IPPROTO_UDP:
>>>> +			break;
>>>> +		default:
>>>> +			dev_err(&pf->pdev->dev,
>>>> +				"Only UDP and TCP transport are supported\n");
>>>> +			return -EINVAL;
>>>> +		}
>>>> +	}
>>>> +	filter->flags = field_flags;
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_handle_redirect_action: Forward to a traffic class on the device
>>>> + * @vsi: Pointer to VSI
>>>> + * @ifindex: ifindex of the device to forwared to
>>>> + * @tc: traffic class index on the device
>>>> + * @filter: Pointer to cloud filter structure
>>>> + *
>>>> + **/
>>>> +static int i40e_handle_redirect_action(struct i40e_vsi *vsi, int ifindex, u8 tc,
>>>> +				       struct i40e_cloud_filter *filter)
>>>> +{
>>>> +	struct i40e_channel *ch, *ch_tmp;
>>>> +
>>>> +	/* redirect to a traffic class on the same device */
>>>> +	if (vsi->netdev->ifindex == ifindex) {
>>>> +		if (tc == 0) {
>>>> +			filter->seid = vsi->seid;
>>>> +			return 0;
>>>> +		} else if (vsi->tc_config.enabled_tc & BIT(tc)) {
>>>> +			if (!filter->dst_port) {
>>>> +				dev_err(&vsi->back->pdev->dev,
>>>> +					"Specify destination port to redirect to traffic class that is not default\n");
>>>> +				return -EINVAL;
>>>> +			}
>>>> +			if (list_empty(&vsi->ch_list))
>>>> +				return -EINVAL;
>>>> +			list_for_each_entry_safe(ch, ch_tmp, &vsi->ch_list,
>>>> +						 list) {
>>>> +				if (ch->seid == vsi->tc_seid_map[tc])
>>>> +					filter->seid = ch->seid;
>>>> +			}
>>>> +			return 0;
>>>> +		}
>>>> +	}
>>>> +	return -EINVAL;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_parse_tc_actions - Parse tc actions
>>>> + * @vsi: Pointer to VSI
>>>> + * @cls_flower: Pointer to struct tc_cls_flower_offload
>>>> + * @filter: Pointer to cloud filter structure
>>>> + *
>>>> + **/
>>>> +static int i40e_parse_tc_actions(struct i40e_vsi *vsi, struct tcf_exts *exts,
>>>> +				 struct i40e_cloud_filter *filter)
>>>> +{
>>>> +	const struct tc_action *a;
>>>> +	LIST_HEAD(actions);
>>>> +	int err;
>>>> +
>>>> +	if (!tcf_exts_has_actions(exts))
>>>> +		return -EINVAL;
>>>> +
>>>> +	tcf_exts_to_list(exts, &actions);
>>>> +	list_for_each_entry(a, &actions, list) {
>>>> +		/* Drop action */
>>>> +		if (is_tcf_gact_shot(a)) {
>>>> +			dev_err(&vsi->back->pdev->dev,
>>>> +				"Cloud filters do not support the drop action.\n");
>>>> +			return -EOPNOTSUPP;
>>>> +		}
>>>> +
>>>> +		/* Redirect to a traffic class on the same device */
>>>> +		if (!is_tcf_mirred_egress_redirect(a) && is_tcf_mirred_tc(a)) {
>>>> +			int ifindex = tcf_mirred_ifindex(a);
>>>> +			u8 tc = tcf_mirred_tc(a);
>>>> +
>>>> +			err = i40e_handle_redirect_action(vsi, ifindex, tc,
>>>> +							  filter);
>>>> +			if (err == 0)
>>>> +				return err;
>>>> +		}
>>>> +	}
>>>> +	return -EINVAL;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_configure_clsflower - Configure tc flower filters
>>>> + * @vsi: Pointer to VSI
>>>> + * @cls_flower: Pointer to struct tc_cls_flower_offload
>>>> + *
>>>> + **/
>>>> +static int i40e_configure_clsflower(struct i40e_vsi *vsi,
>>>> +				    struct tc_cls_flower_offload *cls_flower)
>>>> +{
>>>> +	struct i40e_cloud_filter *filter = NULL;
>>>> +	struct i40e_pf *pf = vsi->back;
>>>> +	int err = 0;
>>>> +
>>>> +	if (test_bit(__I40E_RESET_RECOVERY_PENDING, pf->state) ||
>>>> +	    test_bit(__I40E_RESET_INTR_RECEIVED, pf->state))
>>>> +		return -EBUSY;
>>>> +
>>>> +	if (pf->fdir_pf_active_filters ||
>>>> +	    (!hlist_empty(&pf->fdir_filter_list))) {
>>>> +		dev_err(&vsi->back->pdev->dev,
>>>> +			"Flow Director Sideband filters exists, turn ntuple off to configure cloud filters\n");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +
>>>> +	if (vsi->back->flags & I40E_FLAG_FD_SB_ENABLED) {
>>>> +		dev_err(&vsi->back->pdev->dev,
>>>> +			"Disable Flow Director Sideband, configuring Cloud filters via tc-flower\n");
>>>> +		vsi->back->flags &= ~I40E_FLAG_FD_SB_ENABLED;
>>>> +		vsi->back->flags |= I40E_FLAG_FD_SB_TO_CLOUD_FILTER;
>>>> +	}
>>>> +
>>>> +	filter = kzalloc(sizeof(*filter), GFP_KERNEL);
>>>> +	if (!filter)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	filter->cookie = cls_flower->cookie;
>>>> +
>>>> +	err = i40e_parse_cls_flower(vsi, cls_flower, filter);
>>>> +	if (err < 0)
>>>> +		goto err;
>>>> +
>>>> +	err = i40e_parse_tc_actions(vsi, cls_flower->exts, filter);
>>>> +	if (err < 0)
>>>> +		goto err;
>>>> +
>>>> +	/* Add cloud filter */
>>>> +	if (filter->dst_port)
>>>> +		err = i40e_add_del_cloud_filter_big_buf(vsi, filter, true);
>>>> +	else
>>>> +		err = i40e_add_del_cloud_filter(vsi, filter, true);
>>>> +
>>>> +	if (err) {
>>>> +		dev_err(&pf->pdev->dev,
>>>> +			"Failed to add cloud filter, err %s\n",
>>>> +			i40e_stat_str(&pf->hw, err));
>>>> +		err = i40e_aq_rc_to_posix(err, pf->hw.aq.asq_last_status);
>>>> +		goto err;
>>>> +	}
>>>> +
>>>> +	/* add filter to the ordered list */
>>>> +	INIT_HLIST_NODE(&filter->cloud_node);
>>>> +
>>>> +	hlist_add_head(&filter->cloud_node, &pf->cloud_filter_list);
>>>> +
>>>> +	pf->num_cloud_filters++;
>>>> +
>>>> +	return err;
>>>> +err:
>>>> +	kfree(filter);
>>>> +	return err;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_find_cloud_filter - Find the could filter in the list
>>>> + * @vsi: Pointer to VSI
>>>> + * @cookie: filter specific cookie
>>>> + *
>>>> + **/
>>>> +static struct i40e_cloud_filter *i40e_find_cloud_filter(struct i40e_vsi *vsi,
>>>> +							unsigned long *cookie)
>>>> +{
>>>> +	struct i40e_cloud_filter *filter = NULL;
>>>> +	struct hlist_node *node2;
>>>> +
>>>> +	hlist_for_each_entry_safe(filter, node2,
>>>> +				  &vsi->back->cloud_filter_list, cloud_node)
>>>> +		if (!memcmp(cookie, &filter->cookie, sizeof(filter->cookie)))
>>>> +			return filter;
>>>> +	return NULL;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_delete_clsflower - Remove tc flower filters
>>>> + * @vsi: Pointer to VSI
>>>> + * @cls_flower: Pointer to struct tc_cls_flower_offload
>>>> + *
>>>> + **/
>>>> +static int i40e_delete_clsflower(struct i40e_vsi *vsi,
>>>> +				 struct tc_cls_flower_offload *cls_flower)
>>>> +{
>>>> +	struct i40e_cloud_filter *filter = NULL;
>>>> +	struct i40e_pf *pf = vsi->back;
>>>> +	int err = 0;
>>>> +
>>>> +	filter = i40e_find_cloud_filter(vsi, &cls_flower->cookie);
>>>> +
>>>> +	if (!filter)
>>>> +		return -EINVAL;
>>>> +
>>>> +	hash_del(&filter->cloud_node);
>>>> +
>>>> +	if (filter->dst_port)
>>>> +		err = i40e_add_del_cloud_filter_big_buf(vsi, filter, false);
>>>> +	else
>>>> +		err = i40e_add_del_cloud_filter(vsi, filter, false);
>>>> +	if (err) {
>>>> +		kfree(filter);
>>>> +		dev_err(&pf->pdev->dev,
>>>> +			"Failed to delete cloud filter, err %s\n",
>>>> +			i40e_stat_str(&pf->hw, err));
>>>> +		return i40e_aq_rc_to_posix(err, pf->hw.aq.asq_last_status);
>>>> +	}
>>>> +
>>>> +	kfree(filter);
>>>> +	pf->num_cloud_filters--;
>>>> +
>>>> +	if (!pf->num_cloud_filters)
>>>> +		if ((pf->flags & I40E_FLAG_FD_SB_TO_CLOUD_FILTER) &&
>>>> +		    !(pf->flags & I40E_FLAG_FD_SB_INACTIVE)) {
>>>> +			pf->flags |= I40E_FLAG_FD_SB_ENABLED;
>>>> +			pf->flags &= ~I40E_FLAG_FD_SB_TO_CLOUD_FILTER;
>>>> +			pf->flags &= ~I40E_FLAG_FD_SB_INACTIVE;
>>>> +		}
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i40e_setup_tc_cls_flower - flower classifier offloads
>>>> + * @netdev: net device to configure
>>>> + * @type_data: offload data
>>>> + **/
>>>> +static int i40e_setup_tc_cls_flower(struct net_device *netdev,
>>>> +				    struct tc_cls_flower_offload *cls_flower)
>>>> +{
>>>> +	struct i40e_netdev_priv *np = netdev_priv(netdev);
>>>> +	struct i40e_vsi *vsi = np->vsi;
>>>> +
>>>> +	if (!is_classid_clsact_ingress(cls_flower->common.classid) ||
>>>> +	    cls_flower->common.chain_index)
>>>> +		return -EOPNOTSUPP;
>>>> +
>>>> +	switch (cls_flower->command) {
>>>> +	case TC_CLSFLOWER_REPLACE:
>>>> +		return i40e_configure_clsflower(vsi, cls_flower);
>>>> +	case TC_CLSFLOWER_DESTROY:
>>>> +		return i40e_delete_clsflower(vsi, cls_flower);
>>>> +	case TC_CLSFLOWER_STATS:
>>>> +		return -EOPNOTSUPP;
>>>> +	default:
>>>> +		return -EINVAL;
>>>> +	}
>>>> +}
>>>> +
>>>> static int __i40e_setup_tc(struct net_device *netdev, enum tc_setup_type type,
>>>> 			   void *type_data)
>>>> {
>>>> -	if (type != TC_SETUP_MQPRIO)
>>>> +	switch (type) {
>>>> +	case TC_SETUP_MQPRIO:
>>>> +		return i40e_setup_tc(netdev, type_data);
>>>> +	case TC_SETUP_CLSFLOWER:
>>>> +		return i40e_setup_tc_cls_flower(netdev, type_data);
>>>> +	default:
>>>> 		return -EOPNOTSUPP;
>>>> -
>>>> -	return i40e_setup_tc(netdev, type_data);
>>>> +	}
>>>> }
>>>>
>>>> /**
>>>> @@ -6939,6 +7756,13 @@ static void i40e_cloud_filter_exit(struct i40e_pf *pf)
>>>> 		kfree(cfilter);
>>>> 	}
>>>> 	pf->num_cloud_filters = 0;
>>>> +
>>>> +	if ((pf->flags & I40E_FLAG_FD_SB_TO_CLOUD_FILTER) &&
>>>> +	    !(pf->flags & I40E_FLAG_FD_SB_INACTIVE)) {
>>>> +		pf->flags |= I40E_FLAG_FD_SB_ENABLED;
>>>> +		pf->flags &= ~I40E_FLAG_FD_SB_TO_CLOUD_FILTER;
>>>> +		pf->flags &= ~I40E_FLAG_FD_SB_INACTIVE;
>>>> +	}
>>>> }
>>>>
>>>> /**
>>>> @@ -8046,7 +8870,8 @@ static int i40e_reconstitute_veb(struct i40e_veb *veb)
>>>>  * i40e_get_capabilities - get info about the HW
>>>>  * @pf: the PF struct
>>>>  **/
>>>> -static int i40e_get_capabilities(struct i40e_pf *pf)
>>>> +static int i40e_get_capabilities(struct i40e_pf *pf,
>>>> +				 enum i40e_admin_queue_opc list_type)
>>>> {
>>>> 	struct i40e_aqc_list_capabilities_element_resp *cap_buf;
>>>> 	u16 data_size;
>>>> @@ -8061,9 +8886,8 @@ static int i40e_get_capabilities(struct i40e_pf *pf)
>>>>
>>>> 		/* this loads the data into the hw struct for us */
>>>> 		err = i40e_aq_discover_capabilities(&pf->hw, cap_buf, buf_len,
>>>> -					    &data_size,
>>>> -					    i40e_aqc_opc_list_func_capabilities,
>>>> -					    NULL);
>>>> +						    &data_size, list_type,
>>>> +						    NULL);
>>>> 		/* data loaded, buffer no longer needed */
>>>> 		kfree(cap_buf);
>>>>
>>>> @@ -8080,26 +8904,44 @@ static int i40e_get_capabilities(struct i40e_pf *pf)
>>>> 		}
>>>> 	} while (err);
>>>>
>>>> -	if (pf->hw.debug_mask & I40E_DEBUG_USER)
>>>> -		dev_info(&pf->pdev->dev,
>>>> -			 "pf=%d, num_vfs=%d, msix_pf=%d, msix_vf=%d, fd_g=%d, fd_b=%d, pf_max_q=%d num_vsi=%d\n",
>>>> -			 pf->hw.pf_id, pf->hw.func_caps.num_vfs,
>>>> -			 pf->hw.func_caps.num_msix_vectors,
>>>> -			 pf->hw.func_caps.num_msix_vectors_vf,
>>>> -			 pf->hw.func_caps.fd_filters_guaranteed,
>>>> -			 pf->hw.func_caps.fd_filters_best_effort,
>>>> -			 pf->hw.func_caps.num_tx_qp,
>>>> -			 pf->hw.func_caps.num_vsis);
>>>> -
>>>> +	if (pf->hw.debug_mask & I40E_DEBUG_USER) {
>>>> +		if (list_type == i40e_aqc_opc_list_func_capabilities) {
>>>> +			dev_info(&pf->pdev->dev,
>>>> +				 "pf=%d, num_vfs=%d, msix_pf=%d, msix_vf=%d, fd_g=%d, fd_b=%d, pf_max_q=%d num_vsi=%d\n",
>>>> +				 pf->hw.pf_id, pf->hw.func_caps.num_vfs,
>>>> +				 pf->hw.func_caps.num_msix_vectors,
>>>> +				 pf->hw.func_caps.num_msix_vectors_vf,
>>>> +				 pf->hw.func_caps.fd_filters_guaranteed,
>>>> +				 pf->hw.func_caps.fd_filters_best_effort,
>>>> +				 pf->hw.func_caps.num_tx_qp,
>>>> +				 pf->hw.func_caps.num_vsis);
>>>> +		} else if (list_type == i40e_aqc_opc_list_dev_capabilities) {
>>>> +			dev_info(&pf->pdev->dev,
>>>> +				 "switch_mode=0x%04x, function_valid=0x%08x\n",
>>>> +				 pf->hw.dev_caps.switch_mode,
>>>> +				 pf->hw.dev_caps.valid_functions);
>>>> +			dev_info(&pf->pdev->dev,
>>>> +				 "SR-IOV=%d, num_vfs for all function=%u\n",
>>>> +				 pf->hw.dev_caps.sr_iov_1_1,
>>>> +				 pf->hw.dev_caps.num_vfs);
>>>> +			dev_info(&pf->pdev->dev,
>>>> +				 "num_vsis=%u, num_rx:%u, num_tx=%u\n",
>>>> +				 pf->hw.dev_caps.num_vsis,
>>>> +				 pf->hw.dev_caps.num_rx_qp,
>>>> +				 pf->hw.dev_caps.num_tx_qp);
>>>> +		}
>>>> +	}
>>>> +	if (list_type == i40e_aqc_opc_list_func_capabilities) {
>>>> #define DEF_NUM_VSI (1 + (pf->hw.func_caps.fcoe ? 1 : 0) \
>>>> 		       + pf->hw.func_caps.num_vfs)
>>>> -	if (pf->hw.revision_id == 0 && (DEF_NUM_VSI > pf->hw.func_caps.num_vsis)) {
>>>> -		dev_info(&pf->pdev->dev,
>>>> -			 "got num_vsis %d, setting num_vsis to %d\n",
>>>> -			 pf->hw.func_caps.num_vsis, DEF_NUM_VSI);
>>>> -		pf->hw.func_caps.num_vsis = DEF_NUM_VSI;
>>>> +		if (pf->hw.revision_id == 0 &&
>>>> +		    (pf->hw.func_caps.num_vsis < DEF_NUM_VSI)) {
>>>> +			dev_info(&pf->pdev->dev,
>>>> +				 "got num_vsis %d, setting num_vsis to %d\n",
>>>> +				 pf->hw.func_caps.num_vsis, DEF_NUM_VSI);
>>>> +			pf->hw.func_caps.num_vsis = DEF_NUM_VSI;
>>>> +		}
>>>> 	}
>>>> -
>>>> 	return 0;
>>>> }
>>>>
>>>> @@ -8141,6 +8983,7 @@ static void i40e_fdir_sb_setup(struct i40e_pf *pf)
>>>> 		if (!vsi) {
>>>> 			dev_info(&pf->pdev->dev, "Couldn't create FDir VSI\n");
>>>> 			pf->flags &= ~I40E_FLAG_FD_SB_ENABLED;
>>>> +			pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>>> 			return;
>>>> 		}
>>>> 	}
>>>> @@ -8163,6 +9006,48 @@ static void i40e_fdir_teardown(struct i40e_pf *pf)
>>>> }
>>>>
>>>> /**
>>>> + * i40e_rebuild_cloud_filters - Rebuilds cloud filters for VSIs
>>>> + * @vsi: PF main vsi
>>>> + * @seid: seid of main or channel VSIs
>>>> + *
>>>> + * Rebuilds cloud filters associated with main VSI and channel VSIs if they
>>>> + * existed before reset
>>>> + **/
>>>> +static int i40e_rebuild_cloud_filters(struct i40e_vsi *vsi, u16 seid)
>>>> +{
>>>> +	struct i40e_cloud_filter *cfilter;
>>>> +	struct i40e_pf *pf = vsi->back;
>>>> +	struct hlist_node *node;
>>>> +	i40e_status ret;
>>>> +
>>>> +	/* Add cloud filters back if they exist */
>>>> +	if (hlist_empty(&pf->cloud_filter_list))
>>>> +		return 0;
>>>> +
>>>> +	hlist_for_each_entry_safe(cfilter, node, &pf->cloud_filter_list,
>>>> +				  cloud_node) {
>>>> +		if (cfilter->seid != seid)
>>>> +			continue;
>>>> +
>>>> +		if (cfilter->dst_port)
>>>> +			ret = i40e_add_del_cloud_filter_big_buf(vsi, cfilter,
>>>> +								true);
>>>> +		else
>>>> +			ret = i40e_add_del_cloud_filter(vsi, cfilter, true);
>>>> +
>>>> +		if (ret) {
>>>> +			dev_dbg(&pf->pdev->dev,
>>>> +				"Failed to rebuild cloud filter, err %s aq_err %s\n",
>>>> +				i40e_stat_str(&pf->hw, ret),
>>>> +				i40e_aq_str(&pf->hw,
>>>> +					    pf->hw.aq.asq_last_status));
>>>> +			return ret;
>>>> +		}
>>>> +	}
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +/**
>>>>  * i40e_rebuild_channels - Rebuilds channel VSIs if they existed before reset
>>>>  * @vsi: PF main vsi
>>>>  *
>>>> @@ -8199,6 +9084,13 @@ static int i40e_rebuild_channels(struct i40e_vsi *vsi)
>>>> 						I40E_BW_CREDIT_DIVISOR,
>>>> 				ch->seid);
>>>> 		}
>>>> +		ret = i40e_rebuild_cloud_filters(vsi, ch->seid);
>>>> +		if (ret) {
>>>> +			dev_dbg(&vsi->back->pdev->dev,
>>>> +				"Failed to rebuild cloud filters for channel VSI %u\n",
>>>> +				ch->seid);
>>>> +			return ret;
>>>> +		}
>>>> 	}
>>>> 	return 0;
>>>> }
>>>> @@ -8365,7 +9257,7 @@ static void i40e_rebuild(struct i40e_pf *pf, bool reinit, bool lock_acquired)
>>>> 		i40e_verify_eeprom(pf);
>>>>
>>>> 	i40e_clear_pxe_mode(hw);
>>>> -	ret = i40e_get_capabilities(pf);
>>>> +	ret = i40e_get_capabilities(pf, i40e_aqc_opc_list_func_capabilities);
>>>> 	if (ret)
>>>> 		goto end_core_reset;
>>>>
>>>> @@ -8482,6 +9374,10 @@ static void i40e_rebuild(struct i40e_pf *pf, bool reinit, bool lock_acquired)
>>>> 			goto end_unlock;
>>>> 	}
>>>>
>>>> +	ret = i40e_rebuild_cloud_filters(vsi, vsi->seid);
>>>> +	if (ret)
>>>> +		goto end_unlock;
>>>> +
>>>> 	/* PF Main VSI is rebuild by now, go ahead and rebuild channel VSIs
>>>> 	 * for this main VSI if they exist
>>>> 	 */
>>>> @@ -9404,6 +10300,7 @@ static int i40e_init_msix(struct i40e_pf *pf)
>>>> 	    (pf->num_fdsb_msix == 0)) {
>>>> 		dev_info(&pf->pdev->dev, "Sideband Flowdir disabled, not enough MSI-X vectors\n");
>>>> 		pf->flags &= ~I40E_FLAG_FD_SB_ENABLED;
>>>> +		pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>>> 	}
>>>> 	if ((pf->flags & I40E_FLAG_VMDQ_ENABLED) &&
>>>> 	    (pf->num_vmdq_msix == 0)) {
>>>> @@ -9521,6 +10418,7 @@ static int i40e_init_interrupt_scheme(struct i40e_pf *pf)
>>>> 				       I40E_FLAG_FD_SB_ENABLED	|
>>>> 				       I40E_FLAG_FD_ATR_ENABLED	|
>>>> 				       I40E_FLAG_VMDQ_ENABLED);
>>>> +			pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>>>
>>>> 			/* rework the queue expectations without MSIX */
>>>> 			i40e_determine_queue_usage(pf);
>>>> @@ -10263,9 +11161,13 @@ bool i40e_set_ntuple(struct i40e_pf *pf, netdev_features_t features)
>>>> 		/* Enable filters and mark for reset */
>>>> 		if (!(pf->flags & I40E_FLAG_FD_SB_ENABLED))
>>>> 			need_reset = true;
>>>> -		/* enable FD_SB only if there is MSI-X vector */
>>>> -		if (pf->num_fdsb_msix > 0)
>>>> +		/* enable FD_SB only if there is MSI-X vector and no cloud
>>>> +		 * filters exist
>>>> +		 */
>>>> +		if (pf->num_fdsb_msix > 0 && !pf->num_cloud_filters) {
>>>> 			pf->flags |= I40E_FLAG_FD_SB_ENABLED;
>>>> +			pf->flags &= ~I40E_FLAG_FD_SB_INACTIVE;
>>>> +		}
>>>> 	} else {
>>>> 		/* turn off filters, mark for reset and clear SW filter list */
>>>> 		if (pf->flags & I40E_FLAG_FD_SB_ENABLED) {
>>>> @@ -10274,6 +11176,8 @@ bool i40e_set_ntuple(struct i40e_pf *pf, netdev_features_t features)
>>>> 		}
>>>> 		pf->flags &= ~(I40E_FLAG_FD_SB_ENABLED |
>>>> 			       I40E_FLAG_FD_SB_AUTO_DISABLED);
>>>> +		pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>>> +
>>>> 		/* reset fd counters */
>>>> 		pf->fd_add_err = 0;
>>>> 		pf->fd_atr_cnt = 0;
>>>> @@ -10857,7 +11761,8 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
>>>> 		netdev->hw_features |= NETIF_F_NTUPLE;
>>>> 	hw_features = hw_enc_features		|
>>>> 		      NETIF_F_HW_VLAN_CTAG_TX	|
>>>> -		      NETIF_F_HW_VLAN_CTAG_RX;
>>>> +		      NETIF_F_HW_VLAN_CTAG_RX	|
>>>> +		      NETIF_F_HW_TC;
>>>>
>>>> 	netdev->hw_features |= hw_features;
>>>>
>>>> @@ -12159,8 +13064,10 @@ static int i40e_setup_pf_switch(struct i40e_pf *pf, bool reinit)
>>>> 	*/
>>>>
>>>> 	if ((pf->hw.pf_id == 0) &&
>>>> -	    !(pf->flags & I40E_FLAG_TRUE_PROMISC_SUPPORT))
>>>> +	    !(pf->flags & I40E_FLAG_TRUE_PROMISC_SUPPORT)) {
>>>> 		flags = I40E_AQ_SET_SWITCH_CFG_PROMISC;
>>>> +		pf->last_sw_conf_flags = flags;
>>>> +	}
>>>>
>>>> 	if (pf->hw.pf_id == 0) {
>>>> 		u16 valid_flags;
>>>> @@ -12176,6 +13083,7 @@ static int i40e_setup_pf_switch(struct i40e_pf *pf, bool reinit)
>>>> 					     pf->hw.aq.asq_last_status));
>>>> 			/* not a fatal problem, just keep going */
>>>> 		}
>>>> +		pf->last_sw_conf_valid_flags = valid_flags;
>>>> 	}
>>>>
>>>> 	/* first time setup */
>>>> @@ -12273,6 +13181,7 @@ static void i40e_determine_queue_usage(struct i40e_pf *pf)
>>>> 			       I40E_FLAG_DCB_ENABLED	|
>>>> 			       I40E_FLAG_SRIOV_ENABLED	|
>>>> 			       I40E_FLAG_VMDQ_ENABLED);
>>>> +		pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>>> 	} else if (!(pf->flags & (I40E_FLAG_RSS_ENABLED |
>>>> 				  I40E_FLAG_FD_SB_ENABLED |
>>>> 				  I40E_FLAG_FD_ATR_ENABLED |
>>>> @@ -12287,6 +13196,7 @@ static void i40e_determine_queue_usage(struct i40e_pf *pf)
>>>> 			       I40E_FLAG_FD_ATR_ENABLED	|
>>>> 			       I40E_FLAG_DCB_ENABLED	|
>>>> 			       I40E_FLAG_VMDQ_ENABLED);
>>>> +		pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>>> 	} else {
>>>> 		/* Not enough queues for all TCs */
>>>> 		if ((pf->flags & I40E_FLAG_DCB_CAPABLE) &&
>>>> @@ -12310,6 +13220,7 @@ static void i40e_determine_queue_usage(struct i40e_pf *pf)
>>>> 			queues_left -= 1; /* save 1 queue for FD */
>>>> 		} else {
>>>> 			pf->flags &= ~I40E_FLAG_FD_SB_ENABLED;
>>>> +			pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>>> 			dev_info(&pf->pdev->dev, "not enough queues for Flow Director. Flow Director feature is disabled\n");
>>>> 		}
>>>> 	}
>>>> @@ -12613,7 +13524,7 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>>>> 		dev_warn(&pdev->dev, "This device is a pre-production adapter/LOM. Please be aware there may be issues with your hardware. If you are experiencing problems please contact your Intel or hardware representative who provided you with this hardware.\n");
>>>>
>>>> 	i40e_clear_pxe_mode(hw);
>>>> -	err = i40e_get_capabilities(pf);
>>>> +	err = i40e_get_capabilities(pf, i40e_aqc_opc_list_func_capabilities);
>>>> 	if (err)
>>>> 		goto err_adminq_setup;
>>>>
>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_prototype.h b/drivers/net/ethernet/intel/i40e/i40e_prototype.h
>>>> index 92869f5..3bb6659 100644
>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_prototype.h
>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_prototype.h
>>>> @@ -283,6 +283,22 @@ i40e_status i40e_aq_query_switch_comp_bw_config(struct i40e_hw *hw,
>>>> 		struct i40e_asq_cmd_details *cmd_details);
>>>> i40e_status i40e_aq_resume_port_tx(struct i40e_hw *hw,
>>>> 				   struct i40e_asq_cmd_details *cmd_details);
>>>> +i40e_status
>>>> +i40e_aq_add_cloud_filters_bb(struct i40e_hw *hw, u16 seid,
>>>> +			     struct i40e_aqc_cloud_filters_element_bb *filters,
>>>> +			     u8 filter_count);
>>>> +enum i40e_status_code
>>>> +i40e_aq_add_cloud_filters(struct i40e_hw *hw, u16 vsi,
>>>> +			  struct i40e_aqc_cloud_filters_element_data *filters,
>>>> +			  u8 filter_count);
>>>> +enum i40e_status_code
>>>> +i40e_aq_rem_cloud_filters(struct i40e_hw *hw, u16 vsi,
>>>> +			  struct i40e_aqc_cloud_filters_element_data *filters,
>>>> +			  u8 filter_count);
>>>> +i40e_status
>>>> +i40e_aq_rem_cloud_filters_bb(struct i40e_hw *hw, u16 seid,
>>>> +			     struct i40e_aqc_cloud_filters_element_bb *filters,
>>>> +			     u8 filter_count);
>>>> i40e_status i40e_read_lldp_cfg(struct i40e_hw *hw,
>>>> 			       struct i40e_lldp_variables *lldp_cfg);
>>>> /* i40e_common */
>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_type.h b/drivers/net/ethernet/intel/i40e/i40e_type.h
>>>> index c019f46..af38881 100644
>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_type.h
>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_type.h
>>>> @@ -287,6 +287,7 @@ struct i40e_hw_capabilities {
>>>> #define I40E_NVM_IMAGE_TYPE_MODE1	0x6
>>>> #define I40E_NVM_IMAGE_TYPE_MODE2	0x7
>>>> #define I40E_NVM_IMAGE_TYPE_MODE3	0x8
>>>> +#define I40E_SWITCH_MODE_MASK		0xF
>>>>
>>>> 	u32  management_mode;
>>>> 	u32  mng_protocols_over_mctp;
>>>> diff --git a/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h b/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
>>>> index b8c78bf..4fe27f0 100644
>>>> --- a/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
>>>> +++ b/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
>>>> @@ -1360,6 +1360,9 @@ struct i40e_aqc_cloud_filters_element_data {
>>>> 		struct {
>>>> 			u8 data[16];
>>>> 		} v6;
>>>> +		struct {
>>>> +			__le16 data[8];
>>>> +		} raw_v6;
>>>> 	} ipaddr;
>>>> 	__le16	flags;
>>>> #define I40E_AQC_ADD_CLOUD_FILTER_SHIFT			0
>>>>

^ permalink raw reply

* Re: [pull request][net 00/11] Mellanox, mlx5 fixes 2017-09-28
From: David Miller @ 2017-09-29  5:22 UTC (permalink / raw)
  To: saeedm; +Cc: netdev
In-Reply-To: <20170928044132.30940-1-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Thu, 28 Sep 2017 07:41:21 +0300

> This series provides misc fixes for mlx5 dirver.
> 
> Please pull and let me know if there's any problem.

Pulled.

> for -stable:
>   net/mlx5e: IPoIB, Fix access to invalid memory address (Kernels >= 4.12)

Queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next v9] openvswitch: enable NSH support
From: Yang, Yi @ 2017-09-29  6:40 UTC (permalink / raw)
  To: Pravin Shelar
  Cc: Jiri Benc, netdev@vger.kernel.org, dev@openvswitch.org, e@erig.me,
	davem@davemloft.net, Jan Scheurich
In-Reply-To: <CAOrHB_CQyokdTWeoj02RENPP5miq5Arx6goCNQ9ZPbUeTu_MeQ@mail.gmail.com>

On Fri, Sep 29, 2017 at 02:28:38AM +0800, Pravin Shelar wrote:
> On Tue, Sep 26, 2017 at 6:39 PM, Yang, Yi <yi.y.yang@intel.com> wrote:
> > On Tue, Sep 26, 2017 at 06:49:14PM +0800, Jiri Benc wrote:
> >> On Tue, 26 Sep 2017 12:55:39 +0800, Yang, Yi wrote:
> >> > After push_nsh, the packet won't be recirculated to flow pipeline, so
> >> > key->eth.type must be set explicitly here, but for pop_nsh, the packet
> >> > will be recirculated to flow pipeline, it will be reparsed, so
> >> > key->eth.type will be set in packet parse function, we needn't handle it
> >> > in pop_nsh.
> >>
> >> This seems to be a very different approach than what we currently have.
> >> Looking at the code, the requirement after "destructive" actions such
> >> as pushing or popping headers is to recirculate.
> >
> > This is optimization proposed by Jan Scheurich, recurculating after push_nsh
> > will impact on performance, recurculating after pop_nsh is unavoidable, So
> > also cc jan.scheurich@ericsson.com.
> >
> > Actucally all the keys before push_nsh are still there after push_nsh,
> > push_nsh has updated all the nsh keys, so recirculating remains avoidable.
> >
> 
> 
> We should keep existing model for this patch. Later you can submit
> optimization patch with specific use cases and performance
> improvement. So that we can evaluate code complexity and benefits.

Ok, I'll remove the below line in push_nsh and send out v11, thanks.

	key->eth.type = htons(ETH_P_NSH);

> 
> >>
> >> Setting key->eth.type to satisfy conditions in the output path without
> >> updating the rest of the key looks very hacky and fragile to me. There
> >> might be other conditions and dependencies that are not obvious.
> >> I don't think the code was written with such code path in mind.
> >>
> >> I'd like to hear what Pravin thinks about this.
> >>
> >>  Jiri

^ permalink raw reply

* Re: [Patch net-next] net_sched: use idr to allocate u32 filter handles
From: Simon Horman @ 2017-09-29  6:46 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers, Chris Mi, Jamal Hadi Salim
In-Reply-To: <CAM_iQpWbiA4QZb8j8fau7kMhzeLpRppEMcmB80byqTUryPheOw@mail.gmail.com>

On Thu, Sep 28, 2017 at 03:19:05PM -0700, Cong Wang wrote:
> On Thu, Sep 28, 2017 at 12:34 AM, Simon Horman
> <simon.horman@netronome.com> wrote:
> > Hi Cong,
> >
> > this looks like a nice enhancement to me. Did you measure any performance
> > benefit from it.  Perhaps it could be described in the changelog_ I also
> > have a more detailed question below.
> 
> No, I am inspired by commit c15ab236d69d, don't measure it.

Perhaps it would be nice to note that in the changelog.

> >> ---
> >>  net/sched/cls_u32.c | 108 ++++++++++++++++++++++++++++++++--------------------
> >>  1 file changed, 67 insertions(+), 41 deletions(-)
> >>
> >> diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
> >> index 10b8d851fc6b..316b8a791b13 100644
> >> --- a/net/sched/cls_u32.c
> >> +++ b/net/sched/cls_u32.c
> >> @@ -46,6 +46,7 @@
> >
> > ...
> >
> >> @@ -937,22 +940,33 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
> >>                       return -EINVAL;
> >>               if (TC_U32_KEY(handle))
> >>                       return -EINVAL;
> >> -             if (handle == 0) {
> >> -                     handle = gen_new_htid(tp->data);
> >> -                     if (handle == 0)
> >> -                             return -ENOMEM;
> >> -             }
> >>               ht = kzalloc(sizeof(*ht) + divisor*sizeof(void *), GFP_KERNEL);
> >>               if (ht == NULL)
> >>                       return -ENOBUFS;
> >> +             if (handle == 0) {
> >> +                     handle = gen_new_htid(tp->data, ht);
> >> +                     if (handle == 0) {
> >> +                             kfree(ht);
> >> +                             return -ENOMEM;
> >> +                     }
> >> +             } else {
> >> +                     err = idr_alloc_ext(&tp_c->handle_idr, ht, NULL,
> >> +                                         handle, handle + 1, GFP_KERNEL);
> >> +                     if (err) {
> >> +                             kfree(ht);
> >> +                             return err;
> >> +                     }
> >
> > The above seems to check that handle is not already in use and mark it as
> > in use. But I don't see that logic in the code prior to this patch.
> > Am I missing something? If not perhaps this portion should be a separate
> > patch or described in the changelog.
> 
> The logic is in upper layer, tc_ctl_tfilter(). It tries to get a
> filter by handle
> (if non-zero), and errors out if we are creating a new filter with the same
> handle.
> 
> At the point you quote above, 'n' is already NULL and 'handle' is non-zero,
> which means there is no existing filter has same handle, it is safe to just
> mark it as in-use.

Thanks for the clarification, that seems fine to me.

Reviewed-by: Simon Horman <simon.horman@netronome.com>

^ permalink raw reply

* Re: [net-next PATCH 0/5] New bpf cpumap type for XDP_REDIRECT
From: Jesper Dangaard Brouer @ 2017-09-29  6:53 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, jakub.kicinski, Michael S. Tsirkin, Jason Wang, mchan,
	John Fastabend, peter.waskiewicz.jr, Daniel Borkmann,
	Alexei Starovoitov, Andy Gospodarek, edumazet, brouer
In-Reply-To: <59CD7B94.8010103@iogearbox.net>

On Fri, 29 Sep 2017 00:45:40 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 09/28/2017 02:57 PM, Jesper Dangaard Brouer wrote:
> > Introducing a new way to redirect XDP frames.  Notice how no driver
> > changes are necessary given the design of XDP_REDIRECT.
> >
> > This redirect map type is called 'cpumap', as it allows redirection
> > XDP frames to remote CPUs.  The remote CPU will do the SKB allocation
> > and start the network stack invocation on that CPU.
> >
> > This is a scalability and isolation mechanism, that allow separating
> > the early driver network XDP layer, from the rest of the netstack, and
> > assigning dedicated CPUs for this stage.  The sysadm control/configure
> > the RX-CPU to NIC-RX queue (as usual) via procfs smp_affinity and how
> > many queues are configured via ethtool --set-channels.  Benchmarks
> > show that a single CPU can handle approx 11Mpps.  Thus, only assigning
> > two NIC RX-queues (and two CPUs) is sufficient for handling 10Gbit/s
> > wirespeed smallest packet 14.88Mpps.  Reducing the number of queues
> > have the advantage that more packets being "bulk" available per hard
> > interrupt[1].
> >
> > [1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf
> >
> > Use-cases:
> >
> > 1. End-host based pre-filtering for DDoS mitigation.  This is fast
> >     enough to allow software to see and filter all packets wirespeed.
> >     Thus, no packets getting silently dropped by hardware.
> >
> > 2. Given NIC HW unevenly distributes packets across RX queue, this
> >     mechanism can be used for redistribution load across CPUs.  This
> >     usually happens when HW is unaware of a new protocol.  This
> >     resembles RPS (Receive Packet Steering), just faster, but with more
> >     responsibility placed on the BPF program for correct steering.
> >
> > 3. Auto-scaling or power saving via only activating the appropriate
> >     number of remote CPUs for handling the current load.  The cpumap
> >     tracepoints can function as a feedback loop for this purpose.  
> 
> Interesting work, thanks! Still digesting the code a bit. I think
> it pretty much goes into the direction that Eric describes in his
> netdev paper quoted above; not on a generic level though but specific
> to XDP at least; theoretically XDP could just run transparently on
> the CPU doing the filtering, and raw buffers are handed to remote
> CPU with similar batching, but it would need some different config
> interface at minimum.

Good that you noticed this is (implicit) implementing RX bulking, which
is where much of the performance gain originates from.

It is true, I am inspired by Eric's paper (I love it). Do notice that
this is not blocking or interfering with Erics/others continued work in
this area.  This implementation just show that the section "break the
pipe!" idea works very well for XDP. 

More on config knobs below.

> Shouldn't we take the CPU(s) running XDP on the RX queues out from
> the normal process scheduler, so that we have a guarantee that user
> space or unrelated kernel tasks cannot interfere with them anymore,
> and we could then turn them into busy polling eventually (e.g. as
> long as XDP is running there and once off could put them back into
> normal scheduling domain transparently)?

We should be careful not to invent networking config knobs that belongs
to other parts of the kernel, like the scheduler.  We already have
ability to control where IRQ's land via procfs smp_affinity.  And if
you want to avoid CPU isolation, we can use the boot cmdline
"isolcpus" (hint like DPDK recommend/use for zero-loss configs).  It is
the userspace tool (or sysadm) loading the XDP program, who is
responsible for having configures the CPU smp_affinity alignment.

Making NAPI busy-poll is out of scope for this patchset. Someone
should work on this separately.  It would just help/improve this kind
of scheme.

I actually think it would be more relevant to add/put the "remote" CPUs
in the 'cpumap' into a separate scheduler group.  To implement stuff
like auto-scaling and power-saving.

> What about RPS/RFS in the sense that once you punt them to remote
> CPU, could we reuse application locality information so they'd end
> up on the right CPU in the first place (w/o backlog detour), or is
> the intent to rather disable it and have some own orchestration
> with relation to the CPU map?

An advanced bpf orchestration could basically implement what you
describe, combined with a userspace side tool that taskset/pin
applications.  To know when a task can move between CPUs, you use the
tracepoints to see when the CPU queue is empty (hint, time_limit=true
and processed=0).

For now, I'm not targeting such advanced use-cases.  My main target is
a customer that have double tagged VLANS, and ixgbe cannot RSS
distribute these, thus they all end-up on queue 0.  And as I
demonstrated (in another email) RPS is too slow to fix this.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [net-next:master 332/339] kernel//bpf/syscall.c:1404:23: warning: cast to pointer from integer of different size
From: kbuild test robot @ 2017-09-29  6:54 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: kbuild-all, netdev

[-- Attachment #1: Type: text/plain, Size: 3572 bytes --]

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
head:   fa8fefaa678ea390b873195d19c09930da84a4bb
commit: cb4d2b3f03d8eed90be3a194e5b54b734ec4bbe9 [332/339] bpf: Add name, load_time, uid and map_ids to bpf_prog_info
config: blackfin-allmodconfig (attached as .config)
compiler: bfin-uclinux-gcc (GCC) 6.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        git checkout cb4d2b3f03d8eed90be3a194e5b54b734ec4bbe9
        # save the attached .config to linux build tree
        make.cross ARCH=blackfin 

All warnings (new ones prefixed by >>):

   kernel//bpf/syscall.c: In function 'bpf_prog_get_info_by_fd':
>> kernel//bpf/syscall.c:1404:23: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
      u32 *user_map_ids = (u32 *)info.map_ids;
                          ^

vim +1404 kernel//bpf/syscall.c

  1371	
  1372	static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
  1373					   const union bpf_attr *attr,
  1374					   union bpf_attr __user *uattr)
  1375	{
  1376		struct bpf_prog_info __user *uinfo = u64_to_user_ptr(attr->info.info);
  1377		struct bpf_prog_info info = {};
  1378		u32 info_len = attr->info.info_len;
  1379		char __user *uinsns;
  1380		u32 ulen;
  1381		int err;
  1382	
  1383		err = check_uarg_tail_zero(uinfo, sizeof(info), info_len);
  1384		if (err)
  1385			return err;
  1386		info_len = min_t(u32, sizeof(info), info_len);
  1387	
  1388		if (copy_from_user(&info, uinfo, info_len))
  1389			return -EFAULT;
  1390	
  1391		info.type = prog->type;
  1392		info.id = prog->aux->id;
  1393		info.load_time = prog->aux->load_time;
  1394		info.created_by_uid = from_kuid_munged(current_user_ns(),
  1395						       prog->aux->user->uid);
  1396	
  1397		memcpy(info.tag, prog->tag, sizeof(prog->tag));
  1398		memcpy(info.name, prog->aux->name, sizeof(prog->aux->name));
  1399	
  1400		ulen = info.nr_map_ids;
  1401		info.nr_map_ids = prog->aux->used_map_cnt;
  1402		ulen = min_t(u32, info.nr_map_ids, ulen);
  1403		if (ulen) {
> 1404			u32 *user_map_ids = (u32 *)info.map_ids;
  1405			u32 i;
  1406	
  1407			for (i = 0; i < ulen; i++)
  1408				if (put_user(prog->aux->used_maps[i]->id,
  1409					     &user_map_ids[i]))
  1410					return -EFAULT;
  1411		}
  1412	
  1413		if (!capable(CAP_SYS_ADMIN)) {
  1414			info.jited_prog_len = 0;
  1415			info.xlated_prog_len = 0;
  1416			goto done;
  1417		}
  1418	
  1419		ulen = info.jited_prog_len;
  1420		info.jited_prog_len = prog->jited_len;
  1421		if (info.jited_prog_len && ulen) {
  1422			uinsns = u64_to_user_ptr(info.jited_prog_insns);
  1423			ulen = min_t(u32, info.jited_prog_len, ulen);
  1424			if (copy_to_user(uinsns, prog->bpf_func, ulen))
  1425				return -EFAULT;
  1426		}
  1427	
  1428		ulen = info.xlated_prog_len;
  1429		info.xlated_prog_len = bpf_prog_insn_size(prog);
  1430		if (info.xlated_prog_len && ulen) {
  1431			uinsns = u64_to_user_ptr(info.xlated_prog_insns);
  1432			ulen = min_t(u32, info.xlated_prog_len, ulen);
  1433			if (copy_to_user(uinsns, prog->insnsi, ulen))
  1434				return -EFAULT;
  1435		}
  1436	
  1437	done:
  1438		if (copy_to_user(uinfo, &info, info_len) ||
  1439		    put_user(info_len, &uattr->info.info_len))
  1440			return -EFAULT;
  1441	
  1442		return 0;
  1443	}
  1444	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 45901 bytes --]

^ permalink raw reply

* Re: net: macb: fail when there's no PHY
From: Harini Katakam @ 2017-09-29  7:05 UTC (permalink / raw)
  To: Brandon Streiff; +Cc: Grant Edwards, Florian Fainelli, netdev
In-Reply-To: <1506029751-25249-1-git-send-email-brandon.streiff@ni.com>

Hi Brandon,

On Fri, Sep 22, 2017 at 3:05 AM, Brandon Streiff <brandon.streiff@ni.com> wrote:
>> On Thu, Sep 21, 2017 at 01:05:57PM -0700, Florian Fainelli wrote:
<snip>
>
> I have a board that's in a similar boat. My workaround was to undo
> portions of dacdbb4dfc1a with the following patch; this lets me still
> use fixed-link and have MDIO (to configure a switch), but not require
> a PHY.
>
> There was a patch set last year by Harini Katakam ("net: macb: Add MDIO
> driver for accessing multiple PHY devices") that might ultimately be a
> better approach to tackling this problem, although I haven't seen any
> further chatter on it.

That patch was not backward compatible and I was trying to find a better
solution for a common MDIO bus.
I plan to send a new series next month and work on the review comments.

Regards,
Harini

^ permalink raw reply

* [PATCH net-next v11] openvswitch: enable NSH support
From: Yi Yang @ 2017-09-29  7:03 UTC (permalink / raw)
  To: netdev; +Cc: dev, jbenc, e, davem, pshelar, Yi Yang

v10->v11
 - Fix the left three disputable comments for v9
   but not fixed in v10.

v9->v10
 - Change struct ovs_key_nsh to
       struct ovs_nsh_key_base base;
       __be32 context[NSH_MD1_CONTEXT_SIZE];
 - Fix new comments for v9

v8->v9
 - Fix build error reported by daily intel build
   because nsh module isn't selected by openvswitch

v7->v8
 - Rework nested value and mask for OVS_KEY_ATTR_NSH
 - Change pop_nsh to adapt to nsh kernel module
 - Fix many issues per comments from Jiri Benc

v6->v7
 - Remove NSH GSO patches in v6 because Jiri Benc
   reworked it as another patch series and they have
   been merged.
 - Change it to adapt to nsh kernel module added by NSH
   GSO patch series

v5->v6
 - Fix the rest comments for v4.
 - Add NSH GSO support for VxLAN-gpe + NSH and
   Eth + NSH.

v4->v5
 - Fix many comments by Jiri Benc and Eric Garver
   for v4.

v3->v4
 - Add new NSH match field ttl
 - Update NSH header to the latest format
   which will be final format and won't change
   per its author's confirmation.
 - Fix comments for v3.

v2->v3
 - Change OVS_KEY_ATTR_NSH to nested key to handle
   length-fixed attributes and length-variable
   attriubte more flexibly.
 - Remove struct ovs_action_push_nsh completely
 - Add code to handle nested attribute for SET_MASKED
 - Change PUSH_NSH to use the nested OVS_KEY_ATTR_NSH
   to transfer NSH header data.
 - Fix comments and coding style issues by Jiri and Eric

v1->v2
 - Change encap_nsh and decap_nsh to push_nsh and pop_nsh
 - Dynamically allocate struct ovs_action_push_nsh for
   length-variable metadata.

OVS master and 2.8 branch has merged NSH userspace
patch series, this patch is to enable NSH support
in kernel data path in order that OVS can support
NSH in compat mode by porting this.

Signed-off-by: Yi Yang <yi.y.yang@intel.com>
---
 include/net/nsh.h                |   3 +
 include/uapi/linux/openvswitch.h |  29 ++++
 net/nsh/nsh.c                    |  59 ++++++++
 net/openvswitch/Kconfig          |   1 +
 net/openvswitch/actions.c        | 109 ++++++++++++++
 net/openvswitch/flow.c           |  51 +++++++
 net/openvswitch/flow.h           |   7 +
 net/openvswitch/flow_netlink.c   | 317 ++++++++++++++++++++++++++++++++++++++-
 net/openvswitch/flow_netlink.h   |   5 +
 9 files changed, 580 insertions(+), 1 deletion(-)

diff --git a/include/net/nsh.h b/include/net/nsh.h
index a1eaea2..c03b089 100644
--- a/include/net/nsh.h
+++ b/include/net/nsh.h
@@ -304,4 +304,7 @@ static inline void nsh_set_flags_ttl_len(struct nshhdr *nsh, u8 flags,
 			NSH_FLAGS_MASK | NSH_TTL_MASK | NSH_LEN_MASK);
 }
 
+int skb_push_nsh(struct sk_buff *skb, const struct nshhdr *nh);
+int skb_pop_nsh(struct sk_buff *skb);
+
 #endif /* __NET_NSH_H */
diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 156ee4c..c1a785c 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -333,6 +333,7 @@ enum ovs_key_attr {
 	OVS_KEY_ATTR_CT_LABELS,	/* 16-octet connection tracking label */
 	OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4,   /* struct ovs_key_ct_tuple_ipv4 */
 	OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6,   /* struct ovs_key_ct_tuple_ipv6 */
+	OVS_KEY_ATTR_NSH,       /* Nested set of ovs_nsh_key_* */
 
 #ifdef __KERNEL__
 	OVS_KEY_ATTR_TUNNEL_INFO,  /* struct ip_tunnel_info */
@@ -491,6 +492,30 @@ struct ovs_key_ct_tuple_ipv6 {
 	__u8   ipv6_proto;
 };
 
+enum ovs_nsh_key_attr {
+	OVS_NSH_KEY_ATTR_UNSPEC,
+	OVS_NSH_KEY_ATTR_BASE,  /* struct ovs_nsh_key_base. */
+	OVS_NSH_KEY_ATTR_MD1,   /* struct ovs_nsh_key_md1. */
+	OVS_NSH_KEY_ATTR_MD2,   /* variable-length octets for MD type 2. */
+	__OVS_NSH_KEY_ATTR_MAX
+};
+
+#define OVS_NSH_KEY_ATTR_MAX (__OVS_NSH_KEY_ATTR_MAX - 1)
+
+struct ovs_nsh_key_base {
+	__u8 flags;
+	__u8 ttl;
+	__u8 mdtype;
+	__u8 np;
+	__be32 path_hdr;
+};
+
+#define NSH_MD1_CONTEXT_SIZE 4
+
+struct ovs_nsh_key_md1 {
+	__be32 context[NSH_MD1_CONTEXT_SIZE];
+};
+
 /**
  * enum ovs_flow_attr - attributes for %OVS_FLOW_* commands.
  * @OVS_FLOW_ATTR_KEY: Nested %OVS_KEY_ATTR_* attributes specifying the flow
@@ -806,6 +831,8 @@ struct ovs_action_push_eth {
  * packet.
  * @OVS_ACTION_ATTR_POP_ETH: Pop the outermost Ethernet header off the
  * packet.
+ * @OVS_ACTION_ATTR_PUSH_NSH: push NSH header to the packet.
+ * @OVS_ACTION_ATTR_POP_NSH: pop the outermost NSH header off the packet.
  *
  * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not all
  * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
@@ -835,6 +862,8 @@ enum ovs_action_attr {
 	OVS_ACTION_ATTR_TRUNC,        /* u32 struct ovs_action_trunc. */
 	OVS_ACTION_ATTR_PUSH_ETH,     /* struct ovs_action_push_eth. */
 	OVS_ACTION_ATTR_POP_ETH,      /* No argument. */
+	OVS_ACTION_ATTR_PUSH_NSH,     /* Nested OVS_NSH_KEY_ATTR_*. */
+	OVS_ACTION_ATTR_POP_NSH,      /* No argument. */
 
 	__OVS_ACTION_ATTR_MAX,	      /* Nothing past this will be accepted
 				       * from userspace. */
diff --git a/net/nsh/nsh.c b/net/nsh/nsh.c
index 58fb827..5e4f937 100644
--- a/net/nsh/nsh.c
+++ b/net/nsh/nsh.c
@@ -14,6 +14,65 @@
 #include <net/nsh.h>
 #include <net/tun_proto.h>
 
+int skb_push_nsh(struct sk_buff *skb, const struct nshhdr *src_nsh_hdr)
+{
+	struct nshhdr *nsh_hdr;
+	size_t length = nsh_hdr_len(src_nsh_hdr);
+	u8 next_proto;
+
+	if (skb->mac_len) {
+		next_proto = TUN_P_ETHERNET;
+	} else {
+		next_proto = tun_p_from_eth_p(skb->protocol);
+		if (!next_proto)
+			return -EAFNOSUPPORT;
+	}
+
+	/* Add the NSH header */
+	if (skb_cow_head(skb, length) < 0)
+		return -ENOMEM;
+
+	skb_push(skb, length);
+	nsh_hdr = (struct nshhdr *)(skb->data);
+	memcpy(nsh_hdr, src_nsh_hdr, length);
+	nsh_hdr->np = next_proto;
+
+	skb->protocol = htons(ETH_P_NSH);
+	skb_reset_mac_header(skb);
+	skb_reset_network_header(skb);
+	skb_reset_mac_len(skb);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(skb_push_nsh);
+
+int skb_pop_nsh(struct sk_buff *skb)
+{
+	int err;
+	struct nshhdr *nsh_hdr = (struct nshhdr *)(skb->data);
+	size_t length;
+	__be16 inner_proto;
+
+	err = skb_ensure_writable(skb, skb_network_offset(skb) +
+				       sizeof(struct nshhdr));
+	if (unlikely(err))
+		return err;
+
+	inner_proto = tun_p_to_eth_p(nsh_hdr->np);
+	if (!inner_proto)
+		return -EAFNOSUPPORT;
+
+	length = nsh_hdr_len(nsh_hdr);
+	skb_pull(skb, length);
+	skb_reset_mac_header(skb);
+	skb_reset_network_header(skb);
+	skb_reset_mac_len(skb);
+	skb->protocol = inner_proto;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(skb_pop_nsh);
+
 static struct sk_buff *nsh_gso_segment(struct sk_buff *skb,
 				       netdev_features_t features)
 {
diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index ce94729..2650205 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -14,6 +14,7 @@ config OPENVSWITCH
 	select MPLS
 	select NET_MPLS_GSO
 	select DST_CACHE
+	select NET_NSH
 	---help---
 	  Open vSwitch is a multilayer Ethernet switch targeted at virtualized
 	  environments.  In addition to supporting a variety of features
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index a54a556..27b579b7 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -43,6 +43,7 @@
 #include "flow.h"
 #include "conntrack.h"
 #include "vport.h"
+#include "flow_netlink.h"
 
 struct deferred_action {
 	struct sk_buff *skb;
@@ -380,6 +381,43 @@ static int push_eth(struct sk_buff *skb, struct sw_flow_key *key,
 	return 0;
 }
 
+static int push_nsh(struct sk_buff *skb, struct sw_flow_key *key,
+		    const struct nshhdr *nh)
+{
+	int err;
+
+	err = skb_push_nsh(skb, nh);
+	if (err)
+		return err;
+
+	/* safe right before invalidate_flow_key */
+	key->mac_proto = MAC_PROTO_NONE;
+	invalidate_flow_key(key);
+	return 0;
+}
+
+static int pop_nsh(struct sk_buff *skb, struct sw_flow_key *key)
+{
+	int err;
+
+	if (ovs_key_mac_proto(key) != MAC_PROTO_NONE ||
+	    skb->protocol != htons(ETH_P_NSH)) {
+		return -EINVAL;
+	}
+
+	err = skb_pop_nsh(skb);
+	if (err)
+		return err;
+
+	/* safe right before invalidate_flow_key */
+	if (skb->protocol == htons(ETH_P_TEB))
+		key->mac_proto = MAC_PROTO_ETHERNET;
+	else
+		key->mac_proto = MAC_PROTO_NONE;
+	invalidate_flow_key(key);
+	return 0;
+}
+
 static void update_ip_l4_checksum(struct sk_buff *skb, struct iphdr *nh,
 				  __be32 addr, __be32 new_addr)
 {
@@ -602,6 +640,59 @@ static int set_ipv6(struct sk_buff *skb, struct sw_flow_key *flow_key,
 	return 0;
 }
 
+static int set_nsh(struct sk_buff *skb, struct sw_flow_key *flow_key,
+		   const struct nlattr *a)
+{
+	struct nshhdr *nh;
+	int err;
+	u8 flags;
+	u8 ttl;
+	int i;
+
+	struct ovs_key_nsh key;
+	struct ovs_key_nsh mask;
+
+	err = nsh_key_from_nlattr(a, &key, &mask);
+	if (err)
+		return err;
+
+	err = skb_ensure_writable(skb, skb_network_offset(skb) +
+				       sizeof(struct nshhdr));
+	if (unlikely(err))
+		return err;
+
+	nh = nsh_hdr(skb);
+
+	flags = nsh_get_flags(nh);
+	flags = OVS_MASKED(flags, key.base.flags, mask.base.flags);
+	flow_key->nsh.base.flags = flags;
+	ttl = nsh_get_ttl(nh);
+	ttl = OVS_MASKED(ttl, key.base.ttl, mask.base.ttl);
+	flow_key->nsh.base.ttl = ttl;
+	nsh_set_flags_and_ttl(nh, flags, ttl);
+	nh->path_hdr = OVS_MASKED(nh->path_hdr, key.base.path_hdr,
+				  mask.base.path_hdr);
+	flow_key->nsh.base.path_hdr = nh->path_hdr;
+	switch (nh->mdtype) {
+	case NSH_M_TYPE1:
+		for (i = 0; i < NSH_MD1_CONTEXT_SIZE; i++) {
+			nh->md1.context[i] =
+			    OVS_MASKED(nh->md1.context[i], key.context[i],
+				       mask.context[i]);
+		}
+		memcpy(flow_key->nsh.context, nh->md1.context,
+		       sizeof(nh->md1.context));
+		break;
+	case NSH_M_TYPE2:
+		memset(flow_key->nsh.context, 0,
+		       sizeof(flow_key->nsh.context));
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
 /* Must follow skb_ensure_writable() since that can move the skb data. */
 static void set_tp_port(struct sk_buff *skb, __be16 *port,
 			__be16 new_port, __sum16 *check)
@@ -1024,6 +1115,10 @@ static int execute_masked_set_action(struct sk_buff *skb,
 				   get_mask(a, struct ovs_key_ethernet *));
 		break;
 
+	case OVS_KEY_ATTR_NSH:
+		err = set_nsh(skb, flow_key, a);
+		break;
+
 	case OVS_KEY_ATTR_IPV4:
 		err = set_ipv4(skb, flow_key, nla_data(a),
 			       get_mask(a, struct ovs_key_ipv4 *));
@@ -1210,6 +1305,20 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
 		case OVS_ACTION_ATTR_POP_ETH:
 			err = pop_eth(skb, key);
 			break;
+
+		case OVS_ACTION_ATTR_PUSH_NSH: {
+			u8 buffer[NSH_HDR_MAX_LEN];
+			struct nshhdr *nh = (struct nshhdr *)buffer;
+
+			nsh_hdr_from_nlattr(nla_data(a), nh,
+					    NSH_HDR_MAX_LEN);
+			err = push_nsh(skb, key, (const struct nshhdr *)nh);
+			break;
+		}
+
+		case OVS_ACTION_ATTR_POP_NSH:
+			err = pop_nsh(skb, key);
+			break;
 		}
 
 		if (unlikely(err)) {
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 8c94cef..5970805 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -46,6 +46,7 @@
 #include <net/ipv6.h>
 #include <net/mpls.h>
 #include <net/ndisc.h>
+#include <net/nsh.h>
 
 #include "conntrack.h"
 #include "datapath.h"
@@ -490,6 +491,52 @@ static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key,
 	return 0;
 }
 
+static int parse_nsh(struct sk_buff *skb, struct sw_flow_key *key)
+{
+	struct nshhdr *nh;
+	unsigned int nh_ofs = skb_network_offset(skb);
+	u8 version, length;
+	int err;
+
+	err = check_header(skb, nh_ofs + NSH_BASE_HDR_LEN);
+	if (unlikely(err))
+		return err;
+
+	nh = nsh_hdr(skb);
+	version = nsh_get_ver(nh);
+	length = nsh_hdr_len(nh);
+
+	if (version != 0)
+		return -EINVAL;
+
+	err = check_header(skb, nh_ofs + length);
+	if (unlikely(err))
+		return err;
+
+	nh = (struct nshhdr *)skb_network_header(skb);
+	key->nsh.base.flags = nsh_get_flags(nh);
+	key->nsh.base.ttl = nsh_get_ttl(nh);
+	key->nsh.base.mdtype = nh->mdtype;
+	key->nsh.base.np = nh->np;
+	key->nsh.base.path_hdr = nh->path_hdr;
+	switch (key->nsh.base.mdtype) {
+	case NSH_M_TYPE1:
+		if (length != NSH_M_TYPE1_LEN)
+			return -EINVAL;
+		memcpy(key->nsh.context, nh->md1.context,
+		       sizeof(nh->md1));
+		break;
+	case NSH_M_TYPE2:
+		memset(key->nsh.context, 0,
+		       sizeof(nh->md1));
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 /**
  * key_extract - extracts a flow key from an Ethernet frame.
  * @skb: sk_buff that contains the frame, with skb->data pointing to the
@@ -735,6 +782,10 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
 				memset(&key->tp, 0, sizeof(key->tp));
 			}
 		}
+	} else if (key->eth.type == htons(ETH_P_NSH)) {
+		error = parse_nsh(skb, key);
+		if (error)
+			return error;
 	}
 	return 0;
 }
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index 1875bba..8eeae749 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -35,6 +35,7 @@
 #include <net/inet_ecn.h>
 #include <net/ip_tunnels.h>
 #include <net/dst_metadata.h>
+#include <net/nsh.h>
 
 struct sk_buff;
 
@@ -66,6 +67,11 @@ struct vlan_head {
 	(offsetof(struct sw_flow_key, recirc_id) +	\
 	FIELD_SIZEOF(struct sw_flow_key, recirc_id))
 
+struct ovs_key_nsh {
+	struct ovs_nsh_key_base base;
+	__be32 context[NSH_MD1_CONTEXT_SIZE];
+};
+
 struct sw_flow_key {
 	u8 tun_opts[IP_TUNNEL_OPTS_MAX];
 	u8 tun_opts_len;
@@ -144,6 +150,7 @@ struct sw_flow_key {
 			};
 		} ipv6;
 	};
+	struct ovs_key_nsh nsh;         /* network service header */
 	struct {
 		/* Connection tracking fields not packed above. */
 		struct {
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index e8eb427..77613f4 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -48,6 +48,7 @@
 #include <net/ndisc.h>
 #include <net/mpls.h>
 #include <net/vxlan.h>
+#include <net/tun_proto.h>
 
 #include "flow_netlink.h"
 
@@ -78,9 +79,11 @@ static bool actions_may_change_flow(const struct nlattr *actions)
 		case OVS_ACTION_ATTR_HASH:
 		case OVS_ACTION_ATTR_POP_ETH:
 		case OVS_ACTION_ATTR_POP_MPLS:
+		case OVS_ACTION_ATTR_POP_NSH:
 		case OVS_ACTION_ATTR_POP_VLAN:
 		case OVS_ACTION_ATTR_PUSH_ETH:
 		case OVS_ACTION_ATTR_PUSH_MPLS:
+		case OVS_ACTION_ATTR_PUSH_NSH:
 		case OVS_ACTION_ATTR_PUSH_VLAN:
 		case OVS_ACTION_ATTR_SAMPLE:
 		case OVS_ACTION_ATTR_SET:
@@ -322,12 +325,27 @@ size_t ovs_tun_key_attr_size(void)
 		+ nla_total_size(2);   /* OVS_TUNNEL_KEY_ATTR_TP_DST */
 }
 
+size_t ovs_nsh_key_attr_size(void)
+{
+	/* Whenever adding new OVS_NSH_KEY_ FIELDS, we should consider
+	 * updating this function.
+	 */
+	return  nla_total_size(NSH_BASE_HDR_LEN) /* OVS_NSH_KEY_ATTR_BASE */
+		/* OVS_NSH_KEY_ATTR_MD1 and OVS_NSH_KEY_ATTR_MD2 are
+		 * mutually exclusive, so the bigger one can cover
+		 * the small one.
+		 *
+		 * OVS_NSH_KEY_ATTR_MD2
+		 */
+		+ nla_total_size(NSH_CTX_HDRS_MAX_LEN);
+}
+
 size_t ovs_key_attr_size(void)
 {
 	/* Whenever adding new OVS_KEY_ FIELDS, we should consider
 	 * updating this function.
 	 */
-	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 28);
+	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 29);
 
 	return    nla_total_size(4)   /* OVS_KEY_ATTR_PRIORITY */
 		+ nla_total_size(0)   /* OVS_KEY_ATTR_TUNNEL */
@@ -341,6 +359,8 @@ size_t ovs_key_attr_size(void)
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_CT_MARK */
 		+ nla_total_size(16)  /* OVS_KEY_ATTR_CT_LABELS */
 		+ nla_total_size(40)  /* OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6 */
+		+ nla_total_size(0)   /* OVS_KEY_ATTR_NSH */
+		  + ovs_nsh_key_attr_size()
 		+ nla_total_size(12)  /* OVS_KEY_ATTR_ETHERNET */
 		+ nla_total_size(2)   /* OVS_KEY_ATTR_ETHERTYPE */
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_VLAN */
@@ -373,6 +393,13 @@ static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
 	[OVS_TUNNEL_KEY_ATTR_IPV6_DST]      = { .len = sizeof(struct in6_addr) },
 };
 
+static const struct ovs_len_tbl
+ovs_nsh_key_attr_lens[OVS_NSH_KEY_ATTR_MAX + 1] = {
+	[OVS_NSH_KEY_ATTR_BASE] = { .len = sizeof(struct ovs_nsh_key_base) },
+	[OVS_NSH_KEY_ATTR_MD1]  = { .len = sizeof(struct ovs_nsh_key_md1) },
+	[OVS_NSH_KEY_ATTR_MD2]  = { .len = OVS_ATTR_VARIABLE },
+};
+
 /* The size of the argument for each %OVS_KEY_ATTR_* Netlink attribute.  */
 static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
 	[OVS_KEY_ATTR_ENCAP]	 = { .len = OVS_ATTR_NESTED },
@@ -405,6 +432,8 @@ static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
 		.len = sizeof(struct ovs_key_ct_tuple_ipv4) },
 	[OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6] = {
 		.len = sizeof(struct ovs_key_ct_tuple_ipv6) },
+	[OVS_KEY_ATTR_NSH]       = { .len = OVS_ATTR_NESTED,
+				     .next = ovs_nsh_key_attr_lens, },
 };
 
 static bool check_attr_len(unsigned int attr_len, unsigned int expected_len)
@@ -1179,6 +1208,221 @@ static int metadata_from_nlattrs(struct net *net, struct sw_flow_match *match,
 	return 0;
 }
 
+int nsh_hdr_from_nlattr(const struct nlattr *attr,
+			struct nshhdr *nh, size_t size)
+{
+	struct nlattr *a;
+	int rem;
+	u8 flags = 0;
+	u8 ttl = 0;
+	int mdlen = 0;
+
+	/* validate_nsh has check this, so we needn't do duplicate check here
+	 */
+	nla_for_each_nested(a, attr, rem) {
+		int type = nla_type(a);
+
+		switch (type) {
+		case OVS_NSH_KEY_ATTR_BASE: {
+			const struct ovs_nsh_key_base *base = nla_data(a);
+
+			flags = base->flags;
+			ttl = base->ttl;
+			nh->np = base->np;
+			nh->mdtype = base->mdtype;
+			nh->path_hdr = base->path_hdr;
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD1: {
+			const struct ovs_nsh_key_md1 *md1 = nla_data(a);
+
+			mdlen = nla_len(a);
+			memcpy(&nh->md1, md1, mdlen);
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD2: {
+			const struct u8 *md2 = nla_data(a);
+
+			mdlen = nla_len(a);
+			memcpy(&nh->md2, md2, mdlen);
+			break;
+		}
+		default:
+			return -EINVAL;
+		}
+	}
+
+	/* nsh header length  = NSH_BASE_HDR_LEN + mdlen */
+	nh->ver_flags_ttl_len = 0;
+	nsh_set_flags_ttl_len(nh, flags, ttl, NSH_BASE_HDR_LEN + mdlen);
+
+	return 0;
+}
+
+int nsh_key_from_nlattr(const struct nlattr *attr,
+			struct ovs_key_nsh *nsh, struct ovs_key_nsh *nsh_mask)
+{
+	struct nlattr *a;
+	int rem;
+
+	/* validate_nsh has check this, so we needn't do duplicate check here
+	 */
+	nla_for_each_nested(a, attr, rem) {
+		int type = nla_type(a);
+
+		switch (type) {
+		case OVS_NSH_KEY_ATTR_BASE: {
+			const struct ovs_nsh_key_base *base = nla_data(a);
+			const struct ovs_nsh_key_base *base_mask = base + 1;
+
+			nsh->base = *base;
+			nsh_mask->base = *base_mask;
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD1: {
+			const struct ovs_nsh_key_md1 *md1 =
+				(struct ovs_nsh_key_md1 *)nla_data(a);
+			const struct ovs_nsh_key_md1 *md1_mask = md1 + 1;
+
+			memcpy(nsh->context, md1->context, sizeof(*md1));
+			memcpy(nsh_mask->context, md1_mask->context,
+			       sizeof(*md1_mask));
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD2:
+			/* Not supported yet */
+			return -ENOTSUPP;
+		default:
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int nsh_key_put_from_nlattr(const struct nlattr *attr,
+				   struct sw_flow_match *match, bool is_mask,
+				   bool is_push_nsh, bool log)
+{
+	struct nlattr *a;
+	int rem;
+	bool has_base = false;
+	bool has_md1 = false;
+	bool has_md2 = false;
+	u8 mdtype = 0;
+	int mdlen = 0;
+
+	if (WARN_ON(is_push_nsh && is_mask))
+		return -EINVAL;
+
+	nla_for_each_nested(a, attr, rem) {
+		int type = nla_type(a);
+		int i;
+
+		if (type > OVS_NSH_KEY_ATTR_MAX) {
+			OVS_NLERR(log, "nsh attr %d is out of range max %d",
+				  type, OVS_NSH_KEY_ATTR_MAX);
+			return -EINVAL;
+		}
+
+		if (!check_attr_len(nla_len(a),
+				    ovs_nsh_key_attr_lens[type].len)) {
+			OVS_NLERR(
+			    log,
+			    "nsh attr %d has unexpected len %d expected %d",
+			    type,
+			    nla_len(a),
+			    ovs_nsh_key_attr_lens[type].len
+			);
+			return -EINVAL;
+		}
+
+		switch (type) {
+		case OVS_NSH_KEY_ATTR_BASE: {
+			const struct ovs_nsh_key_base *base =
+				(struct ovs_nsh_key_base *)nla_data(a);
+
+			has_base = true;
+			mdtype = base->mdtype;
+			SW_FLOW_KEY_PUT(match, nsh.base.flags,
+					base->flags, is_mask);
+			SW_FLOW_KEY_PUT(match, nsh.base.ttl,
+					base->ttl, is_mask);
+			SW_FLOW_KEY_PUT(match, nsh.base.mdtype,
+					base->mdtype, is_mask);
+			SW_FLOW_KEY_PUT(match, nsh.base.np,
+					base->np, is_mask);
+			SW_FLOW_KEY_PUT(match, nsh.base.path_hdr,
+					base->path_hdr, is_mask);
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD1: {
+			const struct ovs_nsh_key_md1 *md1 =
+				(struct ovs_nsh_key_md1 *)nla_data(a);
+
+			has_md1 = true;
+			for (i = 0; i < NSH_MD1_CONTEXT_SIZE; i++)
+				SW_FLOW_KEY_PUT(match, nsh.context[i],
+						md1->context[i], is_mask);
+			break;
+		}
+		case OVS_NSH_KEY_ATTR_MD2:
+			if (!is_push_nsh) /* Not supported MD type 2 yet */
+				return -ENOTSUPP;
+
+			has_md2 = true;
+			mdlen = nla_len(a);
+			if (mdlen > NSH_CTX_HDRS_MAX_LEN || mdlen <= 0) {
+				OVS_NLERR(
+				    log,
+				    "Invalid MD length %d for MD type %d",
+				    mdlen,
+				    mdtype
+				);
+				return -EINVAL;
+			}
+			break;
+		default:
+			OVS_NLERR(log, "Unknown nsh attribute %d",
+				  type);
+			return -EINVAL;
+		}
+	}
+
+	if (rem > 0) {
+		OVS_NLERR(log, "nsh attribute has %d unknown bytes.", rem);
+		return -EINVAL;
+	}
+
+	if (has_md1 && has_md2) {
+		OVS_NLERR(
+		    1,
+		    "invalid nsh attribute: md1 and md2 are exclusive."
+		);
+		return -EINVAL;
+	}
+
+	if (!is_mask) {
+		if ((has_md1 && mdtype != NSH_M_TYPE1) ||
+		    (has_md2 && mdtype != NSH_M_TYPE2)) {
+			OVS_NLERR(1, "nsh attribute has unmatched MD type %d.",
+				  mdtype);
+			return -EINVAL;
+		}
+
+		if (is_push_nsh &&
+		    (!has_base || (!has_md1 && !has_md2))) {
+			OVS_NLERR(
+			    1,
+			    "push_nsh: missing base or metadata attributes"
+			);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
 static int ovs_key_from_nlattrs(struct net *net, struct sw_flow_match *match,
 				u64 attrs, const struct nlattr **a,
 				bool is_mask, bool log)
@@ -1306,6 +1550,13 @@ static int ovs_key_from_nlattrs(struct net *net, struct sw_flow_match *match,
 		attrs &= ~(1 << OVS_KEY_ATTR_ARP);
 	}
 
+	if (attrs & (1 << OVS_KEY_ATTR_NSH)) {
+		if (nsh_key_put_from_nlattr(a[OVS_KEY_ATTR_NSH], match,
+					    is_mask, false, log) < 0)
+			return -EINVAL;
+		attrs &= ~(1 << OVS_KEY_ATTR_NSH);
+	}
+
 	if (attrs & (1 << OVS_KEY_ATTR_MPLS)) {
 		const struct ovs_key_mpls *mpls_key;
 
@@ -1622,6 +1873,34 @@ static int ovs_nla_put_vlan(struct sk_buff *skb, const struct vlan_head *vh,
 	return 0;
 }
 
+static int nsh_key_to_nlattr(const struct ovs_key_nsh *nsh, bool is_mask,
+			     struct sk_buff *skb)
+{
+	struct nlattr *start;
+
+	start = nla_nest_start(skb, OVS_KEY_ATTR_NSH);
+	if (!start)
+		return -EMSGSIZE;
+
+	if (nla_put(skb, OVS_NSH_KEY_ATTR_BASE, sizeof(nsh->base), &nsh->base))
+		goto nla_put_failure;
+
+	if (is_mask || nsh->base.mdtype == NSH_M_TYPE1) {
+		if (nla_put(skb, OVS_NSH_KEY_ATTR_MD1,
+			    sizeof(nsh->context), nsh->context))
+			goto nla_put_failure;
+	}
+
+	/* Don't support MD type 2 yet */
+
+	nla_nest_end(skb, start);
+
+	return 0;
+
+nla_put_failure:
+	return -EMSGSIZE;
+}
+
 static int __ovs_nla_put_key(const struct sw_flow_key *swkey,
 			     const struct sw_flow_key *output, bool is_mask,
 			     struct sk_buff *skb)
@@ -1750,6 +2029,9 @@ static int __ovs_nla_put_key(const struct sw_flow_key *swkey,
 		ipv6_key->ipv6_tclass = output->ip.tos;
 		ipv6_key->ipv6_hlimit = output->ip.ttl;
 		ipv6_key->ipv6_frag = output->ip.frag;
+	} else if (swkey->eth.type == htons(ETH_P_NSH)) {
+		if (nsh_key_to_nlattr(&output->nsh, is_mask, skb))
+			goto nla_put_failure;
 	} else if (swkey->eth.type == htons(ETH_P_ARP) ||
 		   swkey->eth.type == htons(ETH_P_RARP)) {
 		struct ovs_key_arp *arp_key;
@@ -2242,6 +2524,19 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 	return err;
 }
 
+static bool validate_nsh(const struct nlattr *attr, bool is_mask,
+			 bool is_push_nsh, bool log)
+{
+	struct sw_flow_match match;
+	struct sw_flow_key key;
+	int ret = 0;
+
+	ovs_match_init(&match, &key, true, NULL);
+	ret = nsh_key_put_from_nlattr(attr, &match, is_mask,
+				      is_push_nsh, log);
+	return !ret;
+}
+
 /* Return false if there are any non-masked bits set.
  * Mask follows data immediately, before any netlink padding.
  */
@@ -2384,6 +2679,11 @@ static int validate_set(const struct nlattr *a,
 
 		break;
 
+	case OVS_KEY_ATTR_NSH:
+		if (!validate_nsh(nla_data(a), masked, false, log))
+			return -EINVAL;
+		break;
+
 	default:
 		return -EINVAL;
 	}
@@ -2482,6 +2782,8 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
 			[OVS_ACTION_ATTR_TRUNC] = sizeof(struct ovs_action_trunc),
 			[OVS_ACTION_ATTR_PUSH_ETH] = sizeof(struct ovs_action_push_eth),
 			[OVS_ACTION_ATTR_POP_ETH] = 0,
+			[OVS_ACTION_ATTR_PUSH_NSH] = (u32)-1,
+			[OVS_ACTION_ATTR_POP_NSH] = 0,
 		};
 		const struct ovs_action_push_vlan *vlan;
 		int type = nla_type(a);
@@ -2636,6 +2938,19 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
 			mac_proto = MAC_PROTO_ETHERNET;
 			break;
 
+		case OVS_ACTION_ATTR_PUSH_NSH:
+			mac_proto = MAC_PROTO_NONE;
+			if (!validate_nsh(nla_data(a), false, true, true))
+				return -EINVAL;
+			break;
+
+		case OVS_ACTION_ATTR_POP_NSH:
+			if (key->nsh.base.np == TUN_P_ETHERNET)
+				mac_proto = MAC_PROTO_ETHERNET;
+			else
+				mac_proto = MAC_PROTO_NONE;
+			break;
+
 		default:
 			OVS_NLERR(log, "Unknown Action type %d", type);
 			return -EINVAL;
diff --git a/net/openvswitch/flow_netlink.h b/net/openvswitch/flow_netlink.h
index 929c665..6657606 100644
--- a/net/openvswitch/flow_netlink.h
+++ b/net/openvswitch/flow_netlink.h
@@ -79,4 +79,9 @@ int ovs_nla_put_actions(const struct nlattr *attr,
 void ovs_nla_free_flow_actions(struct sw_flow_actions *);
 void ovs_nla_free_flow_actions_rcu(struct sw_flow_actions *);
 
+int nsh_key_from_nlattr(const struct nlattr *attr, struct ovs_key_nsh *nsh,
+			struct ovs_key_nsh *nsh_mask);
+int nsh_hdr_from_nlattr(const struct nlattr *attr, struct nshhdr *nh,
+			size_t size);
+
 #endif /* flow_netlink.h */
-- 
2.5.5

^ permalink raw reply related

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: Jesper Dangaard Brouer @ 2017-09-29  7:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, Daniel Borkmann, peter.waskiewicz.jr,
	jakub.kicinski, netdev, Andy Gospodarek, brouer
In-Reply-To: <20170927173233.tuqlutz6t2gwdk53@ast-mbp>

On Wed, 27 Sep 2017 10:32:36 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Wed, Sep 27, 2017 at 04:54:57PM +0200, Jesper Dangaard Brouer wrote:
> > On Wed, 27 Sep 2017 06:35:40 -0700
> > John Fastabend <john.fastabend@gmail.com> wrote:
> >   
> > > On 09/27/2017 02:26 AM, Jesper Dangaard Brouer wrote:  
> > > > On Tue, 26 Sep 2017 21:58:53 +0200
> > > > Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > >     
> > > >> On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote:
> > > >> [...]    
> > > >>> I'm currently implementing a cpumap type, that transfers raw XDP frames
> > > >>> to another CPU, and the SKB is allocated on the remote CPU.  (It
> > > >>> actually works extremely well).      
> > > >>
> > > >> Meaning you let all the XDP_PASS packets get processed on a
> > > >> different CPU, so you can reserve the whole CPU just for
> > > >> prefiltering, right?     
> > > > 
> > > > Yes, exactly.  Except I use the XDP_REDIRECT action to steer packets.
> > > > The trick is using the map-flush point, to transfer packets in bulk to
> > > > the remote CPU (single call IPC is too slow), but at the same time
> > > > flush single packets if NAPI didn't see a bulk.
> > > >     
> > > >> Do you have some numbers to share at this point, just curious when
> > > >> you mention it works extremely well.    
> > > > 
> > > > Sure... I've done a lot of benchmarking on this patchset ;-)
> > > > I have a benchmark program called xdp_redirect_cpu [1][2], that collect
> > > > stats via tracepoints (atm I'm limiting bulking 8 packets, and have
> > > > tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns)
> > > > 
> > > >  [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c
> > > >  [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c
> > > > 
> > > > Here I'm installing a DDoS program that drops UDP port 9 (pktgen
> > > > packets) on RX CPU=0.  I'm forcing my netperf to hit the same CPU, that
> > > > the 11.9Mpps DDoS attack is hitting.
> > > > 
> > > > Running XDP/eBPF prog_num:4
> > > > XDP-cpumap      CPU:to  pps            drop-pps    extra-info
> > > > XDP-RX          0       12,030,471     11,966,982  0          
> > > > XDP-RX          total   12,030,471     11,966,982 
> > > > cpumap-enqueue    0:2   63,488         0           0          
> > > > cpumap-enqueue  sum:2   63,488         0           0          
> > > > cpumap_kthread  2       63,488         0           3          time_exceed
> > > > cpumap_kthread  total   63,488         0           0          
> > > > redirect_err    total   0              0          
> > > > 
> > > > $ netperf -H 172.16.0.2 -t TCP_CRR  -l 10 -D1 -T5,5 -- -r 1024,1024
> > > > Local /Remote
> > > > Socket Size   Request  Resp.   Elapsed  Trans.
> > > > Send   Recv   Size     Size    Time     Rate         
> > > > bytes  Bytes  bytes    bytes   secs.    per sec   
> > > > 
> > > > 16384  87380  1024     1024    10.00    12735.97   
> > > > 16384  87380 
> > > > 
> > > > The netperf TCP_CRR performance is the same, without XDP loaded.
> > > >     
> > > 
> > > Just curious could you also try this with RPS enabled (or does this have
> > > RPS enabled). RPS should effectively do the same thing but higher in the
> > > stack. I'm curious what the delta would be. Might be another interesting
> > > case and fairly easy to setup if you already have the above scripts.  
> > 
> > Yes, I'm essentially competing with RSP, thus such a comparison is very
> > relevant...
> > 
> > This is only a 6 CPUs system. Allocate 2 CPUs to RPS receive and let
> > other 4 CPUS process packet.
> > 
> > Summary of RPS (Receive Packet Steering) performance:
> >  * End result is 6.3 Mpps max performance
> >  * netperf TCP_CRR is 1 trans/sec.
> >  * Each RX-RPS CPU stall at ~3.2Mpps.
> > 
> > The full test report below with setup:
> > 
> > The mask needed::
> > 
> >  perl -e 'printf "%b\n",0x3C'
> >  111100
> > 
> > RPS setup::
> > 
> >  sudo sh -c 'echo 32768 > /proc/sys/net/core/rps_sock_flow_entries'
> > 
> >  for N in $(seq 0 5) ; do \
> >    sudo sh -c "echo 8192 > /sys/class/net/ixgbe1/queues/rx-$N/rps_flow_cnt" ; \
> >    sudo sh -c "echo 3c > /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus" ; \
> >    grep -H . /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus ; \
> >  done
> > 
> > Reduce RX queues to two ::
> > 
> >  ethtool -L ixgbe1 combined 2
> > 
> > IRQ align to CPU numbers::
> > 
> >  $ ~/setup01.sh
> >  Not root, running with sudo
> >   --- Disable Ethernet flow-control ---
> >  rx unmodified, ignoring
> >  tx unmodified, ignoring
> >  no pause parameters changed, aborting
> >  rx unmodified, ignoring
> >  tx unmodified, ignoring
> >  no pause parameters changed, aborting
> >   --- Align IRQs ---
> >  /proc/irq/54/ixgbe1-TxRx-0/../smp_affinity_list:0
> >  /proc/irq/55/ixgbe1-TxRx-1/../smp_affinity_list:1
> >  /proc/irq/56/ixgbe1/../smp_affinity_list:0-5
> > 
> > $ grep -H . /sys/class/net/ixgbe1/queues/rx-*/rps_cpus
> > /sys/class/net/ixgbe1/queues/rx-0/rps_cpus:3c
> > /sys/class/net/ixgbe1/queues/rx-1/rps_cpus:3c
> > 
> > Generator is sending: 12,715,782 tx_packets /sec
> > 
> >  ./pktgen_sample04_many_flows.sh -vi ixgbe2 -m 00:1b:21:bb:9a:84 \
> >     -d 172.16.0.2 -t8
> > 
> > $ nstat > /dev/null && sleep 1 && nstat
> > #kernel
> > IpInReceives                    6346544            0.0
> > IpInDelivers                    6346544            0.0
> > IpOutRequests                   1020               0.0
> > IcmpOutMsgs                     1020               0.0
> > IcmpOutDestUnreachs             1020               0.0
> > IcmpMsgOutType3                 1020               0.0
> > UdpNoPorts                      6346898            0.0
> > IpExtInOctets                   291964714          0.0
> > IpExtOutOctets                  73440              0.0
> > IpExtInNoECTPkts                6347063            0.0
> > 
> > $ mpstat -P ALL -u -I SCPU -I SUM
> > 
> > Average:     CPU    %usr   %nice    %sys   %irq   %soft  %idle
> > Average:     all    0.00    0.00    0.00   0.42   72.97  26.61
> > Average:       0    0.00    0.00    0.00   0.17   99.83   0.00
> > Average:       1    0.00    0.00    0.00   0.17   99.83   0.00
> > Average:       2    0.00    0.00    0.00   0.67   60.37  38.96
> > Average:       3    0.00    0.00    0.00   0.67   58.70  40.64
> > Average:       4    0.00    0.00    0.00   0.67   59.53  39.80
> > Average:       5    0.00    0.00    0.00   0.67   58.93  40.40
> > 
> > Average:     CPU    intr/s
> > Average:     all 152067.22
> > Average:       0  50064.73
> > Average:       1  50089.35
> > Average:       2  45095.17
> > Average:       3  44875.04
> > Average:       4  44906.32
> > Average:       5  45152.08
> > 
> > Average:     CPU     TIMER/s   NET_TX/s   NET_RX/s TASKLET/s  SCHED/s     RCU/s
> > Average:       0      609.48       0.17   49431.28      0.00     2.66     21.13
> > Average:       1      567.55       0.00   49498.00      0.00     2.66     21.13
> > Average:       2      998.34       0.00   43941.60      4.16    82.86     68.22
> > Average:       3      540.60       0.17   44140.27      0.00    85.52    108.49
> > Average:       4      537.27       0.00   44219.63      0.00    84.53     64.89
> > Average:       5      530.78       0.17   44445.59      0.00    85.02     90.52
> > 
> > From mpstat it looks like it is the RX-RPS CPUs that are the bottleneck.
> > 
> > Show adapter(s) (ixgbe1) statistics (ONLY that changed!)
> > Ethtool(ixgbe1) stat:     11109531 (   11,109,531) <= fdir_miss /sec
> > Ethtool(ixgbe1) stat:    380632356 (  380,632,356) <= rx_bytes /sec
> > Ethtool(ixgbe1) stat:    812792611 (  812,792,611) <= rx_bytes_nic /sec
> > Ethtool(ixgbe1) stat:      1753550 (    1,753,550) <= rx_missed_errors /sec
> > Ethtool(ixgbe1) stat:      4602487 (    4,602,487) <= rx_no_dma_resources /sec
> > Ethtool(ixgbe1) stat:      6343873 (    6,343,873) <= rx_packets /sec
> > Ethtool(ixgbe1) stat:     10946441 (   10,946,441) <= rx_pkts_nic /sec
> > Ethtool(ixgbe1) stat:    190287853 (  190,287,853) <= rx_queue_0_bytes /sec
> > Ethtool(ixgbe1) stat:      3171464 (    3,171,464) <= rx_queue_0_packets /sec
> > Ethtool(ixgbe1) stat:    190344503 (  190,344,503) <= rx_queue_1_bytes /sec
> > Ethtool(ixgbe1) stat:      3172408 (    3,172,408) <= rx_queue_1_packets /sec
> > 
> > Notice, each RX-CPU can only process 3.1Mpps.
> > 
> > RPS RX-CPU(0):
> > 
> >  # Overhead  CPU  Symbol
> >  # ........  ...  .......................................
> >  #
> >     11.72%  000  [k] ixgbe_poll
> >     11.29%  000  [k] _raw_spin_lock
> >     10.35%  000  [k] dev_gro_receive
> >      8.36%  000  [k] __build_skb
> >      7.35%  000  [k] __skb_get_hash
> >      6.22%  000  [k] enqueue_to_backlog
> >      5.89%  000  [k] __skb_flow_dissect
> >      4.43%  000  [k] inet_gro_receive
> >      4.19%  000  [k] ___slab_alloc
> >      3.90%  000  [k] queued_spin_lock_slowpath
> >      3.85%  000  [k] kmem_cache_alloc
> >      3.06%  000  [k] build_skb
> >      2.66%  000  [k] get_rps_cpu
> >      2.57%  000  [k] napi_gro_receive
> >      2.34%  000  [k] eth_type_trans
> >      1.81%  000  [k] __cmpxchg_double_slab.isra.61
> >      1.47%  000  [k] ixgbe_alloc_rx_buffers
> >      1.43%  000  [k] get_partial_node.isra.81
> >      0.84%  000  [k] swiotlb_sync_single
> >      0.74%  000  [k] udp4_gro_receive
> >      0.73%  000  [k] netif_receive_skb_internal
> >      0.72%  000  [k] udp_gro_receive
> >      0.63%  000  [k] skb_gro_reset_offset
> >      0.49%  000  [k] __skb_flow_get_ports
> >      0.48%  000  [k] llist_add_batch
> >      0.36%  000  [k] swiotlb_sync_single_for_cpu
> >      0.34%  000  [k] __slab_alloc
> > 
> > 
> > Remote RPS-CPU(3) getting packets::
> > 
> >  # Overhead  CPU  Symbol
> >  # ........  ...  ..............................................
> >  #
> >     33.02%  003  [k] poll_idle
> >     10.99%  003  [k] __netif_receive_skb_core
> >     10.45%  003  [k] page_frag_free
> >      8.49%  003  [k] ip_rcv
> >      4.19%  003  [k] fib_table_lookup
> >      2.84%  003  [k] __udp4_lib_rcv
> >      2.81%  003  [k] __slab_free

Notice slow-path of SLUB

> >      2.23%  003  [k] __udp4_lib_lookup
> >      2.09%  003  [k] ip_route_input_rcu
> >      2.07%  003  [k] kmem_cache_free
> >      2.06%  003  [k] udp_v4_early_demux
> >      1.73%  003  [k] ip_rcv_finish  
> 
> Very interesting data.

You removed some of the more interesting part of the perf-report, that
showed us hitting more of the SLUB slowpath for SKBs.  The slowpath
consist of many separate function calls, thus it doesn't bubble to the
top (the FlameGraph tool shows them easier).

> So above perf report compares to xdp-redirect-cpu this one:
> Perf top on a CPU(3) that have to alloc and free SKBs etc.
> 
> # Overhead  CPU  Symbol
> # ........  ...  .......................................
> #
>     15.51%  003  [k] fib_table_lookup
>      8.91%  003  [k] cpu_map_kthread_run
>      8.04%  003  [k] build_skb
>      7.88%  003  [k] page_frag_free
>      5.13%  003  [k] kmem_cache_alloc
>      4.76%  003  [k] ip_route_input_rcu
>      4.59%  003  [k] kmem_cache_free
>      4.02%  003  [k] __udp4_lib_rcv
>      3.20%  003  [k] fib_validate_source
>      3.02%  003  [k] __netif_receive_skb_core
>      3.02%  003  [k] udp_v4_early_demux
>      2.90%  003  [k] ip_rcv
>      2.80%  003  [k] ip_rcv_finish
> 
> right?
> and in RPS case the consumer cpu is 33% idle whereas in redirect-cpu
> you can load it up all the way.
> Am I interpreting all this correctly that with RPS cpu0 cannot
> distributed the packets to other cpus fast enough and that's
> a bottleneck?

Yes, exactly. The work needed on the RPS cpu0 is simply too much.

> whereas in redirect-cpu you're doing early packet distribution
> before skb alloc?

Yes, the main point to reducing the CPU cycles spend on the packet for
doing early packet distribution.

> So in other words with redirect-cpu all consumer cpus are doing
> skb alloc and in RPS cpu0 is allocating skbs for all ?

Yes.

> and that's where 6M->12M performance gain comes from?

Yes, basically.  There are many small thing that help this along.  Like
cpumap case always hitting the SLUB fastpath.  Another big thing is
bulking. It is sort of hidden, but the XDP_REDIRECT flush mechanism is
implementing the RX bulking (I've been "screaming" about for the last
couple of years! ;-))

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* RE: [PATCH net-next v9] openvswitch: enable NSH support
From: Jan Scheurich @ 2017-09-29  7:10 UTC (permalink / raw)
  To: Yang, Yi, Pravin Shelar
  Cc: Jiri Benc, netdev@vger.kernel.org, dev@openvswitch.org, e@erig.me,
	davem@davemloft.net
In-Reply-To: <20170929064058.GA16145@localhost.localdomain>

> From: Yang, Yi [mailto:yi.y.yang@intel.com]
> Sent: Friday, 29 September, 2017 08:41
> To: Pravin Shelar <pshelar@ovn.org>
> Cc: Jiri Benc <jbenc@redhat.com>; netdev@vger.kernel.org; dev@openvswitch.org; e@erig.me; davem@davemloft.net; Jan Scheurich
> <jan.scheurich@ericsson.com>
> Subject: Re: [PATCH net-next v9] openvswitch: enable NSH support
> 
> On Fri, Sep 29, 2017 at 02:28:38AM +0800, Pravin Shelar wrote:
> > On Tue, Sep 26, 2017 at 6:39 PM, Yang, Yi <yi.y.yang@intel.com> wrote:
> > > On Tue, Sep 26, 2017 at 06:49:14PM +0800, Jiri Benc wrote:
> > >> On Tue, 26 Sep 2017 12:55:39 +0800, Yang, Yi wrote:
> > >> > After push_nsh, the packet won't be recirculated to flow pipeline, so
> > >> > key->eth.type must be set explicitly here, but for pop_nsh, the packet
> > >> > will be recirculated to flow pipeline, it will be reparsed, so
> > >> > key->eth.type will be set in packet parse function, we needn't handle it
> > >> > in pop_nsh.
> > >>
> > >> This seems to be a very different approach than what we currently have.
> > >> Looking at the code, the requirement after "destructive" actions such
> > >> as pushing or popping headers is to recirculate.
> > >
> > > This is optimization proposed by Jan Scheurich, recurculating after push_nsh
> > > will impact on performance, recurculating after pop_nsh is unavoidable, So
> > > also cc jan.scheurich@ericsson.com.
> > >
> > > Actucally all the keys before push_nsh are still there after push_nsh,
> > > push_nsh has updated all the nsh keys, so recirculating remains avoidable.
> > >
> >
> >
> > We should keep existing model for this patch. Later you can submit
> > optimization patch with specific use cases and performance
> > improvement. So that we can evaluate code complexity and benefits.
> 
> Ok, I'll remove the below line in push_nsh and send out v11, thanks.
> 
> 	key->eth.type = htons(ETH_P_NSH);

The optimization Yi refers to only affects the slow path translation. 

OVS 2.8 does not immediately trigger an immediate recirculation after translating 
encap(nsh,...). There is no need to do so as the flow key of the resulting packet 
can be determined from the encap() action and its properties. Translation 
continues with the rewritten flow key and subsequent OpenFlow actions will 
typically set the new fields in the new NSH header. The push_nsh datapath action 
(including all NSH header fields) is only generated at the next commit, e.g. for 
output, cloning, recirculation, encap/decap or another destructive change of 
the flow key.

The implementation of push_nsh in the user-space datapath does not update
the miniflow (key) of the packet, only the packet data and some metadata. 
If the packet needs to be looked up again the slow path triggers recirculation
to re-parse the packet. There should be no need for the datapath push_nsh 
action to try to update the flow key.

BR, Jan

^ permalink raw reply

* Re: [PATCH net-next v9] openvswitch: enable NSH support
From: Yang, Yi @ 2017-09-29  7:15 UTC (permalink / raw)
  To: Jan Scheurich
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Jiri Benc,
	e@erig.me, davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org
In-Reply-To: <CFF8EF42F1132E4CBE2BF0AB6C21C58D7881A337-hqolJogE5njKJFWPz4pdheaU1rCVNFv4@public.gmane.org>

On Fri, Sep 29, 2017 at 07:10:52AM +0000, Jan Scheurich wrote:
> > From: Yang, Yi [mailto:yi.y.yang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org]
> > Sent: Friday, 29 September, 2017 08:41
> > To: Pravin Shelar <pshelar-LZ6Gd1LRuIk@public.gmane.org>
> > Cc: Jiri Benc <jbenc-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>; netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org; e@erig.me; davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org; Jan Scheurich
> > <jan.scheurich-IzeFyvvaP7pWk0Htik3J/w@public.gmane.org>
> > Subject: Re: [PATCH net-next v9] openvswitch: enable NSH support
> > 
> > On Fri, Sep 29, 2017 at 02:28:38AM +0800, Pravin Shelar wrote:
> > > On Tue, Sep 26, 2017 at 6:39 PM, Yang, Yi <yi.y.yang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> > > > On Tue, Sep 26, 2017 at 06:49:14PM +0800, Jiri Benc wrote:
> > > >> On Tue, 26 Sep 2017 12:55:39 +0800, Yang, Yi wrote:
> > > >> > After push_nsh, the packet won't be recirculated to flow pipeline, so
> > > >> > key->eth.type must be set explicitly here, but for pop_nsh, the packet
> > > >> > will be recirculated to flow pipeline, it will be reparsed, so
> > > >> > key->eth.type will be set in packet parse function, we needn't handle it
> > > >> > in pop_nsh.
> > > >>
> > > >> This seems to be a very different approach than what we currently have.
> > > >> Looking at the code, the requirement after "destructive" actions such
> > > >> as pushing or popping headers is to recirculate.
> > > >
> > > > This is optimization proposed by Jan Scheurich, recurculating after push_nsh
> > > > will impact on performance, recurculating after pop_nsh is unavoidable, So
> > > > also cc jan.scheurich-IzeFyvvaP7oU04JRNCRQjg@public.gmane.org
> > > >
> > > > Actucally all the keys before push_nsh are still there after push_nsh,
> > > > push_nsh has updated all the nsh keys, so recirculating remains avoidable.
> > > >
> > >
> > >
> > > We should keep existing model for this patch. Later you can submit
> > > optimization patch with specific use cases and performance
> > > improvement. So that we can evaluate code complexity and benefits.
> > 
> > Ok, I'll remove the below line in push_nsh and send out v11, thanks.
> > 
> > 	key->eth.type = htons(ETH_P_NSH);
> 
> The optimization Yi refers to only affects the slow path translation. 
> 
> OVS 2.8 does not immediately trigger an immediate recirculation after translating 
> encap(nsh,...). There is no need to do so as the flow key of the resulting packet 
> can be determined from the encap() action and its properties. Translation 
> continues with the rewritten flow key and subsequent OpenFlow actions will 
> typically set the new fields in the new NSH header. The push_nsh datapath action 
> (including all NSH header fields) is only generated at the next commit, e.g. for 
> output, cloning, recirculation, encap/decap or another destructive change of 
> the flow key.
> 
> The implementation of push_nsh in the user-space datapath does not update
> the miniflow (key) of the packet, only the packet data and some metadata. 
> If the packet needs to be looked up again the slow path triggers recirculation
> to re-parse the packet. There should be no need for the datapath push_nsh 
> action to try to update the flow key.

Thanks Jan for clarification, it can still work after removing that
line, our flows didn't match it after push_nsh, it is output to
VxLAN-gpe port after push_nsh, I'm not sure if we can match dl_type and NSH
fields if we don't output and don't recirculate.

> 
> BR, Jan

^ permalink raw reply

* (unknown), 
From: kelley @ 2017-09-29  7:26 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: 40098069241.zip --]
[-- Type: application/zip, Size: 7206 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v9] openvswitch: enable NSH support
From: Jan Scheurich @ 2017-09-29  7:27 UTC (permalink / raw)
  To: Yang, Yi
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Jiri Benc,
	e@erig.me, davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org
In-Reply-To: <20170929071553.GA19053-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>

> > The optimization Yi refers to only affects the slow path translation.
> >
> > OVS 2.8 does not immediately trigger an immediate recirculation after translating
> > encap(nsh,...). There is no need to do so as the flow key of the resulting packet
> > can be determined from the encap() action and its properties. Translation
> > continues with the rewritten flow key and subsequent OpenFlow actions will
> > typically set the new fields in the new NSH header. The push_nsh datapath action
> > (including all NSH header fields) is only generated at the next commit, e.g. for
> > output, cloning, recirculation, encap/decap or another destructive change of
> > the flow key.
> >
> > The implementation of push_nsh in the user-space datapath does not update
> > the miniflow (key) of the packet, only the packet data and some metadata.
> > If the packet needs to be looked up again the slow path triggers recirculation
> > to re-parse the packet. There should be no need for the datapath push_nsh
> > action to try to update the flow key.
> 
> Thanks Jan for clarification, it can still work after removing that
> line, our flows didn't match it after push_nsh, it is output to
> VxLAN-gpe port after push_nsh, I'm not sure if we can match dl_type and NSH
> fields if we don't output and don't recirculate.

No worries, a packet cannot be matched again in the datapath unless it is 
recirculated. And recirculation today always implies re-parsing. 

In the future we want to look into possibilities to optimize performance of 
recirculation, for example by skipping the parsing stage if it is unnecessary.
For that we may need to invalidate the flow key in packet metadata when
the packet is modified without corresponding update of the key itself. But that
is music of the future.

/Jan

^ permalink raw reply

* Re: [net-next PATCH 3/5] bpf: cpumap xdp_buff to skb conversion and allocation
From: Jesper Dangaard Brouer @ 2017-09-29  7:46 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, jakub.kicinski, Michael S. Tsirkin, Jason Wang, mchan,
	John Fastabend, peter.waskiewicz.jr, Daniel Borkmann,
	Alexei Starovoitov, Andy Gospodarek, brouer
In-Reply-To: <59CD83DD.4060603@iogearbox.net>

On Fri, 29 Sep 2017 01:21:01 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 09/28/2017 02:57 PM, Jesper Dangaard Brouer wrote:
> [...]
> > +/* Convert xdp_buff to xdp_pkt */
> > +static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
> > +{
> > +	struct xdp_pkt *xdp_pkt;
> > +	int headroom;
> > +
> > +	/* Assure headroom is available for storing info */
> > +	headroom = xdp->data - xdp->data_hard_start;
> > +	if (headroom < sizeof(*xdp_pkt))
> > +		return NULL;
> > +
> > +	/* Store info in top of packet */
> > +	xdp_pkt = xdp->data_hard_start;  
> 
> (You'd also need to handle data_meta here if set, and for below
> cpu_map_build_skb(), e.g. headroom is data_meta-data_hard_start.)

I'll look into this.  The data_meta patchset was in-flight while I
rebased this.

> > +	xdp_pkt->data = xdp->data;
> > +	xdp_pkt->len  = xdp->data_end - xdp->data;
> > +	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
> > +
> > +	return xdp_pkt;
> > +}
> > +
> > +static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
> > +					 struct xdp_pkt *xdp_pkt)
> > +{
> > +	unsigned int frame_size;
> > +	void *pkt_data_start;
> > +	struct sk_buff *skb;
> > +
> > +	/* build_skb need to place skb_shared_info after SKB end, and
> > +	 * also want to know the memory "truesize".  Thus, need to  
> [...]
> >   static int cpu_map_kthread_run(void *data)
> >   {
> > +	const unsigned long busy_poll_jiffies = usecs_to_jiffies(2000);
> > +	unsigned long time_limit = jiffies + busy_poll_jiffies;
> >   	struct bpf_cpu_map_entry *rcpu = data;
> > +	unsigned int empty_cnt = 0;
> >
> >   	set_current_state(TASK_INTERRUPTIBLE);
> >   	while (!kthread_should_stop()) {
> > +		unsigned int processed = 0, drops = 0;
> >   		struct xdp_pkt *xdp_pkt;
> >
> > -		schedule();
> > -		/* Do work */
> > -		while ((xdp_pkt = ptr_ring_consume(rcpu->queue))) {
> > -			/* For now just "refcnt-free" */
> > -			page_frag_free(xdp_pkt);
> > +		/* Release CPU reschedule checks */
> > +		if ((time_after_eq(jiffies, time_limit) || empty_cnt > 25) &&
> > +		    __ptr_ring_empty(rcpu->queue)) {
> > +			empty_cnt++;
> > +			schedule();
> > +			time_limit = jiffies + busy_poll_jiffies;
> > +			WARN_ON(smp_processor_id() != rcpu->cpu);
> > +		} else {
> > +			cond_resched();
> >   		}
> > +
> > +		/* Process packets in rcpu->queue */
> > +		local_bh_disable();
> > +		/*
> > +		 * The bpf_cpu_map_entry is single consumer, with this
> > +		 * kthread CPU pinned. Lockless access to ptr_ring
> > +		 * consume side valid as no-resize allowed of queue.
> > +		 */
> > +		while ((xdp_pkt = __ptr_ring_consume(rcpu->queue))) {
> > +			struct sk_buff *skb;
> > +			int ret;
> > +
> > +			/* Allow busy polling again */
> > +			empty_cnt = 0;
> > +
> > +			skb = cpu_map_build_skb(rcpu, xdp_pkt);
> > +			if (!skb) {
> > +				page_frag_free(xdp_pkt);
> > +				continue;
> > +			}
> > +
> > +			/* Inject into network stack */
> > +			ret = netif_receive_skb(skb);  
> 
> Have you looked into whether it's feasible to reuse GRO
> engine here as well?

This is the first step. I'll work on adding the GRO-engine later. And
it should be feasible.  There are plenty of optimizations in this area
that can do done later ;-)

> 
> > +			if (ret == NET_RX_DROP)
> > +				drops++;
> > +
> > +			/* Limit BH-disable period */
> > +			if (++processed == 8)
> > +				break;
> > +		}
> > +		local_bh_enable();
> > +
> >   		__set_current_state(TASK_INTERRUPTIBLE);
> >   	}
> >   	put_cpu_map_entry(rcpu);  
> [...]



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [net-next PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP
From: Hannes Frederic Sowa @ 2017-09-29  7:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, netdev, jakub.kicinski,
	Michael S. Tsirkin, Jason Wang, mchan, John Fastabend,
	peter.waskiewicz.jr, Daniel Borkmann, Andy Gospodarek, pabeni,
	edumazet
In-Reply-To: <20170929032146.vs5v454wjs4niu4k@ast-mbp>

[adding Paolo, Eric]

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Thu, Sep 28, 2017 at 02:57:08PM +0200, Jesper Dangaard Brouer wrote:

[...]

>> +	wake_up_process(rcpu->kthread);
>
> In general the whole thing looks like 'threaded NAPI' that Hannes was
> proposing some time back. I liked it back then and I like it now.
> I don't remember what were the objections back then.
> Something scheduler related?
> Adding Hannes.

Yes.

The main objection from Eric at that time was that user space now starts
to compete with the threaded NAPI threads depending on process
priorities, which are under control of user space. Softirq always runs
first to end. Networking could starve because a process with higher
priority is runnable. At that time Eric found a way to fix the
particular problem, which resulted in commit 4cd13c21b207e80d. Pinning
and other control is also possible from user space, causing more complex
tuning set ups and problems will be harder to debug.

In particular after Eric's patch threaded NAPI proofed itself to be not
useful anymore, because his patch successfully deferred work to the
ksoftirqd more reliable thus allowing the UDP rx queue to get drained by
user space.

> Still curious about the questions I asked in the other thread
> on what's causing it to be so much better than RPS

My guess is that RPS uses expensive IPI to notify the remote
softirq. The batching size on RPS depends on how many packets could get
worked on during one softirq invocation on the source CPU until we wake
up remote CPU(s!), if they are not constantly running.

^ permalink raw reply

* Re: [PATCH v4 net-next 0/8] flow_dissector: Protocol specific flow dissector offload
From: Hannes Frederic Sowa @ 2017-09-29  7:58 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, rohit
In-Reply-To: <20170928235230.22158-1-tom@quantonium.net>

Tom Herbert <tom@quantonium.net> writes:

> This patch set adds a new offload type to perform flow dissection for
> specific protocols (either by EtherType or by IP protocol). This is
> primary useful to crack open UDP encapsulations (like VXLAN, GUE) for
> the purposes of parsing the encapsulated packet.
>
> Items in this patch set:
> - Create new protocol case in __skb_dissect for ETH_P_TEB. This is based
>   on the code in the GRE dissect function and the special handling in
>   GRE can now be removed (it sets protocol to ETH_P_TEB and returns so
>   goto proto_again is done)
> - Add infrastructure for protocol specific flow dissection offload
> - Add infrastructure to perform UDP flow dissection. Uses same model of
>   GRO where a flow_dissect callback can be associated with a UDP
>   socket
> - Use the infrastructure to support flow dissection of VXLAN and GUE
>
> Tested:
>
> Forced RPS to call flow dissection for VXLAN, FOU, and GUE. Observed
> that inner packet was being properly dissected.

I have the feeling that this patch series changes the behavior of flower
and thus causes uAPI problems.

flower seems to use the flow dissector results for parsing the inner
packets. In case of vxlan in vxlan encapsulation, which seems to become
more common (sigh!) you let part of the flow specification match on the
most inner header, while the flower ingress filter might want to match
inside the first encapsulation only.

^ permalink raw reply

* [PATCH net v1 1/1] tipc: use only positive error codes in messages
From: Parthasarathy Bhuvaragan @ 2017-09-29  8:02 UTC (permalink / raw)
  To: davem
  Cc: netdev, tipc-discussion, jon.maloy, maloy, ying.xue,
	parthasarathy.bhuvaragan

In commit e3a77561e7d32 ("tipc: split up function tipc_msg_eval()"),
we have updated the function tipc_msg_lookup_dest() to set the error
codes to negative values at destination lookup failures. Thus when
the function sets the error code to -TIPC_ERR_NO_NAME, its inserted
into the 4 bit error field of the message header as 0xf instead of
TIPC_ERR_NO_NAME (1). The value 0xf is an unknown error code.

In this commit, we set only positive error code.

Fixes: e3a77561e7d32 ("tipc: split up function tipc_msg_eval()")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
---
 net/tipc/msg.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/msg.c b/net/tipc/msg.c
index 6ef379f004ac..121e59a1d0e7 100644
--- a/net/tipc/msg.c
+++ b/net/tipc/msg.c
@@ -551,7 +551,7 @@ bool tipc_msg_lookup_dest(struct net *net, struct sk_buff *skb, int *err)
 		return false;
 	if (msg_errcode(msg))
 		return false;
-	*err = -TIPC_ERR_NO_NAME;
+	*err = TIPC_ERR_NO_NAME;
 	if (skb_linearize(skb))
 		return false;
 	msg = buf_msg(skb);
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH net-next 0/5] bpf: Extend bpf_{prog,map}_info
From: David Miller @ 2017-09-29  5:17 UTC (permalink / raw)
  To: kafai; +Cc: netdev, ast, daniel, kernel-team
In-Reply-To: <20170927213756.1254938-1-kafai@fb.com>

From: Martin KaFai Lau <kafai@fb.com>
Date: Wed, 27 Sep 2017 14:37:51 -0700

> This patch series adds more fields to bpf_prog_info and bpf_map_info.
> Please see individual patch for details.

Great to see progress in the area of eBPF introspection.

Series applied, thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox