Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next v4 3/4] bpf: BPF for lightweight tunnel infrastructure
From: Thomas Graf @ 2016-11-30 16:10 UTC (permalink / raw)
  To: davem; +Cc: netdev, alexei.starovoitov, daniel, tom, roopa, hannes
In-Reply-To: <cover.1480522144.git.tgraf@suug.ch>

Registers new BPF program types which correspond to the LWT hooks:
  - BPF_PROG_TYPE_LWT_IN   => dst_input()
  - BPF_PROG_TYPE_LWT_OUT  => dst_output()
  - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()

The separate program types are required to differentiate between the
capabilities each LWT hook allows:

 * Programs attached to dst_input() or dst_output() are restricted and
   may only read the data of an skb. This prevent modification and
   possible invalidation of already validated packet headers on receive
   and the construction of illegal headers while the IP headers are
   still being assembled.

 * Programs attached to lwtunnel_xmit() are allowed to modify packet
   content as well as prepending an L2 header via a newly introduced
   helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
   invoked after the IP header has been assembled completely.

All BPF programs receive an skb with L3 headers attached and may return
one of the following error codes:

 BPF_OK - Continue routing as per nexthop
 BPF_DROP - Drop skb and return EPERM
 BPF_REDIRECT - Redirect skb to device as per redirect() helper.
                (Only valid in lwtunnel_xmit() context)

The return codes are binary compatible with their TC_ACT_
relatives to ease compatibility.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 include/linux/filter.h        |   2 +-
 include/uapi/linux/bpf.h      |  32 +++-
 include/uapi/linux/lwtunnel.h |  23 +++
 kernel/bpf/verifier.c         |  14 +-
 net/Kconfig                   |   8 +
 net/core/Makefile             |   1 +
 net/core/filter.c             | 173 ++++++++++++++++++
 net/core/lwt_bpf.c            | 396 ++++++++++++++++++++++++++++++++++++++++++
 net/core/lwtunnel.c           |   2 +
 9 files changed, 646 insertions(+), 5 deletions(-)
 create mode 100644 net/core/lwt_bpf.c

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7f246a2..7ba6446 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -438,7 +438,7 @@ struct xdp_buff {
 };
 
 /* compute the linear packet data range [data, data_end) which
- * will be accessed by cls_bpf and act_bpf programs
+ * will be accessed by cls_bpf, act_bpf and lwt programs
  */
 static inline void bpf_compute_data_end(struct sk_buff *skb)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1370a9d..22ac827 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -101,6 +101,9 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_XDP,
 	BPF_PROG_TYPE_PERF_EVENT,
 	BPF_PROG_TYPE_CGROUP_SKB,
+	BPF_PROG_TYPE_LWT_IN,
+	BPF_PROG_TYPE_LWT_OUT,
+	BPF_PROG_TYPE_LWT_XMIT,
 };
 
 enum bpf_attach_type {
@@ -409,6 +412,16 @@ union bpf_attr {
  *
  * int bpf_get_numa_node_id()
  *     Return: Id of current NUMA node.
+ *
+ * int bpf_skb_change_head()
+ *     Grows headroom of skb and adjusts MAC header offset accordingly.
+ *     Will extends/reallocae as required automatically.
+ *     May change skb data pointer and will thus invalidate any check
+ *     performed for direct packet access.
+ *     @skb: pointer to skb
+ *     @len: length of header to be pushed in front
+ *     @flags: Flags (unused for now)
+ *     Return: 0 on success or negative error
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -453,7 +466,8 @@ union bpf_attr {
 	FN(skb_pull_data),		\
 	FN(csum_update),		\
 	FN(set_hash_invalid),		\
-	FN(get_numa_node_id),
+	FN(get_numa_node_id),		\
+	FN(skb_change_head),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -537,6 +551,22 @@ struct bpf_tunnel_key {
 	__u32 tunnel_label;
 };
 
+/* Generic BPF return codes which all BPF program types may support.
+ * The values are binary compatible with their TC_ACT_* counter-part to
+ * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
+ * programs.
+ *
+ * XDP is handled seprately, see XDP_*.
+ */
+enum bpf_ret_code {
+	BPF_OK = 0,
+	/* 1 reserved */
+	BPF_DROP = 2,
+	/* 3-6 reserved */
+	BPF_REDIRECT = 7,
+	/* >127 are reserved for prog type specific return codes */
+};
+
 /* User return codes for XDP prog type.
  * A valid XDP program must return one of these defined values. All other
  * return codes are reserved for future use. Unknown return codes will result
diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h
index 453cc62..92724cba 100644
--- a/include/uapi/linux/lwtunnel.h
+++ b/include/uapi/linux/lwtunnel.h
@@ -10,6 +10,7 @@ enum lwtunnel_encap_types {
 	LWTUNNEL_ENCAP_ILA,
 	LWTUNNEL_ENCAP_IP6,
 	LWTUNNEL_ENCAP_SEG6,
+	LWTUNNEL_ENCAP_BPF,
 	__LWTUNNEL_ENCAP_MAX,
 };
 
@@ -43,4 +44,26 @@ enum lwtunnel_ip6_t {
 
 #define LWTUNNEL_IP6_MAX (__LWTUNNEL_IP6_MAX - 1)
 
+enum {
+	LWT_BPF_PROG_UNSPEC,
+	LWT_BPF_PROG_FD,
+	LWT_BPF_PROG_NAME,
+	__LWT_BPF_PROG_MAX,
+};
+
+#define LWT_BPF_PROG_MAX (__LWT_BPF_PROG_MAX - 1)
+
+enum {
+	LWT_BPF_UNSPEC,
+	LWT_BPF_IN,
+	LWT_BPF_OUT,
+	LWT_BPF_XMIT,
+	LWT_BPF_XMIT_HEADROOM,
+	__LWT_BPF_MAX,
+};
+
+#define LWT_BPF_MAX (__LWT_BPF_MAX - 1)
+
+#define LWT_BPF_MAX_HEADROOM 256
+
 #endif /* _UAPI_LWTUNNEL_H_ */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8740c5f..8135cb1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -633,12 +633,19 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno, int off,
 #define MAX_PACKET_OFF 0xffff
 
 static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
-				       const struct bpf_call_arg_meta *meta)
+				       const struct bpf_call_arg_meta *meta,
+				       enum bpf_access_type t)
 {
 	switch (env->prog->type) {
+	case BPF_PROG_TYPE_LWT_IN:
+	case BPF_PROG_TYPE_LWT_OUT:
+		/* dst_input() and dst_output() can't write for now */
+		if (t == BPF_WRITE)
+			return false;
 	case BPF_PROG_TYPE_SCHED_CLS:
 	case BPF_PROG_TYPE_SCHED_ACT:
 	case BPF_PROG_TYPE_XDP:
+	case BPF_PROG_TYPE_LWT_XMIT:
 		if (meta)
 			return meta->pkt_access;
 
@@ -837,7 +844,7 @@ static int check_mem_access(struct bpf_verifier_env *env, u32 regno, int off,
 			err = check_stack_read(state, off, size, value_regno);
 		}
 	} else if (state->regs[regno].type == PTR_TO_PACKET) {
-		if (t == BPF_WRITE && !may_access_direct_pkt_data(env, NULL)) {
+		if (t == BPF_WRITE && !may_access_direct_pkt_data(env, NULL, t)) {
 			verbose("cannot write into packet\n");
 			return -EACCES;
 		}
@@ -970,7 +977,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 		return 0;
 	}
 
-	if (type == PTR_TO_PACKET && !may_access_direct_pkt_data(env, meta)) {
+	if (type == PTR_TO_PACKET &&
+	    !may_access_direct_pkt_data(env, meta, BPF_READ)) {
 		verbose("helper access to the packet is not allowed\n");
 		return -EACCES;
 	}
diff --git a/net/Kconfig b/net/Kconfig
index 7b6cd34..a100500 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -402,6 +402,14 @@ config LWTUNNEL
 	  weight tunnel endpoint. Tunnel encapsulation parameters are stored
 	  with light weight tunnel state associated with fib routes.
 
+config LWTUNNEL_BPF
+	bool "Execute BPF program as route nexthop action"
+	depends on LWTUNNEL
+	default y if LWTUNNEL=y
+	---help---
+	  Allows to run BPF programs as a nexthop action following a route
+	  lookup for incoming and outgoing packets.
+
 config DST_CACHE
 	bool
 	default n
diff --git a/net/core/Makefile b/net/core/Makefile
index d6508c2..f6761b6 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_NET_PTP_CLASSIFY) += ptp_classifier.o
 obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
 obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
 obj-$(CONFIG_LWTUNNEL) += lwtunnel.o
+obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o
 obj-$(CONFIG_DST_CACHE) += dst_cache.o
 obj-$(CONFIG_HWBM) += hwbm.o
 obj-$(CONFIG_NET_DEVLINK) += devlink.o
diff --git a/net/core/filter.c b/net/core/filter.c
index 698a262..1c4d0fa 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1689,6 +1689,12 @@ static int __bpf_redirect_no_mac(struct sk_buff *skb, struct net_device *dev,
 static int __bpf_redirect_common(struct sk_buff *skb, struct net_device *dev,
 				 u32 flags)
 {
+	/* Verify that a link layer header is carried */
+	if (unlikely(skb->mac_header >= skb->network_header)) {
+		kfree_skb(skb);
+		return -ERANGE;
+	}
+
 	bpf_push_mac_rcsum(skb);
 	return flags & BPF_F_INGRESS ?
 	       __bpf_rx_skb(dev, skb) : __bpf_tx_skb(dev, skb);
@@ -2188,12 +2194,53 @@ static const struct bpf_func_proto bpf_skb_change_tail_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_skb_change_head, struct sk_buff *, skb, u32, head_room,
+	   u64, flags)
+{
+	u32 max_len = __bpf_skb_max_len(skb);
+	u32 new_len = skb->len + head_room;
+	int ret;
+
+	if (unlikely(flags || (!skb_is_gso(skb) && new_len > max_len) ||
+		     new_len < skb->len))
+		return -EINVAL;
+
+	ret = skb_cow(skb, head_room);
+	if (likely(!ret)) {
+		/* Idea for this helper is that we currently only
+		 * allow to expand on mac header. This means that
+		 * skb->protocol network header, etc, stay as is.
+		 * Compared to bpf_skb_change_tail(), we're more
+		 * flexible due to not needing to linearize or
+		 * reset GSO. Intention for this helper is to be
+		 * used by an L3 skb that needs to push mac header
+		 * for redirection into L2 device.
+		 */
+		__skb_push(skb, head_room);
+		memset(skb->data, 0, head_room);
+		skb_reset_mac_header(skb);
+	}
+
+	bpf_compute_data_end(skb);
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_skb_change_head_proto = {
+	.func		= bpf_skb_change_head,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 bool bpf_helper_changes_skb_data(void *func)
 {
 	if (func == bpf_skb_vlan_push ||
 	    func == bpf_skb_vlan_pop ||
 	    func == bpf_skb_store_bytes ||
 	    func == bpf_skb_change_proto ||
+	    func == bpf_skb_change_head ||
 	    func == bpf_skb_change_tail ||
 	    func == bpf_skb_pull_data ||
 	    func == bpf_l3_csum_replace ||
@@ -2639,6 +2686,68 @@ cg_skb_func_proto(enum bpf_func_id func_id)
 	}
 }
 
+static const struct bpf_func_proto *
+lwt_inout_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_load_bytes:
+		return &bpf_skb_load_bytes_proto;
+	case BPF_FUNC_skb_pull_data:
+		return &bpf_skb_pull_data_proto;
+	case BPF_FUNC_csum_diff:
+		return &bpf_csum_diff_proto;
+	case BPF_FUNC_get_cgroup_classid:
+		return &bpf_get_cgroup_classid_proto;
+	case BPF_FUNC_get_route_realm:
+		return &bpf_get_route_realm_proto;
+	case BPF_FUNC_get_hash_recalc:
+		return &bpf_get_hash_recalc_proto;
+	case BPF_FUNC_perf_event_output:
+		return &bpf_skb_event_output_proto;
+	case BPF_FUNC_get_smp_processor_id:
+		return &bpf_get_smp_processor_id_proto;
+	case BPF_FUNC_skb_under_cgroup:
+		return &bpf_skb_under_cgroup_proto;
+	default:
+		return sk_filter_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
+lwt_xmit_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_get_tunnel_key:
+		return &bpf_skb_get_tunnel_key_proto;
+	case BPF_FUNC_skb_set_tunnel_key:
+		return bpf_get_skb_set_tunnel_proto(func_id);
+	case BPF_FUNC_skb_get_tunnel_opt:
+		return &bpf_skb_get_tunnel_opt_proto;
+	case BPF_FUNC_skb_set_tunnel_opt:
+		return bpf_get_skb_set_tunnel_proto(func_id);
+	case BPF_FUNC_redirect:
+		return &bpf_redirect_proto;
+	case BPF_FUNC_clone_redirect:
+		return &bpf_clone_redirect_proto;
+	case BPF_FUNC_skb_change_tail:
+		return &bpf_skb_change_tail_proto;
+	case BPF_FUNC_skb_change_head:
+		return &bpf_skb_change_head_proto;
+	case BPF_FUNC_skb_store_bytes:
+		return &bpf_skb_store_bytes_proto;
+	case BPF_FUNC_csum_update:
+		return &bpf_csum_update_proto;
+	case BPF_FUNC_l3_csum_replace:
+		return &bpf_l3_csum_replace_proto;
+	case BPF_FUNC_l4_csum_replace:
+		return &bpf_l4_csum_replace_proto;
+	case BPF_FUNC_set_hash_invalid:
+		return &bpf_set_hash_invalid_proto;
+	default:
+		return lwt_inout_func_proto(func_id);
+	}
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
 	if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2676,6 +2785,39 @@ static bool sk_filter_is_valid_access(int off, int size,
 	return __is_valid_access(off, size, type);
 }
 
+static bool lwt_is_valid_access(int off, int size,
+				enum bpf_access_type type,
+				enum bpf_reg_type *reg_type)
+{
+	switch (off) {
+	case offsetof(struct __sk_buff, tc_classid):
+		return false;
+	}
+
+	if (type == BPF_WRITE) {
+		switch (off) {
+		case offsetof(struct __sk_buff, mark):
+		case offsetof(struct __sk_buff, priority):
+		case offsetof(struct __sk_buff, cb[0]) ...
+		     offsetof(struct __sk_buff, cb[4]):
+			break;
+		default:
+			return false;
+		}
+	}
+
+	switch (off) {
+	case offsetof(struct __sk_buff, data):
+		*reg_type = PTR_TO_PACKET;
+		break;
+	case offsetof(struct __sk_buff, data_end):
+		*reg_type = PTR_TO_PACKET_END;
+		break;
+	}
+
+	return __is_valid_access(off, size, type);
+}
+
 static int tc_cls_act_prologue(struct bpf_insn *insn_buf, bool direct_write,
 			       const struct bpf_prog *prog)
 {
@@ -3007,6 +3149,19 @@ static const struct bpf_verifier_ops cg_skb_ops = {
 	.convert_ctx_access	= sk_filter_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops lwt_inout_ops = {
+	.get_func_proto		= lwt_inout_func_proto,
+	.is_valid_access	= lwt_is_valid_access,
+	.convert_ctx_access	= sk_filter_convert_ctx_access,
+};
+
+static const struct bpf_verifier_ops lwt_xmit_ops = {
+	.get_func_proto		= lwt_xmit_func_proto,
+	.is_valid_access	= lwt_is_valid_access,
+	.convert_ctx_access	= sk_filter_convert_ctx_access,
+	.gen_prologue		= tc_cls_act_prologue,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
 	.ops	= &sk_filter_ops,
 	.type	= BPF_PROG_TYPE_SOCKET_FILTER,
@@ -3032,6 +3187,21 @@ static struct bpf_prog_type_list cg_skb_type __read_mostly = {
 	.type	= BPF_PROG_TYPE_CGROUP_SKB,
 };
 
+static struct bpf_prog_type_list lwt_in_type __read_mostly = {
+	.ops	= &lwt_inout_ops,
+	.type	= BPF_PROG_TYPE_LWT_IN,
+};
+
+static struct bpf_prog_type_list lwt_out_type __read_mostly = {
+	.ops	= &lwt_inout_ops,
+	.type	= BPF_PROG_TYPE_LWT_OUT,
+};
+
+static struct bpf_prog_type_list lwt_xmit_type __read_mostly = {
+	.ops	= &lwt_xmit_ops,
+	.type	= BPF_PROG_TYPE_LWT_XMIT,
+};
+
 static int __init register_sk_filter_ops(void)
 {
 	bpf_register_prog_type(&sk_filter_type);
@@ -3039,6 +3209,9 @@ static int __init register_sk_filter_ops(void)
 	bpf_register_prog_type(&sched_act_type);
 	bpf_register_prog_type(&xdp_type);
 	bpf_register_prog_type(&cg_skb_type);
+	bpf_register_prog_type(&lwt_in_type);
+	bpf_register_prog_type(&lwt_out_type);
+	bpf_register_prog_type(&lwt_xmit_type);
 
 	return 0;
 }
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
new file mode 100644
index 0000000..71bb3e2
--- /dev/null
+++ b/net/core/lwt_bpf.c
@@ -0,0 +1,396 @@
+/* Copyright (c) 2016 Thomas Graf <tgraf@tgraf.ch>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <net/lwtunnel.h>
+
+struct bpf_lwt_prog {
+	struct bpf_prog *prog;
+	char *name;
+};
+
+struct bpf_lwt {
+	struct bpf_lwt_prog in;
+	struct bpf_lwt_prog out;
+	struct bpf_lwt_prog xmit;
+	int family;
+};
+
+#define MAX_PROG_NAME 256
+
+static inline struct bpf_lwt *bpf_lwt_lwtunnel(struct lwtunnel_state *lwt)
+{
+	return (struct bpf_lwt *)lwt->data;
+}
+
+#define NO_REDIRECT false
+#define CAN_REDIRECT true
+
+static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
+		       struct dst_entry *dst, bool can_redirect)
+{
+	int ret;
+
+	/* Preempt disable is needed to protect per-cpu redirect_info between
+	 * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and
+	 * access to maps strictly require a rcu_read_lock() for protection,
+	 * mixing with BH RCU lock doesn't work.
+	 */
+	preempt_disable();
+	rcu_read_lock();
+	bpf_compute_data_end(skb);
+	ret = bpf_prog_run_save_cb(lwt->prog, skb);
+	rcu_read_unlock();
+
+	switch (ret) {
+	case BPF_OK:
+		break;
+
+	case BPF_REDIRECT:
+		if (unlikely(!can_redirect)) {
+			pr_warn_once("Illegal redirect return code in prog %s\n",
+				     lwt->name ? : "<unknown>");
+			ret = BPF_OK;
+		} else {
+			ret = skb_do_redirect(skb);
+			if (ret == 0)
+				ret = BPF_REDIRECT;
+		}
+		break;
+
+	case BPF_DROP:
+		kfree_skb(skb);
+		ret = -EPERM;
+		break;
+
+	default:
+		pr_warn_once("bpf-lwt: Illegal return value %u, expect packet loss\n", ret);
+		kfree_skb(skb);
+		ret = -EINVAL;
+		break;
+	}
+
+	preempt_enable();
+
+	return ret;
+}
+
+static int bpf_input(struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct bpf_lwt *bpf;
+	int ret;
+
+	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
+	if (bpf->in.prog) {
+		ret = run_lwt_bpf(skb, &bpf->in, dst, NO_REDIRECT);
+		if (ret < 0)
+			return ret;
+	}
+
+	if (unlikely(!dst->lwtstate->orig_input)) {
+		pr_warn_once("orig_input not set on dst for prog %s\n",
+			     bpf->out.name);
+		kfree_skb(skb);
+		return -EINVAL;
+	}
+
+	return dst->lwtstate->orig_input(skb);
+}
+
+static int bpf_output(struct net *net, struct sock *sk, struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct bpf_lwt *bpf;
+	int ret;
+
+	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
+	if (bpf->out.prog) {
+		ret = run_lwt_bpf(skb, &bpf->out, dst, NO_REDIRECT);
+		if (ret < 0)
+			return ret;
+	}
+
+	if (unlikely(!dst->lwtstate->orig_output)) {
+		pr_warn_once("orig_output not set on dst for prog %s\n",
+			     bpf->out.name);
+		kfree_skb(skb);
+		return -EINVAL;
+	}
+
+	return dst->lwtstate->orig_output(net, sk, skb);
+}
+
+static int xmit_check_hhlen(struct sk_buff *skb)
+{
+	int hh_len = skb_dst(skb)->dev->hard_header_len;
+
+	if (skb_headroom(skb) < hh_len) {
+		int nhead = HH_DATA_ALIGN(hh_len - skb_headroom(skb));
+
+		if (pskb_expand_head(skb, nhead, 0, GFP_ATOMIC))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int bpf_xmit(struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct bpf_lwt *bpf;
+
+	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
+	if (bpf->xmit.prog) {
+		int ret;
+
+		ret = run_lwt_bpf(skb, &bpf->xmit, dst, CAN_REDIRECT);
+		switch (ret) {
+		case BPF_OK:
+			/* If the header was expanded, headroom might be too
+			 * small for L2 header to come, expand as needed.
+			 */
+			ret = xmit_check_hhlen(skb);
+			if (unlikely(ret))
+				return ret;
+
+			return LWTUNNEL_XMIT_CONTINUE;
+		case BPF_REDIRECT:
+			return LWTUNNEL_XMIT_DONE;
+		default:
+			return ret;
+		}
+	}
+
+	return LWTUNNEL_XMIT_CONTINUE;
+}
+
+static void bpf_lwt_prog_destroy(struct bpf_lwt_prog *prog)
+{
+	if (prog->prog)
+		bpf_prog_put(prog->prog);
+
+	kfree(prog->name);
+}
+
+static void bpf_destroy_state(struct lwtunnel_state *lwt)
+{
+	struct bpf_lwt *bpf = bpf_lwt_lwtunnel(lwt);
+
+	bpf_lwt_prog_destroy(&bpf->in);
+	bpf_lwt_prog_destroy(&bpf->out);
+	bpf_lwt_prog_destroy(&bpf->xmit);
+}
+
+static const struct nla_policy bpf_prog_policy[LWT_BPF_PROG_MAX + 1] = {
+	[LWT_BPF_PROG_FD]   = { .type = NLA_U32, },
+	[LWT_BPF_PROG_NAME] = { .type = NLA_NUL_STRING,
+				.len = MAX_PROG_NAME },
+};
+
+static int bpf_parse_prog(struct nlattr *attr, struct bpf_lwt_prog *prog,
+			  enum bpf_prog_type type)
+{
+	struct nlattr *tb[LWT_BPF_PROG_MAX + 1];
+	struct bpf_prog *p;
+	int ret;
+	u32 fd;
+
+	ret = nla_parse_nested(tb, LWT_BPF_PROG_MAX, attr, bpf_prog_policy);
+	if (ret < 0)
+		return ret;
+
+	if (!tb[LWT_BPF_PROG_FD] || !tb[LWT_BPF_PROG_NAME])
+		return -EINVAL;
+
+	prog->name = nla_memdup(tb[LWT_BPF_PROG_NAME], GFP_KERNEL);
+	if (!prog->name)
+		return -ENOMEM;
+
+	fd = nla_get_u32(tb[LWT_BPF_PROG_FD]);
+	p = bpf_prog_get_type(fd, type);
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+
+	prog->prog = p;
+
+	return 0;
+}
+
+static const struct nla_policy bpf_nl_policy[LWT_BPF_MAX + 1] = {
+	[LWT_BPF_IN]		= { .type = NLA_NESTED, },
+	[LWT_BPF_OUT]		= { .type = NLA_NESTED, },
+	[LWT_BPF_XMIT]		= { .type = NLA_NESTED, },
+	[LWT_BPF_XMIT_HEADROOM]	= { .type = NLA_U32 },
+};
+
+static int bpf_build_state(struct net_device *dev, struct nlattr *nla,
+			   unsigned int family, const void *cfg,
+			   struct lwtunnel_state **ts)
+{
+	struct nlattr *tb[LWT_BPF_MAX + 1];
+	struct lwtunnel_state *newts;
+	struct bpf_lwt *bpf;
+	int ret;
+
+	if (family != AF_INET && family != AF_INET6)
+		return -EAFNOSUPPORT;
+
+	ret = nla_parse_nested(tb, LWT_BPF_MAX, nla, bpf_nl_policy);
+	if (ret < 0)
+		return ret;
+
+	if (!tb[LWT_BPF_IN] && !tb[LWT_BPF_OUT] && !tb[LWT_BPF_XMIT])
+		return -EINVAL;
+
+	newts = lwtunnel_state_alloc(sizeof(*bpf));
+	if (!newts)
+		return -ENOMEM;
+
+	newts->type = LWTUNNEL_ENCAP_BPF;
+	bpf = bpf_lwt_lwtunnel(newts);
+
+	if (tb[LWT_BPF_IN]) {
+		newts->flags |= LWTUNNEL_STATE_INPUT_REDIRECT;
+		ret = bpf_parse_prog(tb[LWT_BPF_IN], &bpf->in,
+				     BPF_PROG_TYPE_LWT_IN);
+		if (ret  < 0)
+			goto errout;
+	}
+
+	if (tb[LWT_BPF_OUT]) {
+		newts->flags |= LWTUNNEL_STATE_OUTPUT_REDIRECT;
+		ret = bpf_parse_prog(tb[LWT_BPF_OUT], &bpf->out,
+				     BPF_PROG_TYPE_LWT_OUT);
+		if (ret < 0)
+			goto errout;
+	}
+
+	if (tb[LWT_BPF_XMIT]) {
+		newts->flags |= LWTUNNEL_STATE_XMIT_REDIRECT;
+		ret = bpf_parse_prog(tb[LWT_BPF_XMIT], &bpf->xmit,
+				     BPF_PROG_TYPE_LWT_XMIT);
+		if (ret < 0)
+			goto errout;
+	}
+
+	if (tb[LWT_BPF_XMIT_HEADROOM]) {
+		u32 headroom = nla_get_u32(tb[LWT_BPF_XMIT_HEADROOM]);
+
+		if (headroom > LWT_BPF_MAX_HEADROOM) {
+			ret = -ERANGE;
+			goto errout;
+		}
+
+		newts->headroom = headroom;
+	}
+
+	bpf->family = family;
+	*ts = newts;
+
+	return 0;
+
+errout:
+	bpf_destroy_state(newts);
+	kfree(newts);
+	return ret;
+}
+
+static int bpf_fill_lwt_prog(struct sk_buff *skb, int attr,
+			     struct bpf_lwt_prog *prog)
+{
+	struct nlattr *nest;
+
+	if (!prog->prog)
+		return 0;
+
+	nest = nla_nest_start(skb, attr);
+	if (!nest)
+		return -EMSGSIZE;
+
+	if (prog->name &&
+	    nla_put_string(skb, LWT_BPF_PROG_NAME, prog->name))
+		return -EMSGSIZE;
+
+	return nla_nest_end(skb, nest);
+}
+
+static int bpf_fill_encap_info(struct sk_buff *skb, struct lwtunnel_state *lwt)
+{
+	struct bpf_lwt *bpf = bpf_lwt_lwtunnel(lwt);
+
+	if (bpf_fill_lwt_prog(skb, LWT_BPF_IN, &bpf->in) < 0 ||
+	    bpf_fill_lwt_prog(skb, LWT_BPF_OUT, &bpf->out) < 0 ||
+	    bpf_fill_lwt_prog(skb, LWT_BPF_XMIT, &bpf->xmit) < 0)
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+static int bpf_encap_nlsize(struct lwtunnel_state *lwtstate)
+{
+	int nest_len = nla_total_size(sizeof(struct nlattr)) +
+		       nla_total_size(MAX_PROG_NAME) + /* LWT_BPF_PROG_NAME */
+		       0;
+
+	return nest_len + /* LWT_BPF_IN */
+	       nest_len + /* LWT_BPF_OUT */
+	       nest_len + /* LWT_BPF_XMIT */
+	       0;
+}
+
+int bpf_lwt_prog_cmp(struct bpf_lwt_prog *a, struct bpf_lwt_prog *b)
+{
+	/* FIXME:
+	 * The LWT state is currently rebuilt for delete requests which
+	 * results in a new bpf_prog instance. Comparing names for now.
+	 */
+	if (!a->name && !b->name)
+		return 0;
+
+	if (!a->name || !b->name)
+		return 1;
+
+	return strcmp(a->name, b->name);
+}
+
+static int bpf_encap_cmp(struct lwtunnel_state *a, struct lwtunnel_state *b)
+{
+	struct bpf_lwt *a_bpf = bpf_lwt_lwtunnel(a);
+	struct bpf_lwt *b_bpf = bpf_lwt_lwtunnel(b);
+
+	return bpf_lwt_prog_cmp(&a_bpf->in, &b_bpf->in) ||
+	       bpf_lwt_prog_cmp(&a_bpf->out, &b_bpf->out) ||
+	       bpf_lwt_prog_cmp(&a_bpf->xmit, &b_bpf->xmit);
+}
+
+static const struct lwtunnel_encap_ops bpf_encap_ops = {
+	.build_state	= bpf_build_state,
+	.destroy_state	= bpf_destroy_state,
+	.input		= bpf_input,
+	.output		= bpf_output,
+	.xmit		= bpf_xmit,
+	.fill_encap	= bpf_fill_encap_info,
+	.get_encap_size = bpf_encap_nlsize,
+	.cmp_encap	= bpf_encap_cmp,
+};
+
+static int __init bpf_lwt_init(void)
+{
+	return lwtunnel_encap_add_ops(&bpf_encap_ops, LWTUNNEL_ENCAP_BPF);
+}
+
+subsys_initcall(bpf_lwt_init)
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index 03976e9..a5d4e86 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -41,6 +41,8 @@ static const char *lwtunnel_encap_str(enum lwtunnel_encap_types encap_type)
 		return "ILA";
 	case LWTUNNEL_ENCAP_SEG6:
 		return "SEG6";
+	case LWTUNNEL_ENCAP_BPF:
+		return "BPF";
 	case LWTUNNEL_ENCAP_IP6:
 	case LWTUNNEL_ENCAP_IP:
 	case LWTUNNEL_ENCAP_NONE:
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next v4 4/4] bpf: Add tests and samples for LWT-BPF
From: Thomas Graf @ 2016-11-30 16:10 UTC (permalink / raw)
  To: davem; +Cc: netdev, alexei.starovoitov, daniel, tom, roopa, hannes
In-Reply-To: <cover.1480522144.git.tgraf@suug.ch>

Adds a series of tests to verify the functionality of attaching
BPF programs at LWT hooks.

Also adds a sample which collects a histogram of packet sizes which
pass through an LWT hook.

$ ./lwt_len_hist.sh
Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.253.2 () port 0 AF_INET : demo
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.00    39857.69
       1 -> 1        : 0        |                                      |
       2 -> 3        : 0        |                                      |
       4 -> 7        : 0        |                                      |
       8 -> 15       : 0        |                                      |
      16 -> 31       : 0        |                                      |
      32 -> 63       : 22       |                                      |
      64 -> 127      : 98       |                                      |
     128 -> 255      : 213      |                                      |
     256 -> 511      : 1444251  |********                              |
     512 -> 1023     : 660610   |***                                   |
    1024 -> 2047     : 535241   |**                                    |
    2048 -> 4095     : 19       |                                      |
    4096 -> 8191     : 180      |                                      |
    8192 -> 16383    : 5578023  |************************************* |
   16384 -> 32767    : 632099   |***                                   |
   32768 -> 65535    : 6575     |                                      |

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 samples/bpf/Makefile            |   4 +
 samples/bpf/bpf_helpers.h       |   4 +
 samples/bpf/lwt_len_hist.sh     |  37 ++++
 samples/bpf/lwt_len_hist_kern.c |  82 +++++++++
 samples/bpf/lwt_len_hist_user.c |  76 ++++++++
 samples/bpf/test_lwt_bpf.c      | 253 +++++++++++++++++++++++++
 samples/bpf/test_lwt_bpf.sh     | 399 ++++++++++++++++++++++++++++++++++++++++
 7 files changed, 855 insertions(+)
 create mode 100755 samples/bpf/lwt_len_hist.sh
 create mode 100644 samples/bpf/lwt_len_hist_kern.c
 create mode 100644 samples/bpf/lwt_len_hist_user.c
 create mode 100644 samples/bpf/test_lwt_bpf.c
 create mode 100755 samples/bpf/test_lwt_bpf.sh

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 22b6407e..3161f82 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -29,6 +29,7 @@ hostprogs-y += test_current_task_under_cgroup
 hostprogs-y += trace_event
 hostprogs-y += sampleip
 hostprogs-y += tc_l2_redirect
+hostprogs-y += lwt_len_hist
 
 test_lru_dist-objs := test_lru_dist.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
@@ -59,6 +60,7 @@ test_current_task_under_cgroup-objs := bpf_load.o libbpf.o \
 trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
 sampleip-objs := bpf_load.o libbpf.o sampleip_user.o
 tc_l2_redirect-objs := bpf_load.o libbpf.o tc_l2_redirect_user.o
+lwt_len_hist-objs := bpf_load.o libbpf.o lwt_len_hist_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -89,6 +91,7 @@ always += xdp2_kern.o
 always += test_current_task_under_cgroup_kern.o
 always += trace_event_kern.o
 always += sampleip_kern.o
+always += lwt_len_hist_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(objtree)/tools/testing/selftests/bpf/
@@ -117,6 +120,7 @@ HOSTLOADLIBES_test_current_task_under_cgroup += -lelf
 HOSTLOADLIBES_trace_event += -lelf
 HOSTLOADLIBES_sampleip += -lelf
 HOSTLOADLIBES_tc_l2_redirect += -l elf
+HOSTLOADLIBES_lwt_len_hist += -l elf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 90f44bd..a246c61 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -80,6 +80,8 @@ struct bpf_map_def {
 	unsigned int map_flags;
 };
 
+static int (*bpf_skb_load_bytes)(void *ctx, int off, void *to, int len) =
+	(void *) BPF_FUNC_skb_load_bytes;
 static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from, int len, int flags) =
 	(void *) BPF_FUNC_skb_store_bytes;
 static int (*bpf_l3_csum_replace)(void *ctx, int off, int from, int to, int flags) =
@@ -88,6 +90,8 @@ static int (*bpf_l4_csum_replace)(void *ctx, int off, int from, int to, int flag
 	(void *) BPF_FUNC_l4_csum_replace;
 static int (*bpf_skb_under_cgroup)(void *ctx, void *map, int index) =
 	(void *) BPF_FUNC_skb_under_cgroup;
+static int (*bpf_skb_change_head)(void *, int len, int flags) =
+	(void *) BPF_FUNC_skb_change_head;
 
 #if defined(__x86_64__)
 
diff --git a/samples/bpf/lwt_len_hist.sh b/samples/bpf/lwt_len_hist.sh
new file mode 100755
index 0000000..7d56774
--- /dev/null
+++ b/samples/bpf/lwt_len_hist.sh
@@ -0,0 +1,37 @@
+#!/bin/bash
+
+NS1=lwt_ns1
+VETH0=tst_lwt1a
+VETH1=tst_lwt1b
+
+TRACE_ROOT=/sys/kernel/debug/tracing
+
+function cleanup {
+	ip route del 192.168.253.2/32 dev $VETH0 2> /dev/null
+	ip link del $VETH0 2> /dev/null
+	ip link del $VETH1 2> /dev/null
+	ip netns exec $NS1 killall netserver
+	ip netns delete $NS1 2> /dev/null
+}
+
+cleanup
+
+ip netns add $NS1
+ip link add $VETH0 type veth peer name $VETH1
+ip link set dev $VETH0 up
+ip addr add 192.168.253.1/24 dev $VETH0
+ip link set $VETH1 netns $NS1
+ip netns exec $NS1 ip link set dev $VETH1 up
+ip netns exec $NS1 ip addr add 192.168.253.2/24 dev $VETH1
+ip netns exec $NS1 netserver
+
+echo 1 > ${TRACE_ROOT}/tracing_on
+cp /dev/null ${TRACE_ROOT}/trace
+ip route add 192.168.253.2/32 encap bpf out obj lwt_len_hist_kern.o section len_hist dev $VETH0
+netperf -H 192.168.253.2 -t TCP_STREAM
+cat ${TRACE_ROOT}/trace | grep -v '^#'
+./lwt_len_hist
+cleanup
+echo 0 > ${TRACE_ROOT}/tracing_on
+
+exit 0
diff --git a/samples/bpf/lwt_len_hist_kern.c b/samples/bpf/lwt_len_hist_kern.c
new file mode 100644
index 0000000..df75383
--- /dev/null
+++ b/samples/bpf/lwt_len_hist_kern.c
@@ -0,0 +1,82 @@
+/* Copyright (c) 2016 Thomas Graf <tgraf@tgraf.ch>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/ip.h>
+#include <uapi/linux/in.h>
+#include "bpf_helpers.h"
+
+# define printk(fmt, ...)						\
+		({							\
+			char ____fmt[] = fmt;				\
+			bpf_trace_printk(____fmt, sizeof(____fmt),	\
+				     ##__VA_ARGS__);			\
+		})
+
+struct bpf_elf_map {
+	__u32 type;
+	__u32 size_key;
+	__u32 size_value;
+	__u32 max_elem;
+	__u32 flags;
+	__u32 id;
+	__u32 pinning;
+};
+
+struct bpf_elf_map SEC("maps") lwt_len_hist_map = {
+	.type = BPF_MAP_TYPE_PERCPU_HASH,
+	.size_key = sizeof(__u64),
+	.size_value = sizeof(__u64),
+	.pinning = 2,
+	.max_elem = 1024,
+};
+
+static unsigned int log2(unsigned int v)
+{
+	unsigned int r;
+	unsigned int shift;
+
+	r = (v > 0xFFFF) << 4; v >>= r;
+	shift = (v > 0xFF) << 3; v >>= shift; r |= shift;
+	shift = (v > 0xF) << 2; v >>= shift; r |= shift;
+	shift = (v > 0x3) << 1; v >>= shift; r |= shift;
+	r |= (v >> 1);
+	return r;
+}
+
+static unsigned int log2l(unsigned long v)
+{
+	unsigned int hi = v >> 32;
+	if (hi)
+		return log2(hi) + 32;
+	else
+		return log2(v);
+}
+
+SEC("len_hist")
+int do_len_hist(struct __sk_buff *skb)
+{
+	__u64 *value, key, init_val = 1;
+
+	key = log2l(skb->len);
+
+	value = bpf_map_lookup_elem(&lwt_len_hist_map, &key);
+	if (value)
+		__sync_fetch_and_add(value, 1);
+	else
+		bpf_map_update_elem(&lwt_len_hist_map, &key, &init_val, BPF_ANY);
+
+	return BPF_OK;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/lwt_len_hist_user.c b/samples/bpf/lwt_len_hist_user.c
new file mode 100644
index 0000000..05d783f
--- /dev/null
+++ b/samples/bpf/lwt_len_hist_user.c
@@ -0,0 +1,76 @@
+#include <linux/unistd.h>
+#include <linux/bpf.h>
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <arpa/inet.h>
+
+#include "libbpf.h"
+#include "bpf_util.h"
+
+#define MAX_INDEX 64
+#define MAX_STARS 38
+
+static void stars(char *str, long val, long max, int width)
+{
+	int i;
+
+	for (i = 0; i < (width * val / max) - 1 && i < width - 1; i++)
+		str[i] = '*';
+	if (val > max)
+		str[i - 1] = '+';
+	str[i] = '\0';
+}
+
+int main(int argc, char **argv)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	const char *map_filename = "/sys/fs/bpf/tc/globals/lwt_len_hist_map";
+	uint64_t values[nr_cpus], sum, max_value = 0, data[MAX_INDEX] = {};
+	uint64_t key = 0, next_key, max_key = 0;
+	char starstr[MAX_STARS];
+	int i, map_fd;
+
+	map_fd = bpf_obj_get(map_filename);
+	if (map_fd < 0) {
+		fprintf(stderr, "bpf_obj_get(%s): %s(%d)\n",
+			map_filename, strerror(errno), errno);
+		return -1;
+	}
+
+	while (bpf_get_next_key(map_fd, &key, &next_key) == 0) {
+		if (next_key >= MAX_INDEX) {
+			fprintf(stderr, "Key %lu out of bounds\n", next_key);
+			continue;
+		}
+
+		bpf_lookup_elem(map_fd, &next_key, values);
+
+		sum = 0;
+		for (i = 0; i < nr_cpus; i++)
+			sum += values[i];
+
+		data[next_key] = sum;
+		if (sum && next_key > max_key)
+			max_key = next_key;
+
+		if (sum > max_value)
+			max_value = sum;
+
+		key = next_key;
+	}
+
+	for (i = 1; i <= max_key + 1; i++) {
+		stars(starstr, data[i - 1], max_value, MAX_STARS);
+		printf("%8ld -> %-8ld : %-8ld |%-*s|\n",
+		       (1l << i) >> 1, (1l << i) - 1, data[i - 1],
+		       MAX_STARS, starstr);
+	}
+
+	close(map_fd);
+
+	return 0;
+}
diff --git a/samples/bpf/test_lwt_bpf.c b/samples/bpf/test_lwt_bpf.c
new file mode 100644
index 0000000..bacc801
--- /dev/null
+++ b/samples/bpf/test_lwt_bpf.c
@@ -0,0 +1,253 @@
+/* Copyright (c) 2016 Thomas Graf <tgraf@tgraf.ch>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <stdint.h>
+#include <stddef.h>
+#include <linux/bpf.h>
+#include <linux/ip.h>
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <linux/icmpv6.h>
+#include <linux/if_ether.h>
+#include "bpf_helpers.h"
+#include <string.h>
+
+# define printk(fmt, ...)						\
+		({							\
+			char ____fmt[] = fmt;				\
+			bpf_trace_printk(____fmt, sizeof(____fmt),	\
+				     ##__VA_ARGS__);			\
+		})
+
+#define CB_MAGIC 1234
+
+/* Test: Pass all packets through */
+SEC("nop")
+int do_nop(struct __sk_buff *skb)
+{
+	return BPF_OK;
+}
+
+/* Test: Verify context information can be accessed */
+SEC("test_ctx")
+int do_test_ctx(struct __sk_buff *skb)
+{
+	skb->cb[0] = CB_MAGIC;
+	printk("len %d hash %d protocol %d\n", skb->len, skb->hash,
+	       skb->protocol);
+	printk("cb %d ingress_ifindex %d ifindex %d\n", skb->cb[0],
+	       skb->ingress_ifindex, skb->ifindex);
+
+	return BPF_OK;
+}
+
+/* Test: Ensure skb->cb[] buffer is cleared */
+SEC("test_cb")
+int do_test_cb(struct __sk_buff *skb)
+{
+	printk("cb0: %x cb1: %x cb2: %x\n", skb->cb[0], skb->cb[1],
+	       skb->cb[2]);
+	printk("cb3: %x cb4: %x\n", skb->cb[3], skb->cb[4]);
+
+	return BPF_OK;
+}
+
+/* Test: Verify skb data can be read */
+SEC("test_data")
+int do_test_data(struct __sk_buff *skb)
+{
+	void *data = (void *)(long)skb->data;
+	void *data_end = (void *)(long)skb->data_end;
+	struct iphdr *iph = data;
+
+	if (data + sizeof(*iph) > data_end) {
+		printk("packet truncated\n");
+		return BPF_DROP;
+	}
+
+	printk("src: %x dst: %x\n", iph->saddr, iph->daddr);
+
+	return BPF_OK;
+}
+
+#define IP_CSUM_OFF offsetof(struct iphdr, check)
+#define IP_DST_OFF offsetof(struct iphdr, daddr)
+#define IP_SRC_OFF offsetof(struct iphdr, saddr)
+#define IP_PROTO_OFF offsetof(struct iphdr, protocol)
+#define TCP_CSUM_OFF offsetof(struct tcphdr, check)
+#define UDP_CSUM_OFF offsetof(struct udphdr, check)
+#define IS_PSEUDO 0x10
+
+static inline int rewrite(struct __sk_buff *skb, uint32_t old_ip,
+			  uint32_t new_ip, int rw_daddr)
+{
+	int ret, off = 0, flags = IS_PSEUDO;
+	uint8_t proto;
+
+	ret = bpf_skb_load_bytes(skb, IP_PROTO_OFF, &proto, 1);
+	if (ret < 0) {
+		printk("bpf_l4_csum_replace failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	switch (proto) {
+	case IPPROTO_TCP:
+		off = TCP_CSUM_OFF;
+		break;
+
+	case IPPROTO_UDP:
+		off = UDP_CSUM_OFF;
+		flags |= BPF_F_MARK_MANGLED_0;
+		break;
+
+	case IPPROTO_ICMPV6:
+		off = offsetof(struct icmp6hdr, icmp6_cksum);
+		break;
+	}
+
+	if (off) {
+		ret = bpf_l4_csum_replace(skb, off, old_ip, new_ip,
+					  flags | sizeof(new_ip));
+		if (ret < 0) {
+			printk("bpf_l4_csum_replace failed: %d\n");
+			return BPF_DROP;
+		}
+	}
+
+	ret = bpf_l3_csum_replace(skb, IP_CSUM_OFF, old_ip, new_ip, sizeof(new_ip));
+	if (ret < 0) {
+		printk("bpf_l3_csum_replace failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	if (rw_daddr)
+		ret = bpf_skb_store_bytes(skb, IP_DST_OFF, &new_ip, sizeof(new_ip), 0);
+	else
+		ret = bpf_skb_store_bytes(skb, IP_SRC_OFF, &new_ip, sizeof(new_ip), 0);
+
+	if (ret < 0) {
+		printk("bpf_skb_store_bytes() failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	return BPF_OK;
+}
+
+/* Test: Verify skb data can be modified */
+SEC("test_rewrite")
+int do_test_rewrite(struct __sk_buff *skb)
+{
+	uint32_t old_ip, new_ip = 0x3fea8c0;
+	int ret;
+
+	ret = bpf_skb_load_bytes(skb, IP_DST_OFF, &old_ip, 4);
+	if (ret < 0) {
+		printk("bpf_skb_load_bytes failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	if (old_ip == 0x2fea8c0) {
+		printk("out: rewriting from %x to %x\n", old_ip, new_ip);
+		return rewrite(skb, old_ip, new_ip, 1);
+	}
+
+	return BPF_OK;
+}
+
+static inline int __do_push_ll_and_redirect(struct __sk_buff *skb)
+{
+	uint64_t smac = SRC_MAC, dmac = DST_MAC;
+	int ret, ifindex = DST_IFINDEX;
+	struct ethhdr ehdr;
+
+	ret = bpf_skb_change_head(skb, 14, 0);
+	if (ret < 0) {
+		printk("skb_change_head() failed: %d\n", ret);
+	}
+
+	ehdr.h_proto = __constant_htons(ETH_P_IP);
+	memcpy(&ehdr.h_source, &smac, 6);
+	memcpy(&ehdr.h_dest, &dmac, 6);
+
+	ret = bpf_skb_store_bytes(skb, 0, &ehdr, sizeof(ehdr), 0);
+	if (ret < 0) {
+		printk("skb_store_bytes() failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	return bpf_redirect(ifindex, 0);
+}
+
+SEC("push_ll_and_redirect_silent")
+int do_push_ll_and_redirect_silent(struct __sk_buff *skb)
+{
+	return __do_push_ll_and_redirect(skb);
+}
+
+SEC("push_ll_and_redirect")
+int do_push_ll_and_redirect(struct __sk_buff *skb)
+{
+	int ret, ifindex = DST_IFINDEX;
+
+	ret = __do_push_ll_and_redirect(skb);
+	if (ret >= 0)
+		printk("redirected to %d\n", ifindex);
+
+	return ret;
+}
+
+static inline void __fill_garbage(struct __sk_buff *skb)
+{
+	uint64_t f = 0xFFFFFFFFFFFFFFFF;
+
+	bpf_skb_store_bytes(skb, 0, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 8, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 16, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 24, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 32, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 40, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 48, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 56, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 64, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 72, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 80, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 88, &f, sizeof(f), 0);
+}
+
+SEC("fill_garbage")
+int do_fill_garbage(struct __sk_buff *skb)
+{
+	__fill_garbage(skb);
+	printk("Set initial 96 bytes of header to FF\n");
+	return BPF_OK;
+}
+
+SEC("fill_garbage_and_redirect")
+int do_fill_garbage_and_redirect(struct __sk_buff *skb)
+{
+	int ifindex = DST_IFINDEX;
+	__fill_garbage(skb);
+	printk("redirected to %d\n", ifindex);
+	return bpf_redirect(ifindex, 0);
+}
+
+/* Drop all packets */
+SEC("drop_all")
+int do_drop_all(struct __sk_buff *skb)
+{
+	printk("dropping with: %d\n", BPF_DROP);
+	return BPF_DROP;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/test_lwt_bpf.sh b/samples/bpf/test_lwt_bpf.sh
new file mode 100755
index 0000000..a695ae2
--- /dev/null
+++ b/samples/bpf/test_lwt_bpf.sh
@@ -0,0 +1,399 @@
+#!/bin/bash
+
+# Uncomment to see generated bytecode
+#VERBOSE=verbose
+
+NS1=lwt_ns1
+NS2=lwt_ns2
+VETH0=tst_lwt1a
+VETH1=tst_lwt1b
+VETH2=tst_lwt2a
+VETH3=tst_lwt2b
+IPVETH0="192.168.254.1"
+IPVETH1="192.168.254.2"
+IPVETH1b="192.168.254.3"
+
+IPVETH2="192.168.111.1"
+IPVETH3="192.168.111.2"
+
+IP_LOCAL="192.168.99.1"
+
+TRACE_ROOT=/sys/kernel/debug/tracing
+
+function lookup_mac()
+{
+	set +x
+	if [ ! -z "$2" ]; then
+		MAC=$(ip netns exec $2 ip link show $1 | grep ether | awk '{print $2}')
+	else
+		MAC=$(ip link show $1 | grep ether | awk '{print $2}')
+	fi
+	MAC="${MAC//:/}"
+	echo "0x${MAC:10:2}${MAC:8:2}${MAC:6:2}${MAC:4:2}${MAC:2:2}${MAC:0:2}"
+	set -x
+}
+
+function cleanup {
+	set +ex
+	rm test_lwt_bpf.o 2> /dev/null
+	ip link del $VETH0 2> /dev/null
+	ip link del $VETH1 2> /dev/null
+	ip link del $VETH2 2> /dev/null
+	ip link del $VETH3 2> /dev/null
+	ip netns exec $NS1 killall netserver
+	ip netns delete $NS1 2> /dev/null
+	ip netns delete $NS2 2> /dev/null
+	set -ex
+}
+
+function setup_one_veth {
+	ip netns add $1
+	ip link add $2 type veth peer name $3
+	ip link set dev $2 up
+	ip addr add $4/24 dev $2
+	ip link set $3 netns $1
+	ip netns exec $1 ip link set dev $3 up
+	ip netns exec $1 ip addr add $5/24 dev $3
+
+	if [ "$6" ]; then
+		ip netns exec $1 ip addr add $6/32 dev $3
+	fi
+}
+
+function get_trace {
+	set +x
+	cat ${TRACE_ROOT}/trace | grep -v '^#'
+	set -x
+}
+
+function cleanup_routes {
+	ip route del ${IPVETH1}/32 dev $VETH0 2> /dev/null || true
+	ip route del table local local ${IP_LOCAL}/32 dev lo 2> /dev/null || true
+}
+
+function install_test {
+	cleanup_routes
+	cp /dev/null ${TRACE_ROOT}/trace
+
+	OPTS="encap bpf headroom 14 $1 obj test_lwt_bpf.o section $2 $VERBOSE"
+
+	if [ "$1" == "in" ];  then
+		ip route add table local local ${IP_LOCAL}/32 $OPTS dev lo
+	else
+		ip route add ${IPVETH1}/32 $OPTS dev $VETH0
+	fi
+}
+
+function remove_prog {
+	if [ "$1" == "in" ];  then
+		ip route del table local local ${IP_LOCAL}/32 dev lo
+	else
+		ip route del ${IPVETH1}/32 dev $VETH0
+	fi
+}
+
+function filter_trace {
+	# Add newline to allow starting EXPECT= variables on newline
+	NL=$'\n'
+	echo "${NL}$*" | sed -e 's/^.*: : //g'
+}
+
+function expect_fail {
+	set +x
+	echo "FAIL:"
+	echo "Expected: $1"
+	echo "Got: $2"
+	set -x
+	exit 1
+}
+
+function match_trace {
+	set +x
+	RET=0
+	TRACE=$1
+	EXPECT=$2
+	GOT="$(filter_trace "$TRACE")"
+
+	[ "$GOT" != "$EXPECT" ] && {
+		expect_fail "$EXPECT" "$GOT"
+		RET=1
+	}
+	set -x
+	return $RET
+}
+
+function test_start {
+	set +x
+	echo "----------------------------------------------------------------"
+	echo "Starting test: $*"
+	echo "----------------------------------------------------------------"
+	set -x
+}
+
+function failure {
+	get_trace
+	echo "FAIL: $*"
+	exit 1
+}
+
+function test_ctx_xmit {
+	test_start "test_ctx on lwt xmit"
+	install_test xmit test_ctx
+	ping -c 3 $IPVETH1 || {
+		failure "test_ctx xmit: packets are dropped"
+	}
+	match_trace "$(get_trace)" "
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 0 ifindex $DST_IFINDEX
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 0 ifindex $DST_IFINDEX
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 0 ifindex $DST_IFINDEX" || exit 1
+	remove_prog xmit
+}
+
+function test_ctx_out {
+	test_start "test_ctx on lwt out"
+	install_test out test_ctx
+	ping -c 3 $IPVETH1 || {
+		failure "test_ctx out: packets are dropped"
+	}
+	match_trace "$(get_trace)" "
+len 84 hash 0 protocol 0
+cb 1234 ingress_ifindex 0 ifindex 0
+len 84 hash 0 protocol 0
+cb 1234 ingress_ifindex 0 ifindex 0
+len 84 hash 0 protocol 0
+cb 1234 ingress_ifindex 0 ifindex 0" || exit 1
+	remove_prog out
+}
+
+function test_ctx_in {
+	test_start "test_ctx on lwt in"
+	install_test in test_ctx
+	ping -c 3 $IP_LOCAL || {
+		failure "test_ctx out: packets are dropped"
+	}
+	# We will both request & reply packets as the packets will
+	# be from $IP_LOCAL => $IP_LOCAL
+	match_trace "$(get_trace)" "
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1" || exit 1
+	remove_prog in
+}
+
+function test_data {
+	test_start "test_data on lwt $1"
+	install_test $1 test_data
+	ping -c 3 $IPVETH1 || {
+		failure "test_data ${1}: packets are dropped"
+	}
+	match_trace "$(get_trace)" "
+src: 1fea8c0 dst: 2fea8c0
+src: 1fea8c0 dst: 2fea8c0
+src: 1fea8c0 dst: 2fea8c0" || exit 1
+	remove_prog $1
+}
+
+function test_data_in {
+	test_start "test_data on lwt in"
+	install_test in test_data
+	ping -c 3 $IP_LOCAL || {
+		failure "test_data in: packets are dropped"
+	}
+	# We will both request & reply packets as the packets will
+	# be from $IP_LOCAL => $IP_LOCAL
+	match_trace "$(get_trace)" "
+src: 163a8c0 dst: 163a8c0
+src: 163a8c0 dst: 163a8c0
+src: 163a8c0 dst: 163a8c0
+src: 163a8c0 dst: 163a8c0
+src: 163a8c0 dst: 163a8c0
+src: 163a8c0 dst: 163a8c0" || exit 1
+	remove_prog in
+}
+
+function test_cb {
+	test_start "test_cb on lwt $1"
+	install_test $1 test_cb
+	ping -c 3 $IPVETH1 || {
+		failure "test_cb ${1}: packets are dropped"
+	}
+	match_trace "$(get_trace)" "
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0" || exit 1
+	remove_prog $1
+}
+
+function test_cb_in {
+	test_start "test_cb on lwt in"
+	install_test in test_cb
+	ping -c 3 $IP_LOCAL || {
+		failure "test_cb in: packets are dropped"
+	}
+	# We will both request & reply packets as the packets will
+	# be from $IP_LOCAL => $IP_LOCAL
+	match_trace "$(get_trace)" "
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0" || exit 1
+	remove_prog in
+}
+
+function test_drop_all {
+	test_start "test_drop_all on lwt $1"
+	install_test $1 drop_all
+	ping -c 3 $IPVETH1 && {
+		failure "test_drop_all ${1}: Unexpected success of ping"
+	}
+	match_trace "$(get_trace)" "
+dropping with: 2
+dropping with: 2
+dropping with: 2" || exit 1
+	remove_prog $1
+}
+
+function test_drop_all_in {
+	test_start "test_drop_all on lwt in"
+	install_test in drop_all
+	ping -c 3 $IP_LOCAL && {
+		failure "test_drop_all in: Unexpected success of ping"
+	}
+	match_trace "$(get_trace)" "
+dropping with: 2
+dropping with: 2
+dropping with: 2" || exit 1
+	remove_prog in
+}
+
+function test_push_ll_and_redirect {
+	test_start "test_push_ll_and_redirect on lwt xmit"
+	install_test xmit push_ll_and_redirect
+	ping -c 3 $IPVETH1 || {
+		failure "Redirected packets appear to be dropped"
+	}
+	match_trace "$(get_trace)" "
+redirected to $DST_IFINDEX
+redirected to $DST_IFINDEX
+redirected to $DST_IFINDEX" || exit 1
+	remove_prog xmit
+}
+
+function test_no_l2_and_redirect {
+	test_start "test_no_l2_and_redirect on lwt xmit"
+	install_test xmit fill_garbage_and_redirect
+	ping -c 3 $IPVETH1 && {
+		failure "Unexpected success despite lack of L2 header"
+	}
+	match_trace "$(get_trace)" "
+redirected to $DST_IFINDEX
+redirected to $DST_IFINDEX
+redirected to $DST_IFINDEX" || exit 1
+	remove_prog xmit
+}
+
+function test_rewrite {
+	test_start "test_rewrite on lwt xmit"
+	install_test xmit test_rewrite
+	ping -c 3 $IPVETH1 || {
+		failure "Rewritten packets appear to be dropped"
+	}
+	match_trace "$(get_trace)" "
+out: rewriting from 2fea8c0 to 3fea8c0
+out: rewriting from 2fea8c0 to 3fea8c0
+out: rewriting from 2fea8c0 to 3fea8c0" || exit 1
+	remove_prog out
+}
+
+function test_fill_garbage {
+	test_start "test_fill_garbage on lwt xmit"
+	install_test xmit fill_garbage
+	ping -c 3 $IPVETH1 && {
+		failure "test_drop_all ${1}: Unexpected success of ping"
+	}
+	match_trace "$(get_trace)" "
+Set initial 96 bytes of header to FF
+Set initial 96 bytes of header to FF
+Set initial 96 bytes of header to FF" || exit 1
+	remove_prog xmit
+}
+
+function test_netperf_nop {
+	test_start "test_netperf_nop on lwt xmit"
+	install_test xmit nop
+	netperf -H $IPVETH1 -t TCP_STREAM || {
+		failure "packets appear to be dropped"
+	}
+	match_trace "$(get_trace)" ""|| exit 1
+	remove_prog xmit
+}
+
+function test_netperf_redirect {
+	test_start "test_netperf_redirect on lwt xmit"
+	install_test xmit push_ll_and_redirect_silent
+	netperf -H $IPVETH1 -t TCP_STREAM || {
+		failure "Rewritten packets appear to be dropped"
+	}
+	match_trace "$(get_trace)" ""|| exit 1
+	remove_prog xmit
+}
+
+cleanup
+setup_one_veth $NS1 $VETH0 $VETH1 $IPVETH0 $IPVETH1 $IPVETH1b
+setup_one_veth $NS2 $VETH2 $VETH3 $IPVETH2 $IPVETH3
+ip netns exec $NS1 netserver
+echo 1 > ${TRACE_ROOT}/tracing_on
+
+DST_MAC=$(lookup_mac $VETH1 $NS1)
+SRC_MAC=$(lookup_mac $VETH0)
+DST_IFINDEX=$(cat /sys/class/net/$VETH0/ifindex)
+
+CLANG_OPTS="-O2 -target bpf -I ../include/"
+CLANG_OPTS+=" -DSRC_MAC=$SRC_MAC -DDST_MAC=$DST_MAC -DDST_IFINDEX=$DST_IFINDEX"
+clang $CLANG_OPTS -c test_lwt_bpf.c -o test_lwt_bpf.o
+
+test_ctx_xmit
+test_ctx_out
+test_ctx_in
+test_data "xmit"
+test_data "out"
+test_data_in
+test_cb "xmit"
+test_cb "out"
+test_cb_in
+test_drop_all "xmit"
+test_drop_all "out"
+test_drop_all_in
+test_rewrite
+test_push_ll_and_redirect
+test_no_l2_and_redirect
+test_fill_garbage
+test_netperf_nop
+test_netperf_redirect
+
+cleanup
+echo 0 > ${TRACE_ROOT}/tracing_on
+exit 0
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net-next v5 2/3] bpf: Add new cgroup attach type to enable sock modifications
From: David Ahern @ 2016-11-30 16:24 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <20161130054154.GC31581@ast-mbp.thefacebook.com>

On 11/29/16 10:41 PM, Alexei Starovoitov wrote:
> I don't see a complexity. It was straightforward for skb bitfields,
> but if there is some unforeseen issue, it's better to tackle it now
> otherwise the feature may never come and this 'infra for sockets' will
> stay as 'infra for vrf only' and I'm struggling to convince myself that it's ok.
> So I'll try to tweak this patch to add these 3 fields...
> 

regardless of whether the change is attached to this patch set or sent as a follow on, I will make 2 separate patches -- 1 that adds the fields to bpf_sock and updates the filter.c code and a second patch that adds a test case and automation for it. There is no reason to combine it into 1 large patch.

If you think the use case to fail socket create based on family/proto/type checks is valid, it is valid regardless of when the patch comes.

I take your comment to mean you believe if I don't do it now then it won't get done. Am I over reading your comment? Do you really believe it?

^ permalink raw reply

* [PATCH net-next V2 1/7] net/mlx5e: Implement Fragmented Work Queue (WQ)
From: Saeed Mahameed @ 2016-11-30 15:59 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tariq Toukan, Or Gerlitz, Roi Dayan, Sebastian Ott,
	Saeed Mahameed
In-Reply-To: <1480521583-12755-1-git-send-email-saeedm@mellanox.com>

From: Tariq Toukan <tariqt@mellanox.com>

Add new type of struct mlx5_frag_buf which is used to allocate fragmented
buffers rather than contiguous, and make the Completion Queues (CQs) use
it as they are big (default of 2MB per CQ in Striding RQ).

This fixes the failures of type:
"mlx5e_open_locked: mlx5e_open_channels failed, -12"
due to dma_zalloc_coherent insufficient contiguous coherent memory to
satisfy the driver's request when the user tries to setup more or larger
rings.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/alloc.c   | 66 +++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 10 ++--
 drivers/net/ethernet/mellanox/mlx5/core/wq.c      | 26 ++++++---
 drivers/net/ethernet/mellanox/mlx5/core/wq.h      | 18 +++++--
 include/linux/mlx5/driver.h                       | 11 ++++
 6 files changed, 116 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/alloc.c b/drivers/net/ethernet/mellanox/mlx5/core/alloc.c
index 2c6e3c7..44791de 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/alloc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/alloc.c
@@ -106,6 +106,63 @@ void mlx5_buf_free(struct mlx5_core_dev *dev, struct mlx5_buf *buf)
 }
 EXPORT_SYMBOL_GPL(mlx5_buf_free);
 
+int mlx5_frag_buf_alloc_node(struct mlx5_core_dev *dev, int size,
+			     struct mlx5_frag_buf *buf, int node)
+{
+	int i;
+
+	buf->size = size;
+	buf->npages = 1 << get_order(size);
+	buf->page_shift = PAGE_SHIFT;
+	buf->frags = kcalloc(buf->npages, sizeof(struct mlx5_buf_list),
+			     GFP_KERNEL);
+	if (!buf->frags)
+		goto err_out;
+
+	for (i = 0; i < buf->npages; i++) {
+		struct mlx5_buf_list *frag = &buf->frags[i];
+		int frag_sz = min_t(int, size, PAGE_SIZE);
+
+		frag->buf = mlx5_dma_zalloc_coherent_node(dev, frag_sz,
+							  &frag->map, node);
+		if (!frag->buf)
+			goto err_free_buf;
+		if (frag->map & ((1 << buf->page_shift) - 1)) {
+			dma_free_coherent(&dev->pdev->dev, frag_sz,
+					  buf->frags[i].buf, buf->frags[i].map);
+			mlx5_core_warn(dev, "unexpected map alignment: 0x%p, page_shift=%d\n",
+				       (void *)frag->map, buf->page_shift);
+			goto err_free_buf;
+		}
+		size -= frag_sz;
+	}
+
+	return 0;
+
+err_free_buf:
+	while (i--)
+		dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, buf->frags[i].buf,
+				  buf->frags[i].map);
+	kfree(buf->frags);
+err_out:
+	return -ENOMEM;
+}
+
+void mlx5_frag_buf_free(struct mlx5_core_dev *dev, struct mlx5_frag_buf *buf)
+{
+	int size = buf->size;
+	int i;
+
+	for (i = 0; i < buf->npages; i++) {
+		int frag_sz = min_t(int, size, PAGE_SIZE);
+
+		dma_free_coherent(&dev->pdev->dev, frag_sz, buf->frags[i].buf,
+				  buf->frags[i].map);
+		size -= frag_sz;
+	}
+	kfree(buf->frags);
+}
+
 static struct mlx5_db_pgdir *mlx5_alloc_db_pgdir(struct mlx5_core_dev *dev,
 						 int node)
 {
@@ -230,3 +287,12 @@ void mlx5_fill_page_array(struct mlx5_buf *buf, __be64 *pas)
 	}
 }
 EXPORT_SYMBOL_GPL(mlx5_fill_page_array);
+
+void mlx5_fill_page_frag_array(struct mlx5_frag_buf *buf, __be64 *pas)
+{
+	int i;
+
+	for (i = 0; i < buf->npages; i++)
+		pas[i] = cpu_to_be64(buf->frags[i].map);
+}
+EXPORT_SYMBOL_GPL(mlx5_fill_page_frag_array);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 442dbc3..f16f7fb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -286,7 +286,7 @@ struct mlx5e_cq {
 	u16                        decmprs_wqe_counter;
 
 	/* control */
-	struct mlx5_wq_ctrl        wq_ctrl;
+	struct mlx5_frag_wq_ctrl   wq_ctrl;
 } ____cacheline_aligned_in_smp;
 
 struct mlx5e_rq;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 6b492ca..ba25cd3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1201,7 +1201,7 @@ static int mlx5e_create_cq(struct mlx5e_channel *c,
 
 static void mlx5e_destroy_cq(struct mlx5e_cq *cq)
 {
-	mlx5_wq_destroy(&cq->wq_ctrl);
+	mlx5_cqwq_destroy(&cq->wq_ctrl);
 }
 
 static int mlx5e_enable_cq(struct mlx5e_cq *cq, struct mlx5e_cq_param *param)
@@ -1218,7 +1218,7 @@ static int mlx5e_enable_cq(struct mlx5e_cq *cq, struct mlx5e_cq_param *param)
 	int err;
 
 	inlen = MLX5_ST_SZ_BYTES(create_cq_in) +
-		sizeof(u64) * cq->wq_ctrl.buf.npages;
+		sizeof(u64) * cq->wq_ctrl.frag_buf.npages;
 	in = mlx5_vzalloc(inlen);
 	if (!in)
 		return -ENOMEM;
@@ -1227,15 +1227,15 @@ static int mlx5e_enable_cq(struct mlx5e_cq *cq, struct mlx5e_cq_param *param)
 
 	memcpy(cqc, param->cqc, sizeof(param->cqc));
 
-	mlx5_fill_page_array(&cq->wq_ctrl.buf,
-			     (__be64 *)MLX5_ADDR_OF(create_cq_in, in, pas));
+	mlx5_fill_page_frag_array(&cq->wq_ctrl.frag_buf,
+				  (__be64 *)MLX5_ADDR_OF(create_cq_in, in, pas));
 
 	mlx5_vector2eqn(mdev, param->eq_ix, &eqn, &irqn_not_used);
 
 	MLX5_SET(cqc,   cqc, cq_period_mode, param->cq_period_mode);
 	MLX5_SET(cqc,   cqc, c_eqn,         eqn);
 	MLX5_SET(cqc,   cqc, uar_page,      mcq->uar->index);
-	MLX5_SET(cqc,   cqc, log_page_size, cq->wq_ctrl.buf.page_shift -
+	MLX5_SET(cqc,   cqc, log_page_size, cq->wq_ctrl.frag_buf.page_shift -
 					    MLX5_ADAPTER_PAGE_SHIFT);
 	MLX5_SET64(cqc, cqc, dbr_addr,      cq->wq_ctrl.db.dma);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/wq.c b/drivers/net/ethernet/mellanox/mlx5/core/wq.c
index 821a087..921673c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/wq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/wq.c
@@ -101,13 +101,15 @@ int mlx5_wq_cyc_create(struct mlx5_core_dev *mdev, struct mlx5_wq_param *param,
 
 int mlx5_cqwq_create(struct mlx5_core_dev *mdev, struct mlx5_wq_param *param,
 		     void *cqc, struct mlx5_cqwq *wq,
-		     struct mlx5_wq_ctrl *wq_ctrl)
+		     struct mlx5_frag_wq_ctrl *wq_ctrl)
 {
 	int err;
 
-	wq->log_stride = 6 + MLX5_GET(cqc, cqc, cqe_sz);
-	wq->log_sz = MLX5_GET(cqc, cqc, log_cq_size);
-	wq->sz_m1 = (1 << wq->log_sz) - 1;
+	wq->log_stride	= 6 + MLX5_GET(cqc, cqc, cqe_sz);
+	wq->log_sz	= MLX5_GET(cqc, cqc, log_cq_size);
+	wq->sz_m1	= (1 << wq->log_sz) - 1;
+	wq->log_frag_strides = PAGE_SHIFT - wq->log_stride;
+	wq->frag_sz_m1	= (1 << wq->log_frag_strides) - 1;
 
 	err = mlx5_db_alloc_node(mdev, &wq_ctrl->db, param->db_numa_node);
 	if (err) {
@@ -115,14 +117,16 @@ int mlx5_cqwq_create(struct mlx5_core_dev *mdev, struct mlx5_wq_param *param,
 		return err;
 	}
 
-	err = mlx5_buf_alloc_node(mdev, mlx5_cqwq_get_byte_size(wq),
-				  &wq_ctrl->buf, param->buf_numa_node);
+	err = mlx5_frag_buf_alloc_node(mdev, mlx5_cqwq_get_byte_size(wq),
+				       &wq_ctrl->frag_buf,
+				       param->buf_numa_node);
 	if (err) {
-		mlx5_core_warn(mdev, "mlx5_buf_alloc_node() failed, %d\n", err);
+		mlx5_core_warn(mdev, "mlx5_frag_buf_alloc_node() failed, %d\n",
+			       err);
 		goto err_db_free;
 	}
 
-	wq->buf = wq_ctrl->buf.direct.buf;
+	wq->frag_buf = wq_ctrl->frag_buf;
 	wq->db  = wq_ctrl->db.db;
 
 	wq_ctrl->mdev = mdev;
@@ -184,3 +188,9 @@ void mlx5_wq_destroy(struct mlx5_wq_ctrl *wq_ctrl)
 	mlx5_buf_free(wq_ctrl->mdev, &wq_ctrl->buf);
 	mlx5_db_free(wq_ctrl->mdev, &wq_ctrl->db);
 }
+
+void mlx5_cqwq_destroy(struct mlx5_frag_wq_ctrl *wq_ctrl)
+{
+	mlx5_frag_buf_free(wq_ctrl->mdev, &wq_ctrl->frag_buf);
+	mlx5_db_free(wq_ctrl->mdev, &wq_ctrl->db);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/wq.h b/drivers/net/ethernet/mellanox/mlx5/core/wq.h
index 6c2a8f9..d8afed8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/wq.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/wq.h
@@ -47,6 +47,12 @@ struct mlx5_wq_ctrl {
 	struct mlx5_db		db;
 };
 
+struct mlx5_frag_wq_ctrl {
+	struct mlx5_core_dev	*mdev;
+	struct mlx5_frag_buf	frag_buf;
+	struct mlx5_db		db;
+};
+
 struct mlx5_wq_cyc {
 	void			*buf;
 	__be32			*db;
@@ -55,12 +61,14 @@ struct mlx5_wq_cyc {
 };
 
 struct mlx5_cqwq {
-	void			*buf;
+	struct mlx5_frag_buf	frag_buf;
 	__be32			*db;
 	u32			sz_m1;
+	u32			frag_sz_m1;
 	u32			cc; /* consumer counter */
 	u8			log_sz;
 	u8			log_stride;
+	u8			log_frag_strides;
 };
 
 struct mlx5_wq_ll {
@@ -81,7 +89,7 @@ u32 mlx5_wq_cyc_get_size(struct mlx5_wq_cyc *wq);
 
 int mlx5_cqwq_create(struct mlx5_core_dev *mdev, struct mlx5_wq_param *param,
 		     void *cqc, struct mlx5_cqwq *wq,
-		     struct mlx5_wq_ctrl *wq_ctrl);
+		     struct mlx5_frag_wq_ctrl *wq_ctrl);
 u32 mlx5_cqwq_get_size(struct mlx5_cqwq *wq);
 
 int mlx5_wq_ll_create(struct mlx5_core_dev *mdev, struct mlx5_wq_param *param,
@@ -90,6 +98,7 @@ int mlx5_wq_ll_create(struct mlx5_core_dev *mdev, struct mlx5_wq_param *param,
 u32 mlx5_wq_ll_get_size(struct mlx5_wq_ll *wq);
 
 void mlx5_wq_destroy(struct mlx5_wq_ctrl *wq_ctrl);
+void mlx5_cqwq_destroy(struct mlx5_frag_wq_ctrl *wq_ctrl);
 
 static inline u16 mlx5_wq_cyc_ctr2ix(struct mlx5_wq_cyc *wq, u16 ctr)
 {
@@ -116,7 +125,10 @@ static inline u32 mlx5_cqwq_get_ci(struct mlx5_cqwq *wq)
 
 static inline void *mlx5_cqwq_get_wqe(struct mlx5_cqwq *wq, u32 ix)
 {
-	return wq->buf + (ix << wq->log_stride);
+	unsigned int frag = (ix >> wq->log_frag_strides);
+
+	return wq->frag_buf.frags[frag].buf +
+		((wq->frag_sz_m1 & ix) << wq->log_stride);
 }
 
 static inline u32 mlx5_cqwq_get_wrap_cnt(struct mlx5_cqwq *wq)
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 68b85ef..0ae5536 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -318,6 +318,13 @@ struct mlx5_buf {
 	u8			page_shift;
 };
 
+struct mlx5_frag_buf {
+	struct mlx5_buf_list	*frags;
+	int			npages;
+	int			size;
+	u8			page_shift;
+};
+
 struct mlx5_eq_tasklet {
 	struct list_head list;
 	struct list_head process_list;
@@ -822,6 +829,9 @@ int mlx5_buf_alloc_node(struct mlx5_core_dev *dev, int size,
 			struct mlx5_buf *buf, int node);
 int mlx5_buf_alloc(struct mlx5_core_dev *dev, int size, struct mlx5_buf *buf);
 void mlx5_buf_free(struct mlx5_core_dev *dev, struct mlx5_buf *buf);
+int mlx5_frag_buf_alloc_node(struct mlx5_core_dev *dev, int size,
+			     struct mlx5_frag_buf *buf, int node);
+void mlx5_frag_buf_free(struct mlx5_core_dev *dev, struct mlx5_frag_buf *buf);
 struct mlx5_cmd_mailbox *mlx5_alloc_cmd_mailbox_chain(struct mlx5_core_dev *dev,
 						      gfp_t flags, int npages);
 void mlx5_free_cmd_mailbox_chain(struct mlx5_core_dev *dev,
@@ -866,6 +876,7 @@ void mlx5_unregister_debugfs(void);
 int mlx5_eq_init(struct mlx5_core_dev *dev);
 void mlx5_eq_cleanup(struct mlx5_core_dev *dev);
 void mlx5_fill_page_array(struct mlx5_buf *buf, __be64 *pas);
+void mlx5_fill_page_frag_array(struct mlx5_frag_buf *frag_buf, __be64 *pas);
 void mlx5_cq_completion(struct mlx5_core_dev *dev, u32 cqn);
 void mlx5_rsc_event(struct mlx5_core_dev *dev, u32 rsn, int event_type);
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V2 6/7] net/mlx5e: Refactor tc del flow to accept mlx5e_tc_flow instance
From: Saeed Mahameed @ 2016-11-30 15:59 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tariq Toukan, Or Gerlitz, Roi Dayan, Sebastian Ott,
	Saeed Mahameed
In-Reply-To: <1480521583-12755-1-git-send-email-saeedm@mellanox.com>

From: Roi Dayan <roid@mellanox.com>

Change the function that deletes offloaded TC rule to get
struct mlx5e_tc_flow instance which contains both the flow
handle and flow attributes. This is a cleanup needed for
downstream patches, it doesn't change any functionality.

Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 4d71445..3875c1c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -143,18 +143,17 @@ mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
 }
 
 static void mlx5e_tc_del_flow(struct mlx5e_priv *priv,
-			      struct mlx5_flow_handle *rule,
-			      struct mlx5_esw_flow_attr *attr)
+			      struct mlx5e_tc_flow *flow)
 {
 	struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
 	struct mlx5_fc *counter = NULL;
 
-	counter = mlx5_flow_rule_counter(rule);
+	counter = mlx5_flow_rule_counter(flow->rule);
 
-	mlx5_del_flow_rules(rule);
+	mlx5_del_flow_rules(flow->rule);
 
 	if (esw && esw->mode == SRIOV_OFFLOADS)
-		mlx5_eswitch_del_vlan_action(esw, attr);
+		mlx5_eswitch_del_vlan_action(esw, flow->attr);
 
 	mlx5_fc_destroy(priv->mdev, counter);
 
@@ -1005,7 +1004,7 @@ int mlx5e_delete_flower(struct mlx5e_priv *priv,
 
 	rhashtable_remove_fast(&tc->ht, &flow->node, tc->ht_params);
 
-	mlx5e_tc_del_flow(priv, flow->rule, flow->attr);
+	mlx5e_tc_del_flow(priv, flow);
 
 	if (flow->attr->action & MLX5_FLOW_CONTEXT_ACTION_ENCAP)
 		mlx5e_detach_encap(priv, flow);
@@ -1065,7 +1064,7 @@ static void _mlx5e_tc_del_flow(void *ptr, void *arg)
 	struct mlx5e_tc_flow *flow = ptr;
 	struct mlx5e_priv *priv = arg;
 
-	mlx5e_tc_del_flow(priv, flow->rule, flow->attr);
+	mlx5e_tc_del_flow(priv, flow);
 	kfree(flow);
 }
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V2 4/7] net/mlx5e: Remove redundant hashtable lookup in configure flower
From: Saeed Mahameed @ 2016-11-30 15:59 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tariq Toukan, Or Gerlitz, Roi Dayan, Sebastian Ott,
	Saeed Mahameed
In-Reply-To: <1480521583-12755-1-git-send-email-saeedm@mellanox.com>

From: Roi Dayan <roid@mellanox.com>

We will never find a flow with the same cookie as cls_flower always
allocates a new flow and the cookie is the allocated memory address.

Fixes: e3a2b7ed018e ("net/mlx5e: Support offload cls_flower with drop action")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 26 +++++++------------------
 1 file changed, 7 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 4d06fab..dd6d954 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -915,25 +915,17 @@ int mlx5e_configure_flower(struct mlx5e_priv *priv, __be16 protocol,
 	u32 flow_tag, action;
 	struct mlx5e_tc_flow *flow;
 	struct mlx5_flow_spec *spec;
-	struct mlx5_flow_handle *old = NULL;
-	struct mlx5_esw_flow_attr *old_attr = NULL;
 	struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
 
 	if (esw && esw->mode == SRIOV_OFFLOADS)
 		fdb_flow = true;
 
-	flow = rhashtable_lookup_fast(&tc->ht, &f->cookie,
-				      tc->ht_params);
-	if (flow) {
-		old = flow->rule;
-		old_attr = flow->attr;
-	} else {
-		if (fdb_flow)
-			flow = kzalloc(sizeof(*flow) + sizeof(struct mlx5_esw_flow_attr),
-				       GFP_KERNEL);
-		else
-			flow = kzalloc(sizeof(*flow), GFP_KERNEL);
-	}
+	if (fdb_flow)
+		flow = kzalloc(sizeof(*flow) +
+			       sizeof(struct mlx5_esw_flow_attr),
+			       GFP_KERNEL);
+	else
+		flow = kzalloc(sizeof(*flow), GFP_KERNEL);
 
 	spec = mlx5_vzalloc(sizeof(*spec));
 	if (!spec || !flow) {
@@ -970,17 +962,13 @@ int mlx5e_configure_flower(struct mlx5e_priv *priv, __be16 protocol,
 	if (err)
 		goto err_del_rule;
 
-	if (old)
-		mlx5e_tc_del_flow(priv, old, old_attr);
-
 	goto out;
 
 err_del_rule:
 	mlx5_del_flow_rules(flow->rule);
 
 err_free:
-	if (!old)
-		kfree(flow);
+	kfree(flow);
 out:
 	kvfree(spec);
 	return err;
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V2 0/7] Mellanox 100G mlx5 updates 2016-11-29
From: Saeed Mahameed @ 2016-11-30 15:59 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tariq Toukan, Or Gerlitz, Roi Dayan, Sebastian Ott,
	Saeed Mahameed

Hi Dave,

The following series from Tariq and Roi, provides some critical fixes
and updates for the mlx5e driver.

>From Tariq: 
 - Fix driver coherent memory huge allocation issues by fragmenting
   completion queues, in a way that is transparent to the netdev driver by
   providing a new buffer type "mlx5_frag_buf" with the same access API.
 - Create UMR MKey per RQ to have better scalability.

>From Roi:
 - Some fixes for the encap-decap support and tc flower added lately to the
   mlx5e driver.

v1->v2:
 - Fix start index in error flow of mlx5_frag_buf_alloc_node, pointed out by Eric.

This series was generated against commit:
31ac1c19455f ("geneve: fix ip_hdr_len reserved for geneve6 tunnel.")

Thanks,
Saeed.

Roi Dayan (4):
  net/mlx5e: Remove redundant hashtable lookup in configure flower
  net/mlx5e: Correct cleanup order when deleting offloaded TC rules
  net/mlx5e: Refactor tc del flow to accept mlx5e_tc_flow instance
  net/mlx5e: Remove flow encap entry in the correct place

Tariq Toukan (3):
  net/mlx5e: Implement Fragmented Work Queue (WQ)
  net/mlx5e: Move function mlx5e_create_umr_mkey
  net/mlx5e: Create UMR MKey per RQ

 drivers/net/ethernet/mellanox/mlx5/core/alloc.c    |  66 +++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  14 +--
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  12 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 121 +++++++++++----------
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c    |  82 ++++++--------
 drivers/net/ethernet/mellanox/mlx5/core/wq.c       |  26 +++--
 drivers/net/ethernet/mellanox/mlx5/core/wq.h       |  18 ++-
 include/linux/mlx5/driver.h                        |  11 ++
 8 files changed, 215 insertions(+), 135 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH net-next V2 5/7] net/mlx5e: Correct cleanup order when deleting offloaded TC rules
From: Saeed Mahameed @ 2016-11-30 15:59 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tariq Toukan, Or Gerlitz, Roi Dayan, Sebastian Ott,
	Saeed Mahameed
In-Reply-To: <1480521583-12755-1-git-send-email-saeedm@mellanox.com>

From: Roi Dayan <roid@mellanox.com>

According to the reverse unwinding principle, on delete time we should
first handle deletion of the steering rule and later handle the vlan
deletion from the eswitch.

Fixes: 8b32580df1cb ("net/mlx5e: Add TC vlan action for SRIOV offloads")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index dd6d954..4d71445 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -151,11 +151,11 @@ static void mlx5e_tc_del_flow(struct mlx5e_priv *priv,
 
 	counter = mlx5_flow_rule_counter(rule);
 
+	mlx5_del_flow_rules(rule);
+
 	if (esw && esw->mode == SRIOV_OFFLOADS)
 		mlx5_eswitch_del_vlan_action(esw, attr);
 
-	mlx5_del_flow_rules(rule);
-
 	mlx5_fc_destroy(priv->mdev, counter);
 
 	if (!mlx5e_tc_num_filters(priv) && (priv->fs.tc.t)) {
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V2 7/7] net/mlx5e: Remove flow encap entry in the correct place
From: Saeed Mahameed @ 2016-11-30 15:59 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tariq Toukan, Or Gerlitz, Roi Dayan, Sebastian Ott,
	Saeed Mahameed
In-Reply-To: <1480521583-12755-1-git-send-email-saeedm@mellanox.com>

From: Roi Dayan <roid@mellanox.com>

Handling flow encap entry should be inside tc del flow
and is only relevant for offloaded eswitch TC rules.

Fixes: 11a457e9b6c1 ("net/mlx5e: Add basic TC tunnel set action for SRIOV offloads")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 43 +++++++++++++------------
 1 file changed, 22 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 3875c1c..f07ef8c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -142,6 +142,24 @@ mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
 	return mlx5_eswitch_add_offloaded_rule(esw, spec, attr);
 }
 
+static void mlx5e_detach_encap(struct mlx5e_priv *priv,
+			       struct mlx5e_tc_flow *flow) {
+	struct list_head *next = flow->encap.next;
+
+	list_del(&flow->encap);
+	if (list_empty(next)) {
+		struct mlx5_encap_entry *e;
+
+		e = list_entry(next, struct mlx5_encap_entry, flows);
+		if (e->n) {
+			mlx5_encap_dealloc(priv->mdev, e->encap_id);
+			neigh_release(e->n);
+		}
+		hlist_del_rcu(&e->encap_hlist);
+		kfree(e);
+	}
+}
+
 static void mlx5e_tc_del_flow(struct mlx5e_priv *priv,
 			      struct mlx5e_tc_flow *flow)
 {
@@ -152,8 +170,11 @@ static void mlx5e_tc_del_flow(struct mlx5e_priv *priv,
 
 	mlx5_del_flow_rules(flow->rule);
 
-	if (esw && esw->mode == SRIOV_OFFLOADS)
+	if (esw && esw->mode == SRIOV_OFFLOADS) {
 		mlx5_eswitch_del_vlan_action(esw, flow->attr);
+		if (flow->attr->action & MLX5_FLOW_CONTEXT_ACTION_ENCAP)
+			mlx5e_detach_encap(priv, flow);
+	}
 
 	mlx5_fc_destroy(priv->mdev, counter);
 
@@ -973,24 +994,6 @@ int mlx5e_configure_flower(struct mlx5e_priv *priv, __be16 protocol,
 	return err;
 }
 
-static void mlx5e_detach_encap(struct mlx5e_priv *priv,
-			       struct mlx5e_tc_flow *flow) {
-	struct list_head *next = flow->encap.next;
-
-	list_del(&flow->encap);
-	if (list_empty(next)) {
-		struct mlx5_encap_entry *e;
-
-		e = list_entry(next, struct mlx5_encap_entry, flows);
-		if (e->n) {
-			mlx5_encap_dealloc(priv->mdev, e->encap_id);
-			neigh_release(e->n);
-		}
-		hlist_del_rcu(&e->encap_hlist);
-		kfree(e);
-	}
-}
-
 int mlx5e_delete_flower(struct mlx5e_priv *priv,
 			struct tc_cls_flower_offload *f)
 {
@@ -1006,8 +1009,6 @@ int mlx5e_delete_flower(struct mlx5e_priv *priv,
 
 	mlx5e_tc_del_flow(priv, flow);
 
-	if (flow->attr->action & MLX5_FLOW_CONTEXT_ACTION_ENCAP)
-		mlx5e_detach_encap(priv, flow);
 
 	kfree(flow);
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V2 2/7] net/mlx5e: Move function mlx5e_create_umr_mkey
From: Saeed Mahameed @ 2016-11-30 15:59 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tariq Toukan, Or Gerlitz, Roi Dayan, Sebastian Ott,
	Saeed Mahameed
In-Reply-To: <1480521583-12755-1-git-send-email-saeedm@mellanox.com>

From: Tariq Toukan <tariqt@mellanox.com>

In next patch we are going to create a UMR MKey per RQ, we need
mlx5e_create_umr_mkey declared before mlx5e_create_rq.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 74 +++++++++++------------
 1 file changed, 37 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index ba25cd3..49ca30b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -471,6 +471,43 @@ static void mlx5e_rq_free_mpwqe_info(struct mlx5e_rq *rq)
 	kfree(rq->mpwqe.info);
 }
 
+static int mlx5e_create_umr_mkey(struct mlx5e_priv *priv)
+{
+	struct mlx5_core_dev *mdev = priv->mdev;
+	u64 npages = MLX5E_REQUIRED_MTTS(priv->profile->max_nch(mdev),
+					 BIT(MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW));
+	int inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
+	void *mkc;
+	u32 *in;
+	int err;
+
+	in = mlx5_vzalloc(inlen);
+	if (!in)
+		return -ENOMEM;
+
+	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
+
+	npages = min_t(u32, ALIGN(U16_MAX, 4) * 2, npages);
+
+	MLX5_SET(mkc, mkc, free, 1);
+	MLX5_SET(mkc, mkc, umr_en, 1);
+	MLX5_SET(mkc, mkc, lw, 1);
+	MLX5_SET(mkc, mkc, lr, 1);
+	MLX5_SET(mkc, mkc, access_mode, MLX5_MKC_ACCESS_MODE_MTT);
+
+	MLX5_SET(mkc, mkc, qpn, 0xffffff);
+	MLX5_SET(mkc, mkc, pd, mdev->mlx5e_res.pdn);
+	MLX5_SET64(mkc, mkc, len, npages << PAGE_SHIFT);
+	MLX5_SET(mkc, mkc, translations_octword_size,
+		 MLX5_MTT_OCTW(npages));
+	MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
+
+	err = mlx5_core_create_mkey(mdev, &priv->umr_mkey, in, inlen);
+
+	kvfree(in);
+	return err;
+}
+
 static int mlx5e_create_rq(struct mlx5e_channel *c,
 			   struct mlx5e_rq_param *param,
 			   struct mlx5e_rq *rq)
@@ -3625,43 +3662,6 @@ static void mlx5e_destroy_q_counter(struct mlx5e_priv *priv)
 	mlx5_core_dealloc_q_counter(priv->mdev, priv->q_counter);
 }
 
-static int mlx5e_create_umr_mkey(struct mlx5e_priv *priv)
-{
-	struct mlx5_core_dev *mdev = priv->mdev;
-	u64 npages = MLX5E_REQUIRED_MTTS(priv->profile->max_nch(mdev),
-					 BIT(MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW));
-	int inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
-	void *mkc;
-	u32 *in;
-	int err;
-
-	in = mlx5_vzalloc(inlen);
-	if (!in)
-		return -ENOMEM;
-
-	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
-
-	npages = min_t(u32, ALIGN(U16_MAX, 4) * 2, npages);
-
-	MLX5_SET(mkc, mkc, free, 1);
-	MLX5_SET(mkc, mkc, umr_en, 1);
-	MLX5_SET(mkc, mkc, lw, 1);
-	MLX5_SET(mkc, mkc, lr, 1);
-	MLX5_SET(mkc, mkc, access_mode, MLX5_MKC_ACCESS_MODE_MTT);
-
-	MLX5_SET(mkc, mkc, qpn, 0xffffff);
-	MLX5_SET(mkc, mkc, pd, mdev->mlx5e_res.pdn);
-	MLX5_SET64(mkc, mkc, len, npages << PAGE_SHIFT);
-	MLX5_SET(mkc, mkc, translations_octword_size,
-		 MLX5_MTT_OCTW(npages));
-	MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
-
-	err = mlx5_core_create_mkey(mdev, &priv->umr_mkey, in, inlen);
-
-	kvfree(in);
-	return err;
-}
-
 static void mlx5e_nic_init(struct mlx5_core_dev *mdev,
 			   struct net_device *netdev,
 			   const struct mlx5e_profile *profile,
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next V2 3/7] net/mlx5e: Create UMR MKey per RQ
From: Saeed Mahameed @ 2016-11-30 15:59 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tariq Toukan, Or Gerlitz, Roi Dayan, Sebastian Ott,
	Saeed Mahameed
In-Reply-To: <1480521583-12755-1-git-send-email-saeedm@mellanox.com>

From: Tariq Toukan <tariqt@mellanox.com>

In Striding RQ implementation, we used a single UMR
(User-Mode Memory Registration) memory key for all RQs.
When the product of RQs number*size gets high, we hit a
limitation of u16 field size in FW.

Here we move to using a UMR memory key per RQ, so we can
scale to any number of rings, with the maximum buffer
size in each.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h       | 12 ++---
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 12 +----
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 53 ++++++++++++----------
 3 files changed, 35 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index f16f7fb..63dd639 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -77,9 +77,9 @@
 						 MLX5_MPWRQ_WQE_PAGE_ORDER)
 
 #define MLX5_MTT_OCTW(npages) (ALIGN(npages, 8) / 2)
-#define MLX5E_REQUIRED_MTTS(rqs, wqes)\
-	(rqs * wqes * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8))
-#define MLX5E_VALID_NUM_MTTS(num_mtts) (MLX5_MTT_OCTW(num_mtts) <= U16_MAX)
+#define MLX5E_REQUIRED_MTTS(wqes)		\
+	(wqes * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8))
+#define MLX5E_VALID_NUM_MTTS(num_mtts) (MLX5_MTT_OCTW(num_mtts) - 1 <= U16_MAX)
 
 #define MLX5_UMR_ALIGN				(2048)
 #define MLX5_MPWRQ_SMALL_PACKET_THRESHOLD	(128)
@@ -347,7 +347,6 @@ struct mlx5e_rq {
 		struct {
 			struct mlx5e_mpw_info *info;
 			void                  *mtt_no_align;
-			u32                    mtt_offset;
 		} mpwqe;
 	};
 	struct {
@@ -382,6 +381,7 @@ struct mlx5e_rq {
 	u32                    rqn;
 	struct mlx5e_channel  *channel;
 	struct mlx5e_priv     *priv;
+	struct mlx5_core_mkey  umr_mkey;
 } ____cacheline_aligned_in_smp;
 
 struct mlx5e_umr_dma_info {
@@ -689,7 +689,6 @@ struct mlx5e_priv {
 
 	unsigned long              state;
 	struct mutex               state_lock; /* Protects Interface state */
-	struct mlx5_core_mkey      umr_mkey;
 	struct mlx5e_rq            drop_rq;
 
 	struct mlx5e_channel     **channel;
@@ -838,8 +837,7 @@ static inline void mlx5e_cq_arm(struct mlx5e_cq *cq)
 
 static inline u32 mlx5e_get_wqe_mtt_offset(struct mlx5e_rq *rq, u16 wqe_ix)
 {
-	return rq->mpwqe.mtt_offset +
-		wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
+	return wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
 }
 
 static inline int mlx5e_get_max_num_channels(struct mlx5_core_dev *mdev)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index aa963d7..352462a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -499,8 +499,7 @@ static int mlx5e_set_ringparam(struct net_device *dev,
 		return -EINVAL;
 	}
 
-	num_mtts = MLX5E_REQUIRED_MTTS(priv->params.num_channels,
-				       rx_pending_wqes);
+	num_mtts = MLX5E_REQUIRED_MTTS(rx_pending_wqes);
 	if (priv->params.rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ &&
 	    !MLX5E_VALID_NUM_MTTS(num_mtts)) {
 		netdev_info(dev, "%s: rx_pending (%d) request can't be satisfied, try to reduce.\n",
@@ -565,7 +564,6 @@ static int mlx5e_set_channels(struct net_device *dev,
 	unsigned int count = ch->combined_count;
 	bool arfs_enabled;
 	bool was_opened;
-	u32 num_mtts;
 	int err = 0;
 
 	if (!count) {
@@ -584,14 +582,6 @@ static int mlx5e_set_channels(struct net_device *dev,
 		return -EINVAL;
 	}
 
-	num_mtts = MLX5E_REQUIRED_MTTS(count, BIT(priv->params.log_rq_size));
-	if (priv->params.rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ &&
-	    !MLX5E_VALID_NUM_MTTS(num_mtts)) {
-		netdev_info(dev, "%s: rx count (%d) request can't be satisfied, try to reduce.\n",
-			    __func__, count);
-		return -EINVAL;
-	}
-
 	if (priv->params.num_channels == count)
 		return 0;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 49ca30b..84a4adb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -471,24 +471,25 @@ static void mlx5e_rq_free_mpwqe_info(struct mlx5e_rq *rq)
 	kfree(rq->mpwqe.info);
 }
 
-static int mlx5e_create_umr_mkey(struct mlx5e_priv *priv)
+static int mlx5e_create_umr_mkey(struct mlx5e_priv *priv,
+				 u64 npages, u8 page_shift,
+				 struct mlx5_core_mkey *umr_mkey)
 {
 	struct mlx5_core_dev *mdev = priv->mdev;
-	u64 npages = MLX5E_REQUIRED_MTTS(priv->profile->max_nch(mdev),
-					 BIT(MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW));
 	int inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
 	void *mkc;
 	u32 *in;
 	int err;
 
+	if (!MLX5E_VALID_NUM_MTTS(npages))
+		return -EINVAL;
+
 	in = mlx5_vzalloc(inlen);
 	if (!in)
 		return -ENOMEM;
 
 	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
 
-	npages = min_t(u32, ALIGN(U16_MAX, 4) * 2, npages);
-
 	MLX5_SET(mkc, mkc, free, 1);
 	MLX5_SET(mkc, mkc, umr_en, 1);
 	MLX5_SET(mkc, mkc, lw, 1);
@@ -497,17 +498,25 @@ static int mlx5e_create_umr_mkey(struct mlx5e_priv *priv)
 
 	MLX5_SET(mkc, mkc, qpn, 0xffffff);
 	MLX5_SET(mkc, mkc, pd, mdev->mlx5e_res.pdn);
-	MLX5_SET64(mkc, mkc, len, npages << PAGE_SHIFT);
+	MLX5_SET64(mkc, mkc, len, npages << page_shift);
 	MLX5_SET(mkc, mkc, translations_octword_size,
 		 MLX5_MTT_OCTW(npages));
-	MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
+	MLX5_SET(mkc, mkc, log_page_size, page_shift);
 
-	err = mlx5_core_create_mkey(mdev, &priv->umr_mkey, in, inlen);
+	err = mlx5_core_create_mkey(mdev, umr_mkey, in, inlen);
 
 	kvfree(in);
 	return err;
 }
 
+static int mlx5e_create_rq_umr_mkey(struct mlx5e_rq *rq)
+{
+	struct mlx5e_priv *priv = rq->priv;
+	u64 num_mtts = MLX5E_REQUIRED_MTTS(BIT(priv->params.log_rq_size));
+
+	return mlx5e_create_umr_mkey(priv, num_mtts, PAGE_SHIFT, &rq->umr_mkey);
+}
+
 static int mlx5e_create_rq(struct mlx5e_channel *c,
 			   struct mlx5e_rq_param *param,
 			   struct mlx5e_rq *rq)
@@ -564,18 +573,20 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 		rq->alloc_wqe = mlx5e_alloc_rx_mpwqe;
 		rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
 
-		rq->mpwqe.mtt_offset = c->ix *
-			MLX5E_REQUIRED_MTTS(1, BIT(priv->params.log_rq_size));
-
 		rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz);
 		rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
 
 		rq->buff.wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
 		byte_count = rq->buff.wqe_sz;
-		rq->mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
-		err = mlx5e_rq_alloc_mpwqe_info(rq, c);
+
+		err = mlx5e_create_rq_umr_mkey(rq);
 		if (err)
 			goto err_rq_wq_destroy;
+		rq->mkey_be = cpu_to_be32(rq->umr_mkey.key);
+
+		err = mlx5e_rq_alloc_mpwqe_info(rq, c);
+		if (err)
+			goto err_destroy_umr_mkey;
 		break;
 	default: /* MLX5_WQ_TYPE_LINKED_LIST */
 		rq->dma_info = kzalloc_node(wq_sz * sizeof(*rq->dma_info),
@@ -626,6 +637,9 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 
 	return 0;
 
+err_destroy_umr_mkey:
+	mlx5_core_destroy_mkey(mdev, &rq->umr_mkey);
+
 err_rq_wq_destroy:
 	if (rq->xdp_prog)
 		bpf_prog_put(rq->xdp_prog);
@@ -644,6 +658,7 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
 		mlx5e_rq_free_mpwqe_info(rq);
+		mlx5_core_destroy_mkey(rq->priv->mdev, &rq->umr_mkey);
 		break;
 	default: /* MLX5_WQ_TYPE_LINKED_LIST */
 		kfree(rq->dma_info);
@@ -3868,15 +3883,9 @@ int mlx5e_attach_netdev(struct mlx5_core_dev *mdev, struct net_device *netdev)
 	profile = priv->profile;
 	clear_bit(MLX5E_STATE_DESTROYING, &priv->state);
 
-	err = mlx5e_create_umr_mkey(priv);
-	if (err) {
-		mlx5_core_err(mdev, "create umr mkey failed, %d\n", err);
-		goto out;
-	}
-
 	err = profile->init_tx(priv);
 	if (err)
-		goto err_destroy_umr_mkey;
+		goto out;
 
 	err = mlx5e_open_drop_rq(priv);
 	if (err) {
@@ -3916,9 +3925,6 @@ int mlx5e_attach_netdev(struct mlx5_core_dev *mdev, struct net_device *netdev)
 err_cleanup_tx:
 	profile->cleanup_tx(priv);
 
-err_destroy_umr_mkey:
-	mlx5_core_destroy_mkey(mdev, &priv->umr_mkey);
-
 out:
 	return err;
 }
@@ -3967,7 +3973,6 @@ void mlx5e_detach_netdev(struct mlx5_core_dev *mdev, struct net_device *netdev)
 	profile->cleanup_rx(priv);
 	mlx5e_close_drop_rq(priv);
 	profile->cleanup_tx(priv);
-	mlx5_core_destroy_mkey(priv->mdev, &priv->umr_mkey);
 	cancel_delayed_work_sync(&priv->update_stats_work);
 }
 
-- 
2.7.4

^ permalink raw reply related

* Re: [WIP] net+mlx4: auto doorbell
From: Saeed Mahameed @ 2016-11-30 16:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Rick Jones, Linux Netdev List,
	Saeed Mahameed, Tariq Toukan
In-Reply-To: <1480520661.18162.177.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Nov 30, 2016 at 5:44 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 15:50 +0200, Saeed Mahameed wrote:
>> On Tue, Nov 29, 2016 at 8:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:
>> >
>> >
>> >> Not sure it this has been tried before, but the doorbell avoidance could
>> >> be done by the driver itself, because it knows a TX completion will come
>> >> shortly (well... if softirqs are not delayed too much !)
>> >>
>> >> Doorbell would be forced only if :
>> >>
>> >> (    "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
>> >> OR
>> >> ( too many [1] packets were put in TX ring buffer, no point deferring
>> >> more)
>> >>
>> >> Start the pump, but once it is started, let the doorbells being done by
>> >> TX completion.
>> >>
>> >> ndo_start_xmit and TX completion handler would have to maintain a shared
>> >> state describing if packets were ready but doorbell deferred.
>> >>
>> >>
>> >> Note that TX completion means "if at least one packet was drained",
>> >> otherwise busy polling, constantly calling napi->poll() would force a
>> >> doorbell too soon for devices sharing a NAPI for both RX and TX.
>> >>
>> >> But then, maybe busy poll would like to force a doorbell...
>> >>
>> >> I could try these ideas on mlx4 shortly.
>> >>
>> >>
>> >> [1] limit could be derived from active "ethtool -c" params, eg tx-frames
>> >
>> > I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
>> > not used.
>>
>> Hi Eric, Nice Idea indeed and we need something like this,
>> today we almost don't exploit the TX bulking at all.
>>
>> But please see below, i am not sure different contexts should share
>> the doorbell ringing, it is really risky.
>>
>> >  drivers/net/ethernet/mellanox/mlx4/en_rx.c   |    2
>> >  drivers/net/ethernet/mellanox/mlx4/en_tx.c   |   90 +++++++++++------
>> >  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    4
>> >  include/linux/netdevice.h                    |    1
>> >  net/core/net-sysfs.c                         |   18 +++
>> >  5 files changed, 83 insertions(+), 32 deletions(-)
>> >
>> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> > index 6562f78b07f4..fbea83218fc0 100644
>> > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> > @@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>> >
>> >         if (polled) {
>> >                 if (doorbell_pending)
>> > -                       mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
>> > +                       mlx4_en_xmit_doorbell(dev, priv->tx_ring[TX_XDP][cq->ring]);
>> >
>> >                 mlx4_cq_set_ci(&cq->mcq);
>> >                 wmb(); /* ensure HW sees CQ consumer before we post new buffers */
>> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> > index 4b597dca5c52..affebb435679 100644
>> > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> > @@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
>> >         ring->size = size;
>> >         ring->size_mask = size - 1;
>> >         ring->sp_stride = stride;
>> > -       ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
>> > +       ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS;
>> >
>> >         tmp = size * sizeof(struct mlx4_en_tx_info);
>> >         ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
>> > @@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
>> >         ring->sp_cqn = cq;
>> >         ring->prod = 0;
>> >         ring->cons = 0xffffffff;
>> > +       ring->ncons = 0;
>> >         ring->last_nr_txbb = 1;
>> >         memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
>> >         memset(ring->buf, 0, ring->buf_size);
>> > @@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
>> >                        MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
>> >  }
>> >
>> > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
>> > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
>> >  {
>> > -       return ring->prod - ring->cons > ring->full_size;
>> > +       return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
>> >  }
>> >
>> >  static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
>> > @@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>> >
>> >         /* Skip last polled descriptor */
>> >         ring->cons += ring->last_nr_txbb;
>> > +       ring->ncons += ring->last_nr_txbb;
>> >         en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n",
>> >                  ring->cons, ring->prod);
>> >
>> > @@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>> >                                                 !!(ring->cons & ring->size), 0,
>> >                                                 0 /* Non-NAPI caller */);
>> >                 ring->cons += ring->last_nr_txbb;
>> > +               ring->ncons += ring->last_nr_txbb;
>> >                 cnt++;
>> >         }
>> >
>> > @@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>> >         return cnt;
>> >  }
>> >
>> > +void mlx4_en_xmit_doorbell(const struct net_device *dev,
>> > +                          struct mlx4_en_tx_ring *ring)
>> > +{
>> > +
>> > +       if (dev->doorbell_opt & 1) {
>> > +               u32 oval = READ_ONCE(ring->prod_bell);
>> > +               u32 nval = READ_ONCE(ring->prod);
>> > +
>> > +               if (oval == nval)
>> > +                       return;
>> > +
>> > +               /* I can not tell yet if a cmpxchg() is needed or not */
>> > +               if (dev->doorbell_opt & 2)
>> > +                       WRITE_ONCE(ring->prod_bell, nval);
>> > +               else
>> > +                       if (cmpxchg(&ring->prod_bell, oval, nval) != oval)
>> > +                               return;
>> > +       }
>> > +       /* Since there is no iowrite*_native() that writes the
>> > +        * value as is, without byteswapping - using the one
>> > +        * the doesn't do byteswapping in the relevant arch
>> > +        * endianness.
>> > +        */
>> > +#if defined(__LITTLE_ENDIAN)
>> > +       iowrite32(
>> > +#else
>> > +       iowrite32be(
>> > +#endif
>> > +                 ring->doorbell_qpn,
>> > +                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
>> > +}
>> > +
>> >  static bool mlx4_en_process_tx_cq(struct net_device *dev,
>> >                                   struct mlx4_en_cq *cq, int napi_budget)
>> >  {
>> > @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
>> >         wmb();
>> >
>> >         /* we want to dirty this cache line once */
>> > -       ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
>> > -       ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
>> > +       WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
>> > +       ring_cons += txbbs_skipped;
>> > +       WRITE_ONCE(ring->cons, ring_cons);
>> > +       WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
>> > +
>> > +       if (dev->doorbell_opt)
>> > +               mlx4_en_xmit_doorbell(dev, ring);
>> >
>> >         if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
>> >                 return done < budget;
>> > @@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
>> >         __iowrite64_copy(dst, src, bytecnt / 8);
>> >  }
>> >
>> > -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
>> > -{
>> > -       wmb();
>>
>> you missed/removed this "wmb()" in the new function, why ? where did
>> you compensate for it ?
>
> I removed it because I had a cmpxchg() there if the barrier was needed.
>

Cool, so the answer is yes, the barrier is needed in order for the HW
to see the last step of
mlx4_en_tx_write_desc where we write the ownership bit (which means
this descriptor is valid for HW processing).
tx_desc->ctrl.owner_opcode = op_own;

ringing the doorbell without this wmb might cause the HW to miss that
last packet.

> My patch is a WIP, where you can set the bit 2 to ask to replace the
> cmpxchg() by a simple write, only for performance testing/comparisons.
>
>
>>
>> > -       /* Since there is no iowrite*_native() that writes the
>> > -        * value as is, without byteswapping - using the one
>> > -        * the doesn't do byteswapping in the relevant arch
>> > -        * endianness.
>> > -        */
>> > -#if defined(__LITTLE_ENDIAN)
>> > -       iowrite32(
>> > -#else
>> > -       iowrite32be(
>> > -#endif
>> > -                 ring->doorbell_qpn,
>> > -                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
>> > -}
>> >
>> >  static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>> >                                   struct mlx4_en_tx_desc *tx_desc,
>> >                                   union mlx4_wqe_qpn_vlan qpn_vlan,
>> >                                   int desc_size, int bf_index,
>> >                                   __be32 op_own, bool bf_ok,
>> > -                                 bool send_doorbell)
>> > +                                 bool send_doorbell,
>> > +                                 const struct net_device *dev, int nr_txbb)
>> >  {
>> >         tx_desc->ctrl.qpn_vlan = qpn_vlan;
>> >
>> > @@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>> >
>> >                 wmb();
>> >
>> > +               ring->prod += nr_txbb;
>> >                 mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
>> >                              desc_size);
>> >
>> > @@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>> >                  */
>> >                 dma_wmb();
>> >                 tx_desc->ctrl.owner_opcode = op_own;
>> > +               ring->prod += nr_txbb;
>> >                 if (send_doorbell)
>> > -                       mlx4_en_xmit_doorbell(ring);
>> > +                       mlx4_en_xmit_doorbell(dev, ring);
>> >                 else
>> >                         ring->xmit_more++;
>> >         }
>> > @@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>> >                         op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
>> >         }
>> >
>> > -       ring->prod += nr_txbb;
>> > -
>> >         /* If we used a bounce buffer then copy descriptor back into place */
>> >         if (unlikely(bounce))
>> >                 tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
>> > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>> >         }
>> >         send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
>> >
>> > +       /* Doorbell avoidance : We can omit doorbell if we know a TX completion
>> > +        * will happen shortly.
>> > +        */
>> > +       if (send_doorbell &&
>> > +           dev->doorbell_opt &&
>> > +           (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
>>
>> Aelexi already expressed his worries about synchronization, and i
>> think here (in this exact line) sits the problem,
>> what about if exactly at this point the TX completion handler just
>> finished and rang the last doorbell,
>> you didn't write the new TX descriptor yet (mlx4_en_tx_write_desc), so
>> the last doorbell from the CQ handler missed it.
>> even if you wrote the TX descriptor before the doorbell decision here,
>> you will need a memory barrier to make sure the HW sees
>> the new packet, which was typically done before ringing the doorbell.
>
>
> My patch is a WIP, meaning it is not completed ;)
>
> Surely we can find a non racy way to handle this.

The question is, can it be done without locking ? Maybe ring the
doorbell every N msecs  just in case.

>
>
>> All in all, this is risky business :),  the right way to go is to
>> force the upper layer to use xmit-more and delay doorbells/use bulking
>> but from the same context
>> (xmit routine).  For example see Achiad's suggestion (attached in
>> Jesper's response), he used stop queue to force the stack to queue up
>> packets (TX bulking)
>> which would set xmit-more and will use the next completion to release
>> the "stopped" ring TXQ rather than hit the doorbell on behalf of it.
>
>
>
> Well, you depend on having a higher level queue like a qdisc.
>
> Some users do not use a qdisc.
> If you stop the queue, they no longer can send anything -> drops.
>

In this case, i think they should implement their own bulking (pktgen
is not a good example)
but XDP can predict if it has more packets to xmit  as long as all of
them fall in the same NAPI cycle.
Others should try and do the same.

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Ido Schimmel @ 2016-11-30 16:32 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <eb7ddecd-d6b4-dfce-5990-9e933227c862@stressinduktion.org>

Hi Hannes,

On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote:
> On 30.11.2016 11:09, Jiri Pirko wrote:
> > From: Ido Schimmel <idosch@mellanox.com>
> > 
> > Make sure the device has a complete view of the FIB tables by invoking
> > their dump during module init.
> > 
> > Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> > Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> > ---
> >  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 23 ++++++++++++++++++++++
> >  1 file changed, 23 insertions(+)
> > 
> > diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> > index 14bed1d..d176047 100644
> > --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> > +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> > @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
> >  	return NOTIFY_DONE;
> >  }
> >  
> > +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
> > +{
> > +	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
> > +
> > +	/* Flush pending FIB notifications and then flush the device's
> > +	 * table before requesting another dump. Do that with RTNL held,
> > +	 * as FIB notification block is already registered.
> > +	 */
> > +	mlxsw_core_flush_owq();
> > +	rtnl_lock();
> > +	mlxsw_sp_router_fib_flush(mlxsw_sp);
> > +	rtnl_unlock();
> > +}
> > +
> >  int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
> >  {
> > +	fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush;
> >  	int err;
> >  
> >  	INIT_LIST_HEAD(&mlxsw_sp->router.nexthop_neighs_list);
> > @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
> >  
> >  	mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
> >  	register_fib_notifier(&mlxsw_sp->fib_nb);
> 
> Sorry to pick in here again:
> 
> There is a race here. You need to protect the registration of the fib
> notifier as well by the sequence counter. Updates here are not ordered
> in relation to this code below.

You mean updates that can be received after you registered the notifier
and until the dump started? I'm aware of that and that's OK. This
listener should be able to handle duplicates.

I've a follow up patchset that introduces a new event in switchdev
notification chain called SWITCHDEV_SYNC, which is sent when port
netdevs are enslaved / released  from a master device (points in time
where kernel<->device can get out of sync). It will invoke
re-propagation of configuration from different parts of the stack
(e.g. bridge driver, 8021q driver, fib/neigh code), which can result
in duplicates.

> I think just move the register notification into the fib_notifier_dump
> function, rename it to fib_notifier_init and use it here:

I separated the two on purpose. For example, rocker only needs to
register notifier, but doesn't need the dump.

> 
> > +	if (!fib_notifier_dump(&mlxsw_sp->fib_nb, &init_net, cb)) {
> > +		err = -EBUSY;
> > +		goto err_fib_notifier_dump;
> > +	}
> > +
> >  	return 0;
> 
> Thanks,
> Hannes
> 

^ permalink raw reply

* Re: DSA vs. SWTICHDEV ?
From: Joakim Tjernlund @ 2016-11-30 16:35 UTC (permalink / raw)
  To: andrew@lunn.ch; +Cc: netdev@vger.kernel.org
In-Reply-To: <20161130152503.GE21645@lunn.ch>

On Wed, 2016-11-30 at 16:25 +0100, Andrew Lunn wrote:
> On Wed, Nov 30, 2016 at 02:30:43PM +0000, Joakim Tjernlund wrote:
> > On Wed, 2016-11-30 at 14:52 +0100, Andrew Lunn wrote:
> > > On Wed, Nov 30, 2016 at 08:50:34AM +0000, Joakim Tjernlund wrote:
> > > > I am trying to wrap my head around these two "devices" and have a hard time telling them apart.
> > > > We are looking att adding a faily large switch(over PCIe) to our board and from what I can tell
> > > > switchdev is the new way to do it but DSA is still there. Is it possible to just list
> > > > how they differ?
> > > 
> > > Hi Joakim
> > 
> > Hi Andrew, thanks for answering
> > 
> > > 
> > > If the interface you use to send frames from the host to the switch is
> > > PCIe, you probably want to use switchdev directly.
> > 
> > OK, we will have a few ethernet I/F's connected too but I these should be used
> > as normal interfaces just interfacing a switch.
> 
> That does not make much sense.
> 
> Maybe time to backtrack a bit. The Linux concept for switch/router
> chips is that they are just hardware accelerators for what Linux can
> already do in software. Each port of the switch is just a normal Linux
> interface.  ip link show will list each port. ip addr add can be used
> to add an IP address to the interface.  You want to switch frames
> between two ports: Create a linux bridge and put the interfaces into
> it. Via switchdev you get a call into the hardware to accelerate
> this. If the hardware cannot accelerate it, it is done in software as
> normal.  Want to combine two ports into a trunk: Add a team interface
> and make the port interfaces slaves of the team interface. Via
> switchdev, you ask the hardware to accelerate this. If it cannot, it
> is done in software.
> 
> So back your connecting a few host interfaces to the switch. This is
> logically putting a cable between two interfaces on the same host. You
> are making a loopback. Why do that? Sure it is possible, but it is an
> odd architecture.

This is an embedded system with several boards in a subrack.
Each board has eth I/F connected to a switch to communicate with each other.
One of the board will also house the actual switch device and manage the switch.
Then the normal app just communicates over the physical eth I/F like any other board
in the system. There is a "manage switch app" which brings the switch up and partition
phys VLANs etc. (each phys I/F would be a a separate domain so no loop)

I guess I could skip the phys I/F and have the switch app create a virtual eth0 I/F over PCIe
instead to save eth MACS but the above is safer should there be some problems/limitations in
swicthdev plus switchdev does not exist in u-boot so it would be a lot of effort to
get a working eth I/F inside u-boot.
I can still can still create a bridge I/F etc. should I need to.
Does the above make sense to you ?
 
> 
> > And switchdev can do all this over PCIe instead? Can you have a
> > switch tree in switchdev too?
> 
> Mellonex says so, but i don't think they have actually implemented it.

Not impl. any of DSAs features? What can you do with a Mellonex switch then?

 Jocke

^ permalink raw reply

* Re: [PATCH net-next 8/8] net/mlx5e: Support adding ingress tc rule when egress device flag is set
From: John Fastabend @ 2016-11-30 16:36 UTC (permalink / raw)
  To: Hadar Hen Zion, David S. Miller
  Cc: netdev, Saeed Mahameed, Jiri Pirko, Amir Vadai, Or Gerlitz,
	Roi Dayan
In-Reply-To: <1480516895-29545-9-git-send-email-hadarh@mellanox.com>

On 16-11-30 06:41 AM, Hadar Hen Zion wrote:
> When ndo_setup_tc is called with an egress_dev flag set, it means that
> the ndo call was executed on the mirred action (egress) device and not
> on the ingress device.
> 
> In order to support this kind of ndo_setup_tc call, and insert the
> correct decap rule to the hardware, the uplink device on the same eswitch
> should be found.
> 
> Currently, we use this resolution between the mirred device and the
> uplink on the same eswitch to offload vxlan shared device decap rules.
> 
> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
> ---

Hi Hadar,

I started to dig through these patches and the last series here,

   Re: [PATCH net-next 00/13] Mellanox 100G SRIOV offloads tunnel_key
set/release

Can you explain how these two are related? I'm guessing in that first
series the actual redirect action to a tunnel device was being ignore?

Does this series clean up that bit of software/hardware alignment.

Thanks,
John


>  drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> index 0868677..8503788 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> @@ -289,6 +289,14 @@ static int mlx5e_rep_ndo_setup_tc(struct net_device *dev, u32 handle,
>  	if (TC_H_MAJ(handle) != TC_H_MAJ(TC_H_INGRESS))
>  		return -EOPNOTSUPP;
>  
> +	if (tc->egress_dev) {
> +		struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
> +		struct net_device *uplink_dev = mlx5_eswitch_get_uplink_netdev(esw);
> +
> +		return uplink_dev->netdev_ops->ndo_setup_tc(uplink_dev, handle,
> +							    proto, tc);
> +	}
> +
>  	switch (tc->type) {
>  	case TC_SETUP_CLSFLOWER:
>  		switch (tc->cls_flower->command) {
> 

^ permalink raw reply

* Re: [PATCH net 5/7] net: ethernet: stmmac: dwmac-meson8b: fix probe error path
From: Kevin Hilman @ 2016-11-30 16:43 UTC (permalink / raw)
  To: Johan Hovold
  Cc: David S. Miller, Giuseppe Cavallaro, Alexandre Torgue,
	Joachim Eastwood, Carlo Caione, Maxime Coquelin, Maxime Ripard,
	Chen-Yu Tsai, netdev, linux-kernel
In-Reply-To: <1480516195-27696-6-git-send-email-johan@kernel.org>

Johan Hovold <johan@kernel.org> writes:

> Make sure to disable clocks before returning on late probe errors.
>
> Fixes: 566e82516253 ("net: stmmac: add a glue driver for the Amlogic
> Meson 8b / GXBB DWMAC")
> Signed-off-by: Johan Hovold <johan@kernel.org>

Acked-by: Kevin Hilman <khilman@baylibre.com>

^ permalink raw reply

* Re: qed, qedi patchset submission
From: Martin K. Petersen @ 2016-11-30 16:45 UTC (permalink / raw)
  To: Arun Easi; +Cc: David Miller, Martin K. Petersen, linux-scsi, netdev
In-Reply-To: <alpine.LRH.2.00.1611291452030.28058@mvluser05.qlc.com>

>>>>> "Arun" == Arun Easi <arun.easi@cavium.com> writes:

Arun,

Arun> So far, we have been posting qedi changes split into functional
Arun> blocks, for review, but was not bisectable. With Martin ok to our
Arun> request to squash all patches while committing to tree, we were
Arun> wondering if we should post the qedi patches squashed, with all
Arun> the Reviewed-by added, or continue to post as before?

I guess it depends how things can be split up in a bisectable fashion.

If the net/ pieces can be completely separated from the scsi/ pieces
maybe it would be best to have two patches?

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: [PATCH] net: ipv4: Don't crash if passing a null sk to ip_rt_update_pmtu.
From: Lorenzo Colitti @ 2016-11-30 16:46 UTC (permalink / raw)
  To: netdev@vger.kernel.org
In-Reply-To: <1480442207-43618-1-git-send-email-lorenzo@google.com>

On Tue, Nov 29, 2016 at 9:56 AM, Lorenzo Colitti <lorenzo@google.com> wrote:
> Commit e2d118a1cb5e ("net: inet: Support UID-based routing in IP
> protocols.") made __build_flow_key call sock_net(sk) to determine
> the network namespace of the passed-in socket. This crashes if sk
> is NULL.

Since I missed this in the patch description: this is targeted to
net-next (the code it fixes is not in net yet). Also:

Fixes: e2d118a1cb5e ("net: inet: Support UID-based routing in IP protocols.")

^ permalink raw reply

* Re: Regression: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Saeed Mahameed @ 2016-11-30 16:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jesper Dangaard Brouer, David Miller, netdev, Tariq Toukan
In-Reply-To: <1480521514.18162.191.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Nov 30, 2016 at 5:58 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 15:08 +0100, Jesper Dangaard Brouer wrote:
>> On Fri, 25 Nov 2016 07:46:20 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> > From: Eric Dumazet <edumazet@google.com>
>>
>> Ended up-in net-next as:
>>
>>  commit 40931b85113dad7881d49e8759e5ad41d30a5e6c
>>  Author: Eric Dumazet <edumazet@google.com>
>>  Date:   Fri Nov 25 07:46:20 2016 -0800
>>
>>     mlx4: give precise rx/tx bytes/packets counters
>>
>>     mlx4 stats are chaotic because a deferred work queue is responsible
>>     to update them every 250 ms.
>>
>> Likely after this patch I get this crash (below), when rebooting my machine.
>> Looks like a device removal order thing.
>> Tested with net-next at commit 93ba22225504.
>>
>> [...]
>> [ 1967.248453] mlx5_core 0000:02:00.1: Shutdown was called
>> [ 1967.854556] mlx5_core 0000:02:00.0: Shutdown was called
>> [ 1968.443015] e1000e: EEE TX LPI TIMER: 00000011
>> [ 1968.484676] sd 3:0:0:0: [sda] Synchronizing SCSI cache
>> [ 1968.528354] mlx4_core 0000:01:00.0: mlx4_shutdown was called
>> [ 1968.534054] mlx4_en: mlx4p1: Close port called
>> [ 1968.571156] mlx4_en 0000:01:00.0: removed PHC
>> [ 1968.575677] mlx4_en: mlx4p2: Close port called
>> [ 1969.506602] BUG: unable to handle kernel NULL pointer dereference at 0000000000000d08
>> [ 1969.514530] IP: [<ffffffffa0127ca4>] mlx4_en_fold_software_stats.part.1+0x34/0xb0 [mlx4_en]
>> [ 1969.522963] PGD 0 [ 1969.524803]
>> [ 1969.526332] Oops: 0000 [#1] PREEMPT SMP
>> [ 1969.530201] Modules linked in: coretemp kvm_intel kvm irqbypass intel_cstate mxm_wmi i2c_i801 intel_rapl_perf i2c_smbus sg pcspkr i2c_core shpchp nfsd wmi video acpi_pad auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables mlx4_en e1000e mlx5_core ptp serio_raw sd_mod mlx4_core pps_core devlink hid_generic
>> [ 1969.559616] CPU: 3 PID: 3104 Comm: kworker/3:1 Not tainted 4.9.0-rc6-net-next3-01390-g93ba22225504 #12
>> [ 1969.568984] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Extreme4, BIOS P2.10 05/12/2015
>> [ 1969.578877] Workqueue: events linkwatch_event
>> [ 1969.583285] task: ffff8803f42a0000 task.stack: ffff88040b2d0000
>> [ 1969.589238] RIP: 0010:[<ffffffffa0127ca4>]  [<ffffffffa0127ca4>] mlx4_en_fold_software_stats.part.1+0x34/0xb0 [mlx4_en]
>> [ 1969.600102] RSP: 0018:ffff88040b2d3bd8  EFLAGS: 00010282
>> [ 1969.605442] RAX: ffff8803f432efc8 RBX: ffff8803f4320000 RCX: 0000000000000000
>> [ 1969.612604] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8803f4320000
>> [ 1969.619772] RBP: ffff88040b2d3bd8 R08: 000000000000000c R09: ffff8803f432f000
>> [ 1969.626938] R10: 0000000000000000 R11: ffff88040d64ac00 R12: ffff8803e5aff8dc
>> [ 1969.634104] R13: ffff8803f4320a28 R14: ffff8803e5aff800 R15: 0000000000000000
>> [ 1969.641273] FS:  0000000000000000(0000) GS:ffff88041fac0000(0000) knlGS:0000000000000000
>> [ 1969.649422] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1969.655197] CR2: 0000000000000d08 CR3: 0000000001c07000 CR4: 00000000001406e0
>> [ 1969.662366] Stack:
>> [ 1969.664412]  ffff88040b2d3be8 ffffffffa0127f8e ffff88040b2d3c10 ffffffffa012a23b
>> [ 1969.671948]  ffff8803e5aff8dc ffff8803f4320000 ffff8803f4320000 ffff88040b2d3c30
>> [ 1969.679478]  ffffffff8160ae29 ffff8803e5aff8d8 ffff8804088ff300 ffff88040b2d3c58
>> [ 1969.687001] Call Trace:
>> [ 1969.689484]  [<ffffffffa0127f8e>] mlx4_en_fold_software_stats+0x1e/0x20 [mlx4_en]
>> [ 1969.697026]  [<ffffffffa012a23b>] mlx4_en_get_stats64+0x2b/0x50 [mlx4_en]
>> [ 1969.703844]  [<ffffffff8160ae29>] dev_get_stats+0x39/0xa0
>> [ 1969.709274]  [<ffffffff81622470>] rtnl_fill_stats+0x40/0x130
>> [ 1969.714968]  [<ffffffff8162631b>] rtnl_fill_ifinfo+0x55b/0x1010
>> [ 1969.720921]  [<ffffffff816285d3>] rtmsg_ifinfo_build_skb+0x73/0xd0
>> [ 1969.727136]  [<ffffffff81628646>] rtmsg_ifinfo.part.25+0x16/0x50
>> [ 1969.733176]  [<ffffffff81628698>] rtmsg_ifinfo+0x18/0x20
>> [ 1969.738522]  [<ffffffff8160e947>] netdev_state_change+0x47/0x50
>> [ 1969.744478]  [<ffffffff81629018>] linkwatch_do_dev+0x38/0x50
>> [ 1969.750170]  [<ffffffff81629257>] __linkwatch_run_queue+0xe7/0x160
>> [ 1969.756385]  [<ffffffff816292f5>] linkwatch_event+0x25/0x30
>> [ 1969.761991]  [<ffffffff8107b3cb>] process_one_work+0x15b/0x460
>> [ 1969.767857]  [<ffffffff8107b71e>] worker_thread+0x4e/0x480
>> [ 1969.773378]  [<ffffffff8107b6d0>] ? process_one_work+0x460/0x460
>> [ 1969.779420]  [<ffffffff8107b6d0>] ? process_one_work+0x460/0x460
>> [ 1969.785460]  [<ffffffff810811ea>] kthread+0xca/0xe0
>> [ 1969.790372]  [<ffffffff81081120>] ? kthread_worker_fn+0x120/0x120
>> [ 1969.796495]  [<ffffffff817302d2>] ret_from_fork+0x22/0x30
>> [ 1969.801924] Code: 00 00 55 48 89 e5 85 d2 0f 84 90 00 00 00 83 ea 01 31 c9 31 f6 48 8d 87 c0 ef 00 00 4c 8d 8c d7 c8 ef 00 00 48 8b 10 48 83 c0 08 <4c> 8b 82 08 0d 00 00 48 8b 92 00 0d 00 00 4c 01 c6 48 01 d1 4c
>> [ 1969.821969] RIP  [<ffffffffa0127ca4>] mlx4_en_fold_software_stats.part.1+0x34/0xb0 [mlx4_en]
>> [ 1969.830486]  RSP <ffff88040b2d3bd8>
>> [ 1969.834002] CR2: 0000000000000d08
>> [ 1969.837440] ---[ end trace 80b9fbc1e7baed9b ]---
>> [ 1969.842102] Kernel panic - not syncing: Fatal exception in interrupt
>> [ 1969.848520] Kernel Offset: disabled
>> [ 1969.852050] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
>> (END)
>
> Hi Jesper.
>
> Thanks for the report.
>
> Then we have a bug in the driver, deleting some memory too soon.

No ! it always been this way, the cached stats are always there (never deleted).
we just stop caching once the device is down, nothing is deleted too soon.

>
> If we depend on having proper stats at device dismantle, we need to keep

we had/still have the proper stats they are the ones that
mlx4_en_fold_software_stats is trying to cache into  (they always
exist),
but the ones that you are trying to read from (the mlx4 rings) are gone !

This bug is totally new and as i warned, this is another symptom of
the real root cause (can't sleep while reading stats).

Eric what do you suggest ? Keep pre-allocated MAX_RINGS stats  and
always iterate over all of them to query stats ?
what if you have one ring/none/1K ? how would you know how many to query ?

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Hannes Frederic Sowa @ 2016-11-30 16:49 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <20161130163229.rkxvuwukgg35ktrx@splinter.mtl.com>

On 30.11.2016 17:32, Ido Schimmel wrote:
> Hi Hannes,
> 
> On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote:
>> On 30.11.2016 11:09, Jiri Pirko wrote:
>>> From: Ido Schimmel <idosch@mellanox.com>
>>>
>>> Make sure the device has a complete view of the FIB tables by invoking
>>> their dump during module init.
>>>
>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>> ---
>>>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 23 ++++++++++++++++++++++
>>>  1 file changed, 23 insertions(+)
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>> index 14bed1d..d176047 100644
>>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>> @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
>>>  	return NOTIFY_DONE;
>>>  }
>>>  
>>> +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
>>> +{
>>> +	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
>>> +
>>> +	/* Flush pending FIB notifications and then flush the device's
>>> +	 * table before requesting another dump. Do that with RTNL held,
>>> +	 * as FIB notification block is already registered.
>>> +	 */
>>> +	mlxsw_core_flush_owq();
>>> +	rtnl_lock();
>>> +	mlxsw_sp_router_fib_flush(mlxsw_sp);
>>> +	rtnl_unlock();
>>> +}
>>> +
>>>  int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>  {
>>> +	fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush;
>>>  	int err;
>>>  
>>>  	INIT_LIST_HEAD(&mlxsw_sp->router.nexthop_neighs_list);
>>> @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>  
>>>  	mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
>>>  	register_fib_notifier(&mlxsw_sp->fib_nb);
>>
>> Sorry to pick in here again:
>>
>> There is a race here. You need to protect the registration of the fib
>> notifier as well by the sequence counter. Updates here are not ordered
>> in relation to this code below.
> 
> You mean updates that can be received after you registered the notifier
> and until the dump started? I'm aware of that and that's OK. This
> listener should be able to handle duplicates.

I am not concerned about duplicates, but about ordering deletes and
getting an add from the RCU code you will add the node to hw while it is
deleted in the software path. You probably will ignore the delete
because nothing is installed in hw and later add the node which was
actually deleted but just reordered which happend on another CPU, no?

> I've a follow up patchset that introduces a new event in switchdev
> notification chain called SWITCHDEV_SYNC, which is sent when port
> netdevs are enslaved / released  from a master device (points in time
> where kernel<->device can get out of sync). It will invoke
> re-propagation of configuration from different parts of the stack
> (e.g. bridge driver, 8021q driver, fib/neigh code), which can result
> in duplicates.

Okay, understood. I wonder how we can protect against accidentally abort
calls actually. E.g. if I start to inject routes into my routing domain
how can I make sure the box doesn't die after I try to insert enough
routes. Do we need to touch quagga etc?

Thanks,
Hannes

^ permalink raw reply

* Re: [net-next PATCH v3 6/6] virtio_net: xdp, add slowpath case for non contiguous buffers
From: John Fastabend @ 2016-11-30 16:50 UTC (permalink / raw)
  To: Jakub Kicinski, Michael S. Tsirkin
  Cc: eric.dumazet, daniel, shm, davem, tgraf, alexei.starovoitov,
	john.r.fastabend, netdev, bblanco, brouer
In-Reply-To: <20161130143031.2fc64ab4@jkicinski-Precision-T1700>

On 16-11-30 06:30 AM, Jakub Kicinski wrote:
> [add MST]
> 

Thanks sorry MST. I did a cut'n'paste of an old list of CC's and missed
you were not on the list.

[...]

>> +	memcpy(page_address(page) + page_off, page_address(p) + offset, *len);
>> +	while (--num_buf) {
>> +		unsigned int buflen;
>> +		unsigned long ctx;
>> +		void *buf;
>> +		int off;
>> +
>> +		ctx = (unsigned long)virtqueue_get_buf(rq->vq, &buflen);
>> +		if (unlikely(!ctx))
>> +			goto err_buf;
>> +
>> +		buf = mergeable_ctx_to_buf_address(ctx);
>> +		p = virt_to_head_page(buf);
>> +		off = buf - page_address(p);
>> +
>> +		memcpy(page_address(page) + page_off,
>> +		       page_address(p) + off, buflen);
>> +		page_off += buflen;
> 
> Could malicious user potentially submit a frame bigger than MTU?

Well presumably if the MTU is greater than PAGE_SIZE the xdp program
would not have been loaded. And the malicious user in this case would
have to be qemu which seems like everything is already lost if qemu
is trying to attack its VM.

But this is a good point because it looks like there is nothing in
virtio or qemu that drops frames with MTU greater than the virtio
configured setting. Maybe Michael can confirm this or I'll poke at it
more. I think qemu should drop these frames in general.

So I think adding a guard here is sensible I'll go ahead and do that.
Also the MTU guard at set_xdp time needs to account for header length.

Thanks nice catch.

> 
>> +	}
>> +
>> +	*len = page_off;
>> +	return page;
>> +err_buf:
>> +	__free_pages(page, 0);
>> +	return NULL;
>> +}
>> +
>>  static struct sk_buff *receive_mergeable(struct net_device *dev,
>>  					 struct virtnet_info *vi,
>>  					 struct receive_queue *rq,
>> @@ -469,21 +519,37 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>  	rcu_read_lock();
>>  	xdp_prog = rcu_dereference(rq->xdp_prog);
>>  	if (xdp_prog) {
>> +		struct page *xdp_page;
>>  		u32 act;
>>  
>>  		if (num_buf > 1) {
>>  			bpf_warn_invalid_xdp_buffer();
>> -			goto err_xdp;
>> +
>> +			/* linearize data for XDP */
>> +			xdp_page = xdp_linearize_page(rq, num_buf,
>> +						      page, offset, &len);
>> +			if (!xdp_page)
>> +				goto err_xdp;
>> +			offset = len;
>> +		} else {
>> +			xdp_page = page;
>>  		}
>>  
>> -		act = do_xdp_prog(vi, xdp_prog, page, offset, len);
>> +		act = do_xdp_prog(vi, xdp_prog, xdp_page, offset, len);
>>  		switch (act) {
>>  		case XDP_PASS:
>> +			if (unlikely(xdp_page != page))
>> +				__free_pages(xdp_page, 0);
>>  			break;
>>  		case XDP_TX:
>> +			if (unlikely(xdp_page != page))
>> +				goto err_xdp;
>> +			rcu_read_unlock();
> 
> Only if there is a reason for v4 - this unlock could go to the previous
> patch.
> 

Sure will do this.

^ permalink raw reply

* Re: DSA vs. SWTICHDEV ?
From: Andrew Lunn @ 2016-11-30 16:55 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: netdev@vger.kernel.org
In-Reply-To: <1480523716.3563.144.camel@infinera.com>

> This is an embedded system with several boards in a subrack.
> Each board has eth I/F connected to a switch to communicate with each other.
> One of the board will also house the actual switch device and manage the switch.
> Then the normal app just communicates over the physical eth I/F like any other board
> in the system. There is a "manage switch app" which brings the switch up and partition
> phys VLANs etc. (each phys I/F would be a a separate domain so no loop)

So you are planning on throwing away the "manage switch app", and just
use standard linux networking commands? That is what switchdev is all
about really, throwing away the vendor SDK for the switch, making a
switch just a bunch on interfaces on the host which you manage as
normal interfaces.

> I guess I could skip the phys I/F and have the switch app create a virtual eth0 I/F over PCIe

No need to create this interface. It will exist if you go the
switchdev route.

> > > And switchdev can do all this over PCIe instead? Can you have a
> > > switch tree in switchdev too?
> > 
> > Mellonex says so, but i don't think they have actually implemented it.
> 
> Not impl. any of DSAs features? What can you do with a Mellonex switch then?

They don't have a tree of switches, as far as i know. Just a single
switch. But DSA does support a tree of switches, that is what the D in
DSA means, distributed. And there are a couple of boards which have 2
to 4 switches in a tree.

I think this is partially down to market segments. Mellonex market is
top of rack switches. High port count, very high bandwidth. DSA is
more wireless access points, set top boxes, generally up to 7 ports of
1Gbps and a few custom embedded products which need more ports, so
build a tree of switches.

      Andrew

^ permalink raw reply

* Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure
From: John Fastabend @ 2016-11-30 16:57 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Thomas Graf, davem, netdev, daniel, tom, roopa, hannes
In-Reply-To: <20161130053735.GB31581@ast-mbp.thefacebook.com>

On 16-11-29 09:37 PM, Alexei Starovoitov wrote:
> On Tue, Nov 29, 2016 at 06:52:36PM -0800, John Fastabend wrote:
>> On 16-11-29 04:15 PM, Alexei Starovoitov wrote:
>>> On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
>>>> Registers new BPF program types which correspond to the LWT hooks:
>>>>   - BPF_PROG_TYPE_LWT_IN   => dst_input()
>>>>   - BPF_PROG_TYPE_LWT_OUT  => dst_output()
>>>>   - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
>>>>
>>>> The separate program types are required to differentiate between the
>>>> capabilities each LWT hook allows:
>>>>
>>>>  * Programs attached to dst_input() or dst_output() are restricted and
>>>>    may only read the data of an skb. This prevent modification and
>>>>    possible invalidation of already validated packet headers on receive
>>>>    and the construction of illegal headers while the IP headers are
>>>>    still being assembled.
>>>>
>>>>  * Programs attached to lwtunnel_xmit() are allowed to modify packet
>>>>    content as well as prepending an L2 header via a newly introduced
>>>>    helper bpf_skb_push(). This is safe as lwtunnel_xmit() is invoked
>>>>    after the IP header has been assembled completely.
>>>>
>>>> All BPF programs receive an skb with L3 headers attached and may return
>>>> one of the following error codes:
>>>>
>>>>  BPF_OK - Continue routing as per nexthop
>>>>  BPF_DROP - Drop skb and return EPERM
>>>>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
>>>>                 (Only valid in lwtunnel_xmit() context)
>>>>
>>>> The return codes are binary compatible with their TC_ACT_
>>>> relatives to ease compatibility.
>>>>
>>>> Signed-off-by: Thomas Graf <tgraf@suug.ch>
>>> ...
>>>> +#define LWT_BPF_MAX_HEADROOM 128
>>>
>>> why 128?
>>> btw I'm thinking for XDP to use 256, so metadata can be stored in there.
>>>
>>
>> hopefully not too off-topic but for XDP I would like to see this get
> 
> definitely off-topic. lwt->headroom is existing concept. Too late
> to do anything about it.
> 
>> passed down with the program. It would be more generic and drivers could
>> configure the headroom on demand and more importantly verify that a
>> program pushing data is not going to fail at runtime.
> 
> For xdp I think it will be problematic, since we'd have to check for
> that value at prog array access to make sure tailcalls are not broken.
> Mix and match won't be possible.
> So what does 'configure the headroom on demand' buys us?
> Isn't it much easier to tell all drivers "always reserve this much" ?
> We burn the page anyway.
> If it's configurable per driver, then we'd need an api
> to retrieve it. Yet the program author doesn't care what the value is.
> If program needs to do udp encap, it will try do it. No matter what.
> If xdp_adjust_head() helper fails, the program will likely decide
> to drop the packet. In some cases it may decide to punt to stack
> for further processing, but for high performance dataplane code
> it's highly unlikely.
> If it's configurable to something that is not cache line boundary
> hw dma performance may suffer and so on.
> So I see only cons in such 'configurable headroom' and propose
> to have fixed 256 bytes headroom for XDP
> which is enough for any sensible encap and metadata.
> 

OK I'm convinced let it be fixed at some conservative value.

^ permalink raw reply

* Re: BUG() can be hit in tcp_collapse()
From: Vladis Dronov @ 2016-11-30 17:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, stable, Marco Grassi
In-Reply-To: <1716309808.12143903.1478869689618.JavaMail.zimbra@redhat.com>

Hello, Eric, Marco, all,

This is JFYI and a follow-up message.

A further investigation was made to find out the Linux kernel commit which has
introduced the flaw. It appeared that previous Linux kernel versions are vulnerable,
down to v3.6-rc1. This fact was hidden by 'net.ipv4.tcp_fastopen' set to 0 by default,
and now it is easier to notice since kernel v3.12 due to commit 0d41cca490 where the
default was changed to 1. With 'net.ipv4.tcp_fastopen' set to 1, previous Linux
kernels (including RHEL-7 ones) are also vulnerable.

The bug is here since tcp-fastopen feature was introduced in kernel v3.6-rc1, the first
commit when the reproducer starts to panic the kernel with net.ipv4.tcp_fastopen=1 set
is cf60af03ca, which is a part of commit sequence 2100c8d2d9..67da22d23f introducing
net-tcp-fastopen feature:

$ git bisect bad cf60af03ca4e71134206809ea892e49b92a88896
cf60af03ca4e71134206809ea892e49b92a88896 is the first bad commit
commit cf60af03ca4e71134206809ea892e49b92a88896
Author: Yuchung Cheng <ycheng@google.com>
Date:   Thu Jul 19 06:43:09 2012 +0000

So, ideally, the upstream commit ac6e780070 which fixes the bug should have
"Fixes: cf60af03ca" statement, unfortunately, this investigation was not completed at
the time the patch was accepted upstream. And unfortunately I do not see other way
to add this information except making notes in a comment in the related code, which
seems weird.

Best regards,
Vladis Dronov | Red Hat, Inc. | Product Security Engineer

^ permalink raw reply

* Re: [PATCH v2] vxlan: fix a potential issue when create a new vxlan fdb entry.
From: David Miller @ 2016-11-30 17:03 UTC (permalink / raw)
  To: yanhaishuang; +Cc: jbenc, hannes, pshelar, netdev, linux-kernel
In-Reply-To: <1480384776-8252-1-git-send-email-yanhaishuang@cmss.chinamobile.com>

From: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Date: Tue, 29 Nov 2016 09:59:36 +0800

> vxlan_fdb_append may return error, so add the proper check,
> otherwise it will cause memory leak.
> 
> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
> 
> Changes in v2:
>   - Unnecessary to initialize rc to zero.

Applied, thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox