[PATCH v3 net-next 0/4]: Allow head adjustment in XDP prog

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 net-next 0/4]: Allow head adjustment in XDP prog
@ 2016-12-07  5:31 Martin KaFai Lau
  2016-12-07  5:31 ` [PATCH v3 net-next 1/4] bpf: xdp: " Martin KaFai Lau
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Martin KaFai Lau @ 2016-12-07  5:31 UTC (permalink / raw)
  To: netdev
  Cc: Alexei Starovoitov, Brenden Blanco, Daniel Borkmann, David Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Saeed Mahameed, Tariq Toukan, Kernel Team

This series adds a helper to allow head adjusting in XDP prog.  mlx4
driver has been modified to support this feature.  An example is written
to encapsulate a packet with an IPv4/v6 header and then XDP_TX it
out.

v3:
1. Check if the driver supports head adjustment before
   setting the xdp_prog fd to the device in patch 1of4.
2. Remove the page alignment assumption on the data_hard_start.
   Instead, add data_hard_start to the struct xdp_buff and the
   driver has to fill it if it supports head adjustment.
3. Keep the wire MTU as before in mlx4
4. Set map0_byte_count to PAGE_SIZE in patch 3of4

v2:
1. Make a variable name change in bpf_xdp_adjust_head() in patch 1
2. Ensure no less than ETH_HLEN data in bpf_xdp_adjust_head() in patch 1
3. Some clarifications in commit log messages of patch 2 and 3

Thanks,
Martin

Martin KaFai Lau (4):
  bpf: xdp: Allow head adjustment in XDP prog
  mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs
  mlx4: xdp: Reserve headroom for receiving packet when XDP prog is
    active
  bpf: xdp: Add XDP example for head adjustment

 arch/powerpc/net/bpf_jit_comp64.c                  |   4 +-
 arch/s390/net/bpf_jit_comp.c                       |   2 +-
 arch/x86/net/bpf_jit_comp.c                        |   2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c     |  32 ++-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c         |  70 +++---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c         |   9 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h       |   3 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   3 +
 .../net/ethernet/netronome/nfp/nfp_net_common.c    |   3 +
 drivers/net/ethernet/qlogic/qede/qede_main.c       |   3 +
 include/linux/filter.h                             |   6 +-
 include/linux/netdevice.h                          |  12 +
 include/uapi/linux/bpf.h                           |  11 +-
 kernel/bpf/core.c                                  |   2 +-
 kernel/bpf/syscall.c                               |   2 +
 kernel/bpf/verifier.c                              |   2 +-
 net/core/dev.c                                     |   9 +
 net/core/filter.c                                  |  28 ++-
 samples/bpf/Makefile                               |   4 +
 samples/bpf/bpf_helpers.h                          |   2 +
 samples/bpf/bpf_load.c                             |  94 ++++++++
 samples/bpf/bpf_load.h                             |   1 +
 samples/bpf/xdp1_user.c                            |  93 --------
 samples/bpf/xdp_tx_iptnl_common.h                  |  37 +++
 samples/bpf/xdp_tx_iptnl_kern.c                    | 232 +++++++++++++++++++
 samples/bpf/xdp_tx_iptnl_user.c                    | 253 +++++++++++++++++++++
 26 files changed, 774 insertions(+), 145 deletions(-)
 create mode 100644 samples/bpf/xdp_tx_iptnl_common.h
 create mode 100644 samples/bpf/xdp_tx_iptnl_kern.c
 create mode 100644 samples/bpf/xdp_tx_iptnl_user.c

-- 
2.5.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 net-next 1/4] bpf: xdp: Allow head adjustment in XDP prog
  2016-12-07  5:31 [PATCH v3 net-next 0/4]: Allow head adjustment in XDP prog Martin KaFai Lau
@ 2016-12-07  5:31 ` Martin KaFai Lau
  2016-12-07  9:32   ` Daniel Borkmann
  2016-12-07  5:31 ` [PATCH v3 net-next 2/4] mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs Martin KaFai Lau
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Martin KaFai Lau @ 2016-12-07  5:31 UTC (permalink / raw)
  To: netdev
  Cc: Alexei Starovoitov, Brenden Blanco, Daniel Borkmann, David Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Saeed Mahameed, Tariq Toukan, Kernel Team

This patch allows XDP prog to extend/remove the packet
data at the head (like adding or removing header).  It is
done by adding a new XDP helper bpf_xdp_adjust_head().

It also renames bpf_helper_changes_skb_data() to
bpf_helper_changes_pkt_data() to better reflect
that XDP prog does not work on skb.

To avoid breaking unsupported drivers, this patch
also does the needed checking before setting
the xdp_prog fd to the device.  It is done by
1) Adding a XDP_QUERY_FEATURES command
2) Adding one "xdp_adjust_head" bit to bpf_prog

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 arch/powerpc/net/bpf_jit_comp64.c                  |  4 ++--
 arch/s390/net/bpf_jit_comp.c                       |  2 +-
 arch/x86/net/bpf_jit_comp.c                        |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c     |  3 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  3 +++
 .../net/ethernet/netronome/nfp/nfp_net_common.c    |  3 +++
 drivers/net/ethernet/qlogic/qede/qede_main.c       |  3 +++
 include/linux/filter.h                             |  6 +++--
 include/linux/netdevice.h                          | 12 ++++++++++
 include/uapi/linux/bpf.h                           | 11 ++++++++-
 kernel/bpf/core.c                                  |  2 +-
 kernel/bpf/syscall.c                               |  2 ++
 kernel/bpf/verifier.c                              |  2 +-
 net/core/dev.c                                     |  9 +++++++
 net/core/filter.c                                  | 28 ++++++++++++++++++++--
 15 files changed, 81 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c
index 0fe98a567125..73a5cf18fd84 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -766,7 +766,7 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 *image,
 			func = (u8 *) __bpf_call_base + imm;
 
 			/* Save skb pointer if we need to re-cache skb data */
-			if (bpf_helper_changes_skb_data(func))
+			if (bpf_helper_changes_pkt_data(func))
 				PPC_BPF_STL(3, 1, bpf_jit_stack_local(ctx));
 
 			bpf_jit_emit_func_call(image, ctx, (u64)func);
@@ -775,7 +775,7 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 *image,
 			PPC_MR(b2p[BPF_REG_0], 3);
 
 			/* refresh skb cache */
-			if (bpf_helper_changes_skb_data(func)) {
+			if (bpf_helper_changes_pkt_data(func)) {
 				/* reload skb pointer to r3 */
 				PPC_BPF_LL(3, 1, bpf_jit_stack_local(ctx));
 				bpf_jit_emit_skb_loads(image, ctx);
diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
index bee281f3163d..167b31b186c1 100644
--- a/arch/s390/net/bpf_jit_comp.c
+++ b/arch/s390/net/bpf_jit_comp.c
@@ -981,7 +981,7 @@ static noinline int bpf_jit_insn(struct bpf_jit *jit, struct bpf_prog *fp, int i
 		EMIT2(0x0d00, REG_14, REG_W1);
 		/* lgr %b0,%r2: load return value into %b0 */
 		EMIT4(0xb9040000, BPF_REG_0, REG_2);
-		if (bpf_helper_changes_skb_data((void *)func)) {
+		if (bpf_helper_changes_pkt_data((void *)func)) {
 			jit->seen |= SEEN_SKB_CHANGE;
 			/* lg %b1,ST_OFF_SKBP(%r15) */
 			EMIT6_DISP_LH(0xe3000000, 0x0004, BPF_REG_1, REG_0,
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index fe04a04dab8e..e76d1af60f7a 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -853,7 +853,7 @@ xadd:			if (is_imm8(insn->off))
 			func = (u8 *) __bpf_call_base + imm32;
 			jmp_offset = func - (image + addrs[i]);
 			if (seen_ld_abs) {
-				reload_skb_data = bpf_helper_changes_skb_data(func);
+				reload_skb_data = bpf_helper_changes_pkt_data(func);
 				if (reload_skb_data) {
 					EMIT1(0x57); /* push %rdi */
 					jmp_offset += 22; /* pop, mov, sub, mov */
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 49a81f1fc1d6..6261157f444e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2794,6 +2794,9 @@ static int mlx4_xdp(struct net_device *dev, struct netdev_xdp *xdp)
 	case XDP_QUERY_PROG:
 		xdp->prog_attached = mlx4_xdp_attached(dev);
 		return 0;
+	case XDP_QUERY_FEATURES:
+		xdp->features = 0;
+		return 0;
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 9def5cc378a3..f7a6b6b56d30 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3262,6 +3262,9 @@ static int mlx5e_xdp(struct net_device *dev, struct netdev_xdp *xdp)
 	case XDP_QUERY_PROG:
 		xdp->prog_attached = mlx5e_xdp_attached(dev);
 		return 0;
+	case XDP_QUERY_FEATURES:
+		xdp->features = 0;
+		return 0;
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 00d9a03be31d..89c95bb91503 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -2981,6 +2981,9 @@ static int nfp_net_xdp(struct net_device *netdev, struct netdev_xdp *xdp)
 	case XDP_QUERY_PROG:
 		xdp->prog_attached = !!nn->xdp_prog;
 		return 0;
+	case XDP_QUERY_FEATURES:
+		xdp->features = 0;
+		return 0;
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c
index cf1dd1436d93..d52dd83ae8a1 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -2525,6 +2525,9 @@ static int qede_xdp(struct net_device *dev, struct netdev_xdp *xdp)
 	case XDP_QUERY_PROG:
 		xdp->prog_attached = !!edev->xdp_prog;
 		return 0;
+	case XDP_QUERY_FEATURES:
+		xdp->features = 0;
+		return 0;
 	default:
 		return -EINVAL;
 	}
diff --git a/include/linux/filter.h b/include/linux/filter.h
index f078d2b1cff6..6a1658308612 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -406,7 +406,8 @@ struct bpf_prog {
 	u16			jited:1,	/* Is our filter JIT'ed? */
 				gpl_compatible:1, /* Is filter GPL compatible? */
 				cb_access:1,	/* Is control block accessed? */
-				dst_needed:1;	/* Do we need dst entry? */
+				dst_needed:1,	/* Do we need dst entry? */
+				xdp_adjust_head:1; /* Adjusting pkt head? */
 	kmemcheck_bitfield_end(meta);
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	u32			len;		/* Number of filter blocks */
@@ -440,6 +441,7 @@ struct bpf_skb_data_end {
 struct xdp_buff {
 	void *data;
 	void *data_end;
+	void *data_hard_start;
 };
 
 /* compute the linear packet data range [data, data_end) which
@@ -595,7 +597,7 @@ void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
 u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
-bool bpf_helper_changes_skb_data(void *func);
+bool bpf_helper_changes_pkt_data(void *func);
 
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1ff5ea6e1221..786ad7c67215 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -30,6 +30,7 @@
 #include <linux/delay.h>
 #include <linux/atomic.h>
 #include <linux/prefetch.h>
+#include <linux/bitops.h>
 #include <asm/cache.h>
 #include <asm/byteorder.h>
 
@@ -805,6 +806,13 @@ struct tc_to_netdev {
 	bool egress_dev;
 };
 
+/* Driver must allow a XDP prog to extend header by
+ * up to XDP_PACKET_HEADROOM.  It must also fill out
+ * the data_hard_start value in struct xdp_buff
+ * before calling out the xdp_prog.
+ */
+#define XDP_F_ADJUST_HEAD	BIT(0)
+
 /* These structures hold the attributes of xdp state that are being passed
  * to the netdevice through the xdp op.
  */
@@ -821,6 +829,8 @@ enum xdp_netdev_command {
 	 * return true if a program is currently attached and running.
 	 */
 	XDP_QUERY_PROG,
+	/* Check what XDP features are supported by a device */
+	XDP_QUERY_FEATURES,
 };
 
 struct netdev_xdp {
@@ -830,6 +840,8 @@ struct netdev_xdp {
 		struct bpf_prog *prog;
 		/* XDP_QUERY_PROG */
 		bool prog_attached;
+		/* XDP_QUERY_FEATURES */
+		u32 features;
 	};
 };
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6123d9b8e828..0eb0e87dbe9f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -424,6 +424,12 @@ union bpf_attr {
  *     @len: length of header to be pushed in front
  *     @flags: Flags (unused for now)
  *     Return: 0 on success or negative error
+ *
+ * int bpf_xdp_adjust_head(xdp_md, delta)
+ *     Adjust the xdp_md.data by delta
+ *     @xdp_md: pointer to xdp_md
+ *     @delta: An positive/negative integer to be added to xdp_md.data
+ *     Return: 0 on success or negative on error
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -469,7 +475,8 @@ union bpf_attr {
 	FN(csum_update),		\
 	FN(set_hash_invalid),		\
 	FN(get_numa_node_id),		\
-	FN(skb_change_head),
+	FN(skb_change_head),		\
+	FN(xdp_adjust_head),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -576,6 +583,8 @@ struct bpf_sock {
 	__u32 protocol;
 };
 
+#define XDP_PACKET_HEADROOM 256
+
 /* User return codes for XDP prog type.
  * A valid XDP program must return one of these defined values. All other
  * return codes are reserved for future use. Unknown return codes will result
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index bdcc9f4ba767..83e0d153b0b4 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1143,7 +1143,7 @@ struct bpf_prog * __weak bpf_int_jit_compile(struct bpf_prog *prog)
 	return prog;
 }
 
-bool __weak bpf_helper_changes_skb_data(void *func)
+bool __weak bpf_helper_changes_pkt_data(void *func)
 {
 	return false;
 }
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c0d2b423ce93..add93ecd7e69 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -579,6 +579,8 @@ static void fixup_bpf_calls(struct bpf_prog *prog)
 				prog->dst_needed = 1;
 			if (insn->imm == BPF_FUNC_get_prandom_u32)
 				bpf_user_rnd_init_once();
+			if (insn->imm == BPF_FUNC_xdp_adjust_head)
+				prog->xdp_adjust_head = 1;
 			if (insn->imm == BPF_FUNC_tail_call) {
 				/* mark bpf_tail_call as different opcode
 				 * to avoid conditional branch in
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index cb37339ca0da..f5fa326518c9 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1216,7 +1216,7 @@ static int check_call(struct bpf_verifier_env *env, int func_id)
 		return -EINVAL;
 	}
 
-	changes_data = bpf_helper_changes_skb_data(fn->func);
+	changes_data = bpf_helper_changes_pkt_data(fn->func);
 
 	memset(&meta, 0, sizeof(meta));
 	meta.pkt_access = fn->pkt_access;
diff --git a/net/core/dev.c b/net/core/dev.c
index bffb5253e778..90696f7e6b59 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6722,6 +6722,15 @@ int dev_change_xdp_fd(struct net_device *dev, int fd, u32 flags)
 		prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_XDP);
 		if (IS_ERR(prog))
 			return PTR_ERR(prog);
+
+		xdp.command = XDP_QUERY_FEATURES;
+		err = ops->ndo_xdp(dev, &xdp);
+		if (err)
+			return err;
+
+		if (prog->xdp_adjust_head &&
+		    !(xdp.features & XDP_F_ADJUST_HEAD))
+			return -ENOTSUPP;
 	}
 
 	memset(&xdp, 0, sizeof(xdp));
diff --git a/net/core/filter.c b/net/core/filter.c
index b751202e12f8..b1461708a977 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2234,7 +2234,28 @@ static const struct bpf_func_proto bpf_skb_change_head_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
-bool bpf_helper_changes_skb_data(void *func)
+BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
+{
+	void *data = xdp->data + offset;
+
+	if (unlikely(data < xdp->data_hard_start ||
+		     data > xdp->data_end - ETH_HLEN))
+		return -EINVAL;
+
+	xdp->data = data;
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
+	.func		= bpf_xdp_adjust_head,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+};
+
+bool bpf_helper_changes_pkt_data(void *func)
 {
 	if (func == bpf_skb_vlan_push ||
 	    func == bpf_skb_vlan_pop ||
@@ -2244,7 +2265,8 @@ bool bpf_helper_changes_skb_data(void *func)
 	    func == bpf_skb_change_tail ||
 	    func == bpf_skb_pull_data ||
 	    func == bpf_l3_csum_replace ||
-	    func == bpf_l4_csum_replace)
+	    func == bpf_l4_csum_replace ||
+	    func == bpf_xdp_adjust_head)
 		return true;
 
 	return false;
@@ -2670,6 +2692,8 @@ xdp_func_proto(enum bpf_func_id func_id)
 		return &bpf_xdp_event_output_proto;
 	case BPF_FUNC_get_smp_processor_id:
 		return &bpf_get_smp_processor_id_proto;
+	case BPF_FUNC_xdp_adjust_head:
+		return &bpf_xdp_adjust_head_proto;
 	default:
 		return sk_filter_func_proto(func_id);
 	}
-- 
2.5.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 net-next 2/4] mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs
  2016-12-07  5:31 [PATCH v3 net-next 0/4]: Allow head adjustment in XDP prog Martin KaFai Lau
  2016-12-07  5:31 ` [PATCH v3 net-next 1/4] bpf: xdp: " Martin KaFai Lau
@ 2016-12-07  5:31 ` Martin KaFai Lau
  2016-12-07  5:31 ` [PATCH v3 net-next 3/4] mlx4: xdp: Reserve headroom for receiving packet when XDP prog is active Martin KaFai Lau
  2016-12-07  5:31 ` [PATCH v3 net-next 4/4] bpf: xdp: Add XDP example for head adjustment Martin KaFai Lau
  3 siblings, 0 replies; 13+ messages in thread
From: Martin KaFai Lau @ 2016-12-07  5:31 UTC (permalink / raw)
  To: netdev
  Cc: Alexei Starovoitov, Brenden Blanco, Daniel Borkmann, David Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Saeed Mahameed, Tariq Toukan, Kernel Team

When XDP is active in mlx4, mlx4 is using one page/pkt.
At the same time (i.e. when XDP is active), it is currently
limiting MTU to be FRAG_SZ0 - ETH_HLEN - (2 * VLAN_HLEN)
which is 1514 in x86.  AFAICT, we can at least raise the MTU
limit up to PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) which this
patch is doing.  It will be useful in the next patch which
allows XDP program to extend the packet by adding new header(s).

Note: In the earlier XDP patches, there is already existing guard
to ensure the page/pkt scheme only applies when XDP is active
in mlx4.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 28 +++++++++++-----
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 46 ++++++++++++++------------
 2 files changed, 44 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 6261157f444e..5482591688f8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -51,6 +51,8 @@
 #include "mlx4_en.h"
 #include "en_port.h"
 
+#define MLX4_EN_MAX_XDP_MTU ((int)(PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN)))
+
 int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -2249,6 +2251,19 @@ void mlx4_en_destroy_netdev(struct net_device *dev)
 	free_netdev(dev);
 }
 
+static bool mlx4_en_check_xdp_mtu(struct net_device *dev, int mtu)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+
+	if (mtu > MLX4_EN_MAX_XDP_MTU) {
+		en_err(priv, "mtu:%d > max:%d when XDP prog is attached\n",
+		       mtu, MLX4_EN_MAX_XDP_MTU);
+		return false;
+	}
+
+	return true;
+}
+
 static int mlx4_en_change_mtu(struct net_device *dev, int new_mtu)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -2258,11 +2273,10 @@ static int mlx4_en_change_mtu(struct net_device *dev, int new_mtu)
 	en_dbg(DRV, priv, "Change MTU called - current:%d new:%d\n",
 		 dev->mtu, new_mtu);
 
-	if (priv->tx_ring_num[TX_XDP] && MLX4_EN_EFF_MTU(new_mtu) > FRAG_SZ0) {
-		en_err(priv, "MTU size:%d requires frags but XDP running\n",
-		       new_mtu);
-		return -EOPNOTSUPP;
-	}
+	if (priv->tx_ring_num[TX_XDP] &&
+	    !mlx4_en_check_xdp_mtu(dev, new_mtu))
+		return -ENOTSUPP;
+
 	dev->mtu = new_mtu;
 
 	if (netif_running(dev)) {
@@ -2710,10 +2724,8 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 		return 0;
 	}
 
-	if (priv->num_frags > 1) {
-		en_err(priv, "Cannot set XDP if MTU requires multiple frags\n");
+	if (!mlx4_en_check_xdp_mtu(dev, dev->mtu))
 		return -EOPNOTSUPP;
-	}
 
 	tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
 	if (!tmp)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 6562f78b07f4..23e9d04d1ef4 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1164,37 +1164,39 @@ static const int frag_sizes[] = {
 
 void mlx4_en_calc_rx_buf(struct net_device *dev)
 {
-	enum dma_data_direction dma_dir = PCI_DMA_FROMDEVICE;
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	int eff_mtu = MLX4_EN_EFF_MTU(dev->mtu);
-	int order = MLX4_EN_ALLOC_PREFER_ORDER;
-	u32 align = SMP_CACHE_BYTES;
-	int buf_size = 0;
 	int i = 0;
 
 	/* bpf requires buffers to be set up as 1 packet per page.
 	 * This only works when num_frags == 1.
 	 */
 	if (priv->tx_ring_num[TX_XDP]) {
-		dma_dir = PCI_DMA_BIDIRECTIONAL;
-		/* This will gain efficient xdp frame recycling at the expense
-		 * of more costly truesize accounting
+		priv->frag_info[0].order = 0;
+		priv->frag_info[0].frag_size = eff_mtu;
+		priv->frag_info[0].frag_prefix_size = 0;
+		/* This will gain efficient xdp frame recycling at the
+		 * expense of more costly truesize accounting
 		 */
-		align = PAGE_SIZE;
-		order = 0;
-	}
-
-	while (buf_size < eff_mtu) {
-		priv->frag_info[i].order = order;
-		priv->frag_info[i].frag_size =
-			(eff_mtu > buf_size + frag_sizes[i]) ?
-				frag_sizes[i] : eff_mtu - buf_size;
-		priv->frag_info[i].frag_prefix_size = buf_size;
-		priv->frag_info[i].frag_stride =
-				ALIGN(priv->frag_info[i].frag_size, align);
-		priv->frag_info[i].dma_dir = dma_dir;
-		buf_size += priv->frag_info[i].frag_size;
-		i++;
+		priv->frag_info[0].frag_stride = PAGE_SIZE;
+		priv->frag_info[0].dma_dir = PCI_DMA_BIDIRECTIONAL;
+		i = 1;
+	} else {
+		int buf_size = 0;
+
+		while (buf_size < eff_mtu) {
+			priv->frag_info[i].order = MLX4_EN_ALLOC_PREFER_ORDER;
+			priv->frag_info[i].frag_size =
+				(eff_mtu > buf_size + frag_sizes[i]) ?
+					frag_sizes[i] : eff_mtu - buf_size;
+			priv->frag_info[i].frag_prefix_size = buf_size;
+			priv->frag_info[i].frag_stride =
+				ALIGN(priv->frag_info[i].frag_size,
+				      SMP_CACHE_BYTES);
+			priv->frag_info[i].dma_dir = PCI_DMA_FROMDEVICE;
+			buf_size += priv->frag_info[i].frag_size;
+			i++;
+		}
 	}
 
 	priv->num_frags = i;
-- 
2.5.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 net-next 3/4] mlx4: xdp: Reserve headroom for receiving packet when XDP prog is active
  2016-12-07  5:31 [PATCH v3 net-next 0/4]: Allow head adjustment in XDP prog Martin KaFai Lau
  2016-12-07  5:31 ` [PATCH v3 net-next 1/4] bpf: xdp: " Martin KaFai Lau
  2016-12-07  5:31 ` [PATCH v3 net-next 2/4] mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs Martin KaFai Lau
@ 2016-12-07  5:31 ` Martin KaFai Lau
  2016-12-07  5:31 ` [PATCH v3 net-next 4/4] bpf: xdp: Add XDP example for head adjustment Martin KaFai Lau
  3 siblings, 0 replies; 13+ messages in thread
From: Martin KaFai Lau @ 2016-12-07  5:31 UTC (permalink / raw)
  To: netdev
  Cc: Alexei Starovoitov, Brenden Blanco, Daniel Borkmann, David Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Saeed Mahameed, Tariq Toukan, Kernel Team

Reserve XDP_PACKET_HEADROOM for packet and enable bpf_xdp_adjust_head()
support.  This patch only affects the code path when XDP is active.

After testing, the tx_dropped counter is incremented if the xdp_prog sends
more than wire MTU.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  5 +++--
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 24 ++++++++++++++++++------
 drivers/net/ethernet/mellanox/mlx4/en_tx.c     |  9 +++++----
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  3 ++-
 4 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 5482591688f8..36b9bb042778 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -51,7 +51,8 @@
 #include "mlx4_en.h"
 #include "en_port.h"
 
-#define MLX4_EN_MAX_XDP_MTU ((int)(PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN)))
+#define MLX4_EN_MAX_XDP_MTU ((int)(PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) - \
+				   XDP_PACKET_HEADROOM))
 
 int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 {
@@ -2807,7 +2808,7 @@ static int mlx4_xdp(struct net_device *dev, struct netdev_xdp *xdp)
 		xdp->prog_attached = mlx4_xdp_attached(dev);
 		return 0;
 	case XDP_QUERY_FEATURES:
-		xdp->features = 0;
+		xdp->features = XDP_F_ADJUST_HEAD;
 		return 0;
 	default:
 		return -EINVAL;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 23e9d04d1ef4..3c37e216bbf3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -96,7 +96,6 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
 	struct mlx4_en_rx_alloc page_alloc[MLX4_EN_MAX_RX_FRAGS];
 	const struct mlx4_en_frag_info *frag_info;
 	struct page *page;
-	dma_addr_t dma;
 	int i;
 
 	for (i = 0; i < priv->num_frags; i++) {
@@ -115,9 +114,10 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
 
 	for (i = 0; i < priv->num_frags; i++) {
 		frags[i] = ring_alloc[i];
-		dma = ring_alloc[i].dma + ring_alloc[i].page_offset;
+		frags[i].page_offset += priv->frag_info[i].rx_headroom;
+		rx_desc->data[i].addr = cpu_to_be64(frags[i].dma +
+						    frags[i].page_offset);
 		ring_alloc[i] = page_alloc[i];
-		rx_desc->data[i].addr = cpu_to_be64(dma);
 	}
 
 	return 0;
@@ -250,7 +250,8 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
 
 	if (ring->page_cache.index > 0) {
 		frags[0] = ring->page_cache.buf[--ring->page_cache.index];
-		rx_desc->data[0].addr = cpu_to_be64(frags[0].dma);
+		rx_desc->data[0].addr = cpu_to_be64(frags[0].dma +
+						    frags[0].page_offset);
 		return 0;
 	}
 
@@ -889,6 +890,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		if (xdp_prog) {
 			struct xdp_buff xdp;
 			dma_addr_t dma;
+			void *orig_data;
 			u32 act;
 
 			dma = be64_to_cpu(rx_desc->data[0].addr);
@@ -896,11 +898,19 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 						priv->frag_info[0].frag_size,
 						DMA_FROM_DEVICE);
 
-			xdp.data = page_address(frags[0].page) +
-							frags[0].page_offset;
+			xdp.data_hard_start = page_address(frags[0].page);
+			xdp.data = xdp.data_hard_start + frags[0].page_offset;
 			xdp.data_end = xdp.data + length;
+			orig_data = xdp.data;
 
 			act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+			if (xdp.data != orig_data) {
+				length = xdp.data_end - xdp.data;
+				frags[0].page_offset = xdp.data -
+					xdp.data_hard_start;
+			}
+
 			switch (act) {
 			case XDP_PASS:
 				break;
@@ -1180,6 +1190,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 		 */
 		priv->frag_info[0].frag_stride = PAGE_SIZE;
 		priv->frag_info[0].dma_dir = PCI_DMA_BIDIRECTIONAL;
+		priv->frag_info[0].rx_headroom = XDP_PACKET_HEADROOM;
 		i = 1;
 	} else {
 		int buf_size = 0;
@@ -1194,6 +1205,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 				ALIGN(priv->frag_info[i].frag_size,
 				      SMP_CACHE_BYTES);
 			priv->frag_info[i].dma_dir = PCI_DMA_FROMDEVICE;
+			priv->frag_info[i].rx_headroom = 0;
 			buf_size += priv->frag_info[i].frag_size;
 			i++;
 		}
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 4b597dca5c52..5886ad78058f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -354,7 +354,7 @@ u32 mlx4_en_recycle_tx_desc(struct mlx4_en_priv *priv,
 	struct mlx4_en_rx_alloc frame = {
 		.page = tx_info->page,
 		.dma = tx_info->map0_dma,
-		.page_offset = 0,
+		.page_offset = XDP_PACKET_HEADROOM,
 		.page_size = PAGE_SIZE,
 	};
 
@@ -1132,7 +1132,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
 	tx_info->page = frame->page;
 	frame->page = NULL;
 	tx_info->map0_dma = dma;
-	tx_info->map0_byte_count = length;
+	tx_info->map0_byte_count = PAGE_SIZE;
 	tx_info->nr_txbb = nr_txbb;
 	tx_info->nr_bytes = max_t(unsigned int, length, ETH_ZLEN);
 	tx_info->data_offset = (void *)data - (void *)tx_desc;
@@ -1141,9 +1141,10 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
 	tx_info->linear = 1;
 	tx_info->inl = 0;
 
-	dma_sync_single_for_device(priv->ddev, dma, length, PCI_DMA_TODEVICE);
+	dma_sync_single_range_for_device(priv->ddev, dma, frame->page_offset,
+					 length, PCI_DMA_TODEVICE);
 
-	data->addr = cpu_to_be64(dma);
+	data->addr = cpu_to_be64(dma + frame->page_offset);
 	data->lkey = ring->mr_key;
 	dma_wmb();
 	data->byte_count = cpu_to_be32(length);
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 20a936428f4a..ba1c6cd0cc79 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -475,7 +475,8 @@ struct mlx4_en_frag_info {
 	u16 frag_prefix_size;
 	u32 frag_stride;
 	enum dma_data_direction dma_dir;
-	int order;
+	u16 order;
+	u16 rx_headroom;
 };
 
 #ifdef CONFIG_MLX4_EN_DCB
-- 
2.5.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 net-next 4/4] bpf: xdp: Add XDP example for head adjustment
  2016-12-07  5:31 [PATCH v3 net-next 0/4]: Allow head adjustment in XDP prog Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2016-12-07  5:31 ` [PATCH v3 net-next 3/4] mlx4: xdp: Reserve headroom for receiving packet when XDP prog is active Martin KaFai Lau
@ 2016-12-07  5:31 ` Martin KaFai Lau
  2016-12-07 10:34   ` Jesper Dangaard Brouer
  3 siblings, 1 reply; 13+ messages in thread
From: Martin KaFai Lau @ 2016-12-07  5:31 UTC (permalink / raw)
  To: netdev
  Cc: Alexei Starovoitov, Brenden Blanco, Daniel Borkmann, David Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Saeed Mahameed, Tariq Toukan, Kernel Team

The XDP prog checks if the incoming packet matches any VIP:PORT
combination in the BPF hashmap.  If it is, it will encapsulate
the packet with a IPv4/v6 header as instructed by the value of
the BPF hashmap and then XDP_TX it out.

The VIP:PORT -> IP-Encap-Info can be specified by the cmd args
of the user prog.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 samples/bpf/Makefile              |   4 +
 samples/bpf/bpf_helpers.h         |   2 +
 samples/bpf/bpf_load.c            |  94 ++++++++++++++
 samples/bpf/bpf_load.h            |   1 +
 samples/bpf/xdp1_user.c           |  93 --------------
 samples/bpf/xdp_tx_iptnl_common.h |  37 ++++++
 samples/bpf/xdp_tx_iptnl_kern.c   | 232 ++++++++++++++++++++++++++++++++++
 samples/bpf/xdp_tx_iptnl_user.c   | 253 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 623 insertions(+), 93 deletions(-)
 create mode 100644 samples/bpf/xdp_tx_iptnl_common.h
 create mode 100644 samples/bpf/xdp_tx_iptnl_kern.c
 create mode 100644 samples/bpf/xdp_tx_iptnl_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 00cd3081c038..f78e0ef6ff10 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -33,6 +33,7 @@ hostprogs-y += trace_event
 hostprogs-y += sampleip
 hostprogs-y += tc_l2_redirect
 hostprogs-y += lwt_len_hist
+hostprogs-y += xdp_tx_iptnl
 
 test_lru_dist-objs := test_lru_dist.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
@@ -67,6 +68,7 @@ trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
 sampleip-objs := bpf_load.o libbpf.o sampleip_user.o
 tc_l2_redirect-objs := bpf_load.o libbpf.o tc_l2_redirect_user.o
 lwt_len_hist-objs := bpf_load.o libbpf.o lwt_len_hist_user.o
+xdp_tx_iptnl-objs := bpf_load.o libbpf.o xdp_tx_iptnl_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -99,6 +101,7 @@ always += test_current_task_under_cgroup_kern.o
 always += trace_event_kern.o
 always += sampleip_kern.o
 always += lwt_len_hist_kern.o
+always += xdp_tx_iptnl_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/testing/selftests/bpf/
@@ -129,6 +132,7 @@ HOSTLOADLIBES_trace_event += -lelf
 HOSTLOADLIBES_sampleip += -lelf
 HOSTLOADLIBES_tc_l2_redirect += -l elf
 HOSTLOADLIBES_lwt_len_hist += -l elf
+HOSTLOADLIBES_xdp_tx_iptnl += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 8370a6e3839d..faaffe2e139a 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -57,6 +57,8 @@ static int (*bpf_skb_set_tunnel_opt)(void *ctx, void *md, int size) =
 	(void *) BPF_FUNC_skb_set_tunnel_opt;
 static unsigned long long (*bpf_get_prandom_u32)(void) =
 	(void *) BPF_FUNC_get_prandom_u32;
+static int (*bpf_xdp_adjust_head)(void *ctx, int offset) =
+	(void *) BPF_FUNC_xdp_adjust_head;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 49b45ccbe153..e30b6de94f2e 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -12,6 +12,10 @@
 #include <linux/bpf.h>
 #include <linux/filter.h>
 #include <linux/perf_event.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+#include <sys/types.h>
+#include <sys/socket.h>
 #include <sys/syscall.h>
 #include <sys/ioctl.h>
 #include <sys/mman.h>
@@ -450,3 +454,93 @@ struct ksym *ksym_search(long key)
 	/* out of range. return _stext */
 	return &syms[0];
 }
+
+int set_link_xdp_fd(int ifindex, int fd)
+{
+	struct sockaddr_nl sa;
+	int sock, seq = 0, len, ret = -1;
+	char buf[4096];
+	struct nlattr *nla, *nla_xdp;
+	struct {
+		struct nlmsghdr  nh;
+		struct ifinfomsg ifinfo;
+		char             attrbuf[64];
+	} req;
+	struct nlmsghdr *nh;
+	struct nlmsgerr *err;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.nl_family = AF_NETLINK;
+
+	sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+	if (sock < 0) {
+		printf("open netlink socket: %s\n", strerror(errno));
+		return -1;
+	}
+
+	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+		printf("bind to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	memset(&req, 0, sizeof(req));
+	req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+	req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+	req.nh.nlmsg_type = RTM_SETLINK;
+	req.nh.nlmsg_pid = 0;
+	req.nh.nlmsg_seq = ++seq;
+	req.ifinfo.ifi_family = AF_UNSPEC;
+	req.ifinfo.ifi_index = ifindex;
+	nla = (struct nlattr *)(((char *)&req)
+				+ NLMSG_ALIGN(req.nh.nlmsg_len));
+	nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/;
+
+	nla_xdp = (struct nlattr *)((char *)nla + NLA_HDRLEN);
+	nla_xdp->nla_type = 1/*IFLA_XDP_FD*/;
+	nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
+	memcpy((char *)nla_xdp + NLA_HDRLEN, &fd, sizeof(fd));
+	nla->nla_len = NLA_HDRLEN + nla_xdp->nla_len;
+
+	req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
+
+	if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) {
+		printf("send to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	len = recv(sock, buf, sizeof(buf), 0);
+	if (len < 0) {
+		printf("recv from netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+	     nh = NLMSG_NEXT(nh, len)) {
+		if (nh->nlmsg_pid != getpid()) {
+			printf("Wrong pid %d, expected %d\n",
+			       nh->nlmsg_pid, getpid());
+			goto cleanup;
+		}
+		if (nh->nlmsg_seq != seq) {
+			printf("Wrong seq %d, expected %d\n",
+			       nh->nlmsg_seq, seq);
+			goto cleanup;
+		}
+		switch (nh->nlmsg_type) {
+		case NLMSG_ERROR:
+			err = (struct nlmsgerr *)NLMSG_DATA(nh);
+			if (!err->error)
+				continue;
+			printf("nlmsg error %s\n", strerror(-err->error));
+			goto cleanup;
+		case NLMSG_DONE:
+			break;
+		}
+	}
+
+	ret = 0;
+
+cleanup:
+	close(sock);
+	return ret;
+}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index 4adeeef53ad6..fb46a421ab41 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -31,4 +31,5 @@ struct ksym {
 
 int load_kallsyms(void);
 struct ksym *ksym_search(long key);
+int set_link_xdp_fd(int ifindex, int fd);
 #endif
diff --git a/samples/bpf/xdp1_user.c b/samples/bpf/xdp1_user.c
index 2b2150d6d6f7..5f040a0d7712 100644
--- a/samples/bpf/xdp1_user.c
+++ b/samples/bpf/xdp1_user.c
@@ -5,111 +5,18 @@
  * License as published by the Free Software Foundation.
  */
 #include <linux/bpf.h>
-#include <linux/netlink.h>
-#include <linux/rtnetlink.h>
 #include <assert.h>
 #include <errno.h>
 #include <signal.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
-#include <sys/socket.h>
 #include <unistd.h>
 
 #include "bpf_load.h"
 #include "bpf_util.h"
 #include "libbpf.h"
 
-static int set_link_xdp_fd(int ifindex, int fd)
-{
-	struct sockaddr_nl sa;
-	int sock, seq = 0, len, ret = -1;
-	char buf[4096];
-	struct nlattr *nla, *nla_xdp;
-	struct {
-		struct nlmsghdr  nh;
-		struct ifinfomsg ifinfo;
-		char             attrbuf[64];
-	} req;
-	struct nlmsghdr *nh;
-	struct nlmsgerr *err;
-
-	memset(&sa, 0, sizeof(sa));
-	sa.nl_family = AF_NETLINK;
-
-	sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
-	if (sock < 0) {
-		printf("open netlink socket: %s\n", strerror(errno));
-		return -1;
-	}
-
-	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
-		printf("bind to netlink: %s\n", strerror(errno));
-		goto cleanup;
-	}
-
-	memset(&req, 0, sizeof(req));
-	req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
-	req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
-	req.nh.nlmsg_type = RTM_SETLINK;
-	req.nh.nlmsg_pid = 0;
-	req.nh.nlmsg_seq = ++seq;
-	req.ifinfo.ifi_family = AF_UNSPEC;
-	req.ifinfo.ifi_index = ifindex;
-	nla = (struct nlattr *)(((char *)&req)
-				+ NLMSG_ALIGN(req.nh.nlmsg_len));
-	nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/;
-
-	nla_xdp = (struct nlattr *)((char *)nla + NLA_HDRLEN);
-	nla_xdp->nla_type = 1/*IFLA_XDP_FD*/;
-	nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
-	memcpy((char *)nla_xdp + NLA_HDRLEN, &fd, sizeof(fd));
-	nla->nla_len = NLA_HDRLEN + nla_xdp->nla_len;
-
-	req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
-
-	if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) {
-		printf("send to netlink: %s\n", strerror(errno));
-		goto cleanup;
-	}
-
-	len = recv(sock, buf, sizeof(buf), 0);
-	if (len < 0) {
-		printf("recv from netlink: %s\n", strerror(errno));
-		goto cleanup;
-	}
-
-	for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
-	     nh = NLMSG_NEXT(nh, len)) {
-		if (nh->nlmsg_pid != getpid()) {
-			printf("Wrong pid %d, expected %d\n",
-			       nh->nlmsg_pid, getpid());
-			goto cleanup;
-		}
-		if (nh->nlmsg_seq != seq) {
-			printf("Wrong seq %d, expected %d\n",
-			       nh->nlmsg_seq, seq);
-			goto cleanup;
-		}
-		switch (nh->nlmsg_type) {
-		case NLMSG_ERROR:
-			err = (struct nlmsgerr *)NLMSG_DATA(nh);
-			if (!err->error)
-				continue;
-			printf("nlmsg error %s\n", strerror(-err->error));
-			goto cleanup;
-		case NLMSG_DONE:
-			break;
-		}
-	}
-
-	ret = 0;
-
-cleanup:
-	close(sock);
-	return ret;
-}
-
 static int ifindex;
 
 static void int_exit(int sig)
diff --git a/samples/bpf/xdp_tx_iptnl_common.h b/samples/bpf/xdp_tx_iptnl_common.h
new file mode 100644
index 000000000000..dd12cc35110f
--- /dev/null
+++ b/samples/bpf/xdp_tx_iptnl_common.h
@@ -0,0 +1,37 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _SAMPLES_BPF_XDP_TX_IPTNL_COMMON_H
+#define _SAMPLES_BPF_XDP_TX_IPTNL_COMMON_H
+
+#include <linux/types.h>
+
+#define MAX_IPTNL_ENTRIES 256U
+
+struct vip {
+	union {
+		__u32 v6[4];
+		__u32 v4;
+	} daddr;
+	__u16 dport;
+	__u16 family;
+	__u8 protocol;
+};
+
+struct iptnl_info {
+	union {
+		__u32 v6[4];
+		__u32 v4;
+	} saddr;
+	union {
+		__u32 v6[4];
+		__u32 v4;
+	} daddr;
+	__u16 family;
+	__u8 dmac[6];
+};
+
+#endif
diff --git a/samples/bpf/xdp_tx_iptnl_kern.c b/samples/bpf/xdp_tx_iptnl_kern.c
new file mode 100644
index 000000000000..d88c064175aa
--- /dev/null
+++ b/samples/bpf/xdp_tx_iptnl_kern.c
@@ -0,0 +1,232 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include "bpf_helpers.h"
+#include "xdp_tx_iptnl_common.h"
+
+struct bpf_map_def SEC("maps") rxcnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u64),
+	.max_entries = 256,
+};
+
+struct bpf_map_def SEC("maps") vip2tnl = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(struct vip),
+	.value_size = sizeof(struct iptnl_info),
+	.max_entries = MAX_IPTNL_ENTRIES,
+};
+
+static __always_inline void count_tx(u32 protocol)
+{
+	u64 *rxcnt_count;
+
+	rxcnt_count = bpf_map_lookup_elem(&rxcnt, &protocol);
+	if (rxcnt_count)
+		*rxcnt_count += 1;
+}
+
+static __always_inline int get_dport(void *trans_data, void *data_end,
+				     u8 protocol)
+{
+	struct tcphdr *th;
+	struct udphdr *uh;
+
+	switch (protocol) {
+	case IPPROTO_TCP:
+		th = (struct tcphdr *)trans_data;
+		if (th + 1 > data_end)
+			return -1;
+		return th->dest;
+	case IPPROTO_UDP:
+		uh = (struct udphdr *)trans_data;
+		if (uh + 1 > data_end)
+			return -1;
+		return uh->dest;
+	default:
+		return 0;
+	}
+}
+
+static __always_inline void set_ethhdr(struct ethhdr *new_eth,
+				       const struct ethhdr *old_eth,
+				       const struct iptnl_info *tnl,
+				       __be16 h_proto)
+{
+	memcpy(new_eth->h_source, old_eth->h_dest, sizeof(new_eth->h_source));
+	memcpy(new_eth->h_dest, tnl->dmac, sizeof(new_eth->h_dest));
+	new_eth->h_proto = h_proto;
+}
+
+static __always_inline int handle_ipv4(struct xdp_md *xdp)
+{
+	void *data_end = (void *)(long)xdp->data_end;
+	void *data = (void *)(long)xdp->data;
+	struct iptnl_info *tnl;
+	struct ethhdr *new_eth;
+	struct ethhdr *old_eth;
+	struct iphdr *iph = data + sizeof(struct ethhdr);
+	u16 *next_iph_u16;
+	u16 payload_len;
+	struct vip vip = {};
+	int dport;
+	u32 csum = 0;
+	int i;
+
+	if (iph + 1 > data_end)
+		return XDP_DROP;
+
+	dport = get_dport(iph + 1, data_end, iph->protocol);
+	if (dport == -1)
+		return XDP_DROP;
+
+	vip.protocol = iph->protocol;
+	vip.family = AF_INET;
+	vip.daddr.v4 = iph->daddr;
+	vip.dport = dport;
+	payload_len = ntohs(iph->tot_len);
+
+	tnl = bpf_map_lookup_elem(&vip2tnl, &vip);
+	/* It only does v4-in-v4 */
+	if (!tnl || tnl->family != AF_INET)
+		return XDP_PASS;
+
+	/* The vip key is found.  Add an IP header and send it out */
+
+	if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(struct iphdr)))
+		return XDP_DROP;
+
+	data = (void *)(long)xdp->data;
+	data_end = (void *)(long)xdp->data_end;
+
+	new_eth = data;
+	iph = data + sizeof(*new_eth);
+	old_eth = data + sizeof(*iph);
+
+	if (new_eth + 1 > data_end ||
+	    old_eth + 1 > data_end ||
+	    iph + 1 > data_end)
+		return XDP_DROP;
+
+	set_ethhdr(new_eth, old_eth, tnl, htons(ETH_P_IP));
+
+	iph->version = 4;
+	iph->ihl = sizeof(*iph) >> 2;
+	iph->frag_off =	0;
+	iph->protocol = IPPROTO_IPIP;
+	iph->check = 0;
+	iph->tos = 0;
+	iph->tot_len = htons(payload_len + sizeof(*iph));
+	iph->daddr = tnl->daddr.v4;
+	iph->saddr = tnl->saddr.v4;
+	iph->ttl = 8;
+
+	next_iph_u16 = (u16 *)iph;
+#pragma clang loop unroll(full)
+	for (i = 0; i < sizeof(*iph) >> 1; i++)
+		csum += *next_iph_u16++;
+
+	iph->check = ~((csum & 0xffff) + (csum >> 16));
+
+	count_tx(vip.protocol);
+
+	return XDP_TX;
+}
+
+static __always_inline int handle_ipv6(struct xdp_md *xdp)
+{
+	void *data_end = (void *)(long)xdp->data_end;
+	void *data = (void *)(long)xdp->data;
+	struct iptnl_info *tnl;
+	struct ethhdr *new_eth;
+	struct ethhdr *old_eth;
+	struct ipv6hdr *ip6h = data + sizeof(struct ethhdr);
+	__u16 payload_len;
+	struct vip vip = {};
+	int dport;
+
+	if (ip6h + 1 > data_end)
+		return XDP_DROP;
+
+	dport = get_dport(ip6h + 1, data_end, ip6h->nexthdr);
+	if (dport == -1)
+		return XDP_DROP;
+
+	vip.protocol = ip6h->nexthdr;
+	vip.family = AF_INET6;
+	memcpy(vip.daddr.v6, ip6h->daddr.s6_addr32, sizeof(vip.daddr));
+	vip.dport = dport;
+	payload_len = ip6h->payload_len;
+
+	tnl = bpf_map_lookup_elem(&vip2tnl, &vip);
+	/* It only does v6-in-v6 */
+	if (!tnl || tnl->family != AF_INET6)
+		return XDP_PASS;
+
+	/* The vip key is found.  Add an IP header and send it out */
+
+	if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(struct ipv6hdr)))
+		return XDP_DROP;
+
+	data = (void *)(long)xdp->data;
+	data_end = (void *)(long)xdp->data_end;
+
+	new_eth = data;
+	ip6h = data + sizeof(*new_eth);
+	old_eth = data + sizeof(*ip6h);
+
+	if (new_eth + 1 > data_end ||
+	    old_eth + 1 > data_end ||
+	    ip6h + 1 > data_end)
+		return XDP_DROP;
+
+	set_ethhdr(new_eth, old_eth, tnl, htons(ETH_P_IPV6));
+
+	ip6h->version = 6;
+	ip6h->priority = 0;
+	memset(ip6h->flow_lbl, 0, sizeof(ip6h->flow_lbl));
+	ip6h->payload_len = htons(ntohs(payload_len) + sizeof(*ip6h));
+	ip6h->nexthdr = IPPROTO_IPV6;
+	ip6h->hop_limit = 8;
+	memcpy(ip6h->saddr.s6_addr32, tnl->saddr.v6, sizeof(tnl->saddr.v6));
+	memcpy(ip6h->daddr.s6_addr32, tnl->daddr.v6, sizeof(tnl->daddr.v6));
+
+	count_tx(vip.protocol);
+
+	return XDP_TX;
+}
+
+SEC("xdp_tx_iptnl")
+int _xdp_tx_iptnl(struct xdp_md *xdp)
+{
+	void *data_end = (void *)(long)xdp->data_end;
+	void *data = (void *)(long)xdp->data;
+	struct ethhdr *eth = data;
+	__u16 h_proto;
+
+	if (eth + 1 > data_end)
+		return XDP_DROP;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == htons(ETH_P_IP))
+		return handle_ipv4(xdp);
+	else if (h_proto == htons(ETH_P_IPV6))
+
+		return handle_ipv6(xdp);
+	else
+		return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_tx_iptnl_user.c b/samples/bpf/xdp_tx_iptnl_user.c
new file mode 100644
index 000000000000..9aeef7579af4
--- /dev/null
+++ b/samples/bpf/xdp_tx_iptnl_user.c
@@ -0,0 +1,253 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/bpf.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/resource.h>
+#include <arpa/inet.h>
+#include <netinet/ether.h>
+#include <unistd.h>
+#include <time.h>
+#include "bpf_load.h"
+#include "libbpf.h"
+#include "bpf_util.h"
+#include "xdp_tx_iptnl_common.h"
+
+#define STATS_INTERVAL_S 2U
+
+static int ifindex = -1;
+
+static void int_exit(int sig)
+{
+	if (ifindex > -1)
+		set_link_xdp_fd(ifindex, -1);
+	exit(0);
+}
+
+/* simple per-protocol drop counter
+ */
+static void poll_stats(unsigned int kill_after_s)
+{
+	const unsigned int nr_protos = 256;
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	time_t started_at = time(NULL);
+	__u64 values[nr_cpus], prev[nr_protos][nr_cpus];
+	__u32 proto;
+	int i;
+
+	memset(prev, 0, sizeof(prev));
+
+	while (!kill_after_s || time(NULL) - started_at <= kill_after_s) {
+		sleep(STATS_INTERVAL_S);
+
+		for (proto = 0; proto < nr_protos; proto++) {
+			__u64 sum = 0;
+
+			assert(bpf_lookup_elem(map_fd[0], &proto, values) == 0);
+			for (i = 0; i < nr_cpus; i++)
+				sum += (values[i] - prev[proto][i]);
+
+			if (sum)
+				printf("proto %u: sum:%10llu pkts, rate:%10llu pkts/s\n",
+				       proto, sum, sum / STATS_INTERVAL_S);
+			memcpy(prev[proto], values, sizeof(values));
+		}
+	}
+}
+
+static void usage(const char *cmd)
+{
+	printf("Usage: %s [...]\n", cmd);
+	printf("    -i <ifindex> Interface Index\n");
+	printf("    -a <vip-service-address> IPv4 or IPv6\n");
+	printf("    -p <vip-service-port> A port range (e.g. 433-444) is also allowed\n");
+	printf("    -s <source-ip> Used in the IPTunnel Header\n");
+	printf("    -d <dest-ip> Used in the IPTunnel header>\n");
+	printf("    -m <dest-MAC> Used in sending the IP Tunneled pkt>\n");
+	printf("    -T <stop-after-X-seconds> Default: 0 (forever)\n");
+	printf("    -P <IP-Protocol> Default is TCP\n");
+	printf("    -h Display this help\n");
+}
+
+static int parse_ipstr(const char *ipstr, unsigned int *addr)
+{
+	if (inet_pton(AF_INET6, ipstr, addr) == 1) {
+		return AF_INET6;
+	} else if (inet_pton(AF_INET, ipstr, addr) == 1) {
+		addr[1] = addr[2] = addr[3] = 0;
+		return AF_INET;
+	}
+
+	fprintf(stderr, "%s is an invalid IP\n", ipstr);
+	return AF_UNSPEC;
+}
+
+static int parse_ports(const char *port_str, int *min_port, int *max_port)
+{
+	char *end;
+	long tmp_min_port;
+	long tmp_max_port;
+
+	tmp_min_port = strtol(optarg, &end, 10);
+	if (tmp_min_port < 1 || tmp_min_port > 65535) {
+		fprintf(stderr, "Invalid port(s):%s\n", optarg);
+		return 1;
+	}
+
+	if (*end == '-') {
+		end++;
+		tmp_max_port = strtol(end, NULL, 10);
+		if (tmp_max_port < 1 || tmp_max_port > 65535) {
+			fprintf(stderr, "Invalid port(s):%s\n", optarg);
+			return 1;
+		}
+	} else {
+		tmp_max_port = tmp_min_port;
+	}
+
+	if (tmp_min_port > tmp_max_port) {
+		fprintf(stderr, "Invalid port(s):%s\n", optarg);
+		return 1;
+	}
+
+	if (tmp_max_port - tmp_min_port + 1 > MAX_IPTNL_ENTRIES) {
+		fprintf(stderr, "Port range (%s) is larger than %u\n",
+			port_str, MAX_IPTNL_ENTRIES);
+		return 1;
+	}
+	*min_port = tmp_min_port;
+	*max_port = tmp_max_port;
+
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	unsigned char opt_flags[256] = {};
+	unsigned int kill_after_s = 0;
+	const char *optstr = "i:a:p:s:d:m:T:P:";
+	int min_port = 0, max_port = 0;
+	struct iptnl_info tnl = {};
+	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+	struct vip vip = {};
+	char filename[256];
+	int opt;
+	int i;
+
+	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+		perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)");
+		return 1;
+	}
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	if (!prog_fd[0]) {
+		printf("load_bpf_file: %s\n", strerror(errno));
+		return 1;
+	}
+
+	tnl.family = AF_UNSPEC;
+	vip.protocol = IPPROTO_TCP;
+
+	for (i = 0; i < strlen(optstr); i++)
+		if ('a' <= optstr[i] && optstr[i] <= 'z')
+			opt_flags[(unsigned char)optstr[i]] = 1;
+
+	while ((opt = getopt(argc, argv, optstr)) != -1) {
+		unsigned short family;
+		unsigned int *v6;
+
+		switch (opt) {
+		case 'i':
+			ifindex = atoi(optarg);
+			break;
+		case 'a':
+			vip.family = parse_ipstr(optarg, vip.daddr.v6);
+			if (vip.family == AF_UNSPEC)
+				return 1;
+			break;
+		case 'p':
+			if (parse_ports(optarg, &min_port, &max_port))
+				return 1;
+			break;
+		case 'P':
+			vip.protocol = atoi(optarg);
+			break;
+		case 's':
+		case 'd':
+			if (opt == 's')
+				v6 = tnl.saddr.v6;
+			else
+				v6 = tnl.daddr.v6;
+
+			family = parse_ipstr(optarg, v6);
+			if (family == AF_UNSPEC)
+				return 1;
+			if (tnl.family == AF_UNSPEC) {
+				tnl.family = family;
+			} else if (tnl.family != family) {
+				fprintf(stderr,
+					"The IP version of the src and dst addresses used in the IP encapsulation does not match\n");
+				return 1;
+			}
+			break;
+		case 'm':
+			if (!ether_aton_r(optarg,
+					  (struct ether_addr *)tnl.dmac)) {
+				fprintf(stderr, "Invalid mac address:%s\n",
+					optarg);
+				return 1;
+			}
+			break;
+		case 'T':
+			kill_after_s = atoi(optarg);
+			break;
+		default:
+			usage(argv[0]);
+			return 1;
+		}
+		opt_flags[opt] = 0;
+	}
+
+	for (i = 0; i < strlen(optstr); i++) {
+		if (opt_flags[(unsigned int)optstr[i]]) {
+			fprintf(stderr, "Missing argument -%c\n", optstr[i]);
+			usage(argv[0]);
+			return 1;
+		}
+	}
+
+	signal(SIGINT, int_exit);
+
+	while (min_port <= max_port) {
+		vip.dport = htons(min_port++);
+		if (bpf_update_elem(map_fd[1], &vip, &tnl, BPF_NOEXIST)) {
+			perror("bpf_update_elem(&vip2tnl)");
+			return 1;
+		}
+	}
+
+	if (set_link_xdp_fd(ifindex, prog_fd[0]) < 0) {
+		printf("link set xdp fd failed\n");
+		return 1;
+	}
+
+	poll_stats(kill_after_s);
+
+	set_link_xdp_fd(ifindex, -1);
+
+	return 0;
+}
-- 
2.5.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 net-next 1/4] bpf: xdp: Allow head adjustment in XDP prog
  2016-12-07  5:31 ` [PATCH v3 net-next 1/4] bpf: xdp: " Martin KaFai Lau
@ 2016-12-07  9:32   ` Daniel Borkmann
  2016-12-07 11:41     ` Jakub Kicinski
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel Borkmann @ 2016-12-07  9:32 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, Brenden Blanco, David Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Saeed Mahameed,
	Tariq Toukan, Kernel Team

Hi Martin,

On 12/07/2016 06:31 AM, Martin KaFai Lau wrote:
[...]
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 49a81f1fc1d6..6261157f444e 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -2794,6 +2794,9 @@ static int mlx4_xdp(struct net_device *dev, struct netdev_xdp *xdp)
>   	case XDP_QUERY_PROG:
>   		xdp->prog_attached = mlx4_xdp_attached(dev);
>   		return 0;
> +	case XDP_QUERY_FEATURES:
> +		xdp->features = 0;
> +		return 0;
>   	default:
>   		return -EINVAL;
>   	}
[...]
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 1ff5ea6e1221..786ad7c67215 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -30,6 +30,7 @@
>   #include <linux/delay.h>
>   #include <linux/atomic.h>
>   #include <linux/prefetch.h>
> +#include <linux/bitops.h>
>   #include <asm/cache.h>
>   #include <asm/byteorder.h>
>
> @@ -805,6 +806,13 @@ struct tc_to_netdev {
>   	bool egress_dev;
>   };
>
> +/* Driver must allow a XDP prog to extend header by
> + * up to XDP_PACKET_HEADROOM.  It must also fill out
> + * the data_hard_start value in struct xdp_buff
> + * before calling out the xdp_prog.
> + */
> +#define XDP_F_ADJUST_HEAD	BIT(0)
> +
>   /* These structures hold the attributes of xdp state that are being passed
>    * to the netdevice through the xdp op.
>    */
> @@ -821,6 +829,8 @@ enum xdp_netdev_command {
>   	 * return true if a program is currently attached and running.
>   	 */
>   	XDP_QUERY_PROG,
> +	/* Check what XDP features are supported by a device */
> +	XDP_QUERY_FEATURES,
>   };
>
>   struct netdev_xdp {
> @@ -830,6 +840,8 @@ struct netdev_xdp {
>   		struct bpf_prog *prog;
>   		/* XDP_QUERY_PROG */
>   		bool prog_attached;
> +		/* XDP_QUERY_FEATURES */
> +		u32 features;
>   	};
>   };
>
[...]
> diff --git a/net/core/dev.c b/net/core/dev.c
> index bffb5253e778..90696f7e6b59 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6722,6 +6722,15 @@ int dev_change_xdp_fd(struct net_device *dev, int fd, u32 flags)
>   		prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_XDP);
>   		if (IS_ERR(prog))
>   			return PTR_ERR(prog);
> +
> +		xdp.command = XDP_QUERY_FEATURES;
> +		err = ops->ndo_xdp(dev, &xdp);
> +		if (err)
> +			return err;
> +
> +		if (prog->xdp_adjust_head &&
> +		    !(xdp.features & XDP_F_ADJUST_HEAD))
> +			return -ENOTSUPP;
>   	}
>
>   	memset(&xdp, 0, sizeof(xdp));

I think this interface wrt feature flags is rather odd. Why can't this be
done the usual/expected way we already have today for drivers with NETIF_F_*
flags?

We have include/linux/netdev_features.h, there, we add all NETIF_F_XDP_*
feature flags that the device would then select during init, perhaps some of
them in future might depend on a certain setups, etc, calculating them in a
separate ndo_xdp() seems odd also in the sense that in-kernel users always
need to call ops->ndo_xdp() with XDP_QUERY_FEATURES instead of just simply
doing the test on dev->features & NETIF_F_XDP_* directly. This is global to
the device anyway and doesn't need to be stored somewhere in private data
area.

I see nothing wrong if this is exposed/made visible in the usual way through
ethtool -k as well. I guess at least that would be the expected way to query
for such driver capabilities.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 net-next 4/4] bpf: xdp: Add XDP example for head adjustment
  2016-12-07  5:31 ` [PATCH v3 net-next 4/4] bpf: xdp: Add XDP example for head adjustment Martin KaFai Lau
@ 2016-12-07 10:34   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-07 10:34 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Alexei Starovoitov, Brenden Blanco, Daniel Borkmann,
	David Miller, Jakub Kicinski, John Fastabend, Saeed Mahameed,
	Tariq Toukan, Kernel Team, brouer

On Tue, 6 Dec 2016 21:31:54 -0800
Martin KaFai Lau <kafai@fb.com> wrote:

> The XDP prog checks if the incoming packet matches any VIP:PORT
> combination in the BPF hashmap.  If it is, it will encapsulate
> the packet with a IPv4/v6 header as instructed by the value of
> the BPF hashmap and then XDP_TX it out.
> 
> The VIP:PORT -> IP-Encap-Info can be specified by the cmd args
> of the user prog.
> 
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  samples/bpf/Makefile              |   4 +
>  samples/bpf/bpf_helpers.h         |   2 +
>  samples/bpf/bpf_load.c            |  94 ++++++++++++++
>  samples/bpf/bpf_load.h            |   1 +
>  samples/bpf/xdp1_user.c           |  93 --------------
>  samples/bpf/xdp_tx_iptnl_common.h |  37 ++++++
>  samples/bpf/xdp_tx_iptnl_kern.c   | 232 ++++++++++++++++++++++++++++++++++
>  samples/bpf/xdp_tx_iptnl_user.c   | 253 ++++++++++++++++++++++++++++++++++++++

I got confused by the file name "iptnl", I didn't realize this was
short for iptunnel, before after reading the actually XDP program code.

These are "samples" XDP programs that normal people are expected to
find/discover, could we name it "xdp_tx_tunnel" or "xdp_tx_iptunnel"?
(To guide peoples search for this)

I will likely add a "xdp_tx_vlan" example as I have a customer use-case
that needs this for DDoS scrubbing[1]

[1] http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/use-cases/xdp_use_case_ddos_scrubber.html#forward-clean-traffic

[...]
> diff --git a/samples/bpf/xdp_tx_iptnl_kern.c b/samples/bpf/xdp_tx_iptnl_kern.c
> new file mode 100644
> index 000000000000..d88c064175aa
> --- /dev/null
> +++ b/samples/bpf/xdp_tx_iptnl_kern.c
> @@ -0,0 +1,232 @@
> +/* Copyright (c) 2016 Facebook
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.

Can we add short description of the program, to let readers know if
this is the sample they are looking for. Max 3 lines, like:

 This program demonstrate how XDP does packet header adjustment, here
 by adding an encapsulation tunnel header based on a BPF hashmap.

> + */
> +#include <uapi/linux/bpf.h>
> +#include <linux/in.h>
> +#include <linux/if_ether.h>
> +#include <linux/if_packet.h>
> +#include <linux/if_vlan.h>
> +#include <linux/ip.h>
> +#include <linux/ipv6.h>
> +#include "bpf_helpers.h"
> +#include "xdp_tx_iptnl_common.h"
> +
> +struct bpf_map_def SEC("maps") rxcnt = {
> +	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
> +	.key_size = sizeof(__u32),
> +	.value_size = sizeof(__u64),
> +	.max_entries = 256,
> +};
> +
> +struct bpf_map_def SEC("maps") vip2tnl = {
> +	.type = BPF_MAP_TYPE_HASH,
> +	.key_size = sizeof(struct vip),
> +	.value_size = sizeof(struct iptnl_info),
> +	.max_entries = MAX_IPTNL_ENTRIES,
> +};

[...]

> diff --git a/samples/bpf/xdp_tx_iptnl_user.c b/samples/bpf/xdp_tx_iptnl_user.c
> new file mode 100644
> index 000000000000..9aeef7579af4
> --- /dev/null
> +++ b/samples/bpf/xdp_tx_iptnl_user.c
> @@ -0,0 +1,253 @@
> +/* Copyright (c) 2016 Facebook
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
[...]
> +
> +static void usage(const char *cmd)
> +{

Wondering if there should be a descriptive header, that says e.g. 
"XDP tunnel sample" or if command filename "xdp_tx_iptunnel" or
"xdp_tx_tunnel" would be descriptive enough.


> +	printf("Usage: %s [...]\n", cmd);
> +	printf("    -i <ifindex> Interface Index\n");
> +	printf("    -a <vip-service-address> IPv4 or IPv6\n");
> +	printf("    -p <vip-service-port> A port range (e.g. 433-444) is also allowed\n");
> +	printf("    -s <source-ip> Used in the IPTunnel Header\n");
> +	printf("    -d <dest-ip> Used in the IPTunnel header>\n");
> +	printf("    -m <dest-MAC> Used in sending the IP Tunneled pkt>\n");
> +	printf("    -T <stop-after-X-seconds> Default: 0 (forever)\n");
> +	printf("    -P <IP-Protocol> Default is TCP\n");
> +	printf("    -h Display this help\n");
> +}

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 net-next 1/4] bpf: xdp: Allow head adjustment in XDP prog
  2016-12-07  9:32   ` Daniel Borkmann
@ 2016-12-07 11:41     ` Jakub Kicinski
  2016-12-07 13:34       ` Daniel Borkmann
  2016-12-07 16:37       ` Alexei Starovoitov
  0 siblings, 2 replies; 13+ messages in thread
From: Jakub Kicinski @ 2016-12-07 11:41 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Martin KaFai Lau, netdev, Alexei Starovoitov, Brenden Blanco,
	David Miller, Jesper Dangaard Brouer, John Fastabend,
	Saeed Mahameed, Tariq Toukan, Kernel Team

On Wed, 07 Dec 2016 10:32:19 +0100, Daniel Borkmann wrote:
> Hi Martin,
> 
> On 12/07/2016 06:31 AM, Martin KaFai Lau wrote:
> [...]
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> > index 49a81f1fc1d6..6261157f444e 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> > @@ -2794,6 +2794,9 @@ static int mlx4_xdp(struct net_device *dev, struct netdev_xdp *xdp)
> >   	case XDP_QUERY_PROG:
> >   		xdp->prog_attached = mlx4_xdp_attached(dev);
> >   		return 0;
> > +	case XDP_QUERY_FEATURES:
> > +		xdp->features = 0;
> > +		return 0;
> >   	default:
> >   		return -EINVAL;
> >   	}  
> [...]
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 1ff5ea6e1221..786ad7c67215 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -30,6 +30,7 @@
> >   #include <linux/delay.h>
> >   #include <linux/atomic.h>
> >   #include <linux/prefetch.h>
> > +#include <linux/bitops.h>
> >   #include <asm/cache.h>
> >   #include <asm/byteorder.h>
> >
> > @@ -805,6 +806,13 @@ struct tc_to_netdev {
> >   	bool egress_dev;
> >   };
> >
> > +/* Driver must allow a XDP prog to extend header by
> > + * up to XDP_PACKET_HEADROOM.  It must also fill out
> > + * the data_hard_start value in struct xdp_buff
> > + * before calling out the xdp_prog.
> > + */
> > +#define XDP_F_ADJUST_HEAD	BIT(0)
> > +
> >   /* These structures hold the attributes of xdp state that are being passed
> >    * to the netdevice through the xdp op.
> >    */
> > @@ -821,6 +829,8 @@ enum xdp_netdev_command {
> >   	 * return true if a program is currently attached and running.
> >   	 */
> >   	XDP_QUERY_PROG,
> > +	/* Check what XDP features are supported by a device */
> > +	XDP_QUERY_FEATURES,
> >   };
> >
> >   struct netdev_xdp {
> > @@ -830,6 +840,8 @@ struct netdev_xdp {
> >   		struct bpf_prog *prog;
> >   		/* XDP_QUERY_PROG */
> >   		bool prog_attached;
> > +		/* XDP_QUERY_FEATURES */
> > +		u32 features;
> >   	};
> >   };
> >  
> [...]
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index bffb5253e778..90696f7e6b59 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6722,6 +6722,15 @@ int dev_change_xdp_fd(struct net_device *dev, int fd, u32 flags)
> >   		prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_XDP);
> >   		if (IS_ERR(prog))
> >   			return PTR_ERR(prog);
> > +
> > +		xdp.command = XDP_QUERY_FEATURES;
> > +		err = ops->ndo_xdp(dev, &xdp);
> > +		if (err)
> > +			return err;
> > +
> > +		if (prog->xdp_adjust_head &&
> > +		    !(xdp.features & XDP_F_ADJUST_HEAD))
> > +			return -ENOTSUPP;
> >   	}
> >
> >   	memset(&xdp, 0, sizeof(xdp));  
> 
> I think this interface wrt feature flags is rather odd. Why can't this be
> done the usual/expected way we already have today for drivers with NETIF_F_*
> flags?
>
> We have include/linux/netdev_features.h, there, we add all NETIF_F_XDP_*
> feature flags that the device would then select during init, perhaps some of
> them in future might depend on a certain setups, etc, calculating them in a
> separate ndo_xdp() seems odd also in the sense that in-kernel users always
> need to call ops->ndo_xdp() with XDP_QUERY_FEATURES instead of just simply
> doing the test on dev->features & NETIF_F_XDP_* directly. This is global to
> the device anyway and doesn't need to be stored somewhere in private data
> area.

If I may offer one potential disadvantage of just using netdev
features :)
- if we ever want to report something more than flags (say the length
of headroom) we will need another interface.  People who care about
memory savings may also get upset if we extend struct netdevice given
there is no way to compile XDP out, that would be an argument for
keeping the ndo invocation.

> I see nothing wrong if this is exposed/made visible in the usual way through
> ethtool -k as well. I guess at least that would be the expected way to query
> for such driver capabilities.

+1 on exposing this to user space.  Whether via ethtool -k or a
separate XDP-specific netlink message is mostly a question of whether
we expect the need to expose more complex capabilities than bits.

Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 net-next 1/4] bpf: xdp: Allow head adjustment in XDP prog
  2016-12-07 11:41     ` Jakub Kicinski
@ 2016-12-07 13:34       ` Daniel Borkmann
  2016-12-07 16:37       ` Alexei Starovoitov
  1 sibling, 0 replies; 13+ messages in thread
From: Daniel Borkmann @ 2016-12-07 13:34 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Martin KaFai Lau, netdev, Alexei Starovoitov, Brenden Blanco,
	David Miller, Jesper Dangaard Brouer, John Fastabend,
	Saeed Mahameed, Tariq Toukan, Kernel Team

On 12/07/2016 12:41 PM, Jakub Kicinski wrote:
> On Wed, 07 Dec 2016 10:32:19 +0100, Daniel Borkmann wrote:
>> On 12/07/2016 06:31 AM, Martin KaFai Lau wrote:
>> [...]
>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
>>> index 49a81f1fc1d6..6261157f444e 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
>>> @@ -2794,6 +2794,9 @@ static int mlx4_xdp(struct net_device *dev, struct netdev_xdp *xdp)
>>>    	case XDP_QUERY_PROG:
>>>    		xdp->prog_attached = mlx4_xdp_attached(dev);
>>>    		return 0;
>>> +	case XDP_QUERY_FEATURES:
>>> +		xdp->features = 0;
>>> +		return 0;
>>>    	default:
>>>    		return -EINVAL;
>>>    	}
>> [...]
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index 1ff5ea6e1221..786ad7c67215 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -30,6 +30,7 @@
>>>    #include <linux/delay.h>
>>>    #include <linux/atomic.h>
>>>    #include <linux/prefetch.h>
>>> +#include <linux/bitops.h>
>>>    #include <asm/cache.h>
>>>    #include <asm/byteorder.h>
>>>
>>> @@ -805,6 +806,13 @@ struct tc_to_netdev {
>>>    	bool egress_dev;
>>>    };
>>>
>>> +/* Driver must allow a XDP prog to extend header by
>>> + * up to XDP_PACKET_HEADROOM.  It must also fill out
>>> + * the data_hard_start value in struct xdp_buff
>>> + * before calling out the xdp_prog.
>>> + */
>>> +#define XDP_F_ADJUST_HEAD	BIT(0)
>>> +
>>>    /* These structures hold the attributes of xdp state that are being passed
>>>     * to the netdevice through the xdp op.
>>>     */
>>> @@ -821,6 +829,8 @@ enum xdp_netdev_command {
>>>    	 * return true if a program is currently attached and running.
>>>    	 */
>>>    	XDP_QUERY_PROG,
>>> +	/* Check what XDP features are supported by a device */
>>> +	XDP_QUERY_FEATURES,
>>>    };
>>>
>>>    struct netdev_xdp {
>>> @@ -830,6 +840,8 @@ struct netdev_xdp {
>>>    		struct bpf_prog *prog;
>>>    		/* XDP_QUERY_PROG */
>>>    		bool prog_attached;
>>> +		/* XDP_QUERY_FEATURES */
>>> +		u32 features;
>>>    	};
>>>    };
>>>
>> [...]
>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>> index bffb5253e778..90696f7e6b59 100644
>>> --- a/net/core/dev.c
>>> +++ b/net/core/dev.c
>>> @@ -6722,6 +6722,15 @@ int dev_change_xdp_fd(struct net_device *dev, int fd, u32 flags)
>>>    		prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_XDP);
>>>    		if (IS_ERR(prog))
>>>    			return PTR_ERR(prog);

Ohh, by the way, here you fetch the prog, grabbing a reference.

>>> +
>>> +		xdp.command = XDP_QUERY_FEATURES;
>>> +		err = ops->ndo_xdp(dev, &xdp);
>>> +		if (err)

Therefore ... bpf_prog_put() ...

>>> +			return err;
>>> +
>>> +		if (prog->xdp_adjust_head &&
>>> +		    !(xdp.features & XDP_F_ADJUST_HEAD))

... same here, otherwise we leak it!

>>> +			return -ENOTSUPP;
>>>    	}
>>>
>>>    	memset(&xdp, 0, sizeof(xdp));
>>
>> I think this interface wrt feature flags is rather odd. Why can't this be
>> done the usual/expected way we already have today for drivers with NETIF_F_*
>> flags?
>>
>> We have include/linux/netdev_features.h, there, we add all NETIF_F_XDP_*
>> feature flags that the device would then select during init, perhaps some of
>> them in future might depend on a certain setups, etc, calculating them in a
>> separate ndo_xdp() seems odd also in the sense that in-kernel users always
>> need to call ops->ndo_xdp() with XDP_QUERY_FEATURES instead of just simply
>> doing the test on dev->features & NETIF_F_XDP_* directly. This is global to
>> the device anyway and doesn't need to be stored somewhere in private data
>> area.
>
> If I may offer one potential disadvantage of just using netdev
> features :)
> - if we ever want to report something more than flags (say the length
> of headroom) we will need another interface.  People who care about

Okay, but do we want XDP_QUERY_FEATURES to be a 'super-interface' returning
everything? I mean depending on what comes up in future, I'd rather imagine
that this is still partitioned a bit further, so that f.e. queries where the
driver would need to take some state lock are only required if the caller of
ndo_xdp() is really interested in that. Some of the features might simply be
bit flags, though, some others, if the flag is set, might need a query down
to the driver.

> memory savings may also get upset if we extend struct netdevice given
> there is no way to compile XDP out, that would be an argument for
> keeping the ndo invocation.

If this is a specific concern also regarding dev feature flags, then fair
enough. Just found it odd to have an extra ndo_xdp() call for it where they
could be stored in the dev directly instead. I don't know if we ever need to
pass dev pointer via struct xdp_buff to a helper function and query anything
from there, but worst case this would then need to be changed a bit.

>> I see nothing wrong if this is exposed/made visible in the usual way through
>> ethtool -k as well. I guess at least that would be the expected way to query
>> for such driver capabilities.
>
> +1 on exposing this to user space.  Whether via ethtool -k or a
> separate XDP-specific netlink message is mostly a question of whether
> we expect the need to expose more complex capabilities than bits.
>
> Thanks!
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 net-next 1/4] bpf: xdp: Allow head adjustment in XDP prog
  2016-12-07 11:41     ` Jakub Kicinski
  2016-12-07 13:34       ` Daniel Borkmann
@ 2016-12-07 16:37       ` Alexei Starovoitov
  2016-12-07 17:04         ` David Miller
  2016-12-07 17:26         ` Martin KaFai Lau
  1 sibling, 2 replies; 13+ messages in thread
From: Alexei Starovoitov @ 2016-12-07 16:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Daniel Borkmann, Martin KaFai Lau, netdev, Alexei Starovoitov,
	Brenden Blanco, David Miller, Jesper Dangaard Brouer,
	John Fastabend, Saeed Mahameed, Tariq Toukan, Kernel Team

On Wed, Dec 07, 2016 at 11:41:12AM +0000, Jakub Kicinski wrote:
> > I see nothing wrong if this is exposed/made visible in the usual way through
> > ethtool -k as well. I guess at least that would be the expected way to query
> > for such driver capabilities.
> 
> +1 on exposing this to user space.  Whether via ethtool -k or a
> separate XDP-specific netlink message is mostly a question of whether
> we expect the need to expose more complex capabilities than bits.

I'm very much against using NETIF_F_ flags and exposing this to user space.
I see this xdp feature flag as temporary workaround until all drivers
support adjust_head() helper. It is very much a fundamental feature for xdp.
Without being able to add/remove headers the usability of xdp becomes very limited.

If you guys dont like extra ndo_xdp command, I'd suggest to do
"if (prog->xdp_adjust_head)" check in the driver and if driver doesn't
support it yet, just fail XDP_SETUP_PROG command.
imo that will be more flexible interface, since in the future drivers
can fail on different combination of features and simple boolean flag
unlikely to serve us for long time.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 net-next 1/4] bpf: xdp: Allow head adjustment in XDP prog
  2016-12-07 16:37       ` Alexei Starovoitov
@ 2016-12-07 17:04         ` David Miller
  2016-12-07 17:14           ` Daniel Borkmann
  2016-12-07 17:26         ` Martin KaFai Lau
  1 sibling, 1 reply; 13+ messages in thread
From: David Miller @ 2016-12-07 17:04 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: kubakici, daniel, kafai, netdev, ast, bblanco, brouer,
	john.fastabend, saeedm, tariqt, kernel-team

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Wed, 7 Dec 2016 08:37:58 -0800

> On Wed, Dec 07, 2016 at 11:41:12AM +0000, Jakub Kicinski wrote:
>> > I see nothing wrong if this is exposed/made visible in the usual way through
>> > ethtool -k as well. I guess at least that would be the expected way to query
>> > for such driver capabilities.
>> 
>> +1 on exposing this to user space.  Whether via ethtool -k or a
>> separate XDP-specific netlink message is mostly a question of whether
>> we expect the need to expose more complex capabilities than bits.
> 
> I'm very much against using NETIF_F_ flags and exposing this to user space.
> I see this xdp feature flag as temporary workaround until all drivers
> support adjust_head() helper. It is very much a fundamental feature for xdp.
> Without being able to add/remove headers the usability of xdp becomes very limited.
> 
> If you guys dont like extra ndo_xdp command, I'd suggest to do
> "if (prog->xdp_adjust_head)" check in the driver and if driver doesn't
> support it yet, just fail XDP_SETUP_PROG command.
> imo that will be more flexible interface, since in the future drivers
> can fail on different combination of features and simple boolean flag
> unlikely to serve us for long time.

Indeed, if the eventual plan is to have all drivers be required to
support a fundamental set of XDP features then exporting this in any
way to userspace is not a good idea.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 net-next 1/4] bpf: xdp: Allow head adjustment in XDP prog
  2016-12-07 17:04         ` David Miller
@ 2016-12-07 17:14           ` Daniel Borkmann
  0 siblings, 0 replies; 13+ messages in thread
From: Daniel Borkmann @ 2016-12-07 17:14 UTC (permalink / raw)
  To: David Miller, alexei.starovoitov
  Cc: kubakici, kafai, netdev, ast, bblanco, brouer, john.fastabend,
	saeedm, tariqt, kernel-team

On 12/07/2016 06:04 PM, David Miller wrote:
> From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Date: Wed, 7 Dec 2016 08:37:58 -0800
>
>> On Wed, Dec 07, 2016 at 11:41:12AM +0000, Jakub Kicinski wrote:
>>>> I see nothing wrong if this is exposed/made visible in the usual way through
>>>> ethtool -k as well. I guess at least that would be the expected way to query
>>>> for such driver capabilities.
>>>
>>> +1 on exposing this to user space.  Whether via ethtool -k or a
>>> separate XDP-specific netlink message is mostly a question of whether
>>> we expect the need to expose more complex capabilities than bits.
>>
>> I'm very much against using NETIF_F_ flags and exposing this to user space.
>> I see this xdp feature flag as temporary workaround until all drivers
>> support adjust_head() helper. It is very much a fundamental feature for xdp.
>> Without being able to add/remove headers the usability of xdp becomes very limited.
>>
>> If you guys dont like extra ndo_xdp command, I'd suggest to do
>> "if (prog->xdp_adjust_head)" check in the driver and if driver doesn't
>> support it yet, just fail XDP_SETUP_PROG command.
>> imo that will be more flexible interface, since in the future drivers
>> can fail on different combination of features and simple boolean flag
>> unlikely to serve us for long time.
>
> Indeed, if the eventual plan is to have all drivers be required to
> support a fundamental set of XDP features then exporting this in any
> way to userspace is not a good idea.

Agreed, if that is required anyway, then much better and simpler to just
return the -ENOTSUPP from the XDP_SETUP_PROG handling of each driver that
way.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 net-next 1/4] bpf: xdp: Allow head adjustment in XDP prog
  2016-12-07 16:37       ` Alexei Starovoitov
  2016-12-07 17:04         ` David Miller
@ 2016-12-07 17:26         ` Martin KaFai Lau
  1 sibling, 0 replies; 13+ messages in thread
From: Martin KaFai Lau @ 2016-12-07 17:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jakub Kicinski, Daniel Borkmann, netdev, Alexei Starovoitov,
	Brenden Blanco, David Miller, Jesper Dangaard Brouer,
	John Fastabend, Saeed Mahameed, Tariq Toukan, Kernel Team

On Wed, Dec 07, 2016 at 08:37:58AM -0800, Alexei Starovoitov wrote:
> On Wed, Dec 07, 2016 at 11:41:12AM +0000, Jakub Kicinski wrote:
> > > I see nothing wrong if this is exposed/made visible in the usual way through
> > > ethtool -k as well. I guess at least that would be the expected way to query
> > > for such driver capabilities.
> >
> > +1 on exposing this to user space.  Whether via ethtool -k or a
> > separate XDP-specific netlink message is mostly a question of whether
> > we expect the need to expose more complex capabilities than bits.
>
> I'm very much against using NETIF_F_ flags and exposing this to user space.
> I see this xdp feature flag as temporary workaround until all drivers
> support adjust_head() helper. It is very much a fundamental feature for xdp.
> Without being able to add/remove headers the usability of xdp becomes very limited.
>
> If you guys dont like extra ndo_xdp command, I'd suggest to do
> "if (prog->xdp_adjust_head)" check in the driver and if driver doesn't
> support it yet, just fail XDP_SETUP_PROG command.
> imo that will be more flexible interface, since in the future drivers
> can fail on different combination of features and simple boolean flag
> unlikely to serve us for long time.
It makes sense that adjust_head() will eventually be supported by
all xdp-capable driver.  If that is the case, lets check
prog->xdp_adjust_head inside the driver instead of adding
another ndo_xdp command which will become unuseful very soon.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-12-07 17:26 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-07  5:31 [PATCH v3 net-next 0/4]: Allow head adjustment in XDP prog Martin KaFai Lau
2016-12-07  5:31 ` [PATCH v3 net-next 1/4] bpf: xdp: " Martin KaFai Lau
2016-12-07  9:32   ` Daniel Borkmann
2016-12-07 11:41     ` Jakub Kicinski
2016-12-07 13:34       ` Daniel Borkmann
2016-12-07 16:37       ` Alexei Starovoitov
2016-12-07 17:04         ` David Miller
2016-12-07 17:14           ` Daniel Borkmann
2016-12-07 17:26         ` Martin KaFai Lau
2016-12-07  5:31 ` [PATCH v3 net-next 2/4] mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs Martin KaFai Lau
2016-12-07  5:31 ` [PATCH v3 net-next 3/4] mlx4: xdp: Reserve headroom for receiving packet when XDP prog is active Martin KaFai Lau
2016-12-07  5:31 ` [PATCH v3 net-next 4/4] bpf: xdp: Add XDP example for head adjustment Martin KaFai Lau
2016-12-07 10:34   ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).