Netdev List
 help / color / mirror / Atom feed
* [PATCH v2 0/2] IR decoding using BPF
From: Sean Young @ 2018-05-15 18:50 UTC (permalink / raw)
  To: linux-media, linux-kernel, Alexei Starovoitov,
	Mauro Carvalho Chehab, Daniel Borkmann, netdev, Matthias Reichl,
	Devin Heitmueller, Y Song

The kernel IR decoders support the most widely used IR protocols, but there
are many protocols which are not supported[1]. For example, the
lirc-remotes[2] repo has over 2700 remotes, many of which are not supported
by rc-core. There is a "long tail" of unsupported IR protocols.

IR encoding is done in such a way that some simple circuit can decode it;
therefore, bpf is ideal.

In order to support all these protocols, here we have bpf based IR decoding.
The idea is that user-space can define a decoder in bpf, attach it to
the rc device through the lirc chardev.

Separate work is underway to extend ir-keytable to have an extensive library
of bpf-based decoders, and a much expanded library of rc keymaps.

Another future application would be to compile IRP[3] to a IR BPF program, and
so support virtually every remote without having to write a decoder for each.

Thanks,

Sean Young

[1] http://www.hifi-remote.com/wiki/index.php?title=DecodeIR
[2] https://sourceforge.net/p/lirc-remotes/code/ci/master/tree/remotes/
[3] http://www.hifi-remote.com/wiki/index.php?title=IRP_Notation

Changes since v1:
 - Code review comments from Y Song <ys114321@gmail.com> and
   Randy Dunlap <rdunlap@infradead.org>
 - Re-wrote sample bpf to be selftest
 - Renamed RAWIR_DECODER -> RAWIR_EVENT (Kconfig, context, bpf prog type)
 - Rebase on bpf-next
 - Introduced bpf_rawir_event context structure with simpler access checking

Sean Young (2):
  media: rc: introduce BPF_PROG_RAWIR_EVENT
  bpf: add selftest for rawir_event type program

 drivers/media/rc/Kconfig                      |  10 +
 drivers/media/rc/Makefile                     |   1 +
 drivers/media/rc/bpf-rawir-event.c            | 322 ++++++++++++++++++
 drivers/media/rc/lirc_dev.c                   |  28 ++
 drivers/media/rc/rc-core-priv.h               |  19 ++
 drivers/media/rc/rc-ir-raw.c                  |   3 +
 include/linux/bpf_rcdev.h                     |  30 ++
 include/linux/bpf_types.h                     |   3 +
 include/uapi/linux/bpf.h                      |  55 ++-
 kernel/bpf/syscall.c                          |   7 +
 tools/bpf/bpftool/prog.c                      |   1 +
 tools/include/uapi/linux/bpf.h                |  55 ++-
 tools/lib/bpf/libbpf.c                        |   1 +
 tools/testing/selftests/bpf/Makefile          |   7 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |   6 +
 tools/testing/selftests/bpf/test_rawir.sh     |  37 ++
 .../selftests/bpf/test_rawir_event_kern.c     |  26 ++
 .../selftests/bpf/test_rawir_event_user.c     |  90 +++++
 18 files changed, 696 insertions(+), 5 deletions(-)
 create mode 100644 drivers/media/rc/bpf-rawir-event.c
 create mode 100644 include/linux/bpf_rcdev.h
 create mode 100755 tools/testing/selftests/bpf/test_rawir.sh
 create mode 100644 tools/testing/selftests/bpf/test_rawir_event_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_rawir_event_user.c

-- 
2.17.0

^ permalink raw reply

* [PATCH v2 1/2] media: rc: introduce BPF_PROG_RAWIR_EVENT
From: Sean Young @ 2018-05-15 18:50 UTC (permalink / raw)
  To: linux-media, linux-kernel, Alexei Starovoitov,
	Mauro Carvalho Chehab, Daniel Borkmann, netdev, Matthias Reichl,
	Devin Heitmueller, Y Song
In-Reply-To: <cover.1526409690.git.sean@mess.org>

Add support for BPF_PROG_RAWIR_EVENT. This type of BPF program can call
rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report
that the last key should be repeated.

The bpf program can be attached to using the bpf(BPF_PROG_ATTACH) syscall;
the target_fd must be the /dev/lircN device.

Signed-off-by: Sean Young <sean@mess.org>
---
 drivers/media/rc/Kconfig           |  10 +
 drivers/media/rc/Makefile          |   1 +
 drivers/media/rc/bpf-rawir-event.c | 322 +++++++++++++++++++++++++++++
 drivers/media/rc/lirc_dev.c        |  28 +++
 drivers/media/rc/rc-core-priv.h    |  19 ++
 drivers/media/rc/rc-ir-raw.c       |   3 +
 include/linux/bpf_rcdev.h          |  30 +++
 include/linux/bpf_types.h          |   3 +
 include/uapi/linux/bpf.h           |  55 ++++-
 kernel/bpf/syscall.c               |   7 +
 10 files changed, 477 insertions(+), 1 deletion(-)
 create mode 100644 drivers/media/rc/bpf-rawir-event.c
 create mode 100644 include/linux/bpf_rcdev.h

diff --git a/drivers/media/rc/Kconfig b/drivers/media/rc/Kconfig
index eb2c3b6eca7f..55747af5b978 100644
--- a/drivers/media/rc/Kconfig
+++ b/drivers/media/rc/Kconfig
@@ -25,6 +25,16 @@ config LIRC
 	   passes raw IR to and from userspace, which is needed for
 	   IR transmitting (aka "blasting") and for the lirc daemon.
 
+config BPF_RAWIR_EVENT
+	bool "Enable attaching BPF programs to lirc devices"
+	depends on BPF_SYSCALL
+	depends on RC_CORE=y
+	depends on LIRC
+	help
+	   Enable this option to make it possible to load custom IR
+	   decoders written in BPF. These decoders are type
+	   BPF_PROG_TYPE_RAW_IR_EVENT.
+
 menuconfig RC_DECODERS
 	bool "Remote controller decoders"
 	depends on RC_CORE
diff --git a/drivers/media/rc/Makefile b/drivers/media/rc/Makefile
index 2e1c87066f6c..74907823bef8 100644
--- a/drivers/media/rc/Makefile
+++ b/drivers/media/rc/Makefile
@@ -5,6 +5,7 @@ obj-y += keymaps/
 obj-$(CONFIG_RC_CORE) += rc-core.o
 rc-core-y := rc-main.o rc-ir-raw.o
 rc-core-$(CONFIG_LIRC) += lirc_dev.o
+rc-core-$(CONFIG_BPF_RAWIR_EVENT) += bpf-rawir-event.o
 obj-$(CONFIG_IR_NEC_DECODER) += ir-nec-decoder.o
 obj-$(CONFIG_IR_RC5_DECODER) += ir-rc5-decoder.o
 obj-$(CONFIG_IR_RC6_DECODER) += ir-rc6-decoder.o
diff --git a/drivers/media/rc/bpf-rawir-event.c b/drivers/media/rc/bpf-rawir-event.c
new file mode 100644
index 000000000000..8007841977d6
--- /dev/null
+++ b/drivers/media/rc/bpf-rawir-event.c
@@ -0,0 +1,322 @@
+// SPDX-License-Identifier: GPL-2.0
+// bpf-rawir-event.c - handles bpf
+//
+// Copyright (C) 2018 Sean Young <sean@mess.org>
+
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include "rc-core-priv.h"
+
+/*
+ * BPF interface for raw IR
+ */
+const struct bpf_prog_ops rawir_event_prog_ops = {
+};
+
+BPF_CALL_1(bpf_rc_repeat, struct bpf_rawir_event*, event)
+{
+	struct ir_raw_event_ctrl *ctrl;
+
+	ctrl = container_of(event, struct ir_raw_event_ctrl, bpf_rawir_event);
+
+	rc_repeat(ctrl->dev);
+
+	return 0;
+}
+
+static const struct bpf_func_proto rc_repeat_proto = {
+	.func	   = bpf_rc_repeat,
+	.gpl_only  = true, /* rc_repeat is EXPORT_SYMBOL_GPL */
+	.ret_type  = RET_INTEGER,
+	.arg1_type = ARG_PTR_TO_CTX,
+};
+
+BPF_CALL_4(bpf_rc_keydown, struct bpf_rawir_event*, event, u32, protocol,
+	   u32, scancode, u32, toggle)
+{
+	struct ir_raw_event_ctrl *ctrl;
+
+	ctrl = container_of(event, struct ir_raw_event_ctrl, bpf_rawir_event);
+
+	rc_keydown(ctrl->dev, protocol, scancode, toggle != 0);
+
+	return 0;
+}
+
+static const struct bpf_func_proto rc_keydown_proto = {
+	.func	   = bpf_rc_keydown,
+	.gpl_only  = true, /* rc_keydown is EXPORT_SYMBOL_GPL */
+	.ret_type  = RET_INTEGER,
+	.arg1_type = ARG_PTR_TO_CTX,
+	.arg2_type = ARG_ANYTHING,
+	.arg3_type = ARG_ANYTHING,
+	.arg4_type = ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto *
+rawir_event_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_rc_repeat:
+		return &rc_repeat_proto;
+	case BPF_FUNC_rc_keydown:
+		return &rc_keydown_proto;
+	case BPF_FUNC_map_lookup_elem:
+		return &bpf_map_lookup_elem_proto;
+	case BPF_FUNC_map_update_elem:
+		return &bpf_map_update_elem_proto;
+	case BPF_FUNC_map_delete_elem:
+		return &bpf_map_delete_elem_proto;
+	case BPF_FUNC_ktime_get_ns:
+		return &bpf_ktime_get_ns_proto;
+	case BPF_FUNC_tail_call:
+		return &bpf_tail_call_proto;
+	case BPF_FUNC_get_prandom_u32:
+		return &bpf_get_prandom_u32_proto;
+	default:
+		return NULL;
+	}
+}
+
+static bool rawir_event_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					const struct bpf_prog *prog,
+					struct bpf_insn_access_aux *info)
+{
+	/* struct bpf_rawir_event has two u32 fields */
+	if (type == BPF_WRITE)
+		return false;
+
+	if (size != sizeof(__u32))
+		return false;
+
+	if (!(off == offsetof(struct bpf_rawir_event, duration) ||
+	      off == offsetof(struct bpf_rawir_event, type)))
+		return false;
+
+	return true;
+}
+
+const struct bpf_verifier_ops rawir_event_verifier_ops = {
+	.get_func_proto  = rawir_event_func_proto,
+	.is_valid_access = rawir_event_is_valid_access
+};
+
+#define BPF_MAX_PROGS 64
+
+static int rc_dev_bpf_attach(struct rc_dev *rcdev, struct bpf_prog *prog)
+{
+	struct ir_raw_event_ctrl *raw;
+	struct bpf_prog_array __rcu *old_array;
+	struct bpf_prog_array *new_array;
+	int ret, i, size;
+
+	if (rcdev->driver_type != RC_DRIVER_IR_RAW)
+		return -EINVAL;
+
+	ret = mutex_lock_interruptible(&rcdev->lock);
+	if (ret)
+		return ret;
+
+	raw = rcdev->raw;
+
+	if (raw->progs) {
+		size = bpf_prog_array_length(raw->progs);
+		for (i = 0; i < size; i++) {
+			if (prog == raw->progs->progs[i]) {
+				ret = -EEXIST;
+				goto out;
+			}
+		}
+
+		if (size >= BPF_MAX_PROGS) {
+			ret = -E2BIG;
+			goto out;
+		}
+	}
+
+	old_array = raw->progs;
+	ret = bpf_prog_array_copy(old_array, NULL, prog, &new_array);
+	if (ret < 0)
+		goto out;
+
+	rcu_assign_pointer(raw->progs, new_array);
+	bpf_prog_array_free(old_array);
+out:
+	mutex_unlock(&rcdev->lock);
+	return ret;
+}
+
+static int rc_dev_bpf_detach(struct rc_dev *rcdev, struct bpf_prog *prog)
+{
+	struct ir_raw_event_ctrl *raw;
+	struct bpf_prog_array __rcu *old_array;
+	struct bpf_prog_array *new_array;
+	int ret;
+
+	if (rcdev->driver_type != RC_DRIVER_IR_RAW)
+		return -EINVAL;
+
+	ret = mutex_lock_interruptible(&rcdev->lock);
+	if (ret)
+		return ret;
+
+	raw = rcdev->raw;
+
+	old_array = raw->progs;
+	ret = bpf_prog_array_copy(old_array, prog, NULL, &new_array);
+	if (ret < 0) {
+		bpf_prog_array_delete_safe(old_array, prog);
+	} else {
+		rcu_assign_pointer(raw->progs, new_array);
+		bpf_prog_array_free(old_array);
+	}
+
+	bpf_prog_put(prog);
+	mutex_unlock(&rcdev->lock);
+	return 0;
+}
+
+void rc_dev_bpf_run(struct rc_dev *rcdev, struct ir_raw_event ev)
+{
+	struct ir_raw_event_ctrl *raw = rcdev->raw;
+
+	if (!raw->progs)
+		return;
+
+	if (unlikely(ev.carrier_report)) {
+		raw->bpf_rawir_event.carrier = ev.carrier;
+		raw->bpf_rawir_event.type = BPF_RAWIR_EVENT_CARRIER;
+	} else {
+		raw->bpf_rawir_event.duration = ev.duration;
+
+		if (ev.pulse)
+			raw->bpf_rawir_event.type = BPF_RAWIR_EVENT_PULSE;
+		else if (ev.timeout)
+			raw->bpf_rawir_event.type = BPF_RAWIR_EVENT_TIMEOUT;
+		else if (ev.reset)
+			raw->bpf_rawir_event.type = BPF_RAWIR_EVENT_RESET;
+		else
+			raw->bpf_rawir_event.type = BPF_RAWIR_EVENT_SPACE;
+	}
+
+	BPF_PROG_RUN_ARRAY(raw->progs, &raw->bpf_rawir_event, BPF_PROG_RUN);
+}
+
+void rc_dev_bpf_put(struct rc_dev *rcdev)
+{
+	struct bpf_prog_array *progs = rcdev->raw->progs;
+	int i, size;
+
+	if (!progs)
+		return;
+
+	size = bpf_prog_array_length(progs);
+	for (i = 0; i < size; i++)
+		bpf_prog_put(progs->progs[i]);
+
+	bpf_prog_array_free(rcdev->raw->progs);
+}
+
+int rc_dev_prog_attach(const union bpf_attr *attr)
+{
+	struct bpf_prog *prog;
+	struct rc_dev *rcdev;
+	int ret;
+
+	if (attr->attach_flags)
+		return -EINVAL;
+
+	prog = bpf_prog_get_type(attr->attach_bpf_fd,
+				 BPF_PROG_TYPE_RAWIR_EVENT);
+	if (IS_ERR(prog))
+		return PTR_ERR(prog);
+
+	rcdev = rc_dev_get_from_fd(attr->target_fd);
+	if (IS_ERR(rcdev)) {
+		bpf_prog_put(prog);
+		return PTR_ERR(rcdev);
+	}
+
+	ret = rc_dev_bpf_attach(rcdev, prog);
+	if (ret)
+		bpf_prog_put(prog);
+
+	put_device(&rcdev->dev);
+
+	return ret;
+}
+
+int rc_dev_prog_detach(const union bpf_attr *attr)
+{
+	struct bpf_prog *prog;
+	struct rc_dev *rcdev;
+	int ret;
+
+	if (attr->attach_flags)
+		return -EINVAL;
+
+	prog = bpf_prog_get_type(attr->attach_bpf_fd,
+				 BPF_PROG_TYPE_RAWIR_EVENT);
+	if (IS_ERR(prog))
+		return PTR_ERR(prog);
+
+	rcdev = rc_dev_get_from_fd(attr->target_fd);
+	if (IS_ERR(rcdev)) {
+		bpf_prog_put(prog);
+		return PTR_ERR(rcdev);
+	}
+
+	ret = rc_dev_bpf_detach(rcdev, prog);
+
+	bpf_prog_put(prog);
+	put_device(&rcdev->dev);
+
+	return ret;
+}
+
+int rc_dev_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr)
+{
+	__u32 __user *prog_ids = u64_to_user_ptr(attr->query.prog_ids);
+	struct bpf_prog_array *progs;
+	struct rc_dev *rcdev;
+	u32 cnt, flags = 0;
+	int ret;
+
+	if (attr->query.query_flags)
+		return -EINVAL;
+
+	rcdev = rc_dev_get_from_fd(attr->query.target_fd);
+	if (IS_ERR(rcdev))
+		return PTR_ERR(rcdev);
+
+	if (rcdev->driver_type != RC_DRIVER_IR_RAW) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = mutex_lock_interruptible(&rcdev->lock);
+	if (ret)
+		goto out;
+
+	progs = rcdev->raw->progs;
+	cnt = progs ? bpf_prog_array_length(progs) : 0;
+
+	if (copy_to_user(&uattr->query.prog_cnt, &cnt, sizeof(cnt))) {
+		ret = -EFAULT;
+		goto out;
+	}
+	if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (attr->query.prog_cnt != 0 && prog_ids && cnt)
+		ret = bpf_prog_array_copy_to_user(progs, prog_ids, cnt);
+
+out:
+	mutex_unlock(&rcdev->lock);
+	put_device(&rcdev->dev);
+
+	return ret;
+}
diff --git a/drivers/media/rc/lirc_dev.c b/drivers/media/rc/lirc_dev.c
index 24e9fbb80e81..c3028d0366d1 100644
--- a/drivers/media/rc/lirc_dev.c
+++ b/drivers/media/rc/lirc_dev.c
@@ -20,6 +20,7 @@
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/device.h>
+#include <linux/file.h>
 #include <linux/idr.h>
 #include <linux/poll.h>
 #include <linux/sched.h>
@@ -816,4 +817,31 @@ void __exit lirc_dev_exit(void)
 	unregister_chrdev_region(lirc_base_dev, RC_DEV_MAX);
 }
 
+struct rc_dev *rc_dev_get_from_fd(int fd)
+{
+	struct rc_dev *dev;
+	struct file *f;
+
+	f = fget_raw(fd);
+	if (!f)
+		return ERR_PTR(-EBADF);
+
+	if (!S_ISCHR(f->f_inode->i_mode) ||
+	    imajor(f->f_inode) != MAJOR(lirc_base_dev)) {
+		fput(f);
+		return ERR_PTR(-EBADF);
+	}
+
+	dev = container_of(f->f_inode->i_cdev, struct rc_dev, lirc_cdev);
+	if (!dev->registered) {
+		fput(f);
+		return ERR_PTR(-ENODEV);
+	}
+
+	get_device(&dev->dev);
+	fput(f);
+
+	return dev;
+}
+
 MODULE_ALIAS("lirc_dev");
diff --git a/drivers/media/rc/rc-core-priv.h b/drivers/media/rc/rc-core-priv.h
index e0e6a17460f6..6db579f425f1 100644
--- a/drivers/media/rc/rc-core-priv.h
+++ b/drivers/media/rc/rc-core-priv.h
@@ -13,6 +13,7 @@
 #define	MAX_IR_EVENT_SIZE	512
 
 #include <linux/slab.h>
+#include <uapi/linux/bpf.h>
 #include <media/rc-core.h>
 
 /**
@@ -57,6 +58,11 @@ struct ir_raw_event_ctrl {
 	/* raw decoder state follows */
 	struct ir_raw_event prev_ev;
 	struct ir_raw_event this_ev;
+
+#ifdef CONFIG_BPF_RAWIR_EVENT
+	struct bpf_rawir_event		bpf_rawir_event;
+	struct bpf_prog_array		*progs;
+#endif
 	struct nec_dec {
 		int state;
 		unsigned count;
@@ -288,6 +294,7 @@ void ir_lirc_raw_event(struct rc_dev *dev, struct ir_raw_event ev);
 void ir_lirc_scancode_event(struct rc_dev *dev, struct lirc_scancode *lsc);
 int ir_lirc_register(struct rc_dev *dev);
 void ir_lirc_unregister(struct rc_dev *dev);
+struct rc_dev *rc_dev_get_from_fd(int fd);
 #else
 static inline int lirc_dev_init(void) { return 0; }
 static inline void lirc_dev_exit(void) {}
@@ -299,4 +306,16 @@ static inline int ir_lirc_register(struct rc_dev *dev) { return 0; }
 static inline void ir_lirc_unregister(struct rc_dev *dev) { }
 #endif
 
+/*
+ * bpf interface
+ */
+#ifdef CONFIG_BPF_RAWIR_EVENT
+void rc_dev_bpf_put(struct rc_dev *dev);
+void rc_dev_bpf_run(struct rc_dev *dev, struct ir_raw_event ev);
+#else
+static inline void rc_dev_bpf_put(struct rc_dev *dev) { }
+static inline void rc_dev_bpf_run(struct rc_dev *dev, struct ir_raw_event ev)
+{ }
+#endif
+
 #endif /* _RC_CORE_PRIV */
diff --git a/drivers/media/rc/rc-ir-raw.c b/drivers/media/rc/rc-ir-raw.c
index 374f83105a23..25828f15faec 100644
--- a/drivers/media/rc/rc-ir-raw.c
+++ b/drivers/media/rc/rc-ir-raw.c
@@ -32,6 +32,7 @@ static int ir_raw_event_thread(void *data)
 				    handler->protocols || !handler->protocols)
 					handler->decode(raw->dev, ev);
 			ir_lirc_raw_event(raw->dev, ev);
+			rc_dev_bpf_run(raw->dev, ev);
 			raw->prev_ev = ev;
 		}
 		mutex_unlock(&ir_raw_handler_lock);
@@ -623,6 +624,8 @@ void ir_raw_event_unregister(struct rc_dev *dev)
 			handler->raw_unregister(dev);
 	mutex_unlock(&ir_raw_handler_lock);
 
+	rc_dev_bpf_put(dev);
+
 	ir_raw_event_free(dev);
 }
 
diff --git a/include/linux/bpf_rcdev.h b/include/linux/bpf_rcdev.h
new file mode 100644
index 000000000000..17a30f30436a
--- /dev/null
+++ b/include/linux/bpf_rcdev.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _BPF_RCDEV_H
+#define _BPF_RCDEV_H
+
+#include <linux/bpf.h>
+#include <uapi/linux/bpf.h>
+
+#ifdef CONFIG_BPF_RAWIR_EVENT
+int rc_dev_prog_attach(const union bpf_attr *attr);
+int rc_dev_prog_detach(const union bpf_attr *attr);
+int rc_dev_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr);
+#else
+static inline int rc_dev_prog_attach(const union bpf_attr *attr)
+{
+	return -EINVAL;
+}
+
+static inline int rc_dev_prog_detach(const union bpf_attr *attr)
+{
+	return -EINVAL;
+}
+
+static inline int rc_dev_prog_query(const union bpf_attr *attr,
+				    union bpf_attr __user *uattr)
+{
+	return -EINVAL;
+}
+#endif
+
+#endif /* _BPF_RCDEV_H */
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index d7df1b323082..667d1d557090 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -25,6 +25,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 #ifdef CONFIG_CGROUP_BPF
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
 #endif
+#ifdef CONFIG_BPF_RAWIR_EVENT
+BPF_PROG_TYPE(BPF_PROG_TYPE_RAWIR_EVENT, rawir_event)
+#endif
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 02e4112510f8..8ba1be825af0 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -140,6 +140,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_MSG,
 	BPF_PROG_TYPE_RAW_TRACEPOINT,
 	BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+	BPF_PROG_TYPE_RAWIR_EVENT,
 };
 
 enum bpf_attach_type {
@@ -157,6 +158,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_INET6_CONNECT,
 	BPF_CGROUP_INET4_POST_BIND,
 	BPF_CGROUP_INET6_POST_BIND,
+	BPF_RAWIR_EVENT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1855,6 +1857,35 @@ union bpf_attr {
  *             Egress device index on success, 0 if packet needs to continue
  *             up the stack for further processing or a negative error in case
  *             of failure.
+ *
+ * int bpf_rc_keydown(void *ctx, u32 protocol, u32 scancode, u32 toggle)
+ *	Description
+ *		Report decoded scancode with toggle value. For use in
+ *		BPF_PROG_TYPE_RAWIR_EVENT, to report a successfully
+ *		decoded scancode. This is will generate a keydown event,
+ *		and a keyup event once the scancode is no longer repeated.
+ *
+ *		*ctx* pointer to bpf_rawir_event, *protocol* is decoded
+ *		protocol (see RC_PROTO_* enum).
+ *
+ *		Some protocols include a toggle bit, in case the button
+ *		was released and pressed again between consecutive scancodes,
+ *		copy this bit into *toggle* if it exists, else set to 0.
+ *
+ *     Return
+ *		Always return 0 (for now)
+ *
+ * int bpf_rc_repeat(void *ctx)
+ *	Description
+ *		Repeat the last decoded scancode; some IR protocols like
+ *		NEC have a special IR message for repeat last button,
+ *		in case user is holding a button down; the scancode is
+ *		not repeated.
+ *
+ *		*ctx* pointer to bpf_rawir_event.
+ *
+ *     Return
+ *		Always return 0 (for now)
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -1926,7 +1957,9 @@ union bpf_attr {
 	FN(skb_get_xfrm_state),		\
 	FN(get_stack),			\
 	FN(skb_load_bytes_relative),	\
-	FN(fib_lookup),
+	FN(fib_lookup),			\
+	FN(rc_repeat),			\
+	FN(rc_keydown),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -1993,6 +2026,26 @@ enum bpf_hdr_start_off {
 	BPF_HDR_START_NET,
 };
 
+/*
+ * user accessible mirror of in-kernel ir_raw_event
+ */
+#define BPF_RAWIR_EVENT_SPACE		0
+#define BPF_RAWIR_EVENT_PULSE		1
+#define BPF_RAWIR_EVENT_TIMEOUT		2
+#define BPF_RAWIR_EVENT_RESET		3
+#define BPF_RAWIR_EVENT_CARRIER		4
+#define BPF_RAWIR_EVENT_DUTY_CYCLE	5
+
+struct bpf_rawir_event {
+	union {
+		__u32	duration;
+		__u32	carrier;
+		__u32	duty_cycle;
+	};
+
+	__u32	type;
+};
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e2aeb5e89f44..75c089f407c8 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -11,6 +11,7 @@
  */
 #include <linux/bpf.h>
 #include <linux/bpf_trace.h>
+#include <linux/bpf_rcdev.h>
 #include <linux/btf.h>
 #include <linux/syscalls.h>
 #include <linux/slab.h>
@@ -1567,6 +1568,8 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_SK_SKB_STREAM_PARSER:
 	case BPF_SK_SKB_STREAM_VERDICT:
 		return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, true);
+	case BPF_RAWIR_EVENT:
+		return rc_dev_prog_attach(attr);
 	default:
 		return -EINVAL;
 	}
@@ -1637,6 +1640,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 	case BPF_SK_SKB_STREAM_PARSER:
 	case BPF_SK_SKB_STREAM_VERDICT:
 		return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, false);
+	case BPF_RAWIR_EVENT:
+		return rc_dev_prog_detach(attr);
 	default:
 		return -EINVAL;
 	}
@@ -1684,6 +1689,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
 	case BPF_CGROUP_SOCK_OPS:
 	case BPF_CGROUP_DEVICE:
 		break;
+	case BPF_RAWIR_EVENT:
+		return rc_dev_prog_query(attr, uattr);
 	default:
 		return -EINVAL;
 	}
-- 
2.17.0

^ permalink raw reply related

* [PATCH v2 2/2] bpf: add selftest for rawir_event type program
From: Sean Young @ 2018-05-15 18:50 UTC (permalink / raw)
  To: linux-media, linux-kernel, Alexei Starovoitov,
	Mauro Carvalho Chehab, Daniel Borkmann, netdev, Matthias Reichl,
	Devin Heitmueller, Y Song
In-Reply-To: <cover.1526409690.git.sean@mess.org>

This is simple test over rc-loopback.

Signed-off-by: Sean Young <sean@mess.org>
---
 tools/bpf/bpftool/prog.c                      |  1 +
 tools/include/uapi/linux/bpf.h                | 55 +++++++++++-
 tools/lib/bpf/libbpf.c                        |  1 +
 tools/testing/selftests/bpf/Makefile          |  7 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |  6 ++
 tools/testing/selftests/bpf/test_rawir.sh     | 37 ++++++++
 .../selftests/bpf/test_rawir_event_kern.c     | 26 ++++++
 .../selftests/bpf/test_rawir_event_user.c     | 90 +++++++++++++++++++
 8 files changed, 219 insertions(+), 4 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/test_rawir.sh
 create mode 100644 tools/testing/selftests/bpf/test_rawir_event_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_rawir_event_user.c

diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 9bdfdf2d3fbe..8889a4ee8577 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -71,6 +71,7 @@ static const char * const prog_type_name[] = {
 	[BPF_PROG_TYPE_SK_MSG]		= "sk_msg",
 	[BPF_PROG_TYPE_RAW_TRACEPOINT]	= "raw_tracepoint",
 	[BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
+	[BPF_PROG_TYPE_RAWIR_EVENT]	= "rawir_event",
 };
 
 static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 02e4112510f8..8ba1be825af0 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -140,6 +140,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_MSG,
 	BPF_PROG_TYPE_RAW_TRACEPOINT,
 	BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+	BPF_PROG_TYPE_RAWIR_EVENT,
 };
 
 enum bpf_attach_type {
@@ -157,6 +158,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_INET6_CONNECT,
 	BPF_CGROUP_INET4_POST_BIND,
 	BPF_CGROUP_INET6_POST_BIND,
+	BPF_RAWIR_EVENT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1855,6 +1857,35 @@ union bpf_attr {
  *             Egress device index on success, 0 if packet needs to continue
  *             up the stack for further processing or a negative error in case
  *             of failure.
+ *
+ * int bpf_rc_keydown(void *ctx, u32 protocol, u32 scancode, u32 toggle)
+ *	Description
+ *		Report decoded scancode with toggle value. For use in
+ *		BPF_PROG_TYPE_RAWIR_EVENT, to report a successfully
+ *		decoded scancode. This is will generate a keydown event,
+ *		and a keyup event once the scancode is no longer repeated.
+ *
+ *		*ctx* pointer to bpf_rawir_event, *protocol* is decoded
+ *		protocol (see RC_PROTO_* enum).
+ *
+ *		Some protocols include a toggle bit, in case the button
+ *		was released and pressed again between consecutive scancodes,
+ *		copy this bit into *toggle* if it exists, else set to 0.
+ *
+ *     Return
+ *		Always return 0 (for now)
+ *
+ * int bpf_rc_repeat(void *ctx)
+ *	Description
+ *		Repeat the last decoded scancode; some IR protocols like
+ *		NEC have a special IR message for repeat last button,
+ *		in case user is holding a button down; the scancode is
+ *		not repeated.
+ *
+ *		*ctx* pointer to bpf_rawir_event.
+ *
+ *     Return
+ *		Always return 0 (for now)
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -1926,7 +1957,9 @@ union bpf_attr {
 	FN(skb_get_xfrm_state),		\
 	FN(get_stack),			\
 	FN(skb_load_bytes_relative),	\
-	FN(fib_lookup),
+	FN(fib_lookup),			\
+	FN(rc_repeat),			\
+	FN(rc_keydown),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -1993,6 +2026,26 @@ enum bpf_hdr_start_off {
 	BPF_HDR_START_NET,
 };
 
+/*
+ * user accessible mirror of in-kernel ir_raw_event
+ */
+#define BPF_RAWIR_EVENT_SPACE		0
+#define BPF_RAWIR_EVENT_PULSE		1
+#define BPF_RAWIR_EVENT_TIMEOUT		2
+#define BPF_RAWIR_EVENT_RESET		3
+#define BPF_RAWIR_EVENT_CARRIER		4
+#define BPF_RAWIR_EVENT_DUTY_CYCLE	5
+
+struct bpf_rawir_event {
+	union {
+		__u32	duration;
+		__u32	carrier;
+		__u32	duty_cycle;
+	};
+
+	__u32	type;
+};
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index df54c4c9e48a..372269e9053d 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1455,6 +1455,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_CGROUP_DEVICE:
 	case BPF_PROG_TYPE_SK_MSG:
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
+	case BPF_PROG_TYPE_RAWIR_EVENT:
 		return false;
 	case BPF_PROG_TYPE_UNSPEC:
 	case BPF_PROG_TYPE_KPROBE:
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 438d4f93875b..d4b7a410c333 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -24,7 +24,7 @@ urandom_read: urandom_read.c
 # Order correspond to 'make run_tests' order
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \
 	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
-	test_sock test_btf test_sockmap
+	test_sock test_btf test_sockmap test_rawir_event_user
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test_obj_id.o \
 	test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o     \
@@ -33,7 +33,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
 	sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
 	sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o \
 	test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o \
-	test_get_stack_rawtp.o
+	test_get_stack_rawtp.o test_rawir_event_kern.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -42,7 +42,8 @@ TEST_PROGS := test_kmod.sh \
 	test_xdp_meta.sh \
 	test_offload.py \
 	test_sock_addr.sh \
-	test_tunnel.sh
+	test_tunnel.sh \
+	test_rawir.sh
 
 # Compile but not part of 'make run_tests'
 TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 2375d06c706b..641af70048bd 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -106,6 +106,12 @@ static int (*bpf_get_stack)(void *ctx, void *buf, int size, int flags) =
 static int (*bpf_fib_lookup)(void *ctx, struct bpf_fib_lookup *params,
 			     int plen, __u32 flags) =
 	(void *) BPF_FUNC_fib_lookup;
+static int (*bpf_rc_repeat)(void *ctx) =
+	(void *) BPF_FUNC_rc_repeat;
+static int (*bpf_rc_keydown)(void *ctx, unsigned protocol, unsigned scancode,
+			     unsigned toggle) =
+	(void *) BPF_FUNC_rc_keydown;
+
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/test_rawir.sh b/tools/testing/selftests/bpf/test_rawir.sh
new file mode 100755
index 000000000000..30529e829af1
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_rawir.sh
@@ -0,0 +1,37 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# Test bpf_rawir_event over rc-loopback. Steps for the test:
+#
+# 1. Find the /dev/lircN device for rc-loopback
+# 2. Attach bpf_rawir_event program which decodes some IR.
+# 3. Send some IR to the same IR device; since it is loopback, this will
+#    end up in the bpf program
+# 4. bpf program should decode IR and report keycode
+# 5. We can read keycode from same /dev/lirc device
+
+GREEN='\033[0;92m'
+RED='\033[0;31m'
+NC='\033[0m' # No Color
+
+modprobe rc-loopback
+
+for i in /sys/class/rc/rc*
+do
+	if grep -q DRV_NAME=rc-loopback $i/uevent
+	then
+		LIRCDEV=$(grep DEVNAME= $i/lirc*/uevent | sed sQDEVNAME=Q/dev/Q)
+	fi
+done
+
+if [ -n $LIRCDEV ];
+then
+	TYPE=rawir_event
+	./test_rawir_event_user $LIRCDEV
+	ret=$?
+        if [ $ret -ne 0 ]; then
+                echo -e ${RED}"FAIL: $TYPE"${NC}
+                return 1
+        fi
+        echo -e ${GREEN}"PASS: $TYPE"${NC}
+fi
diff --git a/tools/testing/selftests/bpf/test_rawir_event_kern.c b/tools/testing/selftests/bpf/test_rawir_event_kern.c
new file mode 100644
index 000000000000..33ba5d30af62
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_rawir_event_kern.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+// test ir decoder
+//
+// Copyright (C) 2018 Sean Young <sean@mess.org>
+
+#include <linux/bpf.h>
+#include "bpf_helpers.h"
+
+SEC("rawir_event")
+int bpf_decoder(struct bpf_rawir_event *e)
+{
+	if (e->type == BPF_RAWIR_EVENT_PULSE) {
+		/*
+		 * The lirc interface is microseconds, but here we receive
+		 * nanoseconds.
+		 */
+		int microseconds = e->duration / 1000;
+
+		if (microseconds & 0x10000)
+			bpf_rc_keydown(e, 0x40, microseconds & 0xffff, 0);
+	}
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_rawir_event_user.c b/tools/testing/selftests/bpf/test_rawir_event_user.c
new file mode 100644
index 000000000000..0d1f711b28f1
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_rawir_event_user.c
@@ -0,0 +1,90 @@
+// SPDX-License-Identifier: GPL-2.0
+// test ir decoder
+//
+// Copyright (C) 2018 Sean Young <sean@mess.org>
+
+#include <linux/bpf.h>
+#include <linux/lirc.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <unistd.h>
+#include <poll.h>
+#include <libgen.h>
+#include <sys/resource.h>
+#include <sys/types.h>
+#include <sys/ioctl.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+
+#include "bpf_util.h"
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+int main(int argc, char **argv)
+{
+	struct bpf_object *obj;
+	int ret, lircfd, progfd, mode;
+	int testir = 0x1dead;
+
+	if (argc != 2) {
+		printf("Usage: %s /dev/lircN\n", argv[0]);
+		return 2;
+	}
+
+	ret = bpf_prog_load("test_rawir_event_kern.o",
+			    BPF_PROG_TYPE_RAWIR_EVENT, &obj, &progfd);
+	if (ret) {
+		printf("Failed to load bpf program\n");
+		return 1;
+	}
+
+	lircfd = open(argv[1], O_RDWR | O_NONBLOCK);
+	if (lircfd == -1) {
+		printf("failed to open lirc device %s: %m\n", argv[1]);
+		return 1;
+	}
+
+	mode = LIRC_MODE_SCANCODE;
+	if (ioctl(lircfd, LIRC_SET_REC_MODE, &mode)) {
+		printf("failed to set rec mode: %m\n");
+		return 1;
+	}
+
+	ret = bpf_prog_attach(progfd, lircfd, BPF_RAWIR_EVENT, 0);
+	if (ret) {
+		printf("Failed to attach bpf to lirc device: %m\n");
+		return 1;
+	}
+
+	/* Write raw IR */
+	ret = write(lircfd, &testir, sizeof(testir));
+	if (ret != sizeof(testir)) {
+		printf("Failed to send test IR message: %m\n");
+		return 1;
+	}
+
+	struct pollfd pfd = { .fd = lircfd, .events = POLLIN };
+
+	poll(&pfd, 1, 100);
+
+	struct lirc_scancode lsc;
+
+	/* Read decoded IR */
+	ret = read(lircfd, &lsc, sizeof(lsc));
+	if (ret != sizeof(lsc)) {
+		printf("Failed to read decoded IR: %m\n");
+		return 1;
+	}
+
+	if (lsc.scancode != 0xdead || lsc.rc_proto != 64) {
+		printf("Incorrect scancode decoded\n");
+		return 1;
+	}
+
+	return 0;
+}
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
From: Qing Huang @ 2018-05-15 18:53 UTC (permalink / raw)
  To: Tariq Toukan, davem, haakon.bugge, yanjun.zhu
  Cc: netdev, linux-rdma, linux-kernel
In-Reply-To: <1ded7d49-0ba2-3594-f840-74d7cf37a0eb@mellanox.com>



On 5/15/2018 2:19 AM, Tariq Toukan wrote:
>
>
> On 14/05/2018 7:41 PM, Qing Huang wrote:
>>
>>
>> On 5/13/2018 2:00 AM, Tariq Toukan wrote:
>>>
>>>
>>> On 11/05/2018 10:23 PM, Qing Huang wrote:
>>>> When a system is under memory presure (high usage with fragments),
>>>> the original 256KB ICM chunk allocations will likely trigger kernel
>>>> memory management to enter slow path doing memory compact/migration
>>>> ops in order to complete high order memory allocations.
>>>>
>>>> When that happens, user processes calling uverb APIs may get stuck
>>>> for more than 120s easily even though there are a lot of free pages
>>>> in smaller chunks available in the system.
>>>>
>>>> Syslog:
>>>> ...
>>>> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
>>>> oracle_205573_e:205573 blocked for more than 120 seconds.
>>>> ...
>>>>
>>>> With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.
>>>>
>>>> However in order to support smaller ICM chunk size, we need to fix
>>>> another issue in large size kcalloc allocations.
>>>>
>>>> E.g.
>>>> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
>>>> size, each ICM chunk can only hold 512 mtt entries (8 bytes for 
>>>> each mtt
>>>> entry). So we need a 16MB allocation for a table->icm pointer array to
>>>> hold 2M pointers which can easily cause kcalloc to fail.
>>>>
>>>> The solution is to use vzalloc to replace kcalloc. There is no need
>>>> for contiguous memory pages for a driver meta data structure (no need
>>>> of DMA ops).
>>>>
>>>> Signed-off-by: Qing Huang <qing.huang@oracle.com>
>>>> Acked-by: Daniel Jurgens <danielj@mellanox.com>
>>>> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
>>>> ---
>>>> v2 -> v1: adjusted chunk size to reflect different architectures.
>>>>
>>>>   drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++++++-------
>>>>   1 file changed, 7 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
>>>> b/drivers/net/ethernet/mellanox/mlx4/icm.c
>>>> index a822f7a..ccb62b8 100644
>>>> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
>>>> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
>>>> @@ -43,12 +43,12 @@
>>>>   #include "fw.h"
>>>>     /*
>>>> - * We allocate in as big chunks as we can, up to a maximum of 256 KB
>>>> - * per chunk.
>>>> + * We allocate in page size (default 4KB on many archs) chunks to 
>>>> avoid high
>>>> + * order memory allocations in fragmented/high usage memory 
>>>> situation.
>>>>    */
>>>>   enum {
>>>> -    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
>>>> -    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
>>>> +    MLX4_ICM_ALLOC_SIZE    = 1 << PAGE_SHIFT,
>>>> +    MLX4_TABLE_CHUNK_SIZE    = 1 << PAGE_SHIFT
>>>
>>> Which is actually PAGE_SIZE.
>>
>> Yes, we wanted to avoid high order memory allocations.
>>
>
> Then please use PAGE_SIZE instead.

PAGE_SIZE is usually defined as 1 << PAGE_SHIFT. So I think PAGE_SHIFT 
is actually more appropriate here.


>
>>> Also, please add a comma at the end of the last entry.
>>
>> Hmm..., followed the existing code style and checkpatch.pl didn't 
>> complain about the comma.
>>
>
> I am in favor of having a comma also after the last element, so that 
> when another enum element is added we do not modify this line again, 
> which would falsely affect git blame.
>
> I know it didn't exist before your patch, but once we're here, let's 
> do it.

I'm okay either way. If adding an extra comma is preferred by many 
people, someone should update checkpatch.pl to enforce it. :)

>
>>>
>>>>   };
>>>>     static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct 
>>>> mlx4_icm_chunk *chunk)
>>>> @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
>>>> struct mlx4_icm_table *table,
>>>>       obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
>>>>       num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
>>>>   -    table->icm      = kcalloc(num_icm, sizeof(*table->icm), 
>>>> GFP_KERNEL);
>>>> +    table->icm      = vzalloc(num_icm * sizeof(*table->icm));
>>>
>>> Why not kvzalloc ?
>>
>> I think table->icm pointer array doesn't really need physically 
>> contiguous memory. Sometimes high order
>> memory allocation by kmalloc variants may trigger slow path and cause 
>> tasks to be blocked.
>>
>
> This is control path so it is less latency-sensitive.
> Let's not produce unnecessary degradation here, please call kvzalloc 
> so we maintain a similar behavior when contiguous memory is available, 
> and a fallback for resiliency.

No sure what exactly degradation is caused by vzalloc here. I think it's 
better to keep physically contiguous pages
to other requests which really need them. Besides slow path/mem 
compacting can be really expensive.

>
>> Thanks,
>> Qing
>>
>>>
>>>>       if (!table->icm)
>>>>           return -ENOMEM;
>>>>       table->virt     = virt;
>>>> @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
>>>> struct mlx4_icm_table *table,
>>>>               mlx4_free_icm(dev, table->icm[i], use_coherent);
>>>>           }
>>>>   -    kfree(table->icm);
>>>> +    vfree(table->icm);
>>>>         return -ENOMEM;
>>>>   }
>>>> @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev 
>>>> *dev, struct mlx4_icm_table *table)
>>>>               mlx4_free_icm(dev, table->icm[i], table->coherent);
>>>>           }
>>>>   -    kfree(table->icm);
>>>> +    vfree(table->icm);
>>>>   }
>>>>
>>>
>>> Thanks for your patch.
>>>
>>> I need to verify there is no dramatic performance degradation here.
>>> You can prepare and send a v3 in the meanwhile.
>>>
>>> Thanks,
>>> Tariq
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe 
>>> linux-rdma" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH bpf-next v6 0/4] Hash support for sock
From: Daniel Borkmann @ 2018-05-15 18:55 UTC (permalink / raw)
  To: John Fastabend, ast; +Cc: netdev, davem
In-Reply-To: <1526317219-7752-1-git-send-email-john.fastabend@gmail.com>

On 05/14/2018 07:00 PM, John Fastabend wrote:
> In the original sockmap implementation we got away with using an
> array similar to devmap. However, unlike devmap where an ifindex
> has a nice 1:1 function into the map we have found some use cases
> with sockets that need to be referenced using longer keys.
> 
> This series adds support for a sockhash map reusing as much of
> the sockmap code as possible. I made the decision to add sockhash
> specific helpers vs trying to generalize the existing helpers
> because (a) they have sockmap in the name and (b) the keys are
> different types. I prefer to be explicit here rather than play
> type games or do something else tricky.
> 
> To test this we duplicate all the sockmap testing except swap out
> the sockmap with a sockhash.
> 
> v2: fix file stats and add v2 tag
> v3: move tool updates into test patch, move bpftool updates into
>     its own patch, and fixup the test patch stats to catch the
>     renamed file and provide only diffs +/- on that.
> v4: Add documentation to UAPI bpf.h
> v5: Add documentation to tools UAPI bpf.h
> v6: 'git add' test_sockhash_kern.c which was previously missing
>     but was not causing issues because of typo in test script,
>     noticed by Daniel. After this the git format-patch -M option
>     no longer tracks the rename of the test_sockmap_kern files for
>     some reason. I guess the diff has exceeded some threshold.
> 
> Just a note I pushed Dave's Acks through v4 into v5 due to small
> size of changes.
> 
> John Fastabend (4):
>   bpf: sockmap, refactor sockmap routines to work with hashmap
>   bpf: sockmap, add hash map support
>   bpf: selftest additions for SOCKHASH
>   bpf: bpftool, support for sockhash

Applied to bpf-next, thanks John! Couple of comments in the patches.

^ permalink raw reply

* Re: [PATCH bpf-next v6 2/4] bpf: sockmap, add hash map support
From: Daniel Borkmann @ 2018-05-15 19:01 UTC (permalink / raw)
  To: John Fastabend, ast; +Cc: netdev, davem
In-Reply-To: <1526317219-7752-3-git-send-email-john.fastabend@gmail.com>

On 05/14/2018 07:00 PM, John Fastabend wrote:
> Sockmap is currently backed by an array and enforces keys to be
> four bytes. This works well for many use cases and was originally
> modeled after devmap which also uses four bytes keys. However,
> this has become limiting in larger use cases where a hash would
> be more appropriate. For example users may want to use the 5-tuple
> of the socket as the lookup key.
> 
> To support this add hash support.
> 
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> Acked-by: David S. Miller <davem@davemloft.net>
> ---
>  include/linux/bpf.h       |   8 +
>  include/linux/bpf_types.h |   1 +
>  include/uapi/linux/bpf.h  |  52 ++++-
>  kernel/bpf/core.c         |   1 +
>  kernel/bpf/sockmap.c      | 494 ++++++++++++++++++++++++++++++++++++++++++++--
>  kernel/bpf/verifier.c     |  14 +-
>  net/core/filter.c         |  58 ++++++
>  7 files changed, 610 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index a38e474..ed0122b 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -668,6 +668,7 @@ static inline void bpf_map_offload_map_free(struct bpf_map *map)
>  
>  #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_INET)
>  struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key);
> +struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key);
>  int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type);
>  #else
>  static inline struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key)
> @@ -675,6 +676,12 @@ static inline struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key)
>  	return NULL;
>  }
>  
> +static inline struct sock  *__sock_hash_lookup_elem(struct bpf_map *map,
> +						    void *key)
> +{
> +	return NULL;
> +}
> +
>  static inline int sock_map_prog(struct bpf_map *map,
>  				struct bpf_prog *prog,
>  				u32 type)
> @@ -724,6 +731,7 @@ static inline void __xsk_map_flush(struct bpf_map *map)
>  extern const struct bpf_func_proto bpf_get_stackid_proto;
>  extern const struct bpf_func_proto bpf_get_stack_proto;
>  extern const struct bpf_func_proto bpf_sock_map_update_proto;
> +extern const struct bpf_func_proto bpf_sock_hash_update_proto;
>  
>  /* Shared helpers among cBPF and eBPF. */
>  void bpf_user_rnd_init_once(void);
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index d7df1b32..b67f879 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -47,6 +47,7 @@
>  BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
>  #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_INET)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKHASH, sock_hash_ops)
>  #endif
>  BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
>  #if defined(CONFIG_XDP_SOCKETS)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 02e4112..1205d86 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -118,6 +118,7 @@ enum bpf_map_type {
>  	BPF_MAP_TYPE_SOCKMAP,
>  	BPF_MAP_TYPE_CPUMAP,
>  	BPF_MAP_TYPE_XSKMAP,
> +	BPF_MAP_TYPE_SOCKHASH,
>  };
>  
>  enum bpf_prog_type {
> @@ -1855,6 +1856,52 @@ struct bpf_stack_build_id {
>   *             Egress device index on success, 0 if packet needs to continue
>   *             up the stack for further processing or a negative error in case
>   *             of failure.
> + * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map *map, void *key, u64 flags)

When you rebase please fix this up properly next time and add a newline in between
the helpers. I fixed this up while applying.

> + *	Description
> + *		Add an entry to, or update a sockhash *map* referencing sockets.
> + *		The *skops* is used as a new value for the entry associated to
> + *		*key*. *flags* is one of:
> + *
> + *		**BPF_NOEXIST**
> + *			The entry for *key* must not exist in the map.
> + *		**BPF_EXIST**
> + *			The entry for *key* must already exist in the map.
> + *		**BPF_ANY**
> + *			No condition on the existence of the entry for *key*.
> + *
> + *		If the *map* has eBPF programs (parser and verdict), those will
> + *		be inherited by the socket being added. If the socket is
> + *		already attached to eBPF programs, this results in an error.
> + *	Return
> + *		0 on success, or a negative error in case of failure.
[...]
> +static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
> +{
> +	struct bpf_htab *htab;
> +	int i, err;
> +	u64 cost;
> +
> +	if (!capable(CAP_NET_ADMIN))
> +		return ERR_PTR(-EPERM);
> +
> +	/* check sanity of attributes */
> +	if (attr->max_entries == 0 || attr->value_size != 4 ||
> +	    attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
> +		return ERR_PTR(-EINVAL);
> +
> +	err = bpf_tcp_ulp_register();
> +	if (err && err != -EEXIST)
> +		return ERR_PTR(err);
> +
> +	htab = kzalloc(sizeof(*htab), GFP_USER);
> +	if (!htab)
> +		return ERR_PTR(-ENOMEM);
> +
> +	bpf_map_init_from_attr(&htab->map, attr);
> +
> +	htab->n_buckets = roundup_pow_of_two(htab->map.max_entries);
> +	htab->elem_size = sizeof(struct htab_elem) +
> +			  round_up(htab->map.key_size, 8);
> +
> +	if (htab->n_buckets == 0 ||
> +	    htab->n_buckets > U32_MAX / sizeof(struct bucket))
> +		goto free_htab;
> +
> +	cost = (u64) htab->n_buckets * sizeof(struct bucket) +
> +	       (u64) htab->elem_size * htab->map.max_entries;
> +
> +	if (cost >= U32_MAX - PAGE_SIZE)
> +		goto free_htab;

Also above two goto free_htab were buggy! Error after bpf_tcp_ulp_register()
is either 0 or -EEXIST, if it's 0, then when you bail out here this will cause
a NULL pointer dereference in subsequent map setup paths. Adapting the same
logic from sock_map_alloc() would have been enough. I've added a err = -EINVAL
to the first occasion above this time.

> +	htab->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
> +	err = bpf_map_precharge_memlock(htab->map.pages);
> +	if (err)
> +		goto free_htab;
> +
> +	err = -ENOMEM;
> +	htab->buckets = bpf_map_area_alloc(
> +				htab->n_buckets * sizeof(struct bucket),
> +				htab->map.numa_node);
> +	if (!htab->buckets)
> +		goto free_htab;
> +
> +	for (i = 0; i < htab->n_buckets; i++) {
> +		INIT_HLIST_HEAD(&htab->buckets[i].head);
> +		raw_spin_lock_init(&htab->buckets[i].lock);
> +	}
> +
> +	return &htab->map;
> +free_htab:
> +	kfree(htab);
> +	return ERR_PTR(err);
> +}

^ permalink raw reply

* [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

This RFC introduces zerocopy (ZC) support for AF_XDP. Programs using
AF_XDP sockets will now receive RX packets without any copies and can
also transmit packets without incurring any copies. No modifications
to the application are needed, but the NIC driver needs to be modified
to support ZC. If ZC is not supported by the driver, the modes
introduced in the AF_XDP patch will be used. Using ZC in our
micro benchmarks results in significantly improved performance as can
be seen in the performance section later in this cover letter.

Note that we did not post this as a proper patch set as suggested by
Alexei due to mainly one reason. The i40e modifications need to be
fully and properly implemented (we need support for dynamically
creating and removing queues in the driver), split up in multiple
patches, then reviewed and QA:ed by the Intel NIC team before they can
become a proper patch. We just did not have time to finish all of this
in this merge window. 

Alexei had two concerns in conjunction with adding ZC support to
AF_XDP: show that the user interface holds and can deliver good
performance for ZC and that the driver interfaces for ZC are good. We
think that this patch set shows that we have addressed the first
issue: performance is good and there is no change to the uapi. But
please take a look at the code and see if you like the ZC interfaces
that was the second concern.

Note that for an untrusted application, HW packet steering to a
specific queue pair (the one associated with the application) is a
requirement when using ZC, as the application would otherwise be able
to see other user space processes' packets. If the HW cannot support
the required packet steering you need to use the XDP_SKB mode or the
XDP_DRV mode without ZC turned on. The XSKMAP introduced in the AF_XDP
patch set can be used to do load balancing in that case.

For benchmarking, you can use the xdpsock application from the AF_XDP
patch set without any modifications. Say that you would like your UDP
traffic from port 4242 to end up in queue 16, that we will enable
AF_XDP on. Here, we use ethtool for this:

      ethtool -N p3p2 rx-flow-hash udp4 fn
      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
          action 16

Running the rxdrop benchmark in XDP_DRV mode with zerocopy can then be
done using:

      samples/bpf/xdpsock -i p3p2 -q 16 -r -N

We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
NIC is Intel I40E 40Gbit/s using the i40e driver.

Below are the results in Mpps of the I40E NIC benchmark runs for 64
and 1500 byte packets, generated by a commercial packet generator HW
outputing packets at full 40 Gbit/s line rate. The results are without
retpoline so that we can compare against previous numbers. 

AF_XDP performance 64 byte packets. Results from the AF_XDP V3 patch
set are also reported for ease of reference.

Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
rxdrop       2.9*       9.6*       21.5
txpush       2.6*       -          21.6
l2fwd        1.9*       2.5*       15.0

* From AF_XDP V3 patch set and cover letter.

AF_XDP performance 1500 byte packets:
Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
rxdrop       2.1        3.3       3.3
l2fwd        1.4        1.8       3.1

So why do we not get higher values for RX similar to the 34 Mpps we
had in AF_PACKET V4? We made an experiment running the rxdrop
benchmark without using the xdp_do_redirect/flush infrastructure nor
using an XDP program (all traffic on a queue goes to one
socket). Instead the driver acts directly on the AF_XDP socket. With
this we got 36.9 Mpps, a significant improvement without any change to
the uapi. So not forcing users to have an XDP program if they do not
need it, might be a good idea. This measurement is actually higher
than what we got with AF_PACKET V4.

XDP performance on our system as a base line:

64 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      32.3M  0

1500 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      3.3M    0

The structure of the patch set is as follows:

Patch 1: Removes rebind support. Complicated to support for ZC,
         so will not be supported for AF_XDP in any mode at this
         point. Will be a follow up patch for the AF_XDP patch set.
Patches 2-4: Plumbing for AF_XDP ZC support
Patches 5-6: AF_XDP ZC for RX
Patches 7-8: AF_XDP ZC for TX
Patch 9: Minor performance fix for the sample application. ZC will
         work with nearly as good performance without this.
Patch 10-12: ZC support for i40e. Should be broken out in smaller
             pieces as pre-patches.

We based this patch set on bpf-next commit f2467c2dbc01
("selftests/bpf: make sure build-id is on")

To do for this RFC to become a patch set:

* Implement dynamic creation and deletion of queues in the i40e driver

* Properly splitting up the i40e changes

* Have the Intel NIC team review the i40e changes from at least an
  architecture point of view

* Implement a more fair scheduling policy for multiple XSKs that share
  an umem for TX. This can be combined with a batching API for
  xsk_umem_consume_tx.

We are planning on joining the iovisor call on Wednesday if you would
like to have a chat with us about this.

Thanks: Björn and Magnus

Björn Töpel (8):
  xsk: remove rebind support
  xsk: moved struct xdp_umem definition
  xsk: introduce xdp_umem_frame
  net: xdp: added bpf_netdev_command XDP_SETUP_XSK_UMEM
  xdp: add MEM_TYPE_ZERO_COPY
  xsk: add zero-copy support for Rx
  i40e: added queue pair disable/enable functions
  i40e: implement AF_XDP zero-copy support for Rx

Magnus Karlsson (4):
  net: added netdevice operation for Tx
  xsk: wire upp Tx zero-copy functions
  samples/bpf: minor *_nb_free performance fix
  i40e: implement Tx zero-copy

 drivers/net/ethernet/intel/i40e/i40e.h      |  20 +
 drivers/net/ethernet/intel/i40e/i40e_main.c | 458 +++++++++++++++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 635 +++++++++++++++++++++++++---
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  36 +-
 include/linux/netdevice.h                   |  13 +
 include/net/xdp.h                           |  10 +
 include/net/xdp_sock.h                      |  45 +-
 net/core/xdp.c                              |  47 +-
 net/xdp/xdp_umem.c                          | 112 ++++-
 net/xdp/xdp_umem.h                          |  42 +-
 net/xdp/xdp_umem_props.h                    |  23 -
 net/xdp/xsk.c                               | 162 +++++--
 net/xdp/xsk_queue.h                         |  35 +-
 samples/bpf/xdpsock_user.c                  |   8 +-
 14 files changed, 1458 insertions(+), 188 deletions(-)
 delete mode 100644 net/xdp/xdp_umem_props.h

-- 
2.14.1

^ permalink raw reply

* [RFC PATCH bpf-next 01/12] xsk: remove rebind support
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan
In-Reply-To: <20180515190615.23099-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

Supporting rebind, i.e. after a successful bind the process can call
bind again without closing the socket, makes the setup state machine
more complex. Let us constrain the state space, but not supporting
rebind.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/xdp/xsk.c | 30 +++++++++---------------------
 1 file changed, 9 insertions(+), 21 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 009c5af5bba5..e59ca8e2618d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -236,14 +236,6 @@ static int xsk_init_queue(u32 entries, struct xsk_queue **queue,
 	return 0;
 }
 
-static void __xsk_release(struct xdp_sock *xs)
-{
-	/* Wait for driver to stop using the xdp socket. */
-	synchronize_net();
-
-	dev_put(xs->dev);
-}
-
 static int xsk_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
@@ -260,7 +252,9 @@ static int xsk_release(struct socket *sock)
 	local_bh_enable();
 
 	if (xs->dev) {
-		__xsk_release(xs);
+		/* Wait for driver to stop using the xdp socket. */
+		synchronize_net();
+		dev_put(xs->dev);
 		xs->dev = NULL;
 	}
 
@@ -294,9 +288,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 {
 	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
 	struct sock *sk = sock->sk;
-	struct net_device *dev, *dev_curr;
 	struct xdp_sock *xs = xdp_sk(sk);
-	struct xdp_umem *old_umem = NULL;
+	struct net_device *dev;
 	int err = 0;
 
 	if (addr_len < sizeof(struct sockaddr_xdp))
@@ -305,7 +298,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		return -EINVAL;
 
 	mutex_lock(&xs->mutex);
-	dev_curr = xs->dev;
+	if (xs->dev) {
+		err = -EBUSY;
+		goto out_release;
+	}
+
 	dev = dev_get_by_index(sock_net(sk), sxdp->sxdp_ifindex);
 	if (!dev) {
 		err = -ENODEV;
@@ -352,7 +349,6 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		}
 
 		xdp_get_umem(umem_xs->umem);
-		old_umem = xs->umem;
 		xs->umem = umem_xs->umem;
 		sockfd_put(sock);
 	} else if (!xs->umem || !xdp_umem_validate_queues(xs->umem)) {
@@ -364,14 +360,6 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		xskq_set_umem(xs->umem->cq, &xs->umem->props);
 	}
 
-	/* Rebind? */
-	if (dev_curr && (dev_curr != dev ||
-			 xs->queue_id != sxdp->sxdp_queue_id)) {
-		__xsk_release(xs);
-		if (old_umem)
-			xdp_put_umem(old_umem);
-	}
-
 	xs->dev = dev;
 	xs->queue_id = sxdp->sxdp_queue_id;
 
-- 
2.14.1

^ permalink raw reply related

* [RFC PATCH bpf-next 02/12] xsk: moved struct xdp_umem definition
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan
In-Reply-To: <20180515190615.23099-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

Moved struct xdp_umem to xdp_sock.h, in order to prepare for zero-copy
support.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h   | 27 ++++++++++++++++++++++++++-
 net/xdp/xdp_umem.c       |  1 +
 net/xdp/xdp_umem.h       | 25 +------------------------
 net/xdp/xdp_umem_props.h | 23 -----------------------
 net/xdp/xsk_queue.h      |  3 +--
 5 files changed, 29 insertions(+), 50 deletions(-)
 delete mode 100644 net/xdp/xdp_umem_props.h

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 185f4928fbda..c959aa43fb01 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -15,12 +15,37 @@
 #ifndef _LINUX_XDP_SOCK_H
 #define _LINUX_XDP_SOCK_H
 
+#include <linux/workqueue.h>
+#include <linux/if_xdp.h>
 #include <linux/mutex.h>
+#include <linux/mm.h>
 #include <net/sock.h>
 
 struct net_device;
 struct xsk_queue;
-struct xdp_umem;
+
+struct xdp_umem_props {
+	u32 frame_size;
+	u32 nframes;
+};
+
+struct xdp_umem {
+	struct xsk_queue *fq;
+	struct xsk_queue *cq;
+	struct page **pgs;
+	struct xdp_umem_props props;
+	u32 npgs;
+	u32 frame_headroom;
+	u32 nfpp_mask;
+	u32 nfpplog2;
+	u32 frame_size_log2;
+	struct user_struct *user;
+	struct pid *pid;
+	unsigned long address;
+	size_t size;
+	atomic_t users;
+	struct work_struct work;
+};
 
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 2b47a1dd7c6c..7cc162799744 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 
 #include "xdp_umem.h"
+#include "xsk_queue.h"
 
 #define XDP_UMEM_MIN_FRAME_SIZE 2048
 
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 7e0b2fab8522..32ad59b7322f 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -15,30 +15,7 @@
 #ifndef XDP_UMEM_H_
 #define XDP_UMEM_H_
 
-#include <linux/mm.h>
-#include <linux/if_xdp.h>
-#include <linux/workqueue.h>
-
-#include "xsk_queue.h"
-#include "xdp_umem_props.h"
-
-struct xdp_umem {
-	struct xsk_queue *fq;
-	struct xsk_queue *cq;
-	struct page **pgs;
-	struct xdp_umem_props props;
-	u32 npgs;
-	u32 frame_headroom;
-	u32 nfpp_mask;
-	u32 nfpplog2;
-	u32 frame_size_log2;
-	struct user_struct *user;
-	struct pid *pid;
-	unsigned long address;
-	size_t size;
-	atomic_t users;
-	struct work_struct work;
-};
+#include <net/xdp_sock.h>
 
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx)
 {
diff --git a/net/xdp/xdp_umem_props.h b/net/xdp/xdp_umem_props.h
deleted file mode 100644
index 77fb5daf29f3..000000000000
--- a/net/xdp/xdp_umem_props.h
+++ /dev/null
@@ -1,23 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0
- * XDP user-space packet buffer
- * Copyright(c) 2018 Intel Corporation.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms and conditions of the GNU General Public License,
- * version 2, as published by the Free Software Foundation.
- *
- * This program is distributed in the hope it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
- * more details.
- */
-
-#ifndef XDP_UMEM_PROPS_H_
-#define XDP_UMEM_PROPS_H_
-
-struct xdp_umem_props {
-	u32 frame_size;
-	u32 nframes;
-};
-
-#endif /* XDP_UMEM_PROPS_H_ */
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 7aa9a535db0e..599a8d43c69a 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -17,8 +17,7 @@
 
 #include <linux/types.h>
 #include <linux/if_xdp.h>
-
-#include "xdp_umem_props.h"
+#include <net/xdp_sock.h>
 
 #define RX_BATCH_SIZE 16
 
-- 
2.14.1

^ permalink raw reply related

* [RFC PATCH bpf-next 03/12] xsk: introduce xdp_umem_frame
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan
In-Reply-To: <20180515190615.23099-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

The xdp_umem_frame holds the address for a frame. Trade memory for
faster lookup. Later, we'll add DMA address here as well.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h |  8 +++++---
 net/xdp/xdp_umem.c     | 29 +++++++++++++++++++++++++----
 net/xdp/xdp_umem.h     |  9 +--------
 3 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index c959aa43fb01..09068c4f068e 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -29,16 +29,18 @@ struct xdp_umem_props {
 	u32 nframes;
 };
 
+struct xdp_umem_frame {
+	void *addr;
+};
+
 struct xdp_umem {
 	struct xsk_queue *fq;
 	struct xsk_queue *cq;
+	struct xdp_umem_frame *frames;
 	struct page **pgs;
 	struct xdp_umem_props props;
 	u32 npgs;
 	u32 frame_headroom;
-	u32 nfpp_mask;
-	u32 nfpplog2;
-	u32 frame_size_log2;
 	struct user_struct *user;
 	struct pid *pid;
 	unsigned long address;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 7cc162799744..b426cbe3151a 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -92,6 +92,9 @@ static void xdp_umem_release(struct xdp_umem *umem)
 		umem->pgs = NULL;
 	}
 
+	kfree(umem->frames);
+	umem->frames = NULL;
+
 	xdp_umem_unaccount_pages(umem);
 out:
 	kfree(umem);
@@ -181,7 +184,8 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 {
 	u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
 	u64 addr = mr->addr, size = mr->len;
-	unsigned int nframes, nfpp;
+	u32 nfpplog2, frame_size_log2;
+	unsigned int nframes, nfpp, i;
 	int size_chk, err;
 
 	if (!umem)
@@ -234,9 +238,6 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	umem->pgs = NULL;
 	umem->user = NULL;
 
-	umem->frame_size_log2 = ilog2(frame_size);
-	umem->nfpp_mask = nfpp - 1;
-	umem->nfpplog2 = ilog2(nfpp);
 	atomic_set(&umem->users, 1);
 
 	err = xdp_umem_account_pages(umem);
@@ -246,6 +247,26 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	err = xdp_umem_pin_pages(umem);
 	if (err)
 		goto out_account;
+
+	umem->frames = kcalloc(nframes, sizeof(*umem->frames), GFP_KERNEL);
+	if (!umem->frames) {
+		err = -ENOMEM;
+		goto out_account;
+	}
+
+	frame_size_log2 = ilog2(frame_size);
+	nfpplog2 = ilog2(nfpp);
+	for (i = 0; i < nframes; i++) {
+		u64 pg, off;
+		char *data;
+
+		pg = i >> nfpplog2;
+		off = (i & (nfpp - 1)) << frame_size_log2;
+
+		data = page_address(umem->pgs[pg]);
+		umem->frames[i].addr = data + off;
+	}
+
 	return 0;
 
 out_account:
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 32ad59b7322f..0a969384af93 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -19,14 +19,7 @@
 
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx)
 {
-	u64 pg, off;
-	char *data;
-
-	pg = idx >> umem->nfpplog2;
-	off = (idx & umem->nfpp_mask) << umem->frame_size_log2;
-
-	data = page_address(umem->pgs[pg]);
-	return data + off;
+	return umem->frames[idx].addr;
 }
 
 static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem,
-- 
2.14.1

^ permalink raw reply related

* [RFC PATCH bpf-next 04/12] net: xdp: added bpf_netdev_command XDP_SETUP_XSK_UMEM
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan
In-Reply-To: <20180515190615.23099-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

Extend ndo_bpf with a new command used for register an UMEM to a
queue_id of a netdev.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/netdevice.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 03ed492c4e14..2084536ad4af 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -817,10 +817,12 @@ enum bpf_netdev_command {
 	BPF_OFFLOAD_DESTROY,
 	BPF_OFFLOAD_MAP_ALLOC,
 	BPF_OFFLOAD_MAP_FREE,
+	XDP_SETUP_XSK_UMEM,
 };
 
 struct bpf_prog_offload_ops;
 struct netlink_ext_ack;
+struct xdp_umem;
 
 struct netdev_bpf {
 	enum bpf_netdev_command command;
@@ -851,6 +853,11 @@ struct netdev_bpf {
 		struct {
 			struct bpf_offloaded_map *offmap;
 		};
+		/* XDP_SETUP_XSK_UMEM */
+		struct {
+			struct xdp_umem *umem;
+			u16 queue_id;
+		} xsk;
 	};
 };
 
-- 
2.14.1

^ permalink raw reply related

* [RFC PATCH bpf-next 05/12] xdp: add MEM_TYPE_ZERO_COPY
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan
In-Reply-To: <20180515190615.23099-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

Here, a new type of allocator support is added to the XDP return
API. A zero-copy allocated xdp_buff cannot be converted to an
xdp_frame. Instead is the buff has to be copied. This is not supported
at all in this commit.

Also, an opaque "handle" is added to xdp_buff. This can be used as a
context for the zero-copy allocator implementation.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp.h | 10 ++++++++++
 net/core/xdp.c    | 47 ++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 0b689cf561c7..e9eee37cddd6 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -37,6 +37,7 @@ enum xdp_mem_type {
 	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
 	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
 	MEM_TYPE_PAGE_POOL,
+	MEM_TYPE_ZERO_COPY,
 	MEM_TYPE_MAX,
 };
 
@@ -47,6 +48,10 @@ struct xdp_mem_info {
 
 struct page_pool;
 
+struct zero_copy_allocator {
+	void (*free)(struct zero_copy_allocator *, unsigned long);
+};
+
 struct xdp_rxq_info {
 	struct net_device *dev;
 	u32 queue_index;
@@ -59,6 +64,7 @@ struct xdp_buff {
 	void *data_end;
 	void *data_meta;
 	void *data_hard_start;
+	unsigned long handle;
 	struct xdp_rxq_info *rxq;
 };
 
@@ -82,6 +88,10 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 	int metasize;
 	int headroom;
 
+	// XXX implement clone, copy, use "native" MEM_TYPE
+	if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
+		return NULL;
+
 	/* Assure headroom is available for storing info */
 	headroom = xdp->data - xdp->data_hard_start;
 	metasize = xdp->data - xdp->data_meta;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index bf6758f74339..4e11895b8cd9 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -31,6 +31,7 @@ struct xdp_mem_allocator {
 	union {
 		void *allocator;
 		struct page_pool *page_pool;
+		struct zero_copy_allocator *zc_alloc;
 	};
 	struct rhash_head node;
 	struct rcu_head rcu;
@@ -261,7 +262,7 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 	xdp_rxq->mem.type = type;
 
 	if (!allocator) {
-		if (type == MEM_TYPE_PAGE_POOL)
+		if (type == MEM_TYPE_PAGE_POOL || type == MEM_TYPE_ZERO_COPY)
 			return -EINVAL; /* Setup time check page_pool req */
 		return 0;
 	}
@@ -308,9 +309,11 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
-static void xdp_return(void *data, struct xdp_mem_info *mem)
+void xdp_return_frame(struct xdp_frame *xdpf)
 {
+	struct xdp_mem_info *mem = &xdpf->mem;
 	struct xdp_mem_allocator *xa;
+	void *data = xdpf->data;
 	struct page *page;
 
 	switch (mem->type) {
@@ -336,16 +339,46 @@ static void xdp_return(void *data, struct xdp_mem_info *mem)
 		/* Not possible, checked in xdp_rxq_info_reg_mem_model() */
 		break;
 	}
-}
 
-void xdp_return_frame(struct xdp_frame *xdpf)
-{
-	xdp_return(xdpf->data, &xdpf->mem);
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame);
 
 void xdp_return_buff(struct xdp_buff *xdp)
 {
-	xdp_return(xdp->data, &xdp->rxq->mem);
+	struct xdp_mem_info *mem = &xdp->rxq->mem;
+	struct xdp_mem_allocator *xa;
+	void *data = xdp->data;
+	struct page *page;
+
+	switch (mem->type) {
+	case MEM_TYPE_ZERO_COPY:
+		rcu_read_lock();
+		/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
+		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
+		xa->zc_alloc->free(xa->zc_alloc, xdp->handle);
+		rcu_read_unlock();
+		break;
+	case MEM_TYPE_PAGE_POOL:
+		rcu_read_lock();
+		/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
+		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
+		page = virt_to_head_page(data);
+		if (xa)
+			page_pool_put_page(xa->page_pool, page);
+		else
+			put_page(page);
+		rcu_read_unlock();
+		break;
+	case MEM_TYPE_PAGE_SHARED:
+		page_frag_free(data);
+		break;
+	case MEM_TYPE_PAGE_ORDER0:
+		page = virt_to_page(data); /* Assumes order0 page*/
+		put_page(page);
+		break;
+	default:
+		/* Not possible, checked in xdp_rxq_info_reg_mem_model() */
+		break;
+	}
 }
 EXPORT_SYMBOL_GPL(xdp_return_buff);
-- 
2.14.1

^ permalink raw reply related

* [RFC PATCH bpf-next 06/12] xsk: add zero-copy support for Rx
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan
In-Reply-To: <20180515190615.23099-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

Extend the xsk_rcv to support the new MEM_TYPE_ZERO_COPY memory, and
wireup ndo_bpf call in bind.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h |  7 +++++
 net/xdp/xdp_umem.c     | 60 +++++++++++++++++++++++++++++++++++++++++++
 net/xdp/xdp_umem.h     |  3 +++
 net/xdp/xsk.c          | 69 ++++++++++++++++++++++++++++++++++++++++----------
 4 files changed, 125 insertions(+), 14 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 09068c4f068e..644684eb2caf 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -31,6 +31,7 @@ struct xdp_umem_props {
 
 struct xdp_umem_frame {
 	void *addr;
+	dma_addr_t dma;
 };
 
 struct xdp_umem {
@@ -47,6 +48,8 @@ struct xdp_umem {
 	size_t size;
 	atomic_t users;
 	struct work_struct work;
+	struct net_device *dev;
+	u16 queue_id;
 };
 
 struct xdp_sock {
@@ -69,6 +72,10 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 void xsk_flush(struct xdp_sock *xs);
 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
+
+u32 *xsk_umem_peek_id(struct xdp_umem *umem);
+void xsk_umem_discard_id(struct xdp_umem *umem);
+
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index b426cbe3151a..f70cdaa2ef4d 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -26,6 +26,64 @@
 
 #define XDP_UMEM_MIN_FRAME_SIZE 2048
 
+int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
+			u16 queue_id)
+{
+	struct netdev_bpf bpf;
+	int err;
+
+	if (umem->dev) {
+		if (dev != umem->dev || queue_id != umem->queue_id)
+			return -EBUSY;
+		return 0;
+	}
+
+	dev_hold(dev);
+	if (dev->netdev_ops->ndo_bpf) {
+		bpf.command = XDP_SETUP_XSK_UMEM;
+		bpf.xsk.umem = umem;
+		bpf.xsk.queue_id = queue_id;
+
+		rtnl_lock();
+		err = dev->netdev_ops->ndo_bpf(dev, &bpf);
+		rtnl_unlock();
+
+		if (err) {
+			dev_put(dev);
+			return 0;
+		}
+
+		umem->dev = dev;
+		umem->queue_id = queue_id;
+		return 0;
+	}
+
+	dev_put(dev);
+	return 0;
+}
+
+void xdp_umem_clear_dev(struct xdp_umem *umem)
+{
+	struct netdev_bpf bpf;
+	int err;
+
+	if (umem->dev) {
+		bpf.command = XDP_SETUP_XSK_UMEM;
+		bpf.xsk.umem = NULL;
+		bpf.xsk.queue_id = umem->queue_id;
+
+		rtnl_lock();
+		err = umem->dev->netdev_ops->ndo_bpf(umem->dev, &bpf);
+		rtnl_unlock();
+
+		if (err)
+			WARN(1, "failed to disable umem!\n");
+
+		dev_put(umem->dev);
+		umem->dev = NULL;
+	}
+}
+
 int xdp_umem_create(struct xdp_umem **umem)
 {
 	*umem = kzalloc(sizeof(**umem), GFP_KERNEL);
@@ -66,6 +124,8 @@ static void xdp_umem_release(struct xdp_umem *umem)
 	struct task_struct *task;
 	struct mm_struct *mm;
 
+	xdp_umem_clear_dev(umem);
+
 	if (umem->fq) {
 		xskq_destroy(umem->fq);
 		umem->fq = NULL;
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 0a969384af93..3bb96d156b40 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -34,4 +34,7 @@ void xdp_get_umem(struct xdp_umem *umem);
 void xdp_put_umem(struct xdp_umem *umem);
 int xdp_umem_create(struct xdp_umem **umem);
 
+int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
+			u16 queue_id);
+
 #endif /* XDP_UMEM_H_ */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index e59ca8e2618d..a0cf9c042ed2 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -43,6 +43,18 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
+u32 *xsk_umem_peek_id(struct xdp_umem *umem)
+{
+	return xskq_peek_id(umem->fq);
+}
+EXPORT_SYMBOL(xsk_umem_peek_id);
+
+void xsk_umem_discard_id(struct xdp_umem *umem)
+{
+	xskq_discard_id(umem->fq);
+}
+EXPORT_SYMBOL(xsk_umem_discard_id);
+
 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
 {
 	return !!xs->rx;
@@ -50,40 +62,54 @@ bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
 
 static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
-	u32 *id, len = xdp->data_end - xdp->data;
+	u32 *id, len;
 	void *buffer;
 	int err = 0;
 
-	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
-		return -EINVAL;
-
 	id = xskq_peek_id(xs->umem->fq);
 	if (!id)
 		return -ENOSPC;
 
 	buffer = xdp_umem_get_data_with_headroom(xs->umem, *id);
+	len = xdp->data_end - xdp->data;
 	memcpy(buffer, xdp->data, len);
 	err = xskq_produce_batch_desc(xs->rx, *id, len,
 				      xs->umem->frame_headroom);
-	if (!err)
+	if (!err) {
 		xskq_discard_id(xs->umem->fq);
+		xdp_return_buff(xdp);
+		return 0;
+	}
 
+	xs->rx_dropped++;
 	return err;
 }
 
-int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
+	u16 off = xdp->data - xdp->data_hard_start;
+	u32 len = xdp->data_end - xdp->data;
 	int err;
 
-	err = __xsk_rcv(xs, xdp);
-	if (likely(!err))
+	err = xskq_produce_batch_desc(xs->rx, (u32)xdp->handle, len,
+				      xs->umem->frame_headroom + off);
+	if (err) {
 		xdp_return_buff(xdp);
-	else
 		xs->rx_dropped++;
+	}
 
 	return err;
 }
 
+int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
+		return -EINVAL;
+
+	return (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY) ?
+		__xsk_rcv_zc(xs, xdp) : __xsk_rcv(xs, xdp);
+}
+
 void xsk_flush(struct xdp_sock *xs)
 {
 	xskq_produce_flush_desc(xs->rx);
@@ -92,14 +118,26 @@ void xsk_flush(struct xdp_sock *xs)
 
 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
-	int err;
+	u32 *id, len;
+	void *buffer;
+	int err = 0;
 
-	err = __xsk_rcv(xs, xdp);
-	if (!err)
+	id = xskq_peek_id(xs->umem->fq);
+	if (!id)
+		return -ENOSPC;
+
+	buffer = xdp_umem_get_data_with_headroom(xs->umem, *id);
+	len = xdp->data_end - xdp->data;
+	memcpy(buffer, xdp->data, len);
+	err = xskq_produce_batch_desc(xs->rx, *id, len,
+				      xs->umem->frame_headroom);
+	if (!err) {
+		xskq_discard_id(xs->umem->fq);
 		xsk_flush(xs);
-	else
-		xs->rx_dropped++;
+		return 0;
+	}
 
+	xs->rx_dropped++;
 	return err;
 }
 
@@ -362,6 +400,9 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 
 	xs->dev = dev;
 	xs->queue_id = sxdp->sxdp_queue_id;
+	err = xdp_umem_assign_dev(xs->umem, dev, xs->queue_id);
+	if (err)
+		goto out_unlock;
 
 	xskq_set_umem(xs->rx, &xs->umem->props);
 	xskq_set_umem(xs->tx, &xs->umem->props);
-- 
2.14.1

^ permalink raw reply related

* [RFC PATCH bpf-next 07/12] net: added netdevice operation for Tx
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	intel-wired-lan
In-Reply-To: <20180515190615.23099-1-bjorn.topel@gmail.com>

From: Magnus Karlsson <magnus.karlsson@intel.com>

Added ndo_xsk_async_xmit. This ndo "kicks" the netdev to start pull
userland Tx frames from a NAPI context.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/linux/netdevice.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2084536ad4af..8f4292dc6670 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1385,6 +1385,12 @@ struct net_device_ops {
 	int			(*ndo_xdp_xmit)(struct net_device *dev,
 						struct xdp_frame *xdp);
 	void			(*ndo_xdp_flush)(struct net_device *dev);
+	/* AF_XDP Tx function. NB! That in the PoC we take ownership
+	 * of the XDP Tx rings, so you wont be able to XDP_REDIRECT
+	 * there...
+	 */
+	int			(*ndo_xsk_async_xmit)(struct net_device *dev,
+						      u32 queue_id);
 };
 
 /**
-- 
2.14.1

^ permalink raw reply related

* [RFC PATCH bpf-next 08/12] xsk: wire upp Tx zero-copy functions
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	intel-wired-lan
In-Reply-To: <20180515190615.23099-1-bjorn.topel@gmail.com>

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here we add the functionality required to support zero-copy Tx, and
also exposes various zero-copy related functions to for the netdevs.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/net/xdp_sock.h | 11 +++++++-
 net/xdp/xdp_umem.c     | 66 ++++++++++++++++++++++++++++++-----------------
 net/xdp/xdp_umem.h     |  9 +++++--
 net/xdp/xsk.c          | 69 ++++++++++++++++++++++++++++++++++++++++----------
 net/xdp/xsk_queue.h    | 32 ++++++++++++++++++++++-
 5 files changed, 146 insertions(+), 41 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 644684eb2caf..6d89fe84674e 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -18,6 +18,7 @@
 #include <linux/workqueue.h>
 #include <linux/if_xdp.h>
 #include <linux/mutex.h>
+#include <linux/spinlock.h>
 #include <linux/mm.h>
 #include <net/sock.h>
 
@@ -49,6 +50,9 @@ struct xdp_umem {
 	atomic_t users;
 	struct work_struct work;
 	struct net_device *dev;
+	bool zc;
+	spinlock_t xsk_list_lock;
+	struct list_head xsk_list;
 	u16 queue_id;
 };
 
@@ -61,6 +65,8 @@ struct xdp_sock {
 	struct list_head flush_node;
 	u16 queue_id;
 	struct xsk_queue *tx ____cacheline_aligned_in_smp;
+	struct list_head list;
+	bool zc;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
 	u64 rx_dropped;
@@ -73,9 +79,12 @@ int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 void xsk_flush(struct xdp_sock *xs);
 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
 
+/* Used from netdev driver */
 u32 *xsk_umem_peek_id(struct xdp_umem *umem);
 void xsk_umem_discard_id(struct xdp_umem *umem);
-
+void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries);
+bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma,
+			 u32 *len, u16 *offset);
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index f70cdaa2ef4d..b904786ac836 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -27,42 +27,49 @@
 #define XDP_UMEM_MIN_FRAME_SIZE 2048
 
 int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
-			u16 queue_id)
+			u16 queue_id, struct list_head *list_entry)
 {
 	struct netdev_bpf bpf;
+	unsigned long flags;
 	int err;
 
 	if (umem->dev) {
 		if (dev != umem->dev || queue_id != umem->queue_id)
 			return -EBUSY;
-		return 0;
-	}
-
-	dev_hold(dev);
-	if (dev->netdev_ops->ndo_bpf) {
-		bpf.command = XDP_SETUP_XSK_UMEM;
-		bpf.xsk.umem = umem;
-		bpf.xsk.queue_id = queue_id;
-
-		rtnl_lock();
-		err = dev->netdev_ops->ndo_bpf(dev, &bpf);
-		rtnl_unlock();
-
-		if (err) {
+	} else {
+		dev_hold(dev);
+
+		if (dev->netdev_ops->ndo_bpf) {
+			bpf.command = XDP_SETUP_XSK_UMEM;
+			bpf.xsk.umem = umem;
+			bpf.xsk.queue_id = queue_id;
+
+			rtnl_lock();
+			err = dev->netdev_ops->ndo_bpf(dev, &bpf);
+			rtnl_unlock();
+
+			if (err) {
+				dev_put(dev);
+				goto fallback;
+			}
+
+			umem->dev = dev;
+			umem->queue_id = queue_id;
+			umem->zc = true;
+		} else {
 			dev_put(dev);
-			return 0;
 		}
-
-		umem->dev = dev;
-		umem->queue_id = queue_id;
-		return 0;
 	}
 
-	dev_put(dev);
+fallback:
+	spin_lock_irqsave(&umem->xsk_list_lock, flags);
+	list_add_rcu(list_entry, &umem->xsk_list);
+	spin_unlock_irqrestore(&umem->xsk_list_lock, flags);
+
 	return 0;
 }
 
-void xdp_umem_clear_dev(struct xdp_umem *umem)
+static void xdp_umem_clear_dev(struct xdp_umem *umem)
 {
 	struct netdev_bpf bpf;
 	int err;
@@ -172,11 +179,22 @@ void xdp_get_umem(struct xdp_umem *umem)
 	atomic_inc(&umem->users);
 }
 
-void xdp_put_umem(struct xdp_umem *umem)
+void xdp_put_umem(struct xdp_umem *umem, struct xdp_sock *xs)
 {
+	unsigned long flags;
+
 	if (!umem)
 		return;
 
+	if (xs->dev) {
+		spin_lock_irqsave(&umem->xsk_list_lock, flags);
+		list_del_rcu(&xs->list);
+		spin_unlock_irqrestore(&umem->xsk_list_lock, flags);
+
+		if (umem->zc)
+			synchronize_net();
+	}
+
 	if (atomic_dec_and_test(&umem->users)) {
 		INIT_WORK(&umem->work, xdp_umem_release_deferred);
 		schedule_work(&umem->work);
@@ -297,6 +315,8 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	umem->npgs = size / PAGE_SIZE;
 	umem->pgs = NULL;
 	umem->user = NULL;
+	INIT_LIST_HEAD(&umem->xsk_list);
+	spin_lock_init(&umem->xsk_list_lock);
 
 	atomic_set(&umem->users, 1);
 
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 3bb96d156b40..5687748a9be3 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -22,6 +22,11 @@ static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx)
 	return umem->frames[idx].addr;
 }
 
+static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u32 idx)
+{
+	return umem->frames[idx].dma;
+}
+
 static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem,
 						    u32 idx)
 {
@@ -31,10 +36,10 @@ static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem,
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
 int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
 void xdp_get_umem(struct xdp_umem *umem);
-void xdp_put_umem(struct xdp_umem *umem);
+void xdp_put_umem(struct xdp_umem *umem, struct xdp_sock *xs);
 int xdp_umem_create(struct xdp_umem **umem);
 
 int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
-			u16 queue_id);
+			u16 queue_id, struct list_head *list_entry);
 
 #endif /* XDP_UMEM_H_ */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a0cf9c042ed2..ac979026671f 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -30,6 +30,7 @@
 #include <linux/uaccess.h>
 #include <linux/net.h>
 #include <linux/netdevice.h>
+#include <linux/rculist.h>
 #include <net/xdp_sock.h>
 #include <net/xdp.h>
 
@@ -141,6 +142,49 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	return err;
 }
 
+void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
+{
+	xskq_produce_flush_id_n(umem->cq, nb_entries);
+}
+EXPORT_SYMBOL(xsk_umem_complete_tx);
+
+bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma,
+			 u32 *len, u16 *offset)
+{
+	struct xdp_desc desc;
+	struct xdp_sock *xs;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(xs, &umem->xsk_list, list) {
+		if (!xskq_peek_desc(xs->tx, &desc))
+			continue;
+
+		if (xskq_produce_id_lazy(umem->cq, desc.idx))
+			goto out;
+
+		*dma = xdp_umem_get_dma(umem, desc.idx);
+		*len = desc.len;
+		*offset = desc.offset;
+
+		xskq_discard_desc(xs->tx);
+		rcu_read_unlock();
+		return true;
+	}
+
+out:
+	rcu_read_unlock();
+	return false;
+}
+EXPORT_SYMBOL(xsk_umem_consume_tx);
+
+static int xsk_zc_xmit(struct sock *sk)
+{
+	struct xdp_sock *xs = xdp_sk(sk);
+	struct net_device *dev = xs->dev;
+
+	return dev->netdev_ops->ndo_xsk_async_xmit(dev, xs->queue_id);
+}
+
 static void xsk_destruct_skb(struct sk_buff *skb)
 {
 	u32 id = (u32)(long)skb_shinfo(skb)->destructor_arg;
@@ -154,7 +198,6 @@ static void xsk_destruct_skb(struct sk_buff *skb)
 static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 			    size_t total_len)
 {
-	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
 	u32 max_batch = TX_BATCH_SIZE;
 	struct xdp_sock *xs = xdp_sk(sk);
 	bool sent_frame = false;
@@ -164,8 +207,6 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 
 	if (unlikely(!xs->tx))
 		return -ENOBUFS;
-	if (need_wait)
-		return -EOPNOTSUPP;
 
 	mutex_lock(&xs->mutex);
 
@@ -184,12 +225,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 		}
 
 		len = desc.len;
-		if (unlikely(len > xs->dev->mtu)) {
-			err = -EMSGSIZE;
-			goto out;
-		}
-
-		skb = sock_alloc_send_skb(sk, len, !need_wait, &err);
+		skb = sock_alloc_send_skb(sk, len, 1, &err);
 		if (unlikely(!skb)) {
 			err = -EAGAIN;
 			goto out;
@@ -232,6 +268,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 
 static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 {
+	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
 	struct sock *sk = sock->sk;
 	struct xdp_sock *xs = xdp_sk(sk);
 
@@ -239,8 +276,10 @@ static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 		return -ENXIO;
 	if (unlikely(!(xs->dev->flags & IFF_UP)))
 		return -ENETDOWN;
+	if (need_wait)
+		return -EOPNOTSUPP;
 
-	return xsk_generic_xmit(sk, m, total_len);
+	return (xs->zc) ? xsk_zc_xmit(sk) : xsk_generic_xmit(sk, m, total_len);
 }
 
 static unsigned int xsk_poll(struct file *file, struct socket *sock,
@@ -398,12 +437,14 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		xskq_set_umem(xs->umem->cq, &xs->umem->props);
 	}
 
-	xs->dev = dev;
-	xs->queue_id = sxdp->sxdp_queue_id;
-	err = xdp_umem_assign_dev(xs->umem, dev, xs->queue_id);
+	err = xdp_umem_assign_dev(xs->umem, dev, sxdp->sxdp_queue_id,
+				  &xs->list);
 	if (err)
 		goto out_unlock;
 
+	xs->dev = dev;
+	xs->zc = xs->umem->zc;
+	xs->queue_id = sxdp->sxdp_queue_id;
 	xskq_set_umem(xs->rx, &xs->umem->props);
 	xskq_set_umem(xs->tx, &xs->umem->props);
 
@@ -612,7 +653,7 @@ static void xsk_destruct(struct sock *sk)
 
 	xskq_destroy(xs->rx);
 	xskq_destroy(xs->tx);
-	xdp_put_umem(xs->umem);
+	xdp_put_umem(xs->umem, xs);
 
 	sk_refcnt_debug_dec(sk);
 }
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 599a8d43c69a..5533bf32a254 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -17,9 +17,11 @@
 
 #include <linux/types.h>
 #include <linux/if_xdp.h>
+#include <linux/cache.h>
 #include <net/xdp_sock.h>
 
 #define RX_BATCH_SIZE 16
+#define LAZY_UPDATE_THRESHOLD 128
 
 struct xsk_queue {
 	struct xdp_umem_props umem_props;
@@ -53,9 +55,14 @@ static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
 	return (entries > dcnt) ? dcnt : entries;
 }
 
+static inline u32 xskq_nb_free_lazy(struct xsk_queue *q, u32 producer)
+{
+	return q->nentries - (producer - q->cons_tail);
+}
+
 static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt)
 {
-	u32 free_entries = q->nentries - (producer - q->cons_tail);
+	u32 free_entries = xskq_nb_free_lazy(q, producer);
 
 	if (free_entries >= dcnt)
 		return free_entries;
@@ -119,6 +126,9 @@ static inline int xskq_produce_id(struct xsk_queue *q, u32 id)
 {
 	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
 
+	if (xskq_nb_free(q, q->prod_tail, LAZY_UPDATE_THRESHOLD) == 0)
+		return -ENOSPC;
+
 	ring->desc[q->prod_tail++ & q->ring_mask] = id;
 
 	/* Order producer and data */
@@ -128,6 +138,26 @@ static inline int xskq_produce_id(struct xsk_queue *q, u32 id)
 	return 0;
 }
 
+static inline int xskq_produce_id_lazy(struct xsk_queue *q, u32 id)
+{
+	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
+
+	if (xskq_nb_free(q, q->prod_head, LAZY_UPDATE_THRESHOLD) == 0)
+		return -ENOSPC;
+
+	ring->desc[q->prod_head++ & q->ring_mask] = id;
+	return 0;
+}
+
+static inline void xskq_produce_flush_id_n(struct xsk_queue *q, u32 nb_entries)
+{
+	/* Order producer and data */
+	smp_wmb();
+
+	q->prod_tail += nb_entries;
+	WRITE_ONCE(q->ring->producer, q->prod_tail);
+}
+
 static inline int xskq_reserve_id(struct xsk_queue *q)
 {
 	if (xskq_nb_free(q, q->prod_head, 1) == 0)
-- 
2.14.1

^ permalink raw reply related

* [RFC PATCH bpf-next 10/12] i40e: added queue pair disable/enable functions
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan
In-Reply-To: <20180515190615.23099-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

Queue pair enable/disable plumbing.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 251 ++++++++++++++++++++++++++++
 1 file changed, 251 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index c8659fbd7111..b4c23cf3979c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11799,6 +11799,257 @@ static int i40e_xdp_setup(struct i40e_vsi *vsi,
 	return 0;
 }
 
+/**
+ * i40e_enter_busy_conf - Enters busy config state
+ * @vsi: vsi
+ *
+ * Returns 0 on success, <0 for failure.
+ **/
+static int i40e_enter_busy_conf(struct i40e_vsi *vsi)
+{
+	struct i40e_pf *pf = vsi->back;
+	int timeout = 50;
+
+	while (test_and_set_bit(__I40E_CONFIG_BUSY, pf->state)) {
+		timeout--;
+		if (!timeout)
+			return -EBUSY;
+		usleep_range(1000, 2000);
+	}
+
+	return 0;
+}
+
+/**
+ * i40e_exit_busy_conf - Exits busy config state
+ * @vsi: vsi
+ **/
+static void i40e_exit_busy_conf(struct i40e_vsi *vsi)
+{
+	struct i40e_pf *pf = vsi->back;
+
+	clear_bit(__I40E_CONFIG_BUSY, pf->state);
+}
+
+/**
+ * i40e_queue_pair_reset_stats - Resets all statistics for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_queue_pair_reset_stats(struct i40e_vsi *vsi, int queue_pair)
+{
+	memset(&vsi->rx_rings[queue_pair]->rx_stats, 0,
+	       sizeof(vsi->rx_rings[queue_pair]->rx_stats));
+	memset(&vsi->tx_rings[queue_pair]->stats, 0,
+	       sizeof(vsi->tx_rings[queue_pair]->stats));
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		memset(&vsi->xdp_rings[queue_pair]->stats, 0,
+		       sizeof(vsi->xdp_rings[queue_pair]->stats));
+	}
+}
+
+/**
+ * i40e_queue_pair_clean_rings - Cleans all the rings of a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_queue_pair_clean_rings(struct i40e_vsi *vsi, int queue_pair)
+{
+	i40e_clean_tx_ring(vsi->tx_rings[queue_pair]);
+	if (i40e_enabled_xdp_vsi(vsi))
+		i40e_clean_tx_ring(vsi->xdp_rings[queue_pair]);
+	i40e_clean_rx_ring(vsi->rx_rings[queue_pair]);
+}
+
+/**
+ * i40e_queue_pair_control_napi - Enables/disables NAPI for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ **/
+static void i40e_queue_pair_control_napi(struct i40e_vsi *vsi, int queue_pair,
+					 bool enable)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_q_vector *q_vector = rxr->q_vector;
+
+	if (!vsi->netdev)
+		return;
+
+	/* All rings in a qp belong to the same qvector. */
+	if (q_vector->rx.ring || q_vector->tx.ring) {
+		if (enable)
+			napi_enable(&q_vector->napi);
+		else
+			napi_disable(&q_vector->napi);
+	}
+}
+
+/**
+ * i40e_queue_pair_control_rings - Enables/disables all rings for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_control_rings(struct i40e_vsi *vsi, int queue_pair,
+					 bool enable)
+{
+	struct i40e_pf *pf = vsi->back;
+	int pf_q, ret = 0;
+
+	pf_q = vsi->base_queue + queue_pair;
+	ret = i40e_control_wait_tx_q(vsi->seid, pf, pf_q,
+				     false /*is xdp*/, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d Tx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+		return ret;
+	}
+
+	i40e_control_rx_q(pf, pf_q, enable);
+	ret = i40e_pf_rxq_wait(pf, pf_q, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d Rx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+		return ret;
+	}
+
+	/* Due to HW errata, on Rx disable only, the register can
+ * indicate done before it really is. Needs 50ms to be sure
+ */
+	if (!enable)
+		mdelay(50);
+
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return ret;
+
+	ret = i40e_control_wait_tx_q(vsi->seid, pf,
+				     pf_q + vsi->alloc_queue_pairs,
+				     true /*is xdp*/, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d XDP Tx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+	}
+
+	return ret;
+}
+
+/**
+ * i40e_queue_pair_enable_irq - Enables interrupts for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue_pair
+ **/
+static void i40e_queue_pair_enable_irq(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+
+	/* All rings in a qp belong to the same qvector. */
+	if (pf->flags & I40E_FLAG_MSIX_ENABLED)
+		i40e_irq_dynamic_enable(vsi, rxr->q_vector->v_idx);
+	else
+		i40e_irq_dynamic_enable_icr0(pf);
+
+	i40e_flush(hw);
+}
+
+/**
+ * i40e_queue_pair_disable_irq - Disables interrupts for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue_pair
+ **/
+static void i40e_queue_pair_disable_irq(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+
+	/* For simplicity, instead of removing the qp interrupt causes
+ * from the interrupt linked list, we simply disable the interrupt, and
+ * leave the list intact.
+ *
+ * All rings in a qp belong to the same qvector.
+ */
+	if (pf->flags & I40E_FLAG_MSIX_ENABLED) {
+		u32 intpf = vsi->base_vector + rxr->q_vector->v_idx;
+
+		wr32(hw, I40E_PFINT_DYN_CTLN(intpf - 1), 0);
+		i40e_flush(hw);
+		synchronize_irq(pf->msix_entries[intpf].vector);
+	} else {
+		/* Legacy and MSI mode - this stops all interrupt handling */
+		wr32(hw, I40E_PFINT_ICR0_ENA, 0);
+		wr32(hw, I40E_PFINT_DYN_CTL0, 0);
+		i40e_flush(hw);
+		synchronize_irq(pf->pdev->irq);
+	}
+}
+
+/**
+ * i40e_queue_pair_disable - Disables a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair)
+{
+	int err;
+
+	err = i40e_enter_busy_conf(vsi);
+	if (err)
+		return err;
+
+	i40e_queue_pair_disable_irq(vsi, queue_pair);
+	err = i40e_queue_pair_control_rings(vsi, queue_pair,
+					    false /* disable */);
+	i40e_queue_pair_control_napi(vsi, queue_pair, false /* disable */);
+	i40e_queue_pair_clean_rings(vsi, queue_pair);
+	i40e_queue_pair_reset_stats(vsi, queue_pair);
+
+	return err;
+}
+
+/**
+ * i40e_queue_pair_enable - Enables a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
+{
+	int err;
+
+	err = i40e_configure_tx_ring(vsi->tx_rings[queue_pair]);
+	if (err)
+		return err;
+
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		err = i40e_configure_tx_ring(vsi->xdp_rings[queue_pair]);
+		if (err)
+			return err;
+	}
+
+	err = i40e_configure_rx_ring(vsi->rx_rings[queue_pair]);
+	if (err)
+		return err;
+
+	err = i40e_queue_pair_control_rings(vsi, queue_pair, true /* enable */);
+	i40e_queue_pair_control_napi(vsi, queue_pair, true /* enable */);
+	i40e_queue_pair_enable_irq(vsi, queue_pair);
+
+	i40e_exit_busy_conf(vsi);
+
+	return err;
+}
+
 /**
  * i40e_xdp - implements ndo_bpf for i40e
  * @dev: netdevice
-- 
2.14.1

^ permalink raw reply related

* [RFC PATCH bpf-next 11/12] i40e: implement AF_XDP zero-copy support for Rx
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang, intel-wired-lan
In-Reply-To: <20180515190615.23099-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

A lot of things here. First we add support for the new
XDP_SETUP_XSK_UMEM command in ndo_bpf. This allows the AF_XDP socket
to pass a UMEM to the driver. The driver will then DMA map all the
frames in the UMEM for the driver. Next, the Rx code will allocate
frames from the UMEM fill queue, instead of the regular page
allocator.

Externally, for the rest of the XDP code, the driver the driver
internal UMEM allocator will appear as a MEM_TYPE_ZERO_COPY.

Keep in mind that having frames coming from userland requires some
extra care taken when passing them to the regular kernel stack. In
these cases the ZC frame must be copied.

The commit also introduces a completely new clean_rx_irq/allocator
functions for zero-copy, and means (functions pointers) to set
allocators and clean_rx functions.

Finally, a lot of this are *not* implemented here. To mention some:

* No passing to the stack via XDP_PASS (clone/copy to skb).
* No XDP redirect to other than sockets (convert_to_xdp_frame does not
  clone the frame yet).

And yes, too much C&P and too big commit. :-)

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e.h      |  20 ++
 drivers/net/ethernet/intel/i40e/i40e_main.c | 202 +++++++++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 400 ++++++++++++++++++++++++++--
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  30 ++-
 4 files changed, 619 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index 7a80652e2500..e6ee6c9bf094 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -786,6 +786,12 @@ struct i40e_vsi {
 
 	/* VSI specific handlers */
 	irqreturn_t (*irq_handler)(int irq, void *data);
+
+	/* AF_XDP zero-copy */
+	struct xdp_umem **xsk_umems;
+	u16 num_xsk_umems_used;
+	u16 num_xsk_umems;
+
 } ____cacheline_internodealigned_in_smp;
 
 struct i40e_netdev_priv {
@@ -1090,6 +1096,20 @@ static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi *vsi)
 	return !!vsi->xdp_prog;
 }
 
+static inline struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring)
+{
+	bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi);
+	int qid = ring->queue_index;
+
+	if (ring_is_xdp(ring))
+		qid -= ring->vsi->alloc_queue_pairs;
+
+	if (!ring->vsi->xsk_umems || !ring->vsi->xsk_umems[qid] || !xdp_on)
+		return NULL;
+
+	return ring->vsi->xsk_umems[qid];
+}
+
 int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
 int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
 int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index b4c23cf3979c..dc3d668a741e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5,6 +5,7 @@
 #include <linux/of_net.h>
 #include <linux/pci.h>
 #include <linux/bpf.h>
+#include <net/xdp_sock.h>
 
 /* Local includes */
 #include "i40e.h"
@@ -3054,6 +3055,9 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
 	i40e_status err = 0;
 	u32 qtx_ctl = 0;
 
+	if (ring_is_xdp(ring))
+		ring->xsk_umem = i40e_xsk_umem(ring);
+
 	/* some ATR related tx ring init */
 	if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
 		ring->atr_sample_rate = vsi->back->atr_sample_rate;
@@ -3163,13 +3167,31 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	struct i40e_hw *hw = &vsi->back->hw;
 	struct i40e_hmc_obj_rxq rx_ctx;
 	i40e_status err = 0;
+	int ret;
 
 	bitmap_zero(ring->state, __I40E_RING_STATE_NBITS);
 
 	/* clear the context structure first */
 	memset(&rx_ctx, 0, sizeof(rx_ctx));
 
-	ring->rx_buf_len = vsi->rx_buf_len;
+	ring->xsk_umem = i40e_xsk_umem(ring);
+	if (ring->xsk_umem) {
+		ring->clean_rx_irq = i40e_clean_rx_irq_zc;
+		ring->alloc_rx_buffers = i40e_alloc_rx_buffers_zc;
+		ring->rx_buf_len = ring->xsk_umem->props.frame_size -
+				   ring->xsk_umem->frame_headroom -
+				   XDP_PACKET_HEADROOM;
+		ring->zca.free = i40e_zca_free;
+		ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
+						 MEM_TYPE_ZERO_COPY,
+						 &ring->zca);
+		if (ret)
+			return ret;
+	} else {
+		ring->clean_rx_irq = i40e_clean_rx_irq;
+		ring->alloc_rx_buffers = i40e_alloc_rx_buffers;
+		ring->rx_buf_len = vsi->rx_buf_len;
+	}
 
 	rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
 				    BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
@@ -3225,7 +3247,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
 	writel(0, ring->tail);
 
-	i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
+	ring->alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
 
 	return 0;
 }
@@ -12050,6 +12072,179 @@ static int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair)
 	return err;
 }
 
+static int i40e_alloc_xsk_umems(struct i40e_vsi *vsi)
+{
+	if (vsi->xsk_umems)
+		return 0;
+
+	vsi->num_xsk_umems_used = 0;
+	vsi->num_xsk_umems = vsi->alloc_queue_pairs;
+	vsi->xsk_umems = kcalloc(vsi->num_xsk_umems, sizeof(*vsi->xsk_umems),
+				 GFP_KERNEL);
+	if (!vsi->xsk_umems) {
+		vsi->num_xsk_umems = 0;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int i40e_add_xsk_umem(struct i40e_vsi *vsi, struct xdp_umem *umem,
+			     u16 qid)
+{
+	int err;
+
+	err = i40e_alloc_xsk_umems(vsi);
+	if (err)
+		return err;
+
+	vsi->xsk_umems[qid] = umem;
+	vsi->num_xsk_umems_used++;
+
+	return 0;
+}
+
+static void i40e_remove_xsk_umem(struct i40e_vsi *vsi, u16 qid)
+{
+	vsi->xsk_umems[qid] = NULL;
+	vsi->num_xsk_umems_used--;
+
+	if (vsi->num_xsk_umems == 0) {
+		kfree(vsi->xsk_umems);
+		vsi->xsk_umems = NULL;
+		vsi->num_xsk_umems = 0;
+	}
+}
+
+static int i40e_xsk_umem_dma_map(struct i40e_vsi *vsi, struct xdp_umem *umem)
+{
+	struct i40e_pf *pf = vsi->back;
+	struct device *dev;
+	unsigned int i, j;
+	dma_addr_t dma;
+
+	dev = &pf->pdev->dev;
+
+	for (i = 0; i < umem->props.nframes; i++) {
+		dma = dma_map_single_attrs(dev, umem->frames[i].addr,
+					   umem->props.frame_size,
+					   DMA_BIDIRECTIONAL, I40E_RX_DMA_ATTR);
+		if (dma_mapping_error(dev, dma))
+			goto out_unmap;
+
+		umem->frames[i].dma = dma;
+	}
+
+	return 0;
+
+out_unmap:
+	for (j = 0; j < i; j++) {
+		dma_unmap_single_attrs(dev, umem->frames[i].dma,
+				       umem->props.frame_size,
+				       DMA_BIDIRECTIONAL,
+				       I40E_RX_DMA_ATTR);
+		umem->frames[i].dma = 0;
+	}
+
+	return -1;
+}
+
+static void i40e_xsk_umem_dma_unmap(struct i40e_vsi *vsi, struct xdp_umem *umem)
+{
+	struct i40e_pf *pf = vsi->back;
+	struct device *dev;
+	unsigned int i;
+
+	dev = &pf->pdev->dev;
+
+	for (i = 0; i < umem->props.nframes; i++) {
+		dma_unmap_single_attrs(dev, umem->frames[i].dma,
+				       umem->props.frame_size,
+				       DMA_BIDIRECTIONAL,
+				       I40E_RX_DMA_ATTR);
+
+		umem->frames[i].dma = 0;
+	}
+}
+
+static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem,
+				u16 qid)
+{
+	bool if_running;
+	int err;
+
+	if (vsi->type != I40E_VSI_MAIN)
+		return -EINVAL;
+
+	if (qid >= vsi->num_queue_pairs)
+		return -EINVAL;
+
+	if (vsi->xsk_umems && vsi->xsk_umems[qid])
+		return -EBUSY;
+
+	err = i40e_xsk_umem_dma_map(vsi, umem);
+	if (err)
+		return err;
+
+	if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
+
+	if (if_running) {
+		err = i40e_queue_pair_disable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	err = i40e_add_xsk_umem(vsi, umem, qid);
+	if (err)
+		return err;
+
+	if (if_running) {
+		err = i40e_queue_pair_enable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static int i40e_xsk_umem_disable(struct i40e_vsi *vsi, u16 qid)
+{
+	bool if_running;
+	int err;
+
+	if (!vsi->xsk_umems || qid >= vsi->num_xsk_umems ||
+	    !vsi->xsk_umems[qid])
+		return -EINVAL;
+
+	if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi);
+
+	if (if_running) {
+		err = i40e_queue_pair_disable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	i40e_xsk_umem_dma_unmap(vsi, vsi->xsk_umems[qid]);
+	i40e_remove_xsk_umem(vsi, qid);
+
+	if (if_running) {
+		err = i40e_queue_pair_enable(vsi, qid);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
+			       u16 qid)
+{
+	if (umem)
+		return i40e_xsk_umem_enable(vsi, umem, qid);
+
+	return i40e_xsk_umem_disable(vsi, qid);
+}
+
 /**
  * i40e_xdp - implements ndo_bpf for i40e
  * @dev: netdevice
@@ -12071,6 +12266,9 @@ static int i40e_xdp(struct net_device *dev,
 		xdp->prog_attached = i40e_enabled_xdp_vsi(vsi);
 		xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0;
 		return 0;
+	case XDP_SETUP_XSK_UMEM:
+		return i40e_xsk_umem_setup(vsi, xdp->xsk.umem,
+					   xdp->xsk.queue_id);
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 5efa68de935b..f89ac524652c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -5,6 +5,7 @@
 #include <net/busy_poll.h>
 #include <linux/bpf_trace.h>
 #include <net/xdp.h>
+#include <net/xdp_sock.h>
 #include "i40e.h"
 #include "i40e_trace.h"
 #include "i40e_prototype.h"
@@ -1373,31 +1374,35 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 	}
 
 	/* Free all the Rx ring sk_buffs */
-	for (i = 0; i < rx_ring->count; i++) {
-		struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
+	if (!rx_ring->xsk_umem) {
+		for (i = 0; i < rx_ring->count; i++) {
+			struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
 
-		if (!rx_bi->page)
-			continue;
-
-		/* Invalidate cache lines that may have been written to by
-		 * device so that we avoid corrupting memory.
-		 */
-		dma_sync_single_range_for_cpu(rx_ring->dev,
-					      rx_bi->dma,
-					      rx_bi->page_offset,
-					      rx_ring->rx_buf_len,
-					      DMA_FROM_DEVICE);
-
-		/* free resources associated with mapping */
-		dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
-				     i40e_rx_pg_size(rx_ring),
-				     DMA_FROM_DEVICE,
-				     I40E_RX_DMA_ATTR);
-
-		__page_frag_cache_drain(rx_bi->page, rx_bi->pagecnt_bias);
+			if (!rx_bi->page)
+				continue;
 
-		rx_bi->page = NULL;
-		rx_bi->page_offset = 0;
+			/* Invalidate cache lines that may have been
+			 * written to by device so that we avoid
+			 * corrupting memory.
+			 */
+			dma_sync_single_range_for_cpu(rx_ring->dev,
+						      rx_bi->dma,
+						      rx_bi->page_offset,
+						      rx_ring->rx_buf_len,
+						      DMA_FROM_DEVICE);
+
+			/* free resources associated with mapping */
+			dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
+					     i40e_rx_pg_size(rx_ring),
+					     DMA_FROM_DEVICE,
+					     I40E_RX_DMA_ATTR);
+
+			__page_frag_cache_drain(rx_bi->page,
+						rx_bi->pagecnt_bias);
+
+			rx_bi->page = NULL;
+			rx_bi->page_offset = 0;
+		}
 	}
 
 	bi_size = sizeof(struct i40e_rx_buffer) * rx_ring->count;
@@ -2214,8 +2219,6 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 	if (!xdp_prog)
 		goto xdp_out;
 
-	prefetchw(xdp->data_hard_start); /* xdp_frame write */
-
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
@@ -2284,7 +2287,7 @@ static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
  *
  * Returns amount of work completed
  **/
-static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
+int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 {
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 	struct sk_buff *skb = rx_ring->skb;
@@ -2426,6 +2429,349 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 	return failure ? budget : (int)total_rx_packets;
 }
 
+static struct sk_buff *i40e_run_xdp_zc(struct i40e_ring *rx_ring,
+				       struct xdp_buff *xdp)
+{
+	int err, result = I40E_XDP_PASS;
+	struct i40e_ring *xdp_ring;
+	struct bpf_prog *xdp_prog;
+	u32 act;
+
+	rcu_read_lock();
+	xdp_prog = READ_ONCE(rx_ring->xdp_prog);
+
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	switch (act) {
+	case XDP_PASS:
+		break;
+	case XDP_TX:
+		xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
+		result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
+		break;
+	case XDP_REDIRECT:
+		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+		result = !err ? I40E_XDP_TX : I40E_XDP_CONSUMED;
+		break;
+	default:
+		bpf_warn_invalid_xdp_action(act);
+	case XDP_ABORTED:
+		trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
+		/* fallthrough -- handle aborts by dropping packet */
+	case XDP_DROP:
+		result = I40E_XDP_CONSUMED;
+		break;
+	}
+
+	rcu_read_unlock();
+	return ERR_PTR(-result);
+}
+
+static bool i40e_alloc_frame_zc(struct i40e_ring *rx_ring,
+				struct i40e_rx_buffer *bi)
+{
+	struct xdp_umem *umem = rx_ring->xsk_umem;
+	void *addr = bi->addr;
+	u32 *id;
+
+	if (addr) {
+		rx_ring->rx_stats.page_reuse_count++;
+		return true;
+	}
+
+	id = xsk_umem_peek_id(umem);
+	if (unlikely(!id)) {
+		rx_ring->rx_stats.alloc_page_failed++;
+		return false;
+	}
+
+	bi->dma = umem->frames[*id].dma + umem->frame_headroom +
+		  XDP_PACKET_HEADROOM;
+	bi->addr = umem->frames[*id].addr + umem->frame_headroom +
+		  XDP_PACKET_HEADROOM;
+	bi->id = *id;
+
+	xsk_umem_discard_id(umem);
+	return true;
+}
+
+bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count)
+{
+	u16 ntu = rx_ring->next_to_use;
+	union i40e_rx_desc *rx_desc;
+	struct i40e_rx_buffer *bi;
+
+	rx_desc = I40E_RX_DESC(rx_ring, ntu);
+	bi = &rx_ring->rx_bi[ntu];
+
+	do {
+		if (!i40e_alloc_frame_zc(rx_ring, bi))
+			goto no_buffers;
+
+		/* sync the buffer for use by the device */
+		dma_sync_single_range_for_device(rx_ring->dev, bi->dma, 0,
+						 rx_ring->rx_buf_len,
+						 DMA_BIDIRECTIONAL);
+
+		/* Refresh the desc even if buffer_addrs didn't change
+		 * because each write-back erases this info.
+		 */
+		rx_desc->read.pkt_addr = cpu_to_le64(bi->dma);
+
+		rx_desc++;
+		bi++;
+		ntu++;
+		if (unlikely(ntu == rx_ring->count)) {
+			rx_desc = I40E_RX_DESC(rx_ring, 0);
+			bi = rx_ring->rx_bi;
+			ntu = 0;
+		}
+
+		/* clear the status bits for the next_to_use descriptor */
+		rx_desc->wb.qword1.status_error_len = 0;
+
+		cleaned_count--;
+	} while (cleaned_count);
+
+	if (rx_ring->next_to_use != ntu)
+		i40e_release_rx_desc(rx_ring, ntu);
+
+	return false;
+
+no_buffers:
+	if (rx_ring->next_to_use != ntu)
+		i40e_release_rx_desc(rx_ring, ntu);
+
+	/* make sure to come back via polling to try again after
+	 * allocation failure
+	 */
+	return true;
+}
+
+static struct i40e_rx_buffer *i40e_get_rx_buffer_zc(struct i40e_ring *rx_ring,
+						    const unsigned int size)
+{
+	struct i40e_rx_buffer *rx_buffer;
+
+	rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
+
+	/* we are reusing so sync this buffer for CPU use */
+	dma_sync_single_range_for_cpu(rx_ring->dev,
+				      rx_buffer->dma, 0,
+				      size,
+				      DMA_BIDIRECTIONAL);
+
+	return rx_buffer;
+}
+
+static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring,
+				    struct i40e_rx_buffer *old_buff)
+{
+	struct i40e_rx_buffer *new_buff;
+	u16 nta = rx_ring->next_to_alloc;
+
+	new_buff = &rx_ring->rx_bi[nta];
+
+	/* update, and store next to alloc */
+	nta++;
+	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
+
+	/* transfer page from old buffer to new buffer */
+	new_buff->dma  = old_buff->dma;
+	new_buff->addr = old_buff->addr;
+	new_buff->id   = old_buff->id;
+}
+
+/* Called from the XDP return API in NAPI context. */
+void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
+{
+	struct i40e_rx_buffer *new_buff;
+	struct i40e_ring *rx_ring;
+	u16 nta;
+
+	rx_ring = container_of(alloc, struct i40e_ring, zca);
+	nta = rx_ring->next_to_alloc;
+
+	new_buff = &rx_ring->rx_bi[nta];
+
+	/* update, and store next to alloc */
+	nta++;
+	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
+
+	new_buff->dma  = rx_ring->xsk_umem->frames[handle].dma;
+	new_buff->addr = rx_ring->xsk_umem->frames[handle].addr;
+	new_buff->id   = (u32)handle;
+}
+
+static struct sk_buff *i40e_zc_frame_to_skb(struct i40e_ring *rx_ring,
+					    struct i40e_rx_buffer *rx_buffer,
+					    struct xdp_buff *xdp)
+{
+	// XXX implement alloc skb and copy
+	i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
+	return NULL;
+}
+
+static void i40e_clean_programming_status_zc(struct i40e_ring *rx_ring,
+					     union i40e_rx_desc *rx_desc,
+					     u64 qw)
+{
+	struct i40e_rx_buffer *rx_buffer;
+	u32 ntc = rx_ring->next_to_clean;
+	u8 id;
+
+	/* fetch, update, and store next to clean */
+	rx_buffer = &rx_ring->rx_bi[ntc++];
+	ntc = (ntc < rx_ring->count) ? ntc : 0;
+	rx_ring->next_to_clean = ntc;
+
+	prefetch(I40E_RX_DESC(rx_ring, ntc));
+
+	/* place unused page back on the ring */
+	i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
+	rx_ring->rx_stats.page_reuse_count++;
+
+	/* clear contents of buffer_info */
+	rx_buffer->addr = NULL;
+
+	id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
+		  I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
+
+	if (id == I40E_RX_PROG_STATUS_DESC_FD_FILTER_STATUS)
+		i40e_fd_handle_status(rx_ring, rx_desc, id);
+}
+
+int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
+{
+	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
+	u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
+	bool failure = false, xdp_xmit = false;
+	struct sk_buff *skb;
+	struct xdp_buff xdp;
+
+	xdp.rxq = &rx_ring->xdp_rxq;
+
+	while (likely(total_rx_packets < (unsigned int)budget)) {
+		struct i40e_rx_buffer *rx_buffer;
+		union i40e_rx_desc *rx_desc;
+		unsigned int size;
+		u16 vlan_tag;
+		u8 rx_ptype;
+		u64 qword;
+		u32 ntc;
+
+		/* return some buffers to hardware, one at a time is too slow */
+		if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
+			failure = failure ||
+				  i40e_alloc_rx_buffers_zc(rx_ring,
+							   cleaned_count);
+			cleaned_count = 0;
+		}
+
+		rx_desc = I40E_RX_DESC(rx_ring, rx_ring->next_to_clean);
+
+		/* status_error_len will always be zero for unused descriptors
+		 * because it's cleared in cleanup, and overlaps with hdr_addr
+		 * which is always zero because packet split isn't used, if the
+		 * hardware wrote DD then the length will be non-zero
+		 */
+		qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
+
+		/* This memory barrier is needed to keep us from reading
+		 * any other fields out of the rx_desc until we have
+		 * verified the descriptor has been written back.
+		 */
+		dma_rmb();
+
+		if (unlikely(i40e_rx_is_programming_status(qword))) {
+			i40e_clean_programming_status_zc(rx_ring, rx_desc,
+							 qword);
+			cleaned_count++;
+			continue;
+		}
+		size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
+		       I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
+		if (!size)
+			break;
+
+		rx_buffer = i40e_get_rx_buffer_zc(rx_ring, size);
+
+		/* retrieve a buffer from the ring */
+		xdp.data = rx_buffer->addr;
+		xdp_set_data_meta_invalid(&xdp);
+		xdp.data_hard_start = xdp.data - XDP_PACKET_HEADROOM;
+		xdp.data_end = xdp.data + size;
+		xdp.handle = rx_buffer->id;
+
+		skb = i40e_run_xdp_zc(rx_ring, &xdp);
+
+		if (IS_ERR(skb)) {
+			if (PTR_ERR(skb) == -I40E_XDP_TX)
+				xdp_xmit = true;
+			else
+				i40e_reuse_rx_buffer_zc(rx_ring, rx_buffer);
+			total_rx_bytes += size;
+			total_rx_packets++;
+		} else {
+			skb = i40e_zc_frame_to_skb(rx_ring, rx_buffer, &xdp);
+			if (!skb) {
+				rx_ring->rx_stats.alloc_buff_failed++;
+				break;
+			}
+		}
+
+		rx_buffer->addr = NULL;
+		cleaned_count++;
+
+		/* don't care about non-EOP frames in XDP mode */
+		ntc = rx_ring->next_to_clean + 1;
+		ntc = (ntc < rx_ring->count) ? ntc : 0;
+		rx_ring->next_to_clean = ntc;
+		prefetch(I40E_RX_DESC(rx_ring, ntc));
+
+		if (i40e_cleanup_headers(rx_ring, skb, rx_desc)) {
+			skb = NULL;
+			continue;
+		}
+
+		/* probably a little skewed due to removing CRC */
+		total_rx_bytes += skb->len;
+
+		qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
+		rx_ptype = (qword & I40E_RXD_QW1_PTYPE_MASK) >>
+			   I40E_RXD_QW1_PTYPE_SHIFT;
+
+		/* populate checksum, VLAN, and protocol */
+		i40e_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
+
+		vlan_tag = (qword & BIT(I40E_RX_DESC_STATUS_L2TAG1P_SHIFT)) ?
+			   le16_to_cpu(rx_desc->wb.qword0.lo_dword.l2tag1) : 0;
+
+		i40e_receive_skb(rx_ring, skb, vlan_tag);
+		skb = NULL;
+
+		/* update budget accounting */
+		total_rx_packets++;
+	}
+
+	if (xdp_xmit) {
+		struct i40e_ring *xdp_ring =
+			rx_ring->vsi->xdp_rings[rx_ring->queue_index];
+
+		i40e_xdp_ring_update_tail(xdp_ring);
+		xdp_do_flush_map();
+	}
+
+	u64_stats_update_begin(&rx_ring->syncp);
+	rx_ring->stats.packets += total_rx_packets;
+	rx_ring->stats.bytes += total_rx_bytes;
+	u64_stats_update_end(&rx_ring->syncp);
+	rx_ring->q_vector->rx.total_packets += total_rx_packets;
+	rx_ring->q_vector->rx.total_bytes += total_rx_bytes;
+
+	/* guarantee a trip back through this routine if there was a failure */
+	return failure ? budget : (int)total_rx_packets;
+}
+
 static inline u32 i40e_buildreg_itr(const int type, u16 itr)
 {
 	u32 val;
@@ -2576,7 +2922,7 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	budget_per_ring = max(budget/q_vector->num_ringpairs, 1);
 
 	i40e_for_each_ring(ring, q_vector->rx) {
-		int cleaned = i40e_clean_rx_irq(ring, budget_per_ring);
+		int cleaned = ring->clean_rx_irq(ring, budget_per_ring);
 
 		work_done += cleaned;
 		/* if we clean as many as budgeted, we must not be done */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index fdd2c55f03a6..9d5d9862e9f1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -296,13 +296,22 @@ struct i40e_tx_buffer {
 
 struct i40e_rx_buffer {
 	dma_addr_t dma;
-	struct page *page;
+	union {
+		struct {
+			struct page *page;
 #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
-	__u32 page_offset;
+			__u32 page_offset;
 #else
-	__u16 page_offset;
+			__u16 page_offset;
 #endif
-	__u16 pagecnt_bias;
+			__u16 pagecnt_bias;
+		};
+		struct {
+			/* for umem */
+			void *addr;
+			u32 id;
+		};
+	};
 };
 
 struct i40e_queue_stats {
@@ -344,6 +353,8 @@ enum i40e_ring_state_t {
 #define I40E_RX_SPLIT_TCP_UDP 0x4
 #define I40E_RX_SPLIT_SCTP    0x8
 
+void i40e_zc_recycle(struct zero_copy_allocator *alloc, unsigned long handle);
+
 /* struct that defines a descriptor ring, associated with a VSI */
 struct i40e_ring {
 	struct i40e_ring *next;		/* pointer to next ring in q_vector */
@@ -414,6 +425,12 @@ struct i40e_ring {
 
 	struct i40e_channel *ch;
 	struct xdp_rxq_info xdp_rxq;
+
+	int (*clean_rx_irq)(struct i40e_ring *, int);
+	bool (*alloc_rx_buffers)(struct i40e_ring *, u16);
+	struct xdp_umem *xsk_umem;
+
+	struct zero_copy_allocator zca; /* ZC allocator anchor */
 } ____cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)
@@ -474,6 +491,7 @@ static inline unsigned int i40e_rx_pg_order(struct i40e_ring *ring)
 #define i40e_rx_pg_size(_ring) (PAGE_SIZE << i40e_rx_pg_order(_ring))
 
 bool i40e_alloc_rx_buffers(struct i40e_ring *rxr, u16 cleaned_count);
+bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count);
 netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
 void i40e_clean_tx_ring(struct i40e_ring *tx_ring);
 void i40e_clean_rx_ring(struct i40e_ring *rx_ring);
@@ -489,6 +507,9 @@ int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
 bool __i40e_chk_linearize(struct sk_buff *skb);
 int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf);
 void i40e_xdp_flush(struct net_device *dev);
+int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
+int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
+void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
 
 /**
  * i40e_get_head - Retrieve head from head writeback
@@ -575,4 +596,5 @@ static inline struct netdev_queue *txring_txq(const struct i40e_ring *ring)
 {
 	return netdev_get_tx_queue(ring->netdev, ring->queue_index);
 }
+
 #endif /* _I40E_TXRX_H_ */
-- 
2.14.1

^ permalink raw reply related

* [RFC PATCH bpf-next 09/12] samples/bpf: minor *_nb_free performance fix
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	intel-wired-lan
In-Reply-To: <20180515190615.23099-1-bjorn.topel@gmail.com>

From: Magnus Karlsson <magnus.karlsson@intel.com>

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 samples/bpf/xdpsock_user.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 7fe60f6f7d53..50b9cabba4e8 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -159,15 +159,15 @@ static const char pkt_data[] =
 
 static inline u32 umem_nb_free(struct xdp_umem_uqueue *q, u32 nb)
 {
-	u32 free_entries = q->size - (q->cached_prod - q->cached_cons);
+	u32 free_entries = q->cached_cons - q->cached_prod;
 
 	if (free_entries >= nb)
 		return free_entries;
 
 	/* Refresh the local tail pointer */
-	q->cached_cons = q->ring->ptrs.consumer;
+	q->cached_cons = q->ring->ptrs.consumer + q->size;
 
-	return q->size - (q->cached_prod - q->cached_cons);
+	return q->cached_cons - q->cached_prod;
 }
 
 static inline u32 xq_nb_free(struct xdp_uqueue *q, u32 ndescs)
@@ -432,6 +432,7 @@ static struct xdp_umem *xdp_umem_configure(int sfd)
 
 	umem->fq.mask = FQ_NUM_DESCS - 1;
 	umem->fq.size = FQ_NUM_DESCS;
+	umem->fq.cached_cons = FQ_NUM_DESCS;
 
 	umem->cq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
 			     CQ_NUM_DESCS * sizeof(u32),
@@ -514,6 +515,7 @@ static struct xdpsock *xsk_configure(struct xdp_umem *umem)
 
 	xsk->tx.mask = NUM_DESCS - 1;
 	xsk->tx.size = NUM_DESCS;
+	xsk->tx.cached_cons = NUM_DESCS;
 
 	sxdp.sxdp_family = PF_XDP;
 	sxdp.sxdp_ifindex = opt_ifindex;
-- 
2.14.1

^ permalink raw reply related

* [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy
From: Björn Töpel @ 2018-05-15 19:06 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	intel-wired-lan
In-Reply-To: <20180515190615.23099-1-bjorn.topel@gmail.com>

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, the zero-copy ndo is implemented. As a shortcut, the existing
XDP Tx rings are used for zero-copy. This means that and XDP program
cannot redirect to an AF_XDP enabled XDP Tx ring.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |   7 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 235 +++++++++++++++++++++++-----
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |   6 +
 3 files changed, 212 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index dc3d668a741e..91f8e892179a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3055,8 +3055,12 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
 	i40e_status err = 0;
 	u32 qtx_ctl = 0;
 
-	if (ring_is_xdp(ring))
+	ring->clean_tx_irq = i40e_clean_tx_irq;
+	if (ring_is_xdp(ring)) {
 		ring->xsk_umem = i40e_xsk_umem(ring);
+		if (ring->xsk_umem)
+			ring->clean_tx_irq = i40e_clean_tx_irq_zc;
+	}
 
 	/* some ATR related tx ring init */
 	if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
@@ -12309,6 +12313,7 @@ static const struct net_device_ops i40e_netdev_ops = {
 	.ndo_bpf		= i40e_xdp,
 	.ndo_xdp_xmit		= i40e_xdp_xmit,
 	.ndo_xdp_flush		= i40e_xdp_flush,
+	.ndo_xsk_async_xmit	= i40e_xsk_async_xmit,
 };
 
 /**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index f89ac524652c..17c067556aba 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -649,9 +649,13 @@ void i40e_clean_tx_ring(struct i40e_ring *tx_ring)
 	if (!tx_ring->tx_bi)
 		return;
 
-	/* Free all the Tx ring sk_buffs */
-	for (i = 0; i < tx_ring->count; i++)
-		i40e_unmap_and_free_tx_resource(tx_ring, &tx_ring->tx_bi[i]);
+	/* Cleanup only needed for non XSK TX ZC rings */
+	if (!tx_ring->xsk_umem) {
+		/* Free all the Tx ring sk_buffs */
+		for (i = 0; i < tx_ring->count; i++)
+			i40e_unmap_and_free_tx_resource(tx_ring,
+							&tx_ring->tx_bi[i]);
+	}
 
 	bi_size = sizeof(struct i40e_tx_buffer) * tx_ring->count;
 	memset(tx_ring->tx_bi, 0, bi_size);
@@ -768,8 +772,139 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
 	}
 }
 
+static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
+{
+	/* Force memory writes to complete before letting h/w
+	 * know there are new descriptors to fetch.
+	 */
+	wmb();
+	writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
+}
+
 #define WB_STRIDE 4
 
+static void i40e_update_stats_and_arm_wb(struct i40e_ring *tx_ring,
+					 struct i40e_vsi *vsi,
+					 unsigned int total_packets,
+					 unsigned int total_bytes,
+					 int budget)
+{
+	u64_stats_update_begin(&tx_ring->syncp);
+	tx_ring->stats.bytes += total_bytes;
+	tx_ring->stats.packets += total_packets;
+	u64_stats_update_end(&tx_ring->syncp);
+	tx_ring->q_vector->tx.total_bytes += total_bytes;
+	tx_ring->q_vector->tx.total_packets += total_packets;
+
+	if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
+		/* check to see if there are < 4 descriptors
+		 * waiting to be written back, then kick the hardware to force
+		 * them to be written back in case we stay in NAPI.
+		 * In this mode on X722 we do not enable Interrupt.
+		 */
+		unsigned int j = i40e_get_tx_pending(tx_ring, false);
+
+		if (budget &&
+		    ((j / WB_STRIDE) == 0) && (j > 0) &&
+		    !test_bit(__I40E_VSI_DOWN, vsi->state) &&
+		    (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
+			tx_ring->arm_wb = true;
+	}
+}
+
+/* Returns true if the work is finished */
+static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
+{
+	bool work_done = true, xdp_xmit = false;
+	struct i40e_tx_buffer *tx_bi;
+	struct i40e_tx_desc *tx_desc;
+	dma_addr_t dma;
+	u16 offset;
+	u32 len;
+
+	while (budget-- > 0) {
+		if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
+			xdp_ring->tx_stats.tx_busy++;
+			work_done = false;
+			break;
+		}
+
+		if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len,
+					 &offset))
+			break;
+
+		xdp_xmit = true;
+		dma_sync_single_for_device(xdp_ring->dev, dma, len,
+					   DMA_BIDIRECTIONAL);
+
+		tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
+		tx_bi->bytecount = len;
+		tx_bi->gso_segs = 1;
+
+		tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
+		tx_desc->buffer_addr = cpu_to_le64(dma);
+		tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC
+							| I40E_TX_DESC_CMD_EOP,
+							  0, len, 0);
+
+		xdp_ring->next_to_use++;
+		if (xdp_ring->next_to_use == xdp_ring->count)
+			xdp_ring->next_to_use = 0;
+	}
+
+	/* Request an interrupt for the last frame and bump tail ptr. */
+	if (xdp_xmit) {
+		tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS <<
+						 I40E_TXD_QW1_CMD_SHIFT);
+		i40e_xdp_ring_update_tail(xdp_ring);
+	}
+
+	return !!budget && work_done;
+}
+
+bool i40e_clean_tx_irq_zc(struct i40e_vsi *vsi,
+			  struct i40e_ring *tx_ring, int napi_budget)
+{
+	unsigned int total_bytes = 0, total_packets = 0;
+	struct xdp_umem *umem = tx_ring->xsk_umem;
+	u32 head_idx = i40e_get_head(tx_ring);
+	unsigned int budget = vsi->work_limit;
+	bool work_done = true, xmit_done;
+	u32 completed_frames;
+	u32 frames_ready;
+
+	if (head_idx < tx_ring->next_to_clean)
+		head_idx += tx_ring->count;
+	frames_ready = head_idx - tx_ring->next_to_clean;
+
+	if (frames_ready == 0) {
+		goto out_xmit;
+	} else if (frames_ready > budget) {
+		completed_frames = budget;
+		work_done = false;
+	} else {
+		completed_frames = frames_ready;
+	}
+
+	/* XXX Need to be calculated. */
+	/*total_bytes += tx_buf->bytecount;*/
+	total_packets += completed_frames;
+
+	tx_ring->next_to_clean += completed_frames;
+	if (unlikely(tx_ring->next_to_clean >= tx_ring->count))
+		tx_ring->next_to_clean -= tx_ring->count;
+
+	xsk_umem_complete_tx(umem, completed_frames);
+
+	i40e_update_stats_and_arm_wb(tx_ring, vsi, total_packets,
+				     total_bytes, budget);
+
+out_xmit:
+	xmit_done = i40e_xmit_zc(tx_ring, budget);
+
+	return work_done && xmit_done;
+}
+
 /**
  * i40e_clean_tx_irq - Reclaim resources after transmit completes
  * @vsi: the VSI we care about
@@ -778,8 +913,8 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
  *
  * Returns true if there's any budget left (e.g. the clean is finished)
  **/
-static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
-			      struct i40e_ring *tx_ring, int napi_budget)
+bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
+		       struct i40e_ring *tx_ring, int napi_budget)
 {
 	u16 i = tx_ring->next_to_clean;
 	struct i40e_tx_buffer *tx_buf;
@@ -874,27 +1009,9 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 
 	i += tx_ring->count;
 	tx_ring->next_to_clean = i;
-	u64_stats_update_begin(&tx_ring->syncp);
-	tx_ring->stats.bytes += total_bytes;
-	tx_ring->stats.packets += total_packets;
-	u64_stats_update_end(&tx_ring->syncp);
-	tx_ring->q_vector->tx.total_bytes += total_bytes;
-	tx_ring->q_vector->tx.total_packets += total_packets;
 
-	if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
-		/* check to see if there are < 4 descriptors
-		 * waiting to be written back, then kick the hardware to force
-		 * them to be written back in case we stay in NAPI.
-		 * In this mode on X722 we do not enable Interrupt.
-		 */
-		unsigned int j = i40e_get_tx_pending(tx_ring, false);
-
-		if (budget &&
-		    ((j / WB_STRIDE) == 0) && (j > 0) &&
-		    !test_bit(__I40E_VSI_DOWN, vsi->state) &&
-		    (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
-			tx_ring->arm_wb = true;
-	}
+	i40e_update_stats_and_arm_wb(tx_ring, vsi, total_packets,
+				     total_bytes, budget);
 
 	if (ring_is_xdp(tx_ring))
 		return !!budget;
@@ -2266,15 +2383,6 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
 #endif
 }
 
-static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
-{
-	/* Force memory writes to complete before letting h/w
-	 * know there are new descriptors to fetch.
-	 */
-	wmb();
-	writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
-}
-
 /**
  * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
  * @rx_ring: rx descriptor ring to transact packets on
@@ -2904,10 +3012,11 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	 * budget and be more aggressive about cleaning up the Tx descriptors.
 	 */
 	i40e_for_each_ring(ring, q_vector->tx) {
-		if (!i40e_clean_tx_irq(vsi, ring, budget)) {
+		if (!ring->clean_tx_irq(vsi, ring, budget)) {
 			clean_complete = false;
 			continue;
 		}
+
 		arm_wb |= ring->arm_wb;
 		ring->arm_wb = false;
 	}
@@ -3810,6 +3919,30 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 	return -1;
 }
 
+/**
+ * i40e_napi_is_scheduled - If napi is running, set the NAPIF_STATE_MISSED
+ * @n: napi context
+ *
+ * Returns true if NAPI is scheduled.
+ **/
+static bool i40e_napi_is_scheduled(struct napi_struct *n)
+{
+	unsigned long val, new;
+
+	do {
+		val = READ_ONCE(n->state);
+		if (val & NAPIF_STATE_DISABLE)
+			return true;
+
+		if (!(val & NAPIF_STATE_SCHED))
+			return false;
+
+		new = val | NAPIF_STATE_MISSED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	return true;
+}
+
 /**
  * i40e_xmit_xdp_ring - transmits an XDP buffer to an XDP Tx ring
  * @xdp: data to transmit
@@ -4025,6 +4158,9 @@ int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
 		return -ENXIO;
 
+	if (vsi->xdp_rings[queue_index]->xsk_umem)
+		return -ENXIO;
+
 	err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
 	if (err != I40E_XDP_TX)
 		return -ENOSPC;
@@ -4048,5 +4184,34 @@ void i40e_xdp_flush(struct net_device *dev)
 	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
 		return;
 
+	if (vsi->xdp_rings[queue_index]->xsk_umem)
+		return;
+
 	i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
 }
+
+int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id)
+{
+	struct i40e_netdev_priv *np = netdev_priv(dev);
+	struct i40e_vsi *vsi = np->vsi;
+	struct i40e_ring *ring;
+
+	if (test_bit(__I40E_VSI_DOWN, vsi->state))
+		return -ENETDOWN;
+
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return -ENXIO;
+
+	if (queue_id >= vsi->num_queue_pairs)
+		return -ENXIO;
+
+	if (!vsi->xdp_rings[queue_id]->xsk_umem)
+		return -ENXIO;
+
+	ring = vsi->xdp_rings[queue_id];
+
+	if (!i40e_napi_is_scheduled(&ring->q_vector->napi))
+		i40e_force_wb(vsi, ring->q_vector);
+
+	return 0;
+}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 9d5d9862e9f1..ea1cac00cad4 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -428,6 +428,7 @@ struct i40e_ring {
 
 	int (*clean_rx_irq)(struct i40e_ring *, int);
 	bool (*alloc_rx_buffers)(struct i40e_ring *, u16);
+	bool (*clean_tx_irq)(struct i40e_vsi *, struct i40e_ring *, int);
 	struct xdp_umem *xsk_umem;
 
 	struct zero_copy_allocator zca; /* ZC allocator anchor */
@@ -510,6 +511,11 @@ void i40e_xdp_flush(struct net_device *dev);
 int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget);
 int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget);
 void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
+bool i40e_clean_tx_irq_zc(struct i40e_vsi *vsi,
+			  struct i40e_ring *tx_ring, int napi_budget);
+bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
+		       struct i40e_ring *tx_ring, int napi_budget);
+int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id);
 
 /**
  * i40e_get_head - Retrieve head from head writeback
-- 
2.14.1

^ permalink raw reply related

* Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
From: Eric Dumazet @ 2018-05-15 19:08 UTC (permalink / raw)
  To: Qing Huang, Tariq Toukan, davem, haakon.bugge, yanjun.zhu
  Cc: netdev, linux-rdma, linux-kernel
In-Reply-To: <f4c5d8d3-1eea-8566-9921-a9dc48435f66@oracle.com>



On 05/15/2018 11:53 AM, Qing Huang wrote:
> 
>> This is control path so it is less latency-sensitive.
>> Let's not produce unnecessary degradation here, please call kvzalloc so we maintain a similar behavior when contiguous memory is available, and a fallback for resiliency.
> 
> No sure what exactly degradation is caused by vzalloc here. I think it's better to keep physically contiguous pages
> to other requests which really need them. Besides slow path/mem compacting can be really expensive.
>

Just use kvzalloc(), and you get the benefit of having contiguous memory if available,
without expensive compact phase.

This thing _automatically_ falls back to vmalloc(), thus your problem will be solved.

If you are not sure, trust others.

^ permalink raw reply

* Re: [PATCH bpf-next v6 1/4] bpf: sockmap, refactor sockmap routines to work with hashmap
From: Daniel Borkmann @ 2018-05-15 19:19 UTC (permalink / raw)
  To: John Fastabend, ast; +Cc: netdev, davem
In-Reply-To: <1526317219-7752-2-git-send-email-john.fastabend@gmail.com>

On 05/14/2018 07:00 PM, John Fastabend wrote:
[...]
> +static int __sock_map_ctx_update_elem(struct bpf_map *map,
> +				      struct bpf_sock_progs *progs,
> +				      struct sock *sock,
> +				      struct sock **map_link,
> +				      void *key)
>  {
[...]
>  	 * sock being added. If the sock is already attached to BPF programs
>  	 * this results in an error.
>  	 */
> -	verdict = READ_ONCE(stab->bpf_verdict);
> -	parse = READ_ONCE(stab->bpf_parse);
> -	tx_msg = READ_ONCE(stab->bpf_tx_msg);
> +	verdict = READ_ONCE(progs->bpf_verdict);
> +	parse = READ_ONCE(progs->bpf_parse);
> +	tx_msg = READ_ONCE(progs->bpf_tx_msg);
>  
>  	if (parse && verdict) {
>  		/* bpf prog refcnt may be zero if a concurrent attach operation
> @@ -1703,11 +1691,11 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
>  		 * we increment the refcnt. If this is the case abort with an
>  		 * error.
>  		 */
> -		verdict = bpf_prog_inc_not_zero(stab->bpf_verdict);
> +		verdict = bpf_prog_inc_not_zero(progs->bpf_verdict);
>  		if (IS_ERR(verdict))
>  			return PTR_ERR(verdict);
>  
> -		parse = bpf_prog_inc_not_zero(stab->bpf_parse);
> +		parse = bpf_prog_inc_not_zero(progs->bpf_parse);
>  		if (IS_ERR(parse)) {
>  			bpf_prog_put(verdict);
>  			return PTR_ERR(parse);
> @@ -1715,7 +1703,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
>  	}
>  
>  	if (tx_msg) {
> -		tx_msg = bpf_prog_inc_not_zero(stab->bpf_tx_msg);
> +		tx_msg = bpf_prog_inc_not_zero(progs->bpf_tx_msg);
>  		if (IS_ERR(tx_msg)) {
>  			if (verdict)
>  				bpf_prog_put(verdict);

Not directly related to the patch since it doesn't change the logic, but
it feels something is not quite right here (unless I'm missing something).
The code is as follows:

[...]
        verdict = READ_ONCE(progs->bpf_verdict);
        parse = READ_ONCE(progs->bpf_parse);
        tx_msg = READ_ONCE(progs->bpf_tx_msg);

        if (parse && verdict) {
                /* bpf prog refcnt may be zero if a concurrent attach operation
                 * removes the program after the above READ_ONCE() but before
                 * we increment the refcnt. If this is the case abort with an
                 * error.
                 */
                verdict = bpf_prog_inc_not_zero(progs->bpf_verdict);
                if (IS_ERR(verdict))
                        return PTR_ERR(verdict);

                parse = bpf_prog_inc_not_zero(progs->bpf_parse);
                if (IS_ERR(parse)) {
                        bpf_prog_put(verdict);
                        return PTR_ERR(parse);
                }
        }

        if (tx_msg) {
                tx_msg = bpf_prog_inc_not_zero(progs->bpf_tx_msg);
                if (IS_ERR(tx_msg)) {
                        if (verdict)
                                bpf_prog_put(verdict);
                        if (parse)
                                bpf_prog_put(parse);
                        return PTR_ERR(tx_msg);
                }
        }
[...]

As you say in the comment above the function wrt locking notes that the
__sock_map_ctx_update_elem() can be called concurrently.

  All operations operate on sock_map using cmpxchg and xchg operations to ensure we
  do not get stale references. Any reads into the map must be done with READ_ONCE()
  because of this.

You initially use the READ_ONCE() on the verdict/parse/tx_msg, but later on when
grabbing the reference you use again progs->bpf_verdict/bpf_parse/bpf_tx_msg which
would potentially refetch it, but if updates would happen concurrently e.g. to the
three progs, they could be NULL in the mean-time, no? bpf_prog_inc_not_zero() would
then crash. Why are not the ones used that you fetched previously via READ_ONCE()
for taking the ref?

The second question I had is that verdict/parse/tx_msg are updated independently
from each other and each could be NULL or non-NULL. What if, say, parse is NULL
and verdict as well as tx_msg is non-NULL and the bpf_prog_inc_not_zero() on the
tx_msg prog fails. Doesn't this cause a use-after-free since a ref on verdict wasn't
taken earlier but the bpf_prog_put() will cause accidental misbalance/free of the
progs?

It would probably help to clarify the locking comment a bit more if indeed the
above should be okay as is.

Thanks,
Daniel

^ permalink raw reply

* Poor TCP performance with XPS enabled after scrubbing skb
From: Flavio Leitner @ 2018-05-15 19:31 UTC (permalink / raw)
  To: netdev; +Cc: Paolo Abeni

Hi,

There is a significant throughput issue (~50% drop) for a single TCP
stream when the skb is scrubbed and XPS is enabled.

If I turn CONFIG_XPS off, then the issue never happens and the test
reaches line rate.  The same happens if I echo 0 to tx-*/xps_cpus.

It looks like that when the skb is scrubbed, there is no more reference
to the struct sock, which forces XPS to use a TX queue mapped to the
running CPU. However, since there is no mapping between RX queue and
TX queue, the returning traffic usually ends up in another CPU. This
other CPU process the skb and if the stack needs to send something,
then we have two TX queues being used in parallel for the same stream
and TCP seems to not like that (Out-Of-Order, dup ACKS, retransmissions..)

The test environment is quite simple. The iperf/iperf3 -s can be
just a NIC with IP address.  The peer running iperf/iperf3 -c needs
to use veth (scrub the packet), so create a pair, attach one end
to a linux bridge with the NIC and add the IP address to the other
end:
      Bridge
NIC ---/  \--- veth0 ---- veth1 [ IP address ]

Paolo and I discussed the issue and we came up with a patch[1] that
supports the explanation above. It may not be the best way to fix the
problem though, so for now consider it just as an experiment :-)

Kernel net-next updated with today's:
commit f3002c1374fb2367c9d8dbb28852791ef90d2bac
Date:   Mon May 14 08:14:49 2018 -0400


Default config (CONFIG_XPS on)
# iperf -c 192.168.1.2 -t 30
------------------------------------------------------------
Client connecting to 192.168.1.2, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.1 port 40332 connected with 192.168.1.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  16.8 GBytes  4.80 Gbits/sec


# ./xps_disable.sh; iperf -c 192.168.1.2 -t 30
------------------------------------------------------------
Client connecting to 192.168.1.2, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.1 port 40334 connected with 192.168.1.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  32.2 GBytes  9.21 Gbits/sec


[root@dell-r430-23 ~]# ./xps_restore.sh; iperf -c 192.168.1.2 -t 30
------------------------------------------------------------
Client connecting to 192.168.1.2, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.1 port 40336 connected with 192.168.1.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  16.0 GBytes  4.59 Gbits/sec


Experimental patch applied and XPS functioning:

# iperf -c 192.168.1.2 -t 30
------------------------------------------------------------
Client connecting to 192.168.1.2, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.1 port 34202 connected with 192.168.1.2 port
5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  32.2 GBytes  9.21 Gbits/sec


Sometimes the return traffic ends up in the same CPU running iperf -c.
When that happens, the same TX queue is used and I see line rate.

The issue always happen with MLX and be2net NICs, but so far I am
unable to reproduce with i40e, though I could see two TX queues being
used in parallel as in other cases.

[1]
diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 71c72a9..482d046 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -31,9 +31,10 @@
 
 /*		0 - Reserved to indicate value not set
  *     1..NR_CPUS - Reserved for sender_cpu
- *  NR_CPUS+1..~0 - Region available for NAPI IDs
+ *      NR_CPUS+1 - Scrubbed packet, do not use XPS
+ *  NR_CPUS+2..~0 - Region available for NAPI IDs
  */
-#define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1))
+#define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 2))
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
 
diff --git a/net/core/dev.c b/net/core/dev.c
index af0558b..5567d4f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3398,6 +3398,9 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 	struct xps_map *map;
 	int queue_index = -1;
 
+	if (skb->sender_cpu ==  (u32)(NR_CPUS + 1))
+		return -1;
+
 	rcu_read_lock();
 	dev_maps = rcu_dereference(dev->xps_maps);
 	if (dev_maps) {
@@ -3459,7 +3462,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 #ifdef CONFIG_XPS
 	u32 sender_cpu = skb->sender_cpu - 1;
 
-	if (sender_cpu >= (u32)NR_CPUS)
+	if (sender_cpu >= (u32)NR_CPUS + 1)
 		skb->sender_cpu = raw_smp_processor_id() + 1;
 #endif
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 345b518..99040a0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4898,6 +4898,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 	ipvs_reset(skb);
 	skb_orphan(skb);
 	skb->mark = 0;
+	skb->sender_cpu = (u32)(NR_CPUS + 1);
 }
 EXPORT_SYMBOL_GPL(skb_scrub_packet);
 

-- 
Flavio

^ permalink raw reply related

* Re: [PATCH v2 iproute2-next 1/3] rdma: update rdma_netlink.h to get new driver attributes
From: David Ahern @ 2018-05-15 19:37 UTC (permalink / raw)
  To: Steve Wise, leon; +Cc: stephen, netdev, linux-rdma
In-Reply-To: <ef3d41db66cf503f717f92f6e595c065ea4dbe4e.1526312594.git.swise@opengridcomputing.com>

On 5/14/18 9:42 AM, Steve Wise wrote:
> diff --git a/rdma/include/uapi/rdma/rdma_netlink.h b/rdma/include/uapi/rdma/rdma_netlink.h
> index 60416ed..40be0d8 100644
> --- a/rdma/include/uapi/rdma/rdma_netlink.h
> +++ b/rdma/include/uapi/rdma/rdma_netlink.h
> @@ -249,10 +249,22 @@ enum rdma_nldev_command {
>  	RDMA_NLDEV_NUM_OPS
>  };
>  
> +enum {
> +	RDMA_NLDEV_ATTR_ENTRY_STRLEN = 16,
> +};
> +
> +enum rdma_nldev_print_type {
> +	RDMA_NLDEV_PRINT_TYPE_UNSPEC,
> +	RDMA_NLDEV_PRINT_TYPE_HEX,
> +};
> +
>  enum rdma_nldev_attr {
>  	/* don't change the order or add anything between, this is ABI! */

I asked this before and did not get a response. As the comment above
states with an emphasis (!) ...

>  	RDMA_NLDEV_ATTR_UNSPEC,
>  
> +	/* Pad attribute for 64b alignment */
> +	RDMA_NLDEV_ATTR_PAD = RDMA_NLDEV_ATTR_UNSPEC,
> +

... are you really adding new attributes in the middle?

>  	/* Identifier for ib_device */
>  	RDMA_NLDEV_ATTR_DEV_INDEX,		/* u32 */
>  
> @@ -387,6 +399,20 @@ enum rdma_nldev_attr {
>  	RDMA_NLDEV_ATTR_RES_PD_ENTRY,		/* nested table */
>  	RDMA_NLDEV_ATTR_RES_LOCAL_DMA_LKEY,	/* u32 */
>  	RDMA_NLDEV_ATTR_RES_UNSAFE_GLOBAL_RKEY,	/* u32 */
> +	/*
> +	 * driver-specific attributes.
> +	 */
> +	RDMA_NLDEV_ATTR_DRIVER,			/* nested table */
> +	RDMA_NLDEV_ATTR_DRIVER_ENTRY,		/* nested table */
> +	RDMA_NLDEV_ATTR_DRIVER_STRING,		/* string */
> +	/*
> +	 * u8 values from enum rdma_nldev_print_type
> +	 */
> +	RDMA_NLDEV_ATTR_DRIVER_PRINT_TYPE,	/* u8 */
> +	RDMA_NLDEV_ATTR_DRIVER_S32,		/* s32 */
> +	RDMA_NLDEV_ATTR_DRIVER_U32,		/* u32 */
> +	RDMA_NLDEV_ATTR_DRIVER_S64,		/* s64 */
> +	RDMA_NLDEV_ATTR_DRIVER_U64,		/* u64 */

and again here.

>  
>  	/*
>  	 * Provides logical name and index of netdevice which is
> 

^ permalink raw reply

* Re: [PATCH net-next] sched: cls: enable verbose logging
From: Jakub Kicinski @ 2018-05-15 19:43 UTC (permalink / raw)
  To: Cong Wang
  Cc: Marcelo Ricardo Leitner, Linux Kernel Network Developers,
	David Ahern, Stephen Hemminger, Jiri Pirko, Alexander Aring,
	Jamal Hadi Salim
In-Reply-To: <CAM_iQpVjNOt4g1PY3VOJJPn=XOSE3G3ZqUHEC-H+M66GRDCMTg@mail.gmail.com>

On Mon, 14 May 2018 22:31:46 -0700, Cong Wang wrote:
> On Mon, May 14, 2018 at 1:47 PM, Marcelo Ricardo Leitner wrote:
> > On Mon, May 14, 2018 at 01:30:53PM -0700, Cong Wang wrote:  
> >> On Sun, May 13, 2018 at 1:44 PM, Marcelo Ricardo Leitner wrote:  
> >> > Currently, when the rule is not to be exclusively executed by the
> >> > hardware, extack is not passed along and offloading failures don't
> >> > get logged. The idea was that hardware failures are okay because the
> >> > rule will get executed in software then and this way it doesn't confuse
> >> > unware users.
> >> >
> >> > But this is not helpful in case one needs to understand why a certain
> >> > rule failed to get offloaded. Considering it may have been a temporary
> >> > failure, like resources exceeded or so, reproducing it later and knowing
> >> > that it is triggering the same reason may be challenging.  
> >>
> >> I fail to understand why you need a flag here, IOW, why not just pass
> >> extack unconditionally?  
> >
> > Because (as discussed in the RFC[1], should have linked it here) it
> > could confuse users that are not aware of offloading and, in other
> > cases, it can be just noise (like it would be right now for ebpf,
> > which is mostly used in sw-path).
> >
> > 1.https://www.mail-archive.com/netdev@vger.kernel.org/msg223016.html  
> 
> My point is that a TC filter flag should be used for a filter attribute,
> logging is apparently not a part of filter. At least, put it into HW offloading,
> not in TC filter.
> 
> I know DaveM hates module parameters, but a module parameter here
> is more suitable than a TC filter flag.

Do you mean we should add a global cls_flower parameter to enable
verbose HW offload messages?  I'm not sure where "HW offloading" is.

I agree with you in principle, this could be made a "per application
context" flag.  Perhaps to be set on the socket.  But our existing
flags are per-request so it makes sense to do the same here IMHO.

^ permalink raw reply

* Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
From: Qing Huang @ 2018-05-15 19:45 UTC (permalink / raw)
  To: Eric Dumazet, Tariq Toukan, davem, haakon.bugge, yanjun.zhu
  Cc: netdev, linux-rdma, linux-kernel
In-Reply-To: <f71278c2-a111-1bf6-2ff3-d9a8dd46bfbe@gmail.com>



On 5/15/2018 12:08 PM, Eric Dumazet wrote:
>
> On 05/15/2018 11:53 AM, Qing Huang wrote:
>>> This is control path so it is less latency-sensitive.
>>> Let's not produce unnecessary degradation here, please call kvzalloc so we maintain a similar behavior when contiguous memory is available, and a fallback for resiliency.
>> No sure what exactly degradation is caused by vzalloc here. I think it's better to keep physically contiguous pages
>> to other requests which really need them. Besides slow path/mem compacting can be really expensive.
>>
> Just use kvzalloc(), and you get the benefit of having contiguous memory if available,
> without expensive compact phase.
>
> This thing _automatically_ falls back to vmalloc(), thus your problem will be solved.
>
> If you are not sure, trust others.

Thanks for the review. There are many places in kernel and applications 
where physically contiguous pages are needed.
We saw quite a few issues when there were not enough contiguous phy mem 
available. My main concern here is that why
using physically contiguous pages when they are not really needed?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox