Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH RFC 1/2] bpf: add a longest prefix match trie map implementation
From: Daniel Mack @ 2016-12-14 15:43 UTC (permalink / raw)
  To: ast; +Cc: dh.herrmann, daniel, netdev, davem, Daniel Mack
In-Reply-To: <20161214154336.17639-1-daniel@zonque.org>

This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.

Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.

Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.

The code carries more information about the internal implementation.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
---
 include/uapi/linux/bpf.h |   7 +
 kernel/bpf/Makefile      |   2 +-
 kernel/bpf/lpm_trie.c    | 491 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 499 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/lpm_trie.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0eb0e87..d564277 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -63,6 +63,12 @@ struct bpf_insn {
 	__s32	imm;		/* signed immediate constant */
 };
 
+/* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */
+struct bpf_lpm_trie_key {
+	__u32	prefixlen;	/* up to 32 for AF_INET, 128 for AF_INET6 */
+	__u8	data[0];	/* Arbitrary size */
+};
+
 /* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
 	BPF_MAP_CREATE,
@@ -89,6 +95,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_CGROUP_ARRAY,
 	BPF_MAP_TYPE_LRU_HASH,
 	BPF_MAP_TYPE_LRU_PERCPU_HASH,
+	BPF_MAP_TYPE_LPM_TRIE,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 1276474..e1ce4f4 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,7 +1,7 @@
 obj-y := core.o
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
new file mode 100644
index 0000000..cae759d
--- /dev/null
+++ b/kernel/bpf/lpm_trie.c
@@ -0,0 +1,491 @@
+/*
+ * Longest prefix match list implementation
+ *
+ * Copyright (c) 2016 Daniel Mack
+ * Copyright (c) 2016 David Herrmann
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include <linux/bpf.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/vmalloc.h>
+#include <net/ipv6.h>
+
+/* Intermediate node */
+#define LPM_TREE_NODE_FLAG_IM BIT(0)
+
+struct lpm_trie_node;
+
+struct lpm_trie_node {
+	struct rcu_head rcu;
+	struct lpm_trie_node	*child[2];
+	u32			prefixlen;
+	u32			flags;
+	u64			value;
+	u8			data[0];
+};
+
+struct lpm_trie {
+	struct bpf_map		map;
+	struct lpm_trie_node	*root;
+	size_t			n_entries;
+	size_t			max_prefixlen;
+	size_t			data_size;
+	spinlock_t		lock;
+};
+
+/*
+ * This trie implements a longest prefix match algorithm that can be used to
+ * match IP addresses to a stored set of ranges.
+ *
+ * Data stored in @data of struct bpf_lpm_key and struct lpm_trie_node is
+ * interpreted as big endian, so data[0] stores the most significant byte.
+ *
+ * Match ranges are internally stored in instances of struct lpm_trie_node
+ * which each contain their prefix length as well as two pointers that may
+ * lead to more nodes containing more specific matches. Each node also stores
+ * a value that is defined by and returned to userspace via the update_elem
+ * and lookup functions.
+ *
+ * For instance, let's start with a trie that was created with a prefix length
+ * of 32, so it can be used for IPv4 addresses, and one single element that
+ * matches 192.168.0.0/16. The data array would hence contain
+ * [0xc0, 0xa8, 0x00, 0x00] in big-endian notation. This documentation will
+ * stick to IP-address notation for readability though.
+ *
+ * As the trie is empty initially, the new node (1) will be places as root
+ * node, denoted as (R) in the example below. As there are no other node, both
+ * child pointers are %NULL.
+ *
+ *              +----------------+
+ *              |       (1)  (R) |
+ *              | 192.168.0.0/16 |
+ *              |    value: 1    |
+ *              |   [0]    [1]   |
+ *              +----------------+
+ *
+ * Next, let's add a new node (2) matching 192.168.0.0/24. As there is already
+ * a node with the same data and a smaller prefix (ie, a less specific one),
+ * node (2) will become a child of (1). In child index depends on the next bit
+ * that is outside of that (1) matches, and that bit is 0, so (2) will be
+ * child[0] of (1):
+ *
+ *              +----------------+
+ *              |       (1)  (R) |
+ *              | 192.168.0.0/16 |
+ *              |    value: 1    |
+ *              |   [0]    [1]   |
+ *              +----------------+
+ *                   |
+ *    +----------------+
+ *    |       (2)      |
+ *    | 192.168.0.0/24 |
+ *    |    value: 2    |
+ *    |   [0]    [1]   |
+ *    +----------------+
+ *
+ * The child[1] slot of (1) could be filled with another node which has bit #17
+ * (the next bit after the ones that (1) matches on) set to 1. For instance,
+ * 192.168.128.0/24:
+ *
+ *              +----------------+
+ *              |       (1)  (R) |
+ *              | 192.168.0.0/16 |
+ *              |    value: 1    |
+ *              |   [0]    [1]   |
+ *              +----------------+
+ *                   |      |
+ *    +----------------+  +------------------+
+ *    |       (2)      |  |        (3)       |
+ *    | 192.168.0.0/24 |  | 192.168.128.0/24 |
+ *    |    value: 2    |  |     value: 3     |
+ *    |   [0]    [1]   |  |    [0]    [1]    |
+ *    +----------------+  +------------------+
+ *
+ * Let's add another node (4) to the game for 192.168.1.0/24. In order to place
+ * it, node (1) is looked at first, and because (4) of the semantics laid out
+ * above (bit #17 is 0), it would normally be attached to (1) as child[0].
+ * However, that slot is already allocated, so a new node is needed in between.
+ * That node is does not have a value attached to it and it will never be
+ * returned to users as result of a lookup. It is only there to differenciate
+ * the traversal further. It will get a prefix as wide as necessary to
+ * distinguish its two children:
+ *
+ *                      +----------------+
+ *                      |       (1)  (R) |
+ *                      | 192.168.0.0/16 |
+ *                      |    value: 1    |
+ *                      |   [0]    [1]   |
+ *                      +----------------+
+ *                           |      |
+ *            +----------------+  +------------------+
+ *            |       (4)  (I) |  |        (3)       |
+ *            | 192.168.0.0/23 |  | 192.168.128.0/24 |
+ *            |    value: ---  |  |     value: 3     |
+ *            |   [0]    [1]   |  |    [0]    [1]    |
+ *            +----------------+  +------------------+
+ *                 |      |
+ *  +----------------+  +----------------+
+ *  |       (2)      |  |       (5)      |
+ *  | 192.168.0.0/24 |  | 192.168.1.0/24 |
+ *  |    value: 2    |  |     value: 5   |
+ *  |   [0]    [1]   |  |   [0]    [1]   |
+ *  +----------------+  +----------------+
+ *
+ * 192.168.1.1/32 would be a child of (5) etc.
+ *
+ * An intermediate node will be turned into a 'real' node on demand. In the
+ * example above, (4) would be re-used if 192.168.0.0/23 is added to the trie.
+ *
+ * A fully populated trie would have a height of 32 nodes, as the trie was
+ * created with a prefix length of 32.
+ *
+ * The lookup starts at the root node. If the current node matches and if there
+ * is a child that can be used to become more specific, the trie is traversed
+ * downwards. The last node in the traversal that is a non-intermediate one is
+ * returned.
+ */
+
+static inline int extract_bit(const u8 *data, size_t index)
+{
+	return !!(data[index / 8] & (1 << (7 - (index % 8))));
+}
+
+/**
+ * longest_prefix_match() - determine the longest prefix
+ * @trie:	The trie to get internal sizes from
+ * @node:	The node to operate on
+ * @key:	The key to compare to @node
+ *
+ * Determine the longest prefix of @node that matches the bits in @key.
+ */
+static size_t longest_prefix_match(const struct lpm_trie *trie,
+				   const struct lpm_trie_node *node,
+				   const struct bpf_lpm_trie_key *key)
+{
+	size_t prefixlen = 0;
+	int i;
+
+	for (i = 0; i < trie->data_size; i++) {
+		size_t b;
+
+		b = 8 - fls(node->data[i] ^ key->data[i]);
+		prefixlen += b;
+
+		if (prefixlen >= node->prefixlen || prefixlen >= key->prefixlen)
+			return min(node->prefixlen, key->prefixlen);
+
+		if (b < 8)
+			break;
+	}
+
+	return prefixlen;
+}
+
+/* Called from syscall or from eBPF program */
+static void *trie_lookup_elem(struct bpf_map *map, void *_key)
+{
+	struct lpm_trie_node *node, *found = NULL;
+	struct bpf_lpm_trie_key *key = _key;
+	struct lpm_trie *trie =
+		container_of(map, struct lpm_trie, map);
+
+	/* Start walking the trie from the root node ... */
+
+	for (node = rcu_dereference(trie->root); node;) {
+		unsigned int next_bit;
+		size_t matchlen;
+
+		/*
+		 * Determine the longest prefix of @node that matches @key.
+		 * If it's the maximum possible prefix for this trie, we have
+		 * an exact match and can return it directly.
+		 */
+		matchlen = longest_prefix_match(trie, node, key);
+		if (matchlen == trie->max_prefixlen)
+			return &node->value;
+
+		/*
+		 * If the number of bits that match is smaller than the prefix
+		 * length of @node, bail out and return the node we have seen
+		 * last in the traversal (ie, the parent).
+		 */
+		if (matchlen < node->prefixlen)
+			break;
+
+		/*
+		 * Consider this node as return candidate unless it is an
+		 * artificially added intermediate one
+		 */
+		if (!(node->flags & LPM_TREE_NODE_FLAG_IM))
+			found = node;
+
+		/*
+		 * If the node match is fully satisfied, let's see if we can
+		 * become more specific. Determine the next bit in the key and
+		 * traverse down.
+		 */
+		next_bit = extract_bit(key->data, node->prefixlen);
+		node = rcu_dereference(node->child[next_bit]);
+	}
+
+	return found ? &found->value : NULL;
+}
+
+static struct lpm_trie_node *lpm_trie_node_alloc(size_t data_size)
+{
+	return kmalloc(sizeof(struct lpm_trie_node) + data_size,
+		       GFP_ATOMIC | __GFP_NOWARN);
+}
+
+/**
+ *_lpm_trie_find_target_node() - locate a spot to put a new node
+ * @trie:	The trie to walk
+ * @key:	The key to find a slot for
+ * @node_ret:	Return variable for a node slot
+ *
+ * Find a slot to put a new node for @key, and return it in @node_ret.
+ *
+ * If the target location is an empty child of an existing node, or the
+ * root is unused, a pointer to that empty spot is returned in @node_ret
+ * and 0 is returned by the function.
+ *
+ * Otherwise, if a node is detected that conflicts with @key, that conflicting
+ * node is returned in @node_ret. The caller should then replace that node with
+ * an intermediate node. In this case, the longest prefix match between the
+ * existing node and @key is returned.
+ */
+static size_t find_target_node(struct lpm_trie *trie,
+			       struct bpf_lpm_trie_key *key,
+			       struct lpm_trie_node ***node_ret)
+{
+	struct lpm_trie_node **node = &trie->root;
+	size_t matchlen = 0;
+
+	while (*node) {
+		unsigned int next_bit;
+
+		matchlen = longest_prefix_match(trie, *node, key);
+
+		if ((*node)->prefixlen != matchlen ||
+		    (*node)->prefixlen == key->prefixlen ||
+		    (*node)->prefixlen == trie->max_prefixlen)
+			break;
+
+		next_bit = extract_bit(key->data, (*node)->prefixlen);
+		node = &(*node)->child[next_bit];
+	}
+
+	*node_ret = node;
+
+	return *node ? matchlen : 0;
+}
+
+/* Called from syscall or from eBPF program */
+static int trie_update_elem(struct bpf_map *map,
+			    void *_key, void *value, u64 flags)
+{
+	struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
+	struct lpm_trie_node **node, *im_node, *new_node = NULL;
+	struct bpf_lpm_trie_key *key = _key;
+	size_t matchlen;
+	int ret = 0;
+
+	if (key->prefixlen > trie->max_prefixlen)
+		return -EINVAL;
+
+	spin_lock(&trie->lock);
+
+	/* Allocate and fill a new node */
+
+	if (trie->n_entries == trie->map.max_entries) {
+		ret = -ENOSPC;
+		goto out;
+	}
+
+	new_node = lpm_trie_node_alloc(trie->data_size);
+	if (!new_node) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	trie->n_entries++;
+	new_node->value = *(u64 *) value;
+	new_node->prefixlen = key->prefixlen;
+	new_node->flags = 0;
+	new_node->child[0] = NULL;
+	new_node->child[1] = NULL;
+	memcpy(new_node->data, key->data, trie->data_size);
+
+	/*
+	 * Now find a place to attach the new node. find_target_node()
+	 * either returned an empty slot (the root or an empty leaf), or the
+	 * closest match, in which case an intermediate node has to be created
+	 * and installed.
+	 */
+	matchlen = find_target_node(trie, key, &node);
+	if (!*node) {
+		rcu_assign_pointer(*node, new_node);
+		goto out;
+	}
+
+	/*
+	 * If the node we got back as target already exists, replace it
+	 * new_node, which already has the correct data array and value set.
+	 * If the node that is replaced is an intermediate one, turn it into a
+	 * 'real' node.
+	 */
+	if ((*node)->prefixlen == matchlen) {
+		struct lpm_trie_node *tmp;
+
+		new_node->child[0] = (*node)->child[0];
+		new_node->child[1] = (*node)->child[1];
+
+		tmp = rcu_dereference(*node);
+		if (!(tmp->flags & LPM_TREE_NODE_FLAG_IM))
+			trie->n_entries--;
+
+		rcu_assign_pointer(*node, new_node);
+		kfree_rcu(tmp, rcu);
+
+		goto out;
+	}
+
+	/*
+	 * If the new node matches the prefix completely, it must be an
+	 * inserted as an ancestor. Simply insert it between @node and @*node.
+	 */
+	if (matchlen == key->prefixlen) {
+		new_node->child[extract_bit((*node)->data, matchlen)] = *node;
+		rcu_assign_pointer(*node, new_node);
+		goto out;
+	}
+
+	/* Create an intermediate node and place it inbetween */
+	im_node = lpm_trie_node_alloc(trie->data_size);
+	if (!im_node) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	im_node->prefixlen = matchlen;
+	im_node->flags |= LPM_TREE_NODE_FLAG_IM;
+	memcpy(im_node->data, (*node)->data, trie->data_size);
+
+	/* Now determine which child to install in which slot */
+	if (extract_bit(key->data, matchlen)) {
+		im_node->child[0] = *node;
+		im_node->child[1] = new_node;
+	} else {
+		im_node->child[0] = new_node;
+		im_node->child[1] = *node;
+	}
+
+	/* Finally, assign the intermediate node to the determined spot */
+	rcu_assign_pointer(*node, im_node);
+
+out:
+	if (ret) {
+		if (new_node)
+			trie->n_entries--;
+
+		kfree(new_node);
+		kfree(im_node);
+	}
+
+	spin_unlock(&trie->lock);
+
+	return ret;
+}
+
+static struct bpf_map *trie_alloc(union bpf_attr *attr)
+{
+	struct lpm_trie *trie;
+
+	/* check sanity of attributes */
+	if (attr->max_entries == 0 || attr->map_flags ||
+	    attr->key_size < sizeof(struct bpf_lpm_trie_key) + 1   ||
+	    attr->key_size > sizeof(struct bpf_lpm_trie_key) + 256 ||
+	    attr->value_size != sizeof(u64))
+		return ERR_PTR(-EINVAL);
+
+	trie = kzalloc(sizeof(*trie), GFP_USER | __GFP_NOWARN);
+	if (!trie)
+		return NULL;
+
+	/* copy mandatory map attributes */
+	trie->map.map_type = attr->map_type;
+	trie->map.key_size = attr->key_size;
+	trie->map.value_size = attr->value_size;
+	trie->map.max_entries = attr->max_entries;
+	trie->data_size = attr->key_size -
+				offsetof(struct bpf_lpm_trie_key, data);
+	trie->max_prefixlen = trie->data_size * 8;
+
+	spin_lock_init(&trie->lock);
+
+	return &trie->map;
+}
+
+static void trie_free(struct bpf_map *map)
+{
+	struct lpm_trie_node **node;
+	struct lpm_trie *trie =
+		container_of(map, struct lpm_trie, map);
+
+	spin_lock(&trie->lock);
+
+	/*
+	 * Always start at the root and walk down to a node that has no
+	 * children. Then free that node, nullify its parent pointer and
+	 * start over.
+	 */
+
+	for (;;) {
+		node = &trie->root;
+		if (!*node)
+			break;
+
+		for (;;) {
+			if ((*node)->child[0]) {
+				node = &(*node)->child[0];
+				continue;
+			}
+
+			if ((*node)->child[1]) {
+				node = &(*node)->child[1];
+				continue;
+			}
+
+			kfree(*node);
+			*node = NULL;
+			break;
+		}
+	}
+
+	spin_unlock(&trie->lock);
+}
+
+static const struct bpf_map_ops trie_ops = {
+	.map_alloc = trie_alloc,
+	.map_free = trie_free,
+	.map_lookup_elem = trie_lookup_elem,
+	.map_update_elem = trie_update_elem,
+};
+
+static struct bpf_map_type_list trie_type __read_mostly = {
+	.ops = &trie_ops,
+	.type = BPF_MAP_TYPE_LPM_TRIE,
+};
+
+static int __init register_trie_map(void)
+{
+	bpf_register_map_type(&trie_type);
+	return 0;
+}
+late_initcall(register_trie_map);
-- 
2.9.3

^ permalink raw reply related

* [PATCH RFC 0/2] bpf: add longest prefix match map
From: Daniel Mack @ 2016-12-14 15:43 UTC (permalink / raw)
  To: ast; +Cc: dh.herrmann, daniel, netdev, davem, Daniel Mack

This patch set adds longest prefix match algorithm that can be used to
match IP addresses to a stored set of ranges. It is exposed as a bpf
map type.

Internally, data is stored in an unbalanced tree of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.

Not that this has nothing to do with fib or fib6 and is in no way meant
to replace or share code with it. It's rather a much simpler
implementation that is specifically written with bpf maps in mind.

Patch 1/2 adds the implementation, and 2/2 an extensive test suite.

Feedback is much appreciated.

Thanks,
Daniel

Daniel Mack (1):
  bpf: add a longest prefix match trie map implementation

David Herrmann (1):
  bpf: Add tests for the lpm trie map

 include/uapi/linux/bpf.h                   |   7 +
 kernel/bpf/Makefile                        |   2 +-
 kernel/bpf/lpm_trie.c                      | 491 +++++++++++++++++++++++++++++
 tools/testing/selftests/bpf/.gitignore     |   1 +
 tools/testing/selftests/bpf/Makefile       |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 348 ++++++++++++++++++++
 6 files changed, 850 insertions(+), 3 deletions(-)
 create mode 100644 kernel/bpf/lpm_trie.c
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

-- 
2.9.3

^ permalink raw reply

* [PATCH RFC 2/2] bpf: Add tests for the lpm trie map
From: Daniel Mack @ 2016-12-14 15:43 UTC (permalink / raw)
  To: ast; +Cc: dh.herrmann, daniel, netdev, davem, Daniel Mack
In-Reply-To: <20161214154336.17639-1-daniel@zonque.org>

From: David Herrmann <dh.herrmann@gmail.com>

The first part of this program runs randomized tests against the
lpm-bpf-map. It implements a "Trivial Longest Prefix Match" (tlpm)
based on simple, linear, single linked lists. The implementation
should be pretty straightforward.

Based on tlpm, this inserts randomized data into bpf-lpm-maps and
verifies the trie-based bpf-map implementation behaves the same way
as tlpm.

The second part uses 'real world' IPv4 and IPv6 addresses and tests
the trie with those.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 tools/testing/selftests/bpf/.gitignore     |   1 +
 tools/testing/selftests/bpf/Makefile       |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 348 +++++++++++++++++++++++++++++
 3 files changed, 351 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 071431b..d3b1c9b 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -1,3 +1,4 @@
 test_verifier
 test_maps
 test_lru_map
+test_lpm_map
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 7a5f245..064a3e5 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -1,8 +1,8 @@
 CFLAGS += -Wall -O2 -I../../../../usr/include
 
-test_objs = test_verifier test_maps test_lru_map
+test_objs = test_verifier test_maps test_lru_map test_lpm_map
 
-TEST_PROGS := test_verifier test_maps test_lru_map test_kmod.sh
+TEST_PROGS := test_verifier test_maps test_lru_map test_lpm_map test_kmod.sh
 TEST_FILES := $(test_objs)
 
 all: $(test_objs)
diff --git a/tools/testing/selftests/bpf/test_lpm_map.c b/tools/testing/selftests/bpf/test_lpm_map.c
new file mode 100644
index 0000000..08db750
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lpm_map.c
@@ -0,0 +1,348 @@
+/*
+ * Randomized tests for eBPF longest-prefix-match maps
+ *
+ * This program runs randomized tests against the lpm-bpf-map. It implements a
+ * "Trivial Longest Prefix Match" (tlpm) based on simple, linear, singly linked
+ * lists. The implementation should be pretty straightforward.
+ *
+ * Based on tlpm, this inserts randomized data into bpf-lpm-maps and verifies
+ * the trie-based bpf-map implementation behaves the same way as tlpm.
+ */
+
+#include <assert.h>
+#include <errno.h>
+#include <inttypes.h>
+#include <linux/bpf.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+#include <unistd.h>
+#include <arpa/inet.h>
+
+#include "bpf_sys.h"
+#include "bpf_util.h"
+
+struct tlpm_node {
+	struct tlpm_node *next;
+	size_t n_bits;
+	uint8_t key[];
+};
+
+static struct tlpm_node *tlpm_add(struct tlpm_node *list,
+				  const uint8_t *key,
+				  size_t n_bits)
+{
+	struct tlpm_node *node;
+	size_t n;
+
+	/* add new entry with @key/@n_bits to @list and return new head */
+
+	n = (n_bits + 7) / 8;
+	node = malloc(sizeof(*node) + n);
+	assert(node);
+
+	node->next = list;
+	node->n_bits = n_bits;
+	memcpy(node->key, key, n);
+
+	return node;
+}
+
+static void tlpm_clear(struct tlpm_node *list)
+{
+	struct tlpm_node *node;
+
+	/* free all entries in @list */
+
+	while ((node = list)) {
+		list = list->next;
+		free(node);
+	}
+}
+
+static struct tlpm_node *tlpm_match(struct tlpm_node *list,
+				    const uint8_t *key,
+				    size_t n_bits)
+{
+	struct tlpm_node *best = NULL;
+	size_t i;
+
+	/*
+	 * Perform longest prefix-match on @key/@n_bits. That is, iterate all
+	 * entries and match each prefix against @key. Remember the "best"
+	 * entry we find (i.e., the longest prefix that matches) and return it
+	 * to the caller when done.
+	 */
+
+	for ( ; list; list = list->next) {
+		for (i = 0; i < n_bits && i < list->n_bits; ++i) {
+			if ((key[i / 8] & (1 << (7 - i % 8))) !=
+			    (list->key[i / 8] & (1 << (7 - i % 8))))
+				break;
+		}
+
+		if (i >= list->n_bits) {
+			if (!best || i > best->n_bits)
+				best = list;
+		}
+	}
+
+	return best;
+}
+
+static void test_lpm_basic(void)
+{
+	struct tlpm_node *list = NULL, *t1, *t2;
+
+	/* very basic, static tests to verify tlpm works as expected */
+
+	assert(!tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+
+	t1 = list = tlpm_add(list, (uint8_t[]){ 0xff }, 8);
+	assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+	assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff, 0xff }, 16));
+	assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff, 0x00 }, 16));
+	assert(!tlpm_match(list, (uint8_t[]){ 0x7f }, 8));
+	assert(!tlpm_match(list, (uint8_t[]){ 0xfe }, 8));
+	assert(!tlpm_match(list, (uint8_t[]){ 0xff }, 7));
+
+	t2 = list = tlpm_add(list, (uint8_t[]){ 0xff, 0xff }, 16);
+	assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+	assert(t2 == tlpm_match(list, (uint8_t[]){ 0xff, 0xff }, 16));
+	assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff, 0xff }, 15));
+	assert(!tlpm_match(list, (uint8_t[]){ 0x7f, 0xff }, 16));
+
+	tlpm_clear(list);
+}
+
+static void test_lpm_order(void)
+{
+	struct tlpm_node *t1, *t2, *l1 = NULL, *l2 = NULL;
+	size_t i, j;
+
+	/*
+	 * Verify the tlpm implementation works correctly regardless of the
+	 * order of entries. Insert a random set of entries into @l1, and copy
+	 * the same data in reverse order into @l2. Then verify a lookup of
+	 * random keys will yield the same result in both sets.
+	 */
+
+	for (i = 0; i < (1 << 12); ++i)
+		l1 = tlpm_add(l1, (uint8_t[]){
+					rand() % 0xff,
+					rand() % 0xff,
+				}, rand() % 16 + 1);
+
+	for (t1 = l1; t1; t1 = t1->next)
+		l2 = tlpm_add(l2, t1->key, t1->n_bits);
+
+	for (i = 0; i < (1 << 8); ++i) {
+		uint8_t key[] = { rand() % 0xff, rand() % 0xff };
+
+		t1 = tlpm_match(l1, key, 16);
+		t2 = tlpm_match(l2, key, 16);
+
+		assert(!t1 == !t2);
+		if (t1) {
+			assert(t1->n_bits == t2->n_bits);
+			for (j = 0; j < t1->n_bits; ++j)
+				assert((t1->key[j / 8] & (1 << (7 - j % 8))) ==
+				       (t2->key[j / 8] & (1 << (7 - j % 8))));
+		}
+	}
+
+	tlpm_clear(l1);
+	tlpm_clear(l2);
+}
+
+static void test_lpm_map(void)
+{
+	size_t i, j, n_matches, n_nodes, n_lookups;
+	struct tlpm_node *t, *list = NULL;
+	struct bpf_lpm_trie_key *key;
+	uint8_t value[8] = {};
+	int r, map;
+
+	/*
+	 * Compare behavior of tlpm vs. bpf-lpm. Create a randomized set of
+	 * prefixes and insert it into both tlpm and bpf-lpm. Then run some
+	 * randomized lookups and verify both maps return the same result.
+	 */
+
+	n_matches = 0;
+	n_nodes = 1 << 8;
+	n_lookups = 1 << 16;
+
+	key = alloca(sizeof(*key) + 4);
+	memset(key, 0, sizeof(*key) + 4);
+
+	map = bpf_map_create(BPF_MAP_TYPE_LPM_TRIE,
+			     sizeof(*key) + 4,
+			     sizeof(value),
+			     4096,
+			     0);
+	assert(map >= 0);
+
+	for (i = 0; i < n_nodes; ++i) {
+		value[0] = rand() & 0xff;
+		value[1] = rand() & 0xff;
+		value[2] = rand() & 0xff;
+		value[3] = rand() & 0xff;
+		value[4] = rand() % 33;
+
+		list = tlpm_add(list, value, value[4]);
+
+		key->prefixlen = value[4];
+		memcpy(key->data, value, 4);
+		r = bpf_map_update(map, key, value, 0);
+		assert(!r);
+	}
+
+	for (i = 0; i < n_lookups; ++i) {
+		uint8_t data[] = {
+			rand() % 0xff,
+			rand() % 0xff,
+			rand() % 0xff,
+			rand() % 0xff
+		};
+
+		t = tlpm_match(list, data, 32);
+
+		key->prefixlen = 32;
+		memcpy(key->data, data, 4);
+		r = bpf_map_lookup(map, key, value);
+		assert(!r || errno == ENOENT);
+		assert(!t == !!r);
+
+		if (t) {
+			++n_matches;
+			assert(t->n_bits == value[4]);
+			for (j = 0; j < t->n_bits; ++j)
+				assert((t->key[j / 8] & (1 << (7 - j % 8))) ==
+				       (value[j / 8] & (1 << (7 - j % 8))));
+		}
+	}
+
+	close(map);
+	tlpm_clear(list);
+
+	/*
+	 * With 255 random nodes in the map, we are pretty likely to match
+	 * something on every lookup. For statistics, use this:
+	 *
+	 *     printf("  nodes: %zu\n"
+	 *            "lookups: %zu\n"
+	 *            "matches: %zu\n", n_nodes, n_lookups, n_matches);
+	 */
+}
+
+/* Test the implementation with some 'real world' examples */
+
+static void test_lpm_ipaddr(void)
+{
+	struct bpf_lpm_trie_key *key_ipv4;
+	struct bpf_lpm_trie_key *key_ipv6;
+	size_t key_size_ipv4;
+	size_t key_size_ipv6;
+	int map_fd_ipv4;
+	int map_fd_ipv6;
+	__u64 value;
+
+	key_size_ipv4 = sizeof(*key_ipv4) + sizeof(__u32);
+	key_size_ipv6 = sizeof(*key_ipv6) + sizeof(__u32) * 4;
+	key_ipv4 = alloca(key_size_ipv4);
+	key_ipv6 = alloca(key_size_ipv6);
+
+	map_fd_ipv4 = bpf_map_create(BPF_MAP_TYPE_LPM_TRIE,
+				     key_size_ipv4, sizeof(value),
+				     100, 0);
+	assert(map_fd_ipv4 >= 0);
+
+	map_fd_ipv6 = bpf_map_create(BPF_MAP_TYPE_LPM_TRIE,
+				     key_size_ipv6, sizeof(value),
+				     100, 0);
+	assert(map_fd_ipv6 >= 0);
+
+	/* Fill data some IPv4 and IPv6 address ranges */
+	value = 1;
+	key_ipv4->prefixlen = 16;
+	inet_pton(AF_INET, "192.168.0.0", key_ipv4->data);
+	assert(bpf_map_update(map_fd_ipv4, key_ipv4, &value, 0) == 0);
+
+	value = 2;
+	key_ipv4->prefixlen = 24;
+	inet_pton(AF_INET, "192.168.0.0", key_ipv4->data);
+	assert(bpf_map_update(map_fd_ipv4, key_ipv4, &value, 0) == 0);
+
+	value = 3;
+	key_ipv4->prefixlen = 24;
+	inet_pton(AF_INET, "192.168.128.0", key_ipv4->data);
+	assert(bpf_map_update(map_fd_ipv4, key_ipv4, &value, 0) == 0);
+
+	value = 5;
+	key_ipv4->prefixlen = 24;
+	inet_pton(AF_INET, "192.168.1.0", key_ipv4->data);
+	assert(bpf_map_update(map_fd_ipv4, key_ipv4, &value, 0) == 0);
+
+	value = 4;
+	key_ipv4->prefixlen = 23;
+	inet_pton(AF_INET, "192.168.0.0", key_ipv4->data);
+	assert(bpf_map_update(map_fd_ipv4, key_ipv4, &value, 0) == 0);
+
+	value = 0xdeadbeef;
+	key_ipv6->prefixlen = 64;
+	inet_pton(AF_INET6, "2a00:1450:4001:814::200e", key_ipv6->data);
+	assert(bpf_map_update(map_fd_ipv6, key_ipv6, &value, 0) == 0);
+
+	/* Set tprefixlen to maximum for lookups */
+	key_ipv4->prefixlen = 32;
+	key_ipv6->prefixlen = 128;
+
+	/* Test some lookups that should come back with a value */
+	inet_pton(AF_INET, "192.168.128.23", key_ipv4->data);
+	assert(bpf_map_lookup(map_fd_ipv4, key_ipv4, &value) == 0);
+	assert(value == 3);
+
+	inet_pton(AF_INET, "192.168.0.1", key_ipv4->data);
+	assert(bpf_map_lookup(map_fd_ipv4, key_ipv4, &value) == 0);
+	assert(value == 2);
+
+	inet_pton(AF_INET6, "2a00:1450:4001:814::", key_ipv6->data);
+	assert(bpf_map_lookup(map_fd_ipv6, key_ipv6, &value) == 0);
+	assert(value == 0xdeadbeef);
+
+	inet_pton(AF_INET6, "2a00:1450:4001:814::1", key_ipv6->data);
+	assert(bpf_map_lookup(map_fd_ipv6, key_ipv6, &value) == 0);
+	assert(value == 0xdeadbeef);
+
+	/* Test some lookups that should not match any entry */
+	inet_pton(AF_INET, "10.0.0.1", key_ipv4->data);
+	assert(bpf_map_lookup(map_fd_ipv4, key_ipv4, &value) == -1 &&
+	       errno == ENOENT);
+
+	inet_pton(AF_INET, "11.11.11.11", key_ipv4->data);
+	assert(bpf_map_lookup(map_fd_ipv4, key_ipv4, &value) == -1 &&
+	       errno == ENOENT);
+
+	inet_pton(AF_INET6, "2a00:ffff::", key_ipv6->data);
+	assert(bpf_map_lookup(map_fd_ipv6, key_ipv6, &value) == -1 &&
+	       errno == ENOENT);
+
+	close(map_fd_ipv4);
+	close(map_fd_ipv6);
+}
+
+int main(void)
+{
+	/* we want predictable, pseudo random tests */
+	srand(0xf00ba1);
+
+	test_lpm_basic();
+	test_lpm_order();
+	test_lpm_map();
+	test_lpm_ipaddr();
+
+	printf("test_lpm: OK\n");
+	return 0;
+}
-- 
2.9.3

^ permalink raw reply related

* Re: dsa: handling more than 1 cpu port
From: John Crispin @ 2016-12-14 15:45 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Florian Fainelli, netdev@vger.kernel.org
In-Reply-To: <20161214110058.GE27370@lunn.ch>



On 14/12/2016 12:00, Andrew Lunn wrote:
> On Wed, Dec 14, 2016 at 11:35:30AM +0100, John Crispin wrote:
>>
>>
>> On 14/12/2016 11:31, Andrew Lunn wrote:
>>> On Wed, Dec 14, 2016 at 11:01:54AM +0100, John Crispin wrote:
>>>> Hi Andrew,
>>>>
>>>> switches supported by qca8k have 2 gmacs that we can wire an external
>>>> mii interface to. Usually this is used to wire 2 of the SoCs MACs to the
>>>> switch. Thw switch itself is aware of one of the MACs being the cpu port
>>>> and expects this to be port/mac0. Using the other will break the
>>>> hardware offloading features.
>>>
>>> Just to be sure here. There is no way to use the second port connected
>>> to the CPU as a CPU port?
>>
>> both macs are considered cpu ports and both allow for the tag to be
>> injected. for HW NAT/routing/... offloading to work, the lan ports neet
>> to trunk via port0 and not port6 however.
> 
> Maybe you can do a restricted version of the generic solution. LAN
> ports are mapped to cpu port0. WAN port to cpu port 6?
> 
>>> The Marvell chips do allow this. So i developed a proof of concept
>>> which had a mapping between cpu ports and slave ports. slave port X
>>> should you cpu port y for its traffic. This never got past proof of
>>> concept. 
>>>
>>> If this can be made to work for qca8k, i would prefer having this
>>> general concept, than specific hacks for pass through.
>>
>> oh cool, can you send those patches my way please ? how do you configure
>> this from userland ? does the cpu port get its on swdev which i just add
>> to my lan bridge and then add the 2nd cpu port to the wan bridge ?
> 
> https://github.com/lunn/linux/tree/v4.1-rc4-net-next-multiple-cpu
> 
> You don't configure anything from userland. Which was actually a
> criticism. It is in device tree. But my solution is generic. Having
> one WAN port and four bridges LAN ports is a pure marketing
> requirement. The hardware will happily do two WAN ports and 3 LAN
> ports, for example. And the switch is happy to take traffic for the
> WAN port and a LAN port over the same CPU port, and keep the traffic
> separate. So we can have some form of load balancing. We are not
> limited to 1->1, 1->4, we can do 1->2, 1->3 to increase the overall
> performance. And to the user it is all transparent.
> 
> This PoC is for the old DSA binding. The new binding makes it easier
> to express this. Which is one of the reasons for the new binding.
> 
> 	Andrew
> 

Hi Andrew,

spent some time looking at this and thinking about possible solutions.
my initial idea was to detect which cpu port to based on the cpu port
being included inside the bridge. however that wont allow us to control
ports using the tag outside of a bridge. i think that your approach is
the only sane one. we could add a sysfs interface later, allowing us to
change the default cpu port <-> mappings, but the device tree needs to
include some sane defaults. i'll use your patches as a base for a series.

	John

^ permalink raw reply

* ipv6: handle -EFAULT from skb_copy_bits
From: Dave Jones @ 2016-12-14 15:47 UTC (permalink / raw)
  To: netdev

It seems to be possible to craft a packet for sendmsg that triggers
the -EFAULT path in skb_copy_bits resulting in a BUG_ON that looks like:

RIP: 0010:[<ffffffff817c6390>] [<ffffffff817c6390>] rawv6_sendmsg+0xc30/0xc40
RSP: 0018:ffff881f6c4a7c18  EFLAGS: 00010282
RAX: 00000000fffffff2 RBX: ffff881f6c681680 RCX: 0000000000000002
RDX: ffff881f6c4a7cf8 RSI: 0000000000000030 RDI: ffff881fed0f6a00
RBP: ffff881f6c4a7da8 R08: 0000000000000000 R09: 0000000000000009
R10: ffff881fed0f6a00 R11: 0000000000000009 R12: 0000000000000030
R13: ffff881fed0f6a00 R14: ffff881fee39ba00 R15: ffff881fefa93a80

Call Trace:
 [<ffffffff8118ba23>] ? unmap_page_range+0x693/0x830
 [<ffffffff81772697>] inet_sendmsg+0x67/0xa0
 [<ffffffff816d93f8>] sock_sendmsg+0x38/0x50
 [<ffffffff816d982f>] SYSC_sendto+0xef/0x170
 [<ffffffff816da27e>] SyS_sendto+0xe/0x10
 [<ffffffff81002910>] do_syscall_64+0x50/0xa0
 [<ffffffff817f7cbc>] entry_SYSCALL64_slow_path+0x25/0x25

Handle this in rawv6_push_pending_frames and jump to the failure path.

Signed-off-by: Dave Jones <davej@codemonkey.org.uk>

diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 291ebc260e70..35aa82faa052 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -591,7 +591,9 @@ static int rawv6_push_pending_frames(struct sock *sk, struct flowi6 *fl6,
 	}
 
 	offset += skb_transport_offset(skb);
-	BUG_ON(skb_copy_bits(skb, offset, &csum, 2));
+	err = skb_copy_bits(skb, offset, &csum, 2);
+	if (err < 0)
+		goto out;
 
 	/* in case cksum was not initialized */
 	if (unlikely(csum))

^ permalink raw reply related

* ravb/sh_eth/b44: BUG: sleeping function called from invalid context
From: Geert Uytterhoeven @ 2016-12-14 16:12 UTC (permalink / raw)
  To: Sergei Shtylyov, Michael Chan; +Cc: Linux-Renesas, netdev@vger.kernel.org

Hi,

When CONFIG_DEBUG_ATOMIC_SLEEP=y, running "ethtool -s eth0 speed 100"
on Salvator-X gives:

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:97
in_atomic(): 1, irqs_disabled(): 128, pid: 1683, name: ethtool
CPU: 0 PID: 1683 Comm: ethtool Tainted: G        W
4.9.0-salvator-x-00426-g326519df42c65007-dirty #976
Hardware name: Renesas Salvator-X board based on r8a7796 (DT)
Call trace:
[<ffffff8008089400>] dump_backtrace+0x0/0x208
[<ffffff800808961c>] show_stack+0x14/0x1c
[<ffffff8008233424>] dump_stack+0x94/0xb4
[<ffffff80080c377c>] ___might_sleep+0x108/0x11c
[<ffffff80080c3814>] __might_sleep+0x84/0x94
[<ffffff800855db0c>] mutex_lock+0x24/0x40
[<ffffff800837610c>] phy_start_aneg+0x20/0x130
[<ffffff80083763b8>] phy_ethtool_ksettings_set+0xd0/0xe8
[<ffffff8008386724>] ravb_set_link_ksettings+0x4c/0xa4
[<ffffff80084a7b94>] ethtool_set_settings+0xec/0xfc
[<ffffff80084aa918>] dev_ethtool+0x188/0x17c4
[<ffffff80084bce3c>] dev_ioctl+0x53c/0x6b8
[<ffffff8008488acc>] sock_do_ioctl.constprop.45+0x3c/0x4c
[<ffffff80084897b4>] sock_ioctl+0x33c/0x370
[<ffffff8008171878>] vfs_ioctl+0x20/0x38
[<ffffff8008172198>] do_vfs_ioctl+0x844/0x954
[<ffffff80081722ec>] SyS_ioctl+0x44/0x68
[<ffffff80080830f0>] el0_svc_naked+0x24/0x28

ravb_set_link_ksettings() calls phy_ethtool_ksettings_set() with a spinlock
held and interrupts disabled, while phy_start_aneg() tries to obtain a mutex.

The same issue is present in sh_eth_set_link_ksettings() (verified) and
b44_set_link_ksettings() (code inspection).

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* [v2] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap
From: Arvind Yadav @ 2016-12-14 16:25 UTC (permalink / raw)
  To: peter.chen, fw, david.daney; +Cc: netdev, linux-kernel

Here, If devm_ioremap will fail. It will return NULL.
Kernel can run into a NULL-pointer dereference.
This error check will avoid NULL pointer dereference.

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
---
 drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
index 4ab404f..33c2fec 100644
--- a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
+++ b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
@@ -1479,6 +1479,12 @@ static int octeon_mgmt_probe(struct platform_device *pdev)
 	p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size);
 	p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys,
 					   p->agl_prt_ctl_size);
+	if (!p->mix || !p->agl || !p->agl_prt_ctl) {
+		dev_err(&pdev->dev, "failed to map I/O memory\n");
+		result = -ENOMEM;
+		goto err;
+	}
+
 	spin_lock_init(&p->lock);
 
 	skb_queue_head_init(&p->tx_list);
-- 
2.7.4

^ permalink raw reply related

* Re: [v1] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap
From: arvind Yadav @ 2016-12-14 16:30 UTC (permalink / raw)
  To: kbuild test robot; +Cc: kbuild-all, peter.chen, fw, netdev, linux-kernel
In-Reply-To: <201612142225.IaytQyYf%fengguang.wu@intel.com>

Sorry for build failure. I have send new changes. Which does not
this failure.

Thanks
-Arvind

On Wednesday 14 December 2016 08:10 PM, kbuild test robot wrote:
> Hi Arvind,
>
> [auto build test ERROR on net-next/master]
> [also build test ERROR on v4.9 next-20161214]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url:    https://github.com/0day-ci/linux/commits/Arvind-Yadav/net-ethernet-cavium-octeon-octeon_mgmt-Handle-return-NULL-error-from-devm_ioremap/20161213-224624
> config: mips-cavium_octeon_defconfig (attached as .config)
> compiler: mips64-linux-gnuabi64-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
>          wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
>          chmod +x ~/bin/make.cross
>          # save the attached .config to linux build tree
>          make.cross ARCH=mips
>
> All errors (new ones prefixed by >>):
>
>     drivers/net/ethernet/cavium/octeon/octeon_mgmt.c: In function 'octeon_mgmt_probe':
>>> drivers/net/ethernet/cavium/octeon/octeon_mgmt.c:1473:11: error: 'dev' undeclared (first use in this function)
>        dev_err(dev, "failed to map I/O memory\n");
>                ^~~
>     drivers/net/ethernet/cavium/octeon/octeon_mgmt.c:1473:11: note: each undeclared identifier is reported only once for each function it appears in
>
> vim +/dev +1473 drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
>
>    1467	
>    1468		p->mix = (u64)devm_ioremap(&pdev->dev, p->mix_phys, p->mix_size);
>    1469		p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size);
>    1470		p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys,
>    1471						   p->agl_prt_ctl_size);
>    1472		if (!p->mix || !p->agl || !p->agl_prt_ctl) {
>> 1473			dev_err(dev, "failed to map I/O memory\n");
>    1474			result = -ENOMEM;
>    1475			goto err;
>    1476		}
>
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: John Fastabend @ 2016-12-14 16:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, cl, rppt, netdev, linux-mm, willemdebruijn.kernel,
	bjorn.topel, magnus.karlsson, alexander.duyck, mgorman, tom,
	bblanco, tariqt, saeedm, jesse.brandeburg, METH, vyasevich
In-Reply-To: <20161214103914.3a9ebbbf@redhat.com>

On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote:
> On Tue, 13 Dec 2016 12:08:21 -0800
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
>> On 16-12-13 11:53 AM, David Miller wrote:
>>> From: John Fastabend <john.fastabend@gmail.com>
>>> Date: Tue, 13 Dec 2016 09:43:59 -0800
>>>   
>>>> What does "zero-copy send packet-pages to the application/socket that
>>>> requested this" mean? At the moment on x86 page-flipping appears to be
>>>> more expensive than memcpy (I can post some data shortly) and shared
>>>> memory was proposed and rejected for security reasons when we were
>>>> working on bifurcated driver.  
>>>
>>> The whole idea is that we map all the active RX ring pages into
>>> userspace from the start.
>>>
>>> And just how Jesper's page pool work will avoid DMA map/unmap,
>>> it will also avoid changing the userspace mapping of the pages
>>> as well.
>>>
>>> Thus avoiding the TLB/VM overhead altogether.
>>>   
> 
> Exactly.  It is worth mentioning that pages entering the page pool need
> to be cleared (measured cost 143 cycles), in order to not leak any
> kernel info.  The primary focus of this design is to make sure not to
> leak kernel info to userspace, but with an "exclusive" mode also
> support isolation between applications.
> 
> 
>> I get this but it requires applications to be isolated. The pages from
>> a queue can not be shared between multiple applications in different
>> trust domains. And the application has to be cooperative meaning it
>> can't "look" at data that has not been marked by the stack as OK. In
>> these schemes we tend to end up with something like virtio/vhost or
>> af_packet.
> 
> I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
> two would require CAP_NET_ADMIN privileges.  All modes have a trust
> domain id, that need to match e.g. when page reach the socket.

Even mode 3 should required cap_net_admin we don't want userspace to
grab queues off the nic without it IMO.

> 
> Mode-1 "Shared": Application choose lowest isolation level, allowing
>  multiple application to mmap VMA area.

My only point here is applications can read each others data and all
applications need to cooperate for example one app could try to write
continuously to read only pages causing faults and what not. This is
all non standard and doesn't play well with cgroups and "normal"
applications. It requires a new orchestration model.

I'm a bit skeptical of the use case but I know of a handful of reasons
to use this model. Maybe take a look at the ivshmem implementation in
DPDK.

Also this still requires a hardware filter to push "application" traffic
onto reserved queues/pages as far as I can tell.

> 
> Mode-2 "Single-user": Application request it want to be the only user
>  of the RX queue.  This blocks other application to mmap VMA area.
> 

Assuming data is read-only sharing with the stack is possibly OK :/. I
guess you would need to pools of memory for data and skb so you don't
leak skb into user space.

The devils in the details here. There are lots of hooks in the kernel
that can for example push the packet with a 'redirect' tc action for
example. And letting an app "read" data or impact performance of an
unrelated application is wrong IMO. Stacked devices also provide another
set of details that are a bit difficult to track down see all the
hardware offload efforts.

I assume all these concerns are shared between mode-1 and mode-2

> Mode-3 "Exclusive": Application request to own RX queue.  Packets are
>  no longer allowed for normal netstack delivery.
> 

I have patches for this mode already but haven't pushed them due to
an alternative solution using VFIO.

> Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
> still allowed to travel netstack and thus can contain packet data from
> other normal applications.  This is part of the design, to share the
> NIC between netstack and an accelerated userspace application using RX
> zero-copy delivery.
> 

I don't think this is acceptable to be honest. Letting an application
potentially read/impact other arbitrary applications on the system
seems like a non-starter even with CAP_NET_ADMIN. At least this was
the conclusion from bifurcated driver work some time ago.

> 
>> Any ACLs/filtering/switching/headers need to be done in hardware or
>> the application trust boundaries are broken.
> 
> The software solution outlined allow the application to make the choice
> of what trust boundary it wants.
> 
> The "exclusive" mode-3 make most sense together with HW filters.
> Already today, we support creating a new RX queue based on ethtool
> ntuple HW filter and then you simply attach your application that queue
> in mode-3, and have full isolation.
> 

Still pretty fuzzy on why mode-1 and mode-2 do not need hw filters?
Without hardware filters we have no way of knowing who/what data is
put in the page.

>  
>> If the above can not be met then a copy is needed. What I am trying
>> to tease out is the above comment along with other statements like
>> this "can be done with out HW filter features".
> 
> Does this address your concerns?
> 

I think we need to enforce strong isolation. An application should not
be able to read data or impact other applications. I gather this is
the case per comment about normal applications in mode-2. A slightly
weaker statement would be to say applications can only impace/read data
of other applications in their domain. This might be OK as well.

.John

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 4/3] random: use siphash24 instead of md5 for get_random_int/long
From: Theodore Ts'o @ 2016-12-14 16:37 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Netdev, David Miller, Linus Torvalds,
	kernel-hardening@lists.openwall.com, LKML, George Spelvin,
	Scott Bauer, Andi Kleen, Andy Lutomirski, Greg KH, Eric Biggers,
	linux-crypto, Jean-Philippe Aumasson
In-Reply-To: <20161214031037.25498-1-Jason@zx2c4.com>

On Wed, Dec 14, 2016 at 04:10:37AM +0100, Jason A. Donenfeld wrote:
> This duplicates the current algorithm for get_random_int/long, but uses
> siphash24 instead. This comes with several benefits. It's certainly
> faster and more cryptographically secure than MD5. This patch also
> hashes the pid, entropy, and timestamp as fixed width fields, in order
> to increase diffusion.
> 
> The previous md5 algorithm used a per-cpu md5 state, which caused
> successive calls to the function to chain upon each other. While it's
> not entirely clear that this kind of chaining is absolutely necessary
> when using a secure PRF like siphash24, it can't hurt, and the timing of
> the call chain does add a degree of natural entropy. So, in keeping with
> this design, instead of the massive per-cpu 64-byte md5 state, there is
> instead a per-cpu previously returned value for chaining.
> 
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>

The original reason for get_random_int was because the original
urandom algorithms were too slow.  When we moved to chacha20, which is
must faster, I didn't think to revisit get_random_int() and
get_random_long().

One somewhat undesirable aspect of the current algorithm is that we
never change random_int_secret.  So I've been toying with the
following, which is 4 times faster than md5.  (I haven't tried
benchmarking against siphash yet.)

[    3.606139] random benchmark!!
[    3.606276] get_random_int # cycles: 326578
[    3.606317] get_random_int_new # cycles: 95438
[    3.607423] get_random_bytes # cycles: 2653388

     	       			  	  - Ted

P.S.  It's interesting to note that siphash24 and chacha20 are both
add-rotate-xor based algorithms.

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d6876d506220..be172ea75799 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1681,6 +1681,38 @@ static int rand_initialize(void)
 }
 early_initcall(rand_initialize);
 
+static unsigned int get_random_int_new(void);
+
+static int rand_benchmark(void)
+{
+	cycles_t start,finish;
+	int i, out;
+
+	pr_crit("random benchmark!!\n");
+	start = get_cycles();
+	for (i = 0; i < 1000; i++) {
+		get_random_int();
+	}
+	finish = get_cycles();
+	pr_err("get_random_int # cycles: %llu\n", finish - start);
+
+	start = get_cycles();
+	for (i = 0; i < 1000; i++) {
+		get_random_int_new();
+	}
+	finish = get_cycles();
+	pr_err("get_random_int_new # cycles: %llu\n", finish - start);
+
+	start = get_cycles();
+	for (i = 0; i < 1000; i++) {
+		get_random_bytes(&out, sizeof(out));
+	}
+	finish = get_cycles();
+	pr_err("get_random_bytes # cycles: %llu\n", finish - start);
+	return 0;
+}
+device_initcall(rand_benchmark);
+
 #ifdef CONFIG_BLOCK
 void rand_initialize_disk(struct gendisk *disk)
 {
@@ -2064,8 +2096,10 @@ unsigned int get_random_int(void)
 	__u32 *hash;
 	unsigned int ret;
 
+#if 0	// force slow path
 	if (arch_get_random_int(&ret))
 		return ret;
+#endif
 
 	hash = get_cpu_var(get_random_int_hash);
 
@@ -2100,6 +2134,38 @@ unsigned long get_random_long(void)
 }
 EXPORT_SYMBOL(get_random_long);
 
+struct random_buf {
+	__u8 buf[CHACHA20_BLOCK_SIZE];
+	int ptr;
+};
+
+static DEFINE_PER_CPU(struct random_buf, batched_entropy);
+
+static void get_batched_entropy(void *buf, int n)
+{
+	struct random_buf *p;
+
+	p = &get_cpu_var(batched_entropy);
+
+	if ((p->ptr == 0) ||
+	    (p->ptr + n >= CHACHA20_BLOCK_SIZE)) {
+		extract_crng(p->buf);
+		p->ptr = 0;
+	}
+	BUG_ON(n > CHACHA20_BLOCK_SIZE);
+	memcpy(buf, p->buf, n);
+	p->ptr += n;
+	put_cpu_var(batched_entropy);
+}
+
+static unsigned int get_random_int_new(void)
+{
+	int	ret;
+
+	get_batched_entropy(&ret, sizeof(ret));
+	return ret;
+}
+
 /**
  * randomize_page - Generate a random, page aligned address
  * @start:	The smallest acceptable address the caller will take.

^ permalink raw reply related

* Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE
From: Theodore Ts'o @ 2016-12-14 16:42 UTC (permalink / raw)
  To: Wei Xu; +Cc: jasowang, netdev, mst, nhorman, davem
In-Reply-To: <95568017-e7b1-df95-ec49-f38ae8eb1b14@redhat.com>

On Wed, Dec 14, 2016 at 12:24:43PM +0800, Wei Xu wrote:
> 
> BTW, although this is a guest issue, is there anyway to view the GCE
> host kernel or qemu(if it is) version?

No, there isn't, as far as I know.

    	  	    	     - Ted

^ permalink raw reply

* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: Alexander Duyck @ 2016-12-14 16:45 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jesper Dangaard Brouer, David Miller, Christoph Lameter, rppt,
	Netdev, linux-mm, willemdebruijn.kernel, Björn Töpel,
	magnus.karlsson, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Brandeburg, Jesse, METH,
	Vlad Yasevich
In-Reply-To: <5851740A.2080806@gmail.com>

On Wed, Dec 14, 2016 at 8:32 AM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote:
>> On Tue, 13 Dec 2016 12:08:21 -0800
>> John Fastabend <john.fastabend@gmail.com> wrote:
>>
>>> On 16-12-13 11:53 AM, David Miller wrote:
>>>> From: John Fastabend <john.fastabend@gmail.com>
>>>> Date: Tue, 13 Dec 2016 09:43:59 -0800
>>>>
>>>>> What does "zero-copy send packet-pages to the application/socket that
>>>>> requested this" mean? At the moment on x86 page-flipping appears to be
>>>>> more expensive than memcpy (I can post some data shortly) and shared
>>>>> memory was proposed and rejected for security reasons when we were
>>>>> working on bifurcated driver.
>>>>
>>>> The whole idea is that we map all the active RX ring pages into
>>>> userspace from the start.
>>>>
>>>> And just how Jesper's page pool work will avoid DMA map/unmap,
>>>> it will also avoid changing the userspace mapping of the pages
>>>> as well.
>>>>
>>>> Thus avoiding the TLB/VM overhead altogether.
>>>>
>>
>> Exactly.  It is worth mentioning that pages entering the page pool need
>> to be cleared (measured cost 143 cycles), in order to not leak any
>> kernel info.  The primary focus of this design is to make sure not to
>> leak kernel info to userspace, but with an "exclusive" mode also
>> support isolation between applications.
>>
>>
>>> I get this but it requires applications to be isolated. The pages from
>>> a queue can not be shared between multiple applications in different
>>> trust domains. And the application has to be cooperative meaning it
>>> can't "look" at data that has not been marked by the stack as OK. In
>>> these schemes we tend to end up with something like virtio/vhost or
>>> af_packet.
>>
>> I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
>> two would require CAP_NET_ADMIN privileges.  All modes have a trust
>> domain id, that need to match e.g. when page reach the socket.
>
> Even mode 3 should required cap_net_admin we don't want userspace to
> grab queues off the nic without it IMO.
>
>>
>> Mode-1 "Shared": Application choose lowest isolation level, allowing
>>  multiple application to mmap VMA area.
>
> My only point here is applications can read each others data and all
> applications need to cooperate for example one app could try to write
> continuously to read only pages causing faults and what not. This is
> all non standard and doesn't play well with cgroups and "normal"
> applications. It requires a new orchestration model.
>
> I'm a bit skeptical of the use case but I know of a handful of reasons
> to use this model. Maybe take a look at the ivshmem implementation in
> DPDK.
>
> Also this still requires a hardware filter to push "application" traffic
> onto reserved queues/pages as far as I can tell.
>
>>
>> Mode-2 "Single-user": Application request it want to be the only user
>>  of the RX queue.  This blocks other application to mmap VMA area.
>>
>
> Assuming data is read-only sharing with the stack is possibly OK :/. I
> guess you would need to pools of memory for data and skb so you don't
> leak skb into user space.
>
> The devils in the details here. There are lots of hooks in the kernel
> that can for example push the packet with a 'redirect' tc action for
> example. And letting an app "read" data or impact performance of an
> unrelated application is wrong IMO. Stacked devices also provide another
> set of details that are a bit difficult to track down see all the
> hardware offload efforts.
>
> I assume all these concerns are shared between mode-1 and mode-2
>
>> Mode-3 "Exclusive": Application request to own RX queue.  Packets are
>>  no longer allowed for normal netstack delivery.
>>
>
> I have patches for this mode already but haven't pushed them due to
> an alternative solution using VFIO.
>
>> Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
>> still allowed to travel netstack and thus can contain packet data from
>> other normal applications.  This is part of the design, to share the
>> NIC between netstack and an accelerated userspace application using RX
>> zero-copy delivery.
>>
>
> I don't think this is acceptable to be honest. Letting an application
> potentially read/impact other arbitrary applications on the system
> seems like a non-starter even with CAP_NET_ADMIN. At least this was
> the conclusion from bifurcated driver work some time ago.

I agree.  This is a no-go from the performance perspective as well.
At a minimum you would have to be zeroing out the page between uses to
avoid leaking data, and that assumes that the program we are sending
the pages to is slightly well behaved.  If we think zeroing out an
sk_buff is expensive wait until we are trying to do an entire 4K page.

I think we are stuck with having to use a HW filter to split off
application traffic to a specific ring, and then having to share the
memory between the application and the kernel on that ring only.  Any
other approach just opens us up to all sorts of security concerns
since it would be possible for the application to try to read and
possibly write any data it wants into the buffers.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: Christoph Lameter @ 2016-12-14 17:00 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jesper Dangaard Brouer, John Fastabend, Mike Rapoport,
	netdev@vger.kernel.org, linux-mm, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth, Vladislav Yasevich
In-Reply-To: <8aea213f-2739-9bd3-3a6a-668b759336ae@stressinduktion.org>

On Tue, 13 Dec 2016, Hannes Frederic Sowa wrote:

> > Interesting.  So you even imagine sockets registering memory regions
> > with the NIC.  If we had a proper NIC HW filter API across the drivers,
> > to register the steering rule (like ibv_create_flow), this would be
> > doable, but we don't (DPDK actually have an interesting proposal[1])
>
> On a side note, this is what windows does with RIO ("registered I/O").
> Maybe you want to look at the API to get some ideas: allocating and
> pinning down memory in user space and registering that with sockets to
> get zero-copy IO.

Yup that is also what I think. Regarding the memory registration and flow
steering for user space RX/TX ring please look at the qpair model
implemented by the RDMA subsystem in the kernel. The memory semantics are
clearly established there and have been in use for more than a decade.

^ permalink raw reply

* [PATCH net 0/2] net/sched: cls_flower: Fix mask handling
From: Paul Blakey @ 2016-12-14 17:00 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Jiri Pirko, Or Gerlitz, Roi Dayan, Shahar Klein, Hadar Hen Zion,
	Paul Blakey

Hi,
The series fix how the mask is being handled.
Thanks.

Paul Blakey (2):
  net/sched: cls_flower: Use mask for addr_type
  net/sched: cls_flower: Use masked key when calling HW offloads

 net/sched/cls_flower.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

-- 
1.8.3.1

^ permalink raw reply

* [PATCH net 1/2] net/sched: cls_flower: Use mask for addr_type
From: Paul Blakey @ 2016-12-14 17:00 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Jiri Pirko, Or Gerlitz, Roi Dayan, Shahar Klein, Hadar Hen Zion,
	Paul Blakey
In-Reply-To: <1481734858-37474-1-git-send-email-paulb@mellanox.com>

When addr_type is set, mask should also be set.

Fixes: 66530bdf85eb ('sched,cls_flower: set key address type when present')
Fixes: bc3103f1ed40 ('net/sched: cls_flower: Classify packet in ip tunnels')
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 net/sched/cls_flower.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index e040c51..9758f5a 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -509,6 +509,7 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
 
 	if (tb[TCA_FLOWER_KEY_IPV4_SRC] || tb[TCA_FLOWER_KEY_IPV4_DST]) {
 		key->control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
+		mask->control.addr_type = ~0;
 		fl_set_key_val(tb, &key->ipv4.src, TCA_FLOWER_KEY_IPV4_SRC,
 			       &mask->ipv4.src, TCA_FLOWER_KEY_IPV4_SRC_MASK,
 			       sizeof(key->ipv4.src));
@@ -517,6 +518,7 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
 			       sizeof(key->ipv4.dst));
 	} else if (tb[TCA_FLOWER_KEY_IPV6_SRC] || tb[TCA_FLOWER_KEY_IPV6_DST]) {
 		key->control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
+		mask->control.addr_type = ~0;
 		fl_set_key_val(tb, &key->ipv6.src, TCA_FLOWER_KEY_IPV6_SRC,
 			       &mask->ipv6.src, TCA_FLOWER_KEY_IPV6_SRC_MASK,
 			       sizeof(key->ipv6.src));
@@ -571,6 +573,7 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
 	if (tb[TCA_FLOWER_KEY_ENC_IPV4_SRC] ||
 	    tb[TCA_FLOWER_KEY_ENC_IPV4_DST]) {
 		key->enc_control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
+		mask->enc_control.addr_type = ~0;
 		fl_set_key_val(tb, &key->enc_ipv4.src,
 			       TCA_FLOWER_KEY_ENC_IPV4_SRC,
 			       &mask->enc_ipv4.src,
@@ -586,6 +589,7 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
 	if (tb[TCA_FLOWER_KEY_ENC_IPV6_SRC] ||
 	    tb[TCA_FLOWER_KEY_ENC_IPV6_DST]) {
 		key->enc_control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
+		mask->enc_control.addr_type = ~0;
 		fl_set_key_val(tb, &key->enc_ipv6.src,
 			       TCA_FLOWER_KEY_ENC_IPV6_SRC,
 			       &mask->enc_ipv6.src,
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net 2/2] net/sched: cls_flower: Use masked key when calling HW offloads
From: Paul Blakey @ 2016-12-14 17:00 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Jiri Pirko, Or Gerlitz, Roi Dayan, Shahar Klein, Hadar Hen Zion,
	Paul Blakey
In-Reply-To: <1481734858-37474-1-git-send-email-paulb@mellanox.com>

Zero bits on the mask signify a "don't care" on the corresponding bits
in key. Some HWs require those bits on the key to be zero. Since these
bits are masked anyway, it's okay to provide the masked key to all
drivers.

Fixes: 5b33f48842fa ('net/flower: Introduce hardware offload support')
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 net/sched/cls_flower.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 9758f5a..35ac28d 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -252,7 +252,7 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
 	offload.cookie = (unsigned long)f;
 	offload.dissector = dissector;
 	offload.mask = mask;
-	offload.key = &f->key;
+	offload.key = &f->mkey;
 	offload.exts = &f->exts;
 
 	tc->type = TC_SETUP_CLSFLOWER;
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH] IB/mlx4: avoid a -Wmaybe-uninitialize warning
From: Doug Ledford @ 2016-12-14 17:13 UTC (permalink / raw)
  To: Arnd Bergmann, Yishai Hadas
  Cc: David S. Miller, Jack Morgenstein, Or Gerlitz, Eran Ben Elisha,
	Moshe Shemesh, Christophe Jaillet, Moni Shoua,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161025161632.411899-1-arnd-r2nGTMty4D4@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 712 bytes --]

On 10/25/2016 12:16 PM, Arnd Bergmann wrote:
> There is an old warning about mlx4_SW2HW_EQ_wrapper on x86:
> 
> ethernet/mellanox/mlx4/resource_tracker.c: In function ‘mlx4_SW2HW_EQ_wrapper’:
> ethernet/mellanox/mlx4/resource_tracker.c:3071:10: error: ‘eq’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
> 
> The problem here is that gcc won't track the state of the variable
> across a spin_unlock. Moving the assignment out of the lock is
> safe here and avoids the warning.
> 
> Signed-off-by: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org>

Thanks, applied.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply

* Re: [net-next PATCH v5 1/6] net: virtio dynamically disable/enable LRO
From: John Fastabend @ 2016-12-14 17:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: daniel, shm, davem, tgraf, alexei.starovoitov, john.r.fastabend,
	netdev, brouer
In-Reply-To: <20161214152924-mutt-send-email-mst@kernel.org>

On 16-12-14 05:31 AM, Michael S. Tsirkin wrote:
> On Thu, Dec 08, 2016 at 04:04:58PM -0800, John Fastabend wrote:
>> On 16-12-08 01:36 PM, Michael S. Tsirkin wrote:
>>> On Wed, Dec 07, 2016 at 12:11:11PM -0800, John Fastabend wrote:
>>>> This adds support for dynamically setting the LRO feature flag. The
>>>> message to control guest features in the backend uses the
>>>> CTRL_GUEST_OFFLOADS msg type.
>>>>
>>>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>>>> ---

[...]

>>>>  
>>>>  static void virtnet_config_changed_work(struct work_struct *work)
>>>> @@ -1815,6 +1846,12 @@ static int virtnet_probe(struct virtio_device *vdev)
>>>>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
>>>>  		dev->features |= NETIF_F_RXCSUM;
>>>>  
>>>> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) &&
>>>> +	    virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6)) {
>>>> +		dev->features |= NETIF_F_LRO;
>>>> +		dev->hw_features |= NETIF_F_LRO;
>>>
>>> So the issue is I think that the virtio "LRO" isn't really
>>> LRO, it's typically just GRO forwarded to guests.
>>> So these are easily re-split along MTU boundaries,
>>> which makes it ok to forward these across bridges.
>>>
>>> It's not nice that we don't document this in the spec,
>>> but it's the reality and people rely on this.
>>>
>>> For now, how about doing a custom thing and just disable/enable
>>> it as XDP is attached/detached?
>>
>> The annoying part about doing this is ethtool will say that it is fixed
>> yet it will be changed by seemingly unrelated operation. I'm not sure I
>> like the idea to start automatically configuring the link via xdp_set.
> 
> I really don't like the idea of dropping performance
> by a factor of 3 for people bridging two virtio net
> interfaces.
> 
> So how about a simple approach for now, just disable
> XDP if GUEST_TSO is enabled?
> 
> We can discuss better approaches in next version.
> 

So the proposal is to add a check in XDP setup so that

  if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO{4|6})
	return -ENOPSUPP;

Or whatever is the most appropriate return code? Then we can
disable TSO via qemu-system with guest_tso4=off,guest_tso6=off for
XDP use cases.

Sounds like a reasonable start to me. I'll make the change should this
go through DaveMs net-next tree or do you want it on virtio tree? Either
is fine with me.

Thanks,
John

^ permalink raw reply

* Re: [Query] Delayed vxlan socket creation?
From: Cong Wang @ 2016-12-14 17:24 UTC (permalink / raw)
  To: Du, Fan; +Cc: netdev@vger.kernel.org, mrjana@gmail.com
In-Reply-To: <5A90DA2E42F8AE43BC4A093BF06788481A9457F1@SHSMSX103.ccr.corp.intel.com>

On Tue, Dec 13, 2016 at 11:49 PM, Du, Fan <fan.du@intel.com> wrote:
> Hi
>
> I'm interested to one Docker issue[1] which looks like related to kernel vxlan socket creation
> as described in the thread. From my limited knowledge here, socket creation is synchronous ,
> and after the *socket* syscall, the sock handle will be valid and ready to linkup.

You need to read the code. vxlan tunnel is a UDP tunnel, it needs a kernel
socket (and a port) to setup UDP communication, unlike GRE tunnel etc.

>
> Somehow I'm not sure the detailed scenario here, and which/how possible commit fix?
> Thanks!
>
> Quoted analysis:
> --------------------------------------------------------------------------
> (Found in kernel 3.13)
> The issue happens because in older kernels when a vxlan interface is created,
> the socket creation is queued up in a worker thread which actually creates
> the socket. But this needs to happen before we bring up the link on the vxlan interface.
> If for some chance, the worker thread hasn't completed the creation of the socket
> before we did link up then when we do link up the kernel checks if the socket was
> created and if not it will return ENOTCONN. This was a bug in the kernel which got fixed
> in later kernels. That is why retrying with a timer fixes the issue.

This was introduced by commit 1c51a9159ddefa5119724a4c7da3fd3ef44b68d5
and later fixed by commit 56ef9c909b40483d2c8cb63fcbf83865f162d5ec.

^ permalink raw reply

* Re: [v2] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap
From: David Daney @ 2016-12-14 17:32 UTC (permalink / raw)
  To: Arvind Yadav, peter.chen, fw, david.daney; +Cc: netdev, linux-kernel
In-Reply-To: <1481732732-6892-1-git-send-email-arvind.yadav.cs@gmail.com>

On 12/14/2016 08:25 AM, Arvind Yadav wrote:
> Here, If devm_ioremap will fail. It will return NULL.
> Kernel can run into a NULL-pointer dereference.
> This error check will avoid NULL pointer dereference.
>

Have you ever seen this failure in the wild?

How was the patch tested?

Thanks,
David Daney


> Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
> ---
>  drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
> index 4ab404f..33c2fec 100644
> --- a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
> +++ b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
> @@ -1479,6 +1479,12 @@ static int octeon_mgmt_probe(struct platform_device *pdev)
>  	p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size);
>  	p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys,
>  					   p->agl_prt_ctl_size);
> +	if (!p->mix || !p->agl || !p->agl_prt_ctl) {
> +		dev_err(&pdev->dev, "failed to map I/O memory\n");
> +		result = -ENOMEM;
> +		goto err;
> +	}
> +
>  	spin_lock_init(&p->lock);
>
>  	skb_queue_head_init(&p->tx_list);
>

^ permalink raw reply

* RE: Designing a safe RX-zero-copy Memory Model for Networking
From: David Laight @ 2016-12-14 17:37 UTC (permalink / raw)
  To: 'Christoph Lameter', Hannes Frederic Sowa
  Cc: Jesper Dangaard Brouer, John Fastabend, Mike Rapoport,
	netdev@vger.kernel.org, linux-mm, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth, Vladislav Yasevich
In-Reply-To: <alpine.DEB.2.20.1612141059020.20959@east.gentwo.org>

From: Christoph Lameter
> Sent: 14 December 2016 17:00
> On Tue, 13 Dec 2016, Hannes Frederic Sowa wrote:
> 
> > > Interesting.  So you even imagine sockets registering memory regions
> > > with the NIC.  If we had a proper NIC HW filter API across the drivers,
> > > to register the steering rule (like ibv_create_flow), this would be
> > > doable, but we don't (DPDK actually have an interesting proposal[1])
> >
> > On a side note, this is what windows does with RIO ("registered I/O").
> > Maybe you want to look at the API to get some ideas: allocating and
> > pinning down memory in user space and registering that with sockets to
> > get zero-copy IO.
> 
> Yup that is also what I think. Regarding the memory registration and flow
> steering for user space RX/TX ring please look at the qpair model
> implemented by the RDMA subsystem in the kernel. The memory semantics are
> clearly established there and have been in use for more than a decade.

Isn't there a bigger problem for transmit?
If the kernel is doing ANY validation on the frames it must copy the
data to memory the application cannot modify before doing the validation.
Otherwise the application could change the data afterwards.

	David


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform
From: Jason A. Donenfeld @ 2016-12-14 17:49 UTC (permalink / raw)
  To: David Laight
  Cc: Hannes Frederic Sowa, Netdev, kernel-hardening@lists.openwall.com,
	Andi Kleen, LKML, Linux Crypto Mailing List
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB023F7BD@AcuExch.aculab.com>

On Wed, Dec 14, 2016 at 3:47 PM, David Laight <David.Laight@aculab.com> wrote:
> Just remove the __packed and ensure that the structure is 'nice'.
> This includes ensuring there is no 'tail padding'.
> In some cases you'll need to put the port number into a 32bit field.

I'd rather not. There's no point in wasting extra cycles on hashing
useless data, just for some particular syntactic improvement. __packed
__aligned(8) will do what we want perfectly, I think.

> I'd also require that the key be aligned.

Yep, I'll certainly do this for the siphash24_aligned version in the v3.

^ permalink raw reply

* Re: [PATCH] infiniband: nes: nes_nic: use new api ethtool_{get|set}_link_ksettings
From: Doug Ledford @ 2016-12-14 17:52 UTC (permalink / raw)
  To: Philippe Reynes, faisal.latif-ral2JQCrhuEAvxtiuMwx3w,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
	hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1477409387-25056-1-git-send-email-tremyfr-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 369 bytes --]

On 10/25/2016 11:29 AM, Philippe Reynes wrote:
> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes <tremyfr-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Thanks, applied.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply

* Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform
From: David Miller @ 2016-12-14 17:56 UTC (permalink / raw)
  To: Jason
  Cc: David.Laight, netdev, kernel-hardening, ak, linux-kernel,
	linux-crypto
In-Reply-To: <CAHmME9pEM=cDC5S=j1BU2oCF8-WdnbRfiVojcet4rXcRLcpJRw@mail.gmail.com>

From: "Jason A. Donenfeld" <Jason@zx2c4.com>
Date: Wed, 14 Dec 2016 13:53:10 +0100

> In all current uses of __packed in the code, I think the impact is
> precisely zero, because all structures have members in descending
> order of size, with each member being a perfect multiple of the one
> below it. The __packed is therefore just there for safety, in case
> somebody comes in and screws everything up by sticking a u8 in
> between.

Just marking the structure __packed, whether necessary or not, makes
the compiler assume that the members are not aligned and causes
byte-by-byte accesses to be performed for words.

Never, _ever_, use __packed unless absolutely necessary, it pessimizes
the code on cpus that require proper alignment of types.

^ permalink raw reply

* Re: [kernel-hardening] Re: [PATCH 4/3] random: use siphash24 instead of md5 for get_random_int/long
From: Jason A. Donenfeld @ 2016-12-14 17:58 UTC (permalink / raw)
  To: kernel-hardening, Theodore Ts'o, Jason A. Donenfeld, Netdev,
	David Miller, Linus Torvalds, LKML, George Spelvin, Scott Bauer,
	Andi Kleen, Andy Lutomirski, Greg KH, Eric Biggers,
	Linux Crypto Mailing List, Jean-Philippe Aumasson
In-Reply-To: <20161214163731.luj2dzmnihcuhn5p@thunk.org>

Hey Ted,

On Wed, Dec 14, 2016 at 5:37 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> One somewhat undesirable aspect of the current algorithm is that we
> never change random_int_secret.

Why exactly would this be a problem? So long as the secret is kept
secret, the PRF is secure. If an attacker can read arbitrary kernel
memory, there are much much bigger issues to be concerned about. As
well, the "chaining" variable I introduce ensures that the random
numbers are, per-cpu, related to the uniqueness of timing of
subsequent calls.

> So I've been toying with the
> following, which is 4 times faster than md5.  (I haven't tried
> benchmarking against siphash yet.)
>
> [    3.606139] random benchmark!!
> [    3.606276] get_random_int # cycles: 326578
> [    3.606317] get_random_int_new # cycles: 95438
> [    3.607423] get_random_bytes # cycles: 2653388

Cool, I'll benchmark it against the siphash implementation. I like
what you did with batching up lots of chacha output, and doling it out
bit by bit. I suspect this will be quite fast, because with chacha20
you get an entire block.

> P.S.  It's interesting to note that siphash24 and chacha20 are both
> add-rotate-xor based algorithms.

Quite! Lots of nice shiny things are turning out be be ARX -- ChaCha,
BLAKE2, Siphash, NORX. The simplicity is really appealing.

Jason

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox