* [PATCH v2] Optimize lookup of /0 xfrm policies
From: Yannick Brosseau @ 2018-08-31 22:18 UTC (permalink / raw)
To: steffen.klassert, herbert, davem, netdev
Cc: linux-kernel, kernel-team, Yannick Brosseau
Currently, all the xfrm policies that are not /32 end up in
the inexact policies linked list which take a long time to lookup.
We can optimize the case where we have a /0 prefix in the policy, which
means we can match any address to that part.
We do this by putting those policies in the direct hash table after
zeroing the address part.
At lookup time, we do an additional lookup with the packet address
and either the destination or source address zeroed out.
We still call xfrm_policy_match to validate that the packet match the
selector.
In our tests, with this optimization we reduce softirq cpu utilisation
from about 40% to 7% with 3k policies.
Signed-off-by: Yannick Brosseau <scientist@fb.com>
---
net/xfrm/xfrm_hash.h | 10 +++++
net/xfrm/xfrm_policy.c | 87 +++++++++++++++++++++++++++++++++++++++++-
2 files changed, 95 insertions(+), 2 deletions(-)
diff --git a/net/xfrm/xfrm_hash.h b/net/xfrm/xfrm_hash.h
index 61be810389d8..40997fb5336d 100644
--- a/net/xfrm/xfrm_hash.h
+++ b/net/xfrm/xfrm_hash.h
@@ -145,6 +145,16 @@ static inline unsigned int __sel_hash(const struct xfrm_selector *sel,
const xfrm_address_t *saddr = &sel->saddr;
unsigned int h = 0;
+ /* A selector with a prefixlen of zero can basically be ignored in
+ * the matching. To speed up the lookup, let's hash it without those
+ * component. In the lookup, we'll do an additional check for a zero
+ * daddr and a zero saddr.
+ */
+ if (sel->prefixlen_d == 0)
+ dbits = 0;
+ if (sel->prefixlen_s == 0)
+ sbits = 0;
+
switch (family) {
case AF_INET:
if (sel->prefixlen_d < dbits ||
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 3110c3fbee20..a1ca78900ffc 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1096,8 +1096,10 @@ static struct xfrm_policy *xfrm_policy_lookup_bytype(struct net *net, u8 type,
int err;
struct xfrm_policy *pol, *ret;
const xfrm_address_t *daddr, *saddr;
+ static const xfrm_address_t zero_addr = {0};
+
struct hlist_head *chain;
- unsigned int sequence;
+ unsigned int sequence, first_sequence;
u32 priority;
daddr = xfrm_flowi_daddr(fl, family);
@@ -1112,6 +1114,7 @@ static struct xfrm_policy *xfrm_policy_lookup_bytype(struct net *net, u8 type,
chain = policy_hash_direct(net, daddr, saddr, family, dir);
} while (read_seqcount_retry(&xfrm_policy_hash_generation, sequence));
+ first_sequence = sequence;
priority = ~0U;
ret = NULL;
hlist_for_each_entry_rcu(pol, chain, bydst) {
@@ -1129,6 +1132,86 @@ static struct xfrm_policy *xfrm_policy_lookup_bytype(struct net *net, u8 type,
break;
}
}
+
+ /* Do an additional lookup for saddr == 0, since we stored source
+ * selector with a prefix len of 0 that way in the bydst hash
+ */
+ do {
+ sequence = read_seqcount_begin(&xfrm_policy_hash_generation);
+ chain = policy_hash_direct(net, daddr, &zero_addr, family, dir);
+ } while (read_seqcount_retry(&xfrm_policy_hash_generation, sequence));
+
+ hlist_for_each_entry_rcu(pol, chain, bydst) {
+ if ((pol->priority >= priority) && ret)
+ break;
+
+ err = xfrm_policy_match(pol, fl, type, family, dir, if_id);
+ if (err) {
+ if (err == -ESRCH)
+ continue;
+ else {
+ ret = ERR_PTR(err);
+ goto fail;
+ }
+ } else {
+ ret = pol;
+ priority = ret->priority;
+ break;
+ }
+ }
+
+ /* Do an additional lookup for daddr == 0, since we stored dest
+ * selector with a prefix len of 0 that way in the bydst hash
+ */
+ do {
+ sequence = read_seqcount_begin(&xfrm_policy_hash_generation);
+ chain = policy_hash_direct(net, &zero_addr, saddr, family, dir);
+ } while (read_seqcount_retry(&xfrm_policy_hash_generation, sequence));
+
+ hlist_for_each_entry_rcu(pol, chain, bydst) {
+ if ((pol->priority >= priority) && ret)
+ break;
+
+ err = xfrm_policy_match(pol, fl, type, family, dir, if_id);
+ if (err) {
+ if (err == -ESRCH)
+ continue;
+ else {
+ ret = ERR_PTR(err);
+ goto fail;
+ }
+ } else {
+ ret = pol;
+ priority = ret->priority;
+ break;
+ }
+ }
+
+ /* Do an additional lookup for both saddr and daddr == 0 */
+ do {
+ sequence = read_seqcount_begin(&xfrm_policy_hash_generation);
+ chain = policy_hash_direct(net, &zero_addr, &zero_addr, family, dir);
+ } while (read_seqcount_retry(&xfrm_policy_hash_generation, sequence));
+
+ hlist_for_each_entry_rcu(pol, chain, bydst) {
+ if ((pol->priority >= priority) && ret)
+ break;
+
+ err = xfrm_policy_match(pol, fl, type, family, dir, if_id);
+ if (err) {
+ if (err == -ESRCH)
+ continue;
+ else {
+ ret = ERR_PTR(err);
+ goto fail;
+ }
+ } else {
+ ret = pol;
+ priority = ret->priority;
+ break;
+ }
+ }
+
chain = &net->xfrm.policy_inexact[dir];
hlist_for_each_entry_rcu(pol, chain, bydst) {
if ((pol->priority >= priority) && ret)
@@ -1148,7 +1231,7 @@ static struct xfrm_policy *xfrm_policy_lookup_bytype(struct net *net, u8 type,
}
}
- if (read_seqcount_retry(&xfrm_policy_hash_generation, sequence))
+ if (read_seqcount_retry(&xfrm_policy_hash_generation, first_sequence))
goto retry;
if (ret && !xfrm_pol_hold_rcu(ret))
--
2.18.0
^ permalink raw reply related
* [RFC PATCH bpf-next v2 4/4] selftests/bpf: add test cases for queue and stack maps
From: Mauricio Vasquez B @ 2018-08-31 21:48 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, netdev
Two types of tests are done:
- test_maps: only userspace api.
- test_progs: userspace api and ebpf helpers.
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
---
tools/lib/bpf/bpf.c | 12 ++
tools/lib/bpf/bpf.h | 1
tools/testing/selftests/bpf/Makefile | 2
tools/testing/selftests/bpf/bpf_helpers.h | 7 +
tools/testing/selftests/bpf/test_maps.c | 101 ++++++++++++++++++++
tools/testing/selftests/bpf/test_progs.c | 99 ++++++++++++++++++++
tools/testing/selftests/bpf/test_queue_map.c | 4 +
tools/testing/selftests/bpf/test_queue_stack_map.h | 59 ++++++++++++
tools/testing/selftests/bpf/test_stack_map.c | 4 +
9 files changed, 288 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/bpf/test_queue_map.c
create mode 100644 tools/testing/selftests/bpf/test_queue_stack_map.h
create mode 100644 tools/testing/selftests/bpf/test_stack_map.c
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 60aa4ca8b2c5..7056b2eb554d 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -286,6 +286,18 @@ int bpf_map_lookup_elem(int fd, const void *key, void *value)
return sys_bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
}
+int bpf_map_lookup_and_delete_elem(int fd, const void *key, const void *value)
+{
+ union bpf_attr attr;
+
+ bzero(&attr, sizeof(attr));
+ attr.map_fd = fd;
+ attr.key = ptr_to_u64(key);
+ attr.value = ptr_to_u64(value);
+
+ return sys_bpf(BPF_MAP_LOOKUP_AND_DELETE_ELEM, &attr, sizeof(attr));
+}
+
int bpf_map_delete_elem(int fd, const void *key)
{
union bpf_attr attr;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 6f38164b2618..6134ed9517d3 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -86,6 +86,7 @@ int bpf_map_update_elem(int fd, const void *key, const void *value,
__u64 flags);
int bpf_map_lookup_elem(int fd, const void *key, void *value);
+int bpf_map_lookup_and_delete_elem(int fd, const void *key, const void *value);
int bpf_map_delete_elem(int fd, const void *key);
int bpf_map_get_next_key(int fd, const void *key, void *next_key);
int bpf_obj_pin(int fd, const char *pathname);
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index fff7fb1285fc..3c773a66aa5f 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
- test_skb_cgroup_id_kern.o
+ test_skb_cgroup_id_kern.o test_queue_map.o test_stack_map.o
# Order correspond to 'make run_tests' order
TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index e4be7730222d..05fb5ed90b89 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -16,6 +16,13 @@ static int (*bpf_map_update_elem)(void *map, void *key, void *value,
(void *) BPF_FUNC_map_update_elem;
static int (*bpf_map_delete_elem)(void *map, void *key) =
(void *) BPF_FUNC_map_delete_elem;
+static int (*bpf_map_push_elem)(void *map, void *value,
+ unsigned long long flags) =
+ (void *) BPF_FUNC_map_push_elem;
+static void *(*bpf_map_pop_elem)(void *map) =
+ (void *) BPF_FUNC_map_pop_elem;
+static void *(*bpf_map_peek_elem)(void *map) =
+ (void *) BPF_FUNC_map_peek_elem;
static int (*bpf_probe_read)(void *dst, int size, void *unsafe_ptr) =
(void *) BPF_FUNC_probe_read;
static unsigned long long (*bpf_ktime_get_ns)(void) =
diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index 6f54f84144a0..754871c7c8b4 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -15,6 +15,7 @@
#include <string.h>
#include <assert.h>
#include <stdlib.h>
+#include <time.h>
#include <sys/wait.h>
#include <sys/socket.h>
@@ -471,6 +472,102 @@ static void test_devmap(int task, void *data)
close(fd);
}
+static void test_queuemap(int task, void *data)
+{
+ const int MAP_SIZE = 32;
+ __u32 vals[MAP_SIZE + MAP_SIZE/2], val;
+ int fd, i;
+
+ /* Fill test values to be used */
+ for (i = 0; i < MAP_SIZE + MAP_SIZE/2; i++)
+ vals[i] = rand();
+
+ fd = bpf_create_map(BPF_MAP_TYPE_QUEUE, 0, sizeof(val), MAP_SIZE,
+ map_flags);
+ if (fd < 0) {
+ printf("Failed to create queuemap '%s'!\n", strerror(errno));
+ exit(1);
+ }
+
+ /* Push MAP_SIZE elements */
+ for (i = 0; i < MAP_SIZE; i++)
+ assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
+
+ /* Check that element cannot be pushed due to max_entries limit */
+ assert(bpf_map_update_elem(fd, NULL, &val, 0) == -1 &&
+ errno == E2BIG);
+
+ /* Peek element */
+ assert(bpf_map_lookup_elem(fd, NULL, &val) == 0 && val == vals[0]);
+
+ /* Replace half elements */
+ for (i = MAP_SIZE; i < MAP_SIZE + MAP_SIZE/2; i++)
+ assert(bpf_map_update_elem(fd, NULL, &vals[i], BPF_EXIST) == 0);
+
+ /* Pop all elements */
+ for (i = MAP_SIZE/2; i < MAP_SIZE + MAP_SIZE/2; i++)
+ assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == 0 &&
+ val == vals[i]);
+
+ /* Check that there are not elements left */
+ assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == -1 &&
+ errno == ENOENT);
+
+ /* Check that non supported functions set errno to EINVAL */
+ assert(bpf_map_delete_elem(fd, NULL) == -1 && errno == EINVAL);
+ assert(bpf_map_get_next_key(fd, NULL, NULL) == -1 && errno == EINVAL);
+
+ close(fd);
+}
+
+static void test_stackmap(int task, void *data)
+{
+ const int MAP_SIZE = 32;
+ __u32 vals[MAP_SIZE + MAP_SIZE/2], val;
+ int fd, i;
+
+ /* Fill test values to be used */
+ for (i = 0; i < MAP_SIZE + MAP_SIZE/2; i++)
+ vals[i] = rand();
+
+ fd = bpf_create_map(BPF_MAP_TYPE_STACK, 0, sizeof(val), MAP_SIZE,
+ map_flags);
+ if (fd < 0) {
+ printf("Failed to create stackmap '%s'!\n", strerror(errno));
+ exit(1);
+ }
+
+ /* Push MAP_SIZE elements */
+ for (i = 0; i < MAP_SIZE; i++)
+ assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
+
+ /* Check that element cannot be pushed due to max_entries limit */
+ assert(bpf_map_update_elem(fd, NULL, &val, 0) == -1 &&
+ errno == E2BIG);
+
+ /* Peek element */
+ assert(bpf_map_lookup_elem(fd, NULL, &val) == 0 && val == vals[i - 1]);
+
+ /* Replace half elements */
+ for (i = MAP_SIZE; i < MAP_SIZE + MAP_SIZE/2; i++)
+ assert(bpf_map_update_elem(fd, NULL, &vals[i], BPF_EXIST) == 0);
+
+ /* Pop all elements */
+ for (i = MAP_SIZE + MAP_SIZE/2 - 1; i >= MAP_SIZE/2; i--)
+ assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == 0 &&
+ val == vals[i]);
+
+ /* Check that there are not elements left */
+ assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == -1 &&
+ errno == ENOENT);
+
+ /* Check that non supported functions set errno to EINVAL */
+ assert(bpf_map_delete_elem(fd, NULL) == -1 && errno == EINVAL);
+ assert(bpf_map_get_next_key(fd, NULL, NULL) == -1 && errno == EINVAL);
+
+ close(fd);
+}
+
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <arpa/inet.h>
@@ -1420,7 +1517,9 @@ static void run_all_tests(void)
test_arraymap_percpu_many_keys();
test_devmap(0, NULL);
+ test_queuemap(0, NULL);
test_sockmap(0, NULL);
+ test_stackmap(0, NULL);
test_map_large();
test_map_parallel();
@@ -1434,6 +1533,8 @@ static void run_all_tests(void)
int main(void)
{
+ srand(time(NULL));
+
map_flags = 0;
run_all_tests();
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index 0ef68204c84b..ae719348d858 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -1698,8 +1698,105 @@ static void test_task_fd_query_tp(void)
"sys_enter_read");
}
+enum {
+ QUEUE,
+ STACK,
+};
+
+static void test_queue_stack_map(int type)
+{
+ const int MAP_SIZE = 32;
+ __u32 vals[MAP_SIZE], duration, retval, size, val;
+ int i, err, prog_fd, map_in_fd, map_out_fd;
+ char file[32], buf[128];
+ struct bpf_object *obj;
+ struct iphdr *iph = (void *)buf + sizeof(struct ethhdr);
+
+ /* Fill test values to be used */
+ for (i = 0; i < MAP_SIZE; i++)
+ vals[i] = rand();
+
+ if (type == QUEUE)
+ strncpy(file, "./test_queue_map.o", sizeof(file));
+ else if (type == STACK)
+ strncpy(file, "./test_stack_map.o", sizeof(file));
+ else
+ return;
+
+ err = bpf_prog_load(file, BPF_PROG_TYPE_SCHED_CLS, &obj, &prog_fd);
+ if (err) {
+ error_cnt++;
+ return;
+ }
+
+ map_in_fd = bpf_find_map(__func__, obj, "map_in");
+ if (map_in_fd < 0)
+ goto out;
+
+ map_out_fd = bpf_find_map(__func__, obj, "map_out");
+ if (map_out_fd < 0)
+ goto out;
+
+ /* Push 32 elements to the input map */
+ for (i = 0; i < MAP_SIZE; i++) {
+ err = bpf_map_update_elem(map_in_fd, NULL, &vals[i], 0);
+ if (err) {
+ error_cnt++;
+ goto out;
+ }
+ }
+
+ /* The eBPF program pushes iph.saddr in the output map,
+ * pops the input map and saves this value in iph.daddr
+ */
+ for (i = 0; i < MAP_SIZE; i++) {
+ if (type == QUEUE) {
+ val = vals[i];
+ pkt_v4.iph.saddr = vals[i] * 5;
+ } else if (type == STACK) {
+ val = vals[MAP_SIZE - 1 - i];
+ pkt_v4.iph.saddr = vals[MAP_SIZE - 1 - i] * 5;
+ }
+
+ err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
+ buf, &size, &retval, &duration);
+ if (err || retval || size != sizeof(pkt_v4) ||
+ iph->daddr != val)
+ break;
+ }
+
+ CHECK(err || retval || size != sizeof(pkt_v4) || iph->daddr != val,
+ "bpf_map_pop_elem",
+ "err %d errno %d retval %d size %d iph->daddr %u\n",
+ err, errno, retval, size, iph->daddr);
+
+ /* queue is empty, program should return TC_ACT_SHOT */
+ err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
+ buf, &size, &retval, &duration);
+ CHECK(err || retval != 2 /* TC_ACT_SHOT */|| size != sizeof(pkt_v4),
+ "check-queue-stack-map-empty",
+ "err %d errno %d retval %d size %d\n",
+ err, errno, retval, size);
+
+ /* check that the program pushed elements correctly */
+ for (i = 0; i < MAP_SIZE; i++) {
+ err = bpf_map_lookup_and_delete_elem(map_out_fd, NULL, &val);
+ if (err || val != vals[i] * 5)
+ break;
+ }
+
+ CHECK(i != MAP_SIZE && (err || val != vals[i] * 5),
+ "bpf_map_push_elem", "err %d value %u\n", err, val);
+
+out:
+ pkt_v4.iph.saddr = 0;
+ bpf_object__close(obj);
+}
+
int main(void)
{
+ srand(time(NULL));
+
jit_enabled = is_jit_enabled();
test_pkt_access();
@@ -1719,6 +1816,8 @@ int main(void)
test_get_stack_raw_tp();
test_task_fd_query_rawtp();
test_task_fd_query_tp();
+ test_queue_stack_map(QUEUE);
+ test_queue_stack_map(STACK);
printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
diff --git a/tools/testing/selftests/bpf/test_queue_map.c b/tools/testing/selftests/bpf/test_queue_map.c
new file mode 100644
index 000000000000..3fdb3b9cb038
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_queue_map.c
@@ -0,0 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2017 Politecnico di Torino
+#define MAP_TYPE BPF_MAP_TYPE_QUEUE
+#include "test_queue_stack_map.h"
diff --git a/tools/testing/selftests/bpf/test_queue_stack_map.h b/tools/testing/selftests/bpf/test_queue_stack_map.h
new file mode 100644
index 000000000000..2f86081753a1
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_queue_stack_map.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+// Copyright (c) 2017 Politecnico di Torino
+#include <stddef.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <linux/ip.h>
+#include <linux/pkt_cls.h>
+#include "bpf_helpers.h"
+
+int _version SEC("version") = 1;
+
+struct bpf_map_def __attribute__ ((section("maps"), used)) map_in = {
+ .type = MAP_TYPE,
+ .key_size = 0,
+ .value_size = sizeof(__u32),
+ .max_entries = 32,
+ .map_flags = 0,
+};
+
+struct bpf_map_def __attribute__ ((section("maps"), used)) map_out = {
+ .type = MAP_TYPE,
+ .key_size = 0,
+ .value_size = sizeof(__u32),
+ .max_entries = 32,
+ .map_flags = 0,
+};
+
+SEC("test")
+int _test(struct __sk_buff *skb)
+{
+ void *data_end = (void *)(long)skb->data_end;
+ void *data = (void *)(long)skb->data;
+ struct ethhdr *eth = (struct ethhdr *)(data);
+ int err;
+
+ if (eth + 1 > data_end)
+ return TC_ACT_SHOT;
+
+ struct iphdr *iph = (struct iphdr *)(eth + 1);
+
+ if (iph + 1 > data_end)
+ return TC_ACT_SHOT;
+
+ __u32 *in = bpf_map_pop_elem(&map_in);
+
+ if (!in)
+ return TC_ACT_SHOT;
+
+ iph->daddr = *in;
+
+ err = bpf_map_push_elem(&map_out, &iph->saddr, 0);
+ if (err)
+ return TC_ACT_SHOT;
+
+ return TC_ACT_OK;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_stack_map.c b/tools/testing/selftests/bpf/test_stack_map.c
new file mode 100644
index 000000000000..5be9159b2927
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_stack_map.c
@@ -0,0 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2017 Politecnico di Torino
+#define MAP_TYPE BPF_MAP_TYPE_STACK
+#include "test_queue_stack_map.h"
^ permalink raw reply related
* [RFC PATCH bpf-next v2 3/4] Sync uapi/bpf.h to tools/include
From: Mauricio Vasquez B @ 2018-08-31 21:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, netdev
Sync both files.
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
---
tools/include/uapi/linux/bpf.h | 36 ++++++++++++++++++++++++++++++++----
1 file changed, 32 insertions(+), 4 deletions(-)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 66917a4eba27..0a5b904ba42f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -103,6 +103,7 @@ enum bpf_cmd {
BPF_BTF_LOAD,
BPF_BTF_GET_FD_BY_ID,
BPF_TASK_FD_QUERY,
+ BPF_MAP_LOOKUP_AND_DELETE_ELEM,
};
enum bpf_map_type {
@@ -127,6 +128,8 @@ enum bpf_map_type {
BPF_MAP_TYPE_SOCKHASH,
BPF_MAP_TYPE_CGROUP_STORAGE,
BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
+ BPF_MAP_TYPE_QUEUE,
+ BPF_MAP_TYPE_STACK,
};
enum bpf_prog_type {
@@ -459,6 +462,28 @@ union bpf_attr {
* Return
* 0 on success, or a negative error in case of failure.
*
+ * int bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
+ * Description
+ * Push an element *value* in *map*. *flags* is one of:
+ *
+ * **BPF_EXIST**
+ * If the queue/stack is full, the oldest element is removed to
+ * make room for this.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * void *bpf_map_pop_elem(struct bpf_map *map)
+ * Description
+ * Pop an element from *map*.
+ * Return
+ * Pointer to the element of *NULL* if there is not any.
+ *
+ * void *bpf_map_peek_elem(struct bpf_map *map)
+ * Description
+ * Return an element from *map* without removing it.
+ * Return
+ * Pointer to the element of *NULL* if there is not any.
+ *
* int bpf_probe_read(void *dst, u32 size, const void *src)
* Description
* For tracing programs, safely attempt to read *size* bytes from
@@ -786,14 +811,14 @@ union bpf_attr {
*
* int ret;
* struct bpf_tunnel_key key = {};
- *
+ *
* ret = bpf_skb_get_tunnel_key(skb, &key, sizeof(key), 0);
* if (ret < 0)
* return TC_ACT_SHOT; // drop packet
- *
+ *
* if (key.remote_ipv4 != 0x0a000001)
* return TC_ACT_SHOT; // drop packet
- *
+ *
* return TC_ACT_OK; // accept packet
*
* This interface can also be used with all encapsulation devices
@@ -2226,7 +2251,10 @@ union bpf_attr {
FN(get_current_cgroup_id), \
FN(get_local_storage), \
FN(sk_select_reuseport), \
- FN(skb_ancestor_cgroup_id),
+ FN(skb_ancestor_cgroup_id), \
+ FN(map_push_elem), \
+ FN(map_pop_elem), \
+ FN(map_peek_elem),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
^ permalink raw reply related
* [PATCH net] bpf: Fix bpf_msg_pull_data()
From: Tushar Dave @ 2018-08-31 21:45 UTC (permalink / raw)
To: john.fastabend, ast, daniel, davem, netdev; +Cc: sowmini.varadhan
Helper bpf_msg_pull_data() mistakenly reuses variable 'offset' while
linearizing multiple scatterlist elements. Variable 'offset' is used
to find first starting scatterlist element
i.e. msg->data = sg_virt(&sg[first_sg]) + start - offset"
Use different variable name while linearizing multiple scatterlist
elements so that value contained in variable 'offset' won't get
overwritten.
Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
---
net/core/filter.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index 2c7801f..aecdeba 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2292,7 +2292,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
BPF_CALL_4(bpf_msg_pull_data,
struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
{
- unsigned int len = 0, offset = 0, copy = 0;
+ unsigned int len = 0, offset = 0, copy = 0, poffset = 0;
int bytes = end - start, bytes_sg_total;
struct scatterlist *sg = msg->sg_data;
int first_sg, last_sg, i, shift;
@@ -2348,16 +2348,15 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
if (unlikely(!page))
return -ENOMEM;
p = page_address(page);
- offset = 0;
i = first_sg;
do {
from = sg_virt(&sg[i]);
len = sg[i].length;
- to = p + offset;
+ to = p + poffset;
memcpy(to, from, len);
- offset += len;
+ poffset += len;
sg[i].length = 0;
put_page(sg_page(&sg[i]));
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH net] ebpf: fix bpf_msg_pull_data
From: Tushar Dave @ 2018-08-31 21:41 UTC (permalink / raw)
To: Daniel Borkmann, john.fastabend, ast, davem, netdev; +Cc: sowmini.varadhan
In-Reply-To: <1fa58735-f87b-be0a-7119-2b45277ad064@iogearbox.net>
On 08/31/2018 05:15 AM, Daniel Borkmann wrote:
> On 08/31/2018 10:37 AM, Tushar Dave wrote:
>> On 08/30/2018 12:20 AM, Daniel Borkmann wrote:
>>> On 08/30/2018 02:21 AM, Tushar Dave wrote:
>>>> On 08/29/2018 05:07 PM, Tushar Dave wrote:
>>>>> While doing some preliminary testing it is found that bpf helper
>>>>> bpf_msg_pull_data does not calculate the data and data_end offset
>>>>> correctly. Fix it!
>>>>>
>>>>> Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
>>>>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>>>>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>>>>> ---
>>>>> net/core/filter.c | 38 +++++++++++++++++++++++++-------------
>>>>> 1 file changed, 25 insertions(+), 13 deletions(-)
>>>>>
>>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>>> index c25eb36..3eeb3d6 100644
>>>>> --- a/net/core/filter.c
>>>>> +++ b/net/core/filter.c
>>>>> @@ -2285,7 +2285,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>>>> BPF_CALL_4(bpf_msg_pull_data,
>>>>> struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
>>>>> {
>>>>> - unsigned int len = 0, offset = 0, copy = 0;
>>>>> + unsigned int len = 0, offset = 0, copy = 0, off = 0;
>>>>> struct scatterlist *sg = msg->sg_data;
>>>>> int first_sg, last_sg, i, shift;
>>>>> unsigned char *p, *to, *from;
>>>>> @@ -2299,22 +2299,30 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>>>> i = msg->sg_start;
>>>>> do {
>>>>> len = sg[i].length;
>>>>> - offset += len;
>>>>> if (start < offset + len)
>>>>> break;
>>>>> + offset += len;
>>>>> i++;
>>>>> if (i == MAX_SKB_FRAGS)
>>>>> i = 0;
>>>>> - } while (i != msg->sg_end);
>>>>> + } while (i <= msg->sg_end);
>>>
>>> I don't think this condition is correct, msg operates as a scatterlist ring,
>>> so sg_end may very well be < current i when there's a wrap-around in the
>>> traversal ... more below.
>>
>> I'm wondering then how this is suppose to work in case sg list is not
>> ring! For RDS, We have sg list that is not a ring. More below.
>>
>>>
>>>>> + /* return error if start is out of range */
>>>>> if (unlikely(start >= offset + len))
>>>>> return -EINVAL;
>>>>> - if (!msg->sg_copy[i] && bytes <= len)
>>>>> - goto out;
>>>>> + /* return error if i is last entry in sglist and end is out of range */
>>>>> + if (msg->sg_copy[i] && end > offset + len)
>>>>> + return -EINVAL;
>>>>> first_sg = i;
>>>>> + /* if i is not last entry in sg list and end (i.e start + bytes) is
>>>>> + * within this sg[i] then goto out and calculate data and data_end
>>>>> + */
>>>>> + if (!msg->sg_copy[i] && end <= offset + len)
>>>>> + goto out;
>>>>> +
>>>>> /* At this point we need to linearize multiple scatterlist
>>>>> * elements or a single shared page. Either way we need to
>>>>> * copy into a linear buffer exclusively owned by BPF. Then
>>>>> @@ -2330,9 +2338,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>>>> i++;
>>>>> if (i == MAX_SKB_FRAGS)
>>>>> i = 0;
>>>>> - if (bytes < copy)
>>>>> + if (end < copy)
>>>>> break;
>>>>> - } while (i != msg->sg_end);
>>>>> + } while (i <= msg->sg_end);
>>>>> +
>>>>> + /* return error if i is last entry in sglist and end is out of range */
>>>>> + if (i > msg->sg_end && end > offset + copy)
>>>>> + return -EINVAL;
>>>>> +
>>>>> last_sg = i;
>>>>> if (unlikely(copy < end - start))
>>>>> @@ -2342,23 +2355,22 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>>>> if (unlikely(!page))
>>>>> return -ENOMEM;
>>>>> p = page_address(page);
>>>>> - offset = 0;
>>>>> i = first_sg;
>>>>> do {
>>>>> from = sg_virt(&sg[i]);
>>>>> len = sg[i].length;
>>>>> - to = p + offset;
>>>>> + to = p + off;
>>>>> memcpy(to, from, len);
>>>>> - offset += len;
>>>>> + off += len;
>>>>> sg[i].length = 0;
>>>>> put_page(sg_page(&sg[i]));
>>>>> i++;
>>>>> if (i == MAX_SKB_FRAGS)
>>>>> i = 0;
>>>>> - } while (i != last_sg);
>>>>> + } while (i < last_sg);
>>>>> sg[first_sg].length = copy;
>>>>> sg_set_page(&sg[first_sg], page, copy, 0);
>>>>> @@ -2380,7 +2392,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>>>> else
>>>>> move_from = i + shift;
>>>>> - if (move_from == msg->sg_end)
>>>>> + if (move_from > msg->sg_end)
>>>>> break;
>>>>> sg[i] = sg[move_from];
>>>>> @@ -2396,7 +2408,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>>>> if (msg->sg_end < 0)
>>>>> msg->sg_end += MAX_SKB_FRAGS;
>>>>> out:
>>>>> - msg->data = sg_virt(&sg[i]) + start - offset;
>>>>> + msg->data = sg_virt(&sg[first_sg]) + start - offset;
>>>>> msg->data_end = msg->data + bytes;
>>>>> return 0;
>>>>>
>>>>
>>>> Please discard this patch. I just noticed that Daniel Borkmann sent some similar fixes for bpf_msg_pull_data.
>>>
>>> Yeah I've been looking at these recently as well, please make sure you test
>>> with the below fixes included to see if there's anything left:
>>
>> I tested the latest net tree which has all the fixes you mentioned and I
>> am still seeing issues.
>>
>> As I already mentioned before on RFC v3 thread, we need to be careful
>> reusing 'offset' while linearizing multiple scatterlist
>> elements.
>> Variable 'offset' is used to calculate the 'msg->data'
>> i.e. msg->data = sg_virt(&sg[first_sg]) + start - offset"
>>
>> We should use different variable name (e.g. off) while linearizing
>> multiple scatterlist elements or a single shared page so that we don't
>> overwrite 'offset' that has correct value already been calculated.
>
> Agree, sigh, that's also buggy, please submit the below as proper patch,
> thanks!
Sent. More below.
>
>> A code change for this would look like:
>>
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 60a29ca..076ca09 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -2326,7 +2326,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>> BPF_CALL_4(bpf_msg_pull_data,
>> struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
>> {
>> - unsigned int len = 0, offset = 0, copy = 0;
>> + unsigned int len = 0, offset = 0, copy = 0, off;
>
> Nit: probably better name it 'poffset' since this is relative to p.
>
>> int bytes = end - start, bytes_sg_total;
>> struct scatterlist *sg = msg->sg_data;
>> int first_sg, last_sg, i, shift;
>> @@ -2354,6 +2354,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>> * account for the headroom.
>> */
>> bytes_sg_total = start - offset + bytes;
>>
>> if (!msg->sg_copy[i] && bytes_sg_total <= len)
>> goto out;
>>
>> @@ -2382,18 +2384,18 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>> if (unlikely(!page))
>> return -ENOMEM;
>> p = page_address(page);
>> - offset = 0;
>> + off = 0;
>>
>> i = first_sg;
>> do {
>> from = sg_virt(&sg[i]);
>> len = sg[i].length;
>> - to = p + offset;
>> + to = p + off;
>>
>> memcpy(to, from, len);
>> - offset += len;
>> + off += len;
>> sg[i].length = 0;
>> put_page(sg_page(&sg[i]));
>>
>> sk_msg_iter_var(i);
>> } while (i != last_sg);
>>
>>
>> Apart from above bug fix, I see issues in below 2 cases:
>
> Ok.
>
>> case 1:
>> =======
>> For things like RDS, where sg list is not a ring how to know end of packet?
>>
>> For example I have RDS packet of length 8192 Bytes which is in
>> scatterlist like
>> sge[0].lenghth = 1400
>> sge[1].length = 1448
>> sge[2].length = 1448
>> sge[3].length = 1448
>> sge[4].length = 1448
>> sge[5].length = 1000
>>
>> Now when I execute "bpf_msg_pull_data(msg, 8190, 8196, 0)" on above RDS
>> packet, I expect bpf_msg_pull_data to return error because end is out of
>> bound e.g. 8196 > 8192. Instead current code wraps it to sge index 0!
>> That is if I dump first 6 bytes starting with 'start' I get byte value
>> at offset 8190, 8191 from sge[5] and value at offset 8192, 8193, 8194
>> 8195 bytes from sge[0] from offset 0 ,1 ,2 ,3 respectively - which is
>> not expected!
>
> For the sg ring case which is what is upstream today, it will find start_sg
> as 5 and last_sg as 6. And therefore bail out in bytes_sg_total > copy test
> since 1004 > 1000. sge[5] offset is 7192 with 1000 bytes len, so start is at
> offset 998 from there and end at 1004. So I get -EINVAL from the mentioned
> test.
>
> Is msg->sg_start and msg->sg_end set properly in your RDS case? As long as
> there's no wrap-around the helper should be usable also in RDS case. I'm not
> sure though why you see above oob.
>
>> case 2:
>> =======
>> Same example of RDS packet as above, when I execute
>> "bpf_msg_pull_data(msg, 7191, 8192, 0)"
>> I get -EINVAL which is *wrong*.
>>
>> Debugging this, I found, right before we enter below do while loop the values of variables are:
>> i = 4
>> first_sg = 4
>> last_sg= 5
>> start=7191
>> bytes_sg_total=2448
>> offset=5744
>> copy=0
>> len=1448
>>
>> do {
>> copy += sg[i].length;
>> sk_msg_iter_var(i);
>> if (bytes_sg_total <= copy)
>> break;
>> } while (i != msg->sg_end);
>> last_sg = i;
>>
>> However, above loop execute only one time because after one execution,
>> i=5 and msg->sg_end=5. That makes condition "while (i != msg->sg_end)"
>> false. So when loop exit, the values are:
>> i = 5
>> first_sg=4
>> last_sg= 5
>> start=7191
>> bytes_sg_total=2448 <<<<<<
>> offset=5744
>> copy=1448 <<<<<<
>> len=1448
>>
>> And that makes below condition true
>> if (unlikely(bytes_sg_total > copy))
>> return -EINVAL;
>
> Hmm, is your msg->sg_end correct? I'm getting start_sg of 4 and last_sg of 6.
> So it breaks on the bytes_sg_total <= copy test, and then it merges sges 4 and 5
> into 4 with 2448 length while dropping sge 5. How do you set up sk_msg_buff?
>
>> There may be one or more similar corner scenarios like case 1 and 2
>> which can be fixed however I am not sure how can we fix so that we can
>> get it right for sg ring and sg list?
>
> Imho, it should be sufficient to have invariant of msg->sg_start < msg->sg_end
> and having msg->sg_start point to the starting sge while msg->sg_end to the
> last sge slot + 1.
Yup! I was thinking same :)
For RDS, I have "msg->sg_end = sg_nents(sg) - 1". That makes msg->sg_end
= 5 for above example, and therefore -EINVAL.
I changed "msg->sg_end = sg_nents(sg)" and it works without errors (i.e.
equivalent to last sge slot + 1)
I have ran tests with various RDS packet size pulling data with
different start and end values, and all of the tests passed without any
issues.
BTW, I'll send RFC v4 for RDS as soon as all bpf_msg_pull_data fixes
will get pulled into net-next tree.
Thanks.
-Tushar
>
>> Thanks.
>> -Tushar
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/net/core/filter.c?id=5b24109b0563d45094c470684c1f8cea1af269f8
>>> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/net/core/filter.c?id=0e06b227c5221dd51b5569de93f3b9f532be4a32
>>> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/net/core/filter.c?id=2e43f95dd8ee62bc8bf57f2afac37fbd70c8d565
>>> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/net/core/filter.c?id=a8cf76a9023bc6709b1361d06bb2fae5227b9d68
>>>
>>> Thanks,
>>> Daniel
>>>
>
^ permalink raw reply
* [RFC PATCH bpf-next v2 1/4] bpf: add bpf queue and stack maps
From: Mauricio Vasquez B @ 2018-08-31 21:25 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, netdev
In-Reply-To: <153575074884.30050.17670029209466860207.stgit@kernel>
Implement two new kind of maps that support the peek, push and pop
operations.
A use case for this is to keep track of a pool of elements, like
network ports in a SNAT.
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
---
include/linux/bpf.h | 8 +
include/linux/bpf_types.h | 2
include/uapi/linux/bpf.h | 36 ++++
kernel/bpf/Makefile | 2
kernel/bpf/helpers.c | 44 +++++
kernel/bpf/queue_stack_maps.c | 353 +++++++++++++++++++++++++++++++++++++++++
kernel/bpf/syscall.c | 96 +++++++++++
kernel/bpf/verifier.c | 6 +
net/core/filter.c | 6 +
9 files changed, 543 insertions(+), 10 deletions(-)
create mode 100644 kernel/bpf/queue_stack_maps.c
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 523481a3471b..1d39b9096d9f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -39,6 +39,11 @@ struct bpf_map_ops {
void *(*map_lookup_elem)(struct bpf_map *map, void *key);
int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64 flags);
int (*map_delete_elem)(struct bpf_map *map, void *key);
+ void *(*map_lookup_and_delete_elem)(struct bpf_map *map, void *key);
+
+ /* funcs callable from eBPF programs */
+ void *(*map_lookup_or_init_elem)(struct bpf_map *map, void *key,
+ void *value);
/* funcs called by prog_array and perf_event_array map */
void *(*map_fd_get_ptr)(struct bpf_map *map, struct file *map_file,
@@ -806,6 +811,9 @@ static inline int bpf_fd_reuseport_array_update_elem(struct bpf_map *map,
extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
extern const struct bpf_func_proto bpf_map_update_elem_proto;
extern const struct bpf_func_proto bpf_map_delete_elem_proto;
+extern const struct bpf_func_proto bpf_map_push_elem_proto;
+extern const struct bpf_func_proto bpf_map_pop_elem_proto;
+extern const struct bpf_func_proto bpf_map_peek_elem_proto;
extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cd26c090e7c0..8d955f11f1cd 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -67,3 +67,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
#endif
#endif
+BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, queue_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4eba27..0a5b904ba42f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -103,6 +103,7 @@ enum bpf_cmd {
BPF_BTF_LOAD,
BPF_BTF_GET_FD_BY_ID,
BPF_TASK_FD_QUERY,
+ BPF_MAP_LOOKUP_AND_DELETE_ELEM,
};
enum bpf_map_type {
@@ -127,6 +128,8 @@ enum bpf_map_type {
BPF_MAP_TYPE_SOCKHASH,
BPF_MAP_TYPE_CGROUP_STORAGE,
BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
+ BPF_MAP_TYPE_QUEUE,
+ BPF_MAP_TYPE_STACK,
};
enum bpf_prog_type {
@@ -459,6 +462,28 @@ union bpf_attr {
* Return
* 0 on success, or a negative error in case of failure.
*
+ * int bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
+ * Description
+ * Push an element *value* in *map*. *flags* is one of:
+ *
+ * **BPF_EXIST**
+ * If the queue/stack is full, the oldest element is removed to
+ * make room for this.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * void *bpf_map_pop_elem(struct bpf_map *map)
+ * Description
+ * Pop an element from *map*.
+ * Return
+ * Pointer to the element of *NULL* if there is not any.
+ *
+ * void *bpf_map_peek_elem(struct bpf_map *map)
+ * Description
+ * Return an element from *map* without removing it.
+ * Return
+ * Pointer to the element of *NULL* if there is not any.
+ *
* int bpf_probe_read(void *dst, u32 size, const void *src)
* Description
* For tracing programs, safely attempt to read *size* bytes from
@@ -786,14 +811,14 @@ union bpf_attr {
*
* int ret;
* struct bpf_tunnel_key key = {};
- *
+ *
* ret = bpf_skb_get_tunnel_key(skb, &key, sizeof(key), 0);
* if (ret < 0)
* return TC_ACT_SHOT; // drop packet
- *
+ *
* if (key.remote_ipv4 != 0x0a000001)
* return TC_ACT_SHOT; // drop packet
- *
+ *
* return TC_ACT_OK; // accept packet
*
* This interface can also be used with all encapsulation devices
@@ -2226,7 +2251,10 @@ union bpf_attr {
FN(get_current_cgroup_id), \
FN(get_local_storage), \
FN(sk_select_reuseport), \
- FN(skb_ancestor_cgroup_id),
+ FN(skb_ancestor_cgroup_id), \
+ FN(map_push_elem), \
+ FN(map_pop_elem), \
+ FN(map_peek_elem),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 0488b8258321..17afae9e65f3 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -3,7 +3,7 @@ obj-y := core.o
obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
-obj-$(CONFIG_BPF_SYSCALL) += local_storage.o
+obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
obj-$(CONFIG_BPF_SYSCALL) += disasm.o
obj-$(CONFIG_BPF_SYSCALL) += btf.o
ifeq ($(CONFIG_NET),y)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 1991466b8327..c1ff567b69f1 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -76,6 +76,50 @@ const struct bpf_func_proto bpf_map_delete_elem_proto = {
.arg2_type = ARG_PTR_TO_MAP_KEY,
};
+BPF_CALL_3(bpf_map_push_elem, struct bpf_map *, map, void *, value, u64, flags)
+{
+ WARN_ON_ONCE(!rcu_read_lock_held());
+ return map->ops->map_update_elem(map, NULL, value, flags);
+}
+
+const struct bpf_func_proto bpf_map_push_elem_proto = {
+ .func = bpf_map_push_elem,
+ .gpl_only = false,
+ .pkt_access = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_CONST_MAP_PTR,
+ .arg2_type = ARG_PTR_TO_MAP_VALUE,
+ .arg3_type = ARG_ANYTHING,
+};
+
+BPF_CALL_1(bpf_map_pop_elem, struct bpf_map *, map)
+{
+ WARN_ON_ONCE(!rcu_read_lock_held());
+ return (unsigned long) map->ops->map_lookup_and_delete_elem(map, NULL);
+}
+
+const struct bpf_func_proto bpf_map_pop_elem_proto = {
+ .func = bpf_map_pop_elem,
+ .gpl_only = false,
+ .pkt_access = true,
+ .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
+ .arg1_type = ARG_CONST_MAP_PTR,
+};
+
+BPF_CALL_1(bpf_map_peek_elem, struct bpf_map *, map)
+{
+ WARN_ON_ONCE(!rcu_read_lock_held());
+ return (unsigned long) map->ops->map_lookup_elem(map, NULL);
+}
+
+const struct bpf_func_proto bpf_map_peek_elem_proto = {
+ .func = bpf_map_pop_elem,
+ .gpl_only = false,
+ .pkt_access = true,
+ .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
+ .arg1_type = ARG_CONST_MAP_PTR,
+};
+
const struct bpf_func_proto bpf_get_prandom_u32_proto = {
.func = bpf_user_rnd_u32,
.gpl_only = false,
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
new file mode 100644
index 000000000000..97e652bf6df5
--- /dev/null
+++ b/kernel/bpf/queue_stack_maps.c
@@ -0,0 +1,353 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * queue_stack_maps.c: BPF queue and stack maps
+ *
+ * Copyright (c) 2018 Politecnico di Torino
+ */
+#include <linux/bpf.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include "percpu_freelist.h"
+
+#define QUEUE_STACK_CREATE_FLAG_MASK \
+ (BPF_F_NO_PREALLOC | BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
+
+enum queue_type {
+ QUEUE,
+ STACK,
+};
+
+struct bpf_queue {
+ struct bpf_map map;
+ struct list_head head;
+ struct pcpu_freelist freelist;
+ void *nodes;
+ enum queue_type type;
+ raw_spinlock_t lock;
+ atomic_t count;
+ u32 node_size;
+};
+
+struct queue_node {
+ struct pcpu_freelist_node fnode;
+ struct bpf_queue *queue;
+ struct list_head list;
+ struct rcu_head rcu;
+ char element[0] __aligned(8);
+};
+
+static bool queue_map_is_prealloc(struct bpf_queue *queue)
+{
+ return !(queue->map.map_flags & BPF_F_NO_PREALLOC);
+}
+
+/* Called from syscall */
+static int queue_map_alloc_check(union bpf_attr *attr)
+{
+ /* check sanity of attributes */
+ if (attr->max_entries == 0 || attr->key_size != 0 ||
+ attr->value_size == 0 ||
+ attr->map_flags & ~QUEUE_STACK_CREATE_FLAG_MASK)
+ return -EINVAL;
+
+ if (attr->value_size > KMALLOC_MAX_SIZE)
+ /* if value_size is bigger, the user space won't be able to
+ * access the elements.
+ */
+ return -E2BIG;
+
+ return 0;
+}
+
+static int prealloc_init(struct bpf_queue *queue)
+{
+ u32 node_size = sizeof(struct queue_node) +
+ round_up(queue->map.value_size, 8);
+ u32 num_entries = queue->map.max_entries;
+ int err;
+
+ queue->nodes = bpf_map_area_alloc(node_size * num_entries,
+ queue->map.numa_node);
+ if (!queue->nodes)
+ return -ENOMEM;
+
+ err = pcpu_freelist_init(&queue->freelist);
+ if (err)
+ goto free_nodes;
+
+ pcpu_freelist_populate(&queue->freelist,
+ queue->nodes +
+ offsetof(struct queue_node, fnode),
+ node_size, num_entries);
+
+ return 0;
+
+free_nodes:
+ bpf_map_area_free(queue->nodes);
+ return err;
+}
+
+static void prealloc_destroy(struct bpf_queue *queue)
+{
+ bpf_map_area_free(queue->nodes);
+ pcpu_freelist_destroy(&queue->freelist);
+}
+
+static struct bpf_map *queue_map_alloc(union bpf_attr *attr)
+{
+ struct bpf_queue *queue;
+ u64 cost = sizeof(*queue);
+ int ret;
+
+ queue = kzalloc(sizeof(*queue), GFP_USER);
+ if (!queue)
+ return ERR_PTR(-ENOMEM);
+
+ bpf_map_init_from_attr(&queue->map, attr);
+
+ queue->node_size = sizeof(struct queue_node) +
+ round_up(attr->value_size, 8);
+ cost += (u64) attr->max_entries * queue->node_size;
+ if (cost >= U32_MAX - PAGE_SIZE) {
+ ret = -E2BIG;
+ goto free_queue;
+ }
+
+ queue->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
+
+ ret = bpf_map_precharge_memlock(queue->map.pages);
+ if (ret)
+ goto free_queue;
+
+ INIT_LIST_HEAD(&queue->head);
+
+ raw_spin_lock_init(&queue->lock);
+
+ if (queue->map.map_type == BPF_MAP_TYPE_QUEUE)
+ queue->type = QUEUE;
+ else if (queue->map.map_type == BPF_MAP_TYPE_STACK)
+ queue->type = STACK;
+
+ if (queue_map_is_prealloc(queue))
+ ret = prealloc_init(queue);
+ if (ret)
+ goto free_queue;
+
+ return &queue->map;
+
+free_queue:
+ kfree(queue);
+ return ERR_PTR(ret);
+}
+
+/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
+static void queue_map_free(struct bpf_map *map)
+{
+ struct bpf_queue *queue = container_of(map, struct bpf_queue, map);
+ struct queue_node *l;
+
+ /* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
+ * so the programs (can be more than one that used this map) were
+ * disconnected from events. Wait for outstanding critical sections in
+ * these programs to complete
+ */
+ synchronize_rcu();
+
+ /* some of queue_elem_free_rcu() callbacks for elements of this map may
+ * not have executed. Wait for them.
+ */
+ rcu_barrier();
+ if (!queue_map_is_prealloc(queue))
+ list_for_each_entry(l, &queue->head, list) {
+ list_del(&l->list);
+ kfree(l);
+ }
+ else
+ prealloc_destroy(queue);
+ kfree(queue);
+}
+
+static void queue_elem_free_rcu(struct rcu_head *head)
+{
+ struct queue_node *l = container_of(head, struct queue_node, rcu);
+ struct bpf_queue *queue = l->queue;
+
+ /* must increment bpf_prog_active to avoid kprobe+bpf triggering while
+ * we're calling kfree, otherwise deadlock is possible if kprobes
+ * are placed somewhere inside of slub
+ */
+ preempt_disable();
+ __this_cpu_inc(bpf_prog_active);
+ if (queue_map_is_prealloc(queue))
+ pcpu_freelist_push(&queue->freelist, &l->fnode);
+ else
+ kfree(l);
+ __this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+}
+
+static void *__queue_map_lookup(struct bpf_map *map, bool delete)
+{
+ struct bpf_queue *queue = container_of(map, struct bpf_queue, map);
+ unsigned long flags;
+ struct queue_node *node;
+
+ raw_spin_lock_irqsave(&queue->lock, flags);
+
+ if (list_empty(&queue->head)) {
+ raw_spin_unlock_irqrestore(&queue->lock, flags);
+ return NULL;
+ }
+
+ node = list_first_entry(&queue->head, struct queue_node, list);
+
+ if (delete) {
+ if (!queue_map_is_prealloc(queue))
+ atomic_dec(&queue->count);
+
+ list_del(&node->list);
+ call_rcu(&node->rcu, queue_elem_free_rcu);
+ }
+
+ raw_spin_unlock_irqrestore(&queue->lock, flags);
+ return &node->element;
+}
+
+/* Called from syscall or from eBPF program */
+static void *queue_map_lookup_elem(struct bpf_map *map, void *key)
+{
+ return __queue_map_lookup(map, false);
+}
+
+/* Called from syscall or from eBPF program */
+static void *queue_map_lookup_and_delete_elem(struct bpf_map *map, void *key)
+{
+ return __queue_map_lookup(map, true);
+}
+
+static struct queue_node *queue_map_delete_oldest_node(struct bpf_queue *queue)
+{
+ struct queue_node *node = NULL;
+ unsigned long irq_flags;
+
+ raw_spin_lock_irqsave(&queue->lock, irq_flags);
+
+ if (list_empty(&queue->head))
+ goto out;
+
+ switch (queue->type) {
+ case QUEUE:
+ node = list_first_entry(&queue->head, struct queue_node, list);
+ break;
+ case STACK:
+ node = list_last_entry(&queue->head, struct queue_node, list);
+ break;
+ default:
+ goto out;
+ }
+
+ list_del(&node->list);
+out:
+ raw_spin_unlock_irqrestore(&queue->lock, irq_flags);
+ return node;
+}
+
+/* Called from syscall or from eBPF program */
+static int queue_map_update_elem(struct bpf_map *map, void *key, void *value,
+ u64 flags)
+{
+ struct bpf_queue *queue = container_of(map, struct bpf_queue, map);
+ unsigned long irq_flags;
+ struct queue_node *new;
+ /* BPF_EXIST is used to force making room for a new element in case the
+ * map is full
+ */
+ bool replace = (flags & BPF_EXIST);
+
+ /* Check supported flags for queue and stack maps */
+ if (flags & BPF_NOEXIST || flags > BPF_EXIST)
+ return -EINVAL;
+
+again:
+ if (!queue_map_is_prealloc(queue)) {
+ if (atomic_inc_return(&queue->count) > queue->map.max_entries) {
+ atomic_dec(&queue->count);
+ if (!replace)
+ return -E2BIG;
+ new = queue_map_delete_oldest_node(queue);
+ /* It is possible that in the meanwhile the queue/stack
+ * became empty and there was not an 'oldest' element
+ * to delete. In that case, try again
+ */
+ if (!new)
+ goto again;
+ } else {
+ new = kmalloc_node(queue->node_size,
+ GFP_ATOMIC | __GFP_NOWARN,
+ queue->map.numa_node);
+ if (!new) {
+ atomic_dec(&queue->count);
+ return -ENOMEM;
+ }
+ }
+ } else {
+ struct pcpu_freelist_node *l;
+
+ l = pcpu_freelist_pop(&queue->freelist);
+ if (!l) {
+ if (!replace)
+ return -E2BIG;
+ new = queue_map_delete_oldest_node(queue);
+ if (!new)
+ /* TODO: This should goto again, but this causes an
+ * infinite loop when the elements are not being
+ * returned to the free list by the
+ * queue_elem_free_rcu() callback
+ */
+ return -ENOMEM;
+ } else
+ new = container_of(l, struct queue_node, fnode);
+ }
+
+ memcpy(new->element, value, queue->map.value_size);
+ new->queue = queue;
+
+ raw_spin_lock_irqsave(&queue->lock, irq_flags);
+ switch (queue->type) {
+ case QUEUE:
+ list_add_tail(&new->list, &queue->head);
+ break;
+
+ case STACK:
+ list_add(&new->list, &queue->head);
+ break;
+ }
+ raw_spin_unlock_irqrestore(&queue->lock, irq_flags);
+
+ return 0;
+}
+
+/* Called from syscall or from eBPF program */
+static int queue_map_delete_elem(struct bpf_map *map, void *key)
+{
+ return -EINVAL;
+}
+
+/* Called from syscall */
+static int queue_map_get_next_key(struct bpf_map *map, void *key,
+ void *next_key)
+{
+ return -EINVAL;
+}
+
+const struct bpf_map_ops queue_map_ops = {
+ .map_alloc_check = queue_map_alloc_check,
+ .map_alloc = queue_map_alloc,
+ .map_free = queue_map_free,
+ .map_lookup_elem = queue_map_lookup_elem,
+ .map_lookup_and_delete_elem = queue_map_lookup_and_delete_elem,
+ .map_update_elem = queue_map_update_elem,
+ .map_delete_elem = queue_map_delete_elem,
+ .map_get_next_key = queue_map_get_next_key,
+};
+
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8339d81cba1d..7703795cb4e4 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -652,6 +652,17 @@ int __weak bpf_stackmap_copy(struct bpf_map *map, void *key, void *value)
return -ENOTSUPP;
}
+static void *__bpf_copy_key(void *ukey, u64 key_size)
+{
+ if (key_size)
+ return memdup_user(ukey, key_size);
+
+ if (ukey)
+ return ERR_PTR(-EINVAL);
+
+ return NULL;
+}
+
/* last field in 'union bpf_attr' used by this command */
#define BPF_MAP_LOOKUP_ELEM_LAST_FIELD value
@@ -679,7 +690,7 @@ static int map_lookup_elem(union bpf_attr *attr)
goto err_put;
}
- key = memdup_user(ukey, map->key_size);
+ key = __bpf_copy_key(ukey, map->key_size);
if (IS_ERR(key)) {
err = PTR_ERR(key);
goto err_put;
@@ -767,7 +778,7 @@ static int map_update_elem(union bpf_attr *attr)
goto err_put;
}
- key = memdup_user(ukey, map->key_size);
+ key = __bpf_copy_key(ukey, map->key_size);
if (IS_ERR(key)) {
err = PTR_ERR(key);
goto err_put;
@@ -865,7 +876,7 @@ static int map_delete_elem(union bpf_attr *attr)
goto err_put;
}
- key = memdup_user(ukey, map->key_size);
+ key = __bpf_copy_key(ukey, map->key_size);
if (IS_ERR(key)) {
err = PTR_ERR(key);
goto err_put;
@@ -917,7 +928,7 @@ static int map_get_next_key(union bpf_attr *attr)
}
if (ukey) {
- key = memdup_user(ukey, map->key_size);
+ key = __bpf_copy_key(ukey, map->key_size);
if (IS_ERR(key)) {
err = PTR_ERR(key);
goto err_put;
@@ -958,6 +969,80 @@ static int map_get_next_key(union bpf_attr *attr)
return err;
}
+#define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
+
+static int map_lookup_and_delete_elem(union bpf_attr *attr)
+{
+ void __user *ukey = u64_to_user_ptr(attr->key);
+ void __user *uvalue = u64_to_user_ptr(attr->value);
+ int ufd = attr->map_fd;
+ struct bpf_map *map;
+ void *key, *value, *ptr;
+ u32 value_size;
+ struct fd f;
+ int err;
+
+ if (CHECK_ATTR(BPF_MAP_LOOKUP_ELEM))
+ return -EINVAL;
+
+ f = fdget(ufd);
+ map = __bpf_map_get(f);
+ if (IS_ERR(map))
+ return PTR_ERR(map);
+
+ if (!(f.file->f_mode & FMODE_CAN_WRITE)) {
+ err = -EPERM;
+ goto err_put;
+ }
+
+ key = __bpf_copy_key(ukey, map->key_size);
+ if (IS_ERR(key)) {
+ err = PTR_ERR(key);
+ goto err_put;
+ }
+
+ value_size = map->value_size;
+
+ err = -ENOMEM;
+ value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
+ if (!value)
+ goto free_key;
+
+ err = -EFAULT;
+ if (copy_from_user(value, uvalue, value_size) != 0)
+ goto free_value;
+
+ /* must increment bpf_prog_active to avoid kprobe+bpf triggering from
+ * inside bpf map update or delete otherwise deadlocks are possible
+ */
+ preempt_disable();
+ __this_cpu_inc(bpf_prog_active);
+ rcu_read_lock();
+ ptr = map->ops->map_lookup_and_delete_elem(map, key);
+ if (ptr)
+ memcpy(value, ptr, value_size);
+ rcu_read_unlock();
+ err = ptr ? 0 : -ENOENT;
+ __this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+
+ if (err)
+ goto free_value;
+
+ if (copy_to_user(uvalue, value, value_size) != 0)
+ goto free_value;
+
+ err = 0;
+
+free_value:
+ kfree(value);
+free_key:
+ kfree(key);
+err_put:
+ fdput(f);
+ return err;
+}
+
static const struct bpf_prog_ops * const bpf_prog_types[] = {
#define BPF_PROG_TYPE(_id, _name) \
[_id] = & _name ## _prog_ops,
@@ -2418,6 +2503,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
case BPF_TASK_FD_QUERY:
err = bpf_task_fd_query(&attr, uattr);
break;
+ case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
+ err = map_lookup_and_delete_elem(&attr);
+ break;
default:
err = -EINVAL;
break;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ca90679a7fe5..5bd67feb2f07 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2035,6 +2035,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
verbose(env, "invalid map_ptr to access map->key\n");
return -EACCES;
}
+
err = check_helper_mem_access(env, regno,
meta->map_ptr->key_size, false,
NULL);
@@ -2454,7 +2455,10 @@ record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
if (func_id != BPF_FUNC_tail_call &&
func_id != BPF_FUNC_map_lookup_elem &&
func_id != BPF_FUNC_map_update_elem &&
- func_id != BPF_FUNC_map_delete_elem)
+ func_id != BPF_FUNC_map_delete_elem &&
+ func_id != BPF_FUNC_map_push_elem &&
+ func_id != BPF_FUNC_map_pop_elem &&
+ func_id != BPF_FUNC_map_peek_elem)
return 0;
if (meta->map_ptr == NULL) {
diff --git a/net/core/filter.c b/net/core/filter.c
index fd423ce3da34..e7860914ffc7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4830,6 +4830,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
return &bpf_map_update_elem_proto;
case BPF_FUNC_map_delete_elem:
return &bpf_map_delete_elem_proto;
+ case BPF_FUNC_map_push_elem:
+ return &bpf_map_push_elem_proto;
+ case BPF_FUNC_map_pop_elem:
+ return &bpf_map_pop_elem_proto;
+ case BPF_FUNC_map_peek_elem:
+ return &bpf_map_peek_elem_proto;
case BPF_FUNC_get_prandom_u32:
return &bpf_get_prandom_u32_proto;
case BPF_FUNC_get_smp_processor_id:
^ permalink raw reply related
* [RFC PATCH bpf-next v2 2/4] bpf: restrict use of peek/push/pop
From: Mauricio Vasquez B @ 2018-08-31 21:26 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, netdev
In-Reply-To: <153575074884.30050.17670029209466860207.stgit@kernel>
Restrict the use of peek, push and pop helpers only to queue and stack
maps.
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
---
kernel/bpf/verifier.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5bd67feb2f07..9e177ff4a3b9 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2172,6 +2172,13 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
if (func_id != BPF_FUNC_sk_select_reuseport)
goto error;
break;
+ case BPF_MAP_TYPE_QUEUE:
+ case BPF_MAP_TYPE_STACK:
+ if (func_id != BPF_FUNC_map_peek_elem &&
+ func_id != BPF_FUNC_map_pop_elem &&
+ func_id != BPF_FUNC_map_push_elem)
+ goto error;
+ break;
default:
break;
}
@@ -2227,6 +2234,13 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY)
goto error;
break;
+ case BPF_FUNC_map_peek_elem:
+ case BPF_FUNC_map_pop_elem:
+ case BPF_FUNC_map_push_elem:
+ if (map->map_type != BPF_MAP_TYPE_QUEUE &&
+ map->map_type != BPF_MAP_TYPE_STACK)
+ goto error;
+ break;
default:
break;
}
^ permalink raw reply related
* [RFC PATCH bpf-next v2 0/4] Implement bpf queue/stack maps
From: Mauricio Vasquez B @ 2018-08-31 21:25 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, netdev
In some applications this is needed have a pool of free elements, like for
example the list of free L4 ports in a SNAT. None of the current maps allow
to do it as it is not possibleto get an any element without having they key
it is associated to.
This patchset implements two new kind of eBPF maps: queue and stack.
Those maps provide to eBPF programs the peek, push and pop operations, and for
userspace applications a new bpf_map_lookup_and_delete_elem() is added.
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
---
I am sending this as an RFC because there is still an issue I am not sure how
to solve.
The queue/stack maps have a linked list for saving the nodes, and a
preallocation schema based on the pcpu_freelist already implemented and used
in the htabmap. Each time an element is pushed into the map, a node from the
pcpu_freelist is taken and then added to the linked list.
The pop operation takes and *removes* the first node from the linked list, then
it uses *call_rcu* to postpose freeing the node, i.e, the node is only returned
to the pcpu_freelist when the rcu callback is executed. This is needed because
an element returned by the pop() operation should remain valid for the whole
duration of the eBPF program.
The problem is that elements are not immediately returned to the free list, so
in some cases the push operation could fail because there are not free nodes
in the pcpu_freelist.
The following code snippet exposes that problem.
...
/* Push MAP_SIZE elements */
for (i = 0; i < MAP_SIZE; i++)
assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
/* Pop all elements */
for (i = 0; i < MAP_SIZE; i++)
assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == 0 &&
val == vals[i]);
// sleep(1) <-- If I put this sleep, everything works.
/* Push MAP_SIZE elements */
for (i = 0; i < MAP_SIZE; i++)
assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
^^^
This fails because there are not available elements in pcpu_freelist
...
I think a possible solution is to oversize the pcpu_freelist (no idea by how
much, maybe double or, or make it 1.5 time the max elements in the map?)
I also have concerns about it, it would waste that memory in many cases and
this is also probably that it doesn't solve the issue because that code snippet
is puhsing and popping elements too fast, so even if the pcpu_freelist is much
large a certain time instant all the elements could be used.
Is this really an important issue?
Any idea of how to solve it?
Thanks,
---
Mauricio Vasquez B (4):
bpf: add bpf queue and stack maps
bpf: restrict use of peek/push/pop
Sync uapi/bpf.h to tools/include
selftests/bpf: add test cases for queue and stack maps
include/linux/bpf.h | 8
include/linux/bpf_types.h | 2
include/uapi/linux/bpf.h | 36 ++
kernel/bpf/Makefile | 2
kernel/bpf/helpers.c | 44 ++
kernel/bpf/queue_stack_maps.c | 353 ++++++++++++++++++++
kernel/bpf/syscall.c | 96 +++++
kernel/bpf/verifier.c | 20 +
net/core/filter.c | 6
tools/include/uapi/linux/bpf.h | 36 ++
tools/lib/bpf/bpf.c | 12 +
tools/lib/bpf/bpf.h | 1
tools/testing/selftests/bpf/Makefile | 2
tools/testing/selftests/bpf/bpf_helpers.h | 7
tools/testing/selftests/bpf/test_maps.c | 101 ++++++
tools/testing/selftests/bpf/test_progs.c | 99 ++++++
tools/testing/selftests/bpf/test_queue_map.c | 4
tools/testing/selftests/bpf/test_queue_stack_map.h | 59 +++
tools/testing/selftests/bpf/test_stack_map.c | 4
19 files changed, 877 insertions(+), 15 deletions(-)
create mode 100644 kernel/bpf/queue_stack_maps.c
create mode 100644 tools/testing/selftests/bpf/test_queue_map.c
create mode 100644 tools/testing/selftests/bpf/test_queue_stack_map.h
create mode 100644 tools/testing/selftests/bpf/test_stack_map.c
^ permalink raw reply
* Re: [PATCH net-next 0/5] rtnetlink: add IFA_IF_NETNSID for RTM_GETADDR
From: Christian Brauner @ 2018-09-01 1:34 UTC (permalink / raw)
To: Kirill Tkhai
Cc: netdev, linux-kernel, davem, kuznet, yoshfuji, pombredanne,
kstewart, gregkh, dsahern, fw, lucien.xin, jakub.kicinski, jbenc,
nicolas.dichtel
In-Reply-To: <20180830144544.tpross4jd6awou4u@gmail.com>
On Thu, Aug 30, 2018 at 04:45:45PM +0200, Christian Brauner wrote:
> On Thu, Aug 30, 2018 at 11:49:31AM +0300, Kirill Tkhai wrote:
> > On 29.08.2018 21:13, Christian Brauner wrote:
> > > Hi Kirill,
> > >
> > > Thanks for the question!
> > >
> > > On Wed, Aug 29, 2018 at 11:30:37AM +0300, Kirill Tkhai wrote:
> > >> Hi, Christian,
> > >>
> > >> On 29.08.2018 02:18, Christian Brauner wrote:
> > >>> From: Christian Brauner <christian@brauner.io>
> > >>>
> > >>> Hey,
> > >>>
> > >>> A while back we introduced and enabled IFLA_IF_NETNSID in
> > >>> RTM_{DEL,GET,NEW}LINK requests (cf. [1], [2], [3], [4], [5]). This has led
> > >>> to signficant performance increases since it allows userspace to avoid
> > >>> taking the hit of a setns(netns_fd, CLONE_NEWNET), then getting the
> > >>> interfaces from the netns associated with the netns_fd. Especially when a
> > >>> lot of network namespaces are in use, using setns() becomes increasingly
> > >>> problematic when performance matters.
> > >>
> > >> could you please give a real example, when setns()+socket(AF_NETLINK) cause
> > >> problems with the performance? You should do this only once on application
> > >> startup, and then you have created netlink sockets in any net namespaces you
> > >> need. What is the problem here?
> > >
> > > So we have a daemon (LXD) that is often running thousands of containers.
> > > When users issue a lxc list request against the daemon it returns a list
> > > of all containers including all of the interfaces and addresses for each
> > > container. To retrieve those addresses we currently rely on setns() +
> > > getifaddrs() for each of those containers. That has horrible
> > > performance.
> >
> > Could you please provide some numbers showing that setns()
> > introduces signify performance decrease in the application?
>
> Sure, might take a few days++ though since I'm traveling.
Hey Kirill,
As promised here's some code [1] that compares performance. I basically
did a setns() to the network namespace and called getifaddrs() and
compared this to the scenario where I use the newly introduced property.
I did this 1 million times and calculated the mean getifaddrs()
retrieval time based on that.
My patch cuts the time in half.
brauner@wittgenstein:~/netns_getifaddrs$ ./getifaddrs_perf 0 1178
Mean time in microseconds (netnsid): 81
Mean time in microseconds (setns): 162
Christian
I'm only appending the main file since the netsns_getifaddrs() code I
used is pretty long:
[1]:
#define _GNU_SOURCE
#define __STDC_FORMAT_MACROS
#include <fcntl.h>
#include <inttypes.h>
#include <linux/types.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>
#include "netns_getifaddrs.h"
#include "print_getifaddrs.h"
#define ITERATIONS 1000000
#define SEC_TO_MICROSEC(x) (1000000 * (x))
int main(int argc, char *argv[])
{
int i, ret;
__s32 netns_id;
pid_t netns_pid;
char path[1024];
intmax_t times[ITERATIONS];
struct timeval t1, t2;
intmax_t time_in_mcs;
int fret = EXIT_FAILURE;
intmax_t sum = 0;
int host_netns_fd = -1, netns_fd = -1;
struct ifaddrs *ifaddrs = NULL;
if (argc != 3)
goto on_error;
netns_id = atoi(argv[1]);
netns_pid = atoi(argv[2]);
printf("%d\n", netns_id);
printf("%d\n", netns_pid);
for (i = 0; i < ITERATIONS; i++) {
ret = gettimeofday(&t1, NULL);
if (ret < 0)
goto on_error;
ret = netns_getifaddrs(&ifaddrs, netns_id);
freeifaddrs(ifaddrs);
if (ret < 0)
goto on_error;
ret = gettimeofday(&t2, NULL);
if (ret < 0)
goto on_error;
time_in_mcs = (SEC_TO_MICROSEC(t2.tv_sec) + t2.tv_usec) -
(SEC_TO_MICROSEC(t1.tv_sec) + t1.tv_usec);
times[i] = time_in_mcs;
}
for (i = 0; i < ITERATIONS; i++)
sum += times[i];
printf("Mean time in microseconds (netnsid): %ju\n",
sum / ITERATIONS);
ret = snprintf(path, sizeof(path), "/proc/%d/ns/net", netns_pid);
if (ret < 0 || (size_t)ret >= sizeof(path))
goto on_error;
netns_fd = open(path, O_RDONLY | O_CLOEXEC);
if (netns_fd < 0)
goto on_error;
host_netns_fd = open("/proc/self/ns/net", O_RDONLY | O_CLOEXEC);
if (host_netns_fd < 0)
goto on_error;
memset(times, 0, sizeof(times));
for (i = 0; i < ITERATIONS; i++) {
ret = gettimeofday(&t1, NULL);
if (ret < 0)
goto on_error;
ret = setns(netns_fd, CLONE_NEWNET);
if (ret < 0)
goto on_error;
ret = getifaddrs(&ifaddrs);
freeifaddrs(ifaddrs);
if (ret < 0)
goto on_error;
ret = gettimeofday(&t2, NULL);
if (ret < 0)
goto on_error;
ret = setns(host_netns_fd, CLONE_NEWNET);
if (ret < 0)
goto on_error;
time_in_mcs = (SEC_TO_MICROSEC(t2.tv_sec) + t2.tv_usec) -
(SEC_TO_MICROSEC(t1.tv_sec) + t1.tv_usec);
times[i] = time_in_mcs;
}
for (i = 0; i < ITERATIONS; i++)
sum += times[i];
printf("Mean time in microseconds (setns): %ju\n",
sum / ITERATIONS);
fret = EXIT_SUCCESS;
on_error:
if (netns_fd >= 0)
close(netns_fd);
if (host_netns_fd >= 0)
close(host_netns_fd);
exit(fret);
}
>
> >
> > I worry about all this just because of netlink interface is
> > already overloaded, while this patch makes it even less modular.
>
> Introducing the IFA_IF_NETNSID property will not make the netlink
> interface less modular. It is a clean, RTM_*ADDR-request specific
> property using network namespace identifiers which we discussed in prior
> patches are the way to go forward.
>
> You can already get interfaces via GETLINK from another network
> namespaces than the one you reside in (Which we enabled just a few
> months back.) but you can't do the same for GETADDR. Those two are
> almost always used together. When you want to get the links you usually
> also want to get the addresses associated with it right after.
> In a prior discussion we agreed that network namespace identifiers are
> the way to go forward but that any other propery, i.e. PIDs and fds
> should never be ported into other parts of the codebase and that is
> indeed something I agree with.
>
> > In case of one day we finally reach rtnl unscalability trap,
> > every common interface like this may be a decisive nail in
> > a coffin lid of possibility to overwrite everything.
> >
> > > The problem with what you're proposing is that the daemon would need to
> > > cache a socket file descriptor for each container which is something
> > > that we unfortunately cannot do since we can't excessively cache file
> > > descriptors because we can easily hit the open file limit. We also
> > > refrain from caching file descriptors for a long time for security
> > > reasons.
> > >
> > > For the case where users just request a list of the interfaces we
> > > can already use RTM_GETLINK + IFLA_IF_NETNS which has way better
> > > performance. But we can't do the same with RTM_GETADDR requests which
> > > was an oversight on my part when I wrote the original patchset for the
> > > RTM_*LINK requests. This just rectifies this and aligns RTM_GETLINK +
> > > RTM_GETADDR.
> > > Based on this patchset I have written a userspace POC that is basically
> > > a netns namespace aware getifaddr() or - as I like to call it -
> > > netns_getifaddr().
> > >
> > >>
> > >>> Usually, RTML_GETLINK requests are followed by RTM_GETADDR requests (cf.
> > >>> getifaddrs() style functions and friends). But currently, RTM_GETADDR
> > >>> requests do not support a similar property like IFLA_IF_NETNSID for
> > >>> RTM_*LINK requests.
> > >>> This is problematic since userspace can retrieve interfaces from another
> > >>> network namespace by sending a IFLA_IF_NETNSID property along but
> > >>> RTM_GETLINK request but is still forced to use the legacy setns() style of
> > >>> retrieving interfaces in RTM_GETADDR requests.
> > >>>
> > >>> The goal of this series is to make it possible to perform RTM_GETADDR
> > >>> requests on different network namespaces. To this end a new IFA_IF_NETNSID
> > >>> property for RTM_*ADDR requests is introduced. It can be used to send a
> > >>> network namespace identifier along in RTM_*ADDR requests. The network
> > >>> namespace identifier will be used to retrieve the target network namespace
> > >>> in which the request is supposed to be fulfilled. This aligns the behavior
> > >>> of RTM_*ADDR requests with the behavior of RTM_*LINK requests.
> > >>>
> > >>> Security:
> > >>> - The caller must have assigned a valid network namespace identifier for
> > >>> the target network namespace.
> > >>> - The caller must have CAP_NET_ADMIN in the owning user namespace of the
> > >>> target network namespace.
> > >>>
> > >>> Thanks!
> > >>> Christian
> > >>>
> > >>> [1]: commit 7973bfd8758d ("rtnetlink: remove check for IFLA_IF_NETNSID")
> > >>> [2]: commit 5bb8ed075428 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK")
> > >>> [3]: commit b61ad68a9fe8 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_DELLINK")
> > >>> [4]: commit c310bfcb6e1b ("rtnetlink: enable IFLA_IF_NETNSID for RTM_SETLINK")
> > >>> [5]: commit 7c4f63ba8243 ("rtnetlink: enable IFLA_IF_NETNSID in do_setlink()")
> > >>>
> > >>> Christian Brauner (5):
> > >>> rtnetlink: add rtnl_get_net_ns_capable()
> > >>> if_addr: add IFA_IF_NETNSID
> > >>> ipv4: enable IFA_IF_NETNSID for RTM_GETADDR
> > >>> ipv6: enable IFA_IF_NETNSID for RTM_GETADDR
> > >>> rtnetlink: move type calculation out of loop
> > >>>
> > >>> include/net/rtnetlink.h | 1 +
> > >>> include/uapi/linux/if_addr.h | 1 +
> > >>> net/core/rtnetlink.c | 15 +++++---
> > >>> net/ipv4/devinet.c | 38 +++++++++++++++-----
> > >>> net/ipv6/addrconf.c | 70 ++++++++++++++++++++++++++++--------
> > >>> 5 files changed, 97 insertions(+), 28 deletions(-)
> > >>>
^ permalink raw reply
* Cutting out
From: Jimmy Wilson @ 2018-08-31 11:34 UTC (permalink / raw)
To: netdev
Hi,
Have you received my email from last week?
I would like to speak with the person who manage your photos for your
company?
We are here to provide you all kinds of imaging editing.
What we can provide you:
Cutting out for photos
Clipping path for photos
Masking for photos
Retouching for your photos
Retouching for the Beauty Model and Portraits.
We have 20 staffs in house and daily basis 1000 images can be processed.
We give testing for your photos.
Thanks,
Jimmy Wilson
^ permalink raw reply
* Re: [net-next v2 00/15][pull request] 40GbE Intel Wired LAN Driver Updates 2018-08-30
From: David Miller @ 2018-08-31 21:08 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <20180830211147.17008-1-jeffrey.t.kirsher@intel.com>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Thu, 30 Aug 2018 14:11:32 -0700
> This series contains updates to i40e, i40evf and virtchnl.
Pulled, thanks Jeff.
^ permalink raw reply
* Re: [PATCH v2 1/2] IB/ipoib: Use dev_port to expose network interface port numbers
From: Doug Ledford @ 2018-08-31 20:58 UTC (permalink / raw)
To: Arseny Maslennikov; +Cc: linux-rdma, Jason Gunthorpe, netdev
In-Reply-To: <20180831085740.GA22172@cello.Dlink>
[-- Attachment #1: Type: text/plain, Size: 3249 bytes --]
On Fri, 2018-08-31 at 11:57 +0300, Arseny Maslennikov wrote:
> On Thu, Aug 30, 2018 at 04:17:58PM -0400, Doug Ledford wrote:
> > On Thu, 2018-08-30 at 21:22 +0300, Arseny Maslennikov wrote:
> > > Some InfiniBand network devices have multiple ports on the same PCI
> > > function. This initializes the `dev_port' sysfs field of those
> > > network interfaces with their port number.
> > >
> > > Prior to this the kernel erroneously used the `dev_id' sysfs
> > > field of those network interfaces to convey the port number to userspace.
> > >
> > > The use of `dev_id' was considered correct until Linux 3.15,
> > > when another field, `dev_port', was defined for this particular
> > > purpose and `dev_id' was reserved for distinguishing stacked ifaces
> > > (e.g: VLANs) with the same hardware address as their parent device.
> > >
> > > Similar fixes to net/mlx4_en and many other drivers, which started
> > > exporting this information through `dev_id' before 3.15, were accepted
> > > into the kernel 4 years ago.
> > > See 76a066f2a2a0 (`net/mlx4_en: Expose port number through sysfs').
> > >
> > > Signed-off-by: Arseny Maslennikov <ar@cs.msu.ru>
> > > ---
> > > drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > > index e3d28f9ad9c0..ba16a63ee303 100644
> > > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > > @@ -1880,7 +1880,7 @@ static int ipoib_parent_init(struct net_device *ndev)
> > > sizeof(union ib_gid));
> > >
> > > SET_NETDEV_DEV(priv->dev, priv->ca->dev.parent);
> > > - priv->dev->dev_id = priv->port - 1;
> > > + priv->dev->dev_port = priv->port - 1;
> >
> > I don't know that we can't do this. At least not yet. Expose the new
> > item to make us compliant with the new docs, and deprecate the old sysfs
> > item, but we can't just yank the old item. Existing tools/scripts might
> > (probably) rely on it (existing tools already special case IPoIB
> > interfaces and we'll need to make sure they don't special case this
> > element too).
>
> I'm good with keeping both items for a (probably long) while to not break
> things. But how exactly should we notify users of the deprecation, so they
> don't special case this again? A comment in the code seems too little —
> everyone's obviously too busy to look there and stumble upon that.
> A distinct notice in the doc seems too much. I can't think of another place
> for the deprecation notice where people would take note of it, however.
>
> Anyway: would it be OK to just restore both items and put a small note in
> dev_id's doc entry? If yes, I'll then send a v3.
A warn_on_once in the code so that when someone reads the dev_id entry,
we get a deprecation warning in the dmesg output at info level would be
my suggestion. Have it output the command name as part of the warning
so we know what tools are using it.
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH net-next] rds: store socket timestamps as ktime_t
From: David Miller @ 2018-09-01 1:04 UTC (permalink / raw)
To: arnd
Cc: santosh.shilimkar, sowmini.varadhan, willemb, ka-cheong.poon,
s.mesoraca16, avinash.repaka, edumazet, netdev, linux-rdma,
rds-devel, linux-kernel
In-Reply-To: <20180829154732.844217-1-arnd@arndb.de>
From: Arnd Bergmann <arnd@arndb.de>
Date: Wed, 29 Aug 2018 17:47:19 +0200
> rds is the last in-kernel user of the old do_gettimeofday()
> function. Convert it over to ktime_get_real() to make it
> work more like the generic socket timestamps, and to let
> us kill off do_gettimeofday().
>
> A follow-up patch will have to change the user space interface
> to deal better with 32-bit tasks, which may use an incompatible
> layout for 'struct timespec'.
>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH net-next 0/5] rtnetlink: add IFA_IF_NETNSID for RTM_GETADDR
From: David Miller @ 2018-09-01 0:58 UTC (permalink / raw)
To: nicolas.dichtel
Cc: christian.brauner, ktkhai, netdev, linux-kernel, kuznet, yoshfuji,
pombredanne, kstewart, gregkh, dsahern, fw, lucien.xin,
jakub.kicinski, jbenc
In-Reply-To: <ba6a1b18-959e-e307-45fa-e0da9d6f0b0f@6wind.com>
From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Thu, 30 Aug 2018 17:49:12 +0200
> Maybe I would choose a more generic name for the attribute,
> something that can be used in other netlink families (xfrm,
> netfilter, ...) also. What about IFA_TARGET_NSID?
Agreed, let's at least try to think ahead about potential future
uses :)
^ permalink raw reply
* Re: [net-next] ipv6:sr: Use the right next-hop value for neighbor discovery in End.DX4
From: David Miller @ 2018-09-01 0:46 UTC (permalink / raw)
To: amsalam20; +Cc: dlebrun, netdev, linux-kernel
In-Reply-To: <20180825141541.960-1-amsalam20@gmail.com>
From: Ahmed Abdelsalam <amsalam20@gmail.com>
Date: Sat, 25 Aug 2018 16:15:41 +0200
> This patch makes sure that "rt_gateway" has the value of
> the next-hop configured from the control plane.
You can't change the rt_gateway field of an existing route,
this will change the value for unrelated users trying to
reach the same destination.
^ permalink raw reply
* Re: [PATCH v2 net-next] liquidio: remove set but not used variable 'irh'
From: Felix Manlunas @ 2018-08-31 20:16 UTC (permalink / raw)
To: YueHaibing
Cc: Derek Chickles, Satanand Burla, Felix Manlunas, Raghu Vatsavayi,
David S. Miller, netdev, kernel-janitors
In-Reply-To: <1535717036-88514-1-git-send-email-yuehaibing@huawei.com>
On Fri, Aug 31, 2018 at 12:03:56PM +0000, YueHaibing wrote:
> Fixes gcc '-Wunused-but-set-variable' warning:
>
> drivers/net/ethernet/cavium/liquidio/request_manager.c: In function 'lio_process_iq_request_list':
> drivers/net/ethernet/cavium/liquidio/request_manager.c:383:27: warning:
> variable 'irh' set but not used [-Wunused-but-set-variable]
>
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
> ---
> v2: fix patch description,remove 'cHECK-'
> ---
> drivers/net/ethernet/cavium/liquidio/request_manager.c | 9 ---------
> 1 file changed, 9 deletions(-)
>
> diff --git a/drivers/net/ethernet/cavium/liquidio/request_manager.c b/drivers/net/ethernet/cavium/liquidio/request_manager.c
> index bd0153e..c6f4cbd 100644
> --- a/drivers/net/ethernet/cavium/liquidio/request_manager.c
> +++ b/drivers/net/ethernet/cavium/liquidio/request_manager.c
> @@ -380,7 +380,6 @@ static inline void __copy_cmd_into_iq(struct octeon_instr_queue *iq,
> u32 inst_count = 0;
> unsigned int pkts_compl = 0, bytes_compl = 0;
> struct octeon_soft_command *sc;
> - struct octeon_instr_irh *irh;
> unsigned long flags;
>
> while (old != iq->octeon_read_index) {
> @@ -402,14 +401,6 @@ static inline void __copy_cmd_into_iq(struct octeon_instr_queue *iq,
> case REQTYPE_RESP_NET:
> case REQTYPE_SOFT_COMMAND:
> sc = buf;
> -
> - if (OCTEON_CN23XX_PF(oct) || OCTEON_CN23XX_VF(oct))
> - irh = (struct octeon_instr_irh *)
> - &sc->cmd.cmd3.irh;
> - else
> - irh = (struct octeon_instr_irh *)
> - &sc->cmd.cmd2.irh;
> -
> /* We're expecting a response from Octeon.
> * It's up to lio_process_ordered_list() to
> * process sc. Add sc to the ordered soft
>
Acked-by: Felix Manlunas <felix.manlunas@cavium.com>
^ permalink raw reply
* Re: [PATCH v6 1/4] gpiolib: Pass bitmaps, not integer arrays, to get/set array
From: Peter Rosin @ 2018-09-01 0:23 UTC (permalink / raw)
To: Janusz Krzysztofik, Linus Walleij
Cc: Jonathan Corbet, Miguel Ojeda Sandonis, Peter Korsgaard,
Ulf Hansson, Andrew Lunn, Florian Fainelli, David S. Miller,
Dominik Brodowski, Greg Kroah-Hartman, Kishon Vijay Abraham I,
Lars-Peter Clausen, Michael Hennerich, Jonathan Cameron,
Hartmut Knaack, Peter Meerwald-Stadler, Jiri Slaby, Willy Tarreau,
Geert Uytterhoeven, linux-doc, linux-i2
In-Reply-To: <20180831225616.29221-2-jmkrzyszt@gmail.com>
On 2018-09-01 00:56, Janusz Krzysztofik wrote:
> Most users of get/set array functions iterate consecutive bits of data,
> usually a single integer, while processing array of results obtained
> from, or building an array of values to be passed to those functions.
> Save time wasted on those iterations by changing the functions' API to
> accept bitmaps.
>
> All current users are updated as well.
>
> More benefits from the change are expected as soon as planned support
> for accepting/passing those bitmaps directly from/to respective GPIO
> chip callbacks if applicable is implemented.
> diff --git a/drivers/i2c/muxes/i2c-mux-gpio.c b/drivers/i2c/muxes/i2c-mux-gpio.c
> index 401308e3d036..e28ddc20000d 100644
> --- a/drivers/i2c/muxes/i2c-mux-gpio.c
> +++ b/drivers/i2c/muxes/i2c-mux-gpio.c
> @@ -22,18 +22,16 @@ struct gpiomux {
> struct i2c_mux_gpio_platform_data data;
> unsigned gpio_base;
> struct gpio_desc **gpios;
> - int *values;
> };
>
> static void i2c_mux_gpio_set(const struct gpiomux *mux, unsigned val)
> {
> - int i;
> -
> - for (i = 0; i < mux->data.n_gpios; i++)
> - mux->values[i] = (val >> i) & 1;
> + DECLARE_BITMAP(value_bitmap, mux->data.n_gpios);
Picking a random driver for this comment, it applies to many of them.
I think this creates a VLA? Can't you, for the bit-count, just go with
BITS_PER_TYPE(unsigned)? Or whatever is appropriate for the driver
in question.
Also, I find that where you use DECLARE_BITMAP, the _bitmap suffix is
just noise and I would very much like to zap it.
Cheers,
Peter
> +
> + *value_bitmap = val;
>
> gpiod_set_array_value_cansleep(mux->data.n_gpios,
> - mux->gpios, mux->values);
> + mux->gpios, value_bitmap);
> }
>
> static int i2c_mux_gpio_select(struct i2c_mux_core *muxc, u32 chan)
> @@ -182,15 +180,13 @@ static int i2c_mux_gpio_probe(struct platform_device *pdev)
> return -EPROBE_DEFER;
>
> muxc = i2c_mux_alloc(parent, &pdev->dev, mux->data.n_values,
> - mux->data.n_gpios * sizeof(*mux->gpios) +
> - mux->data.n_gpios * sizeof(*mux->values), 0,
> + mux->data.n_gpios * sizeof(*mux->gpios), 0,
> i2c_mux_gpio_select, NULL);
> if (!muxc) {
> ret = -ENOMEM;
> goto alloc_failed;
> }
> mux->gpios = muxc->priv;
> - mux->values = (int *)(mux->gpios + mux->data.n_gpios);
> muxc->priv = mux;
>
> platform_set_drvdata(pdev, muxc);
^ permalink raw reply
* Re: phys_port_id in switchdev mode?
From: Marcelo Ricardo Leitner @ 2018-08-31 20:13 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Florian Fainelli, Or Gerlitz, Simon Horman, Andy Gospodarek,
mchan@broadcom.com, Jiri Pirko, Alexander Duyck, Frederick Botha,
nick viljoen, netdev@vger.kernel.org
In-Reply-To: <20180828204351.34fe457f@cakuba.netronome.com>
On Tue, Aug 28, 2018 at 08:43:51PM +0200, Jakub Kicinski wrote:
> Ugh, CC: netdev..
>
> On Tue, 28 Aug 2018 20:05:39 +0200, Jakub Kicinski wrote:
> > Hi!
> >
> > I wonder if we can use phys_port_id in switchdev to group together
> > interfaces of a single PCI PF? Here is the problem:
On Mellanox cards, this is already possible via phys_switch_id, as
each PF has its own phys_switch_id. So all VFs with a given
phys_switch_id belong to the PF with that same phys_switch_id.
I understand this is a vendor-specific design, but if you have the
same phys_switch_id across PFs, does it really matter on which PF the
VF was created on?
Marcelo
^ permalink raw reply
* Re: linux-next: Signed-off-by missing for commit in the bpf-next tree
From: Daniel Borkmann @ 2018-09-01 0:06 UTC (permalink / raw)
To: Stephen Rothwell, Alexei Starovoitov, Networking
Cc: Linux-Next Mailing List, Linux Kernel Mailing List
In-Reply-To: <20180901050751.3714fc63@canb.auug.org.au>
On 08/31/2018 09:07 PM, Stephen Rothwell wrote:
> Hi all,
>
> Commits
>
> 0f93c3995c93 ("bpf: add TCP_SAVE_SYN/TCP_SAVED_SYN sample program")
> 9452048c7940 ("bpf: add TCP_SAVE_SYN/TCP_SAVED_SYN options for bpf_(set|get)sockopt")
> 175eba5962b4 ("xdp: remove redundant variable 'headroom'")
>
> are missing a Signed-off-by from their committers.
Fixed up, thanks!
^ permalink raw reply
* Re: [PATCH v2 15/29] net: split eth_platform_get_mac_address() into subroutines
From: Brian Norris @ 2018-08-31 19:54 UTC (permalink / raw)
To: Bartosz Golaszewski
Cc: Jonathan Corbet, Sekhar Nori, Kevin Hilman, Russell King,
Arnd Bergmann, Greg Kroah-Hartman, David Woodhouse,
Boris Brezillon, Marek Vasut, Richard Weinberger,
Grygorii Strashko, David S . Miller, Srinivas Kandagatla, Naren,
Mauro Carvalho Chehab, Andrew Morton, Lukas Wunner, Dan Carpenter,
Florian Fainelli
In-Reply-To: <20180810080526.27207-16-brgl@bgdev.pl>
Hi,
I responded to the cover letter, but I'll put some notes here too.
On Fri, Aug 10, 2018 at 10:05:12AM +0200, Bartosz Golaszewski wrote:
> From: Bartosz Golaszewski <bgolaszewski@baylibre.com>
>
> We want do add more sources from which to read the MAC address. In
> order to avoid bloating this function too much, start by splitting it
> into subroutines, each of which takes care of reading the MAC from
> one source.
>
> Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
> ---
> net/ethernet/eth.c | 44 ++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 36 insertions(+), 8 deletions(-)
>
> diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
> index 7f08105402c8..d01dd55de037 100644
> --- a/net/ethernet/eth.c
> +++ b/net/ethernet/eth.c
> @@ -525,22 +525,50 @@ unsigned char * __weak arch_get_platform_mac_address(void)
> return NULL;
> }
>
> -int eth_platform_get_mac_address(struct device *dev, u8 *mac_addr)
> +static int mac_address_from_of(struct device *dev, u8 *mac_addr)
> {
> const unsigned char *addr;
> - struct device_node *dp;
> + struct device_node *np;
>
> - dp = dev->of_node;
> - addr = NULL;
> - if (dp)
> - addr = of_get_mac_address(dp);
> - if (!addr)
> - addr = arch_get_platform_mac_address();
> + np = dev->of_node;
> + if (!np)
> + return -ENODEV;
>
> + addr = of_get_mac_address(np);
> if (!addr)
> return -ENODEV;
>
> + if (!addr || !is_valid_ether_addr(addr))
> + return -ENODEV;
> +
> ether_addr_copy(mac_addr, addr);
> return 0;
> }
> +
> +static int mac_address_from_arch(u8 *mac_addr)
> +{
> + const unsigned char *addr;
> +
> + addr = arch_get_platform_mac_address();
> + if (!addr || !is_valid_ether_addr(addr))
> + return -ENODEV;
> +
> + ether_addr_copy(mac_addr, addr);
> + return 0;
> +}
> +
> +int eth_platform_get_mac_address(struct device *dev, u8 *mac_addr)
> +{
> + int rv;
> +
> + rv = mac_address_from_of(dev, mac_addr);
This could just be replaced with device_get_mac_address(). Then you
should be extending either fwnode_get_device_mac_address() or
device_get_mac_addres() in the next patch, not
eth_platform_get_mac_address().
Regards,
Brian
> + if (!rv)
> + return 0;
> +
> + rv = mac_address_from_arch(mac_addr);
> + if (!rv)
> + return 0;
> +
> + return -ENODEV;
> +}
> EXPORT_SYMBOL(eth_platform_get_mac_address);
> --
> 2.18.0
>
^ permalink raw reply
* Re: [PATCH v2 00/29] at24: remove at24_platform_data
From: Brian Norris @ 2018-08-31 19:46 UTC (permalink / raw)
To: Bartosz Golaszewski
Cc: Jonathan Corbet, Sekhar Nori, Kevin Hilman, Russell King,
Arnd Bergmann, Greg Kroah-Hartman, David Woodhouse,
Boris Brezillon, Marek Vasut, Richard Weinberger,
Grygorii Strashko, David S . Miller, Srinivas Kandagatla, Naren,
Mauro Carvalho Chehab, Andrew Morton, Lukas Wunner, Dan Carpenter,
Florian Fainelli
In-Reply-To: <20180810080526.27207-1-brgl@bgdev.pl>
Hi,
On Fri, Aug 10, 2018 at 10:04:57AM +0200, Bartosz Golaszewski wrote:
> Most boards use the EEPROM to store the MAC address. This series adds
> support for cell lookups to the nvmem framework, registers relevant
> cells for all users, adds nvmem support to eth_platform_get_mac_address(),
> converts davinci_emac driver to using it and replaces at24_platform_data
> with device properties.
We already have:
of_get_nvmem_mac_address() (which does exactly what you're adding,
except it's DT specific)
of_get_mac_address()
fwnode_get_mac_address()
device_get_mac_address()
and now you've taught me that this exists too:
eth_platform_get_mac_address()
These mostly don't share code, and with your series, they'll start to
diverge even more as to what they support. Can you please help rectify
that, instead of widening the gap?
For instance, you can delete most of eth_platform_get_mac_address() and
replace it with device_get_mac_address() [1]. And you could add your new
stuff to fwnode_get_mac_address().
And important part to note here is that you code isn't just useful for
ethernet -- it could be useful for Wifi devices too. So IMO, sticking it
only in an "eth" function is the wrong move.
Brian
[1] arch_get_platform_mac_address() is the only part I wouldn't want to
replicate into a truly generic helper. The following should be a no-op
refactor, AIUI:
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index ee28440f57c5..b738651f71e1 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -47,12 +47,12 @@
#include <linux/inet.h>
#include <linux/ip.h>
#include <linux/netdevice.h>
+#include <linux/property.h>
#include <linux/etherdevice.h>
#include <linux/skbuff.h>
#include <linux/errno.h>
#include <linux/init.h>
#include <linux/if_ether.h>
-#include <linux/of_net.h>
#include <linux/pci.h>
#include <net/dst.h>
#include <net/arp.h>
@@ -528,19 +528,11 @@ unsigned char * __weak arch_get_platform_mac_address(void)
int eth_platform_get_mac_address(struct device *dev, u8 *mac_addr)
{
const unsigned char *addr;
- struct device_node *dp;
- if (dev_is_pci(dev))
- dp = pci_device_to_OF_node(to_pci_dev(dev));
- else
- dp = dev->of_node;
-
- addr = NULL;
- if (dp)
- addr = of_get_mac_address(dp);
- if (!addr)
- addr = arch_get_platform_mac_address();
+ if (device_get_mac_address(dev, mac_addr, ETH_ALEN))
+ return 0;
+ addr = arch_get_platform_mac_address();
if (!addr)
return -ENODEV;
^ permalink raw reply related
* Adding path
From: Jimmy Wilson @ 2018-08-31 13:15 UTC (permalink / raw)
To: netdev
Hi,
Have you received my email from last week?
I would like to speak with the person who manage your photos for your
company?
We are here to provide you all kinds of imaging editing.
What we can provide you:
Cutting out for photos
Clipping path for photos
Masking for photos
Retouching for your photos
Retouching for the Beauty Model and Portraits.
We have 20 staffs in house and daily basis 1000 images can be processed.
We give testing for your photos.
Thanks,
Jimmy Wilson
^ permalink raw reply
* [PATCH net-next] net: dsa: b53: Provide sensible defaults
From: Florian Fainelli @ 2018-08-31 19:29 UTC (permalink / raw)
To: netdev
Cc: Florian Fainelli, Andrew Lunn, Vivien Didelot, David S. Miller,
open list
The SRAB driver is the default way to communicate with the integrated
switch on iProc platforms and the MMAP driver is the way to communicate
with the integrated switch on DSL BCM63xx and CM BCM33xx.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
drivers/net/dsa/b53/Kconfig | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/net/dsa/b53/Kconfig b/drivers/net/dsa/b53/Kconfig
index 2f988216dab9..37745f4bf4f6 100644
--- a/drivers/net/dsa/b53/Kconfig
+++ b/drivers/net/dsa/b53/Kconfig
@@ -23,6 +23,7 @@ config B53_MDIO_DRIVER
config B53_MMAP_DRIVER
tristate "B53 MMAP connected switch driver"
depends on B53 && HAS_IOMEM
+ default BCM63XX || BMIPS_GENERIC
help
Select to enable support for memory-mapped switches like the BCM63XX
integrated switches.
@@ -30,6 +31,7 @@ config B53_MMAP_DRIVER
config B53_SRAB_DRIVER
tristate "B53 SRAB connected switch driver"
depends on B53 && HAS_IOMEM
+ default ARCH_BCM_IPROC
help
Select to enable support for memory-mapped Switch Register Access
Bridge Registers (SRAB) like it is found on the BCM53010
--
2.17.1
^ permalink raw reply related
* [PATCH] cxgb4: fix abort_req_rss6 struct
From: Steve Wise @ 2018-08-31 18:52 UTC (permalink / raw)
To: netdev; +Cc: jgg, dledford, davem, linux-rdma
Remove the incorrect WR_HDR field which can cause a misinterpretation
of this CPL by ULDs.
Fixes: a3cdaa69e4ae ("cxgb4: Adds CPL support for Shared Receive Queues")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
Dave, Doug, and Jason,
I request this merge through the rdma repo since the only user of this
structure is iw_cxgb4.
---
drivers/net/ethernet/chelsio/cxgb4/t4_msg.h | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h b/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
index b8f75a22fb6c..f152da1ce046 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
@@ -753,7 +753,6 @@ struct cpl_abort_req_rss {
};
struct cpl_abort_req_rss6 {
- WR_HDR;
union opcode_tid ot;
__be32 srqidx_status;
};
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH] i40e: mark expected switch fall-through
From: Gustavo A. R. Silva @ 2018-08-31 22:57 UTC (permalink / raw)
To: Kirsher, Jeffrey T, David S. Miller
Cc: intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <61CC2BC414934749BD9F5BF3D5D94044950056B7@ORSMSX112.amr.corp.intel.com>
On 8/30/18 4:09 PM, Kirsher, Jeffrey T wrote:
>> -----Original Message-----
>> From: Gustavo A. R. Silva [mailto:gustavo@embeddedor.com]
>> Sent: Thursday, August 30, 2018 11:50
>> To: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>; David S. Miller
>> <davem@davemloft.net>
>> Cc: intel-wired-lan@lists.osuosl.org; netdev@vger.kernel.org; linux-
>> kernel@vger.kernel.org; Gustavo A. R. Silva <gustavo@embeddedor.com>
>> Subject: [PATCH] i40e: mark expected switch fall-through
>>
>> In preparation to enabling -Wimplicit-fallthrough, mark switch cases where
>> we are expecting to fall through.
>>
>> Addresses-Coverity-ID: 1473099 ("Missing break in switch")
>> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
>> ---
>> drivers/net/ethernet/intel/i40e/i40e_xsk.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> I have picked this up Dave.
>
Thanks, Jeffrey.
--
Gustavo
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox