* [RFC PATCH net-next v2 01/15] bpf: BPF support for socket ops
2017-06-15 20:08 [RFC PATCH net-next v2 00/15] bpf: BPF support for socket ops Lawrence Brakmo
@ 2017-06-15 20:08 ` Lawrence Brakmo
2017-06-16 12:07 ` Daniel Borkmann
2017-06-15 20:08 ` [RFC PATCH net-next v2 02/15] bpf: program to load socketops BPF programs Lawrence Brakmo
` (13 subsequent siblings)
14 siblings, 1 reply; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-15 20:08 UTC (permalink / raw)
To: netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
David Ahern
Created a new BPF program type, BPF_PROG_TYPE_SOCKET_OPS, and a corresponding
struct that allows BPF programs of this type to access some of the
socket's fields (such as IP addresses, ports, etc.). Currently there is
functionality to load one global BPF program of this type which can be
called at appropriate times to set relevant connection parameters such
as buffer sizes, SYN and SYN-ACK RTOs, etc., based on connection
information such as IP addresses, port numbers, etc.
Alghough there are already 3 mechanisms to set parameters (sysctls,
route metrics and setsockopts), this new mechanism provides some
disticnt advantages. Unlike sysctls, it can set parameters per
connection. In contrast to route metrics, it can also use port numbers
and information provided by a user level program. In addition, it could
set parameters probabilistically for evaluation purposes (i.e. do
something different on 10% of the flows and compare results with the
other 90% of the flows). Also, in cases where IPv6 addresses contain
geographic information, the rules to make changes based on the distance
(or RTT) between the hosts are much easier than route metric rules and
can be global. Finally, unlike setsockopt, it oes not require
application changes and it can be updated easily at any time.
I plan to add support for loading per cgroup socket ops BPF programs in
the near future. One question is whether I should add this functionality
into David Ahern's BPF_PROG_TYPE_CGROUP_SOCK or create a new cgroup bpf
type. Whereas the current cgroup_sock type expects to be called only once
during a connection's lifetime, the new socket_ops type could be called
multipe times. For example, before sending SYN and SYN-ACKs to set an
appropriate timeout, when the connection is established to set
congestion control, etc. As a result it has "op" field to specify the
type of operation requested.
The purpose of this new program type is to simplify setting connection
parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
easy to use facebook's internal IPv6 addresses to determine if both hosts
of a connection are in the same datacenter. Therefore, it is easy to
write a BPF program to choose a small SYN RTO value when both hosts are
in the same datacenter.
This patch only contains the framework to support the new BPF program
type, following patches add the functionality to set various connection
parameters.
This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
and a new bpf syscall command to load a new program of this type:
BPF_PROG_LOAD_SOCKET_OPS.
Two new corresponding structs (one for the kernel one for the user/BPF
program):
/* kernel version */
struct bpf_socket_ops_kern {
struct sock *sk;
__u32 is_req_sock:1;
__u32 op;
union {
__u32 reply;
__u32 replylong[4];
};
};
/* user version */
struct bpf_socket_ops {
__u32 op;
union {
__u32 reply;
__u32 replylong[4];
};
__u32 family;
__u32 remote_ip4;
__u32 local_ip4;
__u32 remote_ip6[4];
__u32 local_ip6[4];
__u32 remote_port;
__u32 local_port;
};
Currently there are two types of ops. The first type expects the BPF
program to return a value which is then used by the caller (or a
negative value to indicate the operation is not supported). The second
type expects state changes to be done by the BPF program, for example
through a setsockopt BPF helper function, and they ignore the return
value.
The reply fields of the bpf_sockt_ops struct are there in case a bpf
program needs to return a value larger than an integer.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
include/linux/bpf.h | 6 ++
include/linux/bpf_types.h | 1 +
include/linux/filter.h | 10 +++
include/net/tcp.h | 27 ++++++++
include/uapi/linux/bpf.h | 28 +++++++++
kernel/bpf/syscall.c | 2 +
net/core/Makefile | 3 +-
net/core/filter.c | 157 ++++++++++++++++++++++++++++++++++++++++++++++
net/core/sock_bpfops.c | 67 ++++++++++++++++++++
samples/bpf/bpf_load.c | 13 +++-
10 files changed, 310 insertions(+), 4 deletions(-)
create mode 100644 net/core/sock_bpfops.c
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 1bcbf0a..e164f94 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -362,4 +362,10 @@ extern const struct bpf_func_proto bpf_get_stackid_proto;
void bpf_user_rnd_init_once(void);
u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+/* socket_ops related */
+struct sock;
+struct bpf_socket_ops_kern;
+
+int bpf_socket_ops_set_prog(int fd);
+int bpf_socket_ops_call(struct bpf_socket_ops_kern *bpf_socket);
#endif /* _LINUX_BPF_H */
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 03bf223..ca69d10 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -10,6 +10,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock_prog_ops)
BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout_prog_ops)
BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout_prog_ops)
BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit_prog_ops)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_OPS, socket_ops_prog_ops)
#endif
#ifdef CONFIG_BPF_EVENTS
BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe_prog_ops)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1fa26dc..102e881 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -898,4 +898,14 @@ static inline int bpf_tell_extensions(void)
return SKF_AD_MAX;
}
+struct bpf_socket_ops_kern {
+ struct sock *sk;
+ u32 is_req_sock:1;
+ u32 op;
+ union {
+ u32 reply;
+ u32 replylong[4];
+ };
+};
+
#endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3ab677d..9ad0d80 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -46,6 +46,9 @@
#include <linux/seq_file.h>
#include <linux/memcontrol.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+
extern struct inet_hashinfo tcp_hashinfo;
extern struct percpu_counter tcp_orphan_count;
@@ -1991,4 +1994,28 @@ static inline void tcp_listendrop(const struct sock *sk)
enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer);
+/* Call BPF_SOCKET_OPS program that returns an int. If the return value
+ * is < 0, then the BPF op failed (for example if the loaded BPF
+ * program does not support the chosen operation or there is no BPF
+ * program loaded).
+ */
+#ifdef CONFIG_BPF
+static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)
+{
+ struct bpf_socket_ops_kern socket_ops;
+
+ memset(&socket_ops, 0, sizeof(socket_ops));
+ socket_ops.sk = sk;
+ socket_ops.is_req_sock = is_req_sock ? 1 : 0;
+ socket_ops.op = op;
+
+ return bpf_socket_ops_call(&socket_ops);
+}
+#else
+static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)
+{
+ return -1;
+}
+#endif
+
#endif /* _TCP_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f94b48b..1540336 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -87,6 +87,7 @@ enum bpf_cmd {
BPF_PROG_GET_FD_BY_ID,
BPF_MAP_GET_FD_BY_ID,
BPF_OBJ_GET_INFO_BY_FD,
+ BPF_PROG_LOAD_SOCKET_OPS,
};
enum bpf_map_type {
@@ -120,6 +121,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_LWT_IN,
BPF_PROG_TYPE_LWT_OUT,
BPF_PROG_TYPE_LWT_XMIT,
+ BPF_PROG_TYPE_SOCKET_OPS,
};
enum bpf_attach_type {
@@ -720,4 +722,30 @@ struct bpf_map_info {
__u32 map_flags;
} __attribute__((aligned(8)));
+/* User bpf_socket_ops struct to access socket values and specify request ops
+ * and their replies.
+ * New fields can only be added at the end of this structure
+ */
+struct bpf_socket_ops {
+ __u32 op;
+ union {
+ __u32 reply;
+ __u32 replylong[4];
+ };
+ __u32 family;
+ __u32 remote_ip4;
+ __u32 local_ip4;
+ __u32 remote_ip6[4];
+ __u32 local_ip6[4];
+ __u32 remote_port;
+ __u32 local_port;
+};
+
+/* List of known BPF socket_ops operators.
+ * New entries can only be added at the end
+ */
+enum {
+ BPF_SOCKET_OPS_VOID,
+};
+
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8942c82..5024b97 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1458,6 +1458,8 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
break;
case BPF_OBJ_GET_INFO_BY_FD:
err = bpf_obj_get_info_by_fd(&attr, uattr);
+ case BPF_PROG_LOAD_SOCKET_OPS:
+ err = bpf_socket_ops_set_prog(attr.bpf_fd);
break;
default:
err = -EINVAL;
diff --git a/net/core/Makefile b/net/core/Makefile
index 79f9479..5d711c2 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -9,7 +9,8 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
obj-y += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
- sock_diag.o dev_ioctl.o tso.o sock_reuseport.o
+ sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \
+ sock_bpfops.o
obj-$(CONFIG_XFRM) += flow.o
obj-y += net-sysfs.o
diff --git a/net/core/filter.c b/net/core/filter.c
index 60ed6f3..7466f55 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3095,6 +3095,37 @@ void bpf_warn_invalid_xdp_action(u32 act)
}
EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
+static bool __is_valid_socket_ops_access(int off, int size,
+ enum bpf_access_type type)
+{
+ if (off < 0 || off >= sizeof(struct bpf_socket_ops))
+ return false;
+ /* The verifier guarantees that size > 0. */
+ if (off % size != 0)
+ return false;
+ if (size != sizeof(__u32))
+ return false;
+
+ return true;
+}
+
+static bool socket_ops_is_valid_access(int off, int size,
+ enum bpf_access_type type,
+ enum bpf_reg_type *reg_type)
+{
+ if (type == BPF_WRITE) {
+ switch (off) {
+ case offsetof(struct bpf_socket_ops, op) ...
+ offsetof(struct bpf_socket_ops, replylong[3]):
+ break;
+ default:
+ return false;
+ }
+ }
+
+ return __is_valid_socket_ops_access(off, size, type);
+}
+
static u32 bpf_convert_ctx_access(enum bpf_access_type type,
const struct bpf_insn *si,
struct bpf_insn *insn_buf,
@@ -3364,6 +3395,126 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
return insn - insn_buf;
}
+static u32 socket_ops_convert_ctx_access(enum bpf_access_type type,
+ const struct bpf_insn *si,
+ struct bpf_insn *insn_buf,
+ struct bpf_prog *prog)
+{
+ struct bpf_insn *insn = insn_buf;
+ int off;
+
+ switch (si->off) {
+ case offsetof(struct bpf_socket_ops, op) ...
+ offsetof(struct bpf_socket_ops, replylong[3]):
+ off = si->off;
+ off -= offsetof(struct bpf_socket_ops, op);
+ off += offsetof(struct bpf_socket_ops_kern, op);
+ if (type == BPF_WRITE)
+ *insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
+ off);
+ else
+ *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+ off);
+ break;
+
+ case offsetof(struct bpf_socket_ops, family):
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_family) != 2);
+
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+ struct bpf_socket_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_socket_ops_kern, sk));
+ *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
+ offsetof(struct sock_common, skc_family));
+ break;
+
+ case offsetof(struct bpf_socket_ops, remote_ip4):
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_daddr) != 4);
+
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+ struct bpf_socket_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_socket_ops_kern, sk));
+ *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+ offsetof(struct sock_common, skc_daddr));
+ *insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
+ break;
+
+ case offsetof(struct bpf_socket_ops, local_ip4):
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_rcv_saddr) != 4);
+
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+ struct bpf_socket_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_socket_ops_kern, sk));
+ *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+ offsetof(struct sock_common,
+ skc_rcv_saddr));
+ *insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
+ break;
+
+ case offsetof(struct bpf_socket_ops, remote_ip6[0]) ...
+ offsetof(struct bpf_socket_ops, remote_ip6[3]):
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common,
+ skc_v6_daddr.s6_addr32[0]) != 4);
+
+ off = si->off;
+ off -= offsetof(struct bpf_socket_ops, remote_ip6[0]);
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+ struct bpf_socket_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_socket_ops_kern, sk));
+ *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+ offsetof(struct sock_common,
+ skc_v6_daddr.s6_addr32[0]) +
+ off);
+ *insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
+ break;
+
+ case offsetof(struct bpf_socket_ops, local_ip6[0]) ...
+ offsetof(struct bpf_socket_ops, local_ip6[3]):
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common,
+ skc_v6_rcv_saddr.s6_addr32[0]) != 4);
+
+ off = si->off;
+ off -= offsetof(struct bpf_socket_ops, local_ip6[0]);
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+ struct bpf_socket_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_socket_ops_kern, sk));
+ *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+ offsetof(struct sock_common,
+ skc_v6_rcv_saddr.s6_addr32[0]) +
+ off);
+ *insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
+ break;
+
+ case offsetof(struct bpf_socket_ops, remote_port):
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_dport) != 2);
+
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+ struct bpf_socket_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_socket_ops_kern, sk));
+ *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
+ offsetof(struct sock_common, skc_dport));
+ *insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 16);
+ break;
+
+ case offsetof(struct bpf_socket_ops, local_port):
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_num) != 2);
+
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+ struct bpf_socket_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_socket_ops_kern, sk));
+ *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
+ offsetof(struct sock_common, skc_num));
+ break;
+ }
+ return insn - insn_buf;
+}
+
const struct bpf_verifier_ops sk_filter_prog_ops = {
.get_func_proto = sk_filter_func_proto,
.is_valid_access = sk_filter_is_valid_access,
@@ -3413,6 +3564,12 @@ const struct bpf_verifier_ops cg_sock_prog_ops = {
.convert_ctx_access = sock_filter_convert_ctx_access,
};
+const struct bpf_verifier_ops socket_ops_prog_ops = {
+ .get_func_proto = bpf_base_func_proto,
+ .is_valid_access = socket_ops_is_valid_access,
+ .convert_ctx_access = socket_ops_convert_ctx_access,
+};
+
int sk_detach_filter(struct sock *sk)
{
int ret = -ENOENT;
diff --git a/net/core/sock_bpfops.c b/net/core/sock_bpfops.c
new file mode 100644
index 0000000..8f8daa5
--- /dev/null
+++ b/net/core/sock_bpfops.c
@@ -0,0 +1,67 @@
+/*
+ * BPF support for sockets
+ *
+ * Copyright (c) 2016 Lawrence Brakmo <brakmo@fb.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation.
+ */
+
+#include <net/sock.h>
+#include <linux/skbuff.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/errno.h>
+#ifdef CONFIG_NET_NS
+#include <net/net_namespace.h>
+#include <linux/proc_ns.h>
+#endif
+
+/* Global BPF program for sockets */
+static struct bpf_prog *bpf_socket_ops_prog;
+static DEFINE_RWLOCK(bpf_socket_ops_lock);
+
+int bpf_socket_ops_set_prog(int fd)
+{
+ int err = 0;
+
+ write_lock(&bpf_socket_ops_lock);
+ if (bpf_socket_ops_prog) {
+ bpf_prog_put(bpf_socket_ops_prog);
+ bpf_socket_ops_prog = NULL;
+ }
+
+ /* fd of zero is used as a signal to remove the current
+ * bpf_socket_ops_prog.
+ */
+ if (fd == 0) {
+ write_unlock(&bpf_socket_ops_lock);
+ return 1;
+ }
+
+ bpf_socket_ops_prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_OPS);
+ if (IS_ERR(bpf_socket_ops_prog)) {
+ bpf_prog_put(bpf_socket_ops_prog);
+ bpf_socket_ops_prog = NULL;
+ err = 1;
+ }
+ write_unlock(&bpf_socket_ops_lock);
+ return err;
+}
+
+int bpf_socket_ops_call(struct bpf_socket_ops_kern *bpf_socket)
+{
+ int ret;
+
+ read_lock(&bpf_socket_ops_lock);
+ if (bpf_socket_ops_prog) {
+ rcu_read_lock();
+ ret = (int)BPF_PROG_RUN(bpf_socket_ops_prog, bpf_socket);
+ rcu_read_unlock();
+ } else {
+ ret = -1;
+ }
+ read_unlock(&bpf_socket_ops_lock);
+ return ret;
+}
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index a91c57d..c18d713 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -64,6 +64,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
bool is_perf_event = strncmp(event, "perf_event", 10) == 0;
bool is_cgroup_skb = strncmp(event, "cgroup/skb", 10) == 0;
bool is_cgroup_sk = strncmp(event, "cgroup/sock", 11) == 0;
+ bool is_sockops = strncmp(event, "sockops", 7) == 0;
size_t insns_cnt = size / sizeof(struct bpf_insn);
enum bpf_prog_type prog_type;
char buf[256];
@@ -89,6 +90,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
prog_type = BPF_PROG_TYPE_CGROUP_SKB;
} else if (is_cgroup_sk) {
prog_type = BPF_PROG_TYPE_CGROUP_SOCK;
+ } else if (is_sockops) {
+ prog_type = BPF_PROG_TYPE_SOCKET_OPS;
} else {
printf("Unknown event '%s'\n", event);
return -1;
@@ -106,8 +109,11 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
return 0;
- if (is_socket) {
- event += 6;
+ if (is_socket || is_sockops) {
+ if (is_socket)
+ event += 6;
+ else
+ event += 7;
if (*event != '/')
return 0;
event++;
@@ -560,7 +566,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
memcmp(shname, "xdp", 3) == 0 ||
memcmp(shname, "perf_event", 10) == 0 ||
memcmp(shname, "socket", 6) == 0 ||
- memcmp(shname, "cgroup/", 7) == 0)
+ memcmp(shname, "cgroup/", 7) == 0 ||
+ memcmp(shname, "sockops", 7) == 0)
load_and_attach(shname, data->d_buf, data->d_size);
}
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [RFC PATCH net-next v2 01/15] bpf: BPF support for socket ops
2017-06-15 20:08 ` [RFC PATCH net-next v2 01/15] " Lawrence Brakmo
@ 2017-06-16 12:07 ` Daniel Borkmann
2017-06-16 23:41 ` Lawrence Brakmo
2017-06-17 21:48 ` Lawrence Brakmo
0 siblings, 2 replies; 29+ messages in thread
From: Daniel Borkmann @ 2017-06-16 12:07 UTC (permalink / raw)
To: Lawrence Brakmo, netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern
On 06/15/2017 10:08 PM, Lawrence Brakmo wrote:
> Created a new BPF program type, BPF_PROG_TYPE_SOCKET_OPS, and a corresponding
> struct that allows BPF programs of this type to access some of the
> socket's fields (such as IP addresses, ports, etc.). Currently there is
> functionality to load one global BPF program of this type which can be
> called at appropriate times to set relevant connection parameters such
> as buffer sizes, SYN and SYN-ACK RTOs, etc., based on connection
> information such as IP addresses, port numbers, etc.
>
> Alghough there are already 3 mechanisms to set parameters (sysctls,
> route metrics and setsockopts), this new mechanism provides some
> disticnt advantages. Unlike sysctls, it can set parameters per
> connection. In contrast to route metrics, it can also use port numbers
> and information provided by a user level program. In addition, it could
> set parameters probabilistically for evaluation purposes (i.e. do
> something different on 10% of the flows and compare results with the
> other 90% of the flows). Also, in cases where IPv6 addresses contain
> geographic information, the rules to make changes based on the distance
> (or RTT) between the hosts are much easier than route metric rules and
> can be global. Finally, unlike setsockopt, it oes not require
> application changes and it can be updated easily at any time.
>
> I plan to add support for loading per cgroup socket ops BPF programs in
> the near future. One question is whether I should add this functionality
> into David Ahern's BPF_PROG_TYPE_CGROUP_SOCK or create a new cgroup bpf
> type. Whereas the current cgroup_sock type expects to be called only once
> during a connection's lifetime, the new socket_ops type could be called
> multipe times. For example, before sending SYN and SYN-ACKs to set an
> appropriate timeout, when the connection is established to set
> congestion control, etc. As a result it has "op" field to specify the
> type of operation requested.
>
> The purpose of this new program type is to simplify setting connection
> parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
> easy to use facebook's internal IPv6 addresses to determine if both hosts
> of a connection are in the same datacenter. Therefore, it is easy to
> write a BPF program to choose a small SYN RTO value when both hosts are
> in the same datacenter.
>
> This patch only contains the framework to support the new BPF program
> type, following patches add the functionality to set various connection
> parameters.
>
> This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
> and a new bpf syscall command to load a new program of this type:
> BPF_PROG_LOAD_SOCKET_OPS.
>
> Two new corresponding structs (one for the kernel one for the user/BPF
> program):
>
> /* kernel version */
> struct bpf_socket_ops_kern {
> struct sock *sk;
> __u32 is_req_sock:1;
> __u32 op;
> union {
> __u32 reply;
> __u32 replylong[4];
> };
> };
>
> /* user version */
> struct bpf_socket_ops {
> __u32 op;
> union {
> __u32 reply;
> __u32 replylong[4];
> };
> __u32 family;
> __u32 remote_ip4;
> __u32 local_ip4;
> __u32 remote_ip6[4];
> __u32 local_ip6[4];
> __u32 remote_port;
> __u32 local_port;
> };
Above and ...
struct bpf_sock {
__u32 bound_dev_if;
__u32 family;
__u32 type;
__u32 protocol;
};
... would result in two BPF sock user versions. It's okayish, but
given struct bpf_sock is quite generic, couldn't we merge the members
from struct bpf_socket_ops into struct bpf_sock instead?
Idea would be that sock_filter_is_valid_access() for cgroups would
then check off < 0 || off + size > offsetofend(struct bpf_sock, protocol)
to disallow new members, and your socket_ops_is_valid_access() could
allow and xlate the full range. The family member is already duplicate
and the others could then be accessed from these kind of BPF progs as
well, plus we have a single user representation similar as with __sk_buff
that multiple types will use.
> Currently there are two types of ops. The first type expects the BPF
> program to return a value which is then used by the caller (or a
> negative value to indicate the operation is not supported). The second
> type expects state changes to be done by the BPF program, for example
> through a setsockopt BPF helper function, and they ignore the return
> value.
>
> The reply fields of the bpf_sockt_ops struct are there in case a bpf
> program needs to return a value larger than an integer.
>
> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
> ---
> include/linux/bpf.h | 6 ++
> include/linux/bpf_types.h | 1 +
> include/linux/filter.h | 10 +++
> include/net/tcp.h | 27 ++++++++
> include/uapi/linux/bpf.h | 28 +++++++++
> kernel/bpf/syscall.c | 2 +
> net/core/Makefile | 3 +-
> net/core/filter.c | 157 ++++++++++++++++++++++++++++++++++++++++++++++
> net/core/sock_bpfops.c | 67 ++++++++++++++++++++
> samples/bpf/bpf_load.c | 13 +++-
> 10 files changed, 310 insertions(+), 4 deletions(-)
> create mode 100644 net/core/sock_bpfops.c
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 1bcbf0a..e164f94 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -362,4 +362,10 @@ extern const struct bpf_func_proto bpf_get_stackid_proto;
> void bpf_user_rnd_init_once(void);
> u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>
> +/* socket_ops related */
> +struct sock;
Is this needed? You're using it in struct bpf_socket_ops_kern in
filter.h, where we already have struct sock;
> +struct bpf_socket_ops_kern;
> +
> +int bpf_socket_ops_set_prog(int fd);
> +int bpf_socket_ops_call(struct bpf_socket_ops_kern *bpf_socket);
> #endif /* _LINUX_BPF_H */
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 03bf223..ca69d10 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -10,6 +10,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock_prog_ops)
> BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout_prog_ops)
> BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout_prog_ops)
> BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit_prog_ops)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_OPS, socket_ops_prog_ops)
> #endif
> #ifdef CONFIG_BPF_EVENTS
> BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe_prog_ops)
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 1fa26dc..102e881 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -898,4 +898,14 @@ static inline int bpf_tell_extensions(void)
> return SKF_AD_MAX;
> }
>
> +struct bpf_socket_ops_kern {
> + struct sock *sk;
> + u32 is_req_sock:1;
> + u32 op;
> + union {
> + u32 reply;
> + u32 replylong[4];
> + };
> +};
> +
> #endif /* __LINUX_FILTER_H__ */
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 3ab677d..9ad0d80 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -46,6 +46,9 @@
> #include <linux/seq_file.h>
> #include <linux/memcontrol.h>
>
> +#include <linux/bpf.h>
> +#include <linux/filter.h>
> +
> extern struct inet_hashinfo tcp_hashinfo;
>
> extern struct percpu_counter tcp_orphan_count;
> @@ -1991,4 +1994,28 @@ static inline void tcp_listendrop(const struct sock *sk)
>
> enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer);
>
> +/* Call BPF_SOCKET_OPS program that returns an int. If the return value
> + * is < 0, then the BPF op failed (for example if the loaded BPF
> + * program does not support the chosen operation or there is no BPF
> + * program loaded).
> + */
> +#ifdef CONFIG_BPF
> +static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)
> +{
> + struct bpf_socket_ops_kern socket_ops;
> +
> + memset(&socket_ops, 0, sizeof(socket_ops));
> + socket_ops.sk = sk;
> + socket_ops.is_req_sock = is_req_sock ? 1 : 0;
Is is_req_sock actually used here in this patch (apart from setting it)?
Not seeing that BPF prog will access it, if it also shouldn't access it,
then bool type would be better.
> + socket_ops.op = op;
> +
> + return bpf_socket_ops_call(&socket_ops);
> +}
> +#else
> +static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)
> +{
> + return -1;
> +}
> +#endif
> +
> #endif /* _TCP_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f94b48b..1540336 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -87,6 +87,7 @@ enum bpf_cmd {
> BPF_PROG_GET_FD_BY_ID,
> BPF_MAP_GET_FD_BY_ID,
> BPF_OBJ_GET_INFO_BY_FD,
> + BPF_PROG_LOAD_SOCKET_OPS,
Why is this as an extra cmd needed, not extending BPF_PROG_ATTACH
and BPF_PROG_DETACH that we already have?
> };
>
> enum bpf_map_type {
> @@ -120,6 +121,7 @@ enum bpf_prog_type {
> BPF_PROG_TYPE_LWT_IN,
> BPF_PROG_TYPE_LWT_OUT,
> BPF_PROG_TYPE_LWT_XMIT,
> + BPF_PROG_TYPE_SOCKET_OPS,
(Nit: maybe BPF_PROG_TYPE_SOCK_OPS given we already have a *_SOCK type.)
> };
>
> enum bpf_attach_type {
> @@ -720,4 +722,30 @@ struct bpf_map_info {
> __u32 map_flags;
> } __attribute__((aligned(8)));
>
> +/* User bpf_socket_ops struct to access socket values and specify request ops
> + * and their replies.
> + * New fields can only be added at the end of this structure
> + */
> +struct bpf_socket_ops {
> + __u32 op;
> + union {
> + __u32 reply;
> + __u32 replylong[4];
> + };
> + __u32 family;
> + __u32 remote_ip4;
> + __u32 local_ip4;
> + __u32 remote_ip6[4];
> + __u32 local_ip6[4];
> + __u32 remote_port;
> + __u32 local_port;
> +};
> +
> +/* List of known BPF socket_ops operators.
> + * New entries can only be added at the end
> + */
> +enum {
> + BPF_SOCKET_OPS_VOID,
> +};
> +
> #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 8942c82..5024b97 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1458,6 +1458,8 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
> break;
> case BPF_OBJ_GET_INFO_BY_FD:
> err = bpf_obj_get_info_by_fd(&attr, uattr);
> + case BPF_PROG_LOAD_SOCKET_OPS:
> + err = bpf_socket_ops_set_prog(attr.bpf_fd);
> break;
(Comment above.)
> default:
> err = -EINVAL;
> diff --git a/net/core/Makefile b/net/core/Makefile
> index 79f9479..5d711c2 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -9,7 +9,8 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
>
> obj-y += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
> neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
> - sock_diag.o dev_ioctl.o tso.o sock_reuseport.o
> + sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \
> + sock_bpfops.o
>
> obj-$(CONFIG_XFRM) += flow.o
> obj-y += net-sysfs.o
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 60ed6f3..7466f55 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3095,6 +3095,37 @@ void bpf_warn_invalid_xdp_action(u32 act)
> }
> EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
>
> +static bool __is_valid_socket_ops_access(int off, int size,
> + enum bpf_access_type type)
Nit: type unused here
> +{
> + if (off < 0 || off >= sizeof(struct bpf_socket_ops))
> + return false;
> + /* The verifier guarantees that size > 0. */
> + if (off % size != 0)
> + return false;
> + if (size != sizeof(__u32))
> + return false;
> +
> + return true;
> +}
> +
> +static bool socket_ops_is_valid_access(int off, int size,
> + enum bpf_access_type type,
> + enum bpf_reg_type *reg_type)
> +{
> + if (type == BPF_WRITE) {
> + switch (off) {
> + case offsetof(struct bpf_socket_ops, op) ...
> + offsetof(struct bpf_socket_ops, replylong[3]):
> + break;
> + default:
> + return false;
> + }
> + }
> +
> + return __is_valid_socket_ops_access(off, size, type);
> +}
> +
> static u32 bpf_convert_ctx_access(enum bpf_access_type type,
> const struct bpf_insn *si,
> struct bpf_insn *insn_buf,
> @@ -3364,6 +3395,126 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
> return insn - insn_buf;
> }
>
> +static u32 socket_ops_convert_ctx_access(enum bpf_access_type type,
> + const struct bpf_insn *si,
> + struct bpf_insn *insn_buf,
> + struct bpf_prog *prog)
> +{
> + struct bpf_insn *insn = insn_buf;
> + int off;
> +
> + switch (si->off) {
> + case offsetof(struct bpf_socket_ops, op) ...
> + offsetof(struct bpf_socket_ops, replylong[3]):
Could we also add a BUILD_BUG_ON() here asserting that kernel
representation has same FIELD_SIZEOF() as bpf_socket_ops.
> + off = si->off;
> + off -= offsetof(struct bpf_socket_ops, op);
> + off += offsetof(struct bpf_socket_ops_kern, op);
> + if (type == BPF_WRITE)
> + *insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
> + off);
> + else
> + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> + off);
> + break;
> +
> + case offsetof(struct bpf_socket_ops, family):
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_family) != 2);
> +
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> + struct bpf_socket_ops_kern, sk),
> + si->dst_reg, si->src_reg,
> + offsetof(struct bpf_socket_ops_kern, sk));
> + *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
> + offsetof(struct sock_common, skc_family));
> + break;
> +
> + case offsetof(struct bpf_socket_ops, remote_ip4):
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_daddr) != 4);
> +
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> + struct bpf_socket_ops_kern, sk),
> + si->dst_reg, si->src_reg,
> + offsetof(struct bpf_socket_ops_kern, sk));
> + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> + offsetof(struct sock_common, skc_daddr));
> + *insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
> + break;
> +
> + case offsetof(struct bpf_socket_ops, local_ip4):
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_rcv_saddr) != 4);
> +
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> + struct bpf_socket_ops_kern, sk),
> + si->dst_reg, si->src_reg,
> + offsetof(struct bpf_socket_ops_kern, sk));
> + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> + offsetof(struct sock_common,
> + skc_rcv_saddr));
> + *insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
> + break;
> +
> + case offsetof(struct bpf_socket_ops, remote_ip6[0]) ...
> + offsetof(struct bpf_socket_ops, remote_ip6[3]):
Nit: indent
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common,
> + skc_v6_daddr.s6_addr32[0]) != 4);
> +
> + off = si->off;
> + off -= offsetof(struct bpf_socket_ops, remote_ip6[0]);
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> + struct bpf_socket_ops_kern, sk),
> + si->dst_reg, si->src_reg,
> + offsetof(struct bpf_socket_ops_kern, sk));
> + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> + offsetof(struct sock_common,
> + skc_v6_daddr.s6_addr32[0]) +
> + off);
> + *insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
> + break;
> +
> + case offsetof(struct bpf_socket_ops, local_ip6[0]) ...
> + offsetof(struct bpf_socket_ops, local_ip6[3]):
Nit: indent
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common,
> + skc_v6_rcv_saddr.s6_addr32[0]) != 4);
> +
> + off = si->off;
> + off -= offsetof(struct bpf_socket_ops, local_ip6[0]);
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> + struct bpf_socket_ops_kern, sk),
> + si->dst_reg, si->src_reg,
> + offsetof(struct bpf_socket_ops_kern, sk));
> + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> + offsetof(struct sock_common,
> + skc_v6_rcv_saddr.s6_addr32[0]) +
> + off);
> + *insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
> + break;
> +
> + case offsetof(struct bpf_socket_ops, remote_port):
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_dport) != 2);
> +
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> + struct bpf_socket_ops_kern, sk),
> + si->dst_reg, si->src_reg,
> + offsetof(struct bpf_socket_ops_kern, sk));
> + *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
> + offsetof(struct sock_common, skc_dport));
> + *insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 16);
> + break;
> +
> + case offsetof(struct bpf_socket_ops, local_port):
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_num) != 2);
> +
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> + struct bpf_socket_ops_kern, sk),
> + si->dst_reg, si->src_reg,
> + offsetof(struct bpf_socket_ops_kern, sk));
> + *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
> + offsetof(struct sock_common, skc_num));
> + break;
> + }
> + return insn - insn_buf;
> +}
> +
> const struct bpf_verifier_ops sk_filter_prog_ops = {
> .get_func_proto = sk_filter_func_proto,
> .is_valid_access = sk_filter_is_valid_access,
> @@ -3413,6 +3564,12 @@ const struct bpf_verifier_ops cg_sock_prog_ops = {
> .convert_ctx_access = sock_filter_convert_ctx_access,
> };
>
> +const struct bpf_verifier_ops socket_ops_prog_ops = {
> + .get_func_proto = bpf_base_func_proto,
> + .is_valid_access = socket_ops_is_valid_access,
> + .convert_ctx_access = socket_ops_convert_ctx_access,
> +};
> +
> int sk_detach_filter(struct sock *sk)
> {
> int ret = -ENOENT;
> diff --git a/net/core/sock_bpfops.c b/net/core/sock_bpfops.c
> new file mode 100644
> index 0000000..8f8daa5
> --- /dev/null
> +++ b/net/core/sock_bpfops.c
> @@ -0,0 +1,67 @@
> +/*
> + * BPF support for sockets
> + *
> + * Copyright (c) 2016 Lawrence Brakmo <brakmo@fb.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2
> + * as published by the Free Software Foundation.
> + */
> +
> +#include <net/sock.h>
> +#include <linux/skbuff.h>
> +#include <linux/bpf.h>
> +#include <linux/filter.h>
> +#include <linux/errno.h>
> +#ifdef CONFIG_NET_NS
> +#include <net/net_namespace.h>
> +#include <linux/proc_ns.h>
> +#endif
> +
> +/* Global BPF program for sockets */
> +static struct bpf_prog *bpf_socket_ops_prog;
> +static DEFINE_RWLOCK(bpf_socket_ops_lock);
> +
> +int bpf_socket_ops_set_prog(int fd)
> +{
> + int err = 0;
> +
> + write_lock(&bpf_socket_ops_lock);
> + if (bpf_socket_ops_prog) {
> + bpf_prog_put(bpf_socket_ops_prog);
> + bpf_socket_ops_prog = NULL;
> + }
> +
> + /* fd of zero is used as a signal to remove the current
> + * bpf_socket_ops_prog.
> + */
> + if (fd == 0) {
Can we make the fd related semantics similar to dev_change_xdp_fd()?
> + write_unlock(&bpf_socket_ops_lock);
> + return 1;
> + }
> +
> + bpf_socket_ops_prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_OPS);
> + if (IS_ERR(bpf_socket_ops_prog)) {
> + bpf_prog_put(bpf_socket_ops_prog);
This will crash the kernel, passing err value to bpf_prog_put().
> + bpf_socket_ops_prog = NULL;
> + err = 1;
> + }
> + write_unlock(&bpf_socket_ops_lock);
If all this gets a bit rearranged as in dev_change_xdp_fd(), we can
RCUify bpf_socket_ops_prog.
> + return err;
> +}
> +
> +int bpf_socket_ops_call(struct bpf_socket_ops_kern *bpf_socket)
> +{
> + int ret;
> +
> + read_lock(&bpf_socket_ops_lock);
> + if (bpf_socket_ops_prog) {
> + rcu_read_lock();
> + ret = (int)BPF_PROG_RUN(bpf_socket_ops_prog, bpf_socket);
Cast not needed.
> + rcu_read_unlock();
> + } else {
> + ret = -1;
> + }
> + read_unlock(&bpf_socket_ops_lock);
See comment wrt RCU for bpf_socket_ops_prog, then read_lock()
can be dropped.
> + return ret;
> +}
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH net-next v2 01/15] bpf: BPF support for socket ops
2017-06-16 12:07 ` Daniel Borkmann
@ 2017-06-16 23:41 ` Lawrence Brakmo
2017-06-19 18:44 ` Daniel Borkmann
2017-06-17 21:48 ` Lawrence Brakmo
1 sibling, 1 reply; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-16 23:41 UTC (permalink / raw)
To: Daniel Borkmann, netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern
On 6/16/17, 5:07 AM, "Daniel Borkmann" <daniel@iogearbox.net> wrote:
On 06/15/2017 10:08 PM, Lawrence Brakmo wrote:
> Created a new BPF program type, BPF_PROG_TYPE_SOCKET_OPS, and a corresponding
> struct that allows BPF programs of this type to access some of the
> socket's fields (such as IP addresses, ports, etc.). Currently there is
> functionality to load one global BPF program of this type which can be
> called at appropriate times to set relevant connection parameters such
> as buffer sizes, SYN and SYN-ACK RTOs, etc., based on connection
> information such as IP addresses, port numbers, etc.
>
> Alghough there are already 3 mechanisms to set parameters (sysctls,
> route metrics and setsockopts), this new mechanism provides some
> disticnt advantages. Unlike sysctls, it can set parameters per
> connection. In contrast to route metrics, it can also use port numbers
> and information provided by a user level program. In addition, it could
> set parameters probabilistically for evaluation purposes (i.e. do
> something different on 10% of the flows and compare results with the
> other 90% of the flows). Also, in cases where IPv6 addresses contain
> geographic information, the rules to make changes based on the distance
> (or RTT) between the hosts are much easier than route metric rules and
> can be global. Finally, unlike setsockopt, it oes not require
> application changes and it can be updated easily at any time.
>
> I plan to add support for loading per cgroup socket ops BPF programs in
> the near future. One question is whether I should add this functionality
> into David Ahern's BPF_PROG_TYPE_CGROUP_SOCK or create a new cgroup bpf
> type. Whereas the current cgroup_sock type expects to be called only once
> during a connection's lifetime, the new socket_ops type could be called
> multipe times. For example, before sending SYN and SYN-ACKs to set an
> appropriate timeout, when the connection is established to set
> congestion control, etc. As a result it has "op" field to specify the
> type of operation requested.
>
> The purpose of this new program type is to simplify setting connection
> parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
> easy to use facebook's internal IPv6 addresses to determine if both hosts
> of a connection are in the same datacenter. Therefore, it is easy to
> write a BPF program to choose a small SYN RTO value when both hosts are
> in the same datacenter.
>
> This patch only contains the framework to support the new BPF program
> type, following patches add the functionality to set various connection
> parameters.
>
> This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
> and a new bpf syscall command to load a new program of this type:
> BPF_PROG_LOAD_SOCKET_OPS.
>
> Two new corresponding structs (one for the kernel one for the user/BPF
> program):
>
> /* kernel version */
> struct bpf_socket_ops_kern {
> struct sock *sk;
> __u32 is_req_sock:1;
> __u32 op;
> union {
> __u32 reply;
> __u32 replylong[4];
> };
> };
>
> /* user version */
> struct bpf_socket_ops {
> __u32 op;
> union {
> __u32 reply;
> __u32 replylong[4];
> };
> __u32 family;
> __u32 remote_ip4;
> __u32 local_ip4;
> __u32 remote_ip6[4];
> __u32 local_ip6[4];
> __u32 remote_port;
> __u32 local_port;
> };
Above and ...
struct bpf_sock {
__u32 bound_dev_if;
__u32 family;
__u32 type;
__u32 protocol;
};
... would result in two BPF sock user versions. It's okayish, but
given struct bpf_sock is quite generic, couldn't we merge the members
from struct bpf_socket_ops into struct bpf_sock instead?
Idea would be that sock_filter_is_valid_access() for cgroups would
then check off < 0 || off + size > offsetofend(struct bpf_sock, protocol)
to disallow new members, and your socket_ops_is_valid_access() could
allow and xlate the full range. The family member is already duplicate
and the others could then be accessed from these kind of BPF progs as
well, plus we have a single user representation similar as with __sk_buff
that multiple types will use.
I see. You are saying have one struct in common but still keep the two
PROG_TYPES? That makes sense. Do we really need two different
is_valid_access functions? Both types should be able to see all
the fields (otherwise adding new fields becomes messy).
> Currently there are two types of ops. The first type expects the BPF
> program to return a value which is then used by the caller (or a
> negative value to indicate the operation is not supported). The second
> type expects state changes to be done by the BPF program, for example
> through a setsockopt BPF helper function, and they ignore the return
> value.
>
> The reply fields of the bpf_sockt_ops struct are there in case a bpf
> program needs to return a value larger than an integer.
>
> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
> ---
> include/linux/bpf.h | 6 ++
> include/linux/bpf_types.h | 1 +
> include/linux/filter.h | 10 +++
> include/net/tcp.h | 27 ++++++++
> include/uapi/linux/bpf.h | 28 +++++++++
> kernel/bpf/syscall.c | 2 +
> net/core/Makefile | 3 +-
> net/core/filter.c | 157 ++++++++++++++++++++++++++++++++++++++++++++++
> net/core/sock_bpfops.c | 67 ++++++++++++++++++++
> samples/bpf/bpf_load.c | 13 +++-
> 10 files changed, 310 insertions(+), 4 deletions(-)
> create mode 100644 net/core/sock_bpfops.c
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 1bcbf0a..e164f94 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -362,4 +362,10 @@ extern const struct bpf_func_proto bpf_get_stackid_proto;
> void bpf_user_rnd_init_once(void);
> u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>
> +/* socket_ops related */
> +struct sock;
Is this needed? You're using it in struct bpf_socket_ops_kern in
filter.h, where we already have struct sock;
Done
> +struct bpf_socket_ops_kern;
> +
> +int bpf_socket_ops_set_prog(int fd);
> +int bpf_socket_ops_call(struct bpf_socket_ops_kern *bpf_socket);
> #endif /* _LINUX_BPF_H */
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 03bf223..ca69d10 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -10,6 +10,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock_prog_ops)
> BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout_prog_ops)
> BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout_prog_ops)
> BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit_prog_ops)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_OPS, socket_ops_prog_ops)
> #endif
> #ifdef CONFIG_BPF_EVENTS
> BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe_prog_ops)
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 1fa26dc..102e881 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -898,4 +898,14 @@ static inline int bpf_tell_extensions(void)
> return SKF_AD_MAX;
> }
>
> +struct bpf_socket_ops_kern {
> + struct sock *sk;
> + u32 is_req_sock:1;
> + u32 op;
> + union {
> + u32 reply;
> + u32 replylong[4];
> + };
> +};
> +
> #endif /* __LINUX_FILTER_H__ */
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 3ab677d..9ad0d80 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -46,6 +46,9 @@
> #include <linux/seq_file.h>
> #include <linux/memcontrol.h>
>
> +#include <linux/bpf.h>
> +#include <linux/filter.h>
> +
> extern struct inet_hashinfo tcp_hashinfo;
>
> extern struct percpu_counter tcp_orphan_count;
> @@ -1991,4 +1994,28 @@ static inline void tcp_listendrop(const struct sock *sk)
>
> enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer);
>
> +/* Call BPF_SOCKET_OPS program that returns an int. If the return value
> + * is < 0, then the BPF op failed (for example if the loaded BPF
> + * program does not support the chosen operation or there is no BPF
> + * program loaded).
> + */
> +#ifdef CONFIG_BPF
> +static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)
> +{
> + struct bpf_socket_ops_kern socket_ops;
> +
> + memset(&socket_ops, 0, sizeof(socket_ops));
> + socket_ops.sk = sk;
> + socket_ops.is_req_sock = is_req_sock ? 1 : 0;
Is is_req_sock actually used here in this patch (apart from setting it)?
Not seeing that BPF prog will access it, if it also shouldn't access it,
then bool type would be better.
The only reason I used a bit was in case I wanted to add more fields later on.
Does it make sense or should I just use bool?
> + socket_ops.op = op;
> +
> + return bpf_socket_ops_call(&socket_ops);
> +}
> +#else
> +static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)
> +{
> + return -1;
> +}
> +#endif
> +
> #endif /* _TCP_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f94b48b..1540336 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -87,6 +87,7 @@ enum bpf_cmd {
> BPF_PROG_GET_FD_BY_ID,
> BPF_MAP_GET_FD_BY_ID,
> BPF_OBJ_GET_INFO_BY_FD,
> + BPF_PROG_LOAD_SOCKET_OPS,
Why is this as an extra cmd needed, not extending BPF_PROG_ATTACH
and BPF_PROG_DETACH that we already have?
I’ll look into it. At the time I did it in the way I though was easier.
> };
>
> enum bpf_map_type {
> @@ -120,6 +121,7 @@ enum bpf_prog_type {
> BPF_PROG_TYPE_LWT_IN,
> BPF_PROG_TYPE_LWT_OUT,
> BPF_PROG_TYPE_LWT_XMIT,
> + BPF_PROG_TYPE_SOCKET_OPS,
(Nit: maybe BPF_PROG_TYPE_SOCK_OPS given we already have a *_SOCK type.)
Done
> };
>
> enum bpf_attach_type {
> @@ -720,4 +722,30 @@ struct bpf_map_info {
> __u32 map_flags;
> } __attribute__((aligned(8)));
>
> +/* User bpf_socket_ops struct to access socket values and specify request ops
> + * and their replies.
> + * New fields can only be added at the end of this structure
> + */
> +struct bpf_socket_ops {
> + __u32 op;
> + union {
> + __u32 reply;
> + __u32 replylong[4];
> + };
> + __u32 family;
> + __u32 remote_ip4;
> + __u32 local_ip4;
> + __u32 remote_ip6[4];
> + __u32 local_ip6[4];
> + __u32 remote_port;
> + __u32 local_port;
> +};
> +
> +/* List of known BPF socket_ops operators.
> + * New entries can only be added at the end
> + */
> +enum {
> + BPF_SOCKET_OPS_VOID,
> +};
> +
> #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 8942c82..5024b97 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1458,6 +1458,8 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
> break;
> case BPF_OBJ_GET_INFO_BY_FD:
> err = bpf_obj_get_info_by_fd(&attr, uattr);
> + case BPF_PROG_LOAD_SOCKET_OPS:
> + err = bpf_socket_ops_set_prog(attr.bpf_fd);
> break;
(Comment above.)
Done
> default:
> err = -EINVAL;
> diff --git a/net/core/Makefile b/net/core/Makefile
> index 79f9479..5d711c2 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -9,7 +9,8 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
>
> obj-y += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
> neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
> - sock_diag.o dev_ioctl.o tso.o sock_reuseport.o
> + sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \
> + sock_bpfops.o
>
> obj-$(CONFIG_XFRM) += flow.o
> obj-y += net-sysfs.o
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 60ed6f3..7466f55 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3095,6 +3095,37 @@ void bpf_warn_invalid_xdp_action(u32 act)
> }
> EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
>
> +static bool __is_valid_socket_ops_access(int off, int size,
> + enum bpf_access_type type)
Nit: type unused here
Done
> +{
> + if (off < 0 || off >= sizeof(struct bpf_socket_ops))
> + return false;
> + /* The verifier guarantees that size > 0. */
> + if (off % size != 0)
> + return false;
> + if (size != sizeof(__u32))
> + return false;
> +
> + return true;
> +}
> +
> +static bool socket_ops_is_valid_access(int off, int size,
> + enum bpf_access_type type,
> + enum bpf_reg_type *reg_type)
> +{
> + if (type == BPF_WRITE) {
> + switch (off) {
> + case offsetof(struct bpf_socket_ops, op) ...
> + offsetof(struct bpf_socket_ops, replylong[3]):
> + break;
> + default:
> + return false;
> + }
> + }
> +
> + return __is_valid_socket_ops_access(off, size, type);
> +}
> +
> static u32 bpf_convert_ctx_access(enum bpf_access_type type,
> const struct bpf_insn *si,
> struct bpf_insn *insn_buf,
> @@ -3364,6 +3395,126 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
> return insn - insn_buf;
> }
>
> +static u32 socket_ops_convert_ctx_access(enum bpf_access_type type,
> + const struct bpf_insn *si,
> + struct bpf_insn *insn_buf,
> + struct bpf_prog *prog)
> +{
> + struct bpf_insn *insn = insn_buf;
> + int off;
> +
> + switch (si->off) {
> + case offsetof(struct bpf_socket_ops, op) ...
> + offsetof(struct bpf_socket_ops, replylong[3]):
Could we also add a BUILD_BUG_ON() here asserting that kernel
representation has same FIELD_SIZEOF() as bpf_socket_ops.
Done
> + off = si->off;
> + off -= offsetof(struct bpf_socket_ops, op);
> + off += offsetof(struct bpf_socket_ops_kern, op);
> + if (type == BPF_WRITE)
> + *insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
> + off);
> + else
> + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> + off);
> + break;
> +
> + case offsetof(struct bpf_socket_ops, family):
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_family) != 2);
> +
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> + struct bpf_socket_ops_kern, sk),
> + si->dst_reg, si->src_reg,
> + offsetof(struct bpf_socket_ops_kern, sk));
> + *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
> + offsetof(struct sock_common, skc_family));
> + break;
> +
> + case offsetof(struct bpf_socket_ops, remote_ip4):
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_daddr) != 4);
> +
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> + struct bpf_socket_ops_kern, sk),
> + si->dst_reg, si->src_reg,
> + offsetof(struct bpf_socket_ops_kern, sk));
> + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> + offsetof(struct sock_common, skc_daddr));
> + *insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
> + break;
> +
> + case offsetof(struct bpf_socket_ops, local_ip4):
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_rcv_saddr) != 4);
> +
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> + struct bpf_socket_ops_kern, sk),
> + si->dst_reg, si->src_reg,
> + offsetof(struct bpf_socket_ops_kern, sk));
> + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> + offsetof(struct sock_common,
> + skc_rcv_saddr));
> + *insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
> + break;
> +
> + case offsetof(struct bpf_socket_ops, remote_ip6[0]) ...
> + offsetof(struct bpf_socket_ops, remote_ip6[3]):
Nit: indent
Done
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common,
> + skc_v6_daddr.s6_addr32[0]) != 4);
> +
> + off = si->off;
> + off -= offsetof(struct bpf_socket_ops, remote_ip6[0]);
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> + struct bpf_socket_ops_kern, sk),
> + si->dst_reg, si->src_reg,
> + offsetof(struct bpf_socket_ops_kern, sk));
> + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> + offsetof(struct sock_common,
> + skc_v6_daddr.s6_addr32[0]) +
> + off);
> + *insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
> + break;
> +
> + case offsetof(struct bpf_socket_ops, local_ip6[0]) ...
> + offsetof(struct bpf_socket_ops, local_ip6[3]):
Nit: indent
Done
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common,
> + skc_v6_rcv_saddr.s6_addr32[0]) != 4);
> +
> + off = si->off;
> + off -= offsetof(struct bpf_socket_ops, local_ip6[0]);
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> + struct bpf_socket_ops_kern, sk),
> + si->dst_reg, si->src_reg,
> + offsetof(struct bpf_socket_ops_kern, sk));
> + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> + offsetof(struct sock_common,
> + skc_v6_rcv_saddr.s6_addr32[0]) +
> + off);
> + *insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
> + break;
> +
> + case offsetof(struct bpf_socket_ops, remote_port):
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_dport) != 2);
> +
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> + struct bpf_socket_ops_kern, sk),
> + si->dst_reg, si->src_reg,
> + offsetof(struct bpf_socket_ops_kern, sk));
> + *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
> + offsetof(struct sock_common, skc_dport));
> + *insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 16);
> + break;
> +
> + case offsetof(struct bpf_socket_ops, local_port):
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_num) != 2);
> +
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> + struct bpf_socket_ops_kern, sk),
> + si->dst_reg, si->src_reg,
> + offsetof(struct bpf_socket_ops_kern, sk));
> + *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
> + offsetof(struct sock_common, skc_num));
> + break;
> + }
> + return insn - insn_buf;
> +}
> +
> const struct bpf_verifier_ops sk_filter_prog_ops = {
> .get_func_proto = sk_filter_func_proto,
> .is_valid_access = sk_filter_is_valid_access,
> @@ -3413,6 +3564,12 @@ const struct bpf_verifier_ops cg_sock_prog_ops = {
> .convert_ctx_access = sock_filter_convert_ctx_access,
> };
>
> +const struct bpf_verifier_ops socket_ops_prog_ops = {
> + .get_func_proto = bpf_base_func_proto,
> + .is_valid_access = socket_ops_is_valid_access,
> + .convert_ctx_access = socket_ops_convert_ctx_access,
> +};
> +
> int sk_detach_filter(struct sock *sk)
> {
> int ret = -ENOENT;
> diff --git a/net/core/sock_bpfops.c b/net/core/sock_bpfops.c
> new file mode 100644
> index 0000000..8f8daa5
> --- /dev/null
> +++ b/net/core/sock_bpfops.c
> @@ -0,0 +1,67 @@
> +/*
> + * BPF support for sockets
> + *
> + * Copyright (c) 2016 Lawrence Brakmo <brakmo@fb.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2
> + * as published by the Free Software Foundation.
> + */
> +
> +#include <net/sock.h>
> +#include <linux/skbuff.h>
> +#include <linux/bpf.h>
> +#include <linux/filter.h>
> +#include <linux/errno.h>
> +#ifdef CONFIG_NET_NS
> +#include <net/net_namespace.h>
> +#include <linux/proc_ns.h>
> +#endif
> +
> +/* Global BPF program for sockets */
> +static struct bpf_prog *bpf_socket_ops_prog;
> +static DEFINE_RWLOCK(bpf_socket_ops_lock);
> +
> +int bpf_socket_ops_set_prog(int fd)
> +{
> + int err = 0;
> +
> + write_lock(&bpf_socket_ops_lock);
> + if (bpf_socket_ops_prog) {
> + bpf_prog_put(bpf_socket_ops_prog);
> + bpf_socket_ops_prog = NULL;
> + }
> +
> + /* fd of zero is used as a signal to remove the current
> + * bpf_socket_ops_prog.
> + */
> + if (fd == 0) {
Can we make the fd related semantics similar to dev_change_xdp_fd()?
Do you mean remove program is fd < 0 instead of == 0?
> + write_unlock(&bpf_socket_ops_lock);
> + return 1;
> + }
> +
> + bpf_socket_ops_prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_OPS);
> + if (IS_ERR(bpf_socket_ops_prog)) {
> + bpf_prog_put(bpf_socket_ops_prog);
This will crash the kernel, passing err value to bpf_prog_put().
That explains some problems I saw in the past if fd was zero.
Fixed.
> + bpf_socket_ops_prog = NULL;
> + err = 1;
> + }
> + write_unlock(&bpf_socket_ops_lock);
If all this gets a bit rearranged as in dev_change_xdp_fd(), we can
RCUify bpf_socket_ops_prog.
Not sure I fully understand, but will give it a shot.
> + return err;
> +}
> +
> +int bpf_socket_ops_call(struct bpf_socket_ops_kern *bpf_socket)
> +{
> + int ret;
> +
> + read_lock(&bpf_socket_ops_lock);
> + if (bpf_socket_ops_prog) {
> + rcu_read_lock();
> + ret = (int)BPF_PROG_RUN(bpf_socket_ops_prog, bpf_socket);
Cast not needed.
Done
> + rcu_read_unlock();
> + } else {
> + ret = -1;
> + }
> + read_unlock(&bpf_socket_ops_lock);
See comment wrt RCU for bpf_socket_ops_prog, then read_lock()
can be dropped.
Will give it a shot.
> + return ret;
> +}
Daniel, thank you for the feedback.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH net-next v2 01/15] bpf: BPF support for socket ops
2017-06-16 23:41 ` Lawrence Brakmo
@ 2017-06-19 18:44 ` Daniel Borkmann
2017-06-19 20:49 ` Lawrence Brakmo
0 siblings, 1 reply; 29+ messages in thread
From: Daniel Borkmann @ 2017-06-19 18:44 UTC (permalink / raw)
To: Lawrence Brakmo, netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern
On 06/17/2017 01:41 AM, Lawrence Brakmo wrote:
> On 6/16/17, 5:07 AM, "Daniel Borkmann" <daniel@iogearbox.net> wrote:
[...]
> I see. You are saying have one struct in common but still keep the two
> PROG_TYPES? That makes sense. Do we really need two different
> is_valid_access functions? Both types should be able to see all
> the fields (otherwise adding new fields becomes messy).
Would probably leave the two is_valid_access() separate initially, and
once people ask for it we could potentially open this up to some of
the other fields that are available at that time.
> > Currently there are two types of ops. The first type expects the BPF
> > program to return a value which is then used by the caller (or a
> > negative value to indicate the operation is not supported). The second
> > type expects state changes to be done by the BPF program, for example
> > through a setsockopt BPF helper function, and they ignore the return
> > value.
[...]
> > +/* Call BPF_SOCKET_OPS program that returns an int. If the return value
> > + * is < 0, then the BPF op failed (for example if the loaded BPF
> > + * program does not support the chosen operation or there is no BPF
> > + * program loaded).
> > + */
> > +#ifdef CONFIG_BPF
> > +static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)
> > +{
> > + struct bpf_socket_ops_kern socket_ops;
> > +
> > + memset(&socket_ops, 0, sizeof(socket_ops));
> > + socket_ops.sk = sk;
> > + socket_ops.is_req_sock = is_req_sock ? 1 : 0;
>
> Is is_req_sock actually used here in this patch (apart from setting it)?
> Not seeing that BPF prog will access it, if it also shouldn't access it,
> then bool type would be better.
>
> The only reason I used a bit was in case I wanted to add more fields later on.
> Does it make sense or should I just use bool?
Didn't know that, but I think starting out with bool seems a bit
cleaner, if needed we could later still switch to bitfield.
> > + socket_ops.op = op;
> > +
> > + return bpf_socket_ops_call(&socket_ops);
> > +}
[...]
> > +/* Global BPF program for sockets */
> > +static struct bpf_prog *bpf_socket_ops_prog;
> > +static DEFINE_RWLOCK(bpf_socket_ops_lock);
> > +
> > +int bpf_socket_ops_set_prog(int fd)
> > +{
> > + int err = 0;
> > +
> > + write_lock(&bpf_socket_ops_lock);
> > + if (bpf_socket_ops_prog) {
> > + bpf_prog_put(bpf_socket_ops_prog);
> > + bpf_socket_ops_prog = NULL;
> > + }
> > +
> > + /* fd of zero is used as a signal to remove the current
> > + * bpf_socket_ops_prog.
> > + */
> > + if (fd == 0) {
>
> Can we make the fd related semantics similar to dev_change_xdp_fd()?
>
> Do you mean remove program is fd < 0 instead of == 0?
Yes, that and also the ordering of dropping the ref of the existing
bpf_socket_ops_prog program with setting the new one, so you can
convert bpf_socket_ops_prog to RCU more easily.
> > + write_unlock(&bpf_socket_ops_lock);
> > + return 1;
> > + }
> > +
> > + bpf_socket_ops_prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_OPS);
> > + if (IS_ERR(bpf_socket_ops_prog)) {
> > + bpf_prog_put(bpf_socket_ops_prog);
>
> This will crash the kernel, passing err value to bpf_prog_put().
[...]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH net-next v2 01/15] bpf: BPF support for socket ops
2017-06-19 18:44 ` Daniel Borkmann
@ 2017-06-19 20:49 ` Lawrence Brakmo
0 siblings, 0 replies; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-19 20:49 UTC (permalink / raw)
To: Daniel Borkmann, netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern
On 6/19/17, 11:44 AM, "Daniel Borkmann" <daniel@iogearbox.net> wrote:
On 06/17/2017 01:41 AM, Lawrence Brakmo wrote:
> On 6/16/17, 5:07 AM, "Daniel Borkmann" <daniel@iogearbox.net> wrote:
[...]
> I see. You are saying have one struct in common but still keep the two
> PROG_TYPES? That makes sense. Do we really need two different
> is_valid_access functions? Both types should be able to see all
> the fields (otherwise adding new fields becomes messy).
Would probably leave the two is_valid_access() separate initially, and
once people ask for it we could potentially open this up to some of
the other fields that are available at that time.
As discussed in the other thread, I will keep the 2 structs
> > Currently there are two types of ops. The first type expects the BPF
> > program to return a value which is then used by the caller (or a
> > negative value to indicate the operation is not supported). The second
> > type expects state changes to be done by the BPF program, for example
> > through a setsockopt BPF helper function, and they ignore the return
> > value.
[...]
> > +/* Call BPF_SOCKET_OPS program that returns an int. If the return value
> > + * is < 0, then the BPF op failed (for example if the loaded BPF
> > + * program does not support the chosen operation or there is no BPF
> > + * program loaded).
> > + */
> > +#ifdef CONFIG_BPF
> > +static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)
> > +{
> > + struct bpf_socket_ops_kern socket_ops;
> > +
> > + memset(&socket_ops, 0, sizeof(socket_ops));
> > + socket_ops.sk = sk;
> > + socket_ops.is_req_sock = is_req_sock ? 1 : 0;
>
> Is is_req_sock actually used here in this patch (apart from setting it)?
> Not seeing that BPF prog will access it, if it also shouldn't access it,
> then bool type would be better.
>
> The only reason I used a bit was in case I wanted to add more fields later on.
> Does it make sense or should I just use bool?
Didn't know that, but I think starting out with bool seems a bit
cleaner, if needed we could later still switch to bitfield.
Done.
> > + socket_ops.op = op;
> > +
> > + return bpf_socket_ops_call(&socket_ops);
> > +}
[...]
> > +/* Global BPF program for sockets */
> > +static struct bpf_prog *bpf_socket_ops_prog;
> > +static DEFINE_RWLOCK(bpf_socket_ops_lock);
> > +
> > +int bpf_socket_ops_set_prog(int fd)
> > +{
> > + int err = 0;
> > +
> > + write_lock(&bpf_socket_ops_lock);
> > + if (bpf_socket_ops_prog) {
> > + bpf_prog_put(bpf_socket_ops_prog);
> > + bpf_socket_ops_prog = NULL;
> > + }
> > +
> > + /* fd of zero is used as a signal to remove the current
> > + * bpf_socket_ops_prog.
> > + */
> > + if (fd == 0) {
>
> Can we make the fd related semantics similar to dev_change_xdp_fd()?
>
> Do you mean remove program is fd < 0 instead of == 0?
Yes, that and also the ordering of dropping the ref of the existing
bpf_socket_ops_prog program with setting the new one, so you can
convert bpf_socket_ops_prog to RCU more easily.
I made lots of changes to how we set/attach the global_sock_ops program
affecting the files kernel/bpf/syscall.c, net/core/sock_bpfops.c and
samples/bpf/tcp_bpf.c. The patch set will be submitted later today.
> > + write_unlock(&bpf_socket_ops_lock);
> > + return 1;
> > + }
> > +
> > + bpf_socket_ops_prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_OPS);
> > + if (IS_ERR(bpf_socket_ops_prog)) {
> > + bpf_prog_put(bpf_socket_ops_prog);
>
> This will crash the kernel, passing err value to bpf_prog_put().
[...]
Thanks again for the feedback.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH net-next v2 01/15] bpf: BPF support for socket ops
2017-06-16 12:07 ` Daniel Borkmann
2017-06-16 23:41 ` Lawrence Brakmo
@ 2017-06-17 21:48 ` Lawrence Brakmo
2017-06-19 18:52 ` Daniel Borkmann
1 sibling, 1 reply; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-17 21:48 UTC (permalink / raw)
To: Daniel Borkmann, netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern
On 6/16/17, 5:07 AM, "Daniel Borkmann" <daniel@iogearbox.net> wrote:
On 06/15/2017 10:08 PM, Lawrence Brakmo wrote:
> Two new corresponding structs (one for the kernel one for the user/BPF
> program):
>
> /* kernel version */
> struct bpf_socket_ops_kern {
> struct sock *sk;
> __u32 is_req_sock:1;
> __u32 op;
> union {
> __u32 reply;
> __u32 replylong[4];
> };
> };
>
> /* user version */
> struct bpf_socket_ops {
> __u32 op;
> union {
> __u32 reply;
> __u32 replylong[4];
> };
> __u32 family;
> __u32 remote_ip4;
> __u32 local_ip4;
> __u32 remote_ip6[4];
> __u32 local_ip6[4];
> __u32 remote_port;
> __u32 local_port;
> };
Above and ...
struct bpf_sock {
__u32 bound_dev_if;
__u32 family;
__u32 type;
__u32 protocol;
};
... would result in two BPF sock user versions. It's okayish, but
given struct bpf_sock is quite generic, couldn't we merge the members
from struct bpf_socket_ops into struct bpf_sock instead?
Idea would be that sock_filter_is_valid_access() for cgroups would
then check off < 0 || off + size > offsetofend(struct bpf_sock, protocol)
to disallow new members, and your socket_ops_is_valid_access() could
allow and xlate the full range. The family member is already duplicate
and the others could then be accessed from these kind of BPF progs as
well, plus we have a single user representation similar as with __sk_buff
that multiple types will use.
I am concerned that it could make usage more confusing. One type of
sock program (cgroup) could only use a subset of the fields while the
other type (socket_ops) could use all (or a different subset). Then what
happens if there is a need to add a new field to cgroup type sock program?
In addition, in the near future I will have a patch to attach socket_ops
programs to cgroups.
I rather leave it as it is.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH net-next v2 01/15] bpf: BPF support for socket ops
2017-06-17 21:48 ` Lawrence Brakmo
@ 2017-06-19 18:52 ` Daniel Borkmann
2017-06-19 20:49 ` Lawrence Brakmo
0 siblings, 1 reply; 29+ messages in thread
From: Daniel Borkmann @ 2017-06-19 18:52 UTC (permalink / raw)
To: Lawrence Brakmo, netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern
On 06/17/2017 11:48 PM, Lawrence Brakmo wrote:
> On 6/16/17, 5:07 AM, "Daniel Borkmann" <daniel@iogearbox.net> wrote:
>
> On 06/15/2017 10:08 PM, Lawrence Brakmo wrote:
> > Two new corresponding structs (one for the kernel one for the user/BPF
> > program):
> >
> > /* kernel version */
> > struct bpf_socket_ops_kern {
> > struct sock *sk;
> > __u32 is_req_sock:1;
> > __u32 op;
> > union {
> > __u32 reply;
> > __u32 replylong[4];
> > };
> > };
> >
> > /* user version */
> > struct bpf_socket_ops {
> > __u32 op;
> > union {
> > __u32 reply;
> > __u32 replylong[4];
> > };
> > __u32 family;
> > __u32 remote_ip4;
> > __u32 local_ip4;
> > __u32 remote_ip6[4];
> > __u32 local_ip6[4];
> > __u32 remote_port;
> > __u32 local_port;
> > };
>
> Above and ...
>
> struct bpf_sock {
> __u32 bound_dev_if;
> __u32 family;
> __u32 type;
> __u32 protocol;
> };
>
> ... would result in two BPF sock user versions. It's okayish, but
> given struct bpf_sock is quite generic, couldn't we merge the members
> from struct bpf_socket_ops into struct bpf_sock instead?
>
> Idea would be that sock_filter_is_valid_access() for cgroups would
> then check off < 0 || off + size > offsetofend(struct bpf_sock, protocol)
> to disallow new members, and your socket_ops_is_valid_access() could
> allow and xlate the full range. The family member is already duplicate
> and the others could then be accessed from these kind of BPF progs as
> well, plus we have a single user representation similar as with __sk_buff
> that multiple types will use.
>
> I am concerned that it could make usage more confusing. One type of
> sock program (cgroup) could only use a subset of the fields while the
> other type (socket_ops) could use all (or a different subset). Then what
> happens if there is a need to add a new field to cgroup type sock program?
> In addition, in the near future I will have a patch to attach socket_ops
> programs to cgroups.
> I rather leave it as it is.
Okay, I'm fine with that as well. For the __sk_buff, we also have the
case that some members are not available for all program types like
tc_classid, so it's similar there. But if indeed the majority of members
cannot be supported for the most parts (?) then having different structs
seems okay, probably easier to use, but we should try hard to not ending
up with 10 different uapi socket structs that apply to program types
working on sockets in one way or another.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH net-next v2 01/15] bpf: BPF support for socket ops
2017-06-19 18:52 ` Daniel Borkmann
@ 2017-06-19 20:49 ` Lawrence Brakmo
0 siblings, 0 replies; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-19 20:49 UTC (permalink / raw)
To: Daniel Borkmann, netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern
On 6/19/17, 11:52 AM, "Daniel Borkmann" <daniel@iogearbox.net> wrote:
On 06/17/2017 11:48 PM, Lawrence Brakmo wrote:
> On 6/16/17, 5:07 AM, "Daniel Borkmann" <daniel@iogearbox.net> wrote:
>
> On 06/15/2017 10:08 PM, Lawrence Brakmo wrote:
> > Two new corresponding structs (one for the kernel one for the user/BPF
> > program):
> >
> > /* kernel version */
> > struct bpf_socket_ops_kern {
> > struct sock *sk;
> > __u32 is_req_sock:1;
> > __u32 op;
> > union {
> > __u32 reply;
> > __u32 replylong[4];
> > };
> > };
> >
> > /* user version */
> > struct bpf_socket_ops {
> > __u32 op;
> > union {
> > __u32 reply;
> > __u32 replylong[4];
> > };
> > __u32 family;
> > __u32 remote_ip4;
> > __u32 local_ip4;
> > __u32 remote_ip6[4];
> > __u32 local_ip6[4];
> > __u32 remote_port;
> > __u32 local_port;
> > };
>
> Above and ...
>
> struct bpf_sock {
> __u32 bound_dev_if;
> __u32 family;
> __u32 type;
> __u32 protocol;
> };
>
> ... would result in two BPF sock user versions. It's okayish, but
> given struct bpf_sock is quite generic, couldn't we merge the members
> from struct bpf_socket_ops into struct bpf_sock instead?
>
> Idea would be that sock_filter_is_valid_access() for cgroups would
> then check off < 0 || off + size > offsetofend(struct bpf_sock, protocol)
> to disallow new members, and your socket_ops_is_valid_access() could
> allow and xlate the full range. The family member is already duplicate
> and the others could then be accessed from these kind of BPF progs as
> well, plus we have a single user representation similar as with __sk_buff
> that multiple types will use.
>
> I am concerned that it could make usage more confusing. One type of
> sock program (cgroup) could only use a subset of the fields while the
> other type (socket_ops) could use all (or a different subset). Then what
> happens if there is a need to add a new field to cgroup type sock program?
> In addition, in the near future I will have a patch to attach socket_ops
> programs to cgroups.
> I rather leave it as it is.
Okay, I'm fine with that as well. For the __sk_buff, we also have the
case that some members are not available for all program types like
tc_classid, so it's similar there. But if indeed the majority of members
cannot be supported for the most parts (?) then having different structs
seems okay, probably easier to use, but we should try hard to not ending
up with 10 different uapi socket structs that apply to program types
working on sockets in one way or another.
Agree 100%.
^ permalink raw reply [flat|nested] 29+ messages in thread
* [RFC PATCH net-next v2 02/15] bpf: program to load socketops BPF programs
2017-06-15 20:08 [RFC PATCH net-next v2 00/15] bpf: BPF support for socket ops Lawrence Brakmo
2017-06-15 20:08 ` [RFC PATCH net-next v2 01/15] " Lawrence Brakmo
@ 2017-06-15 20:08 ` Lawrence Brakmo
2017-06-15 20:08 ` [RFC PATCH net-next v2 03/15] bpf: Support for per connection SYN/SYN-ACK RTOs Lawrence Brakmo
` (12 subsequent siblings)
14 siblings, 0 replies; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-15 20:08 UTC (permalink / raw)
To: netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
David Ahern
The program tcp_bpf can be used to remove current global sockops program
and to load/replace sockops BPF programs. There is also an option to
print the bpf trace buffer (for debugging purposes).
USAGE:
./tcp_bpf [-r] [-l] [<pname>]
WHERE:
-r remove current loaded socketops BPF program
not needed if loading a new program
-l print BPF trace buffer. Used when loading a new program
<pname> name of BPF sockeops program to load
if <pname> does not end in ".o", then "_kern.o" is appended
example: using tcp_rto will load tcp_rto_kern.o
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
samples/bpf/Makefile | 3 ++
samples/bpf/tcp_bpf.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 84 insertions(+)
create mode 100644 samples/bpf/tcp_bpf.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index a0561dc..ed6bc75 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -36,6 +36,7 @@ hostprogs-y += lwt_len_hist
hostprogs-y += xdp_tx_iptunnel
hostprogs-y += test_map_in_map
hostprogs-y += per_socket_stats_example
+hostprogs-y += tcp_bpf
# Libbpf dependencies
LIBBPF := ../../tools/lib/bpf/bpf.o
@@ -52,6 +53,7 @@ tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o
tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o
tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o
tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o
+tcp_bpf-objs := bpf_load.o $(LIBBPF) tcp_bpf.o
test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o
lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o
@@ -130,6 +132,7 @@ HOSTLOADLIBES_tracex4 += -lelf -lrt
HOSTLOADLIBES_tracex5 += -lelf
HOSTLOADLIBES_tracex6 += -lelf
HOSTLOADLIBES_test_cgrp2_sock2 += -lelf
+HOSTLOADLIBES_tcp_bpf += -lelf
HOSTLOADLIBES_test_probe_write_user += -lelf
HOSTLOADLIBES_trace_output += -lelf -lrt
HOSTLOADLIBES_lathist += -lelf
diff --git a/samples/bpf/tcp_bpf.c b/samples/bpf/tcp_bpf.c
new file mode 100644
index 0000000..9de18ea
--- /dev/null
+++ b/samples/bpf/tcp_bpf.c
@@ -0,0 +1,81 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+#include <unistd.h>
+#include <linux/unistd.h>
+
+static void usage(char *pname)
+{
+ printf("USAGE:\n %s [-r] [-l] <pname>\n", pname);
+ printf("WHERE:\n");
+ printf(" -r remove current loaded socketops BPF program\n");
+ printf(" not needed if loading a new program\n");
+ printf(" -l print out BPF log buffer\n");
+ printf(" <pname> name of BPF sockeops program to load\n");
+ printf(" if <pname> does not end in \".o\", then \"_kern.o\" "
+ "is appended\n");
+ printf(" example: using tcp1 will load tcp1_kern.o\n");
+ printf("\n");
+ exit(1);
+}
+
+int main(int argc, char **argv)
+{
+ union bpf_attr attr;
+ int k, logFlag = 0;
+ int error = -1;
+ char fn[500];
+
+ if (argc <= 1)
+ usage(argv[0]);
+ for (k = 1; k < argc; k++) {
+ if (!strcmp(argv[k], "-r")) {
+ /* A fd of zero is used as signal to remove the
+ * current SOCKET_OPS program
+ */
+ attr.bpf_fd = 0;
+ syscall(__NR_bpf, BPF_PROG_LOAD_SOCKET_OPS, &attr,
+ sizeof(attr));
+ } else if (!strcmp(argv[k], "-l")) {
+ logFlag = 1;
+ } else if (!strcmp(argv[k], "-h")) {
+ usage(argv[0]);
+ } else if (argv[k][0] == '-') {
+ printf("Error, unknown flag: %s\n", argv[k]);
+ exit(2);
+ } else if (strlen(argv[k]) > 450) {
+ printf("Error, program name too long %d\n",
+ (int) strlen(argv[k]));
+ exit(3);
+ } else {
+ if (!strcmp(argv[k]+strlen(argv[k])-2, ".o"))
+ strcpy(fn, argv[k]);
+ else
+ sprintf(fn, "%s_kern.o", argv[k]);
+ if (logFlag)
+ printf("loading bpf file:%s\n", fn);
+ if (load_bpf_file(fn)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+ if (logFlag) {
+ printf("TCP BPF Loaded %s\n", fn);
+ printf("%s\n", bpf_log_buf);
+ }
+ attr.bpf_fd = prog_fd[0];
+ error = syscall(__NR_bpf, BPF_PROG_LOAD_SOCKET_OPS,
+ &attr, sizeof(attr));
+ if (error) {
+ printf("ERROR: syscall(BPF_PROG_SOCKET_OPS: %d\n",
+ error);
+ return 2;
+ }
+ if (logFlag)
+ read_trace_pipe();
+ }
+ }
+ return 0;
+}
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH net-next v2 03/15] bpf: Support for per connection SYN/SYN-ACK RTOs
2017-06-15 20:08 [RFC PATCH net-next v2 00/15] bpf: BPF support for socket ops Lawrence Brakmo
2017-06-15 20:08 ` [RFC PATCH net-next v2 01/15] " Lawrence Brakmo
2017-06-15 20:08 ` [RFC PATCH net-next v2 02/15] bpf: program to load socketops BPF programs Lawrence Brakmo
@ 2017-06-15 20:08 ` Lawrence Brakmo
2017-06-15 20:08 ` [RFC PATCH net-next v2 04/15] bpf: Sample bpf program to set " Lawrence Brakmo
` (11 subsequent siblings)
14 siblings, 0 replies; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-15 20:08 UTC (permalink / raw)
To: netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
David Ahern
This patch adds support for setting a per connection SYN and
SYN_ACK RTOs from within a BPF_SOCKET_OPS program. For example,
to set small RTOs when it is known both hosts are within a
datacenter.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
include/net/tcp.h | 11 +++++++++++
include/uapi/linux/bpf.h | 3 +++
net/ipv4/tcp_input.c | 3 ++-
net/ipv4/tcp_output.c | 2 +-
4 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9ad0d80..a726486 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2018,4 +2018,15 @@ static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)
}
#endif
+static inline u32 tcp_timeout_init(struct sock *sk, bool is_req_sock)
+{
+ int timeout;
+
+ timeout = tcp_call_bpf(sk, is_req_sock, BPF_SOCKET_OPS_TIMEOUT_INIT);
+
+ if (timeout <= 0)
+ timeout = TCP_TIMEOUT_INIT;
+ return timeout;
+}
+
#endif /* _TCP_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1540336..039f327 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -746,6 +746,9 @@ struct bpf_socket_ops {
*/
enum {
BPF_SOCKET_OPS_VOID,
+ BPF_SOCKET_OPS_TIMEOUT_INIT, /* Should return SYN-RTO value to use or
+ * -1 if default value should be used
+ */
};
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2ab7e2f..0867b05 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6406,7 +6406,8 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
} else {
tcp_rsk(req)->tfo_listener = false;
if (!want_cookie)
- inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+ inet_csk_reqsk_queue_hash_add(sk, req,
+ tcp_timeout_init((struct sock *)req, true));
af_ops->send_synack(sk, dst, &fl, req, &foc,
!want_cookie ? TCP_SYNACK_NORMAL :
TCP_SYNACK_COOKIE);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9a9c395..5e478a1 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3327,7 +3327,7 @@ static void tcp_connect_init(struct sock *sk)
tp->rcv_wup = tp->rcv_nxt;
tp->copied_seq = tp->rcv_nxt;
- inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+ inet_csk(sk)->icsk_rto = tcp_timeout_init(sk, false);
inet_csk(sk)->icsk_retransmits = 0;
tcp_clear_retrans(tp);
}
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH net-next v2 04/15] bpf: Sample bpf program to set SYN/SYN-ACK RTOs
2017-06-15 20:08 [RFC PATCH net-next v2 00/15] bpf: BPF support for socket ops Lawrence Brakmo
` (2 preceding siblings ...)
2017-06-15 20:08 ` [RFC PATCH net-next v2 03/15] bpf: Support for per connection SYN/SYN-ACK RTOs Lawrence Brakmo
@ 2017-06-15 20:08 ` Lawrence Brakmo
2017-06-15 20:08 ` [RFC PATCH net-next v2 05/15] bpf: Support for setting initial receive window Lawrence Brakmo
` (10 subsequent siblings)
14 siblings, 0 replies; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-15 20:08 UTC (permalink / raw)
To: netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
David Ahern
The sample BPF program, tcp_synrto_kern.c, sets the SYN and SYN-ACK
RTOs to 10ms when both hosts are within the same datacenter (i.e.
small RTTs) in an environment where common IPv6 prefixes indicate
both hosts are in the same data center.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
samples/bpf/Makefile | 1 +
samples/bpf/tcp_synrto_kern.c | 54 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 55 insertions(+)
create mode 100644 samples/bpf/tcp_synrto_kern.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index ed6bc75..21cb016 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -113,6 +113,7 @@ always += lwt_len_hist_kern.o
always += xdp_tx_iptunnel_kern.o
always += test_map_in_map_kern.o
always += cookie_uid_helper_example.o
+always += tcp_synrto_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_synrto_kern.c b/samples/bpf/tcp_synrto_kern.c
new file mode 100644
index 0000000..03579cb
--- /dev/null
+++ b/samples/bpf/tcp_synrto_kern.c
@@ -0,0 +1,54 @@
+/*
+ * BPF program to set SYN and SYN-ACK RTOs to 10ms when using IPv6 addresses
+ * and the first 5.5 bytes of the IPv6 addresses are the same (in this example
+ * that means both hosts are in the same datacenter.
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_synrto(struct bpf_socket_ops *skops)
+{
+ char fmt1[] = "BPF command: %d\n";
+ char fmt2[] = " Returning %d\n";
+ int rv = -1;
+ int op;
+
+ /* For testing purposes, only execute rest of BPF program
+ * if neither port numberis 55601
+ */
+ if (skops->remote_port != 55601 && skops->local_port != 55601)
+ return -1;
+
+ op = (int) skops->op;
+
+#ifdef DEBUG
+ bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+ /* Check for TIMEOUT_INIT operation and IPv6 addresses */
+ if (op == BPF_SOCKET_OPS_TIMEOUT_INIT &&
+ skops->family == AF_INET6) {
+
+ /* If the first 5.5 bytes of the IPv6 address are the same
+ * then both hosts are in the same datacenter
+ * so use an RTO of 10ms
+ */
+ if (skops->local_ip6[0] == skops->remote_ip6[0] &&
+ (skops->local_ip6[1] & 0xfff00000) ==
+ (skops->remote_ip6[1] & 0xfff00000))
+ rv = 10;
+ }
+#ifdef DEBUG
+ bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+ return rv;
+}
+char _license[] SEC("license") = "GPL";
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH net-next v2 05/15] bpf: Support for setting initial receive window
2017-06-15 20:08 [RFC PATCH net-next v2 00/15] bpf: BPF support for socket ops Lawrence Brakmo
` (3 preceding siblings ...)
2017-06-15 20:08 ` [RFC PATCH net-next v2 04/15] bpf: Sample bpf program to set " Lawrence Brakmo
@ 2017-06-15 20:08 ` Lawrence Brakmo
2017-06-15 20:08 ` [RFC PATCH net-next v2 06/15] bpf: Sample bpf program to set initial window Lawrence Brakmo
` (9 subsequent siblings)
14 siblings, 0 replies; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-15 20:08 UTC (permalink / raw)
To: netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
David Ahern
This patch adds suppport for setting the initial advertized window from
within a BPF_SOCKET_OPS program. This can be used to support larger
initial cwnd values in environments where it is known to be safe.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
include/net/tcp.h | 10 ++++++++++
include/uapi/linux/bpf.h | 4 ++++
net/ipv4/tcp_minisocks.c | 9 ++++++++-
net/ipv4/tcp_output.c | 7 ++++++-
4 files changed, 28 insertions(+), 2 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a726486..29c27dc 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2029,4 +2029,14 @@ static inline u32 tcp_timeout_init(struct sock *sk, bool is_req_sock)
return timeout;
}
+static inline u32 tcp_rwnd_init_bpf(struct sock *sk, bool is_req_sock)
+{
+ int rwnd;
+
+ rwnd = tcp_call_bpf(sk, is_req_sock, BPF_SOCKET_OPS_RWND_INIT);
+
+ if (rwnd < 0)
+ rwnd = 0;
+ return rwnd;
+}
#endif /* _TCP_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 039f327..d945336 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -749,6 +749,10 @@ enum {
BPF_SOCKET_OPS_TIMEOUT_INIT, /* Should return SYN-RTO value to use or
* -1 if default value should be used
*/
+ BPF_SOCKET_OPS_RWND_INIT, /* Should return initial advertized
+ * window (in packets) or -1 if default
+ * value should be used
+ */
};
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index d30ee31..bbaf3c6 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -351,6 +351,7 @@ void tcp_openreq_init_rwin(struct request_sock *req,
int full_space = tcp_full_space(sk_listener);
u32 window_clamp;
__u8 rcv_wscale;
+ u32 rcv_wnd;
int mss;
mss = tcp_mss_clamp(tp, dst_metric_advmss(dst));
@@ -363,6 +364,12 @@ void tcp_openreq_init_rwin(struct request_sock *req,
(req->rsk_window_clamp > full_space || req->rsk_window_clamp == 0))
req->rsk_window_clamp = full_space;
+ rcv_wnd = tcp_rwnd_init_bpf((struct sock *)req, true);
+ if (rcv_wnd == 0)
+ rcv_wnd = dst_metric(dst, RTAX_INITRWND);
+ else if (full_space < rcv_wnd * mss)
+ full_space = rcv_wnd * mss;
+
/* tcp_full_space because it is guaranteed to be the first packet */
tcp_select_initial_window(full_space,
mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
@@ -370,7 +377,7 @@ void tcp_openreq_init_rwin(struct request_sock *req,
&req->rsk_window_clamp,
ireq->wscale_ok,
&rcv_wscale,
- dst_metric(dst, RTAX_INITRWND));
+ rcv_wnd);
ireq->rcv_wscale = rcv_wscale;
}
EXPORT_SYMBOL(tcp_openreq_init_rwin);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5e478a1..e5f623f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3267,6 +3267,7 @@ static void tcp_connect_init(struct sock *sk)
const struct dst_entry *dst = __sk_dst_get(sk);
struct tcp_sock *tp = tcp_sk(sk);
__u8 rcv_wscale;
+ u32 rcv_wnd;
/* We'll fix this up when we get a response from the other end.
* See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT.
@@ -3300,13 +3301,17 @@ static void tcp_connect_init(struct sock *sk)
(tp->window_clamp > tcp_full_space(sk) || tp->window_clamp == 0))
tp->window_clamp = tcp_full_space(sk);
+ rcv_wnd = tcp_rwnd_init_bpf(sk, false);
+ if (rcv_wnd == 0)
+ rcv_wnd = dst_metric(dst, RTAX_INITRWND);
+
tcp_select_initial_window(tcp_full_space(sk),
tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
&tp->rcv_wnd,
&tp->window_clamp,
sock_net(sk)->ipv4.sysctl_tcp_window_scaling,
&rcv_wscale,
- dst_metric(dst, RTAX_INITRWND));
+ rcv_wnd);
tp->rx_opt.rcv_wscale = rcv_wscale;
tp->rcv_ssthresh = tp->rcv_wnd;
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH net-next v2 06/15] bpf: Sample bpf program to set initial window
2017-06-15 20:08 [RFC PATCH net-next v2 00/15] bpf: BPF support for socket ops Lawrence Brakmo
` (4 preceding siblings ...)
2017-06-15 20:08 ` [RFC PATCH net-next v2 05/15] bpf: Support for setting initial receive window Lawrence Brakmo
@ 2017-06-15 20:08 ` Lawrence Brakmo
2017-06-15 20:08 ` [RFC PATCH net-next v2 07/15] bpf: Add setsockopt helper function to bpf Lawrence Brakmo
` (8 subsequent siblings)
14 siblings, 0 replies; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-15 20:08 UTC (permalink / raw)
To: netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
David Ahern
The sample bpf program, tcp_rwnd_kern.c, sets the initial
advertized window to 40 packets in an environment where
distinct IPv6 prefixes indicate that both hosts are not
in the same data center.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
samples/bpf/Makefile | 1 +
samples/bpf/tcp_rwnd_kern.c | 55 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 56 insertions(+)
create mode 100644 samples/bpf/tcp_rwnd_kern.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 21cb016..9aca209 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -114,6 +114,7 @@ always += xdp_tx_iptunnel_kern.o
always += test_map_in_map_kern.o
always += cookie_uid_helper_example.o
always += tcp_synrto_kern.o
+always += tcp_rwnd_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_rwnd_kern.c b/samples/bpf/tcp_rwnd_kern.c
new file mode 100644
index 0000000..f01b6d4
--- /dev/null
+++ b/samples/bpf/tcp_rwnd_kern.c
@@ -0,0 +1,55 @@
+/*
+ * BPF program to set initial receive window to 40 packets when using IPv6
+ * and the first 5.5 bytes of the IPv6 addresses are not the same (in this
+ * example that means both hosts are not the same datacenter.
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_rwnd(struct bpf_socket_ops *skops)
+{
+ char fmt1[] = "BPF command: %d\n";
+ char fmt2[] = " Returning %d\n";
+ int rv = -1;
+ int op;
+
+ /* For testing purposes, only execute rest of BPF program
+ * if neither port numberis 55601
+ */
+ if (skops->remote_port != 55601 && skops->local_port != 55601)
+ return -1;
+
+ op = (int) skops->op;
+
+#ifdef DEBUG
+ bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+ /* Check for RWND_INIT operation and IPv6 addresses */
+ if (op == BPF_SOCKET_OPS_RWND_INIT &&
+ skops->family == AF_INET6) {
+
+ /* If the first 5.5 bytes of the IPv6 address are not the same
+ * then both hosts are not in the same datacenter
+ * so use a larger initial advertized window (40 packets)
+ */
+ if (skops->local_ip6[0] != skops->remote_ip6[0] ||
+ (skops->local_ip6[1] & 0xfffff000) !=
+ (skops->remote_ip6[1] & 0xfffff000))
+ bpf_trace_printk(fmt2, sizeof(fmt2), -1);
+ rv = 40;
+ }
+#ifdef DEBUG
+ bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+ return rv;
+}
+char _license[] SEC("license") = "GPL";
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH net-next v2 07/15] bpf: Add setsockopt helper function to bpf
2017-06-15 20:08 [RFC PATCH net-next v2 00/15] bpf: BPF support for socket ops Lawrence Brakmo
` (5 preceding siblings ...)
2017-06-15 20:08 ` [RFC PATCH net-next v2 06/15] bpf: Sample bpf program to set initial window Lawrence Brakmo
@ 2017-06-15 20:08 ` Lawrence Brakmo
2017-06-16 13:27 ` Daniel Borkmann
2017-06-15 20:08 ` [RFC PATCH net-next v2 08/15] bpf: Add TCP connection BPF callbacks Lawrence Brakmo
` (7 subsequent siblings)
14 siblings, 1 reply; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-15 20:08 UTC (permalink / raw)
To: netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
David Ahern
Added support for calling a subset of socket setsockopts from
BPF_PROG_TYPE_SOCKET_OPS programs. The code was duplicated rather
than making the changes to call the socket setsockopt function because
the changes required would have been larger.
The ops supported are:
SO_RCVBUF
SO_SNDBUF
SO_MAX_PACING_RATE
SO_PRIORITY
SO_RCVLOWAT
SO_MARK
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
include/uapi/linux/bpf.h | 14 ++++++++-
net/core/filter.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++-
samples/bpf/bpf_helpers.h | 3 ++
3 files changed, 92 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d945336..8accb4d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -520,6 +520,17 @@ union bpf_attr {
* Set full skb->hash.
* @skb: pointer to skb
* @hash: hash to set
+ *
+ * int bpf_setsockopt(bpf_socket, level, optname, optval, optlen)
+ * Calls setsockopt. Not all opts are available, only those with
+ * integer optvals plus TCP_CONGESTION.
+ * Supported levels: SOL_SOCKET and IPROTO_TCP
+ * @bpf_socket: pointer to bpf_socket
+ * @level: SOL_SOCKET or IPROTO_TCP
+ * @optname: option name
+ * @optval: pointer to option value
+ * @optlen: length of optval in byes
+ * Return: 0 or negative error
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -570,7 +581,8 @@ union bpf_attr {
FN(probe_read_str), \
FN(get_socket_cookie), \
FN(get_socket_uid), \
- FN(set_hash),
+ FN(set_hash), \
+ FN(setsockopt),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 7466f55..9ff567c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -54,6 +54,7 @@
#include <net/dst.h>
#include <net/sock_reuseport.h>
#include <net/busy_poll.h>
+#include <net/tcp.h>
/**
* sk_filter_trim_cap - run a packet through a socket filter
@@ -2671,6 +2672,69 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
.arg1_type = ARG_PTR_TO_CTX,
};
+BPF_CALL_5(bpf_setsockopt, struct bpf_socket_ops_kern *, bpf_socket,
+ int, level, int, optname, char *, optval, int, optlen)
+{
+ struct sock *sk = bpf_socket->sk;
+ int ret = 0;
+ int val;
+
+ if (bpf_socket->is_req_sock)
+ return -EINVAL;
+
+ if (level == SOL_SOCKET) {
+ /* Only some socketops are supported */
+ val = *((int *)optval);
+
+ switch (optname) {
+ case SO_RCVBUF:
+ sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
+ sk->sk_rcvbuf = max_t(int, val * 2, SOCK_MIN_RCVBUF);
+ break;
+ case SO_SNDBUF:
+ sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
+ sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF);
+ break;
+ case SO_MAX_PACING_RATE:
+ sk->sk_max_pacing_rate = val;
+ sk->sk_pacing_rate = min(sk->sk_pacing_rate,
+ sk->sk_max_pacing_rate);
+ break;
+ case SO_PRIORITY:
+ sk->sk_priority = val;
+ break;
+ case SO_RCVLOWAT:
+ if (val < 0)
+ val = INT_MAX;
+ sk->sk_rcvlowat = val ? : 1;
+ break;
+ case SO_MARK:
+ sk->sk_mark = val;
+ break;
+ default:
+ ret = -EINVAL;
+ }
+ } else if (level == SOL_TCP &&
+ bpf_socket->sk->sk_prot->setsockopt == tcp_setsockopt) {
+ /* Place holder */
+ ret = -EINVAL;
+ } else {
+ ret = -EINVAL;
+ }
+ return ret;
+}
+
+static const struct bpf_func_proto bpf_setsockopt_proto = {
+ .func = bpf_setsockopt,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_ANYTHING,
+ .arg4_type = ARG_PTR_TO_MEM,
+ .arg5_type = ARG_CONST_SIZE_OR_ZERO,
+};
+
static const struct bpf_func_proto *
bpf_base_func_proto(enum bpf_func_id func_id)
{
@@ -2822,6 +2886,17 @@ lwt_inout_func_proto(enum bpf_func_id func_id)
}
static const struct bpf_func_proto *
+ socket_ops_func_proto(enum bpf_func_id func_id)
+{
+ switch (func_id) {
+ case BPF_FUNC_setsockopt:
+ return &bpf_setsockopt_proto;
+ default:
+ return bpf_base_func_proto(func_id);
+ }
+}
+
+static const struct bpf_func_proto *
lwt_xmit_func_proto(enum bpf_func_id func_id)
{
switch (func_id) {
@@ -3565,7 +3640,7 @@ const struct bpf_verifier_ops cg_sock_prog_ops = {
};
const struct bpf_verifier_ops socket_ops_prog_ops = {
- .get_func_proto = bpf_base_func_proto,
+ .get_func_proto = socket_ops_func_proto,
.is_valid_access = socket_ops_is_valid_access,
.convert_ctx_access = socket_ops_convert_ctx_access,
};
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index f4840b8..d50ac34 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -60,6 +60,9 @@ static unsigned long long (*bpf_get_prandom_u32)(void) =
(void *) BPF_FUNC_get_prandom_u32;
static int (*bpf_xdp_adjust_head)(void *ctx, int offset) =
(void *) BPF_FUNC_xdp_adjust_head;
+static int (*bpf_setsockopt)(void *ctx, int level, int optname, void *optval,
+ int optlen) =
+ (void *) BPF_FUNC_setsockopt;
/* llvm builtin functions that eBPF C program may use to
* emit BPF_LD_ABS and BPF_LD_IND instructions
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [RFC PATCH net-next v2 07/15] bpf: Add setsockopt helper function to bpf
2017-06-15 20:08 ` [RFC PATCH net-next v2 07/15] bpf: Add setsockopt helper function to bpf Lawrence Brakmo
@ 2017-06-16 13:27 ` Daniel Borkmann
2017-06-17 23:17 ` Lawrence Brakmo
0 siblings, 1 reply; 29+ messages in thread
From: Daniel Borkmann @ 2017-06-16 13:27 UTC (permalink / raw)
To: Lawrence Brakmo, netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern
On 06/15/2017 10:08 PM, Lawrence Brakmo wrote:
> Added support for calling a subset of socket setsockopts from
> BPF_PROG_TYPE_SOCKET_OPS programs. The code was duplicated rather
> than making the changes to call the socket setsockopt function because
> the changes required would have been larger.
>
> The ops supported are:
> SO_RCVBUF
> SO_SNDBUF
> SO_MAX_PACING_RATE
> SO_PRIORITY
> SO_RCVLOWAT
> SO_MARK
>
> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
> ---
> include/uapi/linux/bpf.h | 14 ++++++++-
> net/core/filter.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++-
> samples/bpf/bpf_helpers.h | 3 ++
> 3 files changed, 92 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index d945336..8accb4d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -520,6 +520,17 @@ union bpf_attr {
> * Set full skb->hash.
> * @skb: pointer to skb
> * @hash: hash to set
> + *
> + * int bpf_setsockopt(bpf_socket, level, optname, optval, optlen)
> + * Calls setsockopt. Not all opts are available, only those with
> + * integer optvals plus TCP_CONGESTION.
> + * Supported levels: SOL_SOCKET and IPROTO_TCP
> + * @bpf_socket: pointer to bpf_socket
> + * @level: SOL_SOCKET or IPROTO_TCP
> + * @optname: option name
> + * @optval: pointer to option value
> + * @optlen: length of optval in byes
> + * Return: 0 or negative error
> */
> #define __BPF_FUNC_MAPPER(FN) \
> FN(unspec), \
> @@ -570,7 +581,8 @@ union bpf_attr {
> FN(probe_read_str), \
> FN(get_socket_cookie), \
> FN(get_socket_uid), \
> - FN(set_hash),
> + FN(set_hash), \
> + FN(setsockopt),
>
> /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> * function eBPF program intends to call
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7466f55..9ff567c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -54,6 +54,7 @@
> #include <net/dst.h>
> #include <net/sock_reuseport.h>
> #include <net/busy_poll.h>
> +#include <net/tcp.h>
>
> /**
> * sk_filter_trim_cap - run a packet through a socket filter
> @@ -2671,6 +2672,69 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
> .arg1_type = ARG_PTR_TO_CTX,
> };
>
> +BPF_CALL_5(bpf_setsockopt, struct bpf_socket_ops_kern *, bpf_socket,
> + int, level, int, optname, char *, optval, int, optlen)
> +{
> + struct sock *sk = bpf_socket->sk;
> + int ret = 0;
> + int val;
> +
> + if (bpf_socket->is_req_sock)
> + return -EINVAL;
> +
> + if (level == SOL_SOCKET) {
> + /* Only some socketops are supported */
> + val = *((int *)optval);
> +
> + switch (optname) {
> + case SO_RCVBUF:
> + sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> + sk->sk_rcvbuf = max_t(int, val * 2, SOCK_MIN_RCVBUF);
> + break;
> + case SO_SNDBUF:
> + sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
> + sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF);
> + break;
> + case SO_MAX_PACING_RATE:
> + sk->sk_max_pacing_rate = val;
> + sk->sk_pacing_rate = min(sk->sk_pacing_rate,
> + sk->sk_max_pacing_rate);
> + break;
> + case SO_PRIORITY:
> + sk->sk_priority = val;
> + break;
> + case SO_RCVLOWAT:
> + if (val < 0)
> + val = INT_MAX;
> + sk->sk_rcvlowat = val ? : 1;
> + break;
> + case SO_MARK:
> + sk->sk_mark = val;
> + break;
Likely all of the above, I think we could make programs even more
flexible if these are members in the context struct itself, so
that socket_ops_convert_ctx_access() would translate them inline
w/o even needing a helper call with the bonus of having read and
write access.
But what about socket lock that we otherwise have in sock_setsockopt()?
Could we add a sock_owned_by_me(sk) to tcp_call_bpf() to really assert
this for all cases, so that lockdep complains should it ever be not
the case?
> + default:
> + ret = -EINVAL;
> + }
> + } else if (level == SOL_TCP &&
> + bpf_socket->sk->sk_prot->setsockopt == tcp_setsockopt) {
> + /* Place holder */
> + ret = -EINVAL;
> + } else {
> + ret = -EINVAL;
> + }
> + return ret;
> +}
> +
> +static const struct bpf_func_proto bpf_setsockopt_proto = {
> + .func = bpf_setsockopt,
> + .gpl_only = true,
> + .ret_type = RET_INTEGER,
> + .arg1_type = ARG_PTR_TO_CTX,
> + .arg2_type = ARG_ANYTHING,
> + .arg3_type = ARG_ANYTHING,
> + .arg4_type = ARG_PTR_TO_MEM,
> + .arg5_type = ARG_CONST_SIZE_OR_ZERO,
> +};
> +
> static const struct bpf_func_proto *
> bpf_base_func_proto(enum bpf_func_id func_id)
> {
> @@ -2822,6 +2886,17 @@ lwt_inout_func_proto(enum bpf_func_id func_id)
> }
>
> static const struct bpf_func_proto *
> + socket_ops_func_proto(enum bpf_func_id func_id)
> +{
> + switch (func_id) {
> + case BPF_FUNC_setsockopt:
> + return &bpf_setsockopt_proto;
> + default:
> + return bpf_base_func_proto(func_id);
> + }
> +}
> +
> +static const struct bpf_func_proto *
> lwt_xmit_func_proto(enum bpf_func_id func_id)
> {
> switch (func_id) {
> @@ -3565,7 +3640,7 @@ const struct bpf_verifier_ops cg_sock_prog_ops = {
> };
>
> const struct bpf_verifier_ops socket_ops_prog_ops = {
> - .get_func_proto = bpf_base_func_proto,
> + .get_func_proto = socket_ops_func_proto,
> .is_valid_access = socket_ops_is_valid_access,
> .convert_ctx_access = socket_ops_convert_ctx_access,
> };
> diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
> index f4840b8..d50ac34 100644
> --- a/samples/bpf/bpf_helpers.h
> +++ b/samples/bpf/bpf_helpers.h
> @@ -60,6 +60,9 @@ static unsigned long long (*bpf_get_prandom_u32)(void) =
> (void *) BPF_FUNC_get_prandom_u32;
> static int (*bpf_xdp_adjust_head)(void *ctx, int offset) =
> (void *) BPF_FUNC_xdp_adjust_head;
> +static int (*bpf_setsockopt)(void *ctx, int level, int optname, void *optval,
> + int optlen) =
> + (void *) BPF_FUNC_setsockopt;
>
> /* llvm builtin functions that eBPF C program may use to
> * emit BPF_LD_ABS and BPF_LD_IND instructions
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH net-next v2 07/15] bpf: Add setsockopt helper function to bpf
2017-06-16 13:27 ` Daniel Borkmann
@ 2017-06-17 23:17 ` Lawrence Brakmo
0 siblings, 0 replies; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-17 23:17 UTC (permalink / raw)
To: Daniel Borkmann, netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern
On 6/16/17, 6:27 AM, "Daniel Borkmann" <daniel@iogearbox.net> wrote:
On 06/15/2017 10:08 PM, Lawrence Brakmo wrote:
> Added support for calling a subset of socket setsockopts from
> BPF_PROG_TYPE_SOCKET_OPS programs. The code was duplicated rather
> than making the changes to call the socket setsockopt function because
> the changes required would have been larger.
>
> The ops supported are:
> SO_RCVBUF
> SO_SNDBUF
> SO_MAX_PACING_RATE
> SO_PRIORITY
> SO_RCVLOWAT
> SO_MARK
>
> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
> ---
> include/uapi/linux/bpf.h | 14 ++++++++-
> net/core/filter.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++-
> samples/bpf/bpf_helpers.h | 3 ++
> 3 files changed, 92 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index d945336..8accb4d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -520,6 +520,17 @@ union bpf_attr {
> * Set full skb->hash.
> * @skb: pointer to skb
> * @hash: hash to set
> + *
> + * int bpf_setsockopt(bpf_socket, level, optname, optval, optlen)
> + * Calls setsockopt. Not all opts are available, only those with
> + * integer optvals plus TCP_CONGESTION.
> + * Supported levels: SOL_SOCKET and IPROTO_TCP
> + * @bpf_socket: pointer to bpf_socket
> + * @level: SOL_SOCKET or IPROTO_TCP
> + * @optname: option name
> + * @optval: pointer to option value
> + * @optlen: length of optval in byes
> + * Return: 0 or negative error
> */
> #define __BPF_FUNC_MAPPER(FN) \
> FN(unspec), \
> @@ -570,7 +581,8 @@ union bpf_attr {
> FN(probe_read_str), \
> FN(get_socket_cookie), \
> FN(get_socket_uid), \
> - FN(set_hash),
> + FN(set_hash), \
> + FN(setsockopt),
>
> /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> * function eBPF program intends to call
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7466f55..9ff567c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -54,6 +54,7 @@
> #include <net/dst.h>
> #include <net/sock_reuseport.h>
> #include <net/busy_poll.h>
> +#include <net/tcp.h>
>
> /**
> * sk_filter_trim_cap - run a packet through a socket filter
> @@ -2671,6 +2672,69 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
> .arg1_type = ARG_PTR_TO_CTX,
> };
>
> +BPF_CALL_5(bpf_setsockopt, struct bpf_socket_ops_kern *, bpf_socket,
> + int, level, int, optname, char *, optval, int, optlen)
> +{
> + struct sock *sk = bpf_socket->sk;
> + int ret = 0;
> + int val;
> +
> + if (bpf_socket->is_req_sock)
> + return -EINVAL;
> +
> + if (level == SOL_SOCKET) {
> + /* Only some socketops are supported */
> + val = *((int *)optval);
> +
> + switch (optname) {
> + case SO_RCVBUF:
> + sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> + sk->sk_rcvbuf = max_t(int, val * 2, SOCK_MIN_RCVBUF);
> + break;
> + case SO_SNDBUF:
> + sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
> + sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF);
> + break;
> + case SO_MAX_PACING_RATE:
> + sk->sk_max_pacing_rate = val;
> + sk->sk_pacing_rate = min(sk->sk_pacing_rate,
> + sk->sk_max_pacing_rate);
> + break;
> + case SO_PRIORITY:
> + sk->sk_priority = val;
> + break;
> + case SO_RCVLOWAT:
> + if (val < 0)
> + val = INT_MAX;
> + sk->sk_rcvlowat = val ? : 1;
> + break;
> + case SO_MARK:
> + sk->sk_mark = val;
> + break;
Likely all of the above, I think we could make programs even more
flexible if these are members in the context struct itself, so
that socket_ops_convert_ctx_access() would translate them inline
w/o even needing a helper call with the bonus of having read and
write access.
I initially refactored sock_setsockopt and tcp_setsockopt so they
could be called from bpf programs, but it turned out not to be
worth it since a lot of them cannot be called from there.
These operations should not be called too often (per sock lifetime)
so I was not too concerned about efficiency. In addition, some of
these are not viable if the sock is a req_sock, and I didn’t think the
performance gain would be worth the added complexity.
I already have other enhancements in mind that will require
access to fields in the tcp_sock structure. I was planning to
implement those as you proposed because they would be
accessed more often.
But what about socket lock that we otherwise have in sock_setsockopt()?
My understanding is that the lock is needed because the sock_setsockopt()
function is called from user space. The sock_ops BPF programs are being
called from within the network stack and I thought a lock was not needed
in this case. For example, the call to bh_lock_sock_nested() in tcp_v4_rcv()
Could we add a sock_owned_by_me(sk) to tcp_call_bpf() to really assert
this for all cases, so that lockdep complains should it ever be not
the case?
Done (if it is not a req_sock).
> + default:
> + ret = -EINVAL;
> + }
> + } else if (level == SOL_TCP &&
> + bpf_socket->sk->sk_prot->setsockopt == tcp_setsockopt) {
> + /* Place holder */
> + ret = -EINVAL;
> + } else {
> + ret = -EINVAL;
> + }
> + return ret;
> +}
> +
> +static const struct bpf_func_proto bpf_setsockopt_proto = {
> + .func = bpf_setsockopt,
> + .gpl_only = true,
> + .ret_type = RET_INTEGER,
> + .arg1_type = ARG_PTR_TO_CTX,
> + .arg2_type = ARG_ANYTHING,
> + .arg3_type = ARG_ANYTHING,
> + .arg4_type = ARG_PTR_TO_MEM,
> + .arg5_type = ARG_CONST_SIZE_OR_ZERO,
> +};
> +
> static const struct bpf_func_proto *
> bpf_base_func_proto(enum bpf_func_id func_id)
> {
> @@ -2822,6 +2886,17 @@ lwt_inout_func_proto(enum bpf_func_id func_id)
> }
>
> static const struct bpf_func_proto *
> + socket_ops_func_proto(enum bpf_func_id func_id)
> +{
> + switch (func_id) {
> + case BPF_FUNC_setsockopt:
> + return &bpf_setsockopt_proto;
> + default:
> + return bpf_base_func_proto(func_id);
> + }
> +}
> +
> +static const struct bpf_func_proto *
> lwt_xmit_func_proto(enum bpf_func_id func_id)
> {
> switch (func_id) {
> @@ -3565,7 +3640,7 @@ const struct bpf_verifier_ops cg_sock_prog_ops = {
> };
>
> const struct bpf_verifier_ops socket_ops_prog_ops = {
> - .get_func_proto = bpf_base_func_proto,
> + .get_func_proto = socket_ops_func_proto,
> .is_valid_access = socket_ops_is_valid_access,
> .convert_ctx_access = socket_ops_convert_ctx_access,
> };
> diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
> index f4840b8..d50ac34 100644
> --- a/samples/bpf/bpf_helpers.h
> +++ b/samples/bpf/bpf_helpers.h
> @@ -60,6 +60,9 @@ static unsigned long long (*bpf_get_prandom_u32)(void) =
> (void *) BPF_FUNC_get_prandom_u32;
> static int (*bpf_xdp_adjust_head)(void *ctx, int offset) =
> (void *) BPF_FUNC_xdp_adjust_head;
> +static int (*bpf_setsockopt)(void *ctx, int level, int optname, void *optval,
> + int optlen) =
> + (void *) BPF_FUNC_setsockopt;
>
> /* llvm builtin functions that eBPF C program may use to
> * emit BPF_LD_ABS and BPF_LD_IND instructions
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* [RFC PATCH net-next v2 08/15] bpf: Add TCP connection BPF callbacks
2017-06-15 20:08 [RFC PATCH net-next v2 00/15] bpf: BPF support for socket ops Lawrence Brakmo
` (6 preceding siblings ...)
2017-06-15 20:08 ` [RFC PATCH net-next v2 07/15] bpf: Add setsockopt helper function to bpf Lawrence Brakmo
@ 2017-06-15 20:08 ` Lawrence Brakmo
2017-06-15 20:08 ` [RFC PATCH net-next v2 09/15] bpf: Sample BPF program to set buffer sizes Lawrence Brakmo
` (6 subsequent siblings)
14 siblings, 0 replies; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-15 20:08 UTC (permalink / raw)
To: netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
David Ahern
Added callbacks to BPF SOCKET_OPS type program before an active
connection is intialized and after a passive or active connection is
established.
The following patch demostrates how they can be used to set send and
receive buffer sizes.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
include/uapi/linux/bpf.h | 11 +++++++++++
net/ipv4/tcp_fastopen.c | 1 +
net/ipv4/tcp_input.c | 4 +++-
net/ipv4/tcp_output.c | 1 +
4 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8accb4d..c3490d3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -765,6 +765,17 @@ enum {
* window (in packets) or -1 if default
* value should be used
*/
+ BPF_SOCKET_OPS_TCP_CONNECT_CB, /* Calls BPF program right before an
+ * active connection is initialized
+ */
+ BPF_SOCKET_OPS_ACTIVE_ESTABLISHED_CB, /* Calls BPF program when an
+ * active connection is
+ * established
+ */
+ BPF_SOCKET_OPS_PASSIVE_ESTABLISHED_CB, /* Calls BPF program when a
+ * passive connection is
+ * established
+ */
};
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 4af82b9..c3ec4ec 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -221,6 +221,7 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk,
tcp_init_congestion_control(child);
tcp_mtup_init(child);
tcp_init_metrics(child);
+ tcp_call_bpf(child, false, BPF_SOCKET_OPS_PASSIVE_ESTABLISHED_CB);
tcp_init_buffer_space(child);
tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0867b05..e0d688a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5571,7 +5571,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
icsk->icsk_af_ops->rebuild_header(sk);
tcp_init_metrics(sk);
-
+ tcp_call_bpf(sk, false, BPF_SOCKET_OPS_ACTIVE_ESTABLISHED_CB);
tcp_init_congestion_control(sk);
/* Prevent spurious tcp_cwnd_restart() on first data
@@ -5977,6 +5977,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
} else {
/* Make sure socket is routed, for correct metrics. */
icsk->icsk_af_ops->rebuild_header(sk);
+ tcp_call_bpf(sk, false,
+ BPF_SOCKET_OPS_PASSIVE_ESTABLISHED_CB);
tcp_init_congestion_control(sk);
tcp_mtup_init(sk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e5f623f..9124d3d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3445,6 +3445,7 @@ int tcp_connect(struct sock *sk)
struct sk_buff *buff;
int err;
+ tcp_call_bpf(sk, false, BPF_SOCKET_OPS_TCP_CONNECT_CB);
tcp_connect_init(sk);
if (unlikely(tp->repair)) {
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH net-next v2 09/15] bpf: Sample BPF program to set buffer sizes
2017-06-15 20:08 [RFC PATCH net-next v2 00/15] bpf: BPF support for socket ops Lawrence Brakmo
` (7 preceding siblings ...)
2017-06-15 20:08 ` [RFC PATCH net-next v2 08/15] bpf: Add TCP connection BPF callbacks Lawrence Brakmo
@ 2017-06-15 20:08 ` Lawrence Brakmo
2017-06-15 20:08 ` [RFC PATCH net-next v2 10/15] bpf: Add support for changing congestion control Lawrence Brakmo
` (5 subsequent siblings)
14 siblings, 0 replies; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-15 20:08 UTC (permalink / raw)
To: netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
David Ahern
This patch contains a BPF program to set initial receive window to
40 packets and send and receive buffers to 1.5MB. This would usually
be done after doing appropriate checks that indicate the hosts are
far enough away (i.e. large RTT).
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
samples/bpf/Makefile | 1 +
samples/bpf/tcp_bufs_kern.c | 71 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 72 insertions(+)
create mode 100644 samples/bpf/tcp_bufs_kern.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 9aca209..942c7c7 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -115,6 +115,7 @@ always += test_map_in_map_kern.o
always += cookie_uid_helper_example.o
always += tcp_synrto_kern.o
always += tcp_rwnd_kern.o
+always += tcp_bufs_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_bufs_kern.c b/samples/bpf/tcp_bufs_kern.c
new file mode 100644
index 0000000..37b2923
--- /dev/null
+++ b/samples/bpf/tcp_bufs_kern.c
@@ -0,0 +1,71 @@
+/*
+ * BPF program to set initial receive window to 40 packets and send
+ * and receive buffers to 1.5MB. This would usually be done after
+ * doing appropriate checks that indicate the hosts are far enough
+ * away (i.e. large RTT).
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_bufs(struct bpf_socket_ops *skops)
+{
+ char fmt1[] = "BPF command: %d\n";
+ char fmt2[] = " Returning %d\n";
+ int bufsize = 1500000;
+ int rwnd_init = 40;
+ int rv = 0;
+ int op;
+
+ /* For testing purposes, only execute rest of BPF program
+ * if neither port numberis 55601
+ */
+ if (skops->remote_port != 55601 && skops->local_port != 55601)
+ return -1;
+
+ op = (int) skops->op;
+
+#ifdef DEBUG
+ bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+ /* Usually there would be a check to insure the hosts are far
+ * from each other so it makes sense to increase buffer sizes
+ */
+ switch (op) {
+ case BPF_SOCKET_OPS_RWND_INIT:
+ rv = rwnd_init;
+ break;
+ case BPF_SOCKET_OPS_TCP_CONNECT_CB:
+ /* Set sndbuf and rcvbuf of active connections */
+ rv = bpf_setsockopt(skops, SOL_SOCKET, SO_SNDBUF, &bufsize,
+ sizeof(bufsize));
+ rv = -rv*100 + bpf_setsockopt(skops, SOL_SOCKET, SO_RCVBUF,
+ &bufsize, sizeof(bufsize));
+ break;
+ case BPF_SOCKET_OPS_ACTIVE_ESTABLISHED_CB:
+ /* Nothing to do */
+ break;
+ case BPF_SOCKET_OPS_PASSIVE_ESTABLISHED_CB:
+ /* Set sndbuf and rcvbuf of passive connections */
+ rv = bpf_setsockopt(skops, SOL_SOCKET, SO_SNDBUF, &bufsize,
+ sizeof(bufsize));
+ rv = -rv*100 + bpf_setsockopt(skops, SOL_SOCKET, SO_RCVBUF,
+ &bufsize, sizeof(bufsize));
+ break;
+ default:
+ rv = -1;
+ }
+#ifdef DEBUG
+ bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+ return rv;
+}
+char _license[] SEC("license") = "GPL";
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH net-next v2 10/15] bpf: Add support for changing congestion control
2017-06-15 20:08 [RFC PATCH net-next v2 00/15] bpf: BPF support for socket ops Lawrence Brakmo
` (8 preceding siblings ...)
2017-06-15 20:08 ` [RFC PATCH net-next v2 09/15] bpf: Sample BPF program to set buffer sizes Lawrence Brakmo
@ 2017-06-15 20:08 ` Lawrence Brakmo
2017-06-16 13:58 ` Daniel Borkmann
2017-06-15 20:08 ` [RFC PATCH net-next v2 11/15] bpf: Sample BPF program to set " Lawrence Brakmo
` (4 subsequent siblings)
14 siblings, 1 reply; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-15 20:08 UTC (permalink / raw)
To: netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
David Ahern
Added support for changing congestion control for SOCKET_OPS bps
programs through the setsockopt bpf helper function. It also adds
a new SOCKET_OPS op, BPF_SOCKET_OPS_NEEDS_ECN, that is needed for
congestion controls, like dctcp, that need to enable ECN in the
SYN packets.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
include/net/tcp.h | 9 ++++++++-
include/uapi/linux/bpf.h | 3 +++
net/core/filter.c | 11 +++++++++--
net/ipv4/tcp.c | 2 +-
net/ipv4/tcp_cong.c | 15 ++++++++++-----
net/ipv4/tcp_input.c | 3 ++-
net/ipv4/tcp_output.c | 8 +++++---
7 files changed, 38 insertions(+), 13 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 29c27dc..371b1bd 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1001,7 +1001,9 @@ void tcp_get_default_congestion_control(char *name);
void tcp_get_available_congestion_control(char *buf, size_t len);
void tcp_get_allowed_congestion_control(char *buf, size_t len);
int tcp_set_allowed_congestion_control(char *allowed);
-int tcp_set_congestion_control(struct sock *sk, const char *name);
+int tcp_set_congestion_control(struct sock *sk, const char *name, bool load);
+void tcp_reinit_congestion_control(struct sock *sk,
+ const struct tcp_congestion_ops *ca);
u32 tcp_slow_start(struct tcp_sock *tp, u32 acked);
void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked);
@@ -2039,4 +2041,9 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk, bool is_req_sock)
rwnd = 0;
return rwnd;
}
+
+static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
+{
+ return (tcp_call_bpf(sk, true, BPF_SOCKET_OPS_NEEDS_ECN) == 1);
+}
#endif /* _TCP_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c3490d3..8d1d2b7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -776,6 +776,9 @@ enum {
* passive connection is
* established
*/
+ BPF_SOCKET_OPS_NEEDS_ECN, /* If connection's congestion control
+ * needs ECN
+ */
};
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index 9ff567c..4325aba 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2716,8 +2716,15 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_socket_ops_kern *, bpf_socket,
}
} else if (level == SOL_TCP &&
bpf_socket->sk->sk_prot->setsockopt == tcp_setsockopt) {
- /* Place holder */
- ret = -EINVAL;
+ if (optname == TCP_CONGESTION) {
+ ret = tcp_set_congestion_control(sk, optval, false);
+ if (!ret && bpf_socket->op > BPF_SOCKET_OPS_NEEDS_ECN)
+ /* replacing an existing ca */
+ tcp_reinit_congestion_control(sk,
+ inet_csk(sk)->icsk_ca_ops);
+ } else {
+ ret = -EINVAL;
+ }
} else {
ret = -EINVAL;
}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cc8fd8b..07984ea 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2478,7 +2478,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
name[val] = 0;
lock_sock(sk);
- err = tcp_set_congestion_control(sk, name);
+ err = tcp_set_congestion_control(sk, name, true);
release_sock(sk);
return err;
}
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 324c9bc..e6f3717 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -189,8 +189,8 @@ void tcp_init_congestion_control(struct sock *sk)
INET_ECN_dontxmit(sk);
}
-static void tcp_reinit_congestion_control(struct sock *sk,
- const struct tcp_congestion_ops *ca)
+void tcp_reinit_congestion_control(struct sock *sk,
+ const struct tcp_congestion_ops *ca)
{
struct inet_connection_sock *icsk = inet_csk(sk);
@@ -334,7 +334,7 @@ int tcp_set_allowed_congestion_control(char *val)
}
/* Change congestion control for socket */
-int tcp_set_congestion_control(struct sock *sk, const char *name)
+int tcp_set_congestion_control(struct sock *sk, const char *name, bool load)
{
struct inet_connection_sock *icsk = inet_csk(sk);
const struct tcp_congestion_ops *ca;
@@ -344,7 +344,10 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
return -EPERM;
rcu_read_lock();
- ca = __tcp_ca_find_autoload(name);
+ if (!load)
+ ca = tcp_ca_find(name);
+ else
+ ca = __tcp_ca_find_autoload(name);
/* No change asking for existing value */
if (ca == icsk->icsk_ca_ops) {
icsk->icsk_ca_setsockopt = 1;
@@ -352,8 +355,10 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
}
if (!ca)
err = -ENOENT;
+ else if (!load)
+ icsk->icsk_ca_ops = ca;
else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) ||
- ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)))
+ ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)))
err = -EPERM;
else if (!try_module_get(ca->owner))
err = -EBUSY;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e0d688a..14c889f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6192,7 +6192,8 @@ static void tcp_ecn_create_request(struct request_sock *req,
ecn_ok = net->ipv4.sysctl_tcp_ecn || ecn_ok_dst;
if ((!ect && ecn_ok) || tcp_ca_needs_ecn(listen_sk) ||
- (ecn_ok_dst & DST_FEATURE_ECN_CA))
+ (ecn_ok_dst & DST_FEATURE_ECN_CA) ||
+ tcp_bpf_ca_needs_ecn((struct sock *)req))
inet_rsk(req)->ecn_ok = 1;
}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9124d3d..30660ee 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -316,7 +316,8 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_CWR;
if (!(tp->ecn_flags & TCP_ECN_OK))
TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ECE;
- else if (tcp_ca_needs_ecn(sk))
+ else if (tcp_ca_needs_ecn(sk) ||
+ tcp_bpf_ca_needs_ecn(sk))
INET_ECN_xmit(sk);
}
@@ -324,8 +325,9 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
+ bool bpf_needs_ecn = tcp_bpf_ca_needs_ecn(sk);
bool use_ecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 ||
- tcp_ca_needs_ecn(sk);
+ tcp_ca_needs_ecn(sk) || bpf_needs_ecn;
if (!use_ecn) {
const struct dst_entry *dst = __sk_dst_get(sk);
@@ -339,7 +341,7 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
if (use_ecn) {
TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
tp->ecn_flags = TCP_ECN_OK;
- if (tcp_ca_needs_ecn(sk))
+ if (tcp_ca_needs_ecn(sk) || bpf_needs_ecn)
INET_ECN_xmit(sk);
}
}
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [RFC PATCH net-next v2 10/15] bpf: Add support for changing congestion control
2017-06-15 20:08 ` [RFC PATCH net-next v2 10/15] bpf: Add support for changing congestion control Lawrence Brakmo
@ 2017-06-16 13:58 ` Daniel Borkmann
2017-06-18 2:39 ` Lawrence Brakmo
0 siblings, 1 reply; 29+ messages in thread
From: Daniel Borkmann @ 2017-06-16 13:58 UTC (permalink / raw)
To: Lawrence Brakmo, netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern
On 06/15/2017 10:08 PM, Lawrence Brakmo wrote:
> Added support for changing congestion control for SOCKET_OPS bps
> programs through the setsockopt bpf helper function. It also adds
> a new SOCKET_OPS op, BPF_SOCKET_OPS_NEEDS_ECN, that is needed for
> congestion controls, like dctcp, that need to enable ECN in the
> SYN packets.
>
> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
I like the use-case! ;)
> ---
> include/net/tcp.h | 9 ++++++++-
> include/uapi/linux/bpf.h | 3 +++
> net/core/filter.c | 11 +++++++++--
> net/ipv4/tcp.c | 2 +-
> net/ipv4/tcp_cong.c | 15 ++++++++++-----
> net/ipv4/tcp_input.c | 3 ++-
> net/ipv4/tcp_output.c | 8 +++++---
> 7 files changed, 38 insertions(+), 13 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 29c27dc..371b1bd 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1001,7 +1001,9 @@ void tcp_get_default_congestion_control(char *name);
> void tcp_get_available_congestion_control(char *buf, size_t len);
> void tcp_get_allowed_congestion_control(char *buf, size_t len);
> int tcp_set_allowed_congestion_control(char *allowed);
> -int tcp_set_congestion_control(struct sock *sk, const char *name);
> +int tcp_set_congestion_control(struct sock *sk, const char *name, bool load);
> +void tcp_reinit_congestion_control(struct sock *sk,
> + const struct tcp_congestion_ops *ca);
> u32 tcp_slow_start(struct tcp_sock *tp, u32 acked);
> void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked);
>
> @@ -2039,4 +2041,9 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk, bool is_req_sock)
> rwnd = 0;
> return rwnd;
> }
> +
> +static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
> +{
> + return (tcp_call_bpf(sk, true, BPF_SOCKET_OPS_NEEDS_ECN) == 1);
> +}
> #endif /* _TCP_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c3490d3..8d1d2b7 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -776,6 +776,9 @@ enum {
> * passive connection is
> * established
> */
> + BPF_SOCKET_OPS_NEEDS_ECN, /* If connection's congestion control
> + * needs ECN
> + */
> };
>
> #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 9ff567c..4325aba 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2716,8 +2716,15 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_socket_ops_kern *, bpf_socket,
> }
> } else if (level == SOL_TCP &&
> bpf_socket->sk->sk_prot->setsockopt == tcp_setsockopt) {
> - /* Place holder */
> - ret = -EINVAL;
> + if (optname == TCP_CONGESTION) {
> + ret = tcp_set_congestion_control(sk, optval, false);
> + if (!ret && bpf_socket->op > BPF_SOCKET_OPS_NEEDS_ECN)
> + /* replacing an existing ca */
> + tcp_reinit_congestion_control(sk,
> + inet_csk(sk)->icsk_ca_ops);
Can you elaborate on the sematics of BPF_SOCKET_OPS_NEEDS_ECN?
It's called from tcp_bpf_ca_needs_ecn(), so if someone sets a cong ctl
from that callback, or any other bpf_socket->op > BPF_SOCKET_OPS_NEEDS_ECN
then we call tcp_reinit_congestion_control()? Why 'bpf_socket->op > ..'?
Could the needs_ecn implementation detail be hidden altogether from the
BPF prog?
> + } else {
> + ret = -EINVAL;
> + }
> } else {
> ret = -EINVAL;
> }
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index cc8fd8b..07984ea 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2478,7 +2478,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
> name[val] = 0;
>
> lock_sock(sk);
> - err = tcp_set_congestion_control(sk, name);
> + err = tcp_set_congestion_control(sk, name, true);
> release_sock(sk);
> return err;
> }
> diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
> index 324c9bc..e6f3717 100644
> --- a/net/ipv4/tcp_cong.c
> +++ b/net/ipv4/tcp_cong.c
> @@ -189,8 +189,8 @@ void tcp_init_congestion_control(struct sock *sk)
> INET_ECN_dontxmit(sk);
> }
>
> -static void tcp_reinit_congestion_control(struct sock *sk,
> - const struct tcp_congestion_ops *ca)
> +void tcp_reinit_congestion_control(struct sock *sk,
> + const struct tcp_congestion_ops *ca)
> {
> struct inet_connection_sock *icsk = inet_csk(sk);
>
> @@ -334,7 +334,7 @@ int tcp_set_allowed_congestion_control(char *val)
> }
>
> /* Change congestion control for socket */
> -int tcp_set_congestion_control(struct sock *sk, const char *name)
> +int tcp_set_congestion_control(struct sock *sk, const char *name, bool load)
> {
> struct inet_connection_sock *icsk = inet_csk(sk);
> const struct tcp_congestion_ops *ca;
> @@ -344,7 +344,10 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
> return -EPERM;
>
> rcu_read_lock();
> - ca = __tcp_ca_find_autoload(name);
> + if (!load)
> + ca = tcp_ca_find(name);
> + else
> + ca = __tcp_ca_find_autoload(name);
From BPF program side, we call with !load since we're not allowed
to sleep under RCU, that's correct ...
> /* No change asking for existing value */
> if (ca == icsk->icsk_ca_ops) {
> icsk->icsk_ca_setsockopt = 1;
> @@ -352,8 +355,10 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
> }
> if (!ca)
> err = -ENOENT;
> + else if (!load)
> + icsk->icsk_ca_ops = ca;
... but don't we also need to hold a module ref in this case as done
below?
Meaning, tcp_ca_find() could return a ca that was previously loaded
to the tcp_cong_list as module, then resulting in ref count imbalance
when set from BPF?
> else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) ||
> - ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)))
> + ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)))
Nit: slipped in?
> err = -EPERM;
> else if (!try_module_get(ca->owner))
> err = -EBUSY;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index e0d688a..14c889f 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6192,7 +6192,8 @@ static void tcp_ecn_create_request(struct request_sock *req,
> ecn_ok = net->ipv4.sysctl_tcp_ecn || ecn_ok_dst;
>
> if ((!ect && ecn_ok) || tcp_ca_needs_ecn(listen_sk) ||
> - (ecn_ok_dst & DST_FEATURE_ECN_CA))
> + (ecn_ok_dst & DST_FEATURE_ECN_CA) ||
> + tcp_bpf_ca_needs_ecn((struct sock *)req))
> inet_rsk(req)->ecn_ok = 1;
> }
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 9124d3d..30660ee 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -316,7 +316,8 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
> TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_CWR;
> if (!(tp->ecn_flags & TCP_ECN_OK))
> TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ECE;
> - else if (tcp_ca_needs_ecn(sk))
> + else if (tcp_ca_needs_ecn(sk) ||
> + tcp_bpf_ca_needs_ecn(sk))
> INET_ECN_xmit(sk);
> }
>
> @@ -324,8 +325,9 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
> static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
> {
> struct tcp_sock *tp = tcp_sk(sk);
> + bool bpf_needs_ecn = tcp_bpf_ca_needs_ecn(sk);
> bool use_ecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 ||
> - tcp_ca_needs_ecn(sk);
> + tcp_ca_needs_ecn(sk) || bpf_needs_ecn;
>
> if (!use_ecn) {
> const struct dst_entry *dst = __sk_dst_get(sk);
> @@ -339,7 +341,7 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
> if (use_ecn) {
> TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
> tp->ecn_flags = TCP_ECN_OK;
> - if (tcp_ca_needs_ecn(sk))
> + if (tcp_ca_needs_ecn(sk) || bpf_needs_ecn)
> INET_ECN_xmit(sk);
> }
> }
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH net-next v2 10/15] bpf: Add support for changing congestion control
2017-06-16 13:58 ` Daniel Borkmann
@ 2017-06-18 2:39 ` Lawrence Brakmo
2017-06-19 22:34 ` Daniel Borkmann
0 siblings, 1 reply; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-18 2:39 UTC (permalink / raw)
To: Daniel Borkmann, netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern
On 6/16/17, 6:58 AM, "Daniel Borkmann" <daniel@iogearbox.net> wrote:
On 06/15/2017 10:08 PM, Lawrence Brakmo wrote:
> Added support for changing congestion control for SOCKET_OPS bps
> programs through the setsockopt bpf helper function. It also adds
> a new SOCKET_OPS op, BPF_SOCKET_OPS_NEEDS_ECN, that is needed for
> congestion controls, like dctcp, that need to enable ECN in the
> SYN packets.
>
> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
I like the use-case! ;)
I’m trying to generalize somethings I hardcoded in our kernels
> ---
> include/net/tcp.h | 9 ++++++++-
> include/uapi/linux/bpf.h | 3 +++
> net/core/filter.c | 11 +++++++++--
> net/ipv4/tcp.c | 2 +-
> net/ipv4/tcp_cong.c | 15 ++++++++++-----
> net/ipv4/tcp_input.c | 3 ++-
> net/ipv4/tcp_output.c | 8 +++++---
> 7 files changed, 38 insertions(+), 13 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 29c27dc..371b1bd 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1001,7 +1001,9 @@ void tcp_get_default_congestion_control(char *name);
> void tcp_get_available_congestion_control(char *buf, size_t len);
> void tcp_get_allowed_congestion_control(char *buf, size_t len);
> int tcp_set_allowed_congestion_control(char *allowed);
> -int tcp_set_congestion_control(struct sock *sk, const char *name);
> +int tcp_set_congestion_control(struct sock *sk, const char *name, bool load);
> +void tcp_reinit_congestion_control(struct sock *sk,
> + const struct tcp_congestion_ops *ca);
> u32 tcp_slow_start(struct tcp_sock *tp, u32 acked);
> void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked);
>
> @@ -2039,4 +2041,9 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk, bool is_req_sock)
> rwnd = 0;
> return rwnd;
> }
> +
> +static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
> +{
> + return (tcp_call_bpf(sk, true, BPF_SOCKET_OPS_NEEDS_ECN) == 1);
> +}
> #endif /* _TCP_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c3490d3..8d1d2b7 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -776,6 +776,9 @@ enum {
> * passive connection is
> * established
> */
> + BPF_SOCKET_OPS_NEEDS_ECN, /* If connection's congestion control
> + * needs ECN
> + */
> };
>
> #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 9ff567c..4325aba 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2716,8 +2716,15 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_socket_ops_kern *, bpf_socket,
> }
> } else if (level == SOL_TCP &&
> bpf_socket->sk->sk_prot->setsockopt == tcp_setsockopt) {
> - /* Place holder */
> - ret = -EINVAL;
> + if (optname == TCP_CONGESTION) {
> + ret = tcp_set_congestion_control(sk, optval, false);
> + if (!ret && bpf_socket->op > BPF_SOCKET_OPS_NEEDS_ECN)
> + /* replacing an existing ca */
> + tcp_reinit_congestion_control(sk,
> + inet_csk(sk)->icsk_ca_ops);
Can you elaborate on the sematics of BPF_SOCKET_OPS_NEEDS_ECN?
It's called from tcp_bpf_ca_needs_ecn(), so if someone sets a cong ctl
from that callback, or any other bpf_socket->op > BPF_SOCKET_OPS_NEEDS_ECN
then we call tcp_reinit_congestion_control()? Why 'bpf_socket->op > ..'?
Sorry, this is not very clear. This is to deal with the issue that TCP_CONGESTION can
be called before there is an initialized congestion control, when there is no need to
reinit congestion control, or after one has been initialized, so we need to call reinit.
It just happens that ops <= BPF_SOCKET_OPS_NEEDS_ECN are called before
initialization, so no need to reinit.
Could the needs_ecn implementation detail be hidden altogether from the
BPF prog?
Yes, but it would be somewhat messy. The NEEDS_ECN happens early in the
connection establishment. We could ask instead of SET_CONG, where the
BPF program could return NULL (no change) or the name (ex: dctcp) then
the kernel would need to determine if ECN is needed for it. But there would
need to be another call later on to actually set it. Overall more overhead.
I would rather add more documentation to the example to make it
clearer.
> + } else {
> + ret = -EINVAL;
> + }
> } else {
> ret = -EINVAL;
> }
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index cc8fd8b..07984ea 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2478,7 +2478,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
> name[val] = 0;
>
> lock_sock(sk);
> - err = tcp_set_congestion_control(sk, name);
> + err = tcp_set_congestion_control(sk, name, true);
> release_sock(sk);
> return err;
> }
> diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
> index 324c9bc..e6f3717 100644
> --- a/net/ipv4/tcp_cong.c
> +++ b/net/ipv4/tcp_cong.c
> @@ -189,8 +189,8 @@ void tcp_init_congestion_control(struct sock *sk)
> INET_ECN_dontxmit(sk);
> }
>
> -static void tcp_reinit_congestion_control(struct sock *sk,
> - const struct tcp_congestion_ops *ca)
> +void tcp_reinit_congestion_control(struct sock *sk,
> + const struct tcp_congestion_ops *ca)
> {
> struct inet_connection_sock *icsk = inet_csk(sk);
>
> @@ -334,7 +334,7 @@ int tcp_set_allowed_congestion_control(char *val)
> }
>
> /* Change congestion control for socket */
> -int tcp_set_congestion_control(struct sock *sk, const char *name)
> +int tcp_set_congestion_control(struct sock *sk, const char *name, bool load)
> {
> struct inet_connection_sock *icsk = inet_csk(sk);
> const struct tcp_congestion_ops *ca;
> @@ -344,7 +344,10 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
> return -EPERM;
>
> rcu_read_lock();
> - ca = __tcp_ca_find_autoload(name);
> + if (!load)
> + ca = tcp_ca_find(name);
> + else
> + ca = __tcp_ca_find_autoload(name);
From BPF program side, we call with !load since we're not allowed
to sleep under RCU, that's correct ...
> /* No change asking for existing value */
> if (ca == icsk->icsk_ca_ops) {
> icsk->icsk_ca_setsockopt = 1;
> @@ -352,8 +355,10 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
> }
> if (!ca)
> err = -ENOENT;
> + else if (!load)
> + icsk->icsk_ca_ops = ca;
... but don't we also need to hold a module ref in this case as done
below?
Meaning, tcp_ca_find() could return a ca that was previously loaded
to the tcp_cong_list as module, then resulting in ref count imbalance
when set from BPF?
As I mentioned above, this can be called before congestion has been
initialized (op <= BPF_SOCKET_OPS_NEEDS_ECN) in which case
tcp_init_congestion_control will be called later. If op > ..OPS_NEEDS_ECN
then bpf_setsockopt() will call the reinit_congestion_control().
But this points to an issue where someone else could call
tcp_set_congestion_control() with load == false not knowing they
need to call either init or reinit. I will add a comment to the function
to make it clear.
> else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) ||
> - ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)))
> + ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)))
Nit: slipped in?
Fixed.
> err = -EPERM;
> else if (!try_module_get(ca->owner))
> err = -EBUSY;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index e0d688a..14c889f 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6192,7 +6192,8 @@ static void tcp_ecn_create_request(struct request_sock *req,
> ecn_ok = net->ipv4.sysctl_tcp_ecn || ecn_ok_dst;
>
> if ((!ect && ecn_ok) || tcp_ca_needs_ecn(listen_sk) ||
> - (ecn_ok_dst & DST_FEATURE_ECN_CA))
> + (ecn_ok_dst & DST_FEATURE_ECN_CA) ||
> + tcp_bpf_ca_needs_ecn((struct sock *)req))
> inet_rsk(req)->ecn_ok = 1;
> }
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 9124d3d..30660ee 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -316,7 +316,8 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
> TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_CWR;
> if (!(tp->ecn_flags & TCP_ECN_OK))
> TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ECE;
> - else if (tcp_ca_needs_ecn(sk))
> + else if (tcp_ca_needs_ecn(sk) ||
> + tcp_bpf_ca_needs_ecn(sk))
> INET_ECN_xmit(sk);
> }
>
> @@ -324,8 +325,9 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
> static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
> {
> struct tcp_sock *tp = tcp_sk(sk);
> + bool bpf_needs_ecn = tcp_bpf_ca_needs_ecn(sk);
> bool use_ecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 ||
> - tcp_ca_needs_ecn(sk);
> + tcp_ca_needs_ecn(sk) || bpf_needs_ecn;
>
> if (!use_ecn) {
> const struct dst_entry *dst = __sk_dst_get(sk);
> @@ -339,7 +341,7 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
> if (use_ecn) {
> TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
> tp->ecn_flags = TCP_ECN_OK;
> - if (tcp_ca_needs_ecn(sk))
> + if (tcp_ca_needs_ecn(sk) || bpf_needs_ecn)
> INET_ECN_xmit(sk);
> }
> }
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH net-next v2 10/15] bpf: Add support for changing congestion control
2017-06-18 2:39 ` Lawrence Brakmo
@ 2017-06-19 22:34 ` Daniel Borkmann
2017-06-20 0:35 ` Lawrence Brakmo
0 siblings, 1 reply; 29+ messages in thread
From: Daniel Borkmann @ 2017-06-19 22:34 UTC (permalink / raw)
To: Lawrence Brakmo, netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern
On 06/18/2017 04:39 AM, Lawrence Brakmo wrote:
> On 6/16/17, 6:58 AM, "Daniel Borkmann" <daniel@iogearbox.net> wrote:
[...]
> > /* Change congestion control for socket */
> > -int tcp_set_congestion_control(struct sock *sk, const char *name)
> > +int tcp_set_congestion_control(struct sock *sk, const char *name, bool load)
> > {
> > struct inet_connection_sock *icsk = inet_csk(sk);
> > const struct tcp_congestion_ops *ca;
> > @@ -344,7 +344,10 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
> > return -EPERM;
> >
> > rcu_read_lock();
> > - ca = __tcp_ca_find_autoload(name);
> > + if (!load)
> > + ca = tcp_ca_find(name);
> > + else
> > + ca = __tcp_ca_find_autoload(name);
>
> From BPF program side, we call with !load since we're not allowed
> to sleep under RCU, that's correct ...
>
> > /* No change asking for existing value */
> > if (ca == icsk->icsk_ca_ops) {
> > icsk->icsk_ca_setsockopt = 1;
> > @@ -352,8 +355,10 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
> > }
> > if (!ca)
> > err = -ENOENT;
> > + else if (!load)
> > + icsk->icsk_ca_ops = ca;
>
> ... but don't we also need to hold a module ref in this case as done
> below?
>
> Meaning, tcp_ca_find() could return a ca that was previously loaded
> to the tcp_cong_list as module, then resulting in ref count imbalance
> when set from BPF?
>
> As I mentioned above, this can be called before congestion has been
> initialized (op <= BPF_SOCKET_OPS_NEEDS_ECN) in which case
> tcp_init_congestion_control will be called later. If op > ..OPS_NEEDS_ECN
> then bpf_setsockopt() will call the reinit_congestion_control().
>
> But this points to an issue where someone else could call
> tcp_set_congestion_control() with load == false not knowing they
> need to call either init or reinit. I will add a comment to the function
> to make it clear.
Hm, I'm not sure it answers my question. What I meant was that from BPF
prog, you're setting tcp_set_congestion_control(..., false) so if
tcp_ca_find() returns a ca that was loaded earlier as a from a module
(so it becomes available in tcp_cong_list), the above...
[...]
else if (!load)
icsk->icsk_ca_ops = ca;
[...]
... will basically prevent the later try_module_get() on the ca. So any
later tcp_reinit_congestion_control() or tcp_init_congestion_control()
will still run not having the refcount held on the owner module. Meaning
a module unload would let the machine crash due to the refcnt imbalance?
What am I missing?
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH net-next v2 10/15] bpf: Add support for changing congestion control
2017-06-19 22:34 ` Daniel Borkmann
@ 2017-06-20 0:35 ` Lawrence Brakmo
0 siblings, 0 replies; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-20 0:35 UTC (permalink / raw)
To: Daniel Borkmann, netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern
On 6/19/17, 3:34 PM, "Daniel Borkmann" <daniel@iogearbox.net> wrote:
On 06/18/2017 04:39 AM, Lawrence Brakmo wrote:
> On 6/16/17, 6:58 AM, "Daniel Borkmann" <daniel@iogearbox.net> wrote:
[...]
> > /* Change congestion control for socket */
> > -int tcp_set_congestion_control(struct sock *sk, const char *name)
> > +int tcp_set_congestion_control(struct sock *sk, const char *name, bool load)
> > {
> > struct inet_connection_sock *icsk = inet_csk(sk);
> > const struct tcp_congestion_ops *ca;
> > @@ -344,7 +344,10 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
> > return -EPERM;
> >
> > rcu_read_lock();
> > - ca = __tcp_ca_find_autoload(name);
> > + if (!load)
> > + ca = tcp_ca_find(name);
> > + else
> > + ca = __tcp_ca_find_autoload(name);
>
> From BPF program side, we call with !load since we're not allowed
> to sleep under RCU, that's correct ...
>
> > /* No change asking for existing value */
> > if (ca == icsk->icsk_ca_ops) {
> > icsk->icsk_ca_setsockopt = 1;
> > @@ -352,8 +355,10 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
> > }
> > if (!ca)
> > err = -ENOENT;
> > + else if (!load)
> > + icsk->icsk_ca_ops = ca;
>
> ... but don't we also need to hold a module ref in this case as done
> below?
>
> Meaning, tcp_ca_find() could return a ca that was previously loaded
> to the tcp_cong_list as module, then resulting in ref count imbalance
> when set from BPF?
>
> As I mentioned above, this can be called before congestion has been
> initialized (op <= BPF_SOCKET_OPS_NEEDS_ECN) in which case
> tcp_init_congestion_control will be called later. If op > ..OPS_NEEDS_ECN
> then bpf_setsockopt() will call the reinit_congestion_control().
>
> But this points to an issue where someone else could call
> tcp_set_congestion_control() with load == false not knowing they
> need to call either init or reinit. I will add a comment to the function
> to make it clear.
Hm, I'm not sure it answers my question. What I meant was that from BPF
prog, you're setting tcp_set_congestion_control(..., false) so if
tcp_ca_find() returns a ca that was loaded earlier as a from a module
(so it becomes available in tcp_cong_list), the above...
[...]
else if (!load)
icsk->icsk_ca_ops = ca;
[...]
... will basically prevent the later try_module_get() on the ca. So any
later tcp_reinit_congestion_control() or tcp_init_congestion_control()
will still run not having the refcount held on the owner module. Meaning
a module unload would let the machine crash due to the refcnt imbalance?
What am I missing?
Nothing, you are correct. I was mistakenly thinking that the refcount update
was being done in tcp_init_congestion_control. Done.
^ permalink raw reply [flat|nested] 29+ messages in thread
* [RFC PATCH net-next v2 11/15] bpf: Sample BPF program to set congestion control
2017-06-15 20:08 [RFC PATCH net-next v2 00/15] bpf: BPF support for socket ops Lawrence Brakmo
` (9 preceding siblings ...)
2017-06-15 20:08 ` [RFC PATCH net-next v2 10/15] bpf: Add support for changing congestion control Lawrence Brakmo
@ 2017-06-15 20:08 ` Lawrence Brakmo
2017-06-15 20:08 ` [RFC PATCH net-next v2 12/15] bpf: Adds support for setting initial cwnd Lawrence Brakmo
` (3 subsequent siblings)
14 siblings, 0 replies; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-15 20:08 UTC (permalink / raw)
To: netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
David Ahern
Sample BPF program that sets congestion control to dctcp when both hosts
are within the same datacenter. In this example that is assumed to be
when they have the first 5.5 bytes of their IPv6 address are the same.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
samples/bpf/Makefile | 1 +
samples/bpf/tcp_cong_kern.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 69 insertions(+)
create mode 100644 samples/bpf/tcp_cong_kern.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 942c7c7..eb324e0 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -116,6 +116,7 @@ always += cookie_uid_helper_example.o
always += tcp_synrto_kern.o
always += tcp_rwnd_kern.o
always += tcp_bufs_kern.o
+always += tcp_cong_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_cong_kern.c b/samples/bpf/tcp_cong_kern.c
new file mode 100644
index 0000000..24a3bc4
--- /dev/null
+++ b/samples/bpf/tcp_cong_kern.c
@@ -0,0 +1,68 @@
+/*
+ * BPF program to set congestion control to dctcp when both hosts are
+ * in the same datacenter (as deteremined by IPv6 prefix).
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/tcp.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_cong(struct bpf_socket_ops *skops)
+{
+ char fmt1[] = "BPF command: %d\n";
+ char fmt2[] = " Returning %d\n";
+ char cong[] = "dctcp";
+ int rv = 0;
+ int op;
+
+ /* For testing purposes, only execute rest of BPF program
+ * if neither port numberis 55601
+ */
+ if (skops->remote_port != 55601 && skops->local_port != 55601)
+ return -1;
+
+ op = (int) skops->op;
+
+#ifdef DEBUG
+ bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+ /* Check if both hosts are in the same datacenter. For this
+ * example they are if the 1st 5.5 bytes in the IPv6 address
+ * are the same.
+ */
+ if (skops->family == AF_INET6 &&
+ skops->local_ip6[0] == skops->remote_ip6[0] &&
+ (skops->local_ip6[1] & 0xfff00000) ==
+ (skops->remote_ip6[1] & 0xfff00000)) {
+ switch (op) {
+ case BPF_SOCKET_OPS_NEEDS_ECN:
+ rv = 1;
+ break;
+ case BPF_SOCKET_OPS_ACTIVE_ESTABLISHED_CB:
+ rv = bpf_setsockopt(skops, SOL_TCP, TCP_CONGESTION,
+ cong, sizeof(cong));
+ break;
+ case BPF_SOCKET_OPS_PASSIVE_ESTABLISHED_CB:
+ rv = bpf_setsockopt(skops, SOL_TCP, TCP_CONGESTION,
+ cong, sizeof(cong));
+ break;
+ default:
+ rv = -1;
+ }
+ } else {
+ rv = -1;
+ }
+#ifdef DEBUG
+ bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+ return rv;
+}
+char _license[] SEC("license") = "GPL";
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH net-next v2 12/15] bpf: Adds support for setting initial cwnd
2017-06-15 20:08 [RFC PATCH net-next v2 00/15] bpf: BPF support for socket ops Lawrence Brakmo
` (10 preceding siblings ...)
2017-06-15 20:08 ` [RFC PATCH net-next v2 11/15] bpf: Sample BPF program to set " Lawrence Brakmo
@ 2017-06-15 20:08 ` Lawrence Brakmo
2017-06-15 20:08 ` [RFC PATCH net-next v2 13/15] bpf: Sample BPF program to set " Lawrence Brakmo
` (2 subsequent siblings)
14 siblings, 0 replies; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-15 20:08 UTC (permalink / raw)
To: netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
David Ahern
Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_IW, which sets the
initial congestion window. This can be used when the hosts are far
apart (large RTTs) and it is safe to start with a large inital cwnd.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
include/uapi/linux/bpf.h | 2 ++
net/core/filter.c | 14 +++++++++++++-
2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8d1d2b7..ecb28e2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -781,4 +781,6 @@ enum {
*/
};
+#define TCP_BPF_IW 1001 /* Set TCP initial congestion window */
+
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index 4325aba..a1d9214 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2723,7 +2723,19 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_socket_ops_kern *, bpf_socket,
tcp_reinit_congestion_control(sk,
inet_csk(sk)->icsk_ca_ops);
} else {
- ret = -EINVAL;
+ struct tcp_sock *tp = tcp_sk(sk);
+
+ val = *((int *)optval);
+ switch (optname) {
+ case TCP_BPF_IW:
+ if (val <= 0 || tp->data_segs_out > 0)
+ ret = -EINVAL;
+ else
+ tp->snd_cwnd = val;
+ break;
+ default:
+ ret = -EINVAL;
+ }
}
} else {
ret = -EINVAL;
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH net-next v2 13/15] bpf: Sample BPF program to set initial cwnd
2017-06-15 20:08 [RFC PATCH net-next v2 00/15] bpf: BPF support for socket ops Lawrence Brakmo
` (11 preceding siblings ...)
2017-06-15 20:08 ` [RFC PATCH net-next v2 12/15] bpf: Adds support for setting initial cwnd Lawrence Brakmo
@ 2017-06-15 20:08 ` Lawrence Brakmo
2017-06-15 20:08 ` [RFC PATCH net-next v2 14/15] bpf: Adds support for setting sndcwnd clamp Lawrence Brakmo
2017-06-15 20:08 ` [RFC PATCH net-next v2 15/15] bpf: Sample bpf program to set " Lawrence Brakmo
14 siblings, 0 replies; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-15 20:08 UTC (permalink / raw)
To: netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
David Ahern
Sample BPF program that assumes hosts are far away (i.e. large RTTs)
and sets initial cwnd and initial receive window to 40 packets,
send and receive buffers to 1.5MB.
In practice there would be a test to insure the hosts are actually
far enough away.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
samples/bpf/Makefile | 1 +
samples/bpf/tcp_iw_kern.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 74 insertions(+)
create mode 100644 samples/bpf/tcp_iw_kern.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index eb324e0..3ec96a0 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -117,6 +117,7 @@ always += tcp_synrto_kern.o
always += tcp_rwnd_kern.o
always += tcp_bufs_kern.o
always += tcp_cong_kern.o
+always += tcp_iw_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_iw_kern.c b/samples/bpf/tcp_iw_kern.c
new file mode 100644
index 0000000..07962c8
--- /dev/null
+++ b/samples/bpf/tcp_iw_kern.c
@@ -0,0 +1,73 @@
+/*
+ * BPF program to set initial congestion window and initial receive
+ * window to 40 packets and send and receive buffers to 1.5MB. This
+ * would usually be done after doing appropriate checks that indicate
+ * the hosts are far enough away (i.e. large RTT).
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_iw(struct bpf_socket_ops *skops)
+{
+ char fmt1[] = "BPF command: %d\n";
+ char fmt2[] = " Returning %d\n";
+ int bufsize = 1500000;
+ int rwnd_init = 40;
+ int iw = 40;
+ int rv = 0;
+ int op;
+
+ /* For testing purposes, only execute rest of BPF program
+ * if neither port numberis 55601
+ */
+ if (skops->remote_port != 55601 && skops->local_port != 55601)
+ return -1;
+
+ op = (int) skops->op;
+
+#ifdef DEBUG
+ bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+ /* Usually there would be a check to insure the hosts are far
+ * from each other so it makes sense to increase buffer sizes
+ */
+ switch (op) {
+ case BPF_SOCKET_OPS_RWND_INIT:
+ rv = rwnd_init;
+ break;
+ case BPF_SOCKET_OPS_TCP_CONNECT_CB:
+ /* Set sndbuf and rcvbuf of active connections */
+ rv = bpf_setsockopt(skops, SOL_SOCKET, SO_SNDBUF, &bufsize,
+ sizeof(bufsize));
+ rv = -rv*100 + bpf_setsockopt(skops, SOL_SOCKET, SO_RCVBUF,
+ &bufsize, sizeof(bufsize));
+ break;
+ case BPF_SOCKET_OPS_ACTIVE_ESTABLISHED_CB:
+ rv = bpf_setsockopt(skops, SOL_TCP, TCP_BPF_IW, &iw,
+ sizeof(iw));
+ break;
+ case BPF_SOCKET_OPS_PASSIVE_ESTABLISHED_CB:
+ /* Set sndbuf and rcvbuf of passive connections */
+ rv = bpf_setsockopt(skops, SOL_SOCKET, SO_SNDBUF, &bufsize,
+ sizeof(bufsize));
+ rv = -rv*100 + bpf_setsockopt(skops, SOL_SOCKET, SO_RCVBUF,
+ &bufsize, sizeof(bufsize));
+ break;
+ default:
+ rv = -1;
+ }
+#ifdef DEBUG
+ bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+ return rv;
+}
+char _license[] SEC("license") = "GPL";
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH net-next v2 14/15] bpf: Adds support for setting sndcwnd clamp
2017-06-15 20:08 [RFC PATCH net-next v2 00/15] bpf: BPF support for socket ops Lawrence Brakmo
` (12 preceding siblings ...)
2017-06-15 20:08 ` [RFC PATCH net-next v2 13/15] bpf: Sample BPF program to set " Lawrence Brakmo
@ 2017-06-15 20:08 ` Lawrence Brakmo
2017-06-15 20:08 ` [RFC PATCH net-next v2 15/15] bpf: Sample bpf program to set " Lawrence Brakmo
14 siblings, 0 replies; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-15 20:08 UTC (permalink / raw)
To: netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
David Ahern
Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_SNDCWND_CLAMP, which
sets the initial congestion window. It is useful to limit the sndcwnd
when the host are close to each other (small RTT).
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
include/uapi/linux/bpf.h | 1 +
net/core/filter.c | 7 +++++++
2 files changed, 8 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ecb28e2..4d033eb 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -782,5 +782,6 @@ enum {
};
#define TCP_BPF_IW 1001 /* Set TCP initial congestion window */
+#define TCP_BPF_SNDCWND_CLAMP 1002 /* Set sndcwnd_clamp */
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index a1d9214..f484fef 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2733,6 +2733,13 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_socket_ops_kern *, bpf_socket,
else
tp->snd_cwnd = val;
break;
+ case TCP_BPF_SNDCWND_CLAMP:
+ if (val <= 0) {
+ ret = -EINVAL;
+ } else {
+ tp->snd_cwnd_clamp = val;
+ tp->snd_ssthresh = val;
+ }
default:
ret = -EINVAL;
}
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH net-next v2 15/15] bpf: Sample bpf program to set sndcwnd clamp
2017-06-15 20:08 [RFC PATCH net-next v2 00/15] bpf: BPF support for socket ops Lawrence Brakmo
` (13 preceding siblings ...)
2017-06-15 20:08 ` [RFC PATCH net-next v2 14/15] bpf: Adds support for setting sndcwnd clamp Lawrence Brakmo
@ 2017-06-15 20:08 ` Lawrence Brakmo
14 siblings, 0 replies; 29+ messages in thread
From: Lawrence Brakmo @ 2017-06-15 20:08 UTC (permalink / raw)
To: netdev
Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
David Ahern
Sample BPF program, tcp_clamp_kern.c, to demostrate the use
of setting the sndcwnd clamp. This program assumes that if the
first 5.5 bytes of the host's IPv6 addresses are the same, then
the hosts are in the same datacenter and sets sndcwnd clamp to
100 packets, SYN and SYN-ACK RTOs to 10ms and send/receive buffer
sizes to 150KB.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
samples/bpf/Makefile | 1 +
samples/bpf/tcp_clamp_kern.c | 88 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 89 insertions(+)
create mode 100644 samples/bpf/tcp_clamp_kern.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 3ec96a0..59975c3 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -118,6 +118,7 @@ always += tcp_rwnd_kern.o
always += tcp_bufs_kern.o
always += tcp_cong_kern.o
always += tcp_iw_kern.o
+always += tcp_clamp_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_clamp_kern.c b/samples/bpf/tcp_clamp_kern.c
new file mode 100644
index 0000000..e96eadd
--- /dev/null
+++ b/samples/bpf/tcp_clamp_kern.c
@@ -0,0 +1,88 @@
+/*
+ * Sample BPF program to set send and receive buffers to 150KB, sndcwnd clamp
+ * to 100 packets and SYN and SYN_ACK RTOs to 10ms when both hosts are within
+ * the same datacenter. For his example, we assume they are within the same
+ * datacenter when the first 5.5 bytes of their IPv6 addresses are the same.
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_clamp(struct bpf_socket_ops *skops)
+{
+ char fmt1[] = "BPF command: %d\n";
+ char fmt2[] = " Returning %d\n";
+ int bufsize = 150000;
+ int to_init = 10;
+ int clamp = 100;
+ int rv = 0;
+ int op;
+
+ /* For testing purposes, only execute rest of BPF program
+ * if neither port numberis 55601
+ */
+ if (skops->remote_port != 55601 && skops->local_port != 55601)
+ return -1;
+
+ op = (int) skops->op;
+
+#ifdef DEBUG
+ bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+ /* Check that both hosts are within same datacenter. For this example
+ * it is the case when the first 5.5 bytes of their IPv6 addresses are
+ * the same.
+ */
+ if (skops->family == AF_INET6 &&
+ skops->local_ip6[0] == skops->remote_ip6[0] &&
+ (skops->local_ip6[1] & 0xfff00000) ==
+ (skops->remote_ip6[1] & 0xfff00000)) {
+ switch (op) {
+ case BPF_SOCKET_OPS_TIMEOUT_INIT:
+ rv = to_init;
+ break;
+ case BPF_SOCKET_OPS_TCP_CONNECT_CB:
+ /* Set sndbuf and rcvbuf of active connections */
+ rv = bpf_setsockopt(skops, SOL_SOCKET, SO_SNDBUF,
+ &bufsize, sizeof(bufsize));
+ rv = -rv*100 + bpf_setsockopt(skops, SOL_SOCKET,
+ SO_RCVBUF, &bufsize,
+ sizeof(bufsize));
+ break;
+ case BPF_SOCKET_OPS_ACTIVE_ESTABLISHED_CB:
+ rv = bpf_setsockopt(skops, SOL_TCP,
+ TCP_BPF_SNDCWND_CLAMP,
+ &clamp, sizeof(clamp));
+ break;
+ case BPF_SOCKET_OPS_PASSIVE_ESTABLISHED_CB:
+ /* Set sndbuf and rcvbuf of passive connections */
+ rv = bpf_setsockopt(skops, SOL_TCP,
+ TCP_BPF_SNDCWND_CLAMP,
+ &clamp, sizeof(clamp));
+ rv = -rv*100 + bpf_setsockopt(skops, SOL_SOCKET,
+ SO_SNDBUF, &bufsize,
+ sizeof(bufsize));
+ rv = -rv*200 + bpf_setsockopt(skops, SOL_SOCKET,
+ SO_RCVBUF, &bufsize,
+ sizeof(bufsize));
+ break;
+ default:
+ rv = -1;
+ }
+ } else {
+ rv = -1;
+ }
+#ifdef DEBUG
+ bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+ return rv;
+}
+char _license[] SEC("license") = "GPL";
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread