* [bpf-next PATCH 0/7] Add BPF_PROG_TYPE_SK_MSG and attach pt
@ 2018-01-12 18:09 John Fastabend
2018-01-12 18:10 ` [bpf-next PATCH 1/7] net: add a UID to use for ULP socket assignment John Fastabend
` (6 more replies)
0 siblings, 7 replies; 18+ messages in thread
From: John Fastabend @ 2018-01-12 18:09 UTC (permalink / raw)
To: borkmann, john.fastabend, ast; +Cc: netdev, kafai
This series include 4 setup patches (1-4) to allow ULP layer and
sockmap refcounting to support another ULP type. There is one small
change (5 lines!) on the TCP side in patch 4. This adds a flag so
that the ULP layer can inform the TCP stack not to mark the frags
as shared. When the ULP layer "owns" the frags and 'gives' them
to the TCP stack we know there can be no additional updates from
user side to the data.
Patch 4 is the bulk of the work. This adds a new program type
BPF_PROG_TYPE_SK_MSG, a sockmap program attach type
BPF_SK_MSG_VERDICT and a new ULP layer (TCP_ULP_BPF) to allow BPF
prgrams to be run on sendmsg/sendfile system calls and inspect data.
For now only TCP is supported when sockmap is extended other protos
can be added. See the patch description for a lengthy description
of the details. After this patch users can now attach BPF policy
and monitoring programs to socket send hooks.
Finally patches 6/7 and 7/7 add tests to test_maps and the verifier
in selftests/bpf so we get some automated coverage. One open
question I have is if we should move the samples/sockmap program
into selftests/bpf and start running it to get even more coverage
on the automated side. We can push this as an independent patch
set.
---
John Fastabend (7):
net: add a UID to use for ULP socket assignment
sock: make static tls function alloc_sg generic sock helper
sockmap: convert refcnt to an atomic refcnt
net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG
bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
bpf: add map tests for BPF_PROG_TYPE_SK_MSG
bpf: add verifier tests for BPF_PROG_TYPE_SK_MSG
include/linux/bpf.h | 1
include/linux/bpf_types.h | 1
include/linux/filter.h | 10
include/linux/socket.h | 1
include/net/sock.h | 4
include/net/tcp.h | 7
include/uapi/linux/bpf.h | 28 +
kernel/bpf/sockmap.c | 504 +++++++++++++++++++-
kernel/bpf/syscall.c | 14 -
kernel/bpf/verifier.c | 5
net/core/filter.c | 106 ++++
net/core/sock.c | 56 ++
net/ipv4/tcp.c | 4
net/ipv4/tcp_ulp.c | 51 ++
net/tls/tls_main.c | 1
net/tls/tls_sw.c | 69 ---
tools/include/uapi/linux/bpf.h | 16 +
tools/testing/selftests/bpf/Makefile | 3
tools/testing/selftests/bpf/bpf_helpers.h | 2
tools/testing/selftests/bpf/sockmap_parse_prog.c | 15 +
tools/testing/selftests/bpf/sockmap_verdict_prog.c | 7
tools/testing/selftests/bpf/test_maps.c | 54 ++
tools/testing/selftests/bpf/test_verifier.c | 54 ++
23 files changed, 913 insertions(+), 100 deletions(-)
^ permalink raw reply [flat|nested] 18+ messages in thread
* [bpf-next PATCH 1/7] net: add a UID to use for ULP socket assignment
2018-01-12 18:09 [bpf-next PATCH 0/7] Add BPF_PROG_TYPE_SK_MSG and attach pt John Fastabend
@ 2018-01-12 18:10 ` John Fastabend
2018-01-12 18:10 ` [bpf-next PATCH 2/7] sock: make static tls function alloc_sg generic sock helper John Fastabend
` (5 subsequent siblings)
6 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2018-01-12 18:10 UTC (permalink / raw)
To: borkmann, john.fastabend, ast; +Cc: netdev, kafai
Create a UID field and enum that can be used to assign ULPs to
sockets. This saves a set of string comparisons if the ULP id
is known.
For sockmap, which is added in the next patches, a ULP is used
as a TX hook for TCP sockets. In this case the ULP being added
is done at map insert time and the ULP is known and done on the
kernel side. In this case the named lookup is not needed.
Remove pr_notice, user gets an error code back and should
check that rather than rely on logs.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
include/net/tcp.h | 5 +++++
net/ipv4/tcp_ulp.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++-----
net/tls/tls_main.c | 1 +
3 files changed, 52 insertions(+), 5 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6939e69..a99ceb8 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1982,6 +1982,10 @@ static inline void tcp_listendrop(const struct sock *sk)
#define TCP_ULP_MAX 128
#define TCP_ULP_BUF_MAX (TCP_ULP_NAME_MAX*TCP_ULP_MAX)
+enum {
+ TCP_ULP_TLS,
+};
+
struct tcp_ulp_ops {
struct list_head list;
@@ -1990,6 +1994,7 @@ struct tcp_ulp_ops {
/* cleanup ulp */
void (*release)(struct sock *sk);
+ int uid;
char name[TCP_ULP_NAME_MAX];
struct module *owner;
};
diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
index 6bb9e14..58cadc5 100644
--- a/net/ipv4/tcp_ulp.c
+++ b/net/ipv4/tcp_ulp.c
@@ -29,6 +29,18 @@ static struct tcp_ulp_ops *tcp_ulp_find(const char *name)
return NULL;
}
+static struct tcp_ulp_ops *tcp_ulp_find_id(const int ulp)
+{
+ struct tcp_ulp_ops *e;
+
+ list_for_each_entry_rcu(e, &tcp_ulp_list, list) {
+ if (e->uid == ulp)
+ return e;
+ }
+
+ return NULL;
+}
+
static const struct tcp_ulp_ops *__tcp_ulp_find_autoload(const char *name)
{
const struct tcp_ulp_ops *ulp = NULL;
@@ -51,6 +63,19 @@ static const struct tcp_ulp_ops *__tcp_ulp_find_autoload(const char *name)
return ulp;
}
+static const struct tcp_ulp_ops *__tcp_ulp_lookup(const int uid)
+{
+ const struct tcp_ulp_ops *ulp = NULL;
+
+ rcu_read_lock();
+ ulp = tcp_ulp_find_id(uid);
+ if (!ulp || !try_module_get(ulp->owner))
+ ulp = NULL;
+
+ rcu_read_unlock();
+ return ulp;
+}
+
/* Attach new upper layer protocol to the list
* of available protocols.
*/
@@ -59,13 +84,10 @@ int tcp_register_ulp(struct tcp_ulp_ops *ulp)
int ret = 0;
spin_lock(&tcp_ulp_list_lock);
- if (tcp_ulp_find(ulp->name)) {
- pr_notice("%s already registered or non-unique name\n",
- ulp->name);
+ if (tcp_ulp_find(ulp->name))
ret = -EEXIST;
- } else {
+ else
list_add_tail_rcu(&ulp->list, &tcp_ulp_list);
- }
spin_unlock(&tcp_ulp_list_lock);
return ret;
@@ -133,3 +155,22 @@ int tcp_set_ulp(struct sock *sk, const char *name)
icsk->icsk_ulp_ops = ulp_ops;
return 0;
}
+
+int tcp_set_ulp_id(struct sock *sk, int ulp)
+{
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ const struct tcp_ulp_ops *ulp_ops;
+ int err;
+
+ if (icsk->icsk_ulp_ops)
+ return -EEXIST;
+
+ ulp_ops = __tcp_ulp_lookup(ulp);
+ if (!ulp_ops)
+ return -ENOENT;
+
+ err = ulp_ops->init(sk);
+ if (!err)
+ icsk->icsk_ulp_ops = ulp_ops;
+ return err;
+}
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index e07ee3a..1563a9e 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -473,6 +473,7 @@ static int tls_init(struct sock *sk)
static struct tcp_ulp_ops tcp_tls_ulp_ops __read_mostly = {
.name = "tls",
+ .uid = TCP_ULP_TLS,
.owner = THIS_MODULE,
.init = tls_init,
};
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [bpf-next PATCH 2/7] sock: make static tls function alloc_sg generic sock helper
2018-01-12 18:09 [bpf-next PATCH 0/7] Add BPF_PROG_TYPE_SK_MSG and attach pt John Fastabend
2018-01-12 18:10 ` [bpf-next PATCH 1/7] net: add a UID to use for ULP socket assignment John Fastabend
@ 2018-01-12 18:10 ` John Fastabend
2018-01-12 18:10 ` [bpf-next PATCH 3/7] sockmap: convert refcnt to an atomic refcnt John Fastabend
` (4 subsequent siblings)
6 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2018-01-12 18:10 UTC (permalink / raw)
To: borkmann, john.fastabend, ast; +Cc: netdev, kafai
The TLS ULP module builds scatterlists from a sock using
page_frag_refill(). This is going to be useful for other ULPs
so move it into sock file for more general use.
In the process remove useless goto at end of while loop.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
include/net/sock.h | 4 +++
net/core/sock.c | 56 ++++++++++++++++++++++++++++++++++++++++++
net/tls/tls_sw.c | 69 +++++-----------------------------------------------
3 files changed, 67 insertions(+), 62 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 66fd395..2808343 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2148,6 +2148,10 @@ static inline struct page_frag *sk_page_frag(struct sock *sk)
bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag);
+int sk_alloc_sg(struct sock *sk, int len, struct scatterlist *sg,
+ int *sg_num_elem, unsigned int *sg_size,
+ int first_coalesce);
+
/*
* Default write policy as shown to user space via poll/select/SIGIO
*/
diff --git a/net/core/sock.c b/net/core/sock.c
index 72d14b2..e84c03f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2237,6 +2237,62 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
}
EXPORT_SYMBOL(sk_page_frag_refill);
+int sk_alloc_sg(struct sock *sk, int len, struct scatterlist *sg,
+ int *sg_num_elem, unsigned int *sg_size,
+ int first_coalesce)
+{
+ struct page_frag *pfrag;
+ unsigned int size = *sg_size;
+ int num_elem = *sg_num_elem, use = 0, rc = 0;
+ struct scatterlist *sge;
+ unsigned int orig_offset;
+
+ len -= size;
+ pfrag = sk_page_frag(sk);
+
+ while (len > 0) {
+ if (!sk_page_frag_refill(sk, pfrag)) {
+ rc = -ENOMEM;
+ goto out;
+ }
+
+ use = min_t(int, len, pfrag->size - pfrag->offset);
+
+ if (!sk_wmem_schedule(sk, use)) {
+ rc = -ENOMEM;
+ goto out;
+ }
+
+ sk_mem_charge(sk, use);
+ size += use;
+ orig_offset = pfrag->offset;
+ pfrag->offset += use;
+
+ sge = sg + num_elem - 1;
+ if (num_elem > first_coalesce && sg_page(sg) == pfrag->page &&
+ sg->offset + sg->length == orig_offset) {
+ sg->length += use;
+ } else {
+ sge++;
+ sg_unmark_end(sge);
+ sg_set_page(sge, pfrag->page, use, orig_offset);
+ get_page(pfrag->page);
+ ++num_elem;
+ if (num_elem == MAX_SKB_FRAGS) {
+ rc = -ENOSPC;
+ break;
+ }
+ }
+
+ len -= use;
+ }
+out:
+ *sg_size = size;
+ *sg_num_elem = num_elem;
+ return rc;
+}
+EXPORT_SYMBOL(sk_alloc_sg);
+
static void __lock_sock(struct sock *sk)
__releases(&sk->sk_lock.slock)
__acquires(&sk->sk_lock.slock)
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 73d1921..dabdd1a 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -87,71 +87,16 @@ static void trim_both_sgl(struct sock *sk, int target_size)
target_size);
}
-static int alloc_sg(struct sock *sk, int len, struct scatterlist *sg,
- int *sg_num_elem, unsigned int *sg_size,
- int first_coalesce)
-{
- struct page_frag *pfrag;
- unsigned int size = *sg_size;
- int num_elem = *sg_num_elem, use = 0, rc = 0;
- struct scatterlist *sge;
- unsigned int orig_offset;
-
- len -= size;
- pfrag = sk_page_frag(sk);
-
- while (len > 0) {
- if (!sk_page_frag_refill(sk, pfrag)) {
- rc = -ENOMEM;
- goto out;
- }
-
- use = min_t(int, len, pfrag->size - pfrag->offset);
-
- if (!sk_wmem_schedule(sk, use)) {
- rc = -ENOMEM;
- goto out;
- }
-
- sk_mem_charge(sk, use);
- size += use;
- orig_offset = pfrag->offset;
- pfrag->offset += use;
-
- sge = sg + num_elem - 1;
- if (num_elem > first_coalesce && sg_page(sg) == pfrag->page &&
- sg->offset + sg->length == orig_offset) {
- sg->length += use;
- } else {
- sge++;
- sg_unmark_end(sge);
- sg_set_page(sge, pfrag->page, use, orig_offset);
- get_page(pfrag->page);
- ++num_elem;
- if (num_elem == MAX_SKB_FRAGS) {
- rc = -ENOSPC;
- break;
- }
- }
-
- len -= use;
- }
- goto out;
-
-out:
- *sg_size = size;
- *sg_num_elem = num_elem;
- return rc;
-}
-
static int alloc_encrypted_sg(struct sock *sk, int len)
{
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
int rc = 0;
- rc = alloc_sg(sk, len, ctx->sg_encrypted_data,
- &ctx->sg_encrypted_num_elem, &ctx->sg_encrypted_size, 0);
+ rc = sk_alloc_sg(sk, len,
+ ctx->sg_encrypted_data,
+ &ctx->sg_encrypted_num_elem,
+ &ctx->sg_encrypted_size, 0);
return rc;
}
@@ -162,9 +107,9 @@ static int alloc_plaintext_sg(struct sock *sk, int len)
struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
int rc = 0;
- rc = alloc_sg(sk, len, ctx->sg_plaintext_data,
- &ctx->sg_plaintext_num_elem, &ctx->sg_plaintext_size,
- tls_ctx->pending_open_record_frags);
+ rc = sk_alloc_sg(sk, len, ctx->sg_plaintext_data,
+ &ctx->sg_plaintext_num_elem, &ctx->sg_plaintext_size,
+ tls_ctx->pending_open_record_frags);
return rc;
}
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [bpf-next PATCH 3/7] sockmap: convert refcnt to an atomic refcnt
2018-01-12 18:09 [bpf-next PATCH 0/7] Add BPF_PROG_TYPE_SK_MSG and attach pt John Fastabend
2018-01-12 18:10 ` [bpf-next PATCH 1/7] net: add a UID to use for ULP socket assignment John Fastabend
2018-01-12 18:10 ` [bpf-next PATCH 2/7] sock: make static tls function alloc_sg generic sock helper John Fastabend
@ 2018-01-12 18:10 ` John Fastabend
2018-01-12 18:10 ` [bpf-next PATCH 4/7] net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG John Fastabend
` (3 subsequent siblings)
6 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2018-01-12 18:10 UTC (permalink / raw)
To: borkmann, john.fastabend, ast; +Cc: netdev, kafai
The sockmap refcnt up until now has been wrapped in the
sk_callback_lock(). So its not actually needed any locking of its
own. The counter itself tracks the lifetime of the psock object.
Sockets in a sockmap have a lifetime that is independent of the
map they are part of. This is possible because a single socket may
be in multiple maps. When this happens we can only release the
psock data associated with the socket when the refcnt reaches
zero. There are three possible delete sock reference decrement
paths first through the normal sockmap process, the user deletes
the socket from the map. Second the map is removed and all sockets
in the map are removed, delete path is similar to case 1. The third
case is an asyncronous socket event such as a closing the socket. The
last case handles removing sockets that are no longer available.
For completeness, although inc does not pose any problems in this
patch series, the inc case only happens when a psock is added to a
map.
Next we plan to add another socket prog type to handle policy and
monitoring on the TX path. When we do this however we will need to
keep a reference count open across the sendmsg/sendpage call and
holding the sk_callback_lock() here (on every send) seems less than
ideal, also it may sleep in cases where we hit memory pressure.
Instead of dealing with these issues in some clever way simply make
the reference counting a refcnt_t type and do proper atomic ops.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
kernel/bpf/sockmap.c | 21 +++++++++------------
1 file changed, 9 insertions(+), 12 deletions(-)
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 3f662ee..972608f 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -62,8 +62,7 @@ struct smap_psock_map_entry {
struct smap_psock {
struct rcu_head rcu;
- /* refcnt is used inside sk_callback_lock */
- u32 refcnt;
+ refcount_t refcnt;
/* datapath variables */
struct sk_buff_head rxqueue;
@@ -346,14 +345,12 @@ static void smap_destroy_psock(struct rcu_head *rcu)
static void smap_release_sock(struct smap_psock *psock, struct sock *sock)
{
- psock->refcnt--;
- if (psock->refcnt)
- return;
-
- smap_stop_sock(psock, sock);
- clear_bit(SMAP_TX_RUNNING, &psock->state);
- rcu_assign_sk_user_data(sock, NULL);
- call_rcu_sched(&psock->rcu, smap_destroy_psock);
+ if (refcount_dec_and_test(&psock->refcnt)) {
+ smap_stop_sock(psock, sock);
+ clear_bit(SMAP_TX_RUNNING, &psock->state);
+ rcu_assign_sk_user_data(sock, NULL);
+ call_rcu_sched(&psock->rcu, smap_destroy_psock);
+ }
}
static int smap_parse_func_strparser(struct strparser *strp,
@@ -485,7 +482,7 @@ static struct smap_psock *smap_init_psock(struct sock *sock,
INIT_WORK(&psock->tx_work, smap_tx_work);
INIT_WORK(&psock->gc_work, smap_gc_work);
INIT_LIST_HEAD(&psock->maps);
- psock->refcnt = 1;
+ refcount_set(&psock->refcnt, 1);
rcu_assign_sk_user_data(sock, psock);
sock_hold(sock);
@@ -745,7 +742,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
err = -EBUSY;
goto out_progs;
}
- psock->refcnt++;
+ refcount_inc(&psock->refcnt);
} else {
psock = smap_init_psock(sock, stab);
if (IS_ERR(psock)) {
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [bpf-next PATCH 4/7] net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG
2018-01-12 18:09 [bpf-next PATCH 0/7] Add BPF_PROG_TYPE_SK_MSG and attach pt John Fastabend
` (2 preceding siblings ...)
2018-01-12 18:10 ` [bpf-next PATCH 3/7] sockmap: convert refcnt to an atomic refcnt John Fastabend
@ 2018-01-12 18:10 ` John Fastabend
2018-01-12 20:10 ` Eric Dumazet
2018-01-12 18:11 ` [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data John Fastabend
` (2 subsequent siblings)
6 siblings, 1 reply; 18+ messages in thread
From: John Fastabend @ 2018-01-12 18:10 UTC (permalink / raw)
To: borkmann, john.fastabend, ast; +Cc: netdev, kafai
When calling do_tcp_sendpages() from in kernel and we know the data
has no references from user side we can omit SKBTX_SHARED_FRAG flag.
This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used
to omit setting SKBTX_SHARED_FRAG.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
include/linux/socket.h | 1 +
net/ipv4/tcp.c | 4 +++-
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 9286a5a..add9360 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -287,6 +287,7 @@ struct ucred {
#define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
#define MSG_BATCH 0x40000 /* sendmmsg(): more messages coming */
#define MSG_EOF MSG_FIN
+#define MSG_NO_SHARED_FRAGS 0x80000 /* sendpage() internal : page frags are not shared */
#define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */
#define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7ac583a..56c6f49 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -995,7 +995,9 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
get_page(page);
skb_fill_page_desc(skb, i, page, offset, copy);
}
- skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
+
+ if (!(flags & MSG_NO_SHARED_FRAGS))
+ skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
skb->len += copy;
skb->data_len += copy;
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
2018-01-12 18:09 [bpf-next PATCH 0/7] Add BPF_PROG_TYPE_SK_MSG and attach pt John Fastabend
` (3 preceding siblings ...)
2018-01-12 18:10 ` [bpf-next PATCH 4/7] net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG John Fastabend
@ 2018-01-12 18:11 ` John Fastabend
2018-01-17 2:25 ` Alexei Starovoitov
2018-01-17 22:04 ` Martin KaFai Lau
2018-01-12 18:11 ` [bpf-next PATCH 6/7] bpf: add map tests for BPF_PROG_TYPE_SK_MSG John Fastabend
2018-01-12 18:11 ` [bpf-next PATCH 7/7] bpf: add verifier " John Fastabend
6 siblings, 2 replies; 18+ messages in thread
From: John Fastabend @ 2018-01-12 18:11 UTC (permalink / raw)
To: borkmann, john.fastabend, ast; +Cc: netdev, kafai
This implements a BPF ULP layer to allow policy enforcement and
monitoring at the socket layer. In order to support this a new
program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
the sendmsg/sendpage hook. To attach the policy to sockets a
sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
Similar to previous sockmap usages when a sock is added to a
sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
program type attached then the BPF ULP layer is created on the
socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
every msg in sendmsg case and page/offset in sendpage case.
BPF_PROG_TYPE_SK_MSG Semantics/API:
BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
case and in the sendpage case leaves the data untouched. Both cases
return -EACESS to the user. Returning SK_PASS will allow the msg to
be sent.
In the sendmsg case data is copied into kernel space buffers before
running the BPF program. In the sendpage case data is never copied.
The implication being users may change data after BPF programs run in
the sendpage case. (A flag will be added to always copy shortly
if the copy must always be performed).
The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
in the sendmsg() case and the entire page/offset in the sendpage case.
This avoid ambiguity on how to handle mixed return codes in the
sendmsg case. The readable/writeable data provided to the program
in the sendmsg case may not be the entire message, in fact for
large sends this is likely the case. The data range that can be
read is part of the sk_msg_md structure. This is because similar
to the tc bpf_cls case the data is stored in a scatter gather list.
Future work will address this short-coming to allow users to pull
in more data if needed (similar to TC BPF).
The helper msg_redirect_map() can be used to select the socket to
send the data on. This is used similar to existing redirect use
cases. This allows policy to redirect msgs.
Pseudo code simple example:
The basic logic to attach a program to a socket is as follows,
// load the programs
bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
&obj, &msg_prog);
// lookup the sockmap
bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");
// get fd for sockmap
map_fd_msg = bpf_map__fd(bpf_map_msg);
// attach program to sockmap
bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
Adding sockets to the map is done in the normal way,
// Add a socket 'fd' to sockmap at location 'i'
bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);
After the above any socket attached to "my_sock_map", in this case
'fd', will run the BPF msg verdict program (msg_prog) on every
sendmsg and sendpage system call.
For a complete example see BPF selftests bpf/sockmap_tcp_msg_*.c and
test_maps.c
Implementation notes:
It seemed the simplest, to me at least, to use a refcnt to ensure
psock is not lost across the sendmsg copy into the sg, the bpf program
running on the data in sg_data, and the final pass to the TCP stack.
Some performance testing may show a better method to do this and avoid
the refcnt cost, but for now use the simpler method.
Another item that will come after basic support is in place is
supporting MSG_MORE flag. At the moment we call sendpages even if
the MSG_MORE flag is set. An enhancement would be to collect the
pages into a larger scatterlist and pass down the stack. Notice that
bpf_tcp_sendmsg() could support this with some additional state saved
across sendmsg calls. I built the code to support this without having
to do refactoring work. Other flags TBD include ZEROCOPY flag.
Yet another detail that needs some thought is the size of scatterlist.
Currently, we use MAX_SKB_FRAGS simply because this was being used
already in the TLS case. Future work to improve the kernel sk APIs to
tune this depending on workload may be useful. This is a trade-off
between memory usage and B/s performance.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
include/linux/bpf.h | 1
include/linux/bpf_types.h | 1
include/linux/filter.h | 10 +
include/net/tcp.h | 2
include/uapi/linux/bpf.h | 28 +++
kernel/bpf/sockmap.c | 485 ++++++++++++++++++++++++++++++++++++++++++++-
kernel/bpf/syscall.c | 14 +
kernel/bpf/verifier.c | 5
net/core/filter.c | 106 ++++++++++
9 files changed, 638 insertions(+), 14 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9e03046..14cdb4d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -21,6 +21,7 @@
struct perf_event;
struct bpf_prog;
struct bpf_map;
+struct sock;
/* map is generic key/value storage optionally accesible by eBPF programs */
struct bpf_map_ops {
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 19b8349..5e2e8a4 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -13,6 +13,7 @@
BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
#endif
#ifdef CONFIG_BPF_EVENTS
BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 425056c..f1e9833 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -507,6 +507,15 @@ struct xdp_buff {
struct xdp_rxq_info *rxq;
};
+struct sk_msg_buff {
+ void *data;
+ void *data_end;
+ struct scatterlist sg_data[MAX_SKB_FRAGS];
+ __u32 key;
+ __u32 flags;
+ struct bpf_map *map;
+};
+
/* Compute the linear packet data range [data, data_end) which
* will be accessed by various program types (cls_bpf, act_bpf,
* lwt, ...). Subsystems allowing direct data access must (!)
@@ -769,6 +778,7 @@ int xdp_do_redirect(struct net_device *dev,
void bpf_warn_invalid_xdp_action(u32 act);
struct sock *do_sk_redirect_map(struct sk_buff *skb);
+struct sock *do_msg_redirect_map(struct sk_msg_buff *md);
#ifdef CONFIG_BPF_JIT
extern int bpf_jit_enable;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a99ceb8..7f56c3c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1984,6 +1984,7 @@ static inline void tcp_listendrop(const struct sock *sk)
enum {
TCP_ULP_TLS,
+ TCP_ULP_BPF,
};
struct tcp_ulp_ops {
@@ -2001,6 +2002,7 @@ struct tcp_ulp_ops {
int tcp_register_ulp(struct tcp_ulp_ops *type);
void tcp_unregister_ulp(struct tcp_ulp_ops *type);
int tcp_set_ulp(struct sock *sk, const char *name);
+int tcp_set_ulp_id(struct sock *sk, const int ulp);
void tcp_get_available_ulp(char *buf, size_t len);
void tcp_cleanup_ulp(struct sock *sk);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 405317f..bf649ae 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -133,6 +133,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SOCK_OPS,
BPF_PROG_TYPE_SK_SKB,
BPF_PROG_TYPE_CGROUP_DEVICE,
+ BPF_PROG_TYPE_SK_MSG,
};
enum bpf_attach_type {
@@ -143,6 +144,7 @@ enum bpf_attach_type {
BPF_SK_SKB_STREAM_PARSER,
BPF_SK_SKB_STREAM_VERDICT,
BPF_CGROUP_DEVICE,
+ BPF_SK_MSG_VERDICT,
__MAX_BPF_ATTACH_TYPE
};
@@ -687,6 +689,15 @@ enum bpf_attach_type {
* int bpf_override_return(pt_regs, rc)
* @pt_regs: pointer to struct pt_regs
* @rc: the return value to set
+ *
+ * int bpf_msg_redirect_map(map, key, flags)
+ * Redirect msg to a sock in map using key as a lookup key for the
+ * sock in map.
+ * @map: pointer to sockmap
+ * @key: key to lookup sock in map
+ * @flags: reserved for future use
+ * Return: SK_PASS
+ *
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -747,7 +758,8 @@ enum bpf_attach_type {
FN(perf_event_read_value), \
FN(perf_prog_read_value), \
FN(getsockopt), \
- FN(override_return),
+ FN(override_return), \
+ FN(msg_redirect_map),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@@ -909,6 +921,20 @@ enum sk_action {
SK_PASS,
};
+/* User return codes for SK_MSG prog type. */
+enum sk_msg_action {
+ SK_MSG_DROP = 0,
+ SK_MSG_PASS,
+};
+
+/* user accessible metadata for SK_MSG packet hook, new fields must
+ * be added to the end of this structure
+ */
+struct sk_msg_md {
+ __u32 data;
+ __u32 data_end;
+};
+
#define BPF_TAG_SIZE 8
struct bpf_prog_info {
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 972608f..5793f3a 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -38,6 +38,7 @@
#include <linux/skbuff.h>
#include <linux/workqueue.h>
#include <linux/list.h>
+#include <linux/mm.h>
#include <net/strparser.h>
#include <net/tcp.h>
@@ -47,6 +48,7 @@
struct bpf_stab {
struct bpf_map map;
struct sock **sock_map;
+ struct bpf_prog *bpf_tx_msg;
struct bpf_prog *bpf_parse;
struct bpf_prog *bpf_verdict;
};
@@ -74,6 +76,7 @@ struct smap_psock {
struct sk_buff *save_skb;
struct strparser strp;
+ struct bpf_prog *bpf_tx_msg;
struct bpf_prog *bpf_parse;
struct bpf_prog *bpf_verdict;
struct list_head maps;
@@ -90,6 +93,8 @@ struct smap_psock {
void (*save_state_change)(struct sock *sk);
};
+static void smap_release_sock(struct smap_psock *psock, struct sock *sock);
+
static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
{
return rcu_dereference_sk_user_data(sk);
@@ -99,8 +104,439 @@ enum __sk_action {
__SK_DROP = 0,
__SK_PASS,
__SK_REDIRECT,
+ __SK_NONE,
};
+static int memcopy_from_iter(struct sock *sk, struct scatterlist *sg,
+ int sg_num, struct iov_iter *from, int bytes)
+{
+ int i, rc = 0;
+
+ for (i = 0; i < sg_num; ++i) {
+ int copy = sg[i].length;
+ char *to = sg_virt(&sg[i]);
+
+ if (sk->sk_route_caps & NETIF_F_NOCACHE_COPY)
+ rc = copy_from_iter_nocache(to, copy, from);
+ else
+ rc = copy_from_iter(to, copy, from);
+
+ if (rc != copy) {
+ rc = -EFAULT;
+ goto out;
+ }
+
+ bytes -= copy;
+ if (!bytes)
+ break;
+ }
+out:
+ return rc;
+}
+
+static int bpf_tcp_push(struct sock *sk, struct scatterlist *sg,
+ int *sg_end, int flags, bool charge)
+{
+ int sendpage_flags = flags | MSG_SENDPAGE_NOTLAST;
+ int offset, ret = 0;
+ struct page *p;
+ size_t size;
+
+ size = sg->length;
+ offset = sg->offset;
+
+ while (1) {
+ if (sg_is_last(sg))
+ sendpage_flags = flags;
+
+ tcp_rate_check_app_limited(sk);
+ p = sg_page(sg);
+retry:
+ ret = do_tcp_sendpages(sk, p, offset, size, sendpage_flags);
+ if (ret != size) {
+ if (ret > 0) {
+ offset += ret;
+ size -= ret;
+ goto retry;
+ }
+
+ if (charge)
+ sk_mem_uncharge(sk,
+ sg->length - size - sg->offset);
+
+ sg->offset = offset;
+ sg->length = size;
+ return ret;
+ }
+
+ put_page(p);
+ if (charge)
+ sk_mem_uncharge(sk, sg->length);
+ *sg_end += 1;
+ sg = sg_next(sg);
+ if (!sg)
+ break;
+
+ offset = sg->offset;
+ size = sg->length;
+ }
+
+ return 0;
+}
+
+static inline void bpf_compute_data_pointers_sg(struct sk_msg_buff *md)
+{
+ md->data = sg_virt(md->sg_data);
+ md->data_end = md->data + md->sg_data->length;
+}
+
+static void return_mem_sg(struct sock *sk, struct scatterlist *sg, int end)
+{
+ int i;
+
+ for (i = 0; i < end; ++i)
+ sk_mem_uncharge(sk, sg[i].length);
+}
+
+static int free_sg(struct sock *sk, struct scatterlist *sg, int start, int len)
+{
+ int i, free = 0;
+
+ for (i = start; i < len; ++i) {
+ free += sg[i].length;
+ sk_mem_uncharge(sk, sg[i].length);
+ put_page(sg_page(&sg[i]));
+ }
+
+ return free;
+}
+
+static unsigned int smap_do_tx_msg(struct sock *sk,
+ struct smap_psock *psock,
+ struct sk_msg_buff *md)
+{
+ struct bpf_prog *prog;
+ unsigned int rc, _rc;
+
+ preempt_disable();
+ rcu_read_lock();
+
+ /* If the policy was removed mid-send then default to 'accept' */
+ prog = READ_ONCE(psock->bpf_tx_msg);
+ if (unlikely(!prog)) {
+ _rc = SK_PASS;
+ goto verdict;
+ }
+
+ bpf_compute_data_pointers_sg(md);
+ _rc = (*prog->bpf_func)(md, prog->insnsi);
+
+verdict:
+ rcu_read_unlock();
+ preempt_enable();
+
+ /* Moving return codes from UAPI namespace into internal namespace */
+ rc = ((_rc == SK_PASS) ?
+ (md->map ? __SK_REDIRECT : __SK_PASS) :
+ __SK_DROP);
+
+ return rc;
+}
+
+static int bpf_tcp_sendmsg_do_redirect(struct scatterlist *sg, int sg_num,
+ struct sk_msg_buff *md, int flags)
+{
+ int i, sg_curr = 0, err, free;
+ struct smap_psock *psock;
+ struct sock *sk;
+
+ rcu_read_lock();
+ sk = do_msg_redirect_map(md);
+ if (unlikely(!sk))
+ goto out_rcu;
+
+ psock = smap_psock_sk(sk);
+ if (unlikely(!psock))
+ goto out_rcu;
+
+ if (!refcount_inc_not_zero(&psock->refcnt))
+ goto out_rcu;
+
+ rcu_read_unlock();
+ lock_sock(sk);
+ err = bpf_tcp_push(sk, sg, &sg_curr, flags, false);
+ if (unlikely(err))
+ goto out;
+ release_sock(sk);
+ smap_release_sock(psock, sk);
+ return 0;
+out_rcu:
+ rcu_read_unlock();
+out:
+ for (i = sg_curr; i < sg_num; ++i) {
+ free += sg[i].length;
+ put_page(sg_page(&sg[i]));
+ }
+ return free;
+}
+
+static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
+{
+ int err = 0, eval = __SK_NONE, sg_size = 0, sg_num = 0;
+ int flags = msg->msg_flags | MSG_NO_SHARED_FRAGS;
+ struct sk_msg_buff md = {0};
+ struct smap_psock *psock;
+ size_t copy, copied = 0;
+ struct scatterlist *sg;
+ long timeo;
+
+ sg = md.sg_data;
+ sg_init_table(sg, MAX_SKB_FRAGS);
+
+ /* Its possible a sock event or user removed the psock _but_ the ops
+ * have not been reprogrammed yet so we get here. In this case fallback
+ * to tcp_sendmsg. Note this only works because we _only_ ever allow
+ * a single ULP there is no hierarchy here.
+ */
+ rcu_read_lock();
+ psock = smap_psock_sk(sk);
+ if (unlikely(!psock)) {
+ rcu_read_unlock();
+ return tcp_sendmsg(sk, msg, size);
+ }
+
+ /* Increment the psock refcnt to ensure its not released while sending a
+ * message. Required because sk lookup and bpf programs are used in
+ * separate rcu critical sections. Its OK if we lose the map entry
+ * but we can't lose the sock reference, possible when the refcnt hits
+ * zero and garbage collection calls sock_put().
+ */
+ if (!refcount_inc_not_zero(&psock->refcnt)) {
+ rcu_read_unlock();
+ return tcp_sendmsg(sk, msg, size);
+ }
+
+ rcu_read_unlock();
+
+ lock_sock(sk);
+ timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
+
+ while (msg_data_left(msg)) {
+ int sg_curr;
+
+ if (sk->sk_err) {
+ err = sk->sk_err;
+ goto out_err;
+ }
+
+ copy = msg_data_left(msg);
+ if (!sk_stream_memory_free(sk))
+ goto wait_for_sndbuf;
+
+ /* sg_size indicates bytes already allocated and sg_num
+ * is last sg element used. This is used when alloc_sg
+ * partially allocates a scatterlist and then is sent
+ * to wait for memory. In normal case (no memory pressure)
+ * both sg_nun and sg_size are zero.
+ */
+ copy = copy - sg_size;
+ err = sk_alloc_sg(sk, copy, sg, &sg_num, &sg_size, 0);
+ if (err) {
+ if (err != -ENOSPC)
+ goto wait_for_memory;
+ copy = sg_size;
+ }
+
+ err = memcopy_from_iter(sk, sg, sg_num, &msg->msg_iter, copy);
+ if (err < 0) {
+ free_sg(sk, sg, 0, sg_num);
+ goto out_err;
+ }
+
+ copied += copy;
+
+ /* If msg is larger than MAX_SKB_FRAGS we can send multiple
+ * scatterlists per msg. However BPF decisions apply to the
+ * entire msg.
+ */
+ if (eval == __SK_NONE)
+ eval = smap_do_tx_msg(sk, psock, &md);
+
+ switch (eval) {
+ case __SK_PASS:
+ sg_mark_end(sg + sg_num - 1);
+ err = bpf_tcp_push(sk, sg, &sg_curr, flags, true);
+ if (unlikely(err)) {
+ copied -= free_sg(sk, sg, sg_curr, sg_num);
+ goto out_err;
+ }
+ break;
+ case __SK_REDIRECT:
+ sg_mark_end(sg + sg_num - 1);
+ goto do_redir;
+ case __SK_DROP:
+ default:
+ copied -= free_sg(sk, sg, 0, sg_num);
+ goto out_err;
+ }
+
+ sg_num = 0;
+ sg_size = 0;
+ continue;
+wait_for_sndbuf:
+ set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+wait_for_memory:
+ err = sk_stream_wait_memory(sk, &timeo);
+ if (err)
+ goto out_err;
+ }
+out_err:
+ if (err < 0)
+ err = sk_stream_error(sk, msg->msg_flags, err);
+ release_sock(sk);
+ smap_release_sock(psock, sk);
+ return copied ? copied : err;
+
+do_redir:
+ /* To avoid deadlock with multiple socks all doing redirects to
+ * each other we must first drop the current sock lock and release
+ * the psock. Then get the redirect socket (assuming it still
+ * exists), take it's lock, and finally do the send here. If the
+ * redirect fails there is nothing to do, we don't want to blame
+ * the sender for remote socket failures. Instead we simply
+ * continue making forward progress.
+ */
+ return_mem_sg(sk, sg, sg_num);
+ release_sock(sk);
+ smap_release_sock(psock, sk);
+ copied -= bpf_tcp_sendmsg_do_redirect(sg, sg_num, &md, flags);
+ return copied;
+}
+
+static int bpf_tcp_sendpage_do_redirect(struct page *page, int offset,
+ size_t size, int flags,
+ struct sk_msg_buff *md)
+{
+ struct smap_psock *psock;
+ struct sock *sk;
+ int rc;
+
+ rcu_read_lock();
+ sk = do_msg_redirect_map(md);
+ if (unlikely(!sk))
+ goto out_rcu;
+
+ psock = smap_psock_sk(sk);
+ if (unlikely(!psock))
+ goto out_rcu;
+
+ if (!refcount_inc_not_zero(&psock->refcnt))
+ goto out_rcu;
+
+ rcu_read_unlock();
+
+ lock_sock(sk);
+ rc = tcp_sendpage_locked(sk, page, offset, size, flags);
+ release_sock(sk);
+
+ smap_release_sock(psock, sk);
+ return rc;
+out_rcu:
+ rcu_read_unlock();
+ return -EINVAL;
+}
+
+static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
+ int offset, size_t size, int flags)
+{
+ struct smap_psock *psock;
+ int rc, _rc = __SK_PASS;
+ struct bpf_prog *prog;
+ struct sk_msg_buff md;
+
+ preempt_disable();
+ rcu_read_lock();
+ psock = smap_psock_sk(sk);
+ if (unlikely(!psock))
+ goto verdict;
+
+ /* If the policy was removed mid-send then default to 'accept' */
+ prog = READ_ONCE(psock->bpf_tx_msg);
+ if (unlikely(!prog))
+ goto verdict;
+
+ /* Calculate pkt data pointers and run BPF program */
+ md.data = page_address(page) + offset;
+ md.data_end = md.data + size;
+ _rc = (*prog->bpf_func)(&md, prog->insnsi);
+
+verdict:
+ rcu_read_unlock();
+ preempt_enable();
+
+ /* Moving return codes from UAPI namespace into internal namespace */
+ rc = ((_rc == SK_PASS) ? __SK_PASS : __SK_DROP);
+
+ switch (rc) {
+ case __SK_PASS:
+ lock_sock(sk);
+ rc = tcp_sendpage_locked(sk, page, offset, size, flags);
+ release_sock(sk);
+ break;
+ case __SK_REDIRECT:
+ smap_release_sock(psock, sk);
+ rc = bpf_tcp_sendpage_do_redirect(page, offset, size, flags,
+ &md);
+ break;
+ case __SK_DROP:
+ default:
+ rc = -EACCES;
+ }
+
+ return rc;
+}
+
+static int bpf_tcp_msg_add(struct smap_psock *psock,
+ struct sock *sk,
+ struct bpf_prog *tx_msg)
+{
+ struct bpf_prog *orig_tx_msg;
+
+ orig_tx_msg = xchg(&psock->bpf_tx_msg, tx_msg);
+ if (orig_tx_msg)
+ bpf_prog_put(orig_tx_msg);
+
+ return tcp_set_ulp_id(sk, TCP_ULP_BPF);
+}
+
+struct proto tcp_bpf_proto;
+static int bpf_tcp_init(struct sock *sk)
+{
+ sk->sk_prot = &tcp_bpf_proto;
+ return 0;
+}
+
+static void bpf_tcp_release(struct sock *sk)
+{
+ sk->sk_prot = &tcp_prot;
+}
+
+static struct tcp_ulp_ops bpf_tcp_ulp_ops __read_mostly = {
+ .name = "bpf_tcp",
+ .uid = TCP_ULP_BPF,
+ .owner = NULL,
+ .init = bpf_tcp_init,
+ .release = bpf_tcp_release,
+};
+
+static int bpf_tcp_ulp_register(void)
+{
+ tcp_bpf_proto = tcp_prot;
+ tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
+ tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
+ return tcp_register_ulp(&bpf_tcp_ulp_ops);
+}
+
static int smap_verdict_func(struct smap_psock *psock, struct sk_buff *skb)
{
struct bpf_prog *prog = READ_ONCE(psock->bpf_verdict);
@@ -165,8 +601,6 @@ static void smap_report_sk_error(struct smap_psock *psock, int err)
sk->sk_error_report(sk);
}
-static void smap_release_sock(struct smap_psock *psock, struct sock *sock);
-
/* Called with lock_sock(sk) held */
static void smap_state_change(struct sock *sk)
{
@@ -317,6 +751,7 @@ static void smap_write_space(struct sock *sk)
static void smap_stop_sock(struct smap_psock *psock, struct sock *sk)
{
+ tcp_cleanup_ulp(sk);
if (!psock->strp_enabled)
return;
sk->sk_data_ready = psock->save_data_ready;
@@ -384,7 +819,6 @@ static int smap_parse_func_strparser(struct strparser *strp,
return rc;
}
-
static int smap_read_sock_done(struct strparser *strp, int err)
{
return err;
@@ -456,6 +890,8 @@ static void smap_gc_work(struct work_struct *w)
bpf_prog_put(psock->bpf_parse);
if (psock->bpf_verdict)
bpf_prog_put(psock->bpf_verdict);
+ if (psock->bpf_tx_msg)
+ bpf_prog_put(psock->bpf_tx_msg);
list_for_each_entry_safe(e, tmp, &psock->maps, list) {
list_del(&e->list);
@@ -491,8 +927,7 @@ static struct smap_psock *smap_init_psock(struct sock *sock,
static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
{
- struct bpf_stab *stab;
- int err = -EINVAL;
+ struct bpf_stab *stab; int err = -EINVAL;
u64 cost;
if (!capable(CAP_NET_ADMIN))
@@ -506,6 +941,10 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
if (attr->value_size > KMALLOC_MAX_SIZE)
return ERR_PTR(-E2BIG);
+ err = bpf_tcp_ulp_register();
+ if (err && err != -EEXIST)
+ return ERR_PTR(err);
+
stab = kzalloc(sizeof(*stab), GFP_USER);
if (!stab)
return ERR_PTR(-ENOMEM);
@@ -590,6 +1029,8 @@ static void sock_map_free(struct bpf_map *map)
bpf_prog_put(stab->bpf_verdict);
if (stab->bpf_parse)
bpf_prog_put(stab->bpf_parse);
+ if (stab->bpf_tx_msg)
+ bpf_prog_put(stab->bpf_tx_msg);
sock_map_remove_complete(stab);
}
@@ -684,7 +1125,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
{
struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
struct smap_psock_map_entry *e = NULL;
- struct bpf_prog *verdict, *parse;
+ struct bpf_prog *verdict, *parse, *tx_msg;
struct sock *osock, *sock;
struct smap_psock *psock;
u32 i = *(u32 *)key;
@@ -710,6 +1151,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
*/
verdict = READ_ONCE(stab->bpf_verdict);
parse = READ_ONCE(stab->bpf_parse);
+ tx_msg = READ_ONCE(stab->bpf_tx_msg);
if (parse && verdict) {
/* bpf prog refcnt may be zero if a concurrent attach operation
@@ -728,6 +1170,17 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
}
}
+ if (tx_msg) {
+ tx_msg = bpf_prog_inc_not_zero(stab->bpf_tx_msg);
+ if (IS_ERR(tx_msg)) {
+ if (verdict)
+ bpf_prog_put(verdict);
+ if (parse)
+ bpf_prog_put(parse);
+ return PTR_ERR(tx_msg);
+ }
+ }
+
write_lock_bh(&sock->sk_callback_lock);
psock = smap_psock_sk(sock);
@@ -742,7 +1195,14 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
err = -EBUSY;
goto out_progs;
}
- refcount_inc(&psock->refcnt);
+ if (READ_ONCE(psock->bpf_tx_msg) && tx_msg) {
+ err = -EBUSY;
+ goto out_progs;
+ }
+ if (!refcount_inc_not_zero(&psock->refcnt)) {
+ err = -EAGAIN;
+ goto out_progs;
+ }
} else {
psock = smap_init_psock(sock, stab);
if (IS_ERR(psock)) {
@@ -763,6 +1223,12 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
/* 3. At this point we have a reference to a valid psock that is
* running. Attach any BPF programs needed.
*/
+ if (tx_msg) {
+ err = bpf_tcp_msg_add(psock, sock, tx_msg);
+ if (err)
+ goto out_free;
+ }
+
if (parse && verdict && !psock->strp_enabled) {
err = smap_init_sock(psock, sock);
if (err)
@@ -798,6 +1264,8 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
bpf_prog_put(verdict);
if (parse)
bpf_prog_put(parse);
+ if (tx_msg)
+ bpf_prog_put(tx_msg);
write_unlock_bh(&sock->sk_callback_lock);
kfree(e);
return err;
@@ -812,6 +1280,9 @@ int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type)
return -EINVAL;
switch (type) {
+ case BPF_SK_MSG_VERDICT:
+ orig = xchg(&stab->bpf_tx_msg, prog);
+ break;
case BPF_SK_SKB_STREAM_PARSER:
orig = xchg(&stab->bpf_parse, prog);
break;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ebf0fb2..d32f093 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1267,7 +1267,8 @@ static int bpf_obj_get(const union bpf_attr *attr)
#define BPF_PROG_ATTACH_LAST_FIELD attach_flags
-static int sockmap_get_from_fd(const union bpf_attr *attr, bool attach)
+static int sockmap_get_from_fd(const union bpf_attr *attr,
+ int type, bool attach)
{
struct bpf_prog *prog = NULL;
int ufd = attr->target_fd;
@@ -1281,8 +1282,7 @@ static int sockmap_get_from_fd(const union bpf_attr *attr, bool attach)
return PTR_ERR(map);
if (attach) {
- prog = bpf_prog_get_type(attr->attach_bpf_fd,
- BPF_PROG_TYPE_SK_SKB);
+ prog = bpf_prog_get_type(attr->attach_bpf_fd, type);
if (IS_ERR(prog)) {
fdput(f);
return PTR_ERR(prog);
@@ -1334,9 +1334,11 @@ static int bpf_prog_attach(const union bpf_attr *attr)
case BPF_CGROUP_DEVICE:
ptype = BPF_PROG_TYPE_CGROUP_DEVICE;
break;
+ case BPF_SK_MSG_VERDICT:
+ return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_MSG, true);
case BPF_SK_SKB_STREAM_PARSER:
case BPF_SK_SKB_STREAM_VERDICT:
- return sockmap_get_from_fd(attr, true);
+ return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, true);
default:
return -EINVAL;
}
@@ -1389,9 +1391,11 @@ static int bpf_prog_detach(const union bpf_attr *attr)
case BPF_CGROUP_DEVICE:
ptype = BPF_PROG_TYPE_CGROUP_DEVICE;
break;
+ case BPF_SK_MSG_VERDICT:
+ return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_MSG, false);
case BPF_SK_SKB_STREAM_PARSER:
case BPF_SK_SKB_STREAM_VERDICT:
- return sockmap_get_from_fd(attr, false);
+ return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, false);
default:
return -EINVAL;
}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a2b2112..15c5c2a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1240,6 +1240,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
case BPF_PROG_TYPE_XDP:
case BPF_PROG_TYPE_LWT_XMIT:
case BPF_PROG_TYPE_SK_SKB:
+ case BPF_PROG_TYPE_SK_MSG:
if (meta)
return meta->pkt_access;
@@ -2041,7 +2042,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
case BPF_MAP_TYPE_SOCKMAP:
if (func_id != BPF_FUNC_sk_redirect_map &&
func_id != BPF_FUNC_sock_map_update &&
- func_id != BPF_FUNC_map_delete_elem)
+ func_id != BPF_FUNC_map_delete_elem &&
+ func_id != BPF_FUNC_msg_redirect_map)
goto error;
break;
default:
@@ -2079,6 +2081,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
goto error;
break;
case BPF_FUNC_sk_redirect_map:
+ case BPF_FUNC_msg_redirect_map:
if (map->map_type != BPF_MAP_TYPE_SOCKMAP)
goto error;
break;
diff --git a/net/core/filter.c b/net/core/filter.c
index acdb94c..ca87b8d 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1881,6 +1881,44 @@ struct sock *do_sk_redirect_map(struct sk_buff *skb)
.arg4_type = ARG_ANYTHING,
};
+BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg_buff *, msg,
+ struct bpf_map *, map, u32, key, u64, flags)
+{
+ /* If user passes invalid input drop the packet. */
+ if (unlikely(flags))
+ return SK_DROP;
+
+ msg->key = key;
+ msg->flags = flags;
+ msg->map = map;
+
+ return SK_PASS;
+}
+
+struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
+{
+ struct sock *sk = NULL;
+
+ if (msg->map) {
+ sk = __sock_map_lookup_elem(msg->map, msg->key);
+
+ msg->key = 0;
+ msg->map = NULL;
+ }
+
+ return sk;
+}
+
+static const struct bpf_func_proto bpf_msg_redirect_map_proto = {
+ .func = bpf_msg_redirect_map,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_CONST_MAP_PTR,
+ .arg3_type = ARG_ANYTHING,
+ .arg4_type = ARG_ANYTHING,
+};
+
BPF_CALL_1(bpf_get_cgroup_classid, const struct sk_buff *, skb)
{
return task_get_classid(skb);
@@ -3513,6 +3551,16 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
}
}
+static const struct bpf_func_proto *sk_msg_func_proto(enum bpf_func_id func_id)
+{
+ switch (func_id) {
+ case BPF_FUNC_msg_redirect_map:
+ return &bpf_msg_redirect_map_proto;
+ default:
+ return bpf_base_func_proto(func_id);
+ }
+}
+
static const struct bpf_func_proto *sk_skb_func_proto(enum bpf_func_id func_id)
{
switch (func_id) {
@@ -3892,6 +3940,32 @@ static bool sk_skb_is_valid_access(int off, int size,
return bpf_skb_is_valid_access(off, size, type, info);
}
+static bool sk_msg_is_valid_access(int off, int size,
+ enum bpf_access_type type,
+ struct bpf_insn_access_aux *info)
+{
+ if (type == BPF_WRITE)
+ return false;
+
+ switch (off) {
+ case offsetof(struct sk_msg_md, data):
+ info->reg_type = PTR_TO_PACKET;
+ break;
+ case offsetof(struct sk_msg_md, data_end):
+ info->reg_type = PTR_TO_PACKET_END;
+ break;
+ }
+
+ if (off < 0 || off >= sizeof(struct sk_msg_md))
+ return false;
+ if (off % size != 0)
+ return false;
+ if (size != sizeof(__u32))
+ return false;
+
+ return true;
+}
+
static u32 bpf_convert_ctx_access(enum bpf_access_type type,
const struct bpf_insn *si,
struct bpf_insn *insn_buf,
@@ -4522,6 +4596,29 @@ static u32 sk_skb_convert_ctx_access(enum bpf_access_type type,
return insn - insn_buf;
}
+static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
+ const struct bpf_insn *si,
+ struct bpf_insn *insn_buf,
+ struct bpf_prog *prog, u32 *target_size)
+{
+ struct bpf_insn *insn = insn_buf;
+
+ switch (si->off) {
+ case offsetof(struct sk_msg_md, data):
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data),
+ si->dst_reg, si->src_reg,
+ offsetof(struct sk_msg_buff, data));
+ break;
+ case offsetof(struct sk_msg_md, data_end):
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data_end),
+ si->dst_reg, si->src_reg,
+ offsetof(struct sk_msg_buff, data_end));
+ break;
+ }
+
+ return insn - insn_buf;
+}
+
const struct bpf_verifier_ops sk_filter_verifier_ops = {
.get_func_proto = sk_filter_func_proto,
.is_valid_access = sk_filter_is_valid_access,
@@ -4611,6 +4708,15 @@ static u32 sk_skb_convert_ctx_access(enum bpf_access_type type,
const struct bpf_prog_ops sk_skb_prog_ops = {
};
+const struct bpf_verifier_ops sk_msg_verifier_ops = {
+ .get_func_proto = sk_msg_func_proto,
+ .is_valid_access = sk_msg_is_valid_access,
+ .convert_ctx_access = sk_msg_convert_ctx_access,
+};
+
+const struct bpf_prog_ops sk_msg_prog_ops = {
+};
+
int sk_detach_filter(struct sock *sk)
{
int ret = -ENOENT;
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [bpf-next PATCH 6/7] bpf: add map tests for BPF_PROG_TYPE_SK_MSG
2018-01-12 18:09 [bpf-next PATCH 0/7] Add BPF_PROG_TYPE_SK_MSG and attach pt John Fastabend
` (4 preceding siblings ...)
2018-01-12 18:11 ` [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data John Fastabend
@ 2018-01-12 18:11 ` John Fastabend
2018-01-12 18:11 ` [bpf-next PATCH 7/7] bpf: add verifier " John Fastabend
6 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2018-01-12 18:11 UTC (permalink / raw)
To: borkmann, john.fastabend, ast; +Cc: netdev, kafai
Add map tests to attach BPF_PROG_TYPE_SK_MSG types to a sockmap.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
tools/include/uapi/linux/bpf.h | 16 ++++++
tools/testing/selftests/bpf/Makefile | 3 +
tools/testing/selftests/bpf/bpf_helpers.h | 2 +
tools/testing/selftests/bpf/sockmap_parse_prog.c | 15 +++++-
tools/testing/selftests/bpf/sockmap_verdict_prog.c | 7 +++
tools/testing/selftests/bpf/test_maps.c | 54 +++++++++++++++++++-
6 files changed, 90 insertions(+), 7 deletions(-)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4e8c60a..131d541 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -133,6 +133,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SOCK_OPS,
BPF_PROG_TYPE_SK_SKB,
BPF_PROG_TYPE_CGROUP_DEVICE,
+ BPF_PROG_TYPE_SK_MSG,
};
enum bpf_attach_type {
@@ -143,6 +144,7 @@ enum bpf_attach_type {
BPF_SK_SKB_STREAM_PARSER,
BPF_SK_SKB_STREAM_VERDICT,
BPF_CGROUP_DEVICE,
+ BPF_SK_MSG_VERDICT,
__MAX_BPF_ATTACH_TYPE
};
@@ -906,6 +908,20 @@ enum sk_action {
SK_PASS,
};
+/* User return codes for SK_MSG prog type. */
+enum sk_msg_action {
+ SK_MSG_DROP = 0,
+ SK_MSG_PASS,
+};
+
+/* user accessible metadata for SK_MSG packet hook, new fields must
+ * be added to the end of this structure
+ */
+struct sk_msg_md {
+ __u32 data;
+ __u32 data_end;
+};
+
#define BPF_TAG_SIZE 8
struct bpf_prog_info {
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index a8aa7e2..e399ca3 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -19,7 +19,8 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o \
sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
- test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o
+ test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
+ sockmap_tcp_msg_prog.o
TEST_PROGS := test_kmod.sh test_xdp_redirect.sh test_xdp_meta.sh \
test_offload.py
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 33cb00e..997c95e 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -121,6 +121,8 @@ static int (*bpf_skb_under_cgroup)(void *ctx, void *map, int index) =
(void *) BPF_FUNC_skb_under_cgroup;
static int (*bpf_skb_change_head)(void *, int len, int flags) =
(void *) BPF_FUNC_skb_change_head;
+static int (*bpf_skb_pull_data)(void *, int len) =
+ (void *) BPF_FUNC_skb_pull_data;
/* Scan the ARCH passed in from ARCH env variable (see Makefile) */
#if defined(__TARGET_ARCH_x86)
diff --git a/tools/testing/selftests/bpf/sockmap_parse_prog.c b/tools/testing/selftests/bpf/sockmap_parse_prog.c
index a1dec2b..0f92858 100644
--- a/tools/testing/selftests/bpf/sockmap_parse_prog.c
+++ b/tools/testing/selftests/bpf/sockmap_parse_prog.c
@@ -20,14 +20,25 @@ int bpf_prog1(struct __sk_buff *skb)
__u32 lport = skb->local_port;
__u32 rport = skb->remote_port;
__u8 *d = data;
+ __u32 len = (__u32) data_end - (__u32) data;
+ int err;
- if (data + 10 > data_end)
- return skb->len;
+ if (data + 10 > data_end) {
+ err = bpf_skb_pull_data(skb, 10);
+ if (err)
+ return SK_DROP;
+
+ data_end = (void *)(long)skb->data_end;
+ data = (void *)(long)skb->data;
+ if (data + 10 > data_end)
+ return SK_DROP;
+ }
/* This write/read is a bit pointless but tests the verifier and
* strparser handler for read/write pkt data and access into sk
* fields.
*/
+ d = data;
d[7] = 1;
return skb->len;
}
diff --git a/tools/testing/selftests/bpf/sockmap_verdict_prog.c b/tools/testing/selftests/bpf/sockmap_verdict_prog.c
index d7bea97..2ce7634 100644
--- a/tools/testing/selftests/bpf/sockmap_verdict_prog.c
+++ b/tools/testing/selftests/bpf/sockmap_verdict_prog.c
@@ -26,6 +26,13 @@ struct bpf_map_def SEC("maps") sock_map_tx = {
.max_entries = 20,
};
+struct bpf_map_def SEC("maps") sock_map_msg = {
+ .type = BPF_MAP_TYPE_SOCKMAP,
+ .key_size = sizeof(int),
+ .value_size = sizeof(int),
+ .max_entries = 20,
+};
+
struct bpf_map_def SEC("maps") sock_map_break = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(int),
diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index 040356e..312af5f 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -461,14 +461,15 @@ static void test_devmap(int task, void *data)
#include <linux/err.h>
#define SOCKMAP_PARSE_PROG "./sockmap_parse_prog.o"
#define SOCKMAP_VERDICT_PROG "./sockmap_verdict_prog.o"
+#define SOCKMAP_TCP_MSG_PROG "./sockmap_tcp_msg_prog.o"
static void test_sockmap(int tasks, void *data)
{
- int one = 1, map_fd_rx, map_fd_tx, map_fd_break, s, sc, rc;
- struct bpf_map *bpf_map_rx, *bpf_map_tx, *bpf_map_break;
+ int one = 1, map_fd_rx, map_fd_tx, map_fd_msg, map_fd_break, s, sc, rc;
+ struct bpf_map *bpf_map_rx, *bpf_map_tx, *bpf_map_msg, *bpf_map_break;
int ports[] = {50200, 50201, 50202, 50204};
int err, i, fd, udp, sfd[6] = {0xdeadbeef};
u8 buf[20] = {0x0, 0x5, 0x3, 0x2, 0x1, 0x0};
- int parse_prog, verdict_prog;
+ int parse_prog, verdict_prog, msg_prog;
struct sockaddr_in addr;
struct bpf_object *obj;
struct timeval to;
@@ -581,6 +582,12 @@ static void test_sockmap(int tasks, void *data)
goto out_sockmap;
}
+ err = bpf_prog_attach(-1, fd, BPF_SK_MSG_VERDICT, 0);
+ if (!err) {
+ printf("Failed invalid msg verdict prog attach\n");
+ goto out_sockmap;
+ }
+
err = bpf_prog_attach(-1, fd, __MAX_BPF_ATTACH_TYPE, 0);
if (!err) {
printf("Failed unknown prog attach\n");
@@ -599,6 +606,12 @@ static void test_sockmap(int tasks, void *data)
goto out_sockmap;
}
+ err = bpf_prog_detach(fd, BPF_SK_MSG_VERDICT);
+ if (err) {
+ printf("Failed empty msg verdict prog detach\n");
+ goto out_sockmap;
+ }
+
err = bpf_prog_detach(fd, __MAX_BPF_ATTACH_TYPE);
if (!err) {
printf("Detach invalid prog successful\n");
@@ -613,6 +626,13 @@ static void test_sockmap(int tasks, void *data)
goto out_sockmap;
}
+ err = bpf_prog_load(SOCKMAP_TCP_MSG_PROG,
+ BPF_PROG_TYPE_SK_MSG, &obj, &msg_prog);
+ if (err) {
+ printf("Failed to load SK_SKB msg prog\n");
+ goto out_sockmap;
+ }
+
err = bpf_prog_load(SOCKMAP_VERDICT_PROG,
BPF_PROG_TYPE_SK_SKB, &obj, &verdict_prog);
if (err) {
@@ -628,7 +648,7 @@ static void test_sockmap(int tasks, void *data)
map_fd_rx = bpf_map__fd(bpf_map_rx);
if (map_fd_rx < 0) {
- printf("Failed to get map fd\n");
+ printf("Failed to get map rx fd\n");
goto out_sockmap;
}
@@ -644,6 +664,18 @@ static void test_sockmap(int tasks, void *data)
goto out_sockmap;
}
+ bpf_map_msg = bpf_object__find_map_by_name(obj, "sock_map_msg");
+ if (IS_ERR(bpf_map_msg)) {
+ printf("Failed to load map msg from msg_verdict prog\n");
+ goto out_sockmap;
+ }
+
+ map_fd_msg = bpf_map__fd(bpf_map_msg);
+ if (map_fd_msg < 0) {
+ printf("Failed to get map msg fd\n");
+ goto out_sockmap;
+ }
+
bpf_map_break = bpf_object__find_map_by_name(obj, "sock_map_break");
if (IS_ERR(bpf_map_break)) {
printf("Failed to load map tx from verdict prog\n");
@@ -677,6 +709,12 @@ static void test_sockmap(int tasks, void *data)
goto out_sockmap;
}
+ err = bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
+ if (err) {
+ printf("Failed msg verdict bpf prog attach\n");
+ goto out_sockmap;
+ }
+
err = bpf_prog_attach(verdict_prog, map_fd_rx,
__MAX_BPF_ATTACH_TYPE, 0);
if (!err) {
@@ -716,6 +754,14 @@ static void test_sockmap(int tasks, void *data)
}
}
+ /* Put sfd[2] (sending fd below) into msg map to test sendmsg bpf */
+ i = 0;
+ err = bpf_map_update_elem(map_fd_msg, &i, &sfd[2], BPF_ANY);
+ if (err) {
+ printf("Failed map_fd_msg update sockmap %i\n", err);
+ goto out_sockmap;
+ }
+
/* Test map send/recv */
for (i = 0; i < 2; i++) {
buf[0] = i;
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [bpf-next PATCH 7/7] bpf: add verifier tests for BPF_PROG_TYPE_SK_MSG
2018-01-12 18:09 [bpf-next PATCH 0/7] Add BPF_PROG_TYPE_SK_MSG and attach pt John Fastabend
` (5 preceding siblings ...)
2018-01-12 18:11 ` [bpf-next PATCH 6/7] bpf: add map tests for BPF_PROG_TYPE_SK_MSG John Fastabend
@ 2018-01-12 18:11 ` John Fastabend
6 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2018-01-12 18:11 UTC (permalink / raw)
To: borkmann, john.fastabend, ast; +Cc: netdev, kafai
Test read and writes for BPF_PROG_TYPE_SK_MSG.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
tools/testing/selftests/bpf/test_verifier.c | 54 +++++++++++++++++++++++++++
1 file changed, 54 insertions(+)
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 5438479..013791b 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -1227,6 +1227,60 @@ struct test_val {
.prog_type = BPF_PROG_TYPE_SK_SKB,
},
{
+ "direct packet read for SK_MSG",
+ .insns = {
+ BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+ offsetof(struct sk_msg_md, data)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+ offsetof(struct sk_msg_md, data_end)),
+ BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+ BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1),
+ BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_2, 0),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .result = ACCEPT,
+ .prog_type = BPF_PROG_TYPE_SK_MSG,
+ },
+ {
+ "direct packet write for SK_MSG",
+ .insns = {
+ BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+ offsetof(struct sk_msg_md, data)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+ offsetof(struct sk_msg_md, data_end)),
+ BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+ BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1),
+ BPF_STX_MEM(BPF_B, BPF_REG_2, BPF_REG_2, 0),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .result = ACCEPT,
+ .prog_type = BPF_PROG_TYPE_SK_MSG,
+ },
+ {
+ "overlapping checks for direct packet access SK_MSG",
+ .insns = {
+ BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+ offsetof(struct sk_msg_md, data)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+ offsetof(struct sk_msg_md, data_end)),
+ BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+ BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 4),
+ BPF_MOV64_REG(BPF_REG_1, BPF_REG_2),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 6),
+ BPF_JMP_REG(BPF_JGT, BPF_REG_1, BPF_REG_3, 1),
+ BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_2, 6),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .result = ACCEPT,
+ .prog_type = BPF_PROG_TYPE_SK_MSG,
+ },
+ {
"check skb->mark is not writeable by sockets",
.insns = {
BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_1,
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [bpf-next PATCH 4/7] net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG
2018-01-12 18:10 ` [bpf-next PATCH 4/7] net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG John Fastabend
@ 2018-01-12 20:10 ` Eric Dumazet
2018-01-12 20:26 ` John Fastabend
0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2018-01-12 20:10 UTC (permalink / raw)
To: John Fastabend, borkmann, ast; +Cc: netdev, kafai
On Fri, 2018-01-12 at 10:10 -0800, John Fastabend wrote:
> When calling do_tcp_sendpages() from in kernel and we know the data
> has no references from user side we can omit SKBTX_SHARED_FRAG flag.
> This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used
> to omit setting SKBTX_SHARED_FRAG.
>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
> include/linux/socket.h | 1 +
> net/ipv4/tcp.c | 4 +++-
> 2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index 9286a5a..add9360 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -287,6 +287,7 @@ struct ucred {
> #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
> #define MSG_BATCH 0x40000 /* sendmmsg(): more messages coming */
> #define MSG_EOF MSG_FIN
> +#define MSG_NO_SHARED_FRAGS 0x80000 /* sendpage() internal : page frags are not shared */
>
> #define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */
> #define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 7ac583a..56c6f49 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -995,7 +995,9 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
> get_page(page);
> skb_fill_page_desc(skb, i, page, offset, copy);
> }
> - skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
> +
> + if (!(flags & MSG_NO_SHARED_FRAGS))
> + skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
>
> skb->len += copy;
> skb->data_len += copy;
What would prevent user space from using this flag ?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [bpf-next PATCH 4/7] net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG
2018-01-12 20:10 ` Eric Dumazet
@ 2018-01-12 20:26 ` John Fastabend
2018-01-12 20:46 ` Eric Dumazet
0 siblings, 1 reply; 18+ messages in thread
From: John Fastabend @ 2018-01-12 20:26 UTC (permalink / raw)
To: Eric Dumazet, borkmann, ast; +Cc: netdev, kafai
On 01/12/2018 12:10 PM, Eric Dumazet wrote:
> On Fri, 2018-01-12 at 10:10 -0800, John Fastabend wrote:
>> When calling do_tcp_sendpages() from in kernel and we know the data
>> has no references from user side we can omit SKBTX_SHARED_FRAG flag.
>> This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used
>> to omit setting SKBTX_SHARED_FRAG.
>>
>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> ---
>> include/linux/socket.h | 1 +
>> net/ipv4/tcp.c | 4 +++-
>> 2 files changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/socket.h b/include/linux/socket.h
>> index 9286a5a..add9360 100644
>> --- a/include/linux/socket.h
>> +++ b/include/linux/socket.h
>> @@ -287,6 +287,7 @@ struct ucred {
>> #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
>> #define MSG_BATCH 0x40000 /* sendmmsg(): more messages coming */
>> #define MSG_EOF MSG_FIN
>> +#define MSG_NO_SHARED_FRAGS 0x80000 /* sendpage() internal : page frags are not shared */
>>
>> #define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */
>> #define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 7ac583a..56c6f49 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -995,7 +995,9 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
>> get_page(page);
>> skb_fill_page_desc(skb, i, page, offset, copy);
>> }
>> - skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
>> +
>> + if (!(flags & MSG_NO_SHARED_FRAGS))
>> + skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
>>
>> skb->len += copy;
>> skb->data_len += copy;
>
> What would prevent user space from using this flag ?
>
Nothing in the current patches. So user could set this, change the data,
and then presumably get incorrect checksums with bad timing. Seems like
this should be blocked so we don't allow users to try and send bad csums.
How about masking the flags coming from userland? Alternatively could add
a bool to do_tcp_sendpages().
.John
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [bpf-next PATCH 4/7] net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG
2018-01-12 20:26 ` John Fastabend
@ 2018-01-12 20:46 ` Eric Dumazet
2018-01-12 21:11 ` John Fastabend
0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2018-01-12 20:46 UTC (permalink / raw)
To: John Fastabend, borkmann, ast; +Cc: netdev, kafai
On Fri, 2018-01-12 at 12:26 -0800, John Fastabend wrote:
> On 01/12/2018 12:10 PM, Eric Dumazet wrote:
> > On Fri, 2018-01-12 at 10:10 -0800, John Fastabend wrote:
> > > When calling do_tcp_sendpages() from in kernel and we know the data
> > > has no references from user side we can omit SKBTX_SHARED_FRAG flag.
> > > This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used
> > > to omit setting SKBTX_SHARED_FRAG.
> > >
> > > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > > ---
> > > include/linux/socket.h | 1 +
> > > net/ipv4/tcp.c | 4 +++-
> > > 2 files changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > > index 9286a5a..add9360 100644
> > > --- a/include/linux/socket.h
> > > +++ b/include/linux/socket.h
> > > @@ -287,6 +287,7 @@ struct ucred {
> > > #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
> > > #define MSG_BATCH 0x40000 /* sendmmsg(): more messages coming */
> > > #define MSG_EOF MSG_FIN
> > > +#define MSG_NO_SHARED_FRAGS 0x80000 /* sendpage() internal : page frags are not shared */
> > >
> > > #define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */
> > > #define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > index 7ac583a..56c6f49 100644
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -995,7 +995,9 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
> > > get_page(page);
> > > skb_fill_page_desc(skb, i, page, offset, copy);
> > > }
> > > - skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
> > > +
> > > + if (!(flags & MSG_NO_SHARED_FRAGS))
> > > + skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
> > >
> > > skb->len += copy;
> > > skb->data_len += copy;
> >
> > What would prevent user space from using this flag ?
> >
>
> Nothing in the current patches. So user could set this, change the data,
> and then presumably get incorrect checksums with bad timing. Seems like
> this should be blocked so we don't allow users to try and send bad csums.
Are you sure user can set it ? How would this happen ?
It would be nice to check (sorry I was lazy/busy and did not check
before asking)
> How about masking the flags coming from userland? Alternatively could add
> a bool to do_tcp_sendpages().
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [bpf-next PATCH 4/7] net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG
2018-01-12 20:46 ` Eric Dumazet
@ 2018-01-12 21:11 ` John Fastabend
0 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2018-01-12 21:11 UTC (permalink / raw)
To: Eric Dumazet, borkmann, ast; +Cc: netdev, kafai
On 01/12/2018 12:46 PM, Eric Dumazet wrote:
> On Fri, 2018-01-12 at 12:26 -0800, John Fastabend wrote:
>> On 01/12/2018 12:10 PM, Eric Dumazet wrote:
>>> On Fri, 2018-01-12 at 10:10 -0800, John Fastabend wrote:
>>>> When calling do_tcp_sendpages() from in kernel and we know the data
>>>> has no references from user side we can omit SKBTX_SHARED_FRAG flag.
>>>> This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used
>>>> to omit setting SKBTX_SHARED_FRAG.
>>>>
>>>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>>>> ---
>>>> include/linux/socket.h | 1 +
>>>> net/ipv4/tcp.c | 4 +++-
>>>> 2 files changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/socket.h b/include/linux/socket.h
>>>> index 9286a5a..add9360 100644
>>>> --- a/include/linux/socket.h
>>>> +++ b/include/linux/socket.h
>>>> @@ -287,6 +287,7 @@ struct ucred {
>>>> #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
>>>> #define MSG_BATCH 0x40000 /* sendmmsg(): more messages coming */
>>>> #define MSG_EOF MSG_FIN
>>>> +#define MSG_NO_SHARED_FRAGS 0x80000 /* sendpage() internal : page frags are not shared */
>>>>
>>>> #define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */
>>>> #define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */
>>>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>>>> index 7ac583a..56c6f49 100644
>>>> --- a/net/ipv4/tcp.c
>>>> +++ b/net/ipv4/tcp.c
>>>> @@ -995,7 +995,9 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
>>>> get_page(page);
>>>> skb_fill_page_desc(skb, i, page, offset, copy);
>>>> }
>>>> - skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
>>>> +
>>>> + if (!(flags & MSG_NO_SHARED_FRAGS))
>>>> + skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
>>>>
>>>> skb->len += copy;
>>>> skb->data_len += copy;
>>>
>>> What would prevent user space from using this flag ?
>>>
>>
>> Nothing in the current patches. So user could set this, change the data,
>> and then presumably get incorrect checksums with bad timing. Seems like
>> this should be blocked so we don't allow users to try and send bad csums.
>
> Are you sure user can set it ? How would this happen ?
>
Ah OK I thought you might have a path that I missed. Just
rechecked and I don't see any paths where user flags get
to sendpage without being masked.
> It would be nice to check (sorry I was lazy/busy and did not check
> before asking)
No problem.
The splice path using pipe_to_sendpage() already masks the
flags before sendpage is called. The only other call sites I
see are in o2net and lowcomms both places flags are hard coded
in-kernel.
So we should be safe.
>> How about masking the flags coming from userland? Alternatively could add
>> a bool to do_tcp_sendpages().
>>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
2018-01-12 18:11 ` [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data John Fastabend
@ 2018-01-17 2:25 ` Alexei Starovoitov
2018-01-17 5:49 ` John Fastabend
2018-01-17 22:04 ` Martin KaFai Lau
1 sibling, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2018-01-17 2:25 UTC (permalink / raw)
To: John Fastabend; +Cc: borkmann, ast, netdev, kafai
On Fri, Jan 12, 2018 at 10:11:11AM -0800, John Fastabend wrote:
> This implements a BPF ULP layer to allow policy enforcement and
> monitoring at the socket layer. In order to support this a new
> program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
> the sendmsg/sendpage hook. To attach the policy to sockets a
> sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
>
> Similar to previous sockmap usages when a sock is added to a
> sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
> program type attached then the BPF ULP layer is created on the
> socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
> every msg in sendmsg case and page/offset in sendpage case.
>
> BPF_PROG_TYPE_SK_MSG Semantics/API:
>
> BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
> SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
> case and in the sendpage case leaves the data untouched. Both cases
> return -EACESS to the user. Returning SK_PASS will allow the msg to
> be sent.
>
> In the sendmsg case data is copied into kernel space buffers before
> running the BPF program. In the sendpage case data is never copied.
> The implication being users may change data after BPF programs run in
> the sendpage case. (A flag will be added to always copy shortly
> if the copy must always be performed).
>
> The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
> in the sendmsg() case and the entire page/offset in the sendpage case.
> This avoid ambiguity on how to handle mixed return codes in the
> sendmsg case. The readable/writeable data provided to the program
> in the sendmsg case may not be the entire message, in fact for
> large sends this is likely the case. The data range that can be
> read is part of the sk_msg_md structure. This is because similar
> to the tc bpf_cls case the data is stored in a scatter gather list.
> Future work will address this short-coming to allow users to pull
> in more data if needed (similar to TC BPF).
>
> The helper msg_redirect_map() can be used to select the socket to
> send the data on. This is used similar to existing redirect use
> cases. This allows policy to redirect msgs.
>
> Pseudo code simple example:
>
> The basic logic to attach a program to a socket is as follows,
>
> // load the programs
> bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
> &obj, &msg_prog);
>
> // lookup the sockmap
> bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");
>
> // get fd for sockmap
> map_fd_msg = bpf_map__fd(bpf_map_msg);
>
> // attach program to sockmap
> bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
>
> Adding sockets to the map is done in the normal way,
>
> // Add a socket 'fd' to sockmap at location 'i'
> bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);
>
> After the above any socket attached to "my_sock_map", in this case
> 'fd', will run the BPF msg verdict program (msg_prog) on every
> sendmsg and sendpage system call.
>
> For a complete example see BPF selftests bpf/sockmap_tcp_msg_*.c and
> test_maps.c
>
> Implementation notes:
>
> It seemed the simplest, to me at least, to use a refcnt to ensure
> psock is not lost across the sendmsg copy into the sg, the bpf program
> running on the data in sg_data, and the final pass to the TCP stack.
> Some performance testing may show a better method to do this and avoid
> the refcnt cost, but for now use the simpler method.
>
> Another item that will come after basic support is in place is
> supporting MSG_MORE flag. At the moment we call sendpages even if
> the MSG_MORE flag is set. An enhancement would be to collect the
> pages into a larger scatterlist and pass down the stack. Notice that
> bpf_tcp_sendmsg() could support this with some additional state saved
> across sendmsg calls. I built the code to support this without having
> to do refactoring work. Other flags TBD include ZEROCOPY flag.
>
> Yet another detail that needs some thought is the size of scatterlist.
> Currently, we use MAX_SKB_FRAGS simply because this was being used
> already in the TLS case. Future work to improve the kernel sk APIs to
> tune this depending on workload may be useful. This is a trade-off
> between memory usage and B/s performance.
>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
overall design looks clean. imo huge improvement from first version.
Few nits:
> ---
> include/linux/bpf.h | 1
> include/linux/bpf_types.h | 1
> include/linux/filter.h | 10 +
> include/net/tcp.h | 2
> include/uapi/linux/bpf.h | 28 +++
> kernel/bpf/sockmap.c | 485 ++++++++++++++++++++++++++++++++++++++++++++-
> kernel/bpf/syscall.c | 14 +
> kernel/bpf/verifier.c | 5
> net/core/filter.c | 106 ++++++++++
> 9 files changed, 638 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 9e03046..14cdb4d 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -21,6 +21,7 @@
> struct perf_event;
> struct bpf_prog;
> struct bpf_map;
> +struct sock;
>
> /* map is generic key/value storage optionally accesible by eBPF programs */
> struct bpf_map_ops {
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 19b8349..5e2e8a4 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -13,6 +13,7 @@
> BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
> BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
> BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
> #endif
> #ifdef CONFIG_BPF_EVENTS
> BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 425056c..f1e9833 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -507,6 +507,15 @@ struct xdp_buff {
> struct xdp_rxq_info *rxq;
> };
>
> +struct sk_msg_buff {
> + void *data;
> + void *data_end;
> + struct scatterlist sg_data[MAX_SKB_FRAGS];
> + __u32 key;
> + __u32 flags;
> + struct bpf_map *map;
> +};
> +
> /* Compute the linear packet data range [data, data_end) which
> * will be accessed by various program types (cls_bpf, act_bpf,
> * lwt, ...). Subsystems allowing direct data access must (!)
> @@ -769,6 +778,7 @@ int xdp_do_redirect(struct net_device *dev,
> void bpf_warn_invalid_xdp_action(u32 act);
>
> struct sock *do_sk_redirect_map(struct sk_buff *skb);
> +struct sock *do_msg_redirect_map(struct sk_msg_buff *md);
>
> #ifdef CONFIG_BPF_JIT
> extern int bpf_jit_enable;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index a99ceb8..7f56c3c 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1984,6 +1984,7 @@ static inline void tcp_listendrop(const struct sock *sk)
>
> enum {
> TCP_ULP_TLS,
> + TCP_ULP_BPF,
> };
>
> struct tcp_ulp_ops {
> @@ -2001,6 +2002,7 @@ struct tcp_ulp_ops {
> int tcp_register_ulp(struct tcp_ulp_ops *type);
> void tcp_unregister_ulp(struct tcp_ulp_ops *type);
> int tcp_set_ulp(struct sock *sk, const char *name);
> +int tcp_set_ulp_id(struct sock *sk, const int ulp);
> void tcp_get_available_ulp(char *buf, size_t len);
> void tcp_cleanup_ulp(struct sock *sk);
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 405317f..bf649ae 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -133,6 +133,7 @@ enum bpf_prog_type {
> BPF_PROG_TYPE_SOCK_OPS,
> BPF_PROG_TYPE_SK_SKB,
> BPF_PROG_TYPE_CGROUP_DEVICE,
> + BPF_PROG_TYPE_SK_MSG,
> };
>
> enum bpf_attach_type {
> @@ -143,6 +144,7 @@ enum bpf_attach_type {
> BPF_SK_SKB_STREAM_PARSER,
> BPF_SK_SKB_STREAM_VERDICT,
> BPF_CGROUP_DEVICE,
> + BPF_SK_MSG_VERDICT,
> __MAX_BPF_ATTACH_TYPE
> };
>
> @@ -687,6 +689,15 @@ enum bpf_attach_type {
> * int bpf_override_return(pt_regs, rc)
> * @pt_regs: pointer to struct pt_regs
> * @rc: the return value to set
> + *
> + * int bpf_msg_redirect_map(map, key, flags)
> + * Redirect msg to a sock in map using key as a lookup key for the
> + * sock in map.
> + * @map: pointer to sockmap
> + * @key: key to lookup sock in map
> + * @flags: reserved for future use
> + * Return: SK_PASS
> + *
> */
> #define __BPF_FUNC_MAPPER(FN) \
> FN(unspec), \
> @@ -747,7 +758,8 @@ enum bpf_attach_type {
> FN(perf_event_read_value), \
> FN(perf_prog_read_value), \
> FN(getsockopt), \
> - FN(override_return),
> + FN(override_return), \
> + FN(msg_redirect_map),
>
> /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> * function eBPF program intends to call
> @@ -909,6 +921,20 @@ enum sk_action {
> SK_PASS,
> };
>
> +/* User return codes for SK_MSG prog type. */
> +enum sk_msg_action {
> + SK_MSG_DROP = 0,
> + SK_MSG_PASS,
> +};
> +
> +/* user accessible metadata for SK_MSG packet hook, new fields must
> + * be added to the end of this structure
> + */
> +struct sk_msg_md {
> + __u32 data;
> + __u32 data_end;
> +};
> +
> #define BPF_TAG_SIZE 8
>
> struct bpf_prog_info {
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index 972608f..5793f3a 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -38,6 +38,7 @@
> #include <linux/skbuff.h>
> #include <linux/workqueue.h>
> #include <linux/list.h>
> +#include <linux/mm.h>
> #include <net/strparser.h>
> #include <net/tcp.h>
>
> @@ -47,6 +48,7 @@
> struct bpf_stab {
> struct bpf_map map;
> struct sock **sock_map;
> + struct bpf_prog *bpf_tx_msg;
> struct bpf_prog *bpf_parse;
> struct bpf_prog *bpf_verdict;
> };
> @@ -74,6 +76,7 @@ struct smap_psock {
> struct sk_buff *save_skb;
>
> struct strparser strp;
> + struct bpf_prog *bpf_tx_msg;
> struct bpf_prog *bpf_parse;
> struct bpf_prog *bpf_verdict;
> struct list_head maps;
> @@ -90,6 +93,8 @@ struct smap_psock {
> void (*save_state_change)(struct sock *sk);
> };
>
> +static void smap_release_sock(struct smap_psock *psock, struct sock *sock);
> +
> static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
> {
> return rcu_dereference_sk_user_data(sk);
> @@ -99,8 +104,439 @@ enum __sk_action {
> __SK_DROP = 0,
> __SK_PASS,
> __SK_REDIRECT,
> + __SK_NONE,
> };
>
> +static int memcopy_from_iter(struct sock *sk, struct scatterlist *sg,
> + int sg_num, struct iov_iter *from, int bytes)
> +{
> + int i, rc = 0;
> +
> + for (i = 0; i < sg_num; ++i) {
> + int copy = sg[i].length;
> + char *to = sg_virt(&sg[i]);
> +
> + if (sk->sk_route_caps & NETIF_F_NOCACHE_COPY)
> + rc = copy_from_iter_nocache(to, copy, from);
> + else
> + rc = copy_from_iter(to, copy, from);
> +
> + if (rc != copy) {
> + rc = -EFAULT;
> + goto out;
> + }
> +
> + bytes -= copy;
> + if (!bytes)
> + break;
> + }
> +out:
> + return rc;
> +}
> +
> +static int bpf_tcp_push(struct sock *sk, struct scatterlist *sg,
> + int *sg_end, int flags, bool charge)
> +{
> + int sendpage_flags = flags | MSG_SENDPAGE_NOTLAST;
> + int offset, ret = 0;
> + struct page *p;
> + size_t size;
> +
> + size = sg->length;
> + offset = sg->offset;
> +
> + while (1) {
> + if (sg_is_last(sg))
> + sendpage_flags = flags;
> +
> + tcp_rate_check_app_limited(sk);
> + p = sg_page(sg);
> +retry:
> + ret = do_tcp_sendpages(sk, p, offset, size, sendpage_flags);
> + if (ret != size) {
> + if (ret > 0) {
> + offset += ret;
> + size -= ret;
> + goto retry;
> + }
> +
> + if (charge)
> + sk_mem_uncharge(sk,
> + sg->length - size - sg->offset);
should the bool argument be called 'uncharge' instead ?
> +
> + sg->offset = offset;
> + sg->length = size;
> + return ret;
> + }
> +
> + put_page(p);
> + if (charge)
> + sk_mem_uncharge(sk, sg->length);
> + *sg_end += 1;
> + sg = sg_next(sg);
> + if (!sg)
> + break;
> +
> + offset = sg->offset;
> + size = sg->length;
> + }
> +
> + return 0;
> +}
> +
> +static inline void bpf_compute_data_pointers_sg(struct sk_msg_buff *md)
> +{
> + md->data = sg_virt(md->sg_data);
> + md->data_end = md->data + md->sg_data->length;
> +}
> +
> +static void return_mem_sg(struct sock *sk, struct scatterlist *sg, int end)
> +{
> + int i;
> +
> + for (i = 0; i < end; ++i)
> + sk_mem_uncharge(sk, sg[i].length);
> +}
> +
> +static int free_sg(struct sock *sk, struct scatterlist *sg, int start, int len)
> +{
> + int i, free = 0;
> +
> + for (i = start; i < len; ++i) {
> + free += sg[i].length;
> + sk_mem_uncharge(sk, sg[i].length);
> + put_page(sg_page(&sg[i]));
> + }
> +
> + return free;
> +}
> +
> +static unsigned int smap_do_tx_msg(struct sock *sk,
> + struct smap_psock *psock,
> + struct sk_msg_buff *md)
> +{
> + struct bpf_prog *prog;
> + unsigned int rc, _rc;
> +
> + preempt_disable();
> + rcu_read_lock();
> +
> + /* If the policy was removed mid-send then default to 'accept' */
> + prog = READ_ONCE(psock->bpf_tx_msg);
> + if (unlikely(!prog)) {
> + _rc = SK_PASS;
> + goto verdict;
> + }
> +
> + bpf_compute_data_pointers_sg(md);
> + _rc = (*prog->bpf_func)(md, prog->insnsi);
> +
> +verdict:
> + rcu_read_unlock();
> + preempt_enable();
> +
> + /* Moving return codes from UAPI namespace into internal namespace */
> + rc = ((_rc == SK_PASS) ?
> + (md->map ? __SK_REDIRECT : __SK_PASS) :
> + __SK_DROP);
> +
> + return rc;
> +}
> +
> +static int bpf_tcp_sendmsg_do_redirect(struct scatterlist *sg, int sg_num,
> + struct sk_msg_buff *md, int flags)
> +{
> + int i, sg_curr = 0, err, free;
> + struct smap_psock *psock;
> + struct sock *sk;
> +
> + rcu_read_lock();
> + sk = do_msg_redirect_map(md);
> + if (unlikely(!sk))
> + goto out_rcu;
> +
> + psock = smap_psock_sk(sk);
> + if (unlikely(!psock))
> + goto out_rcu;
> +
> + if (!refcount_inc_not_zero(&psock->refcnt))
> + goto out_rcu;
> +
> + rcu_read_unlock();
> + lock_sock(sk);
> + err = bpf_tcp_push(sk, sg, &sg_curr, flags, false);
> + if (unlikely(err))
> + goto out;
> + release_sock(sk);
> + smap_release_sock(psock, sk);
> + return 0;
> +out_rcu:
> + rcu_read_unlock();
> +out:
> + for (i = sg_curr; i < sg_num; ++i) {
> + free += sg[i].length;
> + put_page(sg_page(&sg[i]));
> + }
> + return free;
erro path keeps rcu_lock and sk locked?
> +}
> +
> +static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> +{
> + int err = 0, eval = __SK_NONE, sg_size = 0, sg_num = 0;
> + int flags = msg->msg_flags | MSG_NO_SHARED_FRAGS;
> + struct sk_msg_buff md = {0};
> + struct smap_psock *psock;
> + size_t copy, copied = 0;
> + struct scatterlist *sg;
> + long timeo;
> +
> + sg = md.sg_data;
> + sg_init_table(sg, MAX_SKB_FRAGS);
> +
> + /* Its possible a sock event or user removed the psock _but_ the ops
> + * have not been reprogrammed yet so we get here. In this case fallback
> + * to tcp_sendmsg. Note this only works because we _only_ ever allow
> + * a single ULP there is no hierarchy here.
> + */
> + rcu_read_lock();
> + psock = smap_psock_sk(sk);
> + if (unlikely(!psock)) {
> + rcu_read_unlock();
> + return tcp_sendmsg(sk, msg, size);
> + }
> +
> + /* Increment the psock refcnt to ensure its not released while sending a
> + * message. Required because sk lookup and bpf programs are used in
> + * separate rcu critical sections. Its OK if we lose the map entry
> + * but we can't lose the sock reference, possible when the refcnt hits
> + * zero and garbage collection calls sock_put().
> + */
> + if (!refcount_inc_not_zero(&psock->refcnt)) {
> + rcu_read_unlock();
> + return tcp_sendmsg(sk, msg, size);
> + }
> +
> + rcu_read_unlock();
> +
> + lock_sock(sk);
> + timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
> +
> + while (msg_data_left(msg)) {
> + int sg_curr;
> +
> + if (sk->sk_err) {
> + err = sk->sk_err;
> + goto out_err;
> + }
> +
> + copy = msg_data_left(msg);
> + if (!sk_stream_memory_free(sk))
> + goto wait_for_sndbuf;
> +
> + /* sg_size indicates bytes already allocated and sg_num
> + * is last sg element used. This is used when alloc_sg
> + * partially allocates a scatterlist and then is sent
> + * to wait for memory. In normal case (no memory pressure)
> + * both sg_nun and sg_size are zero.
> + */
> + copy = copy - sg_size;
> + err = sk_alloc_sg(sk, copy, sg, &sg_num, &sg_size, 0);
> + if (err) {
> + if (err != -ENOSPC)
> + goto wait_for_memory;
> + copy = sg_size;
> + }
> +
> + err = memcopy_from_iter(sk, sg, sg_num, &msg->msg_iter, copy);
> + if (err < 0) {
> + free_sg(sk, sg, 0, sg_num);
> + goto out_err;
> + }
> +
> + copied += copy;
> +
> + /* If msg is larger than MAX_SKB_FRAGS we can send multiple
> + * scatterlists per msg. However BPF decisions apply to the
> + * entire msg.
> + */
> + if (eval == __SK_NONE)
> + eval = smap_do_tx_msg(sk, psock, &md);
it seems sk_alloc_sg() will put 64k bytes into sg_data,
but this program will see only first SG ?
and it's typically going to be one page only ?
then what's the value of waiting for MAX_SKB_FRAGS ?
> +
> + switch (eval) {
> + case __SK_PASS:
> + sg_mark_end(sg + sg_num - 1);
> + err = bpf_tcp_push(sk, sg, &sg_curr, flags, true);
> + if (unlikely(err)) {
> + copied -= free_sg(sk, sg, sg_curr, sg_num);
> + goto out_err;
> + }
> + break;
> + case __SK_REDIRECT:
> + sg_mark_end(sg + sg_num - 1);
> + goto do_redir;
> + case __SK_DROP:
> + default:
> + copied -= free_sg(sk, sg, 0, sg_num);
> + goto out_err;
> + }
> +
> + sg_num = 0;
> + sg_size = 0;
> + continue;
> +wait_for_sndbuf:
> + set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> +wait_for_memory:
> + err = sk_stream_wait_memory(sk, &timeo);
> + if (err)
> + goto out_err;
> + }
> +out_err:
> + if (err < 0)
> + err = sk_stream_error(sk, msg->msg_flags, err);
> + release_sock(sk);
> + smap_release_sock(psock, sk);
> + return copied ? copied : err;
> +
> +do_redir:
> + /* To avoid deadlock with multiple socks all doing redirects to
> + * each other we must first drop the current sock lock and release
> + * the psock. Then get the redirect socket (assuming it still
> + * exists), take it's lock, and finally do the send here. If the
> + * redirect fails there is nothing to do, we don't want to blame
> + * the sender for remote socket failures. Instead we simply
> + * continue making forward progress.
> + */
> + return_mem_sg(sk, sg, sg_num);
> + release_sock(sk);
> + smap_release_sock(psock, sk);
> + copied -= bpf_tcp_sendmsg_do_redirect(sg, sg_num, &md, flags);
> + return copied;
> +}
> +
> +static int bpf_tcp_sendpage_do_redirect(struct page *page, int offset,
> + size_t size, int flags,
> + struct sk_msg_buff *md)
> +{
> + struct smap_psock *psock;
> + struct sock *sk;
> + int rc;
> +
> + rcu_read_lock();
> + sk = do_msg_redirect_map(md);
> + if (unlikely(!sk))
> + goto out_rcu;
> +
> + psock = smap_psock_sk(sk);
> + if (unlikely(!psock))
> + goto out_rcu;
> +
> + if (!refcount_inc_not_zero(&psock->refcnt))
> + goto out_rcu;
> +
> + rcu_read_unlock();
> +
> + lock_sock(sk);
> + rc = tcp_sendpage_locked(sk, page, offset, size, flags);
> + release_sock(sk);
> +
> + smap_release_sock(psock, sk);
> + return rc;
> +out_rcu:
> + rcu_read_unlock();
> + return -EINVAL;
> +}
> +
> +static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
> + int offset, size_t size, int flags)
> +{
> + struct smap_psock *psock;
> + int rc, _rc = __SK_PASS;
> + struct bpf_prog *prog;
> + struct sk_msg_buff md;
> +
> + preempt_disable();
> + rcu_read_lock();
> + psock = smap_psock_sk(sk);
> + if (unlikely(!psock))
> + goto verdict;
> +
> + /* If the policy was removed mid-send then default to 'accept' */
> + prog = READ_ONCE(psock->bpf_tx_msg);
> + if (unlikely(!prog))
> + goto verdict;
> +
> + /* Calculate pkt data pointers and run BPF program */
> + md.data = page_address(page) + offset;
> + md.data_end = md.data + size;
> + _rc = (*prog->bpf_func)(&md, prog->insnsi);
> +
> +verdict:
> + rcu_read_unlock();
> + preempt_enable();
> +
> + /* Moving return codes from UAPI namespace into internal namespace */
> + rc = ((_rc == SK_PASS) ? __SK_PASS : __SK_DROP);
> +
> + switch (rc) {
> + case __SK_PASS:
> + lock_sock(sk);
> + rc = tcp_sendpage_locked(sk, page, offset, size, flags);
> + release_sock(sk);
> + break;
> + case __SK_REDIRECT:
> + smap_release_sock(psock, sk);
> + rc = bpf_tcp_sendpage_do_redirect(page, offset, size, flags,
> + &md);
looks like this path wasn't tested,
since above rc = ...; line cannot return REDIRECT...
probably should be common helper for both tcp_bpf_*() funcs
to call into bpf and convert rc.
> + break;
> + case __SK_DROP:
> + default:
> + rc = -EACCES;
> + }
> +
> + return rc;
> +}
> +
> +static int bpf_tcp_msg_add(struct smap_psock *psock,
> + struct sock *sk,
> + struct bpf_prog *tx_msg)
> +{
> + struct bpf_prog *orig_tx_msg;
> +
> + orig_tx_msg = xchg(&psock->bpf_tx_msg, tx_msg);
> + if (orig_tx_msg)
> + bpf_prog_put(orig_tx_msg);
the function is replacing the program. why is it called bpf_tcp_msg_add ?
> +
> + return tcp_set_ulp_id(sk, TCP_ULP_BPF);
> +}
> +
> +struct proto tcp_bpf_proto;
> +static int bpf_tcp_init(struct sock *sk)
> +{
> + sk->sk_prot = &tcp_bpf_proto;
> + return 0;
> +}
> +
> +static void bpf_tcp_release(struct sock *sk)
> +{
> + sk->sk_prot = &tcp_prot;
> +}
> +
> +static struct tcp_ulp_ops bpf_tcp_ulp_ops __read_mostly = {
> + .name = "bpf_tcp",
> + .uid = TCP_ULP_BPF,
> + .owner = NULL,
> + .init = bpf_tcp_init,
> + .release = bpf_tcp_release,
> +};
> +
> +static int bpf_tcp_ulp_register(void)
> +{
> + tcp_bpf_proto = tcp_prot;
> + tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
> + tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
> + return tcp_register_ulp(&bpf_tcp_ulp_ops);
I don't see corresponding tcp_unregister_ulp().
> +}
> +
> static int smap_verdict_func(struct smap_psock *psock, struct sk_buff *skb)
> {
> struct bpf_prog *prog = READ_ONCE(psock->bpf_verdict);
> @@ -165,8 +601,6 @@ static void smap_report_sk_error(struct smap_psock *psock, int err)
> sk->sk_error_report(sk);
> }
>
> -static void smap_release_sock(struct smap_psock *psock, struct sock *sock);
> -
> /* Called with lock_sock(sk) held */
> static void smap_state_change(struct sock *sk)
> {
> @@ -317,6 +751,7 @@ static void smap_write_space(struct sock *sk)
>
> static void smap_stop_sock(struct smap_psock *psock, struct sock *sk)
> {
> + tcp_cleanup_ulp(sk);
> if (!psock->strp_enabled)
> return;
> sk->sk_data_ready = psock->save_data_ready;
> @@ -384,7 +819,6 @@ static int smap_parse_func_strparser(struct strparser *strp,
> return rc;
> }
>
> -
> static int smap_read_sock_done(struct strparser *strp, int err)
> {
> return err;
> @@ -456,6 +890,8 @@ static void smap_gc_work(struct work_struct *w)
> bpf_prog_put(psock->bpf_parse);
> if (psock->bpf_verdict)
> bpf_prog_put(psock->bpf_verdict);
> + if (psock->bpf_tx_msg)
> + bpf_prog_put(psock->bpf_tx_msg);
>
> list_for_each_entry_safe(e, tmp, &psock->maps, list) {
> list_del(&e->list);
> @@ -491,8 +927,7 @@ static struct smap_psock *smap_init_psock(struct sock *sock,
>
> static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
> {
> - struct bpf_stab *stab;
> - int err = -EINVAL;
> + struct bpf_stab *stab; int err = -EINVAL;
> u64 cost;
>
> if (!capable(CAP_NET_ADMIN))
> @@ -506,6 +941,10 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
> if (attr->value_size > KMALLOC_MAX_SIZE)
> return ERR_PTR(-E2BIG);
>
> + err = bpf_tcp_ulp_register();
> + if (err && err != -EEXIST)
> + return ERR_PTR(err);
> +
> stab = kzalloc(sizeof(*stab), GFP_USER);
> if (!stab)
> return ERR_PTR(-ENOMEM);
> @@ -590,6 +1029,8 @@ static void sock_map_free(struct bpf_map *map)
> bpf_prog_put(stab->bpf_verdict);
> if (stab->bpf_parse)
> bpf_prog_put(stab->bpf_parse);
> + if (stab->bpf_tx_msg)
> + bpf_prog_put(stab->bpf_tx_msg);
>
> sock_map_remove_complete(stab);
> }
> @@ -684,7 +1125,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
> {
> struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
> struct smap_psock_map_entry *e = NULL;
> - struct bpf_prog *verdict, *parse;
> + struct bpf_prog *verdict, *parse, *tx_msg;
> struct sock *osock, *sock;
> struct smap_psock *psock;
> u32 i = *(u32 *)key;
> @@ -710,6 +1151,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
> */
> verdict = READ_ONCE(stab->bpf_verdict);
> parse = READ_ONCE(stab->bpf_parse);
> + tx_msg = READ_ONCE(stab->bpf_tx_msg);
>
> if (parse && verdict) {
> /* bpf prog refcnt may be zero if a concurrent attach operation
> @@ -728,6 +1170,17 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
> }
> }
>
> + if (tx_msg) {
> + tx_msg = bpf_prog_inc_not_zero(stab->bpf_tx_msg);
prog_inc_not_zero() looks scary here.
Why 'not_zero' is necessary ?
> + if (IS_ERR(tx_msg)) {
> + if (verdict)
> + bpf_prog_put(verdict);
> + if (parse)
> + bpf_prog_put(parse);
> + return PTR_ERR(tx_msg);
> + }
> + }
> +
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
2018-01-17 2:25 ` Alexei Starovoitov
@ 2018-01-17 5:49 ` John Fastabend
2018-01-17 6:20 ` Alexei Starovoitov
0 siblings, 1 reply; 18+ messages in thread
From: John Fastabend @ 2018-01-17 5:49 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: borkmann, ast, netdev, kafai
On 01/16/2018 06:25 PM, Alexei Starovoitov wrote:
> On Fri, Jan 12, 2018 at 10:11:11AM -0800, John Fastabend wrote:
>> This implements a BPF ULP layer to allow policy enforcement and
>> monitoring at the socket layer. In order to support this a new
>> program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
>> the sendmsg/sendpage hook. To attach the policy to sockets a
>> sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
[...]
>
> overall design looks clean. imo huge improvement from first version.
>
Great thanks for the review.
> Few nits:
>
[...]
>> +
>> +static int bpf_tcp_push(struct sock *sk, struct scatterlist *sg,
>> + int *sg_end, int flags, bool charge)
>> +{
>> + int sendpage_flags = flags | MSG_SENDPAGE_NOTLAST;
>> + int offset, ret = 0;
>> + struct page *p;
>> + size_t size;
>> +
>> + size = sg->length;
>> + offset = sg->offset;
>> +
>> + while (1) {
>> + if (sg_is_last(sg))
>> + sendpage_flags = flags;
>> +
>> + tcp_rate_check_app_limited(sk);
>> + p = sg_page(sg);
>> +retry:
>> + ret = do_tcp_sendpages(sk, p, offset, size, sendpage_flags);
>> + if (ret != size) {
>> + if (ret > 0) {
>> + offset += ret;
>> + size -= ret;
>> + goto retry;
>> + }
>> +
>> + if (charge)
>> + sk_mem_uncharge(sk,
>> + sg->length - size - sg->offset);
>
> should the bool argument be called 'uncharge' instead ?
>
Agreed that would be a better name, will update.
>> +
>> + sg->offset = offset;
>> + sg->length = size;
>> + return ret;
>> + }
>> +
>> + put_page(p);
>> + if (charge)
>> + sk_mem_uncharge(sk, sg->length);
>> + *sg_end += 1;
>> + sg = sg_next(sg);
>> + if (!sg)
>> + break;
>> +
>> + offset = sg->offset;
>> + size = sg->length;
>> + }
>> +
>> + return 0;
>> +}
[...]
>> +static int bpf_tcp_sendmsg_do_redirect(struct scatterlist *sg, int sg_num,
>> + struct sk_msg_buff *md, int flags)
>> +{
>> + int i, sg_curr = 0, err, free;
>> + struct smap_psock *psock;
>> + struct sock *sk;
>> +
>> + rcu_read_lock();
>> + sk = do_msg_redirect_map(md);
>> + if (unlikely(!sk))
>> + goto out_rcu;
>> +
>> + psock = smap_psock_sk(sk);
>> + if (unlikely(!psock))
>> + goto out_rcu;
>> +
>> + if (!refcount_inc_not_zero(&psock->refcnt))
>> + goto out_rcu;
>> +
>> + rcu_read_unlock();
>> + lock_sock(sk);
>> + err = bpf_tcp_push(sk, sg, &sg_curr, flags, false);
>> + if (unlikely(err))
>> + goto out;
>> + release_sock(sk);
>> + smap_release_sock(psock, sk);
>> + return 0;
>> +out_rcu:
>> + rcu_read_unlock();
>> +out:
>> + for (i = sg_curr; i < sg_num; ++i) {
>> + free += sg[i].length;
>> + put_page(sg_page(&sg[i]));
>> + }
>> + return free;
>
> erro path keeps rcu_lock and sk locked?
>
yep, although looks like rcu_read_unlock() is OK because its
released above the call but the
if (unlikely(err))
goto err
needs to be moved below the smap_release_sock(). Thanks!
>> +}
>> +
[...]
>> + while (msg_data_left(msg)) {
>> + int sg_curr;
>> +
>> + if (sk->sk_err) {
>> + err = sk->sk_err;
>> + goto out_err;
>> + }
>> +
>> + copy = msg_data_left(msg);
>> + if (!sk_stream_memory_free(sk))
>> + goto wait_for_sndbuf;
>> +
>> + /* sg_size indicates bytes already allocated and sg_num
>> + * is last sg element used. This is used when alloc_sg
>> + * partially allocates a scatterlist and then is sent
>> + * to wait for memory. In normal case (no memory pressure)
>> + * both sg_nun and sg_size are zero.
>> + */
>> + copy = copy - sg_size;
>> + err = sk_alloc_sg(sk, copy, sg, &sg_num, &sg_size, 0);
>> + if (err) {
>> + if (err != -ENOSPC)
>> + goto wait_for_memory;
>> + copy = sg_size;
>> + }
>> +
>> + err = memcopy_from_iter(sk, sg, sg_num, &msg->msg_iter, copy);
>> + if (err < 0) {
>> + free_sg(sk, sg, 0, sg_num);
>> + goto out_err;
>> + }
>> +
>> + copied += copy;
>> +
>> + /* If msg is larger than MAX_SKB_FRAGS we can send multiple
>> + * scatterlists per msg. However BPF decisions apply to the
>> + * entire msg.
>> + */
>> + if (eval == __SK_NONE)
>> + eval = smap_do_tx_msg(sk, psock, &md);
>
> it seems sk_alloc_sg() will put 64k bytes into sg_data,
Yep upto 64k will be copied from msg into sg data.
> but this program will see only first SG ?
Correct, to read further into the msg we would need to have a helper
or some other way to catch reads/writes past the first 4k and read
the next sg. We have the same limitation in cls_bpf.
I have started a patch on top of this series with the current working
title msg_apply_bytes(int bytes). This would let us apply a verdict to
a set number of bytes instead of the entire msg. By calling
msg_apply_bytes(data_end - data) a user who needs to read an entire msg
could do this in 4k chunks until the entire msg is passed through the
bpf prog.
> and it's typically going to be one page only ?
yep
> then what's the value of waiting for MAX_SKB_FRAGS ?
>
Its not waiting for MAX_SKB_FRAGS its simple copying up to MAX_SKB_FRAGS
pages in one call if possible. It seems better to me to run this loop
over as much data as we can.
>> +
>> + switch (eval) {
>> + case __SK_PASS:
>> + sg_mark_end(sg + sg_num - 1);
>> + err = bpf_tcp_push(sk, sg, &sg_curr, flags, true);
>> + if (unlikely(err)) {
>> + copied -= free_sg(sk, sg, sg_curr, sg_num);
>> + goto out_err;
>> + }
>> + break;
>> + case __SK_REDIRECT:
>> + sg_mark_end(sg + sg_num - 1);
>> + goto do_redir;
>> + case __SK_DROP:
>> + default:
>> + copied -= free_sg(sk, sg, 0, sg_num);
>> + goto out_err;
>> + }
[...]
>> + /* Calculate pkt data pointers and run BPF program */
>> + md.data = page_address(page) + offset;
>> + md.data_end = md.data + size;
>> + _rc = (*prog->bpf_func)(&md, prog->insnsi);
>> +
>> +verdict:
>> + rcu_read_unlock();
>> + preempt_enable();
>> +
>> + /* Moving return codes from UAPI namespace into internal namespace */
>> + rc = ((_rc == SK_PASS) ? __SK_PASS : __SK_DROP);
>> +
>> + switch (rc) {
>> + case __SK_PASS:
>> + lock_sock(sk);
>> + rc = tcp_sendpage_locked(sk, page, offset, size, flags);
>> + release_sock(sk);
>> + break;
>> + case __SK_REDIRECT:
>> + smap_release_sock(psock, sk);
>> + rc = bpf_tcp_sendpage_do_redirect(page, offset, size, flags,
>> + &md);
>
> looks like this path wasn't tested,
:/ yep adding test now.
> since above rc = ...; line cannot return REDIRECT...
> probably should be common helper for both tcp_bpf_*() funcs
> to call into bpf and convert rc.
>
will do in v2 thanks.
>> + break;
>> + case __SK_DROP:
>> + default:
>> + rc = -EACCES;
>> + }
>> +
>> + return rc;
>> +}
>> +
>> +static int bpf_tcp_msg_add(struct smap_psock *psock,
>> + struct sock *sk,
>> + struct bpf_prog *tx_msg)
>> +{
>> + struct bpf_prog *orig_tx_msg;
>> +
>> + orig_tx_msg = xchg(&psock->bpf_tx_msg, tx_msg);
>> + if (orig_tx_msg)
>> + bpf_prog_put(orig_tx_msg);
>
> the function is replacing the program. why is it called bpf_tcp_msg_add ?
>
bad name will rename.
>> +
>> + return tcp_set_ulp_id(sk, TCP_ULP_BPF);
>> +}
>> +
[...]
>> +static int bpf_tcp_ulp_register(void)
>> +{
>> + tcp_bpf_proto = tcp_prot;
>> + tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
>> + tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
>> + return tcp_register_ulp(&bpf_tcp_ulp_ops);
>
> I don't see corresponding tcp_unregister_ulp().
>
There is none. tcp_register_ulp() adds the bpf_tcp_ulp to the list of
available ULPs and never removes it. To remove it we would have to
keep a ref count on the reg/unreg calls. This would require a couple
more patches to the ULP infra and not sure it hurts to leave the ULP
reference around...
Maybe a follow on patch? Or else it could be a patch in front of this
patch.
>> @@ -710,6 +1151,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
>> */
>> verdict = READ_ONCE(stab->bpf_verdict);
>> parse = READ_ONCE(stab->bpf_parse);
>> + tx_msg = READ_ONCE(stab->bpf_tx_msg);
>>
>> if (parse && verdict) {
>> /* bpf prog refcnt may be zero if a concurrent attach operation
>> @@ -728,6 +1170,17 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
>> }
>> }
>>
>> + if (tx_msg) {
>> + tx_msg = bpf_prog_inc_not_zero(stab->bpf_tx_msg);
>
> prog_inc_not_zero() looks scary here.
> Why 'not_zero' is necessary ?
>
Same reason we use inc_not_zero variants with the verdict/parse
programs,
/* bpf prog refcnt may be zero if a concurrent attach operation
* removes the program after the above READ_ONCE() but before
* we increment the refcnt. If this is the case abort with an
* error.
*/
>> + if (IS_ERR(tx_msg)) {
>> + if (verdict)
>> + bpf_prog_put(verdict);
>> + if (parse)
>> + bpf_prog_put(parse);
>> + return PTR_ERR(tx_msg);
>> + }
>> + }
>> +
Thanks!
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
2018-01-17 5:49 ` John Fastabend
@ 2018-01-17 6:20 ` Alexei Starovoitov
2018-01-17 20:32 ` John Fastabend
0 siblings, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2018-01-17 6:20 UTC (permalink / raw)
To: John Fastabend; +Cc: borkmann, ast, netdev, kafai
On Tue, Jan 16, 2018 at 09:49:16PM -0800, John Fastabend wrote:
>
> > but this program will see only first SG ?
>
> Correct, to read further into the msg we would need to have a helper
> or some other way to catch reads/writes past the first 4k and read
> the next sg. We have the same limitation in cls_bpf.
>
> I have started a patch on top of this series with the current working
> title msg_apply_bytes(int bytes). This would let us apply a verdict to
> a set number of bytes instead of the entire msg. By calling
> msg_apply_bytes(data_end - data) a user who needs to read an entire msg
> could do this in 4k chunks until the entire msg is passed through the
> bpf prog.
good idea.
I think would be good to add this helper as part of this patch set
to make sure there is a way for user to look through the whole
tcp stream if the program really wants to.
I understand that program cannot examine every byte anyway
due to lack of loops and helpers, but this part of sockmap api
should still provide an interface from day one.
One example would be the program parsing http2 or similar
where in the header it sees length. Then it can do
msg_apply_bytes(length)
to skip the bytes it processed, but still continue within
the same 64Kbyte chunk when 0 < length < 64k
> > and it's typically going to be one page only ?
>
> yep
>
> > then what's the value of waiting for MAX_SKB_FRAGS ?
> >
> Its not waiting for MAX_SKB_FRAGS its simple copying up to MAX_SKB_FRAGS
> pages in one call if possible. It seems better to me to run this loop
> over as much data as we can.
agree on trying to do MAX_SKB_FRAGS as a 'processing unit',
but program should still be able to skip or redirect
parts of the bytes and not the whole 64k chunk.
>From program point of view it should never see or worry about
SG list boundaries whereas right now it seems that below code
is dealing with them (though program doesn't know where sg ends):
> +
> + switch (eval) {
> + case __SK_PASS:
> + sg_mark_end(sg + sg_num - 1);
> + err = bpf_tcp_push(sk, sg, &sg_curr, flags, true);
> + if (unlikely(err)) {
> + copied -= free_sg(sk, sg, sg_curr, sg_num);
> + goto out_err;
> + }
> + break;
> + case __SK_REDIRECT:
> + sg_mark_end(sg + sg_num - 1);
> + goto do_redir;
...
> >> +static int bpf_tcp_ulp_register(void)
> >> +{
> >> + tcp_bpf_proto = tcp_prot;
> >> + tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
> >> + tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
> >> + return tcp_register_ulp(&bpf_tcp_ulp_ops);
> >
> > I don't see corresponding tcp_unregister_ulp().
> >
>
> There is none. tcp_register_ulp() adds the bpf_tcp_ulp to the list of
> available ULPs and never removes it. To remove it we would have to
> keep a ref count on the reg/unreg calls. This would require a couple
> more patches to the ULP infra and not sure it hurts to leave the ULP
> reference around...
>
> Maybe a follow on patch? Or else it could be a patch in front of this
> patch.
I see. I'm ok with leaving that for latter.
It doesn't hurt to keep it registered. Please add a comment though.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
2018-01-17 6:20 ` Alexei Starovoitov
@ 2018-01-17 20:32 ` John Fastabend
0 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2018-01-17 20:32 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: borkmann, ast, netdev, kafai
On 01/16/2018 10:20 PM, Alexei Starovoitov wrote:
> On Tue, Jan 16, 2018 at 09:49:16PM -0800, John Fastabend wrote:
>>
>>> but this program will see only first SG ?
>>
>> Correct, to read further into the msg we would need to have a helper
>> or some other way to catch reads/writes past the first 4k and read
>> the next sg. We have the same limitation in cls_bpf.
>>
>> I have started a patch on top of this series with the current working
>> title msg_apply_bytes(int bytes). This would let us apply a verdict to
>> a set number of bytes instead of the entire msg. By calling
>> msg_apply_bytes(data_end - data) a user who needs to read an entire msg
>> could do this in 4k chunks until the entire msg is passed through the
>> bpf prog.
>
> good idea.
> I think would be good to add this helper as part of this patch set
> to make sure there is a way for user to look through the whole
> tcp stream if the program really wants to.
Sure I'll add it to this series.
> I understand that program cannot examine every byte anyway
> due to lack of loops and helpers, but this part of sockmap api
> should still provide an interface from day one.
> One example would be the program parsing http2 or similar
> where in the header it sees length. Then it can do
> msg_apply_bytes(length)
> to skip the bytes it processed, but still continue within
> the same 64Kbyte chunk when 0 < length < 64k
>
Yep, this is my use case.
>>> and it's typically going to be one page only ?
>>
>> yep
>>
>>> then what's the value of waiting for MAX_SKB_FRAGS ?
>>>
>> Its not waiting for MAX_SKB_FRAGS its simple copying up to MAX_SKB_FRAGS
>> pages in one call if possible. It seems better to me to run this loop
>> over as much data as we can.
>
> agree on trying to do MAX_SKB_FRAGS as a 'processing unit',
> but program should still be able to skip or redirect
> parts of the bytes and not the whole 64k chunk.
> From program point of view it should never see or worry about
> SG list boundaries whereas right now it seems that below code
> is dealing with them (though program doesn't know where sg ends):
>
The apply_bytes() helper allows for skip and redirect of
parts of the bytes and not the whole 64k chunk.
>> +
>> + switch (eval) {
>> + case __SK_PASS:
>> + sg_mark_end(sg + sg_num - 1);
All this does is tell the TCP sendpage call that after this sg entry
we don't have anymore data so go ahead and flush it.
>> + err = bpf_tcp_push(sk, sg, &sg_curr, flags, true);
>> + if (unlikely(err)) {
>> + copied -= free_sg(sk, sg, sg_curr, sg_num);
>> + goto out_err;
>> + }
>> + break;
>> + case __SK_REDIRECT:
>> + sg_mark_end(sg + sg_num - 1);
>> + goto do_redir;
> ...
>>>> +static int bpf_tcp_ulp_register(void)
>>>> +{
>>>> + tcp_bpf_proto = tcp_prot;
>>>> + tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
>>>> + tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
>>>> + return tcp_register_ulp(&bpf_tcp_ulp_ops);
>>>
>>> I don't see corresponding tcp_unregister_ulp().
>>>
>>
>> There is none. tcp_register_ulp() adds the bpf_tcp_ulp to the list of
>> available ULPs and never removes it. To remove it we would have to
>> keep a ref count on the reg/unreg calls. This would require a couple
>> more patches to the ULP infra and not sure it hurts to leave the ULP
>> reference around...
>>
>> Maybe a follow on patch? Or else it could be a patch in front of this
>> patch.
>
> I see. I'm ok with leaving that for latter.
> It doesn't hurt to keep it registered. Please add a comment though.
>
OK will add comment.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
2018-01-12 18:11 ` [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data John Fastabend
2018-01-17 2:25 ` Alexei Starovoitov
@ 2018-01-17 22:04 ` Martin KaFai Lau
2018-01-18 17:27 ` John Fastabend
1 sibling, 1 reply; 18+ messages in thread
From: Martin KaFai Lau @ 2018-01-17 22:04 UTC (permalink / raw)
To: John Fastabend; +Cc: borkmann, ast, netdev
On Fri, Jan 12, 2018 at 10:11:11AM -0800, John Fastabend wrote:
> This implements a BPF ULP layer to allow policy enforcement and
> monitoring at the socket layer. In order to support this a new
> program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
> the sendmsg/sendpage hook. To attach the policy to sockets a
> sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
>
> Similar to previous sockmap usages when a sock is added to a
> sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
> program type attached then the BPF ULP layer is created on the
> socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
> every msg in sendmsg case and page/offset in sendpage case.
>
> BPF_PROG_TYPE_SK_MSG Semantics/API:
>
> BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
> SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
> case and in the sendpage case leaves the data untouched. Both cases
> return -EACESS to the user. Returning SK_PASS will allow the msg to
> be sent.
>
> In the sendmsg case data is copied into kernel space buffers before
> running the BPF program. In the sendpage case data is never copied.
> The implication being users may change data after BPF programs run in
> the sendpage case. (A flag will be added to always copy shortly
> if the copy must always be performed).
>
> The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
> in the sendmsg() case and the entire page/offset in the sendpage case.
> This avoid ambiguity on how to handle mixed return codes in the
> sendmsg case. The readable/writeable data provided to the program
> in the sendmsg case may not be the entire message, in fact for
> large sends this is likely the case. The data range that can be
> read is part of the sk_msg_md structure. This is because similar
> to the tc bpf_cls case the data is stored in a scatter gather list.
> Future work will address this short-coming to allow users to pull
> in more data if needed (similar to TC BPF).
>
> The helper msg_redirect_map() can be used to select the socket to
> send the data on. This is used similar to existing redirect use
> cases. This allows policy to redirect msgs.
>
> Pseudo code simple example:
>
> The basic logic to attach a program to a socket is as follows,
>
> // load the programs
> bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
> &obj, &msg_prog);
>
> // lookup the sockmap
> bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");
>
> // get fd for sockmap
> map_fd_msg = bpf_map__fd(bpf_map_msg);
>
> // attach program to sockmap
> bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
>
> Adding sockets to the map is done in the normal way,
>
> // Add a socket 'fd' to sockmap at location 'i'
> bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);
>
> After the above any socket attached to "my_sock_map", in this case
> 'fd', will run the BPF msg verdict program (msg_prog) on every
> sendmsg and sendpage system call.
>
> For a complete example see BPF selftests bpf/sockmap_tcp_msg_*.c and
> test_maps.c
>
> Implementation notes:
>
> It seemed the simplest, to me at least, to use a refcnt to ensure
> psock is not lost across the sendmsg copy into the sg, the bpf program
> running on the data in sg_data, and the final pass to the TCP stack.
> Some performance testing may show a better method to do this and avoid
> the refcnt cost, but for now use the simpler method.
>
> Another item that will come after basic support is in place is
> supporting MSG_MORE flag. At the moment we call sendpages even if
> the MSG_MORE flag is set. An enhancement would be to collect the
> pages into a larger scatterlist and pass down the stack. Notice that
> bpf_tcp_sendmsg() could support this with some additional state saved
> across sendmsg calls. I built the code to support this without having
> to do refactoring work. Other flags TBD include ZEROCOPY flag.
>
> Yet another detail that needs some thought is the size of scatterlist.
> Currently, we use MAX_SKB_FRAGS simply because this was being used
> already in the TLS case. Future work to improve the kernel sk APIs to
> tune this depending on workload may be useful. This is a trade-off
> between memory usage and B/s performance.
Some minor comments/nits below:
>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
> include/linux/bpf.h | 1
> include/linux/bpf_types.h | 1
> include/linux/filter.h | 10 +
> include/net/tcp.h | 2
> include/uapi/linux/bpf.h | 28 +++
> kernel/bpf/sockmap.c | 485 ++++++++++++++++++++++++++++++++++++++++++++-
> kernel/bpf/syscall.c | 14 +
> kernel/bpf/verifier.c | 5
> net/core/filter.c | 106 ++++++++++
> 9 files changed, 638 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 9e03046..14cdb4d 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -21,6 +21,7 @@
> struct perf_event;
> struct bpf_prog;
> struct bpf_map;
> +struct sock;
>
> /* map is generic key/value storage optionally accesible by eBPF programs */
> struct bpf_map_ops {
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 19b8349..5e2e8a4 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -13,6 +13,7 @@
> BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
> BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
> BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
> #endif
> #ifdef CONFIG_BPF_EVENTS
> BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 425056c..f1e9833 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -507,6 +507,15 @@ struct xdp_buff {
> struct xdp_rxq_info *rxq;
> };
>
> +struct sk_msg_buff {
> + void *data;
> + void *data_end;
> + struct scatterlist sg_data[MAX_SKB_FRAGS];
> + __u32 key;
> + __u32 flags;
> + struct bpf_map *map;
> +};
> +
> /* Compute the linear packet data range [data, data_end) which
> * will be accessed by various program types (cls_bpf, act_bpf,
> * lwt, ...). Subsystems allowing direct data access must (!)
> @@ -769,6 +778,7 @@ int xdp_do_redirect(struct net_device *dev,
> void bpf_warn_invalid_xdp_action(u32 act);
>
> struct sock *do_sk_redirect_map(struct sk_buff *skb);
> +struct sock *do_msg_redirect_map(struct sk_msg_buff *md);
>
> #ifdef CONFIG_BPF_JIT
> extern int bpf_jit_enable;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index a99ceb8..7f56c3c 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1984,6 +1984,7 @@ static inline void tcp_listendrop(const struct sock *sk)
>
> enum {
> TCP_ULP_TLS,
> + TCP_ULP_BPF,
> };
>
> struct tcp_ulp_ops {
> @@ -2001,6 +2002,7 @@ struct tcp_ulp_ops {
> int tcp_register_ulp(struct tcp_ulp_ops *type);
> void tcp_unregister_ulp(struct tcp_ulp_ops *type);
> int tcp_set_ulp(struct sock *sk, const char *name);
> +int tcp_set_ulp_id(struct sock *sk, const int ulp);
> void tcp_get_available_ulp(char *buf, size_t len);
> void tcp_cleanup_ulp(struct sock *sk);
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 405317f..bf649ae 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -133,6 +133,7 @@ enum bpf_prog_type {
> BPF_PROG_TYPE_SOCK_OPS,
> BPF_PROG_TYPE_SK_SKB,
> BPF_PROG_TYPE_CGROUP_DEVICE,
> + BPF_PROG_TYPE_SK_MSG,
> };
>
> enum bpf_attach_type {
> @@ -143,6 +144,7 @@ enum bpf_attach_type {
> BPF_SK_SKB_STREAM_PARSER,
> BPF_SK_SKB_STREAM_VERDICT,
> BPF_CGROUP_DEVICE,
> + BPF_SK_MSG_VERDICT,
> __MAX_BPF_ATTACH_TYPE
> };
>
> @@ -687,6 +689,15 @@ enum bpf_attach_type {
> * int bpf_override_return(pt_regs, rc)
> * @pt_regs: pointer to struct pt_regs
> * @rc: the return value to set
> + *
> + * int bpf_msg_redirect_map(map, key, flags)
> + * Redirect msg to a sock in map using key as a lookup key for the
> + * sock in map.
> + * @map: pointer to sockmap
> + * @key: key to lookup sock in map
> + * @flags: reserved for future use
> + * Return: SK_PASS
> + *
> */
> #define __BPF_FUNC_MAPPER(FN) \
> FN(unspec), \
> @@ -747,7 +758,8 @@ enum bpf_attach_type {
> FN(perf_event_read_value), \
> FN(perf_prog_read_value), \
> FN(getsockopt), \
> - FN(override_return),
> + FN(override_return), \
> + FN(msg_redirect_map),
>
> /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> * function eBPF program intends to call
> @@ -909,6 +921,20 @@ enum sk_action {
> SK_PASS,
> };
>
> +/* User return codes for SK_MSG prog type. */
> +enum sk_msg_action {
> + SK_MSG_DROP = 0,
> + SK_MSG_PASS,
> +};
> +
> +/* user accessible metadata for SK_MSG packet hook, new fields must
> + * be added to the end of this structure
> + */
> +struct sk_msg_md {
> + __u32 data;
> + __u32 data_end;
> +};
> +
> #define BPF_TAG_SIZE 8
>
> struct bpf_prog_info {
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index 972608f..5793f3a 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -38,6 +38,7 @@
> #include <linux/skbuff.h>
> #include <linux/workqueue.h>
> #include <linux/list.h>
> +#include <linux/mm.h>
> #include <net/strparser.h>
> #include <net/tcp.h>
>
> @@ -47,6 +48,7 @@
> struct bpf_stab {
> struct bpf_map map;
> struct sock **sock_map;
> + struct bpf_prog *bpf_tx_msg;
> struct bpf_prog *bpf_parse;
> struct bpf_prog *bpf_verdict;
> };
> @@ -74,6 +76,7 @@ struct smap_psock {
> struct sk_buff *save_skb;
>
> struct strparser strp;
> + struct bpf_prog *bpf_tx_msg;
> struct bpf_prog *bpf_parse;
> struct bpf_prog *bpf_verdict;
> struct list_head maps;
> @@ -90,6 +93,8 @@ struct smap_psock {
> void (*save_state_change)(struct sock *sk);
> };
>
> +static void smap_release_sock(struct smap_psock *psock, struct sock *sock);
> +
> static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
> {
> return rcu_dereference_sk_user_data(sk);
> @@ -99,8 +104,439 @@ enum __sk_action {
> __SK_DROP = 0,
> __SK_PASS,
> __SK_REDIRECT,
> + __SK_NONE,
> };
>
> +static int memcopy_from_iter(struct sock *sk, struct scatterlist *sg,
> + int sg_num, struct iov_iter *from, int bytes)
> +{
> + int i, rc = 0;
> +
> + for (i = 0; i < sg_num; ++i) {
> + int copy = sg[i].length;
> + char *to = sg_virt(&sg[i]);
> +
> + if (sk->sk_route_caps & NETIF_F_NOCACHE_COPY)
> + rc = copy_from_iter_nocache(to, copy, from);
> + else
> + rc = copy_from_iter(to, copy, from);
> +
> + if (rc != copy) {
> + rc = -EFAULT;
> + goto out;
> + }
> +
> + bytes -= copy;
> + if (!bytes)
> + break;
> + }
> +out:
> + return rc;
> +}
> +
> +static int bpf_tcp_push(struct sock *sk, struct scatterlist *sg,
> + int *sg_end, int flags, bool charge)
> +{
> + int sendpage_flags = flags | MSG_SENDPAGE_NOTLAST;
> + int offset, ret = 0;
> + struct page *p;
> + size_t size;
> +
> + size = sg->length;
> + offset = sg->offset;
> +
> + while (1) {
> + if (sg_is_last(sg))
> + sendpage_flags = flags;
> +
> + tcp_rate_check_app_limited(sk);
> + p = sg_page(sg);
> +retry:
> + ret = do_tcp_sendpages(sk, p, offset, size, sendpage_flags);
> + if (ret != size) {
> + if (ret > 0) {
> + offset += ret;
> + size -= ret;
> + goto retry;
> + }
> +
> + if (charge)
> + sk_mem_uncharge(sk,
> + sg->length - size - sg->offset);
> +
> + sg->offset = offset;
> + sg->length = size;
> + return ret;
> + }
> +
> + put_page(p);
> + if (charge)
> + sk_mem_uncharge(sk, sg->length);
> + *sg_end += 1;
> + sg = sg_next(sg);
> + if (!sg)
> + break;
> +
> + offset = sg->offset;
> + size = sg->length;
> + }
> +
> + return 0;
> +}
> +
> +static inline void bpf_compute_data_pointers_sg(struct sk_msg_buff *md)
> +{
> + md->data = sg_virt(md->sg_data);
> + md->data_end = md->data + md->sg_data->length;
> +}
> +
> +static void return_mem_sg(struct sock *sk, struct scatterlist *sg, int end)
> +{
> + int i;
> +
> + for (i = 0; i < end; ++i)
> + sk_mem_uncharge(sk, sg[i].length);
> +}
> +
> +static int free_sg(struct sock *sk, struct scatterlist *sg, int start, int len)
> +{
> + int i, free = 0;
> +
> + for (i = start; i < len; ++i) {
> + free += sg[i].length;
> + sk_mem_uncharge(sk, sg[i].length);
> + put_page(sg_page(&sg[i]));
> + }
> +
> + return free;
> +}
> +
> +static unsigned int smap_do_tx_msg(struct sock *sk,
> + struct smap_psock *psock,
> + struct sk_msg_buff *md)
> +{
> + struct bpf_prog *prog;
> + unsigned int rc, _rc;
> +
> + preempt_disable();
Why preempt_disable() is needed?
> + rcu_read_lock();
> +
> + /* If the policy was removed mid-send then default to 'accept' */
> + prog = READ_ONCE(psock->bpf_tx_msg);
> + if (unlikely(!prog)) {
> + _rc = SK_PASS;
> + goto verdict;
> + }
> +
> + bpf_compute_data_pointers_sg(md);
> + _rc = (*prog->bpf_func)(md, prog->insnsi);
> +
> +verdict:
> + rcu_read_unlock();
> + preempt_enable();
> +
> + /* Moving return codes from UAPI namespace into internal namespace */
> + rc = ((_rc == SK_PASS) ?
> + (md->map ? __SK_REDIRECT : __SK_PASS) :
> + __SK_DROP);
> +
> + return rc;
> +}
> +
> +static int bpf_tcp_sendmsg_do_redirect(struct scatterlist *sg, int sg_num,
> + struct sk_msg_buff *md, int flags)
> +{
> + int i, sg_curr = 0, err, free;
> + struct smap_psock *psock;
> + struct sock *sk;
> +
> + rcu_read_lock();
> + sk = do_msg_redirect_map(md);
> + if (unlikely(!sk))
> + goto out_rcu;
> +
> + psock = smap_psock_sk(sk);
> + if (unlikely(!psock))
> + goto out_rcu;
> +
> + if (!refcount_inc_not_zero(&psock->refcnt))
> + goto out_rcu;
> +
> + rcu_read_unlock();
> + lock_sock(sk);
> + err = bpf_tcp_push(sk, sg, &sg_curr, flags, false);
> + if (unlikely(err))
> + goto out;
> + release_sock(sk);
> + smap_release_sock(psock, sk);
> + return 0;
> +out_rcu:
> + rcu_read_unlock();
> +out:
> + for (i = sg_curr; i < sg_num; ++i) {
> + free += sg[i].length;
free is not init.
> + put_page(sg_page(&sg[i]));
> + }
> + return free;
> +}
> +
> +static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> +{
> + int err = 0, eval = __SK_NONE, sg_size = 0, sg_num = 0;
> + int flags = msg->msg_flags | MSG_NO_SHARED_FRAGS;
> + struct sk_msg_buff md = {0};
> + struct smap_psock *psock;
> + size_t copy, copied = 0;
> + struct scatterlist *sg;
> + long timeo;
> +
> + sg = md.sg_data;
> + sg_init_table(sg, MAX_SKB_FRAGS);
> +
> + /* Its possible a sock event or user removed the psock _but_ the ops
> + * have not been reprogrammed yet so we get here. In this case fallback
> + * to tcp_sendmsg. Note this only works because we _only_ ever allow
> + * a single ULP there is no hierarchy here.
> + */
> + rcu_read_lock();
> + psock = smap_psock_sk(sk);
> + if (unlikely(!psock)) {
> + rcu_read_unlock();
> + return tcp_sendmsg(sk, msg, size);
> + }
> +
> + /* Increment the psock refcnt to ensure its not released while sending a
> + * message. Required because sk lookup and bpf programs are used in
> + * separate rcu critical sections. Its OK if we lose the map entry
> + * but we can't lose the sock reference, possible when the refcnt hits
> + * zero and garbage collection calls sock_put().
> + */
> + if (!refcount_inc_not_zero(&psock->refcnt)) {
> + rcu_read_unlock();
> + return tcp_sendmsg(sk, msg, size);
> + }
> +
> + rcu_read_unlock();
> +
> + lock_sock(sk);
> + timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
> +
> + while (msg_data_left(msg)) {
> + int sg_curr;
> +
> + if (sk->sk_err) {
> + err = sk->sk_err;
> + goto out_err;
> + }
> +
> + copy = msg_data_left(msg);
> + if (!sk_stream_memory_free(sk))
> + goto wait_for_sndbuf;
> +
> + /* sg_size indicates bytes already allocated and sg_num
> + * is last sg element used. This is used when alloc_sg
s/alloc_sg/sk_alloc_sg/
> + * partially allocates a scatterlist and then is sent
> + * to wait for memory. In normal case (no memory pressure)
> + * both sg_nun and sg_size are zero.
s/sg_nun/sg_num/
> + */
> + copy = copy - sg_size;
> + err = sk_alloc_sg(sk, copy, sg, &sg_num, &sg_size, 0);
> + if (err) {
> + if (err != -ENOSPC)
> + goto wait_for_memory;
> + copy = sg_size;
> + }
> +
> + err = memcopy_from_iter(sk, sg, sg_num, &msg->msg_iter, copy);
> + if (err < 0) {
> + free_sg(sk, sg, 0, sg_num);
> + goto out_err;
> + }
> +
> + copied += copy;
> +
> + /* If msg is larger than MAX_SKB_FRAGS we can send multiple
> + * scatterlists per msg. However BPF decisions apply to the
> + * entire msg.
> + */
> + if (eval == __SK_NONE)
> + eval = smap_do_tx_msg(sk, psock, &md);
> +
> + switch (eval) {
> + case __SK_PASS:
> + sg_mark_end(sg + sg_num - 1);
> + err = bpf_tcp_push(sk, sg, &sg_curr, flags, true);
> + if (unlikely(err)) {
> + copied -= free_sg(sk, sg, sg_curr, sg_num);
> + goto out_err;
> + }
> + break;
> + case __SK_REDIRECT:
> + sg_mark_end(sg + sg_num - 1);
> + goto do_redir;
> + case __SK_DROP:
> + default:
> + copied -= free_sg(sk, sg, 0, sg_num);
> + goto out_err;
> + }
> +
> + sg_num = 0;
> + sg_size = 0;
> + continue;
> +wait_for_sndbuf:
> + set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> +wait_for_memory:
> + err = sk_stream_wait_memory(sk, &timeo);
> + if (err)
> + goto out_err;
> + }
> +out_err:
> + if (err < 0)
> + err = sk_stream_error(sk, msg->msg_flags, err);
> + release_sock(sk);
> + smap_release_sock(psock, sk);
> + return copied ? copied : err;
> +
> +do_redir:
> + /* To avoid deadlock with multiple socks all doing redirects to
> + * each other we must first drop the current sock lock and release
> + * the psock. Then get the redirect socket (assuming it still
> + * exists), take it's lock, and finally do the send here. If the
> + * redirect fails there is nothing to do, we don't want to blame
> + * the sender for remote socket failures. Instead we simply
> + * continue making forward progress.
> + */
> + return_mem_sg(sk, sg, sg_num);
> + release_sock(sk);
> + smap_release_sock(psock, sk);
> + copied -= bpf_tcp_sendmsg_do_redirect(sg, sg_num, &md, flags);
> + return copied;
For __SK_REDIRECT case, before returning, should 'msg_data_left(msg)' be checked
first? Or msg_data_left(msg) will always be 0 here?
> +}
> +
> +static int bpf_tcp_sendpage_do_redirect(struct page *page, int offset,
> + size_t size, int flags,
> + struct sk_msg_buff *md)
> +{
> + struct smap_psock *psock;
> + struct sock *sk;
> + int rc;
> +
> + rcu_read_lock();
> + sk = do_msg_redirect_map(md);
> + if (unlikely(!sk))
> + goto out_rcu;
> +
> + psock = smap_psock_sk(sk);
> + if (unlikely(!psock))
> + goto out_rcu;
> +
> + if (!refcount_inc_not_zero(&psock->refcnt))
> + goto out_rcu;
> +
> + rcu_read_unlock();
> +
> + lock_sock(sk);
> + rc = tcp_sendpage_locked(sk, page, offset, size, flags);
> + release_sock(sk);
> +
> + smap_release_sock(psock, sk);
> + return rc;
> +out_rcu:
> + rcu_read_unlock();
> + return -EINVAL;
> +}
> +
> +static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
> + int offset, size_t size, int flags)
> +{
> + struct smap_psock *psock;
> + int rc, _rc = __SK_PASS;
> + struct bpf_prog *prog;
> + struct sk_msg_buff md;
> +
> + preempt_disable();
> + rcu_read_lock();
> + psock = smap_psock_sk(sk);
> + if (unlikely(!psock))
> + goto verdict;
> +
> + /* If the policy was removed mid-send then default to 'accept' */
> + prog = READ_ONCE(psock->bpf_tx_msg);
> + if (unlikely(!prog))
> + goto verdict;
> +
> + /* Calculate pkt data pointers and run BPF program */
> + md.data = page_address(page) + offset;
> + md.data_end = md.data + size;
> + _rc = (*prog->bpf_func)(&md, prog->insnsi);
> +
> +verdict:
> + rcu_read_unlock();
> + preempt_enable();
> +
> + /* Moving return codes from UAPI namespace into internal namespace */
> + rc = ((_rc == SK_PASS) ? __SK_PASS : __SK_DROP);
> +
> + switch (rc) {
> + case __SK_PASS:
> + lock_sock(sk);
> + rc = tcp_sendpage_locked(sk, page, offset, size, flags);
> + release_sock(sk);
> + break;
> + case __SK_REDIRECT:
> + smap_release_sock(psock, sk);
smap_release_sock() is only needed in __SK_REDIRECT case?
> + rc = bpf_tcp_sendpage_do_redirect(page, offset, size, flags,
> + &md);
> + break;
> + case __SK_DROP:
> + default:
> + rc = -EACCES;
> + }
> +
> + return rc;
> +}
> +
> +static int bpf_tcp_msg_add(struct smap_psock *psock,
> + struct sock *sk,
> + struct bpf_prog *tx_msg)
> +{
> + struct bpf_prog *orig_tx_msg;
> +
> + orig_tx_msg = xchg(&psock->bpf_tx_msg, tx_msg);
> + if (orig_tx_msg)
> + bpf_prog_put(orig_tx_msg);
> +
> + return tcp_set_ulp_id(sk, TCP_ULP_BPF);
> +}
> +
> +struct proto tcp_bpf_proto;
> +static int bpf_tcp_init(struct sock *sk)
> +{
> + sk->sk_prot = &tcp_bpf_proto;
> + return 0;
> +}
> +
> +static void bpf_tcp_release(struct sock *sk)
> +{
> + sk->sk_prot = &tcp_prot;
> +}
> +
> +static struct tcp_ulp_ops bpf_tcp_ulp_ops __read_mostly = {
> + .name = "bpf_tcp",
> + .uid = TCP_ULP_BPF,
> + .owner = NULL,
> + .init = bpf_tcp_init,
> + .release = bpf_tcp_release,
> +};
> +
> +static int bpf_tcp_ulp_register(void)
> +{
> + tcp_bpf_proto = tcp_prot;
> + tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
> + tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
> + return tcp_register_ulp(&bpf_tcp_ulp_ops);
> +}
> +
> static int smap_verdict_func(struct smap_psock *psock, struct sk_buff *skb)
> {
> struct bpf_prog *prog = READ_ONCE(psock->bpf_verdict);
> @@ -165,8 +601,6 @@ static void smap_report_sk_error(struct smap_psock *psock, int err)
> sk->sk_error_report(sk);
> }
>
> -static void smap_release_sock(struct smap_psock *psock, struct sock *sock);
> -
> /* Called with lock_sock(sk) held */
> static void smap_state_change(struct sock *sk)
> {
> @@ -317,6 +751,7 @@ static void smap_write_space(struct sock *sk)
>
> static void smap_stop_sock(struct smap_psock *psock, struct sock *sk)
> {
> + tcp_cleanup_ulp(sk);
> if (!psock->strp_enabled)
> return;
> sk->sk_data_ready = psock->save_data_ready;
> @@ -384,7 +819,6 @@ static int smap_parse_func_strparser(struct strparser *strp,
> return rc;
> }
>
> -
> static int smap_read_sock_done(struct strparser *strp, int err)
> {
> return err;
> @@ -456,6 +890,8 @@ static void smap_gc_work(struct work_struct *w)
> bpf_prog_put(psock->bpf_parse);
> if (psock->bpf_verdict)
> bpf_prog_put(psock->bpf_verdict);
> + if (psock->bpf_tx_msg)
> + bpf_prog_put(psock->bpf_tx_msg);
>
> list_for_each_entry_safe(e, tmp, &psock->maps, list) {
> list_del(&e->list);
> @@ -491,8 +927,7 @@ static struct smap_psock *smap_init_psock(struct sock *sock,
>
> static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
> {
> - struct bpf_stab *stab;
> - int err = -EINVAL;
> + struct bpf_stab *stab; int err = -EINVAL;
> u64 cost;
>
> if (!capable(CAP_NET_ADMIN))
> @@ -506,6 +941,10 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
> if (attr->value_size > KMALLOC_MAX_SIZE)
> return ERR_PTR(-E2BIG);
>
> + err = bpf_tcp_ulp_register();
> + if (err && err != -EEXIST)
> + return ERR_PTR(err);
> +
> stab = kzalloc(sizeof(*stab), GFP_USER);
> if (!stab)
> return ERR_PTR(-ENOMEM);
> @@ -590,6 +1029,8 @@ static void sock_map_free(struct bpf_map *map)
> bpf_prog_put(stab->bpf_verdict);
> if (stab->bpf_parse)
> bpf_prog_put(stab->bpf_parse);
> + if (stab->bpf_tx_msg)
> + bpf_prog_put(stab->bpf_tx_msg);
>
> sock_map_remove_complete(stab);
> }
> @@ -684,7 +1125,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
> {
> struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
> struct smap_psock_map_entry *e = NULL;
> - struct bpf_prog *verdict, *parse;
> + struct bpf_prog *verdict, *parse, *tx_msg;
> struct sock *osock, *sock;
> struct smap_psock *psock;
> u32 i = *(u32 *)key;
> @@ -710,6 +1151,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
> */
> verdict = READ_ONCE(stab->bpf_verdict);
> parse = READ_ONCE(stab->bpf_parse);
> + tx_msg = READ_ONCE(stab->bpf_tx_msg);
>
> if (parse && verdict) {
> /* bpf prog refcnt may be zero if a concurrent attach operation
> @@ -728,6 +1170,17 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
> }
> }
>
> + if (tx_msg) {
> + tx_msg = bpf_prog_inc_not_zero(stab->bpf_tx_msg);
> + if (IS_ERR(tx_msg)) {
> + if (verdict)
> + bpf_prog_put(verdict);
> + if (parse)
> + bpf_prog_put(parse);
> + return PTR_ERR(tx_msg);
> + }
> + }
> +
> write_lock_bh(&sock->sk_callback_lock);
> psock = smap_psock_sk(sock);
>
> @@ -742,7 +1195,14 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
> err = -EBUSY;
> goto out_progs;
> }
> - refcount_inc(&psock->refcnt);
> + if (READ_ONCE(psock->bpf_tx_msg) && tx_msg) {
> + err = -EBUSY;
> + goto out_progs;
> + }
> + if (!refcount_inc_not_zero(&psock->refcnt)) {
> + err = -EAGAIN;
> + goto out_progs;
> + }
> } else {
> psock = smap_init_psock(sock, stab);
> if (IS_ERR(psock)) {
> @@ -763,6 +1223,12 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
> /* 3. At this point we have a reference to a valid psock that is
> * running. Attach any BPF programs needed.
> */
> + if (tx_msg) {
> + err = bpf_tcp_msg_add(psock, sock, tx_msg);
> + if (err)
> + goto out_free;
> + }
> +
> if (parse && verdict && !psock->strp_enabled) {
> err = smap_init_sock(psock, sock);
> if (err)
> @@ -798,6 +1264,8 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
> bpf_prog_put(verdict);
> if (parse)
> bpf_prog_put(parse);
> + if (tx_msg)
> + bpf_prog_put(tx_msg);
> write_unlock_bh(&sock->sk_callback_lock);
> kfree(e);
> return err;
> @@ -812,6 +1280,9 @@ int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type)
> return -EINVAL;
>
> switch (type) {
> + case BPF_SK_MSG_VERDICT:
> + orig = xchg(&stab->bpf_tx_msg, prog);
> + break;
> case BPF_SK_SKB_STREAM_PARSER:
> orig = xchg(&stab->bpf_parse, prog);
> break;
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index ebf0fb2..d32f093 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1267,7 +1267,8 @@ static int bpf_obj_get(const union bpf_attr *attr)
>
> #define BPF_PROG_ATTACH_LAST_FIELD attach_flags
>
> -static int sockmap_get_from_fd(const union bpf_attr *attr, bool attach)
> +static int sockmap_get_from_fd(const union bpf_attr *attr,
> + int type, bool attach)
> {
> struct bpf_prog *prog = NULL;
> int ufd = attr->target_fd;
> @@ -1281,8 +1282,7 @@ static int sockmap_get_from_fd(const union bpf_attr *attr, bool attach)
> return PTR_ERR(map);
>
> if (attach) {
> - prog = bpf_prog_get_type(attr->attach_bpf_fd,
> - BPF_PROG_TYPE_SK_SKB);
> + prog = bpf_prog_get_type(attr->attach_bpf_fd, type);
> if (IS_ERR(prog)) {
> fdput(f);
> return PTR_ERR(prog);
> @@ -1334,9 +1334,11 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> case BPF_CGROUP_DEVICE:
> ptype = BPF_PROG_TYPE_CGROUP_DEVICE;
> break;
> + case BPF_SK_MSG_VERDICT:
> + return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_MSG, true);
> case BPF_SK_SKB_STREAM_PARSER:
> case BPF_SK_SKB_STREAM_VERDICT:
> - return sockmap_get_from_fd(attr, true);
> + return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, true);
> default:
> return -EINVAL;
> }
> @@ -1389,9 +1391,11 @@ static int bpf_prog_detach(const union bpf_attr *attr)
> case BPF_CGROUP_DEVICE:
> ptype = BPF_PROG_TYPE_CGROUP_DEVICE;
> break;
> + case BPF_SK_MSG_VERDICT:
> + return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_MSG, false);
> case BPF_SK_SKB_STREAM_PARSER:
> case BPF_SK_SKB_STREAM_VERDICT:
> - return sockmap_get_from_fd(attr, false);
> + return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, false);
> default:
> return -EINVAL;
> }
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index a2b2112..15c5c2a 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1240,6 +1240,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
> case BPF_PROG_TYPE_XDP:
> case BPF_PROG_TYPE_LWT_XMIT:
> case BPF_PROG_TYPE_SK_SKB:
> + case BPF_PROG_TYPE_SK_MSG:
> if (meta)
> return meta->pkt_access;
>
> @@ -2041,7 +2042,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
> case BPF_MAP_TYPE_SOCKMAP:
> if (func_id != BPF_FUNC_sk_redirect_map &&
> func_id != BPF_FUNC_sock_map_update &&
> - func_id != BPF_FUNC_map_delete_elem)
> + func_id != BPF_FUNC_map_delete_elem &&
> + func_id != BPF_FUNC_msg_redirect_map)
> goto error;
> break;
> default:
> @@ -2079,6 +2081,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
> goto error;
> break;
> case BPF_FUNC_sk_redirect_map:
> + case BPF_FUNC_msg_redirect_map:
> if (map->map_type != BPF_MAP_TYPE_SOCKMAP)
> goto error;
> break;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index acdb94c..ca87b8d 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -1881,6 +1881,44 @@ struct sock *do_sk_redirect_map(struct sk_buff *skb)
> .arg4_type = ARG_ANYTHING,
> };
>
> +BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg_buff *, msg,
> + struct bpf_map *, map, u32, key, u64, flags)
> +{
> + /* If user passes invalid input drop the packet. */
> + if (unlikely(flags))
> + return SK_DROP;
> +
> + msg->key = key;
> + msg->flags = flags;
> + msg->map = map;
> +
> + return SK_PASS;
> +}
> +
> +struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
> +{
> + struct sock *sk = NULL;
> +
> + if (msg->map) {
> + sk = __sock_map_lookup_elem(msg->map, msg->key);
> +
> + msg->key = 0;
> + msg->map = NULL;
> + }
> +
> + return sk;
> +}
> +
> +static const struct bpf_func_proto bpf_msg_redirect_map_proto = {
> + .func = bpf_msg_redirect_map,
> + .gpl_only = false,
> + .ret_type = RET_INTEGER,
> + .arg1_type = ARG_PTR_TO_CTX,
> + .arg2_type = ARG_CONST_MAP_PTR,
> + .arg3_type = ARG_ANYTHING,
> + .arg4_type = ARG_ANYTHING,
> +};
> +
> BPF_CALL_1(bpf_get_cgroup_classid, const struct sk_buff *, skb)
> {
> return task_get_classid(skb);
> @@ -3513,6 +3551,16 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
> }
> }
>
> +static const struct bpf_func_proto *sk_msg_func_proto(enum bpf_func_id func_id)
> +{
> + switch (func_id) {
> + case BPF_FUNC_msg_redirect_map:
> + return &bpf_msg_redirect_map_proto;
> + default:
> + return bpf_base_func_proto(func_id);
> + }
> +}
> +
> static const struct bpf_func_proto *sk_skb_func_proto(enum bpf_func_id func_id)
> {
> switch (func_id) {
> @@ -3892,6 +3940,32 @@ static bool sk_skb_is_valid_access(int off, int size,
> return bpf_skb_is_valid_access(off, size, type, info);
> }
>
> +static bool sk_msg_is_valid_access(int off, int size,
> + enum bpf_access_type type,
> + struct bpf_insn_access_aux *info)
> +{
> + if (type == BPF_WRITE)
> + return false;
> +
> + switch (off) {
> + case offsetof(struct sk_msg_md, data):
> + info->reg_type = PTR_TO_PACKET;
> + break;
> + case offsetof(struct sk_msg_md, data_end):
> + info->reg_type = PTR_TO_PACKET_END;
> + break;
> + }
> +
> + if (off < 0 || off >= sizeof(struct sk_msg_md))
> + return false;
> + if (off % size != 0)
> + return false;
> + if (size != sizeof(__u32))
> + return false;
> +
> + return true;
> +}
> +
> static u32 bpf_convert_ctx_access(enum bpf_access_type type,
> const struct bpf_insn *si,
> struct bpf_insn *insn_buf,
> @@ -4522,6 +4596,29 @@ static u32 sk_skb_convert_ctx_access(enum bpf_access_type type,
> return insn - insn_buf;
> }
>
> +static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
> + const struct bpf_insn *si,
> + struct bpf_insn *insn_buf,
> + struct bpf_prog *prog, u32 *target_size)
> +{
> + struct bpf_insn *insn = insn_buf;
> +
> + switch (si->off) {
> + case offsetof(struct sk_msg_md, data):
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data),
> + si->dst_reg, si->src_reg,
> + offsetof(struct sk_msg_buff, data));
> + break;
> + case offsetof(struct sk_msg_md, data_end):
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data_end),
> + si->dst_reg, si->src_reg,
> + offsetof(struct sk_msg_buff, data_end));
> + break;
> + }
> +
> + return insn - insn_buf;
> +}
> +
> const struct bpf_verifier_ops sk_filter_verifier_ops = {
> .get_func_proto = sk_filter_func_proto,
> .is_valid_access = sk_filter_is_valid_access,
> @@ -4611,6 +4708,15 @@ static u32 sk_skb_convert_ctx_access(enum bpf_access_type type,
> const struct bpf_prog_ops sk_skb_prog_ops = {
> };
>
> +const struct bpf_verifier_ops sk_msg_verifier_ops = {
> + .get_func_proto = sk_msg_func_proto,
> + .is_valid_access = sk_msg_is_valid_access,
> + .convert_ctx_access = sk_msg_convert_ctx_access,
> +};
> +
> +const struct bpf_prog_ops sk_msg_prog_ops = {
> +};
> +
> int sk_detach_filter(struct sock *sk)
> {
> int ret = -ENOENT;
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
2018-01-17 22:04 ` Martin KaFai Lau
@ 2018-01-18 17:27 ` John Fastabend
0 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2018-01-18 17:27 UTC (permalink / raw)
To: Martin KaFai Lau; +Cc: borkmann, ast, netdev
On 01/17/2018 02:04 PM, Martin KaFai Lau wrote:
> On Fri, Jan 12, 2018 at 10:11:11AM -0800, John Fastabend wrote:
>> This implements a BPF ULP layer to allow policy enforcement and
>> monitoring at the socket layer. In order to support this a new
>> program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
>> the sendmsg/sendpage hook. To attach the policy to sockets a
>> sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
[...]
> Some minor comments/nits below:
Thanks for reviewing!
>> +
>> +static unsigned int smap_do_tx_msg(struct sock *sk,
>> + struct smap_psock *psock,
>> + struct sk_msg_buff *md)
>> +{
>> + struct bpf_prog *prog;
>> + unsigned int rc, _rc;
>> +
>> + preempt_disable();
> Why preempt_disable() is needed?
If we run a BPF program with preempt enabled my read of BPF
code path is we may race with multiple programs running. For
example program A gets preempted during a map update to an
array map, program B updates same map entry, program A then
updates some piece of the entry. The result is garbage in
the array element. Just one example.
With PREEMPT-RCU the above seems possible.
>
>> + rcu_read_lock();
>> +
>> + /* If the policy was removed mid-send then default to 'accept' */
>> + prog = READ_ONCE(psock->bpf_tx_msg);
>> + if (unlikely(!prog)) {
>> + _rc = SK_PASS;
>> + goto verdict;
>> + }
>> +
>> + bpf_compute_data_pointers_sg(md);
>> + _rc = (*prog->bpf_func)(md, prog->insnsi);
>> +
>> +verdict:
>> + rcu_read_unlock();
>> + preempt_enable();
>> +
>> + /* Moving return codes from UAPI namespace into internal namespace */
>> + rc = ((_rc == SK_PASS) ?
>> + (md->map ? __SK_REDIRECT : __SK_PASS) :
>> + __SK_DROP);
>> +
>> + return rc;
>> +}
>> +
[...]
>> +out:
>> + for (i = sg_curr; i < sg_num; ++i) {
>> + free += sg[i].length;
> free is not init.
>
Nice catch thanks.
>> + put_page(sg_page(&sg[i]));
>> + }
>> + return free;
>> +}
>> +
>> +
>> + /* sg_size indicates bytes already allocated and sg_num
>> + * is last sg element used. This is used when alloc_sg
> s/alloc_sg/sk_alloc_sg/
>
>> + * partially allocates a scatterlist and then is sent
>> + * to wait for memory. In normal case (no memory pressure)
>> + * both sg_nun and sg_size are zero.
> s/sg_nun/sg_num/
>
Will fix both spelling issues.
>> + */
>> + copy = copy - sg_size;
>> + err = sk_alloc_sg(sk, copy, sg, &sg_num, &sg_size, 0);
>> + if (err) {
>> + if (err != -ENOSPC)
>> + goto wait_for_memory;
>> + copy = sg_size;
>> + }
>> +
[...]
>> +do_redir:
>> + /* To avoid deadlock with multiple socks all doing redirects to
>> + * each other we must first drop the current sock lock and release
>> + * the psock. Then get the redirect socket (assuming it still
>> + * exists), take it's lock, and finally do the send here. If the
>> + * redirect fails there is nothing to do, we don't want to blame
>> + * the sender for remote socket failures. Instead we simply
>> + * continue making forward progress.
>> + */
>> + return_mem_sg(sk, sg, sg_num);
>> + release_sock(sk);
>> + smap_release_sock(psock, sk);
>> + copied -= bpf_tcp_sendmsg_do_redirect(sg, sg_num, &md, flags);
>> + return copied;
> For __SK_REDIRECT case, before returning, should 'msg_data_left(msg)' be checked
> first? Or msg_data_left(msg) will always be 0 here?
>
Interesting, yes this would probably be best. Otherwise what happens is we
return with only some of the bytes copied. The application should (must?)
handle this case correctly, but agree might be nice to send as much as
possible here. I'll see if I can work this into a v2 or perhaps I'll do it
as a follow on performance improvement.
Nice observation though for sure.
>> +}
>> +
[...]
>> + switch (rc) {
>> + case __SK_PASS:
>> + lock_sock(sk);
>> + rc = tcp_sendpage_locked(sk, page, offset, size, flags);
>> + release_sock(sk);
>> + break;
>> + case __SK_REDIRECT:
>> + smap_release_sock(psock, sk);
> smap_release_sock() is only needed in __SK_REDIRECT case?
Now that I have a test case for this I also caught this with
testing. ;)
>
>> + rc = bpf_tcp_sendpage_do_redirect(page, offset, size, flags,
>> + &md);
>> + break;
>> + case __SK_DROP:
>> + default:
>> + rc = -EACCES;
>> + }
>> +
>> + return rc;
>> +}
Thanks a lot! Should have a v2 tomorrow after some additional
testing.
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2018-01-18 17:27 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-12 18:09 [bpf-next PATCH 0/7] Add BPF_PROG_TYPE_SK_MSG and attach pt John Fastabend
2018-01-12 18:10 ` [bpf-next PATCH 1/7] net: add a UID to use for ULP socket assignment John Fastabend
2018-01-12 18:10 ` [bpf-next PATCH 2/7] sock: make static tls function alloc_sg generic sock helper John Fastabend
2018-01-12 18:10 ` [bpf-next PATCH 3/7] sockmap: convert refcnt to an atomic refcnt John Fastabend
2018-01-12 18:10 ` [bpf-next PATCH 4/7] net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG John Fastabend
2018-01-12 20:10 ` Eric Dumazet
2018-01-12 20:26 ` John Fastabend
2018-01-12 20:46 ` Eric Dumazet
2018-01-12 21:11 ` John Fastabend
2018-01-12 18:11 ` [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data John Fastabend
2018-01-17 2:25 ` Alexei Starovoitov
2018-01-17 5:49 ` John Fastabend
2018-01-17 6:20 ` Alexei Starovoitov
2018-01-17 20:32 ` John Fastabend
2018-01-17 22:04 ` Martin KaFai Lau
2018-01-18 17:27 ` John Fastabend
2018-01-12 18:11 ` [bpf-next PATCH 6/7] bpf: add map tests for BPF_PROG_TYPE_SK_MSG John Fastabend
2018-01-12 18:11 ` [bpf-next PATCH 7/7] bpf: add verifier " John Fastabend
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).