* [PATCH net-next 0/2] act_bpf, cls_bpf: send eBPF bytecode through @ 2016-04-15 10:07 Quentin Monnet 2016-04-15 10:07 ` [PATCH net-next 1/2] act_bpf: send back eBPF bytecode through netlink socket Quentin Monnet ` (2 more replies) 0 siblings, 3 replies; 7+ messages in thread From: Quentin Monnet @ 2016-04-15 10:07 UTC (permalink / raw) To: netdev When a new BPF traffic control filter or action is set up with tc, the bytecode is sent back to userspace through a netlink socket for cBPF, but not for eBPF (the file descriptor pointing to the object file containing the bytecode is sent instead). This patch makes cls_bpf and act_bpf modules send the bytecode for eBPF as well (in addition to the file descriptor). New BPF flags are used in order to differenciate what BPF version is in use, so that userspace tools can process the bytecode properly. Once the series is accepted and merged, it is intended to submit a patch for the iproute2 package, so as to fix tc utility so as to use the new flags and to display the bytecode in eBPF format when needed. This tc patch is already available at: https://github.com/6WIND/iproute2/commits/netlink_eBPF Quentin Monnet (2): act_bpf: send back eBPF bytecode through netlink socket cls_bpf: send back eBPF bytecode through netlink socket include/uapi/linux/pkt_cls.h | 1 + include/uapi/linux/tc_act/tc_bpf.h | 1 + net/sched/act_bpf.c | 23 +++++++++++++++++++++++ net/sched/cls_bpf.c | 25 +++++++++++++++++++++++-- 4 files changed, 48 insertions(+), 2 deletions(-) -- 2.7.4 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH net-next 1/2] act_bpf: send back eBPF bytecode through netlink socket 2016-04-15 10:07 [PATCH net-next 0/2] act_bpf, cls_bpf: send eBPF bytecode through Quentin Monnet @ 2016-04-15 10:07 ` Quentin Monnet 2016-04-15 10:07 ` [PATCH net-next 2/2] cls_bpf: " Quentin Monnet 2016-04-15 10:41 ` [PATCH net-next 0/2] act_bpf, cls_bpf: send eBPF bytecode through Daniel Borkmann 2 siblings, 0 replies; 7+ messages in thread From: Quentin Monnet @ 2016-04-15 10:07 UTC (permalink / raw) To: netdev When a new BPF traffic control action is set up with tc, the bytecode is sent back to userspace through a netlink socket for cBPF, but not for eBPF (the file descriptor pointing to the object file containing the bytecode is sent instead). This patch makes act_bpf module send the bytecode for eBPF as well (in addition to the file descriptor). It also adds a new BPF netlink attribute (a flag) in order to differenciate what BPF version is in use, so that userspace tools can process it properly. Signed-off-by: Quentin Monnet <quentin.monnet@6wind.com> --- include/uapi/linux/tc_act/tc_bpf.h | 1 + net/sched/act_bpf.c | 23 +++++++++++++++++++++++ 2 files changed, 24 insertions(+) diff --git a/include/uapi/linux/tc_act/tc_bpf.h b/include/uapi/linux/tc_act/tc_bpf.h index 07f17cc70bb3..8c9a44324467 100644 --- a/include/uapi/linux/tc_act/tc_bpf.h +++ b/include/uapi/linux/tc_act/tc_bpf.h @@ -26,6 +26,7 @@ enum { TCA_ACT_BPF_OPS, TCA_ACT_BPF_FD, TCA_ACT_BPF_NAME, + TCA_ACT_BPF_EBPF, __TCA_ACT_BPF_MAX, }; #define TCA_ACT_BPF_MAX (__TCA_ACT_BPF_MAX - 1) diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c index 8c9f1f0459ab..fcd30f0b3b75 100644 --- a/net/sched/act_bpf.c +++ b/net/sched/act_bpf.c @@ -118,6 +118,28 @@ static int tcf_bpf_dump_bpf_info(const struct tcf_bpf *prog, static int tcf_bpf_dump_ebpf_info(const struct tcf_bpf *prog, struct sk_buff *skb) { + struct bpf_prog *filter; + + if (nla_put_flag(skb, TCA_ACT_BPF_EBPF)) + return -EMSGSIZE; + + rcu_read_lock(); + filter = rcu_dereference(prog->filter); + if (filter) { + if (nla_put_u16(skb, TCA_ACT_BPF_OPS_LEN, filter->len)) { + rcu_read_unlock(); + return -EMSGSIZE; + } + + if (nla_put(skb, TCA_ACT_BPF_OPS, + filter->len * sizeof(struct sock_filter), + filter->insnsi)) { + rcu_read_unlock(); + return -EMSGSIZE; + } + } + rcu_read_unlock(); + if (nla_put_u32(skb, TCA_ACT_BPF_FD, prog->bpf_fd)) return -EMSGSIZE; @@ -170,6 +192,7 @@ static const struct nla_policy act_bpf_policy[TCA_ACT_BPF_MAX + 1] = { [TCA_ACT_BPF_PARMS] = { .len = sizeof(struct tc_act_bpf) }, [TCA_ACT_BPF_FD] = { .type = NLA_U32 }, [TCA_ACT_BPF_NAME] = { .type = NLA_NUL_STRING, .len = ACT_BPF_NAME_LEN }, + [TCA_ACT_BPF_EBPF] = { .type = NLA_FLAG }, [TCA_ACT_BPF_OPS_LEN] = { .type = NLA_U16 }, [TCA_ACT_BPF_OPS] = { .type = NLA_BINARY, .len = sizeof(struct sock_filter) * BPF_MAXINSNS }, -- 2.7.4 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH net-next 2/2] cls_bpf: send back eBPF bytecode through netlink socket 2016-04-15 10:07 [PATCH net-next 0/2] act_bpf, cls_bpf: send eBPF bytecode through Quentin Monnet 2016-04-15 10:07 ` [PATCH net-next 1/2] act_bpf: send back eBPF bytecode through netlink socket Quentin Monnet @ 2016-04-15 10:07 ` Quentin Monnet 2016-04-15 10:41 ` [PATCH net-next 0/2] act_bpf, cls_bpf: send eBPF bytecode through Daniel Borkmann 2 siblings, 0 replies; 7+ messages in thread From: Quentin Monnet @ 2016-04-15 10:07 UTC (permalink / raw) To: netdev As for act_bpf in a former patch, this patch makes the scheduler send eBPF bytecode through the netlink socket for BPF filters set up with tc. The existing TCA_BPF_FLAGS netlink attribute is used to embed a new flag signaling eBPF bytecode, so as to identify the BPF version when reading from the socket on userspace side. Signed-off-by: Quentin Monnet <quentin.monnet@6wind.com> --- include/uapi/linux/pkt_cls.h | 1 + net/sched/cls_bpf.c | 25 +++++++++++++++++++++++-- 2 files changed, 24 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index c43c5f78b9c4..09d726fc2c5a 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -376,6 +376,7 @@ enum { /* BPF classifier */ #define TCA_BPF_FLAG_ACT_DIRECT (1 << 0) +#define TCA_BPF_FLAG_EBPF (1 << 1) enum { TCA_BPF_UNSPEC, diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c index 425fe6a0eda3..f1d9057d8d94 100644 --- a/net/sched/cls_bpf.c +++ b/net/sched/cls_bpf.c @@ -450,6 +450,25 @@ static int cls_bpf_dump_bpf_info(const struct cls_bpf_prog *prog, static int cls_bpf_dump_ebpf_info(const struct cls_bpf_prog *prog, struct sk_buff *skb) { + struct bpf_prog *filter; + + rcu_read_lock(); + filter = rcu_dereference(prog->filter); + if (filter) { + if (nla_put_u16(skb, TCA_BPF_OPS_LEN, filter->len)) { + rcu_read_unlock(); + return -EMSGSIZE; + } + + if (nla_put(skb, TCA_BPF_OPS, + filter->len * sizeof(struct sock_filter), + filter->insnsi)) { + rcu_read_unlock(); + return -EMSGSIZE; + } + } + rcu_read_unlock(); + if (nla_put_u32(skb, TCA_BPF_FD, prog->bpf_fd)) return -EMSGSIZE; @@ -481,10 +500,12 @@ static int cls_bpf_dump(struct net *net, struct tcf_proto *tp, unsigned long fh, nla_put_u32(skb, TCA_BPF_CLASSID, prog->res.classid)) goto nla_put_failure; - if (cls_bpf_is_ebpf(prog)) + if (cls_bpf_is_ebpf(prog)) { + bpf_flags |= TCA_BPF_FLAG_EBPF; ret = cls_bpf_dump_ebpf_info(prog, skb); - else + } else { ret = cls_bpf_dump_bpf_info(prog, skb); + } if (ret) goto nla_put_failure; -- 2.7.4 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH net-next 0/2] act_bpf, cls_bpf: send eBPF bytecode through 2016-04-15 10:07 [PATCH net-next 0/2] act_bpf, cls_bpf: send eBPF bytecode through Quentin Monnet 2016-04-15 10:07 ` [PATCH net-next 1/2] act_bpf: send back eBPF bytecode through netlink socket Quentin Monnet 2016-04-15 10:07 ` [PATCH net-next 2/2] cls_bpf: " Quentin Monnet @ 2016-04-15 10:41 ` Daniel Borkmann 2016-04-15 18:44 ` Alexei Starovoitov 2 siblings, 1 reply; 7+ messages in thread From: Daniel Borkmann @ 2016-04-15 10:41 UTC (permalink / raw) To: Quentin Monnet; +Cc: netdev, alexei.starovoitov Hi Quentin, On 04/15/2016 12:07 PM, Quentin Monnet wrote: > When a new BPF traffic control filter or action is set up with tc, the > bytecode is sent back to userspace through a netlink socket for cBPF, but > not for eBPF (the file descriptor pointing to the object file containing > the bytecode is sent instead). > > This patch makes cls_bpf and act_bpf modules send the bytecode for eBPF as > well (in addition to the file descriptor). > > New BPF flags are used in order to differenciate what BPF version is in > use, so that userspace tools can process the bytecode properly. > > Once the series is accepted and merged, it is intended to submit a patch > for the iproute2 package, so as to fix tc utility so as to use the new > flags and to display the bytecode in eBPF format when needed. This tc > patch is already available at: > https://github.com/6WIND/iproute2/commits/netlink_eBPF Thanks for working on this, but it's unfortunately not that easy. Let me ask, what would be the intended use-case to dump the insns? I'm asking because if you dump them as-is, then a reinject at a later time of that bytecode back into the kernel will most likely be rejected by the verifier. This is because on load time, verifier does rewrites/expansion on some of the insns (f.e. map pointers, helper functions, ctx access etc, see also appendix in [1]), so the code as seen in the kernel would need to be sanitized first. Also, how would you make sense/transform maps into a meaningful representation (probably possible to find a scheme when they are pinned)? Another possibility is that such programs need to be pinned (can be done easily by tc in the background) and then implement a CRIU facility into the bpf(2) syscall to retrieve them. tc could make use of this w/o too much effort, and at the same time it would help CRIU folks, too. It also seems cleaner to have only one central api (bpf(2)) to dump them, but needs a bit of thought. Thanks & cheers, Daniel [1] http://www.netdevconf.org/1.1/proceedings/slides/borkmann-tc-classifier-cls-bpf.pdf > Quentin Monnet (2): > act_bpf: send back eBPF bytecode through netlink socket > cls_bpf: send back eBPF bytecode through netlink socket > > include/uapi/linux/pkt_cls.h | 1 + > include/uapi/linux/tc_act/tc_bpf.h | 1 + > net/sched/act_bpf.c | 23 +++++++++++++++++++++++ > net/sched/cls_bpf.c | 25 +++++++++++++++++++++++-- > 4 files changed, 48 insertions(+), 2 deletions(-) > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH net-next 0/2] act_bpf, cls_bpf: send eBPF bytecode through 2016-04-15 10:41 ` [PATCH net-next 0/2] act_bpf, cls_bpf: send eBPF bytecode through Daniel Borkmann @ 2016-04-15 18:44 ` Alexei Starovoitov 2016-04-20 7:25 ` Quentin Monnet 0 siblings, 1 reply; 7+ messages in thread From: Alexei Starovoitov @ 2016-04-15 18:44 UTC (permalink / raw) To: Daniel Borkmann; +Cc: Quentin Monnet, netdev On Fri, Apr 15, 2016 at 12:41:05PM +0200, Daniel Borkmann wrote: > Hi Quentin, > > On 04/15/2016 12:07 PM, Quentin Monnet wrote: > >When a new BPF traffic control filter or action is set up with tc, the > >bytecode is sent back to userspace through a netlink socket for cBPF, but > >not for eBPF (the file descriptor pointing to the object file containing > >the bytecode is sent instead). > > > >This patch makes cls_bpf and act_bpf modules send the bytecode for eBPF as > >well (in addition to the file descriptor). > > > >New BPF flags are used in order to differenciate what BPF version is in > >use, so that userspace tools can process the bytecode properly. > > > >Once the series is accepted and merged, it is intended to submit a patch > >for the iproute2 package, so as to fix tc utility so as to use the new > >flags and to display the bytecode in eBPF format when needed. This tc > >patch is already available at: > >https://github.com/6WIND/iproute2/commits/netlink_eBPF > > Thanks for working on this, but it's unfortunately not that easy. Let > me ask, what would be the intended use-case to dump the insns? +1 > I'm asking because if you dump them as-is, then a reinject at a later > time of that bytecode back into the kernel will most likely be rejected > by the verifier. > > This is because on load time, verifier does rewrites/expansion on some > of the insns (f.e. map pointers, helper functions, ctx access etc, see > also appendix in [1]), so the code as seen in the kernel would need to > be sanitized first. +1 we had similar discussion about this in seccomp context and decided that the only sensible way is to keep original instructions, but it's wasteful to do unconditionally and snapshotting of maps is not possible, so there was no use for such dumping facility other than debugging. Is it what the patch after? We need to discuss it in the proper context. > Also, how would you make sense/transform maps into a meaningful > representation (probably possible to find a scheme when they are pinned)? > > Another possibility is that such programs need to be pinned (can be done > easily by tc in the background) and then implement a CRIU facility into > the bpf(2) syscall to retrieve them. tc could make use of this w/o too > much effort, and at the same time it would help CRIU folks, too. It > also seems cleaner to have only one central api (bpf(2)) to dump them, > but needs a bit of thought. +1 any debugging or criu needs to be done in a centralized way via syscall and/or bpffs. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH net-next 0/2] act_bpf, cls_bpf: send eBPF bytecode through 2016-04-15 18:44 ` Alexei Starovoitov @ 2016-04-20 7:25 ` Quentin Monnet 2016-04-20 9:55 ` Daniel Borkmann 0 siblings, 1 reply; 7+ messages in thread From: Quentin Monnet @ 2016-04-20 7:25 UTC (permalink / raw) To: Alexei Starovoitov, Daniel Borkmann; +Cc: netdev Hi Daniel, Alexei, and many thanks for your answers, 2016-04-15 (11:44 UTC-0700) ~ Alexei Starovoitov: > On Fri, Apr 15, 2016 at 12:41:05PM +0200, Daniel Borkmann wrote: >> Hi Quentin, >> >> On 04/15/2016 12:07 PM, Quentin Monnet wrote: >>> When a new BPF traffic control filter or action is set up with tc, the >>> bytecode is sent back to userspace through a netlink socket for cBPF, but >>> not for eBPF (the file descriptor pointing to the object file containing >>> the bytecode is sent instead). >>> >>> This patch makes cls_bpf and act_bpf modules send the bytecode for eBPF as >>> well (in addition to the file descriptor). >>> […] >> >> Thanks for working on this, but it's unfortunately not that easy. Let >> me ask, what would be the intended use-case to dump the insns? > > +1 > >> I'm asking because if you dump them as-is, then a reinject at a later >> time of that bytecode back into the kernel will most likely be rejected >> by the verifier. >> >> This is because on load time, verifier does rewrites/expansion on some >> of the insns (f.e. map pointers, helper functions, ctx access etc, see >> also appendix in [1]), so the code as seen in the kernel would need to >> be sanitized first. > > +1 > we had similar discussion about this in seccomp context and decided that > the only sensible way is to keep original instructions, but it's wasteful > to do unconditionally and snapshotting of maps is not possible, > so there was no use for such dumping facility other than debugging. > Is it what the patch after? > We need to discuss it in the proper context. I am experimenting with BPF, and so far I was just trying to dump the bytecode sent from tc to the kernel. I had not realized that the verifier would bring some changes to the instructions. And I agree that a more comprehensive debugging solution could be obtained if I can find some way to get a snapshot of the maps. >> Also, how would you make sense/transform maps into a meaningful >> representation (probably possible to find a scheme when they are pinned)? >> >> Another possibility is that such programs need to be pinned (can be done >> easily by tc in the background) and then implement a CRIU facility into >> the bpf(2) syscall to retrieve them. tc could make use of this w/o too >> much effort, and at the same time it would help CRIU folks, too. It >> also seems cleaner to have only one central api (bpf(2)) to dump them, >> but needs a bit of thought. > > +1 > any debugging or criu needs to be done in a centralized way via syscall > and/or bpffs. Maintaining a central API around bpf() makes sense to me. I have been looking at the BPF filesystem to see what information I can obtain from it, but I did not understand it well. I read the logs of Daniel's commit b2197755b263 (“bpf: add support for persistent maps/progs”), but I am unsure how I could use it in order to gather data about the maps and programs (if this is possible at all). I tried to set up some BPF filters working with maps, but I could not find any file under /sys/fs/bpf/tc. Would you have a pointer to some documentation about this filesystem? Or is there only the kernel code? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH net-next 0/2] act_bpf, cls_bpf: send eBPF bytecode through 2016-04-20 7:25 ` Quentin Monnet @ 2016-04-20 9:55 ` Daniel Borkmann 0 siblings, 0 replies; 7+ messages in thread From: Daniel Borkmann @ 2016-04-20 9:55 UTC (permalink / raw) To: Quentin Monnet; +Cc: Alexei Starovoitov, netdev Hi Quentin, On 04/20/2016 09:25 AM, Quentin Monnet wrote: > 2016-04-15 (11:44 UTC-0700) ~ Alexei Starovoitov: >> On Fri, Apr 15, 2016 at 12:41:05PM +0200, Daniel Borkmann wrote: >>> On 04/15/2016 12:07 PM, Quentin Monnet wrote: >>>> When a new BPF traffic control filter or action is set up with tc, the >>>> bytecode is sent back to userspace through a netlink socket for cBPF, but >>>> not for eBPF (the file descriptor pointing to the object file containing >>>> the bytecode is sent instead). >>>> >>>> This patch makes cls_bpf and act_bpf modules send the bytecode for eBPF as >>>> well (in addition to the file descriptor). >>>> > […] >>> >>> Thanks for working on this, but it's unfortunately not that easy. Let >>> me ask, what would be the intended use-case to dump the insns? >> >> +1 >> >>> I'm asking because if you dump them as-is, then a reinject at a later >>> time of that bytecode back into the kernel will most likely be rejected >>> by the verifier. >>> >>> This is because on load time, verifier does rewrites/expansion on some >>> of the insns (f.e. map pointers, helper functions, ctx access etc, see >>> also appendix in [1]), so the code as seen in the kernel would need to >>> be sanitized first. >> >> +1 >> we had similar discussion about this in seccomp context and decided that >> the only sensible way is to keep original instructions, but it's wasteful >> to do unconditionally and snapshotting of maps is not possible, >> so there was no use for such dumping facility other than debugging. >> Is it what the patch after? >> We need to discuss it in the proper context. > > I am experimenting with BPF, and so far I was just trying to dump the > bytecode sent from tc to the kernel. I had not realized that the > verifier would bring some changes to the instructions. And I agree that > a more comprehensive debugging solution could be obtained if I can find > some way to get a snapshot of the maps. > >>> Also, how would you make sense/transform maps into a meaningful >>> representation (probably possible to find a scheme when they are pinned)? >>> >>> Another possibility is that such programs need to be pinned (can be done >>> easily by tc in the background) and then implement a CRIU facility into >>> the bpf(2) syscall to retrieve them. tc could make use of this w/o too >>> much effort, and at the same time it would help CRIU folks, too. It >>> also seems cleaner to have only one central api (bpf(2)) to dump them, >>> but needs a bit of thought. >> >> +1 >> any debugging or criu needs to be done in a centralized way via syscall >> and/or bpffs. > > Maintaining a central API around bpf() makes sense to me. I have been > looking at the BPF filesystem to see what information I can obtain from > it, but I did not understand it well. I read the logs of Daniel's commit > b2197755b263 (“bpf: add support for persistent maps/progs”), but I am > unsure how I could use it in order to gather data about the maps and > programs (if this is possible at all). I tried to set up some BPF Currently, there's not yet much information to extract. F.e. if you look at the tc source code, we do bpf_map_selfcheck_pinned() from fdinfo to check if the map fd that we got from the pinned one fits to the one from the object file. But obviously more work is needed for extraction of bytecode as in your case. Haven't thought much about it yet, but one idea could be that tc also pins programs, then sends some kind of annotation down to cls_bpf where on filter dump tc could retrieve the path to the pinned program again, then uses bpf(2) with BPF_OBJ_GET to get the fd, and a new command e.g. BPF_PROG_DUMP to extract bytecode/map info from the running program and dumps it to the user in a way where some sense can be made out of it from admin/user perspective (in other words, not just raw opcodes I mean). BPF_PROG_DUMP could have auxiliary information with map specs, kind of in a similar way like we retrieve them as relo entries from the object file in the loader, and in addition some information where to retrieve the maps in case they were pinned. This still doesn't give you a entire snapshot of the map, but would at least allow you for the pinned ones to iterate over them via bpf(2) with BPF_MAP_GET_NEXT_KEY, plus in general it would allow you to reload the program. There's still the issue with the additional memory overhead to keep original insns around as Alexei mentioned. Two things that come to mind, one being that when JITing was successful, we could actually try to shrink struct bpf_prog again since we work on a different image, but it doesn't address the case where JIT is not used. Other one being to perhaps only keep a 'diff' around in orig_prog where we can patch insns back to original, probably possible, but needs a bit of work though. > filters working with maps, but I could not find any file under > /sys/fs/bpf/tc. There are some getting started examples under examples/bpf/ in the iproute2 repo, f.e. bpf_shared.c is one. > Would you have a pointer to some documentation about this filesystem? Or > is there only the kernel code? Yeah, b2197755b263 and 42984d7c1e56, and in my netdev1.1 paper I tried to put more extensive information, but seems the proceedings haven't been published yet. I can send you a private copy until they are officially released I guess. Thanks, Daniel ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-04-20 9:55 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-04-15 10:07 [PATCH net-next 0/2] act_bpf, cls_bpf: send eBPF bytecode through Quentin Monnet 2016-04-15 10:07 ` [PATCH net-next 1/2] act_bpf: send back eBPF bytecode through netlink socket Quentin Monnet 2016-04-15 10:07 ` [PATCH net-next 2/2] cls_bpf: " Quentin Monnet 2016-04-15 10:41 ` [PATCH net-next 0/2] act_bpf, cls_bpf: send eBPF bytecode through Daniel Borkmann 2016-04-15 18:44 ` Alexei Starovoitov 2016-04-20 7:25 ` Quentin Monnet 2016-04-20 9:55 ` Daniel Borkmann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).