* (unknown), @ 2017-01-17 5:18 Andy Lutomirski [not found] ` <CALCETrW+ZQ1xMEfdEOzx6RK9ZoCvQiqQSnOyhDJ=-FZMwUbucg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Andy Lutomirski @ 2017-01-17 5:18 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Peter Zijlstra, David Ahern, Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development ;; This buffer is for text that is not saved, and for Lisp evaluation. ;; To create a file, visit it with C-x C-f and enter text in its buffer. On Sun, Jan 15, 2017 at 5:19 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > Hello, > > Sorry about the delay. Some fire fighthing followed the holidays. > > On Tue, Jan 03, 2017 at 11:25:59AM +0100, Michal Hocko wrote: >> > So from what I understand the proposed cgroup is not in fact >> > hierarchical at all. >> > >> > @TJ, I thought you were enforcing all new cgroups to be properly >> > hierarchical, that would very much include this one. >> >> I would be interested in that as well. We have made that mistake in >> memcg v1 where hierarchy could be disabled for performance reasons and >> that turned out to be major PITA in the end. Why do we want to repeat >> the same mistake here? > > Across the different threads on this subject, there have been multiple > explanations but I'll try to sum it up more clearly. > > The big issue here is whether this is a cgroup thing or a bpf thing. > I don't think there's anything inherently wrong with one approach or > the other. Forget about the proposed cgroup bpf extentions but thinkg > about how iptables does cgroups. Whether it's the netcls/netprio in > v1 or direct membership matching in v2, it is the network side testing > for cgroup membership one way or the other. The only part where > cgroup is involved in is answering that test. > [...] > > None of the issues that people have been raising here is actually an > issue if one thinks of it as a part of bpf. Its security model is > exactly the same as any other bpf programs. Recursive behavior is > exactly the same as how other external cgroup descendant membership > testing work. There is no issue here whatsoever. After sleeping on this, here are my thoughts: First, there are three ways to think about this, not just two. It could be a BPF feature, a net feature, or a cgroup feature. I think that calling it a BPF feature is a cop-out. BPF is an assembly language and a mechanism for importing little programs into the kernel. BPF maps are a BPF feature. These new hooks are a feature that actively changes the behavior of other parts of the kernel. I don't see how calling this new feature a "BPF" feature excuses it from playing by the expected rules of the subsystems it affects or from generally working well with the rest of the kernel. Perhaps this is a net feature, though, not a cgroup feature. This would IMO make a certain amount of sense. Iptables cgroup matches, for example, logically are an iptables (i.e., net) feature. The problem here is that network configuration (and this explicitly includes iptables) is done on a per-netns level, whereas these new hooks entirely ignore network namespaces. I've pointed out that trying to enable them in a namespaced world is problematic (e.g. switching to ns_capable() will cause serious problems), and Alexei seems to think that this will never happen. So I don't think we can really call this a net feature that works the way that other net features work. (Suppose that this actually tried to be netns-enabled. Then you'd have to map from (netns, cgroup) -> hook, and this would probably be slow and messy.) So I think that leaves it being a cgroup feature. And it definitely does *not* play by the rules of cgroups right now. > I'm sure we'll > eventually get there but from what I hear it isn't something we can > pull off in a restricted timeframe. To me, this sounds like "the API isn't that great, we know how to do better, but we really really want this feature ASAP so we want to ship it with a sub-par API." I think that's a bad idea. > This also holds true for the perf controller. While it is implemented > as a controller, it isn't visible to cgroup users in any way and the > only function it serves is providing the membership test to perf > subsystem. perf is the one which decides whether and how it is to be > used. cgroup providing membership test to other subsystems is > completely acceptable and established. Unless I'm mistaken, "perf_event" is an actual cgroup controller, and it explicitly respects the cgroup hierarchy. It shows up in /proc/cgroups, and I had no problem mounting a cgroupfs instance with perf_event enabled. So I'm not sure what you mean. ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <CALCETrW+ZQ1xMEfdEOzx6RK9ZoCvQiqQSnOyhDJ=-FZMwUbucg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <CALCETrW+ZQ1xMEfdEOzx6RK9ZoCvQiqQSnOyhDJ=-FZMwUbucg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-01-18 22:41 ` Tejun Heo [not found] ` <20170118224120.GG9171-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Tejun Heo @ 2017-01-18 22:41 UTC (permalink / raw) To: Andy Lutomirski Cc: Michal Hocko, Peter Zijlstra, David Ahern, Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development Helloo, Andy. On Mon, Jan 16, 2017 at 09:18:36PM -0800, Andy Lutomirski wrote: > Perhaps this is a net feature, though, not a cgroup feature. This > would IMO make a certain amount of sense. Iptables cgroup matches, > for example, logically are an iptables (i.e., net) feature. The Yeap. > problem here is that network configuration (and this explicitly > includes iptables) is done on a per-netns level, whereas these new > hooks entirely ignore network namespaces. I've pointed out that > trying to enable them in a namespaced world is problematic (e.g. > switching to ns_capable() will cause serious problems), and Alexei > seems to think that this will never happen. So I don't think we can > really call this a net feature that works the way that other net > features work. > > (Suppose that this actually tried to be netns-enabled. Then you'd > have to map from (netns, cgroup) -> hook, and this would probably be > slow and messy.) I'm not sure how important netns support for these bpf programs actually is but I have no objection to making it behave in an equivalent manner to iptables where appropriate. This is kinda trivial to do too. For now, we can simply not run the programs if the associated socket belongs to non-root netns. Later, if necessary, we can add per-ns bpf programs. And I can't think of a reason why (netns, cgroup) -> function mapping would be slow and messy. Why would that be? > So I think that leaves it being a cgroup feature. And it definitely > does *not* play by the rules of cgroups right now. Because this in no way is a cgroup feature. As you wrote above, it is something similar to iptables with lacking netns considerations. Let's address that and make it a proper network thing. > > I'm sure we'll > > eventually get there but from what I hear it isn't something we can > > pull off in a restricted timeframe. > > To me, this sounds like "the API isn't that great, we know how to do > better, but we really really want this feature ASAP so we want to ship > it with a sub-par API." I think that's a bad idea. I see your point but this isn't something which is black and white. There are arguments for both directions. I'm leaning the other direction because * My impression with bpf is that while delegation is something theoretically possible it is not something which is gonna take place in any immediate time frame. If I'm wrong on this, please feel free to correct me. * There are a lot of use cases which can immediately take advantage of cgroup-aware bpf programs operating as proposed. The model is semantically equivalent to iptables (let's address the netns part if that's an issue) which net people are familiar with. * It isn't exclusive with adding cgroup bpf controller down the road if necessary and practical. Sure, this isn't the perfect situation but it seems like an acceptable trade-off to me. What ever is perfect? > > This also holds true for the perf controller. While it is implemented > > as a controller, it isn't visible to cgroup users in any way and the > > only function it serves is providing the membership test to perf > > subsystem. perf is the one which decides whether and how it is to be > > used. cgroup providing membership test to other subsystems is > > completely acceptable and established. > > Unless I'm mistaken, "perf_event" is an actual cgroup controller, and Yeah, it is implemented as a controller but in essence what it does is just tagging the hierarchy to tell perf to use that hierarchy for membership test purposes. > it explicitly respects the cgroup hierarchy. It shows up in > /proc/cgroups, and I had no problem mounting a cgroupfs instance with > perf_event enabled. So I'm not sure what you mean. That all it's doing is providing membership information. Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <20170118224120.GG9171-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <20170118224120.GG9171-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2017-01-19 0:18 ` Andy Lutomirski [not found] ` <CALCETrXv3k+=C=9D5YGypzPk8_f4yo8ShQ3tM2fOo++gJFBM4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Andy Lutomirski @ 2017-01-19 0:18 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Peter Zijlstra, David Ahern, Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development On Wed, Jan 18, 2017 at 2:41 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > Helloo, Andy. > > On Mon, Jan 16, 2017 at 09:18:36PM -0800, Andy Lutomirski wrote: >> Perhaps this is a net feature, though, not a cgroup feature. This >> would IMO make a certain amount of sense. Iptables cgroup matches, >> for example, logically are an iptables (i.e., net) feature. The > > Yeap. > >> problem here is that network configuration (and this explicitly >> includes iptables) is done on a per-netns level, whereas these new >> hooks entirely ignore network namespaces. I've pointed out that >> trying to enable them in a namespaced world is problematic (e.g. >> switching to ns_capable() will cause serious problems), and Alexei >> seems to think that this will never happen. So I don't think we can >> really call this a net feature that works the way that other net >> features work. >> >> (Suppose that this actually tried to be netns-enabled. Then you'd >> have to map from (netns, cgroup) -> hook, and this would probably be >> slow and messy.) > > I'm not sure how important netns support for these bpf programs > actually is but I have no objection to making it behave in an > equivalent manner to iptables where appropriate. This is kinda > trivial to do too. For now, we can simply not run the programs if the > associated socket belongs to non-root netns. Later, if necessary, we > can add per-ns bpf programs. And I can't think of a reason why > (netns, cgroup) -> function mapping would be slow and messy. Why > would that be? To map cgroup -> hook, a simple array in each cgroup structure works. To map (cgroup, netns) -> hook function, the natural approach would be to have some kind of hash, and that would be slower. This code is intended to be *fast*. I suppose you could have each cgroup structure contain an array where each element is (netns, hook function) and just skip the ones that are the wrong netns. Perhaps it would be rare to have many of those. > >> So I think that leaves it being a cgroup feature. And it definitely >> does *not* play by the rules of cgroups right now. > > Because this in no way is a cgroup feature. As you wrote above, it is > something similar to iptables with lacking netns considerations. > Let's address that and make it a proper network thing. I think it's currently a broken cgroup feature. I think it could be made into a properly working cgroup feature (which I favor) or a properly working net feature. Can you articulate why you prefer having it be a net feature instead of a cgroup feature? I tend to favor making it be a working cgroup feature (by giving it a file or files in cgroupfs and maybe even a controller) because the result would have very obvious semantics. But maybe this is just a false dichotomy. Could this feature be a per-netns configuration where you can give instructions like "run this hook if you're in such-and-such netns and in a given cgroup or one of its descendents"? This would prevent it from being a direct analogue to systemd's RestrictAddressFamilies= option, but that may be okay. This would side-step issues in the current code where a hook can't rely on ifindex meaning what the hook thinks it means. Actually, I think I like that approach. What if it we had a "socket" controller and files like "socket.create_socket", "socket.ingress" and "socket.egress"? You could use the bpf() syscall to install a bpf filter into "socket.egress" and that filter would filter egress for the target cgroup and its descendents on the current netns. As a first pass, the netns issue could be sidestepped by making it only work in the root netns (i.e. the bpf() call would fail if you're not in the root netns and the hooks wouldn't run if you're not in the root netns.) It *might* even be possible to retrofit in the socket controller by saying that the default hierarchy is used if the socket controller isn't mounted. What *isn't* possible to cleanly fix after the fact is the current semantics that cgroup hooks override the hooks in their ancestors. IMO that is simply broken. The example you gave (perf_event) is very careful not to have this problem. > >> > I'm sure we'll >> > eventually get there but from what I hear it isn't something we can >> > pull off in a restricted timeframe. >> >> To me, this sounds like "the API isn't that great, we know how to do >> better, but we really really want this feature ASAP so we want to ship >> it with a sub-par API." I think that's a bad idea. > > I see your point but this isn't something which is black and white. > There are arguments for both directions. I'm leaning the other > direction because > > * My impression with bpf is that while delegation is something > theoretically possible it is not something which is gonna take place > in any immediate time frame. If I'm wrong on this, please feel free > to correct me. But the issue isn't *BPF* delegation. It's cgroup delegation or netns creation, both of which exist today. > > * There are a lot of use cases which can immediately take advantage of > cgroup-aware bpf programs operating as proposed. The model is > semantically equivalent to iptables (let's address the netns part if > that's an issue) which net people are familiar with. That sounds like a great argument for those users to patch their kernels to gain experience with this feature. > > * It isn't exclusive with adding cgroup bpf controller down the road > if necessary and practical. Sure, this isn't the perfect situation > but it seems like an acceptable trade-off to me. What ever is > perfect? I think it more or less is exclusive. I can imagine davem accepting a patch to make the sock_cgroup_data think point at a new "socket" cgroup type. I can't imagine him accepting a patch to have sockets have more than one pointer to a cgroup, with good reason. I don't see what's added by having a "bpf" cgroup controller, though -- it's just too broad. If the Landlock stuff goes in, that should presumably be either tied to the default hierarchy or use an "lsm" controller. A "bpf" controller for that makes no sense to me. I can see *huge* value in having some combination of BPF and the perf_event controller deciding whether to log perf events, for example. But this would be the perf_event controller, not a hypothetical "bpf" controller. > >> > This also holds true for the perf controller. While it is implemented >> > as a controller, it isn't visible to cgroup users in any way and the >> > only function it serves is providing the membership test to perf >> > subsystem. perf is the one which decides whether and how it is to be >> > used. cgroup providing membership test to other subsystems is >> > completely acceptable and established. >> >> Unless I'm mistaken, "perf_event" is an actual cgroup controller, and > > Yeah, it is implemented as a controller but in essence what it does is > just tagging the hierarchy to tell perf to use that hierarchy for > membership test purposes. It's also a rather innocuous controller in that it doesn't change the behavior of the controlled tasks. The bpf hooks definitely do change behavior. > >> it explicitly respects the cgroup hierarchy. It shows up in >> /proc/cgroups, and I had no problem mounting a cgroupfs instance with >> perf_event enabled. So I'm not sure what you mean. > > That all it's doing is providing membership information. But it's doing it wrong! Even perf_event tests for membership in a given cgroup *or one of its descendents*. This code does not. I think the moral of the story here is that there are lots of open questions and design work to be done and that this feature really isn't ready to be stable. For Landlock, I believe that it really needs to be done right and I will put my foot down and NAK any effort to have Landlock available in a released kernel without resolving these types of issues first. Does anyone really want Landlock to work differently than the net hooks simply because the net hooks were in a rush? --Andy ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <CALCETrXv3k+=C=9D5YGypzPk8_f4yo8ShQ3tM2fOo++gJFBM4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <CALCETrXv3k+=C=9D5YGypzPk8_f4yo8ShQ3tM2fOo++gJFBM4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-01-19 0:59 ` Tejun Heo 2017-01-19 2:29 ` Andy Lutomirski 2017-01-19 1:01 ` Mickaël Salaün 1 sibling, 1 reply; 49+ messages in thread From: Tejun Heo @ 2017-01-19 0:59 UTC (permalink / raw) To: Andy Lutomirski Cc: Michal Hocko, Peter Zijlstra, David Ahern, Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development Hello, Andy. On Wed, Jan 18, 2017 at 04:18:04PM -0800, Andy Lutomirski wrote: > To map cgroup -> hook, a simple array in each cgroup structure works. > To map (cgroup, netns) -> hook function, the natural approach would be > to have some kind of hash, and that would be slower. This code is > intended to be *fast*. Updating these programs isn't a frequent operation. We can easily calculate a perfect (or acceptable) hash per-cgroup and rcu swap the new hashtable. > I suppose you could have each cgroup structure contain an array where > each element is (netns, hook function) and just skip the ones that are > the wrong netns. Perhaps it would be rare to have many of those. Yeah, or maybe even that. It just isn't a fundamentally difficult problem. > I think it's currently a broken cgroup feature. I think it could be > made into a properly working cgroup feature (which I favor) or a > properly working net feature. Can you articulate why you prefer > having it be a net feature instead of a cgroup feature? I tend to I thought that's what I was doing in the previous message. > favor making it be a working cgroup feature (by giving it a file or > files in cgroupfs and maybe even a controller) because the result > would have very obvious semantics. I'm fine with both directions but one seems far out. > But maybe this is just a false dichotomy. Could this feature be a > per-netns configuration where you can give instructions like "run this > hook if you're in such-and-such netns and in a given cgroup or one of > its descendents"? This would prevent it from being a direct analogue > to systemd's RestrictAddressFamilies= option, but that may be okay. > This would side-step issues in the current code where a hook can't > rely on ifindex meaning what the hook thinks it means. How is this different from making the current code netns aware? > Actually, I think I like that approach. What if it we had a "socket" > controller and files like "socket.create_socket", "socket.ingress" and > "socket.egress"? You could use the bpf() syscall to install a bpf > filter into "socket.egress" and that filter would filter egress for > the target cgroup and its descendents on the current netns. As a > first pass, the netns issue could be sidestepped by making it only > work in the root netns (i.e. the bpf() call would fail if you're not > in the root netns and the hooks wouldn't run if you're not in the root > netns.) It *might* even be possible to retrofit in the socket > controller by saying that the default hierarchy is used if the socket > controller isn't mounted. I don't know. You're throwing out too many ideas too fast and it's difficult to keep up with what the differences are between those ideas. But if we're doing cgroup controllers, shouldn't cgroup ns support be sufficient? We can consider the product of cgroup and net namespaces but that doesn't seem necessary given that people usually set up these namespaces in conjunction. > What *isn't* possible to cleanly fix after the fact is the current > semantics that cgroup hooks override the hooks in their ancestors. > IMO that is simply broken. The example you gave (perf_event) is very > careful not to have this problem. That's like saying installing an iptables rule for a more specific target is broken. As a cgroup controller, it is not an acceptable behavior given how delegation works. As something similar to iptables, it is completely fine. > > * My impression with bpf is that while delegation is something > > theoretically possible it is not something which is gonna take place > > in any immediate time frame. If I'm wrong on this, please feel free > > to correct me. > > But the issue isn't *BPF* delegation. It's cgroup delegation or netns > creation, both of which exist today. No, the issue is bpf delegation. If bpf were fully delegatable in practical sense, we could just do straight forward cgroup bpf controller. Well, we'll have to think about how to chain the programs which would differ on program type but that shouldn't be too hard. > > * There are a lot of use cases which can immediately take advantage of > > cgroup-aware bpf programs operating as proposed. The model is > > semantically equivalent to iptables (let's address the netns part if > > that's an issue) which net people are familiar with. > > That sounds like a great argument for those users to patch their > kernels to gain experience with this feature. I don't get why this would particularly point to out-of-tree usage. Why is that? > > * It isn't exclusive with adding cgroup bpf controller down the road > > if necessary and practical. Sure, this isn't the perfect situation > > but it seems like an acceptable trade-off to me. What ever is > > perfect? > > I think it more or less is exclusive. I can imagine davem accepting a > patch to make the sock_cgroup_data think point at a new "socket" > cgroup type. I can't imagine him accepting a patch to have sockets > have more than one pointer to a cgroup, with good reason. Why would it ever need multiple pointers? Socket is associated with one cgroup whether it's this or controller thing. Membership part doesn't change. It can share everything. > I don't see what's added by having a "bpf" cgroup controller, though > -- it's just too broad. If the Landlock stuff goes in, that should > presumably be either tied to the default hierarchy or use an "lsm" > controller. A "bpf" controller for that makes no sense to me. > > I can see *huge* value in having some combination of BPF and the > perf_event controller deciding whether to log perf events, for > example. But this would be the perf_event controller, not a > hypothetical "bpf" controller. Let's not spiral out. This isn't relevant to the discussion. Call it whatever you want. > But it's doing it wrong! Even perf_event tests for membership in a > given cgroup *or one of its descendents*. This code does not. > > I think the moral of the story here is that there are lots of open > questions and design work to be done and that this feature really > isn't ready to be stable. For Landlock, I believe that it really > needs to be done right and I will put my foot down and NAK any effort > to have Landlock available in a released kernel without resolving > these types of issues first. Does anyone really want Landlock to work > differently than the net hooks simply because the net hooks were in a > rush? No idea about landlock but as for the cgroup aware network bpf programs, let's please try to narrow down the arguments. Here's one question. * If we make the current thing netns aware, how is it different from iptables? Note that at least future-proofing for netns is trivial. We can simply skip running the bpf programs for non-root ns for now. If the answer is "they aren't that different", is the reason that you're against the current code because you think that it being a part of cgroup would be more useful? I'm not necessarily against that point, I just think that this is a reasonable trade-off given the circumstances especially given that this doens't block having the cgroup things in the future. I'd really appreciate some inputs from bpf folks. I'm basing my judgement primarily on the impression that it is very difficult to make bpf programs practically delegatable. If that's not the case, we can easily go the cgroup controller route. Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2017-01-19 0:59 ` Tejun Heo @ 2017-01-19 2:29 ` Andy Lutomirski [not found] ` <CALCETrVrEnL4cvkdDu2LUhxmeOZ+SMEmF=yKKsm9OYoW2y1Kpg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Andy Lutomirski @ 2017-01-19 2:29 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Peter Zijlstra, David Ahern, Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel@vger.kernel.org, Network Development On Wed, Jan 18, 2017 at 4:59 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, Andy. > > On Wed, Jan 18, 2017 at 04:18:04PM -0800, Andy Lutomirski wrote: >> To map cgroup -> hook, a simple array in each cgroup structure works. >> To map (cgroup, netns) -> hook function, the natural approach would be >> to have some kind of hash, and that would be slower. This code is >> intended to be *fast*. > > Updating these programs isn't a frequent operation. We can easily > calculate a perfect (or acceptable) hash per-cgroup and rcu swap the > new hashtable. > Fair enough. > >> I think it's currently a broken cgroup feature. I think it could be >> made into a properly working cgroup feature (which I favor) or a >> properly working net feature. Can you articulate why you prefer >> having it be a net feature instead of a cgroup feature? I tend to > > I thought that's what I was doing in the previous message. > >> favor making it be a working cgroup feature (by giving it a file or >> files in cgroupfs and maybe even a controller) because the result >> would have very obvious semantics. > > I'm fine with both directions but one seems far out. I thought you were saying why you thought it wasn't a cgroup feature, but I'm not sure I understand why you thought it shouldn't be a cgroup feature. When I chatted with Alexei, I had the impression that you and he had wanted it to be a cgroup controller but thought it wouldn't work well. I think it could work by making a single socket cgroup controller that handles all cgroup things that are bound to a socket. Using individual files would play nicer with cgroup delegation within a single netns. > >> But maybe this is just a false dichotomy. Could this feature be a >> per-netns configuration where you can give instructions like "run this >> hook if you're in such-and-such netns and in a given cgroup or one of >> its descendents"? This would prevent it from being a direct analogue >> to systemd's RestrictAddressFamilies= option, but that may be okay. >> This would side-step issues in the current code where a hook can't >> rely on ifindex meaning what the hook thinks it means. > > How is this different from making the current code netns aware? The descendents part is important. > >> Actually, I think I like that approach. What if it we had a "socket" >> controller and files like "socket.create_socket", "socket.ingress" and >> "socket.egress"? You could use the bpf() syscall to install a bpf >> filter into "socket.egress" and that filter would filter egress for >> the target cgroup and its descendents on the current netns. As a >> first pass, the netns issue could be sidestepped by making it only >> work in the root netns (i.e. the bpf() call would fail if you're not >> in the root netns and the hooks wouldn't run if you're not in the root >> netns.) It *might* even be possible to retrofit in the socket >> controller by saying that the default hierarchy is used if the socket >> controller isn't mounted. > > I don't know. You're throwing out too many ideas too fast and it's > difficult to keep up with what the differences are between those > ideas. But if we're doing cgroup controllers, shouldn't cgroup ns > support be sufficient? We can consider the product of cgroup and net > namespaces but that doesn't seem necessary given that people usually > set up these namespaces in conjunction. Fair enough. See way below. > >> What *isn't* possible to cleanly fix after the fact is the current >> semantics that cgroup hooks override the hooks in their ancestors. >> IMO that is simply broken. The example you gave (perf_event) is very >> careful not to have this problem. > > That's like saying installing an iptables rule for a more specific > target is broken. As a cgroup controller, it is not an acceptable > behavior given how delegation works. As something similar to > iptables, it is completely fine. Even the xt_cgroup code checks cgroup_is_descendent(). > >> > * My impression with bpf is that while delegation is something >> > theoretically possible it is not something which is gonna take place >> > in any immediate time frame. If I'm wrong on this, please feel free >> > to correct me. >> >> But the issue isn't *BPF* delegation. It's cgroup delegation or netns >> creation, both of which exist today. > > No, the issue is bpf delegation. If bpf were fully delegatable in > practical sense, we could just do straight forward cgroup bpf > controller. Well, we'll have to think about how to chain the programs > which would differ on program type but that shouldn't be too hard. What do you mean by "if bpf were fully delegatable"? I don't know what it would even mean to delegate bpf. I do know what it means to allow programs to create new network namespaces and for cgroup sub-hierarchies to be delegated. > >> > * There are a lot of use cases which can immediately take advantage of >> > cgroup-aware bpf programs operating as proposed. The model is >> > semantically equivalent to iptables (let's address the netns part if >> > that's an issue) which net people are familiar with. >> >> That sounds like a great argument for those users to patch their >> kernels to gain experience with this feature. > > I don't get why this would particularly point to out-of-tree usage. > Why is that? Because I think that the API currently has some issues that will be difficult to fix without either breaking the API or supporting multiple APIs going forward. > >> > * It isn't exclusive with adding cgroup bpf controller down the road >> > if necessary and practical. Sure, this isn't the perfect situation >> > but it seems like an acceptable trade-off to me. What ever is >> > perfect? >> >> I think it more or less is exclusive. I can imagine davem accepting a >> patch to make the sock_cgroup_data think point at a new "socket" >> cgroup type. I can't imagine him accepting a patch to have sockets >> have more than one pointer to a cgroup, with good reason. > > Why would it ever need multiple pointers? Socket is associated with > one cgroup whether it's this or controller thing. Membership part > doesn't change. It can share everything. Exactly. If it starts out using the default hierarchy and then switches to being a socket controller, there will still be old deployed code that tries to attach bpf programs to the default hierarchy. That will require defining some semantics for it. > No idea about landlock but as for the cgroup aware network bpf > programs, let's please try to narrow down the arguments. Here's one > question. > > * If we make the current thing netns aware, how is it different from > iptables? Note that at least future-proofing for netns is trivial. > We can simply skip running the bpf programs for non-root ns for now. > > If the answer is "they aren't that different", is the reason that > you're against the current code because you think that it being a part > of cgroup would be more useful? I'm not necessarily against that > point, I just think that this is a reasonable trade-off given the > circumstances especially given that this doens't block having the > cgroup things in the future. Having thought about this some more, I think that making it would alleviate a bunch of my concerns, as it would make the semantics if the capable() check were relaxed to ns_capable() be sane. Here's what I currently should happen before bpf+cgroup is enabled in a release: 1. Make it netns-aware. This could be as simple as making it only work in the root netns because then real netns awareness can be added later without breaking anything. The current situation is bad in that network namespaces are just ignored and it's plausible that people will start writing user code that depends on having network namespaces be ignored. 2. Make it inherit properly. Inner cgroups should not override outer hooks. As in (1), this could be simplified by preventing the same hook from being configured in both an ancestor and a descendent cgroup. Then inheritance could be added for real later on. 3. Give cgroup delegation support some serious thought. In particular, if delegation would be straightforward but the current API wouldn't work well with delegation, then at least consider whether the API should change before it becomes stable so that two APIs don't need to be supported going forward. >From the bpf side, I think that the only thing needed to get delegation working is to make inheritance work right -- if the same hook is set up multiple places in the hierarchy, call all of the in an appropriate order. The rest is probably all on the cgroup side -- setting up a way to enable delegation (subtree_control?), making sure that filesystem permissions are checked when configuring hooks, and perhaps dealing with setuid issues. --Andy ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <CALCETrVrEnL4cvkdDu2LUhxmeOZ+SMEmF=yKKsm9OYoW2y1Kpg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <CALCETrVrEnL4cvkdDu2LUhxmeOZ+SMEmF=yKKsm9OYoW2y1Kpg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-01-20 2:39 ` Alexei Starovoitov 2017-01-20 4:04 ` Andy Lutomirski 0 siblings, 1 reply; 49+ messages in thread From: Alexei Starovoitov @ 2017-01-20 2:39 UTC (permalink / raw) To: Andy Lutomirski Cc: Tejun Heo, Michal Hocko, Peter Zijlstra, David Ahern, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote: > I think it could work by making a single socket cgroup controller that > handles all cgroup things that are bound to a socket. Using Such 'socket cgroup controller' would limit usability of the feature to sockets and force all other use cases like landlock to invent their own wheel, which is undesirable. Everyone will be inventing new 'foo cgroup controller', while all of them are really bpf features. They are different bpf program types that attach to different hooks and use cgroup for scoping. > Having thought about this some more, I think that making it would > alleviate a bunch of my concerns, as it would make the semantics if > the capable() check were relaxed to ns_capable() be sane. Here's what here we're on the same page. For any meaningful discussion about 'bpf cgroup controller' to happen bpf itself needs to become delegatable in cgroup sense. In other words BPF_PROG_TYPE_CGROUP* program types need to become available for unprivileged users. The only unprivileged prog type today is BPF_PROG_TYPE_SOCKET_FILTER. To make it secure we severely limited its functionality. All bpf advances since then (like new map types and verifier extensions) were done for root only. If early on the priv vs unpriv bpf features were 80/20. Now it's close to 95/5. No work has been done to make socket filter type more powerful. It still has to use slow-ish ld_abs skb access while tc/xdp have direct packet access. Things like register value tracking is root only as well and so on and so forth. We cannot just flip the switch and allow type_cgroup* to unpriv and I don't see any volunteers willing to do this work. Until that happens there is no point coming up with designs for 'cgroup bpf controller'... whatever that means. > I currently should happen before bpf+cgroup is enabled in a release: > > 1. Make it netns-aware. This could be as simple as making it only > work in the root netns because then real netns awareness can be added > later without breaking anything. The current situation is bad in that > network namespaces are just ignored and it's plausible that people > will start writing user code that depends on having network namespaces > be ignored. nothing in bpf today is netns-aware and frankly I don't see how cgroup+bpf has anything to do with netns. For regular sockets+bpf we don't check netns. When tcpdump opens raw socket and attaches bpf there are no netns checks, since socket itself gives a scope for the program to run. Same thing applies to cgroup+bpf. cgroup gives a scope for the program. But, say, we indeed add 'if !root ns' check to BPF_CGROUP_INET_* hooks. Then if the hooks are used for security, the process only needs to do setns() to escape security sandbox. Obviously broken semantics. > 2. Make it inherit properly. Inner cgroups should not override outer > hooks. As in (1), this could be simplified by preventing the same > hook from being configured in both an ancestor and a descendent > cgroup. Then inheritance could be added for real later on. In general it sounds fine, but it seems the reasoning to add such restriction now (instead of later), so that program chain can be added without breaking abi, since if we don't restrict it now there will be no way to add it without breaking abi?! That is incorrect assumption. We can add chaining and can add 'do not override' logic without breaking existing semantics. For example, we can add 'priority' field to struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */} which would indicate relative position of multiple chained programs applied to the same cgroup+hook pair. Multiple programs with the same priority will be executed in the order they were added. Programs with different priorities will execute in the priority order. Such scheme will be more generic and flexible than earlier proposals. Similarly we can add another flag that will say 'dissallow override of bpf program in descendent cgroup'. It's all trivial to do, since bpf syscall was designed for extensibility. Also until bpf_type_cgroup* becomes unprivileged there is no reason to add this 'priority/prog chaining' feature, since if it's used for security the root can always override it no matter cgroup hierarchy. > 3. Give cgroup delegation support some serious thought. In > particular, if delegation would be straightforward but the current API > wouldn't work well with delegation, then at least consider whether the > API should change before it becomes stable so that two APIs don't need > to be supported going forward. please see example above. Since we went with bpf syscall (instead of inextensible ioctl) we can add any new cgroup+bpf logic without breaking current abi. No matter how you twist it the cgroup+bpf is bpf specific feature. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2017-01-20 2:39 ` Alexei Starovoitov @ 2017-01-20 4:04 ` Andy Lutomirski 2017-01-23 4:31 ` Alexei Starovoitov 0 siblings, 1 reply; 49+ messages in thread From: Andy Lutomirski @ 2017-01-20 4:04 UTC (permalink / raw) To: Alexei Starovoitov Cc: Tejun Heo, Michal Hocko, Peter Zijlstra, David Ahern, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel@vger.kernel.org, Network Development On Thu, Jan 19, 2017 at 6:39 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote: >> I think it could work by making a single socket cgroup controller that >> handles all cgroup things that are bound to a socket. Using > > Such 'socket cgroup controller' would limit usability of the feature > to sockets and force all other use cases like landlock to invent > their own wheel, which is undesirable. Everyone will be > inventing new 'foo cgroup controller', while all of them > are really bpf features. They are different bpf program > types that attach to different hooks and use cgroup for scoping. Can you elaborate on why that would be a problem? In a cgroup v1 world, users who want different hierarchies for different types of control could easily want one hierarchy for socket hooks and a different hierarchy for lsm hooks. In a cgroup v2 delegation world, I could easily imagine the decision to delegate socket hooks being different from the decision to delegate lsm hooks. Almost all of the code would be shared between different bpf-using cgroup controllers. > >> Having thought about this some more, I think that making it would >> alleviate a bunch of my concerns, as it would make the semantics if >> the capable() check were relaxed to ns_capable() be sane. Here's what > > here we're on the same page. For any meaningful discussion about > 'bpf cgroup controller' to happen bpf itself needs to become > delegatable in cgroup sense. In other words BPF_PROG_TYPE_CGROUP* > program types need to become available for unprivileged users. > The only unprivileged prog type today is BPF_PROG_TYPE_SOCKET_FILTER. > To make it secure we severely limited its functionality. > All bpf advances since then (like new map types and verifier extensions) > were done for root only. If early on the priv vs unpriv bpf features > were 80/20. Now it's close to 95/5. No work has been done to > make socket filter type more powerful. It still has to use > slow-ish ld_abs skb access while tc/xdp have direct packet access. > Things like register value tracking is root only as well and so on > and so forth. > We cannot just flip the switch and allow type_cgroup* to unpriv > and I don't see any volunteers willing to do this work. > Until that happens there is no point coming up with designs > for 'cgroup bpf controller'... whatever that means. Sure there is. If delegation can be turned on without changing the API, then the result will be easier to work with and have fewer compatibility issues. > >> I currently should happen before bpf+cgroup is enabled in a release: >> >> 1. Make it netns-aware. This could be as simple as making it only >> work in the root netns because then real netns awareness can be added >> later without breaking anything. The current situation is bad in that >> network namespaces are just ignored and it's plausible that people >> will start writing user code that depends on having network namespaces >> be ignored. > > nothing in bpf today is netns-aware and frankly I don't see > how cgroup+bpf has anything to do with netns. > For regular sockets+bpf we don't check netns. > When tcpdump opens raw socket and attaches bpf there are no netns > checks, since socket itself gives a scope for the program to run. > Same thing applies to cgroup+bpf. cgroup gives a scope for the program. > But, say, we indeed add 'if !root ns' check to BPF_CGROUP_INET_* > hooks. Here I completely disagree with you. tcpdump sees packets in its network namespace. Regular sockets apply bpf filters to the packets seen by that socket, and the socket itself is scoped to a netns. Meanwhile, cgroup+bpf actually appears to be buggy in this regard even regardless of what semantics you think are better. sk_bound_dev_if is exposed as a u32 value, but sk_bound_dev_if only has meaning within a given netns. The "ip vrf" stuff will straight-up malfunction if a process affected by its hook runs in a different netns from the netns that "ip vrf" was run in. IOW, the current code is buggy. > Then if the hooks are used for security, the process > only needs to do setns() to escape security sandbox. Obviously > broken semantics. This could go both ways. If the goal is to filter packets, then it's not really important to have the filter keep working if the sandboxed task unshares netns -- in the new netns, there isn't any access to the network at all. If the goal is to reduce attack surface by restricting the types of sockets that can be created, then you do want the filter to work across namespaces, but seccomp can do that too and the current code doesn't handle netns correctly. > >> 2. Make it inherit properly. Inner cgroups should not override outer >> hooks. As in (1), this could be simplified by preventing the same >> hook from being configured in both an ancestor and a descendent >> cgroup. Then inheritance could be added for real later on. > > In general it sounds fine, but it seems the reasoning to add > such restriction now (instead of later), so that program chain can > be added without breaking abi, since if we don't restrict it now > there will be no way to add it without breaking abi?! Adding it the straightforward way (by simply making all the hooks in scope run) would break ABI. Of course there are more complicated ways to do it, but those are more complicated and messier. Do actually have a use case in which overriding of hooks is a good thing? As far as I can tell, the original version of the patch set didn't have hooks get overridden by descendents, and I don't know why this changed in the first place. > > Also until bpf_type_cgroup* becomes unprivileged there is no reason > to add this 'priority/prog chaining' feature, since if it's > used for security the root can always override it no matter cgroup > hierarchy. > >> 3. Give cgroup delegation support some serious thought. In >> particular, if delegation would be straightforward but the current API >> wouldn't work well with delegation, then at least consider whether the >> API should change before it becomes stable so that two APIs don't need >> to be supported going forward. > > please see example above. Since we went with bpf syscall (instead of > inextensible ioctl) we can add any new cgroup+bpf logic without > breaking current abi. But then you end up with two ABIs. Ideally delegation would just work -- then programs written now could run unmodified in unprivileged containers in future kernels. --Andy ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2017-01-20 4:04 ` Andy Lutomirski @ 2017-01-23 4:31 ` Alexei Starovoitov [not found] ` <20170123043154.GA93519-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Alexei Starovoitov @ 2017-01-23 4:31 UTC (permalink / raw) To: Andy Lutomirski Cc: Tejun Heo, Michal Hocko, Peter Zijlstra, David Ahern, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel@vger.kernel.org, Network Development On Thu, Jan 19, 2017 at 08:04:59PM -0800, Andy Lutomirski wrote: > On Thu, Jan 19, 2017 at 6:39 PM, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote: > >> I think it could work by making a single socket cgroup controller that > >> handles all cgroup things that are bound to a socket. Using > > > > Such 'socket cgroup controller' would limit usability of the feature > > to sockets and force all other use cases like landlock to invent > > their own wheel, which is undesirable. Everyone will be > > inventing new 'foo cgroup controller', while all of them > > are really bpf features. They are different bpf program > > types that attach to different hooks and use cgroup for scoping. > > Can you elaborate on why that would be a problem? In a cgroup v1 > world, users who want different hierarchies for different types of > control could easily want one hierarchy for socket hooks and a > different hierarchy for lsm hooks. In a cgroup v2 delegation world, I > could easily imagine the decision to delegate socket hooks being > different from the decision to delegate lsm hooks. Almost all of the > code would be shared between different bpf-using cgroup controllers. how do you think it can be enforced when directory is chowned? > >> Having thought about this some more, I think that making it would > >> alleviate a bunch of my concerns, as it would make the semantics if > >> the capable() check were relaxed to ns_capable() be sane. Here's what > > > > here we're on the same page. For any meaningful discussion about > > 'bpf cgroup controller' to happen bpf itself needs to become > > delegatable in cgroup sense. In other words BPF_PROG_TYPE_CGROUP* > > program types need to become available for unprivileged users. > > The only unprivileged prog type today is BPF_PROG_TYPE_SOCKET_FILTER. > > To make it secure we severely limited its functionality. > > All bpf advances since then (like new map types and verifier extensions) > > were done for root only. If early on the priv vs unpriv bpf features > > were 80/20. Now it's close to 95/5. No work has been done to > > make socket filter type more powerful. It still has to use > > slow-ish ld_abs skb access while tc/xdp have direct packet access. > > Things like register value tracking is root only as well and so on > > and so forth. > > We cannot just flip the switch and allow type_cgroup* to unpriv > > and I don't see any volunteers willing to do this work. > > Until that happens there is no point coming up with designs > > for 'cgroup bpf controller'... whatever that means. > > Sure there is. If delegation can be turned on without changing the > API, then the result will be easier to work with and have fewer > compatibility issues. ... and open() of the directory done by the current api will preserve cgroup delegation when and only when bpf_prog_type_cgroup_* becomes unprivileged. I'm not proposing creating new api here. > > > >> I currently should happen before bpf+cgroup is enabled in a release: > >> > >> 1. Make it netns-aware. This could be as simple as making it only > >> work in the root netns because then real netns awareness can be added > >> later without breaking anything. The current situation is bad in that > >> network namespaces are just ignored and it's plausible that people > >> will start writing user code that depends on having network namespaces > >> be ignored. > > > > nothing in bpf today is netns-aware and frankly I don't see > > how cgroup+bpf has anything to do with netns. > > For regular sockets+bpf we don't check netns. > > When tcpdump opens raw socket and attaches bpf there are no netns > > checks, since socket itself gives a scope for the program to run. > > Same thing applies to cgroup+bpf. cgroup gives a scope for the program. > > But, say, we indeed add 'if !root ns' check to BPF_CGROUP_INET_* > > hooks. > > > Here I completely disagree with you. tcpdump sees packets in its > network namespace. Regular sockets apply bpf filters to the packets > seen by that socket, and the socket itself is scoped to a netns. > > Meanwhile, cgroup+bpf actually appears to be buggy in this regard even > regardless of what semantics you think are better. sk_bound_dev_if is > exposed as a u32 value, but sk_bound_dev_if only has meaning within a > given netns. The "ip vrf" stuff will straight-up malfunction if a > process affected by its hook runs in a different netns from the netns > that "ip vrf" was run in. how is that any different from normal 'ip netns exec'? that is expected user behavior. > IOW, the current code is buggy. > > > Then if the hooks are used for security, the process > > only needs to do setns() to escape security sandbox. Obviously > > broken semantics. > > This could go both ways. If the goal is to filter packets, then it's > not really important to have the filter keep working if the sandboxed > task unshares netns -- in the new netns, there isn't any access to the > network at all. If the goal is to reduce attack surface by in container networking based on netns all netns-es are connected at l2 or l3, whereas socket is l4+ abstraction. Therefore doing 'if (!root_ns) allow' at socket layer is simply broken from security point of view. > restricting the types of sockets that can be created, then you do want > the filter to work across namespaces, but seccomp can do that too and > the current code doesn't handle netns correctly. are you saying that seccomp supports netns filtering? please show the proof. > >> 2. Make it inherit properly. Inner cgroups should not override outer > >> hooks. As in (1), this could be simplified by preventing the same > >> hook from being configured in both an ancestor and a descendent > >> cgroup. Then inheritance could be added for real later on. > > > > In general it sounds fine, but it seems the reasoning to add > > such restriction now (instead of later), so that program chain can > > be added without breaking abi, since if we don't restrict it now > > there will be no way to add it without breaking abi?! > > Adding it the straightforward way (by simply making all the hooks in > scope run) would break ABI. Of course there are more complicated ways > to do it, but those are more complicated and messier. Do actually adding 'prog override is not allowed' flag will obviously _not_ break abi. If program is attached to a cgroup in this mode, the descendants won't be able to attach. Trivial boolean flag check at attach time. Just like 'priority' mode will not break it. > have a use case in which overriding of hooks is a good thing? As far > as I can tell, the original version of the patch set didn't have hooks > get overridden by descendents, and I don't know why this changed in > the first place. what's been missing since the beginning of the thread that cgroup+bpf is the bpf feature and on bpf side we can handle program chaining with priority in flexible way. bpf prog needs to be able to attach to descendent and make sure that the attached program is the only one that makes the decision. Our main use case is application network analytics and enforcement when apps are not malicious. In this case the container management framework will attach one program to a particular part of the hierarchy and will create whatever necessary combinations of the programs for descendents, since it's one common framework for them all and running single program is faster than walking link lists and doing parsing and IP lookups several times. It's often the case that prep work like skb parsing and map lookups are common across programs, but filtering/monitoring logic is different, so dumb link list and call all programs one by one is inferior. Whereas multiple program combining on the user space side can be done efficiently. To summarize, I think the 'prog override is not allowed' flag would be ok feature to have and I wouldn't mind making it the default when no 'flag' field is passed to bpf syscall, but it's not acceptable to remove current 'prog override is allowed' behavior. So if you insist on changing the default, please implement the flag asap. Though from security point of view and ABI point of view there is absolutely no difference whether this flag is added today or a year later while the default is kept as-is. > > Also until bpf_type_cgroup* becomes unprivileged there is no reason > > to add this 'priority/prog chaining' feature, since if it's > > used for security the root can always override it no matter cgroup > > hierarchy. > > > >> 3. Give cgroup delegation support some serious thought. In > >> particular, if delegation would be straightforward but the current API > >> wouldn't work well with delegation, then at least consider whether the > >> API should change before it becomes stable so that two APIs don't need > >> to be supported going forward. > > > > please see example above. Since we went with bpf syscall (instead of > > inextensible ioctl) we can add any new cgroup+bpf logic without > > breaking current abi. > > But then you end up with two ABIs. Ideally delegation would just work > -- then programs written now could run unmodified in unprivileged > containers in future kernels. At this point I don't see a reason to have two APIs. When bpf_type_cgroup* becomes available to unprivileged users the same api will work as-is. ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <20170123043154.GA93519-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <20170123043154.GA93519-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org> @ 2017-01-23 20:20 ` Andy Lutomirski [not found] ` <CALCETrXK65itdDqYTdhB-Td8d-Hzj00dcDScUOUh9psCZN_cLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Andy Lutomirski @ 2017-01-23 20:20 UTC (permalink / raw) To: Alexei Starovoitov Cc: Tejun Heo, Michal Hocko, Peter Zijlstra, David Ahern, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development On Sun, Jan 22, 2017 at 8:31 PM, Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > On Thu, Jan 19, 2017 at 08:04:59PM -0800, Andy Lutomirski wrote: >> On Thu, Jan 19, 2017 at 6:39 PM, Alexei Starovoitov >> <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: >> > On Wed, Jan 18, 2017 at 06:29:22PM -0800, Andy Lutomirski wrote: >> >> I think it could work by making a single socket cgroup controller that >> >> handles all cgroup things that are bound to a socket. Using >> > >> > Such 'socket cgroup controller' would limit usability of the feature >> > to sockets and force all other use cases like landlock to invent >> > their own wheel, which is undesirable. Everyone will be >> > inventing new 'foo cgroup controller', while all of them >> > are really bpf features. They are different bpf program >> > types that attach to different hooks and use cgroup for scoping. >> >> Can you elaborate on why that would be a problem? In a cgroup v1 >> world, users who want different hierarchies for different types of >> control could easily want one hierarchy for socket hooks and a >> different hierarchy for lsm hooks. In a cgroup v2 delegation world, I >> could easily imagine the decision to delegate socket hooks being >> different from the decision to delegate lsm hooks. Almost all of the >> code would be shared between different bpf-using cgroup controllers. > > how do you think it can be enforced when directory is chowned? A combination of delegation policy (a la subtree_control in the parent) and MAC policy. I'm quite confident that apparmor can apply policy to cgroupfs and I'm reasonably confident (although slightly less so) that selinux can as well. The docs suck :( But if there's only one file in there to apply MAC policy to, the ability to differentiate between subsystems is quite limited. In the current API, there are no files to apply policy to, so it won't work at all. > > ... and open() of the directory done by the current api will preserve > cgroup delegation when and only when bpf_prog_type_cgroup_* > becomes unprivileged. > I'm not proposing creating new api here. I don't know what you mean. open() of the directory can't check anything useful because it has to work for programs that just want to read the directory. >> Meanwhile, cgroup+bpf actually appears to be buggy in this regard even >> regardless of what semantics you think are better. sk_bound_dev_if is >> exposed as a u32 value, but sk_bound_dev_if only has meaning within a >> given netns. The "ip vrf" stuff will straight-up malfunction if a >> process affected by its hook runs in a different netns from the netns >> that "ip vrf" was run in. > > how is that any different from normal 'ip netns exec'? > that is expected user behavior. # ./ip link add dev vrf0 type vrf table 10 # ./ip vrf exec vrf0 ./show_bind Default binding is "vrf0" # ./ip vrf exec vrf0 unshare -n ./show_bind show_bind: getsockopt: No such device The expected behavior to me is that ip netns exec (or equivalently unshare -n) gives a clean slate. Actually malfunctioning (which this example using the latest iproute2 and linux just did) is not expected. I'm done with this part of this thread and I'm sending a patch. > >> restricting the types of sockets that can be created, then you do want >> the filter to work across namespaces, but seccomp can do that too and >> the current code doesn't handle netns correctly. > > are you saying that seccomp supports netns filtering? please show the proof. It can trivially restruct the types of sockets that are created by filtering on socket(2) syscall parameters, at least on sane architectures that don't use socketcall(). > To summarize, I think the 'prog override is not allowed' flag would be > ok feature to have and I wouldn't mind making it the default when no 'flag' > field is passed to bpf syscall, but it's not acceptable to remove current > 'prog override is allowed' behavior. > So if you insist on changing the default, please implement the flag asap. > Though from security point of view and ABI point of view there is absolutely > no difference whether this flag is added today or a year later while > the default is kept as-is. It's too late and I have too little time. I'll try to write a patch to change the default to just deny attempts to override. Better behavior can be added later. IMO your suggestions about priorities are overcomplicated. For your instrumentation needs, I can imagine that a simple "make this hook not run if a descendent has a hook" would do it. For everything else, run them all in tree order (depending on filter type) and let the eBPF code do whatever it wants to do. ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <CALCETrXK65itdDqYTdhB-Td8d-Hzj00dcDScUOUh9psCZN_cLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <CALCETrXK65itdDqYTdhB-Td8d-Hzj00dcDScUOUh9psCZN_cLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-02-03 21:07 ` Andy Lutomirski 2017-02-03 23:21 ` Alexei Starovoitov 0 siblings, 1 reply; 49+ messages in thread From: Andy Lutomirski @ 2017-02-03 21:07 UTC (permalink / raw) To: Alexei Starovoitov Cc: Tejun Heo, Michal Hocko, Peter Zijlstra, David Ahern, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development On Mon, Jan 23, 2017 at 12:20 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote: > On Sun, Jan 22, 2017 at 8:31 PM, Alexei Starovoitov > <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: >> On Thu, Jan 19, 2017 at 08:04:59PM -0800, Andy Lutomirski wrote: >>> restricting the types of sockets that can be created, then you do want >>> the filter to work across namespaces, but seccomp can do that too and >>> the current code doesn't handle netns correctly. >> >> are you saying that seccomp supports netns filtering? please show the proof. > > It can trivially restruct the types of sockets that are created by > filtering on socket(2) syscall parameters, at least on sane > architectures that don't use socketcall(). I think this is actually wrong -- the socket creation filter appears to be called only on inet sockets. Is there a good reason for that? > >> To summarize, I think the 'prog override is not allowed' flag would be >> ok feature to have and I wouldn't mind making it the default when no 'flag' >> field is passed to bpf syscall, but it's not acceptable to remove current >> 'prog override is allowed' behavior. >> So if you insist on changing the default, please implement the flag asap. >> Though from security point of view and ABI point of view there is absolutely >> no difference whether this flag is added today or a year later while >> the default is kept as-is. > > It's too late and I have too little time. I'll try to write a patch > to change the default to just deny attempts to override. Better > behavior can be added later. > > IMO your suggestions about priorities are overcomplicated. For your > instrumentation needs, I can imagine that a simple "make this hook not > run if a descendent has a hook" would do it. For everything else, run > them all in tree order (depending on filter type) and let the eBPF > code do whatever it wants to do. Is there any plan to address this? If not, I'll try to write that patch this weekend. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2017-02-03 21:07 ` Andy Lutomirski @ 2017-02-03 23:21 ` Alexei Starovoitov [not found] ` <20170203232154.GA30114-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Alexei Starovoitov @ 2017-02-03 23:21 UTC (permalink / raw) To: Andy Lutomirski Cc: Tejun Heo, Michal Hocko, Peter Zijlstra, David Ahern, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel@vger.kernel.org, Network Development On Fri, Feb 03, 2017 at 01:07:39PM -0800, Andy Lutomirski wrote: > > Is there any plan to address this? If not, I'll try to write that > patch this weekend. yes. I'm working on 'disallow program override' flag. It got stalled, because netns discussion got stalled. Later today will send a patch for dev_id+inode and will continue on the flag patch. ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <20170203232154.GA30114-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <20170203232154.GA30114-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org> @ 2017-02-04 17:10 ` Andy Lutomirski 0 siblings, 0 replies; 49+ messages in thread From: Andy Lutomirski @ 2017-02-04 17:10 UTC (permalink / raw) To: Alexei Starovoitov Cc: Tejun Heo, Michal Hocko, Peter Zijlstra, David Ahern, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development On Fri, Feb 3, 2017 at 3:21 PM, Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > On Fri, Feb 03, 2017 at 01:07:39PM -0800, Andy Lutomirski wrote: >> >> Is there any plan to address this? If not, I'll try to write that >> patch this weekend. > > yes. I'm working on 'disallow program override' flag. > It got stalled, because netns discussion got stalled. > Later today will send a patch for dev_id+inode and > will continue on the flag patch. > Would it make sense to try to document what your proposal does before writing the code? I don't yet see how to get semantics that are both simple and sensible with a "disallow override" flag. I *do* see how to get simple, sensible semantics with an approach where all the programs in scope for the cgroup in question get called. If needed, I can imagine a special "overridable" program that would not be run if the socket in question is bound to a descendent cgroup that also has an "overridable" program but would still let all the normal hierarchical programs in scope get called. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <CALCETrXv3k+=C=9D5YGypzPk8_f4yo8ShQ3tM2fOo++gJFBM4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-01-19 0:59 ` Tejun Heo @ 2017-01-19 1:01 ` Mickaël Salaün 1 sibling, 0 replies; 49+ messages in thread From: Mickaël Salaün @ 2017-01-19 1:01 UTC (permalink / raw) To: Andy Lutomirski, Tejun Heo Cc: Michal Hocko, Peter Zijlstra, David Ahern, Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development [-- Attachment #1.1: Type: text/plain, Size: 2265 bytes --] On 19/01/2017 01:18, Andy Lutomirski wrote: >>> it explicitly respects the cgroup hierarchy. It shows up in >>> /proc/cgroups, and I had no problem mounting a cgroupfs instance with >>> perf_event enabled. So I'm not sure what you mean. >> >> That all it's doing is providing membership information. > > But it's doing it wrong! Even perf_event tests for membership in a > given cgroup *or one of its descendents*. This code does not. > > I think the moral of the story here is that there are lots of open > questions and design work to be done and that this feature really > isn't ready to be stable. For Landlock, I believe that it really > needs to be done right and I will put my foot down and NAK any effort > to have Landlock available in a released kernel without resolving > these types of issues first. Does anyone really want Landlock to work > differently than the net hooks simply because the net hooks were in a > rush? About Landlock, there is two use cases: The first is to allow unprivileged users to tie eBPF programs (rules) to processes. This is the (final) goal. In this case, a (cgroup) hierarchy is mandatory, otherwise it would be trivial to defeat any access rule. This is the same issue with namespaces. The second use case is to only allow privileged users to tie eBPF programs to processes. As discussed, this will be the next series, preceding the unprivileged series. In this privilege case, only root (global CAP_SYS_ADMIN, no namespaces) may be able to use Landlock. Not having a hierarchy is not a security issue (only a practical one). The first/next Landlock series (in February) will focus on process hierarchies (without cgroup), a la seccomp-bpf. It would be too confusing to not use an inheritable hierarchy like seccomp does, even in a privileged-only first approach. The inherited rules should then behave similarly to the seccomp-bpf filters. However, the following series focusing on cgroup could keep the current cgroup-bpf behavior, without hierarchy. I don't like the non-hierarchy approach very much because it add another (less flexible and more confusing) way to do things (for Landlock at least), but I'm willing to do it if needed. Regards, Mickaël [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Potential issues (security and otherwise) with the current cgroup-bpf API @ 2016-12-17 18:18 Andy Lutomirski 2016-12-17 19:26 ` Mickaël Salaün 2016-12-19 20:56 ` Alexei Starovoitov 0 siblings, 2 replies; 49+ messages in thread From: Andy Lutomirski @ 2016-12-17 18:18 UTC (permalink / raw) To: Daniel Mack, Alexei Starovoitov, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra Cc: Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development [-- Attachment #1: Type: text/plain, Size: 4679 bytes --] Hi all- I apologize for being rather late with this. I didn't realize that cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see it on the linux-api list, so I missed the discussion. I think that the inet ingress, egress etc filters are a neat feature, but I think the API has some issues that will bite us down the road if it becomes stable in its current form. Most of the problems I see are summarized in this transcript: # mkdir cg2 # mount -t cgroup2 none cg2 # mkdir cg2/nosockets # strace cgrp_socket_rule cg2/nosockets/ 0 ... open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 ^^^^ You can modify a cgroup after opening it O_RDONLY? bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, log_buf=0x6020c0, kern_version=0}, 48) = 4 ^^^^ This is fine. The bpf() syscall manipulates bpf objects. bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 ^^^^ This is not so good: ^^^^ ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This ^^^^ is manipulating a cgroup. There's no reason that a socket creation ^^^^ filter couldn't be written in a different language (new iptables ^^^^ table? Simple list of address families?), but if that happened, ^^^^ then using bpf() to install it would be entirely nonsensical. ^^^^ ^^^^ b) This is starting to be an excessively ugly multiplexer. Among ^^^^ other things, it's very unfriendly to seccomp. # echo $$ >cg2/nosockets/cgroup.procs # ping 127.0.0.1 ping: socket: Operation not permitted # ls cg2/nosockets/ cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control # cat cg2/nosockets/cgroup.controllers ^^^^ Something in cgroupfs should give an indication that this cgroup ^^^^ filters socket creation, but there's nothing there. You should also ^^^^ be able to turn the filter off from cgroupfs. # mkdir cg2/nosockets/sockets # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1 ^^^^ This succeeded, which means that, if this feature is enabled in 4.10, ^^^^ then we're stuck with its semantics. If it returned -EINVAL instead, ^^^^ there would be a chance to refine it. # echo $$ >cg2/nosockets/sockets/cgroup.procs # ping 127.0.0.1 PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms ^C --- 127.0.0.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.029/0.029/0.029/0.000 ms ^^^^ Bash was inside a cgroup that disallowed socket creation, but socket ^^^^ creation wasn't disallowed. This means that the obvious use of socket ^^^^ creation filters in nestable constainers fails insecurely. There's also a subtle but nasty potential security problem here. In 4.9 and before, cgroups has only one real effect in the kernel: resource control. A process in a malicious cgroup could be DoSed, but that was about the extent of the damage that a malicious cgroup could do. In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf programs attached that can do things if various events occur. (Right now, this means socket operations, but there are plans in the works to do this for LSM hooks too.) These bpf programs can say yes or no, but they can also read out various data (including socket payloads!) and save them away where an attacker can find them. This sounds a lot like seccomp with a narrower scope but a much stronger ability to exfiltrate private information. Unfortunately, while seccomp is very, very careful to prevent injection of a privileged victim into a malicious sandbox, the CGROUP_BPF mechanism appears to have no real security model. There is nothing to prevent a program that's in a malicious cgroup from running a setuid binary, and there is nothing to prevent a program that has the ability to move itself or another program into a malicious cgroup from doing so and then, if needed for exploitation, exec a setuid binary. This isn't much of a problem yet because you currently need CAP_NET_ADMIN to create a malicious sandbox in the first place. I'm sure that, in the near future, someone will want to make this stuff work in containers with delegated cgroup hierarchies, and then there may be a real problem here. I've included a few security people on this thread. The current API looks abusable, and it would be nice to find all the holes before 4.10 comes out. (The cgrp_socket_rule source is attached. You can build it by sticking it in samples/bpf and doing: $ make headers_install $ cd samples/bpf $ gcc -o cgrp_socket_rule cgrp_socket_rule.c libbpf.c -I../../usr/include ) --Andy [-- Attachment #2: cgrp_socket_rule.c --] [-- Type: text/x-csrc, Size: 1545 bytes --] /* eBPF example program: * * - Loads eBPF program * * The eBPF program sets the sk_bound_dev_if index in new AF_INET{6} * sockets opened by processes in the cgroup. * * - Attaches the new program to a cgroup using BPF_PROG_ATTACH */ #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <stddef.h> #include <string.h> #include <unistd.h> #include <assert.h> #include <errno.h> #include <fcntl.h> #include <net/if.h> #include <linux/bpf.h> #include "libbpf.h" static int prog_load(int value) { struct bpf_insn prog[] = { BPF_MOV64_IMM(BPF_REG_0, value), /* r0 = verdict */ BPF_EXIT_INSN(), }; return bpf_prog_load(BPF_PROG_TYPE_CGROUP_SOCK, prog, sizeof(prog), "GPL", 0); } static int usage(const char *argv0) { printf("Usage: %s cg-path value\n", argv0); return EXIT_FAILURE; } int main(int argc, char **argv) { int cg_fd, prog_fd, value, ret; if (argc < 2) return usage(argv[0]); cg_fd = open(argv[1], O_DIRECTORY | O_RDONLY); if (cg_fd < 0) { printf("Failed to open cgroup path: '%s'\n", strerror(errno)); return EXIT_FAILURE; } value = atoi(argv[2]); prog_fd = prog_load(value); /* printf("Output from kernel verifier:\n%s\n-------\n", bpf_log_buf); */ if (prog_fd < 0) { printf("Failed to load prog: '%s'\n", strerror(errno)); return EXIT_FAILURE; } ret = bpf_prog_attach(prog_fd, cg_fd, BPF_CGROUP_INET_SOCK_CREATE); if (ret < 0) { printf("Failed to attach prog to cgroup: '%s'\n", strerror(errno)); return EXIT_FAILURE; } return EXIT_SUCCESS; } ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2016-12-17 18:18 Andy Lutomirski @ 2016-12-17 19:26 ` Mickaël Salaün 2016-12-17 20:02 ` Andy Lutomirski 2016-12-19 20:56 ` Alexei Starovoitov 1 sibling, 1 reply; 49+ messages in thread From: Mickaël Salaün @ 2016-12-17 19:26 UTC (permalink / raw) To: Andy Lutomirski, Daniel Mack, Alexei Starovoitov, Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra Cc: Linux API, linux-kernel@vger.kernel.org, Network Development, John Stultz [-- Attachment #1.1: Type: text/plain, Size: 7074 bytes --] On 17/12/2016 19:18, Andy Lutomirski wrote: > Hi all- > > I apologize for being rather late with this. I didn't realize that > cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see > it on the linux-api list, so I missed the discussion. > > I think that the inet ingress, egress etc filters are a neat feature, > but I think the API has some issues that will bite us down the road > if it becomes stable in its current form. > > Most of the problems I see are summarized in this transcript: > > # mkdir cg2 > # mount -t cgroup2 none cg2 > # mkdir cg2/nosockets > # strace cgrp_socket_rule cg2/nosockets/ 0 > ... > open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 > > ^^^^ You can modify a cgroup after opening it O_RDONLY? I sent a patch to check the cgroup.procs permission before attaching a BPF program to it [1], but it was not merged because not part of the current security model (which may not be crystal clear). The thing is that the current socket/BPF/cgroup feature is only available to a process with the *global CAP_NET_ADMIN* and such a process can already modify the network for every processes, so it doesn't make much sense to check if it can modify the network for a subset of this processes. [1] https://lkml.org/lkml/2016/9/19/854 However, needing a process to open a cgroup *directory* in write mode may not make sense because the process does not modify the content of the cgroup but only use it as a *reference* in the network stack. Forcing an open with write mode may forbid to use this kind of network-filtering feature in a read-only file-system but not necessarily read-only *network configuration*. Another point of view is that the CAP_NET_ADMIN may be an unneeded privilege if the cgroup migration is using a no_new_privs-like feature as I proposed with Landlock [2] (with an extra ptrace_may_access() check). The new capability proposition for cgroup may be interesting too. [2] https://lkml.org/lkml/2016/9/14/82 > > bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, > insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, > log_buf=0x6020c0, kern_version=0}, 48) = 4 > > ^^^^ This is fine. The bpf() syscall manipulates bpf objects. > > bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 > > ^^^^ This is not so good: > ^^^^ > ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This > ^^^^ is manipulating a cgroup. There's no reason that a socket creation > ^^^^ filter couldn't be written in a different language (new iptables > ^^^^ table? Simple list of address families?), but if that happened, > ^^^^ then using bpf() to install it would be entirely nonsensical. Another point of view is to say that the BPF program (called by the network stack) is using a reference to a set of processes thanks to a cgroup. > ^^^^ > ^^^^ b) This is starting to be an excessively ugly multiplexer. Among > ^^^^ other things, it's very unfriendly to seccomp. FWIW, Landlock will have the capability to filter this kind of action. > > # echo $$ >cg2/nosockets/cgroup.procs > # ping 127.0.0.1 > ping: socket: Operation not permitted > # ls cg2/nosockets/ > cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control > # cat cg2/nosockets/cgroup.controllers > > ^^^^ Something in cgroupfs should give an indication that this cgroup > ^^^^ filters socket creation, but there's nothing there. You should also > ^^^^ be able to turn the filter off from cgroupfs. Right. Everybody was OK at LPC to add such an information but it is not there yet. > > # mkdir cg2/nosockets/sockets > # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1 > > ^^^^ This succeeded, which means that, if this feature is enabled in 4.10, > ^^^^ then we're stuck with its semantics. If it returned -EINVAL instead, > ^^^^ there would be a chance to refine it. This is indeed unfortunate. > > # echo $$ >cg2/nosockets/sockets/cgroup.procs > # ping 127.0.0.1 > PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. > 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms > ^C > --- 127.0.0.1 ping statistics --- > 1 packets transmitted, 1 received, 0% packet loss, time 0ms > rtt min/avg/max/mdev = 0.029/0.029/0.029/0.000 ms > > ^^^^ Bash was inside a cgroup that disallowed socket creation, but socket > ^^^^ creation wasn't disallowed. This means that the obvious use of socket > ^^^^ creation filters in nestable constainers fails insecurely. > > > There's also a subtle but nasty potential security problem here. > In 4.9 and before, cgroups has only one real effect in the kernel: > resource control. A process in a malicious cgroup could be DoSed, > but that was about the extent of the damage that a malicious cgroup > could do. > > In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf > programs attached that can do things if various events occur. (Right > now, this means socket operations, but there are plans in the works > to do this for LSM hooks too.) These bpf programs can say yes or no, > but they can also read out various data (including socket payloads!) > and save them away where an attacker can find them. This sounds a > lot like seccomp with a narrower scope but a much stronger ability to > exfiltrate private information. > > Unfortunately, while seccomp is very, very careful to prevent > injection of a privileged victim into a malicious sandbox, the > CGROUP_BPF mechanism appears to have no real security model. There > is nothing to prevent a program that's in a malicious cgroup from > running a setuid binary, and there is nothing to prevent a program > that has the ability to move itself or another program into a > malicious cgroup from doing so and then, if needed for exploitation, > exec a setuid binary. > > This isn't much of a problem yet because you currently need > CAP_NET_ADMIN to create a malicious sandbox in the first place. I'm > sure that, in the near future, someone will want to make this stuff > work in containers with delegated cgroup hierarchies, and then there > may be a real problem here. > > > I've included a few security people on this thread. The current API > looks abusable, and it would be nice to find all the holes before > 4.10 comes out. > > > (The cgrp_socket_rule source is attached. You can build it by sticking it > in samples/bpf and doing: > > $ make headers_install > $ cd samples/bpf > $ gcc -o cgrp_socket_rule cgrp_socket_rule.c libbpf.c -I../../usr/include > ) > > --Andy > Right. Because this feature doesn't handle namespaces (only global CAP_NET_ADMIN), nested containers should not be allowed to use it at all. If we want this kind of feature to be usable by something other than a process with a global capability, then we need an inheritance mechanism similar to what I designed for Landlock. I think it could be added later. Regards, Mickaël [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2016-12-17 19:26 ` Mickaël Salaün @ 2016-12-17 20:02 ` Andy Lutomirski 0 siblings, 0 replies; 49+ messages in thread From: Andy Lutomirski @ 2016-12-17 20:02 UTC (permalink / raw) To: Mickaël Salaün Cc: Andy Lutomirski, Daniel Mack, Alexei Starovoitov, Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel@vger.kernel.org, Network Development, John Stultz, Eric W. Biederman On Sat, Dec 17, 2016 at 11:26 AM, Mickaël Salaün <mic@digikod.net> wrote: > > On 17/12/2016 19:18, Andy Lutomirski wrote: >> Hi all- >> >> I apologize for being rather late with this. I didn't realize that >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see >> it on the linux-api list, so I missed the discussion. >> >> I think that the inet ingress, egress etc filters are a neat feature, >> but I think the API has some issues that will bite us down the road >> if it becomes stable in its current form. >> >> Most of the problems I see are summarized in this transcript: >> >> # mkdir cg2 >> # mount -t cgroup2 none cg2 >> # mkdir cg2/nosockets >> # strace cgrp_socket_rule cg2/nosockets/ 0 >> ... >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 >> >> ^^^^ You can modify a cgroup after opening it O_RDONLY? > > I sent a patch to check the cgroup.procs permission before attaching a > BPF program to it [1], but it was not merged because not part of the > current security model (which may not be crystal clear). The thing is > that the current socket/BPF/cgroup feature is only available to a > process with the *global CAP_NET_ADMIN* and such a process can already > modify the network for every processes, so it doesn't make much sense to > check if it can modify the network for a subset of this processes. > > [1] https://lkml.org/lkml/2016/9/19/854 I don't think that's quite the right approach. First, checking permissions on an fd that's not the fd passed to the syscall is weird and IMO shouldn't be done without a good reason. Second, cgroups.procs seems like the wrong file. Why not create "net.socket_create_filter" and require a writable fd to *that* to be passed to the syscall. > > However, needing a process to open a cgroup *directory* in write mode > may not make sense because the process does not modify the content of > the cgroup but only use it as a *reference* in the network stack. > Forcing an open with write mode may forbid to use this kind of > network-filtering feature in a read-only file-system but not necessarily > read-only *network configuration*. It does modify the cgroup. Logically it changes the cgroup behavior. As an implementation matter, it modifies the struct cgroup. > > Another point of view is that the CAP_NET_ADMIN may be an unneeded > privilege if the cgroup migration is using a no_new_privs-like feature > as I proposed with Landlock [2] (with an extra ptrace_may_access() check). > The new capability proposition for cgroup may be interesting too. > > [2] https://lkml.org/lkml/2016/9/14/82 > >> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, >> log_buf=0x6020c0, kern_version=0}, 48) = 4 >> >> ^^^^ This is fine. The bpf() syscall manipulates bpf objects. >> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 >> >> ^^^^ This is not so good: >> ^^^^ >> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This >> ^^^^ is manipulating a cgroup. There's no reason that a socket creation >> ^^^^ filter couldn't be written in a different language (new iptables >> ^^^^ table? Simple list of address families?), but if that happened, >> ^^^^ then using bpf() to install it would be entirely nonsensical. > > Another point of view is to say that the BPF program (called by the > network stack) is using a reference to a set of processes thanks to a > cgroup. I strongly disagree with this point of view. From a user's perspective, nothing about the bpf program changed -- the cgroup changes. Even in the code, the bpf program doesn't have a reference to anything. The cgroup references the bpf program. >> >> # echo $$ >cg2/nosockets/cgroup.procs >> # ping 127.0.0.1 >> ping: socket: Operation not permitted >> # ls cg2/nosockets/ >> cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control >> # cat cg2/nosockets/cgroup.controllers >> >> ^^^^ Something in cgroupfs should give an indication that this cgroup >> ^^^^ filters socket creation, but there's nothing there. You should also >> ^^^^ be able to turn the filter off from cgroupfs. > > Right. Everybody was OK at LPC to add such an information but it is not > there yet. Then let's do it before CONFIG_CGROUP_BPF=y becomes possible in a released kernel. > Right. Because this feature doesn't handle namespaces (only global > CAP_NET_ADMIN), nested containers should not be allowed to use it at all. > If we want this kind of feature to be usable by something other than a > process with a global capability, then we need an inheritance mechanism > similar to what I designed for Landlock. I think it could be added later. I think it's okay to have a new kernel feature that doesn't support namespacing yet. I don't think it's okay to have a new kernel feature that will become problematic when namespacing is added, and I think cgroup-bpf is in the latter class. Anyway, here's a half-baked proposal to clean this all up: Make these new fiters actually be cgroup controllers. Fix whatever needs to be fixed to make that work. This will mean that they can't be the "subtree-control" type of controllers. (Maybe the same set of fixes will help with the cpu controllers, too.) Make them stack properly. Make them configurable using cgroupfs. Add a "dangerous" flag on cgroups. Cgroups are "dangerous" if they have dangerous controllers enabled. Controllers like "inet ingress" are dangerous. You can only configure dangerous controllers if all tasks in the cgroup have no_new_privs set and you have appropriate permission over the tasks or if the cgroup is empty. If trying to bind an inet ingress filter would make a cgroup dangerous and you don't have the relevent privilege, then the operation fails. You cannot move any task into a dangerous cgroup unless that task has no_new_privs set *and* you have privilege over that task or if the task is in a namespace that you have CAP_SYS_ADMIN on. (This is kind of like your proposal, and maybe they should be merged. I do think that *something* is needed and needs fleshing out. Keep in mind, though, that strictly requiring no_new_privs is excessive for namespaced applications. It might be okay to require that a cgroup only ever contain tasks in a given namespace or perhaps that a cgroup only contain tasks that are either no_new_privs *or* are in a given namespace or its descendents. (In fact, unshare() can *clear* no_new_privs because being in a fresh userns has more or less the same effect.) --Andy ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2016-12-17 18:18 Andy Lutomirski 2016-12-17 19:26 ` Mickaël Salaün @ 2016-12-19 20:56 ` Alexei Starovoitov 2016-12-19 21:23 ` Andy Lutomirski 1 sibling, 1 reply; 49+ messages in thread From: Alexei Starovoitov @ 2016-12-19 20:56 UTC (permalink / raw) To: Andy Lutomirski Cc: Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel@vger.kernel.org, Network Development On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote: > Hi all- > > I apologize for being rather late with this. I didn't realize that > cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see > it on the linux-api list, so I missed the discussion. > > I think that the inet ingress, egress etc filters are a neat feature, > but I think the API has some issues that will bite us down the road > if it becomes stable in its current form. > > Most of the problems I see are summarized in this transcript: > > # mkdir cg2 > # mount -t cgroup2 none cg2 > # mkdir cg2/nosockets > # strace cgrp_socket_rule cg2/nosockets/ 0 > ... > open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 > > ^^^^ You can modify a cgroup after opening it O_RDONLY? > > bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, > insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, > log_buf=0x6020c0, kern_version=0}, 48) = 4 > > ^^^^ This is fine. The bpf() syscall manipulates bpf objects. > > bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 > > ^^^^ This is not so good: > ^^^^ > ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This > ^^^^ is manipulating a cgroup. There's no reason that a socket creation > ^^^^ filter couldn't be written in a different language (new iptables > ^^^^ table? Simple list of address families?), but if that happened, > ^^^^ then using bpf() to install it would be entirely nonsensical. I don't see why it's _modifing_ the cgroup. I'm looking at it as network stack is using cgroup as an application group that should invoke bpf program at the certain point in the stack. imo cgroup management is orthogonal. I certainly don't mind adding 'cat cgroup.bpf' control file that will print the program digest for debugging, just like we do for fdinfo, but cgroup file interface shouldn't be used to control networking. If there was something like networking-wide ioctl, I think it could have been an alternative way to attach a program to a hook to a cgroup. Adding a control file to a cgroup for the purpose of attaching bpf, just doesn't make sense to me, since it's dragging bpf and networking details into cgroup land. Imagine in the future somebody would want to have nft to perform similar BPF_CGROUP_INET_INGRESS functionality. I'd think that the position of the hook within the networking stack would remain the same, but enum id will be different, since other ids in that enum (like BPF_CGROUP_INET_SOCK_CREATE) are not applicable to nft. Similarly the concept of program_fd doesn't exist there either. So it would be using its own netlink based mechanism and its own way of enumerating hook id. And cgroup could be passed as a string into netlink message instead of fd. So if somebody adds a control file to a cgroup api that takes bpf's hookid and bpf's program_fd, such control file will only be applicable to bpf and not usable for anything else. Hence I see no reason to go that route. > ^^^^ > ^^^^ b) This is starting to be an excessively ugly multiplexer. Among > ^^^^ other things, it's very unfriendly to seccomp. > > # echo $$ >cg2/nosockets/cgroup.procs > # ping 127.0.0.1 > ping: socket: Operation not permitted > # ls cg2/nosockets/ > cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control > # cat cg2/nosockets/cgroup.controllers > > ^^^^ Something in cgroupfs should give an indication that this cgroup > ^^^^ filters socket creation, but there's nothing there. You should also > ^^^^ be able to turn the filter off from cgroupfs. Doing 'cat cg2/nosockets/cgroup.bpf' here would be useful for debugging to see which program is attached to which hook in this cgroup. Detaching programs via cgroup control file is a lot less clear, since it will be confusing to a running application that attached the program in the first place, since it won't have any indication that external entity detached the program. One can argue that it could be useful for curious human/admin to detach all bpf progs from all hooks for this cgroup, but then it probably needs dmesg warning and only usable in extreme cases as debugging and not something control plane should ever use. > # mkdir cg2/nosockets/sockets > # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1 > > ^^^^ This succeeded, which means that, if this feature is enabled in 4.10, > ^^^^ then we're stuck with its semantics. If it returned -EINVAL instead, > ^^^^ there would be a chance to refine it. i don't see the problem with this behavior. bpf and cgroup are indepedent. Imange that socket accounting program is attached to cg2/nosockets. The program is readonly and carry no security meaning. Why cgroup should prevent creation of cg2/nosockets/foo directory ? > # echo $$ >cg2/nosockets/sockets/cgroup.procs > # ping 127.0.0.1 > PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. > 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms hmm. in this case the admin created a subgroup with a group that allows ping, so? how's that a problem? In case of vrf I can imagine a set of application auto-binding to one vrf and smaller set of applications binding to a different vrf. Or broader set of applications monitoring all tcp traffic and subset of them monitoring udp instead. Those are valid use cases. I guess you're advocating to run a link list of programs? That won't be useful for the above use cases, where there is no reason to run more than one program that control plane assigned to run for this cgroup. At this stage we decided to allow only one program per cgroup per hook and later can extend it if necessary. As you're pointing out, in case of security, we probably want to preserve original bpf program that should always be run first and only after it returned 'ok' (we'd need to define what 'ok' means in secruity context) run program attached to sub-hierarchy. Another alternative is to disallow attaching programs in sub-hierarchy if parent has something already attached, but it's not useful for general case. All of these are possible future extensions. > In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf > programs attached that can do things if various events occur. (Right > now, this means socket operations, but there are plans in the works > to do this for LSM hooks too.) These bpf programs can say yes or no, > but they can also read out various data (including socket payloads!) > and save them away where an attacker can find them. This sounds a > lot like seccomp with a narrower scope but a much stronger ability to > exfiltrate private information. that sounds like a future problem to solve when bpf+lsm+cgroup are used for security. > This isn't much of a problem yet because you currently need > CAP_NET_ADMIN to create a malicious sandbox in the first place. I'm > sure that, in the near future, someone will want to make this stuff > work in containers with delegated cgroup hierarchies, and then there > may be a real problem here. I agree. Currently cgroup+bpf+hooks are for root only and have to go long way before they can be used for security and sandboxing. > I've included a few security people on this thread. The current API > looks abusable, and it would be nice to find all the holes before > 4.10 comes out. I disagree with the urgency. I see nothing that needs immediate action. bpf+lsm+cgroup is not in the tree yet. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2016-12-19 20:56 ` Alexei Starovoitov @ 2016-12-19 21:23 ` Andy Lutomirski 2016-12-20 0:02 ` Alexei Starovoitov 0 siblings, 1 reply; 49+ messages in thread From: Andy Lutomirski @ 2016-12-19 21:23 UTC (permalink / raw) To: Alexei Starovoitov Cc: Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel@vger.kernel.org, Network Development On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote: >> Hi all- >> >> I apologize for being rather late with this. I didn't realize that >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see >> it on the linux-api list, so I missed the discussion. >> >> I think that the inet ingress, egress etc filters are a neat feature, >> but I think the API has some issues that will bite us down the road >> if it becomes stable in its current form. >> >> Most of the problems I see are summarized in this transcript: >> >> # mkdir cg2 >> # mount -t cgroup2 none cg2 >> # mkdir cg2/nosockets >> # strace cgrp_socket_rule cg2/nosockets/ 0 >> ... >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 >> >> ^^^^ You can modify a cgroup after opening it O_RDONLY? >> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, >> log_buf=0x6020c0, kern_version=0}, 48) = 4 >> >> ^^^^ This is fine. The bpf() syscall manipulates bpf objects. >> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 >> >> ^^^^ This is not so good: >> ^^^^ >> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This >> ^^^^ is manipulating a cgroup. There's no reason that a socket creation >> ^^^^ filter couldn't be written in a different language (new iptables >> ^^^^ table? Simple list of address families?), but if that happened, >> ^^^^ then using bpf() to install it would be entirely nonsensical. > > I don't see why it's _modifing_ the cgroup. I'm looking at it as > network stack is using cgroup as an application group that should > invoke bpf program at the certain point in the stack. > imo cgroup management is orthogonal. It is literally modifying the struct cgroup, and, as a practical matter, it's causing membership in the cgroup to have a certain effect. But rather than pointless arguing, let me propose an alternative API that I think solves most of the problems here. In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely. Instead, the cgroup gets three new control files: "net.ingress_filter", "net.egress_filter", and "net.socket_create_filter". Initially, if you read these files, you see "none\n". To attach a bpf filter, you open the file for write and do an ioctl on it. After doing the ioctl, if you read the file, you'll see "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for the bpf program. To detach any type of filter, bpf or otherwise, you open the file for write and write "none\n" (or just "none"). If you write anything else to the file, you get -EINVAL. But, if someone writes a new type of filter (perhaps a simple list of address families), maybe you can enable the filter by writing something appropriate to the file. Now the API matches the effect. A cgroup with something other than "none" in one of its net.*_filter files is a cgroup that filters network activity. And you get CRIU compatibility for free: CRIU can read the control file and use whatever mechanism is uses for BPF in general to restore the cgroup filter state. As an added bonus, you get ACLs for free and the ugly multiplexer goes away. >> # mkdir cg2/nosockets/sockets >> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1 >> >> ^^^^ This succeeded, which means that, if this feature is enabled in 4.10, >> ^^^^ then we're stuck with its semantics. If it returned -EINVAL instead, >> ^^^^ there would be a chance to refine it. > > i don't see the problem with this behavior. bpf and cgroup are indepedent. > Imange that socket accounting program is attached to cg2/nosockets. > The program is readonly and carry no security meaning. > Why cgroup should prevent creation of cg2/nosockets/foo directory ? I think you're misunderstanding me. What I'm saying is that, if you allow a cgroup and one of its descendents to both enable the same type of filter, you have just committed to some particular semantics for what happens. And I think that the current semantics are the *wrong* semantics for default long-term use, so you should either fix the semantics or disable the problematic case. > >> # echo $$ >cg2/nosockets/sockets/cgroup.procs >> # ping 127.0.0.1 >> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. >> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms > > hmm. in this case the admin created a subgroup with a group that allows > ping, so? how's that a problem? I think you're forgetting about namespaces. There are two different admins here. There's the admin who created the outer container and said "no sockets here". Then there's the admin inside the container who said "muahaha, I can create a sub-container and allow sockets *there*". > In case of vrf I can imagine > a set of application auto-binding to one vrf and smaller > set of applications binding to a different vrf. > Or broader set of applications monitoring all tcp traffic > and subset of them monitoring udp instead. > Those are valid use cases. > I guess you're advocating to run a link list of programs? > That won't be useful for the above use cases, where there is no > reason to run more than one program that control plane > assigned to run for this cgroup. Yes there is, for both monitoring and policy. If I want to monitor all activity in a cgroup, I probably want to monitor descendents as well. If I want to restrict a cgroup, I want to restrict its descendents. In the case where I actually don't want to hook the descendents, I'd be find with having an option to turn off recursion. Maybe net.egress_filter could also say "bpf(overridable):[hash]" or "bpf(nonrecursive):[hash]". But you should have to opt in to allowing your filter to be overridden. > At this stage we decided to allow only one program per cgroup per hook > and later can extend it if necessary. No you can't. Since you allow the problematic case and you gave it the surprising "one program" semantics, you can't change it later. > As you're pointing out, in case of security, we probably > want to preserve original bpf program that should always be > run first and only after it returned 'ok' (we'd need to define > what 'ok' means in secruity context) run program attached to sub-hierarchy. It's already defined AFAICT. 1 means okay. 0 means not okay. > Another alternative is to disallow attaching programs in sub-hierarchy > if parent has something already attached, but it's not useful > for general case. > All of these are possible future extensions. I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You have to do it now (or disable the feature for 4.10). This is why I'm bringing this whole thing up now. > >> In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf >> programs attached that can do things if various events occur. (Right >> now, this means socket operations, but there are plans in the works >> to do this for LSM hooks too.) These bpf programs can say yes or no, >> but they can also read out various data (including socket payloads!) >> and save them away where an attacker can find them. This sounds a >> lot like seccomp with a narrower scope but a much stronger ability to >> exfiltrate private information. > > that sounds like a future problem to solve when bpf+lsm+cgroup are > used for security. [...] > > I disagree with the urgency. I see nothing that needs immediate action. > bpf+lsm+cgroup is not in the tree yet. > I disagree very strongly here. The API in 4.10 is IMO quite ugly, but the result if bpf+lsm+cgroup works *differently* will be far, far uglier. I think the right solution here is to clean up the API so that it'll work for future extensions that people are already imagining. If this can happen for 4.10, great. If not, then postpone this stuff entirely. I've had code I've written for Linux postponed extensively until I've gotten the API right, and it's not so bad. (Actually, I've even had API changes I've made reverted in -stable, IIRC. This is much worse than postponing before a release.) --Andy ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2016-12-19 21:23 ` Andy Lutomirski @ 2016-12-20 0:02 ` Alexei Starovoitov [not found] ` <20161220000254.GA58895-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Alexei Starovoitov @ 2016-12-20 0:02 UTC (permalink / raw) To: Andy Lutomirski Cc: Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel@vger.kernel.org, Network Development On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote: > On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote: > >> Hi all- > >> > >> I apologize for being rather late with this. I didn't realize that > >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see > >> it on the linux-api list, so I missed the discussion. > >> > >> I think that the inet ingress, egress etc filters are a neat feature, > >> but I think the API has some issues that will bite us down the road > >> if it becomes stable in its current form. > >> > >> Most of the problems I see are summarized in this transcript: > >> > >> # mkdir cg2 > >> # mount -t cgroup2 none cg2 > >> # mkdir cg2/nosockets > >> # strace cgrp_socket_rule cg2/nosockets/ 0 > >> ... > >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 > >> > >> ^^^^ You can modify a cgroup after opening it O_RDONLY? > >> > >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, > >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, > >> log_buf=0x6020c0, kern_version=0}, 48) = 4 > >> > >> ^^^^ This is fine. The bpf() syscall manipulates bpf objects. > >> > >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 > >> > >> ^^^^ This is not so good: > >> ^^^^ > >> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This > >> ^^^^ is manipulating a cgroup. There's no reason that a socket creation > >> ^^^^ filter couldn't be written in a different language (new iptables > >> ^^^^ table? Simple list of address families?), but if that happened, > >> ^^^^ then using bpf() to install it would be entirely nonsensical. > > > > I don't see why it's _modifing_ the cgroup. I'm looking at it as > > network stack is using cgroup as an application group that should > > invoke bpf program at the certain point in the stack. > > imo cgroup management is orthogonal. > > It is literally modifying the struct cgroup, and, as a practical > matter, it's causing membership in the cgroup to have a certain > effect. But rather than pointless arguing, let me propose an > alternative API that I think solves most of the problems here. > > In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely. > Instead, the cgroup gets three new control files: > "net.ingress_filter", "net.egress_filter", and > "net.socket_create_filter". Initially, if you read these files, you > see "none\n". > > To attach a bpf filter, you open the file for write and do an ioctl on > it. After doing the ioctl, if you read the file, you'll see > "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for > the bpf program. > > To detach any type of filter, bpf or otherwise, you open the file for > write and write "none\n" (or just "none"). > > If you write anything else to the file, you get -EINVAL. But, if > someone writes a new type of filter (perhaps a simple list of address > families), maybe you can enable the filter by writing something > appropriate to the file. I see no difference in what you're proposing vs what is already implemented from feature set point of view, but the file approach is very ugly, since it's a mismatch to FD style access that bpf is using everywhere. In your proposal you'd also need to add bpf prefix everywhere. So the control file names should be bpf_inet_ingress, bpf_inet_egress and bpf_socket_create. If you want to prepare such patch for them, I don't mind, but you cannot kill syscall command, since it's more flexible and your control-file approach _will_ be obsolete pretty quickly. > Now the API matches the effect. A cgroup with something other than > "none" in one of its net.*_filter files is a cgroup that filters > network activity. And you get CRIU compatibility for free: CRIU can > read the control file and use whatever mechanism is uses for BPF in > general to restore the cgroup filter state. As an added bonus, you > get ACLs for free and the ugly multiplexer goes away. extended bpf is not supported by criu. only classic, so having control_file-style attachment doesn't buy us anything. > >> # mkdir cg2/nosockets/sockets > >> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1 > >> > >> ^^^^ This succeeded, which means that, if this feature is enabled in 4.10, > >> ^^^^ then we're stuck with its semantics. If it returned -EINVAL instead, > >> ^^^^ there would be a chance to refine it. > > > > i don't see the problem with this behavior. bpf and cgroup are indepedent. > > Imange that socket accounting program is attached to cg2/nosockets. > > The program is readonly and carry no security meaning. > > Why cgroup should prevent creation of cg2/nosockets/foo directory ? > > I think you're misunderstanding me. What I'm saying is that, if you > allow a cgroup and one of its descendents to both enable the same type > of filter, you have just committed to some particular semantics for > what happens. And I think that the current semantics are the *wrong* > semantics for default long-term use, so you should either fix the > semantics or disable the problematic case. Are you saying that use cases I provided are also 'wrong'? If you insist on that we won't be able to make any forward progress. The current semantics is fine for what it's designed for. > >> # echo $$ >cg2/nosockets/sockets/cgroup.procs > >> # ping 127.0.0.1 > >> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. > >> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms > > > > hmm. in this case the admin created a subgroup with a group that allows > > ping, so? how's that a problem? > > I think you're forgetting about namespaces. There are two different > admins here. There's the admin who created the outer container and > said "no sockets here". Then there's the admin inside the container > who said "muahaha, I can create a sub-container and allow sockets > *there*". container management frameworks should be doing sensible things. With any infra in the kernel there are many ways to create non-sensical setups. It's not a job of the kernel to interfere with user space. > > In case of vrf I can imagine > > a set of application auto-binding to one vrf and smaller > > set of applications binding to a different vrf. > > Or broader set of applications monitoring all tcp traffic > > and subset of them monitoring udp instead. > > Those are valid use cases. > > > I guess you're advocating to run a link list of programs? > > That won't be useful for the above use cases, where there is no > > reason to run more than one program that control plane > > assigned to run for this cgroup. > > Yes there is, for both monitoring and policy. If I want to monitor > all activity in a cgroup, I probably want to monitor descendents as > well. If I want to restrict a cgroup, I want to restrict its > descendents. you're ignoring use cases I described earlier. In vrf case there is only one ifindex it needs to bind to. > In the case where I actually don't want to hook the descendents, I'd > be find with having an option to turn off recursion. Maybe > net.egress_filter could also say "bpf(overridable):[hash]" or > "bpf(nonrecursive):[hash]". But you should have to opt in to allowing > your filter to be overridden. 'opt in' only makes sense if you're implementing sandboxing environment. It doesn't make sense as the default. > > At this stage we decided to allow only one program per cgroup per hook > > and later can extend it if necessary. > > No you can't. Since you allow the problematic case and you gave it > the surprising "one program" semantics, you can't change it later. please show me why we cannot? As far as I can see nothing prevents that in the future. We can add any number of new fields to BPF_PROG_ATTACH command just like we did in the past with other commands, whereas control file interface is not extensible. > > As you're pointing out, in case of security, we probably > > want to preserve original bpf program that should always be > > run first and only after it returned 'ok' (we'd need to define > > what 'ok' means in secruity context) run program attached to sub-hierarchy. > > It's already defined AFAICT. 1 means okay. 0 means not okay. sorry that doesn't make any sense. For seccomp we have a set of ranges that mean different things. Here you're proposing to hastily assign 1 and 0 ? How is that extensible? We need to carefully think through what should be the semantics of attaching multiple programs, consider performance implications, return codes and so on. > > Another alternative is to disallow attaching programs in sub-hierarchy > > if parent has something already attached, but it's not useful > > for general case. > > All of these are possible future extensions. > > I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You > have to do it now (or disable the feature for 4.10). This is why I'm > bringing this whole thing up now. We don't have to touch user visible api here, so extensions are fine. > >> In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf > >> programs attached that can do things if various events occur. (Right > >> now, this means socket operations, but there are plans in the works > >> to do this for LSM hooks too.) These bpf programs can say yes or no, > >> but they can also read out various data (including socket payloads!) > >> and save them away where an attacker can find them. This sounds a > >> lot like seccomp with a narrower scope but a much stronger ability to > >> exfiltrate private information. > > > > that sounds like a future problem to solve when bpf+lsm+cgroup are > > used for security. > > [...] > > > > > I disagree with the urgency. I see nothing that needs immediate action. > > bpf+lsm+cgroup is not in the tree yet. > > > > I disagree very strongly here. The API in 4.10 is IMO quite ugly, but > the result if bpf+lsm+cgroup works *differently* will be far, far > uglier. again we're talking about the future here and 'ugly' is the matter of taste. I hear all the time that people say that netlink api is ugly, so? > I think the right solution here is to clean up the API so that it'll > work for future extensions that people are already imagining. If this > can happen for 4.10, great. If not, then postpone this stuff > entirely. I've had code I've written for Linux postponed extensively > until I've gotten the API right, and it's not so bad. (Actually, I've > even had API changes I've made reverted in -stable, IIRC. This is > much worse than postponing before a release.) huh? 'not right api' because it's using bpf syscall instead of cgroup control-file? I think the opposite is the truth. syscall style is more extensible whereas control-file and text based interfaces have proven to be huge pain to extend and maintain. Again, I don't mind if you want to have both available. ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <20161220000254.GA58895-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <20161220000254.GA58895-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org> @ 2016-12-20 0:25 ` Andy Lutomirski 2016-12-20 1:43 ` Andy Lutomirski [not found] ` <CALCETrU1_bDVLfokQ7zasHVmeq7S-R+603GEw59V_wuj4eE1hw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-12-20 1:34 ` David Miller 1 sibling, 2 replies; 49+ messages in thread From: Andy Lutomirski @ 2016-12-20 0:25 UTC (permalink / raw) To: Alexei Starovoitov Cc: Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote: >> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov >> <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: >> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote: >> >> Hi all- >> >> >> >> I apologize for being rather late with this. I didn't realize that >> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see >> >> it on the linux-api list, so I missed the discussion. >> >> >> >> I think that the inet ingress, egress etc filters are a neat feature, >> >> but I think the API has some issues that will bite us down the road >> >> if it becomes stable in its current form. >> >> >> >> Most of the problems I see are summarized in this transcript: >> >> >> >> # mkdir cg2 >> >> # mount -t cgroup2 none cg2 >> >> # mkdir cg2/nosockets >> >> # strace cgrp_socket_rule cg2/nosockets/ 0 >> >> ... >> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 >> >> >> >> ^^^^ You can modify a cgroup after opening it O_RDONLY? >> >> >> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, >> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, >> >> log_buf=0x6020c0, kern_version=0}, 48) = 4 >> >> >> >> ^^^^ This is fine. The bpf() syscall manipulates bpf objects. >> >> >> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 >> >> >> >> ^^^^ This is not so good: >> >> ^^^^ >> >> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This >> >> ^^^^ is manipulating a cgroup. There's no reason that a socket creation >> >> ^^^^ filter couldn't be written in a different language (new iptables >> >> ^^^^ table? Simple list of address families?), but if that happened, >> >> ^^^^ then using bpf() to install it would be entirely nonsensical. >> > >> > I don't see why it's _modifing_ the cgroup. I'm looking at it as >> > network stack is using cgroup as an application group that should >> > invoke bpf program at the certain point in the stack. >> > imo cgroup management is orthogonal. >> >> It is literally modifying the struct cgroup, and, as a practical >> matter, it's causing membership in the cgroup to have a certain >> effect. But rather than pointless arguing, let me propose an >> alternative API that I think solves most of the problems here. >> >> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely. >> Instead, the cgroup gets three new control files: >> "net.ingress_filter", "net.egress_filter", and >> "net.socket_create_filter". Initially, if you read these files, you >> see "none\n". >> >> To attach a bpf filter, you open the file for write and do an ioctl on >> it. After doing the ioctl, if you read the file, you'll see >> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for >> the bpf program. >> >> To detach any type of filter, bpf or otherwise, you open the file for >> write and write "none\n" (or just "none"). >> >> If you write anything else to the file, you get -EINVAL. But, if >> someone writes a new type of filter (perhaps a simple list of address >> families), maybe you can enable the filter by writing something >> appropriate to the file. > > I see no difference in what you're proposing vs what is already implemented > from feature set point of view, but the file approach is very ugly, since > it's a mismatch to FD style access that bpf is using everywhere. > In your proposal you'd also need to add bpf prefix everywhere. > So the control file names should be bpf_inet_ingress, bpf_inet_egress > and bpf_socket_create. I think we're still talking past each other. A big part of the point of changing it is that none of this is specific to bpf. You could (in theory -- I'm not proposing implementing these until there's demand) have: net.socket_create_filter = "none": no filter net.socket_create_filter = "bpf:baadf00d": bpf filter net.socket_create_filter = "disallow": no sockets created period net.socket_create_filter = "iptables:foobar": some iptables thingy net.socket_create_filter = "nft:blahblahblah": some nft thingy net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 See? This API is not bpf-specific. It's an API for filtering. The fact that struct cgroup currently contains a member called "bpf" is purely an artifact of the fact that it currently only supports bpf. Someone will want to rename it to "filters" some day, and BPF_PROG_DETACH makes no sense whatsoever as a generic API to detach a filter. > If you want to prepare such patch for them, I don't mind, > but you cannot kill syscall command, since it's more flexible > and your control-file approach _will_ be obsolete pretty quickly. BPF_PROG_ATTACH and BPF_PROC_DETACH should be removed regardless IMO. If you really really want a syscall, make it a new syscall. > >> Now the API matches the effect. A cgroup with something other than >> "none" in one of its net.*_filter files is a cgroup that filters >> network activity. And you get CRIU compatibility for free: CRIU can >> read the control file and use whatever mechanism is uses for BPF in >> general to restore the cgroup filter state. As an added bonus, you >> get ACLs for free and the ugly multiplexer goes away. > > extended bpf is not supported by criu. only classic, so having > control_file-style attachment doesn't buy us anything. CRIU will support it some day. We might as well put fewer obstacles in their way. > >> >> # mkdir cg2/nosockets/sockets >> >> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1 >> >> >> >> ^^^^ This succeeded, which means that, if this feature is enabled in 4.10, >> >> ^^^^ then we're stuck with its semantics. If it returned -EINVAL instead, >> >> ^^^^ there would be a chance to refine it. >> > >> > i don't see the problem with this behavior. bpf and cgroup are indepedent. >> > Imange that socket accounting program is attached to cg2/nosockets. >> > The program is readonly and carry no security meaning. >> > Why cgroup should prevent creation of cg2/nosockets/foo directory ? >> >> I think you're misunderstanding me. What I'm saying is that, if you >> allow a cgroup and one of its descendents to both enable the same type >> of filter, you have just committed to some particular semantics for >> what happens. And I think that the current semantics are the *wrong* >> semantics for default long-term use, so you should either fix the >> semantics or disable the problematic case. > > Are you saying that use cases I provided are also 'wrong'? > If you insist on that we won't be able to make any forward progress. > The current semantics is fine for what it's designed for. Can you explain your use case more clearly? > >> >> # echo $$ >cg2/nosockets/sockets/cgroup.procs >> >> # ping 127.0.0.1 >> >> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. >> >> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms >> > >> > hmm. in this case the admin created a subgroup with a group that allows >> > ping, so? how's that a problem? >> >> I think you're forgetting about namespaces. There are two different >> admins here. There's the admin who created the outer container and >> said "no sockets here". Then there's the admin inside the container >> who said "muahaha, I can create a sub-container and allow sockets >> *there*". > > container management frameworks should be doing sensible > things. With any infra in the kernel there are many ways to > create non-sensical setups. It's not a job of the kernel > to interfere with user space. What exactly isn't sensible about using cgroup_bpf for containers? > >> > In case of vrf I can imagine >> > a set of application auto-binding to one vrf and smaller >> > set of applications binding to a different vrf. >> > Or broader set of applications monitoring all tcp traffic >> > and subset of them monitoring udp instead. >> > Those are valid use cases. >> >> > I guess you're advocating to run a link list of programs? >> > That won't be useful for the above use cases, where there is no >> > reason to run more than one program that control plane >> > assigned to run for this cgroup. >> >> Yes there is, for both monitoring and policy. If I want to monitor >> all activity in a cgroup, I probably want to monitor descendents as >> well. If I want to restrict a cgroup, I want to restrict its >> descendents. > > you're ignoring use cases I described earlier. > In vrf case there is only one ifindex it needs to bind to. I'm totally lost. Can you explain what this has to do with the cgroup hierarchy? >> > At this stage we decided to allow only one program per cgroup per hook >> > and later can extend it if necessary. >> >> No you can't. Since you allow the problematic case and you gave it >> the surprising "one program" semantics, you can't change it later. > > please show me why we cannot? As far as I can see nothing prevents > that in the future. We can add any number of new fields to > BPF_PROG_ATTACH command just like we did in the past with > other commands, whereas control file interface is not extensible. Because people are going to start using the old API, tools won't be aware of the new semantics, you have no usable introspection mechanism, and everyone is going to screw it up. But it's even worse: because global privilege is currently needed to set up these filters, containers really can use it today, but once you switch to ns_capable, then it suddenly becomes insecure. And *that* is something that you can't do. > >> > As you're pointing out, in case of security, we probably >> > want to preserve original bpf program that should always be >> > run first and only after it returned 'ok' (we'd need to define >> > what 'ok' means in secruity context) run program attached to sub-hierarchy. >> >> It's already defined AFAICT. 1 means okay. 0 means not okay. > > sorry that doesn't make any sense. For seccomp we have a set of > ranges that mean different things. Here you're proposing to > hastily assign 1 and 0 ? How is that extensible? > We need to carefully think through what should be the semantics > of attaching multiple programs, consider performance implications, > return codes and so on. You already assigned it. The return value of the bpf program, loaded in Linus' tree today, tells the kernel whether to accept or reject. > >> > Another alternative is to disallow attaching programs in sub-hierarchy >> > if parent has something already attached, but it's not useful >> > for general case. >> > All of these are possible future extensions. >> >> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You >> have to do it now (or disable the feature for 4.10). This is why I'm >> bringing this whole thing up now. > > We don't have to touch user visible api here, so extensions are fine. Huh? My example in the original email attaches a program in a sub-hierarchy. Are you saying that 4.11 could make that example stop working? > >> >> In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf >> >> programs attached that can do things if various events occur. (Right >> >> now, this means socket operations, but there are plans in the works >> >> to do this for LSM hooks too.) These bpf programs can say yes or no, >> >> but they can also read out various data (including socket payloads!) >> >> and save them away where an attacker can find them. This sounds a >> >> lot like seccomp with a narrower scope but a much stronger ability to >> >> exfiltrate private information. >> > >> > that sounds like a future problem to solve when bpf+lsm+cgroup are >> > used for security. >> >> [...] >> >> > >> > I disagree with the urgency. I see nothing that needs immediate action. >> > bpf+lsm+cgroup is not in the tree yet. >> > >> >> I disagree very strongly here. The API in 4.10 is IMO quite ugly, but >> the result if bpf+lsm+cgroup works *differently* will be far, far >> uglier. > > again we're talking about the future here and 'ugly' is the matter of taste. > I hear all the time that people say that netlink api is ugly, so? Yeah, it's ugly, but that ship already sailed. This ship hasn't. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2016-12-20 0:25 ` Andy Lutomirski @ 2016-12-20 1:43 ` Andy Lutomirski [not found] ` <CALCETrU1_bDVLfokQ7zasHVmeq7S-R+603GEw59V_wuj4eE1hw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 0 replies; 49+ messages in thread From: Andy Lutomirski @ 2016-12-20 1:43 UTC (permalink / raw) To: Alexei Starovoitov Cc: Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel@vger.kernel.org, Network Development On Mon, Dec 19, 2016 at 4:25 PM, Andy Lutomirski <luto@amacapital.net> wrote: > On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: >> you're ignoring use cases I described earlier. >> In vrf case there is only one ifindex it needs to bind to. > > I'm totally lost. Can you explain what this has to do with the cgroup > hierarchy? > Okay, I figured out what you mean, I think. You have a handful of vrf devices. Let's say they have ifindexes 1 and 2 (and maybe more). The interesting case happens when you set up /cgroup/a with a bpf program that binds new sockets to ifindex 1 and /cgroup/a/b with a bpf program that binds new sockets to ifindex 2. The question is: what should happen if you're in /cgroup/a/b? Presumably, if you do this, you wanted to end up with ifindex 2. I think the way it should actually work is that the kernel evaluates /cgroup/a/b's hook and then /cgroup/a's hook. Then /cgroup/a (which is the more privileged hook) gets to make the choice. If it wants ifindex 2 to win, it can do (pseudocode): if (!sk->sk_bound_if) sk->sk_bound_if = 1; ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <CALCETrU1_bDVLfokQ7zasHVmeq7S-R+603GEw59V_wuj4eE1hw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <CALCETrU1_bDVLfokQ7zasHVmeq7S-R+603GEw59V_wuj4eE1hw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-12-20 1:44 ` David Ahern [not found] ` <2dbec775-6304-e44c-19c5-fbf07877e7b1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2016-12-20 3:18 ` Alexei Starovoitov 1 sibling, 1 reply; 49+ messages in thread From: David Ahern @ 2016-12-20 1:44 UTC (permalink / raw) To: Andy Lutomirski, Alexei Starovoitov Cc: Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development On 12/19/16 5:25 PM, Andy Lutomirski wrote: > net.socket_create_filter = "none": no filter > net.socket_create_filter = "bpf:baadf00d": bpf filter > net.socket_create_filter = "disallow": no sockets created period > net.socket_create_filter = "iptables:foobar": some iptables thingy > net.socket_create_filter = "nft:blahblahblah": some nft thingy > net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters. ... >> you're ignoring use cases I described earlier. >> In vrf case there is only one ifindex it needs to bind to. > > I'm totally lost. Can you explain what this has to do with the cgroup > hierarchy? I think the point is that a group hierarchy makes no sense for the VRF use case. What I put into iproute2 is cgrp2/vrf/NAME where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 sockets to a specific device index. cgrp2/vrf is the "default" vrf and does not have a filter. A user can certainly add another layer cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not make sense. ... >>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You >>> have to do it now (or disable the feature for 4.10). This is why I'm >>> bringing this whole thing up now. >> >> We don't have to touch user visible api here, so extensions are fine. > > Huh? My example in the original email attaches a program in a > sub-hierarchy. Are you saying that 4.11 could make that example stop > working? Are you suggesting sub-cgroups should not be allowed to override the filter of a parent cgroup? ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <2dbec775-6304-e44c-19c5-fbf07877e7b1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <2dbec775-6304-e44c-19c5-fbf07877e7b1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2016-12-20 1:56 ` Andy Lutomirski [not found] ` <CALCETrUW2jEYmjSsOrPj+MAjkDGGUCw_rdxQh+5Er0r4ReGLnA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Andy Lutomirski @ 2016-12-20 1:56 UTC (permalink / raw) To: David Ahern Cc: Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development On Mon, Dec 19, 2016 at 5:44 PM, David Ahern <dsahern-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > On 12/19/16 5:25 PM, Andy Lutomirski wrote: >> net.socket_create_filter = "none": no filter >> net.socket_create_filter = "bpf:baadf00d": bpf filter >> net.socket_create_filter = "disallow": no sockets created period >> net.socket_create_filter = "iptables:foobar": some iptables thingy >> net.socket_create_filter = "nft:blahblahblah": some nft thingy >> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 > > Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters. Can you elaborate on what goes wrong? (Obviously the "address_family_list" example makes no sense in that context.) > > ... > >>> you're ignoring use cases I described earlier. >>> In vrf case there is only one ifindex it needs to bind to. >> >> I'm totally lost. Can you explain what this has to do with the cgroup >> hierarchy? > > I think the point is that a group hierarchy makes no sense for the VRF use case. What I put into iproute2 is > > cgrp2/vrf/NAME > > where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 sockets to a specific device index. cgrp2/vrf is the "default" vrf and does not have a filter. A user can certainly add another layer cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not make sense. I tend to agree. I still think that the mechanism as it stands is broken in other respects and should be fixed before it goes live. I have no desire to cause problems for the vrf use case. But keep in mind that the vrf use case is, in Linus' tree, a bit broken right now in its interactions with other users of the same mechanism. Suppose I create a container and want to trace all of its created sockets. I'll set up cgrp2/container and load my tracer as a socket creation hook. Then a container sets up cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding filter. Now the tracing stops working -- oops. > > ... > >>>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You >>>> have to do it now (or disable the feature for 4.10). This is why I'm >>>> bringing this whole thing up now. >>> >>> We don't have to touch user visible api here, so extensions are fine. >> >> Huh? My example in the original email attaches a program in a >> sub-hierarchy. Are you saying that 4.11 could make that example stop >> working? > > Are you suggesting sub-cgroups should not be allowed to override the filter of a parent cgroup? Yes, exactly. I think there are two sensible behaviors: a) sub-cgroups cannot have a filter at all of the parent has a filter. (This is the "punt" approach -- it lets different semantics be assigned later without breaking userspace.) b) sub-cgroups can have a filter if a parent does, too. The semantics are that the sub-cgroup filter runs first and all side-effects occur. If that filter says "reject" then ancestor filters are skipped. If that filter says "accept", then the ancestor filter is run and its side-effects happen as well. (And so on, all the way up to the root.) --Andy ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <CALCETrUW2jEYmjSsOrPj+MAjkDGGUCw_rdxQh+5Er0r4ReGLnA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <CALCETrUW2jEYmjSsOrPj+MAjkDGGUCw_rdxQh+5Er0r4ReGLnA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-12-20 2:52 ` David Ahern 2016-12-20 3:12 ` Andy Lutomirski 2016-12-20 9:11 ` Peter Zijlstra 1 sibling, 1 reply; 49+ messages in thread From: David Ahern @ 2016-12-20 2:52 UTC (permalink / raw) To: Andy Lutomirski Cc: Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development On 12/19/16 6:56 PM, Andy Lutomirski wrote: > On Mon, Dec 19, 2016 at 5:44 PM, David Ahern <dsahern-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: >> On 12/19/16 5:25 PM, Andy Lutomirski wrote: >>> net.socket_create_filter = "none": no filter >>> net.socket_create_filter = "bpf:baadf00d": bpf filter >>> net.socket_create_filter = "disallow": no sockets created period >>> net.socket_create_filter = "iptables:foobar": some iptables thingy >>> net.socket_create_filter = "nft:blahblahblah": some nft thingy >>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 >> >> Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters. > > Can you elaborate on what goes wrong? (Obviously the > "address_family_list" example makes no sense in that context.) Being able to dump a filter or see that one exists would be a great add-on, but I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable API for loading generic BPF filters. Simple cases like "disallow" are easy -- just return 0 in the filter, no complicated BPF code needed. The rest are specific cases of the moment which goes against the intent of ebpf and generic programmability. >> >> ... >> >>>> you're ignoring use cases I described earlier. >>>> In vrf case there is only one ifindex it needs to bind to. >>> >>> I'm totally lost. Can you explain what this has to do with the cgroup >>> hierarchy? >> >> I think the point is that a group hierarchy makes no sense for the VRF use case. What I put into iproute2 is >> >> cgrp2/vrf/NAME >> >> where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 sockets to a specific device index. cgrp2/vrf is the "default" vrf and does not have a filter. A user can certainly add another layer cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not make sense. > > I tend to agree. I still think that the mechanism as it stands is > broken in other respects and should be fixed before it goes live. I > have no desire to cause problems for the vrf use case. > > But keep in mind that the vrf use case is, in Linus' tree, a bit > broken right now in its interactions with other users of the same > mechanism. Suppose I create a container and want to trace all of its > created sockets. I'll set up cgrp2/container and load my tracer as a > socket creation hook. Then a container sets up > cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding > filter. Now the tracing stops working -- oops. There are other ways to achieve socket tracing, but I get your point -- nested cases do not work as users may want. >>>>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You >>>>> have to do it now (or disable the feature for 4.10). This is why I'm >>>>> bringing this whole thing up now. >>>> >>>> We don't have to touch user visible api here, so extensions are fine. >>> >>> Huh? My example in the original email attaches a program in a >>> sub-hierarchy. Are you saying that 4.11 could make that example stop >>> working? >> >> Are you suggesting sub-cgroups should not be allowed to override the filter of a parent cgroup? > > Yes, exactly. I think there are two sensible behaviors: > > a) sub-cgroups cannot have a filter at all of the parent has a filter. > (This is the "punt" approach -- it lets different semantics be > assigned later without breaking userspace.) > > b) sub-cgroups can have a filter if a parent does, too. The semantics > are that the sub-cgroup filter runs first and all side-effects occur. > If that filter says "reject" then ancestor filters are skipped. If > that filter says "accept", then the ancestor filter is run and its > side-effects happen as well. (And so on, all the way up to the root.) That comes with a big performance hit for skb / data path cases. I'm riding my use case on Daniel's work, and as I understand it the nesting case has been discussed. I'll defer to Daniel and Alexei on this part. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2016-12-20 2:52 ` David Ahern @ 2016-12-20 3:12 ` Andy Lutomirski 2016-12-20 4:44 ` Alexei Starovoitov 0 siblings, 1 reply; 49+ messages in thread From: Andy Lutomirski @ 2016-12-20 3:12 UTC (permalink / raw) To: David Ahern Cc: Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel@vger.kernel.org, Network Development On Mon, Dec 19, 2016 at 6:52 PM, David Ahern <dsahern@gmail.com> wrote: > On 12/19/16 6:56 PM, Andy Lutomirski wrote: >> On Mon, Dec 19, 2016 at 5:44 PM, David Ahern <dsahern@gmail.com> wrote: >>> On 12/19/16 5:25 PM, Andy Lutomirski wrote: >>>> net.socket_create_filter = "none": no filter >>>> net.socket_create_filter = "bpf:baadf00d": bpf filter >>>> net.socket_create_filter = "disallow": no sockets created period >>>> net.socket_create_filter = "iptables:foobar": some iptables thingy >>>> net.socket_create_filter = "nft:blahblahblah": some nft thingy >>>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 >>> >>> Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters. >> >> Can you elaborate on what goes wrong? (Obviously the >> "address_family_list" example makes no sense in that context.) > > Being able to dump a filter or see that one exists would be a great add-on, but I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable API for loading generic BPF filters. Simple cases like "disallow" are easy -- just return 0 in the filter, no complicated BPF code needed. The rest are specific cases of the moment which goes against the intent of ebpf and generic programmability. Oh -- I'm not proposing that at all. Let me clarify. For the bpf case, if you *read* the file, you'd see "bpf:baadf00d". But writing "bpf:baadf00d" is nonsense and would give you -EINVAL. Instead you install a bpf filter by opening the file for write (O_RDWR or O_WRONLY) and doing ioctl(cgroup_socket_create_file_fd, CGROUP_IOCTL_SET_BPF_FILTER, bpf_fd); It's kind of like BPF_PROG_ATTACH except that it respects filesystem permissions, it isn't in the bpf() multiplexer, the filter being set is implied (by the fd in use), and everything is nicely seccompable. To remove the filter, you write "none" or "none\n" to the file. As a future extension, if someone wanted more than one filter to be able to coexist in the cgroup socket_create_filter slot, you could plausibly do 'echo disallow >>net.socket_create_filter' or use a new ioctl CGROUP_IOCTL_APPEND_BPF_FILTER, and then you'd read the file and see more than one line. But this would be a future extension and may never be needed. >> >> a) sub-cgroups cannot have a filter at all of the parent has a filter. >> (This is the "punt" approach -- it lets different semantics be >> assigned later without breaking userspace.) >> >> b) sub-cgroups can have a filter if a parent does, too. The semantics >> are that the sub-cgroup filter runs first and all side-effects occur. >> If that filter says "reject" then ancestor filters are skipped. If >> that filter says "accept", then the ancestor filter is run and its >> side-effects happen as well. (And so on, all the way up to the root.) > > That comes with a big performance hit for skb / data path cases. > > I'm riding my use case on Daniel's work, and as I understand it the nesting case has been discussed. I'll defer to Daniel and Alexei on this part. I'm not sure I buy the performance hit. If you do it naively, then performance will indeed suck. But there's already a bunch of code in there to pre-populate a filter list for faster use. Currently, we have: struct cgroup_bpf { /* * Store two sets of bpf_prog pointers, one for programs that are * pinned directly to this cgroup, and one for those that are effective * when this cgroup is accessed. */ struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE]; struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE]; }; in struct cgroup, there's a 'struct cgroup_bpf bpf;'. This would change to something like: struct cgroup_filter_slot { struct bpf_prog *effective; struct cgroup_filter_slot *next; struct bpf_prog *local; } local is NULL unless *this* cgroup has a filter. effective points to the bpf_prog that's active in this cgroup or the nearest ancestor that has a filter. next is NULL if there are no filters higher in the chain or points to the next slot that has a filter. struct cgroup has: struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE]; To evaluate it, you do: struct cgroup_filter_slot *slot = &cgroup->slot[the index]; if (!slot->effective) return; do { evaluate(slot->effective); slot = slot->next; } while (unlikely(slot)); The old code was one branch per evaluation. The new code is n+1 branches where n is the number of filters, so it's one extra branch in the worst case. But the extra branch is cache-hot (the variable is right next to slot->effective, which is needed regardless) and highly predictable (in the case where nesting isn't used, the branch is not taken, and it's a backward branch which most CPUs will predict as not taken). I expect that, on x86, this adds at most a cycle or two and quite possibly adds no measurable overhead at all. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2016-12-20 3:12 ` Andy Lutomirski @ 2016-12-20 4:44 ` Alexei Starovoitov [not found] ` <20161220044440.GB86803-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Alexei Starovoitov @ 2016-12-20 4:44 UTC (permalink / raw) To: Andy Lutomirski Cc: David Ahern, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel@vger.kernel.org, Network Development On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote: > > struct cgroup_bpf { > /* > * Store two sets of bpf_prog pointers, one for programs that are > * pinned directly to this cgroup, and one for those that are effective > * when this cgroup is accessed. > */ > struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE]; > struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE]; > }; > > in struct cgroup, there's a 'struct cgroup_bpf bpf;'. > > This would change to something like: > > struct cgroup_filter_slot { > struct bpf_prog *effective; > struct cgroup_filter_slot *next; > struct bpf_prog *local; > } > > local is NULL unless *this* cgroup has a filter. effective points to > the bpf_prog that's active in this cgroup or the nearest ancestor that > has a filter. next is NULL if there are no filters higher in the > chain or points to the next slot that has a filter. struct cgroup > has: > > struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE]; > > To evaluate it, you do: > > struct cgroup_filter_slot *slot = &cgroup->slot[the index]; > > if (!slot->effective) > return; > > do { > evaluate(slot->effective); > slot = slot->next; > } while (unlikely(slot)); yes. something like this can work as a future extension to support multiple programs for security use case. Please propose a patch. Again, it's not needed today and there is no rush to implement it. ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <20161220044440.GB86803-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <20161220044440.GB86803-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org> @ 2016-12-20 5:27 ` Andy Lutomirski [not found] ` <CALCETrVxkdZA3SsRv0KKhBz9YvNMsnmHSjS8HN1GHrgWRYNM1g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Andy Lutomirski @ 2016-12-20 5:27 UTC (permalink / raw) To: Alexei Starovoitov Cc: David Ahern, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development On Mon, Dec 19, 2016 at 8:44 PM, Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote: >> >> struct cgroup_bpf { >> /* >> * Store two sets of bpf_prog pointers, one for programs that are >> * pinned directly to this cgroup, and one for those that are effective >> * when this cgroup is accessed. >> */ >> struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE]; >> struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE]; >> }; >> >> in struct cgroup, there's a 'struct cgroup_bpf bpf;'. >> >> This would change to something like: >> >> struct cgroup_filter_slot { >> struct bpf_prog *effective; >> struct cgroup_filter_slot *next; >> struct bpf_prog *local; >> } >> >> local is NULL unless *this* cgroup has a filter. effective points to >> the bpf_prog that's active in this cgroup or the nearest ancestor that >> has a filter. next is NULL if there are no filters higher in the >> chain or points to the next slot that has a filter. struct cgroup >> has: >> >> struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE]; >> >> To evaluate it, you do: >> >> struct cgroup_filter_slot *slot = &cgroup->slot[the index]; >> >> if (!slot->effective) >> return; >> >> do { >> evaluate(slot->effective); >> slot = slot->next; >> } while (unlikely(slot)); > > yes. something like this can work as a future extension > to support multiple programs for security use case. > Please propose a patch. > Again, it's not needed today and there is no rush to implement it. > If this happens after 4.10 and 4.10 is released as is, then this change would be an ABI break. --Andy ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <CALCETrVxkdZA3SsRv0KKhBz9YvNMsnmHSjS8HN1GHrgWRYNM1g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <CALCETrVxkdZA3SsRv0KKhBz9YvNMsnmHSjS8HN1GHrgWRYNM1g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-12-20 5:32 ` Alexei Starovoitov 0 siblings, 0 replies; 49+ messages in thread From: Alexei Starovoitov @ 2016-12-20 5:32 UTC (permalink / raw) To: Andy Lutomirski Cc: David Ahern, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development On Mon, Dec 19, 2016 at 09:27:18PM -0800, Andy Lutomirski wrote: > On Mon, Dec 19, 2016 at 8:44 PM, Alexei Starovoitov > <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote: > >> > >> struct cgroup_bpf { > >> /* > >> * Store two sets of bpf_prog pointers, one for programs that are > >> * pinned directly to this cgroup, and one for those that are effective > >> * when this cgroup is accessed. > >> */ > >> struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE]; > >> struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE]; > >> }; > >> > >> in struct cgroup, there's a 'struct cgroup_bpf bpf;'. > >> > >> This would change to something like: > >> > >> struct cgroup_filter_slot { > >> struct bpf_prog *effective; > >> struct cgroup_filter_slot *next; > >> struct bpf_prog *local; > >> } > >> > >> local is NULL unless *this* cgroup has a filter. effective points to > >> the bpf_prog that's active in this cgroup or the nearest ancestor that > >> has a filter. next is NULL if there are no filters higher in the > >> chain or points to the next slot that has a filter. struct cgroup > >> has: > >> > >> struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE]; > >> > >> To evaluate it, you do: > >> > >> struct cgroup_filter_slot *slot = &cgroup->slot[the index]; > >> > >> if (!slot->effective) > >> return; > >> > >> do { > >> evaluate(slot->effective); > >> slot = slot->next; > >> } while (unlikely(slot)); > > > > yes. something like this can work as a future extension > > to support multiple programs for security use case. > > Please propose a patch. > > Again, it's not needed today and there is no rush to implement it. > > > > If this happens after 4.10 and 4.10 is released as is, then this > change would be an ABI break. it won't break existing apps. please study how bpf syscall was extended in the past without breaking anything. Same thing here. The default is one program per hook per cgroup. Everything else is future extensions. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <CALCETrUW2jEYmjSsOrPj+MAjkDGGUCw_rdxQh+5Er0r4ReGLnA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-12-20 2:52 ` David Ahern @ 2016-12-20 9:11 ` Peter Zijlstra [not found] ` <20161220091150.GJ3124-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 1 sibling, 1 reply; 49+ messages in thread From: Peter Zijlstra @ 2016-12-20 9:11 UTC (permalink / raw) To: Andy Lutomirski Cc: David Ahern, Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development On Mon, Dec 19, 2016 at 05:56:24PM -0800, Andy Lutomirski wrote: > >> Huh? My example in the original email attaches a program in a > >> sub-hierarchy. Are you saying that 4.11 could make that example stop > >> working? > > > > Are you suggesting sub-cgroups should not be allowed to override the filter of a parent cgroup? > > Yes, exactly. I think there are two sensible behaviors: > > a) sub-cgroups cannot have a filter at all of the parent has a filter. > (This is the "punt" approach -- it lets different semantics be > assigned later without breaking userspace.) > > b) sub-cgroups can have a filter if a parent does, too. The semantics > are that the sub-cgroup filter runs first and all side-effects occur. > If that filter says "reject" then ancestor filters are skipped. If > that filter says "accept", then the ancestor filter is run and its > side-effects happen as well. (And so on, all the way up to the root.) So from what I understand the proposed cgroup is not in fact hierarchical at all. @TJ, I thought you were enforcing all new cgroups to be properly hierarchical, that would very much include this one. ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <20161220091150.GJ3124-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <20161220091150.GJ3124-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2017-01-03 10:25 ` Michal Hocko [not found] ` <20170103102559.GA30129-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Michal Hocko @ 2017-01-03 10:25 UTC (permalink / raw) To: Peter Zijlstra, Tejun Heo Cc: Andy Lutomirski, David Ahern, Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development On Tue 20-12-16 10:11:50, Peter Zijlstra wrote: > On Mon, Dec 19, 2016 at 05:56:24PM -0800, Andy Lutomirski wrote: > > >> Huh? My example in the original email attaches a program in a > > >> sub-hierarchy. Are you saying that 4.11 could make that example stop > > >> working? > > > > > > Are you suggesting sub-cgroups should not be allowed to override the filter of a parent cgroup? > > > > Yes, exactly. I think there are two sensible behaviors: > > > > a) sub-cgroups cannot have a filter at all of the parent has a filter. > > (This is the "punt" approach -- it lets different semantics be > > assigned later without breaking userspace.) > > > > b) sub-cgroups can have a filter if a parent does, too. The semantics > > are that the sub-cgroup filter runs first and all side-effects occur. > > If that filter says "reject" then ancestor filters are skipped. If > > that filter says "accept", then the ancestor filter is run and its > > side-effects happen as well. (And so on, all the way up to the root.) > > So from what I understand the proposed cgroup is not in fact > hierarchical at all. > > @TJ, I thought you were enforcing all new cgroups to be properly > hierarchical, that would very much include this one. I would be interested in that as well. We have made that mistake in memcg v1 where hierarchy could be disabled for performance reasons and that turned out to be major PITA in the end. Why do we want to repeat the same mistake here? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <20170103102559.GA30129-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <20170103102559.GA30129-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2017-01-16 1:19 ` Tejun Heo 2017-01-17 13:03 ` Michal Hocko 0 siblings, 1 reply; 49+ messages in thread From: Tejun Heo @ 2017-01-16 1:19 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Andy Lutomirski, David Ahern, Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development Hello, Sorry about the delay. Some fire fighthing followed the holidays. On Tue, Jan 03, 2017 at 11:25:59AM +0100, Michal Hocko wrote: > > So from what I understand the proposed cgroup is not in fact > > hierarchical at all. > > > > @TJ, I thought you were enforcing all new cgroups to be properly > > hierarchical, that would very much include this one. > > I would be interested in that as well. We have made that mistake in > memcg v1 where hierarchy could be disabled for performance reasons and > that turned out to be major PITA in the end. Why do we want to repeat > the same mistake here? Across the different threads on this subject, there have been multiple explanations but I'll try to sum it up more clearly. The big issue here is whether this is a cgroup thing or a bpf thing. I don't think there's anything inherently wrong with one approach or the other. Forget about the proposed cgroup bpf extentions but thinkg about how iptables does cgroups. Whether it's the netcls/netprio in v1 or direct membership matching in v2, it is the network side testing for cgroup membership one way or the other. The only part where cgroup is involved in is answering that test. This also holds true for the perf controller. While it is implemented as a controller, it isn't visible to cgroup users in any way and the only function it serves is providing the membership test to perf subsystem. perf is the one which decides whether and how it is to be used. cgroup providing membership test to other subsystems is completely acceptable and established. Now coming back to bpf, the current implementation is just that. Sure, cgroup hosts the rules in its data structures but that isn't something conceptually relevant. We might as well implement it as a prefixed hash table from bpf side. Having pointers in struct cgroup is just a more efficient and easier way of achieving the same result. In fact, IIUC, this whole thing was born out of discussions around implementing scalable cgroup membership matching from bpf programs. So, what's proposed is a proper part of bpf. In terms of implementation, cgroup helps by hosting the pointers but that doesn't necessarily affect the conceptual structure of it. Given that, I don't think it'd be a good idea to add anything to cgroup interface for this feature. Introspection is great to have but this should be introspectable together with other bpf programs using the same mechanism. That's where it belongs. None of the issues that people have been raising here is actually an issue if one thinks of it as a part of bpf. Its security model is exactly the same as any other bpf programs. Recursive behavior is exactly the same as how other external cgroup descendant membership testing work. There is no issue here whatsoever. Now, I'm not claiming that a bpf mechanism which is a proper part of cgrou isn't attractive. It is, especially with delegation; however, that is also where we don't quite know how to proceed. This doesn't have much to do with cgroup. If something is delegatable to non-priv users and scoped, cgroup's fine with it and if that's not possible it simply isn't something which is delegatable and putting it on cgroup doesn't change that. I'm far from being a bpf expert, so I could be wrong here, but I don't think there's anything fundamental which prevents bpf from being delegatable but at the same time bpf is something which is extremely flexible and nobody really thought about or worked that much on delegating bpf. If there's enough need for it, I'm sure we'll eventually get there but from what I hear it isn't something we can pull off in a restricted timeframe. There's nothing which makes the currently implemented mechanism exclusive with a cgroup controller based one. The hooks are the expensive part but can be shared, the rest is just about which programs to execute in what order and how they should be chained. There are a lot of immediate use cases which can benefit from the proposed cgroup bpf mechanism and they're all fine with it being a part of bpf and behaving like any other network mechanism behaves in terms of configuration and delegation. I don't see a reason why we would hold back on merging this. All the raised issues are coming from confusing this as a part of cgroup. It isn't. It is a part of bpf. If we want a bpf cgroup controller, great, but that is a separate thing. Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2017-01-16 1:19 ` Tejun Heo @ 2017-01-17 13:03 ` Michal Hocko 2017-01-17 13:32 ` Peter Zijlstra 0 siblings, 1 reply; 49+ messages in thread From: Michal Hocko @ 2017-01-17 13:03 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Andy Lutomirski, David Ahern, Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel@vger.kernel.org, Network Development On Sun 15-01-17 20:19:01, Tejun Heo wrote: [...] > So, what's proposed is a proper part of bpf. In terms of > implementation, cgroup helps by hosting the pointers but that doesn't > necessarily affect the conceptual structure of it. Given that, I > don't think it'd be a good idea to add anything to cgroup interface > for this feature. Introspection is great to have but this should be > introspectable together with other bpf programs using the same > mechanism. That's where it belongs. If BPF only piggy backs on top of cgroup to iterate tasks shouldn't we at least enforce that the cgroup has to be a leaf one and no further children groups can be created once there is BPF program attached? This should break the existing usecases AFAIU and it would allow future changes without major API surprises. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2017-01-17 13:03 ` Michal Hocko @ 2017-01-17 13:32 ` Peter Zijlstra 2017-01-17 13:58 ` Michal Hocko 0 siblings, 1 reply; 49+ messages in thread From: Peter Zijlstra @ 2017-01-17 13:32 UTC (permalink / raw) To: Michal Hocko Cc: Tejun Heo, Andy Lutomirski, David Ahern, Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel@vger.kernel.org, Network Development On Tue, Jan 17, 2017 at 02:03:03PM +0100, Michal Hocko wrote: > On Sun 15-01-17 20:19:01, Tejun Heo wrote: > [...] > > So, what's proposed is a proper part of bpf. In terms of > > implementation, cgroup helps by hosting the pointers but that doesn't > > necessarily affect the conceptual structure of it. Given that, I > > don't think it'd be a good idea to add anything to cgroup interface > > for this feature. Introspection is great to have but this should be > > introspectable together with other bpf programs using the same > > mechanism. That's where it belongs. > > If BPF only piggy backs on top of cgroup to iterate tasks shouldn't we > at least enforce that the cgroup has to be a leaf one and no further > children groups can be created once there is BPF program attached? Why (again) this stupid constraint? If you want to use cgroups for tagging (like perf does), _any_ parent cgroup will also tag you. So creating child cgroups, and placing tasks in it, should not be a problem, the BPF thing should apply to all of them. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2017-01-17 13:32 ` Peter Zijlstra @ 2017-01-17 13:58 ` Michal Hocko [not found] ` <20170117135830.GO19699-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2017-01-18 22:18 ` Tejun Heo 0 siblings, 2 replies; 49+ messages in thread From: Michal Hocko @ 2017-01-17 13:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, Andy Lutomirski, David Ahern, Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel@vger.kernel.org, Network Development On Tue 17-01-17 14:32:04, Peter Zijlstra wrote: > On Tue, Jan 17, 2017 at 02:03:03PM +0100, Michal Hocko wrote: > > On Sun 15-01-17 20:19:01, Tejun Heo wrote: > > [...] > > > So, what's proposed is a proper part of bpf. In terms of > > > implementation, cgroup helps by hosting the pointers but that doesn't > > > necessarily affect the conceptual structure of it. Given that, I > > > don't think it'd be a good idea to add anything to cgroup interface > > > for this feature. Introspection is great to have but this should be > > > introspectable together with other bpf programs using the same > > > mechanism. That's where it belongs. > > > > If BPF only piggy backs on top of cgroup to iterate tasks shouldn't we > > at least enforce that the cgroup has to be a leaf one and no further > > children groups can be created once there is BPF program attached? > > Why (again) this stupid constraint? > > If you want to use cgroups for tagging (like perf does), _any_ parent > cgroup will also tag you. > > So creating child cgroups, and placing tasks in it, should not be a > problem, the BPF thing should apply to all of them. This would require using hierarchical cgroup iterators to iterate over tasks. As per Andy's testing this doesn't seem to be the case. I haven't checked the implementation closely but my understanding was that using only cgroup specific tasks was intentional. I do agree that using hierarchy aware cgroup iterators is the right approach here and we wouldn't see any issue. But I am still not sure I've wrapped my head around this feature completely. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <20170117135830.GO19699-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <20170117135830.GO19699-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2017-01-17 20:23 ` Andy Lutomirski 0 siblings, 0 replies; 49+ messages in thread From: Andy Lutomirski @ 2017-01-17 20:23 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Tejun Heo, David Ahern, Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development On Tue, Jan 17, 2017 at 5:58 AM, Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > On Tue 17-01-17 14:32:04, Peter Zijlstra wrote: >> On Tue, Jan 17, 2017 at 02:03:03PM +0100, Michal Hocko wrote: >> > On Sun 15-01-17 20:19:01, Tejun Heo wrote: >> > [...] >> > > So, what's proposed is a proper part of bpf. In terms of >> > > implementation, cgroup helps by hosting the pointers but that doesn't >> > > necessarily affect the conceptual structure of it. Given that, I >> > > don't think it'd be a good idea to add anything to cgroup interface >> > > for this feature. Introspection is great to have but this should be >> > > introspectable together with other bpf programs using the same >> > > mechanism. That's where it belongs. >> > >> > If BPF only piggy backs on top of cgroup to iterate tasks shouldn't we >> > at least enforce that the cgroup has to be a leaf one and no further >> > children groups can be created once there is BPF program attached? >> >> Why (again) this stupid constraint? >> >> If you want to use cgroups for tagging (like perf does), _any_ parent >> cgroup will also tag you. >> >> So creating child cgroups, and placing tasks in it, should not be a >> problem, the BPF thing should apply to all of them. > > This would require using hierarchical cgroup iterators to iterate over > tasks. As per Andy's testing this doesn't seem to be the case. I haven't > checked the implementation closely but my understanding was that using > only cgroup specific tasks was intentional. The current semantics are AFAIK that only the innermost cgroup that has a hook installed is in effect. I think this is the wrong design. I think that the right semantics are probably to support both innermost-to-outermost and outermost-to-innermost and to select which is appropriate for each hook. Suppose we have a cgroup /a/b where a and b both have hooks installed. If the hook is a socket creation or egress hook, I think that b's hook should run first. If b's hook rejects, then a's hook is not run. If b's hook accepts, then a's hook is run. This way a gets the last word on any changes to the socket settings and a sees exactly what would happen if it were to accept. Conversely, for ingress hooks, I think that a's hook should run first. This way a sees the packet as it originally came in and can modify or reject it, and then b only sees whatever a chooses to let through. The guiding principle here is that, for actions that originate outside the machine, the outer hooks should IMO run first and, for actions that originate from a task in a cgroup, the innermost hooks should run first. --Andy ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2017-01-17 13:58 ` Michal Hocko [not found] ` <20170117135830.GO19699-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2017-01-18 22:18 ` Tejun Heo 2017-01-19 9:00 ` Michal Hocko 1 sibling, 1 reply; 49+ messages in thread From: Tejun Heo @ 2017-01-18 22:18 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Andy Lutomirski, David Ahern, Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel@vger.kernel.org, Network Development Hello, Michal. On Tue, Jan 17, 2017 at 02:58:30PM +0100, Michal Hocko wrote: > This would require using hierarchical cgroup iterators to iterate over It does behave hierarchically. > tasks. As per Andy's testing this doesn't seem to be the case. I haven't That's not what Andy's testing showed. What that showed was that program in a child can override the one from its ancestor. > checked the implementation closely but my understanding was that using > only cgroup specific tasks was intentional. Maybe I'm misreading you but if you're saying that a bpf program would only execute for tasks on that specific cgroup, that is not true. > I do agree that using hierarchy aware cgroup iterators is the right > approach here and we wouldn't see any issue. But I am still not sure > I've wrapped my head around this feature completely. It's really simple. You can install bpf programs with a cgroup path associated with them. When that particular type of bpf program needs to be executed, the bpf program which is attached to the closest ancestor (including self) is executed. Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2017-01-18 22:18 ` Tejun Heo @ 2017-01-19 9:00 ` Michal Hocko 0 siblings, 0 replies; 49+ messages in thread From: Michal Hocko @ 2017-01-19 9:00 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Andy Lutomirski, David Ahern, Alexei Starovoitov, Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, David S. Miller, Thomas Graf, Michael Kerrisk, Linux API, linux-kernel@vger.kernel.org, Network Development On Wed 18-01-17 14:18:50, Tejun Heo wrote: > Hello, Michal. > > On Tue, Jan 17, 2017 at 02:58:30PM +0100, Michal Hocko wrote: > > This would require using hierarchical cgroup iterators to iterate over > > It does behave hierarchically. > > > tasks. As per Andy's testing this doesn't seem to be the case. I haven't > > That's not what Andy's testing showed. What that showed was that > program in a child can override the one from its ancestor. My fault, I've misread Andy's test case. I thought that the child group simply disabled the bpf program and the one from the parent hasn't executed. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <CALCETrU1_bDVLfokQ7zasHVmeq7S-R+603GEw59V_wuj4eE1hw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-12-20 1:44 ` David Ahern @ 2016-12-20 3:18 ` Alexei Starovoitov [not found] ` <20161220031802.GA77838-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org> 1 sibling, 1 reply; 49+ messages in thread From: Alexei Starovoitov @ 2016-12-20 3:18 UTC (permalink / raw) To: Andy Lutomirski Cc: Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development On Mon, Dec 19, 2016 at 04:25:32PM -0800, Andy Lutomirski wrote: > On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov > <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote: > >> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov > >> <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > >> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote: > >> >> Hi all- > >> >> > >> >> I apologize for being rather late with this. I didn't realize that > >> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see > >> >> it on the linux-api list, so I missed the discussion. > >> >> > >> >> I think that the inet ingress, egress etc filters are a neat feature, > >> >> but I think the API has some issues that will bite us down the road > >> >> if it becomes stable in its current form. > >> >> > >> >> Most of the problems I see are summarized in this transcript: > >> >> > >> >> # mkdir cg2 > >> >> # mount -t cgroup2 none cg2 > >> >> # mkdir cg2/nosockets > >> >> # strace cgrp_socket_rule cg2/nosockets/ 0 > >> >> ... > >> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3 > >> >> > >> >> ^^^^ You can modify a cgroup after opening it O_RDONLY? > >> >> > >> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2, > >> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144, > >> >> log_buf=0x6020c0, kern_version=0}, 48) = 4 > >> >> > >> >> ^^^^ This is fine. The bpf() syscall manipulates bpf objects. > >> >> > >> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0 > >> >> > >> >> ^^^^ This is not so good: > >> >> ^^^^ > >> >> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This > >> >> ^^^^ is manipulating a cgroup. There's no reason that a socket creation > >> >> ^^^^ filter couldn't be written in a different language (new iptables > >> >> ^^^^ table? Simple list of address families?), but if that happened, > >> >> ^^^^ then using bpf() to install it would be entirely nonsensical. > >> > > >> > I don't see why it's _modifing_ the cgroup. I'm looking at it as > >> > network stack is using cgroup as an application group that should > >> > invoke bpf program at the certain point in the stack. > >> > imo cgroup management is orthogonal. > >> > >> It is literally modifying the struct cgroup, and, as a practical > >> matter, it's causing membership in the cgroup to have a certain > >> effect. But rather than pointless arguing, let me propose an > >> alternative API that I think solves most of the problems here. > >> > >> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely. > >> Instead, the cgroup gets three new control files: > >> "net.ingress_filter", "net.egress_filter", and > >> "net.socket_create_filter". Initially, if you read these files, you > >> see "none\n". > >> > >> To attach a bpf filter, you open the file for write and do an ioctl on > >> it. After doing the ioctl, if you read the file, you'll see > >> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for > >> the bpf program. > >> > >> To detach any type of filter, bpf or otherwise, you open the file for > >> write and write "none\n" (or just "none"). > >> > >> If you write anything else to the file, you get -EINVAL. But, if > >> someone writes a new type of filter (perhaps a simple list of address > >> families), maybe you can enable the filter by writing something > >> appropriate to the file. > > > > I see no difference in what you're proposing vs what is already implemented > > from feature set point of view, but the file approach is very ugly, since > > it's a mismatch to FD style access that bpf is using everywhere. > > In your proposal you'd also need to add bpf prefix everywhere. > > So the control file names should be bpf_inet_ingress, bpf_inet_egress > > and bpf_socket_create. > > I think we're still talking past each other. A big part of the point > of changing it is that none of this is specific to bpf. You could (in the hooks and context passed into the program is very much bpf specific. That's what I've been trying to convey all along. > theory -- I'm not proposing implementing these until there's demand) > have: > > net.socket_create_filter = "none": no filter > net.socket_create_filter = "bpf:baadf00d": bpf filter i'm assuming 'baadf00d' is bpf program fd expressed a text string? and kernel needs to parse above? will you allow capital and lower case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not? can program fd expressed as decimal or hex or both? how do you return the error? as a text string for user space to parse? > net.socket_create_filter = "iptables:foobar": some iptables thingy > net.socket_create_filter = "nft:blahblahblah": some nft thingy iptables/nft are not applicable to 'bpf_socket_create' which looks into: struct bpf_sock { __u32 bound_dev_if; __u32 family; __u32 type; __u32 protocol; }; so don't fit as an example. > net.socket_create_filter = "disallow": no sockets created period > net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 so you're proposing to add a bunch of hard coded logic to the kernel. First to parse such text into some sort of syntax tree or list/set and then have hard coded logic specifically for these two use cases? While above two can be implemented as trivial bpf programs already?! That goes 180% degree vs bpf philosophy. bpf is about moving the specific code out of the kernel and keeping kernel generic that it can solve as many use cases as possible by being programmable. > See? This API is not bpf-specific. It's an API for filtering. The no. I don't see it. BPF_CGROUP_INET_SOCK_CREATE is very much bpf specific and we just discussed it to the last the detail. > Can you explain your use case more clearly? ... > What exactly isn't sensible about using cgroup_bpf for containers? my use case is close to zero overhead application network monitoring. > >> > As you're pointing out, in case of security, we probably > >> > want to preserve original bpf program that should always be > >> > run first and only after it returned 'ok' (we'd need to define > >> > what 'ok' means in secruity context) run program attached to sub-hierarchy. > >> > >> It's already defined AFAICT. 1 means okay. 0 means not okay. > > > > sorry that doesn't make any sense. For seccomp we have a set of > > ranges that mean different things. Here you're proposing to > > hastily assign 1 and 0 ? How is that extensible? > > We need to carefully think through what should be the semantics > > of attaching multiple programs, consider performance implications, > > return codes and so on. > > You already assigned it. The return value of the bpf program, loaded > in Linus' tree today, tells the kernel whether to accept or reject. yes. that's what the program tells the hook. I'm saying that whenever we have a link list of the programs interaction between them may or may not be expressible with 1/0 and I don't want to rush such decision. > > > >> > Another alternative is to disallow attaching programs in sub-hierarchy > >> > if parent has something already attached, but it's not useful > >> > for general case. > >> > All of these are possible future extensions. > >> > >> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You > >> have to do it now (or disable the feature for 4.10). This is why I'm > >> bringing this whole thing up now. > > > > We don't have to touch user visible api here, so extensions are fine. > > Huh? My example in the original email attaches a program in a > sub-hierarchy. Are you saying that 4.11 could make that example stop > working? No. As was pointed out before only root can load BPF_PROG_TYPE_CGROUP_[SKB|SOCK] type programs and only root can attach them. and this root semantics obviously have to be preserved from now on, but that doesn't mean that non-root combinations have to follow the same. For example, if for some bizarre reason you want to do net.socket_create_filter = "disallow": no sockets created period in the hard coded way without using bpf at all (I would certainly oppose that as a waste of kernel .text, but I'm not going to nack it), so you can do whatever semantics you like. Similarly for bpf+lsm+cgroup. In the next round of Mickael's patches we'll keep debating the right security model for it. ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <20161220031802.GA77838-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <20161220031802.GA77838-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org> @ 2016-12-20 3:50 ` Andy Lutomirski 2016-12-20 4:41 ` Alexei Starovoitov [not found] ` <CALCETrXymvAo-9zhQe=amToz_fs9XGniK2KLZv5Fxc66qcUx6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 2 replies; 49+ messages in thread From: Andy Lutomirski @ 2016-12-20 3:50 UTC (permalink / raw) To: Alexei Starovoitov Cc: Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development On Mon, Dec 19, 2016 at 7:18 PM, Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > On Mon, Dec 19, 2016 at 04:25:32PM -0800, Andy Lutomirski wrote: >> I think we're still talking past each other. A big part of the point >> of changing it is that none of this is specific to bpf. You could (in > > the hooks and context passed into the program is very much bpf specific. > That's what I've been trying to convey all along. You mean BPF_CGROUP_RUN_PROG_INET_SOCK(sk)? There is nothing bpf specfic about the hook except that the name of this macro has "BPF" in it. There is nothing whatsoever that's bpf-specific about the context -- sk is not bpf-specific at all. The only thing bpf-specific about it is that it currently only invokes bpf programs. That could easily change. > >> theory -- I'm not proposing implementing these until there's demand) >> have: >> >> net.socket_create_filter = "none": no filter >> net.socket_create_filter = "bpf:baadf00d": bpf filter > > i'm assuming 'baadf00d' is bpf program fd expressed a text string? > and kernel needs to parse above? will you allow capital and lower > case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not? > can program fd expressed as decimal or hex or both? > how do you return the error? as a text string for user space > to parse? No. The kernel does not parse it because you cannot write this to the file. You set a bpf filter with ioctl and pass an fd. If you *read* the file, you get the same bpf program hash that fdinfo on the bpf object would show -- this is for debugging and (eventually) CRIU. > >> net.socket_create_filter = "iptables:foobar": some iptables thingy >> net.socket_create_filter = "nft:blahblahblah": some nft thingy > > iptables/nft are not applicable to 'bpf_socket_create' which > looks into: > struct bpf_sock { > __u32 bound_dev_if; > __u32 family; > __u32 type; > __u32 protocol; > }; > so don't fit as an example. The code that takes a 'struct sock' and sets up bpf_sock is bpf-specific and would obviously not be used for non-bpf filter. But if you had a filter that was just a like of address families, that filter would look at struct sock and do its own thing. iptables wouldn't make sense for a socket creation filter, but it would make perfect sense for an ingress filter. Obviously the bpf-specific state object wouldn't be used, but it would still be a hook, invoked from the same network code, guarded by the same static key, looking at the same skb. > >> net.socket_create_filter = "disallow": no sockets created period >> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 > > so you're proposing to add a bunch of hard coded logic to the kernel. > First to parse such text into some sort of syntax tree or list/set > and then have hard coded logic specifically for these two use cases? > While above two can be implemented as trivial bpf programs already?! > That goes 180% degree vs bpf philosophy. bpf is about moving > the specific code out of the kernel and keeping kernel generic that > it can solve as many use cases as possible by being programmable. I'm not seriously proposing implementing these. My point is that *bpf*, while wonderful, is not the be-all-and-end-all of kernel configurability, and other types of hooks might want to be hooked in here. > ... >> What exactly isn't sensible about using cgroup_bpf for containers? > > my use case is close to zero overhead application network monitoring. So if I set up a cgroup that's monitored and call it /cgroup/a and enable delegation and if the program running there wants to do its own monitoring in /cgroup/a/b (via delegation), then you really want the outer monitor to silently drop events coming from /cgroup/a/b? > >> >> > As you're pointing out, in case of security, we probably >> >> > want to preserve original bpf program that should always be >> >> > run first and only after it returned 'ok' (we'd need to define >> >> > what 'ok' means in secruity context) run program attached to sub-hierarchy. >> >> >> >> It's already defined AFAICT. 1 means okay. 0 means not okay. >> > >> > sorry that doesn't make any sense. For seccomp we have a set of >> > ranges that mean different things. Here you're proposing to >> > hastily assign 1 and 0 ? How is that extensible? >> > We need to carefully think through what should be the semantics >> > of attaching multiple programs, consider performance implications, >> > return codes and so on. >> >> You already assigned it. The return value of the bpf program, loaded >> in Linus' tree today, tells the kernel whether to accept or reject. > > yes. that's what the program tells the hook. > I'm saying that whenever we have a link list of the programs > interaction between them may or may not be expressible with 1/0 > and I don't want to rush such decision. Then disallow nesting. You're welcome to not rush the decision, but that's not what you've done. If 4.10 is released as is, you've made the decision and you're going to have a hard time changing it. > >> > >> >> > Another alternative is to disallow attaching programs in sub-hierarchy >> >> > if parent has something already attached, but it's not useful >> >> > for general case. >> >> > All of these are possible future extensions. >> >> >> >> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You >> >> have to do it now (or disable the feature for 4.10). This is why I'm >> >> bringing this whole thing up now. >> > >> > We don't have to touch user visible api here, so extensions are fine. >> >> Huh? My example in the original email attaches a program in a >> sub-hierarchy. Are you saying that 4.11 could make that example stop >> working? > > No. As was pointed out before only root can load BPF_PROG_TYPE_CGROUP_[SKB|SOCK] > type programs and only root can attach them. Why? It really seems to me that you expect that future namespaceable bpf hooks will use a totally different API. At KS, I sat in a room full of people arguing about cgroup v2 and a lot of them pointed out that there are valid, paying use cases that want to stick cgroup v1 in a container, because the code that uses cgroup v1 in the container is called systemd and the contained OS is called RHEL (or SuSE or Ubuntu or Gentoo or whatever), and that code is *already written* and needs to be contained. The current approach to bpf hooks will bite you down the road. David Ahern is already proposing using it for something that is not tracing at all, and someone will want that in a container, and there will be a problem. > Similarly for bpf+lsm+cgroup. In the next round of Mickael's patches > we'll keep debating the right security model for it. > How about slowing down a wee bit and trying to come up with cgroup hook semantics that work for all of these use cases? I think my proposal is quite close to workable. --Andy ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2016-12-20 3:50 ` Andy Lutomirski @ 2016-12-20 4:41 ` Alexei Starovoitov [not found] ` <CALCETrXymvAo-9zhQe=amToz_fs9XGniK2KLZv5Fxc66qcUx6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 0 replies; 49+ messages in thread From: Alexei Starovoitov @ 2016-12-20 4:41 UTC (permalink / raw) To: Andy Lutomirski Cc: Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel@vger.kernel.org, Network Development On Mon, Dec 19, 2016 at 07:50:01PM -0800, Andy Lutomirski wrote: > >> > >> net.socket_create_filter = "none": no filter > >> net.socket_create_filter = "bpf:baadf00d": bpf filter > > > > i'm assuming 'baadf00d' is bpf program fd expressed a text string? > > and kernel needs to parse above? will you allow capital and lower > > case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not? > > can program fd expressed as decimal or hex or both? > > how do you return the error? as a text string for user space > > to parse? > > No. The kernel does not parse it because you cannot write this to the > file. You set a bpf filter with ioctl and pass an fd. If you *read* > the file, you get the same bpf program hash that fdinfo on the bpf > object would show -- this is for debugging and (eventually) CRIU. my understanding that cgroup is based on kernfs and both don't support ioctl, so you'd need quite some hacking to introduce such concepts and buy-in from a bunch of people first. > >> net.socket_create_filter = "iptables:foobar": some iptables thingy > >> net.socket_create_filter = "nft:blahblahblah": some nft thingy > > > > iptables/nft are not applicable to 'bpf_socket_create' which > > looks into: > > struct bpf_sock { > > __u32 bound_dev_if; > > __u32 family; > > __u32 type; > > __u32 protocol; > > }; > > so don't fit as an example. > > The code that takes a 'struct sock' and sets up bpf_sock is > bpf-specific and would obviously not be used for non-bpf filter. But > if you had a filter that was just a like of address families, that > filter would look at struct sock and do its own thing. iptables > wouldn't make sense for a socket creation filter, but it would make > perfect sense for an ingress filter. Obviously the bpf-specific state > object wouldn't be used, but it would still be a hook, invoked from > the same network code, guarded by the same static key, looking at the > same skb. I strongly suggest to go back and read my first reply where I think I explained well enough that something like iptables will not able to reuse the ioctl mechanism you're proposing here, hook ids will be different, attachment mechanism will be different too. So your proposed cgroup ioctl is already dead as a reusable interface. > >> net.socket_create_filter = "disallow": no sockets created period > >> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3 > > > > so you're proposing to add a bunch of hard coded logic to the kernel. > > First to parse such text into some sort of syntax tree or list/set > > and then have hard coded logic specifically for these two use cases? > > While above two can be implemented as trivial bpf programs already?! > > That goes 180% degree vs bpf philosophy. bpf is about moving > > the specific code out of the kernel and keeping kernel generic that > > it can solve as many use cases as possible by being programmable. > > I'm not seriously proposing implementing these. My point is that > *bpf*, while wonderful, is not the be-all-and-end-all of kernel > configurability, and other types of hooks might want to be hooked in > here. Then please let's talk about real use cases. This daydreaming of some future llvm in the kernel that you were bringing up during LPC doesn't help the discussion. Just like these artificial examples. > > ... > >> What exactly isn't sensible about using cgroup_bpf for containers? > > > > my use case is close to zero overhead application network monitoring. > > So if I set up a cgroup that's monitored and call it /cgroup/a and > enable delegation and if the program running there wants to do its own > monitoring in /cgroup/a/b (via delegation), then you really want the > outer monitor to silently drop events coming from /cgroup/a/b? yes. both are root and must talk to each other if they want to co-exist. When root process is asking kernel to do X, this X has to happen. > Then disallow nesting. You're welcome to not rush the decision, but > that's not what you've done. If 4.10 is released as is, you've made > the decision and you're going to have a hard time changing it. Nothing needs to be changed. > > No. As was pointed out before only root can load BPF_PROG_TYPE_CGROUP_[SKB|SOCK] > > type programs and only root can attach them. > > Why? It really seems to me that you expect that future namespaceable > bpf hooks will use a totally different API. At KS, I sat in a room > full of people arguing about cgroup v2 and a lot of them pointed out > that there are valid, paying use cases that want to stick cgroup v1 in > a container, because the code that uses cgroup v1 in the container is > called systemd and the contained OS is called RHEL (or SuSE or Ubuntu > or Gentoo or whatever), and that code is *already written* and needs > to be contained. bpf in general is not namespace aware. It's global and this cgroup scoping of bpf programs is the first of this kind. Namespacing of bpf is completely different topic. > The current approach to bpf hooks will bite you down the road. David > Ahern is already proposing using it for something that is not tracing > at all, and someone will want that in a container, and there will be a > problem. vrf use case already supported by existing code. > How about slowing down a wee bit and trying to come up with cgroup > hook semantics that work for all of these use cases? I think my > proposal is quite close to workable. you've started the topic by claiming that things are broken and non-extensible. At the course of this thread it was explained that the interface is extensible and not broken for the use case it was designed for. The 'security' use case like lsm+bpf+cgroup is not supported by the current model yet and that's what we need to discuss in the future. So, yes, please slow down. ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <CALCETrXymvAo-9zhQe=amToz_fs9XGniK2KLZv5Fxc66qcUx6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <CALCETrXymvAo-9zhQe=amToz_fs9XGniK2KLZv5Fxc66qcUx6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-12-20 10:21 ` Daniel Mack 2016-12-20 17:23 ` Andy Lutomirski 0 siblings, 1 reply; 49+ messages in thread From: Daniel Mack @ 2016-12-20 10:21 UTC (permalink / raw) To: Andy Lutomirski, Alexei Starovoitov Cc: Andy Lutomirski, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development Hi, On 12/20/2016 04:50 AM, Andy Lutomirski wrote: > On Mon, Dec 19, 2016 at 7:18 PM, Alexei Starovoitov > <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: >> On Mon, Dec 19, 2016 at 04:25:32PM -0800, Andy Lutomirski wrote: >>> I think we're still talking past each other. A big part of the point >>> of changing it is that none of this is specific to bpf. You could (in >> >> the hooks and context passed into the program is very much bpf specific. >> That's what I've been trying to convey all along. > > You mean BPF_CGROUP_RUN_PROG_INET_SOCK(sk)? There is nothing bpf > specfic about the hook except that the name of this macro has "BPF" in > it. There is nothing whatsoever that's bpf-specific about the context > -- sk is not bpf-specific at all. > > The only thing bpf-specific about it is that it currently only invokes > bpf programs. That could easily change. I'm not sure if I follow. The code as it currently stands only supports attaching bpf programs to cgroups which have been created using BPF_PROG_LOAD. If cgroups would support other program types in the future, then they would need to be stored in different data types anyway, and the bpf syscall multiplexer would be the wrong entry point to access them anyway. Whether we add bpf-specific code to the cgroup file parsers or cgroup-specific code to the bpf layer does not make much of a semantic difference, does it? As a matter of fact, my very first implementation of this patch set implemented a cgroup controller that would allow writing strings like "ingress 5" to its control file, where 5 is the fd number that came out of BPF_PROG_LOAD. The main reason we decided to ditch that was that echoing fd numbers into a text file seemed way worse than going through a proper syscall layer with it, and ioctls are unavailable on pseudo-fs. The idea was rather to allow attaching bpf programs to other things than just cgroups as well, which is why we called the member of 'union bpf_attr' 'target_fd', and a cgroup is just one type a target here. >> i'm assuming 'baadf00d' is bpf program fd expressed a text string? >> and kernel needs to parse above? will you allow capital and lower >> case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not? >> can program fd expressed as decimal or hex or both? >> how do you return the error? as a text string for user space >> to parse? > > No. The kernel does not parse it because you cannot write this to the > file. You set a bpf filter with ioctl and pass an fd. An ioctl on what file, exactly? > If you *read* > the file, you get the same bpf program hash that fdinfo on the bpf > object would show -- this is for debugging and (eventually) CRIU. We need a debugging facility at some point, I agree to that. As the code currently stands, that would rather need to go into the bpf(2) syscall though, as setting a program through bpf(2) and reading it through cgroupfs is really nasty. >> so you're proposing to add a bunch of hard coded logic to the kernel. >> First to parse such text into some sort of syntax tree or list/set >> and then have hard coded logic specifically for these two use cases? >> While above two can be implemented as trivial bpf programs already?! >> That goes 180% degree vs bpf philosophy. bpf is about moving >> the specific code out of the kernel and keeping kernel generic that >> it can solve as many use cases as possible by being programmable. > > I'm not seriously proposing implementing these. My point is that > *bpf*, while wonderful, is not the be-all-and-end-all of kernel > configurability, and other types of hooks might want to be hooked in > here. Sure, but nobody claimed it to be that be-all-and-end-all thing. It's just one thing that a cgroup is now able to accommodate, and because that new feature is specific to bpf, we decided to hook up the uapi to the bpf syscall. > So if I set up a cgroup that's monitored and call it /cgroup/a and > enable delegation and if the program running there wants to do its own > monitoring in /cgroup/a/b (via delegation), then you really want the > outer monitor to silently drop events coming from /cgroup/a/b? That's a fair point, and we've discussed it as well. The issue is, as Alexei already pointed out, that we do not want to traverse the tree up to the root for nested cgroups due to the runtime costs in the networking fast-path. After all, we're running the bpf program for each packet in flight. Hence, we opted for the approach to only look at the leaf node for now, with the ability to open it up further in the future using flags during attach etc. > The current approach to bpf hooks will bite you down the road. David > Ahern is already proposing using it for something that is not tracing > at all, and someone will want that in a container, and there will be a > problem. Hmm, I thought we've sorted out the concerns about that by making sure that we a) lock-down the API sufficiently so it doesn't cause any security issues in its current form, and b) make it possible to extend the functionality in the future by adding flags to the command struct etc. And I hoped we achieved that after discussing it for so long. > How about slowing down a wee bit and trying to come up with cgroup > hook semantics that work for all of these use cases? I'm all for discussing things, but I don't this was done in a rush. I do agree though that adding functionality to cgroups that is not limited to resource control is a delicate thing to do, which is why I cc'ed cgroups@ in my patches. I should have also added linux-api@ I guess, sorry I missed that. > I think my proposal is quite close to workable. So let's talk about how to proceed. I've seen different bits of your proposal in different mails, and I think a summary of it would help the discussion. Thanks, Daniel ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2016-12-20 10:21 ` Daniel Mack @ 2016-12-20 17:23 ` Andy Lutomirski [not found] ` <CALCETrXyp2ddf4HRsEoN=qEwTBaezOUX2XWj6nxPcbc4t13svw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Andy Lutomirski @ 2016-12-20 17:23 UTC (permalink / raw) To: Daniel Mack Cc: Alexei Starovoitov, Andy Lutomirski, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel@vger.kernel.org, Network Development On Tue, Dec 20, 2016 at 2:21 AM, Daniel Mack <daniel@zonque.org> wrote: > Hi, > > On 12/20/2016 04:50 AM, Andy Lutomirski wrote: >> You mean BPF_CGROUP_RUN_PROG_INET_SOCK(sk)? There is nothing bpf >> specfic about the hook except that the name of this macro has "BPF" in >> it. There is nothing whatsoever that's bpf-specific about the context >> -- sk is not bpf-specific at all. >> >> The only thing bpf-specific about it is that it currently only invokes >> bpf programs. That could easily change. > > I'm not sure if I follow. The code as it currently stands only supports > attaching bpf programs to cgroups which have been created using > BPF_PROG_LOAD. If cgroups would support other program types in the > future, then they would need to be stored in different data types > anyway, and the bpf syscall multiplexer would be the wrong entry point > to access them anyway. To clarify, since this thread has gotten excessively long and twisted, I think it's important that, for hooks attached to a cgroup, you be able to tell in a generic way whether something is plugged into the hook. The natural way to see a cgroup's configuration is to read from cgroupfs, so I think that reading from cgroupfs should show you that a BPF program is attached and also give enough information that, once bpf programs become dumpable, you can dump the program (using the bpf() syscall or whatever). Obviously the interface to *attach* a BPF program to a hook will need to be at least a little bit BPF-specific. But there's nothing particularly BPF-specific about detaching, and if a control file were to exist, writing "detach" or "none" to it seems natural. > > Whether we add bpf-specific code to the cgroup file parsers or > cgroup-specific code to the bpf layer does not make much of a semantic > difference, does it? As a matter of fact, my very first implementation > of this patch set implemented a cgroup controller that would allow > writing strings like "ing > > b) make it possible to extend the functionality in the future by adding > flags to the command struct etc. > > And I hoped we achieved that after discussing it for so long. > >> How about slowing down a wee bit and trying to come up with cgroup >> hook semantics that work for all of these use cases? > > I'm all for discussing things, but I don't this was done in a rush. > > I do agree though that adding functionality to cgroups that is not > limited to resource control is a delicate thing to do, which is why I > cc'ed cgroups@ in my patches. I should have also added linux-api@ I > guess, sorry I missed that. >ress 5" to its control file, where 5 is the fd > number that came out of BPF_PROG_LOAD. The main reason we decided to > ditch that was that echoing fd numbers into a text file seemed way worse > than going through a proper syscall layer with it, and ioctls are > unavailable on pseudo-fs. There isn't a big semantic difference between 'open("/cgroup/NAME/some.control.file", O_WRONLY); ioctl(..., CGROUP_ATTACH_BPF, ...)' and 'open("/cgroup/NAME/some.control.file", O_WRONLY); bpf(BPF_PROG_ATTACH, ...);'. There is, however, a semantic difference when you do open("/cgroup/NAME", O_RDONLY | O_DIRECTORY) because the permission check is much weaker. The reason I suggest ioctl() and not write() is that write() MUST NOT make its behavior depend on the caller's credentials, file table, etc. Imagine what would happen if you did 'sudo -u eviltext >/cgroup/NAME/control.file'. (This particular mistake has been repeated many times in the kernel, in drivers, networking, namespaces, core code, etc, and it's resulted in a big pile of privilege escalation bugs.) So write("bpf:<actual BPF instructions>") is safe (but unusably awkward, I think), whereas write("bpf:fd 5") is unsafe. > > The idea was rather to allow attaching bpf programs to other things than > just cgroups as well, which is why we called the member of 'union > bpf_attr' 'target_fd', and a cgroup is just one type a target here. I would make that a separate operation. If someone adds the ability to attach an ebpf program to, say, seccomp (I'm quite sure this will happen eventually), it should be attached using seccomp(), not bpf(), for example). The people writing seccomp filters will thank you for making the syscall in question reflect what object (the cgroup, for example) is being modified. > >>> i'm assuming 'baadf00d' is bpf program fd expressed a text string? >>> and kernel needs to parse above? will you allow capital and lower >>> case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not? >>> can program fd expressed as decimal or hex or both? >>> how do you return the error? as a text string for user space >>> to parse? >> >> No. The kernel does not parse it because you cannot write this to the >> file. You set a bpf filter with ioctl and pass an fd. > > An ioctl on what file, exactly? There are at least two plausible models. My preference would be to do an ioctl on a new /cgroup/NAME/network_hooks.inet_ingress file. Reading that file tells you whether something is attached and hopefully also gives enough information (a hash of the BPF program, perhaps) to dump the actual program using future bpf() interfaces. write() and ioctl() can be used to configure it as appropriate. Another option that I like less would be to have a /cgroup/NAME/cgroup.bpf that lists all the active hooks along with their contents. You would do an ioctl() on that to program a hook and you could read it to see what's there. FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too. It doesn't make a semantic difference, except that I dislike BPF_PROG_DETACH because that particular command isn't BPF-specific at all. > >> If you *read* >> the file, you get the same bpf program hash that fdinfo on the bpf >> object would show -- this is for debugging and (eventually) CRIU. > > We need a debugging facility at some point, I agree to that. As the code > currently stands, that would rather need to go into the bpf(2) syscall > though, as setting a program through bpf(2) and reading it through > cgroupfs is really nasty. But knowing that you have to call bpf() to tell whether bpf hooks are installed in a cgroup is nasty. Everything else uses ls and cat -- why should BPF be special here? >> So if I set up a cgroup that's monitored and call it /cgroup/a and >> enable delegation and if the program running there wants to do its own >> monitoring in /cgroup/a/b (via delegation), then you really want the >> outer monitor to silently drop events coming from /cgroup/a/b? > > That's a fair point, and we've discussed it as well. The issue is, as > Alexei already pointed out, that we do not want to traverse the tree up > to the root for nested cgroups due to the runtime costs in the > networking fast-path. After all, we're running the bpf program for each > packet in flight. Hence, we opted for the approach to only look at the > leaf node for now, with the ability to open it up further in the future > using flags during attach etc. Careful here! You don't look only at the leaf node for now. You do a fancy traversal and choose the nearest node that has a hook set up. This gives you almost all the complexity of evaluating all of the installed hooks with none of the benefit. It also is, IMO, much more dangerous than only looking at the leaf node. Consider: mkdir /cgroup/foo BPF_PROG_ATTACH(some program to foo) mkdir /cgroup/foo/bar chown -R some_user /cgroup/foo/bar If the kernel only looked at the leaf, then the program that did the above would not expect that the program would constrain /cgroup/foo/bar's activity. But, as it stands, the program *would* expect /cgroup/foo/bar to be constrained, except that, whenever the capable() check changes to ns_capable() (which will happen eventually one way or another), then the bad guy can create /cgroup/foo/bar/baz, install a new no-op hook there, and break the security assumption. IOW, I think that totally non-recursive hooks are okay from a security perspective, albeit rather strange, but the current design is not okay from a security perspective. > >> The current approach to bpf hooks will bite you down the road. David >> Ahern is already proposing using it for something that is not tracing >> at all, and someone will want that in a container, and there will be a >> problem. > > Hmm, I thought we've sorted out the concerns about that by making sure > that we > > a) lock-down the API sufficiently so it doesn't cause any security > issues in its current form, and This argument is why iptables + userns has become a security mess, for example. Designing an API assuming that the bad guys will never be permitted to use it causes quite a bit of pain when, a few years later, bad guys are permitted to use it. >> I think my proposal is quite close to workable. > > So let's talk about how to proceed. I've seen different bits of your > proposal in different mails, and I think a summary of it would help the > discussion. So here's a fleshed-out possible version that's a bit of a compromise after sleeping on this. There's plenty of room to tweak this. Each cgroup gets a new file cgroup.hooks. Reading it shows a list of active hooks. (A hook can be a string like "network.inet_ingress".) You can write a command like "-network.inet_ingress off" to it to disable network.inet_ingress. You can write a command like "+network.inet_ingress" to it to enable the network.inet_ingress hook. When a hook (e.g. network.inet_ingress) is enabled, a new file appears in the cgroup called "hooks.network.inet_ingress"). You can read it to get an indication of what is currently installed in that slot. You can write "none" to it to cause nothing to be installed in that slot. (This replaces BPF_PROG_DETACH.). You can open it for write and use bpf() or perhaps ioctl() to attach a bpf program. Maybe you can also use bpf() to dump the bpf program, but, regardless, if a bpf program is there, read() will return some string that contains "bpf" and maybe some other useful information. If a SELinux policy wants to lock down the netowrk.inet_ingress hook, it uses existing mechanisms to label the hooks.network.inet_ingress file when it appears and to restrict opening it for write. I think this is all quite straightforward to implement and will result in clean code. I could probably make some decent progress toward it over the next couple days. --Andy ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <CALCETrXyp2ddf4HRsEoN=qEwTBaezOUX2XWj6nxPcbc4t13svw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <CALCETrXyp2ddf4HRsEoN=qEwTBaezOUX2XWj6nxPcbc4t13svw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-12-20 18:36 ` Daniel Mack 2016-12-20 18:49 ` Andy Lutomirski 0 siblings, 1 reply; 49+ messages in thread From: Daniel Mack @ 2016-12-20 18:36 UTC (permalink / raw) To: Andy Lutomirski Cc: Alexei Starovoitov, Andy Lutomirski, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development Hi, On 12/20/2016 06:23 PM, Andy Lutomirski wrote: > On Tue, Dec 20, 2016 at 2:21 AM, Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote: > To clarify, since this thread has gotten excessively long and twisted, > I think it's important that, for hooks attached to a cgroup, you be > able to tell in a generic way whether something is plugged into the > hook. The natural way to see a cgroup's configuration is to read from > cgroupfs, so I think that reading from cgroupfs should show you that a > BPF program is attached and also give enough information that, once > bpf programs become dumpable, you can dump the program (using the > bpf() syscall or whatever). [...] > There isn't a big semantic difference between > 'open("/cgroup/NAME/some.control.file", O_WRONLY); ioctl(..., > CGROUP_ATTACH_BPF, ...)' and 'open("/cgroup/NAME/some.control.file", > O_WRONLY); bpf(BPF_PROG_ATTACH, ...);'. There is, however, a semantic > difference when you do open("/cgroup/NAME", O_RDONLY | O_DIRECTORY) > because the permission check is much weaker. Okay, if you have such a control file, you can of course do something like that. When we discussed things back then with Tejun however, we concluded that a controller that is not completely controllable through control knobs that can be written and read via cat is meaningless. That's why this has become a 'hidden' cgroup feature. With your proposed API, you'd first go to the bpf(2) syscall in order to get a prog fd, and then come back to some sort of cgroup API to put the fd in there. That's quite a mix and match, which is why we considered the API cleaner in its current form, as everything that is related to bpf is encapsulated behind a single syscall. > My preference would be to do an ioctl on a new > /cgroup/NAME/network_hooks.inet_ingress file. Reading that file tells > you whether something is attached and hopefully also gives enough > information (a hash of the BPF program, perhaps) to dump the actual > program using future bpf() interfaces. write() and ioctl() can be > used to configure it as appropriate. So am I reading this right? You're proposing to add ioctl() hooks to kernfs/cgroupfs? That would open more possibilities of course, but I'm not sure where that rabbit hole leads us eventually. > Another option that I like less would be to have a > /cgroup/NAME/cgroup.bpf that lists all the active hooks along with > their contents. You would do an ioctl() on that to program a hook and > you could read it to see what's there. Yes, read() could, in theory, give you similar information than ioctl(), but in human-readable form. > FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too. > It doesn't make a semantic difference, except that I dislike > BPF_PROG_DETACH because that particular command isn't BPF-specific at > all. Well, I think it is; it pops the bpf program from a target and drops the reference on it. It's not much code, but it's certainly bpf-specific. >>> So if I set up a cgroup that's monitored and call it /cgroup/a and >>> enable delegation and if the program running there wants to do its own >>> monitoring in /cgroup/a/b (via delegation), then you really want the >>> outer monitor to silently drop events coming from /cgroup/a/b? >> >> That's a fair point, and we've discussed it as well. The issue is, as >> Alexei already pointed out, that we do not want to traverse the tree up >> to the root for nested cgroups due to the runtime costs in the >> networking fast-path. After all, we're running the bpf program for each >> packet in flight. Hence, we opted for the approach to only look at the >> leaf node for now, with the ability to open it up further in the future >> using flags during attach etc. > > Careful here! You don't look only at the leaf node for now. You do a > fancy traversal and choose the nearest node that has a hook set up. But we do the 'complex' operation at attach time or when a cgroup is created, both of which are slow-path operations. In the fast-path, we only look at the leaf, which may or may not have an effective program installed. And that's of course much cheaper then doing the traversing for each packet. > mkdir /cgroup/foo > BPF_PROG_ATTACH(some program to foo) > mkdir /cgroup/foo/bar > chown -R some_user /cgroup/foo/bar > > If the kernel only looked at the leaf, then the program that did the > above would not expect that the program would constrain > /cgroup/foo/bar's activity. But, as it stands, the program *would* > expect /cgroup/foo/bar to be constrained, except that, whenever the > capable() check changes to ns_capable() (which will happen eventually > one way or another), then the bad guy can create /cgroup/foo/bar/baz, > install a new no-op hook there, and break the security assumption. > > IOW, I think that totally non-recursive hooks are okay from a security > perspective, albeit rather strange, but the current design is not okay > from a security perspective. We locked down the ability to override any of these programs with CAP_NET_ADMIN, which is also what it takes to flush iptables, right? What's the difference? > So here's a fleshed-out possible version that's a bit of a compromise > after sleeping on this. There's plenty of room to tweak this. > > Each cgroup gets a new file cgroup.hooks. Reading it shows a list of > active hooks. (A hook can be a string like "network.inet_ingress".) > > You can write a command like "-network.inet_ingress off" to it to > disable network.inet_ingress. You can write a command like > "+network.inet_ingress" to it to enable the network.inet_ingress hook. > > When a hook (e.g. network.inet_ingress) is enabled, a new file appears > in the cgroup called "hooks.network.inet_ingress"). You can read it > to get an indication of what is currently installed in that slot. You > can write "none" to it to cause nothing to be installed in that slot. > (This replaces BPF_PROG_DETACH.). You can open it for write and use > bpf() or perhaps ioctl() to attach a bpf program. Maybe you can also > use bpf() to dump the bpf program, but, regardless, if a bpf program > is there, read() will return some string that contains "bpf" and maybe > some other useful information. I can see where you're going, but I don't know yet if if I like this approach better, given that you would still need a binary interface at least at attach time, and that such an interface would use a resource returned from bpf(2). The ability to read from control files in order to see what's going on is nice though. I'd like to have Tejun's and Alexei's opinion on this - as I said, I had something like that (albeit much simpler) in one of my very early drafts, but we consented to do the hookup the other way around, for stated reasons. Thanks, Daniel ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2016-12-20 18:36 ` Daniel Mack @ 2016-12-20 18:49 ` Andy Lutomirski 2016-12-21 4:01 ` Alexei Starovoitov 0 siblings, 1 reply; 49+ messages in thread From: Andy Lutomirski @ 2016-12-20 18:49 UTC (permalink / raw) To: Daniel Mack Cc: Alexei Starovoitov, Andy Lutomirski, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel@vger.kernel.org, Network Development On Tue, Dec 20, 2016 at 10:36 AM, Daniel Mack <daniel@zonque.org> wrote: > Hi, > > On 12/20/2016 06:23 PM, Andy Lutomirski wrote: >> On Tue, Dec 20, 2016 at 2:21 AM, Daniel Mack <daniel@zonque.org> wrote: > >> To clarify, since this thread has gotten excessively long and twisted, >> I think it's important that, for hooks attached to a cgroup, you be >> able to tell in a generic way whether something is plugged into the >> hook. The natural way to see a cgroup's configuration is to read from >> cgroupfs, so I think that reading from cgroupfs should show you that a >> BPF program is attached and also give enough information that, once >> bpf programs become dumpable, you can dump the program (using the >> bpf() syscall or whatever). > > [...] > >> There isn't a big semantic difference between >> 'open("/cgroup/NAME/some.control.file", O_WRONLY); ioctl(..., >> CGROUP_ATTACH_BPF, ...)' and 'open("/cgroup/NAME/some.control.file", >> O_WRONLY); bpf(BPF_PROG_ATTACH, ...);'. There is, however, a semantic >> difference when you do open("/cgroup/NAME", O_RDONLY | O_DIRECTORY) >> because the permission check is much weaker. > > Okay, if you have such a control file, you can of course do something > like that. When we discussed things back then with Tejun however, we > concluded that a controller that is not completely controllable through > control knobs that can be written and read via cat is meaningless. > That's why this has become a 'hidden' cgroup feature. > > With your proposed API, you'd first go to the bpf(2) syscall in order to > get a prog fd, and then come back to some sort of cgroup API to put the > fd in there. That's quite a mix and match, which is why we considered > the API cleaner in its current form, as everything that is related to > bpf is encapsulated behind a single syscall. You already have to do bpf() to get a prog fd, then open() to get a cgroup fd, then bpf() or ioctl() to attach, so this isn't much different, and its exactly the same number of syscalls. > >> My preference would be to do an ioctl on a new >> /cgroup/NAME/network_hooks.inet_ingress file. Reading that file tells >> you whether something is attached and hopefully also gives enough >> information (a hash of the BPF program, perhaps) to dump the actual >> program using future bpf() interfaces. write() and ioctl() can be >> used to configure it as appropriate. > > So am I reading this right? You're proposing to add ioctl() hooks to > kernfs/cgroupfs? That would open more possibilities of course, but I'm > not sure where that rabbit hole leads us eventually. Indeed. I already have a test patch to add ioctl() to kernfs. Adding it to cgroupfs shouldn't be much more complicated. > >> Another option that I like less would be to have a >> /cgroup/NAME/cgroup.bpf that lists all the active hooks along with >> their contents. You would do an ioctl() on that to program a hook and >> you could read it to see what's there. > > Yes, read() could, in theory, give you similar information than ioctl(), > but in human-readable form. > >> FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too. >> It doesn't make a semantic difference, except that I dislike >> BPF_PROG_DETACH because that particular command isn't BPF-specific at >> all. > > Well, I think it is; it pops the bpf program from a target and drops the > reference on it. It's not much code, but it's certainly bpf-specific. I mean the interface isn't bpf-specific. If there was something that wasn't bpf attached to the target, you'd still want an API to detach it. > >>>> So if I set up a cgroup that's monitored and call it /cgroup/a and >>>> enable delegation and if the program running there wants to do its own >>>> monitoring in /cgroup/a/b (via delegation), then you really want the >>>> outer monitor to silently drop events coming from /cgroup/a/b? >>> >>> That's a fair point, and we've discussed it as well. The issue is, as >>> Alexei already pointed out, that we do not want to traverse the tree up >>> to the root for nested cgroups due to the runtime costs in the >>> networking fast-path. After all, we're running the bpf program for each >>> packet in flight. Hence, we opted for the approach to only look at the >>> leaf node for now, with the ability to open it up further in the future >>> using flags during attach etc. >> >> Careful here! You don't look only at the leaf node for now. You do a >> fancy traversal and choose the nearest node that has a hook set up. > > But we do the 'complex' operation at attach time or when a cgroup is > created, both of which are slow-path operations. In the fast-path, we > only look at the leaf, which may or may not have an effective program > installed. And that's of course much cheaper then doing the traversing > for each packet. You would never traverse the full hierarchy for each packet. You'd have a linked list of programs that are attached, kind of like how there's an "effective" array right now. I sent out pseudocode earlier in the thread. > >> mkdir /cgroup/foo >> BPF_PROG_ATTACH(some program to foo) >> mkdir /cgroup/foo/bar >> chown -R some_user /cgroup/foo/bar >> >> If the kernel only looked at the leaf, then the program that did the >> above would not expect that the program would constrain >> /cgroup/foo/bar's activity. But, as it stands, the program *would* >> expect /cgroup/foo/bar to be constrained, except that, whenever the >> capable() check changes to ns_capable() (which will happen eventually >> one way or another), then the bad guy can create /cgroup/foo/bar/baz, >> install a new no-op hook there, and break the security assumption. >> >> IOW, I think that totally non-recursive hooks are okay from a security >> perspective, albeit rather strange, but the current design is not okay >> from a security perspective. > > We locked down the ability to override any of these programs with > CAP_NET_ADMIN, which is also what it takes to flush iptables, right? > What's the difference? For iptables, it's ns_capable() now, and there have been a number of holes in it. For cgroup, it's going to turn in to ns_capable() sooner or later, and it would be nice to be ready for it. --Andy ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2016-12-20 18:49 ` Andy Lutomirski @ 2016-12-21 4:01 ` Alexei Starovoitov 0 siblings, 0 replies; 49+ messages in thread From: Alexei Starovoitov @ 2016-12-21 4:01 UTC (permalink / raw) To: Andy Lutomirski Cc: Daniel Mack, Andy Lutomirski, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel@vger.kernel.org, Network Development On Tue, Dec 20, 2016 at 10:49:25AM -0800, Andy Lutomirski wrote: > >> FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too. > >> It doesn't make a semantic difference, except that I dislike > >> BPF_PROG_DETACH because that particular command isn't BPF-specific at > >> all. > > > > Well, I think it is; it pops the bpf program from a target and drops the > > reference on it. It's not much code, but it's certainly bpf-specific. > > I mean the interface isn't bpf-specific. If there was something that > wasn't bpf attached to the target, you'd still want an API to detach > it. This discussion won't go anywhere while you keep thinking that this api has to be generalized. As I explained several times earlier BPF_CGROUP_INET_SOCK_CREATE hook is bpf specific. There is nothing in the kernel that can take advantage of it today, so by definition the hook is bpf specific. Period. Saying that something in the future may come along that would want to use that is like saying I want to design the generic steering wheel for any car that will ever need it. Hence if you want to change 'target_fd' in BPF_PROG_ATTACH/DETACH cmds from being fd of open("cgroupdir") to fd of open("cgroupdir/cgroup.bpf") file inside it then I'm ok with that. All other proposals with non-extensible ioctls() and crazy text based per-hook permissions is nack. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <20161220000254.GA58895-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org> 2016-12-20 0:25 ` Andy Lutomirski @ 2016-12-20 1:34 ` David Miller [not found] ` <20161219.203422.500916400463091702.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> 1 sibling, 1 reply; 49+ messages in thread From: David Miller @ 2016-12-20 1:34 UTC (permalink / raw) To: alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w Cc: luto-DgEjT+Ai2ygdnm+yROfE0A, daniel-cYrQPVfZoowdnm+yROfE0A, mic-WFhQfpSGs3bR7s880joybQ, keescook-F7+t8E8rja9g9hUCZPvPmw, jann-XZ1E9jl8jIdeoWH0uzbU5w, tj-DgEjT+Ai2ygdnm+yROfE0A, dsahern-Re5JQEeQqe8AvxtiuMwx3w, tgraf-G/eBtMaohhA, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, peterz-wEGCiKHe2LqWVfeAwA7xHQ, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA From: Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Date: Mon, 19 Dec 2016 16:02:56 -0800 > huh? 'not right api' because it's using bpf syscall instead > of cgroup control-file? I think the opposite is the truth. I completely agree with Alexei on this. ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <20161219.203422.500916400463091702.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>]
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API [not found] ` <20161219.203422.500916400463091702.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> @ 2016-12-20 1:40 ` Andy Lutomirski 2016-12-20 4:51 ` Alexei Starovoitov 0 siblings, 1 reply; 49+ messages in thread From: Andy Lutomirski @ 2016-12-20 1:40 UTC (permalink / raw) To: David Miller Cc: Alexei Starovoitov, Andrew Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David Ahern, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Network Development On Mon, Dec 19, 2016 at 5:34 PM, David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote: > From: Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> > Date: Mon, 19 Dec 2016 16:02:56 -0800 > >> huh? 'not right api' because it's using bpf syscall instead >> of cgroup control-file? I think the opposite is the truth. > > I completely agree with Alexei on this. So what happens when someone adds another type of filter? Let's say there's a simple, no-privilege-required list of allowed address families that can hook up to the socket creation hook for a cgroup. Does BPF_PROG_DETACH still detach it? Or would both the bpf *and* the list of allowed address families be in force? If the latter, why wouldn't two BPF programs on the same hook be allowed? Concretely: # mkdir /cgroup/a # set_up_bpf_socket_rule /cgroup/a # set_up_list_of_address_families /cgroup/a # cat /cgroup/a/some_new_file [what gets displayed?] # BPF_PROG_DETACH: what happens By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't even take a reference to a BPF program as an argument. What is it supposed to do if this mechanism ever gets extended? --Andy ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2016-12-20 1:40 ` Andy Lutomirski @ 2016-12-20 4:51 ` Alexei Starovoitov 2016-12-20 5:26 ` Andy Lutomirski 0 siblings, 1 reply; 49+ messages in thread From: Alexei Starovoitov @ 2016-12-20 4:51 UTC (permalink / raw) To: Andy Lutomirski Cc: David Miller, Andrew Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David Ahern, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel@vger.kernel.org, Network Development On Mon, Dec 19, 2016 at 05:40:53PM -0800, Andy Lutomirski wrote: > > By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't > even take a reference to a BPF program as an argument. What is it > supposed to do if this mechanism ever gets extended? we just add another field to that anonymous union just like we did for other commands and everything is backwards compatible. It's the basics of bpf syscall that we've been relying on for some time now and it worked just fine. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Potential issues (security and otherwise) with the current cgroup-bpf API 2016-12-20 4:51 ` Alexei Starovoitov @ 2016-12-20 5:26 ` Andy Lutomirski 0 siblings, 0 replies; 49+ messages in thread From: Andy Lutomirski @ 2016-12-20 5:26 UTC (permalink / raw) To: Alexei Starovoitov Cc: David Miller, Andrew Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo, David Ahern, Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API, linux-kernel@vger.kernel.org, Network Development On Mon, Dec 19, 2016 at 8:51 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Mon, Dec 19, 2016 at 05:40:53PM -0800, Andy Lutomirski wrote: >> >> By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't >> even take a reference to a BPF program as an argument. What is it >> supposed to do if this mechanism ever gets extended? > > we just add another field to that anonymous union just like > we did for other commands and everything is backwards compatible. > It's the basics of bpf syscall that we've been relying on for some > time now and it worked just fine. And what happens if you don't specify that member and two programs are attached? --Andy ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2017-02-04 17:10 UTC | newest] Thread overview: 49+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-01-17 5:18 (unknown), Andy Lutomirski [not found] ` <CALCETrW+ZQ1xMEfdEOzx6RK9ZoCvQiqQSnOyhDJ=-FZMwUbucg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-01-18 22:41 ` Potential issues (security and otherwise) with the current cgroup-bpf API Tejun Heo [not found] ` <20170118224120.GG9171-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2017-01-19 0:18 ` Andy Lutomirski [not found] ` <CALCETrXv3k+=C=9D5YGypzPk8_f4yo8ShQ3tM2fOo++gJFBM4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-01-19 0:59 ` Tejun Heo 2017-01-19 2:29 ` Andy Lutomirski [not found] ` <CALCETrVrEnL4cvkdDu2LUhxmeOZ+SMEmF=yKKsm9OYoW2y1Kpg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-01-20 2:39 ` Alexei Starovoitov 2017-01-20 4:04 ` Andy Lutomirski 2017-01-23 4:31 ` Alexei Starovoitov [not found] ` <20170123043154.GA93519-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org> 2017-01-23 20:20 ` Andy Lutomirski [not found] ` <CALCETrXK65itdDqYTdhB-Td8d-Hzj00dcDScUOUh9psCZN_cLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-02-03 21:07 ` Andy Lutomirski 2017-02-03 23:21 ` Alexei Starovoitov [not found] ` <20170203232154.GA30114-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org> 2017-02-04 17:10 ` Andy Lutomirski 2017-01-19 1:01 ` Mickaël Salaün -- strict thread matches above, loose matches on Subject: below -- 2016-12-17 18:18 Andy Lutomirski 2016-12-17 19:26 ` Mickaël Salaün 2016-12-17 20:02 ` Andy Lutomirski 2016-12-19 20:56 ` Alexei Starovoitov 2016-12-19 21:23 ` Andy Lutomirski 2016-12-20 0:02 ` Alexei Starovoitov [not found] ` <20161220000254.GA58895-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org> 2016-12-20 0:25 ` Andy Lutomirski 2016-12-20 1:43 ` Andy Lutomirski [not found] ` <CALCETrU1_bDVLfokQ7zasHVmeq7S-R+603GEw59V_wuj4eE1hw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-12-20 1:44 ` David Ahern [not found] ` <2dbec775-6304-e44c-19c5-fbf07877e7b1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2016-12-20 1:56 ` Andy Lutomirski [not found] ` <CALCETrUW2jEYmjSsOrPj+MAjkDGGUCw_rdxQh+5Er0r4ReGLnA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-12-20 2:52 ` David Ahern 2016-12-20 3:12 ` Andy Lutomirski 2016-12-20 4:44 ` Alexei Starovoitov [not found] ` <20161220044440.GB86803-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org> 2016-12-20 5:27 ` Andy Lutomirski [not found] ` <CALCETrVxkdZA3SsRv0KKhBz9YvNMsnmHSjS8HN1GHrgWRYNM1g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-12-20 5:32 ` Alexei Starovoitov 2016-12-20 9:11 ` Peter Zijlstra [not found] ` <20161220091150.GJ3124-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2017-01-03 10:25 ` Michal Hocko [not found] ` <20170103102559.GA30129-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2017-01-16 1:19 ` Tejun Heo 2017-01-17 13:03 ` Michal Hocko 2017-01-17 13:32 ` Peter Zijlstra 2017-01-17 13:58 ` Michal Hocko [not found] ` <20170117135830.GO19699-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2017-01-17 20:23 ` Andy Lutomirski 2017-01-18 22:18 ` Tejun Heo 2017-01-19 9:00 ` Michal Hocko 2016-12-20 3:18 ` Alexei Starovoitov [not found] ` <20161220031802.GA77838-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org> 2016-12-20 3:50 ` Andy Lutomirski 2016-12-20 4:41 ` Alexei Starovoitov [not found] ` <CALCETrXymvAo-9zhQe=amToz_fs9XGniK2KLZv5Fxc66qcUx6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-12-20 10:21 ` Daniel Mack 2016-12-20 17:23 ` Andy Lutomirski [not found] ` <CALCETrXyp2ddf4HRsEoN=qEwTBaezOUX2XWj6nxPcbc4t13svw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-12-20 18:36 ` Daniel Mack 2016-12-20 18:49 ` Andy Lutomirski 2016-12-21 4:01 ` Alexei Starovoitov 2016-12-20 1:34 ` David Miller [not found] ` <20161219.203422.500916400463091702.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> 2016-12-20 1:40 ` Andy Lutomirski 2016-12-20 4:51 ` Alexei Starovoitov 2016-12-20 5:26 ` Andy Lutomirski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).