* [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt @ 2011-10-21 22:22 Maciej Żenczykowski 2011-10-22 4:04 ` David Miller 0 siblings, 1 reply; 7+ messages in thread From: Maciej Żenczykowski @ 2011-10-21 22:22 UTC (permalink / raw) To: Maciej Żenczykowski; +Cc: netdev, Maciej Żenczykowski From: Maciej Żenczykowski <maze@google.com> This change adds a sysctl (/proc/sys/net/core/allow_so_priority) with a default of true (1), as such it does not change the default behaviour of the Linux kernel. This sysctl can be set to false (0), this will result in non CAP_NET_ADMIN processes being unable to set SO_PRIORITY socket option. This is desireable if we want to rely on socket/skb priorities being inferred from TOS/TCLASS bits. Signed-off-by: Maciej Żenczykowski <maze@google.com> --- include/net/sock.h | 2 ++ net/core/sock.c | 5 ++++- net/core/sysctl_net_core.c | 7 +++++++ 3 files changed, 13 insertions(+), 1 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 5ac682f..bf18a6a 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1853,6 +1853,8 @@ extern __u32 sysctl_rmem_max; extern void sk_init(void); +extern int sysctl_allow_so_priority; + extern int sysctl_optmem_max; extern __u32 sysctl_wmem_default; diff --git a/net/core/sock.c b/net/core/sock.c index 5a08762..383fd89 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -217,6 +217,8 @@ __u32 sysctl_rmem_max __read_mostly = SK_RMEM_MAX; __u32 sysctl_wmem_default __read_mostly = SK_WMEM_MAX; __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX; +int sysctl_allow_so_priority __read_mostly = 1; + /* Maximal space eaten by iovec or ancillary data plus some space */ int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512); EXPORT_SYMBOL(sysctl_optmem_max); @@ -612,7 +614,8 @@ set_rcvbuf: break; case SO_PRIORITY: - if ((val >= 0 && val <= 6) || capable(CAP_NET_ADMIN)) + if ((val >= 0 && val <= 6 && sysctl_allow_so_priority) + || capable(CAP_NET_ADMIN)) sk->sk_priority = val; else ret = -EPERM; diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index 77a65f0..91fdaac 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -183,6 +183,13 @@ static struct ctl_table net_core_table[] = { .mode = 0644, .proc_handler = proc_dointvec }, + { + .procname = "allow_so_priority", + .data = &sysctl_allow_so_priority, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, { } }; -- 1.7.3.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt 2011-10-21 22:22 [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt Maciej Żenczykowski @ 2011-10-22 4:04 ` David Miller 2011-10-22 6:49 ` Maciej Żenczykowski 0 siblings, 1 reply; 7+ messages in thread From: David Miller @ 2011-10-22 4:04 UTC (permalink / raw) To: zenczykowski; +Cc: maze, netdev From: Maciej Żenczykowski <zenczykowski@gmail.com> Date: Fri, 21 Oct 2011 15:22:05 -0700 > From: Maciej Żenczykowski <maze@google.com> > > This change adds a sysctl (/proc/sys/net/core/allow_so_priority) > with a default of true (1), as such it does not change the default > behaviour of the Linux kernel. > > This sysctl can be set to false (0), this will result in non > CAP_NET_ADMIN processes being unable to set SO_PRIORITY socket > option. > > This is desireable if we want to rely on socket/skb priorities > being inferred from TOS/TCLASS bits. > > Signed-off-by: Maciej Żenczykowski <maze@google.com> The socket layer is not the place to enforce this. The ingress into your MPLS/RSVP cloud that actually provides the quality of service is where you control and mangle the TOS as needed. Sorry, I'm not applying anything like this. Any machine on your network can spit out any TOS it wants, and if you have control over the apps change it's behavior there. If you don't have control over the apps then filter and mangle. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt 2011-10-22 4:04 ` David Miller @ 2011-10-22 6:49 ` Maciej Żenczykowski 2011-10-22 6:58 ` David Miller 0 siblings, 1 reply; 7+ messages in thread From: Maciej Żenczykowski @ 2011-10-22 6:49 UTC (permalink / raw) To: David Miller; +Cc: netdev 2011/10/21 David Miller <davem@davemloft.net>: > From: Maciej Żenczykowski <zenczykowski@gmail.com> > Date: Fri, 21 Oct 2011 15:22:05 -0700 > >> From: Maciej Żenczykowski <maze@google.com> >> >> This change adds a sysctl (/proc/sys/net/core/allow_so_priority) >> with a default of true (1), as such it does not change the default >> behaviour of the Linux kernel. >> >> This sysctl can be set to false (0), this will result in non >> CAP_NET_ADMIN processes being unable to set SO_PRIORITY socket >> option. >> >> This is desireable if we want to rely on socket/skb priorities >> being inferred from TOS/TCLASS bits. >> >> Signed-off-by: Maciej Żenczykowski <maze@google.com> > > The socket layer is not the place to enforce this. > > The ingress into your MPLS/RSVP cloud that actually provides the > quality of service is where you control and mangle the TOS as needed. > > Sorry, I'm not applying anything like this. Any machine on your > network can spit out any TOS it wants, and if you have control over > the apps change it's behavior there. If you don't have control over > the apps then filter and mangle. Hmm, so I already have container (cgroup) limits on what TOS settings, a process is allowed to set (query: would you be interested in accepting a patch for that at some point in the future?). Normally setting IP_TOS also automatically sets sock->sk_priority (based on a mapping), which gets inherited into skb->priority, which can then be used for stuff like hardware priority queue dispatch (basically xps + skb->priority queue selection). Either via an XPS like mechanism, or a QDISC like mechanism (preferred), or an in-driver mechanism (currently have a hack which does this). However, processes can also manually override the sk_priority by calling SO_PRIORITY directly, at which point their IP_TOS and SO_PRIORITY no longer match. This patch allows you to disable this ability. It's not affecting the on-the-wire bits in any way, it's really only affecting packet classification at the qdisc and in the driver. As you can see this patch isn't about TOS, it's about the kernel internal skb priority setting. On a related note, while setting IP_TOS sets sk_priority, setting IPV6_TCLASS does not set sk_priority. I'd like to see this behaviour be consistent. As such was planning on sending you a patch to add sk_priority = rt_tos2priority(val) to the IPV6_TCLASS setsockopt code path. Is that ok? - Maciej ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt 2011-10-22 6:49 ` Maciej Żenczykowski @ 2011-10-22 6:58 ` David Miller 2011-10-22 8:27 ` Maciej Żenczykowski 0 siblings, 1 reply; 7+ messages in thread From: David Miller @ 2011-10-22 6:58 UTC (permalink / raw) To: zenczykowski; +Cc: netdev From: Maciej Żenczykowski <zenczykowski@gmail.com> Date: Fri, 21 Oct 2011 23:49:00 -0700 > However, processes can also manually override the sk_priority by calling > SO_PRIORITY directly, at which point their IP_TOS and SO_PRIORITY no > longer match. > > This patch allows you to disable this ability. I also don't see why we'd want to allow disabling this either. I really hate these patches that offer ways to disable things that normally work, and thus break apps when the non-default is selected. I kind of have a feeling the kind of situation you're trying to account for, you have some cloud where people run random stuff that you don't control. But you didn't specify this, and we just have to guess. Why don't you describe the specific situation where you want to modify this setting? Please do this instead of just talking about what the side effects are inside of the kernel. That's much less interesting when it comes to patches like this. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt 2011-10-22 6:58 ` David Miller @ 2011-10-22 8:27 ` Maciej Żenczykowski 2011-10-22 8:40 ` David Miller 2011-10-22 9:01 ` David Täht 0 siblings, 2 replies; 7+ messages in thread From: Maciej Żenczykowski @ 2011-10-22 8:27 UTC (permalink / raw) To: David Miller; +Cc: netdev > I also don't see why we'd want to allow disabling this either. > I really hate these patches that offer ways to disable things > that normally work, and thus break apps when the non-default > is selected. Well... the purpose of settings like this is precisely to break functionality when the default is not set ;-) > I kind of have a feeling the kind of situation you're trying to > account for, you have some cloud where people run random stuff > that you don't control. Yes, I have control of the kernel, I have control of root, I have control of some daemons that are running on the machine, but I don't really have control of the entirety of userspace, some of it I have source code for and could audit to guarantee correctness (but I can't really enforce that on the users, ultimately they can run any binary), and for some of it I don't even have that. Either way, it's much easier to delegate setting policy to userspace management daemon(s), and leave enforcing it to the kernel. This is just one more such knob. > But you didn't specify this, and we just have to guess. Why don't you > describe the specific situation where you want to modify this setting? > Please do this instead of just talking about what the side effects are > inside of the kernel. That's much less interesting when it comes to > patches like this. Very well, that's a good point. Here's an attempt to provide some insight. I am attempting to allow not-fully-code-audited nor fully trusted apps to run in a cgroup containerized environment, with many apps in many containers (not 1:1, has hierarchies) on a single kernel. The apps are in the believed to not be actively malicious class, but very likely to be buggy, or written by ill-advised programmers based on wrong/outdated or otherwise incorrect documentation. I cannot rely on unprivileged userspace getting things right. I have to have some mechanism to grant these apps permissions to utilize specific levels of network fabric priority. For this I have the aforementioned per-cgroup allowed TOS settings. VLANs are not appropriate because a client with high priority net privs is allowed to send a request to a server with no special priority permissions. (there are further patches to support tcp tos reflection so the server can automatically respond with the client's priority) Multiqueue networking combined with hardware priority queues and xps desires to use skb->priority + active cpu for tx queue selection. In this particular case TX queue selection should happen based on the TOS priority. Setting TOS automatically sets sk_priority (and hence skb->priority). So all's good, so long as userspace doesn't go and change the sk_priority field via SO_PRIORITY and break the mapping. As a further note: Some of these apps may be a little more special, a little more audited, and a little more trusted. Enough so that they might be granted CAP_NET_RAW, but not enough so that they can get CAP_NET_ADMIN. Hence the general desire for CAP_NET_ADMIN to control general machine-global networking state, but not have it control per-socket or per-packet settings. ie. bringing up or down an interface affects everyone (hence must be CAP_NET_ADMIN, and much more tightly controlled), while spoofing a packet doesn't really negatively affect anyone (you can't assume the network is trusted, so there can be external sources of spoofing or eavesdropping anyway). --- I could attempt to publish the vast majority of our internal networking code base (there isn't really anything secret in there), but it's based on 2.6.34 and even after two years of attempting to clean it up and refactor it (along with a rebase from 2.6.26, and all while actively continuing development) I'm still not at the point were I would consider this to be a particular useful course of action (there's a lot of bugfixes of bugfixes of crappy patches in there, plus hacks, plus tons of backports from upstream, and tons of code which is upstream but slightly differently then we have it internally, because we had it first, and pushed v2 upstream, etc...). Instead I'm trying to get the easy hanging fruit out of the way, rebase our patches onto probably 3.2 or 3.3, likely sending some more your way during the process, and see where that leaves us. Basically trying to reduce the delta. We will always have internal only patches, but the fewer, the less burden for us, hence I'm trying to get the ones I believe to be potentially useful externally upstreamed. Obviously whatever patches you don't accept, we'll still keep around locally. Maciej ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt 2011-10-22 8:27 ` Maciej Żenczykowski @ 2011-10-22 8:40 ` David Miller 2011-10-22 9:01 ` David Täht 1 sibling, 0 replies; 7+ messages in thread From: David Miller @ 2011-10-22 8:40 UTC (permalink / raw) To: zenczykowski; +Cc: netdev From: Maciej Żenczykowski <zenczykowski@gmail.com> Date: Sat, 22 Oct 2011 01:27:03 -0700 > I am attempting to allow not-fully-code-audited nor fully trusted apps to run > in a cgroup containerized environment, with many apps in many > containers (not 1:1, has hierarchies) on a single kernel. Extend, if necessary, the cgroup classifier so you can use it to clip off the socket inherited priority in the SKB for this cgroup. Really, this control has no business in the socket API layer. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt 2011-10-22 8:27 ` Maciej Żenczykowski 2011-10-22 8:40 ` David Miller @ 2011-10-22 9:01 ` David Täht 1 sibling, 0 replies; 7+ messages in thread From: David Täht @ 2011-10-22 9:01 UTC (permalink / raw) To: Maciej Żenczykowski; +Cc: David Miller, netdev [-- Attachment #1: Type: text/plain, Size: 6481 bytes --] On 10/22/2011 10:27 AM, Maciej Żenczykowski wrote: >> I also don't see why we'd want to allow disabling this either. I have been watching this and the other capability patches go by with interest. My use case is that I would like to be running "named" as a non-root user, but would like it to vary the dscp (tos) field on a per connection basis. tcp zone transfers = bulk tcp/udp queries = something like interactive | CS5 (this moves dns queries into the VI queue on wireless - which can also be done with SO_PRIORITY) Having TOS modification as a grant-able capability and otherwise restricting it makes some sense in a world of otherwise unrestricted user programs in the clouds, however I note that setting CS1, reducing something from best effort to background, should also be allowed universally. I note that another way to hammer down someone elses (guest machine, external router, etc) TOS settings would be to do it in iptables, but to do it on a fine grained basis at present would take up to 63 iptables rules... lastly... The skb->priority field needs some re-thought. In the case of wireless, it selects a different tx queue based on magic (see net/wireless/utils.c) /* skb->priority values from 256->263 are magic values to * directly indicate a specific 802.1d priority. This is used * to allow 802.1d priority to be passed directly in from VLAN * tags, etc. */ if (skb->priority >= 256 && skb->priority <= 263) return skb->priority - 256; classification is an aristotelian rathole! >> I really hate these patches that offer ways to disable things >> that normally work, and thus break apps when the non-default >> is selected. > Well... the purpose of settings like this is precisely to break functionality > when the default is not set ;-) > >> I kind of have a feeling the kind of situation you're trying to >> account for, you have some cloud where people run random stuff >> that you don't control. > Yes, I have control of the kernel, I have control of root, I have control of > some daemons that are running on the machine, but I don't really have > control of the entirety of userspace, some of it I have source code for > and could audit to guarantee correctness (but I can't really enforce > that on the users, ultimately they can run any binary), > and for some of it I don't even have that. Either way, it's much > easier to delegate setting policy to > userspace management daemon(s), and leave enforcing it to the kernel. > This is just one more such knob. > >> But you didn't specify this, and we just have to guess. Why don't you >> describe the specific situation where you want to modify this setting? >> Please do this instead of just talking about what the side effects are >> inside of the kernel. That's much less interesting when it comes to >> patches like this. > Very well, that's a good point. > > Here's an attempt to provide some insight. > > I am attempting to allow not-fully-code-audited nor fully trusted apps to run > in a cgroup containerized environment, with many apps in many > containers (not 1:1, has hierarchies) on a single kernel. > The apps are in the believed to not be actively malicious class, but > very likely to be buggy, or written by ill-advised programmers based > on wrong/outdated or otherwise incorrect documentation. I cannot rely > on unprivileged userspace getting things right. > I have to have some mechanism to grant these apps permissions to > utilize specific levels of network fabric priority. For this I have > the aforementioned per-cgroup allowed TOS settings. VLANs are not appropriate > because a client with high priority net privs is allowed to send a > request to a server with no special priority permissions. > (there are further patches to support tcp tos reflection so the server > can automatically respond with the client's priority) > > Multiqueue networking combined with hardware priority queues and xps > desires to use skb->priority + active cpu for tx queue selection. > In this particular case TX queue selection should happen based on the > TOS priority. > Setting TOS automatically sets sk_priority (and hence skb->priority). > So all's good, so long as userspace doesn't go and change the > sk_priority field via SO_PRIORITY and break the mapping. > > As a further note: > > Some of these apps may be a little more special, a little more > audited, and a little more trusted. > Enough so that they might be granted CAP_NET_RAW, but not enough so > that they can get CAP_NET_ADMIN. > Hence the general desire for CAP_NET_ADMIN to control general > machine-global networking state, but not have it control > per-socket or per-packet settings. ie. bringing up or down an > interface affects everyone (hence must be CAP_NET_ADMIN, and much more > tightly controlled), while spoofing a packet doesn't really negatively > affect anyone (you can't assume the network is trusted, so there can > be > external sources of spoofing or eavesdropping anyway). > > --- > > I could attempt to publish the vast majority of our internal > networking code base (there isn't really anything secret in there), > but it's based on 2.6.34 and even after two years of attempting to > clean it up and refactor it (along with a rebase from 2.6.26, and all > while actively continuing development) I'm still not at the point were > I would consider this to be a particular useful course of action > (there's a lot of bugfixes of bugfixes of crappy patches in there, > plus hacks, plus tons of backports from upstream, and tons of code > which is upstream but slightly differently then we have it internally, > because we had it first, and pushed v2 upstream, etc...). Instead I'm > trying to get the easy hanging fruit out of the way, rebase our > patches onto probably 3.2 or 3.3, likely sending some more your way > during the process, and see where that leaves us. Basically trying to > reduce the delta. We will always have internal only patches, but the > fewer, the less burden for us, hence I'm trying to get the ones I > believe to be potentially useful externally upstreamed. Obviously > whatever patches you don't accept, we'll still keep around locally. > > Maciej > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Dave Täht [-- Attachment #2: dave_taht.vcf --] [-- Type: text/x-vcard, Size: 214 bytes --] begin:vcard fn;quoted-printable:Dave T=C3=A4ht n;quoted-printable:T=C3=A4ht;Dave email;internet:dave.taht@gmail.com tel;home:1-239-829-5608 tel;cell:0638645374 x-mozilla-html:FALSE version:2.1 end:vcard ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2011-10-22 9:01 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-10-21 22:22 [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt Maciej Żenczykowski 2011-10-22 4:04 ` David Miller 2011-10-22 6:49 ` Maciej Żenczykowski 2011-10-22 6:58 ` David Miller 2011-10-22 8:27 ` Maciej Żenczykowski 2011-10-22 8:40 ` David Miller 2011-10-22 9:01 ` David Täht
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).