[PATCH] net: add sysctl allow_so_priority for SO

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt
@ 2011-10-21 22:22 Maciej Żenczykowski
  2011-10-22  4:04 ` David Miller
  0 siblings, 1 reply; 7+ messages in thread
From: Maciej Żenczykowski @ 2011-10-21 22:22 UTC (permalink / raw)
  To: Maciej Żenczykowski; +Cc: netdev, Maciej Żenczykowski

From: Maciej Żenczykowski <maze@google.com>

This change adds a sysctl (/proc/sys/net/core/allow_so_priority)
with a default of true (1), as such it does not change the default
behaviour of the Linux kernel.

This sysctl can be set to false (0), this will result in non
CAP_NET_ADMIN processes being unable to set SO_PRIORITY socket
option.

This is desireable if we want to rely on socket/skb priorities
being inferred from TOS/TCLASS bits.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
---
 include/net/sock.h         |    2 ++
 net/core/sock.c            |    5 ++++-
 net/core/sysctl_net_core.c |    7 +++++++
 3 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 5ac682f..bf18a6a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1853,6 +1853,8 @@ extern __u32 sysctl_rmem_max;
 
 extern void sk_init(void);
 
+extern int sysctl_allow_so_priority;
+
 extern int sysctl_optmem_max;
 
 extern __u32 sysctl_wmem_default;
diff --git a/net/core/sock.c b/net/core/sock.c
index 5a08762..383fd89 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -217,6 +217,8 @@ __u32 sysctl_rmem_max __read_mostly = SK_RMEM_MAX;
 __u32 sysctl_wmem_default __read_mostly = SK_WMEM_MAX;
 __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 
+int sysctl_allow_so_priority __read_mostly = 1;
+
 /* Maximal space eaten by iovec or ancillary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
@@ -612,7 +614,8 @@ set_rcvbuf:
 		break;
 
 	case SO_PRIORITY:
-		if ((val >= 0 && val <= 6) || capable(CAP_NET_ADMIN))
+		if ((val >= 0 && val <= 6 && sysctl_allow_so_priority)
+		    || capable(CAP_NET_ADMIN))
 			sk->sk_priority = val;
 		else
 			ret = -EPERM;
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 77a65f0..91fdaac 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -183,6 +183,13 @@ static struct ctl_table net_core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "allow_so_priority",
+		.data		= &sysctl_allow_so_priority,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
 	{ }
 };
 
-- 
1.7.3.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt
  2011-10-21 22:22 [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt Maciej Żenczykowski
@ 2011-10-22  4:04 ` David Miller
  2011-10-22  6:49   ` Maciej Żenczykowski
  0 siblings, 1 reply; 7+ messages in thread
From: David Miller @ 2011-10-22  4:04 UTC (permalink / raw)
  To: zenczykowski; +Cc: maze, netdev

From: Maciej Żenczykowski <zenczykowski@gmail.com>
Date: Fri, 21 Oct 2011 15:22:05 -0700

> From: Maciej Żenczykowski <maze@google.com>
> 
> This change adds a sysctl (/proc/sys/net/core/allow_so_priority)
> with a default of true (1), as such it does not change the default
> behaviour of the Linux kernel.
> 
> This sysctl can be set to false (0), this will result in non
> CAP_NET_ADMIN processes being unable to set SO_PRIORITY socket
> option.
> 
> This is desireable if we want to rely on socket/skb priorities
> being inferred from TOS/TCLASS bits.
> 
> Signed-off-by: Maciej Żenczykowski <maze@google.com>

The socket layer is not the place to enforce this.

The ingress into your MPLS/RSVP cloud that actually provides the
quality of service is where you control and mangle the TOS as needed.

Sorry, I'm not applying anything like this.  Any machine on your
network can spit out any TOS it wants, and if you have control over
the apps change it's behavior there.  If you don't have control over
the apps then filter and mangle.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt
  2011-10-22  4:04 ` David Miller
@ 2011-10-22  6:49   ` Maciej Żenczykowski
  2011-10-22  6:58     ` David Miller
  0 siblings, 1 reply; 7+ messages in thread
From: Maciej Żenczykowski @ 2011-10-22  6:49 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

2011/10/21 David Miller <davem@davemloft.net>:
> From: Maciej Żenczykowski <zenczykowski@gmail.com>
> Date: Fri, 21 Oct 2011 15:22:05 -0700
>
>> From: Maciej Żenczykowski <maze@google.com>
>>
>> This change adds a sysctl (/proc/sys/net/core/allow_so_priority)
>> with a default of true (1), as such it does not change the default
>> behaviour of the Linux kernel.
>>
>> This sysctl can be set to false (0), this will result in non
>> CAP_NET_ADMIN processes being unable to set SO_PRIORITY socket
>> option.
>>
>> This is desireable if we want to rely on socket/skb priorities
>> being inferred from TOS/TCLASS bits.
>>
>> Signed-off-by: Maciej Żenczykowski <maze@google.com>
>
> The socket layer is not the place to enforce this.
>
> The ingress into your MPLS/RSVP cloud that actually provides the
> quality of service is where you control and mangle the TOS as needed.
>
> Sorry, I'm not applying anything like this.  Any machine on your
> network can spit out any TOS it wants, and if you have control over
> the apps change it's behavior there.  If you don't have control over
> the apps then filter and mangle.

Hmm, so I already have container (cgroup) limits on what TOS settings,
a process is allowed to set (query: would you be interested in accepting a patch
for that at some point in the future?).

Normally setting IP_TOS also automatically sets sock->sk_priority
(based on a mapping), which
gets inherited into skb->priority, which can then be used for stuff
like hardware
priority queue dispatch (basically xps + skb->priority queue selection).
Either via an XPS like mechanism, or a QDISC like mechanism (preferred), or
an in-driver mechanism (currently have a hack which does this).

However, processes can also manually override the sk_priority by calling
SO_PRIORITY directly, at which point their IP_TOS and SO_PRIORITY no
longer match.

This patch allows you to disable this ability.  It's not affecting the
on-the-wire bits
in any way, it's really only affecting packet classification at the
qdisc and in the driver.

As you can see this patch isn't about TOS, it's about the kernel
internal skb priority setting.

On a related note, while setting IP_TOS sets sk_priority, setting
IPV6_TCLASS does not
set sk_priority.  I'd like to see this behaviour be consistent. As
such was planning on
sending you a patch to add sk_priority = rt_tos2priority(val) to the
IPV6_TCLASS setsockopt
code path.  Is that ok?

- Maciej

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt
  2011-10-22  6:49   ` Maciej Żenczykowski
@ 2011-10-22  6:58     ` David Miller
  2011-10-22  8:27       ` Maciej Żenczykowski
  0 siblings, 1 reply; 7+ messages in thread
From: David Miller @ 2011-10-22  6:58 UTC (permalink / raw)
  To: zenczykowski; +Cc: netdev

From: Maciej Żenczykowski <zenczykowski@gmail.com>
Date: Fri, 21 Oct 2011 23:49:00 -0700

> However, processes can also manually override the sk_priority by calling
> SO_PRIORITY directly, at which point their IP_TOS and SO_PRIORITY no
> longer match.
> 
> This patch allows you to disable this ability.

I also don't see why we'd want to allow disabling this either.

I really hate these patches that offer ways to disable things
that normally work, and thus break apps when the non-default
is selected.

I kind of have a feeling the kind of situation you're trying to
account for, you have some cloud where people run random stuff
that you don't control.

But you didn't specify this, and we just have to guess.  Why don't you
describe the specific situation where you want to modify this setting?

Please do this instead of just talking about what the side effects are
inside of the kernel.  That's much less interesting when it comes to
patches like this.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt
  2011-10-22  6:58     ` David Miller
@ 2011-10-22  8:27       ` Maciej Żenczykowski
  2011-10-22  8:40         ` David Miller
  2011-10-22  9:01         ` David Täht
  0 siblings, 2 replies; 7+ messages in thread
From: Maciej Żenczykowski @ 2011-10-22  8:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

> I also don't see why we'd want to allow disabling this either.

> I really hate these patches that offer ways to disable things
> that normally work, and thus break apps when the non-default
> is selected.

Well... the purpose of settings like this is precisely to break functionality
when the default is not set ;-)

> I kind of have a feeling the kind of situation you're trying to
> account for, you have some cloud where people run random stuff
> that you don't control.

Yes, I have control of the kernel, I have control of root, I have control of
some daemons that are running on the machine, but I don't really have
control of the entirety of userspace, some of it I have source code for
and could audit to guarantee correctness (but I can't really enforce
that on the users, ultimately they can run any binary),
and for some of it I don't even have that.  Either way, it's much
easier to delegate setting policy to
userspace management daemon(s), and leave enforcing it to the kernel.
This is just one more such knob.

> But you didn't specify this, and we just have to guess.  Why don't you
> describe the specific situation where you want to modify this setting?

> Please do this instead of just talking about what the side effects are
> inside of the kernel.  That's much less interesting when it comes to
> patches like this.

Very well, that's a good point.

Here's an attempt to provide some insight.

I am attempting to allow not-fully-code-audited nor fully trusted apps to run
in a cgroup containerized environment, with many apps in many
containers (not 1:1, has hierarchies) on a single kernel.
The apps are in the believed to not be actively malicious class, but
very likely to be buggy, or written by ill-advised programmers based
on wrong/outdated or otherwise incorrect documentation.  I cannot rely
on unprivileged userspace getting things right.
I have to have some mechanism to grant these apps permissions to
utilize specific levels of network fabric priority.  For this I have
the aforementioned per-cgroup allowed TOS settings.  VLANs are not appropriate
because a client with high priority net privs is allowed to send a
request to a server with no special priority permissions.
(there are further patches to support tcp tos reflection so the server
can automatically respond with the client's priority)

Multiqueue networking combined with hardware priority queues and xps
desires to use skb->priority + active cpu for tx queue selection.
In this particular case TX queue selection should happen based on the
TOS priority.
Setting TOS automatically sets sk_priority (and hence skb->priority).
So all's good, so long as userspace doesn't go and change the
sk_priority field via SO_PRIORITY and break the mapping.

As a further note:

Some of these apps may be a little more special, a little more
audited, and a little more trusted.
Enough so that they might be granted CAP_NET_RAW, but not enough so
that they can get CAP_NET_ADMIN.
Hence the general desire for CAP_NET_ADMIN to control general
machine-global networking state, but not have it control
per-socket or per-packet settings.  ie. bringing up or down an
interface affects everyone (hence must be CAP_NET_ADMIN, and much more
tightly controlled), while spoofing a packet doesn't really negatively
affect anyone (you can't assume the network is trusted, so there can
be
external sources of spoofing or eavesdropping anyway).

---

I could attempt to publish the vast majority of our internal
networking code base (there isn't really anything secret in there),
but it's based on 2.6.34 and even after two years of attempting to
clean it up and refactor it (along with a rebase from 2.6.26, and all
while actively continuing development) I'm still not at the point were
I would consider this to be a particular useful course of action
(there's a lot of bugfixes of bugfixes of crappy patches in there,
plus hacks, plus tons of backports from upstream, and tons of code
which is upstream but slightly differently then we have it internally,
because we had it first, and pushed v2 upstream, etc...).  Instead I'm
trying to get the easy hanging fruit out of the way, rebase our
patches onto probably 3.2 or 3.3, likely sending some more your way
during the process, and see where that leaves us.  Basically trying to
reduce the delta.  We will always have internal only patches, but the
fewer, the less burden for us, hence I'm trying to get the ones I
believe to be potentially useful externally upstreamed.  Obviously
whatever patches you don't accept, we'll still keep around locally.

Maciej

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt
  2011-10-22  8:27       ` Maciej Żenczykowski
@ 2011-10-22  8:40         ` David Miller
  2011-10-22  9:01         ` David Täht
  1 sibling, 0 replies; 7+ messages in thread
From: David Miller @ 2011-10-22  8:40 UTC (permalink / raw)
  To: zenczykowski; +Cc: netdev

From: Maciej Żenczykowski <zenczykowski@gmail.com>
Date: Sat, 22 Oct 2011 01:27:03 -0700

> I am attempting to allow not-fully-code-audited nor fully trusted apps to run
> in a cgroup containerized environment, with many apps in many
> containers (not 1:1, has hierarchies) on a single kernel.

Extend, if necessary, the cgroup classifier so you can use it to clip
off the socket inherited priority in the SKB for this cgroup.

Really, this control has no business in the socket API layer.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt
  2011-10-22  8:27       ` Maciej Żenczykowski
  2011-10-22  8:40         ` David Miller
@ 2011-10-22  9:01         ` David Täht
  1 sibling, 0 replies; 7+ messages in thread
From: David Täht @ 2011-10-22  9:01 UTC (permalink / raw)
  To: Maciej Żenczykowski; +Cc: David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 6481 bytes --]

On 10/22/2011 10:27 AM, Maciej Żenczykowski wrote:
>> I also don't see why we'd want to allow disabling this either.

I have been watching this and the other capability patches go by with 
interest. My use case is that I would like to be running "named" as a 
non-root user, but would like it to vary the dscp (tos) field on a per 
connection basis.

tcp zone transfers = bulk
tcp/udp queries = something like interactive | CS5 (this moves dns 
queries into the VI queue on wireless - which can also be done with 
SO_PRIORITY)

Having TOS modification as a grant-able capability and otherwise 
restricting it makes some sense in a world of otherwise unrestricted 
user programs in the clouds, however I note that setting CS1, reducing 
something from best effort to background, should also be allowed 
universally.

I note that another way to hammer down someone elses (guest machine, 
external router, etc) TOS settings would be to do it in iptables, but to 
do it on a fine grained basis at present would take up to 63 iptables 
rules...

lastly...

The skb->priority field needs some re-thought. In the case of wireless, 
it selects a different tx queue based on magic (see net/wireless/utils.c)

         /* skb->priority values from 256->263 are magic values to
          * directly indicate a specific 802.1d priority.  This is used
          * to allow 802.1d priority to be passed directly in from VLAN
          * tags, etc.
          */
         if (skb->priority >= 256 && skb->priority <= 263)
                 return skb->priority - 256;


classification is an aristotelian rathole!

>> I really hate these patches that offer ways to disable things
>> that normally work, and thus break apps when the non-default
>> is selected.
> Well... the purpose of settings like this is precisely to break functionality
> when the default is not set ;-)
>
>> I kind of have a feeling the kind of situation you're trying to
>> account for, you have some cloud where people run random stuff
>> that you don't control.
> Yes, I have control of the kernel, I have control of root, I have control of
> some daemons that are running on the machine, but I don't really have
> control of the entirety of userspace, some of it I have source code for
> and could audit to guarantee correctness (but I can't really enforce
> that on the users, ultimately they can run any binary),
> and for some of it I don't even have that.  Either way, it's much
> easier to delegate setting policy to
> userspace management daemon(s), and leave enforcing it to the kernel.
> This is just one more such knob.
>
>> But you didn't specify this, and we just have to guess.  Why don't you
>> describe the specific situation where you want to modify this setting?
>> Please do this instead of just talking about what the side effects are
>> inside of the kernel.  That's much less interesting when it comes to
>> patches like this.
> Very well, that's a good point.
>
> Here's an attempt to provide some insight.
>
> I am attempting to allow not-fully-code-audited nor fully trusted apps to run
> in a cgroup containerized environment, with many apps in many
> containers (not 1:1, has hierarchies) on a single kernel.
> The apps are in the believed to not be actively malicious class, but
> very likely to be buggy, or written by ill-advised programmers based
> on wrong/outdated or otherwise incorrect documentation.  I cannot rely
> on unprivileged userspace getting things right.
> I have to have some mechanism to grant these apps permissions to
> utilize specific levels of network fabric priority.  For this I have
> the aforementioned per-cgroup allowed TOS settings.  VLANs are not appropriate
> because a client with high priority net privs is allowed to send a
> request to a server with no special priority permissions.
> (there are further patches to support tcp tos reflection so the server
> can automatically respond with the client's priority)
>
> Multiqueue networking combined with hardware priority queues and xps
> desires to use skb->priority + active cpu for tx queue selection.
> In this particular case TX queue selection should happen based on the
> TOS priority.
> Setting TOS automatically sets sk_priority (and hence skb->priority).
> So all's good, so long as userspace doesn't go and change the
> sk_priority field via SO_PRIORITY and break the mapping.
>
> As a further note:
>
> Some of these apps may be a little more special, a little more
> audited, and a little more trusted.
> Enough so that they might be granted CAP_NET_RAW, but not enough so
> that they can get CAP_NET_ADMIN.
> Hence the general desire for CAP_NET_ADMIN to control general
> machine-global networking state, but not have it control
> per-socket or per-packet settings.  ie. bringing up or down an
> interface affects everyone (hence must be CAP_NET_ADMIN, and much more
> tightly controlled), while spoofing a packet doesn't really negatively
> affect anyone (you can't assume the network is trusted, so there can
> be
> external sources of spoofing or eavesdropping anyway).
>
> ---
>
> I could attempt to publish the vast majority of our internal
> networking code base (there isn't really anything secret in there),
> but it's based on 2.6.34 and even after two years of attempting to
> clean it up and refactor it (along with a rebase from 2.6.26, and all
> while actively continuing development) I'm still not at the point were
> I would consider this to be a particular useful course of action
> (there's a lot of bugfixes of bugfixes of crappy patches in there,
> plus hacks, plus tons of backports from upstream, and tons of code
> which is upstream but slightly differently then we have it internally,
> because we had it first, and pushed v2 upstream, etc...).  Instead I'm
> trying to get the easy hanging fruit out of the way, rebase our
> patches onto probably 3.2 or 3.3, likely sending some more your way
> during the process, and see where that leaves us.  Basically trying to
> reduce the delta.  We will always have internal only patches, but the
> fewer, the less burden for us, hence I'm trying to get the ones I
> believe to be potentially useful externally upstreamed.  Obviously
> whatever patches you don't accept, we'll still keep around locally.
>
> Maciej
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Dave Täht


[-- Attachment #2: dave_taht.vcf --]
[-- Type: text/x-vcard, Size: 214 bytes --]

begin:vcard
fn;quoted-printable:Dave T=C3=A4ht
n;quoted-printable:T=C3=A4ht;Dave
email;internet:dave.taht@gmail.com
tel;home:1-239-829-5608
tel;cell:0638645374
x-mozilla-html:FALSE
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-10-22  9:01 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-21 22:22 [PATCH] net: add sysctl allow_so_priority for SO_PRIORITY setsockopt Maciej Żenczykowski
2011-10-22  4:04 ` David Miller
2011-10-22  6:49   ` Maciej Żenczykowski
2011-10-22  6:58     ` David Miller
2011-10-22  8:27       ` Maciej Żenczykowski
2011-10-22  8:40         ` David Miller
2011-10-22  9:01         ` David Täht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).