Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Eric Dumazet @ 2014-11-14 23:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tom Herbert, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <CALCETrUAd50g2mYEOKT-5pEqMvwSstfQcgU8=7GRsO1XcKBSFA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, 2014-11-14 at 15:03 -0800, Andy Lutomirski wrote:

> If the kernel had an API for this, I'd be all for using it.

It would be user land code, not kernel.

Why doing a system call when you can avoid it ? ;)

Doing it in the kernel would be quite complex actually.

In userland, you can even implement this by machine learning.

For every connection you made, you get the 4-tuple and INCOMING_CPU,
then you store it in a cache.

Next time you need to connect to same remote peer, you can lookup in the
cache to find a good candidate.

It would actually be good if we could do in a single socket verb a
bind_and_connect(fd, &source, &destination)

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Andy Lutomirski @ 2014-11-14 23:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <1416007946.17262.84.camel-XN9IlZ5yJG9HTL0Zs8A6p/gx64E7kk8eUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>

On Fri, Nov 14, 2014 at 3:32 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Fri, 2014-11-14 at 15:03 -0800, Andy Lutomirski wrote:
>
>> If the kernel had an API for this, I'd be all for using it.
>
> It would be user land code, not kernel.
>
> Why doing a system call when you can avoid it ? ;)
>
> Doing it in the kernel would be quite complex actually.

A semi-official library would work, too.  As long as it kept working.

>
> In userland, you can even implement this by machine learning.
>
> For every connection you made, you get the 4-tuple and INCOMING_CPU,
> then you store it in a cache.

Eww :(

>
> Next time you need to connect to same remote peer, you can lookup in the
> cache to find a good candidate.
>
> It would actually be good if we could do in a single socket verb a
> bind_and_connect(fd, &source, &destination)

Fair enough.  Although for my use case, the time it takes to connect
is completely irrelevant.

--Andy

>
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Tom Herbert @ 2014-11-15  0:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric Dumazet, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <CALCETrXf8j52nmpw-A2+bQt0yoz-fiD-4GP4Qf-afwH6UjCVnw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, Nov 14, 2014 at 2:10 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Fri, Nov 14, 2014 at 1:36 PM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Fri, Nov 14, 2014 at 12:34 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Fri, Nov 14, 2014 at 12:25 PM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Fri, Nov 14, 2014 at 12:16 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>> On Fri, Nov 14, 2014 at 11:52 AM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>> On Fri, Nov 14, 2014 at 11:33 AM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>>> On Fri, 2014-11-14 at 09:17 -0800, Andy Lutomirski wrote:
>>>>>>>
>>>>>>>> As a heavy user of RFS (and finder of bugs in it, too), here's my
>>>>>>>> question about this API:
>>>>>>>>
>>>>>>>> How does an application tell whether the socket represents a
>>>>>>>> non-actively-steered flow?  If the flow is subject to RFS, then moving
>>>>>>>> the application handling to the socket's CPU seems problematic, as the
>>>>>>>> socket's CPU might move as well.  The current implementation in this
>>>>>>>> patch seems to tell me which CPU the most recent packet came in on,
>>>>>>>> which is not necessarily very useful.
>>>>>>>
>>>>>>> Its the cpu that hit the TCP stack, bringing dozens of cache lines in
>>>>>>> its cache. This is all that matters,
>>>>>>>
>>>>>>>>
>>>>>>>> Some possibilities:
>>>>>>>>
>>>>>>>> 1. Let SO_INCOMING_CPU fail if RFS or RPS are in play.
>>>>>>>
>>>>>>> Well, idea is to not use RFS at all. Otherwise, it is useless.
>>>>>
>>>>> Sure, but how do I know that it'll be the same CPU next time?
>>>>>
>>>>>>>
>>>>>> Bear in mind this is only an interface to report RX CPU and in itself
>>>>>> doesn't provide any functionality for changing scheduling, there is
>>>>>> obviously logic needed in user space that would need to do something.
>>>>>>
>>>>>> If we track the interrupting CPU in skb, the interface could be easily
>>>>>> extended to provide the interrupting CPU, the RPS CPU (calculated at
>>>>>> reported time), and the CPU processing transport (post steering which
>>>>>> is what is currently returned). That would provide the complete
>>>>>> picture to control scheduling a flow from userspace, and an interface
>>>>>> to selectively turn off RFS for a socket would make sense then.
>>>>>
>>>>> I think that a turn-off-RFS interface would also want a way to figure
>>>>> out where the flow would go without RFS.  Can the network stack do
>>>>> that (e.g. evaluate the rx indirection hash or whatever happens these
>>>>> days)?
>>>>>
>>>> Yes,. We need the rxhash and the CPU that packets are received on from
>>>> the device for the socket. The former we already have, the latter
>>>> might be done by adding a field to skbuff to set received CPU. Given
>>>> the L4 hash and interrupting CPU we can calculated the RPS CPU which
>>>> is where packet would have landed with RFS off.
>>>
>>> Hmm.  I think this would be useful for me.  It would *definitely* be
>>> useful for me if I could pin an RFS flow to a cpu of my choice.
>>>
>> Andy, can you elaborate a little more on your use case. I've thought
>> several times about an interface to program the flow table from
>> userspace, but never quite came up with a compelling use case and
>> there is the security concern that a user could "steal" cycles from
>> arbitrary CPUs.
>
> I have a bunch of threads that are pinned to various CPUs or groups of
> CPUs.  Each thread is responsible for a fixed set of flows.  I'd like
> those flows to go to those CPUs.
>
> RFS will eventually do it, but it would be nice if I could
> deterministically ask for a flow to be routed to the right CPU.  Also,
> if my thread bounces temporarily to another CPU, I don't really need
> the flow to follow it -- I'd like it to stay put.
>
Okay, how about it we have a SO_RFS_LOCK_FLOW sockopt. When this is
called on a socket we can lock the socket to CPU binding to the
current CPU it is called from. It could be unlocked at a later point.
Would this satisfy your requirements?

> This has a significant benefit over using automatic steering: with
> automatic steering, I have to make all of the hash tables have a size
> around the square of the total number of the flows in order to make it
> reliable.
>
> Something like SO_STEER_TO_THIS_CPU would be fine, as long as it
> reported whether it worked (for my diagnostics).
>
>>
>>> With SO_INCOMING_CPU as described, I'm worried that people will write
>>> programs that perform very well if RFS is off, but that once that code
>>> runs with RFS on, weird things could happen.
>>>
>>> (On a side note: the RFS flow hash stuff seems to be rather buggy.
>>> Some Solarflare engineers know about this, but a fix seems to be
>>> rather slow in the works.  I think that some of the bugs are in core
>>> code, though.)
>>
>> This is problems with accelerated RFS or just getting the flow hash for packets?
>
> Accelerated RFS.
>
> Digging through my email, I thought that
> net.core.rps_sock_flow_entries != the per-queue rps_flow_cnt would
> make no sense, although I haven't configured it that way.
>
> More importantly, though, I think that some of the has table stuff is
> problematic.  My understanding is:
>
> get_rps_cpu may call set_rps_cpu, passing rflow =
> flow_table->flows[hash & flow_table->mask];
>
> set_rps_cpu will compute flow_id = hash & flow_table->mask, which
> looks to me like it has the property that rflow == flow_table[hash]
> (unless we race with a hash table resize).
>
> Now set_rps_cpu tries to steer the new flow to the right CPU (all good
> so far), but then it gets weird.  We have a very high probability of
> old_flow == rflow.  rflow->filter gets overwritten with the filter id,
> the if condition doesn't execute, and nothing gets set to
> RPS_NO_FILTER.
>
> This is technically all correct, but if there are two active flows
> with the same hash, they'll each keep getting steered to the same
> place.  This wastes cycles and seems to anger the sfc driver (the
> latter is presumably an sfc bug).  It also means that some of the
> filters are likely to get expired for no good reason.
>
Yes, I could see where persistent hash collisions could cause an issue
with aRFS. IIRC, I saw programming the filters to be quite an
expensive operation. Minimally if seems like there some rate limiting
change to avoid hysteresis.

> Also, shouldn't ndo_rx_flow_steer be passed the full hash, not just
> the index into the hash table?  What's the point of cutting off the
> high bits?
>
In order to make aRFS transparent to the user, it was implemented to
be driven from the RFS table which hashes based on the mask so that is
why it is limited. An alternative would be to program the filters
directly from a socket using the full hash, that interface might
already exist in ntuple filtering I suppose.

> --Andy
>
> --Andy

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Tom Herbert @ 2014-11-15  0:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric Dumazet, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <CALCETrVQqsJKqVU4ENEfi86wkQLQ=mqBif=HcQiKT2h_fj22xQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, Nov 14, 2014 at 2:18 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Fri, Nov 14, 2014 at 2:16 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Fri, 2014-11-14 at 12:16 -0800, Andy Lutomirski wrote:
>>
>>> Sure, but how do I know that it'll be the same CPU next time?
>>
>> Because the NIC always use same RX queue for a given flow.
>>
>> So if you setup your IRQ affinities properly, the same CPU will drain
>> packets from this RX queue. And since RFS is off, you have the guarantee
>> the same CPU will be used to process packets in TCP stack.
>
> Right.  My concern is that if RFS is off, everything works fine, but
> if RFS is on (and the app has no good way to know), then silly results
> will happen.  I think I'd rather have the getsockopt fail if RFS is on
> or at least give some indication so that the app can react
> accordingly.
>
As I mentioned, there is no material functionality in this patch and
it should be independent of RFS. It simply returns the CPU where the
stack processed the packet. Whether or not this is meaningful
information to the algorithm being implemented in userspace is
completely up to the caller to decide.

Tom

> --Andy
>
>>
>> This SO_INCOMING_CPU info is a hint, there is no guarantee eg if you use
>> bonding and some load balancer or switch decides to send packets on
>> different links.
>>
>> Most NIC use Toeplitz hash, so given the 4-tuple, and rss key (40
>> bytes), you can actually compute the hash in software and know on which
>> RX queue traffic should land.
>>
>>
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Andy Lutomirski @ 2014-11-15  0:24 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <CA+mtBx9Z33O0x4-MG=66Fb4qvBJfxV54xOyiYvWydr=4Bka2xw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, Nov 14, 2014 at 4:06 PM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Nov 14, 2014 at 2:10 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Fri, Nov 14, 2014 at 1:36 PM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Fri, Nov 14, 2014 at 12:34 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Fri, Nov 14, 2014 at 12:25 PM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>> On Fri, Nov 14, 2014 at 12:16 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>>> On Fri, Nov 14, 2014 at 11:52 AM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>> On Fri, Nov 14, 2014 at 11:33 AM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>>>> On Fri, 2014-11-14 at 09:17 -0800, Andy Lutomirski wrote:
>>>>>>>>
>>>>>>>>> As a heavy user of RFS (and finder of bugs in it, too), here's my
>>>>>>>>> question about this API:
>>>>>>>>>
>>>>>>>>> How does an application tell whether the socket represents a
>>>>>>>>> non-actively-steered flow?  If the flow is subject to RFS, then moving
>>>>>>>>> the application handling to the socket's CPU seems problematic, as the
>>>>>>>>> socket's CPU might move as well.  The current implementation in this
>>>>>>>>> patch seems to tell me which CPU the most recent packet came in on,
>>>>>>>>> which is not necessarily very useful.
>>>>>>>>
>>>>>>>> Its the cpu that hit the TCP stack, bringing dozens of cache lines in
>>>>>>>> its cache. This is all that matters,
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Some possibilities:
>>>>>>>>>
>>>>>>>>> 1. Let SO_INCOMING_CPU fail if RFS or RPS are in play.
>>>>>>>>
>>>>>>>> Well, idea is to not use RFS at all. Otherwise, it is useless.
>>>>>>
>>>>>> Sure, but how do I know that it'll be the same CPU next time?
>>>>>>
>>>>>>>>
>>>>>>> Bear in mind this is only an interface to report RX CPU and in itself
>>>>>>> doesn't provide any functionality for changing scheduling, there is
>>>>>>> obviously logic needed in user space that would need to do something.
>>>>>>>
>>>>>>> If we track the interrupting CPU in skb, the interface could be easily
>>>>>>> extended to provide the interrupting CPU, the RPS CPU (calculated at
>>>>>>> reported time), and the CPU processing transport (post steering which
>>>>>>> is what is currently returned). That would provide the complete
>>>>>>> picture to control scheduling a flow from userspace, and an interface
>>>>>>> to selectively turn off RFS for a socket would make sense then.
>>>>>>
>>>>>> I think that a turn-off-RFS interface would also want a way to figure
>>>>>> out where the flow would go without RFS.  Can the network stack do
>>>>>> that (e.g. evaluate the rx indirection hash or whatever happens these
>>>>>> days)?
>>>>>>
>>>>> Yes,. We need the rxhash and the CPU that packets are received on from
>>>>> the device for the socket. The former we already have, the latter
>>>>> might be done by adding a field to skbuff to set received CPU. Given
>>>>> the L4 hash and interrupting CPU we can calculated the RPS CPU which
>>>>> is where packet would have landed with RFS off.
>>>>
>>>> Hmm.  I think this would be useful for me.  It would *definitely* be
>>>> useful for me if I could pin an RFS flow to a cpu of my choice.
>>>>
>>> Andy, can you elaborate a little more on your use case. I've thought
>>> several times about an interface to program the flow table from
>>> userspace, but never quite came up with a compelling use case and
>>> there is the security concern that a user could "steal" cycles from
>>> arbitrary CPUs.
>>
>> I have a bunch of threads that are pinned to various CPUs or groups of
>> CPUs.  Each thread is responsible for a fixed set of flows.  I'd like
>> those flows to go to those CPUs.
>>
>> RFS will eventually do it, but it would be nice if I could
>> deterministically ask for a flow to be routed to the right CPU.  Also,
>> if my thread bounces temporarily to another CPU, I don't really need
>> the flow to follow it -- I'd like it to stay put.
>>
> Okay, how about it we have a SO_RFS_LOCK_FLOW sockopt. When this is
> called on a socket we can lock the socket to CPU binding to the
> current CPU it is called from. It could be unlocked at a later point.
> Would this satisfy your requirements?

Yes, I think.  Especially if it bypassed the hash table.

>
>> This has a significant benefit over using automatic steering: with
>> automatic steering, I have to make all of the hash tables have a size
>> around the square of the total number of the flows in order to make it
>> reliable.
>>
>> Something like SO_STEER_TO_THIS_CPU would be fine, as long as it
>> reported whether it worked (for my diagnostics).
>>
>>>
>>>> With SO_INCOMING_CPU as described, I'm worried that people will write
>>>> programs that perform very well if RFS is off, but that once that code
>>>> runs with RFS on, weird things could happen.
>>>>
>>>> (On a side note: the RFS flow hash stuff seems to be rather buggy.
>>>> Some Solarflare engineers know about this, but a fix seems to be
>>>> rather slow in the works.  I think that some of the bugs are in core
>>>> code, though.)
>>>
>>> This is problems with accelerated RFS or just getting the flow hash for packets?
>>
>> Accelerated RFS.
>>
>> Digging through my email, I thought that
>> net.core.rps_sock_flow_entries != the per-queue rps_flow_cnt would
>> make no sense, although I haven't configured it that way.
>>
>> More importantly, though, I think that some of the has table stuff is
>> problematic.  My understanding is:
>>
>> get_rps_cpu may call set_rps_cpu, passing rflow =
>> flow_table->flows[hash & flow_table->mask];
>>
>> set_rps_cpu will compute flow_id = hash & flow_table->mask, which
>> looks to me like it has the property that rflow == flow_table[hash]
>> (unless we race with a hash table resize).
>>
>> Now set_rps_cpu tries to steer the new flow to the right CPU (all good
>> so far), but then it gets weird.  We have a very high probability of
>> old_flow == rflow.  rflow->filter gets overwritten with the filter id,
>> the if condition doesn't execute, and nothing gets set to
>> RPS_NO_FILTER.
>>
>> This is technically all correct, but if there are two active flows
>> with the same hash, they'll each keep getting steered to the same
>> place.  This wastes cycles and seems to anger the sfc driver (the
>> latter is presumably an sfc bug).  It also means that some of the
>> filters are likely to get expired for no good reason.
>>
> Yes, I could see where persistent hash collisions could cause an issue
> with aRFS. IIRC, I saw programming the filters to be quite an
> expensive operation. Minimally if seems like there some rate limiting
> change to avoid hysteresis.

They're especially expensive, given that they often generate printks
for me due to the sfc bug.


> As I mentioned, there is no material functionality in this patch and
> it should be independent of RFS. It simply returns the CPU where the
> stack processed the packet. Whether or not this is meaningful
> information to the algorithm being implemented in userspace is
> completely up to the caller to decide.

Agreed.

My only concern is that writing that userspace algorithm might result
in surprises if RFS is on.  Having the user program notice the problem
early and alert the admin might help keep Murphy's Law at bay here.

--Andy


-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Tom Herbert @ 2014-11-15  0:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric Dumazet, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <CALCETrXnM5zCt651QWF0Z3c197gqzbLA29bQzwi64DnCvS48NQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, Nov 14, 2014 at 4:24 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Fri, Nov 14, 2014 at 4:06 PM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Fri, Nov 14, 2014 at 2:10 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Fri, Nov 14, 2014 at 1:36 PM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Fri, Nov 14, 2014 at 12:34 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>> On Fri, Nov 14, 2014 at 12:25 PM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>> On Fri, Nov 14, 2014 at 12:16 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>>>> On Fri, Nov 14, 2014 at 11:52 AM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>>> On Fri, Nov 14, 2014 at 11:33 AM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>>>>> On Fri, 2014-11-14 at 09:17 -0800, Andy Lutomirski wrote:
>>>>>>>>>
>>>>>>>>>> As a heavy user of RFS (and finder of bugs in it, too), here's my
>>>>>>>>>> question about this API:
>>>>>>>>>>
>>>>>>>>>> How does an application tell whether the socket represents a
>>>>>>>>>> non-actively-steered flow?  If the flow is subject to RFS, then moving
>>>>>>>>>> the application handling to the socket's CPU seems problematic, as the
>>>>>>>>>> socket's CPU might move as well.  The current implementation in this
>>>>>>>>>> patch seems to tell me which CPU the most recent packet came in on,
>>>>>>>>>> which is not necessarily very useful.
>>>>>>>>>
>>>>>>>>> Its the cpu that hit the TCP stack, bringing dozens of cache lines in
>>>>>>>>> its cache. This is all that matters,
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Some possibilities:
>>>>>>>>>>
>>>>>>>>>> 1. Let SO_INCOMING_CPU fail if RFS or RPS are in play.
>>>>>>>>>
>>>>>>>>> Well, idea is to not use RFS at all. Otherwise, it is useless.
>>>>>>>
>>>>>>> Sure, but how do I know that it'll be the same CPU next time?
>>>>>>>
>>>>>>>>>
>>>>>>>> Bear in mind this is only an interface to report RX CPU and in itself
>>>>>>>> doesn't provide any functionality for changing scheduling, there is
>>>>>>>> obviously logic needed in user space that would need to do something.
>>>>>>>>
>>>>>>>> If we track the interrupting CPU in skb, the interface could be easily
>>>>>>>> extended to provide the interrupting CPU, the RPS CPU (calculated at
>>>>>>>> reported time), and the CPU processing transport (post steering which
>>>>>>>> is what is currently returned). That would provide the complete
>>>>>>>> picture to control scheduling a flow from userspace, and an interface
>>>>>>>> to selectively turn off RFS for a socket would make sense then.
>>>>>>>
>>>>>>> I think that a turn-off-RFS interface would also want a way to figure
>>>>>>> out where the flow would go without RFS.  Can the network stack do
>>>>>>> that (e.g. evaluate the rx indirection hash or whatever happens these
>>>>>>> days)?
>>>>>>>
>>>>>> Yes,. We need the rxhash and the CPU that packets are received on from
>>>>>> the device for the socket. The former we already have, the latter
>>>>>> might be done by adding a field to skbuff to set received CPU. Given
>>>>>> the L4 hash and interrupting CPU we can calculated the RPS CPU which
>>>>>> is where packet would have landed with RFS off.
>>>>>
>>>>> Hmm.  I think this would be useful for me.  It would *definitely* be
>>>>> useful for me if I could pin an RFS flow to a cpu of my choice.
>>>>>
>>>> Andy, can you elaborate a little more on your use case. I've thought
>>>> several times about an interface to program the flow table from
>>>> userspace, but never quite came up with a compelling use case and
>>>> there is the security concern that a user could "steal" cycles from
>>>> arbitrary CPUs.
>>>
>>> I have a bunch of threads that are pinned to various CPUs or groups of
>>> CPUs.  Each thread is responsible for a fixed set of flows.  I'd like
>>> those flows to go to those CPUs.
>>>
>>> RFS will eventually do it, but it would be nice if I could
>>> deterministically ask for a flow to be routed to the right CPU.  Also,
>>> if my thread bounces temporarily to another CPU, I don't really need
>>> the flow to follow it -- I'd like it to stay put.
>>>
>> Okay, how about it we have a SO_RFS_LOCK_FLOW sockopt. When this is
>> called on a socket we can lock the socket to CPU binding to the
>> current CPU it is called from. It could be unlocked at a later point.
>> Would this satisfy your requirements?
>
> Yes, I think.  Especially if it bypassed the hash table.

Unfortunately we can't easily bypass the hash table. The only way I
know of to to do that is to perform the socket lookup to do steering
(I tried that early on, but it was pretty costly).

>
>>
>>> This has a significant benefit over using automatic steering: with
>>> automatic steering, I have to make all of the hash tables have a size
>>> around the square of the total number of the flows in order to make it
>>> reliable.
>>>
>>> Something like SO_STEER_TO_THIS_CPU would be fine, as long as it
>>> reported whether it worked (for my diagnostics).
>>>
>>>>
>>>>> With SO_INCOMING_CPU as described, I'm worried that people will write
>>>>> programs that perform very well if RFS is off, but that once that code
>>>>> runs with RFS on, weird things could happen.
>>>>>
>>>>> (On a side note: the RFS flow hash stuff seems to be rather buggy.
>>>>> Some Solarflare engineers know about this, but a fix seems to be
>>>>> rather slow in the works.  I think that some of the bugs are in core
>>>>> code, though.)
>>>>
>>>> This is problems with accelerated RFS or just getting the flow hash for packets?
>>>
>>> Accelerated RFS.
>>>
>>> Digging through my email, I thought that
>>> net.core.rps_sock_flow_entries != the per-queue rps_flow_cnt would
>>> make no sense, although I haven't configured it that way.
>>>
>>> More importantly, though, I think that some of the has table stuff is
>>> problematic.  My understanding is:
>>>
>>> get_rps_cpu may call set_rps_cpu, passing rflow =
>>> flow_table->flows[hash & flow_table->mask];
>>>
>>> set_rps_cpu will compute flow_id = hash & flow_table->mask, which
>>> looks to me like it has the property that rflow == flow_table[hash]
>>> (unless we race with a hash table resize).
>>>
>>> Now set_rps_cpu tries to steer the new flow to the right CPU (all good
>>> so far), but then it gets weird.  We have a very high probability of
>>> old_flow == rflow.  rflow->filter gets overwritten with the filter id,
>>> the if condition doesn't execute, and nothing gets set to
>>> RPS_NO_FILTER.
>>>
>>> This is technically all correct, but if there are two active flows
>>> with the same hash, they'll each keep getting steered to the same
>>> place.  This wastes cycles and seems to anger the sfc driver (the
>>> latter is presumably an sfc bug).  It also means that some of the
>>> filters are likely to get expired for no good reason.
>>>
>> Yes, I could see where persistent hash collisions could cause an issue
>> with aRFS. IIRC, I saw programming the filters to be quite an
>> expensive operation. Minimally if seems like there some rate limiting
>> change to avoid hysteresis.
>
> They're especially expensive, given that they often generate printks
> for me due to the sfc bug.
>
:-)

>
>> As I mentioned, there is no material functionality in this patch and
>> it should be independent of RFS. It simply returns the CPU where the
>> stack processed the packet. Whether or not this is meaningful
>> information to the algorithm being implemented in userspace is
>> completely up to the caller to decide.
>
> Agreed.
>
> My only concern is that writing that userspace algorithm might result
> in surprises if RFS is on.  Having the user program notice the problem
> early and alert the admin might help keep Murphy's Law at bay here.
>
By Murphy's law we'd also have to consider that the flow hash could
change after reading the results so that the scheduling done in
userspace is completely wrong until the CPU is read again.
Synchronizing kernel and device state with userspace state is not
always so easy. One way to mitigate is to use ancillary data which
would provide real time information and obviate the need for another
system call.


> --Andy
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Andy Lutomirski @ 2014-11-15  0:50 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <CA+mtBx8uneXRoc5HcXED58zi44UyMZGJS6_sU3av8ST1ft7Diw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, Nov 14, 2014 at 4:40 PM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Nov 14, 2014 at 4:24 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Fri, Nov 14, 2014 at 4:06 PM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Fri, Nov 14, 2014 at 2:10 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Fri, Nov 14, 2014 at 1:36 PM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>> On Fri, Nov 14, 2014 at 12:34 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>>> On Fri, Nov 14, 2014 at 12:25 PM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>> On Fri, Nov 14, 2014 at 12:16 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>>>>> On Fri, Nov 14, 2014 at 11:52 AM, Tom Herbert <therbert-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>>>> On Fri, Nov 14, 2014 at 11:33 AM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>>>>>> On Fri, 2014-11-14 at 09:17 -0800, Andy Lutomirski wrote:
>>>>>>>>>>
>>>>>>>>>>> As a heavy user of RFS (and finder of bugs in it, too), here's my
>>>>>>>>>>> question about this API:
>>>>>>>>>>>
>>>>>>>>>>> How does an application tell whether the socket represents a
>>>>>>>>>>> non-actively-steered flow?  If the flow is subject to RFS, then moving
>>>>>>>>>>> the application handling to the socket's CPU seems problematic, as the
>>>>>>>>>>> socket's CPU might move as well.  The current implementation in this
>>>>>>>>>>> patch seems to tell me which CPU the most recent packet came in on,
>>>>>>>>>>> which is not necessarily very useful.
>>>>>>>>>>
>>>>>>>>>> Its the cpu that hit the TCP stack, bringing dozens of cache lines in
>>>>>>>>>> its cache. This is all that matters,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Some possibilities:
>>>>>>>>>>>
>>>>>>>>>>> 1. Let SO_INCOMING_CPU fail if RFS or RPS are in play.
>>>>>>>>>>
>>>>>>>>>> Well, idea is to not use RFS at all. Otherwise, it is useless.
>>>>>>>>
>>>>>>>> Sure, but how do I know that it'll be the same CPU next time?
>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Bear in mind this is only an interface to report RX CPU and in itself
>>>>>>>>> doesn't provide any functionality for changing scheduling, there is
>>>>>>>>> obviously logic needed in user space that would need to do something.
>>>>>>>>>
>>>>>>>>> If we track the interrupting CPU in skb, the interface could be easily
>>>>>>>>> extended to provide the interrupting CPU, the RPS CPU (calculated at
>>>>>>>>> reported time), and the CPU processing transport (post steering which
>>>>>>>>> is what is currently returned). That would provide the complete
>>>>>>>>> picture to control scheduling a flow from userspace, and an interface
>>>>>>>>> to selectively turn off RFS for a socket would make sense then.
>>>>>>>>
>>>>>>>> I think that a turn-off-RFS interface would also want a way to figure
>>>>>>>> out where the flow would go without RFS.  Can the network stack do
>>>>>>>> that (e.g. evaluate the rx indirection hash or whatever happens these
>>>>>>>> days)?
>>>>>>>>
>>>>>>> Yes,. We need the rxhash and the CPU that packets are received on from
>>>>>>> the device for the socket. The former we already have, the latter
>>>>>>> might be done by adding a field to skbuff to set received CPU. Given
>>>>>>> the L4 hash and interrupting CPU we can calculated the RPS CPU which
>>>>>>> is where packet would have landed with RFS off.
>>>>>>
>>>>>> Hmm.  I think this would be useful for me.  It would *definitely* be
>>>>>> useful for me if I could pin an RFS flow to a cpu of my choice.
>>>>>>
>>>>> Andy, can you elaborate a little more on your use case. I've thought
>>>>> several times about an interface to program the flow table from
>>>>> userspace, but never quite came up with a compelling use case and
>>>>> there is the security concern that a user could "steal" cycles from
>>>>> arbitrary CPUs.
>>>>
>>>> I have a bunch of threads that are pinned to various CPUs or groups of
>>>> CPUs.  Each thread is responsible for a fixed set of flows.  I'd like
>>>> those flows to go to those CPUs.
>>>>
>>>> RFS will eventually do it, but it would be nice if I could
>>>> deterministically ask for a flow to be routed to the right CPU.  Also,
>>>> if my thread bounces temporarily to another CPU, I don't really need
>>>> the flow to follow it -- I'd like it to stay put.
>>>>
>>> Okay, how about it we have a SO_RFS_LOCK_FLOW sockopt. When this is
>>> called on a socket we can lock the socket to CPU binding to the
>>> current CPU it is called from. It could be unlocked at a later point.
>>> Would this satisfy your requirements?
>>
>> Yes, I think.  Especially if it bypassed the hash table.
>
> Unfortunately we can't easily bypass the hash table. The only way I
> know of to to do that is to perform the socket lookup to do steering
> (I tried that early on, but it was pretty costly).

What happens if you just call ndo_rx_flow_steer and do something to
keep the result from expiring?

>>
>>> As I mentioned, there is no material functionality in this patch and
>>> it should be independent of RFS. It simply returns the CPU where the
>>> stack processed the packet. Whether or not this is meaningful
>>> information to the algorithm being implemented in userspace is
>>> completely up to the caller to decide.
>>
>> Agreed.
>>
>> My only concern is that writing that userspace algorithm might result
>> in surprises if RFS is on.  Having the user program notice the problem
>> early and alert the admin might help keep Murphy's Law at bay here.
>>
> By Murphy's law we'd also have to consider that the flow hash could
> change after reading the results so that the scheduling done in
> userspace is completely wrong until the CPU is read again.
> Synchronizing kernel and device state with userspace state is not
> always so easy. One way to mitigate is to use ancillary data which
> would provide real time information and obviate the need for another
> system call.

Hmm.  That would work, too.  I don't know how annoyed user code would
be at having to read ancillary data, though.

The flow hash really shouldn't change much, though, right?

--Andy

^ permalink raw reply

* [PATCH 1/2] groups: Factor out a function to set a pre-sorted group list
From: Josh Triplett @ 2014-11-15  9:00 UTC (permalink / raw)
  To: Andrew Morton, Eric W. Biederman, Kees Cook,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

This way, functions that already need to sort the group list need not do
so twice.

The new set_groups_sorted is intentionally not exported.

Signed-off-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
---
 kernel/groups.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/groups.c b/kernel/groups.c
index 451698f..f0667e7 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -154,16 +154,26 @@ int groups_search(const struct group_info *group_info, kgid_t grp)
 }
 
 /**
+ * set_groups_sorted - Change a group subscription in a set of credentials
+ * @new: The newly prepared set of credentials to alter
+ * @group_info: The group list to install; must be sorted
+ */
+static void set_groups_sorted(struct cred *new, struct group_info *group_info)
+{
+	put_group_info(new->group_info);
+	get_group_info(group_info);
+	new->group_info = group_info;
+}
+
+/**
  * set_groups - Change a group subscription in a set of credentials
  * @new: The newly prepared set of credentials to alter
  * @group_info: The group list to install
  */
 void set_groups(struct cred *new, struct group_info *group_info)
 {
-	put_group_info(new->group_info);
 	groups_sort(group_info);
-	get_group_info(group_info);
-	new->group_info = group_info;
+	set_groups_sorted(new, group_info);
 }
 
 EXPORT_SYMBOL(set_groups);
-- 
2.1.3
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: Josh Triplett @ 2014-11-15  9:01 UTC (permalink / raw)
  To: Andrew Morton, Eric W. Biederman, Kees Cook, mtk.manpages,
	linux-api, linux-man, linux-kernel
In-Reply-To: <3ccec8a13019b5e8ce7b1d7889677b778b070dc8.1416041823.git.josh@joshtriplett.org>

Currently, unprivileged processes (without CAP_SETGID) cannot call
setgroups at all.  In particular, processes with a set of supplementary
groups cannot further drop permissions without obtaining elevated
permissions first.

Allow unprivileged processes to call setgroups with a subset of their
current groups; only require CAP_SETGID to add a group the process does
not currently have.

The kernel already maintains the list of supplementary group IDs in
sorted order, and setgroups already needs to sort the new list, so this
just requires a linear comparison of the two sorted lists.

This moves the CAP_SETGID test from setgroups into set_current_groups.

Tested via the following test program:

#include <err.h>
#include <grp.h>
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>

void run_id(void)
{
    pid_t p = fork();
    switch (p) {
        case -1:
            err(1, "fork");
        case 0:
            execl("/usr/bin/id", "id", NULL);
            err(1, "exec");
        default:
            if (waitpid(p, NULL, 0) < 0)
                err(1, "waitpid");
    }
}

int main(void)
{
    gid_t list1[] = { 1, 2, 3, 4, 5 };
    gid_t list2[] = { 2, 3, 4 };
    run_id();
    if (setgroups(5, list1) < 0)
        err(1, "setgroups 1");
    run_id();
    if (setresgid(1, 1, 1) < 0)
        err(1, "setresgid");
    if (setresuid(1, 1, 1) < 0)
        err(1, "setresuid");
    run_id();
    if (setgroups(3, list2) < 0)
        err(1, "setgroups 2");
    run_id();
    if (setgroups(5, list1) < 0)
        err(1, "setgroups 3");
    run_id();

    return 0;
}

Without this patch, the test program gets EPERM from the second
setgroups call, after dropping root privileges.  With this patch, the
test program successfully drops groups 1 and 5, but then gets EPERM from
the third setgroups call, since that call attempts to add groups the
process does not currently have.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/groups.c | 33 ++++++++++++++++++++++++++++++---
 kernel/uid16.c  |  2 --
 2 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/kernel/groups.c b/kernel/groups.c
index f0667e7..fe7367d 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -153,6 +153,29 @@ int groups_search(const struct group_info *group_info, kgid_t grp)
 	return 0;
 }
 
+/* Compare two sorted group lists; return true if the first is a subset of the
+ * second.
+ */
+static bool is_subset(const struct group_info *g1, const struct group_info *g2)
+{
+	unsigned int i, j;
+
+	for (i = 0, j = 0; i < g1->ngroups; i++) {
+		kgid_t gid1 = GROUP_AT(g1, i);
+		kgid_t gid2;
+		for (; j < g2->ngroups; j++) {
+			gid2 = GROUP_AT(g2, j);
+			if (gid_lte(gid1, gid2))
+				break;
+		}
+		if (j >= g2->ngroups || !gid_eq(gid1, gid2))
+			return false;
+		j++;
+	}
+
+	return true;
+}
+
 /**
  * set_groups_sorted - Change a group subscription in a set of credentials
  * @new: The newly prepared set of credentials to alter
@@ -189,11 +212,17 @@ int set_current_groups(struct group_info *group_info)
 {
 	struct cred *new;
 
+	groups_sort(group_info);
 	new = prepare_creds();
 	if (!new)
 		return -ENOMEM;
+	if (!ns_capable(current_user_ns(), CAP_SETGID)
+	    && !is_subset(group_info, new->group_info)) {
+		abort_creds(new);
+		return -EPERM;
+	}
 
-	set_groups(new, group_info);
+	set_groups_sorted(new, group_info);
 	return commit_creds(new);
 }
 
@@ -233,8 +262,6 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
 	struct group_info *group_info;
 	int retval;
 
-	if (!ns_capable(current_user_ns(), CAP_SETGID))
-		return -EPERM;
 	if ((unsigned)gidsetsize > NGROUPS_MAX)
 		return -EINVAL;
 
diff --git a/kernel/uid16.c b/kernel/uid16.c
index 602e5bb..b27e167 100644
--- a/kernel/uid16.c
+++ b/kernel/uid16.c
@@ -176,8 +176,6 @@ SYSCALL_DEFINE2(setgroups16, int, gidsetsize, old_gid_t __user *, grouplist)
 	struct group_info *group_info;
 	int retval;
 
-	if (!ns_capable(current_user_ns(), CAP_SETGID))
-		return -EPERM;
 	if ((unsigned)gidsetsize > NGROUPS_MAX)
 		return -EINVAL;
 
-- 
2.1.3

^ permalink raw reply related

* [PATCH manpages] getgroups.2: Document unprivileged setgroups calls
From: Josh Triplett @ 2014-11-15  9:01 UTC (permalink / raw)
  To: Andrew Morton, Eric W. Biederman, Kees Cook,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <3ccec8a13019b5e8ce7b1d7889677b778b070dc8.1416041823.git.josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>

Signed-off-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
---
This should probably also include appropriate documentation for what
kernel introduces this behavior.

 man2/getgroups.2 | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/man2/getgroups.2 b/man2/getgroups.2
index 373c204..edca37c 100644
--- a/man2/getgroups.2
+++ b/man2/getgroups.2
@@ -81,9 +81,10 @@ to be used in a further call to
 .PP
 .BR setgroups ()
 sets the supplementary group IDs for the calling process.
-Appropriate privileges (Linux: the
+Any process may drop groups from its list, but adding groups requires
+appropriate privileges (Linux: the
 .B CAP_SETGID
-capability) are required.
+capability).
 The
 .I size
 argument specifies the number of supplementary group IDs
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH] [media] Add RGB444_1X12 and RGB565_1X16 media bus formats
From: Sakari Ailus @ 2014-11-15 14:49 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Mauro Carvalho Chehab, Hans Verkuil, Laurent Pinchart,
	linux-media, linux-api, linux-kernel, linux-doc
In-Reply-To: <20141114160446.70c1b8b9@bbrezillon>

Hi Boris,

Boris Brezillon wrote:
> Hi Sakari,
> 
> On Fri, 14 Nov 2014 15:58:31 +0200
> Sakari Ailus <sakari.ailus@iki.fi> wrote:
> 
>> Hi Boris,
>>
>> On Fri, Nov 14, 2014 at 11:36:00AM +0100, Boris Brezillon wrote:
...
>>> diff --git a/include/uapi/linux/media-bus-format.h b/include/uapi/linux/media-bus-format.h
>>> index 23b4090..cc7b79e 100644
>>> --- a/include/uapi/linux/media-bus-format.h
>>> +++ b/include/uapi/linux/media-bus-format.h
>>> @@ -33,7 +33,7 @@
>>>  
>>>  #define MEDIA_BUS_FMT_FIXED			0x0001
>>>  
>>> -/* RGB - next is	0x100e */
>>> +/* RGB - next is	0x1010 */
>>>  #define MEDIA_BUS_FMT_RGB444_2X8_PADHI_BE	0x1001
>>>  #define MEDIA_BUS_FMT_RGB444_2X8_PADHI_LE	0x1002
>>>  #define MEDIA_BUS_FMT_RGB555_2X8_PADHI_BE	0x1003
>>> @@ -47,6 +47,8 @@
>>>  #define MEDIA_BUS_FMT_RGB888_2X12_BE		0x100b
>>>  #define MEDIA_BUS_FMT_RGB888_2X12_LE		0x100c
>>>  #define MEDIA_BUS_FMT_ARGB8888_1X32		0x100d
>>> +#define MEDIA_BUS_FMT_RGB444_1X12		0x100e
>>> +#define MEDIA_BUS_FMT_RGB565_1X16		0x100f
>>
>> I'd arrange these according to BPP and bits per sample, both in the header
>> and documentation.
> 
> I cannot keep both macro values and BPP/bits per sample in incrementing
> order. Are you sure you prefer to order macros in BPP/bits per sample
> order ?

If you take a look elsewhere in the header, you'll notice that the
ordering has preferred the BPP value (and other values with semantic
significance) over the numeric value of the definition. I'd just prefer
to keep it that way. This is also why the "next is" comments are there.

-- 
Kind regards,

Sakari Ailus
sakari.ailus@iki.fi

^ permalink raw reply

* Re: [PATCH 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: Eric W. Biederman @ 2014-11-15 15:37 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Andrew Morton, Kees Cook, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <0895c1f268bc0b01cc6c8ed4607d7c3953f49728.1416041823.git.josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>

Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> writes:

> Currently, unprivileged processes (without CAP_SETGID) cannot call
> setgroups at all.  In particular, processes with a set of supplementary
> groups cannot further drop permissions without obtaining elevated
> permissions first.
>
> Allow unprivileged processes to call setgroups with a subset of their
> current groups; only require CAP_SETGID to add a group the process does
> not currently have.

A couple of questions.
- Is there precedence in other unix flavors for this?
- What motiviates this change?
- Have you looked to see if anything might for bug compatibilty
  require applications not to be able to drop supplementary groups?

> The kernel already maintains the list of supplementary group IDs in
> sorted order, and setgroups already needs to sort the new list, so this
> just requires a linear comparison of the two sorted lists.
>
> This moves the CAP_SETGID test from setgroups into set_current_groups.
>
> Tested via the following test program:
>
> #include <err.h>
> #include <grp.h>
> #include <stdio.h>
> #include <sys/types.h>
> #include <unistd.h>
>
> void run_id(void)
> {
>     pid_t p = fork();
>     switch (p) {
>         case -1:
>             err(1, "fork");
>         case 0:
>             execl("/usr/bin/id", "id", NULL);
>             err(1, "exec");
>         default:
>             if (waitpid(p, NULL, 0) < 0)
>                 err(1, "waitpid");
>     }
> }
>
> int main(void)
> {
>     gid_t list1[] = { 1, 2, 3, 4, 5 };
>     gid_t list2[] = { 2, 3, 4 };
>     run_id();
>     if (setgroups(5, list1) < 0)
>         err(1, "setgroups 1");
>     run_id();
>     if (setresgid(1, 1, 1) < 0)
>         err(1, "setresgid");
>     if (setresuid(1, 1, 1) < 0)
>         err(1, "setresuid");
>     run_id();
>     if (setgroups(3, list2) < 0)
>         err(1, "setgroups 2");
>     run_id();
>     if (setgroups(5, list1) < 0)
>         err(1, "setgroups 3");
>     run_id();
>
>     return 0;
> }
>
> Without this patch, the test program gets EPERM from the second
> setgroups call, after dropping root privileges.  With this patch, the
> test program successfully drops groups 1 and 5, but then gets EPERM from
> the third setgroups call, since that call attempts to add groups the
> process does not currently have.
>
> Signed-off-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
> ---
>  kernel/groups.c | 33 ++++++++++++++++++++++++++++++---
>  kernel/uid16.c  |  2 --
>  2 files changed, 30 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/groups.c b/kernel/groups.c
> index f0667e7..fe7367d 100644
> --- a/kernel/groups.c
> +++ b/kernel/groups.c
> @@ -153,6 +153,29 @@ int groups_search(const struct group_info *group_info, kgid_t grp)
>  	return 0;
>  }
>  
> +/* Compare two sorted group lists; return true if the first is a subset of the
> + * second.
> + */
> +static bool is_subset(const struct group_info *g1, const struct group_info *g2)
> +{
> +	unsigned int i, j;
> +
> +	for (i = 0, j = 0; i < g1->ngroups; i++) {
> +		kgid_t gid1 = GROUP_AT(g1, i);
> +		kgid_t gid2;
> +		for (; j < g2->ngroups; j++) {
> +			gid2 = GROUP_AT(g2, j);
> +			if (gid_lte(gid1, gid2))
> +				break;
> +		}
> +		if (j >= g2->ngroups || !gid_eq(gid1, gid2))
> +			return false;
> +		j++;
> +	}
> +
> +	return true;
> +}
> +
>  /**
>   * set_groups_sorted - Change a group subscription in a set of credentials
>   * @new: The newly prepared set of credentials to alter
> @@ -189,11 +212,17 @@ int set_current_groups(struct group_info *group_info)
>  {
>  	struct cred *new;
>  
> +	groups_sort(group_info);
>  	new = prepare_creds();
>  	if (!new)
>  		return -ENOMEM;
> +	if (!ns_capable(current_user_ns(), CAP_SETGID)
> +	    && !is_subset(group_info, new->group_info)) {
> +		abort_creds(new);
> +		return -EPERM;
> +	}
>  
> -	set_groups(new, group_info);
> +	set_groups_sorted(new, group_info);
>  	return commit_creds(new);
>  }
>  
> @@ -233,8 +262,6 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
>  	struct group_info *group_info;
>  	int retval;
>  
> -	if (!ns_capable(current_user_ns(), CAP_SETGID))
> -		return -EPERM;
>  	if ((unsigned)gidsetsize > NGROUPS_MAX)
>  		return -EINVAL;
>  
> diff --git a/kernel/uid16.c b/kernel/uid16.c
> index 602e5bb..b27e167 100644
> --- a/kernel/uid16.c
> +++ b/kernel/uid16.c
> @@ -176,8 +176,6 @@ SYSCALL_DEFINE2(setgroups16, int, gidsetsize, old_gid_t __user *, grouplist)
>  	struct group_info *group_info;
>  	int retval;
>  
> -	if (!ns_capable(current_user_ns(), CAP_SETGID))
> -		return -EPERM;
>  	if ((unsigned)gidsetsize > NGROUPS_MAX)
>  		return -EINVAL;

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Tom Herbert @ 2014-11-15 18:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric Dumazet, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <CALCETrU5Og8tfkFuA-rPY5evV5CkPsx=22f6mePHJPvfcRrBdA@mail.gmail.com>

On Fri, Nov 14, 2014 at 4:50 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Nov 14, 2014 at 4:40 PM, Tom Herbert <therbert@google.com> wrote:
>> On Fri, Nov 14, 2014 at 4:24 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Fri, Nov 14, 2014 at 4:06 PM, Tom Herbert <therbert@google.com> wrote:
>>>> On Fri, Nov 14, 2014 at 2:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Fri, Nov 14, 2014 at 1:36 PM, Tom Herbert <therbert@google.com> wrote:
>>>>>> On Fri, Nov 14, 2014 at 12:34 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>> On Fri, Nov 14, 2014 at 12:25 PM, Tom Herbert <therbert@google.com> wrote:
>>>>>>>> On Fri, Nov 14, 2014 at 12:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>> On Fri, Nov 14, 2014 at 11:52 AM, Tom Herbert <therbert@google.com> wrote:
>>>>>>>>>> On Fri, Nov 14, 2014 at 11:33 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>>>>>>>> On Fri, 2014-11-14 at 09:17 -0800, Andy Lutomirski wrote:
>>>>>>>>>>>
>>>>>>>>>>>> As a heavy user of RFS (and finder of bugs in it, too), here's my
>>>>>>>>>>>> question about this API:
>>>>>>>>>>>>
>>>>>>>>>>>> How does an application tell whether the socket represents a
>>>>>>>>>>>> non-actively-steered flow?  If the flow is subject to RFS, then moving
>>>>>>>>>>>> the application handling to the socket's CPU seems problematic, as the
>>>>>>>>>>>> socket's CPU might move as well.  The current implementation in this
>>>>>>>>>>>> patch seems to tell me which CPU the most recent packet came in on,
>>>>>>>>>>>> which is not necessarily very useful.
>>>>>>>>>>>
>>>>>>>>>>> Its the cpu that hit the TCP stack, bringing dozens of cache lines in
>>>>>>>>>>> its cache. This is all that matters,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Some possibilities:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Let SO_INCOMING_CPU fail if RFS or RPS are in play.
>>>>>>>>>>>
>>>>>>>>>>> Well, idea is to not use RFS at all. Otherwise, it is useless.
>>>>>>>>>
>>>>>>>>> Sure, but how do I know that it'll be the same CPU next time?
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> Bear in mind this is only an interface to report RX CPU and in itself
>>>>>>>>>> doesn't provide any functionality for changing scheduling, there is
>>>>>>>>>> obviously logic needed in user space that would need to do something.
>>>>>>>>>>
>>>>>>>>>> If we track the interrupting CPU in skb, the interface could be easily
>>>>>>>>>> extended to provide the interrupting CPU, the RPS CPU (calculated at
>>>>>>>>>> reported time), and the CPU processing transport (post steering which
>>>>>>>>>> is what is currently returned). That would provide the complete
>>>>>>>>>> picture to control scheduling a flow from userspace, and an interface
>>>>>>>>>> to selectively turn off RFS for a socket would make sense then.
>>>>>>>>>
>>>>>>>>> I think that a turn-off-RFS interface would also want a way to figure
>>>>>>>>> out where the flow would go without RFS.  Can the network stack do
>>>>>>>>> that (e.g. evaluate the rx indirection hash or whatever happens these
>>>>>>>>> days)?
>>>>>>>>>
>>>>>>>> Yes,. We need the rxhash and the CPU that packets are received on from
>>>>>>>> the device for the socket. The former we already have, the latter
>>>>>>>> might be done by adding a field to skbuff to set received CPU. Given
>>>>>>>> the L4 hash and interrupting CPU we can calculated the RPS CPU which
>>>>>>>> is where packet would have landed with RFS off.
>>>>>>>
>>>>>>> Hmm.  I think this would be useful for me.  It would *definitely* be
>>>>>>> useful for me if I could pin an RFS flow to a cpu of my choice.
>>>>>>>
>>>>>> Andy, can you elaborate a little more on your use case. I've thought
>>>>>> several times about an interface to program the flow table from
>>>>>> userspace, but never quite came up with a compelling use case and
>>>>>> there is the security concern that a user could "steal" cycles from
>>>>>> arbitrary CPUs.
>>>>>
>>>>> I have a bunch of threads that are pinned to various CPUs or groups of
>>>>> CPUs.  Each thread is responsible for a fixed set of flows.  I'd like
>>>>> those flows to go to those CPUs.
>>>>>
>>>>> RFS will eventually do it, but it would be nice if I could
>>>>> deterministically ask for a flow to be routed to the right CPU.  Also,
>>>>> if my thread bounces temporarily to another CPU, I don't really need
>>>>> the flow to follow it -- I'd like it to stay put.
>>>>>
>>>> Okay, how about it we have a SO_RFS_LOCK_FLOW sockopt. When this is
>>>> called on a socket we can lock the socket to CPU binding to the
>>>> current CPU it is called from. It could be unlocked at a later point.
>>>> Would this satisfy your requirements?
>>>
>>> Yes, I think.  Especially if it bypassed the hash table.
>>
>> Unfortunately we can't easily bypass the hash table. The only way I
>> know of to to do that is to perform the socket lookup to do steering
>> (I tried that early on, but it was pretty costly).
>
> What happens if you just call ndo_rx_flow_steer and do something to
> keep the result from expiring?
>
Okay, I will look at that. Do you know how many flows we are talking
about, both in the number you need and the number that can be put in
the HW without collision?

Thanks,
Tom

>>>
>>>> As I mentioned, there is no material functionality in this patch and
>>>> it should be independent of RFS. It simply returns the CPU where the
>>>> stack processed the packet. Whether or not this is meaningful
>>>> information to the algorithm being implemented in userspace is
>>>> completely up to the caller to decide.
>>>
>>> Agreed.
>>>
>>> My only concern is that writing that userspace algorithm might result
>>> in surprises if RFS is on.  Having the user program notice the problem
>>> early and alert the admin might help keep Murphy's Law at bay here.
>>>
>> By Murphy's law we'd also have to consider that the flow hash could
>> change after reading the results so that the scheduling done in
>> userspace is completely wrong until the CPU is read again.
>> Synchronizing kernel and device state with userspace state is not
>> always so easy. One way to mitigate is to use ancillary data which
>> would provide real time information and obviate the need for another
>> system call.
>
> Hmm.  That would work, too.  I don't know how annoyed user code would
> be at having to read ancillary data, though.
>
> The flow hash really shouldn't change much, though, right?
>
> --Andy

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Andy Lutomirski @ 2014-11-15 18:47 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Michael Kerrisk, David Miller, netdev, Ying Cai,
	Willem de Bruijn, Neal Cardwell, Linux API
In-Reply-To: <CA+mtBx-=fbA+pW35K8cjXTJpw30jhawy1mhs-MDEYv9SCyD2MQ@mail.gmail.com>

On Sat, Nov 15, 2014 at 10:41 AM, Tom Herbert <therbert@google.com> wrote:
> On Fri, Nov 14, 2014 at 4:50 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Nov 14, 2014 at 4:40 PM, Tom Herbert <therbert@google.com> wrote:
>>> On Fri, Nov 14, 2014 at 4:24 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Fri, Nov 14, 2014 at 4:06 PM, Tom Herbert <therbert@google.com> wrote:
>>>>> On Fri, Nov 14, 2014 at 2:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>> On Fri, Nov 14, 2014 at 1:36 PM, Tom Herbert <therbert@google.com> wrote:
>>>>>>> On Fri, Nov 14, 2014 at 12:34 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>> On Fri, Nov 14, 2014 at 12:25 PM, Tom Herbert <therbert@google.com> wrote:
>>>>>>>>> On Fri, Nov 14, 2014 at 12:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>>>>> On Fri, Nov 14, 2014 at 11:52 AM, Tom Herbert <therbert@google.com> wrote:
>>>>>>>>>>> On Fri, Nov 14, 2014 at 11:33 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>>>>>>>>> On Fri, 2014-11-14 at 09:17 -0800, Andy Lutomirski wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> As a heavy user of RFS (and finder of bugs in it, too), here's my
>>>>>>>>>>>>> question about this API:
>>>>>>>>>>>>>
>>>>>>>>>>>>> How does an application tell whether the socket represents a
>>>>>>>>>>>>> non-actively-steered flow?  If the flow is subject to RFS, then moving
>>>>>>>>>>>>> the application handling to the socket's CPU seems problematic, as the
>>>>>>>>>>>>> socket's CPU might move as well.  The current implementation in this
>>>>>>>>>>>>> patch seems to tell me which CPU the most recent packet came in on,
>>>>>>>>>>>>> which is not necessarily very useful.
>>>>>>>>>>>>
>>>>>>>>>>>> Its the cpu that hit the TCP stack, bringing dozens of cache lines in
>>>>>>>>>>>> its cache. This is all that matters,
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Some possibilities:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Let SO_INCOMING_CPU fail if RFS or RPS are in play.
>>>>>>>>>>>>
>>>>>>>>>>>> Well, idea is to not use RFS at all. Otherwise, it is useless.
>>>>>>>>>>
>>>>>>>>>> Sure, but how do I know that it'll be the same CPU next time?
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> Bear in mind this is only an interface to report RX CPU and in itself
>>>>>>>>>>> doesn't provide any functionality for changing scheduling, there is
>>>>>>>>>>> obviously logic needed in user space that would need to do something.
>>>>>>>>>>>
>>>>>>>>>>> If we track the interrupting CPU in skb, the interface could be easily
>>>>>>>>>>> extended to provide the interrupting CPU, the RPS CPU (calculated at
>>>>>>>>>>> reported time), and the CPU processing transport (post steering which
>>>>>>>>>>> is what is currently returned). That would provide the complete
>>>>>>>>>>> picture to control scheduling a flow from userspace, and an interface
>>>>>>>>>>> to selectively turn off RFS for a socket would make sense then.
>>>>>>>>>>
>>>>>>>>>> I think that a turn-off-RFS interface would also want a way to figure
>>>>>>>>>> out where the flow would go without RFS.  Can the network stack do
>>>>>>>>>> that (e.g. evaluate the rx indirection hash or whatever happens these
>>>>>>>>>> days)?
>>>>>>>>>>
>>>>>>>>> Yes,. We need the rxhash and the CPU that packets are received on from
>>>>>>>>> the device for the socket. The former we already have, the latter
>>>>>>>>> might be done by adding a field to skbuff to set received CPU. Given
>>>>>>>>> the L4 hash and interrupting CPU we can calculated the RPS CPU which
>>>>>>>>> is where packet would have landed with RFS off.
>>>>>>>>
>>>>>>>> Hmm.  I think this would be useful for me.  It would *definitely* be
>>>>>>>> useful for me if I could pin an RFS flow to a cpu of my choice.
>>>>>>>>
>>>>>>> Andy, can you elaborate a little more on your use case. I've thought
>>>>>>> several times about an interface to program the flow table from
>>>>>>> userspace, but never quite came up with a compelling use case and
>>>>>>> there is the security concern that a user could "steal" cycles from
>>>>>>> arbitrary CPUs.
>>>>>>
>>>>>> I have a bunch of threads that are pinned to various CPUs or groups of
>>>>>> CPUs.  Each thread is responsible for a fixed set of flows.  I'd like
>>>>>> those flows to go to those CPUs.
>>>>>>
>>>>>> RFS will eventually do it, but it would be nice if I could
>>>>>> deterministically ask for a flow to be routed to the right CPU.  Also,
>>>>>> if my thread bounces temporarily to another CPU, I don't really need
>>>>>> the flow to follow it -- I'd like it to stay put.
>>>>>>
>>>>> Okay, how about it we have a SO_RFS_LOCK_FLOW sockopt. When this is
>>>>> called on a socket we can lock the socket to CPU binding to the
>>>>> current CPU it is called from. It could be unlocked at a later point.
>>>>> Would this satisfy your requirements?
>>>>
>>>> Yes, I think.  Especially if it bypassed the hash table.
>>>
>>> Unfortunately we can't easily bypass the hash table. The only way I
>>> know of to to do that is to perform the socket lookup to do steering
>>> (I tried that early on, but it was pretty costly).
>>
>> What happens if you just call ndo_rx_flow_steer and do something to
>> keep the result from expiring?
>>
> Okay, I will look at that. Do you know how many flows we are talking
> about, both in the number you need and the number that can be put in
> the HW without collision?

I think I have on the order of 100 flows.  Maybe 400.  It's been a few
months since I checked, and this particular metric is much easier to
measure on a weekday.

I think the HW has several thousand slots.

--Andy

^ permalink raw reply

* Re: [PATCH 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: Josh Triplett @ 2014-11-15 19:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Kees Cook, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <87d28osceg.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On Sat, Nov 15, 2014 at 09:37:27AM -0600, Eric W. Biederman wrote:
> Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> writes:
> 
> > Currently, unprivileged processes (without CAP_SETGID) cannot call
> > setgroups at all.  In particular, processes with a set of supplementary
> > groups cannot further drop permissions without obtaining elevated
> > permissions first.
> >
> > Allow unprivileged processes to call setgroups with a subset of their
> > current groups; only require CAP_SETGID to add a group the process does
> > not currently have.
> 
> A couple of questions.
> - Is there precedence in other unix flavors for this?

I found a few references to now-nonexistent pages at MIT about a system
with this property, but other than that no.

I've also found more than a few references to people wanting this
functionality.

> - What motiviates this change?

I have a series of patches planned to add more ways to drop elevated
privileges without requiring a transition through root to do so.  That
would improve the ability for unprivileged users to run programs
sandboxed with even *less* privileges.  (Among other things, that would
also allow programs running with no_new_privs to further *reduce* their
privileges, which they can't currently do in this case.)

> - Have you looked to see if anything might for bug compatibilty
>   require applications not to be able to drop supplementary groups?

I haven't found any such case; that doesn't mean such a case does not
exist.  Feedback welcome.

The only case I can think of (and I don't know of any examples of such a
system): some kind of quota system that limits the members of a group to
a certain amount of storage, but places no such limit on non-members.

However, the idea of *holding* a credential (a supplementary group ID)
giving *less* privileges, and *dropping* a credential giving *more*
privileges, would completely invert normal security models.  (The sane
way to design such a system would be to have a privileged group for
"users who can exceed the quota".)

If it turns out that a real case exists that people care about, I could
easily make this configurable, either at compile time or via a sysctl.

- Josh Triplett

^ permalink raw reply

* Re: [PATCH 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: Andy Lutomirski @ 2014-11-15 20:06 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Eric W. Biederman, Andrew Morton, Kees Cook,
	Michael Kerrisk-manpages, Linux API, linux-man,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20141115192924.GB19060@thin>

On Sat, Nov 15, 2014 at 11:29 AM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> On Sat, Nov 15, 2014 at 09:37:27AM -0600, Eric W. Biederman wrote:
>> Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> writes:
>>
>> > Currently, unprivileged processes (without CAP_SETGID) cannot call
>> > setgroups at all.  In particular, processes with a set of supplementary
>> > groups cannot further drop permissions without obtaining elevated
>> > permissions first.
>> >
>> > Allow unprivileged processes to call setgroups with a subset of their
>> > current groups; only require CAP_SETGID to add a group the process does
>> > not currently have.
>>
>> A couple of questions.
>> - Is there precedence in other unix flavors for this?
>
> I found a few references to now-nonexistent pages at MIT about a system
> with this property, but other than that no.
>
> I've also found more than a few references to people wanting this
> functionality.
>
>> - What motiviates this change?
>
> I have a series of patches planned to add more ways to drop elevated
> privileges without requiring a transition through root to do so.  That
> would improve the ability for unprivileged users to run programs
> sandboxed with even *less* privileges.  (Among other things, that would
> also allow programs running with no_new_privs to further *reduce* their
> privileges, which they can't currently do in this case.)
>
>> - Have you looked to see if anything might for bug compatibilty
>>   require applications not to be able to drop supplementary groups?
>
> I haven't found any such case; that doesn't mean such a case does not
> exist.  Feedback welcome.
>
> The only case I can think of (and I don't know of any examples of such a
> system): some kind of quota system that limits the members of a group to
> a certain amount of storage, but places no such limit on non-members.
>
> However, the idea of *holding* a credential (a supplementary group ID)
> giving *less* privileges, and *dropping* a credential giving *more*
> privileges, would completely invert normal security models.  (The sane
> way to design such a system would be to have a privileged group for
> "users who can exceed the quota".)

Agreed.  And, if you want to bypass quotas by dropping a supplementary
group, you already can by unsharing your user namespace.

However, sudoers seems to allow negative group matches.  So maybe
allowing this only with no_new_privs already set would make sense.

--Andy

^ permalink raw reply

* Re: [PATCH 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: Josh Triplett @ 2014-11-15 20:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Andrew Morton, Kees Cook,
	Michael Kerrisk-manpages, Linux API, linux-man,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <CALCETrUM=GqsOumTmDMF4B5GS1w=x56t41eE-2xW1bBOfUz02w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Sat, Nov 15, 2014 at 12:06:20PM -0800, Andy Lutomirski wrote:
> On Sat, Nov 15, 2014 at 11:29 AM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> > On Sat, Nov 15, 2014 at 09:37:27AM -0600, Eric W. Biederman wrote:
> >> Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> writes:
> >>
> >> > Currently, unprivileged processes (without CAP_SETGID) cannot call
> >> > setgroups at all.  In particular, processes with a set of supplementary
> >> > groups cannot further drop permissions without obtaining elevated
> >> > permissions first.
> >> >
> >> > Allow unprivileged processes to call setgroups with a subset of their
> >> > current groups; only require CAP_SETGID to add a group the process does
> >> > not currently have.
> >>
> >> A couple of questions.
> >> - Is there precedence in other unix flavors for this?
> >
> > I found a few references to now-nonexistent pages at MIT about a system
> > with this property, but other than that no.
> >
> > I've also found more than a few references to people wanting this
> > functionality.
> >
> >> - What motiviates this change?
> >
> > I have a series of patches planned to add more ways to drop elevated
> > privileges without requiring a transition through root to do so.  That
> > would improve the ability for unprivileged users to run programs
> > sandboxed with even *less* privileges.  (Among other things, that would
> > also allow programs running with no_new_privs to further *reduce* their
> > privileges, which they can't currently do in this case.)
> >
> >> - Have you looked to see if anything might for bug compatibilty
> >>   require applications not to be able to drop supplementary groups?
> >
> > I haven't found any such case; that doesn't mean such a case does not
> > exist.  Feedback welcome.
> >
> > The only case I can think of (and I don't know of any examples of such a
> > system): some kind of quota system that limits the members of a group to
> > a certain amount of storage, but places no such limit on non-members.
> >
> > However, the idea of *holding* a credential (a supplementary group ID)
> > giving *less* privileges, and *dropping* a credential giving *more*
> > privileges, would completely invert normal security models.  (The sane
> > way to design such a system would be to have a privileged group for
> > "users who can exceed the quota".)
> 
> Agreed.  And, if you want to bypass quotas by dropping a supplementary
> group, you already can by unsharing your user namespace.

Good point!  Given that a process can run with a new user namespace and
no other namespaces, and then drop all its other privileges that way,
the ability to drop privileges without using a user namespace seems
completely harmless, with one exception that you noted:

> However, sudoers seems to allow negative group matches.  So maybe
> allowing this only with no_new_privs already set would make sense.

Sigh, bad sudo.  Sure, restricting this to no_new_privs only seems fine.
I'll do that in v2, and document that in the manpage.

- Josh Triplett

^ permalink raw reply

* [PATCHv2 1/2] groups: Factor out a function to set a pre-sorted group list
From: Josh Triplett @ 2014-11-15 21:02 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski, Eric W. Biederman, Kees Cook,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

This way, functions that already need to sort the group list need not do
so twice.

The new set_groups_sorted is intentionally not exported.

Signed-off-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
---
 kernel/groups.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/groups.c b/kernel/groups.c
index 451698f..f0667e7 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -154,16 +154,26 @@ int groups_search(const struct group_info *group_info, kgid_t grp)
 }
 
 /**
+ * set_groups_sorted - Change a group subscription in a set of credentials
+ * @new: The newly prepared set of credentials to alter
+ * @group_info: The group list to install; must be sorted
+ */
+static void set_groups_sorted(struct cred *new, struct group_info *group_info)
+{
+	put_group_info(new->group_info);
+	get_group_info(group_info);
+	new->group_info = group_info;
+}
+
+/**
  * set_groups - Change a group subscription in a set of credentials
  * @new: The newly prepared set of credentials to alter
  * @group_info: The group list to install
  */
 void set_groups(struct cred *new, struct group_info *group_info)
 {
-	put_group_info(new->group_info);
 	groups_sort(group_info);
-	get_group_info(group_info);
-	new->group_info = group_info;
+	set_groups_sorted(new, group_info);
 }
 
 EXPORT_SYMBOL(set_groups);
-- 
2.1.3

^ permalink raw reply related

* [PATCHv2 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: Josh Triplett @ 2014-11-15 21:02 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski, Eric W. Biederman, Kees Cook,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <3ccec8a13019b5e8ce7b1d7889677b778b070dc8.1416085112.git.josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>

Currently, unprivileged processes (without CAP_SETGID) cannot call
setgroups at all.  In particular, processes with a set of supplementary
groups cannot further drop permissions without obtaining elevated
permissions first.

Allow unprivileged processes to call setgroups with a subset of their
current groups; only require CAP_SETGID to add a group the process does
not currently have.

Since some privileged programs (such as sudo) allow tests for
non-membership in a group, require no_new_privs to drop group
privileges.

The kernel already maintains the list of supplementary group IDs in
sorted order, and setgroups already needs to sort the new list, so this
just requires a linear comparison of the two sorted lists.

This moves the CAP_SETGID test from setgroups into set_current_groups.

Tested via the following test program:

void run_id(void)
{
    pid_t p = fork();
    switch (p) {
        case -1:
            err(1, "fork");
        case 0:
            execl("/usr/bin/id", "id", NULL);
            err(1, "exec");
        default:
            if (waitpid(p, NULL, 0) < 0)
                err(1, "waitpid");
    }
}

int main(void)
{
    gid_t list1[] = { 1, 2, 3, 4, 5 };
    gid_t list2[] = { 2, 3, 4 };
    run_id();
    if (setgroups(5, list1) < 0)
        err(1, "setgroups 1");
    run_id();
    if (setresgid(1, 1, 1) < 0)
        err(1, "setresgid");
    if (setresuid(1, 1, 1) < 0)
        err(1, "setresuid");
    run_id();
    if (setgroups(3, list2) < 0)
        err(1, "setgroups 2");
    run_id();
    if (setgroups(5, list1) < 0)
        err(1, "setgroups 3");
    run_id();

    return 0;
}

Without this patch, the test program gets EPERM from the second
setgroups call, after dropping root privileges.  With this patch, the
test program successfully drops groups 1 and 5, but then gets EPERM from
the third setgroups call, since that call attempts to add groups the
process does not currently have.

Signed-off-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
---
v2: Require no_new_privs.

 kernel/groups.c | 34 +++++++++++++++++++++++++++++++---
 kernel/uid16.c  |  2 --
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/kernel/groups.c b/kernel/groups.c
index f0667e7..f7fb8dd 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -153,6 +153,29 @@ int groups_search(const struct group_info *group_info, kgid_t grp)
 	return 0;
 }
 
+/* Compare two sorted group lists; return true if the first is a subset of the
+ * second.
+ */
+static bool is_subset(const struct group_info *g1, const struct group_info *g2)
+{
+	unsigned int i, j;
+
+	for (i = 0, j = 0; i < g1->ngroups; i++) {
+		kgid_t gid1 = GROUP_AT(g1, i);
+		kgid_t gid2;
+		for (; j < g2->ngroups; j++) {
+			gid2 = GROUP_AT(g2, j);
+			if (gid_lte(gid1, gid2))
+				break;
+		}
+		if (j >= g2->ngroups || !gid_eq(gid1, gid2))
+			return false;
+		j++;
+	}
+
+	return true;
+}
+
 /**
  * set_groups_sorted - Change a group subscription in a set of credentials
  * @new: The newly prepared set of credentials to alter
@@ -189,11 +212,18 @@ int set_current_groups(struct group_info *group_info)
 {
 	struct cred *new;
 
+	groups_sort(group_info);
 	new = prepare_creds();
 	if (!new)
 		return -ENOMEM;
+	if (!(ns_capable(current_user_ns(), CAP_SETGID)
+	      || (task_no_new_privs(current)
+		  && is_subset(group_info, new->group_info)))) {
+		abort_creds(new);
+		return -EPERM;
+	}
 
-	set_groups(new, group_info);
+	set_groups_sorted(new, group_info);
 	return commit_creds(new);
 }
 
@@ -233,8 +263,6 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
 	struct group_info *group_info;
 	int retval;
 
-	if (!ns_capable(current_user_ns(), CAP_SETGID))
-		return -EPERM;
 	if ((unsigned)gidsetsize > NGROUPS_MAX)
 		return -EINVAL;
 
diff --git a/kernel/uid16.c b/kernel/uid16.c
index 602e5bb..b27e167 100644
--- a/kernel/uid16.c
+++ b/kernel/uid16.c
@@ -176,8 +176,6 @@ SYSCALL_DEFINE2(setgroups16, int, gidsetsize, old_gid_t __user *, grouplist)
 	struct group_info *group_info;
 	int retval;
 
-	if (!ns_capable(current_user_ns(), CAP_SETGID))
-		return -EPERM;
 	if ((unsigned)gidsetsize > NGROUPS_MAX)
 		return -EINVAL;
 
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCHv2 manpages] getgroups.2: Document unprivileged setgroups calls
From: Josh Triplett @ 2014-11-15 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski, Eric W. Biederman, Kees Cook,
	mtk.manpages, linux-api, linux-man, linux-kernel
In-Reply-To: <3ccec8a13019b5e8ce7b1d7889677b778b070dc8.1416085112.git.josh@joshtriplett.org>

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
---
v2: Document requirement for no_new_privs.

(If this doesn't end up going into 3.18, the version number in the patch will
need updating.)

 man2/getgroups.2 | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/man2/getgroups.2 b/man2/getgroups.2
index 373c204..3f3d330 100644
--- a/man2/getgroups.2
+++ b/man2/getgroups.2
@@ -81,9 +81,11 @@ to be used in a further call to
 .PP
 .BR setgroups ()
 sets the supplementary group IDs for the calling process.
-Appropriate privileges (Linux: the
+As of Linux 3.18, any process that has enabled PR_SET_NO_NEW_PRIVS may drop
+supplementary groups, but may not add new groups. Adding groups, or making any
+change at all without no_new_privs enabled, requires the
 .B CAP_SETGID
-capability) are required.
+capability.
 The
 .I size
 argument specifies the number of supplementary group IDs
-- 
2.1.3

^ permalink raw reply related

* [PATCHv3 1/2] groups: Factor out a function to set a pre-sorted group list
From: Josh Triplett @ 2014-11-15 23:49 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski, Eric W. Biederman, Kees Cook,
	mtk.manpages, linux-api, linux-man, linux-kernel

This way, functions that already need to sort the group list need not do
so twice.

The new set_groups_sorted is intentionally not exported.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
---
v2, v3: No changes to patch 1/2.
 kernel/groups.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/groups.c b/kernel/groups.c
index 451698f..f0667e7 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -154,16 +154,26 @@ int groups_search(const struct group_info *group_info, kgid_t grp)
 }
 
 /**
+ * set_groups_sorted - Change a group subscription in a set of credentials
+ * @new: The newly prepared set of credentials to alter
+ * @group_info: The group list to install; must be sorted
+ */
+static void set_groups_sorted(struct cred *new, struct group_info *group_info)
+{
+	put_group_info(new->group_info);
+	get_group_info(group_info);
+	new->group_info = group_info;
+}
+
+/**
  * set_groups - Change a group subscription in a set of credentials
  * @new: The newly prepared set of credentials to alter
  * @group_info: The group list to install
  */
 void set_groups(struct cred *new, struct group_info *group_info)
 {
-	put_group_info(new->group_info);
 	groups_sort(group_info);
-	get_group_info(group_info);
-	new->group_info = group_info;
+	set_groups_sorted(new, group_info);
 }
 
 EXPORT_SYMBOL(set_groups);
-- 
2.1.3

^ permalink raw reply related

* [PATCHv3 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: Josh Triplett @ 2014-11-15 23:50 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski, Eric W. Biederman, Kees Cook,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <3ccec8a13019b5e8ce7b1d7889677b778b070dc8.1416095211.git.josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>

Currently, unprivileged processes (without CAP_SETGID) cannot call
setgroups at all.  In particular, processes with a set of supplementary
groups cannot further drop permissions without obtaining elevated
permissions first.

Allow unprivileged processes to call setgroups with a subset of their
current groups; only require CAP_SETGID to add a group the process does
not currently have (either as a supplementary group, or as its gid,
egid, or sgid).

Since some privileged programs (such as sudo) allow tests for
non-membership in a group, require no_new_privs to drop group
privileges.

The kernel already maintains the list of supplementary group IDs in
sorted order, and setgroups already needs to sort the new list, so this
just requires a linear comparison of the two sorted lists.

This moves the CAP_SETGID test from setgroups into set_current_groups.

Tested via the following test program:

void run_id(void)
{
    pid_t p = fork();
    switch (p) {
        case -1:
            err(1, "fork");
        case 0:
            execl("/usr/bin/id", "id", NULL);
            err(1, "exec");
        default:
            if (waitpid(p, NULL, 0) < 0)
                err(1, "waitpid");
    }
}

int main(void)
{
    gid_t list1[] = { 1, 2, 3, 4, 5 };
    gid_t list2[] = { 2, 3, 4 };
    run_id();
    if (setgroups(5, list1) < 0)
        err(1, "setgroups 1");
    run_id();
    if (setresgid(1, 1, 1) < 0)
        err(1, "setresgid");
    if (setresuid(1, 1, 1) < 0)
        err(1, "setresuid");
    run_id();
    if (setgroups(3, list2) < 0)
        err(1, "setgroups 2");
    run_id();
    if (setgroups(5, list1) < 0)
        err(1, "setgroups 3");
    run_id();

    return 0;
}

Without this patch, the test program gets EPERM from the second
setgroups call, after dropping root privileges.  With this patch, the
test program successfully drops groups 1 and 5, but then gets EPERM from
the third setgroups call, since that call attempts to add groups the
process does not currently have.

Signed-off-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
---
v3: Allow gid, egid, or sgid.
v2: Require no_new_privs.

 kernel/groups.c | 42 +++++++++++++++++++++++++++++++++++++++---
 kernel/uid16.c  |  2 --
 2 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/kernel/groups.c b/kernel/groups.c
index f0667e7..5114155 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -153,6 +153,37 @@ int groups_search(const struct group_info *group_info, kgid_t grp)
 	return 0;
 }
 
+/* Return true if the group_info is a subset of the group_info of the specified
+ * credentials.  Also allow the first group_info to contain the gid, egid, or
+ * sgid of the credentials.
+ */
+static bool group_subset(const struct group_info *g1,
+			 const struct cred *cred2)
+{
+	const struct group_info *g2 = cred2->group_info;
+	unsigned int i, j;
+
+	for (i = 0, j = 0; i < g1->ngroups; i++) {
+		kgid_t gid1 = GROUP_AT(g1, i);
+		kgid_t gid2;
+		for (; j < g2->ngroups; j++) {
+			gid2 = GROUP_AT(g2, j);
+			if (gid_lte(gid1, gid2))
+				break;
+		}
+		if (j >= g2->ngroups || !gid_eq(gid1, gid2)) {
+			if (!gid_eq(gid1, cred2->gid)
+			    && !gid_eq(gid1, cred2->egid)
+			    && !gid_eq(gid1, cred2->sgid))
+				return false;
+		} else {
+			j++;
+		}
+	}
+
+	return true;
+}
+
 /**
  * set_groups_sorted - Change a group subscription in a set of credentials
  * @new: The newly prepared set of credentials to alter
@@ -189,11 +220,18 @@ int set_current_groups(struct group_info *group_info)
 {
 	struct cred *new;
 
+	groups_sort(group_info);
 	new = prepare_creds();
 	if (!new)
 		return -ENOMEM;
+	if (!(ns_capable(current_user_ns(), CAP_SETGID)
+	      || (task_no_new_privs(current)
+		  && group_subset(group_info, new)))) {
+		abort_creds(new);
+		return -EPERM;
+	}
 
-	set_groups(new, group_info);
+	set_groups_sorted(new, group_info);
 	return commit_creds(new);
 }
 
@@ -233,8 +271,6 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
 	struct group_info *group_info;
 	int retval;
 
-	if (!ns_capable(current_user_ns(), CAP_SETGID))
-		return -EPERM;
 	if ((unsigned)gidsetsize > NGROUPS_MAX)
 		return -EINVAL;
 
diff --git a/kernel/uid16.c b/kernel/uid16.c
index 602e5bb..b27e167 100644
--- a/kernel/uid16.c
+++ b/kernel/uid16.c
@@ -176,8 +176,6 @@ SYSCALL_DEFINE2(setgroups16, int, gidsetsize, old_gid_t __user *, grouplist)
 	struct group_info *group_info;
 	int retval;
 
-	if (!ns_capable(current_user_ns(), CAP_SETGID))
-		return -EPERM;
 	if ((unsigned)gidsetsize > NGROUPS_MAX)
 		return -EINVAL;
 
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH] getgroups.2: Document unprivileged setgroups calls
From: Josh Triplett @ 2014-11-15 23:50 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski, Eric W. Biederman, Kees Cook,
	mtk.manpages, linux-api, linux-man, linux-kernel
In-Reply-To: <3ccec8a13019b5e8ce7b1d7889677b778b070dc8.1416095211.git.josh@joshtriplett.org>

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
---
v3: Document use of gid/egid/sgid.
v2: Document requirement for no_new_privs.

(If this doesn't end up going into 3.18, the version number in the patch will
need updating.)

 man2/getgroups.2 | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/man2/getgroups.2 b/man2/getgroups.2
index 373c204..e2b834e 100644
--- a/man2/getgroups.2
+++ b/man2/getgroups.2
@@ -81,9 +81,16 @@ to be used in a further call to
 .PP
 .BR setgroups ()
 sets the supplementary group IDs for the calling process.
-Appropriate privileges (Linux: the
+A process with the
 .B CAP_SETGID
-capability) are required.
+capability may change its supplementary group IDs arbitrarily.
+As of Linux 3.18, any process that has enabled PR_SET_NO_NEW_PRIVS (see
+.BR prctl (2))
+may drop supplementary groups, or add any of the current real UID, the current
+effective UID, or the current saved set-user-ID; adding any other group ID
+requires the
+.B CAP_SETGID
+capability.
 The
 .I size
 argument specifies the number of supplementary group IDs
-- 
2.1.3

^ permalink raw reply related

* Re: [PATCH 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: Theodore Ts'o @ 2014-11-16  2:05 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Andy Lutomirski, Eric W. Biederman, Andrew Morton, Kees Cook,
	Michael Kerrisk-manpages, Linux API, linux-man,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20141115202042.GA20900@thin>

On Sat, Nov 15, 2014 at 12:20:42PM -0800, Josh Triplett wrote:
> > However, sudoers seems to allow negative group matches.  So maybe
> > allowing this only with no_new_privs already set would make sense.
> 
> Sigh, bad sudo.  Sure, restricting this to no_new_privs only seems fine.
> I'll do that in v2, and document that in the manpage.

I've also seen use cases (generally back in the bad old days of big
timesharing VAX 750's :-) where the system admin might assign someone
to the "games-abusers" group, and then set /usr/games to mode 705
root:games-abusers --- presumably because it's easier to add a few
people to the deny list rather than having to add all of the EECS
department to the games group minus the abusers.

So arbitrarily anyone to drop groups from their supplemental group
list will result in a change from both existing practice and legacy
Unix systems, and it could potentially lead to a security exposure.

Cheers,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v2 00/10] crypto: AF_ALG: add AEAD and RNG support
From: Stephan Mueller @ 2014-11-16  2:23 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Daniel Borkmann, quentin.gouchet, LKML, linux-crypto, ABI/API

Hi,

This patch set adds AEAD and RNG support to the AF_ALG interface
exported by the kernel crypto API. By extending AF_ALG with AEAD and RNG
support, all cipher types the kernel crypto API allows access to are
now accessible from userspace.

The RNG support is stand-alone.

The AEAD implementation is added to algif_skcipher.c to prevent
re-implementation of the memory moving logic.

The extension for the AEAD support can be summarized with the following
types of changes:

        * select the correct crypto API functions (either the ablkcipher
          or the aead functions)

        * apply the additional data needed for AEAD at the right time
          (associated data, authentication tag) -- this includes the addition
          of user space interfaces to allow setting this data.

        * add the calculation for the memory size needed for encryption and
          decryption.

In addition, the patch set adds a getsockopt implementation to skcipher to
allow user space to inquire about properties of the ciphers (IV size,
block size, authentication data size). This extension would be needed for a
generic user space usage of these ciphers.

The new AEAD and RNG interfaces are fully tested with the test application
provided at [1]. That test application exercises all newly added user space
interfaces.

The patch set was tested on x86_64 and i386.

[1] http://www.chronox.de/libkcapi.html

Changes v2:
* rebase to current cryptodev-2.6 tree
* use memzero_explicit to zeroize AEAD associated data
* use sizeof for determining length of AEAD associated data
* update algif_rng.c covering all suggestions from Daniel Borkmann
  <dborkman@redhat.com>
* addition of patch 9: add digestsize interface for hashes
* addition of patch to update documentation covering the userspace interface
* change numbers of getsockopt options: separate them from sendmsg interface
  definitions

Stephan Mueller (10):
  crypto: AF_ALG: add user space interface for AEAD
  crypto: AF_ALG: user space interface for cipher info
  crypto: AF_ALG: extend data structuers for AEAD
  crypto: AF_ALG: crypto API calls to inline functions
  crypto: AF_ALG: add AEAD support
  crypto: AF_ALG: make setkey optional
  crypto: AF_ALG: add random number generator support
  crypto: AF_ALG: enable RNG interface compilation
  crypto: AF_ALG: user space interface for hash info
  crypto: AF_ALG: document the user space interface

 Documentation/crypto/crypto-API-userspace.txt |  95 ++++++-
 crypto/Kconfig                                |   9 +
 crypto/Makefile                               |   1 +
 crypto/af_alg.c                               |  20 ++
 crypto/algif_hash.c                           |  35 ++-
 crypto/algif_rng.c                            | 186 ++++++++++++++
 crypto/algif_skcipher.c                       | 352 +++++++++++++++++++++++---
 include/crypto/if_alg.h                       |   2 +
 include/uapi/linux/if_alg.h                   |  15 ++
 9 files changed, 683 insertions(+), 32 deletions(-)
 create mode 100644 crypto/algif_rng.c

-- 
2.1.0

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox