RFC: fsyscall

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* RFC: fsyscall
@ 2015-09-08 22:35 Eric W. Biederman
  2015-09-08 22:55 ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2015-09-08 22:35 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-kernel, Serge E. Hallyn

I was thinking a bit about the problem of allowing another process to
perform a subset of what your process can perform, and it occured to me
there might be something conceptually simple we can do.

Have a system call fsyscall that takes a file descriptor the system call
number and the parameters to that system call as arguments.  AKA
long fsyscall(int fd, long number, ...); AKA syscall with a file
desciptor argument.

The fd would hold a struct cred, and a filter that limits what system
calls and which parameters may be passed.

The implementation of fsyscall would be something like:
	old = override_creds(f->f_cred);
        /* Perform filtered syscallf */
        revert_creds(old);

Then we have another system call call it fsyscall_create(...) that takes
a bpf filter and returns a file descriptor, that can be used with
fsyscall.

I'm not certain that bpf is the best way to create such a filter but it
seems plausible, and we already have the infrastructure in place, so if
nothing else there would be synergy in syscall filtering.

My two concerns with bpf are (a) it seems a little complex for the
simplest use cases.  (b) I think there cases like inspecting the data
passed into write, or send, or the structure passed into ioctl that it
doesn't handle well yet.

Andy does a fsyscall system call sound like something that would be not
be too bad to implement?  (You have just been through all of the x86
system call paths recently).

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: fsyscall
  2015-09-08 22:35 RFC: fsyscall Eric W. Biederman
@ 2015-09-08 22:55 ` Andy Lutomirski
  2015-09-08 23:07   ` Eric W. Biederman
  0 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2015-09-08 22:55 UTC (permalink / raw)
  To: Eric W. Biederman, David Drysdale
  Cc: linux-kernel@vger.kernel.org, Serge E. Hallyn

On Tue, Sep 8, 2015 at 3:35 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> I was thinking a bit about the problem of allowing another process to
> perform a subset of what your process can perform, and it occured to me
> there might be something conceptually simple we can do.
>
> Have a system call fsyscall that takes a file descriptor the system call
> number and the parameters to that system call as arguments.  AKA
> long fsyscall(int fd, long number, ...); AKA syscall with a file
> desciptor argument.
>
> The fd would hold a struct cred, and a filter that limits what system
> calls and which parameters may be passed.
>
> The implementation of fsyscall would be something like:
>         old = override_creds(f->f_cred);
>         /* Perform filtered syscallf */
>         revert_creds(old);
>
> Then we have another system call call it fsyscall_create(...) that takes
> a bpf filter and returns a file descriptor, that can be used with
> fsyscall.
>
> I'm not certain that bpf is the best way to create such a filter but it
> seems plausible, and we already have the infrastructure in place, so if
> nothing else there would be synergy in syscall filtering.
>
> My two concerns with bpf are (a) it seems a little complex for the
> simplest use cases.  (b) I think there cases like inspecting the data
> passed into write, or send, or the structure passed into ioctl that it
> doesn't handle well yet.
>
> Andy does a fsyscall system call sound like something that would be not
> be too bad to implement?  (You have just been through all of the x86
> system call paths recently).

It's not possible yet due to nasty calling convention issues.
(Entries in the x86 syscall table aren't actually functions callable
using the C ABI right now.)  My pending monster patchset will make it
possible to implement for 32-bit syscalls (native and compat).  I'm
planning on addressing 64-bit, and I want to do almost the reverse of
what you're proposing: have a way that one task can trap into a
special mode in which another process can do syscalls on its behalf.

There are some syscalls for which this simply makes no sense.
Setresuid, capset, and similar come to mind.  Clone and friends may
screw up impressively if you try this.  fsyscall should not be allowed
to call itself.  If you call write(2) like this and it has any
meaningful effect, something's wrong.  keyctl(2) does really awful
things wrt struct cred, and I don't really want to think about what
happens if you try calling it like this.

override_creds is IMO awful.  Serge and I had an old discussion on how
to maybe fix it.

Honestly, I think the way to go might be to get Capsicum, or at least
Capsicum's fd model, merged and to add a mode in which the *at
operations on a specially marked fd use the passed fd's f_cred instead
of the caller's.  (Cc: David Drysdale -- that feature might be really
nice.)

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: fsyscall
  2015-09-08 22:55 ` Andy Lutomirski
@ 2015-09-08 23:07   ` Eric W. Biederman
  2015-09-08 23:18     ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2015-09-08 23:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Drysdale, linux-kernel@vger.kernel.org, Serge E. Hallyn

Andy Lutomirski <luto@amacapital.net> writes:

> On Tue, Sep 8, 2015 at 3:35 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> I was thinking a bit about the problem of allowing another process to
>> perform a subset of what your process can perform, and it occured to me
>> there might be something conceptually simple we can do.
>>
>> Have a system call fsyscall that takes a file descriptor the system call
>> number and the parameters to that system call as arguments.  AKA
>> long fsyscall(int fd, long number, ...); AKA syscall with a file
>> desciptor argument.
>>
>> The fd would hold a struct cred, and a filter that limits what system
>> calls and which parameters may be passed.
>>
>> The implementation of fsyscall would be something like:
>>         old = override_creds(f->f_cred);
>>         /* Perform filtered syscallf */
>>         revert_creds(old);
>>
>> Then we have another system call call it fsyscall_create(...) that takes
>> a bpf filter and returns a file descriptor, that can be used with
>> fsyscall.
>>
>> I'm not certain that bpf is the best way to create such a filter but it
>> seems plausible, and we already have the infrastructure in place, so if
>> nothing else there would be synergy in syscall filtering.
>>
>> My two concerns with bpf are (a) it seems a little complex for the
>> simplest use cases.  (b) I think there cases like inspecting the data
>> passed into write, or send, or the structure passed into ioctl that it
>> doesn't handle well yet.
>>
>> Andy does a fsyscall system call sound like something that would be not
>> be too bad to implement?  (You have just been through all of the x86
>> system call paths recently).
>
> It's not possible yet due to nasty calling convention issues.
> (Entries in the x86 syscall table aren't actually functions callable
> using the C ABI right now.)  My pending monster patchset will make it
> possible to implement for 32-bit syscalls (native and compat).  I'm
> planning on addressing 64-bit, and I want to do almost the reverse of
> what you're proposing: have a way that one task can trap into a
> special mode in which another process can do syscalls on its behalf.

Hmm.  That seems comparatively dangerous to me.

> There are some syscalls for which this simply makes no sense.
> Setresuid, capset, and similar come to mind.  Clone and friends may
> screw up impressively if you try this.  fsyscall should not be allowed
> to call itself.  If you call write(2) like this and it has any
> meaningful effect, something's wrong.

If you peak into the data that is being written it can be meaningful on
write(2).

Hmm.  But yes for file descriptor based system calls this is much less
interesting.  Having some kind of wrapper that embeds one file
descriptor in another and does the filtering that way seems more
interesting, for the file descriptor based methods.

>   keyctl(2) does really awful
> things wrt struct cred, and I don't really want to think about what
> happens if you try calling it like this.
>
> override_creds is IMO awful.  Serge and I had an old discussion on how
> to maybe fix it.
>
> Honestly, I think the way to go might be to get Capsicum, or at least
> Capsicum's fd model, merged and to add a mode in which the *at
> operations on a specially marked fd use the passed fd's f_cred instead
> of the caller's.  (Cc: David Drysdale -- that feature might be really
> nice.)

Perhaps I had missed it but I don't recall capsicum being able to wrap
things like reboot(2).

Which really describes what I am trying to tackle.  How do we create an
object that we can pass between processes that limits what we can do in
the case of the oddball syscalls that require special privileges.

At the same time I still want the caller to be able to pass in data to
the system calls being called such as REBOOT_CMD_POWER_OFF versus
REBOOT_CMD_HALT, while being able to filter it and say you may not pass
REBOOT_CMD_CAD_OFF.

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: fsyscall
  2015-09-08 23:07   ` Eric W. Biederman
@ 2015-09-08 23:18     ` Andy Lutomirski
  2015-09-09  0:25       ` Eric W. Biederman
  0 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2015-09-08 23:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Drysdale, linux-kernel@vger.kernel.org, Serge E. Hallyn

On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
>
>> On Tue, Sep 8, 2015 at 3:35 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>
>>> I was thinking a bit about the problem of allowing another process to
>>> perform a subset of what your process can perform, and it occured to me
>>> there might be something conceptually simple we can do.
>>>
>>> Have a system call fsyscall that takes a file descriptor the system call
>>> number and the parameters to that system call as arguments.  AKA
>>> long fsyscall(int fd, long number, ...); AKA syscall with a file
>>> desciptor argument.
>>>
>>> The fd would hold a struct cred, and a filter that limits what system
>>> calls and which parameters may be passed.
>>>
>>> The implementation of fsyscall would be something like:
>>>         old = override_creds(f->f_cred);
>>>         /* Perform filtered syscallf */
>>>         revert_creds(old);
>>>
>>> Then we have another system call call it fsyscall_create(...) that takes
>>> a bpf filter and returns a file descriptor, that can be used with
>>> fsyscall.
>>>
>>> I'm not certain that bpf is the best way to create such a filter but it
>>> seems plausible, and we already have the infrastructure in place, so if
>>> nothing else there would be synergy in syscall filtering.
>>>
>>> My two concerns with bpf are (a) it seems a little complex for the
>>> simplest use cases.  (b) I think there cases like inspecting the data
>>> passed into write, or send, or the structure passed into ioctl that it
>>> doesn't handle well yet.
>>>
>>> Andy does a fsyscall system call sound like something that would be not
>>> be too bad to implement?  (You have just been through all of the x86
>>> system call paths recently).
>>
>> It's not possible yet due to nasty calling convention issues.
>> (Entries in the x86 syscall table aren't actually functions callable
>> using the C ABI right now.)  My pending monster patchset will make it
>> possible to implement for 32-bit syscalls (native and compat).  I'm
>> planning on addressing 64-bit, and I want to do almost the reverse of
>> what you're proposing: have a way that one task can trap into a
>> special mode in which another process can do syscalls on its behalf.
>
> Hmm.  That seems comparatively dangerous to me.
>
>> There are some syscalls for which this simply makes no sense.
>> Setresuid, capset, and similar come to mind.  Clone and friends may
>> screw up impressively if you try this.  fsyscall should not be allowed
>> to call itself.  If you call write(2) like this and it has any
>> meaningful effect, something's wrong.
>
> If you peak into the data that is being written it can be meaningful on
> write(2).
>
> Hmm.  But yes for file descriptor based system calls this is much less
> interesting.  Having some kind of wrapper that embeds one file
> descriptor in another and does the filtering that way seems more
> interesting, for the file descriptor based methods.
>
>>   keyctl(2) does really awful
>> things wrt struct cred, and I don't really want to think about what
>> happens if you try calling it like this.
>>
>> override_creds is IMO awful.  Serge and I had an old discussion on how
>> to maybe fix it.
>>
>> Honestly, I think the way to go might be to get Capsicum, or at least
>> Capsicum's fd model, merged and to add a mode in which the *at
>> operations on a specially marked fd use the passed fd's f_cred instead
>> of the caller's.  (Cc: David Drysdale -- that feature might be really
>> nice.)
>
> Perhaps I had missed it but I don't recall capsicum being able to wrap
> things like reboot(2).
>

Ah, so you want to be able to grant BPF-defined capabilities :)

Off the top of my head, I think that doing this using a nice IPC
mechanism (which barely exists in Linux, but which seL4 and binder (!)
can do very cleanly) would be simpler and more general, if less
self-contained.

(Aside: how on earth does anyone think that replacing binder with
kdbus makes any sense?  Binder can pass capabilities, and kdbus can't.
OTOH, maybe Android doesn't use the capability-passing ability.)

> Which really describes what I am trying to tackle.  How do we create an
> object that we can pass between processes that limits what we can do in
> the case of the oddball syscalls that require special privileges.
>
> At the same time I still want the caller to be able to pass in data to
> the system calls being called such as REBOOT_CMD_POWER_OFF versus
> REBOOT_CMD_HALT, while being able to filter it and say you may not pass
> REBOOT_CMD_CAD_OFF.
>

We could have a conservative whitelist of syscalls for which we allow
this usage.  I'm a bit worried that there will be very limited use
cases, given that a lot of use cases will want to follow pointers,
which has TOCTOU problems.

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: fsyscall
  2015-09-08 23:18     ` Andy Lutomirski
@ 2015-09-09  0:25       ` Eric W. Biederman
  2015-09-09 17:27         ` David Drysdale
  2015-09-10 13:43         ` Serge E. Hallyn
  0 siblings, 2 replies; 14+ messages in thread
From: Eric W. Biederman @ 2015-09-09  0:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Drysdale, linux-kernel@vger.kernel.org, Serge E. Hallyn

Andy Lutomirski <luto@amacapital.net> writes:

> On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:

>> Perhaps I had missed it but I don't recall capsicum being able to wrap
>> things like reboot(2).
>>
>
> Ah, so you want to be able to grant BPF-defined capabilities :)

Pretty much.

Where I am focusing is turning Posix capabilities into real
capabilities.  I would not mind if the functionality was a bit more
general.  Say to be able to handle things like security labels, or
anywhere else you might reasonably be asked can you do X?

But I would be happy if we just managed to wrap the Posix capabilities
and turned them into real capablilities.

> Off the top of my head, I think that doing this using a nice IPC
> mechanism (which barely exists in Linux, but which seL4 and binder (!)
> can do very cleanly) would be simpler and more general, if less
> self-contained.

Less self-contained becomes a problem when you want to pass them between
processes written at different times between different people.  If there
is something conceptually simple we can implement in the kernel it
becomes worth it because that becomes the standard which everyone knows
to code to.

> (Aside: how on earth does anyone think that replacing binder with
> kdbus makes any sense?  Binder can pass capabilities, and kdbus can't.
> OTOH, maybe Android doesn't use the capability-passing ability.)

kdbus has file descriptor passing.  Beyond that no comment.

>> Which really describes what I am trying to tackle.  How do we create an
>> object that we can pass between processes that limits what we can do in
>> the case of the oddball syscalls that require special privileges.
>>
>> At the same time I still want the caller to be able to pass in data to
>> the system calls being called such as REBOOT_CMD_POWER_OFF versus
>> REBOOT_CMD_HALT, while being able to filter it and say you may not pass
>> REBOOT_CMD_CAD_OFF.
>>
>
> We could have a conservative whitelist of syscalls for which we allow
> this usage.  I'm a bit worried that there will be very limited use
> cases, given that a lot of use cases will want to follow pointers,
> which has TOCTOU problems.

Time of check to time of use problems.  Interesting point.

TOCTOU seems to make filtering of system calls in general much less
viable then I had hoped or imagined, and seems to be one of the better
arguments I have heard against ioctls.

I think the cases I care about are much less likely to have TOCTOU
problems than system calls in general, so I still may be ok.

However it does seem like past a certain point for good filtering the
entire syscall ABI needs to be turned into well defined IPC.  Ick!

Sigh.  I guess it is about time I dig up the places we call capable.
Ugh 1696 places in the kernel..  Even filtering out CAP_SYS_ADMIN and
CAP_NET_ADMIN the list is longer than I can easily look at. 

Still reboot isn't a problem ;)

Thinking abou the TOCTOU problems with system call filtering the only
general solution I can see is to handle it like the compat syscalls
but instead of copying things into a temporary on buffer in userspace
we copy the data into a temporary in-kernel buffer (filter the system call)
	fs = get_fs();
	set_fs(get_ds());
	/* Call the system call */
	set_fs(fs);

I don't like the whole set_fs() thing (especially if there is any data
we did not manage to copy).  But it seems like a good conceptual start.

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: fsyscall
  2015-09-09  0:25       ` Eric W. Biederman
@ 2015-09-09 17:27         ` David Drysdale
  2015-09-09 19:33           ` Eric W. Biederman
  2015-09-10 13:28           ` Serge E. Hallyn
  2015-09-10 13:43         ` Serge E. Hallyn
  1 sibling, 2 replies; 14+ messages in thread
From: David Drysdale @ 2015-09-09 17:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, linux-kernel@vger.kernel.org, Serge E. Hallyn

On Wed, Sep 9, 2015 at 1:25 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
> > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> >> Perhaps I had missed it but I don't recall capsicum being able to wrap
> >> things like reboot(2).
> >>
> >
> > Ah, so you want to be able to grant BPF-defined capabilities :)
>
> Pretty much.
>
> Where I am focusing is turning Posix capabilities into real
> capabilities.  I would not mind if the functionality was a bit more
> general.  Say to be able to handle things like security labels, or
> anywhere else you might reasonably be asked can you do X?
>
> But I would be happy if we just managed to wrap the Posix capabilities
> and turned them into real capablilities.

Interesting idea!  So kind of like the "object" in question is the root
role, and the different rights for the corresponding object-capability
(the file descriptor) are the POSIX capabilities (in the simple case
at least).

And yes, Capsicum doesn't generally interact with things like reboot(2);
its checks are on top of any DAC policies rather than instead of them,
as it's a hybrid rather than a pure object-capability system.

> > Off the top of my head, I think that doing this using a nice IPC
> > mechanism (which barely exists in Linux, but which seL4 and binder (!)
> > can do very cleanly) would be simpler and more general, if less
> > self-contained.
>
> Less self-contained becomes a problem when you want to pass them between
> processes written at different times between different people.  If there
> is something conceptually simple we can implement in the kernel it
> becomes worth it because that becomes the standard which everyone knows
> to code to.
>
> > (Aside: how on earth does anyone think that replacing binder with
> > kdbus makes any sense?  Binder can pass capabilities, and kdbus can't.
> > OTOH, maybe Android doesn't use the capability-passing ability.)
>
> kdbus has file descriptor passing.  Beyond that no comment.
>
> >> Which really describes what I am trying to tackle.  How do we create an
> >> object that we can pass between processes that limits what we can do in
> >> the case of the oddball syscalls that require special privileges.
> >>
> >> At the same time I still want the caller to be able to pass in data to
> >> the system calls being called such as REBOOT_CMD_POWER_OFF versus
> >> REBOOT_CMD_HALT, while being able to filter it and say you may not pass
> >> REBOOT_CMD_CAD_OFF.
> >>
> >
> > We could have a conservative whitelist of syscalls for which we allow
> > this usage.  I'm a bit worried that there will be very limited use
> > cases, given that a lot of use cases will want to follow pointers,
> > which has TOCTOU problems.
>
> Time of check to time of use problems.  Interesting point.
>
> TOCTOU seems to make filtering of system calls in general much less
> viable then I had hoped or imagined, and seems to be one of the better
> arguments I have heard against ioctls.

By the way, Robert Watson (one of the progenitors of Capsicum, as it
happens) has a nice paper about TOCTOU attacks on syscall interposition
layers that's a good read:
  http://www.watson.org/~robert/2007woot/

(From this perspective, the limitation that seccomp-bpf programs only
have access to syscall arguments by-value is actually a help -- the filter
can't look into user memory, so can't be fooled by having memory
contents changed underneath it.  Of course, if the eBPF stuff ever
changes that we should watch out...)

> I think the cases I care about are much less likely to have TOCTOU
> problems than system calls in general, so I still may be ok.
>
> However it does seem like past a certain point for good filtering the
> entire syscall ABI needs to be turned into well defined IPC.  Ick!

That's roughly one of Robert's suggestions (section 8.2).

> Sigh.  I guess it is about time I dig up the places we call capable.
> Ugh 1696 places in the kernel..  Even filtering out CAP_SYS_ADMIN and
> CAP_NET_ADMIN the list is longer than I can easily look at.
>
> Still reboot isn't a problem ;)
>
> Thinking abou the TOCTOU problems with system call filtering the only
> general solution I can see is to handle it like the compat syscalls
> but instead of copying things into a temporary on buffer in userspace
> we copy the data into a temporary in-kernel buffer (filter the system call)
>         fs = get_fs();
>         set_fs(get_ds());
>         /* Call the system call */
>         set_fs(fs);
>
> I don't like the whole set_fs() thing (especially if there is any data
> we did not manage to copy).  But it seems like a good conceptual start.

Doing the copies sounds like it would involve understanding & reproducing
the memory layouts for every syscall pointer argument, which would be a
lot of code.  Or am I misunderstanding something?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: fsyscall
  2015-09-09 17:27         ` David Drysdale
@ 2015-09-09 19:33           ` Eric W. Biederman
  2015-09-10 13:35             ` Serge E. Hallyn
  2015-09-10 13:28           ` Serge E. Hallyn
  1 sibling, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2015-09-09 19:33 UTC (permalink / raw)
  To: David Drysdale
  Cc: Andy Lutomirski, linux-kernel@vger.kernel.org, Serge E. Hallyn

David Drysdale <drysdale@google.com> writes:

> On Wed, Sep 9, 2015 at 1:25 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Andy Lutomirski <luto@amacapital.net> writes:
>> > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> >> Perhaps I had missed it but I don't recall capsicum being able to wrap
>> >> things like reboot(2).
>> >>
>> >
>> > Ah, so you want to be able to grant BPF-defined capabilities :)
>>
>> Pretty much.
>>
>> Where I am focusing is turning Posix capabilities into real
>> capabilities.  I would not mind if the functionality was a bit more
>> general.  Say to be able to handle things like security labels, or
>> anywhere else you might reasonably be asked can you do X?
>>
>> But I would be happy if we just managed to wrap the Posix capabilities
>> and turned them into real capablilities.
>
> Interesting idea!  So kind of like the "object" in question is the root
> role, and the different rights for the corresponding object-capability
> (the file descriptor) are the POSIX capabilities (in the simple case
> at least).
>
> And yes, Capsicum doesn't generally interact with things like reboot(2);
> its checks are on top of any DAC policies rather than instead of them,
> as it's a hybrid rather than a pure object-capability system.
>
>> > Off the top of my head, I think that doing this using a nice IPC
>> > mechanism (which barely exists in Linux, but which seL4 and binder (!)
>> > can do very cleanly) would be simpler and more general, if less
>> > self-contained.
>>
>> Less self-contained becomes a problem when you want to pass them between
>> processes written at different times between different people.  If there
>> is something conceptually simple we can implement in the kernel it
>> becomes worth it because that becomes the standard which everyone knows
>> to code to.
>>
>> > (Aside: how on earth does anyone think that replacing binder with
>> > kdbus makes any sense?  Binder can pass capabilities, and kdbus can't.
>> > OTOH, maybe Android doesn't use the capability-passing ability.)
>>
>> kdbus has file descriptor passing.  Beyond that no comment.
>>
>> >> Which really describes what I am trying to tackle.  How do we create an
>> >> object that we can pass between processes that limits what we can do in
>> >> the case of the oddball syscalls that require special privileges.
>> >>
>> >> At the same time I still want the caller to be able to pass in data to
>> >> the system calls being called such as REBOOT_CMD_POWER_OFF versus
>> >> REBOOT_CMD_HALT, while being able to filter it and say you may not pass
>> >> REBOOT_CMD_CAD_OFF.
>> >>
>> >
>> > We could have a conservative whitelist of syscalls for which we allow
>> > this usage.  I'm a bit worried that there will be very limited use
>> > cases, given that a lot of use cases will want to follow pointers,
>> > which has TOCTOU problems.
>>
>> Time of check to time of use problems.  Interesting point.
>>
>> TOCTOU seems to make filtering of system calls in general much less
>> viable then I had hoped or imagined, and seems to be one of the better
>> arguments I have heard against ioctls.
>
> By the way, Robert Watson (one of the progenitors of Capsicum, as it
> happens) has a nice paper about TOCTOU attacks on syscall interposition
> layers that's a good read:
>   http://www.watson.org/~robert/2007woot/
>
> (From this perspective, the limitation that seccomp-bpf programs only
> have access to syscall arguments by-value is actually a help -- the filter
> can't look into user memory, so can't be fooled by having memory
> contents changed underneath it.  Of course, if the eBPF stuff ever
> changes that we should watch out...)
>
>> I think the cases I care about are much less likely to have TOCTOU
>> problems than system calls in general, so I still may be ok.
>>
>> However it does seem like past a certain point for good filtering the
>> entire syscall ABI needs to be turned into well defined IPC.  Ick!
>
> That's roughly one of Robert's suggestions (section 8.2).
>
>> Sigh.  I guess it is about time I dig up the places we call capable.
>> Ugh 1696 places in the kernel..  Even filtering out CAP_SYS_ADMIN and
>> CAP_NET_ADMIN the list is longer than I can easily look at.
>>
>> Still reboot isn't a problem ;)
>>
>> Thinking abou the TOCTOU problems with system call filtering the only
>> general solution I can see is to handle it like the compat syscalls
>> but instead of copying things into a temporary on buffer in userspace
>> we copy the data into a temporary in-kernel buffer (filter the system call)
>>         fs = get_fs();
>>         set_fs(get_ds());
>>         /* Call the system call */
>>         set_fs(fs);
>>
>> I don't like the whole set_fs() thing (especially if there is any data
>> we did not manage to copy).  But it seems like a good conceptual start.
>
> Doing the copies sounds like it would involve understanding & reproducing
> the memory layouts for every syscall pointer argument, which would be a
> lot of code.  Or am I misunderstanding something?

Which is what we have for ioctls and some of the system calls in the
compat case.  So it is something that has been done before.  However I
am going to leave the TOCTOU mess to another time.

If I assume that anything file descriptor based will need another
mechanism to filter what is allowed on a file descriptor, and as such
will need a different mechanism (capsicum perhaps?).  That handily
reduces the problem space, and removes almost all cases where reading
data from userspace is interesting as I am talking about pure system calls.

The list of system calls which are not file descriptor based are listed
below.  Most of those don't take weird parameter structures that would
be interesting to filter.  So I think my fsyscall idea is conceptually
reasonable.   It is not a complete solution for passing someone a well
defined subset you are allowed to do but it is interesting.

Eric

open
stat
lstat
mprotect
munmap
brk
rt_sigaction
rt_sigprocmask
rt_sigreturn
access
pipe
sched_yield
mremap
msync
mincore
madvise
shmget
shmat
shmctl
pause
nanosleep
getitimer
alarm
setitimer
getpid
socket
socketpair
clone
fork
vfork
execve
exit
wait4
kill
uname
semget
semop
semctl
shmdt
msgget
msgsnd
msgrcv
msgctl
truncate
getcwd
chdir
rename
mkdir
rmdir
creat
link
unlink
symlink
readlink
chmod
chown
lchown
umask
gettimeofday
getrlimit
getrusage
sysinfo
times
ptrace
getuid
syslog
getgid
setuid
setgid
geteuid
getegid
setpgid
getppid
getpgrp
setsid
setreuid
setregid
getgroups
setgroups
setresuid
getresuid
setresgid
getresgid
getpgid
setfsuid
setfsgid
getsid
capget
capset
rt_sigpending
rt_sigtimedwait
rt_sigqueueinfo
rt_sigsuspend
sigaltstack
utime
mknod
uselib
personality
ustat
statfs
sysfs
getpriority
setpriority
sched_setparam
sched_getparam
sched_setscheduler
sched_getscheduler
sched_get_priority_max
sched_get_priority_min
sched_rr_get_interval
mlock
munlock
mlockall
munlockall
vhangup
modify_ldt
pivot_root
_sysctl
prctl
arch_prctl
adjtimex
setrlimit
chroot
sync
acct
settimeofday
mount
umount2
swapon
swapoff
reboot
sethostname
setdomainname
iopl
ioperm
create_module
init_module
delete_module
get_kernel_syms
query_module
quotactl
nfsservctl
gettid
setxattr
lsetxattr
getxattr
lgetxattr
listxattr
llistxattr
removexattr
lremovexattr
tkill
time
futex
sched_setaffinity
sched_getaffinity
set_thread_area
io_setup
io_destroy
io_getevents
io_submit
io_cancel
get_thread_area
lookup_dcookie
epoll_create
epoll_ctl_old
epoll_wait_old
remap_file_pages
set_tid_address
restart_syscall
semtimedop
timer_create
timer_settime
timer_gettime
timer_getoverrun
timer_delete
clock_settime
clock_gettime
clock_getres
clock_nanosleep
exit_group
epoll_wait
epoll_ctl
tgkill
utimes
vserver
mbind
set_mempolicy
get_mempolicy
mq_open
mq_unlink
mq_timedsend
mq_timedreceive
mq_notify
mq_getsetattr
kexec_load
waitid
add_key
request_key
keyctl
ioprio_set
ioprio_get
inotify_init
inotify_add_watch
inotify_rm_watch
migrate_pages
unshare
set_robust_list
get_robust_list
splice
tee
sync_file_range
vmsplice
move_pages
utimensat
epoll_pwait
signalfd
timerfd_create
eventfd
fallocate
signalfd4
eventfd2
epoll_create1
pipe2
inotify_init1
rt_tgsigqueueinfo
perf_event_open
fanotify_init
prlimit64
clock_adjtime
getcpu
process_vm_readv
process_vm_writev
kcmp
sched_setattr
sched_getattr
seccomp
getrandom
memfd_create
bpf

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: fsyscall
  2015-09-09 19:33           ` Eric W. Biederman
@ 2015-09-10 13:35             ` Serge E. Hallyn
  0 siblings, 0 replies; 14+ messages in thread
From: Serge E. Hallyn @ 2015-09-10 13:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Drysdale, Andy Lutomirski, linux-kernel@vger.kernel.org,
	Serge E. Hallyn

On Wed, Sep 09, 2015 at 02:33:14PM -0500, Eric W. Biederman wrote:

...

> If I assume that anything file descriptor based will need another
> mechanism to filter what is allowed on a file descriptor, and as such
> will need a different mechanism (capsicum perhaps?).  That handily
> reduces the problem space, and removes almost all cases where reading
> data from userspace is interesting as I am talking about pure system calls.
> 
> The list of system calls which are not file descriptor based are listed
> below.  Most of those don't take weird parameter structures that would
> be interesting to filter.  So I think my fsyscall idea is conceptually
> reasonable.   It is not a complete solution for passing someone a well
> defined subset you are allowed to do but it is interesting.

...

> creat

Taking this as a specific example, I'm somewhat fond of the idea of
saying that we can support openat() as fd-based (let's say
capsicum-based as we know that can work), and therefore we don't need
open() or creat().  If you're designing an app so that you can fork a
task with a subset of your capabilities, then you're writing it now
anyway, so there is no reason for supporting open and creat.  Since
these are specifically very subject to TOCTTOU, saying "you must use
openat()" seems ok.

-serge

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: fsyscall
  2015-09-09 17:27         ` David Drysdale
  2015-09-09 19:33           ` Eric W. Biederman
@ 2015-09-10 13:28           ` Serge E. Hallyn
  1 sibling, 0 replies; 14+ messages in thread
From: Serge E. Hallyn @ 2015-09-10 13:28 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Andy Lutomirski, linux-kernel@vger.kernel.org,
	Serge E. Hallyn

On Wed, Sep 09, 2015 at 06:27:06PM +0100, David Drysdale wrote:
> On Wed, Sep 9, 2015 at 1:25 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> > Andy Lutomirski <luto@amacapital.net> writes:
> > > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> (From this perspective, the limitation that seccomp-bpf programs only
> have access to syscall arguments by-value is actually a help -- the filter
> can't look into user memory, so can't be fooled by having memory
> contents changed underneath it.  Of course, if the eBPF stuff ever
> changes that we should watch out...)

Yup and I'm quite sure I've seen that raised as a reason to refuse
supporting exactly that.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: fsyscall
  2015-09-09  0:25       ` Eric W. Biederman
  2015-09-09 17:27         ` David Drysdale
@ 2015-09-10 13:43         ` Serge E. Hallyn
  2015-09-10 13:51           ` David Drysdale
  1 sibling, 1 reply; 14+ messages in thread
From: Serge E. Hallyn @ 2015-09-10 13:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, David Drysdale, linux-kernel@vger.kernel.org,
	Serge E. Hallyn

On Tue, Sep 08, 2015 at 07:25:17PM -0500, Eric W. Biederman wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
> 
> > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> 
> >> Perhaps I had missed it but I don't recall capsicum being able to wrap
> >> things like reboot(2).
> >>
> >
> > Ah, so you want to be able to grant BPF-defined capabilities :)
> 
> Pretty much.
> 
> Where I am focusing is turning Posix capabilities into real
> capabilities.  I would not mind if the functionality was a bit more
> general.  Say to be able to handle things like security labels, or
> anywhere else you might reasonably be asked can you do X?
> 
> But I would be happy if we just managed to wrap the Posix capabilities
> and turned them into real capablilities.

If there were a clever way to exec an open fd, then you could do this
by passing an fd to a copy of /bin/reboot which has fP=CAP_SYS_BOOT,
or prefereably fI=CAP_SYS_BOOT,fE=1 and leave pI=CAP_SYS_BOOT in the
task.

A cleaner way to do this is to have a service which can reboot, which
looks at unix socket peercreds to determine whether the granter may
reboot, then passes it an fd which the granter may pass to a grantee.
Then the grantee passes the fd to the service, which recognizes it and
reboots.

-serge

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: fsyscall
  2015-09-10 13:43         ` Serge E. Hallyn
@ 2015-09-10 13:51           ` David Drysdale
  2015-09-10 14:01             ` Serge E. Hallyn
  0 siblings, 1 reply; 14+ messages in thread
From: David Drysdale @ 2015-09-10 13:51 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric W. Biederman, Andy Lutomirski, linux-kernel@vger.kernel.org

On Thu, Sep 10, 2015 at 2:43 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> On Tue, Sep 08, 2015 at 07:25:17PM -0500, Eric W. Biederman wrote:
>> Andy Lutomirski <luto@amacapital.net> writes:
>>
>> > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> >> Perhaps I had missed it but I don't recall capsicum being able to wrap
>> >> things like reboot(2).
>> >>
>> >
>> > Ah, so you want to be able to grant BPF-defined capabilities :)
>>
>> Pretty much.
>>
>> Where I am focusing is turning Posix capabilities into real
>> capabilities.  I would not mind if the functionality was a bit more
>> general.  Say to be able to handle things like security labels, or
>> anywhere else you might reasonably be asked can you do X?
>>
>> But I would be happy if we just managed to wrap the Posix capabilities
>> and turned them into real capablilities.
>
> If there were a clever way to exec an open fd, then you could do this

execveat(fd, "", argv, envp, AT_EMPTY_PATH) ?

> by passing an fd to a copy of /bin/reboot which has fP=CAP_SYS_BOOT,
> or prefereably fI=CAP_SYS_BOOT,fE=1 and leave pI=CAP_SYS_BOOT in the
> task.
>
> A cleaner way to do this is to have a service which can reboot, which
> looks at unix socket peercreds to determine whether the granter may
> reboot, then passes it an fd which the granter may pass to a grantee.
> Then the grantee passes the fd to the service, which recognizes it and
> reboots.
>
> -serge

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: fsyscall
  2015-09-10 13:51           ` David Drysdale
@ 2015-09-10 14:01             ` Serge E. Hallyn
  2015-09-10 14:03               ` Serge E. Hallyn
  0 siblings, 1 reply; 14+ messages in thread
From: Serge E. Hallyn @ 2015-09-10 14:01 UTC (permalink / raw)
  To: David Drysdale
  Cc: Serge E. Hallyn, Eric W. Biederman, Andy Lutomirski,
	linux-kernel@vger.kernel.org

On Thu, Sep 10, 2015 at 02:51:28PM +0100, David Drysdale wrote:
> On Thu, Sep 10, 2015 at 2:43 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> > On Tue, Sep 08, 2015 at 07:25:17PM -0500, Eric W. Biederman wrote:
> >> Andy Lutomirski <luto@amacapital.net> writes:
> >>
> >> > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> >>
> >> >> Perhaps I had missed it but I don't recall capsicum being able to wrap
> >> >> things like reboot(2).
> >> >>
> >> >
> >> > Ah, so you want to be able to grant BPF-defined capabilities :)
> >>
> >> Pretty much.
> >>
> >> Where I am focusing is turning Posix capabilities into real
> >> capabilities.  I would not mind if the functionality was a bit more
> >> general.  Say to be able to handle things like security labels, or
> >> anywhere else you might reasonably be asked can you do X?
> >>
> >> But I would be happy if we just managed to wrap the Posix capabilities
> >> and turned them into real capablilities.
> >
> > If there were a clever way to exec an open fd, then you could do this
> 
> execveat(fd, "", argv, envp, AT_EMPTY_PATH) ?

???  I looked for it but I don't have a manpage for it.  I see it at
man7.org though.  Thanks :)

> > by passing an fd to a copy of /bin/reboot which has fP=CAP_SYS_BOOT,
> > or prefereably fI=CAP_SYS_BOOT,fE=1 and leave pI=CAP_SYS_BOOT in the
> > task.
> >
> > A cleaner way to do this is to have a service which can reboot, which
> > looks at unix socket peercreds to determine whether the granter may
> > reboot, then passes it an fd which the granter may pass to a grantee.
> > Then the grantee passes the fd to the service, which recognizes it and
> > reboots.
> >
> > -serge

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: fsyscall
  2015-09-10 14:01             ` Serge E. Hallyn
@ 2015-09-10 14:03               ` Serge E. Hallyn
  2015-09-10 14:04                 ` Serge E. Hallyn
  0 siblings, 1 reply; 14+ messages in thread
From: Serge E. Hallyn @ 2015-09-10 14:03 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: David Drysdale, Eric W. Biederman, Andy Lutomirski,
	linux-kernel@vger.kernel.org

On Thu, Sep 10, 2015 at 09:01:20AM -0500, Serge E. Hallyn wrote:
> On Thu, Sep 10, 2015 at 02:51:28PM +0100, David Drysdale wrote:
> > On Thu, Sep 10, 2015 at 2:43 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> > > On Tue, Sep 08, 2015 at 07:25:17PM -0500, Eric W. Biederman wrote:
> > >> Andy Lutomirski <luto@amacapital.net> writes:
> > >>
> > >> > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> > >>
> > >> >> Perhaps I had missed it but I don't recall capsicum being able to wrap
> > >> >> things like reboot(2).
> > >> >>
> > >> >
> > >> > Ah, so you want to be able to grant BPF-defined capabilities :)
> > >>
> > >> Pretty much.
> > >>
> > >> Where I am focusing is turning Posix capabilities into real
> > >> capabilities.  I would not mind if the functionality was a bit more
> > >> general.  Say to be able to handle things like security labels, or
> > >> anywhere else you might reasonably be asked can you do X?
> > >>
> > >> But I would be happy if we just managed to wrap the Posix capabilities
> > >> and turned them into real capablilities.
> > >
> > > If there were a clever way to exec an open fd, then you could do this
> > 
> > execveat(fd, "", argv, envp, AT_EMPTY_PATH) ?
> 
> ???  I looked for it but I don't have a manpage for it.  I see it at
> man7.org though.  Thanks :)

But this isn't quite what I was suggesting.  I was suggesting a call which
would exec the fd, not exec a file inside the dirfd.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: fsyscall
  2015-09-10 14:03               ` Serge E. Hallyn
@ 2015-09-10 14:04                 ` Serge E. Hallyn
  0 siblings, 0 replies; 14+ messages in thread
From: Serge E. Hallyn @ 2015-09-10 14:04 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: David Drysdale, Eric W. Biederman, Andy Lutomirski,
	linux-kernel@vger.kernel.org

On Thu, Sep 10, 2015 at 09:03:08AM -0500, Serge E. Hallyn wrote:
> On Thu, Sep 10, 2015 at 09:01:20AM -0500, Serge E. Hallyn wrote:
> > On Thu, Sep 10, 2015 at 02:51:28PM +0100, David Drysdale wrote:
> > > On Thu, Sep 10, 2015 at 2:43 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> > > > On Tue, Sep 08, 2015 at 07:25:17PM -0500, Eric W. Biederman wrote:
> > > >> Andy Lutomirski <luto@amacapital.net> writes:
> > > >>
> > > >> > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> > > >>
> > > >> >> Perhaps I had missed it but I don't recall capsicum being able to wrap
> > > >> >> things like reboot(2).
> > > >> >>
> > > >> >
> > > >> > Ah, so you want to be able to grant BPF-defined capabilities :)
> > > >>
> > > >> Pretty much.
> > > >>
> > > >> Where I am focusing is turning Posix capabilities into real
> > > >> capabilities.  I would not mind if the functionality was a bit more
> > > >> general.  Say to be able to handle things like security labels, or
> > > >> anywhere else you might reasonably be asked can you do X?
> > > >>
> > > >> But I would be happy if we just managed to wrap the Posix capabilities
> > > >> and turned them into real capablilities.
> > > >
> > > > If there were a clever way to exec an open fd, then you could do this
> > > 
> > > execveat(fd, "", argv, envp, AT_EMPTY_PATH) ?
> > 
> > ???  I looked for it but I don't have a manpage for it.  I see it at
> > man7.org though.  Thanks :)
> 
> But this isn't quite what I was suggesting.  I was suggesting a call which
> would exec the fd, not exec a file inside the dirfd.

Oh, which it does:

   If pathname is an empty string and the AT_EMPTY_PATH flag is
       specified, then the file descriptor dirfd specifies the file to be
       executed (i.e., dirfd refers to an executable file, rather than a
       directory).

well that's spiffy

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-09-10 14:04 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-08 22:35 RFC: fsyscall Eric W. Biederman
2015-09-08 22:55 ` Andy Lutomirski
2015-09-08 23:07   ` Eric W. Biederman
2015-09-08 23:18     ` Andy Lutomirski
2015-09-09  0:25       ` Eric W. Biederman
2015-09-09 17:27         ` David Drysdale
2015-09-09 19:33           ` Eric W. Biederman
2015-09-10 13:35             ` Serge E. Hallyn
2015-09-10 13:28           ` Serge E. Hallyn
2015-09-10 13:43         ` Serge E. Hallyn
2015-09-10 13:51           ` David Drysdale
2015-09-10 14:01             ` Serge E. Hallyn
2015-09-10 14:03               ` Serge E. Hallyn
2015-09-10 14:04                 ` Serge E. Hallyn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox