* RFC: fsyscall
@ 2015-09-08 22:35 Eric W. Biederman
2015-09-08 22:55 ` Andy Lutomirski
0 siblings, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2015-09-08 22:35 UTC (permalink / raw)
To: Andy Lutomirski; +Cc: linux-kernel, Serge E. Hallyn
I was thinking a bit about the problem of allowing another process to
perform a subset of what your process can perform, and it occured to me
there might be something conceptually simple we can do.
Have a system call fsyscall that takes a file descriptor the system call
number and the parameters to that system call as arguments. AKA
long fsyscall(int fd, long number, ...); AKA syscall with a file
desciptor argument.
The fd would hold a struct cred, and a filter that limits what system
calls and which parameters may be passed.
The implementation of fsyscall would be something like:
old = override_creds(f->f_cred);
/* Perform filtered syscallf */
revert_creds(old);
Then we have another system call call it fsyscall_create(...) that takes
a bpf filter and returns a file descriptor, that can be used with
fsyscall.
I'm not certain that bpf is the best way to create such a filter but it
seems plausible, and we already have the infrastructure in place, so if
nothing else there would be synergy in syscall filtering.
My two concerns with bpf are (a) it seems a little complex for the
simplest use cases. (b) I think there cases like inspecting the data
passed into write, or send, or the structure passed into ioctl that it
doesn't handle well yet.
Andy does a fsyscall system call sound like something that would be not
be too bad to implement? (You have just been through all of the x86
system call paths recently).
Eric
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: RFC: fsyscall 2015-09-08 22:35 RFC: fsyscall Eric W. Biederman @ 2015-09-08 22:55 ` Andy Lutomirski 2015-09-08 23:07 ` Eric W. Biederman 0 siblings, 1 reply; 14+ messages in thread From: Andy Lutomirski @ 2015-09-08 22:55 UTC (permalink / raw) To: Eric W. Biederman, David Drysdale Cc: linux-kernel@vger.kernel.org, Serge E. Hallyn On Tue, Sep 8, 2015 at 3:35 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > I was thinking a bit about the problem of allowing another process to > perform a subset of what your process can perform, and it occured to me > there might be something conceptually simple we can do. > > Have a system call fsyscall that takes a file descriptor the system call > number and the parameters to that system call as arguments. AKA > long fsyscall(int fd, long number, ...); AKA syscall with a file > desciptor argument. > > The fd would hold a struct cred, and a filter that limits what system > calls and which parameters may be passed. > > The implementation of fsyscall would be something like: > old = override_creds(f->f_cred); > /* Perform filtered syscallf */ > revert_creds(old); > > Then we have another system call call it fsyscall_create(...) that takes > a bpf filter and returns a file descriptor, that can be used with > fsyscall. > > I'm not certain that bpf is the best way to create such a filter but it > seems plausible, and we already have the infrastructure in place, so if > nothing else there would be synergy in syscall filtering. > > My two concerns with bpf are (a) it seems a little complex for the > simplest use cases. (b) I think there cases like inspecting the data > passed into write, or send, or the structure passed into ioctl that it > doesn't handle well yet. > > Andy does a fsyscall system call sound like something that would be not > be too bad to implement? (You have just been through all of the x86 > system call paths recently). It's not possible yet due to nasty calling convention issues. (Entries in the x86 syscall table aren't actually functions callable using the C ABI right now.) My pending monster patchset will make it possible to implement for 32-bit syscalls (native and compat). I'm planning on addressing 64-bit, and I want to do almost the reverse of what you're proposing: have a way that one task can trap into a special mode in which another process can do syscalls on its behalf. There are some syscalls for which this simply makes no sense. Setresuid, capset, and similar come to mind. Clone and friends may screw up impressively if you try this. fsyscall should not be allowed to call itself. If you call write(2) like this and it has any meaningful effect, something's wrong. keyctl(2) does really awful things wrt struct cred, and I don't really want to think about what happens if you try calling it like this. override_creds is IMO awful. Serge and I had an old discussion on how to maybe fix it. Honestly, I think the way to go might be to get Capsicum, or at least Capsicum's fd model, merged and to add a mode in which the *at operations on a specially marked fd use the passed fd's f_cred instead of the caller's. (Cc: David Drysdale -- that feature might be really nice.) --Andy ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RFC: fsyscall 2015-09-08 22:55 ` Andy Lutomirski @ 2015-09-08 23:07 ` Eric W. Biederman 2015-09-08 23:18 ` Andy Lutomirski 0 siblings, 1 reply; 14+ messages in thread From: Eric W. Biederman @ 2015-09-08 23:07 UTC (permalink / raw) To: Andy Lutomirski Cc: David Drysdale, linux-kernel@vger.kernel.org, Serge E. Hallyn Andy Lutomirski <luto@amacapital.net> writes: > On Tue, Sep 8, 2015 at 3:35 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> I was thinking a bit about the problem of allowing another process to >> perform a subset of what your process can perform, and it occured to me >> there might be something conceptually simple we can do. >> >> Have a system call fsyscall that takes a file descriptor the system call >> number and the parameters to that system call as arguments. AKA >> long fsyscall(int fd, long number, ...); AKA syscall with a file >> desciptor argument. >> >> The fd would hold a struct cred, and a filter that limits what system >> calls and which parameters may be passed. >> >> The implementation of fsyscall would be something like: >> old = override_creds(f->f_cred); >> /* Perform filtered syscallf */ >> revert_creds(old); >> >> Then we have another system call call it fsyscall_create(...) that takes >> a bpf filter and returns a file descriptor, that can be used with >> fsyscall. >> >> I'm not certain that bpf is the best way to create such a filter but it >> seems plausible, and we already have the infrastructure in place, so if >> nothing else there would be synergy in syscall filtering. >> >> My two concerns with bpf are (a) it seems a little complex for the >> simplest use cases. (b) I think there cases like inspecting the data >> passed into write, or send, or the structure passed into ioctl that it >> doesn't handle well yet. >> >> Andy does a fsyscall system call sound like something that would be not >> be too bad to implement? (You have just been through all of the x86 >> system call paths recently). > > It's not possible yet due to nasty calling convention issues. > (Entries in the x86 syscall table aren't actually functions callable > using the C ABI right now.) My pending monster patchset will make it > possible to implement for 32-bit syscalls (native and compat). I'm > planning on addressing 64-bit, and I want to do almost the reverse of > what you're proposing: have a way that one task can trap into a > special mode in which another process can do syscalls on its behalf. Hmm. That seems comparatively dangerous to me. > There are some syscalls for which this simply makes no sense. > Setresuid, capset, and similar come to mind. Clone and friends may > screw up impressively if you try this. fsyscall should not be allowed > to call itself. If you call write(2) like this and it has any > meaningful effect, something's wrong. If you peak into the data that is being written it can be meaningful on write(2). Hmm. But yes for file descriptor based system calls this is much less interesting. Having some kind of wrapper that embeds one file descriptor in another and does the filtering that way seems more interesting, for the file descriptor based methods. > keyctl(2) does really awful > things wrt struct cred, and I don't really want to think about what > happens if you try calling it like this. > > override_creds is IMO awful. Serge and I had an old discussion on how > to maybe fix it. > > Honestly, I think the way to go might be to get Capsicum, or at least > Capsicum's fd model, merged and to add a mode in which the *at > operations on a specially marked fd use the passed fd's f_cred instead > of the caller's. (Cc: David Drysdale -- that feature might be really > nice.) Perhaps I had missed it but I don't recall capsicum being able to wrap things like reboot(2). Which really describes what I am trying to tackle. How do we create an object that we can pass between processes that limits what we can do in the case of the oddball syscalls that require special privileges. At the same time I still want the caller to be able to pass in data to the system calls being called such as REBOOT_CMD_POWER_OFF versus REBOOT_CMD_HALT, while being able to filter it and say you may not pass REBOOT_CMD_CAD_OFF. Eric ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RFC: fsyscall 2015-09-08 23:07 ` Eric W. Biederman @ 2015-09-08 23:18 ` Andy Lutomirski 2015-09-09 0:25 ` Eric W. Biederman 0 siblings, 1 reply; 14+ messages in thread From: Andy Lutomirski @ 2015-09-08 23:18 UTC (permalink / raw) To: Eric W. Biederman Cc: David Drysdale, linux-kernel@vger.kernel.org, Serge E. Hallyn On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > Andy Lutomirski <luto@amacapital.net> writes: > >> On Tue, Sep 8, 2015 at 3:35 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: >>> >>> I was thinking a bit about the problem of allowing another process to >>> perform a subset of what your process can perform, and it occured to me >>> there might be something conceptually simple we can do. >>> >>> Have a system call fsyscall that takes a file descriptor the system call >>> number and the parameters to that system call as arguments. AKA >>> long fsyscall(int fd, long number, ...); AKA syscall with a file >>> desciptor argument. >>> >>> The fd would hold a struct cred, and a filter that limits what system >>> calls and which parameters may be passed. >>> >>> The implementation of fsyscall would be something like: >>> old = override_creds(f->f_cred); >>> /* Perform filtered syscallf */ >>> revert_creds(old); >>> >>> Then we have another system call call it fsyscall_create(...) that takes >>> a bpf filter and returns a file descriptor, that can be used with >>> fsyscall. >>> >>> I'm not certain that bpf is the best way to create such a filter but it >>> seems plausible, and we already have the infrastructure in place, so if >>> nothing else there would be synergy in syscall filtering. >>> >>> My two concerns with bpf are (a) it seems a little complex for the >>> simplest use cases. (b) I think there cases like inspecting the data >>> passed into write, or send, or the structure passed into ioctl that it >>> doesn't handle well yet. >>> >>> Andy does a fsyscall system call sound like something that would be not >>> be too bad to implement? (You have just been through all of the x86 >>> system call paths recently). >> >> It's not possible yet due to nasty calling convention issues. >> (Entries in the x86 syscall table aren't actually functions callable >> using the C ABI right now.) My pending monster patchset will make it >> possible to implement for 32-bit syscalls (native and compat). I'm >> planning on addressing 64-bit, and I want to do almost the reverse of >> what you're proposing: have a way that one task can trap into a >> special mode in which another process can do syscalls on its behalf. > > Hmm. That seems comparatively dangerous to me. > >> There are some syscalls for which this simply makes no sense. >> Setresuid, capset, and similar come to mind. Clone and friends may >> screw up impressively if you try this. fsyscall should not be allowed >> to call itself. If you call write(2) like this and it has any >> meaningful effect, something's wrong. > > If you peak into the data that is being written it can be meaningful on > write(2). > > Hmm. But yes for file descriptor based system calls this is much less > interesting. Having some kind of wrapper that embeds one file > descriptor in another and does the filtering that way seems more > interesting, for the file descriptor based methods. > >> keyctl(2) does really awful >> things wrt struct cred, and I don't really want to think about what >> happens if you try calling it like this. >> >> override_creds is IMO awful. Serge and I had an old discussion on how >> to maybe fix it. >> >> Honestly, I think the way to go might be to get Capsicum, or at least >> Capsicum's fd model, merged and to add a mode in which the *at >> operations on a specially marked fd use the passed fd's f_cred instead >> of the caller's. (Cc: David Drysdale -- that feature might be really >> nice.) > > Perhaps I had missed it but I don't recall capsicum being able to wrap > things like reboot(2). > Ah, so you want to be able to grant BPF-defined capabilities :) Off the top of my head, I think that doing this using a nice IPC mechanism (which barely exists in Linux, but which seL4 and binder (!) can do very cleanly) would be simpler and more general, if less self-contained. (Aside: how on earth does anyone think that replacing binder with kdbus makes any sense? Binder can pass capabilities, and kdbus can't. OTOH, maybe Android doesn't use the capability-passing ability.) > Which really describes what I am trying to tackle. How do we create an > object that we can pass between processes that limits what we can do in > the case of the oddball syscalls that require special privileges. > > At the same time I still want the caller to be able to pass in data to > the system calls being called such as REBOOT_CMD_POWER_OFF versus > REBOOT_CMD_HALT, while being able to filter it and say you may not pass > REBOOT_CMD_CAD_OFF. > We could have a conservative whitelist of syscalls for which we allow this usage. I'm a bit worried that there will be very limited use cases, given that a lot of use cases will want to follow pointers, which has TOCTOU problems. --Andy ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RFC: fsyscall 2015-09-08 23:18 ` Andy Lutomirski @ 2015-09-09 0:25 ` Eric W. Biederman 2015-09-09 17:27 ` David Drysdale 2015-09-10 13:43 ` Serge E. Hallyn 0 siblings, 2 replies; 14+ messages in thread From: Eric W. Biederman @ 2015-09-09 0:25 UTC (permalink / raw) To: Andy Lutomirski Cc: David Drysdale, linux-kernel@vger.kernel.org, Serge E. Hallyn Andy Lutomirski <luto@amacapital.net> writes: > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: >> Perhaps I had missed it but I don't recall capsicum being able to wrap >> things like reboot(2). >> > > Ah, so you want to be able to grant BPF-defined capabilities :) Pretty much. Where I am focusing is turning Posix capabilities into real capabilities. I would not mind if the functionality was a bit more general. Say to be able to handle things like security labels, or anywhere else you might reasonably be asked can you do X? But I would be happy if we just managed to wrap the Posix capabilities and turned them into real capablilities. > Off the top of my head, I think that doing this using a nice IPC > mechanism (which barely exists in Linux, but which seL4 and binder (!) > can do very cleanly) would be simpler and more general, if less > self-contained. Less self-contained becomes a problem when you want to pass them between processes written at different times between different people. If there is something conceptually simple we can implement in the kernel it becomes worth it because that becomes the standard which everyone knows to code to. > (Aside: how on earth does anyone think that replacing binder with > kdbus makes any sense? Binder can pass capabilities, and kdbus can't. > OTOH, maybe Android doesn't use the capability-passing ability.) kdbus has file descriptor passing. Beyond that no comment. >> Which really describes what I am trying to tackle. How do we create an >> object that we can pass between processes that limits what we can do in >> the case of the oddball syscalls that require special privileges. >> >> At the same time I still want the caller to be able to pass in data to >> the system calls being called such as REBOOT_CMD_POWER_OFF versus >> REBOOT_CMD_HALT, while being able to filter it and say you may not pass >> REBOOT_CMD_CAD_OFF. >> > > We could have a conservative whitelist of syscalls for which we allow > this usage. I'm a bit worried that there will be very limited use > cases, given that a lot of use cases will want to follow pointers, > which has TOCTOU problems. Time of check to time of use problems. Interesting point. TOCTOU seems to make filtering of system calls in general much less viable then I had hoped or imagined, and seems to be one of the better arguments I have heard against ioctls. I think the cases I care about are much less likely to have TOCTOU problems than system calls in general, so I still may be ok. However it does seem like past a certain point for good filtering the entire syscall ABI needs to be turned into well defined IPC. Ick! Sigh. I guess it is about time I dig up the places we call capable. Ugh 1696 places in the kernel.. Even filtering out CAP_SYS_ADMIN and CAP_NET_ADMIN the list is longer than I can easily look at. Still reboot isn't a problem ;) Thinking abou the TOCTOU problems with system call filtering the only general solution I can see is to handle it like the compat syscalls but instead of copying things into a temporary on buffer in userspace we copy the data into a temporary in-kernel buffer (filter the system call) fs = get_fs(); set_fs(get_ds()); /* Call the system call */ set_fs(fs); I don't like the whole set_fs() thing (especially if there is any data we did not manage to copy). But it seems like a good conceptual start. Eric ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RFC: fsyscall 2015-09-09 0:25 ` Eric W. Biederman @ 2015-09-09 17:27 ` David Drysdale 2015-09-09 19:33 ` Eric W. Biederman 2015-09-10 13:28 ` Serge E. Hallyn 2015-09-10 13:43 ` Serge E. Hallyn 1 sibling, 2 replies; 14+ messages in thread From: David Drysdale @ 2015-09-09 17:27 UTC (permalink / raw) To: Eric W. Biederman Cc: Andy Lutomirski, linux-kernel@vger.kernel.org, Serge E. Hallyn On Wed, Sep 9, 2015 at 1:25 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: > Andy Lutomirski <luto@amacapital.net> writes: > > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > >> Perhaps I had missed it but I don't recall capsicum being able to wrap > >> things like reboot(2). > >> > > > > Ah, so you want to be able to grant BPF-defined capabilities :) > > Pretty much. > > Where I am focusing is turning Posix capabilities into real > capabilities. I would not mind if the functionality was a bit more > general. Say to be able to handle things like security labels, or > anywhere else you might reasonably be asked can you do X? > > But I would be happy if we just managed to wrap the Posix capabilities > and turned them into real capablilities. Interesting idea! So kind of like the "object" in question is the root role, and the different rights for the corresponding object-capability (the file descriptor) are the POSIX capabilities (in the simple case at least). And yes, Capsicum doesn't generally interact with things like reboot(2); its checks are on top of any DAC policies rather than instead of them, as it's a hybrid rather than a pure object-capability system. > > Off the top of my head, I think that doing this using a nice IPC > > mechanism (which barely exists in Linux, but which seL4 and binder (!) > > can do very cleanly) would be simpler and more general, if less > > self-contained. > > Less self-contained becomes a problem when you want to pass them between > processes written at different times between different people. If there > is something conceptually simple we can implement in the kernel it > becomes worth it because that becomes the standard which everyone knows > to code to. > > > (Aside: how on earth does anyone think that replacing binder with > > kdbus makes any sense? Binder can pass capabilities, and kdbus can't. > > OTOH, maybe Android doesn't use the capability-passing ability.) > > kdbus has file descriptor passing. Beyond that no comment. > > >> Which really describes what I am trying to tackle. How do we create an > >> object that we can pass between processes that limits what we can do in > >> the case of the oddball syscalls that require special privileges. > >> > >> At the same time I still want the caller to be able to pass in data to > >> the system calls being called such as REBOOT_CMD_POWER_OFF versus > >> REBOOT_CMD_HALT, while being able to filter it and say you may not pass > >> REBOOT_CMD_CAD_OFF. > >> > > > > We could have a conservative whitelist of syscalls for which we allow > > this usage. I'm a bit worried that there will be very limited use > > cases, given that a lot of use cases will want to follow pointers, > > which has TOCTOU problems. > > Time of check to time of use problems. Interesting point. > > TOCTOU seems to make filtering of system calls in general much less > viable then I had hoped or imagined, and seems to be one of the better > arguments I have heard against ioctls. By the way, Robert Watson (one of the progenitors of Capsicum, as it happens) has a nice paper about TOCTOU attacks on syscall interposition layers that's a good read: http://www.watson.org/~robert/2007woot/ (From this perspective, the limitation that seccomp-bpf programs only have access to syscall arguments by-value is actually a help -- the filter can't look into user memory, so can't be fooled by having memory contents changed underneath it. Of course, if the eBPF stuff ever changes that we should watch out...) > I think the cases I care about are much less likely to have TOCTOU > problems than system calls in general, so I still may be ok. > > However it does seem like past a certain point for good filtering the > entire syscall ABI needs to be turned into well defined IPC. Ick! That's roughly one of Robert's suggestions (section 8.2). > Sigh. I guess it is about time I dig up the places we call capable. > Ugh 1696 places in the kernel.. Even filtering out CAP_SYS_ADMIN and > CAP_NET_ADMIN the list is longer than I can easily look at. > > Still reboot isn't a problem ;) > > Thinking abou the TOCTOU problems with system call filtering the only > general solution I can see is to handle it like the compat syscalls > but instead of copying things into a temporary on buffer in userspace > we copy the data into a temporary in-kernel buffer (filter the system call) > fs = get_fs(); > set_fs(get_ds()); > /* Call the system call */ > set_fs(fs); > > I don't like the whole set_fs() thing (especially if there is any data > we did not manage to copy). But it seems like a good conceptual start. Doing the copies sounds like it would involve understanding & reproducing the memory layouts for every syscall pointer argument, which would be a lot of code. Or am I misunderstanding something? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RFC: fsyscall 2015-09-09 17:27 ` David Drysdale @ 2015-09-09 19:33 ` Eric W. Biederman 2015-09-10 13:35 ` Serge E. Hallyn 2015-09-10 13:28 ` Serge E. Hallyn 1 sibling, 1 reply; 14+ messages in thread From: Eric W. Biederman @ 2015-09-09 19:33 UTC (permalink / raw) To: David Drysdale Cc: Andy Lutomirski, linux-kernel@vger.kernel.org, Serge E. Hallyn David Drysdale <drysdale@google.com> writes: > On Wed, Sep 9, 2015 at 1:25 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: >> Andy Lutomirski <luto@amacapital.net> writes: >> > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> >> Perhaps I had missed it but I don't recall capsicum being able to wrap >> >> things like reboot(2). >> >> >> > >> > Ah, so you want to be able to grant BPF-defined capabilities :) >> >> Pretty much. >> >> Where I am focusing is turning Posix capabilities into real >> capabilities. I would not mind if the functionality was a bit more >> general. Say to be able to handle things like security labels, or >> anywhere else you might reasonably be asked can you do X? >> >> But I would be happy if we just managed to wrap the Posix capabilities >> and turned them into real capablilities. > > Interesting idea! So kind of like the "object" in question is the root > role, and the different rights for the corresponding object-capability > (the file descriptor) are the POSIX capabilities (in the simple case > at least). > > And yes, Capsicum doesn't generally interact with things like reboot(2); > its checks are on top of any DAC policies rather than instead of them, > as it's a hybrid rather than a pure object-capability system. > >> > Off the top of my head, I think that doing this using a nice IPC >> > mechanism (which barely exists in Linux, but which seL4 and binder (!) >> > can do very cleanly) would be simpler and more general, if less >> > self-contained. >> >> Less self-contained becomes a problem when you want to pass them between >> processes written at different times between different people. If there >> is something conceptually simple we can implement in the kernel it >> becomes worth it because that becomes the standard which everyone knows >> to code to. >> >> > (Aside: how on earth does anyone think that replacing binder with >> > kdbus makes any sense? Binder can pass capabilities, and kdbus can't. >> > OTOH, maybe Android doesn't use the capability-passing ability.) >> >> kdbus has file descriptor passing. Beyond that no comment. >> >> >> Which really describes what I am trying to tackle. How do we create an >> >> object that we can pass between processes that limits what we can do in >> >> the case of the oddball syscalls that require special privileges. >> >> >> >> At the same time I still want the caller to be able to pass in data to >> >> the system calls being called such as REBOOT_CMD_POWER_OFF versus >> >> REBOOT_CMD_HALT, while being able to filter it and say you may not pass >> >> REBOOT_CMD_CAD_OFF. >> >> >> > >> > We could have a conservative whitelist of syscalls for which we allow >> > this usage. I'm a bit worried that there will be very limited use >> > cases, given that a lot of use cases will want to follow pointers, >> > which has TOCTOU problems. >> >> Time of check to time of use problems. Interesting point. >> >> TOCTOU seems to make filtering of system calls in general much less >> viable then I had hoped or imagined, and seems to be one of the better >> arguments I have heard against ioctls. > > By the way, Robert Watson (one of the progenitors of Capsicum, as it > happens) has a nice paper about TOCTOU attacks on syscall interposition > layers that's a good read: > http://www.watson.org/~robert/2007woot/ > > (From this perspective, the limitation that seccomp-bpf programs only > have access to syscall arguments by-value is actually a help -- the filter > can't look into user memory, so can't be fooled by having memory > contents changed underneath it. Of course, if the eBPF stuff ever > changes that we should watch out...) > >> I think the cases I care about are much less likely to have TOCTOU >> problems than system calls in general, so I still may be ok. >> >> However it does seem like past a certain point for good filtering the >> entire syscall ABI needs to be turned into well defined IPC. Ick! > > That's roughly one of Robert's suggestions (section 8.2). > >> Sigh. I guess it is about time I dig up the places we call capable. >> Ugh 1696 places in the kernel.. Even filtering out CAP_SYS_ADMIN and >> CAP_NET_ADMIN the list is longer than I can easily look at. >> >> Still reboot isn't a problem ;) >> >> Thinking abou the TOCTOU problems with system call filtering the only >> general solution I can see is to handle it like the compat syscalls >> but instead of copying things into a temporary on buffer in userspace >> we copy the data into a temporary in-kernel buffer (filter the system call) >> fs = get_fs(); >> set_fs(get_ds()); >> /* Call the system call */ >> set_fs(fs); >> >> I don't like the whole set_fs() thing (especially if there is any data >> we did not manage to copy). But it seems like a good conceptual start. > > Doing the copies sounds like it would involve understanding & reproducing > the memory layouts for every syscall pointer argument, which would be a > lot of code. Or am I misunderstanding something? Which is what we have for ioctls and some of the system calls in the compat case. So it is something that has been done before. However I am going to leave the TOCTOU mess to another time. If I assume that anything file descriptor based will need another mechanism to filter what is allowed on a file descriptor, and as such will need a different mechanism (capsicum perhaps?). That handily reduces the problem space, and removes almost all cases where reading data from userspace is interesting as I am talking about pure system calls. The list of system calls which are not file descriptor based are listed below. Most of those don't take weird parameter structures that would be interesting to filter. So I think my fsyscall idea is conceptually reasonable. It is not a complete solution for passing someone a well defined subset you are allowed to do but it is interesting. Eric open stat lstat mprotect munmap brk rt_sigaction rt_sigprocmask rt_sigreturn access pipe sched_yield mremap msync mincore madvise shmget shmat shmctl pause nanosleep getitimer alarm setitimer getpid socket socketpair clone fork vfork execve exit wait4 kill uname semget semop semctl shmdt msgget msgsnd msgrcv msgctl truncate getcwd chdir rename mkdir rmdir creat link unlink symlink readlink chmod chown lchown umask gettimeofday getrlimit getrusage sysinfo times ptrace getuid syslog getgid setuid setgid geteuid getegid setpgid getppid getpgrp setsid setreuid setregid getgroups setgroups setresuid getresuid setresgid getresgid getpgid setfsuid setfsgid getsid capget capset rt_sigpending rt_sigtimedwait rt_sigqueueinfo rt_sigsuspend sigaltstack utime mknod uselib personality ustat statfs sysfs getpriority setpriority sched_setparam sched_getparam sched_setscheduler sched_getscheduler sched_get_priority_max sched_get_priority_min sched_rr_get_interval mlock munlock mlockall munlockall vhangup modify_ldt pivot_root _sysctl prctl arch_prctl adjtimex setrlimit chroot sync acct settimeofday mount umount2 swapon swapoff reboot sethostname setdomainname iopl ioperm create_module init_module delete_module get_kernel_syms query_module quotactl nfsservctl gettid setxattr lsetxattr getxattr lgetxattr listxattr llistxattr removexattr lremovexattr tkill time futex sched_setaffinity sched_getaffinity set_thread_area io_setup io_destroy io_getevents io_submit io_cancel get_thread_area lookup_dcookie epoll_create epoll_ctl_old epoll_wait_old remap_file_pages set_tid_address restart_syscall semtimedop timer_create timer_settime timer_gettime timer_getoverrun timer_delete clock_settime clock_gettime clock_getres clock_nanosleep exit_group epoll_wait epoll_ctl tgkill utimes vserver mbind set_mempolicy get_mempolicy mq_open mq_unlink mq_timedsend mq_timedreceive mq_notify mq_getsetattr kexec_load waitid add_key request_key keyctl ioprio_set ioprio_get inotify_init inotify_add_watch inotify_rm_watch migrate_pages unshare set_robust_list get_robust_list splice tee sync_file_range vmsplice move_pages utimensat epoll_pwait signalfd timerfd_create eventfd fallocate signalfd4 eventfd2 epoll_create1 pipe2 inotify_init1 rt_tgsigqueueinfo perf_event_open fanotify_init prlimit64 clock_adjtime getcpu process_vm_readv process_vm_writev kcmp sched_setattr sched_getattr seccomp getrandom memfd_create bpf ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RFC: fsyscall 2015-09-09 19:33 ` Eric W. Biederman @ 2015-09-10 13:35 ` Serge E. Hallyn 0 siblings, 0 replies; 14+ messages in thread From: Serge E. Hallyn @ 2015-09-10 13:35 UTC (permalink / raw) To: Eric W. Biederman Cc: David Drysdale, Andy Lutomirski, linux-kernel@vger.kernel.org, Serge E. Hallyn On Wed, Sep 09, 2015 at 02:33:14PM -0500, Eric W. Biederman wrote: ... > If I assume that anything file descriptor based will need another > mechanism to filter what is allowed on a file descriptor, and as such > will need a different mechanism (capsicum perhaps?). That handily > reduces the problem space, and removes almost all cases where reading > data from userspace is interesting as I am talking about pure system calls. > > The list of system calls which are not file descriptor based are listed > below. Most of those don't take weird parameter structures that would > be interesting to filter. So I think my fsyscall idea is conceptually > reasonable. It is not a complete solution for passing someone a well > defined subset you are allowed to do but it is interesting. ... > creat Taking this as a specific example, I'm somewhat fond of the idea of saying that we can support openat() as fd-based (let's say capsicum-based as we know that can work), and therefore we don't need open() or creat(). If you're designing an app so that you can fork a task with a subset of your capabilities, then you're writing it now anyway, so there is no reason for supporting open and creat. Since these are specifically very subject to TOCTTOU, saying "you must use openat()" seems ok. -serge ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RFC: fsyscall 2015-09-09 17:27 ` David Drysdale 2015-09-09 19:33 ` Eric W. Biederman @ 2015-09-10 13:28 ` Serge E. Hallyn 1 sibling, 0 replies; 14+ messages in thread From: Serge E. Hallyn @ 2015-09-10 13:28 UTC (permalink / raw) To: David Drysdale Cc: Eric W. Biederman, Andy Lutomirski, linux-kernel@vger.kernel.org, Serge E. Hallyn On Wed, Sep 09, 2015 at 06:27:06PM +0100, David Drysdale wrote: > On Wed, Sep 9, 2015 at 1:25 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > Andy Lutomirski <luto@amacapital.net> writes: > > > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > (From this perspective, the limitation that seccomp-bpf programs only > have access to syscall arguments by-value is actually a help -- the filter > can't look into user memory, so can't be fooled by having memory > contents changed underneath it. Of course, if the eBPF stuff ever > changes that we should watch out...) Yup and I'm quite sure I've seen that raised as a reason to refuse supporting exactly that. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RFC: fsyscall 2015-09-09 0:25 ` Eric W. Biederman 2015-09-09 17:27 ` David Drysdale @ 2015-09-10 13:43 ` Serge E. Hallyn 2015-09-10 13:51 ` David Drysdale 1 sibling, 1 reply; 14+ messages in thread From: Serge E. Hallyn @ 2015-09-10 13:43 UTC (permalink / raw) To: Eric W. Biederman Cc: Andy Lutomirski, David Drysdale, linux-kernel@vger.kernel.org, Serge E. Hallyn On Tue, Sep 08, 2015 at 07:25:17PM -0500, Eric W. Biederman wrote: > Andy Lutomirski <luto@amacapital.net> writes: > > > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > >> Perhaps I had missed it but I don't recall capsicum being able to wrap > >> things like reboot(2). > >> > > > > Ah, so you want to be able to grant BPF-defined capabilities :) > > Pretty much. > > Where I am focusing is turning Posix capabilities into real > capabilities. I would not mind if the functionality was a bit more > general. Say to be able to handle things like security labels, or > anywhere else you might reasonably be asked can you do X? > > But I would be happy if we just managed to wrap the Posix capabilities > and turned them into real capablilities. If there were a clever way to exec an open fd, then you could do this by passing an fd to a copy of /bin/reboot which has fP=CAP_SYS_BOOT, or prefereably fI=CAP_SYS_BOOT,fE=1 and leave pI=CAP_SYS_BOOT in the task. A cleaner way to do this is to have a service which can reboot, which looks at unix socket peercreds to determine whether the granter may reboot, then passes it an fd which the granter may pass to a grantee. Then the grantee passes the fd to the service, which recognizes it and reboots. -serge ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RFC: fsyscall 2015-09-10 13:43 ` Serge E. Hallyn @ 2015-09-10 13:51 ` David Drysdale 2015-09-10 14:01 ` Serge E. Hallyn 0 siblings, 1 reply; 14+ messages in thread From: David Drysdale @ 2015-09-10 13:51 UTC (permalink / raw) To: Serge E. Hallyn Cc: Eric W. Biederman, Andy Lutomirski, linux-kernel@vger.kernel.org On Thu, Sep 10, 2015 at 2:43 PM, Serge E. Hallyn <serge@hallyn.com> wrote: > On Tue, Sep 08, 2015 at 07:25:17PM -0500, Eric W. Biederman wrote: >> Andy Lutomirski <luto@amacapital.net> writes: >> >> > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> >> Perhaps I had missed it but I don't recall capsicum being able to wrap >> >> things like reboot(2). >> >> >> > >> > Ah, so you want to be able to grant BPF-defined capabilities :) >> >> Pretty much. >> >> Where I am focusing is turning Posix capabilities into real >> capabilities. I would not mind if the functionality was a bit more >> general. Say to be able to handle things like security labels, or >> anywhere else you might reasonably be asked can you do X? >> >> But I would be happy if we just managed to wrap the Posix capabilities >> and turned them into real capablilities. > > If there were a clever way to exec an open fd, then you could do this execveat(fd, "", argv, envp, AT_EMPTY_PATH) ? > by passing an fd to a copy of /bin/reboot which has fP=CAP_SYS_BOOT, > or prefereably fI=CAP_SYS_BOOT,fE=1 and leave pI=CAP_SYS_BOOT in the > task. > > A cleaner way to do this is to have a service which can reboot, which > looks at unix socket peercreds to determine whether the granter may > reboot, then passes it an fd which the granter may pass to a grantee. > Then the grantee passes the fd to the service, which recognizes it and > reboots. > > -serge ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RFC: fsyscall 2015-09-10 13:51 ` David Drysdale @ 2015-09-10 14:01 ` Serge E. Hallyn 2015-09-10 14:03 ` Serge E. Hallyn 0 siblings, 1 reply; 14+ messages in thread From: Serge E. Hallyn @ 2015-09-10 14:01 UTC (permalink / raw) To: David Drysdale Cc: Serge E. Hallyn, Eric W. Biederman, Andy Lutomirski, linux-kernel@vger.kernel.org On Thu, Sep 10, 2015 at 02:51:28PM +0100, David Drysdale wrote: > On Thu, Sep 10, 2015 at 2:43 PM, Serge E. Hallyn <serge@hallyn.com> wrote: > > On Tue, Sep 08, 2015 at 07:25:17PM -0500, Eric W. Biederman wrote: > >> Andy Lutomirski <luto@amacapital.net> writes: > >> > >> > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > >> > >> >> Perhaps I had missed it but I don't recall capsicum being able to wrap > >> >> things like reboot(2). > >> >> > >> > > >> > Ah, so you want to be able to grant BPF-defined capabilities :) > >> > >> Pretty much. > >> > >> Where I am focusing is turning Posix capabilities into real > >> capabilities. I would not mind if the functionality was a bit more > >> general. Say to be able to handle things like security labels, or > >> anywhere else you might reasonably be asked can you do X? > >> > >> But I would be happy if we just managed to wrap the Posix capabilities > >> and turned them into real capablilities. > > > > If there were a clever way to exec an open fd, then you could do this > > execveat(fd, "", argv, envp, AT_EMPTY_PATH) ? ??? I looked for it but I don't have a manpage for it. I see it at man7.org though. Thanks :) > > by passing an fd to a copy of /bin/reboot which has fP=CAP_SYS_BOOT, > > or prefereably fI=CAP_SYS_BOOT,fE=1 and leave pI=CAP_SYS_BOOT in the > > task. > > > > A cleaner way to do this is to have a service which can reboot, which > > looks at unix socket peercreds to determine whether the granter may > > reboot, then passes it an fd which the granter may pass to a grantee. > > Then the grantee passes the fd to the service, which recognizes it and > > reboots. > > > > -serge ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RFC: fsyscall 2015-09-10 14:01 ` Serge E. Hallyn @ 2015-09-10 14:03 ` Serge E. Hallyn 2015-09-10 14:04 ` Serge E. Hallyn 0 siblings, 1 reply; 14+ messages in thread From: Serge E. Hallyn @ 2015-09-10 14:03 UTC (permalink / raw) To: Serge E. Hallyn Cc: David Drysdale, Eric W. Biederman, Andy Lutomirski, linux-kernel@vger.kernel.org On Thu, Sep 10, 2015 at 09:01:20AM -0500, Serge E. Hallyn wrote: > On Thu, Sep 10, 2015 at 02:51:28PM +0100, David Drysdale wrote: > > On Thu, Sep 10, 2015 at 2:43 PM, Serge E. Hallyn <serge@hallyn.com> wrote: > > > On Tue, Sep 08, 2015 at 07:25:17PM -0500, Eric W. Biederman wrote: > > >> Andy Lutomirski <luto@amacapital.net> writes: > > >> > > >> > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > >> > > >> >> Perhaps I had missed it but I don't recall capsicum being able to wrap > > >> >> things like reboot(2). > > >> >> > > >> > > > >> > Ah, so you want to be able to grant BPF-defined capabilities :) > > >> > > >> Pretty much. > > >> > > >> Where I am focusing is turning Posix capabilities into real > > >> capabilities. I would not mind if the functionality was a bit more > > >> general. Say to be able to handle things like security labels, or > > >> anywhere else you might reasonably be asked can you do X? > > >> > > >> But I would be happy if we just managed to wrap the Posix capabilities > > >> and turned them into real capablilities. > > > > > > If there were a clever way to exec an open fd, then you could do this > > > > execveat(fd, "", argv, envp, AT_EMPTY_PATH) ? > > ??? I looked for it but I don't have a manpage for it. I see it at > man7.org though. Thanks :) But this isn't quite what I was suggesting. I was suggesting a call which would exec the fd, not exec a file inside the dirfd. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RFC: fsyscall 2015-09-10 14:03 ` Serge E. Hallyn @ 2015-09-10 14:04 ` Serge E. Hallyn 0 siblings, 0 replies; 14+ messages in thread From: Serge E. Hallyn @ 2015-09-10 14:04 UTC (permalink / raw) To: Serge E. Hallyn Cc: David Drysdale, Eric W. Biederman, Andy Lutomirski, linux-kernel@vger.kernel.org On Thu, Sep 10, 2015 at 09:03:08AM -0500, Serge E. Hallyn wrote: > On Thu, Sep 10, 2015 at 09:01:20AM -0500, Serge E. Hallyn wrote: > > On Thu, Sep 10, 2015 at 02:51:28PM +0100, David Drysdale wrote: > > > On Thu, Sep 10, 2015 at 2:43 PM, Serge E. Hallyn <serge@hallyn.com> wrote: > > > > On Tue, Sep 08, 2015 at 07:25:17PM -0500, Eric W. Biederman wrote: > > > >> Andy Lutomirski <luto@amacapital.net> writes: > > > >> > > > >> > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > > >> > > > >> >> Perhaps I had missed it but I don't recall capsicum being able to wrap > > > >> >> things like reboot(2). > > > >> >> > > > >> > > > > >> > Ah, so you want to be able to grant BPF-defined capabilities :) > > > >> > > > >> Pretty much. > > > >> > > > >> Where I am focusing is turning Posix capabilities into real > > > >> capabilities. I would not mind if the functionality was a bit more > > > >> general. Say to be able to handle things like security labels, or > > > >> anywhere else you might reasonably be asked can you do X? > > > >> > > > >> But I would be happy if we just managed to wrap the Posix capabilities > > > >> and turned them into real capablilities. > > > > > > > > If there were a clever way to exec an open fd, then you could do this > > > > > > execveat(fd, "", argv, envp, AT_EMPTY_PATH) ? > > > > ??? I looked for it but I don't have a manpage for it. I see it at > > man7.org though. Thanks :) > > But this isn't quite what I was suggesting. I was suggesting a call which > would exec the fd, not exec a file inside the dirfd. Oh, which it does: If pathname is an empty string and the AT_EMPTY_PATH flag is specified, then the file descriptor dirfd specifies the file to be executed (i.e., dirfd refers to an executable file, rather than a directory). well that's spiffy ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2015-09-10 14:04 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-09-08 22:35 RFC: fsyscall Eric W. Biederman 2015-09-08 22:55 ` Andy Lutomirski 2015-09-08 23:07 ` Eric W. Biederman 2015-09-08 23:18 ` Andy Lutomirski 2015-09-09 0:25 ` Eric W. Biederman 2015-09-09 17:27 ` David Drysdale 2015-09-09 19:33 ` Eric W. Biederman 2015-09-10 13:35 ` Serge E. Hallyn 2015-09-10 13:28 ` Serge E. Hallyn 2015-09-10 13:43 ` Serge E. Hallyn 2015-09-10 13:51 ` David Drysdale 2015-09-10 14:01 ` Serge E. Hallyn 2015-09-10 14:03 ` Serge E. Hallyn 2015-09-10 14:04 ` Serge E. Hallyn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox