Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
       [not found] <1404124096-21445-1-git-send-email-drysdale@google.com>
@ 2014-07-03  9:12 ` Paolo Bonzini
  2014-07-03 10:01   ` Loganaden Velvindron
  2014-07-03 18:39   ` David Drysdale
  0 siblings, 2 replies; 10+ messages in thread
From: Paolo Bonzini @ 2014-07-03  9:12 UTC (permalink / raw)
  To: David Drysdale, linux-security-module, linux-kernel,
	Greg Kroah-Hartman
  Cc: Kees Cook, linux-api, Meredydd Luff, qemu-devel, Alexander Viro,
	James Morris

Il 30/06/2014 12:28, David Drysdale ha scritto:
> Hi all,
>
> The last couple of versions of FreeBSD (9.x/10.x) have included the
> Capsicum security framework [1], which allows security-aware
> applications to sandbox themselves in a very fine-grained way.  For
> example, OpenSSH now (>= 6.5) uses Capsicum in its FreeBSD version to
> restrict sshd's credentials checking process, to reduce the chances of
> credential leakage.

Hi David,

we've had similar goals in QEMU.  QEMU can be used as a virtual machine 
monitor from the command line, but it also has an API that lets a 
management tool drive QEMU via AF_UNIX sockets.  Long term, we would 
like to have a restricted mode for QEMU where all file descriptors are 
obtained via SCM_RIGHTS or /dev/fd, and syscalls can be locked down.

Currently we do use seccomp v2 BPF filters, but unfortunately this 
didn't help very much.  QEMU supports hotplugging hence the filter must 
whitelist anything that _might_ be used in the future, which is 
generally... too much.

Something like Capsicum would be really nice because it attaches 
capabilities to file descriptors.  However, I wonder however how 
extensible Capsicum could be, and I am worried about the proliferation 
of capabilities that its design naturally leads to.

Given Linux's previous experience with BPF filters, what do you think 
about attaching specific BPF programs to file descriptors?  Then 
whenever a syscall is run that affects a file descriptor, the BPF 
program for the file descriptor (attached to a struct file* as in 
Capsicum) would run in addition to the process-wide filter.

An equivalent of PR_SET_NO_NEW_PRIVS can also be added to file 
descriptors, so that a program that doesn't lock down syscalls can still 
lock down the operations (including fcntls and ioctls) on specific file 
descriptors.

Converting FreeBSD capabilities to BPF programs can be easily 
implemented in userspace.

>   [Capsicum also includes 'capability mode', which locks down the
>   available syscalls so the rights restrictions can't just be bypassed
>   by opening new file descriptors; I'll describe that separately later.]

This can also be implemented in userspace via seccomp and 
PR_SET_NO_NEW_PRIVS.

>   [Policing the rights checks anywhere else, for example at the system
>   call boundary, isn't a good idea because it opens up the possibility
>   of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
>   changed (as openat/close/dup2 are allowed in capability mode) between
>   the 'check' at syscall entry and the 'use' at fget() invocation.]

In the case of BPF filters, I wonder if you could stash the BPF 
"environment" somewhere and then use it at fget() invocation. 
Alternatively, it can be reconstructed at fget() time, similar to your 
introduction of fgetr().

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
  2014-07-03  9:12 ` [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) Paolo Bonzini
@ 2014-07-03 10:01   ` Loganaden Velvindron
  2014-07-03 18:39   ` David Drysdale
  1 sibling, 0 replies; 10+ messages in thread
From: Loganaden Velvindron @ 2014-07-03 10:01 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kees Cook, Greg Kroah-Hartman, Meredydd Luff, linux-kernel,
	qemu-devel, linux-security-module, Alexander Viro, James Morris,
	linux-api, David Drysdale

On Thu, Jul 3, 2014 at 1:12 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 30/06/2014 12:28, David Drysdale ha scritto:
>>
>> Hi all,
>>
>> The last couple of versions of FreeBSD (9.x/10.x) have included the
>> Capsicum security framework [1], which allows security-aware
>> applications to sandbox themselves in a very fine-grained way.  For
>> example, OpenSSH now (>= 6.5) uses Capsicum in its FreeBSD version to
>> restrict sshd's credentials checking process, to reduce the chances of
>> credential leakage.

Aside from OpenSSH, I've also been working on implementing Capsicum,
in other userspace software.



>
>
> Hi David,
>
> we've had similar goals in QEMU.  QEMU can be used as a virtual machine
> monitor from the command line, but it also has an API that lets a management
> tool drive QEMU via AF_UNIX sockets.  Long term, we would like to have a
> restricted mode for QEMU where all file descriptors are obtained via
> SCM_RIGHTS or /dev/fd, and syscalls can be locked down.
>
> Currently we do use seccomp v2 BPF filters, but unfortunately this didn't
> help very much.  QEMU supports hotplugging hence the filter must whitelist
> anything that _might_ be used in the future, which is generally... too much.
>
> Something like Capsicum would be really nice because it attaches
> capabilities to file descriptors.  However, I wonder however how extensible
> Capsicum could be, and I am worried about the proliferation of capabilities
> that its design naturally leads to.
>
> Given Linux's previous experience with BPF filters, what do you think about
> attaching specific BPF programs to file descriptors?  Then whenever a
> syscall is run that affects a file descriptor, the BPF program for the file
> descriptor (attached to a struct file* as in Capsicum) would run in addition
> to the process-wide filter.
>
> An equivalent of PR_SET_NO_NEW_PRIVS can also be added to file descriptors,
> so that a program that doesn't lock down syscalls can still lock down the
> operations (including fcntls and ioctls) on specific file descriptors.
>
> Converting FreeBSD capabilities to BPF programs can be easily implemented in
> userspace.
>
>>   [Capsicum also includes 'capability mode', which locks down the
>>   available syscalls so the rights restrictions can't just be bypassed
>>   by opening new file descriptors; I'll describe that separately later.]
>
>
> This can also be implemented in userspace via seccomp and
> PR_SET_NO_NEW_PRIVS.
>
>>   [Policing the rights checks anywhere else, for example at the system
>>   call boundary, isn't a good idea because it opens up the possibility
>>   of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
>>   changed (as openat/close/dup2 are allowed in capability mode) between
>>   the 'check' at syscall entry and the 'use' at fget() invocation.]
>
>
> In the case of BPF filters, I wonder if you could stash the BPF
> "environment" somewhere and then use it at fget() invocation. Alternatively,
> it can be reconstructed at fget() time, similar to your introduction of
> fgetr().
>
> Thanks,
>
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-security-module" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
This message is strictly personal and the opinions expressed do not
represent those of my employers, either past or present.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
  2014-07-03  9:12 ` [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) Paolo Bonzini
  2014-07-03 10:01   ` Loganaden Velvindron
@ 2014-07-03 18:39   ` David Drysdale
  2014-07-04  7:03     ` Paolo Bonzini
  1 sibling, 1 reply; 10+ messages in thread
From: David Drysdale @ 2014-07-03 18:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kees Cook, Greg Kroah-Hartman, Meredydd Luff,
	linux-kernel@vger.kernel.org, qemu-devel, LSM List,
	Alexander Viro, James Morris, Linux API

On Thu, Jul 03, 2014 at 11:12:33AM +0200, Paolo Bonzini wrote:
> Il 30/06/2014 12:28, David Drysdale ha scritto:
> >Hi all,
> >
> >The last couple of versions of FreeBSD (9.x/10.x) have included the
> >Capsicum security framework [1], which allows security-aware
> >applications to sandbox themselves in a very fine-grained way.  For
> >example, OpenSSH now (>= 6.5) uses Capsicum in its FreeBSD version to
> >restrict sshd's credentials checking process, to reduce the chances of
> >credential leakage.
>
> Hi David,
>
> we've had similar goals in QEMU.  QEMU can be used as a virtual
> machine monitor from the command line, but it also has an API that
> lets a management tool drive QEMU via AF_UNIX sockets.  Long term,
> we would like to have a restricted mode for QEMU where all file
> descriptors are obtained via SCM_RIGHTS or /dev/fd, and syscalls can
> be locked down.
>
> Currently we do use seccomp v2 BPF filters, but unfortunately this
> didn't help very much.  QEMU supports hotplugging hence the filter
> must whitelist anything that _might_ be used in the future, which is
> generally... too much.
>
> Something like Capsicum would be really nice because it attaches
> capabilities to file descriptors.  However, I wonder however how
> extensible Capsicum could be, and I am worried about the
> proliferation of capabilities that its design naturally leads to.

True, capability rights are likely to expand over time (although
FreeBSD only expanded from 55 to 60 between 9.x and 10.x).

> Given Linux's previous experience with BPF filters, what do you
> think about attaching specific BPF programs to file descriptors?
> Then whenever a syscall is run that affects a file descriptor, the
> BPF program for the file descriptor (attached to a struct file* as
> in Capsicum) would run in addition to the process-wide filter.

That sounds kind of clever, but also kind of complicated.

Off the top of my head, one particular problem is that not all
fd->struct file conversions in the kernel are completely specified
by an enclosing syscall and the explicit values of its parameters.

For example, the actual contents of the arguments to io_submit(2)
aren't visible to a seccomp-bpf program (as it can't read the __user
memory for the iocb structures), and so it can't distinguish a
read from a write.

Also, there could potentially be some odd interactions with file
descriptors passed between processes, if the BPF program relies
on assumptions about the environment of the original process.  For
example, what happens if an x86_64 process passes a filter-attached
FD to an ia32 process?  Given that the syscall numbers are
arch-specific, I guess that means the filter program would have
to include arch-specific branches for any possible variant.

More generally, I suspect that keeping things simpler will end
up being more secure.  Capsicum was based on well-studied ideas
from the world of object capability-based security, and I'd be
nervous about adding complications that take us further away from
that. 

> An equivalent of PR_SET_NO_NEW_PRIVS can also be added to file
> descriptors, so that a program that doesn't lock down syscalls can
> still lock down the operations (including fcntls and ioctls) on
> specific file descriptors.
>
> Converting FreeBSD capabilities to BPF programs can be easily
> implemented in userspace.

I get the idea, but I'm not sure it would be that easy!  The
BPF-generation library would need to hold all of the mappings
from system calls (and their arguments) to the equivalent
required rights -- and vice versa.

That mapping would also need be kept closely in sync with the kernel
and other system libraries -- if a new syscall is added and libc (or
some other library) started using it, the equivalent BPF chunks would
need to be updated to cope.

> >  [Capsicum also includes 'capability mode', which locks down the
> >  available syscalls so the rights restrictions can't just be bypassed
> >  by opening new file descriptors; I'll describe that separately later.]
>
> This can also be implemented in userspace via seccomp and
> PR_SET_NO_NEW_PRIVS.

Well, mostly (and in fact I've got an attempt to do exactly that at
https://github.com/google/capsicum-test/blob/dev/linux-bpf-capmode.c).

But there are a few wrinkles with that approach.

First, we need Kees Cook's patches to allow seccomp filters
to be synchronized across existing threads, but hopefully they
will make it in soon.

Next, there's one awkward syscall case.  In capability mode we'd like
to prevent processes from sending signals with kill(2)/tgkill(2)
to other processes, but they should still be able to send themselves
signals.  For example, abort(3) generates:
  tgkill(gettid(), gettid(), SIGABRT)

Only allowing kill(self) is hard to encode in a seccomp-bpf program, at
least in a way that survives forking.

Finally, capability mode also turns on strict-relative lookups
process-wide; in other words, every openat(dfd, ...) operation
acts as though it has the O_BENEATH_ONLY flag set, regardless of
whether the dfd is a Capsicum capability.  I can't see a way to
do that with a BPF program (although it would be possible to add
a filter that polices the requirement to include O_BENEATH_ONLY
rather than implicitly adding it).

So although a capability-mode implementation in terms of seccomp-bpf
is tantalizingly close, at the moment I've got it implemented as a new
seccomp mode.

> >  [Policing the rights checks anywhere else, for example at the system
> >  call boundary, isn't a good idea because it opens up the possibility
> >  of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
> >  changed (as openat/close/dup2 are allowed in capability mode) between
> >  the 'check' at syscall entry and the 'use' at fget() invocation.]
>
> In the case of BPF filters, I wonder if you could stash the BPF
> "environment" somewhere and then use it at fget() invocation.
> Alternatively, it can be reconstructed at fget() time, similar to
> your introduction of fgetr().

Stashing something at syscall entry to be referred to later always
makes me worry about TOCTOU vulnerabilities, but the details might
be OK in this case (given that no check occurs at syscall entry)...

> Thanks,
>
> Paolo

Many thanks for taking the time to comment and think of innovative
ideas!

David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
  2014-07-03 18:39   ` David Drysdale
@ 2014-07-04  7:03     ` Paolo Bonzini
  2014-07-07 10:29       ` David Drysdale
  0 siblings, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2014-07-04  7:03 UTC (permalink / raw)
  To: David Drysdale
  Cc: Kees Cook, Greg Kroah-Hartman, Meredydd Luff,
	linux-kernel@vger.kernel.org, qemu-devel, LSM List,
	Alexander Viro, James Morris, Linux API


Il 03/07/2014 20:39, David Drysdale ha scritto:
> On Thu, Jul 03, 2014 at 11:12:33AM +0200, Paolo Bonzini wrote:
>> Given Linux's previous experience with BPF filters, what do you
>> think about attaching specific BPF programs to file descriptors?
>> Then whenever a syscall is run that affects a file descriptor, the
>> BPF program for the file descriptor (attached to a struct file* as
>> in Capsicum) would run in addition to the process-wide filter.
>
> That sounds kind of clever, but also kind of complicated.
>
> Off the top of my head, one particular problem is that not all
> fd->struct file conversions in the kernel are completely specified
> by an enclosing syscall and the explicit values of its parameters.
>
> For example, the actual contents of the arguments to io_submit(2)
> aren't visible to a seccomp-bpf program (as it can't read the __user
> memory for the iocb structures), and so it can't distinguish a
> read from a write.

I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
/O_RDWR.   You could do it by running the file descriptor's seccomp-bpf 
program once per iocb with synthesized syscall numbers and argument 
vectors.

BTW, there's one thing I'm not sure I understand (because my knowledge 
of VFS is really only cursory).  Are the capabilities associated to the 
file _descriptor_ (a la F_GETFD/SETFD) or _description_ 
(F_GETFL/SETFL)?!?

If it is the former, there is some value in read/write capabilities 
because you could for example block a child process from reading an 
eventfd and simulate the two file descriptors returned by pipe(2).  But 
if it is the latter, it looks like an important usability problem in 
the Capsicum model.  (Granted, it's just about usability---in the end 
it does exactly what it's meant and documented to do).

> Also, there could potentially be some odd interactions with file
> descriptors passed between processes, if the BPF program relies
> on assumptions about the environment of the original process.  For
> example, what happens if an x86_64 process passes a filter-attached
> FD to an ia32 process?  Given that the syscall numbers are
> arch-specific, I guess that means the filter program would have
> to include arch-specific branches for any possible variant.

This is the same for using seccompv2 to limit child processes, no?  So 
there may be a problem but it has to be solved anyway by libseccomp.

> More generally, I suspect that keeping things simpler will end
> up being more secure.  Capsicum was based on well-studied ideas
> from the world of object capability-based security, and I'd be
> nervous about adding complications that take us further away from
> that.

True.

> That mapping would also need be kept closely in sync with the kernel
> and other system libraries -- if a new syscall is added and libc (or
> some other library) started using it, the equivalent BPF chunks would
> need to be updated to cope.

Again, this is the same problem that has to be solved for process-wide 
seccompv2.

>>>  [Capsicum also includes 'capability mode', which locks down the
>>>  available syscalls so the rights restrictions can't just be bypassed
>>>  by opening new file descriptors; I'll describe that separately later.]
>>
>> This can also be implemented in userspace via seccomp and
>> PR_SET_NO_NEW_PRIVS.
>
> Well, mostly (and in fact I've got an attempt to do exactly that at
> https://github.com/google/capsicum-test/blob/dev/linux-bpf-capmode.c).
>
> [..] there's one awkward syscall case.  In capability mode we'd like
> to prevent processes from sending signals with kill(2)/tgkill(2)
> to other processes, but they should still be able to send themselves
> signals.  For example, abort(3) generates:
>   tgkill(gettid(), gettid(), SIGABRT)
>
> Only allowing kill(self) is hard to encode in a seccomp-bpf program, at
> least in a way that survives forking.

I guess the thread id could be added as a special seccomp-bpf argument 
(ancillary datum?).

> Finally, capability mode also turns on strict-relative lookups
> process-wide; in other words, every openat(dfd, ...) operation
> acts as though it has the O_BENEATH_ONLY flag set, regardless of
> whether the dfd is a Capsicum capability.  I can't see a way to
> do that with a BPF program (although it would be possible to add
> a filter that polices the requirement to include O_BENEATH_ONLY
> rather than implicitly adding it).

That can be a new prctl (one that PR_SET_NO_NEW_PRIVS would lock up). 
It seems useful independent of Capsicum, and the Linux APIs tend to be 
fine-grained more often than coarse-grained.

>>>  [Policing the rights checks anywhere else, for example at the system
>>>  call boundary, isn't a good idea because it opens up the possibility
>>>  of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
>>>  changed (as openat/close/dup2 are allowed in capability mode) between
>>>  the 'check' at syscall entry and the 'use' at fget() invocation.]
>>
>> In the case of BPF filters, I wonder if you could stash the BPF
>> "environment" somewhere and then use it at fget() invocation.
>> Alternatively, it can be reconstructed at fget() time, similar to
>> your introduction of fgetr().
>
> Stashing something at syscall entry to be referred to later always
> makes me worry about TOCTOU vulnerabilities, but the details might
> be OK in this case (given that no check occurs at syscall entry)...

Yeah, that was pretty much the idea.  But I was cautious enough to 
label it with "I wonder"...

Paolo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
  2014-07-04  7:03     ` Paolo Bonzini
@ 2014-07-07 10:29       ` David Drysdale
  2014-07-07 12:20         ` Paolo Bonzini
  0 siblings, 1 reply; 10+ messages in thread
From: David Drysdale @ 2014-07-07 10:29 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kees Cook, Greg Kroah-Hartman, Meredydd Luff,
	linux-kernel@vger.kernel.org, qemu-devel, LSM List,
	Alexander Viro, James Morris, Linux API

On Fri, Jul 4, 2014 at 8:03 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> Il 03/07/2014 20:39, David Drysdale ha scritto:
>> On Thu, Jul 03, 2014 at 11:12:33AM +0200, Paolo Bonzini wrote:
>>> Given Linux's previous experience with BPF filters, what do you
>>> think about attaching specific BPF programs to file descriptors?
>>> Then whenever a syscall is run that affects a file descriptor, the
>>> BPF program for the file descriptor (attached to a struct file* as
>>> in Capsicum) would run in addition to the process-wide filter.
>>
>> That sounds kind of clever, but also kind of complicated.
>>
>> Off the top of my head, one particular problem is that not all
>> fd->struct file conversions in the kernel are completely specified
>> by an enclosing syscall and the explicit values of its parameters.
>>
>> For example, the actual contents of the arguments to io_submit(2)
>> aren't visible to a seccomp-bpf program (as it can't read the __user
>> memory for the iocb structures), and so it can't distinguish a
>> read from a write.
>
> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
> program once per iocb with synthesized syscall numbers and argument
> vectors.

Right, but generating the equivalent seccomp input environment for an
equivalent single-fd syscall is going to be subtle and complex (which
are worrying words to mention in a security context).  And how many
other syscalls are going to need similar special-case processing?
(poll? select? send[m]msg? ...)

> BTW, there's one thing I'm not sure I understand (because my knowledge
> of VFS is really only cursory).  Are the capabilities associated to the
> file _descriptor_ (a la F_GETFD/SETFD) or _description_
> (F_GETFL/SETFL)?!?

Capsicum capabilities are associated with the file descriptor (a la
F_GETFD), not the open file itself -- different FDs with different
associated rights can map to the same underlying open file.

> If it is the former, there is some value in read/write capabilities
> because you could for example block a child process from reading an
> eventfd and simulate the two file descriptors returned by pipe(2).  But
> if it is the latter, it looks like an important usability problem in
> the Capsicum model.  (Granted, it's just about usability---in the end
> it does exactly what it's meant and documented to do).

Attaching the rights to the FD also comes back to the association with
object-capability security.  The FD is an unforgeable reference to the
object (file) in question, but these references (with their rights) can
be transferred to other programs -- either by inheritance after fork, or
by explicitly passing the FD across a Unix domain socket.

>> Also, there could potentially be some odd interactions with file
>> descriptors passed between processes, if the BPF program relies
>> on assumptions about the environment of the original process.  For
>> example, what happens if an x86_64 process passes a filter-attached
>> FD to an ia32 process?  Given that the syscall numbers are
>> arch-specific, I guess that means the filter program would have
>> to include arch-specific branches for any possible variant.
>
> This is the same for using seccompv2 to limit child processes, no?  So
> there may be a problem but it has to be solved anyway by libseccomp.

I don't know whether libseccomp would worry about this, but being able
to send FDs between processes via Unix domain sockets makes this more
visible in the Capsicum case.

>> More generally, I suspect that keeping things simpler will end
>> up being more secure.  Capsicum was based on well-studied ideas
>> from the world of object capability-based security, and I'd be
>> nervous about adding complications that take us further away from
>> that.
>
> True.
>
>> That mapping would also need be kept closely in sync with the kernel
>> and other system libraries -- if a new syscall is added and libc (or
>> some other library) started using it, the equivalent BPF chunks would
>> need to be updated to cope.
>
> Again, this is the same problem that has to be solved for process-wide
> seccompv2.

True.  I guess new syscalls are sufficiently rare in practice that this
isn't a serious concern.

>>>>  [Capsicum also includes 'capability mode', which locks down the
>>>>  available syscalls so the rights restrictions can't just be bypassed
>>>>  by opening new file descriptors; I'll describe that separately later.]
>>>
>>> This can also be implemented in userspace via seccomp and
>>> PR_SET_NO_NEW_PRIVS.
>>
>> Well, mostly (and in fact I've got an attempt to do exactly that at
>> https://github.com/google/capsicum-test/blob/dev/linux-bpf-capmode.c).
>>
>> [..] there's one awkward syscall case.  In capability mode we'd like
>> to prevent processes from sending signals with kill(2)/tgkill(2)
>> to other processes, but they should still be able to send themselves
>> signals.  For example, abort(3) generates:
>>   tgkill(gettid(), gettid(), SIGABRT)
>>
>> Only allowing kill(self) is hard to encode in a seccomp-bpf program, at
>> least in a way that survives forking.
>
> I guess the thread id could be added as a special seccomp-bpf argument
> (ancillary datum?).

Yeah, I tried exactly that a while ago
(https://github.com/google/capsicum-linux/commit/e163c6348328)
but didn't run with it because of the process-wide beneath-only issue below.
But a combination of that and your new prctl() suggestion below might do
the trick.

>> Finally, capability mode also turns on strict-relative lookups
>> process-wide; in other words, every openat(dfd, ...) operation
>> acts as though it has the O_BENEATH_ONLY flag set, regardless of
>> whether the dfd is a Capsicum capability.  I can't see a way to
>> do that with a BPF program (although it would be possible to add
>> a filter that polices the requirement to include O_BENEATH_ONLY
>> rather than implicitly adding it).
>
> That can be a new prctl (one that PR_SET_NO_NEW_PRIVS would lock up).
> It seems useful independent of Capsicum, and the Linux APIs tend to be
> fine-grained more often than coarse-grained.

That sounds like a good idea, particularly in combination with the idea
above -- thanks!  I'll have a think/investigate...

>>>>  [Policing the rights checks anywhere else, for example at the system
>>>>  call boundary, isn't a good idea because it opens up the possibility
>>>>  of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
>>>>  changed (as openat/close/dup2 are allowed in capability mode) between
>>>>  the 'check' at syscall entry and the 'use' at fget() invocation.]
>>>
>>> In the case of BPF filters, I wonder if you could stash the BPF
>>> "environment" somewhere and then use it at fget() invocation.
>>> Alternatively, it can be reconstructed at fget() time, similar to
>>> your introduction of fgetr().
>>
>> Stashing something at syscall entry to be referred to later always
>> makes me worry about TOCTOU vulnerabilities, but the details might
>> be OK in this case (given that no check occurs at syscall entry)...
>
> Yeah, that was pretty much the idea.  But I was cautious enough to
> label it with "I wonder"...
>
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
  2014-07-07 10:29       ` David Drysdale
@ 2014-07-07 12:20         ` Paolo Bonzini
  2014-07-07 14:11           ` David Drysdale
  2014-07-07 22:33           ` Alexei Starovoitov
  0 siblings, 2 replies; 10+ messages in thread
From: Paolo Bonzini @ 2014-07-07 12:20 UTC (permalink / raw)
  To: David Drysdale
  Cc: Kees Cook, Greg Kroah-Hartman, Meredydd Luff,
	linux-kernel@vger.kernel.org, qemu-devel, LSM List,
	Alexander Viro, James Morris, Linux API

Il 07/07/2014 12:29, David Drysdale ha scritto:
>> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
>> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
>> program once per iocb with synthesized syscall numbers and argument
>> vectors.
>
> Right, but generating the equivalent seccomp input environment for an
> equivalent single-fd syscall is going to be subtle and complex (which
> are worrying words to mention in a security context).  And how many
> other syscalls are going to need similar special-case processing?
> (poll? select? send[m]msg? ...)

Yeah, the difficult part is getting the right balance between:

1) limitations due to seccomp's impossibility to chase pointers (which 
is not something that can be lifted, as it's required for correctness)

2) subtlety and complexity of the resulting code.

The problem stems when you have a single a single syscall operating on 
multiple file descriptors.  So for example among the cases you mention 
poll and select are problematic; sendm?msg are not.  They would be if 
Capsicum had a capability for SCM_RIGHTS file descriptor passing, but I 
cannot find it.

But then you also have to strike the right balance between a complete 
solution and an overengineered one.

For example, even though poll and select are problematic, I wonder what 
would really the point be in blocking them; poll/select are 
level-triggered, and calling them should be idempotent as far as the 
file descriptor is concerned.  If you want to prevent a process/thread 
from issuing blocking system calls, but you'd do that with a per-process 
filter, not with per-file-descriptor filters or capabilities.

> Capsicum capabilities are associated with the file descriptor (a la
> F_GETFD), not the open file itself -- different FDs with different
> associated rights can map to the same underlying open file.

Good to know, thanks.  I suppose you have testcases that cover this.

Paolo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
  2014-07-07 12:20         ` Paolo Bonzini
@ 2014-07-07 14:11           ` David Drysdale
  2014-07-07 22:33           ` Alexei Starovoitov
  1 sibling, 0 replies; 10+ messages in thread
From: David Drysdale @ 2014-07-07 14:11 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kees Cook, Greg Kroah-Hartman, Meredydd Luff,
	linux-kernel@vger.kernel.org, qemu-devel, LSM List,
	Alexander Viro, James Morris, Linux API

On Mon, Jul 7, 2014 at 1:20 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 07/07/2014 12:29, David Drysdale ha scritto:
>> Capsicum capabilities are associated with the file descriptor (a la
>> F_GETFD), not the open file itself -- different FDs with different
>> associated rights can map to the same underlying open file.
>
>
> Good to know, thanks.  I suppose you have testcases that cover this.
>
> Paolo

Yeah, there's lots of tests at:
  https://github.com/google/capsicum-test
(which is in a separate repo so it's easy to run against
FreeBSD as well as the Linux code); in particular
  https://github.com/google/capsicum-test/blob/dev/capability-fd.cc
has various interactions of capability FDs.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
  2014-07-07 12:20         ` Paolo Bonzini
  2014-07-07 14:11           ` David Drysdale
@ 2014-07-07 22:33           ` Alexei Starovoitov
  2014-07-08 14:58             ` Kees Cook
  2014-08-16 15:41             ` Pavel Machek
  1 sibling, 2 replies; 10+ messages in thread
From: Alexei Starovoitov @ 2014-07-07 22:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kees Cook, Greg Kroah-Hartman, Meredydd Luff,
	linux-kernel@vger.kernel.org, qemu-devel, LSM List,
	Alexander Viro, James Morris, Linux API, David Drysdale

On Mon, Jul 7, 2014 at 5:20 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 07/07/2014 12:29, David Drysdale ha scritto:
>
>>> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
>>> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
>>> program once per iocb with synthesized syscall numbers and argument
>>> vectors.
>>
>>
>> Right, but generating the equivalent seccomp input environment for an
>> equivalent single-fd syscall is going to be subtle and complex (which
>> are worrying words to mention in a security context).  And how many
>> other syscalls are going to need similar special-case processing?
>> (poll? select? send[m]msg? ...)
>
>
> Yeah, the difficult part is getting the right balance between:
>
> 1) limitations due to seccomp's impossibility to chase pointers (which is
> not something that can be lifted, as it's required for correctness)

btw once seccomp moves to eBPF it will be able to 'chase pointers',
since pointer walking will be possible via bpf_load_pointer() function call,
which is a wrapper of:
  probe_kernel_read(&ptr, unsafe_ptr, sizeof(void *));
  return ptr;
Not sure whether it helps this case or not. Just fyi.

> 2) subtlety and complexity of the resulting code.
>
> The problem stems when you have a single a single syscall operating on
> multiple file descriptors.  So for example among the cases you mention poll
> and select are problematic; sendm?msg are not.  They would be if Capsicum
> had a capability for SCM_RIGHTS file descriptor passing, but I cannot find
> it.
>
> But then you also have to strike the right balance between a complete
> solution and an overengineered one.
>
> For example, even though poll and select are problematic, I wonder what
> would really the point be in blocking them; poll/select are level-triggered,
> and calling them should be idempotent as far as the file descriptor is
> concerned.  If you want to prevent a process/thread from issuing blocking
> system calls, but you'd do that with a per-process filter, not with
> per-file-descriptor filters or capabilities.
>
>
>> Capsicum capabilities are associated with the file descriptor (a la
>> F_GETFD), not the open file itself -- different FDs with different
>> associated rights can map to the same underlying open file.
>
>
> Good to know, thanks.  I suppose you have testcases that cover this.
>
> Paolo
> --
>
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
  2014-07-07 22:33           ` Alexei Starovoitov
@ 2014-07-08 14:58             ` Kees Cook
  2014-08-16 15:41             ` Pavel Machek
  1 sibling, 0 replies; 10+ messages in thread
From: Kees Cook @ 2014-07-08 14:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Greg Kroah-Hartman, Meredydd Luff, linux-kernel@vger.kernel.org,
	qemu-devel, LSM List, Alexander Viro, James Morris, Linux API,
	Paolo Bonzini, David Drysdale

On Mon, Jul 7, 2014 at 3:33 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Jul 7, 2014 at 5:20 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 07/07/2014 12:29, David Drysdale ha scritto:
>>
>>>> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
>>>> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
>>>> program once per iocb with synthesized syscall numbers and argument
>>>> vectors.
>>>
>>>
>>> Right, but generating the equivalent seccomp input environment for an
>>> equivalent single-fd syscall is going to be subtle and complex (which
>>> are worrying words to mention in a security context).  And how many
>>> other syscalls are going to need similar special-case processing?
>>> (poll? select? send[m]msg? ...)
>>
>>
>> Yeah, the difficult part is getting the right balance between:
>>
>> 1) limitations due to seccomp's impossibility to chase pointers (which is
>> not something that can be lifted, as it's required for correctness)
>
> btw once seccomp moves to eBPF it will be able to 'chase pointers',
> since pointer walking will be possible via bpf_load_pointer() function call,
> which is a wrapper of:
>   probe_kernel_read(&ptr, unsafe_ptr, sizeof(void *));
>   return ptr;
> Not sure whether it helps this case or not. Just fyi.

It won't immediately help, since threads can race pointer target
contents (i.e. seccomp sees one thing, and then the syscall see
another thing). Having an immutable memory area could help with this
(i.e. some kind of "locked" memory range that holds all the "approved"
argument strings, at which point seccomp could then trust the chased
pointers that land in this range.) Obviously eBPF is a prerequisite to
this, but it isn't the full solution, as far as I understand it.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
  2014-07-07 22:33           ` Alexei Starovoitov
  2014-07-08 14:58             ` Kees Cook
@ 2014-08-16 15:41             ` Pavel Machek
  1 sibling, 0 replies; 10+ messages in thread
From: Pavel Machek @ 2014-08-16 15:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kees Cook, Greg Kroah-Hartman, Meredydd Luff,
	linux-kernel@vger.kernel.org, qemu-devel, LSM List,
	Alexander Viro, James Morris, Linux API, Paolo Bonzini,
	David Drysdale

Hi!

> >>> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
> >>> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
> >>> program once per iocb with synthesized syscall numbers and argument
> >>> vectors.
> >>
> >>
> >> Right, but generating the equivalent seccomp input environment for an
> >> equivalent single-fd syscall is going to be subtle and complex (which
> >> are worrying words to mention in a security context).  And how many
> >> other syscalls are going to need similar special-case processing?
> >> (poll? select? send[m]msg? ...)
> >
> >
> > Yeah, the difficult part is getting the right balance between:
> >
> > 1) limitations due to seccomp's impossibility to chase pointers (which is
> > not something that can be lifted, as it's required for correctness)
> 
> btw once seccomp moves to eBPF it will be able to 'chase pointers',
> since pointer walking will be possible via bpf_load_pointer() function call,
> which is a wrapper of:

Even if you could make capscium work with eBPF... please don't.

Capscium is kind of obvious, elegant solution. BPF is quite
complex. And security semantics should not be pushed to userspace...

						Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-08-16 15:42 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1404124096-21445-1-git-send-email-drysdale@google.com>
2014-07-03  9:12 ` [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) Paolo Bonzini
2014-07-03 10:01   ` Loganaden Velvindron
2014-07-03 18:39   ` David Drysdale
2014-07-04  7:03     ` Paolo Bonzini
2014-07-07 10:29       ` David Drysdale
2014-07-07 12:20         ` Paolo Bonzini
2014-07-07 14:11           ` David Drysdale
2014-07-07 22:33           ` Alexei Starovoitov
2014-07-08 14:58             ` Kees Cook
2014-08-16 15:41             ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).