Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
       [not found]                   ` <20110524200815.GD27634@elte.hu>
@ 2011-05-24 20:25                     ` Kees Cook
  2011-05-25 19:09                       ` Ingo Molnar
  2011-05-25 16:40                     ` Will Drewry
  1 sibling, 1 reply; 91+ messages in thread
From: Kees Cook @ 2011-05-24 20:25 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Will Drewry, linux-kernel

[CC trimmed, as recommended]

Hi,

On Tue, May 24, 2011 at 10:08:15PM +0200, Ingo Molnar wrote:
> * Will Drewry <wad@chromium.org> wrote:
> 
> > The change avoids defining a new trace call type or a huge number of internal 
> > changes and hides seccomp.mode=2 from ABI-exposure in prctl, but the attack 
> > surface is non-trivial to verify, and I'm not sure if this ABI change makes 
> > sense. It amounts to:
> > 
> >  include/linux/ftrace_event.h  |    4 +-
> >  include/linux/perf_event.h    |   10 +++++---
> >  kernel/perf_event.c           |   49 +++++++++++++++++++++++++++++++++++++---
> >  kernel/seccomp.c              |    8 ++++++
> >  kernel/trace/trace_syscalls.c |   27 +++++++++++++++++-----
> >  5 files changed, 82 insertions(+), 16 deletions(-)
> > 
> > And can be found here: http://static.dataspill.org/perf_secure/v1/
> 
> Wow, i'm very impressed how few changes you needed to do to support this!
> [...]
> attr.require_secure: this is basically used to *force* the creation of 
> security-controlling filters, right? It seems to me that this could be done via 
> a seccomp ABI extension as well, without adding this to the perf ABI. That 
> seccomp call could check whether the right events are created and move the task 
> to mode 2 only if that prereq is met - or something like that.

I understood the prctl() API that was outlined earlier, but it seems
this is not going to happen now. What would the programming API actually
look like for an application developer using this perf-style method?

-Kees

-- 
Kees Cook
Ubuntu Security Team

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-24 20:25                     ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Kees Cook
@ 2011-05-25 19:09                       ` Ingo Molnar
  0 siblings, 0 replies; 91+ messages in thread
From: Ingo Molnar @ 2011-05-25 19:09 UTC (permalink / raw)
  To: Kees Cook; +Cc: Will Drewry, linux-kernel


* Kees Cook <kees.cook@canonical.com> wrote:

> [CC trimmed, as recommended]
> 
> Hi,
> 
> On Tue, May 24, 2011 at 10:08:15PM +0200, Ingo Molnar wrote:
> > * Will Drewry <wad@chromium.org> wrote:
> > 
> > > The change avoids defining a new trace call type or a huge number of internal 
> > > changes and hides seccomp.mode=2 from ABI-exposure in prctl, but the attack 
> > > surface is non-trivial to verify, and I'm not sure if this ABI change makes 
> > > sense. It amounts to:
> > > 
> > >  include/linux/ftrace_event.h  |    4 +-
> > >  include/linux/perf_event.h    |   10 +++++---
> > >  kernel/perf_event.c           |   49 +++++++++++++++++++++++++++++++++++++---
> > >  kernel/seccomp.c              |    8 ++++++
> > >  kernel/trace/trace_syscalls.c |   27 +++++++++++++++++-----
> > >  5 files changed, 82 insertions(+), 16 deletions(-)
> > > 
> > > And can be found here: http://static.dataspill.org/perf_secure/v1/
> > 
> > Wow, i'm very impressed how few changes you needed to do to support this!
> > [...]
> > attr.require_secure: this is basically used to *force* the creation of 
> > security-controlling filters, right? It seems to me that this could be done via 
> > a seccomp ABI extension as well, without adding this to the perf ABI. That 
> > seccomp call could check whether the right events are created and move the task 
> > to mode 2 only if that prereq is met - or something like that.
> 
> I understood the prctl() API that was outlined earlier, but it 
> seems this is not going to happen now. What would the programming 
> API actually look like for an application developer using this 
> perf-style method?

Well, this API is probably not going to happen either ;-)

The way is to create a perf event and install a filter by passing the 
filters ASCII string as a pointer to the kernel, using 
PERF_EVENT_IOC_SET_FILTER on the event fd. If this ever gets used 
seriously then it should probably move into its own system call - but 
that is a detail.

Installing a filter can be safely done by unprivileged user-space 
(the kernel checks it), and they get inherited across fork(), are 
properly per task, etc.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
       [not found]                   ` <20110524200815.GD27634@elte.hu>
  2011-05-24 20:25                     ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Kees Cook
@ 2011-05-25 16:40                     ` Will Drewry
  1 sibling, 0 replies; 91+ messages in thread
From: Will Drewry @ 2011-05-25 16:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Peter Zijlstra, Frederic Weisbecker, James Morris,
	linux-kernel, Eric Paris, kees.cook, Serge E. Hallyn, Ingo Molnar,
	Thomas Gleixner

[trimmed the cc list to those who've expressed interest or strong opinions]

On Tue, May 24, 2011 at 3:08 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Will Drewry <wad@chromium.org> wrote:
>
>> The change avoids defining a new trace call type or a huge number of internal
>> changes and hides seccomp.mode=2 from ABI-exposure in prctl, but the attack
>> surface is non-trivial to verify, and I'm not sure if this ABI change makes
>> sense. It amounts to:
>>
>>  include/linux/ftrace_event.h  |    4 +-
>>  include/linux/perf_event.h    |   10 +++++---
>>  kernel/perf_event.c           |   49 +++++++++++++++++++++++++++++++++++++---
>>  kernel/seccomp.c              |    8 ++++++
>>  kernel/trace/trace_syscalls.c |   27 +++++++++++++++++-----
>>  5 files changed, 82 insertions(+), 16 deletions(-)
>>
>> And can be found here: http://static.dataspill.org/perf_secure/v1/
>
> Wow, i'm very impressed how few changes you needed to do to support this!

There's definitely a large overlap in the needed infrastructure (event
inheritance, event enabling, event registration) which makes changes
using it pretty small!

> So, firstly, i don't think we should change perf_tp_event() at all - the
> 'observer' codepaths should be unaffected.

Easy enough to do.

> But there could be a perf_tp_event_ret() or perf_tp_event_check() entry that
> code like seccomp which wants to use event results can use.
>
> Also, i'm not sure about the seccomp details and assumptions that were moved
> into the perf core. How about passing in a helper function to
> perf_tp_event_check(), where seccomp would define its seccomp specific helper
> function?

I'm curious how we'd register the seccomp callback and/or know when to
call it.  Would it just be a task-wide callback for task_context
events?  If so, will that mean only one callback_event user will be
able to function at a time (per task)?  [That's fine for seccomp
purposes, of course, but may incur further internal api pain later
unless it uses the list iterator function model that ftrace uses.]

> That looks sufficiently flexible. That helper function could be an 'extra
> filter' kind of thing, right?
>
> Also, regarding the ABI and the attr.err_on_discard and attr.require_secure
> bits, they look a bit too specific as well.
>
> attr.err_on_discard: with the filter helper function passed in this is probably
> not needed anymore, right?
>
> attr.require_secure: this is basically used to *force* the creation of
> security-controlling filters, right? It seems to me that this could be done via
> a seccomp ABI extension as well, without adding this to the perf ABI. That
> seccomp call could check whether the right events are created and move the task
> to mode 2 only if that prereq is met - or something like that.

Yeah - this can all be achieved with 0 changes to the perf path
entirely.  The part that causes me concern is the idea of reusing perf
events for a task that may or may not have been installed by the
caller of prctl(PR_SET_SECCOMP, 2) without any clear demarcation to
the API/ABI consumer.  It seems that to use this interface an
application will need to:
1. via libperf, drop all task-centric traces
2. via libperf, install some perf_events for the "allowed" events
3. call prctl(PR_SET_SECCOMP, 2).

Even then, it isn't necessarily clear to the user which events will
have a security impact and which won't.  It creates an implicit ABI
(if that makes sense) that will be hard to ensure strong security
guarantees for as perf changes.  It'd be possible to still do a
(PR_SET_SECCOMP_FILTER, eventid) approach to indicate which events are
expected, but at that point I'd be tempted to leave libperf out of it
again :)  Regardless, I'll pull together the patch as I understand the
proposal since code can be a bit easier to digest than the discussion.
 It seems to me that it's going to be hard to stay away from explicit
ABI changes, but perhaps I'm misunderstanding the ultimate direction
still.

If I'm way off base, please let me know :)  It'll save some extra patch-churn.

Thanks!
will

^ permalink raw reply	[flat|nested] 91+ messages in thread

[parent not found: <1306254027.18455.47.camel@twins>]

[parent not found: <20110524195435.GC27634@elte.hu>]

[parent not found: <alpine.LFD.2.02.1105242239230.3078@ionos>]

[parent not found: <20110525150153.GE29179@elte.hu>]

[parent not found: <alpine.LFD.2.02.1105251836030.3078@ionos>]

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
       [not found]                           ` <alpine.LFD.2.02.1105251836030.3078@ionos>
@ 2011-05-25 18:01                             ` Kees Cook
  2011-05-25 18:42                               ` Linus Torvalds
  0 siblings, 1 reply; 91+ messages in thread
From: Kees Cook @ 2011-05-25 18:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Peter Zijlstra, Will Drewry, Steven Rostedt,
	linux-kernel

Hi,

On Wed, May 25, 2011 at 07:48:51PM +0200, Thomas Gleixner wrote:
> On Wed, 25 May 2011, Ingo Molnar wrote:
> > * Thomas Gleixner <tglx@linutronix.de> wrote:
> > > On Tue, 24 May 2011, Ingo Molnar wrote:
> > > > * Peter Zijlstra <peterz@infradead.org> wrote:
> > > > 
> > > > > On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote:
> > > > > >  include/linux/ftrace_event.h  |    4 +-
> > > > > >  include/linux/perf_event.h    |   10 +++++---
> > > > > >  kernel/perf_event.c           |   49 +++++++++++++++++++++++++++++++++++++---
> > > > > >  kernel/seccomp.c              |    8 ++++++
> > > > > >  kernel/trace/trace_syscalls.c |   27 +++++++++++++++++-----
> > > > > >  5 files changed, 82 insertions(+), 16 deletions(-) 
> > > > > 
> > > > > I strongly oppose to the perf core being mixed with any sekurity voodoo
> > > > > (or any other active role for that matter).
> > > > 
> > > > I'd object to invisible side-effects as well, and vehemently so. But note how 
> > > > intelligently it's used here: it's explicit in the code, it's used explicitly 
> > > > in kernel/seccomp.c and the event generation place in 
> > > > kernel/trace/trace_syscalls.c.
> > > > 
> > > > So this is a really flexible solution IMO and does not extend events with some 
> > > > invisible 'active' role. It extends the *call site* with an open-coded active 
> > > > role - which active role btw. already pre-existed.
> > > 
> > > We do _NOT_ make any decision based on the trace point so what's the
> > > "pre-existing" active role in the syscall entry code?
> > 
> > The seccomp code we are discussing in this thread.
> 
> That's proposed code and has absolutely nothing to do with the
> existing trace point semantics.
>  
> > > I'm all for code reuse and reuse of interfaces, but this is completely
> > > wrong. Instrumentation and security decisions are two fundamentally
> > > different things and we want them kept separate. Instrumentation is
> > > not meant to make decisions. Just because we can does not mean that it
> > > is a good idea.
> > 
> > Instrumentation does not 'make decisions': the calling site, which is 
> > already emitting both the event and wants to do decisions based on 
> > the data that also generates the event wants to do decisions.
> 
> You can repeat that as often as you want, it does not make it more
> true. Fact is that the decision is made in the middle of the perf code.

Can we just go back to the original spec? A lot of people were excited
about the prctl() API as done in Will's earlier patchset, we don't lose the
extremely useful "enable_on_exec" feature, and we can get away from all
this disagreement.

-Kees

-- 
Kees Cook
Ubuntu Security Team

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-25 18:01                             ` Kees Cook
@ 2011-05-25 18:42                               ` Linus Torvalds
  2011-05-25 19:06                                 ` Ingo Molnar
                                                   ` (3 more replies)
  0 siblings, 4 replies; 91+ messages in thread
From: Linus Torvalds @ 2011-05-25 18:42 UTC (permalink / raw)
  To: Kees Cook
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Will Drewry,
	Steven Rostedt, linux-kernel

On Wed, May 25, 2011 at 11:01 AM, Kees Cook <kees.cook@canonical.com> wrote:
>
> Can we just go back to the original spec? A lot of people were excited
> about the prctl() API as done in Will's earlier patchset, we don't lose the
> extremely useful "enable_on_exec" feature, and we can get away from all
> this disagreement.

.. and quite frankly, I'm not even convinced about the original simpler spec.

Security is a morass. People come up with cool ideas every day, and
nobody actually uses them - or if they use them, they are just a
maintenance nightmare.

Quite frankly, limiting pathname access by some prefix is "cool", but
it's basically useless.

That's not where security problems are.

Security problems are in the odd corners - ioctl's, /proc files,
random small interfaces that aren't just about file access.

And who would *use* this thing in real life? Nobody. In order to sell
me on a new security interface, give me a real actual use case that is
security-conscious and relevant to real users.

For things like web servers that actually want to limit filename
lookup, we'd be <i>much</i> better off with a few new flags to
pathname lookup that say "don't follow symlinks" and "don't follow
'..'". Things like that can actually be beneficial to
security-conscious programming, with very little overhead. Some of
those things currently look up pathnames one component at a time,
because they can't afford to not do so. That's a *much* better model
for the whole "only limit to this subtree" case that was quoted
sometime early in this thread.

And per-system-call permissions are very dubious. What system calls
don't you want to succeed? That ioctl? You just made it impossible to
do a modern graphical application. Yet the kind of thing where we
would _want_ to help users is in making it easier to sandbox something
like the adobe flash player. But without accelerated direct rendering,
that's not going to fly, is it?

So I'm sorry for throwing cold water on you guys, but the whole "let's
come up with a new security gadget" thing just makes me go "oh no, not
again".

                    Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-25 18:42                               ` Linus Torvalds
@ 2011-05-25 19:06                                 ` Ingo Molnar
  2011-05-25 19:54                                   ` Will Drewry
  2011-05-25 19:11                                 ` Kees Cook
                                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 91+ messages in thread
From: Ingo Molnar @ 2011-05-25 19:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Thomas Gleixner, Peter Zijlstra, Will Drewry,
	Steven Rostedt, linux-kernel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> And per-system-call permissions are very dubious. What system calls 
> don't you want to succeed? That ioctl? You just made it impossible 
> to do a modern graphical application. Yet the kind of thing where 
> we would _want_ to help users is in making it easier to sandbox 
> something like the adobe flash player. But without accelerated 
> direct rendering, that's not going to fly, is it?

I was under the impression that Will had a very specific application 
in mind which actually works today and uses the inferior version of 
seccomp.

Will, mind filling us in on that?

I'd agree that adding any of this without a real serious app making 
real use of it would be pointless. I discussed this under the 
impression that the app existed :-)

I also got the very distinct impression from the various iterations 
that a real usecase existed behind it - all the fixes and 
considerations looked very realistic, not designed up for security's 
sake.

> So I'm sorry for throwing cold water on you guys, but the whole 
> "let's come up with a new security gadget" thing just makes me go 
> "oh no, not again".

Fair enough :-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-25 19:06                                 ` Ingo Molnar
@ 2011-05-25 19:54                                   ` Will Drewry
  0 siblings, 0 replies; 91+ messages in thread
From: Will Drewry @ 2011-05-25 19:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Kees Cook, Thomas Gleixner, Peter Zijlstra,
	Steven Rostedt, linux-kernel

On Wed, May 25, 2011 at 2:06 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> And per-system-call permissions are very dubious. What system calls
>> don't you want to succeed? That ioctl? You just made it impossible
>> to do a modern graphical application. Yet the kind of thing where
>> we would _want_ to help users is in making it easier to sandbox
>> something like the adobe flash player. But without accelerated
>> direct rendering, that's not going to fly, is it?
>
> I was under the impression that Will had a very specific application
> in mind which actually works today and uses the inferior version of
> seccomp.
>
> Will, mind filling us in on that?

With pleasure!  I'll be a bit overly verbose to ensure I'm covering my
bases, I hope it's not too tedious.

Support for using system call filtering will be added to the Chromium
browser if it is accepted here.  At present, Chromium separates the
processing of untrusted input (html, javascript, images) into
standalone renderer processes.  In an effort to reduce the risks
associated with processing the data we put those renderers in a chroot
with a private VFS and PID namespace. This limits the ability for a
compromised renderer to signal() another process outside of the
"sandbox" or access files it shouldn't.

Ideally, the only exposed surface to the renderer would be the IPC
mechanism, memory allocation, etc.  That isn't possible today though
[*].  The renderer gets the whole syscall ABI.  In many cases, adding
support for (all of the) LSMs to the sandboxing methodology would help
mitigate the exposure.  There would be the code paths that handle the
user input prior to calling the LSM hooks, but after that point, the
renderer could be denied, shutdown, etc.  Unfortunately, there's no
one-to-one mapping from system calls to LSM hooks (nor do all stock
kernels from distros come with a pre-chosen and configured LSM).

To supply some concreteness, the perf_counter_open() system call comes
to mind.  It suffered from a stack-based buffer overflow when
processing the user-supplied arguments, and there was no effective
mechanism, LSM or otherwise, to prevent its access.  In my usecase, if
only a whitelist of required system calls was made available to the
Chromium renderer processes, then the addition of a bug like
perf_counter_open()'s to the kernel would not have provided a direct
means to escape the user-level sandboxing and execute arbitrary code
in the kernel.

As I mentioned, if it is possible to expand seccomp to provide a
system call access mechanism (bitmask, whatever),  I will expand the
Chromium sandbox to make use of it on every linux distro that ships
with it enabled.  In addition, my immediate work focus is on Chromium
OS.  I would like to apply system call filtering to every daemon in
the distribution alongside additional security defenses.  Also, I am
aware of many server-side uses but can't promise immediate deployment
in the same fashion.

[It's also worth noting that as more browser plugins, like Adobe
Flash, migrate to the Pepper API (chrome,mozilla), they will no longer
need direct hardware access (ioctl()s, fs, etc).  All system access
will be brokered via the browser which lets them be sandboxed entirely
-- including system call filtering is supported by the host platform.]

[*] it is possible to do crazy, on-the-fly syscall rewriting with
seccomp(1) and a trusted thread, but the performance cost is huge, the
portability is nil (pure asm), and the risk of a security bug is high.

> I'd agree that adding any of this without a real serious app making
> real use of it would be pointless. I discussed this under the
> impression that the app existed :-)
>
> I also got the very distinct impression from the various iterations
> that a real usecase existed behind it - all the fixes and
> considerations looked very realistic, not designed up for security's
> sake.
>
>> So I'm sorry for throwing cold water on you guys, but the whole
>> "let's come up with a new security gadget" thing just makes me go
>> "oh no, not again".
>
> Fair enough :-)

I don't want to boil the ocean and certainly am not interested in
reliving the LSM-wars. I want the missing piece of the puzzle when it
comes to reducing exposed kernel code.  seccomp.mode=1 is so close,
but its overly restrictive nature has made it implausible for nearly
all real-world uses.  A slight expansion to allow a system call
bitmask or simple filters would be sufficient for Chromium OS,
Chromium, qemu, and lxc use, among others.

Thanks for reading and replying!
will

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-25 18:42                               ` Linus Torvalds
  2011-05-25 19:06                                 ` Ingo Molnar
@ 2011-05-25 19:11                                 ` Kees Cook
  2011-05-25 20:01                                   ` Linus Torvalds
  2011-05-26  1:19                                 ` James Morris
  2011-05-29 16:51                                 ` Aneesh Kumar K.V
  3 siblings, 1 reply; 91+ messages in thread
From: Kees Cook @ 2011-05-25 19:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Will Drewry,
	Steven Rostedt, linux-kernel

Hi Linus,

On Wed, May 25, 2011 at 11:42:44AM -0700, Linus Torvalds wrote:
> And who would *use* this thing in real life? Nobody. In order to sell
> me on a new security interface, give me a real actual use case that is
> security-conscious and relevant to real users.
> [...]
> And per-system-call permissions are very dubious. What system calls
> don't you want to succeed? That ioctl? You just made it impossible to
> do a modern graphical application. Yet the kind of thing where we
> would _want_ to help users is in making it easier to sandbox something
> like the adobe flash player. But without accelerated direct rendering,
> that's not going to fly, is it?

Uhm, what? Chrome would use it. And LXC would. Those were stated very
early on as projects extremely interested in syscall filtering. And that's
just the start, I can easily imagine Apache modules enforcing a very narrow
band of syscalls, or just about anything else that could be in a position
of running potentially malicious code. This could be very far-reaching, IMO.

-Kees

-- 
Kees Cook
Ubuntu Security Team

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-25 19:11                                 ` Kees Cook
@ 2011-05-25 20:01                                   ` Linus Torvalds
  2011-05-25 20:19                                     ` Ingo Molnar
  2011-05-26 14:37                                     ` Colin Walters
  0 siblings, 2 replies; 91+ messages in thread
From: Linus Torvalds @ 2011-05-25 20:01 UTC (permalink / raw)
  To: Kees Cook
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Will Drewry,
	Steven Rostedt, linux-kernel

On Wed, May 25, 2011 at 12:11 PM, Kees Cook <kees.cook@canonical.com> wrote:
>
> Uhm, what? Chrome would use it. And LXC would. Those were stated very
> early on as projects extremely interested in syscall filtering.

.. and I seriously doubt it is workable.

Or at least it needs some actual working proof-of-concept thing.
Exactly because of issues like direct rendering etc, that require some
of the nastier system calls to work at all.

As to your example of apache modules - last I saw, most of those were
written in high-level scripting languages that almost invariably end
up using quite a bit of the system call interfaces. And more
importantly, almost nobody does unportable code.

So hey, I'm willing to be convinced. But I'll need more than people
_saying_ that they'd be interested. Because judging by past
performance, nobody ever uses esoteric cool new features.

                           Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-25 20:01                                   ` Linus Torvalds
@ 2011-05-25 20:19                                     ` Ingo Molnar
  2011-06-09  9:00                                       ` Sven Anders
  2011-05-26 14:37                                     ` Colin Walters
  1 sibling, 1 reply; 91+ messages in thread
From: Ingo Molnar @ 2011-05-25 20:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Thomas Gleixner, Peter Zijlstra, Will Drewry,
	Steven Rostedt, linux-kernel

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, May 25, 2011 at 12:11 PM, Kees Cook <kees.cook@canonical.com> wrote:
> >
> > Uhm, what? Chrome would use it. And LXC would. Those were stated very
> > early on as projects extremely interested in syscall filtering.
> 
> .. and I seriously doubt it is workable.
> 
> Or at least it needs some actual working proof-of-concept thing.
> Exactly because of issues like direct rendering etc, that require some
> of the nastier system calls to work at all.

Btw., Will's patch in this thread (which i think he tested with real 
code) implements an approach which detaches the concept from a rigid 
notion of 'syscall filtering' and opens it up for things like 
reliable pathname checks, memory object checks, etc. - without having 
to change the ABI.

If we go for syscall filtering as per bitmask, then we've pretty much 
condemned this to be limited to the syscall boundary alone.

So this sandboxing concept looked flexible enough to me to work 
itself up the security concept food chain *embedded in apps*.

<flame>

IMHO the key design mistake of LSM is that it detaches security 
policy from applications: you need to be admin to load policies, you 
need to be root to use/configure an LSM. Dammit, you need to be root 
to add labels to files!

This not only makes the LSM policies distro specific (and needlessly 
forked and detached from real security), but also gives the message 
that:

 'to ensure your security you need to be privileged'

which is the anti-concept of good security IMO.

So why not give unprivileged security policy facilities and let 
*Apps* shape their own security models. Yes, they will mess up 
initially and will reinvent the wheel. But socially IMO it will work 
a *lot* better in the long run: it's not imposed on them 
*externally*, it's something they can use and grow gradually. They 
will experience the security holes first hand and they will be *able 
to do something strategic about them* if we give them the right 
facilities.

At least the Chrome browser project appears to be intent on following 
such an approach. I consider a more bazaar alike approach more 
healthy, and it very much needs kernel help as LSMs are isolated from 
apps right now.

The thing is, we cannot possibly make the LSM situation much worse 
than it is today: i see *ALL* of the LSMs focused on all the wrong 
things!

But yes, i can understand that you are deeply sceptical.

</flame>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-25 20:19                                     ` Ingo Molnar
@ 2011-06-09  9:00                                       ` Sven Anders
  0 siblings, 0 replies; 91+ messages in thread
From: Sven Anders @ 2011-06-09  9:00 UTC (permalink / raw)
  To: linux-kernel

Ingo Molnar <mingo <at> elte.hu> writes:

> <flame>
> 
> IMHO the key design mistake of LSM is that it detaches security 
> policy from applications: you need to be admin to load policies, you 
> need to be root to use/configure an LSM. Dammit, you need to be root 
> to add labels to files!
> 
> This not only makes the LSM policies distro specific (and needlessly 
> forked and detached from real security), but also gives the message 
> that:
> 
>  'to ensure your security you need to be privileged'
> 
> which is the anti-concept of good security IMO.
> 
> [....]
> </flame>
> 
> Thanks,
> 	Ingo

Hello!
An incomplete idea I had some time ago:

Couldn't the security information (like the selinux profiles) be
part of the binaries?

Each source package should deliver it's own security information
and this should be better than adding it later, because the 
developer of the program knows, what his program should be allowed
to do. Moreover, if the developer changes something, he can/must
add the security information altogether.

Of course these information has to be signed in some way to
avoid tampering.

Just an idea...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-25 20:01                                   ` Linus Torvalds
  2011-05-25 20:19                                     ` Ingo Molnar
@ 2011-05-26 14:37                                     ` Colin Walters
  2011-05-26 15:03                                       ` Linus Torvalds
  1 sibling, 1 reply; 91+ messages in thread
From: Colin Walters @ 2011-05-26 14:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Will Drewry, Steven Rostedt, linux-kernel

On Wed, May 25, 2011 at 4:01 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> As to your example of apache modules - last I saw, most of those were
> written in high-level scripting languages that almost invariably end
> up using quite a bit of the system call interfaces. And more
> importantly, almost nobody does unportable code.

Well, there's a difference between frameworks and applications.  At
least in GNOME e.g. we've been generally good at picking up and
transparently taking advantage of Linux-specific stuff where possible
like splice() in g_file_copy(), and in this cycle I'll probably end up
using signalfd() to fix a long standing race condition.

> So hey, I'm willing to be convinced. But I'll need more than people
> _saying_ that they'd be interested. Because judging by past
> performance, nobody ever uses esoteric cool new features.

I'm curious which features you feel are esoteric and cool but unused?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 14:37                                     ` Colin Walters
@ 2011-05-26 15:03                                       ` Linus Torvalds
  2011-05-26 15:28                                         ` Colin Walters
  2011-05-26 16:33                                         ` Will Drewry
  0 siblings, 2 replies; 91+ messages in thread
From: Linus Torvalds @ 2011-05-26 15:03 UTC (permalink / raw)
  To: Colin Walters
  Cc: Kees Cook, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Will Drewry, Steven Rostedt, linux-kernel

On Thu, May 26, 2011 at 7:37 AM, Colin Walters <walters@verbum.org> wrote:
>
> I'm curious which features you feel are esoteric and cool but unused?

Just about anything linux-specific. Ranging from the totally new
concepts (epoll/clone/splice/signalfd) to just simple cleanups and
extensions of reasonably standard stuff (sync_file_range/sendpage).

Sure, there's almost always *somebody* who uses them, but they are
seldom actually worth it.

The one thing that works well is when you expose it as a standard
interface. So futexes are linux-specific, but they are exposed as the
standard pthreads condition variables etc to apps - very few actually
use them as futexes. But because glibc uses them for the pthreads
synchronization, I think they ended up being used inside glibc for
low-level stuff too, so I think futexes ended up being an unqualified
success - much better than the standard interface.

The "it can be used in standard libraries" ends up being a very
powerful thing. It doesn't have to be libc - if something like a glib
or a big graphical interface uses them, they can get very popular. But
if you have to have actual config options (autoconf or similar) to
enable the feature on Linux, along with a compatibility case (because
older kernels don't even support it, so it's not even "linux", it's
"linux newer than xyz"), then very very few applications end up using
it.

And security issues in particular are often *very* subtle. For
example, something like a system call filter sounds like an obviously
safe thing: it can only limit what you do, right?

Except no, not right at all. Imagine that you're limiting a suid
application, and the one operation you limit is "setuid()". Imagine
that the suid application explicitly drops privileges in order to run
safely as the user. Imagine, further, that it doesn't even check the
return value, because it *knows* that if it is root, it will succeed,
and if it isn't root, then it wasn't suid to begin with and doesn't
need to do anything about it.

Unlikely? Hell no. That's standard practice. And if you allow filter
setup that survives fork+exec, you just opened a HUGE security hole.

Fixable? Yes, easily. And I haven't looked at the current patches, but
I would not be AT ALL surprised if they had exactly the above huge
security hole.

My point being that (a) I'm very dubious about new non-standard
features, because historically they seldom get used very widely and
(b) I'm doubly dubious about security things because it turns out it's
damn easy to get it wrong in all kinds of small subtle details.

                            Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 15:03                                       ` Linus Torvalds
@ 2011-05-26 15:28                                         ` Colin Walters
  2011-05-26 16:33                                         ` Will Drewry
  1 sibling, 0 replies; 91+ messages in thread
From: Colin Walters @ 2011-05-26 15:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Will Drewry, Steven Rostedt, linux-kernel

On Thu, May 26, 2011 at 11:03 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, May 26, 2011 at 7:37 AM, Colin Walters <walters@verbum.org> wrote:
>>
>> I'm curious which features you feel are esoteric and cool but unused?
>
> Just about anything linux-specific. Ranging from the totally new
> concepts (epoll/clone/splice/signalfd) to just simple cleanups and
> extensions of reasonably standard stuff (sync_file_range/sendpage).

epoll's very widely used via frameworks I'd say; at least the Apache
runtime uses it, libevent does, and apparently the Sun JDK does:
http://www.google.com/codesearch/p?hl=en#ih5hvYJNSIA/src/solaris/classes/sun/nio/ch/EPollPort.java&q=epoll&sa=N&cd=32&ct=rc
And here's an entry on that: http://blogs.oracle.com/alanb/entry/epoll

(Why doesn't glib?  It's hard since the priority design was kind of a
mistake: https://bugzilla.gnome.org/show_bug.cgi?id=156048 )

> The "it can be used in standard libraries" ends up being a very
> powerful thing. It doesn't have to be libc - if something like a glib
> or a big graphical interface uses them, they can get very popular.

Right, that's the distinction I was trying to make.

> But
> if you have to have actual config options (autoconf or similar) to
> enable the feature on Linux, along with a compatibility case (because
> older kernels don't even support it, so it's not even "linux", it's
> "linux newer than xyz"), then very very few applications end up using
> it.

>From my experience as a framework developer, it hasn't been hard at
all to keep track of new Linux features, we talk about them a lot =)
The fallback code is often obvious, like for splice(), though for
signalfd it's going to much more messy to keep around the legacy
helper thread case.

> Unlikely? Hell no. That's standard practice. And if you allow filter
> setup that survives fork+exec, you just opened a HUGE security hole.

Oh definitely, setuid and process inheritance has been a source of
many problems over the years, and I agree it'd be very dangerous for
the syscall filters to stay open across execve.  Of course in practice
glibc secure mode exists to mitigate these things too; it could abort
if one was in place in that case.

But I was more curious about your views on Linux-specific interfaces,
and you answered that; thanks!

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 15:03                                       ` Linus Torvalds
  2011-05-26 15:28                                         ` Colin Walters
@ 2011-05-26 16:33                                         ` Will Drewry
  2011-05-26 16:46                                           ` Linus Torvalds
  1 sibling, 1 reply; 91+ messages in thread
From: Will Drewry @ 2011-05-26 16:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Colin Walters, Kees Cook, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, linux-kernel, James Morris

On Thu, May 26, 2011 at 10:03 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, May 26, 2011 at 7:37 AM, Colin Walters <walters@verbum.org> wrote:
>>
>> I'm curious which features you feel are esoteric and cool but unused?
>
> Just about anything linux-specific. Ranging from the totally new
> concepts (epoll/clone/splice/signalfd) to just simple cleanups and
> extensions of reasonably standard stuff (sync_file_range/sendpage).
>
> Sure, there's almost always *somebody* who uses them, but they are
> seldom actually worth it.
>
> The one thing that works well is when you expose it as a standard
> interface. So futexes are linux-specific, but they are exposed as the
> standard pthreads condition variables etc to apps - very few actually
> use them as futexes. But because glibc uses them for the pthreads
> synchronization, I think they ended up being used inside glibc for
> low-level stuff too, so I think futexes ended up being an unqualified
> success - much better than the standard interface.
>
> The "it can be used in standard libraries" ends up being a very
> powerful thing. It doesn't have to be libc - if something like a glib
> or a big graphical interface uses them, they can get very popular. But
> if you have to have actual config options (autoconf or similar) to
> enable the feature on Linux, along with a compatibility case (because
> older kernels don't even support it, so it's not even "linux", it's
> "linux newer than xyz"), then very very few applications end up using
> it.
>
> And security issues in particular are often *very* subtle. For
> example, something like a system call filter sounds like an obviously
> safe thing: it can only limit what you do, right?
>
> Except no, not right at all. Imagine that you're limiting a suid
> application, and the one operation you limit is "setuid()". Imagine
> that the suid application explicitly drops privileges in order to run
> safely as the user. Imagine, further, that it doesn't even check the
> return value, because it *knows* that if it is root, it will succeed,
> and if it isn't root, then it wasn't suid to begin with and doesn't
> need to do anything about it.
>
> Unlikely? Hell no. That's standard practice. And if you allow filter
> setup that survives fork+exec, you just opened a HUGE security hole.
>
> Fixable? Yes, easily. And I haven't looked at the current patches, but
> I would not be AT ALL surprised if they had exactly the above huge
> security hole.

FWIW, none of the patches deal with privilege escalation via setuid
files or file capabilities.

> My point being that (a) I'm very dubious about new non-standard
> features, because historically they seldom get used very widely and
> (b) I'm doubly dubious about security things because it turns out it's
> damn easy to get it wrong in all kinds of small subtle details.

I agree with both points, so I'm being a bit hypocritical, I suspect.

At present, I'm not aware of any platforms that support system call
restriction in a non-platform-specific fashion: mac has seatbelt,
freebsd has things like capsicum, linux has seccomp :)  This led me to
the proposal around expanding seccomp since it was already a Linux-ism
for this functionality and, ideally, could be minimal to help limit
the subtle-bug-exposure.   However, any form of kernel attack surface
reduction would be great, but I'm unaware of any that integrate with
glibc smoothly (even if some do have nicer programming interfaces).

I've used system call filtering in the past with good effect in server
environments, and I believe the Chromium renderer example is a robust
for Linux desktops, even without glibc integration.  If the
Linux-specific, non-automatic (glibc) interface is a no-go, then I'll
go back to the drawing board.  I'm not sure how to avoid something
Linux-specific in general, even if it's just adding syscall hooks to
LSMs, though it could be possible to share interfaces with some other
platform's implementation of a broader security system that includes
kernel exposure minimization (like capsicum) which could be built-on
what existing substrate is available or a new one, as Ingo proposes.

thanks!
will

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 16:33                                         ` Will Drewry
@ 2011-05-26 16:46                                           ` Linus Torvalds
  2011-05-26 17:02                                             ` Will Drewry
                                                               ` (3 more replies)
  0 siblings, 4 replies; 91+ messages in thread
From: Linus Torvalds @ 2011-05-26 16:46 UTC (permalink / raw)
  To: Will Drewry
  Cc: Colin Walters, Kees Cook, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, linux-kernel, James Morris

On Thu, May 26, 2011 at 9:33 AM, Will Drewry <wad@chromium.org> wrote:
>
> FWIW, none of the patches deal with privilege escalation via setuid
> files or file capabilities.

That is NOT AT ALL what I'm talking about.

I'm talking about the "setuid()" system call (and all its cousins:
setgit/setreuid etc). And the whole thread has been about filtering
system calls, no?

Do a google code search for setuid.

In good code, it will look something like

  uid = getuid();

  if (setuid(uid)) {
    fprintf(stderr, "Unable to drop provileges\n");
    exit(1);
  }

but I guarantee you that there are cases where people just blindly
drop privileges. google code search found me at least the "heirloom"
source code doing exactly that.

And if you filter system calls, it's entirely possible that you can
attack suid executables through such a vector. Your "limit system
calls for security" security suddenly turned into "avoid the system
call that made things secure"!

See what I'm saying?

                       Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 16:46                                           ` Linus Torvalds
@ 2011-05-26 17:02                                             ` Will Drewry
  2011-05-26 17:04                                               ` Will Drewry
                                                                 ` (2 more replies)
  2011-05-26 17:07                                             ` Steven Rostedt
                                                               ` (2 subsequent siblings)
  3 siblings, 3 replies; 91+ messages in thread
From: Will Drewry @ 2011-05-26 17:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Colin Walters, Kees Cook, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, linux-kernel, James Morris

On Thu, May 26, 2011 at 11:46 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, May 26, 2011 at 9:33 AM, Will Drewry <wad@chromium.org> wrote:
>>
>> FWIW, none of the patches deal with privilege escalation via setuid
>> files or file capabilities.
>
> That is NOT AT ALL what I'm talking about.
>
> I'm talking about the "setuid()" system call (and all its cousins:
> setgit/setreuid etc). And the whole thread has been about filtering
> system calls, no?
>
> Do a google code search for setuid.
>
> In good code, it will look something like
>
>  uid = getuid();
>
>  if (setuid(uid)) {
>    fprintf(stderr, "Unable to drop provileges\n");
>    exit(1);
>  }
>
> but I guarantee you that there are cases where people just blindly
> drop privileges. google code search found me at least the "heirloom"
> source code doing exactly that.
>
> And if you filter system calls, it's entirely possible that you can
> attack suid executables through such a vector. Your "limit system
> calls for security" security suddenly turned into "avoid the system
> call that made things secure"!
>
> See what I'm saying?

Absolutely - that was what I meant :/  The patches do not currently
check creds at creation or again at use, which would lead to
unprivileged filters being used in a privileged context.  Right now,
though, if setuid() is not allowed by the seccomp-filter, the process
will be immediately killed with do_exit(SIGKILL) on call -- thus
avoiding a silent failure. I mentioned file capabilities because they
can have setuid-like side effects, too.  As long as system call
rejection results in a process death, I *think* it helps with some of
this complexity, but I haven't fully vetted the patches for these
scenarios to be 100% confident.

Sorry I wasn't clear!
will

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 17:02                                             ` Will Drewry
@ 2011-05-26 17:04                                               ` Will Drewry
  2011-05-26 17:17                                               ` Linus Torvalds
  2011-05-26 17:38                                               ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Valdis.Kletnieks
  2 siblings, 0 replies; 91+ messages in thread
From: Will Drewry @ 2011-05-26 17:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Colin Walters, Kees Cook, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, linux-kernel, James Morris

On Thu, May 26, 2011 at 12:02 PM, Will Drewry <wad@chromium.org> wrote:
> On Thu, May 26, 2011 at 11:46 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Thu, May 26, 2011 at 9:33 AM, Will Drewry <wad@chromium.org> wrote:
>>>
>>> FWIW, none of the patches deal with privilege escalation via setuid
>>> files or file capabilities.
>>
>> That is NOT AT ALL what I'm talking about.
>>
>> I'm talking about the "setuid()" system call (and all its cousins:
>> setgit/setreuid etc). And the whole thread has been about filtering
>> system calls, no?
>>
>> Do a google code search for setuid.
>>
>> In good code, it will look something like
>>
>>  uid = getuid();
>>
>>  if (setuid(uid)) {
>>    fprintf(stderr, "Unable to drop provileges\n");
>>    exit(1);
>>  }
>>
>> but I guarantee you that there are cases where people just blindly
>> drop privileges. google code search found me at least the "heirloom"
>> source code doing exactly that.
>>
>> And if you filter system calls, it's entirely possible that you can
>> attack suid executables through such a vector. Your "limit system
>> calls for security" security suddenly turned into "avoid the system
>> call that made things secure"!
>>
>> See what I'm saying?
>
> Absolutely - that was what I meant :/  The patches do not currently
> check creds at creation or again at use, which would lead to
> unprivileged filters being used in a privileged context.  Right now,
> though, if setuid() is not allowed by the seccomp-filter, the process
> will be immediately killed with do_exit(SIGKILL) on call -- thus
> avoiding a silent failure. I mentioned file capabilities because they
> can have setuid-like side effects, too.  As long as system call
> rejection results in a process death, I *think* it helps with some of
> this complexity, but I haven't fully vetted the patches for these
> scenarios to be 100% confident.

Bah - by "setuid-like side effects", I meant suid executable-like side
effects.  And I blocking even outside of those scenarios, I think
immediate process-death helps resolves coding mistakes leading to
filtering setuid() calls prior to use.

cheers,
will

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 17:02                                             ` Will Drewry
  2011-05-26 17:04                                               ` Will Drewry
@ 2011-05-26 17:17                                               ` Linus Torvalds
  2011-05-26 17:38                                                 ` Will Drewry
  2011-05-26 17:38                                               ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Valdis.Kletnieks
  2 siblings, 1 reply; 91+ messages in thread
From: Linus Torvalds @ 2011-05-26 17:17 UTC (permalink / raw)
  To: Will Drewry
  Cc: Colin Walters, Kees Cook, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, linux-kernel, James Morris

On Thu, May 26, 2011 at 10:02 AM, Will Drewry <wad@chromium.org> wrote:
>
> Absolutely - that was what I meant :/  The patches do not currently
> check creds at creation or again at use, which would lead to
> unprivileged filters being used in a privileged context.  Right now,
> though, if setuid() is not allowed by the seccomp-filter, the process
> will be immediately killed with do_exit(SIGKILL) on call -- thus
> avoiding a silent failure.

Umm.

You do realize that there is a reason we don't allow random kill()
system calls to succeed without privileges either?

So no, "we kill it with sigkill" is not safe *either*. It now is
potentially a way to kill privileged processes that you didn't have
permission to kill.

My point is that it all sounds designed for well-behaved processes.
"kill it if it does something bad" sounds like a *wonderful* idea if
you're doing a sandbox.

But it is suddenly potentially deadly if that capability is used by a
malicious user for a process that isn't ready for it.

One option is to just not ever allow execve() from inside a restricted
environment.

                       Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 17:17                                               ` Linus Torvalds
@ 2011-05-26 17:38                                                 ` Will Drewry
  2011-05-26 18:33                                                   ` Linus Torvalds
  0 siblings, 1 reply; 91+ messages in thread
From: Will Drewry @ 2011-05-26 17:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Colin Walters, Kees Cook, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, linux-kernel, James Morris

On Thu, May 26, 2011 at 12:17 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, May 26, 2011 at 10:02 AM, Will Drewry <wad@chromium.org> wrote:
>>
>> Absolutely - that was what I meant :/  The patches do not currently
>> check creds at creation or again at use, which would lead to
>> unprivileged filters being used in a privileged context.  Right now,
>> though, if setuid() is not allowed by the seccomp-filter, the process
>> will be immediately killed with do_exit(SIGKILL) on call -- thus
>> avoiding a silent failure.
>
> Umm.
>
> You do realize that there is a reason we don't allow random kill()
> system calls to succeed without privileges either?
>
> So no, "we kill it with sigkill" is not safe *either*. It now is
> potentially a way to kill privileged processes that you didn't have
> permission to kill.
>
> My point is that it all sounds designed for well-behaved processes.
> "kill it if it does something bad" sounds like a *wonderful* idea if
> you're doing a sandbox.

Yeah - we end up in a weird place, because for many suid executables,
the failure would be immediate (at priv drop), but it introduces bugs
that will be less obvious in more complex scenarios.

> But it is suddenly potentially deadly if that capability is used by a
> malicious user for a process that isn't ready for it.
>
> One option is to just not ever allow execve() from inside a restricted
> environment.

That'd certainly be fine with me.  Another option could be adding a
cred checking (from set to use) or execve time checking or ..., but
simple works for me.  I'm not hung up on the implementation details
specifically if the end result is that the syscalls can be _safely_
whitelisted.

Thanks!

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 17:38                                                 ` Will Drewry
@ 2011-05-26 18:33                                                   ` Linus Torvalds
  2011-05-26 18:47                                                     ` Ingo Molnar
  2011-05-26 18:49                                                     ` Will Drewry
  0 siblings, 2 replies; 91+ messages in thread
From: Linus Torvalds @ 2011-05-26 18:33 UTC (permalink / raw)
  To: Will Drewry
  Cc: Colin Walters, Kees Cook, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, linux-kernel, James Morris

On Thu, May 26, 2011 at 10:38 AM, Will Drewry <wad@chromium.org> wrote:
>>
>> One option is to just not ever allow execve() from inside a restricted
>> environment.
>
> That'd certainly be fine with me.

So if it ends up being purely a "internal to the process" thing, then
I'm much happier about it - it not only limits the scope of things
sufficiently that I don't worry too much about security issues, but it
makes it very clear that it's about a process going into "lock-down"
mode on its own.

It also gets rid of all configuration - one of the things that makes
most security frameworks (look at selinux, but also just ACL's etc)
such a crazy rats nest is the whole "set up for other processes". If
it's designed very much to be about just the "self" process (after
initialization etc), then I think that avoids pretty much all the
serious issues.

A lot of server processes could probably use it as a way to say "Hey,
I guarantee that I will only open new files read-only, and will only
write to the socket that was already opened for me by the accept", and
explicitly limit their worker threads that way.

If that is really sufficient for some chrome sandboxing, then hey,
that's all fine.

Sometimes limiting yourself (rather than looking for some bigger
"generic" solution) is the right answer.

                         Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 18:33                                                   ` Linus Torvalds
@ 2011-05-26 18:47                                                     ` Ingo Molnar
  2011-05-26 19:05                                                       ` david
  2011-05-26 18:49                                                     ` Will Drewry
  1 sibling, 1 reply; 91+ messages in thread
From: Ingo Molnar @ 2011-05-26 18:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Drewry, Colin Walters, Kees Cook, Thomas Gleixner,
	Peter Zijlstra, Steven Rostedt, linux-kernel, James Morris


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> It also gets rid of all configuration - one of the things that 
> makes most security frameworks (look at selinux, but also just 
> ACL's etc) such a crazy rats nest is the whole "set up for other 
> processes". If it's designed very much to be about just the "self" 
> process (after initialization etc), then I think that avoids pretty 
> much all the serious issues.

That's how the event filters work currently: even when inherited they 
get removed when exec-ing a setuid task, so they cannot leak into 
privileged context and cannot modify execution there.

Inheritance works when requested, covering only same-credential child 
tasks, not privileged successors.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 18:47                                                     ` Ingo Molnar
@ 2011-05-26 19:05                                                       ` david
  2011-05-26 19:09                                                         ` Eric Paris
  2011-05-26 19:46                                                         ` Ingo Molnar
  0 siblings, 2 replies; 91+ messages in thread
From: david @ 2011-05-26 19:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Will Drewry, Colin Walters, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, Steven Rostedt, linux-kernel,
	James Morris

On Thu, 26 May 2011, Ingo Molnar wrote:

> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> It also gets rid of all configuration - one of the things that
>> makes most security frameworks (look at selinux, but also just
>> ACL's etc) such a crazy rats nest is the whole "set up for other
>> processes". If it's designed very much to be about just the "self"
>> process (after initialization etc), then I think that avoids pretty
>> much all the serious issues.
>
> That's how the event filters work currently: even when inherited they
> get removed when exec-ing a setuid task, so they cannot leak into
> privileged context and cannot modify execution there.
>
> Inheritance works when requested, covering only same-credential child
> tasks, not privileged successors.

this is a very reasonable default, but there should be some way of saying 
that you want the restrictions to carry over to the suid task (I really 
know what I'm doing switch)

David Lang

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 19:05                                                       ` david
@ 2011-05-26 19:09                                                         ` Eric Paris
  2011-05-26 19:46                                                         ` Ingo Molnar
  1 sibling, 0 replies; 91+ messages in thread
From: Eric Paris @ 2011-05-26 19:09 UTC (permalink / raw)
  To: david
  Cc: Ingo Molnar, Linus Torvalds, Will Drewry, Colin Walters,
	Kees Cook, Thomas Gleixner, Peter Zijlstra, Steven Rostedt,
	linux-kernel, James Morris

On Thu, May 26, 2011 at 3:05 PM,  <david@lang.hm> wrote:
> On Thu, 26 May 2011, Ingo Molnar wrote:
>
>> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>>> It also gets rid of all configuration - one of the things that
>>> makes most security frameworks (look at selinux, but also just
>>> ACL's etc) such a crazy rats nest is the whole "set up for other
>>> processes". If it's designed very much to be about just the "self"
>>> process (after initialization etc), then I think that avoids pretty
>>> much all the serious issues.
>>
>> That's how the event filters work currently: even when inherited they
>> get removed when exec-ing a setuid task, so they cannot leak into
>> privileged context and cannot modify execution there.
>>
>> Inheritance works when requested, covering only same-credential child
>> tasks, not privileged successors.
>
> this is a very reasonable default, but there should be some way of saying
> that you want the restrictions to carry over to the suid task (I really know
> what I'm doing switch)

You mean the "i'm a hacker and want to be able to learn about tasks I
shouldn't be able to learn about" switch?  No.  You either get out of
the way on SUID or refuse to launch SUID apps.  Those are the only
reasonable choices.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 19:05                                                       ` david
  2011-05-26 19:09                                                         ` Eric Paris
@ 2011-05-26 19:46                                                         ` Ingo Molnar
  2011-05-26 19:49                                                           ` david
  1 sibling, 1 reply; 91+ messages in thread
From: Ingo Molnar @ 2011-05-26 19:46 UTC (permalink / raw)
  To: david
  Cc: Linus Torvalds, Will Drewry, Colin Walters, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, Steven Rostedt, linux-kernel,
	James Morris


* david@lang.hm <david@lang.hm> wrote:

> On Thu, 26 May 2011, Ingo Molnar wrote:
> 
> >* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> >>It also gets rid of all configuration - one of the things that
> >>makes most security frameworks (look at selinux, but also just
> >>ACL's etc) such a crazy rats nest is the whole "set up for other
> >>processes". If it's designed very much to be about just the "self"
> >>process (after initialization etc), then I think that avoids pretty
> >>much all the serious issues.
> >
> >That's how the event filters work currently: even when inherited they
> >get removed when exec-ing a setuid task, so they cannot leak into
> >privileged context and cannot modify execution there.
> >
> >Inheritance works when requested, covering only same-credential child
> >tasks, not privileged successors.
> 
> this is a very reasonable default, but there should be some way of 
> saying that you want the restrictions to carry over to the suid 
> task (I really know what I'm doing switch)

Unless you mean that root should be able to do it it's a security 
hole both for events and for filters:

 - for example we dont want really finegrained events to be used to 
   BTS hw-trace sshd and thus enable it to discover cryptographic 
   properties of the private key sshd is using.

 - we do not want to *modify* the execution flow of a setuid program,
   that can lead to exploits: by pushing the privileged codepath into 
   a condition that can never occur on a normal system - and thus can 
   push it into doing something it was not intended to do.

   data damage could be done as well: for example if the privileged 
   code is logging into a system file then modifying execution can 
   damage the log file.

So it's not a good idea in general to allow unprivileged code to 
modify the execution of privileged code. In fact it's not a good idea 
to allow it to simply *observe* privileged code. It must remain a 
black box with very few information leaking outwards.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 19:46                                                         ` Ingo Molnar
@ 2011-05-26 19:49                                                           ` david
  0 siblings, 0 replies; 91+ messages in thread
From: david @ 2011-05-26 19:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Will Drewry, Colin Walters, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, Steven Rostedt, linux-kernel,
	James Morris

On Thu, 26 May 2011, Ingo Molnar wrote:

> * david@lang.hm <david@lang.hm> wrote:
>
>> On Thu, 26 May 2011, Ingo Molnar wrote:
>>
>>> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>>
>>>> It also gets rid of all configuration - one of the things that
>>>> makes most security frameworks (look at selinux, but also just
>>>> ACL's etc) such a crazy rats nest is the whole "set up for other
>>>> processes". If it's designed very much to be about just the "self"
>>>> process (after initialization etc), then I think that avoids pretty
>>>> much all the serious issues.
>>>
>>> That's how the event filters work currently: even when inherited they
>>> get removed when exec-ing a setuid task, so they cannot leak into
>>> privileged context and cannot modify execution there.
>>>
>>> Inheritance works when requested, covering only same-credential child
>>> tasks, not privileged successors.
>>
>> this is a very reasonable default, but there should be some way of
>> saying that you want the restrictions to carry over to the suid
>> task (I really know what I'm doing switch)
>
> Unless you mean that root should be able to do it it's a security
> hole both for events and for filters:
>
> - for example we dont want really finegrained events to be used to
>   BTS hw-trace sshd and thus enable it to discover cryptographic
>   properties of the private key sshd is using.
>
> - we do not want to *modify* the execution flow of a setuid program,
>   that can lead to exploits: by pushing the privileged codepath into
>   a condition that can never occur on a normal system - and thus can
>   push it into doing something it was not intended to do.
>
>   data damage could be done as well: for example if the privileged
>   code is logging into a system file then modifying execution can
>   damage the log file.
>
> So it's not a good idea in general to allow unprivileged code to
> modify the execution of privileged code. In fact it's not a good idea
> to allow it to simply *observe* privileged code. It must remain a
> black box with very few information leaking outwards.

I was thinking of the use case of the real sysadmin (i.e. root) wanting to 
be able to constrain things. I can see why you would not want to allow 
normal users to do this.

David Lang

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 18:33                                                   ` Linus Torvalds
  2011-05-26 18:47                                                     ` Ingo Molnar
@ 2011-05-26 18:49                                                     ` Will Drewry
  2011-06-01  3:10                                                       ` [PATCH v3 01/13] tracing: split out filter initialization and clean up Will Drewry
                                                                         ` (12 more replies)
  1 sibling, 13 replies; 91+ messages in thread
From: Will Drewry @ 2011-05-26 18:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Colin Walters, Kees Cook, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, linux-kernel, James Morris

On Thu, May 26, 2011 at 1:33 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, May 26, 2011 at 10:38 AM, Will Drewry <wad@chromium.org> wrote:
>>>
>>> One option is to just not ever allow execve() from inside a restricted
>>> environment.
>>
>> That'd certainly be fine with me.
>
> So if it ends up being purely a "internal to the process" thing, then
> I'm much happier about it - it not only limits the scope of things
> sufficiently that I don't worry too much about security issues, but it
> makes it very clear that it's about a process going into "lock-down"
> mode on its own.
>
> It also gets rid of all configuration - one of the things that makes
> most security frameworks (look at selinux, but also just ACL's etc)
> such a crazy rats nest is the whole "set up for other processes". If
> it's designed very much to be about just the "self" process (after
> initialization etc), then I think that avoids pretty much all the
> serious issues.
>
> A lot of server processes could probably use it as a way to say "Hey,
> I guarantee that I will only open new files read-only, and will only
> write to the socket that was already opened for me by the accept", and
> explicitly limit their worker threads that way.
>
> If that is really sufficient for some chrome sandboxing, then hey,
> that's all fine.

It adds some hoops, but less than exist today.

> Sometimes limiting yourself (rather than looking for some bigger
> "generic" solution) is the right answer.

I will very happily validate usage and repost with a self-limited
patch series.  Doing so makes the change much more explicitly an
expansion of seccomp, which keeps things sane.

Thanks!
will

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH v3 01/13] tracing: split out filter initialization and clean up.
  2011-05-26 18:49                                                     ` Will Drewry
@ 2011-06-01  3:10                                                       ` Will Drewry
  2011-06-01  3:10                                                       ` [PATCH v3 02/13] tracing: split out syscall_trace_enter construction Will Drewry
                                                                         ` (11 subsequent siblings)
  12 siblings, 0 replies; 91+ messages in thread
From: Will Drewry @ 2011-06-01  3:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: kees.cook, torvalds, tglx, mingo, rostedt, jmorris, Will Drewry,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Frederic Weisbecker

Moves the perf-specific profile event allocation and freeing code into
kernel/perf_event.c where it is called from and two symbols are exported
via ftrace_event.h for instantiating struct event_filters without
requiring a change to the core tracing code.

The change allows globally registered ftrace events to be used in
event_filter structs.  perf is the current consumer, but a possible
future consumer is a system call filtering using the secure computing
hooks (and the existing syscalls subsystem events).

Signed-off-by: Will Drewry <wad@chromium.org>
---
 include/linux/ftrace_event.h       |    9 +++--
 kernel/perf_event.c                |    7 +++-
 kernel/trace/trace_events_filter.c |   60 ++++++++++++++++++++++--------------
 3 files changed, 48 insertions(+), 28 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 22b32af..fea9d98 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -216,6 +216,12 @@ extern int filter_current_check_discard(struct ring_buffer *buffer,
 					void *rec,
 					struct ring_buffer_event *event);
 
+extern void ftrace_free_filter(struct event_filter *filter);
+extern int ftrace_parse_filter(struct event_filter **filter,
+			       int event_id,
+			       const char *filter_str);
+extern const char *ftrace_get_filter_string(const struct event_filter *filter);
+
 enum {
 	FILTER_OTHER = 0,
 	FILTER_STATIC_STRING,
@@ -266,9 +272,6 @@ extern int  perf_trace_init(struct perf_event *event);
 extern void perf_trace_destroy(struct perf_event *event);
 extern int  perf_trace_add(struct perf_event *event, int flags);
 extern void perf_trace_del(struct perf_event *event, int flags);
-extern int  ftrace_profile_set_filter(struct perf_event *event, int event_id,
-				     char *filter_str);
-extern void ftrace_profile_free_filter(struct perf_event *event);
 extern void *perf_trace_buf_prepare(int size, unsigned short type,
 				    struct pt_regs *regs, int *rctxp);
 
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 8e81a98..1da45e7 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -5588,7 +5588,8 @@ static int perf_event_set_filter(struct perf_event *event, void __user *arg)
 	if (IS_ERR(filter_str))
 		return PTR_ERR(filter_str);
 
-	ret = ftrace_profile_set_filter(event, event->attr.config, filter_str);
+	ret = ftrace_parse_filter(&event->filter, event->attr.config,
+				  filter_str);
 
 	kfree(filter_str);
 	return ret;
@@ -5596,7 +5597,9 @@ static int perf_event_set_filter(struct perf_event *event, void __user *arg)
 
 static void perf_event_free_filter(struct perf_event *event)
 {
-	ftrace_profile_free_filter(event);
+	struct event_filter *filter = event->filter;
+	event->filter = NULL;
+	ftrace_free_filter(filter);
 }
 
 #else
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 8008ddc..787b174 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -522,7 +522,7 @@ static void remove_filter_string(struct event_filter *filter)
 }
 
 static int replace_filter_string(struct event_filter *filter,
-				 char *filter_string)
+				 const char *filter_string)
 {
 	kfree(filter->filter_string);
 	filter->filter_string = kstrdup(filter_string, GFP_KERNEL);
@@ -1936,21 +1936,27 @@ out_unlock:
 	return err;
 }
 
-#ifdef CONFIG_PERF_EVENTS
-
-void ftrace_profile_free_filter(struct perf_event *event)
+/* ftrace_free_filter - frees a parsed filter its internal structures.
+ *
+ * @filter: pointer to the event_filter to free.
+ */
+void ftrace_free_filter(struct event_filter *filter)
 {
-	struct event_filter *filter = event->filter;
-
-	event->filter = NULL;
-	__free_filter(filter);
+	if (filter)
+		__free_filter(filter);
 }
+EXPORT_SYMBOL_GPL(ftrace_free_filter);
 
-int ftrace_profile_set_filter(struct perf_event *event, int event_id,
-			      char *filter_str)
+/* ftrace_parse_filter - allocates and populates a new event_filter
+ *
+ * @event_id: may be something like syscalls::sys_event_tkill's id.
+ * @filter_str: pointer to the filter string. Ownership IS taken.
+ */
+int ftrace_parse_filter(struct event_filter **filter,
+			int event_id,
+			const char *filter_str)
 {
 	int err;
-	struct event_filter *filter;
 	struct filter_parse_state *ps;
 	struct ftrace_event_call *call = NULL;
 
@@ -1966,12 +1972,12 @@ int ftrace_profile_set_filter(struct perf_event *event, int event_id,
 		goto out_unlock;
 
 	err = -EEXIST;
-	if (event->filter)
+	if (*filter)
 		goto out_unlock;
 
-	filter = __alloc_filter();
-	if (!filter) {
-		err = PTR_ERR(filter);
+	*filter = __alloc_filter();
+	if (IS_ERR_OR_NULL(*filter)) {
+		err = PTR_ERR(*filter);
 		goto out_unlock;
 	}
 
@@ -1980,14 +1986,14 @@ int ftrace_profile_set_filter(struct perf_event *event, int event_id,
 	if (!ps)
 		goto free_filter;
 
-	parse_init(ps, filter_ops, filter_str);
+	replace_filter_string(*filter, filter_str);
+
+	parse_init(ps, filter_ops, (*filter)->filter_string);
 	err = filter_parse(ps);
 	if (err)
 		goto free_ps;
 
-	err = replace_preds(call, filter, ps, filter_str, false);
-	if (!err)
-		event->filter = filter;
+	err = replace_preds(call, *filter, ps, (*filter)->filter_string, false);
 
 free_ps:
 	filter_opstack_clear(ps);
@@ -1995,14 +2001,22 @@ free_ps:
 	kfree(ps);
 
 free_filter:
-	if (err)
-		__free_filter(filter);
+	if (err) {
+		__free_filter(*filter);
+		*filter = NULL;
+	}
 
 out_unlock:
 	mutex_unlock(&event_mutex);
 
 	return err;
 }
+EXPORT_SYMBOL_GPL(ftrace_parse_filter);
 
-#endif /* CONFIG_PERF_EVENTS */
-
+const char *ftrace_get_filter_string(const struct event_filter *filter)
+{
+	if (!filter)
+		return NULL;
+	return filter->filter_string;
+}
+EXPORT_SYMBOL_GPL(ftrace_get_filter_string);
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 02/13] tracing: split out syscall_trace_enter construction
  2011-05-26 18:49                                                     ` Will Drewry
  2011-06-01  3:10                                                       ` [PATCH v3 01/13] tracing: split out filter initialization and clean up Will Drewry
@ 2011-06-01  3:10                                                       ` Will Drewry
  2011-06-01  7:00                                                         ` Ingo Molnar
  2011-06-01  3:10                                                       ` [PATCH v3 03/13] seccomp_filters: new mode with configurable syscall filters Will Drewry
                                                                         ` (10 subsequent siblings)
  12 siblings, 1 reply; 91+ messages in thread
From: Will Drewry @ 2011-06-01  3:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: kees.cook, torvalds, tglx, mingo, rostedt, jmorris, Will Drewry,
	Frederic Weisbecker, Ingo Molnar

perf appears to be the primary consumer of the CONFIG_FTRACE_SYSCALLS
infrastructure.  As such, many the helpers target at perf can be split
into a peerf-focused helper and a generic CONFIG_FTRACE_SYSCALLS
consumer interface.

This change splits out syscall_trace_enter construction from
perf_syscall_enter for current into two helpers:
- ftrace_syscall_enter_state
- ftrace_syscall_enter_state_size

And adds another helper for completeness:
- ftrace_syscall_exit_state_size

These helpers allow for shared code between perf ftrace events and
any other consumers of CONFIG_FTRACE_SYSCALLS events.  The proposed
seccomp_filter patches use this code.

Signed-off-by: Will Drewry <wad@chromium.org>
---
 include/trace/syscall.h       |    4 ++
 kernel/trace/trace_syscalls.c |   96 +++++++++++++++++++++++++++++++++++------
 2 files changed, 86 insertions(+), 14 deletions(-)

diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 31966a4..242ae04 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -41,6 +41,10 @@ extern int reg_event_syscall_exit(struct ftrace_event_call *call);
 extern void unreg_event_syscall_exit(struct ftrace_event_call *call);
 extern int
 ftrace_format_syscall(struct ftrace_event_call *call, struct trace_seq *s);
+extern int ftrace_syscall_enter_state(u8 *buf, size_t available,
+				      struct trace_entry **entry);
+extern size_t ftrace_syscall_enter_state_size(int nb_args);
+extern size_t ftrace_syscall_exit_state_size(void);
 enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags,
 				      struct trace_event *event);
 enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags,
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index ee7b5a0..f37f120 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -95,7 +95,7 @@ find_syscall_meta(unsigned long syscall)
 	return NULL;
 }
 
-static struct syscall_metadata *syscall_nr_to_meta(int nr)
+struct syscall_metadata *syscall_nr_to_meta(int nr)
 {
 	if (!syscalls_metadata || nr >= NR_syscalls || nr < 0)
 		return NULL;
@@ -498,7 +498,7 @@ static int sys_perf_refcount_exit;
 static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 {
 	struct syscall_metadata *sys_data;
-	struct syscall_trace_enter *rec;
+	void *buf;
 	struct hlist_head *head;
 	int syscall_nr;
 	int rctx;
@@ -513,25 +513,22 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 		return;
 
 	/* get the size after alignment with the u32 buffer size field */
-	size = sizeof(unsigned long) * sys_data->nb_args + sizeof(*rec);
-	size = ALIGN(size + sizeof(u32), sizeof(u64));
-	size -= sizeof(u32);
+	size = ftrace_syscall_enter_state_size(sys_data->nb_args);
 
 	if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE,
 		      "perf buffer not large enough"))
 		return;
 
-	rec = (struct syscall_trace_enter *)perf_trace_buf_prepare(size,
-				sys_data->enter_event->event.type, regs, &rctx);
-	if (!rec)
+	buf = perf_trace_buf_prepare(size, sys_data->enter_event->event.type,
+				     regs, &rctx);
+	if (!buf)
 		return;
 
-	rec->nr = syscall_nr;
-	syscall_get_arguments(current, regs, 0, sys_data->nb_args,
-			       (unsigned long *)&rec->args);
+	/* The only error conditions in this helper are handled above. */
+	ftrace_syscall_enter_state(buf, size, NULL);
 
 	head = this_cpu_ptr(sys_data->enter_event->perf_events);
-	perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, head);
+	perf_trace_buf_submit(buf, size, rctx, 0, 1, regs, head);
 }
 
 int perf_sysenter_enable(struct ftrace_event_call *call)
@@ -587,8 +584,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 		return;
 
 	/* We can probably do that at build time */
-	size = ALIGN(sizeof(*rec) + sizeof(u32), sizeof(u64));
-	size -= sizeof(u32);
+	size = ftrace_syscall_exit_state_size();
 
 	/*
 	 * Impossible, but be paranoid with the future
@@ -688,3 +684,75 @@ static int syscall_exit_register(struct ftrace_event_call *event,
 	}
 	return 0;
 }
+
+/* ftrace_syscall_enter_state_size - returns the state size required.
+ *
+ * @nb_args: number of system call args expected.
+ *           a negative value implies the maximum allowed.
+ */
+size_t ftrace_syscall_enter_state_size(int nb_args)
+{
+	/* syscall_get_arguments only supports up to 6 arguments. */
+	int arg_count = (nb_args >= 0 ? nb_args : 6);
+	size_t size = (sizeof(unsigned long) * arg_count) +
+		      sizeof(struct syscall_trace_enter);
+	size = ALIGN(size + sizeof(u32), sizeof(u64));
+	size -= sizeof(u32);
+	return size;
+}
+EXPORT_SYMBOL_GPL(ftrace_syscall_enter_state_size);
+
+size_t ftrace_syscall_exit_state_size(void)
+{
+	return ALIGN(sizeof(struct syscall_trace_exit) + sizeof(u32),
+		     sizeof(u64)) - sizeof(u32);
+}
+EXPORT_SYMBOL_GPL(ftrace_syscall_exit_state_size);
+
+/* ftrace_syscall_enter_state - build state for filter matching
+ *
+ * @buf: buffer to populate with current task state for matching
+ * @available: size available for use in the buffer.
+ * @entry: optional pointer to the trace_entry member of the state.
+ *
+ * Returns 0 on success and non-zero otherwise.
+ * If @entry is NULL, it will be ignored.
+ */
+int ftrace_syscall_enter_state(u8 *buf, size_t available,
+			       struct trace_entry **entry)
+{
+	struct syscall_trace_enter *sys_enter;
+	struct syscall_metadata *sys_data;
+	int size;
+	int syscall_nr;
+	struct pt_regs *regs = task_pt_regs(current);
+
+	syscall_nr = syscall_get_nr(current, regs);
+	if (syscall_nr < 0)
+		return -EINVAL;
+
+	sys_data = syscall_nr_to_meta(syscall_nr);
+	if (!sys_data)
+		return -EINVAL;
+
+	/* Determine the actual size needed. */
+	size = sizeof(unsigned long) * sys_data->nb_args +
+	       sizeof(struct syscall_trace_enter);
+	size = ALIGN(size + sizeof(u32), sizeof(u64));
+	size -= sizeof(u32);
+
+	BUG_ON(size > available);
+	sys_enter = (struct syscall_trace_enter *)buf;
+
+	/* Populating the struct trace_sys_enter is left to the caller, but
+	 * a pointer is returned to encourage opacity.
+	 */
+	if (entry)
+		*entry = &sys_enter->ent;
+
+	sys_enter->nr = syscall_nr;
+	syscall_get_arguments(current, regs, 0, sys_data->nb_args,
+			      sys_enter->args);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ftrace_syscall_enter_state);
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 02/13] tracing: split out syscall_trace_enter construction
  2011-06-01  3:10                                                       ` [PATCH v3 02/13] tracing: split out syscall_trace_enter construction Will Drewry
@ 2011-06-01  7:00                                                         ` Ingo Molnar
  2011-06-01 17:15                                                           ` Will Drewry
  0 siblings, 1 reply; 91+ messages in thread
From: Ingo Molnar @ 2011-06-01  7:00 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, kees.cook, torvalds, tglx, rostedt, jmorris,
	Frederic Weisbecker, Ingo Molnar

* Will Drewry <wad@chromium.org> wrote:

> perf appears to be the primary consumer of the CONFIG_FTRACE_SYSCALLS
> infrastructure.  As such, many the helpers target at perf can be split
> into a peerf-focused helper and a generic CONFIG_FTRACE_SYSCALLS
> consumer interface.
> 
> This change splits out syscall_trace_enter construction from
> perf_syscall_enter for current into two helpers:
> - ftrace_syscall_enter_state
> - ftrace_syscall_enter_state_size
> 
> And adds another helper for completeness:
> - ftrace_syscall_exit_state_size
> 
> These helpers allow for shared code between perf ftrace events and
> any other consumers of CONFIG_FTRACE_SYSCALLS events.  The proposed
> seccomp_filter patches use this code.
> 
> Signed-off-by: Will Drewry <wad@chromium.org>
> ---
>  include/trace/syscall.h       |    4 ++
>  kernel/trace/trace_syscalls.c |   96 +++++++++++++++++++++++++++++++++++------
>  2 files changed, 86 insertions(+), 14 deletions(-)

So, looking at the diffstat comparison again:

       bitmask (2009):  6 files changed,  194 insertions(+), 22 deletions(-)
 filter engine (2010): 18 files changed, 1100 insertions(+), 21 deletions(-)
 event filters (2011):  5 files changed,   82 insertions(+), 16 deletions(-)

you went back to the middle solution again which is the worst of them 
- why?

If you want this to be a stupid, limited hack then go for the v1 
bitmask.

If you agree with my observation that filters allow the clean 
user-space implementation of LSM equivalent security solutions (of 
which sandboxes are just a *narrow special case*) then please use the 
main highlevel abstraction we have defined around them: event 
filters.

Now, my observation was not uncontested so let me try to sum up the
rather large discussion that erupted around it, as i see it.

I saw four main counter arguments:

 - "Sandboxing is special and should stay separate from LSMs."

   I think this is a technically bogus argument, see:

         https://lkml.org/lkml/2011/5/26/85

   That answer of mine went unchallenged.

 - "Events should only be observers."

   Even ignoring the question of why on earth it should be a problem 
   for a willing call-site to use event filtering results sensibly, 
   this argument misses the plain fact that events are *already* 
   active participants, see:

         http://www.spinics.net/lists/mips/msg41075.html

   That answer of mine went unchallenged too.

 - "This feature is too simplistic."

   That's wrong i think, the feature is highly flexible:

         http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg51387.html

   This reply of mine went unchallenged as well.

 - "Is this feature actually useful enough for applications, does it 
    justify the complexity?"

  This is the *only* valid technical counter-argument i saw, and it's 
  a crutial one that is not fully answered yet. Since i think the feature
  is an LSM equivalent i think it's at least as useful as any LSM is.

 - [ if i missed any important argument then someone please insert it 
     here. ]

But what you do here is to use the filter engine directly which is 
both a limited hack *and* complex (beyond the linecount it doubles 
our ABI exposure, amongst other things), so i find that approach 
rather counter-productive, now that i've seen the real thing.

Will this feature be just another example of the LSM status quo 
dragging down a newcomer into the mud, until it's just as sucky and 
limited as any existing LSMs? That would be a sad outcome!

Thanks,

	Ingo

ps. Please start a new discussion thread for the next iteration!
    This one is *way* too deep already.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 02/13] tracing: split out syscall_trace_enter construction
  2011-06-01  7:00                                                         ` Ingo Molnar
@ 2011-06-01 17:15                                                           ` Will Drewry
  2011-06-02 14:29                                                             ` Ingo Molnar
  0 siblings, 1 reply; 91+ messages in thread
From: Will Drewry @ 2011-06-01 17:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, kees.cook, torvalds, tglx, rostedt, jmorris,
	Frederic Weisbecker, Ingo Molnar

On Wed, Jun 1, 2011 at 2:00 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Will Drewry <wad@chromium.org> wrote:
>
>> perf appears to be the primary consumer of the CONFIG_FTRACE_SYSCALLS
>> infrastructure.  As such, many the helpers target at perf can be split
>> into a peerf-focused helper and a generic CONFIG_FTRACE_SYSCALLS
>> consumer interface.
>>
>> This change splits out syscall_trace_enter construction from
>> perf_syscall_enter for current into two helpers:
>> - ftrace_syscall_enter_state
>> - ftrace_syscall_enter_state_size
>>
>> And adds another helper for completeness:
>> - ftrace_syscall_exit_state_size
>>
>> These helpers allow for shared code between perf ftrace events and
>> any other consumers of CONFIG_FTRACE_SYSCALLS events.  The proposed
>> seccomp_filter patches use this code.
>>
>> Signed-off-by: Will Drewry <wad@chromium.org>
>> ---
>>  include/trace/syscall.h       |    4 ++
>>  kernel/trace/trace_syscalls.c |   96 +++++++++++++++++++++++++++++++++++------
>>  2 files changed, 86 insertions(+), 14 deletions(-)
>
> So, looking at the diffstat comparison again:
>
>       bitmask (2009):  6 files changed,  194 insertions(+), 22 deletions(-)
>  filter engine (2010): 18 files changed, 1100 insertions(+), 21 deletions(-)
>  event filters (2011):  5 files changed,   82 insertions(+), 16 deletions(-)
>
> you went back to the middle solution again which is the worst of them
> - why?

In short, design for the future and implement now.  I'll elaborate a
bit more below.

> If you want this to be a stupid, limited hack then go for the v1
> bitmask.

I only aim for the finest!

(bitmasks were bad for the other consumers of this patch series:
socketcall mulitplexing issues and ioctl # filtering).

> If you agree with my observation that filters allow the clean
> user-space implementation of LSM equivalent security solutions (of
> which sandboxes are just a *narrow special case*) then please use the
> main highlevel abstraction we have defined around them: event
> filters.

I agree that LSM-equivalent security solutions can be moved over to an
ftrace based infrastructure.  However, LSMs and seccomp have different
semantics.  Reducing the kernel attack surface in a
"sandboxing"-sort-of-way requires a default-deny interface that is
resilient to kernel changes (like new system calls) without
immediately degrading robustness.  LSMs provide a fail-open mechanism
for taking an active role in kernel-defined pinch points.  It is
possible to implement a default-deny LSM, but it requires a "hook" for
every security event and the addition of a security event results in a
hole in the not-so-default-deny infrastructure.  ftrace + event
filters are the same.

Based on my observations while exploring the code, it appears that the
LSM security_* calls could easily become active trace events and the
LSM infrastructure moved over to use those as tracepoints or via
event_filters.  There will be a need for new predicates for the
various new types (inode *, etc), and so on.  However, the
trace_sys_enter/__secure_computing model will still be a special case.
 Even if they fed into security event subsystem or something like
that, the absence of filters on a traced process would need to
default-deny as well as when there are no active matches.  So while a
brand-new shared ABI may be possible (security_event_open,
active_event_open, ?), there will still be trickiness in making the
behaviors not have implicit side effects and ensure that newly added
system calls, for instance, that lack the macro wrapper don't poke a
hole in the "sandbox" model.  There are a lot of options for designing
it though.  Like making TIF_SECCOMP mean that any security_* filter
failure or match count of 0 == process death.  It's just that
designing this new approach will be incredibly hairy, and we really
lack many of the concrete requirements that would be needed, in my
opinion.

> Now, my observation was not uncontested so let me try to sum up the
> rather large discussion that erupted around it, as i see it.
>
> I saw four main counter arguments:
>
>  - "Sandboxing is special and should stay separate from LSMs."
>
>   I think this is a technically bogus argument, see:
>
>         https://lkml.org/lkml/2011/5/26/85
>
>   That answer of mine went unchallenged.

I may have spoken to this above.  I dunno.

>  - "Events should only be observers."
>
>   Even ignoring the question of why on earth it should be a problem
>   for a willing call-site to use event filtering results sensibly,
>   this argument misses the plain fact that events are *already*
>   active participants, see:
>
>         http://www.spinics.net/lists/mips/msg41075.html
>
>   That answer of mine went unchallenged too.
>
>  - "This feature is too simplistic."
>
>   That's wrong i think, the feature is highly flexible:
>
>         http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg51387.html
>
>   This reply of mine went unchallenged as well.

Well I did only implement a PoC.  It couldn't handle attack surface
reduction after-the-fact, nor did I add a GET_FILTER call, etc.  The
code was minimal in many ways because the functionality was too.

>  - "Is this feature actually useful enough for applications, does it
>    justify the complexity?"
>
>  This is the *only* valid technical counter-argument i saw, and it's
>  a crutial one that is not fully answered yet. Since i think the feature
>  is an LSM equivalent i think it's at least as useful as any LSM is.
>
>  - [ if i missed any important argument then someone please insert it
>     here. ]
>
> But what you do here is to use the filter engine directly which is
> both a limited hack *and* complex (beyond the linecount it doubles
> our ABI exposure, amongst other things), so i find that approach
> rather counter-productive, now that i've seen the real thing.
>
> Will this feature be just another example of the LSM status quo
> dragging down a newcomer into the mud, until it's just as sucky and
> limited as any existing LSMs? That would be a sad outcome!

I hope not.  I believe it will be easy to move the backend of
seccomp_filter over to a per-task ftrace event filter infrastructure
when that comes in the future.  But for now, I'm trying to meet the
needs of possible consumers now: chromium, qemu, lxc, and lay
groundwork for a ftrace-future.

If this is a total fail, then perhaps we should have a separate
discussion over how we can tackle a lot of these needs.  I was hoping
that we could push some of that off to the LinuxSecuritySummit -- I've
proposed/requested a QA panel on this topic :)  But I'd love to not
wait until then for everything.

> ps. Please start a new discussion thread for the next iteration!
>    This one is *way* too deep already.

Sorry - will do!

thanks!
will

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 02/13] tracing: split out syscall_trace_enter construction
  2011-06-01 17:15                                                           ` Will Drewry
@ 2011-06-02 14:29                                                             ` Ingo Molnar
  2011-06-02 15:18                                                               ` Will Drewry
  0 siblings, 1 reply; 91+ messages in thread
From: Ingo Molnar @ 2011-06-02 14:29 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, kees.cook, torvalds, tglx, rostedt, jmorris,
	Frederic Weisbecker, Ingo Molnar

* Will Drewry <wad@chromium.org> wrote:

> > If you agree with my observation that filters allow the clean 
> > user-space implementation of LSM equivalent security solutions 
> > (of which sandboxes are just a *narrow special case*) then please 
> > use the main highlevel abstraction we have defined around them: 
> > event filters.
> 
> I agree that LSM-equivalent security solutions can be moved over to 
> an ftrace based infrastructure.  However, LSMs and seccomp have 
> different semantics.  Reducing the kernel attack surface in a 
> "sandboxing"-sort-of-way requires a default-deny interface that is 
> resilient to kernel changes (like new system calls) without 
> immediately degrading robustness. [...]

Correct. Because seccomp is the user of those syscall-surface events 
it can use them in such a way - i see no problem there: unknown or 
not permitted syscalls get denied for seccomp-mode-2 tasks.

> [...] LSMs provide a fail-open mechanism for taking an active role 
> in kernel-defined pinch points.  It is possible to implement a 
> default-deny LSM, but it requires a "hook" for every security event 
> and the addition of a security event results in a hole in the 
> not-so-default-deny infrastructure.  ftrace + event filters are the 
> same.

Well, i only suggested that it's LSM-equivalent security 
functionality, i did not suggest that you should implement an LSM in 
security/. I do not think the LSM modularization is particularly well 
fit for seccomp.

> Based on my observations while exploring the code, it appears that 
> the LSM security_* calls could easily become active trace events 
> and the LSM infrastructure moved over to use those as tracepoints 
> or via event_filters.  There will be a need for new predicates for 
> the various new types (inode *, etc), and so on.  However, the 
> trace_sys_enter/__secure_computing model will still be a special 
> case.

Yes, and that special event will not go away!

I did not suggest to *replace* those events with the security events. 
I suggested to *combine* them - or at least have a model that 
smoothly extends to those events as well and does not limit itself to 
the syscall surface alone.

We'll want to have both.

But by hardcoding to only those events, and creating a 
syscall-numbering special ABI, a wall will be risen between this 
implementation and any future enhancement to cover other events. My 
suggestion would be to use the event filter approach - that way 
there's not a wall but an open door towards future extensions ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 02/13] tracing: split out syscall_trace_enter construction
  2011-06-02 14:29                                                             ` Ingo Molnar
@ 2011-06-02 15:18                                                               ` Will Drewry
  0 siblings, 0 replies; 91+ messages in thread
From: Will Drewry @ 2011-06-02 15:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, kees.cook, torvalds, tglx, rostedt, jmorris,
	Frederic Weisbecker, Ingo Molnar

On Thu, Jun 2, 2011 at 9:29 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Will Drewry <wad@chromium.org> wrote:
>
[...]
>
>> Based on my observations while exploring the code, it appears that
>> the LSM security_* calls could easily become active trace events
>> and the LSM infrastructure moved over to use those as tracepoints
>> or via event_filters.  There will be a need for new predicates for
>> the various new types (inode *, etc), and so on.  However, the
>> trace_sys_enter/__secure_computing model will still be a special
>> case.
>
> Yes, and that special event will not go away!
>
> I did not suggest to *replace* those events with the security events.
> I suggested to *combine* them - or at least have a model that
> smoothly extends to those events as well and does not limit itself to
> the syscall surface alone.
>
> We'll want to have both.
>
> But by hardcoding to only those events, and creating a
> syscall-numbering special ABI, a wall will be risen between this
> implementation and any future enhancement to cover other events. My
> suggestion would be to use the event filter approach - that way
> there's not a wall but an open door towards future extensions ;-)

Yeah, I can definitely see that.  We could have the prctl interface
take in the event id, but that introduces dependency on
CONFIG_PERF_EVENTS in addition
(to get the id exported) and means we'll have much more limited
coverage of syscalls until the syscall wrapping matures.

Could this be resolved in the proposed change by supporting both
mechanisms? Or is that just asking for trouble?

E.g., it could be an extra field:
  prctl(PR_SET_SECCOMP_FILTER, PR_SECCOMP_FILTER_TYPE_EVENT, event_id,
filter_string);
  prctl(PR_SET_SECCOMP_FILTER, PR_SECCOMP_FILTER_TYPE_SYSCALL,
__NR_somesyscall, filter_string);
  [and the same for CLEAR_FILTER and GET_FILTER]

or even reserve negative values for event ids and positive for
syscalls (which feels more hackish).  Adding event_id support wouldn't
be much more additional code (since it's just a layer of
dereferencing).  Since there will likely be syscall-indexed entry
behavior no matter what (like there is for ftrace/perf_sysenter), it
won't necessarily be a large diversion in the future either.

If not, seccomp_filter could depend on both FTRACE_SYSCALLS and
exported PERF_EVENTS (or make "id"s not perf_event specific), then it
could just use the sys_enter event ids.  Doing so does have some other
properties that I'm not as fond of, like requiring debugfs to be
compiled in, mounted, and readable by the caller in order to construct
a filterset, so I can still see some benefit for the syscall number
use in some cases (much easier to deploy on a server without debugfs
access, etc).  Right now, having both interfaces doesn't really give
us anything, but having the field set aside for future exploration
isn't necessarily a bad thing!

What do you think? Would a change to support both be too crazy/dumb or
just crazy/dumb enough?  Or do you see another path that could avoid
isolating any current work from a more fruitful future?

thanks!
will

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH v3 03/13] seccomp_filters: new mode with configurable syscall filters
  2011-05-26 18:49                                                     ` Will Drewry
  2011-06-01  3:10                                                       ` [PATCH v3 01/13] tracing: split out filter initialization and clean up Will Drewry
  2011-06-01  3:10                                                       ` [PATCH v3 02/13] tracing: split out syscall_trace_enter construction Will Drewry
@ 2011-06-01  3:10                                                       ` Will Drewry
  2011-06-02 17:36                                                         ` Paul E. McKenney
  2011-06-01  3:10                                                       ` [PATCH v3 04/13] seccomp_filter: add process state reporting Will Drewry
                                                                         ` (9 subsequent siblings)
  12 siblings, 1 reply; 91+ messages in thread
From: Will Drewry @ 2011-06-01  3:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: kees.cook, torvalds, tglx, mingo, rostedt, jmorris, Will Drewry,
	Peter Zijlstra, Frederic Weisbecker, linux-security-module

This change adds a new seccomp mode which specifies the allowed system
calls dynamically.  When in the new mode (2), all system calls are
checked against process-defined filters - first by system call number,
then by a filter string.  If an entry exists for a given system call and
all filter predicates evaluate to true, then the task may proceed.
Otherwise, the task is killed.

Filter string parsing and evaluation is handled by the ftrace filter
engine.  Related patches tweak to the perf filter trace and free
allowing the calls to be shared. Filters inherit their understanding of
types and arguments for each system call from the CONFIG_FTRACE_SYSCALLS
subsystem which already populates this information in syscall_metadata
associated enter_event (and exit_event) structures. If
CONFIG_FTRACE_SYSCALLS is not compiled in, only filter strings of "1"
will be allowed.

The net result is a process may have its system calls filtered using the
ftrace filter engine's inherent understanding of systems calls.  The set
of filters is specified through the PR_SET_SECCOMP_FILTER argument in
prctl(). For example, a filterset for a process, like pdftotext, that
should only process read-only input could (roughly) look like:
  sprintf(rdonly, "flags == %u", O_RDONLY|O_LARGEFILE);
  prctl(PR_SET_SECCOMP_FILTER, __NR_open, rdonly);
  prctl(PR_SET_SECCOMP_FILTER, __NR__llseek, "1");
  prctl(PR_SET_SECCOMP_FILTER, __NR_brk, "1");
  prctl(PR_SET_SECCOMP_FILTER, __NR_close, "1");
  prctl(PR_SET_SECCOMP_FILTER, __NR_exit_group, "1");
  prctl(PR_SET_SECCOMP_FILTER, __NR_fstat64, "1");
  prctl(PR_SET_SECCOMP_FILTER, __NR_mmap2, "1");
  prctl(PR_SET_SECCOMP_FILTER, __NR_munmap, "1");
  prctl(PR_SET_SECCOMP_FILTER, __NR_read, "1");
  prctl(PR_SET_SECCOMP_FILTER, __NR_write, "(fd == 1 | fd == 2)");
  prctl(PR_SET_SECCOMP, 2);

Subsequent calls to PR_SET_SECCOMP_FILTER for the same system call will
be &&'d together to ensure that attack surface may only be reduced:
  prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2");

With the earlier example, the active filter becomes:
  "(fd == 1 || fd == 2) && fd != 2"

The patch also adds PR_CLEAR_SECCOMP_FILTER and PR_GET_SECCOMP_FILTER.
The latter returns the current filter for a system call to userspace:

  prctl(PR_GET_SECCOMP_FILTER, __NR_write, buf, bufsize);

while the former clears any filters for a given system call changing it
back to a defaulty deny:

  prctl(PR_CLEAR_SECCOMP_FILTER, __NR_write);

v3: - always block execve calls (as per linus torvalds)
    - add __NR_seccomp_execve(_32) to seccomp-supporting arches
    - ensure compat tasks can't reach ftrace:syscalls
    - dropped new defines for seccomp modes.
    - two level array instead of hlists (sugg. by olof johansson)
    - added generic Kconfig entry that is not connected.
    - dropped internal seccomp.h
    - move prctl helpers to seccomp_filter
    - killed seccomp_t typedef (as per checkpatch)
v2: - changed to use the existing syscall number ABI.
    - prctl changes to minimize parsing in the kernel:
      prctl(PR_SET_SECCOMP, {0 | 1 | 2 }, { 0 | ON_EXEC });
      prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 5");
      prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read);
      prctl(PR_GET_SECCOMP_FILTER, __NR_read, buf, bufsize);
    - defined PR_SECCOMP_MODE_STRICT and ..._FILTER
    - added flags
    - provide a default fail syscall_nr_to_meta in ftrace
    - provides fallback for unhooked system calls
    - use -ENOSYS and ERR_PTR(-ENOSYS) for stubbed functionality
    - added kernel/seccomp.h to share seccomp.c/seccomp_filter.c
    - moved to a hlist and 4 bit hash of linked lists
    - added support to operate without CONFIG_FTRACE_SYSCALLS
    - moved Kconfig support next to SECCOMP
    - made Kconfig entries dependent on EXPERIMENTAL
    - added macros to avoid ifdefs from kernel/fork.c
    - added compat task/filter matching
    - drop seccomp.h inclusion in sched.h and drop seccomp_t
    - added Filtering to "show" output
    - added on_exec state dup'ing when enabling after a fast-path accept.

Signed-off-by: Will Drewry <wad@chromium.org>
---
 include/linux/prctl.h   |    5 +
 include/linux/sched.h   |    2 +-
 include/linux/seccomp.h |   98 ++++++-
 include/trace/syscall.h |    7 +
 kernel/Makefile         |    3 +
 kernel/fork.c           |    3 +
 kernel/seccomp.c        |   38 ++-
 kernel/seccomp_filter.c |  784 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c            |   13 +-
 security/Kconfig        |   17 +
 10 files changed, 954 insertions(+), 16 deletions(-)
 create mode 100644 kernel/seccomp_filter.c

diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..44723ce 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -64,6 +64,11 @@
 #define PR_GET_SECCOMP	21
 #define PR_SET_SECCOMP	22
 
+/* Get/set process seccomp filters */
+#define PR_GET_SECCOMP_FILTER	35
+#define PR_SET_SECCOMP_FILTER	36
+#define PR_CLEAR_SECCOMP_FILTER	37
+
 /* Get/set the capability bounding set (as per security/commoncap.c) */
 #define PR_CAPBSET_READ 23
 #define PR_CAPBSET_DROP 24
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 18d63ce..3f0bc8d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1374,7 +1374,7 @@ struct task_struct {
 	uid_t loginuid;
 	unsigned int sessionid;
 #endif
-	seccomp_t seccomp;
+	struct seccomp_struct seccomp;
 
 /* Thread group tracking */
    	u32 parent_exec_id;
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 167c333..f4434ca 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -1,13 +1,33 @@
 #ifndef _LINUX_SECCOMP_H
 #define _LINUX_SECCOMP_H
 
+struct seq_file;
 
 #ifdef CONFIG_SECCOMP
 
+#include <linux/errno.h>
 #include <linux/thread_info.h>
+#include <linux/types.h>
 #include <asm/seccomp.h>
 
-typedef struct { int mode; } seccomp_t;
+struct seccomp_filters;
+/**
+ * struct seccomp_struct - the state of a seccomp'ed process
+ *
+ * @mode:
+ *     if this is 1, the process is under standard seccomp rules
+ *             is 2, the process is only allowed to make system calls where
+ *                   associated filters evaluate successfully.
+ * @filters: Metadata for filters if using CONFIG_SECCOMP_FILTER.
+ *           filters assignment/use should be RCU-protected and its contents
+ *           should never be modified when attached to a seccomp_struct.
+ */
+struct seccomp_struct {
+	uint16_t mode;
+#ifdef CONFIG_SECCOMP_FILTER
+	struct seccomp_filters *filters;
+#endif
+};
 
 extern void __secure_computing(int);
 static inline void secure_computing(int this_syscall)
@@ -16,15 +36,14 @@ static inline void secure_computing(int this_syscall)
 		__secure_computing(this_syscall);
 }
 
-extern long prctl_get_seccomp(void);
 extern long prctl_set_seccomp(unsigned long);
+extern long prctl_get_seccomp(void);
 
 #else /* CONFIG_SECCOMP */
 
 #include <linux/errno.h>
 
-typedef struct { } seccomp_t;
-
+struct seccomp_struct { };
 #define secure_computing(x) do { } while (0)
 
 static inline long prctl_get_seccomp(void)
@@ -32,11 +51,80 @@ static inline long prctl_get_seccomp(void)
 	return -EINVAL;
 }
 
-static inline long prctl_set_seccomp(unsigned long arg2)
+static inline long prctl_set_seccomp(unsigned long a2);
 {
 	return -EINVAL;
 }
 
 #endif /* CONFIG_SECCOMP */
 
+#ifdef CONFIG_SECCOMP_FILTER
+
+#define inherit_tsk_seccomp(_child, _orig) do { \
+	_child->seccomp.mode = _orig->seccomp.mode; \
+	_child->seccomp.filters = get_seccomp_filters(_orig->seccomp.filters); \
+	} while (0)
+#define put_tsk_seccomp(_tsk) put_seccomp_filters(_tsk->seccomp.filters)
+
+extern int seccomp_show_filters(struct seccomp_filters *filters,
+				struct seq_file *);
+extern long seccomp_set_filter(int, char *);
+extern long seccomp_clear_filter(int);
+extern long seccomp_get_filter(int, char *, unsigned long);
+
+extern long prctl_set_seccomp_filter(unsigned long, char __user *);
+extern long prctl_get_seccomp_filter(unsigned long, char __user *,
+				     unsigned long);
+extern long prctl_clear_seccomp_filter(unsigned long);
+
+extern struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *);
+extern void put_seccomp_filters(struct seccomp_filters *);
+
+extern int seccomp_test_filters(int);
+extern void seccomp_filter_log_failure(int);
+
+#else  /* CONFIG_SECCOMP_FILTER */
+
+struct seccomp_filters { };
+#define inherit_tsk_seccomp(_child, _orig) do { } while (0)
+#define put_tsk_seccomp(_tsk) do { } while (0)
+
+static inline int seccomp_show_filters(struct seccomp_filters *filters,
+				       struct seq_file *m)
+{
+	return -ENOSYS;
+}
+
+static inline long seccomp_set_filter(int syscall_nr, char *filter)
+{
+	return -ENOSYS;
+}
+
+static inline long seccomp_clear_filter(int syscall_nr)
+{
+	return -ENOSYS;
+}
+
+static inline long seccomp_get_filter(int syscall_nr,
+				      char *buf, unsigned long available)
+{
+	return -ENOSYS;
+}
+
+static inline long prctl_set_seccomp_filter(unsigned long a2, char __user *a3)
+{
+	return -ENOSYS;
+}
+
+static inline long prctl_clear_seccomp_filter(unsigned long a2)
+{
+	return -ENOSYS;
+}
+
+static inline long prctl_get_seccomp_filter(unsigned long a2, char __user *a3,
+					    unsigned long a4)
+{
+	return -ENOSYS;
+}
+#endif  /* CONFIG_SECCOMP_FILTER */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 242ae04..e061ad0 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -35,6 +35,8 @@ struct syscall_metadata {
 extern unsigned long arch_syscall_addr(int nr);
 extern int init_syscall_trace(struct ftrace_event_call *call);
 
+extern struct syscall_metadata *syscall_nr_to_meta(int);
+
 extern int reg_event_syscall_enter(struct ftrace_event_call *call);
 extern void unreg_event_syscall_enter(struct ftrace_event_call *call);
 extern int reg_event_syscall_exit(struct ftrace_event_call *call);
@@ -49,6 +51,11 @@ enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags,
 				      struct trace_event *event);
 enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags,
 				     struct trace_event *event);
+#else
+static inline struct syscall_metadata *syscall_nr_to_meta(int nr)
+{
+	return NULL;
+}
 #endif
 
 #ifdef CONFIG_PERF_EVENTS
diff --git a/kernel/Makefile b/kernel/Makefile
index 85cbfb3..84e7dfb 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -81,6 +81,9 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
+ifeq ($(CONFIG_SECCOMP_FILTER),y)
+obj-$(CONFIG_SECCOMP) += seccomp_filter.o
+endif
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
 obj-$(CONFIG_TREE_RCU) += rcutree.o
 obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index e7548de..6f835e0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
 #include <linux/cgroup.h>
 #include <linux/security.h>
 #include <linux/hugetlb.h>
+#include <linux/seccomp.h>
 #include <linux/swap.h>
 #include <linux/syscalls.h>
 #include <linux/jiffies.h>
@@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
 	free_thread_info(tsk->stack);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
+	put_tsk_seccomp(tsk);
 	free_task_struct(tsk);
 }
 EXPORT_SYMBOL(free_task);
@@ -280,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	if (err)
 		goto out;
 
+	inherit_tsk_seccomp(tsk, orig);
 	setup_thread_stack(tsk, orig);
 	clear_user_return_notifier(tsk);
 	clear_tsk_need_resched(tsk);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..0a942be 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -2,16 +2,20 @@
  * linux/kernel/seccomp.c
  *
  * Copyright 2004-2005  Andrea Arcangeli <andrea@cpushare.com>
+ * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
  *
  * This defines a simple but solid secure-computing mode.
  */
 
 #include <linux/seccomp.h>
 #include <linux/sched.h>
+#include <linux/slab.h>
 #include <linux/compat.h>
+#include <linux/unistd.h>
+#include <linux/ftrace_event.h>
 
+#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
 /* #define SECCOMP_DEBUG 1 */
-#define NR_SECCOMP_MODES 1
 
 /*
  * Secure computing mode 1 allows only read/write/exit/sigreturn.
@@ -32,10 +36,9 @@ static int mode1_syscalls_32[] = {
 
 void __secure_computing(int this_syscall)
 {
-	int mode = current->seccomp.mode;
 	int * syscall;
 
-	switch (mode) {
+	switch (current->seccomp.mode) {
 	case 1:
 		syscall = mode1_syscalls;
 #ifdef CONFIG_COMPAT
@@ -47,6 +50,17 @@ void __secure_computing(int this_syscall)
 				return;
 		} while (*++syscall);
 		break;
+#ifdef CONFIG_SECCOMP_FILTER
+	case 2:
+		if (this_syscall >= NR_syscalls || this_syscall < 0)
+			break;
+
+		if (!seccomp_test_filters(this_syscall))
+			return;
+
+		seccomp_filter_log_failure(this_syscall);
+		break;
+#endif
 	default:
 		BUG();
 	}
@@ -71,16 +85,22 @@ long prctl_set_seccomp(unsigned long seccomp_mode)
 	if (unlikely(current->seccomp.mode))
 		goto out;
 
-	ret = -EINVAL;
-	if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
-		current->seccomp.mode = seccomp_mode;
-		set_thread_flag(TIF_SECCOMP);
+	ret = 0;
+	switch (seccomp_mode) {
+	case 1:
 #ifdef TIF_NOTSC
 		disable_TSC();
 #endif
-		ret = 0;
+#ifdef CONFIG_SECCOMP_FILTER
+	case 2:
+#endif
+		current->seccomp.mode = seccomp_mode;
+		set_thread_flag(TIF_SECCOMP);
+		break;
+	default:
+		ret = -EINVAL;
 	}
 
- out:
+out:
 	return ret;
 }
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..9782f25
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,784 @@
+/* filter engine-based seccomp system call filtering
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ */
+
+#include <linux/compat.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/ftrace_event.h>
+#include <linux/seccomp.h>
+#include <linux/seq_file.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#include <asm/syscall.h>
+#include <trace/syscall.h>
+
+
+#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
+
+#define SECCOMP_FILTER_ALLOW "1"
+#define SECCOMP_ACTION_DENY 0xffff
+#define SECCOMP_ACTION_ALLOW 0xfffe
+
+/**
+ * struct seccomp_filters - container for seccomp filterset
+ *
+ * @syscalls: array of 16-bit indices into @event_filters by syscall_nr
+ *            May also be SECCOMP_ACTION_DENY or SECCOMP_ACTION_ALLOW
+ * @event_filters: array of pointers to ftrace event objects
+ * @count: size of @event_filters
+ * @flags: anonymous struct to wrap filters-specific flags
+ * @usage: reference count to simplify use.
+ */
+struct seccomp_filters {
+	uint16_t syscalls[NR_syscalls];
+	struct event_filter **event_filters;
+	uint16_t count;
+	struct {
+		uint32_t compat:1,
+			 __reserved:31;
+	} flags;
+	atomic_t usage;
+};
+
+/* Handle ftrace symbol non-existence */
+#ifdef CONFIG_FTRACE_SYSCALLS
+#define create_event_filter(_ef_pptr, _event_type, _str) \
+	ftrace_parse_filter(_ef_pptr, _event_type, _str)
+#define get_filter_string(_ef) ftrace_get_filter_string(_ef)
+#define free_event_filter(_f) ftrace_free_filter(_f)
+
+#else
+
+#define create_event_filter(_ef_pptr, _event_type, _str) (-ENOSYS)
+#define get_filter_string(_ef) (NULL)
+#define free_event_filter(_f) do { } while (0)
+#endif
+
+/**
+ * seccomp_filters_new - allocates a new filters object
+ * @count: count to allocate for the event_filters array
+ *
+ * Returns ERR_PTR on error or an allocated object.
+ */
+static struct seccomp_filters *seccomp_filters_new(uint16_t count)
+{
+	struct seccomp_filters *f;
+
+	if (count >= SECCOMP_ACTION_ALLOW)
+		return ERR_PTR(-EINVAL);
+
+	f = kzalloc(sizeof(struct seccomp_filters), GFP_KERNEL);
+	if (!f)
+		return ERR_PTR(-ENOMEM);
+
+	/* Lazy SECCOMP_ACTION_DENY assignment. */
+	memset(f->syscalls, 0xff, sizeof(f->syscalls));
+	atomic_set(&f->usage, 1);
+
+	f->event_filters = NULL;
+	f->count = count;
+	if (!count)
+		return f;
+
+	f->event_filters = kzalloc(count * sizeof(struct event_filter *),
+				   GFP_KERNEL);
+	if (!f->event_filters) {
+		kfree(f);
+		f = ERR_PTR(-ENOMEM);
+	}
+	return f;
+}
+
+/**
+ * seccomp_filters_free - cleans up the filter list and frees the table
+ * @filters: NULL or live object to be completely destructed.
+ */
+static void seccomp_filters_free(struct seccomp_filters *filters)
+{
+	uint16_t count = 0;
+	if (!filters)
+		return;
+	while (count < filters->count) {
+		struct event_filter *f = filters->event_filters[count];
+		free_event_filter(f);
+		count++;
+	}
+	kfree(filters->event_filters);
+	kfree(filters);
+}
+
+static void __put_seccomp_filters(struct seccomp_filters *orig)
+{
+	WARN_ON(atomic_read(&orig->usage));
+	seccomp_filters_free(orig);
+}
+
+#define seccomp_filter_allow(_id) ((_id) == SECCOMP_ACTION_ALLOW)
+#define seccomp_filter_deny(_id) ((_id) == SECCOMP_ACTION_DENY)
+#define seccomp_filter_dynamic(_id) \
+	(!seccomp_filter_allow(_id) && !seccomp_filter_deny(_id))
+static inline uint16_t seccomp_filter_id(const struct seccomp_filters *f,
+					 int syscall_nr)
+{
+	if (!f)
+		return SECCOMP_ACTION_DENY;
+	return f->syscalls[syscall_nr];
+}
+
+static inline struct event_filter *seccomp_dynamic_filter(
+		const struct seccomp_filters *filters, uint16_t id)
+{
+	if (!seccomp_filter_dynamic(id))
+		return NULL;
+	return filters->event_filters[id];
+}
+
+static inline void set_seccomp_filter_id(struct seccomp_filters *filters,
+					 int syscall_nr, uint16_t id)
+{
+	filters->syscalls[syscall_nr] = id;
+}
+
+static inline void set_seccomp_filter(struct seccomp_filters *filters,
+				      int syscall_nr, uint16_t id,
+				      struct event_filter *dynamic_filter)
+{
+	filters->syscalls[syscall_nr] = id;
+	if (seccomp_filter_dynamic(id))
+		filters->event_filters[id] = dynamic_filter;
+}
+
+static struct event_filter *alloc_event_filter(int syscall_nr,
+					       const char *filter_string)
+{
+	struct syscall_metadata *data;
+	struct event_filter *filter = NULL;
+	int err;
+
+	data = syscall_nr_to_meta(syscall_nr);
+	/* Argument-based filtering only works on ftrace-hooked syscalls. */
+	err = -ENOSYS;
+	if (!data)
+		goto fail;
+	err = create_event_filter(&filter,
+				  data->enter_event->event.type,
+				  filter_string);
+	if (err)
+		goto fail;
+
+	return filter;
+fail:
+	kfree(filter);
+	return ERR_PTR(err);
+}
+
+/**
+ * seccomp_filters_copy - copies filters from src to dst.
+ *
+ * @dst: seccomp_filters to populate.
+ * @src: table to read from.
+ * @skip: specifies an entry, by system call, to skip.
+ *
+ * Returns non-zero on failure.
+ * Both the source and the destination should have no simultaneous
+ * writers, and dst should be exclusive to the caller.
+ * If @skip is < 0, it is ignored.
+ */
+static int seccomp_filters_copy(struct seccomp_filters *dst,
+				const struct seccomp_filters *src,
+				int skip)
+{
+	int id = 0, ret = 0, nr;
+	memcpy(&dst->flags, &src->flags, sizeof(src->flags));
+	memcpy(dst->syscalls, src->syscalls, sizeof(dst->syscalls));
+	if (!src->count)
+		goto done;
+	for (nr = 0; nr < NR_syscalls; ++nr) {
+		struct event_filter *filter;
+		const char *str;
+		uint16_t src_id = seccomp_filter_id(src, nr);
+		if (nr == skip) {
+			set_seccomp_filter(dst, nr, SECCOMP_ACTION_DENY,
+					   NULL);
+			continue;
+		}
+		if (!seccomp_filter_dynamic(src_id))
+			continue;
+		if (id >= dst->count) {
+			ret = -EINVAL;
+			goto done;
+		}
+		str = get_filter_string(seccomp_dynamic_filter(src, src_id));
+		filter = alloc_event_filter(nr, str);
+		if (IS_ERR(filter)) {
+			ret = PTR_ERR(filter);
+			goto done;
+		}
+		set_seccomp_filter(dst, nr, id, filter);
+		id++;
+	}
+
+done:
+	return ret;
+}
+
+/**
+ * seccomp_extend_filter - appends more text to a syscall_nr's filter
+ * @filters: unattached filter object to operate on
+ * @syscall_nr: syscall number to update filters for
+ * @filter_string: string to append to the existing filter
+ *
+ * The new string will be &&'d to the original filter string to ensure that it
+ * always matches the existing predicates or less:
+ *   (old_filter) && @filter_string
+ * A new seccomp_filters instance is returned on success and a ERR_PTR on
+ * failure.
+ */
+static int seccomp_extend_filter(struct seccomp_filters *filters,
+				 int syscall_nr, char *filter_string)
+{
+	struct event_filter *filter;
+	uint16_t id = seccomp_filter_id(filters, syscall_nr);
+	char *merged = NULL;
+	int ret = -EINVAL, expected;
+
+	/* No extending with a "1". */
+	if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string))
+		goto out;
+
+	filter = seccomp_dynamic_filter(filters, id);
+	ret = -ENOENT;
+	if (!filter)
+		goto out;
+
+	merged = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
+	ret = -ENOMEM;
+	if (!merged)
+		goto out;
+
+	expected = snprintf(merged, SECCOMP_MAX_FILTER_LENGTH, "(%s) && %s",
+			    get_filter_string(filter), filter_string);
+	ret = -E2BIG;
+	if (expected >= SECCOMP_MAX_FILTER_LENGTH || expected < 0)
+		goto out;
+
+	/* Free the old filter */
+	free_event_filter(filter);
+	set_seccomp_filter(filters, syscall_nr, id, NULL);
+
+	/* Replace it */
+	filter = alloc_event_filter(syscall_nr, merged);
+	if (IS_ERR(filter)) {
+		ret = PTR_ERR(filter);
+		goto out;
+	}
+	set_seccomp_filter(filters, syscall_nr, id, filter);
+	ret = 0;
+
+out:
+	kfree(merged);
+	return ret;
+}
+
+/**
+ * seccomp_add_filter - adds a filter for an unfiltered syscall
+ * @filters: filters object to add a filter/action to
+ * @syscall_nr: system call number to add a filter for
+ * @filter_string: the filter string to apply
+ *
+ * Returns 0 on success and non-zero otherwise.
+ */
+static int seccomp_add_filter(struct seccomp_filters *filters, int syscall_nr,
+			      char *filter_string)
+{
+	struct event_filter *filter;
+	int ret = 0;
+
+	if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string)) {
+		set_seccomp_filter(filters, syscall_nr,
+				   SECCOMP_ACTION_ALLOW, NULL);
+		goto out;
+	}
+
+	filter = alloc_event_filter(syscall_nr, filter_string);
+	if (IS_ERR(filter)) {
+		ret = PTR_ERR(filter);
+		goto out;
+	}
+	/* Always add to the last slot available since additions are
+	 * are only done one at a time.
+	 */
+	set_seccomp_filter(filters, syscall_nr, filters->count - 1, filter);
+out:
+	return ret;
+}
+
+/* Wrap optional ftrace syscall support. Returns 1 on match or 0 otherwise. */
+static int filter_match_current(struct event_filter *event_filter)
+{
+	int err = 0;
+#ifdef CONFIG_FTRACE_SYSCALLS
+	uint8_t syscall_state[64];
+
+	memset(syscall_state, 0, sizeof(syscall_state));
+
+	/* The generic tracing entry can remain zeroed. */
+	err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state),
+					 NULL);
+	if (err)
+		return 0;
+
+	err = filter_match_preds(event_filter, syscall_state);
+#endif
+	return err;
+}
+
+static const char *syscall_nr_to_name(int syscall)
+{
+	const char *syscall_name = "unknown";
+	struct syscall_metadata *data = syscall_nr_to_meta(syscall);
+	if (data)
+		syscall_name = data->name;
+	return syscall_name;
+}
+
+static void filters_set_compat(struct seccomp_filters *filters)
+{
+#ifdef CONFIG_COMPAT
+	if (is_compat_task())
+		filters->flags.compat = 1;
+#endif
+}
+
+static inline int filters_compat_mismatch(struct seccomp_filters *filters)
+{
+	int ret = 0;
+	if (!filters)
+		return 0;
+#ifdef CONFIG_COMPAT
+	if (!!(is_compat_task()) == filters->flags.compat)
+		ret = 1;
+#endif
+	return ret;
+}
+
+static inline int syscall_is_execve(int syscall)
+{
+	int nr = __NR_execve;
+#ifdef CONFIG_COMPAT
+	if (is_compat_task())
+		nr = __NR_seccomp_execve_32;
+#endif
+	return syscall == nr;
+}
+
+#ifndef KSTK_EIP
+#define KSTK_EIP(x) 0L
+#endif
+
+void seccomp_filter_log_failure(int syscall)
+{
+	pr_info("%s[%d]: system call %d (%s) blocked at 0x%lx\n",
+		current->comm, task_pid_nr(current), syscall,
+		syscall_nr_to_name(syscall), KSTK_EIP(current));
+}
+
+/* put_seccomp_state - decrements the reference count of @orig and may free. */
+void put_seccomp_filters(struct seccomp_filters *orig)
+{
+	if (!orig)
+		return;
+
+	if (atomic_dec_and_test(&orig->usage))
+		__put_seccomp_filters(orig);
+}
+
+/* get_seccomp_state - increments the reference count of @orig */
+struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *orig)
+{
+	if (!orig)
+		return NULL;
+	atomic_inc(&orig->usage);
+	return orig;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ * @state: seccomp_state of current to use.
+ * @syscall: number of the system call to test
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(int syscall)
+{
+	uint16_t id;
+	struct event_filter *filter;
+	struct seccomp_filters *filters;
+	int ret = -EACCES;
+
+	rcu_read_lock();
+	filters = get_seccomp_filters(current->seccomp.filters);
+	rcu_read_unlock();
+
+	if (!filters)
+		goto out;
+
+	if (filters_compat_mismatch(filters)) {
+		pr_info("%s[%d]: seccomp_filter compat() mismatch.\n",
+			current->comm, task_pid_nr(current));
+		goto out;
+	}
+
+	/* execve is never allowed. */
+	if (syscall_is_execve(syscall))
+		goto out;
+
+	ret = 0;
+	id = seccomp_filter_id(filters, syscall);
+	if (seccomp_filter_allow(id))
+		goto out;
+
+	ret = -EACCES;
+	if (!seccomp_filter_dynamic(id))
+		goto out;
+
+	filter = seccomp_dynamic_filter(filters, id);
+	if (filter && filter_match_current(filter))
+		ret = 0;
+out:
+	put_seccomp_filters(filters);
+	return ret;
+}
+
+/**
+ * seccomp_show_filters - prints the current filter state to a seq_file
+ * @filters: properly get()'d filters object
+ * @m: the prepared seq_file to receive the data
+ *
+ * Returns 0 on a successful write.
+ */
+int seccomp_show_filters(struct seccomp_filters *filters, struct seq_file *m)
+{
+	int syscall;
+	seq_printf(m, "Mode: %d\n", current->seccomp.mode);
+	if (!filters)
+		goto out;
+
+	for (syscall = 0; syscall < NR_syscalls; ++syscall) {
+		uint16_t id = seccomp_filter_id(filters, syscall);
+		const char *filter_string = SECCOMP_FILTER_ALLOW;
+		if (seccomp_filter_deny(id))
+			continue;
+		seq_printf(m, "%d (%s): ",
+			      syscall,
+			      syscall_nr_to_name(syscall));
+		if (seccomp_filter_dynamic(id))
+			filter_string = get_filter_string(
+					  seccomp_dynamic_filter(filters, id));
+		seq_printf(m, "%s\n", filter_string);
+	}
+out:
+	return 0;
+}
+EXPORT_SYMBOL_GPL(seccomp_show_filters);
+
+/**
+ * seccomp_get_filter - copies the filter_string into "buf"
+ * @syscall_nr: system call number to look up
+ * @buf: destination buffer
+ * @bufsize: available space in the buffer.
+ *
+ * Context: User context only. This function may sleep on allocation and
+ *          operates on current. current must be attempting a system call
+ *          when this is called.
+ *
+ * Looks up the filter for the given system call number on current.  If found,
+ * the string length of the NUL-terminated buffer is returned and < 0 is
+ * returned on error. The NUL byte is not included in the length.
+ */
+long seccomp_get_filter(int syscall_nr, char *buf, unsigned long bufsize)
+{
+	struct seccomp_filters *filters;
+	struct event_filter *filter;
+	long ret = -EINVAL;
+	uint16_t id;
+
+	if (bufsize > SECCOMP_MAX_FILTER_LENGTH)
+		bufsize = SECCOMP_MAX_FILTER_LENGTH;
+
+	rcu_read_lock();
+	filters = get_seccomp_filters(current->seccomp.filters);
+	rcu_read_unlock();
+
+	if (!filters)
+		goto out;
+
+	ret = -ENOENT;
+	id = seccomp_filter_id(filters, syscall_nr);
+	if (seccomp_filter_deny(id))
+		goto out;
+
+	if (seccomp_filter_allow(id)) {
+		ret = strlcpy(buf, SECCOMP_FILTER_ALLOW, bufsize);
+		goto copied;
+	}
+
+	filter = seccomp_dynamic_filter(filters, id);
+	if (!filter)
+		goto out;
+	ret = strlcpy(buf, get_filter_string(filter), bufsize);
+
+copied:
+	if (ret >= bufsize) {
+		ret = -ENOSPC;
+		goto out;
+	}
+	/* Zero out any remaining buffer, just in case. */
+	memset(buf + ret, 0, bufsize - ret);
+out:
+	put_seccomp_filters(filters);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(seccomp_get_filter);
+
+/**
+ * seccomp_clear_filter: clears the seccomp filter for a syscall.
+ * @syscall_nr: the system call number to clear filters for.
+ *
+ * Context: User context only. This function may sleep on allocation and
+ *          operates on current. current must be attempting a system call
+ *          when this is called.
+ *
+ * Returns 0 on success.
+ */
+long seccomp_clear_filter(int syscall_nr)
+{
+	struct seccomp_filters *filters = NULL, *orig_filters;
+	uint16_t id;
+	int ret = -EINVAL;
+
+	rcu_read_lock();
+	orig_filters = get_seccomp_filters(current->seccomp.filters);
+	rcu_read_unlock();
+
+	if (!orig_filters)
+		goto out;
+
+	if (filters_compat_mismatch(orig_filters))
+		goto out;
+
+	id = seccomp_filter_id(orig_filters, syscall_nr);
+	if (seccomp_filter_deny(id))
+		goto out;
+
+	/* Create a new filters object for the task */
+	if (seccomp_filter_dynamic(id))
+		filters = seccomp_filters_new(orig_filters->count - 1);
+	else
+		filters = seccomp_filters_new(orig_filters->count);
+
+	if (IS_ERR(filters)) {
+		ret = PTR_ERR(filters);
+		goto out;
+	}
+
+	/* Copy, but drop the requested entry. */
+	ret = seccomp_filters_copy(filters, orig_filters, syscall_nr);
+	if (ret)
+		goto out;
+	get_seccomp_filters(filters);  /* simplify the out: path */
+
+	rcu_assign_pointer(current->seccomp.filters, filters);
+	synchronize_rcu();
+	put_seccomp_filters(orig_filters);  /* for the task */
+out:
+	put_seccomp_filters(orig_filters);  /* for the get */
+	put_seccomp_filters(filters);  /* for the extra get */
+	return ret;
+}
+EXPORT_SYMBOL_GPL(seccomp_clear_filter);
+
+/**
+ * seccomp_set_filter: - Adds/extends a seccomp filter for a syscall.
+ * @syscall_nr: system call number to apply the filter to.
+ * @filter: ftrace filter string to apply.
+ *
+ * Context: User context only. This function may sleep on allocation and
+ *          operates on current. current must be attempting a system call
+ *          when this is called.
+ *
+ * New filters may be added for system calls when the current task is
+ * not in a secure computing mode (seccomp).  Otherwise, existing filters may
+ * be extended.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_set_filter(int syscall_nr, char *filter)
+{
+	struct seccomp_filters *filters = NULL, *orig_filters = NULL;
+	uint16_t id;
+	long ret = -EINVAL;
+	uint16_t filters_needed;
+
+	if (!filter)
+		goto out;
+
+	filter = strstrip(filter);
+	/* Disallow empty strings. */
+	if (filter[0] == 0)
+		goto out;
+
+	rcu_read_lock();
+	orig_filters = get_seccomp_filters(current->seccomp.filters);
+	rcu_read_unlock();
+
+	/* After the first call, compatibility mode is selected permanently. */
+	ret = -EACCES;
+	if (filters_compat_mismatch(orig_filters))
+		goto out;
+
+	filters_needed = orig_filters ? orig_filters->count : 0;
+	id = seccomp_filter_id(orig_filters, syscall_nr);
+	if (seccomp_filter_deny(id)) {
+		/* Don't allow DENYs to be changed when in a seccomp mode */
+		ret = -EACCES;
+		if (current->seccomp.mode)
+			goto out;
+		filters_needed++;
+	}
+
+	filters = seccomp_filters_new(filters_needed);
+	if (IS_ERR(filters)) {
+		ret = PTR_ERR(filters);
+		goto out;
+	}
+
+	filters_set_compat(filters);
+	if (orig_filters) {
+		ret = seccomp_filters_copy(filters, orig_filters, -1);
+		if (ret)
+			goto out;
+	}
+
+	if (seccomp_filter_deny(id))
+		ret = seccomp_add_filter(filters, syscall_nr, filter);
+	else
+		ret = seccomp_extend_filter(filters, syscall_nr, filter);
+	if (ret)
+		goto out;
+	get_seccomp_filters(filters);  /* simplify the error paths */
+
+	rcu_assign_pointer(current->seccomp.filters, filters);
+	synchronize_rcu();
+	put_seccomp_filters(orig_filters);  /* for the task */
+out:
+	put_seccomp_filters(orig_filters);  /* for the get */
+	put_seccomp_filters(filters);  /* for get or task, on err */
+	return ret;
+}
+EXPORT_SYMBOL_GPL(seccomp_set_filter);
+
+long prctl_set_seccomp_filter(unsigned long syscall_nr,
+			      char __user *user_filter)
+{
+	int nr;
+	long ret;
+	char *filter = NULL;
+
+	ret = -EINVAL;
+	if (syscall_nr >= NR_syscalls)
+		goto out;
+
+	ret = -EFAULT;
+	if (!user_filter)
+		goto out;
+
+	filter = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
+	ret = -ENOMEM;
+	if (!filter)
+		goto out;
+
+	ret = -EFAULT;
+	if (strncpy_from_user(filter, user_filter,
+			      SECCOMP_MAX_FILTER_LENGTH - 1) < 0)
+		goto out;
+
+	nr = (int) syscall_nr;
+	ret = seccomp_set_filter(nr, filter);
+
+out:
+	kfree(filter);
+	return ret;
+}
+
+long prctl_clear_seccomp_filter(unsigned long syscall_nr)
+{
+	int nr = -1;
+	long ret;
+
+	ret = -EINVAL;
+	if (syscall_nr >= NR_syscalls)
+		goto out;
+
+	nr = (int) syscall_nr;
+	ret = seccomp_clear_filter(nr);
+
+out:
+	return ret;
+}
+
+long prctl_get_seccomp_filter(unsigned long syscall_nr, char __user *dst,
+			      unsigned long available)
+{
+	int ret, nr;
+	unsigned long copied;
+	char *buf = NULL;
+	ret = -EINVAL;
+	if (!available)
+		goto out;
+	/* Ignore extra buffer space. */
+	if (available > SECCOMP_MAX_FILTER_LENGTH)
+		available = SECCOMP_MAX_FILTER_LENGTH;
+
+	ret = -EINVAL;
+	if (syscall_nr >= NR_syscalls)
+		goto out;
+	nr = (int) syscall_nr;
+
+	ret = -ENOMEM;
+	buf = kmalloc(available, GFP_KERNEL);
+	if (!buf)
+		goto out;
+
+	ret = seccomp_get_filter(nr, buf, available);
+	if (ret < 0)
+		goto out;
+
+	/* Include the NUL byte in the copy. */
+	copied = copy_to_user(dst, buf, ret + 1);
+	ret = -ENOSPC;
+	if (copied)
+		goto out;
+	ret = 0;
+out:
+	kfree(buf);
+	return ret;
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index af468ed..ed60d06 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1698,13 +1698,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		case PR_SET_ENDIAN:
 			error = SET_ENDIAN(me, arg2);
 			break;
-
 		case PR_GET_SECCOMP:
 			error = prctl_get_seccomp();
 			break;
 		case PR_SET_SECCOMP:
 			error = prctl_set_seccomp(arg2);
 			break;
+		case PR_SET_SECCOMP_FILTER:
+			error = prctl_set_seccomp_filter(arg2,
+							 (char __user *) arg3);
+			break;
+		case PR_CLEAR_SECCOMP_FILTER:
+			error = prctl_clear_seccomp_filter(arg2);
+			break;
+		case PR_GET_SECCOMP_FILTER:
+			error = prctl_get_seccomp_filter(arg2,
+							 (char __user *) arg3,
+							 arg4);
+			break;
 		case PR_GET_TSC:
 			error = GET_TSC_CTL(arg2);
 			break;
diff --git a/security/Kconfig b/security/Kconfig
index 95accd4..c76adf2 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -2,6 +2,10 @@
 # Security configuration
 #
 
+# Make seccomp filter Kconfig switch below available
+config HAVE_SECCOMP_FILTER
+       bool
+
 menu "Security options"
 
 config KEYS
@@ -82,6 +86,19 @@ config SECURITY_DMESG_RESTRICT
 
 	  If you are unsure how to answer this question, answer N.
 
+config SECCOMP_FILTER
+	bool "Enable seccomp-based system call filtering"
+	select SECCOMP
+	depends on HAVE_SECCOMP_FILTER && EXPERIMENTAL
+	help
+	  This kernel feature expands CONFIG_SECCOMP to allow computing
+	  in environments with reduced kernel access dictated by the
+	  application itself through prctl calls.  If
+	  CONFIG_FTRACE_SYSCALLS is available, then system call
+	  argument-based filtering predicates may be used.
+
+	  See Documentation/prctl/seccomp_filter.txt for more detail.
+
 config SECURITY
 	bool "Enable different security models"
 	depends on SYSFS
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 03/13] seccomp_filters: new mode with configurable syscall filters
  2011-06-01  3:10                                                       ` [PATCH v3 03/13] seccomp_filters: new mode with configurable syscall filters Will Drewry
@ 2011-06-02 17:36                                                         ` Paul E. McKenney
  2011-06-02 18:14                                                           ` Will Drewry
  0 siblings, 1 reply; 91+ messages in thread
From: Paul E. McKenney @ 2011-06-02 17:36 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, kees.cook, torvalds, tglx, mingo, rostedt, jmorris,
	Peter Zijlstra, Frederic Weisbecker, linux-security-module

On Tue, May 31, 2011 at 10:10:35PM -0500, Will Drewry wrote:
> This change adds a new seccomp mode which specifies the allowed system
> calls dynamically.  When in the new mode (2), all system calls are
> checked against process-defined filters - first by system call number,
> then by a filter string.  If an entry exists for a given system call and
> all filter predicates evaluate to true, then the task may proceed.
> Otherwise, the task is killed.

A few questions below -- I can't say that I understand the RCU usage.

							Thanx, Paul

> Filter string parsing and evaluation is handled by the ftrace filter
> engine.  Related patches tweak to the perf filter trace and free
> allowing the calls to be shared. Filters inherit their understanding of
> types and arguments for each system call from the CONFIG_FTRACE_SYSCALLS
> subsystem which already populates this information in syscall_metadata
> associated enter_event (and exit_event) structures. If
> CONFIG_FTRACE_SYSCALLS is not compiled in, only filter strings of "1"
> will be allowed.
> 
> The net result is a process may have its system calls filtered using the
> ftrace filter engine's inherent understanding of systems calls.  The set
> of filters is specified through the PR_SET_SECCOMP_FILTER argument in
> prctl(). For example, a filterset for a process, like pdftotext, that
> should only process read-only input could (roughly) look like:
>   sprintf(rdonly, "flags == %u", O_RDONLY|O_LARGEFILE);
>   prctl(PR_SET_SECCOMP_FILTER, __NR_open, rdonly);
>   prctl(PR_SET_SECCOMP_FILTER, __NR__llseek, "1");
>   prctl(PR_SET_SECCOMP_FILTER, __NR_brk, "1");
>   prctl(PR_SET_SECCOMP_FILTER, __NR_close, "1");
>   prctl(PR_SET_SECCOMP_FILTER, __NR_exit_group, "1");
>   prctl(PR_SET_SECCOMP_FILTER, __NR_fstat64, "1");
>   prctl(PR_SET_SECCOMP_FILTER, __NR_mmap2, "1");
>   prctl(PR_SET_SECCOMP_FILTER, __NR_munmap, "1");
>   prctl(PR_SET_SECCOMP_FILTER, __NR_read, "1");
>   prctl(PR_SET_SECCOMP_FILTER, __NR_write, "(fd == 1 | fd == 2)");
>   prctl(PR_SET_SECCOMP, 2);
> 
> Subsequent calls to PR_SET_SECCOMP_FILTER for the same system call will
> be &&'d together to ensure that attack surface may only be reduced:
>   prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2");
> 
> With the earlier example, the active filter becomes:
>   "(fd == 1 || fd == 2) && fd != 2"
> 
> The patch also adds PR_CLEAR_SECCOMP_FILTER and PR_GET_SECCOMP_FILTER.
> The latter returns the current filter for a system call to userspace:
> 
>   prctl(PR_GET_SECCOMP_FILTER, __NR_write, buf, bufsize);
> 
> while the former clears any filters for a given system call changing it
> back to a defaulty deny:
> 
>   prctl(PR_CLEAR_SECCOMP_FILTER, __NR_write);
> 
> v3: - always block execve calls (as per linus torvalds)
>     - add __NR_seccomp_execve(_32) to seccomp-supporting arches
>     - ensure compat tasks can't reach ftrace:syscalls
>     - dropped new defines for seccomp modes.
>     - two level array instead of hlists (sugg. by olof johansson)
>     - added generic Kconfig entry that is not connected.
>     - dropped internal seccomp.h
>     - move prctl helpers to seccomp_filter
>     - killed seccomp_t typedef (as per checkpatch)
> v2: - changed to use the existing syscall number ABI.
>     - prctl changes to minimize parsing in the kernel:
>       prctl(PR_SET_SECCOMP, {0 | 1 | 2 }, { 0 | ON_EXEC });
>       prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 5");
>       prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read);
>       prctl(PR_GET_SECCOMP_FILTER, __NR_read, buf, bufsize);
>     - defined PR_SECCOMP_MODE_STRICT and ..._FILTER
>     - added flags
>     - provide a default fail syscall_nr_to_meta in ftrace
>     - provides fallback for unhooked system calls
>     - use -ENOSYS and ERR_PTR(-ENOSYS) for stubbed functionality
>     - added kernel/seccomp.h to share seccomp.c/seccomp_filter.c
>     - moved to a hlist and 4 bit hash of linked lists
>     - added support to operate without CONFIG_FTRACE_SYSCALLS
>     - moved Kconfig support next to SECCOMP
>     - made Kconfig entries dependent on EXPERIMENTAL
>     - added macros to avoid ifdefs from kernel/fork.c
>     - added compat task/filter matching
>     - drop seccomp.h inclusion in sched.h and drop seccomp_t
>     - added Filtering to "show" output
>     - added on_exec state dup'ing when enabling after a fast-path accept.
> 
> Signed-off-by: Will Drewry <wad@chromium.org>
> ---
>  include/linux/prctl.h   |    5 +
>  include/linux/sched.h   |    2 +-
>  include/linux/seccomp.h |   98 ++++++-
>  include/trace/syscall.h |    7 +
>  kernel/Makefile         |    3 +
>  kernel/fork.c           |    3 +
>  kernel/seccomp.c        |   38 ++-
>  kernel/seccomp_filter.c |  784 +++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sys.c            |   13 +-
>  security/Kconfig        |   17 +
>  10 files changed, 954 insertions(+), 16 deletions(-)
>  create mode 100644 kernel/seccomp_filter.c
> 
> diff --git a/include/linux/prctl.h b/include/linux/prctl.h
> index a3baeb2..44723ce 100644
> --- a/include/linux/prctl.h
> +++ b/include/linux/prctl.h
> @@ -64,6 +64,11 @@
>  #define PR_GET_SECCOMP	21
>  #define PR_SET_SECCOMP	22
> 
> +/* Get/set process seccomp filters */
> +#define PR_GET_SECCOMP_FILTER	35
> +#define PR_SET_SECCOMP_FILTER	36
> +#define PR_CLEAR_SECCOMP_FILTER	37
> +
>  /* Get/set the capability bounding set (as per security/commoncap.c) */
>  #define PR_CAPBSET_READ 23
>  #define PR_CAPBSET_DROP 24
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 18d63ce..3f0bc8d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1374,7 +1374,7 @@ struct task_struct {
>  	uid_t loginuid;
>  	unsigned int sessionid;
>  #endif
> -	seccomp_t seccomp;
> +	struct seccomp_struct seccomp;
> 
>  /* Thread group tracking */
>     	u32 parent_exec_id;
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 167c333..f4434ca 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -1,13 +1,33 @@
>  #ifndef _LINUX_SECCOMP_H
>  #define _LINUX_SECCOMP_H
> 
> +struct seq_file;
> 
>  #ifdef CONFIG_SECCOMP
> 
> +#include <linux/errno.h>
>  #include <linux/thread_info.h>
> +#include <linux/types.h>
>  #include <asm/seccomp.h>
> 
> -typedef struct { int mode; } seccomp_t;
> +struct seccomp_filters;
> +/**
> + * struct seccomp_struct - the state of a seccomp'ed process
> + *
> + * @mode:
> + *     if this is 1, the process is under standard seccomp rules
> + *             is 2, the process is only allowed to make system calls where
> + *                   associated filters evaluate successfully.
> + * @filters: Metadata for filters if using CONFIG_SECCOMP_FILTER.
> + *           filters assignment/use should be RCU-protected and its contents
> + *           should never be modified when attached to a seccomp_struct.
> + */
> +struct seccomp_struct {
> +	uint16_t mode;
> +#ifdef CONFIG_SECCOMP_FILTER
> +	struct seccomp_filters *filters;
> +#endif
> +};
> 
>  extern void __secure_computing(int);
>  static inline void secure_computing(int this_syscall)
> @@ -16,15 +36,14 @@ static inline void secure_computing(int this_syscall)
>  		__secure_computing(this_syscall);
>  }
> 
> -extern long prctl_get_seccomp(void);
>  extern long prctl_set_seccomp(unsigned long);
> +extern long prctl_get_seccomp(void);
> 
>  #else /* CONFIG_SECCOMP */
> 
>  #include <linux/errno.h>
> 
> -typedef struct { } seccomp_t;
> -
> +struct seccomp_struct { };
>  #define secure_computing(x) do { } while (0)
> 
>  static inline long prctl_get_seccomp(void)
> @@ -32,11 +51,80 @@ static inline long prctl_get_seccomp(void)
>  	return -EINVAL;
>  }
> 
> -static inline long prctl_set_seccomp(unsigned long arg2)
> +static inline long prctl_set_seccomp(unsigned long a2);
>  {
>  	return -EINVAL;
>  }
> 
>  #endif /* CONFIG_SECCOMP */
> 
> +#ifdef CONFIG_SECCOMP_FILTER
> +
> +#define inherit_tsk_seccomp(_child, _orig) do { \
> +	_child->seccomp.mode = _orig->seccomp.mode; \
> +	_child->seccomp.filters = get_seccomp_filters(_orig->seccomp.filters); \
> +	} while (0)
> +#define put_tsk_seccomp(_tsk) put_seccomp_filters(_tsk->seccomp.filters)
> +
> +extern int seccomp_show_filters(struct seccomp_filters *filters,
> +				struct seq_file *);
> +extern long seccomp_set_filter(int, char *);
> +extern long seccomp_clear_filter(int);
> +extern long seccomp_get_filter(int, char *, unsigned long);
> +
> +extern long prctl_set_seccomp_filter(unsigned long, char __user *);
> +extern long prctl_get_seccomp_filter(unsigned long, char __user *,
> +				     unsigned long);
> +extern long prctl_clear_seccomp_filter(unsigned long);
> +
> +extern struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *);
> +extern void put_seccomp_filters(struct seccomp_filters *);
> +
> +extern int seccomp_test_filters(int);
> +extern void seccomp_filter_log_failure(int);
> +
> +#else  /* CONFIG_SECCOMP_FILTER */
> +
> +struct seccomp_filters { };
> +#define inherit_tsk_seccomp(_child, _orig) do { } while (0)
> +#define put_tsk_seccomp(_tsk) do { } while (0)
> +
> +static inline int seccomp_show_filters(struct seccomp_filters *filters,
> +				       struct seq_file *m)
> +{
> +	return -ENOSYS;
> +}
> +
> +static inline long seccomp_set_filter(int syscall_nr, char *filter)
> +{
> +	return -ENOSYS;
> +}
> +
> +static inline long seccomp_clear_filter(int syscall_nr)
> +{
> +	return -ENOSYS;
> +}
> +
> +static inline long seccomp_get_filter(int syscall_nr,
> +				      char *buf, unsigned long available)
> +{
> +	return -ENOSYS;
> +}
> +
> +static inline long prctl_set_seccomp_filter(unsigned long a2, char __user *a3)
> +{
> +	return -ENOSYS;
> +}
> +
> +static inline long prctl_clear_seccomp_filter(unsigned long a2)
> +{
> +	return -ENOSYS;
> +}
> +
> +static inline long prctl_get_seccomp_filter(unsigned long a2, char __user *a3,
> +					    unsigned long a4)
> +{
> +	return -ENOSYS;
> +}
> +#endif  /* CONFIG_SECCOMP_FILTER */
>  #endif /* _LINUX_SECCOMP_H */
> diff --git a/include/trace/syscall.h b/include/trace/syscall.h
> index 242ae04..e061ad0 100644
> --- a/include/trace/syscall.h
> +++ b/include/trace/syscall.h
> @@ -35,6 +35,8 @@ struct syscall_metadata {
>  extern unsigned long arch_syscall_addr(int nr);
>  extern int init_syscall_trace(struct ftrace_event_call *call);
> 
> +extern struct syscall_metadata *syscall_nr_to_meta(int);
> +
>  extern int reg_event_syscall_enter(struct ftrace_event_call *call);
>  extern void unreg_event_syscall_enter(struct ftrace_event_call *call);
>  extern int reg_event_syscall_exit(struct ftrace_event_call *call);
> @@ -49,6 +51,11 @@ enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags,
>  				      struct trace_event *event);
>  enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags,
>  				     struct trace_event *event);
> +#else
> +static inline struct syscall_metadata *syscall_nr_to_meta(int nr)
> +{
> +	return NULL;
> +}
>  #endif
> 
>  #ifdef CONFIG_PERF_EVENTS
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 85cbfb3..84e7dfb 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -81,6 +81,9 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
>  obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
>  obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
>  obj-$(CONFIG_SECCOMP) += seccomp.o
> +ifeq ($(CONFIG_SECCOMP_FILTER),y)
> +obj-$(CONFIG_SECCOMP) += seccomp_filter.o
> +endif
>  obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
>  obj-$(CONFIG_TREE_RCU) += rcutree.o
>  obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
> diff --git a/kernel/fork.c b/kernel/fork.c
> index e7548de..6f835e0 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -34,6 +34,7 @@
>  #include <linux/cgroup.h>
>  #include <linux/security.h>
>  #include <linux/hugetlb.h>
> +#include <linux/seccomp.h>
>  #include <linux/swap.h>
>  #include <linux/syscalls.h>
>  #include <linux/jiffies.h>
> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
>  	free_thread_info(tsk->stack);
>  	rt_mutex_debug_task_free(tsk);
>  	ftrace_graph_exit_task(tsk);
> +	put_tsk_seccomp(tsk);
>  	free_task_struct(tsk);
>  }
>  EXPORT_SYMBOL(free_task);
> @@ -280,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
>  	if (err)
>  		goto out;
> 
> +	inherit_tsk_seccomp(tsk, orig);
>  	setup_thread_stack(tsk, orig);
>  	clear_user_return_notifier(tsk);
>  	clear_tsk_need_resched(tsk);
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 57d4b13..0a942be 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -2,16 +2,20 @@
>   * linux/kernel/seccomp.c
>   *
>   * Copyright 2004-2005  Andrea Arcangeli <andrea@cpushare.com>
> + * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
>   *
>   * This defines a simple but solid secure-computing mode.
>   */
> 
>  #include <linux/seccomp.h>
>  #include <linux/sched.h>
> +#include <linux/slab.h>
>  #include <linux/compat.h>
> +#include <linux/unistd.h>
> +#include <linux/ftrace_event.h>
> 
> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
>  /* #define SECCOMP_DEBUG 1 */
> -#define NR_SECCOMP_MODES 1
> 
>  /*
>   * Secure computing mode 1 allows only read/write/exit/sigreturn.
> @@ -32,10 +36,9 @@ static int mode1_syscalls_32[] = {
> 
>  void __secure_computing(int this_syscall)
>  {
> -	int mode = current->seccomp.mode;
>  	int * syscall;
> 
> -	switch (mode) {
> +	switch (current->seccomp.mode) {
>  	case 1:
>  		syscall = mode1_syscalls;
>  #ifdef CONFIG_COMPAT
> @@ -47,6 +50,17 @@ void __secure_computing(int this_syscall)
>  				return;
>  		} while (*++syscall);
>  		break;
> +#ifdef CONFIG_SECCOMP_FILTER
> +	case 2:
> +		if (this_syscall >= NR_syscalls || this_syscall < 0)
> +			break;
> +
> +		if (!seccomp_test_filters(this_syscall))
> +			return;
> +
> +		seccomp_filter_log_failure(this_syscall);
> +		break;
> +#endif
>  	default:
>  		BUG();
>  	}
> @@ -71,16 +85,22 @@ long prctl_set_seccomp(unsigned long seccomp_mode)
>  	if (unlikely(current->seccomp.mode))
>  		goto out;
> 
> -	ret = -EINVAL;
> -	if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
> -		current->seccomp.mode = seccomp_mode;
> -		set_thread_flag(TIF_SECCOMP);
> +	ret = 0;
> +	switch (seccomp_mode) {
> +	case 1:
>  #ifdef TIF_NOTSC
>  		disable_TSC();
>  #endif
> -		ret = 0;
> +#ifdef CONFIG_SECCOMP_FILTER
> +	case 2:
> +#endif
> +		current->seccomp.mode = seccomp_mode;
> +		set_thread_flag(TIF_SECCOMP);
> +		break;
> +	default:
> +		ret = -EINVAL;
>  	}
> 
> - out:
> +out:
>  	return ret;
>  }
> diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
> new file mode 100644
> index 0000000..9782f25
> --- /dev/null
> +++ b/kernel/seccomp_filter.c
> @@ -0,0 +1,784 @@
> +/* filter engine-based seccomp system call filtering
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + *
> + * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/err.h>
> +#include <linux/errno.h>
> +#include <linux/ftrace_event.h>
> +#include <linux/seccomp.h>
> +#include <linux/seq_file.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +
> +#include <asm/syscall.h>
> +#include <trace/syscall.h>
> +
> +
> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
> +
> +#define SECCOMP_FILTER_ALLOW "1"
> +#define SECCOMP_ACTION_DENY 0xffff
> +#define SECCOMP_ACTION_ALLOW 0xfffe
> +
> +/**
> + * struct seccomp_filters - container for seccomp filterset
> + *
> + * @syscalls: array of 16-bit indices into @event_filters by syscall_nr
> + *            May also be SECCOMP_ACTION_DENY or SECCOMP_ACTION_ALLOW
> + * @event_filters: array of pointers to ftrace event objects
> + * @count: size of @event_filters
> + * @flags: anonymous struct to wrap filters-specific flags
> + * @usage: reference count to simplify use.
> + */
> +struct seccomp_filters {
> +	uint16_t syscalls[NR_syscalls];
> +	struct event_filter **event_filters;
> +	uint16_t count;
> +	struct {
> +		uint32_t compat:1,
> +			 __reserved:31;
> +	} flags;
> +	atomic_t usage;
> +};
> +
> +/* Handle ftrace symbol non-existence */
> +#ifdef CONFIG_FTRACE_SYSCALLS
> +#define create_event_filter(_ef_pptr, _event_type, _str) \
> +	ftrace_parse_filter(_ef_pptr, _event_type, _str)
> +#define get_filter_string(_ef) ftrace_get_filter_string(_ef)
> +#define free_event_filter(_f) ftrace_free_filter(_f)
> +
> +#else
> +
> +#define create_event_filter(_ef_pptr, _event_type, _str) (-ENOSYS)
> +#define get_filter_string(_ef) (NULL)
> +#define free_event_filter(_f) do { } while (0)
> +#endif
> +
> +/**
> + * seccomp_filters_new - allocates a new filters object
> + * @count: count to allocate for the event_filters array
> + *
> + * Returns ERR_PTR on error or an allocated object.
> + */
> +static struct seccomp_filters *seccomp_filters_new(uint16_t count)
> +{
> +	struct seccomp_filters *f;
> +
> +	if (count >= SECCOMP_ACTION_ALLOW)
> +		return ERR_PTR(-EINVAL);
> +
> +	f = kzalloc(sizeof(struct seccomp_filters), GFP_KERNEL);
> +	if (!f)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/* Lazy SECCOMP_ACTION_DENY assignment. */
> +	memset(f->syscalls, 0xff, sizeof(f->syscalls));
> +	atomic_set(&f->usage, 1);
> +
> +	f->event_filters = NULL;
> +	f->count = count;
> +	if (!count)
> +		return f;
> +
> +	f->event_filters = kzalloc(count * sizeof(struct event_filter *),
> +				   GFP_KERNEL);
> +	if (!f->event_filters) {
> +		kfree(f);
> +		f = ERR_PTR(-ENOMEM);
> +	}
> +	return f;
> +}
> +
> +/**
> + * seccomp_filters_free - cleans up the filter list and frees the table
> + * @filters: NULL or live object to be completely destructed.
> + */
> +static void seccomp_filters_free(struct seccomp_filters *filters)
> +{
> +	uint16_t count = 0;
> +	if (!filters)
> +		return;
> +	while (count < filters->count) {
> +		struct event_filter *f = filters->event_filters[count];
> +		free_event_filter(f);
> +		count++;
> +	}
> +	kfree(filters->event_filters);
> +	kfree(filters);
> +}
> +
> +static void __put_seccomp_filters(struct seccomp_filters *orig)
> +{
> +	WARN_ON(atomic_read(&orig->usage));
> +	seccomp_filters_free(orig);
> +}
> +
> +#define seccomp_filter_allow(_id) ((_id) == SECCOMP_ACTION_ALLOW)
> +#define seccomp_filter_deny(_id) ((_id) == SECCOMP_ACTION_DENY)
> +#define seccomp_filter_dynamic(_id) \
> +	(!seccomp_filter_allow(_id) && !seccomp_filter_deny(_id))
> +static inline uint16_t seccomp_filter_id(const struct seccomp_filters *f,
> +					 int syscall_nr)
> +{
> +	if (!f)
> +		return SECCOMP_ACTION_DENY;
> +	return f->syscalls[syscall_nr];
> +}
> +
> +static inline struct event_filter *seccomp_dynamic_filter(
> +		const struct seccomp_filters *filters, uint16_t id)
> +{
> +	if (!seccomp_filter_dynamic(id))
> +		return NULL;
> +	return filters->event_filters[id];
> +}
> +
> +static inline void set_seccomp_filter_id(struct seccomp_filters *filters,
> +					 int syscall_nr, uint16_t id)
> +{
> +	filters->syscalls[syscall_nr] = id;
> +}
> +
> +static inline void set_seccomp_filter(struct seccomp_filters *filters,
> +				      int syscall_nr, uint16_t id,
> +				      struct event_filter *dynamic_filter)
> +{
> +	filters->syscalls[syscall_nr] = id;
> +	if (seccomp_filter_dynamic(id))
> +		filters->event_filters[id] = dynamic_filter;
> +}
> +
> +static struct event_filter *alloc_event_filter(int syscall_nr,
> +					       const char *filter_string)
> +{
> +	struct syscall_metadata *data;
> +	struct event_filter *filter = NULL;
> +	int err;
> +
> +	data = syscall_nr_to_meta(syscall_nr);
> +	/* Argument-based filtering only works on ftrace-hooked syscalls. */
> +	err = -ENOSYS;
> +	if (!data)
> +		goto fail;
> +	err = create_event_filter(&filter,
> +				  data->enter_event->event.type,
> +				  filter_string);
> +	if (err)
> +		goto fail;
> +
> +	return filter;
> +fail:
> +	kfree(filter);
> +	return ERR_PTR(err);
> +}
> +
> +/**
> + * seccomp_filters_copy - copies filters from src to dst.
> + *
> + * @dst: seccomp_filters to populate.
> + * @src: table to read from.
> + * @skip: specifies an entry, by system call, to skip.
> + *
> + * Returns non-zero on failure.
> + * Both the source and the destination should have no simultaneous
> + * writers, and dst should be exclusive to the caller.
> + * If @skip is < 0, it is ignored.
> + */
> +static int seccomp_filters_copy(struct seccomp_filters *dst,
> +				const struct seccomp_filters *src,
> +				int skip)
> +{
> +	int id = 0, ret = 0, nr;
> +	memcpy(&dst->flags, &src->flags, sizeof(src->flags));
> +	memcpy(dst->syscalls, src->syscalls, sizeof(dst->syscalls));
> +	if (!src->count)
> +		goto done;
> +	for (nr = 0; nr < NR_syscalls; ++nr) {
> +		struct event_filter *filter;
> +		const char *str;
> +		uint16_t src_id = seccomp_filter_id(src, nr);
> +		if (nr == skip) {
> +			set_seccomp_filter(dst, nr, SECCOMP_ACTION_DENY,
> +					   NULL);
> +			continue;
> +		}
> +		if (!seccomp_filter_dynamic(src_id))
> +			continue;
> +		if (id >= dst->count) {
> +			ret = -EINVAL;
> +			goto done;
> +		}
> +		str = get_filter_string(seccomp_dynamic_filter(src, src_id));
> +		filter = alloc_event_filter(nr, str);
> +		if (IS_ERR(filter)) {
> +			ret = PTR_ERR(filter);
> +			goto done;
> +		}
> +		set_seccomp_filter(dst, nr, id, filter);
> +		id++;
> +	}
> +
> +done:
> +	return ret;
> +}
> +
> +/**
> + * seccomp_extend_filter - appends more text to a syscall_nr's filter
> + * @filters: unattached filter object to operate on
> + * @syscall_nr: syscall number to update filters for
> + * @filter_string: string to append to the existing filter
> + *
> + * The new string will be &&'d to the original filter string to ensure that it
> + * always matches the existing predicates or less:
> + *   (old_filter) && @filter_string
> + * A new seccomp_filters instance is returned on success and a ERR_PTR on
> + * failure.
> + */
> +static int seccomp_extend_filter(struct seccomp_filters *filters,
> +				 int syscall_nr, char *filter_string)
> +{
> +	struct event_filter *filter;
> +	uint16_t id = seccomp_filter_id(filters, syscall_nr);
> +	char *merged = NULL;
> +	int ret = -EINVAL, expected;
> +
> +	/* No extending with a "1". */
> +	if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string))
> +		goto out;
> +
> +	filter = seccomp_dynamic_filter(filters, id);
> +	ret = -ENOENT;
> +	if (!filter)
> +		goto out;
> +
> +	merged = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
> +	ret = -ENOMEM;
> +	if (!merged)
> +		goto out;
> +
> +	expected = snprintf(merged, SECCOMP_MAX_FILTER_LENGTH, "(%s) && %s",
> +			    get_filter_string(filter), filter_string);
> +	ret = -E2BIG;
> +	if (expected >= SECCOMP_MAX_FILTER_LENGTH || expected < 0)
> +		goto out;
> +
> +	/* Free the old filter */
> +	free_event_filter(filter);
> +	set_seccomp_filter(filters, syscall_nr, id, NULL);
> +
> +	/* Replace it */
> +	filter = alloc_event_filter(syscall_nr, merged);
> +	if (IS_ERR(filter)) {
> +		ret = PTR_ERR(filter);
> +		goto out;
> +	}
> +	set_seccomp_filter(filters, syscall_nr, id, filter);
> +	ret = 0;
> +
> +out:
> +	kfree(merged);
> +	return ret;
> +}
> +
> +/**
> + * seccomp_add_filter - adds a filter for an unfiltered syscall
> + * @filters: filters object to add a filter/action to
> + * @syscall_nr: system call number to add a filter for
> + * @filter_string: the filter string to apply
> + *
> + * Returns 0 on success and non-zero otherwise.
> + */
> +static int seccomp_add_filter(struct seccomp_filters *filters, int syscall_nr,
> +			      char *filter_string)
> +{
> +	struct event_filter *filter;
> +	int ret = 0;
> +
> +	if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string)) {
> +		set_seccomp_filter(filters, syscall_nr,
> +				   SECCOMP_ACTION_ALLOW, NULL);
> +		goto out;
> +	}
> +
> +	filter = alloc_event_filter(syscall_nr, filter_string);
> +	if (IS_ERR(filter)) {
> +		ret = PTR_ERR(filter);
> +		goto out;
> +	}
> +	/* Always add to the last slot available since additions are
> +	 * are only done one at a time.
> +	 */
> +	set_seccomp_filter(filters, syscall_nr, filters->count - 1, filter);
> +out:
> +	return ret;
> +}
> +
> +/* Wrap optional ftrace syscall support. Returns 1 on match or 0 otherwise. */
> +static int filter_match_current(struct event_filter *event_filter)
> +{
> +	int err = 0;
> +#ifdef CONFIG_FTRACE_SYSCALLS
> +	uint8_t syscall_state[64];
> +
> +	memset(syscall_state, 0, sizeof(syscall_state));
> +
> +	/* The generic tracing entry can remain zeroed. */
> +	err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state),
> +					 NULL);
> +	if (err)
> +		return 0;
> +
> +	err = filter_match_preds(event_filter, syscall_state);
> +#endif
> +	return err;
> +}
> +
> +static const char *syscall_nr_to_name(int syscall)
> +{
> +	const char *syscall_name = "unknown";
> +	struct syscall_metadata *data = syscall_nr_to_meta(syscall);
> +	if (data)
> +		syscall_name = data->name;
> +	return syscall_name;
> +}
> +
> +static void filters_set_compat(struct seccomp_filters *filters)
> +{
> +#ifdef CONFIG_COMPAT
> +	if (is_compat_task())
> +		filters->flags.compat = 1;
> +#endif
> +}
> +
> +static inline int filters_compat_mismatch(struct seccomp_filters *filters)
> +{
> +	int ret = 0;
> +	if (!filters)
> +		return 0;
> +#ifdef CONFIG_COMPAT
> +	if (!!(is_compat_task()) == filters->flags.compat)
> +		ret = 1;
> +#endif
> +	return ret;
> +}
> +
> +static inline int syscall_is_execve(int syscall)
> +{
> +	int nr = __NR_execve;
> +#ifdef CONFIG_COMPAT
> +	if (is_compat_task())
> +		nr = __NR_seccomp_execve_32;
> +#endif
> +	return syscall == nr;
> +}
> +
> +#ifndef KSTK_EIP
> +#define KSTK_EIP(x) 0L
> +#endif
> +
> +void seccomp_filter_log_failure(int syscall)
> +{
> +	pr_info("%s[%d]: system call %d (%s) blocked at 0x%lx\n",
> +		current->comm, task_pid_nr(current), syscall,
> +		syscall_nr_to_name(syscall), KSTK_EIP(current));
> +}
> +
> +/* put_seccomp_state - decrements the reference count of @orig and may free. */
> +void put_seccomp_filters(struct seccomp_filters *orig)
> +{
> +	if (!orig)
> +		return;
> +
> +	if (atomic_dec_and_test(&orig->usage))
> +		__put_seccomp_filters(orig);
> +}
> +
> +/* get_seccomp_state - increments the reference count of @orig */
> +struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *orig)

Nit: the name does not match the comment.

> +{
> +	if (!orig)
> +		return NULL;
> +	atomic_inc(&orig->usage);
> +	return orig;

This is called in an RCU read-side critical section.  What exactly is
RCU protecting?  I would expect an rcu_dereference() or one of the
RCU list-traversal primitives somewhere, either here or at the caller.

> +}
> +
> +/**
> + * seccomp_test_filters - tests 'current' against the given syscall
> + * @state: seccomp_state of current to use.
> + * @syscall: number of the system call to test
> + *
> + * Returns 0 on ok and non-zero on error/failure.
> + */
> +int seccomp_test_filters(int syscall)
> +{
> +	uint16_t id;
> +	struct event_filter *filter;
> +	struct seccomp_filters *filters;
> +	int ret = -EACCES;
> +
> +	rcu_read_lock();
> +	filters = get_seccomp_filters(current->seccomp.filters);
> +	rcu_read_unlock();
> +
> +	if (!filters)
> +		goto out;
> +
> +	if (filters_compat_mismatch(filters)) {
> +		pr_info("%s[%d]: seccomp_filter compat() mismatch.\n",
> +			current->comm, task_pid_nr(current));
> +		goto out;
> +	}
> +
> +	/* execve is never allowed. */
> +	if (syscall_is_execve(syscall))
> +		goto out;
> +
> +	ret = 0;
> +	id = seccomp_filter_id(filters, syscall);
> +	if (seccomp_filter_allow(id))
> +		goto out;
> +
> +	ret = -EACCES;
> +	if (!seccomp_filter_dynamic(id))
> +		goto out;
> +
> +	filter = seccomp_dynamic_filter(filters, id);
> +	if (filter && filter_match_current(filter))
> +		ret = 0;
> +out:
> +	put_seccomp_filters(filters);
> +	return ret;
> +}
> +
> +/**
> + * seccomp_show_filters - prints the current filter state to a seq_file
> + * @filters: properly get()'d filters object
> + * @m: the prepared seq_file to receive the data
> + *
> + * Returns 0 on a successful write.
> + */
> +int seccomp_show_filters(struct seccomp_filters *filters, struct seq_file *m)
> +{
> +	int syscall;
> +	seq_printf(m, "Mode: %d\n", current->seccomp.mode);
> +	if (!filters)
> +		goto out;
> +
> +	for (syscall = 0; syscall < NR_syscalls; ++syscall) {
> +		uint16_t id = seccomp_filter_id(filters, syscall);
> +		const char *filter_string = SECCOMP_FILTER_ALLOW;
> +		if (seccomp_filter_deny(id))
> +			continue;
> +		seq_printf(m, "%d (%s): ",
> +			      syscall,
> +			      syscall_nr_to_name(syscall));
> +		if (seccomp_filter_dynamic(id))
> +			filter_string = get_filter_string(
> +					  seccomp_dynamic_filter(filters, id));
> +		seq_printf(m, "%s\n", filter_string);
> +	}
> +out:
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(seccomp_show_filters);
> +
> +/**
> + * seccomp_get_filter - copies the filter_string into "buf"
> + * @syscall_nr: system call number to look up
> + * @buf: destination buffer
> + * @bufsize: available space in the buffer.
> + *
> + * Context: User context only. This function may sleep on allocation and
> + *          operates on current. current must be attempting a system call
> + *          when this is called.
> + *
> + * Looks up the filter for the given system call number on current.  If found,
> + * the string length of the NUL-terminated buffer is returned and < 0 is
> + * returned on error. The NUL byte is not included in the length.
> + */
> +long seccomp_get_filter(int syscall_nr, char *buf, unsigned long bufsize)
> +{
> +	struct seccomp_filters *filters;
> +	struct event_filter *filter;
> +	long ret = -EINVAL;
> +	uint16_t id;
> +
> +	if (bufsize > SECCOMP_MAX_FILTER_LENGTH)
> +		bufsize = SECCOMP_MAX_FILTER_LENGTH;
> +
> +	rcu_read_lock();
> +	filters = get_seccomp_filters(current->seccomp.filters);
> +	rcu_read_unlock();
> +
> +	if (!filters)
> +		goto out;
> +
> +	ret = -ENOENT;
> +	id = seccomp_filter_id(filters, syscall_nr);
> +	if (seccomp_filter_deny(id))
> +		goto out;
> +
> +	if (seccomp_filter_allow(id)) {
> +		ret = strlcpy(buf, SECCOMP_FILTER_ALLOW, bufsize);
> +		goto copied;
> +	}
> +
> +	filter = seccomp_dynamic_filter(filters, id);
> +	if (!filter)
> +		goto out;
> +	ret = strlcpy(buf, get_filter_string(filter), bufsize);
> +
> +copied:
> +	if (ret >= bufsize) {
> +		ret = -ENOSPC;
> +		goto out;
> +	}
> +	/* Zero out any remaining buffer, just in case. */
> +	memset(buf + ret, 0, bufsize - ret);
> +out:
> +	put_seccomp_filters(filters);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(seccomp_get_filter);
> +
> +/**
> + * seccomp_clear_filter: clears the seccomp filter for a syscall.
> + * @syscall_nr: the system call number to clear filters for.
> + *
> + * Context: User context only. This function may sleep on allocation and
> + *          operates on current. current must be attempting a system call
> + *          when this is called.
> + *
> + * Returns 0 on success.
> + */
> +long seccomp_clear_filter(int syscall_nr)
> +{
> +	struct seccomp_filters *filters = NULL, *orig_filters;
> +	uint16_t id;
> +	int ret = -EINVAL;
> +
> +	rcu_read_lock();
> +	orig_filters = get_seccomp_filters(current->seccomp.filters);
> +	rcu_read_unlock();
> +
> +	if (!orig_filters)
> +		goto out;
> +
> +	if (filters_compat_mismatch(orig_filters))
> +		goto out;
> +
> +	id = seccomp_filter_id(orig_filters, syscall_nr);
> +	if (seccomp_filter_deny(id))
> +		goto out;
> +
> +	/* Create a new filters object for the task */
> +	if (seccomp_filter_dynamic(id))
> +		filters = seccomp_filters_new(orig_filters->count - 1);
> +	else
> +		filters = seccomp_filters_new(orig_filters->count);
> +
> +	if (IS_ERR(filters)) {
> +		ret = PTR_ERR(filters);
> +		goto out;
> +	}
> +
> +	/* Copy, but drop the requested entry. */
> +	ret = seccomp_filters_copy(filters, orig_filters, syscall_nr);
> +	if (ret)
> +		goto out;
> +	get_seccomp_filters(filters);  /* simplify the out: path */
> +
> +	rcu_assign_pointer(current->seccomp.filters, filters);

What prevents two copies of seccomp_clear_filter() from running
concurrently?

> +	synchronize_rcu();
> +	put_seccomp_filters(orig_filters);  /* for the task */
> +out:
> +	put_seccomp_filters(orig_filters);  /* for the get */
> +	put_seccomp_filters(filters);  /* for the extra get */
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(seccomp_clear_filter);
> +
> +/**
> + * seccomp_set_filter: - Adds/extends a seccomp filter for a syscall.
> + * @syscall_nr: system call number to apply the filter to.
> + * @filter: ftrace filter string to apply.
> + *
> + * Context: User context only. This function may sleep on allocation and
> + *          operates on current. current must be attempting a system call
> + *          when this is called.
> + *
> + * New filters may be added for system calls when the current task is
> + * not in a secure computing mode (seccomp).  Otherwise, existing filters may
> + * be extended.
> + *
> + * Returns 0 on success or an errno on failure.
> + */
> +long seccomp_set_filter(int syscall_nr, char *filter)
> +{
> +	struct seccomp_filters *filters = NULL, *orig_filters = NULL;
> +	uint16_t id;
> +	long ret = -EINVAL;
> +	uint16_t filters_needed;
> +
> +	if (!filter)
> +		goto out;
> +
> +	filter = strstrip(filter);
> +	/* Disallow empty strings. */
> +	if (filter[0] == 0)
> +		goto out;
> +
> +	rcu_read_lock();
> +	orig_filters = get_seccomp_filters(current->seccomp.filters);
> +	rcu_read_unlock();
> +
> +	/* After the first call, compatibility mode is selected permanently. */
> +	ret = -EACCES;
> +	if (filters_compat_mismatch(orig_filters))
> +		goto out;
> +
> +	filters_needed = orig_filters ? orig_filters->count : 0;
> +	id = seccomp_filter_id(orig_filters, syscall_nr);
> +	if (seccomp_filter_deny(id)) {
> +		/* Don't allow DENYs to be changed when in a seccomp mode */
> +		ret = -EACCES;
> +		if (current->seccomp.mode)
> +			goto out;
> +		filters_needed++;
> +	}
> +
> +	filters = seccomp_filters_new(filters_needed);
> +	if (IS_ERR(filters)) {
> +		ret = PTR_ERR(filters);
> +		goto out;
> +	}
> +
> +	filters_set_compat(filters);
> +	if (orig_filters) {
> +		ret = seccomp_filters_copy(filters, orig_filters, -1);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	if (seccomp_filter_deny(id))
> +		ret = seccomp_add_filter(filters, syscall_nr, filter);
> +	else
> +		ret = seccomp_extend_filter(filters, syscall_nr, filter);
> +	if (ret)
> +		goto out;
> +	get_seccomp_filters(filters);  /* simplify the error paths */
> +
> +	rcu_assign_pointer(current->seccomp.filters, filters);

Again, what prevents two copies of seccomp_set_filter() from running
concurrently?

> +	synchronize_rcu();
> +	put_seccomp_filters(orig_filters);  /* for the task */
> +out:
> +	put_seccomp_filters(orig_filters);  /* for the get */
> +	put_seccomp_filters(filters);  /* for get or task, on err */
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(seccomp_set_filter);
> +
> +long prctl_set_seccomp_filter(unsigned long syscall_nr,
> +			      char __user *user_filter)
> +{
> +	int nr;
> +	long ret;
> +	char *filter = NULL;
> +
> +	ret = -EINVAL;
> +	if (syscall_nr >= NR_syscalls)
> +		goto out;
> +
> +	ret = -EFAULT;
> +	if (!user_filter)
> +		goto out;
> +
> +	filter = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
> +	ret = -ENOMEM;
> +	if (!filter)
> +		goto out;
> +
> +	ret = -EFAULT;
> +	if (strncpy_from_user(filter, user_filter,
> +			      SECCOMP_MAX_FILTER_LENGTH - 1) < 0)
> +		goto out;
> +
> +	nr = (int) syscall_nr;
> +	ret = seccomp_set_filter(nr, filter);
> +
> +out:
> +	kfree(filter);
> +	return ret;
> +}
> +
> +long prctl_clear_seccomp_filter(unsigned long syscall_nr)
> +{
> +	int nr = -1;
> +	long ret;
> +
> +	ret = -EINVAL;
> +	if (syscall_nr >= NR_syscalls)
> +		goto out;
> +
> +	nr = (int) syscall_nr;
> +	ret = seccomp_clear_filter(nr);
> +
> +out:
> +	return ret;
> +}
> +
> +long prctl_get_seccomp_filter(unsigned long syscall_nr, char __user *dst,
> +			      unsigned long available)
> +{
> +	int ret, nr;
> +	unsigned long copied;
> +	char *buf = NULL;
> +	ret = -EINVAL;
> +	if (!available)
> +		goto out;
> +	/* Ignore extra buffer space. */
> +	if (available > SECCOMP_MAX_FILTER_LENGTH)
> +		available = SECCOMP_MAX_FILTER_LENGTH;
> +
> +	ret = -EINVAL;
> +	if (syscall_nr >= NR_syscalls)
> +		goto out;
> +	nr = (int) syscall_nr;
> +
> +	ret = -ENOMEM;
> +	buf = kmalloc(available, GFP_KERNEL);
> +	if (!buf)
> +		goto out;
> +
> +	ret = seccomp_get_filter(nr, buf, available);
> +	if (ret < 0)
> +		goto out;
> +
> +	/* Include the NUL byte in the copy. */
> +	copied = copy_to_user(dst, buf, ret + 1);
> +	ret = -ENOSPC;
> +	if (copied)
> +		goto out;
> +	ret = 0;
> +out:
> +	kfree(buf);
> +	return ret;
> +}
> diff --git a/kernel/sys.c b/kernel/sys.c
> index af468ed..ed60d06 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1698,13 +1698,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		case PR_SET_ENDIAN:
>  			error = SET_ENDIAN(me, arg2);
>  			break;
> -
>  		case PR_GET_SECCOMP:
>  			error = prctl_get_seccomp();
>  			break;
>  		case PR_SET_SECCOMP:
>  			error = prctl_set_seccomp(arg2);
>  			break;
> +		case PR_SET_SECCOMP_FILTER:
> +			error = prctl_set_seccomp_filter(arg2,
> +							 (char __user *) arg3);
> +			break;
> +		case PR_CLEAR_SECCOMP_FILTER:
> +			error = prctl_clear_seccomp_filter(arg2);
> +			break;
> +		case PR_GET_SECCOMP_FILTER:
> +			error = prctl_get_seccomp_filter(arg2,
> +							 (char __user *) arg3,
> +							 arg4);
> +			break;
>  		case PR_GET_TSC:
>  			error = GET_TSC_CTL(arg2);
>  			break;
> diff --git a/security/Kconfig b/security/Kconfig
> index 95accd4..c76adf2 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -2,6 +2,10 @@
>  # Security configuration
>  #
> 
> +# Make seccomp filter Kconfig switch below available
> +config HAVE_SECCOMP_FILTER
> +       bool
> +
>  menu "Security options"
> 
>  config KEYS
> @@ -82,6 +86,19 @@ config SECURITY_DMESG_RESTRICT
> 
>  	  If you are unsure how to answer this question, answer N.
> 
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	select SECCOMP
> +	depends on HAVE_SECCOMP_FILTER && EXPERIMENTAL
> +	help
> +	  This kernel feature expands CONFIG_SECCOMP to allow computing
> +	  in environments with reduced kernel access dictated by the
> +	  application itself through prctl calls.  If
> +	  CONFIG_FTRACE_SYSCALLS is available, then system call
> +	  argument-based filtering predicates may be used.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  config SECURITY
>  	bool "Enable different security models"
>  	depends on SYSFS
> -- 
> 1.7.0.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 03/13] seccomp_filters: new mode with configurable syscall filters
  2011-06-02 17:36                                                         ` Paul E. McKenney
@ 2011-06-02 18:14                                                           ` Will Drewry
  2011-06-02 19:42                                                             ` Paul E. McKenney
  0 siblings, 1 reply; 91+ messages in thread
From: Will Drewry @ 2011-06-02 18:14 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, kees.cook, torvalds, tglx, mingo, rostedt, jmorris,
	Peter Zijlstra, Frederic Weisbecker, linux-security-module

On Thu, Jun 2, 2011 at 12:36 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Tue, May 31, 2011 at 10:10:35PM -0500, Will Drewry wrote:
>> This change adds a new seccomp mode which specifies the allowed system
>> calls dynamically.  When in the new mode (2), all system calls are
>> checked against process-defined filters - first by system call number,
>> then by a filter string.  If an entry exists for a given system call and
>> all filter predicates evaluate to true, then the task may proceed.
>> Otherwise, the task is killed.
>
> A few questions below -- I can't say that I understand the RCU usage.
>
>                                                        Thanx, Paul
>
>> Filter string parsing and evaluation is handled by the ftrace filter
>> engine.  Related patches tweak to the perf filter trace and free
>> allowing the calls to be shared. Filters inherit their understanding of
>> types and arguments for each system call from the CONFIG_FTRACE_SYSCALLS
>> subsystem which already populates this information in syscall_metadata
>> associated enter_event (and exit_event) structures. If
>> CONFIG_FTRACE_SYSCALLS is not compiled in, only filter strings of "1"
>> will be allowed.
>>
>> The net result is a process may have its system calls filtered using the
>> ftrace filter engine's inherent understanding of systems calls.  The set
>> of filters is specified through the PR_SET_SECCOMP_FILTER argument in
>> prctl(). For example, a filterset for a process, like pdftotext, that
>> should only process read-only input could (roughly) look like:
>>   sprintf(rdonly, "flags == %u", O_RDONLY|O_LARGEFILE);
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_open, rdonly);
>>   prctl(PR_SET_SECCOMP_FILTER, __NR__llseek, "1");
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_brk, "1");
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_close, "1");
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_exit_group, "1");
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_fstat64, "1");
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_mmap2, "1");
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_munmap, "1");
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_read, "1");
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_write, "(fd == 1 | fd == 2)");
>>   prctl(PR_SET_SECCOMP, 2);
>>
>> Subsequent calls to PR_SET_SECCOMP_FILTER for the same system call will
>> be &&'d together to ensure that attack surface may only be reduced:
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2");
>>
>> With the earlier example, the active filter becomes:
>>   "(fd == 1 || fd == 2) && fd != 2"
>>
>> The patch also adds PR_CLEAR_SECCOMP_FILTER and PR_GET_SECCOMP_FILTER.
>> The latter returns the current filter for a system call to userspace:
>>
>>   prctl(PR_GET_SECCOMP_FILTER, __NR_write, buf, bufsize);
>>
>> while the former clears any filters for a given system call changing it
>> back to a defaulty deny:
>>
>>   prctl(PR_CLEAR_SECCOMP_FILTER, __NR_write);
>>
>> v3: - always block execve calls (as per linus torvalds)
>>     - add __NR_seccomp_execve(_32) to seccomp-supporting arches
>>     - ensure compat tasks can't reach ftrace:syscalls
>>     - dropped new defines for seccomp modes.
>>     - two level array instead of hlists (sugg. by olof johansson)
>>     - added generic Kconfig entry that is not connected.
>>     - dropped internal seccomp.h
>>     - move prctl helpers to seccomp_filter
>>     - killed seccomp_t typedef (as per checkpatch)
>> v2: - changed to use the existing syscall number ABI.
>>     - prctl changes to minimize parsing in the kernel:
>>       prctl(PR_SET_SECCOMP, {0 | 1 | 2 }, { 0 | ON_EXEC });
>>       prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 5");
>>       prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read);
>>       prctl(PR_GET_SECCOMP_FILTER, __NR_read, buf, bufsize);
>>     - defined PR_SECCOMP_MODE_STRICT and ..._FILTER
>>     - added flags
>>     - provide a default fail syscall_nr_to_meta in ftrace
>>     - provides fallback for unhooked system calls
>>     - use -ENOSYS and ERR_PTR(-ENOSYS) for stubbed functionality
>>     - added kernel/seccomp.h to share seccomp.c/seccomp_filter.c
>>     - moved to a hlist and 4 bit hash of linked lists
>>     - added support to operate without CONFIG_FTRACE_SYSCALLS
>>     - moved Kconfig support next to SECCOMP
>>     - made Kconfig entries dependent on EXPERIMENTAL
>>     - added macros to avoid ifdefs from kernel/fork.c
>>     - added compat task/filter matching
>>     - drop seccomp.h inclusion in sched.h and drop seccomp_t
>>     - added Filtering to "show" output
>>     - added on_exec state dup'ing when enabling after a fast-path accept.
>>
>> Signed-off-by: Will Drewry <wad@chromium.org>
>> ---
>>  include/linux/prctl.h   |    5 +
>>  include/linux/sched.h   |    2 +-
>>  include/linux/seccomp.h |   98 ++++++-
>>  include/trace/syscall.h |    7 +
>>  kernel/Makefile         |    3 +
>>  kernel/fork.c           |    3 +
>>  kernel/seccomp.c        |   38 ++-
>>  kernel/seccomp_filter.c |  784 +++++++++++++++++++++++++++++++++++++++++++++++
>>  kernel/sys.c            |   13 +-
>>  security/Kconfig        |   17 +
>>  10 files changed, 954 insertions(+), 16 deletions(-)
>>  create mode 100644 kernel/seccomp_filter.c
>>
>> diff --git a/include/linux/prctl.h b/include/linux/prctl.h
>> index a3baeb2..44723ce 100644
>> --- a/include/linux/prctl.h
>> +++ b/include/linux/prctl.h
>> @@ -64,6 +64,11 @@
>>  #define PR_GET_SECCOMP       21
>>  #define PR_SET_SECCOMP       22
>>
>> +/* Get/set process seccomp filters */
>> +#define PR_GET_SECCOMP_FILTER        35
>> +#define PR_SET_SECCOMP_FILTER        36
>> +#define PR_CLEAR_SECCOMP_FILTER      37
>> +
>>  /* Get/set the capability bounding set (as per security/commoncap.c) */
>>  #define PR_CAPBSET_READ 23
>>  #define PR_CAPBSET_DROP 24
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 18d63ce..3f0bc8d 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1374,7 +1374,7 @@ struct task_struct {
>>       uid_t loginuid;
>>       unsigned int sessionid;
>>  #endif
>> -     seccomp_t seccomp;
>> +     struct seccomp_struct seccomp;
>>
>>  /* Thread group tracking */
>>       u32 parent_exec_id;
>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> index 167c333..f4434ca 100644
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -1,13 +1,33 @@
>>  #ifndef _LINUX_SECCOMP_H
>>  #define _LINUX_SECCOMP_H
>>
>> +struct seq_file;
>>
>>  #ifdef CONFIG_SECCOMP
>>
>> +#include <linux/errno.h>
>>  #include <linux/thread_info.h>
>> +#include <linux/types.h>
>>  #include <asm/seccomp.h>
>>
>> -typedef struct { int mode; } seccomp_t;
>> +struct seccomp_filters;
>> +/**
>> + * struct seccomp_struct - the state of a seccomp'ed process
>> + *
>> + * @mode:
>> + *     if this is 1, the process is under standard seccomp rules
>> + *             is 2, the process is only allowed to make system calls where
>> + *                   associated filters evaluate successfully.
>> + * @filters: Metadata for filters if using CONFIG_SECCOMP_FILTER.
>> + *           filters assignment/use should be RCU-protected and its contents
>> + *           should never be modified when attached to a seccomp_struct.
>> + */
>> +struct seccomp_struct {
>> +     uint16_t mode;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +     struct seccomp_filters *filters;
>> +#endif
>> +};
>>
>>  extern void __secure_computing(int);
>>  static inline void secure_computing(int this_syscall)
>> @@ -16,15 +36,14 @@ static inline void secure_computing(int this_syscall)
>>               __secure_computing(this_syscall);
>>  }
>>
>> -extern long prctl_get_seccomp(void);
>>  extern long prctl_set_seccomp(unsigned long);
>> +extern long prctl_get_seccomp(void);
>>
>>  #else /* CONFIG_SECCOMP */
>>
>>  #include <linux/errno.h>
>>
>> -typedef struct { } seccomp_t;
>> -
>> +struct seccomp_struct { };
>>  #define secure_computing(x) do { } while (0)
>>
>>  static inline long prctl_get_seccomp(void)
>> @@ -32,11 +51,80 @@ static inline long prctl_get_seccomp(void)
>>       return -EINVAL;
>>  }
>>
>> -static inline long prctl_set_seccomp(unsigned long arg2)
>> +static inline long prctl_set_seccomp(unsigned long a2);
>>  {
>>       return -EINVAL;
>>  }
>>
>>  #endif /* CONFIG_SECCOMP */
>>
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +
>> +#define inherit_tsk_seccomp(_child, _orig) do { \
>> +     _child->seccomp.mode = _orig->seccomp.mode; \
>> +     _child->seccomp.filters = get_seccomp_filters(_orig->seccomp.filters); \
>> +     } while (0)
>> +#define put_tsk_seccomp(_tsk) put_seccomp_filters(_tsk->seccomp.filters)
>> +
>> +extern int seccomp_show_filters(struct seccomp_filters *filters,
>> +                             struct seq_file *);
>> +extern long seccomp_set_filter(int, char *);
>> +extern long seccomp_clear_filter(int);
>> +extern long seccomp_get_filter(int, char *, unsigned long);
>> +
>> +extern long prctl_set_seccomp_filter(unsigned long, char __user *);
>> +extern long prctl_get_seccomp_filter(unsigned long, char __user *,
>> +                                  unsigned long);
>> +extern long prctl_clear_seccomp_filter(unsigned long);
>> +
>> +extern struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *);
>> +extern void put_seccomp_filters(struct seccomp_filters *);
>> +
>> +extern int seccomp_test_filters(int);
>> +extern void seccomp_filter_log_failure(int);
>> +
>> +#else  /* CONFIG_SECCOMP_FILTER */
>> +
>> +struct seccomp_filters { };
>> +#define inherit_tsk_seccomp(_child, _orig) do { } while (0)
>> +#define put_tsk_seccomp(_tsk) do { } while (0)
>> +
>> +static inline int seccomp_show_filters(struct seccomp_filters *filters,
>> +                                    struct seq_file *m)
>> +{
>> +     return -ENOSYS;
>> +}
>> +
>> +static inline long seccomp_set_filter(int syscall_nr, char *filter)
>> +{
>> +     return -ENOSYS;
>> +}
>> +
>> +static inline long seccomp_clear_filter(int syscall_nr)
>> +{
>> +     return -ENOSYS;
>> +}
>> +
>> +static inline long seccomp_get_filter(int syscall_nr,
>> +                                   char *buf, unsigned long available)
>> +{
>> +     return -ENOSYS;
>> +}
>> +
>> +static inline long prctl_set_seccomp_filter(unsigned long a2, char __user *a3)
>> +{
>> +     return -ENOSYS;
>> +}
>> +
>> +static inline long prctl_clear_seccomp_filter(unsigned long a2)
>> +{
>> +     return -ENOSYS;
>> +}
>> +
>> +static inline long prctl_get_seccomp_filter(unsigned long a2, char __user *a3,
>> +                                         unsigned long a4)
>> +{
>> +     return -ENOSYS;
>> +}
>> +#endif  /* CONFIG_SECCOMP_FILTER */
>>  #endif /* _LINUX_SECCOMP_H */
>> diff --git a/include/trace/syscall.h b/include/trace/syscall.h
>> index 242ae04..e061ad0 100644
>> --- a/include/trace/syscall.h
>> +++ b/include/trace/syscall.h
>> @@ -35,6 +35,8 @@ struct syscall_metadata {
>>  extern unsigned long arch_syscall_addr(int nr);
>>  extern int init_syscall_trace(struct ftrace_event_call *call);
>>
>> +extern struct syscall_metadata *syscall_nr_to_meta(int);
>> +
>>  extern int reg_event_syscall_enter(struct ftrace_event_call *call);
>>  extern void unreg_event_syscall_enter(struct ftrace_event_call *call);
>>  extern int reg_event_syscall_exit(struct ftrace_event_call *call);
>> @@ -49,6 +51,11 @@ enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags,
>>                                     struct trace_event *event);
>>  enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags,
>>                                    struct trace_event *event);
>> +#else
>> +static inline struct syscall_metadata *syscall_nr_to_meta(int nr)
>> +{
>> +     return NULL;
>> +}
>>  #endif
>>
>>  #ifdef CONFIG_PERF_EVENTS
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index 85cbfb3..84e7dfb 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -81,6 +81,9 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
>>  obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
>>  obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
>>  obj-$(CONFIG_SECCOMP) += seccomp.o
>> +ifeq ($(CONFIG_SECCOMP_FILTER),y)
>> +obj-$(CONFIG_SECCOMP) += seccomp_filter.o
>> +endif
>>  obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
>>  obj-$(CONFIG_TREE_RCU) += rcutree.o
>>  obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index e7548de..6f835e0 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -34,6 +34,7 @@
>>  #include <linux/cgroup.h>
>>  #include <linux/security.h>
>>  #include <linux/hugetlb.h>
>> +#include <linux/seccomp.h>
>>  #include <linux/swap.h>
>>  #include <linux/syscalls.h>
>>  #include <linux/jiffies.h>
>> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
>>       free_thread_info(tsk->stack);
>>       rt_mutex_debug_task_free(tsk);
>>       ftrace_graph_exit_task(tsk);
>> +     put_tsk_seccomp(tsk);
>>       free_task_struct(tsk);
>>  }
>>  EXPORT_SYMBOL(free_task);
>> @@ -280,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
>>       if (err)
>>               goto out;
>>
>> +     inherit_tsk_seccomp(tsk, orig);
>>       setup_thread_stack(tsk, orig);
>>       clear_user_return_notifier(tsk);
>>       clear_tsk_need_resched(tsk);
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index 57d4b13..0a942be 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -2,16 +2,20 @@
>>   * linux/kernel/seccomp.c
>>   *
>>   * Copyright 2004-2005  Andrea Arcangeli <andrea@cpushare.com>
>> + * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
>>   *
>>   * This defines a simple but solid secure-computing mode.
>>   */
>>
>>  #include <linux/seccomp.h>
>>  #include <linux/sched.h>
>> +#include <linux/slab.h>
>>  #include <linux/compat.h>
>> +#include <linux/unistd.h>
>> +#include <linux/ftrace_event.h>
>>
>> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
>>  /* #define SECCOMP_DEBUG 1 */
>> -#define NR_SECCOMP_MODES 1
>>
>>  /*
>>   * Secure computing mode 1 allows only read/write/exit/sigreturn.
>> @@ -32,10 +36,9 @@ static int mode1_syscalls_32[] = {
>>
>>  void __secure_computing(int this_syscall)
>>  {
>> -     int mode = current->seccomp.mode;
>>       int * syscall;
>>
>> -     switch (mode) {
>> +     switch (current->seccomp.mode) {
>>       case 1:
>>               syscall = mode1_syscalls;
>>  #ifdef CONFIG_COMPAT
>> @@ -47,6 +50,17 @@ void __secure_computing(int this_syscall)
>>                               return;
>>               } while (*++syscall);
>>               break;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +     case 2:
>> +             if (this_syscall >= NR_syscalls || this_syscall < 0)
>> +                     break;
>> +
>> +             if (!seccomp_test_filters(this_syscall))
>> +                     return;
>> +
>> +             seccomp_filter_log_failure(this_syscall);
>> +             break;
>> +#endif
>>       default:
>>               BUG();
>>       }
>> @@ -71,16 +85,22 @@ long prctl_set_seccomp(unsigned long seccomp_mode)
>>       if (unlikely(current->seccomp.mode))
>>               goto out;
>>
>> -     ret = -EINVAL;
>> -     if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
>> -             current->seccomp.mode = seccomp_mode;
>> -             set_thread_flag(TIF_SECCOMP);
>> +     ret = 0;
>> +     switch (seccomp_mode) {
>> +     case 1:
>>  #ifdef TIF_NOTSC
>>               disable_TSC();
>>  #endif
>> -             ret = 0;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +     case 2:
>> +#endif
>> +             current->seccomp.mode = seccomp_mode;
>> +             set_thread_flag(TIF_SECCOMP);
>> +             break;
>> +     default:
>> +             ret = -EINVAL;
>>       }
>>
>> - out:
>> +out:
>>       return ret;
>>  }
>> diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
>> new file mode 100644
>> index 0000000..9782f25
>> --- /dev/null
>> +++ b/kernel/seccomp_filter.c
>> @@ -0,0 +1,784 @@
>> +/* filter engine-based seccomp system call filtering
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, write to the Free Software
>> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
>> + *
>> + * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
>> + */
>> +
>> +#include <linux/compat.h>
>> +#include <linux/err.h>
>> +#include <linux/errno.h>
>> +#include <linux/ftrace_event.h>
>> +#include <linux/seccomp.h>
>> +#include <linux/seq_file.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/uaccess.h>
>> +
>> +#include <asm/syscall.h>
>> +#include <trace/syscall.h>
>> +
>> +
>> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
>> +
>> +#define SECCOMP_FILTER_ALLOW "1"
>> +#define SECCOMP_ACTION_DENY 0xffff
>> +#define SECCOMP_ACTION_ALLOW 0xfffe
>> +
>> +/**
>> + * struct seccomp_filters - container for seccomp filterset
>> + *
>> + * @syscalls: array of 16-bit indices into @event_filters by syscall_nr
>> + *            May also be SECCOMP_ACTION_DENY or SECCOMP_ACTION_ALLOW
>> + * @event_filters: array of pointers to ftrace event objects
>> + * @count: size of @event_filters
>> + * @flags: anonymous struct to wrap filters-specific flags
>> + * @usage: reference count to simplify use.
>> + */
>> +struct seccomp_filters {
>> +     uint16_t syscalls[NR_syscalls];
>> +     struct event_filter **event_filters;
>> +     uint16_t count;
>> +     struct {
>> +             uint32_t compat:1,
>> +                      __reserved:31;
>> +     } flags;
>> +     atomic_t usage;
>> +};
>> +
>> +/* Handle ftrace symbol non-existence */
>> +#ifdef CONFIG_FTRACE_SYSCALLS
>> +#define create_event_filter(_ef_pptr, _event_type, _str) \
>> +     ftrace_parse_filter(_ef_pptr, _event_type, _str)
>> +#define get_filter_string(_ef) ftrace_get_filter_string(_ef)
>> +#define free_event_filter(_f) ftrace_free_filter(_f)
>> +
>> +#else
>> +
>> +#define create_event_filter(_ef_pptr, _event_type, _str) (-ENOSYS)
>> +#define get_filter_string(_ef) (NULL)
>> +#define free_event_filter(_f) do { } while (0)
>> +#endif
>> +
>> +/**
>> + * seccomp_filters_new - allocates a new filters object
>> + * @count: count to allocate for the event_filters array
>> + *
>> + * Returns ERR_PTR on error or an allocated object.
>> + */
>> +static struct seccomp_filters *seccomp_filters_new(uint16_t count)
>> +{
>> +     struct seccomp_filters *f;
>> +
>> +     if (count >= SECCOMP_ACTION_ALLOW)
>> +             return ERR_PTR(-EINVAL);
>> +
>> +     f = kzalloc(sizeof(struct seccomp_filters), GFP_KERNEL);
>> +     if (!f)
>> +             return ERR_PTR(-ENOMEM);
>> +
>> +     /* Lazy SECCOMP_ACTION_DENY assignment. */
>> +     memset(f->syscalls, 0xff, sizeof(f->syscalls));
>> +     atomic_set(&f->usage, 1);
>> +
>> +     f->event_filters = NULL;
>> +     f->count = count;
>> +     if (!count)
>> +             return f;
>> +
>> +     f->event_filters = kzalloc(count * sizeof(struct event_filter *),
>> +                                GFP_KERNEL);
>> +     if (!f->event_filters) {
>> +             kfree(f);
>> +             f = ERR_PTR(-ENOMEM);
>> +     }
>> +     return f;
>> +}
>> +
>> +/**
>> + * seccomp_filters_free - cleans up the filter list and frees the table
>> + * @filters: NULL or live object to be completely destructed.
>> + */
>> +static void seccomp_filters_free(struct seccomp_filters *filters)
>> +{
>> +     uint16_t count = 0;
>> +     if (!filters)
>> +             return;
>> +     while (count < filters->count) {
>> +             struct event_filter *f = filters->event_filters[count];
>> +             free_event_filter(f);
>> +             count++;
>> +     }
>> +     kfree(filters->event_filters);
>> +     kfree(filters);
>> +}
>> +
>> +static void __put_seccomp_filters(struct seccomp_filters *orig)
>> +{
>> +     WARN_ON(atomic_read(&orig->usage));
>> +     seccomp_filters_free(orig);
>> +}
>> +
>> +#define seccomp_filter_allow(_id) ((_id) == SECCOMP_ACTION_ALLOW)
>> +#define seccomp_filter_deny(_id) ((_id) == SECCOMP_ACTION_DENY)
>> +#define seccomp_filter_dynamic(_id) \
>> +     (!seccomp_filter_allow(_id) && !seccomp_filter_deny(_id))
>> +static inline uint16_t seccomp_filter_id(const struct seccomp_filters *f,
>> +                                      int syscall_nr)
>> +{
>> +     if (!f)
>> +             return SECCOMP_ACTION_DENY;
>> +     return f->syscalls[syscall_nr];
>> +}
>> +
>> +static inline struct event_filter *seccomp_dynamic_filter(
>> +             const struct seccomp_filters *filters, uint16_t id)
>> +{
>> +     if (!seccomp_filter_dynamic(id))
>> +             return NULL;
>> +     return filters->event_filters[id];
>> +}
>> +
>> +static inline void set_seccomp_filter_id(struct seccomp_filters *filters,
>> +                                      int syscall_nr, uint16_t id)
>> +{
>> +     filters->syscalls[syscall_nr] = id;
>> +}
>> +
>> +static inline void set_seccomp_filter(struct seccomp_filters *filters,
>> +                                   int syscall_nr, uint16_t id,
>> +                                   struct event_filter *dynamic_filter)
>> +{
>> +     filters->syscalls[syscall_nr] = id;
>> +     if (seccomp_filter_dynamic(id))
>> +             filters->event_filters[id] = dynamic_filter;
>> +}
>> +
>> +static struct event_filter *alloc_event_filter(int syscall_nr,
>> +                                            const char *filter_string)
>> +{
>> +     struct syscall_metadata *data;
>> +     struct event_filter *filter = NULL;
>> +     int err;
>> +
>> +     data = syscall_nr_to_meta(syscall_nr);
>> +     /* Argument-based filtering only works on ftrace-hooked syscalls. */
>> +     err = -ENOSYS;
>> +     if (!data)
>> +             goto fail;
>> +     err = create_event_filter(&filter,
>> +                               data->enter_event->event.type,
>> +                               filter_string);
>> +     if (err)
>> +             goto fail;
>> +
>> +     return filter;
>> +fail:
>> +     kfree(filter);
>> +     return ERR_PTR(err);
>> +}
>> +
>> +/**
>> + * seccomp_filters_copy - copies filters from src to dst.
>> + *
>> + * @dst: seccomp_filters to populate.
>> + * @src: table to read from.
>> + * @skip: specifies an entry, by system call, to skip.
>> + *
>> + * Returns non-zero on failure.
>> + * Both the source and the destination should have no simultaneous
>> + * writers, and dst should be exclusive to the caller.
>> + * If @skip is < 0, it is ignored.
>> + */
>> +static int seccomp_filters_copy(struct seccomp_filters *dst,
>> +                             const struct seccomp_filters *src,
>> +                             int skip)
>> +{
>> +     int id = 0, ret = 0, nr;
>> +     memcpy(&dst->flags, &src->flags, sizeof(src->flags));
>> +     memcpy(dst->syscalls, src->syscalls, sizeof(dst->syscalls));
>> +     if (!src->count)
>> +             goto done;
>> +     for (nr = 0; nr < NR_syscalls; ++nr) {
>> +             struct event_filter *filter;
>> +             const char *str;
>> +             uint16_t src_id = seccomp_filter_id(src, nr);
>> +             if (nr == skip) {
>> +                     set_seccomp_filter(dst, nr, SECCOMP_ACTION_DENY,
>> +                                        NULL);
>> +                     continue;
>> +             }
>> +             if (!seccomp_filter_dynamic(src_id))
>> +                     continue;
>> +             if (id >= dst->count) {
>> +                     ret = -EINVAL;
>> +                     goto done;
>> +             }
>> +             str = get_filter_string(seccomp_dynamic_filter(src, src_id));
>> +             filter = alloc_event_filter(nr, str);
>> +             if (IS_ERR(filter)) {
>> +                     ret = PTR_ERR(filter);
>> +                     goto done;
>> +             }
>> +             set_seccomp_filter(dst, nr, id, filter);
>> +             id++;
>> +     }
>> +
>> +done:
>> +     return ret;
>> +}
>> +
>> +/**
>> + * seccomp_extend_filter - appends more text to a syscall_nr's filter
>> + * @filters: unattached filter object to operate on
>> + * @syscall_nr: syscall number to update filters for
>> + * @filter_string: string to append to the existing filter
>> + *
>> + * The new string will be &&'d to the original filter string to ensure that it
>> + * always matches the existing predicates or less:
>> + *   (old_filter) && @filter_string
>> + * A new seccomp_filters instance is returned on success and a ERR_PTR on
>> + * failure.
>> + */
>> +static int seccomp_extend_filter(struct seccomp_filters *filters,
>> +                              int syscall_nr, char *filter_string)
>> +{
>> +     struct event_filter *filter;
>> +     uint16_t id = seccomp_filter_id(filters, syscall_nr);
>> +     char *merged = NULL;
>> +     int ret = -EINVAL, expected;
>> +
>> +     /* No extending with a "1". */
>> +     if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string))
>> +             goto out;
>> +
>> +     filter = seccomp_dynamic_filter(filters, id);
>> +     ret = -ENOENT;
>> +     if (!filter)
>> +             goto out;
>> +
>> +     merged = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
>> +     ret = -ENOMEM;
>> +     if (!merged)
>> +             goto out;
>> +
>> +     expected = snprintf(merged, SECCOMP_MAX_FILTER_LENGTH, "(%s) && %s",
>> +                         get_filter_string(filter), filter_string);
>> +     ret = -E2BIG;
>> +     if (expected >= SECCOMP_MAX_FILTER_LENGTH || expected < 0)
>> +             goto out;
>> +
>> +     /* Free the old filter */
>> +     free_event_filter(filter);
>> +     set_seccomp_filter(filters, syscall_nr, id, NULL);
>> +
>> +     /* Replace it */
>> +     filter = alloc_event_filter(syscall_nr, merged);
>> +     if (IS_ERR(filter)) {
>> +             ret = PTR_ERR(filter);
>> +             goto out;
>> +     }
>> +     set_seccomp_filter(filters, syscall_nr, id, filter);
>> +     ret = 0;
>> +
>> +out:
>> +     kfree(merged);
>> +     return ret;
>> +}
>> +
>> +/**
>> + * seccomp_add_filter - adds a filter for an unfiltered syscall
>> + * @filters: filters object to add a filter/action to
>> + * @syscall_nr: system call number to add a filter for
>> + * @filter_string: the filter string to apply
>> + *
>> + * Returns 0 on success and non-zero otherwise.
>> + */
>> +static int seccomp_add_filter(struct seccomp_filters *filters, int syscall_nr,
>> +                           char *filter_string)
>> +{
>> +     struct event_filter *filter;
>> +     int ret = 0;
>> +
>> +     if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string)) {
>> +             set_seccomp_filter(filters, syscall_nr,
>> +                                SECCOMP_ACTION_ALLOW, NULL);
>> +             goto out;
>> +     }
>> +
>> +     filter = alloc_event_filter(syscall_nr, filter_string);
>> +     if (IS_ERR(filter)) {
>> +             ret = PTR_ERR(filter);
>> +             goto out;
>> +     }
>> +     /* Always add to the last slot available since additions are
>> +      * are only done one at a time.
>> +      */
>> +     set_seccomp_filter(filters, syscall_nr, filters->count - 1, filter);
>> +out:
>> +     return ret;
>> +}
>> +
>> +/* Wrap optional ftrace syscall support. Returns 1 on match or 0 otherwise. */
>> +static int filter_match_current(struct event_filter *event_filter)
>> +{
>> +     int err = 0;
>> +#ifdef CONFIG_FTRACE_SYSCALLS
>> +     uint8_t syscall_state[64];
>> +
>> +     memset(syscall_state, 0, sizeof(syscall_state));
>> +
>> +     /* The generic tracing entry can remain zeroed. */
>> +     err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state),
>> +                                      NULL);
>> +     if (err)
>> +             return 0;
>> +
>> +     err = filter_match_preds(event_filter, syscall_state);
>> +#endif
>> +     return err;
>> +}
>> +
>> +static const char *syscall_nr_to_name(int syscall)
>> +{
>> +     const char *syscall_name = "unknown";
>> +     struct syscall_metadata *data = syscall_nr_to_meta(syscall);
>> +     if (data)
>> +             syscall_name = data->name;
>> +     return syscall_name;
>> +}
>> +
>> +static void filters_set_compat(struct seccomp_filters *filters)
>> +{
>> +#ifdef CONFIG_COMPAT
>> +     if (is_compat_task())
>> +             filters->flags.compat = 1;
>> +#endif
>> +}
>> +
>> +static inline int filters_compat_mismatch(struct seccomp_filters *filters)
>> +{
>> +     int ret = 0;
>> +     if (!filters)
>> +             return 0;
>> +#ifdef CONFIG_COMPAT
>> +     if (!!(is_compat_task()) == filters->flags.compat)
>> +             ret = 1;
>> +#endif
>> +     return ret;
>> +}
>> +
>> +static inline int syscall_is_execve(int syscall)
>> +{
>> +     int nr = __NR_execve;
>> +#ifdef CONFIG_COMPAT
>> +     if (is_compat_task())
>> +             nr = __NR_seccomp_execve_32;
>> +#endif
>> +     return syscall == nr;
>> +}
>> +
>> +#ifndef KSTK_EIP
>> +#define KSTK_EIP(x) 0L
>> +#endif
>> +
>> +void seccomp_filter_log_failure(int syscall)
>> +{
>> +     pr_info("%s[%d]: system call %d (%s) blocked at 0x%lx\n",
>> +             current->comm, task_pid_nr(current), syscall,
>> +             syscall_nr_to_name(syscall), KSTK_EIP(current));
>> +}
>> +
>> +/* put_seccomp_state - decrements the reference count of @orig and may free. */
>> +void put_seccomp_filters(struct seccomp_filters *orig)
>> +{
>> +     if (!orig)
>> +             return;
>> +
>> +     if (atomic_dec_and_test(&orig->usage))
>> +             __put_seccomp_filters(orig);
>> +}
>> +
>> +/* get_seccomp_state - increments the reference count of @orig */
>> +struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *orig)
>
> Nit: the name does not match the comment.

Will fix it here and above. Thanks!

>> +{
>> +     if (!orig)
>> +             return NULL;
>> +     atomic_inc(&orig->usage);
>> +     return orig;
>
> This is called in an RCU read-side critical section.  What exactly is
> RCU protecting?  I would expect an rcu_dereference() or one of the
> RCU list-traversal primitives somewhere, either here or at the caller.

Ah, I spaced on rcu_dereference().  The goal was to make the
assignment and replacement of the seccomp_filters pointer
RCU-protected (in seccomp_state) so there's no concern over it being
replaced partial on platforms where pointer assignments are non-atomic
- such as via /proc/<pid>/seccomp_filters access or a call via the
exported symbols.  Object lifetime is managed by reference counting so
that I don't have to worry about extending the RCU read-side critical
section by much or deal with pre-allocations.

I'll add rcu_dereference() to all the get_seccomp_filters() uses where
it makes sense, so that it is called safely.  Just to make sure, does
it make sense to continue to rcu protect the specific pointer?

>> +}
>> +
>> +/**
>> + * seccomp_test_filters - tests 'current' against the given syscall
>> + * @state: seccomp_state of current to use.
>> + * @syscall: number of the system call to test
>> + *
>> + * Returns 0 on ok and non-zero on error/failure.
>> + */
>> +int seccomp_test_filters(int syscall)
>> +{
>> +     uint16_t id;
>> +     struct event_filter *filter;
>> +     struct seccomp_filters *filters;
>> +     int ret = -EACCES;
>> +
>> +     rcu_read_lock();
>> +     filters = get_seccomp_filters(current->seccomp.filters);
>> +     rcu_read_unlock();
>> +
>> +     if (!filters)
>> +             goto out;
>> +
>> +     if (filters_compat_mismatch(filters)) {
>> +             pr_info("%s[%d]: seccomp_filter compat() mismatch.\n",
>> +                     current->comm, task_pid_nr(current));
>> +             goto out;
>> +     }
>> +
>> +     /* execve is never allowed. */
>> +     if (syscall_is_execve(syscall))
>> +             goto out;
>> +
>> +     ret = 0;
>> +     id = seccomp_filter_id(filters, syscall);
>> +     if (seccomp_filter_allow(id))
>> +             goto out;
>> +
>> +     ret = -EACCES;
>> +     if (!seccomp_filter_dynamic(id))
>> +             goto out;
>> +
>> +     filter = seccomp_dynamic_filter(filters, id);
>> +     if (filter && filter_match_current(filter))
>> +             ret = 0;
>> +out:
>> +     put_seccomp_filters(filters);
>> +     return ret;
>> +}
>> +
>> +/**
>> + * seccomp_show_filters - prints the current filter state to a seq_file
>> + * @filters: properly get()'d filters object
>> + * @m: the prepared seq_file to receive the data
>> + *
>> + * Returns 0 on a successful write.
>> + */
>> +int seccomp_show_filters(struct seccomp_filters *filters, struct seq_file *m)
>> +{
>> +     int syscall;
>> +     seq_printf(m, "Mode: %d\n", current->seccomp.mode);
>> +     if (!filters)
>> +             goto out;
>> +
>> +     for (syscall = 0; syscall < NR_syscalls; ++syscall) {
>> +             uint16_t id = seccomp_filter_id(filters, syscall);
>> +             const char *filter_string = SECCOMP_FILTER_ALLOW;
>> +             if (seccomp_filter_deny(id))
>> +                     continue;
>> +             seq_printf(m, "%d (%s): ",
>> +                           syscall,
>> +                           syscall_nr_to_name(syscall));
>> +             if (seccomp_filter_dynamic(id))
>> +                     filter_string = get_filter_string(
>> +                                       seccomp_dynamic_filter(filters, id));
>> +             seq_printf(m, "%s\n", filter_string);
>> +     }
>> +out:
>> +     return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(seccomp_show_filters);
>> +
>> +/**
>> + * seccomp_get_filter - copies the filter_string into "buf"
>> + * @syscall_nr: system call number to look up
>> + * @buf: destination buffer
>> + * @bufsize: available space in the buffer.
>> + *
>> + * Context: User context only. This function may sleep on allocation and
>> + *          operates on current. current must be attempting a system call
>> + *          when this is called.
>> + *
>> + * Looks up the filter for the given system call number on current.  If found,
>> + * the string length of the NUL-terminated buffer is returned and < 0 is
>> + * returned on error. The NUL byte is not included in the length.
>> + */
>> +long seccomp_get_filter(int syscall_nr, char *buf, unsigned long bufsize)
>> +{
>> +     struct seccomp_filters *filters;
>> +     struct event_filter *filter;
>> +     long ret = -EINVAL;
>> +     uint16_t id;
>> +
>> +     if (bufsize > SECCOMP_MAX_FILTER_LENGTH)
>> +             bufsize = SECCOMP_MAX_FILTER_LENGTH;
>> +
>> +     rcu_read_lock();
>> +     filters = get_seccomp_filters(current->seccomp.filters);
>> +     rcu_read_unlock();
>> +
>> +     if (!filters)
>> +             goto out;
>> +
>> +     ret = -ENOENT;
>> +     id = seccomp_filter_id(filters, syscall_nr);
>> +     if (seccomp_filter_deny(id))
>> +             goto out;
>> +
>> +     if (seccomp_filter_allow(id)) {
>> +             ret = strlcpy(buf, SECCOMP_FILTER_ALLOW, bufsize);
>> +             goto copied;
>> +     }
>> +
>> +     filter = seccomp_dynamic_filter(filters, id);
>> +     if (!filter)
>> +             goto out;
>> +     ret = strlcpy(buf, get_filter_string(filter), bufsize);
>> +
>> +copied:
>> +     if (ret >= bufsize) {
>> +             ret = -ENOSPC;
>> +             goto out;
>> +     }
>> +     /* Zero out any remaining buffer, just in case. */
>> +     memset(buf + ret, 0, bufsize - ret);
>> +out:
>> +     put_seccomp_filters(filters);
>> +     return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(seccomp_get_filter);
>> +
>> +/**
>> + * seccomp_clear_filter: clears the seccomp filter for a syscall.
>> + * @syscall_nr: the system call number to clear filters for.
>> + *
>> + * Context: User context only. This function may sleep on allocation and
>> + *          operates on current. current must be attempting a system call
>> + *          when this is called.
>> + *
>> + * Returns 0 on success.
>> + */
>> +long seccomp_clear_filter(int syscall_nr)
>> +{
>> +     struct seccomp_filters *filters = NULL, *orig_filters;
>> +     uint16_t id;
>> +     int ret = -EINVAL;
>> +
>> +     rcu_read_lock();
>> +     orig_filters = get_seccomp_filters(current->seccomp.filters);
>> +     rcu_read_unlock();
>> +
>> +     if (!orig_filters)
>> +             goto out;
>> +
>> +     if (filters_compat_mismatch(orig_filters))
>> +             goto out;
>> +
>> +     id = seccomp_filter_id(orig_filters, syscall_nr);
>> +     if (seccomp_filter_deny(id))
>> +             goto out;
>> +
>> +     /* Create a new filters object for the task */
>> +     if (seccomp_filter_dynamic(id))
>> +             filters = seccomp_filters_new(orig_filters->count - 1);
>> +     else
>> +             filters = seccomp_filters_new(orig_filters->count);
>> +
>> +     if (IS_ERR(filters)) {
>> +             ret = PTR_ERR(filters);
>> +             goto out;
>> +     }
>> +
>> +     /* Copy, but drop the requested entry. */
>> +     ret = seccomp_filters_copy(filters, orig_filters, syscall_nr);
>> +     if (ret)
>> +             goto out;
>> +     get_seccomp_filters(filters);  /* simplify the out: path */
>> +
>> +     rcu_assign_pointer(current->seccomp.filters, filters);
>
> What prevents two copies of seccomp_clear_filter() from running
> concurrently?

Nothing - the last one wins assignment, but the objects themselves
should be internally consistent to the parallel calls.  If that's a
concern, a per-task writer mutex could be used just to ensure
simultaneous calls to clear and set are performed serially.  Would
that make more sense?


>> +     synchronize_rcu();
>> +     put_seccomp_filters(orig_filters);  /* for the task */
>> +out:
>> +     put_seccomp_filters(orig_filters);  /* for the get */
>> +     put_seccomp_filters(filters);  /* for the extra get */
>> +     return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(seccomp_clear_filter);
>> +
>> +/**
>> + * seccomp_set_filter: - Adds/extends a seccomp filter for a syscall.
>> + * @syscall_nr: system call number to apply the filter to.
>> + * @filter: ftrace filter string to apply.
>> + *
>> + * Context: User context only. This function may sleep on allocation and
>> + *          operates on current. current must be attempting a system call
>> + *          when this is called.
>> + *
>> + * New filters may be added for system calls when the current task is
>> + * not in a secure computing mode (seccomp).  Otherwise, existing filters may
>> + * be extended.
>> + *
>> + * Returns 0 on success or an errno on failure.
>> + */
>> +long seccomp_set_filter(int syscall_nr, char *filter)
>> +{
>> +     struct seccomp_filters *filters = NULL, *orig_filters = NULL;
>> +     uint16_t id;
>> +     long ret = -EINVAL;
>> +     uint16_t filters_needed;
>> +
>> +     if (!filter)
>> +             goto out;
>> +
>> +     filter = strstrip(filter);
>> +     /* Disallow empty strings. */
>> +     if (filter[0] == 0)
>> +             goto out;
>> +
>> +     rcu_read_lock();
>> +     orig_filters = get_seccomp_filters(current->seccomp.filters);
>> +     rcu_read_unlock();
>> +
>> +     /* After the first call, compatibility mode is selected permanently. */
>> +     ret = -EACCES;
>> +     if (filters_compat_mismatch(orig_filters))
>> +             goto out;
>> +
>> +     filters_needed = orig_filters ? orig_filters->count : 0;
>> +     id = seccomp_filter_id(orig_filters, syscall_nr);
>> +     if (seccomp_filter_deny(id)) {
>> +             /* Don't allow DENYs to be changed when in a seccomp mode */
>> +             ret = -EACCES;
>> +             if (current->seccomp.mode)
>> +                     goto out;
>> +             filters_needed++;
>> +     }
>> +
>> +     filters = seccomp_filters_new(filters_needed);
>> +     if (IS_ERR(filters)) {
>> +             ret = PTR_ERR(filters);
>> +             goto out;
>> +     }
>> +
>> +     filters_set_compat(filters);
>> +     if (orig_filters) {
>> +             ret = seccomp_filters_copy(filters, orig_filters, -1);
>> +             if (ret)
>> +                     goto out;
>> +     }
>> +
>> +     if (seccomp_filter_deny(id))
>> +             ret = seccomp_add_filter(filters, syscall_nr, filter);
>> +     else
>> +             ret = seccomp_extend_filter(filters, syscall_nr, filter);
>> +     if (ret)
>> +             goto out;
>> +     get_seccomp_filters(filters);  /* simplify the error paths */
>> +
>> +     rcu_assign_pointer(current->seccomp.filters, filters);
>
> Again, what prevents two copies of seccomp_set_filter() from running
> concurrently?

Same deal - nothing, but I'd be happy to add a guard if it makes sense.

Thanks!

>> +     synchronize_rcu();
>> +     put_seccomp_filters(orig_filters);  /* for the task */
>> +out:
>> +     put_seccomp_filters(orig_filters);  /* for the get */
>> +     put_seccomp_filters(filters);  /* for get or task, on err */
>> +     return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(seccomp_set_filter);
>> +
>> +long prctl_set_seccomp_filter(unsigned long syscall_nr,
>> +                           char __user *user_filter)
>> +{
>> +     int nr;
>> +     long ret;
>> +     char *filter = NULL;
>> +
>> +     ret = -EINVAL;
>> +     if (syscall_nr >= NR_syscalls)
>> +             goto out;
>> +
>> +     ret = -EFAULT;
>> +     if (!user_filter)
>> +             goto out;
>> +
>> +     filter = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
>> +     ret = -ENOMEM;
>> +     if (!filter)
>> +             goto out;
>> +
>> +     ret = -EFAULT;
>> +     if (strncpy_from_user(filter, user_filter,
>> +                           SECCOMP_MAX_FILTER_LENGTH - 1) < 0)
>> +             goto out;
>> +
>> +     nr = (int) syscall_nr;
>> +     ret = seccomp_set_filter(nr, filter);
>> +
>> +out:
>> +     kfree(filter);
>> +     return ret;
>> +}
>> +
>> +long prctl_clear_seccomp_filter(unsigned long syscall_nr)
>> +{
>> +     int nr = -1;
>> +     long ret;
>> +
>> +     ret = -EINVAL;
>> +     if (syscall_nr >= NR_syscalls)
>> +             goto out;
>> +
>> +     nr = (int) syscall_nr;
>> +     ret = seccomp_clear_filter(nr);
>> +
>> +out:
>> +     return ret;
>> +}
>> +
>> +long prctl_get_seccomp_filter(unsigned long syscall_nr, char __user *dst,
>> +                           unsigned long available)
>> +{
>> +     int ret, nr;
>> +     unsigned long copied;
>> +     char *buf = NULL;
>> +     ret = -EINVAL;
>> +     if (!available)
>> +             goto out;
>> +     /* Ignore extra buffer space. */
>> +     if (available > SECCOMP_MAX_FILTER_LENGTH)
>> +             available = SECCOMP_MAX_FILTER_LENGTH;
>> +
>> +     ret = -EINVAL;
>> +     if (syscall_nr >= NR_syscalls)
>> +             goto out;
>> +     nr = (int) syscall_nr;
>> +
>> +     ret = -ENOMEM;
>> +     buf = kmalloc(available, GFP_KERNEL);
>> +     if (!buf)
>> +             goto out;
>> +
>> +     ret = seccomp_get_filter(nr, buf, available);
>> +     if (ret < 0)
>> +             goto out;
>> +
>> +     /* Include the NUL byte in the copy. */
>> +     copied = copy_to_user(dst, buf, ret + 1);
>> +     ret = -ENOSPC;
>> +     if (copied)
>> +             goto out;
>> +     ret = 0;
>> +out:
>> +     kfree(buf);
>> +     return ret;
>> +}
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index af468ed..ed60d06 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -1698,13 +1698,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>>               case PR_SET_ENDIAN:
>>                       error = SET_ENDIAN(me, arg2);
>>                       break;
>> -
>>               case PR_GET_SECCOMP:
>>                       error = prctl_get_seccomp();
>>                       break;
>>               case PR_SET_SECCOMP:
>>                       error = prctl_set_seccomp(arg2);
>>                       break;
>> +             case PR_SET_SECCOMP_FILTER:
>> +                     error = prctl_set_seccomp_filter(arg2,
>> +                                                      (char __user *) arg3);
>> +                     break;
>> +             case PR_CLEAR_SECCOMP_FILTER:
>> +                     error = prctl_clear_seccomp_filter(arg2);
>> +                     break;
>> +             case PR_GET_SECCOMP_FILTER:
>> +                     error = prctl_get_seccomp_filter(arg2,
>> +                                                      (char __user *) arg3,
>> +                                                      arg4);
>> +                     break;
>>               case PR_GET_TSC:
>>                       error = GET_TSC_CTL(arg2);
>>                       break;
>> diff --git a/security/Kconfig b/security/Kconfig
>> index 95accd4..c76adf2 100644
>> --- a/security/Kconfig
>> +++ b/security/Kconfig
>> @@ -2,6 +2,10 @@
>>  # Security configuration
>>  #
>>
>> +# Make seccomp filter Kconfig switch below available
>> +config HAVE_SECCOMP_FILTER
>> +       bool
>> +
>>  menu "Security options"
>>
>>  config KEYS
>> @@ -82,6 +86,19 @@ config SECURITY_DMESG_RESTRICT
>>
>>         If you are unsure how to answer this question, answer N.
>>
>> +config SECCOMP_FILTER
>> +     bool "Enable seccomp-based system call filtering"
>> +     select SECCOMP
>> +     depends on HAVE_SECCOMP_FILTER && EXPERIMENTAL
>> +     help
>> +       This kernel feature expands CONFIG_SECCOMP to allow computing
>> +       in environments with reduced kernel access dictated by the
>> +       application itself through prctl calls.  If
>> +       CONFIG_FTRACE_SYSCALLS is available, then system call
>> +       argument-based filtering predicates may be used.
>> +
>> +       See Documentation/prctl/seccomp_filter.txt for more detail.
>> +
>>  config SECURITY
>>       bool "Enable different security models"
>>       depends on SYSFS
>> --
>> 1.7.0.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 03/13] seccomp_filters: new mode with configurable syscall filters
  2011-06-02 18:14                                                           ` Will Drewry
@ 2011-06-02 19:42                                                             ` Paul E. McKenney
  2011-06-02 20:28                                                               ` Will Drewry
  0 siblings, 1 reply; 91+ messages in thread
From: Paul E. McKenney @ 2011-06-02 19:42 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, kees.cook, torvalds, tglx, mingo, rostedt, jmorris,
	Peter Zijlstra, Frederic Weisbecker, linux-security-module

On Thu, Jun 02, 2011 at 01:14:54PM -0500, Will Drewry wrote:
> On Thu, Jun 2, 2011 at 12:36 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Tue, May 31, 2011 at 10:10:35PM -0500, Will Drewry wrote:
> >> This change adds a new seccomp mode which specifies the allowed system
> >> calls dynamically.  When in the new mode (2), all system calls are
> >> checked against process-defined filters - first by system call number,
> >> then by a filter string.  If an entry exists for a given system call and
> >> all filter predicates evaluate to true, then the task may proceed.
> >> Otherwise, the task is killed.
> >
> > A few questions below -- I can't say that I understand the RCU usage.
> >
> >                                                        Thanx, Paul
> >
> >> Filter string parsing and evaluation is handled by the ftrace filter
> >> engine.  Related patches tweak to the perf filter trace and free
> >> allowing the calls to be shared. Filters inherit their understanding of
> >> types and arguments for each system call from the CONFIG_FTRACE_SYSCALLS
> >> subsystem which already populates this information in syscall_metadata
> >> associated enter_event (and exit_event) structures. If
> >> CONFIG_FTRACE_SYSCALLS is not compiled in, only filter strings of "1"
> >> will be allowed.
> >>
> >> The net result is a process may have its system calls filtered using the
> >> ftrace filter engine's inherent understanding of systems calls.  The set
> >> of filters is specified through the PR_SET_SECCOMP_FILTER argument in
> >> prctl(). For example, a filterset for a process, like pdftotext, that
> >> should only process read-only input could (roughly) look like:
> >>   sprintf(rdonly, "flags == %u", O_RDONLY|O_LARGEFILE);
> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_open, rdonly);
> >>   prctl(PR_SET_SECCOMP_FILTER, __NR__llseek, "1");
> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_brk, "1");
> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_close, "1");
> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_exit_group, "1");
> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_fstat64, "1");
> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_mmap2, "1");
> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_munmap, "1");
> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_read, "1");
> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_write, "(fd == 1 | fd == 2)");
> >>   prctl(PR_SET_SECCOMP, 2);
> >>
> >> Subsequent calls to PR_SET_SECCOMP_FILTER for the same system call will
> >> be &&'d together to ensure that attack surface may only be reduced:
> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2");
> >>
> >> With the earlier example, the active filter becomes:
> >>   "(fd == 1 || fd == 2) && fd != 2"
> >>
> >> The patch also adds PR_CLEAR_SECCOMP_FILTER and PR_GET_SECCOMP_FILTER.
> >> The latter returns the current filter for a system call to userspace:
> >>
> >>   prctl(PR_GET_SECCOMP_FILTER, __NR_write, buf, bufsize);
> >>
> >> while the former clears any filters for a given system call changing it
> >> back to a defaulty deny:
> >>
> >>   prctl(PR_CLEAR_SECCOMP_FILTER, __NR_write);
> >>
> >> v3: - always block execve calls (as per linus torvalds)
> >>     - add __NR_seccomp_execve(_32) to seccomp-supporting arches
> >>     - ensure compat tasks can't reach ftrace:syscalls
> >>     - dropped new defines for seccomp modes.
> >>     - two level array instead of hlists (sugg. by olof johansson)
> >>     - added generic Kconfig entry that is not connected.
> >>     - dropped internal seccomp.h
> >>     - move prctl helpers to seccomp_filter
> >>     - killed seccomp_t typedef (as per checkpatch)
> >> v2: - changed to use the existing syscall number ABI.
> >>     - prctl changes to minimize parsing in the kernel:
> >>       prctl(PR_SET_SECCOMP, {0 | 1 | 2 }, { 0 | ON_EXEC });
> >>       prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 5");
> >>       prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read);
> >>       prctl(PR_GET_SECCOMP_FILTER, __NR_read, buf, bufsize);
> >>     - defined PR_SECCOMP_MODE_STRICT and ..._FILTER
> >>     - added flags
> >>     - provide a default fail syscall_nr_to_meta in ftrace
> >>     - provides fallback for unhooked system calls
> >>     - use -ENOSYS and ERR_PTR(-ENOSYS) for stubbed functionality
> >>     - added kernel/seccomp.h to share seccomp.c/seccomp_filter.c
> >>     - moved to a hlist and 4 bit hash of linked lists
> >>     - added support to operate without CONFIG_FTRACE_SYSCALLS
> >>     - moved Kconfig support next to SECCOMP
> >>     - made Kconfig entries dependent on EXPERIMENTAL
> >>     - added macros to avoid ifdefs from kernel/fork.c
> >>     - added compat task/filter matching
> >>     - drop seccomp.h inclusion in sched.h and drop seccomp_t
> >>     - added Filtering to "show" output
> >>     - added on_exec state dup'ing when enabling after a fast-path accept.
> >>
> >> Signed-off-by: Will Drewry <wad@chromium.org>
> >> ---
> >>  include/linux/prctl.h   |    5 +
> >>  include/linux/sched.h   |    2 +-
> >>  include/linux/seccomp.h |   98 ++++++-
> >>  include/trace/syscall.h |    7 +
> >>  kernel/Makefile         |    3 +
> >>  kernel/fork.c           |    3 +
> >>  kernel/seccomp.c        |   38 ++-
> >>  kernel/seccomp_filter.c |  784 +++++++++++++++++++++++++++++++++++++++++++++++
> >>  kernel/sys.c            |   13 +-
> >>  security/Kconfig        |   17 +
> >>  10 files changed, 954 insertions(+), 16 deletions(-)
> >>  create mode 100644 kernel/seccomp_filter.c
> >>
> >> diff --git a/include/linux/prctl.h b/include/linux/prctl.h
> >> index a3baeb2..44723ce 100644
> >> --- a/include/linux/prctl.h
> >> +++ b/include/linux/prctl.h
> >> @@ -64,6 +64,11 @@
> >>  #define PR_GET_SECCOMP       21
> >>  #define PR_SET_SECCOMP       22
> >>
> >> +/* Get/set process seccomp filters */
> >> +#define PR_GET_SECCOMP_FILTER        35
> >> +#define PR_SET_SECCOMP_FILTER        36
> >> +#define PR_CLEAR_SECCOMP_FILTER      37
> >> +
> >>  /* Get/set the capability bounding set (as per security/commoncap.c) */
> >>  #define PR_CAPBSET_READ 23
> >>  #define PR_CAPBSET_DROP 24
> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> index 18d63ce..3f0bc8d 100644
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -1374,7 +1374,7 @@ struct task_struct {
> >>       uid_t loginuid;
> >>       unsigned int sessionid;
> >>  #endif
> >> -     seccomp_t seccomp;
> >> +     struct seccomp_struct seccomp;
> >>
> >>  /* Thread group tracking */
> >>       u32 parent_exec_id;
> >> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> >> index 167c333..f4434ca 100644
> >> --- a/include/linux/seccomp.h
> >> +++ b/include/linux/seccomp.h
> >> @@ -1,13 +1,33 @@
> >>  #ifndef _LINUX_SECCOMP_H
> >>  #define _LINUX_SECCOMP_H
> >>
> >> +struct seq_file;
> >>
> >>  #ifdef CONFIG_SECCOMP
> >>
> >> +#include <linux/errno.h>
> >>  #include <linux/thread_info.h>
> >> +#include <linux/types.h>
> >>  #include <asm/seccomp.h>
> >>
> >> -typedef struct { int mode; } seccomp_t;
> >> +struct seccomp_filters;
> >> +/**
> >> + * struct seccomp_struct - the state of a seccomp'ed process
> >> + *
> >> + * @mode:
> >> + *     if this is 1, the process is under standard seccomp rules
> >> + *             is 2, the process is only allowed to make system calls where
> >> + *                   associated filters evaluate successfully.
> >> + * @filters: Metadata for filters if using CONFIG_SECCOMP_FILTER.
> >> + *           filters assignment/use should be RCU-protected and its contents
> >> + *           should never be modified when attached to a seccomp_struct.
> >> + */
> >> +struct seccomp_struct {
> >> +     uint16_t mode;
> >> +#ifdef CONFIG_SECCOMP_FILTER
> >> +     struct seccomp_filters *filters;
> >> +#endif
> >> +};
> >>
> >>  extern void __secure_computing(int);
> >>  static inline void secure_computing(int this_syscall)
> >> @@ -16,15 +36,14 @@ static inline void secure_computing(int this_syscall)
> >>               __secure_computing(this_syscall);
> >>  }
> >>
> >> -extern long prctl_get_seccomp(void);
> >>  extern long prctl_set_seccomp(unsigned long);
> >> +extern long prctl_get_seccomp(void);
> >>
> >>  #else /* CONFIG_SECCOMP */
> >>
> >>  #include <linux/errno.h>
> >>
> >> -typedef struct { } seccomp_t;
> >> -
> >> +struct seccomp_struct { };
> >>  #define secure_computing(x) do { } while (0)
> >>
> >>  static inline long prctl_get_seccomp(void)
> >> @@ -32,11 +51,80 @@ static inline long prctl_get_seccomp(void)
> >>       return -EINVAL;
> >>  }
> >>
> >> -static inline long prctl_set_seccomp(unsigned long arg2)
> >> +static inline long prctl_set_seccomp(unsigned long a2);
> >>  {
> >>       return -EINVAL;
> >>  }
> >>
> >>  #endif /* CONFIG_SECCOMP */
> >>
> >> +#ifdef CONFIG_SECCOMP_FILTER
> >> +
> >> +#define inherit_tsk_seccomp(_child, _orig) do { \
> >> +     _child->seccomp.mode = _orig->seccomp.mode; \
> >> +     _child->seccomp.filters = get_seccomp_filters(_orig->seccomp.filters); \
> >> +     } while (0)
> >> +#define put_tsk_seccomp(_tsk) put_seccomp_filters(_tsk->seccomp.filters)
> >> +
> >> +extern int seccomp_show_filters(struct seccomp_filters *filters,
> >> +                             struct seq_file *);
> >> +extern long seccomp_set_filter(int, char *);
> >> +extern long seccomp_clear_filter(int);
> >> +extern long seccomp_get_filter(int, char *, unsigned long);
> >> +
> >> +extern long prctl_set_seccomp_filter(unsigned long, char __user *);
> >> +extern long prctl_get_seccomp_filter(unsigned long, char __user *,
> >> +                                  unsigned long);
> >> +extern long prctl_clear_seccomp_filter(unsigned long);
> >> +
> >> +extern struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *);
> >> +extern void put_seccomp_filters(struct seccomp_filters *);
> >> +
> >> +extern int seccomp_test_filters(int);
> >> +extern void seccomp_filter_log_failure(int);
> >> +
> >> +#else  /* CONFIG_SECCOMP_FILTER */
> >> +
> >> +struct seccomp_filters { };
> >> +#define inherit_tsk_seccomp(_child, _orig) do { } while (0)
> >> +#define put_tsk_seccomp(_tsk) do { } while (0)
> >> +
> >> +static inline int seccomp_show_filters(struct seccomp_filters *filters,
> >> +                                    struct seq_file *m)
> >> +{
> >> +     return -ENOSYS;
> >> +}
> >> +
> >> +static inline long seccomp_set_filter(int syscall_nr, char *filter)
> >> +{
> >> +     return -ENOSYS;
> >> +}
> >> +
> >> +static inline long seccomp_clear_filter(int syscall_nr)
> >> +{
> >> +     return -ENOSYS;
> >> +}
> >> +
> >> +static inline long seccomp_get_filter(int syscall_nr,
> >> +                                   char *buf, unsigned long available)
> >> +{
> >> +     return -ENOSYS;
> >> +}
> >> +
> >> +static inline long prctl_set_seccomp_filter(unsigned long a2, char __user *a3)
> >> +{
> >> +     return -ENOSYS;
> >> +}
> >> +
> >> +static inline long prctl_clear_seccomp_filter(unsigned long a2)
> >> +{
> >> +     return -ENOSYS;
> >> +}
> >> +
> >> +static inline long prctl_get_seccomp_filter(unsigned long a2, char __user *a3,
> >> +                                         unsigned long a4)
> >> +{
> >> +     return -ENOSYS;
> >> +}
> >> +#endif  /* CONFIG_SECCOMP_FILTER */
> >>  #endif /* _LINUX_SECCOMP_H */
> >> diff --git a/include/trace/syscall.h b/include/trace/syscall.h
> >> index 242ae04..e061ad0 100644
> >> --- a/include/trace/syscall.h
> >> +++ b/include/trace/syscall.h
> >> @@ -35,6 +35,8 @@ struct syscall_metadata {
> >>  extern unsigned long arch_syscall_addr(int nr);
> >>  extern int init_syscall_trace(struct ftrace_event_call *call);
> >>
> >> +extern struct syscall_metadata *syscall_nr_to_meta(int);
> >> +
> >>  extern int reg_event_syscall_enter(struct ftrace_event_call *call);
> >>  extern void unreg_event_syscall_enter(struct ftrace_event_call *call);
> >>  extern int reg_event_syscall_exit(struct ftrace_event_call *call);
> >> @@ -49,6 +51,11 @@ enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags,
> >>                                     struct trace_event *event);
> >>  enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags,
> >>                                    struct trace_event *event);
> >> +#else
> >> +static inline struct syscall_metadata *syscall_nr_to_meta(int nr)
> >> +{
> >> +     return NULL;
> >> +}
> >>  #endif
> >>
> >>  #ifdef CONFIG_PERF_EVENTS
> >> diff --git a/kernel/Makefile b/kernel/Makefile
> >> index 85cbfb3..84e7dfb 100644
> >> --- a/kernel/Makefile
> >> +++ b/kernel/Makefile
> >> @@ -81,6 +81,9 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
> >>  obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
> >>  obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
> >>  obj-$(CONFIG_SECCOMP) += seccomp.o
> >> +ifeq ($(CONFIG_SECCOMP_FILTER),y)
> >> +obj-$(CONFIG_SECCOMP) += seccomp_filter.o
> >> +endif
> >>  obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
> >>  obj-$(CONFIG_TREE_RCU) += rcutree.o
> >>  obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
> >> diff --git a/kernel/fork.c b/kernel/fork.c
> >> index e7548de..6f835e0 100644
> >> --- a/kernel/fork.c
> >> +++ b/kernel/fork.c
> >> @@ -34,6 +34,7 @@
> >>  #include <linux/cgroup.h>
> >>  #include <linux/security.h>
> >>  #include <linux/hugetlb.h>
> >> +#include <linux/seccomp.h>
> >>  #include <linux/swap.h>
> >>  #include <linux/syscalls.h>
> >>  #include <linux/jiffies.h>
> >> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
> >>       free_thread_info(tsk->stack);
> >>       rt_mutex_debug_task_free(tsk);
> >>       ftrace_graph_exit_task(tsk);
> >> +     put_tsk_seccomp(tsk);
> >>       free_task_struct(tsk);
> >>  }
> >>  EXPORT_SYMBOL(free_task);
> >> @@ -280,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
> >>       if (err)
> >>               goto out;
> >>
> >> +     inherit_tsk_seccomp(tsk, orig);
> >>       setup_thread_stack(tsk, orig);
> >>       clear_user_return_notifier(tsk);
> >>       clear_tsk_need_resched(tsk);
> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> >> index 57d4b13..0a942be 100644
> >> --- a/kernel/seccomp.c
> >> +++ b/kernel/seccomp.c
> >> @@ -2,16 +2,20 @@
> >>   * linux/kernel/seccomp.c
> >>   *
> >>   * Copyright 2004-2005  Andrea Arcangeli <andrea@cpushare.com>
> >> + * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
> >>   *
> >>   * This defines a simple but solid secure-computing mode.
> >>   */
> >>
> >>  #include <linux/seccomp.h>
> >>  #include <linux/sched.h>
> >> +#include <linux/slab.h>
> >>  #include <linux/compat.h>
> >> +#include <linux/unistd.h>
> >> +#include <linux/ftrace_event.h>
> >>
> >> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
> >>  /* #define SECCOMP_DEBUG 1 */
> >> -#define NR_SECCOMP_MODES 1
> >>
> >>  /*
> >>   * Secure computing mode 1 allows only read/write/exit/sigreturn.
> >> @@ -32,10 +36,9 @@ static int mode1_syscalls_32[] = {
> >>
> >>  void __secure_computing(int this_syscall)
> >>  {
> >> -     int mode = current->seccomp.mode;
> >>       int * syscall;
> >>
> >> -     switch (mode) {
> >> +     switch (current->seccomp.mode) {
> >>       case 1:
> >>               syscall = mode1_syscalls;
> >>  #ifdef CONFIG_COMPAT
> >> @@ -47,6 +50,17 @@ void __secure_computing(int this_syscall)
> >>                               return;
> >>               } while (*++syscall);
> >>               break;
> >> +#ifdef CONFIG_SECCOMP_FILTER
> >> +     case 2:
> >> +             if (this_syscall >= NR_syscalls || this_syscall < 0)
> >> +                     break;
> >> +
> >> +             if (!seccomp_test_filters(this_syscall))
> >> +                     return;
> >> +
> >> +             seccomp_filter_log_failure(this_syscall);
> >> +             break;
> >> +#endif
> >>       default:
> >>               BUG();
> >>       }
> >> @@ -71,16 +85,22 @@ long prctl_set_seccomp(unsigned long seccomp_mode)
> >>       if (unlikely(current->seccomp.mode))
> >>               goto out;
> >>
> >> -     ret = -EINVAL;
> >> -     if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
> >> -             current->seccomp.mode = seccomp_mode;
> >> -             set_thread_flag(TIF_SECCOMP);
> >> +     ret = 0;
> >> +     switch (seccomp_mode) {
> >> +     case 1:
> >>  #ifdef TIF_NOTSC
> >>               disable_TSC();
> >>  #endif
> >> -             ret = 0;
> >> +#ifdef CONFIG_SECCOMP_FILTER
> >> +     case 2:
> >> +#endif
> >> +             current->seccomp.mode = seccomp_mode;
> >> +             set_thread_flag(TIF_SECCOMP);
> >> +             break;
> >> +     default:
> >> +             ret = -EINVAL;
> >>       }
> >>
> >> - out:
> >> +out:
> >>       return ret;
> >>  }
> >> diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
> >> new file mode 100644
> >> index 0000000..9782f25
> >> --- /dev/null
> >> +++ b/kernel/seccomp_filter.c
> >> @@ -0,0 +1,784 @@
> >> +/* filter engine-based seccomp system call filtering
> >> + *
> >> + * This program is free software; you can redistribute it and/or modify
> >> + * it under the terms of the GNU General Public License as published by
> >> + * the Free Software Foundation; either version 2 of the License, or
> >> + * (at your option) any later version.
> >> + *
> >> + * This program is distributed in the hope that it will be useful,
> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >> + * GNU General Public License for more details.
> >> + *
> >> + * You should have received a copy of the GNU General Public License
> >> + * along with this program; if not, write to the Free Software
> >> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> >> + *
> >> + * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
> >> + */
> >> +
> >> +#include <linux/compat.h>
> >> +#include <linux/err.h>
> >> +#include <linux/errno.h>
> >> +#include <linux/ftrace_event.h>
> >> +#include <linux/seccomp.h>
> >> +#include <linux/seq_file.h>
> >> +#include <linux/sched.h>
> >> +#include <linux/slab.h>
> >> +#include <linux/uaccess.h>
> >> +
> >> +#include <asm/syscall.h>
> >> +#include <trace/syscall.h>
> >> +
> >> +
> >> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
> >> +
> >> +#define SECCOMP_FILTER_ALLOW "1"
> >> +#define SECCOMP_ACTION_DENY 0xffff
> >> +#define SECCOMP_ACTION_ALLOW 0xfffe
> >> +
> >> +/**
> >> + * struct seccomp_filters - container for seccomp filterset
> >> + *
> >> + * @syscalls: array of 16-bit indices into @event_filters by syscall_nr
> >> + *            May also be SECCOMP_ACTION_DENY or SECCOMP_ACTION_ALLOW
> >> + * @event_filters: array of pointers to ftrace event objects
> >> + * @count: size of @event_filters
> >> + * @flags: anonymous struct to wrap filters-specific flags
> >> + * @usage: reference count to simplify use.
> >> + */
> >> +struct seccomp_filters {
> >> +     uint16_t syscalls[NR_syscalls];
> >> +     struct event_filter **event_filters;
> >> +     uint16_t count;
> >> +     struct {
> >> +             uint32_t compat:1,
> >> +                      __reserved:31;
> >> +     } flags;
> >> +     atomic_t usage;
> >> +};
> >> +
> >> +/* Handle ftrace symbol non-existence */
> >> +#ifdef CONFIG_FTRACE_SYSCALLS
> >> +#define create_event_filter(_ef_pptr, _event_type, _str) \
> >> +     ftrace_parse_filter(_ef_pptr, _event_type, _str)
> >> +#define get_filter_string(_ef) ftrace_get_filter_string(_ef)
> >> +#define free_event_filter(_f) ftrace_free_filter(_f)
> >> +
> >> +#else
> >> +
> >> +#define create_event_filter(_ef_pptr, _event_type, _str) (-ENOSYS)
> >> +#define get_filter_string(_ef) (NULL)
> >> +#define free_event_filter(_f) do { } while (0)
> >> +#endif
> >> +
> >> +/**
> >> + * seccomp_filters_new - allocates a new filters object
> >> + * @count: count to allocate for the event_filters array
> >> + *
> >> + * Returns ERR_PTR on error or an allocated object.
> >> + */
> >> +static struct seccomp_filters *seccomp_filters_new(uint16_t count)
> >> +{
> >> +     struct seccomp_filters *f;
> >> +
> >> +     if (count >= SECCOMP_ACTION_ALLOW)
> >> +             return ERR_PTR(-EINVAL);
> >> +
> >> +     f = kzalloc(sizeof(struct seccomp_filters), GFP_KERNEL);
> >> +     if (!f)
> >> +             return ERR_PTR(-ENOMEM);
> >> +
> >> +     /* Lazy SECCOMP_ACTION_DENY assignment. */
> >> +     memset(f->syscalls, 0xff, sizeof(f->syscalls));
> >> +     atomic_set(&f->usage, 1);
> >> +
> >> +     f->event_filters = NULL;
> >> +     f->count = count;
> >> +     if (!count)
> >> +             return f;
> >> +
> >> +     f->event_filters = kzalloc(count * sizeof(struct event_filter *),
> >> +                                GFP_KERNEL);
> >> +     if (!f->event_filters) {
> >> +             kfree(f);
> >> +             f = ERR_PTR(-ENOMEM);
> >> +     }
> >> +     return f;
> >> +}
> >> +
> >> +/**
> >> + * seccomp_filters_free - cleans up the filter list and frees the table
> >> + * @filters: NULL or live object to be completely destructed.
> >> + */
> >> +static void seccomp_filters_free(struct seccomp_filters *filters)
> >> +{
> >> +     uint16_t count = 0;
> >> +     if (!filters)
> >> +             return;
> >> +     while (count < filters->count) {
> >> +             struct event_filter *f = filters->event_filters[count];
> >> +             free_event_filter(f);
> >> +             count++;
> >> +     }
> >> +     kfree(filters->event_filters);
> >> +     kfree(filters);
> >> +}
> >> +
> >> +static void __put_seccomp_filters(struct seccomp_filters *orig)
> >> +{
> >> +     WARN_ON(atomic_read(&orig->usage));
> >> +     seccomp_filters_free(orig);
> >> +}
> >> +
> >> +#define seccomp_filter_allow(_id) ((_id) == SECCOMP_ACTION_ALLOW)
> >> +#define seccomp_filter_deny(_id) ((_id) == SECCOMP_ACTION_DENY)
> >> +#define seccomp_filter_dynamic(_id) \
> >> +     (!seccomp_filter_allow(_id) && !seccomp_filter_deny(_id))
> >> +static inline uint16_t seccomp_filter_id(const struct seccomp_filters *f,
> >> +                                      int syscall_nr)
> >> +{
> >> +     if (!f)
> >> +             return SECCOMP_ACTION_DENY;
> >> +     return f->syscalls[syscall_nr];
> >> +}
> >> +
> >> +static inline struct event_filter *seccomp_dynamic_filter(
> >> +             const struct seccomp_filters *filters, uint16_t id)
> >> +{
> >> +     if (!seccomp_filter_dynamic(id))
> >> +             return NULL;
> >> +     return filters->event_filters[id];
> >> +}
> >> +
> >> +static inline void set_seccomp_filter_id(struct seccomp_filters *filters,
> >> +                                      int syscall_nr, uint16_t id)
> >> +{
> >> +     filters->syscalls[syscall_nr] = id;
> >> +}
> >> +
> >> +static inline void set_seccomp_filter(struct seccomp_filters *filters,
> >> +                                   int syscall_nr, uint16_t id,
> >> +                                   struct event_filter *dynamic_filter)
> >> +{
> >> +     filters->syscalls[syscall_nr] = id;
> >> +     if (seccomp_filter_dynamic(id))
> >> +             filters->event_filters[id] = dynamic_filter;
> >> +}
> >> +
> >> +static struct event_filter *alloc_event_filter(int syscall_nr,
> >> +                                            const char *filter_string)
> >> +{
> >> +     struct syscall_metadata *data;
> >> +     struct event_filter *filter = NULL;
> >> +     int err;
> >> +
> >> +     data = syscall_nr_to_meta(syscall_nr);
> >> +     /* Argument-based filtering only works on ftrace-hooked syscalls. */
> >> +     err = -ENOSYS;
> >> +     if (!data)
> >> +             goto fail;
> >> +     err = create_event_filter(&filter,
> >> +                               data->enter_event->event.type,
> >> +                               filter_string);
> >> +     if (err)
> >> +             goto fail;
> >> +
> >> +     return filter;
> >> +fail:
> >> +     kfree(filter);
> >> +     return ERR_PTR(err);
> >> +}
> >> +
> >> +/**
> >> + * seccomp_filters_copy - copies filters from src to dst.
> >> + *
> >> + * @dst: seccomp_filters to populate.
> >> + * @src: table to read from.
> >> + * @skip: specifies an entry, by system call, to skip.
> >> + *
> >> + * Returns non-zero on failure.
> >> + * Both the source and the destination should have no simultaneous
> >> + * writers, and dst should be exclusive to the caller.
> >> + * If @skip is < 0, it is ignored.
> >> + */
> >> +static int seccomp_filters_copy(struct seccomp_filters *dst,
> >> +                             const struct seccomp_filters *src,
> >> +                             int skip)
> >> +{
> >> +     int id = 0, ret = 0, nr;
> >> +     memcpy(&dst->flags, &src->flags, sizeof(src->flags));
> >> +     memcpy(dst->syscalls, src->syscalls, sizeof(dst->syscalls));
> >> +     if (!src->count)
> >> +             goto done;
> >> +     for (nr = 0; nr < NR_syscalls; ++nr) {
> >> +             struct event_filter *filter;
> >> +             const char *str;
> >> +             uint16_t src_id = seccomp_filter_id(src, nr);
> >> +             if (nr == skip) {
> >> +                     set_seccomp_filter(dst, nr, SECCOMP_ACTION_DENY,
> >> +                                        NULL);
> >> +                     continue;
> >> +             }
> >> +             if (!seccomp_filter_dynamic(src_id))
> >> +                     continue;
> >> +             if (id >= dst->count) {
> >> +                     ret = -EINVAL;
> >> +                     goto done;
> >> +             }
> >> +             str = get_filter_string(seccomp_dynamic_filter(src, src_id));
> >> +             filter = alloc_event_filter(nr, str);
> >> +             if (IS_ERR(filter)) {
> >> +                     ret = PTR_ERR(filter);
> >> +                     goto done;
> >> +             }
> >> +             set_seccomp_filter(dst, nr, id, filter);
> >> +             id++;
> >> +     }
> >> +
> >> +done:
> >> +     return ret;
> >> +}
> >> +
> >> +/**
> >> + * seccomp_extend_filter - appends more text to a syscall_nr's filter
> >> + * @filters: unattached filter object to operate on
> >> + * @syscall_nr: syscall number to update filters for
> >> + * @filter_string: string to append to the existing filter
> >> + *
> >> + * The new string will be &&'d to the original filter string to ensure that it
> >> + * always matches the existing predicates or less:
> >> + *   (old_filter) && @filter_string
> >> + * A new seccomp_filters instance is returned on success and a ERR_PTR on
> >> + * failure.
> >> + */
> >> +static int seccomp_extend_filter(struct seccomp_filters *filters,
> >> +                              int syscall_nr, char *filter_string)
> >> +{
> >> +     struct event_filter *filter;
> >> +     uint16_t id = seccomp_filter_id(filters, syscall_nr);
> >> +     char *merged = NULL;
> >> +     int ret = -EINVAL, expected;
> >> +
> >> +     /* No extending with a "1". */
> >> +     if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string))
> >> +             goto out;
> >> +
> >> +     filter = seccomp_dynamic_filter(filters, id);
> >> +     ret = -ENOENT;
> >> +     if (!filter)
> >> +             goto out;
> >> +
> >> +     merged = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
> >> +     ret = -ENOMEM;
> >> +     if (!merged)
> >> +             goto out;
> >> +
> >> +     expected = snprintf(merged, SECCOMP_MAX_FILTER_LENGTH, "(%s) && %s",
> >> +                         get_filter_string(filter), filter_string);
> >> +     ret = -E2BIG;
> >> +     if (expected >= SECCOMP_MAX_FILTER_LENGTH || expected < 0)
> >> +             goto out;
> >> +
> >> +     /* Free the old filter */
> >> +     free_event_filter(filter);
> >> +     set_seccomp_filter(filters, syscall_nr, id, NULL);
> >> +
> >> +     /* Replace it */
> >> +     filter = alloc_event_filter(syscall_nr, merged);
> >> +     if (IS_ERR(filter)) {
> >> +             ret = PTR_ERR(filter);
> >> +             goto out;
> >> +     }
> >> +     set_seccomp_filter(filters, syscall_nr, id, filter);
> >> +     ret = 0;
> >> +
> >> +out:
> >> +     kfree(merged);
> >> +     return ret;
> >> +}
> >> +
> >> +/**
> >> + * seccomp_add_filter - adds a filter for an unfiltered syscall
> >> + * @filters: filters object to add a filter/action to
> >> + * @syscall_nr: system call number to add a filter for
> >> + * @filter_string: the filter string to apply
> >> + *
> >> + * Returns 0 on success and non-zero otherwise.
> >> + */
> >> +static int seccomp_add_filter(struct seccomp_filters *filters, int syscall_nr,
> >> +                           char *filter_string)
> >> +{
> >> +     struct event_filter *filter;
> >> +     int ret = 0;
> >> +
> >> +     if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string)) {
> >> +             set_seccomp_filter(filters, syscall_nr,
> >> +                                SECCOMP_ACTION_ALLOW, NULL);
> >> +             goto out;
> >> +     }
> >> +
> >> +     filter = alloc_event_filter(syscall_nr, filter_string);
> >> +     if (IS_ERR(filter)) {
> >> +             ret = PTR_ERR(filter);
> >> +             goto out;
> >> +     }
> >> +     /* Always add to the last slot available since additions are
> >> +      * are only done one at a time.
> >> +      */
> >> +     set_seccomp_filter(filters, syscall_nr, filters->count - 1, filter);
> >> +out:
> >> +     return ret;
> >> +}
> >> +
> >> +/* Wrap optional ftrace syscall support. Returns 1 on match or 0 otherwise. */
> >> +static int filter_match_current(struct event_filter *event_filter)
> >> +{
> >> +     int err = 0;
> >> +#ifdef CONFIG_FTRACE_SYSCALLS
> >> +     uint8_t syscall_state[64];
> >> +
> >> +     memset(syscall_state, 0, sizeof(syscall_state));
> >> +
> >> +     /* The generic tracing entry can remain zeroed. */
> >> +     err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state),
> >> +                                      NULL);
> >> +     if (err)
> >> +             return 0;
> >> +
> >> +     err = filter_match_preds(event_filter, syscall_state);
> >> +#endif
> >> +     return err;
> >> +}
> >> +
> >> +static const char *syscall_nr_to_name(int syscall)
> >> +{
> >> +     const char *syscall_name = "unknown";
> >> +     struct syscall_metadata *data = syscall_nr_to_meta(syscall);
> >> +     if (data)
> >> +             syscall_name = data->name;
> >> +     return syscall_name;
> >> +}
> >> +
> >> +static void filters_set_compat(struct seccomp_filters *filters)
> >> +{
> >> +#ifdef CONFIG_COMPAT
> >> +     if (is_compat_task())
> >> +             filters->flags.compat = 1;
> >> +#endif
> >> +}
> >> +
> >> +static inline int filters_compat_mismatch(struct seccomp_filters *filters)
> >> +{
> >> +     int ret = 0;
> >> +     if (!filters)
> >> +             return 0;
> >> +#ifdef CONFIG_COMPAT
> >> +     if (!!(is_compat_task()) == filters->flags.compat)
> >> +             ret = 1;
> >> +#endif
> >> +     return ret;
> >> +}
> >> +
> >> +static inline int syscall_is_execve(int syscall)
> >> +{
> >> +     int nr = __NR_execve;
> >> +#ifdef CONFIG_COMPAT
> >> +     if (is_compat_task())
> >> +             nr = __NR_seccomp_execve_32;
> >> +#endif
> >> +     return syscall == nr;
> >> +}
> >> +
> >> +#ifndef KSTK_EIP
> >> +#define KSTK_EIP(x) 0L
> >> +#endif
> >> +
> >> +void seccomp_filter_log_failure(int syscall)
> >> +{
> >> +     pr_info("%s[%d]: system call %d (%s) blocked at 0x%lx\n",
> >> +             current->comm, task_pid_nr(current), syscall,
> >> +             syscall_nr_to_name(syscall), KSTK_EIP(current));
> >> +}
> >> +
> >> +/* put_seccomp_state - decrements the reference count of @orig and may free. */
> >> +void put_seccomp_filters(struct seccomp_filters *orig)
> >> +{
> >> +     if (!orig)
> >> +             return;
> >> +
> >> +     if (atomic_dec_and_test(&orig->usage))
> >> +             __put_seccomp_filters(orig);
> >> +}
> >> +
> >> +/* get_seccomp_state - increments the reference count of @orig */
> >> +struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *orig)
> >
> > Nit: the name does not match the comment.
> 
> Will fix it here and above. Thanks!
> 
> >> +{
> >> +     if (!orig)
> >> +             return NULL;
> >> +     atomic_inc(&orig->usage);
> >> +     return orig;
> >
> > This is called in an RCU read-side critical section.  What exactly is
> > RCU protecting?  I would expect an rcu_dereference() or one of the
> > RCU list-traversal primitives somewhere, either here or at the caller.
> 
> Ah, I spaced on rcu_dereference().  The goal was to make the
> assignment and replacement of the seccomp_filters pointer
> RCU-protected (in seccomp_state) so there's no concern over it being
> replaced partial on platforms where pointer assignments are non-atomic
> - such as via /proc/<pid>/seccomp_filters access or a call via the
> exported symbols.  Object lifetime is managed by reference counting so
> that I don't have to worry about extending the RCU read-side critical
> section by much or deal with pre-allocations.
> 
> I'll add rcu_dereference() to all the get_seccomp_filters() uses where
> it makes sense, so that it is called safely.  Just to make sure, does
> it make sense to continue to rcu protect the specific pointer?

It might.  The usual other options is to use a lock outside of the element
containing the reference count to protect reference-count manipulation.
If there is some convenient lock, especially if it is already held where
needed, then locking is more straightforward.  Otherwise, RCU is usually
a reasonable option.

> >> +}
> >> +
> >> +/**
> >> + * seccomp_test_filters - tests 'current' against the given syscall
> >> + * @state: seccomp_state of current to use.
> >> + * @syscall: number of the system call to test
> >> + *
> >> + * Returns 0 on ok and non-zero on error/failure.
> >> + */
> >> +int seccomp_test_filters(int syscall)
> >> +{
> >> +     uint16_t id;
> >> +     struct event_filter *filter;
> >> +     struct seccomp_filters *filters;
> >> +     int ret = -EACCES;
> >> +
> >> +     rcu_read_lock();
> >> +     filters = get_seccomp_filters(current->seccomp.filters);
> >> +     rcu_read_unlock();
> >> +
> >> +     if (!filters)
> >> +             goto out;
> >> +
> >> +     if (filters_compat_mismatch(filters)) {
> >> +             pr_info("%s[%d]: seccomp_filter compat() mismatch.\n",
> >> +                     current->comm, task_pid_nr(current));
> >> +             goto out;
> >> +     }
> >> +
> >> +     /* execve is never allowed. */
> >> +     if (syscall_is_execve(syscall))
> >> +             goto out;
> >> +
> >> +     ret = 0;
> >> +     id = seccomp_filter_id(filters, syscall);
> >> +     if (seccomp_filter_allow(id))
> >> +             goto out;
> >> +
> >> +     ret = -EACCES;
> >> +     if (!seccomp_filter_dynamic(id))
> >> +             goto out;
> >> +
> >> +     filter = seccomp_dynamic_filter(filters, id);
> >> +     if (filter && filter_match_current(filter))
> >> +             ret = 0;
> >> +out:
> >> +     put_seccomp_filters(filters);
> >> +     return ret;
> >> +}
> >> +
> >> +/**
> >> + * seccomp_show_filters - prints the current filter state to a seq_file
> >> + * @filters: properly get()'d filters object
> >> + * @m: the prepared seq_file to receive the data
> >> + *
> >> + * Returns 0 on a successful write.
> >> + */
> >> +int seccomp_show_filters(struct seccomp_filters *filters, struct seq_file *m)
> >> +{
> >> +     int syscall;
> >> +     seq_printf(m, "Mode: %d\n", current->seccomp.mode);
> >> +     if (!filters)
> >> +             goto out;
> >> +
> >> +     for (syscall = 0; syscall < NR_syscalls; ++syscall) {
> >> +             uint16_t id = seccomp_filter_id(filters, syscall);
> >> +             const char *filter_string = SECCOMP_FILTER_ALLOW;
> >> +             if (seccomp_filter_deny(id))
> >> +                     continue;
> >> +             seq_printf(m, "%d (%s): ",
> >> +                           syscall,
> >> +                           syscall_nr_to_name(syscall));
> >> +             if (seccomp_filter_dynamic(id))
> >> +                     filter_string = get_filter_string(
> >> +                                       seccomp_dynamic_filter(filters, id));
> >> +             seq_printf(m, "%s\n", filter_string);
> >> +     }
> >> +out:
> >> +     return 0;
> >> +}
> >> +EXPORT_SYMBOL_GPL(seccomp_show_filters);
> >> +
> >> +/**
> >> + * seccomp_get_filter - copies the filter_string into "buf"
> >> + * @syscall_nr: system call number to look up
> >> + * @buf: destination buffer
> >> + * @bufsize: available space in the buffer.
> >> + *
> >> + * Context: User context only. This function may sleep on allocation and
> >> + *          operates on current. current must be attempting a system call
> >> + *          when this is called.
> >> + *
> >> + * Looks up the filter for the given system call number on current.  If found,
> >> + * the string length of the NUL-terminated buffer is returned and < 0 is
> >> + * returned on error. The NUL byte is not included in the length.
> >> + */
> >> +long seccomp_get_filter(int syscall_nr, char *buf, unsigned long bufsize)
> >> +{
> >> +     struct seccomp_filters *filters;
> >> +     struct event_filter *filter;
> >> +     long ret = -EINVAL;
> >> +     uint16_t id;
> >> +
> >> +     if (bufsize > SECCOMP_MAX_FILTER_LENGTH)
> >> +             bufsize = SECCOMP_MAX_FILTER_LENGTH;
> >> +
> >> +     rcu_read_lock();
> >> +     filters = get_seccomp_filters(current->seccomp.filters);
> >> +     rcu_read_unlock();
> >> +
> >> +     if (!filters)
> >> +             goto out;
> >> +
> >> +     ret = -ENOENT;
> >> +     id = seccomp_filter_id(filters, syscall_nr);
> >> +     if (seccomp_filter_deny(id))
> >> +             goto out;
> >> +
> >> +     if (seccomp_filter_allow(id)) {
> >> +             ret = strlcpy(buf, SECCOMP_FILTER_ALLOW, bufsize);
> >> +             goto copied;
> >> +     }
> >> +
> >> +     filter = seccomp_dynamic_filter(filters, id);
> >> +     if (!filter)
> >> +             goto out;
> >> +     ret = strlcpy(buf, get_filter_string(filter), bufsize);
> >> +
> >> +copied:
> >> +     if (ret >= bufsize) {
> >> +             ret = -ENOSPC;
> >> +             goto out;
> >> +     }
> >> +     /* Zero out any remaining buffer, just in case. */
> >> +     memset(buf + ret, 0, bufsize - ret);
> >> +out:
> >> +     put_seccomp_filters(filters);
> >> +     return ret;
> >> +}
> >> +EXPORT_SYMBOL_GPL(seccomp_get_filter);
> >> +
> >> +/**
> >> + * seccomp_clear_filter: clears the seccomp filter for a syscall.
> >> + * @syscall_nr: the system call number to clear filters for.
> >> + *
> >> + * Context: User context only. This function may sleep on allocation and
> >> + *          operates on current. current must be attempting a system call
> >> + *          when this is called.
> >> + *
> >> + * Returns 0 on success.
> >> + */
> >> +long seccomp_clear_filter(int syscall_nr)
> >> +{
> >> +     struct seccomp_filters *filters = NULL, *orig_filters;
> >> +     uint16_t id;
> >> +     int ret = -EINVAL;
> >> +
> >> +     rcu_read_lock();
> >> +     orig_filters = get_seccomp_filters(current->seccomp.filters);
> >> +     rcu_read_unlock();
> >> +
> >> +     if (!orig_filters)
> >> +             goto out;
> >> +
> >> +     if (filters_compat_mismatch(orig_filters))
> >> +             goto out;
> >> +
> >> +     id = seccomp_filter_id(orig_filters, syscall_nr);
> >> +     if (seccomp_filter_deny(id))
> >> +             goto out;
> >> +
> >> +     /* Create a new filters object for the task */
> >> +     if (seccomp_filter_dynamic(id))
> >> +             filters = seccomp_filters_new(orig_filters->count - 1);
> >> +     else
> >> +             filters = seccomp_filters_new(orig_filters->count);
> >> +
> >> +     if (IS_ERR(filters)) {
> >> +             ret = PTR_ERR(filters);
> >> +             goto out;
> >> +     }
> >> +
> >> +     /* Copy, but drop the requested entry. */
> >> +     ret = seccomp_filters_copy(filters, orig_filters, syscall_nr);
> >> +     if (ret)
> >> +             goto out;
> >> +     get_seccomp_filters(filters);  /* simplify the out: path */
> >> +
> >> +     rcu_assign_pointer(current->seccomp.filters, filters);
> >
> > What prevents two copies of seccomp_clear_filter() from running
> > concurrently?
> 
> Nothing - the last one wins assignment, but the objects themselves
> should be internally consistent to the parallel calls.  If that's a
> concern, a per-task writer mutex could be used just to ensure
> simultaneous calls to clear and set are performed serially.  Would
> that make more sense?

Here is the sequence of events that I am concerned about:

o	CPU 0 sets orig_filters to point to the current filters.

o	CPU 1 sets its local orig_filters to point to the current
	set of filters.

o	Both CPUs allocate new filters and use rcu_assign_pointer()
	to do the update.  As you say, the last one wins, but it appears
	to me that the first one leaks memory.

o	Both CPUs free the object referenced by their orig_filters,
	which might or might not result in a double free, depending
	on exactly what happens below.  (You might actually be OK,
	I didn't check -- leaking memory was enough for me to call
	attention to this.)

So yes, please use some kind of mutual exclusion.  Not sure what you
mean by "per-task mutex", but whatever it is must prevent two different
tasks from acting on the same set of filters at the same time.  The
thing that I call "per-task mutex" would -not- do that.

> >> +     synchronize_rcu();
> >> +     put_seccomp_filters(orig_filters);  /* for the task */
> >> +out:
> >> +     put_seccomp_filters(orig_filters);  /* for the get */
> >> +     put_seccomp_filters(filters);  /* for the extra get */
> >> +     return ret;
> >> +}
> >> +EXPORT_SYMBOL_GPL(seccomp_clear_filter);
> >> +
> >> +/**
> >> + * seccomp_set_filter: - Adds/extends a seccomp filter for a syscall.
> >> + * @syscall_nr: system call number to apply the filter to.
> >> + * @filter: ftrace filter string to apply.
> >> + *
> >> + * Context: User context only. This function may sleep on allocation and
> >> + *          operates on current. current must be attempting a system call
> >> + *          when this is called.
> >> + *
> >> + * New filters may be added for system calls when the current task is
> >> + * not in a secure computing mode (seccomp).  Otherwise, existing filters may
> >> + * be extended.
> >> + *
> >> + * Returns 0 on success or an errno on failure.
> >> + */
> >> +long seccomp_set_filter(int syscall_nr, char *filter)
> >> +{
> >> +     struct seccomp_filters *filters = NULL, *orig_filters = NULL;
> >> +     uint16_t id;
> >> +     long ret = -EINVAL;
> >> +     uint16_t filters_needed;
> >> +
> >> +     if (!filter)
> >> +             goto out;
> >> +
> >> +     filter = strstrip(filter);
> >> +     /* Disallow empty strings. */
> >> +     if (filter[0] == 0)
> >> +             goto out;
> >> +
> >> +     rcu_read_lock();
> >> +     orig_filters = get_seccomp_filters(current->seccomp.filters);
> >> +     rcu_read_unlock();
> >> +
> >> +     /* After the first call, compatibility mode is selected permanently. */
> >> +     ret = -EACCES;
> >> +     if (filters_compat_mismatch(orig_filters))
> >> +             goto out;
> >> +
> >> +     filters_needed = orig_filters ? orig_filters->count : 0;
> >> +     id = seccomp_filter_id(orig_filters, syscall_nr);
> >> +     if (seccomp_filter_deny(id)) {
> >> +             /* Don't allow DENYs to be changed when in a seccomp mode */
> >> +             ret = -EACCES;
> >> +             if (current->seccomp.mode)
> >> +                     goto out;
> >> +             filters_needed++;
> >> +     }
> >> +
> >> +     filters = seccomp_filters_new(filters_needed);
> >> +     if (IS_ERR(filters)) {
> >> +             ret = PTR_ERR(filters);
> >> +             goto out;
> >> +     }
> >> +
> >> +     filters_set_compat(filters);
> >> +     if (orig_filters) {
> >> +             ret = seccomp_filters_copy(filters, orig_filters, -1);
> >> +             if (ret)
> >> +                     goto out;
> >> +     }
> >> +
> >> +     if (seccomp_filter_deny(id))
> >> +             ret = seccomp_add_filter(filters, syscall_nr, filter);
> >> +     else
> >> +             ret = seccomp_extend_filter(filters, syscall_nr, filter);
> >> +     if (ret)
> >> +             goto out;
> >> +     get_seccomp_filters(filters);  /* simplify the error paths */
> >> +
> >> +     rcu_assign_pointer(current->seccomp.filters, filters);
> >
> > Again, what prevents two copies of seccomp_set_filter() from running
> > concurrently?
> 
> Same deal - nothing, but I'd be happy to add a guard if it makes sense.
> 
> Thanks!
> 
> >> +     synchronize_rcu();
> >> +     put_seccomp_filters(orig_filters);  /* for the task */
> >> +out:
> >> +     put_seccomp_filters(orig_filters);  /* for the get */
> >> +     put_seccomp_filters(filters);  /* for get or task, on err */
> >> +     return ret;
> >> +}
> >> +EXPORT_SYMBOL_GPL(seccomp_set_filter);
> >> +
> >> +long prctl_set_seccomp_filter(unsigned long syscall_nr,
> >> +                           char __user *user_filter)
> >> +{
> >> +     int nr;
> >> +     long ret;
> >> +     char *filter = NULL;
> >> +
> >> +     ret = -EINVAL;
> >> +     if (syscall_nr >= NR_syscalls)
> >> +             goto out;
> >> +
> >> +     ret = -EFAULT;
> >> +     if (!user_filter)
> >> +             goto out;
> >> +
> >> +     filter = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
> >> +     ret = -ENOMEM;
> >> +     if (!filter)
> >> +             goto out;
> >> +
> >> +     ret = -EFAULT;
> >> +     if (strncpy_from_user(filter, user_filter,
> >> +                           SECCOMP_MAX_FILTER_LENGTH - 1) < 0)
> >> +             goto out;
> >> +
> >> +     nr = (int) syscall_nr;
> >> +     ret = seccomp_set_filter(nr, filter);
> >> +
> >> +out:
> >> +     kfree(filter);
> >> +     return ret;
> >> +}
> >> +
> >> +long prctl_clear_seccomp_filter(unsigned long syscall_nr)
> >> +{
> >> +     int nr = -1;
> >> +     long ret;
> >> +
> >> +     ret = -EINVAL;
> >> +     if (syscall_nr >= NR_syscalls)
> >> +             goto out;
> >> +
> >> +     nr = (int) syscall_nr;
> >> +     ret = seccomp_clear_filter(nr);
> >> +
> >> +out:
> >> +     return ret;
> >> +}
> >> +
> >> +long prctl_get_seccomp_filter(unsigned long syscall_nr, char __user *dst,
> >> +                           unsigned long available)
> >> +{
> >> +     int ret, nr;
> >> +     unsigned long copied;
> >> +     char *buf = NULL;
> >> +     ret = -EINVAL;
> >> +     if (!available)
> >> +             goto out;
> >> +     /* Ignore extra buffer space. */
> >> +     if (available > SECCOMP_MAX_FILTER_LENGTH)
> >> +             available = SECCOMP_MAX_FILTER_LENGTH;
> >> +
> >> +     ret = -EINVAL;
> >> +     if (syscall_nr >= NR_syscalls)
> >> +             goto out;
> >> +     nr = (int) syscall_nr;
> >> +
> >> +     ret = -ENOMEM;
> >> +     buf = kmalloc(available, GFP_KERNEL);
> >> +     if (!buf)
> >> +             goto out;
> >> +
> >> +     ret = seccomp_get_filter(nr, buf, available);
> >> +     if (ret < 0)
> >> +             goto out;
> >> +
> >> +     /* Include the NUL byte in the copy. */
> >> +     copied = copy_to_user(dst, buf, ret + 1);
> >> +     ret = -ENOSPC;
> >> +     if (copied)
> >> +             goto out;
> >> +     ret = 0;
> >> +out:
> >> +     kfree(buf);
> >> +     return ret;
> >> +}
> >> diff --git a/kernel/sys.c b/kernel/sys.c
> >> index af468ed..ed60d06 100644
> >> --- a/kernel/sys.c
> >> +++ b/kernel/sys.c
> >> @@ -1698,13 +1698,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> >>               case PR_SET_ENDIAN:
> >>                       error = SET_ENDIAN(me, arg2);
> >>                       break;
> >> -
> >>               case PR_GET_SECCOMP:
> >>                       error = prctl_get_seccomp();
> >>                       break;
> >>               case PR_SET_SECCOMP:
> >>                       error = prctl_set_seccomp(arg2);
> >>                       break;
> >> +             case PR_SET_SECCOMP_FILTER:
> >> +                     error = prctl_set_seccomp_filter(arg2,
> >> +                                                      (char __user *) arg3);
> >> +                     break;
> >> +             case PR_CLEAR_SECCOMP_FILTER:
> >> +                     error = prctl_clear_seccomp_filter(arg2);
> >> +                     break;
> >> +             case PR_GET_SECCOMP_FILTER:
> >> +                     error = prctl_get_seccomp_filter(arg2,
> >> +                                                      (char __user *) arg3,
> >> +                                                      arg4);
> >> +                     break;
> >>               case PR_GET_TSC:
> >>                       error = GET_TSC_CTL(arg2);
> >>                       break;
> >> diff --git a/security/Kconfig b/security/Kconfig
> >> index 95accd4..c76adf2 100644
> >> --- a/security/Kconfig
> >> +++ b/security/Kconfig
> >> @@ -2,6 +2,10 @@
> >>  # Security configuration
> >>  #
> >>
> >> +# Make seccomp filter Kconfig switch below available
> >> +config HAVE_SECCOMP_FILTER
> >> +       bool
> >> +
> >>  menu "Security options"
> >>
> >>  config KEYS
> >> @@ -82,6 +86,19 @@ config SECURITY_DMESG_RESTRICT
> >>
> >>         If you are unsure how to answer this question, answer N.
> >>
> >> +config SECCOMP_FILTER
> >> +     bool "Enable seccomp-based system call filtering"
> >> +     select SECCOMP
> >> +     depends on HAVE_SECCOMP_FILTER && EXPERIMENTAL
> >> +     help
> >> +       This kernel feature expands CONFIG_SECCOMP to allow computing
> >> +       in environments with reduced kernel access dictated by the
> >> +       application itself through prctl calls.  If
> >> +       CONFIG_FTRACE_SYSCALLS is available, then system call
> >> +       argument-based filtering predicates may be used.
> >> +
> >> +       See Documentation/prctl/seccomp_filter.txt for more detail.
> >> +
> >>  config SECURITY
> >>       bool "Enable different security models"
> >>       depends on SYSFS
> >> --
> >> 1.7.0.4
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at  http://www.tux.org/lkml/
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 03/13] seccomp_filters: new mode with configurable syscall filters
  2011-06-02 19:42                                                             ` Paul E. McKenney
@ 2011-06-02 20:28                                                               ` Will Drewry
  2011-06-02 20:46                                                                 ` Steven Rostedt
  0 siblings, 1 reply; 91+ messages in thread
From: Will Drewry @ 2011-06-02 20:28 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, kees.cook, torvalds, tglx, mingo, rostedt, jmorris,
	Peter Zijlstra, Frederic Weisbecker, linux-security-module

On Thu, Jun 2, 2011 at 2:42 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Thu, Jun 02, 2011 at 01:14:54PM -0500, Will Drewry wrote:
>> On Thu, Jun 2, 2011 at 12:36 PM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> > On Tue, May 31, 2011 at 10:10:35PM -0500, Will Drewry wrote:
>> >> This change adds a new seccomp mode which specifies the allowed system
>> >> calls dynamically.  When in the new mode (2), all system calls are
>> >> checked against process-defined filters - first by system call number,
>> >> then by a filter string.  If an entry exists for a given system call and
>> >> all filter predicates evaluate to true, then the task may proceed.
>> >> Otherwise, the task is killed.
>> >
>> > A few questions below -- I can't say that I understand the RCU usage.
>> >
>> >                                                        Thanx, Paul
>> >
>> >> Filter string parsing and evaluation is handled by the ftrace filter
>> >> engine.  Related patches tweak to the perf filter trace and free
>> >> allowing the calls to be shared. Filters inherit their understanding of
>> >> types and arguments for each system call from the CONFIG_FTRACE_SYSCALLS
>> >> subsystem which already populates this information in syscall_metadata
>> >> associated enter_event (and exit_event) structures. If
>> >> CONFIG_FTRACE_SYSCALLS is not compiled in, only filter strings of "1"
>> >> will be allowed.
>> >>
>> >> The net result is a process may have its system calls filtered using the
>> >> ftrace filter engine's inherent understanding of systems calls.  The set
>> >> of filters is specified through the PR_SET_SECCOMP_FILTER argument in
>> >> prctl(). For example, a filterset for a process, like pdftotext, that
>> >> should only process read-only input could (roughly) look like:
>> >>   sprintf(rdonly, "flags == %u", O_RDONLY|O_LARGEFILE);
>> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_open, rdonly);
>> >>   prctl(PR_SET_SECCOMP_FILTER, __NR__llseek, "1");
>> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_brk, "1");
>> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_close, "1");
>> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_exit_group, "1");
>> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_fstat64, "1");
>> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_mmap2, "1");
>> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_munmap, "1");
>> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_read, "1");
>> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_write, "(fd == 1 | fd == 2)");
>> >>   prctl(PR_SET_SECCOMP, 2);
>> >>
>> >> Subsequent calls to PR_SET_SECCOMP_FILTER for the same system call will
>> >> be &&'d together to ensure that attack surface may only be reduced:
>> >>   prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2");
>> >>
>> >> With the earlier example, the active filter becomes:
>> >>   "(fd == 1 || fd == 2) && fd != 2"
>> >>
>> >> The patch also adds PR_CLEAR_SECCOMP_FILTER and PR_GET_SECCOMP_FILTER.
>> >> The latter returns the current filter for a system call to userspace:
>> >>
>> >>   prctl(PR_GET_SECCOMP_FILTER, __NR_write, buf, bufsize);
>> >>
>> >> while the former clears any filters for a given system call changing it
>> >> back to a defaulty deny:
>> >>
>> >>   prctl(PR_CLEAR_SECCOMP_FILTER, __NR_write);
>> >>
>> >> v3: - always block execve calls (as per linus torvalds)
>> >>     - add __NR_seccomp_execve(_32) to seccomp-supporting arches
>> >>     - ensure compat tasks can't reach ftrace:syscalls
>> >>     - dropped new defines for seccomp modes.
>> >>     - two level array instead of hlists (sugg. by olof johansson)
>> >>     - added generic Kconfig entry that is not connected.
>> >>     - dropped internal seccomp.h
>> >>     - move prctl helpers to seccomp_filter
>> >>     - killed seccomp_t typedef (as per checkpatch)
>> >> v2: - changed to use the existing syscall number ABI.
>> >>     - prctl changes to minimize parsing in the kernel:
>> >>       prctl(PR_SET_SECCOMP, {0 | 1 | 2 }, { 0 | ON_EXEC });
>> >>       prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 5");
>> >>       prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read);
>> >>       prctl(PR_GET_SECCOMP_FILTER, __NR_read, buf, bufsize);
>> >>     - defined PR_SECCOMP_MODE_STRICT and ..._FILTER
>> >>     - added flags
>> >>     - provide a default fail syscall_nr_to_meta in ftrace
>> >>     - provides fallback for unhooked system calls
>> >>     - use -ENOSYS and ERR_PTR(-ENOSYS) for stubbed functionality
>> >>     - added kernel/seccomp.h to share seccomp.c/seccomp_filter.c
>> >>     - moved to a hlist and 4 bit hash of linked lists
>> >>     - added support to operate without CONFIG_FTRACE_SYSCALLS
>> >>     - moved Kconfig support next to SECCOMP
>> >>     - made Kconfig entries dependent on EXPERIMENTAL
>> >>     - added macros to avoid ifdefs from kernel/fork.c
>> >>     - added compat task/filter matching
>> >>     - drop seccomp.h inclusion in sched.h and drop seccomp_t
>> >>     - added Filtering to "show" output
>> >>     - added on_exec state dup'ing when enabling after a fast-path accept.
>> >>
>> >> Signed-off-by: Will Drewry <wad@chromium.org>
>> >> ---
>> >>  include/linux/prctl.h   |    5 +
>> >>  include/linux/sched.h   |    2 +-
>> >>  include/linux/seccomp.h |   98 ++++++-
>> >>  include/trace/syscall.h |    7 +
>> >>  kernel/Makefile         |    3 +
>> >>  kernel/fork.c           |    3 +
>> >>  kernel/seccomp.c        |   38 ++-
>> >>  kernel/seccomp_filter.c |  784 +++++++++++++++++++++++++++++++++++++++++++++++
>> >>  kernel/sys.c            |   13 +-
>> >>  security/Kconfig        |   17 +
>> >>  10 files changed, 954 insertions(+), 16 deletions(-)
>> >>  create mode 100644 kernel/seccomp_filter.c
>> >>
>> >> diff --git a/include/linux/prctl.h b/include/linux/prctl.h
>> >> index a3baeb2..44723ce 100644
>> >> --- a/include/linux/prctl.h
>> >> +++ b/include/linux/prctl.h
>> >> @@ -64,6 +64,11 @@
>> >>  #define PR_GET_SECCOMP       21
>> >>  #define PR_SET_SECCOMP       22
>> >>
>> >> +/* Get/set process seccomp filters */
>> >> +#define PR_GET_SECCOMP_FILTER        35
>> >> +#define PR_SET_SECCOMP_FILTER        36
>> >> +#define PR_CLEAR_SECCOMP_FILTER      37
>> >> +
>> >>  /* Get/set the capability bounding set (as per security/commoncap.c) */
>> >>  #define PR_CAPBSET_READ 23
>> >>  #define PR_CAPBSET_DROP 24
>> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> >> index 18d63ce..3f0bc8d 100644
>> >> --- a/include/linux/sched.h
>> >> +++ b/include/linux/sched.h
>> >> @@ -1374,7 +1374,7 @@ struct task_struct {
>> >>       uid_t loginuid;
>> >>       unsigned int sessionid;
>> >>  #endif
>> >> -     seccomp_t seccomp;
>> >> +     struct seccomp_struct seccomp;
>> >>
>> >>  /* Thread group tracking */
>> >>       u32 parent_exec_id;
>> >> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> >> index 167c333..f4434ca 100644
>> >> --- a/include/linux/seccomp.h
>> >> +++ b/include/linux/seccomp.h
>> >> @@ -1,13 +1,33 @@
>> >>  #ifndef _LINUX_SECCOMP_H
>> >>  #define _LINUX_SECCOMP_H
>> >>
>> >> +struct seq_file;
>> >>
>> >>  #ifdef CONFIG_SECCOMP
>> >>
>> >> +#include <linux/errno.h>
>> >>  #include <linux/thread_info.h>
>> >> +#include <linux/types.h>
>> >>  #include <asm/seccomp.h>
>> >>
>> >> -typedef struct { int mode; } seccomp_t;
>> >> +struct seccomp_filters;
>> >> +/**
>> >> + * struct seccomp_struct - the state of a seccomp'ed process
>> >> + *
>> >> + * @mode:
>> >> + *     if this is 1, the process is under standard seccomp rules
>> >> + *             is 2, the process is only allowed to make system calls where
>> >> + *                   associated filters evaluate successfully.
>> >> + * @filters: Metadata for filters if using CONFIG_SECCOMP_FILTER.
>> >> + *           filters assignment/use should be RCU-protected and its contents
>> >> + *           should never be modified when attached to a seccomp_struct.
>> >> + */
>> >> +struct seccomp_struct {
>> >> +     uint16_t mode;
>> >> +#ifdef CONFIG_SECCOMP_FILTER
>> >> +     struct seccomp_filters *filters;
>> >> +#endif
>> >> +};
>> >>
>> >>  extern void __secure_computing(int);
>> >>  static inline void secure_computing(int this_syscall)
>> >> @@ -16,15 +36,14 @@ static inline void secure_computing(int this_syscall)
>> >>               __secure_computing(this_syscall);
>> >>  }
>> >>
>> >> -extern long prctl_get_seccomp(void);
>> >>  extern long prctl_set_seccomp(unsigned long);
>> >> +extern long prctl_get_seccomp(void);
>> >>
>> >>  #else /* CONFIG_SECCOMP */
>> >>
>> >>  #include <linux/errno.h>
>> >>
>> >> -typedef struct { } seccomp_t;
>> >> -
>> >> +struct seccomp_struct { };
>> >>  #define secure_computing(x) do { } while (0)
>> >>
>> >>  static inline long prctl_get_seccomp(void)
>> >> @@ -32,11 +51,80 @@ static inline long prctl_get_seccomp(void)
>> >>       return -EINVAL;
>> >>  }
>> >>
>> >> -static inline long prctl_set_seccomp(unsigned long arg2)
>> >> +static inline long prctl_set_seccomp(unsigned long a2);
>> >>  {
>> >>       return -EINVAL;
>> >>  }
>> >>
>> >>  #endif /* CONFIG_SECCOMP */
>> >>
>> >> +#ifdef CONFIG_SECCOMP_FILTER
>> >> +
>> >> +#define inherit_tsk_seccomp(_child, _orig) do { \
>> >> +     _child->seccomp.mode = _orig->seccomp.mode; \
>> >> +     _child->seccomp.filters = get_seccomp_filters(_orig->seccomp.filters); \
>> >> +     } while (0)
>> >> +#define put_tsk_seccomp(_tsk) put_seccomp_filters(_tsk->seccomp.filters)
>> >> +
>> >> +extern int seccomp_show_filters(struct seccomp_filters *filters,
>> >> +                             struct seq_file *);
>> >> +extern long seccomp_set_filter(int, char *);
>> >> +extern long seccomp_clear_filter(int);
>> >> +extern long seccomp_get_filter(int, char *, unsigned long);
>> >> +
>> >> +extern long prctl_set_seccomp_filter(unsigned long, char __user *);
>> >> +extern long prctl_get_seccomp_filter(unsigned long, char __user *,
>> >> +                                  unsigned long);
>> >> +extern long prctl_clear_seccomp_filter(unsigned long);
>> >> +
>> >> +extern struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *);
>> >> +extern void put_seccomp_filters(struct seccomp_filters *);
>> >> +
>> >> +extern int seccomp_test_filters(int);
>> >> +extern void seccomp_filter_log_failure(int);
>> >> +
>> >> +#else  /* CONFIG_SECCOMP_FILTER */
>> >> +
>> >> +struct seccomp_filters { };
>> >> +#define inherit_tsk_seccomp(_child, _orig) do { } while (0)
>> >> +#define put_tsk_seccomp(_tsk) do { } while (0)
>> >> +
>> >> +static inline int seccomp_show_filters(struct seccomp_filters *filters,
>> >> +                                    struct seq_file *m)
>> >> +{
>> >> +     return -ENOSYS;
>> >> +}
>> >> +
>> >> +static inline long seccomp_set_filter(int syscall_nr, char *filter)
>> >> +{
>> >> +     return -ENOSYS;
>> >> +}
>> >> +
>> >> +static inline long seccomp_clear_filter(int syscall_nr)
>> >> +{
>> >> +     return -ENOSYS;
>> >> +}
>> >> +
>> >> +static inline long seccomp_get_filter(int syscall_nr,
>> >> +                                   char *buf, unsigned long available)
>> >> +{
>> >> +     return -ENOSYS;
>> >> +}
>> >> +
>> >> +static inline long prctl_set_seccomp_filter(unsigned long a2, char __user *a3)
>> >> +{
>> >> +     return -ENOSYS;
>> >> +}
>> >> +
>> >> +static inline long prctl_clear_seccomp_filter(unsigned long a2)
>> >> +{
>> >> +     return -ENOSYS;
>> >> +}
>> >> +
>> >> +static inline long prctl_get_seccomp_filter(unsigned long a2, char __user *a3,
>> >> +                                         unsigned long a4)
>> >> +{
>> >> +     return -ENOSYS;
>> >> +}
>> >> +#endif  /* CONFIG_SECCOMP_FILTER */
>> >>  #endif /* _LINUX_SECCOMP_H */
>> >> diff --git a/include/trace/syscall.h b/include/trace/syscall.h
>> >> index 242ae04..e061ad0 100644
>> >> --- a/include/trace/syscall.h
>> >> +++ b/include/trace/syscall.h
>> >> @@ -35,6 +35,8 @@ struct syscall_metadata {
>> >>  extern unsigned long arch_syscall_addr(int nr);
>> >>  extern int init_syscall_trace(struct ftrace_event_call *call);
>> >>
>> >> +extern struct syscall_metadata *syscall_nr_to_meta(int);
>> >> +
>> >>  extern int reg_event_syscall_enter(struct ftrace_event_call *call);
>> >>  extern void unreg_event_syscall_enter(struct ftrace_event_call *call);
>> >>  extern int reg_event_syscall_exit(struct ftrace_event_call *call);
>> >> @@ -49,6 +51,11 @@ enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags,
>> >>                                     struct trace_event *event);
>> >>  enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags,
>> >>                                    struct trace_event *event);
>> >> +#else
>> >> +static inline struct syscall_metadata *syscall_nr_to_meta(int nr)
>> >> +{
>> >> +     return NULL;
>> >> +}
>> >>  #endif
>> >>
>> >>  #ifdef CONFIG_PERF_EVENTS
>> >> diff --git a/kernel/Makefile b/kernel/Makefile
>> >> index 85cbfb3..84e7dfb 100644
>> >> --- a/kernel/Makefile
>> >> +++ b/kernel/Makefile
>> >> @@ -81,6 +81,9 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
>> >>  obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
>> >>  obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
>> >>  obj-$(CONFIG_SECCOMP) += seccomp.o
>> >> +ifeq ($(CONFIG_SECCOMP_FILTER),y)
>> >> +obj-$(CONFIG_SECCOMP) += seccomp_filter.o
>> >> +endif
>> >>  obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
>> >>  obj-$(CONFIG_TREE_RCU) += rcutree.o
>> >>  obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
>> >> diff --git a/kernel/fork.c b/kernel/fork.c
>> >> index e7548de..6f835e0 100644
>> >> --- a/kernel/fork.c
>> >> +++ b/kernel/fork.c
>> >> @@ -34,6 +34,7 @@
>> >>  #include <linux/cgroup.h>
>> >>  #include <linux/security.h>
>> >>  #include <linux/hugetlb.h>
>> >> +#include <linux/seccomp.h>
>> >>  #include <linux/swap.h>
>> >>  #include <linux/syscalls.h>
>> >>  #include <linux/jiffies.h>
>> >> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
>> >>       free_thread_info(tsk->stack);
>> >>       rt_mutex_debug_task_free(tsk);
>> >>       ftrace_graph_exit_task(tsk);
>> >> +     put_tsk_seccomp(tsk);
>> >>       free_task_struct(tsk);
>> >>  }
>> >>  EXPORT_SYMBOL(free_task);
>> >> @@ -280,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
>> >>       if (err)
>> >>               goto out;
>> >>
>> >> +     inherit_tsk_seccomp(tsk, orig);
>> >>       setup_thread_stack(tsk, orig);
>> >>       clear_user_return_notifier(tsk);
>> >>       clear_tsk_need_resched(tsk);
>> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> >> index 57d4b13..0a942be 100644
>> >> --- a/kernel/seccomp.c
>> >> +++ b/kernel/seccomp.c
>> >> @@ -2,16 +2,20 @@
>> >>   * linux/kernel/seccomp.c
>> >>   *
>> >>   * Copyright 2004-2005  Andrea Arcangeli <andrea@cpushare.com>
>> >> + * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
>> >>   *
>> >>   * This defines a simple but solid secure-computing mode.
>> >>   */
>> >>
>> >>  #include <linux/seccomp.h>
>> >>  #include <linux/sched.h>
>> >> +#include <linux/slab.h>
>> >>  #include <linux/compat.h>
>> >> +#include <linux/unistd.h>
>> >> +#include <linux/ftrace_event.h>
>> >>
>> >> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
>> >>  /* #define SECCOMP_DEBUG 1 */
>> >> -#define NR_SECCOMP_MODES 1
>> >>
>> >>  /*
>> >>   * Secure computing mode 1 allows only read/write/exit/sigreturn.
>> >> @@ -32,10 +36,9 @@ static int mode1_syscalls_32[] = {
>> >>
>> >>  void __secure_computing(int this_syscall)
>> >>  {
>> >> -     int mode = current->seccomp.mode;
>> >>       int * syscall;
>> >>
>> >> -     switch (mode) {
>> >> +     switch (current->seccomp.mode) {
>> >>       case 1:
>> >>               syscall = mode1_syscalls;
>> >>  #ifdef CONFIG_COMPAT
>> >> @@ -47,6 +50,17 @@ void __secure_computing(int this_syscall)
>> >>                               return;
>> >>               } while (*++syscall);
>> >>               break;
>> >> +#ifdef CONFIG_SECCOMP_FILTER
>> >> +     case 2:
>> >> +             if (this_syscall >= NR_syscalls || this_syscall < 0)
>> >> +                     break;
>> >> +
>> >> +             if (!seccomp_test_filters(this_syscall))
>> >> +                     return;
>> >> +
>> >> +             seccomp_filter_log_failure(this_syscall);
>> >> +             break;
>> >> +#endif
>> >>       default:
>> >>               BUG();
>> >>       }
>> >> @@ -71,16 +85,22 @@ long prctl_set_seccomp(unsigned long seccomp_mode)
>> >>       if (unlikely(current->seccomp.mode))
>> >>               goto out;
>> >>
>> >> -     ret = -EINVAL;
>> >> -     if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
>> >> -             current->seccomp.mode = seccomp_mode;
>> >> -             set_thread_flag(TIF_SECCOMP);
>> >> +     ret = 0;
>> >> +     switch (seccomp_mode) {
>> >> +     case 1:
>> >>  #ifdef TIF_NOTSC
>> >>               disable_TSC();
>> >>  #endif
>> >> -             ret = 0;
>> >> +#ifdef CONFIG_SECCOMP_FILTER
>> >> +     case 2:
>> >> +#endif
>> >> +             current->seccomp.mode = seccomp_mode;
>> >> +             set_thread_flag(TIF_SECCOMP);
>> >> +             break;
>> >> +     default:
>> >> +             ret = -EINVAL;
>> >>       }
>> >>
>> >> - out:
>> >> +out:
>> >>       return ret;
>> >>  }
>> >> diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
>> >> new file mode 100644
>> >> index 0000000..9782f25
>> >> --- /dev/null
>> >> +++ b/kernel/seccomp_filter.c
>> >> @@ -0,0 +1,784 @@
>> >> +/* filter engine-based seccomp system call filtering
>> >> + *
>> >> + * This program is free software; you can redistribute it and/or modify
>> >> + * it under the terms of the GNU General Public License as published by
>> >> + * the Free Software Foundation; either version 2 of the License, or
>> >> + * (at your option) any later version.
>> >> + *
>> >> + * This program is distributed in the hope that it will be useful,
>> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> >> + * GNU General Public License for more details.
>> >> + *
>> >> + * You should have received a copy of the GNU General Public License
>> >> + * along with this program; if not, write to the Free Software
>> >> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
>> >> + *
>> >> + * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
>> >> + */
>> >> +
>> >> +#include <linux/compat.h>
>> >> +#include <linux/err.h>
>> >> +#include <linux/errno.h>
>> >> +#include <linux/ftrace_event.h>
>> >> +#include <linux/seccomp.h>
>> >> +#include <linux/seq_file.h>
>> >> +#include <linux/sched.h>
>> >> +#include <linux/slab.h>
>> >> +#include <linux/uaccess.h>
>> >> +
>> >> +#include <asm/syscall.h>
>> >> +#include <trace/syscall.h>
>> >> +
>> >> +
>> >> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
>> >> +
>> >> +#define SECCOMP_FILTER_ALLOW "1"
>> >> +#define SECCOMP_ACTION_DENY 0xffff
>> >> +#define SECCOMP_ACTION_ALLOW 0xfffe
>> >> +
>> >> +/**
>> >> + * struct seccomp_filters - container for seccomp filterset
>> >> + *
>> >> + * @syscalls: array of 16-bit indices into @event_filters by syscall_nr
>> >> + *            May also be SECCOMP_ACTION_DENY or SECCOMP_ACTION_ALLOW
>> >> + * @event_filters: array of pointers to ftrace event objects
>> >> + * @count: size of @event_filters
>> >> + * @flags: anonymous struct to wrap filters-specific flags
>> >> + * @usage: reference count to simplify use.
>> >> + */
>> >> +struct seccomp_filters {
>> >> +     uint16_t syscalls[NR_syscalls];
>> >> +     struct event_filter **event_filters;
>> >> +     uint16_t count;
>> >> +     struct {
>> >> +             uint32_t compat:1,
>> >> +                      __reserved:31;
>> >> +     } flags;
>> >> +     atomic_t usage;
>> >> +};
>> >> +
>> >> +/* Handle ftrace symbol non-existence */
>> >> +#ifdef CONFIG_FTRACE_SYSCALLS
>> >> +#define create_event_filter(_ef_pptr, _event_type, _str) \
>> >> +     ftrace_parse_filter(_ef_pptr, _event_type, _str)
>> >> +#define get_filter_string(_ef) ftrace_get_filter_string(_ef)
>> >> +#define free_event_filter(_f) ftrace_free_filter(_f)
>> >> +
>> >> +#else
>> >> +
>> >> +#define create_event_filter(_ef_pptr, _event_type, _str) (-ENOSYS)
>> >> +#define get_filter_string(_ef) (NULL)
>> >> +#define free_event_filter(_f) do { } while (0)
>> >> +#endif
>> >> +
>> >> +/**
>> >> + * seccomp_filters_new - allocates a new filters object
>> >> + * @count: count to allocate for the event_filters array
>> >> + *
>> >> + * Returns ERR_PTR on error or an allocated object.
>> >> + */
>> >> +static struct seccomp_filters *seccomp_filters_new(uint16_t count)
>> >> +{
>> >> +     struct seccomp_filters *f;
>> >> +
>> >> +     if (count >= SECCOMP_ACTION_ALLOW)
>> >> +             return ERR_PTR(-EINVAL);
>> >> +
>> >> +     f = kzalloc(sizeof(struct seccomp_filters), GFP_KERNEL);
>> >> +     if (!f)
>> >> +             return ERR_PTR(-ENOMEM);
>> >> +
>> >> +     /* Lazy SECCOMP_ACTION_DENY assignment. */
>> >> +     memset(f->syscalls, 0xff, sizeof(f->syscalls));
>> >> +     atomic_set(&f->usage, 1);
>> >> +
>> >> +     f->event_filters = NULL;
>> >> +     f->count = count;
>> >> +     if (!count)
>> >> +             return f;
>> >> +
>> >> +     f->event_filters = kzalloc(count * sizeof(struct event_filter *),
>> >> +                                GFP_KERNEL);
>> >> +     if (!f->event_filters) {
>> >> +             kfree(f);
>> >> +             f = ERR_PTR(-ENOMEM);
>> >> +     }
>> >> +     return f;
>> >> +}
>> >> +
>> >> +/**
>> >> + * seccomp_filters_free - cleans up the filter list and frees the table
>> >> + * @filters: NULL or live object to be completely destructed.
>> >> + */
>> >> +static void seccomp_filters_free(struct seccomp_filters *filters)
>> >> +{
>> >> +     uint16_t count = 0;
>> >> +     if (!filters)
>> >> +             return;
>> >> +     while (count < filters->count) {
>> >> +             struct event_filter *f = filters->event_filters[count];
>> >> +             free_event_filter(f);
>> >> +             count++;
>> >> +     }
>> >> +     kfree(filters->event_filters);
>> >> +     kfree(filters);
>> >> +}
>> >> +
>> >> +static void __put_seccomp_filters(struct seccomp_filters *orig)
>> >> +{
>> >> +     WARN_ON(atomic_read(&orig->usage));
>> >> +     seccomp_filters_free(orig);
>> >> +}
>> >> +
>> >> +#define seccomp_filter_allow(_id) ((_id) == SECCOMP_ACTION_ALLOW)
>> >> +#define seccomp_filter_deny(_id) ((_id) == SECCOMP_ACTION_DENY)
>> >> +#define seccomp_filter_dynamic(_id) \
>> >> +     (!seccomp_filter_allow(_id) && !seccomp_filter_deny(_id))
>> >> +static inline uint16_t seccomp_filter_id(const struct seccomp_filters *f,
>> >> +                                      int syscall_nr)
>> >> +{
>> >> +     if (!f)
>> >> +             return SECCOMP_ACTION_DENY;
>> >> +     return f->syscalls[syscall_nr];
>> >> +}
>> >> +
>> >> +static inline struct event_filter *seccomp_dynamic_filter(
>> >> +             const struct seccomp_filters *filters, uint16_t id)
>> >> +{
>> >> +     if (!seccomp_filter_dynamic(id))
>> >> +             return NULL;
>> >> +     return filters->event_filters[id];
>> >> +}
>> >> +
>> >> +static inline void set_seccomp_filter_id(struct seccomp_filters *filters,
>> >> +                                      int syscall_nr, uint16_t id)
>> >> +{
>> >> +     filters->syscalls[syscall_nr] = id;
>> >> +}
>> >> +
>> >> +static inline void set_seccomp_filter(struct seccomp_filters *filters,
>> >> +                                   int syscall_nr, uint16_t id,
>> >> +                                   struct event_filter *dynamic_filter)
>> >> +{
>> >> +     filters->syscalls[syscall_nr] = id;
>> >> +     if (seccomp_filter_dynamic(id))
>> >> +             filters->event_filters[id] = dynamic_filter;
>> >> +}
>> >> +
>> >> +static struct event_filter *alloc_event_filter(int syscall_nr,
>> >> +                                            const char *filter_string)
>> >> +{
>> >> +     struct syscall_metadata *data;
>> >> +     struct event_filter *filter = NULL;
>> >> +     int err;
>> >> +
>> >> +     data = syscall_nr_to_meta(syscall_nr);
>> >> +     /* Argument-based filtering only works on ftrace-hooked syscalls. */
>> >> +     err = -ENOSYS;
>> >> +     if (!data)
>> >> +             goto fail;
>> >> +     err = create_event_filter(&filter,
>> >> +                               data->enter_event->event.type,
>> >> +                               filter_string);
>> >> +     if (err)
>> >> +             goto fail;
>> >> +
>> >> +     return filter;
>> >> +fail:
>> >> +     kfree(filter);
>> >> +     return ERR_PTR(err);
>> >> +}
>> >> +
>> >> +/**
>> >> + * seccomp_filters_copy - copies filters from src to dst.
>> >> + *
>> >> + * @dst: seccomp_filters to populate.
>> >> + * @src: table to read from.
>> >> + * @skip: specifies an entry, by system call, to skip.
>> >> + *
>> >> + * Returns non-zero on failure.
>> >> + * Both the source and the destination should have no simultaneous
>> >> + * writers, and dst should be exclusive to the caller.
>> >> + * If @skip is < 0, it is ignored.
>> >> + */
>> >> +static int seccomp_filters_copy(struct seccomp_filters *dst,
>> >> +                             const struct seccomp_filters *src,
>> >> +                             int skip)
>> >> +{
>> >> +     int id = 0, ret = 0, nr;
>> >> +     memcpy(&dst->flags, &src->flags, sizeof(src->flags));
>> >> +     memcpy(dst->syscalls, src->syscalls, sizeof(dst->syscalls));
>> >> +     if (!src->count)
>> >> +             goto done;
>> >> +     for (nr = 0; nr < NR_syscalls; ++nr) {
>> >> +             struct event_filter *filter;
>> >> +             const char *str;
>> >> +             uint16_t src_id = seccomp_filter_id(src, nr);
>> >> +             if (nr == skip) {
>> >> +                     set_seccomp_filter(dst, nr, SECCOMP_ACTION_DENY,
>> >> +                                        NULL);
>> >> +                     continue;
>> >> +             }
>> >> +             if (!seccomp_filter_dynamic(src_id))
>> >> +                     continue;
>> >> +             if (id >= dst->count) {
>> >> +                     ret = -EINVAL;
>> >> +                     goto done;
>> >> +             }
>> >> +             str = get_filter_string(seccomp_dynamic_filter(src, src_id));
>> >> +             filter = alloc_event_filter(nr, str);
>> >> +             if (IS_ERR(filter)) {
>> >> +                     ret = PTR_ERR(filter);
>> >> +                     goto done;
>> >> +             }
>> >> +             set_seccomp_filter(dst, nr, id, filter);
>> >> +             id++;
>> >> +     }
>> >> +
>> >> +done:
>> >> +     return ret;
>> >> +}
>> >> +
>> >> +/**
>> >> + * seccomp_extend_filter - appends more text to a syscall_nr's filter
>> >> + * @filters: unattached filter object to operate on
>> >> + * @syscall_nr: syscall number to update filters for
>> >> + * @filter_string: string to append to the existing filter
>> >> + *
>> >> + * The new string will be &&'d to the original filter string to ensure that it
>> >> + * always matches the existing predicates or less:
>> >> + *   (old_filter) && @filter_string
>> >> + * A new seccomp_filters instance is returned on success and a ERR_PTR on
>> >> + * failure.
>> >> + */
>> >> +static int seccomp_extend_filter(struct seccomp_filters *filters,
>> >> +                              int syscall_nr, char *filter_string)
>> >> +{
>> >> +     struct event_filter *filter;
>> >> +     uint16_t id = seccomp_filter_id(filters, syscall_nr);
>> >> +     char *merged = NULL;
>> >> +     int ret = -EINVAL, expected;
>> >> +
>> >> +     /* No extending with a "1". */
>> >> +     if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string))
>> >> +             goto out;
>> >> +
>> >> +     filter = seccomp_dynamic_filter(filters, id);
>> >> +     ret = -ENOENT;
>> >> +     if (!filter)
>> >> +             goto out;
>> >> +
>> >> +     merged = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
>> >> +     ret = -ENOMEM;
>> >> +     if (!merged)
>> >> +             goto out;
>> >> +
>> >> +     expected = snprintf(merged, SECCOMP_MAX_FILTER_LENGTH, "(%s) && %s",
>> >> +                         get_filter_string(filter), filter_string);
>> >> +     ret = -E2BIG;
>> >> +     if (expected >= SECCOMP_MAX_FILTER_LENGTH || expected < 0)
>> >> +             goto out;
>> >> +
>> >> +     /* Free the old filter */
>> >> +     free_event_filter(filter);
>> >> +     set_seccomp_filter(filters, syscall_nr, id, NULL);
>> >> +
>> >> +     /* Replace it */
>> >> +     filter = alloc_event_filter(syscall_nr, merged);
>> >> +     if (IS_ERR(filter)) {
>> >> +             ret = PTR_ERR(filter);
>> >> +             goto out;
>> >> +     }
>> >> +     set_seccomp_filter(filters, syscall_nr, id, filter);
>> >> +     ret = 0;
>> >> +
>> >> +out:
>> >> +     kfree(merged);
>> >> +     return ret;
>> >> +}
>> >> +
>> >> +/**
>> >> + * seccomp_add_filter - adds a filter for an unfiltered syscall
>> >> + * @filters: filters object to add a filter/action to
>> >> + * @syscall_nr: system call number to add a filter for
>> >> + * @filter_string: the filter string to apply
>> >> + *
>> >> + * Returns 0 on success and non-zero otherwise.
>> >> + */
>> >> +static int seccomp_add_filter(struct seccomp_filters *filters, int syscall_nr,
>> >> +                           char *filter_string)
>> >> +{
>> >> +     struct event_filter *filter;
>> >> +     int ret = 0;
>> >> +
>> >> +     if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string)) {
>> >> +             set_seccomp_filter(filters, syscall_nr,
>> >> +                                SECCOMP_ACTION_ALLOW, NULL);
>> >> +             goto out;
>> >> +     }
>> >> +
>> >> +     filter = alloc_event_filter(syscall_nr, filter_string);
>> >> +     if (IS_ERR(filter)) {
>> >> +             ret = PTR_ERR(filter);
>> >> +             goto out;
>> >> +     }
>> >> +     /* Always add to the last slot available since additions are
>> >> +      * are only done one at a time.
>> >> +      */
>> >> +     set_seccomp_filter(filters, syscall_nr, filters->count - 1, filter);
>> >> +out:
>> >> +     return ret;
>> >> +}
>> >> +
>> >> +/* Wrap optional ftrace syscall support. Returns 1 on match or 0 otherwise. */
>> >> +static int filter_match_current(struct event_filter *event_filter)
>> >> +{
>> >> +     int err = 0;
>> >> +#ifdef CONFIG_FTRACE_SYSCALLS
>> >> +     uint8_t syscall_state[64];
>> >> +
>> >> +     memset(syscall_state, 0, sizeof(syscall_state));
>> >> +
>> >> +     /* The generic tracing entry can remain zeroed. */
>> >> +     err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state),
>> >> +                                      NULL);
>> >> +     if (err)
>> >> +             return 0;
>> >> +
>> >> +     err = filter_match_preds(event_filter, syscall_state);
>> >> +#endif
>> >> +     return err;
>> >> +}
>> >> +
>> >> +static const char *syscall_nr_to_name(int syscall)
>> >> +{
>> >> +     const char *syscall_name = "unknown";
>> >> +     struct syscall_metadata *data = syscall_nr_to_meta(syscall);
>> >> +     if (data)
>> >> +             syscall_name = data->name;
>> >> +     return syscall_name;
>> >> +}
>> >> +
>> >> +static void filters_set_compat(struct seccomp_filters *filters)
>> >> +{
>> >> +#ifdef CONFIG_COMPAT
>> >> +     if (is_compat_task())
>> >> +             filters->flags.compat = 1;
>> >> +#endif
>> >> +}
>> >> +
>> >> +static inline int filters_compat_mismatch(struct seccomp_filters *filters)
>> >> +{
>> >> +     int ret = 0;
>> >> +     if (!filters)
>> >> +             return 0;
>> >> +#ifdef CONFIG_COMPAT
>> >> +     if (!!(is_compat_task()) == filters->flags.compat)
>> >> +             ret = 1;
>> >> +#endif
>> >> +     return ret;
>> >> +}
>> >> +
>> >> +static inline int syscall_is_execve(int syscall)
>> >> +{
>> >> +     int nr = __NR_execve;
>> >> +#ifdef CONFIG_COMPAT
>> >> +     if (is_compat_task())
>> >> +             nr = __NR_seccomp_execve_32;
>> >> +#endif
>> >> +     return syscall == nr;
>> >> +}
>> >> +
>> >> +#ifndef KSTK_EIP
>> >> +#define KSTK_EIP(x) 0L
>> >> +#endif
>> >> +
>> >> +void seccomp_filter_log_failure(int syscall)
>> >> +{
>> >> +     pr_info("%s[%d]: system call %d (%s) blocked at 0x%lx\n",
>> >> +             current->comm, task_pid_nr(current), syscall,
>> >> +             syscall_nr_to_name(syscall), KSTK_EIP(current));
>> >> +}
>> >> +
>> >> +/* put_seccomp_state - decrements the reference count of @orig and may free. */
>> >> +void put_seccomp_filters(struct seccomp_filters *orig)
>> >> +{
>> >> +     if (!orig)
>> >> +             return;
>> >> +
>> >> +     if (atomic_dec_and_test(&orig->usage))
>> >> +             __put_seccomp_filters(orig);
>> >> +}
>> >> +
>> >> +/* get_seccomp_state - increments the reference count of @orig */
>> >> +struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *orig)
>> >
>> > Nit: the name does not match the comment.
>>
>> Will fix it here and above. Thanks!
>>
>> >> +{
>> >> +     if (!orig)
>> >> +             return NULL;
>> >> +     atomic_inc(&orig->usage);
>> >> +     return orig;
>> >
>> > This is called in an RCU read-side critical section.  What exactly is
>> > RCU protecting?  I would expect an rcu_dereference() or one of the
>> > RCU list-traversal primitives somewhere, either here or at the caller.
>>
>> Ah, I spaced on rcu_dereference().  The goal was to make the
>> assignment and replacement of the seccomp_filters pointer
>> RCU-protected (in seccomp_state) so there's no concern over it being
>> replaced partial on platforms where pointer assignments are non-atomic
>> - such as via /proc/<pid>/seccomp_filters access or a call via the
>> exported symbols.  Object lifetime is managed by reference counting so
>> that I don't have to worry about extending the RCU read-side critical
>> section by much or deal with pre-allocations.
>>
>> I'll add rcu_dereference() to all the get_seccomp_filters() uses where
>> it makes sense, so that it is called safely.  Just to make sure, does
>> it make sense to continue to rcu protect the specific pointer?
>
> It might.  The usual other options is to use a lock outside of the element
> containing the reference count to protect reference-count manipulation.
> If there is some convenient lock, especially if it is already held where
> needed, then locking is more straightforward.  Otherwise, RCU is usually
> a reasonable option.

I was concerned about the overhead a lock would have at each system
call entry, but I didn't benchmark it to see.  I'll add the
rcu_dereference right away, then look into seeing whether there's a
cleaner approach.  I was trying to be overly protective of mutating
any data internal to the filters through complete replacement on any
change.  I'll take a step back and see if

>> >> +}
>> >> +
>> >> +/**
>> >> + * seccomp_test_filters - tests 'current' against the given syscall
>> >> + * @state: seccomp_state of current to use.
>> >> + * @syscall: number of the system call to test
>> >> + *
>> >> + * Returns 0 on ok and non-zero on error/failure.
>> >> + */
>> >> +int seccomp_test_filters(int syscall)
>> >> +{
>> >> +     uint16_t id;
>> >> +     struct event_filter *filter;
>> >> +     struct seccomp_filters *filters;
>> >> +     int ret = -EACCES;
>> >> +
>> >> +     rcu_read_lock();
>> >> +     filters = get_seccomp_filters(current->seccomp.filters);
>> >> +     rcu_read_unlock();
>> >> +
>> >> +     if (!filters)
>> >> +             goto out;
>> >> +
>> >> +     if (filters_compat_mismatch(filters)) {
>> >> +             pr_info("%s[%d]: seccomp_filter compat() mismatch.\n",
>> >> +                     current->comm, task_pid_nr(current));
>> >> +             goto out;
>> >> +     }
>> >> +
>> >> +     /* execve is never allowed. */
>> >> +     if (syscall_is_execve(syscall))
>> >> +             goto out;
>> >> +
>> >> +     ret = 0;
>> >> +     id = seccomp_filter_id(filters, syscall);
>> >> +     if (seccomp_filter_allow(id))
>> >> +             goto out;
>> >> +
>> >> +     ret = -EACCES;
>> >> +     if (!seccomp_filter_dynamic(id))
>> >> +             goto out;
>> >> +
>> >> +     filter = seccomp_dynamic_filter(filters, id);
>> >> +     if (filter && filter_match_current(filter))
>> >> +             ret = 0;
>> >> +out:
>> >> +     put_seccomp_filters(filters);
>> >> +     return ret;
>> >> +}
>> >> +
>> >> +/**
>> >> + * seccomp_show_filters - prints the current filter state to a seq_file
>> >> + * @filters: properly get()'d filters object
>> >> + * @m: the prepared seq_file to receive the data
>> >> + *
>> >> + * Returns 0 on a successful write.
>> >> + */
>> >> +int seccomp_show_filters(struct seccomp_filters *filters, struct seq_file *m)
>> >> +{
>> >> +     int syscall;
>> >> +     seq_printf(m, "Mode: %d\n", current->seccomp.mode);
>> >> +     if (!filters)
>> >> +             goto out;
>> >> +
>> >> +     for (syscall = 0; syscall < NR_syscalls; ++syscall) {
>> >> +             uint16_t id = seccomp_filter_id(filters, syscall);
>> >> +             const char *filter_string = SECCOMP_FILTER_ALLOW;
>> >> +             if (seccomp_filter_deny(id))
>> >> +                     continue;
>> >> +             seq_printf(m, "%d (%s): ",
>> >> +                           syscall,
>> >> +                           syscall_nr_to_name(syscall));
>> >> +             if (seccomp_filter_dynamic(id))
>> >> +                     filter_string = get_filter_string(
>> >> +                                       seccomp_dynamic_filter(filters, id));
>> >> +             seq_printf(m, "%s\n", filter_string);
>> >> +     }
>> >> +out:
>> >> +     return 0;
>> >> +}
>> >> +EXPORT_SYMBOL_GPL(seccomp_show_filters);
>> >> +
>> >> +/**
>> >> + * seccomp_get_filter - copies the filter_string into "buf"
>> >> + * @syscall_nr: system call number to look up
>> >> + * @buf: destination buffer
>> >> + * @bufsize: available space in the buffer.
>> >> + *
>> >> + * Context: User context only. This function may sleep on allocation and
>> >> + *          operates on current. current must be attempting a system call
>> >> + *          when this is called.
>> >> + *
>> >> + * Looks up the filter for the given system call number on current.  If found,
>> >> + * the string length of the NUL-terminated buffer is returned and < 0 is
>> >> + * returned on error. The NUL byte is not included in the length.
>> >> + */
>> >> +long seccomp_get_filter(int syscall_nr, char *buf, unsigned long bufsize)
>> >> +{
>> >> +     struct seccomp_filters *filters;
>> >> +     struct event_filter *filter;
>> >> +     long ret = -EINVAL;
>> >> +     uint16_t id;
>> >> +
>> >> +     if (bufsize > SECCOMP_MAX_FILTER_LENGTH)
>> >> +             bufsize = SECCOMP_MAX_FILTER_LENGTH;
>> >> +
>> >> +     rcu_read_lock();
>> >> +     filters = get_seccomp_filters(current->seccomp.filters);
>> >> +     rcu_read_unlock();
>> >> +
>> >> +     if (!filters)
>> >> +             goto out;
>> >> +
>> >> +     ret = -ENOENT;
>> >> +     id = seccomp_filter_id(filters, syscall_nr);
>> >> +     if (seccomp_filter_deny(id))
>> >> +             goto out;
>> >> +
>> >> +     if (seccomp_filter_allow(id)) {
>> >> +             ret = strlcpy(buf, SECCOMP_FILTER_ALLOW, bufsize);
>> >> +             goto copied;
>> >> +     }
>> >> +
>> >> +     filter = seccomp_dynamic_filter(filters, id);
>> >> +     if (!filter)
>> >> +             goto out;
>> >> +     ret = strlcpy(buf, get_filter_string(filter), bufsize);
>> >> +
>> >> +copied:
>> >> +     if (ret >= bufsize) {
>> >> +             ret = -ENOSPC;
>> >> +             goto out;
>> >> +     }
>> >> +     /* Zero out any remaining buffer, just in case. */
>> >> +     memset(buf + ret, 0, bufsize - ret);
>> >> +out:
>> >> +     put_seccomp_filters(filters);
>> >> +     return ret;
>> >> +}
>> >> +EXPORT_SYMBOL_GPL(seccomp_get_filter);
>> >> +
>> >> +/**
>> >> + * seccomp_clear_filter: clears the seccomp filter for a syscall.
>> >> + * @syscall_nr: the system call number to clear filters for.
>> >> + *
>> >> + * Context: User context only. This function may sleep on allocation and
>> >> + *          operates on current. current must be attempting a system call
>> >> + *          when this is called.
>> >> + *
>> >> + * Returns 0 on success.
>> >> + */
>> >> +long seccomp_clear_filter(int syscall_nr)
>> >> +{
>> >> +     struct seccomp_filters *filters = NULL, *orig_filters;
>> >> +     uint16_t id;
>> >> +     int ret = -EINVAL;
>> >> +
>> >> +     rcu_read_lock();
>> >> +     orig_filters = get_seccomp_filters(current->seccomp.filters);
>> >> +     rcu_read_unlock();
>> >> +
>> >> +     if (!orig_filters)
>> >> +             goto out;
>> >> +
>> >> +     if (filters_compat_mismatch(orig_filters))
>> >> +             goto out;
>> >> +
>> >> +     id = seccomp_filter_id(orig_filters, syscall_nr);
>> >> +     if (seccomp_filter_deny(id))
>> >> +             goto out;
>> >> +
>> >> +     /* Create a new filters object for the task */
>> >> +     if (seccomp_filter_dynamic(id))
>> >> +             filters = seccomp_filters_new(orig_filters->count - 1);
>> >> +     else
>> >> +             filters = seccomp_filters_new(orig_filters->count);
>> >> +
>> >> +     if (IS_ERR(filters)) {
>> >> +             ret = PTR_ERR(filters);
>> >> +             goto out;
>> >> +     }
>> >> +
>> >> +     /* Copy, but drop the requested entry. */
>> >> +     ret = seccomp_filters_copy(filters, orig_filters, syscall_nr);
>> >> +     if (ret)
>> >> +             goto out;
>> >> +     get_seccomp_filters(filters);  /* simplify the out: path */
>> >> +
>> >> +     rcu_assign_pointer(current->seccomp.filters, filters);
>> >
>> > What prevents two copies of seccomp_clear_filter() from running
>> > concurrently?
>>
>> Nothing - the last one wins assignment, but the objects themselves
>> should be internally consistent to the parallel calls.  If that's a
>> concern, a per-task writer mutex could be used just to ensure
>> simultaneous calls to clear and set are performed serially.  Would
>> that make more sense?
>
> Here is the sequence of events that I am concerned about:
>
> o       CPU 0 sets orig_filters to point to the current filters.
>
> o       CPU 1 sets its local orig_filters to point to the current
>        set of filters.
>
> o       Both CPUs allocate new filters and use rcu_assign_pointer()
>        to do the update.  As you say, the last one wins, but it appears
>        to me that the first one leaks memory.
>
> o       Both CPUs free the object referenced by their orig_filters,
>        which might or might not result in a double free, depending
>        on exactly what happens below.  (You might actually be OK,
>        I didn't check -- leaking memory was enough for me to call
>        attention to this.)
>
> So yes, please use some kind of mutual exclusion.  Not sure what you
> mean by "per-task mutex", but whatever it is must prevent two different
> tasks from acting on the same set of filters at the same time.  The
> thing that I call "per-task mutex" would -not- do that.

Ah nice.  Yeah that would most definitely leak as there would be a
remaining increment for the task that isn't dropped.

Since those interfaces acquire the filter itself from
current->seccomp.filters, I was thinking a mutex in current, e.g.,
current->seccomp.write_mutex, would fit the bill to ensure that
pointer isn't accessed for replacement.  I'll look at this and the rcu
usage to see if I'm taking the most logical approach.  I pretty much
always get locking wrong in some way and perhaps I can simplify
further and get the correct guarantees.  I really appreciate the keen
observation and explanation!

>> >> +     synchronize_rcu();
>> >> +     put_seccomp_filters(orig_filters);  /* for the task */
>> >> +out:
>> >> +     put_seccomp_filters(orig_filters);  /* for the get */
>> >> +     put_seccomp_filters(filters);  /* for the extra get */
>> >> +     return ret;
>> >> +}
>> >> +EXPORT_SYMBOL_GPL(seccomp_clear_filter);
>> >> +
>> >> +/**
>> >> + * seccomp_set_filter: - Adds/extends a seccomp filter for a syscall.
>> >> + * @syscall_nr: system call number to apply the filter to.
>> >> + * @filter: ftrace filter string to apply.
>> >> + *
>> >> + * Context: User context only. This function may sleep on allocation and
>> >> + *          operates on current. current must be attempting a system call
>> >> + *          when this is called.
>> >> + *
>> >> + * New filters may be added for system calls when the current task is
>> >> + * not in a secure computing mode (seccomp).  Otherwise, existing filters may
>> >> + * be extended.
>> >> + *
>> >> + * Returns 0 on success or an errno on failure.
>> >> + */
>> >> +long seccomp_set_filter(int syscall_nr, char *filter)
>> >> +{
>> >> +     struct seccomp_filters *filters = NULL, *orig_filters = NULL;
>> >> +     uint16_t id;
>> >> +     long ret = -EINVAL;
>> >> +     uint16_t filters_needed;
>> >> +
>> >> +     if (!filter)
>> >> +             goto out;
>> >> +
>> >> +     filter = strstrip(filter);
>> >> +     /* Disallow empty strings. */
>> >> +     if (filter[0] == 0)
>> >> +             goto out;
>> >> +
>> >> +     rcu_read_lock();
>> >> +     orig_filters = get_seccomp_filters(current->seccomp.filters);
>> >> +     rcu_read_unlock();
>> >> +
>> >> +     /* After the first call, compatibility mode is selected permanently. */
>> >> +     ret = -EACCES;
>> >> +     if (filters_compat_mismatch(orig_filters))
>> >> +             goto out;
>> >> +
>> >> +     filters_needed = orig_filters ? orig_filters->count : 0;
>> >> +     id = seccomp_filter_id(orig_filters, syscall_nr);
>> >> +     if (seccomp_filter_deny(id)) {
>> >> +             /* Don't allow DENYs to be changed when in a seccomp mode */
>> >> +             ret = -EACCES;
>> >> +             if (current->seccomp.mode)
>> >> +                     goto out;
>> >> +             filters_needed++;
>> >> +     }
>> >> +
>> >> +     filters = seccomp_filters_new(filters_needed);
>> >> +     if (IS_ERR(filters)) {
>> >> +             ret = PTR_ERR(filters);
>> >> +             goto out;
>> >> +     }
>> >> +
>> >> +     filters_set_compat(filters);
>> >> +     if (orig_filters) {
>> >> +             ret = seccomp_filters_copy(filters, orig_filters, -1);
>> >> +             if (ret)
>> >> +                     goto out;
>> >> +     }
>> >> +
>> >> +     if (seccomp_filter_deny(id))
>> >> +             ret = seccomp_add_filter(filters, syscall_nr, filter);
>> >> +     else
>> >> +             ret = seccomp_extend_filter(filters, syscall_nr, filter);
>> >> +     if (ret)
>> >> +             goto out;
>> >> +     get_seccomp_filters(filters);  /* simplify the error paths */
>> >> +
>> >> +     rcu_assign_pointer(current->seccomp.filters, filters);
>> >
>> > Again, what prevents two copies of seccomp_set_filter() from running
>> > concurrently?
>>
>> Same deal - nothing, but I'd be happy to add a guard if it makes sense.
>>
>> Thanks!
>>
>> >> +     synchronize_rcu();
>> >> +     put_seccomp_filters(orig_filters);  /* for the task */
>> >> +out:
>> >> +     put_seccomp_filters(orig_filters);  /* for the get */
>> >> +     put_seccomp_filters(filters);  /* for get or task, on err */
>> >> +     return ret;
>> >> +}
>> >> +EXPORT_SYMBOL_GPL(seccomp_set_filter);
>> >> +
>> >> +long prctl_set_seccomp_filter(unsigned long syscall_nr,
>> >> +                           char __user *user_filter)
>> >> +{
>> >> +     int nr;
>> >> +     long ret;
>> >> +     char *filter = NULL;
>> >> +
>> >> +     ret = -EINVAL;
>> >> +     if (syscall_nr >= NR_syscalls)
>> >> +             goto out;
>> >> +
>> >> +     ret = -EFAULT;
>> >> +     if (!user_filter)
>> >> +             goto out;
>> >> +
>> >> +     filter = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
>> >> +     ret = -ENOMEM;
>> >> +     if (!filter)
>> >> +             goto out;
>> >> +
>> >> +     ret = -EFAULT;
>> >> +     if (strncpy_from_user(filter, user_filter,
>> >> +                           SECCOMP_MAX_FILTER_LENGTH - 1) < 0)
>> >> +             goto out;
>> >> +
>> >> +     nr = (int) syscall_nr;
>> >> +     ret = seccomp_set_filter(nr, filter);
>> >> +
>> >> +out:
>> >> +     kfree(filter);
>> >> +     return ret;
>> >> +}
>> >> +
>> >> +long prctl_clear_seccomp_filter(unsigned long syscall_nr)
>> >> +{
>> >> +     int nr = -1;
>> >> +     long ret;
>> >> +
>> >> +     ret = -EINVAL;
>> >> +     if (syscall_nr >= NR_syscalls)
>> >> +             goto out;
>> >> +
>> >> +     nr = (int) syscall_nr;
>> >> +     ret = seccomp_clear_filter(nr);
>> >> +
>> >> +out:
>> >> +     return ret;
>> >> +}
>> >> +
>> >> +long prctl_get_seccomp_filter(unsigned long syscall_nr, char __user *dst,
>> >> +                           unsigned long available)
>> >> +{
>> >> +     int ret, nr;
>> >> +     unsigned long copied;
>> >> +     char *buf = NULL;
>> >> +     ret = -EINVAL;
>> >> +     if (!available)
>> >> +             goto out;
>> >> +     /* Ignore extra buffer space. */
>> >> +     if (available > SECCOMP_MAX_FILTER_LENGTH)
>> >> +             available = SECCOMP_MAX_FILTER_LENGTH;
>> >> +
>> >> +     ret = -EINVAL;
>> >> +     if (syscall_nr >= NR_syscalls)
>> >> +             goto out;
>> >> +     nr = (int) syscall_nr;
>> >> +
>> >> +     ret = -ENOMEM;
>> >> +     buf = kmalloc(available, GFP_KERNEL);
>> >> +     if (!buf)
>> >> +             goto out;
>> >> +
>> >> +     ret = seccomp_get_filter(nr, buf, available);
>> >> +     if (ret < 0)
>> >> +             goto out;
>> >> +
>> >> +     /* Include the NUL byte in the copy. */
>> >> +     copied = copy_to_user(dst, buf, ret + 1);
>> >> +     ret = -ENOSPC;
>> >> +     if (copied)
>> >> +             goto out;
>> >> +     ret = 0;
>> >> +out:
>> >> +     kfree(buf);
>> >> +     return ret;
>> >> +}
>> >> diff --git a/kernel/sys.c b/kernel/sys.c
>> >> index af468ed..ed60d06 100644
>> >> --- a/kernel/sys.c
>> >> +++ b/kernel/sys.c
>> >> @@ -1698,13 +1698,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>> >>               case PR_SET_ENDIAN:
>> >>                       error = SET_ENDIAN(me, arg2);
>> >>                       break;
>> >> -
>> >>               case PR_GET_SECCOMP:
>> >>                       error = prctl_get_seccomp();
>> >>                       break;
>> >>               case PR_SET_SECCOMP:
>> >>                       error = prctl_set_seccomp(arg2);
>> >>                       break;
>> >> +             case PR_SET_SECCOMP_FILTER:
>> >> +                     error = prctl_set_seccomp_filter(arg2,
>> >> +                                                      (char __user *) arg3);
>> >> +                     break;
>> >> +             case PR_CLEAR_SECCOMP_FILTER:
>> >> +                     error = prctl_clear_seccomp_filter(arg2);
>> >> +                     break;
>> >> +             case PR_GET_SECCOMP_FILTER:
>> >> +                     error = prctl_get_seccomp_filter(arg2,
>> >> +                                                      (char __user *) arg3,
>> >> +                                                      arg4);
>> >> +                     break;
>> >>               case PR_GET_TSC:
>> >>                       error = GET_TSC_CTL(arg2);
>> >>                       break;
>> >> diff --git a/security/Kconfig b/security/Kconfig
>> >> index 95accd4..c76adf2 100644
>> >> --- a/security/Kconfig
>> >> +++ b/security/Kconfig
>> >> @@ -2,6 +2,10 @@
>> >>  # Security configuration
>> >>  #
>> >>
>> >> +# Make seccomp filter Kconfig switch below available
>> >> +config HAVE_SECCOMP_FILTER
>> >> +       bool
>> >> +
>> >>  menu "Security options"
>> >>
>> >>  config KEYS
>> >> @@ -82,6 +86,19 @@ config SECURITY_DMESG_RESTRICT
>> >>
>> >>         If you are unsure how to answer this question, answer N.
>> >>
>> >> +config SECCOMP_FILTER
>> >> +     bool "Enable seccomp-based system call filtering"
>> >> +     select SECCOMP
>> >> +     depends on HAVE_SECCOMP_FILTER && EXPERIMENTAL
>> >> +     help
>> >> +       This kernel feature expands CONFIG_SECCOMP to allow computing
>> >> +       in environments with reduced kernel access dictated by the
>> >> +       application itself through prctl calls.  If
>> >> +       CONFIG_FTRACE_SYSCALLS is available, then system call
>> >> +       argument-based filtering predicates may be used.
>> >> +
>> >> +       See Documentation/prctl/seccomp_filter.txt for more detail.
>> >> +
>> >>  config SECURITY
>> >>       bool "Enable different security models"
>> >>       depends on SYSFS
>> >> --
>> >> 1.7.0.4
>> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> Please read the FAQ at  http://www.tux.org/lkml/
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 03/13] seccomp_filters: new mode with configurable syscall filters
  2011-06-02 20:28                                                               ` Will Drewry
@ 2011-06-02 20:46                                                                 ` Steven Rostedt
  2011-06-02 21:12                                                                   ` Paul E. McKenney
  0 siblings, 1 reply; 91+ messages in thread
From: Steven Rostedt @ 2011-06-02 20:46 UTC (permalink / raw)
  To: Will Drewry
  Cc: paulmck, linux-kernel, kees.cook, torvalds, tglx, mingo, jmorris,
	Peter Zijlstra, Frederic Weisbecker, linux-security-module

On Thu, 2011-06-02 at 15:28 -0500, Will Drewry wrote:

[ Snipped 860 lines of non relevant text ]

Seriously guys, Please trim your replies. These last few messages were
ridicules. I spent more than 30 seconds searching for what the email was
about. That's too much wasted time.

-- Steve


> >> Ah, I spaced on rcu_dereference().  The goal was to make the
> >> assignment and replacement of the seccomp_filters pointer
> >> RCU-protected (in seccomp_state) so there's no concern over it being
> >> replaced partial on platforms where pointer assignments are non-atomic
> >> - such as via /proc/<pid>/seccomp_filters access or a call via the
> >> exported symbols.  Object lifetime is managed by reference counting so
> >> that I don't have to worry about extending the RCU read-side critical
> >> section by much or deal with pre-allocations.
> >>
> >> I'll add rcu_dereference() to all the get_seccomp_filters() uses where
> >> it makes sense, so that it is called safely.  Just to make sure, does
> >> it make sense to continue to rcu protect the specific pointer?
> >
> > It might.  The usual other options is to use a lock outside of the element
> > containing the reference count to protect reference-count manipulation.
> > If there is some convenient lock, especially if it is already held where
> > needed, then locking is more straightforward.  Otherwise, RCU is usually
> > a reasonable option.
> 
> I was concerned about the overhead a lock would have at each system
> call entry, but I didn't benchmark it to see.  I'll add the
> rcu_dereference right away, then look into seeing whether there's a
> cleaner approach.  I was trying to be overly protective of mutating
> any data internal to the filters through complete replacement on any
> change.  I'll take a step back and see if
> 



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 03/13] seccomp_filters: new mode with configurable syscall filters
  2011-06-02 20:46                                                                 ` Steven Rostedt
@ 2011-06-02 21:12                                                                   ` Paul E. McKenney
  0 siblings, 0 replies; 91+ messages in thread
From: Paul E. McKenney @ 2011-06-02 21:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Will Drewry, linux-kernel, kees.cook, torvalds, tglx, mingo,
	jmorris, Peter Zijlstra, Frederic Weisbecker,
	linux-security-module

On Thu, Jun 02, 2011 at 04:46:07PM -0400, Steven Rostedt wrote:
> On Thu, 2011-06-02 at 15:28 -0500, Will Drewry wrote:
> 
> [ Snipped 860 lines of non relevant text ]
> 
> Seriously guys, Please trim your replies. These last few messages were
> ridicules. I spent more than 30 seconds searching for what the email was
> about. That's too much wasted time.

Because every time I do trim the messages, I get a response from the
reviewee of the form "Oh, I take care of that in function foo()."
And of course function foo() will be in the part I trimmed.  So I then
have to find the earlier message, copy the function back in, and by
that time something else has distracted me.

							Thanx, Paul

> -- Steve
> 
> 
> > >> Ah, I spaced on rcu_dereference().  The goal was to make the
> > >> assignment and replacement of the seccomp_filters pointer
> > >> RCU-protected (in seccomp_state) so there's no concern over it being
> > >> replaced partial on platforms where pointer assignments are non-atomic
> > >> - such as via /proc/<pid>/seccomp_filters access or a call via the
> > >> exported symbols.  Object lifetime is managed by reference counting so
> > >> that I don't have to worry about extending the RCU read-side critical
> > >> section by much or deal with pre-allocations.
> > >>
> > >> I'll add rcu_dereference() to all the get_seccomp_filters() uses where
> > >> it makes sense, so that it is called safely.  Just to make sure, does
> > >> it make sense to continue to rcu protect the specific pointer?
> > >
> > > It might.  The usual other options is to use a lock outside of the element
> > > containing the reference count to protect reference-count manipulation.
> > > If there is some convenient lock, especially if it is already held where
> > > needed, then locking is more straightforward.  Otherwise, RCU is usually
> > > a reasonable option.
> > 
> > I was concerned about the overhead a lock would have at each system
> > call entry, but I didn't benchmark it to see.  I'll add the
> > rcu_dereference right away, then look into seeing whether there's a
> > cleaner approach.  I was trying to be overly protective of mutating
> > any data internal to the filters through complete replacement on any
> > change.  I'll take a step back and see if
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH v3 04/13] seccomp_filter: add process state reporting
  2011-05-26 18:49                                                     ` Will Drewry
                                                                         ` (2 preceding siblings ...)
  2011-06-01  3:10                                                       ` [PATCH v3 03/13] seccomp_filters: new mode with configurable syscall filters Will Drewry
@ 2011-06-01  3:10                                                       ` Will Drewry
  2011-06-01  3:10                                                       ` [PATCH v3 05/13] seccomp_filter: Document what seccomp_filter is and how it works Will Drewry
                                                                         ` (8 subsequent siblings)
  12 siblings, 0 replies; 91+ messages in thread
From: Will Drewry @ 2011-06-01  3:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: kees.cook, torvalds, tglx, mingo, rostedt, jmorris, Will Drewry

Adds seccomp and seccomp_filter status reporting to proc.
/proc/<pid>/seccomp_filter provides the current seccomp mode
and the list of allowed or dynamically filtered system calls.

v3: changed to using filters directly.
v2: removed status entry, added seccomp file.
    (requested by kosaki.motohiro@jp.fujitsu.com)
    allowed S_IRUGO reading of entries
    (requested by viro@zeniv.linux.org.uk)
    added flags
    got rid of the seccomp_t type
    dropped seccomp file

Signed-off-by: Will Drewry <wad@chromium.org>
---
 fs/proc/base.c |   25 +++++++++++++++++++++++++
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index dfa5327..01473fe 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -73,6 +73,7 @@
 #include <linux/security.h>
 #include <linux/ptrace.h>
 #include <linux/tracehook.h>
+#include <linux/seccomp.h>
 #include <linux/cgroup.h>
 #include <linux/cpuset.h>
 #include <linux/audit.h>
@@ -579,6 +580,24 @@ static int proc_pid_syscall(struct task_struct *task, char *buffer)
 }
 #endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
 
+/*
+ * Print out the current seccomp filter set for the task.
+ */
+#ifdef CONFIG_SECCOMP_FILTER
+int proc_pid_seccomp_filter_show(struct seq_file *m, struct pid_namespace *ns,
+				 struct pid *pid, struct task_struct *task)
+{
+	struct seccomp_filters *filters;
+
+	rcu_read_lock();
+	filters = get_seccomp_filters(task->seccomp.filters);
+	rcu_read_unlock();
+	seccomp_show_filters(filters, m);
+	put_seccomp_filters(filters);
+	return 0;
+}
+#endif /* CONFIG_SECCOMP_FILTER */
+
 /************************************************************************/
 /*                       Here the fs part begins                        */
 /************************************************************************/
@@ -2838,6 +2857,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
 	INF("syscall",    S_IRUGO, proc_pid_syscall),
 #endif
+#ifdef CONFIG_SECCOMP_FILTER
+	ONE("seccomp_filter",     S_IRUGO, proc_pid_seccomp_filter_show),
+#endif
 	INF("cmdline",    S_IRUGO, proc_pid_cmdline),
 	ONE("stat",       S_IRUGO, proc_tgid_stat),
 	ONE("statm",      S_IRUGO, proc_pid_statm),
@@ -3180,6 +3202,9 @@ static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
 	INF("syscall",   S_IRUGO, proc_pid_syscall),
 #endif
+#ifdef CONFIG_SECCOMP_FILTER
+	ONE("seccomp_filter",     S_IRUGO, proc_pid_seccomp_filter_show),
+#endif
 	INF("cmdline",   S_IRUGO, proc_pid_cmdline),
 	ONE("stat",      S_IRUGO, proc_tid_stat),
 	ONE("statm",     S_IRUGO, proc_pid_statm),
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 05/13] seccomp_filter: Document what seccomp_filter is and how it works.
  2011-05-26 18:49                                                     ` Will Drewry
                                                                         ` (3 preceding siblings ...)
  2011-06-01  3:10                                                       ` [PATCH v3 04/13] seccomp_filter: add process state reporting Will Drewry
@ 2011-06-01  3:10                                                       ` Will Drewry
  2011-06-01 21:23                                                         ` Kees Cook
  2011-06-01  3:10                                                       ` [PATCH v3 06/13] x86: add HAVE_SECCOMP_FILTER and seccomp_execve Will Drewry
                                                                         ` (7 subsequent siblings)
  12 siblings, 1 reply; 91+ messages in thread
From: Will Drewry @ 2011-06-01  3:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: kees.cook, torvalds, tglx, mingo, rostedt, jmorris, Will Drewry,
	Randy Dunlap, linux-doc

 Adds a text file covering what CONFIG_SECCOMP_FILTER is, how it is
 implemented presently, and what it may be used for.  In addition,
 the limitations and caveats of the proposed implementation are
 included.

 v3: a little more cleanup
 v2: moved to prctl/
     updated for the v2 syntax.
     adds a note about compat behavior

 Signed-off-by: Will Drewry <wad@chromium.org>
---
 Documentation/prctl/seccomp_filter.txt |  145 ++++++++++++++++++++++++++++++++
 1 files changed, 145 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/prctl/seccomp_filter.txt

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..27ac5af
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,145 @@
+		Seccomp filtering
+		=================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated.  A
+certain subset of userland applications benefit by having a reduce set
+of available system calls.  The reduced set reduces the total kernel
+surface exposed to the application.  System call filtering is meant for
+use with those applications.
+
+The implementation currently leverages both the existing seccomp
+infrastructure and the kernel tracing infrastructure.  By centralizing
+hooks for attack surface reduction in seccomp, it is possible to assure
+attention to security that is less relevant in normal ftrace scenarios,
+such as time-of-check, time-of-use attacks.  However, ftrace provides a
+rich, human-friendly environment for interfacing with system call
+specific arguments.  (As such, this requires FTRACE_SYSCALLS for any
+introspective filtering support.)
+
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox.  It provides a clearly defined
+mechanism for minimizing the exposed kernel surface.  Beyond that,
+policy for logical behavior and information flow should be managed with
+a combinations of other system hardening techniques and, potentially, a
+LSM of your choosing.  Expressive, dynamic filters based on the ftrace
+filter engine provide further options down this path (avoiding
+pathological sizes or selecting which of the multiplexed system calls in
+socketcall() is allowed, for instance) which could be construed,
+incorrectly, as a more complete sandboxing solution.
+
+
+Usage
+-----
+
+An additional seccomp mode is exposed through mode '2'.
+This mode depends on CONFIG_SECCOMP_FILTER.  By default, it provides
+only the most trivial of filter support "1" or cleared.  However, if
+CONFIG_FTRACE_SYSCALLS is enabled, the ftrace filter engine may be used
+for more expressive filters.
+
+A collection of filters may be supplied via prctl, and the current set
+of filters is exposed in /proc/<pid>/seccomp_filter.
+
+Interacting with seccomp filters can be done through three new prctl calls
+and one existing one.
+
+PR_SET_SECCOMP:
+	A pre-existing option for enabling strict seccomp mode (1) or
+	filtering seccomp (2).
+
+	Usage:
+		prctl(PR_SET_SECCOMP, 1);  /* strict */
+		prctl(PR_SET_SECCOMP, 2);  /* filters */
+
+PR_SET_SECCOMP_FILTER:
+	Allows the specification of a new filter for a given system
+	call, by number, and filter string. If CONFIG_FTRACE_SYSCALLS is
+	supported, the filter string may be any valid value for the
+	given system call.  If it is not supported, the filter string
+	may only be "1".
+
+	All calls to PR_SET_SECCOMP_FILTER for a given system
+	call will append the supplied string to any existing filters.
+	Filter construction looks as follows:
+		(Nothing) + "fd == 1 || fd == 2" => fd == 1 || fd == 2
+		... + "fd != 2" => (fd == 1 || fd == 2) && fd != 2
+		... + "size < 100" =>
+			((fd == 1 || fd == 2) && fd != 2) && size < 100
+	If there is no filter and the seccomp mode has already
+	transitioned to filtering, additions cannot be made.  Filters
+	may only be added that reduce the available kernel surface.
+
+	Usage (per the construction example above):
+		prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd == 1 || fd == 2");
+		prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2");
+		prctl(PR_SET_SECCOMP_FILTER, __NR_write, "size < 100");
+
+PR_CLEAR_SECCOMP_FILTER:
+	Removes all filter entries for a given system call number.  When
+	called prior to entering seccomp filtering mode, it allows for
+	new filters to be applied to the same system call.  After
+	transition, however, it completely drops access to the call.
+
+	Usage:
+		prctl(PR_CLEAR_SECCOMP_FILTER, __NR_open);
+
+PR_GET_SECCOMP_FILTER: Returns the aggregated filter string for a system
+	call into a user-supplied buffer of a given length.
+
+	Usage:
+		prctl(PR_GET_SECCOMP_FILTER, __NR_write, buf,
+		      sizeof(buf));
+
+All of the above calls return 0 on success and non-zero on error.
+
+
+Example
+-------
+
+Assume a process would like to cleanly read and write to stdin/out/err
+as well as access its filters after seccomp enforcement begins.  This
+may be done as follows:
+
+  prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 0");
+  prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd == 1 || fd == 2");
+  prctl(PR_SET_SECCOMP_FILTER, __NR_exit, "1");
+  prctl(PR_SET_SECCOMP_FILTER, __NR_prctl, "1");
+
+  prctl(PR_SET_SECCOMP, PR_SECCOMP_MODE_FILTER, 0);
+
+  /* Do stuff with fdset . . .*/
+
+  /* Drop read access and keep only write access to fd 1. */
+  prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read);
+  prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2");
+
+  /* Perform any final processing . . . */
+  syscall(__NR_exit, 0);
+
+
+Caveats
+-------
+
+- The filter event subsystem comes from CONFIG_TRACE_EVENTS, and the
+system call events come from CONFIG_FTRACE_SYSCALLS.  However, if
+neither are available, a filter string of "1" will be honored, and it may
+be removed using PR_CLEAR_SECCOMP_FILTER.  With ftrace filtering,
+calling PR_SET_SECCOMP_FILTER with a filter of "0" would have similar
+affect but would not be consistent on a kernel without the support.
+
+- Some platforms support a 32-bit userspace with 64-bit kernels.  In
+these cases (CONFIG_COMPAT), system call numbers may not match across
+64-bit and 32-bit system calls. When the first PRCTL_SET_SECCOMP_FILTER
+is called, the in-memory filters state is annotated with whether the
+call has been made via the compat interface.  All subsequent calls will
+be checked for compat call mismatch.  In the long run, it may make sense
+to store compat and non-compat filters separately, but that is not
+supported at present.
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 05/13] seccomp_filter: Document what seccomp_filter is and how it works.
  2011-06-01  3:10                                                       ` [PATCH v3 05/13] seccomp_filter: Document what seccomp_filter is and how it works Will Drewry
@ 2011-06-01 21:23                                                         ` Kees Cook
  2011-06-01 23:03                                                           ` Will Drewry
  0 siblings, 1 reply; 91+ messages in thread
From: Kees Cook @ 2011-06-01 21:23 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, torvalds, tglx, mingo, rostedt, jmorris,
	Randy Dunlap, linux-doc

Hi Will,

Minor typo corrections below...

On Tue, May 31, 2011 at 10:10:37PM -0500, Will Drewry wrote:
>  Adds a text file covering what CONFIG_SECCOMP_FILTER is, how it is
>  implemented presently, and what it may be used for.  In addition,
>  the limitations and caveats of the proposed implementation are
>  included.
> 
> --- /dev/null
> +++ b/Documentation/prctl/seccomp_filter.txt
> @@ -0,0 +1,145 @@
> ...
> +certain subset of userland applications benefit by having a reduce set
                                                               reduced

> +of available system calls.  The reduced set reduces the total kernel

Maybe "The resulting set reduces ... " ?


-Kees

-- 
Kees Cook
Ubuntu Security Team

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 05/13] seccomp_filter: Document what seccomp_filter is and how it works.
  2011-06-01 21:23                                                         ` Kees Cook
@ 2011-06-01 23:03                                                           ` Will Drewry
  0 siblings, 0 replies; 91+ messages in thread
From: Will Drewry @ 2011-06-01 23:03 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, torvalds, tglx, mingo, rostedt, jmorris,
	Randy Dunlap, linux-doc

On Wed, Jun 1, 2011 at 4:23 PM, Kees Cook <kees.cook@canonical.com> wrote:
> Hi Will,
>
> Minor typo corrections below...
>
> On Tue, May 31, 2011 at 10:10:37PM -0500, Will Drewry wrote:
>>  Adds a text file covering what CONFIG_SECCOMP_FILTER is, how it is
>>  implemented presently, and what it may be used for.  In addition,
>>  the limitations and caveats of the proposed implementation are
>>  included.
>>
>> --- /dev/null
>> +++ b/Documentation/prctl/seccomp_filter.txt
>> @@ -0,0 +1,145 @@
>> ...
>> +certain subset of userland applications benefit by having a reduce set
>                                                               reduced
>
>> +of available system calls.  The reduced set reduces the total kernel
>
> Maybe "The resulting set reduces ... " ?

Cool - I'll clean it up in the next cut.

Thanks!
will

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH v3 06/13] x86: add HAVE_SECCOMP_FILTER and seccomp_execve
  2011-05-26 18:49                                                     ` Will Drewry
                                                                         ` (4 preceding siblings ...)
  2011-06-01  3:10                                                       ` [PATCH v3 05/13] seccomp_filter: Document what seccomp_filter is and how it works Will Drewry
@ 2011-06-01  3:10                                                       ` Will Drewry
  2011-06-01  3:10                                                       ` [PATCH v3 07/13] arm: select HAVE_SECCOMP_FILTER Will Drewry
                                                                         ` (6 subsequent siblings)
  12 siblings, 0 replies; 91+ messages in thread
From: Will Drewry @ 2011-06-01  3:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: kees.cook, torvalds, tglx, mingo, rostedt, jmorris, Will Drewry,
	Ingo Molnar, H. Peter Anvin, x86

Adds support to the x86 architecture by providing a compatibility
mode wrapper for sys_execve's number and selecting HAVE_SECCOMP_FILTER

Signed-off-by: Will Drewry <wad@chromium.org>
---
 arch/x86/Kconfig                   |    1 +
 arch/x86/include/asm/ia32_unistd.h |    1 +
 arch/x86/include/asm/seccomp_64.h  |    2 ++
 3 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc6c53a..1843d17 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -64,6 +64,7 @@ config X86
 	select HAVE_TEXT_POKE_SMP
 	select HAVE_GENERIC_HARDIRQS
 	select HAVE_SPARSE_IRQ
+	select HAVE_SECCOMP_FILTER
 	select GENERIC_FIND_FIRST_BIT
 	select GENERIC_FIND_NEXT_BIT
 	select GENERIC_IRQ_PROBE
diff --git a/arch/x86/include/asm/ia32_unistd.h b/arch/x86/include/asm/ia32_unistd.h
index 976f6ec..8ed2922 100644
--- a/arch/x86/include/asm/ia32_unistd.h
+++ b/arch/x86/include/asm/ia32_unistd.h
@@ -12,6 +12,7 @@
 #define __NR_ia32_exit		  1
 #define __NR_ia32_read		  3
 #define __NR_ia32_write		  4
+#define __NR_ia32_execve	 11
 #define __NR_ia32_sigreturn	119
 #define __NR_ia32_rt_sigreturn	173
 
diff --git a/arch/x86/include/asm/seccomp_64.h b/arch/x86/include/asm/seccomp_64.h
index 84ec1bd..85c4219 100644
--- a/arch/x86/include/asm/seccomp_64.h
+++ b/arch/x86/include/asm/seccomp_64.h
@@ -8,10 +8,12 @@
 #define __NR_seccomp_write __NR_write
 #define __NR_seccomp_exit __NR_exit
 #define __NR_seccomp_sigreturn __NR_rt_sigreturn
+#define __NR_seccomp_execve __NR_execve
 
 #define __NR_seccomp_read_32 __NR_ia32_read
 #define __NR_seccomp_write_32 __NR_ia32_write
 #define __NR_seccomp_exit_32 __NR_ia32_exit
 #define __NR_seccomp_sigreturn_32 __NR_ia32_sigreturn
+#define __NR_seccomp_execve_32 __NR_ia32_execve
 
 #endif /* _ASM_X86_SECCOMP_64_H */
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 07/13] arm: select HAVE_SECCOMP_FILTER
  2011-05-26 18:49                                                     ` Will Drewry
                                                                         ` (5 preceding siblings ...)
  2011-06-01  3:10                                                       ` [PATCH v3 06/13] x86: add HAVE_SECCOMP_FILTER and seccomp_execve Will Drewry
@ 2011-06-01  3:10                                                       ` Will Drewry
  2011-06-01  3:10                                                       ` [PATCH v3 08/13] microblaze: select HAVE_SECCOMP_FILTER and provide seccomp_execve Will Drewry
                                                                         ` (5 subsequent siblings)
  12 siblings, 0 replies; 91+ messages in thread
From: Will Drewry @ 2011-06-01  3:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: kees.cook, torvalds, tglx, mingo, rostedt, jmorris, Will Drewry,
	Russell King, linux-arm-kernel

Enable support for CONFIG_SECCOMP_FILTER by selecting HAVE_SECCOMP_FILTER by
default.

Signed-off-by: Will Drewry <wad@chromium.org>
---
 arch/arm/Kconfig |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 377a7a5..4725fbc 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -16,6 +16,7 @@ config ARM
 	select HAVE_FTRACE_MCOUNT_RECORD if (!XIP_KERNEL)
 	select HAVE_DYNAMIC_FTRACE if (!XIP_KERNEL)
 	select HAVE_FUNCTION_GRAPH_TRACER if (!THUMB2_KERNEL)
+	select HAVE_SECCOMP_FILTER
 	select HAVE_GENERIC_DMA_COHERENT
 	select HAVE_KERNEL_GZIP
 	select HAVE_KERNEL_LZO
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 08/13] microblaze: select HAVE_SECCOMP_FILTER and provide seccomp_execve
  2011-05-26 18:49                                                     ` Will Drewry
                                                                         ` (6 preceding siblings ...)
  2011-06-01  3:10                                                       ` [PATCH v3 07/13] arm: select HAVE_SECCOMP_FILTER Will Drewry
@ 2011-06-01  3:10                                                       ` Will Drewry
  2011-06-01  5:37                                                         ` Michal Simek
  2011-06-01  3:10                                                       ` [PATCH v3 09/13] mips: " Will Drewry
                                                                         ` (4 subsequent siblings)
  12 siblings, 1 reply; 91+ messages in thread
From: Will Drewry @ 2011-06-01  3:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: kees.cook, torvalds, tglx, mingo, rostedt, jmorris, Will Drewry,
	Michal Simek, microblaze-uclinux

Facilitate the use of CONFIG_SECCOMP_FILTER by wrapping compatibility
system call numbering for execve and selecting HAVE_SECCOMP_FILTER.

Signed-off-by: Will Drewry <wad@chromium.org>
---
 arch/microblaze/Kconfig               |    1 +
 arch/microblaze/include/asm/seccomp.h |    2 ++
 2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index eccdefe..30ef677 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -1,6 +1,7 @@
 config MICROBLAZE
 	def_bool y
 	select HAVE_MEMBLOCK
+	select HAVE_SECCOMP_FILTER
 	select HAVE_FUNCTION_TRACER
 	select HAVE_FUNCTION_TRACE_MCOUNT_TEST
 	select HAVE_FUNCTION_GRAPH_TRACER
diff --git a/arch/microblaze/include/asm/seccomp.h b/arch/microblaze/include/asm/seccomp.h
index 0d91275..0e38eed 100644
--- a/arch/microblaze/include/asm/seccomp.h
+++ b/arch/microblaze/include/asm/seccomp.h
@@ -7,10 +7,12 @@
 #define __NR_seccomp_write		__NR_write
 #define __NR_seccomp_exit		__NR_exit
 #define __NR_seccomp_sigreturn		__NR_sigreturn
+#define __NR_seccomp_execve		__NR_execve
 
 #define __NR_seccomp_read_32		__NR_read
 #define __NR_seccomp_write_32		__NR_write
 #define __NR_seccomp_exit_32		__NR_exit
 #define __NR_seccomp_sigreturn_32	__NR_sigreturn
+#define __NR_seccomp_execve_32		__NR_execve
 
 #endif	/* _ASM_MICROBLAZE_SECCOMP_H */
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 08/13] microblaze: select HAVE_SECCOMP_FILTER and provide seccomp_execve
  2011-06-01  3:10                                                       ` [PATCH v3 08/13] microblaze: select HAVE_SECCOMP_FILTER and provide seccomp_execve Will Drewry
@ 2011-06-01  5:37                                                         ` Michal Simek
  0 siblings, 0 replies; 91+ messages in thread
From: Michal Simek @ 2011-06-01  5:37 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, kees.cook, torvalds, tglx, mingo, rostedt, jmorris,
	microblaze-uclinux

Will Drewry wrote:
> Facilitate the use of CONFIG_SECCOMP_FILTER by wrapping compatibility
> system call numbering for execve and selecting HAVE_SECCOMP_FILTER.
> 
> Signed-off-by: Will Drewry <wad@chromium.org>
> ---
>  arch/microblaze/Kconfig               |    1 +
>  arch/microblaze/include/asm/seccomp.h |    2 ++
>  2 files changed, 3 insertions(+), 0 deletions(-)

Acked-by: Michal Simek <monstr@monstr.eu>

-- 
Michal Simek, Ing. (M.Eng)
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel 2.6 Microblaze Linux - http://www.monstr.eu/fdt/
Microblaze U-BOOT custodian

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH v3 09/13] mips: select HAVE_SECCOMP_FILTER and provide seccomp_execve
  2011-05-26 18:49                                                     ` Will Drewry
                                                                         ` (7 preceding siblings ...)
  2011-06-01  3:10                                                       ` [PATCH v3 08/13] microblaze: select HAVE_SECCOMP_FILTER and provide seccomp_execve Will Drewry
@ 2011-06-01  3:10                                                       ` Will Drewry
  2011-06-01  3:10                                                       ` [PATCH v3 10/13] s390: " Will Drewry
                                                                         ` (3 subsequent siblings)
  12 siblings, 0 replies; 91+ messages in thread
From: Will Drewry @ 2011-06-01  3:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: kees.cook, torvalds, tglx, mingo, rostedt, jmorris, Will Drewry,
	Ralf Baechle, linux-mips

Facilitate the use of CONFIG_SECCOMP_FILTER by wrapping compatibility
system call numbering for execve and selecting HAVE_SECCOMP_FILTER.

Signed-off-by: Will Drewry <wad@chromium.org>
---
 arch/mips/Kconfig               |    1 +
 arch/mips/include/asm/seccomp.h |    3 +++
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 8e256cc..d376f68 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -10,6 +10,7 @@ config MIPS
 	select HAVE_ARCH_KGDB
 	select HAVE_FUNCTION_TRACER
 	select HAVE_FUNCTION_TRACE_MCOUNT_TEST
+	select HAVE_SECCOMP_FILTER
 	select HAVE_DYNAMIC_FTRACE
 	select HAVE_FTRACE_MCOUNT_RECORD
 	select HAVE_C_RECORDMCOUNT
diff --git a/arch/mips/include/asm/seccomp.h b/arch/mips/include/asm/seccomp.h
index ae6306e..4014a3a 100644
--- a/arch/mips/include/asm/seccomp.h
+++ b/arch/mips/include/asm/seccomp.h
@@ -6,6 +6,7 @@
 #define __NR_seccomp_write __NR_write
 #define __NR_seccomp_exit __NR_exit
 #define __NR_seccomp_sigreturn __NR_rt_sigreturn
+#define __NR_seccomp_execve __NR_execve
 
 /*
  * Kludge alert:
@@ -19,6 +20,7 @@
 #define __NR_seccomp_write_32		4004
 #define __NR_seccomp_exit_32		4001
 #define __NR_seccomp_sigreturn_32	4193	/* rt_sigreturn */
+#define __NR_seccomp_execve_32		4011
 
 #elif defined(CONFIG_MIPS32_N32)
 
@@ -26,6 +28,7 @@
 #define __NR_seccomp_write_32		6001
 #define __NR_seccomp_exit_32		6058
 #define __NR_seccomp_sigreturn_32	6211	/* rt_sigreturn */
+#define __NR_seccomp_execve_32		6057
 
 #endif /* CONFIG_MIPS32_O32 */
 
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 10/13] s390: select HAVE_SECCOMP_FILTER and provide seccomp_execve
  2011-05-26 18:49                                                     ` Will Drewry
                                                                         ` (8 preceding siblings ...)
  2011-06-01  3:10                                                       ` [PATCH v3 09/13] mips: " Will Drewry
@ 2011-06-01  3:10                                                       ` Will Drewry
  2011-06-01  3:10                                                       ` [PATCH v3 11/13] powerpc: " Will Drewry
                                                                         ` (2 subsequent siblings)
  12 siblings, 0 replies; 91+ messages in thread
From: Will Drewry @ 2011-06-01  3:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: kees.cook, torvalds, tglx, mingo, rostedt, jmorris, Will Drewry,
	Martin Schwidefsky, Heiko Carstens, linux390, linux-s390

Facilitate the use of CONFIG_SECCOMP_FILTER by wrapping compatibility
system call numbering for execve and selecting HAVE_SECCOMP_FILTER.

Signed-off-by: Will Drewry <wad@chromium.org>
---
 arch/s390/Kconfig               |    1 +
 arch/s390/include/asm/seccomp.h |    3 ++-
 2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 2508a6f..9382198 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -64,6 +64,7 @@ config ARCH_SUPPORTS_DEBUG_PAGEALLOC
 config S390
 	def_bool y
 	select USE_GENERIC_SMP_HELPERS if SMP
+	select HAVE_SECCOMP_FILTER
 	select HAVE_SYSCALL_WRAPPERS
 	select HAVE_FUNCTION_TRACER
 	select HAVE_FUNCTION_TRACE_MCOUNT_TEST
diff --git a/arch/s390/include/asm/seccomp.h b/arch/s390/include/asm/seccomp.h
index 781a9cf..e5792f5 100644
--- a/arch/s390/include/asm/seccomp.h
+++ b/arch/s390/include/asm/seccomp.h
@@ -7,10 +7,11 @@
 #define __NR_seccomp_write	__NR_write
 #define __NR_seccomp_exit	__NR_exit
 #define __NR_seccomp_sigreturn	__NR_sigreturn
+#define __NR_seccomp_execve	__NR_execve
 
 #define __NR_seccomp_read_32	__NR_read
 #define __NR_seccomp_write_32	__NR_write
 #define __NR_seccomp_exit_32	__NR_exit
-#define __NR_seccomp_sigreturn_32 __NR_sigreturn
+#define __NR_seccomp_execve_32	__NR_execve
 
 #endif	/* _ASM_S390_SECCOMP_H */
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 11/13] powerpc: select HAVE_SECCOMP_FILTER and provide seccomp_execve
  2011-05-26 18:49                                                     ` Will Drewry
                                                                         ` (9 preceding siblings ...)
  2011-06-01  3:10                                                       ` [PATCH v3 10/13] s390: " Will Drewry
@ 2011-06-01  3:10                                                       ` Will Drewry
  2011-06-01  3:10                                                       ` [PATCH v3 12/13] sparc: " Will Drewry
  2011-06-01  3:10                                                       ` [PATCH v3 13/13] sh: select HAVE_SECCOMP_FILTER Will Drewry
  12 siblings, 0 replies; 91+ messages in thread
From: Will Drewry @ 2011-06-01  3:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: kees.cook, torvalds, tglx, mingo, rostedt, jmorris, Will Drewry,
	Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev

Facilitate the use of CONFIG_SECCOMP_FILTER by wrapping compatibility
system call numbering for execve and selecting HAVE_SECCOMP_FILTER.

Signed-off-by: Will Drewry <wad@chromium.org>
---
 arch/powerpc/Kconfig               |    1 +
 arch/powerpc/include/asm/seccomp.h |    2 ++
 2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8f4d50b..0bd4574 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -137,6 +137,7 @@ config PPC
 	select HAVE_HW_BREAKPOINT if PERF_EVENTS && PPC_BOOK3S_64
 	select HAVE_GENERIC_HARDIRQS
 	select HAVE_SPARSE_IRQ
+	select HAVE_SECCOMP_FILTER
 	select IRQ_PER_CPU
 	select GENERIC_IRQ_SHOW
 	select GENERIC_IRQ_SHOW_LEVEL
diff --git a/arch/powerpc/include/asm/seccomp.h b/arch/powerpc/include/asm/seccomp.h
index 00c1d91..3cb9cc1 100644
--- a/arch/powerpc/include/asm/seccomp.h
+++ b/arch/powerpc/include/asm/seccomp.h
@@ -7,10 +7,12 @@
 #define __NR_seccomp_write __NR_write
 #define __NR_seccomp_exit __NR_exit
 #define __NR_seccomp_sigreturn __NR_rt_sigreturn
+#define __NR_seccomp_execve __NR_execve
 
 #define __NR_seccomp_read_32 __NR_read
 #define __NR_seccomp_write_32 __NR_write
 #define __NR_seccomp_exit_32 __NR_exit
 #define __NR_seccomp_sigreturn_32 __NR_sigreturn
+#define __NR_seccomp_execve_32 __NR_execve
 
 #endif	/* _ASM_POWERPC_SECCOMP_H */
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 12/13] sparc: select HAVE_SECCOMP_FILTER and provide seccomp_execve
  2011-05-26 18:49                                                     ` Will Drewry
                                                                         ` (10 preceding siblings ...)
  2011-06-01  3:10                                                       ` [PATCH v3 11/13] powerpc: " Will Drewry
@ 2011-06-01  3:10                                                       ` Will Drewry
  2011-06-01  3:35                                                         ` David Miller
  2011-06-01  3:10                                                       ` [PATCH v3 13/13] sh: select HAVE_SECCOMP_FILTER Will Drewry
  12 siblings, 1 reply; 91+ messages in thread
From: Will Drewry @ 2011-06-01  3:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: kees.cook, torvalds, tglx, mingo, rostedt, jmorris, Will Drewry,
	David S. Miller, sparclinux

Facilitate the use of CONFIG_SECCOMP_FILTER by wrapping compatibility
system call numbering for execve and selecting HAVE_SECCOMP_FILTER.

Signed-off-by: Will Drewry <wad@chromium.org>
---
 arch/sparc/Kconfig               |    2 ++
 arch/sparc/include/asm/seccomp.h |    2 ++
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e560d10..5249760 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -25,6 +25,7 @@ config SPARC
 	select HAVE_DMA_ATTRS
 	select HAVE_DMA_API_DEBUG
 	select HAVE_ARCH_JUMP_LABEL
+	select HAVE_SECCOMP_FILTER
 
 config SPARC32
 	def_bool !64BIT
@@ -39,6 +40,7 @@ config SPARC64
 	select HAVE_KRETPROBES
 	select HAVE_KPROBES
 	select HAVE_MEMBLOCK
+	select HAVE_SECCOMP_FILTER
 	select HAVE_SYSCALL_WRAPPERS
 	select HAVE_DYNAMIC_FTRACE
 	select HAVE_FTRACE_MCOUNT_RECORD
diff --git a/arch/sparc/include/asm/seccomp.h b/arch/sparc/include/asm/seccomp.h
index adca1bc..a1dac08 100644
--- a/arch/sparc/include/asm/seccomp.h
+++ b/arch/sparc/include/asm/seccomp.h
@@ -6,10 +6,12 @@
 #define __NR_seccomp_write __NR_write
 #define __NR_seccomp_exit __NR_exit
 #define __NR_seccomp_sigreturn __NR_rt_sigreturn
+#define __NR_seccomp_execve __NR_execve
 
 #define __NR_seccomp_read_32 __NR_read
 #define __NR_seccomp_write_32 __NR_write
 #define __NR_seccomp_exit_32 __NR_exit
 #define __NR_seccomp_sigreturn_32 __NR_sigreturn
+#define __NR_seccomp_execve_32 __NR_execve
 
 #endif /* _ASM_SECCOMP_H */
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 12/13] sparc: select HAVE_SECCOMP_FILTER and provide seccomp_execve
  2011-06-01  3:10                                                       ` [PATCH v3 12/13] sparc: " Will Drewry
@ 2011-06-01  3:35                                                         ` David Miller
  0 siblings, 0 replies; 91+ messages in thread
From: David Miller @ 2011-06-01  3:35 UTC (permalink / raw)
  To: wad
  Cc: linux-kernel, kees.cook, torvalds, tglx, mingo, rostedt, jmorris,
	sparclinux

From: Will Drewry <wad@chromium.org>
Date: Tue, 31 May 2011 22:10:44 -0500

> Facilitate the use of CONFIG_SECCOMP_FILTER by wrapping compatibility
> system call numbering for execve and selecting HAVE_SECCOMP_FILTER.
> 
> Signed-off-by: Will Drewry <wad@chromium.org>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH v3 13/13] sh: select HAVE_SECCOMP_FILTER
  2011-05-26 18:49                                                     ` Will Drewry
                                                                         ` (11 preceding siblings ...)
  2011-06-01  3:10                                                       ` [PATCH v3 12/13] sparc: " Will Drewry
@ 2011-06-01  3:10                                                       ` Will Drewry
  2011-06-02  5:27                                                         ` Paul Mundt
  12 siblings, 1 reply; 91+ messages in thread
From: Will Drewry @ 2011-06-01  3:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: kees.cook, torvalds, tglx, mingo, rostedt, jmorris, Will Drewry,
	Paul Mundt, linux-sh

Add support for CONFIG_SECCOMP_FILTER by selecting HAVE_SECCOMP_FILTER.
---
 arch/sh/Kconfig |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 4b89da2..41ea3a7 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -10,6 +10,7 @@ config SUPERH
 	select HAVE_DMA_API_DEBUG
 	select HAVE_DMA_ATTRS
 	select HAVE_IRQ_WORK
+	select HAVE_SECCOMP_FILTER
 	select HAVE_PERF_EVENTS
 	select PERF_USE_VMALLOC
 	select HAVE_KERNEL_GZIP
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 13/13] sh: select HAVE_SECCOMP_FILTER
  2011-06-01  3:10                                                       ` [PATCH v3 13/13] sh: select HAVE_SECCOMP_FILTER Will Drewry
@ 2011-06-02  5:27                                                         ` Paul Mundt
  0 siblings, 0 replies; 91+ messages in thread
From: Paul Mundt @ 2011-06-02  5:27 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, kees.cook, torvalds, tglx, mingo, rostedt, jmorris,
	linux-sh

On Tue, May 31, 2011 at 10:10:45PM -0500, Will Drewry wrote:
> Add support for CONFIG_SECCOMP_FILTER by selecting HAVE_SECCOMP_FILTER.
> ---
>  arch/sh/Kconfig |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
Acked-by: Paul Mundt <lethal@linux-sh.org>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 17:02                                             ` Will Drewry
  2011-05-26 17:04                                               ` Will Drewry
  2011-05-26 17:17                                               ` Linus Torvalds
@ 2011-05-26 17:38                                               ` Valdis.Kletnieks
  2011-05-26 18:08                                                 ` Will Drewry
  2 siblings, 1 reply; 91+ messages in thread
From: Valdis.Kletnieks @ 2011-05-26 17:38 UTC (permalink / raw)
  To: Will Drewry
  Cc: Linus Torvalds, Colin Walters, Kees Cook, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, linux-kernel,
	James Morris

[-- Attachment #1: Type: text/plain, Size: 1395 bytes --]

On Thu, 26 May 2011 12:02:45 CDT, Will Drewry said:

> Absolutely - that was what I meant :/  The patches do not currently
> check creds at creation or again at use, which would lead to
> unprivileged filters being used in a privileged context.  Right now,
> though, if setuid() is not allowed by the seccomp-filter, the process
> will be immediately killed with do_exit(SIGKILL) on call -- thus
> avoiding a silent failure.

How do you know you have the bounding set correct?

This has been a long-standing issue for SELinux policy writing - it's usually
easy to get 95% of the bounding box right (you need these rules for shared
libraries, you need these rules to access the user's home directory, you need
these other rules to talk TCP to the net, etc).  There's a nice tool that
converts any remaining rejection messages into rules you can add to the policy.

The problem is twofold: (a) that way you can never be sure you got *all* the
rules right and (b) the missing rules are almost always in squirrelly little
error-handling code that gets invoked once in a blue moon.  So in this case,
you end up with trying to debug the SIGKILL that happened when the process was
already in trouble for some other reason...

"Wow. Who would have guessed that program only called gettimeofday() in
the error handler for when it was formatting its crash message?"

Exactly.

[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 17:38                                               ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Valdis.Kletnieks
@ 2011-05-26 18:08                                                 ` Will Drewry
  2011-05-26 18:22                                                   ` Valdis.Kletnieks
  0 siblings, 1 reply; 91+ messages in thread
From: Will Drewry @ 2011-05-26 18:08 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Linus Torvalds, Colin Walters, Kees Cook, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, linux-kernel,
	James Morris

On Thu, May 26, 2011 at 12:38 PM,  <Valdis.Kletnieks@vt.edu> wrote:
> On Thu, 26 May 2011 12:02:45 CDT, Will Drewry said:
>
>> Absolutely - that was what I meant :/  The patches do not currently
>> check creds at creation or again at use, which would lead to
>> unprivileged filters being used in a privileged context.  Right now,
>> though, if setuid() is not allowed by the seccomp-filter, the process
>> will be immediately killed with do_exit(SIGKILL) on call -- thus
>> avoiding a silent failure.
>
> How do you know you have the bounding set correct?
>
> This has been a long-standing issue for SELinux policy writing - it's usually
> easy to get 95% of the bounding box right (you need these rules for shared
> libraries, you need these rules to access the user's home directory, you need
> these other rules to talk TCP to the net, etc).  There's a nice tool that
> converts any remaining rejection messages into rules you can add to the policy.
>
> The problem is twofold: (a) that way you can never be sure you got *all* the
> rules right and (b) the missing rules are almost always in squirrelly little
> error-handling code that gets invoked once in a blue moon.  So in this case,
> you end up with trying to debug the SIGKILL that happened when the process was
> already in trouble for some other reason...
>
> "Wow. Who would have guessed that program only called gettimeofday() in
> the error handler for when it was formatting its crash message?"
>
> Exactly.

Depending on the need, there is work involved, and there are many ways
to determine your bounding box.  It can be very tight -- where you
analyze normal workloads (perf,strace,objdump) and accept the fact
that pathological workloads may result in process death -- or it can
be quite loose and enable most system calls, just not newer ones,
let's say.  In practice, you might get bit a few times if you're
overly zealous (I know I have), but it's the difference between
failing open and failing closed.  There are some scenarios where you
never, ever want to fail-open even at the cost of process death and
lack of solid insight into a valid failure path.

Hope that makes sense and isn't too general,
will

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 18:08                                                 ` Will Drewry
@ 2011-05-26 18:22                                                   ` Valdis.Kletnieks
  0 siblings, 0 replies; 91+ messages in thread
From: Valdis.Kletnieks @ 2011-05-26 18:22 UTC (permalink / raw)
  To: Will Drewry
  Cc: Linus Torvalds, Colin Walters, Kees Cook, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, linux-kernel,
	James Morris

[-- Attachment #1: Type: text/plain, Size: 1212 bytes --]

On Thu, 26 May 2011 13:08:10 CDT, Will Drewry said:

> Depending on the need, there is work involved, and there are many ways
> to determine your bounding box.  It can be very tight -- where you
> analyze normal workloads (perf,strace,objdump) and accept the fact
> that pathological workloads may result in process death -- or it can
> be quite loose and enable most system calls, just not newer ones,
> let's say.  In practice, you might get bit a few times if you're
> overly zealous (I know I have), but it's the difference between
> failing open and failing closed.  There are some scenarios where you
> never, ever want to fail-open even at the cost of process death and
> lack of solid insight into a valid failure path.

> Hope that makes sense and isn't too general,

Oh, I already understood all that. :)   I'd have to double-check the actual
patch, does it give a (hopefully rate-limited) printk or other hint which
syscall caused the issue, to help in making up the list of needed syscalls?

And we probably want a cleaned-up copy of the quoted paragraph in the
documentation for this feature when it hits the streets.  People tuning in late
will need guidance on how to use this in their projects.


[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 16:46                                           ` Linus Torvalds
  2011-05-26 17:02                                             ` Will Drewry
@ 2011-05-26 17:07                                             ` Steven Rostedt
  2011-05-26 18:43                                               ` Casey Schaufler
  2011-05-26 18:34                                             ` david
  2011-05-26 18:54                                             ` Ingo Molnar
  3 siblings, 1 reply; 91+ messages in thread
From: Steven Rostedt @ 2011-05-26 17:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Drewry, Colin Walters, Kees Cook, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, linux-kernel, James Morris

On Thu, 2011-05-26 at 09:46 -0700, Linus Torvalds wrote:

> And if you filter system calls, it's entirely possible that you can
> attack suid executables through such a vector. Your "limit system
> calls for security" security suddenly turned into "avoid the system
> call that made things secure"!
> 
> See what I'm saying?

So you are not complaining about this implementation, but the use of
syscall filtering?

There may be some user that says, "oh I don't want my other apps to be
able to call setuid" thinking it will secure their application even
more. But because that application did the brain dead thing to not check
the return code of setuid, and it just happened to be running
privileged, it then execs off another application that can root the box.

Because, originally that setuid would have succeeded if the user did
nothing special, but now with this filtering, and the user thinking that
they could limit their app from doing harm, they just opened up a hole
that caused their app to do the exact opposite and give the exec'd app
full root privileges.

Did I get this right?

-- Steve

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 17:07                                             ` Steven Rostedt
@ 2011-05-26 18:43                                               ` Casey Schaufler
  2011-05-26 18:54                                                 ` Steven Rostedt
  0 siblings, 1 reply; 91+ messages in thread
From: Casey Schaufler @ 2011-05-26 18:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Will Drewry, Colin Walters, Kees Cook,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, linux-kernel,
	James Morris

On 5/26/2011 10:07 AM, Steven Rostedt wrote:
> On Thu, 2011-05-26 at 09:46 -0700, Linus Torvalds wrote:
>
>> And if you filter system calls, it's entirely possible that you can
>> attack suid executables through such a vector. Your "limit system
>> calls for security" security suddenly turned into "avoid the system
>> call that made things secure"!
>>
>> See what I'm saying?
> So you are not complaining about this implementation, but the use of
> syscall filtering?
>
> There may be some user that says, "oh I don't want my other apps to be
> able to call setuid" thinking it will secure their application even
> more. But because that application did the brain dead thing to not check
> the return code of setuid, and it just happened to be running
> privileged, it then execs off another application that can root the box.
>
> Because, originally that setuid would have succeeded if the user did
> nothing special, but now with this filtering, and the user thinking that
> they could limit their app from doing harm, they just opened up a hole
> that caused their app to do the exact opposite and give the exec'd app
> full root privileges.
>
> Did I get this right?

Yes. Some system calls are there so that you can turn off
privilege. There was a major exploit with sendmail when capabilities
were first introduced that brought the potential for this sort of
problem into the public eye. Kernel mechanisms intended to provide
additional security have to be massively careful about the impact
they may have on applications that are currently security aware and
that make use of the existing mechanisms. The ACL mechanism is much
more complicated than it probably ought to be to accommodate chmod()
and capabilities go way over the top to deal with traditional root
behavior.

> -- Steve
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
>


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 18:43                                               ` Casey Schaufler
@ 2011-05-26 18:54                                                 ` Steven Rostedt
  0 siblings, 0 replies; 91+ messages in thread
From: Steven Rostedt @ 2011-05-26 18:54 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Linus Torvalds, Will Drewry, Colin Walters, Kees Cook,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, linux-kernel,
	James Morris

On Thu, 2011-05-26 at 11:43 -0700, Casey Schaufler wrote:

> > Did I get this right?
> 
> Yes.

Thanks for the validation ;)

-- Steve



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 16:46                                           ` Linus Torvalds
  2011-05-26 17:02                                             ` Will Drewry
  2011-05-26 17:07                                             ` Steven Rostedt
@ 2011-05-26 18:34                                             ` david
  2011-05-26 18:54                                             ` Ingo Molnar
  3 siblings, 0 replies; 91+ messages in thread
From: david @ 2011-05-26 18:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Drewry, Colin Walters, Kees Cook, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, linux-kernel,
	James Morris

On Thu, 26 May 2011, Linus Torvalds wrote:
> 
> On Thu, May 26, 2011 at 9:33 AM, Will Drewry <wad@chromium.org> wrote:
>>
>> FWIW, none of the patches deal with privilege escalation via setuid
>> files or file capabilities.
>
> That is NOT AT ALL what I'm talking about.
>
> I'm talking about the "setuid()" system call (and all its cousins:
> setgit/setreuid etc). And the whole thread has been about filtering
> system calls, no?
>
> Do a google code search for setuid.
>
> In good code, it will look something like
>
>  uid = getuid();
>
>  if (setuid(uid)) {
>    fprintf(stderr, "Unable to drop provileges\n");
>    exit(1);
>  }
>
> but I guarantee you that there are cases where people just blindly
> drop privileges. google code search found me at least the "heirloom"
> source code doing exactly that.

I believe that sendmail had this exact vunerability when capibilities were 
used to control setuid a couple of years ago.

David Lang

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 16:46                                           ` Linus Torvalds
                                                               ` (2 preceding siblings ...)
  2011-05-26 18:34                                             ` david
@ 2011-05-26 18:54                                             ` Ingo Molnar
  3 siblings, 0 replies; 91+ messages in thread
From: Ingo Molnar @ 2011-05-26 18:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Drewry, Colin Walters, Kees Cook, Thomas Gleixner,
	Peter Zijlstra, Steven Rostedt, linux-kernel, James Morris


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> And if you filter system calls, it's entirely possible that you can 
> attack suid executables through such a vector. Your "limit system 
> calls for security" security suddenly turned into "avoid the system 
> call that made things secure"!

That should not be possible with Will's event filter based solution 
(his last submitted patch), due to this code in fs/exec.c (which is 
in your upstream tree as well):

        /*
         * Flush performance counters when crossing a
         * security domain:
         */
        if (!get_dumpable(current->mm))
                perf_event_exit_task(current);

This will drop all filters if a setuid-root (or whatever setuid) 
binary is executed from a filtered environment.

Does this cover the case you were thinking of?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-25 18:42                               ` Linus Torvalds
  2011-05-25 19:06                                 ` Ingo Molnar
  2011-05-25 19:11                                 ` Kees Cook
@ 2011-05-26  1:19                                 ` James Morris
  2011-05-26  6:08                                   ` Avi Kivity
  2011-05-26  8:24                                   ` Ingo Molnar
  2011-05-29 16:51                                 ` Aneesh Kumar K.V
  3 siblings, 2 replies; 91+ messages in thread
From: James Morris @ 2011-05-26  1:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Will Drewry, Steven Rostedt, linux-kernel, Avi Kivity, gnatapov,
	Chris Wright

On Wed, 25 May 2011, Linus Torvalds wrote:

> And per-system-call permissions are very dubious. What system calls
> don't you want to succeed? That ioctl? You just made it impossible to
> do a modern graphical application. Yet the kind of thing where we
> would _want_ to help users is in making it easier to sandbox something
> like the adobe flash player. But without accelerated direct rendering,
> that's not going to fly, is it?

Going back to the initial idea proposed by Will, where seccomp is simply 
extended to filter all syscalls, there is potential benefit in being able 
to limit the attack surface of the syscall API.

This is not security mediation in terms of interaction between things 
(e.g. "allow A to read B").  It's a _hardening_ feature which prevents a 
process from being able to invoke potentially hundreds of syscalls is has 
no need for.  It would allow us to usefully restrict some well-established 
attack modes, e.g. triggering bugs in kernel code via unneeded syscalls.

This is orthogonal to access control schemes (such as SELinux), which are 
about mediating security-relevant interactions between objects.

One area of possible use is KVM/Qemu, where processes now contain entire 
operating systems, and the attack surface between them is now much broader 
e.g. a local unprivileged vulnerability is now effectively a 'remote' full 
system compromise.

There has been some discussion of this within the KVM project.  Using the 
existing seccomp facility is problematic in that it requires significant 
reworking of Qemu to a privsep model, which would also then incur a likely 
unacceptable context switching overhead.  The generalized seccomp filter 
as proposed by Will would provide a significant reduction in exposed 
syscalls and thus guest->host attack surface.

I've cc'd some KVM folk for more input on how this may or may not meet 
their requirements -- Avi/Gleb, there's a background writeup here: 
http://lwn.net/Articles/442569/ .  We may need a proof of concept and/or 
commitment to use this feature for it to be accepted upstream.

- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26  1:19                                 ` James Morris
@ 2011-05-26  6:08                                   ` Avi Kivity
  2011-05-26  8:24                                   ` Ingo Molnar
  1 sibling, 0 replies; 91+ messages in thread
From: Avi Kivity @ 2011-05-26  6:08 UTC (permalink / raw)
  To: James Morris
  Cc: Linus Torvalds, Kees Cook, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Will Drewry, Steven Rostedt, linux-kernel,
	gnatapov, Chris Wright, Eric Paris

On 05/26/2011 04:19 AM, James Morris wrote:
> On Wed, 25 May 2011, Linus Torvalds wrote:
>
> >  And per-system-call permissions are very dubious. What system calls
> >  don't you want to succeed? That ioctl? You just made it impossible to
> >  do a modern graphical application. Yet the kind of thing where we
> >  would _want_ to help users is in making it easier to sandbox something
> >  like the adobe flash player. But without accelerated direct rendering,
> >  that's not going to fly, is it?
>
> Going back to the initial idea proposed by Will, where seccomp is simply
> extended to filter all syscalls, there is potential benefit in being able
> to limit the attack surface of the syscall API.
>
> This is not security mediation in terms of interaction between things
> (e.g. "allow A to read B").  It's a _hardening_ feature which prevents a
> process from being able to invoke potentially hundreds of syscalls is has
> no need for.  It would allow us to usefully restrict some well-established
> attack modes, e.g. triggering bugs in kernel code via unneeded syscalls.
>
> This is orthogonal to access control schemes (such as SELinux), which are
> about mediating security-relevant interactions between objects.
>
> One area of possible use is KVM/Qemu, where processes now contain entire
> operating systems, and the attack surface between them is now much broader
> e.g. a local unprivileged vulnerability is now effectively a 'remote' full
> system compromise.
>
> There has been some discussion of this within the KVM project.  Using the
> existing seccomp facility is problematic in that it requires significant
> reworking of Qemu to a privsep model, which would also then incur a likely
> unacceptable context switching overhead.  The generalized seccomp filter
> as proposed by Will would provide a significant reduction in exposed
> syscalls and thus guest->host attack surface.
>
> I've cc'd some KVM folk for more input on how this may or may not meet
> their requirements -- Avi/Gleb, there's a background writeup here:
> http://lwn.net/Articles/442569/ .  We may need a proof of concept and/or
> commitment to use this feature for it to be accepted upstream.

Indeed are were looking at sandboxing as a means to mitigate the "guest 
exploits qemu, proceeds to exploit host syscall interface" scenario, and 
evolved seccomp looks like the best tradeoff in terms of security gains 
vs effort needed.

Eric Paris (copied) prototyped this with his own version of enhanced 
seccomp and achieved pretty good results, so a proof of concept will be 
quite easy to provide.

Regarding dynamic filtering, the biggest question here is how this will 
interact with hotplug, which requires new files to be opened in the 
sandboxed process (or SCM_RIGHTed in).  Any fd-based filtering will 
defeat that, so we'll need some way for a privileged monitor to adjust 
filters.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26  1:19                                 ` James Morris
  2011-05-26  6:08                                   ` Avi Kivity
@ 2011-05-26  8:24                                   ` Ingo Molnar
  2011-05-26  8:35                                     ` Pekka Enberg
  2011-05-26  8:49                                     ` Avi Kivity
  1 sibling, 2 replies; 91+ messages in thread
From: Ingo Molnar @ 2011-05-26  8:24 UTC (permalink / raw)
  To: James Morris
  Cc: Linus Torvalds, Kees Cook, Thomas Gleixner, Peter Zijlstra,
	Will Drewry, Steven Rostedt, linux-kernel, Avi Kivity, gnatapov,
	Chris Wright, Pekka Enberg

* James Morris <jmorris@namei.org> wrote:

> On Wed, 25 May 2011, Linus Torvalds wrote:
> 
> > And per-system-call permissions are very dubious. What system 
> > calls don't you want to succeed? That ioctl? You just made it 
> > impossible to do a modern graphical application. Yet the kind of 
> > thing where we would _want_ to help users is in making it easier 
> > to sandbox something like the adobe flash player. But without 
> > accelerated direct rendering, that's not going to fly, is it?
> 
> Going back to the initial idea proposed by Will, where seccomp is 
> simply extended to filter all syscalls, there is potential benefit 
> in being able to limit the attack surface of the syscall API.

If controlling the system call boundary is found to be useful then 
the logical next logical step is to realize that limiting it to 
*only* the syscall boundary is shortsighted.

Also, here's a short reminder of the complexity evolution of this 
patch-set, which i've followed since it's been first posted in 2009:

       bitmask (2009):  6 files changed,  194 insertions(+), 22 deletions(-)
 filter engine (2010): 18 files changed, 1100 insertions(+), 21 deletions(-)
 event filters (2011):  5 files changed,   82 insertions(+), 16 deletions(-)

Interestingly, the events version is *by far* the most flexible one 
in both the short and the long run, and it is also the smallest patch 
...

It's a perfect fit and that's not really surprising: system call 
boundary hardening is about filtering various key parameters - while 
event tracing is about filtering various key parameters as well.

But it goes further than that: SELinux security policies are in 
essence primitive event filters as well, on an abstract level - see 
below for more details.

And yes, the primitive, coarse, per syscall allow/disallow bitmask v1 
version would not be too painful to the core kernel in terms of code 
impact and interaction with other code (it does not interact at all) 
- but it would still be sadly shortsighted to not explore the event 
filters angle, now that we have actual working code.

It would not improve the LSM situation one tiny bit either - the 
bitmask design would guarantee that the seccomp approach can never 
seriously replace the sucky LSM concepts we have in the kernel today.

> This is not security mediation in terms of interaction between 
> things (e.g. "allow A to read B").  It's a _hardening_ feature 
> which prevents a process from being able to invoke potentially 
> hundreds of syscalls is has no need for.  It would allow us to 
> usefully restrict some well-established attack modes, e.g. 
> triggering bugs in kernel code via unneeded syscalls.

If you think about it then you'll see that this artificial 
distinction between 'mediation' and 'hardening' is nonsense!

If we add the appropriate file label field to VFS tracing events 
(which would be useful for many instrumentation reasons as well) then 
the event filtering variant of Will's patch:

   _will be able to do object level security mediation too_

What is at the core of every access control concept, be that DAC, 
MAC, RBAC or ACL? Flexible task specific set of access vectors to 
file and other labeled objects, which cannot be circumvented by that 
task.

How can we implement a user-space file object manager via Will's 
event filters approach? It's actually pretty easy:

 - a simple object manager wants to know 'who' does something, 'what'
   it is trying to access, and then wants to generate an allow/deny
   action as a function of the (who,what) parameters:

     - The 'who' is a given as the event filters are per
       task, so different tasks with different roles can have
       different event filters. This is the equivalent of the current
       tasks's security context. [ Event filters installed by the
       parent cannot be removed by child tasks (they cannot even read
       them - it's transparent). ]

     - The most finegrained representation of 'what' are inode
       numbers. Because we do not want to generate rules for every
       single object we want to group objects and want to define
       access rules on different groups. This can be done by defining
       an event that emits file labels.

So a simple object manager would simply use file label event 
attributes and would define simple rules like:

	"(label & tmp_t) || (label & user_home_t)"

to allow access to /tmp and /home files. Filters allow us to define 
arbitrary access vectors to objects in essence. The above filters get 
passed to the kernel as an ASCII string by the object manager task, 
where the filter engine parses it safely and creates atomic 
predicates out of it, which can then be executed at the source of any 
event.

[ We could even implement a transparent AVC-cache equivalent for 
  filters, should the complexity and popularity of them increase: 
  ASCII filters lend themselves very well to hash based caches. ]

Similarly, support for other types of object tagging, network labels, 
etc. can be added as well with little pain: they can be added without 
any change to the basic ABI! Using events filters here makes it a 
very extensible security concept.

It is capable to implement the functional equivalent of most MAC, 
DAC, RBAC and other access control concepts, purely in user-space - 
in addition to 'hardening' (which btw. is really access control too, 
in disguise).

Obviously it is all layered: it is only allowed to control access to 
objects all the other security concepts allow for it to access - i.e. 
this is an unprivileged LSM, a per application security layer if you 
will, that can further refine security policies.

In terms of security models this event filters approach is 
unconditional goodness in my opinion.

> This is orthogonal to access control schemes (such as SELinux), 
> which are about mediating security-relevant interactions between 
> objects.

It's only 'orthogonal' because IMO you make two fundamental mistakes:

 1) You arbitrarily limit SELinux to object based security measures 
    alone.

    Which is not even true btw: SELinux certainly has some hooks it
    uses for pragmatic non-object hardening: for example all the
    places where we fall back to capabilities are places where
    there's a method based restriction not object based restriction.

    The KDSKBENT ioctl check for example in 
    security/selinux/hooks.c::selinux_file_ioctl(), or 
    selinux_vm_enough_memory(), or the CAP_MAC_ADMIN exception in
    selinux_inode_getsecurity() all violate 'pure' DAC concepts but
    obviously for pragmatic reasons SELinux is doing these ...

    mmap_min_addr is a borderline method restriction feature as well:
    it does not really control access to the underlying object (RAM),
    but controls one (of many) access methods to it by controlling
    virtual memory ...

    So SELinux, in a rather hypocritical fashion is already involved 
    in hardening and in filtering, because obviously any practical 
    and pragmatic security system *has to*.

 2) You arbitrarily limit Will's patch to *not* be able to
    implement object based security mechanisms. Why?

Syscall hardening and object based access rules are *deeply* 
connected, conceptually they are subsets of one and the same thing: a 
good, organic security model controlling different hierarchies of 
physical and derived (virtual) resources, which allows flexible 
control of both objects *and* methods.

The 'methods' (the syscalls and other functionality) are *also* a 
derived resource so it's entirely legitimate to control access to 
them. Yes, because they are methods you can also try to use them to 
restrict access to underlying objects - this is what AppArmour is 
about mostly, and yes i agree that in the general case it's not a 
particularly robust method.

And yes, i fully submit that object access control has theoretical 
advantages and it should often be the primary measure that gives a 
robust, often provable backbone to a secure system.

But you'd be out of your mind to not recognize:

 - The utility of controlling access methods (as resources) as well, 
   both to reduce the attack surface in the implementation of those 
   methods, and to allow the easy summary control of objects where 
   there's only a low number of (and often only a single!) access 
   method.

 - The utility of unprivileged security frameworks.

 - The utility of stackable security fetures. (defense in depth, 
   anyone?)

Will's astonishingly small patch:

 event filters (2011):  5 files changed,   82 insertions(+), 16 deletions(-)

Gives us *all three* of those, while also allowing user-space 
implemented MAC, DAC, RBAC as well.

> One area of possible use is KVM/Qemu, where processes now contain 
> entire operating systems, and the attack surface between them is 
> now much broader e.g. a local unprivileged vulnerability is now 
> effectively a 'remote' full system compromise.

Note that the main reason why Qemu needs access method hardening is 
because it has a dominantly state machine based design which does not 
lend itself very well to an object manager security design.

Note that tools/kvm/ would probably like to implement its own object 
manager model as well in addition to access method restrictions: by 
being virtual hardware it deals with many resources and object 
hierarchies that are simply not known to the host OS's LSM.

Unlike Qemu tools/kvm/ has a design that is very fit for MAC 
concepts: it uses separate helper threads for separate resources 
(this could in many cases even be changed to be separate processes 
which only share access to the guest RAM image) - while Qemu is in 
most parts a state machine, so in tools/kvm/ we can realistically 
have a good object manager and keep an exploit in a networking 
interface driver from being able to access disk driver state.

(I've Cc:-ed Pekka for tools/kvm/.)

> There has been some discussion of this within the KVM project.  
> Using the existing seccomp facility is problematic in that it 
> requires significant reworking of Qemu to a privsep model, which 
> would also then incur a likely unacceptable context switching 
> overhead.  The generalized seccomp filter as proposed by Will would 
> provide a significant reduction in exposed syscalls and thus 
> guest->host attack surface.

... and the event filter based method would *also* allow MAC to be 
defined over physical resources, such as virtual network interfaces, 
virtual disk devices, etc.

You are seriously limiting the capabilities of this feature for no 
good reason i can recognize.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26  8:24                                   ` Ingo Molnar
@ 2011-05-26  8:35                                     ` Pekka Enberg
  2011-05-26  8:49                                     ` Avi Kivity
  1 sibling, 0 replies; 91+ messages in thread
From: Pekka Enberg @ 2011-05-26  8:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: James Morris, Linus Torvalds, Kees Cook, Thomas Gleixner,
	Peter Zijlstra, Will Drewry, Steven Rostedt, linux-kernel,
	Avi Kivity, gnatapov, Chris Wright, Pekka Enberg, Sasha Levin

Hi Ingo,

On Thu, May 26, 2011 at 11:24 AM, Ingo Molnar <mingo@elte.hu> wrote:
> Unlike Qemu tools/kvm/ has a design that is very fit for MAC
> concepts: it uses separate helper threads for separate resources
> (this could in many cases even be changed to be separate processes
> which only share access to the guest RAM image) - while Qemu is in
> most parts a state machine, so in tools/kvm/ we can realistically
> have a good object manager and keep an exploit in a networking
> interface driver from being able to access disk driver state.

I haven't really followed this particular discussion nor do I know if
Qemu is good or bad fit but sure, for tools/kvm Chrome-style
sandboxing makes tons of sense and would be a pretty good fit for how
our device model works now.

                        Pekka

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26  8:24                                   ` Ingo Molnar
  2011-05-26  8:35                                     ` Pekka Enberg
@ 2011-05-26  8:49                                     ` Avi Kivity
  2011-05-26  8:57                                       ` Pekka Enberg
  2011-05-26  9:30                                       ` Ingo Molnar
  1 sibling, 2 replies; 91+ messages in thread
From: Avi Kivity @ 2011-05-26  8:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: James Morris, Linus Torvalds, Kees Cook, Thomas Gleixner,
	Peter Zijlstra, Will Drewry, Steven Rostedt, linux-kernel,
	gnatapov, Chris Wright, Pekka Enberg

On 05/26/2011 11:24 AM, Ingo Molnar wrote:
> So a simple object manager would simply use file label event
> attributes and would define simple rules like:
>
> 	"(label&  tmp_t) || (label&  user_home_t)"

Filtering by label vs. filtering by descriptor would solve qemu's 
hotplug issue neatly.

> Note that tools/kvm/ would probably like to implement its own object
> manager model as well in addition to access method restrictions: by
> being virtual hardware it deals with many resources and object
> hierarchies that are simply not known to the host OS's LSM.
>
> Unlike Qemu tools/kvm/ has a design that is very fit for MAC
> concepts: it uses separate helper threads for separate resources
> (this could in many cases even be changed to be separate processes
> which only share access to the guest RAM image) - while Qemu is in
> most parts a state machine, so in tools/kvm/ we can realistically
> have a good object manager and keep an exploit in a networking
> interface driver from being able to access disk driver state.

You mean each thread will have a different security context?  I don't 
see the point.  All threads share all of memory so it would be trivial 
for one thread to exploit another and gain all of its privileges.

A multi process model works better but it has significant memory and 
performance overhead.

(well the memory overhead is much smaller when using transparent huge 
pages, but these only work for anonymous memory).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26  8:49                                     ` Avi Kivity
@ 2011-05-26  8:57                                       ` Pekka Enberg
       [not found]                                         ` <20110526085939.GG29458@redhat.com>
  2011-05-26  9:30                                       ` Ingo Molnar
  1 sibling, 1 reply; 91+ messages in thread
From: Pekka Enberg @ 2011-05-26  8:57 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, James Morris, Linus Torvalds, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, Will Drewry, Steven Rostedt,
	linux-kernel, gnatapov, Chris Wright, Pekka Enberg

Hi Avi,

On Thu, May 26, 2011 at 11:49 AM, Avi Kivity <avi@redhat.com> wrote:
> You mean each thread will have a different security context?  I don't see
> the point.  All threads share all of memory so it would be trivial for one
> thread to exploit another and gain all of its privileges.

So how would that happen? I'm assuming that once the security context
has been set up for a thread, you're not able to change it after that.
You'd be able to exploit other threads through shared memory but how
would you gain privileges?

                        Pekka

^ permalink raw reply	[flat|nested] 91+ messages in thread

[parent not found: <20110526085939.GG29458@redhat.com>]

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
       [not found]                                         ` <20110526085939.GG29458@redhat.com>
@ 2011-05-26 10:38                                           ` Ingo Molnar
  2011-05-26 10:46                                             ` Avi Kivity
  2011-05-26 10:46                                             ` Gleb Natapov
  0 siblings, 2 replies; 91+ messages in thread
From: Ingo Molnar @ 2011-05-26 10:38 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Pekka Enberg, Avi Kivity, James Morris, Linus Torvalds, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, Will Drewry, Steven Rostedt,
	linux-kernel, Chris Wright, Pekka Enberg


* Gleb Natapov <gleb@redhat.com> wrote:

> On Thu, May 26, 2011 at 11:57:51AM +0300, Pekka Enberg wrote:
> > Hi Avi,
> > 
> > On Thu, May 26, 2011 at 11:49 AM, Avi Kivity <avi@redhat.com> wrote:
> >
> > > You mean each thread will have a different security context?  I 
> > > don't see the point.  All threads share all of memory so it 
> > > would be trivial for one thread to exploit another and gain all 
> > > of its privileges.
> > 
> > So how would that happen? I'm assuming that once the security 
> > context has been set up for a thread, you're not able to change 
> > it after that. You'd be able to exploit other threads through 
> > shared memory but how would you gain privileges?
> 
> By tricking other threads to execute code for you. Just replace 
> return address on the other's thread stack.

That kind of exploit is not possible if the worker pool consists of 
processes - which would be rather easy to achieve with tools/kvm/.

In that model each process has its own stack, not accessible to other 
worker processes. They'd only share the guest RAM image and some 
(minimal) global state.

This way the individual devices are (optionally) isolated from each 
other. In a way this is a microkernel done right ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 10:38                                           ` Ingo Molnar
@ 2011-05-26 10:46                                             ` Avi Kivity
  2011-05-26 10:46                                             ` Gleb Natapov
  1 sibling, 0 replies; 91+ messages in thread
From: Avi Kivity @ 2011-05-26 10:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gleb Natapov, Pekka Enberg, James Morris, Linus Torvalds,
	Kees Cook, Thomas Gleixner, Peter Zijlstra, Will Drewry,
	Steven Rostedt, linux-kernel, Chris Wright, Pekka Enberg

On 05/26/2011 01:38 PM, Ingo Molnar wrote:
> * Gleb Natapov<gleb@redhat.com>  wrote:
>
> >  On Thu, May 26, 2011 at 11:57:51AM +0300, Pekka Enberg wrote:
> >  >  Hi Avi,
> >  >
> >  >  On Thu, May 26, 2011 at 11:49 AM, Avi Kivity<avi@redhat.com>  wrote:
> >  >
> >  >  >  You mean each thread will have a different security context?  I
> >  >  >  don't see the point.  All threads share all of memory so it
> >  >  >  would be trivial for one thread to exploit another and gain all
> >  >  >  of its privileges.
> >  >
> >  >  So how would that happen? I'm assuming that once the security
> >  >  context has been set up for a thread, you're not able to change
> >  >  it after that. You'd be able to exploit other threads through
> >  >  shared memory but how would you gain privileges?
> >
> >  By tricking other threads to execute code for you. Just replace
> >  return address on the other's thread stack.
>
> That kind of exploit is not possible if the worker pool consists of
> processes - which would be rather easy to achieve with tools/kvm/.
>
> In that model each process has its own stack, not accessible to other
> worker processes. They'd only share the guest RAM image and some
> (minimal) global state.
>
> This way the individual devices are (optionally) isolated from each
> other. In a way this is a microkernel done right ;-)

It's really hard to achieve, since devices have global interactions.  
For example a PCI device can change the memory layout when a BAR is 
programmed.  So you would have a lot of message passing going on (not at 
runtime, so no huge impact on performance).  The programming model is 
very different.

Note that message passing is in fact quite a good way to model hardware, 
since what different devices actually do is pass messages to each 
other.  I expect if done this way, the device model would be better than 
what we have today.  But it's not an easy step away from threads.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 10:38                                           ` Ingo Molnar
  2011-05-26 10:46                                             ` Avi Kivity
@ 2011-05-26 10:46                                             ` Gleb Natapov
  2011-05-26 11:11                                               ` Ingo Molnar
  1 sibling, 1 reply; 91+ messages in thread
From: Gleb Natapov @ 2011-05-26 10:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Avi Kivity, James Morris, Linus Torvalds, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, Will Drewry, Steven Rostedt,
	linux-kernel, Chris Wright, Pekka Enberg

On Thu, May 26, 2011 at 12:38:36PM +0200, Ingo Molnar wrote:
> 
> * Gleb Natapov <gleb@redhat.com> wrote:
> 
> > On Thu, May 26, 2011 at 11:57:51AM +0300, Pekka Enberg wrote:
> > > Hi Avi,
> > > 
> > > On Thu, May 26, 2011 at 11:49 AM, Avi Kivity <avi@redhat.com> wrote:
> > >
> > > > You mean each thread will have a different security context?  I 
> > > > don't see the point.  All threads share all of memory so it 
> > > > would be trivial for one thread to exploit another and gain all 
> > > > of its privileges.
> > > 
> > > So how would that happen? I'm assuming that once the security 
> > > context has been set up for a thread, you're not able to change 
> > > it after that. You'd be able to exploit other threads through 
> > > shared memory but how would you gain privileges?
> > 
> > By tricking other threads to execute code for you. Just replace 
> > return address on the other's thread stack.
> 
> That kind of exploit is not possible if the worker pool consists of 
> processes - which would be rather easy to achieve with tools/kvm/.
> 
Well, of course. There original question was about threads.

> In that model each process has its own stack, not accessible to other 
> worker processes. They'd only share the guest RAM image and some 
> (minimal) global state.
> 
> This way the individual devices are (optionally) isolated from each 
> other. In a way this is a microkernel done right ;-)
> 
But doesn't this design suffer the same problem as microkernel? Namely
a lot of slow IPCs?

--
			Gleb.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 10:46                                             ` Gleb Natapov
@ 2011-05-26 11:11                                               ` Ingo Molnar
  0 siblings, 0 replies; 91+ messages in thread
From: Ingo Molnar @ 2011-05-26 11:11 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Pekka Enberg, Avi Kivity, James Morris, Linus Torvalds, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, Will Drewry, Steven Rostedt,
	linux-kernel, Chris Wright, Pekka Enberg


* Gleb Natapov <gleb@redhat.com> wrote:

> > In that model each process has its own stack, not accessible to 
> > other worker processes. They'd only share the guest RAM image and 
> > some (minimal) global state.
> > 
> > This way the individual devices are (optionally) isolated from 
> > each other. In a way this is a microkernel done right ;-)
> 
> But doesn't this design suffer the same problem as microkernel? 
> Namely a lot of slow IPCs?

Most of the IPCs we do already, to keep the devices separated from 
each other. So the most common type of IPC comes 'for free' in that 
model - and this is specific to virtualization so i'd not extend the 
claim to the host kernel.

virtio is an IPC mechanism to begin with.

It's certainly not entirely free though so if this is implemented in 
tools/kvm/ it should be configurable.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26  8:49                                     ` Avi Kivity
  2011-05-26  8:57                                       ` Pekka Enberg
@ 2011-05-26  9:30                                       ` Ingo Molnar
  2011-05-26  9:48                                         ` Ingo Molnar
  2011-05-26 10:56                                         ` Avi Kivity
  1 sibling, 2 replies; 91+ messages in thread
From: Ingo Molnar @ 2011-05-26  9:30 UTC (permalink / raw)
  To: Avi Kivity
  Cc: James Morris, Linus Torvalds, Kees Cook, Thomas Gleixner,
	Peter Zijlstra, Will Drewry, Steven Rostedt, linux-kernel,
	gnatapov, Chris Wright, Pekka Enberg

* Avi Kivity <avi@redhat.com> wrote:

> > Note that tools/kvm/ would probably like to implement its own 
> > object manager model as well in addition to access method 
> > restrictions: by being virtual hardware it deals with many 
> > resources and object hierarchies that are simply not known to the 
> > host OS's LSM.
> >
> > Unlike Qemu tools/kvm/ has a design that is very fit for MAC 
> > concepts: it uses separate helper threads for separate resources 
> > (this could in many cases even be changed to be separate 
> > processes which only share access to the guest RAM image) - while 
> > Qemu is in most parts a state machine, so in tools/kvm/ we can 
> > realistically have a good object manager and keep an exploit in a 
> > networking interface driver from being able to access disk driver 
> > state.
> 
> You mean each thread will have a different security context?  I 
> don't see the point.  All threads share all of memory so it would 
> be trivial for one thread to exploit another and gain all of its 
> privileges.

You are missing the geniality of the tools/kvm/ thread pool! :-)

It could be switched to a worker *process* model rather easily. Guest 
RAM and (a limited amount of) global resources would be shared via 
mmap(SHARED), but otherwise each worker process would have its own 
stack, its own subsystem-specific state, etc.

Exploiting other device domains via the shared guest RAM image is not 
possible, we treat guest RAM as untrusted data already.

Devices, like real hardware devices, are functionally pretty 
independent from each other, so this security model is rather natural 
and makes a lot of sense.

> A multi process model works better but it has significant memory 
> and performance overhead.

Not in Linux :-) We context-switch between processes almost as 
quickly as we do between threads. With modern tagged TLB hardware 
it's even faster.

> (well the memory overhead is much smaller when using transparent 
> huge pages, but these only work for anonymous memory).

The biggest amount of RAM is the guest RAM image - but if that is 
mmap(SHARED) and mapped using hugepages then the pte overhead from a 
process model is largely mitigated.

Once we have a process model then isolation and MAC between devices 
becomes a very real possibility: exploit via one network interface 
cannot break into a disk interface.

Maybe even the isolation and per device access control of 
*same-class* devices from each other is possible: with careful 
implementation of the subsystem shared data structures. (which isnt 
much really)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26  9:30                                       ` Ingo Molnar
@ 2011-05-26  9:48                                         ` Ingo Molnar
  2011-05-26 11:02                                           ` Avi Kivity
  2011-05-26 10:56                                         ` Avi Kivity
  1 sibling, 1 reply; 91+ messages in thread
From: Ingo Molnar @ 2011-05-26  9:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: James Morris, Linus Torvalds, Kees Cook, Thomas Gleixner,
	Peter Zijlstra, Will Drewry, Steven Rostedt, linux-kernel,
	gnatapov, Chris Wright, Pekka Enberg

* Ingo Molnar <mingo@elte.hu> wrote:

> You are missing the geniality of the tools/kvm/ thread pool! :-)
> 
> It could be switched to a worker *process* model rather easily. 
> Guest RAM and (a limited amount of) global resources would be 
> shared via mmap(SHARED), but otherwise each worker process would 
> have its own stack, its own subsystem-specific state, etc.

We get VM exit events in the vcpu threads which after minimal 
processing pass much of the work to the thread pool. Most of the 
virtio work (which could be a source of vulnerability - ringbuffers 
are hard) is done in the worker task context.

It would be possible to further increase isolation there by also 
passing the IO/MMIO decoding to the worker thread - but i'm not sure 
that's truly needed. Most of the risk is where most of the code is - 
and the code is in the worker task which interprets on-disk data, 
protocols, etc.

So we could not only isolate devices from each other, but we could 
also protect the highly capable vcpu fd from exploits in devices - 
worker threads generally do not need access to the vcpu fd IIRC.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26  9:48                                         ` Ingo Molnar
@ 2011-05-26 11:02                                           ` Avi Kivity
  2011-05-26 11:16                                             ` Ingo Molnar
  0 siblings, 1 reply; 91+ messages in thread
From: Avi Kivity @ 2011-05-26 11:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: James Morris, Linus Torvalds, Kees Cook, Thomas Gleixner,
	Peter Zijlstra, Will Drewry, Steven Rostedt, linux-kernel,
	gnatapov, Chris Wright, Pekka Enberg

On 05/26/2011 12:48 PM, Ingo Molnar wrote:
> * Ingo Molnar<mingo@elte.hu>  wrote:
>
> >  You are missing the geniality of the tools/kvm/ thread pool! :-)
> >
> >  It could be switched to a worker *process* model rather easily.
> >  Guest RAM and (a limited amount of) global resources would be
> >  shared via mmap(SHARED), but otherwise each worker process would
> >  have its own stack, its own subsystem-specific state, etc.
>
> We get VM exit events in the vcpu threads which after minimal
> processing pass much of the work to the thread pool. Most of the
> virtio work (which could be a source of vulnerability - ringbuffers
> are hard) is done in the worker task context.
>
> It would be possible to further increase isolation there by also
> passing the IO/MMIO decoding to the worker thread - but i'm not sure
> that's truly needed. Most of the risk is where most of the code is -
> and the code is in the worker task which interprets on-disk data,
> protocols, etc.

I've suggested in the past to add an "mmiofd" facility to kvm, similar 
to ioeventfd.  This is how it would work:

- userspace configures kvm with an mmio range and a pipe
- guest writes to that range write a packet to the pipe describing the write
- guest reads from that range write a packet to the pipe describing the 
read, then wait for a reply packet with the result

The advantages would be
- avoid heavyweight exit; kvm can simply wake up a thread on another 
core and resume processing
- writes can be pipelined, similar to how PCI writes are posted
- supports process separation

So far no one has posted an implementation but it should be pretty simple.

> So we could not only isolate devices from each other, but we could
> also protect the highly capable vcpu fd from exploits in devices -
> worker threads generally do not need access to the vcpu fd IIRC.

Yes.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 11:02                                           ` Avi Kivity
@ 2011-05-26 11:16                                             ` Ingo Molnar
  0 siblings, 0 replies; 91+ messages in thread
From: Ingo Molnar @ 2011-05-26 11:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: James Morris, Linus Torvalds, Kees Cook, Thomas Gleixner,
	Peter Zijlstra, Will Drewry, Steven Rostedt, linux-kernel,
	gnatapov, Chris Wright, Pekka Enberg


* Avi Kivity <avi@redhat.com> wrote:

> > It would be possible to further increase isolation there by also 
> > passing the IO/MMIO decoding to the worker thread - but i'm not 
> > sure that's truly needed. Most of the risk is where most of the 
> > code is - and the code is in the worker task which interprets 
> > on-disk data, protocols, etc.
> 
> I've suggested in the past to add an "mmiofd" facility to kvm, 
> similar to ioeventfd.  This is how it would work:
> 
> - userspace configures kvm with an mmio range and a pipe
> - guest writes to that range write a packet to the pipe describing the write
> - guest reads from that range write a packet to the pipe describing
> the read, then wait for a reply packet with the result
> 
> The advantages would be
> - avoid heavyweight exit; kvm can simply wake up a thread on another
> core and resume processing
> - writes can be pipelined, similar to how PCI writes are posted
> - supports process separation

Yes, that was my exact thought, a per transport channel fd.

> So far no one has posted an implementation but it should be pretty 
> simple.

tools/kvm/ could make quick use of it - and it's a performance 
optimization mainly IMO, not primarily a security feature.

If you whip up a quick untested prototype for the KVM side we could 
look into adding tooling support for it and could test it.

As long as it's provided as an opt-in ioctl() which if fails (on 
older kernels) we fall back to the vcpu-fd, it should be relatively 
straightforward to support on the tooling side as well AFAICS.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26  9:30                                       ` Ingo Molnar
  2011-05-26  9:48                                         ` Ingo Molnar
@ 2011-05-26 10:56                                         ` Avi Kivity
  2011-05-26 11:38                                           ` Ingo Molnar
  1 sibling, 1 reply; 91+ messages in thread
From: Avi Kivity @ 2011-05-26 10:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: James Morris, Linus Torvalds, Kees Cook, Thomas Gleixner,
	Peter Zijlstra, Will Drewry, Steven Rostedt, linux-kernel,
	gnatapov, Chris Wright, Pekka Enberg

On 05/26/2011 12:30 PM, Ingo Molnar wrote:
> * Avi Kivity<avi@redhat.com>  wrote:
>
> >  >  Note that tools/kvm/ would probably like to implement its own
> >  >  object manager model as well in addition to access method
> >  >  restrictions: by being virtual hardware it deals with many
> >  >  resources and object hierarchies that are simply not known to the
> >  >  host OS's LSM.
> >  >
> >  >  Unlike Qemu tools/kvm/ has a design that is very fit for MAC
> >  >  concepts: it uses separate helper threads for separate resources
> >  >  (this could in many cases even be changed to be separate
> >  >  processes which only share access to the guest RAM image) - while
> >  >  Qemu is in most parts a state machine, so in tools/kvm/ we can
> >  >  realistically have a good object manager and keep an exploit in a
> >  >  networking interface driver from being able to access disk driver
> >  >  state.
> >
> >  You mean each thread will have a different security context?  I
> >  don't see the point.  All threads share all of memory so it would
> >  be trivial for one thread to exploit another and gain all of its
> >  privileges.
>
> You are missing the geniality of the tools/kvm/ thread pool! :-)

I'm sure the thread pool is very general, but the hardware we're 
modelling is not.

> It could be switched to a worker *process* model rather easily. Guest
> RAM and (a limited amount of) global resources would be shared via
> mmap(SHARED), but otherwise each worker process would have its own
> stack, its own subsystem-specific state, etc.

Suppose a guest reconfigures a device's MSI page, and suppose that's 
handled by the device's process.  Now it's not sufficient to update some 
global state, you have to go and tell the host kernel about it.  With 
good privilege separation the device process would not be permitted to 
do that; now it has to pass a message to a process that is.

Same thing applies for BARs, reset signals, live migration, etc.

> Exploiting other device domains via the shared guest RAM image is not
> possible, we treat guest RAM as untrusted data already.

Right.

> Devices, like real hardware devices, are functionally pretty
> independent from each other, so this security model is rather natural
> and makes a lot of sense.

When just pushing packets, you are right.  However setup/configuration 
is hardly clean.

Consider a CD-ROM eject, for example.  Now it can't be done by a simple 
callback.

> >  A multi process model works better but it has significant memory
> >  and performance overhead.
>
> Not in Linux :-) We context-switch between processes almost as
> quickly as we do between threads. With modern tagged TLB hardware
> it's even faster.

Once we get PCID in, yes.  There's still the message passing overhead, 
and unnecessary context switches.  In a threaded model you can choose 
whether to switch threads or not, in a process model you cannot.

> >  (well the memory overhead is much smaller when using transparent
> >  huge pages, but these only work for anonymous memory).
>
> The biggest amount of RAM is the guest RAM image - but if that is
> mmap(SHARED) and mapped using hugepages then the pte overhead from a
> process model is largely mitigated.

That doesn't work with memory hotplug.

> Once we have a process model then isolation and MAC between devices
> becomes a very real possibility: exploit via one network interface
> cannot break into a disk interface.

Yes, certainly.

> Maybe even the isolation and per device access control of
> *same-class* devices from each other is possible: with careful
> implementation of the subsystem shared data structures. (which isnt
> much really)

Right, hardly at all in fact.  The problem comes from the side-band 
issues like reset, interrupts, hotplug, and whatnot.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 10:56                                         ` Avi Kivity
@ 2011-05-26 11:38                                           ` Ingo Molnar
  2011-05-26 18:06                                             ` Avi Kivity
  0 siblings, 1 reply; 91+ messages in thread
From: Ingo Molnar @ 2011-05-26 11:38 UTC (permalink / raw)
  To: Avi Kivity
  Cc: James Morris, Linus Torvalds, Kees Cook, Thomas Gleixner,
	Peter Zijlstra, Will Drewry, Steven Rostedt, linux-kernel,
	gnatapov, Chris Wright, Pekka Enberg

* Avi Kivity <avi@redhat.com> wrote:

> > The biggest amount of RAM is the guest RAM image - but if that is 
> > mmap(SHARED) and mapped using hugepages then the pte overhead 
> > from a process model is largely mitigated.
> 
> That doesn't work with memory hotplug.

Why not, if we do the sensible thing and restrict the size 
granularity and alignment of plugged/unplugged memory regions to 2MB?

We can fix guest Linux as well to not be stupid about the sizing of 
memory hotplug requests. It does hotplug based on the memory map we 
pass to it anyway.

Am i missing something obvious here?

> > Maybe even the isolation and per device access control of 
> > *same-class* devices from each other is possible: with careful 
> > implementation of the subsystem shared data structures. (which 
> > isnt much really)
> 
> Right, hardly at all in fact.  The problem comes from the side-band 
> issues like reset, interrupts, hotplug, and whatnot.

Yeah. There are two good aspects here i think:

 - The sideband IPC overhead does not matter much, it's a side band.

 - Spending the effort to isolate configuration details is worth it: 
   sideband code is a primary breeding ground for bugs and security 
   holes.

The main worry to me would be the maintainability difference: does it 
result in much more complex code? As always i'm cautiously optimistic 
about that: i think once we try it we can find a suitable model ... 
It might even turn out to be more readable and more flexible in the 
end.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 11:38                                           ` Ingo Molnar
@ 2011-05-26 18:06                                             ` Avi Kivity
  2011-05-26 18:15                                               ` Ingo Molnar
  0 siblings, 1 reply; 91+ messages in thread
From: Avi Kivity @ 2011-05-26 18:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: James Morris, Linus Torvalds, Kees Cook, Thomas Gleixner,
	Peter Zijlstra, Will Drewry, Steven Rostedt, linux-kernel,
	gnatapov, Chris Wright, Pekka Enberg

On 05/26/2011 02:38 PM, Ingo Molnar wrote:
> * Avi Kivity<avi@redhat.com>  wrote:
>
> >  >  The biggest amount of RAM is the guest RAM image - but if that is
> >  >  mmap(SHARED) and mapped using hugepages then the pte overhead
> >  >  from a process model is largely mitigated.
> >
> >  That doesn't work with memory hotplug.
>
> Why not, if we do the sensible thing and restrict the size
> granularity and alignment of plugged/unplugged memory regions to 2MB?

Once forked, you cannot have new shared anonymous memory, can you?

> We can fix guest Linux as well to not be stupid about the sizing of
> memory hotplug requests. It does hotplug based on the memory map we
> pass to it anyway.
>
> Am i missing something obvious here?

Yes, the new mmap() will be only visible in the calling process.

> >  >  Maybe even the isolation and per device access control of
> >  >  *same-class* devices from each other is possible: with careful
> >  >  implementation of the subsystem shared data structures. (which
> >  >  isnt much really)
> >
> >  Right, hardly at all in fact.  The problem comes from the side-band
> >  issues like reset, interrupts, hotplug, and whatnot.
>
> Yeah. There are two good aspects here i think:
>
>   - The sideband IPC overhead does not matter much, it's a side band.
>
>   - Spending the effort to isolate configuration details is worth it:
>     sideband code is a primary breeding ground for bugs and security
>     holes.
>
> The main worry to me would be the maintainability difference: does it
> result in much more complex code? As always i'm cautiously optimistic
> about that: i think once we try it we can find a suitable model ...
> It might even turn out to be more readable and more flexible in the
> end.

I also believe it will be more maintainable, especially if written in a 
language that has explicit support for message passing (e.g. Erlang).  
This is because it is more similar to how hardware actually works.  
However it needs to be designed in, it's not just a matter of switching 
a thread to a process.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 18:06                                             ` Avi Kivity
@ 2011-05-26 18:15                                               ` Ingo Molnar
  2011-05-26 18:20                                                 ` Avi Kivity
  2011-05-26 18:22                                                 ` Peter Zijlstra
  0 siblings, 2 replies; 91+ messages in thread
From: Ingo Molnar @ 2011-05-26 18:15 UTC (permalink / raw)
  To: Avi Kivity
  Cc: James Morris, Linus Torvalds, Kees Cook, Thomas Gleixner,
	Peter Zijlstra, Will Drewry, Steven Rostedt, linux-kernel,
	gnatapov, Chris Wright, Pekka Enberg

* Avi Kivity <avi@redhat.com> wrote:

> On 05/26/2011 02:38 PM, Ingo Molnar wrote:
> >* Avi Kivity<avi@redhat.com>  wrote:
> >
> >>  >  The biggest amount of RAM is the guest RAM image - but if that is
> >>  >  mmap(SHARED) and mapped using hugepages then the pte overhead
> >>  >  from a process model is largely mitigated.
> >>
> >>  That doesn't work with memory hotplug.
> >
> > Why not, if we do the sensible thing and restrict the size 
> > granularity and alignment of plugged/unplugged memory regions to 
> > 2MB?
> 
> Once forked, you cannot have new shared anonymous memory, can you?

We can have named shared memory.

Incidentally i suggested this to Pekka just yesterday: i think we 
should consider guest RAM images to be named files on the local 
filesystem (prefixed with the disk image's name or so, for easy 
identification), this will help with debugging and with swapping as 
well. (This way guest RAM wont eat up regular anonymous swap space - 
it will be swapped to the filesystem.)

As a sidenote, live migration might also become possible this way: in 
theory we could freeze a guest to its RAM image - which can then be 
copied (together with the disk image) to another box as files and 
restarted there, with some some hw configuration state dumped to a 
header portion of that RAM image as well. (outside of the RAM area)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 18:15                                               ` Ingo Molnar
@ 2011-05-26 18:20                                                 ` Avi Kivity
  2011-05-26 18:36                                                   ` Ingo Molnar
  2011-05-26 18:22                                                 ` Peter Zijlstra
  1 sibling, 1 reply; 91+ messages in thread
From: Avi Kivity @ 2011-05-26 18:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: James Morris, Linus Torvalds, Kees Cook, Thomas Gleixner,
	Peter Zijlstra, Will Drewry, Steven Rostedt, linux-kernel,
	gnatapov, Chris Wright, Pekka Enberg

On 05/26/2011 09:15 PM, Ingo Molnar wrote:
> * Avi Kivity<avi@redhat.com>  wrote:
>
> >  On 05/26/2011 02:38 PM, Ingo Molnar wrote:
> >  >* Avi Kivity<avi@redhat.com>   wrote:
> >  >
> >  >>   >   The biggest amount of RAM is the guest RAM image - but if that is
> >  >>   >   mmap(SHARED) and mapped using hugepages then the pte overhead
> >  >>   >   from a process model is largely mitigated.
> >  >>
> >  >>   That doesn't work with memory hotplug.
> >  >
> >  >  Why not, if we do the sensible thing and restrict the size
> >  >  granularity and alignment of plugged/unplugged memory regions to
> >  >  2MB?
> >
> >  Once forked, you cannot have new shared anonymous memory, can you?
>
> We can have named shared memory.

But then the benefit of transparent huge pages goes away.

Of course, if some is working on extending transparent hugepages, the 
problem is solved.  I know there is interest in this.

> Incidentally i suggested this to Pekka just yesterday: i think we
> should consider guest RAM images to be named files on the local
> filesystem (prefixed with the disk image's name or so, for easy
> identification), this will help with debugging and with swapping as
> well. (This way guest RAM wont eat up regular anonymous swap space -
> it will be swapped to the filesystem.)

Qemu supports this via -mem-path.  The motivation was supporting 
hugetlbfs, before THP.  I can't say it was useful for debugging (but 
then qemu has a built in memory inspector and debugger, and supports 
attaching gdb to the guest).

> As a sidenote, live migration might also become possible this way: in
> theory we could freeze a guest to its RAM image - which can then be
> copied (together with the disk image) to another box as files and
> restarted there, with some some hw configuration state dumped to a
> header portion of that RAM image as well. (outside of the RAM area)

Live migration involves the guest running in parallel with its memory 
being copied over.  Even a 1GB guest will take 10s over 1GbE; any 
reasonably sized guest will take forever.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 18:20                                                 ` Avi Kivity
@ 2011-05-26 18:36                                                   ` Ingo Molnar
  2011-05-26 18:43                                                     ` Valdis.Kletnieks
  0 siblings, 1 reply; 91+ messages in thread
From: Ingo Molnar @ 2011-05-26 18:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: James Morris, Linus Torvalds, Kees Cook, Thomas Gleixner,
	Peter Zijlstra, Will Drewry, Steven Rostedt, linux-kernel,
	gnatapov, Chris Wright, Pekka Enberg


* Avi Kivity <avi@redhat.com> wrote:

> Live migration involves the guest running in parallel with its 
> memory being copied over.  Even a 1GB guest will take 10s over 
> 1GbE; any reasonably sized guest will take forever.

I suspect we are really offtopic here, but an initial rsync, then 
stopping the guest, final rsync + restart the guest at the target 
would work with minimal interruption.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 18:36                                                   ` Ingo Molnar
@ 2011-05-26 18:43                                                     ` Valdis.Kletnieks
  2011-05-26 18:50                                                       ` Ingo Molnar
  0 siblings, 1 reply; 91+ messages in thread
From: Valdis.Kletnieks @ 2011-05-26 18:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, James Morris, Linus Torvalds, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, Will Drewry, Steven Rostedt,
	linux-kernel, gnatapov, Chris Wright, Pekka Enberg

[-- Attachment #1: Type: text/plain, Size: 516 bytes --]

On Thu, 26 May 2011 20:36:35 +0200, Ingo Molnar said:

> I suspect we are really offtopic here, but an initial rsync, then 
> stopping the guest, final rsync + restart the guest at the target 
> would work with minimal interruption.

Actually, after you kick off the migrate, you really want to be tracking in
real time what pages get dirtied while you're doing the initial copy, so that
the second rsync doesn't have to walk through the file finding the differences.
 But that requires some extra instrumentation.


[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 18:43                                                     ` Valdis.Kletnieks
@ 2011-05-26 18:50                                                       ` Ingo Molnar
  0 siblings, 0 replies; 91+ messages in thread
From: Ingo Molnar @ 2011-05-26 18:50 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Avi Kivity, James Morris, Linus Torvalds, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, Will Drewry, Steven Rostedt,
	linux-kernel, gnatapov, Chris Wright, Pekka Enberg


* Valdis.Kletnieks@vt.edu <Valdis.Kletnieks@vt.edu> wrote:

> On Thu, 26 May 2011 20:36:35 +0200, Ingo Molnar said:
> 
> > I suspect we are really offtopic here, but an initial rsync, then 
> > stopping the guest, final rsync + restart the guest at the target 
> > would work with minimal interruption.
> 
> Actually, after you kick off the migrate, you really want to be 
> tracking in real time what pages get dirtied while you're doing the 
> initial copy, so that the second rsync doesn't have to walk through 
> the file finding the differences.  But that requires some extra 
> instrumentation.

Yeah, and that's how socket based live migration works - it's 
completely seemless.

But note that the rsync re-scan should not be an issue: both the 
source and the target system will obviously have a *lot* more RAM 
than the guest RAM image size.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 18:15                                               ` Ingo Molnar
  2011-05-26 18:20                                                 ` Avi Kivity
@ 2011-05-26 18:22                                                 ` Peter Zijlstra
  2011-05-26 18:38                                                   ` Ingo Molnar
  1 sibling, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2011-05-26 18:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, James Morris, Linus Torvalds, Kees Cook,
	Thomas Gleixner, Will Drewry, Steven Rostedt, linux-kernel,
	gnatapov, Chris Wright, Pekka Enberg

On Thu, 2011-05-26 at 20:15 +0200, Ingo Molnar wrote:
> Incidentally i suggested this to Pekka just yesterday: i think we 
> should consider guest RAM images to be named files on the local 
> filesystem (prefixed with the disk image's name or so, for easy 
> identification), 

That'll break THP and KSM, both rely and work on anon only.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 18:22                                                 ` Peter Zijlstra
@ 2011-05-26 18:38                                                   ` Ingo Molnar
  2011-05-27  0:12                                                     ` James Morris
  0 siblings, 1 reply; 91+ messages in thread
From: Ingo Molnar @ 2011-05-26 18:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, James Morris, Linus Torvalds, Kees Cook,
	Thomas Gleixner, Will Drewry, Steven Rostedt, linux-kernel,
	gnatapov, Chris Wright, Pekka Enberg


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, 2011-05-26 at 20:15 +0200, Ingo Molnar wrote:
>
> > Incidentally i suggested this to Pekka just yesterday: i think we 
> > should consider guest RAM images to be named files on the local 
> > filesystem (prefixed with the disk image's name or so, for easy 
> > identification),
> 
> That'll break THP and KSM, both rely and work on anon only.

No reason they should be limited to anon only though.

Also, don't we have some sort of anonfs, from which we could get an 
fd, which, if mmap()-ed produces regular anonymous shared memory?

That fd could be passed over to other processes, who could then 
mmap() the new piece of shared-anon memory as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26 18:38                                                   ` Ingo Molnar
@ 2011-05-27  0:12                                                     ` James Morris
  0 siblings, 0 replies; 91+ messages in thread
From: James Morris @ 2011-05-27  0:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Avi Kivity, Linus Torvalds, Kees Cook,
	Thomas Gleixner, Will Drewry, Steven Rostedt, linux-kernel,
	gnatapov, Chris Wright, Pekka Enberg

Btw, if anyone's going to be at Plumbers this year, we have a day set 
aside for the security summit:

https://security.wiki.kernel.org/index.php/LinuxSecuritySummit2011

This may make a good discussion topic.


- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-25 18:42                               ` Linus Torvalds
                                                   ` (2 preceding siblings ...)
  2011-05-26  1:19                                 ` James Morris
@ 2011-05-29 16:51                                 ` Aneesh Kumar K.V
  2011-05-29 17:02                                   ` Linus Torvalds
  3 siblings, 1 reply; 91+ messages in thread
From: Aneesh Kumar K.V @ 2011-05-29 16:51 UTC (permalink / raw)
  To: Linus Torvalds, Kees Cook, Al Viro
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Will Drewry,
	Steven Rostedt, linux-kernel

On Wed, 25 May 2011 11:42:44 -0700, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, May 25, 2011 at 11:01 AM, Kees Cook <kees.cook@canonical.com> wrote:
> >
> > Can we just go back to the original spec? A lot of people were excited
> > about the prctl() API as done in Will's earlier patchset, we don't lose the
> > extremely useful "enable_on_exec" feature, and we can get away from all
> > this disagreement.
> 
> .. and quite frankly, I'm not even convinced about the original simpler spec.
> 
> Security is a morass. People come up with cool ideas every day, and
> nobody actually uses them - or if they use them, they are just a
> maintenance nightmare.
> 
> Quite frankly, limiting pathname access by some prefix is "cool", but
> it's basically useless.
> 
> That's not where security problems are.
> 
> Security problems are in the odd corners - ioctl's, /proc files,
> random small interfaces that aren't just about file access.
> 
> And who would *use* this thing in real life? Nobody. In order to sell
> me on a new security interface, give me a real actual use case that is
> security-conscious and relevant to real users.
> 
> For things like web servers that actually want to limit filename
> lookup, we'd be <i>much</i> better off with a few new flags to
> pathname lookup that say "don't follow symlinks" and "don't follow
> '..'". Things like that can actually be beneficial to
> security-conscious programming, with very little overhead. Some of
> those things currently look up pathnames one component at a time,
> because they can't afford to not do so. That's a *much* better model
> for the whole "only limit to this subtree" case that was quoted
> sometime early in this thread.


The "make sure we don't follow symlinks at all" is a real problem in
VirtFS (http://wiki.qemu.org/Documentation/9psetup) that we are fixing
by adding a forked chrooted process to Qemu. If we are open to a new
open flag O_NOFOLLOW_PATH, which would fail with ELOOP if any of the
path component is a symbolic link, that would greatly simplify VirtFS.
Will such a new flag to open be acceptable ? 


-aneesh


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-29 16:51                                 ` Aneesh Kumar K.V
@ 2011-05-29 17:02                                   ` Linus Torvalds
  2011-05-29 18:23                                     ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Linus Torvalds @ 2011-05-29 17:02 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Kees Cook, Al Viro, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Will Drewry, Steven Rostedt, linux-kernel

On Sun, May 29, 2011 at 9:51 AM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
>
> The "make sure we don't follow symlinks at all" is a real problem in
> VirtFS (http://wiki.qemu.org/Documentation/9psetup) that we are fixing
> by adding a forked chrooted process to Qemu. If we are open to a new
> open flag O_NOFOLLOW_PATH, which would fail with ELOOP if any of the
> path component is a symbolic link, that would greatly simplify VirtFS.
> Will such a new flag to open be acceptable ?

Such a flag should be something like 3 lines of actual code (and then
the header file changes to actually add the mask itself, which is apt
to be th ebulk of the patch just because we have to have different
values for different architectures).

And yes, it is absolutely acceptable. The only questions in my mind are

 - why haven't we done this long ago?

 - do we have the flag space?

 - should we do a O_NOMNT_PATH flag to do the same for mount-points?

  Some people worry about being confused by bind mounts etc.

 - do we think ".." is worthy of a flag too?

   or is that a "user space can damn well check that itself, even if
it would be absolutely trivial to check in the kernel too"?

Whatever. I think the NOFOLLOW_PATH one is pretty much a no-brainer.
It's not like symlink worries are unusual.

                    Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-29 17:02                                   ` Linus Torvalds
@ 2011-05-29 18:23                                     ` Al Viro
  0 siblings, 0 replies; 91+ messages in thread
From: Al Viro @ 2011-05-29 18:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Aneesh Kumar K.V, Kees Cook, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Will Drewry, Steven Rostedt, linux-kernel

On Sun, May 29, 2011 at 10:02:06AM -0700, Linus Torvalds wrote:

> And yes, it is absolutely acceptable. The only questions in my mind are
> 
>  - why haven't we done this long ago?
> 
>  - do we have the flag space?
> 
>  - should we do a O_NOMNT_PATH flag to do the same for mount-points?
> 
>   Some people worry about being confused by bind mounts etc.
> 
>  - do we think ".." is worthy of a flag too?
> 
>    or is that a "user space can damn well check that itself, even if
> it would be absolutely trivial to check in the kernel too"?
> 
> Whatever. I think the NOFOLLOW_PATH one is pretty much a no-brainer.
> It's not like symlink worries are unusual.

It's not *quite* a no-brainer.  Guys, please hold that one off for a while;
we have more massage to do in the area and I *really* want to get atomic
open work finished (== intents gone, revalidation vs mountpoints sanitized,
etc.) before anything else is done to fs/namie.c.  OK?

And as for .. - userland can bloody well check that on its own if it cares.
Let's keep it simple, please - we already have things far too complicated
in there for my taste.

^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2011-06-09  9:05 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1305563026.5456.19.camel@gandalf.stny.rr.com>
     [not found] ` <20110516165249.GB10929@elte.hu>
     [not found]   ` <1305565422.5456.21.camel@gandalf.stny.rr.com>
     [not found]     ` <20110517124212.GB21441@elte.hu>
     [not found]       ` <1305637528.5456.723.camel@gandalf.stny.rr.com>
     [not found]         ` <20110517131902.GF21441@elte.hu>
     [not found]           ` <BANLkTikBK3-KZ10eErQ6Eex_L6Qe2aZang@mail.gmail.com>
     [not found]             ` <1305807728.11267.25.camel@gandalf.stny.rr.com>
     [not found]               ` <BANLkTiki8aQJbFkKOFC+s6xAEiuVyMM5MQ@mail.gmail.com>
     [not found]                 ` <BANLkTim9UyYAGhg06vCFLxkYPX18cPymEQ@mail.gmail.com>
     [not found]                   ` <20110524200815.GD27634@elte.hu>
2011-05-24 20:25                     ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Kees Cook
2011-05-25 19:09                       ` Ingo Molnar
2011-05-25 16:40                     ` Will Drewry
     [not found]                   ` <1306254027.18455.47.camel@twins>
     [not found]                     ` <20110524195435.GC27634@elte.hu>
     [not found]                       ` <alpine.LFD.2.02.1105242239230.3078@ionos>
     [not found]                         ` <20110525150153.GE29179@elte.hu>
     [not found]                           ` <alpine.LFD.2.02.1105251836030.3078@ionos>
2011-05-25 18:01                             ` Kees Cook
2011-05-25 18:42                               ` Linus Torvalds
2011-05-25 19:06                                 ` Ingo Molnar
2011-05-25 19:54                                   ` Will Drewry
2011-05-25 19:11                                 ` Kees Cook
2011-05-25 20:01                                   ` Linus Torvalds
2011-05-25 20:19                                     ` Ingo Molnar
2011-06-09  9:00                                       ` Sven Anders
2011-05-26 14:37                                     ` Colin Walters
2011-05-26 15:03                                       ` Linus Torvalds
2011-05-26 15:28                                         ` Colin Walters
2011-05-26 16:33                                         ` Will Drewry
2011-05-26 16:46                                           ` Linus Torvalds
2011-05-26 17:02                                             ` Will Drewry
2011-05-26 17:04                                               ` Will Drewry
2011-05-26 17:17                                               ` Linus Torvalds
2011-05-26 17:38                                                 ` Will Drewry
2011-05-26 18:33                                                   ` Linus Torvalds
2011-05-26 18:47                                                     ` Ingo Molnar
2011-05-26 19:05                                                       ` david
2011-05-26 19:09                                                         ` Eric Paris
2011-05-26 19:46                                                         ` Ingo Molnar
2011-05-26 19:49                                                           ` david
2011-05-26 18:49                                                     ` Will Drewry
2011-06-01  3:10                                                       ` [PATCH v3 01/13] tracing: split out filter initialization and clean up Will Drewry
2011-06-01  3:10                                                       ` [PATCH v3 02/13] tracing: split out syscall_trace_enter construction Will Drewry
2011-06-01  7:00                                                         ` Ingo Molnar
2011-06-01 17:15                                                           ` Will Drewry
2011-06-02 14:29                                                             ` Ingo Molnar
2011-06-02 15:18                                                               ` Will Drewry
2011-06-01  3:10                                                       ` [PATCH v3 03/13] seccomp_filters: new mode with configurable syscall filters Will Drewry
2011-06-02 17:36                                                         ` Paul E. McKenney
2011-06-02 18:14                                                           ` Will Drewry
2011-06-02 19:42                                                             ` Paul E. McKenney
2011-06-02 20:28                                                               ` Will Drewry
2011-06-02 20:46                                                                 ` Steven Rostedt
2011-06-02 21:12                                                                   ` Paul E. McKenney
2011-06-01  3:10                                                       ` [PATCH v3 04/13] seccomp_filter: add process state reporting Will Drewry
2011-06-01  3:10                                                       ` [PATCH v3 05/13] seccomp_filter: Document what seccomp_filter is and how it works Will Drewry
2011-06-01 21:23                                                         ` Kees Cook
2011-06-01 23:03                                                           ` Will Drewry
2011-06-01  3:10                                                       ` [PATCH v3 06/13] x86: add HAVE_SECCOMP_FILTER and seccomp_execve Will Drewry
2011-06-01  3:10                                                       ` [PATCH v3 07/13] arm: select HAVE_SECCOMP_FILTER Will Drewry
2011-06-01  3:10                                                       ` [PATCH v3 08/13] microblaze: select HAVE_SECCOMP_FILTER and provide seccomp_execve Will Drewry
2011-06-01  5:37                                                         ` Michal Simek
2011-06-01  3:10                                                       ` [PATCH v3 09/13] mips: " Will Drewry
2011-06-01  3:10                                                       ` [PATCH v3 10/13] s390: " Will Drewry
2011-06-01  3:10                                                       ` [PATCH v3 11/13] powerpc: " Will Drewry
2011-06-01  3:10                                                       ` [PATCH v3 12/13] sparc: " Will Drewry
2011-06-01  3:35                                                         ` David Miller
2011-06-01  3:10                                                       ` [PATCH v3 13/13] sh: select HAVE_SECCOMP_FILTER Will Drewry
2011-06-02  5:27                                                         ` Paul Mundt
2011-05-26 17:38                                               ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Valdis.Kletnieks
2011-05-26 18:08                                                 ` Will Drewry
2011-05-26 18:22                                                   ` Valdis.Kletnieks
2011-05-26 17:07                                             ` Steven Rostedt
2011-05-26 18:43                                               ` Casey Schaufler
2011-05-26 18:54                                                 ` Steven Rostedt
2011-05-26 18:34                                             ` david
2011-05-26 18:54                                             ` Ingo Molnar
2011-05-26  1:19                                 ` James Morris
2011-05-26  6:08                                   ` Avi Kivity
2011-05-26  8:24                                   ` Ingo Molnar
2011-05-26  8:35                                     ` Pekka Enberg
2011-05-26  8:49                                     ` Avi Kivity
2011-05-26  8:57                                       ` Pekka Enberg
     [not found]                                         ` <20110526085939.GG29458@redhat.com>
2011-05-26 10:38                                           ` Ingo Molnar
2011-05-26 10:46                                             ` Avi Kivity
2011-05-26 10:46                                             ` Gleb Natapov
2011-05-26 11:11                                               ` Ingo Molnar
2011-05-26  9:30                                       ` Ingo Molnar
2011-05-26  9:48                                         ` Ingo Molnar
2011-05-26 11:02                                           ` Avi Kivity
2011-05-26 11:16                                             ` Ingo Molnar
2011-05-26 10:56                                         ` Avi Kivity
2011-05-26 11:38                                           ` Ingo Molnar
2011-05-26 18:06                                             ` Avi Kivity
2011-05-26 18:15                                               ` Ingo Molnar
2011-05-26 18:20                                                 ` Avi Kivity
2011-05-26 18:36                                                   ` Ingo Molnar
2011-05-26 18:43                                                     ` Valdis.Kletnieks
2011-05-26 18:50                                                       ` Ingo Molnar
2011-05-26 18:22                                                 ` Peter Zijlstra
2011-05-26 18:38                                                   ` Ingo Molnar
2011-05-27  0:12                                                     ` James Morris
2011-05-29 16:51                                 ` Aneesh Kumar K.V
2011-05-29 17:02                                   ` Linus Torvalds
2011-05-29 18:23                                     ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox