Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@elte.hu>
To: "Adam Langley" <agl@google.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Frédéric Weisbecker" <fweisbec@gmail.com>,
	"Tom Zanussi" <tzanussi@gmail.com>,
	"Li Zefan" <lizf@cn.fujitsu.com>,
	"Steven Rostedt" <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org, markus@google.com
Subject: Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.
Date: Fri, 8 May 2009 00:14:47 +0200	[thread overview]
Message-ID: <20090507221447.GE28770@elte.hu> (raw)
In-Reply-To: <396556a20805301217k293e5718h6bbf02b234897235@europa>

(i've restored the Cc: line of the previous thread)

* Adam Langley <agl@google.com> wrote:

> (This is a discussion email rather than a patch which I'm 
> seriously proposing be landed.)
> 
> In a recent thread[1] my colleague, Markus, mentioned that we 
> (Chrome Linux) are investigating using seccomp to implement our 
> rendering sandbox[2] on Linux.
> 
> In the same thread, Ingo mentioned[3] that he thought a bitmap of 
> allowed system calls would be reasonable. If we had such a thing, 
> many of the acrobatics that we currently need could be avoided. 
> Since we need to support the currently existing kernels, we'll 
> need to have the code for both, but allowing signal handling, 
> gettimeofday, epoll etc would save a lot of overhead for common 
> operations.
> 
> The patch below implements such a scheme. It's written on top of 
> the current seccomp for the moment, although it looks like seccomp 
> might be written in terms of ftrace soon[4].
> 
> Briefly, it adds a second seccomp mode (2) where one uploads a 
> bitmask. Syscall n is allowed if, and only if, bit n is true in 
> the bitmask. If n is beyond the range of the bitmask, the syscall 
> is denied.
> 
> If prctl is allowed by the bitmask, then a process may switch to 
> mode 1, or may set a new bitmask iff the new bitmask is a subset 
> of the current one. (Possibly moving to mode 1 should only be 
> allowed if read, write, sigreturn, exit are in the currently 
> allowed set.)
> 
> If a process forks/clones, the child inherits the seccomp state of 
> the parent. (And hopefully I'm managing the memory correctly 
> here.)
> 
> Ingo subsequently floated the idea of a more expressive interface 
> based on ftrace which could introspect the arguments, although I 
> think the discussion had fallen off list at that point.
> 
> He suggested using an ftrace parser which I'm not familiar with, but can
> be summed up with:
>   seccomp_prctl("sys_write", "fd == 3")  // allow writes only to fd 3

It's the ftrace filter parser and execution engine.

I.e. we first parse the filter expression when setting up a seccomp 
context. Each syscall has the following attributes:

 on                # enabled unconditionally
 off               # disabled unconditionally
 filtered

In the filtered case, the filter can be simple:

	"fd == 0"

To restrict sys_write() to a single fd (but still allow sys_read() 
from other fds).

Or as complex as:

	(fd == 4 || fd == 5) && (buf == 0x12340000) && (size <= 4096)

To restrict IO to two specific fds and to restrict output to a 
specific memory address and to restrict size to 4K or smaller.

This is how the filter engine works: we parse the string and save it 
into a binay expression structure (cache) that can later on be run 
by the engine in a pretty fast way. (without any string parsing or 
formatting overhead in the validation fastpath)

The filter is thus evaluated in the sandbox task's context, without 
the need for any context-switching. It's very, very fast. It is i 
think faster than LSM rules, and it is also atomic and lockless (RCU 
based).

> In general, I believe that ftrace based solutions cannot safely 
> validate arguments which are in user-space memory when multiple 
> threads could be racing to change the memory between ftrace and 
> the eventual copy_from_user. Because of this, many useful 
> arguments (such as the sockaddr to connect, the filename to open 
> etc) are out of reach. LSM hooks appear to be the best way to 
> impose limits in such cases. (Which we are also experimenting 
> with).

That assessment is incorrect, there's no difference between safety 
here really.

LSM cannot magically inspect user-space memory either when multiple 
threads may access it. The point would be to define filters for 
system call _arguments_, which are inherently thread-local and safe.

> However, such a parser could be very useful in one particular 
> case: socketcall on IA32. Allowing recvmsg and sendmsg, but not 
> socket, connect etc is certainly something that we would be 
> interested in.

There are two problems with the bitmap scheme, which i also 
suggested in a previous thread but then found it to be lacking:

1) enumeration: you define a bitmap. That will be problematic 
   between compat and native 64-bit (both have different syscall 
   vectors).

2) flexibility. It's an on/off selection per syscall. With the 
   filter we have on, off, or filtered. That's a _whole_ lot more 
   flexible.

The filter expression based solution does not suffer from this: it 
is string enumerated. "sys_read" means that syscall, and we could 
specify whether it's the compat or the native one.

	Ingo

next prev parent reply	other threads:[~2009-05-07 22:15 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-07 21:48 [RFC 1/1] seccomp: Add bitmask of allowed system calls Adam Langley
2009-05-07 22:14 ` Ingo Molnar [this message]
2009-05-07 22:34   ` Adam Langley
2009-05-07 23:00     ` Frederic Weisbecker
2009-05-08  5:32       ` Tom Zanussi
2009-05-08  9:19         ` Ingo Molnar
2009-05-08 11:12         ` Frederic Weisbecker
2009-05-08  9:20       ` Ingo Molnar
2009-05-08  2:37   ` James Morris
2009-05-08  9:44     ` Ingo Molnar
2009-05-15 19:56 ` Pavel Machek
2009-05-15 20:29   ` Adam Langley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090507221447.GE28770@elte.hu \
    --to=mingo@elte.hu \
    --cc=agl@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=fweisbec@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizf@cn.fujitsu.com \
    --cc=markus@google.com \
    --cc=rostedt@goodmis.org \
    --cc=tzanussi@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.