[RFC 1/1] seccomp: Add bitmask of allowed system calls.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC 1/1] seccomp: Add bitmask of allowed system calls.
@ 2009-05-07 21:48 Adam Langley
  2009-05-07 22:14 ` Ingo Molnar
  2009-05-15 19:56 ` Pavel Machek
  0 siblings, 2 replies; 12+ messages in thread
From: Adam Langley @ 2009-05-07 21:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: markus

(This is a discussion email rather than a patch which I'm seriously
proposing be landed.)

In a recent thread[1] my colleague, Markus, mentioned that we (Chrome
Linux) are investigating using seccomp to implement our rendering
sandbox[2] on Linux.

In the same thread, Ingo mentioned[3] that he thought a bitmap of
allowed system calls would be reasonable. If we had such a thing, many
of the acrobatics that we currently need could be avoided. Since we need
to support the currently existing kernels, we'll need to have the code
for both, but allowing signal handling, gettimeofday, epoll etc would
save a lot of overhead for common operations.

The patch below implements such a scheme. It's written on top of the
current seccomp for the moment, although it looks like seccomp might be
written in terms of ftrace soon[4].

Briefly, it adds a second seccomp mode (2) where one uploads a bitmask.
Syscall n is allowed if, and only if, bit n is true in the bitmask. If n
is beyond the range of the bitmask, the syscall is denied.

If prctl is allowed by the bitmask, then a process may switch to mode 1,
or may set a new bitmask iff the new bitmask is a subset of the current
one. (Possibly moving to mode 1 should only be allowed if read, write,
sigreturn, exit are in the currently allowed set.)

If a process forks/clones, the child inherits the seccomp state of the
parent. (And hopefully I'm managing the memory correctly here.)


Ingo subsequently floated the idea of a more expressive interface based
on ftrace which could introspect the arguments, although I think the
discussion had fallen off list at that point.

He suggested using an ftrace parser which I'm not familiar with, but can
be summed up with:
  seccomp_prctl("sys_write", "fd == 3")  // allow writes only to fd 3

In general, I believe that ftrace based solutions cannot safely validate
arguments which are in user-space memory when multiple threads could be
racing to change the memory between ftrace and the eventual
copy_from_user. Because of this, many useful arguments (such as the
sockaddr to connect, the filename to open etc) are out of reach. LSM
hooks appear to be the best way to impose limits in such cases. (Which
we are also experimenting with).

However, such a parser could be very useful in one particular case:
socketcall on IA32. Allowing recvmsg and sendmsg, but not socket,
connect etc is certainly something that we would be interested in.

In summary: we would like to see a more flexible seccomp and, if people
agree, we're willing to do work to make it so.

Thanks,

AGL


[1] http://lkml.org/lkml/2009/5/6/443
[2] http://dev.chromium.org/developers/design-documents/sandbox
[3] http://lkml.org/lkml/2009/5/7/137
[4] http://lkml.org/lkml/2009/3/31/412

Signed-off-by: Adam Langley <agl@google.com>

---

 fs/proc/array.c         |   21 +++++++
 include/linux/prctl.h   |    3 +
 include/linux/seccomp.h |   24 +++++++-
 kernel/fork.c           |   15 +++++
 kernel/seccomp.c        |  151 +++++++++++++++++++++++++++++++++++++++++------
 kernel/sys.c            |    2 +-
 6 files changed, 194 insertions(+), 22 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 167c333..ce5620a 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -7,7 +7,23 @@
 #include <linux/thread_info.h>
 #include <asm/seccomp.h>
 
-typedef struct { int mode; } seccomp_t;
+/**
+ * struct seccomp_state - the state of a seccomp'ed process
+ *
+ * @mode:
+ *     if this is 1, the process is under standard seccomp rules
+ *             is 2, the process is only allowed to make system calls where
+ *                   the corresponding bit is set in bitmask.
+ * @bit_length: the length of bitmask, in bits.
+ * @bitmask: a mask of allowed system calls.
+ */
+struct seccomp_state {
+	uint16_t mode;
+	uint16_t bit_length;
+	uint8_t bitmask[0];
+};
+
+typedef struct { struct seccomp_state *state; } seccomp_t;
 
 extern void __secure_computing(int);
 static inline void secure_computing(int this_syscall)
@@ -16,8 +32,9 @@ static inline void secure_computing(int this_syscall)
 		__secure_computing(this_syscall);
 }
 
+extern struct seccomp_state* seccomp_state_dup(const struct seccomp_state *old);
 extern long prctl_get_seccomp(void);
-extern long prctl_set_seccomp(unsigned long);
+extern long prctl_set_seccomp(unsigned long, unsigned long, unsigned long);
 
 #else /* CONFIG_SECCOMP */
 
@@ -32,7 +49,8 @@ static inline long prctl_get_seccomp(void)
 	return -EINVAL;
 }
 
-static inline long prctl_set_seccomp(unsigned long arg2)
+static inline long prctl_set_seccomp(unsigned long arg2, unsigned long arg3,
+				     unsigned long arg4)
 {
 	return -EINVAL;
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index b9e2edd..1f599ec 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -35,6 +35,7 @@
 #include <linux/cgroup.h>
 #include <linux/security.h>
 #include <linux/hugetlb.h>
+#include <linux/seccomp.h>
 #include <linux/swap.h>
 #include <linux/syscalls.h>
 #include <linux/jiffies.h>
@@ -143,6 +144,10 @@ void free_task(struct task_struct *tsk)
 	free_thread_info(tsk->stack);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
+#ifdef CONFIG_SECCOMP
+	if (tsk->seccomp.state)
+		kfree(tsk->seccomp.state);
+#endif
 	free_task_struct(tsk);
 }
 EXPORT_SYMBOL(free_task);
@@ -240,6 +245,16 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	if (err)
 		goto out;
 
+#ifdef CONFIG_SECCOMP
+	if (orig->seccomp.state) {
+		tsk->seccomp.state = seccomp_state_dup(orig->seccomp.state);
+		if (!tsk->seccomp.state) {
+			err = ENOMEM;
+			goto out;
+		}
+	}
+#endif
+
 	setup_thread_stack(tsk, orig);
 	stackend = end_of_stack(tsk);
 	*stackend = STACK_END_MAGIC;	/* for overflow detection */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..3fcf658 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -9,9 +9,10 @@
 #include <linux/seccomp.h>
 #include <linux/sched.h>
 #include <linux/compat.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
 
 /* #define SECCOMP_DEBUG 1 */
-#define NR_SECCOMP_MODES 1
 
 /*
  * Secure computing mode 1 allows only read/write/exit/sigreturn.
@@ -32,8 +33,8 @@ static int mode1_syscalls_32[] = {
 
 void __secure_computing(int this_syscall)
 {
-	int mode = current->seccomp.mode;
-	int * syscall;
+	int mode = current->seccomp.state->mode;
+	int *syscall;
 
 	switch (mode) {
 	case 1:
@@ -47,6 +48,13 @@ void __secure_computing(int this_syscall)
 				return;
 		} while (*++syscall);
 		break;
+	case 2:
+		if (this_syscall >= current->seccomp.state->bit_length)
+			break;
+		if (current->seccomp.state->bitmask[this_syscall / 8] &
+		    (1 << (7 - (this_syscall % 8))))
+			return;
+		break;
 	default:
 		BUG();
 	}
@@ -57,30 +65,135 @@ void __secure_computing(int this_syscall)
 	do_exit(SIGKILL);
 }
 
-long prctl_get_seccomp(void)
+/**
+ * seccomp_mask_install() - install a system call mask
+ * @seccomp_bitlength: the number of bits in the mask
+ * @seccomp_mask: a pointer in userspace to the bits
+ *
+ * This is used to install a system call mask when none currently exists.
+ */
+static int seccomp_mask_install(unsigned long seccomp_mask,
+				unsigned long seccomp_bitlength)
 {
-	return current->seccomp.mode;
+	if (seccomp_bitlength >= 1024)
+		return -EINVAL;
+	current->seccomp.state = kmalloc(sizeof(struct seccomp_state) +
+					 (seccomp_bitlength + 7) / 8,
+					 GFP_KERNEL);
+	if (!current->seccomp.state)
+		return -ENOMEM;
+	current->seccomp.state->mode = 2;
+	current->seccomp.state->bit_length = seccomp_bitlength;
+	if (copy_from_user(current->seccomp.state->bitmask,
+			   (void __user *) seccomp_mask,
+			   (seccomp_bitlength + 7) / 8)) {
+		kfree(current->seccomp.state);
+		current->seccomp.state = NULL;
+		return -EFAULT;
+	}
+
+	set_thread_flag(TIF_SECCOMP);
+	return 0;
 }
 
-long prctl_set_seccomp(unsigned long seccomp_mode)
+/**
+ * seccomp_mask_subset() - subset a system call mask
+ * @seccomp_bitlength: the number of bits in the mask
+ * @seccomp_mask: a pointer in userspace to the bits
+ *
+ * This is used when a process which already has a system call mask tries to
+ * install another. We require that the new mask be <= to the length of the old
+ * one and that it's a subset of the old mask.
+ */
+static int seccomp_mask_subset(unsigned long seccomp_mask,
+			       unsigned long seccomp_bitlength)
 {
-	long ret;
+	unsigned i;
+	const unsigned byte_length = (seccomp_bitlength + 7) / 8;
+	const uint8_t *current_mask = current->seccomp.state->bitmask;
+	struct seccomp_state *new;
 
-	/* can set it only once to be even more secure */
-	ret = -EPERM;
-	if (unlikely(current->seccomp.mode))
-		goto out;
+	if (seccomp_bitlength > current->seccomp.state->bit_length)
+		return -EPERM;
+	new = kmalloc(sizeof(struct seccomp_state) + byte_length, GFP_KERNEL);
+	if (!new)
+		return -ENOMEM;
+	new->mode = 2;
+	new->bit_length = seccomp_bitlength;
+	if (copy_from_user(new->bitmask,
+			   (void __user *) seccomp_mask,
+			   byte_length)) {
+		kfree(new);
+		return -EFAULT;
+	}
+
+	for (i = 0; i < byte_length; ++i) {
+		if ((new->bitmask[i] & current_mask[i]) != new->bitmask[i]) {
+			kfree(new);
+			return -EPERM;
+		}
+	}
+
+	kfree(current->seccomp.state);
+	current->seccomp.state = new;
+
+	return 0;
+}
+
+struct seccomp_state* seccomp_state_dup(const struct seccomp_state *orig)
+{
+	const unsigned state_len = sizeof(struct seccomp_state) +
+				   (orig->bit_length + 7) / 8;
+	struct seccomp_state *new = kmalloc(state_len, GFP_KERNEL);
+	if (!new)
+		return NULL;
 
-	ret = -EINVAL;
-	if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
-		current->seccomp.mode = seccomp_mode;
+	memcpy(new, orig, state_len);
+	return new;
+}
+
+
+long prctl_get_seccomp(void)
+{
+	if (!current->seccomp.state)
+		return 0;
+	return current->seccomp.state->mode;
+}
+
+long prctl_set_seccomp(unsigned long seccomp_mode,
+		       unsigned long seccomp_mask,
+		       unsigned long seccomp_bitlength)
+{
+	switch (seccomp_mode) {
+	case 0:
+		/* One cannot switch to mode 0 from any other mode */
+		if (current->seccomp.state)
+			return -EPERM;
+		return 0;
+
+	case 1:
+		if (current->seccomp.state)
+			kfree(current->seccomp.state);
+		current->seccomp.state = kmalloc(sizeof(struct seccomp_state),
+						 GFP_KERNEL);
+		if (!current->seccomp.state)
+			return -ENOMEM;
+		current->seccomp.state->mode = 1;
+		current->seccomp.state->bit_length = 0;
 		set_thread_flag(TIF_SECCOMP);
 #ifdef TIF_NOTSC
 		disable_TSC();
 #endif
-		ret = 0;
-	}
+		return 0;
+
+	case 2:
+		/* System call bitmask */
 
- out:
-	return ret;
+		if (current->seccomp.state)
+			return seccomp_mask_subset(seccomp_mask, seccomp_bitlength);
+		return seccomp_mask_install(seccomp_mask, seccomp_bitlength);
+
+	default:
+		return -EINVAL;
+	}
 }
diff --git a/kernel/sys.c b/kernel/sys.c
index e7998cf..bf50553 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1785,7 +1785,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			error = prctl_get_seccomp();
 			break;
 		case PR_SET_SECCOMP:
-			error = prctl_set_seccomp(arg2);
+			error = prctl_set_seccomp(arg2, arg3, arg4);
 			break;
 		case PR_GET_TSC:
 			error = GET_TSC_CTL(arg2);
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 725a650..840442b 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -321,6 +321,26 @@ static inline void task_context_switch_counts(struct seq_file *m,
 			p->nivcsw);
 }
 
+extern const char hex_asc[];
+
+static void task_show_seccomp(struct seq_file *m, struct task_struct *p) {
+#if defined(CONFIG_SECCOMP)
+	const int mode = p->seccomp.state ? p->seccomp.state->mode : 0;
+	unsigned i;
+
+	seq_printf(m, "Seccomp:\t%d\n", mode);
+	if (mode != 2)
+		return;
+	seq_printf(m, "Seccomp_mask:\t%d,", p->seccomp.state->bit_length);
+	for (i = 0; i < (p->seccomp.state->bit_length + 7) / 8; ++i) {
+		const uint8_t *mask = p->seccomp.state->bitmask;
+		seq_printf(m, "%c%c", hex_asc[mask[i] >> 4],
+		           hex_asc[mask[i] & 15]);
+	}
+	seq_printf(m, "\n");
+#endif
+}
+
 int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 			struct pid *pid, struct task_struct *task)
 {
@@ -340,6 +360,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 	task_show_regs(m, task);
 #endif
 	task_context_switch_counts(m, task);
+	task_show_seccomp(m, task);
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.
  2009-05-07 21:48 [RFC 1/1] seccomp: Add bitmask of allowed system calls Adam Langley
@ 2009-05-07 22:14 ` Ingo Molnar
  2009-05-07 22:34   ` Adam Langley
  2009-05-08  2:37   ` James Morris
  2009-05-15 19:56 ` Pavel Machek
  1 sibling, 2 replies; 12+ messages in thread
From: Ingo Molnar @ 2009-05-07 22:14 UTC (permalink / raw)
  To: Adam Langley, Andrew Morton, Frédéric Weisbecker,
	Tom Zanussi, Li Zefan, Steven Rostedt
  Cc: linux-kernel, markus

(i've restored the Cc: line of the previous thread)

* Adam Langley <agl@google.com> wrote:

> (This is a discussion email rather than a patch which I'm 
> seriously proposing be landed.)
> 
> In a recent thread[1] my colleague, Markus, mentioned that we 
> (Chrome Linux) are investigating using seccomp to implement our 
> rendering sandbox[2] on Linux.
> 
> In the same thread, Ingo mentioned[3] that he thought a bitmap of 
> allowed system calls would be reasonable. If we had such a thing, 
> many of the acrobatics that we currently need could be avoided. 
> Since we need to support the currently existing kernels, we'll 
> need to have the code for both, but allowing signal handling, 
> gettimeofday, epoll etc would save a lot of overhead for common 
> operations.
> 
> The patch below implements such a scheme. It's written on top of 
> the current seccomp for the moment, although it looks like seccomp 
> might be written in terms of ftrace soon[4].
> 
> Briefly, it adds a second seccomp mode (2) where one uploads a 
> bitmask. Syscall n is allowed if, and only if, bit n is true in 
> the bitmask. If n is beyond the range of the bitmask, the syscall 
> is denied.
> 
> If prctl is allowed by the bitmask, then a process may switch to 
> mode 1, or may set a new bitmask iff the new bitmask is a subset 
> of the current one. (Possibly moving to mode 1 should only be 
> allowed if read, write, sigreturn, exit are in the currently 
> allowed set.)
> 
> If a process forks/clones, the child inherits the seccomp state of 
> the parent. (And hopefully I'm managing the memory correctly 
> here.)
> 
> Ingo subsequently floated the idea of a more expressive interface 
> based on ftrace which could introspect the arguments, although I 
> think the discussion had fallen off list at that point.
> 
> He suggested using an ftrace parser which I'm not familiar with, but can
> be summed up with:
>   seccomp_prctl("sys_write", "fd == 3")  // allow writes only to fd 3

It's the ftrace filter parser and execution engine.

I.e. we first parse the filter expression when setting up a seccomp 
context. Each syscall has the following attributes:

 on                # enabled unconditionally
 off               # disabled unconditionally
 filtered

In the filtered case, the filter can be simple:

	"fd == 0"

To restrict sys_write() to a single fd (but still allow sys_read() 
from other fds).

Or as complex as:

	(fd == 4 || fd == 5) && (buf == 0x12340000) && (size <= 4096)

To restrict IO to two specific fds and to restrict output to a 
specific memory address and to restrict size to 4K or smaller.

This is how the filter engine works: we parse the string and save it 
into a binay expression structure (cache) that can later on be run 
by the engine in a pretty fast way. (without any string parsing or 
formatting overhead in the validation fastpath)

The filter is thus evaluated in the sandbox task's context, without 
the need for any context-switching. It's very, very fast. It is i 
think faster than LSM rules, and it is also atomic and lockless (RCU 
based).

> In general, I believe that ftrace based solutions cannot safely 
> validate arguments which are in user-space memory when multiple 
> threads could be racing to change the memory between ftrace and 
> the eventual copy_from_user. Because of this, many useful 
> arguments (such as the sockaddr to connect, the filename to open 
> etc) are out of reach. LSM hooks appear to be the best way to 
> impose limits in such cases. (Which we are also experimenting 
> with).

That assessment is incorrect, there's no difference between safety 
here really.

LSM cannot magically inspect user-space memory either when multiple 
threads may access it. The point would be to define filters for 
system call _arguments_, which are inherently thread-local and safe.

> However, such a parser could be very useful in one particular 
> case: socketcall on IA32. Allowing recvmsg and sendmsg, but not 
> socket, connect etc is certainly something that we would be 
> interested in.

There are two problems with the bitmap scheme, which i also 
suggested in a previous thread but then found it to be lacking:

1) enumeration: you define a bitmap. That will be problematic 
   between compat and native 64-bit (both have different syscall 
   vectors).

2) flexibility. It's an on/off selection per syscall. With the 
   filter we have on, off, or filtered. That's a _whole_ lot more 
   flexible.

The filter expression based solution does not suffer from this: it 
is string enumerated. "sys_read" means that syscall, and we could 
specify whether it's the compat or the native one.

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.
  2009-05-07 22:14 ` Ingo Molnar
@ 2009-05-07 22:34   ` Adam Langley
  2009-05-07 23:00     ` Frederic Weisbecker
  2009-05-08  2:37   ` James Morris
  1 sibling, 1 reply; 12+ messages in thread
From: Adam Langley @ 2009-05-07 22:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Frédéric Weisbecker, Tom Zanussi,
	Li Zefan, Steven Rostedt, linux-kernel, markus

> That assessment is incorrect, there's no difference between safety
> here really.
>
> LSM cannot magically inspect user-space memory either when multiple
> threads may access it. The point would be to define filters for
> system call _arguments_, which are inherently thread-local and safe.

If I hook security_operations.socket_connect, I can validate the struct
sockaddr after the final copy_from_user. However, since the sockaddr lives in
userspace memory, if I try and validate it from ftrace SYSCALL_ENTER, I can't
know that it won't change before sys_connect reads it again.

Because of that, there are system calls which an LSM hook can safely accept
that an ftrace hook cannot. However, as you point out, any arguments passed in
registers are inheriently safe and these may be sufficiently powerful.

> There are two problems with the bitmap scheme, which i also
> suggested in a previous thread but then found it to be lacking:
>
> 1) enumeration: you define a bitmap. That will be problematic
>   between compat and native 64-bit (both have different syscall
>   vectors).

I /think/ it works out, but I've been bitten before with subtle 32/64 bit
compat issues and accept that it's a bit ugly.

> 2) flexibility. It's an on/off selection per syscall. With the
>   filter we have on, off, or filtered. That's a _whole_ lot more
>   flexible.

Absolutely.

Is there a git tree that I can pull this parsing code from?
(git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
maybe?). I can patch in the seccomp-on-ftrace work and try building the
filtering on top of that. I'll see how it turns out anyway.

Thanks

AGL

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.
  2009-05-07 22:34   ` Adam Langley
@ 2009-05-07 23:00     ` Frederic Weisbecker
  2009-05-08  5:32       ` Tom Zanussi
  2009-05-08  9:20       ` Ingo Molnar
  0 siblings, 2 replies; 12+ messages in thread
From: Frederic Weisbecker @ 2009-05-07 23:00 UTC (permalink / raw)
  To: Adam Langley
  Cc: Ingo Molnar, Andrew Morton, Tom Zanussi, Li Zefan, Steven Rostedt,
	linux-kernel, markus

On Thu, May 07, 2009 at 03:34:58PM -0700, Adam Langley wrote:
> > That assessment is incorrect, there's no difference between safety
> > here really.
> >
> > LSM cannot magically inspect user-space memory either when multiple
> > threads may access it. The point would be to define filters for
> > system call _arguments_, which are inherently thread-local and safe.
> 
> If I hook security_operations.socket_connect, I can validate the struct
> sockaddr after the final copy_from_user. However, since the sockaddr lives in
> userspace memory, if I try and validate it from ftrace SYSCALL_ENTER, I can't
> know that it won't change before sys_connect reads it again.
> 
> Because of that, there are system calls which an LSM hook can safely accept
> that an ftrace hook cannot. However, as you point out, any arguments passed in
> registers are inheriently safe and these may be sufficiently powerful.
> 
> > There are two problems with the bitmap scheme, which i also
> > suggested in a previous thread but then found it to be lacking:
> >
> > 1) enumeration: you define a bitmap. That will be problematic
> >   between compat and native 64-bit (both have different syscall
> >   vectors).
> 
> I /think/ it works out, but I've been bitten before with subtle 32/64 bit
> compat issues and accept that it's a bit ugly.
> 
> > 2) flexibility. It's an on/off selection per syscall. With the
> >   filter we have on, off, or filtered. That's a _whole_ lot more
> >   flexible.
> 
> Absolutely.
> 
> Is there a git tree that I can pull this parsing code from?
> (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
> maybe?). I can patch in the seccomp-on-ftrace work and try building the
> filtering on top of that. I'll see how it turns out anyway.


Hi,

The most uptodate one is:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git
on the tracing/filters topic.

See kernel/trace/trace_events_filter.c

You might want to reuse some current syscall tracing facility.
An automated resolution table is built on bootime, you can look
at the end of arch/x86/kernel/ftrace.c

This table links each syscalls nr to the matching attribute entry of syscall
in which you can find:

- parameters names
- parameters types
- syscall name

You can look at the hooks in include/linux/syscall.h:
The sections inside
#ifdef CONFIG_FTRACE_SYSCALLS

Well, it's a bit insane to read, so you can also look
at include/trace/syscall.h and kernel/trace/trace_syscalls.c and
see how it is used and how it could be reused.


Thanks,
Frederic.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.
  2009-05-07 22:14 ` Ingo Molnar
  2009-05-07 22:34   ` Adam Langley
@ 2009-05-08  2:37   ` James Morris
  2009-05-08  9:44     ` Ingo Molnar
  1 sibling, 1 reply; 12+ messages in thread
From: James Morris @ 2009-05-08  2:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Adam Langley, Andrew Morton, Frédéric Weisbecker,
	Tom Zanussi, Li Zefan, Steven Rostedt, linux-kernel, markus,
	linux-security-module

On Fri, 8 May 2009, Ingo Molnar wrote:

> > In general, I believe that ftrace based solutions cannot safely 
> > validate arguments which are in user-space memory when multiple 
> > threads could be racing to change the memory between ftrace and 
> > the eventual copy_from_user. Because of this, many useful 
> > arguments (such as the sockaddr to connect, the filename to open 
> > etc) are out of reach. LSM hooks appear to be the best way to 
> > impose limits in such cases. (Which we are also experimenting 
> > with).
> 
> That assessment is incorrect, there's no difference between safety 
> here really.
> 
> LSM cannot magically inspect user-space memory either when multiple 
> threads may access it. The point would be to define filters for 
> system call _arguments_, which are inherently thread-local and safe.

LSM hooks are placed so that they can access objects safely, e.g. after 
copy_from_user() and with all apropriate kernel locks for that object 
held, and also with all security-relevant information available for the 
particular operation.

You cannot do this with system call interception: it's an inherently racy 
and limited mechanism (and very well known for being so).

I'm concerned that we're seeing yet another security scheme being designed 
on the fly, without a well-formed threat model, and without taking into 
account lessons learned from the seemingly endless parade of similar, 
failed schemes.

Please refer to (for example):

- "Traps and Pitfalls: Practical Problems in System Call Interposition 
  Based Security Tools" by Tal Garfinkel.

    http://www.stanford.edu/~talg/papers/traps/abstract.html

- "Exploiting Concurrency Vulnerabilities in System Call Wrappers" by 
  Robert Watson.

    http://www.watson.org/~robert/2007woot/




-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.
  2009-05-07 23:00     ` Frederic Weisbecker
@ 2009-05-08  5:32       ` Tom Zanussi
  2009-05-08  9:19         ` Ingo Molnar
  2009-05-08 11:12         ` Frederic Weisbecker
  2009-05-08  9:20       ` Ingo Molnar
  1 sibling, 2 replies; 12+ messages in thread
From: Tom Zanussi @ 2009-05-08  5:32 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Adam Langley, Ingo Molnar, Andrew Morton, Li Zefan,
	Steven Rostedt, linux-kernel, markus

On Fri, 2009-05-08 at 01:00 +0200, Frederic Weisbecker wrote:
> On Thu, May 07, 2009 at 03:34:58PM -0700, Adam Langley wrote:
> > > That assessment is incorrect, there's no difference between safety
> > > here really.
> > >
> > > LSM cannot magically inspect user-space memory either when multiple
> > > threads may access it. The point would be to define filters for
> > > system call _arguments_, which are inherently thread-local and safe.
> > 
> > If I hook security_operations.socket_connect, I can validate the struct
> > sockaddr after the final copy_from_user. However, since the sockaddr lives in
> > userspace memory, if I try and validate it from ftrace SYSCALL_ENTER, I can't
> > know that it won't change before sys_connect reads it again.
> > 
> > Because of that, there are system calls which an LSM hook can safely accept
> > that an ftrace hook cannot. However, as you point out, any arguments passed in
> > registers are inheriently safe and these may be sufficiently powerful.
> > 
> > > There are two problems with the bitmap scheme, which i also
> > > suggested in a previous thread but then found it to be lacking:
> > >
> > > 1) enumeration: you define a bitmap. That will be problematic
> > >   between compat and native 64-bit (both have different syscall
> > >   vectors).
> > 
> > I /think/ it works out, but I've been bitten before with subtle 32/64 bit
> > compat issues and accept that it's a bit ugly.
> > 
> > > 2) flexibility. It's an on/off selection per syscall. With the
> > >   filter we have on, off, or filtered. That's a _whole_ lot more
> > >   flexible.
> > 
> > Absolutely.
> > 
> > Is there a git tree that I can pull this parsing code from?
> > (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
> > maybe?). I can patch in the seccomp-on-ftrace work and try building the
> > filtering on top of that. I'll see how it turns out anyway.
> 
> 
> Hi,
> 
> The most uptodate one is:
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git
> on the tracing/filters topic.
> 
> See kernel/trace/trace_events_filter.c
> 
> You might want to reuse some current syscall tracing facility.
> An automated resolution table is built on bootime, you can look
> at the end of arch/x86/kernel/ftrace.c
> 
> This table links each syscalls nr to the matching attribute entry of syscall
> in which you can find:
> 
> - parameters names
> - parameters types
> - syscall name
> 
 
I guess it would be easier for seccomp to make use of the filtering and
reuse the syscall tracing facility if we added filtering to the syscall
tracer first.  It would be useful to do for the syscall tracer anyway,
regardless of whether it (or something similar) ended up being used by
seccomp.

Frederic, did you have any plans for that?  If not, I can take a look
when I get the chance...

Tom

> You can look at the hooks in include/linux/syscall.h:
> The sections inside
> #ifdef CONFIG_FTRACE_SYSCALLS
> 
> Well, it's a bit insane to read, so you can also look
> at include/trace/syscall.h and kernel/trace/trace_syscalls.c and
> see how it is used and how it could be reused.
> 
> 
> Thanks,
> Frederic.
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.
  2009-05-08  5:32       ` Tom Zanussi
@ 2009-05-08  9:19         ` Ingo Molnar
  2009-05-08 11:12         ` Frederic Weisbecker
  1 sibling, 0 replies; 12+ messages in thread
From: Ingo Molnar @ 2009-05-08  9:19 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Frederic Weisbecker, Adam Langley, Andrew Morton, Li Zefan,
	Steven Rostedt, linux-kernel, markus


* Tom Zanussi <tzanussi@gmail.com> wrote:

> I guess it would be easier for seccomp to make use of the 
> filtering and reuse the syscall tracing facility if we added 
> filtering to the syscall tracer first.  It would be useful to do 
> for the syscall tracer anyway, regardless of whether it (or 
> something similar) ended up being used by seccomp.
> 
> Frederic, did you have any plans for that?  If not, I can take a 
> look when I get the chance...

Doing that would be very nice and useful indeed.

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.
  2009-05-07 23:00     ` Frederic Weisbecker
  2009-05-08  5:32       ` Tom Zanussi
@ 2009-05-08  9:20       ` Ingo Molnar
  1 sibling, 0 replies; 12+ messages in thread
From: Ingo Molnar @ 2009-05-08  9:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Adam Langley, Andrew Morton, Tom Zanussi, Li Zefan,
	Steven Rostedt, linux-kernel, markus


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> On Thu, May 07, 2009 at 03:34:58PM -0700, Adam Langley wrote:
> > > That assessment is incorrect, there's no difference between safety
> > > here really.
> > >
> > > LSM cannot magically inspect user-space memory either when multiple
> > > threads may access it. The point would be to define filters for
> > > system call _arguments_, which are inherently thread-local and safe.
> > 
> > If I hook security_operations.socket_connect, I can validate the struct
> > sockaddr after the final copy_from_user. However, since the sockaddr lives in
> > userspace memory, if I try and validate it from ftrace SYSCALL_ENTER, I can't
> > know that it won't change before sys_connect reads it again.
> > 
> > Because of that, there are system calls which an LSM hook can safely accept
> > that an ftrace hook cannot. However, as you point out, any arguments passed in
> > registers are inheriently safe and these may be sufficiently powerful.
> > 
> > > There are two problems with the bitmap scheme, which i also
> > > suggested in a previous thread but then found it to be lacking:
> > >
> > > 1) enumeration: you define a bitmap. That will be problematic
> > >   between compat and native 64-bit (both have different syscall
> > >   vectors).
> > 
> > I /think/ it works out, but I've been bitten before with subtle 32/64 bit
> > compat issues and accept that it's a bit ugly.
> > 
> > > 2) flexibility. It's an on/off selection per syscall. With the
> > >   filter we have on, off, or filtered. That's a _whole_ lot more
> > >   flexible.
> > 
> > Absolutely.
> > 
> > Is there a git tree that I can pull this parsing code from?
> > (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
> > maybe?). I can patch in the seccomp-on-ftrace work and try building the
> > filtering on top of that. I'll see how it turns out anyway.
> 
> 
> Hi,
> 
> The most uptodate one is:
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git
> on the tracing/filters topic.
> 
> See kernel/trace/trace_events_filter.c

tracing/filters may lag behind - i'd suggest to use the master 
branch, that's the most stable stuff generally. Or tracing/core for 
any Git based tree.

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.
  2009-05-08  2:37   ` James Morris
@ 2009-05-08  9:44     ` Ingo Molnar
  0 siblings, 0 replies; 12+ messages in thread
From: Ingo Molnar @ 2009-05-08  9:44 UTC (permalink / raw)
  To: James Morris
  Cc: Adam Langley, Andrew Morton, Frédéric Weisbecker,
	Tom Zanussi, Li Zefan, Steven Rostedt, linux-kernel, markus,
	linux-security-module

* James Morris <jmorris@namei.org> wrote:

> On Fri, 8 May 2009, Ingo Molnar wrote:
> 
> > > In general, I believe that ftrace based solutions cannot safely 
> > > validate arguments which are in user-space memory when multiple 
> > > threads could be racing to change the memory between ftrace and 
> > > the eventual copy_from_user. Because of this, many useful 
> > > arguments (such as the sockaddr to connect, the filename to open 
> > > etc) are out of reach. LSM hooks appear to be the best way to 
> > > impose limits in such cases. (Which we are also experimenting 
> > > with).
> > 
> > That assessment is incorrect, there's no difference between safety 
> > here really.
> > 
> > LSM cannot magically inspect user-space memory either when multiple 
> > threads may access it. The point would be to define filters for 
> > system call _arguments_, which are inherently thread-local and safe.
> 
> LSM hooks are placed so that they can access objects safely, e.g. 
> after copy_from_user() and with all apropriate kernel locks for 
> that object held, and also with all security-relevant information 
> available for the particular operation.
> 
> You cannot do this with system call interception: it's an 
> inherently racy and limited mechanism (and very well known for 
> being so).

Two things.

Firstly, the seccomp + filter engine based filtering method does not 
have to be limited to system call interception at all: by placing a 
tracepoint at that place seccomp can register itself to the same 
point as the LSM hook, and enumerate and expose the fields. It can 
be expressed in the string filter namespace just fine.

[ do we have nestable LSM hooks? If yes then seccomp could layer
  itself below any existing security context, in a hierarchical way, 
  to provide add-on restrictions. It is all about further 
  restrictions, not to creation or overruling of existing security 
  policies/modules. ]

Secondly, pure system call argument based filtering is already very 
powerful for _sandboxing_. Seccomp v1 is the proof for that, it is 
equivalent to the:

{ { "sys_read",			"1" },
  { "sys_write",		"1" },
  { "sys_ret_from_signal",	"1" } }

filter rules. Your argument really pertains to full-system security 
solutions - while maximising compatibility and capability and 
minimizing user invenience. _That_ is an extremely hard problem with 
many pitfalls and snake-oil merchants flooding the roads. But that 
is not our goal here: the goal is to restrict execution in very 
brutal but still performant ways.

That means we'd like to give finegrained but still very brutally 
constructed permissions to untrusted contexts. Instead of the 
seccomp v1 rules above, an app might want to inject these rules into 
a sandbox context:

{ { "sys_read",			"fd == 0" },
  { "sys_write",		"fd == 1" },
  { "sys_sigreturn",		"1" },
  { "sys_gettimeofday",		"tz == NULL" },

Note how such a (simple!) set of rules expands over seccomp v1 in a 
very meaningful way:

 - The sys_read rule restricts the read() syscall to stdin only. 
   Even if other fds exist.

 - The sys_write() rule restricts the write() syscall to stdout 
   only.

 - sys_gettimeofday() is allowed, but only tv is allowed - tz not. 

Note how we were able to _further restrict_ the seccomp v1 
sandboxing concept: under seccomp v1 the task would be able to write 
to stdin or read from stdout.

Furthermore, only fds 0 and 1 are allowed - under seccomp v1 if any 
other fd gets into the sandboxed context accidentally, it could make 
use of them. With the above filtering scheme that is denied.

Also, note the gettimeofday rule: we were able to 'halve' the 
security cross-section of the sys_gettimeofday() permission: we only 
allow &tv to be recovered, not the time zone.

So the filtering engine allows the very finegrained tailoring of the 
system call environment, right in the context of the sandboxed task, 
without context-switches.

The filtering engine is also 'safe' in that unprivileged tasks can 
use PRCTL_SECCOMP_SET with arbitrary strings, and the resulting 
filter expression is still to be parsed and later on executed by the 
kernel.

> I'm concerned that we're seeing yet another security scheme being 
> designed on the fly, without a well-formed threat model, and 
> without taking into account lessons learned from the seemingly 
> endless parade of similar, failed schemes.

I do agree that that danger is there (as with any security scheme), 
so this all has to be designed carefully.

[ I think as long as we shape it as "only additional restrictions on
  top of what is already there", in a strictly nested way, there's
  little danger of impacting existing security measures. ]

There's also the very real possibility of having a really flexible 
sandboxing model :) So i think Adam's work is fundamentally useful.

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.
  2009-05-08  5:32       ` Tom Zanussi
  2009-05-08  9:19         ` Ingo Molnar
@ 2009-05-08 11:12         ` Frederic Weisbecker
  1 sibling, 0 replies; 12+ messages in thread
From: Frederic Weisbecker @ 2009-05-08 11:12 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Adam Langley, Ingo Molnar, Andrew Morton, Li Zefan,
	Steven Rostedt, linux-kernel, markus

On Fri, May 08, 2009 at 12:32:25AM -0500, Tom Zanussi wrote:
> On Fri, 2009-05-08 at 01:00 +0200, Frederic Weisbecker wrote:
> > On Thu, May 07, 2009 at 03:34:58PM -0700, Adam Langley wrote:
> > > > That assessment is incorrect, there's no difference between safety
> > > > here really.
> > > >
> > > > LSM cannot magically inspect user-space memory either when multiple
> > > > threads may access it. The point would be to define filters for
> > > > system call _arguments_, which are inherently thread-local and safe.
> > > 
> > > If I hook security_operations.socket_connect, I can validate the struct
> > > sockaddr after the final copy_from_user. However, since the sockaddr lives in
> > > userspace memory, if I try and validate it from ftrace SYSCALL_ENTER, I can't
> > > know that it won't change before sys_connect reads it again.
> > > 
> > > Because of that, there are system calls which an LSM hook can safely accept
> > > that an ftrace hook cannot. However, as you point out, any arguments passed in
> > > registers are inheriently safe and these may be sufficiently powerful.
> > > 
> > > > There are two problems with the bitmap scheme, which i also
> > > > suggested in a previous thread but then found it to be lacking:
> > > >
> > > > 1) enumeration: you define a bitmap. That will be problematic
> > > >   between compat and native 64-bit (both have different syscall
> > > >   vectors).
> > > 
> > > I /think/ it works out, but I've been bitten before with subtle 32/64 bit
> > > compat issues and accept that it's a bit ugly.
> > > 
> > > > 2) flexibility. It's an on/off selection per syscall. With the
> > > >   filter we have on, off, or filtered. That's a _whole_ lot more
> > > >   flexible.
> > > 
> > > Absolutely.
> > > 
> > > Is there a git tree that I can pull this parsing code from?
> > > (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
> > > maybe?). I can patch in the seccomp-on-ftrace work and try building the
> > > filtering on top of that. I'll see how it turns out anyway.
> > 
> > 
> > Hi,
> > 
> > The most uptodate one is:
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git
> > on the tracing/filters topic.
> > 
> > See kernel/trace/trace_events_filter.c
> > 
> > You might want to reuse some current syscall tracing facility.
> > An automated resolution table is built on bootime, you can look
> > at the end of arch/x86/kernel/ftrace.c
> > 
> > This table links each syscalls nr to the matching attribute entry of syscall
> > in which you can find:
> > 
> > - parameters names
> > - parameters types
> > - syscall name
> > 
>  
> I guess it would be easier for seccomp to make use of the filtering and
> reuse the syscall tracing facility if we added filtering to the syscall
> tracer first.  It would be useful to do for the syscall tracer anyway,
> regardless of whether it (or something similar) ended up being used by
> seccomp.
> 
> Frederic, did you have any plans for that?  If not, I can take a look
> when I get the chance...
> 
> Tom



You're totally right!
Well, I will not have enough time for that until next week-end. So
if you have free slots for that, don't hesitate :)

Thanks!

Frederic.


> 
> > You can look at the hooks in include/linux/syscall.h:
> > The sections inside
> > #ifdef CONFIG_FTRACE_SYSCALLS
> > 
> > Well, it's a bit insane to read, so you can also look
> > at include/trace/syscall.h and kernel/trace/trace_syscalls.c and
> > see how it is used and how it could be reused.
> > 
> > 
> > Thanks,
> > Frederic.
> > 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.
  2009-05-07 21:48 [RFC 1/1] seccomp: Add bitmask of allowed system calls Adam Langley
  2009-05-07 22:14 ` Ingo Molnar
@ 2009-05-15 19:56 ` Pavel Machek
  2009-05-15 20:29   ` Adam Langley
  1 sibling, 1 reply; 12+ messages in thread
From: Pavel Machek @ 2009-05-15 19:56 UTC (permalink / raw)
  To: Adam Langley; +Cc: linux-kernel, markus

Hi!

> Briefly, it adds a second seccomp mode (2) where one uploads a bitmask.
> Syscall n is allowed if, and only if, bit n is true in the bitmask. If n
> is beyond the range of the bitmask, the syscall is denied.
> 
> If prctl is allowed by the bitmask, then a process may switch to mode 1,
> or may set a new bitmask iff the new bitmask is a subset of the current
> one. (Possibly moving to mode 1 should only be allowed if read, write,
> sigreturn, exit are in the currently allowed set.)
> 
> If a process forks/clones, the child inherits the seccomp state of the
> parent. (And hopefully I'm managing the memory correctly here.)

If you allow setuid exec here, you have added a security hole. Deny
setuid() to sendmail and have fun...

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.
  2009-05-15 19:56 ` Pavel Machek
@ 2009-05-15 20:29   ` Adam Langley
  0 siblings, 0 replies; 12+ messages in thread
From: Adam Langley @ 2009-05-15 20:29 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-kernel, markus

On Fri, May 15, 2009 at 12:56 PM, Pavel Machek <pavel@ucw.cz> wrote:
> If you allow setuid exec here, you have added a security hole. Deny
> setuid() to sendmail and have fun...

Good point there. If you have any seccomp enabled you shouldn't be
able to run suid binaries. Being able to deny random syscalls to suid
binaries is too scary.

If this patch ever goes anywhere I'll add that.


Thanks

AGL

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-05-15 20:29 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-07 21:48 [RFC 1/1] seccomp: Add bitmask of allowed system calls Adam Langley
2009-05-07 22:14 ` Ingo Molnar
2009-05-07 22:34   ` Adam Langley
2009-05-07 23:00     ` Frederic Weisbecker
2009-05-08  5:32       ` Tom Zanussi
2009-05-08  9:19         ` Ingo Molnar
2009-05-08 11:12         ` Frederic Weisbecker
2009-05-08  9:20       ` Ingo Molnar
2009-05-08  2:37   ` James Morris
2009-05-08  9:44     ` Ingo Molnar
2009-05-15 19:56 ` Pavel Machek
2009-05-15 20:29   ` Adam Langley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox