* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
       [not found] <1304017638.18763.205.camel@gandalf.stny.rr.com>
@ 2011-05-12  3:02 ` Will Drewry
  2011-05-12  7:48   ` Ingo Molnar
                     ` (3 more replies)
  0 siblings, 4 replies; 77+ messages in thread
From: Will Drewry @ 2011-05-12  3:02 UTC (permalink / raw)
  To: linux-arm-kernel
This change adds a new seccomp mode based on the work by
agl at chromium.org in [1]. This new mode, "filter mode", provides a hash
table of seccomp_filter objects.  When in the new mode (2), all system
calls are checked against the filters - first by system call number,
then by a filter string.  If an entry exists for a given system call and
all filter predicates evaluate to true, then the task may proceed.
Otherwise, the task is killed (as per seccomp_mode == 1).
Filter string parsing and evaluation  is handled by the ftrace filter
engine.  Related patches tweak to the perf filter trace and free allow
the call to be shared. Filters inherit their understanding of types and
arguments for each system call from the CONFIG_FTRACE_SYSCALLS subsystem
which already predefines this information in syscall_metadata associated
enter_event (and exit_event) structures. If CONFIG_FTRACE and
CONFIG_FTRACE_SYSCALLS are not compiled in, only "1" filter strings will
be allowed.
The net result is a process may have its system calls filtered using the
ftrace filter engine's inherent understanding of systems calls. A
logical ruleset for a process that only needs stdin/stdout may be:
  sys_read: fd == 0
  sys_write: fd == 1 || fd == 2
  sys_exit: 1
The set of filters is specified through the PR_SET_SECCOMP path in prctl().
For example:
  prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 0");
  prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd == 1 || fd == 2");
  prctl(PR_SET_SECCOMP_FILTER, __NR_exit, "1");
  prctl(PR_SET_SECCOMP, 2, 0);
v2: - changed to use the existing syscall number ABI.
    - prctl changes to minimize parsing in the kernel:
      prctl(PR_SET_SECCOMP, {0 | 1 | 2 }, { 0 | ON_EXEC });
      prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 5");
      prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read);
      prctl(PR_GET_SECCOMP_FILTER, __NR_read, buf, bufsize);
    - defined PR_SECCOMP_MODE_STRICT and ..._FILTER
    - added flags
    - provide a default fail syscall_nr_to_meta in ftrace
    - provides fallback for unhooked system calls
    - use -ENOSYS and ERR_PTR(-ENOSYS) for stubbed functionality
    - added kernel/seccomp.h to share seccomp.c/seccomp_filter.c
    - moved to a hlist and 4 bit hash of linked lists
    - added support to operate without CONFIG_FTRACE_SYSCALLS
    - moved Kconfig support next to SECCOMP
      (should this be done in per-platform patches?)
    - made Kconfig entries dependent on EXPERIMENTAL
    - added macros to avoid ifdefs from kernel/fork.c
    - added compat task/filter matching
    - drop seccomp.h inclusion in sched.h and drop seccomp_t
    - added Filtering to "show" output
    - added on_exec state dup'ing when enabling after a fast-path accept.
Signed-off-by: Will Drewry <wad@chromium.org>
---
 arch/arm/Kconfig        |   10 +
 arch/microblaze/Kconfig |   10 +
 arch/mips/Kconfig       |   10 +
 arch/powerpc/Kconfig    |   10 +
 arch/s390/Kconfig       |   10 +
 arch/sh/Kconfig         |   10 +
 arch/sparc/Kconfig      |   10 +
 arch/x86/Kconfig        |   10 +
 include/linux/prctl.h   |    9 +
 include/linux/sched.h   |    5 +-
 include/linux/seccomp.h |  116 +++++++++-
 include/trace/syscall.h |    7 +
 kernel/Makefile         |    3 +
 kernel/fork.c           |    3 +
 kernel/seccomp.c        |  228 +++++++++++++++++-
 kernel/seccomp.h        |   74 ++++++
 kernel/seccomp_filter.c |  581 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c            |   15 +-
 18 files changed, 1100 insertions(+), 21 deletions(-)
 create mode 100644 kernel/seccomp.h
 create mode 100644 kernel/seccomp_filter.c
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 377a7a5..22e1668 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1664,6 +1664,16 @@ config SECCOMP
 	  and the task is only allowed to execute a few safe syscalls
 	  defined by each seccomp mode.
 
+config SECCOMP_FILTER
+	bool "Enable seccomp-based system call filtering"
+	depends on SECCOMP && EXPERIMENTAL
+	help
+	  Per-process, inherited system call filtering using shared code
+	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
+	  is not available, enhanced filters will not be available.
+
+	  See Documentation/prctl/seccomp_filter.txt for more detail.
+
 config CC_STACKPROTECTOR
 	bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
 	depends on EXPERIMENTAL
diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index eccdefe..7641ee9 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -129,6 +129,16 @@ config SECCOMP
 
 	  If unsure, say Y. Only embedded should say N here.
 
+config SECCOMP_FILTER
+	bool "Enable seccomp-based system call filtering"
+	depends on SECCOMP && EXPERIMENTAL
+	help
+	  Per-process, inherited system call filtering using shared code
+	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
+	  is not available, enhanced filters will not be available.
+
+	  See Documentation/prctl/seccomp_filter.txt for more detail.
+
 endmenu
 
 menu "Advanced setup"
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 8e256cc..fe4cbda 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2245,6 +2245,16 @@ config SECCOMP
 
 	  If unsure, say Y. Only embedded should say N here.
 
+config SECCOMP_FILTER
+	bool "Enable seccomp-based system call filtering"
+	depends on SECCOMP && EXPERIMENTAL
+	help
+	  Per-process, inherited system call filtering using shared code
+	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
+	  is not available, enhanced filters will not be available.
+
+	  See Documentation/prctl/seccomp_filter.txt for more detail.
+
 config USE_OF
 	bool "Flattened Device Tree support"
 	select OF
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8f4d50b..83499e4 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -605,6 +605,16 @@ config SECCOMP
 
 	  If unsure, say Y. Only embedded should say N here.
 
+config SECCOMP_FILTER
+	bool "Enable seccomp-based system call filtering"
+	depends on SECCOMP && EXPERIMENTAL
+	help
+	  Per-process, inherited system call filtering using shared code
+	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
+	  is not available, enhanced filters will not be available.
+
+	  See Documentation/prctl/seccomp_filter.txt for more detail.
+
 endmenu
 
 config ISA_DMA_API
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 2508a6f..2777515 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -614,6 +614,16 @@ config SECCOMP
 
 	  If unsure, say Y.
 
+config SECCOMP_FILTER
+	bool "Enable seccomp-based system call filtering"
+	depends on SECCOMP && EXPERIMENTAL
+	help
+	  Per-process, inherited system call filtering using shared code
+	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
+	  is not available, enhanced filters will not be available.
+
+	  See Documentation/prctl/seccomp_filter.txt for more detail.
+
 endmenu
 
 menu "Power Management"
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 4b89da2..00c1521 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -676,6 +676,16 @@ config SECCOMP
 
 	  If unsure, say N.
 
+config SECCOMP_FILTER
+	bool "Enable seccomp-based system call filtering"
+	depends on SECCOMP && EXPERIMENTAL
+	help
+	  Per-process, inherited system call filtering using shared code
+	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
+	  is not available, enhanced filters will not be available.
+
+	  See Documentation/prctl/seccomp_filter.txt for more detail.
+
 config SMP
 	bool "Symmetric multi-processing support"
 	depends on SYS_SUPPORTS_SMP
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e560d10..5b42255 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -270,6 +270,16 @@ config SECCOMP
 
 	  If unsure, say Y. Only embedded should say N here.
 
+config SECCOMP_FILTER
+	bool "Enable seccomp-based system call filtering"
+	depends on SECCOMP && EXPERIMENTAL
+	help
+	  Per-process, inherited system call filtering using shared code
+	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
+	  is not available, enhanced filters will not be available.
+
+	  See Documentation/prctl/seccomp_filter.txt for more detail.
+
 config HOTPLUG_CPU
 	bool "Support for hot-pluggable CPUs"
 	depends on SPARC64 && SMP
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc6c53a..d6d44d9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1485,6 +1485,16 @@ config SECCOMP
 
 	  If unsure, say Y. Only embedded should say N here.
 
+config SECCOMP_FILTER
+	bool "Enable seccomp-based system call filtering"
+	depends on SECCOMP && EXPERIMENTAL
+	help
+	  Per-process, inherited system call filtering using shared code
+	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
+	  is not available, enhanced filters will not be available.
+
+	  See Documentation/prctl/seccomp_filter.txt for more detail.
+
 config CC_STACKPROTECTOR
 	bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
 	---help---
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..379b391 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -63,6 +63,15 @@
 /* Get/set process seccomp mode */
 #define PR_GET_SECCOMP	21
 #define PR_SET_SECCOMP	22
+# define PR_SECCOMP_MODE_NONE	0
+# define PR_SECCOMP_MODE_STRICT	1
+# define PR_SECCOMP_MODE_FILTER	2
+# define PR_SECCOMP_FLAG_FILTER_ON_EXEC	(1 << 1)
+
+/* Get/set process seccomp filters */
+#define PR_GET_SECCOMP_FILTER	35
+#define PR_SET_SECCOMP_FILTER	36
+#define PR_CLEAR_SECCOMP_FILTER	37
 
 /* Get/set the capability bounding set (as per security/commoncap.c) */
 #define PR_CAPBSET_READ 23
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 18d63ce..27eacf9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -77,7 +77,6 @@ struct sched_param {
 #include <linux/percpu.h>
 #include <linux/topology.h>
 #include <linux/proportions.h>
-#include <linux/seccomp.h>
 #include <linux/rcupdate.h>
 #include <linux/rculist.h>
 #include <linux/rtmutex.h>
@@ -1190,6 +1189,8 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+struct seccomp_state;
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1374,7 +1375,7 @@ struct task_struct {
 	uid_t loginuid;
 	unsigned int sessionid;
 #endif
-	seccomp_t seccomp;
+	struct seccomp_state *seccomp;
 
 /* Thread group tracking */
    	u32 parent_exec_id;
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 167c333..289c836 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -2,12 +2,34 @@
 #define _LINUX_SECCOMP_H
 
 
+/* Forward declare for proc interface */
+struct seq_file;
+
 #ifdef CONFIG_SECCOMP
 
+#include <linux/errno.h>
+#include <linux/list.h>
 #include <linux/thread_info.h>
+#include <linux/types.h>
 #include <asm/seccomp.h>
 
-typedef struct { int mode; } seccomp_t;
+/**
+ * struct seccomp_state - the state of a seccomp'ed process
+ *
+ * @mode:
+ *     if this is 1, the process is under standard seccomp rules
+ *             is 2, the process is only allowed to make system calls where
+ *                   associated filters evaluate successfully.
+ * @usage: number of references to the current instance.
+ * @flags: a bitmask of behavior altering flags.
+ * @filters: Hash table of filters if using CONFIG_SECCOMP_FILTER.
+ */
+struct seccomp_state {
+	uint16_t mode;
+	atomic_t usage;
+	long flags;
+	struct seccomp_filter_table *filters;
+};
 
 extern void __secure_computing(int);
 static inline void secure_computing(int this_syscall)
@@ -16,27 +38,113 @@ static inline void secure_computing(int this_syscall)
 		__secure_computing(this_syscall);
 }
 
+extern struct seccomp_state *seccomp_state_new(void);
+extern struct seccomp_state *seccomp_state_dup(const struct seccomp_state *);
+extern struct seccomp_state *get_seccomp_state(struct seccomp_state *);
+extern void put_seccomp_state(struct seccomp_state *);
+
+extern long prctl_set_seccomp(unsigned long, unsigned long);
 extern long prctl_get_seccomp(void);
-extern long prctl_set_seccomp(unsigned long);
+
+extern long prctl_set_seccomp_filter(unsigned long, char __user *);
+extern long prctl_get_seccomp_filter(unsigned long, char __user *,
+				     unsigned long);
+extern long prctl_clear_seccomp_filter(unsigned long);
+
+#define inherit_tsk_seccomp_state(_child, _orig) \
+	_child->seccomp = get_seccomp_state(_orig->seccomp);
+#define put_tsk_seccomp_state(_tsk) put_seccomp_state(_tsk->seccomp)
 
 #else /* CONFIG_SECCOMP */
 
 #include <linux/errno.h>
 
-typedef struct { } seccomp_t;
+struct seccomp_state { };
 
 #define secure_computing(x) do { } while (0)
+#define inherit_tsk_seccomp_state(_child, _orig) do { } while (0)
+#define put_tsk_seccomp_state(_tsk) do { } while (0)
 
 static inline long prctl_get_seccomp(void)
 {
 	return -EINVAL;
 }
 
-static inline long prctl_set_seccomp(unsigned long arg2)
+static inline long prctl_set_seccomp(unsigned long a2, unsigned long a3)
 {
 	return -EINVAL;
 }
 
+static inline long prctl_set_seccomp_filter(unsigned long a2, char __user *a3)
+{
+	return -ENOSYS;
+}
+
+static inline long prctl_clear_seccomp_filter(unsigned long a2)
+{
+	return -ENOSYS;
+}
+
+static inline long prctl_get_seccomp_filter(unsigned long a2, char __user *a3,
+					    unsigned long a4)
+{
+	return -ENOSYS;
+}
+
+static inline struct seccomp_state *seccomp_state_new(void)
+{
+	return NULL;
+}
+
+static inline struct seccomp_state *seccomp_state_dup(
+					const struct seccomp_state *state)
+{
+	return NULL;
+}
+
+static inline struct seccomp_state *get_seccomp_state(
+					struct seccomp_state *state)
+{
+	return NULL;
+}
+
+static inline void put_seccomp_state(struct seccomp_state *state)
+{
+}
+
 #endif /* CONFIG_SECCOMP */
 
+#ifdef CONFIG_SECCOMP_FILTER
+
+extern int seccomp_show_filters(struct seccomp_state *, struct seq_file *);
+extern long seccomp_set_filter(int, char *);
+extern long seccomp_clear_filter(int);
+extern long seccomp_get_filter(int, char *, unsigned long);
+
+#else  /* CONFIG_SECCOMP_FILTER */
+
+static inline int seccomp_show_filters(struct seccomp_state *state,
+				       struct seq_file *m)
+{
+	return -ENOSYS;
+}
+
+static inline long seccomp_set_filter(int syscall_nr, char *filter)
+{
+	return -ENOSYS;
+}
+
+static inline long seccomp_clear_filter(int syscall_nr)
+{
+	return -ENOSYS;
+}
+
+static inline long seccomp_get_filter(int syscall_nr,
+				      char *buf, unsigned long available)
+{
+	return -ENOSYS;
+}
+
+#endif  /* CONFIG_SECCOMP_FILTER */
+
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 242ae04..e061ad0 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -35,6 +35,8 @@ struct syscall_metadata {
 extern unsigned long arch_syscall_addr(int nr);
 extern int init_syscall_trace(struct ftrace_event_call *call);
 
+extern struct syscall_metadata *syscall_nr_to_meta(int);
+
 extern int reg_event_syscall_enter(struct ftrace_event_call *call);
 extern void unreg_event_syscall_enter(struct ftrace_event_call *call);
 extern int reg_event_syscall_exit(struct ftrace_event_call *call);
@@ -49,6 +51,11 @@ enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags,
 				      struct trace_event *event);
 enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags,
 				     struct trace_event *event);
+#else
+static inline struct syscall_metadata *syscall_nr_to_meta(int nr)
+{
+	return NULL;
+}
 #endif
 
 #ifdef CONFIG_PERF_EVENTS
diff --git a/kernel/Makefile b/kernel/Makefile
index 85cbfb3..84e7dfb 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -81,6 +81,9 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
+ifeq ($(CONFIG_SECCOMP_FILTER),y)
+obj-$(CONFIG_SECCOMP) += seccomp_filter.o
+endif
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
 obj-$(CONFIG_TREE_RCU) += rcutree.o
 obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index e7548de..46987d4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
 #include <linux/cgroup.h>
 #include <linux/security.h>
 #include <linux/hugetlb.h>
+#include <linux/seccomp.h>
 #include <linux/swap.h>
 #include <linux/syscalls.h>
 #include <linux/jiffies.h>
@@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
 	free_thread_info(tsk->stack);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
+	put_tsk_seccomp_state(tsk);
 	free_task_struct(tsk);
 }
 EXPORT_SYMBOL(free_task);
@@ -280,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	if (err)
 		goto out;
 
+	inherit_tsk_seccomp_state(tsk, orig);
 	setup_thread_stack(tsk, orig);
 	clear_user_return_notifier(tsk);
 	clear_tsk_need_resched(tsk);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..502ba04 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -6,12 +6,15 @@
  * This defines a simple but solid secure-computing mode.
  */
 
+#include <linux/err.h>
+#include <linux/prctl.h>
 #include <linux/seccomp.h>
 #include <linux/sched.h>
+#include <linux/slab.h>
 #include <linux/compat.h>
+#include <linux/unistd.h>
 
-/* #define SECCOMP_DEBUG 1 */
-#define NR_SECCOMP_MODES 1
+#include "seccomp.h"
 
 /*
  * Secure computing mode 1 allows only read/write/exit/sigreturn.
@@ -32,11 +35,13 @@ static int mode1_syscalls_32[] = {
 
 void __secure_computing(int this_syscall)
 {
-	int mode = current->seccomp.mode;
+	int mode = -1;
+	long ret = 0;
 	int * syscall;
-
+	if (current->seccomp)
+		mode = current->seccomp->mode;
 	switch (mode) {
-	case 1:
+	case PR_SECCOMP_MODE_STRICT:
 		syscall = mode1_syscalls;
 #ifdef CONFIG_COMPAT
 		if (is_compat_task())
@@ -47,6 +52,20 @@ void __secure_computing(int this_syscall)
 				return;
 		} while (*++syscall);
 		break;
+	case PR_SECCOMP_MODE_FILTER:
+		if (this_syscall >= NR_syscalls || this_syscall < 0)
+			break;
+		ret = seccomp_test_filters(current->seccomp, this_syscall);
+		if (!ret)
+			return;
+		/* Only check for an override if an access failure occurred. */
+		if (ret != -EACCES)
+			break;
+		ret = seccomp_maybe_apply_filters(current, this_syscall);
+		if (!ret)
+			return;
+		seccomp_filter_log_failure(this_syscall);
+		break;
 	default:
 		BUG();
 	}
@@ -57,30 +76,213 @@ void __secure_computing(int this_syscall)
 	do_exit(SIGKILL);
 }
 
+/* seccomp_state_new - allocate a new state object. */
+struct seccomp_state *seccomp_state_new()
+{
+	struct seccomp_state *new = kzalloc(sizeof(struct seccomp_state),
+					    GFP_KERNEL);
+	if (!new)
+		return NULL;
+
+	new->flags = 0;
+#ifdef CONFIG_COMPAT
+	/* Annotate if this filterset is being created by a compat task. */
+	if (is_compat_task())
+		new->flags |= SECCOMP_FLAG_COMPAT;
+#endif
+
+	atomic_set(&new->usage, 1);
+	new->filters = seccomp_filter_table_new();
+	/* Not supported errors are fine, others are a problem. */
+	if (IS_ERR(new->filters) && PTR_ERR(new->filters) != -ENOSYS) {
+		kfree(new);
+		new = NULL;
+	}
+	return new;
+}
+
+/* seccomp_state_dup - copies an existing state object. */
+struct seccomp_state *seccomp_state_dup(const struct seccomp_state *orig)
+{
+	int err;
+	struct seccomp_state *new_state = seccomp_state_new();
+
+	err = -ENOMEM;
+	if (!new_state)
+		goto fail;
+	new_state->mode = orig->mode;
+	/* Flag copying will hide if the new process is a compat task.  However,
+	 * if the rule was compat/non-compat and the process is the opposite,
+	 * enforcement will terminate it.
+	 */
+	new_state->flags = orig->flags;
+	err = seccomp_copy_all_filters(new_state->filters,
+				       orig->filters);
+	if (err)
+		goto fail;
+
+	return new_state;
+fail:
+	put_seccomp_state(new_state);
+	return NULL;
+}
+
+/* get_seccomp_state - increments the reference count of @orig */
+struct seccomp_state *get_seccomp_state(struct seccomp_state *orig)
+{
+	if (!orig)
+		return NULL;
+	atomic_inc(&orig->usage);
+	return orig;
+}
+
+static void __put_seccomp_state(struct seccomp_state *orig)
+{
+	WARN_ON(atomic_read(&orig->usage));
+	seccomp_drop_all_filters(orig);
+	kfree(orig);
+}
+
+/* put_seccomp_state - decrements the reference count of @orig and may free. */
+void put_seccomp_state(struct seccomp_state *orig)
+{
+	if (!orig)
+		return;
+
+	if (atomic_dec_and_test(&orig->usage))
+		__put_seccomp_state(orig);
+}
+
 long prctl_get_seccomp(void)
 {
-	return current->seccomp.mode;
+	if (!current->seccomp)
+		return 0;
+	return current->seccomp->mode;
 }
 
-long prctl_set_seccomp(unsigned long seccomp_mode)
+long prctl_set_seccomp(unsigned long seccomp_mode, unsigned long flags)
 {
 	long ret;
+	struct seccomp_state *state, *cur_state;
 
+	cur_state = get_seccomp_state(current->seccomp);
 	/* can set it only once to be even more secure */
 	ret = -EPERM;
-	if (unlikely(current->seccomp.mode))
+	if (cur_state && unlikely(cur_state->mode))
 		goto out;
 
 	ret = -EINVAL;
-	if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
-		current->seccomp.mode = seccomp_mode;
-		set_thread_flag(TIF_SECCOMP);
+	if (seccomp_mode <= 0 || seccomp_mode > NR_SECCOMP_MODES)
+		goto out;
+
+	ret = -ENOMEM;
+	state = (cur_state ? seccomp_state_dup(cur_state) :
+			     seccomp_state_new());
+	if (!state)
+		goto out;
+
+	if (seccomp_mode == PR_SECCOMP_MODE_STRICT) {
 #ifdef TIF_NOTSC
 		disable_TSC();
 #endif
-		ret = 0;
 	}
 
- out:
+	rcu_assign_pointer(current->seccomp, state);
+	synchronize_rcu();
+	put_seccomp_state(cur_state);	/* For the task */
+
+	/* Convert the ABI flag to the internal flag value. */
+	if (seccomp_mode == PR_SECCOMP_MODE_FILTER &&
+	    (flags & PR_SECCOMP_FLAG_FILTER_ON_EXEC))
+		state->flags |= SECCOMP_FLAG_ON_EXEC;
+	/* Encourage flag values to stay synchronized explicitly. */
+	BUILD_BUG_ON(PR_SECCOMP_FLAG_FILTER_ON_EXEC != SECCOMP_FLAG_ON_EXEC);
+
+	/* Only set the thread flag once after the new state is in place. */
+	state->mode = seccomp_mode;
+	set_thread_flag(TIF_SECCOMP);
+	ret = 0;
+
+out:
+	put_seccomp_state(cur_state);	/* for the get */
+	return ret;
+}
+
+long prctl_set_seccomp_filter(unsigned long syscall_nr,
+			      char __user *user_filter)
+{
+	int nr;
+	long ret;
+	char filter[SECCOMP_MAX_FILTER_LENGTH];
+
+	ret = -EINVAL;
+	if (syscall_nr >= NR_syscalls || syscall_nr < 0)
+		goto out;
+
+	ret = -EFAULT;
+	if (!user_filter ||
+	    strncpy_from_user(filter, user_filter,
+			      sizeof(filter) - 1) < 0)
+		goto out;
+
+	nr = (int) syscall_nr;
+	ret = seccomp_set_filter(nr, filter);
+
+out:
+	return ret;
+}
+
+long prctl_clear_seccomp_filter(unsigned long syscall_nr)
+{
+	int nr = -1;
+	long ret;
+
+	ret = -EINVAL;
+	if (syscall_nr >= NR_syscalls || syscall_nr < 0)
+		goto out;
+
+	nr = (int) syscall_nr;
+	ret = seccomp_clear_filter(nr);
+
+out:
+	return ret;
+}
+
+long prctl_get_seccomp_filter(unsigned long syscall_nr, char __user *dst,
+			      unsigned long available)
+{
+	int ret, nr;
+	unsigned long copied;
+	char *buf = NULL;
+	ret = -EINVAL;
+	if (!available)
+		goto out;
+	/* Ignore extra buffer space. */
+	if (available > SECCOMP_MAX_FILTER_LENGTH)
+		available = SECCOMP_MAX_FILTER_LENGTH;
+
+	ret = -EINVAL;
+	if (syscall_nr >= NR_syscalls || syscall_nr < 0)
+		goto out;
+	nr = (int) syscall_nr;
+
+	ret = -ENOMEM;
+	buf = kmalloc(available, GFP_KERNEL);
+	if (!buf)
+		goto out;
+
+	ret = seccomp_get_filter(nr, buf, available);
+	if (ret < 0)
+		goto out;
+
+	/* Include the NUL byte in the copy. */
+	copied = copy_to_user(dst, buf, ret + 1);
+	ret = -ENOSPC;
+	if (copied)
+		goto out;
+
+	ret = 0;
+out:
+	kfree(buf);
 	return ret;
 }
diff --git a/kernel/seccomp.h b/kernel/seccomp.h
new file mode 100644
index 0000000..5abd219
--- /dev/null
+++ b/kernel/seccomp.h
@@ -0,0 +1,74 @@
+/*
+ * seccomp/seccomp_filter shared internal prototypes and state.
+ *
+ * Copyright (C) 2011 Chromium OS Authors.
+ */
+
+#ifndef __KERNEL_SECCOMP_H
+#define __KERNEL_SECCOMP_H
+
+#include <linux/ftrace_event.h>
+#include <linux/seccomp.h>
+
+/* #define SECCOMP_DEBUG 1 */
+#define NR_SECCOMP_MODES 2
+
+/* Inherit the max filter length from the filtering engine. */
+#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
+
+/* Presently, flags only affect SECCOMP_FILTER. */
+#define _SECCOMP_FLAG_COMPAT	0
+#define _SECCOMP_FLAG_ON_EXEC	1
+
+#define SECCOMP_FLAG_COMPAT	(1 << (_SECCOMP_FLAG_COMPAT))
+#define SECCOMP_FLAG_ON_EXEC	(1 << (_SECCOMP_FLAG_ON_EXEC))
+
+
+#ifdef CONFIG_SECCOMP_FILTER
+
+#define SECCOMP_FILTER_HASH_BITS 4
+#define SECCOMP_FILTER_HASH_SIZE (1 << SECCOMP_FILTER_HASH_BITS)
+
+struct seccomp_filter_table;
+extern struct seccomp_filter_table *seccomp_filter_table_new(void);
+extern int seccomp_copy_all_filters(struct seccomp_filter_table *,
+				    const struct seccomp_filter_table *);
+extern void seccomp_drop_all_filters(struct seccomp_state *);
+
+extern int seccomp_test_filters(struct seccomp_state *, int);
+extern int seccomp_maybe_apply_filters(struct task_struct *, int);
+extern void seccomp_filter_log_failure(int);
+
+#else /* CONFIG_SECCOMP_FILTER */
+
+static inline void seccomp_filter_log_failure(int syscall)
+{
+}
+
+static inline int seccomp_maybe_apply_filters(struct task_struct *tsk,
+					      int syscall_nr)
+{
+	return -ENOSYS;
+}
+
+static inline struct seccomp_filter_table *seccomp_filter_table_new(void)
+{
+	return ERR_PTR(-ENOSYS);
+}
+
+static inline int seccomp_test_filters(struct seccomp_state *state, int nr)
+{
+	return -ENOSYS;
+}
+
+extern inline int seccomp_copy_all_filters(struct seccomp_filter_table *dst,
+					const struct seccomp_filter_table *src)
+{
+	return 0;
+}
+
+static inline void seccomp_drop_all_filters(struct seccomp_state *state) { }
+
+#endif /* CONFIG_SECCOMP_FILTER */
+
+#endif /* __KERNEL_SECCOMP_H */
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..ff4e055
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,581 @@
+/* filter engine-based seccomp system call filtering.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ */
+
+#include <linux/compat.h>
+#include <linux/errno.h>
+#include <linux/hash.h>
+#include <linux/prctl.h>
+#include <linux/seccomp.h>
+#include <linux/seq_file.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#include <asm/syscall.h>
+#include <trace/syscall.h>
+
+#include "seccomp.h"
+
+#define SECCOMP_MAX_FILTER_COUNT 512
+#define SECCOMP_WILDCARD_FILTER "1"
+
+struct seccomp_filter {
+	struct hlist_node node;
+	int syscall_nr;
+	struct syscall_metadata *data;
+	struct event_filter *event_filter;
+};
+
+struct seccomp_filter_table {
+	struct hlist_head heads[SECCOMP_FILTER_HASH_SIZE];
+	int count;
+};
+
+struct seccomp_filter_table *seccomp_filter_table_new(void)
+{
+	struct seccomp_filter_table *t =
+		kzalloc(sizeof(struct seccomp_filter_table), GFP_KERNEL);
+	if (!t)
+		return ERR_PTR(-ENOMEM);
+	return t;
+}
+
+static inline u32 syscall_hash(int syscall_nr)
+{
+	return hash_32(syscall_nr, SECCOMP_FILTER_HASH_BITS);
+}
+
+static const char *get_filter_string(struct seccomp_filter *f)
+{
+	const char *str = SECCOMP_WILDCARD_FILTER;
+	if (!f)
+		return NULL;
+
+	/* Missing event filters qualify as wildcard matches. */
+	if (!f->event_filter)
+		return str;
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+	str = ftrace_get_filter_string(f->event_filter);
+#endif
+	return str;
+}
+
+static struct seccomp_filter *alloc_seccomp_filter(int syscall_nr,
+						   const char *filter_string)
+{
+	int err = -ENOMEM;
+	struct seccomp_filter *filter = kzalloc(sizeof(struct seccomp_filter),
+						GFP_KERNEL);
+	if (!filter)
+		goto fail;
+
+	INIT_HLIST_NODE(&filter->node);
+	filter->syscall_nr = syscall_nr;
+	filter->data = syscall_nr_to_meta(syscall_nr);
+
+	/* Treat a filter of SECCOMP_WILDCARD_FILTER as a wildcard and skip
+	 * using a predicate at all.
+	 */
+	if (!strcmp(SECCOMP_WILDCARD_FILTER, filter_string))
+		goto out;
+
+	/* Argument-based filtering only works on ftrace-hooked syscalls. */
+	if (!filter->data) {
+		err = -ENOSYS;
+		goto fail;
+	}
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+	err = ftrace_parse_filter(&filter->event_filter,
+				  filter->data->enter_event->event.type,
+				  filter_string);
+	if (err)
+		goto fail;
+#endif
+
+out:
+	return filter;
+
+fail:
+	kfree(filter);
+	return ERR_PTR(err);
+}
+
+static void free_seccomp_filter(struct seccomp_filter *filter)
+{
+#ifdef CONFIG_FTRACE_SYSCALLS
+	ftrace_free_filter(filter->event_filter);
+#endif
+	kfree(filter);
+}
+
+static struct seccomp_filter *copy_seccomp_filter(struct seccomp_filter *orig)
+{
+	return alloc_seccomp_filter(orig->syscall_nr, get_filter_string(orig));
+}
+
+/* Returns the matching filter or NULL */
+static struct seccomp_filter *find_filter(struct seccomp_state *state,
+					  int syscall)
+{
+	struct hlist_node *this, *pos;
+	struct seccomp_filter *filter = NULL;
+
+	u32 head = syscall_hash(syscall);
+	if (head >= SECCOMP_FILTER_HASH_SIZE)
+		goto out;
+
+	hlist_for_each_safe(this, pos, &state->filters->heads[head]) {
+		filter = hlist_entry(this, struct seccomp_filter, node);
+		if (filter->syscall_nr == syscall)
+			goto out;
+	}
+
+	filter = NULL;
+
+out:
+	return filter;
+}
+
+/* Safely drops all filters for a given syscall. This should only be called
+ * on unattached seccomp_state objects.
+ */
+static void drop_filter(struct seccomp_state *state, int syscall_nr)
+{
+	struct seccomp_filter *filter = find_filter(state, syscall_nr);
+	if (!filter)
+		return;
+
+	WARN_ON(state->filters->count == 0);
+	state->filters->count--;
+	hlist_del(&filter->node);
+	free_seccomp_filter(filter);
+}
+
+/* This should only be called on unattached seccomp_state objects. */
+static int add_filter(struct seccomp_state *state, int syscall_nr,
+		      char *filter_string)
+{
+	struct seccomp_filter *filter;
+	struct hlist_head *head;
+	char merged[SECCOMP_MAX_FILTER_LENGTH];
+	int ret;
+	u32 hash = syscall_hash(syscall_nr);
+
+	ret = -EINVAL;
+	if (state->filters->count == SECCOMP_MAX_FILTER_COUNT - 1)
+		goto out;
+
+	filter_string = strstrip(filter_string);
+
+	/* Disallow empty strings. */
+	if (filter_string[0] == 0)
+		goto out;
+
+	/* Get the right list head. */
+	head = &state->filters->heads[hash];
+
+	/* Find out if there is an existing entry to append to and
+	 * build the resultant filter string.  The original filter can be
+	 * destroyed here since the caller should be operating on a copy.
+	 */
+	filter = find_filter(state, syscall_nr);
+	if (filter) {
+		int expected = snprintf(merged, sizeof(merged), "(%s) && %s",
+					get_filter_string(filter),
+					filter_string);
+		ret = -E2BIG;
+		if (expected >= sizeof(merged) || expected < 0)
+			goto out;
+		filter_string = merged;
+		hlist_del(&filter->node);
+		free_seccomp_filter(filter);
+	}
+
+	/* When in seccomp filtering mode, only allow additions. */
+	ret = -EACCES;
+	if (filter == NULL && state->mode == PR_SECCOMP_MODE_FILTER)
+		goto out;
+
+	ret = 0;
+	filter = alloc_seccomp_filter(syscall_nr, filter_string);
+	if (IS_ERR(filter)) {
+		ret = PTR_ERR(filter);
+		goto out;
+	}
+
+	state->filters->count++;
+	hlist_add_head(&filter->node, head);
+out:
+	return ret;
+}
+
+/* Wrap optional ftrace syscall support. Returns 1 on match or if ftrace is not
+ * supported.
+ */
+static int do_ftrace_syscall_match(struct event_filter *event_filter)
+{
+	int err = 1;
+#ifdef CONFIG_FTRACE_SYSCALLS
+	uint8_t syscall_state[64];
+
+	memset(syscall_state, 0, sizeof(syscall_state));
+
+	/* The generic tracing entry can remain zeroed. */
+	err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state),
+					 NULL);
+	if (err)
+		return 0;
+
+	err = filter_match_preds(event_filter, syscall_state);
+#endif
+	return err;
+}
+
+/* 1 on match, 0 otherwise. */
+static int filter_match_current(struct seccomp_filter *filter)
+{
+	/* If no event filter exists, we assume a wildcard match. */
+	if (!filter->event_filter)
+		return 1;
+
+	return do_ftrace_syscall_match(filter->event_filter);
+}
+
+#ifndef KSTK_EIP
+#define KSTK_EIP(x) 0L
+#endif
+
+static const char *syscall_nr_to_name(int syscall)
+{
+	const char *syscall_name = "unknown";
+	struct syscall_metadata *data = syscall_nr_to_meta(syscall);
+	if (data)
+		syscall_name = data->name;
+	return syscall_name;
+}
+
+void seccomp_filter_log_failure(int syscall)
+{
+	printk(KERN_INFO
+		"%s[%d]: system call %d (%s) blocked at ip:%lx\n",
+		current->comm, task_pid_nr(current), syscall,
+		syscall_nr_to_name(syscall), KSTK_EIP(current));
+}
+
+/**
+ * seccomp_drop_all_filters - cleans up the filter list and frees the table
+ * @state: the seccomp_state to destroy the filters in.
+ */
+void seccomp_drop_all_filters(struct seccomp_state *state)
+{
+	struct hlist_node *this, *pos;
+	int head;
+	if (!state->filters)
+		return;
+	for (head = 0; head < SECCOMP_FILTER_HASH_SIZE; ++head) {
+		hlist_for_each_safe(this, pos, &state->filters->heads[head]) {
+			struct seccomp_filter *f = hlist_entry(this,
+						struct seccomp_filter, node);
+			WARN_ON(state->filters->count == 0);
+			hlist_del(this);
+			free_seccomp_filter(f);
+			state->filters->count--;
+		}
+	}
+	kfree(state->filters);
+}
+
+/**
+ * seccomp_copy_all_filters - copies all filters from src to dst.
+ *
+ * @dst: seccomp_filter_table to populate.
+ * @src: table to read from.
+ * Returns non-zero on failure.
+ * Both the source and the destination should have no simultaneous
+ * writers, and dst should be exclusive to the caller.
+ */
+int seccomp_copy_all_filters(struct seccomp_filter_table *dst,
+			     const struct seccomp_filter_table *src)
+{
+	struct seccomp_filter *filter;
+	int head, ret = 0;
+	BUG_ON(!dst || !src);
+	for (head = 0; head < SECCOMP_FILTER_HASH_SIZE; ++head) {
+		struct hlist_node *pos;
+		hlist_for_each_entry(filter, pos, &src->heads[head], node) {
+			struct seccomp_filter *new_filter =
+				copy_seccomp_filter(filter);
+			if (IS_ERR(new_filter)) {
+				ret = PTR_ERR(new_filter);
+				goto done;
+			}
+			hlist_add_head(&new_filter->node,
+				       &dst->heads[head]);
+			dst->count++;
+		}
+	}
+
+done:
+	return ret;
+}
+
+/**
+ * seccomp_show_filters - prints the filter state to a seq_file
+ * @state: the seccomp_state to enumerate the filter and bitmask of
+ * @m: the prepared seq_file to receive the data
+ *
+ * Returns 0 on a successful write.
+ */
+int seccomp_show_filters(struct seccomp_state *state, struct seq_file *m)
+{
+	int head;
+	struct hlist_node *pos;
+	struct seccomp_filter *filter;
+	int filtering = 0;
+	if (!state)
+		return 0;
+	if (!state->filters)
+		return 0;
+
+	filtering = (state->mode == 2);
+	filtering &= !(state->flags & SECCOMP_FLAG_ON_EXEC);
+	seq_printf(m, "Filtering: %d\n", filtering);
+	seq_printf(m, "FilterCount: %d\n", state->filters->count);
+	for (head = 0; head < SECCOMP_FILTER_HASH_SIZE; ++head) {
+		hlist_for_each_entry(filter, pos, &state->filters->heads[head],
+				     node) {
+			seq_printf(m, "SystemCall: %d (%s)\n",
+				      filter->syscall_nr,
+				      syscall_nr_to_name(filter->syscall_nr));
+			seq_printf(m, "Filter: %s\n",
+				      get_filter_string(filter));
+		}
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(seccomp_show_filters);
+
+/**
+ * seccomp_maybe_apply_filters - conditionally applies seccomp filters
+ * @tsk: task to update
+ * @syscall_nr: current system call in progress
+ * tsk must already be in seccomp filter mode.
+ *
+ * Returns 0 if the call should be allowed or state has been updated.
+ * This call is only reach if no filters matched the current system call.
+ * In some cases, such as when the ON_EXEC flag is set, failure should
+ * not be terminal.
+ */
+int seccomp_maybe_apply_filters(struct task_struct *tsk, int syscall_nr)
+{
+	struct seccomp_state *state, *new_state = NULL;
+	int ret = -EACCES;
+
+	/* There's no question of application if ON_EXEC is not set. */
+	state = get_seccomp_state(tsk->seccomp);
+	if ((state->flags & SECCOMP_FLAG_ON_EXEC) == 0)
+		goto out;
+
+	ret = 0;
+	if (syscall_nr != __NR_execve)
+		goto out;
+
+	new_state = seccomp_state_dup(state);
+	ret = -ENOMEM;
+	if (!new_state)
+		goto out;
+
+	ret = 0;
+	new_state->flags &= ~(SECCOMP_FLAG_ON_EXEC);
+	rcu_assign_pointer(tsk->seccomp, new_state);
+	synchronize_rcu();
+	put_seccomp_state(state); /* for the task */
+
+out:
+	put_seccomp_state(state); /* for the get */
+	return ret;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ * @state: seccomp_state of current to use.
+ * @syscall: number of the system call to test
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(struct seccomp_state *state, int syscall)
+{
+	struct seccomp_filter *filter = NULL;
+	int ret;
+
+#ifdef CONFIG_COMPAT
+	ret = -EPERM;
+	if (is_compat_task() == !!(state->flags & SECCOMP_FLAG_COMPAT)) {
+		printk(KERN_INFO "%s[%d]: seccomp filter compat() mismatch.\n",
+		       current->comm, task_pid_nr(current));
+		goto out;
+	}
+#endif
+
+	ret = 0;
+	filter = find_filter(state, syscall);
+	if (filter && filter_match_current(filter))
+		goto out;
+
+	ret = -EACCES;
+out:
+	return ret;
+}
+
+/**
+ * seccomp_get_filter - copies the filter_string into "buf"
+ * @syscall_nr: system call number to look up
+ * @buf: destination buffer
+ * @bufsize: available space in the buffer.
+ *
+ * Looks up the filter for the given system call number on current.  If found,
+ * the string length of the NUL-terminated buffer is returned and < 0 is
+ * returned on error. The NUL byte is not included in the length.
+ */
+long seccomp_get_filter(int syscall_nr, char *buf, unsigned long bufsize)
+{
+	struct seccomp_state *state;
+	struct seccomp_filter *filter;
+	long ret = -ENOENT;
+
+	if (bufsize > SECCOMP_MAX_FILTER_LENGTH)
+		bufsize = SECCOMP_MAX_FILTER_LENGTH;
+
+	state = get_seccomp_state(current->seccomp);
+	if (!state)
+		goto out;
+
+	filter = find_filter(state, syscall_nr);
+	if (!filter)
+		goto out;
+
+	ret = strlcpy(buf, get_filter_string(filter), bufsize);
+	if (ret >= bufsize) {
+		ret = -ENOSPC;
+		goto out;
+	}
+	/* Zero out any remaining buffer, just in case. */
+	memset(buf + ret, 0, bufsize - ret);
+out:
+	put_seccomp_state(state);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(seccomp_get_filter);
+
+/**
+ * seccomp_clear_filter: clears the seccomp filter for a syscall.
+ * @syscall_nr: the system call number to clear filters for.
+ *
+ * (acts as a frontend for seccomp_set_filter. All restrictions
+ *  apply)
+ *
+ * Returns 0 on success.
+ */
+long seccomp_clear_filter(int syscall_nr)
+{
+	return seccomp_set_filter(syscall_nr, NULL);
+}
+EXPORT_SYMBOL_GPL(seccomp_clear_filter);
+
+/**
+ * seccomp_set_filter: - Adds/extends a seccomp filter for a syscall.
+ * @syscall_nr: system call number to apply the filter to.
+ * @filter: ftrace filter string to apply.
+ *
+ * Context: User context only. This function may sleep on allocation and
+ *          operates on current. current must be attempting a system call
+ *          when this is called.
+ *
+ * New filters may be added for system calls when the current task is
+ * not in a secure computing mode (seccomp).  Otherwise, filters may only
+ * be added to already filtered system call entries.  Any additions will
+ * be &&'d with the existing filter string to ensure no expansion of privileges
+ * will be possible.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_set_filter(int syscall_nr, char *filter)
+{
+	struct seccomp_state *state, *orig_state;
+	long ret = -EINVAL;
+
+	orig_state = get_seccomp_state(current->seccomp);
+
+	/* Prior to mutating the state, create a duplicate to avoid modifying
+	 * the behavior of other instances sharing the state and ensure
+	 * consistency.
+	 */
+	state = (orig_state ? seccomp_state_dup(orig_state) :
+			      seccomp_state_new());
+	if (!state) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* A NULL filter doubles as a drop value, but the exposed prctl
+	 * interface requires a trip through seccomp_clear_filter().
+	 * Filter dropping is allowed across the is_compat_task() barrier.
+	 */
+	ret = 0;
+	if (filter == NULL) {
+		drop_filter(state, syscall_nr);
+		goto assign;
+	}
+
+	/* Avoid amiguous filters which may have been inherited from a parent
+	 * with different syscall numbers for the logically same calls.
+	 */
+#ifdef CONFIG_COMPAT
+	ret = -EACCES;
+	if (is_compat_task() != !!(state->flags & SECCOMP_FLAG_COMPAT)) {
+		if (state->filters->count)
+			goto free_state;
+		/* It's safe to add if there are no existing ambiguous rules.*/
+		if (is_compat_task())
+			state->flags |= SECCOMP_FLAG_COMPAT;
+		else
+			state->flags &= ~(SECCOMP_FLAG_COMPAT);
+	}
+#endif
+
+	ret = add_filter(state, syscall_nr, filter);
+	if (ret)
+		goto free_state;
+
+assign:
+	rcu_assign_pointer(current->seccomp, state);
+	synchronize_rcu();
+	put_seccomp_state(orig_state);  /* for the task */
+out:
+	put_seccomp_state(orig_state);  /* for the get */
+	return ret;
+
+free_state:
+	put_seccomp_state(orig_state);  /* for the get */
+	put_seccomp_state(state);  /* drop the dup/new */
+	return ret;
+}
+EXPORT_SYMBOL_GPL(seccomp_set_filter);
diff --git a/kernel/sys.c b/kernel/sys.c
index af468ed..d29003a 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1698,12 +1698,23 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		case PR_SET_ENDIAN:
 			error = SET_ENDIAN(me, arg2);
 			break;
-
 		case PR_GET_SECCOMP:
 			error = prctl_get_seccomp();
 			break;
 		case PR_SET_SECCOMP:
-			error = prctl_set_seccomp(arg2);
+			error = prctl_set_seccomp(arg2, arg3);
+			break;
+		case PR_SET_SECCOMP_FILTER:
+			error = prctl_set_seccomp_filter(arg2,
+							 (char __user *) arg3);
+			break;
+		case PR_CLEAR_SECCOMP_FILTER:
+			error = prctl_clear_seccomp_filter(arg2);
+			break;
+		case PR_GET_SECCOMP_FILTER:
+			error = prctl_get_seccomp_filter(arg2,
+							 (char __user *) arg3,
+							 arg4);
 			break;
 		case PR_GET_TSC:
 			error = GET_TSC_CTL(arg2);
-- 
1.7.0.4
^ permalink raw reply related	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-12  3:02 ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Will Drewry
@ 2011-05-12  7:48   ` Ingo Molnar
  2011-05-12  9:24     ` Kees Cook
                       ` (2 more replies)
  2011-05-12 11:33   ` James Morris
                     ` (2 subsequent siblings)
  3 siblings, 3 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-12  7:48 UTC (permalink / raw)
  To: linux-arm-kernel
Ok, i like the direction here, but i think the ABI should be done differently.
In this patch the ftrace event filter mechanism is used:
* Will Drewry <wad@chromium.org> wrote:
> +static struct seccomp_filter *alloc_seccomp_filter(int syscall_nr,
> +						   const char *filter_string)
> +{
> +	int err = -ENOMEM;
> +	struct seccomp_filter *filter = kzalloc(sizeof(struct seccomp_filter),
> +						GFP_KERNEL);
> +	if (!filter)
> +		goto fail;
> +
> +	INIT_HLIST_NODE(&filter->node);
> +	filter->syscall_nr = syscall_nr;
> +	filter->data = syscall_nr_to_meta(syscall_nr);
> +
> +	/* Treat a filter of SECCOMP_WILDCARD_FILTER as a wildcard and skip
> +	 * using a predicate at all.
> +	 */
> +	if (!strcmp(SECCOMP_WILDCARD_FILTER, filter_string))
> +		goto out;
> +
> +	/* Argument-based filtering only works on ftrace-hooked syscalls. */
> +	if (!filter->data) {
> +		err = -ENOSYS;
> +		goto fail;
> +	}
> +
> +#ifdef CONFIG_FTRACE_SYSCALLS
> +	err = ftrace_parse_filter(&filter->event_filter,
> +				  filter->data->enter_event->event.type,
> +				  filter_string);
> +	if (err)
> +		goto fail;
> +#endif
> +
> +out:
> +	return filter;
> +
> +fail:
> +	kfree(filter);
> +	return ERR_PTR(err);
> +}
Via a prctl() ABI:
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1698,12 +1698,23 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		case PR_SET_ENDIAN:
>  			error = SET_ENDIAN(me, arg2);
>  			break;
> -
>  		case PR_GET_SECCOMP:
>  			error = prctl_get_seccomp();
>  			break;
>  		case PR_SET_SECCOMP:
> -			error = prctl_set_seccomp(arg2);
> +			error = prctl_set_seccomp(arg2, arg3);
> +			break;
> +		case PR_SET_SECCOMP_FILTER:
> +			error = prctl_set_seccomp_filter(arg2,
> +							 (char __user *) arg3);
> +			break;
> +		case PR_CLEAR_SECCOMP_FILTER:
> +			error = prctl_clear_seccomp_filter(arg2);
> +			break;
> +		case PR_GET_SECCOMP_FILTER:
> +			error = prctl_get_seccomp_filter(arg2,
> +							 (char __user *) arg3,
> +							 arg4);
To restrict execution to system calls.
Two observations:
1) We already have a specific ABI for this: you can set filters for events via 
   an event fd.
   Why not extend that mechanism instead and improve *both* your sandboxing
   bits and the events code? This new seccomp code has a lot more
   to do with trace event filters than the minimal old seccomp code ...
   kernel/trace/trace_event_filter.c is 2000 lines of tricky code that
   interprets the ASCII filter expressions. kernel/seccomp.c is 86 lines of
   mostly trivial code.
2) Why should this concept not be made available wider, to allow the 
   restriction of not just system calls but other security relevant components 
   of the kernel as well?
   This too, if you approach the problem via the events code, will be a natural 
   end result, while if you approach it from the seccomp prctl angle it will be
   a limited hack only.
Note, the end result will be the same - just using a different ABI.
So i really think the ABI itself should be closer related to the event code. 
What this "seccomp" code does is that it uses specific syscall events to 
restrict execution of certain event generating codepaths, such as system calls.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-12  7:48   ` Ingo Molnar
@ 2011-05-12  9:24     ` Kees Cook
  2011-05-12 10:49       ` Ingo Molnar
  2011-05-12 11:44     ` James Morris
  2011-05-12 12:15     ` Frederic Weisbecker
  2 siblings, 1 reply; 77+ messages in thread
From: Kees Cook @ 2011-05-12  9:24 UTC (permalink / raw)
  To: linux-arm-kernel
Hi,
On Thu, May 12, 2011 at 09:48:50AM +0200, Ingo Molnar wrote:
> 1) We already have a specific ABI for this: you can set filters for events via 
>    an event fd.
> 
>    Why not extend that mechanism instead and improve *both* your sandboxing
>    bits and the events code? This new seccomp code has a lot more
>    to do with trace event filters than the minimal old seccomp code ...
Would this require privileges to get the event fd to start with? If so,
I would prefer to avoid that, since using prctl() as shown in the patch
set won't require any privs.
-Kees
-- 
Kees Cook
Ubuntu Security Team
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-12  9:24     ` Kees Cook
@ 2011-05-12 10:49       ` Ingo Molnar
  0 siblings, 0 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-12 10:49 UTC (permalink / raw)
  To: linux-arm-kernel
* Kees Cook <kees.cook@canonical.com> wrote:
> Hi,
> 
> On Thu, May 12, 2011 at 09:48:50AM +0200, Ingo Molnar wrote:
> > 1) We already have a specific ABI for this: you can set filters for events via 
> >    an event fd.
> > 
> >    Why not extend that mechanism instead and improve *both* your sandboxing
> >    bits and the events code? This new seccomp code has a lot more
> >    to do with trace event filters than the minimal old seccomp code ...
> 
> Would this require privileges to get the event fd to start with? [...]
No special privileges with the default perf_events_paranoid value.
> [...] If so, I would prefer to avoid that, since using prctl() as shown in 
> the patch set won't require any privs.
and we could also explicitly allow syscall events without any privileges, 
regardless of the setting of 'perf_events_paranoid' config value.
Obviously a sandboxing host process wants to run with as low privileges as it 
can.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-12  3:02 ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Will Drewry
  2011-05-12  7:48   ` Ingo Molnar
@ 2011-05-12 11:33   ` James Morris
  2011-05-13 19:35   ` Arnd Bergmann
  2011-05-16 15:26   ` Steven Rostedt
  3 siblings, 0 replies; 77+ messages in thread
From: James Morris @ 2011-05-12 11:33 UTC (permalink / raw)
  To: linux-arm-kernel
On Wed, 11 May 2011, Will Drewry wrote:
> +void seccomp_filter_log_failure(int syscall)
> +{
> +	printk(KERN_INFO
> +		"%s[%d]: system call %d (%s) blocked at ip:%lx\n",
> +		current->comm, task_pid_nr(current), syscall,
> +		syscall_nr_to_name(syscall), KSTK_EIP(current));
> +}
I think it'd be a good idea to utilize the audit facility here.
- James
-- 
James Morris
<jmorris@namei.org>
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-12  7:48   ` Ingo Molnar
  2011-05-12  9:24     ` Kees Cook
@ 2011-05-12 11:44     ` James Morris
  2011-05-12 13:01       ` Ingo Molnar
  2011-05-12 12:15     ` Frederic Weisbecker
  2 siblings, 1 reply; 77+ messages in thread
From: James Morris @ 2011-05-12 11:44 UTC (permalink / raw)
  To: linux-arm-kernel
On Thu, 12 May 2011, Ingo Molnar wrote:
> 
> 2) Why should this concept not be made available wider, to allow the 
>    restriction of not just system calls but other security relevant components 
>    of the kernel as well?
Because the aim of this is to reduce the attack surface of the syscall 
interface.
LSM is the correct level of abstraction for general security mediation, 
because it allows you to take into account all relevant security 
information in a race-free context.
>    This too, if you approach the problem via the events code, will be a natural 
>    end result, while if you approach it from the seccomp prctl angle it will be
>    a limited hack only.
I'd say it's a well-defined and readily understandable feature.
- James
-- 
James Morris
<jmorris@namei.org>
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-12  7:48   ` Ingo Molnar
  2011-05-12  9:24     ` Kees Cook
  2011-05-12 11:44     ` James Morris
@ 2011-05-12 12:15     ` Frederic Weisbecker
  2 siblings, 0 replies; 77+ messages in thread
From: Frederic Weisbecker @ 2011-05-12 12:15 UTC (permalink / raw)
  To: linux-arm-kernel
On Thu, May 12, 2011 at 09:48:50AM +0200, Ingo Molnar wrote:
> To restrict execution to system calls.
> 
> Two observations:
> 
> 1) We already have a specific ABI for this: you can set filters for events via 
>    an event fd.
> 
>    Why not extend that mechanism instead and improve *both* your sandboxing
>    bits and the events code? This new seccomp code has a lot more
>    to do with trace event filters than the minimal old seccomp code ...
> 
>    kernel/trace/trace_event_filter.c is 2000 lines of tricky code that
>    interprets the ASCII filter expressions. kernel/seccomp.c is 86 lines of
>    mostly trivial code.
> 
> 2) Why should this concept not be made available wider, to allow the 
>    restriction of not just system calls but other security relevant components 
>    of the kernel as well?
> 
>    This too, if you approach the problem via the events code, will be a natural 
>    end result, while if you approach it from the seccomp prctl angle it will be
>    a limited hack only.
> 
> Note, the end result will be the same - just using a different ABI.
> 
> So i really think the ABI itself should be closer related to the event code. 
> What this "seccomp" code does is that it uses specific syscall events to 
> restrict execution of certain event generating codepaths, such as system calls.
> 
> Thanks,
> 
> 	Ingo
What's positive with that approach is that the code is all there already.
Create a perf event for a given trace event, attach a filter to it.
What needs to be added is an override of the effect of the filter. By default
it's dropping the event, but there may be different flavours, including sending
a signal. All in one, extending the current code to allow that looks trivial.
The negative points are that
* trace events are supposed to stay passive and not act on the system, except
doing some endpoint things like writing to a buffer. We can't call do_exit()
from a tracepoint for example, preemption is disabled there.
* Also, is it actually relevant to extend that seccomp filtering to other events
than syscalls? Exposing kernel events to filtering sounds actually to me bringing
a new potential security issue. But with fine restrictions this can probably
be dealt with. Especially if by default only syscalls can be filtered
* I think Peter did not want to give such "active" role to perf in the system.
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-12 11:44     ` James Morris
@ 2011-05-12 13:01       ` Ingo Molnar
  2011-05-12 16:26         ` Will Drewry
  2011-05-13  0:18         ` James Morris
  0 siblings, 2 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-12 13:01 UTC (permalink / raw)
  To: linux-arm-kernel
* James Morris <jmorris@namei.org> wrote:
> On Thu, 12 May 2011, Ingo Molnar wrote:
> 
> > 2) Why should this concept not be made available wider, to allow the 
> >    restriction of not just system calls but other security relevant components 
> >    of the kernel as well?
> 
> Because the aim of this is to reduce the attack surface of the syscall 
> interface.
What i suggest achieves the same, my argument is that we could aim it to be 
even more flexible and even more useful.
> LSM is the correct level of abstraction for general security mediation, 
> because it allows you to take into account all relevant security information 
> in a race-free context.
I don't care about LSM though, i find it poorly designed.
The approach implemented here, the ability for *unprivileged code* to define 
(the seeds of ...) flexible security policies, in a proper Linuxish way, which 
is inherited along the task parent/child hieararchy and which allows nesting 
etc. is a *lot* more flexible.
What Will implemented here is pretty huge in my opinion: it turns security from 
a root-only kind of weird hack into an essential component of its APIs, 
available to *any* app not just the select security policy/mechanism chosen by 
the distributor ...
If implemented properly this could replace LSM in the long run.
As a prctl() hack bound to seccomp (which, by all means, is a natural extension 
to the current seccomp ABI, so perfectly fine if we only want that scope), that 
is much less likely to happen.
And if we merge the seccomp interface prematurely then interest towards a more 
flexible approach will disappear, so either we do it properly now or it will 
take some time for someone to come around and do it ...
Also note that i do not consider the perf events ABI itself cast into stone - 
and we could very well add a new system call for this, independent of perf 
events. I just think that the seccomp scope itself is exciting but looks 
limited to what the real potential of this could be.
> >    This too, if you approach the problem via the events code, will be a natural 
> >    end result, while if you approach it from the seccomp prctl angle it will be
> >    a limited hack only.
> 
> I'd say it's a well-defined and readily understandable feature.
Note, it was me who suggested this very event-filter-engine design a year ago, 
when the first submission still used a crude bitmap of allowed seccomp 
syscalls:
  http://lwn.net/Articles/332974/
Funnily enough, back then you wrote this:
  " I'm concerned that we're seeing yet another security scheme being designed on 
    the fly, without a well-formed threat model, and without taking into account 
    lessons learned from the seemingly endless parade of similar, failed schemes. "
so when and how did your opinion of this scheme turn from it being an "endless 
parade of failed schemes" to it being a "well-defined and readily 
understandable feature"? :-)
The idea itself has not changed since last year, what happened is that the 
filter engine got a couple of new features and Will has separated it out and 
has implemented a working prototype for sandboxing.
What i do here is to suggest *further* steps down the same road, now that we 
see that this scheme can indeed be used to implement sandboxing ... I think 
it's a valid line of inquiry.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-12 13:01       ` Ingo Molnar
@ 2011-05-12 16:26         ` Will Drewry
  2011-05-16 12:55           ` Ingo Molnar
  2011-05-13  0:18         ` James Morris
  1 sibling, 1 reply; 77+ messages in thread
From: Will Drewry @ 2011-05-12 16:26 UTC (permalink / raw)
  To: linux-arm-kernel
[Thanks to everyone for the continued feedback and insights - I appreciate it!]
On Thu, May 12, 2011 at 8:01 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * James Morris <jmorris@namei.org> wrote:
>
>> On Thu, 12 May 2011, Ingo Molnar wrote:
>>
>> > 2) Why should this concept not be made available wider, to allow the
>> > ? ?restriction of not just system calls but other security relevant components
>> > ? ?of the kernel as well?
>>
>> Because the aim of this is to reduce the attack surface of the syscall
>> interface.
>
> What i suggest achieves the same, my argument is that we could aim it to be
> even more flexible and even more useful.
>
>> LSM is the correct level of abstraction for general security mediation,
>> because it allows you to take into account all relevant security information
>> in a race-free context.
>
> I don't care about LSM though, i find it poorly designed.
>
> The approach implemented here, the ability for *unprivileged code* to define
> (the seeds of ...) flexible security policies, in a proper Linuxish way, which
> is inherited along the task parent/child hieararchy and which allows nesting
> etc. is a *lot* more flexible.
>
> What Will implemented here is pretty huge in my opinion: it turns security from
> a root-only kind of weird hack into an essential component of its APIs,
> available to *any* app not just the select security policy/mechanism chosen by
> the distributor ...
>
> If implemented properly this could replace LSM in the long run.
>
> As a prctl() hack bound to seccomp (which, by all means, is a natural extension
> to the current seccomp ABI, so perfectly fine if we only want that scope), that
> is much less likely to happen.
>
> And if we merge the seccomp interface prematurely then interest towards a more
> flexible approach will disappear, so either we do it properly now or it will
> take some time for someone to come around and do it ...
>
> Also note that i do not consider the perf events ABI itself cast into stone -
> and we could very well add a new system call for this, independent of perf
> events. I just think that the seccomp scope itself is exciting but looks
> limited to what the real potential of this could be.
I agree with you on many of these points!  However, I don't think that
the views around LSMs, perf/ftrace infrastructure, or the current
seccomp filtering implementation are necessarily in conflict.  Here is
my understanding of how the different worlds fit together and where I
see this patchset living, along with where I could see future work
going.  Perhaps I'm being a trifle naive, but here goes anyway:
1. LSMs provide a global mechanism for hooking "security relevant"
events at a point where all the incoming user-sourced data has been
preprocessed and moved into userspace.  The hooks are called every
time one of those boundaries are crossed.
2. Perf and the ftrace infrastructure provide global function tracing
and system call hooks with direct access to the caller's registers
(and memory).
3. seccomp (as it exists today) provides a global system call entry
hook point with a binary per-process decision about whether to provide
"secure computing" behavior.
When I boil that down to abstractions, I see:
A. Globally scoped: LSMs, ftrace/perf
B. Locally/process scoped: seccomp
The result of that logical equivalence is that I see room for:
I. A per-process, locally scoped security event hooking interface (the
proposed changes in this patchset)
II. A globally scoped security event hooking interface _prior_ to
argument processing
III. A globally scoped security event hooking interface _post_
argument processing
II and III could be reduced further if I assume that ftrace/perf
provides (II) and a simple intermediary layer (hook entry/exit)
provides the argument processing steps that then call out a global
security policy system.
The driving motivation for this patchset is kernel attack surface
reduction, but that need arises because we lack a process-scoped
mechanism for making security decisions -- everything is global:
creds/DAC, containers, LSM, etc.   Adding ftrace filtering to agl's
original bitmask-seccomp proposal opens up the process-local security
world.  At present, it can limit the attack surface with simple binary
filters or apply limited security policy through the use of filter
strings.
Based on your mails, I see two main deficiencies in my proposed patchset:
a. Deep argument analysis: Any arguments that live in user memory
needs to be copied into the kernel, then checked, and substituted for
the actual system call, then have the original pointers restored (when
applicable) on system call exit.  There is a large overhead here and
the LSM hooks provide much of this support on a global level.
b. Lack of support for non-system call events.
For (a), if the long term view of ftrace/perf & LSMs is that LSM-like
functionality will live on top of the ftrace/perf infrastructure, then
adding support for the intermediary layer to analyze arguments will
come with time.  It's also likely that for process-local stuff (e.g.,)
a new predicate could be added to callback to a userspace supervisor,
or even a more generic ability for modules to register new
predicates/functions in the filtering engine itself -- like "fd == 1
&& check_path(path) == '/etc/safe.conf'" or "check_xattr(path,
expected)".  Of course, I'm just making stuff up right now :)
For (b), we could just add a field we don't use right now in the prctl
interface:
  prctl(PR_SET_SECCOMP_FILTER, int event_type, int
event_or_syscall_nr, char *filter)
[or something similar]
Then we can add process-local/scoped supported event types somewhere
down the road without an ABI change.
Tying it all together, it'd look like:
* Now -- add process-scoped security support: secocmp filter with
support for "future" event types
* Soon -- expand ftrace syscall hooks to hook more system calls
* Later -- expand ftrace filter language to support either deep
argument analysis and/or custom registered predicates
* Later, later -- implement a LSM-like hooking layer for "interesting"
event types on top of the ftrace hooks
That would yield process-scoped security controls and global security
controls and the ability to continue to create new and interesting
security modules.
All that said, I'm in over my head.  I've focused primarily on the
process-scoped security.  I think James, some of the LSM authors, and
out-of-tree security system maintainers would be good to help guide
direction toward the security view you have in mind to ensure the
flexibility desired exists.  And that's even assuming this sketch is
even vaguely interesting...
[snip]
> What i do here is to suggest *further* steps down the same road, now that we
> see that this scheme can indeed be used to implement sandboxing ... I think
> it's a valid line of inquiry.
I certainly agree that it's a valid line of inquiry, but I worry about
the massive scope expansion.  I know it hurts my head, but I'm hoping
the brain-dump above frames up how I think about this patch and your
line of inquiry.  ftrace hooking and the perf code certainly look a
lot like LSMs if I squint hard :)  But there is a substantial amount
of work to merge the worlds, and (thankfully) I don't think that
future directly impacts process-scoped security mechanisms even if
they can interact nicely.
thanks!
will
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-12 13:01       ` Ingo Molnar
  2011-05-12 16:26         ` Will Drewry
@ 2011-05-13  0:18         ` James Morris
  2011-05-13 12:10           ` Ingo Molnar
  1 sibling, 1 reply; 77+ messages in thread
From: James Morris @ 2011-05-13  0:18 UTC (permalink / raw)
  To: linux-arm-kernel
On Thu, 12 May 2011, Ingo Molnar wrote:
> Funnily enough, back then you wrote this:
> 
>   " I'm concerned that we're seeing yet another security scheme being designed on 
>     the fly, without a well-formed threat model, and without taking into account 
>     lessons learned from the seemingly endless parade of similar, failed schemes. "
> 
> so when and how did your opinion of this scheme turn from it being an "endless 
> parade of failed schemes" to it being a "well-defined and readily 
> understandable feature"? :-)
When it was defined in a way which limited its purpose to reducing the 
attack surface of the sycall interface.
- James
-- 
James Morris
<jmorris@namei.org>
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13  0:18         ` James Morris
@ 2011-05-13 12:10           ` Ingo Molnar
  2011-05-13 12:19             ` Peter Zijlstra
                               ` (2 more replies)
  0 siblings, 3 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-13 12:10 UTC (permalink / raw)
  To: linux-arm-kernel
* James Morris <jmorris@namei.org> wrote:
> On Thu, 12 May 2011, Ingo Molnar wrote:
> > Funnily enough, back then you wrote this:
> > 
> >   " I'm concerned that we're seeing yet another security scheme being designed on 
> >     the fly, without a well-formed threat model, and without taking into account 
> >     lessons learned from the seemingly endless parade of similar, failed schemes. "
> > 
> > so when and how did your opinion of this scheme turn from it being an 
> > "endless parade of failed schemes" to it being a "well-defined and readily 
> > understandable feature"? :-)
> 
> When it was defined in a way which limited its purpose to reducing the attack 
> surface of the sycall interface.
Let me outline a simple example of a new filter expression based security 
feature that could be implemented outside the narrow system call boundary you 
find acceptable, and please tell what is bad about it.
Say i'm a user-space sandbox developer who wants to enforce that sandboxed code 
should only be allowed to open files in /home/sandbox/, /lib/ and /usr/lib/.
It is a simple and sensible security feature, agreed? It allows most code to 
run well and link to countless libraries - but no access to other files is 
allowed.
I would also like my sandbox app to be able to install this policy without 
having to be root. I do not want the sandbox app to have permission to create 
labels on /lib and /usr/lib and what not.
Firstly, using the filter code i deny the various link creation syscalls so 
that sandboxed code cannot escape for example by creating a symlink to outside 
the permitted VFS namespace. (Note: we opt-in to syscalls, that way new 
syscalls added by new kernels are denied by defalt. The current symlink 
creation syscalls are not opted in to.)
But the next step, actually checking filenames, poses a big hurdle: i cannot 
implement the filename checking at the sys_open() syscall level in a secure 
way: because the pathname is passed to sys_open() by pointer, and if i check it 
at the generic sys_open() syscall level, another thread in the sandbox might 
modify the underlying filename *after* i've checked it.
But if i had a VFS event at the fs/namei.c::getname() level, i would have 
access to a central point where the VFS string becomes stable to the kernel and 
can be checked (and denied if necessary).
A sidenote, and not surprisingly, the audit subsystem already has an event 
callback there:
        audit_getname(result);
Unfortunately this audit callback cannot be used for my purposes, because the 
event is single-purpose for auditd and because it allows no feedback (no 
deny/accept discretion for the security policy).
But if had this simple event there:
	err = event_vfs_getname(result);
I could implement this new filename based sandboxing policy, using a filter 
like this installed on the vfs::getname event and inherited by all sandboxed 
tasks (which cannot uninstall the filter, obviously):
  "
	if (strstr(name, ".."))
		return -EACCESS;
	if (!strncmp(name, "/home/sandbox/", 14) &&
	    !strncmp(name, "/lib/", 5) &&
	    !strncmp(name, "/usr/lib/", 9))
		return -EACCESS;
  "
  #
  # Note1: Obviously the filter engine would be extended to allow such simple string
  #        match functions. )
  #
  # Note2: ".." is disallowed so that sandboxed code cannot escape the restrictions
  #         using "/..".
  #
This kind of flexible and dynamic sandboxing would allow a wide range of file 
ops within the sandbox, while still isolating it from files not included in the 
specified VFS namespace.
( Note that there are tons of other examples as well, for useful security features
  that are best done using events outside the syscall boundary. )
The security event filters code tied to seccomp and syscalls at the moment is 
useful, but limited in its future potential.
So i argue that it should go slightly further and should become:
 - unprivileged:  application-definable, allowing the embedding of security 
                  policy in *apps* as well, not just the system
 - flexible:      can be added/removed runtime unprivileged, and cheaply so
 - transparent:   does not impact executing code that meets the policy
 - nestable:      it is inherited by child tasks and is fundamentally stackable,
                  multiple policies will have the combined effect and they
                  are transparent to each other. So if a child task within a
                  sandbox adds *more* checks then those add to the already
                  existing set of checks. We only narrow permissions, never
                  extend them.
 - generic:       allowing observation and (safe) control of security relevant
                  parameters not just at the system call boundary but at other
                  relevant places of kernel execution as well: which 
                  points/callbacks could also be used for other types of event 
                  extraction such as perf. It could even be shared with audit ...
I argue that this is the LSM and audit subsystems designed right: in the long 
run it could allow everything that LSM does at the moment - and so much more 
...
And you argue that allowing this would be bad, if it was extended like that 
then you'd consider it a failed scheme? Why?
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 12:10           ` Ingo Molnar
@ 2011-05-13 12:19             ` Peter Zijlstra
  2011-05-13 12:26               ` Ingo Molnar
  2011-05-13 15:10             ` Eric Paris
  2011-05-16  0:36             ` James Morris
  2 siblings, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2011-05-13 12:19 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote:
>         err = event_vfs_getname(result);
I really think we should not do this. Events like we have them should be
inactive, totally passive entities, only observe but not affect
execution (other than the bare minimal time delay introduced by
observance).
If you want another entity that is more active, please invent a new name
for it and create a new subsystem for them, now you could have these
active entities also have an (automatic) passive event side, but that's
some detail.
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 12:19             ` Peter Zijlstra
@ 2011-05-13 12:26               ` Ingo Molnar
  2011-05-13 12:39                 ` Peter Zijlstra
  0 siblings, 1 reply; 77+ messages in thread
From: Ingo Molnar @ 2011-05-13 12:26 UTC (permalink / raw)
  To: linux-arm-kernel
* Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote:
> >         err = event_vfs_getname(result);
> 
> I really think we should not do this. Events like we have them should be 
> inactive, totally passive entities, only observe but not affect execution 
> (other than the bare minimal time delay introduced by observance).
Well, this patchset already demonstrates that we can use a single event 
callback for a rather useful purpose.
Either it makes sense to do, in which case we should share facilities as much 
as possible, or it makes no sense, in which case we should not merge it at all.
> If you want another entity that is more active, please invent a new name for 
> it and create a new subsystem for them, now you could have these active 
> entities also have an (automatic) passive event side, but that's some detail.
Why should we have two callbacks next to each other:
	event_vfs_getname(result);
	result = check_event_vfs_getname(result);
if one could do it all?
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 12:26               ` Ingo Molnar
@ 2011-05-13 12:39                 ` Peter Zijlstra
  2011-05-13 12:43                   ` Peter Zijlstra
  2011-05-13 12:49                   ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Ingo Molnar
  0 siblings, 2 replies; 77+ messages in thread
From: Peter Zijlstra @ 2011-05-13 12:39 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, 2011-05-13 at 14:26 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote:
> > >         err = event_vfs_getname(result);
> > 
> > I really think we should not do this. Events like we have them should be 
> > inactive, totally passive entities, only observe but not affect execution 
> > (other than the bare minimal time delay introduced by observance).
> 
> Well, this patchset already demonstrates that we can use a single event 
> callback for a rather useful purpose.
Can and should are two distinct things.
> Either it makes sense to do, in which case we should share facilities as much 
> as possible, or it makes no sense, in which case we should not merge it at all.
And I'm arguing we should _not_. Observing is radically different from
Affecting, at the very least the two things should have different
permission schemes. We should not confuse these two matters.
> > If you want another entity that is more active, please invent a new name for 
> > it and create a new subsystem for them, now you could have these active 
> > entities also have an (automatic) passive event side, but that's some detail.
> 
> Why should we have two callbacks next to each other:
> 
> 	event_vfs_getname(result);
> 	result = check_event_vfs_getname(result);
> 
> if one could do it all?
Did you actually read the bit where I said that check_event_* (although
I still think that name sucks) could imply a matching event_*?
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 12:39                 ` Peter Zijlstra
@ 2011-05-13 12:43                   ` Peter Zijlstra
  2011-05-13 12:54                     ` Ingo Molnar
  2011-05-13 12:49                   ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Ingo Molnar
  1 sibling, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2011-05-13 12:43 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, 2011-05-13 at 14:39 +0200, Peter Zijlstra wrote:
> 
> >       event_vfs_getname(result);
> >       result = check_event_vfs_getname(result); 
Another fundamental difference is how to treat the callback chains for
these two.
Observers won't have a return value and are assumed to never fail,
therefore we can always call every entry on the callback list.
Active things otoh do have a return value, and thus we need to have
semantics that define what to do with that during callback iteration,
when to continue and when to break. Thus for active elements its
impossible to guarantee all entries will indeed be called.
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 12:39                 ` Peter Zijlstra
  2011-05-13 12:43                   ` Peter Zijlstra
@ 2011-05-13 12:49                   ` Ingo Molnar
  2011-05-13 13:55                     ` Peter Zijlstra
  1 sibling, 1 reply; 77+ messages in thread
From: Ingo Molnar @ 2011-05-13 12:49 UTC (permalink / raw)
  To: linux-arm-kernel
* Peter Zijlstra <peterz@infradead.org> wrote:
> > Why should we have two callbacks next to each other:
> > 
> > 	event_vfs_getname(result);
> > 	result = check_event_vfs_getname(result);
> > 
> > if one could do it all?
> 
> Did you actually read the bit where I said that check_event_* (although
> I still think that name sucks) could imply a matching event_*?
No, did not notice that - and yes that solves this particular problem.
So given that by your own admission it makes sense to share the facilities at 
the low level, i also argue that it makes sense to share as high up as 
possible.
Are you perhaps arguing for a ->observe flag that would make 100% sure that the 
default behavior for events is observe-only? That would make sense indeed.
Otherwise both cases really want to use all the same facilities for event 
discovery, setup, control and potential extraction of events.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 12:43                   ` Peter Zijlstra
@ 2011-05-13 12:54                     ` Ingo Molnar
  2011-05-13 13:08                       ` Peter Zijlstra
  0 siblings, 1 reply; 77+ messages in thread
From: Ingo Molnar @ 2011-05-13 12:54 UTC (permalink / raw)
  To: linux-arm-kernel
* Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2011-05-13 at 14:39 +0200, Peter Zijlstra wrote:
> > 
> > >       event_vfs_getname(result);
> > >       result = check_event_vfs_getname(result); 
> 
> Another fundamental difference is how to treat the callback chains for
> these two.
> 
> Observers won't have a return value and are assumed to never fail,
> therefore we can always call every entry on the callback list.
> 
> Active things otoh do have a return value, and thus we need to have
> semantics that define what to do with that during callback iteration,
> when to continue and when to break. Thus for active elements its
> impossible to guarantee all entries will indeed be called.
I think the sanest semantics is to run all active callbacks as well.
For example if this is used for three stacked security policies - as if 3 LSM 
modules were stacked at once. We'd call all three, and we'd determine that at 
least one failed - and we'd return a failure.
Even if the first one failed already we'd still want to trigger *all* the 
failures, because security policies like to know when they have triggered a 
failure (regardless of other active policies) and want to see that failure 
event (if they are logging such events).
So to me this looks pretty similar to observer callbacks as well, it's the 
natural extension to an observer callback chain.
Observer callbacks are simply constant functions (to the caller), those which 
never return failure and which never modify any of the parameters.
It's as if you argued that there should be separate syscalls/facilities for 
handling readonly files versus handling read/write files.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 12:54                     ` Ingo Molnar
@ 2011-05-13 13:08                       ` Peter Zijlstra
  2011-05-13 13:18                         ` Ingo Molnar
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2011-05-13 13:08 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote:
> I think the sanest semantics is to run all active callbacks as well.
> 
> For example if this is used for three stacked security policies - as if 3 LSM 
> modules were stacked at once. We'd call all three, and we'd determine that at 
> least one failed - and we'd return a failure. 
But that only works for boolean functions where you can return the
multi-bit-or of the result. What if you need to return the specific
error code.
Also, there's bound to be other cases where people will want to employ
this, look at all the various notifier chain muck we've got, it already
deals with much of this -- simply because users need it.
Then there's the whole indirection argument, if you don't need
indirection, its often better to not use it, I myself much prefer code
to look like:
   foo1(bar);
   foo2(bar);
   foo3(bar);
Than:
   foo_notifier(bar);
Simply because its much clearer who all are involved without me having
to grep around to see who registers for foo_notifier and wth they do
with it. It also makes it much harder to sneak in another user, whereas
its nearly impossible to find new notifier users.
Its also much faster, no extra memory accesses, no indirect function
calls, no other muck.
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 13:08                       ` Peter Zijlstra
@ 2011-05-13 13:18                         ` Ingo Molnar
  2011-05-13 13:55                           ` Peter Zijlstra
  2011-05-13 15:17                           ` Eric Paris
  0 siblings, 2 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-13 13:18 UTC (permalink / raw)
  To: linux-arm-kernel
* Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote:
> > I think the sanest semantics is to run all active callbacks as well.
> > 
> > For example if this is used for three stacked security policies - as if 3 LSM 
> > modules were stacked at once. We'd call all three, and we'd determine that at 
> > least one failed - and we'd return a failure. 
> 
> But that only works for boolean functions where you can return the
> multi-bit-or of the result. What if you need to return the specific
> error code.
Do you mean that one filter returns -EINVAL while the other -EACCES?
Seems like a non-problem to me, we'd return the first nonzero value.
> Also, there's bound to be other cases where people will want to employ
> this, look at all the various notifier chain muck we've got, it already
> deals with much of this -- simply because users need it.
Do you mean it would be easy to abuse it? What kind of abuse are you most 
worried about?
> Then there's the whole indirection argument, if you don't need
> indirection, its often better to not use it, I myself much prefer code
> to look like:
> 
>    foo1(bar);
>    foo2(bar);
>    foo3(bar);
> 
> Than:
> 
>    foo_notifier(bar);
> 
> Simply because its much clearer who all are involved without me having
> to grep around to see who registers for foo_notifier and wth they do
> with it. It also makes it much harder to sneak in another user, whereas
> its nearly impossible to find new notifier users.
> 
> Its also much faster, no extra memory accesses, no indirect function
> calls, no other muck.
But i suspect this question has been settled, given the fact that even pure 
observer events need and already process a chain of events? Am i missing 
something about your argument?
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 13:18                         ` Ingo Molnar
@ 2011-05-13 13:55                           ` Peter Zijlstra
  2011-05-13 14:57                             ` Ingo Molnar
  2011-05-13 15:17                           ` Eric Paris
  1 sibling, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2011-05-13 13:55 UTC (permalink / raw)
  To: linux-arm-kernel
Cut the microblaze list since its bouncy.
On Fri, 2011-05-13 at 15:18 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote:
> > > I think the sanest semantics is to run all active callbacks as well.
> > > 
> > > For example if this is used for three stacked security policies - as if 3 LSM 
> > > modules were stacked at once. We'd call all three, and we'd determine that at 
> > > least one failed - and we'd return a failure. 
> > 
> > But that only works for boolean functions where you can return the
> > multi-bit-or of the result. What if you need to return the specific
> > error code.
> 
> Do you mean that one filter returns -EINVAL while the other -EACCES?
> 
> Seems like a non-problem to me, we'd return the first nonzero value.
Assuming the first is -EINVAL, what then is the value in computing the
-EACCESS? Sounds like a massive waste of time to me.
> > Also, there's bound to be other cases where people will want to employ
> > this, look at all the various notifier chain muck we've got, it already
> > deals with much of this -- simply because users need it.
> 
> Do you mean it would be easy to abuse it? What kind of abuse are you most 
> worried about?
I'm not worried about abuse, I'm saying that going by the existing
notifier pattern always visiting all entries on the callback list is
undesired.
> > Then there's the whole indirection argument, if you don't need
> > indirection, its often better to not use it, I myself much prefer code
> > to look like:
> > 
> >    foo1(bar);
> >    foo2(bar);
> >    foo3(bar);
> > 
> > Than:
> > 
> >    foo_notifier(bar);
> > 
> > Simply because its much clearer who all are involved without me having
> > to grep around to see who registers for foo_notifier and wth they do
> > with it. It also makes it much harder to sneak in another user, whereas
> > its nearly impossible to find new notifier users.
> > 
> > Its also much faster, no extra memory accesses, no indirect function
> > calls, no other muck.
> 
> But i suspect this question has been settled, given the fact that even pure 
> observer events need and already process a chain of events? Am i missing 
> something about your argument?
I'm saying that there's reasons to not use notifiers passive or active.
Mostly the whole notifier/indirection muck comes up once you want
modules to make use of the thing, because then you need dynamic
management of the callback list.
(Then again, I'm fairly glad we don't have explicit callbacks in
kernel/cpu.c for all the cpu-hotplug callbacks :-)
Anyway, I oppose for the existing events to gain an active role.
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 12:49                   ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Ingo Molnar
@ 2011-05-13 13:55                     ` Peter Zijlstra
  2011-05-13 15:02                       ` Ingo Molnar
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2011-05-13 13:55 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, 2011-05-13 at 14:49 +0200, Ingo Molnar wrote:
> 
> So given that by your own admission it makes sense to share the facilities at 
> the low level, i also argue that it makes sense to share as high up as 
> possible. 
I'm not saying any such thing, I'm saying that it might make sense to
observe active objects and auto-create these observation points. That
doesn't make them similar or make them share anything.
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 13:55                           ` Peter Zijlstra
@ 2011-05-13 14:57                             ` Ingo Molnar
  2011-05-13 15:27                               ` Peter Zijlstra
  2011-05-16 16:23                               ` Steven Rostedt
  0 siblings, 2 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-13 14:57 UTC (permalink / raw)
  To: linux-arm-kernel
* Peter Zijlstra <peterz@infradead.org> wrote:
> Cut the microblaze list since its bouncy.
> 
> On Fri, 2011-05-13 at 15:18 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote:
> > > > I think the sanest semantics is to run all active callbacks as well.
> > > > 
> > > > For example if this is used for three stacked security policies - as if 3 LSM 
> > > > modules were stacked at once. We'd call all three, and we'd determine that at 
> > > > least one failed - and we'd return a failure. 
> > > 
> > > But that only works for boolean functions where you can return the
> > > multi-bit-or of the result. What if you need to return the specific
> > > error code.
> > 
> > Do you mean that one filter returns -EINVAL while the other -EACCES?
> > 
> > Seems like a non-problem to me, we'd return the first nonzero value.
> 
> Assuming the first is -EINVAL, what then is the value in computing the
> -EACCESS? Sounds like a massive waste of time to me.
No, because the common case is no rejection - this is a security mechanism. So 
in the normal case we would execute all 3 anyway, just to determine that all 
return 0.
Are you really worried about the abnormal case of one of them returning an 
error and us calculating all 3 return values?
> > > Also, there's bound to be other cases where people will want to employ
> > > this, look at all the various notifier chain muck we've got, it already
> > > deals with much of this -- simply because users need it.
> > 
> > Do you mean it would be easy to abuse it? What kind of abuse are you most 
> > worried about?
> 
> I'm not worried about abuse, I'm saying that going by the existing
> notifier pattern always visiting all entries on the callback list is
> undesired.
That is because many notifier chains are used in an 'event consuming' manner - 
they are responding to things like hardware events and are called in an 
interrupt-handler alike fashion most of the time.
> > > Then there's the whole indirection argument, if you don't need
> > > indirection, its often better to not use it, I myself much prefer code
> > > to look like:
> > > 
> > >    foo1(bar);
> > >    foo2(bar);
> > >    foo3(bar);
> > > 
> > > Than:
> > > 
> > >    foo_notifier(bar);
> > > 
> > > Simply because its much clearer who all are involved without me having
> > > to grep around to see who registers for foo_notifier and wth they do
> > > with it. It also makes it much harder to sneak in another user, whereas
> > > its nearly impossible to find new notifier users.
> > > 
> > > Its also much faster, no extra memory accesses, no indirect function
> > > calls, no other muck.
> > 
> > But i suspect this question has been settled, given the fact that even pure 
> > observer events need and already process a chain of events? Am i missing 
> > something about your argument?
> 
> I'm saying that there's reasons to not use notifiers passive or active.
> 
> Mostly the whole notifier/indirection muck comes up once you want
> modules to make use of the thing, because then you need dynamic
> management of the callback list.
But your argument assumes that we'd have a chain of functions to call, like 
regular notifiers.
While the natural model here would be to have a list of registered event 
structs for that point, with different filters but basically the same callback 
mechanism (a call into the filter engine in essence).
Also note that the common case would be no event registered - and we'd 
automatically optimize that case via the existing jump labels optimization.
> (Then again, I'm fairly glad we don't have explicit callbacks in kernel/cpu.c 
> for all the cpu-hotplug callbacks :-)
> 
> Anyway, I oppose for the existing events to gain an active role.
Why if 'being active' is optional and useful?
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 13:55                     ` Peter Zijlstra
@ 2011-05-13 15:02                       ` Ingo Molnar
  0 siblings, 0 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-13 15:02 UTC (permalink / raw)
  To: linux-arm-kernel
* Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2011-05-13 at 14:49 +0200, Ingo Molnar wrote:
> > 
> > So given that by your own admission it makes sense to share the facilities at 
> > the low level, i also argue that it makes sense to share as high up as 
> > possible. 
> 
> I'm not saying any such thing, I'm saying that it might make sense to
> observe active objects and auto-create these observation points. That
> doesn't make them similar or make them share anything.
Well, they would share the lowest level call site:
	result = check_event_vfs_getname(result);
You call it 'auto-generated call site', i call it a shared (single line) call 
site. The same thing as far as the lowest level goes.
Now (the way i understood it) you'd want to stop the sharing right after that. 
I argue that it should go all the way up.
Note: i fully agree that there should be events where filters can have no 
effect whatsoever. For example if this was written as:
	check_event_vfs_getname(result);
Then it would have no effect. This is decided by the subsystem developers, 
obviously. So whether an event is 'active' or 'passive' can be enforced at the 
subsystem level as well.
As far as the event facilities go, 'no effect observation' is a special-case of 
'active observation' - just like read-only files are a special case of 
read-write files.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 12:10           ` Ingo Molnar
  2011-05-13 12:19             ` Peter Zijlstra
@ 2011-05-13 15:10             ` Eric Paris
  2011-05-13 15:23               ` Peter Zijlstra
  2011-05-14  7:30               ` Ingo Molnar
  2011-05-16  0:36             ` James Morris
  2 siblings, 2 replies; 77+ messages in thread
From: Eric Paris @ 2011-05-13 15:10 UTC (permalink / raw)
  To: linux-arm-kernel
[dropping microblaze and roland]
lOn Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote:
> * James Morris <jmorris@namei.org> wrote:
> It is a simple and sensible security feature, agreed? It allows most code to 
> run well and link to countless libraries - but no access to other files is 
> allowed.
It's simple enough and sounds reasonable, but you can read all the
discussion about AppArmour why many people don't really think it's the
best.  Still, I'll agree it's a lot better than nothing.
> But if i had a VFS event at the fs/namei.c::getname() level, i would have 
> access to a central point where the VFS string becomes stable to the kernel and 
> can be checked (and denied if necessary).
> 
> A sidenote, and not surprisingly, the audit subsystem already has an event 
> callback there:
> 
>         audit_getname(result);
> 
> Unfortunately this audit callback cannot be used for my purposes, because the 
> event is single-purpose for auditd and because it allows no feedback (no 
> deny/accept discretion for the security policy).
> 
> But if had this simple event there:
> 
> 	err = event_vfs_getname(result);
Wow it sounds so easy.  Now lets keep extending your train of thought
until we can actually provide the security provided by SELinux.  What do
we end up with?  We end up with an event hook right next to every LSM
hook.  You know, the LSM hooks were placed where they are for a reason.
Because those were the locations inside the kernel where you actually
have information about the task doing an operation and the objects
(files, sockets, directories, other tasks, etc) they are doing an
operation on.
Honestly all you are talking about it remaking the LSM with 2 sets of
hooks instead if 1.  Why?  It seems much easier that if you want the
language of the filter engine you would just make a new LSM that uses
the filter engine for it's policy language rather than the language
created by SELinux or SMACK or name your LSM implementation.
>  - unprivileged:  application-definable, allowing the embedding of security 
>                   policy in *apps* as well, not just the system
> 
>  - flexible:      can be added/removed runtime unprivileged, and cheaply so
> 
>  - transparent:   does not impact executing code that meets the policy
> 
>  - nestable:      it is inherited by child tasks and is fundamentally stackable,
>                   multiple policies will have the combined effect and they
>                   are transparent to each other. So if a child task within a
>                   sandbox adds *more* checks then those add to the already
>                   existing set of checks. We only narrow permissions, never
>                   extend them.
> 
>  - generic:       allowing observation and (safe) control of security relevant
>                   parameters not just at the system call boundary but at other
>                   relevant places of kernel execution as well: which 
>                   points/callbacks could also be used for other types of event 
>                   extraction such as perf. It could even be shared with audit ...
I'm not arguing that any of these things are bad things.  What you
describe is a new LSM that uses a discretionary access control model but
with the granularity and flexibility that has traditionally only existed
in the mandatory access control security modules previously implemented
in the kernel.
I won't argue that's a bad idea, there's no reason in my mind that a
process shouldn't be allowed to control it's own access decisions in a
more flexible way than rwx bits.  Then again, I certainly don't see a
reason that this syscall hardening patch should be held up while a whole
new concept in computer security is contemplated...
-Eric
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 13:18                         ` Ingo Molnar
  2011-05-13 13:55                           ` Peter Zijlstra
@ 2011-05-13 15:17                           ` Eric Paris
  2011-05-13 15:29                             ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system callfiltering David Laight
  1 sibling, 1 reply; 77+ messages in thread
From: Eric Paris @ 2011-05-13 15:17 UTC (permalink / raw)
  To: linux-arm-kernel
[dropping microblaze and roland]
On Fri, 2011-05-13 at 15:18 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote:
> > > I think the sanest semantics is to run all active callbacks as well.
> > > 
> > > For example if this is used for three stacked security policies - as if 3 LSM 
> > > modules were stacked at once. We'd call all three, and we'd determine that at 
> > > least one failed - and we'd return a failure. 
> > 
> > But that only works for boolean functions where you can return the
> > multi-bit-or of the result. What if you need to return the specific
> > error code.
> 
> Do you mean that one filter returns -EINVAL while the other -EACCES?
> 
> Seems like a non-problem to me, we'd return the first nonzero value.
Sounds so easy!  Why haven't LSMs stacked already?  Because what happens
if one of these hooks did something stateful?  Lets say on open, hook #1
returns EPERM.  hook #2 allocates memory.  The open is going to fail and
hooks #2 is never going to get the close() which should have freed the
allocation.  If you can be completely stateless its easier, but there's
a reason that stacking security modules is hard.  Serge has tried in the
past and both dhowells and casey schaufler are working on it right now.
Stacking is never as easy as it sounds   :)
-Eric
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 15:10             ` Eric Paris
@ 2011-05-13 15:23               ` Peter Zijlstra
  2011-05-13 15:55                 ` Eric Paris
  2011-05-14  7:30               ` Ingo Molnar
  1 sibling, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2011-05-13 15:23 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, 2011-05-13 at 11:10 -0400, Eric Paris wrote:
> Then again, I certainly don't see a
> reason that this syscall hardening patch should be held up while a whole
> new concept in computer security is contemplated... 
Which makes me wonder why this syscall hardening stuff is done outside
of LSM? Why isn't is part of the LSM so that say SELinux can have a
syscall bitmask per security context?
Making it part of the LSM also avoids having to add this prctl().
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 14:57                             ` Ingo Molnar
@ 2011-05-13 15:27                               ` Peter Zijlstra
  2011-05-14  7:05                                 ` Ingo Molnar
  2011-05-16 16:23                               ` Steven Rostedt
  1 sibling, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2011-05-13 15:27 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, 2011-05-13 at 16:57 +0200, Ingo Molnar wrote:
> this is a security mechanism
Who says? and why would you want to unify two separate concepts only to
them limit it to security that just doesn't make sense.
Either you provide a full on replacement for notifier chain like things
or you don't, only extending trace events in this fashion for security
is like way weird.
Plus see the arguments Eric made about stacking stuff, not only security
schemes will have those problems.
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system callfiltering
  2011-05-13 15:17                           ` Eric Paris
@ 2011-05-13 15:29                             ` David Laight
  2011-05-16 12:03                               ` Ingo Molnar
  0 siblings, 1 reply; 77+ messages in thread
From: David Laight @ 2011-05-13 15:29 UTC (permalink / raw)
  To: linux-arm-kernel
> ... If you can be completely stateless its easier, but there's
> a reason that stacking security modules is hard.  Serge has tried in
the
> past and both dhowells and casey schaufler are working on it right
now.
> Stacking is never as easy as it sounds   :)
For a bad example of trying to allow alternate security models
look at NetBSD's kauth code :-)
NetBSD also had issues where some 'system call trace' code
was being used to (try to) apply security - unfortunately
it worked by looking at the user-space buffers on system
call entry - and a multithreaded program can easily arrange
to update them after the initial check!
For trace/event type activities this wouldn't really matter,
for security policy it does.
(I've not looked directly at these event points in linux)
	David
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 15:23               ` Peter Zijlstra
@ 2011-05-13 15:55                 ` Eric Paris
  2011-05-13 16:29                   ` Will Drewry
  0 siblings, 1 reply; 77+ messages in thread
From: Eric Paris @ 2011-05-13 15:55 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, 2011-05-13 at 17:23 +0200, Peter Zijlstra wrote:
> On Fri, 2011-05-13 at 11:10 -0400, Eric Paris wrote:
> > Then again, I certainly don't see a
> > reason that this syscall hardening patch should be held up while a whole
> > new concept in computer security is contemplated... 
> 
> Which makes me wonder why this syscall hardening stuff is done outside
> of LSM? Why isn't is part of the LSM so that say SELinux can have a
> syscall bitmask per security context?
I could do that, but I like Will's approach better.  From the PoV of
meeting security goals of information flow, data confidentiality,
integrity, least priv, etc limiting on the syscall boundary doesn't make
a lot of sense.  You just don't know enough there to enforce these
things.  These are the types of goals that SELinux and other LSMs have
previously tried to enforce.  From the PoV of making the kernel more
resistant to attacks and making a process more resistant to misbehavior
I think that the syscall boundary is appropriate.  Although I could do
it in SELinux it don't really want to do it there.
In case people are interested or confused let me give my definition of
two words I've used a bit in these conversations: discretionary and
mandatory.  Any time I talk about a 'discretionary' security decision it
is a security decisions that a process imposed upon itself.  Aka the
choice to use seccomp is discretionary.  The choice to mark our own file
u-wx is discretionary.  This isn't the best definition but it's one that
works well in this discussion.  Mandatory security is one enforce by a
global policy.  It's what selinux is all about.  SELinux doesn't give
hoot what a process wants to do, it enforces a global policy from the
top down.  You take over a process, well, too bad, you still have no
choice but to follow the mandatory policy.
The LSM does NOT enforce a mandatory access control model, it's just how
it's been used in the past.  Ingo appears to me (please correct me if
I'm wrong) to really be a fan of exposing the flexibility of the LSM to
a discretionary access control model.  That doesn't seem like a bad
idea.  And maybe using the filter engine to define the language to do
this isn't a bad idea either.  But I think that's a 'down the road'
project, not something to hold up a better seccomp.
> Making it part of the LSM also avoids having to add this prctl().
Well, it would mean exposing some new language construct to every LSM
(instead of a single prctl construct) and it would mean anyone wanting
to use the interface would have to rely on the LSM implementing those
hooks the way they need it.  Honestly chrome can already get all of the
benefits of this patch (given a perfectly coded kernel) and a whole lot
more using SELinux, but (surprise surprise) not everyone uses SELinux.
I think it's a good idea to expose a simple interface which will be
widely enough adopted that many userspace applications can rely on it
for hardening.
The existence of the LSM and the fact that there exists multiple
security modules that may or may not be enabled really leads application
developers to be unable to rely on LSM for security.  If linux had a
single security model which everyone could rely on we wouldn't really
have as big of an issue but that's not possible.  So I'm advocating for
this series which will provide a single useful change which applications
can rely upon across distros and platforms to enhance the properties and
abilities of the linux kernel.
-Eric
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 15:55                 ` Eric Paris
@ 2011-05-13 16:29                   ` Will Drewry
  0 siblings, 0 replies; 77+ messages in thread
From: Will Drewry @ 2011-05-13 16:29 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, May 13, 2011 at 10:55 AM, Eric Paris <eparis@redhat.com> wrote:
> On Fri, 2011-05-13 at 17:23 +0200, Peter Zijlstra wrote:
>> On Fri, 2011-05-13 at 11:10 -0400, Eric Paris wrote:
>> > Then again, I certainly don't see a
>> > reason that this syscall hardening patch should be held up while a whole
>> > new concept in computer security is contemplated...
>>
>> Which makes me wonder why this syscall hardening stuff is done outside
>> of LSM? Why isn't is part of the LSM so that say SELinux can have a
>> syscall bitmask per security context?
>
> I could do that, but I like Will's approach better. ?From the PoV of
> meeting security goals of information flow, data confidentiality,
> integrity, least priv, etc limiting on the syscall boundary doesn't make
> a lot of sense. ?You just don't know enough there to enforce these
> things. ?These are the types of goals that SELinux and other LSMs have
> previously tried to enforce. ?From the PoV of making the kernel more
> resistant to attacks and making a process more resistant to misbehavior
> I think that the syscall boundary is appropriate. ?Although I could do
> it in SELinux it don't really want to do it there.
There's also the problem that there are no hooks per-system call for
LSMs, only logical hooks that sometimes mirror system call names and
are called after user data has been parsed.  If system call enter
hooks, like seccomp's, were added for LSMs, it would allow the lsm
bitmask approach, but it still wouldn't satisfy the issues you raise
below (and I wholeheartedly agree with).
> In case people are interested or confused let me give my definition of
> two words I've used a bit in these conversations: discretionary and
> mandatory. ?Any time I talk about a 'discretionary' security decision it
> is a security decisions that a process imposed upon itself. ?Aka the
> choice to use seccomp is discretionary. ?The choice to mark our own file
> u-wx is discretionary. ?This isn't the best definition but it's one that
> works well in this discussion. ?Mandatory security is one enforce by a
> global policy. ?It's what selinux is all about. ?SELinux doesn't give
> hoot what a process wants to do, it enforces a global policy from the
> top down. ?You take over a process, well, too bad, you still have no
> choice but to follow the mandatory policy.
>
> The LSM does NOT enforce a mandatory access control model, it's just how
> it's been used in the past. ?Ingo appears to me (please correct me if
> I'm wrong) to really be a fan of exposing the flexibility of the LSM to
> a discretionary access control model. ?That doesn't seem like a bad
> idea. ?And maybe using the filter engine to define the language to do
> this isn't a bad idea either. ?But I think that's a 'down the road'
> project, not something to hold up a better seccomp.
>
>> Making it part of the LSM also avoids having to add this prctl().
>
> Well, it would mean exposing some new language construct to every LSM
> (instead of a single prctl construct) and it would mean anyone wanting
> to use the interface would have to rely on the LSM implementing those
> hooks the way they need it. ?Honestly chrome can already get all of the
> benefits of this patch (given a perfectly coded kernel) and a whole lot
> more using SELinux, but (surprise surprise) not everyone uses SELinux.
> I think it's a good idea to expose a simple interface which will be
> widely enough adopted that many userspace applications can rely on it
> for hardening.
>
> The existence of the LSM and the fact that there exists multiple
> security modules that may or may not be enabled really leads application
> developers to be unable to rely on LSM for security. ?If linux had a
> single security model which everyone could rely on we wouldn't really
> have as big of an issue but that's not possible. ?So I'm advocating for
> this series which will provide a single useful change which applications
> can rely upon across distros and platforms to enhance the properties and
> abilities of the linux kernel.
>
> -Eric
>
>
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-12  3:02 ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Will Drewry
  2011-05-12  7:48   ` Ingo Molnar
  2011-05-12 11:33   ` James Morris
@ 2011-05-13 19:35   ` Arnd Bergmann
  2011-05-14 20:58     ` Will Drewry
  2011-05-16 15:26   ` Steven Rostedt
  3 siblings, 1 reply; 77+ messages in thread
From: Arnd Bergmann @ 2011-05-13 19:35 UTC (permalink / raw)
  To: linux-arm-kernel
On Thursday 12 May 2011, Will Drewry wrote:
> This change adds a new seccomp mode based on the work by
> agl at chromium.org in [1]. This new mode, "filter mode", provides a hash
> table of seccomp_filter objects.  When in the new mode (2), all system
> calls are checked against the filters - first by system call number,
> then by a filter string.  If an entry exists for a given system call and
> all filter predicates evaluate to true, then the task may proceed.
> Otherwise, the task is killed (as per seccomp_mode == 1).
I've got a question about this: Do you expect the typical usage to disallow
ioctl()? Given that ioctl alone is responsible for a huge number of exploits
in various drivers, while certain ioctls are immensely useful (FIONREAD,
FIOASYNC, ...), do you expect to extend the mechanism to filter specific
ioctl commands in the future?
	Arnd
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 15:27                               ` Peter Zijlstra
@ 2011-05-14  7:05                                 ` Ingo Molnar
  0 siblings, 0 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-14  7:05 UTC (permalink / raw)
  To: linux-arm-kernel
* Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2011-05-13 at 16:57 +0200, Ingo Molnar wrote:
> > this is a security mechanism
> 
> Who says? [...]
Kernel developers/maintainers of the affected code.
We have security hooks all around the kernel, which can deny/accept execution 
at various key points, but we do not have 'execute arbitrary user-space defined 
(safe) scripts' callbacks in general.
But yes, if a particular callback point is defined widely enough to allow much 
bigger intervention into the flow of execution, then more is possible as well.
> [...] and why would you want to unify two separate concepts only to them 
> limit it to security that just doesn't make sense.
I don't limit them to security - the callbacks themselves are either for 
passive observation or, at most, for security accept/deny callbacks.
It's decided by the subsystem maintainers what kind of user-space control power 
(or observation power) they want to allow, not me.
I would just like to not stop the facility itself at the 'observe only' level, 
like you suggest.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 15:10             ` Eric Paris
  2011-05-13 15:23               ` Peter Zijlstra
@ 2011-05-14  7:30               ` Ingo Molnar
  2011-05-14 20:57                 ` Will Drewry
  1 sibling, 1 reply; 77+ messages in thread
From: Ingo Molnar @ 2011-05-14  7:30 UTC (permalink / raw)
  To: linux-arm-kernel
* Eric Paris <eparis@redhat.com> wrote:
> [dropping microblaze and roland]
> 
> lOn Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote:
> > * James Morris <jmorris@namei.org> wrote:
> 
> > It is a simple and sensible security feature, agreed? It allows most code to 
> > run well and link to countless libraries - but no access to other files is 
> > allowed.
> 
> It's simple enough and sounds reasonable, but you can read all the discussion 
> about AppArmour why many people don't really think it's the best. [...]
I have to say most of the participants of the AppArmour flamefests were dead 
wrong, and it wasnt the AppArmour folks who were wrong ...
The straight ASCII VFS namespace *makes sense*, and yes, the raw physical 
objects space that SELinux uses makes sense as well.
And no, i do not subscribe to the dogma that it is not possible to secure the 
ASCII VFS namespace: it evidently is possible, if you know and handle the 
ambiguitites. It is also obviously true that the ASCII VFS namespaces we use 
every day are a *lot* more intuitive than the labeled physical objects space 
...
What all the security flamewars missed is the simple fact that being intuitive 
matters a *lot* not just to not annoy users, but also to broaden the effective 
number of security-conscious developers ...
> > Unfortunately this audit callback cannot be used for my purposes, because 
> > the event is single-purpose for auditd and because it allows no feedback 
> > (no deny/accept discretion for the security policy).
> > 
> > But if had this simple event there:
> > 
> > 	err = event_vfs_getname(result);
> 
> Wow it sounds so easy.  Now lets keep extending your train of thought
> until we can actually provide the security provided by SELinux.  What do
> we end up with?  We end up with an event hook right next to every LSM
> hook.  You know, the LSM hooks were placed where they are for a reason.
> Because those were the locations inside the kernel where you actually
> have information about the task doing an operation and the objects
> (files, sockets, directories, other tasks, etc) they are doing an
> operation on.
> 
> Honestly all you are talking about it remaking the LSM with 2 sets of
> hooks instead if 1.  Why? [...]
Not at all. I am taking about using *one* set of events, to keep the intrusion 
at the lowest possible level.
LSM could make use of them as well.
Obviously for pragmatic reasons that might not be feasible initially.
> [...]  It seems much easier that if you want the language of the filter 
> engine you would just make a new LSM that uses the filter engine for it's 
> policy language rather than the language created by SELinux or SMACK or name 
> your LSM implementation.
Correct, that is what i suggested.
Note that performance is a primary concern, so if certain filters are very 
popular then in practice this would come with support for a couple of 'built 
in' (pre-optimized) filters that the kernel can accelerate directly, so that we 
do not incure the cost of executing the filter preds for really common-sense 
security policies that almost everyone is using.
I.e. in the end we'd *roughly* end up with the same performance and security as 
we are today (i mean, SELinux and the other LSMs did a nice job of collecting 
the things that apps should be careful about), but the crutial difference isnt 
just the advantages i menioned, but the fact that the *development model* of 
security modules would be a *lot* more extensible.
So security efforts could move to a whole different level: they could move into 
key apps and they could integrate with the general mind-set of developers.
At least Will as an application framework developer cares, so that hope is 
justified i think.
> >  - unprivileged:  application-definable, allowing the embedding of security 
> >                   policy in *apps* as well, not just the system
> > 
> >  - flexible:      can be added/removed runtime unprivileged, and cheaply so
> > 
> >  - transparent:   does not impact executing code that meets the policy
> > 
> >  - nestable:      it is inherited by child tasks and is fundamentally stackable,
> >                   multiple policies will have the combined effect and they
> >                   are transparent to each other. So if a child task within a
> >                   sandbox adds *more* checks then those add to the already
> >                   existing set of checks. We only narrow permissions, never
> >                   extend them.
> > 
> >  - generic:       allowing observation and (safe) control of security relevant
> >                   parameters not just at the system call boundary but at other
> >                   relevant places of kernel execution as well: which 
> >                   points/callbacks could also be used for other types of event 
> >                   extraction such as perf. It could even be shared with audit ...
> 
> I'm not arguing that any of these things are bad things.  What you describe 
> is a new LSM that uses a discretionary access control model but with the 
> granularity and flexibility that has traditionally only existed in the 
> mandatory access control security modules previously implemented in the 
> kernel.
> 
> I won't argue that's a bad idea, there's no reason in my mind that a process 
> shouldn't be allowed to control it's own access decisions in a more flexible 
> way than rwx bits.  Then again, I certainly don't see a reason that this 
> syscall hardening patch should be held up while a whole new concept in 
> computer security is contemplated...
Note, i'm not actually asking for the moon, a pony and more.
I fully submit that we are yet far away from being able to do a full LSM via 
this mechanism.
What i'm asking for is that because the syscall point steps taken by Will look 
very promising so it would be nice to do *that* in a slightly more flexible 
scheme that does not condemn it to be limited to the syscall boundary and such 
...
Also, to answer you, do you say that by my argument someone should have stood 
up and said 'no' to the LSM mess that was introduced a couple of years ago and 
which caused so many problems:
 - kernel inefficiencies and user-space overhead
 - stalled security efforts
 - infighting
 - friction, fragmentation, overmodularization
 - non-stackability
 - security annoyances on the Linux desktop
 - probably *less* Linux security
and should have asked them to do something better designed instead?
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-14  7:30               ` Ingo Molnar
@ 2011-05-14 20:57                 ` Will Drewry
  2011-05-16 12:43                   ` Ingo Molnar
  0 siblings, 1 reply; 77+ messages in thread
From: Will Drewry @ 2011-05-14 20:57 UTC (permalink / raw)
  To: linux-arm-kernel
On Sat, May 14, 2011 at 2:30 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Eric Paris <eparis@redhat.com> wrote:
>
>> [dropping microblaze and roland]
>>
>> lOn Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote:
>> > * James Morris <jmorris@namei.org> wrote:
>>
>> > It is a simple and sensible security feature, agreed? It allows most code to
>> > run well and link to countless libraries - but no access to other files is
>> > allowed.
>>
>> It's simple enough and sounds reasonable, but you can read all the discussion
>> about AppArmour why many people don't really think it's the best. [...]
>
> I have to say most of the participants of the AppArmour flamefests were dead
> wrong, and it wasnt the AppArmour folks who were wrong ...
>
> The straight ASCII VFS namespace *makes sense*, and yes, the raw physical
> objects space that SELinux uses makes sense as well.
>
> And no, i do not subscribe to the dogma that it is not possible to secure the
> ASCII VFS namespace: it evidently is possible, if you know and handle the
> ambiguitites. It is also obviously true that the ASCII VFS namespaces we use
> every day are a *lot* more intuitive than the labeled physical objects space
> ...
>
> What all the security flamewars missed is the simple fact that being intuitive
> matters a *lot* not just to not annoy users, but also to broaden the effective
> number of security-conscious developers ...
>
>> > Unfortunately this audit callback cannot be used for my purposes, because
>> > the event is single-purpose for auditd and because it allows no feedback
>> > (no deny/accept discretion for the security policy).
>> >
>> > But if had this simple event there:
>> >
>> > ? ? err = event_vfs_getname(result);
>>
>> Wow it sounds so easy. ?Now lets keep extending your train of thought
>> until we can actually provide the security provided by SELinux. ?What do
>> we end up with? ?We end up with an event hook right next to every LSM
>> hook. ?You know, the LSM hooks were placed where they are for a reason.
>> Because those were the locations inside the kernel where you actually
>> have information about the task doing an operation and the objects
>> (files, sockets, directories, other tasks, etc) they are doing an
>> operation on.
>>
>> Honestly all you are talking about it remaking the LSM with 2 sets of
>> hooks instead if 1. ?Why? [...]
>
> Not at all. I am taking about using *one* set of events, to keep the intrusion
> at the lowest possible level.
>
> LSM could make use of them as well.
>
> Obviously for pragmatic reasons that might not be feasible initially.
>
>> [...] ?It seems much easier that if you want the language of the filter
>> engine you would just make a new LSM that uses the filter engine for it's
>> policy language rather than the language created by SELinux or SMACK or name
>> your LSM implementation.
>
> Correct, that is what i suggested.
>
> Note that performance is a primary concern, so if certain filters are very
> popular then in practice this would come with support for a couple of 'built
> in' (pre-optimized) filters that the kernel can accelerate directly, so that we
> do not incure the cost of executing the filter preds for really common-sense
> security policies that almost everyone is using.
>
> I.e. in the end we'd *roughly* end up with the same performance and security as
> we are today (i mean, SELinux and the other LSMs did a nice job of collecting
> the things that apps should be careful about), but the crutial difference isnt
> just the advantages i menioned, but the fact that the *development model* of
> security modules would be a *lot* more extensible.
>
> So security efforts could move to a whole different level: they could move into
> key apps and they could integrate with the general mind-set of developers.
>
> At least Will as an application framework developer cares, so that hope is
> justified i think.
>
>> > ?- unprivileged: ?application-definable, allowing the embedding of security
>> > ? ? ? ? ? ? ? ? ? policy in *apps* as well, not just the system
>> >
>> > ?- flexible: ? ? ?can be added/removed runtime unprivileged, and cheaply so
>> >
>> > ?- transparent: ? does not impact executing code that meets the policy
>> >
>> > ?- nestable: ? ? ?it is inherited by child tasks and is fundamentally stackable,
>> > ? ? ? ? ? ? ? ? ? multiple policies will have the combined effect and they
>> > ? ? ? ? ? ? ? ? ? are transparent to each other. So if a child task within a
>> > ? ? ? ? ? ? ? ? ? sandbox adds *more* checks then those add to the already
>> > ? ? ? ? ? ? ? ? ? existing set of checks. We only narrow permissions, never
>> > ? ? ? ? ? ? ? ? ? extend them.
>> >
>> > ?- generic: ? ? ? allowing observation and (safe) control of security relevant
>> > ? ? ? ? ? ? ? ? ? parameters not just at the system call boundary but at other
>> > ? ? ? ? ? ? ? ? ? relevant places of kernel execution as well: which
>> > ? ? ? ? ? ? ? ? ? points/callbacks could also be used for other types of event
>> > ? ? ? ? ? ? ? ? ? extraction such as perf. It could even be shared with audit ...
>>
>> I'm not arguing that any of these things are bad things. ?What you describe
>> is a new LSM that uses a discretionary access control model but with the
>> granularity and flexibility that has traditionally only existed in the
>> mandatory access control security modules previously implemented in the
>> kernel.
>>
>> I won't argue that's a bad idea, there's no reason in my mind that a process
>> shouldn't be allowed to control it's own access decisions in a more flexible
>> way than rwx bits. ?Then again, I certainly don't see a reason that this
>> syscall hardening patch should be held up while a whole new concept in
>> computer security is contemplated...
>
> Note, i'm not actually asking for the moon, a pony and more.
>
> I fully submit that we are yet far away from being able to do a full LSM via
> this mechanism.
>
> What i'm asking for is that because the syscall point steps taken by Will look
> very promising so it would be nice to do *that* in a slightly more flexible
> scheme that does not condemn it to be limited to the syscall boundary and such
> ...
What do you suggest here?
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 19:35   ` Arnd Bergmann
@ 2011-05-14 20:58     ` Will Drewry
  2011-05-15  6:42       ` Arnd Bergmann
  0 siblings, 1 reply; 77+ messages in thread
From: Will Drewry @ 2011-05-14 20:58 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, May 13, 2011 at 2:35 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Thursday 12 May 2011, Will Drewry wrote:
>> This change adds a new seccomp mode based on the work by
>> agl at chromium.org in [1]. This new mode, "filter mode", provides a hash
>> table of seccomp_filter objects. ?When in the new mode (2), all system
>> calls are checked against the filters - first by system call number,
>> then by a filter string. ?If an entry exists for a given system call and
>> all filter predicates evaluate to true, then the task may proceed.
>> Otherwise, the task is killed (as per seccomp_mode == 1).
>
> I've got a question about this: Do you expect the typical usage to disallow
> ioctl()? Given that ioctl alone is responsible for a huge number of exploits
> in various drivers, while certain ioctls are immensely useful (FIONREAD,
> FIOASYNC, ...), do you expect to extend the mechanism to filter specific
> ioctl commands in the future?
In many cases, I do expect ioctl's to be dropped, but it's totally up
to whoever is setting the filters.
As is, it can already help out: [even though an LSM, if available,
would be appropriate to define a fine-grained policy]
ioctl() is hooked by the ftrace syscalls infrastructure (via SYSCALL_DEFINE3):
  SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd, unsigned
long, arg)
This means you can do:
  sprintf(filter, "cmd == %u || cmd == %u", FIOASYNC, FIONREAD);
  prctl(PR_SET_SECCOMP_FILTER, __NR_ioctl, filter);
  ...
  prctl(PR_SET_SECCOMP, 2, 0);
and then you'll be able to call ioctl on any fd with any argument but
limited to only the FIOASYNC and FIONREAD commands.
Depending on integration, it could even be limited to ioctl commands
that are appropriate to a known fd if the fd is opened prior to
entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed
with a filter of "1" then narrowed through a later addition of
something like "(fd == %u && (cmd == %u || cmd == %u))" or something
along those lines.
Does that make sense?
In general, this interface won't need specific extensions for most
system call oriented filtering events.  ftrace events may be expanded
(to include more system calls), but that's behind the scenes.  Only
arguments subject to time-of-check-time-of-use attacks (data living in
userspace passed in by pointer) are not safe to use via this
interface.  In theory, that limitation could also be lifted in the
implementation without changing the ABI.
Thanks!
will
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-14 20:58     ` Will Drewry
@ 2011-05-15  6:42       ` Arnd Bergmann
  2011-05-16 12:00         ` Ingo Molnar
  0 siblings, 1 reply; 77+ messages in thread
From: Arnd Bergmann @ 2011-05-15  6:42 UTC (permalink / raw)
  To: linux-arm-kernel
On Saturday 14 May 2011, Will Drewry wrote:
> Depending on integration, it could even be limited to ioctl commands
> that are appropriate to a known fd if the fd is opened prior to
> entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed
> with a filter of "1" then narrowed through a later addition of
> something like "(fd == %u && (cmd == %u || cmd == %u))" or something
> along those lines.
> 
> Does that make sense?
Thanks for the explanation. This sounds like it's already doing all
we need.
	Arnd
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 12:10           ` Ingo Molnar
  2011-05-13 12:19             ` Peter Zijlstra
  2011-05-13 15:10             ` Eric Paris
@ 2011-05-16  0:36             ` James Morris
  2011-05-16 15:08               ` Ingo Molnar
  2011-05-26  6:27               ` Pavel Machek
  2 siblings, 2 replies; 77+ messages in thread
From: James Morris @ 2011-05-16  0:36 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, 13 May 2011, Ingo Molnar wrote:
> Say i'm a user-space sandbox developer who wants to enforce that sandboxed code 
> should only be allowed to open files in /home/sandbox/, /lib/ and /usr/lib/.
> 
> It is a simple and sensible security feature, agreed? It allows most code to 
> run well and link to countless libraries - but no access to other files is 
> allowed.
Not really.
Firstly, what is the security goal of these restrictions?  Then, are the 
restrictions complete and unbypassable?
How do you reason about the behavior of the system as a whole?
> I argue that this is the LSM and audit subsystems designed right: in the long 
> run it could allow everything that LSM does at the moment - and so much more 
> ...
Now you're proposing a redesign of the security subsystem.  That's a 
significant undertaking.
In the meantime, we have a simple, well-defined enhancement to seccomp 
which will be very useful to current users in reducing their kernel attack 
surface.
We should merge that, and the security subsystem discussion can carry on 
separately.
- James
-- 
James Morris
<jmorris@namei.org>
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-15  6:42       ` Arnd Bergmann
@ 2011-05-16 12:00         ` Ingo Molnar
  0 siblings, 0 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-16 12:00 UTC (permalink / raw)
  To: linux-arm-kernel
* Arnd Bergmann <arnd@arndb.de> wrote:
> On Saturday 14 May 2011, Will Drewry wrote:
> > Depending on integration, it could even be limited to ioctl commands
> > that are appropriate to a known fd if the fd is opened prior to
> > entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed
> > with a filter of "1" then narrowed through a later addition of
> > something like "(fd == %u && (cmd == %u || cmd == %u))" or something
> > along those lines.
> > 
> > Does that make sense?
> 
> Thanks for the explanation. This sounds like it's already doing all
> we need.
One thing we could do more clearly here is to help keep the filter expressions 
symbolic - i.e. help resolve the various ioctl variants as well, not just the 
raw syscall parameter numbers.
But yes, access to the raw syscall parameters and the ability to filter them 
already gives us the ability to exclude/include specific ioctls in a rather 
flexible way.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system callfiltering
  2011-05-13 15:29                             ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system callfiltering David Laight
@ 2011-05-16 12:03                               ` Ingo Molnar
  0 siblings, 0 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-16 12:03 UTC (permalink / raw)
  To: linux-arm-kernel
* David Laight <David.Laight@ACULAB.COM> wrote:
> [...] unfortunately it worked by looking at the user-space buffers on system 
> call entry - and a multithreaded program can easily arrange to update them 
> after the initial check! [...]
Such problems of reliability/persistency of security checks is exactly one of 
my arguments why this should not be limited to the syscall boundary, if you 
read the example i have provided in this discussion.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-14 20:57                 ` Will Drewry
@ 2011-05-16 12:43                   ` Ingo Molnar
  2011-05-16 15:29                     ` Will Drewry
  0 siblings, 1 reply; 77+ messages in thread
From: Ingo Molnar @ 2011-05-16 12:43 UTC (permalink / raw)
  To: linux-arm-kernel
* Will Drewry <wad@chromium.org> wrote:
> > Note, i'm not actually asking for the moon, a pony and more.
> >
> > I fully submit that we are yet far away from being able to do a full LSM 
> > via this mechanism.
> >
> > What i'm asking for is that because the syscall point steps taken by Will 
> > look very promising so it would be nice to do *that* in a slightly more 
> > flexible scheme that does not condemn it to be limited to the syscall 
> > boundary and such ...
> 
> What do you suggest here?
> 
> From my brief exploration of the ftrace/perf (and seccomp) code, I don't see 
> a clean way of integrating over the existing interfaces to the ftrace 
> framework (e.g., the global perf event pump seems to be a mismatch), but I 
> may be missing something obvious. [...]
Well, there's no global perf event pump. Here we'd obviously want to use 
buffer-less events that do no output (pumping) whatsoever - i.e. just counting 
events with filters attached.
What i suggest is to:
 - share syscall events        # you are fine with that as your patch makes use of them
 - share the scripting engine  # you are fine with that as your patch makes use of it
 - share *any* other event to do_exit() a process at syscall exit time
 - share any other active event that kernel developers specifically enable 
   for active use to impact security-relevant execution even sooner than 
   syscall exit time - not just system calls
 - share the actual facility that manages (sets/gets) filters
So right now you have this structure for your new feature:
 Documentation/trace/seccomp_filter.txt |   75 +++++
 arch/x86/kernel/ldt.c                  |    5 
 arch/x86/kernel/tls.c                  |    7 
 fs/proc/array.c                        |   21 +
 fs/proc/base.c                         |   25 +
 include/linux/ftrace_event.h           |    9 
 include/linux/seccomp.h                |   98 +++++++
 include/linux/syscalls.h               |   54 ++--
 include/trace/syscall.h                |    6 
 kernel/Makefile                        |    1 
 kernel/fork.c                          |    8 
 kernel/perf_event.c                    |    7 
 kernel/seccomp.c                       |  144 ++++++++++-
 kernel/seccomp_filter.c                |  428 +++++++++++++++++++++++++++++++++
 kernel/sys.c                           |   11 
 kernel/trace/Kconfig                   |   10 
 kernel/trace/trace_events_filter.c     |   60 ++--
 kernel/trace/trace_syscalls.c          |   96 ++++++-
 18 files changed, 986 insertions(+), 79 deletions(-)
Firstly, one specific problem i can see is that kernel/seccomp_filter.c 
hardcodes to the system call boundary. Which is fine to a prototype 
implementation (and obviously fine for something that builds upon seccomp) but 
not fine in terms of a flexible Linux security feature :-)
You have hardcoded these syscall assumptions via:
 struct seccomp_filter {
        struct list_head list;
        struct rcu_head rcu;
        int syscall_nr;
        struct syscall_metadata *data;
        struct event_filter *event_filter;
 };
Which comes just because you chose to enumerate only syscall events - instead 
of enumerating all events.
Instead of that please bind to all events instead - syscalls are just one of 
the many events we have. Type 'perf list' and see how many event types we have, 
and quite a few could be enabled for 'active feedback' sandboxing as well.
Secondly, and this is really a variant of the first problem you have, the way 
you process event filter 'failures' is pretty limited. You utilize the regular 
seccomp method which works by calling into __secure_computing() and silently 
accepting syscalls or generating a hard do_exit() on even the slightest of 
filter failures.
Instead of that what we'd want to have is to have regular syscall events 
registered, *and* enabled for such active filtering purposes. The moment the 
filter hits such an 'active' event would set the TIF_NOTIFY_RESUME flag and 
some other attribute in the task and the kernel would do a do_exit() at the 
earliest of opportunities before calling the syscall or at the 
return-from-syscall point latest.
Note that no seccomp specific code would have to execute here, we can already 
generate events both at syscall entry and at syscall exit points, the only new 
bit we'd need is for the 'kill the task' [or abort the syscall] policy action.
This is *far* more generic still yields the same short-term end result as far 
as your sandboxing is concerned.
What we'd need for this is a way to mark existing TRACE_EVENT()s as 'active'. 
We'd mark all syscall events as 'active' straight away.
[ Detail: these trace events return a return code, which the calling code can 
  use, that way event return values could be used sooner than syscall exit 
  points. IRQ code could make use of it as well, so for example this way we 
  could filter based early packet inspection, still in the IRQ code. ]
Then what your feature would do is to simply open up the events you are 
interested in (just as counting events, without any buffers), and set 'active' 
filters in them them to 'deny/allow' specific combinations of parameters only.
Thirdly, you have added a separate facility to set/query filters via /proc, 
this lives mostly in /proc/<pid>/seccomp_filter. This seccomp specificity would 
go away altogether: it should be possible to query existing events and filters 
even outside of the seccomp facility, so if you want something explicit here 
beyond the current event ioctl() to set filters please improve the generic 
facility - which right now is poorer than what you have implemented so it's in 
good need to be improved! :-)
All in one such a solution would enable us to reuse code in a lot more deeper 
fashion while still providing all the functionality you are interested in. 
These are all short-term concerns, not pie-in-the-sky future ideal LSM 
concerns.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-12 16:26         ` Will Drewry
@ 2011-05-16 12:55           ` Ingo Molnar
  2011-05-16 14:42             ` Will Drewry
  0 siblings, 1 reply; 77+ messages in thread
From: Ingo Molnar @ 2011-05-16 12:55 UTC (permalink / raw)
  To: linux-arm-kernel
* Will Drewry <wad@chromium.org> wrote:
> I agree with you on many of these points!  However, I don't think that the 
> views around LSMs, perf/ftrace infrastructure, or the current seccomp 
> filtering implementation are necessarily in conflict.  Here is my 
> understanding of how the different worlds fit together and where I see this 
> patchset living, along with where I could see future work going.  Perhaps I'm 
> being a trifle naive, but here goes anyway:
> 
> 1. LSMs provide a global mechanism for hooking "security relevant"
> events at a point where all the incoming user-sourced data has been
> preprocessed and moved into userspace.  The hooks are called every
> time one of those boundaries are crossed.
> 2. Perf and the ftrace infrastructure provide global function tracing
> and system call hooks with direct access to the caller's registers
> (and memory).
No, perf events are not just global but per task as well. Nor are events 
limited to 'tracing' (generating a flow of events into a trace buffer) - they 
can just be themselves as well and count and generate callbacks.
The generic NMI watchdog uses that kind of event model for example, see 
kernel/watchdog.c and how it makes use of struct perf_event abstractions to do 
per CPU events (with no buffrs), or how kernel/hw_breakpoint.c uses it for per 
task events and integrates it with the ptrace hw-breakpoints code.
Ideally Peter's one particular suggestion is right IMO and we'd want to be able 
for a perf_event to just be a list of callbacks, attached to a task and barely 
more than a discoverable, named notifier chain in its slimmest form.
In practice it's fatter than that right now, but we should definitely factor 
out that aspect of it more clearly, both code-wise and API-wise. 
kernel/watchdog.c and kernel/hw_breakpoint.c shows that such factoring out is 
possible and desirable.
> 3. seccomp (as it exists today) provides a global system call entry
> hook point with a binary per-process decision about whether to provide
> "secure computing" behavior.
> 
> When I boil that down to abstractions, I see:
> A. Globally scoped: LSMs, ftrace/perf
> B. Locally/process scoped: seccomp
Ok, i see where you got the idea that you needed to cut your surface of 
abstraction at the filter engine / syscall enumeration level - i think you were 
thinking of it in the ftrace model of tracepoints, not in the perf model of 
events.
No, events are generic and as such per task as well, not just global.
I've replied to your other mail with more specific suggestions of how we could 
provide your feature using abstractions that share code more widely. Talking 
specifics will i hope help move the discussion forward! :-)
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-16 12:55           ` Ingo Molnar
@ 2011-05-16 14:42             ` Will Drewry
  0 siblings, 0 replies; 77+ messages in thread
From: Will Drewry @ 2011-05-16 14:42 UTC (permalink / raw)
  To: linux-arm-kernel
On Mon, May 16, 2011 at 7:55 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Will Drewry <wad@chromium.org> wrote:
>
>> I agree with you on many of these points! ?However, I don't think that the
>> views around LSMs, perf/ftrace infrastructure, or the current seccomp
>> filtering implementation are necessarily in conflict. ?Here is my
>> understanding of how the different worlds fit together and where I see this
>> patchset living, along with where I could see future work going. ?Perhaps I'm
>> being a trifle naive, but here goes anyway:
>>
>> 1. LSMs provide a global mechanism for hooking "security relevant"
>> events at a point where all the incoming user-sourced data has been
>> preprocessed and moved into userspace. ?The hooks are called every
>> time one of those boundaries are crossed.
>
>> 2. Perf and the ftrace infrastructure provide global function tracing
>> and system call hooks with direct access to the caller's registers
>> (and memory).
>
> No, perf events are not just global but per task as well. Nor are events
> limited to 'tracing' (generating a flow of events into a trace buffer) - they
> can just be themselves as well and count and generate callbacks.
I was looking at the perf_sysenter_enable() call, but clearly there is
more going on :)
> The generic NMI watchdog uses that kind of event model for example, see
> kernel/watchdog.c and how it makes use of struct perf_event abstractions to do
> per CPU events (with no buffrs), or how kernel/hw_breakpoint.c uses it for per
> task events and integrates it with the ptrace hw-breakpoints code.
>
> Ideally Peter's one particular suggestion is right IMO and we'd want to be able
> for a perf_event to just be a list of callbacks, attached to a task and barely
> more than a discoverable, named notifier chain in its slimmest form.
>
> In practice it's fatter than that right now, but we should definitely factor
> out that aspect of it more clearly, both code-wise and API-wise.
> kernel/watchdog.c and kernel/hw_breakpoint.c shows that such factoring out is
> possible and desirable.
>
>> 3. seccomp (as it exists today) provides a global system call entry
>> hook point with a binary per-process decision about whether to provide
>> "secure computing" behavior.
>>
>> When I boil that down to abstractions, I see:
>> A. Globally scoped: LSMs, ftrace/perf
>> B. Locally/process scoped: seccomp
>
> Ok, i see where you got the idea that you needed to cut your surface of
> abstraction at the filter engine / syscall enumeration level - i think you were
> thinking of it in the ftrace model of tracepoints, not in the perf model of
> events.
>
> No, events are generic and as such per task as well, not just global.
>
> I've replied to your other mail with more specific suggestions of how we could
> provide your feature using abstractions that share code more widely. Talking
> specifics will i hope help move the discussion forward! :-)
Agreed.  I'll digest both the watchdog code as well as your other
comments and follow up when I have a fuller picture in my head.
(I have a few initial comments I'll post in response to your other mail.)
Thanks!
will
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-16  0:36             ` James Morris
@ 2011-05-16 15:08               ` Ingo Molnar
  2011-05-17  2:24                 ` James Morris
  2011-05-26  6:27               ` Pavel Machek
  1 sibling, 1 reply; 77+ messages in thread
From: Ingo Molnar @ 2011-05-16 15:08 UTC (permalink / raw)
  To: linux-arm-kernel
* James Morris <jmorris@namei.org> wrote:
> On Fri, 13 May 2011, Ingo Molnar wrote:
> 
> > Say i'm a user-space sandbox developer who wants to enforce that sandboxed 
> > code should only be allowed to open files in /home/sandbox/, /lib/ and 
> > /usr/lib/.
> > 
> > It is a simple and sensible security feature, agreed? It allows most code 
> > to run well and link to countless libraries - but no access to other files 
> > is allowed.
> 
> Not really.
> 
> Firstly, what is the security goal of these restrictions? [...]
To do what i described above? Namely:
 " Sandboxed code should only be allowed to open files in /home/sandbox/, /lib/
   and /usr/lib/ "
> [...]  Then, are the restrictions complete and unbypassable?
If only the system calls i mentioned are allowed, and if the sandboxed VFS 
namespace itself is isolated from the rest of the system (no bind mounts, no 
hard links outside the sandbox, etc.) then its goal is to not be bypassable - 
what use is a sandbox if the sandbox can be bypassed by the sandboxed code?
There's a few ways how to alter (and thus bypass) VFS namespace lookups: 
symlinks, chdir, chroot, rename, etc., which (as i mentioned) have to be 
excluded by default or filtered as well.
> How do you reason about the behavior of the system as a whole?
For some usecases i mainly want to reason about what the sandboxed code can do 
and can not do, within a fairly static and limited VFS namespace environment.
I might not want to have a full-blown 'physical barrier' for all objects 
labeled as inaccessible to sandboxed code (or labeled as accessible to 
sandboxed code).
Especially as manipulating file labels is not also slow (affects all files) but 
is also often an exclusively privileged operation even for owned files, for no 
good reason. For things like /lib/ and /usr/lib/ it also *has* to be a 
privileged operation.
> > I argue that this is the LSM and audit subsystems designed right: in the 
> > long run it could allow everything that LSM does at the moment - and so 
> > much more ...
> 
> Now you're proposing a redesign of the security subsystem.  That's a 
> significant undertaking.
It certainly is.
> In the meantime, we have a simple, well-defined enhancement to seccomp which 
> will be very useful to current users in reducing their kernel attack surface.
> 
> We should merge that, and the security subsystem discussion can carry on 
> separately.
Is that the development and merge process along which the LSM subsystem got 
into its current state?
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-12  3:02 ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Will Drewry
                     ` (2 preceding siblings ...)
  2011-05-13 19:35   ` Arnd Bergmann
@ 2011-05-16 15:26   ` Steven Rostedt
  2011-05-16 15:28     ` Will Drewry
  3 siblings, 1 reply; 77+ messages in thread
From: Steven Rostedt @ 2011-05-16 15:26 UTC (permalink / raw)
  To: linux-arm-kernel
Sorry to be absent from this thread so far, I just got back from my
travels and I'm now catching up on email.
On Wed, 2011-05-11 at 22:02 -0500, Will Drewry wrote:
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 377a7a5..22e1668 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -1664,6 +1664,16 @@ config SECCOMP
>  	  and the task is only allowed to execute a few safe syscalls
>  	  defined by each seccomp mode.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	depends on SECCOMP && EXPERIMENTAL
> +	help
> +	  Per-process, inherited system call filtering using shared code
> +	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
> +	  is not available, enhanced filters will not be available.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  config CC_STACKPROTECTOR
>  	bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
>  	depends on EXPERIMENTAL
> diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
> index eccdefe..7641ee9 100644
> --- a/arch/microblaze/Kconfig
> +++ b/arch/microblaze/Kconfig
> @@ -129,6 +129,16 @@ config SECCOMP
>  
>  	  If unsure, say Y. Only embedded should say N here.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	depends on SECCOMP && EXPERIMENTAL
> +	help
> +	  Per-process, inherited system call filtering using shared code
> +	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
> +	  is not available, enhanced filters will not be available.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  endmenu
>  
>  menu "Advanced setup"
> diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
> index 8e256cc..fe4cbda 100644
> --- a/arch/mips/Kconfig
> +++ b/arch/mips/Kconfig
> @@ -2245,6 +2245,16 @@ config SECCOMP
>  
>  	  If unsure, say Y. Only embedded should say N here.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	depends on SECCOMP && EXPERIMENTAL
> +	help
> +	  Per-process, inherited system call filtering using shared code
> +	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
> +	  is not available, enhanced filters will not be available.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  config USE_OF
>  	bool "Flattened Device Tree support"
>  	select OF
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 8f4d50b..83499e4 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -605,6 +605,16 @@ config SECCOMP
>  
>  	  If unsure, say Y. Only embedded should say N here.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	depends on SECCOMP && EXPERIMENTAL
> +	help
> +	  Per-process, inherited system call filtering using shared code
> +	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
> +	  is not available, enhanced filters will not be available.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  endmenu
>  
>  config ISA_DMA_API
> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
> index 2508a6f..2777515 100644
> --- a/arch/s390/Kconfig
> +++ b/arch/s390/Kconfig
> @@ -614,6 +614,16 @@ config SECCOMP
>  
>  	  If unsure, say Y.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	depends on SECCOMP && EXPERIMENTAL
> +	help
> +	  Per-process, inherited system call filtering using shared code
> +	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
> +	  is not available, enhanced filters will not be available.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  endmenu
>  
>  menu "Power Management"
> diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
> index 4b89da2..00c1521 100644
> --- a/arch/sh/Kconfig
> +++ b/arch/sh/Kconfig
> @@ -676,6 +676,16 @@ config SECCOMP
>  
>  	  If unsure, say N.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	depends on SECCOMP && EXPERIMENTAL
> +	help
> +	  Per-process, inherited system call filtering using shared code
> +	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
> +	  is not available, enhanced filters will not be available.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  config SMP
>  	bool "Symmetric multi-processing support"
>  	depends on SYS_SUPPORTS_SMP
> diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
> index e560d10..5b42255 100644
> --- a/arch/sparc/Kconfig
> +++ b/arch/sparc/Kconfig
> @@ -270,6 +270,16 @@ config SECCOMP
>  
>  	  If unsure, say Y. Only embedded should say N here.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	depends on SECCOMP && EXPERIMENTAL
> +	help
> +	  Per-process, inherited system call filtering using shared code
> +	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
> +	  is not available, enhanced filters will not be available.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  config HOTPLUG_CPU
>  	bool "Support for hot-pluggable CPUs"
>  	depends on SPARC64 && SMP
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index cc6c53a..d6d44d9 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1485,6 +1485,16 @@ config SECCOMP
>  
>  	  If unsure, say Y. Only embedded should say N here.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	depends on SECCOMP && EXPERIMENTAL
> +	help
> +	  Per-process, inherited system call filtering using shared code
> +	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
> +	  is not available, enhanced filters will not be available.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  config CC_STACKPROTECTOR
>  	bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
>  	---help---
You just cut-and-pasted 8 copies of a config selection. The proper way
to do that is to add the Kconfig selection in a core kernel Kconfig,
have it depend on "HAVE_SECCOMP_FILTER" and then in each of these
configs, simply add in the arch Kconfig:
config <ARCH>
	[...]
	select HAVE_SECCOMP_FILTER
That way you don't need to duplicate the config option all over the
place.
-- Steve
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-16 15:26   ` Steven Rostedt
@ 2011-05-16 15:28     ` Will Drewry
  0 siblings, 0 replies; 77+ messages in thread
From: Will Drewry @ 2011-05-16 15:28 UTC (permalink / raw)
  To: linux-arm-kernel
On Mon, May 16, 2011 at 10:26 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> Sorry to be absent from this thread so far, I just got back from my
> travels and I'm now catching up on email.
>
>
> On Wed, 2011-05-11 at 22:02 -0500, Will Drewry wrote:
>
>> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
>> index 377a7a5..22e1668 100644
>> --- a/arch/arm/Kconfig
>> +++ b/arch/arm/Kconfig
>> @@ -1664,6 +1664,16 @@ config SECCOMP
>> ? ? ? ? and the task is only allowed to execute a few safe syscalls
>> ? ? ? ? defined by each seccomp mode.
>>
>> +config SECCOMP_FILTER
>> + ? ? bool "Enable seccomp-based system call filtering"
>> + ? ? depends on SECCOMP && EXPERIMENTAL
>> + ? ? help
>> + ? ? ? Per-process, inherited system call filtering using shared code
>> + ? ? ? across seccomp and ftrace_syscalls. ?If CONFIG_FTRACE_SYSCALLS
>> + ? ? ? is not available, enhanced filters will not be available.
>> +
>> + ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
>> +
>> ?config CC_STACKPROTECTOR
>> ? ? ? bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
>> ? ? ? depends on EXPERIMENTAL
>> diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
>> index eccdefe..7641ee9 100644
>> --- a/arch/microblaze/Kconfig
>> +++ b/arch/microblaze/Kconfig
>> @@ -129,6 +129,16 @@ config SECCOMP
>>
>> ? ? ? ? If unsure, say Y. Only embedded should say N here.
>>
>> +config SECCOMP_FILTER
>> + ? ? bool "Enable seccomp-based system call filtering"
>> + ? ? depends on SECCOMP && EXPERIMENTAL
>> + ? ? help
>> + ? ? ? Per-process, inherited system call filtering using shared code
>> + ? ? ? across seccomp and ftrace_syscalls. ?If CONFIG_FTRACE_SYSCALLS
>> + ? ? ? is not available, enhanced filters will not be available.
>> +
>> + ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
>> +
>> ?endmenu
>>
>> ?menu "Advanced setup"
>> diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
>> index 8e256cc..fe4cbda 100644
>> --- a/arch/mips/Kconfig
>> +++ b/arch/mips/Kconfig
>> @@ -2245,6 +2245,16 @@ config SECCOMP
>>
>> ? ? ? ? If unsure, say Y. Only embedded should say N here.
>>
>> +config SECCOMP_FILTER
>> + ? ? bool "Enable seccomp-based system call filtering"
>> + ? ? depends on SECCOMP && EXPERIMENTAL
>> + ? ? help
>> + ? ? ? Per-process, inherited system call filtering using shared code
>> + ? ? ? across seccomp and ftrace_syscalls. ?If CONFIG_FTRACE_SYSCALLS
>> + ? ? ? is not available, enhanced filters will not be available.
>> +
>> + ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
>> +
>> ?config USE_OF
>> ? ? ? bool "Flattened Device Tree support"
>> ? ? ? select OF
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index 8f4d50b..83499e4 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -605,6 +605,16 @@ config SECCOMP
>>
>> ? ? ? ? If unsure, say Y. Only embedded should say N here.
>>
>> +config SECCOMP_FILTER
>> + ? ? bool "Enable seccomp-based system call filtering"
>> + ? ? depends on SECCOMP && EXPERIMENTAL
>> + ? ? help
>> + ? ? ? Per-process, inherited system call filtering using shared code
>> + ? ? ? across seccomp and ftrace_syscalls. ?If CONFIG_FTRACE_SYSCALLS
>> + ? ? ? is not available, enhanced filters will not be available.
>> +
>> + ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
>> +
>> ?endmenu
>>
>> ?config ISA_DMA_API
>> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
>> index 2508a6f..2777515 100644
>> --- a/arch/s390/Kconfig
>> +++ b/arch/s390/Kconfig
>> @@ -614,6 +614,16 @@ config SECCOMP
>>
>> ? ? ? ? If unsure, say Y.
>>
>> +config SECCOMP_FILTER
>> + ? ? bool "Enable seccomp-based system call filtering"
>> + ? ? depends on SECCOMP && EXPERIMENTAL
>> + ? ? help
>> + ? ? ? Per-process, inherited system call filtering using shared code
>> + ? ? ? across seccomp and ftrace_syscalls. ?If CONFIG_FTRACE_SYSCALLS
>> + ? ? ? is not available, enhanced filters will not be available.
>> +
>> + ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
>> +
>> ?endmenu
>>
>> ?menu "Power Management"
>> diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
>> index 4b89da2..00c1521 100644
>> --- a/arch/sh/Kconfig
>> +++ b/arch/sh/Kconfig
>> @@ -676,6 +676,16 @@ config SECCOMP
>>
>> ? ? ? ? If unsure, say N.
>>
>> +config SECCOMP_FILTER
>> + ? ? bool "Enable seccomp-based system call filtering"
>> + ? ? depends on SECCOMP && EXPERIMENTAL
>> + ? ? help
>> + ? ? ? Per-process, inherited system call filtering using shared code
>> + ? ? ? across seccomp and ftrace_syscalls. ?If CONFIG_FTRACE_SYSCALLS
>> + ? ? ? is not available, enhanced filters will not be available.
>> +
>> + ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
>> +
>> ?config SMP
>> ? ? ? bool "Symmetric multi-processing support"
>> ? ? ? depends on SYS_SUPPORTS_SMP
>> diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
>> index e560d10..5b42255 100644
>> --- a/arch/sparc/Kconfig
>> +++ b/arch/sparc/Kconfig
>> @@ -270,6 +270,16 @@ config SECCOMP
>>
>> ? ? ? ? If unsure, say Y. Only embedded should say N here.
>>
>> +config SECCOMP_FILTER
>> + ? ? bool "Enable seccomp-based system call filtering"
>> + ? ? depends on SECCOMP && EXPERIMENTAL
>> + ? ? help
>> + ? ? ? Per-process, inherited system call filtering using shared code
>> + ? ? ? across seccomp and ftrace_syscalls. ?If CONFIG_FTRACE_SYSCALLS
>> + ? ? ? is not available, enhanced filters will not be available.
>> +
>> + ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
>> +
>> ?config HOTPLUG_CPU
>> ? ? ? bool "Support for hot-pluggable CPUs"
>> ? ? ? depends on SPARC64 && SMP
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index cc6c53a..d6d44d9 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -1485,6 +1485,16 @@ config SECCOMP
>>
>> ? ? ? ? If unsure, say Y. Only embedded should say N here.
>>
>> +config SECCOMP_FILTER
>> + ? ? bool "Enable seccomp-based system call filtering"
>> + ? ? depends on SECCOMP && EXPERIMENTAL
>> + ? ? help
>> + ? ? ? Per-process, inherited system call filtering using shared code
>> + ? ? ? across seccomp and ftrace_syscalls. ?If CONFIG_FTRACE_SYSCALLS
>> + ? ? ? is not available, enhanced filters will not be available.
>> +
>> + ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
>> +
>> ?config CC_STACKPROTECTOR
>> ? ? ? bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
>> ? ? ? ---help---
>
> You just cut-and-pasted 8 copies of a config selection. The proper way
> to do that is to add the Kconfig selection in a core kernel Kconfig,
> have it depend on "HAVE_SECCOMP_FILTER" and then in each of these
> configs, simply add in the arch Kconfig:
>
> config <ARCH>
> ? ? ? ?[...]
> ? ? ? ?select HAVE_SECCOMP_FILTER
>
>
> That way you don't need to duplicate the config option all over the
> place.
Thanks!  I had seen the HAVE_* format but failed to acknowledge why!
I'll fix it in the next rev (assuming it still fits there)!
will
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-16 12:43                   ` Ingo Molnar
@ 2011-05-16 15:29                     ` Will Drewry
  2011-05-17 12:57                       ` Ingo Molnar
  0 siblings, 1 reply; 77+ messages in thread
From: Will Drewry @ 2011-05-16 15:29 UTC (permalink / raw)
  To: linux-arm-kernel
On Mon, May 16, 2011 at 7:43 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Will Drewry <wad@chromium.org> wrote:
>
>> > Note, i'm not actually asking for the moon, a pony and more.
>> >
>> > I fully submit that we are yet far away from being able to do a full LSM
>> > via this mechanism.
>> >
>> > What i'm asking for is that because the syscall point steps taken by Will
>> > look very promising so it would be nice to do *that* in a slightly more
>> > flexible scheme that does not condemn it to be limited to the syscall
>> > boundary and such ...
>>
>> What do you suggest here?
>>
>> From my brief exploration of the ftrace/perf (and seccomp) code, I don't see
>> a clean way of integrating over the existing interfaces to the ftrace
>> framework (e.g., the global perf event pump seems to be a mismatch), but I
>> may be missing something obvious. [...]
>
> Well, there's no global perf event pump. Here we'd obviously want to use
> buffer-less events that do no output (pumping) whatsoever - i.e. just counting
> events with filters attached.
Cool - I missed these entirely.  I'll get reading :)
> What i suggest is to:
>
> ?- share syscall events ? ? ? ?# you are fine with that as your patch makes use of them
> ?- share the scripting engine ?# you are fine with that as your patch makes use of it
>
> ?- share *any* other event to do_exit() a process at syscall exit time
>
> ?- share any other active event that kernel developers specifically enable
> ? for active use to impact security-relevant execution even sooner than
> ? syscall exit time - not just system calls
>
> ?- share the actual facility that manages (sets/gets) filters
These make sense to me at a high level.  I'll throw in a few initial
comments, but I'll be back for a round-two once I catch up on the rest
of the perf code.
> So right now you have this structure for your new feature:
>
> ?Documentation/trace/seccomp_filter.txt | ? 75 +++++
> ?arch/x86/kernel/ldt.c ? ? ? ? ? ? ? ? ?| ? ?5
> ?arch/x86/kernel/tls.c ? ? ? ? ? ? ? ? ?| ? ?7
> ?fs/proc/array.c ? ? ? ? ? ? ? ? ? ? ? ?| ? 21 +
> ?fs/proc/base.c ? ? ? ? ? ? ? ? ? ? ? ? | ? 25 +
> ?include/linux/ftrace_event.h ? ? ? ? ? | ? ?9
> ?include/linux/seccomp.h ? ? ? ? ? ? ? ?| ? 98 +++++++
> ?include/linux/syscalls.h ? ? ? ? ? ? ? | ? 54 ++--
> ?include/trace/syscall.h ? ? ? ? ? ? ? ?| ? ?6
> ?kernel/Makefile ? ? ? ? ? ? ? ? ? ? ? ?| ? ?1
> ?kernel/fork.c ? ? ? ? ? ? ? ? ? ? ? ? ?| ? ?8
> ?kernel/perf_event.c ? ? ? ? ? ? ? ? ? ?| ? ?7
> ?kernel/seccomp.c ? ? ? ? ? ? ? ? ? ? ? | ?144 ++++++++++-
> ?kernel/seccomp_filter.c ? ? ? ? ? ? ? ?| ?428 +++++++++++++++++++++++++++++++++
> ?kernel/sys.c ? ? ? ? ? ? ? ? ? ? ? ? ? | ? 11
> ?kernel/trace/Kconfig ? ? ? ? ? ? ? ? ? | ? 10
> ?kernel/trace/trace_events_filter.c ? ? | ? 60 ++--
> ?kernel/trace/trace_syscalls.c ? ? ? ? ?| ? 96 ++++++-
> ?18 files changed, 986 insertions(+), 79 deletions(-)
>
> Firstly, one specific problem i can see is that kernel/seccomp_filter.c
> hardcodes to the system call boundary. Which is fine to a prototype
> implementation (and obviously fine for something that builds upon seccomp) but
> not fine in terms of a flexible Linux security feature :-)
>
> You have hardcoded these syscall assumptions via:
>
> ?struct seccomp_filter {
> ? ? ? ?struct list_head list;
> ? ? ? ?struct rcu_head rcu;
> ? ? ? ?int syscall_nr;
> ? ? ? ?struct syscall_metadata *data;
> ? ? ? ?struct event_filter *event_filter;
> ?};
(This structure is a bit different in the new rev of the patch, but
nothing relevant to this specific part of the discussion :)
> Which comes just because you chose to enumerate only syscall events - instead
> of enumerating all events.
While I'm willing to live with the tradeoff, using ftrace event
numbers from FTRACE_SYSCALLS means that the filter will be unable to
hook a fair number of syscalls: execve, clone, etc  (all the ptregs
fixup syscalls on x86) and anything that returns int instead of long
(on x86).  Though the last two patches in the initial patch series
provided a proposed clean up for the latter :/
The current revision of the seccomp filter patch can function in a
bitmask-like state when CONFIG_FTRACE is unset or
CONFIG_FTRACE_SYSCALLS is unset.  This also means any platform with
CONFIG_SECCOMP support can start using this right away, but I realize
that is more of a short-term gain rather than a long-term one.
> Instead of that please bind to all events instead - syscalls are just one of
> the many events we have. Type 'perf list' and see how many event types we have,
> and quite a few could be enabled for 'active feedback' sandboxing as well.
>
> Secondly, and this is really a variant of the first problem you have, the way
> you process event filter 'failures' is pretty limited. You utilize the regular
> seccomp method which works by calling into __secure_computing() and silently
> accepting syscalls or generating a hard do_exit() on even the slightest of
> filter failures.
>
> Instead of that what we'd want to have is to have regular syscall events
> registered, *and* enabled for such active filtering purposes. The moment the
> filter hits such an 'active' event would set the TIF_NOTIFY_RESUME flag and
> some other attribute in the task and the kernel would do a do_exit() at the
> earliest of opportunities before calling the syscall or at the
> return-from-syscall point latest.
>
> Note that no seccomp specific code would have to execute here, we can already
> generate events both at syscall entry and at syscall exit points, the only new
> bit we'd need is for the 'kill the task' [or abort the syscall] policy action.
>
> This is *far* more generic still yields the same short-term end result as far
> as your sandboxing is concerned.
Almost :/  I still need to review the code you've pointed out, but, at
present, the ftrace hooks occur after the seccomp and syscall auditing
hooks.  This means that that code is exposed no matter what in this
model.  To trim the exposed surface to userspace, we really need those
early hooks.  While I can see both hacky and less hacky approaches
around this, it stills strikes me that the seccomp thread flag and
early interception are good to reuse.  One option might be to allow
seccomp to be a secure-syscall event source, but I suspect that lands
more on the hack-y side of the fence :)
Anyway, I'll read up on the other filters and try to understand
exactly how per-task, counting perf events are being handled.
> What we'd need for this is a way to mark existing TRACE_EVENT()s as 'active'.
> We'd mark all syscall events as 'active' straight away.
It seems like we'll still end up special casing system calls no matter
what as they are the userland vector into the kernel.  Marking
syscalls active right away sounds roughly equivalent to calling
prctl(PR_SET_SECCOMP, 2).  This could easily be done following the
model of syscall_nr_to_meta() since that enumerates every system call
meta data entry which yields both entry and exit event numbers. Of
course, I'd need to understand how that ties in with the per-task,
counting code.
> [ Detail: these trace events return a return code, which the calling code can
> ?use, that way event return values could be used sooner than syscall exit
> ?points. IRQ code could make use of it as well, so for example this way we
> ?could filter based early packet inspection, still in the IRQ code. ]
>
> Then what your feature would do is to simply open up the events you are
> interested in (just as counting events, without any buffers), and set 'active'
> filters in them them to 'deny/allow' specific combinations of parameters only.
>
> Thirdly, you have added a separate facility to set/query filters via /proc,
> this lives mostly in /proc/<pid>/seccomp_filter. This seccomp specificity would
> go away altogether: it should be possible to query existing events and filters
> even outside of the seccomp facility, so if you want something explicit here
> beyond the current event ioctl() to set filters please improve the generic
> facility - which right now is poorer than what you have implemented so it's in
> good need to be improved! :-)
I haven't looked at the other interface, so I will.  I know about the
existence of the perf_event_open syscall, but that's about it.  Would
it make sense to use the same syscall in a unified-interface world?
Even if the right way, semantically it seems awkward.
> All in one such a solution would enable us to reuse code in a lot more deeper
> fashion while still providing all the functionality you are interested in.
> These are all short-term concerns, not pie-in-the-sky future ideal LSM
> concerns.
Thanks for the concrete feedback! I'll follow up again once I better
understand the code you're citing as well as the existing userland
interface to it.
cheers!
will
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-13 14:57                             ` Ingo Molnar
  2011-05-13 15:27                               ` Peter Zijlstra
@ 2011-05-16 16:23                               ` Steven Rostedt
  2011-05-16 16:52                                 ` Ingo Molnar
  1 sibling, 1 reply; 77+ messages in thread
From: Steven Rostedt @ 2011-05-16 16:23 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, 2011-05-13 at 16:57 +0200, Ingo Molnar wrote:
> > > > Then there's the whole indirection argument, if you don't need
> > > > indirection, its often better to not use it, I myself much prefer code
> > > > to look like:
> > > > 
> > > >    foo1(bar);
> > > >    foo2(bar);
> > > >    foo3(bar);
> > > > 
> > > > Than:
> > > > 
> > > >    foo_notifier(bar);
> > > > 
> > > > Simply because its much clearer who all are involved without me having
> > > > to grep around to see who registers for foo_notifier and wth they do
> > > > with it. It also makes it much harder to sneak in another user, whereas
> > > > its nearly impossible to find new notifier users.
> > > > 
> > > > Its also much faster, no extra memory accesses, no indirect function
> > > > calls, no other muck.
> > > 
> > > But i suspect this question has been settled, given the fact that even pure 
> > > observer events need and already process a chain of events? Am i missing 
> > > something about your argument?
> > 
> > I'm saying that there's reasons to not use notifiers passive or active.
> > 
> > Mostly the whole notifier/indirection muck comes up once you want
> > modules to make use of the thing, because then you need dynamic
> > management of the callback list.
> 
> But your argument assumes that we'd have a chain of functions to call, like 
> regular notifiers.
> 
> While the natural model here would be to have a list of registered event 
> structs for that point, with different filters but basically the same callback 
> mechanism (a call into the filter engine in essence).
> 
> Also note that the common case would be no event registered - and we'd 
> automatically optimize that case via the existing jump labels optimization.
I agree that I prefer the "notifier" type over having direct function
calls. Yes, it's easier to read and figure out what functions are
called, but notifiers can be optimized for the default case where
nothing is called (jump-label nop).
> 
> > (Then again, I'm fairly glad we don't have explicit callbacks in kernel/cpu.c 
> > for all the cpu-hotplug callbacks :-)
> > 
> > Anyway, I oppose for the existing events to gain an active role.
> 
> Why if 'being active' is optional and useful?
I'm a bit nervous about the 'active' role of (trace_)events, because of
the way multiple callbacks can be registered. How would:
	err = event_x();
	if (err == -EACCESS) {
be handled? Would we need a way to prioritize which call back gets the
return value? One way I guess would be to add a check_event option,
where you pass in an ENUM of the event you want:
	event_x();
	err = check_event_x(MYEVENT);
If something registered itself as "MYEVENT" to event_x, then you get the
return code of MYEVENT. If the MYEVENT was not registered, a -ENODEV or
something could be returned. I'm sure we could even optimize it such a
way if no active events have been registered to event_x, that
check_event_x() will return -ENODEV without any branches.
-- Steve
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-16 16:23                               ` Steven Rostedt
@ 2011-05-16 16:52                                 ` Ingo Molnar
  2011-05-16 17:03                                   ` Steven Rostedt
  0 siblings, 1 reply; 77+ messages in thread
From: Ingo Molnar @ 2011-05-16 16:52 UTC (permalink / raw)
  To: linux-arm-kernel
* Steven Rostedt <rostedt@goodmis.org> wrote:
> I'm a bit nervous about the 'active' role of (trace_)events, because of the 
> way multiple callbacks can be registered. How would:
> 
> 	err = event_x();
> 	if (err == -EACCESS) {
> 
> be handled? [...]
The default behavior would be something obvious: to trigger all callbacks and 
use the first non-zero return value.
> [...] Would we need a way to prioritize which call back gets the return 
> value? One way I guess would be to add a check_event option, where you pass 
> in an ENUM of the event you want:
> 
> 	event_x();
> 	err = check_event_x(MYEVENT);
> 
> If something registered itself as "MYEVENT" to event_x, then you get the 
> return code of MYEVENT. If the MYEVENT was not registered, a -ENODEV or 
> something could be returned. I'm sure we could even optimize it such a way if 
> no active events have been registered to event_x, that check_event_x() will 
> return -ENODEV without any branches.
I would keep it simple and extensible - that way we can complicate it when the 
need arises! :)
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-16 16:52                                 ` Ingo Molnar
@ 2011-05-16 17:03                                   ` Steven Rostedt
  2011-05-17 12:42                                     ` Ingo Molnar
  0 siblings, 1 reply; 77+ messages in thread
From: Steven Rostedt @ 2011-05-16 17:03 UTC (permalink / raw)
  To: linux-arm-kernel
On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote:
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > I'm a bit nervous about the 'active' role of (trace_)events, because of the 
> > way multiple callbacks can be registered. How would:
> > 
> > 	err = event_x();
> > 	if (err == -EACCESS) {
> > 
> > be handled? [...]
> 
> The default behavior would be something obvious: to trigger all callbacks and 
> use the first non-zero return value.
But how do we know which callback that was from? There's no ordering of
what callbacks are called first.
> 
> > [...] Would we need a way to prioritize which call back gets the return 
> > value? One way I guess would be to add a check_event option, where you pass 
> > in an ENUM of the event you want:
> > 
> > 	event_x();
> > 	err = check_event_x(MYEVENT);
> > 
> > If something registered itself as "MYEVENT" to event_x, then you get the 
> > return code of MYEVENT. If the MYEVENT was not registered, a -ENODEV or 
> > something could be returned. I'm sure we could even optimize it such a way if 
> > no active events have been registered to event_x, that check_event_x() will 
> > return -ENODEV without any branches.
> 
> I would keep it simple and extensible - that way we can complicate it when the 
> need arises! :)
The above is rather trivial to implement. I don't think it complicates
anything.
-- Steve
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-16 15:08               ` Ingo Molnar
@ 2011-05-17  2:24                 ` James Morris
  2011-05-17 13:10                   ` Ingo Molnar
  0 siblings, 1 reply; 77+ messages in thread
From: James Morris @ 2011-05-17  2:24 UTC (permalink / raw)
  To: linux-arm-kernel
On Mon, 16 May 2011, Ingo Molnar wrote:
> > Not really.
> > 
> > Firstly, what is the security goal of these restrictions? [...]
> 
> To do what i described above? Namely:
> 
>  " Sandboxed code should only be allowed to open files in /home/sandbox/, /lib/
>    and /usr/lib/ "
These are access rules, they don't really describe a high-level security 
goal.  How do you know it's ok to open everything in these directories?
- James
-- 
James Morris
<jmorris@namei.org>
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-16 17:03                                   ` Steven Rostedt
@ 2011-05-17 12:42                                     ` Ingo Molnar
  2011-05-17 13:05                                       ` Steven Rostedt
  0 siblings, 1 reply; 77+ messages in thread
From: Ingo Molnar @ 2011-05-17 12:42 UTC (permalink / raw)
  To: linux-arm-kernel
* Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote:
> > * Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > I'm a bit nervous about the 'active' role of (trace_)events, because of the 
> > > way multiple callbacks can be registered. How would:
> > > 
> > > 	err = event_x();
> > > 	if (err == -EACCESS) {
> > > 
> > > be handled? [...]
> > 
> > The default behavior would be something obvious: to trigger all callbacks and 
> > use the first non-zero return value.
> 
> But how do we know which callback that was from? There's no ordering of what 
> callbacks are called first.
We do not have to know that - nor do the calling sites care in general. Do you 
have some specific usecase in mind where the identity of the callback that 
generates a match matters?
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-16 15:29                     ` Will Drewry
@ 2011-05-17 12:57                       ` Ingo Molnar
  0 siblings, 0 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-17 12:57 UTC (permalink / raw)
  To: linux-arm-kernel
* Will Drewry <wad@chromium.org> wrote:
> > This is *far* more generic still yields the same short-term end result as 
> > far as your sandboxing is concerned.
> 
> Almost :/ [...]
Hey that's a pretty good result from a subsystem that was not written with your 
usecase in mind *at all* ;-)
> [...]  I still need to review the code you've pointed out, but, at present, 
> the ftrace hooks occur after the seccomp and syscall auditing hooks.  This 
> means that that code is exposed no matter what in this model.  To trim the 
> exposed surface to userspace, we really need those early hooks.  While I can 
> see both hacky and less hacky approaches around this, it stills strikes me 
> that the seccomp thread flag and early interception are good to reuse.  One 
> option might be to allow seccomp to be a secure-syscall event source, but I 
> suspect that lands more on the hack-y side of the fence :)
Agreed, there should be no security compromise imposed on your usecase, at all.
You could move the event callback sooner into the syscall-entry sequence to 
make sure it's the highest priority thing to process?
There's no semantic dependency on its current location so this can be changed 
AFAICS.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-17 12:42                                     ` Ingo Molnar
@ 2011-05-17 13:05                                       ` Steven Rostedt
  2011-05-17 13:19                                         ` Ingo Molnar
  0 siblings, 1 reply; 77+ messages in thread
From: Steven Rostedt @ 2011-05-17 13:05 UTC (permalink / raw)
  To: linux-arm-kernel
On Tue, 2011-05-17 at 14:42 +0200, Ingo Molnar wrote:
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote:
> > > * Steven Rostedt <rostedt@goodmis.org> wrote:
> > > 
> > > > I'm a bit nervous about the 'active' role of (trace_)events, because of the 
> > > > way multiple callbacks can be registered. How would:
> > > > 
> > > > 	err = event_x();
> > > > 	if (err == -EACCESS) {
> > > > 
> > > > be handled? [...]
> > > 
> > > The default behavior would be something obvious: to trigger all callbacks and 
> > > use the first non-zero return value.
> > 
> > But how do we know which callback that was from? There's no ordering of what 
> > callbacks are called first.
> 
> We do not have to know that - nor do the calling sites care in general. Do you 
> have some specific usecase in mind where the identity of the callback that 
> generates a match matters?
Maybe I'm confused. I was thinking that these event_*() are what we
currently call trace_*(), but the event_*(), I assume, can return a
value if a call back returns one.
Thus, we now have the ability to dynamically attach function calls to
arbitrary points in the kernel that can have an affect on the code that
called it. Right now, we only have the ability to attach function calls
to these locations that have passive affects (tracing/profiling).
But you say, "nor do the calling sites care in general". Then what do
these calling sites do with the return code? Are we limiting these
actions to security only? Or can we have some other feature. I can
envision that we can make the Linux kernel quite dynamic here with "self
modifying code". That is, anywhere we have "hooks", perhaps we could
replace them with dynamic switches (jump labels). Maybe events would not
be the best use, but they could be a generic one.
Knowing what callback returned the result would be beneficial. Right
now, you are saying if the call back return anything, just abort the
call, not knowing what callback was called.
-- Steve
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-17  2:24                 ` James Morris
@ 2011-05-17 13:10                   ` Ingo Molnar
  2011-05-17 13:29                     ` James Morris
  0 siblings, 1 reply; 77+ messages in thread
From: Ingo Molnar @ 2011-05-17 13:10 UTC (permalink / raw)
  To: linux-arm-kernel
* James Morris <jmorris@namei.org> wrote:
> On Mon, 16 May 2011, Ingo Molnar wrote:
> 
> > > Not really.
> > > 
> > > Firstly, what is the security goal of these restrictions? [...]
> > 
> > To do what i described above? Namely:
> > 
> >  " Sandboxed code should only be allowed to open files in /home/sandbox/, /lib/
> >    and /usr/lib/ "
> 
> These are access rules, they don't really describe a high-level security 
> goal. [...]
Restrictng sandboxed code to only open files within a given VFS namespace 
boundary sure sounds like a high-level security goal to me.
If implemented and set up correctly then it restricts sandboxed code to only be 
able to open files reachable via that VFS sub-namespace.
That is a rather meaningful high-level concept. What higher level concept do 
you want to argue?
> [...]  How do you know it's ok to open everything in these directories?
How do you know it's ok to open /etc/hosts? The sysadmin has configured the 
system that way.
How do you know that it's ok for sandboxed code to open files in 
/home/sandbox/? The sandbox developer has configured the system that way.
I'm not sure i get your point.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-17 13:05                                       ` Steven Rostedt
@ 2011-05-17 13:19                                         ` Ingo Molnar
  2011-05-19  4:07                                           ` Will Drewry
  0 siblings, 1 reply; 77+ messages in thread
From: Ingo Molnar @ 2011-05-17 13:19 UTC (permalink / raw)
  To: linux-arm-kernel
* Steven Rostedt <rostedt@goodmis.org> wrote:
> On Tue, 2011-05-17 at 14:42 +0200, Ingo Molnar wrote:
> > * Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote:
> > > > * Steven Rostedt <rostedt@goodmis.org> wrote:
> > > > 
> > > > > I'm a bit nervous about the 'active' role of (trace_)events, because of the 
> > > > > way multiple callbacks can be registered. How would:
> > > > > 
> > > > > 	err = event_x();
> > > > > 	if (err == -EACCESS) {
> > > > > 
> > > > > be handled? [...]
> > > > 
> > > > The default behavior would be something obvious: to trigger all callbacks and 
> > > > use the first non-zero return value.
> > > 
> > > But how do we know which callback that was from? There's no ordering of what 
> > > callbacks are called first.
> > 
> > We do not have to know that - nor do the calling sites care in general. Do you 
> > have some specific usecase in mind where the identity of the callback that 
> > generates a match matters?
> 
> Maybe I'm confused. I was thinking that these event_*() are what we
> currently call trace_*(), but the event_*(), I assume, can return a
> value if a call back returns one.
Yeah - and the call site can treat it as:
 - Ugh, if i get an error i need to abort whatever i was about to do
or (more advanced future use):
 - If i get a positive value i need to re-evaluate the parameters that were 
   passed in, they were changed
> Thus, we now have the ability to dynamically attach function calls to 
> arbitrary points in the kernel that can have an affect on the code that 
> called it. Right now, we only have the ability to attach function calls to 
> these locations that have passive affects (tracing/profiling).
Well, they can only have the effect that the calling site accepts and handles. 
So the 'effect' is not arbitrary and not defined by the callbacks, it is 
controlled and handled by the calling code.
We do not want invisible side-effects, opaque hooks, etc.
Instead of that we want (this is the getname() example i cited in the thread) 
explicit effects, like:
 if (event_vfs_getname(result))
	return ERR_PTR(-EPERM);
> But you say, "nor do the calling sites care in general". Then what do
> these calling sites do with the return code? Are we limiting these
> actions to security only? Or can we have some other feature. [...]
Yeah, not just security. One other example that came up recently is whether to 
panic the box on certain (bad) events such as NMI errors. This too could be 
made flexible via the event filter code: we already capture many events, so 
places that might conceivably do some policy could do so based on a filter 
condition.
> [...] I can envision that we can make the Linux kernel quite dynamic here 
> with "self modifying code". That is, anywhere we have "hooks", perhaps we 
> could replace them with dynamic switches (jump labels). Maybe events would 
> not be the best use, but they could be a generic one.
events and explicit function calls and explicit side-effects are pretty much 
the only thing that are acceptable. We do not want opaque hooks and arbitrary 
side-effects.
> Knowing what callback returned the result would be beneficial. Right now, you 
> are saying if the call back return anything, just abort the call, not knowing 
> what callback was called.
Yeah, and that's a feature: that way a number of conditions can be attached. 
Multiple security frameworks may have effect on a task or multiple tools might 
set policy action on a given event.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-17 13:10                   ` Ingo Molnar
@ 2011-05-17 13:29                     ` James Morris
  2011-05-17 18:34                       ` Ingo Molnar
  0 siblings, 1 reply; 77+ messages in thread
From: James Morris @ 2011-05-17 13:29 UTC (permalink / raw)
  To: linux-arm-kernel
On Tue, 17 May 2011, Ingo Molnar wrote:
> I'm not sure i get your point.
Your example was not complete as described.  After an apparently simple 
specification, you've since added several qualifiers and assumptions, and 
I still doubt that it's complete.
A higher level goal would look like
"Allow a sandbox app access only to approved resources, to contain the 
effects of flaws in the app", or similar.
Note that this includes a threat model (remote attacker taking control of 
the app) and a general and fully stated strategy for dealing with it.
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-17 13:29                     ` James Morris
@ 2011-05-17 18:34                       ` Ingo Molnar
  0 siblings, 0 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-17 18:34 UTC (permalink / raw)
  To: linux-arm-kernel
* James Morris <jmorris@namei.org> wrote:
> On Tue, 17 May 2011, Ingo Molnar wrote:
> 
> > I'm not sure i get your point.
> 
> Your example was not complete as described.  After an apparently simple 
> specification, you've since added several qualifiers and assumptions, [...]
I havent added any qualifiers really (i added examples/description), the opt-in 
method i mentioned in my first mail should be pretty robust:
 | Firstly, using the filter code i deny the various link creation syscalls so 
 | that sandboxed code cannot escape for example by creating a symlink to 
 | outside the permitted VFS namespace. (Note: we opt-in to syscalls, that way 
 | new syscalls added by new kernels are denied by defalt. The current symlink 
 | creation syscalls are not opted in to.)
> [...] and I still doubt that it's complete.
I could too claim that i doubt that the SELinux kernel implementation is 
secure!
So how about we both come up with specific examples about how it's not secure, 
instead of going down the fear-uncertainty-and-doubt road? ;-)
> A higher level goal would look like
> 
> "Allow a sandbox app access only to approved resources, to contain the 
> effects of flaws in the app", or similar.
I see what you mean.
I really think that "restricting sandboxed code to only open files within a 
given VFS namespace boundary" is the most useful highlevel description here - 
which is really a subset of a "allow a sandbox app access only to an easily 
approved set of files" highlevel concept.
There's no "to contain ..." bit here: *all* of the sandboxed app code is 
untrusted, so there's no 'remote attacker' and we do not limit our threat to 
flaws in the app. We want to contain apps to within a small subset of Linux 
functionality, and we want to do that within regular apps (without having to be 
superuser), full stop.
> Note that this includes a threat model (remote attacker taking control of the 
> app) and a general and fully stated strategy for dealing with it.
Attacker does not have to be remote - most sandboxing concepts protect against 
locally installed plugins/apps/applets. In sandboxing the whole app is 
considered untrusted - not just some flaw in it, abused remotely.
> From there, you can start to analyze how to implement the goal, at which 
> point you'd start thinking about configuration, assumptions, filesystem 
> access, namespaces, indirect access (e.g. via sockets, rpc, ipc, shared 
> memory, invocation).
Sandboxed code generally does not have access to anything fancy like that - if 
it is added then all possible side effects have to be examined.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-17 13:19                                         ` Ingo Molnar
@ 2011-05-19  4:07                                           ` Will Drewry
  2011-05-19 12:22                                             ` Steven Rostedt
  0 siblings, 1 reply; 77+ messages in thread
From: Will Drewry @ 2011-05-19  4:07 UTC (permalink / raw)
  To: linux-arm-kernel
On Tue, May 17, 2011 at 6:19 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Steven Rostedt <rostedt@goodmis.org> wrote:
>
>> On Tue, 2011-05-17 at 14:42 +0200, Ingo Molnar wrote:
>> > * Steven Rostedt <rostedt@goodmis.org> wrote:
>> >
>> > > On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote:
>> > > > * Steven Rostedt <rostedt@goodmis.org> wrote:
>> > > >
>> > > > > I'm a bit nervous about the 'active' role of (trace_)events, because of the
>> > > > > way multiple callbacks can be registered. How would:
>> > > > >
>> > > > > ? ? ? err = event_x();
>> > > > > ? ? ? if (err == -EACCESS) {
>> > > > >
>> > > > > be handled? [...]
>> > > >
>> > > > The default behavior would be something obvious: to trigger all callbacks and
>> > > > use the first non-zero return value.
>> > >
>> > > But how do we know which callback that was from? There's no ordering of what
>> > > callbacks are called first.
>> >
>> > We do not have to know that - nor do the calling sites care in general. Do you
>> > have some specific usecase in mind where the identity of the callback that
>> > generates a match matters?
>>
>> Maybe I'm confused. I was thinking that these event_*() are what we
>> currently call trace_*(), but the event_*(), I assume, can return a
>> value if a call back returns one.
>
> Yeah - and the call site can treat it as:
>
> ?- Ugh, if i get an error i need to abort whatever i was about to do
>
> or (more advanced future use):
>
> ?- If i get a positive value i need to re-evaluate the parameters that were
> ? passed in, they were changed
Do event_* that return non-void exist in the tree at all now?  I've
looked at the various tracepoint macros as well as some of the other
handlers (trace_function, perf_tp_event, etc) and I'm not seeing any
places where a return value is honored nor could be.  At best, the
perf_tp_event can be short-circuited it in the hlist_for_each, but
it'd still need a way to bubble up a failure and result in not calling
the trace/event that the hook precedes.
Am I missing something really obvious?  I don't feel I've gotten a
good handle on exactly how all the tracing code gets triggered, so
perhaps I'm still a level (or three) too shallow. (I can see the asm
hooks for trace functions and I can see where that translates to
registered calls - like trace_function - but I don't see how the
hooked calls can be trivially aborted).
As is, I'm not sure how the perf and ftrace infrastructure could be
reused cleanly without a fair number of hacks to the interface and a
good bit of reworking.  I can already see a number of challenges
around reusing the sys_perf_event_open interface and the fact that
reimplementing something even as simple as seccomp mode=1 seems to
require a fair amount of tweaking to avoid from being leaky.  (E.g.,
enabling all TRACE_EVENT()s for syscalls will miss unhooked syscalls
so either acceptance matching needs to be propagated up the stack
along with some seccomp-like task modality or seccomp-on-perf would
have to depend on sys_enter events with syscall number predicate
matching and fail when a filter discard applies to all active events.)
At present, I'm leaning back towards the v2 series (plus the requested
minor changes) for the benefit of code clarity and its fail-secure
behavior.  Even just considering the reduced case of seccomp mode 1
being implemented on the shared infrastructure, I feel like I missing
something that makes it viable.  Any clues?
If not, I don't think a seccomp mode 2 interface via prctl would be
intractable if the long term movement is to a ftrace/perf backend - it
just means that the in-kernel code would change to wrap whatever the
final design ended up being.
Thanks and sorry if I'm being dense!
>> Thus, we now have the ability to dynamically attach function calls to
>> arbitrary points in the kernel that can have an affect on the code that
>> called it. Right now, we only have the ability to attach function calls to
>> these locations that have passive affects (tracing/profiling).
>
> Well, they can only have the effect that the calling site accepts and handles.
> So the 'effect' is not arbitrary and not defined by the callbacks, it is
> controlled and handled by the calling code.
>
> We do not want invisible side-effects, opaque hooks, etc.
>
> Instead of that we want (this is the getname() example i cited in the thread)
> explicit effects, like:
>
> ?if (event_vfs_getname(result))
> ? ? ? ?return ERR_PTR(-EPERM);
>
>> But you say, "nor do the calling sites care in general". Then what do
>> these calling sites do with the return code? Are we limiting these
>> actions to security only? Or can we have some other feature. [...]
>
> Yeah, not just security. One other example that came up recently is whether to
> panic the box on certain (bad) events such as NMI errors. This too could be
> made flexible via the event filter code: we already capture many events, so
> places that might conceivably do some policy could do so based on a filter
> condition.
This sounds great - I just wish I could figure out how it'd work :)
>> [...] I can envision that we can make the Linux kernel quite dynamic here
>> with "self modifying code". That is, anywhere we have "hooks", perhaps we
>> could replace them with dynamic switches (jump labels). Maybe events would
>> not be the best use, but they could be a generic one.
>
> events and explicit function calls and explicit side-effects are pretty much
> the only thing that are acceptable. We do not want opaque hooks and arbitrary
> side-effects.
>
>> Knowing what callback returned the result would be beneficial. Right now, you
>> are saying if the call back return anything, just abort the call, not knowing
>> what callback was called.
>
> Yeah, and that's a feature: that way a number of conditions can be attached.
> Multiple security frameworks may have effect on a task or multiple tools might
> set policy action on a given event.
>
> Thanks,
>
> ? ? ? ?Ingo
>
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-19  4:07                                           ` Will Drewry
@ 2011-05-19 12:22                                             ` Steven Rostedt
  2011-05-19 21:05                                               ` Will Drewry
  0 siblings, 1 reply; 77+ messages in thread
From: Steven Rostedt @ 2011-05-19 12:22 UTC (permalink / raw)
  To: linux-arm-kernel
On Wed, 2011-05-18 at 21:07 -0700, Will Drewry wrote:
> Do event_* that return non-void exist in the tree at all now?  I've
> looked at the various tracepoint macros as well as some of the other
> handlers (trace_function, perf_tp_event, etc) and I'm not seeing any
> places where a return value is honored nor could be.  At best, the
> perf_tp_event can be short-circuited it in the hlist_for_each, but
> it'd still need a way to bubble up a failure and result in not calling
> the trace/event that the hook precedes.
No, none of the current trace hooks have return values. That was what I
was talking about how to implement in my previous emails.
-- Steve
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-19 12:22                                             ` Steven Rostedt
@ 2011-05-19 21:05                                               ` Will Drewry
  2011-05-24 15:59                                                 ` Will Drewry
  0 siblings, 1 reply; 77+ messages in thread
From: Will Drewry @ 2011-05-19 21:05 UTC (permalink / raw)
  To: linux-arm-kernel
On Thu, May 19, 2011 at 7:22 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Wed, 2011-05-18 at 21:07 -0700, Will Drewry wrote:
>
>> Do event_* that return non-void exist in the tree at all now? ?I've
>> looked at the various tracepoint macros as well as some of the other
>> handlers (trace_function, perf_tp_event, etc) and I'm not seeing any
>> places where a return value is honored nor could be. ?At best, the
>> perf_tp_event can be short-circuited it in the hlist_for_each, but
>> it'd still need a way to bubble up a failure and result in not calling
>> the trace/event that the hook precedes.
>
> No, none of the current trace hooks have return values. That was what I
> was talking about how to implement in my previous emails.
Led on by complete ignorance, I think I'm finally beginning to untwist
the different pieces of the tracing infrastructure.  Unfortunately,
that means I took a few wrong turns along the way...
I think function tracing looks something like this:
ftrace_call has been injected into at a specific callsite.  Upon hit:
1. ftrace_call triggers
2. does some checks then calls ftrace_trace_function (via mcount instrs)
3. ftrace_trace_function may be a single func or a list. For a list it
will be: ftrace_list_func
4. ftrace_list_func calls each registered hook for that function in a
while loop ignoring return values
5. registered hook funcs may then track the call, farm it out to
specific sub-handlers, etc.
This seems to be a red herring for my use case :/ though this helped
me understand your back and forth (Ingo & Steve) regarding dynamic
versus explicit events.
System call tracing is done via kernel/tracepoint.c events fed in via
arch/[arch]/kernel/ptrace.c where it calls trace_sys_enter.  This
yields direct sys_enter and sys_exit event sources (and an event class
to hook up per-system call events).  This means that
ftrace_syscall_enter() does the event prep prior to doing a filter
check comparing the ftrace_event object for the given syscall_nr to
the event data.  perf_sysenter_enter() is similar but it pushes the
info over to perf_tp_event to be matched not against the global
syscall event entry, but against any sub-event in the linked list on
that syscall's event.
Is that roughly an accurate representation of the two?  I wish I
hadn't gotten distracted along the function path, but at least I
learned something (and it is relevant to the larger scope of this
thread's discussion).
After doing that digging, it looks like providing hook call
pre-emption and return value propagation will be a unique and fun task
for each tracer and event subsystem.  If I look solely at tracepoints,
a generic change would be to make the trace_##subsys function return
an int (which I think was the event_vfs_getname proposal?).  The other
option is to change the trace_sys_enter proto to include a 'int
*retcode'.
That change would allow the propagation of some sort of policy
information.  To put it to use, seccomp mode 1 could be implemented on
top of trace_syscalls.  The following changes would need to happen:
1. dummy metadata should be inserted for all unhooked system calls
2. perf_trace_buf_submit would need to return an int or a new
TRACE_REG_SECCOMP_REGISTER handler would need to be setup in
syscall_enter_register.
3. If perf is abused, a "kill/fail_on_discard" bit would be added to
event->attrs.
4. perf_trace_buf_submit/perf_tp_event will return 0 for no matches,
'n' for the number of event matches, and -EACCES/? if a
'fail_on_discard' event is seen.
5. perf_syscall_enter would set *retcode = perf_trace_buf_submit()'s retcode
6. trace_sys_enter() would need to be moved to be the first entry
arch/../kernel/ptrace.c for incoming syscalls
7. if trace_sys_enter() yields a negative return code, then
do_exit(SIGKILL) the process and return.
Entering into seccomp mode 1 would require adding a  "0" filter for
every system call number (which is why we need a dummy event call for
them since failing to check the bitmask can't be flagged
fail_on_discard) with the fail_on_discard bit.  For the three calls
that are allowed, a '1' filter would be set.
That would roughly implement seccomp mode 1.  It's pretty ugly and the
fact that every system call that's disallowed has to be blacklisted is
not ideal.  An alternate model would be to just use the seccomp mode
as we do today and let secure_computing() handle the return code of "#
of matches".  If it the # of matches is 0, it terminates. A
'fail_on_discard' bit then would only be good to stop further
tracepoint callback evaluation.  This approach would also *almost* nix
the need to provide dummy syscall hooks.  (Since sigreturn isn't
hooked on x86 because it uses ptregs fixup, a dummy would still be
needed to apply a "1" filter to.)
Even with that tweak to move to a whitelist model, the perf event
evaluation and tracepoint callback ordering is still not guaranteed.
Without changing tracepoint itself, all other TPs will still execute.
And for perf events, it'll be first-come-first-serve until a
fail_on_discard is hit.
After using seccomp mode=1 as the sample case to reduce scope, it's
possible to ignore it for now :) and look at the seccomp-filter/mode=2
case.  The same mechanism could be used to inject more expressive
filters.  This would be done over the sys_perf_event_open() interface
assuming the new attr is added to stop perf event list enumeration.
Assuming a whitelist model, a call to prctl(PR_SET_SECCOMP, 2) would
be needed (maybe with the ON_EXEC flag option too to mirror the
event->attr on-exec bit). That would yield the ability to register
perf events for system calls then use ioctl() to SET_FILTER on them.
Reducing the privileges of the filters after installation could be
done with another attribute bit like 'set_filter_ands'.  If that is
also set on the event, and a filter is installed to ensure that
sys_perf_event_open is blocked, then it should be reasonably sane.
I'd need to add a PERF_EVENT_IOC_GET_FILTER handler to allow
extracting the settings.
Clearly, I haven't written the code for that yet, though making the
change for a single platform shouldn't be too much code.
So that leaves me with some questions:
- Is this the type of reuse that was being encouraged?
- Does it really make sense to cram this through the perf interface
and events?  While the changed attributes are innocuous and
potentially reusable, it seems that a large amount of the perf
facilities are exposed that could have weird side effects, but I'm
sure I still don't fully grok the perf infrastructure.
- Does extending one tracepoint to provide return values via a pointer
make sense? I'd hesitate to make all tracepoint hooks return an int by
default.
- If all tracepoints returned an int, what would the standard value
look like - or would it need to be per tracepoint impl?
- How is ambiguity resolved if multiple perf_events are registered for
a syscall with different filters?  Maybe a 'stop_on_match'? though
ordering is still a problem then.
- Is there a way to affect a task-wide change without a seccomp flag
(or new task_struct entry) via the existing sys_perf_event_open
syscall?  I considered suggesting a attr bit call 'secure_computing'
that when an event with the bit is enabled, it automagically forces
the task into seccomp enforcement mode, but that, again, seemed
hackish.
While I like the idea of sharing the tracepoints infrastructure and
the trace_syscalls hooks as well as using a pre-existing interface
with very minor changes, I'm worried that the complexity of the
interface and of the infrastructure might undermine the ability to
continue meeting the desired security goals.  I had originally stayed
very close to the seccomp world because of how trivial it is to review
the code and verify its accuracy/effectiveness.  This approach leaves
a lot of gaps for kernel code to seep through and a fair amount of
ambiguity in what locked down syscall filters might look like.
To me, the best part of the above is that it shows that even if we go
with a prctl SET_SECCOMP_FILTER-style interface, it is completely
certain that if a perf/ftrace-based security infrastructure is on our
future, it will be entirely compatible -- even if the prctl()
interface is just the "simpler" interface at that point somewhere down
the road.
Regardless, I'll hack up a proof of concept based on the outline
above. Perhaps something more elegant will emerge once I start
tweaking the source, but I'm still seeing too many gaps to be
comfortable so far.
[[There is a whole other approach to this too. We could continue with
the prctl() interface and mirror the trace_sys_enter model for
secure_computing().   Instead of keeping a seccomp_t-based hlist of
events, they could be stored in a new hlist for seccomp_events in
struct ftrace_event_call.  The ftrace filters would be installed there
and the seccomp_syscall_enter() function could do the checks and pass
up some state data on the task's seccomp_t that indicates it needs to
do_exit().  That would likely reduce the amount of code in
seccomp_filter.c pretty seriously (though not entirely
beneficially).]]
Thanks again for all the feedback and insights! I really hope we can
come to an agreeable approach for implementing kernel attack surface
reduction.
will
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-19 21:05                                               ` Will Drewry
@ 2011-05-24 15:59                                                 ` Will Drewry
  2011-05-24 16:20                                                   ` Peter Zijlstra
  2011-05-24 20:08                                                   ` Ingo Molnar
  0 siblings, 2 replies; 77+ messages in thread
From: Will Drewry @ 2011-05-24 15:59 UTC (permalink / raw)
  To: linux-arm-kernel
On Thu, May 19, 2011 at 4:05 PM, Will Drewry <wad@chromium.org> wrote:
> On Thu, May 19, 2011 at 7:22 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> On Wed, 2011-05-18 at 21:07 -0700, Will Drewry wrote:
>>
>>> Do event_* that return non-void exist in the tree at all now? ?I've
>>> looked at the various tracepoint macros as well as some of the other
>>> handlers (trace_function, perf_tp_event, etc) and I'm not seeing any
>>> places where a return value is honored nor could be. ?At best, the
>>> perf_tp_event can be short-circuited it in the hlist_for_each, but
>>> it'd still need a way to bubble up a failure and result in not calling
>>> the trace/event that the hook precedes.
>>
>> No, none of the current trace hooks have return values. That was what I
>> was talking about how to implement in my previous emails.
>
> Led on by complete ignorance, I think I'm finally beginning to untwist
> the different pieces of the tracing infrastructure. ?Unfortunately,
> that means I took a few wrong turns along the way...
>
> I think function tracing looks something like this:
>
> ftrace_call has been injected into at a specific callsite. ?Upon hit:
> 1. ftrace_call triggers
> 2. does some checks then calls ftrace_trace_function (via mcount instrs)
> 3. ftrace_trace_function may be a single func or a list. For a list it
> will be: ftrace_list_func
> 4. ftrace_list_func calls each registered hook for that function in a
> while loop ignoring return values
> 5. registered hook funcs may then track the call, farm it out to
> specific sub-handlers, etc.
>
> This seems to be a red herring for my use case :/ though this helped
> me understand your back and forth (Ingo & Steve) regarding dynamic
> versus explicit events.
>
>
> System call tracing is done via kernel/tracepoint.c events fed in via
> arch/[arch]/kernel/ptrace.c where it calls trace_sys_enter. ?This
> yields direct sys_enter and sys_exit event sources (and an event class
> to hook up per-system call events). ?This means that
> ftrace_syscall_enter() does the event prep prior to doing a filter
> check comparing the ftrace_event object for the given syscall_nr to
> the event data. ?perf_sysenter_enter() is similar but it pushes the
> info over to perf_tp_event to be matched not against the global
> syscall event entry, but against any sub-event in the linked list on
> that syscall's event.
>
> Is that roughly an accurate representation of the two? ?I wish I
> hadn't gotten distracted along the function path, but at least I
> learned something (and it is relevant to the larger scope of this
> thread's discussion).
>
>
> After doing that digging, it looks like providing hook call
> pre-emption and return value propagation will be a unique and fun task
> for each tracer and event subsystem. ?If I look solely at tracepoints,
> a generic change would be to make the trace_##subsys function return
> an int (which I think was the event_vfs_getname proposal?). ?The other
> option is to change the trace_sys_enter proto to include a 'int
> *retcode'.
>
> That change would allow the propagation of some sort of policy
> information. ?To put it to use, seccomp mode 1 could be implemented on
> top of trace_syscalls. ?The following changes would need to happen:
> 1. dummy metadata should be inserted for all unhooked system calls
> 2. perf_trace_buf_submit would need to return an int or a new
> TRACE_REG_SECCOMP_REGISTER handler would need to be setup in
> syscall_enter_register.
> 3. If perf is abused, a "kill/fail_on_discard" bit would be added to
> event->attrs.
> 4. perf_trace_buf_submit/perf_tp_event will return 0 for no matches,
> 'n' for the number of event matches, and -EACCES/? if a
> 'fail_on_discard' event is seen.
> 5. perf_syscall_enter would set *retcode = perf_trace_buf_submit()'s retcode
> 6. trace_sys_enter() would need to be moved to be the first entry
> arch/../kernel/ptrace.c for incoming syscalls
> 7. if trace_sys_enter() yields a negative return code, then
> do_exit(SIGKILL) the process and return.
>
> Entering into seccomp mode 1 would require adding a ?"0" filter for
> every system call number (which is why we need a dummy event call for
> them since failing to check the bitmask can't be flagged
> fail_on_discard) with the fail_on_discard bit. ?For the three calls
> that are allowed, a '1' filter would be set.
>
> That would roughly implement seccomp mode 1. ?It's pretty ugly and the
> fact that every system call that's disallowed has to be blacklisted is
> not ideal. ?An alternate model would be to just use the seccomp mode
> as we do today and let secure_computing() handle the return code of "#
> of matches". ?If it the # of matches is 0, it terminates. A
> 'fail_on_discard' bit then would only be good to stop further
> tracepoint callback evaluation. ?This approach would also *almost* nix
> the need to provide dummy syscall hooks. ?(Since sigreturn isn't
> hooked on x86 because it uses ptregs fixup, a dummy would still be
> needed to apply a "1" filter to.)
>
> Even with that tweak to move to a whitelist model, the perf event
> evaluation and tracepoint callback ordering is still not guaranteed.
> Without changing tracepoint itself, all other TPs will still execute.
> And for perf events, it'll be first-come-first-serve until a
> fail_on_discard is hit.
>
> After using seccomp mode=1 as the sample case to reduce scope, it's
> possible to ignore it for now :) and look at the seccomp-filter/mode=2
> case. ?The same mechanism could be used to inject more expressive
> filters. ?This would be done over the sys_perf_event_open() interface
> assuming the new attr is added to stop perf event list enumeration.
> Assuming a whitelist model, a call to prctl(PR_SET_SECCOMP, 2) would
> be needed (maybe with the ON_EXEC flag option too to mirror the
> event->attr on-exec bit). That would yield the ability to register
> perf events for system calls then use ioctl() to SET_FILTER on them.
>
> Reducing the privileges of the filters after installation could be
> done with another attribute bit like 'set_filter_ands'. ?If that is
> also set on the event, and a filter is installed to ensure that
> sys_perf_event_open is blocked, then it should be reasonably sane.
>
> I'd need to add a PERF_EVENT_IOC_GET_FILTER handler to allow
> extracting the settings.
>
> Clearly, I haven't written the code for that yet, though making the
> change for a single platform shouldn't be too much code.
>
> So that leaves me with some questions:
> - Is this the type of reuse that was being encouraged?
> - Does it really make sense to cram this through the perf interface
> and events? ?While the changed attributes are innocuous and
> potentially reusable, it seems that a large amount of the perf
> facilities are exposed that could have weird side effects, but I'm
> sure I still don't fully grok the perf infrastructure.
> - Does extending one tracepoint to provide return values via a pointer
> make sense? I'd hesitate to make all tracepoint hooks return an int by
> default.
> - If all tracepoints returned an int, what would the standard value
> look like - or would it need to be per tracepoint impl?
> - How is ambiguity resolved if multiple perf_events are registered for
> a syscall with different filters? ?Maybe a 'stop_on_match'? though
> ordering is still a problem then.
> - Is there a way to affect a task-wide change without a seccomp flag
> (or new task_struct entry) via the existing sys_perf_event_open
> syscall? ?I considered suggesting a attr bit call 'secure_computing'
> that when an event with the bit is enabled, it automagically forces
> the task into seccomp enforcement mode, but that, again, seemed
> hackish.
>
> While I like the idea of sharing the tracepoints infrastructure and
> the trace_syscalls hooks as well as using a pre-existing interface
> with very minor changes, I'm worried that the complexity of the
> interface and of the infrastructure might undermine the ability to
> continue meeting the desired security goals. ?I had originally stayed
> very close to the seccomp world because of how trivial it is to review
> the code and verify its accuracy/effectiveness. ?This approach leaves
> a lot of gaps for kernel code to seep through and a fair amount of
> ambiguity in what locked down syscall filters might look like.
>
> To me, the best part of the above is that it shows that even if we go
> with a prctl SET_SECCOMP_FILTER-style interface, it is completely
> certain that if a perf/ftrace-based security infrastructure is on our
> future, it will be entirely compatible -- even if the prctl()
> interface is just the "simpler" interface at that point somewhere down
> the road.
>
>
> Regardless, I'll hack up a proof of concept based on the outline
> above. Perhaps something more elegant will emerge once I start
> tweaking the source, but I'm still seeing too many gaps to be
> comfortable so far.
>
>
> [[There is a whole other approach to this too. We could continue with
> the prctl() interface and mirror the trace_sys_enter model for
> secure_computing(). ? Instead of keeping a seccomp_t-based hlist of
> events, they could be stored in a new hlist for seccomp_events in
> struct ftrace_event_call. ?The ftrace filters would be installed there
> and the seccomp_syscall_enter() function could do the checks and pass
> up some state data on the task's seccomp_t that indicates it needs to
> do_exit(). ?That would likely reduce the amount of code in
> seccomp_filter.c pretty seriously (though not entirely
> beneficially).]]
>
>
> Thanks again for all the feedback and insights! I really hope we can
> come to an agreeable approach for implementing kernel attack surface
> reduction.
Just a quick follow up.  I have a PoC of the perf-based system call
filtering code in place which uses seccomp.mode as a task_context
flag, adds require_secure and err_on_discard perf_event_attrs, and
enforces these new events terminate the process in perf_syscall_enter
via a return value on perf_buf_submit.  This lets a call like:
LD_LIBRARY_PATH=/host/home/wad/kernel.uml/tests/
/host/home/wad/kernel.uml/tests/perf record \
-S \
-e 'syscalls:sys_enter_access' \
-e 'syscalls:sys_enter_brk' \
-e 'syscalls:sys_enter_close' \
-e 'syscalls:sys_enter_exit_group' \
-e 'syscalls:sys_enter_fcntl64' \
-e 'syscalls:sys_enter_fstat64' \
-e 'syscalls:sys_enter_getdents64' \
-e 'syscalls:sys_enter_getpid' \
-e 'syscalls:sys_enter_getuid' \
-e 'syscalls:sys_enter_ioctl' \
-e 'syscalls:sys_enter_lstat64' \
-e 'syscalls:sys_enter_mmap_pgoff' \
-e 'syscalls:sys_enter_mprotect' \
-e 'syscalls:sys_enter_munmap' \
-e 'syscalls:sys_enter_open' \
-e 'syscalls:sys_enter_read' \
-e 'syscalls:sys_enter_stat64' \
-e 'syscalls:sys_enter_time' \
-e 'syscalls:sys_enter_newuname' \
-e 'syscalls:sys_enter_write' --filter "fd == 1" \
/bin/ls
run successfully while omitting a syscall or passing in --filter "fd
== 0" properly terminates it. (ignoring the need for execve and
set_thread_area for PoC purposes).
The change avoids defining a new trace call type or a huge number of
internal changes and hides seccomp.mode=2 from ABI-exposure in prctl,
but the attack surface is non-trivial to verify, and I'm not sure if
this ABI change makes sense. It amounts to:
 include/linux/ftrace_event.h  |    4 +-
 include/linux/perf_event.h    |   10 +++++---
 kernel/perf_event.c           |   49 +++++++++++++++++++++++++++++++++++++---
 kernel/seccomp.c              |    8 ++++++
 kernel/trace/trace_syscalls.c |   27 +++++++++++++++++-----
 5 files changed, 82 insertions(+), 16 deletions(-)
And can be found here: http://static.dataspill.org/perf_secure/v1/
If there is any interest at all, I can post it properly to this giant
CC list.  However,  it's unclear to me where this thread stands.  I
also see a completely independent path for implementing system call
filtering now, but I'll hold off posting that patch until I see where
this goes.
Any guidance will be appreciated - thanks again!
will
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-24 15:59                                                 ` Will Drewry
@ 2011-05-24 16:20                                                   ` Peter Zijlstra
  2011-05-24 16:25                                                     ` Thomas Gleixner
  2011-05-24 19:54                                                     ` Ingo Molnar
  2011-05-24 20:08                                                   ` Ingo Molnar
  1 sibling, 2 replies; 77+ messages in thread
From: Peter Zijlstra @ 2011-05-24 16:20 UTC (permalink / raw)
  To: linux-arm-kernel
On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote:
>  include/linux/ftrace_event.h  |    4 +-
>  include/linux/perf_event.h    |   10 +++++---
>  kernel/perf_event.c           |   49 +++++++++++++++++++++++++++++++++++++---
>  kernel/seccomp.c              |    8 ++++++
>  kernel/trace/trace_syscalls.c |   27 +++++++++++++++++-----
>  5 files changed, 82 insertions(+), 16 deletions(-) 
I strongly oppose to the perf core being mixed with any sekurity voodoo
(or any other active role for that matter).
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-24 16:20                                                   ` Peter Zijlstra
@ 2011-05-24 16:25                                                     ` Thomas Gleixner
  2011-05-24 19:00                                                       ` Will Drewry
  2011-05-24 19:54                                                     ` Ingo Molnar
  1 sibling, 1 reply; 77+ messages in thread
From: Thomas Gleixner @ 2011-05-24 16:25 UTC (permalink / raw)
  To: linux-arm-kernel
On Tue, 24 May 2011, Peter Zijlstra wrote:
> On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote:
> >  include/linux/ftrace_event.h  |    4 +-
> >  include/linux/perf_event.h    |   10 +++++---
> >  kernel/perf_event.c           |   49 +++++++++++++++++++++++++++++++++++++---
> >  kernel/seccomp.c              |    8 ++++++
> >  kernel/trace/trace_syscalls.c |   27 +++++++++++++++++-----
> >  5 files changed, 82 insertions(+), 16 deletions(-) 
> 
> I strongly oppose to the perf core being mixed with any sekurity voodoo
> (or any other active role for that matter).
Amen. We have enough crap to cleanup in perf/ftrace already, so we
really do not need security magic added to it.
Thanks,
	tglx
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-24 16:25                                                     ` Thomas Gleixner
@ 2011-05-24 19:00                                                       ` Will Drewry
  0 siblings, 0 replies; 77+ messages in thread
From: Will Drewry @ 2011-05-24 19:00 UTC (permalink / raw)
  To: linux-arm-kernel
On Tue, May 24, 2011 at 11:25 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 24 May 2011, Peter Zijlstra wrote:
>
>> On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote:
>> > ?include/linux/ftrace_event.h ?| ? ?4 +-
>> > ?include/linux/perf_event.h ? ?| ? 10 +++++---
>> > ?kernel/perf_event.c ? ? ? ? ? | ? 49 +++++++++++++++++++++++++++++++++++++---
>> > ?kernel/seccomp.c ? ? ? ? ? ? ?| ? ?8 ++++++
>> > ?kernel/trace/trace_syscalls.c | ? 27 +++++++++++++++++-----
>> > ?5 files changed, 82 insertions(+), 16 deletions(-)
>>
>> I strongly oppose to the perf core being mixed with any sekurity voodoo
>> (or any other active role for that matter).
>
> Amen. We have enough crap to cleanup in perf/ftrace already, so we
> really do not need security magic added to it.
Thanks for the quick responses!
I agree, but I'm left a little bit lost now w.r.t. the comments around
reusing the ABI.  If perf doesn't make sense (which certainly seems
wrong from a security interface perspective), then the preexisting
ABIs I know of for this case are as follows:
- /sys/kernel/debug/tracing/*
- prctl(PR_SET_SECCOMP* (or /proc/...)
Both would require expansion.  The latter was reused by the original
patch series.  The former doesn't expose much in the way of per-task
event filtering -- ftrace_pids doesn't translate well to
ftrace_syscall_enter-based enforcement.  I'd expect we'd need
ftrace_event_call->task_events (like ->perf_events), and either
explore them in ftrace_syscall_enter or add a new tracepoint handler,
ftrace_task_syscall_enter, via something like TRACE_REG_TASK_REGISTER.
 It could then do whatever it wanted with the successful or
unsuccessful matching against predicates, stacking or not, which could
be used for a seccomp-like mechanism.  However, bubbling that change
up to the non-existent interfaces in debug/tracing could be a
challenge too (Registration would require an alternate flow like perf
to call TRACE_REG_*? Do they become
tracing/events/subsystem/event/task/<tid>/<filter_string_N>? ...?).
This is all just a matter of programming... but at this point, I'm not
seeing the clear shared path forward.  Even with per-task ftrace
access in debug/tracing, that would introduce a reasonably large
change to the system and add a new ABI, albeit in debug/tracing.  If
the above (or whatever the right approach is) comes into existence,
then any prctl(PR_SET_SECCOMP) ABI could have the backend
implementation to modify the same data.  I'm not putting it like this
to say that I'm designing to be obsolete, but to show that the defined
interface wouldn't conflict if ftrace does overlap more in the future.
 Given the importance of a clearly defined interface for security
functionality, I'd be surprised to see all the pieces come together in
the near future in such a way that a transition would be immediately
possible -- I'm not even sure what the ftrace roadmap really is!
Would it be more desirable to put a system call filtering interface on
a miscdev (like /dev/syscall_filter) instead of in /proc or prctl (and
not reuse seccomp at all)?  I'm not clear what the onus is to justify
a change in the different ABI areas, but I see system call filtering
as an important piece of system security and would like to determine
if there is a viable path forward, or if this will need to be
revisited in another 2 years.
thanks again!
will
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-24 16:20                                                   ` Peter Zijlstra
  2011-05-24 16:25                                                     ` Thomas Gleixner
@ 2011-05-24 19:54                                                     ` Ingo Molnar
  2011-05-24 20:10                                                       ` Ingo Molnar
  2011-05-25 10:35                                                       ` Thomas Gleixner
  1 sibling, 2 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-24 19:54 UTC (permalink / raw)
  To: linux-arm-kernel
* Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote:
> >  include/linux/ftrace_event.h  |    4 +-
> >  include/linux/perf_event.h    |   10 +++++---
> >  kernel/perf_event.c           |   49 +++++++++++++++++++++++++++++++++++++---
> >  kernel/seccomp.c              |    8 ++++++
> >  kernel/trace/trace_syscalls.c |   27 +++++++++++++++++-----
> >  5 files changed, 82 insertions(+), 16 deletions(-) 
> 
> I strongly oppose to the perf core being mixed with any sekurity voodoo
> (or any other active role for that matter).
I'd object to invisible side-effects as well, and vehemently so. But note how 
intelligently it's used here: it's explicit in the code, it's used explicitly 
in kernel/seccomp.c and the event generation place in 
kernel/trace/trace_syscalls.c.
So this is a really flexible solution IMO and does not extend events with some 
invisible 'active' role. It extends the *call site* with an open-coded active 
role - which active role btw. already pre-existed.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-24 15:59                                                 ` Will Drewry
  2011-05-24 16:20                                                   ` Peter Zijlstra
@ 2011-05-24 20:08                                                   ` Ingo Molnar
  2011-05-24 20:14                                                     ` Steven Rostedt
  1 sibling, 1 reply; 77+ messages in thread
From: Ingo Molnar @ 2011-05-24 20:08 UTC (permalink / raw)
  To: linux-arm-kernel
* Will Drewry <wad@chromium.org> wrote:
> The change avoids defining a new trace call type or a huge number of internal 
> changes and hides seccomp.mode=2 from ABI-exposure in prctl, but the attack 
> surface is non-trivial to verify, and I'm not sure if this ABI change makes 
> sense. It amounts to:
> 
>  include/linux/ftrace_event.h  |    4 +-
>  include/linux/perf_event.h    |   10 +++++---
>  kernel/perf_event.c           |   49 +++++++++++++++++++++++++++++++++++++---
>  kernel/seccomp.c              |    8 ++++++
>  kernel/trace/trace_syscalls.c |   27 +++++++++++++++++-----
>  5 files changed, 82 insertions(+), 16 deletions(-)
> 
> And can be found here: http://static.dataspill.org/perf_secure/v1/
Wow, i'm very impressed how few changes you needed to do to support this!
So, firstly, i don't think we should change perf_tp_event() at all - the 
'observer' codepaths should be unaffected.
But there could be a perf_tp_event_ret() or perf_tp_event_check() entry that 
code like seccomp which wants to use event results can use.
Also, i'm not sure about the seccomp details and assumptions that were moved 
into the perf core. How about passing in a helper function to 
perf_tp_event_check(), where seccomp would define its seccomp specific helper 
function?
That looks sufficiently flexible. That helper function could be an 'extra 
filter' kind of thing, right?
Also, regarding the ABI and the attr.err_on_discard and attr.require_secure 
bits, they look a bit too specific as well.
attr.err_on_discard: with the filter helper function passed in this is probably 
not needed anymore, right?
attr.require_secure: this is basically used to *force* the creation of 
security-controlling filters, right? It seems to me that this could be done via 
a seccomp ABI extension as well, without adding this to the perf ABI. That 
seccomp call could check whether the right events are created and move the task 
to mode 2 only if that prereq is met - or something like that.
> If there is any interest at all, I can post it properly to this giant
> CC list. [...]
I'd suggest to trim the Cc: list aggressively - anyone interested in the 
discussion can pick it up on lkml - and i strongly suspect that most of the Cc: 
participants would want to be off the Cc: :-)
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-24 19:54                                                     ` Ingo Molnar
@ 2011-05-24 20:10                                                       ` Ingo Molnar
  2011-05-25 10:35                                                       ` Thomas Gleixner
  1 sibling, 0 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-24 20:10 UTC (permalink / raw)
  To: linux-arm-kernel
* Ingo Molnar <mingo@elte.hu> wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote:
> > >  include/linux/ftrace_event.h  |    4 +-
> > >  include/linux/perf_event.h    |   10 +++++---
> > >  kernel/perf_event.c           |   49 +++++++++++++++++++++++++++++++++++++---
> > >  kernel/seccomp.c              |    8 ++++++
> > >  kernel/trace/trace_syscalls.c |   27 +++++++++++++++++-----
> > >  5 files changed, 82 insertions(+), 16 deletions(-) 
> > 
> > I strongly oppose to the perf core being mixed with any sekurity voodoo
> > (or any other active role for that matter).
> 
> I'd object to invisible side-effects as well, and vehemently so. But note how 
> intelligently it's used here: it's explicit in the code, it's used explicitly 
> in kernel/seccomp.c and the event generation place in 
> kernel/trace/trace_syscalls.c.
> 
> So this is a really flexible solution IMO and does not extend events with 
> some invisible 'active' role. It extends the *call site* with an open-coded 
> active role - which active role btw. already pre-existed.
Also see my other mail - i think this seccomp code is too tied in to the perf 
core and ABI - but this is fixable IMO.
The fundamental notion that a generator subsystem of events can use filter 
results as well (such as kernel/trace/trace_syscalls.c.) for its own purposes 
is pretty robust though.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-24 20:08                                                   ` Ingo Molnar
@ 2011-05-24 20:14                                                     ` Steven Rostedt
  0 siblings, 0 replies; 77+ messages in thread
From: Steven Rostedt @ 2011-05-24 20:14 UTC (permalink / raw)
  To: linux-arm-kernel
On Tue, 2011-05-24 at 22:08 +0200, Ingo Molnar wrote:
> * Will Drewry <wad@chromium.org> wrote:
> But there could be a perf_tp_event_ret() or perf_tp_event_check() entry that 
> code like seccomp which wants to use event results can use.
We should name it something else. The "perf_tp.." is a misnomer as it
has nothing to do with performance monitoring. "dynamic_event_.." maybe,
as it is dynamic to the affect that we can use jump labels to enable or
disable it.
-- Steve
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-24 19:54                                                     ` Ingo Molnar
  2011-05-24 20:10                                                       ` Ingo Molnar
@ 2011-05-25 10:35                                                       ` Thomas Gleixner
  2011-05-25 15:01                                                         ` Ingo Molnar
  1 sibling, 1 reply; 77+ messages in thread
From: Thomas Gleixner @ 2011-05-25 10:35 UTC (permalink / raw)
  To: linux-arm-kernel
On Tue, 24 May 2011, Ingo Molnar wrote:
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote:
> > >  include/linux/ftrace_event.h  |    4 +-
> > >  include/linux/perf_event.h    |   10 +++++---
> > >  kernel/perf_event.c           |   49 +++++++++++++++++++++++++++++++++++++---
> > >  kernel/seccomp.c              |    8 ++++++
> > >  kernel/trace/trace_syscalls.c |   27 +++++++++++++++++-----
> > >  5 files changed, 82 insertions(+), 16 deletions(-) 
> > 
> > I strongly oppose to the perf core being mixed with any sekurity voodoo
> > (or any other active role for that matter).
> 
> I'd object to invisible side-effects as well, and vehemently so. But note how 
> intelligently it's used here: it's explicit in the code, it's used explicitly 
> in kernel/seccomp.c and the event generation place in 
> kernel/trace/trace_syscalls.c.
> 
> So this is a really flexible solution IMO and does not extend events with some 
> invisible 'active' role. It extends the *call site* with an open-coded active 
> role - which active role btw. already pre-existed.
We do _NOT_ make any decision based on the trace point so what's the
"pre-existing" active role in the syscall entry code?
I'm all for code reuse and reuse of interfaces, but this is completely
wrong. Instrumentation and security decisions are two fundamentally
different things and we want them kept separate. Instrumentation is
not meant to make decisions. Just because we can does not mean that it
is a good idea.
So what the current approach does is:
 - abuse the existing ftrace syscall hook by adding a return value to
   the tracepoint.
   So we need to propagate that for every tracepoint just because we
   have a single user.
 - abuse the perf per task mechanism
   Just because we have per task context in perf does not mean that we
   pull everything and the world which requires per task context into
   perf. The security folks have per task context already so security
   related stuff wants to go there.
 - abuse the perf/ftrace interfaces
   One of the arguments was that perf and ftrace have permission which
   are not available from the existing security interfaces. That's not
   at all a good reason to abuse these interfaces. Let the security
   folks sort out the problem on their end and do not impose any
   expectations on perf/ftrace which we have to carry around forever.
Yes, it can be made working with a relatively small patch, but it has
a very nasty side effect: 
  You add another user space visible ABI to the existing perf/ftrace
  mess which needs to be supported forever.
Brilliant, we have already two ABIs (perf/ftrace) to support and at
the same time we urgently need to solve the problem of better
integration of those two. So adding a third completely unrelated
component with a guaranteed ABI is just making this even more complex.
We can factor out the filtering code and let the security dudes reuse
it for their own purposes. That makes them to have their own
interfaces and does not impose any restrictions upon the tracing/perf
ones. And really security stuff wants to be integrated into the
existing security frameworks and not duct taped into perf/trace just
because it's a conveniant hack around limitiations of the existing
security stuff.
You really should stop to see everything as a nail just because the
only tool you have handy is the perf hammer. perf is about
instrumentation and we don't want to violate the oldest principle of
unix to have simple tools which do one thing and do it good.
Even swiss army knifes have the restriction that you can use only one
tool at a time unless you want to stick the corkscrew through your
palm when you try to cut bread.
Thanks,
	tglx
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-25 10:35                                                       ` Thomas Gleixner
@ 2011-05-25 15:01                                                         ` Ingo Molnar
  2011-05-25 17:43                                                           ` Peter Zijlstra
  2011-05-25 17:48                                                           ` Thomas Gleixner
  0 siblings, 2 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-25 15:01 UTC (permalink / raw)
  To: linux-arm-kernel
* Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 24 May 2011, Ingo Molnar wrote:
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote:
> > > >  include/linux/ftrace_event.h  |    4 +-
> > > >  include/linux/perf_event.h    |   10 +++++---
> > > >  kernel/perf_event.c           |   49 +++++++++++++++++++++++++++++++++++++---
> > > >  kernel/seccomp.c              |    8 ++++++
> > > >  kernel/trace/trace_syscalls.c |   27 +++++++++++++++++-----
> > > >  5 files changed, 82 insertions(+), 16 deletions(-) 
> > > 
> > > I strongly oppose to the perf core being mixed with any sekurity voodoo
> > > (or any other active role for that matter).
> > 
> > I'd object to invisible side-effects as well, and vehemently so. But note how 
> > intelligently it's used here: it's explicit in the code, it's used explicitly 
> > in kernel/seccomp.c and the event generation place in 
> > kernel/trace/trace_syscalls.c.
> > 
> > So this is a really flexible solution IMO and does not extend events with some 
> > invisible 'active' role. It extends the *call site* with an open-coded active 
> > role - which active role btw. already pre-existed.
> 
> We do _NOT_ make any decision based on the trace point so what's the
> "pre-existing" active role in the syscall entry code?
The seccomp code we are discussing in this thread.
> I'm all for code reuse and reuse of interfaces, but this is completely
> wrong. Instrumentation and security decisions are two fundamentally
> different things and we want them kept separate. Instrumentation is
> not meant to make decisions. Just because we can does not mean that it
> is a good idea.
Instrumentation does not 'make decisions': the calling site, which is 
already emitting both the event and wants to do decisions based on 
the data that also generates the event wants to do decisions.
Those decisions *will be made* and you cannot prevent that, the only 
open question is can it reuse code intelligently, which code it is 
btw. already calling for observation reasons?
( Note that pure observers wont be affected and note that pure 
  observation call sites are not affected either. )
> So what the current approach does is:
> 
>  - abuse the existing ftrace syscall hook by adding a return value to
>    the tracepoint.
> 
>    So we need to propagate that for every tracepoint just because we
>    have a single user.
This is a technical criticism i share with you and i think it can be 
fixed - i outlined it to Will yesterday.
And no, if done cleanly it's not 'abuse' to reuse code. Could we wait 
for the first clean iteration of this patch instead of rushing 
judgement prematurely?
>  - abuse the perf per task mechanism
> 
>    Just because we have per task context in perf does not mean that we
>    pull everything and the world which requires per task context into
>    perf. The security folks have per task context already so security
>    related stuff wants to go there.
We do not pull 'everything and the world' in, but code that wants to 
process events in places that already emit events surely sounds 
related to me :-)
>  - abuse the perf/ftrace interfaces
> 
>    One of the arguments was that perf and ftrace have permission which
>    are not available from the existing security interfaces. That's not
>    at all a good reason to abuse these interfaces. Let the security
>    folks sort out the problem on their end and do not impose any
>    expectations on perf/ftrace which we have to carry around forever.
> 
> Yes, it can be made working with a relatively small patch, but it has
> a very nasty side effect: 
> 
>   You add another user space visible ABI to the existing perf/ftrace
>   mess which needs to be supported forever.
What mess? I'm not aware of a mess - other than the ftrace API which 
is not used by this patch.
> Brilliant, we have already two ABIs (perf/ftrace) to support and at
> the same time we urgently need to solve the problem of better
> integration of those two. So adding a third completely unrelated
> component with a guaranteed ABI is just making this even more complex.
So your solution is to add yet another ABI for seccomp and to keep 
seccomp a limited hack forever, just because you are not interested 
in security?
I think we want fewer ABIs and more flexible/reusable facilities.
> We can factor out the filtering code and let the security dudes 
> reuse it for their own purposes. That makes them to have their own 
> interfaces and does not impose any restrictions upon the 
> tracing/perf ones. And really security stuff wants to be integrated 
> into the existing security frameworks and not duct taped into 
> perf/trace just because it's a conveniant hack around limitiations 
> of the existing security stuff.
You are missing what i tried to point out in earlier discussions: 
from a security design POV this isnt just about the system call 
boundary. If this seccomp variant is based on events then it could 
grow proper security checks in other places as well, in places where 
we have some sort of object observation event anyway.
So this is opens up possibilities to reuse and unify code on a very 
broad basis.
> You really should stop to see everything as a nail just because the 
> only tool you have handy is the perf hammer. perf is about 
> instrumentation and we don't want to violate the oldest principle 
> of unix to have simple tools which do one thing and do it good.
That is one of the most misunderstood principles of Unix.
The original Unix tool landscape was highly *integrated* and 
*unified*, into a very tight codebase that was maintained together. 
Yes, there were different, atomic, simple commands but it was all 
done together and it was a coherent whole and pleasant to use!
People misunderstood this as a license to fragment the heck out 
functionality and build 'simple and stupid' tools that fit nowhere 
really and use different, incompatible principles not synced with 
each other. That is wrong and harmful.
So yes, over-integration is obviously wrong - but so is needless 
fragmentation.
Anyway, i still reserve judgement on this seccomp bit but the patches 
so far looked very promising with a very surprisingly small diffstat. 
If anything then that should tell you something that events and 
seccomp are not just casually related ...
> Even swiss army knifes have the restriction that you can use only 
> one tool at a time unless you want to stick the corkscrew through 
> your palm when you try to cut bread.
I'm not sure what you are arguing here - do you pine for the DOS days 
where you could only use one command at a time?
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-25 15:01                                                         ` Ingo Molnar
@ 2011-05-25 17:43                                                           ` Peter Zijlstra
  2011-05-29 20:17                                                             ` Ingo Molnar
  2011-05-25 17:48                                                           ` Thomas Gleixner
  1 sibling, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2011-05-25 17:43 UTC (permalink / raw)
  To: linux-arm-kernel
On Wed, 2011-05-25 at 17:01 +0200, Ingo Molnar wrote:
> > We do _NOT_ make any decision based on the trace point so what's the
> > "pre-existing" active role in the syscall entry code?
> 
> The seccomp code we are discussing in this thread. 
That isn't pre-existing, that's proposed.
But face it, you can argue until you're blue in the face, but both tglx
and I will NAK any and all patches that extend perf/ftrace beyond the
passive observing role.
Your arguments appear to be as non-persuasive to us as ours are to you,
so please drop this endeavor and let the security folks sort it on their
own and let's get back to doing useful work. 
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-25 15:01                                                         ` Ingo Molnar
  2011-05-25 17:43                                                           ` Peter Zijlstra
@ 2011-05-25 17:48                                                           ` Thomas Gleixner
  2011-05-26  8:43                                                             ` Ingo Molnar
  2011-05-26  9:15                                                             ` Ingo Molnar
  1 sibling, 2 replies; 77+ messages in thread
From: Thomas Gleixner @ 2011-05-25 17:48 UTC (permalink / raw)
  To: linux-arm-kernel
On Wed, 25 May 2011, Ingo Molnar wrote:
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Tue, 24 May 2011, Ingo Molnar wrote:
> > > * Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote:
> > > > >  include/linux/ftrace_event.h  |    4 +-
> > > > >  include/linux/perf_event.h    |   10 +++++---
> > > > >  kernel/perf_event.c           |   49 +++++++++++++++++++++++++++++++++++++---
> > > > >  kernel/seccomp.c              |    8 ++++++
> > > > >  kernel/trace/trace_syscalls.c |   27 +++++++++++++++++-----
> > > > >  5 files changed, 82 insertions(+), 16 deletions(-) 
> > > > 
> > > > I strongly oppose to the perf core being mixed with any sekurity voodoo
> > > > (or any other active role for that matter).
> > > 
> > > I'd object to invisible side-effects as well, and vehemently so. But note how 
> > > intelligently it's used here: it's explicit in the code, it's used explicitly 
> > > in kernel/seccomp.c and the event generation place in 
> > > kernel/trace/trace_syscalls.c.
> > > 
> > > So this is a really flexible solution IMO and does not extend events with some 
> > > invisible 'active' role. It extends the *call site* with an open-coded active 
> > > role - which active role btw. already pre-existed.
> > 
> > We do _NOT_ make any decision based on the trace point so what's the
> > "pre-existing" active role in the syscall entry code?
> 
> The seccomp code we are discussing in this thread.
That's proposed code and has absolutely nothing to do with the
existing trace point semantics.
 
> > I'm all for code reuse and reuse of interfaces, but this is completely
> > wrong. Instrumentation and security decisions are two fundamentally
> > different things and we want them kept separate. Instrumentation is
> > not meant to make decisions. Just because we can does not mean that it
> > is a good idea.
> 
> Instrumentation does not 'make decisions': the calling site, which is 
> already emitting both the event and wants to do decisions based on 
> the data that also generates the event wants to do decisions.
You can repeat that as often as you want, it does not make it more
true. Fact is that the decision is made in the middle of the perf code.
> +     /* Transition the task if required. */
> +     if (ctx->type == task_context && event->attr.require_secure) {
> +#ifdef CONFIG_SECCOMP
> +		/* Don't allow perf events to escape mode = 1. */
> +		   if (!current->seccomp.mode)
> +			current->seccomp.mode = 2;
> +#endif
> +	}
 
and further down
> +   	    if (event->attr.err_on_discard)
> +		ok = -EACCES;
 
> Those decisions *will be made* and you cannot prevent that, the only 
> open question is can it reuse code intelligently, which code it is 
> btw. already calling for observation reasons?
The tracepoint is called for observation reasons and now you make it a
decision function. That's what I call abuse.
> ( Note that pure observers wont be affected and note that pure 
>   observation call sites are not affected either. )
Hahaha, they still have to run through the additional code when
seccomp is enabled and we still have to propagate the return value
down to the point where the tracepoint itself is. You call that not
affected?
 
> > So what the current approach does is:
> > 
> >  - abuse the existing ftrace syscall hook by adding a return value to
> >    the tracepoint.
> > 
> >    So we need to propagate that for every tracepoint just because we
> >    have a single user.
> 
> This is a technical criticism i share with you and i think it can be 
> fixed - i outlined it to Will yesterday.
> 
> And no, if done cleanly it's not 'abuse' to reuse code. Could we wait 
> for the first clean iteration of this patch instead of rushing 
> judgement prematurely?
There is no way to do it cleanly. It always comes for the price that
you add additional code into the tracing code path. And there are
other people who try hard to remove stuff to recude the overhead which
is caused by instrumentation.
> >  - abuse the perf per task mechanism
> > 
> >    Just because we have per task context in perf does not mean that we
> >    pull everything and the world which requires per task context into
> >    perf. The security folks have per task context already so security
> >    related stuff wants to go there.
> 
> We do not pull 'everything and the world' in, but code that wants to 
> process events in places that already emit events surely sounds 
> related to me :-)
We have enough places where different independent parts of the kernel
want to hook into for obvious reasons.
We have notifiers for those where performance does not matter much and
we have separate calls into the independent functions where it matters
or where we need to evaluate the results in specific ways.
So now you turn instrumentation into a security mechanism, which
"works" nicely for a particular purpose, i.e. decision on a particular
syscall number. Now, how do you make that work when a decision has to
be made on more than a simple match, e.g. syscall number + arguments ?
Not at all, unless you add more complexity and arbitrary callbacks
into the instrumentation code.
> > Brilliant, we have already two ABIs (perf/ftrace) to support and at
> > the same time we urgently need to solve the problem of better
> > integration of those two. So adding a third completely unrelated
> > component with a guaranteed ABI is just making this even more complex.
> 
> So your solution is to add yet another ABI for seccomp and to keep 
> seccomp a limited hack forever, just because you are not interested 
> in security?
Well, I'm interested in security, but I'm neither interested in
security decisions inside instrumentation code nor in security models
which are limited hacks by definition unless you want to add callback
complexities to the instrumentation code.
It all looks nice and charming with this minimalistic use case, but
add real features to it and it gets messy in a split second and you
can't hold that off once you started to allow A. And you clearly
stated that you want to have more trace point based security features
than the simple syscall number filtering.
> I think we want fewer ABIs and more flexible/reusable facilities.
I'm all for that, but security and instrumentation are different
beasts.
 
> > We can factor out the filtering code and let the security dudes 
> > reuse it for their own purposes. That makes them to have their own 
> > interfaces and does not impose any restrictions upon the 
> > tracing/perf ones. And really security stuff wants to be integrated 
> > into the existing security frameworks and not duct taped into 
> > perf/trace just because it's a conveniant hack around limitiations 
> > of the existing security stuff.
> 
> You are missing what i tried to point out in earlier discussions: 
> from a security design POV this isnt just about the system call 
> boundary. If this seccomp variant is based on events then it could 
> grow proper security checks in other places as well, in places where 
> we have some sort of object observation event anyway.
Right, that requires callbacks and more stuff to do object based
observation and ties a trace point into a place where it might not
make sense after a while, but the security decision wants to stay at
that place. The syscall example is so tempting because it's simplistic
and easy to solve, but every extension to that model is going to
create a nightmare.
You are duct taping stuff together which has totally different
semantics and requirements. And your only argument is reuse of
existing code. That's a good argument in principle, but there is a
fundamental difference between intelligent reuse and enforcing it just
for reuse sake.
> So this is opens up possibilities to reuse and unify code on a very 
> broad basis.
By making a total mess out of it? 
> So yes, over-integration is obviously wrong - but so is needless 
> fragmentation.
Right. And this falls into the category of over-integration. We have
people working on security and they are working on stacked security
models, so where is the justification to start another security
framework inside the instrumentation code which is completely non
interoperable with the existing ones?
> If anything then that should tell you something that events and 
> seccomp are not just casually related ...
They happen to have the hook at the same point in the source and for
pure coincidence it works because the problem to solve is extremly
simplistic. And that's why the diffstat is minimalistic, but that does
not prove anything.
Thanks,
	tglx
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-16  0:36             ` James Morris
  2011-05-16 15:08               ` Ingo Molnar
@ 2011-05-26  6:27               ` Pavel Machek
  2011-05-26  8:35                 ` Ingo Molnar
  1 sibling, 1 reply; 77+ messages in thread
From: Pavel Machek @ 2011-05-26  6:27 UTC (permalink / raw)
  To: linux-arm-kernel
  On Mon 2011-05-16 10:36:05, James Morris wrote:
> On Fri, 13 May 2011, Ingo Molnar wrote:
> How do you reason about the behavior of the system as a whole?
> 
> 
> > I argue that this is the LSM and audit subsystems designed right: in the long 
> > run it could allow everything that LSM does at the moment - and so much more 
> > ...
> 
> Now you're proposing a redesign of the security subsystem.  That's a 
> significant undertaking.
> 
> In the meantime, we have a simple, well-defined enhancement to seccomp 
> which will be very useful to current users in reducing their kernel attack 
> surface.
Well, you can do the same with subterfugue, even without kernel
changes. But that's ptrace -- slow. (And it already shows that syscall
based filters are extremely tricky to configure).
If yu want speed, seccomp+server for non-permitted operations seems like reasonable way.
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-26  6:27               ` Pavel Machek
@ 2011-05-26  8:35                 ` Ingo Molnar
  0 siblings, 0 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-26  8:35 UTC (permalink / raw)
  To: linux-arm-kernel
* Pavel Machek <pavel@ucw.cz> wrote:
>   On Mon 2011-05-16 10:36:05, James Morris wrote:
> > On Fri, 13 May 2011, Ingo Molnar wrote:
> > How do you reason about the behavior of the system as a whole?
> > 
> > 
> > > I argue that this is the LSM and audit subsystems designed right: in the long 
> > > run it could allow everything that LSM does at the moment - and so much more 
> > > ...
> > 
> > Now you're proposing a redesign of the security subsystem.  That's a 
> > significant undertaking.
> > 
> > In the meantime, we have a simple, well-defined enhancement to seccomp 
> > which will be very useful to current users in reducing their kernel attack 
> > surface.
> 
> Well, you can do the same with subterfugue, even without kernel 
> changes. But that's ptrace -- slow. (And it already shows that 
> syscall based filters are extremely tricky to configure).
Yes, if you use syscall based filters to implement access to 
underlying objects where the access methods do not capture essential 
lifetime events properly (such as files) they you'll quickly run into 
trouble achieving a secure solution.
But you can robustly use syscall filters to control the underlying 
primary *resource*: various pieces of kernel code with *negative* 
utility to the current app - which have no use to the app but pose 
risks in terms of potential exploits in them.
But you can use event filters to implement arbitrary security 
policies robustly.
For example file objects: if you generate the right events for a 
class of objects then you can control access to them very robustly.
It's not a surprise that this is what SELinux does primarily: it has 
lifetime event hooks at the inode object (and socket, packet, etc.) 
level and captures those access attempts and validates them against 
the permissions of that object, in light of the accessing task's 
credentials.
Exactly that can be done with Will's patch as well, if its potential 
scope of event-checking points is not stupidly limited to the syscall 
boundary alone ...
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-25 17:48                                                           ` Thomas Gleixner
@ 2011-05-26  8:43                                                             ` Ingo Molnar
  2011-05-26  9:15                                                             ` Ingo Molnar
  1 sibling, 0 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-26  8:43 UTC (permalink / raw)
  To: linux-arm-kernel
* Thomas Gleixner <tglx@linutronix.de> wrote:
> > > We do _NOT_ make any decision based on the trace point so 
> > > what's the "pre-existing" active role in the syscall entry 
> > > code?
> > 
> > The seccomp code we are discussing in this thread.
> 
> That's proposed code and has absolutely nothing to do with the 
> existing trace point semantics.
So because it's proposed code it does not exist?
If the feature is accepted (and given Linus's opinion it's not clear 
at all it's accepted in any form) then it's obviously a very 
legitimate technical concern whether we do:
	ret = seccomp_check_syscall_event(p1, p2, p3, p4, p5);
	if (ret)
		return -EACCES;
	... random code ...
	trace_syscall_event(p1, p2, p3, p4, p5);
Where seccomp_check_syscall_event() duplicates much of the machinery 
that is behind trace_syscall_event().
Or we do the more intelligent:
	ret = check_syscall_event(p1, p2, p3, p4, p5);
	if (ret)
		return -EACCES;
Where we have the happy side effects of:
  - less code at the call site
  - (a lot of!) shared infrastructure between the proposed seccomp 
    code and event filters.
  - we'd also be able to trace at security check boundaries - which
    has obvious bug analysis advantages.
In fact i do not see *any* advantages in keeping this needlessly 
bloaty and needlessly inconsistently sampled form of instrumentation:
	ret = seccomp_check_syscall_event(p1, p2, p3, p4, p5);
	if (ret)
		return -EACCES;
	... random code ...
	trace_syscall_event(p1, p2, p3, p4, p5);
Do you?
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-25 17:48                                                           ` Thomas Gleixner
  2011-05-26  8:43                                                             ` Ingo Molnar
@ 2011-05-26  9:15                                                             ` Ingo Molnar
  1 sibling, 0 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-26  9:15 UTC (permalink / raw)
  To: linux-arm-kernel
* Thomas Gleixner <tglx@linutronix.de> wrote:
> > If anything then that should tell you something that events and 
> > seccomp are not just casually related ...
> 
> They happen to have the hook at the same point in the source and 
> for pure coincidence it works because the problem to solve is 
> extremly simplistic. And that's why the diffstat is minimalistic, 
> but that does not prove anything.
Here are the diffstats of the various versions of this proposed 
security feature:
       bitmask (2009):  6 files changed,  194 insertions(+), 22 deletions(-)
 filter engine (2010): 18 files changed, 1100 insertions(+), 21 deletions(-)
 event filters (2011):  5 files changed,   82 insertions(+), 16 deletions(-)
The third variant, 'event filters', is actually the most 
sophisticated one of all and it is not simplistic at all.
The main reason why the diffstat is small is because it reuses over 
ten thousand lines of pre-existing kernel code intelligently. Are you 
interpreting that as some sort of failure of the patch? I think it's 
a very good thing.
To demonstrate the non-simplicity of the feature:
 - These security rules/filters can be sophisticated like:
   sys_close() rule protecting against the closing of 
   stdin/stdout/stderr:
                  "fd == 0 || fd == 1 || fd == 2"
   sys_ioperm() rule allowing port 0x80 access but nothing else:
                  "from != 128 || num != 1"
   sys_listen() rule limiting the max accept() backlog to 16 entries:
                  "backlog > 16"
   sys_mprotect(), sys_mmap[2](), sys_unmap() and sys_mremap() rule
   protecting the first 1 MB NULL pointer guard range:
                  "addr < 0x00100000"
   sys_setscheduler() rule protecting against the switch to 
   non-SCHED_OTHER scheduler policies:
                  "policy != 0"
   Most of these examples are finegrained access restrictions that 
   AFAIK are not possible with any of the LSM based security measures 
   that Linux offers today.
 - These security rules/filters can be safely used and installed by 
   unprivileged userspace, allowing arbitrary end user apps to define 
   their own, flexible security policies.
 - These security rules/filters get automatically inherited into child 
   tasks and child tasks cannot mess with them - they cannot even 
   query/observe that these filters *exist*.
 - These security rules/filters nest on each other in basically 
   arbitrary depth, giving us a working, implemented, stackable LSM
   concept.
 - These security rules/filters can be extended to arbitrary more 
   object lifetime events in the future, without changing the ABI.
 - These security rules/filters, unlike most LSM rules, can execute
   not just within hardirqs but also within deeply atomic contexts
   such as NMI contexts, putting far less restrictions on what can
   be security/access checked.
 - Access permission violations can be set up to generate events of
   the violations into a scalable ring-buffer, providing unprivileged
   security-auditing functionality to the managing task(s).
I'd call that anything but 'simplistic'.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
  2011-05-25 17:43                                                           ` Peter Zijlstra
@ 2011-05-29 20:17                                                             ` Ingo Molnar
  0 siblings, 0 replies; 77+ messages in thread
From: Ingo Molnar @ 2011-05-29 20:17 UTC (permalink / raw)
  To: linux-arm-kernel
* Peter Zijlstra <peterz@infradead.org> wrote:
> But face it, you can argue until you're blue in the face,
That is not a technical argument though - and i considered and 
answered every valid technical argument made by you and Thomas.
You were either not able to or not willing to counter them.
> [...] but both tglx and I will NAK any and all patches that extend 
> perf/ftrace beyond the passive observing role.
The thing is, perf is *already* well beyond the 'passive observer' 
role: we already generate lots of 'action' in response to events. We 
generate notification signals, we write events - all of which can 
(and does) modify program behavior.
So what's your point? There's no "passive observer" role really - 
it's apparently just that you dislike this use of instrumentation 
while you approve of other uses.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 77+ messages in thread
end of thread, other threads:[~2011-05-29 20:17 UTC | newest]
Thread overview: 77+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1304017638.18763.205.camel@gandalf.stny.rr.com>
2011-05-12  3:02 ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Will Drewry
2011-05-12  7:48   ` Ingo Molnar
2011-05-12  9:24     ` Kees Cook
2011-05-12 10:49       ` Ingo Molnar
2011-05-12 11:44     ` James Morris
2011-05-12 13:01       ` Ingo Molnar
2011-05-12 16:26         ` Will Drewry
2011-05-16 12:55           ` Ingo Molnar
2011-05-16 14:42             ` Will Drewry
2011-05-13  0:18         ` James Morris
2011-05-13 12:10           ` Ingo Molnar
2011-05-13 12:19             ` Peter Zijlstra
2011-05-13 12:26               ` Ingo Molnar
2011-05-13 12:39                 ` Peter Zijlstra
2011-05-13 12:43                   ` Peter Zijlstra
2011-05-13 12:54                     ` Ingo Molnar
2011-05-13 13:08                       ` Peter Zijlstra
2011-05-13 13:18                         ` Ingo Molnar
2011-05-13 13:55                           ` Peter Zijlstra
2011-05-13 14:57                             ` Ingo Molnar
2011-05-13 15:27                               ` Peter Zijlstra
2011-05-14  7:05                                 ` Ingo Molnar
2011-05-16 16:23                               ` Steven Rostedt
2011-05-16 16:52                                 ` Ingo Molnar
2011-05-16 17:03                                   ` Steven Rostedt
2011-05-17 12:42                                     ` Ingo Molnar
2011-05-17 13:05                                       ` Steven Rostedt
2011-05-17 13:19                                         ` Ingo Molnar
2011-05-19  4:07                                           ` Will Drewry
2011-05-19 12:22                                             ` Steven Rostedt
2011-05-19 21:05                                               ` Will Drewry
2011-05-24 15:59                                                 ` Will Drewry
2011-05-24 16:20                                                   ` Peter Zijlstra
2011-05-24 16:25                                                     ` Thomas Gleixner
2011-05-24 19:00                                                       ` Will Drewry
2011-05-24 19:54                                                     ` Ingo Molnar
2011-05-24 20:10                                                       ` Ingo Molnar
2011-05-25 10:35                                                       ` Thomas Gleixner
2011-05-25 15:01                                                         ` Ingo Molnar
2011-05-25 17:43                                                           ` Peter Zijlstra
2011-05-29 20:17                                                             ` Ingo Molnar
2011-05-25 17:48                                                           ` Thomas Gleixner
2011-05-26  8:43                                                             ` Ingo Molnar
2011-05-26  9:15                                                             ` Ingo Molnar
2011-05-24 20:08                                                   ` Ingo Molnar
2011-05-24 20:14                                                     ` Steven Rostedt
2011-05-13 15:17                           ` Eric Paris
2011-05-13 15:29                             ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system callfiltering David Laight
2011-05-16 12:03                               ` Ingo Molnar
2011-05-13 12:49                   ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Ingo Molnar
2011-05-13 13:55                     ` Peter Zijlstra
2011-05-13 15:02                       ` Ingo Molnar
2011-05-13 15:10             ` Eric Paris
2011-05-13 15:23               ` Peter Zijlstra
2011-05-13 15:55                 ` Eric Paris
2011-05-13 16:29                   ` Will Drewry
2011-05-14  7:30               ` Ingo Molnar
2011-05-14 20:57                 ` Will Drewry
2011-05-16 12:43                   ` Ingo Molnar
2011-05-16 15:29                     ` Will Drewry
2011-05-17 12:57                       ` Ingo Molnar
2011-05-16  0:36             ` James Morris
2011-05-16 15:08               ` Ingo Molnar
2011-05-17  2:24                 ` James Morris
2011-05-17 13:10                   ` Ingo Molnar
2011-05-17 13:29                     ` James Morris
2011-05-17 18:34                       ` Ingo Molnar
2011-05-26  6:27               ` Pavel Machek
2011-05-26  8:35                 ` Ingo Molnar
2011-05-12 12:15     ` Frederic Weisbecker
2011-05-12 11:33   ` James Morris
2011-05-13 19:35   ` Arnd Bergmann
2011-05-14 20:58     ` Will Drewry
2011-05-15  6:42       ` Arnd Bergmann
2011-05-16 12:00         ` Ingo Molnar
2011-05-16 15:26   ` Steven Rostedt
2011-05-16 15:28     ` Will Drewry
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).