[PATCH v2 0/3] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
@ 2014-04-17 18:05 Kees Cook
  2014-04-17 18:05 ` [PATCH v2 1/3] seccomp: introduce writer locking Kees Cook
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Kees Cook @ 2014-04-17 18:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: John Johansen, Kees Cook, Oleg Nesterov, Andy Lutomirski,
	Will Drewry, Julien Tinnes, linux-doc, linux-security-module

This adds the ability for threads to request seccomp filter 
synchronization across their thread group. To support this, 
seccomp locking on writes is introduced, along with refactoring
of no_new_privs. Races with thread creation are handled via the
tasklist_list. 

I think all the concerns raised during the discussion[1] of the first
version of this patch have been addressed. However, the races involved
have tricked me before. :)

Thanks!

-Kees

[1] https://lkml.org/lkml/2014/1/13/795

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/3] seccomp: introduce writer locking
  2014-04-17 18:05 [PATCH v2 0/3] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
@ 2014-04-17 18:05 ` Kees Cook
  2014-04-17 18:05 ` [PATCH v2 2/3] seccomp: move no_new_privs into seccomp Kees Cook
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Kees Cook @ 2014-04-17 18:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: John Johansen, Kees Cook, Oleg Nesterov, Andy Lutomirski,
	Will Drewry, Julien Tinnes, linux-doc, linux-security-module

Normally, task_struck.seccomp.filter is only ever read or modified by
the task that owns it (current). This property aids in fast access
during system call filtering as read access is lockless.

Updating the pointer from another task, however, opens up race
conditions. To allow cross-task filter pointer updates, writes to the
seccomp fields are now protected via a spinlock.  Read access remains
lockless because pointer updates themselves are atomic.  However, writes
(or cloning) often entail additional checking (like maximum instruction
counts) which require locking to perform safely.

In the case of cloning threads, the child is invisible to the system
until it enters the task list. To make sure a child can't be cloned
from a thread and left in a prior state, seccomp duplication is moved
under the tasklist_lock. Then parent and child are certain have the same
seccomp state when they exit the lock.

Based on patches by Will Drewry.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 include/linux/init_task.h |    8 ++++++++
 include/linux/sched.h     |   18 ++++++++++++++++++
 include/linux/seccomp.h   |    6 ++++--
 kernel/fork.c             |   33 ++++++++++++++++++++++++++++++++-
 kernel/seccomp.c          |   13 ++++++++-----
 5 files changed, 70 insertions(+), 8 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6df7f9fe0d01..e2370ec3102b 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -154,6 +154,13 @@ extern struct task_group root_task_group;
 # define INIT_VTIME(tsk)
 #endif
 
+#ifdef CONFIG_SECCOMP
+# define INIT_SECCOMP(tsk)						\
+	.seccomp.lock	= __SPIN_LOCK_UNLOCKED(tsk.seccomp.lock),
+#else
+# define INIT_SECCOMP(tsk)
+#endif
+
 #define INIT_TASK_COMM "swapper"
 
 #ifdef CONFIG_RT_MUTEXES
@@ -234,6 +241,7 @@ extern struct task_group root_task_group;
 	INIT_CPUSET_SEQ(tsk)						\
 	INIT_RT_MUTEXES(tsk)						\
 	INIT_VTIME(tsk)							\
+	INIT_SECCOMP(tsk)						\
 }
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 25f54c79f757..50a21e527eb2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2480,6 +2480,24 @@ static inline void task_unlock(struct task_struct *p)
 	spin_unlock(&p->alloc_lock);
 }
 
+#ifdef CONFIG_SECCOMP
+/*
+ * Protects changes to ->seccomp
+ */
+static inline void seccomp_lock(struct task_struct *p)
+{
+	spin_lock(&p->seccomp.lock);
+}
+
+static inline void seccomp_unlock(struct task_struct *p)
+{
+	spin_unlock(&p->seccomp.lock);
+}
+#else
+static inline void seccomp_lock(struct task_struct *p) { }
+static inline void seccomp_unlock(struct task_struct *p) { }
+#endif
+
 extern struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
 							unsigned long *flags);
 
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 4054b0994071..c47be00e8ffb 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -14,14 +14,16 @@ struct seccomp_filter;
  *
  * @mode:  indicates one of the valid values above for controlled
  *         system calls available to a process.
- * @filter: The metadata and ruleset for determining what system calls
- *          are allowed for a task.
+ * @lock:  held when making changes to avoid thread races.
+ * @filter: must always point to a valid seccomp-filter or NULL as it is
+ *          accessed without locking during system call entry.
  *
  *          @filter must only be accessed from the context of current as there
  *          is no locking.
  */
 struct seccomp {
 	int mode;
+	spinlock_t lock;
 	struct seccomp_filter *filter;
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 54a8d26f612f..b33f886fe3a1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -315,6 +315,15 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 		goto free_ti;
 
 	tsk->stack = ti;
+#ifdef CONFIG_SECCOMP
+	/*
+	 * We must handle setting up seccomp filters once we're under
+	 * the tasklist_lock in case orig has changed between now and
+	 * then. Until then, filter must be NULL to avoid messing up
+	 * the usage counts on the error path calling free_task.
+	 */
+	tsk->seccomp.filter = NULL;
+#endif
 
 	setup_thread_stack(tsk, orig);
 	clear_user_return_notifier(tsk);
@@ -1081,6 +1090,23 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	return 0;
 }
 
+static void copy_seccomp(struct task_struct *p)
+{
+#ifdef CONFIG_SECCOMP
+	/* Child lock not needed since it is not yet in tasklist. */
+	seccomp_lock(current);
+
+	get_seccomp_filter(current);
+	p->seccomp = current->seccomp;
+	spin_lock_init(&p->seccomp.lock);
+
+	if (p->seccomp.mode != SECCOMP_MODE_DISABLED)
+		set_tsk_thread_flag(p, TIF_SECCOMP);
+
+	seccomp_unlock(current);
+#endif
+}
+
 SYSCALL_DEFINE1(set_tid_address, int __user *, tidptr)
 {
 	current->clear_child_tid = tidptr;
@@ -1196,7 +1222,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto fork_out;
 
 	ftrace_graph_init_task(p);
-	get_seccomp_filter(p);
 
 	rt_mutex_init_task(p);
 
@@ -1425,6 +1450,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	 */
 	write_lock_irq(&tasklist_lock);
 
+	/*
+	 * Copy seccomp details explicitly here, in case they were changed
+	 * before holding tasklist_lock.
+	 */
+	copy_seccomp(p);
+
 	/* CLONE_PARENT re-uses the old parent */
 	if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
 		p->real_parent = current->real_parent;
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 590c37925084..6d61a0b5080c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -174,12 +174,12 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
  */
 static u32 seccomp_run_filters(int syscall)
 {
-	struct seccomp_filter *f;
+	struct seccomp_filter *f = ACCESS_ONCE(current->seccomp.filter);
 	struct seccomp_data sd;
 	u32 ret = SECCOMP_RET_ALLOW;
 
 	/* Ensure unexpected behavior doesn't result in failing open. */
-	if (WARN_ON(current->seccomp.filter == NULL))
+	if (WARN_ON(f == NULL))
 		return SECCOMP_RET_KILL;
 
 	populate_seccomp_data(&sd);
@@ -188,7 +188,7 @@ static u32 seccomp_run_filters(int syscall)
 	 * All filters in the list are evaluated and the lowest BPF return
 	 * value always takes priority (ignoring the DATA).
 	 */
-	for (f = current->seccomp.filter; f; f = f->prev) {
+	for (; f; f = ACCESS_ONCE(f->prev)) {
 		u32 cur_ret = sk_run_filter_int_seccomp(&sd, f->insnsi);
 		if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
 			ret = cur_ret;
@@ -313,7 +313,7 @@ out:
 /* get_seccomp_filter - increments the reference count of the filter on @tsk */
 void get_seccomp_filter(struct task_struct *tsk)
 {
-	struct seccomp_filter *orig = tsk->seccomp.filter;
+	struct seccomp_filter *orig = ACCESS_ONCE(tsk->seccomp.filter);
 	if (!orig)
 		return;
 	/* Reference count is bounded by the number of total processes. */
@@ -327,7 +327,7 @@ void put_seccomp_filter(struct task_struct *tsk)
 	/* Clean up single-reference branches iteratively. */
 	while (orig && atomic_dec_and_test(&orig->usage)) {
 		struct seccomp_filter *freeme = orig;
-		orig = orig->prev;
+		orig = ACCESS_ONCE(orig->prev);
 		kfree(freeme);
 	}
 }
@@ -480,6 +480,8 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
 {
 	long ret = -EINVAL;
 
+	seccomp_lock(current);
+
 	if (current->seccomp.mode &&
 	    current->seccomp.mode != seccomp_mode)
 		goto out;
@@ -505,5 +507,6 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
 	current->seccomp.mode = seccomp_mode;
 	set_thread_flag(TIF_SECCOMP);
 out:
+	seccomp_unlock(current);
 	return ret;
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/3] seccomp: move no_new_privs into seccomp
  2014-04-17 18:05 [PATCH v2 0/3] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
  2014-04-17 18:05 ` [PATCH v2 1/3] seccomp: introduce writer locking Kees Cook
@ 2014-04-17 18:05 ` Kees Cook
  2014-04-17 18:05 ` [PATCH v2 3/3] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
  2014-04-17 18:13 ` [PATCH v2 0/3] " Andy Lutomirski
  3 siblings, 0 replies; 8+ messages in thread
From: Kees Cook @ 2014-04-17 18:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: John Johansen, Kees Cook, Oleg Nesterov, Andy Lutomirski,
	Will Drewry, Julien Tinnes, linux-doc, linux-security-module

Since seccomp transitions between threads requires updates to the
no_new_privs flag to be atomic, this creates accessors for it. In the
case of seccomp being built into the kernel, the flag is moved it into
seccomp struct where it can be updated safely.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 fs/exec.c                  |    4 ++--
 include/linux/sched.h      |   22 ++++++++++++++++++++++
 include/linux/seccomp.h    |    4 ++++
 kernel/seccomp.c           |    2 +-
 kernel/sys.c               |    4 ++--
 security/apparmor/domain.c |    4 ++--
 6 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 476f3ebf437e..c72f9f6f66f3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1233,7 +1233,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
 	 * This isn't strictly necessary, but it makes it harder for LSMs to
 	 * mess up.
 	 */
-	if (current->no_new_privs)
+	if (task_no_new_privs(current))
 		bprm->unsafe |= LSM_UNSAFE_NO_NEW_PRIVS;
 
 	t = p;
@@ -1271,7 +1271,7 @@ int prepare_binprm(struct linux_binprm *bprm)
 	bprm->cred->egid = current_egid();
 
 	if (!(bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) &&
-	    !current->no_new_privs &&
+	    !task_no_new_privs(current) &&
 	    kuid_has_mapping(bprm->cred->user_ns, inode->i_uid) &&
 	    kgid_has_mapping(bprm->cred->user_ns, inode->i_gid)) {
 		/* Set-uid? */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 50a21e527eb2..cd8e59bb62a5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1259,8 +1259,10 @@ struct task_struct {
 				 * execve */
 	unsigned in_iowait:1;
 
+#ifndef CONFIG_SECCOMP
 	/* task may not gain privileges */
 	unsigned no_new_privs:1;
+#endif
 
 	/* Revert to default priority/policy when forking */
 	unsigned sched_reset_on_fork:1;
@@ -2493,9 +2495,29 @@ static inline void seccomp_unlock(struct task_struct *p)
 {
 	spin_unlock(&p->seccomp.lock);
 }
+
+static inline bool task_no_new_privs(struct task_struct *p)
+{
+	return test_bit(SECCOMP_FLAG_NO_NEW_PRIVS, &p->seccomp.flags);
+}
+
+static inline int task_set_no_new_privs(struct task_struct *p)
+{
+	set_bit(SECCOMP_FLAG_NO_NEW_PRIVS, &p->seccomp.flags);
+	return 0;
+}
 #else
 static inline void seccomp_lock(struct task_struct *p) { }
 static inline void seccomp_unlock(struct task_struct *p) { }
+static inline bool task_no_new_privs(struct task_struct *p)
+{
+	return p->no_new_privs;
+}
+static inline int task_set_no_new_privs(struct task_struct *p)
+{
+	p->no_new_privs = 1;
+	return 0;
+}
 #endif
 
 extern struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index c47be00e8ffb..d05f1f1b8b10 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -17,6 +17,7 @@ struct seccomp_filter;
  * @lock:  held when making changes to avoid thread races.
  * @filter: must always point to a valid seccomp-filter or NULL as it is
  *          accessed without locking during system call entry.
+ * @flags: flags under write lock
  *
  *          @filter must only be accessed from the context of current as there
  *          is no locking.
@@ -25,8 +26,11 @@ struct seccomp {
 	int mode;
 	spinlock_t lock;
 	struct seccomp_filter *filter;
+	unsigned long flags;
 };
 
+#define SECCOMP_FLAG_NO_NEW_PRIVS	0
+
 extern int __secure_computing(int);
 static inline int secure_computing(int this_syscall)
 {
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 6d61a0b5080c..8761ce47a8bd 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -225,7 +225,7 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
 	 * This avoids scenarios where unprivileged tasks can affect the
 	 * behavior of privileged children.
 	 */
-	if (!current->no_new_privs &&
+	if (!task_no_new_privs(current) &&
 	    security_capable_noaudit(current_cred(), current_user_ns(),
 				     CAP_SYS_ADMIN) != 0)
 		return -EACCES;
diff --git a/kernel/sys.c b/kernel/sys.c
index fba0f29401ea..262919a8a7ac 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1990,12 +1990,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		if (arg2 != 1 || arg3 || arg4 || arg5)
 			return -EINVAL;
 
-		current->no_new_privs = 1;
+		error = task_set_no_new_privs(current);
 		break;
 	case PR_GET_NO_NEW_PRIVS:
 		if (arg2 || arg3 || arg4 || arg5)
 			return -EINVAL;
-		return current->no_new_privs ? 1 : 0;
+		return task_no_new_privs(current) ? 1 : 0;
 	case PR_GET_THP_DISABLE:
 		if (arg2 || arg3 || arg4 || arg5)
 			return -EINVAL;
diff --git a/security/apparmor/domain.c b/security/apparmor/domain.c
index 452567d3a08e..d97cba3e3849 100644
--- a/security/apparmor/domain.c
+++ b/security/apparmor/domain.c
@@ -621,7 +621,7 @@ int aa_change_hat(const char *hats[], int count, u64 token, bool permtest)
 	 * There is no exception for unconfined as change_hat is not
 	 * available.
 	 */
-	if (current->no_new_privs)
+	if (task_no_new_privs(current))
 		return -EPERM;
 
 	/* released below */
@@ -776,7 +776,7 @@ int aa_change_profile(const char *ns_name, const char *hname, bool onexec,
 	 * no_new_privs is set because this aways results in a reduction
 	 * of permissions.
 	 */
-	if (current->no_new_privs && !unconfined(profile)) {
+	if (task_no_new_privs(current) && !unconfined(profile)) {
 		put_cred(cred);
 		return -EPERM;
 	}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 3/3] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
  2014-04-17 18:05 [PATCH v2 0/3] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
  2014-04-17 18:05 ` [PATCH v2 1/3] seccomp: introduce writer locking Kees Cook
  2014-04-17 18:05 ` [PATCH v2 2/3] seccomp: move no_new_privs into seccomp Kees Cook
@ 2014-04-17 18:05 ` Kees Cook
  2014-04-17 18:13 ` [PATCH v2 0/3] " Andy Lutomirski
  3 siblings, 0 replies; 8+ messages in thread
From: Kees Cook @ 2014-04-17 18:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: John Johansen, Kees Cook, Oleg Nesterov, Andy Lutomirski,
	Will Drewry, Julien Tinnes, linux-doc, linux-security-module

Applying restrictive seccomp filter programs to large or diverse
codebases often requires handling threads which may be started early in
the process lifetime (e.g., by code that is linked in). While it is
possible to apply permissive programs prior to process start up, it is
difficult to further restrict the kernel ABI to those threads after that
point.

This change adds a new seccomp "extension" for synchronizing thread
group seccomp filters and a prctl() for accessing that functionality.
The need for the added prctl() is due to the lack of reserved arguments
in PR_SET_SECCOMP (much existing code already calls prctl without
initializing trailing arguments).

When prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT_TSYNC, 0, 0) is called, it will
attempt to synchronize all threads in current's threadgroup to its seccomp
filter program. This is possible iff all threads are using a filter that
is an ancestor to the filter current is attempting to synchronize to. NULL
filters (where the task is running as SECCOMP_MODE_NONE) are also treated
as ancestors allowing threads to be transitioned into SECCOMP_MODE_FILTER.
On success, 0 is returned. On failure, the pid of one of the failing
threads will be returned.

The possible race conditions are against another thread calling TSYNC,
another thread performing a clone, and another thread changing its
filter. The seccomp write lock is sufficient for these cases, though the
clone case is assisted by the tasklist_lock so that new threads must have
a duplicate of its parent seccomp state when it appears on the tasklist.

Based on patches by Will Drewry.

Suggested-by: Julien Tinnes <jln@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 Documentation/prctl/seccomp_filter.txt |   20 +++++
 include/linux/seccomp.h                |    7 ++
 include/uapi/linux/prctl.h             |    6 ++
 include/uapi/linux/seccomp.h           |    6 ++
 kernel/seccomp.c                       |  128 ++++++++++++++++++++++++++++++++
 kernel/sys.c                           |    3 +
 6 files changed, 170 insertions(+)

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
index 1e469ef75778..632f7d9fcfb2 100644
--- a/Documentation/prctl/seccomp_filter.txt
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -223,3 +223,23 @@ Note that modern systems are unlikely to use vsyscalls at all -- they
 are a legacy feature and they are considerably slower than standard
 syscalls.  New code will use the vDSO, and vDSO-issued system calls
 are indistinguishable from normal system calls.
+
+
+
+Extensions
+----------
+Additional seccomp extensions are available through prctl using
+PR_SECCOMP_EXT, with the extension as the following argument.
+
+prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT_TSYNC, 0, 0):
+	Thread synchronization.
+
+	The current thread requests to synchronize all threads in current's
+	threadgroup to its seccomp filter program. This is possible iff all
+	threads are using a filter that is an ancestor to the filter current
+	is attempting to synchronize to, or the thread has not yet entered
+	seccomp.
+
+	On success, 0 is returned. On failure, all synchronizable threads
+	will have been synchronized, and the pid of any of the failing
+	threads will be returned.
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index d05f1f1b8b10..a34a6bc76d3d 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -82,6 +82,8 @@ static inline int seccomp_mode(struct seccomp *s)
 #ifdef CONFIG_SECCOMP_FILTER
 extern void put_seccomp_filter(struct task_struct *tsk);
 extern void get_seccomp_filter(struct task_struct *tsk);
+extern long prctl_seccomp_ext(unsigned long, unsigned long,
+			      unsigned long, unsigned long);
 #else  /* CONFIG_SECCOMP_FILTER */
 static inline void put_seccomp_filter(struct task_struct *tsk)
 {
@@ -91,5 +93,10 @@ static inline void get_seccomp_filter(struct task_struct *tsk)
 {
 	return;
 }
+static inline long prctl_seccomp_ext(unsigned long arg2, unsigned long arg3,
+				     unsigned long arg4, unsigned long arg5)
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_SECCOMP_FILTER */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 58afc04c107e..ac758ed72495 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -152,4 +152,10 @@
 #define PR_SET_THP_DISABLE	41
 #define PR_GET_THP_DISABLE	42
 
+/*
+ * Access seccomp extensions
+ * See Documentation/prctl/seccomp_filter.txt for more details.
+ */
+#define PR_SECCOMP_EXT		43
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index ac2dc9f72973..49b527935957 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -10,6 +10,12 @@
 #define SECCOMP_MODE_STRICT	1 /* uses hard-coded filter. */
 #define SECCOMP_MODE_FILTER	2 /* uses user-supplied filter. */
 
+/* Valid extension types as arg2 for prctl(PR_SECCOMP_EXT) */
+#define SECCOMP_EXT_ACT		1
+
+/* Valid extension actions as arg3 to prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT) */
+#define SECCOMP_EXT_ACT_TSYNC	1 /* attempt to synchronize thread filters */
+
 /*
  * All BPF programs must return a 32-bit value.
  * The bottom 16-bits are for optional return data.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 8761ce47a8bd..3f32904533fa 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -24,6 +24,7 @@
 #ifdef CONFIG_SECCOMP_FILTER
 #include <asm/syscall.h>
 #include <linux/filter.h>
+#include <linux/pid.h>
 #include <linux/ptrace.h>
 #include <linux/security.h>
 #include <linux/slab.h>
@@ -196,6 +197,108 @@ static u32 seccomp_run_filters(int syscall)
 	return ret;
 }
 
+/* Returns 1 if the candidate is an ancestor. */
+static int is_ancestor(struct seccomp_filter *candidate,
+		       struct seccomp_filter *child)
+{
+	/* NULL is the root ancestor. */
+	if (candidate == NULL)
+		return 1;
+	for (; child; child = child->prev)
+		if (child == candidate)
+			return 1;
+	return 0;
+}
+
+static void seccomp_sync_thread(struct task_struct *caller,
+				struct task_struct *thread)
+{
+	/* Get a task reference for the new leaf node. */
+	get_seccomp_filter(caller);
+	/*
+	 * Drop the task reference to the shared ancestor since
+	 * current's path will hold a reference.  (This also
+	 * allows a put before the assignment.)
+	 */
+	put_seccomp_filter(thread);
+	thread->seccomp.filter = caller->seccomp.filter;
+	/* Opt the other thread into seccomp if needed.
+	 * As threads are considered to be trust-realm
+	 * equivalent (see ptrace_may_access), it is safe to
+	 * allow one thread to transition the other.
+	 */
+	if (thread->seccomp.mode == SECCOMP_MODE_DISABLED) {
+		thread->seccomp.mode = SECCOMP_MODE_FILTER;
+		/*
+		 * Don't let an unprivileged task work around
+		 * the no_new_privs restriction by creating
+		 * a thread that sets it up, enters seccomp,
+		 * then dies.
+		 */
+		if (task_no_new_privs(caller))
+			task_set_no_new_privs(thread);
+		set_tsk_thread_flag(thread, TIF_SECCOMP);
+	}
+}
+
+/**
+ * seccomp_sync_threads: sets all threads to use current's filter
+ *
+ * Returns 0 on success or the pid of a thread which was either not
+ * in the correct seccomp mode or it did not have an ancestral
+ * seccomp filter.
+ */
+static pid_t seccomp_sync_threads(void)
+{
+	struct task_struct *thread, *caller;
+	pid_t failed = 0;
+
+	if (current->seccomp.mode != SECCOMP_MODE_FILTER)
+		return -EACCES;
+
+	write_lock(&tasklist_lock);
+	thread = caller = current;
+	while_each_thread(caller, thread) {
+		seccomp_lock(thread);
+		/*
+		 * Validate thread being eligible for synchronization.
+		 */
+		if (thread->seccomp.mode == SECCOMP_MODE_DISABLED ||
+		    (thread->seccomp.mode == SECCOMP_MODE_FILTER &&
+		     is_ancestor(thread->seccomp.filter,
+				 caller->seccomp.filter))) {
+			seccomp_sync_thread(caller, thread);
+		} else {
+			/* Keep the last sibling that failed to return. */
+			failed = task_pid_vnr(thread);
+			/* If the pid cannot be resolved, then return -ESRCH */
+			if (failed == 0)
+				failed = -ESRCH;
+		}
+		seccomp_unlock(thread);
+	}
+	write_unlock(&tasklist_lock);
+	return failed;
+}
+
+/**
+ * seccomp_extended_action: performs the specific action
+ * @action: the enum of the action to perform.
+ *
+ * Returns 0 on success. On failure, it returns -ve, or EINVAL on an
+ * invalid action.
+ */
+static long seccomp_extended_action(int action)
+{
+	switch (action) {
+	case SECCOMP_EXT_ACT_TSYNC:
+		return seccomp_sync_threads();
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
 /**
  * seccomp_attach_filter: Attaches a seccomp filter to current.
  * @fprog: BPF program to install
@@ -351,6 +454,31 @@ static void seccomp_send_sigsys(int syscall, int reason)
 	info.si_syscall = syscall;
 	force_sig_info(SIGSYS, &info, current);
 }
+
+/**
+ * prctl_seccomp_ext: exposed extension behaviors for seccomp
+ * @cmd: the type of extension being called
+ * @arg[123]: the arguments for the extension
+ *            (at present, arg2 and arg3 must be 0)
+ *
+ * Returns >= 0 on success and < 0 on failure.
+ * Invalid arguments return -EINVAL.
+ * Improper seccomp mode will result in -EACCES.
+ *
+ * SECCOMP_EXT_TYPE_ACT, SECCOMP_EXT_ACT_TSYNC will return 0 on success
+ * or the last thread pid that it cannot synchronize.
+ */
+long prctl_seccomp_ext(unsigned long type, unsigned long arg1,
+		       unsigned long arg2, unsigned long arg3)
+{
+	if (type != SECCOMP_EXT_ACT)
+		return -EINVAL;
+	/* arg2 and arg3 are currently unused. */
+	if (arg2 || arg3)
+		return -EINVAL;
+	/* For action extensions, arg1 is the identifier. */
+	return seccomp_extended_action(arg1);
+}
 #endif	/* CONFIG_SECCOMP_FILTER */
 
 /*
diff --git a/kernel/sys.c b/kernel/sys.c
index 262919a8a7ac..cb73d82e1dd5 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1917,6 +1917,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_SET_SECCOMP:
 		error = prctl_set_seccomp(arg2, (char __user *)arg3);
 		break;
+	case PR_SECCOMP_EXT:
+		error = prctl_seccomp_ext(arg2, arg3, arg4, arg5);
+		break;
 	case PR_GET_TSC:
 		error = GET_TSC_CTL(arg2);
 		break;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/3] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
  2014-04-17 18:05 [PATCH v2 0/3] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
                   ` (2 preceding siblings ...)
  2014-04-17 18:05 ` [PATCH v2 3/3] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
@ 2014-04-17 18:13 ` Andy Lutomirski
  2014-04-17 18:18   ` Kees Cook
  3 siblings, 1 reply; 8+ messages in thread
From: Andy Lutomirski @ 2014-04-17 18:13 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel@vger.kernel.org, John Johansen, Oleg Nesterov,
	Will Drewry, Julien Tinnes, linux-doc, LSM List

On Thu, Apr 17, 2014 at 11:05 AM, Kees Cook <keescook@chromium.org> wrote:
> This adds the ability for threads to request seccomp filter
> synchronization across their thread group. To support this,
> seccomp locking on writes is introduced, along with refactoring
> of no_new_privs. Races with thread creation are handled via the
> tasklist_list.
>
> I think all the concerns raised during the discussion[1] of the first
> version of this patch have been addressed. However, the races involved
> have tricked me before. :)
>

Would this be easier to use if there were a single syscall to set a
seccomp filter and sync threads?  That way you wouldn't have to write
your filter such that it gives permission to sync threads.

--Andy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/3] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
  2014-04-17 18:13 ` [PATCH v2 0/3] " Andy Lutomirski
@ 2014-04-17 18:18   ` Kees Cook
  2014-04-17 18:31     ` Andy Lutomirski
  0 siblings, 1 reply; 8+ messages in thread
From: Kees Cook @ 2014-04-17 18:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel@vger.kernel.org, John Johansen, Oleg Nesterov,
	Will Drewry, Julien Tinnes, linux-doc@vger.kernel.org, LSM List

On Thu, Apr 17, 2014 at 11:13 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Apr 17, 2014 at 11:05 AM, Kees Cook <keescook@chromium.org> wrote:
>> This adds the ability for threads to request seccomp filter
>> synchronization across their thread group. To support this,
>> seccomp locking on writes is introduced, along with refactoring
>> of no_new_privs. Races with thread creation are handled via the
>> tasklist_list.
>>
>> I think all the concerns raised during the discussion[1] of the first
>> version of this patch have been addressed. However, the races involved
>> have tricked me before. :)
>>
>
> Would this be easier to use if there were a single syscall to set a
> seccomp filter and sync threads?  That way you wouldn't have to write
> your filter such that it gives permission to sync threads.

That would be even cleaner, yes. I was hoping to see the new bpf jump
tables before expanding into new filter calls, with the hope of doing
it all at the same time. However, I guess we could just include a
version number in the new call to indicate which filter type it was,
and include flags (like "threadgroup sync") in there? I'm trying to
imagine what would be the least painful for future-proofing.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/3] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
  2014-04-17 18:18   ` Kees Cook
@ 2014-04-17 18:31     ` Andy Lutomirski
  2014-04-17 18:40       ` Kees Cook
  0 siblings, 1 reply; 8+ messages in thread
From: Andy Lutomirski @ 2014-04-17 18:31 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel@vger.kernel.org, John Johansen, Oleg Nesterov,
	Will Drewry, Julien Tinnes, linux-doc@vger.kernel.org, LSM List

On Thu, Apr 17, 2014 at 11:18 AM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Apr 17, 2014 at 11:13 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Thu, Apr 17, 2014 at 11:05 AM, Kees Cook <keescook@chromium.org> wrote:
>>> This adds the ability for threads to request seccomp filter
>>> synchronization across their thread group. To support this,
>>> seccomp locking on writes is introduced, along with refactoring
>>> of no_new_privs. Races with thread creation are handled via the
>>> tasklist_list.
>>>
>>> I think all the concerns raised during the discussion[1] of the first
>>> version of this patch have been addressed. However, the races involved
>>> have tricked me before. :)
>>>
>>
>> Would this be easier to use if there were a single syscall to set a
>> seccomp filter and sync threads?  That way you wouldn't have to write
>> your filter such that it gives permission to sync threads.
>
> That would be even cleaner, yes. I was hoping to see the new bpf jump
> tables before expanding into new filter calls, with the hope of doing
> it all at the same time. However, I guess we could just include a
> version number in the new call to indicate which filter type it was,
> and include flags (like "threadgroup sync") in there? I'm trying to
> imagine what would be the least painful for future-proofing.

What's the time frame on the new bpf stuff?  If it'll be ready for
3.16, it might not matter.

--Andy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/3] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
  2014-04-17 18:31     ` Andy Lutomirski
@ 2014-04-17 18:40       ` Kees Cook
  0 siblings, 0 replies; 8+ messages in thread
From: Kees Cook @ 2014-04-17 18:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel@vger.kernel.org, John Johansen, Oleg Nesterov,
	Will Drewry, Julien Tinnes, linux-doc@vger.kernel.org, LSM List

On Thu, Apr 17, 2014 at 11:31 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Apr 17, 2014 at 11:18 AM, Kees Cook <keescook@chromium.org> wrote:
>> On Thu, Apr 17, 2014 at 11:13 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Thu, Apr 17, 2014 at 11:05 AM, Kees Cook <keescook@chromium.org> wrote:
>>>> This adds the ability for threads to request seccomp filter
>>>> synchronization across their thread group. To support this,
>>>> seccomp locking on writes is introduced, along with refactoring
>>>> of no_new_privs. Races with thread creation are handled via the
>>>> tasklist_list.
>>>>
>>>> I think all the concerns raised during the discussion[1] of the first
>>>> version of this patch have been addressed. However, the races involved
>>>> have tricked me before. :)
>>>>
>>>
>>> Would this be easier to use if there were a single syscall to set a
>>> seccomp filter and sync threads?  That way you wouldn't have to write
>>> your filter such that it gives permission to sync threads.
>>
>> That would be even cleaner, yes. I was hoping to see the new bpf jump
>> tables before expanding into new filter calls, with the hope of doing
>> it all at the same time. However, I guess we could just include a
>> version number in the new call to indicate which filter type it was,
>> and include flags (like "threadgroup sync") in there? I'm trying to
>> imagine what would be the least painful for future-proofing.
>
> What's the time frame on the new bpf stuff?  If it'll be ready for
> 3.16, it might not matter.

I don't think seccomp will be extended to deal with that for a while,
since it'd require a lot of coordination first. So, for now, I think
adding a "flags" field should be sufficient. I'll work that up.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-04-17 18:40 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-17 18:05 [PATCH v2 0/3] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
2014-04-17 18:05 ` [PATCH v2 1/3] seccomp: introduce writer locking Kees Cook
2014-04-17 18:05 ` [PATCH v2 2/3] seccomp: move no_new_privs into seccomp Kees Cook
2014-04-17 18:05 ` [PATCH v2 3/3] seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC Kees Cook
2014-04-17 18:13 ` [PATCH v2 0/3] " Andy Lutomirski
2014-04-17 18:18   ` Kees Cook
2014-04-17 18:31     ` Andy Lutomirski
2014-04-17 18:40       ` Kees Cook

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).