public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] kernel: Implement selective syscall redirection through personality
@ 2020-07-08  5:56 Gabriel Krisman Bertazi
  2020-07-08  8:01 ` Kees Cook
  0 siblings, 1 reply; 2+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-07-08  5:56 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, Gabriel Krisman Bertazi, kernel, Matthew Wilcox,
	Andy Lutomirski, Paul Gofman, Kees Cook

The problem I'm trying to solve is that modern Windows applications are
executing system call instructions directly from the application's code
without going through the WinAPI.  This breaks Wine emulation, because
it doesn't have a chance to intercept and emulate these syscalls before
they are submitted to Linux.  There are additional constraints like
performance and inability to dinamically modify the original source
code, which stop us from using regular seccomp filters or patching
userspace.  Those are discussed in the previous discussion in the
mailing list.

The first approach I proposed was a new Seccomp mode to do the trapping
depending on which part of the user memory map issued the syscall. The
main problem with that approach was the use of seccomp for a non-secure
mechanism.  A few other suggestions were made by the community around at
that time:

    - A personality mechanism

    - A quick redirection to a thunk code in userspace to do the
      syscall filtering in userspace and issue the syscalls from
      a well-known dispatcher

The previous discussion is archived at:
  <https://lwn.net/ml/linux-kernel/20200530055953.817666-1-krisman@collabora.com/>

I experimented with both approaches and the biggest problem with the
thunk is that native Linux syscalls would also be redirected to
userspace, which added a prohibitive cost for us.  In fact, for quick
syscalls like sysinfo, it almost doubled the execution time of our
benchmarks.  My current understanding is that we cannot do without some
kind of kernel filtering, since it avoids the high overhead to native
Linux syscalls.

A version of the personality mechanism is what I implement in the patch
below.  In order to avoid issuing personality syscalls in each boundary
transition, which we observed to be too prohibitive, it uses a
userspace-defined key to select the personality.  I confess this is a
bit of an awkward interface to me, but it yields very good results.
While it adds a user memory access to every syscall when the mechanism
is enabled, the impact was tiny, even for a small syscall.  More details
in the patch below.

>8

Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS.  This is useful
for processes that have different personalities depending on which part
of the application is executing, but who need to perform the personality
switch really fast, without paying the cost of a system call on each
boundary transition.  This is particularly important for Windows games
running over Wine.

The proposed interface looks like this:

  prctl(PR_SET_SYSCALL_DISABLE, <mode>, <dispatcher>, [fast_selector])

Dispatcher is the address of a syscall instruction that is allowed to
by-pass the blockage, such that in fast paths you don't need to disable
the trap nor check the fast selector.  This is essential to return from
SIGSYS to a blocked area without triggering another SIGSYS from the
rt_sigreturn.

fast-selector is an optional pointer to an aligned int-sized userspace
memory region that has a key switch for the mechanism. The only policy
the kernel enforces on that area is that zero means SYSCALL_ALLOW while
!0 means SYSCALL_BLOCK.

The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.

Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support.  I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism.  And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.

For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only dispatcher to issue syscalls is pretty much
irrelevant.  The overhead of fast-selector goes around 40ns for a
native (untrapped) syscall in my system, and it is (as expected)
dominated by the supervisor-mode user-address access.  In fact, with
SMAP off, the overhead is consistently less than 5ns on my test box.

Right now, it is only supported by x86_64 and x86, but it should be
easily enabled for other architectures.

An example code using this interface can be found at:
  https://gitlab.collabora.com/krisman/syscall-disable-personality

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Paul Gofman <gofmanp@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
---
 arch/Kconfig                       | 19 +++++++
 arch/x86/Kconfig                   |  1 +
 arch/x86/entry/common.c            |  7 +++
 arch/x86/include/asm/thread_info.h |  4 +-
 fs/exec.c                          |  4 ++
 include/linux/sched.h              |  5 ++
 include/linux/syscall_disable.h    | 29 ++++++++++
 include/uapi/asm-generic/siginfo.h |  1 +
 include/uapi/linux/prctl.h         |  4 ++
 kernel/Makefile                    |  1 +
 kernel/fork.c                      |  4 ++
 kernel/sys.c                       |  7 +++
 kernel/syscall_disable.c           | 85 ++++++++++++++++++++++++++++++
 13 files changed, 170 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/syscall_disable.h
 create mode 100644 kernel/syscall_disable.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 8cc35dc556c7..bbbfbd4b9f86 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -465,6 +465,25 @@ config SECCOMP_FILTER
 
 	  See Documentation/userspace-api/seccomp_filter.rst for details.
 
+config HAVE_ARCH_FAST_SYSCALL_DISABLE
+	bool
+	help
+	  An arch should select this symbol if it provides all of these things:
+	  - TIF_FAST_SYSCALL_DISABLE
+	  - syscall_get_arch
+	  - syscall_rollback
+	  - SIGSYS siginfo_t support
+
+config FAST_SYSCALL_DISABLE
+	bool "Support rejecting syscalls not coming from a dispatcher"
+	depends on HAVE_ARCH_FAST_SYSCALL_DISABLE
+	help
+	  Enable tasks to ask the kernel to reject syscalls not
+	  issued from a predefined dispatcher depending on a userspace
+	  key.
+
+	  This option is useful to optimize games running over Wine.
+
 config HAVE_ARCH_STACKLEAK
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 883da0abf779..36f711570d19 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -149,6 +149,7 @@ config X86
 	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_FAST_SYSCALL_DISABLE
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_STACKLEAK
 	select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index bd3f14175193..26b3afbeea37 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -138,6 +138,13 @@ static long syscall_trace_enter(struct pt_regs *regs)
 			return -1L;
 	}
 
+#ifdef CONFIG_FAST_SYSCALL_DISABLE
+	if (work & _TIF_FAST_SYSCALL_DISABLE) {
+		if (do_fast_syscall_disable(regs->orig_ax, regs))
+			return -1L;
+	}
+#endif
+
 #ifdef CONFIG_SECCOMP
 	/*
 	 * Do seccomp after ptrace, to catch any tracer changes.
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 8de8ceccb8bc..be72d14fc570 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -93,6 +93,7 @@ struct thread_info {
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
 #define TIF_SLD			18	/* Restore split lock detection on context switch */
+#define TIF_FAST_SYSCALL_DISABLE 19	/* fast syscall disable active */
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
@@ -123,6 +124,7 @@ struct thread_info {
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_SLD		(1 << TIF_SLD)
+#define _TIF_FAST_SYSCALL_DISABLE (1 << TIF_FAST_SYSCALL_DISABLE)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
@@ -136,7 +138,7 @@ struct thread_info {
 /* Work to do before invoking the actual syscall. */
 #define _TIF_WORK_SYSCALL_ENTRY	\
 	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |	\
-	 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT)
+	 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | _TIF_FAST_SYSCALL_DISABLE)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE					\
diff --git a/fs/exec.c b/fs/exec.c
index e6e8a9a70327..3c40b00aa19f 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1502,6 +1502,10 @@ void setup_new_exec(struct linux_binprm * bprm)
 
 	arch_setup_new_exec();
 
+#ifdef CONFIG_FAST_SYSCALL_DISABLE
+	clear_tsk_thread_flag(me, TIF_FAST_SYSCALL_DISABLE);
+#endif
+
 	/* Set the new mm task size. We have to do that late because it may
 	 * depend on TIF_32BIT which is only updated in flush_thread() on
 	 * some architectures like powerpc
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 692e327d7455..f502b7fc2bfd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -32,6 +32,7 @@
 #include <linux/posix-timers.h>
 #include <linux/rseq.h>
 #include <linux/kcsan.h>
+#include <linux/syscall_disable.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -953,6 +954,10 @@ struct task_struct {
 #endif
 	struct seccomp			seccomp;
 
+#ifdef CONFIG_FAST_SYSCALL_DISABLE
+	struct fast_syscall_disable	syscall_disable;
+#endif
+
 	/* Thread group tracking: */
 	u64				parent_exec_id;
 	u64				self_exec_id;
diff --git a/include/linux/syscall_disable.h b/include/linux/syscall_disable.h
new file mode 100644
index 000000000000..96e52f50eee5
--- /dev/null
+++ b/include/linux/syscall_disable.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SYSCALL_DISABLE_H
+#define _SYSCALL_DISABLE_H
+
+#ifdef CONFIG_FAST_SYSCALL_DISABLE
+struct fast_syscall_disable {
+	int __user *fastlock;
+	unsigned long __user dispatcher;
+};
+
+int do_fast_syscall_disable(int nr_syscall, struct pt_regs *regs);
+int set_syscall_disable(int mode, unsigned long __user dispatcher,
+			int __user *fastlock);
+
+#else
+static inline int set_syscall_disable(int mode, unsigned long __user dispatcher,
+				      int __user *fastlock)
+{
+	return -EINVAL;
+}
+
+static inline int do_fast_syscall_disable(int nr_syscall, struct pt_regs *regs)
+{
+	WARN_ON(1);
+	return 0;
+}
+#endif /* CONFIG_FAST_SYSCALL_DISABLE */
+
+#endif
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index cb3d6c267181..09d4ef45a341 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -285,6 +285,7 @@ typedef struct siginfo {
  */
 #define SYS_SECCOMP	1	/* seccomp triggered */
 #define NSIGSYS		1
+#define SYS_DISABLED	2
 
 /*
  * SIGEMT si_codes
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f8131e36..383773269ade 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -238,4 +238,8 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+#define PR_SET_SYSCALL_DISABLE		59
+# define PR_SYSCALL_ALLOW		0
+# define PR_SYSCALL_BLOCK		1
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index f3218bc5ec69..e0449ed68fd5 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -92,6 +92,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
 obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o
 obj-$(CONFIG_SECCOMP) += seccomp.o
+obj-$(CONFIG_FAST_SYSCALL_DISABLE) += syscall_disable.o
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
diff --git a/kernel/fork.c b/kernel/fork.c
index 142b23645d82..529f0b5e02d3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -922,6 +922,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	clear_tsk_need_resched(tsk);
 	set_task_stack_end_magic(tsk);
 
+#ifdef CONFIG_FAST_SYSCALL_DISABLE
+	clear_tsk_thread_flag(tsk, TIF_FAST_SYSCALL_DISABLE);
+#endif
+
 #ifdef CONFIG_STACKPROTECTOR
 	tsk->stack_canary = get_random_canary();
 #endif
diff --git a/kernel/sys.c b/kernel/sys.c
index 00a96746e28a..2205df99ba67 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,6 +42,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/syscall_disable.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2527,6 +2528,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+	case PR_SET_SYSCALL_DISABLE:
+		if (arg5)
+			return -EINVAL;
+		error = set_syscall_disable((int) arg2, (unsigned long __user) arg3,
+					    (int __user *) arg4);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/syscall_disable.c b/kernel/syscall_disable.c
new file mode 100644
index 000000000000..4b0a1f916df3
--- /dev/null
+++ b/kernel/syscall_disable.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/sched.h>
+#include <linux/prctl.h>
+#include <linux/syscall_disable.h>
+#include <linux/uaccess.h>
+#include <linux/signal.h>
+#include <linux/processor.h>
+
+#include <asm/syscall.h>
+
+#include <linux/sched/signal.h>
+#include <linux/sched/task_stack.h>
+
+static void trigger_sigsys(int nr_syscall)
+{
+	struct kernel_siginfo info;
+
+	clear_siginfo(&info);
+	info.si_signo = SIGSYS;
+	info.si_code = SYS_DISABLED;
+	info.si_call_addr = (void __user *)KSTK_EIP(current);
+	info.si_errno = 0;
+	info.si_arch = syscall_get_arch(current);
+	info.si_syscall = nr_syscall;
+
+	force_sig_info(&info);
+}
+
+int do_fast_syscall_disable(int nr_syscall, struct pt_regs *regs)
+{
+	int state;
+
+	if (current->syscall_disable.dispatcher == instruction_pointer(regs))
+		return 0;
+
+	if (current->syscall_disable.fastlock) {
+		if (__get_user(state, current->syscall_disable.fastlock))
+			do_exit(SIGSEGV);
+
+		if (!state)
+			return 0;
+	}
+
+	syscall_rollback(current, regs);
+	trigger_sigsys(nr_syscall);
+
+	return 1;
+}
+
+int set_syscall_disable(int mode, unsigned long __user dispatcher,
+			int __user *fastlock)
+{
+	switch (mode) {
+	case PR_SYSCALL_ALLOW:
+		if (dispatcher || fastlock)
+			return -EINVAL;
+		break;
+	case PR_SYSCALL_BLOCK:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (fastlock) {
+		if (!access_ok(fastlock, sizeof(int)))
+			return -EFAULT;
+	}
+
+	spin_lock_irq(&current->sighand->siglock);
+
+	current->syscall_disable.fastlock = fastlock;
+	current->syscall_disable.dispatcher = dispatcher;
+
+	/* make sure fsb_address is committed before setting the flag. */
+	smp_mb__before_atomic();
+
+	if (mode == PR_SYSCALL_BLOCK)
+		set_tsk_thread_flag(current, TIF_FAST_SYSCALL_DISABLE);
+	else
+		clear_tsk_thread_flag(current, TIF_FAST_SYSCALL_DISABLE);
+
+	spin_unlock_irq(&current->sighand->siglock);
+
+	return 0;
+}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH RFC] kernel: Implement selective syscall redirection through personality
  2020-07-08  5:56 [PATCH RFC] kernel: Implement selective syscall redirection through personality Gabriel Krisman Bertazi
@ 2020-07-08  8:01 ` Kees Cook
  0 siblings, 0 replies; 2+ messages in thread
From: Kees Cook @ 2020-07-08  8:01 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: tglx, linux-kernel, kernel, Matthew Wilcox, Andy Lutomirski,
	Paul Gofman

On Wed, Jul 08, 2020 at 01:56:00AM -0400, Gabriel Krisman Bertazi wrote:
> Introduce a mechanism to quickly disable/enable syscall handling for a
> specific process and redirect to userspace via SIGSYS.  This is useful
> for processes that have different personalities depending on which part
> of the application is executing, but who need to perform the personality
> switch really fast, without paying the cost of a system call on each
> boundary transition.  This is particularly important for Windows games
> running over Wine.

nit: probably best to avoid saying "personality" here, since this isn't
using the personality bits.

> The proposed interface looks like this:
> 
>   prctl(PR_SET_SYSCALL_DISABLE, <mode>, <dispatcher>, [fast_selector])
> 
> Dispatcher is the address of a syscall instruction that is allowed to
> by-pass the blockage, such that in fast paths you don't need to disable
> the trap nor check the fast selector.  This is essential to return from
> SIGSYS to a blocked area without triggering another SIGSYS from the
> rt_sigreturn.
> 
> fast-selector is an optional pointer to an aligned int-sized userspace
> memory region that has a key switch for the mechanism. The only policy
> the kernel enforces on that area is that zero means SYSCALL_ALLOW while
> !0 means SYSCALL_BLOCK.

I like this; the boundary control is nice and clean! (The only potentially
weird bit is threads being able to accidentally muck with each other,
but that'd just lead to debugging bizarre behaviors if one thread pokes
another thread's fast_selector.)

> Internally, this doesn't add overhead to the syscall hot path, and it
> requires very little per-architecture support.  I avoided using seccomp,

Yeah -- it's basically a TIF flag and the plumbing to get the syscall
number (actually.. can you use syscall_get_nr()... notes below).

> even though it duplicates some functionality, due to previous feedback
> that maybe it shouldn't mix with seccomp since it is not a security
> mechanism.  And obviously, this should never be considered a security
> mechanism, since any part of the program can by-pass it by using the
> syscall dispatcher.

Yeah, I'd really *like* this to be part of seccomp because it feels so
very close in several aspects, but ultimately I think it's right to be
separate.

> +config HAVE_ARCH_FAST_SYSCALL_DISABLE
> +	bool
> +	help
> +	  An arch should select this symbol if it provides all of these things:
> +	  - TIF_FAST_SYSCALL_DISABLE
> +	  - syscall_get_arch
> +	  - syscall_rollback

I'm going to propose including syscall_get_nr() in this list for
a note below...

> +	  - SIGSYS siginfo_t support
> +
> +config FAST_SYSCALL_DISABLE
> +	bool "Support rejecting syscalls not coming from a dispatcher"
> +	depends on HAVE_ARCH_FAST_SYSCALL_DISABLE
> +	help
> +	  Enable tasks to ask the kernel to reject syscalls not
> +	  issued from a predefined dispatcher depending on a userspace
> +	  key.
> +
> +	  This option is useful to optimize games running over Wine.
> +
>  config HAVE_ARCH_STACKLEAK
>  	bool
>  	help
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 883da0abf779..36f711570d19 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -149,6 +149,7 @@ config X86
>  	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
>  	select HAVE_ARCH_PREL32_RELOCATIONS
>  	select HAVE_ARCH_SECCOMP_FILTER
> +	select HAVE_ARCH_FAST_SYSCALL_DISABLE
>  	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
>  	select HAVE_ARCH_STACKLEAK
>  	select HAVE_ARCH_TRACEHOOK
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index bd3f14175193..26b3afbeea37 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -138,6 +138,13 @@ static long syscall_trace_enter(struct pt_regs *regs)
>  			return -1L;
>  	}
>  
> +#ifdef CONFIG_FAST_SYSCALL_DISABLE

This implementation is so small I have to wonder if this even needs a
CONFIG for it generally. Here, though, you can leave it off too and just
use no-op static inline functions for do_fast_syscall_disable().
(Since you're adding the TIF unconditionally.)

> +	if (work & _TIF_FAST_SYSCALL_DISABLE) {
> +		if (do_fast_syscall_disable(regs->orig_ax, regs))

Instead of even needing to learn about regs->orig_ax, you can just use
syscall_get_nr() here so when other architectures look at this
implementation, they can just literally copy/paste the same lines. :)

While I was looking at this, I'm actually wonder if this can be made
entirely architecture agnostic by stuffing it into
tracehook_report_syscall_entry() instead. That'd require some
refactoring, but seeing how _TIF_FAST_SYSCALL_DISABLE is the last
available bit on x86, I'm wondering about raising all of the common
trace-related stuff up into arch-agnostic areas. seccomp could move too,
I bet. But anyway, I don't think you need to worry about that for this
patch. Stick to the arch-specific bits for now. I just mostly wanted to
dump my thoughts on this so I can find it later. :)

> +			return -1L;
> +	}
> +#endif
> +
>  #ifdef CONFIG_SECCOMP
>  	/*
>  	 * Do seccomp after ptrace, to catch any tracer changes.
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index 8de8ceccb8bc..be72d14fc570 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -93,6 +93,7 @@ struct thread_info {
>  #define TIF_NOTSC		16	/* TSC is not accessible in userland */
>  #define TIF_IA32		17	/* IA32 compatibility process */
>  #define TIF_SLD			18	/* Restore split lock detection on context switch */
> +#define TIF_FAST_SYSCALL_DISABLE 19	/* fast syscall disable active */
>  #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
>  #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
>  #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
> @@ -123,6 +124,7 @@ struct thread_info {
>  #define _TIF_NOTSC		(1 << TIF_NOTSC)
>  #define _TIF_IA32		(1 << TIF_IA32)
>  #define _TIF_SLD		(1 << TIF_SLD)
> +#define _TIF_FAST_SYSCALL_DISABLE (1 << TIF_FAST_SYSCALL_DISABLE)
>  #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
>  #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
>  #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
> @@ -136,7 +138,7 @@ struct thread_info {
>  /* Work to do before invoking the actual syscall. */
>  #define _TIF_WORK_SYSCALL_ENTRY	\
>  	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |	\
> -	 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT)
> +	 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | _TIF_FAST_SYSCALL_DISABLE)
>  
>  /* flags to check in __switch_to() */
>  #define _TIF_WORK_CTXSW_BASE					\
> diff --git a/fs/exec.c b/fs/exec.c
> index e6e8a9a70327..3c40b00aa19f 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1502,6 +1502,10 @@ void setup_new_exec(struct linux_binprm * bprm)
>  
>  	arch_setup_new_exec();
>  
> +#ifdef CONFIG_FAST_SYSCALL_DISABLE
> +	clear_tsk_thread_flag(me, TIF_FAST_SYSCALL_DISABLE);
> +#endif

I would just leave the #ifdef off -- the TIF exists unconditionally
anyway.

> +
>  	/* Set the new mm task size. We have to do that late because it may
>  	 * depend on TIF_32BIT which is only updated in flush_thread() on
>  	 * some architectures like powerpc
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 692e327d7455..f502b7fc2bfd 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -32,6 +32,7 @@
>  #include <linux/posix-timers.h>
>  #include <linux/rseq.h>
>  #include <linux/kcsan.h>
> +#include <linux/syscall_disable.h>
>  
>  /* task_struct member predeclarations (sorted alphabetically): */
>  struct audit_context;
> @@ -953,6 +954,10 @@ struct task_struct {
>  #endif
>  	struct seccomp			seccomp;
>  
> +#ifdef CONFIG_FAST_SYSCALL_DISABLE
> +	struct fast_syscall_disable	syscall_disable;
> +#endif

You can actually drop this too and instead hide an empty struct via the
header (see how seccomp does this for example).

> +
>  	/* Thread group tracking: */
>  	u64				parent_exec_id;
>  	u64				self_exec_id;
> diff --git a/include/linux/syscall_disable.h b/include/linux/syscall_disable.h
> new file mode 100644
> index 000000000000..96e52f50eee5
> --- /dev/null
> +++ b/include/linux/syscall_disable.h
> @@ -0,0 +1,29 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _SYSCALL_DISABLE_H
> +#define _SYSCALL_DISABLE_H
> +
> +#ifdef CONFIG_FAST_SYSCALL_DISABLE
> +struct fast_syscall_disable {
> +	int __user *fastlock;
> +	unsigned long __user dispatcher;
> +};
> +
> +int do_fast_syscall_disable(int nr_syscall, struct pt_regs *regs);
> +int set_syscall_disable(int mode, unsigned long __user dispatcher,
> +			int __user *fastlock);
> +
> +#else

i.e. add:
struct fast_syscall_disable { };

> +static inline int set_syscall_disable(int mode, unsigned long __user dispatcher,
> +				      int __user *fastlock)
> +{
> +	return -EINVAL;
> +}
> +
> +static inline int do_fast_syscall_disable(int nr_syscall, struct pt_regs *regs)
> +{
> +	WARN_ON(1);

I would make this a WARN_ON_ONCE().

> +	return 0;
> +}
> +#endif /* CONFIG_FAST_SYSCALL_DISABLE */
> +
> +#endif
> diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
> index cb3d6c267181..09d4ef45a341 100644
> --- a/include/uapi/asm-generic/siginfo.h
> +++ b/include/uapi/asm-generic/siginfo.h
> @@ -285,6 +285,7 @@ typedef struct siginfo {
>   */
>  #define SYS_SECCOMP	1	/* seccomp triggered */
>  #define NSIGSYS		1
> +#define SYS_DISABLED	2
>  
>  /*
>   * SIGEMT si_codes
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 07b4f8131e36..383773269ade 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -238,4 +238,8 @@ struct prctl_mm_map {
>  #define PR_SET_IO_FLUSHER		57
>  #define PR_GET_IO_FLUSHER		58
>  
> +#define PR_SET_SYSCALL_DISABLE		59
> +# define PR_SYSCALL_ALLOW		0
> +# define PR_SYSCALL_BLOCK		1
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index f3218bc5ec69..e0449ed68fd5 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -92,6 +92,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
>  obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
>  obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o
>  obj-$(CONFIG_SECCOMP) += seccomp.o
> +obj-$(CONFIG_FAST_SYSCALL_DISABLE) += syscall_disable.o
>  obj-$(CONFIG_RELAY) += relay.o
>  obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
>  obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 142b23645d82..529f0b5e02d3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -922,6 +922,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>  	clear_tsk_need_resched(tsk);
>  	set_task_stack_end_magic(tsk);
>  
> +#ifdef CONFIG_FAST_SYSCALL_DISABLE
> +	clear_tsk_thread_flag(tsk, TIF_FAST_SYSCALL_DISABLE);
> +#endif
> +
>  #ifdef CONFIG_STACKPROTECTOR
>  	tsk->stack_canary = get_random_canary();
>  #endif
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 00a96746e28a..2205df99ba67 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -42,6 +42,7 @@
>  #include <linux/syscore_ops.h>
>  #include <linux/version.h>
>  #include <linux/ctype.h>
> +#include <linux/syscall_disable.h>
>  
>  #include <linux/compat.h>
>  #include <linux/syscalls.h>
> @@ -2527,6 +2528,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  
>  		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
>  		break;
> +	case PR_SET_SYSCALL_DISABLE:
> +		if (arg5)
> +			return -EINVAL;
> +		error = set_syscall_disable((int) arg2, (unsigned long __user) arg3,
> +					    (int __user *) arg4);
> +		break;
>  	default:
>  		error = -EINVAL;
>  		break;
> diff --git a/kernel/syscall_disable.c b/kernel/syscall_disable.c
> new file mode 100644
> index 000000000000..4b0a1f916df3
> --- /dev/null
> +++ b/kernel/syscall_disable.c
> @@ -0,0 +1,85 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/sched.h>
> +#include <linux/prctl.h>
> +#include <linux/syscall_disable.h>
> +#include <linux/uaccess.h>
> +#include <linux/signal.h>
> +#include <linux/processor.h>
> +
> +#include <asm/syscall.h>
> +
> +#include <linux/sched/signal.h>
> +#include <linux/sched/task_stack.h>
> +
> +static void trigger_sigsys(int nr_syscall)
> +{
> +	struct kernel_siginfo info;
> +
> +	clear_siginfo(&info);
> +	info.si_signo = SIGSYS;
> +	info.si_code = SYS_DISABLED;
> +	info.si_call_addr = (void __user *)KSTK_EIP(current);
> +	info.si_errno = 0;
> +	info.si_arch = syscall_get_arch(current);
> +	info.si_syscall = nr_syscall;
> +
> +	force_sig_info(&info);
> +}
> +
> +int do_fast_syscall_disable(int nr_syscall, struct pt_regs *regs)
> +{
> +	int state;
> +
> +	if (current->syscall_disable.dispatcher == instruction_pointer(regs))
> +		return 0;
> +
> +	if (current->syscall_disable.fastlock) {
> +		if (__get_user(state, current->syscall_disable.fastlock))
> +			do_exit(SIGSEGV);
> +
> +		if (!state)

Just like the prctl arg5 == 0 checking, I actually think this should use
explicit values. 0 and 1, everything else -> SIGSEGV.

> +			return 0;
> +	}
> +
> +	syscall_rollback(current, regs);
> +	trigger_sigsys(nr_syscall);
> +
> +	return 1;
> +}
> +
> +int set_syscall_disable(int mode, unsigned long __user dispatcher,
> +			int __user *fastlock)
> +{
> +	switch (mode) {
> +	case PR_SYSCALL_ALLOW:
> +		if (dispatcher || fastlock)
> +			return -EINVAL;
> +		break;
> +	case PR_SYSCALL_BLOCK:
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	if (fastlock) {
> +		if (!access_ok(fastlock, sizeof(int)))
> +			return -EFAULT;
> +	}
> +
> +	spin_lock_irq(&current->sighand->siglock);
> +
> +	current->syscall_disable.fastlock = fastlock;
> +	current->syscall_disable.dispatcher = dispatcher;
> +
> +	/* make sure fsb_address is committed before setting the flag. */

"fsb_address"? renamed from earlier version?

> +	smp_mb__before_atomic();
> +
> +	if (mode == PR_SYSCALL_BLOCK)
> +		set_tsk_thread_flag(current, TIF_FAST_SYSCALL_DISABLE);
> +	else
> +		clear_tsk_thread_flag(current, TIF_FAST_SYSCALL_DISABLE);
> +
> +	spin_unlock_irq(&current->sighand->siglock);
> +
> +	return 0;
> +}
> -- 
> 2.27.0
> 

This looks really clean to me. I like it. Time for bikeshed painting!
"fast syscall disable" seems an odd name. I would just call it "syscall
dispatch" or "syscall userspace dispatch". Yes, it's fast, but that
doesn't really say _what_ it is. And "disable" is true, but that's not
the point -- using it for just disabling syscalls isn't the point,
providing a userspace dispatcher, though, is.

So how about "fast syscall disable" -> "syscall user dispatch"
through-out? And it's only 1 extra letter. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-07-08  8:01 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-07-08  5:56 [PATCH RFC] kernel: Implement selective syscall redirection through personality Gabriel Krisman Bertazi
2020-07-08  8:01 ` Kees Cook

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox