* [PATCH 1/3] syscall_user_dispatch: Allow allowed range wrap-around
[not found] <cover.1739894594.git.dvyukov@google.com>
@ 2025-02-18 16:04 ` Dmitry Vyukov
2025-02-18 16:58 ` Gregory Price
2025-02-18 16:04 ` [PATCH 2/3] selftests: Fix errno checking in syscall_user_dispatch test Dmitry Vyukov
2025-02-18 16:04 ` [PATCH 3/3] selftests: Extend syscall_user_dispatch test to check allowed range Dmitry Vyukov
2 siblings, 1 reply; 12+ messages in thread
From: Dmitry Vyukov @ 2025-02-18 16:04 UTC (permalink / raw)
To: krisman, tglx, luto, peterz, keescook, gregory.price
Cc: Dmitry Vyukov, Marco Elver, linux-kernel
There are two possible scenarios for syscall filtering:
- having a trusted/allowed range of PCs, and intercepting everything else
- or the opposite: a single untrusted/intercepted range and allowing
everything else
The current implementation only allows the former use case due to
allowed range wrap-around check. Allow the latter use case as well
by removing the wrap-around check.
The latter use case is relevant for any kind of sandboxing scenario,
or monitoring behavior of a single library. If a program wants to
intercept syscalls for PC range [START, END) then it needs to call:
prctl(..., END, -(END-START), ...);
which sets a wrap-around range that excludes everything
besides [START, END).
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Gregory Price <gregory.price@memverge.com>
Cc: Marco Elver <elver@google.com>
Cc: linux-kernel@vger.kernel.org
---
kernel/entry/syscall_user_dispatch.c | 9 +++------
kernel/sys.c | 6 ++++++
2 files changed, 9 insertions(+), 6 deletions(-)
diff --git a/kernel/entry/syscall_user_dispatch.c b/kernel/entry/syscall_user_dispatch.c
index 5340c5aa89e7d..a0659f0515404 100644
--- a/kernel/entry/syscall_user_dispatch.c
+++ b/kernel/entry/syscall_user_dispatch.c
@@ -37,6 +37,7 @@ bool syscall_user_dispatch(struct pt_regs *regs)
struct syscall_user_dispatch *sd = ¤t->syscall_dispatch;
char state;
+ /* Note: this check form allows for range wrap-around. */
if (likely(instruction_pointer(regs) - sd->offset < sd->len))
return false;
@@ -80,13 +81,9 @@ static int task_set_syscall_user_dispatch(struct task_struct *task, unsigned lon
break;
case PR_SYS_DISPATCH_ON:
/*
- * Validate the direct dispatcher region just for basic
- * sanity against overflow and a 0-sized dispatcher
- * region. If the user is able to submit a syscall from
- * an address, that address is obviously valid.
+ * Note: we don't check and allow arbitrary values for
+ * offset/len in particular to allow range wrap-around.
*/
- if (offset && offset + len <= offset)
- return -EINVAL;
/*
* access_ok() will clear memory tags for tagged addresses
diff --git a/kernel/sys.c b/kernel/sys.c
index cb366ff8703af..666322026ad72 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2735,6 +2735,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
break;
case PR_SET_SYSCALL_USER_DISPATCH:
+ /*
+ * Sign-extend len for 32-bit processes to allow region
+ * wrap-around.
+ */
+ if (in_compat_syscall())
+ arg4 = (long)(s32)arg4;
error = set_syscall_user_dispatch(arg2, arg3, arg4,
(char __user *) arg5);
break;
--
2.48.1.601.g30ceb7b040-goog
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/3] selftests: Fix errno checking in syscall_user_dispatch test
[not found] <cover.1739894594.git.dvyukov@google.com>
2025-02-18 16:04 ` [PATCH 1/3] syscall_user_dispatch: Allow allowed range wrap-around Dmitry Vyukov
@ 2025-02-18 16:04 ` Dmitry Vyukov
2025-02-19 17:19 ` Kees Cook
2025-02-18 16:04 ` [PATCH 3/3] selftests: Extend syscall_user_dispatch test to check allowed range Dmitry Vyukov
2 siblings, 1 reply; 12+ messages in thread
From: Dmitry Vyukov @ 2025-02-18 16:04 UTC (permalink / raw)
To: krisman, tglx, luto, peterz, keescook, gregory.price
Cc: Dmitry Vyukov, Marco Elver, linux-kernel
Successful syscalls don't change errno, so checking errno is wrong
to ensure that a syscall has failed. For example for the following
sequence:
prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0xff, 0);
EXPECT_EQ(EINVAL, errno);
prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x0, &sel);
EXPECT_EQ(EINVAL, errno);
only the first syscall may fail and set errno, but the second may succeed
and keep errno intact, and the check will falsely pass.
Or if errno happened to be EINVAL before, even the first check may falsely
pass.
Also use EXPECT/ASSERT consistently. Currently there is an inconsistent mix
without obvious reasons for usage of one or another.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Gregory Price <gregory.price@memverge.com>
Cc: Marco Elver <elver@google.com>
Cc: linux-kernel@vger.kernel.org
---
.../syscall_user_dispatch/sud_test.c | 30 +++++++++----------
1 file changed, 14 insertions(+), 16 deletions(-)
diff --git a/tools/testing/selftests/syscall_user_dispatch/sud_test.c b/tools/testing/selftests/syscall_user_dispatch/sud_test.c
index d975a67673299..b0969925ec64c 100644
--- a/tools/testing/selftests/syscall_user_dispatch/sud_test.c
+++ b/tools/testing/selftests/syscall_user_dispatch/sud_test.c
@@ -86,55 +86,53 @@ TEST(bad_prctl_param)
/* Invalid op */
op = -1;
- prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0, 0, &sel);
- ASSERT_EQ(EINVAL, errno);
+ EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0, 0, &sel));
+ EXPECT_EQ(EINVAL, errno);
/* PR_SYS_DISPATCH_OFF */
op = PR_SYS_DISPATCH_OFF;
/* offset != 0 */
- prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0x0, 0);
+ EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0x0, 0));
EXPECT_EQ(EINVAL, errno);
/* len != 0 */
- prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0xff, 0);
+ EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0xff, 0));
EXPECT_EQ(EINVAL, errno);
/* sel != NULL */
- prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x0, &sel);
+ EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x0, &sel));
EXPECT_EQ(EINVAL, errno);
/* Valid parameter */
- errno = 0;
- prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x0, 0x0);
- EXPECT_EQ(0, errno);
+ EXPECT_EQ(0, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x0, 0x0));
/* PR_SYS_DISPATCH_ON */
op = PR_SYS_DISPATCH_ON;
/* Dispatcher region is bad (offset > 0 && len == 0) */
- prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0x0, &sel);
+ EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0x0, &sel));
EXPECT_EQ(EINVAL, errno);
- prctl(PR_SET_SYSCALL_USER_DISPATCH, op, -1L, 0x0, &sel);
+ EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, -1L, 0x0, &sel));
EXPECT_EQ(EINVAL, errno);
/* Invalid selector */
- prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x1, (void *) -1);
- ASSERT_EQ(EFAULT, errno);
+ EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x1, (void *) -1));
+ EXPECT_EQ(EFAULT, errno);
/*
* Dispatcher range overflows unsigned long
*/
- prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 1, -1L, &sel);
- ASSERT_EQ(EINVAL, errno) {
+ EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 1, -1L, &sel));
+ EXPECT_EQ(EINVAL, errno) {
TH_LOG("Should reject bad syscall range");
}
/*
* Allowed range overflows usigned long
*/
- prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, -1L, 0x1, &sel);
- ASSERT_EQ(EINVAL, errno) {
+ EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, -1L, 0x1, &sel));
+ EXPECT_EQ(EINVAL, errno) {
TH_LOG("Should reject bad syscall range");
}
}
--
2.48.1.601.g30ceb7b040-goog
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 3/3] selftests: Extend syscall_user_dispatch test to check allowed range
[not found] <cover.1739894594.git.dvyukov@google.com>
2025-02-18 16:04 ` [PATCH 1/3] syscall_user_dispatch: Allow allowed range wrap-around Dmitry Vyukov
2025-02-18 16:04 ` [PATCH 2/3] selftests: Fix errno checking in syscall_user_dispatch test Dmitry Vyukov
@ 2025-02-18 16:04 ` Dmitry Vyukov
2025-02-20 15:17 ` Gregory Price
2 siblings, 1 reply; 12+ messages in thread
From: Dmitry Vyukov @ 2025-02-18 16:04 UTC (permalink / raw)
To: krisman, tglx, luto, peterz, keescook, gregory.price
Cc: Dmitry Vyukov, Marco Elver, linux-kernel
Add a test that ensures that PR_SET_SYSCALL_USER_DISPATCH respects
the specified allowed PC range. The test includes both a continuous
range and a wrap-around range.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Gregory Price <gregory.price@memverge.com>
Cc: Marco Elver <elver@google.com>
Cc: linux-kernel@vger.kernel.org
---
.../syscall_user_dispatch/sud_test.c | 79 +++++++++++--------
1 file changed, 48 insertions(+), 31 deletions(-)
diff --git a/tools/testing/selftests/syscall_user_dispatch/sud_test.c b/tools/testing/selftests/syscall_user_dispatch/sud_test.c
index b0969925ec64c..fa40e46e6d3e9 100644
--- a/tools/testing/selftests/syscall_user_dispatch/sud_test.c
+++ b/tools/testing/selftests/syscall_user_dispatch/sud_test.c
@@ -10,6 +10,7 @@
#include <sys/sysinfo.h>
#include <sys/syscall.h>
#include <signal.h>
+#include <stdbool.h>
#include <asm/unistd.h>
#include "../kselftest_harness.h"
@@ -110,31 +111,15 @@ TEST(bad_prctl_param)
/* PR_SYS_DISPATCH_ON */
op = PR_SYS_DISPATCH_ON;
- /* Dispatcher region is bad (offset > 0 && len == 0) */
- EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0x0, &sel));
- EXPECT_EQ(EINVAL, errno);
- EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, -1L, 0x0, &sel));
- EXPECT_EQ(EINVAL, errno);
+ /* All ranges are allowed */
+ EXPECT_EQ(0, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0x0, &sel));
+ EXPECT_EQ(0, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, -1L, 0x0, &sel));
+ EXPECT_EQ(0, prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 1, -1L, &sel));
+ EXPECT_EQ(0, prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, -1L, 0x1, &sel));
/* Invalid selector */
EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x1, (void *) -1));
EXPECT_EQ(EFAULT, errno);
-
- /*
- * Dispatcher range overflows unsigned long
- */
- EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 1, -1L, &sel));
- EXPECT_EQ(EINVAL, errno) {
- TH_LOG("Should reject bad syscall range");
- }
-
- /*
- * Allowed range overflows usigned long
- */
- EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, -1L, 0x1, &sel));
- EXPECT_EQ(EINVAL, errno) {
- TH_LOG("Should reject bad syscall range");
- }
}
/*
@@ -145,11 +130,13 @@ char glob_sel;
int nr_syscalls_emulated;
int si_code;
int si_errno;
+unsigned long syscall_addr;
static void handle_sigsys(int sig, siginfo_t *info, void *ucontext)
{
si_code = info->si_code;
si_errno = info->si_errno;
+ syscall_addr = (unsigned long)info->si_call_addr;
if (info->si_syscall == MAGIC_SYSCALL_1)
nr_syscalls_emulated++;
@@ -172,26 +159,29 @@ static void handle_sigsys(int sig, siginfo_t *info, void *ucontext)
#endif
}
-TEST(dispatch_and_return)
+int setup_sigsys_handler(void)
{
- long ret;
struct sigaction act;
sigset_t mask;
- glob_sel = 0;
- nr_syscalls_emulated = 0;
- si_code = 0;
- si_errno = 0;
-
memset(&act, 0, sizeof(act));
sigemptyset(&mask);
-
act.sa_sigaction = handle_sigsys;
act.sa_flags = SA_SIGINFO;
act.sa_mask = mask;
+ return sigaction(SIGSYS, &act, NULL);
+}
- ret = sigaction(SIGSYS, &act, NULL);
- ASSERT_EQ(0, ret);
+TEST(dispatch_and_return)
+{
+ long ret;
+
+ glob_sel = 0;
+ nr_syscalls_emulated = 0;
+ si_code = 0;
+ si_errno = 0;
+
+ ASSERT_EQ(0, setup_sigsys_handler());
/* Make sure selector is good prior to prctl. */
SYSCALL_DISPATCH_OFF(glob_sel);
@@ -321,4 +311,31 @@ TEST(direct_dispatch_range)
}
}
+bool test_range(unsigned long offset, unsigned long length)
+{
+ nr_syscalls_emulated = 0;
+ if (prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, offset, length, &glob_sel))
+ return false;
+ SYSCALL_DISPATCH_ON(glob_sel);
+ return syscall(MAGIC_SYSCALL_1) == MAGIC_SYSCALL_1 && nr_syscalls_emulated == 1;
+}
+
+TEST(dispatch_range)
+{
+ ASSERT_EQ(0, setup_sigsys_handler());
+ ASSERT_EQ(0, prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &glob_sel));
+ SYSCALL_DISPATCH_ON(glob_sel);
+ ASSERT_EQ(MAGIC_SYSCALL_1, syscall(MAGIC_SYSCALL_1));
+ TH_LOG("syscall_addr=0x%lx", syscall_addr);
+ EXPECT_FALSE(test_range(syscall_addr, 1));
+ EXPECT_FALSE(test_range(syscall_addr-100, 200));
+ EXPECT_TRUE(test_range(syscall_addr+1, 100));
+ EXPECT_TRUE(test_range(syscall_addr-100, 100));
+ /* Wrap-around tests for everything except for a single PC. */
+ EXPECT_TRUE(test_range(syscall_addr+1, -1));
+ EXPECT_FALSE(test_range(syscall_addr, -1));
+ EXPECT_FALSE(test_range(syscall_addr+2, -1));
+ SYSCALL_DISPATCH_OFF(glob_sel);
+}
+
TEST_HARNESS_MAIN
--
2.48.1.601.g30ceb7b040-goog
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] syscall_user_dispatch: Allow allowed range wrap-around
2025-02-18 16:04 ` [PATCH 1/3] syscall_user_dispatch: Allow allowed range wrap-around Dmitry Vyukov
@ 2025-02-18 16:58 ` Gregory Price
2025-02-18 17:34 ` Dmitry Vyukov
0 siblings, 1 reply; 12+ messages in thread
From: Gregory Price @ 2025-02-18 16:58 UTC (permalink / raw)
To: Dmitry Vyukov
Cc: krisman, tglx, luto, peterz, keescook, gregory.price, Marco Elver,
linux-kernel
On Tue, Feb 18, 2025 at 05:04:34PM +0100, Dmitry Vyukov wrote:
> There are two possible scenarios for syscall filtering:
> - having a trusted/allowed range of PCs, and intercepting everything else
> - or the opposite: a single untrusted/intercepted range and allowing
> everything else
> The current implementation only allows the former use case due to
> allowed range wrap-around check. Allow the latter use case as well
> by removing the wrap-around check.
> The latter use case is relevant for any kind of sandboxing scenario,
> or monitoring behavior of a single library. If a program wants to
> intercept syscalls for PC range [START, END) then it needs to call:
> prctl(..., END, -(END-START), ...);
I don't necessarily disagree with the idea, but this sounds like using
the wrong tool for the job. The purpose of SUD was for emulating
foreign OS system calls of entire programs - not a single library.
The point being that it's very difficult to sandbox an individual
library when you can't ensure it won't allocate resources outside the
monitored bounds (this would be very difficult to guarantee, at least).
If the intent is to load and re-use a single foreign-OS library, this
change seems to be the question of "why not allow multiple ranges?",
and you'd be on your way to reimplementing seccomp or BPF.
~Gregory
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] syscall_user_dispatch: Allow allowed range wrap-around
2025-02-18 16:58 ` Gregory Price
@ 2025-02-18 17:34 ` Dmitry Vyukov
2025-02-18 18:00 ` Gregory Price
0 siblings, 1 reply; 12+ messages in thread
From: Dmitry Vyukov @ 2025-02-18 17:34 UTC (permalink / raw)
To: gourry
Cc: krisman, tglx, luto, peterz, keescook, gregory.price, Marco Elver,
linux-kernel
On Tue, 18 Feb 2025 at 17:58, Gregory Price <gourry@gourry.net> wrote:
>
> On Tue, Feb 18, 2025 at 05:04:34PM +0100, Dmitry Vyukov wrote:
> > There are two possible scenarios for syscall filtering:
> > - having a trusted/allowed range of PCs, and intercepting everything else
> > - or the opposite: a single untrusted/intercepted range and allowing
> > everything else
> > The current implementation only allows the former use case due to
> > allowed range wrap-around check. Allow the latter use case as well
> > by removing the wrap-around check.
> > The latter use case is relevant for any kind of sandboxing scenario,
> > or monitoring behavior of a single library. If a program wants to
> > intercept syscalls for PC range [START, END) then it needs to call:
> > prctl(..., END, -(END-START), ...);
>
> I don't necessarily disagree with the idea, but this sounds like using
> the wrong tool for the job. The purpose of SUD was for emulating
> foreign OS system calls of entire programs - not a single library.
>
> The point being that it's very difficult to sandbox an individual
> library when you can't ensure it won't allocate resources outside the
> monitored bounds (this would be very difficult to guarantee, at least).
>
> If the intent is to load and re-use a single foreign-OS library, this
> change seems to be the question of "why not allow multiple ranges?",
> and you'd be on your way to reimplementing seccomp or BPF.
The problem with seccomp BPF is that the filter is inherited across
fork/exec which can't be used with SIGSYS and fine-grained custom
user-space policy. USER_DISPATCH is much more flexible in this regard.
Re allocating resources outside of monitored bounds: this is exactly
what syscall filtering is for, right :)
If we install a filter on a library/sandbox, we can control and
prevent it from allocating any more executable pages outside of the
range.
The motivation is sandboxing of libraries loaded within a known fixed
address range, while non-sandboxed code can live on both sides of the
sandboxed range (say, non-pie binary at lower addresses, and libc at
higher addresses).
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] syscall_user_dispatch: Allow allowed range wrap-around
2025-02-18 17:34 ` Dmitry Vyukov
@ 2025-02-18 18:00 ` Gregory Price
2025-02-19 8:54 ` Dmitry Vyukov
0 siblings, 1 reply; 12+ messages in thread
From: Gregory Price @ 2025-02-18 18:00 UTC (permalink / raw)
To: Dmitry Vyukov
Cc: krisman, tglx, luto, peterz, keescook, gregory.price, Marco Elver,
linux-kernel
On Tue, Feb 18, 2025 at 06:34:34PM +0100, Dmitry Vyukov wrote:
> On Tue, 18 Feb 2025 at 17:58, Gregory Price <gourry@gourry.net> wrote:
> > If the intent is to load and re-use a single foreign-OS library, this
> > change seems to be the question of "why not allow multiple ranges?",
> > and you'd be on your way to reimplementing seccomp or BPF.
>
> The problem with seccomp BPF is that the filter is inherited across
> fork/exec which can't be used with SIGSYS and fine-grained custom
> user-space policy. USER_DISPATCH is much more flexible in this regard.
>
It's also fundamentally not a security-sufficient interposition system.
> Re allocating resources outside of monitored bounds: this is exactly
> what syscall filtering is for, right :)
No. SUD's purpose is to catch foreign-OS syscall execution.
You *can* do hacky stuff like interposing on libc, but it you can do
hacky things with bpf too.
> If we install a filter on a library/sandbox, we can control and
> prevent it from allocating any more executable pages outside of the
> range.
>
> The motivation is sandboxing of libraries loaded within a known fixed
> address range, while non-sandboxed code can live on both sides of the
> sandboxed range (say, non-pie binary at lower addresses, and libc at
> higher addresses).
My question is why you aren't doing the opposite. Exempt the known good
ranges and hook everything else. This actually makes it easier to
ensure the software you're hooking doesn't escape interposition.
You can use the SIGSYS register data (instruction pointer) to determine
whether to act on the syscall or pass it through.
Like I said, I don't necessarily disagree with the change, just a bit
concerned about the direction this takes SUD. It's not a sufficient
interface to isolate the behavior of a single library, and this change
naturally begs the question "If we do this, why not implement an entire
multi-range filtering system? Why stop at one range?"
~Gregory
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] syscall_user_dispatch: Allow allowed range wrap-around
2025-02-18 18:00 ` Gregory Price
@ 2025-02-19 8:54 ` Dmitry Vyukov
2025-02-19 13:29 ` Gregory Price
0 siblings, 1 reply; 12+ messages in thread
From: Dmitry Vyukov @ 2025-02-19 8:54 UTC (permalink / raw)
To: Gregory Price
Cc: krisman, tglx, luto, peterz, keescook, gregory.price, Marco Elver,
linux-kernel
On Tue, 18 Feb 2025 at 19:00, Gregory Price <gourry@gourry.net> wrote:
>
> On Tue, Feb 18, 2025 at 06:34:34PM +0100, Dmitry Vyukov wrote:
> > On Tue, 18 Feb 2025 at 17:58, Gregory Price <gourry@gourry.net> wrote:
> > > If the intent is to load and re-use a single foreign-OS library, this
> > > change seems to be the question of "why not allow multiple ranges?",
> > > and you'd be on your way to reimplementing seccomp or BPF.
> >
> > The problem with seccomp BPF is that the filter is inherited across
> > fork/exec which can't be used with SIGSYS and fine-grained custom
> > user-space policy. USER_DISPATCH is much more flexible in this regard.
>
> It's also fundamentally not a security-sufficient interposition system.
>
> > Re allocating resources outside of monitored bounds: this is exactly
> > what syscall filtering is for, right :)
>
> No. SUD's purpose is to catch foreign-OS syscall execution.
My understanding is that aiming at concrete end problems is not the
kernel approach and design philosophy. Instead it aims at providing
flexible _primitives_ that can be used to solve various end problems.
It's like you are not selling pencils to draw trees, instead you just
sell good pencils.
E.g. if there are 2 end problems A and B that require 98% of the same
primitives, the kernel wouldn't implement 2 completely independent
subsystems to solve A and B that duplicate 98% of the code. Instead it
would provide flexible primitives that can be used to solve A and B
(and yet unknown C and D in future).
> You *can* do hacky stuff like interposing on libc, but it you can do
> hacky things with bpf too.
>
> > If we install a filter on a library/sandbox, we can control and
> > prevent it from allocating any more executable pages outside of the
> > range.
> >
> > The motivation is sandboxing of libraries loaded within a known fixed
> > address range, while non-sandboxed code can live on both sides of the
> > sandboxed range (say, non-pie binary at lower addresses, and libc at
> > higher addresses).
>
> My question is why you aren't doing the opposite. Exempt the known good
> ranges and hook everything else. This actually makes it easier to
> ensure the software you're hooking doesn't escape interposition.
The restricted code is a single continuous region. Allowed code lives
on both sides (non-pie binary at the lowest addresses, dynamic libs at
the highest addresses, and there is not enough space before and after
to map a large enough contiguous region for restricted code).
> You can use the SIGSYS register data (instruction pointer) to determine
> whether to act on the syscall or pass it through.
Too expensive (few additional instructions vs microseconds for kernel
transition, sigcontext setup, and sigreturn).
> Like I said, I don't necessarily disagree with the change, just a bit
> concerned about the direction this takes SUD. It's not a sufficient
> interface to isolate the behavior of a single library, and this change
> naturally begs the question "If we do this, why not implement an entire
> multi-range filtering system? Why stop at one range?"
That's a harder question. I think we don't need to answer it right
now. We can just consider this proposal in isolation. This is where we
stop now.
It preserves all of the existing uses intact + allows more cases with
a trivial code change (actually deleting code).
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] syscall_user_dispatch: Allow allowed range wrap-around
2025-02-19 8:54 ` Dmitry Vyukov
@ 2025-02-19 13:29 ` Gregory Price
0 siblings, 0 replies; 12+ messages in thread
From: Gregory Price @ 2025-02-19 13:29 UTC (permalink / raw)
To: Dmitry Vyukov
Cc: krisman, tglx, luto, peterz, keescook, gregory.price, Marco Elver,
linux-kernel
On Wed, Feb 19, 2025 at 09:54:28AM +0100, Dmitry Vyukov wrote:
> On Tue, 18 Feb 2025 at 19:00, Gregory Price <gourry@gourry.net> wrote:
> >
> > No. SUD's purpose is to catch foreign-OS syscall execution.
>
> My understanding is that aiming at concrete end problems is not the
> kernel approach and design philosophy. Instead it aims at providing
> flexible _primitives_ that can be used to solve various end problems.
> It's like you are not selling pencils to draw trees, instead you just
> sell good pencils.
Fair point, apologies for being blunt here.
Mostly i'm concerned at the use of SUD for "sandboxing" when it is not
itself a sufficient isolation mechanism for this - it's far too easy to
bypass and has basically no guardrails to enforce it.
But it's not my place to say how to use features in userland (also I've
written a lib-c interposer for fun with SUD anyway so *shrug*).
> > Like I said, I don't necessarily disagree with the change, just a bit
> > concerned about the direction this takes SUD. It's not a sufficient
> > interface to isolate the behavior of a single library, and this change
> > naturally begs the question "If we do this, why not implement an entire
> > multi-range filtering system? Why stop at one range?"
>
> That's a harder question. I think we don't need to answer it right
> now. We can just consider this proposal in isolation. This is where we
> stop now.
> It preserves all of the existing uses intact + allows more cases with
> a trivial code change (actually deleting code).
You may also want to update
Documentation/admin-guide/syscall-user-dispatch.rst
to have some examples w/ wrap-around, since most people would not even
think about this as a possibility.
(also, i'm not sure if you've sent a cover letter or not, but lore seems
to suggest it is missing - and it did not show up in my box)
~Gregory
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/3] selftests: Fix errno checking in syscall_user_dispatch test
2025-02-18 16:04 ` [PATCH 2/3] selftests: Fix errno checking in syscall_user_dispatch test Dmitry Vyukov
@ 2025-02-19 17:19 ` Kees Cook
0 siblings, 0 replies; 12+ messages in thread
From: Kees Cook @ 2025-02-19 17:19 UTC (permalink / raw)
To: Dmitry Vyukov
Cc: krisman, tglx, luto, peterz, gregory.price, Marco Elver,
linux-kernel
On Tue, Feb 18, 2025 at 05:04:35PM +0100, Dmitry Vyukov wrote:
> Successful syscalls don't change errno, so checking errno is wrong
> to ensure that a syscall has failed. For example for the following
> sequence:
>
> prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0xff, 0);
> EXPECT_EQ(EINVAL, errno);
> prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x0, &sel);
> EXPECT_EQ(EINVAL, errno);
>
> only the first syscall may fail and set errno, but the second may succeed
> and keep errno intact, and the check will falsely pass.
> Or if errno happened to be EINVAL before, even the first check may falsely
> pass.
>
> Also use EXPECT/ASSERT consistently. Currently there is an inconsistent mix
> without obvious reasons for usage of one or another.
>
> Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Yeah, these all look good to me.
--
Kees Cook
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 3/3] selftests: Extend syscall_user_dispatch test to check allowed range
2025-02-18 16:04 ` [PATCH 3/3] selftests: Extend syscall_user_dispatch test to check allowed range Dmitry Vyukov
@ 2025-02-20 15:17 ` Gregory Price
2025-02-24 8:48 ` Dmitry Vyukov
0 siblings, 1 reply; 12+ messages in thread
From: Gregory Price @ 2025-02-20 15:17 UTC (permalink / raw)
To: Dmitry Vyukov
Cc: krisman, tglx, luto, peterz, keescook, gregory.price, Marco Elver,
linux-kernel
On Tue, Feb 18, 2025 at 05:04:36PM +0100, Dmitry Vyukov wrote:
> diff --git a/tools/testing/selftests/syscall_user_dispatch/sud_test.c b/tools/testing/selftests/syscall_user_dispatch/sud_test.c
> index b0969925ec64c..fa40e46e6d3e9 100644
> --- a/tools/testing/selftests/syscall_user_dispatch/sud_test.c
> +++ b/tools/testing/selftests/syscall_user_dispatch/sud_test.c
... snip ...
> @@ -110,31 +111,15 @@ TEST(bad_prctl_param)
> /* PR_SYS_DISPATCH_ON */
> op = PR_SYS_DISPATCH_ON;
>
> - /* Dispatcher region is bad (offset > 0 && len == 0) */
> - EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0x0, &sel));
> - EXPECT_EQ(EINVAL, errno);
> - EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, -1L, 0x0, &sel));
> - EXPECT_EQ(EINVAL, errno);
> + /* All ranges are allowed */
> + EXPECT_EQ(0, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0x0, &sel));
> + EXPECT_EQ(0, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, -1L, 0x0, &sel));
A 0 length is ambiguous and nonsensical in every other context, not sure
why you'd allow it here.
... snip ...
> +bool test_range(unsigned long offset, unsigned long length)
> +{
> + nr_syscalls_emulated = 0;
> + if (prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, offset, length, &glob_sel))
> + return false;
This creates an ambiguous failure state for your test. Is it failing
because the range is bad or because you didn't intercept a syscall?
Better to be more explicit here. It makes it difficult to understand
what each individual test is doing at a glance.
> + SYSCALL_DISPATCH_ON(glob_sel);
> + return syscall(MAGIC_SYSCALL_1) == MAGIC_SYSCALL_1 && nr_syscalls_emulated == 1;
> +}
> +
> +TEST(dispatch_range)
> +{
> + ASSERT_EQ(0, setup_sigsys_handler());
> + ASSERT_EQ(0, prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &glob_sel));
> + SYSCALL_DISPATCH_ON(glob_sel);
> + ASSERT_EQ(MAGIC_SYSCALL_1, syscall(MAGIC_SYSCALL_1));
> + TH_LOG("syscall_addr=0x%lx", syscall_addr);
> + EXPECT_FALSE(test_range(syscall_addr, 1));
> + EXPECT_FALSE(test_range(syscall_addr-100, 200));
> + EXPECT_TRUE(test_range(syscall_addr+1, 100));
> + EXPECT_TRUE(test_range(syscall_addr-100, 100));
> + /* Wrap-around tests for everything except for a single PC. */
> + EXPECT_TRUE(test_range(syscall_addr+1, -1));
> + EXPECT_FALSE(test_range(syscall_addr, -1));
> + EXPECT_FALSE(test_range(syscall_addr+2, -1));
If you are planning to include 0 as an allowed length, you need to
demonstrate what it does.
> + SYSCALL_DISPATCH_OFF(glob_sel);
> +}
> +
> TEST_HARNESS_MAIN
> --
> 2.48.1.601.g30ceb7b040-goog
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 3/3] selftests: Extend syscall_user_dispatch test to check allowed range
2025-02-20 15:17 ` Gregory Price
@ 2025-02-24 8:48 ` Dmitry Vyukov
2025-03-03 15:57 ` Gregory Price
0 siblings, 1 reply; 12+ messages in thread
From: Dmitry Vyukov @ 2025-02-24 8:48 UTC (permalink / raw)
To: Gregory Price
Cc: krisman, tglx, luto, peterz, keescook, gregory.price, Marco Elver,
linux-kernel
On Thu, 20 Feb 2025 at 16:17, Gregory Price <gourry@gourry.net> wrote:
>
> On Tue, Feb 18, 2025 at 05:04:36PM +0100, Dmitry Vyukov wrote:
> > diff --git a/tools/testing/selftests/syscall_user_dispatch/sud_test.c b/tools/testing/selftests/syscall_user_dispatch/sud_test.c
> > index b0969925ec64c..fa40e46e6d3e9 100644
> > --- a/tools/testing/selftests/syscall_user_dispatch/sud_test.c
> > +++ b/tools/testing/selftests/syscall_user_dispatch/sud_test.c
> ... snip ...
> > @@ -110,31 +111,15 @@ TEST(bad_prctl_param)
> > /* PR_SYS_DISPATCH_ON */
> > op = PR_SYS_DISPATCH_ON;
> >
> > - /* Dispatcher region is bad (offset > 0 && len == 0) */
> > - EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0x0, &sel));
> > - EXPECT_EQ(EINVAL, errno);
> > - EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, -1L, 0x0, &sel));
> > - EXPECT_EQ(EINVAL, errno);
> > + /* All ranges are allowed */
> > + EXPECT_EQ(0, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0x0, &sel));
> > + EXPECT_EQ(0, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, -1L, 0x0, &sel));
>
> A 0 length is ambiguous and nonsensical in every other context, not sure
> why you'd allow it here.
Yes, but it's also not special in any way. One asks for a range of N
bytes, one gets a range of N bytes. Works for 0 as well. We can move 0
to an own special category, and add production code to support that.
But I don't see strong reasons to do that. It's like it's possible to
do read/write with 0 bytes to get, well, exactly that.
How strong do you feel about special casing 0?
> > +bool test_range(unsigned long offset, unsigned long length)
> > +{
> > + nr_syscalls_emulated = 0;
> > + if (prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, offset, length, &glob_sel))
> > + return false;
>
> This creates an ambiguous failure state for your test. Is it failing
> because the range is bad or because you didn't intercept a syscall?
>
> Better to be more explicit here. It makes it difficult to understand
> what each individual test is doing at a glance.
Good point. Done in v2.
> > + SYSCALL_DISPATCH_ON(glob_sel);
> > + return syscall(MAGIC_SYSCALL_1) == MAGIC_SYSCALL_1 && nr_syscalls_emulated == 1;
> > +}
> > +
> > +TEST(dispatch_range)
> > +{
> > + ASSERT_EQ(0, setup_sigsys_handler());
> > + ASSERT_EQ(0, prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, 0, 0, &glob_sel));
> > + SYSCALL_DISPATCH_ON(glob_sel);
> > + ASSERT_EQ(MAGIC_SYSCALL_1, syscall(MAGIC_SYSCALL_1));
> > + TH_LOG("syscall_addr=0x%lx", syscall_addr);
> > + EXPECT_FALSE(test_range(syscall_addr, 1));
> > + EXPECT_FALSE(test_range(syscall_addr-100, 200));
> > + EXPECT_TRUE(test_range(syscall_addr+1, 100));
> > + EXPECT_TRUE(test_range(syscall_addr-100, 100));
> > + /* Wrap-around tests for everything except for a single PC. */
> > + EXPECT_TRUE(test_range(syscall_addr+1, -1));
> > + EXPECT_FALSE(test_range(syscall_addr, -1));
> > + EXPECT_FALSE(test_range(syscall_addr+2, -1));
>
> If you are planning to include 0 as an allowed length, you need to
> demonstrate what it does.
Done in v2.
> > + SYSCALL_DISPATCH_OFF(glob_sel);
> > +}
> > +
> > TEST_HARNESS_MAIN
> > --
> > 2.48.1.601.g30ceb7b040-goog
> >
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 3/3] selftests: Extend syscall_user_dispatch test to check allowed range
2025-02-24 8:48 ` Dmitry Vyukov
@ 2025-03-03 15:57 ` Gregory Price
0 siblings, 0 replies; 12+ messages in thread
From: Gregory Price @ 2025-03-03 15:57 UTC (permalink / raw)
To: Dmitry Vyukov
Cc: krisman, tglx, luto, peterz, keescook, gregory.price, Marco Elver,
linux-kernel
On Mon, Feb 24, 2025 at 09:48:19AM +0100, Dmitry Vyukov wrote:
> On Thu, 20 Feb 2025 at 16:17, Gregory Price <gourry@gourry.net> wrote:
> >
> > On Tue, Feb 18, 2025 at 05:04:36PM +0100, Dmitry Vyukov wrote:
> > > diff --git a/tools/testing/selftests/syscall_user_dispatch/sud_test.c b/tools/testing/selftests/syscall_user_dispatch/sud_test.c
> > > index b0969925ec64c..fa40e46e6d3e9 100644
> > > --- a/tools/testing/selftests/syscall_user_dispatch/sud_test.c
> > > +++ b/tools/testing/selftests/syscall_user_dispatch/sud_test.c
> > ... snip ...
> > > @@ -110,31 +111,15 @@ TEST(bad_prctl_param)
> > > /* PR_SYS_DISPATCH_ON */
> > > op = PR_SYS_DISPATCH_ON;
> > >
> > > - /* Dispatcher region is bad (offset > 0 && len == 0) */
> > > - EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0x0, &sel));
> > > - EXPECT_EQ(EINVAL, errno);
> > > - EXPECT_EQ(-1, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, -1L, 0x0, &sel));
> > > - EXPECT_EQ(EINVAL, errno);
> > > + /* All ranges are allowed */
> > > + EXPECT_EQ(0, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x1, 0x0, &sel));
> > > + EXPECT_EQ(0, prctl(PR_SET_SYSCALL_USER_DISPATCH, op, -1L, 0x0, &sel));
> >
> > A 0 length is ambiguous and nonsensical in every other context, not sure
> > why you'd allow it here.
>
> Yes, but it's also not special in any way. One asks for a range of N
> bytes, one gets a range of N bytes.
It's specialy in the sense that it's nonsensical :]
I don't know what the prevaling opinion here is, but it seems pretty
obvious that a 0-length is almost certainly a mistake and reasonbly
should result in an EINVAL.
I don't feel strongly about this though.
~Gregory
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-03-03 15:57 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <cover.1739894594.git.dvyukov@google.com>
2025-02-18 16:04 ` [PATCH 1/3] syscall_user_dispatch: Allow allowed range wrap-around Dmitry Vyukov
2025-02-18 16:58 ` Gregory Price
2025-02-18 17:34 ` Dmitry Vyukov
2025-02-18 18:00 ` Gregory Price
2025-02-19 8:54 ` Dmitry Vyukov
2025-02-19 13:29 ` Gregory Price
2025-02-18 16:04 ` [PATCH 2/3] selftests: Fix errno checking in syscall_user_dispatch test Dmitry Vyukov
2025-02-19 17:19 ` Kees Cook
2025-02-18 16:04 ` [PATCH 3/3] selftests: Extend syscall_user_dispatch test to check allowed range Dmitry Vyukov
2025-02-20 15:17 ` Gregory Price
2025-02-24 8:48 ` Dmitry Vyukov
2025-03-03 15:57 ` Gregory Price
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox