* [PATCH 0/7] um: skas: harden the seccomp userspace stub
@ 2026-06-20 3:22 Cong Wang
2026-06-20 3:22 ` [PATCH 1/7] um: skas: create a seccomp USER_NOTIF listener and hand it to the monitor Cong Wang
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: Cong Wang @ 2026-06-20 3:22 UTC (permalink / raw)
To: Richard Weinberger, Anton Ivanov, Johannes Berg, linux-um; +Cc: Benjamin Berg
From: Cong Wang <cwang@multikernel.io>
In the seccomp ("SECCOMP") userspace mode, each guest userspace process
runs in a stub under a seccomp filter and traps to the monitor (the UML
kernel) on every syscall. Two items on the stub.c "Known security issues"
list could not be addressed by the filter alone:
- a hijacked stub could mmap() arbitrary physmem offsets, which is an
intra-guest disclosure and, on this base (single physmem fd, no
kernel/user split), a host escape; and
- a hijacked stub could block SIGALRM via a crafted rt_sigreturn to
evade preemption and wedge the monitor indefinitely.
This series closes both:
1-2: route the stub's mmap() through a SECCOMP_RET_USER_NOTIF listener
owned by the monitor (no behavioural change yet).
3-4: validate each mmap() against the mm's page table -- allowed iff the
PTE already maps the requested frame with no more access than it
grants -- including out-of-batch mmaps a hijacked stub issues on
its own.
5: route and validate munmap() the same way (range-confined below
STUB_START).
6: add a watchdog thread that detects a stub which stops reporting
back (e.g. blocked SIGALRM) and SIGKILLs it, letting the monitor
recover via the existing teardown.
7: drop the now-resolved "Known security issues" note and refresh the
seccomp= help text.
After the series a hijacked stub is confined to the frames its own page
tables reference and can no longer reach arbitrary guest/host memory; one
that evades preemption is detected out of band and killed rather than
wedging the monitor.
Verified on UML (UP and 2-CPU SMP): boots and survives fork/exec storms
and heavy mmap/munmap churn with zero false denials or false kills; an
artificially SIGALRM-blocked busy loop is killed in ~5s and the monitor
recovers, while syscall-making processes are untouched. Each patch builds
and the series is bisectable.
---
Cong Wang (7):
um: skas: create a seccomp USER_NOTIF listener and hand it to the
monitor
um: skas: gate stub mmap() through the USER_NOTIF monitor
um: skas: validate stub mmap() against the guest page table
um: skas: handle out-of-batch stub mmap notifications
um: skas: validate stub munmap() against the guest address range
um: skas: kill stubs that block SIGALRM via a watchdog thread
um: skas: refresh stub security notes after closing the known issues
arch/um/include/shared/skas/mm_id.h | 1 +
arch/um/include/shared/skas/skas.h | 5 +
arch/um/kernel/skas/stub.c | 22 --
arch/um/kernel/skas/stub_exe.c | 19 +-
arch/um/kernel/skas/uaccess.c | 48 +++++
arch/um/os-Linux/skas/process.c | 315 ++++++++++++++++++++++++----
arch/um/os-Linux/start_up.c | 6 -
7 files changed, 344 insertions(+), 72 deletions(-)
base-commit: 1a3746ccbb0a97bed3c06ccde6b880013b1dddc1
--
2.43.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/7] um: skas: create a seccomp USER_NOTIF listener and hand it to the monitor
2026-06-20 3:22 [PATCH 0/7] um: skas: harden the seccomp userspace stub Cong Wang
@ 2026-06-20 3:22 ` Cong Wang
2026-06-20 3:22 ` [PATCH 2/7] um: skas: gate stub mmap() through the USER_NOTIF monitor Cong Wang
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Cong Wang @ 2026-06-20 3:22 UTC (permalink / raw)
To: Richard Weinberger, Anton Ivanov, Johannes Berg, linux-um; +Cc: Benjamin Berg
From: Cong Wang <cwang@multikernel.io>
First step toward validating stub mmap() calls (the intra-guest /
host-escape disclosure issue in stub.c "Known security issues"): give the
monitor a SECCOMP_RET_USER_NOTIF listener for each stub's filter, so a
later change can route mmap to the monitor for per-call validation.
The stub installs its seccomp filter with NEW_LISTENER (plus TSYNC_ESRCH,
which the kernel requires to combine NEW_LISTENER with TSYNC); the
seccomp() return value is the listener fd. mmap still returns RET_ALLOW,
so there is no behavioural change yet.
The stub cannot hand the fd over itself: once the filter is installed,
every syscall it makes from outside the stub page traps with SIGSYS
instead of executing, so it can neither sendmsg() nor close() the fd.
Instead the monitor pulls it with pidfd_getfd(): the listener is, by
construction, fd 1 in the stub (close_range() left only fd 0 open before
seccomp() allocated it). Leaving the stub's copy open is harmless:
ioctl (NOTIF_RECV/SEND) is not on the syscall allowlist, so a hijacked
stub cannot self-approve. The monitor stores the fd in mm_id.
Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
arch/um/include/shared/skas/mm_id.h | 1 +
arch/um/kernel/skas/stub_exe.c | 12 ++++++---
arch/um/os-Linux/skas/process.c | 38 +++++++++++++++++++++++++++--
3 files changed, 45 insertions(+), 6 deletions(-)
diff --git a/arch/um/include/shared/skas/mm_id.h b/arch/um/include/shared/skas/mm_id.h
index 18c0621430d2..46164d71554b 100644
--- a/arch/um/include/shared/skas/mm_id.h
+++ b/arch/um/include/shared/skas/mm_id.h
@@ -17,6 +17,7 @@ struct mm_id {
/* Only used with SECCOMP mode */
int sock;
+ int seccomp_notify_fd;
int syscall_fd_num;
int syscall_fd_map[STUB_MAX_FDS];
};
diff --git a/arch/um/kernel/skas/stub_exe.c b/arch/um/kernel/skas/stub_exe.c
index cbafaa684e66..b5432f6ccbc7 100644
--- a/arch/um/kernel/skas/stub_exe.c
+++ b/arch/um/kernel/skas/stub_exe.c
@@ -196,10 +196,14 @@ noinline static void real_init(void)
.len = sizeof(filter) / sizeof(filter[0]),
.filter = filter,
};
-
- if (stub_syscall3(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
- SECCOMP_FILTER_FLAG_TSYNC,
- (unsigned long)&prog) != 0)
+ long listener;
+
+ listener = stub_syscall3(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
+ SECCOMP_FILTER_FLAG_TSYNC |
+ SECCOMP_FILTER_FLAG_NEW_LISTENER |
+ SECCOMP_FILTER_FLAG_TSYNC_ESRCH,
+ (unsigned long)&prog);
+ if (listener < 0)
stub_syscall1(__NR_exit, 21);
/* Fall through, the exit syscall will cause SIGSYS */
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index d6c22f8aa06d..0ab1d109a68d 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -428,6 +428,34 @@ __initcall(init_stub_exe_fd);
int using_seccomp;
+/*
+ * Obtain the SECCOMP_RET_USER_NOTIF listener fd the stub created. The stub
+ * cannot hand it over itself: after installing the filter, every syscall it
+ * makes traps with SIGSYS rather than executing (so it can neither sendmsg()
+ * it nor close() it). Instead the listener is, by construction, the first free
+ * fd in the stub -- fd 1, since close_range() left only fd 0 open before
+ * seccomp() allocated it -- so the monitor duplicates it directly with
+ * pidfd_getfd(). ptrace_may_access() holds because the monitor is the stub's
+ * parent.
+ */
+#define STUB_LISTENER_FD 1
+static int get_stub_listener(struct mm_id *mm_id)
+{
+ int pidfd, lfd;
+
+ pidfd = syscall(__NR_pidfd_open, mm_id->pid, 0);
+ if (pidfd < 0)
+ return -errno;
+
+ lfd = syscall(__NR_pidfd_getfd, pidfd, STUB_LISTENER_FD, 0);
+ close(pidfd);
+ if (lfd < 0)
+ return -errno;
+
+ mm_id->seccomp_notify_fd = lfd;
+ return 0;
+}
+
/**
* start_userspace() - prepare a new userspace process
* @mm_id: The corresponding struct mm_id
@@ -449,6 +477,8 @@ int start_userspace(struct mm_id *mm_id)
unsigned long sp;
int status, n, err;
+ mm_id->seccomp_notify_fd = -1;
+
/* setup a temporary stack page */
stack = mmap(NULL, UM_KERN_PAGE_SIZE,
PROT_READ | PROT_WRITE | PROT_EXEC,
@@ -522,10 +552,14 @@ int start_userspace(struct mm_id *mm_id)
}
close(tramp_data.sockpair[0]);
- if (using_seccomp)
+ if (using_seccomp) {
mm_id->sock = tramp_data.sockpair[1];
- else
+ err = get_stub_listener(mm_id);
+ if (err)
+ goto out_kill;
+ } else {
close(tramp_data.sockpair[1]);
+ }
return 0;
--
2.43.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/7] um: skas: gate stub mmap() through the USER_NOTIF monitor
2026-06-20 3:22 [PATCH 0/7] um: skas: harden the seccomp userspace stub Cong Wang
2026-06-20 3:22 ` [PATCH 1/7] um: skas: create a seccomp USER_NOTIF listener and hand it to the monitor Cong Wang
@ 2026-06-20 3:22 ` Cong Wang
2026-06-20 3:22 ` [PATCH 3/7] um: skas: validate stub mmap() against the guest page table Cong Wang
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Cong Wang @ 2026-06-20 3:22 UTC (permalink / raw)
To: Richard Weinberger, Anton Ivanov, Johannes Berg, linux-um; +Cc: Benjamin Berg
From: Cong Wang <cwang@multikernel.io>
Route the stub's mmap to SECCOMP_RET_USER_NOTIF instead of RET_ALLOW, and
have the monitor service the resulting notifications inline. This is the
mechanism that a following change uses to validate each mmap's arguments;
for now every stub mmap is allowed (responded with
SECCOMP_USER_NOTIF_FLAG_CONTINUE), so behaviour is unchanged.
CONTINUE is safe for mmap: its arguments are all scalars captured in
seccomp_data, so there is no TOCTOU re-read of user memory.
The stub runs queued mmap batches in two places: syscall_stub_flush()
and the userspace() resume path, both of which wake the stub via
wait_stub_done_seccomp(). Servicing therefore lives there: after waking,
the monitor issues one NOTIF_RECV/CONTINUE per STUB_SYSCALL_MMAP in the
batch the stub is about to run. Signals are masked while the stub
executes the batch inside its SIGSYS handler, so notifications arrive in
queued order with nothing interleaved, and a simple counted loop is
sufficient. The pre-existing wake logic is factored into
wake_seccomp_stub() so both the wake and the wait-only paths share it.
Verified on UML: guest boots and survives a fork/exec storm plus heavy
demand paging with every stub mmap round-tripping through the monitor.
Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
arch/um/kernel/skas/stub_exe.c | 5 +-
arch/um/os-Linux/skas/process.c | 121 ++++++++++++++++++++++----------
2 files changed, 89 insertions(+), 37 deletions(-)
diff --git a/arch/um/kernel/skas/stub_exe.c b/arch/um/kernel/skas/stub_exe.c
index b5432f6ccbc7..65ea2af5ca73 100644
--- a/arch/um/kernel/skas/stub_exe.c
+++ b/arch/um/kernel/skas/stub_exe.c
@@ -173,7 +173,7 @@ noinline static void real_init(void)
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K,__NR_close,
5, 0),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, STUB_MMAP_NR,
- 4, 0),
+ 5, 0),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_munmap,
3, 0),
#ifdef __i386__
@@ -191,6 +191,9 @@ noinline static void real_init(void)
/* [18] Permitted call for the stub */
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
+
+ /* [19] mmap: route to the monitor for validation */
+ BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF),
};
struct sock_fprog prog = {
.len = sizeof(filter) / sizeof(filter[0]),
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index 0ab1d109a68d..63b426b2c523 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -29,7 +29,9 @@
#include <sysdep/stub.h>
#include <sysdep/mcontext.h>
#include <linux/futex.h>
+#include <linux/seccomp.h>
#include <linux/threads.h>
+#include <sys/ioctl.h>
#include <timetravel.h>
#include <asm-generic/rwonce.h>
#include "../internal.h"
@@ -147,49 +149,89 @@ void wait_stub_done(int pid)
fatal_sigsegv();
}
+static void wake_seccomp_stub(struct mm_id *mm_idp)
+{
+ struct stub_data *data = (void *)mm_idp->stack;
+ const char byte = 0;
+ struct iovec iov = {
+ .iov_base = (void *)&byte,
+ .iov_len = sizeof(byte),
+ };
+ union {
+ char data[CMSG_SPACE(sizeof(mm_idp->syscall_fd_map))];
+ struct cmsghdr align;
+ } ctrl;
+ struct msghdr msgh = {
+ .msg_iov = &iov,
+ .msg_iovlen = 1,
+ };
+
+ if (mm_idp->syscall_fd_num) {
+ unsigned int fds_size = sizeof(int) * mm_idp->syscall_fd_num;
+ struct cmsghdr *cmsg;
+
+ msgh.msg_control = ctrl.data;
+ msgh.msg_controllen = CMSG_SPACE(fds_size);
+ cmsg = CMSG_FIRSTHDR(&msgh);
+ cmsg->cmsg_level = SOL_SOCKET;
+ cmsg->cmsg_type = SCM_RIGHTS;
+ cmsg->cmsg_len = CMSG_LEN(fds_size);
+ memcpy(CMSG_DATA(cmsg), mm_idp->syscall_fd_map, fds_size);
+
+ CATCH_EINTR(syscall(__NR_sendmsg, mm_idp->sock, &msgh, 0));
+ }
+
+ data->signal = 0;
+ data->futex = FUTEX_IN_CHILD;
+ CATCH_EINTR(syscall(__NR_futex, &data->futex,
+ FUTEX_WAKE, 1, NULL, NULL, 0));
+}
+
+static int seccomp_notify_serve(int notify_fd)
+{
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+ int ret;
+
+ CATCH_EINTR(ret = ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_RECV, &req));
+ if (ret < 0)
+ return -errno;
+
+ resp.id = req.id;
+ resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+
+ CATCH_EINTR(ret = ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_SEND, &resp));
+ if (ret < 0)
+ return -errno;
+
+ return 0;
+}
+
+static void seccomp_serve_mmaps(struct mm_id *mm_idp)
+{
+ struct stub_data *data = (void *)mm_idp->stack;
+ int i, n_mmaps = 0;
+
+ if (mm_idp->seccomp_notify_fd < 0)
+ return;
+
+ for (i = 0; i < data->syscall_data_len; i++)
+ if (data->syscall_data[i].syscall == STUB_SYSCALL_MMAP)
+ n_mmaps++;
+
+ for (i = 0; i < n_mmaps; i++)
+ seccomp_notify_serve(mm_idp->seccomp_notify_fd);
+}
+
void wait_stub_done_seccomp(struct mm_id *mm_idp, int running, int wait_sigsys)
{
struct stub_data *data = (void *)mm_idp->stack;
int ret;
do {
- const char byte = 0;
- struct iovec iov = {
- .iov_base = (void *)&byte,
- .iov_len = sizeof(byte),
- };
- union {
- char data[CMSG_SPACE(sizeof(mm_idp->syscall_fd_map))];
- struct cmsghdr align;
- } ctrl;
- struct msghdr msgh = {
- .msg_iov = &iov,
- .msg_iovlen = 1,
- };
-
if (!running) {
- if (mm_idp->syscall_fd_num) {
- unsigned int fds_size =
- sizeof(int) * mm_idp->syscall_fd_num;
- struct cmsghdr *cmsg;
-
- msgh.msg_control = ctrl.data;
- msgh.msg_controllen = CMSG_SPACE(fds_size);
- cmsg = CMSG_FIRSTHDR(&msgh);
- cmsg->cmsg_level = SOL_SOCKET;
- cmsg->cmsg_type = SCM_RIGHTS;
- cmsg->cmsg_len = CMSG_LEN(fds_size);
- memcpy(CMSG_DATA(cmsg), mm_idp->syscall_fd_map,
- fds_size);
-
- CATCH_EINTR(syscall(__NR_sendmsg, mm_idp->sock,
- &msgh, 0));
- }
-
- data->signal = 0;
- data->futex = FUTEX_IN_CHILD;
- CATCH_EINTR(syscall(__NR_futex, &data->futex,
- FUTEX_WAKE, 1, NULL, NULL, 0));
+ wake_seccomp_stub(mm_idp);
+ seccomp_serve_mmaps(mm_idp);
}
do {
@@ -246,6 +288,13 @@ void wait_stub_done_seccomp(struct mm_id *mm_idp, int running, int wait_sigsys)
fatal_sigsegv();
}
+/*
+ * Service one SECCOMP_RET_USER_NOTIF notification from a stub mmap: read the
+ * suspended call, then respond CONTINUE so the stub's real mmap runs. CONTINUE
+ * is safe here because mmap takes only scalar arguments (no TOCTOU on user
+ * memory). Validation of (addr, len, prot, fd, offset) is added later; for now
+ * every stub mmap is allowed.
+ */
extern unsigned long current_stub_stack(void);
static void get_skas_faultinfo(int pid, struct faultinfo *fi)
--
2.43.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 3/7] um: skas: validate stub mmap() against the guest page table
2026-06-20 3:22 [PATCH 0/7] um: skas: harden the seccomp userspace stub Cong Wang
2026-06-20 3:22 ` [PATCH 1/7] um: skas: create a seccomp USER_NOTIF listener and hand it to the monitor Cong Wang
2026-06-20 3:22 ` [PATCH 2/7] um: skas: gate stub mmap() through the USER_NOTIF monitor Cong Wang
@ 2026-06-20 3:22 ` Cong Wang
2026-06-20 3:22 ` [PATCH 4/7] um: skas: handle out-of-batch stub mmap notifications Cong Wang
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Cong Wang @ 2026-06-20 3:22 UTC (permalink / raw)
To: Richard Weinberger, Anton Ivanov, Johannes Berg, linux-um; +Cc: Benjamin Berg
From: Cong Wang <cwang@multikernel.io>
Replace the allow-all USER_NOTIF response with the actual security check.
When a stub mmap traps to the monitor, validate its arguments against the
mm's page table: the call is allowed iff this mm's PTE for the target
address maps exactly the physical frame named by the mmap offset, and the
requested protection does not exceed what the PTE grants. Otherwise the
mmap is rejected with -EPERM.
This is what closes the disclosure issue documented in stub.c. A hijacked
stub (jumping to the in-stub mmap with crafted registers) can no longer
choose an arbitrary physmem offset: it is confined to the frames its own
mm's PTEs already reference. On this base (single physmem fd, no
kernel/user split) the same check also blocks host escape: a mapping of
a UML-kernel frame has no authorizing user PTE, so it is denied.
The check is a pure function of state the monitor already owns (it *is*
the UML kernel and holds the guest pgd, freshly synced before the stub
mmap is issued), so it needs no per-batch bookkeeping or fd-identity
tracking. stub_mmap_allowed() lives in skas/uaccess.c next to
virt_to_pte(); the os-Linux notify handler calls it and responds CONTINUE
or -EPERM.
Verified on UML: guest boots and survives a fork/exec storm plus heavy
demand paging with zero false denials.
Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
arch/um/include/shared/skas/skas.h | 3 +++
arch/um/kernel/skas/uaccess.c | 36 ++++++++++++++++++++++++++++++
arch/um/os-Linux/skas/process.c | 21 ++++++++---------
3 files changed, 50 insertions(+), 10 deletions(-)
diff --git a/arch/um/include/shared/skas/skas.h b/arch/um/include/shared/skas/skas.h
index 2237ffedec75..ce1b67b06b4b 100644
--- a/arch/um/include/shared/skas/skas.h
+++ b/arch/um/include/shared/skas/skas.h
@@ -18,4 +18,7 @@ extern void current_mm_sync(void);
void initial_jmpbuf_lock(void);
void initial_jmpbuf_unlock(void);
+int stub_mmap_allowed(struct mm_id *id, unsigned long addr,
+ unsigned long prot, unsigned long offset);
+
#endif
diff --git a/arch/um/kernel/skas/uaccess.c b/arch/um/kernel/skas/uaccess.c
index caef1deef795..9359ede8a04b 100644
--- a/arch/um/kernel/skas/uaccess.c
+++ b/arch/um/kernel/skas/uaccess.c
@@ -13,6 +13,18 @@
#include <kern_util.h>
#include <asm/futex.h>
#include <os.h>
+#include <skas.h>
+
+/*
+ * Same mapping as MMAP_OFFSET() in <sysdep/stub.h>, but usable from kernel
+ * code (that header pulls in the host <sys/mman.h>). 64-bit stubs use mmap()
+ * with a byte offset; 32-bit stubs use mmap2() with a page offset.
+ */
+#ifdef CONFIG_64BIT
+#define stub_mmap_offset(phys) (phys)
+#else
+#define stub_mmap_offset(phys) ((phys) >> PAGE_SHIFT)
+#endif
pte_t *virt_to_pte(struct mm_struct *mm, unsigned long addr)
{
@@ -43,6 +55,30 @@ pte_t *virt_to_pte(struct mm_struct *mm, unsigned long addr)
return pte_offset_kernel(pmd, addr);
}
+int stub_mmap_allowed(struct mm_id *id, unsigned long addr,
+ unsigned long prot, unsigned long offset)
+{
+ struct mm_context *ctx = container_of(id, struct mm_context, id);
+ struct mm_struct *mm = container_of(ctx, struct mm_struct, context);
+ pte_t *pte;
+
+ pte = virt_to_pte(mm, addr);
+ if (pte == NULL || !pte_present(*pte))
+ return 0;
+
+ /* Must map exactly the frame this PTE references. */
+ if (stub_mmap_offset(pte_val(*pte) & PAGE_MASK) != offset)
+ return 0;
+
+ /* Must not grant more access than the PTE allows. */
+ if ((prot & UM_PROT_WRITE) && !pte_write(*pte))
+ return 0;
+ if ((prot & UM_PROT_EXEC) && !pte_exec(*pte))
+ return 0;
+
+ return 1;
+}
+
static pte_t *maybe_map(unsigned long virt, int is_write)
{
pte_t *pte = virt_to_pte(current->mm, virt);
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index 63b426b2c523..3a31e52cdcf8 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -187,10 +187,11 @@ static void wake_seccomp_stub(struct mm_id *mm_idp)
FUTEX_WAKE, 1, NULL, NULL, 0));
}
-static int seccomp_notify_serve(int notify_fd)
+static int seccomp_notify_serve(struct mm_id *mm_idp)
{
struct seccomp_notif req = {};
struct seccomp_notif_resp resp = {};
+ int notify_fd = mm_idp->seccomp_notify_fd;
int ret;
CATCH_EINTR(ret = ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_RECV, &req));
@@ -198,7 +199,14 @@ static int seccomp_notify_serve(int notify_fd)
return -errno;
resp.id = req.id;
- resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+
+ if (req.data.nr == STUB_MMAP_NR &&
+ stub_mmap_allowed(mm_idp, req.data.args[0], req.data.args[2],
+ req.data.args[5])) {
+ resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+ } else {
+ resp.error = -EPERM;
+ }
CATCH_EINTR(ret = ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_SEND, &resp));
if (ret < 0)
@@ -220,7 +228,7 @@ static void seccomp_serve_mmaps(struct mm_id *mm_idp)
n_mmaps++;
for (i = 0; i < n_mmaps; i++)
- seccomp_notify_serve(mm_idp->seccomp_notify_fd);
+ seccomp_notify_serve(mm_idp);
}
void wait_stub_done_seccomp(struct mm_id *mm_idp, int running, int wait_sigsys)
@@ -288,13 +296,6 @@ void wait_stub_done_seccomp(struct mm_id *mm_idp, int running, int wait_sigsys)
fatal_sigsegv();
}
-/*
- * Service one SECCOMP_RET_USER_NOTIF notification from a stub mmap: read the
- * suspended call, then respond CONTINUE so the stub's real mmap runs. CONTINUE
- * is safe here because mmap takes only scalar arguments (no TOCTOU on user
- * memory). Validation of (addr, len, prot, fd, offset) is added later; for now
- * every stub mmap is allowed.
- */
extern unsigned long current_stub_stack(void);
static void get_skas_faultinfo(int pid, struct faultinfo *fi)
--
2.43.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 4/7] um: skas: handle out-of-batch stub mmap notifications
2026-06-20 3:22 [PATCH 0/7] um: skas: harden the seccomp userspace stub Cong Wang
` (2 preceding siblings ...)
2026-06-20 3:22 ` [PATCH 3/7] um: skas: validate stub mmap() against the guest page table Cong Wang
@ 2026-06-20 3:22 ` Cong Wang
2026-06-20 3:22 ` [PATCH 5/7] um: skas: validate stub munmap() against the guest address range Cong Wang
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Cong Wang @ 2026-06-20 3:22 UTC (permalink / raw)
To: Richard Weinberger, Anton Ivanov, Johannes Berg, linux-um; +Cc: Benjamin Berg
From: Cong Wang <cwang@multikernel.io>
The mmap validation so far only runs for mmaps the monitor itself queued.
A hijacked stub can instead jump to the in-stub mmap on its own, outside
any batch. Such an mmap traps to the USER_NOTIF listener and suspends
in-kernel (so it discloses nothing) but it never signals the monitor
via the futex, so the monitor, waiting only on the futex, would block
forever (a guest-triggered DoS of the whole UML).
Drain the listener whenever the wait wakes: after each FUTEX_WAIT in
wait_stub_done_seccomp() (which already wakes periodically on SIGALRM and
tolerates EINTR), poll the listener non-blocking and service anything
pending. An out-of-batch mmap therefore stays suspended only until the
next wakeup and is then validated like any other: denied with -EPERM
unless it maps exactly what its PTE allows. No disclosure occurs while it
is pending, and the monitor no longer hangs.
Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
arch/um/os-Linux/skas/process.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index 3a31e52cdcf8..0987eb79ce76 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -32,6 +32,7 @@
#include <linux/seccomp.h>
#include <linux/threads.h>
#include <sys/ioctl.h>
+#include <poll.h>
#include <timetravel.h>
#include <asm-generic/rwonce.h>
#include "../internal.h"
@@ -215,6 +216,21 @@ static int seccomp_notify_serve(struct mm_id *mm_idp)
return 0;
}
+static void seccomp_notify_drain(struct mm_id *mm_idp)
+{
+ struct pollfd pfd = {
+ .fd = mm_idp->seccomp_notify_fd,
+ .events = POLLIN,
+ };
+
+ if (mm_idp->seccomp_notify_fd < 0)
+ return;
+
+ while (poll(&pfd, 1, 0) > 0 && (pfd.revents & POLLIN))
+ if (seccomp_notify_serve(mm_idp) < 0)
+ break;
+}
+
static void seccomp_serve_mmaps(struct mm_id *mm_idp)
{
struct stub_data *data = (void *)mm_idp->stack;
@@ -265,6 +281,9 @@ void wait_stub_done_seccomp(struct mm_id *mm_idp, int running, int wait_sigsys)
__func__, errno);
goto out_kill;
}
+
+ /* Handle any stub mmap that trapped out of band. */
+ seccomp_notify_drain(mm_idp);
} while (data->futex == FUTEX_IN_CHILD);
if (__READ_ONCE(mm_idp->pid) < 0)
--
2.43.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 5/7] um: skas: validate stub munmap() against the guest address range
2026-06-20 3:22 [PATCH 0/7] um: skas: harden the seccomp userspace stub Cong Wang
` (3 preceding siblings ...)
2026-06-20 3:22 ` [PATCH 4/7] um: skas: handle out-of-batch stub mmap notifications Cong Wang
@ 2026-06-20 3:22 ` Cong Wang
2026-06-20 3:22 ` [PATCH 6/7] um: skas: kill stubs that block SIGALRM via a watchdog thread Cong Wang
2026-06-20 3:22 ` [PATCH 7/7] um: skas: refresh stub security notes after closing the known issues Cong Wang
6 siblings, 0 replies; 8+ messages in thread
From: Cong Wang @ 2026-06-20 3:22 UTC (permalink / raw)
To: Richard Weinberger, Anton Ivanov, Johannes Berg, linux-um; +Cc: Benjamin Berg
From: Cong Wang <cwang@multikernel.io>
Route stub munmap() through the USER_NOTIF monitor too, and validate it
before letting it run. munmap() was previously SECCOMP_RET_ALLOW, so a
hijacked stub (jumping to the in-stub munmap with crafted registers) could
unmap arbitrary ranges, including the stub's own code/data pages, which
would sever the monitor's control over it, or guest mappings outside what
it is allowed to manage. After mmap, munmap was the remaining memory
primitive a hijacked stub could invoke with arbitrary arguments.
Unlike mmap(), there is no PTE left to check: by the time the stub unmaps
a guest page the kernel has already cleared the corresponding entry. So
stub_munmap_allowed() is range-based instead: the request must be
non-empty, must not wrap, and must lie entirely below STUB_START. That
confines the stub to the guest address space and keeps its own reserved
region off-limits. Both arguments are scalars captured in seccomp_data, so
CONTINUE carries no TOCTOU risk, same as mmap().
stub_munmap_allowed() lives in skas/uaccess.c next to stub_mmap_allowed();
the os-Linux notify handler dispatches on the syscall number and responds
CONTINUE or -EPERM, and the batch server counts STUB_SYSCALL_MUNMAP as
well as STUB_SYSCALL_MMAP.
Verified on UML: guest boots and survives heavy mmap/munmap churn with
zero false denials; the legitimate boot-time clear of the whole user
address space [0, STUB_START) is allowed (end == STUB_START), while a
range overlapping the stub region is denied.
Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
arch/um/include/shared/skas/skas.h | 2 ++
arch/um/kernel/skas/stub_exe.c | 4 ++--
arch/um/kernel/skas/uaccess.c | 12 ++++++++++++
arch/um/os-Linux/skas/process.c | 31 ++++++++++++++++++------------
4 files changed, 35 insertions(+), 14 deletions(-)
diff --git a/arch/um/include/shared/skas/skas.h b/arch/um/include/shared/skas/skas.h
index ce1b67b06b4b..ca2a62cef0c1 100644
--- a/arch/um/include/shared/skas/skas.h
+++ b/arch/um/include/shared/skas/skas.h
@@ -20,5 +20,7 @@ void initial_jmpbuf_unlock(void);
int stub_mmap_allowed(struct mm_id *id, unsigned long addr,
unsigned long prot, unsigned long offset);
+int stub_munmap_allowed(struct mm_id *id, unsigned long addr,
+ unsigned long len);
#endif
diff --git a/arch/um/kernel/skas/stub_exe.c b/arch/um/kernel/skas/stub_exe.c
index 65ea2af5ca73..00eea0cb9463 100644
--- a/arch/um/kernel/skas/stub_exe.c
+++ b/arch/um/kernel/skas/stub_exe.c
@@ -175,7 +175,7 @@ noinline static void real_init(void)
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, STUB_MMAP_NR,
5, 0),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_munmap,
- 3, 0),
+ 4, 0),
#ifdef __i386__
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_set_thread_area,
2, 0),
@@ -192,7 +192,7 @@ noinline static void real_init(void)
/* [18] Permitted call for the stub */
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
- /* [19] mmap: route to the monitor for validation */
+ /* [19] mmap and munmap: route to the monitor for validation */
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF),
};
struct sock_fprog prog = {
diff --git a/arch/um/kernel/skas/uaccess.c b/arch/um/kernel/skas/uaccess.c
index 9359ede8a04b..feb267637735 100644
--- a/arch/um/kernel/skas/uaccess.c
+++ b/arch/um/kernel/skas/uaccess.c
@@ -14,6 +14,7 @@
#include <asm/futex.h>
#include <os.h>
#include <skas.h>
+#include <as-layout.h>
/*
* Same mapping as MMAP_OFFSET() in <sysdep/stub.h>, but usable from kernel
@@ -79,6 +80,17 @@ int stub_mmap_allowed(struct mm_id *id, unsigned long addr,
return 1;
}
+int stub_munmap_allowed(struct mm_id *id, unsigned long addr, unsigned long len)
+{
+ if (len == 0 || addr + len < addr)
+ return 0;
+
+ if (addr + len > STUB_START)
+ return 0;
+
+ return 1;
+}
+
static pte_t *maybe_map(unsigned long virt, int is_write)
{
pte_t *pte = virt_to_pte(current->mm, virt);
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index 0987eb79ce76..2010b4529c41 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -193,7 +193,7 @@ static int seccomp_notify_serve(struct mm_id *mm_idp)
struct seccomp_notif req = {};
struct seccomp_notif_resp resp = {};
int notify_fd = mm_idp->seccomp_notify_fd;
- int ret;
+ int allowed, ret;
CATCH_EINTR(ret = ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_RECV, &req));
if (ret < 0)
@@ -201,13 +201,19 @@ static int seccomp_notify_serve(struct mm_id *mm_idp)
resp.id = req.id;
- if (req.data.nr == STUB_MMAP_NR &&
- stub_mmap_allowed(mm_idp, req.data.args[0], req.data.args[2],
- req.data.args[5])) {
+ if (req.data.nr == STUB_MMAP_NR)
+ allowed = stub_mmap_allowed(mm_idp, req.data.args[0],
+ req.data.args[2], req.data.args[5]);
+ else if (req.data.nr == __NR_munmap)
+ allowed = stub_munmap_allowed(mm_idp, req.data.args[0],
+ req.data.args[1]);
+ else
+ allowed = 0;
+
+ if (allowed)
resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
- } else {
+ else
resp.error = -EPERM;
- }
CATCH_EINTR(ret = ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_SEND, &resp));
if (ret < 0)
@@ -231,19 +237,20 @@ static void seccomp_notify_drain(struct mm_id *mm_idp)
break;
}
-static void seccomp_serve_mmaps(struct mm_id *mm_idp)
+static void seccomp_serve_stub_syscalls(struct mm_id *mm_idp)
{
struct stub_data *data = (void *)mm_idp->stack;
- int i, n_mmaps = 0;
+ int i, n_notif = 0;
if (mm_idp->seccomp_notify_fd < 0)
return;
for (i = 0; i < data->syscall_data_len; i++)
- if (data->syscall_data[i].syscall == STUB_SYSCALL_MMAP)
- n_mmaps++;
+ if (data->syscall_data[i].syscall == STUB_SYSCALL_MMAP ||
+ data->syscall_data[i].syscall == STUB_SYSCALL_MUNMAP)
+ n_notif++;
- for (i = 0; i < n_mmaps; i++)
+ for (i = 0; i < n_notif; i++)
seccomp_notify_serve(mm_idp);
}
@@ -255,7 +262,7 @@ void wait_stub_done_seccomp(struct mm_id *mm_idp, int running, int wait_sigsys)
do {
if (!running) {
wake_seccomp_stub(mm_idp);
- seccomp_serve_mmaps(mm_idp);
+ seccomp_serve_stub_syscalls(mm_idp);
}
do {
--
2.43.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 6/7] um: skas: kill stubs that block SIGALRM via a watchdog thread
2026-06-20 3:22 [PATCH 0/7] um: skas: harden the seccomp userspace stub Cong Wang
` (4 preceding siblings ...)
2026-06-20 3:22 ` [PATCH 5/7] um: skas: validate stub munmap() against the guest address range Cong Wang
@ 2026-06-20 3:22 ` Cong Wang
2026-06-20 3:22 ` [PATCH 7/7] um: skas: refresh stub security notes after closing the known issues Cong Wang
6 siblings, 0 replies; 8+ messages in thread
From: Cong Wang @ 2026-06-20 3:22 UTC (permalink / raw)
To: Richard Weinberger, Anton Ivanov, Johannes Berg, linux-um; +Cc: Benjamin Berg
From: Cong Wang <cwang@multikernel.io>
A hijacked stub can block SIGALRM with a crafted rt_sigreturn (the mask is
restored from the stack it controls), so it is never preempted and never
reports back. SIGALRM goes to the stub, not the monitor, so the monitor
then blocks indefinitely in wait_stub_done_seccomp(). This is the one item
("blocking e.g. SIGALRM") of the stub.c "Known security issues" list that a
seccomp filter cannot address, since rt_sigreturn is required and its
effect lives on the stack rather than in register arguments.
Detect it out of band. A helper thread blocks on its own timerfd and
watches a per-vCPU (pid, seq) pair the monitor updates around each wait:
pid is the stub being waited on, seq advances every time the stub reports.
A stub that stops reporting leaves pid pinned and seq frozen; after
SECCOMP_WD_STALL_TICKS ticks of no progress the watchdog SIGKILLs it, and
the resulting SIGCHLD unblocks the monitor through the existing "pid < 0"
teardown.
Each monitor writes only its own slot and the watchdog only reads, so the
word-sized state needs just __READ_ONCE()/__WRITE_ONCE(), no lock; the
watchdog scans every slot, covering all CPUs under SMP. A false kill cannot
happen without the same pid and an unchanged seq across many ticks. The
thread runs with all signals blocked (os_run_helper_thread()), uses
write(2) rather than printk() from its non-kernel context, and is started
once via a compare-and-swap guard. It is preferred over a bounded
FUTEX_WAIT timeout: it costs one counter bump on the per-syscall hot path
and catches a stall anywhere on stub input, not just the one futex.
Verified on UML (UP and 2-CPU SMP): heavy mmap/munmap churn and CPU-bound
loops on every vCPU run with zero false kills; a stub with SIGALRM blocked
is killed in ~5s and the monitor recovers, while syscall-making processes
are untouched.
Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
arch/um/os-Linux/skas/process.c | 129 ++++++++++++++++++++++++++++++++
1 file changed, 129 insertions(+)
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index 2010b4529c41..7ffde2b00b61 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -21,6 +21,7 @@
#include <as-layout.h>
#include <init.h>
#include <kern_util.h>
+#include <smp.h>
#include <mem.h>
#include <os.h>
#include <ptrace_user.h>
@@ -33,6 +34,8 @@
#include <linux/threads.h>
#include <sys/ioctl.h>
#include <poll.h>
+#include <signal.h>
+#include <sys/timerfd.h>
#include <timetravel.h>
#include <asm-generic/rwonce.h>
#include "../internal.h"
@@ -254,11 +257,131 @@ static void seccomp_serve_stub_syscalls(struct mm_id *mm_idp)
seccomp_notify_serve(mm_idp);
}
+#define SECCOMP_WD_TICK_SECS 1
+#define SECCOMP_WD_STALL_TICKS 5 /* ~5s of no progress before killing */
+
+static int seccomp_wd_pid[CONFIG_NR_CPUS] = { [0 ... CONFIG_NR_CPUS - 1] = -1 };
+static unsigned long seccomp_wd_seq[CONFIG_NR_CPUS];
+
+static inline void seccomp_wd_enter(int pid)
+{
+ int cpu = uml_curr_cpu();
+
+ __WRITE_ONCE(seccomp_wd_seq[cpu], seccomp_wd_seq[cpu] + 1);
+ __WRITE_ONCE(seccomp_wd_pid[cpu], pid);
+}
+
+static inline void seccomp_wd_progress(void)
+{
+ int cpu = uml_curr_cpu();
+
+ __WRITE_ONCE(seccomp_wd_seq[cpu], seccomp_wd_seq[cpu] + 1);
+}
+
+static inline void seccomp_wd_exit(void)
+{
+ __WRITE_ONCE(seccomp_wd_pid[uml_curr_cpu()], -1);
+}
+
+/* Per-CPU snapshot the watchdog compares against the next tick. */
+struct seccomp_wd_cpu {
+ int prev_pid;
+ unsigned long prev_seq;
+ int stall;
+};
+
+static void seccomp_wd_check_cpu(int cpu, struct seccomp_wd_cpu *st)
+{
+ static const char kill_msg[] =
+ "seccomp watchdog: killing unresponsive stub (SIGALRM blocked?)\n";
+ int pid = __READ_ONCE(seccomp_wd_pid[cpu]);
+ unsigned long seq = __READ_ONCE(seccomp_wd_seq[cpu]);
+
+ if (pid >= 0 && pid == st->prev_pid && seq == st->prev_seq) {
+ if (++st->stall >= SECCOMP_WD_STALL_TICKS) {
+ /* printk() is unsafe from this thread. */
+ (void)!write(2, kill_msg, sizeof(kill_msg) - 1);
+ kill(pid, SIGKILL);
+ st->stall = 0;
+ }
+ } else {
+ st->stall = 0;
+ }
+
+ st->prev_pid = pid;
+ st->prev_seq = seq;
+}
+
+static void *seccomp_watchdog(void *arg)
+{
+ int tfd = (int)(long)arg;
+ struct seccomp_wd_cpu st[CONFIG_NR_CPUS];
+ int cpu;
+
+ for (cpu = 0; cpu < CONFIG_NR_CPUS; cpu++)
+ st[cpu] = (struct seccomp_wd_cpu){ .prev_pid = -1 };
+
+ for (;;) {
+ unsigned long long expirations;
+
+ /*
+ * One check per wakeup; ignore the expiration count so a
+ * descheduled watchdog accrues stalls more slowly, never faster.
+ */
+ if (read(tfd, &expirations, sizeof(expirations)) !=
+ sizeof(expirations))
+ continue;
+
+ for (cpu = 0; cpu < CONFIG_NR_CPUS; cpu++)
+ seccomp_wd_check_cpu(cpu, &st[cpu]);
+ }
+
+ return NULL;
+}
+
+static void start_seccomp_watchdog(void)
+{
+ static int started;
+ static struct os_helper_thread *wd_td;
+ struct itimerspec its = {
+ .it_value = { .tv_sec = SECCOMP_WD_TICK_SECS },
+ .it_interval = { .tv_sec = SECCOMP_WD_TICK_SECS },
+ };
+ int tfd;
+
+ /* Several vCPU monitors may race here; only the first starts the thread. */
+ if (!__sync_bool_compare_and_swap(&started, 0, 1))
+ return;
+
+ tfd = timerfd_create(CLOCK_MONOTONIC, 0);
+ if (tfd < 0) {
+ printk(UM_KERN_ERR "%s : timerfd_create failed, errno = %d\n",
+ __func__, errno);
+ return;
+ }
+
+ if (timerfd_settime(tfd, 0, &its, NULL) < 0) {
+ printk(UM_KERN_ERR "%s : timerfd_settime failed, errno = %d\n",
+ __func__, errno);
+ close(tfd);
+ return;
+ }
+
+ if (os_run_helper_thread(&wd_td, seccomp_watchdog, (void *)(long)tfd) < 0) {
+ printk(UM_KERN_ERR "%s : failed to start watchdog thread\n",
+ __func__);
+ close(tfd);
+ wd_td = NULL;
+ }
+}
+
void wait_stub_done_seccomp(struct mm_id *mm_idp, int running, int wait_sigsys)
{
struct stub_data *data = (void *)mm_idp->stack;
int ret;
+ seccomp_wd_enter(mm_idp->pid);
+
do {
if (!running) {
wake_seccomp_stub(mm_idp);
@@ -296,6 +419,9 @@ void wait_stub_done_seccomp(struct mm_id *mm_idp, int running, int wait_sigsys)
if (__READ_ONCE(mm_idp->pid) < 0)
goto out_kill;
+ /* The stub reported back: record progress for the watchdog. */
+ seccomp_wd_progress();
+
running = 0;
/* We may receive a SIGALRM before SIGSYS, iterate again. */
@@ -312,9 +438,11 @@ void wait_stub_done_seccomp(struct mm_id *mm_idp, int running, int wait_sigsys)
goto out_kill;
}
+ seccomp_wd_exit();
return;
out_kill:
+ seccomp_wd_exit();
printk(UM_KERN_ERR "%s : failed to wait for stub, pid = %d, errno = %d\n",
__func__, mm_idp->pid, errno);
/* This is not true inside start_userspace */
@@ -633,6 +761,7 @@ int start_userspace(struct mm_id *mm_id)
err = get_stub_listener(mm_id);
if (err)
goto out_kill;
+ start_seccomp_watchdog();
} else {
close(tramp_data.sockpair[1]);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 7/7] um: skas: refresh stub security notes after closing the known issues
2026-06-20 3:22 [PATCH 0/7] um: skas: harden the seccomp userspace stub Cong Wang
` (5 preceding siblings ...)
2026-06-20 3:22 ` [PATCH 6/7] um: skas: kill stubs that block SIGALRM via a watchdog thread Cong Wang
@ 2026-06-20 3:22 ` Cong Wang
6 siblings, 0 replies; 8+ messages in thread
From: Cong Wang @ 2026-06-20 3:22 UTC (permalink / raw)
To: Richard Weinberger, Anton Ivanov, Johannes Berg, linux-um; +Cc: Benjamin Berg
From: Cong Wang <cwang@multikernel.io>
Drop the stale comment and update the seccomp= setup help text, which still
warned about these exact issues and labelled the mode "insecure".
Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
arch/um/kernel/skas/stub.c | 22 ----------------------
arch/um/os-Linux/start_up.c | 6 ------
2 files changed, 28 deletions(-)
diff --git a/arch/um/kernel/skas/stub.c b/arch/um/kernel/skas/stub.c
index e09216a20cb5..7845638d595d 100644
--- a/arch/um/kernel/skas/stub.c
+++ b/arch/um/kernel/skas/stub.c
@@ -9,28 +9,6 @@
#include <sys/socket.h>
#include <errno.h>
-/*
- * Known security issues
- *
- * Userspace can jump to this address to execute *any* syscall that is
- * permitted by the stub. As we will return afterwards, it can do
- * whatever it likes, including:
- * - Tricking the kernel into handing out the memory FD
- * - Using this memory FD to read/write all physical memory
- * - Running in parallel to the kernel processing a syscall
- * (possibly creating data races?)
- * - Blocking e.g. SIGALRM to avoid time based scheduling
- *
- * To avoid this, the permitted location for each syscall needs to be
- * checked for in the SECCOMP filter (which is reasonably simple). Also,
- * more care will need to go into considerations how the code might be
- * tricked by using a prepared stack (or even modifying the stack from
- * another thread in case SMP support is added).
- *
- * As for the SIGALRM, the best counter measure will be to check in the
- * kernel that the process is reporting back the SIGALRM in a timely
- * fashion.
- */
static __always_inline int syscall_handler(int fd_map[STUB_MAX_FDS])
{
struct stub_data *d = get_stub_data();
diff --git a/arch/um/os-Linux/start_up.c b/arch/um/os-Linux/start_up.c
index 054ac03bbf5e..b01942bec953 100644
--- a/arch/um/os-Linux/start_up.c
+++ b/arch/um/os-Linux/start_up.c
@@ -452,12 +452,6 @@ __uml_setup("seccomp=", uml_seccomp_config,
" This method is overall faster than the ptrace based userspace, primarily\n"
" because it reduces the number of context switches for (minor) page faults.\n"
"\n"
-" However, the SECCOMP filter is not (yet) restrictive enough to prevent\n"
-" userspace from reading and writing all physical memory. Userspace\n"
-" processes could also trick the stub into disabling SIGALRM which\n"
-" prevents it from being interrupted for scheduling purposes.\n"
-"\n"
-" This is insecure and should only be used with a trusted userspace\n\n"
);
void __init os_early_checks(void)
--
2.43.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-06-20 3:23 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-20 3:22 [PATCH 0/7] um: skas: harden the seccomp userspace stub Cong Wang
2026-06-20 3:22 ` [PATCH 1/7] um: skas: create a seccomp USER_NOTIF listener and hand it to the monitor Cong Wang
2026-06-20 3:22 ` [PATCH 2/7] um: skas: gate stub mmap() through the USER_NOTIF monitor Cong Wang
2026-06-20 3:22 ` [PATCH 3/7] um: skas: validate stub mmap() against the guest page table Cong Wang
2026-06-20 3:22 ` [PATCH 4/7] um: skas: handle out-of-batch stub mmap notifications Cong Wang
2026-06-20 3:22 ` [PATCH 5/7] um: skas: validate stub munmap() against the guest address range Cong Wang
2026-06-20 3:22 ` [PATCH 6/7] um: skas: kill stubs that block SIGALRM via a watchdog thread Cong Wang
2026-06-20 3:22 ` [PATCH 7/7] um: skas: refresh stub security notes after closing the known issues Cong Wang
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.