* [PATCH v10 00/13] nommu UML
@ 2025-06-22 21:32 Hajime Tazaki
2025-06-22 21:32 ` [PATCH v10 01/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
` (12 more replies)
0 siblings, 13 replies; 25+ messages in thread
From: Hajime Tazaki @ 2025-06-22 21:32 UTC (permalink / raw)
To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett, linux-kernel
This patchset is another spin of nommu mode addition to UML. It would
be nice to hear about your opinions on that.
There are still several limitations/issues which we already found;
here is the list of those issues.
- memory mapped by loadable modules are not distinguished from
userspace memory.
-- Hajime
v10:
- fix wrong comment on gs register handling ([09/13])
- remove unnecessary code of early syscall implementation ([04/13])
v9:
- rebase with the latest uml/next branch
- add performance numbers of new SECCOMP mode, and update results ([12/13])
- add a workaround for upstream change on MMU depedency to PCI drivers ([10/13])
- https://lore.kernel.org/all/cover.1750294482.git.thehajime@gmail.com/
v8:
- rebase with the latest uml/next branch
- clean up segv_handler to align with the latest uml ([9/12])
- https://lore.kernel.org/all/cover.1745980082.git.thehajime@gmail.com/
v7:
- properly handle FP register upon signal delivery [10/13]
- update benchmark result with new FP register handling [12/13]
- fix arch_has_single_step() for !MMU case [07/13]
- revert stack alignment as it is in uml/fixes tree [10/13]
- https://lore.kernel.org/all/cover.1737348399.git.thehajime@gmail.com/
v6:
- rebase to the latest uml/next tree
- more clean up on mmu/nommu for signal handling [10/13]
- rename functions of mcontext routines [06,10/13]
- added Acked-by tag for binfmt_elf_fdpic [02/13]
- https://lore.kernel.org/linux-um/cover.1736853925.git.thehajime@gmail.com/
v5:
- clean up stack manipulation code [05,06,07,10/13]
- https://lore.kernel.org/linux-um/cover.1733998168.git.thehajime@gmail.com/
v4:
- add arch/um/nommu, arch/x86/um/nommu to contain !MMU specific codes
- remove zpoline patch
- drop binfmt_elf_fdpic patch
- reduce ifndef CONFIG_MMU if possible
- split to elf header cleanup patch [01/13]
- fix kernel test robot warnings [06/13]
- fix coding styles [07/13]
- move task_top_of_stack definition [05/13]
- https://lore.kernel.org/linux-um/cover.1733652929.git.thehajime@gmail.com/
v3:
- https://lore.kernel.org/linux-um/cover.1733199769.git.thehajime@gmail.com/
- add seccomp-based syscall hook in addition to zpoline [06/13]
- remove RFC, add a line to MAINTAINERS file
- fix kernel test robot warnings [02/13,08/13,10/13]
- add base-commit tag to cover letter
- pull the latest uml/next
- clean up SIGSEGV handling [10/13]
- detect fsgsbase availability with elf aux vector [08/13]
- simplify vdso code with macros [09/13]
RFC v2:
- https://lore.kernel.org/linux-um/cover.1731290567.git.thehajime@gmail.com/
- base branch is now uml/linux.git instead of torvalds/linux.git.
- reorganize the patch series to clean up
- fixed various coding styles issues
- clean up exec code path [07/13]
- fixed the crash/SIGSEGV case on userspace programs [10/13]
- add seccomp filter to limit syscall caller address [06/13]
- detect fsgsbase availability with sigsetjmp/siglongjmp [08/13]
- removes unrelated changes
- removes unneeded ifndef CONFIG_MMU
- convert UML_CONFIG_MMU to CONFIG_MMU as using uml/linux.git
- proposed a patch of maple-tree issue (resolving a limitation in RFC v1)
https://lore.kernel.org/linux-mm/20241108222834.3625217-1-thehajime@gmail.com/
RFC:
- https://lore.kernel.org/linux-um/cover.1729770373.git.thehajime@gmail.com/
Hajime Tazaki (13):
x86/um: nommu: elf loader for fdpic
um: decouple MMU specific code from the common part
um: nommu: memory handling
x86/um: nommu: syscall handling
um: nommu: seccomp syscalls hook
x86/um: nommu: process/thread handling
um: nommu: configure fs register on host syscall invocation
x86/um/vdso: nommu: vdso memory update
x86/um: nommu: signal handling
um: nommu: a work around for MMU dependency to PCI driver
um: change machine name for uname output
um: nommu: add documentation of nommu UML
um: nommu: plug nommu code into build system
Documentation/virt/uml/nommu-uml.rst | 180 ++++++++++++++++++++++
MAINTAINERS | 1 +
arch/um/Kconfig | 14 +-
arch/um/Makefile | 10 ++
arch/um/configs/x86_64_nommu_defconfig | 54 +++++++
arch/um/include/asm/dma.h | 13 ++
arch/um/include/asm/futex.h | 4 +
arch/um/include/asm/mmu.h | 8 +
arch/um/include/asm/mmu_context.h | 2 +
arch/um/include/asm/ptrace-generic.h | 8 +-
arch/um/include/asm/uaccess.h | 7 +-
arch/um/include/shared/kern_util.h | 6 +
arch/um/include/shared/os.h | 16 ++
arch/um/kernel/Makefile | 5 +-
arch/um/kernel/mem-pgtable.c | 55 +++++++
arch/um/kernel/mem.c | 38 +----
arch/um/kernel/process.c | 25 +++
arch/um/kernel/skas/process.c | 27 ----
arch/um/kernel/um_arch.c | 3 +
arch/um/nommu/Makefile | 3 +
arch/um/nommu/os-Linux/Makefile | 7 +
arch/um/nommu/os-Linux/signal.c | 29 ++++
arch/um/nommu/trap.c | 194 ++++++++++++++++++++++++
arch/um/os-Linux/Makefile | 8 +-
arch/um/os-Linux/internal.h | 5 +
arch/um/os-Linux/mem.c | 4 +
arch/um/os-Linux/process.c | 144 +++++++++++++++++-
arch/um/os-Linux/seccomp.c | 87 +++++++++++
arch/um/os-Linux/signal.c | 8 +
arch/um/os-Linux/skas/process.c | 132 ----------------
arch/um/os-Linux/start_up.c | 25 ++-
arch/um/os-Linux/util.c | 3 +-
arch/x86/um/Makefile | 7 +-
arch/x86/um/asm/elf.h | 8 +-
arch/x86/um/nommu/Makefile | 8 +
arch/x86/um/nommu/do_syscall_64.c | 68 +++++++++
arch/x86/um/nommu/entry_64.S | 113 ++++++++++++++
arch/x86/um/nommu/os-Linux/Makefile | 6 +
arch/x86/um/nommu/os-Linux/mcontext.c | 24 +++
arch/x86/um/nommu/syscalls.h | 16 ++
arch/x86/um/nommu/syscalls_64.c | 115 ++++++++++++++
arch/x86/um/shared/sysdep/mcontext.h | 5 +
arch/x86/um/shared/sysdep/ptrace.h | 2 +-
arch/x86/um/shared/sysdep/syscalls_64.h | 6 +
arch/x86/um/vdso/vma.c | 17 ++-
fs/Kconfig.binfmt | 2 +-
46 files changed, 1306 insertions(+), 216 deletions(-)
create mode 100644 Documentation/virt/uml/nommu-uml.rst
create mode 100644 arch/um/configs/x86_64_nommu_defconfig
create mode 100644 arch/um/kernel/mem-pgtable.c
create mode 100644 arch/um/nommu/Makefile
create mode 100644 arch/um/nommu/os-Linux/Makefile
create mode 100644 arch/um/nommu/os-Linux/signal.c
create mode 100644 arch/um/nommu/trap.c
create mode 100644 arch/um/os-Linux/seccomp.c
create mode 100644 arch/x86/um/nommu/Makefile
create mode 100644 arch/x86/um/nommu/do_syscall_64.c
create mode 100644 arch/x86/um/nommu/entry_64.S
create mode 100644 arch/x86/um/nommu/os-Linux/Makefile
create mode 100644 arch/x86/um/nommu/os-Linux/mcontext.c
create mode 100644 arch/x86/um/nommu/syscalls.h
create mode 100644 arch/x86/um/nommu/syscalls_64.c
base-commit: cfc4ca8986bb1f6182da6cd7bb57f228590b4643
--
2.43.0
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v10 01/13] x86/um: nommu: elf loader for fdpic
2025-06-22 21:32 [PATCH v10 00/13] nommu UML Hajime Tazaki
@ 2025-06-22 21:32 ` Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 02/13] um: decouple MMU specific code from the common part Hajime Tazaki
` (11 subsequent siblings)
12 siblings, 0 replies; 25+ messages in thread
From: Hajime Tazaki @ 2025-06-22 21:32 UTC (permalink / raw)
To: linux-um
Cc: thehajime, ricarkol, Liam.Howlett, linux-kernel, Eric Biederman,
Kees Cook, Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
linux-fsdevel
As UML supports CONFIG_MMU=n case, it has to use an alternate ELF
loader, FDPIC ELF loader. In this commit, we added necessary
definitions in the arch, as UML has not been used so far. It also
updates Kconfig file to use BINFMT_ELF_FDPIC under !MMU environment.
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org
Acked-by: Kees Cook <kees@kernel.org>
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
arch/um/include/asm/mmu.h | 5 +++++
arch/um/include/asm/ptrace-generic.h | 6 ++++++
arch/x86/um/asm/elf.h | 8 ++++++--
fs/Kconfig.binfmt | 2 +-
4 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
index 4d0e4239f3cc..e9661846b4a3 100644
--- a/arch/um/include/asm/mmu.h
+++ b/arch/um/include/asm/mmu.h
@@ -17,6 +17,11 @@ typedef struct mm_context {
/* Address range in need of a TLB sync */
unsigned long sync_tlb_range_from;
unsigned long sync_tlb_range_to;
+
+#ifdef CONFIG_BINFMT_ELF_FDPIC
+ unsigned long exec_fdpic_loadmap;
+ unsigned long interp_fdpic_loadmap;
+#endif
} mm_context_t;
#endif
diff --git a/arch/um/include/asm/ptrace-generic.h b/arch/um/include/asm/ptrace-generic.h
index 4696f24d1492..4ff844bcb1cd 100644
--- a/arch/um/include/asm/ptrace-generic.h
+++ b/arch/um/include/asm/ptrace-generic.h
@@ -29,6 +29,12 @@ struct pt_regs {
#define PTRACE_OLDSETOPTIONS 21
+#ifdef CONFIG_BINFMT_ELF_FDPIC
+#define PTRACE_GETFDPIC 31
+#define PTRACE_GETFDPIC_EXEC 0
+#define PTRACE_GETFDPIC_INTERP 1
+#endif
+
struct task_struct;
extern long subarch_ptrace(struct task_struct *child, long request,
diff --git a/arch/x86/um/asm/elf.h b/arch/x86/um/asm/elf.h
index 62ed5d68a978..33f69f1eac10 100644
--- a/arch/x86/um/asm/elf.h
+++ b/arch/x86/um/asm/elf.h
@@ -9,6 +9,7 @@
#include <skas.h>
#define CORE_DUMP_USE_REGSET
+#define ELF_FDPIC_CORE_EFLAGS 0
#ifdef CONFIG_X86_32
@@ -190,8 +191,11 @@ extern int arch_setup_additional_pages(struct linux_binprm *bprm,
extern unsigned long um_vdso_addr;
#define AT_SYSINFO_EHDR 33
-#define ARCH_DLINFO NEW_AUX_ENT(AT_SYSINFO_EHDR, um_vdso_addr)
-
+#define ARCH_DLINFO \
+do { \
+ NEW_AUX_ENT(AT_SYSINFO_EHDR, um_vdso_addr); \
+ NEW_AUX_ENT(AT_MINSIGSTKSZ, 0); \
+} while (0)
#endif
typedef unsigned long elf_greg_t;
diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index bd2f530e5740..419ba0282806 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -58,7 +58,7 @@ config ARCH_USE_GNU_PROPERTY
config BINFMT_ELF_FDPIC
bool "Kernel support for FDPIC ELF binaries"
default y if !BINFMT_ELF
- depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
+ depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU)
select ELFCORE
help
ELF FDPIC binaries are based on ELF, but allow the individual load
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 02/13] um: decouple MMU specific code from the common part
2025-06-22 21:32 [PATCH v10 00/13] nommu UML Hajime Tazaki
2025-06-22 21:32 ` [PATCH v10 01/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
@ 2025-06-22 21:33 ` Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 03/13] um: nommu: memory handling Hajime Tazaki
` (10 subsequent siblings)
12 siblings, 0 replies; 25+ messages in thread
From: Hajime Tazaki @ 2025-06-22 21:33 UTC (permalink / raw)
To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett, linux-kernel
This splits the memory, process related code with common and MMU
specific parts in order to avoid ifdefs in .c file and duplication
between MMU and !MMU.
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
arch/um/kernel/Makefile | 5 +-
arch/um/kernel/mem-pgtable.c | 55 +++++++++++++
arch/um/kernel/mem.c | 35 ---------
arch/um/kernel/process.c | 25 ++++++
arch/um/kernel/skas/process.c | 27 -------
arch/um/os-Linux/Makefile | 3 +-
arch/um/os-Linux/internal.h | 5 ++
arch/um/os-Linux/process.c | 134 ++++++++++++++++++++++++++++++++
arch/um/os-Linux/skas/process.c | 132 -------------------------------
9 files changed, 224 insertions(+), 197 deletions(-)
create mode 100644 arch/um/kernel/mem-pgtable.c
diff --git a/arch/um/kernel/Makefile b/arch/um/kernel/Makefile
index 4669db2aa9be..b7922f937213 100644
--- a/arch/um/kernel/Makefile
+++ b/arch/um/kernel/Makefile
@@ -16,9 +16,10 @@ extra-y := vmlinux.lds
obj-y = config.o exec.o exitcode.o irq.o ksyms.o mem.o \
physmem.o process.o ptrace.o reboot.o sigio.o \
- signal.o sysrq.o time.o tlb.o trap.o \
- um_arch.o umid.o kmsg_dump.o capflags.o skas/
+ signal.o sysrq.o time.o \
+ um_arch.o umid.o kmsg_dump.o capflags.o
obj-y += load_file.o
+obj-$(CONFIG_MMU) += mem-pgtable.o tlb.o trap.o skas/
obj-$(CONFIG_BLK_DEV_INITRD) += initrd.o
obj-$(CONFIG_GPROF) += gprof_syms.o
diff --git a/arch/um/kernel/mem-pgtable.c b/arch/um/kernel/mem-pgtable.c
new file mode 100644
index 000000000000..549da1d3bff0
--- /dev/null
+++ b/arch/um/kernel/mem-pgtable.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2000 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com)
+ */
+
+#include <linux/stddef.h>
+#include <linux/module.h>
+#include <linux/memblock.h>
+#include <linux/swap.h>
+#include <linux/slab.h>
+#include <asm/page.h>
+#include <asm/pgalloc.h>
+#include <as-layout.h>
+#include <init.h>
+#include <kern.h>
+#include <kern_util.h>
+#include <mem_user.h>
+#include <os.h>
+#include <um_malloc.h>
+
+
+/* Allocate and free page tables. */
+
+pgd_t *pgd_alloc(struct mm_struct *mm)
+{
+ pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
+
+ if (pgd) {
+ memset(pgd, 0, USER_PTRS_PER_PGD * sizeof(pgd_t));
+ memcpy(pgd + USER_PTRS_PER_PGD,
+ swapper_pg_dir + USER_PTRS_PER_PGD,
+ (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
+ }
+ return pgd;
+}
+
+static const pgprot_t protection_map[16] = {
+ [VM_NONE] = PAGE_NONE,
+ [VM_READ] = PAGE_READONLY,
+ [VM_WRITE] = PAGE_COPY,
+ [VM_WRITE | VM_READ] = PAGE_COPY,
+ [VM_EXEC] = PAGE_READONLY,
+ [VM_EXEC | VM_READ] = PAGE_READONLY,
+ [VM_EXEC | VM_WRITE] = PAGE_COPY,
+ [VM_EXEC | VM_WRITE | VM_READ] = PAGE_COPY,
+ [VM_SHARED] = PAGE_NONE,
+ [VM_SHARED | VM_READ] = PAGE_READONLY,
+ [VM_SHARED | VM_WRITE] = PAGE_SHARED,
+ [VM_SHARED | VM_WRITE | VM_READ] = PAGE_SHARED,
+ [VM_SHARED | VM_EXEC] = PAGE_READONLY,
+ [VM_SHARED | VM_EXEC | VM_READ] = PAGE_READONLY,
+ [VM_SHARED | VM_EXEC | VM_WRITE] = PAGE_SHARED,
+ [VM_SHARED | VM_EXEC | VM_WRITE | VM_READ] = PAGE_SHARED
+};
+DECLARE_VM_GET_PAGE_PROT
diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index 76bec7de81b5..106a2f85ab5c 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -6,7 +6,6 @@
#include <linux/stddef.h>
#include <linux/module.h>
#include <linux/memblock.h>
-#include <linux/mm.h>
#include <linux/swap.h>
#include <linux/slab.h>
#include <linux/init.h>
@@ -207,45 +206,11 @@ void free_initmem(void)
{
}
-/* Allocate and free page tables. */
-
-pgd_t *pgd_alloc(struct mm_struct *mm)
-{
- pgd_t *pgd = __pgd_alloc(mm, 0);
-
- if (pgd)
- memcpy(pgd + USER_PTRS_PER_PGD,
- swapper_pg_dir + USER_PTRS_PER_PGD,
- (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
-
- return pgd;
-}
-
void *uml_kmalloc(int size, int flags)
{
return kmalloc(size, flags);
}
-static const pgprot_t protection_map[16] = {
- [VM_NONE] = PAGE_NONE,
- [VM_READ] = PAGE_READONLY,
- [VM_WRITE] = PAGE_COPY,
- [VM_WRITE | VM_READ] = PAGE_COPY,
- [VM_EXEC] = PAGE_READONLY,
- [VM_EXEC | VM_READ] = PAGE_READONLY,
- [VM_EXEC | VM_WRITE] = PAGE_COPY,
- [VM_EXEC | VM_WRITE | VM_READ] = PAGE_COPY,
- [VM_SHARED] = PAGE_NONE,
- [VM_SHARED | VM_READ] = PAGE_READONLY,
- [VM_SHARED | VM_WRITE] = PAGE_SHARED,
- [VM_SHARED | VM_WRITE | VM_READ] = PAGE_SHARED,
- [VM_SHARED | VM_EXEC] = PAGE_READONLY,
- [VM_SHARED | VM_EXEC | VM_READ] = PAGE_READONLY,
- [VM_SHARED | VM_EXEC | VM_WRITE] = PAGE_SHARED,
- [VM_SHARED | VM_EXEC | VM_WRITE | VM_READ] = PAGE_SHARED
-};
-DECLARE_VM_GET_PAGE_PROT
-
void mark_rodata_ro(void)
{
unsigned long rodata_start = PFN_ALIGN(__start_rodata);
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 0cd6fad3d908..08959745c30d 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -25,6 +25,7 @@
#include <linux/tick.h>
#include <linux/threads.h>
#include <linux/resume_user_mode.h>
+#include <linux/start_kernel.h>
#include <asm/current.h>
#include <asm/mmu_context.h>
#include <asm/switch_to.h>
@@ -46,6 +47,8 @@
struct task_struct *cpu_tasks[NR_CPUS];
EXPORT_SYMBOL(cpu_tasks);
+static char cpu0_irqstack[THREAD_SIZE] __aligned(THREAD_SIZE);
+
void free_stack(unsigned long stack, int order)
{
free_pages(stack, order);
@@ -295,3 +298,25 @@ unsigned long __get_wchan(struct task_struct *p)
return 0;
}
+
+
+static int __init start_kernel_proc(void *unused)
+{
+ block_signals_trace();
+
+ start_kernel();
+ return 0;
+}
+
+int __init start_uml(void)
+{
+ stack_protections((unsigned long) &cpu0_irqstack);
+ set_sigstack(cpu0_irqstack, THREAD_SIZE);
+
+ init_new_thread_signals();
+
+ init_task.thread.request.thread.proc = start_kernel_proc;
+ init_task.thread.request.thread.arg = NULL;
+ return start_idle_thread(task_stack_page(&init_task),
+ &init_task.thread.switch_buf);
+}
diff --git a/arch/um/kernel/skas/process.c b/arch/um/kernel/skas/process.c
index 05dcdc057af9..5247121d3419 100644
--- a/arch/um/kernel/skas/process.c
+++ b/arch/um/kernel/skas/process.c
@@ -16,33 +16,6 @@
#include <skas.h>
#include <kern_util.h>
-extern void start_kernel(void);
-
-static int __init start_kernel_proc(void *unused)
-{
- block_signals_trace();
-
- start_kernel();
- return 0;
-}
-
-extern int userspace_pid[];
-
-static char cpu0_irqstack[THREAD_SIZE] __aligned(THREAD_SIZE);
-
-int __init start_uml(void)
-{
- stack_protections((unsigned long) &cpu0_irqstack);
- set_sigstack(cpu0_irqstack, THREAD_SIZE);
-
- init_new_thread_signals();
-
- init_task.thread.request.thread.proc = start_kernel_proc;
- init_task.thread.request.thread.arg = NULL;
- return start_idle_thread(task_stack_page(&init_task),
- &init_task.thread.switch_buf);
-}
-
unsigned long current_stub_stack(void)
{
if (current->mm == NULL)
diff --git a/arch/um/os-Linux/Makefile b/arch/um/os-Linux/Makefile
index fae836713487..c048fc838068 100644
--- a/arch/um/os-Linux/Makefile
+++ b/arch/um/os-Linux/Makefile
@@ -8,7 +8,8 @@ KCOV_INSTRUMENT := n
obj-y = execvp.o file.o helper.o irq.o main.o mem.o process.o \
registers.o sigio.o signal.o start_up.o time.o tty.o \
- umid.o user_syms.o util.o skas/
+ umid.o user_syms.o util.o
+obj-$(CONFIG_MMU) += skas/
CFLAGS_signal.o += -Wframe-larger-than=4096
diff --git a/arch/um/os-Linux/internal.h b/arch/um/os-Linux/internal.h
index 5d8d3b0817a9..89cfab0d5e47 100644
--- a/arch/um/os-Linux/internal.h
+++ b/arch/um/os-Linux/internal.h
@@ -5,6 +5,11 @@
#include <mm_id.h>
#include <stub-data.h>
+/*
+ * process.c
+ */
+extern int userspace_pid[];
+
/*
* elf_aux.c
*/
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index 00b49e90d05f..4eb7e137ef4b 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -6,6 +6,7 @@
#include <stdio.h>
#include <stdlib.h>
+#include <stdbool.h>
#include <unistd.h>
#include <errno.h>
#include <signal.h>
@@ -15,10 +16,17 @@
#include <sys/prctl.h>
#include <sys/wait.h>
#include <asm/unistd.h>
+#include <linux/threads.h>
#include <init.h>
#include <longjmp.h>
#include <os.h>
#include <skas/skas.h>
+#include <as-layout.h>
+#include <kern_util.h>
+
+int using_seccomp;
+int userspace_pid[NR_CPUS];
+int unscheduled_userspace_iterations;
void os_alarm_process(int pid)
{
@@ -189,3 +197,129 @@ void os_set_pdeathsig(void)
{
prctl(PR_SET_PDEATHSIG, SIGKILL);
}
+
+int is_skas_winch(int pid, int fd, void *data)
+{
+ return pid == getpgrp();
+}
+
+void new_thread(void *stack, jmp_buf *buf, void (*handler)(void))
+{
+ (*buf)[0].JB_IP = (unsigned long) handler;
+ (*buf)[0].JB_SP = (unsigned long) stack + UM_THREAD_SIZE -
+ sizeof(void *);
+}
+
+#define INIT_JMP_NEW_THREAD 0
+#define INIT_JMP_CALLBACK 1
+#define INIT_JMP_HALT 2
+#define INIT_JMP_REBOOT 3
+
+void switch_threads(jmp_buf *me, jmp_buf *you)
+{
+ unscheduled_userspace_iterations = 0;
+
+ if (UML_SETJMP(me) == 0)
+ UML_LONGJMP(you, 1);
+}
+
+static jmp_buf initial_jmpbuf;
+
+/* XXX Make these percpu */
+static void (*cb_proc)(void *arg);
+static void *cb_arg;
+static jmp_buf *cb_back;
+
+int start_idle_thread(void *stack, jmp_buf *switch_buf)
+{
+ int n;
+
+ set_handler(SIGWINCH);
+
+ /*
+ * Can't use UML_SETJMP or UML_LONGJMP here because they save
+ * and restore signals, with the possible side-effect of
+ * trying to handle any signals which came when they were
+ * blocked, which can't be done on this stack.
+ * Signals must be blocked when jumping back here and restored
+ * after returning to the jumper.
+ */
+ n = setjmp(initial_jmpbuf);
+ switch (n) {
+ case INIT_JMP_NEW_THREAD:
+ (*switch_buf)[0].JB_IP = (unsigned long) uml_finishsetup;
+ (*switch_buf)[0].JB_SP = (unsigned long) stack +
+ UM_THREAD_SIZE - sizeof(void *);
+ break;
+ case INIT_JMP_CALLBACK:
+ (*cb_proc)(cb_arg);
+ longjmp(*cb_back, 1);
+ break;
+ case INIT_JMP_HALT:
+ kmalloc_ok = 0;
+ return 0;
+ case INIT_JMP_REBOOT:
+ kmalloc_ok = 0;
+ return 1;
+ default:
+ printk(UM_KERN_ERR "Bad sigsetjmp return in %s - %d\n",
+ __func__, n);
+ fatal_sigsegv();
+ }
+ longjmp(*switch_buf, 1);
+
+ /* unreachable */
+ printk(UM_KERN_ERR "impossible long jump!");
+ fatal_sigsegv();
+ return 0;
+}
+
+void initial_thread_cb_skas(void (*proc)(void *), void *arg)
+{
+ jmp_buf here;
+
+ cb_proc = proc;
+ cb_arg = arg;
+ cb_back = &here;
+
+ block_signals_trace();
+ if (UML_SETJMP(&here) == 0)
+ UML_LONGJMP(&initial_jmpbuf, INIT_JMP_CALLBACK);
+ unblock_signals_trace();
+
+ cb_proc = NULL;
+ cb_arg = NULL;
+ cb_back = NULL;
+}
+
+void halt_skas(void)
+{
+ block_signals_trace();
+ UML_LONGJMP(&initial_jmpbuf, INIT_JMP_HALT);
+}
+
+static bool noreboot;
+
+static int __init noreboot_cmd_param(char *str, int *add)
+{
+ *add = 0;
+ noreboot = true;
+ return 0;
+}
+
+__uml_setup("noreboot", noreboot_cmd_param,
+"noreboot\n"
+" Rather than rebooting, exit always, akin to QEMU's -no-reboot option.\n"
+" This is useful if you're using CONFIG_PANIC_TIMEOUT in order to catch\n"
+" crashes in CI\n");
+
+void reboot_skas(void)
+{
+ block_signals_trace();
+ UML_LONGJMP(&initial_jmpbuf, noreboot ? INIT_JMP_HALT : INIT_JMP_REBOOT);
+}
+
+void __switch_mm(struct mm_id *mm_idp)
+{
+ userspace_pid[0] = mm_idp->pid;
+}
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index e42ffac23e3c..808d8c205b65 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -18,7 +18,6 @@
#include <sys/stat.h>
#include <sys/socket.h>
#include <asm/unistd.h>
-#include <as-layout.h>
#include <init.h>
#include <kern_util.h>
#include <mem.h>
@@ -29,16 +28,10 @@
#include <sysdep/stub.h>
#include <sysdep/mcontext.h>
#include <linux/futex.h>
-#include <linux/threads.h>
#include <timetravel.h>
#include <asm-generic/rwonce.h>
#include "../internal.h"
-int is_skas_winch(int pid, int fd, void *data)
-{
- return pid == getpgrp();
-}
-
static const char *ptrace_reg_name(int idx)
{
#define R(n) case HOST_##n: return #n
@@ -433,9 +426,6 @@ static int __init init_stub_exe_fd(void)
}
__initcall(init_stub_exe_fd);
-int using_seccomp;
-int userspace_pid[NR_CPUS];
-
/**
* start_userspace() - prepare a new userspace process
* @mm_id: The corresponding struct mm_id
@@ -548,7 +538,6 @@ int start_userspace(struct mm_id *mm_id)
return err;
}
-int unscheduled_userspace_iterations;
extern unsigned long tt_extra_sched_jiffies;
void userspace(struct uml_pt_regs *regs)
@@ -786,124 +775,3 @@ void userspace(struct uml_pt_regs *regs)
}
}
}
-
-void new_thread(void *stack, jmp_buf *buf, void (*handler)(void))
-{
- (*buf)[0].JB_IP = (unsigned long) handler;
- (*buf)[0].JB_SP = (unsigned long) stack + UM_THREAD_SIZE -
- sizeof(void *);
-}
-
-#define INIT_JMP_NEW_THREAD 0
-#define INIT_JMP_CALLBACK 1
-#define INIT_JMP_HALT 2
-#define INIT_JMP_REBOOT 3
-
-void switch_threads(jmp_buf *me, jmp_buf *you)
-{
- unscheduled_userspace_iterations = 0;
-
- if (UML_SETJMP(me) == 0)
- UML_LONGJMP(you, 1);
-}
-
-static jmp_buf initial_jmpbuf;
-
-/* XXX Make these percpu */
-static void (*cb_proc)(void *arg);
-static void *cb_arg;
-static jmp_buf *cb_back;
-
-int start_idle_thread(void *stack, jmp_buf *switch_buf)
-{
- int n;
-
- set_handler(SIGWINCH);
-
- /*
- * Can't use UML_SETJMP or UML_LONGJMP here because they save
- * and restore signals, with the possible side-effect of
- * trying to handle any signals which came when they were
- * blocked, which can't be done on this stack.
- * Signals must be blocked when jumping back here and restored
- * after returning to the jumper.
- */
- n = setjmp(initial_jmpbuf);
- switch (n) {
- case INIT_JMP_NEW_THREAD:
- (*switch_buf)[0].JB_IP = (unsigned long) uml_finishsetup;
- (*switch_buf)[0].JB_SP = (unsigned long) stack +
- UM_THREAD_SIZE - sizeof(void *);
- break;
- case INIT_JMP_CALLBACK:
- (*cb_proc)(cb_arg);
- longjmp(*cb_back, 1);
- break;
- case INIT_JMP_HALT:
- kmalloc_ok = 0;
- return 0;
- case INIT_JMP_REBOOT:
- kmalloc_ok = 0;
- return 1;
- default:
- printk(UM_KERN_ERR "Bad sigsetjmp return in %s - %d\n",
- __func__, n);
- fatal_sigsegv();
- }
- longjmp(*switch_buf, 1);
-
- /* unreachable */
- printk(UM_KERN_ERR "impossible long jump!");
- fatal_sigsegv();
- return 0;
-}
-
-void initial_thread_cb_skas(void (*proc)(void *), void *arg)
-{
- jmp_buf here;
-
- cb_proc = proc;
- cb_arg = arg;
- cb_back = &here;
-
- block_signals_trace();
- if (UML_SETJMP(&here) == 0)
- UML_LONGJMP(&initial_jmpbuf, INIT_JMP_CALLBACK);
- unblock_signals_trace();
-
- cb_proc = NULL;
- cb_arg = NULL;
- cb_back = NULL;
-}
-
-void halt_skas(void)
-{
- block_signals_trace();
- UML_LONGJMP(&initial_jmpbuf, INIT_JMP_HALT);
-}
-
-static bool noreboot;
-
-static int __init noreboot_cmd_param(char *str, int *add)
-{
- *add = 0;
- noreboot = true;
- return 0;
-}
-
-__uml_setup("noreboot", noreboot_cmd_param,
-"noreboot\n"
-" Rather than rebooting, exit always, akin to QEMU's -no-reboot option.\n"
-" This is useful if you're using CONFIG_PANIC_TIMEOUT in order to catch\n"
-" crashes in CI\n");
-
-void reboot_skas(void)
-{
- block_signals_trace();
- UML_LONGJMP(&initial_jmpbuf, noreboot ? INIT_JMP_HALT : INIT_JMP_REBOOT);
-}
-
-void __switch_mm(struct mm_id *mm_idp)
-{
- userspace_pid[0] = mm_idp->pid;
-}
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 03/13] um: nommu: memory handling
2025-06-22 21:32 [PATCH v10 00/13] nommu UML Hajime Tazaki
2025-06-22 21:32 ` [PATCH v10 01/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 02/13] um: decouple MMU specific code from the common part Hajime Tazaki
@ 2025-06-22 21:33 ` Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 04/13] x86/um: nommu: syscall handling Hajime Tazaki
` (9 subsequent siblings)
12 siblings, 0 replies; 25+ messages in thread
From: Hajime Tazaki @ 2025-06-22 21:33 UTC (permalink / raw)
To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett, linux-kernel
This commit adds memory operations on UML under !MMU environment.
Some part of the original UML code relying on CONFIG_MMU are excluded
from compilation when !CONFIG_MMU. Additionally, generic functions such as
uaccess, futex, memcpy/strnlen/strncpy can be used as user- and
kernel-space share the address space in !CONFIG_MMU mode.
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
arch/um/Makefile | 4 ++++
arch/um/include/asm/futex.h | 4 ++++
arch/um/include/asm/mmu.h | 3 +++
arch/um/include/asm/mmu_context.h | 2 ++
arch/um/include/asm/uaccess.h | 7 ++++---
arch/um/kernel/mem.c | 3 ++-
arch/um/os-Linux/mem.c | 4 ++++
arch/um/os-Linux/process.c | 4 ++--
8 files changed, 25 insertions(+), 6 deletions(-)
diff --git a/arch/um/Makefile b/arch/um/Makefile
index 7be0143b5ba3..5371c9a1b11e 100644
--- a/arch/um/Makefile
+++ b/arch/um/Makefile
@@ -46,6 +46,10 @@ ARCH_INCLUDE := -I$(srctree)/$(SHARED_HEADERS)
ARCH_INCLUDE += -I$(srctree)/$(HOST_DIR)/um/shared
KBUILD_CPPFLAGS += -I$(srctree)/$(HOST_DIR)/um
+ifneq ($(CONFIG_MMU),y)
+core-y += $(ARCH_DIR)/nommu/
+endif
+
# -Dvmap=kernel_vmap prevents anything from referencing the libpcap.o symbol so
# named - it's a common symbol in libpcap, so we get a binary which crashes.
#
diff --git a/arch/um/include/asm/futex.h b/arch/um/include/asm/futex.h
index 780aa6bfc050..785fd6649aa2 100644
--- a/arch/um/include/asm/futex.h
+++ b/arch/um/include/asm/futex.h
@@ -7,8 +7,12 @@
#include <asm/errno.h>
+#ifdef CONFIG_MMU
int arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval, u32 __user *uaddr);
int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
u32 oldval, u32 newval);
+#else
+#include <asm-generic/futex.h>
+#endif
#endif
diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
index e9661846b4a3..9f30c69e5278 100644
--- a/arch/um/include/asm/mmu.h
+++ b/arch/um/include/asm/mmu.h
@@ -18,10 +18,13 @@ typedef struct mm_context {
unsigned long sync_tlb_range_from;
unsigned long sync_tlb_range_to;
+#ifndef CONFIG_MMU
+ unsigned long end_brk;
#ifdef CONFIG_BINFMT_ELF_FDPIC
unsigned long exec_fdpic_loadmap;
unsigned long interp_fdpic_loadmap;
#endif
+#endif /* !CONFIG_MMU */
} mm_context_t;
#endif
diff --git a/arch/um/include/asm/mmu_context.h b/arch/um/include/asm/mmu_context.h
index 23dcc914d44e..033a70166066 100644
--- a/arch/um/include/asm/mmu_context.h
+++ b/arch/um/include/asm/mmu_context.h
@@ -36,11 +36,13 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
}
}
+#ifdef CONFIG_MMU
#define init_new_context init_new_context
extern int init_new_context(struct task_struct *task, struct mm_struct *mm);
#define destroy_context destroy_context
extern void destroy_context(struct mm_struct *mm);
+#endif
#include <asm-generic/mmu_context.h>
diff --git a/arch/um/include/asm/uaccess.h b/arch/um/include/asm/uaccess.h
index 1c6e0ae41b0c..b9677758e759 100644
--- a/arch/um/include/asm/uaccess.h
+++ b/arch/um/include/asm/uaccess.h
@@ -23,6 +23,7 @@
#define __addr_range_nowrap(addr, size) \
((unsigned long) (addr) <= ((unsigned long) (addr) + (size)))
+#ifdef CONFIG_MMU
extern unsigned long raw_copy_from_user(void *to, const void __user *from, unsigned long n);
extern unsigned long raw_copy_to_user(void __user *to, const void *from, unsigned long n);
extern unsigned long __clear_user(void __user *mem, unsigned long len);
@@ -34,9 +35,6 @@ static inline int __access_ok(const void __user *ptr, unsigned long size);
#define INLINE_COPY_FROM_USER
#define INLINE_COPY_TO_USER
-
-#include <asm-generic/uaccess.h>
-
static inline int __access_ok(const void __user *ptr, unsigned long size)
{
unsigned long addr = (unsigned long)ptr;
@@ -70,5 +68,8 @@ do { \
barrier(); \
current->thread.segv_continue = NULL; \
} while (0)
+#endif
+
+#include <asm-generic/uaccess.h>
#endif
diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index 106a2f85ab5c..4be1cf240d71 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -64,7 +64,8 @@ void __init arch_mm_preinit(void)
* to be turned on.
*/
brk_end = (unsigned long) UML_ROUND_UP(sbrk(0));
- map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1, 0);
+ map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1,
+ !IS_ENABLED(CONFIG_MMU));
memblock_free((void *)brk_end, uml_reserved - brk_end);
uml_reserved = brk_end;
min_low_pfn = PFN_UP(__pa(uml_reserved));
diff --git a/arch/um/os-Linux/mem.c b/arch/um/os-Linux/mem.c
index 72f302f4d197..4f5d9a94f8e2 100644
--- a/arch/um/os-Linux/mem.c
+++ b/arch/um/os-Linux/mem.c
@@ -213,6 +213,10 @@ int __init create_mem_file(unsigned long long len)
{
int err, fd;
+ /* NOMMU kernel uses -1 as a fd for further use (e.g., mmap) */
+ if (!IS_ENABLED(CONFIG_MMU))
+ return -1;
+
fd = create_tmp_file(len);
err = os_set_exec_close(fd);
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index 4eb7e137ef4b..8a1ab59a089f 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -99,8 +99,8 @@ int os_map_memory(void *virt, int fd, unsigned long long off, unsigned long len,
prot = (r ? PROT_READ : 0) | (w ? PROT_WRITE : 0) |
(x ? PROT_EXEC : 0);
- loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED,
- fd, off);
+ loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED |
+ (!IS_ENABLED(CONFIG_MMU) ? MAP_ANONYMOUS : 0), fd, off);
if (loc == MAP_FAILED)
return -errno;
return 0;
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 04/13] x86/um: nommu: syscall handling
2025-06-22 21:32 [PATCH v10 00/13] nommu UML Hajime Tazaki
` (2 preceding siblings ...)
2025-06-22 21:33 ` [PATCH v10 03/13] um: nommu: memory handling Hajime Tazaki
@ 2025-06-22 21:33 ` Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 05/13] um: nommu: seccomp syscalls hook Hajime Tazaki
` (8 subsequent siblings)
12 siblings, 0 replies; 25+ messages in thread
From: Hajime Tazaki @ 2025-06-22 21:33 UTC (permalink / raw)
To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett, linux-kernel
This commit introduces an entry point of syscall interface for !MMU
mode. It uses an entry function, __kernel_vsyscall, a kernel-wide global
symbol accessible from any locations.
Although it isn't in the scope of this commit, it can be also exposed
via vdso image which is directly accessible from userspace. A standard
library (i.e., libc) can utilize this entry point to implement syscall
wrapper; we can also use this by hooking syscall for unmodified userspace
applications/libraries, which will be implemented in the subsequent
commit.
This only supports 64-bit mode of x86 architecture.
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
arch/x86/um/Makefile | 4 ++
arch/x86/um/nommu/Makefile | 8 +++
arch/x86/um/nommu/do_syscall_64.c | 25 +++++++
arch/x86/um/nommu/entry_64.S | 91 +++++++++++++++++++++++++
arch/x86/um/nommu/syscalls.h | 16 +++++
arch/x86/um/shared/sysdep/syscalls_64.h | 6 ++
6 files changed, 150 insertions(+)
create mode 100644 arch/x86/um/nommu/Makefile
create mode 100644 arch/x86/um/nommu/do_syscall_64.c
create mode 100644 arch/x86/um/nommu/entry_64.S
create mode 100644 arch/x86/um/nommu/syscalls.h
diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile
index b42c31cd2390..227af2a987e2 100644
--- a/arch/x86/um/Makefile
+++ b/arch/x86/um/Makefile
@@ -32,6 +32,10 @@ obj-y += syscalls_64.o vdso/
subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o \
../lib/memmove_64.o ../lib/memset_64.o
+ifneq ($(CONFIG_MMU),y)
+obj-y += nommu/
+endif
+
endif
subarch-$(CONFIG_MODULES) += ../kernel/module.o
diff --git a/arch/x86/um/nommu/Makefile b/arch/x86/um/nommu/Makefile
new file mode 100644
index 000000000000..d72c63afffa5
--- /dev/null
+++ b/arch/x86/um/nommu/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+ifeq ($(CONFIG_X86_32),y)
+ BITS := 32
+else
+ BITS := 64
+endif
+
+obj-y = do_syscall_$(BITS).o entry_$(BITS).o
diff --git a/arch/x86/um/nommu/do_syscall_64.c b/arch/x86/um/nommu/do_syscall_64.c
new file mode 100644
index 000000000000..6b08daab6afe
--- /dev/null
+++ b/arch/x86/um/nommu/do_syscall_64.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/kernel.h>
+#include <linux/ptrace.h>
+#include <kern_util.h>
+#include <sysdep/syscalls.h>
+#include <os.h>
+
+__visible void do_syscall_64(struct pt_regs *regs)
+{
+ int syscall;
+
+ syscall = PT_SYSCALL_NR(regs->regs.gp);
+ UPT_SYSCALL_NR(®s->regs) = syscall;
+
+ if (likely(syscall < NR_syscalls)) {
+ PT_REGS_SET_SYSCALL_RETURN(regs,
+ EXECUTE_SYSCALL(syscall, regs));
+ }
+
+ PT_REGS_SYSCALL_RET(regs) = regs->regs.gp[HOST_AX];
+
+ /* handle tasks and signals at the end */
+ interrupt_end();
+}
diff --git a/arch/x86/um/nommu/entry_64.S b/arch/x86/um/nommu/entry_64.S
new file mode 100644
index 000000000000..e9bfc7b93c84
--- /dev/null
+++ b/arch/x86/um/nommu/entry_64.S
@@ -0,0 +1,91 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/errno.h>
+
+#include <linux/linkage.h>
+#include <asm/percpu.h>
+#include <asm/desc.h>
+
+#include "../entry/calling.h"
+
+#ifdef CONFIG_SMP
+#error need to stash these variables somewhere else
+#endif
+
+#define UM_GLOBAL_VAR(x) .data; .align 8; .globl x; x:; .long 0
+
+UM_GLOBAL_VAR(current_top_of_stack)
+UM_GLOBAL_VAR(current_ptregs)
+
+.code64
+.section .entry.text, "ax"
+
+.align 8
+#undef ENTRY
+#define ENTRY(x) .text; .globl x; .type x,%function; x:
+#undef END
+#define END(x) .size x, . - x
+
+/*
+ * %rcx has the return address (we set it before entering __kernel_vsyscall).
+ *
+ * Registers on entry:
+ * rax system call number
+ * rcx return address
+ * rdi arg0
+ * rsi arg1
+ * rdx arg2
+ * r10 arg3
+ * r8 arg4
+ * r9 arg5
+ *
+ * (note: we are allowed to mess with r11: r11 is callee-clobbered
+ * register in C ABI)
+ */
+ENTRY(__kernel_vsyscall)
+
+ movq %rsp, %r11
+
+ /* Point rsp to the top of the ptregs array, so we can
+ just fill it with a bunch of push'es. */
+ movq current_ptregs, %rsp
+
+ /* 8 bytes * 20 registers (plus 8 for the push) */
+ addq $168, %rsp
+
+ /* Construct struct pt_regs on stack */
+ pushq $0 /* pt_regs->ss (index 20) */
+ pushq %r11 /* pt_regs->sp */
+ pushfq /* pt_regs->flags */
+ pushq $0 /* pt_regs->cs */
+ pushq %rcx /* pt_regs->ip */
+ pushq %rax /* pt_regs->orig_ax */
+
+ PUSH_AND_CLEAR_REGS rax=$-ENOSYS
+
+ mov %rsp, %rdi
+
+ /*
+ * Switch to current top of stack, so "current->" points
+ * to the right task.
+ */
+ movq current_top_of_stack, %rsp
+
+ call do_syscall_64
+
+ movq current_ptregs, %rsp
+
+ POP_REGS
+
+ addq $8, %rsp /* skip orig_ax */
+ popq %rcx /* pt_regs->ip */
+ addq $8, %rsp /* skip cs */
+ addq $8, %rsp /* skip flags */
+ popq %rsp
+
+ /*
+ * not return w/ ret but w/ jmp as the stack is already popped before
+ * entering __kernel_vsyscall
+ */
+ jmp *%rcx
+
+END(__kernel_vsyscall)
diff --git a/arch/x86/um/nommu/syscalls.h b/arch/x86/um/nommu/syscalls.h
new file mode 100644
index 000000000000..a2433756b1fc
--- /dev/null
+++ b/arch/x86/um/nommu/syscalls.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __UM_NOMMU_SYSCALLS_H
+#define __UM_NOMMU_SYSCALLS_H
+
+
+#define task_top_of_stack(task) \
+({ \
+ unsigned long __ptr = (unsigned long)task->stack; \
+ __ptr += THREAD_SIZE; \
+ __ptr; \
+})
+
+extern long current_top_of_stack;
+extern long current_ptregs;
+
+#endif
diff --git a/arch/x86/um/shared/sysdep/syscalls_64.h b/arch/x86/um/shared/sysdep/syscalls_64.h
index b6b997225841..ffd80ee3b9dc 100644
--- a/arch/x86/um/shared/sysdep/syscalls_64.h
+++ b/arch/x86/um/shared/sysdep/syscalls_64.h
@@ -25,4 +25,10 @@ extern syscall_handler_t *sys_call_table[];
extern syscall_handler_t sys_modify_ldt;
extern syscall_handler_t sys_arch_prctl;
+#ifndef CONFIG_MMU
+extern void do_syscall_64(struct pt_regs *regs);
+extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3,
+ int64_t a4, int64_t a5, int64_t a6);
+#endif
+
#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 05/13] um: nommu: seccomp syscalls hook
2025-06-22 21:32 [PATCH v10 00/13] nommu UML Hajime Tazaki
` (3 preceding siblings ...)
2025-06-22 21:33 ` [PATCH v10 04/13] x86/um: nommu: syscall handling Hajime Tazaki
@ 2025-06-22 21:33 ` Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 06/13] x86/um: nommu: process/thread handling Hajime Tazaki
` (7 subsequent siblings)
12 siblings, 0 replies; 25+ messages in thread
From: Hajime Tazaki @ 2025-06-22 21:33 UTC (permalink / raw)
To: linux-um
Cc: thehajime, ricarkol, Liam.Howlett, linux-kernel, Kenichi Yasukata
This commit adds syscall hook with seccomp.
Using seccomp raises SIGSYS to UML process, which is captured in the
(UML) kernel, then jumps to the syscall entry point, __kernel_vsyscall,
to hook the original syscall instructions.
The SIGSYS signal is raised upon the execution from uml_reserved and
high_physmem, which locates userspace memory.
It also renames existing static function, sigsys_handler(), in
start_up.c to avoid name conflicts between them.
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Kenichi Yasukata <kenichi.yasukata@gmail.com>
---
arch/um/include/shared/kern_util.h | 2 +
arch/um/include/shared/os.h | 10 +++
arch/um/kernel/um_arch.c | 3 +
arch/um/nommu/Makefile | 3 +
arch/um/nommu/os-Linux/Makefile | 7 +++
arch/um/nommu/os-Linux/signal.c | 16 +++++
arch/um/os-Linux/Makefile | 5 ++
arch/um/os-Linux/seccomp.c | 87 +++++++++++++++++++++++++++
arch/um/os-Linux/signal.c | 8 +++
arch/um/os-Linux/start_up.c | 4 +-
arch/x86/um/nommu/Makefile | 2 +-
arch/x86/um/nommu/os-Linux/Makefile | 6 ++
arch/x86/um/nommu/os-Linux/mcontext.c | 13 ++++
arch/x86/um/shared/sysdep/mcontext.h | 4 ++
14 files changed, 167 insertions(+), 3 deletions(-)
create mode 100644 arch/um/nommu/Makefile
create mode 100644 arch/um/nommu/os-Linux/Makefile
create mode 100644 arch/um/nommu/os-Linux/signal.c
create mode 100644 arch/um/os-Linux/seccomp.c
create mode 100644 arch/x86/um/nommu/os-Linux/Makefile
create mode 100644 arch/x86/um/nommu/os-Linux/mcontext.c
diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h
index 00ca3e12fd9a..ec8ba1f13c58 100644
--- a/arch/um/include/shared/kern_util.h
+++ b/arch/um/include/shared/kern_util.h
@@ -66,6 +66,8 @@ extern void segv_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs
extern void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs,
void *mc);
extern void fatal_sigsegv(void) __attribute__ ((noreturn));
+extern void sigsys_handler(int sig, struct siginfo *si, struct uml_pt_regs *regs,
+ void *mc);
void um_idle_sleep(void);
diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index b35cc8ce333b..1251f08e26d0 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -338,4 +338,14 @@ extern void um_trace_signals_off(void);
/* time-travel */
extern void deliver_time_travel_irqs(void);
+/* seccomp.c */
+#ifdef CONFIG_MMU
+static inline int os_setup_seccomp(void)
+{
+ return 0;
+}
+#else
+extern int os_setup_seccomp(void);
+#endif
+
#endif
diff --git a/arch/um/kernel/um_arch.c b/arch/um/kernel/um_arch.c
index 2f5ee045bc7a..14b9dcab9907 100644
--- a/arch/um/kernel/um_arch.c
+++ b/arch/um/kernel/um_arch.c
@@ -431,6 +431,9 @@ void __init setup_arch(char **cmdline_p)
add_bootloader_randomness(rng_seed, sizeof(rng_seed));
memzero_explicit(rng_seed, sizeof(rng_seed));
}
+
+ /* install seccomp filter */
+ os_setup_seccomp();
}
void __init arch_cpu_finalize_init(void)
diff --git a/arch/um/nommu/Makefile b/arch/um/nommu/Makefile
new file mode 100644
index 000000000000..baab7c2f57c2
--- /dev/null
+++ b/arch/um/nommu/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-y := os-Linux/
diff --git a/arch/um/nommu/os-Linux/Makefile b/arch/um/nommu/os-Linux/Makefile
new file mode 100644
index 000000000000..68833c576437
--- /dev/null
+++ b/arch/um/nommu/os-Linux/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-y := signal.o
+USER_OBJS := $(obj-y)
+
+include $(srctree)/arch/um/scripts/Makefile.rules
+USER_CFLAGS+=-I$(srctree)/arch/um/os-Linux
diff --git a/arch/um/nommu/os-Linux/signal.c b/arch/um/nommu/os-Linux/signal.c
new file mode 100644
index 000000000000..19043b9652e2
--- /dev/null
+++ b/arch/um/nommu/os-Linux/signal.c
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <signal.h>
+#include <kern_util.h>
+#include <os.h>
+#include <sysdep/mcontext.h>
+#include <sys/ucontext.h>
+
+void sigsys_handler(int sig, struct siginfo *si,
+ struct uml_pt_regs *regs, void *ptr)
+{
+ mcontext_t *mc = (mcontext_t *) ptr;
+
+ /* hook syscall via SIGSYS */
+ set_mc_sigsys_hook(mc);
+}
diff --git a/arch/um/os-Linux/Makefile b/arch/um/os-Linux/Makefile
index c048fc838068..432476a4239a 100644
--- a/arch/um/os-Linux/Makefile
+++ b/arch/um/os-Linux/Makefile
@@ -21,4 +21,9 @@ USER_OBJS := $(user-objs-y) elf_aux.o execvp.o file.o helper.o irq.o \
main.o mem.o process.o registers.o sigio.o signal.o start_up.o time.o \
tty.o umid.o util.o
+ifneq ($(CONFIG_MMU),y)
+obj-y += seccomp.o
+USER_OBJS += seccomp.o
+endif
+
include $(srctree)/arch/um/scripts/Makefile.rules
diff --git a/arch/um/os-Linux/seccomp.c b/arch/um/os-Linux/seccomp.c
new file mode 100644
index 000000000000..d1cfa6e3d632
--- /dev/null
+++ b/arch/um/os-Linux/seccomp.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <signal.h>
+#include <sys/prctl.h>
+#include <sys/syscall.h> /* For SYS_xxx definitions */
+#include <init.h>
+#include <as-layout.h>
+#include <os.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+
+int __init os_setup_seccomp(void)
+{
+ int err;
+ unsigned long __userspace_start = uml_reserved,
+ __userspace_end = high_physmem;
+
+ struct sock_filter filter[] = {
+ /* if (IP_high > __userspace_end) allow; */
+ BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+ offsetof(struct seccomp_data, instruction_pointer) + 4),
+ BPF_JUMP(BPF_JMP + BPF_JGT + BPF_K, __userspace_end >> 32,
+ /*true-skip=*/0, /*false-skip=*/1),
+ BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+ /* if (IP_high == __userspace_end && IP_low >= __userspace_end) allow; */
+ BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+ offsetof(struct seccomp_data, instruction_pointer) + 4),
+ BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __userspace_end >> 32,
+ /*true-skip=*/0, /*false-skip=*/3),
+ BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+ offsetof(struct seccomp_data, instruction_pointer)),
+ BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_end,
+ /*true-skip=*/0, /*false-skip=*/1),
+ BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+ /* if (IP_high < __userspace_start) allow; */
+ BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+ offsetof(struct seccomp_data, instruction_pointer) + 4),
+ BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_start >> 32,
+ /*true-skip=*/1, /*false-skip=*/0),
+ BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+ /* if (IP_high == __userspace_start && IP_low < __userspace_start) allow; */
+ BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+ offsetof(struct seccomp_data, instruction_pointer) + 4),
+ BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __userspace_start >> 32,
+ /*true-skip=*/0, /*false-skip=*/3),
+ BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+ offsetof(struct seccomp_data, instruction_pointer)),
+ BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_start,
+ /*true-skip=*/1, /*false-skip=*/0),
+ BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+ /* other address; trap */
+ BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRAP),
+ };
+ struct sock_fprog prog = {
+ .len = ARRAY_SIZE(filter),
+ .filter = filter,
+ };
+
+ err = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ if (err)
+ os_warn("PR_SET_NO_NEW_PRIVS (err=%d, ernro=%d)\n",
+ err, errno);
+
+ err = syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER,
+ SECCOMP_FILTER_FLAG_TSYNC, &prog);
+ if (err) {
+ os_warn("SECCOMP_SET_MODE_FILTER (err=%d, ernro=%d)\n",
+ err, errno);
+ exit(1);
+ }
+
+ set_handler(SIGSYS);
+
+ os_info("seccomp: setup filter syscalls in the range: 0x%lx-0x%lx\n",
+ __userspace_start, __userspace_end);
+
+ return 0;
+}
+
diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c
index 11f07f498270..53e276e81b37 100644
--- a/arch/um/os-Linux/signal.c
+++ b/arch/um/os-Linux/signal.c
@@ -20,6 +20,7 @@
#include <um_malloc.h>
#include <sys/ucontext.h>
#include <timetravel.h>
+#include <linux/compiler_attributes.h>
void (*sig_info[NSIG])(int, struct siginfo *, struct uml_pt_regs *, void *mc) = {
[SIGTRAP] = relay_signal,
@@ -30,6 +31,7 @@ void (*sig_info[NSIG])(int, struct siginfo *, struct uml_pt_regs *, void *mc) =
[SIGSEGV] = segv_handler,
[SIGIO] = sigio_handler,
[SIGCHLD] = sigchld_handler,
+ [SIGSYS] = sigsys_handler,
};
static void sig_handler_common(int sig, struct siginfo *si, mcontext_t *mc)
@@ -176,6 +178,11 @@ static void sigusr1_handler(int sig, struct siginfo *unused_si, mcontext_t *mc)
uml_pm_wake();
}
+__weak void sigsys_handler(int sig, struct siginfo *unused_si,
+ struct uml_pt_regs *regs, void *mc)
+{
+}
+
void register_pm_wake_signal(void)
{
set_handler(SIGUSR1);
@@ -187,6 +194,7 @@ static void (*handlers[_NSIG])(int sig, struct siginfo *si, mcontext_t *mc) = {
[SIGILL] = sig_handler,
[SIGFPE] = sig_handler,
[SIGTRAP] = sig_handler,
+ [SIGSYS] = sig_handler,
[SIGIO] = sig_handler,
[SIGWINCH] = sig_handler,
diff --git a/arch/um/os-Linux/start_up.c b/arch/um/os-Linux/start_up.c
index a827c2e01aa5..4e1f05360c49 100644
--- a/arch/um/os-Linux/start_up.c
+++ b/arch/um/os-Linux/start_up.c
@@ -238,7 +238,7 @@ extern unsigned long *exec_fp_regs;
__initdata static struct stub_data *seccomp_test_stub_data;
-static void __init sigsys_handler(int sig, siginfo_t *info, void *p)
+static void __init _sigsys_handler(int sig, siginfo_t *info, void *p)
{
ucontext_t *uc = p;
@@ -273,7 +273,7 @@ static int __init seccomp_helper(void *data)
sizeof(seccomp_test_stub_data->sigstack));
sa.sa_flags = SA_ONSTACK | SA_NODEFER | SA_SIGINFO;
- sa.sa_sigaction = (void *) sigsys_handler;
+ sa.sa_sigaction = (void *) _sigsys_handler;
sa.sa_restorer = NULL;
if (sigaction(SIGSYS, &sa, NULL) < 0)
exit(2);
diff --git a/arch/x86/um/nommu/Makefile b/arch/x86/um/nommu/Makefile
index d72c63afffa5..ebe47d4836f4 100644
--- a/arch/x86/um/nommu/Makefile
+++ b/arch/x86/um/nommu/Makefile
@@ -5,4 +5,4 @@ else
BITS := 64
endif
-obj-y = do_syscall_$(BITS).o entry_$(BITS).o
+obj-y = do_syscall_$(BITS).o entry_$(BITS).o os-Linux/
diff --git a/arch/x86/um/nommu/os-Linux/Makefile b/arch/x86/um/nommu/os-Linux/Makefile
new file mode 100644
index 000000000000..4571e403a6ff
--- /dev/null
+++ b/arch/x86/um/nommu/os-Linux/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-y = mcontext.o
+USER_OBJS := mcontext.o
+
+include $(srctree)/arch/um/scripts/Makefile.rules
diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c
new file mode 100644
index 000000000000..c4ef877d5ea0
--- /dev/null
+++ b/arch/x86/um/nommu/os-Linux/mcontext.c
@@ -0,0 +1,13 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/ucontext.h>
+#define __FRAME_OFFSETS
+#include <asm/ptrace.h>
+#include <sysdep/ptrace.h>
+#include <sysdep/mcontext.h>
+#include <sysdep/syscalls.h>
+
+void set_mc_sigsys_hook(mcontext_t *mc)
+{
+ mc->gregs[REG_RCX] = mc->gregs[REG_RIP];
+ mc->gregs[REG_RIP] = (unsigned long) __kernel_vsyscall;
+}
diff --git a/arch/x86/um/shared/sysdep/mcontext.h b/arch/x86/um/shared/sysdep/mcontext.h
index 6fe490cc5b98..9a0d6087f357 100644
--- a/arch/x86/um/shared/sysdep/mcontext.h
+++ b/arch/x86/um/shared/sysdep/mcontext.h
@@ -17,6 +17,10 @@ extern int get_stub_state(struct uml_pt_regs *regs, struct stub_data *data,
extern int set_stub_state(struct uml_pt_regs *regs, struct stub_data *data,
int single_stepping);
+#ifndef CONFIG_MMU
+extern void set_mc_sigsys_hook(mcontext_t *mc);
+#endif
+
#ifdef __i386__
#define GET_FAULTINFO_FROM_MC(fi, mc) \
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 06/13] x86/um: nommu: process/thread handling
2025-06-22 21:32 [PATCH v10 00/13] nommu UML Hajime Tazaki
` (4 preceding siblings ...)
2025-06-22 21:33 ` [PATCH v10 05/13] um: nommu: seccomp syscalls hook Hajime Tazaki
@ 2025-06-22 21:33 ` Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 07/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
` (6 subsequent siblings)
12 siblings, 0 replies; 25+ messages in thread
From: Hajime Tazaki @ 2025-06-22 21:33 UTC (permalink / raw)
To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett, linux-kernel
Since ptrace facility isn't used under !MMU of UML, there is different
code path to invoke processes/threads; there are no external process
used, and need to properly configure some of registers (fs segment
register for TLS, etc) on every context switch, etc.
Signals aren't delivered in non-ptrace syscall entry/leave so, we also
need to handle pending signal by ourselves.
ptrace related syscalls are not tested yet so, marked
arch_has_single_step() unsupported in !MMU environment.
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
arch/um/include/asm/ptrace-generic.h | 2 +-
arch/x86/um/Makefile | 3 +-
arch/x86/um/nommu/Makefile | 2 +-
arch/x86/um/nommu/entry_64.S | 22 ++++++++++++++
arch/x86/um/nommu/syscalls_64.c | 44 ++++++++++++++++++++++++++++
5 files changed, 70 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/um/nommu/syscalls_64.c
diff --git a/arch/um/include/asm/ptrace-generic.h b/arch/um/include/asm/ptrace-generic.h
index 4ff844bcb1cd..a9778c9a59a3 100644
--- a/arch/um/include/asm/ptrace-generic.h
+++ b/arch/um/include/asm/ptrace-generic.h
@@ -14,7 +14,7 @@ struct pt_regs {
struct uml_pt_regs regs;
};
-#define arch_has_single_step() (1)
+#define arch_has_single_step() (IS_ENABLED(CONFIG_MMU))
#define EMPTY_REGS { .regs = EMPTY_UML_PT_REGS }
diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile
index 227af2a987e2..53c9ebb3c41c 100644
--- a/arch/x86/um/Makefile
+++ b/arch/x86/um/Makefile
@@ -27,7 +27,8 @@ subarch-y += ../kernel/sys_ia32.o
else
-obj-y += syscalls_64.o vdso/
+obj-y += vdso/
+obj-$(CONFIG_MMU) += syscalls_64.o
subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o \
../lib/memmove_64.o ../lib/memset_64.o
diff --git a/arch/x86/um/nommu/Makefile b/arch/x86/um/nommu/Makefile
index ebe47d4836f4..4018d9e0aba0 100644
--- a/arch/x86/um/nommu/Makefile
+++ b/arch/x86/um/nommu/Makefile
@@ -5,4 +5,4 @@ else
BITS := 64
endif
-obj-y = do_syscall_$(BITS).o entry_$(BITS).o os-Linux/
+obj-y = do_syscall_$(BITS).o entry_$(BITS).o syscalls_$(BITS).o os-Linux/
diff --git a/arch/x86/um/nommu/entry_64.S b/arch/x86/um/nommu/entry_64.S
index e9bfc7b93c84..950447dfa66b 100644
--- a/arch/x86/um/nommu/entry_64.S
+++ b/arch/x86/um/nommu/entry_64.S
@@ -89,3 +89,25 @@ ENTRY(__kernel_vsyscall)
jmp *%rcx
END(__kernel_vsyscall)
+
+// void userspace(struct uml_pt_regs *regs)
+ENTRY(userspace)
+
+ /* align the stack for x86_64 ABI */
+ and $-0x10, %rsp
+ /* Handle any immediate reschedules or signals */
+ call interrupt_end
+
+ movq current_ptregs, %rsp
+
+ POP_REGS
+
+ addq $8, %rsp /* skip orig_ax */
+ popq %r11 /* pt_regs->ip */
+ addq $8, %rsp /* skip cs */
+ addq $8, %rsp /* skip flags */
+ popq %rsp
+
+ jmp *%r11
+
+END(userspace)
diff --git a/arch/x86/um/nommu/syscalls_64.c b/arch/x86/um/nommu/syscalls_64.c
new file mode 100644
index 000000000000..e88e93e9d80a
--- /dev/null
+++ b/arch/x86/um/nommu/syscalls_64.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2003 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com)
+ * Copyright 2003 PathScale, Inc.
+ *
+ * Licensed under the GPL
+ */
+
+#include <linux/sched.h>
+#include <linux/sched/mm.h>
+#include <linux/syscalls.h>
+#include <linux/uaccess.h>
+#include <asm/prctl.h> /* XXX This should get the constants from libc */
+#include <registers.h>
+#include <os.h>
+#include "syscalls.h"
+
+void arch_switch_to(struct task_struct *to)
+{
+ /*
+ * In !CONFIG_MMU, it doesn't ptrace thus,
+ * The FS_BASE registers are saved here.
+ */
+ current_top_of_stack = task_top_of_stack(to);
+ current_ptregs = (long)task_pt_regs(to);
+
+ if ((to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] == 0) ||
+ (to->mm == NULL))
+ return;
+
+ /* this changes the FS on every context switch */
+ arch_prctl(to, ARCH_SET_FS,
+ (void __user *) to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)]);
+}
+
+SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
+ unsigned long, prot, unsigned long, flags,
+ unsigned long, fd, unsigned long, off)
+{
+ if (off & ~PAGE_MASK)
+ return -EINVAL;
+
+ return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
+}
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 07/13] um: nommu: configure fs register on host syscall invocation
2025-06-22 21:32 [PATCH v10 00/13] nommu UML Hajime Tazaki
` (5 preceding siblings ...)
2025-06-22 21:33 ` [PATCH v10 06/13] x86/um: nommu: process/thread handling Hajime Tazaki
@ 2025-06-22 21:33 ` Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 08/13] x86/um/vdso: nommu: vdso memory update Hajime Tazaki
` (5 subsequent siblings)
12 siblings, 0 replies; 25+ messages in thread
From: Hajime Tazaki @ 2025-06-22 21:33 UTC (permalink / raw)
To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett, linux-kernel
As userspace on UML/!MMU also need to configure %fs register when it is
running to correctly access thread structure, host syscalls implemented
in os-Linux drivers may be puzzled when they are called. Thus it has to
configure %fs register via arch_prctl(SET_FS) on every host syscalls.
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
arch/um/include/shared/os.h | 6 +++
arch/um/os-Linux/process.c | 6 +++
arch/um/os-Linux/start_up.c | 21 +++++++++
arch/x86/um/nommu/do_syscall_64.c | 37 ++++++++++++++++
arch/x86/um/nommu/syscalls_64.c | 71 +++++++++++++++++++++++++++++++
5 files changed, 141 insertions(+)
diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index 1251f08e26d0..7c6a8bc0447c 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -189,6 +189,7 @@ extern void check_host_supports_tls(int *supports_tls, int *tls_min);
extern void get_host_cpu_features(
void (*flags_helper_func)(char *line),
void (*cache_helper_func)(char *line));
+extern int host_has_fsgsbase;
/* mem.c */
extern int create_mem_file(unsigned long long len);
@@ -213,6 +214,11 @@ extern int os_protect_memory(void *addr, unsigned long len,
extern int os_unmap_memory(void *addr, int len);
extern int os_drop_memory(void *addr, int length);
extern int can_drop_memory(void);
+extern int os_arch_prctl(int pid, int option, unsigned long *arg);
+#ifndef CONFIG_MMU
+extern long long host_fs;
+#endif
+
void os_set_pdeathsig(void);
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index 8a1ab59a089f..3a6d34ccd12b 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -16,6 +16,7 @@
#include <sys/prctl.h>
#include <sys/wait.h>
#include <asm/unistd.h>
+#include <sys/syscall.h> /* For SYS_xxx definitions */
#include <linux/threads.h>
#include <init.h>
#include <longjmp.h>
@@ -178,6 +179,11 @@ int __init can_drop_memory(void)
return ok;
}
+int os_arch_prctl(int pid, int option, unsigned long *arg2)
+{
+ return syscall(SYS_arch_prctl, option, arg2);
+}
+
void init_new_thread_signals(void)
{
set_handler(SIGSEGV);
diff --git a/arch/um/os-Linux/start_up.c b/arch/um/os-Linux/start_up.c
index 4e1f05360c49..55dd92bd2a0b 100644
--- a/arch/um/os-Linux/start_up.c
+++ b/arch/um/os-Linux/start_up.c
@@ -20,6 +20,8 @@
#include <sys/resource.h>
#include <asm/ldt.h>
#include <asm/unistd.h>
+#include <sys/auxv.h>
+#include <asm/hwcap2.h>
#include <init.h>
#include <os.h>
#include <kern_util.h>
@@ -36,6 +38,8 @@
#include <skas.h>
#include "internal.h"
+int host_has_fsgsbase;
+
static void ptrace_child(void)
{
int ret;
@@ -459,6 +463,20 @@ __uml_setup("seccomp=", uml_seccomp_config,
" This is insecure and should only be used with a trusted userspace\n\n"
);
+static void __init check_fsgsbase(void)
+{
+ unsigned long auxv = getauxval(AT_HWCAP2);
+
+ os_info("Checking FSGSBASE instructions...");
+ if (auxv & HWCAP2_FSGSBASE) {
+ host_has_fsgsbase = 1;
+ os_info("OK\n");
+ } else {
+ host_has_fsgsbase = 0;
+ os_info("disabled\n");
+ }
+}
+
void __init os_early_checks(void)
{
int pid;
@@ -484,6 +502,9 @@ void __init os_early_checks(void)
using_seccomp = 0;
check_ptrace();
+ /* probe fsgsbase instruction */
+ check_fsgsbase();
+
pid = start_ptraced_child();
if (init_pid_registers(pid))
fatal("Failed to initialize default registers");
diff --git a/arch/x86/um/nommu/do_syscall_64.c b/arch/x86/um/nommu/do_syscall_64.c
index 6b08daab6afe..74d5bcc4508d 100644
--- a/arch/x86/um/nommu/do_syscall_64.c
+++ b/arch/x86/um/nommu/do_syscall_64.c
@@ -2,10 +2,38 @@
#include <linux/kernel.h>
#include <linux/ptrace.h>
+#include <asm/fsgsbase.h>
+#include <asm/prctl.h>
#include <kern_util.h>
#include <sysdep/syscalls.h>
#include <os.h>
+static int os_x86_arch_prctl(int pid, int option, unsigned long *arg2)
+{
+ if (!host_has_fsgsbase)
+ return os_arch_prctl(pid, option, arg2);
+
+ switch (option) {
+ case ARCH_SET_FS:
+ wrfsbase(*arg2);
+ break;
+ case ARCH_SET_GS:
+ wrgsbase(*arg2);
+ break;
+ case ARCH_GET_FS:
+ *arg2 = rdfsbase();
+ break;
+ case ARCH_GET_GS:
+ *arg2 = rdgsbase();
+ break;
+ default:
+ pr_warn("%s: unsupported option: 0x%x", __func__, option);
+ break;
+ }
+
+ return 0;
+}
+
__visible void do_syscall_64(struct pt_regs *regs)
{
int syscall;
@@ -13,6 +41,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
syscall = PT_SYSCALL_NR(regs->regs.gp);
UPT_SYSCALL_NR(®s->regs) = syscall;
+ /* set fs register to the original host one */
+ os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
+
if (likely(syscall < NR_syscalls)) {
PT_REGS_SET_SYSCALL_RETURN(regs,
EXECUTE_SYSCALL(syscall, regs));
@@ -22,4 +53,10 @@ __visible void do_syscall_64(struct pt_regs *regs)
/* handle tasks and signals at the end */
interrupt_end();
+
+ /* restore back fs register to userspace configured one */
+ os_x86_arch_prctl(0, ARCH_SET_FS,
+ (void *)(current->thread.regs.regs.gp[FS_BASE
+ / sizeof(unsigned long)]));
+
}
diff --git a/arch/x86/um/nommu/syscalls_64.c b/arch/x86/um/nommu/syscalls_64.c
index e88e93e9d80a..f213251c5e35 100644
--- a/arch/x86/um/nommu/syscalls_64.c
+++ b/arch/x86/um/nommu/syscalls_64.c
@@ -13,8 +13,70 @@
#include <asm/prctl.h> /* XXX This should get the constants from libc */
#include <registers.h>
#include <os.h>
+#include <asm/thread_info.h>
+#include <asm/mman.h>
#include "syscalls.h"
+/*
+ * The guest libc can change FS, which confuses the host libc.
+ * In fact, changing FS directly is not supported (check
+ * man arch_prctl). So, whenever we make a host syscall,
+ * we should be changing FS to the original FS (not the
+ * one set by the guest libc). This original FS is stored
+ * in host_fs.
+ */
+long long host_fs = -1;
+
+long arch_prctl(struct task_struct *task, int option,
+ unsigned long __user *arg2)
+{
+ long ret = -EINVAL;
+ unsigned long *ptr = arg2, tmp;
+
+ switch (option) {
+ case ARCH_SET_FS:
+ if (host_fs == -1)
+ os_arch_prctl(0, ARCH_GET_FS, (void *)&host_fs);
+ ret = 0;
+ break;
+ case ARCH_SET_GS:
+ ret = 0;
+ break;
+ case ARCH_GET_FS:
+ case ARCH_GET_GS:
+ ptr = &tmp;
+ break;
+ }
+
+ ret = os_arch_prctl(0, option, ptr);
+ if (ret)
+ return ret;
+
+ switch (option) {
+ case ARCH_SET_FS:
+ current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] =
+ (unsigned long) arg2;
+ break;
+ case ARCH_SET_GS:
+ current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)] =
+ (unsigned long) arg2;
+ break;
+ case ARCH_GET_FS:
+ ret = put_user(current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)], arg2);
+ break;
+ case ARCH_GET_GS:
+ ret = put_user(current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)], arg2);
+ break;
+ }
+
+ return ret;
+}
+
+SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
+{
+ return arch_prctl(current, option, (unsigned long __user *) arg2);
+}
+
void arch_switch_to(struct task_struct *to)
{
/*
@@ -42,3 +104,12 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
}
+
+static int __init um_nommu_setup_hostfs(void)
+{
+ /* initialize the host_fs value at boottime */
+ os_arch_prctl(0, ARCH_GET_FS, (void *)&host_fs);
+
+ return 0;
+}
+arch_initcall(um_nommu_setup_hostfs);
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 08/13] x86/um/vdso: nommu: vdso memory update
2025-06-22 21:32 [PATCH v10 00/13] nommu UML Hajime Tazaki
` (6 preceding siblings ...)
2025-06-22 21:33 ` [PATCH v10 07/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
@ 2025-06-22 21:33 ` Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 09/13] x86/um: nommu: signal handling Hajime Tazaki
` (4 subsequent siblings)
12 siblings, 0 replies; 25+ messages in thread
From: Hajime Tazaki @ 2025-06-22 21:33 UTC (permalink / raw)
To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett, linux-kernel
On !MMU mode, the address of vdso is accessible from userspace. This
commit implements the entry point by pointing a block of page address.
This commit also add memory permission configuration of vdso page to be
executable.
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
arch/x86/um/vdso/vma.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c
index dc8dfb2abd80..1c8c39f87681 100644
--- a/arch/x86/um/vdso/vma.c
+++ b/arch/x86/um/vdso/vma.c
@@ -9,6 +9,7 @@
#include <asm/page.h>
#include <asm/elf.h>
#include <linux/init.h>
+#include <os.h>
static unsigned int __read_mostly vdso_enabled = 1;
unsigned long um_vdso_addr;
@@ -21,14 +22,24 @@ static int __init init_vdso(void)
{
BUG_ON(vdso_end - vdso_start > PAGE_SIZE);
- um_vdso_addr = task_size - PAGE_SIZE;
-
um_vdso = alloc_page(GFP_KERNEL);
if (!um_vdso)
goto oom;
copy_page(page_address(um_vdso), vdso_start);
+#ifdef CONFIG_MMU
+ um_vdso_addr = task_size - PAGE_SIZE;
+#else
+ /* this is fine with NOMMU as everything is accessible */
+ um_vdso_addr = (unsigned long)page_address(um_vdso);
+ os_protect_memory((void *)um_vdso_addr, vdso_end - vdso_start, 1, 0, 1);
+#endif
+
+ pr_info("vdso_start=%lx um_vdso_addr=%lx pg_um_vdso=%lx",
+ (unsigned long)vdso_start, um_vdso_addr,
+ (unsigned long)page_address(um_vdso));
+
return 0;
oom:
@@ -39,6 +50,7 @@ static int __init init_vdso(void)
}
subsys_initcall(init_vdso);
+#ifdef CONFIG_MMU
int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
{
struct vm_area_struct *vma;
@@ -63,3 +75,4 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
return IS_ERR(vma) ? PTR_ERR(vma) : 0;
}
+#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 09/13] x86/um: nommu: signal handling
2025-06-22 21:32 [PATCH v10 00/13] nommu UML Hajime Tazaki
` (7 preceding siblings ...)
2025-06-22 21:33 ` [PATCH v10 08/13] x86/um/vdso: nommu: vdso memory update Hajime Tazaki
@ 2025-06-22 21:33 ` Hajime Tazaki
2025-06-24 23:20 ` Benjamin Berg
2025-06-22 21:33 ` [PATCH v10 10/13] um: nommu: a work around for MMU dependency to PCI driver Hajime Tazaki
` (3 subsequent siblings)
12 siblings, 1 reply; 25+ messages in thread
From: Hajime Tazaki @ 2025-06-22 21:33 UTC (permalink / raw)
To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett, linux-kernel
This commit updates the behavior of signal handling under !MMU
environment. It adds the alignment code for signal frame as the frame
is used in userspace as-is.
floating point register is carefully handling upon entry/leave of
syscall routine so that signal handlers can read/write the contents of
the register.
It also adds the follow up routine for SIGSEGV as a signal delivery runs
in the same stack frame while we have to avoid endless SIGSEGV.
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
arch/um/include/shared/kern_util.h | 4 +
arch/um/nommu/Makefile | 2 +-
arch/um/nommu/os-Linux/signal.c | 13 ++
arch/um/nommu/trap.c | 194 ++++++++++++++++++++++++++
arch/x86/um/nommu/do_syscall_64.c | 6 +
arch/x86/um/nommu/os-Linux/mcontext.c | 11 ++
arch/x86/um/shared/sysdep/mcontext.h | 1 +
arch/x86/um/shared/sysdep/ptrace.h | 2 +-
8 files changed, 231 insertions(+), 2 deletions(-)
create mode 100644 arch/um/nommu/trap.c
diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h
index ec8ba1f13c58..f559943b52cb 100644
--- a/arch/um/include/shared/kern_util.h
+++ b/arch/um/include/shared/kern_util.h
@@ -73,4 +73,8 @@ void um_idle_sleep(void);
void kasan_map_memory(void *start, size_t len);
+#ifndef CONFIG_MMU
+extern void arch_sigsegv_handler(int sig, struct siginfo *si, void *mc);
+#endif
+
#endif
diff --git a/arch/um/nommu/Makefile b/arch/um/nommu/Makefile
index baab7c2f57c2..096221590cfd 100644
--- a/arch/um/nommu/Makefile
+++ b/arch/um/nommu/Makefile
@@ -1,3 +1,3 @@
# SPDX-License-Identifier: GPL-2.0
-obj-y := os-Linux/
+obj-y := trap.o os-Linux/
diff --git a/arch/um/nommu/os-Linux/signal.c b/arch/um/nommu/os-Linux/signal.c
index 19043b9652e2..b2cd0470b67c 100644
--- a/arch/um/nommu/os-Linux/signal.c
+++ b/arch/um/nommu/os-Linux/signal.c
@@ -5,6 +5,7 @@
#include <os.h>
#include <sysdep/mcontext.h>
#include <sys/ucontext.h>
+#include <as-layout.h>
void sigsys_handler(int sig, struct siginfo *si,
struct uml_pt_regs *regs, void *ptr)
@@ -14,3 +15,15 @@ void sigsys_handler(int sig, struct siginfo *si,
/* hook syscall via SIGSYS */
set_mc_sigsys_hook(mc);
}
+
+void arch_sigsegv_handler(int sig, struct siginfo *si, void *ptr)
+{
+ mcontext_t *mc = (mcontext_t *) ptr;
+
+ /* !MMU specific part; detection of userspace */
+ if (mc->gregs[REG_RIP] > uml_reserved &&
+ mc->gregs[REG_RIP] < high_physmem) {
+ /* !MMU: force handle signals after rt_sigreturn() */
+ set_mc_userspace_relay_signal(mc);
+ }
+}
diff --git a/arch/um/nommu/trap.c b/arch/um/nommu/trap.c
new file mode 100644
index 000000000000..2053a3b5071b
--- /dev/null
+++ b/arch/um/nommu/trap.c
@@ -0,0 +1,194 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/hardirq.h>
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/sched/debug.h>
+#include <asm/current.h>
+#include <asm/tlbflush.h>
+#include <arch.h>
+#include <as-layout.h>
+#include <kern_util.h>
+#include <os.h>
+#include <skas.h>
+
+/*
+ * Note this is constrained to return 0, -EFAULT, -EACCES, -ENOMEM by
+ * segv().
+ */
+int handle_page_fault(unsigned long address, unsigned long ip,
+ int is_write, int is_user, int *code_out)
+{
+ /* !MMU has no pagefault */
+ return -EFAULT;
+}
+
+static void show_segv_info(struct uml_pt_regs *regs)
+{
+ struct task_struct *tsk = current;
+ struct faultinfo *fi = UPT_FAULTINFO(regs);
+
+ if (!unhandled_signal(tsk, SIGSEGV))
+ return;
+
+ pr_warn_ratelimited("%s%s[%d]: segfault at %lx ip %p sp %p error %x",
+ task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
+ tsk->comm, task_pid_nr(tsk), FAULT_ADDRESS(*fi),
+ (void *)UPT_IP(regs), (void *)UPT_SP(regs),
+ fi->error_code);
+}
+
+static void bad_segv(struct faultinfo fi, unsigned long ip)
+{
+ current->thread.arch.faultinfo = fi;
+ force_sig_fault(SIGSEGV, SEGV_ACCERR, (void __user *) FAULT_ADDRESS(fi));
+}
+
+void fatal_sigsegv(void)
+{
+ force_fatal_sig(SIGSEGV);
+ do_signal(¤t->thread.regs);
+ /*
+ * This is to tell gcc that we're not returning - do_signal
+ * can, in general, return, but in this case, it's not, since
+ * we just got a fatal SIGSEGV queued.
+ */
+ os_dump_core();
+}
+
+/**
+ * segv_handler() - the SIGSEGV handler
+ * @sig: the signal number
+ * @unused_si: the signal info struct; unused in this handler
+ * @regs: the ptrace register information
+ *
+ * The handler first extracts the faultinfo from the UML ptrace regs struct.
+ * If the userfault did not happen in an UML userspace process, bad_segv is called.
+ * Otherwise the signal did happen in a cloned userspace process, handle it.
+ */
+void segv_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs,
+ void *mc)
+{
+ struct faultinfo *fi = UPT_FAULTINFO(regs);
+
+ /* !MMU specific part; detection of userspace */
+ /* mark is_user=1 when the IP is from userspace code. */
+ if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) < high_physmem)
+ regs->is_user = 1;
+
+ if (UPT_IS_USER(regs) && !SEGV_IS_FIXABLE(fi)) {
+ show_segv_info(regs);
+ bad_segv(*fi, UPT_IP(regs));
+ return;
+ }
+ segv(*fi, UPT_IP(regs), UPT_IS_USER(regs), regs, mc);
+
+ /* !MMU specific part; detection of userspace */
+ arch_sigsegv_handler(sig, unused_si, mc);
+}
+
+/*
+ * We give a *copy* of the faultinfo in the regs to segv.
+ * This must be done, since nesting SEGVs could overwrite
+ * the info in the regs. A pointer to the info then would
+ * give us bad data!
+ */
+unsigned long segv(struct faultinfo fi, unsigned long ip, int is_user,
+ struct uml_pt_regs *regs, void *mc)
+{
+ int si_code;
+ int err;
+ int is_write = FAULT_WRITE(fi);
+ unsigned long address = FAULT_ADDRESS(fi);
+
+ if (!is_user && regs)
+ current->thread.segv_regs = container_of(regs, struct pt_regs, regs);
+
+ if (current->mm == NULL) {
+ show_regs(container_of(regs, struct pt_regs, regs));
+ panic("Segfault with no mm");
+ } else if (!is_user && address > PAGE_SIZE && address < TASK_SIZE) {
+ show_regs(container_of(regs, struct pt_regs, regs));
+ panic("Kernel tried to access user memory at addr 0x%lx, ip 0x%lx",
+ address, ip);
+ }
+
+ if (SEGV_IS_FIXABLE(&fi))
+ err = handle_page_fault(address, ip, is_write, is_user,
+ &si_code);
+ else {
+ err = -EFAULT;
+ /*
+ * A thread accessed NULL, we get a fault, but CR2 is invalid.
+ * This code is used in __do_copy_from_user() of TT mode.
+ * XXX tt mode is gone, so maybe this isn't needed any more
+ */
+ address = 0;
+ }
+
+ if (!err)
+ goto out;
+ else if (!is_user && arch_fixup(ip, regs))
+ goto out;
+
+ if (!is_user) {
+ show_regs(container_of(regs, struct pt_regs, regs));
+ panic("Kernel mode fault at addr 0x%lx, ip 0x%lx",
+ address, ip);
+ }
+
+ show_segv_info(regs);
+
+ if (err == -EACCES) {
+ current->thread.arch.faultinfo = fi;
+ force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address);
+ } else {
+ WARN_ON_ONCE(err != -EFAULT);
+ current->thread.arch.faultinfo = fi;
+ force_sig_fault(SIGSEGV, si_code, (void __user *) address);
+ }
+
+out:
+ if (regs)
+ current->thread.segv_regs = NULL;
+
+ return 0;
+}
+
+void relay_signal(int sig, struct siginfo *si, struct uml_pt_regs *regs,
+ void *mc)
+{
+ int code, err;
+
+ if (!UPT_IS_USER(regs)) {
+ if (sig == SIGBUS)
+ pr_err("Bus error - the host /dev/shm or /tmp mount likely just ran out of space\n");
+ panic("Kernel mode signal %d", sig);
+ }
+
+ arch_examine_signal(sig, regs);
+
+ /* Is the signal layout for the signal known?
+ * Signal data must be scrubbed to prevent information leaks.
+ */
+ code = si->si_code;
+ err = si->si_errno;
+ if ((err == 0) && (siginfo_layout(sig, code) == SIL_FAULT)) {
+ struct faultinfo *fi = UPT_FAULTINFO(regs);
+
+ current->thread.arch.faultinfo = *fi;
+ force_sig_fault(sig, code, (void __user *)FAULT_ADDRESS(*fi));
+ } else {
+ pr_err("Attempted to relay unknown signal %d (si_code = %d) with errno %d\n",
+ sig, code, err);
+ force_sig(sig);
+ }
+}
+
+void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs,
+ void *mc)
+{
+ do_IRQ(WINCH_IRQ, regs);
+}
diff --git a/arch/x86/um/nommu/do_syscall_64.c b/arch/x86/um/nommu/do_syscall_64.c
index 74d5bcc4508d..d77e69e097c1 100644
--- a/arch/x86/um/nommu/do_syscall_64.c
+++ b/arch/x86/um/nommu/do_syscall_64.c
@@ -44,6 +44,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
/* set fs register to the original host one */
os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
+ /* save fp registers */
+ asm volatile("fxsaveq %0" : "=m"(*(struct _xstate *)regs->regs.fp));
+
if (likely(syscall < NR_syscalls)) {
PT_REGS_SET_SYSCALL_RETURN(regs,
EXECUTE_SYSCALL(syscall, regs));
@@ -54,6 +57,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
/* handle tasks and signals at the end */
interrupt_end();
+ /* restore fp registers */
+ asm volatile("fxrstorq %0" : : "m"((current->thread.regs.regs.fp)));
+
/* restore back fs register to userspace configured one */
os_x86_arch_prctl(0, ARCH_SET_FS,
(void *)(current->thread.regs.regs.gp[FS_BASE
diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c
index c4ef877d5ea0..955e7d9f4765 100644
--- a/arch/x86/um/nommu/os-Linux/mcontext.c
+++ b/arch/x86/um/nommu/os-Linux/mcontext.c
@@ -6,6 +6,17 @@
#include <sysdep/mcontext.h>
#include <sysdep/syscalls.h>
+static void __userspace_relay_signal(void)
+{
+ /* XXX: dummy syscall */
+ __asm__ volatile("call *%0" : : "r"(__kernel_vsyscall), "a"(39) :);
+}
+
+void set_mc_userspace_relay_signal(mcontext_t *mc)
+{
+ mc->gregs[REG_RIP] = (unsigned long) __userspace_relay_signal;
+}
+
void set_mc_sigsys_hook(mcontext_t *mc)
{
mc->gregs[REG_RCX] = mc->gregs[REG_RIP];
diff --git a/arch/x86/um/shared/sysdep/mcontext.h b/arch/x86/um/shared/sysdep/mcontext.h
index 9a0d6087f357..479fd923ff1d 100644
--- a/arch/x86/um/shared/sysdep/mcontext.h
+++ b/arch/x86/um/shared/sysdep/mcontext.h
@@ -19,6 +19,7 @@ extern int set_stub_state(struct uml_pt_regs *regs, struct stub_data *data,
#ifndef CONFIG_MMU
extern void set_mc_sigsys_hook(mcontext_t *mc);
+extern void set_mc_userspace_relay_signal(mcontext_t *mc);
#endif
#ifdef __i386__
diff --git a/arch/x86/um/shared/sysdep/ptrace.h b/arch/x86/um/shared/sysdep/ptrace.h
index 8f7476ff6e95..7d553d9f05be 100644
--- a/arch/x86/um/shared/sysdep/ptrace.h
+++ b/arch/x86/um/shared/sysdep/ptrace.h
@@ -65,7 +65,7 @@ struct uml_pt_regs {
int is_user;
/* Dynamically sized FP registers (holds an XSTATE) */
- unsigned long fp[];
+ unsigned long fp[] __attribute__((aligned(16)));
};
#define EMPTY_UML_PT_REGS { }
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 10/13] um: nommu: a work around for MMU dependency to PCI driver
2025-06-22 21:32 [PATCH v10 00/13] nommu UML Hajime Tazaki
` (8 preceding siblings ...)
2025-06-22 21:33 ` [PATCH v10 09/13] x86/um: nommu: signal handling Hajime Tazaki
@ 2025-06-22 21:33 ` Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 11/13] um: change machine name for uname output Hajime Tazaki
` (2 subsequent siblings)
12 siblings, 0 replies; 25+ messages in thread
From: Hajime Tazaki @ 2025-06-22 21:33 UTC (permalink / raw)
To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett, linux-kernel
The commit 8fe743b5eba0 ("PCI: Add CONFIG_MMU dependency") restricts the
PCI base driver to depend on MMU. While nommu UML _can_ implement PCI
drivers over PCI devices (e.g., virtio-pci), the current nommu UML
doesn't implement it.
But without PCI drivers kunit complains as config for kunit
(arch_uml.config) defines the dependency to PCI drivers.
This commit fixes the issue of this compile failures when building PCI
drivers with nommu UML. In particular, the fix is to undefine the
constant PCI_IOBASE to be able to bypass pci_unmap_iospace() call.
When we will support PCI drivers for nommu UML, we will refactor this
code.
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
arch/um/include/asm/dma.h | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/arch/um/include/asm/dma.h b/arch/um/include/asm/dma.h
index fdc53642c718..643d74555671 100644
--- a/arch/um/include/asm/dma.h
+++ b/arch/um/include/asm/dma.h
@@ -4,6 +4,19 @@
#include <asm/io.h>
+/**
+ * now the PCI core driver depends on CONFIG_MMU in linus tree, nommu
+ * UML cannot build with PCI but without PCI kunit doesn't build due
+ * to the dependency to the CONFIG_VIRTIO_UML.
+ *
+ * This is a workaround to silence build failures on kunit, which is
+ * valid until nommu UML supports PCI drivers (e.g., virtio-pci) in a
+ * future.
+ */
+#ifndef CONFIG_MMU
+#undef PCI_IOBASE
+#endif
+
extern unsigned long uml_physmem;
#define MAX_DMA_ADDRESS (uml_physmem)
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 11/13] um: change machine name for uname output
2025-06-22 21:32 [PATCH v10 00/13] nommu UML Hajime Tazaki
` (9 preceding siblings ...)
2025-06-22 21:33 ` [PATCH v10 10/13] um: nommu: a work around for MMU dependency to PCI driver Hajime Tazaki
@ 2025-06-22 21:33 ` Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 12/13] um: nommu: add documentation of nommu UML Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 13/13] um: nommu: plug nommu code into build system Hajime Tazaki
12 siblings, 0 replies; 25+ messages in thread
From: Hajime Tazaki @ 2025-06-22 21:33 UTC (permalink / raw)
To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett, linux-kernel
This commit tries to display MMU/!MMU mode from the output of uname(2)
so that users can distinguish which mode of UML is running right now.
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
arch/um/Makefile | 6 ++++++
arch/um/os-Linux/util.c | 3 ++-
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/arch/um/Makefile b/arch/um/Makefile
index 5371c9a1b11e..9bc8fc149514 100644
--- a/arch/um/Makefile
+++ b/arch/um/Makefile
@@ -153,6 +153,12 @@ export CFLAGS_vmlinux := $(LINK-y) $(LINK_WRAPS) $(LD_FLAGS_CMDLINE) $(CC_FLAGS_
CLEAN_FILES += linux x.i gmon.out
MRPROPER_FILES += $(HOST_DIR)/include/generated
+ifeq ($(CONFIG_MMU),y)
+UTS_MACHINE := "um"
+else
+UTS_MACHINE := "um\(nommu\)"
+endif
+
archclean:
@find . \( -name '*.bb' -o -name '*.bbg' -o -name '*.da' \
-o -name '*.gcov' \) -type f -print | xargs rm -f
diff --git a/arch/um/os-Linux/util.c b/arch/um/os-Linux/util.c
index 4193e04d7e4a..20421e9f0f77 100644
--- a/arch/um/os-Linux/util.c
+++ b/arch/um/os-Linux/util.c
@@ -65,7 +65,8 @@ void setup_machinename(char *machine_out)
}
# endif
#endif
- strcpy(machine_out, host.machine);
+ strcat(machine_out, "/");
+ strcat(machine_out, host.machine);
}
void setup_hostinfo(char *buf, int len)
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 12/13] um: nommu: add documentation of nommu UML
2025-06-22 21:32 [PATCH v10 00/13] nommu UML Hajime Tazaki
` (10 preceding siblings ...)
2025-06-22 21:33 ` [PATCH v10 11/13] um: change machine name for uname output Hajime Tazaki
@ 2025-06-22 21:33 ` Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 13/13] um: nommu: plug nommu code into build system Hajime Tazaki
12 siblings, 0 replies; 25+ messages in thread
From: Hajime Tazaki @ 2025-06-22 21:33 UTC (permalink / raw)
To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett, linux-kernel
This commit adds an initial documentation for !MMU mode of UML.
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
Documentation/virt/uml/nommu-uml.rst | 180 +++++++++++++++++++++++++++
MAINTAINERS | 1 +
2 files changed, 181 insertions(+)
create mode 100644 Documentation/virt/uml/nommu-uml.rst
diff --git a/Documentation/virt/uml/nommu-uml.rst b/Documentation/virt/uml/nommu-uml.rst
new file mode 100644
index 000000000000..1a988253bef8
--- /dev/null
+++ b/Documentation/virt/uml/nommu-uml.rst
@@ -0,0 +1,180 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+UML has been built with CONFIG_MMU since day 0. The patchset
+introduces the nommu mode on UML in a different angle from what Linux
+Kernel Library tried.
+
+.. contents:: :local:
+
+What is it for ?
+================
+
+- Alleviate syscall hook overhead implemented with ptrace(2)
+- To exercises nommu code over UML (and over KUnit)
+- Less dependency to host facilities
+
+
+How it works ?
+==============
+
+To illustrate how this feature works, the below shows how syscalls are
+called under nommu/UML environment.
+
+- boot kernel, install seccomp filter if ``syscall`` instructions are
+ called from userspace memory based on the address of instruction
+ pointer
+- (userspace starts)
+- calls ``vfork``/``execve`` syscalls
+- ``SIGSYS`` signal raised, handler calls syscall entry point ``__kernel_vsyscall``
+- call handler function in ``sys_call_table[]`` and follow how UML syscall
+ works.
+- return to userspace
+
+
+What are the differences from MMU-full UML ?
+============================================
+
+The current nommu implementation adds 3 different functions which
+MMU-full UML doesn't have:
+
+- kernel address space can directly be accessible from userspace
+ - so, ``uaccess()`` always returns 1
+ - generic implementation of memcpy/strcpy/futex is also used
+- alternate syscall entrypoint without ptrace
+- alternate syscall hook
+ - hook syscall by seccomp filter
+
+With those modifications, it allows us to use unmodified userspace
+binaries with nommu UML.
+
+
+History
+=======
+
+This feature was originally introduced by Ricardo Koller at Open
+Source Summit NA 2020, then integrated with the syscall translation
+functionality with the clean up to the original code.
+
+Building and run
+================
+
+::
+
+ make ARCH=um x86_64_nommu_defconfig
+ make ARCH=um
+
+will build UML with ``CONFIG_MMU=n`` applied.
+
+Kunit tests can run with the following command::
+
+ ./tools/testing/kunit/kunit.py run --kconfig_add CONFIG_MMU=n
+
+To run a typical Linux distribution, we need nommu-aware userspace.
+We can use a stock version of Alpine Linux with nommu-built version of
+busybox and musl-libc.
+
+
+Preparing root filesystem
+=========================
+
+nommu UML requires to use a specific standard library which is aware
+of nommu kernel. We have tested custom-build musl-libc and busybox,
+both of which have built-in support for nommu kernels.
+
+There are no available Linux distributions for nommu under x86_64
+architecture, so we need to prepare our own image for the root
+filesystem. We use Alpine Linux as a base distribution and replace
+busybox and musl-libc on top of that. The following are the step to
+prepare the filesystem for the quick start::
+
+ container_id=$(docker create ghcr.io/thehajime/alpine:3.20.3-um-nommu)
+ docker start $container_id
+ docker wait $container_id
+ docker export $container_id > alpine.tar
+ docker rm $container_id
+
+ mnt=$(mktemp -d)
+ dd if=/dev/zero of=alpine.ext4 bs=1 count=0 seek=1G
+ sudo chmod og+wr "alpine.ext4"
+ yes 2>/dev/null | mkfs.ext4 "alpine.ext4" || true
+ sudo mount "alpine.ext4" $mnt
+ sudo tar -xf alpine.tar -C $mnt
+ sudo umount $mnt
+
+This will create a file image, ``alpine.ext4``, which contains busybox
+and musl with nommu build on the Alpine Linux root filesystem. The
+file can be specified to the argument ``ubd0=`` to the UML command line::
+
+ ./vmlinux ubd0=./alpine.ext4 rw mem=1024m loglevel=8 init=/sbin/init
+
+We plan to upstream apk packages for busybox and musl so that we can
+follow the proper procedure to set up the root filesystem.
+
+
+Quick start with docker
+=======================
+
+There is a docker image that you can quickly start with a simple step::
+
+ docker run -it -v /dev/shm:/dev/shm --rm ghcr.io/thehajime/alpine:3.20.3-um-nommu
+
+This will launch a UML instance with an pre-configured root filesystem.
+
+Benchmark
+=========
+
+The below shows an example of performance measurement conducted with
+lmbench and (self-crafted) getpid benchmark (with v6.15-rc0 uml/next
+tree).
+
+.. csv-table:: lmbench (usec)
+ :header: ,native,um,um-mmu(s),um-nommu(s)
+
+ select-10 ,0.5224,28.3882,27.2839,3.0046
+ select-100 ,1.5641,30.3775,28.8091,3.8546
+ select-1000 ,11.6922,38.2021,32.5367,12.2568
+ syscall ,0.1635,27.8278,24.8049,2.6957
+ read ,0.3063,29.0073,23.5953,2.8127
+ write ,0.2531,29.6342,26.3339,2.7932
+ stat ,1.8827,41.2546,34.6495,3.3199
+ open/close ,3.2548,67.5806,62.4781,6.4189
+ fork+sh ,1108.8000,5618.0000,3604.6667,456.0476
+ fork+execve ,519.1579,2242.8000,1425.7500,138.1316
+
+.. csv-table:: do_getpid bench (nsec)
+ :header: ,native,um,um-mmu(s),um-nommu(s)
+
+ getpid , 162 , 27049 , 24444 , 2696
+
+(um-nommu(s) is with seccomp syscall hook, um-mmu(s) is SECCOMP mode,
+respectively)
+
+Limitations
+===========
+
+generic nommu limitations
+-------------------------
+Since this port is a kernel of nommu architecture so, the
+implementation inherits the characteristics of other nommu kernels
+(riscv, arm, etc), described below.
+
+- vfork(2) should be used instead of fork(2)
+- ELF loader only loads PIE (position independent executable) binaries
+- processes share the address space among others
+- mmap(2) offers a subset of functionalities (e.g., unsupported
+ MMAP_FIXED)
+
+Thus, we have limited options to userspace programs. We have tested
+Alpine Linux with musl-libc, which has a support nommu kernel.
+
+supported architecture
+----------------------
+The current implementation of nommu UML only works on x86_64 SUBARCH.
+We have not tested with 32-bit environment.
+
+
+Further readings about NOMMU UML
+================================
+
+- NOMMU UML (original code by Ricardo Koller)
+ - https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf
diff --git a/MAINTAINERS b/MAINTAINERS
index ac8ccc837bab..822efc04bbe1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -25753,6 +25753,7 @@ USER-MODE LINUX (UML)
M: Richard Weinberger <richard@nod.at>
M: Anton Ivanov <anton.ivanov@cambridgegreys.com>
M: Johannes Berg <johannes@sipsolutions.net>
+M: Hajime Tazaki <thehajime@gmail.com>
L: linux-um@lists.infradead.org
S: Maintained
W: http://user-mode-linux.sourceforge.net
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 13/13] um: nommu: plug nommu code into build system
2025-06-22 21:32 [PATCH v10 00/13] nommu UML Hajime Tazaki
` (11 preceding siblings ...)
2025-06-22 21:33 ` [PATCH v10 12/13] um: nommu: add documentation of nommu UML Hajime Tazaki
@ 2025-06-22 21:33 ` Hajime Tazaki
12 siblings, 0 replies; 25+ messages in thread
From: Hajime Tazaki @ 2025-06-22 21:33 UTC (permalink / raw)
To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett, linux-kernel
Add nommu kernel for um build. defconfig is also provided.
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
arch/um/Kconfig | 14 ++++++-
arch/um/configs/x86_64_nommu_defconfig | 54 ++++++++++++++++++++++++++
2 files changed, 66 insertions(+), 2 deletions(-)
create mode 100644 arch/um/configs/x86_64_nommu_defconfig
diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index f08e8a7fac93..81a79c7a5a6f 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -31,14 +31,17 @@ config UML
select ARCH_SUPPORTS_LTO_CLANG_THIN
select TRACE_IRQFLAGS_SUPPORT
select TTY # Needed for line.c
- select HAVE_ARCH_VMAP_STACK
+ select HAVE_ARCH_VMAP_STACK if MMU
select HAVE_RUST
select ARCH_HAS_UBSAN
select HAVE_ARCH_TRACEHOOK
select THREAD_INFO_IN_TASK
+ select UACCESS_MEMCPY if !MMU
+ select GENERIC_STRNLEN_USER if !MMU
+ select GENERIC_STRNCPY_FROM_USER if !MMU
config MMU
- bool
+ bool "MMU-based Paged Memory Management Support" if 64BIT
default y
config UML_DMA_EMULATION
@@ -185,8 +188,15 @@ config MAGIC_SYSRQ
The keys are documented in <file:Documentation/admin-guide/sysrq.rst>. Don't say Y
unless you really know what this hack does.
+config ARCH_FORCE_MAX_ORDER
+ int "Order of maximal physically contiguous allocations" if EXPERT
+ default "10" if MMU
+ default "16" if !MMU
+
config KERNEL_STACK_ORDER
int "Kernel stack size order"
+ default 3 if !MMU
+ range 3 10 if !MMU
default 2 if 64BIT
range 2 10 if 64BIT
default 1 if !64BIT
diff --git a/arch/um/configs/x86_64_nommu_defconfig b/arch/um/configs/x86_64_nommu_defconfig
new file mode 100644
index 000000000000..02cb87091c9f
--- /dev/null
+++ b/arch/um/configs/x86_64_nommu_defconfig
@@ -0,0 +1,54 @@
+CONFIG_SYSVIPC=y
+CONFIG_POSIX_MQUEUE=y
+CONFIG_NO_HZ=y
+CONFIG_HIGH_RES_TIMERS=y
+CONFIG_BSD_PROCESS_ACCT=y
+CONFIG_IKCONFIG=y
+CONFIG_IKCONFIG_PROC=y
+CONFIG_LOG_BUF_SHIFT=14
+CONFIG_CGROUPS=y
+CONFIG_BLK_CGROUP=y
+CONFIG_CGROUP_SCHED=y
+CONFIG_CGROUP_DEVICE=y
+CONFIG_CGROUP_CPUACCT=y
+# CONFIG_PID_NS is not set
+CONFIG_CC_OPTIMIZE_FOR_SIZE=y
+# CONFIG_MMU is not set
+CONFIG_HOSTFS=y
+CONFIG_MAGIC_SYSRQ=y
+CONFIG_SSL=y
+CONFIG_NULL_CHAN=y
+CONFIG_PORT_CHAN=y
+CONFIG_PTY_CHAN=y
+CONFIG_TTY_CHAN=y
+CONFIG_CON_CHAN="pts"
+CONFIG_SSL_CHAN="pts"
+CONFIG_MODULES=y
+CONFIG_MODULE_UNLOAD=y
+CONFIG_IOSCHED_BFQ=m
+CONFIG_BINFMT_MISC=m
+CONFIG_NET=y
+CONFIG_PACKET=y
+CONFIG_UNIX=y
+CONFIG_INET=y
+CONFIG_DEVTMPFS=y
+CONFIG_DEVTMPFS_MOUNT=y
+CONFIG_BLK_DEV_UBD=y
+CONFIG_BLK_DEV_LOOP=m
+CONFIG_BLK_DEV_NBD=m
+CONFIG_DUMMY=m
+CONFIG_TUN=m
+CONFIG_PPP=m
+CONFIG_SLIP=m
+CONFIG_LEGACY_PTY_COUNT=32
+CONFIG_UML_RANDOM=y
+CONFIG_EXT4_FS=y
+CONFIG_QUOTA=y
+CONFIG_AUTOFS_FS=m
+CONFIG_ISO9660_FS=m
+CONFIG_JOLIET=y
+CONFIG_NLS=y
+CONFIG_DEBUG_KERNEL=y
+CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
+CONFIG_FRAME_WARN=1024
+CONFIG_IPV6=y
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v10 09/13] x86/um: nommu: signal handling
2025-06-22 21:33 ` [PATCH v10 09/13] x86/um: nommu: signal handling Hajime Tazaki
@ 2025-06-24 23:20 ` Benjamin Berg
2025-06-27 13:50 ` Hajime Tazaki
0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Berg @ 2025-06-24 23:20 UTC (permalink / raw)
To: Hajime Tazaki, linux-um; +Cc: ricarkol, Liam.Howlett, linux-kernel
Hi,
On Mon, 2025-06-23 at 06:33 +0900, Hajime Tazaki wrote:
> This commit updates the behavior of signal handling under !MMU
> environment. It adds the alignment code for signal frame as the frame
> is used in userspace as-is.
>
> floating point register is carefully handling upon entry/leave of
> syscall routine so that signal handlers can read/write the contents of
> the register.
>
> It also adds the follow up routine for SIGSEGV as a signal delivery runs
> in the same stack frame while we have to avoid endless SIGSEGV.
>
> Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
> ---
> arch/um/include/shared/kern_util.h | 4 +
> arch/um/nommu/Makefile | 2 +-
> arch/um/nommu/os-Linux/signal.c | 13 ++
> arch/um/nommu/trap.c | 194 ++++++++++++++++++++++++++
> arch/x86/um/nommu/do_syscall_64.c | 6 +
> arch/x86/um/nommu/os-Linux/mcontext.c | 11 ++
> arch/x86/um/shared/sysdep/mcontext.h | 1 +
> arch/x86/um/shared/sysdep/ptrace.h | 2 +-
> 8 files changed, 231 insertions(+), 2 deletions(-)
> create mode 100644 arch/um/nommu/trap.c
>
> [SNIP]
> diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c
> index c4ef877d5ea0..955e7d9f4765 100644
> --- a/arch/x86/um/nommu/os-Linux/mcontext.c
> +++ b/arch/x86/um/nommu/os-Linux/mcontext.c
> @@ -6,6 +6,17 @@
> #include <sysdep/mcontext.h>
> #include <sysdep/syscalls.h>
>
> +static void __userspace_relay_signal(void)
> +{
> + /* XXX: dummy syscall */
> + __asm__ volatile("call *%0" : : "r"(__kernel_vsyscall), "a"(39) :);
> +}
39 is NR__getpid, I assume?
The "call *%0" looks like it is code for retpolin, I think this would
currently just segfault.
> +
> +void set_mc_userspace_relay_signal(mcontext_t *mc)
> +{
> + mc->gregs[REG_RIP] = (unsigned long) __userspace_relay_signal;
> +}
> +
And this is really confusing me. The way I am reading it, the code
tries to do:
1. Rewrite RIP to jump to __userspace_relay_signal
2. Trigger a getpid syscall (to do "nothing"?)
3. Let do_syscall_64 fire the signal from interrupt_end
However, then that really confuses me, because:
* If I am reading it correctly, then this approach will destroy the
contents of various registers (RIP, RAX and likely more)
* This would result in an incorrect mcontext in the userspace signal
handler (which could be relevant if userspace is inspecting it)
* However, worst, rt_sigreturn will eventually jump back
into__userspace_relay_signal, which has nothing to return to.
* Also, relay_signal doesn't use this? What happens for a SIGFPE, how
is userspace interrupted immediately in that case?
Honestly, I really think we should take a step back and swap the
current syscall entry/exit code. That would likely also simplify
floating point register handling, which I think is currently
insufficient do deal with the odd special cases caused by different
x86_64 hardware extensions.
Basically, I think nommu mode should use the same general approach as
the current SECCOMP mode. Which is to use rt_sigreturn to jump into
userspace and let the host kernel deal with the ugly details of how to
do that.
I believe that this requires a second "userspace" sigaltstack in
addition to the current "IRQ" sigaltstack. Then switching in between
the two (note that the "userspace" one is also used for IRQs if those
happen while userspace is executing).
So, in principle I would think something like:
* to jump into userspace, you would:
- block all signals
- set "userspace" sigaltstack
- setup mcontext for rt_sigreturn
- setup RSP for rt_sigreturn
- call rt_sigreturn syscall
* all signal handlers can (except pure IRQs):
- check on which stack they are
-> easy to detect whether we are in kernel mode
- for IRQs one can probably handle them directly (and return)
- in user mode:
+ store mcontext location and information needed for rt_sigreturn
+ jump back into kernel task stack
* kernel task handler to continue would:
- set sigaltstack to IRQ stack
- fetch register from mcontext
- unblock all signals
- handle syscall/signal in whatever way needed
Now that I wrote about it, I am thinking that it might be possible to
just use the kernel task stack for the signal stack. One would probably
need to increase the kernel stack size a bit, but it would also mean
that no special code is needed for "rt_sigreturn" handling. The rest
would remain the same.
Thoughts?
Benjamin
> [SNIP]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v10 09/13] x86/um: nommu: signal handling
2025-06-24 23:20 ` Benjamin Berg
@ 2025-06-27 13:50 ` Hajime Tazaki
2025-06-27 15:02 ` Benjamin Berg
0 siblings, 1 reply; 25+ messages in thread
From: Hajime Tazaki @ 2025-06-27 13:50 UTC (permalink / raw)
To: benjamin; +Cc: linux-um, ricarkol, Liam.Howlett, linux-kernel
Hello,
thanks for the comment on the complicated part of the kernel (signal).
On Wed, 25 Jun 2025 08:20:03 +0900,
Benjamin Berg wrote:
>
> Hi,
>
> On Mon, 2025-06-23 at 06:33 +0900, Hajime Tazaki wrote:
> > This commit updates the behavior of signal handling under !MMU
> > environment. It adds the alignment code for signal frame as the frame
> > is used in userspace as-is.
> >
> > floating point register is carefully handling upon entry/leave of
> > syscall routine so that signal handlers can read/write the contents of
> > the register.
> >
> > It also adds the follow up routine for SIGSEGV as a signal delivery runs
> > in the same stack frame while we have to avoid endless SIGSEGV.
> >
> > Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
> > ---
> > arch/um/include/shared/kern_util.h | 4 +
> > arch/um/nommu/Makefile | 2 +-
> > arch/um/nommu/os-Linux/signal.c | 13 ++
> > arch/um/nommu/trap.c | 194 ++++++++++++++++++++++++++
> > arch/x86/um/nommu/do_syscall_64.c | 6 +
> > arch/x86/um/nommu/os-Linux/mcontext.c | 11 ++
> > arch/x86/um/shared/sysdep/mcontext.h | 1 +
> > arch/x86/um/shared/sysdep/ptrace.h | 2 +-
> > 8 files changed, 231 insertions(+), 2 deletions(-)
> > create mode 100644 arch/um/nommu/trap.c
> >
> > [SNIP]
> > diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c
> > index c4ef877d5ea0..955e7d9f4765 100644
> > --- a/arch/x86/um/nommu/os-Linux/mcontext.c
> > +++ b/arch/x86/um/nommu/os-Linux/mcontext.c
> > @@ -6,6 +6,17 @@
> > #include <sysdep/mcontext.h>
> > #include <sysdep/syscalls.h>
> >
> > +static void __userspace_relay_signal(void)
> > +{
> > + /* XXX: dummy syscall */
> > + __asm__ volatile("call *%0" : : "r"(__kernel_vsyscall), "a"(39) :);
> > +}
>
> 39 is NR__getpid, I assume?
>
> The "call *%0" looks like it is code for retpolin, I think this would
> currently just segfault.
# if you mean retpolin as zpoline,
zploine uses `call *%rax` so, this is not about zpoline.
> > +void set_mc_userspace_relay_signal(mcontext_t *mc)
> > +{
> > + mc->gregs[REG_RIP] = (unsigned long) __userspace_relay_signal;
> > +}
> > +
This is a bit scary code which I tried to handle when SIGSEGV is
raised by host for a userspace program running on UML (nommu).
# and I should remember my XXX tag is important to fix....
let me try to explain what happens and what I tried to solve.
The SEGV signal from userspace program is delivered to userspace but
if we don't fix the code raising the signal, after (um) rt_sigreturn,
it will restart from $rip and raise SIGSEGV again.
# so, yes, we've already relied on host and um's rt_sigreturn to
restore various things.
when a uml userspace crashes with SIGSEGV,
- host kernel raises SIGSEGV (at original $rip)
- caught by uml process (hard_handler)
- raise a signal to uml userspace process (segv_handler)
- handler ends (hard_handler)
- (host) run restorer (rt_sigreturn, registered by (libc)sigaction,
not (host) rt_sigaction)
- return back to the original $rip
- (back to top)
this is the case where endless loop is happened.
um's sa_handler isn't called as rt_sigreturn (um) isn't called.
and the my original attempt (__userspace_relay_signal) is what I tried.
I agree that it is lazy to call a dummy syscall (indeed, getpid).
I'm trying to introduce another routine to jump into userspace and
call (um) rt_sigreturn after (host) rt_sigreturn.
> And this is really confusing me. The way I am reading it, the code
> tries to do:
> 1. Rewrite RIP to jump to __userspace_relay_signal
> 2. Trigger a getpid syscall (to do "nothing"?)
> 3. Let do_syscall_64 fire the signal from interrupt_end
correct.
> However, then that really confuses me, because:
> * If I am reading it correctly, then this approach will destroy the
> contents of various registers (RIP, RAX and likely more)
> * This would result in an incorrect mcontext in the userspace signal
> handler (which could be relevant if userspace is inspecting it)
> * However, worst, rt_sigreturn will eventually jump back
> into__userspace_relay_signal, which has nothing to return to.
> * Also, relay_signal doesn't use this? What happens for a SIGFPE, how
> is userspace interrupted immediately in that case?
relay_signal shares the same goal of this, indeed.
but the issue with `mc->gregs[REG_RIP]` (endless signals) still exists
I guess.
> Honestly, I really think we should take a step back and swap the
> current syscall entry/exit code. That would likely also simplify
> floating point register handling, which I think is currently
> insufficient do deal with the odd special cases caused by different
> x86_64 hardware extensions.
>
> Basically, I think nommu mode should use the same general approach as
> the current SECCOMP mode. Which is to use rt_sigreturn to jump into
> userspace and let the host kernel deal with the ugly details of how to
> do that.
I looked at how MMU mode (ptrace/seccomp) does handle this case.
In nommu mode, we don't have external process to catch signals so, the
nommu mode uses hard_handler() to catch SEGV/FPE of userspace
programs. While mmu mode calls segv_handler not in a context of
signal handler.
# correct me if I'm wrong.
thus, mmu mode doesn't have this situation.
I'm attempting various ways; calling um's rt_sigreturn instead of
host's one, which doesn't work as host restore procedures (unblocking
masked signals, restoring register states, etc) aren't called.
I'll update here if I found a good direction, but would be great if
you see how it should be handled.
-- Hajime
> I believe that this requires a second "userspace" sigaltstack in
> addition to the current "IRQ" sigaltstack. Then switching in between
> the two (note that the "userspace" one is also used for IRQs if those
> happen while userspace is executing).
>
> So, in principle I would think something like:
> * to jump into userspace, you would:
> - block all signals
> - set "userspace" sigaltstack
> - setup mcontext for rt_sigreturn
> - setup RSP for rt_sigreturn
> - call rt_sigreturn syscall
> * all signal handlers can (except pure IRQs):
> - check on which stack they are
> -> easy to detect whether we are in kernel mode
> - for IRQs one can probably handle them directly (and return)
> - in user mode:
> + store mcontext location and information needed for rt_sigreturn
> + jump back into kernel task stack
> * kernel task handler to continue would:
> - set sigaltstack to IRQ stack
> - fetch register from mcontext
> - unblock all signals
> - handle syscall/signal in whatever way needed
>
> Now that I wrote about it, I am thinking that it might be possible to
> just use the kernel task stack for the signal stack. One would probably
> need to increase the kernel stack size a bit, but it would also mean
> that no special code is needed for "rt_sigreturn" handling. The rest
> would remain the same.
>
> Thoughts?
>
> Benjamin
>
> > [SNIP]
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v10 09/13] x86/um: nommu: signal handling
2025-06-27 13:50 ` Hajime Tazaki
@ 2025-06-27 15:02 ` Benjamin Berg
2025-06-30 1:04 ` Hajime Tazaki
0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Berg @ 2025-06-27 15:02 UTC (permalink / raw)
To: Hajime Tazaki; +Cc: linux-um, ricarkol, Liam.Howlett, linux-kernel
Hi,
On Fri, 2025-06-27 at 22:50 +0900, Hajime Tazaki wrote:
> thanks for the comment on the complicated part of the kernel (signal).
This stuff isn't simple.
Actually, I am starting to think that the current MMU UML kernel also
needs a redesign with regard to signal handling and stack use in that
case. My current impression is that the design right now only permits
voluntarily scheduling. More specifically, scheduling in response to an
interrupt is impossible.
I suppose that works fine, but it also does not seem quite right.
> On Wed, 25 Jun 2025 08:20:03 +0900,
> Benjamin Berg wrote:
> >
> > Hi,
> >
> > On Mon, 2025-06-23 at 06:33 +0900, Hajime Tazaki wrote:
> > > This commit updates the behavior of signal handling under !MMU
> > > environment. It adds the alignment code for signal frame as the frame
> > > is used in userspace as-is.
> > >
> > > floating point register is carefully handling upon entry/leave of
> > > syscall routine so that signal handlers can read/write the contents of
> > > the register.
> > >
> > > It also adds the follow up routine for SIGSEGV as a signal delivery runs
> > > in the same stack frame while we have to avoid endless SIGSEGV.
> > >
> > > Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
> > > ---
> > > arch/um/include/shared/kern_util.h | 4 +
> > > arch/um/nommu/Makefile | 2 +-
> > > arch/um/nommu/os-Linux/signal.c | 13 ++
> > > arch/um/nommu/trap.c | 194 ++++++++++++++++++++++++++
> > > arch/x86/um/nommu/do_syscall_64.c | 6 +
> > > arch/x86/um/nommu/os-Linux/mcontext.c | 11 ++
> > > arch/x86/um/shared/sysdep/mcontext.h | 1 +
> > > arch/x86/um/shared/sysdep/ptrace.h | 2 +-
> > > 8 files changed, 231 insertions(+), 2 deletions(-)
> > > create mode 100644 arch/um/nommu/trap.c
> > >
> > > [SNIP]
> > > diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c
> > > index c4ef877d5ea0..955e7d9f4765 100644
> > > --- a/arch/x86/um/nommu/os-Linux/mcontext.c
> > > +++ b/arch/x86/um/nommu/os-Linux/mcontext.c
> > > @@ -6,6 +6,17 @@
> > > #include <sysdep/mcontext.h>
> > > #include <sysdep/syscalls.h>
> > >
> > > +static void __userspace_relay_signal(void)
> > > +{
> > > + /* XXX: dummy syscall */
> > > + __asm__ volatile("call *%0" : : "r"(__kernel_vsyscall), "a"(39) :);
> > > +}
> >
> > 39 is NR__getpid, I assume?
> >
> > The "call *%0" looks like it is code for retpolin, I think this would
> > currently just segfault.
>
> # if you mean retpolin as zpoline,
>
> zploine uses `call *%rax` so, this is not about zpoline.
Ah, yes, of course.
> > > +void set_mc_userspace_relay_signal(mcontext_t *mc)
> > > +{
> > > + mc->gregs[REG_RIP] = (unsigned long) __userspace_relay_signal;
> > > +}
> > > +
>
> This is a bit scary code which I tried to handle when SIGSEGV is
> raised by host for a userspace program running on UML (nommu).
>
> # and I should remember my XXX tag is important to fix....
>
> let me try to explain what happens and what I tried to solve.
>
> The SEGV signal from userspace program is delivered to userspace but
> if we don't fix the code raising the signal, after (um) rt_sigreturn,
> it will restart from $rip and raise SIGSEGV again.
>
> # so, yes, we've already relied on host and um's rt_sigreturn to
> restore various things.
>
> when a uml userspace crashes with SIGSEGV,
>
> - host kernel raises SIGSEGV (at original $rip)
> - caught by uml process (hard_handler)
> - raise a signal to uml userspace process (segv_handler)
> - handler ends (hard_handler)
> - (host) run restorer (rt_sigreturn, registered by (libc)sigaction,
> not (host) rt_sigaction)
> - return back to the original $rip
> - (back to top)
>
> this is the case where endless loop is happened.
> um's sa_handler isn't called as rt_sigreturn (um) isn't called.
> and the my original attempt (__userspace_relay_signal) is what I tried.
>
> I agree that it is lazy to call a dummy syscall (indeed, getpid).
> I'm trying to introduce another routine to jump into userspace and
> call (um) rt_sigreturn after (host) rt_sigreturn.
>
> > And this is really confusing me. The way I am reading it, the code
> > tries to do:
> > 1. Rewrite RIP to jump to __userspace_relay_signal
> > 2. Trigger a getpid syscall (to do "nothing"?)
> > 3. Let do_syscall_64 fire the signal from interrupt_end
>
> correct.
>
> > However, then that really confuses me, because:
> > * If I am reading it correctly, then this approach will destroy the
> > contents of various registers (RIP, RAX and likely more)
> > * This would result in an incorrect mcontext in the userspace signal
> > handler (which could be relevant if userspace is inspecting it)
> > * However, worst, rt_sigreturn will eventually jump back
> > into__userspace_relay_signal, which has nothing to return to.
> > * Also, relay_signal doesn't use this? What happens for a SIGFPE, how
> > is userspace interrupted immediately in that case?
>
> relay_signal shares the same goal of this, indeed.
> but the issue with `mc->gregs[REG_RIP]` (endless signals) still exists
> I guess.
Well, endless signals only exist as long as you exit to the same
location. My suggestion was to read the user state from the mcontext
(as SECCOMP mode does it) and executing the signal right away, i.e.:
* Fetch the current registers from the mcontext
* Push the signal context onto the userspace stack
* Modify the host mcontext to set registers for the signal handler
* Jump back to userspace by doing a "return"
Said differently, I really prefer deferring as much logic as possible
to the host. This is both safer and easier to understand. Plus, it also
has the advantage of making it simpler to port UML to other
architectures.
> > Honestly, I really think we should take a step back and swap the
> > current syscall entry/exit code. That would likely also simplify
> > floating point register handling, which I think is currently
> > insufficient do deal with the odd special cases caused by different
> > x86_64 hardware extensions.
> >
> > Basically, I think nommu mode should use the same general approach as
> > the current SECCOMP mode. Which is to use rt_sigreturn to jump into
> > userspace and let the host kernel deal with the ugly details of how to
> > do that.
>
> I looked at how MMU mode (ptrace/seccomp) does handle this case.
>
> In nommu mode, we don't have external process to catch signals so, the
> nommu mode uses hard_handler() to catch SEGV/FPE of userspace
> programs. While mmu mode calls segv_handler not in a context of
> signal handler.
>
> # correct me if I'm wrong.
>
> thus, mmu mode doesn't have this situation.
Yes, it does not have this specific issue. But see the top of the mail
for other issues that are somewhat related.
> I'm attempting various ways; calling um's rt_sigreturn instead of
> host's one, which doesn't work as host restore procedures (unblocking
> masked signals, restoring register states, etc) aren't called.
>
> I'll update here if I found a good direction, but would be great if
> you see how it should be handled.
Can we please discuss possible solutions? We can figure out the details
once it is clear how the interaction with the host should work.
I still think that the idea of using the kernel task stack as the
signal stack is really elegant. Actually, doing that in normal UML may
be how we can fix the issues mentioned at the top of my mail. And for
nommu, we can also use the host mcontext to jump back into userspace
using a simple "return".
Conceptually it seems so simple.
Benjamin
>
> -- Hajime
>
> > I believe that this requires a second "userspace" sigaltstack in
> > addition to the current "IRQ" sigaltstack. Then switching in between
> > the two (note that the "userspace" one is also used for IRQs if those
> > happen while userspace is executing).
> >
> > So, in principle I would think something like:
> > * to jump into userspace, you would:
> > - block all signals
> > - set "userspace" sigaltstack
> > - setup mcontext for rt_sigreturn
> > - setup RSP for rt_sigreturn
> > - call rt_sigreturn syscall
> > * all signal handlers can (except pure IRQs):
> > - check on which stack they are
> > -> easy to detect whether we are in kernel mode
> > - for IRQs one can probably handle them directly (and return)
> > - in user mode:
> > + store mcontext location and information needed for rt_sigreturn
> > + jump back into kernel task stack
> > * kernel task handler to continue would:
> > - set sigaltstack to IRQ stack
> > - fetch register from mcontext
> > - unblock all signals
> > - handle syscall/signal in whatever way needed
> >
> > Now that I wrote about it, I am thinking that it might be possible to
> > just use the kernel task stack for the signal stack. One would probably
> > need to increase the kernel stack size a bit, but it would also mean
> > that no special code is needed for "rt_sigreturn" handling. The rest
> > would remain the same.
> >
> > Thoughts?
> >
> > Benjamin
> >
> > > [SNIP]
> >
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v10 09/13] x86/um: nommu: signal handling
2025-06-27 15:02 ` Benjamin Berg
@ 2025-06-30 1:04 ` Hajime Tazaki
2025-07-01 12:03 ` Benjamin Berg
0 siblings, 1 reply; 25+ messages in thread
From: Hajime Tazaki @ 2025-06-30 1:04 UTC (permalink / raw)
To: benjamin; +Cc: linux-um, ricarkol, Liam.Howlett, linux-kernel
Hello Benjamin,
On Sat, 28 Jun 2025 00:02:05 +0900,
Benjamin Berg wrote:
>
> Hi,
>
> On Fri, 2025-06-27 at 22:50 +0900, Hajime Tazaki wrote:
> > thanks for the comment on the complicated part of the kernel (signal).
>
> This stuff isn't simple.
>
> Actually, I am starting to think that the current MMU UML kernel also
> needs a redesign with regard to signal handling and stack use in that
> case. My current impression is that the design right now only permits
> voluntarily scheduling. More specifically, scheduling in response to an
> interrupt is impossible.
>
> I suppose that works fine, but it also does not seem quite right.
thanks for the info. it's very useful to understand what's going on.
(snip)
> > > > +void set_mc_userspace_relay_signal(mcontext_t *mc)
> > > > +{
> > > > + mc->gregs[REG_RIP] = (unsigned long) __userspace_relay_signal;
> > > > +}
> > > > +
> >
> > This is a bit scary code which I tried to handle when SIGSEGV is
> > raised by host for a userspace program running on UML (nommu).
> >
> > # and I should remember my XXX tag is important to fix....
> >
> > let me try to explain what happens and what I tried to solve.
> >
> > The SEGV signal from userspace program is delivered to userspace but
> > if we don't fix the code raising the signal, after (um) rt_sigreturn,
> > it will restart from $rip and raise SIGSEGV again.
> >
> > # so, yes, we've already relied on host and um's rt_sigreturn to
> > restore various things.
> >
> > when a uml userspace crashes with SIGSEGV,
> >
> > - host kernel raises SIGSEGV (at original $rip)
> > - caught by uml process (hard_handler)
> > - raise a signal to uml userspace process (segv_handler)
> > - handler ends (hard_handler)
> > - (host) run restorer (rt_sigreturn, registered by (libc)sigaction,
> > not (host) rt_sigaction)
> > - return back to the original $rip
> > - (back to top)
> >
> > this is the case where endless loop is happened.
> > um's sa_handler isn't called as rt_sigreturn (um) isn't called.
> > and the my original attempt (__userspace_relay_signal) is what I tried.
> >
> > I agree that it is lazy to call a dummy syscall (indeed, getpid).
> > I'm trying to introduce another routine to jump into userspace and
> > call (um) rt_sigreturn after (host) rt_sigreturn.
> >
> > > And this is really confusing me. The way I am reading it, the code
> > > tries to do:
> > > 1. Rewrite RIP to jump to __userspace_relay_signal
> > > 2. Trigger a getpid syscall (to do "nothing"?)
> > > 3. Let do_syscall_64 fire the signal from interrupt_end
> >
> > correct.
> >
> > > However, then that really confuses me, because:
> > > * If I am reading it correctly, then this approach will destroy the
> > > contents of various registers (RIP, RAX and likely more)
> > > * This would result in an incorrect mcontext in the userspace signal
> > > handler (which could be relevant if userspace is inspecting it)
> > > * However, worst, rt_sigreturn will eventually jump back
> > > into__userspace_relay_signal, which has nothing to return to.
> > > * Also, relay_signal doesn't use this? What happens for a SIGFPE, how
> > > is userspace interrupted immediately in that case?
> >
> > relay_signal shares the same goal of this, indeed.
> > but the issue with `mc->gregs[REG_RIP]` (endless signals) still exists
> > I guess.
>
> Well, endless signals only exist as long as you exit to the same
> location. My suggestion was to read the user state from the mcontext
> (as SECCOMP mode does it) and executing the signal right away, i.e.:
thanks too; below is my understanding.
> * Fetch the current registers from the mcontext
I guess this is already done in sig_handler_common().
> * Push the signal context onto the userspace stack
(guess) this is already done on handle_signal() => setup_signal_stack_si().
> * Modify the host mcontext to set registers for the signal handler
this is something which I'm not well understanding.
- do you mean the host handler when you say "for the signal handler" ?
or the userspace handler ?
- if former (the host one), maybe mcontext is already there so, it
might not be the one you mentioned.
- if the latter, how the original handler (the host one,
hard_handler()) works ? even if we can call userspace handler
instead of the host one, we need to call the host handler (and
restorer). do we call both ?
- and by "to set registers", what register do you mean ? for the
registers inspected by userspace signal handler ? but if you set a
register, for instance RIP, as the fault location to the host
register, it will return to RIP after handler and restart the fault
again ?
> * Jump back to userspace by doing a "return"
this is still also unclear to me.
it would be very helpful if you point the location of the code (at
uml/next tree) on how SECCOMP mode does. I'm also looking at but
really hard to map what you described and the code (sorry).
all of above runs within hard_handler() in nommu mode on SIGSEGV.
my best guess is this is different from what ptrace/seccomp do.
> Said differently, I really prefer deferring as much logic as possible
> to the host. This is both safer and easier to understand. Plus, it also
> has the advantage of making it simpler to port UML to other
> architectures.
okay.
>
> > > Honestly, I really think we should take a step back and swap the
> > > current syscall entry/exit code. That would likely also simplify
> > > floating point register handling, which I think is currently
> > > insufficient do deal with the odd special cases caused by different
> > > x86_64 hardware extensions.
> > >
> > > Basically, I think nommu mode should use the same general approach as
> > > the current SECCOMP mode. Which is to use rt_sigreturn to jump into
> > > userspace and let the host kernel deal with the ugly details of how to
> > > do that.
> >
> > I looked at how MMU mode (ptrace/seccomp) does handle this case.
> >
> > In nommu mode, we don't have external process to catch signals so, the
> > nommu mode uses hard_handler() to catch SEGV/FPE of userspace
> > programs. While mmu mode calls segv_handler not in a context of
> > signal handler.
> >
> > # correct me if I'm wrong.
> >
> > thus, mmu mode doesn't have this situation.
>
> Yes, it does not have this specific issue. But see the top of the mail
> for other issues that are somewhat related.
>
> > I'm attempting various ways; calling um's rt_sigreturn instead of
> > host's one, which doesn't work as host restore procedures (unblocking
> > masked signals, restoring register states, etc) aren't called.
> >
> > I'll update here if I found a good direction, but would be great if
> > you see how it should be handled.
>
> Can we please discuss possible solutions? We can figure out the details
> once it is clear how the interaction with the host should work.
I was wishing to update to you that I'm working on it. So, your
comments are always helpful to me. Thanks.
-- Hajime
> I still think that the idea of using the kernel task stack as the
> signal stack is really elegant. Actually, doing that in normal UML may
> be how we can fix the issues mentioned at the top of my mail. And for
> nommu, we can also use the host mcontext to jump back into userspace
> using a simple "return".
>
> Conceptually it seems so simple.
>
> Benjamin
>
>
> >
> > -- Hajime
> >
> > > I believe that this requires a second "userspace" sigaltstack in
> > > addition to the current "IRQ" sigaltstack. Then switching in between
> > > the two (note that the "userspace" one is also used for IRQs if those
> > > happen while userspace is executing).
> > >
> > > So, in principle I would think something like:
> > > * to jump into userspace, you would:
> > > - block all signals
> > > - set "userspace" sigaltstack
> > > - setup mcontext for rt_sigreturn
> > > - setup RSP for rt_sigreturn
> > > - call rt_sigreturn syscall
> > > * all signal handlers can (except pure IRQs):
> > > - check on which stack they are
> > > -> easy to detect whether we are in kernel mode
> > > - for IRQs one can probably handle them directly (and return)
> > > - in user mode:
> > > + store mcontext location and information needed for rt_sigreturn
> > > + jump back into kernel task stack
> > > * kernel task handler to continue would:
> > > - set sigaltstack to IRQ stack
> > > - fetch register from mcontext
> > > - unblock all signals
> > > - handle syscall/signal in whatever way needed
> > >
> > > Now that I wrote about it, I am thinking that it might be possible to
> > > just use the kernel task stack for the signal stack. One would probably
> > > need to increase the kernel stack size a bit, but it would also mean
> > > that no special code is needed for "rt_sigreturn" handling. The rest
> > > would remain the same.
> > >
> > > Thoughts?
> > >
> > > Benjamin
> > >
> > > > [SNIP]
> > >
> >
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v10 09/13] x86/um: nommu: signal handling
2025-06-30 1:04 ` Hajime Tazaki
@ 2025-07-01 12:03 ` Benjamin Berg
2025-07-02 4:37 ` Hajime Tazaki
0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Berg @ 2025-07-01 12:03 UTC (permalink / raw)
To: Hajime Tazaki; +Cc: linux-um, ricarkol, Liam.Howlett, linux-kernel
Hi Hajim,
On Mon, 2025-06-30 at 10:04 +0900, Hajime Tazaki wrote:
>
> Hello Benjamin,
>
> On Sat, 28 Jun 2025 00:02:05 +0900,
> Benjamin Berg wrote:
> >
> > Hi,
> >
> > On Fri, 2025-06-27 at 22:50 +0900, Hajime Tazaki wrote:
> > > thanks for the comment on the complicated part of the kernel (signal).
> >
> > This stuff isn't simple.
> >
> > Actually, I am starting to think that the current MMU UML kernel also
> > needs a redesign with regard to signal handling and stack use in that
> > case. My current impression is that the design right now only permits
> > voluntarily scheduling. More specifically, scheduling in response to an
> > interrupt is impossible.
> >
> > I suppose that works fine, but it also does not seem quite right.
>
> thanks for the info. it's very useful to understand what's going on.
>
> (snip)
>
> > > > > +void set_mc_userspace_relay_signal(mcontext_t *mc)
> > > > > +{
> > > > > + mc->gregs[REG_RIP] = (unsigned long) __userspace_relay_signal;
> > > > > +}
> > > > > +
> > >
> > > This is a bit scary code which I tried to handle when SIGSEGV is
> > > raised by host for a userspace program running on UML (nommu).
> > >
> > > # and I should remember my XXX tag is important to fix....
> > >
> > > let me try to explain what happens and what I tried to solve.
> > >
> > > The SEGV signal from userspace program is delivered to userspace but
> > > if we don't fix the code raising the signal, after (um) rt_sigreturn,
> > > it will restart from $rip and raise SIGSEGV again.
> > >
> > > # so, yes, we've already relied on host and um's rt_sigreturn to
> > > restore various things.
> > >
> > > when a uml userspace crashes with SIGSEGV,
> > >
> > > - host kernel raises SIGSEGV (at original $rip)
> > > - caught by uml process (hard_handler)
> > > - raise a signal to uml userspace process (segv_handler)
> > > - handler ends (hard_handler)
> > > - (host) run restorer (rt_sigreturn, registered by (libc)sigaction,
> > > not (host) rt_sigaction)
> > > - return back to the original $rip
> > > - (back to top)
> > >
> > > this is the case where endless loop is happened.
> > > um's sa_handler isn't called as rt_sigreturn (um) isn't called.
> > > and the my original attempt (__userspace_relay_signal) is what I tried.
> > >
> > > I agree that it is lazy to call a dummy syscall (indeed, getpid).
> > > I'm trying to introduce another routine to jump into userspace and
> > > call (um) rt_sigreturn after (host) rt_sigreturn.
> > >
> > > > And this is really confusing me. The way I am reading it, the code
> > > > tries to do:
> > > > 1. Rewrite RIP to jump to __userspace_relay_signal
> > > > 2. Trigger a getpid syscall (to do "nothing"?)
> > > > 3. Let do_syscall_64 fire the signal from interrupt_end
> > >
> > > correct.
> > >
> > > > However, then that really confuses me, because:
> > > > * If I am reading it correctly, then this approach will destroy the
> > > > contents of various registers (RIP, RAX and likely more)
> > > > * This would result in an incorrect mcontext in the userspace signal
> > > > handler (which could be relevant if userspace is inspecting it)
> > > > * However, worst, rt_sigreturn will eventually jump back
> > > > into__userspace_relay_signal, which has nothing to return to.
> > > > * Also, relay_signal doesn't use this? What happens for a SIGFPE, how
> > > > is userspace interrupted immediately in that case?
> > >
> > > relay_signal shares the same goal of this, indeed.
> > > but the issue with `mc->gregs[REG_RIP]` (endless signals) still exists
> > > I guess.
> >
> > Well, endless signals only exist as long as you exit to the same
> > location. My suggestion was to read the user state from the mcontext
> > (as SECCOMP mode does it) and executing the signal right away, i.e.:
>
> thanks too; below is my understanding.
>
> > * Fetch the current registers from the mcontext
>
> I guess this is already done in sig_handler_common().
Well, not really?
It does seem to fetch the general purpose registers. But the code
pretty much assumes we will return to the same location and only stores
them on the stack for the signal handler itself. Also, remember that it
might be userspace or kernel space in your case. The kernel task
registers are in "switch_buf" while the userspace registers are in
"regs" of "struct task_struct" (effectively "struct uml_pt_regs").
> > * Push the signal context onto the userspace stack
>
> (guess) this is already done on handle_signal() => setup_signal_stack_si().
>
> > * Modify the host mcontext to set registers for the signal handler
>
> this is something which I'm not well understanding.
> - do you mean the host handler when you say "for the signal handler" ?
> or the userspace handler ?
Both in a way ;-)
I mean modify the registers in the host mcontext so that the UML
userspace will continue executing inside its signal handler.
> - if former (the host one), maybe mcontext is already there so, it
> might not be the one you mentioned.
> - if the latter, how the original handler (the host one,
> hard_handler()) works ? even if we can call userspace handler
> instead of the host one, we need to call the host handler (and
> restorer). do we call both ?
> - and by "to set registers", what register do you mean ? for the
> registers inspected by userspace signal handler ? but if you set a
> register, for instance RIP, as the fault location to the host
> register, it will return to RIP after handler and restart the fault
> again ?
I am confused, why would the fault handler be restarted? If you modify
RIP, then the host kernel will not return to the faulting location. You
were using that already to jump into __userspace_relay_signal. All I am
arguing that instead of jumping to __userspace_relay_signal you can
prepare everything and directly jump into the users signal handler.
> > * Jump back to userspace by doing a "return"
>
> this is still also unclear to me.
>
> it would be very helpful if you point the location of the code (at
> uml/next tree) on how SECCOMP mode does. I'm also looking at but
> really hard to map what you described and the code (sorry).
"stub_signal_interrupt" simply returns, which means it jumps into the
restorer "stub_signal_restorer" which does the rt_sigreturn syscall.
This means the host kernel restores the userspace state from the
mcontext. As the mcontext resides in shared memory, the UML kernel can
update it to specify where the process should continue running (thread
switching, signals, syscall return value, …).
Benjamin
> all of above runs within hard_handler() in nommu mode on SIGSEGV.
> my best guess is this is different from what ptrace/seccomp do.
>
> > Said differently, I really prefer deferring as much logic as possible
> > to the host. This is both safer and easier to understand. Plus, it also
> > has the advantage of making it simpler to port UML to other
> > architectures.
>
> okay.
>
> >
> > > > Honestly, I really think we should take a step back and swap the
> > > > current syscall entry/exit code. That would likely also simplify
> > > > floating point register handling, which I think is currently
> > > > insufficient do deal with the odd special cases caused by different
> > > > x86_64 hardware extensions.
> > > >
> > > > Basically, I think nommu mode should use the same general approach as
> > > > the current SECCOMP mode. Which is to use rt_sigreturn to jump into
> > > > userspace and let the host kernel deal with the ugly details of how to
> > > > do that.
> > >
> > > I looked at how MMU mode (ptrace/seccomp) does handle this case.
> > >
> > > In nommu mode, we don't have external process to catch signals so, the
> > > nommu mode uses hard_handler() to catch SEGV/FPE of userspace
> > > programs. While mmu mode calls segv_handler not in a context of
> > > signal handler.
> > >
> > > # correct me if I'm wrong.
> > >
> > > thus, mmu mode doesn't have this situation.
> >
> > Yes, it does not have this specific issue. But see the top of the mail
> > for other issues that are somewhat related.
> >
> > > I'm attempting various ways; calling um's rt_sigreturn instead of
> > > host's one, which doesn't work as host restore procedures (unblocking
> > > masked signals, restoring register states, etc) aren't called.
> > >
> > > I'll update here if I found a good direction, but would be great if
> > > you see how it should be handled.
> >
> > Can we please discuss possible solutions? We can figure out the details
> > once it is clear how the interaction with the host should work.
>
> I was wishing to update to you that I'm working on it. So, your
> comments are always helpful to me. Thanks.
>
> -- Hajime
>
> > I still think that the idea of using the kernel task stack as the
> > signal stack is really elegant. Actually, doing that in normal UML may
> > be how we can fix the issues mentioned at the top of my mail. And for
> > nommu, we can also use the host mcontext to jump back into userspace
> > using a simple "return".
> >
> > Conceptually it seems so simple.
> >
> > Benjamin
> >
> >
> > >
> > > -- Hajime
> > >
> > > > I believe that this requires a second "userspace" sigaltstack in
> > > > addition to the current "IRQ" sigaltstack. Then switching in between
> > > > the two (note that the "userspace" one is also used for IRQs if those
> > > > happen while userspace is executing).
> > > >
> > > > So, in principle I would think something like:
> > > > * to jump into userspace, you would:
> > > > - block all signals
> > > > - set "userspace" sigaltstack
> > > > - setup mcontext for rt_sigreturn
> > > > - setup RSP for rt_sigreturn
> > > > - call rt_sigreturn syscall
> > > > * all signal handlers can (except pure IRQs):
> > > > - check on which stack they are
> > > > -> easy to detect whether we are in kernel mode
> > > > - for IRQs one can probably handle them directly (and return)
> > > > - in user mode:
> > > > + store mcontext location and information needed for rt_sigreturn
> > > > + jump back into kernel task stack
> > > > * kernel task handler to continue would:
> > > > - set sigaltstack to IRQ stack
> > > > - fetch register from mcontext
> > > > - unblock all signals
> > > > - handle syscall/signal in whatever way needed
> > > >
> > > > Now that I wrote about it, I am thinking that it might be possible to
> > > > just use the kernel task stack for the signal stack. One would probably
> > > > need to increase the kernel stack size a bit, but it would also mean
> > > > that no special code is needed for "rt_sigreturn" handling. The rest
> > > > would remain the same.
> > > >
> > > > Thoughts?
> > > >
> > > > Benjamin
> > > >
> > > > > [SNIP]
> > > >
> > >
> >
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v10 09/13] x86/um: nommu: signal handling
2025-07-01 12:03 ` Benjamin Berg
@ 2025-07-02 4:37 ` Hajime Tazaki
2025-07-10 23:59 ` Hajime Tazaki
0 siblings, 1 reply; 25+ messages in thread
From: Hajime Tazaki @ 2025-07-02 4:37 UTC (permalink / raw)
To: benjamin; +Cc: linux-um, ricarkol, Liam.Howlett, linux-kernel
Hello Benjamin,
On Tue, 01 Jul 2025 21:03:36 +0900,
Benjamin Berg wrote:
>
> Hi Hajim,
>
> On Mon, 2025-06-30 at 10:04 +0900, Hajime Tazaki wrote:
> >
> > Hello Benjamin,
> >
> > On Sat, 28 Jun 2025 00:02:05 +0900,
> > Benjamin Berg wrote:
> > >
> > > Hi,
> > >
> > > On Fri, 2025-06-27 at 22:50 +0900, Hajime Tazaki wrote:
> > > > thanks for the comment on the complicated part of the kernel (signal).
> > >
> > > This stuff isn't simple.
> > >
> > > Actually, I am starting to think that the current MMU UML kernel also
> > > needs a redesign with regard to signal handling and stack use in that
> > > case. My current impression is that the design right now only permits
> > > voluntarily scheduling. More specifically, scheduling in response to an
> > > interrupt is impossible.
> > >
> > > I suppose that works fine, but it also does not seem quite right.
> >
> > thanks for the info. it's very useful to understand what's going on.
> >
> > (snip)
> >
> > > > > > +void set_mc_userspace_relay_signal(mcontext_t *mc)
> > > > > > +{
> > > > > > + mc->gregs[REG_RIP] = (unsigned long) __userspace_relay_signal;
> > > > > > +}
> > > > > > +
> > > >
> > > > This is a bit scary code which I tried to handle when SIGSEGV is
> > > > raised by host for a userspace program running on UML (nommu).
> > > >
> > > > # and I should remember my XXX tag is important to fix....
> > > >
> > > > let me try to explain what happens and what I tried to solve.
> > > >
> > > > The SEGV signal from userspace program is delivered to userspace but
> > > > if we don't fix the code raising the signal, after (um) rt_sigreturn,
> > > > it will restart from $rip and raise SIGSEGV again.
> > > >
> > > > # so, yes, we've already relied on host and um's rt_sigreturn to
> > > > restore various things.
> > > >
> > > > when a uml userspace crashes with SIGSEGV,
> > > >
> > > > - host kernel raises SIGSEGV (at original $rip)
> > > > - caught by uml process (hard_handler)
> > > > - raise a signal to uml userspace process (segv_handler)
> > > > - handler ends (hard_handler)
> > > > - (host) run restorer (rt_sigreturn, registered by (libc)sigaction,
> > > > not (host) rt_sigaction)
> > > > - return back to the original $rip
> > > > - (back to top)
> > > >
> > > > this is the case where endless loop is happened.
> > > > um's sa_handler isn't called as rt_sigreturn (um) isn't called.
> > > > and the my original attempt (__userspace_relay_signal) is what I tried.
> > > >
> > > > I agree that it is lazy to call a dummy syscall (indeed, getpid).
> > > > I'm trying to introduce another routine to jump into userspace and
> > > > call (um) rt_sigreturn after (host) rt_sigreturn.
> > > >
> > > > > And this is really confusing me. The way I am reading it, the code
> > > > > tries to do:
> > > > > 1. Rewrite RIP to jump to __userspace_relay_signal
> > > > > 2. Trigger a getpid syscall (to do "nothing"?)
> > > > > 3. Let do_syscall_64 fire the signal from interrupt_end
> > > >
> > > > correct.
> > > >
> > > > > However, then that really confuses me, because:
> > > > > * If I am reading it correctly, then this approach will destroy the
> > > > > contents of various registers (RIP, RAX and likely more)
> > > > > * This would result in an incorrect mcontext in the userspace signal
> > > > > handler (which could be relevant if userspace is inspecting it)
> > > > > * However, worst, rt_sigreturn will eventually jump back
> > > > > into__userspace_relay_signal, which has nothing to return to.
> > > > > * Also, relay_signal doesn't use this? What happens for a SIGFPE, how
> > > > > is userspace interrupted immediately in that case?
> > > >
> > > > relay_signal shares the same goal of this, indeed.
> > > > but the issue with `mc->gregs[REG_RIP]` (endless signals) still exists
> > > > I guess.
> > >
> > > Well, endless signals only exist as long as you exit to the same
> > > location. My suggestion was to read the user state from the mcontext
> > > (as SECCOMP mode does it) and executing the signal right away, i.e.:
> >
> > thanks too; below is my understanding.
> >
> > > * Fetch the current registers from the mcontext
> >
> > I guess this is already done in sig_handler_common().
>
> Well, not really?
>
> It does seem to fetch the general purpose registers. But the code
> pretty much assumes we will return to the same location and only stores
> them on the stack for the signal handler itself. Also, remember that it
> might be userspace or kernel space in your case. The kernel task
> registers are in "switch_buf" while the userspace registers are in
> "regs" of "struct task_struct" (effectively "struct uml_pt_regs").
indeed, the handler returns to the same location.
here is what the current patchset does for the signal handling.
# sorry i might be writing same things several times, but I hope
this will help to understand/discuss what it should be.
receive signal (from host)
- > call host sa_handler (hard_handler)
- > sig_handler_common => get_regs_from_mc (fetch host mcontext to um)
- > set TIF_SIGPENDING (um kernel)
- > set host mcontext[RIP] to __userspace_relay_signal
(host sa_handler ends)
- call host sa_restorer => return to mcontext[RIP]
- > call __userspace_relay_signal from mcontext[RIP]
- > call interrupt_end()
- > do_signal => handle_signal => setup_signal_stack_si
(because TIF_SIGPENDING is on above)
- > call userspace sa_handler
- > call userspace sa_restorer
instead of set mcontext[RIP] to userspace sa_handler, it uses
__userspace_relay_signal, which configures stack and mcontext (via
interrupt_end, setup_signal_stack_si, etc) and call userspace
sa_handler/restorer after that.
in this way, programs runs userspace sa_handler not in the host
sa_handler context. I guess this means we don't have to configure
host register/mcontext with the userspace one ?
I agree that the current __userspace_relay_signal can be shrunk not
to call __kernel_vsyscall and focus on interrupt_end and stack
preparation.
> > > * Push the signal context onto the userspace stack
> >
> > (guess) this is already done on handle_signal() => setup_signal_stack_si().
> >
> > > * Modify the host mcontext to set registers for the signal handler
> >
> > this is something which I'm not well understanding.
> > - do you mean the host handler when you say "for the signal handler" ?
> > or the userspace handler ?
>
> Both in a way ;-)
>
> I mean modify the registers in the host mcontext so that the UML
> userspace will continue executing inside its signal handler.
>
> > - if former (the host one), maybe mcontext is already there so, it
> > might not be the one you mentioned.
> > - if the latter, how the original handler (the host one,
> > hard_handler()) works ? even if we can call userspace handler
> > instead of the host one, we need to call the host handler (and
> > restorer). do we call both ?
> > - and by "to set registers", what register do you mean ? for the
> > registers inspected by userspace signal handler ? but if you set a
> > register, for instance RIP, as the fault location to the host
> > register, it will return to RIP after handler and restart the fault
> > again ?
>
> I am confused, why would the fault handler be restarted? If you modify
> RIP, then the host kernel will not return to the faulting location. You
> were using that already to jump into __userspace_relay_signal. All I am
> arguing that instead of jumping to __userspace_relay_signal you can
> prepare everything and directly jump into the users signal handler.
what I meant in that example is; set host mcontext[RIP] to the fault
location, as a userspace information, which will lead to the fault
again. But this doesn't change RIP before and after so, I guess this
isn't a good example..
Sorry for the confusion.
> > > * Jump back to userspace by doing a "return"
> >
> > this is still also unclear to me.
> >
> > it would be very helpful if you point the location of the code (at
> > uml/next tree) on how SECCOMP mode does. I'm also looking at but
> > really hard to map what you described and the code (sorry).
>
> "stub_signal_interrupt" simply returns, which means it jumps into the
> restorer "stub_signal_restorer" which does the rt_sigreturn syscall.
> This means the host kernel restores the userspace state from the
> mcontext. As the mcontext resides in shared memory, the UML kernel can
> update it to specify where the process should continue running (thread
> switching, signals, syscall return value, …).
thanks !
so, stub_signal_interrupt runs on a different host process.
nommu mode tries to reuse existing host sa_handler (hard_handler) to
do the job (handle SEGV etc).
If there are something missing on hard_handler and co on nommmu mode
for what userspace_tramp does on seccomp mode, I've been trying to
update it.
-- Hajime
>
> Benjamin
>
> > all of above runs within hard_handler() in nommu mode on SIGSEGV.
> > my best guess is this is different from what ptrace/seccomp do.
> >
> > > Said differently, I really prefer deferring as much logic as possible
> > > to the host. This is both safer and easier to understand. Plus, it also
> > > has the advantage of making it simpler to port UML to other
> > > architectures.
> >
> > okay.
> >
> > >
> > > > > Honestly, I really think we should take a step back and swap the
> > > > > current syscall entry/exit code. That would likely also simplify
> > > > > floating point register handling, which I think is currently
> > > > > insufficient do deal with the odd special cases caused by different
> > > > > x86_64 hardware extensions.
> > > > >
> > > > > Basically, I think nommu mode should use the same general approach as
> > > > > the current SECCOMP mode. Which is to use rt_sigreturn to jump into
> > > > > userspace and let the host kernel deal with the ugly details of how to
> > > > > do that.
> > > >
> > > > I looked at how MMU mode (ptrace/seccomp) does handle this case.
> > > >
> > > > In nommu mode, we don't have external process to catch signals so, the
> > > > nommu mode uses hard_handler() to catch SEGV/FPE of userspace
> > > > programs. While mmu mode calls segv_handler not in a context of
> > > > signal handler.
> > > >
> > > > # correct me if I'm wrong.
> > > >
> > > > thus, mmu mode doesn't have this situation.
> > >
> > > Yes, it does not have this specific issue. But see the top of the mail
> > > for other issues that are somewhat related.
> > >
> > > > I'm attempting various ways; calling um's rt_sigreturn instead of
> > > > host's one, which doesn't work as host restore procedures (unblocking
> > > > masked signals, restoring register states, etc) aren't called.
> > > >
> > > > I'll update here if I found a good direction, but would be great if
> > > > you see how it should be handled.
> > >
> > > Can we please discuss possible solutions? We can figure out the details
> > > once it is clear how the interaction with the host should work.
> >
> > I was wishing to update to you that I'm working on it. So, your
> > comments are always helpful to me. Thanks.
> >
> > -- Hajime
> >
> > > I still think that the idea of using the kernel task stack as the
> > > signal stack is really elegant. Actually, doing that in normal UML may
> > > be how we can fix the issues mentioned at the top of my mail. And for
> > > nommu, we can also use the host mcontext to jump back into userspace
> > > using a simple "return".
> > >
> > > Conceptually it seems so simple.
> > >
> > > Benjamin
> > >
> > >
> > > >
> > > > -- Hajime
> > > >
> > > > > I believe that this requires a second "userspace" sigaltstack in
> > > > > addition to the current "IRQ" sigaltstack. Then switching in between
> > > > > the two (note that the "userspace" one is also used for IRQs if those
> > > > > happen while userspace is executing).
> > > > >
> > > > > So, in principle I would think something like:
> > > > > * to jump into userspace, you would:
> > > > > - block all signals
> > > > > - set "userspace" sigaltstack
> > > > > - setup mcontext for rt_sigreturn
> > > > > - setup RSP for rt_sigreturn
> > > > > - call rt_sigreturn syscall
> > > > > * all signal handlers can (except pure IRQs):
> > > > > - check on which stack they are
> > > > > -> easy to detect whether we are in kernel mode
> > > > > - for IRQs one can probably handle them directly (and return)
> > > > > - in user mode:
> > > > > + store mcontext location and information needed for rt_sigreturn
> > > > > + jump back into kernel task stack
> > > > > * kernel task handler to continue would:
> > > > > - set sigaltstack to IRQ stack
> > > > > - fetch register from mcontext
> > > > > - unblock all signals
> > > > > - handle syscall/signal in whatever way needed
> > > > >
> > > > > Now that I wrote about it, I am thinking that it might be possible to
> > > > > just use the kernel task stack for the signal stack. One would probably
> > > > > need to increase the kernel stack size a bit, but it would also mean
> > > > > that no special code is needed for "rt_sigreturn" handling. The rest
> > > > > would remain the same.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > > Benjamin
> > > > >
> > > > > > [SNIP]
> > > > >
> > > >
> > >
> >
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v10 09/13] x86/um: nommu: signal handling
2025-07-02 4:37 ` Hajime Tazaki
@ 2025-07-10 23:59 ` Hajime Tazaki
2025-07-11 9:39 ` Benjamin Berg
0 siblings, 1 reply; 25+ messages in thread
From: Hajime Tazaki @ 2025-07-10 23:59 UTC (permalink / raw)
To: benjamin; +Cc: linux-um, ricarkol, Liam.Howlett, linux-kernel
Hello Benjamin,
below is the updated patch of this patch.
I didn't follow your suggestion to use host handler to execute
userspace handlers. instead, setup stack and %rip register to call
userspace handlers at the end of the host handler.
It would be great to hear your opinion.
---
arch/um/include/shared/kern_util.h | 4 +
arch/um/nommu/Makefile | 2 +-
arch/um/nommu/os-Linux/signal.c | 8 +
arch/um/nommu/trap.c | 201 ++++++++++++++++++++++++
arch/um/os-Linux/signal.c | 3 +-
arch/x86/um/nommu/do_syscall_64.c | 6 +
arch/x86/um/nommu/entry_64.S | 14 ++
arch/x86/um/nommu/os-Linux/mcontext.c | 5 +
arch/x86/um/shared/sysdep/mcontext.h | 1 +
arch/x86/um/shared/sysdep/ptrace.h | 2 +-
arch/x86/um/shared/sysdep/syscalls_64.h | 1 +
11 files changed, 244 insertions(+), 3 deletions(-)
create mode 100644 arch/um/nommu/trap.c
diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h
index ec8ba1f13c58..7f55402b6385 100644
--- a/arch/um/include/shared/kern_util.h
+++ b/arch/um/include/shared/kern_util.h
@@ -73,4 +73,8 @@ void um_idle_sleep(void);
void kasan_map_memory(void *start, size_t len);
+#ifndef CONFIG_MMU
+extern void nommu_relay_signal(void *ptr);
+#endif
+
#endif
diff --git a/arch/um/nommu/Makefile b/arch/um/nommu/Makefile
index baab7c2f57c2..096221590cfd 100644
--- a/arch/um/nommu/Makefile
+++ b/arch/um/nommu/Makefile
@@ -1,3 +1,3 @@
# SPDX-License-Identifier: GPL-2.0
-obj-y := os-Linux/
+obj-y := trap.o os-Linux/
diff --git a/arch/um/nommu/os-Linux/signal.c b/arch/um/nommu/os-Linux/signal.c
index 19043b9652e2..27b6b37744b7 100644
--- a/arch/um/nommu/os-Linux/signal.c
+++ b/arch/um/nommu/os-Linux/signal.c
@@ -5,6 +5,7 @@
#include <os.h>
#include <sysdep/mcontext.h>
#include <sys/ucontext.h>
+#include <as-layout.h>
void sigsys_handler(int sig, struct siginfo *si,
struct uml_pt_regs *regs, void *ptr)
@@ -14,3 +15,10 @@ void sigsys_handler(int sig, struct siginfo *si,
/* hook syscall via SIGSYS */
set_mc_sigsys_hook(mc);
}
+
+void nommu_relay_signal(void *ptr)
+{
+ mcontext_t *mc = (mcontext_t *) ptr;
+
+ set_mc_return_address(mc);
+}
diff --git a/arch/um/nommu/trap.c b/arch/um/nommu/trap.c
new file mode 100644
index 000000000000..430297517455
--- /dev/null
+++ b/arch/um/nommu/trap.c
@@ -0,0 +1,201 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/hardirq.h>
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/sched/debug.h>
+#include <asm/current.h>
+#include <asm/tlbflush.h>
+#include <arch.h>
+#include <as-layout.h>
+#include <kern_util.h>
+#include <os.h>
+#include <skas.h>
+
+/*
+ * Note this is constrained to return 0, -EFAULT, -EACCES, -ENOMEM by
+ * segv().
+ */
+int handle_page_fault(unsigned long address, unsigned long ip,
+ int is_write, int is_user, int *code_out)
+{
+ /* !MMU has no pagefault */
+ return -EFAULT;
+}
+
+static void show_segv_info(struct uml_pt_regs *regs)
+{
+ struct task_struct *tsk = current;
+ struct faultinfo *fi = UPT_FAULTINFO(regs);
+
+ if (!unhandled_signal(tsk, SIGSEGV))
+ return;
+
+ pr_warn_ratelimited("%s%s[%d]: segfault at %lx ip %p sp %p error %x",
+ task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
+ tsk->comm, task_pid_nr(tsk), FAULT_ADDRESS(*fi),
+ (void *)UPT_IP(regs), (void *)UPT_SP(regs),
+ fi->error_code);
+}
+
+static void bad_segv(struct faultinfo fi, unsigned long ip)
+{
+ current->thread.arch.faultinfo = fi;
+ force_sig_fault(SIGSEGV, SEGV_ACCERR, (void __user *) FAULT_ADDRESS(fi));
+}
+
+void fatal_sigsegv(void)
+{
+ force_fatal_sig(SIGSEGV);
+ do_signal(¤t->thread.regs);
+ /*
+ * This is to tell gcc that we're not returning - do_signal
+ * can, in general, return, but in this case, it's not, since
+ * we just got a fatal SIGSEGV queued.
+ */
+ os_dump_core();
+}
+
+/**
+ * segv_handler() - the SIGSEGV handler
+ * @sig: the signal number
+ * @unused_si: the signal info struct; unused in this handler
+ * @regs: the ptrace register information
+ *
+ * The handler first extracts the faultinfo from the UML ptrace regs struct.
+ * If the userfault did not happen in an UML userspace process, bad_segv is called.
+ * Otherwise the signal did happen in a cloned userspace process, handle it.
+ */
+void segv_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs,
+ void *mc)
+{
+ struct faultinfo *fi = UPT_FAULTINFO(regs);
+
+ /* !MMU specific part; detection of userspace */
+ /* mark is_user=1 when the IP is from userspace code. */
+ if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) < high_physmem)
+ regs->is_user = 1;
+
+ if (UPT_IS_USER(regs) && !SEGV_IS_FIXABLE(fi)) {
+ show_segv_info(regs);
+ bad_segv(*fi, UPT_IP(regs));
+ return;
+ }
+ segv(*fi, UPT_IP(regs), UPT_IS_USER(regs), regs, mc);
+
+ /* !MMU specific part; detection of userspace */
+ relay_signal(sig, unused_si, regs, mc);
+}
+
+/*
+ * We give a *copy* of the faultinfo in the regs to segv.
+ * This must be done, since nesting SEGVs could overwrite
+ * the info in the regs. A pointer to the info then would
+ * give us bad data!
+ */
+unsigned long segv(struct faultinfo fi, unsigned long ip, int is_user,
+ struct uml_pt_regs *regs, void *mc)
+{
+ int si_code;
+ int err;
+ int is_write = FAULT_WRITE(fi);
+ unsigned long address = FAULT_ADDRESS(fi);
+
+ if (!is_user && regs)
+ current->thread.segv_regs = container_of(regs, struct pt_regs, regs);
+
+ if (current->mm == NULL) {
+ show_regs(container_of(regs, struct pt_regs, regs));
+ panic("Segfault with no mm");
+ } else if (!is_user && address > PAGE_SIZE && address < TASK_SIZE) {
+ show_regs(container_of(regs, struct pt_regs, regs));
+ panic("Kernel tried to access user memory at addr 0x%lx, ip 0x%lx",
+ address, ip);
+ }
+
+ if (SEGV_IS_FIXABLE(&fi))
+ err = handle_page_fault(address, ip, is_write, is_user,
+ &si_code);
+ else {
+ err = -EFAULT;
+ /*
+ * A thread accessed NULL, we get a fault, but CR2 is invalid.
+ * This code is used in __do_copy_from_user() of TT mode.
+ * XXX tt mode is gone, so maybe this isn't needed any more
+ */
+ address = 0;
+ }
+
+ if (!err)
+ goto out;
+ else if (!is_user && arch_fixup(ip, regs))
+ goto out;
+
+ if (!is_user) {
+ show_regs(container_of(regs, struct pt_regs, regs));
+ panic("Kernel mode fault at addr 0x%lx, ip 0x%lx",
+ address, ip);
+ }
+
+ show_segv_info(regs);
+
+ if (err == -EACCES) {
+ current->thread.arch.faultinfo = fi;
+ force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address);
+ } else {
+ WARN_ON_ONCE(err != -EFAULT);
+ current->thread.arch.faultinfo = fi;
+ force_sig_fault(SIGSEGV, si_code, (void __user *) address);
+ }
+
+out:
+ if (regs)
+ current->thread.segv_regs = NULL;
+
+ return 0;
+}
+
+void relay_signal(int sig, struct siginfo *si, struct uml_pt_regs *regs,
+ void *mc)
+{
+ int code, err;
+
+ /* !MMU specific part; detection of userspace */
+ /* mark is_user=1 when the IP is from userspace code. */
+ if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) < high_physmem)
+ regs->is_user = 1;
+
+ if (!UPT_IS_USER(regs)) {
+ if (sig == SIGBUS)
+ pr_err("Bus error - the host /dev/shm or /tmp mount likely just ran out of space\n");
+ panic("Kernel mode signal %d", sig);
+ }
+ /* if is_user==1, set return to userspace sig handler to relay signal */
+ nommu_relay_signal(mc);
+
+ arch_examine_signal(sig, regs);
+
+ /* Is the signal layout for the signal known?
+ * Signal data must be scrubbed to prevent information leaks.
+ */
+ code = si->si_code;
+ err = si->si_errno;
+ if ((err == 0) && (siginfo_layout(sig, code) == SIL_FAULT)) {
+ struct faultinfo *fi = UPT_FAULTINFO(regs);
+
+ current->thread.arch.faultinfo = *fi;
+ force_sig_fault(sig, code, (void __user *)FAULT_ADDRESS(*fi));
+ } else {
+ pr_err("Attempted to relay unknown signal %d (si_code = %d) with errno %d\n",
+ sig, code, err);
+ force_sig(sig);
+ }
+}
+
+void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs,
+ void *mc)
+{
+ do_IRQ(WINCH_IRQ, regs);
+}
diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c
index 53e276e81b37..67dcd88b45b1 100644
--- a/arch/um/os-Linux/signal.c
+++ b/arch/um/os-Linux/signal.c
@@ -40,9 +40,10 @@ static void sig_handler_common(int sig, struct siginfo *si, mcontext_t *mc)
int save_errno = errno;
r.is_user = 0;
+ if (mc)
+ get_regs_from_mc(&r, mc);
if (sig == SIGSEGV) {
/* For segfaults, we want the data from the sigcontext. */
- get_regs_from_mc(&r, mc);
GET_FAULTINFO_FROM_MC(r.faultinfo, mc);
}
diff --git a/arch/x86/um/nommu/do_syscall_64.c b/arch/x86/um/nommu/do_syscall_64.c
index 74d5bcc4508d..d77e69e097c1 100644
--- a/arch/x86/um/nommu/do_syscall_64.c
+++ b/arch/x86/um/nommu/do_syscall_64.c
@@ -44,6 +44,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
/* set fs register to the original host one */
os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
+ /* save fp registers */
+ asm volatile("fxsaveq %0" : "=m"(*(struct _xstate *)regs->regs.fp));
+
if (likely(syscall < NR_syscalls)) {
PT_REGS_SET_SYSCALL_RETURN(regs,
EXECUTE_SYSCALL(syscall, regs));
@@ -54,6 +57,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
/* handle tasks and signals at the end */
interrupt_end();
+ /* restore fp registers */
+ asm volatile("fxrstorq %0" : : "m"((current->thread.regs.regs.fp)));
+
/* restore back fs register to userspace configured one */
os_x86_arch_prctl(0, ARCH_SET_FS,
(void *)(current->thread.regs.regs.gp[FS_BASE
diff --git a/arch/x86/um/nommu/entry_64.S b/arch/x86/um/nommu/entry_64.S
index 950447dfa66b..e038bc7b53ac 100644
--- a/arch/x86/um/nommu/entry_64.S
+++ b/arch/x86/um/nommu/entry_64.S
@@ -111,3 +111,17 @@ ENTRY(userspace)
jmp *%r11
END(userspace)
+
+/*
+ * this routine prepares the stack to return via host-generated
+ * signals (e.g., SEGV, FPE) via do_signal() from interrupt_end().
+ */
+ENTRY(__prep_sigreturn)
+ /*
+ * Switch to current top of stack, so "current->" points
+ * to the right task.
+ */
+ movq current_top_of_stack, %rsp
+
+ jmp userspace
+END(__prep_sigreturn)
diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c
index c4ef877d5ea0..87fb2a35e7ff 100644
--- a/arch/x86/um/nommu/os-Linux/mcontext.c
+++ b/arch/x86/um/nommu/os-Linux/mcontext.c
@@ -6,6 +6,11 @@
#include <sysdep/mcontext.h>
#include <sysdep/syscalls.h>
+void set_mc_return_address(mcontext_t *mc)
+{
+ mc->gregs[REG_RIP] = (unsigned long) __prep_sigreturn;
+}
+
void set_mc_sigsys_hook(mcontext_t *mc)
{
mc->gregs[REG_RCX] = mc->gregs[REG_RIP];
diff --git a/arch/x86/um/shared/sysdep/mcontext.h b/arch/x86/um/shared/sysdep/mcontext.h
index 9a0d6087f357..de4041b758f3 100644
--- a/arch/x86/um/shared/sysdep/mcontext.h
+++ b/arch/x86/um/shared/sysdep/mcontext.h
@@ -19,6 +19,7 @@ extern int set_stub_state(struct uml_pt_regs *regs, struct stub_data *data,
#ifndef CONFIG_MMU
extern void set_mc_sigsys_hook(mcontext_t *mc);
+extern void set_mc_return_address(mcontext_t *mc);
#endif
#ifdef __i386__
diff --git a/arch/x86/um/shared/sysdep/ptrace.h b/arch/x86/um/shared/sysdep/ptrace.h
index 8f7476ff6e95..7d553d9f05be 100644
--- a/arch/x86/um/shared/sysdep/ptrace.h
+++ b/arch/x86/um/shared/sysdep/ptrace.h
@@ -65,7 +65,7 @@ struct uml_pt_regs {
int is_user;
/* Dynamically sized FP registers (holds an XSTATE) */
- unsigned long fp[];
+ unsigned long fp[] __attribute__((aligned(16)));
};
#define EMPTY_UML_PT_REGS { }
diff --git a/arch/x86/um/shared/sysdep/syscalls_64.h b/arch/x86/um/shared/sysdep/syscalls_64.h
index ffd80ee3b9dc..bd152422cdfb 100644
--- a/arch/x86/um/shared/sysdep/syscalls_64.h
+++ b/arch/x86/um/shared/sysdep/syscalls_64.h
@@ -29,6 +29,7 @@ extern syscall_handler_t sys_arch_prctl;
extern void do_syscall_64(struct pt_regs *regs);
extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3,
int64_t a4, int64_t a5, int64_t a6);
+extern void __prep_sigreturn(void);
#endif
#endif
--
2.43.0
thanks, and have a nice day,
-- Hajime
On Wed, 02 Jul 2025 13:37:50 +0900,
Hajime Tazaki wrote:
>
>
> Hello Benjamin,
>
> On Tue, 01 Jul 2025 21:03:36 +0900,
> Benjamin Berg wrote:
> >
> > Hi Hajim,
> >
> > On Mon, 2025-06-30 at 10:04 +0900, Hajime Tazaki wrote:
> > >
> > > Hello Benjamin,
> > >
> > > On Sat, 28 Jun 2025 00:02:05 +0900,
> > > Benjamin Berg wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Fri, 2025-06-27 at 22:50 +0900, Hajime Tazaki wrote:
> > > > > thanks for the comment on the complicated part of the kernel (signal).
> > > >
> > > > This stuff isn't simple.
> > > >
> > > > Actually, I am starting to think that the current MMU UML kernel also
> > > > needs a redesign with regard to signal handling and stack use in that
> > > > case. My current impression is that the design right now only permits
> > > > voluntarily scheduling. More specifically, scheduling in response to an
> > > > interrupt is impossible.
> > > >
> > > > I suppose that works fine, but it also does not seem quite right.
> > >
> > > thanks for the info. it's very useful to understand what's going on.
> > >
> > > (snip)
> > >
> > > > > > > +void set_mc_userspace_relay_signal(mcontext_t *mc)
> > > > > > > +{
> > > > > > > + mc->gregs[REG_RIP] = (unsigned long) __userspace_relay_signal;
> > > > > > > +}
> > > > > > > +
> > > > >
> > > > > This is a bit scary code which I tried to handle when SIGSEGV is
> > > > > raised by host for a userspace program running on UML (nommu).
> > > > >
> > > > > # and I should remember my XXX tag is important to fix....
> > > > >
> > > > > let me try to explain what happens and what I tried to solve.
> > > > >
> > > > > The SEGV signal from userspace program is delivered to userspace but
> > > > > if we don't fix the code raising the signal, after (um) rt_sigreturn,
> > > > > it will restart from $rip and raise SIGSEGV again.
> > > > >
> > > > > # so, yes, we've already relied on host and um's rt_sigreturn to
> > > > > restore various things.
> > > > >
> > > > > when a uml userspace crashes with SIGSEGV,
> > > > >
> > > > > - host kernel raises SIGSEGV (at original $rip)
> > > > > - caught by uml process (hard_handler)
> > > > > - raise a signal to uml userspace process (segv_handler)
> > > > > - handler ends (hard_handler)
> > > > > - (host) run restorer (rt_sigreturn, registered by (libc)sigaction,
> > > > > not (host) rt_sigaction)
> > > > > - return back to the original $rip
> > > > > - (back to top)
> > > > >
> > > > > this is the case where endless loop is happened.
> > > > > um's sa_handler isn't called as rt_sigreturn (um) isn't called.
> > > > > and the my original attempt (__userspace_relay_signal) is what I tried.
> > > > >
> > > > > I agree that it is lazy to call a dummy syscall (indeed, getpid).
> > > > > I'm trying to introduce another routine to jump into userspace and
> > > > > call (um) rt_sigreturn after (host) rt_sigreturn.
> > > > >
> > > > > > And this is really confusing me. The way I am reading it, the code
> > > > > > tries to do:
> > > > > > 1. Rewrite RIP to jump to __userspace_relay_signal
> > > > > > 2. Trigger a getpid syscall (to do "nothing"?)
> > > > > > 3. Let do_syscall_64 fire the signal from interrupt_end
> > > > >
> > > > > correct.
> > > > >
> > > > > > However, then that really confuses me, because:
> > > > > > * If I am reading it correctly, then this approach will destroy the
> > > > > > contents of various registers (RIP, RAX and likely more)
> > > > > > * This would result in an incorrect mcontext in the userspace signal
> > > > > > handler (which could be relevant if userspace is inspecting it)
> > > > > > * However, worst, rt_sigreturn will eventually jump back
> > > > > > into__userspace_relay_signal, which has nothing to return to.
> > > > > > * Also, relay_signal doesn't use this? What happens for a SIGFPE, how
> > > > > > is userspace interrupted immediately in that case?
> > > > >
> > > > > relay_signal shares the same goal of this, indeed.
> > > > > but the issue with `mc->gregs[REG_RIP]` (endless signals) still exists
> > > > > I guess.
> > > >
> > > > Well, endless signals only exist as long as you exit to the same
> > > > location. My suggestion was to read the user state from the mcontext
> > > > (as SECCOMP mode does it) and executing the signal right away, i.e.:
> > >
> > > thanks too; below is my understanding.
> > >
> > > > * Fetch the current registers from the mcontext
> > >
> > > I guess this is already done in sig_handler_common().
> >
> > Well, not really?
> >
> > It does seem to fetch the general purpose registers. But the code
> > pretty much assumes we will return to the same location and only stores
> > them on the stack for the signal handler itself. Also, remember that it
> > might be userspace or kernel space in your case. The kernel task
> > registers are in "switch_buf" while the userspace registers are in
> > "regs" of "struct task_struct" (effectively "struct uml_pt_regs").
>
> indeed, the handler returns to the same location.
> here is what the current patchset does for the signal handling.
>
> # sorry i might be writing same things several times, but I hope
> this will help to understand/discuss what it should be.
>
> receive signal (from host)
> - > call host sa_handler (hard_handler)
> - > sig_handler_common => get_regs_from_mc (fetch host mcontext to um)
> - > set TIF_SIGPENDING (um kernel)
> - > set host mcontext[RIP] to __userspace_relay_signal
> (host sa_handler ends)
> - call host sa_restorer => return to mcontext[RIP]
> - > call __userspace_relay_signal from mcontext[RIP]
> - > call interrupt_end()
> - > do_signal => handle_signal => setup_signal_stack_si
> (because TIF_SIGPENDING is on above)
> - > call userspace sa_handler
> - > call userspace sa_restorer
>
> instead of set mcontext[RIP] to userspace sa_handler, it uses
> __userspace_relay_signal, which configures stack and mcontext (via
> interrupt_end, setup_signal_stack_si, etc) and call userspace
> sa_handler/restorer after that.
>
> in this way, programs runs userspace sa_handler not in the host
> sa_handler context. I guess this means we don't have to configure
> host register/mcontext with the userspace one ?
>
> I agree that the current __userspace_relay_signal can be shrunk not
> to call __kernel_vsyscall and focus on interrupt_end and stack
> preparation.
>
> > > > * Push the signal context onto the userspace stack
> > >
> > > (guess) this is already done on handle_signal() => setup_signal_stack_si().
> > >
> > > > * Modify the host mcontext to set registers for the signal handler
> > >
> > > this is something which I'm not well understanding.
> > > - do you mean the host handler when you say "for the signal handler" ?
> > > or the userspace handler ?
> >
> > Both in a way ;-)
> >
> > I mean modify the registers in the host mcontext so that the UML
> > userspace will continue executing inside its signal handler.
> >
> > > - if former (the host one), maybe mcontext is already there so, it
> > > might not be the one you mentioned.
> > > - if the latter, how the original handler (the host one,
> > > hard_handler()) works ? even if we can call userspace handler
> > > instead of the host one, we need to call the host handler (and
> > > restorer). do we call both ?
> > > - and by "to set registers", what register do you mean ? for the
> > > registers inspected by userspace signal handler ? but if you set a
> > > register, for instance RIP, as the fault location to the host
> > > register, it will return to RIP after handler and restart the fault
> > > again ?
> >
> > I am confused, why would the fault handler be restarted? If you modify
> > RIP, then the host kernel will not return to the faulting location. You
> > were using that already to jump into __userspace_relay_signal. All I am
> > arguing that instead of jumping to __userspace_relay_signal you can
> > prepare everything and directly jump into the users signal handler.
>
> what I meant in that example is; set host mcontext[RIP] to the fault
> location, as a userspace information, which will lead to the fault
> again. But this doesn't change RIP before and after so, I guess this
> isn't a good example..
> Sorry for the confusion.
>
> > > > * Jump back to userspace by doing a "return"
> > >
> > > this is still also unclear to me.
> > >
> > > it would be very helpful if you point the location of the code (at
> > > uml/next tree) on how SECCOMP mode does. I'm also looking at but
> > > really hard to map what you described and the code (sorry).
> >
> > "stub_signal_interrupt" simply returns, which means it jumps into the
> > restorer "stub_signal_restorer" which does the rt_sigreturn syscall.
> > This means the host kernel restores the userspace state from the
> > mcontext. As the mcontext resides in shared memory, the UML kernel can
> > update it to specify where the process should continue running (thread
> > switching, signals, syscall return value, …).
>
> thanks !
>
> so, stub_signal_interrupt runs on a different host process.
> nommu mode tries to reuse existing host sa_handler (hard_handler) to
> do the job (handle SEGV etc).
>
> If there are something missing on hard_handler and co on nommmu mode
> for what userspace_tramp does on seccomp mode, I've been trying to
> update it.
>
> -- Hajime
>
> >
> > Benjamin
> >
> > > all of above runs within hard_handler() in nommu mode on SIGSEGV.
> > > my best guess is this is different from what ptrace/seccomp do.
> > >
> > > > Said differently, I really prefer deferring as much logic as possible
> > > > to the host. This is both safer and easier to understand. Plus, it also
> > > > has the advantage of making it simpler to port UML to other
> > > > architectures.
> > >
> > > okay.
> > >
> > > >
> > > > > > Honestly, I really think we should take a step back and swap the
> > > > > > current syscall entry/exit code. That would likely also simplify
> > > > > > floating point register handling, which I think is currently
> > > > > > insufficient do deal with the odd special cases caused by different
> > > > > > x86_64 hardware extensions.
> > > > > >
> > > > > > Basically, I think nommu mode should use the same general approach as
> > > > > > the current SECCOMP mode. Which is to use rt_sigreturn to jump into
> > > > > > userspace and let the host kernel deal with the ugly details of how to
> > > > > > do that.
> > > > >
> > > > > I looked at how MMU mode (ptrace/seccomp) does handle this case.
> > > > >
> > > > > In nommu mode, we don't have external process to catch signals so, the
> > > > > nommu mode uses hard_handler() to catch SEGV/FPE of userspace
> > > > > programs. While mmu mode calls segv_handler not in a context of
> > > > > signal handler.
> > > > >
> > > > > # correct me if I'm wrong.
> > > > >
> > > > > thus, mmu mode doesn't have this situation.
> > > >
> > > > Yes, it does not have this specific issue. But see the top of the mail
> > > > for other issues that are somewhat related.
> > > >
> > > > > I'm attempting various ways; calling um's rt_sigreturn instead of
> > > > > host's one, which doesn't work as host restore procedures (unblocking
> > > > > masked signals, restoring register states, etc) aren't called.
> > > > >
> > > > > I'll update here if I found a good direction, but would be great if
> > > > > you see how it should be handled.
> > > >
> > > > Can we please discuss possible solutions? We can figure out the details
> > > > once it is clear how the interaction with the host should work.
> > >
> > > I was wishing to update to you that I'm working on it. So, your
> > > comments are always helpful to me. Thanks.
> > >
> > > -- Hajime
> > >
> > > > I still think that the idea of using the kernel task stack as the
> > > > signal stack is really elegant. Actually, doing that in normal UML may
> > > > be how we can fix the issues mentioned at the top of my mail. And for
> > > > nommu, we can also use the host mcontext to jump back into userspace
> > > > using a simple "return".
> > > >
> > > > Conceptually it seems so simple.
> > > >
> > > > Benjamin
> > > >
> > > >
> > > > >
> > > > > -- Hajime
> > > > >
> > > > > > I believe that this requires a second "userspace" sigaltstack in
> > > > > > addition to the current "IRQ" sigaltstack. Then switching in between
> > > > > > the two (note that the "userspace" one is also used for IRQs if those
> > > > > > happen while userspace is executing).
> > > > > >
> > > > > > So, in principle I would think something like:
> > > > > > * to jump into userspace, you would:
> > > > > > - block all signals
> > > > > > - set "userspace" sigaltstack
> > > > > > - setup mcontext for rt_sigreturn
> > > > > > - setup RSP for rt_sigreturn
> > > > > > - call rt_sigreturn syscall
> > > > > > * all signal handlers can (except pure IRQs):
> > > > > > - check on which stack they are
> > > > > > -> easy to detect whether we are in kernel mode
> > > > > > - for IRQs one can probably handle them directly (and return)
> > > > > > - in user mode:
> > > > > > + store mcontext location and information needed for rt_sigreturn
> > > > > > + jump back into kernel task stack
> > > > > > * kernel task handler to continue would:
> > > > > > - set sigaltstack to IRQ stack
> > > > > > - fetch register from mcontext
> > > > > > - unblock all signals
> > > > > > - handle syscall/signal in whatever way needed
> > > > > >
> > > > > > Now that I wrote about it, I am thinking that it might be possible to
> > > > > > just use the kernel task stack for the signal stack. One would probably
> > > > > > need to increase the kernel stack size a bit, but it would also mean
> > > > > > that no special code is needed for "rt_sigreturn" handling. The rest
> > > > > > would remain the same.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > > Benjamin
> > > > > >
> > > > > > > [SNIP]
> > > > > >
> > > > >
> > > >
> > >
> >
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v10 09/13] x86/um: nommu: signal handling
2025-07-10 23:59 ` Hajime Tazaki
@ 2025-07-11 9:39 ` Benjamin Berg
2025-07-11 10:05 ` Benjamin Berg
0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Berg @ 2025-07-11 9:39 UTC (permalink / raw)
To: Hajime Tazaki; +Cc: linux-um, ricarkol, Liam.Howlett, linux-kernel
Hi Hajime,
On Fri, 2025-07-11 at 08:59 +0900, Hajime Tazaki wrote:
> below is the updated patch of this patch.
> I didn't follow your suggestion to use host handler to execute
> userspace handlers. instead, setup stack and %rip register to call
> userspace handlers at the end of the host handler.
>
> It would be great to hear your opinion.
Maybe I am missing something, but I do not see how this solves the
problem. The issue I raised was the correctness of the userspace
registers. Before, the syscall hook would fetch them (albeit they were
corrupted at that point). Now, I don't see any code to do this. Which
would mean the generated mcontext that userspace sees is based on some
really old register state.
Honestly, I think we need a test case to be able to move forward. The
test needs to trigger an exception (FPE, segfault, whatever) and then
handle the signal. In the signal handler, verify the register state in
the mcontext is expected (RIP, RSP, FP regs), then update it to not
raise an exception again and return. The test should obviously exit
cleanly afterwards.
That said, I would also still like to see a higher level discussion on
how userspace registers are saved and restored. We have two separate
cases--interrupts/exceptions (host signals) and the syscall path--and
both need to be well defined. My hope is still that both of these can
use the same register save/restore mechanism.
Benjamin
>
> ---
> arch/um/include/shared/kern_util.h | 4 +
> arch/um/nommu/Makefile | 2 +-
> arch/um/nommu/os-Linux/signal.c | 8 +
> arch/um/nommu/trap.c | 201 ++++++++++++++++++++++++
> arch/um/os-Linux/signal.c | 3 +-
> arch/x86/um/nommu/do_syscall_64.c | 6 +
> arch/x86/um/nommu/entry_64.S | 14 ++
> arch/x86/um/nommu/os-Linux/mcontext.c | 5 +
> arch/x86/um/shared/sysdep/mcontext.h | 1 +
> arch/x86/um/shared/sysdep/ptrace.h | 2 +-
> arch/x86/um/shared/sysdep/syscalls_64.h | 1 +
> 11 files changed, 244 insertions(+), 3 deletions(-)
> create mode 100644 arch/um/nommu/trap.c
>
> diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h
> index ec8ba1f13c58..7f55402b6385 100644
> --- a/arch/um/include/shared/kern_util.h
> +++ b/arch/um/include/shared/kern_util.h
> @@ -73,4 +73,8 @@ void um_idle_sleep(void);
>
> void kasan_map_memory(void *start, size_t len);
>
> +#ifndef CONFIG_MMU
> +extern void nommu_relay_signal(void *ptr);
> +#endif
> +
> #endif
> diff --git a/arch/um/nommu/Makefile b/arch/um/nommu/Makefile
> index baab7c2f57c2..096221590cfd 100644
> --- a/arch/um/nommu/Makefile
> +++ b/arch/um/nommu/Makefile
> @@ -1,3 +1,3 @@
> # SPDX-License-Identifier: GPL-2.0
>
> -obj-y := os-Linux/
> +obj-y := trap.o os-Linux/
> diff --git a/arch/um/nommu/os-Linux/signal.c b/arch/um/nommu/os-Linux/signal.c
> index 19043b9652e2..27b6b37744b7 100644
> --- a/arch/um/nommu/os-Linux/signal.c
> +++ b/arch/um/nommu/os-Linux/signal.c
> @@ -5,6 +5,7 @@
> #include <os.h>
> #include <sysdep/mcontext.h>
> #include <sys/ucontext.h>
> +#include <as-layout.h>
>
> void sigsys_handler(int sig, struct siginfo *si,
> struct uml_pt_regs *regs, void *ptr)
> @@ -14,3 +15,10 @@ void sigsys_handler(int sig, struct siginfo *si,
> /* hook syscall via SIGSYS */
> set_mc_sigsys_hook(mc);
> }
> +
> +void nommu_relay_signal(void *ptr)
> +{
> + mcontext_t *mc = (mcontext_t *) ptr;
> +
> + set_mc_return_address(mc);
> +}
> diff --git a/arch/um/nommu/trap.c b/arch/um/nommu/trap.c
> new file mode 100644
> index 000000000000..430297517455
> --- /dev/null
> +++ b/arch/um/nommu/trap.c
> @@ -0,0 +1,201 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/mm.h>
> +#include <linux/sched/signal.h>
> +#include <linux/hardirq.h>
> +#include <linux/module.h>
> +#include <linux/uaccess.h>
> +#include <linux/sched/debug.h>
> +#include <asm/current.h>
> +#include <asm/tlbflush.h>
> +#include <arch.h>
> +#include <as-layout.h>
> +#include <kern_util.h>
> +#include <os.h>
> +#include <skas.h>
> +
> +/*
> + * Note this is constrained to return 0, -EFAULT, -EACCES, -ENOMEM by
> + * segv().
> + */
> +int handle_page_fault(unsigned long address, unsigned long ip,
> + int is_write, int is_user, int *code_out)
> +{
> + /* !MMU has no pagefault */
> + return -EFAULT;
> +}
> +
> +static void show_segv_info(struct uml_pt_regs *regs)
> +{
> + struct task_struct *tsk = current;
> + struct faultinfo *fi = UPT_FAULTINFO(regs);
> +
> + if (!unhandled_signal(tsk, SIGSEGV))
> + return;
> +
> + pr_warn_ratelimited("%s%s[%d]: segfault at %lx ip %p sp %p error %x",
> + task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
> + tsk->comm, task_pid_nr(tsk), FAULT_ADDRESS(*fi),
> + (void *)UPT_IP(regs), (void *)UPT_SP(regs),
> + fi->error_code);
> +}
> +
> +static void bad_segv(struct faultinfo fi, unsigned long ip)
> +{
> + current->thread.arch.faultinfo = fi;
> + force_sig_fault(SIGSEGV, SEGV_ACCERR, (void __user *) FAULT_ADDRESS(fi));
> +}
> +
> +void fatal_sigsegv(void)
> +{
> + force_fatal_sig(SIGSEGV);
> + do_signal(¤t->thread.regs);
> + /*
> + * This is to tell gcc that we're not returning - do_signal
> + * can, in general, return, but in this case, it's not, since
> + * we just got a fatal SIGSEGV queued.
> + */
> + os_dump_core();
> +}
> +
> +/**
> + * segv_handler() - the SIGSEGV handler
> + * @sig: the signal number
> + * @unused_si: the signal info struct; unused in this handler
> + * @regs: the ptrace register information
> + *
> + * The handler first extracts the faultinfo from the UML ptrace regs struct.
> + * If the userfault did not happen in an UML userspace process, bad_segv is called.
> + * Otherwise the signal did happen in a cloned userspace process, handle it.
> + */
> +void segv_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs,
> + void *mc)
> +{
> + struct faultinfo *fi = UPT_FAULTINFO(regs);
> +
> + /* !MMU specific part; detection of userspace */
> + /* mark is_user=1 when the IP is from userspace code. */
> + if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) < high_physmem)
> + regs->is_user = 1;
> +
> + if (UPT_IS_USER(regs) && !SEGV_IS_FIXABLE(fi)) {
> + show_segv_info(regs);
> + bad_segv(*fi, UPT_IP(regs));
> + return;
> + }
> + segv(*fi, UPT_IP(regs), UPT_IS_USER(regs), regs, mc);
> +
> + /* !MMU specific part; detection of userspace */
> + relay_signal(sig, unused_si, regs, mc);
> +}
> +
> +/*
> + * We give a *copy* of the faultinfo in the regs to segv.
> + * This must be done, since nesting SEGVs could overwrite
> + * the info in the regs. A pointer to the info then would
> + * give us bad data!
> + */
> +unsigned long segv(struct faultinfo fi, unsigned long ip, int is_user,
> + struct uml_pt_regs *regs, void *mc)
> +{
> + int si_code;
> + int err;
> + int is_write = FAULT_WRITE(fi);
> + unsigned long address = FAULT_ADDRESS(fi);
> +
> + if (!is_user && regs)
> + current->thread.segv_regs = container_of(regs, struct pt_regs, regs);
> +
> + if (current->mm == NULL) {
> + show_regs(container_of(regs, struct pt_regs, regs));
> + panic("Segfault with no mm");
> + } else if (!is_user && address > PAGE_SIZE && address < TASK_SIZE) {
> + show_regs(container_of(regs, struct pt_regs, regs));
> + panic("Kernel tried to access user memory at addr 0x%lx, ip 0x%lx",
> + address, ip);
> + }
> +
> + if (SEGV_IS_FIXABLE(&fi))
> + err = handle_page_fault(address, ip, is_write, is_user,
> + &si_code);
> + else {
> + err = -EFAULT;
> + /*
> + * A thread accessed NULL, we get a fault, but CR2 is invalid.
> + * This code is used in __do_copy_from_user() of TT mode.
> + * XXX tt mode is gone, so maybe this isn't needed any more
> + */
> + address = 0;
> + }
> +
> + if (!err)
> + goto out;
> + else if (!is_user && arch_fixup(ip, regs))
> + goto out;
> +
> + if (!is_user) {
> + show_regs(container_of(regs, struct pt_regs, regs));
> + panic("Kernel mode fault at addr 0x%lx, ip 0x%lx",
> + address, ip);
> + }
> +
> + show_segv_info(regs);
> +
> + if (err == -EACCES) {
> + current->thread.arch.faultinfo = fi;
> + force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address);
> + } else {
> + WARN_ON_ONCE(err != -EFAULT);
> + current->thread.arch.faultinfo = fi;
> + force_sig_fault(SIGSEGV, si_code, (void __user *) address);
> + }
> +
> +out:
> + if (regs)
> + current->thread.segv_regs = NULL;
> +
> + return 0;
> +}
> +
> +void relay_signal(int sig, struct siginfo *si, struct uml_pt_regs *regs,
> + void *mc)
> +{
> + int code, err;
> +
> + /* !MMU specific part; detection of userspace */
> + /* mark is_user=1 when the IP is from userspace code. */
> + if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) < high_physmem)
> + regs->is_user = 1;
> +
> + if (!UPT_IS_USER(regs)) {
> + if (sig == SIGBUS)
> + pr_err("Bus error - the host /dev/shm or /tmp mount likely just ran out of space\n");
> + panic("Kernel mode signal %d", sig);
> + }
> + /* if is_user==1, set return to userspace sig handler to relay signal */
> + nommu_relay_signal(mc);
> +
> + arch_examine_signal(sig, regs);
> +
> + /* Is the signal layout for the signal known?
> + * Signal data must be scrubbed to prevent information leaks.
> + */
> + code = si->si_code;
> + err = si->si_errno;
> + if ((err == 0) && (siginfo_layout(sig, code) == SIL_FAULT)) {
> + struct faultinfo *fi = UPT_FAULTINFO(regs);
> +
> + current->thread.arch.faultinfo = *fi;
> + force_sig_fault(sig, code, (void __user *)FAULT_ADDRESS(*fi));
> + } else {
> + pr_err("Attempted to relay unknown signal %d (si_code = %d) with errno %d\n",
> + sig, code, err);
> + force_sig(sig);
> + }
> +}
> +
> +void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs,
> + void *mc)
> +{
> + do_IRQ(WINCH_IRQ, regs);
> +}
> diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c
> index 53e276e81b37..67dcd88b45b1 100644
> --- a/arch/um/os-Linux/signal.c
> +++ b/arch/um/os-Linux/signal.c
> @@ -40,9 +40,10 @@ static void sig_handler_common(int sig, struct siginfo *si, mcontext_t *mc)
> int save_errno = errno;
>
> r.is_user = 0;
> + if (mc)
> + get_regs_from_mc(&r, mc);
> if (sig == SIGSEGV) {
> /* For segfaults, we want the data from the sigcontext. */
> - get_regs_from_mc(&r, mc);
> GET_FAULTINFO_FROM_MC(r.faultinfo, mc);
> }
>
> diff --git a/arch/x86/um/nommu/do_syscall_64.c b/arch/x86/um/nommu/do_syscall_64.c
> index 74d5bcc4508d..d77e69e097c1 100644
> --- a/arch/x86/um/nommu/do_syscall_64.c
> +++ b/arch/x86/um/nommu/do_syscall_64.c
> @@ -44,6 +44,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
> /* set fs register to the original host one */
> os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
>
> + /* save fp registers */
> + asm volatile("fxsaveq %0" : "=m"(*(struct _xstate *)regs->regs.fp));
> +
> if (likely(syscall < NR_syscalls)) {
> PT_REGS_SET_SYSCALL_RETURN(regs,
> EXECUTE_SYSCALL(syscall, regs));
> @@ -54,6 +57,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
> /* handle tasks and signals at the end */
> interrupt_end();
>
> + /* restore fp registers */
> + asm volatile("fxrstorq %0" : : "m"((current->thread.regs.regs.fp)));
> +
> /* restore back fs register to userspace configured one */
> os_x86_arch_prctl(0, ARCH_SET_FS,
> (void *)(current->thread.regs.regs.gp[FS_BASE
> diff --git a/arch/x86/um/nommu/entry_64.S b/arch/x86/um/nommu/entry_64.S
> index 950447dfa66b..e038bc7b53ac 100644
> --- a/arch/x86/um/nommu/entry_64.S
> +++ b/arch/x86/um/nommu/entry_64.S
> @@ -111,3 +111,17 @@ ENTRY(userspace)
> jmp *%r11
>
> END(userspace)
> +
> +/*
> + * this routine prepares the stack to return via host-generated
> + * signals (e.g., SEGV, FPE) via do_signal() from interrupt_end().
> + */
> +ENTRY(__prep_sigreturn)
> + /*
> + * Switch to current top of stack, so "current->" points
> + * to the right task.
> + */
> + movq current_top_of_stack, %rsp
> +
> + jmp userspace
> +END(__prep_sigreturn)
> diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c
> index c4ef877d5ea0..87fb2a35e7ff 100644
> --- a/arch/x86/um/nommu/os-Linux/mcontext.c
> +++ b/arch/x86/um/nommu/os-Linux/mcontext.c
> @@ -6,6 +6,11 @@
> #include <sysdep/mcontext.h>
> #include <sysdep/syscalls.h>
>
> +void set_mc_return_address(mcontext_t *mc)
> +{
> + mc->gregs[REG_RIP] = (unsigned long) __prep_sigreturn;
> +}
> +
> void set_mc_sigsys_hook(mcontext_t *mc)
> {
> mc->gregs[REG_RCX] = mc->gregs[REG_RIP];
> diff --git a/arch/x86/um/shared/sysdep/mcontext.h b/arch/x86/um/shared/sysdep/mcontext.h
> index 9a0d6087f357..de4041b758f3 100644
> --- a/arch/x86/um/shared/sysdep/mcontext.h
> +++ b/arch/x86/um/shared/sysdep/mcontext.h
> @@ -19,6 +19,7 @@ extern int set_stub_state(struct uml_pt_regs *regs, struct stub_data *data,
>
> #ifndef CONFIG_MMU
> extern void set_mc_sigsys_hook(mcontext_t *mc);
> +extern void set_mc_return_address(mcontext_t *mc);
> #endif
>
> #ifdef __i386__
> diff --git a/arch/x86/um/shared/sysdep/ptrace.h b/arch/x86/um/shared/sysdep/ptrace.h
> index 8f7476ff6e95..7d553d9f05be 100644
> --- a/arch/x86/um/shared/sysdep/ptrace.h
> +++ b/arch/x86/um/shared/sysdep/ptrace.h
> @@ -65,7 +65,7 @@ struct uml_pt_regs {
> int is_user;
>
> /* Dynamically sized FP registers (holds an XSTATE) */
> - unsigned long fp[];
> + unsigned long fp[] __attribute__((aligned(16)));
> };
>
> #define EMPTY_UML_PT_REGS { }
> diff --git a/arch/x86/um/shared/sysdep/syscalls_64.h b/arch/x86/um/shared/sysdep/syscalls_64.h
> index ffd80ee3b9dc..bd152422cdfb 100644
> --- a/arch/x86/um/shared/sysdep/syscalls_64.h
> +++ b/arch/x86/um/shared/sysdep/syscalls_64.h
> @@ -29,6 +29,7 @@ extern syscall_handler_t sys_arch_prctl;
> extern void do_syscall_64(struct pt_regs *regs);
> extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3,
> int64_t a4, int64_t a5, int64_t a6);
> +extern void __prep_sigreturn(void);
> #endif
>
> #endif
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v10 09/13] x86/um: nommu: signal handling
2025-07-11 9:39 ` Benjamin Berg
@ 2025-07-11 10:05 ` Benjamin Berg
2025-07-12 1:16 ` Hajime Tazaki
0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Berg @ 2025-07-11 10:05 UTC (permalink / raw)
To: Hajime Tazaki; +Cc: linux-um, ricarkol, Liam.Howlett, linux-kernel
On Fri, 2025-07-11 at 11:39 +0200, Benjamin Berg wrote:
> [SNIP]
>
> That said, I would also still like to see a higher level discussion on
> how userspace registers are saved and restored. We have two separate
> cases--interrupts/exceptions (host signals) and the syscall path--and
> both need to be well defined. My hope is still that both of these can
> use the same register save/restore mechanism.
Now syscalls are also just signals. The crucial difference is that for
syscalls you are allowed to clobber R11 and RCX. Your current syscall
entry code uses that fact, but that does not work for other signals.
Benjamin
>
> Benjamin
>
> >
> > ---
> > arch/um/include/shared/kern_util.h | 4 +
> > arch/um/nommu/Makefile | 2 +-
> > arch/um/nommu/os-Linux/signal.c | 8 +
> > arch/um/nommu/trap.c | 201
> > ++++++++++++++++++++++++
> > arch/um/os-Linux/signal.c | 3 +-
> > arch/x86/um/nommu/do_syscall_64.c | 6 +
> > arch/x86/um/nommu/entry_64.S | 14 ++
> > arch/x86/um/nommu/os-Linux/mcontext.c | 5 +
> > arch/x86/um/shared/sysdep/mcontext.h | 1 +
> > arch/x86/um/shared/sysdep/ptrace.h | 2 +-
> > arch/x86/um/shared/sysdep/syscalls_64.h | 1 +
> > 11 files changed, 244 insertions(+), 3 deletions(-)
> > create mode 100644 arch/um/nommu/trap.c
> >
> > diff --git a/arch/um/include/shared/kern_util.h
> > b/arch/um/include/shared/kern_util.h
> > index ec8ba1f13c58..7f55402b6385 100644
> > --- a/arch/um/include/shared/kern_util.h
> > +++ b/arch/um/include/shared/kern_util.h
> > @@ -73,4 +73,8 @@ void um_idle_sleep(void);
> >
> > void kasan_map_memory(void *start, size_t len);
> >
> > +#ifndef CONFIG_MMU
> > +extern void nommu_relay_signal(void *ptr);
> > +#endif
> > +
> > #endif
> > diff --git a/arch/um/nommu/Makefile b/arch/um/nommu/Makefile
> > index baab7c2f57c2..096221590cfd 100644
> > --- a/arch/um/nommu/Makefile
> > +++ b/arch/um/nommu/Makefile
> > @@ -1,3 +1,3 @@
> > # SPDX-License-Identifier: GPL-2.0
> >
> > -obj-y := os-Linux/
> > +obj-y := trap.o os-Linux/
> > diff --git a/arch/um/nommu/os-Linux/signal.c b/arch/um/nommu/os-
> > Linux/signal.c
> > index 19043b9652e2..27b6b37744b7 100644
> > --- a/arch/um/nommu/os-Linux/signal.c
> > +++ b/arch/um/nommu/os-Linux/signal.c
> > @@ -5,6 +5,7 @@
> > #include <os.h>
> > #include <sysdep/mcontext.h>
> > #include <sys/ucontext.h>
> > +#include <as-layout.h>
> >
> > void sigsys_handler(int sig, struct siginfo *si,
> > struct uml_pt_regs *regs, void *ptr)
> > @@ -14,3 +15,10 @@ void sigsys_handler(int sig, struct siginfo *si,
> > /* hook syscall via SIGSYS */
> > set_mc_sigsys_hook(mc);
> > }
> > +
> > +void nommu_relay_signal(void *ptr)
> > +{
> > + mcontext_t *mc = (mcontext_t *) ptr;
> > +
> > + set_mc_return_address(mc);
> > +}
> > diff --git a/arch/um/nommu/trap.c b/arch/um/nommu/trap.c
> > new file mode 100644
> > index 000000000000..430297517455
> > --- /dev/null
> > +++ b/arch/um/nommu/trap.c
> > @@ -0,0 +1,201 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include <linux/mm.h>
> > +#include <linux/sched/signal.h>
> > +#include <linux/hardirq.h>
> > +#include <linux/module.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/sched/debug.h>
> > +#include <asm/current.h>
> > +#include <asm/tlbflush.h>
> > +#include <arch.h>
> > +#include <as-layout.h>
> > +#include <kern_util.h>
> > +#include <os.h>
> > +#include <skas.h>
> > +
> > +/*
> > + * Note this is constrained to return 0, -EFAULT, -EACCES, -ENOMEM
> > by
> > + * segv().
> > + */
> > +int handle_page_fault(unsigned long address, unsigned long ip,
> > + int is_write, int is_user, int *code_out)
> > +{
> > + /* !MMU has no pagefault */
> > + return -EFAULT;
> > +}
> > +
> > +static void show_segv_info(struct uml_pt_regs *regs)
> > +{
> > + struct task_struct *tsk = current;
> > + struct faultinfo *fi = UPT_FAULTINFO(regs);
> > +
> > + if (!unhandled_signal(tsk, SIGSEGV))
> > + return;
> > +
> > + pr_warn_ratelimited("%s%s[%d]: segfault at %lx ip %p sp %p
> > error %x",
> > + task_pid_nr(tsk) > 1 ? KERN_INFO :
> > KERN_EMERG,
> > + tsk->comm, task_pid_nr(tsk),
> > FAULT_ADDRESS(*fi),
> > + (void *)UPT_IP(regs), (void
> > *)UPT_SP(regs),
> > + fi->error_code);
> > +}
> > +
> > +static void bad_segv(struct faultinfo fi, unsigned long ip)
> > +{
> > + current->thread.arch.faultinfo = fi;
> > + force_sig_fault(SIGSEGV, SEGV_ACCERR, (void __user *)
> > FAULT_ADDRESS(fi));
> > +}
> > +
> > +void fatal_sigsegv(void)
> > +{
> > + force_fatal_sig(SIGSEGV);
> > + do_signal(¤t->thread.regs);
> > + /*
> > + * This is to tell gcc that we're not returning -
> > do_signal
> > + * can, in general, return, but in this case, it's not,
> > since
> > + * we just got a fatal SIGSEGV queued.
> > + */
> > + os_dump_core();
> > +}
> > +
> > +/**
> > + * segv_handler() - the SIGSEGV handler
> > + * @sig: the signal number
> > + * @unused_si: the signal info struct; unused in this handler
> > + * @regs: the ptrace register information
> > + *
> > + * The handler first extracts the faultinfo from the UML ptrace
> > regs struct.
> > + * If the userfault did not happen in an UML userspace process,
> > bad_segv is called.
> > + * Otherwise the signal did happen in a cloned userspace process,
> > handle it.
> > + */
> > +void segv_handler(int sig, struct siginfo *unused_si, struct
> > uml_pt_regs *regs,
> > + void *mc)
> > +{
> > + struct faultinfo *fi = UPT_FAULTINFO(regs);
> > +
> > + /* !MMU specific part; detection of userspace */
> > + /* mark is_user=1 when the IP is from userspace code. */
> > + if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) <
> > high_physmem)
> > + regs->is_user = 1;
> > +
> > + if (UPT_IS_USER(regs) && !SEGV_IS_FIXABLE(fi)) {
> > + show_segv_info(regs);
> > + bad_segv(*fi, UPT_IP(regs));
> > + return;
> > + }
> > + segv(*fi, UPT_IP(regs), UPT_IS_USER(regs), regs, mc);
> > +
> > + /* !MMU specific part; detection of userspace */
> > + relay_signal(sig, unused_si, regs, mc);
> > +}
> > +
> > +/*
> > + * We give a *copy* of the faultinfo in the regs to segv.
> > + * This must be done, since nesting SEGVs could overwrite
> > + * the info in the regs. A pointer to the info then would
> > + * give us bad data!
> > + */
> > +unsigned long segv(struct faultinfo fi, unsigned long ip, int
> > is_user,
> > + struct uml_pt_regs *regs, void *mc)
> > +{
> > + int si_code;
> > + int err;
> > + int is_write = FAULT_WRITE(fi);
> > + unsigned long address = FAULT_ADDRESS(fi);
> > +
> > + if (!is_user && regs)
> > + current->thread.segv_regs = container_of(regs,
> > struct pt_regs, regs);
> > +
> > + if (current->mm == NULL) {
> > + show_regs(container_of(regs, struct pt_regs,
> > regs));
> > + panic("Segfault with no mm");
> > + } else if (!is_user && address > PAGE_SIZE && address <
> > TASK_SIZE) {
> > + show_regs(container_of(regs, struct pt_regs,
> > regs));
> > + panic("Kernel tried to access user memory at addr
> > 0x%lx, ip 0x%lx",
> > + address, ip);
> > + }
> > +
> > + if (SEGV_IS_FIXABLE(&fi))
> > + err = handle_page_fault(address, ip, is_write,
> > is_user,
> > + &si_code);
> > + else {
> > + err = -EFAULT;
> > + /*
> > + * A thread accessed NULL, we get a fault, but CR2
> > is invalid.
> > + * This code is used in __do_copy_from_user() of
> > TT mode.
> > + * XXX tt mode is gone, so maybe this isn't needed
> > any more
> > + */
> > + address = 0;
> > + }
> > +
> > + if (!err)
> > + goto out;
> > + else if (!is_user && arch_fixup(ip, regs))
> > + goto out;
> > +
> > + if (!is_user) {
> > + show_regs(container_of(regs, struct pt_regs,
> > regs));
> > + panic("Kernel mode fault at addr 0x%lx, ip 0x%lx",
> > + address, ip);
> > + }
> > +
> > + show_segv_info(regs);
> > +
> > + if (err == -EACCES) {
> > + current->thread.arch.faultinfo = fi;
> > + force_sig_fault(SIGBUS, BUS_ADRERR, (void __user
> > *)address);
> > + } else {
> > + WARN_ON_ONCE(err != -EFAULT);
> > + current->thread.arch.faultinfo = fi;
> > + force_sig_fault(SIGSEGV, si_code, (void __user *)
> > address);
> > + }
> > +
> > +out:
> > + if (regs)
> > + current->thread.segv_regs = NULL;
> > +
> > + return 0;
> > +}
> > +
> > +void relay_signal(int sig, struct siginfo *si, struct uml_pt_regs
> > *regs,
> > + void *mc)
> > +{
> > + int code, err;
> > +
> > + /* !MMU specific part; detection of userspace */
> > + /* mark is_user=1 when the IP is from userspace code. */
> > + if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) <
> > high_physmem)
> > + regs->is_user = 1;
> > +
> > + if (!UPT_IS_USER(regs)) {
> > + if (sig == SIGBUS)
> > + pr_err("Bus error - the host /dev/shm or
> > /tmp mount likely just ran out of space\n");
> > + panic("Kernel mode signal %d", sig);
> > + }
> > + /* if is_user==1, set return to userspace sig handler to
> > relay signal */
> > + nommu_relay_signal(mc);
> > +
> > + arch_examine_signal(sig, regs);
> > +
> > + /* Is the signal layout for the signal known?
> > + * Signal data must be scrubbed to prevent information
> > leaks.
> > + */
> > + code = si->si_code;
> > + err = si->si_errno;
> > + if ((err == 0) && (siginfo_layout(sig, code) ==
> > SIL_FAULT)) {
> > + struct faultinfo *fi = UPT_FAULTINFO(regs);
> > +
> > + current->thread.arch.faultinfo = *fi;
> > + force_sig_fault(sig, code, (void __user
> > *)FAULT_ADDRESS(*fi));
> > + } else {
> > + pr_err("Attempted to relay unknown signal %d
> > (si_code = %d) with errno %d\n",
> > + sig, code, err);
> > + force_sig(sig);
> > + }
> > +}
> > +
> > +void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs
> > *regs,
> > + void *mc)
> > +{
> > + do_IRQ(WINCH_IRQ, regs);
> > +}
> > diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c
> > index 53e276e81b37..67dcd88b45b1 100644
> > --- a/arch/um/os-Linux/signal.c
> > +++ b/arch/um/os-Linux/signal.c
> > @@ -40,9 +40,10 @@ static void sig_handler_common(int sig, struct
> > siginfo *si, mcontext_t *mc)
> > int save_errno = errno;
> >
> > r.is_user = 0;
> > + if (mc)
> > + get_regs_from_mc(&r, mc);
> > if (sig == SIGSEGV) {
> > /* For segfaults, we want the data from the
> > sigcontext. */
> > - get_regs_from_mc(&r, mc);
> > GET_FAULTINFO_FROM_MC(r.faultinfo, mc);
> > }
> >
> > diff --git a/arch/x86/um/nommu/do_syscall_64.c
> > b/arch/x86/um/nommu/do_syscall_64.c
> > index 74d5bcc4508d..d77e69e097c1 100644
> > --- a/arch/x86/um/nommu/do_syscall_64.c
> > +++ b/arch/x86/um/nommu/do_syscall_64.c
> > @@ -44,6 +44,9 @@ __visible void do_syscall_64(struct pt_regs
> > *regs)
> > /* set fs register to the original host one */
> > os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
> >
> > + /* save fp registers */
> > + asm volatile("fxsaveq %0" : "=m"(*(struct _xstate *)regs-
> > >regs.fp));
> > +
> > if (likely(syscall < NR_syscalls)) {
> > PT_REGS_SET_SYSCALL_RETURN(regs,
> > EXECUTE_SYSCALL(syscall, regs));
> > @@ -54,6 +57,9 @@ __visible void do_syscall_64(struct pt_regs
> > *regs)
> > /* handle tasks and signals at the end */
> > interrupt_end();
> >
> > + /* restore fp registers */
> > + asm volatile("fxrstorq %0" : : "m"((current-
> > >thread.regs.regs.fp)));
> > +
> > /* restore back fs register to userspace configured one */
> > os_x86_arch_prctl(0, ARCH_SET_FS,
> > (void *)(current-
> > >thread.regs.regs.gp[FS_BASE
> > diff --git a/arch/x86/um/nommu/entry_64.S
> > b/arch/x86/um/nommu/entry_64.S
> > index 950447dfa66b..e038bc7b53ac 100644
> > --- a/arch/x86/um/nommu/entry_64.S
> > +++ b/arch/x86/um/nommu/entry_64.S
> > @@ -111,3 +111,17 @@ ENTRY(userspace)
> > jmp *%r11
> >
> > END(userspace)
> > +
> > +/*
> > + * this routine prepares the stack to return via host-generated
> > + * signals (e.g., SEGV, FPE) via do_signal() from interrupt_end().
> > + */
> > +ENTRY(__prep_sigreturn)
> > + /*
> > + * Switch to current top of stack, so "current->" points
> > + * to the right task.
> > + */
> > + movq current_top_of_stack, %rsp
> > +
> > + jmp userspace
> > +END(__prep_sigreturn)
> > diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c
> > b/arch/x86/um/nommu/os-Linux/mcontext.c
> > index c4ef877d5ea0..87fb2a35e7ff 100644
> > --- a/arch/x86/um/nommu/os-Linux/mcontext.c
> > +++ b/arch/x86/um/nommu/os-Linux/mcontext.c
> > @@ -6,6 +6,11 @@
> > #include <sysdep/mcontext.h>
> > #include <sysdep/syscalls.h>
> >
> > +void set_mc_return_address(mcontext_t *mc)
> > +{
> > + mc->gregs[REG_RIP] = (unsigned long) __prep_sigreturn;
> > +}
> > +
> > void set_mc_sigsys_hook(mcontext_t *mc)
> > {
> > mc->gregs[REG_RCX] = mc->gregs[REG_RIP];
> > diff --git a/arch/x86/um/shared/sysdep/mcontext.h
> > b/arch/x86/um/shared/sysdep/mcontext.h
> > index 9a0d6087f357..de4041b758f3 100644
> > --- a/arch/x86/um/shared/sysdep/mcontext.h
> > +++ b/arch/x86/um/shared/sysdep/mcontext.h
> > @@ -19,6 +19,7 @@ extern int set_stub_state(struct uml_pt_regs
> > *regs, struct stub_data *data,
> >
> > #ifndef CONFIG_MMU
> > extern void set_mc_sigsys_hook(mcontext_t *mc);
> > +extern void set_mc_return_address(mcontext_t *mc);
> > #endif
> >
> > #ifdef __i386__
> > diff --git a/arch/x86/um/shared/sysdep/ptrace.h
> > b/arch/x86/um/shared/sysdep/ptrace.h
> > index 8f7476ff6e95..7d553d9f05be 100644
> > --- a/arch/x86/um/shared/sysdep/ptrace.h
> > +++ b/arch/x86/um/shared/sysdep/ptrace.h
> > @@ -65,7 +65,7 @@ struct uml_pt_regs {
> > int is_user;
> >
> > /* Dynamically sized FP registers (holds an XSTATE) */
> > - unsigned long fp[];
> > + unsigned long fp[] __attribute__((aligned(16)));
> > };
> >
> > #define EMPTY_UML_PT_REGS { }
> > diff --git a/arch/x86/um/shared/sysdep/syscalls_64.h
> > b/arch/x86/um/shared/sysdep/syscalls_64.h
> > index ffd80ee3b9dc..bd152422cdfb 100644
> > --- a/arch/x86/um/shared/sysdep/syscalls_64.h
> > +++ b/arch/x86/um/shared/sysdep/syscalls_64.h
> > @@ -29,6 +29,7 @@ extern syscall_handler_t sys_arch_prctl;
> > extern void do_syscall_64(struct pt_regs *regs);
> > extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2,
> > int64_t a3,
> > int64_t a4, int64_t a5, int64_t a6);
> > +extern void __prep_sigreturn(void);
> > #endif
> >
> > #endif
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v10 09/13] x86/um: nommu: signal handling
2025-07-11 10:05 ` Benjamin Berg
@ 2025-07-12 1:16 ` Hajime Tazaki
2025-07-12 7:58 ` Benjamin Berg
0 siblings, 1 reply; 25+ messages in thread
From: Hajime Tazaki @ 2025-07-12 1:16 UTC (permalink / raw)
To: benjamin; +Cc: linux-um, ricarkol, Liam.Howlett, linux-kernel
Hello,
> Honestly, I think we need a test case to be able to move forward. The
> test needs to trigger an exception (FPE, segfault, whatever) and then
> handle the signal. In the signal handler, verify the register state in
> the mcontext is expected (RIP, RSP, FP regs), then update it to not
> raise an exception again and return. The test should obviously exit
> cleanly afterwards.
I agree to have a test case.
I played with your RFC patch ([RFC 0/2] Experimental kunit test for
signal context handling), which I guess the similar one which you gave
me in the past, with minor modification for nommu mode, and looks like
that test passed.
(none):/# /root/test-fp-save-restore
TAP version 13
1..1
# pre-signal: 50 / 100, 11223344 55667788 99aabbcc ddeeff00
# sighandler: extended_size: 2700, xstate_size: 2696
# post-signal: 51200 / 100, 11233345 55677789 99abbbcd ddefff01 (should change: 1, changed: 1)
ok 1 mcontext
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
I couldn't invoke this test via `kunit.py run` (which I should
investigate more), but this can be a good start to have the test case
which you proposed.
I will follow up the highlevel discussion on how syscall/signal
entry/exit code is implemented in nommu, which I think I've been
explained several times but why not :)
-- Hajime
On Fri, 11 Jul 2025 19:05:13 +0900,
Benjamin Berg wrote:
>
> On Fri, 2025-07-11 at 11:39 +0200, Benjamin Berg wrote:
> > [SNIP]
> >
> > That said, I would also still like to see a higher level discussion on
> > how userspace registers are saved and restored. We have two separate
> > cases--interrupts/exceptions (host signals) and the syscall path--and
> > both need to be well defined. My hope is still that both of these can
> > use the same register save/restore mechanism.
>
> Now syscalls are also just signals. The crucial difference is that for
> syscalls you are allowed to clobber R11 and RCX. Your current syscall
> entry code uses that fact, but that does not work for other signals.
>
> Benjamin
>
> >
> > Benjamin
> >
> > >
> > > ---
> > > arch/um/include/shared/kern_util.h | 4 +
> > > arch/um/nommu/Makefile | 2 +-
> > > arch/um/nommu/os-Linux/signal.c | 8 +
> > > arch/um/nommu/trap.c | 201
> > > ++++++++++++++++++++++++
> > > arch/um/os-Linux/signal.c | 3 +-
> > > arch/x86/um/nommu/do_syscall_64.c | 6 +
> > > arch/x86/um/nommu/entry_64.S | 14 ++
> > > arch/x86/um/nommu/os-Linux/mcontext.c | 5 +
> > > arch/x86/um/shared/sysdep/mcontext.h | 1 +
> > > arch/x86/um/shared/sysdep/ptrace.h | 2 +-
> > > arch/x86/um/shared/sysdep/syscalls_64.h | 1 +
> > > 11 files changed, 244 insertions(+), 3 deletions(-)
> > > create mode 100644 arch/um/nommu/trap.c
> > >
> > > diff --git a/arch/um/include/shared/kern_util.h
> > > b/arch/um/include/shared/kern_util.h
> > > index ec8ba1f13c58..7f55402b6385 100644
> > > --- a/arch/um/include/shared/kern_util.h
> > > +++ b/arch/um/include/shared/kern_util.h
> > > @@ -73,4 +73,8 @@ void um_idle_sleep(void);
> > >
> > > void kasan_map_memory(void *start, size_t len);
> > >
> > > +#ifndef CONFIG_MMU
> > > +extern void nommu_relay_signal(void *ptr);
> > > +#endif
> > > +
> > > #endif
> > > diff --git a/arch/um/nommu/Makefile b/arch/um/nommu/Makefile
> > > index baab7c2f57c2..096221590cfd 100644
> > > --- a/arch/um/nommu/Makefile
> > > +++ b/arch/um/nommu/Makefile
> > > @@ -1,3 +1,3 @@
> > > # SPDX-License-Identifier: GPL-2.0
> > >
> > > -obj-y := os-Linux/
> > > +obj-y := trap.o os-Linux/
> > > diff --git a/arch/um/nommu/os-Linux/signal.c b/arch/um/nommu/os-
> > > Linux/signal.c
> > > index 19043b9652e2..27b6b37744b7 100644
> > > --- a/arch/um/nommu/os-Linux/signal.c
> > > +++ b/arch/um/nommu/os-Linux/signal.c
> > > @@ -5,6 +5,7 @@
> > > #include <os.h>
> > > #include <sysdep/mcontext.h>
> > > #include <sys/ucontext.h>
> > > +#include <as-layout.h>
> > >
> > > void sigsys_handler(int sig, struct siginfo *si,
> > > struct uml_pt_regs *regs, void *ptr)
> > > @@ -14,3 +15,10 @@ void sigsys_handler(int sig, struct siginfo *si,
> > > /* hook syscall via SIGSYS */
> > > set_mc_sigsys_hook(mc);
> > > }
> > > +
> > > +void nommu_relay_signal(void *ptr)
> > > +{
> > > + mcontext_t *mc = (mcontext_t *) ptr;
> > > +
> > > + set_mc_return_address(mc);
> > > +}
> > > diff --git a/arch/um/nommu/trap.c b/arch/um/nommu/trap.c
> > > new file mode 100644
> > > index 000000000000..430297517455
> > > --- /dev/null
> > > +++ b/arch/um/nommu/trap.c
> > > @@ -0,0 +1,201 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +
> > > +#include <linux/mm.h>
> > > +#include <linux/sched/signal.h>
> > > +#include <linux/hardirq.h>
> > > +#include <linux/module.h>
> > > +#include <linux/uaccess.h>
> > > +#include <linux/sched/debug.h>
> > > +#include <asm/current.h>
> > > +#include <asm/tlbflush.h>
> > > +#include <arch.h>
> > > +#include <as-layout.h>
> > > +#include <kern_util.h>
> > > +#include <os.h>
> > > +#include <skas.h>
> > > +
> > > +/*
> > > + * Note this is constrained to return 0, -EFAULT, -EACCES, -ENOMEM
> > > by
> > > + * segv().
> > > + */
> > > +int handle_page_fault(unsigned long address, unsigned long ip,
> > > + int is_write, int is_user, int *code_out)
> > > +{
> > > + /* !MMU has no pagefault */
> > > + return -EFAULT;
> > > +}
> > > +
> > > +static void show_segv_info(struct uml_pt_regs *regs)
> > > +{
> > > + struct task_struct *tsk = current;
> > > + struct faultinfo *fi = UPT_FAULTINFO(regs);
> > > +
> > > + if (!unhandled_signal(tsk, SIGSEGV))
> > > + return;
> > > +
> > > + pr_warn_ratelimited("%s%s[%d]: segfault at %lx ip %p sp %p
> > > error %x",
> > > + task_pid_nr(tsk) > 1 ? KERN_INFO :
> > > KERN_EMERG,
> > > + tsk->comm, task_pid_nr(tsk),
> > > FAULT_ADDRESS(*fi),
> > > + (void *)UPT_IP(regs), (void
> > > *)UPT_SP(regs),
> > > + fi->error_code);
> > > +}
> > > +
> > > +static void bad_segv(struct faultinfo fi, unsigned long ip)
> > > +{
> > > + current->thread.arch.faultinfo = fi;
> > > + force_sig_fault(SIGSEGV, SEGV_ACCERR, (void __user *)
> > > FAULT_ADDRESS(fi));
> > > +}
> > > +
> > > +void fatal_sigsegv(void)
> > > +{
> > > + force_fatal_sig(SIGSEGV);
> > > + do_signal(¤t->thread.regs);
> > > + /*
> > > + * This is to tell gcc that we're not returning -
> > > do_signal
> > > + * can, in general, return, but in this case, it's not,
> > > since
> > > + * we just got a fatal SIGSEGV queued.
> > > + */
> > > + os_dump_core();
> > > +}
> > > +
> > > +/**
> > > + * segv_handler() - the SIGSEGV handler
> > > + * @sig: the signal number
> > > + * @unused_si: the signal info struct; unused in this handler
> > > + * @regs: the ptrace register information
> > > + *
> > > + * The handler first extracts the faultinfo from the UML ptrace
> > > regs struct.
> > > + * If the userfault did not happen in an UML userspace process,
> > > bad_segv is called.
> > > + * Otherwise the signal did happen in a cloned userspace process,
> > > handle it.
> > > + */
> > > +void segv_handler(int sig, struct siginfo *unused_si, struct
> > > uml_pt_regs *regs,
> > > + void *mc)
> > > +{
> > > + struct faultinfo *fi = UPT_FAULTINFO(regs);
> > > +
> > > + /* !MMU specific part; detection of userspace */
> > > + /* mark is_user=1 when the IP is from userspace code. */
> > > + if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) <
> > > high_physmem)
> > > + regs->is_user = 1;
> > > +
> > > + if (UPT_IS_USER(regs) && !SEGV_IS_FIXABLE(fi)) {
> > > + show_segv_info(regs);
> > > + bad_segv(*fi, UPT_IP(regs));
> > > + return;
> > > + }
> > > + segv(*fi, UPT_IP(regs), UPT_IS_USER(regs), regs, mc);
> > > +
> > > + /* !MMU specific part; detection of userspace */
> > > + relay_signal(sig, unused_si, regs, mc);
> > > +}
> > > +
> > > +/*
> > > + * We give a *copy* of the faultinfo in the regs to segv.
> > > + * This must be done, since nesting SEGVs could overwrite
> > > + * the info in the regs. A pointer to the info then would
> > > + * give us bad data!
> > > + */
> > > +unsigned long segv(struct faultinfo fi, unsigned long ip, int
> > > is_user,
> > > + struct uml_pt_regs *regs, void *mc)
> > > +{
> > > + int si_code;
> > > + int err;
> > > + int is_write = FAULT_WRITE(fi);
> > > + unsigned long address = FAULT_ADDRESS(fi);
> > > +
> > > + if (!is_user && regs)
> > > + current->thread.segv_regs = container_of(regs,
> > > struct pt_regs, regs);
> > > +
> > > + if (current->mm == NULL) {
> > > + show_regs(container_of(regs, struct pt_regs,
> > > regs));
> > > + panic("Segfault with no mm");
> > > + } else if (!is_user && address > PAGE_SIZE && address <
> > > TASK_SIZE) {
> > > + show_regs(container_of(regs, struct pt_regs,
> > > regs));
> > > + panic("Kernel tried to access user memory at addr
> > > 0x%lx, ip 0x%lx",
> > > + address, ip);
> > > + }
> > > +
> > > + if (SEGV_IS_FIXABLE(&fi))
> > > + err = handle_page_fault(address, ip, is_write,
> > > is_user,
> > > + &si_code);
> > > + else {
> > > + err = -EFAULT;
> > > + /*
> > > + * A thread accessed NULL, we get a fault, but CR2
> > > is invalid.
> > > + * This code is used in __do_copy_from_user() of
> > > TT mode.
> > > + * XXX tt mode is gone, so maybe this isn't needed
> > > any more
> > > + */
> > > + address = 0;
> > > + }
> > > +
> > > + if (!err)
> > > + goto out;
> > > + else if (!is_user && arch_fixup(ip, regs))
> > > + goto out;
> > > +
> > > + if (!is_user) {
> > > + show_regs(container_of(regs, struct pt_regs,
> > > regs));
> > > + panic("Kernel mode fault at addr 0x%lx, ip 0x%lx",
> > > + address, ip);
> > > + }
> > > +
> > > + show_segv_info(regs);
> > > +
> > > + if (err == -EACCES) {
> > > + current->thread.arch.faultinfo = fi;
> > > + force_sig_fault(SIGBUS, BUS_ADRERR, (void __user
> > > *)address);
> > > + } else {
> > > + WARN_ON_ONCE(err != -EFAULT);
> > > + current->thread.arch.faultinfo = fi;
> > > + force_sig_fault(SIGSEGV, si_code, (void __user *)
> > > address);
> > > + }
> > > +
> > > +out:
> > > + if (regs)
> > > + current->thread.segv_regs = NULL;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +void relay_signal(int sig, struct siginfo *si, struct uml_pt_regs
> > > *regs,
> > > + void *mc)
> > > +{
> > > + int code, err;
> > > +
> > > + /* !MMU specific part; detection of userspace */
> > > + /* mark is_user=1 when the IP is from userspace code. */
> > > + if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) <
> > > high_physmem)
> > > + regs->is_user = 1;
> > > +
> > > + if (!UPT_IS_USER(regs)) {
> > > + if (sig == SIGBUS)
> > > + pr_err("Bus error - the host /dev/shm or
> > > /tmp mount likely just ran out of space\n");
> > > + panic("Kernel mode signal %d", sig);
> > > + }
> > > + /* if is_user==1, set return to userspace sig handler to
> > > relay signal */
> > > + nommu_relay_signal(mc);
> > > +
> > > + arch_examine_signal(sig, regs);
> > > +
> > > + /* Is the signal layout for the signal known?
> > > + * Signal data must be scrubbed to prevent information
> > > leaks.
> > > + */
> > > + code = si->si_code;
> > > + err = si->si_errno;
> > > + if ((err == 0) && (siginfo_layout(sig, code) ==
> > > SIL_FAULT)) {
> > > + struct faultinfo *fi = UPT_FAULTINFO(regs);
> > > +
> > > + current->thread.arch.faultinfo = *fi;
> > > + force_sig_fault(sig, code, (void __user
> > > *)FAULT_ADDRESS(*fi));
> > > + } else {
> > > + pr_err("Attempted to relay unknown signal %d
> > > (si_code = %d) with errno %d\n",
> > > + sig, code, err);
> > > + force_sig(sig);
> > > + }
> > > +}
> > > +
> > > +void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs
> > > *regs,
> > > + void *mc)
> > > +{
> > > + do_IRQ(WINCH_IRQ, regs);
> > > +}
> > > diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c
> > > index 53e276e81b37..67dcd88b45b1 100644
> > > --- a/arch/um/os-Linux/signal.c
> > > +++ b/arch/um/os-Linux/signal.c
> > > @@ -40,9 +40,10 @@ static void sig_handler_common(int sig, struct
> > > siginfo *si, mcontext_t *mc)
> > > int save_errno = errno;
> > >
> > > r.is_user = 0;
> > > + if (mc)
> > > + get_regs_from_mc(&r, mc);
> > > if (sig == SIGSEGV) {
> > > /* For segfaults, we want the data from the
> > > sigcontext. */
> > > - get_regs_from_mc(&r, mc);
> > > GET_FAULTINFO_FROM_MC(r.faultinfo, mc);
> > > }
> > >
> > > diff --git a/arch/x86/um/nommu/do_syscall_64.c
> > > b/arch/x86/um/nommu/do_syscall_64.c
> > > index 74d5bcc4508d..d77e69e097c1 100644
> > > --- a/arch/x86/um/nommu/do_syscall_64.c
> > > +++ b/arch/x86/um/nommu/do_syscall_64.c
> > > @@ -44,6 +44,9 @@ __visible void do_syscall_64(struct pt_regs
> > > *regs)
> > > /* set fs register to the original host one */
> > > os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
> > >
> > > + /* save fp registers */
> > > + asm volatile("fxsaveq %0" : "=m"(*(struct _xstate *)regs-
> > > >regs.fp));
> > > +
> > > if (likely(syscall < NR_syscalls)) {
> > > PT_REGS_SET_SYSCALL_RETURN(regs,
> > > EXECUTE_SYSCALL(syscall, regs));
> > > @@ -54,6 +57,9 @@ __visible void do_syscall_64(struct pt_regs
> > > *regs)
> > > /* handle tasks and signals at the end */
> > > interrupt_end();
> > >
> > > + /* restore fp registers */
> > > + asm volatile("fxrstorq %0" : : "m"((current-
> > > >thread.regs.regs.fp)));
> > > +
> > > /* restore back fs register to userspace configured one */
> > > os_x86_arch_prctl(0, ARCH_SET_FS,
> > > (void *)(current-
> > > >thread.regs.regs.gp[FS_BASE
> > > diff --git a/arch/x86/um/nommu/entry_64.S
> > > b/arch/x86/um/nommu/entry_64.S
> > > index 950447dfa66b..e038bc7b53ac 100644
> > > --- a/arch/x86/um/nommu/entry_64.S
> > > +++ b/arch/x86/um/nommu/entry_64.S
> > > @@ -111,3 +111,17 @@ ENTRY(userspace)
> > > jmp *%r11
> > >
> > > END(userspace)
> > > +
> > > +/*
> > > + * this routine prepares the stack to return via host-generated
> > > + * signals (e.g., SEGV, FPE) via do_signal() from interrupt_end().
> > > + */
> > > +ENTRY(__prep_sigreturn)
> > > + /*
> > > + * Switch to current top of stack, so "current->" points
> > > + * to the right task.
> > > + */
> > > + movq current_top_of_stack, %rsp
> > > +
> > > + jmp userspace
> > > +END(__prep_sigreturn)
> > > diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c
> > > b/arch/x86/um/nommu/os-Linux/mcontext.c
> > > index c4ef877d5ea0..87fb2a35e7ff 100644
> > > --- a/arch/x86/um/nommu/os-Linux/mcontext.c
> > > +++ b/arch/x86/um/nommu/os-Linux/mcontext.c
> > > @@ -6,6 +6,11 @@
> > > #include <sysdep/mcontext.h>
> > > #include <sysdep/syscalls.h>
> > >
> > > +void set_mc_return_address(mcontext_t *mc)
> > > +{
> > > + mc->gregs[REG_RIP] = (unsigned long) __prep_sigreturn;
> > > +}
> > > +
> > > void set_mc_sigsys_hook(mcontext_t *mc)
> > > {
> > > mc->gregs[REG_RCX] = mc->gregs[REG_RIP];
> > > diff --git a/arch/x86/um/shared/sysdep/mcontext.h
> > > b/arch/x86/um/shared/sysdep/mcontext.h
> > > index 9a0d6087f357..de4041b758f3 100644
> > > --- a/arch/x86/um/shared/sysdep/mcontext.h
> > > +++ b/arch/x86/um/shared/sysdep/mcontext.h
> > > @@ -19,6 +19,7 @@ extern int set_stub_state(struct uml_pt_regs
> > > *regs, struct stub_data *data,
> > >
> > > #ifndef CONFIG_MMU
> > > extern void set_mc_sigsys_hook(mcontext_t *mc);
> > > +extern void set_mc_return_address(mcontext_t *mc);
> > > #endif
> > >
> > > #ifdef __i386__
> > > diff --git a/arch/x86/um/shared/sysdep/ptrace.h
> > > b/arch/x86/um/shared/sysdep/ptrace.h
> > > index 8f7476ff6e95..7d553d9f05be 100644
> > > --- a/arch/x86/um/shared/sysdep/ptrace.h
> > > +++ b/arch/x86/um/shared/sysdep/ptrace.h
> > > @@ -65,7 +65,7 @@ struct uml_pt_regs {
> > > int is_user;
> > >
> > > /* Dynamically sized FP registers (holds an XSTATE) */
> > > - unsigned long fp[];
> > > + unsigned long fp[] __attribute__((aligned(16)));
> > > };
> > >
> > > #define EMPTY_UML_PT_REGS { }
> > > diff --git a/arch/x86/um/shared/sysdep/syscalls_64.h
> > > b/arch/x86/um/shared/sysdep/syscalls_64.h
> > > index ffd80ee3b9dc..bd152422cdfb 100644
> > > --- a/arch/x86/um/shared/sysdep/syscalls_64.h
> > > +++ b/arch/x86/um/shared/sysdep/syscalls_64.h
> > > @@ -29,6 +29,7 @@ extern syscall_handler_t sys_arch_prctl;
> > > extern void do_syscall_64(struct pt_regs *regs);
> > > extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2,
> > > int64_t a3,
> > > int64_t a4, int64_t a5, int64_t a6);
> > > +extern void __prep_sigreturn(void);
> > > #endif
> > >
> > > #endif
> >
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v10 09/13] x86/um: nommu: signal handling
2025-07-12 1:16 ` Hajime Tazaki
@ 2025-07-12 7:58 ` Benjamin Berg
0 siblings, 0 replies; 25+ messages in thread
From: Benjamin Berg @ 2025-07-12 7:58 UTC (permalink / raw)
To: Hajime Tazaki; +Cc: linux-um, ricarkol, Liam.Howlett, linux-kernel
Hi,
On Sat, 2025-07-12 at 10:16 +0900, Hajime Tazaki wrote:
>
> Hello,
>
> > Honestly, I think we need a test case to be able to move forward. The
> > test needs to trigger an exception (FPE, segfault, whatever) and then
> > handle the signal. In the signal handler, verify the register state in
> > the mcontext is expected (RIP, RSP, FP regs), then update it to not
> > raise an exception again and return. The test should obviously exit
> > cleanly afterwards.
>
> I agree to have a test case.
>
> I played with your RFC patch ([RFC 0/2] Experimental kunit test for
> signal context handling), which I guess the similar one which you gave
> me in the past, with minor modification for nommu mode, and looks like
> that test passed.
That test triggers the signal emission using a self-kill (i.e. SIGSYS
and then the syscall entry point). The problems that I believe exist
will only happen if the kernel is entered for other reasons. I was
primarily thinking about exceptions (e.g. SIGFPE), but I suppose it
could even be scheduling right now (SIGALRM).
Benjamin
>
>
> (none):/# /root/test-fp-save-restore
> TAP version 13
> 1..1
> # pre-signal: 50 / 100, 11223344 55667788 99aabbcc ddeeff00
> # sighandler: extended_size: 2700, xstate_size: 2696
> # post-signal: 51200 / 100, 11233345 55677789 99abbbcd ddefff01 (should change: 1, changed: 1)
> ok 1 mcontext
> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
>
>
> I couldn't invoke this test via `kunit.py run` (which I should
> investigate more), but this can be a good start to have the test case
> which you proposed.
>
> I will follow up the highlevel discussion on how syscall/signal
> entry/exit code is implemented in nommu, which I think I've been
> explained several times but why not :)
>
> -- Hajime
>
> On Fri, 11 Jul 2025 19:05:13 +0900,
> Benjamin Berg wrote:
> >
> > On Fri, 2025-07-11 at 11:39 +0200, Benjamin Berg wrote:
> > > [SNIP]
> > >
> > > That said, I would also still like to see a higher level discussion on
> > > how userspace registers are saved and restored. We have two separate
> > > cases--interrupts/exceptions (host signals) and the syscall path--and
> > > both need to be well defined. My hope is still that both of these can
> > > use the same register save/restore mechanism.
> >
> > Now syscalls are also just signals. The crucial difference is that for
> > syscalls you are allowed to clobber R11 and RCX. Your current syscall
> > entry code uses that fact, but that does not work for other signals.
> >
> > Benjamin
> >
> > >
> > > Benjamin
> > >
> > > >
> > > > ---
> > > > arch/um/include/shared/kern_util.h | 4 +
> > > > arch/um/nommu/Makefile | 2 +-
> > > > arch/um/nommu/os-Linux/signal.c | 8 +
> > > > arch/um/nommu/trap.c | 201
> > > > ++++++++++++++++++++++++
> > > > arch/um/os-Linux/signal.c | 3 +-
> > > > arch/x86/um/nommu/do_syscall_64.c | 6 +
> > > > arch/x86/um/nommu/entry_64.S | 14 ++
> > > > arch/x86/um/nommu/os-Linux/mcontext.c | 5 +
> > > > arch/x86/um/shared/sysdep/mcontext.h | 1 +
> > > > arch/x86/um/shared/sysdep/ptrace.h | 2 +-
> > > > arch/x86/um/shared/sysdep/syscalls_64.h | 1 +
> > > > 11 files changed, 244 insertions(+), 3 deletions(-)
> > > > create mode 100644 arch/um/nommu/trap.c
> > > >
> > > > diff --git a/arch/um/include/shared/kern_util.h
> > > > b/arch/um/include/shared/kern_util.h
> > > > index ec8ba1f13c58..7f55402b6385 100644
> > > > --- a/arch/um/include/shared/kern_util.h
> > > > +++ b/arch/um/include/shared/kern_util.h
> > > > @@ -73,4 +73,8 @@ void um_idle_sleep(void);
> > > >
> > > > void kasan_map_memory(void *start, size_t len);
> > > >
> > > > +#ifndef CONFIG_MMU
> > > > +extern void nommu_relay_signal(void *ptr);
> > > > +#endif
> > > > +
> > > > #endif
> > > > diff --git a/arch/um/nommu/Makefile b/arch/um/nommu/Makefile
> > > > index baab7c2f57c2..096221590cfd 100644
> > > > --- a/arch/um/nommu/Makefile
> > > > +++ b/arch/um/nommu/Makefile
> > > > @@ -1,3 +1,3 @@
> > > > # SPDX-License-Identifier: GPL-2.0
> > > >
> > > > -obj-y := os-Linux/
> > > > +obj-y := trap.o os-Linux/
> > > > diff --git a/arch/um/nommu/os-Linux/signal.c b/arch/um/nommu/os-
> > > > Linux/signal.c
> > > > index 19043b9652e2..27b6b37744b7 100644
> > > > --- a/arch/um/nommu/os-Linux/signal.c
> > > > +++ b/arch/um/nommu/os-Linux/signal.c
> > > > @@ -5,6 +5,7 @@
> > > > #include <os.h>
> > > > #include <sysdep/mcontext.h>
> > > > #include <sys/ucontext.h>
> > > > +#include <as-layout.h>
> > > >
> > > > void sigsys_handler(int sig, struct siginfo *si,
> > > > struct uml_pt_regs *regs, void *ptr)
> > > > @@ -14,3 +15,10 @@ void sigsys_handler(int sig, struct siginfo *si,
> > > > /* hook syscall via SIGSYS */
> > > > set_mc_sigsys_hook(mc);
> > > > }
> > > > +
> > > > +void nommu_relay_signal(void *ptr)
> > > > +{
> > > > + mcontext_t *mc = (mcontext_t *) ptr;
> > > > +
> > > > + set_mc_return_address(mc);
> > > > +}
> > > > diff --git a/arch/um/nommu/trap.c b/arch/um/nommu/trap.c
> > > > new file mode 100644
> > > > index 000000000000..430297517455
> > > > --- /dev/null
> > > > +++ b/arch/um/nommu/trap.c
> > > > @@ -0,0 +1,201 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +
> > > > +#include <linux/mm.h>
> > > > +#include <linux/sched/signal.h>
> > > > +#include <linux/hardirq.h>
> > > > +#include <linux/module.h>
> > > > +#include <linux/uaccess.h>
> > > > +#include <linux/sched/debug.h>
> > > > +#include <asm/current.h>
> > > > +#include <asm/tlbflush.h>
> > > > +#include <arch.h>
> > > > +#include <as-layout.h>
> > > > +#include <kern_util.h>
> > > > +#include <os.h>
> > > > +#include <skas.h>
> > > > +
> > > > +/*
> > > > + * Note this is constrained to return 0, -EFAULT, -EACCES, -ENOMEM
> > > > by
> > > > + * segv().
> > > > + */
> > > > +int handle_page_fault(unsigned long address, unsigned long ip,
> > > > + int is_write, int is_user, int *code_out)
> > > > +{
> > > > + /* !MMU has no pagefault */
> > > > + return -EFAULT;
> > > > +}
> > > > +
> > > > +static void show_segv_info(struct uml_pt_regs *regs)
> > > > +{
> > > > + struct task_struct *tsk = current;
> > > > + struct faultinfo *fi = UPT_FAULTINFO(regs);
> > > > +
> > > > + if (!unhandled_signal(tsk, SIGSEGV))
> > > > + return;
> > > > +
> > > > + pr_warn_ratelimited("%s%s[%d]: segfault at %lx ip %p sp %p
> > > > error %x",
> > > > + task_pid_nr(tsk) > 1 ? KERN_INFO :
> > > > KERN_EMERG,
> > > > + tsk->comm, task_pid_nr(tsk),
> > > > FAULT_ADDRESS(*fi),
> > > > + (void *)UPT_IP(regs), (void
> > > > *)UPT_SP(regs),
> > > > + fi->error_code);
> > > > +}
> > > > +
> > > > +static void bad_segv(struct faultinfo fi, unsigned long ip)
> > > > +{
> > > > + current->thread.arch.faultinfo = fi;
> > > > + force_sig_fault(SIGSEGV, SEGV_ACCERR, (void __user *)
> > > > FAULT_ADDRESS(fi));
> > > > +}
> > > > +
> > > > +void fatal_sigsegv(void)
> > > > +{
> > > > + force_fatal_sig(SIGSEGV);
> > > > + do_signal(¤t->thread.regs);
> > > > + /*
> > > > + * This is to tell gcc that we're not returning -
> > > > do_signal
> > > > + * can, in general, return, but in this case, it's not,
> > > > since
> > > > + * we just got a fatal SIGSEGV queued.
> > > > + */
> > > > + os_dump_core();
> > > > +}
> > > > +
> > > > +/**
> > > > + * segv_handler() - the SIGSEGV handler
> > > > + * @sig: the signal number
> > > > + * @unused_si: the signal info struct; unused in this handler
> > > > + * @regs: the ptrace register information
> > > > + *
> > > > + * The handler first extracts the faultinfo from the UML ptrace
> > > > regs struct.
> > > > + * If the userfault did not happen in an UML userspace process,
> > > > bad_segv is called.
> > > > + * Otherwise the signal did happen in a cloned userspace process,
> > > > handle it.
> > > > + */
> > > > +void segv_handler(int sig, struct siginfo *unused_si, struct
> > > > uml_pt_regs *regs,
> > > > + void *mc)
> > > > +{
> > > > + struct faultinfo *fi = UPT_FAULTINFO(regs);
> > > > +
> > > > + /* !MMU specific part; detection of userspace */
> > > > + /* mark is_user=1 when the IP is from userspace code. */
> > > > + if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) <
> > > > high_physmem)
> > > > + regs->is_user = 1;
> > > > +
> > > > + if (UPT_IS_USER(regs) && !SEGV_IS_FIXABLE(fi)) {
> > > > + show_segv_info(regs);
> > > > + bad_segv(*fi, UPT_IP(regs));
> > > > + return;
> > > > + }
> > > > + segv(*fi, UPT_IP(regs), UPT_IS_USER(regs), regs, mc);
> > > > +
> > > > + /* !MMU specific part; detection of userspace */
> > > > + relay_signal(sig, unused_si, regs, mc);
> > > > +}
> > > > +
> > > > +/*
> > > > + * We give a *copy* of the faultinfo in the regs to segv.
> > > > + * This must be done, since nesting SEGVs could overwrite
> > > > + * the info in the regs. A pointer to the info then would
> > > > + * give us bad data!
> > > > + */
> > > > +unsigned long segv(struct faultinfo fi, unsigned long ip, int
> > > > is_user,
> > > > + struct uml_pt_regs *regs, void *mc)
> > > > +{
> > > > + int si_code;
> > > > + int err;
> > > > + int is_write = FAULT_WRITE(fi);
> > > > + unsigned long address = FAULT_ADDRESS(fi);
> > > > +
> > > > + if (!is_user && regs)
> > > > + current->thread.segv_regs = container_of(regs,
> > > > struct pt_regs, regs);
> > > > +
> > > > + if (current->mm == NULL) {
> > > > + show_regs(container_of(regs, struct pt_regs,
> > > > regs));
> > > > + panic("Segfault with no mm");
> > > > + } else if (!is_user && address > PAGE_SIZE && address <
> > > > TASK_SIZE) {
> > > > + show_regs(container_of(regs, struct pt_regs,
> > > > regs));
> > > > + panic("Kernel tried to access user memory at addr
> > > > 0x%lx, ip 0x%lx",
> > > > + address, ip);
> > > > + }
> > > > +
> > > > + if (SEGV_IS_FIXABLE(&fi))
> > > > + err = handle_page_fault(address, ip, is_write,
> > > > is_user,
> > > > + &si_code);
> > > > + else {
> > > > + err = -EFAULT;
> > > > + /*
> > > > + * A thread accessed NULL, we get a fault, but CR2
> > > > is invalid.
> > > > + * This code is used in __do_copy_from_user() of
> > > > TT mode.
> > > > + * XXX tt mode is gone, so maybe this isn't needed
> > > > any more
> > > > + */
> > > > + address = 0;
> > > > + }
> > > > +
> > > > + if (!err)
> > > > + goto out;
> > > > + else if (!is_user && arch_fixup(ip, regs))
> > > > + goto out;
> > > > +
> > > > + if (!is_user) {
> > > > + show_regs(container_of(regs, struct pt_regs,
> > > > regs));
> > > > + panic("Kernel mode fault at addr 0x%lx, ip 0x%lx",
> > > > + address, ip);
> > > > + }
> > > > +
> > > > + show_segv_info(regs);
> > > > +
> > > > + if (err == -EACCES) {
> > > > + current->thread.arch.faultinfo = fi;
> > > > + force_sig_fault(SIGBUS, BUS_ADRERR, (void __user
> > > > *)address);
> > > > + } else {
> > > > + WARN_ON_ONCE(err != -EFAULT);
> > > > + current->thread.arch.faultinfo = fi;
> > > > + force_sig_fault(SIGSEGV, si_code, (void __user *)
> > > > address);
> > > > + }
> > > > +
> > > > +out:
> > > > + if (regs)
> > > > + current->thread.segv_regs = NULL;
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +void relay_signal(int sig, struct siginfo *si, struct uml_pt_regs
> > > > *regs,
> > > > + void *mc)
> > > > +{
> > > > + int code, err;
> > > > +
> > > > + /* !MMU specific part; detection of userspace */
> > > > + /* mark is_user=1 when the IP is from userspace code. */
> > > > + if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) <
> > > > high_physmem)
> > > > + regs->is_user = 1;
> > > > +
> > > > + if (!UPT_IS_USER(regs)) {
> > > > + if (sig == SIGBUS)
> > > > + pr_err("Bus error - the host /dev/shm or
> > > > /tmp mount likely just ran out of space\n");
> > > > + panic("Kernel mode signal %d", sig);
> > > > + }
> > > > + /* if is_user==1, set return to userspace sig handler to
> > > > relay signal */
> > > > + nommu_relay_signal(mc);
> > > > +
> > > > + arch_examine_signal(sig, regs);
> > > > +
> > > > + /* Is the signal layout for the signal known?
> > > > + * Signal data must be scrubbed to prevent information
> > > > leaks.
> > > > + */
> > > > + code = si->si_code;
> > > > + err = si->si_errno;
> > > > + if ((err == 0) && (siginfo_layout(sig, code) ==
> > > > SIL_FAULT)) {
> > > > + struct faultinfo *fi = UPT_FAULTINFO(regs);
> > > > +
> > > > + current->thread.arch.faultinfo = *fi;
> > > > + force_sig_fault(sig, code, (void __user
> > > > *)FAULT_ADDRESS(*fi));
> > > > + } else {
> > > > + pr_err("Attempted to relay unknown signal %d
> > > > (si_code = %d) with errno %d\n",
> > > > + sig, code, err);
> > > > + force_sig(sig);
> > > > + }
> > > > +}
> > > > +
> > > > +void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs
> > > > *regs,
> > > > + void *mc)
> > > > +{
> > > > + do_IRQ(WINCH_IRQ, regs);
> > > > +}
> > > > diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c
> > > > index 53e276e81b37..67dcd88b45b1 100644
> > > > --- a/arch/um/os-Linux/signal.c
> > > > +++ b/arch/um/os-Linux/signal.c
> > > > @@ -40,9 +40,10 @@ static void sig_handler_common(int sig, struct
> > > > siginfo *si, mcontext_t *mc)
> > > > int save_errno = errno;
> > > >
> > > > r.is_user = 0;
> > > > + if (mc)
> > > > + get_regs_from_mc(&r, mc);
> > > > if (sig == SIGSEGV) {
> > > > /* For segfaults, we want the data from the
> > > > sigcontext. */
> > > > - get_regs_from_mc(&r, mc);
> > > > GET_FAULTINFO_FROM_MC(r.faultinfo, mc);
> > > > }
> > > >
> > > > diff --git a/arch/x86/um/nommu/do_syscall_64.c
> > > > b/arch/x86/um/nommu/do_syscall_64.c
> > > > index 74d5bcc4508d..d77e69e097c1 100644
> > > > --- a/arch/x86/um/nommu/do_syscall_64.c
> > > > +++ b/arch/x86/um/nommu/do_syscall_64.c
> > > > @@ -44,6 +44,9 @@ __visible void do_syscall_64(struct pt_regs
> > > > *regs)
> > > > /* set fs register to the original host one */
> > > > os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
> > > >
> > > > + /* save fp registers */
> > > > + asm volatile("fxsaveq %0" : "=m"(*(struct _xstate *)regs-
> > > > > regs.fp));
> > > > +
> > > > if (likely(syscall < NR_syscalls)) {
> > > > PT_REGS_SET_SYSCALL_RETURN(regs,
> > > > EXECUTE_SYSCALL(syscall, regs));
> > > > @@ -54,6 +57,9 @@ __visible void do_syscall_64(struct pt_regs
> > > > *regs)
> > > > /* handle tasks and signals at the end */
> > > > interrupt_end();
> > > >
> > > > + /* restore fp registers */
> > > > + asm volatile("fxrstorq %0" : : "m"((current-
> > > > > thread.regs.regs.fp)));
> > > > +
> > > > /* restore back fs register to userspace configured one */
> > > > os_x86_arch_prctl(0, ARCH_SET_FS,
> > > > (void *)(current-
> > > > > thread.regs.regs.gp[FS_BASE
> > > > diff --git a/arch/x86/um/nommu/entry_64.S
> > > > b/arch/x86/um/nommu/entry_64.S
> > > > index 950447dfa66b..e038bc7b53ac 100644
> > > > --- a/arch/x86/um/nommu/entry_64.S
> > > > +++ b/arch/x86/um/nommu/entry_64.S
> > > > @@ -111,3 +111,17 @@ ENTRY(userspace)
> > > > jmp *%r11
> > > >
> > > > END(userspace)
> > > > +
> > > > +/*
> > > > + * this routine prepares the stack to return via host-generated
> > > > + * signals (e.g., SEGV, FPE) via do_signal() from interrupt_end().
> > > > + */
> > > > +ENTRY(__prep_sigreturn)
> > > > + /*
> > > > + * Switch to current top of stack, so "current->" points
> > > > + * to the right task.
> > > > + */
> > > > + movq current_top_of_stack, %rsp
> > > > +
> > > > + jmp userspace
> > > > +END(__prep_sigreturn)
> > > > diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c
> > > > b/arch/x86/um/nommu/os-Linux/mcontext.c
> > > > index c4ef877d5ea0..87fb2a35e7ff 100644
> > > > --- a/arch/x86/um/nommu/os-Linux/mcontext.c
> > > > +++ b/arch/x86/um/nommu/os-Linux/mcontext.c
> > > > @@ -6,6 +6,11 @@
> > > > #include <sysdep/mcontext.h>
> > > > #include <sysdep/syscalls.h>
> > > >
> > > > +void set_mc_return_address(mcontext_t *mc)
> > > > +{
> > > > + mc->gregs[REG_RIP] = (unsigned long) __prep_sigreturn;
> > > > +}
> > > > +
> > > > void set_mc_sigsys_hook(mcontext_t *mc)
> > > > {
> > > > mc->gregs[REG_RCX] = mc->gregs[REG_RIP];
> > > > diff --git a/arch/x86/um/shared/sysdep/mcontext.h
> > > > b/arch/x86/um/shared/sysdep/mcontext.h
> > > > index 9a0d6087f357..de4041b758f3 100644
> > > > --- a/arch/x86/um/shared/sysdep/mcontext.h
> > > > +++ b/arch/x86/um/shared/sysdep/mcontext.h
> > > > @@ -19,6 +19,7 @@ extern int set_stub_state(struct uml_pt_regs
> > > > *regs, struct stub_data *data,
> > > >
> > > > #ifndef CONFIG_MMU
> > > > extern void set_mc_sigsys_hook(mcontext_t *mc);
> > > > +extern void set_mc_return_address(mcontext_t *mc);
> > > > #endif
> > > >
> > > > #ifdef __i386__
> > > > diff --git a/arch/x86/um/shared/sysdep/ptrace.h
> > > > b/arch/x86/um/shared/sysdep/ptrace.h
> > > > index 8f7476ff6e95..7d553d9f05be 100644
> > > > --- a/arch/x86/um/shared/sysdep/ptrace.h
> > > > +++ b/arch/x86/um/shared/sysdep/ptrace.h
> > > > @@ -65,7 +65,7 @@ struct uml_pt_regs {
> > > > int is_user;
> > > >
> > > > /* Dynamically sized FP registers (holds an XSTATE) */
> > > > - unsigned long fp[];
> > > > + unsigned long fp[] __attribute__((aligned(16)));
> > > > };
> > > >
> > > > #define EMPTY_UML_PT_REGS { }
> > > > diff --git a/arch/x86/um/shared/sysdep/syscalls_64.h
> > > > b/arch/x86/um/shared/sysdep/syscalls_64.h
> > > > index ffd80ee3b9dc..bd152422cdfb 100644
> > > > --- a/arch/x86/um/shared/sysdep/syscalls_64.h
> > > > +++ b/arch/x86/um/shared/sysdep/syscalls_64.h
> > > > @@ -29,6 +29,7 @@ extern syscall_handler_t sys_arch_prctl;
> > > > extern void do_syscall_64(struct pt_regs *regs);
> > > > extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2,
> > > > int64_t a3,
> > > > int64_t a4, int64_t a5, int64_t a6);
> > > > +extern void __prep_sigreturn(void);
> > > > #endif
> > > >
> > > > #endif
> > >
> >
>
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2025-07-12 8:02 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-22 21:32 [PATCH v10 00/13] nommu UML Hajime Tazaki
2025-06-22 21:32 ` [PATCH v10 01/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 02/13] um: decouple MMU specific code from the common part Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 03/13] um: nommu: memory handling Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 04/13] x86/um: nommu: syscall handling Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 05/13] um: nommu: seccomp syscalls hook Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 06/13] x86/um: nommu: process/thread handling Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 07/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 08/13] x86/um/vdso: nommu: vdso memory update Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 09/13] x86/um: nommu: signal handling Hajime Tazaki
2025-06-24 23:20 ` Benjamin Berg
2025-06-27 13:50 ` Hajime Tazaki
2025-06-27 15:02 ` Benjamin Berg
2025-06-30 1:04 ` Hajime Tazaki
2025-07-01 12:03 ` Benjamin Berg
2025-07-02 4:37 ` Hajime Tazaki
2025-07-10 23:59 ` Hajime Tazaki
2025-07-11 9:39 ` Benjamin Berg
2025-07-11 10:05 ` Benjamin Berg
2025-07-12 1:16 ` Hajime Tazaki
2025-07-12 7:58 ` Benjamin Berg
2025-06-22 21:33 ` [PATCH v10 10/13] um: nommu: a work around for MMU dependency to PCI driver Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 11/13] um: change machine name for uname output Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 12/13] um: nommu: add documentation of nommu UML Hajime Tazaki
2025-06-22 21:33 ` [PATCH v10 13/13] um: nommu: plug nommu code into build system Hajime Tazaki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).