[RFC PATCH 00/13] nommu UML

linux-um.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 00/13] nommu UML
@ 2024-10-24 12:09 Hajime Tazaki
  2024-10-24 12:09 ` [RFC PATCH 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec Hajime Tazaki
                   ` (14 more replies)
  0 siblings, 15 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-24 12:09 UTC (permalink / raw)
  To: linux-um, jdike, richard, anton.ivanov, johannes; +Cc: thehajime, ricarkol

This is a series of patches of nommu arch addition to UML.  It would
be nice to ask comments/opinions on this.

There are several limitations/issues which we already found; here is
the list of those issues.

- prompt configured with /etc/profile is broken (variables are not
  expanded, ${HOSTNAME%%.*}:$PWD#)
- there are no mechanism implemented to cache for mapped memory of
  exec(2) thus, always read files from filesystem upon every exec,
  which makes slow on some benchmark (lmbench).
- a crash on userspace programs crashes a UML kernel, not signaling
  with SIGSEGV to the program.
- commit c27e618 (during v6.12-rc1 merge) introduces invalid access to
  a vma structure for our case, which updates the internal procedure
  of maple_tree subsystem.  We're trying to fix issue but still a
  random process on exit(2) crashes.

UML has been built with CONFIG_MMU since day 0.  The feature
introduces the nommu mode in a different angle from what Linux Kernel
Library tried.


What is it for ?
================

- Alleviate syscall hook overhead implemented with ptrace(2)
- To exercises nommu code over UML (and over KUnit)
- Less dependency to host facilities


How it works ?
==============

To illustrate how this feature works, the below shows how syscalls are
called under nommu/UML environment.

- boot kernel, setup zpoline trampoline code (detailed later) at address 0x0
- (userspace starts)
- calls vfork/execve syscalls
- during execve, more specifically during load_elf_fdpic_binary()
  function, kernel translates `syscall/sysenter` instructions with `call
  *%rax`, which usually point to address 0 to NR_syscalls (around
  512), where trampoline code was installed during startup.
- when syscalls are issued by userspace, it jumps to *%rax, slides
  until `nop` instructions end, and jump to hooked function,
  `__kernel_vsyscall`, which is an entrypoint for syscall under nommu
  UML environment.
- call handler function in sys_call_table[] and follow how UML syscall
  works.
- return to userspace


What are the differences from MMU-full UML ?
============================================

The current nommu implementation adds 3 different functions which
MMU-full UML doesn't have:

- kernel address space can directly be accessible from userspace
  - so, uaccess() always returns 1
  - generic implementation of memcpy/strcpy/futex is also used
- alternate syscall entrypoint without ptrace
- translation of syscall/sysenter instructions to a trampoline code
  and syscall hooks

With those modifications, it allows us to use unmodified userspace
binaries with nommu UML.


History
=======

This feature was originally introduced by Ricardo Koller at Open
Source Summit NA 2020, then integrated with the syscall translation
functionality with the clean up to the original code.

Building and run
================

```
% make ARCH=um x86_64_nommu_defconfig
% make ARCH=um
```

will build UML with CONFIG_MMU=n applied.

Kunit tests can run with the following command:

```
% ./tools/testing/kunit/kunit.py run --kconfig_add CONFIG_MMU=n
```

To run a typical Linux distribution, we need nommu-aware userspace.
We can use a stock version of Alpine Linux with nommu-built version of
busybox and musl-libc.


Preparing root filesystem
=========================

nommu UML requires to use a specific standard library which is aware
of nommu kernel.  We have tested custom-build musl-libc and busybox,
both of which have built-in support for nommu kernels.

There are no available Linux distributions for nommu under x86_64
architecture, so we need to prepare our own image for the root
filesystem.  We use Alpine Linux as a base distribution and replace
busybox and musl-libc on top of that.  The following are the step to
prepare the filesystem for the quick start.

```
     container_id=$(docker create ghcr.io/thehajime/alpine:3.20.3-um-nommu)
     docker start $container_id
     docker wait $container_id
     docker export $container_id > alpine.tar
     docker rm $container_id

     mnt=$(mktemp -d)
     dd if=/dev/zero of=alpine.ext4 bs=1 count=0 seek=1G
     sudo chmod og+wr "alpine.ext4"
     yes 2>/dev/null | mkfs.ext4 "alpine.ext4" || true
     sudo mount "alpine.ext4" $mnt
     sudo tar -xf alpine.tar -C $mnt
     sudo umount $mnt
```

This will create a file image, `alpine.ext4`, which contains busybox
and musl with nommu build on the Alpine Linux root filesystem.  The
file can be specified to the argument `ubd0=` to the UML command line.

```
  ./vmlinux eth0=tuntap,tap100,0e:fd:0:0:0:1,172.17.0.1 ubd0=./alpine.ext4 rw mem=1024m loglevel=8 init=/sbin/init
```

We plan to upstream apk packages for busybox and musl so that we can
follow the proper procedure to set up the root filesystem.


Quick start with docker
=======================

There is a docker image that you can quickly start with a simple step.

```
  docker run -it -v /dev/shm:/dev/shm --rm ghcr.io/thehajime/alpine:3.20.3-um-nommu
```

This will launch a UML instance with an pre-configured root filesystem.

Benchmark
=========

The below shows an example of performance measurement conducted with
lmbench and (self-crafted) getpid benchmark (with v6.12-rc3 linus tree).

### lmbench (usec)

||native|um|um-nommu|
|--|--|--|--|
|select-10    |0.5645|28.3738|0.2647|
|select-100   |2.3872|28.8385|1.1021|
|select-1000  |20.5527|37.6364|9.4264|
|syscall      |0.1735|26.8711|0.1037|
|read         |0.3442|28.5771|0.1370|
|write        |0.2862|28.7340|0.1236|
|stat         |1.9236|38.5928|0.4640|
|open/close   |3.8308|66.8451|0.7789|
|fork+sh      |1176.4444|8221.5000|21443.0000|
|fork+execve  |533.1053|3034.5000|4894.3333|

### do_getpid bench (nsec)

||native|um|um-nommu|
|--|--|--|--|
|getpid | 180 | 31579 | 101|


Limitations
===========

generic nommu limitations
-------------------------
Since this port is a kernel of nommu architecture so, the
implementation inherits the characteristics of other nommu kernels
(riscv, arm, etc), described below.

- vfork(2) should be used instead of fork(2)
- ELF loader only loads PIE (position independent executable) binaries
- processes share the address space among others
- mmap(2) offers a subset of functionalities (e.g., unsupported
  MMAP_FIXED)

Thus, we have limited options to userspace programs.  We have tested
Alpine Linux with musl-libc, which has a support nommu kernel.

access to mmap_min_addr
----------------------
As the mechanism of syscall translations relies on an ability to
write/read memory address zero (0x0), we need to configure host kernel
with the following command:

```
% sh -c "echo 0 > /proc/sys/vm/mmap_min_addr"
```

supported architecture
----------------------
The current implementation of nommu UML only works on x86_64 SUBARCH.
We have not tested with 32-bit environment.

target of syscall translation
-----------------------------
The syscall translation only applies to the executable and interpreter
of ELF binary files which are processed by execve(2) syscall for the
moment: other libraries such as linked library and dlopen-ed one
aren't translated; we may be able to trigger the translation by
LD_PRELOAD.

Note that with musl-libc in Alpine Linux which we've been tested, most
of syscalls are implemented in the interpreter file
(ld-musl-x86_64.so) and calling syscall/sysenter instructions from the
linked/loaded libraries might be rare.  But it is definitely possible
so, a workaround with LD_PRELOAD is effective.


Further readings about NOMMU UML
================================

- NOMMU UML (original code by Ricardo Koller)
https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf

- zpoline: syscall translation mechanism
https://www.usenix.org/conference/atc23/presentation/yasukata
Please review the following changes for suitability for inclusion. If you have
any objections or suggestions for improvement, please respond to the patches. If
you agree with the changes, please provide your Acked-by.

The following changes since commit c2ee9f594da826bea183ed14f2cc029c719bf4da:

  KVM: selftests: Fix build on on non-x86 architectures (2024-10-21 15:49:33 -0700)

are available in the Git repository at:

  https://github.com/thehajime/linux 82a7ee8b31c51edb47e144922581824a3b5e371d
  https://github.com/thehajime/linux/tree/um-nommu-v6.12-rc4-rfc

Hajime Tazaki (13):
  fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec
  x86/um: nommu: elf loader for fdpic
  um: nommu: memory handling
  x86/um: nommu: syscall handling
  x86/um: nommu: syscall translation by zpoline
  x86/um: nommu: process/thread handling
  um: nommu: configure fs register on host syscall invocation
  x86/um/vdso: nommu: vdso memory update
  x86/um: nommu: signal handling
  x86/um: nommu: stack save/restore on vfork
  um: change machine name for uname output
  um: nommu: add documentation of nommu UML
  um: nommu: plug nommu code into build system

 Documentation/virt/uml/nommu-uml.rst    | 219 +++++++++++++++++++++++
 arch/um/Kconfig                         |  13 +-
 arch/um/Makefile                        |   6 +
 arch/um/configs/x86_64_nommu_defconfig  |  64 +++++++
 arch/um/include/asm/futex.h             |   4 +
 arch/um/include/asm/mmu.h               |   8 +
 arch/um/include/asm/mmu_context.h       |  14 +-
 arch/um/include/asm/ptrace-generic.h    |  17 ++
 arch/um/include/asm/tlbflush.h          |  23 ++-
 arch/um/include/asm/uaccess.h           |   7 +-
 arch/um/include/shared/common-offsets.h |   3 +
 arch/um/include/shared/os.h             |   9 +
 arch/um/kernel/Makefile                 |   3 +-
 arch/um/kernel/exec.c                   |   8 +
 arch/um/kernel/mem.c                    |  13 ++
 arch/um/kernel/physmem.c                |   6 +
 arch/um/kernel/process.c                |  34 +++-
 arch/um/kernel/skas/Makefile            |   3 +-
 arch/um/kernel/trap.c                   |   4 +
 arch/um/os-Linux/main.c                 |   5 +
 arch/um/os-Linux/process.c              |  22 +++
 arch/um/os-Linux/skas/process.c         |   4 +
 arch/um/os-Linux/start_up.c             |  47 +++++
 arch/um/os-Linux/time.c                 |   3 +-
 arch/um/os-Linux/util.c                 |   3 +-
 arch/x86/um/Makefile                    |  18 ++
 arch/x86/um/asm/elf.h                   |  12 +-
 arch/x86/um/asm/module.h                |  19 +-
 arch/x86/um/asm/processor.h             |  12 ++
 arch/x86/um/do_syscall_64.c             | 113 ++++++++++++
 arch/x86/um/entry_64.S                  | 110 ++++++++++++
 arch/x86/um/shared/sysdep/syscalls_64.h |   4 +
 arch/x86/um/signal.c                    |  26 +++
 arch/x86/um/syscalls_64.c               |  67 +++++++
 arch/x86/um/vdso/um_vdso.c              |  20 +++
 arch/x86/um/vdso/vma.c                  |  16 +-
 arch/x86/um/zpoline.c                   | 228 ++++++++++++++++++++++++
 fs/Kconfig.binfmt                       |   2 +-
 fs/binfmt_elf_fdpic.c                   |  10 ++
 39 files changed, 1164 insertions(+), 35 deletions(-)
 create mode 100644 Documentation/virt/uml/nommu-uml.rst
 create mode 100644 arch/um/configs/x86_64_nommu_defconfig
 create mode 100644 arch/x86/um/do_syscall_64.c
 create mode 100644 arch/x86/um/entry_64.S
 create mode 100644 arch/x86/um/zpoline.c

-- 
2.43.0



^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec
  2024-10-24 12:09 [RFC PATCH 00/13] nommu UML Hajime Tazaki
@ 2024-10-24 12:09 ` Hajime Tazaki
  2024-10-24 12:09 ` [RFC PATCH 02/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-24 12:09 UTC (permalink / raw)
  To: linux-um, jdike, richard, anton.ivanov, johannes
  Cc: thehajime, ricarkol, Alexander Viro, Christian Brauner, Jan Kara,
	Eric Biederman, Kees Cook, linux-fsdevel, linux-mm

FDPIC ELF loader adds an architecture hook at the end of loading
binaries to finalize the mapped memory before moving toward exec
function.  The hook is used by UML under !MMU when translating
syscall/sysenter instructions before calling execve.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <kees@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
 fs/binfmt_elf_fdpic.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 4fe5bb9f1b1f..ab16fdf475b0 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -175,6 +175,12 @@ static int elf_fdpic_fetch_phdrs(struct elf_fdpic_params *params,
 	return 0;
 }
 
+int __weak elf_arch_finalize_exec(struct elf_fdpic_params *exec_params,
+				  struct elf_fdpic_params *interp_params)
+{
+	return 0;
+}
+
 /*****************************************************************************/
 /*
  * load an fdpic binary into various bits of memory
@@ -457,6 +463,10 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
 			    dynaddr);
 #endif
 
+	retval = elf_arch_finalize_exec(&exec_params, &interp_params);
+	if (retval)
+		goto error;
+
 	finalize_exec(bprm);
 	/* everything is now ready... get the userspace context ready to roll */
 	entryaddr = interp_params.entry_addr ?: exec_params.entry_addr;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 02/13] x86/um: nommu: elf loader for fdpic
  2024-10-24 12:09 [RFC PATCH 00/13] nommu UML Hajime Tazaki
  2024-10-24 12:09 ` [RFC PATCH 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec Hajime Tazaki
@ 2024-10-24 12:09 ` Hajime Tazaki
  2024-10-25  8:56   ` Johannes Berg
  2024-10-24 12:09 ` [RFC PATCH 03/13] um: nommu: memory handling Hajime Tazaki
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-24 12:09 UTC (permalink / raw)
  To: linux-um, jdike, richard, anton.ivanov, johannes
  Cc: thehajime, ricarkol, Eric Biederman, Kees Cook, Alexander Viro,
	Christian Brauner, Jan Kara, linux-mm, linux-fsdevel

As UML supports CONFIG_MMU=n case, it has to use an alternate ELF
loader, FDPIC ELF loader.  In this commit, we added necessary
definitions in the arch, as UML has not been used so far.  It also
updates Kconfig file to use BINFMT_ELF_FDPIC under !MMU environment.

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/um/include/asm/mmu.h            |  5 +++++
 arch/um/include/asm/ptrace-generic.h | 17 +++++++++++++++++
 arch/x86/um/asm/elf.h                |  9 +++++++--
 arch/x86/um/asm/module.h             | 19 +------------------
 fs/Kconfig.binfmt                    |  2 +-
 5 files changed, 31 insertions(+), 21 deletions(-)

diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
index a3eaca41ff61..01422b761aa0 100644
--- a/arch/um/include/asm/mmu.h
+++ b/arch/um/include/asm/mmu.h
@@ -14,6 +14,11 @@ typedef struct mm_context {
 	/* Address range in need of a TLB sync */
 	unsigned long sync_tlb_range_from;
 	unsigned long sync_tlb_range_to;
+
+#ifdef CONFIG_BINFMT_ELF_FDPIC
+	unsigned long   exec_fdpic_loadmap;
+	unsigned long   interp_fdpic_loadmap;
+#endif
 } mm_context_t;
 
 #endif
diff --git a/arch/um/include/asm/ptrace-generic.h b/arch/um/include/asm/ptrace-generic.h
index 4696f24d1492..fefa7631394e 100644
--- a/arch/um/include/asm/ptrace-generic.h
+++ b/arch/um/include/asm/ptrace-generic.h
@@ -29,6 +29,12 @@ struct pt_regs {
 
 #define PTRACE_OLDSETOPTIONS 21
 
+#ifdef CONFIG_BINFMT_ELF_FDPIC
+#define PTRACE_GETFDPIC		31
+#define PTRACE_GETFDPIC_EXEC	0
+#define PTRACE_GETFDPIC_INTERP	1
+#endif
+
 struct task_struct;
 
 extern long subarch_ptrace(struct task_struct *child, long request,
@@ -44,6 +50,17 @@ extern void clear_flushed_tls(struct task_struct *task);
 extern int syscall_trace_enter(struct pt_regs *regs);
 extern void syscall_trace_leave(struct pt_regs *regs);
 
+#ifndef CONFIG_MMU
+#include <asm-generic/bug.h>
+
+static inline const struct user_regset_view *task_user_regset_view(
+	struct task_struct *task)
+{
+	WARN_ON_ONCE(true);
+	return 0;
+}
+#endif
+
 #endif
 
 #endif
diff --git a/arch/x86/um/asm/elf.h b/arch/x86/um/asm/elf.h
index 6052200fe925..4f87980bc9e9 100644
--- a/arch/x86/um/asm/elf.h
+++ b/arch/x86/um/asm/elf.h
@@ -8,6 +8,8 @@
 #include <asm/user.h>
 #include <skas.h>
 
+#define ELF_FDPIC_CORE_EFLAGS  0
+
 #ifdef CONFIG_X86_32
 
 #define R_386_NONE	0
@@ -188,8 +190,11 @@ extern int arch_setup_additional_pages(struct linux_binprm *bprm,
 
 extern unsigned long um_vdso_addr;
 #define AT_SYSINFO_EHDR 33
-#define ARCH_DLINFO	NEW_AUX_ENT(AT_SYSINFO_EHDR, um_vdso_addr)
-
+#define ARCH_DLINFO						\
+do {								\
+	NEW_AUX_ENT(AT_SYSINFO_EHDR, um_vdso_addr);		\
+	NEW_AUX_ENT(AT_MINSIGSTKSZ, 0);			\
+} while (0)
 #endif
 
 typedef unsigned long elf_greg_t;
diff --git a/arch/x86/um/asm/module.h b/arch/x86/um/asm/module.h
index a3b061d66082..4f7be1481979 100644
--- a/arch/x86/um/asm/module.h
+++ b/arch/x86/um/asm/module.h
@@ -2,23 +2,6 @@
 #ifndef __UM_MODULE_H
 #define __UM_MODULE_H
 
-/* UML is simple */
-struct mod_arch_specific
-{
-};
-
-#ifdef CONFIG_X86_32
-
-#define Elf_Shdr Elf32_Shdr
-#define Elf_Sym Elf32_Sym
-#define Elf_Ehdr Elf32_Ehdr
-
-#else
-
-#define Elf_Shdr Elf64_Shdr
-#define Elf_Sym Elf64_Sym
-#define Elf_Ehdr Elf64_Ehdr
-
-#endif
+#include <asm-generic/module.h>
 
 #endif
diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index bd2f530e5740..419ba0282806 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -58,7 +58,7 @@ config ARCH_USE_GNU_PROPERTY
 config BINFMT_ELF_FDPIC
 	bool "Kernel support for FDPIC ELF binaries"
 	default y if !BINFMT_ELF
-	depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
+	depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU)
 	select ELFCORE
 	help
 	  ELF FDPIC binaries are based on ELF, but allow the individual load
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 03/13] um: nommu: memory handling
  2024-10-24 12:09 [RFC PATCH 00/13] nommu UML Hajime Tazaki
  2024-10-24 12:09 ` [RFC PATCH 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec Hajime Tazaki
  2024-10-24 12:09 ` [RFC PATCH 02/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
@ 2024-10-24 12:09 ` Hajime Tazaki
  2024-10-25  9:11   ` Johannes Berg
  2024-10-24 12:09 ` [RFC PATCH 04/13] x86/um: nommu: syscall handling Hajime Tazaki
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-24 12:09 UTC (permalink / raw)
  To: linux-um, jdike, richard, anton.ivanov, johannes; +Cc: thehajime, ricarkol

This commit adds memory operations on UML under !MMU environment.

Some part of the original UML code relying on CONFIG_MMU are excluded
from compilation when !CONFIG_MMU.  Additionally, generic functions such as
uaccess, futex, memcpy/strnlen/strncpy can be used as user- and
kernel-space share the address space in !CONFIG_MMU mode.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/um/include/asm/futex.h       |  4 ++++
 arch/um/include/asm/mmu.h         |  3 +++
 arch/um/include/asm/mmu_context.h | 14 ++++++++++++--
 arch/um/include/asm/tlbflush.h    | 23 ++++++++++++++++++++++-
 arch/um/include/asm/uaccess.h     |  7 ++++---
 arch/um/include/shared/os.h       |  6 ++++++
 arch/um/kernel/Makefile           |  3 ++-
 arch/um/kernel/mem.c              | 13 +++++++++++++
 arch/um/kernel/physmem.c          |  6 ++++++
 arch/um/kernel/process.c          |  1 +
 arch/um/kernel/skas/Makefile      |  3 ++-
 arch/um/kernel/trap.c             |  4 ++++
 arch/um/os-Linux/process.c        |  5 +++++
 13 files changed, 84 insertions(+), 8 deletions(-)

diff --git a/arch/um/include/asm/futex.h b/arch/um/include/asm/futex.h
index 780aa6bfc050..89a8ac0b6963 100644
--- a/arch/um/include/asm/futex.h
+++ b/arch/um/include/asm/futex.h
@@ -8,7 +8,11 @@
 
 
 int arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval, u32 __user *uaddr);
+#ifdef CONFIG_MMU
 int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
 			      u32 oldval, u32 newval);
+#else
+#include <asm-generic/futex.h>
+#endif
 
 #endif
diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
index 01422b761aa0..d4087f9499e2 100644
--- a/arch/um/include/asm/mmu.h
+++ b/arch/um/include/asm/mmu.h
@@ -15,10 +15,13 @@ typedef struct mm_context {
 	unsigned long sync_tlb_range_from;
 	unsigned long sync_tlb_range_to;
 
+#ifndef CONFIG_MMU
+	unsigned long   end_brk;
 #ifdef CONFIG_BINFMT_ELF_FDPIC
 	unsigned long   exec_fdpic_loadmap;
 	unsigned long   interp_fdpic_loadmap;
 #endif
+#endif /* !CONFIG_MMU */
 } mm_context_t;
 
 #endif
diff --git a/arch/um/include/asm/mmu_context.h b/arch/um/include/asm/mmu_context.h
index 23dcc914d44e..279351beede5 100644
--- a/arch/um/include/asm/mmu_context.h
+++ b/arch/um/include/asm/mmu_context.h
@@ -37,10 +37,20 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 }
 
 #define init_new_context init_new_context
-extern int init_new_context(struct task_struct *task, struct mm_struct *mm);
-
 #define destroy_context destroy_context
+#ifdef CONFIG_MMU
+extern int init_new_context(struct task_struct *task, struct mm_struct *mm);
 extern void destroy_context(struct mm_struct *mm);
+#else
+static inline int init_new_context(struct task_struct *task, struct mm_struct *mm)
+{
+	return 0;
+}
+static inline void destroy_context(struct mm_struct *mm)
+{
+}
+#endif
+
 
 #include <asm-generic/mmu_context.h>
 
diff --git a/arch/um/include/asm/tlbflush.h b/arch/um/include/asm/tlbflush.h
index db997976b6ea..620debb84956 100644
--- a/arch/um/include/asm/tlbflush.h
+++ b/arch/um/include/asm/tlbflush.h
@@ -29,7 +29,7 @@
  *  - flush_tlb_range(vma, start, end) flushes a range of pages
  *  - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
  */
-
+#ifdef CONFIG_MMU
 extern int um_tlb_sync(struct mm_struct *mm);
 
 extern void flush_tlb_all(void);
@@ -55,5 +55,26 @@ static inline void flush_tlb_kernel_range(unsigned long start,
 	/* Kernel needs to be synced immediately */
 	um_tlb_sync(&init_mm);
 }
+#else
+static inline int um_tlb_sync(struct mm_struct *mm)
+{
+	return 0;
+}
+
+static inline void flush_tlb_page(struct vm_area_struct *vma,
+				  unsigned long address)
+{
+}
+
+static inline void flush_tlb_range(struct vm_area_struct *vma,
+				   unsigned long start, unsigned long end)
+{
+}
+
+static inline void flush_tlb_kernel_range(unsigned long start,
+					  unsigned long end)
+{
+}
+#endif
 
 #endif
diff --git a/arch/um/include/asm/uaccess.h b/arch/um/include/asm/uaccess.h
index 1d4b6bbc1b65..9bfee12cb6b7 100644
--- a/arch/um/include/asm/uaccess.h
+++ b/arch/um/include/asm/uaccess.h
@@ -22,6 +22,7 @@
 #define __addr_range_nowrap(addr, size) \
 	((unsigned long) (addr) <= ((unsigned long) (addr) + (size)))
 
+#ifdef CONFIG_MMU
 extern unsigned long raw_copy_from_user(void *to, const void __user *from, unsigned long n);
 extern unsigned long raw_copy_to_user(void __user *to, const void *from, unsigned long n);
 extern unsigned long __clear_user(void __user *mem, unsigned long len);
@@ -33,9 +34,6 @@ static inline int __access_ok(const void __user *ptr, unsigned long size);
 
 #define INLINE_COPY_FROM_USER
 #define INLINE_COPY_TO_USER
-
-#include <asm-generic/uaccess.h>
-
 static inline int __access_ok(const void __user *ptr, unsigned long size)
 {
 	unsigned long addr = (unsigned long)ptr;
@@ -43,6 +41,9 @@ static inline int __access_ok(const void __user *ptr, unsigned long size)
 		(__under_task_size(addr, size) ||
 		 __access_ok_vsyscall(addr, size));
 }
+#endif
+
+#include <asm-generic/uaccess.h>
 
 /* no pagefaults for kernel addresses in um */
 #define __get_kernel_nofault(dst, src, type, err_label)			\
diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index 9a039d6f1f74..f6d3f3d7eade 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -196,7 +196,13 @@ extern void get_host_cpu_features(
 extern int create_mem_file(unsigned long long len);
 
 /* tlb.c */
+#ifdef CONFIG_MMU
 extern void report_enomem(void);
+#else
+static inline void report_enomem(void)
+{
+}
+#endif
 
 /* process.c */
 extern unsigned long os_process_pc(int pid);
diff --git a/arch/um/kernel/Makefile b/arch/um/kernel/Makefile
index f8567b933ffa..b41e9bcabbe3 100644
--- a/arch/um/kernel/Makefile
+++ b/arch/um/kernel/Makefile
@@ -16,9 +16,10 @@ extra-y := vmlinux.lds
 
 obj-y = config.o exec.o exitcode.o irq.o ksyms.o mem.o \
 	physmem.o process.o ptrace.o reboot.o sigio.o \
-	signal.o sysrq.o time.o tlb.o trap.o \
+	signal.o sysrq.o time.o trap.o \
 	um_arch.o umid.o maccess.o kmsg_dump.o capflags.o skas/
 obj-y += load_file.o
+obj-$(CONFIG_MMU) += tlb.o
 
 obj-$(CONFIG_BLK_DEV_INITRD) += initrd.o
 obj-$(CONFIG_GPROF)	+= gprof_syms.o
diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index a5b4fe2ad931..c7498d609f0f 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -67,7 +67,11 @@ void __init mem_init(void)
 	 * to be turned on.
 	 */
 	brk_end = (unsigned long) UML_ROUND_UP(sbrk(0));
+#ifdef CONFIG_MMU
 	map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1, 0);
+#else
+	map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1, 1);
+#endif
 	memblock_free((void *)brk_end, uml_reserved - brk_end);
 	uml_reserved = brk_end;
 
@@ -81,6 +85,7 @@ void __init mem_init(void)
  * Create a page table and place a pointer to it in a middle page
  * directory entry.
  */
+#ifdef CONFIG_MMU
 static void __init one_page_table_init(pmd_t *pmd)
 {
 	if (pmd_none(*pmd)) {
@@ -137,6 +142,12 @@ static void __init fixrange_init(unsigned long start, unsigned long end,
 		j = 0;
 	}
 }
+#else
+static void __init fixrange_init(unsigned long start, unsigned long end,
+				 pgd_t *pgd_base)
+{
+}
+#endif
 
 static void __init fixaddr_user_init( void)
 {
@@ -218,6 +229,7 @@ void *uml_kmalloc(int size, int flags)
 	return kmalloc(size, flags);
 }
 
+#ifdef CONFIG_MMU
 static const pgprot_t protection_map[16] = {
 	[VM_NONE]					= PAGE_NONE,
 	[VM_READ]					= PAGE_READONLY,
@@ -237,3 +249,4 @@ static const pgprot_t protection_map[16] = {
 	[VM_SHARED | VM_EXEC | VM_WRITE | VM_READ]	= PAGE_SHARED
 };
 DECLARE_VM_GET_PAGE_PROT
+#endif
diff --git a/arch/um/kernel/physmem.c b/arch/um/kernel/physmem.c
index fb2adfb49945..00ed840301ca 100644
--- a/arch/um/kernel/physmem.c
+++ b/arch/um/kernel/physmem.c
@@ -90,7 +90,11 @@ void __init setup_physmem(unsigned long start, unsigned long reserve_end,
 		exit(1);
 	}
 
+#ifdef CONFIG_MMU
 	physmem_fd = create_mem_file(len + highmem);
+#else
+	physmem_fd = -1;
+#endif
 
 	err = os_map_memory((void *) reserve_end, physmem_fd, reserve,
 			    map_size, 1, 1, 1);
@@ -101,6 +105,7 @@ void __init setup_physmem(unsigned long start, unsigned long reserve_end,
 		exit(1);
 	}
 
+#ifdef CONFIG_MMU
 	/*
 	 * Special kludge - This page will be mapped in to userspace processes
 	 * from physmem_fd, so it needs to be written out there.
@@ -108,6 +113,7 @@ void __init setup_physmem(unsigned long start, unsigned long reserve_end,
 	os_seek_file(physmem_fd, __pa(__syscall_stub_start));
 	os_write_file(physmem_fd, __syscall_stub_start, PAGE_SIZE);
 	os_fsync_file(physmem_fd);
+#endif
 
 	memblock_add(__pa(start), len + highmem);
 	memblock_reserve(__pa(start), reserve);
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index be2856af6d4c..b1b608afa036 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -29,6 +29,7 @@
 #include <asm/mmu_context.h>
 #include <asm/switch_to.h>
 #include <asm/exec.h>
+#include <asm/tlbflush.h>
 #include <linux/uaccess.h>
 #include <as-layout.h>
 #include <kern_util.h>
diff --git a/arch/um/kernel/skas/Makefile b/arch/um/kernel/skas/Makefile
index 6f86d53e3d69..97ea2d393e92 100644
--- a/arch/um/kernel/skas/Makefile
+++ b/arch/um/kernel/skas/Makefile
@@ -3,7 +3,8 @@
 # Copyright (C) 2002 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com)
 #
 
-obj-y := stub.o mmu.o process.o syscall.o uaccess.o
+obj-y := stub.o process.o
+obj-$(CONFIG_MMU) += mmu.o syscall.o uaccess.o
 
 # stub.o is in the stub, so it can't be built with profiling
 # GCC hardened also auto-enables -fpic, but we need %ebx so it can't work ->
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 97c8df9c4401..079f33d7d20c 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -24,6 +24,7 @@
 int handle_page_fault(unsigned long address, unsigned long ip,
 		      int is_write, int is_user, int *code_out)
 {
+#ifdef CONFIG_MMU
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	pmd_t *pmd;
@@ -129,6 +130,9 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 		goto out_nosemaphore;
 	pagefault_out_of_memory();
 	return 0;
+#else
+	return -EFAULT;
+#endif
 }
 
 static void show_segv_info(struct uml_pt_regs *regs)
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index e52dd37ddadc..b164873da2db 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -144,8 +144,13 @@ int os_map_memory(void *virt, int fd, unsigned long long off, unsigned long len,
 	prot = (r ? PROT_READ : 0) | (w ? PROT_WRITE : 0) |
 		(x ? PROT_EXEC : 0);
 
+#ifdef UML_CONFIG_MMU
 	loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED,
 		     fd, off);
+#else
+	loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED | MAP_ANONYMOUS,
+		     fd, off);
+#endif
 	if (loc == MAP_FAILED)
 		return -errno;
 	return 0;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 04/13] x86/um: nommu: syscall handling
  2024-10-24 12:09 [RFC PATCH 00/13] nommu UML Hajime Tazaki
                   ` (2 preceding siblings ...)
  2024-10-24 12:09 ` [RFC PATCH 03/13] um: nommu: memory handling Hajime Tazaki
@ 2024-10-24 12:09 ` Hajime Tazaki
  2024-10-25  9:14   ` Johannes Berg
  2024-10-24 12:09 ` [RFC PATCH 05/13] x86/um: nommu: syscall translation by zpoline Hajime Tazaki
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-24 12:09 UTC (permalink / raw)
  To: linux-um, jdike, richard, anton.ivanov, johannes; +Cc: thehajime, ricarkol

This commit introduces an entry point of syscall interface for !MMU
mode. It uses an entry function, __kernel_vsyscall, a kernel-wide global
symbol accessible from any locations.

Although it isn't in the scope of this commit, it can be also exposed
via vdso image which is directly accessible from userspace. A standard
library (i.e., libc) can utilize this entry point to implement syscall
wrapper; we can also use this by hooking syscall for unmodified userspace
applications/libraries, which will be implemented in the subsequent
commit.

This only supports 64-bit mode of x86 architecture.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/x86/um/do_syscall_64.c             | 42 ++++++++++++
 arch/x86/um/entry_64.S                  | 88 +++++++++++++++++++++++++
 arch/x86/um/shared/sysdep/syscalls_64.h |  4 ++
 3 files changed, 134 insertions(+)
 create mode 100644 arch/x86/um/do_syscall_64.c
 create mode 100644 arch/x86/um/entry_64.S

diff --git a/arch/x86/um/do_syscall_64.c b/arch/x86/um/do_syscall_64.c
new file mode 100644
index 000000000000..7af6e881ad58
--- /dev/null
+++ b/arch/x86/um/do_syscall_64.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/kernel.h>
+#include <linux/ptrace.h>
+#include <kern_util.h>
+#include <sysdep/syscalls.h>
+#include <os.h>
+
+#ifndef CONFIG_MMU
+
+__visible void do_syscall_64(struct pt_regs *regs)
+{
+	int syscall;
+
+	syscall = PT_SYSCALL_NR(regs->regs.gp);
+	UPT_SYSCALL_NR(&regs->regs) = syscall;
+
+	pr_debug("syscall(%d) (current=%lx) (fn=%lx)\n",
+		 syscall, (unsigned long)current,
+		 (unsigned long)sys_call_table[syscall]);
+
+	if (likely(syscall < NR_syscalls)) {
+		PT_REGS_SET_SYSCALL_RETURN(regs,
+				EXECUTE_SYSCALL(syscall, regs));
+	}
+
+	pr_debug("syscall(%d) --> %lx\n", syscall,
+		regs->regs.gp[HOST_AX]);
+
+	PT_REGS_SYSCALL_RET(regs) = regs->regs.gp[HOST_AX];
+
+	/* force do_signal() --> is_syscall() */
+	set_thread_flag(TIF_SIGPENDING);
+	interrupt_end();
+
+	/* execve succeeded */
+	if (syscall == __NR_execve && regs->regs.gp[HOST_AX] == 0) {
+		userspace(&current->thread.regs.regs,
+			current_thread_info()->aux_fp_regs);
+	}
+}
+#endif
diff --git a/arch/x86/um/entry_64.S b/arch/x86/um/entry_64.S
new file mode 100644
index 000000000000..12e11ac03543
--- /dev/null
+++ b/arch/x86/um/entry_64.S
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/errno.h>
+
+#include <linux/linkage.h>
+#include <asm/percpu.h>
+#include <asm/desc.h>
+
+#include "../entry/calling.h"
+
+#ifdef CONFIG_SMP
+#error need to stash these variables somewhere else
+#endif
+
+#ifndef CONFIG_MMU
+#define UM_GLOBAL_VAR(x) .data; .align 8; .globl x; x:; .long 0
+
+UM_GLOBAL_VAR(current_top_of_stack)
+UM_GLOBAL_VAR(current_ptregs)
+
+.code64
+.section .entry.text, "ax"
+
+.align 8
+#undef ENTRY
+#define ENTRY(x) .text; .globl x; .type x,%function; x:
+#undef END
+#define END(x)   .size x, . - x
+
+/*
+ * %rcx has the return address (we set it like that in zpoline trampoline).
+ *
+ * Registers on entry:
+ * rax  system call number
+ * rcx  return address
+ * rdi  arg0
+ * rsi  arg1
+ * rdx  arg2
+ * r10  arg3
+ * r8   arg4
+ * r9   arg5
+ *
+ * (note: we are allowed to mess with r11: r11 is callee-clobbered
+ * register in C ABI)
+ */
+ENTRY(__kernel_vsyscall)
+
+	movq	%rsp, %r11
+
+	/* Point rsp to the top of the ptregs array, so we can
+           just fill it with a bunch of push'es. */
+	movq	current_ptregs, %rsp
+
+	/* 8 bytes * 20 registers (plus 8 for the push) */
+	addq	$168, %rsp
+
+	/* Construct struct pt_regs on stack */
+	pushq	$0		/* pt_regs->ss (index 20) */
+	pushq   %r11		/* pt_regs->sp */
+	pushfq			/* pt_regs->flags */
+	pushq	$0		/* pt_regs->cs */
+	pushq	%rcx		/* pt_regs->ip */
+	pushq	%rax		/* pt_regs->orig_ax */
+
+	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
+
+	mov %rsp, %rdi
+
+	/*
+	 * Switch to current top of stack, so "current->" points
+	 * to the right task.
+	 */
+	movq	current_top_of_stack, %rsp
+
+	call	do_syscall_64
+
+	movq	current_ptregs, %rsp
+
+	POP_REGS
+
+	addq	$8, %rsp	/* skip orig_ax */
+	addq	$8, %rsp	/* skip ip */
+	addq	$8, %rsp	/* skip cs */
+	addq	$8, %rsp	/* skip flags */
+	popq	%rsp
+
+	ret
+
+END(__kernel_vsyscall)
diff --git a/arch/x86/um/shared/sysdep/syscalls_64.h b/arch/x86/um/shared/sysdep/syscalls_64.h
index b6b997225841..31aa0694cec0 100644
--- a/arch/x86/um/shared/sysdep/syscalls_64.h
+++ b/arch/x86/um/shared/sysdep/syscalls_64.h
@@ -25,4 +25,8 @@ extern syscall_handler_t *sys_call_table[];
 extern syscall_handler_t sys_modify_ldt;
 extern syscall_handler_t sys_arch_prctl;
 
+__visible void do_syscall_64(struct pt_regs *regs);
+extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3,
+			      int64_t a4, int64_t a5, int64_t a6);
+
 #endif
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 05/13] x86/um: nommu: syscall translation by zpoline
  2024-10-24 12:09 [RFC PATCH 00/13] nommu UML Hajime Tazaki
                   ` (3 preceding siblings ...)
  2024-10-24 12:09 ` [RFC PATCH 04/13] x86/um: nommu: syscall handling Hajime Tazaki
@ 2024-10-24 12:09 ` Hajime Tazaki
  2024-10-25  9:19   ` Johannes Berg
  2024-10-24 12:09 ` [RFC PATCH 06/13] x86/um: nommu: process/thread handling Hajime Tazaki
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-24 12:09 UTC (permalink / raw)
  To: linux-um, jdike, richard, anton.ivanov, johannes; +Cc: thehajime, ricarkol

This commit adds a mechanism to hook syscalls for unmodified userspace
programs used under UML in !MMU mode. The mechanism, called zpoline,
translates syscall/sysenter instructions with `call *%rax`, which can be
processed by a trampoline code also installed upon an initcall during
boot. The translation is triggered by elf_arch_finalize_exec(), an arch
hook introduced by another commit.

All syscalls issued by userspace thus redirected to a speicific function,
__kernel_vsyscall, introduced as a syscall entry point for !MMU UML.  This
totally changes the code path to hook syscall with ptrace(2) used by
MMU-full UML.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
 arch/x86/um/asm/elf.h |   3 +
 arch/x86/um/zpoline.c | 228 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 231 insertions(+)
 create mode 100644 arch/x86/um/zpoline.c

diff --git a/arch/x86/um/asm/elf.h b/arch/x86/um/asm/elf.h
index 4f87980bc9e9..05f90fc078b3 100644
--- a/arch/x86/um/asm/elf.h
+++ b/arch/x86/um/asm/elf.h
@@ -187,6 +187,9 @@ do {								\
 struct linux_binprm;
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
 	int uses_interp);
+struct elf_fdpic_params;
+extern int elf_arch_finalize_exec(struct elf_fdpic_params *exec_params,
+				  struct elf_fdpic_params *interp_params);
 
 extern unsigned long um_vdso_addr;
 #define AT_SYSINFO_EHDR 33
diff --git a/arch/x86/um/zpoline.c b/arch/x86/um/zpoline.c
new file mode 100644
index 000000000000..a25bb50680e8
--- /dev/null
+++ b/arch/x86/um/zpoline.c
@@ -0,0 +1,228 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *  zpoline.c
+ *
+ *  Replace syscall/sysenter instructions to `call *%rax` to hook syscalls.
+ *
+ */
+//#define DEBUG
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/elf-fdpic.h>
+#include <asm/unistd.h>
+#include <asm/insn.h>
+#include <sysdep/syscalls.h>
+#include <os.h>
+
+#ifndef CONFIG_MMU
+
+/* start of trampoline code area */
+static char *__zpoline_start;
+
+static int __zpoline_translate_syscalls(struct elf_fdpic_params *params)
+{
+	int count = 0, loop;
+	struct insn insn;
+	unsigned long addr;
+	struct elf_fdpic_loadseg *seg;
+	struct elf_phdr *phdr;
+	struct elfhdr *ehdr = (struct elfhdr *)params->elfhdr_addr;
+
+	if (!ehdr)
+		return 0;
+
+	seg = params->loadmap->segs;
+	phdr = params->phdrs;
+	for (loop = 0; loop < params->hdr.e_phnum; loop++, phdr++) {
+		if (phdr->p_type != PT_LOAD)
+			continue;
+		addr = seg->addr;
+		/* skip translation of trampoline code */
+		if (addr <= (unsigned long)(&__zpoline_start[0] + 0x1000 + 0x0100)) {
+			pr_warn("%lx: address is in the range of trampoline", addr);
+			return -EINVAL;
+		}
+
+		/* translate only segment with Executable flag */
+		if (!(phdr->p_flags & PF_X)) {
+			seg++;
+			continue;
+		}
+
+		pr_debug("translation 0x%lx-0x%llx", addr,
+			 seg->addr + seg->p_memsz);
+		/* now ready to translate */
+		while (addr < (seg->addr + seg->p_memsz)) {
+			insn_init(&insn, (void *)addr, MAX_INSN_SIZE, 1);
+			insn_get_length(&insn);
+
+			insn_get_opcode(&insn);
+
+			switch (insn.opcode.bytes[0]) {
+			case 0xf:
+				switch (insn.opcode.bytes[1]) {
+				case 0x05: /* syscall */
+				case 0x34: /* sysenter */
+					pr_debug("%lx: found syscall/sysenter", addr);
+					*(char *)addr = 0xff; // callq
+					*((char *)addr + 1) = 0xd0; // *%rax
+					count++;
+					break;
+				}
+			default:
+			}
+
+			addr += insn.length;
+			if (insn.length == 0) {
+				pr_debug("%lx: length zero with byte %x. skip ?",
+					 addr, insn.opcode.bytes[0]);
+				addr += 1;
+			}
+		}
+		seg++;
+	}
+	return count;
+}
+
+/**
+ * translate syscall/sysenter instruction upon loading ELF binary file
+ * on execve(2)&co syscall.
+ *
+ * suppose we have those instructions:
+ *
+ *    mov $sysnr, %rax
+ *    syscall                 0f 05
+ *
+ * this will translate it with:
+ *
+ *    mov $sysnr, %rax        (<= untouched)
+ *    call *(%rax)            ff d0
+ *
+ * this will finally called hook function guided by trampoline code installed
+ * at setup_zpoline_trampoline().
+ */
+int elf_arch_finalize_exec(struct elf_fdpic_params *exec_params,
+			   struct elf_fdpic_params *interp_params)
+{
+	int err = 0, count = 0;
+	struct mm_struct *mm = current->mm;
+
+	if (down_write_killable(&mm->mmap_lock)) {
+		err = -EINTR;
+		return err;
+	}
+
+	/* translate for the executable */
+	err = __zpoline_translate_syscalls(exec_params);
+	if (err < 0) {
+		pr_info("zpoline: xlate error %d", err);
+		goto out;
+	}
+	count += err;
+	pr_debug("zpoline: rewritten (exec) %d syscalls\n", count);
+
+	/* translate for the interpreter */
+	err = __zpoline_translate_syscalls(interp_params);
+	if (err < 0) {
+		pr_info("zpoline: xlate error %d", err);
+		goto out;
+	}
+	count += err;
+
+	err = 0;
+	pr_debug("zpoline: rewritten (exec+interp) %d syscalls\n", count);
+
+out:
+	up_write(&mm->mmap_lock);
+	return err;
+}
+
+/**
+ * setup trampoline code for syscall hooks
+ *
+ * the trampoline code guides to call hooked function, __kernel_vsyscall
+ * in this case, via nop slides at the memory address zero (thus, zpoline).
+ *
+ * loaded binary by exec(2) is translated to call the function.
+ */
+static int __init setup_zpoline_trampoline(void)
+{
+	int i, ret;
+	int ptr;
+
+	/* zpoline: map area of trampoline code started from addr 0x0 */
+	__zpoline_start = 0x0;
+
+	ret = os_map_memory((void *) 0, -1, 0, 0x1000, 1, 1, 1);
+	if (ret)
+		panic("map failed\n NOTE: /proc/sys/vm/mmap_min_addr should be set 0\n");
+
+	/* fill nop instructions until the trampoline code */
+	for (i = 0; i < NR_syscalls; i++)
+		__zpoline_start[i] = 0x90;
+
+	/* optimization to skip old syscalls */
+	/* short jmp */
+	__zpoline_start[214 /* __NR_epoll_ctl_old */] = 0xeb;
+	/* range of a short jmp : -128 ~ +127 */
+	__zpoline_start[215 /* __NR_epoll_wait_old */] = 127;
+
+	/**
+	 * FIXME: shit red zone area to properly handle the case
+	 */
+
+	/**
+	 * put code for jumping to __kernel_vsyscall.
+	 *
+	 * here we embed the following code.
+	 *
+	 * movabs [$addr],%r11
+	 * jmpq   *%r11
+	 *
+	 */
+	ptr = NR_syscalls;
+	/* 49 bb [64-bit addr (8-byte)]    movabs [64-bit addr (8-byte)],%r11 */
+	__zpoline_start[ptr++] = 0x49;
+	__zpoline_start[ptr++] = 0xbb;
+	__zpoline_start[ptr++] = ((uint64_t)
+				  __kernel_vsyscall >> (8 * 0)) & 0xff;
+	__zpoline_start[ptr++] = ((uint64_t)
+				  __kernel_vsyscall >> (8 * 1)) & 0xff;
+	__zpoline_start[ptr++] = ((uint64_t)
+				  __kernel_vsyscall >> (8 * 2)) & 0xff;
+	__zpoline_start[ptr++] = ((uint64_t)
+				  __kernel_vsyscall >> (8 * 3)) & 0xff;
+	__zpoline_start[ptr++] = ((uint64_t)
+				  __kernel_vsyscall >> (8 * 4)) & 0xff;
+	__zpoline_start[ptr++] = ((uint64_t)
+				  __kernel_vsyscall >> (8 * 5)) & 0xff;
+	__zpoline_start[ptr++] = ((uint64_t)
+				  __kernel_vsyscall >> (8 * 6)) & 0xff;
+	__zpoline_start[ptr++] = ((uint64_t)
+				  __kernel_vsyscall >> (8 * 7)) & 0xff;
+
+	/*
+	 * pretending to be syscall instruction by putting return
+	 * address in %rcx.
+	 */
+	/* 48 8b 0c 24               mov    (%rsp),%rcx */
+	__zpoline_start[ptr++] = 0x48;
+	__zpoline_start[ptr++] = 0x8b;
+	__zpoline_start[ptr++] = 0x0c;
+	__zpoline_start[ptr++] = 0x24;
+
+	/* 41 ff e3                jmp    *%r11 */
+	__zpoline_start[ptr++] = 0x41;
+	__zpoline_start[ptr++] = 0xff;
+	__zpoline_start[ptr++] = 0xe3;
+
+	/* permission: XOM (PROT_EXEC only) */
+	ret = os_protect_memory(0, 0x1000, 0, 0, 1);
+	if (ret)
+		panic("failed: can't configure permission on trampoline code");
+
+	pr_info("zpoline: setting up trampoline code done\n");
+	return 0;
+}
+arch_initcall(setup_zpoline_trampoline);
+#endif /* !CONFIG_MMU */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 06/13] x86/um: nommu: process/thread handling
  2024-10-24 12:09 [RFC PATCH 00/13] nommu UML Hajime Tazaki
                   ` (4 preceding siblings ...)
  2024-10-24 12:09 ` [RFC PATCH 05/13] x86/um: nommu: syscall translation by zpoline Hajime Tazaki
@ 2024-10-24 12:09 ` Hajime Tazaki
  2024-10-25  9:22   ` Johannes Berg
  2024-10-24 12:09 ` [RFC PATCH 07/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-24 12:09 UTC (permalink / raw)
  To: linux-um, jdike, richard, anton.ivanov, johannes; +Cc: thehajime, ricarkol

Since ptrace facility isn't used under !MMU of UML, there is different
code path to invoke proceeses/threads; on an entry to the syscall
interface, the stack pointer should be manipulated to handle vfork(2)
return address, no external process is used, and need to properlly
configure some of registers (fs segment register for TLS, etc) on every
contex switch, etc.

Signals aren't delivered in non-ptrace syscall entry/leave so, we also
need to handle pending signal by ourselves.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/um/kernel/exec.c           |  8 ++++++++
 arch/um/kernel/process.c        | 21 ++++++++++++++++++++-
 arch/um/os-Linux/process.c      |  6 ++++++
 arch/um/os-Linux/skas/process.c |  4 ++++
 arch/x86/um/asm/processor.h     | 12 ++++++++++++
 arch/x86/um/entry_64.S          | 22 ++++++++++++++++++++++
 arch/x86/um/syscalls_64.c       | 12 ++++++++++++
 7 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/arch/um/kernel/exec.c b/arch/um/kernel/exec.c
index cb8b5cd9285c..fe7d776a5962 100644
--- a/arch/um/kernel/exec.c
+++ b/arch/um/kernel/exec.c
@@ -24,8 +24,10 @@ void flush_thread(void)
 {
 	arch_flush_thread(&current->thread.arch);
 
+#ifdef CONFIG_MMU
 	get_safe_registers(current_pt_regs()->regs.gp,
 			   current_pt_regs()->regs.fp);
+#endif
 
 	__switch_mm(&current->mm->context.id);
 }
@@ -35,5 +37,11 @@ void start_thread(struct pt_regs *regs, unsigned long eip, unsigned long esp)
 	PT_REGS_IP(regs) = eip;
 	PT_REGS_SP(regs) = esp;
 	clear_thread_flag(TIF_SINGLESTEP);
+#ifndef CONFIG_MMU
+	current->thread.regs.regs.gp[REGS_IP_INDEX] = eip;
+	current->thread.regs.regs.gp[REGS_SP_INDEX] = esp;
+	new_thread(task_stack_page(current), &current->thread.switch_buf,
+		   (void *)eip);
+#endif
 }
 EXPORT_SYMBOL(start_thread);
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index b1b608afa036..270b5bd476be 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -117,13 +117,17 @@ void new_thread_handler(void)
 	 * callback returns only if the kernel thread execs a process
 	 */
 	fn(arg);
+#ifndef CONFIG_MMU
+	arch_switch_to(current);
+#endif
 	userspace(&current->thread.regs.regs, current_thread_info()->aux_fp_regs);
 }
 
 /* Called magically, see new_thread_handler above */
 static void fork_handler(void)
 {
-	schedule_tail(current->thread.prev_sched);
+	if (current->thread.prev_sched != NULL)
+		schedule_tail(current->thread.prev_sched);
 
 	/*
 	 * XXX: if interrupt_end() calls schedule, this call to
@@ -134,6 +138,21 @@ static void fork_handler(void)
 
 	current->thread.prev_sched = NULL;
 
+#ifndef CONFIG_MMU
+	/*
+	 * This fork can only come from libc's vfork, which
+	 * does this:
+	 *	popq %%rdx;
+	 *	call *%0; // vsyscall
+	 *	pushq %%rdx;
+	 * %rdx stores the return address which is stored
+	 * at pt_regs[HOST_IP] at the moment. We still
+	 * need to pop the pushed address by "call" though,
+	 * so this is what this next line does.
+	 */
+	if (current->thread.regs.regs.gp[HOST_ORIG_AX] == __NR_vfork)
+		current->thread.regs.regs.gp[REGS_SP_INDEX] += 8;
+#endif
 	userspace(&current->thread.regs.regs, current_thread_info()->aux_fp_regs);
 }
 
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index b164873da2db..f08bb20d95ec 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -92,7 +92,10 @@ int os_process_parent(int pid)
 
 void os_alarm_process(int pid)
 {
+/* !CONFIG_MMU doesn't send alarm signal to other processes */
+#ifdef UML_CONFIG_MMU
 	kill(pid, SIGALRM);
+#endif
 }
 
 void os_stop_process(int pid)
@@ -114,11 +117,14 @@ void os_kill_process(int pid, int reap_child)
 
 void os_kill_ptraced_process(int pid, int reap_child)
 {
+/* !CONFIG_MMU doesn't have ptraced process */
+#ifdef UML_CONFIG_MMU
 	kill(pid, SIGKILL);
 	ptrace(PTRACE_KILL, pid);
 	ptrace(PTRACE_CONT, pid);
 	if (reap_child)
 		CATCH_EINTR(waitpid(pid, NULL, __WALL));
+#endif
 }
 
 /* Don't use the glibc version, which caches the result in TLS. It misses some
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index b6f656bcffb1..2a0a20aa59b9 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -141,6 +141,7 @@ void wait_stub_done(int pid)
 
 extern unsigned long current_stub_stack(void);
 
+#ifdef UML_CONFIG_MMU
 static void get_skas_faultinfo(int pid, struct faultinfo *fi, unsigned long *aux_fp_regs)
 {
 	int err;
@@ -186,6 +187,7 @@ static void handle_trap(int pid, struct uml_pt_regs *regs)
 
 	handle_syscall(regs);
 }
+#endif
 
 extern char __syscall_stub_start[];
 
@@ -336,6 +338,7 @@ int start_userspace(unsigned long stub_stack)
 	return err;
 }
 
+#ifdef UML_CONFIG_MMU
 void userspace(struct uml_pt_regs *regs, unsigned long *aux_fp_regs)
 {
 	int err, status, op, pid = userspace_pid[0];
@@ -472,6 +475,7 @@ void userspace(struct uml_pt_regs *regs, unsigned long *aux_fp_regs)
 		}
 	}
 }
+#endif /* UML_CONFIG_MMU */
 
 void new_thread(void *stack, jmp_buf *buf, void (*handler)(void))
 {
diff --git a/arch/x86/um/asm/processor.h b/arch/x86/um/asm/processor.h
index 478710384b34..d88d7d9d5c18 100644
--- a/arch/x86/um/asm/processor.h
+++ b/arch/x86/um/asm/processor.h
@@ -38,6 +38,18 @@ static __always_inline void cpu_relax(void)
 
 #define task_pt_regs(t) (&(t)->thread.regs)
 
+#ifndef CONFIG_MMU
+#define task_top_of_stack(task) \
+({									\
+	unsigned long __ptr = (unsigned long)task->stack;	\
+	__ptr += THREAD_SIZE;			\
+	__ptr;					\
+})
+
+extern long current_top_of_stack;
+extern long current_ptregs;
+#endif
+
 #include <asm/processor-generic.h>
 
 #endif
diff --git a/arch/x86/um/entry_64.S b/arch/x86/um/entry_64.S
index 12e11ac03543..1fd5b3665efa 100644
--- a/arch/x86/um/entry_64.S
+++ b/arch/x86/um/entry_64.S
@@ -86,3 +86,25 @@ ENTRY(__kernel_vsyscall)
 	ret
 
 END(__kernel_vsyscall)
+
+// void userspace(struct uml_pt_regs *regs, unsigned long *aux_fp_regs)
+ENTRY(userspace)
+	/* align the stack for x86_64 ABI */
+	and     $-0x10, %rsp
+	/* Handle any immediate reschedules or signals */
+	call	interrupt_end
+
+	movq	current_ptregs, %rsp
+
+	POP_REGS
+
+	addq	$8, %rsp	/* skip orig_ax */
+	popq	%r11		/* pt_regs->ip */
+	addq	$8, %rsp	/* skip cs */
+	addq	$8, %rsp	/* skip flags */
+	popq	%rsp
+
+	jmp	*%r11
+
+END(userspace)
+#endif // !CONFIG_MMU
diff --git a/arch/x86/um/syscalls_64.c b/arch/x86/um/syscalls_64.c
index 6a00a28c9cca..8abf2a679578 100644
--- a/arch/x86/um/syscalls_64.c
+++ b/arch/x86/um/syscalls_64.c
@@ -51,6 +51,18 @@ void arch_switch_to(struct task_struct *to)
 	 * Nothing needs to be done on x86_64.
 	 * The FS_BASE/GS_BASE registers are saved in the ptrace register set.
 	 */
+#ifndef CONFIG_MMU
+	current_top_of_stack = task_top_of_stack(to);
+	current_ptregs = (long)task_pt_regs(to);
+
+	if ((to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] == 0)
+	    || (to->mm == NULL))
+		return;
+
+	// rkj: this changes the FS on every context switch
+	arch_prctl(to, ARCH_SET_FS,
+		   (void __user *) to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)]);
+#endif
 }
 
 SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 07/13] um: nommu: configure fs register on host syscall invocation
  2024-10-24 12:09 [RFC PATCH 00/13] nommu UML Hajime Tazaki
                   ` (5 preceding siblings ...)
  2024-10-24 12:09 ` [RFC PATCH 06/13] x86/um: nommu: process/thread handling Hajime Tazaki
@ 2024-10-24 12:09 ` Hajime Tazaki
  2024-10-25  9:28   ` Johannes Berg
  2024-10-24 12:09 ` [RFC PATCH 08/13] x86/um/vdso: nommu: vdso memory update Hajime Tazaki
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-24 12:09 UTC (permalink / raw)
  To: linux-um, jdike, richard, anton.ivanov, johannes; +Cc: thehajime, ricarkol

As userspace on UML/!MMU also need to configure %fs register when it is
running to correctly access thread structure, host syscalls implemented
in os-Linux drivers may be puzzled when they are called.  Thus it has to
configure %fs register via arch_prctl(SET_FS) on every host syscalls.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/um/include/shared/os.h |  3 ++
 arch/um/os-Linux/main.c     |  5 ++++
 arch/um/os-Linux/process.c  | 11 ++++++++
 arch/um/os-Linux/start_up.c | 47 +++++++++++++++++++++++++++++++
 arch/um/os-Linux/time.c     |  3 +-
 arch/x86/um/do_syscall_64.c | 35 +++++++++++++++++++++++
 arch/x86/um/syscalls_64.c   | 55 +++++++++++++++++++++++++++++++++++++
 7 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index f6d3f3d7eade..ba95e882f7e6 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -191,6 +191,7 @@ extern void check_host_supports_tls(int *supports_tls, int *tls_min);
 extern void get_host_cpu_features(
 	void (*flags_helper_func)(char *line),
 	void (*cache_helper_func)(char *line));
+extern int os_has_fsgsbase(void);
 
 /* mem.c */
 extern int create_mem_file(unsigned long long len);
@@ -225,6 +226,8 @@ extern int os_unmap_memory(void *addr, int len);
 extern int os_drop_memory(void *addr, int length);
 extern int can_drop_memory(void);
 extern int os_mincore(void *addr, unsigned long len);
+extern long long host_fs;
+extern int os_arch_prctl(int pid, int option, unsigned long *arg);
 
 /* execvp.c */
 extern int execvp_noalloc(char *buf, const char *file, char *const argv[]);
diff --git a/arch/um/os-Linux/main.c b/arch/um/os-Linux/main.c
index f98ff79cdbf7..8e5c7361bac1 100644
--- a/arch/um/os-Linux/main.c
+++ b/arch/um/os-Linux/main.c
@@ -16,6 +16,7 @@
 #include <kern_util.h>
 #include <os.h>
 #include <um_malloc.h>
+#include <asm/prctl.h> /* XXX This should get the constants from libc */
 #include "internal.h"
 
 #define PGD_BOUND (4 * 1024 * 1024)
@@ -142,6 +143,10 @@ int __init main(int argc, char **argv, char **envp)
 	change_sig(SIGPIPE, 0);
 	ret = linux_main(argc, argv);
 
+#ifndef CONFIG_MMU
+	os_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
+#endif
+
 	/*
 	 * Disable SIGPROF - I have no idea why libc doesn't do this or turn
 	 * off the profiling time, but UML dies with a SIGPROF just before
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index f08bb20d95ec..0be8f4ac5f70 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -296,3 +296,14 @@ void init_new_thread_signals(void)
 	set_handler(SIGIO);
 	signal(SIGWINCH, SIG_IGN);
 }
+
+#ifndef CONFIG_MMU
+
+#include <unistd.h>
+#include <sys/syscall.h>   /* For SYS_xxx definitions */
+
+int os_arch_prctl(int pid, int option, unsigned long *arg2)
+{
+	return syscall(SYS_arch_prctl, option, arg2);
+}
+#endif
diff --git a/arch/um/os-Linux/start_up.c b/arch/um/os-Linux/start_up.c
index 93fc82c01aba..80dce242e67a 100644
--- a/arch/um/os-Linux/start_up.c
+++ b/arch/um/os-Linux/start_up.c
@@ -12,6 +12,7 @@
 #include <sched.h>
 #include <signal.h>
 #include <string.h>
+#include <longjmp.h>
 #include <sys/mman.h>
 #include <sys/stat.h>
 #include <sys/wait.h>
@@ -278,6 +279,49 @@ void  __init get_host_cpu_features(
 	}
 }
 
+/**
+ * get_host_cpu_features() return true with X86_FEATURE_FSGSBASE even
+ * if the kernel is older and disabled using fsgsbase instruction.
+ * thus detection is based on whether SIGILL is raised or not.
+ */
+static jmp_buf jmpbuf;
+static int has_fsgsbase;
+
+static void sigill(int sig, siginfo_t *si, void *ctx_void)
+{
+	longjmp(jmpbuf, 1);
+}
+
+static int __init check_fsgsbase(void)
+{
+	unsigned long fsbase;
+	struct sigaction sa;
+
+	/* Probe FSGSBASE */
+	memset(&sa, 0, sizeof(sa));
+	sa.sa_sigaction = sigill;
+	sa.sa_flags = SA_SIGINFO | SA_RESETHAND;
+	sigemptyset(&sa.sa_mask);
+	if (sigaction(SIGILL, &sa, 0))
+		os_warn("sigaction");
+
+	os_info("Checking FSGSBASE instructions...");
+	if (setjmp(jmpbuf) == 0) {
+		asm volatile("rdfsbase %0" : "=r" (fsbase) :: "memory");
+		has_fsgsbase = 1;
+		os_info("OK\n");
+	} else {
+		has_fsgsbase = 0;
+		os_info("disabled\n");
+	}
+
+	return 0;
+}
+
+int os_has_fsgsbase(void)
+{
+	return has_fsgsbase;
+}
 
 void __init os_early_checks(void)
 {
@@ -293,6 +337,9 @@ void __init os_early_checks(void)
 	 */
 	check_tmpexec();
 
+	/* probe fsgsbase instruction */
+	check_fsgsbase();
+
 	pid = start_ptraced_child();
 	if (init_pid_registers(pid))
 		fatal("Failed to initialize default registers");
diff --git a/arch/um/os-Linux/time.c b/arch/um/os-Linux/time.c
index 4d5591d96d8c..1ed14c6b67c4 100644
--- a/arch/um/os-Linux/time.c
+++ b/arch/um/os-Linux/time.c
@@ -89,7 +89,8 @@ long long os_nsecs(void)
 {
 	struct timespec ts;
 
-	clock_gettime(CLOCK_MONOTONIC,&ts);
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+
 	return timespec_to_ns(&ts);
 }
 
diff --git a/arch/x86/um/do_syscall_64.c b/arch/x86/um/do_syscall_64.c
index 7af6e881ad58..594248f7319c 100644
--- a/arch/x86/um/do_syscall_64.c
+++ b/arch/x86/um/do_syscall_64.c
@@ -2,12 +2,38 @@
 
 #include <linux/kernel.h>
 #include <linux/ptrace.h>
+#include <asm/fsgsbase.h>
+#include <asm/prctl.h>
 #include <kern_util.h>
 #include <sysdep/syscalls.h>
 #include <os.h>
 
 #ifndef CONFIG_MMU
 
+static int os_x86_arch_prctl(int pid, int option, unsigned long *arg2)
+{
+	if (os_has_fsgsbase()) {
+		switch (option) {
+		case ARCH_SET_FS:
+			wrfsbase(*arg2);
+			break;
+		case ARCH_SET_GS:
+			wrgsbase(*arg2);
+			break;
+		case ARCH_GET_FS:
+			*arg2 = rdfsbase();
+			break;
+		case ARCH_GET_GS:
+			*arg2 = rdgsbase();
+			break;
+		}
+		return 0;
+	} else
+		return os_arch_prctl(pid, option, arg2);
+
+	return 0;
+}
+
 __visible void do_syscall_64(struct pt_regs *regs)
 {
 	int syscall;
@@ -19,6 +45,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
 		 syscall, (unsigned long)current,
 		 (unsigned long)sys_call_table[syscall]);
 
+	/* set fs register to the original host one */
+	os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
+
 	if (likely(syscall < NR_syscalls)) {
 		PT_REGS_SET_SYSCALL_RETURN(regs,
 				EXECUTE_SYSCALL(syscall, regs));
@@ -29,6 +58,11 @@ __visible void do_syscall_64(struct pt_regs *regs)
 
 	PT_REGS_SYSCALL_RET(regs) = regs->regs.gp[HOST_AX];
 
+	/* restore back fs register to userspace configured one */
+	os_x86_arch_prctl(0, ARCH_SET_FS,
+		      (void *)(current->thread.regs.regs.gp[FS_BASE
+						     / sizeof(unsigned long)]));
+
 	/* force do_signal() --> is_syscall() */
 	set_thread_flag(TIF_SIGPENDING);
 	interrupt_end();
@@ -39,4 +73,5 @@ __visible void do_syscall_64(struct pt_regs *regs)
 			current_thread_info()->aux_fp_regs);
 	}
 }
+
 #endif
diff --git a/arch/x86/um/syscalls_64.c b/arch/x86/um/syscalls_64.c
index 8abf2a679578..5a1b1b3efab2 100644
--- a/arch/x86/um/syscalls_64.c
+++ b/arch/x86/um/syscalls_64.c
@@ -12,11 +12,24 @@
 #include <asm/prctl.h> /* XXX This should get the constants from libc */
 #include <registers.h>
 #include <os.h>
+#include <asm/thread_info.h>
+#include <asm/mman.h>
+
+/*
+ * The guest libc can change FS, which confuses the host libc.
+ * In fact, changing FS directly is not supported (check
+ * man arch_prctl). So, whenever we make a host syscall,
+ * we should be changing FS to the original FS (not the
+ * one set by the guest libc). This original FS is stored
+ * in host_fs.
+ */
+long long host_fs = -1;
 
 long arch_prctl(struct task_struct *task, int option,
 		unsigned long __user *arg2)
 {
 	long ret = -EINVAL;
+#ifdef CONFIG_MMU
 
 	switch (option) {
 	case ARCH_SET_FS:
@@ -38,6 +51,48 @@ long arch_prctl(struct task_struct *task, int option,
 	}
 
 	return ret;
+#else
+
+	unsigned long *ptr = arg2, tmp;
+
+	switch (option) {
+	case ARCH_SET_FS:
+		if (host_fs == -1)
+			os_arch_prctl(0, ARCH_GET_FS, (void *)&host_fs);
+		ret = 0;
+		break;
+	case ARCH_SET_GS:
+		ret = 0;
+		break;
+	case ARCH_GET_FS:
+	case ARCH_GET_GS:
+		ptr = &tmp;
+		break;
+	}
+
+	ret = os_arch_prctl(0, option, ptr);
+	if (ret)
+		return ret;
+
+	switch (option) {
+	case ARCH_SET_FS:
+		current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] =
+			(unsigned long) arg2;
+		break;
+	case ARCH_SET_GS:
+		current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)] =
+			(unsigned long) arg2;
+		break;
+	case ARCH_GET_FS:
+		ret = put_user(current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)], arg2);
+		break;
+	case ARCH_GET_GS:
+		ret = put_user(current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)], arg2);
+		break;
+	}
+
+	return ret;
+#endif
 }
 
 SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 08/13] x86/um/vdso: nommu: vdso memory update
  2024-10-24 12:09 [RFC PATCH 00/13] nommu UML Hajime Tazaki
                   ` (6 preceding siblings ...)
  2024-10-24 12:09 ` [RFC PATCH 07/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
@ 2024-10-24 12:09 ` Hajime Tazaki
  2024-10-25  9:29   ` Johannes Berg
  2024-10-24 12:09 ` [RFC PATCH 09/13] x86/um: nommu: signal handling Hajime Tazaki
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-24 12:09 UTC (permalink / raw)
  To: linux-um, jdike, richard, anton.ivanov, johannes; +Cc: thehajime, ricarkol

On !MMU mode, the address of vdso is accessible from userspace.  This
commit implements the entry point by pointing a block of page address.

This commit also add memory permission configuration of vdso page to be
executable.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/x86/um/vdso/um_vdso.c | 20 ++++++++++++++++++++
 arch/x86/um/vdso/vma.c     | 16 +++++++++++++++-
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/arch/x86/um/vdso/um_vdso.c b/arch/x86/um/vdso/um_vdso.c
index cbae2584124f..eff3e6641a0e 100644
--- a/arch/x86/um/vdso/um_vdso.c
+++ b/arch/x86/um/vdso/um_vdso.c
@@ -23,10 +23,17 @@ int __vdso_clock_gettime(clockid_t clock, struct __kernel_old_timespec *ts)
 {
 	long ret;
 
+#ifdef CONFIG_MMU
 	asm("syscall"
 		: "=a" (ret)
 		: "0" (__NR_clock_gettime), "D" (clock), "S" (ts)
 		: "rcx", "r11", "memory");
+#else
+	asm("call *%1"
+		: "=a" (ret)
+		: "0" ((unsigned long)__NR_clock_gettime), "D" (clock), "S" (ts)
+		: "rcx", "r11", "memory");
+#endif
 
 	return ret;
 }
@@ -37,10 +44,17 @@ int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
 {
 	long ret;
 
+#ifdef CONFIG_MMU
 	asm("syscall"
 		: "=a" (ret)
 		: "0" (__NR_gettimeofday), "D" (tv), "S" (tz)
 		: "rcx", "r11", "memory");
+#else
+	asm("call *%1"
+		: "=a" (ret)
+		: "0" ((unsigned long)__NR_gettimeofday), "D" (tv), "S" (tz)
+		: "rcx", "r11", "memory");
+#endif
 
 	return ret;
 }
@@ -51,9 +65,15 @@ __kernel_old_time_t __vdso_time(__kernel_old_time_t *t)
 {
 	long secs;
 
+#ifdef CONFIG_MMU
 	asm volatile("syscall"
 		: "=a" (secs)
 		: "0" (__NR_time), "D" (t) : "cc", "r11", "cx", "memory");
+#else
+	asm("call *%1"
+		: "=a" (secs)
+		: "0" ((unsigned long)__NR_time), "D" (t) : "cc", "r11", "cx", "memory");
+#endif
 
 	return secs;
 }
diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c
index f238f7b33cdd..36ac644147b6 100644
--- a/arch/x86/um/vdso/vma.c
+++ b/arch/x86/um/vdso/vma.c
@@ -9,6 +9,7 @@
 #include <asm/page.h>
 #include <asm/elf.h>
 #include <linux/init.h>
+#include <os.h>
 
 static unsigned int __read_mostly vdso_enabled = 1;
 unsigned long um_vdso_addr;
@@ -24,7 +25,9 @@ static int __init init_vdso(void)
 
 	BUG_ON(vdso_end - vdso_start > PAGE_SIZE);
 
+#ifdef CONFIG_MMU
 	um_vdso_addr = task_size - PAGE_SIZE;
+#endif
 
 	vdsop = kmalloc(sizeof(struct page *), GFP_KERNEL);
 	if (!vdsop)
@@ -40,16 +43,26 @@ static int __init init_vdso(void)
 	copy_page(page_address(um_vdso), vdso_start);
 	*vdsop = um_vdso;
 
+#ifndef CONFIG_MMU
+	/* this is fine with NOMMU as everything is accessible */
+	um_vdso_addr = (unsigned long)page_address(um_vdso);
+	os_protect_memory((void *)um_vdso_addr, vdso_end - vdso_start, 1, 1, 1);
+	pr_debug("vdso_start=%lx um_vdso_addr=%lx pg_um_vdso=%lx",
+	       (unsigned long)vdso_start, um_vdso_addr,
+	       (unsigned long)page_address(um_vdso));
+#endif
+
 	return 0;
 
 oom:
-	printk(KERN_ERR "Cannot allocate vdso\n");
+	pr_err("Cannot allocate vdso");
 	vdso_enabled = 0;
 
 	return -ENOMEM;
 }
 subsys_initcall(init_vdso);
 
+#ifdef CONFIG_MMU
 int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 {
 	struct vm_area_struct *vma;
@@ -74,3 +87,4 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 
 	return IS_ERR(vma) ? PTR_ERR(vma) : 0;
 }
+#endif
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 09/13] x86/um: nommu: signal handling
  2024-10-24 12:09 [RFC PATCH 00/13] nommu UML Hajime Tazaki
                   ` (7 preceding siblings ...)
  2024-10-24 12:09 ` [RFC PATCH 08/13] x86/um/vdso: nommu: vdso memory update Hajime Tazaki
@ 2024-10-24 12:09 ` Hajime Tazaki
  2024-10-25  9:30   ` Johannes Berg
  2024-10-24 12:09 ` [RFC PATCH 10/13] x86/um: nommu: stack save/restore on vfork Hajime Tazaki
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-24 12:09 UTC (permalink / raw)
  To: linux-um, jdike, richard, anton.ivanov, johannes; +Cc: thehajime, ricarkol

This commit updates the behavior of signal handling under !MMU
environment. 1) the stack preparation for the signal handlers and
2) retoration of stack after rt_sigreturn(2) syscall.  Those are needed
as the stack usage on vfork(2) syscall is different.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
 arch/x86/um/signal.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/arch/x86/um/signal.c b/arch/x86/um/signal.c
index 2cc8c2309022..ae9b231dd8f8 100644
--- a/arch/x86/um/signal.c
+++ b/arch/x86/um/signal.c
@@ -537,6 +537,18 @@ int setup_signal_stack_si(unsigned long stack_top, struct ksignal *ksig,
 		/* could use a vstub here */
 		return err;
 
+#ifndef CONFIG_MMU
+	/*
+	 * we need to push handler address at top of stack, as
+	 * __kernel_vsyscall, called after this returns with ret with
+	 * stack contents, thus push the handler here.
+	 */
+	frame = (struct rt_sigframe __user *) ((unsigned long) frame -
+					       sizeof(unsigned long));
+	err |= __put_user((unsigned long)ksig->ka.sa.sa_handler,
+			  (unsigned long *)frame);
+#endif
+
 	if (err)
 		return err;
 
@@ -562,6 +574,20 @@ SYSCALL_DEFINE0(rt_sigreturn)
 	unsigned long sp = PT_REGS_SP(&current->thread.regs);
 	struct rt_sigframe __user *frame =
 		(struct rt_sigframe __user *)(sp - sizeof(long));
+#ifndef CONFIG_MMU
+	/**
+	 * we enter here with:
+	 *
+	 * __restore_rt:
+	 *     mov $15, %rax
+	 *     call *%rax (translated from syscall)
+	 *
+	 * (code is from musl libc)
+	 * so, stack needs to be popped of "call"ed address before
+	 * looking at rt_sigframe.
+	 */
+	frame = (struct rt_sigframe __user *)((unsigned long)frame + sizeof(long));
+#endif
 	struct ucontext __user *uc = &frame->uc;
 	sigset_t set;
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 10/13] x86/um: nommu: stack save/restore on vfork
  2024-10-24 12:09 [RFC PATCH 00/13] nommu UML Hajime Tazaki
                   ` (8 preceding siblings ...)
  2024-10-24 12:09 ` [RFC PATCH 09/13] x86/um: nommu: signal handling Hajime Tazaki
@ 2024-10-24 12:09 ` Hajime Tazaki
  2024-10-24 12:09 ` [RFC PATCH 11/13] um: change machine name for uname output Hajime Tazaki
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-24 12:09 UTC (permalink / raw)
  To: linux-um, jdike, richard, anton.ivanov, johannes; +Cc: thehajime, ricarkol

This fork can only come from libc's vfork, which does this:

	popq %%rdx;
	call *%rax; // zpoline => __kernel_vsyscall
	pushq %%rdx;

%rcx stores the return address which is stored at
pt_regs[HOST_IP] at the moment.  As child returns via
userspace() with a jmp instruction (while parent does via ret
instruction in __kernel_vsyscall), we need to pop (advance)
the pushed address by "call".

As a result of vfork return in child, stack contents is overwritten
by child (by pushq in vfork), which makes the parent puzzled after
child returns.  thus the contents should be restored before
vfork/parent returns.  this is done in do_syscall_64().

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/um/kernel/process.c    | 22 +++++++++++++++++-----
 arch/x86/um/do_syscall_64.c | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+), 5 deletions(-)

diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 270b5bd476be..134687530e5f 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -140,15 +140,27 @@ static void fork_handler(void)
 
 #ifndef CONFIG_MMU
 	/*
+	 * child of vfork(2) comes here.
+	 * clone(2) also enters here but doesn't need to advance the %rsp.
+	 *
 	 * This fork can only come from libc's vfork, which
 	 * does this:
 	 *	popq %%rdx;
-	 *	call *%0; // vsyscall
+	 *	call *%rax; // zpoline => __kernel_vsyscall
 	 *	pushq %%rdx;
-	 * %rdx stores the return address which is stored
-	 * at pt_regs[HOST_IP] at the moment. We still
-	 * need to pop the pushed address by "call" though,
-	 * so this is what this next line does.
+	 * %rcx stores the return address which is stored
+	 * at pt_regs[HOST_IP] at the moment.  As child returns
+	 * via userspace() with a jmp instruction (while parent
+	 * does via ret instruction in __kernel_vsyscall), we
+	 * need to pop (advance) the pushed address by "call"
+	 * though, so this is what this next line does.
+	 *
+	 * As a result of vfork return in child, stack contents
+	 * is overwritten by child (by pushq in vfork), which
+	 * makes the parent puzzled after child returns.
+	 *
+	 * thus the contents should be restored before vfork/parent
+	 * returns.  this is done in do_syscall_64().
 	 */
 	if (current->thread.regs.regs.gp[HOST_ORIG_AX] == __NR_vfork)
 		current->thread.regs.regs.gp[REGS_SP_INDEX] += 8;
diff --git a/arch/x86/um/do_syscall_64.c b/arch/x86/um/do_syscall_64.c
index 594248f7319c..4235259bbee1 100644
--- a/arch/x86/um/do_syscall_64.c
+++ b/arch/x86/um/do_syscall_64.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 
+//#define DEBUG 1
 #include <linux/kernel.h>
 #include <linux/ptrace.h>
 #include <asm/fsgsbase.h>
@@ -34,9 +35,38 @@ static int os_x86_arch_prctl(int pid, int option, unsigned long *arg2)
 	return 0;
 }
 
+/*
+ * save/restore the return address stored in the stack, as the child overwrites
+ * the contents after returning to userspace (i.e., by push %rdx).
+ *
+ * see the detail in fork_handler().
+ */
+static void *vfork_save_stack(void)
+{
+	unsigned char *stack_copy;
+
+	stack_copy = kzalloc(PAGE_SIZE << THREAD_SIZE_ORDER,
+			     GFP_KERNEL);
+	if (!stack_copy)
+		return NULL;
+
+	memcpy(stack_copy,
+	       (void *)current->thread.regs.regs.gp[HOST_SP], 8);
+
+	return stack_copy;
+}
+
+static void vfork_restore_stack(void *stack_copy)
+{
+	WARN_ON_ONCE(!stack_copy);
+	memcpy((void *)current->thread.regs.regs.gp[HOST_SP],
+	       stack_copy, 8);
+}
+
 __visible void do_syscall_64(struct pt_regs *regs)
 {
 	int syscall;
+	unsigned char *stack_copy = NULL;
 
 	syscall = PT_SYSCALL_NR(regs->regs.gp);
 	UPT_SYSCALL_NR(&regs->regs) = syscall;
@@ -45,6 +75,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
 		 syscall, (unsigned long)current,
 		 (unsigned long)sys_call_table[syscall]);
 
+	if (syscall == __NR_vfork)
+		stack_copy = vfork_save_stack();
+
 	/* set fs register to the original host one */
 	os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
 
@@ -72,6 +105,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
 		userspace(&current->thread.regs.regs,
 			current_thread_info()->aux_fp_regs);
 	}
+	/* only parents of vfork restores the contents of stack */
+	if (syscall == __NR_vfork && regs->regs.gp[HOST_AX] > 0)
+		vfork_restore_stack(stack_copy);
 }
 
 #endif
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 11/13] um: change machine name for uname output
  2024-10-24 12:09 [RFC PATCH 00/13] nommu UML Hajime Tazaki
                   ` (9 preceding siblings ...)
  2024-10-24 12:09 ` [RFC PATCH 10/13] x86/um: nommu: stack save/restore on vfork Hajime Tazaki
@ 2024-10-24 12:09 ` Hajime Tazaki
  2024-10-24 12:09 ` [RFC PATCH 12/13] um: nommu: add documentation of nommu UML Hajime Tazaki
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-24 12:09 UTC (permalink / raw)
  To: linux-um, jdike, richard, anton.ivanov, johannes; +Cc: thehajime, ricarkol

This commit tries to display MMU/!MMU mode from the output of uname(2)
so that users can distinguish which mode of UML is running right now.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
 arch/um/Makefile        | 6 ++++++
 arch/um/os-Linux/util.c | 3 ++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/um/Makefile b/arch/um/Makefile
index 00b63bac5eff..ea0c263fea14 100644
--- a/arch/um/Makefile
+++ b/arch/um/Makefile
@@ -148,6 +148,12 @@ export CFLAGS_vmlinux := $(LINK-y) $(LINK_WRAPS) $(LD_FLAGS_CMDLINE) $(CC_FLAGS_
 CLEAN_FILES += linux x.i gmon.out
 MRPROPER_FILES += $(HOST_DIR)/include/generated
 
+ifeq ($(CONFIG_MMU),y)
+UTS_MACHINE := "um"
+else
+UTS_MACHINE := "um\(nommu\)"
+endif
+
 archclean:
 	@find . \( -name '*.bb' -o -name '*.bbg' -o -name '*.da' \
 		-o -name '*.gcov' \) -type f -print | xargs rm -f
diff --git a/arch/um/os-Linux/util.c b/arch/um/os-Linux/util.c
index 1dca4ffbd572..0c7942ae8616 100644
--- a/arch/um/os-Linux/util.c
+++ b/arch/um/os-Linux/util.c
@@ -65,7 +65,8 @@ void setup_machinename(char *machine_out)
 	}
 # endif
 #endif
-	strcpy(machine_out, host.machine);
+	strcat(machine_out, "/");
+	strcat(machine_out, host.machine);
 }
 
 void setup_hostinfo(char *buf, int len)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 12/13] um: nommu: add documentation of nommu UML
  2024-10-24 12:09 [RFC PATCH 00/13] nommu UML Hajime Tazaki
                   ` (10 preceding siblings ...)
  2024-10-24 12:09 ` [RFC PATCH 11/13] um: change machine name for uname output Hajime Tazaki
@ 2024-10-24 12:09 ` Hajime Tazaki
  2024-10-24 12:09 ` [RFC PATCH 13/13] um: nommu: plug nommu code into build system Hajime Tazaki
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-24 12:09 UTC (permalink / raw)
  To: linux-um, jdike, richard, anton.ivanov, johannes; +Cc: thehajime, ricarkol

This commit adds an initial documentation for !MMU mode of UML.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
 Documentation/virt/uml/nommu-uml.rst | 219 +++++++++++++++++++++++++++
 1 file changed, 219 insertions(+)
 create mode 100644 Documentation/virt/uml/nommu-uml.rst

diff --git a/Documentation/virt/uml/nommu-uml.rst b/Documentation/virt/uml/nommu-uml.rst
new file mode 100644
index 000000000000..3171f34a50bd
--- /dev/null
+++ b/Documentation/virt/uml/nommu-uml.rst
@@ -0,0 +1,219 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+UML has been built with CONFIG_MMU since day 0.  The feature
+introduces the nommu mode in a different angle from what Linux Kernel
+Library tried.
+
+.. contents:: :local:
+
+What is it for ?
+================
+
+- Alleviate syscall hook overhead implemented with ptrace(2)
+- To exercises nommu code over UML (and over KUnit)
+- Less dependency to host facilities
+
+
+How it works ?
+==============
+
+To illustrate how this feature works, the below shows how syscalls are
+called under nommu/UML environment.
+
+- boot kernel, setup zpoline trampoline code (detailed later) at address 0x0
+- (userspace starts)
+- calls vfork/execve syscalls
+- during execve, more specifically during load_elf_fdpic_binary()
+  function, kernel translates `syscall/sysenter` instructions with `call
+  *%rax`, which usually point to address 0 to NR_syscalls (around
+  512), where trampoline code was installed during startup.
+- when syscalls are issued by userspace, it jumps to *%rax, slides
+  until `nop` instructions end, and jump to hooked function,
+  `__kernel_vsyscall`, which is an entrypoint for syscall under nommu
+  UML environment.
+- call handler function in sys_call_table[] and follow how UML syscall
+  works.
+- return to userspace
+
+
+What are the differences from MMU-full UML ?
+============================================
+
+The current nommu implementation adds 3 different functions which
+MMU-full UML doesn't have:
+
+- kernel address space can directly be accessible from userspace
+  - so, uaccess() always returns 1
+  - generic implementation of memcpy/strcpy/futex is also used
+- alternate syscall entrypoint without ptrace
+- translation of syscall/sysenter instructions to a trampoline code
+  and syscall hooks
+
+With those modifications, it allows us to use unmodified userspace
+binaries with nommu UML.
+
+
+History
+=======
+
+This feature was originally introduced by Ricardo Koller at Open
+Source Summit NA 2020, then integrated with the syscall translation
+functionality with the clean up to the original code.
+
+Building and run
+================
+
+```
+% make ARCH=um x86_64_nommu_defconfig
+% make ARCH=um
+```
+
+will build UML with CONFIG_MMU=n applied.
+
+Kunit tests can run with the following command:
+
+```
+% ./tools/testing/kunit/kunit.py run --kconfig_add CONFIG_MMU=n
+```
+
+To run a typical Linux distribution, we need nommu-aware userspace.
+We can use a stock version of Alpine Linux with nommu-built version of
+busybox and musl-libc.
+
+
+Preparing root filesystem
+=========================
+
+nommu UML requires to use a specific standard library which is aware
+of nommu kernel.  We have tested custom-build musl-libc and busybox,
+both of which have built-in support for nommu kernels.
+
+There are no available Linux distributions for nommu under x86_64
+architecture, so we need to prepare our own image for the root
+filesystem.  We use Alpine Linux as a base distribution and replace
+busybox and musl-libc on top of that.  The following are the step to
+prepare the filesystem for the quick start.
+
+```
+     container_id=$(docker create ghcr.io/thehajime/alpine:3.20.3-um-nommu)
+     docker start $container_id
+     docker wait $container_id
+     docker export $container_id > alpine.tar
+     docker rm $container_id
+
+     mnt=$(mktemp -d)
+     dd if=/dev/zero of=alpine.ext4 bs=1 count=0 seek=1G
+     sudo chmod og+wr "alpine.ext4"
+     yes 2>/dev/null | mkfs.ext4 "alpine.ext4" || true
+     sudo mount "alpine.ext4" $mnt
+     sudo tar -xf alpine.tar -C $mnt
+     sudo umount $mnt
+```
+
+This will create a file image, `alpine.ext4`, which contains busybox
+and musl with nommu build on the Alpine Linux root filesystem.  The
+file can be specified to the argument `ubd0=` to the UML command line.
+
+```
+  ./vmlinux eth0=tuntap,tap100,0e:fd:0:0:0:1,172.17.0.1 ubd0=./alpine.ext4 rw mem=1024m loglevel=8 init=/sbin/init
+```
+
+We plan to upstream apk packages for busybox and musl so that we can
+follow the proper procedure to set up the root filesystem.
+
+
+Quick start with docker
+=======================
+
+There is a docker image that you can quickly start with a simple step.
+
+```
+  docker run -it -v /dev/shm:/dev/shm --rm ghcr.io/thehajime/alpine:3.20.3-um-nommu
+```
+
+This will launch a UML instance with an pre-configured root filesystem.
+
+Benchmark
+=========
+
+The below shows an example of performance measurement conducted with
+lmbench and (self-crafted) getpid benchmark (with v6.12-rc3 linus tree).
+
+### lmbench (usec)
+
+||native|um|um-nommu|
+|--|--|--|--|
+|select-10    |0.5645|28.3738|0.2647|
+|select-100   |2.3872|28.8385|1.1021|
+|select-1000  |20.5527|37.6364|9.4264|
+|syscall      |0.1735|26.8711|0.1037|
+|read         |0.3442|28.5771|0.1370|
+|write        |0.2862|28.7340|0.1236|
+|stat         |1.9236|38.5928|0.4640|
+|open/close   |3.8308|66.8451|0.7789|
+|fork+sh      |1176.4444|8221.5000|21443.0000|
+|fork+execve  |533.1053|3034.5000|4894.3333|
+
+### do_getpid bench (nsec)
+
+||native|um|um-nommu|
+|--|--|--|--|
+|getpid | 180 | 31579 | 101|
+
+
+Limitations
+===========
+
+generic nommu limitations
+-------------------------
+Since this port is a kernel of nommu architecture so, the
+implementation inherits the characteristics of other nommu kernels
+(riscv, arm, etc), described below.
+
+- vfork(2) should be used instead of fork(2)
+- ELF loader only loads PIE (position independent executable) binaries
+- processes share the address space among others
+- mmap(2) offers a subset of functionalities (e.g., unsupported
+  MMAP_FIXED)
+
+Thus, we have limited options to userspace programs.  We have tested
+Alpine Linux with musl-libc, which has a support nommu kernel.
+
+access to mmap_min_addr
+----------------------
+As the mechanism of syscall translations relies on an ability to
+write/read memory address zero (0x0), we need to configure host kernel
+with the following command:
+
+```
+% sh -c "echo 0 > /proc/sys/vm/mmap_min_addr"
+```
+
+supported architecture
+----------------------
+The current implementation of nommu UML only works on x86_64 SUBARCH.
+We have not tested with 32-bit environment.
+
+target of syscall translation
+-----------------------------
+The syscall translation only applies to the executable and interpreter
+of ELF binary files which are processed by execve(2) syscall for the
+moment: other libraries such as linked library and dlopen-ed one
+aren't translated; we may be able to trigger the translation by
+LD_PRELOAD.
+
+Note that with musl-libc in Alpine Linux which we've been tested, most
+of syscalls are implemented in the interpreter file
+(ld-musl-x86_64.so) and calling syscall/sysenter instructions from the
+linked/loaded libraries might be rare.  But it is definitely possible
+so, a workaround with LD_PRELOAD is effective.
+
+
+Further readings about NOMMU UML
+================================
+
+- NOMMU UML (original code by Ricardo Koller)
+https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf
+
+- zpoline: syscall translation mechanism
+https://www.usenix.org/conference/atc23/presentation/yasukata
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH 13/13] um: nommu: plug nommu code into build system
  2024-10-24 12:09 [RFC PATCH 00/13] nommu UML Hajime Tazaki
                   ` (11 preceding siblings ...)
  2024-10-24 12:09 ` [RFC PATCH 12/13] um: nommu: add documentation of nommu UML Hajime Tazaki
@ 2024-10-24 12:09 ` Hajime Tazaki
  2024-10-25  9:33   ` Johannes Berg
  2024-10-26 10:19 ` [RFC PATCH 00/13] nommu UML Benjamin Berg
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
  14 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-24 12:09 UTC (permalink / raw)
  To: linux-um, jdike, richard, anton.ivanov, johannes; +Cc: thehajime, ricarkol

Add nommu kernel for um build.  defconfig is also provided.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/um/Kconfig                         | 13 ++++-
 arch/um/configs/x86_64_nommu_defconfig  | 64 +++++++++++++++++++++++++
 arch/um/include/shared/common-offsets.h |  3 ++
 arch/x86/um/Makefile                    | 18 +++++++
 4 files changed, 96 insertions(+), 2 deletions(-)
 create mode 100644 arch/um/configs/x86_64_nommu_defconfig

diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index c89575d05021..7e0d4285bd87 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -29,12 +29,15 @@ config UML
 	select ARCH_SUPPORTS_LTO_CLANG_THIN
 	select TRACE_IRQFLAGS_SUPPORT
 	select TTY # Needed for line.c
-	select HAVE_ARCH_VMAP_STACK
+	select HAVE_ARCH_VMAP_STACK if MMU
 	select HAVE_RUST
 	select ARCH_HAS_UBSAN
+	select UACCESS_MEMCPY if !MMU
+	select GENERIC_STRNLEN_USER if !MMU
+	select GENERIC_STRNCPY_FROM_USER if !MMU
 
 config MMU
-	bool
+	bool "MMU-based Paged Memory Management Support"
 	default y
 
 config UML_DMA_EMULATION
@@ -187,8 +190,14 @@ config MAGIC_SYSRQ
 	  The keys are documented in <file:Documentation/admin-guide/sysrq.rst>. Don't say Y
 	  unless you really know what this hack does.
 
+config ARCH_FORCE_MAX_ORDER
+	int "Order of maximal physically contiguous allocations"
+	default "10" if MMU
+	default "16" if !MMU
+
 config KERNEL_STACK_ORDER
 	int "Kernel stack size order"
+	default 3 if !MMU
 	default 2 if 64BIT
 	range 2 10 if 64BIT
 	default 1 if !64BIT
diff --git a/arch/um/configs/x86_64_nommu_defconfig b/arch/um/configs/x86_64_nommu_defconfig
new file mode 100644
index 000000000000..c2e0fb546987
--- /dev/null
+++ b/arch/um/configs/x86_64_nommu_defconfig
@@ -0,0 +1,64 @@
+CONFIG_SYSVIPC=y
+CONFIG_POSIX_MQUEUE=y
+CONFIG_NO_HZ=y
+CONFIG_HIGH_RES_TIMERS=y
+CONFIG_BSD_PROCESS_ACCT=y
+CONFIG_IKCONFIG=y
+CONFIG_IKCONFIG_PROC=y
+CONFIG_LOG_BUF_SHIFT=14
+CONFIG_CGROUPS=y
+CONFIG_BLK_CGROUP=y
+CONFIG_CGROUP_SCHED=y
+CONFIG_CGROUP_DEVICE=y
+CONFIG_CGROUP_CPUACCT=y
+# CONFIG_PID_NS is not set
+CONFIG_CC_OPTIMIZE_FOR_SIZE=y
+# CONFIG_MMU is not set
+CONFIG_HOSTFS=y
+CONFIG_MAGIC_SYSRQ=y
+CONFIG_SSL=y
+CONFIG_NULL_CHAN=y
+CONFIG_PORT_CHAN=y
+CONFIG_PTY_CHAN=y
+CONFIG_TTY_CHAN=y
+CONFIG_CON_CHAN="pts"
+CONFIG_SSL_CHAN="pts"
+CONFIG_UML_SOUND=m
+CONFIG_UML_NET=y
+CONFIG_UML_NET_ETHERTAP=y
+CONFIG_UML_NET_TUNTAP=y
+CONFIG_UML_NET_SLIP=y
+CONFIG_UML_NET_DAEMON=y
+CONFIG_UML_NET_MCAST=y
+CONFIG_UML_NET_SLIRP=y
+CONFIG_MODULES=y
+CONFIG_MODULE_UNLOAD=y
+CONFIG_IOSCHED_BFQ=m
+CONFIG_BINFMT_MISC=m
+CONFIG_NET=y
+CONFIG_PACKET=y
+CONFIG_UNIX=y
+CONFIG_INET=y
+CONFIG_DEVTMPFS=y
+CONFIG_DEVTMPFS_MOUNT=y
+CONFIG_BLK_DEV_UBD=y
+CONFIG_BLK_DEV_LOOP=m
+CONFIG_BLK_DEV_NBD=m
+CONFIG_DUMMY=m
+CONFIG_TUN=m
+CONFIG_PPP=m
+CONFIG_SLIP=m
+CONFIG_LEGACY_PTY_COUNT=32
+CONFIG_UML_RANDOM=y
+CONFIG_SOUND=m
+CONFIG_EXT4_FS=y
+CONFIG_REISERFS_FS=y
+CONFIG_QUOTA=y
+CONFIG_AUTOFS_FS=m
+CONFIG_ISO9660_FS=m
+CONFIG_JOLIET=y
+CONFIG_NLS=y
+CONFIG_DEBUG_KERNEL=y
+CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
+CONFIG_FRAME_WARN=1024
+CONFIG_IPV6=y
diff --git a/arch/um/include/shared/common-offsets.h b/arch/um/include/shared/common-offsets.h
index 579ed946a3a9..85a0e90a34b7 100644
--- a/arch/um/include/shared/common-offsets.h
+++ b/arch/um/include/shared/common-offsets.h
@@ -28,4 +28,7 @@ DEFINE(UML_CONFIG_64BIT, CONFIG_64BIT);
 #ifdef CONFIG_UML_TIME_TRAVEL_SUPPORT
 DEFINE(UML_CONFIG_UML_TIME_TRAVEL_SUPPORT, CONFIG_UML_TIME_TRAVEL_SUPPORT);
 #endif
+#ifdef CONFIG_MMU
+DEFINE(UML_CONFIG_MMU, CONFIG_MMU);
+#endif
 
diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile
index 36e67fc97c22..d0a37634979b 100644
--- a/arch/x86/um/Makefile
+++ b/arch/x86/um/Makefile
@@ -32,6 +32,24 @@ obj-y += syscalls_64.o vdso/
 subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o \
 	../lib/memmove_64.o ../lib/memset_64.o
 
+
+# used by zpoline.c to translate syscall/sysenter instructions
+# note: only in x86_64 w/ !CONFIG_MMU
+ifneq ($(CONFIG_MMU),y)
+inat_tables_script = $(srctree)/arch/x86/tools/gen-insn-attr-x86.awk
+inat_tables_maps = $(srctree)/arch/x86/lib/x86-opcode-map.txt
+quiet_cmd_inat_tables = GEN     $@
+      cmd_inat_tables = $(AWK) -f $(inat_tables_script) $(inat_tables_maps) > $@
+$(obj)/inat-tables.c: $(inat_tables_script) $(inat_tables_maps)
+	$(call cmd,inat_tables)
+targets += inat-tables.c
+$(obj)/../lib/inat.o: $(obj)/inat-tables.c
+subarch-y += ../lib/insn.o ../lib/inat.o
+
+
+obj-y += do_syscall_$(BITS).o entry_$(BITS).o zpoline.o
+endif
+
 endif
 
 subarch-$(CONFIG_MODULES) += ../kernel/module.o
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 02/13] x86/um: nommu: elf loader for fdpic
  2024-10-24 12:09 ` [RFC PATCH 02/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
@ 2024-10-25  8:56   ` Johannes Berg
  2024-10-25 12:54     ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-10-25  8:56 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um, jdike, richard, anton.ivanov
  Cc: ricarkol, Eric Biederman, Kees Cook, Alexander Viro,
	Christian Brauner, Jan Kara, linux-mm, linux-fsdevel

On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> 
> +#ifndef CONFIG_MMU
> +#include <asm-generic/bug.h>

Not sure that makes so much sense in the middle of the file, no harm
always having it?
> 
> +static inline const struct user_regset_view *task_user_regset_view(
> +	struct task_struct *task)

What happened to indentation here ;-)

static inline const ..... *
task_user_regset_view(....)

would be far easier to read.

> +++ b/arch/x86/um/asm/module.h
> @@ -2,23 +2,6 @@
>  #ifndef __UM_MODULE_H
>  #define __UM_MODULE_H
>  
> -/* UML is simple */
> -struct mod_arch_specific
> -{
> -};
> -
> -#ifdef CONFIG_X86_32
> -
> -#define Elf_Shdr Elf32_Shdr
> -#define Elf_Sym Elf32_Sym
> -#define Elf_Ehdr Elf32_Ehdr
> -
> -#else
> -
> -#define Elf_Shdr Elf64_Shdr
> -#define Elf_Sym Elf64_Sym
> -#define Elf_Ehdr Elf64_Ehdr
> -
> -#endif
> +#include <asm-generic/module.h>
>  
>  #endif

That seems like a worthwhile cleanup on its own, but you should be able
to just remove the file entirely?

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 03/13] um: nommu: memory handling
  2024-10-24 12:09 ` [RFC PATCH 03/13] um: nommu: memory handling Hajime Tazaki
@ 2024-10-25  9:11   ` Johannes Berg
  2024-10-25 12:55     ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-10-25  9:11 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um, jdike, richard, anton.ivanov; +Cc: ricarkol

(I should say, I'm still reading through this, and haven't formed an
overall opinion. Just nitpicking on the details as I see them for now)

> +#endif
> +
>  
>  #include <asm-generic/mmu_context.h>

extra newline

>  /* tlb.c */
> +#ifdef CONFIG_MMU
>  extern void report_enomem(void);
> +#else
> +static inline void report_enomem(void)
> +{
> +}
> +#endif

Should that really do _nothing_? Perhaps it's not called at all in no-
MMU, but then you don't need it, but otherwise it seems it should do
something even if it's just panic()?


>  	brk_end = (unsigned long) UML_ROUND_UP(sbrk(0));
> +#ifdef CONFIG_MMU
>  	map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1, 0);
> +#else
> +	map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1, 1);
> +#endif

That seems much simpler as

	map_memory(.....,
		   !IS_ENABLED(CONFIG_MMU));


> +#ifdef UML_CONFIG_MMU
>  	loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED,
>  		     fd, off);
> +#else
> +	loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED | MAP_ANONYMOUS,
> +		     fd, off);
> +#endif

Same here,

	mmap64(....
	       MAP_SHARED | MAP_FIXED |
		IS_ENABLED(CONFIG_MMU) ? MAP_ANONYMOUS : 0,
	       ...);

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 04/13] x86/um: nommu: syscall handling
  2024-10-24 12:09 ` [RFC PATCH 04/13] x86/um: nommu: syscall handling Hajime Tazaki
@ 2024-10-25  9:14   ` Johannes Berg
  2024-10-25 12:55     ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-10-25  9:14 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um, jdike, richard, anton.ivanov; +Cc: ricarkol

On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> 
> +++ b/arch/x86/um/do_syscall_64.c
> @@ -0,0 +1,42 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/kernel.h>
> +#include <linux/ptrace.h>
> +#include <kern_util.h>
> +#include <sysdep/syscalls.h>
> +#include <os.h>
> +
> +#ifndef CONFIG_MMU

This seems unnecessary, you don't build the file with CONFIG_MMU in the
first place.

> +++ b/arch/x86/um/entry_64.S
> @@ -0,0 +1,88 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <asm/errno.h>
> +
> +#include <linux/linkage.h>
> +#include <asm/percpu.h>
> +#include <asm/desc.h>
> +
> +#include "../entry/calling.h"
> +
> +#ifdef CONFIG_SMP
> +#error need to stash these variables somewhere else
> +#endif
> +
> +#ifndef CONFIG_MMU

same here.

> +++ b/arch/x86/um/shared/sysdep/syscalls_64.h
> @@ -25,4 +25,8 @@ extern syscall_handler_t *sys_call_table[];
>  extern syscall_handler_t sys_modify_ldt;
>  extern syscall_handler_t sys_arch_prctl;
>  
> +__visible void do_syscall_64(struct pt_regs *regs);
> +extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3,
> +			      int64_t a4, int64_t a5, int64_t a6);
> +
> 

but maybe that should be ifdef'ed?

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 05/13] x86/um: nommu: syscall translation by zpoline
  2024-10-24 12:09 ` [RFC PATCH 05/13] x86/um: nommu: syscall translation by zpoline Hajime Tazaki
@ 2024-10-25  9:19   ` Johannes Berg
  2024-10-25 12:58     ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-10-25  9:19 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um, jdike, richard, anton.ivanov; +Cc: ricarkol

On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> This commit adds a mechanism to hook syscalls for unmodified userspace
> programs used under UML in !MMU mode. The mechanism, called zpoline,
> translates syscall/sysenter instructions with `call *%rax`, which can be
> processed by a trampoline code also installed upon an initcall during
> boot. The translation is triggered by elf_arch_finalize_exec(), an arch
> hook introduced by another commit.
> 
> All syscalls issued by userspace thus redirected to a speicific function,

typo: "specific"

> +	if (down_write_killable(&mm->mmap_lock)) {
> +		err = -EINTR;
> +		return err;

?


What happens if the binary JITs some code and you don't find it? I don't
remember from your talk - there you seemed to say this was fine just
slow, but that was zpoline in a different context (container)?

Perhaps UML could additionally install a seccomp filter or something on
itself while running a userspace program? Hmm.


> +/**
> + * setup trampoline code for syscall hooks
> + *
> + * the trampoline code guides to call hooked function, __kernel_vsyscall
> + * in this case, via nop slides at the memory address zero (thus, zpoline).
> + *
> + * loaded binary by exec(2) is translated to call the function.
> + */
> +static int __init setup_zpoline_trampoline(void)
> +{
> +	int i, ret;
> +	int ptr;
> +
> +	/* zpoline: map area of trampoline code started from addr 0x0 */
> +	__zpoline_start = 0x0;
> +
> +	ret = os_map_memory((void *) 0, -1, 0, 0x1000, 1, 1, 1);

(UM_)PAGE_SIZE?

> +	/**
> +	 * FIXME: shit red zone area to properly handle the case

"shift"? :)

> +	 */
> +
> +	/**
> +	 * put code for jumping to __kernel_vsyscall.
> +	 *
> +	 * here we embed the following code.
> +	 *
> +	 * movabs [$addr],%r11
> +	 * jmpq   *%r11
> +	 *
> +	 */
> +	ptr = NR_syscalls;
> +	/* 49 bb [64-bit addr (8-byte)]    movabs [64-bit addr (8-byte)],%r11 */
> +	__zpoline_start[ptr++] = 0x49;
> +	__zpoline_start[ptr++] = 0xbb;
> +	__zpoline_start[ptr++] = ((uint64_t)
> +				  __kernel_vsyscall >> (8 * 0)) & 0xff;

&0xff seems pointless with a u8 array?

> +	/* permission: XOM (PROT_EXEC only) */
> +	ret = os_protect_memory(0, 0x1000, 0, 0, 1);

(UM_)PAGE_SIZE?

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 06/13] x86/um: nommu: process/thread handling
  2024-10-24 12:09 ` [RFC PATCH 06/13] x86/um: nommu: process/thread handling Hajime Tazaki
@ 2024-10-25  9:22   ` Johannes Berg
  2024-10-25 12:58     ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-10-25  9:22 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um, jdike, richard, anton.ivanov; +Cc: ricarkol

On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> Since ptrace facility isn't used under !MMU of UML, there is different
> code path to invoke proceeses/threads; on an entry to the syscall

typo: processes

>  /* Called magically, see new_thread_handler above */
>  static void fork_handler(void)
>  {
> -	schedule_tail(current->thread.prev_sched);
> +	if (current->thread.prev_sched != NULL)

nit: no need for "!= NULL"

> @@ -134,6 +138,21 @@ static void fork_handler(void)
>  
>  	current->thread.prev_sched = NULL;
>  
> +#ifndef CONFIG_MMU
> +	/*
> +	 * This fork can only come from libc's vfork, which
> +	 * does this:
> +	 *	popq %%rdx;
> +	 *	call *%0; // vsyscall
> +	 *	pushq %%rdx;
> +	 * %rdx stores the return address which is stored
> +	 * at pt_regs[HOST_IP] at the moment. We still
> +	 * need to pop the pushed address by "call" though,
> +	 * so this is what this next line does.
> +	 */
> +	if (current->thread.regs.regs.gp[HOST_ORIG_AX] == __NR_vfork)
> +		current->thread.regs.regs.gp[REGS_SP_INDEX] += 8;
> +#endif

Kind of ugly ... but I guess not much choice.

> +#ifndef CONFIG_MMU
> +	current_top_of_stack = task_top_of_stack(to);
> +	current_ptregs = (long)task_pt_regs(to);
> +
> +	if ((to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] == 0)
> +	    || (to->mm == NULL))

Put || on the previous line, "!to->mm"

> +		return;
> +
> +	// rkj: this changes the FS on every context switch

Not sure we're allowing C99 comments yet, and there shouldn't be a "rkj"
tag either :)

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 07/13] um: nommu: configure fs register on host syscall invocation
  2024-10-24 12:09 ` [RFC PATCH 07/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
@ 2024-10-25  9:28   ` Johannes Berg
  2024-10-25 13:27     ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-10-25  9:28 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um, jdike, richard, anton.ivanov; +Cc: ricarkol

On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> 
> +static void sigill(int sig, siginfo_t *si, void *ctx_void)
> +{
> +	longjmp(jmpbuf, 1);
> +}

Should this code use sigsetjmp/siglongjmp?

> +int os_has_fsgsbase(void)
> +{
> +	return has_fsgsbase;
> +}

Why should this be a function rather than just exposing the variable?

> +++ b/arch/um/os-Linux/time.c
> @@ -89,7 +89,8 @@ long long os_nsecs(void)
>  {
>  	struct timespec ts;
>  
> -	clock_gettime(CLOCK_MONOTONIC,&ts);
> +	clock_gettime(CLOCK_MONOTONIC, &ts);
> +
>  	return timespec_to_ns(&ts);

unrelated changes

>  #ifndef CONFIG_MMU
>  
> +static int os_x86_arch_prctl(int pid, int option, unsigned long *arg2)
> +{
> +	if (os_has_fsgsbase()) {
> +		switch (option) {
> +		case ARCH_SET_FS:
> +			wrfsbase(*arg2);
> +			break;
> +		case ARCH_SET_GS:
> +			wrgsbase(*arg2);
> +			break;
> +		case ARCH_GET_FS:
> +			*arg2 = rdfsbase();
> +			break;
> +		case ARCH_GET_GS:
> +			*arg2 = rdgsbase();
> +			break;
> +		}
> +		return 0;
> +	} else
> +		return os_arch_prctl(pid, option, arg2);

please use (or don't) {} on all branches


> @@ -39,4 +73,5 @@ __visible void do_syscall_64(struct pt_regs *regs)
>  			current_thread_info()->aux_fp_regs);
>  	}
>  }
> +
>  #endif

unrelated

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 08/13] x86/um/vdso: nommu: vdso memory update
  2024-10-24 12:09 ` [RFC PATCH 08/13] x86/um/vdso: nommu: vdso memory update Hajime Tazaki
@ 2024-10-25  9:29   ` Johannes Berg
  2024-10-25 13:28     ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-10-25  9:29 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um, jdike, richard, anton.ivanov; +Cc: ricarkol


>  oom:
> -	printk(KERN_ERR "Cannot allocate vdso\n");
> +	pr_err("Cannot allocate vdso");

kind of unrelated change

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 09/13] x86/um: nommu: signal handling
  2024-10-24 12:09 ` [RFC PATCH 09/13] x86/um: nommu: signal handling Hajime Tazaki
@ 2024-10-25  9:30   ` Johannes Berg
  2024-10-25 13:04     ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-10-25  9:30 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um, jdike, richard, anton.ivanov; +Cc: ricarkol

On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> This commit updates the behavior of signal handling under !MMU
> environment. 1) the stack preparation for the signal handlers and
> 2) retoration of stack after rt_sigreturn(2) syscall.  Those are 

typo: restoration

> @@ -562,6 +574,20 @@ SYSCALL_DEFINE0(rt_sigreturn)
>  	unsigned long sp = PT_REGS_SP(&current->thread.regs);
>  	struct rt_sigframe __user *frame =
>  		(struct rt_sigframe __user *)(sp - sizeof(long));
> +#ifndef CONFIG_MMU
> +	/**
> +	 * we enter here with:
> +	 *
> +	 * __restore_rt:
> +	 *     mov $15, %rax
> +	 *     call *%rax (translated from syscall)
> +	 *
> +	 * (code is from musl libc)
> +	 * so, stack needs to be popped of "call"ed address before
> +	 * looking at rt_sigframe.
> +	 */
> +	frame = (struct rt_sigframe __user *)((unsigned long)frame + sizeof(long));
> +#endif
>  	struct ucontext __user *uc = &frame->uc;

you shouldn't put code in the middle of variable declarations ...

I see why, but probably just split #if/#else/#endif?

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 13/13] um: nommu: plug nommu code into build system
  2024-10-24 12:09 ` [RFC PATCH 13/13] um: nommu: plug nommu code into build system Hajime Tazaki
@ 2024-10-25  9:33   ` Johannes Berg
  2024-10-25 13:05     ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-10-25  9:33 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um, jdike, richard, anton.ivanov; +Cc: ricarkol

On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> 
>  config MMU
> -	bool
> +	bool "MMU-based Paged Memory Management Support"
>  	default y

"if !64bit" or something
 
>  config UML_DMA_EMULATION
> @@ -187,8 +190,14 @@ config MAGIC_SYSRQ
>  	  The keys are documented in <file:Documentation/admin-guide/sysrq.rst>. Don't say Y
>  	  unless you really know what this hack does.
>  
> +config ARCH_FORCE_MAX_ORDER
> +	int "Order of maximal physically contiguous allocations"
> +	default "10" if MMU
> +	default "16" if !

Should there even a be user prompt for it?
Perhaps "if EXPERT" or something?

>  config KERNEL_STACK_ORDER
>  	int "Kernel stack size order"
> +	default 3 if !MMU
>  	default 2 if 64BIT
>  	range 2 10 if 64BIT
>  	default 1 if !64BIT

give a different range for !MMU?

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 02/13] x86/um: nommu: elf loader for fdpic
  2024-10-25  8:56   ` Johannes Berg
@ 2024-10-25 12:54     ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-25 12:54 UTC (permalink / raw)
  To: johannes
  Cc: linux-um, jdike, richard, anton.ivanov, ricarkol, ebiederm, kees,
	viro, brauner, jack, linux-mm, linux-fsdevel


Hello Johannes,

On Fri, 25 Oct 2024 17:56:51 +0900,
Johannes Berg wrote:
> 
> On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> > 
> > +#ifndef CONFIG_MMU
> > +#include <asm-generic/bug.h>
> 
> Not sure that makes so much sense in the middle of the file, no harm
> always having it?

agree.

> > +static inline const struct user_regset_view *task_user_regset_view(
> > +	struct task_struct *task)
> 
> What happened to indentation here ;-)
> 
> static inline const ..... *
> task_user_regset_view(....)
> 
> would be far easier to read.

fine, will fix it in the next revision.

> > +++ b/arch/x86/um/asm/module.h
> > @@ -2,23 +2,6 @@
> >  #ifndef __UM_MODULE_H
> >  #define __UM_MODULE_H
> >  
> > -/* UML is simple */
> > -struct mod_arch_specific
> > -{
> > -};
> > -
> > -#ifdef CONFIG_X86_32
> > -
> > -#define Elf_Shdr Elf32_Shdr
> > -#define Elf_Sym Elf32_Sym
> > -#define Elf_Ehdr Elf32_Ehdr
> > -
> > -#else
> > -
> > -#define Elf_Shdr Elf64_Shdr
> > -#define Elf_Sym Elf64_Sym
> > -#define Elf_Ehdr Elf64_Ehdr
> > -
> > -#endif
> > +#include <asm-generic/module.h>
> >  
> >  #endif
> 
> That seems like a worthwhile cleanup on its own, but you should be able
> to just remove the file entirely?

agree. will add module.h to arch/um/include/asm/Kbuild.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 03/13] um: nommu: memory handling
  2024-10-25  9:11   ` Johannes Berg
@ 2024-10-25 12:55     ` Hajime Tazaki
  2024-10-25 15:15       ` Johannes Berg
  0 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-25 12:55 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol


On Fri, 25 Oct 2024 18:11:01 +0900,
Johannes Berg wrote:
> 
> (I should say, I'm still reading through this, and haven't formed an
> overall opinion. Just nitpicking on the details as I see them for now)

thanks anyway.  looking forward to any opinions.

> > +#endif
> > +
> >  
> >  #include <asm-generic/mmu_context.h>
> 
> extra newline

will fix it.

> >  /* tlb.c */
> > +#ifdef CONFIG_MMU
> >  extern void report_enomem(void);
> > +#else
> > +static inline void report_enomem(void)
> > +{
> > +}
> > +#endif
> 
> Should that really do _nothing_? Perhaps it's not called at all in no-
> MMU, but then you don't need it, but otherwise it seems it should do
> something even if it's just panic()?

it is called also in !MMU.  I'll think to figure out how the function
is shared.
> 
> >  	brk_end = (unsigned long) UML_ROUND_UP(sbrk(0));
> > +#ifdef CONFIG_MMU
> >  	map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1, 0);
> > +#else
> > +	map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1, 1);
> > +#endif
> 
> That seems much simpler as
> 
> 	map_memory(.....,
> 		   !IS_ENABLED(CONFIG_MMU));

looks nice, will fix it.

> > +#ifdef UML_CONFIG_MMU
> >  	loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED,
> >  		     fd, off);
> > +#else
> > +	loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED | MAP_ANONYMOUS,
> > +		     fd, off);
> > +#endif
> 
> Same here,
> 
> 	mmap64(....
> 	       MAP_SHARED | MAP_FIXED |
> 		IS_ENABLED(CONFIG_MMU) ? MAP_ANONYMOUS : 0,
> 	       ...);

since this is part under os-Linux and we cannot use kconfig.h (IIUC)
feature (e.g., IS_ENABLED). but I'll reformat it to simplify instead
of duplicating same lines.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 04/13] x86/um: nommu: syscall handling
  2024-10-25  9:14   ` Johannes Berg
@ 2024-10-25 12:55     ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-25 12:55 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol


On Fri, 25 Oct 2024 18:14:19 +0900,
Johannes Berg wrote:
> 
> On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> > 
> > +++ b/arch/x86/um/do_syscall_64.c
> > @@ -0,0 +1,42 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/ptrace.h>
> > +#include <kern_util.h>
> > +#include <sysdep/syscalls.h>
> > +#include <os.h>
> > +
> > +#ifndef CONFIG_MMU
> 
> This seems unnecessary, you don't build the file with CONFIG_MMU in the
> first place.

will fix it.

> > +++ b/arch/x86/um/entry_64.S
> > @@ -0,0 +1,88 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#include <asm/errno.h>
> > +
> > +#include <linux/linkage.h>
> > +#include <asm/percpu.h>
> > +#include <asm/desc.h>
> > +
> > +#include "../entry/calling.h"
> > +
> > +#ifdef CONFIG_SMP
> > +#error need to stash these variables somewhere else
> > +#endif
> > +
> > +#ifndef CONFIG_MMU
> 
> same here.

will fix it too.

> > +++ b/arch/x86/um/shared/sysdep/syscalls_64.h
> > @@ -25,4 +25,8 @@ extern syscall_handler_t *sys_call_table[];
> >  extern syscall_handler_t sys_modify_ldt;
> >  extern syscall_handler_t sys_arch_prctl;
> >  
> > +__visible void do_syscall_64(struct pt_regs *regs);
> > +extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3,
> > +			      int64_t a4, int64_t a5, int64_t a6);
> > +
> > 
> 
> but maybe that should be ifdef'ed?

thanks, will fix it too.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 05/13] x86/um: nommu: syscall translation by zpoline
  2024-10-25  9:19   ` Johannes Berg
@ 2024-10-25 12:58     ` Hajime Tazaki
  2024-10-25 15:20       ` Johannes Berg
  0 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-25 12:58 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol



On Fri, 25 Oct 2024 18:19:25 +0900,
Johannes Berg wrote:
> 
> On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> > This commit adds a mechanism to hook syscalls for unmodified userspace
> > programs used under UML in !MMU mode. The mechanism, called zpoline,
> > translates syscall/sysenter instructions with `call *%rax`, which can be
> > processed by a trampoline code also installed upon an initcall during
> > boot. The translation is triggered by elf_arch_finalize_exec(), an arch
> > hook introduced by another commit.
> > 
> > All syscalls issued by userspace thus redirected to a speicific function,
> 
> typo: "specific"

thanks.

> > +	if (down_write_killable(&mm->mmap_lock)) {
> > +		err = -EINTR;
> > +		return err;
> 
> ?

the lock isn't needed actually so, will remove it.

> What happens if the binary JITs some code and you don't find it? I don't
> remember from your talk - there you seemed to say this was fine just
> slow, but that was zpoline in a different context (container)?

instructions loaded after execve family (like JIT generated code,
loaded with dlopen, etc) isn't going to be translated.  we can
translated it by tweaking the userspace loader (ld.so w/ LD_PRELOAD)
or hook mprotect(2) syscall before executing JIT generated code.
generic description is written in the document ([12/13]).

> Perhaps UML could additionally install a seccomp filter or something on
> itself while running a userspace program? Hmm.

I'm trying to understand the purpose of seccomp filter you suggested
here; is it for preventing executed by untranslated code ?

> > +/**
> > + * setup trampoline code for syscall hooks
> > + *
> > + * the trampoline code guides to call hooked function, __kernel_vsyscall
> > + * in this case, via nop slides at the memory address zero (thus, zpoline).
> > + *
> > + * loaded binary by exec(2) is translated to call the function.
> > + */
> > +static int __init setup_zpoline_trampoline(void)
> > +{
> > +	int i, ret;
> > +	int ptr;
> > +
> > +	/* zpoline: map area of trampoline code started from addr 0x0 */
> > +	__zpoline_start = 0x0;
> > +
> > +	ret = os_map_memory((void *) 0, -1, 0, 0x1000, 1, 1, 1);
> 
> (UM_)PAGE_SIZE?

thanks, it's much better; will fix it.

> > +	/**
> > +	 * FIXME: shit red zone area to properly handle the case
> 
> "shift"? :)

thanks (╥﹏╥)

> > +	 */
> > +
> > +	/**
> > +	 * put code for jumping to __kernel_vsyscall.
> > +	 *
> > +	 * here we embed the following code.
> > +	 *
> > +	 * movabs [$addr],%r11
> > +	 * jmpq   *%r11
> > +	 *
> > +	 */
> > +	ptr = NR_syscalls;
> > +	/* 49 bb [64-bit addr (8-byte)]    movabs [64-bit addr (8-byte)],%r11 */
> > +	__zpoline_start[ptr++] = 0x49;
> > +	__zpoline_start[ptr++] = 0xbb;
> > +	__zpoline_start[ptr++] = ((uint64_t)
> > +				  __kernel_vsyscall >> (8 * 0)) & 0xff;
> 
> &0xff seems pointless with a u8 array?

agree, will fix it.

> > +	/* permission: XOM (PROT_EXEC only) */
> > +	ret = os_protect_memory(0, 0x1000, 0, 0, 1);
> 
> (UM_)PAGE_SIZE?

will fix it too.

-- Hajime

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 06/13] x86/um: nommu: process/thread handling
  2024-10-25  9:22   ` Johannes Berg
@ 2024-10-25 12:58     ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-25 12:58 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol


On Fri, 25 Oct 2024 18:22:29 +0900,
Johannes Berg wrote:
> 
> On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> > Since ptrace facility isn't used under !MMU of UML, there is different
> > code path to invoke proceeses/threads; on an entry to the syscall
> 
> typo: processes

thanks. (I thought checkpatch.pl detects them..)

> >  /* Called magically, see new_thread_handler above */
> >  static void fork_handler(void)
> >  {
> > -	schedule_tail(current->thread.prev_sched);
> > +	if (current->thread.prev_sched != NULL)
> 
> nit: no need for "!= NULL"

will fix it.

> > @@ -134,6 +138,21 @@ static void fork_handler(void)
> >  
> >  	current->thread.prev_sched = NULL;
> >  
> > +#ifndef CONFIG_MMU
> > +	/*
> > +	 * This fork can only come from libc's vfork, which
> > +	 * does this:
> > +	 *	popq %%rdx;
> > +	 *	call *%0; // vsyscall
> > +	 *	pushq %%rdx;
> > +	 * %rdx stores the return address which is stored
> > +	 * at pt_regs[HOST_IP] at the moment. We still
> > +	 * need to pop the pushed address by "call" though,
> > +	 * so this is what this next line does.
> > +	 */
> > +	if (current->thread.regs.regs.gp[HOST_ORIG_AX] == __NR_vfork)
> > +		current->thread.regs.regs.gp[REGS_SP_INDEX] += 8;
> > +#endif
> 
> Kind of ugly ... but I guess not much choice.

(indeed)

> > +#ifndef CONFIG_MMU
> > +	current_top_of_stack = task_top_of_stack(to);
> > +	current_ptregs = (long)task_pt_regs(to);
> > +
> > +	if ((to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] == 0)
> > +	    || (to->mm == NULL))
> 
> Put || on the previous line, "!to->mm"

will fix it.

> > +		return;
> > +
> > +	// rkj: this changes the FS on every context switch
> 
> Not sure we're allowing C99 comments yet, and there shouldn't be a "rkj"
> tag either :)

this is my mistake; forgot to remove those private tags.
will fix it.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 09/13] x86/um: nommu: signal handling
  2024-10-25  9:30   ` Johannes Berg
@ 2024-10-25 13:04     ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-25 13:04 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol


On Fri, 25 Oct 2024 18:30:41 +0900,
Johannes Berg wrote:
> 
> On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> > This commit updates the behavior of signal handling under !MMU
> > environment. 1) the stack preparation for the signal handlers and
> > 2) retoration of stack after rt_sigreturn(2) syscall.  Those are 
> 
> typo: restoration

will fix it.

> > @@ -562,6 +574,20 @@ SYSCALL_DEFINE0(rt_sigreturn)
> >  	unsigned long sp = PT_REGS_SP(&current->thread.regs);
> >  	struct rt_sigframe __user *frame =
> >  		(struct rt_sigframe __user *)(sp - sizeof(long));
> > +#ifndef CONFIG_MMU
> > +	/**
> > +	 * we enter here with:
> > +	 *
> > +	 * __restore_rt:
> > +	 *     mov $15, %rax
> > +	 *     call *%rax (translated from syscall)
> > +	 *
> > +	 * (code is from musl libc)
> > +	 * so, stack needs to be popped of "call"ed address before
> > +	 * looking at rt_sigframe.
> > +	 */
> > +	frame = (struct rt_sigframe __user *)((unsigned long)frame + sizeof(long));
> > +#endif
> >  	struct ucontext __user *uc = &frame->uc;
> 
> you shouldn't put code in the middle of variable declarations ...
> 
> I see why, but probably just split #if/#else/#endif?

thanks, will reformat it to make it clear.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 13/13] um: nommu: plug nommu code into build system
  2024-10-25  9:33   ` Johannes Berg
@ 2024-10-25 13:05     ` Hajime Tazaki
  2024-10-25 15:27       ` Johannes Berg
  0 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-25 13:05 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol


On Fri, 25 Oct 2024 18:33:06 +0900,
Johannes Berg wrote:
> 
> On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> > 
> >  config MMU
> > -	bool
> > +	bool "MMU-based Paged Memory Management Support"
> >  	default y
> 
> "if !64bit" or something

not sure if I understand correctly but where do you suggest to add "if
!64bit" in this block ?

> >  config UML_DMA_EMULATION
> > @@ -187,8 +190,14 @@ config MAGIC_SYSRQ
> >  	  The keys are documented in <file:Documentation/admin-guide/sysrq.rst>. Don't say Y
> >  	  unless you really know what this hack does.
> >  
> > +config ARCH_FORCE_MAX_ORDER
> > +	int "Order of maximal physically contiguous allocations"
> > +	default "10" if MMU
> > +	default "16" if !
> 
> Should there even a be user prompt for it?
> Perhaps "if EXPERT" or something?

I think it's already user-editable but your suggestion is to limit
with 'if EXPERT' ?

> >  config KERNEL_STACK_ORDER
> >  	int "Kernel stack size order"
> > +	default 3 if !MMU
> >  	default 2 if 64BIT
> >  	range 2 10 if 64BIT
> >  	default 1 if !64BIT
> 
> give a different range for !MMU?

will fix it.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 07/13] um: nommu: configure fs register on host syscall invocation
  2024-10-25  9:28   ` Johannes Berg
@ 2024-10-25 13:27     ` Hajime Tazaki
  2024-10-25 15:22       ` Johannes Berg
  0 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-25 13:27 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol



On Fri, 25 Oct 2024 18:28:01 +0900,
Johannes Berg wrote:
> 
> On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> > 
> > +static void sigill(int sig, siginfo_t *si, void *ctx_void)
> > +{
> > +	longjmp(jmpbuf, 1);
> > +}
> 
> Should this code use sigsetjmp/siglongjmp?

the code is referred from tools/testing/selftests/x86/fsgsbase.c and
the original code uses sigsetjmp/siglongjmp indeed.

I was struggling to pull the definition of sigsetjmp & co from host
headers as it conflicts with UML definitions of jmp_buf, etc.

Will look into detail again but would be nice if you have an
experience on this.

> > +int os_has_fsgsbase(void)
> > +{
> > +	return has_fsgsbase;
> > +}
> 
> Why should this be a function rather than just exposing the variable?

as it is referred in arch/x86/um code.

> > +++ b/arch/um/os-Linux/time.c
> > @@ -89,7 +89,8 @@ long long os_nsecs(void)
> >  {
> >  	struct timespec ts;
> >  
> > -	clock_gettime(CLOCK_MONOTONIC,&ts);
> > +	clock_gettime(CLOCK_MONOTONIC, &ts);
> > +
> >  	return timespec_to_ns(&ts);
> 
> unrelated changes

will revert it.

> >  #ifndef CONFIG_MMU
> >  
> > +static int os_x86_arch_prctl(int pid, int option, unsigned long *arg2)
> > +{
> > +	if (os_has_fsgsbase()) {
> > +		switch (option) {
> > +		case ARCH_SET_FS:
> > +			wrfsbase(*arg2);
> > +			break;
> > +		case ARCH_SET_GS:
> > +			wrgsbase(*arg2);
> > +			break;
> > +		case ARCH_GET_FS:
> > +			*arg2 = rdfsbase();
> > +			break;
> > +		case ARCH_GET_GS:
> > +			*arg2 = rdgsbase();
> > +			break;
> > +		}
> > +		return 0;
> > +	} else
> > +		return os_arch_prctl(pid, option, arg2);
> 
> please use (or don't) {} on all branches

thanks, will fix it.

> > @@ -39,4 +73,5 @@ __visible void do_syscall_64(struct pt_regs *regs)
> >  			current_thread_info()->aux_fp_regs);
> >  	}
> >  }
> > +
> >  #endif
> 
> unrelated

will revert it too.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 08/13] x86/um/vdso: nommu: vdso memory update
  2024-10-25  9:29   ` Johannes Berg
@ 2024-10-25 13:28     ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-25 13:28 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol


On Fri, 25 Oct 2024 18:29:07 +0900,
Johannes Berg wrote:
> 
> 
> >  oom:
> > -	printk(KERN_ERR "Cannot allocate vdso\n");
> > +	pr_err("Cannot allocate vdso");
> 
> kind of unrelated change

thanks, will fix it.

-- Hajime



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 03/13] um: nommu: memory handling
  2024-10-25 12:55     ` Hajime Tazaki
@ 2024-10-25 15:15       ` Johannes Berg
  2024-10-26  7:24         ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-10-25 15:15 UTC (permalink / raw)
  To: Hajime Tazaki; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol

On Fri, 2024-10-25 at 21:55 +0900, Hajime Tazaki wrote:
> > 
> > Should that really do _nothing_? Perhaps it's not called at all in no-
> > MMU, but then you don't need it, but otherwise it seems it should do
> > something even if it's just panic()?
> 
> it is called also in !MMU.  I'll think to figure out how the function
> is shared.

Feels like it should do something then? Why not print like before? If it
happens in userspace we kill it, otherwise not sure what even happens...

> > 	mmap64(....
> > 	       MAP_SHARED | MAP_FIXED |
> > 		IS_ENABLED(CONFIG_MMU) ? MAP_ANONYMOUS : 0,
> > 	       ...);
> 
> since this is part under os-Linux and we cannot use kconfig.h (IIUC)
> feature (e.g., IS_ENABLED). but I'll reformat it to simplify instead
> of duplicating same lines.

Oh, missed that, sorry

still I guess putting

#ifndef CONFIG_MMU
	| MAP_ANONYMOUS
#endif

might be nicer.

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 05/13] x86/um: nommu: syscall translation by zpoline
  2024-10-25 12:58     ` Hajime Tazaki
@ 2024-10-25 15:20       ` Johannes Berg
  2024-10-26  7:36         ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-10-25 15:20 UTC (permalink / raw)
  To: Hajime Tazaki; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol

On Fri, 2024-10-25 at 21:58 +0900, Hajime Tazaki wrote:
> 
> > > +	if (down_write_killable(&mm->mmap_lock)) {
> > > +		err = -EINTR;
> > > +		return err;
> > 
> > ?
> 
> the lock isn't needed actually so, will remove it.

Oh, I was just looking at the weird handling of the err variable :)

> > What happens if the binary JITs some code and you don't find it? I don't
> > remember from your talk - there you seemed to say this was fine just
> > slow, but that was zpoline in a different context (container)?
> 
> instructions loaded after execve family (like JIT generated code,
> loaded with dlopen, etc) isn't going to be translated.  we can
> translated it by tweaking the userspace loader (ld.so w/ LD_PRELOAD)
> or hook mprotect(2) syscall before executing JIT generated code.
> generic description is written in the document ([12/13]).

Guess I should've read that, sorry.

> > Perhaps UML could additionally install a seccomp filter or something on
> > itself while running a userspace program? Hmm.
> 
> I'm trying to understand the purpose of seccomp filter you suggested
> here; is it for preventing executed by untranslated code ?

Yeah, that's what I was wondering.

Obviously you have to be able to get rid of the seccomp filter again so
it's not foolproof, but perhaps not _that_ bad?

I'm not worried about security or so, it's clear this isn't even _meant_
to have security. But I do wonder about really hard to debug issues if
userspace suddenly makes syscalls to the host, that'd be ... difficult
to understand?

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 07/13] um: nommu: configure fs register on host syscall invocation
  2024-10-25 13:27     ` Hajime Tazaki
@ 2024-10-25 15:22       ` Johannes Berg
  2024-10-26  7:34         ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-10-25 15:22 UTC (permalink / raw)
  To: Hajime Tazaki; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol

On Fri, 2024-10-25 at 22:27 +0900, Hajime Tazaki wrote:
> 
> On Fri, 25 Oct 2024 18:28:01 +0900,
> Johannes Berg wrote:
> > 
> > On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> > > 
> > > +static void sigill(int sig, siginfo_t *si, void *ctx_void)
> > > +{
> > > +	longjmp(jmpbuf, 1);
> > > +}
> > 
> > Should this code use sigsetjmp/siglongjmp?
> 
> the code is referred from tools/testing/selftests/x86/fsgsbase.c and
> the original code uses sigsetjmp/siglongjmp indeed.

:)

> I was struggling to pull the definition of sigsetjmp & co from host
> headers as it conflicts with UML definitions of jmp_buf, etc.
> 
> Will look into detail again but would be nice if you have an
> experience on this.

Hm. This is a userspace side so there shouldn't be much trouble with
that? Worst case put it into its own file in os-Linux/ and don't include
so many shared headers, I guess?

> > > +int os_has_fsgsbase(void)
> > > +{
> > > +	return has_fsgsbase;
> > > +}
> > 
> > Why should this be a function rather than just exposing the variable?
> 
> as it is referred in arch/x86/um code.

Yeah but you have to also declare the function somewhere in a header
file - could as well declare "extern int host_has_fsgsbase;" or so?

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 13/13] um: nommu: plug nommu code into build system
  2024-10-25 13:05     ` Hajime Tazaki
@ 2024-10-25 15:27       ` Johannes Berg
  2024-10-26  7:36         ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-10-25 15:27 UTC (permalink / raw)
  To: Hajime Tazaki; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol

On Fri, 2024-10-25 at 22:05 +0900, Hajime Tazaki wrote:
> On Fri, 25 Oct 2024 18:33:06 +0900,
> Johannes Berg wrote:
> > 
> > On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> > > 
> > >  config MMU
> > > -	bool
> > > +	bool "MMU-based Paged Memory Management Support"
> > >  	default y
> > 
> > "if !64bit" or something
> 
> not sure if I understand correctly but where do you suggest to add "if
> !64bit" in this block ?

config MMU
	bool "..." if 64BIT

was what I was thinking, since you cannot allow !MMU on 32-bit.

As you wrote it, the user would get prompted on both 32 and 64 bit, and
get an invalid configuration on 32-bit if they turn off MMU.

> > > +config ARCH_FORCE_MAX_ORDER
> > > +	int "Order of maximal physically contiguous allocations"
> > > +	default "10" if MMU
> > > +	default "16" if !
> > 
> > Should there even a be user prompt for it?
> > Perhaps "if EXPERT" or something?
> 
> I think it's already user-editable 

It is now, but it didn't even exist before, so nobody would've been
asked. I'm not sure it's _too_ useful to ask the users though? When
would they know how to set it? (Perhaps also add some help text? I'm
guessing it's related to the size of binaries or so?)

> but your suggestion is to limit with 'if EXPERT' ?

Or maybe it should just not be user editable at all?

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 03/13] um: nommu: memory handling
  2024-10-25 15:15       ` Johannes Berg
@ 2024-10-26  7:24         ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-26  7:24 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol


On Sat, 26 Oct 2024 00:15:06 +0900,
Johannes Berg wrote:
> 
> On Fri, 2024-10-25 at 21:55 +0900, Hajime Tazaki wrote:
> > > 
> > > Should that really do _nothing_? Perhaps it's not called at all in no-
> > > MMU, but then you don't need it, but otherwise it seems it should do
> > > something even if it's just panic()?
> > 
> > it is called also in !MMU.  I'll think to figure out how the function
> > is shared.
> 
> Feels like it should do something then? Why not print like before? If it
> happens in userspace we kill it, otherwise not sure what even happens...

the function report_enomem() is defined in tlb.c, used in several
places (trap.c, os-Linux/skas/process.c) but in !MMU the tlb.c
is filtered-out from compilation but uses trap.c so, it causes missing
symbols.

I can move the report_enomem() function to somewhere else, like mem.c,
but all the current usage of report_enomem() is MMU dependent
procedure so, I thought it is fine without doing anything.

> > > 	mmap64(....
> > > 	       MAP_SHARED | MAP_FIXED |
> > > 		IS_ENABLED(CONFIG_MMU) ? MAP_ANONYMOUS : 0,
> > > 	       ...);
> > 
> > since this is part under os-Linux and we cannot use kconfig.h (IIUC)
> > feature (e.g., IS_ENABLED). but I'll reformat it to simplify instead
> > of duplicating same lines.
> 
> Oh, missed that, sorry
> 
> still I guess putting
> 
> #ifndef CONFIG_MMU
> 	| MAP_ANONYMOUS
> #endif
> 
> might be nicer.

I thought the same thing.  Will fix it.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 07/13] um: nommu: configure fs register on host syscall invocation
  2024-10-25 15:22       ` Johannes Berg
@ 2024-10-26  7:34         ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-26  7:34 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol


On Sat, 26 Oct 2024 00:22:48 +0900,
Johannes Berg wrote:
> 
> On Fri, 2024-10-25 at 22:27 +0900, Hajime Tazaki wrote:
> > 
> > On Fri, 25 Oct 2024 18:28:01 +0900,
> > Johannes Berg wrote:
> > > 
> > > On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> > > > 
> > > > +static void sigill(int sig, siginfo_t *si, void *ctx_void)
> > > > +{
> > > > +	longjmp(jmpbuf, 1);
> > > > +}
> > > 
> > > Should this code use sigsetjmp/siglongjmp?
> > 
> > the code is referred from tools/testing/selftests/x86/fsgsbase.c and
> > the original code uses sigsetjmp/siglongjmp indeed.
> 
> :)
> 
> > I was struggling to pull the definition of sigsetjmp & co from host
> > headers as it conflicts with UML definitions of jmp_buf, etc.
> > 
> > Will look into detail again but would be nice if you have an
> > experience on this.
> 
> Hm. This is a userspace side so there shouldn't be much trouble with
> that? Worst case put it into its own file in os-Linux/ and don't include
> so many shared headers, I guess?

thanks, I'll try with this approach.

> > > > +int os_has_fsgsbase(void)
> > > > +{
> > > > +	return has_fsgsbase;
> > > > +}
> > > 
> > > Why should this be a function rather than just exposing the variable?
> > 
> > as it is referred in arch/x86/um code.
> 
> Yeah but you have to also declare the function somewhere in a header
> file - could as well declare "extern int host_has_fsgsbase;" or so?

agreed.  I'll update that part.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 05/13] x86/um: nommu: syscall translation by zpoline
  2024-10-25 15:20       ` Johannes Berg
@ 2024-10-26  7:36         ` Hajime Tazaki
  2024-10-27  9:45           ` Johannes Berg
  0 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-26  7:36 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol

On Sat, 26 Oct 2024 00:20:49 +0900,
Johannes Berg wrote:
> 
> On Fri, 2024-10-25 at 21:58 +0900, Hajime Tazaki wrote:
> > 
> > > > +	if (down_write_killable(&mm->mmap_lock)) {
> > > > +		err = -EINTR;
> > > > +		return err;
> > > 
> > > ?
> > 
> > the lock isn't needed actually so, will remove it.
> 
> Oh, I was just looking at the weird handling of the err variable :)

Ah, now I see. I'd revert the lock part with `return -EINTR` instead.

> > > What happens if the binary JITs some code and you don't find it? I don't
> > > remember from your talk - there you seemed to say this was fine just
> > > slow, but that was zpoline in a different context (container)?
> > 
> > instructions loaded after execve family (like JIT generated code,
> > loaded with dlopen, etc) isn't going to be translated.  we can
> > translated it by tweaking the userspace loader (ld.so w/ LD_PRELOAD)
> > or hook mprotect(2) syscall before executing JIT generated code.
> > generic description is written in the document ([12/13]).
> 
> Guess I should've read that, sorry.

no no, since this part is completely new feature and I'd like to
explain any unclear points to help understanding, so any inputs are
always nice.

# btw, the talk at last netdev was not container specific context, but
  more focus on the syscall hook mechanism itself so, I didn't go much
  detail at that time.

> > > Perhaps UML could additionally install a seccomp filter or something on
> > > itself while running a userspace program? Hmm.
> > 
> > I'm trying to understand the purpose of seccomp filter you suggested
> > here; is it for preventing executed by untranslated code ?
> 
> Yeah, that's what I was wondering.
> 
> Obviously you have to be able to get rid of the seccomp filter again so
> it's not foolproof, but perhaps not _that_ bad?
> 
> I'm not worried about security or so, it's clear this isn't even _meant_
> to have security. But I do wonder about really hard to debug issues if
> userspace suddenly makes syscalls to the host, that'd be ... difficult
> to understand?

I totally understand; I faced similar situations during the developing
this patchset.

Originally our patchset had a whitelist-based seccomp filter (w/
SCMP_ACT_ALLOW), but dropped from this RFC as I found that 1) this is
not the !MMU specific feature (it can be generally applied to all UML
use cases), and 2) we cannot prevent a syscall (e.g., ioctl(2)) from
userspace which is white-listed in our seccomp filter, thus the newly
introduced filter may not be perfect.

the maintenance of the whitelist is also not easy; the syscall used in
one version is renamed at some point in future (what I faced is
SCMP_SYS(open) should be renamed with SCMP_SYS(openat)).

-- Hajime

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 13/13] um: nommu: plug nommu code into build system
  2024-10-25 15:27       ` Johannes Berg
@ 2024-10-26  7:36         ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-26  7:36 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol


On Sat, 26 Oct 2024 00:27:08 +0900,
Johannes Berg wrote:
> 
> On Fri, 2024-10-25 at 22:05 +0900, Hajime Tazaki wrote:
> > On Fri, 25 Oct 2024 18:33:06 +0900,
> > Johannes Berg wrote:
> > > 
> > > On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> > > > 
> > > >  config MMU
> > > > -	bool
> > > > +	bool "MMU-based Paged Memory Management Support"
> > > >  	default y
> > > 
> > > "if !64bit" or something
> > 
> > not sure if I understand correctly but where do you suggest to add "if
> > !64bit" in this block ?
> 
> config MMU
> 	bool "..." if 64BIT
> 
> was what I was thinking, since you cannot allow !MMU on 32-bit.
> 
> As you wrote it, the user would get prompted on both 32 and 64 bit, and
> get an invalid configuration on 32-bit if they turn off MMU.

Ah, thanks.  Yes, the !MMU isn't available for now on 32-bit subarch
so, will fix it.

> > > > +config ARCH_FORCE_MAX_ORDER
> > > > +	int "Order of maximal physically contiguous allocations"
> > > > +	default "10" if MMU
> > > > +	default "16" if !
> > > 
> > > Should there even a be user prompt for it?
> > > Perhaps "if EXPERT" or something?
> > 
> > I think it's already user-editable 
> 
> It is now, but it didn't even exist before, so nobody would've been
> asked. I'm not sure it's _too_ useful to ask the users though? When
> would they know how to set it? (Perhaps also add some help text? I'm
> guessing it's related to the size of binaries or so?)

Ah, you're right.
I thought I only added !MMU part of this block (but it's not
correct)...

> > but your suggestion is to limit with 'if EXPERT' ?
> 
> Or maybe it should just not be user editable at all?

agreed to add if EXPERT.
I'll also add help text to that.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 00/13] nommu UML
  2024-10-24 12:09 [RFC PATCH 00/13] nommu UML Hajime Tazaki
                   ` (12 preceding siblings ...)
  2024-10-24 12:09 ` [RFC PATCH 13/13] um: nommu: plug nommu code into build system Hajime Tazaki
@ 2024-10-26 10:19 ` Benjamin Berg
  2024-10-27  9:10   ` Hajime Tazaki
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
  14 siblings, 1 reply; 128+ messages in thread
From: Benjamin Berg @ 2024-10-26 10:19 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um, jdike, richard, anton.ivanov, johannes; +Cc: ricarkol

Hi,

On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> This is a series of patches of nommu arch addition to UML.  It would
> be nice to ask comments/opinions on this.
> 
> There are several limitations/issues which we already found; here is
> the list of those issues.
> 
> - prompt configured with /etc/profile is broken (variables are not
>   expanded, ${HOSTNAME%%.*}:$PWD#)
> - there are no mechanism implemented to cache for mapped memory of
>   exec(2) thus, always read files from filesystem upon every exec,
>   which makes slow on some benchmark (lmbench).
> - a crash on userspace programs crashes a UML kernel, not signaling
>   with SIGSEGV to the program.
> - commit c27e618 (during v6.12-rc1 merge) introduces invalid access to
>   a vma structure for our case, which updates the internal procedure
>   of maple_tree subsystem.  We're trying to fix issue but still a
>   random process on exit(2) crashes.

Btw. are you handling FP register save/restore? If it is not there, it
probably would not be too hard to add (XSAVE, etc.), though it might
add a bit of additional overhead. Especially as UML always saves the FP
state rather than optimizing it like the x86 architectures.


I am a bit confused overall. I mean, zpoline seems kind of neat, but a
requirement on patching userspace code also seems like a lot.

To me, it seems much more natural to catch the userspace syscalls using
a SECCOMP filter[1]. While quite a lot slower, that should be much more
portable across architectures. For improved speed one could still do
architecture specific things inside the vDSO or by using zpoline. But
those would then "just" be optimizations and unpatched code would still
work correctly (e.g. JIT).

For me, a big argument in favour of such an approach is its simplicity.
I am mostly basing that on the fact that this patchset should properly
handle other signals like SIGFPE and SIGSEGV. And, once it does that,
you will already have all the infrastructure to do the correct register
save/restore using the host mcontex, which is what is needed in the
SIGSYS handler when using SECCOMP. The filter itself should be simple
as it just needs to catch all syscalls within valid userspace
executable memory[2] ranges.

Benjamin

[1] Maybe not surprising, as I have been working on a SECCOMP based UML
that does not require ptrace.
[2] I am assuming that userspace executable code is already confined to
a certain address space within the UML process. Obviously, the kernel
itself and loaded modules need to be free to do host syscalls and
should not be affected by the SECCOMP filter.



> 
> UML has been built with CONFIG_MMU since day 0.  The feature
> introduces the nommu mode in a different angle from what Linux Kernel
> Library tried.
> 
> 
> What is it for ?
> ================
> 
> - Alleviate syscall hook overhead implemented with ptrace(2)
> - To exercises nommu code over UML (and over KUnit)
> - Less dependency to host facilities
> 
> 
> How it works ?
> ==============
> 
> To illustrate how this feature works, the below shows how syscalls are
> called under nommu/UML environment.
> 
> - boot kernel, setup zpoline trampoline code (detailed later) at address 0x0
> - (userspace starts)
> - calls vfork/execve syscalls
> - during execve, more specifically during load_elf_fdpic_binary()
>   function, kernel translates `syscall/sysenter` instructions with `call
>   *%rax`, which usually point to address 0 to NR_syscalls (around
>   512), where trampoline code was installed during startup.
> - when syscalls are issued by userspace, it jumps to *%rax, slides
>   until `nop` instructions end, and jump to hooked function,
>   `__kernel_vsyscall`, which is an entrypoint for syscall under nommu
>   UML environment.
> - call handler function in sys_call_table[] and follow how UML syscall
>   works.
> - return to userspace
> 
> 
> What are the differences from MMU-full UML ?
> ============================================
> 
> The current nommu implementation adds 3 different functions which
> MMU-full UML doesn't have:
> 
> - kernel address space can directly be accessible from userspace
>   - so, uaccess() always returns 1
>   - generic implementation of memcpy/strcpy/futex is also used
> - alternate syscall entrypoint without ptrace
> - translation of syscall/sysenter instructions to a trampoline code
>   and syscall hooks
> 
> With those modifications, it allows us to use unmodified userspace
> binaries with nommu UML.
> 
> 
> History
> =======
> 
> This feature was originally introduced by Ricardo Koller at Open
> Source Summit NA 2020, then integrated with the syscall translation
> functionality with the clean up to the original code.
> 
> Building and run
> ================
> 
> ```
> % make ARCH=um x86_64_nommu_defconfig
> % make ARCH=um
> ```
> 
> will build UML with CONFIG_MMU=n applied.
> 
> Kunit tests can run with the following command:
> 
> ```
> % ./tools/testing/kunit/kunit.py run --kconfig_add CONFIG_MMU=n
> ```
> 
> To run a typical Linux distribution, we need nommu-aware userspace.
> We can use a stock version of Alpine Linux with nommu-built version of
> busybox and musl-libc.
> 
> 
> Preparing root filesystem
> =========================
> 
> nommu UML requires to use a specific standard library which is aware
> of nommu kernel.  We have tested custom-build musl-libc and busybox,
> both of which have built-in support for nommu kernels.
> 
> There are no available Linux distributions for nommu under x86_64
> architecture, so we need to prepare our own image for the root
> filesystem.  We use Alpine Linux as a base distribution and replace
> busybox and musl-libc on top of that.  The following are the step to
> prepare the filesystem for the quick start.
> 
> ```
>      container_id=$(docker create ghcr.io/thehajime/alpine:3.20.3-um-nommu)
>      docker start $container_id
>      docker wait $container_id
>      docker export $container_id > alpine.tar
>      docker rm $container_id
> 
>      mnt=$(mktemp -d)
>      dd if=/dev/zero of=alpine.ext4 bs=1 count=0 seek=1G
>      sudo chmod og+wr "alpine.ext4"
>      yes 2>/dev/null | mkfs.ext4 "alpine.ext4" || true
>      sudo mount "alpine.ext4" $mnt
>      sudo tar -xf alpine.tar -C $mnt
>      sudo umount $mnt
> ```
> 
> This will create a file image, `alpine.ext4`, which contains busybox
> and musl with nommu build on the Alpine Linux root filesystem.  The
> file can be specified to the argument `ubd0=` to the UML command line.
> 
> ```
>   ./vmlinux eth0=tuntap,tap100,0e:fd:0:0:0:1,172.17.0.1 ubd0=./alpine.ext4 rw mem=1024m loglevel=8 init=/sbin/init
> ```
> 
> We plan to upstream apk packages for busybox and musl so that we can
> follow the proper procedure to set up the root filesystem.
> 
> 
> Quick start with docker
> =======================
> 
> There is a docker image that you can quickly start with a simple step.
> 
> ```
>   docker run -it -v /dev/shm:/dev/shm --rm ghcr.io/thehajime/alpine:3.20.3-um-nommu
> ```
> 
> This will launch a UML instance with an pre-configured root filesystem.
> 
> Benchmark
> =========
> 
> The below shows an example of performance measurement conducted with
> lmbench and (self-crafted) getpid benchmark (with v6.12-rc3 linus tree).
> 
> ### lmbench (usec)
> 
> > > native|um|um-nommu|
> > --|--|--|--|
> > select-10    |0.5645|28.3738|0.2647|
> > select-100   |2.3872|28.8385|1.1021|
> > select-1000  |20.5527|37.6364|9.4264|
> > syscall      |0.1735|26.8711|0.1037|
> > read         |0.3442|28.5771|0.1370|
> > write        |0.2862|28.7340|0.1236|
> > stat         |1.9236|38.5928|0.4640|
> > open/close   |3.8308|66.8451|0.7789|
> > fork+sh      |1176.4444|8221.5000|21443.0000|
> > fork+execve  |533.1053|3034.5000|4894.3333|
> 
> ### do_getpid bench (nsec)
> 
> > > native|um|um-nommu|
> > --|--|--|--|
> > getpid | 180 | 31579 | 101|
> 
> 
> Limitations
> ===========
> 
> generic nommu limitations
> -------------------------
> Since this port is a kernel of nommu architecture so, the
> implementation inherits the characteristics of other nommu kernels
> (riscv, arm, etc), described below.
> 
> - vfork(2) should be used instead of fork(2)
> - ELF loader only loads PIE (position independent executable) binaries
> - processes share the address space among others
> - mmap(2) offers a subset of functionalities (e.g., unsupported
>   MMAP_FIXED)
> 
> Thus, we have limited options to userspace programs.  We have tested
> Alpine Linux with musl-libc, which has a support nommu kernel.
> 
> access to mmap_min_addr
> ----------------------
> As the mechanism of syscall translations relies on an ability to
> write/read memory address zero (0x0), we need to configure host kernel
> with the following command:
> 
> ```
> % sh -c "echo 0 > /proc/sys/vm/mmap_min_addr"
> ```
> 
> supported architecture
> ----------------------
> The current implementation of nommu UML only works on x86_64 SUBARCH.
> We have not tested with 32-bit environment.
> 
> target of syscall translation
> -----------------------------
> The syscall translation only applies to the executable and interpreter
> of ELF binary files which are processed by execve(2) syscall for the
> moment: other libraries such as linked library and dlopen-ed one
> aren't translated; we may be able to trigger the translation by
> LD_PRELOAD.
> 
> Note that with musl-libc in Alpine Linux which we've been tested, most
> of syscalls are implemented in the interpreter file
> (ld-musl-x86_64.so) and calling syscall/sysenter instructions from the
> linked/loaded libraries might be rare.  But it is definitely possible
> so, a workaround with LD_PRELOAD is effective.
> 
> 
> Further readings about NOMMU UML
> ================================
> 
> - NOMMU UML (original code by Ricardo Koller)
> https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf
> 
> - zpoline: syscall translation mechanism
> https://www.usenix.org/conference/atc23/presentation/yasukata
> Please review the following changes for suitability for inclusion. If you have
> any objections or suggestions for improvement, please respond to the patches. If
> you agree with the changes, please provide your Acked-by.
> 
> The following changes since commit c2ee9f594da826bea183ed14f2cc029c719bf4da:
> 
>   KVM: selftests: Fix build on on non-x86 architectures (2024-10-21 15:49:33 -0700)
> 
> are available in the Git repository at:
> 
>   https://github.com/thehajime/linux 82a7ee8b31c51edb47e144922581824a3b5e371d
>   https://github.com/thehajime/linux/tree/um-nommu-v6.12-rc4-rfc
> 
> Hajime Tazaki (13):
>   fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec
>   x86/um: nommu: elf loader for fdpic
>   um: nommu: memory handling
>   x86/um: nommu: syscall handling
>   x86/um: nommu: syscall translation by zpoline
>   x86/um: nommu: process/thread handling
>   um: nommu: configure fs register on host syscall invocation
>   x86/um/vdso: nommu: vdso memory update
>   x86/um: nommu: signal handling
>   x86/um: nommu: stack save/restore on vfork
>   um: change machine name for uname output
>   um: nommu: add documentation of nommu UML
>   um: nommu: plug nommu code into build system
> 
>  Documentation/virt/uml/nommu-uml.rst    | 219 +++++++++++++++++++++++
>  arch/um/Kconfig                         |  13 +-
>  arch/um/Makefile                        |   6 +
>  arch/um/configs/x86_64_nommu_defconfig  |  64 +++++++
>  arch/um/include/asm/futex.h             |   4 +
>  arch/um/include/asm/mmu.h               |   8 +
>  arch/um/include/asm/mmu_context.h       |  14 +-
>  arch/um/include/asm/ptrace-generic.h    |  17 ++
>  arch/um/include/asm/tlbflush.h          |  23 ++-
>  arch/um/include/asm/uaccess.h           |   7 +-
>  arch/um/include/shared/common-offsets.h |   3 +
>  arch/um/include/shared/os.h             |   9 +
>  arch/um/kernel/Makefile                 |   3 +-
>  arch/um/kernel/exec.c                   |   8 +
>  arch/um/kernel/mem.c                    |  13 ++
>  arch/um/kernel/physmem.c                |   6 +
>  arch/um/kernel/process.c                |  34 +++-
>  arch/um/kernel/skas/Makefile            |   3 +-
>  arch/um/kernel/trap.c                   |   4 +
>  arch/um/os-Linux/main.c                 |   5 +
>  arch/um/os-Linux/process.c              |  22 +++
>  arch/um/os-Linux/skas/process.c         |   4 +
>  arch/um/os-Linux/start_up.c             |  47 +++++
>  arch/um/os-Linux/time.c                 |   3 +-
>  arch/um/os-Linux/util.c                 |   3 +-
>  arch/x86/um/Makefile                    |  18 ++
>  arch/x86/um/asm/elf.h                   |  12 +-
>  arch/x86/um/asm/module.h                |  19 +-
>  arch/x86/um/asm/processor.h             |  12 ++
>  arch/x86/um/do_syscall_64.c             | 113 ++++++++++++
>  arch/x86/um/entry_64.S                  | 110 ++++++++++++
>  arch/x86/um/shared/sysdep/syscalls_64.h |   4 +
>  arch/x86/um/signal.c                    |  26 +++
>  arch/x86/um/syscalls_64.c               |  67 +++++++
>  arch/x86/um/vdso/um_vdso.c              |  20 +++
>  arch/x86/um/vdso/vma.c                  |  16 +-
>  arch/x86/um/zpoline.c                   | 228 ++++++++++++++++++++++++
>  fs/Kconfig.binfmt                       |   2 +-
>  fs/binfmt_elf_fdpic.c                   |  10 ++
>  39 files changed, 1164 insertions(+), 35 deletions(-)
>  create mode 100644 Documentation/virt/uml/nommu-uml.rst
>  create mode 100644 arch/um/configs/x86_64_nommu_defconfig
>  create mode 100644 arch/x86/um/do_syscall_64.c
>  create mode 100644 arch/x86/um/entry_64.S
>  create mode 100644 arch/x86/um/zpoline.c
> 



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 00/13] nommu UML
  2024-10-26 10:19 ` [RFC PATCH 00/13] nommu UML Benjamin Berg
@ 2024-10-27  9:10   ` Hajime Tazaki
  2024-10-28 13:32     ` Benjamin Berg
  0 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-27  9:10 UTC (permalink / raw)
  To: benjamin; +Cc: linux-um, jdike, richard, anton.ivanov, johannes, ricarkol

Hello Benjamin,

thank you for your time looking at this.

On Sat, 26 Oct 2024 19:19:08 +0900,
Benjamin Berg wrote:

> > - a crash on userspace programs crashes a UML kernel, not signaling
> >   with SIGSEGV to the program.
> > - commit c27e618 (during v6.12-rc1 merge) introduces invalid access to
> >   a vma structure for our case, which updates the internal procedure
> >   of maple_tree subsystem.  We're trying to fix issue but still a
> >   random process on exit(2) crashes.
> 
> Btw. are you handling FP register save/restore? If it is not there, it
> probably would not be too hard to add (XSAVE, etc.), though it might
> add a bit of additional overhead. Especially as UML always saves the FP
> state rather than optimizing it like the x86 architectures.

The patch handles fp register on entry/leave at syscall; [07/13] patch
contains this part.

I'm not familiar with that but what kind of optimizations does x86
architecture do for fp register handling ?

> I am a bit confused overall. I mean, zpoline seems kind of neat, but a
> requirement on patching userspace code also seems like a lot.
> 
> To me, it seems much more natural to catch the userspace syscalls using
> a SECCOMP filter[1]. While quite a lot slower, that should be much more
> portable across architectures. For improved speed one could still do
> architecture specific things inside the vDSO or by using zpoline. But
> those would then "just" be optimizations and unpatched code would still
> work correctly (e.g. JIT).

I'm not proposing this patch to replace existing UML implementations;
for instance, the patchset cannot run CONFIG_MMU code in the whole
kernel tree so, existing ptrace-based implementation still has real
usecase.  and ptrace based syscall hook is not indeed fast and the
improvements with seccomp filter instead clearly has benefits.  I
think it's independent to this patchset.

So I think while your seccomp patches are also in review, this
patchset can exist in parallel.

btw, though I mentioned that JIT generated code is not currently
handled in a different reply, it can be implemented as an extension to
this patchset; the original implementation of zpoline now is able to
patch JIT generated code as well.

https://github.com/yasukata/zpoline/pull/20/commits/c42af16757ad3fcdf7084c9f2139bb9105796873

it is not implemented for the moment.

in terms of the portability, the basic idea of syscall hook with
zpoline is applicable to other platform, like aarch64
(https://github.com/retrage/svc-hook).  so I believe it has a chance
to expand this idea to other architectures than x86_64.

> For me, a big argument in favour of such an approach is its simplicity.
> I am mostly basing that on the fact that this patchset should properly
> handle other signals like SIGFPE and SIGSEGV. And, once it does that,
> you will already have all the infrastructure to do the correct register
> save/restore using the host mcontex, which is what is needed in the
> SIGSYS handler when using SECCOMP. The filter itself should be simple
> as it just needs to catch all syscalls within valid userspace
> executable memory[2] ranges.

I agree with your observation that the approach is simple.
I don't have a good idea on how to handle SIGSEGV, but will try to see
with your inputs.

> Benjamin
> 
> [1] Maybe not surprising, as I have been working on a SECCOMP based UML
> that does not require ptrace.

yes, I'm aware of it since before.  I have also conducted a benchmark
with several hook mechanisms, including seccomp with simple getpid
measurement.

https://speakerdeck.com/thehajime/netdev0x18-zpoline?slide=16

> [2] I am assuming that userspace executable code is already confined to
> a certain address space within the UML process. Obviously, the kernel
> itself and loaded modules need to be free to do host syscalls and
> should not be affected by the SECCOMP filter.

I think our !MMU UML doesn't break this assumption.  But did you see
something to our patchset ?

Thanks again,
-- Hajime

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 05/13] x86/um: nommu: syscall translation by zpoline
  2024-10-26  7:36         ` Hajime Tazaki
@ 2024-10-27  9:45           ` Johannes Berg
  2024-10-28  7:47             ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-10-27  9:45 UTC (permalink / raw)
  To: Hajime Tazaki; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol

On Sat, 2024-10-26 at 16:36 +0900, Hajime Tazaki wrote:
> 
> Originally our patchset had a whitelist-based seccomp filter (w/
> SCMP_ACT_ALLOW), but dropped from this RFC as I found that 1) this is
> not the !MMU specific feature (it can be generally applied to all UML
> use cases), and 2) we cannot prevent a syscall (e.g., ioctl(2)) from
> userspace which is white-listed in our seccomp filter, thus the newly
> introduced filter may not be perfect.
> 
> the maintenance of the whitelist is also not easy; the syscall used in
> one version is renamed at some point in future (what I faced is
> SCMP_SYS(open) should be renamed with SCMP_SYS(openat)).

Sure, agree that would be awful. However, only kernel code should be
making real host syscalls, never userspace code, so you should be able
to filter simply based on address? Since it's NOMMU there's a single
process and a single address space, and userspace binaries always have
to be in certain places, I'd think?

This should be cheap since
 (a) it's not doing anything with (guest) syscalls that were already
     rewritten by zpoline (they don't exist as host syscalls)
 (b) while the real host syscalls made by the kernel would still be
     checked by the filter program, it'd just return "sure that's OK"
     and not redirect anything

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 05/13] x86/um: nommu: syscall translation by zpoline
  2024-10-27  9:45           ` Johannes Berg
@ 2024-10-28  7:47             ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-28  7:47 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, jdike, richard, anton.ivanov, ricarkol


Hello,

On Sun, 27 Oct 2024 18:45:39 +0900,
Johannes Berg wrote:
> 
> On Sat, 2024-10-26 at 16:36 +0900, Hajime Tazaki wrote:
> > 
> > Originally our patchset had a whitelist-based seccomp filter (w/
> > SCMP_ACT_ALLOW), but dropped from this RFC as I found that 1) this is
> > not the !MMU specific feature (it can be generally applied to all UML
> > use cases), and 2) we cannot prevent a syscall (e.g., ioctl(2)) from
> > userspace which is white-listed in our seccomp filter, thus the newly
> > introduced filter may not be perfect.
> > 
> > the maintenance of the whitelist is also not easy; the syscall used in
> > one version is renamed at some point in future (what I faced is
> > SCMP_SYS(open) should be renamed with SCMP_SYS(openat)).
> 
> Sure, agree that would be awful. However, only kernel code should be
> making real host syscalls, never userspace code, so you should be able
> to filter simply based on address? Since it's NOMMU there's a single
> process and a single address space, and userspace binaries always have
> to be in certain places, I'd think?

Yes, the address which issued syscall instruction should be able to
identify.

> This should be cheap since
>  (a) it's not doing anything with (guest) syscalls that were already
>      rewritten by zpoline (they don't exist as host syscalls)
>  (b) while the real host syscalls made by the kernel would still be
>      checked by the filter program, it'd just return "sure that's OK"
>      and not redirect anything

totally makes sense to me and the filter is nice to have.
I'm going to investigate to implement it as a seccomp filter.

thanks for the idea,
-- Hajime



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 00/13] nommu UML
  2024-10-27  9:10   ` Hajime Tazaki
@ 2024-10-28 13:32     ` Benjamin Berg
  2024-10-30  9:25       ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Benjamin Berg @ 2024-10-28 13:32 UTC (permalink / raw)
  To: Hajime Tazaki; +Cc: linux-um, jdike, richard, anton.ivanov, johannes, ricarkol

Hello Hajime,

On Sun, 2024-10-27 at 18:10 +0900, Hajime Tazaki wrote:
> thank you for your time looking at this.
> 
> On Sat, 26 Oct 2024 19:19:08 +0900,
> Benjamin Berg wrote:
> 
> > > - a crash on userspace programs crashes a UML kernel, not signaling
> > >   with SIGSEGV to the program.
> > > - commit c27e618 (during v6.12-rc1 merge) introduces invalid access to
> > >   a vma structure for our case, which updates the internal procedure
> > >   of maple_tree subsystem.  We're trying to fix issue but still a
> > >   random process on exit(2) crashes.
> > 
> > Btw. are you handling FP register save/restore? If it is not there, it
> > probably would not be too hard to add (XSAVE, etc.), though it might
> > add a bit of additional overhead. Especially as UML always saves the FP
> > state rather than optimizing it like the x86 architectures.
> 
> The patch handles fp register on entry/leave at syscall; [07/13] patch
> contains this part.

That looks like FS/GS registers which are for thread-local storage. I
was talking about floating point registers. Maybe you meant another
patch?

> I'm not familiar with that but what kind of optimizations does x86
> architecture do for fp register handling ?

The kernel does not usually need the FP registers. So it optimizes the
pretty common case of a userspace -> kernel -> userspace switch that
happens for a syscall by simply not saving/restoring these registers at
all.

Obviously, it then still needs to do the work when the task is switched
or in the rare case that the kernel wants to use floating point itself.

> > I am a bit confused overall. I mean, zpoline seems kind of neat, but a
> > requirement on patching userspace code also seems like a lot.
> > 
> > To me, it seems much more natural to catch the userspace syscalls using
> > a SECCOMP filter[1]. While quite a lot slower, that should be much more
> > portable across architectures. For improved speed one could still do
> > architecture specific things inside the vDSO or by using zpoline. But
> > those would then "just" be optimizations and unpatched code would still
> > work correctly (e.g. JIT).
> 
> I'm not proposing this patch to replace existing UML implementations;
> for instance, the patchset cannot run CONFIG_MMU code in the whole
> kernel tree so, existing ptrace-based implementation still has real
> usecase.  and ptrace based syscall hook is not indeed fast and the
> improvements with seccomp filter instead clearly has benefits.  I
> think it's independent to this patchset.

Of course. nommu mode is a completely independent feature.

I am still wondering a bit about the users for such a mode. It is not
interesting for us as we use it for testing. Of course, speed is nice
but it is not the primary objective.

I understand that it can be an approach for a small "container", but
then you would need a very strict SECCOMP filter for the kernel itself.

> So I think while your seccomp patches are also in review, this
> patchset can exist in parallel.
> 
> btw, though I mentioned that JIT generated code is not currently
> handled in a different reply, it can be implemented as an extension to
> this patchset; the original implementation of zpoline now is able to
> patch JIT generated code as well.
> 
> https://github.com/yasukata/zpoline/pull/20/commits/c42af16757ad3fcdf7084c9f2139bb9105796873
> 
> it is not implemented for the moment.
> 
> in terms of the portability, the basic idea of syscall hook with
> zpoline is applicable to other platform, like aarch64
> (https://github.com/retrage/svc-hook).  so I believe it has a chance
> to expand this idea to other architectures than x86_64.

Right, aarch64 is probably the most interesting one in general. At
least there was some interest in a UML port.

> > For me, a big argument in favour of such an approach is its simplicity.
> > I am mostly basing that on the fact that this patchset should properly
> > handle other signals like SIGFPE and SIGSEGV. And, once it does that,
> > you will already have all the infrastructure to do the correct register
> > save/restore using the host mcontex, which is what is needed in the
> > SIGSYS handler when using SECCOMP. The filter itself should be simple
> > as it just needs to catch all syscalls within valid userspace
> > executable memory[2] ranges.
> 
> I agree with your observation that the approach is simple.
> I don't have a good idea on how to handle SIGSEGV, but will try to see
> with your inputs.

You can probably use "[RFC PATCH v2 5/9] um: Add helper functions to
get/set state for SECCOMP" for getting the registers and also writing
them back if you want to restore using rt_sigreturn.

> > [1] Maybe not surprising, as I have been working on a SECCOMP based UML
> > that does not require ptrace.
> 
> yes, I'm aware of it since before.  I have also conducted a benchmark
> with several hook mechanisms, including seccomp with simple getpid
> measurement.
> 
> https://speakerdeck.com/thehajime/netdev0x18-zpoline?slide=16

Sure! I saw that :-)

> > [2] I am assuming that userspace executable code is already confined to
> > a certain address space within the UML process. Obviously, the kernel
> > itself and loaded modules need to be free to do host syscalls and
> > should not be affected by the SECCOMP filter.
> 
> I think our !MMU UML doesn't break this assumption.  But did you see
> something to our patchset ?

I also assume that is fine. One just needs to understand this when
writing a SECCOMP filter for syscall emulation in nommu mode.

Benjamin


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 00/13] nommu UML
  2024-10-28 13:32     ` Benjamin Berg
@ 2024-10-30  9:25       ` Hajime Tazaki
  2024-11-09  0:52         ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-30  9:25 UTC (permalink / raw)
  To: benjamin; +Cc: linux-um, jdike, richard, anton.ivanov, johannes, ricarkol


Hello,

On Mon, 28 Oct 2024 22:32:43 +0900,
Benjamin Berg wrote:

> > > > - a crash on userspace programs crashes a UML kernel, not signaling
> > > >   with SIGSEGV to the program.
> > > > - commit c27e618 (during v6.12-rc1 merge) introduces invalid access to
> > > >   a vma structure for our case, which updates the internal procedure
> > > >   of maple_tree subsystem.  We're trying to fix issue but still a
> > > >   random process on exit(2) crashes.
> > > 
> > > Btw. are you handling FP register save/restore? If it is not there, it
> > > probably would not be too hard to add (XSAVE, etc.), though it might
> > > add a bit of additional overhead. Especially as UML always saves the FP
> > > state rather than optimizing it like the x86 architectures.
> > 
> > The patch handles fp register on entry/leave at syscall; [07/13] patch
> > contains this part.
> 
> That looks like FS/GS registers which are for thread-local storage. I
> was talking about floating point registers. Maybe you meant another
> patch?

oh, this is my terrible mistake...
no, the patch doesn't handle fp resister at all.

> > I'm not familiar with that but what kind of optimizations does x86
> > architecture do for fp register handling ?
> 
> The kernel does not usually need the FP registers. So it optimizes the
> pretty common case of a userspace -> kernel -> userspace switch that
> happens for a syscall by simply not saving/restoring these registers at
> all.
> 
> Obviously, it then still needs to do the work when the task is switched
> or in the rare case that the kernel wants to use floating point itself.

thanks for the information.

> > > I am a bit confused overall. I mean, zpoline seems kind of neat, but a
> > > requirement on patching userspace code also seems like a lot.
> > > 
> > > To me, it seems much more natural to catch the userspace syscalls using
> > > a SECCOMP filter[1]. While quite a lot slower, that should be much more
> > > portable across architectures. For improved speed one could still do
> > > architecture specific things inside the vDSO or by using zpoline. But
> > > those would then "just" be optimizations and unpatched code would still
> > > work correctly (e.g. JIT).
> > 
> > I'm not proposing this patch to replace existing UML implementations;
> > for instance, the patchset cannot run CONFIG_MMU code in the whole
> > kernel tree so, existing ptrace-based implementation still has real
> > usecase.  and ptrace based syscall hook is not indeed fast and the
> > improvements with seccomp filter instead clearly has benefits.  I
> > think it's independent to this patchset.
> 
> Of course. nommu mode is a completely independent feature.
> 
> I am still wondering a bit about the users for such a mode. It is not
> interesting for us as we use it for testing. Of course, speed is nice
> but it is not the primary objective.
> 
> I understand that it can be an approach for a small "container", but
> then you would need a very strict SECCOMP filter for the kernel itself.

I didn't specifically describe the usecase for this at the v1 patch;
but at least here is the list in my mind.

1) container-like usecase can be one of them (the original work proposed
toward this),
2) testing nommu code in kernel might be another use,
3) faster I/O workload which involves bunch of syscalls over UML can
be also interesting.

I think this list covers pretty much to have !MMU mode in current
MMU-full UML.

speed might not be indeed the primary objective but if you'll see the
dozen of test cases which issues bunch of syscalls (which I think
possible case), this might be helpful.

(snip)

> > > For me, a big argument in favour of such an approach is its simplicity.
> > > I am mostly basing that on the fact that this patchset should properly
> > > handle other signals like SIGFPE and SIGSEGV. And, once it does that,
> > > you will already have all the infrastructure to do the correct register
> > > save/restore using the host mcontex, which is what is needed in the
> > > SIGSYS handler when using SECCOMP. The filter itself should be simple
> > > as it just needs to catch all syscalls within valid userspace
> > > executable memory[2] ranges.
> > 
> > I agree with your observation that the approach is simple.
> > I don't have a good idea on how to handle SIGSEGV, but will try to see
> > with your inputs.
> 
> You can probably use "[RFC PATCH v2 5/9] um: Add helper functions to
> get/set state for SECCOMP" for getting the registers and also writing
> them back if you want to restore using rt_sigreturn.

thanks,

I'm still testing with various attempts to deliver SEGV to userspace,
but yet no luck so far...  I will get you back once I come up with a
nice form.

(snip)
> > > [2] I am assuming that userspace executable code is already confined to
> > > a certain address space within the UML process. Obviously, the kernel
> > > itself and loaded modules need to be free to do host syscalls and
> > > should not be affected by the SECCOMP filter.
> > 
> > I think our !MMU UML doesn't break this assumption.  But did you see
> > something to our patchset ?
> 
> I also assume that is fine. One just needs to understand this when
> writing a SECCOMP filter for syscall emulation in nommu mode.

okay, thanks for the clarification.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH 00/13] nommu UML
  2024-10-30  9:25       ` Hajime Tazaki
@ 2024-11-09  0:52         ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-09  0:52 UTC (permalink / raw)
  To: benjamin; +Cc: linux-um, jdike, richard, anton.ivanov, johannes, ricarkol

On Wed, 30 Oct 2024 18:25:18 +0900,
Hajime Tazaki wrote:
> 
> > > > > - a crash on userspace programs crashes a UML kernel, not signaling
> > > > >   with SIGSEGV to the program.

after investigation with trying to save/restore FP registers, I found
the register is not the reason that userspace programs cannot
handle/recover with SIGSEGV.

the cause is; I tried to call (!MMU version of) userspace() in segv()
function if is_user == 1 (I configured the variable based on the
address of $rip), but without returning hard_handler(), which is
caller of segv(), SIGIO has been blocked even after switching the
userspace process.  Thus no console input is processed.

With that issue, I didn't include the patch to mark is_user = 1,
resulting kernel exit with panic().

I unblock signals (IO/WINCH/ALRM) after segv() && is_user==1, then the
userspace can recover/handle the SEGV signal.

I will try to implement this in a cleaner way.
Thanks for your comment,
-- Hajime

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH v2 00/13] nommu UML
  2024-10-24 12:09 [RFC PATCH 00/13] nommu UML Hajime Tazaki
                   ` (13 preceding siblings ...)
  2024-10-26 10:19 ` [RFC PATCH 00/13] nommu UML Benjamin Berg
@ 2024-11-11  6:27 ` Hajime Tazaki
  2024-11-11  6:27   ` [RFC PATCH v2 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec Hajime Tazaki
                     ` (16 more replies)
  14 siblings, 17 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-11  6:27 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

This is a series of patches of nommu arch addition to UML.  It would
be nice to ask comments/opinions on this.

There are still several limitations/issues which we already found;
here is the list of those issues.

- prompt configured with /etc/profile is broken (variables are not
  expanded, ${HOSTNAME%%.*}:$PWD#)
- there are no mechanism implemented to cache for mapped memory of
  exec(2) thus, always read files from filesystem upon every exec,
  which makes slow on some benchmark (lmbench).

-- Hajime


RFC v2:
- base branch is now uml/linux.git instead of torvalds/linux.git.
- reorganize the patch series to clean up
- fixed various coding styles issues
- clean up exec code path [07/13]
- fixed the crash/SIGSEGV case on userspace programs [10/13]
- add seccomp filter to limit syscall caller address [06/13]
- detect fsgsbase availability with sigsetjmp/siglongjmp [08/13]
- removes unrelated changes
- removes unneeded ifndef CONFIG_MMU
- convert UML_CONFIG_MMU to CONFIG_MMU as using uml/linux.git
- proposed a patch of maple-tree issue (resolving a limitation in RFC v1)
  https://lore.kernel.org/linux-mm/20241108222834.3625217-1-thehajime@gmail.com/

RFC:
- https://lore.kernel.org/linux-um/cover.1729770373.git.thehajime@gmail.com/

Hajime Tazaki (13):
  fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec
  x86/um: nommu: elf loader for fdpic
  um: nommu: memory handling
  x86/um: nommu: syscall handling
  x86/um: nommu: syscall translation by zpoline
  um: nommu: prevent host syscalls from userspace by seccomp filter
  x86/um: nommu: process/thread handling
  um: nommu: configure fs register on host syscall invocation
  x86/um/vdso: nommu: vdso memory update
  x86/um: nommu: signal handling
  um: change machine name for uname output
  um: nommu: add documentation of nommu UML
  um: nommu: plug nommu code into build system

 Documentation/virt/uml/nommu-uml.rst    | 221 +++++++++++++++++++++++
 arch/um/Kconfig                         |  14 +-
 arch/um/Makefile                        |   6 +
 arch/um/configs/x86_64_nommu_defconfig  |  64 +++++++
 arch/um/include/asm/Kbuild              |   1 +
 arch/um/include/asm/futex.h             |   4 +
 arch/um/include/asm/mmu.h               |   8 +
 arch/um/include/asm/mmu_context.h       |  13 +-
 arch/um/include/asm/ptrace-generic.h    |   6 +
 arch/um/include/asm/tlbflush.h          |  22 +++
 arch/um/include/asm/uaccess.h           |   7 +-
 arch/um/include/shared/kern_util.h      |   3 +
 arch/um/include/shared/os.h             |  14 ++
 arch/um/kernel/Makefile                 |   3 +-
 arch/um/kernel/mem.c                    |  12 +-
 arch/um/kernel/physmem.c                |   6 +
 arch/um/kernel/process.c                |  33 +++-
 arch/um/kernel/skas/Makefile            |   4 +-
 arch/um/kernel/trap.c                   |  14 ++
 arch/um/kernel/um_arch.c                |   4 +
 arch/um/os-Linux/Makefile               |   5 +-
 arch/um/os-Linux/cpu.c                  |  50 ++++++
 arch/um/os-Linux/internal.h             |   5 +
 arch/um/os-Linux/main.c                 |   5 +
 arch/um/os-Linux/process.c              |  94 +++++++++-
 arch/um/os-Linux/signal.c               |  18 +-
 arch/um/os-Linux/skas/process.c         |   4 +
 arch/um/os-Linux/start_up.c             |   3 +
 arch/um/os-Linux/util.c                 |   3 +-
 arch/x86/um/Makefile                    |  18 ++
 arch/x86/um/asm/elf.h                   |  11 +-
 arch/x86/um/asm/module.h                |  24 ---
 arch/x86/um/asm/processor.h             |  12 ++
 arch/x86/um/do_syscall_64.c             | 108 ++++++++++++
 arch/x86/um/entry_64.S                  | 108 ++++++++++++
 arch/x86/um/shared/sysdep/syscalls_64.h |   6 +
 arch/x86/um/signal.c                    |  37 +++-
 arch/x86/um/syscalls_64.c               |  69 ++++++++
 arch/x86/um/vdso/um_vdso.c              |  20 +++
 arch/x86/um/vdso/vma.c                  |  14 ++
 arch/x86/um/zpoline.c                   | 223 ++++++++++++++++++++++++
 fs/Kconfig.binfmt                       |   2 +-
 fs/binfmt_elf_fdpic.c                   |  10 ++
 43 files changed, 1262 insertions(+), 46 deletions(-)
 create mode 100644 Documentation/virt/uml/nommu-uml.rst
 create mode 100644 arch/um/configs/x86_64_nommu_defconfig
 create mode 100644 arch/um/os-Linux/cpu.c
 delete mode 100644 arch/x86/um/asm/module.h
 create mode 100644 arch/x86/um/do_syscall_64.c
 create mode 100644 arch/x86/um/entry_64.S
 create mode 100644 arch/x86/um/zpoline.c

-- 
2.43.0



^ permalink raw reply	[flat|nested] 128+ messages in thread

* [RFC PATCH v2 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
@ 2024-11-11  6:27   ` Hajime Tazaki
  2024-11-11  6:27   ` [RFC PATCH v2 02/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
                     ` (15 subsequent siblings)
  16 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-11  6:27 UTC (permalink / raw)
  To: linux-um
  Cc: thehajime, ricarkol, Liam.Howlett, Alexander Viro,
	Christian Brauner, Jan Kara, Eric Biederman, Kees Cook,
	linux-fsdevel, linux-mm

FDPIC ELF loader adds an architecture hook at the end of loading
binaries to finalize the mapped memory before moving toward exec
function.  The hook is used by UML under !MMU when translating
syscall/sysenter instructions before calling execve.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <kees@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
 fs/binfmt_elf_fdpic.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 4fe5bb9f1b1f..ab16fdf475b0 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -175,6 +175,12 @@ static int elf_fdpic_fetch_phdrs(struct elf_fdpic_params *params,
 	return 0;
 }
 
+int __weak elf_arch_finalize_exec(struct elf_fdpic_params *exec_params,
+				  struct elf_fdpic_params *interp_params)
+{
+	return 0;
+}
+
 /*****************************************************************************/
 /*
  * load an fdpic binary into various bits of memory
@@ -457,6 +463,10 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
 			    dynaddr);
 #endif
 
+	retval = elf_arch_finalize_exec(&exec_params, &interp_params);
+	if (retval)
+		goto error;
+
 	finalize_exec(bprm);
 	/* everything is now ready... get the userspace context ready to roll */
 	entryaddr = interp_params.entry_addr ?: exec_params.entry_addr;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH v2 02/13] x86/um: nommu: elf loader for fdpic
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
  2024-11-11  6:27   ` [RFC PATCH v2 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec Hajime Tazaki
@ 2024-11-11  6:27   ` Hajime Tazaki
  2024-11-12 12:48     ` Geert Uytterhoeven
  2024-11-11  6:27   ` [RFC PATCH v2 03/13] um: nommu: memory handling Hajime Tazaki
                     ` (14 subsequent siblings)
  16 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-11  6:27 UTC (permalink / raw)
  To: linux-um
  Cc: thehajime, ricarkol, Liam.Howlett, Eric Biederman, Kees Cook,
	Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
	linux-fsdevel

As UML supports CONFIG_MMU=n case, it has to use an alternate ELF
loader, FDPIC ELF loader.  In this commit, we added necessary
definitions in the arch, as UML has not been used so far.  It also
updates Kconfig file to use BINFMT_ELF_FDPIC under !MMU environment.

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/um/include/asm/Kbuild           |  1 +
 arch/um/include/asm/mmu.h            |  5 +++++
 arch/um/include/asm/ptrace-generic.h |  6 ++++++
 arch/x86/um/asm/elf.h                |  8 ++++++--
 arch/x86/um/asm/module.h             | 24 ------------------------
 fs/Kconfig.binfmt                    |  2 +-
 6 files changed, 19 insertions(+), 27 deletions(-)
 delete mode 100644 arch/x86/um/asm/module.h

diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
index 18f902da8e99..cf8260fdcfe5 100644
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -14,6 +14,7 @@ generic-y += irq_work.h
 generic-y += kdebug.h
 generic-y += mcs_spinlock.h
 generic-y += mmiowb.h
+generic-y += module.h
 generic-y += module.lds.h
 generic-y += param.h
 generic-y += parport.h
diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
index a3eaca41ff61..01422b761aa0 100644
--- a/arch/um/include/asm/mmu.h
+++ b/arch/um/include/asm/mmu.h
@@ -14,6 +14,11 @@ typedef struct mm_context {
 	/* Address range in need of a TLB sync */
 	unsigned long sync_tlb_range_from;
 	unsigned long sync_tlb_range_to;
+
+#ifdef CONFIG_BINFMT_ELF_FDPIC
+	unsigned long   exec_fdpic_loadmap;
+	unsigned long   interp_fdpic_loadmap;
+#endif
 } mm_context_t;
 
 #endif
diff --git a/arch/um/include/asm/ptrace-generic.h b/arch/um/include/asm/ptrace-generic.h
index 4696f24d1492..4ff844bcb1cd 100644
--- a/arch/um/include/asm/ptrace-generic.h
+++ b/arch/um/include/asm/ptrace-generic.h
@@ -29,6 +29,12 @@ struct pt_regs {
 
 #define PTRACE_OLDSETOPTIONS 21
 
+#ifdef CONFIG_BINFMT_ELF_FDPIC
+#define PTRACE_GETFDPIC		31
+#define PTRACE_GETFDPIC_EXEC	0
+#define PTRACE_GETFDPIC_INTERP	1
+#endif
+
 struct task_struct;
 
 extern long subarch_ptrace(struct task_struct *child, long request,
diff --git a/arch/x86/um/asm/elf.h b/arch/x86/um/asm/elf.h
index 62ed5d68a978..33f69f1eac10 100644
--- a/arch/x86/um/asm/elf.h
+++ b/arch/x86/um/asm/elf.h
@@ -9,6 +9,7 @@
 #include <skas.h>
 
 #define CORE_DUMP_USE_REGSET
+#define ELF_FDPIC_CORE_EFLAGS  0
 
 #ifdef CONFIG_X86_32
 
@@ -190,8 +191,11 @@ extern int arch_setup_additional_pages(struct linux_binprm *bprm,
 
 extern unsigned long um_vdso_addr;
 #define AT_SYSINFO_EHDR 33
-#define ARCH_DLINFO	NEW_AUX_ENT(AT_SYSINFO_EHDR, um_vdso_addr)
-
+#define ARCH_DLINFO						\
+do {								\
+	NEW_AUX_ENT(AT_SYSINFO_EHDR, um_vdso_addr);		\
+	NEW_AUX_ENT(AT_MINSIGSTKSZ, 0);			\
+} while (0)
 #endif
 
 typedef unsigned long elf_greg_t;
diff --git a/arch/x86/um/asm/module.h b/arch/x86/um/asm/module.h
deleted file mode 100644
index a3b061d66082..000000000000
--- a/arch/x86/um/asm/module.h
+++ /dev/null
@@ -1,24 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __UM_MODULE_H
-#define __UM_MODULE_H
-
-/* UML is simple */
-struct mod_arch_specific
-{
-};
-
-#ifdef CONFIG_X86_32
-
-#define Elf_Shdr Elf32_Shdr
-#define Elf_Sym Elf32_Sym
-#define Elf_Ehdr Elf32_Ehdr
-
-#else
-
-#define Elf_Shdr Elf64_Shdr
-#define Elf_Sym Elf64_Sym
-#define Elf_Ehdr Elf64_Ehdr
-
-#endif
-
-#endif
diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index bd2f530e5740..419ba0282806 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -58,7 +58,7 @@ config ARCH_USE_GNU_PROPERTY
 config BINFMT_ELF_FDPIC
 	bool "Kernel support for FDPIC ELF binaries"
 	default y if !BINFMT_ELF
-	depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
+	depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU)
 	select ELFCORE
 	help
 	  ELF FDPIC binaries are based on ELF, but allow the individual load
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH v2 03/13] um: nommu: memory handling
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
  2024-11-11  6:27   ` [RFC PATCH v2 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec Hajime Tazaki
  2024-11-11  6:27   ` [RFC PATCH v2 02/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
@ 2024-11-11  6:27   ` Hajime Tazaki
  2024-11-11  6:27   ` [RFC PATCH v2 04/13] x86/um: nommu: syscall handling Hajime Tazaki
                     ` (13 subsequent siblings)
  16 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-11  6:27 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

This commit adds memory operations on UML under !MMU environment.

Some part of the original UML code relying on CONFIG_MMU are excluded
from compilation when !CONFIG_MMU.  Additionally, generic functions such as
uaccess, futex, memcpy/strnlen/strncpy can be used as user- and
kernel-space share the address space in !CONFIG_MMU mode.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/um/include/asm/futex.h       |  4 ++++
 arch/um/include/asm/mmu.h         |  3 +++
 arch/um/include/asm/mmu_context.h | 13 +++++++++++--
 arch/um/include/asm/tlbflush.h    | 22 ++++++++++++++++++++++
 arch/um/include/asm/uaccess.h     |  7 ++++---
 arch/um/include/shared/os.h       |  6 ++++++
 arch/um/kernel/Makefile           |  3 ++-
 arch/um/kernel/mem.c              | 12 +++++++++++-
 arch/um/kernel/physmem.c          |  6 ++++++
 arch/um/kernel/skas/Makefile      |  4 ++--
 arch/um/kernel/trap.c             |  4 ++++
 arch/um/os-Linux/Makefile         |  1 +
 arch/um/os-Linux/process.c        |  4 ++--
 13 files changed, 78 insertions(+), 11 deletions(-)

diff --git a/arch/um/include/asm/futex.h b/arch/um/include/asm/futex.h
index 780aa6bfc050..89a8ac0b6963 100644
--- a/arch/um/include/asm/futex.h
+++ b/arch/um/include/asm/futex.h
@@ -8,7 +8,11 @@
 
 
 int arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval, u32 __user *uaddr);
+#ifdef CONFIG_MMU
 int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
 			      u32 oldval, u32 newval);
+#else
+#include <asm-generic/futex.h>
+#endif
 
 #endif
diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
index 01422b761aa0..d4087f9499e2 100644
--- a/arch/um/include/asm/mmu.h
+++ b/arch/um/include/asm/mmu.h
@@ -15,10 +15,13 @@ typedef struct mm_context {
 	unsigned long sync_tlb_range_from;
 	unsigned long sync_tlb_range_to;
 
+#ifndef CONFIG_MMU
+	unsigned long   end_brk;
 #ifdef CONFIG_BINFMT_ELF_FDPIC
 	unsigned long   exec_fdpic_loadmap;
 	unsigned long   interp_fdpic_loadmap;
 #endif
+#endif /* !CONFIG_MMU */
 } mm_context_t;
 
 #endif
diff --git a/arch/um/include/asm/mmu_context.h b/arch/um/include/asm/mmu_context.h
index 23dcc914d44e..da287e8c86b3 100644
--- a/arch/um/include/asm/mmu_context.h
+++ b/arch/um/include/asm/mmu_context.h
@@ -37,10 +37,19 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 }
 
 #define init_new_context init_new_context
-extern int init_new_context(struct task_struct *task, struct mm_struct *mm);
-
 #define destroy_context destroy_context
+#ifdef CONFIG_MMU
+extern int init_new_context(struct task_struct *task, struct mm_struct *mm);
 extern void destroy_context(struct mm_struct *mm);
+#else
+static inline int init_new_context(struct task_struct *task, struct mm_struct *mm)
+{
+	return 0;
+}
+static inline void destroy_context(struct mm_struct *mm)
+{
+}
+#endif
 
 #include <asm-generic/mmu_context.h>
 
diff --git a/arch/um/include/asm/tlbflush.h b/arch/um/include/asm/tlbflush.h
index 13a3009942be..9157f71695c6 100644
--- a/arch/um/include/asm/tlbflush.h
+++ b/arch/um/include/asm/tlbflush.h
@@ -30,6 +30,7 @@
  *  - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
  */
 
+#ifdef CONFIG_MMU
 extern int um_tlb_sync(struct mm_struct *mm);
 
 extern void flush_tlb_all(void);
@@ -55,5 +56,26 @@ static inline void flush_tlb_kernel_range(unsigned long start,
 	/* Kernel needs to be synced immediately */
 	um_tlb_sync(&init_mm);
 }
+#else
+static inline int um_tlb_sync(struct mm_struct *mm)
+{
+	return 0;
+}
+
+static inline void flush_tlb_page(struct vm_area_struct *vma,
+				  unsigned long address)
+{
+}
+
+static inline void flush_tlb_range(struct vm_area_struct *vma,
+				   unsigned long start, unsigned long end)
+{
+}
+
+static inline void flush_tlb_kernel_range(unsigned long start,
+					  unsigned long end)
+{
+}
+#endif
 
 #endif
diff --git a/arch/um/include/asm/uaccess.h b/arch/um/include/asm/uaccess.h
index 1d4b6bbc1b65..9bfee12cb6b7 100644
--- a/arch/um/include/asm/uaccess.h
+++ b/arch/um/include/asm/uaccess.h
@@ -22,6 +22,7 @@
 #define __addr_range_nowrap(addr, size) \
 	((unsigned long) (addr) <= ((unsigned long) (addr) + (size)))
 
+#ifdef CONFIG_MMU
 extern unsigned long raw_copy_from_user(void *to, const void __user *from, unsigned long n);
 extern unsigned long raw_copy_to_user(void __user *to, const void *from, unsigned long n);
 extern unsigned long __clear_user(void __user *mem, unsigned long len);
@@ -33,9 +34,6 @@ static inline int __access_ok(const void __user *ptr, unsigned long size);
 
 #define INLINE_COPY_FROM_USER
 #define INLINE_COPY_TO_USER
-
-#include <asm-generic/uaccess.h>
-
 static inline int __access_ok(const void __user *ptr, unsigned long size)
 {
 	unsigned long addr = (unsigned long)ptr;
@@ -43,6 +41,9 @@ static inline int __access_ok(const void __user *ptr, unsigned long size)
 		(__under_task_size(addr, size) ||
 		 __access_ok_vsyscall(addr, size));
 }
+#endif
+
+#include <asm-generic/uaccess.h>
 
 /* no pagefaults for kernel addresses in um */
 #define __get_kernel_nofault(dst, src, type, err_label)			\
diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index 5babad8c5f75..6874be0c38a8 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -195,7 +195,13 @@ extern void get_host_cpu_features(
 extern int create_mem_file(unsigned long long len);
 
 /* tlb.c */
+#ifdef CONFIG_MMU
 extern void report_enomem(void);
+#else
+static inline void report_enomem(void)
+{
+}
+#endif
 
 /* process.c */
 extern void os_alarm_process(int pid);
diff --git a/arch/um/kernel/Makefile b/arch/um/kernel/Makefile
index f8567b933ffa..b41e9bcabbe3 100644
--- a/arch/um/kernel/Makefile
+++ b/arch/um/kernel/Makefile
@@ -16,9 +16,10 @@ extra-y := vmlinux.lds
 
 obj-y = config.o exec.o exitcode.o irq.o ksyms.o mem.o \
 	physmem.o process.o ptrace.o reboot.o sigio.o \
-	signal.o sysrq.o time.o tlb.o trap.o \
+	signal.o sysrq.o time.o trap.o \
 	um_arch.o umid.o maccess.o kmsg_dump.o capflags.o skas/
 obj-y += load_file.o
+obj-$(CONFIG_MMU) += tlb.o
 
 obj-$(CONFIG_BLK_DEV_INITRD) += initrd.o
 obj-$(CONFIG_GPROF)	+= gprof_syms.o
diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index 53248ed04771..b674017d9871 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -64,7 +64,8 @@ void __init mem_init(void)
 	 * to be turned on.
 	 */
 	brk_end = (unsigned long) UML_ROUND_UP(sbrk(0));
-	map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1, 0);
+	map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1,
+		   !IS_ENABLED(CONFIG_MMU));
 	memblock_free((void *)brk_end, uml_reserved - brk_end);
 	uml_reserved = brk_end;
 
@@ -78,6 +79,7 @@ void __init mem_init(void)
  * Create a page table and place a pointer to it in a middle page
  * directory entry.
  */
+#ifdef CONFIG_MMU
 static void __init one_page_table_init(pmd_t *pmd)
 {
 	if (pmd_none(*pmd)) {
@@ -149,6 +151,12 @@ static void __init fixrange_init(unsigned long start, unsigned long end,
 		j = 0;
 	}
 }
+#else
+static void __init fixrange_init(unsigned long start, unsigned long end,
+				 pgd_t *pgd_base)
+{
+}
+#endif
 
 static void __init fixaddr_user_init( void)
 {
@@ -230,6 +238,7 @@ void *uml_kmalloc(int size, int flags)
 	return kmalloc(size, flags);
 }
 
+#ifdef CONFIG_MMU
 static const pgprot_t protection_map[16] = {
 	[VM_NONE]					= PAGE_NONE,
 	[VM_READ]					= PAGE_READONLY,
@@ -249,3 +258,4 @@ static const pgprot_t protection_map[16] = {
 	[VM_SHARED | VM_EXEC | VM_WRITE | VM_READ]	= PAGE_SHARED
 };
 DECLARE_VM_GET_PAGE_PROT
+#endif
diff --git a/arch/um/kernel/physmem.c b/arch/um/kernel/physmem.c
index a74f17b033c4..f55d46dbe173 100644
--- a/arch/um/kernel/physmem.c
+++ b/arch/um/kernel/physmem.c
@@ -84,7 +84,11 @@ void __init setup_physmem(unsigned long start, unsigned long reserve_end,
 		exit(1);
 	}
 
+#ifdef CONFIG_MMU
 	physmem_fd = create_mem_file(len);
+#else
+	physmem_fd = -1;
+#endif
 
 	err = os_map_memory((void *) reserve_end, physmem_fd, reserve,
 			    map_size, 1, 1, 1);
@@ -95,12 +99,14 @@ void __init setup_physmem(unsigned long start, unsigned long reserve_end,
 		exit(1);
 	}
 
+#ifdef CONFIG_MMU
 	/*
 	 * Special kludge - This page will be mapped in to userspace processes
 	 * from physmem_fd, so it needs to be written out there.
 	 */
 	os_seek_file(physmem_fd, __pa(__syscall_stub_start));
 	os_write_file(physmem_fd, __syscall_stub_start, PAGE_SIZE);
+#endif
 
 	memblock_add(__pa(start), len);
 	memblock_reserve(__pa(start), reserve);
diff --git a/arch/um/kernel/skas/Makefile b/arch/um/kernel/skas/Makefile
index 3384be42691f..64d7ba803b1a 100644
--- a/arch/um/kernel/skas/Makefile
+++ b/arch/um/kernel/skas/Makefile
@@ -3,8 +3,8 @@
 # Copyright (C) 2002 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com)
 #
 
-obj-y := stub.o mmu.o process.o syscall.o uaccess.o \
-	 stub_exe_embed.o
+obj-y := stub.o process.o stub_exe_embed.o
+obj-$(CONFIG_MMU) += mmu.o syscall.o uaccess.o
 
 # Stub executable
 
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index cdaee3e94273..a7519b3de4bf 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -24,6 +24,7 @@
 int handle_page_fault(unsigned long address, unsigned long ip,
 		      int is_write, int is_user, int *code_out)
 {
+#ifdef CONFIG_MMU
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	pmd_t *pmd;
@@ -129,6 +130,9 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 		goto out_nosemaphore;
 	pagefault_out_of_memory();
 	return 0;
+#else
+	return -EFAULT;
+#endif
 }
 
 static void show_segv_info(struct uml_pt_regs *regs)
diff --git a/arch/um/os-Linux/Makefile b/arch/um/os-Linux/Makefile
index 049dfa5bc9c6..20ff8d5971db 100644
--- a/arch/um/os-Linux/Makefile
+++ b/arch/um/os-Linux/Makefile
@@ -21,3 +21,4 @@ USER_OBJS := $(user-objs-y) elf_aux.o execvp.o file.o helper.o irq.o \
 	tty.o umid.o util.o
 
 include $(srctree)/arch/um/scripts/Makefile.rules
+CFLAGS_process.o=-g -O0
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index 9f086f939420..ef1a2f0aa06a 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -63,8 +63,8 @@ int os_map_memory(void *virt, int fd, unsigned long long off, unsigned long len,
 	prot = (r ? PROT_READ : 0) | (w ? PROT_WRITE : 0) |
 		(x ? PROT_EXEC : 0);
 
-	loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED,
-		     fd, off);
+	loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED |
+		     (!IS_ENABLED(CONFIG_MMU) ? MAP_ANONYMOUS : 0), fd, off);
 	if (loc == MAP_FAILED)
 		return -errno;
 	return 0;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH v2 04/13] x86/um: nommu: syscall handling
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
                     ` (2 preceding siblings ...)
  2024-11-11  6:27   ` [RFC PATCH v2 03/13] um: nommu: memory handling Hajime Tazaki
@ 2024-11-11  6:27   ` Hajime Tazaki
  2024-11-11  6:27   ` [RFC PATCH v2 05/13] x86/um: nommu: syscall translation by zpoline Hajime Tazaki
                     ` (12 subsequent siblings)
  16 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-11  6:27 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

This commit introduces an entry point of syscall interface for !MMU
mode. It uses an entry function, __kernel_vsyscall, a kernel-wide global
symbol accessible from any locations.

Although it isn't in the scope of this commit, it can be also exposed
via vdso image which is directly accessible from userspace. A standard
library (i.e., libc) can utilize this entry point to implement syscall
wrapper; we can also use this by hooking syscall for unmodified userspace
applications/libraries, which will be implemented in the subsequent
commit.

This only supports 64-bit mode of x86 architecture.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/x86/um/do_syscall_64.c             | 37 +++++++++++
 arch/x86/um/entry_64.S                  | 87 +++++++++++++++++++++++++
 arch/x86/um/shared/sysdep/syscalls_64.h |  6 ++
 3 files changed, 130 insertions(+)
 create mode 100644 arch/x86/um/do_syscall_64.c
 create mode 100644 arch/x86/um/entry_64.S

diff --git a/arch/x86/um/do_syscall_64.c b/arch/x86/um/do_syscall_64.c
new file mode 100644
index 000000000000..a1189ddb2b50
--- /dev/null
+++ b/arch/x86/um/do_syscall_64.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/kernel.h>
+#include <linux/ptrace.h>
+#include <kern_util.h>
+#include <sysdep/syscalls.h>
+#include <os.h>
+
+__visible void do_syscall_64(struct pt_regs *regs)
+{
+	int syscall;
+
+	syscall = PT_SYSCALL_NR(regs->regs.gp);
+	UPT_SYSCALL_NR(&regs->regs) = syscall;
+
+	pr_debug("syscall(%d) (current=%lx) (fn=%lx)\n",
+		 syscall, (unsigned long)current,
+		 (unsigned long)sys_call_table[syscall]);
+
+	if (likely(syscall < NR_syscalls)) {
+		PT_REGS_SET_SYSCALL_RETURN(regs,
+				EXECUTE_SYSCALL(syscall, regs));
+	}
+
+	pr_debug("syscall(%d) --> %lx\n", syscall,
+		regs->regs.gp[HOST_AX]);
+
+	PT_REGS_SYSCALL_RET(regs) = regs->regs.gp[HOST_AX];
+
+	/* force do_signal() --> is_syscall() */
+	set_thread_flag(TIF_SIGPENDING);
+	interrupt_end();
+
+	/* execve succeeded */
+	if (syscall == __NR_execve && regs->regs.gp[HOST_AX] == 0)
+		userspace(&current->thread.regs.regs);
+}
diff --git a/arch/x86/um/entry_64.S b/arch/x86/um/entry_64.S
new file mode 100644
index 000000000000..022a8122690b
--- /dev/null
+++ b/arch/x86/um/entry_64.S
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/errno.h>
+
+#include <linux/linkage.h>
+#include <asm/percpu.h>
+#include <asm/desc.h>
+
+#include "../entry/calling.h"
+
+#ifdef CONFIG_SMP
+#error need to stash these variables somewhere else
+#endif
+
+#define UM_GLOBAL_VAR(x) .data; .align 8; .globl x; x:; .long 0
+
+UM_GLOBAL_VAR(current_top_of_stack)
+UM_GLOBAL_VAR(current_ptregs)
+
+.code64
+.section .entry.text, "ax"
+
+.align 8
+#undef ENTRY
+#define ENTRY(x) .text; .globl x; .type x,%function; x:
+#undef END
+#define END(x)   .size x, . - x
+
+/*
+ * %rcx has the return address (we set it like that in zpoline trampoline).
+ *
+ * Registers on entry:
+ * rax  system call number
+ * rcx  return address
+ * rdi  arg0
+ * rsi  arg1
+ * rdx  arg2
+ * r10  arg3
+ * r8   arg4
+ * r9   arg5
+ *
+ * (note: we are allowed to mess with r11: r11 is callee-clobbered
+ * register in C ABI)
+ */
+ENTRY(__kernel_vsyscall)
+
+	movq	%rsp, %r11
+
+	/* Point rsp to the top of the ptregs array, so we can
+           just fill it with a bunch of push'es. */
+	movq	current_ptregs, %rsp
+
+	/* 8 bytes * 20 registers (plus 8 for the push) */
+	addq	$168, %rsp
+
+	/* Construct struct pt_regs on stack */
+	pushq	$0		/* pt_regs->ss (index 20) */
+	pushq   %r11		/* pt_regs->sp */
+	pushfq			/* pt_regs->flags */
+	pushq	$0		/* pt_regs->cs */
+	pushq	%rcx		/* pt_regs->ip */
+	pushq	%rax		/* pt_regs->orig_ax */
+
+	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
+
+	mov %rsp, %rdi
+
+	/*
+	 * Switch to current top of stack, so "current->" points
+	 * to the right task.
+	 */
+	movq	current_top_of_stack, %rsp
+
+	call	do_syscall_64
+
+	movq	current_ptregs, %rsp
+
+	POP_REGS
+
+	addq	$8, %rsp	/* skip orig_ax */
+	addq	$8, %rsp	/* skip ip */
+	addq	$8, %rsp	/* skip cs */
+	addq	$8, %rsp	/* skip flags */
+	popq	%rsp
+
+	ret
+
+END(__kernel_vsyscall)
diff --git a/arch/x86/um/shared/sysdep/syscalls_64.h b/arch/x86/um/shared/sysdep/syscalls_64.h
index b6b997225841..f3a4fd76673f 100644
--- a/arch/x86/um/shared/sysdep/syscalls_64.h
+++ b/arch/x86/um/shared/sysdep/syscalls_64.h
@@ -25,4 +25,10 @@ extern syscall_handler_t *sys_call_table[];
 extern syscall_handler_t sys_modify_ldt;
 extern syscall_handler_t sys_arch_prctl;
 
+#ifndef CONFIG_MMU
+__visible void do_syscall_64(struct pt_regs *regs);
+extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3,
+			      int64_t a4, int64_t a5, int64_t a6);
+#endif
+
 #endif
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH v2 05/13] x86/um: nommu: syscall translation by zpoline
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
                     ` (3 preceding siblings ...)
  2024-11-11  6:27   ` [RFC PATCH v2 04/13] x86/um: nommu: syscall handling Hajime Tazaki
@ 2024-11-11  6:27   ` Hajime Tazaki
  2024-11-11  6:27   ` [RFC PATCH v2 06/13] um: nommu: prevent host syscalls from userspace by seccomp filter Hajime Tazaki
                     ` (11 subsequent siblings)
  16 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-11  6:27 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

This commit adds a mechanism to hook syscalls for unmodified userspace
programs used under UML in !MMU mode. The mechanism, called zpoline,
translates syscall/sysenter instructions with `call *%rax`, which can be
processed by a trampoline code also installed upon an initcall during
boot. The translation is triggered by elf_arch_finalize_exec(), an arch
hook introduced by another commit.

All syscalls issued by userspace thus redirected to a specific function,
__kernel_vsyscall, introduced as a syscall entry point for !MMU UML.  This
totally changes the code path to hook syscall with ptrace(2) used by
MMU-full UML.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
 arch/x86/um/asm/elf.h |   3 +
 arch/x86/um/zpoline.c | 223 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 226 insertions(+)
 create mode 100644 arch/x86/um/zpoline.c

diff --git a/arch/x86/um/asm/elf.h b/arch/x86/um/asm/elf.h
index 33f69f1eac10..6f5977ff0d21 100644
--- a/arch/x86/um/asm/elf.h
+++ b/arch/x86/um/asm/elf.h
@@ -188,6 +188,9 @@ do {								\
 struct linux_binprm;
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
 	int uses_interp);
+struct elf_fdpic_params;
+extern int elf_arch_finalize_exec(struct elf_fdpic_params *exec_params,
+				  struct elf_fdpic_params *interp_params);
 
 extern unsigned long um_vdso_addr;
 #define AT_SYSINFO_EHDR 33
diff --git a/arch/x86/um/zpoline.c b/arch/x86/um/zpoline.c
new file mode 100644
index 000000000000..97f5345ab314
--- /dev/null
+++ b/arch/x86/um/zpoline.c
@@ -0,0 +1,223 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *  zpoline.c
+ *
+ *  Replace syscall/sysenter instructions to `call *%rax` to hook syscalls.
+ *
+ */
+//#define DEBUG
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/elf-fdpic.h>
+#include <asm/unistd.h>
+#include <asm/insn.h>
+#include <sysdep/syscalls.h>
+#include <os.h>
+
+/* start of trampoline code area */
+static char *__zpoline_start;
+
+static int __zpoline_translate_syscalls(struct elf_fdpic_params *params)
+{
+	int count = 0, loop;
+	struct insn insn;
+	unsigned long addr;
+	struct elf_fdpic_loadseg *seg;
+	struct elf_phdr *phdr;
+	struct elfhdr *ehdr = (struct elfhdr *)params->elfhdr_addr;
+
+	if (!ehdr)
+		return 0;
+
+	seg = params->loadmap->segs;
+	phdr = params->phdrs;
+	for (loop = 0; loop < params->hdr.e_phnum; loop++, phdr++) {
+		if (phdr->p_type != PT_LOAD)
+			continue;
+		addr = seg->addr;
+		/* skip translation of trampoline code */
+		if (addr <= (unsigned long)(&__zpoline_start[0] + 0x1000 + 0x0100)) {
+			pr_warn("%lx: address is in the range of trampoline", addr);
+			return -EINVAL;
+		}
+
+		/* translate only segment with Executable flag */
+		if (!(phdr->p_flags & PF_X)) {
+			seg++;
+			continue;
+		}
+
+		pr_debug("translation 0x%lx-0x%llx", addr,
+			 seg->addr + seg->p_memsz);
+		/* now ready to translate */
+		while (addr < (seg->addr + seg->p_memsz)) {
+			insn_init(&insn, (void *)addr, MAX_INSN_SIZE, 1);
+			insn_get_length(&insn);
+
+			insn_get_opcode(&insn);
+
+			switch (insn.opcode.bytes[0]) {
+			case 0xf:
+				switch (insn.opcode.bytes[1]) {
+				case 0x05: /* syscall */
+				case 0x34: /* sysenter */
+					pr_debug("%lx: found syscall/sysenter", addr);
+					*(char *)addr = 0xff; // callq
+					*((char *)addr + 1) = 0xd0; // *%rax
+					count++;
+					break;
+				}
+			default:
+				break;
+			}
+
+			addr += insn.length;
+			if (insn.length == 0) {
+				pr_debug("%lx: length zero with byte %x. skip ?",
+					 addr, insn.opcode.bytes[0]);
+				addr += 1;
+			}
+		}
+		seg++;
+	}
+	return count;
+}
+
+/**
+ * elf_arch_finalize_exec() - architecture hook to translate syscall/sysenter
+ *
+ * translate syscall/sysenter instruction upon loading ELF binary file
+ * on execve(2)&co syscall.
+ *
+ * suppose we have those instructions:
+ *
+ *    mov $sysnr, %rax
+ *    syscall                 0f 05
+ *
+ * this will translate it with:
+ *
+ *    mov $sysnr, %rax        (<= untouched)
+ *    call *(%rax)            ff d0
+ *
+ * this will finally called hook function guided by trampoline code installed
+ * at setup_zpoline_trampoline().
+ *
+ * @exec_params: ELF meta data for executable file
+ * @interp_params: ELF meta data for the interpreter file
+ */
+int elf_arch_finalize_exec(struct elf_fdpic_params *exec_params,
+			   struct elf_fdpic_params *interp_params)
+{
+	int err = 0, count = 0;
+	struct mm_struct *mm = current->mm;
+
+	if (down_write_killable(&mm->mmap_lock))
+		return -EINTR;
+
+	/* translate for the executable */
+	err = __zpoline_translate_syscalls(exec_params);
+	if (err < 0) {
+		pr_info("zpoline: xlate error %d", err);
+		goto out;
+	}
+	count += err;
+	pr_debug("zpoline: rewritten (exec) %d syscalls\n", count);
+
+	/* translate for the interpreter */
+	err = __zpoline_translate_syscalls(interp_params);
+	if (err < 0) {
+		pr_info("zpoline: xlate error %d", err);
+		goto out;
+	}
+	count += err;
+
+	err = 0;
+	pr_debug("zpoline: rewritten (exec+interp) %d syscalls\n", count);
+
+out:
+	up_write(&mm->mmap_lock);
+	return err;
+}
+
+/**
+ * setup_zpoline_trampoline() - install trampoline code for zpoline
+ *
+ * setup trampoline code for syscall hooks
+ *
+ * the trampoline code guides to call hooked function, __kernel_vsyscall
+ * in this case, via nop slides at the memory address zero (thus, zpoline).
+ *
+ * loaded binary by exec(2) is translated to call the function.
+ */
+static int __init setup_zpoline_trampoline(void)
+{
+	int i, ret;
+	int ptr;
+
+	/* zpoline: map area of trampoline code started from addr 0x0 */
+	__zpoline_start = 0x0;
+
+	ret = os_map_memory((void *) 0, -1, 0, PAGE_SIZE, 1, 1, 1);
+	if (ret)
+		panic("map failed\n NOTE: /proc/sys/vm/mmap_min_addr should be set 0\n");
+
+	/* fill nop instructions until the trampoline code */
+	for (i = 0; i < NR_syscalls; i++)
+		__zpoline_start[i] = 0x90;
+
+	/* optimization to skip old syscalls */
+	/* short jmp */
+	__zpoline_start[214 /* __NR_epoll_ctl_old */] = 0xeb;
+	/* range of a short jmp : -128 ~ +127 */
+	__zpoline_start[215 /* __NR_epoll_wait_old */] = 127;
+
+	/**
+	 * FIXME: shift red zone area to properly handle the case
+	 */
+
+	/**
+	 * put code for jumping to __kernel_vsyscall.
+	 *
+	 * here we embed the following code.
+	 *
+	 * movabs [$addr],%r11
+	 * jmpq   *%r11
+	 *
+	 */
+	ptr = NR_syscalls;
+	/* 49 bb [64-bit addr (8-byte)]    movabs [64-bit addr (8-byte)],%r11 */
+	__zpoline_start[ptr++] = 0x49;
+	__zpoline_start[ptr++] = 0xbb;
+	__zpoline_start[ptr++] = ((uint64_t) __kernel_vsyscall >> (8 * 0));
+	__zpoline_start[ptr++] = ((uint64_t) __kernel_vsyscall >> (8 * 1));
+	__zpoline_start[ptr++] = ((uint64_t) __kernel_vsyscall >> (8 * 2));
+	__zpoline_start[ptr++] = ((uint64_t) __kernel_vsyscall >> (8 * 3));
+	__zpoline_start[ptr++] = ((uint64_t) __kernel_vsyscall >> (8 * 4));
+	__zpoline_start[ptr++] = ((uint64_t) __kernel_vsyscall >> (8 * 5));
+	__zpoline_start[ptr++] = ((uint64_t) __kernel_vsyscall >> (8 * 6));
+	__zpoline_start[ptr++] = ((uint64_t) __kernel_vsyscall >> (8 * 7));
+
+	/*
+	 * pretending to be syscall instruction by putting return
+	 * address in %rcx.
+	 */
+	/* 48 8b 0c 24               mov    (%rsp),%rcx */
+	__zpoline_start[ptr++] = 0x48;
+	__zpoline_start[ptr++] = 0x8b;
+	__zpoline_start[ptr++] = 0x0c;
+	__zpoline_start[ptr++] = 0x24;
+
+	/* 41 ff e3                jmp    *%r11 */
+	__zpoline_start[ptr++] = 0x41;
+	__zpoline_start[ptr++] = 0xff;
+	__zpoline_start[ptr++] = 0xe3;
+
+	/* permission: XOM (PROT_EXEC only) */
+	ret = os_protect_memory(0, PAGE_SIZE, 0, 0, 1);
+	if (ret)
+		panic("failed: can't configure permission on trampoline code");
+
+	pr_info("zpoline: setting up trampoline code done\n");
+	return 0;
+}
+arch_initcall(setup_zpoline_trampoline);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH v2 06/13] um: nommu: prevent host syscalls from userspace by seccomp filter
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
                     ` (4 preceding siblings ...)
  2024-11-11  6:27   ` [RFC PATCH v2 05/13] x86/um: nommu: syscall translation by zpoline Hajime Tazaki
@ 2024-11-11  6:27   ` Hajime Tazaki
  2024-11-11  6:27   ` [RFC PATCH v2 07/13] x86/um: nommu: process/thread handling Hajime Tazaki
                     ` (10 subsequent siblings)
  16 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-11  6:27 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

As syscall translation done by zpoline assumes that there are no direct
syscalls issued by userspace code, but there would be possibly issued by
1) dlopen-ed code containing syscall instructions, or 2) JIT-generated
code.  This commit add a seccomp filter to prevent such syscalls from
userspace code.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
 arch/um/include/shared/os.h |  3 ++
 arch/um/kernel/um_arch.c    |  4 ++
 arch/um/os-Linux/process.c  | 76 +++++++++++++++++++++++++++++++++++++
 3 files changed, 83 insertions(+)

diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index 6874be0c38a8..5a6722f254d5 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -220,6 +220,9 @@ extern int os_unmap_memory(void *addr, int len);
 extern int os_drop_memory(void *addr, int length);
 extern int can_drop_memory(void);
 extern int os_mincore(void *addr, unsigned long len);
+#ifndef CONFIG_MMU
+extern int os_setup_seccomp(void);
+#endif
 
 void os_set_pdeathsig(void);
 
diff --git a/arch/um/kernel/um_arch.c b/arch/um/kernel/um_arch.c
index ec17576ce9fc..694e428ddf35 100644
--- a/arch/um/kernel/um_arch.c
+++ b/arch/um/kernel/um_arch.c
@@ -433,6 +433,10 @@ void __init setup_arch(char **cmdline_p)
 		add_bootloader_randomness(rng_seed, sizeof(rng_seed));
 		memzero_explicit(rng_seed, sizeof(rng_seed));
 	}
+
+#ifndef CONFIG_MMU
+	os_setup_seccomp();
+#endif
 }
 
 void __init arch_cpu_finalize_init(void)
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index ef1a2f0aa06a..ed3d99301dc8 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -17,7 +17,11 @@
 #include <asm/unistd.h>
 #include <init.h>
 #include <longjmp.h>
+#include <as-layout.h>
 #include <os.h>
+#include <sys/prctl.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
 
 void os_alarm_process(int pid)
 {
@@ -209,3 +213,75 @@ void os_set_pdeathsig(void)
 {
 	prctl(PR_SET_PDEATHSIG, SIGKILL);
 }
+
+#ifndef CONFIG_MMU
+int os_setup_seccomp(void)
+{
+	int err;
+	unsigned long __userspace_start = uml_reserved,
+		__userspace_end = high_physmem;
+
+	struct sock_filter filter[] = {
+		/* if (IP_high > __userspace_end) allow; */
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer) + 4),
+		BPF_JUMP(BPF_JMP + BPF_JGT + BPF_K, __userspace_end >> 32,
+			 /*true-skip=*/0, /*false-skip=*/1),
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+		/* if (IP_high == __userspace_end && IP_low >= __userspace_end) allow; */
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer) + 4),
+		BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __userspace_end >> 32,
+			 /*true-skip=*/0, /*false-skip=*/3),
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer)),
+		BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_end,
+			 /*true-skip=*/0, /*false-skip=*/1),
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+		/* if (IP_high < __userspace_start) allow; */
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer) + 4),
+		BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_start >> 32,
+			 /*true-skip=*/1, /*false-skip=*/0),
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+		/* if (IP_high == __userspace_start && IP_low < __userspace_start) allow; */
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer) + 4),
+		BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __userspace_start >> 32,
+			 /*true-skip=*/0, /*false-skip=*/3),
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer)),
+		BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_start,
+			 /*true-skip=*/1, /*false-skip=*/0),
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+		/* other address; trap  */
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRAP),
+	};
+	struct sock_fprog prog = {
+		.len = ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	err = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	if (err)
+		os_warn("PR_SET_NO_NEW_PRIVS (err=%d, ernro=%d)\n",
+		       err, errno);
+
+	err = syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER,
+		      SECCOMP_FILTER_FLAG_TSYNC, &prog);
+	if (err) {
+		os_warn("SECCOMP_SET_MODE_FILTER (err=%d, ernro=%d)\n",
+		       err, errno);
+		exit(-1);
+	}
+
+	os_info("seccomp: filter syscalls in the range: 0x%lx-0x%lx\n",
+		__userspace_start, __userspace_end);
+
+	return 0;
+}
+#endif
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH v2 07/13] x86/um: nommu: process/thread handling
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
                     ` (5 preceding siblings ...)
  2024-11-11  6:27   ` [RFC PATCH v2 06/13] um: nommu: prevent host syscalls from userspace by seccomp filter Hajime Tazaki
@ 2024-11-11  6:27   ` Hajime Tazaki
  2024-11-11  6:27   ` [RFC PATCH v2 08/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
                     ` (9 subsequent siblings)
  16 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-11  6:27 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

Since ptrace facility isn't used under !MMU of UML, there is different
code path to invoke processes/threads; on an entry to the syscall
interface, the stack pointer should be manipulated to handle vfork(2)
return address, no external process is used, and need to properly
configure some of registers (fs segment register for TLS, etc) on every
context switch, etc.

Signals aren't delivered in non-ptrace syscall entry/leave so, we also
need to handle pending signal by ourselves.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/um/kernel/process.c        | 33 +++++++++++++++++++++++++++++-
 arch/um/os-Linux/process.c      |  6 ++++++
 arch/um/os-Linux/skas/process.c |  4 ++++
 arch/x86/um/asm/processor.h     | 12 +++++++++++
 arch/x86/um/do_syscall_64.c     | 36 +++++++++++++++++++++++++++++++++
 arch/x86/um/entry_64.S          | 21 +++++++++++++++++++
 arch/x86/um/syscalls_64.c       | 12 +++++++++++
 7 files changed, 123 insertions(+), 1 deletion(-)

diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 56e7e525fc91..b3708dceb731 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -116,13 +116,17 @@ void new_thread_handler(void)
 	 * callback returns only if the kernel thread execs a process
 	 */
 	fn(arg);
+#ifndef CONFIG_MMU
+	arch_switch_to(current);
+#endif
 	userspace(&current->thread.regs.regs);
 }
 
 /* Called magically, see new_thread_handler above */
 static void fork_handler(void)
 {
-	schedule_tail(current->thread.prev_sched);
+	if (current->thread.prev_sched)
+		schedule_tail(current->thread.prev_sched);
 
 	/*
 	 * XXX: if interrupt_end() calls schedule, this call to
@@ -133,6 +137,33 @@ static void fork_handler(void)
 
 	current->thread.prev_sched = NULL;
 
+#ifndef CONFIG_MMU
+	/*
+	 * child of vfork(2) comes here.
+	 * clone(2) also enters here but doesn't need to advance the %rsp.
+	 *
+	 * This fork can only come from libc's vfork, which
+	 * does this:
+	 *	popq %%rdx;
+	 *	call *%rax; // zpoline => __kernel_vsyscall
+	 *	pushq %%rdx;
+	 * %rcx stores the return address which is stored
+	 * at pt_regs[HOST_IP] at the moment.  As child returns
+	 * via userspace() with a jmp instruction (while parent
+	 * does via ret instruction in __kernel_vsyscall), we
+	 * need to pop (advance) the pushed address by "call"
+	 * though, so this is what this next line does.
+	 *
+	 * As a result of vfork return in child, stack contents
+	 * is overwritten by child (by pushq in vfork), which
+	 * makes the parent puzzled after child returns.
+	 *
+	 * thus the contents should be restored before vfork/parent
+	 * returns.  this is done in do_syscall_64().
+	 */
+	if (current->thread.regs.regs.gp[HOST_ORIG_AX] == __NR_vfork)
+		current->thread.regs.regs.gp[REGS_SP_INDEX] += 8;
+#endif
 	userspace(&current->thread.regs.regs);
 }
 
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index ed3d99301dc8..5acf6d41a4c2 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -25,7 +25,10 @@
 
 void os_alarm_process(int pid)
 {
+/* !CONFIG_MMU doesn't send alarm signal to other processes */
+#ifdef CONFIG_MMU
 	kill(pid, SIGALRM);
+#endif
 }
 
 void os_kill_process(int pid, int reap_child)
@@ -42,11 +45,14 @@ void os_kill_process(int pid, int reap_child)
 
 void os_kill_ptraced_process(int pid, int reap_child)
 {
+/* !CONFIG_MMU doesn't have ptraced process */
+#ifdef CONFIG_MMU
 	kill(pid, SIGKILL);
 	ptrace(PTRACE_KILL, pid);
 	ptrace(PTRACE_CONT, pid);
 	if (reap_child)
 		CATCH_EINTR(waitpid(pid, NULL, __WALL));
+#endif
 }
 
 /* Don't use the glibc version, which caches the result in TLS. It misses some
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index f683cfc9e51a..291136008431 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -144,6 +144,7 @@ void wait_stub_done(int pid)
 
 extern unsigned long current_stub_stack(void);
 
+#ifdef CONFIG_MMU
 static void get_skas_faultinfo(int pid, struct faultinfo *fi)
 {
 	int err;
@@ -176,6 +177,7 @@ static void handle_trap(int pid, struct uml_pt_regs *regs)
 
 	handle_syscall(regs);
 }
+#endif
 
 extern char __syscall_stub_start[];
 
@@ -389,6 +391,7 @@ int start_userspace(unsigned long stub_stack)
 }
 
 int unscheduled_userspace_iterations;
+#ifdef CONFIG_MMU
 extern unsigned long tt_extra_sched_jiffies;
 
 void userspace(struct uml_pt_regs *regs)
@@ -550,6 +553,7 @@ void userspace(struct uml_pt_regs *regs)
 		}
 	}
 }
+#endif /* UML_CONFIG_MMU */
 
 void new_thread(void *stack, jmp_buf *buf, void (*handler)(void))
 {
diff --git a/arch/x86/um/asm/processor.h b/arch/x86/um/asm/processor.h
index 478710384b34..d88d7d9d5c18 100644
--- a/arch/x86/um/asm/processor.h
+++ b/arch/x86/um/asm/processor.h
@@ -38,6 +38,18 @@ static __always_inline void cpu_relax(void)
 
 #define task_pt_regs(t) (&(t)->thread.regs)
 
+#ifndef CONFIG_MMU
+#define task_top_of_stack(task) \
+({									\
+	unsigned long __ptr = (unsigned long)task->stack;	\
+	__ptr += THREAD_SIZE;			\
+	__ptr;					\
+})
+
+extern long current_top_of_stack;
+extern long current_ptregs;
+#endif
+
 #include <asm/processor-generic.h>
 
 #endif
diff --git a/arch/x86/um/do_syscall_64.c b/arch/x86/um/do_syscall_64.c
index a1189ddb2b50..203bacc4cb3c 100644
--- a/arch/x86/um/do_syscall_64.c
+++ b/arch/x86/um/do_syscall_64.c
@@ -1,14 +1,43 @@
 // SPDX-License-Identifier: GPL-2.0
 
+//#define DEBUG 1
 #include <linux/kernel.h>
 #include <linux/ptrace.h>
 #include <kern_util.h>
 #include <sysdep/syscalls.h>
 #include <os.h>
 
+/*
+ * save/restore the return address stored in the stack, as the child overwrites
+ * the contents after returning to userspace (i.e., by push %rdx).
+ *
+ * see the detail in fork_handler().
+ */
+static void *vfork_save_stack(void)
+{
+	unsigned char *stack_copy;
+
+	stack_copy = kzalloc(8, GFP_KERNEL);
+	if (!stack_copy)
+		return NULL;
+
+	memcpy(stack_copy,
+	       (void *)current->thread.regs.regs.gp[HOST_SP], 8);
+
+	return stack_copy;
+}
+
+static void vfork_restore_stack(void *stack_copy)
+{
+	WARN_ON_ONCE(!stack_copy);
+	memcpy((void *)current->thread.regs.regs.gp[HOST_SP],
+	       stack_copy, 8);
+}
+
 __visible void do_syscall_64(struct pt_regs *regs)
 {
 	int syscall;
+	unsigned char *stack_copy = NULL;
 
 	syscall = PT_SYSCALL_NR(regs->regs.gp);
 	UPT_SYSCALL_NR(&regs->regs) = syscall;
@@ -17,6 +46,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
 		 syscall, (unsigned long)current,
 		 (unsigned long)sys_call_table[syscall]);
 
+	if (syscall == __NR_vfork)
+		stack_copy = vfork_save_stack();
+
 	if (likely(syscall < NR_syscalls)) {
 		PT_REGS_SET_SYSCALL_RETURN(regs,
 				EXECUTE_SYSCALL(syscall, regs));
@@ -34,4 +66,8 @@ __visible void do_syscall_64(struct pt_regs *regs)
 	/* execve succeeded */
 	if (syscall == __NR_execve && regs->regs.gp[HOST_AX] == 0)
 		userspace(&current->thread.regs.regs);
+
+	/* only parents of vfork restores the contents of stack */
+	if (syscall == __NR_vfork && regs->regs.gp[HOST_AX] > 0)
+		vfork_restore_stack(stack_copy);
 }
diff --git a/arch/x86/um/entry_64.S b/arch/x86/um/entry_64.S
index 022a8122690b..32f5002e2eb0 100644
--- a/arch/x86/um/entry_64.S
+++ b/arch/x86/um/entry_64.S
@@ -85,3 +85,24 @@ ENTRY(__kernel_vsyscall)
 	ret
 
 END(__kernel_vsyscall)
+
+// void userspace(struct uml_pt_regs *regs)
+ENTRY(userspace)
+	/* align the stack for x86_64 ABI */
+	and     $-0x10, %rsp
+	/* Handle any immediate reschedules or signals */
+	call	interrupt_end
+
+	movq	current_ptregs, %rsp
+
+	POP_REGS
+
+	addq	$8, %rsp	/* skip orig_ax */
+	popq	%r11		/* pt_regs->ip */
+	addq	$8, %rsp	/* skip cs */
+	addq	$8, %rsp	/* skip flags */
+	popq	%rsp
+
+	jmp	*%r11
+
+END(userspace)
diff --git a/arch/x86/um/syscalls_64.c b/arch/x86/um/syscalls_64.c
index 6a00a28c9cca..edb17fc73e07 100644
--- a/arch/x86/um/syscalls_64.c
+++ b/arch/x86/um/syscalls_64.c
@@ -51,6 +51,18 @@ void arch_switch_to(struct task_struct *to)
 	 * Nothing needs to be done on x86_64.
 	 * The FS_BASE/GS_BASE registers are saved in the ptrace register set.
 	 */
+#ifndef CONFIG_MMU
+	current_top_of_stack = task_top_of_stack(to);
+	current_ptregs = (long)task_pt_regs(to);
+
+	if ((to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] == 0) ||
+	    (to->mm == NULL))
+		return;
+
+	/* this changes the FS on every context switch */
+	arch_prctl(to, ARCH_SET_FS,
+		   (void __user *) to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)]);
+#endif
 }
 
 SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH v2 08/13] um: nommu: configure fs register on host syscall invocation
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
                     ` (6 preceding siblings ...)
  2024-11-11  6:27   ` [RFC PATCH v2 07/13] x86/um: nommu: process/thread handling Hajime Tazaki
@ 2024-11-11  6:27   ` Hajime Tazaki
  2024-11-27 10:00     ` Benjamin Berg
  2024-11-11  6:27   ` [RFC PATCH v2 09/13] x86/um/vdso: nommu: vdso memory update Hajime Tazaki
                     ` (8 subsequent siblings)
  16 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-11  6:27 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

As userspace on UML/!MMU also need to configure %fs register when it is
running to correctly access thread structure, host syscalls implemented
in os-Linux drivers may be puzzled when they are called.  Thus it has to
configure %fs register via arch_prctl(SET_FS) on every host syscalls.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/um/include/shared/os.h |  5 ++++
 arch/um/os-Linux/Makefile   |  4 +--
 arch/um/os-Linux/cpu.c      | 50 ++++++++++++++++++++++++++++++++
 arch/um/os-Linux/internal.h |  5 ++++
 arch/um/os-Linux/main.c     |  5 ++++
 arch/um/os-Linux/process.c  |  8 ++++++
 arch/um/os-Linux/start_up.c |  3 ++
 arch/x86/um/do_syscall_64.c | 35 +++++++++++++++++++++++
 arch/x86/um/syscalls_64.c   | 57 +++++++++++++++++++++++++++++++++++++
 9 files changed, 170 insertions(+), 2 deletions(-)
 create mode 100644 arch/um/os-Linux/cpu.c

diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index 5a6722f254d5..69a7854f5f87 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -136,6 +136,9 @@ static inline struct openflags of_cloexec(struct openflags flags)
 	return flags;
 }
 
+/* cpu.c */
+extern int host_has_fsgsbase;
+
 /* file.c */
 extern int os_stat_file(const char *file_name, struct uml_stat *buf);
 extern int os_stat_fd(const int fd, struct uml_stat *buf);
@@ -221,6 +224,8 @@ extern int os_drop_memory(void *addr, int length);
 extern int can_drop_memory(void);
 extern int os_mincore(void *addr, unsigned long len);
 #ifndef CONFIG_MMU
+extern long long host_fs;
+extern int os_arch_prctl(int pid, int option, unsigned long *arg);
 extern int os_setup_seccomp(void);
 #endif
 
diff --git a/arch/um/os-Linux/Makefile b/arch/um/os-Linux/Makefile
index 20ff8d5971db..af7c5f4373bc 100644
--- a/arch/um/os-Linux/Makefile
+++ b/arch/um/os-Linux/Makefile
@@ -8,7 +8,7 @@ KCOV_INSTRUMENT                := n
 
 obj-y = execvp.o file.o helper.o irq.o main.o mem.o process.o \
 	registers.o sigio.o signal.o start_up.o time.o tty.o \
-	umid.o user_syms.o util.o drivers/ skas/
+	umid.o user_syms.o util.o cpu.o drivers/ skas/
 
 CFLAGS_signal.o += -Wframe-larger-than=4096
 
@@ -18,7 +18,7 @@ obj-$(CONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA) += elf_aux.o
 
 USER_OBJS := $(user-objs-y) elf_aux.o execvp.o file.o helper.o irq.o \
 	main.o mem.o process.o registers.o sigio.o signal.o start_up.o time.o \
-	tty.o umid.o util.o
+	tty.o umid.o util.o cpu.o
 
 include $(srctree)/arch/um/scripts/Makefile.rules
 CFLAGS_process.o=-g -O0
diff --git a/arch/um/os-Linux/cpu.c b/arch/um/os-Linux/cpu.c
new file mode 100644
index 000000000000..49b6d8b4d65d
--- /dev/null
+++ b/arch/um/os-Linux/cpu.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <string.h>
+#include <signal.h>
+#include <setjmp.h>
+#include <init.h>
+#include "internal.h"
+
+int host_has_fsgsbase;
+/* those definitions can be pulled from os.h but if we include this
+ * it shows conflicts of jmp_buf definitions in longjmp.h (UM) and
+ * host one.  thus we declared here instead.
+ */
+void os_info(const char *fmt, ...);
+void os_warn(const char *fmt, ...);
+
+/**
+ * get_host_cpu_features() return true with X86_FEATURE_FSGSBASE even
+ * if the kernel is older and disabled using fsgsbase instruction.
+ * thus detection is based on whether SIGILL is raised or not.
+ */
+static jmp_buf jmpbuf;
+static void sigill(int sig, siginfo_t *si, void *ctx_void)
+{
+	siglongjmp(jmpbuf, 1);
+}
+
+void __init check_fsgsbase(void)
+{
+	unsigned long fsbase;
+	struct sigaction sa;
+
+	/* Probe FSGSBASE */
+	memset(&sa, 0, sizeof(sa));
+	sa.sa_sigaction = sigill;
+	sa.sa_flags = SA_SIGINFO | SA_RESETHAND;
+	sigemptyset(&sa.sa_mask);
+	if (sigaction(SIGILL, &sa, 0))
+		os_warn("sigaction");
+
+	os_info("Checking FSGSBASE instructions...");
+	if (sigsetjmp(jmpbuf, 0) == 0) {
+		asm volatile("rdfsbase %0" : "=r" (fsbase) :: "memory");
+		host_has_fsgsbase = 1;
+		os_info("OK\n");
+	} else {
+		host_has_fsgsbase = 0;
+		os_info("disabled\n");
+	}
+}
diff --git a/arch/um/os-Linux/internal.h b/arch/um/os-Linux/internal.h
index 317fca190c2b..60220b8b8843 100644
--- a/arch/um/os-Linux/internal.h
+++ b/arch/um/os-Linux/internal.h
@@ -2,6 +2,11 @@
 #ifndef __UM_OS_LINUX_INTERNAL_H
 #define __UM_OS_LINUX_INTERNAL_H
 
+/*
+ * cpu.c
+ */
+void check_fsgsbase(void);
+
 /*
  * elf_aux.c
  */
diff --git a/arch/um/os-Linux/main.c b/arch/um/os-Linux/main.c
index 0afcdeb8995b..aecf63d3db79 100644
--- a/arch/um/os-Linux/main.c
+++ b/arch/um/os-Linux/main.c
@@ -17,6 +17,7 @@
 #include <kern_util.h>
 #include <os.h>
 #include <um_malloc.h>
+#include <asm/prctl.h> /* XXX This should get the constants from libc */
 #include "internal.h"
 
 #define PGD_BOUND (4 * 1024 * 1024)
@@ -158,6 +159,10 @@ int __init main(int argc, char **argv, char **envp)
 	change_sig(SIGPIPE, 0);
 	ret = linux_main(argc, argv, envp);
 
+#ifndef CONFIG_MMU
+	os_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
+#endif
+
 	/*
 	 * Disable SIGPROF - I have no idea why libc doesn't do this or turn
 	 * off the profiling time, but UML dies with a SIGPROF just before
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index 5acf6d41a4c2..5a3b09096f92 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -221,6 +221,14 @@ void os_set_pdeathsig(void)
 }
 
 #ifndef CONFIG_MMU
+#include <unistd.h>
+#include <sys/syscall.h>   /* For SYS_xxx definitions */
+
+int os_arch_prctl(int pid, int option, unsigned long *arg2)
+{
+	return syscall(SYS_arch_prctl, option, arg2);
+}
+
 int os_setup_seccomp(void)
 {
 	int err;
diff --git a/arch/um/os-Linux/start_up.c b/arch/um/os-Linux/start_up.c
index 93fc82c01aba..88164893cbec 100644
--- a/arch/um/os-Linux/start_up.c
+++ b/arch/um/os-Linux/start_up.c
@@ -293,6 +293,9 @@ void __init os_early_checks(void)
 	 */
 	check_tmpexec();
 
+	/* probe fsgsbase instruction */
+	check_fsgsbase();
+
 	pid = start_ptraced_child();
 	if (init_pid_registers(pid))
 		fatal("Failed to initialize default registers");
diff --git a/arch/x86/um/do_syscall_64.c b/arch/x86/um/do_syscall_64.c
index 203bacc4cb3c..75326acc931b 100644
--- a/arch/x86/um/do_syscall_64.c
+++ b/arch/x86/um/do_syscall_64.c
@@ -3,6 +3,8 @@
 //#define DEBUG 1
 #include <linux/kernel.h>
 #include <linux/ptrace.h>
+#include <asm/fsgsbase.h>
+#include <asm/prctl.h>
 #include <kern_util.h>
 #include <sysdep/syscalls.h>
 #include <os.h>
@@ -34,6 +36,31 @@ static void vfork_restore_stack(void *stack_copy)
 	       stack_copy, 8);
 }
 
+static int os_x86_arch_prctl(int pid, int option, unsigned long *arg2)
+{
+	if (host_has_fsgsbase) {
+		switch (option) {
+		case ARCH_SET_FS:
+			wrfsbase(*arg2);
+			break;
+		case ARCH_SET_GS:
+			wrgsbase(*arg2);
+			break;
+		case ARCH_GET_FS:
+			*arg2 = rdfsbase();
+			break;
+		case ARCH_GET_GS:
+			*arg2 = rdgsbase();
+			break;
+		}
+		return 0;
+	} else {
+		return os_arch_prctl(pid, option, arg2);
+	}
+
+	return 0;
+}
+
 __visible void do_syscall_64(struct pt_regs *regs)
 {
 	int syscall;
@@ -49,6 +76,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
 	if (syscall == __NR_vfork)
 		stack_copy = vfork_save_stack();
 
+	/* set fs register to the original host one */
+	os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
+
 	if (likely(syscall < NR_syscalls)) {
 		PT_REGS_SET_SYSCALL_RETURN(regs,
 				EXECUTE_SYSCALL(syscall, regs));
@@ -63,6 +93,11 @@ __visible void do_syscall_64(struct pt_regs *regs)
 	set_thread_flag(TIF_SIGPENDING);
 	interrupt_end();
 
+	/* restore back fs register to userspace configured one */
+	os_x86_arch_prctl(0, ARCH_SET_FS,
+		      (void *)(current->thread.regs.regs.gp[FS_BASE
+						     / sizeof(unsigned long)]));
+
 	/* execve succeeded */
 	if (syscall == __NR_execve && regs->regs.gp[HOST_AX] == 0)
 		userspace(&current->thread.regs.regs);
diff --git a/arch/x86/um/syscalls_64.c b/arch/x86/um/syscalls_64.c
index edb17fc73e07..d56df936a2d7 100644
--- a/arch/x86/um/syscalls_64.c
+++ b/arch/x86/um/syscalls_64.c
@@ -12,11 +12,26 @@
 #include <asm/prctl.h> /* XXX This should get the constants from libc */
 #include <registers.h>
 #include <os.h>
+#include <asm/thread_info.h>
+#include <asm/mman.h>
+
+#ifndef CONFIG_MMU
+/*
+ * The guest libc can change FS, which confuses the host libc.
+ * In fact, changing FS directly is not supported (check
+ * man arch_prctl). So, whenever we make a host syscall,
+ * we should be changing FS to the original FS (not the
+ * one set by the guest libc). This original FS is stored
+ * in host_fs.
+ */
+long long host_fs = -1;
+#endif
 
 long arch_prctl(struct task_struct *task, int option,
 		unsigned long __user *arg2)
 {
 	long ret = -EINVAL;
+#ifdef CONFIG_MMU
 
 	switch (option) {
 	case ARCH_SET_FS:
@@ -38,6 +53,48 @@ long arch_prctl(struct task_struct *task, int option,
 	}
 
 	return ret;
+#else
+
+	unsigned long *ptr = arg2, tmp;
+
+	switch (option) {
+	case ARCH_SET_FS:
+		if (host_fs == -1)
+			os_arch_prctl(0, ARCH_GET_FS, (void *)&host_fs);
+		ret = 0;
+		break;
+	case ARCH_SET_GS:
+		ret = 0;
+		break;
+	case ARCH_GET_FS:
+	case ARCH_GET_GS:
+		ptr = &tmp;
+		break;
+	}
+
+	ret = os_arch_prctl(0, option, ptr);
+	if (ret)
+		return ret;
+
+	switch (option) {
+	case ARCH_SET_FS:
+		current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] =
+			(unsigned long) arg2;
+		break;
+	case ARCH_SET_GS:
+		current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)] =
+			(unsigned long) arg2;
+		break;
+	case ARCH_GET_FS:
+		ret = put_user(current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)], arg2);
+		break;
+	case ARCH_GET_GS:
+		ret = put_user(current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)], arg2);
+		break;
+	}
+
+	return ret;
+#endif
 }
 
 SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH v2 09/13] x86/um/vdso: nommu: vdso memory update
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
                     ` (7 preceding siblings ...)
  2024-11-11  6:27   ` [RFC PATCH v2 08/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
@ 2024-11-11  6:27   ` Hajime Tazaki
  2024-11-27 10:36     ` Benjamin Berg
  2024-11-11  6:27   ` [RFC PATCH v2 10/13] x86/um: nommu: signal handling Hajime Tazaki
                     ` (7 subsequent siblings)
  16 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-11  6:27 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

On !MMU mode, the address of vdso is accessible from userspace.  This
commit implements the entry point by pointing a block of page address.

This commit also add memory permission configuration of vdso page to be
executable.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/x86/um/vdso/um_vdso.c | 20 ++++++++++++++++++++
 arch/x86/um/vdso/vma.c     | 14 ++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/arch/x86/um/vdso/um_vdso.c b/arch/x86/um/vdso/um_vdso.c
index cbae2584124f..eff3e6641a0e 100644
--- a/arch/x86/um/vdso/um_vdso.c
+++ b/arch/x86/um/vdso/um_vdso.c
@@ -23,10 +23,17 @@ int __vdso_clock_gettime(clockid_t clock, struct __kernel_old_timespec *ts)
 {
 	long ret;
 
+#ifdef CONFIG_MMU
 	asm("syscall"
 		: "=a" (ret)
 		: "0" (__NR_clock_gettime), "D" (clock), "S" (ts)
 		: "rcx", "r11", "memory");
+#else
+	asm("call *%1"
+		: "=a" (ret)
+		: "0" ((unsigned long)__NR_clock_gettime), "D" (clock), "S" (ts)
+		: "rcx", "r11", "memory");
+#endif
 
 	return ret;
 }
@@ -37,10 +44,17 @@ int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
 {
 	long ret;
 
+#ifdef CONFIG_MMU
 	asm("syscall"
 		: "=a" (ret)
 		: "0" (__NR_gettimeofday), "D" (tv), "S" (tz)
 		: "rcx", "r11", "memory");
+#else
+	asm("call *%1"
+		: "=a" (ret)
+		: "0" ((unsigned long)__NR_gettimeofday), "D" (tv), "S" (tz)
+		: "rcx", "r11", "memory");
+#endif
 
 	return ret;
 }
@@ -51,9 +65,15 @@ __kernel_old_time_t __vdso_time(__kernel_old_time_t *t)
 {
 	long secs;
 
+#ifdef CONFIG_MMU
 	asm volatile("syscall"
 		: "=a" (secs)
 		: "0" (__NR_time), "D" (t) : "cc", "r11", "cx", "memory");
+#else
+	asm("call *%1"
+		: "=a" (secs)
+		: "0" ((unsigned long)__NR_time), "D" (t) : "cc", "r11", "cx", "memory");
+#endif
 
 	return secs;
 }
diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c
index f238f7b33cdd..83c861e2a815 100644
--- a/arch/x86/um/vdso/vma.c
+++ b/arch/x86/um/vdso/vma.c
@@ -9,6 +9,7 @@
 #include <asm/page.h>
 #include <asm/elf.h>
 #include <linux/init.h>
+#include <os.h>
 
 static unsigned int __read_mostly vdso_enabled = 1;
 unsigned long um_vdso_addr;
@@ -24,7 +25,9 @@ static int __init init_vdso(void)
 
 	BUG_ON(vdso_end - vdso_start > PAGE_SIZE);
 
+#ifdef CONFIG_MMU
 	um_vdso_addr = task_size - PAGE_SIZE;
+#endif
 
 	vdsop = kmalloc(sizeof(struct page *), GFP_KERNEL);
 	if (!vdsop)
@@ -40,6 +43,15 @@ static int __init init_vdso(void)
 	copy_page(page_address(um_vdso), vdso_start);
 	*vdsop = um_vdso;
 
+#ifndef CONFIG_MMU
+	/* this is fine with NOMMU as everything is accessible */
+	um_vdso_addr = (unsigned long)page_address(um_vdso);
+	os_protect_memory((void *)um_vdso_addr, vdso_end - vdso_start, 1, 1, 1);
+	pr_debug("vdso_start=%lx um_vdso_addr=%lx pg_um_vdso=%lx",
+	       (unsigned long)vdso_start, um_vdso_addr,
+	       (unsigned long)page_address(um_vdso));
+#endif
+
 	return 0;
 
 oom:
@@ -50,6 +62,7 @@ static int __init init_vdso(void)
 }
 subsys_initcall(init_vdso);
 
+#ifdef CONFIG_MMU
 int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 {
 	struct vm_area_struct *vma;
@@ -74,3 +87,4 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 
 	return IS_ERR(vma) ? PTR_ERR(vma) : 0;
 }
+#endif
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH v2 10/13] x86/um: nommu: signal handling
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
                     ` (8 preceding siblings ...)
  2024-11-11  6:27   ` [RFC PATCH v2 09/13] x86/um/vdso: nommu: vdso memory update Hajime Tazaki
@ 2024-11-11  6:27   ` Hajime Tazaki
  2024-11-28 10:37     ` Benjamin Berg
  2024-11-11  6:27   ` [RFC PATCH v2 11/13] um: change machine name for uname output Hajime Tazaki
                     ` (6 subsequent siblings)
  16 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-11  6:27 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

This commit updates the behavior of signal handling under !MMU
environment. 1) the stack preparation for the signal handlers and
2) restoration of stack after rt_sigreturn(2) syscall.  Those are needed
as the stack usage on vfork(2) syscall is different.

It also adds the follow up routine for SIGSEGV as a signal delivery runs
in the same stack frame while we have to avoid endless SIGSEGV.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
 arch/um/include/shared/kern_util.h |  3 +++
 arch/um/kernel/trap.c              | 10 ++++++++
 arch/um/os-Linux/signal.c          | 18 ++++++++++++++-
 arch/x86/um/signal.c               | 37 +++++++++++++++++++++++++++++-
 4 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h
index f21dc8517538..bcc8d28279ae 100644
--- a/arch/um/include/shared/kern_util.h
+++ b/arch/um/include/shared/kern_util.h
@@ -62,6 +62,9 @@ extern int singlestepping(void);
 extern void segv_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs);
 extern void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs);
 extern void fatal_sigsegv(void) __attribute__ ((noreturn));
+#ifndef CONFIG_MMU
+extern void sigsegv_post_routine(void);
+#endif
 
 void um_idle_sleep(void);
 
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index a7519b3de4bf..b9b54e777894 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -174,6 +174,16 @@ void fatal_sigsegv(void)
 	os_dump_core();
 }
 
+#ifndef CONFIG_MMU
+void sigsegv_post_routine(void)
+{
+	change_sig(SIGIO, 1);
+	change_sig(SIGALRM, 1);
+	change_sig(SIGWINCH, 1);
+	userspace(&current->thread.regs.regs);
+}
+#endif
+
 /**
  * segv_handler() - the SIGSEGV handler
  * @sig:	the signal number
diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c
index 52852018a3ad..a06622415d8f 100644
--- a/arch/um/os-Linux/signal.c
+++ b/arch/um/os-Linux/signal.c
@@ -36,7 +36,15 @@ static void sig_handler_common(int sig, struct siginfo *si, mcontext_t *mc)
 	struct uml_pt_regs r;
 	int save_errno = errno;
 
-	r.is_user = 0;
+#ifndef CONFIG_MMU
+	memset(&r, 0, sizeof(r));
+	/* mark is_user=1 when the IP is from userspace code. */
+	if (mc && (REGS_IP(mc->gregs) > uml_reserved
+		   && REGS_IP(mc->gregs) < high_physmem))
+		r.is_user = 1;
+	else
+#endif
+		r.is_user = 0;
 	if (sig == SIGSEGV) {
 		/* For segfaults, we want the data from the sigcontext. */
 		get_regs_from_mc(&r, mc);
@@ -191,6 +199,7 @@ static void hard_handler(int sig, siginfo_t *si, void *p)
 	ucontext_t *uc = p;
 	mcontext_t *mc = &uc->uc_mcontext;
 	unsigned long pending = 1UL << sig;
+	int is_segv = 0;
 
 	do {
 		int nested, bail;
@@ -214,6 +223,7 @@ static void hard_handler(int sig, siginfo_t *si, void *p)
 
 		while ((sig = ffs(pending)) != 0){
 			sig--;
+			is_segv = (sig == SIGSEGV) ? 1 : 0;
 			pending &= ~(1 << sig);
 			(*handlers[sig])(sig, (struct siginfo *)si, mc);
 		}
@@ -227,6 +237,12 @@ static void hard_handler(int sig, siginfo_t *si, void *p)
 		if (!nested)
 			pending = from_irq_stack(nested);
 	} while (pending);
+
+#ifndef CONFIG_MMU
+	/* if there is SIGSEGV notified, let the userspace run w/ __noreturn */
+	if (is_segv)
+		sigsegv_post_routine();
+#endif
 }
 
 void set_handler(int sig)
diff --git a/arch/x86/um/signal.c b/arch/x86/um/signal.c
index 75087e85b6fd..b7365c75a967 100644
--- a/arch/x86/um/signal.c
+++ b/arch/x86/um/signal.c
@@ -371,6 +371,13 @@ int setup_signal_stack_si(unsigned long stack_top, struct ksignal *ksig,
 		round_down(stack_top - sizeof(struct rt_sigframe), 16);
 
 	/* Add required space for math frame */
+#ifndef CONFIG_MMU
+	/*
+	 * the sig_frame on !MMU needs be aligned for SSE as
+	 * the frame is used as-is.
+	 */
+	math_size = round_down(math_size, 16);
+#endif
 	frame = (struct rt_sigframe __user *)((unsigned long)frame - math_size);
 
 	/* Subtract 128 for a red zone and 8 for proper alignment */
@@ -417,6 +424,18 @@ int setup_signal_stack_si(unsigned long stack_top, struct ksignal *ksig,
 		/* could use a vstub here */
 		return err;
 
+#ifndef CONFIG_MMU
+	/*
+	 * we need to push handler address at top of stack, as
+	 * __kernel_vsyscall, called after this returns with ret with
+	 * stack contents, thus push the handler here.
+	 */
+	frame = (struct rt_sigframe __user *) ((unsigned long) frame -
+					       sizeof(unsigned long));
+	err |= __put_user((unsigned long)ksig->ka.sa.sa_handler,
+			  (unsigned long *)frame);
+#endif
+
 	if (err)
 		return err;
 
@@ -442,9 +461,25 @@ SYSCALL_DEFINE0(rt_sigreturn)
 	unsigned long sp = PT_REGS_SP(&current->thread.regs);
 	struct rt_sigframe __user *frame =
 		(struct rt_sigframe __user *)(sp - sizeof(long));
-	struct ucontext __user *uc = &frame->uc;
+	struct ucontext __user *uc;
 	sigset_t set;
 
+#ifndef CONFIG_MMU
+	/**
+	 * we enter here with:
+	 *
+	 * __restore_rt:
+	 *     mov $15, %rax
+	 *     call *%rax (translated from syscall)
+	 *
+	 * (code is from musl libc)
+	 * so, stack needs to be popped of "call"ed address before
+	 * looking at rt_sigframe.
+	 */
+	frame = (struct rt_sigframe __user *)((unsigned long)frame + sizeof(long));
+#endif
+	uc = &frame->uc;
+
 	if (copy_from_user(&set, &uc->uc_sigmask, sizeof(set)))
 		goto segfault;
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH v2 11/13] um: change machine name for uname output
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
                     ` (9 preceding siblings ...)
  2024-11-11  6:27   ` [RFC PATCH v2 10/13] x86/um: nommu: signal handling Hajime Tazaki
@ 2024-11-11  6:27   ` Hajime Tazaki
  2024-11-11  6:27   ` [RFC PATCH v2 12/13] um: nommu: add documentation of nommu UML Hajime Tazaki
                     ` (5 subsequent siblings)
  16 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-11  6:27 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

This commit tries to display MMU/!MMU mode from the output of uname(2)
so that users can distinguish which mode of UML is running right now.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
 arch/um/Makefile        | 6 ++++++
 arch/um/os-Linux/util.c | 3 ++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/um/Makefile b/arch/um/Makefile
index 1d36a613aad8..e0cfa3a9eae4 100644
--- a/arch/um/Makefile
+++ b/arch/um/Makefile
@@ -151,6 +151,12 @@ export CFLAGS_vmlinux := $(LINK-y) $(LINK_WRAPS) $(LD_FLAGS_CMDLINE) $(CC_FLAGS_
 CLEAN_FILES += linux x.i gmon.out
 MRPROPER_FILES += $(HOST_DIR)/include/generated
 
+ifeq ($(CONFIG_MMU),y)
+UTS_MACHINE := "um"
+else
+UTS_MACHINE := "um\(nommu\)"
+endif
+
 archclean:
 	@find . \( -name '*.bb' -o -name '*.bbg' -o -name '*.da' \
 		-o -name '*.gcov' \) -type f -print | xargs rm -f
diff --git a/arch/um/os-Linux/util.c b/arch/um/os-Linux/util.c
index 4193e04d7e4a..20421e9f0f77 100644
--- a/arch/um/os-Linux/util.c
+++ b/arch/um/os-Linux/util.c
@@ -65,7 +65,8 @@ void setup_machinename(char *machine_out)
 	}
 # endif
 #endif
-	strcpy(machine_out, host.machine);
+	strcat(machine_out, "/");
+	strcat(machine_out, host.machine);
 }
 
 void setup_hostinfo(char *buf, int len)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH v2 12/13] um: nommu: add documentation of nommu UML
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
                     ` (10 preceding siblings ...)
  2024-11-11  6:27   ` [RFC PATCH v2 11/13] um: change machine name for uname output Hajime Tazaki
@ 2024-11-11  6:27   ` Hajime Tazaki
  2024-11-11  6:27   ` [RFC PATCH v2 13/13] um: nommu: plug nommu code into build system Hajime Tazaki
                     ` (4 subsequent siblings)
  16 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-11  6:27 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

This commit adds an initial documentation for !MMU mode of UML.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
 Documentation/virt/uml/nommu-uml.rst | 221 +++++++++++++++++++++++++++
 1 file changed, 221 insertions(+)
 create mode 100644 Documentation/virt/uml/nommu-uml.rst

diff --git a/Documentation/virt/uml/nommu-uml.rst b/Documentation/virt/uml/nommu-uml.rst
new file mode 100644
index 000000000000..9172918be137
--- /dev/null
+++ b/Documentation/virt/uml/nommu-uml.rst
@@ -0,0 +1,221 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+UML has been built with CONFIG_MMU since day 0.  The patchset
+introduces the nommu mode on UML in a different angle from what Linux
+Kernel Library tried.
+
+.. contents:: :local:
+
+What is it for ?
+================
+
+- Alleviate syscall hook overhead implemented with ptrace(2)
+- To exercises nommu code over UML (and over KUnit)
+- Less dependency to host facilities
+
+
+How it works ?
+==============
+
+To illustrate how this feature works, the below shows how syscalls are
+called under nommu/UML environment.
+
+- boot kernel, setup zpoline trampoline code (detailed later) at address 0x0
+- (userspace starts)
+- calls vfork/execve syscalls
+- during execve, more specifically during load_elf_fdpic_binary()
+  function, kernel translates `syscall/sysenter` instructions with `call
+  *%rax`, which usually point to address 0 to NR_syscalls (around
+  512), where trampoline code was installed during startup.
+- when syscalls are issued by userspace, it jumps to `*%rax`, slides
+  until `nop` instructions end, and jump to hooked function,
+  `__kernel_vsyscall`, which is an entrypoint for syscall under nommu
+  UML environment.
+- call handler function in sys_call_table[] and follow how UML syscall
+  works.
+- return to userspace
+
+
+What are the differences from MMU-full UML ?
+============================================
+
+The current nommu implementation adds 3 different functions which
+MMU-full UML doesn't have:
+
+- kernel address space can directly be accessible from userspace
+  - so, uaccess() always returns 1
+  - generic implementation of memcpy/strcpy/futex is also used
+- alternate syscall entrypoint without ptrace
+- translation of syscall/sysenter instructions to a trampoline code
+  and syscall hooks
+
+With those modifications, it allows us to use unmodified userspace
+binaries with nommu UML.
+
+
+History
+=======
+
+This feature was originally introduced by Ricardo Koller at Open
+Source Summit NA 2020, then integrated with the syscall translation
+functionality with the clean up to the original code.
+
+Building and run
+================
+
+```
+% make ARCH=um x86_64_nommu_defconfig
+% make ARCH=um
+```
+
+will build UML with CONFIG_MMU=n applied.
+
+Kunit tests can run with the following command:
+
+```
+% ./tools/testing/kunit/kunit.py run --kconfig_add CONFIG_MMU=n
+```
+
+To run a typical Linux distribution, we need nommu-aware userspace.
+We can use a stock version of Alpine Linux with nommu-built version of
+busybox and musl-libc.
+
+
+Preparing root filesystem
+=========================
+
+nommu UML requires to use a specific standard library which is aware
+of nommu kernel.  We have tested custom-build musl-libc and busybox,
+both of which have built-in support for nommu kernels.
+
+There are no available Linux distributions for nommu under x86_64
+architecture, so we need to prepare our own image for the root
+filesystem.  We use Alpine Linux as a base distribution and replace
+busybox and musl-libc on top of that.  The following are the step to
+prepare the filesystem for the quick start.
+
+```
+     container_id=$(docker create ghcr.io/thehajime/alpine:3.20.3-um-nommu)
+     docker start $container_id
+     docker wait $container_id
+     docker export $container_id > alpine.tar
+     docker rm $container_id
+
+     mnt=$(mktemp -d)
+     dd if=/dev/zero of=alpine.ext4 bs=1 count=0 seek=1G
+     sudo chmod og+wr "alpine.ext4"
+     yes 2>/dev/null | mkfs.ext4 "alpine.ext4" || true
+     sudo mount "alpine.ext4" $mnt
+     sudo tar -xf alpine.tar -C $mnt
+     sudo umount $mnt
+```
+
+This will create a file image, `alpine.ext4`, which contains busybox
+and musl with nommu build on the Alpine Linux root filesystem.  The
+file can be specified to the argument `ubd0=` to the UML command line.
+
+```
+  ./vmlinux eth0=tuntap,tap100,0e:fd:0:0:0:1,172.17.0.1 ubd0=./alpine.ext4 rw mem=1024m loglevel=8 init=/sbin/init
+```
+
+We plan to upstream apk packages for busybox and musl so that we can
+follow the proper procedure to set up the root filesystem.
+
+
+Quick start with docker
+=======================
+
+There is a docker image that you can quickly start with a simple step.
+
+```
+  docker run -it -v /dev/shm:/dev/shm --rm ghcr.io/thehajime/alpine:3.20.3-um-nommu
+```
+
+This will launch a UML instance with an pre-configured root filesystem.
+
+Benchmark
+=========
+
+The below shows an example of performance measurement conducted with
+lmbench and (self-crafted) getpid benchmark (with v6.12-rc2 uml/next
+tree).
+
+### lmbench (usec)
+
+||native|um|um-nommu|
+|--|--|--|--|
+|select-10    |0.5644|31.0917|0.2743|
+|select-100   |2.3869|31.4651|1.1472|
+|select-1000  |20.4004|36.4966|9.7533|
+|syscall      |0.1733|25.9904|0.1053|
+|read         |0.3438|27.4873|0.1451|
+|write        |0.2862|25.8794|0.1361|
+|stat         |1.9250|37.5072|0.4532|
+|open/close   |3.8961|65.1736|0.7665|
+|fork+sh      |1173.8889|5404.5000|20577.0000|
+|fork+execve  |535.2105|2179.2000|4716.3333|
+
+### do_getpid bench (nsec)
+
+||native|um|um-nommu|
+|--|--|--|--|
+|getpid | 172 | 25602 | 103|
+
+
+Limitations
+===========
+
+generic nommu limitations
+-------------------------
+Since this port is a kernel of nommu architecture so, the
+implementation inherits the characteristics of other nommu kernels
+(riscv, arm, etc), described below.
+
+- vfork(2) should be used instead of fork(2)
+- ELF loader only loads PIE (position independent executable) binaries
+- processes share the address space among others
+- mmap(2) offers a subset of functionalities (e.g., unsupported
+  MMAP_FIXED)
+
+Thus, we have limited options to userspace programs.  We have tested
+Alpine Linux with musl-libc, which has a support nommu kernel.
+
+access to mmap_min_addr
+----------------------
+As the mechanism of syscall translations relies on an ability to
+write/read memory address zero (0x0), we need to configure host kernel
+with the following command:
+
+```
+% sh -c "echo 0 > /proc/sys/vm/mmap_min_addr"
+```
+
+supported architecture
+----------------------
+The current implementation of nommu UML only works on x86_64 SUBARCH.
+We have not tested with 32-bit environment.
+
+target of syscall translation
+-----------------------------
+The syscall translation only applies to the executable and interpreter
+of ELF binary files which are processed by execve(2) syscall for the
+moment: other libraries such as linked library and dlopen-ed one
+aren't translated; we may be able to trigger the translation by
+LD_PRELOAD.  JIT compiler generated code is also generated after execve
+thus, it is not currently translated.
+
+Note that with musl-libc in Alpine Linux which we've been tested, most
+of syscalls are implemented in the interpreter file
+(ld-musl-x86_64.so) and calling syscall/sysenter instructions from the
+linked/loaded libraries might be rare.  But it is definitely possible
+so, a workaround with LD_PRELOAD is effective.
+
+
+Further readings about NOMMU UML
+================================
+
+- NOMMU UML (original code by Ricardo Koller)
+https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf
+
+- zpoline: syscall translation mechanism
+https://www.usenix.org/conference/atc23/presentation/yasukata
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [RFC PATCH v2 13/13] um: nommu: plug nommu code into build system
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
                     ` (11 preceding siblings ...)
  2024-11-11  6:27   ` [RFC PATCH v2 12/13] um: nommu: add documentation of nommu UML Hajime Tazaki
@ 2024-11-11  6:27   ` Hajime Tazaki
  2024-11-15 10:12   ` [RFC PATCH v2 00/13] nommu UML Johannes Berg
                     ` (3 subsequent siblings)
  16 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-11  6:27 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

Add nommu kernel for um build.  defconfig is also provided.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/um/Kconfig                        | 14 +++++-
 arch/um/configs/x86_64_nommu_defconfig | 64 ++++++++++++++++++++++++++
 arch/x86/um/Makefile                   | 18 ++++++++
 3 files changed, 94 insertions(+), 2 deletions(-)
 create mode 100644 arch/um/configs/x86_64_nommu_defconfig

diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index a9876bdb5bf9..81897c496711 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -30,13 +30,16 @@ config UML
 	select ARCH_SUPPORTS_LTO_CLANG_THIN
 	select TRACE_IRQFLAGS_SUPPORT
 	select TTY # Needed for line.c
-	select HAVE_ARCH_VMAP_STACK
+	select HAVE_ARCH_VMAP_STACK if MMU
 	select HAVE_RUST
 	select ARCH_HAS_UBSAN
 	select HAVE_ARCH_TRACEHOOK
+	select UACCESS_MEMCPY if !MMU
+	select GENERIC_STRNLEN_USER if !MMU
+	select GENERIC_STRNCPY_FROM_USER if !MMU
 
 config MMU
-	bool
+	bool "MMU-based Paged Memory Management Support" if 64BIT
 	default y
 
 config UML_DMA_EMULATION
@@ -189,8 +192,15 @@ config MAGIC_SYSRQ
 	  The keys are documented in <file:Documentation/admin-guide/sysrq.rst>. Don't say Y
 	  unless you really know what this hack does.
 
+config ARCH_FORCE_MAX_ORDER
+	int "Order of maximal physically contiguous allocations" if EXPERT
+	default "10" if MMU
+	default "16" if !MMU
+
 config KERNEL_STACK_ORDER
 	int "Kernel stack size order"
+	default 3 if !MMU
+	range 3 10 if !MMU
 	default 2 if 64BIT
 	range 2 10 if 64BIT
 	default 1 if !64BIT
diff --git a/arch/um/configs/x86_64_nommu_defconfig b/arch/um/configs/x86_64_nommu_defconfig
new file mode 100644
index 000000000000..c2e0fb546987
--- /dev/null
+++ b/arch/um/configs/x86_64_nommu_defconfig
@@ -0,0 +1,64 @@
+CONFIG_SYSVIPC=y
+CONFIG_POSIX_MQUEUE=y
+CONFIG_NO_HZ=y
+CONFIG_HIGH_RES_TIMERS=y
+CONFIG_BSD_PROCESS_ACCT=y
+CONFIG_IKCONFIG=y
+CONFIG_IKCONFIG_PROC=y
+CONFIG_LOG_BUF_SHIFT=14
+CONFIG_CGROUPS=y
+CONFIG_BLK_CGROUP=y
+CONFIG_CGROUP_SCHED=y
+CONFIG_CGROUP_DEVICE=y
+CONFIG_CGROUP_CPUACCT=y
+# CONFIG_PID_NS is not set
+CONFIG_CC_OPTIMIZE_FOR_SIZE=y
+# CONFIG_MMU is not set
+CONFIG_HOSTFS=y
+CONFIG_MAGIC_SYSRQ=y
+CONFIG_SSL=y
+CONFIG_NULL_CHAN=y
+CONFIG_PORT_CHAN=y
+CONFIG_PTY_CHAN=y
+CONFIG_TTY_CHAN=y
+CONFIG_CON_CHAN="pts"
+CONFIG_SSL_CHAN="pts"
+CONFIG_UML_SOUND=m
+CONFIG_UML_NET=y
+CONFIG_UML_NET_ETHERTAP=y
+CONFIG_UML_NET_TUNTAP=y
+CONFIG_UML_NET_SLIP=y
+CONFIG_UML_NET_DAEMON=y
+CONFIG_UML_NET_MCAST=y
+CONFIG_UML_NET_SLIRP=y
+CONFIG_MODULES=y
+CONFIG_MODULE_UNLOAD=y
+CONFIG_IOSCHED_BFQ=m
+CONFIG_BINFMT_MISC=m
+CONFIG_NET=y
+CONFIG_PACKET=y
+CONFIG_UNIX=y
+CONFIG_INET=y
+CONFIG_DEVTMPFS=y
+CONFIG_DEVTMPFS_MOUNT=y
+CONFIG_BLK_DEV_UBD=y
+CONFIG_BLK_DEV_LOOP=m
+CONFIG_BLK_DEV_NBD=m
+CONFIG_DUMMY=m
+CONFIG_TUN=m
+CONFIG_PPP=m
+CONFIG_SLIP=m
+CONFIG_LEGACY_PTY_COUNT=32
+CONFIG_UML_RANDOM=y
+CONFIG_SOUND=m
+CONFIG_EXT4_FS=y
+CONFIG_REISERFS_FS=y
+CONFIG_QUOTA=y
+CONFIG_AUTOFS_FS=m
+CONFIG_ISO9660_FS=m
+CONFIG_JOLIET=y
+CONFIG_NLS=y
+CONFIG_DEBUG_KERNEL=y
+CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
+CONFIG_FRAME_WARN=1024
+CONFIG_IPV6=y
diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile
index b42c31cd2390..0513c4ad0130 100644
--- a/arch/x86/um/Makefile
+++ b/arch/x86/um/Makefile
@@ -32,6 +32,24 @@ obj-y += syscalls_64.o vdso/
 subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o \
 	../lib/memmove_64.o ../lib/memset_64.o
 
+
+# used by zpoline.c to translate syscall/sysenter instructions
+# note: only in x86_64 w/ !CONFIG_MMU
+ifneq ($(CONFIG_MMU),y)
+inat_tables_script = $(srctree)/arch/x86/tools/gen-insn-attr-x86.awk
+inat_tables_maps = $(srctree)/arch/x86/lib/x86-opcode-map.txt
+quiet_cmd_inat_tables = GEN     $@
+      cmd_inat_tables = $(AWK) -f $(inat_tables_script) $(inat_tables_maps) > $@
+$(obj)/inat-tables.c: $(inat_tables_script) $(inat_tables_maps)
+	$(call cmd,inat_tables)
+targets += inat-tables.c
+$(obj)/../lib/inat.o: $(obj)/inat-tables.c
+subarch-y += ../lib/insn.o ../lib/inat.o
+
+
+obj-y += do_syscall_$(BITS).o entry_$(BITS).o zpoline.o
+endif
+
 endif
 
 subarch-$(CONFIG_MODULES) += ../kernel/module.o
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 02/13] x86/um: nommu: elf loader for fdpic
  2024-11-11  6:27   ` [RFC PATCH v2 02/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
@ 2024-11-12 12:48     ` Geert Uytterhoeven
  2024-11-12 22:07       ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Geert Uytterhoeven @ 2024-11-12 12:48 UTC (permalink / raw)
  To: Hajime Tazaki
  Cc: linux-um, ricarkol, Liam.Howlett, Eric Biederman, Kees Cook,
	Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
	linux-fsdevel

Hi Tazaki-san,

On Mon, Nov 11, 2024 at 7:28 AM Hajime Tazaki <thehajime@gmail.com> wrote:
> As UML supports CONFIG_MMU=n case, it has to use an alternate ELF
> loader, FDPIC ELF loader.  In this commit, we added necessary
> definitions in the arch, as UML has not been used so far.  It also
> updates Kconfig file to use BINFMT_ELF_FDPIC under !MMU environment.
>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Cc: Kees Cook <kees@kernel.org>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: linux-mm@kvack.org
> Cc: linux-fsdevel@vger.kernel.org
> Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
> Signed-off-by: Ricardo Koller <ricarkol@google.com>

Thanks for your patch!

> --- a/fs/Kconfig.binfmt
> +++ b/fs/Kconfig.binfmt
> @@ -58,7 +58,7 @@ config ARCH_USE_GNU_PROPERTY
>  config BINFMT_ELF_FDPIC
>         bool "Kernel support for FDPIC ELF binaries"
>         default y if !BINFMT_ELF
> -       depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
> +       depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU)

s/UML/X86/?

>         select ELFCORE
>         help
>           ELF FDPIC binaries are based on ELF, but allow the individual load

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 02/13] x86/um: nommu: elf loader for fdpic
  2024-11-12 12:48     ` Geert Uytterhoeven
@ 2024-11-12 22:07       ` Hajime Tazaki
  2024-11-13  8:19         ` Geert Uytterhoeven
  0 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-12 22:07 UTC (permalink / raw)
  To: geert
  Cc: linux-um, ricarkol, Liam.Howlett, ebiederm, kees, viro, brauner,
	jack, linux-mm, linux-fsdevel


Hello Geert,

thank you for the message.

On Tue, 12 Nov 2024 21:48:28 +0900,
Geert Uytterhoeven wrote:
>
> On Mon, Nov 11, 2024 at 7:28 AM Hajime Tazaki <thehajime@gmail.com> wrote:
> > As UML supports CONFIG_MMU=n case, it has to use an alternate ELF
> > loader, FDPIC ELF loader.  In this commit, we added necessary
> > definitions in the arch, as UML has not been used so far.  It also
> > updates Kconfig file to use BINFMT_ELF_FDPIC under !MMU environment.
> >
> > Cc: Eric Biederman <ebiederm@xmission.com>
> > Cc: Kees Cook <kees@kernel.org>
> > Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: linux-mm@kvack.org
> > Cc: linux-fsdevel@vger.kernel.org
> > Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
> > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> 
> Thanks for your patch!
> 
> > --- a/fs/Kconfig.binfmt
> > +++ b/fs/Kconfig.binfmt
> > @@ -58,7 +58,7 @@ config ARCH_USE_GNU_PROPERTY
> >  config BINFMT_ELF_FDPIC
> >         bool "Kernel support for FDPIC ELF binaries"
> >         default y if !BINFMT_ELF
> > -       depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
> > +       depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU)
> 
> s/UML/X86/?

I guess the fdpic loader can be used to X86, but this patchset only
adds UML to be able to select it.  I intended to add UML into nommu
family.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 02/13] x86/um: nommu: elf loader for fdpic
  2024-11-12 22:07       ` Hajime Tazaki
@ 2024-11-13  8:19         ` Geert Uytterhoeven
  2024-11-13  8:36           ` Johannes Berg
  0 siblings, 1 reply; 128+ messages in thread
From: Geert Uytterhoeven @ 2024-11-13  8:19 UTC (permalink / raw)
  To: Hajime Tazaki
  Cc: linux-um, ricarkol, Liam.Howlett, ebiederm, kees, viro, brauner,
	jack, linux-mm, linux-fsdevel

Hi Tazaki-san,

On Tue, Nov 12, 2024 at 11:07 PM Hajime Tazaki <thehajime@gmail.com> wrote:
> On Tue, 12 Nov 2024 21:48:28 +0900,
> > On Mon, Nov 11, 2024 at 7:28 AM Hajime Tazaki <thehajime@gmail.com> wrote:
> > > As UML supports CONFIG_MMU=n case, it has to use an alternate ELF
> > > loader, FDPIC ELF loader.  In this commit, we added necessary
> > > definitions in the arch, as UML has not been used so far.  It also
> > > updates Kconfig file to use BINFMT_ELF_FDPIC under !MMU environment.
> > >
> > > Cc: Eric Biederman <ebiederm@xmission.com>
> > > Cc: Kees Cook <kees@kernel.org>
> > > Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> > > Cc: Christian Brauner <brauner@kernel.org>
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: linux-mm@kvack.org
> > > Cc: linux-fsdevel@vger.kernel.org
> > > Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
> > > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> >
> > Thanks for your patch!
> >
> > > --- a/fs/Kconfig.binfmt
> > > +++ b/fs/Kconfig.binfmt
> > > @@ -58,7 +58,7 @@ config ARCH_USE_GNU_PROPERTY
> > >  config BINFMT_ELF_FDPIC
> > >         bool "Kernel support for FDPIC ELF binaries"
> > >         default y if !BINFMT_ELF
> > > -       depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
> > > +       depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU)
> >
> > s/UML/X86/?
>
> I guess the fdpic loader can be used to X86, but this patchset only
> adds UML to be able to select it.  I intended to add UML into nommu
> family.

While currently x86-nommu is supported for UML only, this is really
x86-specific. I still hope UML will get support for other architectures
one day, at which point a dependency on UML here will become wrong...

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 02/13] x86/um: nommu: elf loader for fdpic
  2024-11-13  8:19         ` Geert Uytterhoeven
@ 2024-11-13  8:36           ` Johannes Berg
  2024-11-13  8:36             ` Johannes Berg
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-11-13  8:36 UTC (permalink / raw)
  To: Geert Uytterhoeven, Hajime Tazaki
  Cc: linux-um, ricarkol, Liam.Howlett, ebiederm, kees, viro, brauner,
	jack, linux-mm, linux-fsdevel

On Wed, 2024-11-13 at 09:19 +0100, Geert Uytterhoeven wrote:
> 
> > > > -       depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
> > > > +       depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU)
> > > 
> > > s/UML/X86/?
> > 
> > I guess the fdpic loader can be used to X86, but this patchset only
> > adds UML to be able to select it.  I intended to add UML into nommu
> > family.
> 
> While currently x86-nommu is supported for UML only, this is really
> x86-specific. I still hope UML will get support for other architectures
> one day, at which point a dependency on UML here will become wrong...
> 

X86 isn't set for UML, X64_32 and X64_64 are though.

Given that the no-MMU UM support even is 64-bit only, that probably
should then really be (UML && X86_64).

But it already has !MMU, so can't be selected otherwise, and it seems
that non-X86 UML 

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 02/13] x86/um: nommu: elf loader for fdpic
  2024-11-13  8:36           ` Johannes Berg
@ 2024-11-13  8:36             ` Johannes Berg
  2024-11-13 10:27               ` Geert Uytterhoeven
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-11-13  8:36 UTC (permalink / raw)
  To: Geert Uytterhoeven, Hajime Tazaki
  Cc: linux-um, ricarkol, Liam.Howlett, ebiederm, kees, viro, brauner,
	jack, linux-mm, linux-fsdevel

(sorry, fat-fingered that)

On Wed, 2024-11-13 at 09:36 +0100, Johannes Berg wrote:
> On Wed, 2024-11-13 at 09:19 +0100, Geert Uytterhoeven wrote:
> > 
> > > > > -       depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
> > > > > +       depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU)
> > > > 
> > > > s/UML/X86/?
> > > 
> > > I guess the fdpic loader can be used to X86, but this patchset only
> > > adds UML to be able to select it.  I intended to add UML into nommu
> > > family.
> > 
> > While currently x86-nommu is supported for UML only, this is really
> > x86-specific. I still hope UML will get support for other architectures
> > one day, at which point a dependency on UML here will become wrong...
> > 
> 
> X86 isn't set for UML, X64_32 and X64_64 are though.
> 
> Given that the no-MMU UM support even is 64-bit only, that probably
> should then really be (UML && X86_64).
> 
> But it already has !MMU, so can't be selected otherwise, and it seems
> that non-X86 UML 

... would require far more changes in all kinds of places, so not sure
I'd be too concerned about it here.

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 02/13] x86/um: nommu: elf loader for fdpic
  2024-11-13  8:36             ` Johannes Berg
@ 2024-11-13 10:27               ` Geert Uytterhoeven
  2024-11-13 13:17                 ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Geert Uytterhoeven @ 2024-11-13 10:27 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Hajime Tazaki, linux-um, ricarkol, Liam.Howlett, ebiederm, kees,
	viro, brauner, jack, linux-mm, linux-fsdevel

Hi Johannes,

On Wed, Nov 13, 2024 at 9:37 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> On Wed, 2024-11-13 at 09:36 +0100, Johannes Berg wrote:
> > On Wed, 2024-11-13 at 09:19 +0100, Geert Uytterhoeven wrote:
> > >
> > > > > > -       depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
> > > > > > +       depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU)
> > > > >
> > > > > s/UML/X86/?
> > > >
> > > > I guess the fdpic loader can be used to X86, but this patchset only
> > > > adds UML to be able to select it.  I intended to add UML into nommu
> > > > family.
> > >
> > > While currently x86-nommu is supported for UML only, this is really
> > > x86-specific. I still hope UML will get support for other architectures
> > > one day, at which point a dependency on UML here will become wrong...
> > >
> >
> > X86 isn't set for UML, X64_32 and X64_64 are though.
> >
> > Given that the no-MMU UM support even is 64-bit only, that probably
> > should then really be (UML && X86_64).
> >
> > But it already has !MMU, so can't be selected otherwise, and it seems
> > that non-X86 UML
>
> ... would require far more changes in all kinds of places, so not sure
> I'd be too concerned about it here.

OK, up to you...

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 02/13] x86/um: nommu: elf loader for fdpic
  2024-11-13 10:27               ` Geert Uytterhoeven
@ 2024-11-13 13:17                 ` Hajime Tazaki
  2024-11-13 13:55                   ` Geert Uytterhoeven
  0 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-13 13:17 UTC (permalink / raw)
  To: geert
  Cc: johannes, linux-um, ricarkol, Liam.Howlett, ebiederm, kees, viro,
	brauner, jack, linux-mm, linux-fsdevel


Hello,

thanks for the inputs Geert, Johannes,

On Wed, 13 Nov 2024 19:27:08 +0900,
Geert Uytterhoeven wrote:
> 
> Hi Johannes,
> 
> On Wed, Nov 13, 2024 at 9:37 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> > On Wed, 2024-11-13 at 09:36 +0100, Johannes Berg wrote:
> > > On Wed, 2024-11-13 at 09:19 +0100, Geert Uytterhoeven wrote:
> > > >
> > > > > > > -       depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
> > > > > > > +       depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU)
> > > > > >
> > > > > > s/UML/X86/?
> > > > >
> > > > > I guess the fdpic loader can be used to X86, but this patchset only
> > > > > adds UML to be able to select it.  I intended to add UML into nommu
> > > > > family.
> > > >
> > > > While currently x86-nommu is supported for UML only, this is really
> > > > x86-specific. I still hope UML will get support for other architectures
> > > > one day, at which point a dependency on UML here will become wrong...
> > > >
> > >
> > > X86 isn't set for UML, X64_32 and X64_64 are though.
> > >
> > > Given that the no-MMU UM support even is 64-bit only, that probably
> > > should then really be (UML && X86_64).
> > >
> > > But it already has !MMU, so can't be selected otherwise, and it seems
> > > that non-X86 UML
> >
> > ... would require far more changes in all kinds of places, so not sure
> > I'd be too concerned about it here.
> 
> OK, up to you...

Indeed, this particular patch [02/13] intends to support the fdpic
loader under the condition 1) x86_64 ELF binaries (w/ PIE), 2) on UML,
3) and with) !MMU configured.  Given that situation, the strict check
should be like:

   depends on ARM || ((M68K || RISCV || SUPERH || (UML && X86_64) || XTENSA) && !MMU)

(as Johannes mentioned).

on the other hand, the fdpic loader works (afaik) on MMU environment so,

   depends on ARM || (UML && X86_64) || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)

should also works, but this might be too broad for this patchset (and
not sure if this makes a new use case).

anyway, thank you for the comment.
# I really wanted to have comments from nommu folks.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 02/13] x86/um: nommu: elf loader for fdpic
  2024-11-13 13:17                 ` Hajime Tazaki
@ 2024-11-13 13:55                   ` Geert Uytterhoeven
  2024-11-13 23:32                     ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Geert Uytterhoeven @ 2024-11-13 13:55 UTC (permalink / raw)
  To: Hajime Tazaki
  Cc: johannes, linux-um, ricarkol, Liam.Howlett, ebiederm, kees, viro,
	brauner, jack, linux-mm, linux-fsdevel, Greg Ungerer, Rich Felker

Hi Tazaki-san,

On Wed, Nov 13, 2024 at 2:17 PM Hajime Tazaki <thehajime@gmail.com> wrote:
> On Wed, 13 Nov 2024 19:27:08 +0900,
> Geert Uytterhoeven wrote:
> > On Wed, Nov 13, 2024 at 9:37 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> > > On Wed, 2024-11-13 at 09:36 +0100, Johannes Berg wrote:
> > > > On Wed, 2024-11-13 at 09:19 +0100, Geert Uytterhoeven wrote:
> > > > >
> > > > > > > > -       depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
> > > > > > > > +       depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU)
> > > > > > >
> > > > > > > s/UML/X86/?
> > > > > >
> > > > > > I guess the fdpic loader can be used to X86, but this patchset only
> > > > > > adds UML to be able to select it.  I intended to add UML into nommu
> > > > > > family.
> > > > >
> > > > > While currently x86-nommu is supported for UML only, this is really
> > > > > x86-specific. I still hope UML will get support for other architectures
> > > > > one day, at which point a dependency on UML here will become wrong...
> > > > >
> > > >
> > > > X86 isn't set for UML, X64_32 and X64_64 are though.
> > > >
> > > > Given that the no-MMU UM support even is 64-bit only, that probably
> > > > should then really be (UML && X86_64).
> > > >
> > > > But it already has !MMU, so can't be selected otherwise, and it seems
> > > > that non-X86 UML
> > >
> > > ... would require far more changes in all kinds of places, so not sure
> > > I'd be too concerned about it here.
> >
> > OK, up to you...
>
> Indeed, this particular patch [02/13] intends to support the fdpic
> loader under the condition 1) x86_64 ELF binaries (w/ PIE), 2) on UML,
> 3) and with) !MMU configured.  Given that situation, the strict check
> should be like:
>
>    depends on ARM || ((M68K || RISCV || SUPERH || (UML && X86_64) || XTENSA) && !MMU)
>
> (as Johannes mentioned).
>
> on the other hand, the fdpic loader works (afaik) on MMU environment so,
>
>    depends on ARM || (UML && X86_64) || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
>
> should also works, but this might be too broad for this patchset (and
> not sure if this makes a new use case).

AFAIK that depends on the architecture's MMU context structure, cfr.
the comment in commit 782f4c5c44e7d99d ("m68knommu: allow elf_fdpic
loader to be selected"), which restricts it to nommu on m68k.  If it
does work on X86_64, you can drop the dependency on UML, and we're
(almost) back to my initial comment ;-)

> anyway, thank you for the comment.
> # I really wanted to have comments from nommu folks.

I've added some in CC...

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 02/13] x86/um: nommu: elf loader for fdpic
  2024-11-13 13:55                   ` Geert Uytterhoeven
@ 2024-11-13 23:32                     ` Hajime Tazaki
  2024-11-14  1:40                       ` Greg Ungerer
  0 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-13 23:32 UTC (permalink / raw)
  To: geert
  Cc: johannes, linux-um, ricarkol, Liam.Howlett, ebiederm, kees, viro,
	brauner, jack, linux-mm, linux-fsdevel, gerg, dalias


On Wed, 13 Nov 2024 22:55:02 +0900,
Geert Uytterhoeven wrote:
> On Wed, Nov 13, 2024 at 2:17 PM Hajime Tazaki <thehajime@gmail.com> wrote:
> > On Wed, 13 Nov 2024 19:27:08 +0900,
> > Geert Uytterhoeven wrote:
> > > On Wed, Nov 13, 2024 at 9:37 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> > > > On Wed, 2024-11-13 at 09:36 +0100, Johannes Berg wrote:
> > > > > On Wed, 2024-11-13 at 09:19 +0100, Geert Uytterhoeven wrote:
> > > > > >
> > > > > > > > > -       depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
> > > > > > > > > +       depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU)
> > > > > > > >
> > > > > > > > s/UML/X86/?
> > > > > > >
> > > > > > > I guess the fdpic loader can be used to X86, but this patchset only
> > > > > > > adds UML to be able to select it.  I intended to add UML into nommu
> > > > > > > family.
> > > > > >
> > > > > > While currently x86-nommu is supported for UML only, this is really
> > > > > > x86-specific. I still hope UML will get support for other architectures
> > > > > > one day, at which point a dependency on UML here will become wrong...
> > > > > >
> > > > >
> > > > > X86 isn't set for UML, X64_32 and X64_64 are though.
> > > > >
> > > > > Given that the no-MMU UM support even is 64-bit only, that probably
> > > > > should then really be (UML && X86_64).
> > > > >
> > > > > But it already has !MMU, so can't be selected otherwise, and it seems
> > > > > that non-X86 UML
> > > >
> > > > ... would require far more changes in all kinds of places, so not sure
> > > > I'd be too concerned about it here.
> > >
> > > OK, up to you...
> >
> > Indeed, this particular patch [02/13] intends to support the fdpic
> > loader under the condition 1) x86_64 ELF binaries (w/ PIE), 2) on UML,
> > 3) and with) !MMU configured.  Given that situation, the strict check
> > should be like:
> >
> >    depends on ARM || ((M68K || RISCV || SUPERH || (UML && X86_64) || XTENSA) && !MMU)
> >
> > (as Johannes mentioned).
> >
> > on the other hand, the fdpic loader works (afaik) on MMU environment so,
> >
> >    depends on ARM || (UML && X86_64) || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
> >
> > should also works, but this might be too broad for this patchset (and
> > not sure if this makes a new use case).
> 
> AFAIK that depends on the architecture's MMU context structure, cfr.
> the comment in commit 782f4c5c44e7d99d ("m68knommu: allow elf_fdpic
> loader to be selected"), which restricts it to nommu on m68k.  If it
> does work on X86_64, you can drop the dependency on UML, and we're
> (almost) back to my initial comment ;-)

I checked and it doesn't work as-is with (UML_X86_64 && MMU).
restricting nommu with UML might be a good to for this patch.

even if it works, I would like to focus on UML && !MMU for this patch
series since I wish to make the (initial) patchset as small as
possible.  If we would like to make it broadly available on x86, that
would be a different patch.

> > anyway, thank you for the comment.
> > # I really wanted to have comments from nommu folks.
> 
> I've added some in CC...

Thanks,

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 02/13] x86/um: nommu: elf loader for fdpic
  2024-11-13 23:32                     ` Hajime Tazaki
@ 2024-11-14  1:40                       ` Greg Ungerer
  2024-11-14 10:41                         ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Greg Ungerer @ 2024-11-14  1:40 UTC (permalink / raw)
  To: Hajime Tazaki, geert
  Cc: johannes, linux-um, ricarkol, Liam.Howlett, ebiederm, kees, viro,
	brauner, jack, linux-mm, linux-fsdevel, dalias

Hi Hajime,

On 14/11/24 09:32, Hajime Tazaki wrote:
> On Wed, 13 Nov 2024 22:55:02 +0900,
> Geert Uytterhoeven wrote:
>> On Wed, Nov 13, 2024 at 2:17 PM Hajime Tazaki <thehajime@gmail.com> wrote:
>>> On Wed, 13 Nov 2024 19:27:08 +0900,
>>> Geert Uytterhoeven wrote:
>>>> On Wed, Nov 13, 2024 at 9:37 AM Johannes Berg <johannes@sipsolutions.net> wrote:
>>>>> On Wed, 2024-11-13 at 09:36 +0100, Johannes Berg wrote:
>>>>>> On Wed, 2024-11-13 at 09:19 +0100, Geert Uytterhoeven wrote:
>>>>>>>
>>>>>>>>>> -       depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
>>>>>>>>>> +       depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU)
>>>>>>>>>
>>>>>>>>> s/UML/X86/?
>>>>>>>>
>>>>>>>> I guess the fdpic loader can be used to X86, but this patchset only
>>>>>>>> adds UML to be able to select it.  I intended to add UML into nommu
>>>>>>>> family.
>>>>>>>
>>>>>>> While currently x86-nommu is supported for UML only, this is really
>>>>>>> x86-specific. I still hope UML will get support for other architectures
>>>>>>> one day, at which point a dependency on UML here will become wrong...
>>>>>>>
>>>>>>
>>>>>> X86 isn't set for UML, X64_32 and X64_64 are though.
>>>>>>
>>>>>> Given that the no-MMU UM support even is 64-bit only, that probably
>>>>>> should then really be (UML && X86_64).
>>>>>>
>>>>>> But it already has !MMU, so can't be selected otherwise, and it seems
>>>>>> that non-X86 UML
>>>>>
>>>>> ... would require far more changes in all kinds of places, so not sure
>>>>> I'd be too concerned about it here.
>>>>
>>>> OK, up to you...
>>>
>>> Indeed, this particular patch [02/13] intends to support the fdpic
>>> loader under the condition 1) x86_64 ELF binaries (w/ PIE), 2) on UML,
>>> 3) and with) !MMU configured.  Given that situation, the strict check
>>> should be like:
>>>
>>>     depends on ARM || ((M68K || RISCV || SUPERH || (UML && X86_64) || XTENSA) && !MMU)
>>>
>>> (as Johannes mentioned).
>>>
>>> on the other hand, the fdpic loader works (afaik) on MMU environment so,
>>>
>>>     depends on ARM || (UML && X86_64) || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
>>>
>>> should also works, but this might be too broad for this patchset (and
>>> not sure if this makes a new use case).
>>
>> AFAIK that depends on the architecture's MMU context structure, cfr.
>> the comment in commit 782f4c5c44e7d99d ("m68knommu: allow elf_fdpic
>> loader to be selected"), which restricts it to nommu on m68k.  If it
>> does work on X86_64, you can drop the dependency on UML, and we're
>> (almost) back to my initial comment ;-)
> 
> I checked and it doesn't work as-is with (UML_X86_64 && MMU).
> restricting nommu with UML might be a good to for this patch.
> 
> even if it works, I would like to focus on UML && !MMU for this patch
> series since I wish to make the (initial) patchset as small as
> possible.  If we would like to make it broadly available on x86, that
> would be a different patch.

Makes sense.

I was only interested in the ability to run ELF based static/PIE binaries
when I did 782f4c5c44e7d99d ("m68knommu: allow elf_fdpic loader to be selected").
I did the same thing for RISC-V in commit 9549fb354ef1 ("riscv: support the
elf-fdpic binfmt loader"), limiting it to !MMU configurations only.

There is no need for binfmt_fdpic in MMU configurations if all you want to
do is run ELF PIE binaries. The normal binfmt_elf loader can load and run
those already.

Regards
Greg



>>> anyway, thank you for the comment.
>>> # I really wanted to have comments from nommu folks.
>>
>> I've added some in CC...
> 
> Thanks,
> 
> -- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 02/13] x86/um: nommu: elf loader for fdpic
  2024-11-14  1:40                       ` Greg Ungerer
@ 2024-11-14 10:41                         ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-14 10:41 UTC (permalink / raw)
  To: gerg
  Cc: geert, johannes, linux-um, ricarkol, Liam.Howlett, ebiederm, kees,
	viro, brauner, jack, linux-mm, linux-fsdevel, dalias


Hello Greg,

On Thu, 14 Nov 2024 10:40:03 +0900,
Greg Ungerer wrote:

> I was only interested in the ability to run ELF based static/PIE binaries
> when I did 782f4c5c44e7d99d ("m68knommu: allow elf_fdpic loader to be selected").
> I did the same thing for RISC-V in commit 9549fb354ef1 ("riscv: support the
> elf-fdpic binfmt loader"), limiting it to !MMU configurations only.
> 
> There is no need for binfmt_fdpic in MMU configurations if all you want to
> do is run ELF PIE binaries. The normal binfmt_elf loader can load and run
> those already.

Yes, my motivation to use this loader is run elf PIE binaries under
!MMU environment.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 00/13] nommu UML
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
                     ` (12 preceding siblings ...)
  2024-11-11  6:27   ` [RFC PATCH v2 13/13] um: nommu: plug nommu code into build system Hajime Tazaki
@ 2024-11-15 10:12   ` Johannes Berg
  2024-11-15 10:26     ` Anton Ivanov
  2024-11-15 14:48     ` Hajime Tazaki
  2024-11-22  9:33   ` Lorenzo Stoakes
                     ` (2 subsequent siblings)
  16 siblings, 2 replies; 128+ messages in thread
From: Johannes Berg @ 2024-11-15 10:12 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um; +Cc: ricarkol, Liam.Howlett

On Mon, 2024-11-11 at 15:27 +0900, Hajime Tazaki wrote:
> This is a series of patches of nommu arch addition to UML.  It would
> be nice to ask comments/opinions on this.

So I've been thinking about this for a while now...

To be clear, I'm not really _against_ it. With around 1200 lines of
code, it really isn't even big. But I also don't know how brittle it is?
Testing it is made somewhat difficult with the map-at-zero requirement
too.

And really I keep coming back to asking myself what the use case is?

Is it to test something for no-MMU platforms more easily? But I'm not
sure what that would be? Have any no-MMU platform maintainers weighed in
on this, have they even _seen_ it? Is that interesting? Is it more
interesting than testing an emulated system with the right architecture?
With it this way you'd probably have to build the right libraries and
binaries for x86-64 no-MMU, does such a thing already exist somewhere?

It also doesn't look like it's meant to replace LKL? But even LKL I
don't really know - are people using it, and if so what for? Seems
lklfuse is a thing for some BSD folks?

Is there something else to use it for?

If it's the first (test no-MMU) then it probably should be smarter about
not really relying on retpoline. Why is the focus so much on that
anyway? If testing no-MMU was the most important thing then probably
you'd have started with seccomp, and actually execute the syscalls from
that, to not have all those restrictions that come from rewriting
binaries, rather than ignoring the whole thing. Though of course you did
add a filter now, but I think it'll just crash?
So I could perhaps see this use case, but then I'd probably think it
should be more generic (i.e. able to execute all no-MMU binaries
including ones that may be using JIT compilation etc.) and not _require_
retpoline, but rather use it as an optimisation where that's possible
(i.e. if you can map at zero)?

If the use case instead of more LKL-type usage, I guess I don't really
understand it, though to be honest I also don't really fully understand
LKL itself, but it always _seemed_ very different.

Somewhat hyperbolically, I'm wondering if it's just a tech demo for
retpoline?

So I dunno. Reading through it again there are a few minor things wrt.
code style and debug things left over, but it's not awful ;-) I'd also
prefer the code to be more clearly "marked" (as nommu), perhaps putting
new files into a nommu/ directory, or something like that. But that's
pretty minor.

Still it's in a lot of places and chances are it'll make bigger
refactoring (like seccomp mode) harder. Perhaps if at all it should come
after seccomp mode and use that to execute syscalls if zpoline can't be
done, and to catch all the cases where zpoline doesn't work (you have
that in the docs)?

What do others think? Would you use it? What for?

johannes

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 00/13] nommu UML
  2024-11-15 10:12   ` [RFC PATCH v2 00/13] nommu UML Johannes Berg
@ 2024-11-15 10:26     ` Anton Ivanov
  2024-11-15 14:54       ` Hajime Tazaki
  2024-11-15 14:48     ` Hajime Tazaki
  1 sibling, 1 reply; 128+ messages in thread
From: Anton Ivanov @ 2024-11-15 10:26 UTC (permalink / raw)
  To: Johannes Berg, Hajime Tazaki, linux-um; +Cc: ricarkol, Liam.Howlett



On 15/11/2024 10:12, Johannes Berg wrote:
> On Mon, 2024-11-11 at 15:27 +0900, Hajime Tazaki wrote:
>> This is a series of patches of nommu arch addition to UML.  It would
>> be nice to ask comments/opinions on this.
> 
> So I've been thinking about this for a while now...
> 
> To be clear, I'm not really _against_ it. With around 1200 lines of
> code, it really isn't even big. But I also don't know how brittle it is?
> Testing it is made somewhat difficult with the map-at-zero requirement
> too.
> 
> 
> And really I keep coming back to asking myself what the use case is?
> 
> Is it to test something for no-MMU platforms more easily? But I'm not
> sure what that would be? Have any no-MMU platform maintainers weighed in
> on this, have they even _seen_ it? Is that interesting? Is it more
> interesting than testing an emulated system with the right architecture?
> With it this way you'd probably have to build the right libraries and
> binaries for x86-64 no-MMU, does such a thing already exist somewhere?
> 
> It also doesn't look like it's meant to replace LKL? But even LKL I
> don't really know - are people using it, and if so what for? Seems
> lklfuse is a thing for some BSD folks?
> 
> Is there something else to use it for?
> 
> If it's the first (test no-MMU) then it probably should be smarter about
> not really relying on retpoline. Why is the focus so much on that
> anyway? If testing no-MMU was the most important thing then probably
> you'd have started with seccomp, and actually execute the syscalls from
> that, to not have all those restrictions that come from rewriting
> binaries, rather than ignoring the whole thing. Though of course you did
> add a filter now, but I think it'll just crash?
> So I could perhaps see this use case, but then I'd probably think it
> should be more generic (i.e. able to execute all no-MMU binaries
> including ones that may be using JIT compilation etc.) and not _require_
> retpoline, but rather use it as an optimisation where that's possible
> (i.e. if you can map at zero)?
> 
> If the use case instead of more LKL-type usage, I guess I don't really
> understand it, though to be honest I also don't really fully understand
> LKL itself, but it always _seemed_ very different.
> 
> Somewhat hyperbolically, I'm wondering if it's just a tech demo for
> retpoline?
> 
> So I dunno. Reading through it again there are a few minor things wrt.
> code style and debug things left over, but it's not awful ;-) I'd also
> prefer the code to be more clearly "marked" (as nommu), perhaps putting
> new files into a nommu/ directory, or something like that. But that's
> pretty minor.
> 
> Still it's in a lot of places and chances are it'll make bigger
> refactoring (like seccomp mode) harder. Perhaps if at all it should come
> after seccomp mode and use that to execute syscalls if zpoline can't be
> done, and to catch all the cases where zpoline doesn't work (you have
> that in the docs)?
> 
> What do others think? Would you use it? What for?

I always thought of it as "another LKL". In that case, it can be compared
to LKL on merit and if it is equivalent or better - go into kernel.

If there is another use case, I will be glad to hear it.

> 
> johannes
> 
> 

-- 
Anton R. Ivanov
https://www.kot-begemot.co.uk/


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 00/13] nommu UML
  2024-11-15 10:12   ` [RFC PATCH v2 00/13] nommu UML Johannes Berg
  2024-11-15 10:26     ` Anton Ivanov
@ 2024-11-15 14:48     ` Hajime Tazaki
  1 sibling, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-15 14:48 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, ricarkol, Liam.Howlett, gerg, geert, dalias

Hello Johannes,

# added Geert, Greg, Rich to Cc (sorry if you feel noisy)
# here is the original email of this thread: just in case.
# https://lore.kernel.org/linux-um/cover.1731290567.git.thehajime@gmail.com/

On Fri, 15 Nov 2024 19:12:39 +0900,
Johannes Berg wrote:
> 
> On Mon, 2024-11-11 at 15:27 +0900, Hajime Tazaki wrote:
> > This is a series of patches of nommu arch addition to UML.  It would
> > be nice to ask comments/opinions on this.
> 
> So I've been thinking about this for a while now...

thank you for your time !

> To be clear, I'm not really _against_ it. With around 1200 lines of
> code, it really isn't even big. But I also don't know how brittle it is?
> Testing it is made somewhat difficult with the map-at-zero requirement
> too.

Given the recent situation that CI/testing facilities running are on
VMs, configuring /proc/sys/vm/mmap_min_addr=0 is not so difficult
in order to test this feature.

> And really I keep coming back to asking myself what the use case is?
> 
> Is it to test something for no-MMU platforms more easily? But I'm not
> sure what that would be? Have any no-MMU platform maintainers weighed in
> on this, have they even _seen_ it? Is that interesting? Is it more
> interesting than testing an emulated system with the right architecture?

Let me explain one recent experience for the use case.

I spotted (and fixed, now in linus tree) an issue of vma subsystem
using the maple-tree library, during this development of patch series.

There is a (slightly) long thread here to discuss with the maple-tree
maintainer, Liam (below).

- traversing vma on nommu
https://lists.infradead.org/pipermail/maple-tree/2024-November/003740.html

The issue was bisected that I can reproduce it after v6.12-rc1, but
never happened with the other nommu arch (we tested with m68k and
riscv, both on buildroot qemu).  maybe because I'm familiar with nommu
UML than m68k/riscv qemu, I could comfortably reproduce/debug/test
what's going on with gdb, and finally proposed a fix (one-liner
patch).

- the patch (hope it'll be landed on 6.12 release)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=247d720b2c5d22f7281437fd6054a138256986ba

This is only a case of usefulness.  I believe you can also imagine
that this also can happen with regular (MMU) UML.

I also privately run a CI test which verifies that my patch doesn't
break MMU UML, with a simple boot test (static/dynamic), 12 kunit
tests in kernel tree, basic benchmarks with lmbench, etc.  This is not
specific characteristics of nommu UML though.

https://github.com/thehajime/linux/actions/runs/11811327291
# The above URL may expire in future.

> With it this way you'd probably have to build the right libraries and
> binaries for x86-64 no-MMU, does such a thing already exist somewhere?

I'm preparing the patches to upstream Alpine Linux for such binaries
to be available in an appropriate way.  Note that I didn't modify the
code of programs itself (except a clear bug), just build with NOMMU
option which is already implemented in busybox/musl-libc.

https://gitlab.alpinelinux.org/thehajime/aports/-/merge_requests/2/diffs

I have not contacted to the upstream developer so, this diff might be changed.

> It also doesn't look like it's meant to replace LKL? But even LKL I
> don't really know - are people using it, and if so what for? Seems
> lklfuse is a thing for some BSD folks?
> 
> Is there something else to use it for?

This patchset is independent and nothing related to LKL.
# you may confuse that I've still been working on LKL.

(off topic)
lklsue is indeed used by FreeBSD but not well maintained (afaik).
NixOS (a linux pkg manager) also use lklfuse iirc.

> If it's the first (test no-MMU) then it probably should be smarter about
> not really relying on retpoline.

# I assume s/retpoline/zpoline/ in the rest of your message.

> Why is the focus so much on that
> anyway? If testing no-MMU was the most important thing then probably
> you'd have started with seccomp, and actually execute the syscalls from
> that, to not have all those restrictions that come from rewriting
> binaries, rather than ignoring the whole thing.

For the JIT part (and also syscalls from dlopen-ed binaries), as I
mentioned in the other reply, it can be implemented but not yet for
now.

The choice of zpoline is based on the speed of syscall invocations.
We have investigated that seccomp (and similar mechanism like SUD:
syscall user dispatch, ptrace, int3 signaling) are still slower than
binary rewrites, as the nature of signal delivery in its mechanism.
LD_PRELOAD with symbol rewrites is faster (even than binary rewrites)
but fundamentally cannot hook all syscalls.

zpoline tries to fill this gap, and we thought this fits the UML
usage.

> Though of course you did
> add a filter now, but I think it'll just crash?

this part (just crash w/ SIGSYS) can be improved.

> So I could perhaps see this use case, but then I'd probably think it
> should be more generic (i.e. able to execute all no-MMU binaries
> including ones that may be using JIT compilation etc.) and not _require_
> retpoline, but rather use it as an optimisation where that's possible
> (i.e. if you can map at zero)?

I understand your point.

> If the use case instead of more LKL-type usage, I guess I don't really
> understand it, though to be honest I also don't really fully understand
> LKL itself, but it always _seemed_ very different.

I didn't explain the comparison between LKL v.s. nommu UML, as I
thought those are independent from each other.

> Somewhat hyperbolically, I'm wondering if it's just a tech demo for
> retpoline?

Additional reason we used zpoline to replace syscall instruction is:

our first implementation of this nommu UML used modified version of
(userspace) standard library (musl-libc), without zpoline.  We
reimplemented syscall wrappers to call a syscall entry point
(__kernel_vsyscall) exposed by ELF aux vector.

Like this:

static __inline long __syscall0(long n)
{
	unsigned long ret = -1;
        __asm__ __volatile__ ("call *%1" : "=a"(ret)
			: "r"(__sysinfo), "a"(n)
			: "rcx", "r11", "memory");
	return ret;
}
# __sysinfo is exposed address from the aux vector.
# this was actually done not by myself, but Ricardo (in Cc)'s work.

https://github.com/nabla-containers/musl-libc/blob/e11be13e6abc06f7034d6b98552b5928d0ed0dfe/arch/x86_64/syscall_arch.h#L13-L20

with that, we can use unmodified binaries, but need to modify libc.so
and ld.so, which isn't trivial I thought.

My motivation to apply zpoline here is to eliminate this dependency;
with zpoline, we don't have to modify the standard library (musl).

In addition to that, since NOMMU kernel shares address space among
multiple userspace processes, we only have to prepare a trampoline
code a single time, while processes in multiple address space model
(in MMU case) needs to install those zpoline related code per each
process invocation.  This is not direct motivation to use zpoline
here, but side-benefit under the given environment.

> So I dunno. Reading through it again there are a few minor things wrt.
> code style and debug things left over, but it's not awful ;-)

oh really.  I'll double check them but would be nice to know any flaws
you found.

> I'd also
> prefer the code to be more clearly "marked" (as nommu), perhaps putting
> new files into a nommu/ directory, or something like that. But that's
> pretty minor.

I understand.  I'm afraid that it will be still multiple of ifdefs since
nommu UML relies on various part of existing UML infrastructure.

> Still it's in a lot of places and chances are it'll make bigger
> refactoring (like seccomp mode) harder. Perhaps if at all it should come
> after seccomp mode and use that to execute syscalls if zpoline can't be
> done, and to catch all the cases where zpoline doesn't work (you have
> that in the docs)?

fallback mechanism after zpoline failure might be interesting.

> What do others think? Would you use it? What for?

-- Hajime

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 00/13] nommu UML
  2024-11-15 10:26     ` Anton Ivanov
@ 2024-11-15 14:54       ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-15 14:54 UTC (permalink / raw)
  To: anton.ivanov; +Cc: johannes, linux-um, ricarkol, Liam.Howlett

Hello Anton,

thanks for the comment.

On Fri, 15 Nov 2024 19:26:07 +0900,
Anton Ivanov wrote:

> > What do others think? Would you use it? What for?
> 
> I always thought of it as "another LKL". In that case, it can be compared
> to LKL on merit and if it is equivalent or better - go into kernel.
> 
> If there is another use case, I will be glad to hear it.

In a high-level view,

the usage is different (no merit/demerit).
LKL is used with userspace binaries, linked with, or dynamically
replaced with the liblinux.so.  LKL has a userspace API derived from
syscall interface, which can be used to bridge LKL-world and
host-kernel world (not specific to Linux host).

This patchset (nommu UML) doesn't change the usage of current UML.

In an internal implementation point of view,

both (LKL and nommu-UML) uses !MMU.  While LKL can be implemented with
MMU-full configuration, we found (the last patch was back in 2021)
that it is not trivial.

LKL has no process model, currently only runs in a single (LKL)
process.  no vfork(2) support.
nommu-UML can host multiple processes with vfork available.

the patch size is:
LKL (last v8 patch): mostly 5k lines of modifications
nommu-UML: 1.2k lines of mods.

I think it looks like similar (as I'm from LKL which also uses !MMU),
but different from various aspects.

let me know if you wish to see more about the comparison.

-- Hajime

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 00/13] nommu UML
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
                     ` (13 preceding siblings ...)
  2024-11-15 10:12   ` [RFC PATCH v2 00/13] nommu UML Johannes Berg
@ 2024-11-22  9:33   ` Lorenzo Stoakes
  2024-11-22  9:53     ` Johannes Berg
  2024-11-23  7:27   ` David Gow
  2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
  16 siblings, 1 reply; 128+ messages in thread
From: Lorenzo Stoakes @ 2024-11-22  9:33 UTC (permalink / raw)
  To: Hajime Tazaki
  Cc: linux-um, ricarkol, Liam.Howlett, vbabka, linux-mm, Jann Horn

+ VMA people, mm list

On Mon, Nov 11, 2024 at 03:27:00PM +0900, Hajime Tazaki wrote:
> This is a series of patches of nommu arch addition to UML.  It would
> be nice to ask comments/opinions on this.

In general, while I appreciate your work and don't mean to be negative, we
in mm consistently have problems with nommu as it is a rarely-tested
more-or-less hack used for very few very old architectures and a constant
source of problems and maintenance overhead for us.

It also complicates mm code and time taken to develop new features.

So ideally we'd avoid doing anything that requires us maintain it going
forward unless the benefits really overwhelmingly outweigh the drawbacks.

There have been various murmourings about moving towards elimination of
nommu, obviously this would entirely prevent that.

Thanks, Lorenzo

>
> There are still several limitations/issues which we already found;
> here is the list of those issues.
>
> - prompt configured with /etc/profile is broken (variables are not
>   expanded, ${HOSTNAME%%.*}:$PWD#)
> - there are no mechanism implemented to cache for mapped memory of
>   exec(2) thus, always read files from filesystem upon every exec,
>   which makes slow on some benchmark (lmbench).
>
> -- Hajime
>
>
> RFC v2:
> - base branch is now uml/linux.git instead of torvalds/linux.git.
> - reorganize the patch series to clean up
> - fixed various coding styles issues
> - clean up exec code path [07/13]
> - fixed the crash/SIGSEGV case on userspace programs [10/13]
> - add seccomp filter to limit syscall caller address [06/13]
> - detect fsgsbase availability with sigsetjmp/siglongjmp [08/13]
> - removes unrelated changes
> - removes unneeded ifndef CONFIG_MMU
> - convert UML_CONFIG_MMU to CONFIG_MMU as using uml/linux.git
> - proposed a patch of maple-tree issue (resolving a limitation in RFC v1)
>   https://lore.kernel.org/linux-mm/20241108222834.3625217-1-thehajime@gmail.com/
>
> RFC:
> - https://lore.kernel.org/linux-um/cover.1729770373.git.thehajime@gmail.com/
>
> Hajime Tazaki (13):
>   fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec
>   x86/um: nommu: elf loader for fdpic
>   um: nommu: memory handling
>   x86/um: nommu: syscall handling
>   x86/um: nommu: syscall translation by zpoline
>   um: nommu: prevent host syscalls from userspace by seccomp filter
>   x86/um: nommu: process/thread handling
>   um: nommu: configure fs register on host syscall invocation
>   x86/um/vdso: nommu: vdso memory update
>   x86/um: nommu: signal handling
>   um: change machine name for uname output
>   um: nommu: add documentation of nommu UML
>   um: nommu: plug nommu code into build system
>
>  Documentation/virt/uml/nommu-uml.rst    | 221 +++++++++++++++++++++++
>  arch/um/Kconfig                         |  14 +-
>  arch/um/Makefile                        |   6 +
>  arch/um/configs/x86_64_nommu_defconfig  |  64 +++++++
>  arch/um/include/asm/Kbuild              |   1 +
>  arch/um/include/asm/futex.h             |   4 +
>  arch/um/include/asm/mmu.h               |   8 +
>  arch/um/include/asm/mmu_context.h       |  13 +-
>  arch/um/include/asm/ptrace-generic.h    |   6 +
>  arch/um/include/asm/tlbflush.h          |  22 +++
>  arch/um/include/asm/uaccess.h           |   7 +-
>  arch/um/include/shared/kern_util.h      |   3 +
>  arch/um/include/shared/os.h             |  14 ++
>  arch/um/kernel/Makefile                 |   3 +-
>  arch/um/kernel/mem.c                    |  12 +-
>  arch/um/kernel/physmem.c                |   6 +
>  arch/um/kernel/process.c                |  33 +++-
>  arch/um/kernel/skas/Makefile            |   4 +-
>  arch/um/kernel/trap.c                   |  14 ++
>  arch/um/kernel/um_arch.c                |   4 +
>  arch/um/os-Linux/Makefile               |   5 +-
>  arch/um/os-Linux/cpu.c                  |  50 ++++++
>  arch/um/os-Linux/internal.h             |   5 +
>  arch/um/os-Linux/main.c                 |   5 +
>  arch/um/os-Linux/process.c              |  94 +++++++++-
>  arch/um/os-Linux/signal.c               |  18 +-
>  arch/um/os-Linux/skas/process.c         |   4 +
>  arch/um/os-Linux/start_up.c             |   3 +
>  arch/um/os-Linux/util.c                 |   3 +-
>  arch/x86/um/Makefile                    |  18 ++
>  arch/x86/um/asm/elf.h                   |  11 +-
>  arch/x86/um/asm/module.h                |  24 ---
>  arch/x86/um/asm/processor.h             |  12 ++
>  arch/x86/um/do_syscall_64.c             | 108 ++++++++++++
>  arch/x86/um/entry_64.S                  | 108 ++++++++++++
>  arch/x86/um/shared/sysdep/syscalls_64.h |   6 +
>  arch/x86/um/signal.c                    |  37 +++-
>  arch/x86/um/syscalls_64.c               |  69 ++++++++
>  arch/x86/um/vdso/um_vdso.c              |  20 +++
>  arch/x86/um/vdso/vma.c                  |  14 ++
>  arch/x86/um/zpoline.c                   | 223 ++++++++++++++++++++++++
>  fs/Kconfig.binfmt                       |   2 +-
>  fs/binfmt_elf_fdpic.c                   |  10 ++
>  43 files changed, 1262 insertions(+), 46 deletions(-)
>  create mode 100644 Documentation/virt/uml/nommu-uml.rst
>  create mode 100644 arch/um/configs/x86_64_nommu_defconfig
>  create mode 100644 arch/um/os-Linux/cpu.c
>  delete mode 100644 arch/x86/um/asm/module.h
>  create mode 100644 arch/x86/um/do_syscall_64.c
>  create mode 100644 arch/x86/um/entry_64.S
>  create mode 100644 arch/x86/um/zpoline.c
>
> --
> 2.43.0
>
>


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 00/13] nommu UML
  2024-11-22  9:33   ` Lorenzo Stoakes
@ 2024-11-22  9:53     ` Johannes Berg
  2024-11-22 10:29       ` Lorenzo Stoakes
  2024-11-22 12:18       ` Christoph Hellwig
  0 siblings, 2 replies; 128+ messages in thread
From: Johannes Berg @ 2024-11-22  9:53 UTC (permalink / raw)
  To: Lorenzo Stoakes, Hajime Tazaki
  Cc: linux-um, ricarkol, Liam.Howlett, vbabka, linux-mm, Jann Horn,
	Christoph Hellwig, Damien Le Moal

On Fri, 2024-11-22 at 09:33 +0000, Lorenzo Stoakes wrote:
> 
> In general, while I appreciate your work and don't mean to be negative, we
> in mm consistently have problems with nommu as it is a rarely-tested
> more-or-less hack used for very few very old architectures and a constant
> source of problems and maintenance overhead for us.
> 
> It also complicates mm code and time taken to develop new features.
> 
> So ideally we'd avoid doing anything that requires us maintain it going
> forward unless the benefits really overwhelmingly outweigh the drawbacks.

:)

There aren't really any benefits to ARCH=um in *itself*, IMHO.

> There have been various murmourings about moving towards elimination of
> nommu, obviously this would entirely prevent that.

No objection from me, but e.g. RISC-V added nommu somewhat recently?
(+Christoph, Damien)

So we could argue the other way around and say that while we have other
architectures with nommu (like RISC-V), having ARCH=um could simplify
testing by e.g. allowing a kunit configuration in ARCH=um which is
simpler (and probably faster) to run for most people than simulating a
foreign architecture.

Anyway, I think that's where I am with my partial (and very limited)
ARCH=um maintainer role. I don't really care for having the feature in
UML itself, but if it's useful for testing nommu architectures for
someone else, it doesn't seem too problematic to support. And testing
such things is also a big part of the argument Hajime was making,
afaict.

johannes

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 00/13] nommu UML
  2024-11-22  9:53     ` Johannes Berg
@ 2024-11-22 10:29       ` Lorenzo Stoakes
  2024-11-22 12:18       ` Christoph Hellwig
  1 sibling, 0 replies; 128+ messages in thread
From: Lorenzo Stoakes @ 2024-11-22 10:29 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Hajime Tazaki, linux-um, ricarkol, Liam.Howlett, vbabka, linux-mm,
	Jann Horn, Christoph Hellwig, Damien Le Moal

On Fri, Nov 22, 2024 at 10:53:18AM +0100, Johannes Berg wrote:
> On Fri, 2024-11-22 at 09:33 +0000, Lorenzo Stoakes wrote:
> >
> > In general, while I appreciate your work and don't mean to be negative, we
> > in mm consistently have problems with nommu as it is a rarely-tested
> > more-or-less hack used for very few very old architectures and a constant
> > source of problems and maintenance overhead for us.
> >
> > It also complicates mm code and time taken to develop new features.
> >
> > So ideally we'd avoid doing anything that requires us maintain it going
> > forward unless the benefits really overwhelmingly outweigh the drawbacks.
>
> :)
>
> There aren't really any benefits to ARCH=um in *itself*, IMHO.
>
> > There have been various murmourings about moving towards elimination of
> > nommu, obviously this would entirely prevent that.
>
> No objection from me, but e.g. RISC-V added nommu somewhat recently?
> (+Christoph, Damien)

I mean it's not my place to object to this of course, but ideally we'd
avoid supporting the truly low spec RISC-V arches which do not have MMUs (I
wasn't aware there were some but I am wholly unfamiliar with RISC-V so
plead ignorance!)

>
> So we could argue the other way around and say that while we have other
> architectures with nommu (like RISC-V), having ARCH=um could simplify
> testing by e.g. allowing a kunit configuration in ARCH=um which is
> simpler (and probably faster) to run for most people than simulating a
> foreign architecture.

Yeah and this is the flip side of the coin, I mean it's actually very
useful to be able to test nommu stuff easily (I've had real issues getting
nommu m68k working in qemu for instance), but my concern is by adding more
dependency on this mechanism it makes it harder to remove later.

I would support this if in future there wouldn't be too much objection to
_this_ feature being removed should we come to a point where nommu removal
happens.

If a large part of the motivation is testing nommu arches, and we at some
point eliminate them, then I think hopefully given this would in that case
be the raison d'etre for the effort it'd not be too egregious to remove at
this point.

In which case, the flip side of the coin is that I am in fact positive
about the testing possibilities here :)

>
> Anyway, I think that's where I am with my partial (and very limited)
> ARCH=um maintainer role. I don't really care for having the feature in
> UML itself, but if it's useful for testing nommu architectures for
> someone else, it doesn't seem too problematic to support. And testing
> such things is also a big part of the argument Hajime was making,
> afaict.
>
> johannes
>

Thanks, and again I don't mean to be negative or difficult about this
series, I just want to raise the fact that 'in the wind' so to speak there
is desire to eliminate nommu at some point.

How realistic that desire is, I am not sure...

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 00/13] nommu UML
  2024-11-22  9:53     ` Johannes Berg
  2024-11-22 10:29       ` Lorenzo Stoakes
@ 2024-11-22 12:18       ` Christoph Hellwig
  2024-11-22 12:25         ` Lorenzo Stoakes
  1 sibling, 1 reply; 128+ messages in thread
From: Christoph Hellwig @ 2024-11-22 12:18 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Lorenzo Stoakes, Hajime Tazaki, linux-um, ricarkol, Liam.Howlett,
	vbabka, linux-mm, Jann Horn, Christoph Hellwig, Damien Le Moal

Maybe I'm missing something, but where does this discussion about
killing nommu even come from?  Nommu is a long standing and reasonable
well maintained part of the kernel, why would anyone want to kill it
for no good reason?  I know quite a lot of products shipping it.

Btw, nommu UML certainly sounds interesting to me, at least indirectly.
I have a project for next year or so for which the linux kernel library
or something like it would be useful to run an in-kernel workload as
a user space process if needed.  nommu uml sounds like a really good
base for that as there basically won't be any userspace that needs
memory protection to start with.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 00/13] nommu UML
  2024-11-22 12:18       ` Christoph Hellwig
@ 2024-11-22 12:25         ` Lorenzo Stoakes
  2024-11-22 12:38           ` Christoph Hellwig
  0 siblings, 1 reply; 128+ messages in thread
From: Lorenzo Stoakes @ 2024-11-22 12:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Johannes Berg, Hajime Tazaki, linux-um, ricarkol, Liam.Howlett,
	vbabka, linux-mm, Jann Horn, Damien Le Moal

On Fri, Nov 22, 2024 at 01:18:26PM +0100, Christoph Hellwig wrote:
> Maybe I'm missing something, but where does this discussion about
> killing nommu even come from?  Nommu is a long standing and reasonable
> well maintained part of the kernel, why would anyone want to kill it
> for no good reason?  I know quite a lot of products shipping it.

It's an ongoing maintenance burden, discussions about seeing whether it's
feasible to remove it have been had in multiple places.

I have personally run into issues having to accommodate it on numerous
occasions, as have many others.

I'd be interested to know which products specifically ship this and also
require tip kernel, perhaps this is just a case of my not being aware of
certain architectures?

My impression was that only legacy architectures specifically needed this,
but I'm happy to stand corrected.

Discussion which prompted this is specifically around m68k over at [0].

[0]:https://lore.kernel.org/all/9be80a9f-1587-4e8a-98cb-edf4920e587e@lucifer.local/

>
> Btw, nommu UML certainly sounds interesting to me, at least indirectly.
> I have a project for next year or so for which the linux kernel library
> or something like it would be useful to run an in-kernel workload as
> a user space process if needed.  nommu uml sounds like a really good
> base for that as there basically won't be any userspace that needs
> memory protection to start with.
>
>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 00/13] nommu UML
  2024-11-22 12:25         ` Lorenzo Stoakes
@ 2024-11-22 12:38           ` Christoph Hellwig
  2024-11-22 12:49             ` Damien Le Moal
  0 siblings, 1 reply; 128+ messages in thread
From: Christoph Hellwig @ 2024-11-22 12:38 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Christoph Hellwig, Johannes Berg, Hajime Tazaki, linux-um,
	ricarkol, Liam.Howlett, vbabka, linux-mm, Jann Horn,
	Damien Le Moal

On Fri, Nov 22, 2024 at 12:25:19PM +0000, Lorenzo Stoakes wrote:
> It's an ongoing maintenance burden, discussions about seeing whether it's
> feasible to remove it have been had in multiple places.
> 
> I have personally run into issues having to accommodate it on numerous
> occasions, as have many others.
> 
> I'd be interested to know which products specifically ship this and also
> require tip kernel, perhaps this is just a case of my not being aware of
> certain architectures?

I can't tell you the products I know on commercial basis.  Most of them
are arm based, but I also know about at least one RISC-V one.    They
all used the latest long term stable at the time of release and tend
to stay on that.  And the involved vendors keep spinning out new versions
of these every few years.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 00/13] nommu UML
  2024-11-22 12:38           ` Christoph Hellwig
@ 2024-11-22 12:49             ` Damien Le Moal
  2024-11-22 12:52               ` Lorenzo Stoakes
  0 siblings, 1 reply; 128+ messages in thread
From: Damien Le Moal @ 2024-11-22 12:49 UTC (permalink / raw)
  To: hch, Lorenzo Stoakes
  Cc: Johannes Berg, Hajime Tazaki, linux-um@lists.infradead.org,
	ricarkol@google.com, Liam.Howlett@oracle.com, vbabka@suse.cz,
	linux-mm@kvack.org, Jann Horn

On 11/22/24 21:38, Christoph Hellwig wrote:
> On Fri, Nov 22, 2024 at 12:25:19PM +0000, Lorenzo Stoakes wrote:
>> It's an ongoing maintenance burden, discussions about seeing whether it's
>> feasible to remove it have been had in multiple places.
>>
>> I have personally run into issues having to accommodate it on numerous
>> occasions, as have many others.
>>
>> I'd be interested to know which products specifically ship this and also
>> require tip kernel, perhaps this is just a case of my not being aware of
>> certain architectures?
> 
> I can't tell you the products I know on commercial basis.  Most of them
> are arm based, but I also know about at least one RISC-V one.    They
> all used the latest long term stable at the time of release and tend
> to stay on that.  And the involved vendors keep spinning out new versions
> of these every few years.

To add to this, we had a discussion at the RISC-V MC at plumbers last year (I
think it was) about removing the K210 RISC-V SoC and associated RISC-V NOMMU
support. But several people complained about that because several FPGAs
implementing RISC-V cores are NOMMU (for obvious reasons for the FPGA case). So
NOMMU is being used out there.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 00/13] nommu UML
  2024-11-22 12:49             ` Damien Le Moal
@ 2024-11-22 12:52               ` Lorenzo Stoakes
  0 siblings, 0 replies; 128+ messages in thread
From: Lorenzo Stoakes @ 2024-11-22 12:52 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: hch, Johannes Berg, Hajime Tazaki, linux-um@lists.infradead.org,
	ricarkol@google.com, Liam.Howlett@oracle.com, vbabka@suse.cz,
	linux-mm@kvack.org, Jann Horn

On Fri, Nov 22, 2024 at 12:49:45PM +0000, Damien Le Moal wrote:
> On 11/22/24 21:38, Christoph Hellwig wrote:
> > On Fri, Nov 22, 2024 at 12:25:19PM +0000, Lorenzo Stoakes wrote:
> >> It's an ongoing maintenance burden, discussions about seeing whether it's
> >> feasible to remove it have been had in multiple places.
> >>
> >> I have personally run into issues having to accommodate it on numerous
> >> occasions, as have many others.
> >>
> >> I'd be interested to know which products specifically ship this and also
> >> require tip kernel, perhaps this is just a case of my not being aware of
> >> certain architectures?
> >
> > I can't tell you the products I know on commercial basis.  Most of them
> > are arm based, but I also know about at least one RISC-V one.    They
> > all used the latest long term stable at the time of release and tend
> > to stay on that.  And the involved vendors keep spinning out new versions
> > of these every few years.
>
> To add to this, we had a discussion at the RISC-V MC at plumbers last year (I
> think it was) about removing the K210 RISC-V SoC and associated RISC-V NOMMU
> support. But several people complained about that because several FPGAs
> implementing RISC-V cores are NOMMU (for obvious reasons for the FPGA case). So
> NOMMU is being used out there.

Thanks guys, appreciate the input, and this has made me aware of things I
simply was not before.

In that case, I am actually rather in favour of this series to make it
easier to test nommu things :)

>
> --
> Damien Le Moal
> Western Digital Research


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 00/13] nommu UML
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
                     ` (14 preceding siblings ...)
  2024-11-22  9:33   ` Lorenzo Stoakes
@ 2024-11-23  7:27   ` David Gow
  2024-11-24  1:25     ` Hajime Tazaki
  2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
  16 siblings, 1 reply; 128+ messages in thread
From: David Gow @ 2024-11-23  7:27 UTC (permalink / raw)
  To: Hajime Tazaki; +Cc: linux-um, ricarkol, Liam.Howlett

[-- Attachment #1: Type: text/plain, Size: 2290 bytes --]

On Mon, 11 Nov 2024 at 14:27, Hajime Tazaki <thehajime@gmail.com> wrote:
>
> This is a series of patches of nommu arch addition to UML.  It would
> be nice to ask comments/opinions on this.
>
> There are still several limitations/issues which we already found;
> here is the list of those issues.
>
> - prompt configured with /etc/profile is broken (variables are not
>   expanded, ${HOSTNAME%%.*}:$PWD#)
> - there are no mechanism implemented to cache for mapped memory of
>   exec(2) thus, always read files from filesystem upon every exec,
>   which makes slow on some benchmark (lmbench).
>
> -- Hajime
>

Thanks for sending this in!

I had a chance to give this a proper try with KUnit, and think it'd be
a great options to have available: it's certainly nice to have a fast,
easy nommu architecture for testing.

I'd echo the comments from others that — at least for the testing case
— it doesn't make much sense to go to the length to use the fancy
zpoline patching (as neat as it is) compared to a simpler, but slower
seccomp-based approach. It'd be nicer to have a simpler, more robust
implementation first, and if there's a particular reason to want to
speed it up later, zpoline can be added as an option.

Plus, if we can avoid the need for vm.mmap_min_addr, that'd make it
much easier to run the nommu tests alongside all the regular UML ones,
as none would require either root, or an otherwise particularly
special config.

Cheers,
-- David

>
> RFC v2:
> - base branch is now uml/linux.git instead of torvalds/linux.git.
> - reorganize the patch series to clean up
> - fixed various coding styles issues
> - clean up exec code path [07/13]
> - fixed the crash/SIGSEGV case on userspace programs [10/13]
> - add seccomp filter to limit syscall caller address [06/13]
> - detect fsgsbase availability with sigsetjmp/siglongjmp [08/13]
> - removes unrelated changes
> - removes unneeded ifndef CONFIG_MMU
> - convert UML_CONFIG_MMU to CONFIG_MMU as using uml/linux.git
> - proposed a patch of maple-tree issue (resolving a limitation in RFC v1)
>   https://lore.kernel.org/linux-mm/20241108222834.3625217-1-thehajime@gmail.com/
>
> RFC:
> - https://lore.kernel.org/linux-um/cover.1729770373.git.thehajime@gmail.com/
>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5294 bytes --]

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 00/13] nommu UML
  2024-11-23  7:27   ` David Gow
@ 2024-11-24  1:25     ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-24  1:25 UTC (permalink / raw)
  To: davidgow; +Cc: linux-um, ricarkol, Liam.Howlett


Hello David,

On Sat, 23 Nov 2024 16:27:27 +0900,
David Gow wrote:

> I had a chance to give this a proper try with KUnit, and think it'd be
> a great options to have available: it's certainly nice to have a fast,
> easy nommu architecture for testing.

thanks for the test.

> I'd echo the comments from others that ― at least for the testing case
> ― it doesn't make much sense to go to the length to use the fancy
> zpoline patching (as neat as it is) compared to a simpler, but slower
> seccomp-based approach. It'd be nicer to have a simpler, more robust
> implementation first, and if there's a particular reason to want to
> speed it up later, zpoline can be added as an option.

I'll start to explore the possibility of this option under nommu; will
get you guys back here.

> Plus, if we can avoid the need for vm.mmap_min_addr, that'd make it
> much easier to run the nommu tests alongside all the regular UML ones,
> as none would require either root, or an otherwise particularly
> special config.

Though I thought this limitation doesn't have much impact, we'll also
experiment if this (not using mmap_min_addr) is possible or not.


thanks for the feedback !

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 08/13] um: nommu: configure fs register on host syscall invocation
  2024-11-11  6:27   ` [RFC PATCH v2 08/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
@ 2024-11-27 10:00     ` Benjamin Berg
  2024-11-27 10:26       ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Benjamin Berg @ 2024-11-27 10:00 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um; +Cc: ricarkol, Liam.Howlett

Hi,

On Mon, 2024-11-11 at 15:27 +0900, Hajime Tazaki wrote:
> As userspace on UML/!MMU also need to configure %fs register when it is
> running to correctly access thread structure, host syscalls implemented
> in os-Linux drivers may be puzzled when they are called.  Thus it has to
> configure %fs register via arch_prctl(SET_FS) on every host syscalls.
> 
> [SNIP]
> +
> +/**
> + * get_host_cpu_features() return true with X86_FEATURE_FSGSBASE even
> + * if the kernel is older and disabled using fsgsbase instruction.
> + * thus detection is based on whether SIGILL is raised or not.
> + */
> +static jmp_buf jmpbuf;
> +static void sigill(int sig, siginfo_t *si, void *ctx_void)
> +{
> +	siglongjmp(jmpbuf, 1);
> +}
> +
> +void __init check_fsgsbase(void)
> +{
> +	unsigned long fsbase;
> +	struct sigaction sa;
> +
> +	/* Probe FSGSBASE */
> +	memset(&sa, 0, sizeof(sa));
> +	sa.sa_sigaction = sigill;
> +	sa.sa_flags = SA_SIGINFO | SA_RESETHAND;
> +	sigemptyset(&sa.sa_mask);
> +	if (sigaction(SIGILL, &sa, 0))
> +		os_warn("sigaction");
> +
> +	os_info("Checking FSGSBASE instructions...");
> +	if (sigsetjmp(jmpbuf, 0) == 0) {
> +		asm volatile("rdfsbase %0" : "=r" (fsbase) :: "memory");
> +		host_has_fsgsbase = 1;
> +		os_info("OK\n");
> +	} else {
> +		host_has_fsgsbase = 0;
> +		os_info("disabled\n");
> +	}
> +}

According to Documentation/arch/x86/x86_64/fsgs.rst it looks like this
can also be checked using the HWCAP2_FSGSBASE bit in AT_HWCAP2.

Maybe that is a bit simpler?

> [SNIP]
> 
>  __visible void do_syscall_64(struct pt_regs *regs)
>  {
>  	int syscall;
> @@ -49,6 +76,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
>  	if (syscall == __NR_vfork)
>  		stack_copy = vfork_save_stack();
>  
> +	/* set fs register to the original host one */
> +	os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
> +
>  	if (likely(syscall < NR_syscalls)) {
>  		PT_REGS_SET_SYSCALL_RETURN(regs,
>  				EXECUTE_SYSCALL(syscall, regs));
> @@ -63,6 +93,11 @@ __visible void do_syscall_64(struct pt_regs *regs)
>  	set_thread_flag(TIF_SIGPENDING);
>  	interrupt_end();
>  
> +	/* restore back fs register to userspace configured one */
> +	os_x86_arch_prctl(0, ARCH_SET_FS,
> +		      (void *)(current->thread.regs.regs.gp[FS_BASE
> +						     / sizeof(unsigned long)]));
> +
>  	/* execve succeeded */
>  	if (syscall == __NR_execve && regs->regs.gp[HOST_AX] == 0)
>  		userspace(&current->thread.regs.regs);
> diff --git a/arch/x86/um/syscalls_64.c b/arch/x86/um/syscalls_64.c
> index edb17fc73e07..d56df936a2d7 100644
> --- a/arch/x86/um/syscalls_64.c
> +++ b/arch/x86/um/syscalls_64.c
> @@ -12,11 +12,26 @@
>  #include <asm/prctl.h> /* XXX This should get the constants from libc */
>  #include <registers.h>
>  #include <os.h>
> +#include <asm/thread_info.h>
> +#include <asm/mman.h>
> +
> +#ifndef CONFIG_MMU
> +/*
> + * The guest libc can change FS, which confuses the host libc.
> + * In fact, changing FS directly is not supported (check
> + * man arch_prctl). So, whenever we make a host syscall,
> + * we should be changing FS to the original FS (not the
> + * one set by the guest libc). This original FS is stored
> + * in host_fs.
> + */
> +long long host_fs = -1;

Right, the libc already uses it for its own thread-local storage. That
is a bit annoying, as UML doesn't need threading in that sense.

Note that similar handling needs to happen for every userspace to
kernel switch. I guess the only other location is the signal handler.

Benjamin

> +#endif
>  
>  long arch_prctl(struct task_struct *task, int option,
>  		unsigned long __user *arg2)
>  {
>  	long ret = -EINVAL;
> +#ifdef CONFIG_MMU
>  
>  	switch (option) {
>  	case ARCH_SET_FS:
> @@ -38,6 +53,48 @@ long arch_prctl(struct task_struct *task, int option,
>  	}
>  
>  	return ret;
> +#else
> +
> +	unsigned long *ptr = arg2, tmp;
> +
> +	switch (option) {
> +	case ARCH_SET_FS:
> +		if (host_fs == -1)
> +			os_arch_prctl(0, ARCH_GET_FS, (void *)&host_fs);
> +		ret = 0;
> +		break;
> +	case ARCH_SET_GS:
> +		ret = 0;
> +		break;
> +	case ARCH_GET_FS:
> +	case ARCH_GET_GS:
> +		ptr = &tmp;
> +		break;
> +	}
> +
> +	ret = os_arch_prctl(0, option, ptr);
> +	if (ret)
> +		return ret;
> +
> +	switch (option) {
> +	case ARCH_SET_FS:
> +		current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] =
> +			(unsigned long) arg2;
> +		break;
> +	case ARCH_SET_GS:
> +		current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)] =
> +			(unsigned long) arg2;
> +		break;
> +	case ARCH_GET_FS:
> +		ret = put_user(current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)], arg2);
> +		break;
> +	case ARCH_GET_GS:
> +		ret = put_user(current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)], arg2);
> +		break;
> +	}
> +
> +	return ret;
> +#endif
>  }
>  
>  SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 08/13] um: nommu: configure fs register on host syscall invocation
  2024-11-27 10:00     ` Benjamin Berg
@ 2024-11-27 10:26       ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-27 10:26 UTC (permalink / raw)
  To: benjamin; +Cc: linux-um, ricarkol, Liam.Howlett


On Wed, 27 Nov 2024 19:00:11 +0900,
Benjamin Berg wrote:

> > +
> > +	os_info("Checking FSGSBASE instructions...");
> > +	if (sigsetjmp(jmpbuf, 0) == 0) {
> > +		asm volatile("rdfsbase %0" : "=r" (fsbase) :: "memory");
> > +		host_has_fsgsbase = 1;
> > +		os_info("OK\n");
> > +	} else {
> > +		host_has_fsgsbase = 0;
> > +		os_info("disabled\n");
> > +	}
> > +}
> 
> According to Documentation/arch/x86/x86_64/fsgs.rst it looks like this
> can also be checked using the HWCAP2_FSGSBASE bit in AT_HWCAP2.
> 
> Maybe that is a bit simpler?

Ah, thanks.  This should be much simpler:

+#include <sys/auxv.h>
+#include <asm/hwcap2.h>
+void __init check_fsgsbase(void)
+{
+       unsigned long auxv = getauxval(AT_HWCAP2);
+
+       os_info("Checking FSGSBASE instructions...");
+       if (auxv & HWCAP2_FSGSBASE) {
+               host_has_fsgsbase = 1;
+               os_info("OK\n");
+       } else {
+               host_has_fsgsbase = 0;
+               os_info("disabled\n");
+       }
+}


> 
> > [SNIP]
> > 
> >  __visible void do_syscall_64(struct pt_regs *regs)
> >  {
> >  	int syscall;
> > @@ -49,6 +76,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
> >  	if (syscall == __NR_vfork)
> >  		stack_copy = vfork_save_stack();
> >  
> > +	/* set fs register to the original host one */
> > +	os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
> > +
> >  	if (likely(syscall < NR_syscalls)) {
> >  		PT_REGS_SET_SYSCALL_RETURN(regs,
> >  				EXECUTE_SYSCALL(syscall, regs));
> > @@ -63,6 +93,11 @@ __visible void do_syscall_64(struct pt_regs *regs)
> >  	set_thread_flag(TIF_SIGPENDING);
> >  	interrupt_end();
> >  
> > +	/* restore back fs register to userspace configured one */
> > +	os_x86_arch_prctl(0, ARCH_SET_FS,
> > +		      (void *)(current->thread.regs.regs.gp[FS_BASE
> > +						     / sizeof(unsigned long)]));
> > +
> >  	/* execve succeeded */
> >  	if (syscall == __NR_execve && regs->regs.gp[HOST_AX] == 0)
> >  		userspace(&current->thread.regs.regs);
> > diff --git a/arch/x86/um/syscalls_64.c b/arch/x86/um/syscalls_64.c
> > index edb17fc73e07..d56df936a2d7 100644
> > --- a/arch/x86/um/syscalls_64.c
> > +++ b/arch/x86/um/syscalls_64.c
> > @@ -12,11 +12,26 @@
> >  #include <asm/prctl.h> /* XXX This should get the constants from libc */
> >  #include <registers.h>
> >  #include <os.h>
> > +#include <asm/thread_info.h>
> > +#include <asm/mman.h>
> > +
> > +#ifndef CONFIG_MMU
> > +/*
> > + * The guest libc can change FS, which confuses the host libc.
> > + * In fact, changing FS directly is not supported (check
> > + * man arch_prctl). So, whenever we make a host syscall,
> > + * we should be changing FS to the original FS (not the
> > + * one set by the guest libc). This original FS is stored
> > + * in host_fs.
> > + */
> > +long long host_fs = -1;
> 
> Right, the libc already uses it for its own thread-local storage. That
> is a bit annoying, as UML doesn't need threading in that sense.
> 
> Note that similar handling needs to happen for every userspace to
> kernel switch. I guess the only other location is the signal handler.

Thanks too.
I guess the former (arch_switch_to) handles in my patch but the latter
(signal handler) doesn't.  Let me try to check.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 09/13] x86/um/vdso: nommu: vdso memory update
  2024-11-11  6:27   ` [RFC PATCH v2 09/13] x86/um/vdso: nommu: vdso memory update Hajime Tazaki
@ 2024-11-27 10:36     ` Benjamin Berg
  2024-11-27 23:23       ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Benjamin Berg @ 2024-11-27 10:36 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um; +Cc: ricarkol, Liam.Howlett

Hi,

On Mon, 2024-11-11 at 15:27 +0900, Hajime Tazaki wrote:
> On !MMU mode, the address of vdso is accessible from userspace.  This
> commit implements the entry point by pointing a block of page address.
> 
> This commit also add memory permission configuration of vdso page to be
> executable.
> 
> Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> ---
>  arch/x86/um/vdso/um_vdso.c | 20 ++++++++++++++++++++
>  arch/x86/um/vdso/vma.c     | 14 ++++++++++++++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/arch/x86/um/vdso/um_vdso.c b/arch/x86/um/vdso/um_vdso.c
> index cbae2584124f..eff3e6641a0e 100644
> --- a/arch/x86/um/vdso/um_vdso.c
> +++ b/arch/x86/um/vdso/um_vdso.c
> @@ -23,10 +23,17 @@ int __vdso_clock_gettime(clockid_t clock, struct
> __kernel_old_timespec *ts)
>  {
>  	long ret;
>  
> +#ifdef CONFIG_MMU
>  	asm("syscall"
>  		: "=a" (ret)
>  		: "0" (__NR_clock_gettime), "D" (clock), "S" (ts)
>  		: "rcx", "r11", "memory");
> +#else
> +	asm("call *%1"
> +		: "=a" (ret)
> +		: "0" ((unsigned long)__NR_clock_gettime), "D"
> (clock), "S" (ts)
> +		: "rcx", "r11", "memory");
> +#endif
>  
>  	return ret;
>  }
> @@ -37,10 +44,17 @@ int __vdso_gettimeofday(struct
> __kernel_old_timeval *tv, struct timezone *tz)
>  {
>  	long ret;
>  
> +#ifdef CONFIG_MMU
>  	asm("syscall"
>  		: "=a" (ret)
>  		: "0" (__NR_gettimeofday), "D" (tv), "S" (tz)
>  		: "rcx", "r11", "memory");
> +#else
> +	asm("call *%1"
> +		: "=a" (ret)
> +		: "0" ((unsigned long)__NR_gettimeofday), "D" (tv),
> "S" (tz)
> +		: "rcx", "r11", "memory");
> +#endif
>  
>  	return ret;
>  }
> @@ -51,9 +65,15 @@ __kernel_old_time_t
> __vdso_time(__kernel_old_time_t *t)
>  {
>  	long secs;
>  
> +#ifdef CONFIG_MMU
>  	asm volatile("syscall"
>  		: "=a" (secs)
>  		: "0" (__NR_time), "D" (t) : "cc", "r11", "cx",
> "memory");
> +#else
> +	asm("call *%1"
> +		: "=a" (secs)
> +		: "0" ((unsigned long)__NR_time), "D" (t) : "cc",
> "r11", "cx", "memory");
> +#endif

Maybe introduce a macro for "syscall" vs. "call *%1"? The parameters
should be identical in both cases. The "call" could probably even jump
to the end of the NOP ramp directly in this case.

Though maybe I am missing something with the "(unsigned long)" cast?

>  	return secs;
>  }
> diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c
> index f238f7b33cdd..83c861e2a815 100644
> --- a/arch/x86/um/vdso/vma.c
> +++ b/arch/x86/um/vdso/vma.c
> @@ -9,6 +9,7 @@
>  #include <asm/page.h>
>  #include <asm/elf.h>
>  #include <linux/init.h>
> +#include <os.h>
>  
>  static unsigned int __read_mostly vdso_enabled = 1;
>  unsigned long um_vdso_addr;
> @@ -24,7 +25,9 @@ static int __init init_vdso(void)
>  
>  	BUG_ON(vdso_end - vdso_start > PAGE_SIZE);
>  
> +#ifdef CONFIG_MMU
>  	um_vdso_addr = task_size - PAGE_SIZE;
> +#endif
>  
>  	vdsop = kmalloc(sizeof(struct page *), GFP_KERNEL);
>  	if (!vdsop)
> @@ -40,6 +43,15 @@ static int __init init_vdso(void)
>  	copy_page(page_address(um_vdso), vdso_start);
>  	*vdsop = um_vdso;
>  
> +#ifndef CONFIG_MMU
> +	/* this is fine with NOMMU as everything is accessible */
> +	um_vdso_addr = (unsigned long)page_address(um_vdso);
> +	os_protect_memory((void *)um_vdso_addr, vdso_end -
> vdso_start, 1, 1, 1);

I think this should be "1, 0, 1", i.e. we shouldn't enable write
access.

> +	pr_debug("vdso_start=%lx um_vdso_addr=%lx pg_um_vdso=%lx",
> +	       (unsigned long)vdso_start, um_vdso_addr,
> +	       (unsigned long)page_address(um_vdso));
> +#endif
> +
>  	return 0;
>  
>  oom:
> @@ -50,6 +62,7 @@ static int __init init_vdso(void)
>  }
>  subsys_initcall(init_vdso);
>  
> +#ifdef CONFIG_MMU
>  int arch_setup_additional_pages(struct linux_binprm *bprm, int
> uses_interp)
>  {
>  	struct vm_area_struct *vma;
> @@ -74,3 +87,4 @@ int arch_setup_additional_pages(struct linux_binprm
> *bprm, int uses_interp)
>  
>  	return IS_ERR(vma) ? PTR_ERR(vma) : 0;
>  }
> +#endif



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 09/13] x86/um/vdso: nommu: vdso memory update
  2024-11-27 10:36     ` Benjamin Berg
@ 2024-11-27 23:23       ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-11-27 23:23 UTC (permalink / raw)
  To: benjamin; +Cc: linux-um, ricarkol, Liam.Howlett


Thanks Benjamin,

On Wed, 27 Nov 2024 19:36:44 +0900,
Benjamin Berg wrote:

> > @@ -51,9 +65,15 @@ __kernel_old_time_t
> > __vdso_time(__kernel_old_time_t *t)
> >  {
> >  	long secs;
> >  
> > +#ifdef CONFIG_MMU
> >  	asm volatile("syscall"
> >  		: "=a" (secs)
> >  		: "0" (__NR_time), "D" (t) : "cc", "r11", "cx",
> > "memory");
> > +#else
> > +	asm("call *%1"
> > +		: "=a" (secs)
> > +		: "0" ((unsigned long)__NR_time), "D" (t) : "cc",
> > "r11", "cx", "memory");
> > +#endif
> 
> Maybe introduce a macro for "syscall" vs. "call *%1"? The parameters
> should be identical in both cases.

I think it's nice to clean up this part.

> The "call" could probably even jump
> to the end of the NOP ramp directly in this case.

ah, that would be a bit of speed ups.
confirmed that it works.  I'll update it if the code readability
doesn't become worse.

> Though maybe I am missing something with the "(unsigned long)" cast?

without this, gcc says

  CC      arch/x86/um/vdso/um_vdso.o
../arch/x86/um/vdso/um_vdso.c: Assembler messages:
../arch/x86/um/vdso/um_vdso.c:50: Error: operand size mismatch for `call'

so the cast intended to silence it.

but if I changed the constraint like below, the error is gone.

-		: "0" ((unsigned long)__NR_time), "D" (t) : "cc",
+		: "a" (__NR_time), "D" (t) : "cc",

I will clean it up as well.

> >  	return secs;
> >  }
> > diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c
> > index f238f7b33cdd..83c861e2a815 100644
> > --- a/arch/x86/um/vdso/vma.c
> > +++ b/arch/x86/um/vdso/vma.c
> > @@ -9,6 +9,7 @@
> >  #include <asm/page.h>
> >  #include <asm/elf.h>
> >  #include <linux/init.h>
> > +#include <os.h>
> >  
> >  static unsigned int __read_mostly vdso_enabled = 1;
> >  unsigned long um_vdso_addr;
> > @@ -24,7 +25,9 @@ static int __init init_vdso(void)
> >  
> >  	BUG_ON(vdso_end - vdso_start > PAGE_SIZE);
> >  
> > +#ifdef CONFIG_MMU
> >  	um_vdso_addr = task_size - PAGE_SIZE;
> > +#endif
> >  
> >  	vdsop = kmalloc(sizeof(struct page *), GFP_KERNEL);
> >  	if (!vdsop)
> > @@ -40,6 +43,15 @@ static int __init init_vdso(void)
> >  	copy_page(page_address(um_vdso), vdso_start);
> >  	*vdsop = um_vdso;
> >  
> > +#ifndef CONFIG_MMU
> > +	/* this is fine with NOMMU as everything is accessible */
> > +	um_vdso_addr = (unsigned long)page_address(um_vdso);
> > +	os_protect_memory((void *)um_vdso_addr, vdso_end -
> > vdso_start, 1, 1, 1);
> 
> I think this should be "1, 0, 1", i.e. we shouldn't enable write
> access.

not sure if this relates to but with PROT_WRITE off I cannot put a
breakpoint with gdb on vdso area.  normal execution (without gdb)
looks fine.


-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 10/13] x86/um: nommu: signal handling
  2024-11-11  6:27   ` [RFC PATCH v2 10/13] x86/um: nommu: signal handling Hajime Tazaki
@ 2024-11-28 10:37     ` Benjamin Berg
  2024-12-01  1:38       ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Benjamin Berg @ 2024-11-28 10:37 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um; +Cc: ricarkol, Liam.Howlett

Hi,

On Mon, 2024-11-11 at 15:27 +0900, Hajime Tazaki wrote:
> This commit updates the behavior of signal handling under !MMU
> environment. 1) the stack preparation for the signal handlers and
> 2) restoration of stack after rt_sigreturn(2) syscall.  Those are needed
> as the stack usage on vfork(2) syscall is different.
> 
> It also adds the follow up routine for SIGSEGV as a signal delivery runs
> in the same stack frame while we have to avoid endless SIGSEGV.
> 
> Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
> ---
>  arch/um/include/shared/kern_util.h |  3 +++
>  arch/um/kernel/trap.c              | 10 ++++++++
>  arch/um/os-Linux/signal.c          | 18 ++++++++++++++-
>  arch/x86/um/signal.c               | 37 +++++++++++++++++++++++++++++-
>  4 files changed, 66 insertions(+), 2 deletions(-)
> 
> [SNIP]
> diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c
> index 52852018a3ad..a06622415d8f 100644
> --- a/arch/um/os-Linux/signal.c
> +++ b/arch/um/os-Linux/signal.c
> @@ -36,7 +36,15 @@ static void sig_handler_common(int sig, struct siginfo *si, mcontext_t *mc)
>   struct uml_pt_regs r;
>   int save_errno = errno;
>  
> - r.is_user = 0;
> +#ifndef CONFIG_MMU
> + memset(&r, 0, sizeof(r));
> + /* mark is_user=1 when the IP is from userspace code. */
> + if (mc && (REGS_IP(mc->gregs) > uml_reserved
> +    && REGS_IP(mc->gregs) < high_physmem))
> + r.is_user = 1;
> + else
> +#endif
> + r.is_user = 0;

Does this work if we load modules dynamically?

I suppose one could map them into a separate memory area rather than
running them directly from the physical memory.
Otherwise we'll also get problem with the SECCOMP filter.

>   if (sig == SIGSEGV) {
>   /* For segfaults, we want the data from the sigcontext. */
>   get_regs_from_mc(&r, mc);
> @@ -191,6 +199,7 @@ static void hard_handler(int sig, siginfo_t *si, void *p)
>   ucontext_t *uc = p;
>   mcontext_t *mc = &uc->uc_mcontext;
>   unsigned long pending = 1UL << sig;
> + int is_segv = 0;
>  
>   do {
>   int nested, bail;
> @@ -214,6 +223,7 @@ static void hard_handler(int sig, siginfo_t *si, void *p)
>  
>   while ((sig = ffs(pending)) != 0){
>   sig--;
> + is_segv = (sig == SIGSEGV) ? 1 : 0;
>   pending &= ~(1 << sig);
>   (*handlers[sig])(sig, (struct siginfo *)si, mc);
>   }
> @@ -227,6 +237,12 @@ static void hard_handler(int sig, siginfo_t *si, void *p)
>   if (!nested)
>   pending = from_irq_stack(nested);
>   } while (pending);
> +
> +#ifndef CONFIG_MMU
> + /* if there is SIGSEGV notified, let the userspace run w/ __noreturn */
> + if (is_segv)
> + sigsegv_post_routine();
> +#endif
>  }

I am confused, this doesn't feel quite correct to me.

So, for normal UML, I think we always do an rt_sigreturn. Which means,
we always go back to the corresponding *kernel* task. To schedule in
response to SIGALRM, we forward the signal to the userspace process.
I believe that means:
   1. We cannot schedule kernel threads (that seems like a bug)
   2. Scheduling for userspace happens once the signal is delivered.
      Then userspace() saves the state and calls interrupt_end().


Now, keep in mind that we are on the separate signal stack here. If we
jump anywhere directly, we abandon the old state information stored by
the host kernel into the mcontext. We can absolutely do that, but we
need to be careful to not forget anything.

As such, I wonder whether nommu should:
   1. When entering from kernel, update "current->thread.switch_buf"
      from the mcontext.
       - If we need to schedule, push a stack frame that calls the scheduling
         code and returns with the correct state.
   2. When entering from user, store the task registers from the
      mcontext. At some point (here or earlier) ensure that the
      "current->thread.switch_buf" is set up so that we can return to
      userspace by restoring the task registers.
       - To schedule, piggy back on 1. or add special code.
   3. Always do a UML_LONGJMP() back into the "current" task.

That said, I am probably not having the full picture right now.

Benjamin

PS: On a further note, I think the current code to enter userspace
cannot handle single stepping. I suppose that is fine, but you should
probably set arch_has_single_step to 0 for nommu.

>  void set_handler(int sig)
> diff --git a/arch/x86/um/signal.c b/arch/x86/um/signal.c
> index 75087e85b6fd..b7365c75a967 100644
> --- a/arch/x86/um/signal.c
> +++ b/arch/x86/um/signal.c
> @@ -371,6 +371,13 @@ int setup_signal_stack_si(unsigned long stack_top, struct ksignal *ksig,
>   round_down(stack_top - sizeof(struct rt_sigframe), 16);
>  
>   /* Add required space for math frame */
> +#ifndef CONFIG_MMU
> + /*
> + * the sig_frame on !MMU needs be aligned for SSE as
> + * the frame is used as-is.
> + */
> + math_size = round_down(math_size, 16);
> +#endif
>   frame = (struct rt_sigframe __user *)((unsigned long)frame - math_size);
>  
>   /* Subtract 128 for a red zone and 8 for proper alignment */
> @@ -417,6 +424,18 @@ int setup_signal_stack_si(unsigned long stack_top, struct ksignal *ksig,
>   /* could use a vstub here */
>   return err;
>  
> +#ifndef CONFIG_MMU
> + /*
> + * we need to push handler address at top of stack, as
> + * __kernel_vsyscall, called after this returns with ret with
> + * stack contents, thus push the handler here.
> + */
> + frame = (struct rt_sigframe __user *) ((unsigned long) frame -
> +        sizeof(unsigned long));
> + err |= __put_user((unsigned long)ksig->ka.sa.sa_handler,
> +   (unsigned long *)frame);
> +#endif
> +
>   if (err)
>   return err;
>  
> @@ -442,9 +461,25 @@ SYSCALL_DEFINE0(rt_sigreturn)
>   unsigned long sp = PT_REGS_SP(&current->thread.regs);
>   struct rt_sigframe __user *frame =
>   (struct rt_sigframe __user *)(sp - sizeof(long));
> - struct ucontext __user *uc = &frame->uc;
> + struct ucontext __user *uc;
>   sigset_t set;
>  
> +#ifndef CONFIG_MMU
> + /**
> + * we enter here with:
> + *
> + * __restore_rt:
> + *     mov $15, %rax
> + *     call *%rax (translated from syscall)
> + *
> + * (code is from musl libc)
> + * so, stack needs to be popped of "call"ed address before
> + * looking at rt_sigframe.
> + */
> + frame = (struct rt_sigframe __user *)((unsigned long)frame + sizeof(long));
> +#endif
> + uc = &frame->uc;
> +
>   if (copy_from_user(&set, &uc->uc_sigmask, sizeof(set)))
>   goto segfault;
>  



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [RFC PATCH v2 10/13] x86/um: nommu: signal handling
  2024-11-28 10:37     ` Benjamin Berg
@ 2024-12-01  1:38       ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-01  1:38 UTC (permalink / raw)
  To: benjamin; +Cc: linux-um, ricarkol, Liam.Howlett


Hello,

On Thu, 28 Nov 2024 19:37:21 +0900,
Benjamin Berg wrote:

> > +#ifndef CONFIG_MMU
> > + memset(&r, 0, sizeof(r));
> > + /* mark is_user=1 when the IP is from userspace code. */
> > + if (mc && (REGS_IP(mc->gregs) > uml_reserved
> > +    && REGS_IP(mc->gregs) < high_physmem))
> > + r.is_user = 1;
> > + else
> > +#endif
> > + r.is_user = 0;
> 
> Does this work if we load modules dynamically?
> 
> I suppose one could map them into a separate memory area rather than
> running them directly from the physical memory.
> Otherwise we'll also get problem with the SECCOMP filter.

currently, I thought modules use the separate area from execmem, but
nommu allocator ignores this location info to map the memory; instead
mixing up with area used by userspace programs.

we may be able to come up with execmem_arch_setup() to fix this
situation.

so, no, this is_user detection doesn't work; modules also become
is_user=1.

MMU full allocator (normal UML and seccomp asl well ?) seems to be
fine as long as using execmem.

I will look into detail how we should handle.

> >   if (sig == SIGSEGV) {
> >   /* For segfaults, we want the data from the sigcontext. */
> >   get_regs_from_mc(&r, mc);
> > @@ -191,6 +199,7 @@ static void hard_handler(int sig, siginfo_t *si, void *p)
> >   ucontext_t *uc = p;
> >   mcontext_t *mc = &uc->uc_mcontext;
> >   unsigned long pending = 1UL << sig;
> > + int is_segv = 0;
> >  
> >   do {
> >   int nested, bail;
> > @@ -214,6 +223,7 @@ static void hard_handler(int sig, siginfo_t *si, void *p)
> >  
> >   while ((sig = ffs(pending)) != 0){
> >   sig--;
> > + is_segv = (sig == SIGSEGV) ? 1 : 0;
> >   pending &= ~(1 << sig);
> >   (*handlers[sig])(sig, (struct siginfo *)si, mc);
> >   }
> > @@ -227,6 +237,12 @@ static void hard_handler(int sig, siginfo_t *si, void *p)
> >   if (!nested)
> >   pending = from_irq_stack(nested);
> >   } while (pending);
> > +
> > +#ifndef CONFIG_MMU
> > + /* if there is SIGSEGV notified, let the userspace run w/ __noreturn */
> > + if (is_segv)
> > + sigsegv_post_routine();
> > +#endif
> >  }
> 
> I am confused, this doesn't feel quite correct to me.

thanks for pointing this out.  the above code, which I spot the
working example under nommu, is indeed suspicious and doesn't look a
right code.

that signal handing (this patch) is immature, and need more work to
understand existing code, nommu characteristic, etc.

> So, for normal UML, I think we always do an rt_sigreturn. Which means,
> we always go back to the corresponding *kernel* task. To schedule in
> response to SIGALRM, we forward the signal to the userspace process.
> I believe that means:
>    1. We cannot schedule kernel threads (that seems like a bug)
>    2. Scheduling for userspace happens once the signal is delivered.
>       Then userspace() saves the state and calls interrupt_end().
> 
> 
> Now, keep in mind that we are on the separate signal stack here. If we
> jump anywhere directly, we abandon the old state information stored by
> the host kernel into the mcontext. We can absolutely do that, but we
> need to be careful to not forget anything.
> 
> As such, I wonder whether nommu should:
>    1. When entering from kernel, update "current->thread.switch_buf"
>       from the mcontext.
>        - If we need to schedule, push a stack frame that calls the scheduling
>          code and returns with the correct state.
>    2. When entering from user, store the task registers from the
>       mcontext. At some point (here or earlier) ensure that the
>       "current->thread.switch_buf" is set up so that we can return to
>       userspace by restoring the task registers.
>        - To schedule, piggy back on 1. or add special code.
>    3. Always do a UML_LONGJMP() back into the "current" task.

thanks, the current code jumps in the signal handler and unblocking
signals without returning the handler (and not calling rt_sigreturn at
host either) upon SIGSEGV, which should not work as you mentioned.

I will also investigate how I can handle.

> That said, I am probably not having the full picture right now.
> 
> Benjamin
> 
> PS: On a further note, I think the current code to enter userspace
> cannot handle single stepping. I suppose that is fine, but you should
> probably set arch_has_single_step to 0 for nommu.

I did almost zero tests with ptrace(2) (inside nommu UM) and might
miss a lot of features that mmu-UM could.  will also look into that.

thanks,

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [PATCH v3 00/13] nommu UML
  2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
                     ` (15 preceding siblings ...)
  2024-11-23  7:27   ` David Gow
@ 2024-12-03  4:22   ` Hajime Tazaki
  2024-12-03  4:23     ` [PATCH v3 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec Hajime Tazaki
                       ` (13 more replies)
  16 siblings, 14 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-03  4:22 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

This is a series of patches of nommu arch addition to UML.  It would
be nice to ask comments/opinions on this.

There are still several limitations/issues which we already found;
here is the list of those issues.

- there are no mechanism implemented to cache for mapped memory of
  exec(2) thus, always read files from filesystem upon every exec,
  which makes slow on some benchmark (lmbench).
- memory mapped by loadable modules are not distinguished from
  userspace memory.

-- Hajime

v3:
- add seccomp-based syscall hook in addition to zpoline [06/13]
- remove RFC, add a line to MAINTAINERS file
- fix kernel test robot warnings [02/13,08/13,10/13]
- add base-commit tag to cover letter
- pull the latest uml/next
- clean up SIGSEGV handling [10/13]
- detect fsgsbase availability with elf aux vector [08/13]
- simplify vdso code with macros [09/13]

RFC v2:
- https://lore.kernel.org/linux-um/cover.1731290567.git.thehajime@gmail.com/
- base branch is now uml/linux.git instead of torvalds/linux.git.
- reorganize the patch series to clean up
- fixed various coding styles issues
- clean up exec code path [07/13]
- fixed the crash/SIGSEGV case on userspace programs [10/13]
- add seccomp filter to limit syscall caller address [06/13]
- detect fsgsbase availability with sigsetjmp/siglongjmp [08/13]
- removes unrelated changes
- removes unneeded ifndef CONFIG_MMU
- convert UML_CONFIG_MMU to CONFIG_MMU as using uml/linux.git
- proposed a patch of maple-tree issue (resolving a limitation in RFC v1)
  https://lore.kernel.org/linux-mm/20241108222834.3625217-1-thehajime@gmail.com/

RFC:
- https://lore.kernel.org/linux-um/cover.1729770373.git.thehajime@gmail.com/

Hajime Tazaki (13):
  fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec
  x86/um: nommu: elf loader for fdpic
  um: nommu: memory handling
  x86/um: nommu: syscall handling
  x86/um: nommu: syscall translation by zpoline
  um: nommu: syscalls handler from userspace by seccomp filter
  x86/um: nommu: process/thread handling
  um: nommu: configure fs register on host syscall invocation
  x86/um/vdso: nommu: vdso memory update
  x86/um: nommu: signal handling
  um: change machine name for uname output
  um: nommu: add documentation of nommu UML
  um: nommu: plug nommu code into build system

 Documentation/virt/uml/nommu-uml.rst    | 230 +++++++++++++++++++++++
 MAINTAINERS                             |   1 +
 arch/um/Kconfig                         |  14 +-
 arch/um/Makefile                        |   6 +
 arch/um/configs/x86_64_nommu_defconfig  |  64 +++++++
 arch/um/include/asm/Kbuild              |   1 +
 arch/um/include/asm/futex.h             |   4 +
 arch/um/include/asm/mmu.h               |   8 +
 arch/um/include/asm/mmu_context.h       |  13 +-
 arch/um/include/asm/ptrace-generic.h    |   6 +
 arch/um/include/asm/tlbflush.h          |  22 +++
 arch/um/include/asm/uaccess.h           |   7 +-
 arch/um/include/shared/kern_util.h      |   2 +
 arch/um/include/shared/os.h             |  15 ++
 arch/um/kernel/Makefile                 |   3 +-
 arch/um/kernel/mem.c                    |  12 +-
 arch/um/kernel/physmem.c                |   6 +
 arch/um/kernel/process.c                |  33 +++-
 arch/um/kernel/skas/Makefile            |   4 +-
 arch/um/kernel/trap.c                   |  16 ++
 arch/um/kernel/um_arch.c                |   4 +
 arch/um/os-Linux/main.c                 |   5 +
 arch/um/os-Linux/process.c              |  96 +++++++++-
 arch/um/os-Linux/signal.c               |  38 +++-
 arch/um/os-Linux/skas/process.c         |   4 +
 arch/um/os-Linux/start_up.c             |  20 ++
 arch/um/os-Linux/util.c                 |   3 +-
 arch/x86/um/Makefile                    |  18 ++
 arch/x86/um/asm/elf.h                   |  11 +-
 arch/x86/um/asm/module.h                |  24 ---
 arch/x86/um/asm/processor.h             |  12 ++
 arch/x86/um/do_syscall_64.c             | 109 +++++++++++
 arch/x86/um/entry_64.S                  | 108 +++++++++++
 arch/x86/um/os-Linux/mcontext.c         |  22 +++
 arch/x86/um/shared/sysdep/mcontext.h    |   4 +
 arch/x86/um/shared/sysdep/syscalls_64.h |   6 +
 arch/x86/um/signal.c                    |  37 +++-
 arch/x86/um/syscalls_64.c               |  69 +++++++
 arch/x86/um/vdso/um_vdso.c              |  41 ++--
 arch/x86/um/vdso/vma.c                  |  14 ++
 arch/x86/um/zpoline.c                   | 238 ++++++++++++++++++++++++
 fs/Kconfig.binfmt                       |   2 +-
 fs/binfmt_elf_fdpic.c                   |  10 +
 include/linux/elf-fdpic.h               |   3 +
 44 files changed, 1307 insertions(+), 58 deletions(-)
 create mode 100644 Documentation/virt/uml/nommu-uml.rst
 create mode 100644 arch/um/configs/x86_64_nommu_defconfig
 delete mode 100644 arch/x86/um/asm/module.h
 create mode 100644 arch/x86/um/do_syscall_64.c
 create mode 100644 arch/x86/um/entry_64.S
 create mode 100644 arch/x86/um/zpoline.c


base-commit: bed2cc482600296fe04edbc38005ba2851449c10
-- 
2.43.0



^ permalink raw reply	[flat|nested] 128+ messages in thread

* [PATCH v3 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec
  2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
@ 2024-12-03  4:23     ` Hajime Tazaki
  2024-12-03  4:23     ` [PATCH v3 02/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
                       ` (12 subsequent siblings)
  13 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-03  4:23 UTC (permalink / raw)
  To: linux-um
  Cc: thehajime, ricarkol, Liam.Howlett, Alexander Viro,
	Christian Brauner, Jan Kara, Eric Biederman, Kees Cook,
	linux-fsdevel, linux-mm

FDPIC ELF loader adds an architecture hook at the end of loading
binaries to finalize the mapped memory before moving toward exec
function.  The hook is used by UML under !MMU when translating
syscall/sysenter instructions before calling execve.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <kees@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
 fs/binfmt_elf_fdpic.c     | 10 ++++++++++
 include/linux/elf-fdpic.h |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 4fe5bb9f1b1f..ab16fdf475b0 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -175,6 +175,12 @@ static int elf_fdpic_fetch_phdrs(struct elf_fdpic_params *params,
 	return 0;
 }
 
+int __weak elf_arch_finalize_exec(struct elf_fdpic_params *exec_params,
+				  struct elf_fdpic_params *interp_params)
+{
+	return 0;
+}
+
 /*****************************************************************************/
 /*
  * load an fdpic binary into various bits of memory
@@ -457,6 +463,10 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
 			    dynaddr);
 #endif
 
+	retval = elf_arch_finalize_exec(&exec_params, &interp_params);
+	if (retval)
+		goto error;
+
 	finalize_exec(bprm);
 	/* everything is now ready... get the userspace context ready to roll */
 	entryaddr = interp_params.entry_addr ?: exec_params.entry_addr;
diff --git a/include/linux/elf-fdpic.h b/include/linux/elf-fdpic.h
index e533f4513194..e7fd85a1d10f 100644
--- a/include/linux/elf-fdpic.h
+++ b/include/linux/elf-fdpic.h
@@ -56,4 +56,7 @@ extern void elf_fdpic_arch_lay_out_mm(struct elf_fdpic_params *exec_params,
 				      unsigned long *start_brk);
 #endif
 
+extern int elf_arch_finalize_exec(struct elf_fdpic_params *exec_params,
+				  struct elf_fdpic_params *interp_params);
+
 #endif /* _LINUX_ELF_FDPIC_H */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v3 02/13] x86/um: nommu: elf loader for fdpic
  2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
  2024-12-03  4:23     ` [PATCH v3 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec Hajime Tazaki
@ 2024-12-03  4:23     ` Hajime Tazaki
  2024-12-04 16:20       ` Johannes Berg
  2024-12-03  4:23     ` [PATCH v3 03/13] um: nommu: memory handling Hajime Tazaki
                       ` (11 subsequent siblings)
  13 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-03  4:23 UTC (permalink / raw)
  To: linux-um
  Cc: thehajime, ricarkol, Liam.Howlett, Eric Biederman, Kees Cook,
	Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
	linux-fsdevel

As UML supports CONFIG_MMU=n case, it has to use an alternate ELF
loader, FDPIC ELF loader.  In this commit, we added necessary
definitions in the arch, as UML has not been used so far.  It also
updates Kconfig file to use BINFMT_ELF_FDPIC under !MMU environment.

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/um/include/asm/Kbuild           |  1 +
 arch/um/include/asm/mmu.h            |  5 +++++
 arch/um/include/asm/ptrace-generic.h |  6 ++++++
 arch/x86/um/asm/elf.h                |  8 ++++++--
 arch/x86/um/asm/module.h             | 24 ------------------------
 fs/Kconfig.binfmt                    |  2 +-
 6 files changed, 19 insertions(+), 27 deletions(-)
 delete mode 100644 arch/x86/um/asm/module.h

diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
index 428f2c5158c2..04ab3b653a48 100644
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -13,6 +13,7 @@ generic-y += irq_work.h
 generic-y += kdebug.h
 generic-y += mcs_spinlock.h
 generic-y += mmiowb.h
+generic-y += module.h
 generic-y += module.lds.h
 generic-y += param.h
 generic-y += parport.h
diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
index a3eaca41ff61..01422b761aa0 100644
--- a/arch/um/include/asm/mmu.h
+++ b/arch/um/include/asm/mmu.h
@@ -14,6 +14,11 @@ typedef struct mm_context {
 	/* Address range in need of a TLB sync */
 	unsigned long sync_tlb_range_from;
 	unsigned long sync_tlb_range_to;
+
+#ifdef CONFIG_BINFMT_ELF_FDPIC
+	unsigned long   exec_fdpic_loadmap;
+	unsigned long   interp_fdpic_loadmap;
+#endif
 } mm_context_t;
 
 #endif
diff --git a/arch/um/include/asm/ptrace-generic.h b/arch/um/include/asm/ptrace-generic.h
index 4696f24d1492..4ff844bcb1cd 100644
--- a/arch/um/include/asm/ptrace-generic.h
+++ b/arch/um/include/asm/ptrace-generic.h
@@ -29,6 +29,12 @@ struct pt_regs {
 
 #define PTRACE_OLDSETOPTIONS 21
 
+#ifdef CONFIG_BINFMT_ELF_FDPIC
+#define PTRACE_GETFDPIC		31
+#define PTRACE_GETFDPIC_EXEC	0
+#define PTRACE_GETFDPIC_INTERP	1
+#endif
+
 struct task_struct;
 
 extern long subarch_ptrace(struct task_struct *child, long request,
diff --git a/arch/x86/um/asm/elf.h b/arch/x86/um/asm/elf.h
index 62ed5d68a978..33f69f1eac10 100644
--- a/arch/x86/um/asm/elf.h
+++ b/arch/x86/um/asm/elf.h
@@ -9,6 +9,7 @@
 #include <skas.h>
 
 #define CORE_DUMP_USE_REGSET
+#define ELF_FDPIC_CORE_EFLAGS  0
 
 #ifdef CONFIG_X86_32
 
@@ -190,8 +191,11 @@ extern int arch_setup_additional_pages(struct linux_binprm *bprm,
 
 extern unsigned long um_vdso_addr;
 #define AT_SYSINFO_EHDR 33
-#define ARCH_DLINFO	NEW_AUX_ENT(AT_SYSINFO_EHDR, um_vdso_addr)
-
+#define ARCH_DLINFO						\
+do {								\
+	NEW_AUX_ENT(AT_SYSINFO_EHDR, um_vdso_addr);		\
+	NEW_AUX_ENT(AT_MINSIGSTKSZ, 0);			\
+} while (0)
 #endif
 
 typedef unsigned long elf_greg_t;
diff --git a/arch/x86/um/asm/module.h b/arch/x86/um/asm/module.h
deleted file mode 100644
index a3b061d66082..000000000000
--- a/arch/x86/um/asm/module.h
+++ /dev/null
@@ -1,24 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __UM_MODULE_H
-#define __UM_MODULE_H
-
-/* UML is simple */
-struct mod_arch_specific
-{
-};
-
-#ifdef CONFIG_X86_32
-
-#define Elf_Shdr Elf32_Shdr
-#define Elf_Sym Elf32_Sym
-#define Elf_Ehdr Elf32_Ehdr
-
-#else
-
-#define Elf_Shdr Elf64_Shdr
-#define Elf_Sym Elf64_Sym
-#define Elf_Ehdr Elf64_Ehdr
-
-#endif
-
-#endif
diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index bd2f530e5740..419ba0282806 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -58,7 +58,7 @@ config ARCH_USE_GNU_PROPERTY
 config BINFMT_ELF_FDPIC
 	bool "Kernel support for FDPIC ELF binaries"
 	default y if !BINFMT_ELF
-	depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
+	depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU)
 	select ELFCORE
 	help
 	  ELF FDPIC binaries are based on ELF, but allow the individual load
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v3 03/13] um: nommu: memory handling
  2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
  2024-12-03  4:23     ` [PATCH v3 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec Hajime Tazaki
  2024-12-03  4:23     ` [PATCH v3 02/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
@ 2024-12-03  4:23     ` Hajime Tazaki
  2024-12-04 16:34       ` Johannes Berg
  2024-12-03  4:23     ` [PATCH v3 04/13] x86/um: nommu: syscall handling Hajime Tazaki
                       ` (10 subsequent siblings)
  13 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-03  4:23 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

This commit adds memory operations on UML under !MMU environment.

Some part of the original UML code relying on CONFIG_MMU are excluded
from compilation when !CONFIG_MMU.  Additionally, generic functions such as
uaccess, futex, memcpy/strnlen/strncpy can be used as user- and
kernel-space share the address space in !CONFIG_MMU mode.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/um/include/asm/futex.h       |  4 ++++
 arch/um/include/asm/mmu.h         |  3 +++
 arch/um/include/asm/mmu_context.h | 13 +++++++++++--
 arch/um/include/asm/tlbflush.h    | 22 ++++++++++++++++++++++
 arch/um/include/asm/uaccess.h     |  7 ++++---
 arch/um/include/shared/os.h       |  6 ++++++
 arch/um/kernel/Makefile           |  3 ++-
 arch/um/kernel/mem.c              | 12 +++++++++++-
 arch/um/kernel/physmem.c          |  6 ++++++
 arch/um/kernel/skas/Makefile      |  4 ++--
 arch/um/kernel/trap.c             |  4 ++++
 arch/um/os-Linux/process.c        |  4 ++--
 12 files changed, 77 insertions(+), 11 deletions(-)

diff --git a/arch/um/include/asm/futex.h b/arch/um/include/asm/futex.h
index 780aa6bfc050..89a8ac0b6963 100644
--- a/arch/um/include/asm/futex.h
+++ b/arch/um/include/asm/futex.h
@@ -8,7 +8,11 @@
 
 
 int arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval, u32 __user *uaddr);
+#ifdef CONFIG_MMU
 int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
 			      u32 oldval, u32 newval);
+#else
+#include <asm-generic/futex.h>
+#endif
 
 #endif
diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
index 01422b761aa0..d4087f9499e2 100644
--- a/arch/um/include/asm/mmu.h
+++ b/arch/um/include/asm/mmu.h
@@ -15,10 +15,13 @@ typedef struct mm_context {
 	unsigned long sync_tlb_range_from;
 	unsigned long sync_tlb_range_to;
 
+#ifndef CONFIG_MMU
+	unsigned long   end_brk;
 #ifdef CONFIG_BINFMT_ELF_FDPIC
 	unsigned long   exec_fdpic_loadmap;
 	unsigned long   interp_fdpic_loadmap;
 #endif
+#endif /* !CONFIG_MMU */
 } mm_context_t;
 
 #endif
diff --git a/arch/um/include/asm/mmu_context.h b/arch/um/include/asm/mmu_context.h
index 23dcc914d44e..da287e8c86b3 100644
--- a/arch/um/include/asm/mmu_context.h
+++ b/arch/um/include/asm/mmu_context.h
@@ -37,10 +37,19 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 }
 
 #define init_new_context init_new_context
-extern int init_new_context(struct task_struct *task, struct mm_struct *mm);
-
 #define destroy_context destroy_context
+#ifdef CONFIG_MMU
+extern int init_new_context(struct task_struct *task, struct mm_struct *mm);
 extern void destroy_context(struct mm_struct *mm);
+#else
+static inline int init_new_context(struct task_struct *task, struct mm_struct *mm)
+{
+	return 0;
+}
+static inline void destroy_context(struct mm_struct *mm)
+{
+}
+#endif
 
 #include <asm-generic/mmu_context.h>
 
diff --git a/arch/um/include/asm/tlbflush.h b/arch/um/include/asm/tlbflush.h
index 13a3009942be..9157f71695c6 100644
--- a/arch/um/include/asm/tlbflush.h
+++ b/arch/um/include/asm/tlbflush.h
@@ -30,6 +30,7 @@
  *  - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
  */
 
+#ifdef CONFIG_MMU
 extern int um_tlb_sync(struct mm_struct *mm);
 
 extern void flush_tlb_all(void);
@@ -55,5 +56,26 @@ static inline void flush_tlb_kernel_range(unsigned long start,
 	/* Kernel needs to be synced immediately */
 	um_tlb_sync(&init_mm);
 }
+#else
+static inline int um_tlb_sync(struct mm_struct *mm)
+{
+	return 0;
+}
+
+static inline void flush_tlb_page(struct vm_area_struct *vma,
+				  unsigned long address)
+{
+}
+
+static inline void flush_tlb_range(struct vm_area_struct *vma,
+				   unsigned long start, unsigned long end)
+{
+}
+
+static inline void flush_tlb_kernel_range(unsigned long start,
+					  unsigned long end)
+{
+}
+#endif
 
 #endif
diff --git a/arch/um/include/asm/uaccess.h b/arch/um/include/asm/uaccess.h
index 1d4b6bbc1b65..9bfee12cb6b7 100644
--- a/arch/um/include/asm/uaccess.h
+++ b/arch/um/include/asm/uaccess.h
@@ -22,6 +22,7 @@
 #define __addr_range_nowrap(addr, size) \
 	((unsigned long) (addr) <= ((unsigned long) (addr) + (size)))
 
+#ifdef CONFIG_MMU
 extern unsigned long raw_copy_from_user(void *to, const void __user *from, unsigned long n);
 extern unsigned long raw_copy_to_user(void __user *to, const void *from, unsigned long n);
 extern unsigned long __clear_user(void __user *mem, unsigned long len);
@@ -33,9 +34,6 @@ static inline int __access_ok(const void __user *ptr, unsigned long size);
 
 #define INLINE_COPY_FROM_USER
 #define INLINE_COPY_TO_USER
-
-#include <asm-generic/uaccess.h>
-
 static inline int __access_ok(const void __user *ptr, unsigned long size)
 {
 	unsigned long addr = (unsigned long)ptr;
@@ -43,6 +41,9 @@ static inline int __access_ok(const void __user *ptr, unsigned long size)
 		(__under_task_size(addr, size) ||
 		 __access_ok_vsyscall(addr, size));
 }
+#endif
+
+#include <asm-generic/uaccess.h>
 
 /* no pagefaults for kernel addresses in um */
 #define __get_kernel_nofault(dst, src, type, err_label)			\
diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index 5babad8c5f75..6874be0c38a8 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -195,7 +195,13 @@ extern void get_host_cpu_features(
 extern int create_mem_file(unsigned long long len);
 
 /* tlb.c */
+#ifdef CONFIG_MMU
 extern void report_enomem(void);
+#else
+static inline void report_enomem(void)
+{
+}
+#endif
 
 /* process.c */
 extern void os_alarm_process(int pid);
diff --git a/arch/um/kernel/Makefile b/arch/um/kernel/Makefile
index f8567b933ffa..b41e9bcabbe3 100644
--- a/arch/um/kernel/Makefile
+++ b/arch/um/kernel/Makefile
@@ -16,9 +16,10 @@ extra-y := vmlinux.lds
 
 obj-y = config.o exec.o exitcode.o irq.o ksyms.o mem.o \
 	physmem.o process.o ptrace.o reboot.o sigio.o \
-	signal.o sysrq.o time.o tlb.o trap.o \
+	signal.o sysrq.o time.o trap.o \
 	um_arch.o umid.o maccess.o kmsg_dump.o capflags.o skas/
 obj-y += load_file.o
+obj-$(CONFIG_MMU) += tlb.o
 
 obj-$(CONFIG_BLK_DEV_INITRD) += initrd.o
 obj-$(CONFIG_GPROF)	+= gprof_syms.o
diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index 53248ed04771..b674017d9871 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -64,7 +64,8 @@ void __init mem_init(void)
 	 * to be turned on.
 	 */
 	brk_end = (unsigned long) UML_ROUND_UP(sbrk(0));
-	map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1, 0);
+	map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1,
+		   !IS_ENABLED(CONFIG_MMU));
 	memblock_free((void *)brk_end, uml_reserved - brk_end);
 	uml_reserved = brk_end;
 
@@ -78,6 +79,7 @@ void __init mem_init(void)
  * Create a page table and place a pointer to it in a middle page
  * directory entry.
  */
+#ifdef CONFIG_MMU
 static void __init one_page_table_init(pmd_t *pmd)
 {
 	if (pmd_none(*pmd)) {
@@ -149,6 +151,12 @@ static void __init fixrange_init(unsigned long start, unsigned long end,
 		j = 0;
 	}
 }
+#else
+static void __init fixrange_init(unsigned long start, unsigned long end,
+				 pgd_t *pgd_base)
+{
+}
+#endif
 
 static void __init fixaddr_user_init( void)
 {
@@ -230,6 +238,7 @@ void *uml_kmalloc(int size, int flags)
 	return kmalloc(size, flags);
 }
 
+#ifdef CONFIG_MMU
 static const pgprot_t protection_map[16] = {
 	[VM_NONE]					= PAGE_NONE,
 	[VM_READ]					= PAGE_READONLY,
@@ -249,3 +258,4 @@ static const pgprot_t protection_map[16] = {
 	[VM_SHARED | VM_EXEC | VM_WRITE | VM_READ]	= PAGE_SHARED
 };
 DECLARE_VM_GET_PAGE_PROT
+#endif
diff --git a/arch/um/kernel/physmem.c b/arch/um/kernel/physmem.c
index a74f17b033c4..f55d46dbe173 100644
--- a/arch/um/kernel/physmem.c
+++ b/arch/um/kernel/physmem.c
@@ -84,7 +84,11 @@ void __init setup_physmem(unsigned long start, unsigned long reserve_end,
 		exit(1);
 	}
 
+#ifdef CONFIG_MMU
 	physmem_fd = create_mem_file(len);
+#else
+	physmem_fd = -1;
+#endif
 
 	err = os_map_memory((void *) reserve_end, physmem_fd, reserve,
 			    map_size, 1, 1, 1);
@@ -95,12 +99,14 @@ void __init setup_physmem(unsigned long start, unsigned long reserve_end,
 		exit(1);
 	}
 
+#ifdef CONFIG_MMU
 	/*
 	 * Special kludge - This page will be mapped in to userspace processes
 	 * from physmem_fd, so it needs to be written out there.
 	 */
 	os_seek_file(physmem_fd, __pa(__syscall_stub_start));
 	os_write_file(physmem_fd, __syscall_stub_start, PAGE_SIZE);
+#endif
 
 	memblock_add(__pa(start), len);
 	memblock_reserve(__pa(start), reserve);
diff --git a/arch/um/kernel/skas/Makefile b/arch/um/kernel/skas/Makefile
index 3384be42691f..64d7ba803b1a 100644
--- a/arch/um/kernel/skas/Makefile
+++ b/arch/um/kernel/skas/Makefile
@@ -3,8 +3,8 @@
 # Copyright (C) 2002 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com)
 #
 
-obj-y := stub.o mmu.o process.o syscall.o uaccess.o \
-	 stub_exe_embed.o
+obj-y := stub.o process.o stub_exe_embed.o
+obj-$(CONFIG_MMU) += mmu.o syscall.o uaccess.o
 
 # Stub executable
 
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index cdaee3e94273..a7519b3de4bf 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -24,6 +24,7 @@
 int handle_page_fault(unsigned long address, unsigned long ip,
 		      int is_write, int is_user, int *code_out)
 {
+#ifdef CONFIG_MMU
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	pmd_t *pmd;
@@ -129,6 +130,9 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 		goto out_nosemaphore;
 	pagefault_out_of_memory();
 	return 0;
+#else
+	return -EFAULT;
+#endif
 }
 
 static void show_segv_info(struct uml_pt_regs *regs)
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index 9f086f939420..ef1a2f0aa06a 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -63,8 +63,8 @@ int os_map_memory(void *virt, int fd, unsigned long long off, unsigned long len,
 	prot = (r ? PROT_READ : 0) | (w ? PROT_WRITE : 0) |
 		(x ? PROT_EXEC : 0);
 
-	loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED,
-		     fd, off);
+	loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED |
+		     (!IS_ENABLED(CONFIG_MMU) ? MAP_ANONYMOUS : 0), fd, off);
 	if (loc == MAP_FAILED)
 		return -errno;
 	return 0;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v3 04/13] x86/um: nommu: syscall handling
  2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
                       ` (2 preceding siblings ...)
  2024-12-03  4:23     ` [PATCH v3 03/13] um: nommu: memory handling Hajime Tazaki
@ 2024-12-03  4:23     ` Hajime Tazaki
  2024-12-04 16:37       ` Johannes Berg
  2024-12-03  4:23     ` [PATCH v3 05/13] x86/um: nommu: syscall translation by zpoline Hajime Tazaki
                       ` (9 subsequent siblings)
  13 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-03  4:23 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

This commit introduces an entry point of syscall interface for !MMU
mode. It uses an entry function, __kernel_vsyscall, a kernel-wide global
symbol accessible from any locations.

Although it isn't in the scope of this commit, it can be also exposed
via vdso image which is directly accessible from userspace. A standard
library (i.e., libc) can utilize this entry point to implement syscall
wrapper; we can also use this by hooking syscall for unmodified userspace
applications/libraries, which will be implemented in the subsequent
commit.

This only supports 64-bit mode of x86 architecture.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/x86/um/do_syscall_64.c             | 37 +++++++++++
 arch/x86/um/entry_64.S                  | 87 +++++++++++++++++++++++++
 arch/x86/um/shared/sysdep/syscalls_64.h |  6 ++
 3 files changed, 130 insertions(+)
 create mode 100644 arch/x86/um/do_syscall_64.c
 create mode 100644 arch/x86/um/entry_64.S

diff --git a/arch/x86/um/do_syscall_64.c b/arch/x86/um/do_syscall_64.c
new file mode 100644
index 000000000000..5d0fa83e7fdc
--- /dev/null
+++ b/arch/x86/um/do_syscall_64.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/kernel.h>
+#include <linux/ptrace.h>
+#include <kern_util.h>
+#include <sysdep/syscalls.h>
+#include <os.h>
+
+__visible void do_syscall_64(struct pt_regs *regs)
+{
+	int syscall;
+
+	syscall = PT_SYSCALL_NR(regs->regs.gp);
+	UPT_SYSCALL_NR(&regs->regs) = syscall;
+
+	pr_debug("syscall(%d) (current=%lx) (fn=%lx)\n",
+		 syscall, (unsigned long)current,
+		 (unsigned long)sys_call_table[syscall]);
+
+	if (likely(syscall < NR_syscalls)) {
+		PT_REGS_SET_SYSCALL_RETURN(regs,
+				EXECUTE_SYSCALL(syscall, regs));
+	}
+
+	pr_debug("syscall(%d) --> %lx\n", syscall,
+		regs->regs.gp[HOST_AX]);
+
+	PT_REGS_SYSCALL_RET(regs) = regs->regs.gp[HOST_AX];
+
+	/* execve succeeded */
+	if (syscall == __NR_execve && regs->regs.gp[HOST_AX] == 0)
+		userspace(&current->thread.regs.regs);
+
+	/* force do_signal() --> is_syscall() */
+	set_thread_flag(TIF_SIGPENDING);
+	interrupt_end();
+}
diff --git a/arch/x86/um/entry_64.S b/arch/x86/um/entry_64.S
new file mode 100644
index 000000000000..022a8122690b
--- /dev/null
+++ b/arch/x86/um/entry_64.S
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/errno.h>
+
+#include <linux/linkage.h>
+#include <asm/percpu.h>
+#include <asm/desc.h>
+
+#include "../entry/calling.h"
+
+#ifdef CONFIG_SMP
+#error need to stash these variables somewhere else
+#endif
+
+#define UM_GLOBAL_VAR(x) .data; .align 8; .globl x; x:; .long 0
+
+UM_GLOBAL_VAR(current_top_of_stack)
+UM_GLOBAL_VAR(current_ptregs)
+
+.code64
+.section .entry.text, "ax"
+
+.align 8
+#undef ENTRY
+#define ENTRY(x) .text; .globl x; .type x,%function; x:
+#undef END
+#define END(x)   .size x, . - x
+
+/*
+ * %rcx has the return address (we set it like that in zpoline trampoline).
+ *
+ * Registers on entry:
+ * rax  system call number
+ * rcx  return address
+ * rdi  arg0
+ * rsi  arg1
+ * rdx  arg2
+ * r10  arg3
+ * r8   arg4
+ * r9   arg5
+ *
+ * (note: we are allowed to mess with r11: r11 is callee-clobbered
+ * register in C ABI)
+ */
+ENTRY(__kernel_vsyscall)
+
+	movq	%rsp, %r11
+
+	/* Point rsp to the top of the ptregs array, so we can
+           just fill it with a bunch of push'es. */
+	movq	current_ptregs, %rsp
+
+	/* 8 bytes * 20 registers (plus 8 for the push) */
+	addq	$168, %rsp
+
+	/* Construct struct pt_regs on stack */
+	pushq	$0		/* pt_regs->ss (index 20) */
+	pushq   %r11		/* pt_regs->sp */
+	pushfq			/* pt_regs->flags */
+	pushq	$0		/* pt_regs->cs */
+	pushq	%rcx		/* pt_regs->ip */
+	pushq	%rax		/* pt_regs->orig_ax */
+
+	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
+
+	mov %rsp, %rdi
+
+	/*
+	 * Switch to current top of stack, so "current->" points
+	 * to the right task.
+	 */
+	movq	current_top_of_stack, %rsp
+
+	call	do_syscall_64
+
+	movq	current_ptregs, %rsp
+
+	POP_REGS
+
+	addq	$8, %rsp	/* skip orig_ax */
+	addq	$8, %rsp	/* skip ip */
+	addq	$8, %rsp	/* skip cs */
+	addq	$8, %rsp	/* skip flags */
+	popq	%rsp
+
+	ret
+
+END(__kernel_vsyscall)
diff --git a/arch/x86/um/shared/sysdep/syscalls_64.h b/arch/x86/um/shared/sysdep/syscalls_64.h
index b6b997225841..ffd80ee3b9dc 100644
--- a/arch/x86/um/shared/sysdep/syscalls_64.h
+++ b/arch/x86/um/shared/sysdep/syscalls_64.h
@@ -25,4 +25,10 @@ extern syscall_handler_t *sys_call_table[];
 extern syscall_handler_t sys_modify_ldt;
 extern syscall_handler_t sys_arch_prctl;
 
+#ifndef CONFIG_MMU
+extern void do_syscall_64(struct pt_regs *regs);
+extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3,
+			      int64_t a4, int64_t a5, int64_t a6);
+#endif
+
 #endif
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v3 05/13] x86/um: nommu: syscall translation by zpoline
  2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
                       ` (3 preceding siblings ...)
  2024-12-03  4:23     ` [PATCH v3 04/13] x86/um: nommu: syscall handling Hajime Tazaki
@ 2024-12-03  4:23     ` Hajime Tazaki
  2024-12-04 16:37       ` Johannes Berg
  2024-12-03  4:23     ` [PATCH v3 06/13] um: nommu: syscalls handler from userspace by seccomp filter Hajime Tazaki
                       ` (8 subsequent siblings)
  13 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-03  4:23 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

This commit adds a mechanism to hook syscalls for unmodified userspace
programs used under UML in !MMU mode. The mechanism, called zpoline,
translates syscall/sysenter instructions with `call *%rax`, which can be
processed by a trampoline code also installed upon an initcall during
boot. The translation is triggered by elf_arch_finalize_exec(), an arch
hook introduced by another commit.

All syscalls issued by userspace thus redirected to a specific function,
__kernel_vsyscall, introduced as a syscall entry point for !MMU UML.  This
totally changes the code path to hook syscall with ptrace(2) used by
MMU-full UML.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
 arch/x86/um/asm/elf.h |   3 +
 arch/x86/um/zpoline.c | 223 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 226 insertions(+)
 create mode 100644 arch/x86/um/zpoline.c

diff --git a/arch/x86/um/asm/elf.h b/arch/x86/um/asm/elf.h
index 33f69f1eac10..6f5977ff0d21 100644
--- a/arch/x86/um/asm/elf.h
+++ b/arch/x86/um/asm/elf.h
@@ -188,6 +188,9 @@ do {								\
 struct linux_binprm;
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
 	int uses_interp);
+struct elf_fdpic_params;
+extern int elf_arch_finalize_exec(struct elf_fdpic_params *exec_params,
+				  struct elf_fdpic_params *interp_params);
 
 extern unsigned long um_vdso_addr;
 #define AT_SYSINFO_EHDR 33
diff --git a/arch/x86/um/zpoline.c b/arch/x86/um/zpoline.c
new file mode 100644
index 000000000000..97f5345ab314
--- /dev/null
+++ b/arch/x86/um/zpoline.c
@@ -0,0 +1,223 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *  zpoline.c
+ *
+ *  Replace syscall/sysenter instructions to `call *%rax` to hook syscalls.
+ *
+ */
+//#define DEBUG
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/elf-fdpic.h>
+#include <asm/unistd.h>
+#include <asm/insn.h>
+#include <sysdep/syscalls.h>
+#include <os.h>
+
+/* start of trampoline code area */
+static char *__zpoline_start;
+
+static int __zpoline_translate_syscalls(struct elf_fdpic_params *params)
+{
+	int count = 0, loop;
+	struct insn insn;
+	unsigned long addr;
+	struct elf_fdpic_loadseg *seg;
+	struct elf_phdr *phdr;
+	struct elfhdr *ehdr = (struct elfhdr *)params->elfhdr_addr;
+
+	if (!ehdr)
+		return 0;
+
+	seg = params->loadmap->segs;
+	phdr = params->phdrs;
+	for (loop = 0; loop < params->hdr.e_phnum; loop++, phdr++) {
+		if (phdr->p_type != PT_LOAD)
+			continue;
+		addr = seg->addr;
+		/* skip translation of trampoline code */
+		if (addr <= (unsigned long)(&__zpoline_start[0] + 0x1000 + 0x0100)) {
+			pr_warn("%lx: address is in the range of trampoline", addr);
+			return -EINVAL;
+		}
+
+		/* translate only segment with Executable flag */
+		if (!(phdr->p_flags & PF_X)) {
+			seg++;
+			continue;
+		}
+
+		pr_debug("translation 0x%lx-0x%llx", addr,
+			 seg->addr + seg->p_memsz);
+		/* now ready to translate */
+		while (addr < (seg->addr + seg->p_memsz)) {
+			insn_init(&insn, (void *)addr, MAX_INSN_SIZE, 1);
+			insn_get_length(&insn);
+
+			insn_get_opcode(&insn);
+
+			switch (insn.opcode.bytes[0]) {
+			case 0xf:
+				switch (insn.opcode.bytes[1]) {
+				case 0x05: /* syscall */
+				case 0x34: /* sysenter */
+					pr_debug("%lx: found syscall/sysenter", addr);
+					*(char *)addr = 0xff; // callq
+					*((char *)addr + 1) = 0xd0; // *%rax
+					count++;
+					break;
+				}
+			default:
+				break;
+			}
+
+			addr += insn.length;
+			if (insn.length == 0) {
+				pr_debug("%lx: length zero with byte %x. skip ?",
+					 addr, insn.opcode.bytes[0]);
+				addr += 1;
+			}
+		}
+		seg++;
+	}
+	return count;
+}
+
+/**
+ * elf_arch_finalize_exec() - architecture hook to translate syscall/sysenter
+ *
+ * translate syscall/sysenter instruction upon loading ELF binary file
+ * on execve(2)&co syscall.
+ *
+ * suppose we have those instructions:
+ *
+ *    mov $sysnr, %rax
+ *    syscall                 0f 05
+ *
+ * this will translate it with:
+ *
+ *    mov $sysnr, %rax        (<= untouched)
+ *    call *(%rax)            ff d0
+ *
+ * this will finally called hook function guided by trampoline code installed
+ * at setup_zpoline_trampoline().
+ *
+ * @exec_params: ELF meta data for executable file
+ * @interp_params: ELF meta data for the interpreter file
+ */
+int elf_arch_finalize_exec(struct elf_fdpic_params *exec_params,
+			   struct elf_fdpic_params *interp_params)
+{
+	int err = 0, count = 0;
+	struct mm_struct *mm = current->mm;
+
+	if (down_write_killable(&mm->mmap_lock))
+		return -EINTR;
+
+	/* translate for the executable */
+	err = __zpoline_translate_syscalls(exec_params);
+	if (err < 0) {
+		pr_info("zpoline: xlate error %d", err);
+		goto out;
+	}
+	count += err;
+	pr_debug("zpoline: rewritten (exec) %d syscalls\n", count);
+
+	/* translate for the interpreter */
+	err = __zpoline_translate_syscalls(interp_params);
+	if (err < 0) {
+		pr_info("zpoline: xlate error %d", err);
+		goto out;
+	}
+	count += err;
+
+	err = 0;
+	pr_debug("zpoline: rewritten (exec+interp) %d syscalls\n", count);
+
+out:
+	up_write(&mm->mmap_lock);
+	return err;
+}
+
+/**
+ * setup_zpoline_trampoline() - install trampoline code for zpoline
+ *
+ * setup trampoline code for syscall hooks
+ *
+ * the trampoline code guides to call hooked function, __kernel_vsyscall
+ * in this case, via nop slides at the memory address zero (thus, zpoline).
+ *
+ * loaded binary by exec(2) is translated to call the function.
+ */
+static int __init setup_zpoline_trampoline(void)
+{
+	int i, ret;
+	int ptr;
+
+	/* zpoline: map area of trampoline code started from addr 0x0 */
+	__zpoline_start = 0x0;
+
+	ret = os_map_memory((void *) 0, -1, 0, PAGE_SIZE, 1, 1, 1);
+	if (ret)
+		panic("map failed\n NOTE: /proc/sys/vm/mmap_min_addr should be set 0\n");
+
+	/* fill nop instructions until the trampoline code */
+	for (i = 0; i < NR_syscalls; i++)
+		__zpoline_start[i] = 0x90;
+
+	/* optimization to skip old syscalls */
+	/* short jmp */
+	__zpoline_start[214 /* __NR_epoll_ctl_old */] = 0xeb;
+	/* range of a short jmp : -128 ~ +127 */
+	__zpoline_start[215 /* __NR_epoll_wait_old */] = 127;
+
+	/**
+	 * FIXME: shift red zone area to properly handle the case
+	 */
+
+	/**
+	 * put code for jumping to __kernel_vsyscall.
+	 *
+	 * here we embed the following code.
+	 *
+	 * movabs [$addr],%r11
+	 * jmpq   *%r11
+	 *
+	 */
+	ptr = NR_syscalls;
+	/* 49 bb [64-bit addr (8-byte)]    movabs [64-bit addr (8-byte)],%r11 */
+	__zpoline_start[ptr++] = 0x49;
+	__zpoline_start[ptr++] = 0xbb;
+	__zpoline_start[ptr++] = ((uint64_t) __kernel_vsyscall >> (8 * 0));
+	__zpoline_start[ptr++] = ((uint64_t) __kernel_vsyscall >> (8 * 1));
+	__zpoline_start[ptr++] = ((uint64_t) __kernel_vsyscall >> (8 * 2));
+	__zpoline_start[ptr++] = ((uint64_t) __kernel_vsyscall >> (8 * 3));
+	__zpoline_start[ptr++] = ((uint64_t) __kernel_vsyscall >> (8 * 4));
+	__zpoline_start[ptr++] = ((uint64_t) __kernel_vsyscall >> (8 * 5));
+	__zpoline_start[ptr++] = ((uint64_t) __kernel_vsyscall >> (8 * 6));
+	__zpoline_start[ptr++] = ((uint64_t) __kernel_vsyscall >> (8 * 7));
+
+	/*
+	 * pretending to be syscall instruction by putting return
+	 * address in %rcx.
+	 */
+	/* 48 8b 0c 24               mov    (%rsp),%rcx */
+	__zpoline_start[ptr++] = 0x48;
+	__zpoline_start[ptr++] = 0x8b;
+	__zpoline_start[ptr++] = 0x0c;
+	__zpoline_start[ptr++] = 0x24;
+
+	/* 41 ff e3                jmp    *%r11 */
+	__zpoline_start[ptr++] = 0x41;
+	__zpoline_start[ptr++] = 0xff;
+	__zpoline_start[ptr++] = 0xe3;
+
+	/* permission: XOM (PROT_EXEC only) */
+	ret = os_protect_memory(0, PAGE_SIZE, 0, 0, 1);
+	if (ret)
+		panic("failed: can't configure permission on trampoline code");
+
+	pr_info("zpoline: setting up trampoline code done\n");
+	return 0;
+}
+arch_initcall(setup_zpoline_trampoline);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v3 06/13] um: nommu: syscalls handler from userspace by seccomp filter
  2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
                       ` (4 preceding siblings ...)
  2024-12-03  4:23     ` [PATCH v3 05/13] x86/um: nommu: syscall translation by zpoline Hajime Tazaki
@ 2024-12-03  4:23     ` Hajime Tazaki
  2024-12-04 16:42       ` Johannes Berg
  2024-12-04 17:54       ` kernel test robot
  2024-12-03  4:23     ` [PATCH v3 07/13] x86/um: nommu: process/thread handling Hajime Tazaki
                       ` (7 subsequent siblings)
  13 siblings, 2 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-03  4:23 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett, Kenichi Yasukata

This commit adds syscall handlers with seccomp, which has two functions.

1) syscall hooks issues from userspace memory ($rip), and 2) prevent
syscall and report when zpoline is used as zpoline cannot translate
syscall/sysenter instructions by 1) dlopen-ed code containing syscall
instructions, or 2) JIT-generated code.

The SIGSYS signal is raised upon the execution from uml_reserved and
high_physmem, which locates userspace memory.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Kenichi Yasukata <kenichi.yasukata@gmail.com>
---
 arch/um/include/shared/kern_util.h   |  2 +
 arch/um/include/shared/os.h          |  6 +++
 arch/um/kernel/trap.c                | 12 +++++
 arch/um/kernel/um_arch.c             |  4 ++
 arch/um/os-Linux/process.c           | 78 ++++++++++++++++++++++++++++
 arch/um/os-Linux/signal.c            | 22 ++++++++
 arch/x86/um/os-Linux/mcontext.c      | 22 ++++++++
 arch/x86/um/shared/sysdep/mcontext.h |  4 ++
 arch/x86/um/zpoline.c                | 15 ++++++
 9 files changed, 165 insertions(+)

diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h
index f21dc8517538..9b26386dd2ea 100644
--- a/arch/um/include/shared/kern_util.h
+++ b/arch/um/include/shared/kern_util.h
@@ -67,4 +67,6 @@ void um_idle_sleep(void);
 
 void kasan_map_memory(void *start, size_t len);
 
+extern void trap_sigsys(struct uml_pt_regs *regs);
+
 #endif
diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index 6874be0c38a8..c979a8b15434 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -220,6 +220,9 @@ extern int os_unmap_memory(void *addr, int len);
 extern int os_drop_memory(void *addr, int length);
 extern int can_drop_memory(void);
 extern int os_mincore(void *addr, unsigned long len);
+#ifndef CONFIG_MMU
+extern int os_setup_seccomp(void);
+#endif
 
 void os_set_pdeathsig(void);
 
@@ -252,6 +255,9 @@ extern void register_pm_wake_signal(void);
 extern void block_signals_hard(void);
 extern void unblock_signals_hard(void);
 extern void mark_sigio_pending(void);
+#ifndef CONFIG_MMU
+extern int um_zpoline_enabled;
+#endif
 
 /* util.c */
 extern void stack_protections(unsigned long address);
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index a7519b3de4bf..f23ba7f9a82d 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -310,3 +310,15 @@ void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs)
 {
 	do_IRQ(WINCH_IRQ, regs);
 }
+
+void trap_sigsys(struct uml_pt_regs *regs)
+{
+	struct task_struct *tsk = current;
+
+	pr_info_ratelimited("%s%s[%d]: sigsys ip %p sp %p\n",
+			    task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
+			    tsk->comm, task_pid_nr(tsk),
+			    (void *)UPT_IP(regs), (void *)UPT_SP(regs));
+
+	force_sig(SIGSYS);
+}
diff --git a/arch/um/kernel/um_arch.c b/arch/um/kernel/um_arch.c
index 62ddb865eb91..d89752bf5be0 100644
--- a/arch/um/kernel/um_arch.c
+++ b/arch/um/kernel/um_arch.c
@@ -432,6 +432,10 @@ void __init setup_arch(char **cmdline_p)
 		add_bootloader_randomness(rng_seed, sizeof(rng_seed));
 		memzero_explicit(rng_seed, sizeof(rng_seed));
 	}
+
+#ifndef CONFIG_MMU
+	os_setup_seccomp();
+#endif
 }
 
 void __init arch_cpu_finalize_init(void)
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index ef1a2f0aa06a..4e0b21b4b00c 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -17,7 +17,11 @@
 #include <asm/unistd.h>
 #include <init.h>
 #include <longjmp.h>
+#include <as-layout.h>
 #include <os.h>
+#include <sys/prctl.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
 
 void os_alarm_process(int pid)
 {
@@ -209,3 +213,77 @@ void os_set_pdeathsig(void)
 {
 	prctl(PR_SET_PDEATHSIG, SIGKILL);
 }
+
+#ifndef CONFIG_MMU
+int os_setup_seccomp(void)
+{
+	int err;
+	unsigned long __userspace_start = uml_reserved,
+		__userspace_end = high_physmem;
+
+	struct sock_filter filter[] = {
+		/* if (IP_high > __userspace_end) allow; */
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer) + 4),
+		BPF_JUMP(BPF_JMP + BPF_JGT + BPF_K, __userspace_end >> 32,
+			 /*true-skip=*/0, /*false-skip=*/1),
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+		/* if (IP_high == __userspace_end && IP_low >= __userspace_end) allow; */
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer) + 4),
+		BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __userspace_end >> 32,
+			 /*true-skip=*/0, /*false-skip=*/3),
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer)),
+		BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_end,
+			 /*true-skip=*/0, /*false-skip=*/1),
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+		/* if (IP_high < __userspace_start) allow; */
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer) + 4),
+		BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_start >> 32,
+			 /*true-skip=*/1, /*false-skip=*/0),
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+		/* if (IP_high == __userspace_start && IP_low < __userspace_start) allow; */
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer) + 4),
+		BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __userspace_start >> 32,
+			 /*true-skip=*/0, /*false-skip=*/3),
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer)),
+		BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_start,
+			 /*true-skip=*/1, /*false-skip=*/0),
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+		/* other address; trap  */
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRAP),
+	};
+	struct sock_fprog prog = {
+		.len = ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	err = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	if (err)
+		os_warn("PR_SET_NO_NEW_PRIVS (err=%d, ernro=%d)\n",
+		       err, errno);
+
+	err = syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER,
+		      SECCOMP_FILTER_FLAG_TSYNC, &prog);
+	if (err) {
+		os_warn("SECCOMP_SET_MODE_FILTER (err=%d, ernro=%d)\n",
+		       err, errno);
+		exit(-1);
+	}
+
+	set_handler(SIGSYS);
+
+	os_info("seccomp: filter syscalls in the range: 0x%lx-0x%lx\n",
+		__userspace_start, __userspace_end);
+
+	return 0;
+}
+#endif
diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c
index 9ea7269ffb77..c0d1fb1fc0c4 100644
--- a/arch/um/os-Linux/signal.c
+++ b/arch/um/os-Linux/signal.c
@@ -20,6 +20,25 @@
 #include <um_malloc.h>
 #include <sys/ucontext.h>
 #include <timetravel.h>
+#include <init.h>
+
+#ifndef CONFIG_MMU
+static void sigsys_handler(int sig, struct siginfo *si, mcontext_t *mc)
+{
+	struct uml_pt_regs r;
+
+	if (!um_zpoline_enabled) {
+		/* hook syscall via SIGSYS */
+		mc_set_sigsys_hook(mc);
+	} else {
+		/* trap SIGSYS to userspace */
+		get_regs_from_mc(&r, mc);
+		trap_sigsys(&r);
+		/* force handle signals after rt_sigreturn() */
+		mc_set_regs_ip_relay(mc);
+	}
+}
+#endif
 
 void (*sig_info[NSIG])(int, struct siginfo *, struct uml_pt_regs *) = {
 	[SIGTRAP]	= relay_signal,
@@ -178,6 +197,9 @@ static void (*handlers[_NSIG])(int sig, struct siginfo *si, mcontext_t *mc) = {
 	[SIGILL] = sig_handler,
 	[SIGFPE] = sig_handler,
 	[SIGTRAP] = sig_handler,
+#ifndef CONFIG_MMU
+	[SIGSYS] = sigsys_handler,
+#endif
 
 	[SIGIO] = sig_handler,
 	[SIGWINCH] = sig_handler,
diff --git a/arch/x86/um/os-Linux/mcontext.c b/arch/x86/um/os-Linux/mcontext.c
index e80ab7d28117..d876e34a9c7a 100644
--- a/arch/x86/um/os-Linux/mcontext.c
+++ b/arch/x86/um/os-Linux/mcontext.c
@@ -4,6 +4,7 @@
 #include <asm/ptrace.h>
 #include <sysdep/ptrace.h>
 #include <sysdep/mcontext.h>
+#include <sysdep/syscalls.h>
 
 void get_regs_from_mc(struct uml_pt_regs *regs, mcontext_t *mc)
 {
@@ -31,3 +32,24 @@ void get_regs_from_mc(struct uml_pt_regs *regs, mcontext_t *mc)
 	regs->gp[CS / sizeof(unsigned long)] |= 3;
 #endif
 }
+
+#ifndef CONFIG_MMU
+static void userspace_sigreturn(void)
+{
+	__asm__ volatile("movq $15, %rax");
+	__asm__ volatile("call *%0" : : "r"(__kernel_vsyscall) :);
+}
+
+void mc_set_regs_ip_relay(mcontext_t *mc)
+{
+	mc->gregs[REG_RIP] = (unsigned long) userspace_sigreturn;
+}
+
+void mc_set_sigsys_hook(mcontext_t *mc)
+{
+	mc->gregs[REG_RSP] -= sizeof(unsigned long);
+	*((unsigned long *) (mc->gregs[REG_RSP])) = mc->gregs[REG_RIP];
+	mc->gregs[REG_RCX] = mc->gregs[REG_RIP];
+	mc->gregs[REG_RIP] = (unsigned long) __kernel_vsyscall;
+}
+#endif
diff --git a/arch/x86/um/shared/sysdep/mcontext.h b/arch/x86/um/shared/sysdep/mcontext.h
index b724c54da316..0e837f4b5757 100644
--- a/arch/x86/um/shared/sysdep/mcontext.h
+++ b/arch/x86/um/shared/sysdep/mcontext.h
@@ -7,6 +7,10 @@
 #define __SYS_SIGCONTEXT_X86_H
 
 extern void get_regs_from_mc(struct uml_pt_regs *, mcontext_t *);
+#ifndef CONFIG_MMU
+extern void mc_set_sigsys_hook(mcontext_t *mc);
+extern void mc_set_regs_ip_relay(mcontext_t *mc);
+#endif
 
 #ifdef __i386__
 
diff --git a/arch/x86/um/zpoline.c b/arch/x86/um/zpoline.c
index 97f5345ab314..6ec44233276b 100644
--- a/arch/x86/um/zpoline.c
+++ b/arch/x86/um/zpoline.c
@@ -14,6 +14,7 @@
 #include <sysdep/syscalls.h>
 #include <os.h>
 
+int um_zpoline_enabled;
 /* start of trampoline code area */
 static char *__zpoline_start;
 
@@ -111,6 +112,10 @@ int elf_arch_finalize_exec(struct elf_fdpic_params *exec_params,
 	int err = 0, count = 0;
 	struct mm_struct *mm = current->mm;
 
+	/* zpoline disabled */
+	if (!um_zpoline_enabled)
+		return 0;
+
 	if (down_write_killable(&mm->mmap_lock))
 		return -EINTR;
 
@@ -221,3 +226,13 @@ static int __init setup_zpoline_trampoline(void)
 	return 0;
 }
 arch_initcall(setup_zpoline_trampoline);
+
+static int __init zpoline_set(char *str)
+{
+	int val = 0;
+
+	get_option(&str, &val);
+	um_zpoline_enabled = val;
+	return 1;
+}
+__setup("zpoline=", zpoline_set);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v3 07/13] x86/um: nommu: process/thread handling
  2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
                       ` (5 preceding siblings ...)
  2024-12-03  4:23     ` [PATCH v3 06/13] um: nommu: syscalls handler from userspace by seccomp filter Hajime Tazaki
@ 2024-12-03  4:23     ` Hajime Tazaki
  2024-12-04 16:50       ` Johannes Berg
  2024-12-03  4:23     ` [PATCH v3 08/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
                       ` (6 subsequent siblings)
  13 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-03  4:23 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

Since ptrace facility isn't used under !MMU of UML, there is different
code path to invoke processes/threads; on an entry to the syscall
interface, the stack pointer should be manipulated to handle vfork(2)
return address, no external process is used, and need to properly
configure some of registers (fs segment register for TLS, etc) on every
context switch, etc.

Signals aren't delivered in non-ptrace syscall entry/leave so, we also
need to handle pending signal by ourselves.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/um/kernel/process.c        | 33 +++++++++++++++++++++++++++++-
 arch/um/os-Linux/process.c      |  6 ++++++
 arch/um/os-Linux/skas/process.c |  4 ++++
 arch/x86/um/asm/processor.h     | 12 +++++++++++
 arch/x86/um/do_syscall_64.c     | 36 +++++++++++++++++++++++++++++++++
 arch/x86/um/entry_64.S          | 21 +++++++++++++++++++
 arch/x86/um/syscalls_64.c       | 12 +++++++++++
 7 files changed, 123 insertions(+), 1 deletion(-)

diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 30bdc0a87dc8..865f3456f24b 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -117,13 +117,17 @@ void new_thread_handler(void)
 	 * callback returns only if the kernel thread execs a process
 	 */
 	fn(arg);
+#ifndef CONFIG_MMU
+	arch_switch_to(current);
+#endif
 	userspace(&current->thread.regs.regs);
 }
 
 /* Called magically, see new_thread_handler above */
 static void fork_handler(void)
 {
-	schedule_tail(current->thread.prev_sched);
+	if (current->thread.prev_sched)
+		schedule_tail(current->thread.prev_sched);
 
 	/*
 	 * XXX: if interrupt_end() calls schedule, this call to
@@ -134,6 +138,33 @@ static void fork_handler(void)
 
 	current->thread.prev_sched = NULL;
 
+#ifndef CONFIG_MMU
+	/*
+	 * child of vfork(2) comes here.
+	 * clone(2) also enters here but doesn't need to advance the %rsp.
+	 *
+	 * This fork can only come from libc's vfork, which
+	 * does this:
+	 *	popq %%rdx;
+	 *	call *%rax; // zpoline => __kernel_vsyscall
+	 *	pushq %%rdx;
+	 * %rcx stores the return address which is stored
+	 * at pt_regs[HOST_IP] at the moment.  As child returns
+	 * via userspace() with a jmp instruction (while parent
+	 * does via ret instruction in __kernel_vsyscall), we
+	 * need to pop (advance) the pushed address by "call"
+	 * though, so this is what this next line does.
+	 *
+	 * As a result of vfork return in child, stack contents
+	 * is overwritten by child (by pushq in vfork), which
+	 * makes the parent puzzled after child returns.
+	 *
+	 * thus the contents should be restored before vfork/parent
+	 * returns.  this is done in do_syscall_64().
+	 */
+	if (current->thread.regs.regs.gp[HOST_ORIG_AX] == __NR_vfork)
+		current->thread.regs.regs.gp[REGS_SP_INDEX] += 8;
+#endif
 	userspace(&current->thread.regs.regs);
 }
 
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index 4e0b21b4b00c..51473b834497 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -25,7 +25,10 @@
 
 void os_alarm_process(int pid)
 {
+/* !CONFIG_MMU doesn't send alarm signal to other processes */
+#ifdef CONFIG_MMU
 	kill(pid, SIGALRM);
+#endif
 }
 
 void os_kill_process(int pid, int reap_child)
@@ -42,11 +45,14 @@ void os_kill_process(int pid, int reap_child)
 
 void os_kill_ptraced_process(int pid, int reap_child)
 {
+/* !CONFIG_MMU doesn't have ptraced process */
+#ifdef CONFIG_MMU
 	kill(pid, SIGKILL);
 	ptrace(PTRACE_KILL, pid);
 	ptrace(PTRACE_CONT, pid);
 	if (reap_child)
 		CATCH_EINTR(waitpid(pid, NULL, __WALL));
+#endif
 }
 
 /* Don't use the glibc version, which caches the result in TLS. It misses some
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index f683cfc9e51a..80776bac168b 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -144,6 +144,7 @@ void wait_stub_done(int pid)
 
 extern unsigned long current_stub_stack(void);
 
+#ifdef CONFIG_MMU
 static void get_skas_faultinfo(int pid, struct faultinfo *fi)
 {
 	int err;
@@ -176,6 +177,7 @@ static void handle_trap(int pid, struct uml_pt_regs *regs)
 
 	handle_syscall(regs);
 }
+#endif
 
 extern char __syscall_stub_start[];
 
@@ -389,6 +391,7 @@ int start_userspace(unsigned long stub_stack)
 }
 
 int unscheduled_userspace_iterations;
+#ifdef CONFIG_MMU
 extern unsigned long tt_extra_sched_jiffies;
 
 void userspace(struct uml_pt_regs *regs)
@@ -550,6 +553,7 @@ void userspace(struct uml_pt_regs *regs)
 		}
 	}
 }
+#endif
 
 void new_thread(void *stack, jmp_buf *buf, void (*handler)(void))
 {
diff --git a/arch/x86/um/asm/processor.h b/arch/x86/um/asm/processor.h
index 478710384b34..d88d7d9d5c18 100644
--- a/arch/x86/um/asm/processor.h
+++ b/arch/x86/um/asm/processor.h
@@ -38,6 +38,18 @@ static __always_inline void cpu_relax(void)
 
 #define task_pt_regs(t) (&(t)->thread.regs)
 
+#ifndef CONFIG_MMU
+#define task_top_of_stack(task) \
+({									\
+	unsigned long __ptr = (unsigned long)task->stack;	\
+	__ptr += THREAD_SIZE;			\
+	__ptr;					\
+})
+
+extern long current_top_of_stack;
+extern long current_ptregs;
+#endif
+
 #include <asm/processor-generic.h>
 
 #endif
diff --git a/arch/x86/um/do_syscall_64.c b/arch/x86/um/do_syscall_64.c
index 5d0fa83e7fdc..ca468caff729 100644
--- a/arch/x86/um/do_syscall_64.c
+++ b/arch/x86/um/do_syscall_64.c
@@ -1,14 +1,43 @@
 // SPDX-License-Identifier: GPL-2.0
 
+//#define DEBUG 1
 #include <linux/kernel.h>
 #include <linux/ptrace.h>
 #include <kern_util.h>
 #include <sysdep/syscalls.h>
 #include <os.h>
 
+/*
+ * save/restore the return address stored in the stack, as the child overwrites
+ * the contents after returning to userspace (i.e., by push %rdx).
+ *
+ * see the detail in fork_handler().
+ */
+static void *vfork_save_stack(void)
+{
+	unsigned char *stack_copy;
+
+	stack_copy = kzalloc(8, GFP_KERNEL);
+	if (!stack_copy)
+		return NULL;
+
+	memcpy(stack_copy,
+	       (void *)current->thread.regs.regs.gp[HOST_SP], 8);
+
+	return stack_copy;
+}
+
+static void vfork_restore_stack(void *stack_copy)
+{
+	WARN_ON_ONCE(!stack_copy);
+	memcpy((void *)current->thread.regs.regs.gp[HOST_SP],
+	       stack_copy, 8);
+}
+
 __visible void do_syscall_64(struct pt_regs *regs)
 {
 	int syscall;
+	unsigned char *stack_copy = NULL;
 
 	syscall = PT_SYSCALL_NR(regs->regs.gp);
 	UPT_SYSCALL_NR(&regs->regs) = syscall;
@@ -17,6 +46,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
 		 syscall, (unsigned long)current,
 		 (unsigned long)sys_call_table[syscall]);
 
+	if (syscall == __NR_vfork)
+		stack_copy = vfork_save_stack();
+
 	if (likely(syscall < NR_syscalls)) {
 		PT_REGS_SET_SYSCALL_RETURN(regs,
 				EXECUTE_SYSCALL(syscall, regs));
@@ -31,6 +63,10 @@ __visible void do_syscall_64(struct pt_regs *regs)
 	if (syscall == __NR_execve && regs->regs.gp[HOST_AX] == 0)
 		userspace(&current->thread.regs.regs);
 
+	/* only parents of vfork restores the contents of stack */
+	if (syscall == __NR_vfork && regs->regs.gp[HOST_AX] > 0)
+		vfork_restore_stack(stack_copy);
+
 	/* force do_signal() --> is_syscall() */
 	set_thread_flag(TIF_SIGPENDING);
 	interrupt_end();
diff --git a/arch/x86/um/entry_64.S b/arch/x86/um/entry_64.S
index 022a8122690b..32f5002e2eb0 100644
--- a/arch/x86/um/entry_64.S
+++ b/arch/x86/um/entry_64.S
@@ -85,3 +85,24 @@ ENTRY(__kernel_vsyscall)
 	ret
 
 END(__kernel_vsyscall)
+
+// void userspace(struct uml_pt_regs *regs)
+ENTRY(userspace)
+	/* align the stack for x86_64 ABI */
+	and     $-0x10, %rsp
+	/* Handle any immediate reschedules or signals */
+	call	interrupt_end
+
+	movq	current_ptregs, %rsp
+
+	POP_REGS
+
+	addq	$8, %rsp	/* skip orig_ax */
+	popq	%r11		/* pt_regs->ip */
+	addq	$8, %rsp	/* skip cs */
+	addq	$8, %rsp	/* skip flags */
+	popq	%rsp
+
+	jmp	*%r11
+
+END(userspace)
diff --git a/arch/x86/um/syscalls_64.c b/arch/x86/um/syscalls_64.c
index 6a00a28c9cca..edb17fc73e07 100644
--- a/arch/x86/um/syscalls_64.c
+++ b/arch/x86/um/syscalls_64.c
@@ -51,6 +51,18 @@ void arch_switch_to(struct task_struct *to)
 	 * Nothing needs to be done on x86_64.
 	 * The FS_BASE/GS_BASE registers are saved in the ptrace register set.
 	 */
+#ifndef CONFIG_MMU
+	current_top_of_stack = task_top_of_stack(to);
+	current_ptregs = (long)task_pt_regs(to);
+
+	if ((to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] == 0) ||
+	    (to->mm == NULL))
+		return;
+
+	/* this changes the FS on every context switch */
+	arch_prctl(to, ARCH_SET_FS,
+		   (void __user *) to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)]);
+#endif
 }
 
 SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v3 08/13] um: nommu: configure fs register on host syscall invocation
  2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
                       ` (6 preceding siblings ...)
  2024-12-03  4:23     ` [PATCH v3 07/13] x86/um: nommu: process/thread handling Hajime Tazaki
@ 2024-12-03  4:23     ` Hajime Tazaki
  2024-12-04 16:52       ` Johannes Berg
  2024-12-03  4:23     ` [PATCH v3 09/13] x86/um/vdso: nommu: vdso memory update Hajime Tazaki
                       ` (5 subsequent siblings)
  13 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-03  4:23 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

As userspace on UML/!MMU also need to configure %fs register when it is
running to correctly access thread structure, host syscalls implemented
in os-Linux drivers may be puzzled when they are called.  Thus it has to
configure %fs register via arch_prctl(SET_FS) on every host syscalls.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/um/include/shared/os.h |  3 ++
 arch/um/os-Linux/main.c     |  5 ++++
 arch/um/os-Linux/process.c  |  8 ++++++
 arch/um/os-Linux/start_up.c | 20 +++++++++++++
 arch/x86/um/do_syscall_64.c | 36 +++++++++++++++++++++++
 arch/x86/um/syscalls_64.c   | 57 +++++++++++++++++++++++++++++++++++++
 6 files changed, 129 insertions(+)

diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index c979a8b15434..f7f4da322906 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -190,6 +190,7 @@ extern void check_host_supports_tls(int *supports_tls, int *tls_min);
 extern void get_host_cpu_features(
 	void (*flags_helper_func)(char *line),
 	void (*cache_helper_func)(char *line));
+extern int host_has_fsgsbase;
 
 /* mem.c */
 extern int create_mem_file(unsigned long long len);
@@ -221,6 +222,8 @@ extern int os_drop_memory(void *addr, int length);
 extern int can_drop_memory(void);
 extern int os_mincore(void *addr, unsigned long len);
 #ifndef CONFIG_MMU
+extern long long host_fs;
+extern int os_arch_prctl(int pid, int option, unsigned long *arg);
 extern int os_setup_seccomp(void);
 #endif
 
diff --git a/arch/um/os-Linux/main.c b/arch/um/os-Linux/main.c
index 0afcdeb8995b..aecf63d3db79 100644
--- a/arch/um/os-Linux/main.c
+++ b/arch/um/os-Linux/main.c
@@ -17,6 +17,7 @@
 #include <kern_util.h>
 #include <os.h>
 #include <um_malloc.h>
+#include <asm/prctl.h> /* XXX This should get the constants from libc */
 #include "internal.h"
 
 #define PGD_BOUND (4 * 1024 * 1024)
@@ -158,6 +159,10 @@ int __init main(int argc, char **argv, char **envp)
 	change_sig(SIGPIPE, 0);
 	ret = linux_main(argc, argv, envp);
 
+#ifndef CONFIG_MMU
+	os_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
+#endif
+
 	/*
 	 * Disable SIGPROF - I have no idea why libc doesn't do this or turn
 	 * off the profiling time, but UML dies with a SIGPROF just before
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index 51473b834497..346d297e89fe 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -221,6 +221,14 @@ void os_set_pdeathsig(void)
 }
 
 #ifndef CONFIG_MMU
+#include <unistd.h>
+#include <sys/syscall.h>   /* For SYS_xxx definitions */
+
+int os_arch_prctl(int pid, int option, unsigned long *arg2)
+{
+	return syscall(SYS_arch_prctl, option, arg2);
+}
+
 int os_setup_seccomp(void)
 {
 	int err;
diff --git a/arch/um/os-Linux/start_up.c b/arch/um/os-Linux/start_up.c
index 93fc82c01aba..dbab091892b3 100644
--- a/arch/um/os-Linux/start_up.c
+++ b/arch/um/os-Linux/start_up.c
@@ -19,6 +19,8 @@
 #include <sys/resource.h>
 #include <asm/ldt.h>
 #include <asm/unistd.h>
+#include <sys/auxv.h>
+#include <asm/hwcap2.h>
 #include <init.h>
 #include <os.h>
 #include <kern_util.h>
@@ -28,6 +30,8 @@
 #include <skas.h>
 #include "internal.h"
 
+int host_has_fsgsbase;
+
 static void ptrace_child(void)
 {
 	int ret;
@@ -278,6 +282,19 @@ void  __init get_host_cpu_features(
 	}
 }
 
+static void __init check_fsgsbase(void)
+{
+	unsigned long auxv = getauxval(AT_HWCAP2);
+
+	os_info("Checking FSGSBASE instructions...");
+	if (auxv & HWCAP2_FSGSBASE) {
+		host_has_fsgsbase = 1;
+		os_info("OK\n");
+	} else {
+		host_has_fsgsbase = 0;
+		os_info("disabled\n");
+	}
+}
 
 void __init os_early_checks(void)
 {
@@ -293,6 +310,9 @@ void __init os_early_checks(void)
 	 */
 	check_tmpexec();
 
+	/* probe fsgsbase instruction */
+	check_fsgsbase();
+
 	pid = start_ptraced_child();
 	if (init_pid_registers(pid))
 		fatal("Failed to initialize default registers");
diff --git a/arch/x86/um/do_syscall_64.c b/arch/x86/um/do_syscall_64.c
index ca468caff729..c7e48c74c7a5 100644
--- a/arch/x86/um/do_syscall_64.c
+++ b/arch/x86/um/do_syscall_64.c
@@ -3,6 +3,8 @@
 //#define DEBUG 1
 #include <linux/kernel.h>
 #include <linux/ptrace.h>
+#include <asm/fsgsbase.h>
+#include <asm/prctl.h>
 #include <kern_util.h>
 #include <sysdep/syscalls.h>
 #include <os.h>
@@ -34,6 +36,31 @@ static void vfork_restore_stack(void *stack_copy)
 	       stack_copy, 8);
 }
 
+static int os_x86_arch_prctl(int pid, int option, unsigned long *arg2)
+{
+	if (host_has_fsgsbase) {
+		switch (option) {
+		case ARCH_SET_FS:
+			wrfsbase(*arg2);
+			break;
+		case ARCH_SET_GS:
+			wrgsbase(*arg2);
+			break;
+		case ARCH_GET_FS:
+			*arg2 = rdfsbase();
+			break;
+		case ARCH_GET_GS:
+			*arg2 = rdgsbase();
+			break;
+		}
+		return 0;
+	} else {
+		return os_arch_prctl(pid, option, arg2);
+	}
+
+	return 0;
+}
+
 __visible void do_syscall_64(struct pt_regs *regs)
 {
 	int syscall;
@@ -49,6 +76,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
 	if (syscall == __NR_vfork)
 		stack_copy = vfork_save_stack();
 
+	/* set fs register to the original host one */
+	os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
+
 	if (likely(syscall < NR_syscalls)) {
 		PT_REGS_SET_SYSCALL_RETURN(regs,
 				EXECUTE_SYSCALL(syscall, regs));
@@ -70,4 +100,10 @@ __visible void do_syscall_64(struct pt_regs *regs)
 	/* force do_signal() --> is_syscall() */
 	set_thread_flag(TIF_SIGPENDING);
 	interrupt_end();
+
+	/* restore back fs register to userspace configured one */
+	os_x86_arch_prctl(0, ARCH_SET_FS,
+		      (void *)(current->thread.regs.regs.gp[FS_BASE
+						     / sizeof(unsigned long)]));
+
 }
diff --git a/arch/x86/um/syscalls_64.c b/arch/x86/um/syscalls_64.c
index edb17fc73e07..d56df936a2d7 100644
--- a/arch/x86/um/syscalls_64.c
+++ b/arch/x86/um/syscalls_64.c
@@ -12,11 +12,26 @@
 #include <asm/prctl.h> /* XXX This should get the constants from libc */
 #include <registers.h>
 #include <os.h>
+#include <asm/thread_info.h>
+#include <asm/mman.h>
+
+#ifndef CONFIG_MMU
+/*
+ * The guest libc can change FS, which confuses the host libc.
+ * In fact, changing FS directly is not supported (check
+ * man arch_prctl). So, whenever we make a host syscall,
+ * we should be changing FS to the original FS (not the
+ * one set by the guest libc). This original FS is stored
+ * in host_fs.
+ */
+long long host_fs = -1;
+#endif
 
 long arch_prctl(struct task_struct *task, int option,
 		unsigned long __user *arg2)
 {
 	long ret = -EINVAL;
+#ifdef CONFIG_MMU
 
 	switch (option) {
 	case ARCH_SET_FS:
@@ -38,6 +53,48 @@ long arch_prctl(struct task_struct *task, int option,
 	}
 
 	return ret;
+#else
+
+	unsigned long *ptr = arg2, tmp;
+
+	switch (option) {
+	case ARCH_SET_FS:
+		if (host_fs == -1)
+			os_arch_prctl(0, ARCH_GET_FS, (void *)&host_fs);
+		ret = 0;
+		break;
+	case ARCH_SET_GS:
+		ret = 0;
+		break;
+	case ARCH_GET_FS:
+	case ARCH_GET_GS:
+		ptr = &tmp;
+		break;
+	}
+
+	ret = os_arch_prctl(0, option, ptr);
+	if (ret)
+		return ret;
+
+	switch (option) {
+	case ARCH_SET_FS:
+		current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] =
+			(unsigned long) arg2;
+		break;
+	case ARCH_SET_GS:
+		current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)] =
+			(unsigned long) arg2;
+		break;
+	case ARCH_GET_FS:
+		ret = put_user(current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)], arg2);
+		break;
+	case ARCH_GET_GS:
+		ret = put_user(current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)], arg2);
+		break;
+	}
+
+	return ret;
+#endif
 }
 
 SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v3 09/13] x86/um/vdso: nommu: vdso memory update
  2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
                       ` (7 preceding siblings ...)
  2024-12-03  4:23     ` [PATCH v3 08/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
@ 2024-12-03  4:23     ` Hajime Tazaki
  2024-12-03  4:23     ` [PATCH v3 10/13] x86/um: nommu: signal handling Hajime Tazaki
                       ` (4 subsequent siblings)
  13 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-03  4:23 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

On !MMU mode, the address of vdso is accessible from userspace.  This
commit implements the entry point by pointing a block of page address.

This commit also add memory permission configuration of vdso page to be
executable.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/x86/um/vdso/um_vdso.c | 41 +++++++++++++++++++++++++-------------
 arch/x86/um/vdso/vma.c     | 14 +++++++++++++
 2 files changed, 41 insertions(+), 14 deletions(-)

diff --git a/arch/x86/um/vdso/um_vdso.c b/arch/x86/um/vdso/um_vdso.c
index cbae2584124f..a78d095655f1 100644
--- a/arch/x86/um/vdso/um_vdso.c
+++ b/arch/x86/um/vdso/um_vdso.c
@@ -19,15 +19,35 @@ int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
 __kernel_old_time_t __vdso_time(__kernel_old_time_t *t);
 long __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused);
 
+#ifdef CONFIG_MMU
+#define __VDSO_SYSCALL1(sysnr, ret, a0)		\
+	asm("syscall"				\
+	    : "=a" (ret)			\
+	    : "0" (sysnr), "D" (a0)		\
+	    : "rcx", "r11", "memory")
+#define __VDSO_SYSCALL2(sysnr, ret, a0, a1)		\
+	asm("syscall"					\
+	    : "=a" (ret)				\
+	    : "0" (sysnr), "D" (a0), "S" (a1)		\
+	    : "rcx", "r11", "memory")
+#else
+#define __VDSO_SYSCALL1(sysnr, ret, a0)		\
+	asm("call *%%rax"				\
+	    : "=a" (ret)			\
+	    : "a" (sysnr), "D" (a0)	\
+	    : "rcx", "r11", "memory")
+#define __VDSO_SYSCALL2(sysnr, ret, a0, a1)		\
+	asm("call *%%rax"					\
+	    : "=a" (ret)				\
+	    : "a" (sysnr), "D" (a0), "S" (a1)	\
+	    : "rcx", "r11", "memory")
+#endif
+
 int __vdso_clock_gettime(clockid_t clock, struct __kernel_old_timespec *ts)
 {
 	long ret;
 
-	asm("syscall"
-		: "=a" (ret)
-		: "0" (__NR_clock_gettime), "D" (clock), "S" (ts)
-		: "rcx", "r11", "memory");
-
+	__VDSO_SYSCALL2(__NR_clock_gettime, ret, clock, ts);
 	return ret;
 }
 int clock_gettime(clockid_t, struct __kernel_old_timespec *)
@@ -37,11 +57,7 @@ int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
 {
 	long ret;
 
-	asm("syscall"
-		: "=a" (ret)
-		: "0" (__NR_gettimeofday), "D" (tv), "S" (tz)
-		: "rcx", "r11", "memory");
-
+	__VDSO_SYSCALL2(__NR_gettimeofday, ret, tv, tz);
 	return ret;
 }
 int gettimeofday(struct __kernel_old_timeval *, struct timezone *)
@@ -51,10 +67,7 @@ __kernel_old_time_t __vdso_time(__kernel_old_time_t *t)
 {
 	long secs;
 
-	asm volatile("syscall"
-		: "=a" (secs)
-		: "0" (__NR_time), "D" (t) : "cc", "r11", "cx", "memory");
-
+	__VDSO_SYSCALL1(__NR_time, secs, t);
 	return secs;
 }
 __kernel_old_time_t time(__kernel_old_time_t *t) __attribute__((weak, alias("__vdso_time")));
diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c
index f238f7b33cdd..093fed27ad49 100644
--- a/arch/x86/um/vdso/vma.c
+++ b/arch/x86/um/vdso/vma.c
@@ -9,6 +9,7 @@
 #include <asm/page.h>
 #include <asm/elf.h>
 #include <linux/init.h>
+#include <os.h>
 
 static unsigned int __read_mostly vdso_enabled = 1;
 unsigned long um_vdso_addr;
@@ -24,7 +25,9 @@ static int __init init_vdso(void)
 
 	BUG_ON(vdso_end - vdso_start > PAGE_SIZE);
 
+#ifdef CONFIG_MMU
 	um_vdso_addr = task_size - PAGE_SIZE;
+#endif
 
 	vdsop = kmalloc(sizeof(struct page *), GFP_KERNEL);
 	if (!vdsop)
@@ -40,6 +43,15 @@ static int __init init_vdso(void)
 	copy_page(page_address(um_vdso), vdso_start);
 	*vdsop = um_vdso;
 
+#ifndef CONFIG_MMU
+	/* this is fine with NOMMU as everything is accessible */
+	um_vdso_addr = (unsigned long)page_address(um_vdso);
+	os_protect_memory((void *)um_vdso_addr, vdso_end - vdso_start, 1, 0, 1);
+	pr_info("vdso_start=%lx um_vdso_addr=%lx pg_um_vdso=%lx",
+	       (unsigned long)vdso_start, um_vdso_addr,
+	       (unsigned long)page_address(um_vdso));
+#endif
+
 	return 0;
 
 oom:
@@ -50,6 +62,7 @@ static int __init init_vdso(void)
 }
 subsys_initcall(init_vdso);
 
+#ifdef CONFIG_MMU
 int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 {
 	struct vm_area_struct *vma;
@@ -74,3 +87,4 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 
 	return IS_ERR(vma) ? PTR_ERR(vma) : 0;
 }
+#endif
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v3 10/13] x86/um: nommu: signal handling
  2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
                       ` (8 preceding siblings ...)
  2024-12-03  4:23     ` [PATCH v3 09/13] x86/um/vdso: nommu: vdso memory update Hajime Tazaki
@ 2024-12-03  4:23     ` Hajime Tazaki
  2024-12-03  4:23     ` [PATCH v3 11/13] um: change machine name for uname output Hajime Tazaki
                       ` (3 subsequent siblings)
  13 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-03  4:23 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

This commit updates the behavior of signal handling under !MMU
environment. 1) the stack preparation for the signal handlers and
2) restoration of stack after rt_sigreturn(2) syscall.  Those are needed
as the stack usage on vfork(2) syscall is different.

It also adds the follow up routine for SIGSEGV as a signal delivery runs
in the same stack frame while we have to avoid endless SIGSEGV.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
 arch/um/os-Linux/signal.c | 16 +++++++++++++++-
 arch/x86/um/signal.c      | 37 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c
index c0d1fb1fc0c4..de3ed8fc0268 100644
--- a/arch/um/os-Linux/signal.c
+++ b/arch/um/os-Linux/signal.c
@@ -55,7 +55,15 @@ static void sig_handler_common(int sig, struct siginfo *si, mcontext_t *mc)
 	struct uml_pt_regs r;
 	int save_errno = errno;
 
-	r.is_user = 0;
+#ifndef CONFIG_MMU
+	memset(&r, 0, sizeof(r));
+	/* mark is_user=1 when the IP is from userspace code. */
+	if (mc && (REGS_IP(mc->gregs) > uml_reserved
+		   && REGS_IP(mc->gregs) < high_physmem))
+		r.is_user = 1;
+	else
+#endif
+		r.is_user = 0;
 	if (sig == SIGSEGV) {
 		/* For segfaults, we want the data from the sigcontext. */
 		get_regs_from_mc(&r, mc);
@@ -69,6 +77,12 @@ static void sig_handler_common(int sig, struct siginfo *si, mcontext_t *mc)
 	(*sig_info[sig])(sig, si, &r);
 
 	errno = save_errno;
+
+#ifndef CONFIG_MMU
+	/* force handle signals after rt_sigreturn() */
+	if (r.is_user && sig == SIGSEGV)
+		mc_set_regs_ip_relay(mc);
+#endif
 }
 
 /*
diff --git a/arch/x86/um/signal.c b/arch/x86/um/signal.c
index 75087e85b6fd..e1b3a87ddc5d 100644
--- a/arch/x86/um/signal.c
+++ b/arch/x86/um/signal.c
@@ -370,6 +370,13 @@ int setup_signal_stack_si(unsigned long stack_top, struct ksignal *ksig,
 	frame = (struct rt_sigframe __user *)
 		round_down(stack_top - sizeof(struct rt_sigframe), 16);
 
+#ifndef CONFIG_MMU
+	/*
+	 * the sig_frame on !MMU needs be aligned for SSE as
+	 * the frame is used as-is.
+	 */
+	math_size = round_down(math_size, 16);
+#endif
 	/* Add required space for math frame */
 	frame = (struct rt_sigframe __user *)((unsigned long)frame - math_size);
 
@@ -417,6 +424,18 @@ int setup_signal_stack_si(unsigned long stack_top, struct ksignal *ksig,
 		/* could use a vstub here */
 		return err;
 
+#ifndef CONFIG_MMU
+	/*
+	 * we need to push handler address at top of stack, as
+	 * __kernel_vsyscall, called after this returns with ret with
+	 * stack contents, thus push the handler here.
+	 */
+	frame = (struct rt_sigframe __user *) ((unsigned long) frame -
+					       sizeof(unsigned long));
+	err |= __put_user((unsigned long)ksig->ka.sa.sa_handler,
+			  (unsigned long *)frame);
+#endif
+
 	if (err)
 		return err;
 
@@ -442,9 +461,25 @@ SYSCALL_DEFINE0(rt_sigreturn)
 	unsigned long sp = PT_REGS_SP(&current->thread.regs);
 	struct rt_sigframe __user *frame =
 		(struct rt_sigframe __user *)(sp - sizeof(long));
-	struct ucontext __user *uc = &frame->uc;
+	struct ucontext __user *uc;
 	sigset_t set;
 
+#ifndef CONFIG_MMU
+	/**
+	 * we enter here with:
+	 *
+	 * __restore_rt:
+	 *     mov $15, %rax
+	 *     call *%rax (translated from syscall)
+	 *
+	 * (code is from musl libc)
+	 * so, stack needs to be popped of "call"ed address before
+	 * looking at rt_sigframe.
+	 */
+	frame = (struct rt_sigframe __user *)((unsigned long)frame + sizeof(long));
+#endif
+	uc = &frame->uc;
+
 	if (copy_from_user(&set, &uc->uc_sigmask, sizeof(set)))
 		goto segfault;
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v3 11/13] um: change machine name for uname output
  2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
                       ` (9 preceding siblings ...)
  2024-12-03  4:23     ` [PATCH v3 10/13] x86/um: nommu: signal handling Hajime Tazaki
@ 2024-12-03  4:23     ` Hajime Tazaki
  2024-12-03  4:23     ` [PATCH v3 12/13] um: nommu: add documentation of nommu UML Hajime Tazaki
                       ` (2 subsequent siblings)
  13 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-03  4:23 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

This commit tries to display MMU/!MMU mode from the output of uname(2)
so that users can distinguish which mode of UML is running right now.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
 arch/um/Makefile        | 6 ++++++
 arch/um/os-Linux/util.c | 3 ++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/um/Makefile b/arch/um/Makefile
index 1d36a613aad8..e0cfa3a9eae4 100644
--- a/arch/um/Makefile
+++ b/arch/um/Makefile
@@ -151,6 +151,12 @@ export CFLAGS_vmlinux := $(LINK-y) $(LINK_WRAPS) $(LD_FLAGS_CMDLINE) $(CC_FLAGS_
 CLEAN_FILES += linux x.i gmon.out
 MRPROPER_FILES += $(HOST_DIR)/include/generated
 
+ifeq ($(CONFIG_MMU),y)
+UTS_MACHINE := "um"
+else
+UTS_MACHINE := "um\(nommu\)"
+endif
+
 archclean:
 	@find . \( -name '*.bb' -o -name '*.bbg' -o -name '*.da' \
 		-o -name '*.gcov' \) -type f -print | xargs rm -f
diff --git a/arch/um/os-Linux/util.c b/arch/um/os-Linux/util.c
index 4193e04d7e4a..20421e9f0f77 100644
--- a/arch/um/os-Linux/util.c
+++ b/arch/um/os-Linux/util.c
@@ -65,7 +65,8 @@ void setup_machinename(char *machine_out)
 	}
 # endif
 #endif
-	strcpy(machine_out, host.machine);
+	strcat(machine_out, "/");
+	strcat(machine_out, host.machine);
 }
 
 void setup_hostinfo(char *buf, int len)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v3 12/13] um: nommu: add documentation of nommu UML
  2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
                       ` (10 preceding siblings ...)
  2024-12-03  4:23     ` [PATCH v3 11/13] um: change machine name for uname output Hajime Tazaki
@ 2024-12-03  4:23     ` Hajime Tazaki
  2024-12-03  4:23     ` [PATCH v3 13/13] um: nommu: plug nommu code into build system Hajime Tazaki
  2024-12-04 16:20     ` [PATCH v3 00/13] nommu UML Johannes Berg
  13 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-03  4:23 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

This commit adds an initial documentation for !MMU mode of UML.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
 Documentation/virt/uml/nommu-uml.rst | 230 +++++++++++++++++++++++++++
 MAINTAINERS                          |   1 +
 2 files changed, 231 insertions(+)
 create mode 100644 Documentation/virt/uml/nommu-uml.rst

diff --git a/Documentation/virt/uml/nommu-uml.rst b/Documentation/virt/uml/nommu-uml.rst
new file mode 100644
index 000000000000..3194b6ff8877
--- /dev/null
+++ b/Documentation/virt/uml/nommu-uml.rst
@@ -0,0 +1,230 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+UML has been built with CONFIG_MMU since day 0.  The patchset
+introduces the nommu mode on UML in a different angle from what Linux
+Kernel Library tried.
+
+.. contents:: :local:
+
+What is it for ?
+================
+
+- Alleviate syscall hook overhead implemented with ptrace(2)
+- To exercises nommu code over UML (and over KUnit)
+- Less dependency to host facilities
+
+
+How it works ?
+==============
+
+To illustrate how this feature works, the below shows how syscalls are
+called under nommu/UML environment.
+
+- boot kernel, install seccomp filter if ``syscall`` instructions are
+  called from userspace memory based on the address of instruction
+  pointer
+- (userspace starts)
+- calls ``vfork``/``execve`` syscalls
+- ``SIGSYS`` signal raised, handler calls syscall entry point ``__kernel_vsyscall``
+- call handler function in ``sys_call_table[]`` and follow how UML syscall
+  works.
+- return to userspace
+
+When users enable the zpoline syscall hook (configured with boot
+parameter ``zpoline=1``), the code path looks like below;
+
+- boot kernel, setup zpoline trampoline code (detailed later) at address 0x0
+- (userspace starts)
+- calls ``vfork``/``execve`` syscalls
+- during execve, more specifically during ``load_elf_fdpic_binary()``
+  function, kernel translates ``syscall``/``sysenter`` instructions with ``call
+  *%rax``, which usually point to address 0 to ``NR_syscalls`` (around
+  512), where trampoline code was installed during startup.
+- when syscalls are issued by userspace, it jumps to ``*%rax``, slides
+  until ``nop`` instructions end, and jump to hooked function,
+  ``__kernel_vsyscall``, which is an entrypoint for syscall under nommu
+  UML environment.
+- call handler function in ``sys_call_table[]`` and follow how UML syscall
+  works.
+- return to userspace
+
+With zpoline syscall hook, the latency is greatly improved while
+startup time of a process cost a bit.  See more detail in the
+Benchmark section.
+
+What are the differences from MMU-full UML ?
+============================================
+
+The current nommu implementation adds 3 different functions which
+MMU-full UML doesn't have:
+
+- kernel address space can directly be accessible from userspace
+  - so, ``uaccess()`` always returns 1
+  - generic implementation of memcpy/strcpy/futex is also used
+- alternate syscall entrypoint without ptrace
+- alternate syscall hook
+  - hook syscall by seccomp filter (when zpoline isn't used)
+  - translation of ``syscall``/``sysenter`` instructions to a trampoline
+    code and syscall hooks (when zpoline is used)
+
+With those modifications, it allows us to use unmodified userspace
+binaries with nommu UML.
+
+
+History
+=======
+
+This feature was originally introduced by Ricardo Koller at Open
+Source Summit NA 2020, then integrated with the syscall translation
+functionality with the clean up to the original code.
+
+Building and run
+================
+
+::
+
+   make ARCH=um x86_64_nommu_defconfig
+   make ARCH=um
+
+will build UML with ``CONFIG_MMU=n`` applied.
+
+Kunit tests can run with the following command::
+
+   ./tools/testing/kunit/kunit.py run --kconfig_add CONFIG_MMU=n
+
+To run a typical Linux distribution, we need nommu-aware userspace.
+We can use a stock version of Alpine Linux with nommu-built version of
+busybox and musl-libc.
+
+
+Preparing root filesystem
+=========================
+
+nommu UML requires to use a specific standard library which is aware
+of nommu kernel.  We have tested custom-build musl-libc and busybox,
+both of which have built-in support for nommu kernels.
+
+There are no available Linux distributions for nommu under x86_64
+architecture, so we need to prepare our own image for the root
+filesystem.  We use Alpine Linux as a base distribution and replace
+busybox and musl-libc on top of that.  The following are the step to
+prepare the filesystem for the quick start::
+
+     container_id=$(docker create ghcr.io/thehajime/alpine:3.20.3-um-nommu)
+     docker start $container_id
+     docker wait $container_id
+     docker export $container_id > alpine.tar
+     docker rm $container_id
+
+     mnt=$(mktemp -d)
+     dd if=/dev/zero of=alpine.ext4 bs=1 count=0 seek=1G
+     sudo chmod og+wr "alpine.ext4"
+     yes 2>/dev/null | mkfs.ext4 "alpine.ext4" || true
+     sudo mount "alpine.ext4" $mnt
+     sudo tar -xf alpine.tar -C $mnt
+     sudo umount $mnt
+
+This will create a file image, ``alpine.ext4``, which contains busybox
+and musl with nommu build on the Alpine Linux root filesystem.  The
+file can be specified to the argument ``ubd0=`` to the UML command line::
+
+  ./vmlinux ubd0=./alpine.ext4 rw mem=1024m loglevel=8 init=/sbin/init
+
+We plan to upstream apk packages for busybox and musl so that we can
+follow the proper procedure to set up the root filesystem.
+
+
+Quick start with docker
+=======================
+
+There is a docker image that you can quickly start with a simple step::
+
+  docker run -it -v /dev/shm:/dev/shm --rm ghcr.io/thehajime/alpine:3.20.3-um-nommu
+
+This will launch a UML instance with an pre-configured root filesystem.
+
+Benchmark
+=========
+
+The below shows an example of performance measurement conducted with
+lmbench and (self-crafted) getpid benchmark (with v6.12-rc2 uml/next
+tree).
+
+.. csv-table:: lmbench (usec)
+  :header: ,native,um,um-nommu(s),um-nommu(z)
+
+  select-10    ,0.5544,29.7143,2.8920,0.2834
+  select-100   ,2.3992,27.7262,3.7794,1.1732
+  select-1000  ,20.4708,42.0885,12.6920,10.0434
+  syscall      ,0.1734,26.2471,2.6070,0.0999
+  read         ,0.3433,29.8828,2.6923,0.1327
+  write        ,0.2866,25.9753,2.6925,0.1325
+  stat         ,1.9195,40.1164,3.1813,0.4642
+  open/close   ,3.8657,63.4730,6.2049,0.7283
+  fork+sh      ,1161.1111,5216.5000,462.3077,18744.0000
+  fork+execve  ,536.5263,2117.0000,131.0633,4840.6667
+
+.. csv-table:: do_getpid bench (nsec)
+  :header: ,native,um,um-nommu(s),um-nommu(z)
+
+  getpid,  172 , 26807 , 2614, 104
+
+
+(um-nommu(z) is nommu with zpoline syscall hook, um-nommu(s) is with
+seccomp syscall hook, respectively)
+
+Limitations
+===========
+
+generic nommu limitations
+-------------------------
+Since this port is a kernel of nommu architecture so, the
+implementation inherits the characteristics of other nommu kernels
+(riscv, arm, etc), described below.
+
+- vfork(2) should be used instead of fork(2)
+- ELF loader only loads PIE (position independent executable) binaries
+- processes share the address space among others
+- mmap(2) offers a subset of functionalities (e.g., unsupported
+  MMAP_FIXED)
+
+Thus, we have limited options to userspace programs.  We have tested
+Alpine Linux with musl-libc, which has a support nommu kernel.
+
+access to mmap_min_addr (if zpoline enabled)
+--------------------------------------------
+As the mechanism of syscall translations relies on an ability to
+write/read memory address zero (0x0), we need to configure host kernel
+with the following command::
+
+% sh -c "echo 0 > /proc/sys/vm/mmap_min_addr"
+
+supported architecture
+----------------------
+The current implementation of nommu UML only works on x86_64 SUBARCH.
+We have not tested with 32-bit environment.
+
+target of syscall translation (if zpoline enabled)
+--------------------------------------------------
+The syscall translation only applies to the executable and interpreter
+of ELF binary files which are processed by execve(2) syscall for the
+moment: other libraries such as linked library and dlopen-ed one
+aren't translated; we may be able to trigger the translation by
+LD_PRELOAD.  JIT compiler generated code is also generated after execve
+thus, it is not currently translated.
+
+Note that with musl-libc in Alpine Linux which we've been tested, most
+of syscalls are implemented in the interpreter file
+(ld-musl-x86_64.so) and calling syscall/sysenter instructions from the
+linked/loaded libraries might be rare.  But it is definitely possible
+so, a workaround with LD_PRELOAD is effective.
+
+
+Further readings about NOMMU UML
+================================
+
+- NOMMU UML (original code by Ricardo Koller)
+ - https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf
+
+- zpoline: syscall translation mechanism
+ - https://www.usenix.org/conference/atc23/presentation/yasukata
diff --git a/MAINTAINERS b/MAINTAINERS
index a097afd76ded..aaffff989580 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -24186,6 +24186,7 @@ USER-MODE LINUX (UML)
 M:	Richard Weinberger <richard@nod.at>
 M:	Anton Ivanov <anton.ivanov@cambridgegreys.com>
 M:	Johannes Berg <johannes@sipsolutions.net>
+M:	Hajime Tazaki <thehajime@gmail.com>
 L:	linux-um@lists.infradead.org
 S:	Maintained
 W:	http://user-mode-linux.sourceforge.net
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH v3 13/13] um: nommu: plug nommu code into build system
  2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
                       ` (11 preceding siblings ...)
  2024-12-03  4:23     ` [PATCH v3 12/13] um: nommu: add documentation of nommu UML Hajime Tazaki
@ 2024-12-03  4:23     ` Hajime Tazaki
  2024-12-04 16:20     ` [PATCH v3 00/13] nommu UML Johannes Berg
  13 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-03  4:23 UTC (permalink / raw)
  To: linux-um; +Cc: thehajime, ricarkol, Liam.Howlett

Add nommu kernel for um build.  defconfig is also provided.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/um/Kconfig                        | 14 +++++-
 arch/um/configs/x86_64_nommu_defconfig | 64 ++++++++++++++++++++++++++
 arch/x86/um/Makefile                   | 18 ++++++++
 3 files changed, 94 insertions(+), 2 deletions(-)
 create mode 100644 arch/um/configs/x86_64_nommu_defconfig

diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 18051b1cfce0..2fc5a91c90a7 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -30,14 +30,17 @@ config UML
 	select ARCH_SUPPORTS_LTO_CLANG_THIN
 	select TRACE_IRQFLAGS_SUPPORT
 	select TTY # Needed for line.c
-	select HAVE_ARCH_VMAP_STACK
+	select HAVE_ARCH_VMAP_STACK if MMU
 	select HAVE_RUST
 	select ARCH_HAS_UBSAN
 	select HAVE_ARCH_TRACEHOOK
 	select THREAD_INFO_IN_TASK
+	select UACCESS_MEMCPY if !MMU
+	select GENERIC_STRNLEN_USER if !MMU
+	select GENERIC_STRNCPY_FROM_USER if !MMU
 
 config MMU
-	bool
+	bool "MMU-based Paged Memory Management Support" if 64BIT
 	default y
 
 config UML_DMA_EMULATION
@@ -190,8 +193,15 @@ config MAGIC_SYSRQ
 	  The keys are documented in <file:Documentation/admin-guide/sysrq.rst>. Don't say Y
 	  unless you really know what this hack does.
 
+config ARCH_FORCE_MAX_ORDER
+	int "Order of maximal physically contiguous allocations" if EXPERT
+	default "10" if MMU
+	default "16" if !MMU
+
 config KERNEL_STACK_ORDER
 	int "Kernel stack size order"
+	default 3 if !MMU
+	range 3 10 if !MMU
 	default 2 if 64BIT
 	range 2 10 if 64BIT
 	default 1 if !64BIT
diff --git a/arch/um/configs/x86_64_nommu_defconfig b/arch/um/configs/x86_64_nommu_defconfig
new file mode 100644
index 000000000000..c2e0fb546987
--- /dev/null
+++ b/arch/um/configs/x86_64_nommu_defconfig
@@ -0,0 +1,64 @@
+CONFIG_SYSVIPC=y
+CONFIG_POSIX_MQUEUE=y
+CONFIG_NO_HZ=y
+CONFIG_HIGH_RES_TIMERS=y
+CONFIG_BSD_PROCESS_ACCT=y
+CONFIG_IKCONFIG=y
+CONFIG_IKCONFIG_PROC=y
+CONFIG_LOG_BUF_SHIFT=14
+CONFIG_CGROUPS=y
+CONFIG_BLK_CGROUP=y
+CONFIG_CGROUP_SCHED=y
+CONFIG_CGROUP_DEVICE=y
+CONFIG_CGROUP_CPUACCT=y
+# CONFIG_PID_NS is not set
+CONFIG_CC_OPTIMIZE_FOR_SIZE=y
+# CONFIG_MMU is not set
+CONFIG_HOSTFS=y
+CONFIG_MAGIC_SYSRQ=y
+CONFIG_SSL=y
+CONFIG_NULL_CHAN=y
+CONFIG_PORT_CHAN=y
+CONFIG_PTY_CHAN=y
+CONFIG_TTY_CHAN=y
+CONFIG_CON_CHAN="pts"
+CONFIG_SSL_CHAN="pts"
+CONFIG_UML_SOUND=m
+CONFIG_UML_NET=y
+CONFIG_UML_NET_ETHERTAP=y
+CONFIG_UML_NET_TUNTAP=y
+CONFIG_UML_NET_SLIP=y
+CONFIG_UML_NET_DAEMON=y
+CONFIG_UML_NET_MCAST=y
+CONFIG_UML_NET_SLIRP=y
+CONFIG_MODULES=y
+CONFIG_MODULE_UNLOAD=y
+CONFIG_IOSCHED_BFQ=m
+CONFIG_BINFMT_MISC=m
+CONFIG_NET=y
+CONFIG_PACKET=y
+CONFIG_UNIX=y
+CONFIG_INET=y
+CONFIG_DEVTMPFS=y
+CONFIG_DEVTMPFS_MOUNT=y
+CONFIG_BLK_DEV_UBD=y
+CONFIG_BLK_DEV_LOOP=m
+CONFIG_BLK_DEV_NBD=m
+CONFIG_DUMMY=m
+CONFIG_TUN=m
+CONFIG_PPP=m
+CONFIG_SLIP=m
+CONFIG_LEGACY_PTY_COUNT=32
+CONFIG_UML_RANDOM=y
+CONFIG_SOUND=m
+CONFIG_EXT4_FS=y
+CONFIG_REISERFS_FS=y
+CONFIG_QUOTA=y
+CONFIG_AUTOFS_FS=m
+CONFIG_ISO9660_FS=m
+CONFIG_JOLIET=y
+CONFIG_NLS=y
+CONFIG_DEBUG_KERNEL=y
+CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
+CONFIG_FRAME_WARN=1024
+CONFIG_IPV6=y
diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile
index b42c31cd2390..0513c4ad0130 100644
--- a/arch/x86/um/Makefile
+++ b/arch/x86/um/Makefile
@@ -32,6 +32,24 @@ obj-y += syscalls_64.o vdso/
 subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o \
 	../lib/memmove_64.o ../lib/memset_64.o
 
+
+# used by zpoline.c to translate syscall/sysenter instructions
+# note: only in x86_64 w/ !CONFIG_MMU
+ifneq ($(CONFIG_MMU),y)
+inat_tables_script = $(srctree)/arch/x86/tools/gen-insn-attr-x86.awk
+inat_tables_maps = $(srctree)/arch/x86/lib/x86-opcode-map.txt
+quiet_cmd_inat_tables = GEN     $@
+      cmd_inat_tables = $(AWK) -f $(inat_tables_script) $(inat_tables_maps) > $@
+$(obj)/inat-tables.c: $(inat_tables_script) $(inat_tables_maps)
+	$(call cmd,inat_tables)
+targets += inat-tables.c
+$(obj)/../lib/inat.o: $(obj)/inat-tables.c
+subarch-y += ../lib/insn.o ../lib/inat.o
+
+
+obj-y += do_syscall_$(BITS).o entry_$(BITS).o zpoline.o
+endif
+
 endif
 
 subarch-$(CONFIG_MODULES) += ../kernel/module.o
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 00/13] nommu UML
  2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
                       ` (12 preceding siblings ...)
  2024-12-03  4:23     ` [PATCH v3 13/13] um: nommu: plug nommu code into build system Hajime Tazaki
@ 2024-12-04 16:20     ` Johannes Berg
  2024-12-05 13:41       ` Hajime Tazaki
  13 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-12-04 16:20 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um; +Cc: ricarkol, Liam.Howlett

On Tue, 2024-12-03 at 13:22 +0900, Hajime Tazaki wrote:
> This is a series of patches of nommu arch addition to UML.

Please next time you resend this, don't hide it in the old thread :)

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 02/13] x86/um: nommu: elf loader for fdpic
  2024-12-03  4:23     ` [PATCH v3 02/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
@ 2024-12-04 16:20       ` Johannes Berg
  2024-12-05 13:41         ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-12-04 16:20 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um
  Cc: ricarkol, Liam.Howlett, Eric Biederman, Kees Cook, Alexander Viro,
	Christian Brauner, Jan Kara, linux-mm, linux-fsdevel

On Tue, 2024-12-03 at 13:23 +0900, Hajime Tazaki wrote:
> 
>  arch/um/include/asm/Kbuild           |  1 +
> 
>  arch/x86/um/asm/module.h             | 24 ------------------------
> 

These changes could be a separate cleanup?

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 03/13] um: nommu: memory handling
  2024-12-03  4:23     ` [PATCH v3 03/13] um: nommu: memory handling Hajime Tazaki
@ 2024-12-04 16:34       ` Johannes Berg
  2024-12-05 13:46         ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-12-04 16:34 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um; +Cc: ricarkol, Liam.Howlett

On Tue, 2024-12-03 at 13:23 +0900, Hajime Tazaki wrote:
> 
> +++ b/arch/um/include/asm/futex.h
> @@ -8,7 +8,11 @@
>  
>  
>  int arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval, u32 __user *uaddr);
> +#ifdef CONFIG_MMU
>  int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
>  			      u32 oldval, u32 newval);
> +#else
> +#include <asm-generic/futex.h>
> +#endif

That seems somewhat problematic since it also defines
arch_futex_atomic_op_inuser ...

> +++ b/arch/um/include/shared/os.h
> @@ -195,7 +195,13 @@ extern void get_host_cpu_features(
>  extern int create_mem_file(unsigned long long len);
>  
>  /* tlb.c */
> +#ifdef CONFIG_MMU
>  extern void report_enomem(void);
> +#else
> +static inline void report_enomem(void)
> +{
> +}
> +#endif

That still seems simply wrong? Why is that even called, and why should
it do *nothing*?

I'm thinking you might just have patch sequence issues here - I can't
really see why this would be called at all, eventually.

> @@ -78,6 +79,7 @@ void __init mem_init(void)
>   * Create a page table and place a pointer to it in a middle page
>   * directory entry.
>   */
> +#ifdef CONFIG_MMU
>  static void __init one_page_table_init(pmd_t *pmd)
>  {
>  	if (pmd_none(*pmd)) {
> @@ -149,6 +151,12 @@ static void __init fixrange_init(unsigned long start, unsigned long end,
>  		j = 0;
>  	}
>  }
> +#else
> +static void __init fixrange_init(unsigned long start, unsigned long end,
> +				 pgd_t *pgd_base)
> +{
> +}
> +#endif

Really not a fan of all these randomly placed ifdefs ...

> +#ifdef CONFIG_MMU
>  static const pgprot_t protection_map[16] = {
>  	[VM_NONE]					= PAGE_NONE,
>  	[VM_READ]					= PAGE_READONLY,
> @@ -249,3 +258,4 @@ static const pgprot_t protection_map[16] = {
>  	[VM_SHARED | VM_EXEC | VM_WRITE | VM_READ]	= PAGE_SHARED
>  };
>  DECLARE_VM_GET_PAGE_PROT
> +#endif

Same here.

I think we can do better - perhaps move some code to mmu.c and nommu.c
or something like that.

> diff --git a/arch/um/kernel/physmem.c b/arch/um/kernel/physmem.c
> index a74f17b033c4..f55d46dbe173 100644
> --- a/arch/um/kernel/physmem.c
> +++ b/arch/um/kernel/physmem.c
> @@ -84,7 +84,11 @@ void __init setup_physmem(unsigned long start, unsigned long reserve_end,
>  		exit(1);
>  	}
>  
> +#ifdef CONFIG_MMU
>  	physmem_fd = create_mem_file(len);
> +#else
> +	physmem_fd = -1;
> +#endif

same here, create_mem_file() can just be in a file only built for mmu,
and otherwise be an inline that returns -1?

> +#ifdef CONFIG_MMU
>  	/*
>  	 * Special kludge - This page will be mapped in to userspace processes
>  	 * from physmem_fd, so it needs to be written out there.
>  	 */
>  	os_seek_file(physmem_fd, __pa(__syscall_stub_start));
>  	os_write_file(physmem_fd, __syscall_stub_start, PAGE_SIZE);
> +#endif

That doesn't even do anything if the fd is -1, do we need the ifdef? ;-)

Still better as "if (IS_ENABLED())" or something anyway though.

> +++ b/arch/um/kernel/trap.c
> @@ -24,6 +24,7 @@
>  int handle_page_fault(unsigned long address, unsigned long ip,
>  		      int is_write, int is_user, int *code_out)
>  {
> +#ifdef CONFIG_MMU
>  	struct mm_struct *mm = current->mm;
>  	struct vm_area_struct *vma;
>  	pmd_t *pmd;
> @@ -129,6 +130,9 @@ int handle_page_fault(unsigned long address, unsigned long ip,
>  		goto out_nosemaphore;
>  	pagefault_out_of_memory();
>  	return 0;
> +#else
> +	return -EFAULT;
> +#endif
>  }

same comments here ... try not to sprinkle ifdefs everywhere.

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 04/13] x86/um: nommu: syscall handling
  2024-12-03  4:23     ` [PATCH v3 04/13] x86/um: nommu: syscall handling Hajime Tazaki
@ 2024-12-04 16:37       ` Johannes Berg
  2024-12-05 13:47         ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-12-04 16:37 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um; +Cc: ricarkol, Liam.Howlett

On Tue, 2024-12-03 at 13:23 +0900, Hajime Tazaki wrote:
> This commit introduces an entry point of syscall interface for !MMU
> mode. It uses an entry function, __kernel_vsyscall, a kernel-wide global
> symbol accessible from any locations.
> 
> Although it isn't in the scope of this commit, it can be also exposed
> via vdso image which is directly accessible from userspace. A standard
> library (i.e., libc) can utilize this entry point to implement syscall
> wrapper; we can also use this by hooking syscall for unmodified userspace
> applications/libraries, which will be implemented in the subsequent
> commit.
> 
> This only supports 64-bit mode of x86 architecture.
> 
> Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> ---
>  arch/x86/um/do_syscall_64.c             | 37 +++++++++++
>  arch/x86/um/entry_64.S                  | 87 +++++++++++++++++++++++++
> 

As I said before, I think it needs to be something obviously nommu.
Maybe in a new directory, or maybe nommu_syscall_64.c or something like
that.

Also needs comment style fixes and I'm not sure about all the
pr_debug(), but I guess they don't matter unless someone defines it? 

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 05/13] x86/um: nommu: syscall translation by zpoline
  2024-12-03  4:23     ` [PATCH v3 05/13] x86/um: nommu: syscall translation by zpoline Hajime Tazaki
@ 2024-12-04 16:37       ` Johannes Berg
  2024-12-05 13:48         ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-12-04 16:37 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um; +Cc: ricarkol, Liam.Howlett

On Tue, 2024-12-03 at 13:23 +0900, Hajime Tazaki wrote:
> This commit adds a mechanism to hook syscalls for unmodified userspace
> programs used under UML in !MMU mode. The mechanism, called zpoline,
> 

I think you should just leave this out of the first version entirely.

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 06/13] um: nommu: syscalls handler from userspace by seccomp filter
  2024-12-03  4:23     ` [PATCH v3 06/13] um: nommu: syscalls handler from userspace by seccomp filter Hajime Tazaki
@ 2024-12-04 16:42       ` Johannes Berg
  2024-12-05 13:51         ` Hajime Tazaki
  2024-12-04 17:54       ` kernel test robot
  1 sibling, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-12-04 16:42 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um; +Cc: ricarkol, Liam.Howlett, Kenichi Yasukata

On Tue, 2024-12-03 at 13:23 +0900, Hajime Tazaki wrote:
> 
> +#ifndef CONFIG_MMU
> +extern int um_zpoline_enabled;
> +#endif

That doesn't make sense, there's no good reason these two mechanisms
need to be mutually exclusive.

I think you should leave out zpoline initially, get all the other stuff
sorted out, and then add it later as an optimisation where possible (can
map at 0, can rewrite the binary, etc.)

> +void trap_sigsys(struct uml_pt_regs *regs)
> +{
> +	struct task_struct *tsk = current;
> +
> +	pr_info_ratelimited("%s%s[%d]: sigsys ip %p sp %p\n",
> +			    task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
> +			    tsk->comm, task_pid_nr(tsk),
> +			    (void *)UPT_IP(regs), (void *)UPT_SP(regs));

Shouldn't do that, not even rate-limited.

> +	if (err) {
> +		os_warn("SECCOMP_SET_MODE_FILTER (err=%d, ernro=%d)\n",
> +		       err, errno);
> +		exit(-1);

exit(-1) probably isn't quite right, it's more like a u8.

> +	os_info("seccomp: filter syscalls in the range: 0x%lx-0x%lx\n",
> +		__userspace_start, __userspace_end);

not sure we need that, but at least it's only once?

> +#ifndef CONFIG_MMU
> +static void sigsys_handler(int sig, struct siginfo *si, mcontext_t *mc)
> +{
> +	struct uml_pt_regs r;
> +
> +	if (!um_zpoline_enabled) {
> +		/* hook syscall via SIGSYS */
> +		mc_set_sigsys_hook(mc);
> +	} else {
> +		/* trap SIGSYS to userspace */
> +		get_regs_from_mc(&r, mc);
> +		trap_sigsys(&r);
> +		/* force handle signals after rt_sigreturn() */
> +		mc_set_regs_ip_relay(mc);

I don't understand why this behaves differently with and without
zpoline, it seems it shouldn't need to. Anyway, still think zpoline is
future work.

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 07/13] x86/um: nommu: process/thread handling
  2024-12-03  4:23     ` [PATCH v3 07/13] x86/um: nommu: process/thread handling Hajime Tazaki
@ 2024-12-04 16:50       ` Johannes Berg
  2024-12-05 13:56         ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-12-04 16:50 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um; +Cc: ricarkol, Liam.Howlett

On Tue, 2024-12-03 at 13:23 +0900, Hajime Tazaki wrote:
> 
> +++ b/arch/um/kernel/process.c
> @@ -117,13 +117,17 @@ void new_thread_handler(void)
>  	 * callback returns only if the kernel thread execs a process
>  	 */
>  	fn(arg);
> +#ifndef CONFIG_MMU
> +	arch_switch_to(current);
> +#endif
>  	userspace(&current->thread.regs.regs);

that doesn't make sense, arch_switch_to() does nothing anyway on 64-bit

>  /* Called magically, see new_thread_handler above */
>  static void fork_handler(void)
>  {
> -	schedule_tail(current->thread.prev_sched);
> +	if (current->thread.prev_sched)
> +		schedule_tail(current->thread.prev_sched);

Why is that NULL on nommu?

> @@ -134,6 +138,33 @@ static void fork_handler(void)
>  
>  	current->thread.prev_sched = NULL;
>  
> +#ifndef CONFIG_MMU

again, don't sprinkle ifdefs around the C code files - make inlines in a
header file or something

> +	/*
> +	 * child of vfork(2) comes here.
> +	 * clone(2) also enters here but doesn't need to advance the %rsp.
> +	 *
> +	 * This fork can only come from libc's vfork, which
> +	 * does this:
> +	 *	popq %%rdx;
> +	 *	call *%rax; // zpoline => __kernel_vsyscall
> +	 *	pushq %%rdx;

or maybe not zpoline ... so maybe need to update this

> +++ b/arch/um/os-Linux/skas/process.c
> @@ -144,6 +144,7 @@ void wait_stub_done(int pid)
>  
>  extern unsigned long current_stub_stack(void);
>  
> +#ifdef CONFIG_MMU

I'll stop commenting on ifdef sprinkling :)

> +++ b/arch/x86/um/asm/processor.h
> @@ -38,6 +38,18 @@ static __always_inline void cpu_relax(void)
>  
>  #define task_pt_regs(t) (&(t)->thread.regs)
>  
> +#ifndef CONFIG_MMU
> +#define task_top_of_stack(task) \
> +({									\
> +	unsigned long __ptr = (unsigned long)task->stack;	\
> +	__ptr += THREAD_SIZE;			\
> +	__ptr;					\
> +})
> +
> +extern long current_top_of_stack;
> +extern long current_ptregs;
> +#endif

no need for "extern".

you only use all that once, does it need to be here?

> +
>  #include <asm/processor-generic.h>
>  
>  #endif
> diff --git a/arch/x86/um/do_syscall_64.c b/arch/x86/um/do_syscall_64.c
> index 5d0fa83e7fdc..ca468caff729 100644
> --- a/arch/x86/um/do_syscall_64.c
> +++ b/arch/x86/um/do_syscall_64.c
> @@ -1,14 +1,43 @@
>  // SPDX-License-Identifier: GPL-2.0
>  
> +//#define DEBUG 1

please remove

>  #include <linux/kernel.h>
>  #include <linux/ptrace.h>
>  #include <kern_util.h>
>  #include <sysdep/syscalls.h>
>  #include <os.h>
>  
> +/*
> + * save/restore the return address stored in the stack, as the child overwrites
> + * the contents after returning to userspace (i.e., by push %rdx).
> + *
> + * see the detail in fork_handler().
> + */
> +static void *vfork_save_stack(void)
> +{
> +	unsigned char *stack_copy;
> +
> +	stack_copy = kzalloc(8, GFP_KERNEL);

Using a heap allocation to track 8 bytes, when the pointer to the
allocation is already 8 bytes (you're on 64-bit) seems ... rather
wasteful?

I also don't see you ever free it? Restore probably should, but anyway,
it shouldn't exist.

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 08/13] um: nommu: configure fs register on host syscall invocation
  2024-12-03  4:23     ` [PATCH v3 08/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
@ 2024-12-04 16:52       ` Johannes Berg
  2024-12-04 19:31         ` Geert Uytterhoeven
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-12-04 16:52 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um; +Cc: ricarkol, Liam.Howlett

On Tue, 2024-12-03 at 13:23 +0900, Hajime Tazaki wrote:
> 
> +static int os_x86_arch_prctl(int pid, int option, unsigned long *arg2)
> +{
> +	if (host_has_fsgsbase) {
> +		switch (option) {
> +		case ARCH_SET_FS:
> +			wrfsbase(*arg2);
> +			break;
> +		case ARCH_SET_GS:
> +			wrgsbase(*arg2);
> +			break;
> +		case ARCH_GET_FS:
> +			*arg2 = rdfsbase();
> +			break;
> +		case ARCH_GET_GS:
> +			*arg2 = rdgsbase();
> +			break;
> +		}
> +		return 0;
> +	} else {

Even checkpatch complains about else after return :) No need.

Also here, I think need better separation of MMU/no-MMU code. Perhaps
some of the functions should have "nommu" in the name too, and be
otherwise empty inlines.

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 06/13] um: nommu: syscalls handler from userspace by seccomp filter
  2024-12-03  4:23     ` [PATCH v3 06/13] um: nommu: syscalls handler from userspace by seccomp filter Hajime Tazaki
  2024-12-04 16:42       ` Johannes Berg
@ 2024-12-04 17:54       ` kernel test robot
  1 sibling, 0 replies; 128+ messages in thread
From: kernel test robot @ 2024-12-04 17:54 UTC (permalink / raw)
  To: Hajime Tazaki, linux-um
  Cc: oe-kbuild-all, thehajime, ricarkol, Liam.Howlett,
	Kenichi Yasukata

Hi Hajime,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bed2cc482600296fe04edbc38005ba2851449c10]

url:    https://github.com/intel-lab-lkp/linux/commits/Hajime-Tazaki/fs-binfmt_elf_efpic-add-architecture-hook-elf_arch_finalize_exec/20241203-163016
base:   bed2cc482600296fe04edbc38005ba2851449c10
patch link:    https://lore.kernel.org/r/f1aed0f9233f49880510e9506deddd52b33c2fc6.1733199769.git.thehajime%40gmail.com
patch subject: [PATCH v3 06/13] um: nommu: syscalls handler from userspace by seccomp filter
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412050150.ZqbrNI0T-lkp@intel.com/

includecheck warnings: (new ones prefixed by >>)
>> arch/um/os-Linux/process.c: sys/prctl.h is included more than once.

vim +15 arch/um/os-Linux/process.c

  > 15	#include <sys/prctl.h>
    16	#include <sys/wait.h>
    17	#include <asm/unistd.h>
    18	#include <init.h>
    19	#include <longjmp.h>
    20	#include <as-layout.h>
    21	#include <os.h>
  > 22	#include <sys/prctl.h>
    23	#include <linux/filter.h>
    24	#include <linux/seccomp.h>
    25	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 08/13] um: nommu: configure fs register on host syscall invocation
  2024-12-04 16:52       ` Johannes Berg
@ 2024-12-04 19:31         ` Geert Uytterhoeven
  2024-12-05 13:58           ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Geert Uytterhoeven @ 2024-12-04 19:31 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Hajime Tazaki, linux-um, ricarkol, Liam.Howlett

On Wed, Dec 4, 2024 at 5:53 PM Johannes Berg <johannes@sipsolutions.net> wrote:
> On Tue, 2024-12-03 at 13:23 +0900, Hajime Tazaki wrote:
> >
> > +static int os_x86_arch_prctl(int pid, int option, unsigned long *arg2)
> > +{
> > +     if (host_has_fsgsbase) {
> > +             switch (option) {
> > +             case ARCH_SET_FS:
> > +                     wrfsbase(*arg2);
> > +                     break;
> > +             case ARCH_SET_GS:
> > +                     wrgsbase(*arg2);
> > +                     break;
> > +             case ARCH_GET_FS:
> > +                     *arg2 = rdfsbase();
> > +                     break;
> > +             case ARCH_GET_GS:
> > +                     *arg2 = rdgsbase();
> > +                     break;
> > +             }
> > +             return 0;
> > +     } else {
>
> Even checkpatch complains about else after return :) No need.

And inverting the check would reduce indentation in the largest
branch (the smallest branch is just "return os_arch_prctl(...)".

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 00/13] nommu UML
  2024-12-04 16:20     ` [PATCH v3 00/13] nommu UML Johannes Berg
@ 2024-12-05 13:41       ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-05 13:41 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, ricarkol, Liam.Howlett


Thanks Johannes,

On Thu, 05 Dec 2024 01:20:04 +0900,
Johannes Berg wrote:
> 
> On Tue, 2024-12-03 at 13:22 +0900, Hajime Tazaki wrote:
> > This is a series of patches of nommu arch addition to UML.
> 
> Please next time you resend this, don't hide it in the old thread :)

ah, my bad.

I thought adding in-reply-to is a must but the document says it should
not for multiple patches.

I will fix it.

-- Hajime



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 02/13] x86/um: nommu: elf loader for fdpic
  2024-12-04 16:20       ` Johannes Berg
@ 2024-12-05 13:41         ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-05 13:41 UTC (permalink / raw)
  To: johannes
  Cc: linux-um, ricarkol, Liam.Howlett, ebiederm, kees, viro, brauner,
	jack, linux-mm, linux-fsdevel


On Thu, 05 Dec 2024 01:20:51 +0900,
Johannes Berg wrote:
> 
> On Tue, 2024-12-03 at 13:23 +0900, Hajime Tazaki wrote:
> > 
> >  arch/um/include/asm/Kbuild           |  1 +
> > 
> >  arch/x86/um/asm/module.h             | 24 ------------------------
> > 
> 
> These changes could be a separate cleanup?

agree.  will do it.

-- Hajime



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 03/13] um: nommu: memory handling
  2024-12-04 16:34       ` Johannes Berg
@ 2024-12-05 13:46         ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-05 13:46 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, ricarkol, Liam.Howlett


On Thu, 05 Dec 2024 01:34:49 +0900,
Johannes Berg wrote:
> 
> On Tue, 2024-12-03 at 13:23 +0900, Hajime Tazaki wrote:
> > 
> > +++ b/arch/um/include/asm/futex.h
> > @@ -8,7 +8,11 @@
> >  
> >  
> >  int arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval, u32 __user *uaddr);
> > +#ifdef CONFIG_MMU
> >  int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
> >  			      u32 oldval, u32 newval);
> > +#else
> > +#include <asm-generic/futex.h>
> > +#endif
> 
> That seems somewhat problematic since it also defines
> arch_futex_atomic_op_inuser ...

indeed, will fix it.

> > +++ b/arch/um/include/shared/os.h
> > @@ -195,7 +195,13 @@ extern void get_host_cpu_features(
> >  extern int create_mem_file(unsigned long long len);
> >  
> >  /* tlb.c */
> > +#ifdef CONFIG_MMU
> >  extern void report_enomem(void);
> > +#else
> > +static inline void report_enomem(void)
> > +{
> > +}
> > +#endif
> 
> That still seems simply wrong? Why is that even called, and why should
> it do *nothing*?
> 
> I'm thinking you might just have patch sequence issues here - I can't
> really see why this would be called at all, eventually.

report_enomem was used in trap.c, which both MMU and !MMU use.  Now I
decouple the file to contain !mmu specific code so, the above chunk
can be reverted.

> > @@ -78,6 +79,7 @@ void __init mem_init(void)
> >   * Create a page table and place a pointer to it in a middle page
> >   * directory entry.
> >   */
> > +#ifdef CONFIG_MMU
> >  static void __init one_page_table_init(pmd_t *pmd)
> >  {
> >  	if (pmd_none(*pmd)) {
> > @@ -149,6 +151,12 @@ static void __init fixrange_init(unsigned long start, unsigned long end,
> >  		j = 0;
> >  	}
> >  }
> > +#else
> > +static void __init fixrange_init(unsigned long start, unsigned long end,
> > +				 pgd_t *pgd_base)
> > +{
> > +}
> > +#endif
> 
> Really not a fan of all these randomly placed ifdefs ...

I decided to introduce nommu directory to avoid those ifdefs.

> > +#ifdef CONFIG_MMU
> >  static const pgprot_t protection_map[16] = {
> >  	[VM_NONE]					= PAGE_NONE,
> >  	[VM_READ]					= PAGE_READONLY,
> > @@ -249,3 +258,4 @@ static const pgprot_t protection_map[16] = {
> >  	[VM_SHARED | VM_EXEC | VM_WRITE | VM_READ]	= PAGE_SHARED
> >  };
> >  DECLARE_VM_GET_PAGE_PROT
> > +#endif
> 
> Same here.
> 
> I think we can do better - perhaps move some code to mmu.c and nommu.c
> or something like that.

will fix it.

> > diff --git a/arch/um/kernel/physmem.c b/arch/um/kernel/physmem.c
> > index a74f17b033c4..f55d46dbe173 100644
> > --- a/arch/um/kernel/physmem.c
> > +++ b/arch/um/kernel/physmem.c
> > @@ -84,7 +84,11 @@ void __init setup_physmem(unsigned long start, unsigned long reserve_end,
> >  		exit(1);
> >  	}
> >  
> > +#ifdef CONFIG_MMU
> >  	physmem_fd = create_mem_file(len);
> > +#else
> > +	physmem_fd = -1;
> > +#endif
> 
> same here, create_mem_file() can just be in a file only built for mmu,
> and otherwise be an inline that returns -1?

ditto.

> > +#ifdef CONFIG_MMU
> >  	/*
> >  	 * Special kludge - This page will be mapped in to userspace processes
> >  	 * from physmem_fd, so it needs to be written out there.
> >  	 */
> >  	os_seek_file(physmem_fd, __pa(__syscall_stub_start));
> >  	os_write_file(physmem_fd, __syscall_stub_start, PAGE_SIZE);
> > +#endif
> 
> That doesn't even do anything if the fd is -1, do we need the ifdef? ;-)
> 
> Still better as "if (IS_ENABLED())" or something anyway though.

I'll revert this part.

> > +++ b/arch/um/kernel/trap.c
> > @@ -24,6 +24,7 @@
> >  int handle_page_fault(unsigned long address, unsigned long ip,
> >  		      int is_write, int is_user, int *code_out)
> >  {
> > +#ifdef CONFIG_MMU
> >  	struct mm_struct *mm = current->mm;
> >  	struct vm_area_struct *vma;
> >  	pmd_t *pmd;
> > @@ -129,6 +130,9 @@ int handle_page_fault(unsigned long address, unsigned long ip,
> >  		goto out_nosemaphore;
> >  	pagefault_out_of_memory();
> >  	return 0;
> > +#else
> > +	return -EFAULT;
> > +#endif
> >  }
> 
> same comments here ... try not to sprinkle ifdefs everywhere.

will revert it.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 04/13] x86/um: nommu: syscall handling
  2024-12-04 16:37       ` Johannes Berg
@ 2024-12-05 13:47         ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-05 13:47 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, ricarkol, Liam.Howlett


On Thu, 05 Dec 2024 01:37:14 +0900,
Johannes Berg wrote:

> >  arch/x86/um/do_syscall_64.c             | 37 +++++++++++
> >  arch/x86/um/entry_64.S                  | 87 +++++++++++++++++++++++++
> > 
> 
> As I said before, I think it needs to be something obviously nommu.
> Maybe in a new directory, or maybe nommu_syscall_64.c or something like
> that.

agree.  I will make a directory for that.

> Also needs comment style fixes and I'm not sure about all the
> pr_debug(), but I guess they don't matter unless someone defines it? 

if you meant to the commented `//#define DEBUG` line, I will remove
it.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 05/13] x86/um: nommu: syscall translation by zpoline
  2024-12-04 16:37       ` Johannes Berg
@ 2024-12-05 13:48         ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-05 13:48 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, ricarkol, Liam.Howlett


On Thu, 05 Dec 2024 01:37:42 +0900,
Johannes Berg wrote:
> 
> On Tue, 2024-12-03 at 13:23 +0900, Hajime Tazaki wrote:
> > This commit adds a mechanism to hook syscalls for unmodified userspace
> > programs used under UML in !MMU mode. The mechanism, called zpoline,
> > 
> 
> I think you should just leave this out of the first version entirely.

okay, agree.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 06/13] um: nommu: syscalls handler from userspace by seccomp filter
  2024-12-04 16:42       ` Johannes Berg
@ 2024-12-05 13:51         ` Hajime Tazaki
  2024-12-05 13:54           ` Johannes Berg
  0 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-05 13:51 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, ricarkol, Liam.Howlett, kenichi.yasukata


On Thu, 05 Dec 2024 01:42:11 +0900,
Johannes Berg wrote:
> 
> On Tue, 2024-12-03 at 13:23 +0900, Hajime Tazaki wrote:
> > 
> > +#ifndef CONFIG_MMU
> > +extern int um_zpoline_enabled;
> > +#endif
> 
> That doesn't make sense, there's no good reason these two mechanisms
> need to be mutually exclusive.
> 
> I think you should leave out zpoline initially, get all the other stuff
> sorted out, and then add it later as an optimisation where possible (can
> map at 0, can rewrite the binary, etc.)

okay.

> > +void trap_sigsys(struct uml_pt_regs *regs)
> > +{
> > +	struct task_struct *tsk = current;
> > +
> > +	pr_info_ratelimited("%s%s[%d]: sigsys ip %p sp %p\n",
> > +			    task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
> > +			    tsk->comm, task_pid_nr(tsk),
> > +			    (void *)UPT_IP(regs), (void *)UPT_SP(regs));
> 
> Shouldn't do that, not even rate-limited.

this was actually follow the way how show_segvinfo(), but yes, it
should not.  I will use pr_info() instead.

> > +	if (err) {
> > +		os_warn("SECCOMP_SET_MODE_FILTER (err=%d, ernro=%d)\n",
> > +		       err, errno);
> > +		exit(-1);
> 
> exit(-1) probably isn't quite right, it's more like a u8.

will fix it.

> > +	os_info("seccomp: filter syscalls in the range: 0x%lx-0x%lx\n",
> > +		__userspace_start, __userspace_end);
> 
> not sure we need that, but at least it's only once?

this is a message that will report at a startup after filter
installation, so it would be useful if the feature is enabled or not.
# and expected to run this part only once.

(and maybe I can put the following fix)

- int os_setup_seccomp(void)
+ int __init os_setup_seccomp(void)

> > +#ifndef CONFIG_MMU
> > +static void sigsys_handler(int sig, struct siginfo *si, mcontext_t *mc)
> > +{
> > +	struct uml_pt_regs r;
> > +
> > +	if (!um_zpoline_enabled) {
> > +		/* hook syscall via SIGSYS */
> > +		mc_set_sigsys_hook(mc);
> > +	} else {
> > +		/* trap SIGSYS to userspace */
> > +		get_regs_from_mc(&r, mc);
> > +		trap_sigsys(&r);
> > +		/* force handle signals after rt_sigreturn() */
> > +		mc_set_regs_ip_relay(mc);
> 
> I don't understand why this behaves differently with and without
> zpoline, it seems it shouldn't need to. Anyway, still think zpoline is
> future work.

I will remove the zpoline part.
When zpoline is used, SIGSYS signal is a sign of unexpected syscall
invocation, and raise this signal to userspace (with printing
message).

When it's not used, the handler calls syscall entrypoint instead, so
the code path should be different.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 06/13] um: nommu: syscalls handler from userspace by seccomp filter
  2024-12-05 13:51         ` Hajime Tazaki
@ 2024-12-05 13:54           ` Johannes Berg
  2024-12-06  2:51             ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-12-05 13:54 UTC (permalink / raw)
  To: Hajime Tazaki; +Cc: linux-um, ricarkol, Liam.Howlett, kenichi.yasukata

On Thu, 2024-12-05 at 22:51 +0900, Hajime Tazaki wrote:
> > 
> > I don't understand why this behaves differently with and without
> > zpoline, it seems it shouldn't need to. Anyway, still think zpoline is
> > future work.
> 
> I will remove the zpoline part.
> When zpoline is used, SIGSYS signal is a sign of unexpected syscall
> invocation, and raise this signal to userspace (with printing
> message).
> 

But why? We already established that zpoline cannot translate
everything, e.g. JIT code and similar. So even if you have zpoline you
can just have seccomp handle the syscall as a fallback, to catch cases
like that rather than failing, no?

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 07/13] x86/um: nommu: process/thread handling
  2024-12-04 16:50       ` Johannes Berg
@ 2024-12-05 13:56         ` Hajime Tazaki
  2024-12-05 13:58           ` Johannes Berg
  0 siblings, 1 reply; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-05 13:56 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, ricarkol, Liam.Howlett


On Thu, 05 Dec 2024 01:50:07 +0900,
Johannes Berg wrote:
> 
> On Tue, 2024-12-03 at 13:23 +0900, Hajime Tazaki wrote:
> > 
> > +++ b/arch/um/kernel/process.c
> > @@ -117,13 +117,17 @@ void new_thread_handler(void)
> >  	 * callback returns only if the kernel thread execs a process
> >  	 */
> >  	fn(arg);
> > +#ifndef CONFIG_MMU
> > +	arch_switch_to(current);
> > +#endif
> >  	userspace(&current->thread.regs.regs);
> 
> that doesn't make sense, arch_switch_to() does nothing anyway on 64-bit

makes sense.  will fix it.

I added fs register record code to arch_switch_to() for nommu as we
don't use ptrace so, arch_switch_to() does the job in 64bit, but for
the kernel thread, we don't have to so, will remove it.

> >  /* Called magically, see new_thread_handler above */
> >  static void fork_handler(void)
> >  {
> > -	schedule_tail(current->thread.prev_sched);
> > +	if (current->thread.prev_sched)
> > +		schedule_tail(current->thread.prev_sched);
> 
> Why is that NULL on nommu?

During the past series, the pointer was sometimes NULL on random
conditions, but I couldn't reproduce it anymore..

I'll revert it until I could reproduce it.

> > @@ -134,6 +138,33 @@ static void fork_handler(void)
> >  
> >  	current->thread.prev_sched = NULL;
> >  
> > +#ifndef CONFIG_MMU
> 
> again, don't sprinkle ifdefs around the C code files - make inlines in a
> header file or something

will revert it.

> > +	/*
> > +	 * child of vfork(2) comes here.
> > +	 * clone(2) also enters here but doesn't need to advance the %rsp.
> > +	 *
> > +	 * This fork can only come from libc's vfork, which
> > +	 * does this:
> > +	 *	popq %%rdx;
> > +	 *	call *%rax; // zpoline => __kernel_vsyscall
> > +	 *	pushq %%rdx;
> 
> or maybe not zpoline ... so maybe need to update this

will do it.

> > +++ b/arch/um/os-Linux/skas/process.c
> > @@ -144,6 +144,7 @@ void wait_stub_done(int pid)
> >  
> >  extern unsigned long current_stub_stack(void);
> >  
> > +#ifdef CONFIG_MMU
> 
> I'll stop commenting on ifdef sprinkling :)

ditto.

> > +++ b/arch/x86/um/asm/processor.h
> > @@ -38,6 +38,18 @@ static __always_inline void cpu_relax(void)
> >  
> >  #define task_pt_regs(t) (&(t)->thread.regs)
> >  
> > +#ifndef CONFIG_MMU
> > +#define task_top_of_stack(task) \
> > +({									\
> > +	unsigned long __ptr = (unsigned long)task->stack;	\
> > +	__ptr += THREAD_SIZE;			\
> > +	__ptr;					\
> > +})
> > +
> > +extern long current_top_of_stack;
> > +extern long current_ptregs;
> > +#endif
> 
> no need for "extern".
> 
> you only use all that once, does it need to be here?

sorry, I don't understand both of these comments; could you care to
elaborate ?

> > +
> >  #include <asm/processor-generic.h>
> >  
> >  #endif
> > diff --git a/arch/x86/um/do_syscall_64.c b/arch/x86/um/do_syscall_64.c
> > index 5d0fa83e7fdc..ca468caff729 100644
> > --- a/arch/x86/um/do_syscall_64.c
> > +++ b/arch/x86/um/do_syscall_64.c
> > @@ -1,14 +1,43 @@
> >  // SPDX-License-Identifier: GPL-2.0
> >  
> > +//#define DEBUG 1
> 
> please remove

yes, will do it.

> > +/*
> > + * save/restore the return address stored in the stack, as the child overwrites
> > + * the contents after returning to userspace (i.e., by push %rdx).
> > + *
> > + * see the detail in fork_handler().
> > + */
> > +static void *vfork_save_stack(void)
> > +{
> > +	unsigned char *stack_copy;
> > +
> > +	stack_copy = kzalloc(8, GFP_KERNEL);
> 
> Using a heap allocation to track 8 bytes, when the pointer to the
> allocation is already 8 bytes (you're on 64-bit) seems ... rather
> wasteful?
> 
> I also don't see you ever free it? Restore probably should, but anyway,
> it shouldn't exist.

oops, my bad...
indeed the memory is never freed.

I'll update this part by not using heap allocation, but instead with
a variable.

-- Hajime



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 08/13] um: nommu: configure fs register on host syscall invocation
  2024-12-04 19:31         ` Geert Uytterhoeven
@ 2024-12-05 13:58           ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-05 13:58 UTC (permalink / raw)
  To: geert; +Cc: johannes, linux-um, ricarkol, Liam.Howlett


On Thu, 05 Dec 2024 04:31:11 +0900,
Geert Uytterhoeven wrote:
> 
> On Wed, Dec 4, 2024 at 5:53 PM Johannes Berg <johannes@sipsolutions.net> wrote:
> > On Tue, 2024-12-03 at 13:23 +0900, Hajime Tazaki wrote:
> > >
> > > +static int os_x86_arch_prctl(int pid, int option, unsigned long *arg2)
> > > +{
> > > +     if (host_has_fsgsbase) {
> > > +             switch (option) {
> > > +             case ARCH_SET_FS:
> > > +                     wrfsbase(*arg2);
> > > +                     break;
> > > +             case ARCH_SET_GS:
> > > +                     wrgsbase(*arg2);
> > > +                     break;
> > > +             case ARCH_GET_FS:
> > > +                     *arg2 = rdfsbase();
> > > +                     break;
> > > +             case ARCH_GET_GS:
> > > +                     *arg2 = rdgsbase();
> > > +                     break;
> > > +             }
> > > +             return 0;
> > > +     } else {
> >
> > Even checkpatch complains about else after return :) No need.
> 
> And inverting the check would reduce indentation in the largest
> branch (the smallest branch is just "return os_arch_prctl(...)".

ah, thanks.  will fix it.

-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 07/13] x86/um: nommu: process/thread handling
  2024-12-05 13:56         ` Hajime Tazaki
@ 2024-12-05 13:58           ` Johannes Berg
  2024-12-06  2:49             ` Hajime Tazaki
  0 siblings, 1 reply; 128+ messages in thread
From: Johannes Berg @ 2024-12-05 13:58 UTC (permalink / raw)
  To: Hajime Tazaki; +Cc: linux-um, ricarkol, Liam.Howlett

On Thu, 2024-12-05 at 22:56 +0900, Hajime Tazaki wrote:
> 
> > > +++ b/arch/x86/um/asm/processor.h
> > > @@ -38,6 +38,18 @@ static __always_inline void cpu_relax(void)
> > >  
> > >  #define task_pt_regs(t) (&(t)->thread.regs)
> > >  
> > > +#ifndef CONFIG_MMU
> > > +#define task_top_of_stack(task) \
> > > +({									\
> > > +	unsigned long __ptr = (unsigned long)task->stack;	\
> > > +	__ptr += THREAD_SIZE;			\
> > > +	__ptr;					\
> > > +})
> > > +
> > > +extern long current_top_of_stack;
> > > +extern long current_ptregs;
> > > +#endif
> > 
> > no need for "extern".
> > 
> > you only use all that once, does it need to be here?
> 
> sorry, I don't understand both of these comments; could you care to
> elaborate ?

Sorry, you obviously do need 'extern', I was clearly confused.

Regarding the task_top_of_stack() macro, I thought you only used it in
one C file, so I'm not sure it should be in a header file that's
generally included across the kernel, rather than a private header file
or just in the C file itself.

johannes


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 07/13] x86/um: nommu: process/thread handling
  2024-12-05 13:58           ` Johannes Berg
@ 2024-12-06  2:49             ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-06  2:49 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, ricarkol, Liam.Howlett


On Thu, 05 Dec 2024 22:58:52 +0900,
Johannes Berg wrote:
> 
> On Thu, 2024-12-05 at 22:56 +0900, Hajime Tazaki wrote:
> > 
> > > > +++ b/arch/x86/um/asm/processor.h
> > > > @@ -38,6 +38,18 @@ static __always_inline void cpu_relax(void)
> > > >  
> > > >  #define task_pt_regs(t) (&(t)->thread.regs)
> > > >  
> > > > +#ifndef CONFIG_MMU
> > > > +#define task_top_of_stack(task) \
> > > > +({									\
> > > > +	unsigned long __ptr = (unsigned long)task->stack;	\
> > > > +	__ptr += THREAD_SIZE;			\
> > > > +	__ptr;					\
> > > > +})
> > > > +
> > > > +extern long current_top_of_stack;
> > > > +extern long current_ptregs;
> > > > +#endif
> > > 
> > > no need for "extern".
> > > 
> > > you only use all that once, does it need to be here?
> > 
> > sorry, I don't understand both of these comments; could you care to
> > elaborate ?
> 
> Sorry, you obviously do need 'extern', I was clearly confused.
> 
> Regarding the task_top_of_stack() macro, I thought you only used it in
> one C file, so I'm not sure it should be in a header file that's
> generally included across the kernel, rather than a private header file
> or just in the C file itself.

I understand, thanks.
I'll move the macro to a different file.

-- Hajime



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH v3 06/13] um: nommu: syscalls handler from userspace by seccomp filter
  2024-12-05 13:54           ` Johannes Berg
@ 2024-12-06  2:51             ` Hajime Tazaki
  0 siblings, 0 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-12-06  2:51 UTC (permalink / raw)
  To: johannes; +Cc: linux-um, ricarkol, Liam.Howlett, kenichi.yasukata


On Thu, 05 Dec 2024 22:54:21 +0900,
Johannes Berg wrote:
> 
> On Thu, 2024-12-05 at 22:51 +0900, Hajime Tazaki wrote:
> > > 
> > > I don't understand why this behaves differently with and without
> > > zpoline, it seems it shouldn't need to. Anyway, still think zpoline is
> > > future work.
> > 
> > I will remove the zpoline part.
> > When zpoline is used, SIGSYS signal is a sign of unexpected syscall
> > invocation, and raise this signal to userspace (with printing
> > message).
> > 
> 
> But why? We already established that zpoline cannot translate
> everything, e.g. JIT code and similar. So even if you have zpoline you
> can just have seccomp handle the syscall as a fallback, to catch cases
> like that rather than failing, no?

You have better understandings on this part than me; yes the fallback
option should be the way to go.

thanks,
-- Hajime


^ permalink raw reply	[flat|nested] 128+ messages in thread

end of thread, other threads:[~2024-12-06  2:51 UTC | newest]

Thread overview: 128+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-24 12:09 [RFC PATCH 00/13] nommu UML Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 02/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
2024-10-25  8:56   ` Johannes Berg
2024-10-25 12:54     ` Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 03/13] um: nommu: memory handling Hajime Tazaki
2024-10-25  9:11   ` Johannes Berg
2024-10-25 12:55     ` Hajime Tazaki
2024-10-25 15:15       ` Johannes Berg
2024-10-26  7:24         ` Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 04/13] x86/um: nommu: syscall handling Hajime Tazaki
2024-10-25  9:14   ` Johannes Berg
2024-10-25 12:55     ` Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 05/13] x86/um: nommu: syscall translation by zpoline Hajime Tazaki
2024-10-25  9:19   ` Johannes Berg
2024-10-25 12:58     ` Hajime Tazaki
2024-10-25 15:20       ` Johannes Berg
2024-10-26  7:36         ` Hajime Tazaki
2024-10-27  9:45           ` Johannes Berg
2024-10-28  7:47             ` Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 06/13] x86/um: nommu: process/thread handling Hajime Tazaki
2024-10-25  9:22   ` Johannes Berg
2024-10-25 12:58     ` Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 07/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
2024-10-25  9:28   ` Johannes Berg
2024-10-25 13:27     ` Hajime Tazaki
2024-10-25 15:22       ` Johannes Berg
2024-10-26  7:34         ` Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 08/13] x86/um/vdso: nommu: vdso memory update Hajime Tazaki
2024-10-25  9:29   ` Johannes Berg
2024-10-25 13:28     ` Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 09/13] x86/um: nommu: signal handling Hajime Tazaki
2024-10-25  9:30   ` Johannes Berg
2024-10-25 13:04     ` Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 10/13] x86/um: nommu: stack save/restore on vfork Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 11/13] um: change machine name for uname output Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 12/13] um: nommu: add documentation of nommu UML Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 13/13] um: nommu: plug nommu code into build system Hajime Tazaki
2024-10-25  9:33   ` Johannes Berg
2024-10-25 13:05     ` Hajime Tazaki
2024-10-25 15:27       ` Johannes Berg
2024-10-26  7:36         ` Hajime Tazaki
2024-10-26 10:19 ` [RFC PATCH 00/13] nommu UML Benjamin Berg
2024-10-27  9:10   ` Hajime Tazaki
2024-10-28 13:32     ` Benjamin Berg
2024-10-30  9:25       ` Hajime Tazaki
2024-11-09  0:52         ` Hajime Tazaki
2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 02/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
2024-11-12 12:48     ` Geert Uytterhoeven
2024-11-12 22:07       ` Hajime Tazaki
2024-11-13  8:19         ` Geert Uytterhoeven
2024-11-13  8:36           ` Johannes Berg
2024-11-13  8:36             ` Johannes Berg
2024-11-13 10:27               ` Geert Uytterhoeven
2024-11-13 13:17                 ` Hajime Tazaki
2024-11-13 13:55                   ` Geert Uytterhoeven
2024-11-13 23:32                     ` Hajime Tazaki
2024-11-14  1:40                       ` Greg Ungerer
2024-11-14 10:41                         ` Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 03/13] um: nommu: memory handling Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 04/13] x86/um: nommu: syscall handling Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 05/13] x86/um: nommu: syscall translation by zpoline Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 06/13] um: nommu: prevent host syscalls from userspace by seccomp filter Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 07/13] x86/um: nommu: process/thread handling Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 08/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
2024-11-27 10:00     ` Benjamin Berg
2024-11-27 10:26       ` Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 09/13] x86/um/vdso: nommu: vdso memory update Hajime Tazaki
2024-11-27 10:36     ` Benjamin Berg
2024-11-27 23:23       ` Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 10/13] x86/um: nommu: signal handling Hajime Tazaki
2024-11-28 10:37     ` Benjamin Berg
2024-12-01  1:38       ` Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 11/13] um: change machine name for uname output Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 12/13] um: nommu: add documentation of nommu UML Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 13/13] um: nommu: plug nommu code into build system Hajime Tazaki
2024-11-15 10:12   ` [RFC PATCH v2 00/13] nommu UML Johannes Berg
2024-11-15 10:26     ` Anton Ivanov
2024-11-15 14:54       ` Hajime Tazaki
2024-11-15 14:48     ` Hajime Tazaki
2024-11-22  9:33   ` Lorenzo Stoakes
2024-11-22  9:53     ` Johannes Berg
2024-11-22 10:29       ` Lorenzo Stoakes
2024-11-22 12:18       ` Christoph Hellwig
2024-11-22 12:25         ` Lorenzo Stoakes
2024-11-22 12:38           ` Christoph Hellwig
2024-11-22 12:49             ` Damien Le Moal
2024-11-22 12:52               ` Lorenzo Stoakes
2024-11-23  7:27   ` David Gow
2024-11-24  1:25     ` Hajime Tazaki
2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 02/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
2024-12-04 16:20       ` Johannes Berg
2024-12-05 13:41         ` Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 03/13] um: nommu: memory handling Hajime Tazaki
2024-12-04 16:34       ` Johannes Berg
2024-12-05 13:46         ` Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 04/13] x86/um: nommu: syscall handling Hajime Tazaki
2024-12-04 16:37       ` Johannes Berg
2024-12-05 13:47         ` Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 05/13] x86/um: nommu: syscall translation by zpoline Hajime Tazaki
2024-12-04 16:37       ` Johannes Berg
2024-12-05 13:48         ` Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 06/13] um: nommu: syscalls handler from userspace by seccomp filter Hajime Tazaki
2024-12-04 16:42       ` Johannes Berg
2024-12-05 13:51         ` Hajime Tazaki
2024-12-05 13:54           ` Johannes Berg
2024-12-06  2:51             ` Hajime Tazaki
2024-12-04 17:54       ` kernel test robot
2024-12-03  4:23     ` [PATCH v3 07/13] x86/um: nommu: process/thread handling Hajime Tazaki
2024-12-04 16:50       ` Johannes Berg
2024-12-05 13:56         ` Hajime Tazaki
2024-12-05 13:58           ` Johannes Berg
2024-12-06  2:49             ` Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 08/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
2024-12-04 16:52       ` Johannes Berg
2024-12-04 19:31         ` Geert Uytterhoeven
2024-12-05 13:58           ` Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 09/13] x86/um/vdso: nommu: vdso memory update Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 10/13] x86/um: nommu: signal handling Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 11/13] um: change machine name for uname output Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 12/13] um: nommu: add documentation of nommu UML Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 13/13] um: nommu: plug nommu code into build system Hajime Tazaki
2024-12-04 16:20     ` [PATCH v3 00/13] nommu UML Johannes Berg
2024-12-05 13:41       ` Hajime Tazaki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).