linux-um.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/13] nommu UML
@ 2024-10-24 12:09 Hajime Tazaki
  2024-10-24 12:09 ` [RFC PATCH 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec Hajime Tazaki
                   ` (14 more replies)
  0 siblings, 15 replies; 128+ messages in thread
From: Hajime Tazaki @ 2024-10-24 12:09 UTC (permalink / raw)
  To: linux-um, jdike, richard, anton.ivanov, johannes; +Cc: thehajime, ricarkol

This is a series of patches of nommu arch addition to UML.  It would
be nice to ask comments/opinions on this.

There are several limitations/issues which we already found; here is
the list of those issues.

- prompt configured with /etc/profile is broken (variables are not
  expanded, ${HOSTNAME%%.*}:$PWD#)
- there are no mechanism implemented to cache for mapped memory of
  exec(2) thus, always read files from filesystem upon every exec,
  which makes slow on some benchmark (lmbench).
- a crash on userspace programs crashes a UML kernel, not signaling
  with SIGSEGV to the program.
- commit c27e618 (during v6.12-rc1 merge) introduces invalid access to
  a vma structure for our case, which updates the internal procedure
  of maple_tree subsystem.  We're trying to fix issue but still a
  random process on exit(2) crashes.

UML has been built with CONFIG_MMU since day 0.  The feature
introduces the nommu mode in a different angle from what Linux Kernel
Library tried.


What is it for ?
================

- Alleviate syscall hook overhead implemented with ptrace(2)
- To exercises nommu code over UML (and over KUnit)
- Less dependency to host facilities


How it works ?
==============

To illustrate how this feature works, the below shows how syscalls are
called under nommu/UML environment.

- boot kernel, setup zpoline trampoline code (detailed later) at address 0x0
- (userspace starts)
- calls vfork/execve syscalls
- during execve, more specifically during load_elf_fdpic_binary()
  function, kernel translates `syscall/sysenter` instructions with `call
  *%rax`, which usually point to address 0 to NR_syscalls (around
  512), where trampoline code was installed during startup.
- when syscalls are issued by userspace, it jumps to *%rax, slides
  until `nop` instructions end, and jump to hooked function,
  `__kernel_vsyscall`, which is an entrypoint for syscall under nommu
  UML environment.
- call handler function in sys_call_table[] and follow how UML syscall
  works.
- return to userspace


What are the differences from MMU-full UML ?
============================================

The current nommu implementation adds 3 different functions which
MMU-full UML doesn't have:

- kernel address space can directly be accessible from userspace
  - so, uaccess() always returns 1
  - generic implementation of memcpy/strcpy/futex is also used
- alternate syscall entrypoint without ptrace
- translation of syscall/sysenter instructions to a trampoline code
  and syscall hooks

With those modifications, it allows us to use unmodified userspace
binaries with nommu UML.


History
=======

This feature was originally introduced by Ricardo Koller at Open
Source Summit NA 2020, then integrated with the syscall translation
functionality with the clean up to the original code.

Building and run
================

```
% make ARCH=um x86_64_nommu_defconfig
% make ARCH=um
```

will build UML with CONFIG_MMU=n applied.

Kunit tests can run with the following command:

```
% ./tools/testing/kunit/kunit.py run --kconfig_add CONFIG_MMU=n
```

To run a typical Linux distribution, we need nommu-aware userspace.
We can use a stock version of Alpine Linux with nommu-built version of
busybox and musl-libc.


Preparing root filesystem
=========================

nommu UML requires to use a specific standard library which is aware
of nommu kernel.  We have tested custom-build musl-libc and busybox,
both of which have built-in support for nommu kernels.

There are no available Linux distributions for nommu under x86_64
architecture, so we need to prepare our own image for the root
filesystem.  We use Alpine Linux as a base distribution and replace
busybox and musl-libc on top of that.  The following are the step to
prepare the filesystem for the quick start.

```
     container_id=$(docker create ghcr.io/thehajime/alpine:3.20.3-um-nommu)
     docker start $container_id
     docker wait $container_id
     docker export $container_id > alpine.tar
     docker rm $container_id

     mnt=$(mktemp -d)
     dd if=/dev/zero of=alpine.ext4 bs=1 count=0 seek=1G
     sudo chmod og+wr "alpine.ext4"
     yes 2>/dev/null | mkfs.ext4 "alpine.ext4" || true
     sudo mount "alpine.ext4" $mnt
     sudo tar -xf alpine.tar -C $mnt
     sudo umount $mnt
```

This will create a file image, `alpine.ext4`, which contains busybox
and musl with nommu build on the Alpine Linux root filesystem.  The
file can be specified to the argument `ubd0=` to the UML command line.

```
  ./vmlinux eth0=tuntap,tap100,0e:fd:0:0:0:1,172.17.0.1 ubd0=./alpine.ext4 rw mem=1024m loglevel=8 init=/sbin/init
```

We plan to upstream apk packages for busybox and musl so that we can
follow the proper procedure to set up the root filesystem.


Quick start with docker
=======================

There is a docker image that you can quickly start with a simple step.

```
  docker run -it -v /dev/shm:/dev/shm --rm ghcr.io/thehajime/alpine:3.20.3-um-nommu
```

This will launch a UML instance with an pre-configured root filesystem.

Benchmark
=========

The below shows an example of performance measurement conducted with
lmbench and (self-crafted) getpid benchmark (with v6.12-rc3 linus tree).

### lmbench (usec)

||native|um|um-nommu|
|--|--|--|--|
|select-10    |0.5645|28.3738|0.2647|
|select-100   |2.3872|28.8385|1.1021|
|select-1000  |20.5527|37.6364|9.4264|
|syscall      |0.1735|26.8711|0.1037|
|read         |0.3442|28.5771|0.1370|
|write        |0.2862|28.7340|0.1236|
|stat         |1.9236|38.5928|0.4640|
|open/close   |3.8308|66.8451|0.7789|
|fork+sh      |1176.4444|8221.5000|21443.0000|
|fork+execve  |533.1053|3034.5000|4894.3333|

### do_getpid bench (nsec)

||native|um|um-nommu|
|--|--|--|--|
|getpid | 180 | 31579 | 101|


Limitations
===========

generic nommu limitations
-------------------------
Since this port is a kernel of nommu architecture so, the
implementation inherits the characteristics of other nommu kernels
(riscv, arm, etc), described below.

- vfork(2) should be used instead of fork(2)
- ELF loader only loads PIE (position independent executable) binaries
- processes share the address space among others
- mmap(2) offers a subset of functionalities (e.g., unsupported
  MMAP_FIXED)

Thus, we have limited options to userspace programs.  We have tested
Alpine Linux with musl-libc, which has a support nommu kernel.

access to mmap_min_addr
----------------------
As the mechanism of syscall translations relies on an ability to
write/read memory address zero (0x0), we need to configure host kernel
with the following command:

```
% sh -c "echo 0 > /proc/sys/vm/mmap_min_addr"
```

supported architecture
----------------------
The current implementation of nommu UML only works on x86_64 SUBARCH.
We have not tested with 32-bit environment.

target of syscall translation
-----------------------------
The syscall translation only applies to the executable and interpreter
of ELF binary files which are processed by execve(2) syscall for the
moment: other libraries such as linked library and dlopen-ed one
aren't translated; we may be able to trigger the translation by
LD_PRELOAD.

Note that with musl-libc in Alpine Linux which we've been tested, most
of syscalls are implemented in the interpreter file
(ld-musl-x86_64.so) and calling syscall/sysenter instructions from the
linked/loaded libraries might be rare.  But it is definitely possible
so, a workaround with LD_PRELOAD is effective.


Further readings about NOMMU UML
================================

- NOMMU UML (original code by Ricardo Koller)
https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf

- zpoline: syscall translation mechanism
https://www.usenix.org/conference/atc23/presentation/yasukata
Please review the following changes for suitability for inclusion. If you have
any objections or suggestions for improvement, please respond to the patches. If
you agree with the changes, please provide your Acked-by.

The following changes since commit c2ee9f594da826bea183ed14f2cc029c719bf4da:

  KVM: selftests: Fix build on on non-x86 architectures (2024-10-21 15:49:33 -0700)

are available in the Git repository at:

  https://github.com/thehajime/linux 82a7ee8b31c51edb47e144922581824a3b5e371d
  https://github.com/thehajime/linux/tree/um-nommu-v6.12-rc4-rfc

Hajime Tazaki (13):
  fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec
  x86/um: nommu: elf loader for fdpic
  um: nommu: memory handling
  x86/um: nommu: syscall handling
  x86/um: nommu: syscall translation by zpoline
  x86/um: nommu: process/thread handling
  um: nommu: configure fs register on host syscall invocation
  x86/um/vdso: nommu: vdso memory update
  x86/um: nommu: signal handling
  x86/um: nommu: stack save/restore on vfork
  um: change machine name for uname output
  um: nommu: add documentation of nommu UML
  um: nommu: plug nommu code into build system

 Documentation/virt/uml/nommu-uml.rst    | 219 +++++++++++++++++++++++
 arch/um/Kconfig                         |  13 +-
 arch/um/Makefile                        |   6 +
 arch/um/configs/x86_64_nommu_defconfig  |  64 +++++++
 arch/um/include/asm/futex.h             |   4 +
 arch/um/include/asm/mmu.h               |   8 +
 arch/um/include/asm/mmu_context.h       |  14 +-
 arch/um/include/asm/ptrace-generic.h    |  17 ++
 arch/um/include/asm/tlbflush.h          |  23 ++-
 arch/um/include/asm/uaccess.h           |   7 +-
 arch/um/include/shared/common-offsets.h |   3 +
 arch/um/include/shared/os.h             |   9 +
 arch/um/kernel/Makefile                 |   3 +-
 arch/um/kernel/exec.c                   |   8 +
 arch/um/kernel/mem.c                    |  13 ++
 arch/um/kernel/physmem.c                |   6 +
 arch/um/kernel/process.c                |  34 +++-
 arch/um/kernel/skas/Makefile            |   3 +-
 arch/um/kernel/trap.c                   |   4 +
 arch/um/os-Linux/main.c                 |   5 +
 arch/um/os-Linux/process.c              |  22 +++
 arch/um/os-Linux/skas/process.c         |   4 +
 arch/um/os-Linux/start_up.c             |  47 +++++
 arch/um/os-Linux/time.c                 |   3 +-
 arch/um/os-Linux/util.c                 |   3 +-
 arch/x86/um/Makefile                    |  18 ++
 arch/x86/um/asm/elf.h                   |  12 +-
 arch/x86/um/asm/module.h                |  19 +-
 arch/x86/um/asm/processor.h             |  12 ++
 arch/x86/um/do_syscall_64.c             | 113 ++++++++++++
 arch/x86/um/entry_64.S                  | 110 ++++++++++++
 arch/x86/um/shared/sysdep/syscalls_64.h |   4 +
 arch/x86/um/signal.c                    |  26 +++
 arch/x86/um/syscalls_64.c               |  67 +++++++
 arch/x86/um/vdso/um_vdso.c              |  20 +++
 arch/x86/um/vdso/vma.c                  |  16 +-
 arch/x86/um/zpoline.c                   | 228 ++++++++++++++++++++++++
 fs/Kconfig.binfmt                       |   2 +-
 fs/binfmt_elf_fdpic.c                   |  10 ++
 39 files changed, 1164 insertions(+), 35 deletions(-)
 create mode 100644 Documentation/virt/uml/nommu-uml.rst
 create mode 100644 arch/um/configs/x86_64_nommu_defconfig
 create mode 100644 arch/x86/um/do_syscall_64.c
 create mode 100644 arch/x86/um/entry_64.S
 create mode 100644 arch/x86/um/zpoline.c

-- 
2.43.0



^ permalink raw reply	[flat|nested] 128+ messages in thread

end of thread, other threads:[~2024-12-06  2:51 UTC | newest]

Thread overview: 128+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-24 12:09 [RFC PATCH 00/13] nommu UML Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 02/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
2024-10-25  8:56   ` Johannes Berg
2024-10-25 12:54     ` Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 03/13] um: nommu: memory handling Hajime Tazaki
2024-10-25  9:11   ` Johannes Berg
2024-10-25 12:55     ` Hajime Tazaki
2024-10-25 15:15       ` Johannes Berg
2024-10-26  7:24         ` Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 04/13] x86/um: nommu: syscall handling Hajime Tazaki
2024-10-25  9:14   ` Johannes Berg
2024-10-25 12:55     ` Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 05/13] x86/um: nommu: syscall translation by zpoline Hajime Tazaki
2024-10-25  9:19   ` Johannes Berg
2024-10-25 12:58     ` Hajime Tazaki
2024-10-25 15:20       ` Johannes Berg
2024-10-26  7:36         ` Hajime Tazaki
2024-10-27  9:45           ` Johannes Berg
2024-10-28  7:47             ` Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 06/13] x86/um: nommu: process/thread handling Hajime Tazaki
2024-10-25  9:22   ` Johannes Berg
2024-10-25 12:58     ` Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 07/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
2024-10-25  9:28   ` Johannes Berg
2024-10-25 13:27     ` Hajime Tazaki
2024-10-25 15:22       ` Johannes Berg
2024-10-26  7:34         ` Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 08/13] x86/um/vdso: nommu: vdso memory update Hajime Tazaki
2024-10-25  9:29   ` Johannes Berg
2024-10-25 13:28     ` Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 09/13] x86/um: nommu: signal handling Hajime Tazaki
2024-10-25  9:30   ` Johannes Berg
2024-10-25 13:04     ` Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 10/13] x86/um: nommu: stack save/restore on vfork Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 11/13] um: change machine name for uname output Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 12/13] um: nommu: add documentation of nommu UML Hajime Tazaki
2024-10-24 12:09 ` [RFC PATCH 13/13] um: nommu: plug nommu code into build system Hajime Tazaki
2024-10-25  9:33   ` Johannes Berg
2024-10-25 13:05     ` Hajime Tazaki
2024-10-25 15:27       ` Johannes Berg
2024-10-26  7:36         ` Hajime Tazaki
2024-10-26 10:19 ` [RFC PATCH 00/13] nommu UML Benjamin Berg
2024-10-27  9:10   ` Hajime Tazaki
2024-10-28 13:32     ` Benjamin Berg
2024-10-30  9:25       ` Hajime Tazaki
2024-11-09  0:52         ` Hajime Tazaki
2024-11-11  6:27 ` [RFC PATCH v2 " Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 02/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
2024-11-12 12:48     ` Geert Uytterhoeven
2024-11-12 22:07       ` Hajime Tazaki
2024-11-13  8:19         ` Geert Uytterhoeven
2024-11-13  8:36           ` Johannes Berg
2024-11-13  8:36             ` Johannes Berg
2024-11-13 10:27               ` Geert Uytterhoeven
2024-11-13 13:17                 ` Hajime Tazaki
2024-11-13 13:55                   ` Geert Uytterhoeven
2024-11-13 23:32                     ` Hajime Tazaki
2024-11-14  1:40                       ` Greg Ungerer
2024-11-14 10:41                         ` Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 03/13] um: nommu: memory handling Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 04/13] x86/um: nommu: syscall handling Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 05/13] x86/um: nommu: syscall translation by zpoline Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 06/13] um: nommu: prevent host syscalls from userspace by seccomp filter Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 07/13] x86/um: nommu: process/thread handling Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 08/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
2024-11-27 10:00     ` Benjamin Berg
2024-11-27 10:26       ` Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 09/13] x86/um/vdso: nommu: vdso memory update Hajime Tazaki
2024-11-27 10:36     ` Benjamin Berg
2024-11-27 23:23       ` Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 10/13] x86/um: nommu: signal handling Hajime Tazaki
2024-11-28 10:37     ` Benjamin Berg
2024-12-01  1:38       ` Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 11/13] um: change machine name for uname output Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 12/13] um: nommu: add documentation of nommu UML Hajime Tazaki
2024-11-11  6:27   ` [RFC PATCH v2 13/13] um: nommu: plug nommu code into build system Hajime Tazaki
2024-11-15 10:12   ` [RFC PATCH v2 00/13] nommu UML Johannes Berg
2024-11-15 10:26     ` Anton Ivanov
2024-11-15 14:54       ` Hajime Tazaki
2024-11-15 14:48     ` Hajime Tazaki
2024-11-22  9:33   ` Lorenzo Stoakes
2024-11-22  9:53     ` Johannes Berg
2024-11-22 10:29       ` Lorenzo Stoakes
2024-11-22 12:18       ` Christoph Hellwig
2024-11-22 12:25         ` Lorenzo Stoakes
2024-11-22 12:38           ` Christoph Hellwig
2024-11-22 12:49             ` Damien Le Moal
2024-11-22 12:52               ` Lorenzo Stoakes
2024-11-23  7:27   ` David Gow
2024-11-24  1:25     ` Hajime Tazaki
2024-12-03  4:22   ` [PATCH v3 " Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 01/13] fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 02/13] x86/um: nommu: elf loader for fdpic Hajime Tazaki
2024-12-04 16:20       ` Johannes Berg
2024-12-05 13:41         ` Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 03/13] um: nommu: memory handling Hajime Tazaki
2024-12-04 16:34       ` Johannes Berg
2024-12-05 13:46         ` Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 04/13] x86/um: nommu: syscall handling Hajime Tazaki
2024-12-04 16:37       ` Johannes Berg
2024-12-05 13:47         ` Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 05/13] x86/um: nommu: syscall translation by zpoline Hajime Tazaki
2024-12-04 16:37       ` Johannes Berg
2024-12-05 13:48         ` Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 06/13] um: nommu: syscalls handler from userspace by seccomp filter Hajime Tazaki
2024-12-04 16:42       ` Johannes Berg
2024-12-05 13:51         ` Hajime Tazaki
2024-12-05 13:54           ` Johannes Berg
2024-12-06  2:51             ` Hajime Tazaki
2024-12-04 17:54       ` kernel test robot
2024-12-03  4:23     ` [PATCH v3 07/13] x86/um: nommu: process/thread handling Hajime Tazaki
2024-12-04 16:50       ` Johannes Berg
2024-12-05 13:56         ` Hajime Tazaki
2024-12-05 13:58           ` Johannes Berg
2024-12-06  2:49             ` Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 08/13] um: nommu: configure fs register on host syscall invocation Hajime Tazaki
2024-12-04 16:52       ` Johannes Berg
2024-12-04 19:31         ` Geert Uytterhoeven
2024-12-05 13:58           ` Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 09/13] x86/um/vdso: nommu: vdso memory update Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 10/13] x86/um: nommu: signal handling Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 11/13] um: change machine name for uname output Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 12/13] um: nommu: add documentation of nommu UML Hajime Tazaki
2024-12-03  4:23     ` [PATCH v3 13/13] um: nommu: plug nommu code into build system Hajime Tazaki
2024-12-04 16:20     ` [PATCH v3 00/13] nommu UML Johannes Berg
2024-12-05 13:41       ` Hajime Tazaki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).