* [PATCH 0/5] Increased address space for 64 bit
@ 2024-05-28 8:54 benjamin
2024-05-28 8:54 ` [PATCH 1/5] um: Fix stub_start address calculation benjamin
` (4 more replies)
0 siblings, 5 replies; 15+ messages in thread
From: benjamin @ 2024-05-28 8:54 UTC (permalink / raw)
To: linux-um; +Cc: Benjamin Berg
From: Benjamin Berg <benjamin.berg@intel.com>
This patchset fixes a few bugs, adds a new method of discovering the
host task size and finally adds four level page table support. All of
this means the userspace TASK_SIZE is much larger and in turns permits
userspace applications that need a lot of virtual addresses to work
fine.
One such application is ASAN which uses a fixed address in memory that
would otherwise not be addressable.
Benjamin Berg (5):
um: Fix stub_start address calculation
um: Limit TASK_SIZE to the addressable range
um: Do a double clone to disable rseq
um: Discover host_task_size from envp
um: Add 4 level page table support
arch/um/Kconfig | 1 +
arch/um/include/asm/page.h | 14 +++-
arch/um/include/asm/pgalloc.h | 11 ++-
arch/um/include/asm/pgtable-4level.h | 119 +++++++++++++++++++++++++++
arch/um/include/asm/pgtable.h | 6 +-
arch/um/include/shared/as-layout.h | 2 +-
arch/um/include/shared/os.h | 2 +-
arch/um/kernel/mem.c | 17 +++-
arch/um/kernel/um_arch.c | 14 +++-
arch/um/os-Linux/main.c | 9 +-
arch/um/os-Linux/skas/process.c | 54 +++++++++++-
arch/x86/um/Kconfig | 38 ++++++---
arch/x86/um/os-Linux/task_size.c | 19 ++++-
13 files changed, 274 insertions(+), 32 deletions(-)
create mode 100644 arch/um/include/asm/pgtable-4level.h
--
2.45.1
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/5] um: Fix stub_start address calculation
2024-05-28 8:54 [PATCH 0/5] Increased address space for 64 bit benjamin
@ 2024-05-28 8:54 ` benjamin
2024-05-28 8:54 ` [PATCH 2/5] um: Limit TASK_SIZE to the addressable range benjamin
` (3 subsequent siblings)
4 siblings, 0 replies; 15+ messages in thread
From: benjamin @ 2024-05-28 8:54 UTC (permalink / raw)
To: linux-um; +Cc: Benjamin Berg
From: Benjamin Berg <benjamin.berg@intel.com>
The calculation was wrong as it only subtracted one and then rounded
down for alignment. However, this is incorrect if host_task_size is not
already aligned.
This probably worked fine because on 64 bit the host_task_size is bigger
than returned by os_get_top_address.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
---
arch/um/kernel/um_arch.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/um/kernel/um_arch.c b/arch/um/kernel/um_arch.c
index e95f805e5004..0d8b1a73cd5b 100644
--- a/arch/um/kernel/um_arch.c
+++ b/arch/um/kernel/um_arch.c
@@ -331,7 +331,8 @@ int __init linux_main(int argc, char **argv)
/* reserve a few pages for the stubs (taking care of data alignment) */
/* align the data portion */
BUILD_BUG_ON(!is_power_of_2(STUB_DATA_PAGES));
- stub_start = (host_task_size - 1) & ~(STUB_DATA_PAGES * PAGE_SIZE - 1);
+ stub_start = (host_task_size - STUB_DATA_PAGES * PAGE_SIZE) &
+ ~(STUB_DATA_PAGES * PAGE_SIZE - 1);
/* another page for the code portion */
stub_start -= PAGE_SIZE;
host_task_size = stub_start;
--
2.45.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 2/5] um: Limit TASK_SIZE to the addressable range
2024-05-28 8:54 [PATCH 0/5] Increased address space for 64 bit benjamin
2024-05-28 8:54 ` [PATCH 1/5] um: Fix stub_start address calculation benjamin
@ 2024-05-28 8:54 ` benjamin
2024-05-28 8:54 ` [PATCH 3/5] um: Do a double clone to disable rseq benjamin
` (2 subsequent siblings)
4 siblings, 0 replies; 15+ messages in thread
From: benjamin @ 2024-05-28 8:54 UTC (permalink / raw)
To: linux-um; +Cc: Benjamin Berg
From: Benjamin Berg <benjamin.berg@intel.com>
We may have a TASK_SIZE from the host that is bigger than UML is able to
address with a three-level pagetable. Guard against that by clipping the
maximum TASK_SIZE to the maximum addressable area.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
---
arch/um/kernel/um_arch.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/um/kernel/um_arch.c b/arch/um/kernel/um_arch.c
index 0d8b1a73cd5b..5ab1a92b6bf7 100644
--- a/arch/um/kernel/um_arch.c
+++ b/arch/um/kernel/um_arch.c
@@ -337,11 +337,16 @@ int __init linux_main(int argc, char **argv)
stub_start -= PAGE_SIZE;
host_task_size = stub_start;
+ /* Limit TASK_SIZE to what is addressable by the page table */
+ task_size = host_task_size;
+ if (task_size > PTRS_PER_PGD * PGDIR_SIZE)
+ task_size = PTRS_PER_PGD * PGDIR_SIZE;
+
/*
* TASK_SIZE needs to be PGDIR_SIZE aligned or else exit_mmap craps
* out
*/
- task_size = host_task_size & PGDIR_MASK;
+ task_size = task_size & PGDIR_MASK;
/* OS sanity checks that need to happen before the kernel runs */
os_early_checks();
--
2.45.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 3/5] um: Do a double clone to disable rseq
2024-05-28 8:54 [PATCH 0/5] Increased address space for 64 bit benjamin
2024-05-28 8:54 ` [PATCH 1/5] um: Fix stub_start address calculation benjamin
2024-05-28 8:54 ` [PATCH 2/5] um: Limit TASK_SIZE to the addressable range benjamin
@ 2024-05-28 8:54 ` benjamin
2024-05-28 10:16 ` Tiwei Bie
2024-05-28 8:54 ` [PATCH 4/5] um: Discover host_task_size from envp benjamin
2024-05-28 8:54 ` [PATCH 5/5] um: Add 4 level page table support benjamin
4 siblings, 1 reply; 15+ messages in thread
From: benjamin @ 2024-05-28 8:54 UTC (permalink / raw)
To: linux-um; +Cc: Benjamin Berg
From: Benjamin Berg <benjamin.berg@intel.com>
Newer glibc versions are enabling rseq support by default. This remains
enabled in the cloned child process, potentially causing the host kernel
to write/read memory in the child.
It appears that this was purely not an issue because the used memory
area happened to be above TASK_SIZE and remains mapped.
Note that a better approach would be to exec a small static binary that
does not link with other libraries. Using a memfd and execveat the
binary could be embedded into UML itself and it would result in an
entirely clean execution environment for userspace.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
---
arch/um/os-Linux/skas/process.c | 54 ++++++++++++++++++++++++++++++---
1 file changed, 50 insertions(+), 4 deletions(-)
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index 41a288dcfc34..ee332a2aeea6 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -255,6 +255,31 @@ static int userspace_tramp(void *stack)
int userspace_pid[NR_CPUS];
int kill_userspace_mm[NR_CPUS];
+struct tramp_data {
+ int pid;
+ void *clone_sp;
+ void *stack;
+};
+
+static int userspace_tramp_clone_vm(void *data)
+{
+ struct tramp_data *tramp_data = data;
+
+ /*
+ * This helper exist to do a double-clone. First with CLONE_VM which
+ * effectively disables things like rseq, and then the second one to
+ * get a new memory space.
+ */
+
+ tramp_data->pid = clone(userspace_tramp, tramp_data->clone_sp,
+ CLONE_PARENT | CLONE_FILES | SIGCHLD,
+ tramp_data->stack);
+ if (tramp_data->pid < 0)
+ tramp_data->pid = -errno;
+
+ exit(0);
+}
+
/**
* start_userspace() - prepare a new userspace process
* @stub_stack: pointer to the stub stack.
@@ -268,9 +293,10 @@ int kill_userspace_mm[NR_CPUS];
*/
int start_userspace(unsigned long stub_stack)
{
+ struct tramp_data tramp_data;
void *stack;
unsigned long sp;
- int pid, status, n, flags, err;
+ int pid, status, n, err;
/* setup a temporary stack page */
stack = mmap(NULL, UM_KERN_PAGE_SIZE,
@@ -286,10 +312,13 @@ int start_userspace(unsigned long stub_stack)
/* set stack pointer to the end of the stack page, so it can grow downwards */
sp = (unsigned long)stack + UM_KERN_PAGE_SIZE;
- flags = CLONE_FILES | SIGCHLD;
+ tramp_data.stack = (void *) stub_stack;
+ tramp_data.clone_sp = (void *) sp;
+ tramp_data.pid = -EINVAL;
/* clone into new userspace process */
- pid = clone(userspace_tramp, (void *) sp, flags, (void *) stub_stack);
+ pid = clone(userspace_tramp_clone_vm, (void *) sp,
+ CLONE_VM | CLONE_FILES | SIGCHLD, &tramp_data);
if (pid < 0) {
err = -errno;
printk(UM_KERN_ERR "%s : clone failed, errno = %d\n",
@@ -305,7 +334,24 @@ int start_userspace(unsigned long stub_stack)
__func__, errno);
goto out_kill;
}
- } while (WIFSTOPPED(status) && (WSTOPSIG(status) == SIGALRM));
+ } while (!WIFEXITED(status));
+
+ pid = tramp_data.pid;
+ if (pid < 0) {
+ printk(UM_KERN_ERR "%s : second clone failed, errno = %d\n",
+ __func__, -pid);
+ return pid;
+ }
+
+ do {
+ CATCH_EINTR(n = waitpid(pid, &status, WUNTRACED | __WALL));
+ if (n < 0) {
+ err = -errno;
+ printk(UM_KERN_ERR "%s : wait failed, errno = %d\n",
+ __func__, errno);
+ goto out_kill;
+ }
+ } while (WIFEXITED(status) && (WSTOPSIG(status) == SIGALRM));
if (!WIFSTOPPED(status) || (WSTOPSIG(status) != SIGSTOP)) {
err = -EINVAL;
--
2.45.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 4/5] um: Discover host_task_size from envp
2024-05-28 8:54 [PATCH 0/5] Increased address space for 64 bit benjamin
` (2 preceding siblings ...)
2024-05-28 8:54 ` [PATCH 3/5] um: Do a double clone to disable rseq benjamin
@ 2024-05-28 8:54 ` benjamin
2024-05-28 8:54 ` [PATCH 5/5] um: Add 4 level page table support benjamin
4 siblings, 0 replies; 15+ messages in thread
From: benjamin @ 2024-05-28 8:54 UTC (permalink / raw)
To: linux-um; +Cc: Benjamin Berg
From: Benjamin Berg <benjamin.berg@intel.com>
When loading the UML binary, the host kernel will place the stack at the
highest possible address. It will then map the program name and
environment variables onto the start of the stack.
As such, an easy way to figure out the host_task_size is to use the
highest pointer to an environment variable as a reference.
Ensure that this works by disabling address layout randomization and
re-executing UML in case it was enabled.
This increases the available TASK_SIZE for 64 bit UML considerably.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
---
arch/um/include/shared/as-layout.h | 2 +-
arch/um/include/shared/os.h | 2 +-
arch/um/kernel/um_arch.c | 4 ++--
arch/um/os-Linux/main.c | 9 ++++++++-
arch/x86/um/os-Linux/task_size.c | 19 +++++++++++++++----
5 files changed, 27 insertions(+), 9 deletions(-)
diff --git a/arch/um/include/shared/as-layout.h b/arch/um/include/shared/as-layout.h
index c22f46a757dc..480bb44ea1f2 100644
--- a/arch/um/include/shared/as-layout.h
+++ b/arch/um/include/shared/as-layout.h
@@ -48,7 +48,7 @@ extern unsigned long brk_start;
extern unsigned long host_task_size;
extern unsigned long stub_start;
-extern int linux_main(int argc, char **argv);
+extern int linux_main(int argc, char **argv, char **envp);
extern void uml_finishsetup(void);
struct siginfo;
diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index aff8906304ea..db644fc67069 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -327,7 +327,7 @@ extern int __ignore_sigio_fd(int fd);
extern int get_pty(void);
/* sys-$ARCH/task_size.c */
-extern unsigned long os_get_top_address(void);
+extern unsigned long os_get_top_address(char **envp);
long syscall(long number, ...);
diff --git a/arch/um/kernel/um_arch.c b/arch/um/kernel/um_arch.c
index 5ab1a92b6bf7..046eaf356b28 100644
--- a/arch/um/kernel/um_arch.c
+++ b/arch/um/kernel/um_arch.c
@@ -305,7 +305,7 @@ static void parse_cache_line(char *line)
}
}
-int __init linux_main(int argc, char **argv)
+int __init linux_main(int argc, char **argv, char **envp)
{
unsigned long avail, diff;
unsigned long virtmem_size, max_physmem;
@@ -327,7 +327,7 @@ int __init linux_main(int argc, char **argv)
if (have_console == 0)
add_arg(DEFAULT_COMMAND_LINE_CONSOLE);
- host_task_size = os_get_top_address();
+ host_task_size = os_get_top_address(envp);
/* reserve a few pages for the stubs (taking care of data alignment) */
/* align the data portion */
BUILD_BUG_ON(!is_power_of_2(STUB_DATA_PAGES));
diff --git a/arch/um/os-Linux/main.c b/arch/um/os-Linux/main.c
index f98ff79cdbf7..9a61b1767795 100644
--- a/arch/um/os-Linux/main.c
+++ b/arch/um/os-Linux/main.c
@@ -11,6 +11,7 @@
#include <signal.h>
#include <string.h>
#include <sys/resource.h>
+#include <sys/personality.h>
#include <as-layout.h>
#include <init.h>
#include <kern_util.h>
@@ -108,6 +109,12 @@ int __init main(int argc, char **argv, char **envp)
char **new_argv;
int ret, i, err;
+ /* Disable randomization and re-exec if it was changed successfully */
+ ret = personality(PER_LINUX | ADDR_NO_RANDOMIZE);
+ if (ret >= 0 && (ret & (PER_LINUX | ADDR_NO_RANDOMIZE)) !=
+ (PER_LINUX | ADDR_NO_RANDOMIZE))
+ execve("/proc/self/exe", argv, envp);
+
set_stklim();
setup_env_path();
@@ -140,7 +147,7 @@ int __init main(int argc, char **argv, char **envp)
#endif
change_sig(SIGPIPE, 0);
- ret = linux_main(argc, argv);
+ ret = linux_main(argc, argv, envp);
/*
* Disable SIGPROF - I have no idea why libc doesn't do this or turn
diff --git a/arch/x86/um/os-Linux/task_size.c b/arch/x86/um/os-Linux/task_size.c
index 1dc9adc20b1c..33c26291545a 100644
--- a/arch/x86/um/os-Linux/task_size.c
+++ b/arch/x86/um/os-Linux/task_size.c
@@ -65,7 +65,7 @@ static int page_ok(unsigned long page)
return ok;
}
-unsigned long os_get_top_address(void)
+unsigned long os_get_top_address(char **envp)
{
struct sigaction sa, old;
unsigned long bottom = 0;
@@ -142,10 +142,21 @@ unsigned long os_get_top_address(void)
#else
-unsigned long os_get_top_address(void)
+unsigned long os_get_top_address(char **envp)
{
- /* The old value of CONFIG_TOP_ADDR */
- return 0x7fc0002000;
+ unsigned long top_addr = (unsigned long) &top_addr;
+ int i;
+
+ /* The earliest variable should be after the program name in ELF */
+ for (i = 0; envp[i]; i++) {
+ if ((unsigned long) envp[i] > top_addr)
+ top_addr = (unsigned long) envp[i];
+ }
+
+ top_addr &= ~(UM_KERN_PAGE_SIZE - 1);
+ top_addr += UM_KERN_PAGE_SIZE;
+
+ return top_addr;
}
#endif
--
2.45.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 5/5] um: Add 4 level page table support
2024-05-28 8:54 [PATCH 0/5] Increased address space for 64 bit benjamin
` (3 preceding siblings ...)
2024-05-28 8:54 ` [PATCH 4/5] um: Discover host_task_size from envp benjamin
@ 2024-05-28 8:54 ` benjamin
2024-05-30 3:07 ` Tiwei Bie
4 siblings, 1 reply; 15+ messages in thread
From: benjamin @ 2024-05-28 8:54 UTC (permalink / raw)
To: linux-um; +Cc: Benjamin Berg
From: Benjamin Berg <benjamin.berg@intel.com>
The larger memory space is useful to support more applications inside
UML. One example for this is ASAN instrumentation of userspace
applications which requires addresses that would otherwise not be
available.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
---
arch/um/Kconfig | 1 +
arch/um/include/asm/page.h | 14 +++-
arch/um/include/asm/pgalloc.h | 11 ++-
arch/um/include/asm/pgtable-4level.h | 119 +++++++++++++++++++++++++++
arch/um/include/asm/pgtable.h | 6 +-
arch/um/kernel/mem.c | 17 +++-
arch/x86/um/Kconfig | 38 ++++++---
7 files changed, 189 insertions(+), 17 deletions(-)
create mode 100644 arch/um/include/asm/pgtable-4level.h
diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 93a5a8999b07..5d111fc8ccb7 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -208,6 +208,7 @@ config MMAPPER
config PGTABLE_LEVELS
int
+ default 4 if 4_LEVEL_PGTABLES
default 3 if 3_LEVEL_PGTABLES
default 2
diff --git a/arch/um/include/asm/page.h b/arch/um/include/asm/page.h
index 9ef9a8aedfa6..c3b2ae03b60c 100644
--- a/arch/um/include/asm/page.h
+++ b/arch/um/include/asm/page.h
@@ -57,14 +57,22 @@ typedef unsigned long long phys_t;
typedef struct { unsigned long pte; } pte_t;
typedef struct { unsigned long pgd; } pgd_t;
-#ifdef CONFIG_3_LEVEL_PGTABLES
+#if CONFIG_PGTABLE_LEVELS > 2
+
typedef struct { unsigned long pmd; } pmd_t;
#define pmd_val(x) ((x).pmd)
#define __pmd(x) ((pmd_t) { (x) } )
-#endif
-#define pte_val(x) ((x).pte)
+#if CONFIG_PGTABLE_LEVELS > 3
+typedef struct { unsigned long pud; } pud_t;
+#define pud_val(x) ((x).pud)
+#define __pud(x) ((pud_t) { (x) } )
+
+#endif /* CONFIG_PGTABLE_LEVELS > 3 */
+#endif /* CONFIG_PGTABLE_LEVELS > 2 */
+
+#define pte_val(x) ((x).pte)
#define pte_get_bits(p, bits) ((p).pte & (bits))
#define pte_set_bits(p, bits) ((p).pte |= (bits))
diff --git a/arch/um/include/asm/pgalloc.h b/arch/um/include/asm/pgalloc.h
index de5e31c64793..04fb4e6969a4 100644
--- a/arch/um/include/asm/pgalloc.h
+++ b/arch/um/include/asm/pgalloc.h
@@ -31,7 +31,7 @@ do { \
tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte))); \
} while (0)
-#ifdef CONFIG_3_LEVEL_PGTABLES
+#if CONFIG_PGTABLE_LEVELS > 2
#define __pmd_free_tlb(tlb, pmd, address) \
do { \
@@ -39,6 +39,15 @@ do { \
tlb_remove_page_ptdesc((tlb), virt_to_ptdesc(pmd)); \
} while (0)
+#if CONFIG_PGTABLE_LEVELS > 3
+
+#define __pud_free_tlb(tlb, pud, address) \
+do { \
+ pagetable_pud_dtor(virt_to_ptdesc(pud)); \
+ tlb_remove_page_ptdesc((tlb), virt_to_ptdesc(pud)); \
+} while (0)
+
+#endif
#endif
#endif
diff --git a/arch/um/include/asm/pgtable-4level.h b/arch/um/include/asm/pgtable-4level.h
new file mode 100644
index 000000000000..f912fcc16b7a
--- /dev/null
+++ b/arch/um/include/asm/pgtable-4level.h
@@ -0,0 +1,119 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2003 PathScale Inc
+ * Derived from include/asm-i386/pgtable.h
+ */
+
+#ifndef __UM_PGTABLE_4LEVEL_H
+#define __UM_PGTABLE_4LEVEL_H
+
+#include <asm-generic/pgtable-nop4d.h>
+
+/* PGDIR_SHIFT determines what a fourth-level page table entry can map */
+
+#define PGDIR_SHIFT 39
+#define PGDIR_SIZE (1UL << PGDIR_SHIFT)
+#define PGDIR_MASK (~(PGDIR_SIZE-1))
+
+/* PUD_SHIFT determines the size of the area a third-level page table can
+ * map
+ */
+
+#define PUD_SHIFT 30
+#define PUD_SIZE (1UL << PUD_SHIFT)
+#define PUD_MASK (~(PUD_SIZE-1))
+
+/* PMD_SHIFT determines the size of the area a second-level page table can
+ * map
+ */
+
+#define PMD_SHIFT 21
+#define PMD_SIZE (1UL << PMD_SHIFT)
+#define PMD_MASK (~(PMD_SIZE-1))
+
+/*
+ * entries per page directory level
+ */
+
+#define PTRS_PER_PTE 512
+#define PTRS_PER_PMD 512
+#define PTRS_PER_PUD 512
+#define PTRS_PER_PGD 512
+
+#define USER_PTRS_PER_PGD ((TASK_SIZE + (PGDIR_SIZE - 1)) / PGDIR_SIZE)
+
+#define pte_ERROR(e) \
+ printk("%s:%d: bad pte %p(%016lx).\n", __FILE__, __LINE__, &(e), \
+ pte_val(e))
+#define pmd_ERROR(e) \
+ printk("%s:%d: bad pmd %p(%016lx).\n", __FILE__, __LINE__, &(e), \
+ pmd_val(e))
+#define pud_ERROR(e) \
+ printk("%s:%d: bad pud %p(%016lx).\n", __FILE__, __LINE__, &(e), \
+ pud_val(e))
+#define pgd_ERROR(e) \
+ printk("%s:%d: bad pgd %p(%016lx).\n", __FILE__, __LINE__, &(e), \
+ pgd_val(e))
+
+#define pud_none(x) (!(pud_val(x) & ~_PAGE_NEWPAGE))
+#define pud_bad(x) ((pud_val(x) & (~PAGE_MASK & ~_PAGE_USER)) != _KERNPG_TABLE)
+#define pud_present(x) (pud_val(x) & _PAGE_PRESENT)
+#define pud_populate(mm, pud, pmd) \
+ set_pud(pud, __pud(_PAGE_TABLE + __pa(pmd)))
+
+#define set_pud(pudptr, pudval) (*(pudptr) = (pudval))
+
+#define p4d_none(x) (!(p4d_val(x) & ~_PAGE_NEWPAGE))
+#define p4d_bad(x) ((p4d_val(x) & (~PAGE_MASK & ~_PAGE_USER)) != _KERNPG_TABLE)
+#define p4d_present(x) (p4d_val(x) & _PAGE_PRESENT)
+#define p4d_populate(mm, p4d, pud) \
+ set_p4d(p4d, __p4d(_PAGE_TABLE + __pa(pud)))
+
+#define set_p4d(p4dptr, p4dval) (*(p4dptr) = (p4dval))
+
+
+static inline int pgd_newpage(pgd_t pgd)
+{
+ return(pgd_val(pgd) & _PAGE_NEWPAGE);
+}
+
+static inline void pgd_mkuptodate(pgd_t pgd) { pgd_val(pgd) &= ~_PAGE_NEWPAGE; }
+
+#define set_pmd(pmdptr, pmdval) (*(pmdptr) = (pmdval))
+
+static inline void pud_clear (pud_t *pud)
+{
+ set_pud(pud, __pud(_PAGE_NEWPAGE));
+}
+
+static inline void p4d_clear (p4d_t *p4d)
+{
+ set_p4d(p4d, __p4d(_PAGE_NEWPAGE));
+}
+
+#define pud_page(pud) phys_to_page(pud_val(pud) & PAGE_MASK)
+#define pud_pgtable(pud) ((pmd_t *) __va(pud_val(pud) & PAGE_MASK))
+
+#define p4d_page(p4d) phys_to_page(p4d_val(p4d) & PAGE_MASK)
+#define p4d_pgtable(p4d) ((pud_t *) __va(p4d_val(p4d) & PAGE_MASK))
+
+static inline unsigned long pte_pfn(pte_t pte)
+{
+ return phys_to_pfn(pte_val(pte));
+}
+
+static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
+{
+ pte_t pte;
+ phys_t phys = pfn_to_phys(page_nr);
+
+ pte_set_val(pte, phys, pgprot);
+ return pte;
+}
+
+static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
+{
+ return __pmd((page_nr << PAGE_SHIFT) | pgprot_val(pgprot));
+}
+
+#endif
diff --git a/arch/um/include/asm/pgtable.h b/arch/um/include/asm/pgtable.h
index e1ece21dbe3f..71a7651e2db7 100644
--- a/arch/um/include/asm/pgtable.h
+++ b/arch/um/include/asm/pgtable.h
@@ -24,9 +24,11 @@
/* We borrow bit 10 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE 0x400
-#ifdef CONFIG_3_LEVEL_PGTABLES
+#if CONFIG_PGTABLE_LEVELS == 4
+#include <asm/pgtable-4level.h>
+#elif CONFIG_PGTABLE_LEVELS == 3
#include <asm/pgtable-3level.h>
-#else
+#elif CONFIG_PGTABLE_LEVELS == 2
#include <asm/pgtable-2level.h>
#endif
diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index ca91accd64fc..2dc0d90c0550 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -99,7 +99,7 @@ static void __init one_page_table_init(pmd_t *pmd)
static void __init one_md_table_init(pud_t *pud)
{
-#ifdef CONFIG_3_LEVEL_PGTABLES
+#if CONFIG_PGTABLE_LEVELS > 2
pmd_t *pmd_table = (pmd_t *) memblock_alloc_low(PAGE_SIZE, PAGE_SIZE);
if (!pmd_table)
panic("%s: Failed to allocate %lu bytes align=%lx\n",
@@ -110,6 +110,19 @@ static void __init one_md_table_init(pud_t *pud)
#endif
}
+static void __init one_ud_table_init(p4d_t *p4d)
+{
+#if CONFIG_PGTABLE_LEVELS > 3
+ pud_t *pud_table = (pud_t *) memblock_alloc_low(PAGE_SIZE, PAGE_SIZE);
+ if (!pud_table)
+ panic("%s: Failed to allocate %lu bytes align=%lx\n",
+ __func__, PAGE_SIZE, PAGE_SIZE);
+
+ set_p4d(p4d, __p4d(_KERNPG_TABLE + (unsigned long) __pa(pud_table)));
+ BUG_ON(pud_table != pud_offset(p4d, 0));
+#endif
+}
+
static void __init fixrange_init(unsigned long start, unsigned long end,
pgd_t *pgd_base)
{
@@ -127,6 +140,8 @@ static void __init fixrange_init(unsigned long start, unsigned long end,
for ( ; (i < PTRS_PER_PGD) && (vaddr < end); pgd++, i++) {
p4d = p4d_offset(pgd, vaddr);
+ if (p4d_none(*p4d))
+ one_ud_table_init(p4d);
pud = pud_offset(p4d, vaddr);
if (pud_none(*pud))
one_md_table_init(pud);
diff --git a/arch/x86/um/Kconfig b/arch/x86/um/Kconfig
index 186f13268401..72dc7b0b3a33 100644
--- a/arch/x86/um/Kconfig
+++ b/arch/x86/um/Kconfig
@@ -28,16 +28,34 @@ config X86_64
def_bool 64BIT
select MODULES_USE_ELF_RELA
-config 3_LEVEL_PGTABLES
- bool "Three-level pagetables" if !64BIT
- default 64BIT
- help
- Three-level pagetables will let UML have more than 4G of physical
- memory. All the memory that can't be mapped directly will be treated
- as high memory.
-
- However, this it experimental on 32-bit architectures, so if unsure say
- N (on x86-64 it's automatically enabled, instead, as it's safe there).
+choice
+ prompt "Pagetable levels" if EXPERT
+ default 2_LEVEL_PGTABLES if !64BIT
+ default 4_LEVEL_PGTABLES if 64BIT
+
+ config 2_LEVEL_PGTABLES
+ bool "Three-level pagetables" if !64BIT
+ depends on !64BIT
+ help
+ Two-level page table for 32-bit architectures.
+
+ config 3_LEVEL_PGTABLES
+ bool "Three-level pagetables" if 64BIT
+ help
+ Three-level pagetables will let UML have more than 4G of physical
+ memory. All the memory that can't be mapped directly will be treated
+ as high memory.
+
+ However, this it experimental on 32-bit architectures, so if unsure say
+ N (on x86-64 it's automatically enabled, instead, as it's safe there).
+
+ config 4_LEVEL_PGTABLES
+ bool "Four-level pagetables" if 64BIT
+ depends on 64BIT
+ help
+ Four-level pagetables, gives a bigger address space which can be
+ useful for some applications (e.g. ASAN).
+endchoice
config ARCH_HAS_SC_SIGNALS
def_bool !64BIT
--
2.45.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 3/5] um: Do a double clone to disable rseq
2024-05-28 8:54 ` [PATCH 3/5] um: Do a double clone to disable rseq benjamin
@ 2024-05-28 10:16 ` Tiwei Bie
2024-05-28 10:30 ` Benjamin Berg
2024-05-28 11:57 ` Johannes Berg
0 siblings, 2 replies; 15+ messages in thread
From: Tiwei Bie @ 2024-05-28 10:16 UTC (permalink / raw)
To: benjamin, linux-um; +Cc: Benjamin Berg
On 5/28/24 4:54 PM, benjamin@sipsolutions.net wrote:
> From: Benjamin Berg <benjamin.berg@intel.com>
>
> Newer glibc versions are enabling rseq support by default. This remains
> enabled in the cloned child process, potentially causing the host kernel
> to write/read memory in the child.
>
> It appears that this was purely not an issue because the used memory
> area happened to be above TASK_SIZE and remains mapped.
I also encountered this issue. In my case, with "Force a static link"
(CONFIG_STATIC_LINK) enabled, UML will crash immediately every time
it starts up. I worked around this by setting the glibc.pthread.rseq
tunable via GLIBC_TUNABLES [1] before launching UML.
So another easy way to work around this issue without introducing runtime
overhead might be to add the GLIBC_TUNABLES=glibc.pthread.rseq=0 environment
variable and exec /proc/self/exe in UML on startup.
[1] https://www.gnu.org/software/libc/manual/html_node/Tunables.html
Regards,
Tiwei
>
> Note that a better approach would be to exec a small static binary that
> does not link with other libraries. Using a memfd and execveat the
> binary could be embedded into UML itself and it would result in an
> entirely clean execution environment for userspace.
>
> Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
> ---
> arch/um/os-Linux/skas/process.c | 54 ++++++++++++++++++++++++++++++---
> 1 file changed, 50 insertions(+), 4 deletions(-)
>
> diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
> index 41a288dcfc34..ee332a2aeea6 100644
> --- a/arch/um/os-Linux/skas/process.c
> +++ b/arch/um/os-Linux/skas/process.c
> @@ -255,6 +255,31 @@ static int userspace_tramp(void *stack)
> int userspace_pid[NR_CPUS];
> int kill_userspace_mm[NR_CPUS];
>
> +struct tramp_data {
> + int pid;
> + void *clone_sp;
> + void *stack;
> +};
> +
> +static int userspace_tramp_clone_vm(void *data)
> +{
> + struct tramp_data *tramp_data = data;
> +
> + /*
> + * This helper exist to do a double-clone. First with CLONE_VM which
> + * effectively disables things like rseq, and then the second one to
> + * get a new memory space.
> + */
> +
> + tramp_data->pid = clone(userspace_tramp, tramp_data->clone_sp,
> + CLONE_PARENT | CLONE_FILES | SIGCHLD,
> + tramp_data->stack);
> + if (tramp_data->pid < 0)
> + tramp_data->pid = -errno;
> +
> + exit(0);
> +}
> +
> /**
> * start_userspace() - prepare a new userspace process
> * @stub_stack: pointer to the stub stack.
> @@ -268,9 +293,10 @@ int kill_userspace_mm[NR_CPUS];
> */
> int start_userspace(unsigned long stub_stack)
> {
> + struct tramp_data tramp_data;
> void *stack;
> unsigned long sp;
> - int pid, status, n, flags, err;
> + int pid, status, n, err;
>
> /* setup a temporary stack page */
> stack = mmap(NULL, UM_KERN_PAGE_SIZE,
> @@ -286,10 +312,13 @@ int start_userspace(unsigned long stub_stack)
> /* set stack pointer to the end of the stack page, so it can grow downwards */
> sp = (unsigned long)stack + UM_KERN_PAGE_SIZE;
>
> - flags = CLONE_FILES | SIGCHLD;
> + tramp_data.stack = (void *) stub_stack;
> + tramp_data.clone_sp = (void *) sp;
> + tramp_data.pid = -EINVAL;
>
> /* clone into new userspace process */
> - pid = clone(userspace_tramp, (void *) sp, flags, (void *) stub_stack);
> + pid = clone(userspace_tramp_clone_vm, (void *) sp,
> + CLONE_VM | CLONE_FILES | SIGCHLD, &tramp_data);
> if (pid < 0) {
> err = -errno;
> printk(UM_KERN_ERR "%s : clone failed, errno = %d\n",
> @@ -305,7 +334,24 @@ int start_userspace(unsigned long stub_stack)
> __func__, errno);
> goto out_kill;
> }
> - } while (WIFSTOPPED(status) && (WSTOPSIG(status) == SIGALRM));
> + } while (!WIFEXITED(status));
> +
> + pid = tramp_data.pid;
> + if (pid < 0) {
> + printk(UM_KERN_ERR "%s : second clone failed, errno = %d\n",
> + __func__, -pid);
> + return pid;
> + }
> +
> + do {
> + CATCH_EINTR(n = waitpid(pid, &status, WUNTRACED | __WALL));
> + if (n < 0) {
> + err = -errno;
> + printk(UM_KERN_ERR "%s : wait failed, errno = %d\n",
> + __func__, errno);
> + goto out_kill;
> + }
> + } while (WIFEXITED(status) && (WSTOPSIG(status) == SIGALRM));
>
> if (!WIFSTOPPED(status) || (WSTOPSIG(status) != SIGSTOP)) {
> err = -EINVAL;
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/5] um: Do a double clone to disable rseq
2024-05-28 10:16 ` Tiwei Bie
@ 2024-05-28 10:30 ` Benjamin Berg
2024-05-28 11:03 ` Tiwei Bie
2024-05-28 11:57 ` Johannes Berg
1 sibling, 1 reply; 15+ messages in thread
From: Benjamin Berg @ 2024-05-28 10:30 UTC (permalink / raw)
To: Tiwei Bie, linux-um
Hi Tiwei,
On Tue, 2024-05-28 at 18:16 +0800, Tiwei Bie wrote:
> On 5/28/24 4:54 PM, benjamin@sipsolutions.net wrote:
> > From: Benjamin Berg <benjamin.berg@intel.com>
> >
> > Newer glibc versions are enabling rseq support by default. This remains
> > enabled in the cloned child process, potentially causing the host kernel
> > to write/read memory in the child.
> >
> > It appears that this was purely not an issue because the used memory
> > area happened to be above TASK_SIZE and remains mapped.
>
> I also encountered this issue. In my case, with "Force a static link"
> (CONFIG_STATIC_LINK) enabled, UML will crash immediately every time
> it starts up. I worked around this by setting the glibc.pthread.rseq
> tunable via GLIBC_TUNABLES [1] before launching UML.
>
> So another easy way to work around this issue without introducing runtime
> overhead might be to add the GLIBC_TUNABLES=glibc.pthread.rseq=0 environment
> variable and exec /proc/self/exe in UML on startup.
I am not really worried about the overhead, but I agree that setting
GLIBC_TUNABLES is also a reasonable solution to the problem.
Doing the memfd/execveat dance with an embedded static binary would
still be best in my view, but either this or GLIBC_TUNABLES seem fine
in the meantime.
Do you want to submit the patch? Should I re-roll the patchset with
GLIBC_TUNABLES?
Benjamin
> [1] https://www.gnu.org/software/libc/manual/html_node/Tunables.html
>
> Regards,
> Tiwei
>
> >
> > Note that a better approach would be to exec a small static binary that
> > does not link with other libraries. Using a memfd and execveat the
> > binary could be embedded into UML itself and it would result in an
> > entirely clean execution environment for userspace.
> >
> > Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
> > ---
> > arch/um/os-Linux/skas/process.c | 54 ++++++++++++++++++++++++++++++---
> > 1 file changed, 50 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
> > index 41a288dcfc34..ee332a2aeea6 100644
> > --- a/arch/um/os-Linux/skas/process.c
> > +++ b/arch/um/os-Linux/skas/process.c
> > @@ -255,6 +255,31 @@ static int userspace_tramp(void *stack)
> > int userspace_pid[NR_CPUS];
> > int kill_userspace_mm[NR_CPUS];
> >
> > +struct tramp_data {
> > + int pid;
> > + void *clone_sp;
> > + void *stack;
> > +};
> > +
> > +static int userspace_tramp_clone_vm(void *data)
> > +{
> > + struct tramp_data *tramp_data = data;
> > +
> > + /*
> > + * This helper exist to do a double-clone. First with CLONE_VM which
> > + * effectively disables things like rseq, and then the second one to
> > + * get a new memory space.
> > + */
> > +
> > + tramp_data->pid = clone(userspace_tramp, tramp_data->clone_sp,
> > + CLONE_PARENT | CLONE_FILES | SIGCHLD,
> > + tramp_data->stack);
> > + if (tramp_data->pid < 0)
> > + tramp_data->pid = -errno;
> > +
> > + exit(0);
> > +}
> > +
> > /**
> > * start_userspace() - prepare a new userspace process
> > * @stub_stack: pointer to the stub stack.
> > @@ -268,9 +293,10 @@ int kill_userspace_mm[NR_CPUS];
> > */
> > int start_userspace(unsigned long stub_stack)
> > {
> > + struct tramp_data tramp_data;
> > void *stack;
> > unsigned long sp;
> > - int pid, status, n, flags, err;
> > + int pid, status, n, err;
> >
> > /* setup a temporary stack page */
> > stack = mmap(NULL, UM_KERN_PAGE_SIZE,
> > @@ -286,10 +312,13 @@ int start_userspace(unsigned long stub_stack)
> > /* set stack pointer to the end of the stack page, so it can grow downwards */
> > sp = (unsigned long)stack + UM_KERN_PAGE_SIZE;
> >
> > - flags = CLONE_FILES | SIGCHLD;
> > + tramp_data.stack = (void *) stub_stack;
> > + tramp_data.clone_sp = (void *) sp;
> > + tramp_data.pid = -EINVAL;
> >
> > /* clone into new userspace process */
> > - pid = clone(userspace_tramp, (void *) sp, flags, (void *) stub_stack);
> > + pid = clone(userspace_tramp_clone_vm, (void *) sp,
> > + CLONE_VM | CLONE_FILES | SIGCHLD, &tramp_data);
> > if (pid < 0) {
> > err = -errno;
> > printk(UM_KERN_ERR "%s : clone failed, errno = %d\n",
> > @@ -305,7 +334,24 @@ int start_userspace(unsigned long stub_stack)
> > __func__, errno);
> > goto out_kill;
> > }
> > - } while (WIFSTOPPED(status) && (WSTOPSIG(status) == SIGALRM));
> > + } while (!WIFEXITED(status));
> > +
> > + pid = tramp_data.pid;
> > + if (pid < 0) {
> > + printk(UM_KERN_ERR "%s : second clone failed, errno = %d\n",
> > + __func__, -pid);
> > + return pid;
> > + }
> > +
> > + do {
> > + CATCH_EINTR(n = waitpid(pid, &status, WUNTRACED | __WALL));
> > + if (n < 0) {
> > + err = -errno;
> > + printk(UM_KERN_ERR "%s : wait failed, errno = %d\n",
> > + __func__, errno);
> > + goto out_kill;
> > + }
> > + } while (WIFEXITED(status) && (WSTOPSIG(status) == SIGALRM));
> >
> > if (!WIFSTOPPED(status) || (WSTOPSIG(status) != SIGSTOP)) {
> > err = -EINVAL;
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/5] um: Do a double clone to disable rseq
2024-05-28 10:30 ` Benjamin Berg
@ 2024-05-28 11:03 ` Tiwei Bie
0 siblings, 0 replies; 15+ messages in thread
From: Tiwei Bie @ 2024-05-28 11:03 UTC (permalink / raw)
To: Benjamin Berg, linux-um
Hi Benjamin,
On 5/28/24 6:30 PM, Benjamin Berg wrote:
> On Tue, 2024-05-28 at 18:16 +0800, Tiwei Bie wrote:
>> On 5/28/24 4:54 PM, benjamin@sipsolutions.net wrote:
>>> From: Benjamin Berg <benjamin.berg@intel.com>
>>>
>>> Newer glibc versions are enabling rseq support by default. This remains
>>> enabled in the cloned child process, potentially causing the host kernel
>>> to write/read memory in the child.
>>>
>>> It appears that this was purely not an issue because the used memory
>>> area happened to be above TASK_SIZE and remains mapped.
>>
>> I also encountered this issue. In my case, with "Force a static link"
>> (CONFIG_STATIC_LINK) enabled, UML will crash immediately every time
>> it starts up. I worked around this by setting the glibc.pthread.rseq
>> tunable via GLIBC_TUNABLES [1] before launching UML.
>>
>> So another easy way to work around this issue without introducing runtime
>> overhead might be to add the GLIBC_TUNABLES=glibc.pthread.rseq=0 environment
>> variable and exec /proc/self/exe in UML on startup.
>
> I am not really worried about the overhead, but I agree that setting
> GLIBC_TUNABLES is also a reasonable solution to the problem.
>
> Doing the memfd/execveat dance with an embedded static binary would
> still be best in my view, but either this or GLIBC_TUNABLES seem fine
> in the meantime.
>
> Do you want to submit the patch? Should I re-roll the patchset with
> GLIBC_TUNABLES?
Thanks for asking! :)
I don't have such a patch at the moment. I worked around this issue
using a script. I saw that you are already doing execve("/proc/self/exe",
...) in PATCH 4/5. Please just feel free to re-roll the patchset with
GLIBC_TUNABLES if you would like to choose this solution.
Regards,
Tiwei
>
> Benjamin
>
>> [1] https://www.gnu.org/software/libc/manual/html_node/Tunables.html
>>
>> Regards,
>> Tiwei
>>
>>>
>>> Note that a better approach would be to exec a small static binary that
>>> does not link with other libraries. Using a memfd and execveat the
>>> binary could be embedded into UML itself and it would result in an
>>> entirely clean execution environment for userspace.
>>>
>>> Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
>>> ---
>>> arch/um/os-Linux/skas/process.c | 54 ++++++++++++++++++++++++++++++---
>>> 1 file changed, 50 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
>>> index 41a288dcfc34..ee332a2aeea6 100644
>>> --- a/arch/um/os-Linux/skas/process.c
>>> +++ b/arch/um/os-Linux/skas/process.c
>>> @@ -255,6 +255,31 @@ static int userspace_tramp(void *stack)
>>> int userspace_pid[NR_CPUS];
>>> int kill_userspace_mm[NR_CPUS];
>>>
>>> +struct tramp_data {
>>> + int pid;
>>> + void *clone_sp;
>>> + void *stack;
>>> +};
>>> +
>>> +static int userspace_tramp_clone_vm(void *data)
>>> +{
>>> + struct tramp_data *tramp_data = data;
>>> +
>>> + /*
>>> + * This helper exist to do a double-clone. First with CLONE_VM which
>>> + * effectively disables things like rseq, and then the second one to
>>> + * get a new memory space.
>>> + */
>>> +
>>> + tramp_data->pid = clone(userspace_tramp, tramp_data->clone_sp,
>>> + CLONE_PARENT | CLONE_FILES | SIGCHLD,
>>> + tramp_data->stack);
>>> + if (tramp_data->pid < 0)
>>> + tramp_data->pid = -errno;
>>> +
>>> + exit(0);
>>> +}
>>> +
>>> /**
>>> * start_userspace() - prepare a new userspace process
>>> * @stub_stack: pointer to the stub stack.
>>> @@ -268,9 +293,10 @@ int kill_userspace_mm[NR_CPUS];
>>> */
>>> int start_userspace(unsigned long stub_stack)
>>> {
>>> + struct tramp_data tramp_data;
>>> void *stack;
>>> unsigned long sp;
>>> - int pid, status, n, flags, err;
>>> + int pid, status, n, err;
>>>
>>> /* setup a temporary stack page */
>>> stack = mmap(NULL, UM_KERN_PAGE_SIZE,
>>> @@ -286,10 +312,13 @@ int start_userspace(unsigned long stub_stack)
>>> /* set stack pointer to the end of the stack page, so it can grow downwards */
>>> sp = (unsigned long)stack + UM_KERN_PAGE_SIZE;
>>>
>>> - flags = CLONE_FILES | SIGCHLD;
>>> + tramp_data.stack = (void *) stub_stack;
>>> + tramp_data.clone_sp = (void *) sp;
>>> + tramp_data.pid = -EINVAL;
>>>
>>> /* clone into new userspace process */
>>> - pid = clone(userspace_tramp, (void *) sp, flags, (void *) stub_stack);
>>> + pid = clone(userspace_tramp_clone_vm, (void *) sp,
>>> + CLONE_VM | CLONE_FILES | SIGCHLD, &tramp_data);
>>> if (pid < 0) {
>>> err = -errno;
>>> printk(UM_KERN_ERR "%s : clone failed, errno = %d\n",
>>> @@ -305,7 +334,24 @@ int start_userspace(unsigned long stub_stack)
>>> __func__, errno);
>>> goto out_kill;
>>> }
>>> - } while (WIFSTOPPED(status) && (WSTOPSIG(status) == SIGALRM));
>>> + } while (!WIFEXITED(status));
>>> +
>>> + pid = tramp_data.pid;
>>> + if (pid < 0) {
>>> + printk(UM_KERN_ERR "%s : second clone failed, errno = %d\n",
>>> + __func__, -pid);
>>> + return pid;
>>> + }
>>> +
>>> + do {
>>> + CATCH_EINTR(n = waitpid(pid, &status, WUNTRACED | __WALL));
>>> + if (n < 0) {
>>> + err = -errno;
>>> + printk(UM_KERN_ERR "%s : wait failed, errno = %d\n",
>>> + __func__, errno);
>>> + goto out_kill;
>>> + }
>>> + } while (WIFEXITED(status) && (WSTOPSIG(status) == SIGALRM));
>>>
>>> if (!WIFSTOPPED(status) || (WSTOPSIG(status) != SIGSTOP)) {
>>> err = -EINVAL;
>>
>>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/5] um: Do a double clone to disable rseq
2024-05-28 10:16 ` Tiwei Bie
2024-05-28 10:30 ` Benjamin Berg
@ 2024-05-28 11:57 ` Johannes Berg
2024-05-28 14:13 ` Tiwei Bie
1 sibling, 1 reply; 15+ messages in thread
From: Johannes Berg @ 2024-05-28 11:57 UTC (permalink / raw)
To: Tiwei Bie, benjamin, linux-um; +Cc: Benjamin Berg
On Tue, 2024-05-28 at 18:16 +0800, Tiwei Bie wrote:
> On 5/28/24 4:54 PM, benjamin@sipsolutions.net wrote:
> > From: Benjamin Berg <benjamin.berg@intel.com>
> >
> > Newer glibc versions are enabling rseq support by default. This remains
> > enabled in the cloned child process, potentially causing the host kernel
> > to write/read memory in the child.
> >
> > It appears that this was purely not an issue because the used memory
> > area happened to be above TASK_SIZE and remains mapped.
>
> I also encountered this issue. In my case, with "Force a static link"
> (CONFIG_STATIC_LINK) enabled, UML will crash immediately every time
> it starts up. I worked around this by setting the glibc.pthread.rseq
> tunable via GLIBC_TUNABLES [1] before launching UML.
>
> So another easy way to work around this issue without introducing runtime
> overhead might be to add the GLIBC_TUNABLES=glibc.pthread.rseq=0 environment
> variable and exec /proc/self/exe in UML on startup.
>
It's also a bit of a question what to rely on - this would introduce a
dependency on glibc behaviour, whereas doing the double-clone proposed
here will work purely because of host kernel behaviour, regardless of
what part of the system set up rseq, how the tunables work, etc.
johannes
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/5] um: Do a double clone to disable rseq
2024-05-28 11:57 ` Johannes Berg
@ 2024-05-28 14:13 ` Tiwei Bie
2024-05-30 2:54 ` Tiwei Bie
0 siblings, 1 reply; 15+ messages in thread
From: Tiwei Bie @ 2024-05-28 14:13 UTC (permalink / raw)
To: Johannes Berg, benjamin, linux-um; +Cc: Benjamin Berg
On 5/28/24 7:57 PM, Johannes Berg wrote:
> On Tue, 2024-05-28 at 18:16 +0800, Tiwei Bie wrote:
>> On 5/28/24 4:54 PM, benjamin@sipsolutions.net wrote:
>>> From: Benjamin Berg <benjamin.berg@intel.com>
>>>
>>> Newer glibc versions are enabling rseq support by default. This remains
>>> enabled in the cloned child process, potentially causing the host kernel
>>> to write/read memory in the child.
>>>
>>> It appears that this was purely not an issue because the used memory
>>> area happened to be above TASK_SIZE and remains mapped.
>>
>> I also encountered this issue. In my case, with "Force a static link"
>> (CONFIG_STATIC_LINK) enabled, UML will crash immediately every time
>> it starts up. I worked around this by setting the glibc.pthread.rseq
>> tunable via GLIBC_TUNABLES [1] before launching UML.
>>
>> So another easy way to work around this issue without introducing runtime
>> overhead might be to add the GLIBC_TUNABLES=glibc.pthread.rseq=0 environment
>> variable and exec /proc/self/exe in UML on startup.
>>
>
> It's also a bit of a question what to rely on - this would introduce a
> dependency on glibc behaviour, whereas doing the double-clone proposed
> here will work purely because of host kernel behaviour, regardless of
> what part of the system set up rseq, how the tunables work, etc.
Makes sense. My previous concern was primarily about the runtime overhead,
but after taking a closer look at the patch, I realized that the double-clone
won't happen on the critical path, so there shouldn't be any performance
issues. I also think the double-clone proposal is better. :)
Regards,
Tiwei
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/5] um: Do a double clone to disable rseq
2024-05-28 14:13 ` Tiwei Bie
@ 2024-05-30 2:54 ` Tiwei Bie
2024-05-30 8:54 ` Benjamin Berg
0 siblings, 1 reply; 15+ messages in thread
From: Tiwei Bie @ 2024-05-30 2:54 UTC (permalink / raw)
To: Johannes Berg, benjamin, linux-um; +Cc: Benjamin Berg
On 5/28/24 10:13 PM, Tiwei Bie wrote:
> On 5/28/24 7:57 PM, Johannes Berg wrote:
>> On Tue, 2024-05-28 at 18:16 +0800, Tiwei Bie wrote:
>>> On 5/28/24 4:54 PM, benjamin@sipsolutions.net wrote:
>>>> From: Benjamin Berg <benjamin.berg@intel.com>
>>>>
>>>> Newer glibc versions are enabling rseq support by default. This remains
>>>> enabled in the cloned child process, potentially causing the host kernel
>>>> to write/read memory in the child.
>>>>
>>>> It appears that this was purely not an issue because the used memory
>>>> area happened to be above TASK_SIZE and remains mapped.
>>>
>>> I also encountered this issue. In my case, with "Force a static link"
>>> (CONFIG_STATIC_LINK) enabled, UML will crash immediately every time
>>> it starts up. I worked around this by setting the glibc.pthread.rseq
>>> tunable via GLIBC_TUNABLES [1] before launching UML.
>>>
>>> So another easy way to work around this issue without introducing runtime
>>> overhead might be to add the GLIBC_TUNABLES=glibc.pthread.rseq=0 environment
>>> variable and exec /proc/self/exe in UML on startup.
>>>
>>
>> It's also a bit of a question what to rely on - this would introduce a
>> dependency on glibc behaviour, whereas doing the double-clone proposed
>> here will work purely because of host kernel behaviour, regardless of
>> what part of the system set up rseq, how the tunables work, etc.
>
> Makes sense. My previous concern was primarily about the runtime overhead,
> but after taking a closer look at the patch, I realized that the double-clone
> won't happen on the critical path, so there shouldn't be any performance
> issues. I also think the double-clone proposal is better. :)
But when combined with this series [1], things might be different..
Double-clone will happen for each new mm context. That's something
we might want to avoid.
[1] https://patchwork.ozlabs.org/project/linux-um/list/?series=408104
Regards,
Tiwei
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 5/5] um: Add 4 level page table support
2024-05-28 8:54 ` [PATCH 5/5] um: Add 4 level page table support benjamin
@ 2024-05-30 3:07 ` Tiwei Bie
0 siblings, 0 replies; 15+ messages in thread
From: Tiwei Bie @ 2024-05-30 3:07 UTC (permalink / raw)
To: benjamin, linux-um; +Cc: Benjamin Berg
On 5/28/24 4:54 PM, benjamin@sipsolutions.net wrote:
[...]
> diff --git a/arch/um/Kconfig b/arch/um/Kconfig
> index 93a5a8999b07..5d111fc8ccb7 100644
> --- a/arch/um/Kconfig
> +++ b/arch/um/Kconfig
> @@ -208,6 +208,7 @@ config MMAPPER
>
> config PGTABLE_LEVELS
> int
> + default 4 if 4_LEVEL_PGTABLES
> default 3 if 3_LEVEL_PGTABLES
> default 2
>
[...]
> diff --git a/arch/x86/um/Kconfig b/arch/x86/um/Kconfig
> index 186f13268401..72dc7b0b3a33 100644
> --- a/arch/x86/um/Kconfig
> +++ b/arch/x86/um/Kconfig
> @@ -28,16 +28,34 @@ config X86_64
> def_bool 64BIT
> select MODULES_USE_ELF_RELA
>
> -config 3_LEVEL_PGTABLES
> - bool "Three-level pagetables" if !64BIT
> - default 64BIT
> - help
> - Three-level pagetables will let UML have more than 4G of physical
> - memory. All the memory that can't be mapped directly will be treated
> - as high memory.
> -
> - However, this it experimental on 32-bit architectures, so if unsure say
> - N (on x86-64 it's automatically enabled, instead, as it's safe there).
> +choice
> + prompt "Pagetable levels" if EXPERT
> + default 2_LEVEL_PGTABLES if !64BIT
> + default 4_LEVEL_PGTABLES if 64BIT
> +
> + config 2_LEVEL_PGTABLES
> + bool "Three-level pagetables" if !64BIT
Nit: s/Three-level/Two-level/
> + depends on !64BIT
> + help
> + Two-level page table for 32-bit architectures.
> +
> + config 3_LEVEL_PGTABLES
> + bool "Three-level pagetables" if 64BIT
> + help
> + Three-level pagetables will let UML have more than 4G of physical
> + memory. All the memory that can't be mapped directly will be treated
> + as high memory.
> +
> + However, this it experimental on 32-bit architectures, so if unsure say
> + N (on x86-64 it's automatically enabled, instead, as it's safe there).
> +
> + config 4_LEVEL_PGTABLES
> + bool "Four-level pagetables" if 64BIT
> + depends on 64BIT
> + help
> + Four-level pagetables, gives a bigger address space which can be
> + useful for some applications (e.g. ASAN).
On 64bit, it appears that 4_LEVEL_PGTABLES won't be selected by running
"make ARCH=um defconfig", and PGTABLE_LEVELS will be 2.
I got the following search result in "make ARCH=um menuconfig":
│ Symbol: 4_LEVEL_PGTABLES [=n]
│ Type : bool
│ Defined at arch/x86/um/Kconfig:51
│ Prompt: Four-level pagetables
│ Depends on: <choice> && 64BIT [=y]
│ Location:
│ (1) -> UML-specific options
│ -> Pagetable levels (<choice> [=n])
│ -> Four-level pagetables (4_LEVEL_PGTABLES [=n])
And 4_LEVEL_PGTABLES will be selected if I turn on EXPERT manually.
Regards,
Tiwei
> +endchoice
>
> config ARCH_HAS_SC_SIGNALS
> def_bool !64BIT
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/5] um: Do a double clone to disable rseq
2024-05-30 2:54 ` Tiwei Bie
@ 2024-05-30 8:54 ` Benjamin Berg
2024-05-30 14:05 ` Tiwei Bie
0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Berg @ 2024-05-30 8:54 UTC (permalink / raw)
To: Tiwei Bie, Johannes Berg, linux-um
Hi,
On Thu, 2024-05-30 at 10:54 +0800, Tiwei Bie wrote:
> On 5/28/24 10:13 PM, Tiwei Bie wrote:
> > On 5/28/24 7:57 PM, Johannes Berg wrote:
> > > On Tue, 2024-05-28 at 18:16 +0800, Tiwei Bie wrote:
> > > > On 5/28/24 4:54 PM, benjamin@sipsolutions.net wrote:
> > > > > From: Benjamin Berg <benjamin.berg@intel.com>
> > > > >
> > > > > Newer glibc versions are enabling rseq support by default. This remains
> > > > > enabled in the cloned child process, potentially causing the host kernel
> > > > > to write/read memory in the child.
> > > > >
> > > > > It appears that this was purely not an issue because the used memory
> > > > > area happened to be above TASK_SIZE and remains mapped.
> > > >
> > > > I also encountered this issue. In my case, with "Force a static link"
> > > > (CONFIG_STATIC_LINK) enabled, UML will crash immediately every time
> > > > it starts up. I worked around this by setting the glibc.pthread.rseq
> > > > tunable via GLIBC_TUNABLES [1] before launching UML.
> > > >
> > > > So another easy way to work around this issue without introducing runtime
> > > > overhead might be to add the GLIBC_TUNABLES=glibc.pthread.rseq=0 environment
> > > > variable and exec /proc/self/exe in UML on startup.
> > > >
> > >
> > > It's also a bit of a question what to rely on - this would introduce a
> > > dependency on glibc behaviour, whereas doing the double-clone proposed
> > > here will work purely because of host kernel behaviour, regardless of
> > > what part of the system set up rseq, how the tunables work, etc.
> >
> > Makes sense. My previous concern was primarily about the runtime overhead,
> > but after taking a closer look at the patch, I realized that the double-clone
> > won't happen on the critical path, so there shouldn't be any performance
> > issues. I also think the double-clone proposal is better. :)
>
> But when combined with this series [1], things might be different..
> Double-clone will happen for each new mm context. That's something
> we might want to avoid.
I cannot believe that this overhead is something to worry about. The
CLONE_VM step should be really fast compared to the second clone as it
runs in the same MM as the kernel (it is how posix_spawn avoids the
fork overhead to execute another process if possible).
Note that using execve in the second step would speed things up even
more as the process will then run in a new MM instead of copying the
kernel MM and cleaning it.
That said, this patch can be made simpler by using CLONE_VFORK.
Benjamin
>
> [1] https://patchwork.ozlabs.org/project/linux-um/list/?series=408104
>
> Regards,
> Tiwei
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/5] um: Do a double clone to disable rseq
2024-05-30 8:54 ` Benjamin Berg
@ 2024-05-30 14:05 ` Tiwei Bie
0 siblings, 0 replies; 15+ messages in thread
From: Tiwei Bie @ 2024-05-30 14:05 UTC (permalink / raw)
To: Benjamin Berg, Johannes Berg, linux-um
On 5/30/24 4:54 PM, Benjamin Berg wrote:
> Hi,
>
> On Thu, 2024-05-30 at 10:54 +0800, Tiwei Bie wrote:
>> On 5/28/24 10:13 PM, Tiwei Bie wrote:
>>> On 5/28/24 7:57 PM, Johannes Berg wrote:
>>>> On Tue, 2024-05-28 at 18:16 +0800, Tiwei Bie wrote:
>>>>> On 5/28/24 4:54 PM, benjamin@sipsolutions.net wrote:
>>>>>> From: Benjamin Berg <benjamin.berg@intel.com>
>>>>>>
>>>>>> Newer glibc versions are enabling rseq support by default. This remains
>>>>>> enabled in the cloned child process, potentially causing the host kernel
>>>>>> to write/read memory in the child.
>>>>>>
>>>>>> It appears that this was purely not an issue because the used memory
>>>>>> area happened to be above TASK_SIZE and remains mapped.
>>>>>
>>>>> I also encountered this issue. In my case, with "Force a static link"
>>>>> (CONFIG_STATIC_LINK) enabled, UML will crash immediately every time
>>>>> it starts up. I worked around this by setting the glibc.pthread.rseq
>>>>> tunable via GLIBC_TUNABLES [1] before launching UML.
>>>>>
>>>>> So another easy way to work around this issue without introducing runtime
>>>>> overhead might be to add the GLIBC_TUNABLES=glibc.pthread.rseq=0 environment
>>>>> variable and exec /proc/self/exe in UML on startup.
>>>>>
>>>>
>>>> It's also a bit of a question what to rely on - this would introduce a
>>>> dependency on glibc behaviour, whereas doing the double-clone proposed
>>>> here will work purely because of host kernel behaviour, regardless of
>>>> what part of the system set up rseq, how the tunables work, etc.
>>>
>>> Makes sense. My previous concern was primarily about the runtime overhead,
>>> but after taking a closer look at the patch, I realized that the double-clone
>>> won't happen on the critical path, so there shouldn't be any performance
>>> issues. I also think the double-clone proposal is better. :)
>>
>> But when combined with this series [1], things might be different..
>> Double-clone will happen for each new mm context. That's something
>> we might want to avoid.
>
> I cannot believe that this overhead is something to worry about. The
> CLONE_VM step should be really fast compared to the second clone as it
> runs in the same MM as the kernel (it is how posix_spawn avoids the
> fork overhead to execute another process if possible).
Hmm.. I just think that creating a temporary "thread" every time to do this
is perhaps a bit unnecessary. But I also agree that using GLIBC_TUNABLES in
UML will introduce a dependency on glibc behaviour, which is undesirable.
Honestly, I don't have a strong opinion on this.
Regards,
Tiwei
>
> Note that using execve in the second step would speed things up even
> more as the process will then run in a new MM instead of copying the
> kernel MM and cleaning it.
>
> That said, this patch can be made simpler by using CLONE_VFORK.
>
> Benjamin
>
>>
>> [1] https://patchwork.ozlabs.org/project/linux-um/list/?series=408104
>>
>> Regards,
>> Tiwei
>>
>>
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-05-30 14:05 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-28 8:54 [PATCH 0/5] Increased address space for 64 bit benjamin
2024-05-28 8:54 ` [PATCH 1/5] um: Fix stub_start address calculation benjamin
2024-05-28 8:54 ` [PATCH 2/5] um: Limit TASK_SIZE to the addressable range benjamin
2024-05-28 8:54 ` [PATCH 3/5] um: Do a double clone to disable rseq benjamin
2024-05-28 10:16 ` Tiwei Bie
2024-05-28 10:30 ` Benjamin Berg
2024-05-28 11:03 ` Tiwei Bie
2024-05-28 11:57 ` Johannes Berg
2024-05-28 14:13 ` Tiwei Bie
2024-05-30 2:54 ` Tiwei Bie
2024-05-30 8:54 ` Benjamin Berg
2024-05-30 14:05 ` Tiwei Bie
2024-05-28 8:54 ` [PATCH 4/5] um: Discover host_task_size from envp benjamin
2024-05-28 8:54 ` [PATCH 5/5] um: Add 4 level page table support benjamin
2024-05-30 3:07 ` Tiwei Bie
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox