- * [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API
  2020-07-28 13:10 ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor madvenka
@ 2020-07-28 13:10   ` madvenka
  2020-07-28 14:50     ` Oleg Nesterov
  2020-07-28 13:10   ` [PATCH v1 2/4] [RFC] x86/trampfd: Provide support for the trampoline file descriptor madvenka
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 64+ messages in thread
From: madvenka @ 2020-07-28 13:10 UTC (permalink / raw)
  To: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86,
	madvenka
From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
There are many applications that use trampoline code. Trampoline code is
usually placed in a data page or a stack page. In order to execute a
trampoline, the page that contains the trampoline needs to have execute
permissions.
Writable pages with execute permissions provide an attack surface for
hackers. To mitigate this, LSMs such as SELinux may prevent a page from
having both write and execute permissions.
An application may attempt to circumvent this by writing the trampoline
code into a temporary file and mapping the file into its process
address space with just execute permissions. This presents the same
opportunity to hackers as before. LSMs that implement cryptographic
verification of files can prevent such temporary files from being mapped.
Such security mitigations prevent genuine trampoline code from running
as well.
Typically, trampolines simply load some values in some registers and/or
push some values on the stack and jump to a target PC. For such simple
trampolines, an application could request the kernel to do that work
instead of executing trampoline code to do that work. trampfd allows
applications to do exactly this.
Such applications can then run without having to relax security
settings for them. For instance, libffi trampolines can easily be
replaced by trampfd. libffi is used by a variety of applications.
trampfd_create() system call
----------------------------
A new system call is introduced to create a trampoline. The system call
number for this is 440. The system call is invoked like this:
	int	trampfd;
	trampfd = syscall(440, type, data);
	type	Trampoline type.
	data	Trampoline type-specific data.
Types of trampolines
--------------------
Different types of trampolines can be defined based on the desired
functionality. In this initial work, the following type is defined:
	TRAMPFD_USER
This implements the simple trampoline type I referred to earlier.
The type-specific structure for TRAMPFD_USER is struct trampfd_user.
Trampoline contexts
-------------------
A trampoline can have one or more contexts associated with it. Contexts
are of two kinds:
	- Contexts that can be specified by the user. These can be added,
	  retrieved and removed by user code.
	- Contexts that are specified by the kernel. This can only be
	  added by the kernel. But these can be read by the user.
In this initial work, I define the following contexts:
User specifiable:
	Register Context
	----------------
	Contains register name-value pairs. When a trampoline is invoked,
	the specified values are loaded in the specified registers. This
	includes the value of the PC register. The kernel specifies the
	subset of registers that can be specified.
	Stack Context
	-------------
	Contains data to push on the user stack when a trampoline is
	invoked.
	Allowed PCs
	-----------
	This specifies a list of PCs that the trampoline is allowed to
	jump to. This prevents a hacker from modifying the trampoline's
	target PC.
Kernel specified:
	Mapping parameters
	------------------
	Used to map a trampoline into an address space. Mapping parameters
	are determined by the kernel based on the trampoline type and
	type-specific information.
Other contexts can be defined in the future.
How to set and read contexts
----------------------------
A symbolic file offset is associated with each context type.
	TRAMPFD_MAP_OFFSET
	TRAMPFD_REGS_OFFSET
	TRAMPFD_STACK_OFFSET
	TRAMPFD_PCS_OFFSET
A structure is defined for each context type as well:
	struct trampfd_map
	struct trampfd_regs
	struct trampfd_stack
	struct trampfd_pcs
To set/retrieve a context, seek to the corresponding offset and
write()/read() the corresponding structure. As a convenience, pread()
and pwrite() can be used so it can be done in one call instead of two.
Invoking a trampoline
---------------------
Map the file descriptor into process address space using mmap(). The
kernel returns an address to invoke the trampoline with. The protection
for the mapping is set to PROT_NONE.
Execute the trampoline in one of two ways depending upon what the target
PC points to:
   - Branch to the trampoline address.
   - Use the trampoline address as a function pointer and call it.
Because the user process does not have execute permissions on the
trampoline address, it traps into the kernel. The kernel recognizes
it as a trampoline invocation and performs the action indicated by the
trampoline's type and context. In the case of TRAMPFD_USER, the
kernel loads the user registers with the values specified in the
register context, pushes the values specfied in the stack context on
the user stack and sets the user PC to point to the PC register value
in the register context. Then, the process returns to user land and
continues execution at the target PC.
Removing a context
------------------
To remove a context, write the context structure into trampfd but
specify a zero context. For example, for register context, specify
the number of registers as 0. For stack context, specify size of
stack data as 0.
Removing a trampoline
---------------------
To remove a trampoline, unmap it and close the file descriptor. When
the last reference on the trampoline goes away, the trampoline is freed.
Sharing trampolines
-------------------
A trampoline created by one thread can be used by other threads sharing
the same address space.
Trampolines, in general, may be shared across processes by the usual
mechanism of sending the file descriptor to another process over a Unix
domain socket.
Architecture support
--------------------
The handling of the trampoline page fault and the setting up of the
register and stack contexts are architecture specific. Architecture
specific patches will implement support for the API.
The signal delivery code in the kernel already implements the elements
needed for this work. That will be leveraged.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 fs/Makefile                       |   1 +
 fs/trampfd/Makefile               |   6 ++
 fs/trampfd/trampfd_data.c         |  43 ++++++++
 fs/trampfd/trampfd_fops.c         | 131 +++++++++++++++++++++++
 fs/trampfd/trampfd_map.c          |  78 ++++++++++++++
 fs/trampfd/trampfd_pcs.c          |  95 +++++++++++++++++
 fs/trampfd/trampfd_regs.c         | 137 ++++++++++++++++++++++++
 fs/trampfd/trampfd_stack.c        | 131 +++++++++++++++++++++++
 fs/trampfd/trampfd_stubs.c        |  41 +++++++
 fs/trampfd/trampfd_syscall.c      |  92 ++++++++++++++++
 include/linux/syscalls.h          |   3 +
 include/linux/trampfd.h           |  82 ++++++++++++++
 include/uapi/asm-generic/unistd.h |   4 +-
 include/uapi/linux/trampfd.h      | 171 ++++++++++++++++++++++++++++++
 init/Kconfig                      |   8 ++
 kernel/sys_ni.c                   |   3 +
 16 files changed, 1025 insertions(+), 1 deletion(-)
 create mode 100644 fs/trampfd/Makefile
 create mode 100644 fs/trampfd/trampfd_data.c
 create mode 100644 fs/trampfd/trampfd_fops.c
 create mode 100644 fs/trampfd/trampfd_map.c
 create mode 100644 fs/trampfd/trampfd_pcs.c
 create mode 100644 fs/trampfd/trampfd_regs.c
 create mode 100644 fs/trampfd/trampfd_stack.c
 create mode 100644 fs/trampfd/trampfd_stubs.c
 create mode 100644 fs/trampfd/trampfd_syscall.c
 create mode 100644 include/linux/trampfd.h
 create mode 100644 include/uapi/linux/trampfd.h
diff --git a/fs/Makefile b/fs/Makefile
index 2ce5112b02c8..227761302000 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -136,3 +136,4 @@ obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
 obj-$(CONFIG_EROFS_FS)		+= erofs/
 obj-$(CONFIG_VBOXSF_FS)		+= vboxsf/
 obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
+obj-$(CONFIG_TRAMPFD)		+= trampfd/
diff --git a/fs/trampfd/Makefile b/fs/trampfd/Makefile
new file mode 100644
index 000000000000..bdf5e487facc
--- /dev/null
+++ b/fs/trampfd/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_TRAMPFD) += trampfd.o
+
+trampfd-y += trampfd_data.o trampfd_fops.o trampfd_map.o trampfd_pcs.o
+trampfd-y += trampfd_regs.o trampfd_stack.o trampfd_stubs.o trampfd_syscall.o
diff --git a/fs/trampfd/trampfd_data.c b/fs/trampfd/trampfd_data.c
new file mode 100644
index 000000000000..0a316754cbe4
--- /dev/null
+++ b/fs/trampfd/trampfd_data.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Trampoline type-specific code.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/mman.h>
+#include <linux/trampfd.h>
+
+int trampfd_create_data(struct trampfd *trampfd, const void __user *tramp_data)
+{
+	struct trampfd_map	*map = &trampfd->map;
+	struct trampfd_user	*user;
+
+	if (trampfd->type == TRAMPFD_USER) {
+		user = kmalloc(sizeof(*user), GFP_KERNEL);
+		if (!user)
+			return -ENOMEM;
+
+		if (copy_from_user(user, tramp_data, sizeof(*user))) {
+			kfree(user);
+			return -EFAULT;
+		}
+		if (user->flags || user->reserved) {
+			kfree(user);
+			return -EINVAL;
+		}
+		trampfd->data = user;
+
+		map->size = PAGE_SIZE;
+		map->prot = PROT_NONE;
+		map->flags = MAP_PRIVATE;
+		map->offset = 0;
+		map->ioffset = 0;
+	}
+	return 0;
+}
diff --git a/fs/trampfd/trampfd_fops.c b/fs/trampfd/trampfd_fops.c
new file mode 100644
index 000000000000..94b82e0da75b
--- /dev/null
+++ b/fs/trampfd/trampfd_fops.c
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - File operations.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/seq_file.h>
+#include <linux/trampfd.h>
+
+#ifdef CONFIG_PROC_FS
+static const char * const trampfd_type_names[TRAMPFD_NUM_TYPES] = {
+	"TRAMPFD_USER",
+};
+
+static void trampfd_show_fdinfo(struct seq_file *sfile, struct file *file)
+{
+	struct trampfd		*trampfd = file->private_data;
+
+	seq_printf(sfile, "type: %s\n", trampfd_type_names[trampfd->type]);
+}
+#endif
+
+static loff_t trampfd_llseek(struct file *file, loff_t offset, int whence)
+{
+	struct trampfd		*trampfd = file->private_data;
+
+	if (whence != SEEK_SET)
+		return -EINVAL;
+
+	if ((offset < 0) || (offset >= TRAMPFD_NUM_OFFSETS))
+		return -EINVAL;
+
+	mutex_lock(&trampfd->lock);
+	if (offset != file->f_pos) {
+		file->f_pos = offset;
+		file->f_version = 0;
+	}
+	mutex_unlock(&trampfd->lock);
+	return offset;
+}
+
+static ssize_t trampfd_read(struct file *file, char __user *arg,
+			    size_t count, loff_t *ppos)
+{
+	int		rc;
+
+	if (!arg || !count)
+		return -EINVAL;
+
+	switch (*ppos) {
+	case TRAMPFD_MAP_OFFSET:
+		rc = trampfd_get_map(file, arg, count);
+		break;
+
+	case TRAMPFD_REGS_OFFSET:
+		rc = trampfd_get_regs(file, arg, count);
+		break;
+
+	case TRAMPFD_STACK_OFFSET:
+		rc = trampfd_get_stack(file, arg, count);
+		break;
+
+	default:
+		rc = -EINVAL;
+		goto out;
+	}
+out:
+	return rc ? rc : (ssize_t) count;
+}
+
+static ssize_t trampfd_write(struct file *file, const char __user *arg,
+			     size_t count, loff_t *ppos)
+{
+	int		rc;
+
+	if (!arg || !count)
+		return -EINVAL;
+
+	switch (*ppos) {
+	case TRAMPFD_REGS_OFFSET:
+		rc = trampfd_set_regs(file, arg, count);
+		break;
+
+	case TRAMPFD_STACK_OFFSET:
+		rc = trampfd_set_stack(file, arg, count);
+		break;
+
+	case TRAMPFD_ALLOWED_PCS_OFFSET:
+		rc = trampfd_set_allowed_pcs(file, arg, count);
+		break;
+
+	default:
+		rc = -EINVAL;
+		goto out;
+	}
+out:
+	return rc ? rc : (ssize_t) count;
+}
+
+static int trampfd_release(struct inode *inode, struct file *file)
+{
+	struct trampfd		*trampfd = file->private_data;
+
+	if (trampfd->type == TRAMPFD_USER) {
+		kfree(trampfd->regs);
+		kfree(trampfd->stack);
+		kfree(trampfd->allowed_pcs);
+	}
+	kfree(trampfd->data);
+	mutex_destroy(&trampfd->lock);
+	kmem_cache_free(trampfd_cache, trampfd);
+	return 0;
+}
+
+const struct file_operations trampfd_fops = {
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo		= trampfd_show_fdinfo,
+#endif
+	.llseek			= trampfd_llseek,
+	.read			= trampfd_read,
+	.write			= trampfd_write,
+	.release		= trampfd_release,
+	.mmap			= trampfd_mmap,
+	.get_unmapped_area	= trampfd_get_unmapped_area,
+};
diff --git a/fs/trampfd/trampfd_map.c b/fs/trampfd/trampfd_map.c
new file mode 100644
index 000000000000..1a156c850ca8
--- /dev/null
+++ b/fs/trampfd/trampfd_map.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Memory mapping.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/mman.h>
+#include <linux/security.h>
+#include <linux/trampfd.h>
+
+int trampfd_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct trampfd		*trampfd = file->private_data;
+
+	if (trampfd->type == TRAMPFD_USER) {
+		/*
+		 * These mappings are special mappings that should not be
+		 * merged or inherited. No physical page is currently allocated
+		 * to these mappings. So, there is nothing to read/write.
+		 * When the trampoline is invoked, an execute fault must be
+		 * encountered so the kernel can intercept the invocation and
+		 * set up user context.
+		 */
+		if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
+			return -EINVAL;
+		vma->vm_flags = VM_SPECIAL | VM_DONTCOPY | VM_DONTDUMP;
+	}
+	vma->vm_private_data = trampfd;
+	return 0;
+}
+
+unsigned long
+trampfd_get_unmapped_area(struct file *file, unsigned long orig_addr,
+			  unsigned long len, unsigned long pgoff,
+			  unsigned long flags)
+{
+	struct trampfd		*trampfd = file->private_data;
+	struct trampfd_map	*map = &trampfd->map;
+	unsigned long		map_pgoff = map->offset >> PAGE_SHIFT;
+
+	const typeof_member(struct file_operations, get_unmapped_area)
+	get_area = current->mm->get_unmapped_area;
+
+	if (len != map->size || pgoff != map_pgoff || (flags != map->flags))
+		return -EINVAL;
+
+	return get_area(file, orig_addr, len, pgoff, flags);
+}
+
+/*
+ * Retrieve the mapping parameters of a trampoline.
+ */
+int trampfd_get_map(struct file *file, char __user *arg, size_t count)
+{
+	struct trampfd		*trampfd = file->private_data;
+
+	if (count != sizeof(trampfd->map))
+		return -EINVAL;
+	if (copy_to_user(arg, &trampfd->map, count))
+		return -EFAULT;
+	return 0;
+}
+
+bool is_trampfd_vma(struct vm_area_struct *vma)
+{
+	struct file	*file = vma->vm_file;
+
+	if (!file)
+		return false;
+	return !strcmp(file->f_path.dentry->d_name.name, trampfd_name);
+}
+EXPORT_SYMBOL_GPL(is_trampfd_vma);
diff --git a/fs/trampfd/trampfd_pcs.c b/fs/trampfd/trampfd_pcs.c
new file mode 100644
index 000000000000..0ed36fd2169f
--- /dev/null
+++ b/fs/trampfd/trampfd_pcs.c
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Allowed PCs context.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/trampfd.h>
+
+/*
+ * Copy list of allowed PCs from the user and validate it.
+ */
+static int trampfd_copy_allowed_pcs(struct trampfd_values *allowed_pcs,
+			    const void __user *arg, size_t count)
+{
+	u32			npcs;
+	size_t			size;
+	u64			*values;
+	int			i;
+
+	if (copy_from_user(allowed_pcs, arg, count))
+		return -EFAULT;
+
+	if (allowed_pcs->reserved)
+		return -EINVAL;
+
+	npcs = allowed_pcs->nvalues;
+	if (npcs > TRAMPFD_MAX_PCS)
+		return -EINVAL;
+
+	size = sizeof(*allowed_pcs);
+	size += npcs * sizeof(u64);
+	if (size != count)
+		return -EINVAL;
+
+	values = allowed_pcs->values;
+	for (i = 0; i < npcs; i++) {
+		if (!values[i])
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ * Set the allowed PCs for a trampoline. If the trampoline has a register
+ * context at this point, the PC register value in that register context is
+ * not checked against this list of allowed PCs.
+ */
+int trampfd_set_allowed_pcs(struct file *file, const char __user *arg,
+			    size_t count)
+{
+	struct trampfd			*trampfd = file->private_data;
+	struct trampfd_values		*allowed_pcs, *cur_allowed_pcs;
+	int				rc;
+
+	if (count < sizeof(*allowed_pcs) || count > TRAMPFD_MAX_PCS_SIZE)
+		return -EINVAL;
+
+	allowed_pcs = kmalloc(count, GFP_KERNEL);
+	if (!allowed_pcs)
+		return -ENOMEM;
+
+	rc = trampfd_copy_allowed_pcs(allowed_pcs, arg, count);
+	if (rc)
+		goto out;
+
+	/*
+	 * If number of PCs is 0, there is no new PCS to set.
+	 */
+	if (!allowed_pcs->nvalues) {
+		kfree(allowed_pcs);
+		allowed_pcs = NULL;
+	}
+
+	/*
+	 * Swap the new PCs with the current one and free the current one,
+	 * if any.
+	 */
+	mutex_lock(&trampfd->lock);
+
+	cur_allowed_pcs = trampfd->allowed_pcs;
+	trampfd->allowed_pcs = allowed_pcs;
+	allowed_pcs = cur_allowed_pcs;
+
+	mutex_unlock(&trampfd->lock);
+out:
+	kfree(allowed_pcs);
+	return rc;
+}
diff --git a/fs/trampfd/trampfd_regs.c b/fs/trampfd/trampfd_regs.c
new file mode 100644
index 000000000000..35114d647385
--- /dev/null
+++ b/fs/trampfd/trampfd_regs.c
@@ -0,0 +1,137 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Register context.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/trampfd.h>
+
+/*
+ * Copy context from the user and validate it.
+ */
+static int trampfd_copy_regs(struct trampfd_regs *regs, const void __user *arg,
+			     size_t count)
+{
+	u32			nregs;
+	size_t			size;
+
+	if (copy_from_user(regs, arg, count))
+		return -EFAULT;
+
+	if (regs->reserved)
+		return -EINVAL;
+
+	nregs = regs->nregs;
+	if (nregs > TRAMPFD_MAX_REGS)
+		return -EINVAL;
+
+	size = sizeof(*regs);
+	size += nregs * sizeof(struct trampfd_reg);
+	if (size != count)
+		return -EINVAL;
+
+	if (nregs && !trampfd_valid_regs(regs))
+		return -EINVAL;
+	return 0;
+}
+
+/*
+ * Set the register context for a trampoline.
+ */
+int trampfd_set_regs(struct file *file, const char __user *arg, size_t count)
+{
+	struct trampfd			*trampfd = file->private_data;
+	struct trampfd_regs		*regs, *cur_regs;
+	int				rc;
+
+	if (count < sizeof(*regs) || count > TRAMPFD_MAX_REGS_SIZE)
+		return -EINVAL;
+
+	regs = kmalloc(count, GFP_KERNEL);
+	if (!regs)
+		return -ENOMEM;
+
+	rc = trampfd_copy_regs(regs, arg, count);
+	if (rc)
+		goto out;
+
+	/*
+	 * If nregs is 0, there is no new register context to set.
+	 */
+	if (!regs->nregs) {
+		kfree(regs);
+		regs = NULL;
+	}
+
+	/*
+	 * Swap the new register context with the current one and free the
+	 * current one, if any.
+	 */
+	mutex_lock(&trampfd->lock);
+
+	/*
+	 * Check if the specified PC is allowed.
+	 */
+	if (!regs || trampfd_allowed_pc(trampfd, regs)) {
+		cur_regs = trampfd->regs;
+		trampfd->regs = regs;
+		regs = cur_regs;
+	} else {
+		rc = -EINVAL;
+	}
+
+	mutex_unlock(&trampfd->lock);
+out:
+	kfree(regs);
+	return rc;
+}
+
+/*
+ * Retrieve the register context of a trampoline.
+ */
+int trampfd_get_regs(struct file *file, char __user *arg, size_t count)
+{
+	struct trampfd			*trampfd = file->private_data;
+	struct trampfd_regs		*regs, *cur_regs;
+	size_t				size;
+	int				rc = 0;
+
+	if (count < sizeof(*regs) || count > TRAMPFD_MAX_REGS_SIZE)
+		return -EINVAL;
+
+	regs = kmalloc(count, GFP_KERNEL);
+	if (!regs)
+		return -ENOMEM;
+
+	mutex_lock(&trampfd->lock);
+
+	/*
+	 * Copy the current register context into a local buffer so we can
+	 * copy it to the user outside the lock.
+	 */
+	cur_regs = trampfd->regs;
+	if (cur_regs) {
+		size = sizeof(*cur_regs);
+		size += sizeof(struct trampfd_reg) * cur_regs->nregs;
+		if (size > count)
+			size = count;
+		memcpy(regs, cur_regs, size);
+	} else {
+		size = sizeof(*regs);
+		memset(regs, 0, size);
+	}
+
+	mutex_unlock(&trampfd->lock);
+
+	if (copy_to_user(arg, regs, size))
+		rc = -EFAULT;
+
+	kfree(regs);
+	return rc;
+}
diff --git a/fs/trampfd/trampfd_stack.c b/fs/trampfd/trampfd_stack.c
new file mode 100644
index 000000000000..032c5ed70d57
--- /dev/null
+++ b/fs/trampfd/trampfd_stack.c
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Stack context.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/trampfd.h>
+
+/*
+ * Copy context from the user and validate it.
+ */
+static int trampfd_copy_stack(struct trampfd_stack *stack,
+			      const void __user *arg, size_t count)
+{
+	size_t			size;
+
+	if (copy_from_user(stack, arg, count))
+		return -EFAULT;
+
+	if (stack->reserved)
+		return -EINVAL;
+
+	size = stack->size;
+	if (size > TRAMPFD_MAX_DATA_SIZE)
+		return -EINVAL;
+
+	size += sizeof(*stack);
+	if (size != count)
+		return -EINVAL;
+
+	if (!stack->size)
+		return 0;
+
+	if ((stack->flags & ~TRAMPFD_SFLAGS) ||
+	    stack->offset > TRAMPFD_MAX_STACK_OFFSET)
+		return -EINVAL;
+	return 0;
+}
+
+/*
+ * Set the register context for a trampoline.
+ */
+int trampfd_set_stack(struct file *file, const char __user *arg, size_t count)
+{
+	struct trampfd			*trampfd = file->private_data;
+	struct trampfd_stack		*stack, *cur_stack;
+	int				rc;
+
+	if (count < sizeof(*stack) || count > TRAMPFD_MAX_STACK_SIZE)
+		return -EINVAL;
+
+	stack = kmalloc(count, GFP_KERNEL);
+	if (!stack)
+		return -ENOMEM;
+
+	rc = trampfd_copy_stack(stack, arg, count);
+	if (rc)
+		goto out;
+
+	/*
+	 * If size is 0, there is no new stack context to set.
+	 */
+	if (!stack->size) {
+		kfree(stack);
+		stack = NULL;
+	}
+
+	/*
+	 * Swap the new stack context with the current one and free the
+	 * current one, if any.
+	 */
+	mutex_lock(&trampfd->lock);
+
+	cur_stack = trampfd->stack;
+	trampfd->stack = stack;
+	stack = cur_stack;
+
+	mutex_unlock(&trampfd->lock);
+out:
+	kfree(stack);
+	return rc;
+}
+
+/*
+ * Retrieve the register context of a trampoline.
+ */
+int trampfd_get_stack(struct file *file, char __user *arg, size_t count)
+{
+	struct trampfd			*trampfd = file->private_data;
+	struct trampfd_stack		*stack, *cur_stack;
+	size_t				size;
+	int				rc = 0;
+
+	if (count < sizeof(*stack) || count > TRAMPFD_MAX_STACK_SIZE)
+		return -EINVAL;
+
+	stack = kmalloc(count, GFP_KERNEL);
+	if (!stack)
+		return -ENOMEM;
+
+	mutex_lock(&trampfd->lock);
+
+	/*
+	 * Copy the current register context into a local buffer so we can
+	 * copy it to the user outside the lock.
+	 */
+	cur_stack = trampfd->stack;
+	if (cur_stack) {
+		size = sizeof(*cur_stack) + cur_stack->size;
+		if (size > count)
+			size = count;
+		memcpy(stack, cur_stack, size);
+	} else {
+		size = sizeof(*stack);
+		memset(stack, 0, size);
+	}
+
+	mutex_unlock(&trampfd->lock);
+
+	if (copy_to_user(arg, stack, size))
+		rc = -EFAULT;
+
+	kfree(stack);
+	return rc;
+}
diff --git a/fs/trampfd/trampfd_stubs.c b/fs/trampfd/trampfd_stubs.c
new file mode 100644
index 000000000000..8ca29dccbbf7
--- /dev/null
+++ b/fs/trampfd/trampfd_stubs.c
@@ -0,0 +1,41 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Stub functions.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/trampfd.h>
+
+/*
+ * Stub for the arch function that checks if a trampoline type is supported
+ * by the architecture. Return an error for all types that require architecture
+ * support. Return success for the rest as they are generic.
+ */
+int __attribute__((weak)) trampfd_check_arch(struct trampfd *trampfd)
+{
+	if (trampfd->type == TRAMPFD_USER)
+		return -EINVAL;
+	return 0;
+}
+
+/*
+ * Stub for the arch function that checks if a specified register context
+ * is valid.
+ */
+bool __attribute__((weak)) trampfd_valid_regs(struct trampfd_regs *regs)
+{
+	return false;
+}
+
+/*
+ * Stub for the arch function that checks if the PC register in a specified
+ * register context is allowed.
+ */
+bool __attribute__((weak)) trampfd_allowed_pc(struct trampfd *trampfd,
+					      struct trampfd_regs *regs)
+{
+	return false;
+}
diff --git a/fs/trampfd/trampfd_syscall.c b/fs/trampfd/trampfd_syscall.c
new file mode 100644
index 000000000000..675460afc521
--- /dev/null
+++ b/fs/trampfd/trampfd_syscall.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - System call.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/mman.h>
+#include <linux/syscalls.h>
+#include <linux/anon_inodes.h>
+#include <linux/trampfd.h>
+
+char	*trampfd_name = "[trampfd]";
+
+struct kmem_cache	*trampfd_cache;
+
+SYSCALL_DEFINE3(trampfd_create,
+		int, tramp_type,
+		const void __user *, tramp_data,
+		unsigned int, flags)
+{
+	struct trampfd		*trampfd;
+	struct file		*file;
+	int			fd, rc = 0;
+
+	if (!trampfd_cache)
+		return -ENOMEM;
+
+	/*
+	 * Flags are for future use.
+	 */
+	if (flags || !tramp_data)
+		return -EINVAL;
+
+	if (tramp_type < 0 || tramp_type >= TRAMPFD_NUM_TYPES)
+		return -EINVAL;
+
+	trampfd = kmem_cache_zalloc(trampfd_cache, GFP_KERNEL);
+	if (!trampfd)
+		return -ENOMEM;
+
+	mutex_init(&trampfd->lock);
+	trampfd->type = tramp_type;
+
+	rc = trampfd_create_data(trampfd, tramp_data);
+	if (rc)
+		goto freetramp;
+
+	rc = trampfd_check_arch(trampfd);
+	if (rc)
+		goto freedata;
+
+	rc = get_unused_fd_flags(O_CLOEXEC);
+	if (rc < 0)
+		goto freedata;
+	fd = rc;
+
+	file = anon_inode_getfile(trampfd_name, &trampfd_fops, trampfd, O_RDWR);
+	if (IS_ERR(file)) {
+		rc = PTR_ERR(file);
+		goto freefd;
+	}
+	file->f_mode |= (FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE);
+
+	fd_install(fd, file);
+	return fd;
+freefd:
+	put_unused_fd(fd);
+freedata:
+	kfree(trampfd->data);
+freetramp:
+	kmem_cache_free(trampfd_cache, trampfd);
+	return rc;
+}
+
+int __init trampfd_init(void)
+{
+	trampfd_cache = kmem_cache_create("trampfd_cache",
+		sizeof(struct trampfd), 0, SLAB_HWCACHE_ALIGN, NULL);
+
+	if (trampfd_cache == NULL) {
+		pr_warn("%s: kmem_cache_create failed", __func__);
+		return -ENOMEM;
+	}
+	return 0;
+}
+core_initcall(trampfd_init);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b951a87da987..25ddf29477bc 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1005,6 +1005,9 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
 				       siginfo_t __user *info,
 				       unsigned int flags);
 asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
+asmlinkage long sys_trampfd_create(int tramp_type,
+				   const void __user *tramp_data,
+				   unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/linux/trampfd.h b/include/linux/trampfd.h
new file mode 100644
index 000000000000..383d7eeda2d1
--- /dev/null
+++ b/include/linux/trampfd.h
@@ -0,0 +1,82 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Trampoline File Descriptor - Internal structures and definitions.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ *
+ * Copyright (c) 2020, Microsoft Corporation.
+ */
+#ifndef _LINUX_TRAMPFD_H
+#define _LINUX_TRAMPFD_H
+
+#include <uapi/linux/trampfd.h>
+
+#define TRAMPFD_MAX_REGS_SIZE						\
+	(sizeof(struct trampfd_regs) +					\
+	(sizeof(struct trampfd_reg) * TRAMPFD_MAX_REGS))
+
+#define TRAMPFD_MAX_STACK_SIZE						\
+	(sizeof(struct trampfd_stack) + TRAMPFD_MAX_DATA_SIZE)
+
+#define TRAMPFD_MAX_PCS_SIZE						\
+	(sizeof(struct trampfd_values) + sizeof(u64) * TRAMPFD_MAX_PCS)
+
+/*
+ * Trampoline structure.
+ */
+struct trampfd {
+	struct mutex		lock;		/* to serialize access */
+	enum trampfd_type	type;		/* type of trampoline */
+	void			*data;		/* type specific data */
+	struct trampfd_map	map;		/* mmap() parameters */
+	struct trampfd_regs	*regs;		/* register context */
+	struct trampfd_stack	*stack;		/* stack context */
+	struct trampfd_values	*allowed_pcs;	/* allowed PCs */
+};
+
+#ifdef CONFIG_TRAMPFD
+
+/* Trampoline mapping */
+int trampfd_mmap(struct file *file, struct vm_area_struct *vma);
+unsigned long trampfd_get_unmapped_area(struct file *file,
+					unsigned long orig_addr,
+					unsigned long len,
+					unsigned long pgoff,
+					unsigned long flags);
+bool is_trampfd_vma(struct vm_area_struct *vma);
+
+/* Trampoline context */
+int trampfd_get_map(struct file *file, char __user *arg, size_t count);
+int trampfd_set_regs(struct file *file, const char __user *arg, size_t count);
+int trampfd_get_regs(struct file *file, char __user *arg, size_t count);
+int trampfd_set_stack(struct file *file, const char __user *arg, size_t count);
+int trampfd_get_stack(struct file *file, char __user *arg, size_t count);
+int trampfd_set_allowed_pcs(struct file *file, const char __user *arg,
+			    size_t count);
+
+/* Arch functions */
+bool trampfd_fault(struct vm_area_struct *vma, struct pt_regs *pt_regs);
+bool trampfd_valid_regs(struct trampfd_regs *regs);
+bool trampfd_allowed_pc(struct trampfd *trampfd, struct trampfd_regs *regs);
+int trampfd_check_arch(struct trampfd *trampfd);
+
+/* Trampoline type-specific */
+int trampfd_create_data(struct trampfd *trampfd, const void __user *tramp_data);
+
+extern char				*trampfd_name;
+extern struct kmem_cache		*trampfd_cache;
+extern const struct file_operations	trampfd_fops;
+
+#define USERPTR(ptr)	((void __user *)(uintptr_t)(ptr))
+
+#else
+
+static inline bool trampfd_fault(struct vm_area_struct *vma,
+				 struct pt_regs *pt_regs)
+{
+	return false;
+}
+
+#endif /* CONFIG_TRAMPFD */
+
+#endif /* _LINUX_TRAMPFD_H */
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index f4a01305d9a6..14e526a45624 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -857,9 +857,11 @@ __SYSCALL(__NR_openat2, sys_openat2)
 __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 #define __NR_faccessat2 439
 __SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_trampfd_create 440
+__SYSCALL(__NR_trampfd_create, sys_trampfd_create)
 
 #undef __NR_syscalls
-#define __NR_syscalls 440
+#define __NR_syscalls 441
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/trampfd.h b/include/uapi/linux/trampfd.h
new file mode 100644
index 000000000000..bf9a6ef3683b
--- /dev/null
+++ b/include/uapi/linux/trampfd.h
@@ -0,0 +1,171 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Trampoline File Descriptor - API structures and definitions.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ *
+ * Copyright (c) 2020, Microsoft Corporation.
+ */
+#ifndef _UAPI_LINUX_TRAMPFD_H
+#define _UAPI_LINUX_TRAMPFD_H
+
+#include <linux/types.h>
+#include <linux/ptrace.h>
+
+/*
+ * All structure fields are defined so that they are the same width and at the
+ * same structure offset on 32-bit and 64-bit to avoid compat code.
+ *
+ * All fields named "reserved" must be set to 0. They are there primarily for
+ * alignment. But they may be used in the future.
+ */
+
+/* ------------------------- Types of Trampolines ------------------------- */
+
+/*
+ * TRAMPFD_USER
+ *	User programs use the kernel as a trampoline to setup a user context
+ *	and jump to a user function. This trampoline type can be used to
+ *	replace user trampoline code.
+ */
+enum trampfd_type {
+	TRAMPFD_USER,
+	TRAMPFD_NUM_TYPES,
+};
+
+/* ---------------------------- Context offsets ---------------------------- */
+
+/*
+ * A trampoline has different types of context associated with it. Each context
+ * type has a symbolic offset into trampfd. The context can be read from or
+ * written to at its symbolic offset in trampfd.
+ *
+ * TRAMPFD_MAP_OFFSET
+ *	To read trampoline mapping parameters - struct ktramp_map.
+ *
+ * TRAMPFD_REGS_OFFSET
+ *	To read/write trampoline register context - struct ktramp_regs.
+ *
+ * TRAMPFD_STACK_OFFSET
+ *	To read/write trampoline stack context - struct ktramp_stack.
+ *
+ * TRAMPFD_ALLOWED_PCS_OFFSET
+ *	To write a list of allowed PCs - struct trampfd_values.
+ */
+enum trampfd_offsets {
+	TRAMPFD_MAP_OFFSET,
+	TRAMPFD_REGS_OFFSET,
+	TRAMPFD_STACK_OFFSET,
+	TRAMPFD_ALLOWED_PCS_OFFSET,
+	TRAMPFD_NUM_OFFSETS,
+};
+
+/* ------------------- Trampoline type specific data -------------------- */
+
+/*
+ * For TRAMPFD_USER.
+ */
+struct trampfd_user {
+	__u32		flags;		/* for future enhancements */
+	__u32		reserved;
+};
+
+/* ------------------- Trampoline mapping parameters ---------------------- */
+
+/*
+ * Since the kernel implements the trampoline object, the kernel specifies
+ * how a trampoline should be mapped. User code must obtain these parameters
+ * and do an mmap() to map the trampoline. The first four parameters are used
+ * in the mmap() call. User code must add ioffset to the address returned by
+ * mmap() to get the actual invocation address for the trampoline.
+ */
+struct trampfd_map {
+	__u32			size;		/* Size of the mapping */
+	__u32			prot;		/* memory protection */
+	__u32			flags;		/* map flags */
+	__u32			offset;		/* file offset */
+	__u32			ioffset;	/* invocation offset */
+	__u32			reserved;
+};
+
+/* -------------------------- Register context -------------------------- */
+
+/*
+ * A register context may be specified for a trampoline, if applicable
+ * to the trampoline type. E.g., TRAMPFD_USER. The register context is
+ * an array of name-value pairs. When a trampoline is invoked, its user
+ * registers are loaded with the specified values. Register names are
+ * architecture specific and can be found in <linux/ptrace.h> for architectures
+ * that support trampolines. Enumerations reg_32_name and reg_64_name in
+ * <linux/ptrace.h> refer to 32-bit and 64-bit respectively.
+ */
+struct trampfd_reg {
+	__u32		name;		/* Register name */
+	__u32		reserved;
+	__u64		value;		/* Register value */
+};
+
+/*
+ * Register context. It is a variable sized structure sized by the number
+ * of registers.
+ */
+struct trampfd_regs {
+	__u32			nregs;		/* Number of registers */
+	__u32			reserved;
+	struct trampfd_reg	regs[0];	/* Array of registers */
+};
+
+#define TRAMPFD_MAX_REGS		40
+
+/* ---------------------------- Stack context ---------------------------- */
+
+/*
+ * A stack context may be specified for a trampoline, if applicable
+ * to the trampoline type. E.g., TRAMPFD_USER. The stack context contains
+ * a data buffer. When a trampoline is invoked, the specified data is pushed
+ * on the stack at a specified offset from the current stack pointer.
+ * Optionally, the stack pointer can be moved to the top of the data.
+ *
+ * This is a variable sized structure sized by the amount of data that is
+ * to be pushed on the user stack.
+ */
+struct trampfd_stack {
+	__u32		flags;			/* TRAMPFD_SFLAGS */
+	__u32		offset;			/* Offset from top of stack */
+	__u32		size;			/* Size of data to push */
+	__u32		reserved;
+	__u8		data[0];		/* Data to push on the stack */
+};
+
+#define TRAMPFD_MAX_DATA_SIZE		64
+#define TRAMPFD_MAX_STACK_OFFSET	256
+
+/*
+ * Stack context flags:
+ *
+ * TRAMPFD_SET_SP
+ *	After pushing the data to user stack, move the stack pointer to the
+ *	base of the data pushed. Note that the kernel will align the stack
+ *	pointer based on the alignment requirements of the architecture.
+ */
+#define TRAMPFD_SET_SP		0x1
+#define TRAMPFD_SFLAGS		(TRAMPFD_SET_SP)
+
+/* ---------------------------- Values context ---------------------------- */
+
+/*
+ * Some contexts may be just a list of values. For instance, the user can
+ * specify a list of allowed PCs for a trampoline. The following structure
+ * is used for those contexts.
+ */
+struct trampfd_values {
+	__u32		nvalues;		/* number of values */
+	__u32		reserved;
+	__u64		values[0];		/* Array of values */
+};
+
+#define TRAMPFD_MAX_PCS		16
+
+/* -------------------------------------------------------------------------- */
+
+#endif /* _UAPI_LINUX_TRAMPFD_H */
diff --git a/init/Kconfig b/init/Kconfig
index 0498af567f70..783a0b98fce1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2313,3 +2313,11 @@ config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
 # <asm/syscall_wrapper.h>.
 config ARCH_HAS_SYSCALL_WRAPPER
 	def_bool n
+
+config TRAMPFD
+	bool "Enable trampfd_create() system call"
+	depends on MMU
+	help
+	  Enable the trampfd_create() system call that allows a process to
+	  map trampolines within its address space that can be invoked
+	  with the help of the kernel.
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 3b69a560a7ac..136acf9234a3 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -349,6 +349,9 @@ COND_SYSCALL(pkey_mprotect);
 COND_SYSCALL(pkey_alloc);
 COND_SYSCALL(pkey_free);
 
+/* Trampoline fd */
+COND_SYSCALL(trampfd_create);
+
 
 /*
  * Architecture specific weak syscall entries.
-- 
2.17.1
^ permalink raw reply related	[flat|nested] 64+ messages in thread
- * Re: [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API
  2020-07-28 13:10   ` [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API madvenka
@ 2020-07-28 14:50     ` Oleg Nesterov
  2020-07-28 14:58       ` Madhavan T. Venkataraman
  0 siblings, 1 reply; 64+ messages in thread
From: Oleg Nesterov @ 2020-07-28 14:50 UTC (permalink / raw)
  To: madvenka
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, x86
On 07/28, madvenka@linux.microsoft.com wrote:
>
> +bool is_trampfd_vma(struct vm_area_struct *vma)
> +{
> +	struct file	*file = vma->vm_file;
> +
> +	if (!file)
> +		return false;
> +	return !strcmp(file->f_path.dentry->d_name.name, trampfd_name);
Hmm, this looks obviously wrong or I am totally confused. A user can
create a file named "[trampfd]", mmap it, and fool trampfd_fault() ?
Why not
	return file->f_op == trampfd_fops;
?
> +EXPORT_SYMBOL_GPL(is_trampfd_vma);
why is it exported?
Oleg.
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * Re: [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API
  2020-07-28 14:50     ` Oleg Nesterov
@ 2020-07-28 14:58       ` Madhavan T. Venkataraman
  2020-07-28 16:06         ` Oleg Nesterov
  0 siblings, 1 reply; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-28 14:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, x86
Thanks. See inline..
On 7/28/20 9:50 AM, Oleg Nesterov wrote:
> On 07/28, madvenka@linux.microsoft.com wrote:
>> +bool is_trampfd_vma(struct vm_area_struct *vma)
>> +{
>> +	struct file	*file = vma->vm_file;
>> +
>> +	if (!file)
>> +		return false;
>> +	return !strcmp(file->f_path.dentry->d_name.name, trampfd_name);
> Hmm, this looks obviously wrong or I am totally confused. A user can
> create a file named "[trampfd]", mmap it, and fool trampfd_fault() ?
>
> Why not
>
> 	return file->f_op == trampfd_fops;
This is definitely the correct check. I will fix it.
>
> ?
>
>> +EXPORT_SYMBOL_GPL(is_trampfd_vma);
> why is it exported?
This is in common code and is called by arch code. Should I not export it?
I guess since the symbol is not used by any modules, I don't need to
export it. Please confirm and I will fix this.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * Re: [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API
  2020-07-28 14:58       ` Madhavan T. Venkataraman
@ 2020-07-28 16:06         ` Oleg Nesterov
  0 siblings, 0 replies; 64+ messages in thread
From: Oleg Nesterov @ 2020-07-28 16:06 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, x86
On 07/28, Madhavan T. Venkataraman wrote:
>
> I guess since the symbol is not used by any modules, I don't need to
> export it.
Yes,
Oleg.
^ permalink raw reply	[flat|nested] 64+ messages in thread 
 
 
 
- * [PATCH v1 2/4] [RFC] x86/trampfd: Provide support for the trampoline file descriptor
  2020-07-28 13:10 ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor madvenka
  2020-07-28 13:10   ` [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API madvenka
@ 2020-07-28 13:10   ` madvenka
  2020-07-30  9:06     ` Greg KH
  2020-07-28 13:10   ` [PATCH v1 3/4] [RFC] arm64/trampfd: " madvenka
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 64+ messages in thread
From: madvenka @ 2020-07-28 13:10 UTC (permalink / raw)
  To: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86,
	madvenka
From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
Implement 32-bit and 64-bit X86 support for the trampoline file descriptor.
	- Define architecture specific register names
	- Handle the trampoline invocation page fault
	- Setup the user register context on trampoline invocation
	- Setup the user stack context on trampoline invocation
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 arch/x86/include/uapi/asm/ptrace.h     |  38 +++
 arch/x86/kernel/Makefile               |   2 +
 arch/x86/kernel/trampfd.c              | 313 +++++++++++++++++++++++++
 arch/x86/mm/fault.c                    |  11 +
 6 files changed, 366 insertions(+)
 create mode 100644 arch/x86/kernel/trampfd.c
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index d8f8a1a69ed1..77eb50414591 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -443,3 +443,4 @@
 437	i386	openat2			sys_openat2
 438	i386	pidfd_getfd		sys_pidfd_getfd
 439	i386	faccessat2		sys_faccessat2
+440	i386	trampfd_create		sys_trampfd_create
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 78847b32e137..9d962de1d21f 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -360,6 +360,7 @@
 437	common	openat2			sys_openat2
 438	common	pidfd_getfd		sys_pidfd_getfd
 439	common	faccessat2		sys_faccessat2
+440	common	trampfd_create		sys_trampfd_create
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/x86/include/uapi/asm/ptrace.h b/arch/x86/include/uapi/asm/ptrace.h
index 85165c0edafc..b031598f857e 100644
--- a/arch/x86/include/uapi/asm/ptrace.h
+++ b/arch/x86/include/uapi/asm/ptrace.h
@@ -9,6 +9,44 @@
 
 #ifndef __ASSEMBLY__
 
+/*
+ * These register names are to be used by 32-bit applications.
+ */
+enum reg_32_name {
+	x32_eax,
+	x32_ebx,
+	x32_ecx,
+	x32_edx,
+	x32_esi,
+	x32_edi,
+	x32_ebp,
+	x32_eip,
+	x32_max,
+};
+
+/*
+ * These register names are to be used by 64-bit applications.
+ */
+enum reg_64_name {
+	x64_rax = x32_max,
+	x64_rbx,
+	x64_rcx,
+	x64_rdx,
+	x64_rsi,
+	x64_rdi,
+	x64_rbp,
+	x64_r8,
+	x64_r9,
+	x64_r10,
+	x64_r11,
+	x64_r12,
+	x64_r13,
+	x64_r14,
+	x64_r15,
+	x64_rip,
+	x64_max,
+};
+
 #ifdef __i386__
 /* this struct defines the way the registers are stored on the
    stack during a system call. */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index e77261db2391..5d968ac4c7d9 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -157,3 +157,5 @@ ifeq ($(CONFIG_X86_64),y)
 endif
 
 obj-$(CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT)	+= ima_arch.o
+
+obj-$(CONFIG_TRAMPFD)			+= trampfd.o
diff --git a/arch/x86/kernel/trampfd.c b/arch/x86/kernel/trampfd.c
new file mode 100644
index 000000000000..f6b5507134d2
--- /dev/null
+++ b/arch/x86/kernel/trampfd.c
@@ -0,0 +1,313 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - X86 support.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ *
+ * Copyright (c) 2020, Microsoft Corporation.
+ */
+
+#include <linux/thread_info.h>
+#include <linux/mm_types.h>
+#include <linux/trampfd.h>
+#include <linux/uaccess.h>
+
+/* ---------------------------- Register Context ---------------------------- */
+
+static inline bool is_compat(void)
+{
+	return (IS_ENABLED(CONFIG_X86_32) ||
+		(IS_ENABLED(CONFIG_COMPAT) && test_thread_flag(TIF_ADDR32)));
+}
+
+static void set_reg_32(struct pt_regs *pt_regs, u32 name, u64 value)
+{
+	switch (name) {
+	case x32_eax:
+		pt_regs->ax = (unsigned long)value;
+		break;
+	case x32_ebx:
+		pt_regs->bx = (unsigned long)value;
+		break;
+	case x32_ecx:
+		pt_regs->cx = (unsigned long)value;
+		break;
+	case x32_edx:
+		pt_regs->dx = (unsigned long)value;
+		break;
+	case x32_esi:
+		pt_regs->si = (unsigned long)value;
+		break;
+	case x32_edi:
+		pt_regs->di = (unsigned long)value;
+		break;
+	case x32_ebp:
+		pt_regs->bp = (unsigned long)value;
+		break;
+	case x32_eip:
+		pt_regs->ip = (unsigned long)value;
+		break;
+	default:
+		WARN(1, "%s: Illegal register name %d\n", __func__, name);
+		break;
+	}
+}
+
+#ifdef __i386__
+
+static void set_reg_64(struct pt_regs *pt_regs, u32 name, u64 value)
+{
+}
+
+#else
+
+static void set_reg_64(struct pt_regs *pt_regs, u32 name, u64 value)
+{
+	switch (name) {
+	case x64_rax:
+		pt_regs->ax = (unsigned long)value;
+		break;
+	case x64_rbx:
+		pt_regs->bx = (unsigned long)value;
+		break;
+	case x64_rcx:
+		pt_regs->cx = (unsigned long)value;
+		break;
+	case x64_rdx:
+		pt_regs->dx = (unsigned long)value;
+		break;
+	case x64_rsi:
+		pt_regs->si = (unsigned long)value;
+		break;
+	case x64_rdi:
+		pt_regs->di = (unsigned long)value;
+		break;
+	case x64_rbp:
+		pt_regs->bp = (unsigned long)value;
+		break;
+	case x64_r8:
+		pt_regs->r8 = (unsigned long)value;
+		break;
+	case x64_r9:
+		pt_regs->r9 = (unsigned long)value;
+		break;
+	case x64_r10:
+		pt_regs->r10 = (unsigned long)value;
+		break;
+	case x64_r11:
+		pt_regs->r11 = (unsigned long)value;
+		break;
+	case x64_r12:
+		pt_regs->r12 = (unsigned long)value;
+		break;
+	case x64_r13:
+		pt_regs->r13 = (unsigned long)value;
+		break;
+	case x64_r14:
+		pt_regs->r14 = (unsigned long)value;
+		break;
+	case x64_r15:
+		pt_regs->r15 = (unsigned long)value;
+		break;
+	case x64_rip:
+		pt_regs->ip = (unsigned long)value;
+		break;
+	default:
+		WARN(1, "%s: Illegal register name %d\n", __func__, name);
+		break;
+	}
+}
+
+#endif /* __i386__ */
+
+static void set_regs(struct pt_regs *pt_regs, struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+	bool			compat = is_compat();
+
+	for (; reg < reg_end; reg++) {
+		if (compat)
+			set_reg_32(pt_regs, reg->name, reg->value);
+		else
+			set_reg_64(pt_regs, reg->name, reg->value);
+	}
+}
+
+/*
+ * Check if the register names are valid. Check if the user PC has been set.
+ */
+bool trampfd_valid_regs(struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+	int			min, max, pc_name;
+	bool			pc_set = false;
+
+	if (is_compat()) {
+		min = 0;
+		pc_name = x32_eip;
+		max = x32_max;
+	} else {
+		min = x32_max;
+		pc_name = x64_rip;
+		max = x64_max;
+	}
+
+	for (; reg < reg_end; reg++) {
+		if (reg->name < min || reg->name >= max || reg->reserved)
+			return false;
+		if (reg->name == pc_name && reg->value)
+			pc_set = true;
+	}
+	return pc_set;
+}
+EXPORT_SYMBOL_GPL(trampfd_valid_regs);
+
+/*
+ * Check if the PC specified in a register context is allowed.
+ */
+bool trampfd_allowed_pc(struct trampfd *trampfd, struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+	struct trampfd_values	*allowed_pcs = trampfd->allowed_pcs;
+	u64			*allowed_values, pc_value = 0;
+	u32			nvalues, pc_name;
+	int			i;
+
+	if (!allowed_pcs)
+		return true;
+
+	pc_name = is_compat() ? x32_eip : x64_rip;
+
+	/*
+	 * Find the PC register and its value. If the PC register has been
+	 * specified multiple times, only the last one counts.
+	 */
+	for (; reg < reg_end; reg++) {
+		if (reg->name == pc_name)
+			pc_value = reg->value;
+	}
+
+	allowed_values = allowed_pcs->values;
+	nvalues = allowed_pcs->nvalues;
+
+	for (i = 0; i < nvalues; i++) {
+		if (pc_value == allowed_values[i])
+			return true;
+	}
+	return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_allowed_pc);
+
+/* ---------------------------- Stack Context ---------------------------- */
+
+static int push_data(struct pt_regs *pt_regs, struct trampfd_stack *tstack)
+{
+	unsigned long	sp;
+
+	sp = user_stack_pointer(pt_regs) - tstack->size - tstack->offset;
+	if (tstack->flags & TRAMPFD_SET_SP) {
+		if (is_compat())
+			sp = ((sp + 4) & -16ul) - 4;
+		else
+			sp = round_down(sp, 16) - 8;
+	}
+
+	if (!access_ok(sp, user_stack_pointer(pt_regs) - sp))
+		return -EFAULT;
+
+	if (copy_to_user(USERPTR(sp), tstack->data, tstack->size))
+		return -EFAULT;
+
+	if (tstack->flags & TRAMPFD_SET_SP)
+		user_stack_pointer_set(pt_regs, sp);
+
+	return 0;
+}
+
+/* ---------------------------- Fault Handlers ---------------------------- */
+
+static int trampfd_user_fault(struct trampfd *trampfd,
+			      struct vm_area_struct *vma,
+			      struct pt_regs *pt_regs)
+{
+	char			buf[TRAMPFD_MAX_STACK_SIZE];
+	struct trampfd_regs	*tregs;
+	struct trampfd_stack	*tstack = NULL;
+	unsigned long		addr;
+	size_t			size;
+	int			rc = 0;
+
+	mutex_lock(&trampfd->lock);
+
+	/*
+	 * Execution of the trampoline must start at the offset specfied by
+	 * the kernel.
+	 */
+	addr = vma->vm_start + trampfd->map.ioffset;
+	if (addr != pt_regs->ip) {
+		rc = -EINVAL;
+		goto unlock;
+	}
+
+	/*
+	 * At a minimum, the user PC register must be specified for a
+	 * user trampoline.
+	 */
+	tregs = trampfd->regs;
+	if (!tregs) {
+		rc = -EINVAL;
+		goto unlock;
+	}
+
+	/*
+	 * Set the register context for the trampoline.
+	 */
+	set_regs(pt_regs, tregs);
+
+	if (trampfd->stack) {
+		/*
+		 * Copy the stack context into a local buffer and push stack
+		 * data after dropping the lock.
+		 */
+		size = sizeof(*trampfd->stack) + trampfd->stack->size;
+		tstack = (struct trampfd_stack *) buf;
+		memcpy(tstack, trampfd->stack, size);
+	}
+unlock:
+	mutex_unlock(&trampfd->lock);
+
+	if (!rc && tstack) {
+		mmap_read_unlock(vma->vm_mm);
+		rc = push_data(pt_regs, tstack);
+		mmap_read_lock(vma->vm_mm);
+	}
+	return rc;
+}
+
+/*
+ * Handle it if it is a trampoline fault.
+ */
+bool trampfd_fault(struct vm_area_struct *vma, struct pt_regs *pt_regs)
+{
+	struct trampfd		*trampfd;
+
+	if (!is_trampfd_vma(vma))
+		return false;
+	trampfd = vma->vm_private_data;
+
+	if (trampfd->type == TRAMPFD_USER)
+		return !trampfd_user_fault(trampfd, vma, pt_regs);
+	return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_fault);
+
+/* ------------------------- Arch Initialization ------------------------- */
+
+int trampfd_check_arch(struct trampfd *trampfd)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(trampfd_check_arch);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 1ead568c0101..a1432ee2a1a2 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 #include <linux/efi.h>			/* efi_recover_from_page_fault()*/
 #include <linux/mm_types.h>
+#include <linux/trampfd.h>		/* trampoline invocation */
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1142,6 +1143,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	struct mm_struct *mm;
 	vm_fault_t fault, major = 0;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
+	unsigned long tflags = X86_PF_INSTR | X86_PF_USER;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1275,6 +1277,15 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 */
 good_area:
 	if (unlikely(access_error(hw_error_code, vma))) {
+		/*
+		 * If it is a user execute fault, it could be a trampoline
+		 * invocation.
+		 */
+		if ((hw_error_code & tflags) == tflags &&
+		    trampfd_fault(vma, regs)) {
+			mmap_read_unlock(mm);
+			return;
+		}
 		bad_area_access_error(regs, hw_error_code, address, vma);
 		return;
 	}
-- 
2.17.1
^ permalink raw reply related	[flat|nested] 64+ messages in thread
- * Re: [PATCH v1 2/4] [RFC] x86/trampfd: Provide support for the trampoline file descriptor
  2020-07-28 13:10   ` [PATCH v1 2/4] [RFC] x86/trampfd: Provide support for the trampoline file descriptor madvenka
@ 2020-07-30  9:06     ` Greg KH
  2020-07-30 14:25       ` Madhavan T. Venkataraman
  0 siblings, 1 reply; 64+ messages in thread
From: Greg KH @ 2020-07-30  9:06 UTC (permalink / raw)
  To: madvenka
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86
On Tue, Jul 28, 2020 at 08:10:48AM -0500, madvenka@linux.microsoft.com wrote:
> +EXPORT_SYMBOL_GPL(trampfd_valid_regs);
Why are all of these exported?  I don't see a module user in this
series, or did I miss it somehow?
EXPORT_SYMBOL* is only needed for symbols to be used by modules, not by
code that is built into the kernel.
thanks,
greg k-h
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 2/4] [RFC] x86/trampfd: Provide support for the trampoline file descriptor
  2020-07-30  9:06     ` Greg KH
@ 2020-07-30 14:25       ` Madhavan T. Venkataraman
  0 siblings, 0 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-30 14:25 UTC (permalink / raw)
  To: Greg KH
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86
Yes. I will fix this.
Thanks.
Madhavan
On 7/30/20 4:06 AM, Greg KH wrote:
> On Tue, Jul 28, 2020 at 08:10:48AM -0500, madvenka@linux.microsoft.com wrote:
>> +EXPORT_SYMBOL_GPL(trampfd_valid_regs);
> Why are all of these exported?  I don't see a module user in this
> series, or did I miss it somehow?
>
> EXPORT_SYMBOL* is only needed for symbols to be used by modules, not by
> code that is built into the kernel.
>
> thanks,
>
> greg k-h
^ permalink raw reply	[flat|nested] 64+ messages in thread 
 
 
- * [PATCH v1 3/4] [RFC] arm64/trampfd: Provide support for the trampoline file descriptor
  2020-07-28 13:10 ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor madvenka
  2020-07-28 13:10   ` [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API madvenka
  2020-07-28 13:10   ` [PATCH v1 2/4] [RFC] x86/trampfd: Provide support for the trampoline file descriptor madvenka
@ 2020-07-28 13:10   ` madvenka
  2020-07-28 13:10   ` [PATCH v1 4/4] [RFC] arm/trampfd: " madvenka
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 64+ messages in thread
From: madvenka @ 2020-07-28 13:10 UTC (permalink / raw)
  To: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86,
	madvenka
From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
Implement 64-bit ARM support for the trampoline file descriptor.
	- Define architecture specific register names
	- Handle the trampoline invocation page fault
	- Setup the user register context on trampoline invocation
	- Setup the user stack context on trampoline invocation
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 arch/arm64/include/asm/ptrace.h      |   9 +
 arch/arm64/include/asm/unistd.h      |   2 +-
 arch/arm64/include/asm/unistd32.h    |   2 +
 arch/arm64/include/uapi/asm/ptrace.h |  57 ++++++
 arch/arm64/kernel/Makefile           |   2 +
 arch/arm64/kernel/trampfd.c          | 278 +++++++++++++++++++++++++++
 arch/arm64/mm/fault.c                |  15 +-
 7 files changed, 361 insertions(+), 4 deletions(-)
 create mode 100644 arch/arm64/kernel/trampfd.c
diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
index 953b6a1ce549..dad6cdbd59c6 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -232,6 +232,15 @@ static inline unsigned long user_stack_pointer(struct pt_regs *regs)
 	return regs->sp;
 }
 
+static inline void user_stack_pointer_set(struct pt_regs *regs,
+					  unsigned long val)
+{
+	if (compat_user_mode(regs))
+		regs->compat_sp = val;
+	else
+		regs->sp = val;
+}
+
 extern int regs_query_register_offset(const char *name);
 extern unsigned long regs_get_kernel_stack_nth(struct pt_regs *regs,
 					       unsigned int n);
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 3b859596840d..b3b2019f8d16 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		440
+#define __NR_compat_syscalls		441
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 6d95d0c8bf2f..821ddcaf9683 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -885,6 +885,8 @@ __SYSCALL(__NR_openat2, sys_openat2)
 __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 #define __NR_faccessat2 439
 __SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_trampfd_create 440
+__SYSCALL(__NR_trampfd_create, sys_trampfd_create)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/arm64/include/uapi/asm/ptrace.h b/arch/arm64/include/uapi/asm/ptrace.h
index 42cbe34d95ce..f4d1974dd795 100644
--- a/arch/arm64/include/uapi/asm/ptrace.h
+++ b/arch/arm64/include/uapi/asm/ptrace.h
@@ -88,6 +88,63 @@ struct user_pt_regs {
 	__u64		pstate;
 };
 
+/*
+ * These register names are to be used by 32-bit applications.
+ */
+enum reg_32_name {
+	arm_r0,
+	arm_r1,
+	arm_r2,
+	arm_r3,
+	arm_r4,
+	arm_r5,
+	arm_r6,
+	arm_r7,
+	arm_r8,
+	arm_r9,
+	arm_r10,
+	arm_ip,
+	arm_pc,
+	arm_max,
+};
+
+/*
+ * These register names are to be used by 64-bit applications.
+ */
+enum reg_64_name {
+	arm64_r0 = arm_max,
+	arm64_r1,
+	arm64_r2,
+	arm64_r3,
+	arm64_r4,
+	arm64_r5,
+	arm64_r6,
+	arm64_r7,
+	arm64_r8,
+	arm64_r9,
+	arm64_r10,
+	arm64_r11,
+	arm64_r12,
+	arm64_r13,
+	arm64_r14,
+	arm64_r15,
+	arm64_r16,
+	arm64_r17,
+	arm64_r18,
+	arm64_r19,
+	arm64_r20,
+	arm64_r21,
+	arm64_r22,
+	arm64_r23,
+	arm64_r24,
+	arm64_r25,
+	arm64_r26,
+	arm64_r27,
+	arm64_r28,
+	arm64_pc,
+	arm64_max,
+};
+
 struct user_fpsimd_state {
 	__uint128_t	vregs[32];
 	__u32		fpsr;
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index a561cbb91d4d..18d373fb1208 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -71,3 +71,5 @@ extra-y					+= $(head-y) vmlinux.lds
 ifeq ($(CONFIG_DEBUG_EFI),y)
 AFLAGS_head.o += -DVMLINUX_PATH="\"$(realpath $(objtree)/vmlinux)\""
 endif
+
+obj-$(CONFIG_TRAMPFD)			+= trampfd.o
diff --git a/arch/arm64/kernel/trampfd.c b/arch/arm64/kernel/trampfd.c
new file mode 100644
index 000000000000..d79e749e0c30
--- /dev/null
+++ b/arch/arm64/kernel/trampfd.c
@@ -0,0 +1,278 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - ARM64 support.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ *
+ * Copyright (c) 2020, Microsoft Corporation.
+ */
+
+#include <linux/trampfd.h>
+#include <linux/mm_types.h>
+#include <linux/uaccess.h>
+
+/* ---------------------------- Register Context ---------------------------- */
+
+static inline bool is_compat(void)
+{
+	return is_compat_thread(task_thread_info(current));
+}
+
+static void set_reg_32(struct pt_regs *pt_regs, u32 name, u64 value)
+{
+	switch (name) {
+	case arm_r0:
+	case arm_r1:
+	case arm_r2:
+	case arm_r3:
+	case arm_r4:
+	case arm_r5:
+	case arm_r6:
+	case arm_r7:
+	case arm_r8:
+	case arm_r9:
+	case arm_r10:
+		pt_regs->regs[name] = (__u64)value;
+		break;
+	case arm_ip:
+		pt_regs->regs[arm64_r16 - arm_max] = (__u64)value;
+		break;
+	case arm_pc:
+		pt_regs->pc = (__u64)value;
+		break;
+	default:
+		WARN(1, "%s: Illegal register name %d\n", __func__, name);
+		break;
+	}
+}
+
+static void set_reg_64(struct pt_regs *pt_regs, u32 name, u64 value)
+{
+	switch (name) {
+	case arm64_r0:
+	case arm64_r1:
+	case arm64_r2:
+	case arm64_r3:
+	case arm64_r4:
+	case arm64_r5:
+	case arm64_r6:
+	case arm64_r7:
+	case arm64_r8:
+	case arm64_r9:
+	case arm64_r10:
+	case arm64_r11:
+	case arm64_r12:
+	case arm64_r13:
+	case arm64_r14:
+	case arm64_r15:
+	case arm64_r16:
+	case arm64_r17:
+	case arm64_r18:
+	case arm64_r19:
+	case arm64_r20:
+	case arm64_r21:
+	case arm64_r22:
+	case arm64_r23:
+	case arm64_r24:
+	case arm64_r25:
+	case arm64_r26:
+	case arm64_r27:
+	case arm64_r28:
+		pt_regs->regs[name - arm_max] = (__u64)value;
+		break;
+	case arm64_pc:
+		pt_regs->pc = (__u64)value;
+		break;
+	default:
+		WARN(1, "%s: Illegal register name %d\n", __func__, name);
+		break;
+	}
+}
+
+static void set_regs(struct pt_regs *pt_regs, struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+	bool			compat = is_compat();
+
+	for (; reg < reg_end; reg++) {
+		if (compat)
+			set_reg_32(pt_regs, reg->name, reg->value);
+		else
+			set_reg_64(pt_regs, reg->name, reg->value);
+	}
+}
+
+/*
+ * Check if the register names are valid. Check if the user PC has been set.
+ */
+bool trampfd_valid_regs(struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+	int			min, max, pc_name;
+	bool			pc_set = false;
+
+	if (is_compat()) {
+		min = 0;
+		pc_name = arm_pc;
+		max = arm_max;
+	} else {
+		min = arm_max;
+		pc_name = arm64_pc;
+		max = arm64_max;
+	}
+
+	for (; reg < reg_end; reg++) {
+		if (reg->name < min || reg->name >= max || reg->reserved)
+			return false;
+		if (reg->name == pc_name && reg->value)
+			pc_set = true;
+	}
+	return pc_set;
+}
+EXPORT_SYMBOL_GPL(trampfd_valid_regs);
+
+/*
+ * Check if the PC specified in a register context is allowed.
+ */
+bool trampfd_allowed_pc(struct trampfd *trampfd, struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+	struct trampfd_values	*allowed_pcs = trampfd->allowed_pcs;
+	u64			*allowed_values, pc_value = 0;
+	u32			nvalues, pc_name;
+	int			i;
+
+	if (!allowed_pcs)
+		return true;
+
+	pc_name = is_compat() ? arm_pc : arm64_pc;
+
+	/*
+	 * Find the PC register and its value. If the PC register has been
+	 * specified multiple times, only the last one counts.
+	 */
+	for (; reg < reg_end; reg++) {
+		if (reg->name == pc_name)
+			pc_value = reg->value;
+	}
+
+	allowed_values = allowed_pcs->values;
+	nvalues = allowed_pcs->nvalues;
+
+	for (i = 0; i < nvalues; i++) {
+		if (pc_value == allowed_values[i])
+			return true;
+	}
+	return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_allowed_pc);
+
+/* ---------------------------- Stack Context ---------------------------- */
+
+static int push_data(struct pt_regs *pt_regs, struct trampfd_stack *tstack)
+{
+	unsigned long	sp;
+
+	sp = user_stack_pointer(pt_regs) - tstack->size - tstack->offset;
+	if (tstack->flags & TRAMPFD_SET_SP)
+		sp = round_down(sp, 16);
+
+	if (!access_ok((void *)sp, user_stack_pointer(pt_regs) - sp))
+		return -EFAULT;
+
+	if (copy_to_user(USERPTR(sp), tstack->data, tstack->size))
+		return -EFAULT;
+
+	if (tstack->flags & TRAMPFD_SET_SP)
+		user_stack_pointer_set(pt_regs, sp);
+
+	return 0;
+}
+
+/* ---------------------------- Fault Handlers ---------------------------- */
+
+static int trampfd_user_fault(struct trampfd *trampfd,
+			      struct vm_area_struct *vma,
+			      struct pt_regs *pt_regs)
+{
+	char			buf[TRAMPFD_MAX_STACK_SIZE];
+	struct trampfd_regs	*tregs;
+	struct trampfd_stack	*tstack = NULL;
+	unsigned long		addr;
+	size_t			size;
+	int			rc = 0;
+
+	mutex_lock(&trampfd->lock);
+
+	/*
+	 * Execution of the trampoline must start at the offset specfied by
+	 * the kernel.
+	 */
+	addr = vma->vm_start + trampfd->map.ioffset;
+	if (addr != pt_regs->pc) {
+		rc = -EINVAL;
+		goto unlock;
+	}
+
+	/*
+	 * At a minimum, the user PC register must be specified for a
+	 * user trampoline.
+	 */
+	tregs = trampfd->regs;
+	if (!tregs) {
+		rc = -EINVAL;
+		goto unlock;
+	}
+
+	/*
+	 * Set the register context for the trampoline.
+	 */
+	set_regs(pt_regs, tregs);
+
+	if (trampfd->stack) {
+		/*
+		 * Copy the stack context into a local buffer and push stack
+		 * data after dropping the lock.
+		 */
+		size = sizeof(*trampfd->stack) + trampfd->stack->size;
+		tstack = (struct trampfd_stack *) buf;
+		memcpy(tstack, trampfd->stack, size);
+	}
+unlock:
+	mutex_unlock(&trampfd->lock);
+
+	if (!rc && tstack) {
+		mmap_read_unlock(vma->vm_mm);
+		rc = push_data(pt_regs, tstack);
+		mmap_read_lock(vma->vm_mm);
+	}
+	return rc;
+}
+
+/*
+ * Handle it if it is a trampoline fault.
+ */
+bool trampfd_fault(struct vm_area_struct *vma, struct pt_regs *pt_regs)
+{
+	struct trampfd		*trampfd;
+
+	if (!is_trampfd_vma(vma))
+		return false;
+	trampfd = vma->vm_private_data;
+
+	if (trampfd->type == TRAMPFD_USER)
+		return !trampfd_user_fault(trampfd, vma, pt_regs);
+	return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_fault);
+
+/* ---------------------------- Miscellaneous ---------------------------- */
+
+int trampfd_check_arch(struct trampfd *trampfd)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(trampfd_check_arch);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 8afb238ff335..6e5e3193919a 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -23,6 +23,7 @@
 #include <linux/perf_event.h>
 #include <linux/preempt.h>
 #include <linux/hugetlb.h>
+#include <linux/trampfd.h>
 
 #include <asm/acpi.h>
 #include <asm/bug.h>
@@ -404,7 +405,8 @@ static void do_bad_area(unsigned long addr, unsigned int esr, struct pt_regs *re
 #define VM_FAULT_BADACCESS	0x020000
 
 static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr,
-			   unsigned int mm_flags, unsigned long vm_flags)
+			   unsigned int mm_flags, unsigned long vm_flags,
+			   struct pt_regs *regs)
 {
 	struct vm_area_struct *vma = find_vma(mm, addr);
 
@@ -426,8 +428,15 @@ static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr,
 	 * Check that the permissions on the VMA allow for the fault which
 	 * occurred.
 	 */
-	if (!(vma->vm_flags & vm_flags))
+	if (!(vma->vm_flags & vm_flags)) {
+		/*
+		 * If it is an execute fault, it could be a trampoline
+		 * invocation.
+		 */
+		if ((vm_flags & VM_EXEC) && trampfd_fault(vma, regs))
+			return 0;
 		return VM_FAULT_BADACCESS;
+	}
 	return handle_mm_fault(vma, addr & PAGE_MASK, mm_flags);
 }
 
@@ -516,7 +525,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 #endif
 	}
 
-	fault = __do_page_fault(mm, addr, mm_flags, vm_flags);
+	fault = __do_page_fault(mm, addr, mm_flags, vm_flags, regs);
 	major |= fault & VM_FAULT_MAJOR;
 
 	/* Quick path to respond to signals */
-- 
2.17.1
^ permalink raw reply related	[flat|nested] 64+ messages in thread
- * [PATCH v1 4/4] [RFC] arm/trampfd: Provide support for the trampoline file descriptor
  2020-07-28 13:10 ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor madvenka
                     ` (2 preceding siblings ...)
  2020-07-28 13:10   ` [PATCH v1 3/4] [RFC] arm64/trampfd: " madvenka
@ 2020-07-28 13:10   ` madvenka
  2020-07-28 15:13   ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor David Laight
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 64+ messages in thread
From: madvenka @ 2020-07-28 13:10 UTC (permalink / raw)
  To: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86,
	madvenka
From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
Implement 32-bit ARM support for the trampoline file descriptor.
	- Define architecture specific register names
	- Handle the trampoline invocation page fault
	- Setup the user register context on trampoline invocation
	- Setup the user stack context on trampoline invocation
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 arch/arm/include/uapi/asm/ptrace.h |  20 +++
 arch/arm/kernel/Makefile           |   1 +
 arch/arm/kernel/trampfd.c          | 214 +++++++++++++++++++++++++++++
 arch/arm/mm/fault.c                |  12 +-
 arch/arm/tools/syscall.tbl         |   1 +
 5 files changed, 246 insertions(+), 2 deletions(-)
 create mode 100644 arch/arm/kernel/trampfd.c
diff --git a/arch/arm/include/uapi/asm/ptrace.h b/arch/arm/include/uapi/asm/ptrace.h
index e61c65b4018d..47b1c5e2f32c 100644
--- a/arch/arm/include/uapi/asm/ptrace.h
+++ b/arch/arm/include/uapi/asm/ptrace.h
@@ -151,6 +151,26 @@ struct pt_regs {
 #define ARM_r0		uregs[0]
 #define ARM_ORIG_r0	uregs[17]
 
+/*
+ * These register names are to be used by 32-bit applications.
+ */
+enum reg_32_name {
+	arm_r0,
+	arm_r1,
+	arm_r2,
+	arm_r3,
+	arm_r4,
+	arm_r5,
+	arm_r6,
+	arm_r7,
+	arm_r8,
+	arm_r9,
+	arm_r10,
+	arm_ip,
+	arm_pc,
+	arm_max,
+};
+
 /*
  * The size of the user-visible VFP state as seen by PTRACE_GET/SETVFPREGS
  * and core dumps.
diff --git a/arch/arm/kernel/Makefile b/arch/arm/kernel/Makefile
index 89e5d864e923..652c54c2f19a 100644
--- a/arch/arm/kernel/Makefile
+++ b/arch/arm/kernel/Makefile
@@ -105,5 +105,6 @@ obj-$(CONFIG_SMP)		+= psci_smp.o
 endif
 
 obj-$(CONFIG_HAVE_ARM_SMCCC)	+= smccc-call.o
+obj-$(CONFIG_TRAMPFD)		+= trampfd.o
 
 extra-y := $(head-y) vmlinux.lds
diff --git a/arch/arm/kernel/trampfd.c b/arch/arm/kernel/trampfd.c
new file mode 100644
index 000000000000..50fc5706e85b
--- /dev/null
+++ b/arch/arm/kernel/trampfd.c
@@ -0,0 +1,214 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - ARM support.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ *
+ * Copyright (c) 2020, Microsoft Corporation.
+ */
+
+#include <linux/trampfd.h>
+#include <linux/mm_types.h>
+#include <linux/uaccess.h>
+
+/* ---------------------------- Register Context ---------------------------- */
+
+static void set_reg(long *uregs, u32 name, u64 value)
+{
+	switch (name) {
+	case arm_r0:
+	case arm_r1:
+	case arm_r2:
+	case arm_r3:
+	case arm_r4:
+	case arm_r5:
+	case arm_r6:
+	case arm_r7:
+	case arm_r8:
+	case arm_r9:
+	case arm_r10:
+		uregs[name] = (__u64)value;
+		break;
+	case arm_ip:
+		ARM_ip = (__u64)value;
+		break;
+	case arm_pc:
+		ARM_pc = (__u64)value;
+		break;
+	default:
+		WARN(1, "%s: Illegal register name %d\n", __func__, name);
+		break;
+	}
+}
+
+static void set_regs(long *uregs, struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+
+	for (; reg < reg_end; reg++)
+		set_reg(uregs, reg->name, reg->value);
+}
+
+/*
+ * Check if the register names are valid. Check if the user PC has been set.
+ */
+bool trampfd_valid_regs(struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+	bool			pc_set = false;
+
+	for (; reg < reg_end; reg++) {
+		if (reg->name >= arm_max || reg->reserved)
+			return false;
+		if (reg->name == arm_pc && reg->value)
+			pc_set = true;
+	}
+	return pc_set;
+}
+EXPORT_SYMBOL_GPL(trampfd_valid_regs);
+
+/*
+ * Check if the PC specified in a register context is allowed.
+ */
+bool trampfd_allowed_pc(struct trampfd *trampfd, struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+	struct trampfd_values	*allowed_pcs = trampfd->allowed_pcs;
+	u64			*allowed_values, pc_value = 0;
+	u32			nvalues, pc_name;
+	int			i;
+
+	if (!allowed_pcs)
+		return true;
+
+	pc_name = arm_pc;
+
+	/*
+	 * Find the PC register and its value. If the PC register has been
+	 * specified multiple times, only the last one counts.
+	 */
+	for (; reg < reg_end; reg++) {
+		if (reg->name == pc_name)
+			pc_value = reg->value;
+	}
+
+	allowed_values = allowed_pcs->values;
+	nvalues = allowed_pcs->nvalues;
+
+	for (i = 0; i < nvalues; i++) {
+		if (pc_value == allowed_values[i])
+			return true;
+	}
+	return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_allowed_pc);
+
+/* ---------------------------- Stack Context ---------------------------- */
+
+static int push_data(long *uregs, struct trampfd_stack *tstack)
+{
+	unsigned long	sp;
+
+	sp = ARM_sp - tstack->size - tstack->offset;
+	if (tstack->flags & TRAMPFD_SET_SP)
+		sp &= ~7;
+
+	if (!access_ok(sp, ARM_sp - sp))
+		return -EFAULT;
+
+	if (copy_to_user(USERPTR(sp), tstack->data, tstack->size))
+		return -EFAULT;
+
+	if (tstack->flags & TRAMPFD_SET_SP)
+		ARM_sp = sp;
+	return 0;
+}
+
+/* ---------------------------- Fault Handlers ---------------------------- */
+
+static int trampfd_user_fault(struct trampfd *trampfd,
+			      struct vm_area_struct *vma,
+			      long *uregs)
+{
+	char			buf[TRAMPFD_MAX_STACK_SIZE];
+	struct trampfd_regs	*tregs;
+	struct trampfd_stack	*tstack = NULL;
+	unsigned long		addr;
+	size_t			size;
+	int			rc;
+
+	mutex_lock(&trampfd->lock);
+
+	/*
+	 * Execution of the trampoline must start at the offset specfied by
+	 * the kernel.
+	 */
+	addr = vma->vm_start + trampfd->map.ioffset;
+	if (addr != ARM_pc) {
+		rc = -EINVAL;
+		goto unlock;
+	}
+
+	/*
+	 * At a minimum, the user PC register must be specified for a
+	 * user trampoline.
+	 */
+	tregs = trampfd->regs;
+	if (!tregs) {
+		rc = -EINVAL;
+		goto unlock;
+	}
+
+	/*
+	 * Set the register context for the trampoline.
+	 */
+	set_regs(uregs, tregs);
+
+	if (trampfd->stack) {
+		/*
+		 * Copy the stack context into a local buffer and push stack
+		 * data after dropping the lock.
+		 */
+		size = sizeof(*trampfd->stack) + trampfd->stack->size;
+		tstack = (struct trampfd_stack *) buf;
+		memcpy(tstack, trampfd->stack, size);
+	}
+unlock:
+	mutex_unlock(&trampfd->lock);
+
+	if (!rc && tstack) {
+		mmap_read_unlock(vma->vm_mm);
+		rc = push_data(uregs, tstack);
+		mmap_read_lock(vma->vm_mm);
+	}
+	return rc;
+}
+
+/*
+ * Handle it if it is a trampoline fault.
+ */
+bool trampfd_fault(struct vm_area_struct *vma, struct pt_regs *pt_regs)
+{
+	struct trampfd		*trampfd;
+	unsigned long		*uregs = pt_regs->uregs;
+
+	if (!is_trampfd_vma(vma))
+		return false;
+	trampfd = vma->vm_private_data;
+
+	if (trampfd->type == TRAMPFD_USER)
+		return !trampfd_user_fault(trampfd, vma, uregs);
+	return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_fault);
+
+/* ---------------------------- Miscellaneous ---------------------------- */
+
+int trampfd_check_arch(struct trampfd *trampfd)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(trampfd_check_arch);
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index c6550eddfce1..21a81d19336b 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -17,6 +17,7 @@
 #include <linux/sched/debug.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/trampfd.h>
 
 #include <asm/system_misc.h>
 #include <asm/system_info.h>
@@ -202,7 +203,8 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma)
 
 static vm_fault_t __kprobes
 __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
-		unsigned int flags, struct task_struct *tsk)
+		unsigned int flags, struct task_struct *tsk,
+		struct pt_regs *regs)
 {
 	struct vm_area_struct *vma;
 	vm_fault_t fault;
@@ -220,6 +222,12 @@ __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
 	 */
 good_area:
 	if (access_error(fsr, vma)) {
+		/*
+		 * If it is an execute fault, it could be a trampoline
+		 * invocation.
+		 */
+		if ((fsr & FSR_LNX_PF) && trampfd_fault(vma, regs))
+			return 0;
 		fault = VM_FAULT_BADACCESS;
 		goto out;
 	}
@@ -290,7 +298,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 #endif
 	}
 
-	fault = __do_page_fault(mm, addr, fsr, flags, tsk);
+	fault = __do_page_fault(mm, addr, fsr, flags, tsk, regs);
 
 	/* If we need to retry but a fatal signal is pending, handle the
 	 * signal first. We do not need to release the mmap_lock because
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index d5cae5ffede0..88cf4c45069a 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -452,3 +452,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	trampfd_create			sys_trampfd_create
-- 
2.17.1
^ permalink raw reply related	[flat|nested] 64+ messages in thread
- * RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 13:10 ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor madvenka
                     ` (3 preceding siblings ...)
  2020-07-28 13:10   ` [PATCH v1 4/4] [RFC] arm/trampfd: " madvenka
@ 2020-07-28 15:13   ` David Laight
  2020-07-28 16:32     ` Madhavan T. Venkataraman
  2020-07-28 16:05   ` Casey Schaufler
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 64+ messages in thread
From: David Laight @ 2020-07-28 15:13 UTC (permalink / raw)
  To: 'madvenka@linux.microsoft.com',
	kernel-hardening@lists.openwall.com, linux-api@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-fsdevel@vger.kernel.org, linux-integrity@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, oleg@redhat.com,
	x86@kernel.org
From:  madvenka@linux.microsoft.com
> Sent: 28 July 2020 14:11
...
> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.
Isn't the performance of this going to be horrid?
If you don't care that much about performance the fixup can
all be done in userspace within the fault signal handler.
Since whatever you do needs the application changed why
not change the implementation of nested functions to not
need on-stack executable trampolines.
I can think of other alternatives that don't need much more
than an array of 'push constant; jump trampoline' instructions
be created (all jump to the same place).
You might want something to create an executable page of such
instructions.
	David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 15:13   ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor David Laight
@ 2020-07-28 16:32     ` Madhavan T. Venkataraman
  2020-07-28 17:16       ` Andy Lutomirski
  0 siblings, 1 reply; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-28 16:32 UTC (permalink / raw)
  To: David Laight, kernel-hardening@lists.openwall.com,
	linux-api@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-fsdevel@vger.kernel.org, linux-integrity@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, oleg@redhat.com,
	x86@kernel.org
Thanks. See inline..
On 7/28/20 10:13 AM, David Laight wrote:
> From:  madvenka@linux.microsoft.com
>> Sent: 28 July 2020 14:11
> ...
>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
> Isn't the performance of this going to be horrid?
It takes about the same amount of time as getpid(). So, it is
one quick trip into the kernel. I expect that applications will
typically not care about this extra overhead as long as
they are able to run.
But I agree that if there is an application that cannot tolerate
this extra overhead, then it is an issue. See below for further
discussion.
In the libffi changes I have included in the cover letter, I have
done it in such a way that trampfd is chosen when current
security settings don't allow other methods such as
loading trampoline code into a file and mapping it. In this
case, the application can at least run with trampfd.
>
> If you don't care that much about performance the fixup can
> all be done in userspace within the fault signal handler.
I do care about performance.
This is a framework to address trampolines. In this initial
work, I want to establish one basic way for things to work.
In the future, trampfd can be enhanced for performance.
For instance, it is easy for an architecture to generate
the exact instructions required to load specified registers,
push specified values on the stack and jump to a target
PC. The kernel can map a page with the generated code
with execute permissions. In this case, the performance
issue goes away.
> Since whatever you do needs the application changed why
> not change the implementation of nested functions to not
> need on-stack executable trampolines.
I kinda agree with your suggestion.
But it is up to the GCC folks to change its implementation.
I am trying to provide a way for their existing implementation
to work in a more secure way.
> I can think of other alternatives that don't need much more
> than an array of 'push constant; jump trampoline' instructions
> be created (all jump to the same place).
>
> You might want something to create an executable page of such
> instructions.
Agreed. And that can be done within this framework as
I have mentioned above.
But it is not just this trampoline type that I have implemented
in this patchset. In the future, other types can be implemented
and other contexts can be defined. Basically, the approach is
for the user to supply a recipe to the kernel and leave it up to
the kernel to do it in the best way possible. I am hoping that
other forms of dynamic code can be addressed in the future
using the same framework.
*Purely as a hypothetical example*, a user can supply
instructions in a language such as BPF that the kernel
understands and have the kernel arrange for that to
be executed in user context.
Madhavan
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 16:32     ` Madhavan T. Venkataraman
@ 2020-07-28 17:16       ` Andy Lutomirski
  2020-07-28 18:52         ` Madhavan T. Venkataraman
       [not found]         ` <81d744c0-923e-35ad-6063-8b186f6a153c@linux.microsoft.com>
  0 siblings, 2 replies; 64+ messages in thread
From: Andy Lutomirski @ 2020-07-28 17:16 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: David Laight, kernel-hardening@lists.openwall.com,
	linux-api@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-fsdevel@vger.kernel.org, linux-integrity@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, oleg@redhat.com,
	x86@kernel.org
On Tue, Jul 28, 2020 at 9:32 AM Madhavan T. Venkataraman
<madvenka@linux.microsoft.com> wrote:
>
> Thanks. See inline..
>
> On 7/28/20 10:13 AM, David Laight wrote:
> > From:  madvenka@linux.microsoft.com
> >> Sent: 28 July 2020 14:11
> > ...
> >> The kernel creates the trampoline mapping without any permissions. When
> >> the trampoline is executed by user code, a page fault happens and the
> >> kernel gets control. The kernel recognizes that this is a trampoline
> >> invocation. It sets up the user registers based on the specified
> >> register context, and/or pushes values on the user stack based on the
> >> specified stack context, and sets the user PC to the requested target
> >> PC. When the kernel returns, execution continues at the target PC.
> >> So, the kernel does the work of the trampoline on behalf of the
> >> application.
> > Isn't the performance of this going to be horrid?
>
> It takes about the same amount of time as getpid(). So, it is
> one quick trip into the kernel. I expect that applications will
> typically not care about this extra overhead as long as
> they are able to run.
What did you test this on?  A page fault on any modern x86_64 system
is much, much, much, much slower than a syscall.
--Andy
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 17:16       ` Andy Lutomirski
@ 2020-07-28 18:52         ` Madhavan T. Venkataraman
  2020-07-29  8:36           ` David Laight
       [not found]         ` <81d744c0-923e-35ad-6063-8b186f6a153c@linux.microsoft.com>
  1 sibling, 1 reply; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-28 18:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Laight, kernel-hardening@lists.openwall.com,
	linux-api@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-fsdevel@vger.kernel.org, linux-integrity@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, oleg@redhat.com,
	x86@kernel.org
On 7/28/20 12:16 PM, Andy Lutomirski wrote:
> On Tue, Jul 28, 2020 at 9:32 AM Madhavan T. Venkataraman
> <madvenka@linux.microsoft.com> wrote:
>> Thanks. See inline..
>>
>> On 7/28/20 10:13 AM, David Laight wrote:
>>> From:  madvenka@linux.microsoft.com
>>>> Sent: 28 July 2020 14:11
>>> ...
>>>> The kernel creates the trampoline mapping without any permissions. When
>>>> the trampoline is executed by user code, a page fault happens and the
>>>> kernel gets control. The kernel recognizes that this is a trampoline
>>>> invocation. It sets up the user registers based on the specified
>>>> register context, and/or pushes values on the user stack based on the
>>>> specified stack context, and sets the user PC to the requested target
>>>> PC. When the kernel returns, execution continues at the target PC.
>>>> So, the kernel does the work of the trampoline on behalf of the
>>>> application.
>>> Isn't the performance of this going to be horrid?
>> It takes about the same amount of time as getpid(). So, it is
>> one quick trip into the kernel. I expect that applications will
>> typically not care about this extra overhead as long as
>> they are able to run.
> What did you test this on?  A page fault on any modern x86_64 system
> is much, much, much, much slower than a syscall.
I sent a response to this. But the mail was returned to me.
I am resending.
I tested it in on a KVM guest running Ubuntu. So, when you say that a
page fault is much slower, do you mean a regular page fault that is handled
through the VM layer? Here is the relevant code in do_user_addr_fault():
        if (unlikely(access_error(hw_error_code, vma))) {
                /*                 
                 * If it is a user execute fault, it could be a trampoline
                 * invocation.
                 */
                if ((hw_error_code & tflags) == tflags &&
                     trampfd_fault(vma, regs)) {
                         up_read(&mm->mmap_sem);
                         return;
                 }
                 bad_area_access_error(regs, hw_error_code, address, vma);
                 return;
         }
         ...
         fault = handle_mm_fault(vma, address, flags);
trampfd faults are instruction faults that go through a different code path than
the one that calls handle_mm_fault(). Perhaps, it is the handle_mm_fault() that
is time consuming. Could you clarify?
Thanks.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 18:52         ` Madhavan T. Venkataraman
@ 2020-07-29  8:36           ` David Laight
  2020-07-29 17:55             ` Madhavan T. Venkataraman
  0 siblings, 1 reply; 64+ messages in thread
From: David Laight @ 2020-07-29  8:36 UTC (permalink / raw)
  To: 'Madhavan T. Venkataraman', Andy Lutomirski
  Cc: kernel-hardening@lists.openwall.com, linux-api@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-fsdevel@vger.kernel.org, linux-integrity@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, oleg@redhat.com,
	x86@kernel.org
From: Madhavan T. Venkataraman
> Sent: 28 July 2020 19:52
...
> trampfd faults are instruction faults that go through a different code path than
> the one that calls handle_mm_fault(). Perhaps, it is the handle_mm_fault() that
> is time consuming. Could you clarify?
Given that the expectation is a few instructions in userspace
(eg to pick up the original arguments for a nested call)
the (probable) thousands of clocks taken by entering the
kernel (especially with page table separation) is a massive
delta.
If entering the kernel were cheap no one would have added
the DSO functions for getting the time of day.
	David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-29  8:36           ` David Laight
@ 2020-07-29 17:55             ` Madhavan T. Venkataraman
  0 siblings, 0 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-29 17:55 UTC (permalink / raw)
  To: David Laight, Andy Lutomirski
  Cc: kernel-hardening@lists.openwall.com, linux-api@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-fsdevel@vger.kernel.org, linux-integrity@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, oleg@redhat.com,
	x86@kernel.org
On 7/29/20 3:36 AM, David Laight wrote:
> From: Madhavan T. Venkataraman
>> Sent: 28 July 2020 19:52
> ...
>> trampfd faults are instruction faults that go through a different code path than
>> the one that calls handle_mm_fault(). Perhaps, it is the handle_mm_fault() that
>> is time consuming. Could you clarify?
> Given that the expectation is a few instructions in userspace
> (eg to pick up the original arguments for a nested call)
> the (probable) thousands of clocks taken by entering the
> kernel (especially with page table separation) is a massive
> delta.
>
> If entering the kernel were cheap no one would have added
> the DSO functions for getting the time of day.
I hear you. BTW, I did not say that the overhead was trivial.
I only said that in most cases, applications may not mind that
extra overhead.
However, since multiple people have raised that as an issue,
I will address it. I mentioned before that the kernel can actually
supply the code page that sets the context and jumps to
a PC and map it so the performance issue can be addressed.
I was planning to do that as a future enhancement.
If there is a consensus that I must address it immediately, I
could do that.
I will continue this discussion in my reply to Andy's email. Let
us pick it up from there.
Thanks.
Madhavan
>
> 	David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
^ permalink raw reply	[flat|nested] 64+ messages in thread 
 
 
- [parent not found: <81d744c0-923e-35ad-6063-8b186f6a153c@linux.microsoft.com>] 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
       [not found]         ` <81d744c0-923e-35ad-6063-8b186f6a153c@linux.microsoft.com>
@ 2020-07-29  5:16           ` Andy Lutomirski
  0 siblings, 0 replies; 64+ messages in thread
From: Andy Lutomirski @ 2020-07-29  5:16 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: Andy Lutomirski, David Laight,
	kernel-hardening@lists.openwall.com, linux-api@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-fsdevel@vger.kernel.org, linux-integrity@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, oleg@redhat.com,
	x86@kernel.org
On Tue, Jul 28, 2020 at 10:40 AM Madhavan T. Venkataraman
<madvenka@linux.microsoft.com> wrote:
>
>
>
> On 7/28/20 12:16 PM, Andy Lutomirski wrote:
>
> On Tue, Jul 28, 2020 at 9:32 AM Madhavan T. Venkataraman
> <madvenka@linux.microsoft.com> wrote:
>
> Thanks. See inline..
>
> On 7/28/20 10:13 AM, David Laight wrote:
>
> From:  madvenka@linux.microsoft.com
>
> Sent: 28 July 2020 14:11
>
> ...
>
> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.
>
> Isn't the performance of this going to be horrid?
>
> It takes about the same amount of time as getpid(). So, it is
> one quick trip into the kernel. I expect that applications will
> typically not care about this extra overhead as long as
> they are able to run.
>
> What did you test this on?  A page fault on any modern x86_64 system
> is much, much, much, much slower than a syscall.
>
>
> I tested it in on a KVM guest running Ubuntu. So, when you say
> that a page fault is much slower, do you mean a regular page
> fault that is handled through the VM layer? Here is the relevant code
> in do_user_addr_fault():
I mean that x86 CPUs have reasonably SYSCALL and SYSRET instructions
(the former is used for 64-bit system calls on Linux and the latter is
mostly used to return from system calls), but hardware page fault
delivery and IRET (used to return from page faults) are very slow.
^ permalink raw reply	[flat|nested] 64+ messages in thread
 
 
 
 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 13:10 ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor madvenka
                     ` (4 preceding siblings ...)
  2020-07-28 15:13   ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor David Laight
@ 2020-07-28 16:05   ` Casey Schaufler
  2020-07-28 16:49     ` Madhavan T. Venkataraman
  2020-07-28 17:05     ` James Morris
  2020-07-28 17:31   ` Andy Lutomirski
  2020-07-31 18:09   ` Mark Rutland
  7 siblings, 2 replies; 64+ messages in thread
From: Casey Schaufler @ 2020-07-28 16:05 UTC (permalink / raw)
  To: madvenka, kernel-hardening, linux-api, linux-arm-kernel,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, oleg, x86
On 7/28/2020 6:10 AM, madvenka@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>
> Introduction
> ------------
>
> Trampolines are used in many different user applications. Trampoline
> code is often generated at runtime. Trampoline code can also just be a
> pre-defined sequence of machine instructions in a data buffer.
>
> Trampoline code is placed either in a data page or in a stack page. In
> order to execute a trampoline, the page it resides in needs to be mapped
> with execute permissions. Writable pages with execute permissions provide
> an attack surface for hackers. Attackers can use this to inject malicious
> code, modify existing code or do other harm.
>
> To mitigate this, LSMs such as SELinux may not allow pages to have both
> write and execute permissions. This prevents trampolines from executing
> and blocks applications that use trampolines. To allow genuine applications
> to run, exceptions have to be made for them (by setting execmem, etc).
> In this case, the attack surface is just the pages of such applications.
>
> An application that is not allowed to have writable executable pages
> may try to load trampoline code into a file and map the file with execute
> permissions. In this case, the attack surface is just the buffer that
> contains trampoline code. However, a successful exploit may provide the
> hacker with means to load his own code in a file, map it and execute it.
>
> LSMs (such as the IPE proposal [1]) may allow only properly signed object
> files to be mapped with execute permissions. This will prevent trampoline
> files from being mapped. Again, exceptions have to be made for genuine
> applications.
>
> We need a way to execute trampolines without making security exceptions
> where possible and to reduce the attack surface even further.
>
> Examples of trampolines
> -----------------------
>
> libffi (A Portable Foreign Function Interface Library):
>
> libffi allows a user to define functions with an arbitrary list of
> arguments and return value through a feature called "Closures".
> Closures use trampolines to jump to ABI handlers that handle calling
> conventions and call a target function. libffi is used by a lot
> of different applications. To name a few:
>
> 	- Python
> 	- Java
> 	- Javascript
> 	- Ruby FFI
> 	- Lisp
> 	- Objective C
>
> GCC nested functions:
>
> GCC has traditionally used trampolines for implementing nested
> functions. The trampoline is placed on the user stack. So, the stack
> needs to be executable.
>
> Currently available solution
> ----------------------------
>
> One solution that has been proposed to allow trampolines to be executed
> without making security exceptions is Trampoline Emulation. See:
>
> https://pax.grsecurity.net/docs/emutramp.txt
>
> In this solution, the kernel recognizes certain sequences of instructions
> as "well-known" trampolines. When such a trampoline is executed, a page
> fault happens because the trampoline page does not have execute permission.
> The kernel recognizes the trampoline and emulates it. Basically, the
> kernel does the work of the trampoline on behalf of the application.
What prevents a malicious process from using the "well-known" trampoline
to its own purposes? I expect it is obvious, but I'm not seeing it. Old
eyes, I suppose.
> Here, the attack surface is the buffer that contains the trampoline.
> The attack surface is narrower than before. A hacker may still be able to
> modify what gets loaded in the registers or modify the target PC to point
> to arbitrary locations.
>
> Currently, the emulated trampolines are the ones used in libffi and GCC
> nested functions. To my knowledge, only X86 is supported at this time.
>
> As noted in emutramp.txt, this is not a generic solution. For every new
> trampoline that needs to be supported, new instruction sequences need to
> be recognized by the kernel and emulated. And this has to be done for
> every architecture that needs to be supported.
>
> emutramp.txt notes the following:
>
> "... the real solution is not in emulation but by designing a kernel API
> for runtime code generation and modifying userland to make use of it."
>
> Trampoline File Descriptor (trampfd)
> --------------------------
>
> I am proposing a kernel API using anonymous file descriptors that
> can be used to create and execute trampolines with the help of the
> kernel. In this solution also, the kernel does the work of the trampoline.
> The API is described in patch 1/4 of this patchset. I provide a
> summary here:
>
> Trampolines commonly execute the following sequence:
>
> 	- Load some values in some registers and/or
> 	- Push some values on the stack
> 	- Jump to a target PC
>
> libffi and GCC nested function trampolines fit into this model.
>
> Using the kernel API, applications and libraries can:
>
> 	- Create a trampoline object
> 	- Associate a register context with the trampoline (including
> 	  a target PC)
> 	- Associate a stack context with the trampoline
> 	- Map the trampoline into a process address space
> 	- Execute the trampoline by executing at the trampoline address
>
> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.
>
> In this case, the attack surface is the context buffer. A hacker may
> attack an application with a vulnerability and may be able to modify the
> context buffer. So, when the register or stack context is set for
> a trampoline, the values may have been tampered with. From an attack
> surface perspective, this is similar to Trampoline Emulation. But
> with trampfd, user code can retrieve a trampoline's context from the
> kernel and add defensive checks to see if the context has been
> tampered with.
>
> As for the target PC, trampfd implements a measure called the
> "Allowed PCs" context (see Advantages) to prevent a hacker from making
> the target PC point to arbitrary locations. So, the attack surface is
> narrower than Trampoline Emulation.
>
> Advantages of the Trampoline File Descriptor approach
> -----------------------------------------------------
>
> - trampfd is customizable. The user can specify any combination of
>   allowed register name-value pairs in the register context and the kernel
>   will set it up accordingly. This allows different user trampolines to be
>   converted to use trampfd.
>
> - trampfd allows a stack context to be set up so that trampolines that
>   need to push values on the user stack can do that.
>
> - The initial work is targeted for X86 and ARM. But the implementation
>   leverages small portions of existing signal delivery code. Specifically,
>   it uses pt_regs for setting up user registers and copy_to_user()
>   to push values on the stack. So, this can be very easily ported to other
>   architectures.
>
> - trampfd provides a basic framework. In the future, new trampoline types
>   can be implemented, new contexts can be defined, and additional rules
>   can be implemented for security purposes.
>
> - For instance, trampfd defines an "Allowed PCs" context in this initial
>   work. As an example, libffi can create a read-only array of all ABI
>   handlers for an architecture at build time. This array can be used to
>   set the list of allowed PCs for a trampoline. This will mean that a hacker
>   cannot hack the PC part of the register context and make it point to
>   arbitrary locations.
>
> - An SELinux setting called "exectramp" can be implemented along the
>   lines of "execmem", "execstack" and "execheap" to selectively allow the
>   use of trampolines on a per application basis.
>
> - User code can add defensive checks in the code before invoking a
>   trampoline to make sure that a hacker has not modified the context data.
>   It can do this by getting the trampoline context from the kernel and
>   double checking it.
>
> - In the future, if the kernel can be enhanced to use a safe code
>   generation component, that code can be placed in the trampoline mapping
>   pages. Then, the trampoline invocation does not have to incur a trip
>   into the kernel.
>
> - Also, if the kernel can be enhanced to use a safe code generation
>   component, other forms of dynamic code such as JIT code can be
>   addressed by the trampfd framework.
>
> - Trampolines can be shared across processes which can give rise to
>   interesting uses in the future.
>
> - Trampfd can be used for other purposes to extend the kernel's
>   functionality.
>
> libffi
> ------
>
> I have implemented my solution for libffi and provided the changes for
> X86 and ARM, 32-bit and 64-bit. Here is the reference patch:
>
> http://linux.microsoft.com/~madvenka/libffi/libffi.txt
>
> If the trampfd patchset gets accepted, I will send the libffi changes
> to the maintainers for a review. BTW, I have also successfully executed
> the libffi self tests.
>
> Work that is pending
> --------------------
>
> - I am working on implementing an SELinux setting called "exectramp"
>   similar to "execmem" to allow the use of trampfd on a per application
>   basis.
You could make a separate LSM to do these checks instead of limiting
it to SELinux. Your use case, your call, of course.
>
> - I have a comprehensive test program to test the kernel API. I am
>   working on adding it to selftests.
>
> References
> ----------
>
> [1] https://microsoft.github.io/ipe/
> ---
> Madhavan T. Venkataraman (4):
>   fs/trampfd: Implement the trampoline file descriptor API
>   x86/trampfd: Support for the trampoline file descriptor
>   arm64/trampfd: Support for the trampoline file descriptor
>   arm/trampfd: Support for the trampoline file descriptor
>
>  arch/arm/include/uapi/asm/ptrace.h     |  20 ++
>  arch/arm/kernel/Makefile               |   1 +
>  arch/arm/kernel/trampfd.c              | 214 +++++++++++++++++
>  arch/arm/mm/fault.c                    |  12 +-
>  arch/arm/tools/syscall.tbl             |   1 +
>  arch/arm64/include/asm/ptrace.h        |   9 +
>  arch/arm64/include/asm/unistd.h        |   2 +-
>  arch/arm64/include/asm/unistd32.h      |   2 +
>  arch/arm64/include/uapi/asm/ptrace.h   |  57 +++++
>  arch/arm64/kernel/Makefile             |   2 +
>  arch/arm64/kernel/trampfd.c            | 278 ++++++++++++++++++++++
>  arch/arm64/mm/fault.c                  |  15 +-
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  arch/x86/include/uapi/asm/ptrace.h     |  38 +++
>  arch/x86/kernel/Makefile               |   2 +
>  arch/x86/kernel/trampfd.c              | 313 +++++++++++++++++++++++++
>  arch/x86/mm/fault.c                    |  11 +
>  fs/Makefile                            |   1 +
>  fs/trampfd/Makefile                    |   6 +
>  fs/trampfd/trampfd_data.c              |  43 ++++
>  fs/trampfd/trampfd_fops.c              | 131 +++++++++++
>  fs/trampfd/trampfd_map.c               |  78 ++++++
>  fs/trampfd/trampfd_pcs.c               |  95 ++++++++
>  fs/trampfd/trampfd_regs.c              | 137 +++++++++++
>  fs/trampfd/trampfd_stack.c             | 131 +++++++++++
>  fs/trampfd/trampfd_stubs.c             |  41 ++++
>  fs/trampfd/trampfd_syscall.c           |  92 ++++++++
>  include/linux/syscalls.h               |   3 +
>  include/linux/trampfd.h                |  82 +++++++
>  include/uapi/asm-generic/unistd.h      |   4 +-
>  include/uapi/linux/trampfd.h           | 171 ++++++++++++++
>  init/Kconfig                           |   8 +
>  kernel/sys_ni.c                        |   3 +
>  34 files changed, 1998 insertions(+), 7 deletions(-)
>  create mode 100644 arch/arm/kernel/trampfd.c
>  create mode 100644 arch/arm64/kernel/trampfd.c
>  create mode 100644 arch/x86/kernel/trampfd.c
>  create mode 100644 fs/trampfd/Makefile
>  create mode 100644 fs/trampfd/trampfd_data.c
>  create mode 100644 fs/trampfd/trampfd_fops.c
>  create mode 100644 fs/trampfd/trampfd_map.c
>  create mode 100644 fs/trampfd/trampfd_pcs.c
>  create mode 100644 fs/trampfd/trampfd_regs.c
>  create mode 100644 fs/trampfd/trampfd_stack.c
>  create mode 100644 fs/trampfd/trampfd_stubs.c
>  create mode 100644 fs/trampfd/trampfd_syscall.c
>  create mode 100644 include/linux/trampfd.h
>  create mode 100644 include/uapi/linux/trampfd.h
>
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 16:05   ` Casey Schaufler
@ 2020-07-28 16:49     ` Madhavan T. Venkataraman
  2020-07-28 17:05     ` James Morris
  1 sibling, 0 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-28 16:49 UTC (permalink / raw)
  To: Casey Schaufler, kernel-hardening, linux-api, linux-arm-kernel,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, oleg, x86
Thanks.
On 7/28/20 11:05 AM, Casey Schaufler wrote:
>> In this solution, the kernel recognizes certain sequences of instructions
>> as "well-known" trampolines. When such a trampoline is executed, a page
>> fault happens because the trampoline page does not have execute permission.
>> The kernel recognizes the trampoline and emulates it. Basically, the
>> kernel does the work of the trampoline on behalf of the application.
> What prevents a malicious process from using the "well-known" trampoline
> to its own purposes? I expect it is obvious, but I'm not seeing it. Old
> eyes, I suppose.
You are quite right. As I note below, the attack surface is the
buffer that contains the trampoline code. Since the kernel does
check the instruction sequence, the sequence cannot be
changed by a hacker. But the hacker can presumably change
the register values and redirect the PC to his desired location.
The assumption with trampoline emulation is that the
system will have security settings that will prevent pages from
having both write and execute permissions. So, a hacker
cannot load his own code in a page and redirect the PC to
it and execute his own code. But he can probably set the
PC to point to arbitrary locations. For instance, jump to
the middle of a C library function.
>
>> Here, the attack surface is the buffer that contains the trampoline.
>> The attack surface is narrower than before. A hacker may still be able to
>> modify what gets loaded in the registers or modify the target PC to point
>> to arbitrary locations.
...
>> Work that is pending
>> --------------------
>>
>> - I am working on implementing an SELinux setting called "exectramp"
>>   similar to "execmem" to allow the use of trampfd on a per application
>>   basis.
> You could make a separate LSM to do these checks instead of limiting
> it to SELinux. Your use case, your call, of course.
OK. I will research this.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 16:05   ` Casey Schaufler
  2020-07-28 16:49     ` Madhavan T. Venkataraman
@ 2020-07-28 17:05     ` James Morris
  2020-07-28 17:08       ` Madhavan T. Venkataraman
  1 sibling, 1 reply; 64+ messages in thread
From: James Morris @ 2020-07-28 17:05 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: madvenka, kernel-hardening, linux-api, linux-arm-kernel,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, oleg, x86
On Tue, 28 Jul 2020, Casey Schaufler wrote:
> You could make a separate LSM to do these checks instead of limiting
> it to SELinux. Your use case, your call, of course.
It's not limited to SELinux. This is hooked via the LSM API and 
implementable by any LSM (similar to execmem, execstack etc.)
-- 
James Morris
<jmorris@namei.org>
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 17:05     ` James Morris
@ 2020-07-28 17:08       ` Madhavan T. Venkataraman
  0 siblings, 0 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-28 17:08 UTC (permalink / raw)
  To: James Morris, Casey Schaufler
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86
On 7/28/20 12:05 PM, James Morris wrote:
> On Tue, 28 Jul 2020, Casey Schaufler wrote:
>
>> You could make a separate LSM to do these checks instead of limiting
>> it to SELinux. Your use case, your call, of course.
> It's not limited to SELinux. This is hooked via the LSM API and 
> implementable by any LSM (similar to execmem, execstack etc.)
Yes. I have an implementation that I am testing right now that
defines the hook for exectramp and implements it for
SELinux. That is why I mentioned SELinux.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread 
 
 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 13:10 ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor madvenka
                     ` (5 preceding siblings ...)
  2020-07-28 16:05   ` Casey Schaufler
@ 2020-07-28 17:31   ` Andy Lutomirski
  2020-07-28 19:01     ` Madhavan T. Venkataraman
                       ` (5 more replies)
  2020-07-31 18:09   ` Mark Rutland
  7 siblings, 6 replies; 64+ messages in thread
From: Andy Lutomirski @ 2020-07-28 17:31 UTC (permalink / raw)
  To: madvenka
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML
> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>
> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.
This is quite clever, but now I’m wondering just how much kernel help
is really needed. In your series, the trampoline is an non-executable
page.  I can think of at least two alternative approaches, and I'd
like to know the pros and cons.
1. Entirely userspace: a return trampoline would be something like:
1:
pushq %rax
pushq %rbc
pushq %rcx
...
pushq %r15
movq %rsp, %rdi # pointer to saved regs
leaq 1b(%rip), %rsi # pointer to the trampoline itself
callq trampoline_handler # see below
You would fill a page with a bunch of these, possibly compacted to get
more per page, and then you would remap as many copies as needed.  The
'callq trampoline_handler' part would need to be a bit clever to make
it continue to work despite this remapping.  This will be *much*
faster than trampfd. How much of your use case would it cover?  For
the inverse, it's not too hard to write a bit of asm to set all
registers and jump somewhere.
2. Use existing kernel functionality.  Raise a signal, modify the
state, and return from the signal.  This is very flexible and may not
be all that much slower than trampfd.
3. Use a syscall.  Instead of having the kernel handle page faults,
have the trampoline code push the syscall nr register, load a special
new syscall nr into the syscall nr register, and do a syscall. On
x86_64, this would be:
pushq %rax
movq __NR_magic_trampoline, %rax
syscall
with some adjustment if the stack slot you're clobbering is important.
Also, will using trampfd cause issues with various unwinders?  I can
easily imagine unwinders expecting code to be readable, although this
is slowly going away for other reasons.
All this being said, I think that the kernel should absolutely add a
sensible interface for JITs to use to materialize their code.  This
would integrate sanely with LSMs and wouldn't require hacks like using
files, etc.  A cleverly designed JIT interface could function without
seriailization IPIs, and even lame architectures like x86 could
potentially avoid shootdown IPIs if the interface copied code instead
of playing virtual memory games.  At its very simplest, this could be:
void *jit_create_code(const void *source, size_t len);
and the result would be a new anonymous mapping that contains exactly
the code requested.  There could also be:
int jittfd_create(...);
that does something similar but creates a memfd.  A nicer
implementation for short JIT sequences would allow appending more code
to an existing JIT region.  On x86, an appendable JIT region would
start filled with 0xCC, and I bet there's a way to materialize new
code into a previously 0xcc-filled virtual page wthout any
synchronization.  One approach would be to start with:
<some code>
0xcc
0xcc
...
0xcc
and to create a whole new page like:
<some code>
<some more code>
0xcc
...
0xcc
so that the only difference is that some code changed to some more
code.  Then replace the PTE to swap from the old page to the new page,
and arrange to avoid freeing the old page until we're sure it's gone
from all TLBs.  This may not work if <some more code> spans a page
boundary.  The #BP fixup would zap the TLB and retry.  Even just
directly copying code over some 0xcc bytes almost works, but there's a
nasty corner case involving instructions that fetch I$ fetch
boundaries.  I'm not sure to what extent I$ snooping helps.
--Andy
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 17:31   ` Andy Lutomirski
@ 2020-07-28 19:01     ` Madhavan T. Venkataraman
  2020-07-29 13:29     ` Florian Weimer
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-28 19:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML
I am working on a response to this. I will send it soon.
Thanks.
Madhavan
On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>>
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page.  I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
>
> 1. Entirely userspace: a return trampoline would be something like:
>
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below
>
> You would fill a page with a bunch of these, possibly compacted to get
> more per page, and then you would remap as many copies as needed.  The
> 'callq trampoline_handler' part would need to be a bit clever to make
> it continue to work despite this remapping.  This will be *much*
> faster than trampfd. How much of your use case would it cover?  For
> the inverse, it's not too hard to write a bit of asm to set all
> registers and jump somewhere.
>
> 2. Use existing kernel functionality.  Raise a signal, modify the
> state, and return from the signal.  This is very flexible and may not
> be all that much slower than trampfd.
>
> 3. Use a syscall.  Instead of having the kernel handle page faults,
> have the trampoline code push the syscall nr register, load a special
> new syscall nr into the syscall nr register, and do a syscall. On
> x86_64, this would be:
>
> pushq %rax
> movq __NR_magic_trampoline, %rax
> syscall
>
> with some adjustment if the stack slot you're clobbering is important.
>
>
> Also, will using trampfd cause issues with various unwinders?  I can
> easily imagine unwinders expecting code to be readable, although this
> is slowly going away for other reasons.
>
> All this being said, I think that the kernel should absolutely add a
> sensible interface for JITs to use to materialize their code.  This
> would integrate sanely with LSMs and wouldn't require hacks like using
> files, etc.  A cleverly designed JIT interface could function without
> seriailization IPIs, and even lame architectures like x86 could
> potentially avoid shootdown IPIs if the interface copied code instead
> of playing virtual memory games.  At its very simplest, this could be:
>
> void *jit_create_code(const void *source, size_t len);
>
> and the result would be a new anonymous mapping that contains exactly
> the code requested.  There could also be:
>
> int jittfd_create(...);
>
> that does something similar but creates a memfd.  A nicer
> implementation for short JIT sequences would allow appending more code
> to an existing JIT region.  On x86, an appendable JIT region would
> start filled with 0xCC, and I bet there's a way to materialize new
> code into a previously 0xcc-filled virtual page wthout any
> synchronization.  One approach would be to start with:
>
> <some code>
> 0xcc
> 0xcc
> ...
> 0xcc
>
> and to create a whole new page like:
>
> <some code>
> <some more code>
> 0xcc
> ...
> 0xcc
>
> so that the only difference is that some code changed to some more
> code.  Then replace the PTE to swap from the old page to the new page,
> and arrange to avoid freeing the old page until we're sure it's gone
> from all TLBs.  This may not work if <some more code> spans a page
> boundary.  The #BP fixup would zap the TLB and retry.  Even just
> directly copying code over some 0xcc bytes almost works, but there's a
> nasty corner case involving instructions that fetch I$ fetch
> boundaries.  I'm not sure to what extent I$ snooping helps.
>
> --Andy
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 17:31   ` Andy Lutomirski
  2020-07-28 19:01     ` Madhavan T. Venkataraman
@ 2020-07-29 13:29     ` Florian Weimer
  2020-07-30 13:09     ` David Laight
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 64+ messages in thread
From: Florian Weimer @ 2020-07-29 13:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: madvenka, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
* Andy Lutomirski:
> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page.  I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
>
> 1. Entirely userspace: a return trampoline would be something like:
>
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below
>
> You would fill a page with a bunch of these, possibly compacted to get
> more per page, and then you would remap as many copies as needed.
libffi does something like this for iOS, I believe.
The only thing you really need is a PC-relative indirect call, with the
target address loaded from a different page.  The trampoline handler can
do all the rest because it can identify the trampoline from the stack.
Having a closure parameter loaded into a register will speed things up,
of course.
I still hope to transition libffi to this model for most Linux targets.
It really simplifies things because you don't have to deal with cache
flushes (on both the data and code aliases for SELinux support).
But the key observation is that efficient trampolines do not need
run-time code generation at all because their code is so regular.
Thanks,
Florian
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 17:31   ` Andy Lutomirski
  2020-07-28 19:01     ` Madhavan T. Venkataraman
  2020-07-29 13:29     ` Florian Weimer
@ 2020-07-30 13:09     ` David Laight
  2020-08-02 11:56       ` Pavel Machek
  2020-07-30 14:42     ` Madhavan T. Venkataraman
                       ` (2 subsequent siblings)
  5 siblings, 1 reply; 64+ messages in thread
From: David Laight @ 2020-07-30 13:09 UTC (permalink / raw)
  To: 'Andy Lutomirski', madvenka@linux.microsoft.com
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML
> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page.  I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
> 
> 1. Entirely userspace: a return trampoline would be something like:
> 
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below
For nested calls (where the trampoline needs to pass the
original stack frame to the nested function) I think you
just need a page full of:
	mov	$0, scratch_reg; jmp trampoline_handler
	mov	$1, scratch_reg; jmp trampoline_handler
You need an unused register, on x86-64 I think both
r10 and r11 are available.
On i386 I think eax can be used.
It might even be that the first argument register is
available - if that is used to pass in the stack frame.
The trampoline_handler then uses the passed in value
to index an array of stack frame and function pointers
and jumps to the real function.
You need to hold everything in __thread data.
And maybe be able to allocate an extra page for deeply
nested code paths (eg recursive nested functions).
You might then need a driver to create you a suitable
executable page. Somehow you need to pass in the address
of the trampoline_handler and the number for the first fault.
It need to pass back the 'stride' of the array and number
of elements created.
But if you can take the cost of the page fault, then
you can interpret the existing trampoline in userspace
within the signal handler.
This is two kernel entry/exits.
Arbitrary JIT is a different problem entirely.
	David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-30 13:09     ` David Laight
@ 2020-08-02 11:56       ` Pavel Machek
  2020-08-03  8:08         ` David Laight
  0 siblings, 1 reply; 64+ messages in thread
From: Pavel Machek @ 2020-08-02 11:56 UTC (permalink / raw)
  To: David Laight
  Cc: 'Andy Lutomirski', madvenka@linux.microsoft.com,
	Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML
Hi!
> > This is quite clever, but now I???m wondering just how much kernel help
> > is really needed. In your series, the trampoline is an non-executable
> > page.  I can think of at least two alternative approaches, and I'd
> > like to know the pros and cons.
> > 
> > 1. Entirely userspace: a return trampoline would be something like:
> > 
> > 1:
> > pushq %rax
> > pushq %rbc
> > pushq %rcx
> > ...
> > pushq %r15
> > movq %rsp, %rdi # pointer to saved regs
> > leaq 1b(%rip), %rsi # pointer to the trampoline itself
> > callq trampoline_handler # see below
> 
> For nested calls (where the trampoline needs to pass the
> original stack frame to the nested function) I think you
> just need a page full of:
> 	mov	$0, scratch_reg; jmp trampoline_handler
I believe you could do with mov %pc, scratch_reg; jmp ...
That has advantage of being able to share single physical page across multiple virtual pages...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-02 11:56       ` Pavel Machek
@ 2020-08-03  8:08         ` David Laight
  2020-08-03 15:57           ` Madhavan T. Venkataraman
  0 siblings, 1 reply; 64+ messages in thread
From: David Laight @ 2020-08-03  8:08 UTC (permalink / raw)
  To: 'Pavel Machek'
  Cc: 'Andy Lutomirski', madvenka@linux.microsoft.com,
	Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML
From: Pavel Machek <pavel@ucw.cz>
> Sent: 02 August 2020 12:56
> Hi!
> 
> > > This is quite clever, but now I???m wondering just how much kernel help
> > > is really needed. In your series, the trampoline is an non-executable
> > > page.  I can think of at least two alternative approaches, and I'd
> > > like to know the pros and cons.
> > >
> > > 1. Entirely userspace: a return trampoline would be something like:
> > >
> > > 1:
> > > pushq %rax
> > > pushq %rbc
> > > pushq %rcx
> > > ...
> > > pushq %r15
> > > movq %rsp, %rdi # pointer to saved regs
> > > leaq 1b(%rip), %rsi # pointer to the trampoline itself
> > > callq trampoline_handler # see below
> >
> > For nested calls (where the trampoline needs to pass the
> > original stack frame to the nested function) I think you
> > just need a page full of:
> > 	mov	$0, scratch_reg; jmp trampoline_handler
> 
> I believe you could do with mov %pc, scratch_reg; jmp ...
> 
> That has advantage of being able to share single physical
> page across multiple virtual pages...
A lot of architecture don't let you copy %pc that way so you would
have to use 'call' - but that trashes the return address cache.
It also needs the trampoline handler to know the addresses
of the trampolines.
	David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-03  8:08         ` David Laight
@ 2020-08-03 15:57           ` Madhavan T. Venkataraman
  0 siblings, 0 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-03 15:57 UTC (permalink / raw)
  To: David Laight, 'Pavel Machek'
  Cc: 'Andy Lutomirski', Kernel Hardening, Linux API,
	linux-arm-kernel, Linux FS Devel, linux-integrity, LKML, LSM List,
	Oleg Nesterov, X86 ML
On 8/3/20 3:08 AM, David Laight wrote:
> From: Pavel Machek <pavel@ucw.cz>
>> Sent: 02 August 2020 12:56
>> Hi!
>>
>>>> This is quite clever, but now I???m wondering just how much kernel help
>>>> is really needed. In your series, the trampoline is an non-executable
>>>> page.  I can think of at least two alternative approaches, and I'd
>>>> like to know the pros and cons.
>>>>
>>>> 1. Entirely userspace: a return trampoline would be something like:
>>>>
>>>> 1:
>>>> pushq %rax
>>>> pushq %rbc
>>>> pushq %rcx
>>>> ...
>>>> pushq %r15
>>>> movq %rsp, %rdi # pointer to saved regs
>>>> leaq 1b(%rip), %rsi # pointer to the trampoline itself
>>>> callq trampoline_handler # see below
>>> For nested calls (where the trampoline needs to pass the
>>> original stack frame to the nested function) I think you
>>> just need a page full of:
>>> 	mov	$0, scratch_reg; jmp trampoline_handler
>> I believe you could do with mov %pc, scratch_reg; jmp ...
>>
>> That has advantage of being able to share single physical
>> page across multiple virtual pages...
> A lot of architecture don't let you copy %pc that way so you would
> have to use 'call' - but that trashes the return address cache.
> It also needs the trampoline handler to know the addresses
> of the trampolines.
Do you which ones don't allow you to copy %pc?
Some of the architctures do not have PC-relative data references.
If they do not allow you to copy the PC into a general purpose
register, then there is no way to implement the statically defined
trampoline that has been discussed so far. In these cases, the
trampoline has to be generate at runtime.
Thanks.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread 
 
 
 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 17:31   ` Andy Lutomirski
                       ` (2 preceding siblings ...)
  2020-07-30 13:09     ` David Laight
@ 2020-07-30 14:42     ` Madhavan T. Venkataraman
       [not found]     ` <6540b4b7-3f70-adbf-c922-43886599713a@linux.microsoft.com>
  2020-08-02 18:54     ` Madhavan T. Venkataraman
  5 siblings, 0 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-30 14:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML
For some reason my email program is not delivering to all the
recipients because of some formatting issues. I am resending.
I apologize. I will try to get this fixed.
Sorry for the delay. I just needed to think about it a little.
I will respond to your first suggestion in this email. I will
respond to the others in separate emails if that is alright
with you.
On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>>
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page.  I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
>
> 1. Entirely userspace: a return trampoline would be something like:
>
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below
>
> You would fill a page with a bunch of these, possibly compacted to get
> more per page, and then you would remap as many copies as needed.  The
> 'callq trampoline_handler' part would need to be a bit clever to make
> it continue to work despite this remapping.  This will be *much*
> faster than trampfd. How much of your use case would it cover?  For
> the inverse, it's not too hard to write a bit of asm to set all
> registers and jump somewhere.
Let me state my understanding of what you are suggesting. Correct me if
I get anything wrong. If you don't mind, I will also take the liberty
of generalizing and paraphrasing your suggestion.
The goal is to create two page mappings that are adjacent to each other:
- a code page that contains template code for a trampoline. Since the
  template code would tend to be small in size, pack as many of them
  as possible within a page to conserve memory. In other words, create
  an array of the template code fragments. Each element in the array
  would be used for one trampoline instance.
- a data page that contains an array of data elements. Corresponding
  to each code element in the code page, there would be a data element
  in the data page that would contain data that is specific to a
  trampoline instance.
- Code will access data using PC-relative addressing.
The management of the code pages and allocation for each trampoline
instance would all be done in user space.
Is this the general idea?
Creating a code page
----------------------------
We can do this in one of the following ways:
- Allocate a writable page at run time, write the template code into
  the page and have execute permissions on the page.
- Allocate a writable page at run time, write the template code into
  the page and remap the page with just execute permissions.
- Allocate a writable page at run time, write the template code into
  the page, write the page into a temporary file and map the file with
  execute permissions.
- Include the template code in a code page at build time itself and
  just remap the code page each time you need a code page.
Pros and Cons
-------------------
As long as the OS provides the functionality to do this and the security
subsystem in the OS allows the actions, this is totally feasible. If not,
we need something like trampfd.
As Floren mentioned, libffi does implement something like this for MACH.
In fact, in my libffi changes, I use trampfd only after all the other methods
have failed because of security settings.
But the above approach only solves the problem for this simple type of
trampoline. It does not provide a framework for addressing more complex types
or even other forms of dynamic code.
Also, each application would need to implement this solution for itself
as opposed to relying on one implementation provided by the kernel.
Trampfd-based solution
-------------------------------
I outlined an enhancement to trampfd in a response to David Laight. In this
enhancement, the kernel is the one that would set up the code page.
The kernel would call an arch-specific support function to generate the
code required to load registers, push values on the stack and jump to a PC
for a trampoline instance based on its current context. The trampoline
instance data could be baked into the code.
My initial idea was to only have one trampoline instance per page. But I
think I can implement multiple instances per page. I just have to manage
the trampfd file private data and VMA private data accordingly to map an
element in a code page to its trampoline object.
The two approaches are similar except for the detail about who sets up
and manages the trampoline pages. In both approaches, the performance problem
is addressed. But trampfd can be used even when security settings are
restrictive.
Is my solution acceptable?
A couple of things
------------------------
- In the current trampfd implementation, no physical pages are actually
  allocated. It is just a virtual mapping. From a memory footprint
  perspective, this is good. May be, we can let the user specify if
  he wants a fast trampoline that consumes memory or a slow one that doesn't?
- In the future, we may define additional types that need the kernel to do
  the job. Examples:
    - The kernel may have a trampoline type for which it is not willing
       or able to generate code
    - The kernel could emulate dynamic code for the user
     - The kernel could interpret dynamic code for the user
     - The kernel could allow the user to access some kernel functionality
        using the framework
  In such cases, there isn't any physical code page that gets mapped into
  the user address space. We need the kernel to handle the address fault
  and provide the functionality.
One question for the reviewers
----------------------------------------
Do you think that the file descriptor based approach is fine? Or, does this
need a regular system call based implementation? There are some advantages
with a regular system call:
- We don't consume file descriptors. E.g., in libffi, we have to
  keep the file descriptor open for a closure until the closure
  is freed.
- Trampoline operations can be performed based on the trampoline
  address instead of an fd.
- Sharing of objects across processes can be implemented through
  a regular ID based method rather than sending the file descriptor
  over a unix domain socket.
- Shared objects can be persistent.
- An fd based API does structure parsing in read()/write() calls
  to obtain arguments. With a regular system call, that is not
  necessary.
Please let me know your thoughts.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread
- [parent not found: <6540b4b7-3f70-adbf-c922-43886599713a@linux.microsoft.com>] 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
       [not found]     ` <6540b4b7-3f70-adbf-c922-43886599713a@linux.microsoft.com>
@ 2020-07-30 20:54       ` Andy Lutomirski
  2020-07-31 17:13         ` Madhavan T. Venkataraman
  0 siblings, 1 reply; 64+ messages in thread
From: Andy Lutomirski @ 2020-07-30 20:54 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
<madvenka@linux.microsoft.com> wrote:
>
> Sorry for the delay. I just wanted to think about this a little.
> In this email, I will respond to your first suggestion. I will
> respond to the rest in separate emails if that is alright with
> you.
>
> On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>
> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>
> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.
>
> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page.  I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
>
> 1. Entirely userspace: a return trampoline would be something like:
>
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below
>
> You would fill a page with a bunch of these, possibly compacted to get
> more per page, and then you would remap as many copies as needed.  The
> 'callq trampoline_handler' part would need to be a bit clever to make
> it continue to work despite this remapping.  This will be *much*
> faster than trampfd. How much of your use case would it cover?  For
> the inverse, it's not too hard to write a bit of asm to set all
> registers and jump somewhere.
>
> Let me state what I have understood about this suggestion. Correct me if
> I get anything wrong. If you don't mind, I will also take the liberty
> of generalizing and paraphrasing your suggestion.
>
> The goal is to create two page mappings that are adjacent to each other:
>
> - a code page that contains template code for a trampoline. Since the
>  template code would tend to be small in size, pack as many of them
>  as possible within a page to conserve memory. In other words, create
>  an array of the template code fragments. Each element in the array
>  would be used for one trampoline instance.
>
> - a data page that contains an array of data elements. Corresponding
>  to each code element in the code page, there would be a data element
>  in the data page that would contain data that is specific to a
>  trampoline instance.
>
> - Code will access data using PC-relative addressing.
>
> The management of the code pages and allocation for each trampoline
> instance would all be done in user space.
>
> Is this the general idea?
Yes.
>
> Creating a code page
> --------------------
>
> We can do this in one of the following ways:
>
> - Allocate a writable page at run time, write the template code into
>   the page and have execute permissions on the page.
>
> - Allocate a writable page at run time, write the template code into
>   the page and remap the page with just execute permissions.
>
> - Allocate a writable page at run time, write the template code into
>   the page, write the page into a temporary file and map the file with
>   execute permissions.
>
> - Include the template code in a code page at build time itself and
>   just remap the code page each time you need a code page.
This latter part shouldn't need any special permissions as far as I know.
>
> Pros and Cons
> -------------
>
> As long as the OS provides the functionality to do this and the security
> subsystem in the OS allows the actions, this is totally feasible. If not,
> we need something like trampfd.
>
> As Floren mentioned, libffi does implement something like this for MACH.
>
> In fact, in my libffi changes, I use trampfd only after all the other methods
> have failed because of security settings.
>
> But the above approach only solves the problem for this simple type of
> trampoline. It does not provide a framework for addressing more complex types
> or even other forms of dynamic code.
>
> Also, each application would need to implement this solution for itself
> as opposed to relying on one implementation provided by the kernel.
I would argue this is a benefit.  If the whole implementation is in
userspace, there is no ABI compatibility issue.  The user program
contains the trampoline code and the code that uses it.
>
> Trampfd-based solution
> ----------------------
>
> I outlined an enhancement to trampfd in a response to David Laight. In this
> enhancement, the kernel is the one that would set up the code page.
>
> The kernel would call an arch-specific support function to generate the
> code required to load registers, push values on the stack and jump to a PC
> for a trampoline instance based on its current context. The trampoline
> instance data could be baked into the code.
>
> My initial idea was to only have one trampoline instance per page. But I
> think I can implement multiple instances per page. I just have to manage
> the trampfd file private data and VMA private data accordingly to map an
> element in a code page to its trampoline object.
>
> The two approaches are similar except for the detail about who sets up
> and manages the trampoline pages. In both approaches, the performance problem
> is addressed. But trampfd can be used even when security settings are
> restrictive.
>
> Is my solution acceptable?
Perhaps.  In general, before adding a new ABI to the kernel, it's nice
to understand how it's better than doing the same thing in userspace.
Saying that it's easier for user code to work with if it's in the
kernel isn't necessarily an adequate justification.
Why would remapping two pages of actual application text ever fail?
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-30 20:54       ` Andy Lutomirski
@ 2020-07-31 17:13         ` Madhavan T. Venkataraman
  2020-07-31 18:31           ` Mark Rutland
  2020-08-02 13:57           ` Florian Weimer
  0 siblings, 2 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-31 17:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML
On 7/30/20 3:54 PM, Andy Lutomirski wrote:
> On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
> <madvenka@linux.microsoft.com> wrote:
>> ...
>> Creating a code page
>> --------------------
>>
>> We can do this in one of the following ways:
>>
>> - Allocate a writable page at run time, write the template code into
>>   the page and have execute permissions on the page.
>>
>> - Allocate a writable page at run time, write the template code into
>>   the page and remap the page with just execute permissions.
>>
>> - Allocate a writable page at run time, write the template code into
>>   the page, write the page into a temporary file and map the file with
>>   execute permissions.
>>
>> - Include the template code in a code page at build time itself and
>>   just remap the code page each time you need a code page.
> This latter part shouldn't need any special permissions as far as I know.
Agreed.
>
>> Pros and Cons
>> -------------
>>
>> As long as the OS provides the functionality to do this and the security
>> subsystem in the OS allows the actions, this is totally feasible. If not,
>> we need something like trampfd.
>>
>> As Floren mentioned, libffi does implement something like this for MACH.
>>
>> In fact, in my libffi changes, I use trampfd only after all the other methods
>> have failed because of security settings.
>>
>> But the above approach only solves the problem for this simple type of
>> trampoline. It does not provide a framework for addressing more complex types
>> or even other forms of dynamic code.
>>
>> Also, each application would need to implement this solution for itself
>> as opposed to relying on one implementation provided by the kernel.
> I would argue this is a benefit.  If the whole implementation is in
> userspace, there is no ABI compatibility issue.  The user program
> contains the trampoline code and the code that uses it.
The current trampfd implementation also does not have an ABI issue.
ABI details are to be handled in user land. In the case of libffi, they
are. Trampfd only addresses the trampoline required to jump to the
ABI handler.
>
>> Trampfd-based solution
>> ----------------------
>>
>> I outlined an enhancement to trampfd in a response to David Laight. In this
>> enhancement, the kernel is the one that would set up the code page.
>>
>> The kernel would call an arch-specific support function to generate the
>> code required to load registers, push values on the stack and jump to a PC
>> for a trampoline instance based on its current context. The trampoline
>> instance data could be baked into the code.
>>
>> My initial idea was to only have one trampoline instance per page. But I
>> think I can implement multiple instances per page. I just have to manage
>> the trampfd file private data and VMA private data accordingly to map an
>> element in a code page to its trampoline object.
>>
>> The two approaches are similar except for the detail about who sets up
>> and manages the trampoline pages. In both approaches, the performance problem
>> is addressed. But trampfd can be used even when security settings are
>> restrictive.
>>
>> Is my solution acceptable?
> Perhaps.  In general, before adding a new ABI to the kernel, it's nice
> to understand how it's better than doing the same thing in userspace.
> Saying that it's easier for user code to work with if it's in the
> kernel isn't necessarily an adequate justification.
Fair enough.
Dealing with multiple architectures
-----------------------------------------------
One good reason to use trampfd is multiple architecture support. The
trampoline table in a code page approach is neat. I don't deny that at
all. But my question is - can it be used in all cases?
It requires PC-relative data references. I have not worked on all architectures.
So, I need to study this. But do all ISAs support PC-relative data references?
Even in an ISA that supports it, there would be a maximum supported offset
from the current PC that can be reached for a data reference. That maximum
needs to be at least the size of a base page in the architecture. This is because
the code page and the data page need to be separate for security reasons.
Do all ISAs support a sufficiently large offset?
When the kernel generates the code for a trampoline, it can hard code data values
in the generated code itself so it does not need PC-relative data referencing.
And, for ISAs that do support the large offset, we do have to implement and
maintain the code page stuff for different ISAs for each application and library
if we did not use trampfd.
If you look at the libffi reference patch that I have linked in the cover letter,
I have added functions in common code that wrap trampfd calls. From architecture
specific code, there is just one function call to one of those wrapper functions
to set the register context for the trampoline. This is a very small C code change
in each architecture. So, support can be extended to all architectures without
exception easily.
Runtime generated trampolines
-------------------------------------------
libffi trampolines are simple. But there may be many cases out there where the
trampoline code cannot be statically defined at build time. It may have to be
generated at runtime. For this, we will need trampfd.
Security
-----------
With the user level trampoline table approach, the data part of the trampoline table
can be hacked by an attacker if an application has a vulnerability. Specifically, the
target PC can be altered to some arbitrary location. Trampfd implements an
"Allowed PCS" context. In the libffi changes, I have created a read-only array of
all ABI handlers used in closures for each architecture. This read-only array
can be used to restrict the PC values for libffi trampolines to prevent hacking.
To generalize, we can implement security rules/features if the trampoline
object is in the kernel.
Standardization
---------------------
Trampfd is a framework that can be used to implement multiple things. May be,
a few of those things can also be implemented in user land itself. But I think having
just one mechanism to execute dynamic code objects is preferable to having
multiple mechanisms not standardized across all applications.
As an example, let us say that I am able to implement support for JIT code. Let us
say that an interpreter uses libffi to execute a generated function. The interpreter
would use trampfd for the JIT code object and get an address. Then, it would pass
that to libffi which would then use trampfd for the trampoline. So, trampfd based
code objects can be chained.
> Why would remapping two pages of actual application text ever fail?
Remapping a page may not be available on all OSes. However, that is not a problem
for the code page approach. One can always memory map the code page from the
binary file directly. So, yes, this would not fail.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-31 17:13         ` Madhavan T. Venkataraman
@ 2020-07-31 18:31           ` Mark Rutland
  2020-08-03  8:27             ` David Laight
  2020-08-03 17:58             ` Madhavan T. Venkataraman
  2020-08-02 13:57           ` Florian Weimer
  1 sibling, 2 replies; 64+ messages in thread
From: Mark Rutland @ 2020-07-31 18:31 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote:
> On 7/30/20 3:54 PM, Andy Lutomirski wrote:
> > On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
> > <madvenka@linux.microsoft.com> wrote:
> Dealing with multiple architectures
> -----------------------------------------------
> 
> One good reason to use trampfd is multiple architecture support. The
> trampoline table in a code page approach is neat. I don't deny that at
> all. But my question is - can it be used in all cases?
> 
> It requires PC-relative data references. I have not worked on all architectures.
> So, I need to study this. But do all ISAs support PC-relative data references?
Not all do, but pretty much any recent ISA will as it's a practical
necessity for fast position-independent code.
> Even in an ISA that supports it, there would be a maximum supported offset
> from the current PC that can be reached for a data reference. That maximum
> needs to be at least the size of a base page in the architecture. This is because
> the code page and the data page need to be separate for security reasons.
> Do all ISAs support a sufficiently large offset?
ISAs with pc-relative addessing can usually generate PC-relative
addresses into a GPR, from which they can apply an arbitrarily large
offset.
> When the kernel generates the code for a trampoline, it can hard code data values
> in the generated code itself so it does not need PC-relative data referencing.
> 
> And, for ISAs that do support the large offset, we do have to implement and
> maintain the code page stuff for different ISAs for each application and library
> if we did not use trampfd.
Trampoline code is architecture specific today, so I don't see that as a
major issue. Common structural bits can probably be shared even if the
specifid machine code cannot.
[...]
> Security
> -----------
> 
> With the user level trampoline table approach, the data part of the trampoline table
> can be hacked by an attacker if an application has a vulnerability. Specifically, the
> target PC can be altered to some arbitrary location. Trampfd implements an
> "Allowed PCS" context. In the libffi changes, I have created a read-only array of
> all ABI handlers used in closures for each architecture. This read-only array
> can be used to restrict the PC values for libffi trampolines to prevent hacking.
> 
> To generalize, we can implement security rules/features if the trampoline
> object is in the kernel.
I don't follow this argument. If it's possible to statically define that
in the kernel, it's also possible to do that in userspace without any
new kernel support.
[...]
> Trampfd is a framework that can be used to implement multiple things. May be,
> a few of those things can also be implemented in user land itself. But I think having
> just one mechanism to execute dynamic code objects is preferable to having
> multiple mechanisms not standardized across all applications.
In abstract, having a common interface sounds nice, but in practice
elements of this are always architecture-specific (e.g. interactiosn
with HW CFI), and that common interface can result in more pain as it
doesn't fit naturally into the context that ISAs were designed for (e.g. 
where control-flow instructions are extended with new semantics).
It also meass that you can't share the rough approach across OSs which
do not implement an identical mechanism, so for code abstracting by ISA
first, then by platform/ABI, there isn't much saving.
Thanks,
Mark.
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-31 18:31           ` Mark Rutland
@ 2020-08-03  8:27             ` David Laight
  2020-08-03 16:03               ` Madhavan T. Venkataraman
  2020-08-03 17:58             ` Madhavan T. Venkataraman
  1 sibling, 1 reply; 64+ messages in thread
From: David Laight @ 2020-08-03  8:27 UTC (permalink / raw)
  To: 'Mark Rutland', Madhavan T. Venkataraman
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
From: Mark Rutland
> Sent: 31 July 2020 19:32
...
> > It requires PC-relative data references. I have not worked on all architectures.
> > So, I need to study this. But do all ISAs support PC-relative data references?
> 
> Not all do, but pretty much any recent ISA will as it's a practical
> necessity for fast position-independent code.
i386 has neither PC-relative addressing nor moves from %pc.
The cpu architecture knows that the sequence:
	call	1f  
1:	pop	%reg  
is used to get the %pc value so is treated specially so that
it doesn't 'trash' the return stack.
So PIC code isn't too bad, but you have to use the correct
sequence.
	David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-03  8:27             ` David Laight
@ 2020-08-03 16:03               ` Madhavan T. Venkataraman
  2020-08-03 16:57                 ` David Laight
  0 siblings, 1 reply; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-03 16:03 UTC (permalink / raw)
  To: David Laight, 'Mark Rutland'
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
On 8/3/20 3:27 AM, David Laight wrote:
> From: Mark Rutland
>> Sent: 31 July 2020 19:32
> ...
>>> It requires PC-relative data references. I have not worked on all architectures.
>>> So, I need to study this. But do all ISAs support PC-relative data references?
>> Not all do, but pretty much any recent ISA will as it's a practical
>> necessity for fast position-independent code.
> i386 has neither PC-relative addressing nor moves from %pc.
> The cpu architecture knows that the sequence:
> 	call	1f  
> 1:	pop	%reg  
> is used to get the %pc value so is treated specially so that
> it doesn't 'trash' the return stack.
>
> So PIC code isn't too bad, but you have to use the correct
> sequence.
Is that true only for 32-bit systems only? I thought RIP-relative addressing was
introduced in 64-bit mode. Please confirm.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-03 16:03               ` Madhavan T. Venkataraman
@ 2020-08-03 16:57                 ` David Laight
  2020-08-03 17:00                   ` Madhavan T. Venkataraman
  0 siblings, 1 reply; 64+ messages in thread
From: David Laight @ 2020-08-03 16:57 UTC (permalink / raw)
  To: 'Madhavan T. Venkataraman', 'Mark Rutland'
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
From: Madhavan T. Venkataraman
> Sent: 03 August 2020 17:03
> 
> On 8/3/20 3:27 AM, David Laight wrote:
> > From: Mark Rutland
> >> Sent: 31 July 2020 19:32
> > ...
> >>> It requires PC-relative data references. I have not worked on all architectures.
> >>> So, I need to study this. But do all ISAs support PC-relative data references?
> >> Not all do, but pretty much any recent ISA will as it's a practical
> >> necessity for fast position-independent code.
> > i386 has neither PC-relative addressing nor moves from %pc.
> > The cpu architecture knows that the sequence:
> > 	call	1f
> > 1:	pop	%reg
> > is used to get the %pc value so is treated specially so that
> > it doesn't 'trash' the return stack.
> >
> > So PIC code isn't too bad, but you have to use the correct
> > sequence.
> 
> Is that true only for 32-bit systems only? I thought RIP-relative addressing was
> introduced in 64-bit mode. Please confirm.
I said i386 not amd64 or x86-64.
So yes, 64bit code has PC-relative addressing.
But I'm pretty sure it has no other way to get the PC itself
except using call - certainly nothing in the 'usual' instructions.
	David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-03 16:57                 ` David Laight
@ 2020-08-03 17:00                   ` Madhavan T. Venkataraman
  0 siblings, 0 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-03 17:00 UTC (permalink / raw)
  To: David Laight, 'Mark Rutland'
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
On 8/3/20 11:57 AM, David Laight wrote:
> From: Madhavan T. Venkataraman
>> Sent: 03 August 2020 17:03
>>
>> On 8/3/20 3:27 AM, David Laight wrote:
>>> From: Mark Rutland
>>>> Sent: 31 July 2020 19:32
>>> ...
>>>>> It requires PC-relative data references. I have not worked on all architectures.
>>>>> So, I need to study this. But do all ISAs support PC-relative data references?
>>>> Not all do, but pretty much any recent ISA will as it's a practical
>>>> necessity for fast position-independent code.
>>> i386 has neither PC-relative addressing nor moves from %pc.
>>> The cpu architecture knows that the sequence:
>>> 	call	1f
>>> 1:	pop	%reg
>>> is used to get the %pc value so is treated specially so that
>>> it doesn't 'trash' the return stack.
>>>
>>> So PIC code isn't too bad, but you have to use the correct
>>> sequence.
>> Is that true only for 32-bit systems only? I thought RIP-relative addressing was
>> introduced in 64-bit mode. Please confirm.
> I said i386 not amd64 or x86-64.
I am sorry. My bad.
>
> So yes, 64bit code has PC-relative addressing.
> But I'm pretty sure it has no other way to get the PC itself
> except using call - certainly nothing in the 'usual' instructions.
OK.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread 
 
 
 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-31 18:31           ` Mark Rutland
  2020-08-03  8:27             ` David Laight
@ 2020-08-03 17:58             ` Madhavan T. Venkataraman
  2020-08-04 13:55               ` Mark Rutland
  1 sibling, 1 reply; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-03 17:58 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
On 7/31/20 1:31 PM, Mark Rutland wrote:
> On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote:
>> On 7/30/20 3:54 PM, Andy Lutomirski wrote:
>>> On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
>>> <madvenka@linux.microsoft.com> wrote:
>> Dealing with multiple architectures
>> -----------------------------------------------
>>
>> One good reason to use trampfd is multiple architecture support. The
>> trampoline table in a code page approach is neat. I don't deny that at
>> all. But my question is - can it be used in all cases?
>>
>> It requires PC-relative data references. I have not worked on all architectures.
>> So, I need to study this. But do all ISAs support PC-relative data references?
> Not all do, but pretty much any recent ISA will as it's a practical
> necessity for fast position-independent code.
So, two questions:
1. IIUC, for position independent code, we need PC-relative control transfers. I know that
    PC-relative control transfers are kinda fundamental. So, I expect most architectures
    support it. But to implement the trampoline table suggestion, we need PC-relative
    data references. Like:
    movq    X(%rip), %rax
2. Do you know which architectures do not support PC-relative data references? I am
    going to study this. But if you have some information, I would appreciate it.
In any case, I think we should support all of the architectures on which Linux currently
runs even if they are legacy.
>
>> Even in an ISA that supports it, there would be a maximum supported offset
>> from the current PC that can be reached for a data reference. That maximum
>> needs to be at least the size of a base page in the architecture. This is because
>> the code page and the data page need to be separate for security reasons.
>> Do all ISAs support a sufficiently large offset?
> ISAs with pc-relative addessing can usually generate PC-relative
> addresses into a GPR, from which they can apply an arbitrarily large
> offset.
I will study this. I need to nail down the list of architectures that cannot do this.
>
>> When the kernel generates the code for a trampoline, it can hard code data values
>> in the generated code itself so it does not need PC-relative data referencing.
>>
>> And, for ISAs that do support the large offset, we do have to implement and
>> maintain the code page stuff for different ISAs for each application and library
>> if we did not use trampfd.
> Trampoline code is architecture specific today, so I don't see that as a
> major issue. Common structural bits can probably be shared even if the
> specifid machine code cannot.
True. But an implementor may prefer a standard mechanism provided by
the kernel so all of his architectures can be supported easily with less
effort.
If you look at the libffi reference patch I have included, the architecture
specific changes to use trampfd just involve a single C function call to
a common code function.
So, from the point of view of adoption, IMHO, the kernel provided method
is preferable.
>
> [...]
>
>> Security
>> -----------
>>
>> With the user level trampoline table approach, the data part of the trampoline table
>> can be hacked by an attacker if an application has a vulnerability. Specifically, the
>> target PC can be altered to some arbitrary location. Trampfd implements an
>> "Allowed PCS" context. In the libffi changes, I have created a read-only array of
>> all ABI handlers used in closures for each architecture. This read-only array
>> can be used to restrict the PC values for libffi trampolines to prevent hacking.
>>
>> To generalize, we can implement security rules/features if the trampoline
>> object is in the kernel.
> I don't follow this argument. If it's possible to statically define that
> in the kernel, it's also possible to do that in userspace without any
> new kernel support.
It is not statically defined in the kernel.
Let us take the libffi example. In the 64-bit X86 arch code, there are 3
ABI handlers:
    ffi_closure_unix64_sse
    ffi_closure_unix64
    ffi_closure_win64
I could create an "Allowed PCs" context like this:
struct my_allowed_pcs {
    struct trampfd_values    pcs;
    __u64                             pc_values[3];
};
const struct my_allowed_pcs    my_allowed_pcs = {
    { 3, 0 },
    (uintptr_t) ffi_closure_unix64_sse,
    (uintptr_t) ffi_closure_unix64,
    (uintptr_t) ffi_closure_win64,
};
I have created a read-only array of allowed ABI handlers that closures use.
When I set up the context for a closure trampoline, I could do this:
    pwrite(trampfd, &my_allowed_pcs, sizeof(my_allowed_pcs), TRAMPFD_ALLOWED_PCS_OFFSET);
   
This copies the array into the trampoline object in the kernel.
When the register context is set for the trampoline, the kernel checks
the PC register value against allowed PCs.
Because my_allowed_pcs is read-only, a hacker cannot modify it. So, the only
permitted target PCs enforced by the kernel are the ABI handlers.
>
> [...]
>
>> Trampfd is a framework that can be used to implement multiple things. May be,
>> a few of those things can also be implemented in user land itself. But I think having
>> just one mechanism to execute dynamic code objects is preferable to having
>> multiple mechanisms not standardized across all applications.
> In abstract, having a common interface sounds nice, but in practice
> elements of this are always architecture-specific (e.g. interactiosn
> with HW CFI), and that common interface can result in more pain as it
> doesn't fit naturally into the context that ISAs were designed for (e.g. 
> where control-flow instructions are extended with new semantics).
In the case of trampfd, the code generation is indeed architecture
specific. But that is in the kernel. The application is not affected by it.
Again, referring to the libffi reference patch, I have defined wrapper
functions for trampfd in common code. The architecture specific code
in libffi only calls the set_context function defined in common code.
Even this is required only because register names are specific to each
architecture and the target PC (to the ABI handler) is specific to
each architecture-ABI combo.
> It also meass that you can't share the rough approach across OSs which
> do not implement an identical mechanism, so for code abstracting by ISA
> first, then by platform/ABI, there isn't much saving.
Why can you not share the same approach across OSes? In fact,
I have tried to design it so that other OSes can use the same
mechanism.
The only thing is that I have defined the API to be based on a file
descriptor since that is what is generally preferred by the Linux community
for a new API. If I were to implement it as a regular system call, the same
system call can be implemented in other OSes as well.
Thanks.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-03 17:58             ` Madhavan T. Venkataraman
@ 2020-08-04 13:55               ` Mark Rutland
  2020-08-04 14:33                 ` David Laight
  2020-08-04 15:46                 ` Madhavan T. Venkataraman
  0 siblings, 2 replies; 64+ messages in thread
From: Mark Rutland @ 2020-08-04 13:55 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
On Mon, Aug 03, 2020 at 12:58:04PM -0500, Madhavan T. Venkataraman wrote:
> On 7/31/20 1:31 PM, Mark Rutland wrote:
> > On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote:
> >> On 7/30/20 3:54 PM, Andy Lutomirski wrote:
> >>> On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
> >>> <madvenka@linux.microsoft.com> wrote:
>> >> When the kernel generates the code for a trampoline, it can hard code data values
> >> in the generated code itself so it does not need PC-relative data referencing.
> >>
> >> And, for ISAs that do support the large offset, we do have to implement and
> >> maintain the code page stuff for different ISAs for each application and library
> >> if we did not use trampfd.
> > Trampoline code is architecture specific today, so I don't see that as a
> > major issue. Common structural bits can probably be shared even if the
> > specifid machine code cannot.
> 
> True. But an implementor may prefer a standard mechanism provided by
> the kernel so all of his architectures can be supported easily with less
> effort.
> 
> If you look at the libffi reference patch I have included, the architecture
> specific changes to use trampfd just involve a single C function call to
> a common code function.
Sure but in addition to that each architecture backend had to define a
set of arguments to that. I view the C function is analagous to the
"common structural bits".
I appreciate that your patch is small today (and architectures seem to
largely align on what they need), but I don't think it's necessarily
true that things will remain so simple as architecture are extended and
their calling conventions evolve, and I also don't think it's clear that
this will work for more complex cases elsewhere.
[...]
> >> With the user level trampoline table approach, the data part of the trampoline table
> >> can be hacked by an attacker if an application has a vulnerability. Specifically, the
> >> target PC can be altered to some arbitrary location. Trampfd implements an
> >> "Allowed PCS" context. In the libffi changes, I have created a read-only array of
> >> all ABI handlers used in closures for each architecture. This read-only array
> >> can be used to restrict the PC values for libffi trampolines to prevent hacking.
> >>
> >> To generalize, we can implement security rules/features if the trampoline
> >> object is in the kernel.
> > I don't follow this argument. If it's possible to statically define that
> > in the kernel, it's also possible to do that in userspace without any
> > new kernel support.
> It is not statically defined in the kernel.
> 
> Let us take the libffi example. In the 64-bit X86 arch code, there are 3
> ABI handlers:
> 
>     ffi_closure_unix64_sse
>     ffi_closure_unix64
>     ffi_closure_win64
> 
> I could create an "Allowed PCs" context like this:
> 
> struct my_allowed_pcs {
>     struct trampfd_values    pcs;
>     __u64                             pc_values[3];
> };
> 
> const struct my_allowed_pcs    my_allowed_pcs = {
>     { 3, 0 },
>     (uintptr_t) ffi_closure_unix64_sse,
>     (uintptr_t) ffi_closure_unix64,
>     (uintptr_t) ffi_closure_win64,
> };
> 
> I have created a read-only array of allowed ABI handlers that closures use.
> 
> When I set up the context for a closure trampoline, I could do this:
> 
>     pwrite(trampfd, &my_allowed_pcs, sizeof(my_allowed_pcs), TRAMPFD_ALLOWED_PCS_OFFSET);
>    
> This copies the array into the trampoline object in the kernel.
> When the register context is set for the trampoline, the kernel checks
> the PC register value against allowed PCs.
> 
> Because my_allowed_pcs is read-only, a hacker cannot modify it. So, the only
> permitted target PCs enforced by the kernel are the ABI handlers.
Sorry, when I said "statically define" meant when you knew legitimate
targets ahead of time when you create the trampoline (i.e. whether you
could enumerate those and know they would not change dynamically).
My point was that you can achieve the same in userspace if the
trampoline and array of legitimate targets are in read-only memory,
without having to trap to the kernel.
I think the key point here is that an adversary must be prevented from
altering a trampoline and any associated metadata, and I think that
there are ways of achieving that without having to trap into the kernel,
and without the kernel having to be intimately aware of the calling
conventions used in userspace.
[...]
> >> Trampfd is a framework that can be used to implement multiple things. May be,
> >> a few of those things can also be implemented in user land itself. But I think having
> >> just one mechanism to execute dynamic code objects is preferable to having
> >> multiple mechanisms not standardized across all applications.
> > In abstract, having a common interface sounds nice, but in practice
> > elements of this are always architecture-specific (e.g. interactiosn
> > with HW CFI), and that common interface can result in more pain as it
> > doesn't fit naturally into the context that ISAs were designed for (e.g. 
> > where control-flow instructions are extended with new semantics).
> 
> In the case of trampfd, the code generation is indeed architecture
> specific. But that is in the kernel. The application is not affected by it.
As an ABI detail, applications are *definitely* affected by this, and it
is wrong to suggest they are not even if you don't have a specific case
in mind today. As this forms a contract between userspace and the kernel
it's overly simplistic to say that it's the kernel's problem
For example, in the case of BTI on arm64, what should the trampoline
set PSTATE.BTYPE to? Different use-cases *will* want different values,
and not necessarily the value of PSTATE at the instant the call to the
trampoline was made. In the case of libffi specifically using the
original value of PSTATE.BTYPE probably is sound, but other code
sequences may need to restrict/broaden or entirely change that.
> Again, referring to the libffi reference patch, I have defined wrapper
> functions for trampfd in common code. The architecture specific code
> in libffi only calls the set_context function defined in common code.
> Even this is required only because register names are specific to each
> architecture and the target PC (to the ABI handler) is specific to
> each architecture-ABI combo.
> 
> > It also meass that you can't share the rough approach across OSs which
> > do not implement an identical mechanism, so for code abstracting by ISA
> > first, then by platform/ABI, there isn't much saving.
> 
> Why can you not share the same approach across OSes? In fact,
> I have tried to design it so that other OSes can use the same
> mechanism.
Sure, but where they *don't*, you must fall back to the existing
purely-userspace mechanisms, and so a codebase now has the burden of
maintaining two distinct mechanisms.
Whereas if there's a way of doing this in userspace with (stronger)
enforcement of memory permissions the trampoline code can be common for
when this is present or absent, which is much easier for a codebase rto
maintain, and could make use of weaker existing mechanisms to improve
the situation on systems without the new functionality.
Thanks,
Mark.
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-04 13:55               ` Mark Rutland
@ 2020-08-04 14:33                 ` David Laight
  2020-08-04 14:44                   ` David Laight
  2020-08-04 14:48                   ` Madhavan T. Venkataraman
  2020-08-04 15:46                 ` Madhavan T. Venkataraman
  1 sibling, 2 replies; 64+ messages in thread
From: David Laight @ 2020-08-04 14:33 UTC (permalink / raw)
  To: 'Mark Rutland', Madhavan T. Venkataraman
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
> > If you look at the libffi reference patch I have included, the architecture
> > specific changes to use trampfd just involve a single C function call to
> > a common code function.
No idea what libffi is, but it must surely be simpler to
rewrite it to avoid nested function definitions.
Or find a book from the 1960s on how to do recursive
calls and nested functions in FORTRAN-IV.
	David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-04 14:33                 ` David Laight
@ 2020-08-04 14:44                   ` David Laight
  2020-08-04 14:48                   ` Madhavan T. Venkataraman
  1 sibling, 0 replies; 64+ messages in thread
From: David Laight @ 2020-08-04 14:44 UTC (permalink / raw)
  To: David Laight, 'Mark Rutland', Madhavan T. Venkataraman
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
> > > If you look at the libffi reference patch I have included, the architecture
> > > specific changes to use trampfd just involve a single C function call to
> > > a common code function.
> 
> No idea what libffi is, but it must surely be simpler to
> rewrite it to avoid nested function definitions.
> 
> Or find a book from the 1960s on how to do recursive
> calls and nested functions in FORTRAN-IV.
FWIW it is probably as simple as:
1) Put all the 'variables' the nested function accesses into a struct.
2) Add a field for the address of the 'nested' function.
3) Pass the address of the structure down instead of the
   address of the function.
If you aren't in control of the call sites then add the
structure to a linked list on a thread-local variable.
	David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-04 14:33                 ` David Laight
  2020-08-04 14:44                   ` David Laight
@ 2020-08-04 14:48                   ` Madhavan T. Venkataraman
  1 sibling, 0 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-04 14:48 UTC (permalink / raw)
  To: David Laight, 'Mark Rutland'
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
On 8/4/20 9:33 AM, David Laight wrote:
>>> If you look at the libffi reference patch I have included, the architecture
>>> specific changes to use trampfd just involve a single C function call to
>>> a common code function.
> No idea what libffi is, but it must surely be simpler to
> rewrite it to avoid nested function definitions.
Sorry if I wasn't clear.
libffi is a separate use case and GCC nested functions is a separate one.
libffi is not used to solve the nested function stuff.
For nested functions, GCC generates trampoline code and arranges to
place it on the stack and execute it.
I agree with your other points about nested function implementation.
Madhavan
> Or find a book from the 1960s on how to do recursive
> calls and nested functions in FORTRAN-IV.
>
> 	David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
^ permalink raw reply	[flat|nested] 64+ messages in thread 
 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-04 13:55               ` Mark Rutland
  2020-08-04 14:33                 ` David Laight
@ 2020-08-04 15:46                 ` Madhavan T. Venkataraman
  1 sibling, 0 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-04 15:46 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
Hey Mark,
I am working on putting together an improved definition of trampfd per
Andy's comment. I will try to address your comments in that improved
definition. Once I send that out, I will respond to your emails as well.
Thanks.
Madhavan
On 8/4/20 8:55 AM, Mark Rutland wrote:
> On Mon, Aug 03, 2020 at 12:58:04PM -0500, Madhavan T. Venkataraman wrote:
>> On 7/31/20 1:31 PM, Mark Rutland wrote:
>>> On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote:
>>>> On 7/30/20 3:54 PM, Andy Lutomirski wrote:
>>>>> On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
>>>>> <madvenka@linux.microsoft.com> wrote:
>>>>> When the kernel generates the code for a trampoline, it can hard code data values
>>>> in the generated code itself so it does not need PC-relative data referencing.
>>>>
>>>> And, for ISAs that do support the large offset, we do have to implement and
>>>> maintain the code page stuff for different ISAs for each application and library
>>>> if we did not use trampfd.
>>> Trampoline code is architecture specific today, so I don't see that as a
>>> major issue. Common structural bits can probably be shared even if the
>>> specifid machine code cannot.
>> True. But an implementor may prefer a standard mechanism provided by
>> the kernel so all of his architectures can be supported easily with less
>> effort.
>>
>> If you look at the libffi reference patch I have included, the architecture
>> specific changes to use trampfd just involve a single C function call to
>> a common code function.
> Sure but in addition to that each architecture backend had to define a
> set of arguments to that. I view the C function is analagous to the
> "common structural bits".
>
> I appreciate that your patch is small today (and architectures seem to
> largely align on what they need), but I don't think it's necessarily
> true that things will remain so simple as architecture are extended and
> their calling conventions evolve, and I also don't think it's clear that
> this will work for more complex cases elsewhere.
>
> [...]
>
>>>> With the user level trampoline table approach, the data part of the trampoline table
>>>> can be hacked by an attacker if an application has a vulnerability. Specifically, the
>>>> target PC can be altered to some arbitrary location. Trampfd implements an
>>>> "Allowed PCS" context. In the libffi changes, I have created a read-only array of
>>>> all ABI handlers used in closures for each architecture. This read-only array
>>>> can be used to restrict the PC values for libffi trampolines to prevent hacking.
>>>>
>>>> To generalize, we can implement security rules/features if the trampoline
>>>> object is in the kernel.
>>> I don't follow this argument. If it's possible to statically define that
>>> in the kernel, it's also possible to do that in userspace without any
>>> new kernel support.
>> It is not statically defined in the kernel.
>>
>> Let us take the libffi example. In the 64-bit X86 arch code, there are 3
>> ABI handlers:
>>
>>     ffi_closure_unix64_sse
>>     ffi_closure_unix64
>>     ffi_closure_win64
>>
>> I could create an "Allowed PCs" context like this:
>>
>> struct my_allowed_pcs {
>>     struct trampfd_values    pcs;
>>     __u64                             pc_values[3];
>> };
>>
>> const struct my_allowed_pcs    my_allowed_pcs = {
>>     { 3, 0 },
>>     (uintptr_t) ffi_closure_unix64_sse,
>>     (uintptr_t) ffi_closure_unix64,
>>     (uintptr_t) ffi_closure_win64,
>> };
>>
>> I have created a read-only array of allowed ABI handlers that closures use.
>>
>> When I set up the context for a closure trampoline, I could do this:
>>
>>     pwrite(trampfd, &my_allowed_pcs, sizeof(my_allowed_pcs), TRAMPFD_ALLOWED_PCS_OFFSET);
>>    
>> This copies the array into the trampoline object in the kernel.
>> When the register context is set for the trampoline, the kernel checks
>> the PC register value against allowed PCs.
>>
>> Because my_allowed_pcs is read-only, a hacker cannot modify it. So, the only
>> permitted target PCs enforced by the kernel are the ABI handlers.
> Sorry, when I said "statically define" meant when you knew legitimate
> targets ahead of time when you create the trampoline (i.e. whether you
> could enumerate those and know they would not change dynamically).
>
> My point was that you can achieve the same in userspace if the
> trampoline and array of legitimate targets are in read-only memory,
> without having to trap to the kernel.
>
> I think the key point here is that an adversary must be prevented from
> altering a trampoline and any associated metadata, and I think that
> there are ways of achieving that without having to trap into the kernel,
> and without the kernel having to be intimately aware of the calling
> conventions used in userspace.
>
> [...]
>
>>>> Trampfd is a framework that can be used to implement multiple things. May be,
>>>> a few of those things can also be implemented in user land itself. But I think having
>>>> just one mechanism to execute dynamic code objects is preferable to having
>>>> multiple mechanisms not standardized across all applications.
>>> In abstract, having a common interface sounds nice, but in practice
>>> elements of this are always architecture-specific (e.g. interactiosn
>>> with HW CFI), and that common interface can result in more pain as it
>>> doesn't fit naturally into the context that ISAs were designed for (e.g. 
>>> where control-flow instructions are extended with new semantics).
>> In the case of trampfd, the code generation is indeed architecture
>> specific. But that is in the kernel. The application is not affected by it.
> As an ABI detail, applications are *definitely* affected by this, and it
> is wrong to suggest they are not even if you don't have a specific case
> in mind today. As this forms a contract between userspace and the kernel
> it's overly simplistic to say that it's the kernel's problem
>
> For example, in the case of BTI on arm64, what should the trampoline
> set PSTATE.BTYPE to? Different use-cases *will* want different values,
> and not necessarily the value of PSTATE at the instant the call to the
> trampoline was made. In the case of libffi specifically using the
> original value of PSTATE.BTYPE probably is sound, but other code
> sequences may need to restrict/broaden or entirely change that.
>
>> Again, referring to the libffi reference patch, I have defined wrapper
>> functions for trampfd in common code. The architecture specific code
>> in libffi only calls the set_context function defined in common code.
>> Even this is required only because register names are specific to each
>> architecture and the target PC (to the ABI handler) is specific to
>> each architecture-ABI combo.
>>
>>> It also meass that you can't share the rough approach across OSs which
>>> do not implement an identical mechanism, so for code abstracting by ISA
>>> first, then by platform/ABI, there isn't much saving.
>> Why can you not share the same approach across OSes? In fact,
>> I have tried to design it so that other OSes can use the same
>> mechanism.
> Sure, but where they *don't*, you must fall back to the existing
> purely-userspace mechanisms, and so a codebase now has the burden of
> maintaining two distinct mechanisms.
>
> Whereas if there's a way of doing this in userspace with (stronger)
> enforcement of memory permissions the trampoline code can be common for
> when this is present or absent, which is much easier for a codebase rto
> maintain, and could make use of weaker existing mechanisms to improve
> the situation on systems without the new functionality.
>
> Thanks,
> Mark.
^ permalink raw reply	[flat|nested] 64+ messages in thread
 
 
 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-31 17:13         ` Madhavan T. Venkataraman
  2020-07-31 18:31           ` Mark Rutland
@ 2020-08-02 13:57           ` Florian Weimer
  1 sibling, 0 replies; 64+ messages in thread
From: Florian Weimer @ 2020-08-02 13:57 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
* Madhavan T. Venkataraman:
> Standardization
> ---------------------
>
> Trampfd is a framework that can be used to implement multiple
> things. May be, a few of those things can also be implemented in
> user land itself. But I think having just one mechanism to execute
> dynamic code objects is preferable to having multiple mechanisms not
> standardized across all applications.
>
> As an example, let us say that I am able to implement support for
> JIT code. Let us say that an interpreter uses libffi to execute a
> generated function. The interpreter would use trampfd for the JIT
> code object and get an address. Then, it would pass that to libffi
> which would then use trampfd for the trampoline. So, trampfd based
> code objects can be chained.
There is certainly value in coordination.  For example, it would be
nice if unwinders could recognize the trampolines during all phases
and unwind correctly through them (including when interrupted by an
asynchronous symbol).  That requires some level of coordination with
the unwinder and dynamic linker.
A kernel solution could hide the intermediate state in a kernel-side
trap handler, but I think it wouldn't reduce the overall complexity.
^ permalink raw reply	[flat|nested] 64+ messages in thread 
 
 
 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 17:31   ` Andy Lutomirski
                       ` (4 preceding siblings ...)
       [not found]     ` <6540b4b7-3f70-adbf-c922-43886599713a@linux.microsoft.com>
@ 2020-08-02 18:54     ` Madhavan T. Venkataraman
  2020-08-02 20:00       ` Andy Lutomirski
  2020-08-03  8:23       ` David Laight
  5 siblings, 2 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-02 18:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML
More responses inline..
On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>>
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>
> 2. Use existing kernel functionality.  Raise a signal, modify the
> state, and return from the signal.  This is very flexible and may not
> be all that much slower than trampfd.
Let me understand this. You are saying that the trampoline code
would raise a signal and, in the signal handler, set up the context
so that when the signal handler returns, we end up in the target
function with the context correctly set up. And, this trampoline code
can be generated statically at build time so that there are no
security issues using it.
Have I understood your suggestion correctly?
So, my argument would be that this would always incur the overhead
of a trip to the kernel. I think twice the overhead if I am not mistaken.
With trampfd, we can have the kernel generate the code so that there
is no performance penalty at all.
Signals have many problems. Which signal number should we use for this
purpose? If we use an existing one, that might conflict with what the application
is already handling. Getting a new signal number for this could meet
with resistance from the community.
Also, signals are asynchronous. So, they are vulnerable to race conditions.
To prevent other signals from coming in while handling the raised signal,
we would need to block and unblock signals. This will cause more
overhead.
> 3. Use a syscall.  Instead of having the kernel handle page faults,
> have the trampoline code push the syscall nr register, load a special
> new syscall nr into the syscall nr register, and do a syscall. On
> x86_64, this would be:
>
> pushq %rax
> movq __NR_magic_trampoline, %rax
> syscall
>
> with some adjustment if the stack slot you're clobbering is important.
How is this better than the kernel handling an address fault?
The system call still needs to do the same work as the fault handler.
We do need to specify the register and stack contexts before hand
so the system call can do its job.
Also, this always incurs a trip to the kernel. With trampfd, the kernel
could generate the code to avoid the performance penalty.
>
> Also, will using trampfd cause issues with various unwinders?  I can
> easily imagine unwinders expecting code to be readable, although this
> is slowly going away for other reasons.
I need to study unwinders a little before I respond to this question.
So, bear with me.
> All this being said, I think that the kernel should absolutely add a
> sensible interface for JITs to use to materialize their code.  This
> would integrate sanely with LSMs and wouldn't require hacks like using
> files, etc.  A cleverly designed JIT interface could function without
> seriailization IPIs, and even lame architectures like x86 could
> potentially avoid shootdown IPIs if the interface copied code instead
> of playing virtual memory games.  At its very simplest, this could be:
>
> void *jit_create_code(const void *source, size_t len);
>
> and the result would be a new anonymous mapping that contains exactly
> the code requested.  There could also be:
>
> int jittfd_create(...);
>
> that does something similar but creates a memfd.  A nicer
> implementation for short JIT sequences would allow appending more code
> to an existing JIT region.  On x86, an appendable JIT region would
> start filled with 0xCC, and I bet there's a way to materialize new
> code into a previously 0xcc-filled virtual page wthout any
> synchronization.  One approach would be to start with:
>
> <some code>
> 0xcc
> 0xcc
> ...
> 0xcc
>
> and to create a whole new page like:
>
> <some code>
> <some more code>
> 0xcc
> ...
> 0xcc
>
> so that the only difference is that some code changed to some more
> code.  Then replace the PTE to swap from the old page to the new page,
> and arrange to avoid freeing the old page until we're sure it's gone
> from all TLBs.  This may not work if <some more code> spans a page
> boundary.  The #BP fixup would zap the TLB and retry.  Even just
> directly copying code over some 0xcc bytes almost works, but there's a
> nasty corner case involving instructions that fetch I$ fetch
> boundaries.  I'm not sure to what extent I$ snooping helps.
I am thinking that the trampfd API can be used for addressing JIT
code as well. I have not yet started thinking about the details. But I
think the API is sufficient. E.g.,
    struct trampfd_jit {
        void    *source;
        size_t    len;
    };
    struct trampfd_jit    jit;
    struct trampfd_map    map;
    void    *addr;
    jit.source = blah;
    jit.size = blah;
    fd = syscall(440, TRAMPFD_JIT, &jit, flags);
    pread(fd, &map, sizeof(map), TRAMPFD_MAP_OFFSET);
    addr = mmap(NULL, map.size, map.prot, map.flags, fd, map.offset);
And addr would be used to invoke the generated JIT code.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-02 18:54     ` Madhavan T. Venkataraman
@ 2020-08-02 20:00       ` Andy Lutomirski
  2020-08-02 22:58         ` Madhavan T. Venkataraman
                           ` (2 more replies)
  2020-08-03  8:23       ` David Laight
  1 sibling, 3 replies; 64+ messages in thread
From: Andy Lutomirski @ 2020-08-02 20:00 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
On Sun, Aug 2, 2020 at 11:54 AM Madhavan T. Venkataraman
<madvenka@linux.microsoft.com> wrote:
>
> More responses inline..
>
> On 7/28/20 12:31 PM, Andy Lutomirski wrote:
> >> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
> >>
> >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> >>
> >
> > 2. Use existing kernel functionality.  Raise a signal, modify the
> > state, and return from the signal.  This is very flexible and may not
> > be all that much slower than trampfd.
>
> Let me understand this. You are saying that the trampoline code
> would raise a signal and, in the signal handler, set up the context
> so that when the signal handler returns, we end up in the target
> function with the context correctly set up. And, this trampoline code
> can be generated statically at build time so that there are no
> security issues using it.
>
> Have I understood your suggestion correctly?
yes.
>
> So, my argument would be that this would always incur the overhead
> of a trip to the kernel. I think twice the overhead if I am not mistaken.
> With trampfd, we can have the kernel generate the code so that there
> is no performance penalty at all.
I feel like trampfd is too poorly defined at this point to evaluate.
There are three general things it could do.  It could generate actual
code that varies by instance.  It could have static code that does not
vary.  And it could actually involve a kernel entry.
If it involves a kernel entry, then it's slow.  Maybe this is okay for
some use cases.
If it involves only static code, I see no good reason that it should
be in the kernel.
If it involves dynamic code, then I think it needs a clearly defined
use case that actually requires dynamic code.
> Also, signals are asynchronous. So, they are vulnerable to race conditions.
> To prevent other signals from coming in while handling the raised signal,
> we would need to block and unblock signals. This will cause more
> overhead.
If you're worried about raise() racing against signals from out of
thread, you have bigger problems to deal with.
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-02 20:00       ` Andy Lutomirski
@ 2020-08-02 22:58         ` Madhavan T. Venkataraman
  2020-08-03 18:36         ` Madhavan T. Venkataraman
  2020-08-10 17:34         ` Madhavan T. Venkataraman
  2 siblings, 0 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-02 22:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML
On 8/2/20 3:00 PM, Andy Lutomirski wrote:
> On Sun, Aug 2, 2020 at 11:54 AM Madhavan T. Venkataraman
> <madvenka@linux.microsoft.com> wrote:
>> More responses inline..
>>
>> On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>>>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>>>>
>>>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>>>
>>> 2. Use existing kernel functionality.  Raise a signal, modify the
>>> state, and return from the signal.  This is very flexible and may not
>>> be all that much slower than trampfd.
>> Let me understand this. You are saying that the trampoline code
>> would raise a signal and, in the signal handler, set up the context
>> so that when the signal handler returns, we end up in the target
>> function with the context correctly set up. And, this trampoline code
>> can be generated statically at build time so that there are no
>> security issues using it.
>>
>> Have I understood your suggestion correctly?
> yes.
>
>> So, my argument would be that this would always incur the overhead
>> of a trip to the kernel. I think twice the overhead if I am not mistaken.
>> With trampfd, we can have the kernel generate the code so that there
>> is no performance penalty at all.
> I feel like trampfd is too poorly defined at this point to evaluate.
> There are three general things it could do.  It could generate actual
> code that varies by instance.  It could have static code that does not
> vary.  And it could actually involve a kernel entry.
>
> If it involves a kernel entry, then it's slow.  Maybe this is okay for
> some use cases.
Yes. IMO, it is OK for most cases except where dynamic code
is used specifically for enhancing performance such as interpreters
using JIT code for frequently executed sequences and dynamic
binary translation.
> If it involves only static code, I see no good reason that it should
> be in the kernel.
It does not involve only static code. This is meant for dynamic code.
However, see below.
> If it involves dynamic code, then I think it needs a clearly defined
> use case that actually requires dynamic code.
Fair enough. I will work on this and get back to you. This might take
a little time. So, bear with me.
But I would like to make one point here. There are many applications
and libraries out there that use trampolines. They may all require the
same sort of things:
    - set register context
    - push stuff on stack
    - jump to a target PC
But in each case, the context would be different:
    - only register context
    - only stack context
    - both register and stack contexts
    - different registers
    - different values pushed on the stack
    - different target PCs
If we had to do this purely at user level, each application/library would
need to roll its own solution, the solution has to be implemented for
each supported architecture and maintained. While the code is static
in each separate case, it is dynamic across all of them.
That is, the kernel will generate the code on the fly for each trampoline
instance based on its current context. It will not maintain any static
trampoline code at all.
Basically, it will supply the context to an arch-specific function and say:
    - generate instructions for loading these regs with these values
    - generate instructions to push these values on the stack
    - generate an instruction to jump to this target PC
It will place all of those generated instructions on a page and return the address.
So, even with the static case, there is a lot of value in the kernel providing
this. Plus, it has the framework to handle dynamic code.
>> Also, signals are asynchronous. So, they are vulnerable to race conditions.
>> To prevent other signals from coming in while handling the raised signal,
>> we would need to block and unblock signals. This will cause more
>> overhead.
> If you're worried about raise() racing against signals from out of
> thread, you have bigger problems to deal with.
Agreed. The signal blocking is just one example of problems related
to signals. There are other bigger problems as well. So, let us remove
the signal-based approach from our discussions.
Thanks.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-02 20:00       ` Andy Lutomirski
  2020-08-02 22:58         ` Madhavan T. Venkataraman
@ 2020-08-03 18:36         ` Madhavan T. Venkataraman
  2020-08-10 17:34         ` Madhavan T. Venkataraman
  2 siblings, 0 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-03 18:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML
On 8/2/20 3:00 PM, Andy Lutomirski wrote:
> I feel like trampfd is too poorly defined at this point to evaluate.
Point taken. It is because I wanted to start with something small
and specific and expand it in the future. So, I did not really describe the big
picture - the overall vision, future work, that sort of thing. In retrospect,
may be, I should have done that.
I will take all of the input I have received so far and all of the responses
I have given, refine the definition of trampfd and send it out. Please
review that and let me know if anything is still missing from the
definition.
Thanks.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-02 20:00       ` Andy Lutomirski
  2020-08-02 22:58         ` Madhavan T. Venkataraman
  2020-08-03 18:36         ` Madhavan T. Venkataraman
@ 2020-08-10 17:34         ` Madhavan T. Venkataraman
  2020-08-11 21:12           ` Madhavan T. Venkataraman
  2 siblings, 1 reply; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-10 17:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML
Resending because of mailer problems. Some of the recipients did not receive
my email. I apologize. Sigh.
Here is a redefinition of trampfd based on review comments.
I wanted to address dynamic code in 3 different ways:
    Remove the need for dynamic code where possible
    --------------------------------------------------------------------
    If the kernel itself can perform the work of some dynamic code, then
    the code can be replaced by the kernel.
    This is what I implemented in the patchset. But reviewers objected
    to the performance impact. One trip to the kernel was needed for each
    trampoline invocation. So, I have decided to defer this approach.
    Convert dynamic code to static code where possible
    ----------------------------------------------------------------------
    This is possible with help from the kernel. This has no performance
    impact and can be used in libffi, GCC nested functions, etc. I have
    described the approach below.
    Deal with code generation
    -----------------------------------
    For cases like generating JIT code from Java byte code, I wanted to
    establish a framework. However, reviewers felt that details are missing.
    Should the kernel generate code or should it use a user-level code generator?
    How do you make sure that a user level code generator can be trusted?
    How would the communication work? ABI details? Architecture support?
    Support for different types - JIT, DBT, etc?
    I have come to the conclusion that this is best done separately.
My main interest is to provide a way to convert dynamic code such as
trampolines to static code without any special architecture support.
This can be done with the kernel's help. Any code that gets written in
the future can conform to this as well.
So, in version 2 of the Trampfd RFC, I would like to simplify trampfd and
just address item 2. I will reimplement the support in libffi and present it.
Convert dynamic code to static code
------------------------------------------------
One problem with dynamic code is that it cannot be verified or authenticated
by the kernel. The kernel cannot tell the difference between genuine dynamic
code and an attacker's code. Where possible, dynamic code should be converted
to static code and placed in the text segment of a binary file. This allows
the kernel to verify the code by verifying the signature of the file.
The other problem is using user-level methods to load and execute dynamic code
can potentially be exploited by an attacker to inject his code and have it be
executed. To prevent this, a system may enforce W^X. If W^X is enforced
properly, genuine dynamic code will not be able to run. This is another
reason to convert dynamic code to static code.
The issue in converting dynamic code to static code is that the data is
dynamic. The code does not know before hand where the data is going to be
at runtime.
Some architectures support PC-relative data references. So, if you co-locate
code and data, then the code can find the data at runtime. But this is not
supported on all architectures. When supported, there may be limitations to
deal with. Plus you have to take the trouble to co-locate code and data.
And, to deal with W^X, code and data need to be in different pages.
All architectures must be supported without any limitations. Fortunately,
the kernel can solve this problem quite easily. I suggest the following:
Convert dynamic code to static code like this:
    - Decide which register should point to the data that the code needs.
      Call it register R.
    - Write the static code assuming that R already points to the data.
    - Use trampfd and pass the following to the kernel:
        - pointers to the code and data
        - the name of the register R
The kernel will write the following instructions in a trampoline page
mapped into the caller's address space with R-X.
    - Load the data address in register R
    - Jump to the static code
Basically, the kernel provides a trampoline to jump to the user's code
and returns the kernel-provided trampoline's address to the user.
It is trivial to implement a trampoline table in the trampoline page to
conserve memory.
Issues raised previously
-------------------------------
I believe that the following issues that were raised by reviewers is not
a problem in this scheme. Please rereview.
    - Florian mentioned the libffi trampoline table. Trampoline tables can be
      implemented in this scheme easily.
    - Florian mentioned stack unwinders. I am not an expert on unwinders.
      But I don't see an issue with unwinders.
    - Mark Rutland mentioned Intel's CET and CFI. Don't see a problem there.
    - Mark Rutland mentioned PAC+BTI on ARM64. Don't see a problem there.
If I have missed addressing any previously raised issue, I apologize.
Please let me know.
Thanks!
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-10 17:34         ` Madhavan T. Venkataraman
@ 2020-08-11 21:12           ` Madhavan T. Venkataraman
  0 siblings, 0 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-11 21:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML
I am working on version 2 of trampfd. Will send it out soon.
Thanks for all the comments so far!
Madhavan
On 8/10/20 12:34 PM, Madhavan T. Venkataraman wrote:
> Resending because of mailer problems. Some of the recipients did not receive
> my email. I apologize. Sigh.
> 
> Here is a redefinition of trampfd based on review comments.
> 
> I wanted to address dynamic code in 3 different ways:
> 
>     Remove the need for dynamic code where possible
>     --------------------------------------------------------------------
> 
>     If the kernel itself can perform the work of some dynamic code, then
>     the code can be replaced by the kernel.
> 
>     This is what I implemented in the patchset. But reviewers objected
>     to the performance impact. One trip to the kernel was needed for each
>     trampoline invocation. So, I have decided to defer this approach.
> 
>     Convert dynamic code to static code where possible
>     ----------------------------------------------------------------------
> 
>     This is possible with help from the kernel. This has no performance
>     impact and can be used in libffi, GCC nested functions, etc. I have
>     described the approach below.
> 
>     Deal with code generation
>     -----------------------------------
> 
>     For cases like generating JIT code from Java byte code, I wanted to
>     establish a framework. However, reviewers felt that details are missing.
> 
>     Should the kernel generate code or should it use a user-level code generator?
>     How do you make sure that a user level code generator can be trusted?
>     How would the communication work? ABI details? Architecture support?
>     Support for different types - JIT, DBT, etc?
> 
>     I have come to the conclusion that this is best done separately.
> 
> My main interest is to provide a way to convert dynamic code such as
> trampolines to static code without any special architecture support.
> This can be done with the kernel's help. Any code that gets written in
> the future can conform to this as well.
> 
> So, in version 2 of the Trampfd RFC, I would like to simplify trampfd and
> just address item 2. I will reimplement the support in libffi and present it.
> 
> Convert dynamic code to static code
> ------------------------------------------------
> 
> One problem with dynamic code is that it cannot be verified or authenticated
> by the kernel. The kernel cannot tell the difference between genuine dynamic
> code and an attacker's code. Where possible, dynamic code should be converted
> to static code and placed in the text segment of a binary file. This allows
> the kernel to verify the code by verifying the signature of the file.
> 
> The other problem is using user-level methods to load and execute dynamic code
> can potentially be exploited by an attacker to inject his code and have it be
> executed. To prevent this, a system may enforce W^X. If W^X is enforced
> properly, genuine dynamic code will not be able to run. This is another
> reason to convert dynamic code to static code.
> 
> The issue in converting dynamic code to static code is that the data is
> dynamic. The code does not know before hand where the data is going to be
> at runtime.
> 
> Some architectures support PC-relative data references. So, if you co-locate
> code and data, then the code can find the data at runtime. But this is not
> supported on all architectures. When supported, there may be limitations to
> deal with. Plus you have to take the trouble to co-locate code and data.
> And, to deal with W^X, code and data need to be in different pages.
> 
> All architectures must be supported without any limitations. Fortunately,
> the kernel can solve this problem quite easily. I suggest the following:
> 
> Convert dynamic code to static code like this:
> 
>     - Decide which register should point to the data that the code needs.
>       Call it register R.
> 
>     - Write the static code assuming that R already points to the data.
> 
>     - Use trampfd and pass the following to the kernel:
> 
>         - pointers to the code and data
>         - the name of the register R
> 
> The kernel will write the following instructions in a trampoline page
> mapped into the caller's address space with R-X.
> 
>     - Load the data address in register R
>     - Jump to the static code
> 
> Basically, the kernel provides a trampoline to jump to the user's code
> and returns the kernel-provided trampoline's address to the user.
> 
> It is trivial to implement a trampoline table in the trampoline page to
> conserve memory.
> 
> Issues raised previously
> -------------------------------
> 
> I believe that the following issues that were raised by reviewers is not
> a problem in this scheme. Please rereview.
> 
>     - Florian mentioned the libffi trampoline table. Trampoline tables can be
>       implemented in this scheme easily.
> 
>     - Florian mentioned stack unwinders. I am not an expert on unwinders.
>       But I don't see an issue with unwinders.
> 
>     - Mark Rutland mentioned Intel's CET and CFI. Don't see a problem there.
> 
>     - Mark Rutland mentioned PAC+BTI on ARM64. Don't see a problem there.
> 
> If I have missed addressing any previously raised issue, I apologize.
> Please let me know.
> 
> Thanks!
> 
> Madhavan
> 
> 
^ permalink raw reply	[flat|nested] 64+ messages in thread 
 
 
- * RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-02 18:54     ` Madhavan T. Venkataraman
  2020-08-02 20:00       ` Andy Lutomirski
@ 2020-08-03  8:23       ` David Laight
  2020-08-03 15:59         ` Madhavan T. Venkataraman
  1 sibling, 1 reply; 64+ messages in thread
From: David Laight @ 2020-08-03  8:23 UTC (permalink / raw)
  To: 'Madhavan T. Venkataraman', Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML
From: Madhavan T. Venkataraman
> Sent: 02 August 2020 19:55
> To: Andy Lutomirski <luto@kernel.org>
> Cc: Kernel Hardening <kernel-hardening@lists.openwall.com>; Linux API <linux-api@vger.kernel.org>;
> linux-arm-kernel <linux-arm-kernel@lists.infradead.org>; Linux FS Devel <linux-
> fsdevel@vger.kernel.org>; linux-integrity <linux-integrity@vger.kernel.org>; LKML <linux-
> kernel@vger.kernel.org>; LSM List <linux-security-module@vger.kernel.org>; Oleg Nesterov
> <oleg@redhat.com>; X86 ML <x86@kernel.org>
> Subject: Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
> 
> More responses inline..
> 
> On 7/28/20 12:31 PM, Andy Lutomirski wrote:
> >> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
> >>
> >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> >>
> >
> > 2. Use existing kernel functionality.  Raise a signal, modify the
> > state, and return from the signal.  This is very flexible and may not
> > be all that much slower than trampfd.
> 
> Let me understand this. You are saying that the trampoline code
> would raise a signal and, in the signal handler, set up the context
> so that when the signal handler returns, we end up in the target
> function with the context correctly set up. And, this trampoline code
> can be generated statically at build time so that there are no
> security issues using it.
> 
> Have I understood your suggestion correctly?
I was thinking that you'd just let the 'not executable' page fault
signal happen (SIGSEGV?) when the code jumps to on-stack trampoline
is executed.
The user signal handler can then decode the faulting instruction
and, if it matches the expected on-stack trampoline, modify the
saved registers before returning from the signal.
No kernel changes and all you need to add to the program is
an architecture-dependant signal handler.
	David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-03  8:23       ` David Laight
@ 2020-08-03 15:59         ` Madhavan T. Venkataraman
  0 siblings, 0 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-03 15:59 UTC (permalink / raw)
  To: David Laight, Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML
On 8/3/20 3:23 AM, David Laight wrote:
> From: Madhavan T. Venkataraman
>> Sent: 02 August 2020 19:55
>> To: Andy Lutomirski <luto@kernel.org>
>> Cc: Kernel Hardening <kernel-hardening@lists.openwall.com>; Linux API <linux-api@vger.kernel.org>;
>> linux-arm-kernel <linux-arm-kernel@lists.infradead.org>; Linux FS Devel <linux-
>> fsdevel@vger.kernel.org>; linux-integrity <linux-integrity@vger.kernel.org>; LKML <linux-
>> kernel@vger.kernel.org>; LSM List <linux-security-module@vger.kernel.org>; Oleg Nesterov
>> <oleg@redhat.com>; X86 ML <x86@kernel.org>
>> Subject: Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
>>
>> More responses inline..
>>
>> On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>>>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>>>>
>>>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>>>
>>> 2. Use existing kernel functionality.  Raise a signal, modify the
>>> state, and return from the signal.  This is very flexible and may not
>>> be all that much slower than trampfd.
>> Let me understand this. You are saying that the trampoline code
>> would raise a signal and, in the signal handler, set up the context
>> so that when the signal handler returns, we end up in the target
>> function with the context correctly set up. And, this trampoline code
>> can be generated statically at build time so that there are no
>> security issues using it.
>>
>> Have I understood your suggestion correctly?
> I was thinking that you'd just let the 'not executable' page fault
> signal happen (SIGSEGV?) when the code jumps to on-stack trampoline
> is executed.
>
> The user signal handler can then decode the faulting instruction
> and, if it matches the expected on-stack trampoline, modify the
> saved registers before returning from the signal.
>
> No kernel changes and all you need to add to the program is
> an architecture-dependant signal handler.
Understood.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread 
 
 
 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 13:10 ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor madvenka
                     ` (6 preceding siblings ...)
  2020-07-28 17:31   ` Andy Lutomirski
@ 2020-07-31 18:09   ` Mark Rutland
  2020-07-31 20:08     ` Madhavan T. Venkataraman
  2020-08-03 16:57     ` Madhavan T. Venkataraman
  7 siblings, 2 replies; 64+ messages in thread
From: Mark Rutland @ 2020-07-31 18:09 UTC (permalink / raw)
  To: madvenka
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86
Hi,
On Tue, Jul 28, 2020 at 08:10:46AM -0500, madvenka@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> Trampoline code is placed either in a data page or in a stack page. In
> order to execute a trampoline, the page it resides in needs to be mapped
> with execute permissions. Writable pages with execute permissions provide
> an attack surface for hackers. Attackers can use this to inject malicious
> code, modify existing code or do other harm.
For the purpose of below, IIUC this assumes the adversary has an
arbitrary write.
> To mitigate this, LSMs such as SELinux may not allow pages to have both
> write and execute permissions. This prevents trampolines from executing
> and blocks applications that use trampolines. To allow genuine applications
> to run, exceptions have to be made for them (by setting execmem, etc).
> In this case, the attack surface is just the pages of such applications.
> 
> An application that is not allowed to have writable executable pages
> may try to load trampoline code into a file and map the file with execute
> permissions. In this case, the attack surface is just the buffer that
> contains trampoline code. However, a successful exploit may provide the
> hacker with means to load his own code in a file, map it and execute it.
It's not clear to me what power the adversary is assumed to have here,
and consequently it's not clear to me how the proposal mitigates this.
For example, if the attack can control the arguments to syscalls, and
has an arbitrary write as above, what prevents them from creating a
trampfd of their own?
[...]
> GCC has traditionally used trampolines for implementing nested
> functions. The trampoline is placed on the user stack. So, the stack
> needs to be executable.
IIUC generally nested functions are avoided these days, specifically to
prevent the creation of gadgets on the stack. So I don't think those are
relevant as a cased to care about. Applications using them should move
to not using them, and would be more secure generally for doing so.
[...]
> Trampoline File Descriptor (trampfd)
> --------------------------
> 
> I am proposing a kernel API using anonymous file descriptors that
> can be used to create and execute trampolines with the help of the
> kernel. In this solution also, the kernel does the work of the trampoline.
What's the rationale for the kernel emulating the trampoline here?
In ther case of EMUTRAMP this was necessary to work with existing
application binaries and kernel ABIs which placed instructions onto the
stack, and the stack needed to remain RW for other reasons. That
restriction doesn't apply here.
Assuming trampfd creation is somehow authenticated, the code could be
placed in a r-x page (which the kernel could refuse to add write
permission), in order to prevent modification. If that's sufficient,
it's not much of a leap to allow userspace to generate the code.
> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.
> 
> In this case, the attack surface is the context buffer. A hacker may
> attack an application with a vulnerability and may be able to modify the
> context buffer. So, when the register or stack context is set for
> a trampoline, the values may have been tampered with. From an attack
> surface perspective, this is similar to Trampoline Emulation. But
> with trampfd, user code can retrieve a trampoline's context from the
> kernel and add defensive checks to see if the context has been
> tampered with.
Can you elaborate on this: what sort of checks would be applied, and
how?
Why is this not possible in a r-x user page?
[...]
> - trampfd provides a basic framework. In the future, new trampoline types
>   can be implemented, new contexts can be defined, and additional rules
>   can be implemented for security purposes.
From a kernel developer perspective, this reads as "this ABI will become
more complex", which I think is worrisome.
I'm also worried that this is liable to have nasty interaction with HW
CFI mechanisms (e.g. PAC+BTI on arm64) either now or in future, and that
we bake incompatibility into ABI.
> - For instance, trampfd defines an "Allowed PCs" context in this initial
>   work. As an example, libffi can create a read-only array of all ABI
>   handlers for an architecture at build time. This array can be used to
>   set the list of allowed PCs for a trampoline. This will mean that a hacker
>   cannot hack the PC part of the register context and make it point to
>   arbitrary locations.
I'm not exactly sure what's meant here. Do you mean that this prevents
userspace from branching into the middle of a trampoline, or that the
trampfd code prevents where the trampoline itself can branch to?
Both x86 and arm64 have upcoming HW CFI (CET and BTI) to deal with the
former, and I believe the latter can also be implemented in userspace
with defensive checks in the trampolines, provided that they are
protected read-only.
> - An SELinux setting called "exectramp" can be implemented along the
>   lines of "execmem", "execstack" and "execheap" to selectively allow the
>   use of trampolines on a per application basis.
> 
> - User code can add defensive checks in the code before invoking a
>   trampoline to make sure that a hacker has not modified the context data.
>   It can do this by getting the trampoline context from the kernel and
>   double checking it.
As above, without examples it's not clear to me what sort of chacks are
possible nor where they wouild need to be made. So it's difficult to see
whether that's actually possible or subject to TOCTTOU races and
similar.
> - In the future, if the kernel can be enhanced to use a safe code
>   generation component, that code can be placed in the trampoline mapping
>   pages. Then, the trampoline invocation does not have to incur a trip
>   into the kernel.
> 
> - Also, if the kernel can be enhanced to use a safe code generation
>   component, other forms of dynamic code such as JIT code can be
>   addressed by the trampfd framework.
I don't see why it's necessary for the kernel to generate code at all.
If the trampfd creation requests can be trusted, what prevents trusting
a sealed set of instructions generated in userspace?
> - Trampolines can be shared across processes which can give rise to
>   interesting uses in the future.
This sounds like the use-case of a sealed memfd. Is a sealed executable
memfd not sufficient?
Thanks,
Mark.
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-31 18:09   ` Mark Rutland
@ 2020-07-31 20:08     ` Madhavan T. Venkataraman
  2020-08-03 16:57     ` Madhavan T. Venkataraman
  1 sibling, 0 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-31 20:08 UTC (permalink / raw)
  To: Mark Rutland
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86
Thanks for the comments. I will respond to these and your next
email on Monday.
Madhavan
On 7/31/20 1:09 PM, Mark Rutland wrote:
> Hi,
>
> On Tue, Jul 28, 2020 at 08:10:46AM -0500, madvenka@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>> Trampoline code is placed either in a data page or in a stack page. In
>> order to execute a trampoline, the page it resides in needs to be mapped
>> with execute permissions. Writable pages with execute permissions provide
>> an attack surface for hackers. Attackers can use this to inject malicious
>> code, modify existing code or do other harm.
> For the purpose of below, IIUC this assumes the adversary has an
> arbitrary write.
>
>> To mitigate this, LSMs such as SELinux may not allow pages to have both
>> write and execute permissions. This prevents trampolines from executing
>> and blocks applications that use trampolines. To allow genuine applications
>> to run, exceptions have to be made for them (by setting execmem, etc).
>> In this case, the attack surface is just the pages of such applications.
>>
>> An application that is not allowed to have writable executable pages
>> may try to load trampoline code into a file and map the file with execute
>> permissions. In this case, the attack surface is just the buffer that
>> contains trampoline code. However, a successful exploit may provide the
>> hacker with means to load his own code in a file, map it and execute it.
> It's not clear to me what power the adversary is assumed to have here,
> and consequently it's not clear to me how the proposal mitigates this.
>
> For example, if the attack can control the arguments to syscalls, and
> has an arbitrary write as above, what prevents them from creating a
> trampfd of their own?
>
> [...]
>
>> GCC has traditionally used trampolines for implementing nested
>> functions. The trampoline is placed on the user stack. So, the stack
>> needs to be executable.
> IIUC generally nested functions are avoided these days, specifically to
> prevent the creation of gadgets on the stack. So I don't think those are
> relevant as a cased to care about. Applications using them should move
> to not using them, and would be more secure generally for doing so.
>
> [...]
>
>> Trampoline File Descriptor (trampfd)
>> --------------------------
>>
>> I am proposing a kernel API using anonymous file descriptors that
>> can be used to create and execute trampolines with the help of the
>> kernel. In this solution also, the kernel does the work of the trampoline.
> What's the rationale for the kernel emulating the trampoline here?
>
> In ther case of EMUTRAMP this was necessary to work with existing
> application binaries and kernel ABIs which placed instructions onto the
> stack, and the stack needed to remain RW for other reasons. That
> restriction doesn't apply here.
>
> Assuming trampfd creation is somehow authenticated, the code could be
> placed in a r-x page (which the kernel could refuse to add write
> permission), in order to prevent modification. If that's sufficient,
> it's not much of a leap to allow userspace to generate the code.
>
>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
>>
>> In this case, the attack surface is the context buffer. A hacker may
>> attack an application with a vulnerability and may be able to modify the
>> context buffer. So, when the register or stack context is set for
>> a trampoline, the values may have been tampered with. From an attack
>> surface perspective, this is similar to Trampoline Emulation. But
>> with trampfd, user code can retrieve a trampoline's context from the
>> kernel and add defensive checks to see if the context has been
>> tampered with.
> Can you elaborate on this: what sort of checks would be applied, and
> how?
>
> Why is this not possible in a r-x user page?
>
> [...]
>
>> - trampfd provides a basic framework. In the future, new trampoline types
>>   can be implemented, new contexts can be defined, and additional rules
>>   can be implemented for security purposes.
> >From a kernel developer perspective, this reads as "this ABI will become
> more complex", which I think is worrisome.
>
> I'm also worried that this is liable to have nasty interaction with HW
> CFI mechanisms (e.g. PAC+BTI on arm64) either now or in future, and that
> we bake incompatibility into ABI.
>
>> - For instance, trampfd defines an "Allowed PCs" context in this initial
>>   work. As an example, libffi can create a read-only array of all ABI
>>   handlers for an architecture at build time. This array can be used to
>>   set the list of allowed PCs for a trampoline. This will mean that a hacker
>>   cannot hack the PC part of the register context and make it point to
>>   arbitrary locations.
> I'm not exactly sure what's meant here. Do you mean that this prevents
> userspace from branching into the middle of a trampoline, or that the
> trampfd code prevents where the trampoline itself can branch to?
>
> Both x86 and arm64 have upcoming HW CFI (CET and BTI) to deal with the
> former, and I believe the latter can also be implemented in userspace
> with defensive checks in the trampolines, provided that they are
> protected read-only.
>
>> - An SELinux setting called "exectramp" can be implemented along the
>>   lines of "execmem", "execstack" and "execheap" to selectively allow the
>>   use of trampolines on a per application basis.
>>
>> - User code can add defensive checks in the code before invoking a
>>   trampoline to make sure that a hacker has not modified the context data.
>>   It can do this by getting the trampoline context from the kernel and
>>   double checking it.
> As above, without examples it's not clear to me what sort of chacks are
> possible nor where they wouild need to be made. So it's difficult to see
> whether that's actually possible or subject to TOCTTOU races and
> similar.
>
>> - In the future, if the kernel can be enhanced to use a safe code
>>   generation component, that code can be placed in the trampoline mapping
>>   pages. Then, the trampoline invocation does not have to incur a trip
>>   into the kernel.
>>
>> - Also, if the kernel can be enhanced to use a safe code generation
>>   component, other forms of dynamic code such as JIT code can be
>>   addressed by the trampfd framework.
> I don't see why it's necessary for the kernel to generate code at all.
> If the trampfd creation requests can be trusted, what prevents trusting
> a sealed set of instructions generated in userspace?
>
>> - Trampolines can be shared across processes which can give rise to
>>   interesting uses in the future.
> This sounds like the use-case of a sealed memfd. Is a sealed executable
> memfd not sufficient?
>
> Thanks,
> Mark.
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-31 18:09   ` Mark Rutland
  2020-07-31 20:08     ` Madhavan T. Venkataraman
@ 2020-08-03 16:57     ` Madhavan T. Venkataraman
  2020-08-04 14:30       ` Mark Rutland
  1 sibling, 1 reply; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-03 16:57 UTC (permalink / raw)
  To: Mark Rutland
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86
Responses inline..
On 7/31/20 1:09 PM, Mark Rutland wrote:
> Hi,
>
> On Tue, Jul 28, 2020 at 08:10:46AM -0500, madvenka@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>> Trampoline code is placed either in a data page or in a stack page. In
>> order to execute a trampoline, the page it resides in needs to be mapped
>> with execute permissions. Writable pages with execute permissions provide
>> an attack surface for hackers. Attackers can use this to inject malicious
>> code, modify existing code or do other harm.
> For the purpose of below, IIUC this assumes the adversary has an
> arbitrary write.
>
>> To mitigate this, LSMs such as SELinux may not allow pages to have both
>> write and execute permissions. This prevents trampolines from executing
>> and blocks applications that use trampolines. To allow genuine applications
>> to run, exceptions have to be made for them (by setting execmem, etc).
>> In this case, the attack surface is just the pages of such applications.
>>
>> An application that is not allowed to have writable executable pages
>> may try to load trampoline code into a file and map the file with execute
>> permissions. In this case, the attack surface is just the buffer that
>> contains trampoline code. However, a successful exploit may provide the
>> hacker with means to load his own code in a file, map it and execute it.
> It's not clear to me what power the adversary is assumed to have here,
> and consequently it's not clear to me how the proposal mitigates this.
>
> For example, if the attack can control the arguments to syscalls, and
> has an arbitrary write as above, what prevents them from creating a
> trampfd of their own?
That is the point. If a process is allowed to have pages that are both
writable and executable, a hacker can exploit some vulnerability such
as buffer overflow to write his own code into a page and somehow
contrive to execute that.
So, the context is - if security settings in a system disallow a page to have
both write and execute permissions, how do you allow the execution of
genuine trampolines that are runtime generated and placed in a data
page or a stack page?
trampfd tries to address that. So, trampfd is not a measure that increases
the security of a system or mitigates a security problem. It is a framework
to allow safe forms of dynamic code to execute when security settings
will block them otherwise.
>
> [...]
>
>> GCC has traditionally used trampolines for implementing nested
>> functions. The trampoline is placed on the user stack. So, the stack
>> needs to be executable.
> IIUC generally nested functions are avoided these days, specifically to
> prevent the creation of gadgets on the stack. So I don't think those are
> relevant as a cased to care about. Applications using them should move
> to not using them, and would be more secure generally for doing so.
Could not agree with you more.
>
> [...]
>
>> Trampoline File Descriptor (trampfd)
>> --------------------------
>>
>> I am proposing a kernel API using anonymous file descriptors that
>> can be used to create and execute trampolines with the help of the
>> kernel. In this solution also, the kernel does the work of the trampoline.
> What's the rationale for the kernel emulating the trampoline here?
>
> In ther case of EMUTRAMP this was necessary to work with existing
> application binaries and kernel ABIs which placed instructions onto the
> stack, and the stack needed to remain RW for other reasons. That
> restriction doesn't apply here.
In addition to the stack, EMUTRAMP also allows the emulation
of the same well-known trampolines placed in a non-stack data page.
For instance, libffi closures embed a trampoline in a closure structure.
That gets executed when the caller of libffi invokes it.
The goal of EMUTRAMP is to allow safe trampolines to execute when
security settings disallow their execution. Mainly, it permits applications
that use libffi to run. A lot of applications use libffi.
They chose the emulation method so that no changes need to be made
to application code to use them. But the EMUTRAMP implementors note
in their description that the real solution to the problem is a kernel
API that is backed by a safe code generator.
trampd is an attempt to define such an API. This is just a starting point.
I realize that we need to have a lot of discussion to refine the approach.
> Assuming trampfd creation is somehow authenticated, the code could be
> placed in a r-x page (which the kernel could refuse to add write
> permission), in order to prevent modification. If that's sufficient,
> it's not much of a leap to allow userspace to generate the code.
IIUC, you are suggesting that the user hands the kernel a code fragment
and requests it to be placed in an r-x page, correct? However, the
kernel cannot trust any code given to it by the user. Nor can it scan any
piece of code and reliably decide if it is safe or not.
So, the problem of executing dynamic code when security settings are
restrictive cannot be solved in userland. The only option I can think of is
to have the kernel provide support for dynamic code. It must have one
or more safe, trusted code generation components and an API to use
the components.
My goal is to introduce an API and start off by supporting simple, regular
trampolines that are widely used. Then, evolve the feature over a period
of time to include other forms of dynamic code such as JIT code.
>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
>>
>> In this case, the attack surface is the context buffer. A hacker may
>> attack an application with a vulnerability and may be able to modify the
>> context buffer. So, when the register or stack context is set for
>> a trampoline, the values may have been tampered with. From an attack
>> surface perspective, this is similar to Trampoline Emulation. But
>> with trampfd, user code can retrieve a trampoline's context from the
>> kernel and add defensive checks to see if the context has been
>> tampered with.
> Can you elaborate on this: what sort of checks would be applied, and
> how?
So, an application that uses trampfd would do the following steps:
1. Create a trampoline by calling trampfd_create()
2. Set the register and/or stack contexts for the trampoline.
3. mmap() the trampoline to get an address
4. Invoke the trampoline using the address
Let us say that the application has a vulnerability such as buffer overflow
that allows a hacker to modify the data that is used to do step 2.
Potentially, a hacker could modify the following things:
    - register values specified in the register context
    - values specified in the stack context
    - the target PC specified in the register context
When the trampoline is invoked in step 4, the kernel will gain control,
load the registers, push stuff on the stack and transfer control to the target
PC. Whatever the hacker had modified in step 2 will take effect in step 4.
His values will get loaded and his PC is the one that will get control.
A paranoid application could add a step to this sequence. So, the steps
would be:
1. Create a trampoline by calling trampfd_create()
2. Set the register and/or stack contexts for the trampoline.
3. mmap() the trampoline to get an address
4a. Retrieve the register and stack context for the trampoline from the
      kernel and check if anything has been altered. If yes, abort.
4b. Invoke the trampoline using the address
The check that I mentioned will be in step 4a. Now, the hacker has to
hack both step 2 and step 4a to let his stuff take effect. That is far
less likely to succeed because there needs to exist a vulnerability in
both places.
> Why is this not possible in a r-x user page?
This is answered above.
>
> [...]
>
>> - trampfd provides a basic framework. In the future, new trampoline types
>>   can be implemented, new contexts can be defined, and additional rules
>>   can be implemented for security purposes.
> >From a kernel developer perspective, this reads as "this ABI will become
> more complex", which I think is worrisome.
I hear you. My goal from the beginning is to not have the kernel deal
with ABI issues. ABI handling is best left to userland (except in cases
like signal handlers where the kernel does have to deal with it).
In the libffi changes, this is certainly true. The kernel only helps with
the trampoline that passes control to the ABI handler. The ABI handler
itself is part of libffi.
> I'm also worried that this is liable to have nasty interaction with HW
> CFI mechanisms (e.g. PAC+BTI on arm64) either now or in future, and that
> we bake incompatibility into ABI.
I will study CFI and then answer this question. So, bear with me.
>> - For instance, trampfd defines an "Allowed PCs" context in this initial
>>   work. As an example, libffi can create a read-only array of all ABI
>>   handlers for an architecture at build time. This array can be used to
>>   set the list of allowed PCs for a trampoline. This will mean that a hacker
>>   cannot hack the PC part of the register context and make it point to
>>   arbitrary locations.
> I'm not exactly sure what's meant here. Do you mean that this prevents
> userspace from branching into the middle of a trampoline, or that the
> trampfd code prevents where the trampoline itself can branch to?
>
> Both x86 and arm64 have upcoming HW CFI (CET and BTI) to deal with the
> former, and I believe the latter can also be implemented in userspace
> with defensive checks in the trampolines, provided that they are
> protected read-only.
So, I mentioned before that a hacker can potentially alter the target
PC that a trampoline finally jumps to.
If a process were allowed to have pages with both write and execute
permissions, a hacker could load his own code in one of those pages and
point the PC to that.
In the context of trampfd, we are talking about the case where a process is
not permitted to have both write and execute permissions. In this case,
the hacker cannot load his own code anywhere and hope to execute it.
But a hacker can point the PC to some arbitrary place such as return
from glibc.
>
>> - An SELinux setting called "exectramp" can be implemented along the
>>   lines of "execmem", "execstack" and "execheap" to selectively allow the
>>   use of trampolines on a per application basis.
>>
>> - User code can add defensive checks in the code before invoking a
>>   trampoline to make sure that a hacker has not modified the context data.
>>   It can do this by getting the trampoline context from the kernel and
>>   double checking it.
> As above, without examples it's not clear to me what sort of chacks are
> possible nor where they wouild need to be made. So it's difficult to see
> whether that's actually possible or subject to TOCTTOU races and
> similar.
I have explained this above. If there are any further questions on that,
please let me know.
>
>> - In the future, if the kernel can be enhanced to use a safe code
>>   generation component, that code can be placed in the trampoline mapping
>>   pages. Then, the trampoline invocation does not have to incur a trip
>>   into the kernel.
>>
>> - Also, if the kernel can be enhanced to use a safe code generation
>>   component, other forms of dynamic code such as JIT code can be
>>   addressed by the trampfd framework.
> I don't see why it's necessary for the kernel to generate code at all.
> If the trampfd creation requests can be trusted, what prevents trusting
> a sealed set of instructions generated in userspace?
Let us consider a system in which:
    - a process is not permitted to have pages with both write and execute
    - a process is not permitted to map any file as executable unless it
      is properly signed. In other words, cryptographically verified.
Then, the process cannot execute any code that is runtime generated.
That includes trampolines. Only trampoline code that is part of program
text at build time would be permitted to execute.
In this scenario, trampfd requests are coming from signed code. So, they
are trusted by the kernel. But trampoline code could be dynamically generated.
The kernel will not trust it.
>> - Trampolines can be shared across processes which can give rise to
>>   interesting uses in the future.
> This sounds like the use-case of a sealed memfd. Is a sealed executable
> memfd not sufficient?
I will answer this in a separate email.
Thanks.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-03 16:57     ` Madhavan T. Venkataraman
@ 2020-08-04 14:30       ` Mark Rutland
  2020-08-06 17:26         ` Madhavan T. Venkataraman
  0 siblings, 1 reply; 64+ messages in thread
From: Mark Rutland @ 2020-08-04 14:30 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86
On Mon, Aug 03, 2020 at 11:57:57AM -0500, Madhavan T. Venkataraman wrote:
> Responses inline..
> 
> On 7/31/20 1:09 PM, Mark Rutland wrote:
> > Hi,
> >
> > On Tue, Jul 28, 2020 at 08:10:46AM -0500, madvenka@linux.microsoft.com wrote:
> >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> >> Trampoline code is placed either in a data page or in a stack page. In
> >> order to execute a trampoline, the page it resides in needs to be mapped
> >> with execute permissions. Writable pages with execute permissions provide
> >> an attack surface for hackers. Attackers can use this to inject malicious
> >> code, modify existing code or do other harm.
> > For the purpose of below, IIUC this assumes the adversary has an
> > arbitrary write.
> >
> >> To mitigate this, LSMs such as SELinux may not allow pages to have both
> >> write and execute permissions. This prevents trampolines from executing
> >> and blocks applications that use trampolines. To allow genuine applications
> >> to run, exceptions have to be made for them (by setting execmem, etc).
> >> In this case, the attack surface is just the pages of such applications.
> >>
> >> An application that is not allowed to have writable executable pages
> >> may try to load trampoline code into a file and map the file with execute
> >> permissions. In this case, the attack surface is just the buffer that
> >> contains trampoline code. However, a successful exploit may provide the
> >> hacker with means to load his own code in a file, map it and execute it.
> > It's not clear to me what power the adversary is assumed to have here,
> > and consequently it's not clear to me how the proposal mitigates this.
> >
> > For example, if the attack can control the arguments to syscalls, and
> > has an arbitrary write as above, what prevents them from creating a
> > trampfd of their own?
> 
> That is the point. If a process is allowed to have pages that are both
> writable and executable, a hacker can exploit some vulnerability such
> as buffer overflow to write his own code into a page and somehow
> contrive to execute that.
I understood that, and that was not my question.
> So, the context is - if security settings in a system disallow a page to have
> both write and execute permissions, how do you allow the execution of
> genuine trampolines that are runtime generated and placed in a data
> page or a stack page?
There are options today, e.g.
a) If the restriction is only per-alias, you can have distinct aliases
   where one is writable and another is executable, and you can make it
   hard to find the relationship between the two.
b) If the restriction is only temporal, you can write instructions into
   an RW- buffer, transition the buffer to R--, verify the buffer
   contents, then transition it to --X.
c) You can have two processes A and B where A generates instrucitons into
   a buffer that (only) B can execute (where B may be restricted from
   making syscalls like write, mprotect, etc).
If (as this series appears to) you assume that an adversary can't
control the arguments trampfd_create() and any such call is legitimate,
then something like (b) is not weaker and can be much more general
without many of the potential ABI or performance problems of trying to
fiddle with precedure call details in the kernel.
If that's not an assumption, then I'm missing how you expect to
determine that a trampfd_create() call is legitimate, and why that could
not be applied to other calls.
[...]
> Could not agree with you more.
> >
> > [...]
> >
> >> Trampoline File Descriptor (trampfd)
> >> --------------------------
> >>
> >> I am proposing a kernel API using anonymous file descriptors that
> >> can be used to create and execute trampolines with the help of the
> >> kernel. In this solution also, the kernel does the work of the trampoline.
> > What's the rationale for the kernel emulating the trampoline here?
> >
> > In ther case of EMUTRAMP this was necessary to work with existing
> > application binaries and kernel ABIs which placed instructions onto the
> > stack, and the stack needed to remain RW for other reasons. That
> > restriction doesn't apply here.
> 
> In addition to the stack, EMUTRAMP also allows the emulation
> of the same well-known trampolines placed in a non-stack data page.
> For instance, libffi closures embed a trampoline in a closure structure.
> That gets executed when the caller of libffi invokes it.
> 
> The goal of EMUTRAMP is to allow safe trampolines to execute when
> security settings disallow their execution. Mainly, it permits applications
> that use libffi to run. A lot of applications use libffi.
> 
> They chose the emulation method so that no changes need to be made
> to application code to use them. But the EMUTRAMP implementors note
> in their description that the real solution to the problem is a kernel
> API that is backed by a safe code generator.
> 
> trampd is an attempt to define such an API. This is just a starting point.
> I realize that we need to have a lot of discussion to refine the approach.
> 
> > Assuming trampfd creation is somehow authenticated, the code could be
> > placed in a r-x page (which the kernel could refuse to add write
> > permission), in order to prevent modification. If that's sufficient,
> > it's not much of a leap to allow userspace to generate the code.
> 
> IIUC, you are suggesting that the user hands the kernel a code fragment
> and requests it to be placed in an r-x page, correct? However, the
> kernel cannot trust any code given to it by the user. Nor can it scan any
> piece of code and reliably decide if it is safe or not.
Per that same logic the kernel cannot trust trampfd creation calls to be
legitimate as the adversary could mess with the arguments. It doesn't
matter if the kernel's codegen is trustworthy if it's potentially driven
by an adversary.
> So, the problem of executing dynamic code when security settings are
> restrictive cannot be solved in userland. The only option I can think of is
> to have the kernel provide support for dynamic code. It must have one
> or more safe, trusted code generation components and an API to use
> the components.
> 
> My goal is to introduce an API and start off by supporting simple, regular
> trampolines that are widely used. Then, evolve the feature over a period
> of time to include other forms of dynamic code such as JIT code.
I think that you're making a leap to this approach without sufficient
justification that it actually solves the problem, and I believe that
there will be ABI issues with this approach which can be sidestepped by
other potential approaches.
Taking a step back, I think it's necessary to better describe the
problem and constraints that you believe apply before attempting to
justify any potential solution.
[...]
> >> The kernel creates the trampoline mapping without any permissions. When
> >> the trampoline is executed by user code, a page fault happens and the
> >> kernel gets control. The kernel recognizes that this is a trampoline
> >> invocation. It sets up the user registers based on the specified
> >> register context, and/or pushes values on the user stack based on the
> >> specified stack context, and sets the user PC to the requested target
> >> PC. When the kernel returns, execution continues at the target PC.
> >> So, the kernel does the work of the trampoline on behalf of the
> >> application.
> >>
> >> In this case, the attack surface is the context buffer. A hacker may
> >> attack an application with a vulnerability and may be able to modify the
> >> context buffer. So, when the register or stack context is set for
> >> a trampoline, the values may have been tampered with. From an attack
> >> surface perspective, this is similar to Trampoline Emulation. But
> >> with trampfd, user code can retrieve a trampoline's context from the
> >> kernel and add defensive checks to see if the context has been
> >> tampered with.
> > Can you elaborate on this: what sort of checks would be applied, and
> > how?
> 
> So, an application that uses trampfd would do the following steps:
> 
> 1. Create a trampoline by calling trampfd_create()
> 2. Set the register and/or stack contexts for the trampoline.
> 3. mmap() the trampoline to get an address
> 4. Invoke the trampoline using the address
> 
> Let us say that the application has a vulnerability such as buffer overflow
> that allows a hacker to modify the data that is used to do step 2.
> 
> Potentially, a hacker could modify the following things:
>     - register values specified in the register context
>     - values specified in the stack context
>     - the target PC specified in the register context
> 
> When the trampoline is invoked in step 4, the kernel will gain control,
> load the registers, push stuff on the stack and transfer control to the target
> PC. Whatever the hacker had modified in step 2 will take effect in step 4.
> His values will get loaded and his PC is the one that will get control.
> 
> A paranoid application could add a step to this sequence. So, the steps
> would be:
> 
> 1. Create a trampoline by calling trampfd_create()
> 2. Set the register and/or stack contexts for the trampoline.
> 3. mmap() the trampoline to get an address
> 4a. Retrieve the register and stack context for the trampoline from the
>       kernel and check if anything has been altered. If yes, abort.
> 4b. Invoke the trampoline using the address
As above, you can also do this when using mprotect today, transitioning
the buffer RWX -> R-- -> R-X. If you're worried about subsequent
modification via an alias, a sealed memfd would work assuming that can
be mapped R-X.
This approach is applicable to trampfd, but it isn't a specific benefit
of trampfd.
[...] 
> >> - In the future, if the kernel can be enhanced to use a safe code
> >>   generation component, that code can be placed in the trampoline mapping
> >>   pages. Then, the trampoline invocation does not have to incur a trip
> >>   into the kernel.
> >>
> >> - Also, if the kernel can be enhanced to use a safe code generation
> >>   component, other forms of dynamic code such as JIT code can be
> >>   addressed by the trampfd framework.
> > I don't see why it's necessary for the kernel to generate code at all.
> > If the trampfd creation requests can be trusted, what prevents trusting
> > a sealed set of instructions generated in userspace?
> 
> Let us consider a system in which:
>     - a process is not permitted to have pages with both write and execute
>     - a process is not permitted to map any file as executable unless it
>       is properly signed. In other words, cryptographically verified.
> 
> Then, the process cannot execute any code that is runtime generated.
> That includes trampolines. Only trampoline code that is part of program
> text at build time would be permitted to execute.
> 
> In this scenario, trampfd requests are coming from signed code. So, they
> are trusted by the kernel. But trampoline code could be dynamically generated.
> The kernel will not trust it.
I think this a very hand-wavy argument, as it suggests that generated
code is not trusted, but what is effectively a generated bytecode is.
If certain codegen can be trusted, then we can add mechanisms to permit
the results of this to be mapped r-x. If that is not possible, then the
same argument says that trampfd requests cannot be trusted.
Thanks,
Mark.
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-04 14:30       ` Mark Rutland
@ 2020-08-06 17:26         ` Madhavan T. Venkataraman
  2020-08-08 22:17           ` Pavel Machek
  2020-08-12 10:06           ` Mark Rutland
  0 siblings, 2 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-06 17:26 UTC (permalink / raw)
  To: Mark Rutland
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86
Thanks for the lively discussion. I have tried to answer some of the
comments below.
On 8/4/20 9:30 AM, Mark Rutland wrote:
>
>> So, the context is - if security settings in a system disallow a page to have
>> both write and execute permissions, how do you allow the execution of
>> genuine trampolines that are runtime generated and placed in a data
>> page or a stack page?
> There are options today, e.g.
>
> a) If the restriction is only per-alias, you can have distinct aliases
>    where one is writable and another is executable, and you can make it
>    hard to find the relationship between the two.
>
> b) If the restriction is only temporal, you can write instructions into
>    an RW- buffer, transition the buffer to R--, verify the buffer
>    contents, then transition it to --X.
>
> c) You can have two processes A and B where A generates instrucitons into
>    a buffer that (only) B can execute (where B may be restricted from
>    making syscalls like write, mprotect, etc).
The general principle of the mitigation is W^X. I would argue that
the above options are violations of the W^X principle. If they are
allowed today, they must be fixed. And they will be. So, we cannot
rely on them.
a) This requires a remap operation. Two mappings point to the same
     physical page. One mapping has W and the other one has X. This
     is a violation of W^X.
b) This is again a violation. The kernel should refuse to give execute
     permission to a page that was writeable in the past and refuse to
     give write permission to a page that was executable in the past.
c) This is just a variation of (a).
In general, the problem with user-level methods to map and execute
dynamic code is that the kernel cannot tell if a genuine application is
using them or an attacker is using them or piggy-backing on them.
If a security subsystem blocks all user-level methods for this reason,
we need a kernel mechanism to deal with the problem.
The kernel mechanism is not to be a backdoor. It is there to define
ways in which safe dynamic code can be executed.
I admit I have to provide more proof that my API and framework can
cover different cases. So, that is what I am doing now. I am in the process
of identifying other examples (per Andy's comment) and attempting to
show that this API and framework can address them. It will take a little time.
>>
>> IIUC, you are suggesting that the user hands the kernel a code fragment
>> and requests it to be placed in an r-x page, correct? However, the
>> kernel cannot trust any code given to it by the user. Nor can it scan any
>> piece of code and reliably decide if it is safe or not.
> Per that same logic the kernel cannot trust trampfd creation calls to be
> legitimate as the adversary could mess with the arguments. It doesn't
> matter if the kernel's codegen is trustworthy if it's potentially driven
> by an adversary.
That is not true. IMO, this is not a deficiency in trampfd. This is
something that is there even for regular system calls. For instance,
the write() system call will faithfully write out a buffer to a file
even if the buffer contents have been hacked by an attacker.
A system call can perform certain checks on incoming arguments.
But it cannot tell if a hacker has modified them.
So, there are two aspects in dynamic code that I am considering -
data and code. I submit that the data part can be hacked if an
application has a vulnerability such as buffer overflow. I don't see
how we can ever help that.
So, I am focused on the code generation part. Not all dynamic code
is the same. They have different degrees of trust.
Off the top of my head, I have tried to identify some examples
where we can have more trust on dynamic code and have the kernel
permit its execution.
1. If the kernel can do the job, then that is one safe way. Here, the kernel
    is the code. There is no code generation involved. This is what I
    have presented in the patch series as the first cut.
2. If the kernel can generate the code, then that code has a measure
    of trust. For trampolines, I agreed to do this for performance.
3. If the code resides in a signed file, then we know that it comes from
    an known source and it was generated at build time. So, it is not
    hacker generated. So, there is a measure of trust.
    This is not just program text. This could also be a buffer that contains
    trampoline code that resides in the read-only data section of a binary.
4. If the code resides in a signed file and is emulated (e.g. by QEMU)
    and we generate code for dynamic binary translation, we should
    be able to do that provided the code generator itself is not suspect.
    See the next point.   
5. The above are examples of actual machine code or equivalent.
    We could also have source code from which we generate machine
    code. E.g., JIT code from Java byte code. In this case, if the source
   code is in a signed file, we have a measure of trust on the source.
   If the kernel uses its own trusted code generator to generate the
   object code from the source code, then that object code has a
   measure of trust.
Anyway, these are just examples. The principle is - if we can identify
dynamic code that has a certain measure of trust, can the kernel
permit their execution?
All other code that cannot really be trusted by the kernel cannot be
executed safely (unless we find some safe and efficient way to
sandbox such code and limit the effects of the code to within
the sandbox). This is outside the scope of what I am doing.
>> So, the problem of executing dynamic code when security settings are
>> restrictive cannot be solved in userland. The only option I can think of is
>> to have the kernel provide support for dynamic code. It must have one
>> or more safe, trusted code generation components and an API to use
>> the components.
>>
>> My goal is to introduce an API and start off by supporting simple, regular
>> trampolines that are widely used. Then, evolve the feature over a period
>> of time to include other forms of dynamic code such as JIT code.
> I think that you're making a leap to this approach without sufficient
> justification that it actually solves the problem, and I believe that
> there will be ABI issues with this approach which can be sidestepped by
> other potential approaches.
>
> Taking a step back, I think it's necessary to better describe the
> problem and constraints that you believe apply before attempting to
> justify any potential solution.
I totally agree that more justification is needed and I am working on it.
As I have mentioned above, I intend to have the kernel generate code
only if the code generation is simple enough. For more complicated cases,
I plan to use a user-level code generator that is for exclusive kernel use.
I have yet to work out the details on how this would work. Need time.
>
> [...]
>
>>
>> 1. Create a trampoline by calling trampfd_create()
>> 2. Set the register and/or stack contexts for the trampoline.
>> 3. mmap() the trampoline to get an address
>> 4a. Retrieve the register and stack context for the trampoline from the
>>       kernel and check if anything has been altered. If yes, abort.
>> 4b. Invoke the trampoline using the address
> As above, you can also do this when using mprotect today, transitioning
> the buffer RWX -> R-- -> R-X. If you're worried about subsequent
> modification via an alias, a sealed memfd would work assuming that can
> be mapped R-X.
This is a violation of W^X and the security subsystem must be fixed
if it permits it.
> This approach is applicable to trampfd, but it isn't a specific benefit
> of trampfd.
>
> [...] 
>
>>>> - In the future, if the kernel can be enhanced to use a safe code
>>>>   generation component, that code can be placed in the trampoline mapping
>>>>   pages. Then, the trampoline invocation does not have to incur a trip
>>>>   into the kernel.
>>>>
>>>> - Also, if the kernel can be enhanced to use a safe code generation
>>>>   component, other forms of dynamic code such as JIT code can be
>>>>   addressed by the trampfd framework.
>>> I don't see why it's necessary for the kernel to generate code at all.
>>> If the trampfd creation requests can be trusted, what prevents trusting
>>> a sealed set of instructions generated in userspace?
>> Let us consider a system in which:
>>     - a process is not permitted to have pages with both write and execute
>>     - a process is not permitted to map any file as executable unless it
>>       is properly signed. In other words, cryptographically verified.
>>
>> Then, the process cannot execute any code that is runtime generated.
>> That includes trampolines. Only trampoline code that is part of program
>> text at build time would be permitted to execute.
>>
>> In this scenario, trampfd requests are coming from signed code. So, they
>> are trusted by the kernel. But trampoline code could be dynamically generated.
>> The kernel will not trust it.
> I think this a very hand-wavy argument, as it suggests that generated
> code is not trusted, but what is effectively a generated bytecode is.
> If certain codegen can be trusted, then we can add mechanisms to permit
> the results of this to be mapped r-x. If that is not possible, then the
> same argument says that trampfd requests cannot be trusted.
There is certainly an extra measure of trust in code that is in
signature verified files as compared to code that is generated
on the fly. At least, we know that the place from which we get
that code is known and the file was generated at build time
and not hacker generated. Such files could still contain a vulnerability.
But because these files are maintained by a known source, chances
are that there is nothing malicious in them.
Thanks.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-06 17:26         ` Madhavan T. Venkataraman
@ 2020-08-08 22:17           ` Pavel Machek
  2020-08-11 12:41             ` Madhavan T. Venkataraman
  2020-08-12 10:06           ` Mark Rutland
  1 sibling, 1 reply; 64+ messages in thread
From: Pavel Machek @ 2020-08-08 22:17 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: Mark Rutland, kernel-hardening, linux-api, linux-arm-kernel,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, oleg, x86
Hi!
> Thanks for the lively discussion. I have tried to answer some of the
> comments below.
> > There are options today, e.g.
> >
> > a) If the restriction is only per-alias, you can have distinct aliases
> >    where one is writable and another is executable, and you can make it
> >    hard to find the relationship between the two.
> >
> > b) If the restriction is only temporal, you can write instructions into
> >    an RW- buffer, transition the buffer to R--, verify the buffer
> >    contents, then transition it to --X.
> >
> > c) You can have two processes A and B where A generates instrucitons into
> >    a buffer that (only) B can execute (where B may be restricted from
> >    making syscalls like write, mprotect, etc).
> 
> The general principle of the mitigation is W^X. I would argue that
> the above options are violations of the W^X principle. If they are
> allowed today, they must be fixed. And they will be. So, we cannot
> rely on them.
Would you mind describing your threat model?
Because I believe you are using model different from everyone else.
In particular, I don't believe b) is a problem or should be fixed.
I'll add d), application mmaps a file(R--), and uses write syscall to change
trampolines in it.
> b) This is again a violation. The kernel should refuse to give execute
> ???????? permission to a page that was writeable in the past and refuse to
> ???????? give write permission to a page that was executable in the past.
Why?
										Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-08 22:17           ` Pavel Machek
@ 2020-08-11 12:41             ` Madhavan T. Venkataraman
  2020-08-11 13:08               ` Pavel Machek
  0 siblings, 1 reply; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-11 12:41 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Mark Rutland, kernel-hardening, linux-api, linux-arm-kernel,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, oleg, x86
On 8/8/20 5:17 PM, Pavel Machek wrote:
> Hi!
> 
>> Thanks for the lively discussion. I have tried to answer some of the
>> comments below.
> 
>>> There are options today, e.g.
>>>
>>> a) If the restriction is only per-alias, you can have distinct aliases
>>>    where one is writable and another is executable, and you can make it
>>>    hard to find the relationship between the two.
>>>
>>> b) If the restriction is only temporal, you can write instructions into
>>>    an RW- buffer, transition the buffer to R--, verify the buffer
>>>    contents, then transition it to --X.
>>>
>>> c) You can have two processes A and B where A generates instrucitons into
>>>    a buffer that (only) B can execute (where B may be restricted from
>>>    making syscalls like write, mprotect, etc).
>>
>> The general principle of the mitigation is W^X. I would argue that
>> the above options are violations of the W^X principle. If they are
>> allowed today, they must be fixed. And they will be. So, we cannot
>> rely on them.
> 
> Would you mind describing your threat model?
> 
> Because I believe you are using model different from everyone else.
> 
> In particular, I don't believe b) is a problem or should be fixed.
It is a problem because a kernel that implements W^X properly
will not allow it. It has no idea what has been done in userland.
It has no idea that the user has checked and verified the buffer
contents after transitioning the page to R--.
> 
> I'll add d), application mmaps a file(R--), and uses write syscall to change
> trampolines in it.
> 
No matter how you do it, these are all user-level methods that can be
hacked. The kernel cannot be sure that an attacker's code has
not found its way into the file.
>> b) This is again a violation. The kernel should refuse to give execute
>> ???????? permission to a page that was writeable in the past and refuse to
>> ???????? give write permission to a page that was executable in the past.
> 
> Why?
I don't know about the latter part. I guess I need to think about it.
But the former is valid. When a page is RW-, a hacker could hack the
page. Then it does not matter that the page is transitioned to R--.
Again, the kernel cannot be sure that the user has verified the contents
after R--.
IMO, W^X needs to be enforced temporally as well.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-11 12:41             ` Madhavan T. Venkataraman
@ 2020-08-11 13:08               ` Pavel Machek
  2020-08-11 15:54                 ` Madhavan T. Venkataraman
  0 siblings, 1 reply; 64+ messages in thread
From: Pavel Machek @ 2020-08-11 13:08 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: Mark Rutland, kernel-hardening, linux-api, linux-arm-kernel,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, oleg, x86
[-- Attachment #1: Type: text/plain, Size: 1813 bytes --]
Hi!
> >> Thanks for the lively discussion. I have tried to answer some of the
> >> comments below.
> > 
> >>> There are options today, e.g.
> >>>
> >>> a) If the restriction is only per-alias, you can have distinct aliases
> >>>    where one is writable and another is executable, and you can make it
> >>>    hard to find the relationship between the two.
> >>>
> >>> b) If the restriction is only temporal, you can write instructions into
> >>>    an RW- buffer, transition the buffer to R--, verify the buffer
> >>>    contents, then transition it to --X.
> >>>
> >>> c) You can have two processes A and B where A generates instrucitons into
> >>>    a buffer that (only) B can execute (where B may be restricted from
> >>>    making syscalls like write, mprotect, etc).
> >>
> >> The general principle of the mitigation is W^X. I would argue that
> >> the above options are violations of the W^X principle. If they are
> >> allowed today, they must be fixed. And they will be. So, we cannot
> >> rely on them.
> > 
> > Would you mind describing your threat model?
> > 
> > Because I believe you are using model different from everyone else.
> > 
> > In particular, I don't believe b) is a problem or should be fixed.
> 
> It is a problem because a kernel that implements W^X properly
> will not allow it. It has no idea what has been done in userland.
> It has no idea that the user has checked and verified the buffer
> contents after transitioning the page to R--.
No, it is not a problem. W^X is designed to protect from attackers
doing buffer overflows, not attackers doing arbitrary syscalls.
Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-11 13:08               ` Pavel Machek
@ 2020-08-11 15:54                 ` Madhavan T. Venkataraman
  0 siblings, 0 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-11 15:54 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Mark Rutland, kernel-hardening, linux-api, linux-arm-kernel,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, oleg, x86
On 8/11/20 8:08 AM, Pavel Machek wrote:
> Hi!
> 
>>>> Thanks for the lively discussion. I have tried to answer some of the
>>>> comments below.
>>>
>>>>> There are options today, e.g.
>>>>>
>>>>> a) If the restriction is only per-alias, you can have distinct aliases
>>>>>    where one is writable and another is executable, and you can make it
>>>>>    hard to find the relationship between the two.
>>>>>
>>>>> b) If the restriction is only temporal, you can write instructions into
>>>>>    an RW- buffer, transition the buffer to R--, verify the buffer
>>>>>    contents, then transition it to --X.
>>>>>
>>>>> c) You can have two processes A and B where A generates instrucitons into
>>>>>    a buffer that (only) B can execute (where B may be restricted from
>>>>>    making syscalls like write, mprotect, etc).
>>>>
>>>> The general principle of the mitigation is W^X. I would argue that
>>>> the above options are violations of the W^X principle. If they are
>>>> allowed today, they must be fixed. And they will be. So, we cannot
>>>> rely on them.
>>>
>>> Would you mind describing your threat model?
>>>
>>> Because I believe you are using model different from everyone else.
>>>
>>> In particular, I don't believe b) is a problem or should be fixed.
>>
>> It is a problem because a kernel that implements W^X properly
>> will not allow it. It has no idea what has been done in userland.
>> It has no idea that the user has checked and verified the buffer
>> contents after transitioning the page to R--.
> 
> No, it is not a problem. W^X is designed to protect from attackers
> doing buffer overflows, not attackers doing arbitrary syscalls.
> 
Hey Pavel,
You are correct. The W^X implementation today still has some holes.
IIUC, the principle of W^X is - user should not be able to (W) write code
into a page and use some trick to get it to (X) execute. So, what I
was trying to say was that the W^X principle is not implemented
completely today.
Mark Rutland mentioned some other tricks as well which are being used
today.
For instance, Microsoft has submitted this proposal:
 https://microsoft.github.io/ipe/
IPE is an LSM. In this proposal, only mappings that are backed by a
signature verified file can have execute permissions. This means that
all anonymous page based tricks will fail. And, file mapping based
tricks will fail as well when temporary files are used to load code
and mmap(). That is the intent.
Thanks!
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread 
 
 
 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-06 17:26         ` Madhavan T. Venkataraman
  2020-08-08 22:17           ` Pavel Machek
@ 2020-08-12 10:06           ` Mark Rutland
  2020-08-12 18:47             ` Madhavan T. Venkataraman
  2020-08-19 18:53             ` Mickaël Salaün
  1 sibling, 2 replies; 64+ messages in thread
From: Mark Rutland @ 2020-08-12 10:06 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86
On Thu, Aug 06, 2020 at 12:26:02PM -0500, Madhavan T. Venkataraman wrote:
> Thanks for the lively discussion. I have tried to answer some of the
> comments below.
> 
> On 8/4/20 9:30 AM, Mark Rutland wrote:
> >
> >> So, the context is - if security settings in a system disallow a page to have
> >> both write and execute permissions, how do you allow the execution of
> >> genuine trampolines that are runtime generated and placed in a data
> >> page or a stack page?
> > There are options today, e.g.
> >
> > a) If the restriction is only per-alias, you can have distinct aliases
> >    where one is writable and another is executable, and you can make it
> >    hard to find the relationship between the two.
> >
> > b) If the restriction is only temporal, you can write instructions into
> >    an RW- buffer, transition the buffer to R--, verify the buffer
> >    contents, then transition it to --X.
> >
> > c) You can have two processes A and B where A generates instrucitons into
> >    a buffer that (only) B can execute (where B may be restricted from
> >    making syscalls like write, mprotect, etc).
> 
> The general principle of the mitigation is W^X. I would argue that
> the above options are violations of the W^X principle. If they are
> allowed today, they must be fixed. And they will be. So, we cannot
> rely on them.
Hold on.
Contemporary W^X means that a given virtual alias cannot be writeable
and executeable simultaneously, permitting (a) and (b). If you read the
references on the Wikipedia page for W^X you'll see the OpenBSD 3.3
release notes and related presentation make this clear, and further they
expect (b) to occur with JITS flipping W/X with mprotect().
Please don't conflate your assumed stronger semantics with the general
principle. It not matching you expectations does not necessarily mean
that it is wrong.
If you want a stronger W^X semantics, please refer to this specifically
with a distinct name.
> a) This requires a remap operation. Two mappings point to the same
>      physical page. One mapping has W and the other one has X. This
>      is a violation of W^X.
> 
> b) This is again a violation. The kernel should refuse to give execute
>      permission to a page that was writeable in the past and refuse to
>      give write permission to a page that was executable in the past.
> 
> c) This is just a variation of (a).
As above, this is not true.
If you have a rationale for why this is desirable or necessary, please
justify that before using this as justification for additional features.
> In general, the problem with user-level methods to map and execute
> dynamic code is that the kernel cannot tell if a genuine application is
> using them or an attacker is using them or piggy-backing on them.
Yes, and as I pointed out the same is true for trampfd unless you can
somehow authenticate the calls are legitimate (in both callsite and the
set of arguments), and I don't see any reasonable way of doing that.
If you relax your threat model to an attacker not being able to make
arbitrary syscalls, then your suggestion that userspace can perorm
chceks between syscalls may be sufficient, but as I pointed out that's
equally true for a sealed memfd or similar.
> Off the top of my head, I have tried to identify some examples
> where we can have more trust on dynamic code and have the kernel
> permit its execution.
> 
> 1. If the kernel can do the job, then that is one safe way. Here, the kernel
>     is the code. There is no code generation involved. This is what I
>     have presented in the patch series as the first cut.
This is sleight-of-hand; it doesn't matter where the logic is performed
if the power is identical. Practically speaking this is equivalent to
some dynamic code generation.
I think that it's misleading to say that because the kernel emulates
something it is safe when the provenance of the syscall arguments cannot
be verified.
[...]
> Anyway, these are just examples. The principle is - if we can identify
> dynamic code that has a certain measure of trust, can the kernel
> permit their execution?
My point generally is that the kernel cannot identify this, and if
usrspace code is trusted to dynamically generate trampfd arguments it
can equally be trusted to dyncamilly generate code.
[...]
> As I have mentioned above, I intend to have the kernel generate code
> only if the code generation is simple enough. For more complicated cases,
> I plan to use a user-level code generator that is for exclusive kernel use.
> I have yet to work out the details on how this would work. Need time.
This reads to me like trampfd is only dealing with a few special cases
and we know that we need a more general solution.
I hope I am mistaken, but I get the strong impression that you're trying
to justify your existing solution rather than trying to understand the
problem space.
To be clear, my strong opinion is that we should not be trying to do
this sort of emulation or code generation within the kernel. I do think
it's worthwhile to look at mechanisms to make it harder to subvert
dynamic userspace code generation, but I think the code generation
itself needs to live in userspace (e.g. for ABI reasons I previously
mentioned).
Mark.
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-12 10:06           ` Mark Rutland
@ 2020-08-12 18:47             ` Madhavan T. Venkataraman
  2020-08-19 18:53             ` Mickaël Salaün
  1 sibling, 0 replies; 64+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-12 18:47 UTC (permalink / raw)
  To: Mark Rutland
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86
On 8/12/20 5:06 AM, Mark Rutland wrote:
> [..]
>>
>> The general principle of the mitigation is W^X. I would argue that
>> the above options are violations of the W^X principle. If they are
>> allowed today, they must be fixed. And they will be. So, we cannot
>> rely on them.
> 
> Hold on.
> 
> Contemporary W^X means that a given virtual alias cannot be writeable
> and executeable simultaneously, permitting (a) and (b). If you read the
> references on the Wikipedia page for W^X you'll see the OpenBSD 3.3
> release notes and related presentation make this clear, and further they
> expect (b) to occur with JITS flipping W/X with mprotect().
> 
> Please don't conflate your assumed stronger semantics with the general
> principle. It not matching you expectations does not necessarily mean
> that it is wrong.
> 
> If you want a stronger W^X semantics, please refer to this specifically
> with a distinct name.
OK. Fair enough. We can give a different name to the stronger requirement.
Just for the sake of this discussion and for the want of a better name,
let us call it WX2.
> 
>> a) This requires a remap operation. Two mappings point to the same
>>      physical page. One mapping has W and the other one has X. This
>>      is a violation of W^X.
>>
>> b) This is again a violation. The kernel should refuse to give execute
>>      permission to a page that was writeable in the past and refuse to
>>      give write permission to a page that was executable in the past.
>>
>> c) This is just a variation of (a).
> 
> As above, this is not true.
> 
> If you have a rationale for why this is desirable or necessary, please
> justify that before using this as justification for additional features.
> 
I already supplied the justification. Any user level method can potentially
be hijacked by an attacker for his purpose.
WX does not prevent all of the methods. We need WX2.
>> In general, the problem with user-level methods to map and execute
>> dynamic code is that the kernel cannot tell if a genuine application is
>> using them or an attacker is using them or piggy-backing on them.
> 
> Yes, and as I pointed out the same is true for trampfd unless you can
> somehow authenticate the calls are legitimate (in both callsite and the
> set of arguments), and I don't see any reasonable way of doing that.
> 
I am afraid I am not in agreement with this. If WX2 is not implemented,
an attacker can hack both code and data. If WX2 is implemented, an attacker
can only attack data. The attack surface is reduced.
Also, trampfd calls coming from code from a signed file can be authenticated.
trampfd calls coming from an attacker's generated code cannot be authenticated.
> If you relax your threat model to an attacker not being able to make
> arbitrary syscalls, then your suggestion that userspace can perorm
> chceks between syscalls may be sufficient, but as I pointed out that's
> equally true for a sealed memfd or similar.
> 
Actually, I did not suggest that userspace can perform checks. I said that
the kernel can perform checks.
User space cannot reliably perform checks between calls. A clever hacker
can cover his tracks.
In any case, the kernel has no knowledge of these checks. So, when execute
permissions are requested for a page, a properly implemented WX2 can refuse.
>> Off the top of my head, I have tried to identify some examples
>> where we can have more trust on dynamic code and have the kernel
>> permit its execution.
>>
>> 1. If the kernel can do the job, then that is one safe way. Here, the kernel
>>     is the code. There is no code generation involved. This is what I
>>     have presented in the patch series as the first cut.
> 
> This is sleight-of-hand; it doesn't matter where the logic is performed
> if the power is identical. Practically speaking this is equivalent to
> some dynamic code generation.
> 
> I think that it's misleading to say that because the kernel emulates
> something it is safe when the provenance of the syscall arguments cannot
> be verified.
I submit that there are two aspects - code and data. In one case, both
code and data can be hacked. So, an attacker can modify both code
and data. In the other case, the attacker can only modify data.
The power is not identical. The attack surface is not the same.
Most of the times, security measures are mitigations. They are not a 100%.
This approach of not allowing the user to do certain things that can be
exploited and having the kernel doing them increases our confidence.
From that perspective, the two approaches are different and it is worth
pursuing a kernel based mitigation.
> 
> [...]
> 
>> Anyway, these are just examples. The principle is - if we can identify
>> dynamic code that has a certain measure of trust, can the kernel
>> permit their execution?
> 
> My point generally is that the kernel cannot identify this, and if
> usrspace code is trusted to dynamically generate trampfd arguments it
> can equally be trusted to dyncamilly generate code.
I am afraid not. See my previous response. Ability to hack only data
gives an attacker fewer options as compared to the ability to hack
both code and data.
> 
> [...]
> 
>> As I have mentioned above, I intend to have the kernel generate code
>> only if the code generation is simple enough. For more complicated cases,
>> I plan to use a user-level code generator that is for exclusive kernel use.
>> I have yet to work out the details on how this would work. Need time.
> 
> This reads to me like trampfd is only dealing with a few special cases
> and we know that we need a more general solution.
> 
> I hope I am mistaken, but I get the strong impression that you're trying
> to justify your existing solution rather than trying to understand the
> problem space.
> 
I do understand the problem space. I wanted to address dynamic code in 3
different ways in separate phases starting from the easiest and working
my way up to the more difficult ones.
1. Remove dynamic code where possible
   If the kernel can replace user level dynamic code, then do it.
   This is what I did in version 1.
2. Replace dynamic code with static code
   Where you cannot do (1), replace dynamic code with static code with
   the kernel's help. I wanted to do this later. But I have decided to
   do this in version 2. This combined with signature verification of
   files adds a measure or trust in the code.
3. Deal with JIT, DBT, etc
   In (1) and (2), we deal with machine code. In (3), there is some source
   from which dynamic code needs to be generated using a code generator.
   E.g., JIT code from Java byte code. Here, the solution I had in mind
   had two parts:
       - Make the source more trustworthy by requiring it to be part
         of a signed file
       - Design a code generator trusted and used exclusively by the kernel
In this patchset, I wanted to lay a foundation for all 3 and attempt to
solve (1) first. Once this was in place, I wanted to do (2) and then (3).
In retrospect, I should have probably started with the big picture first
instead of starting with just item (1). But I always had the big picture
in mind. That said, I did not necessarily have all the details fleshed
out for all the phases. (3) is complex.
My focus was to define the API in a generic enough fashion so that all
3 phases can be implemented. But I realize that it is a hard sell at this
point to convince people that the API is adequate for phase 3. So,
I have decided to do (1) and (2). (3) has to be done separately with
more thought and details put into it.
Also, it may be the case that there are some examples of dynamic code
out there than can never be addressed. My goal is to try to address a
majority of the dynamic code out there.
> To be clear, my strong opinion is that we should not be trying to do
> this sort of emulation or code generation within the kernel. I do think
> it's worthwhile to look at mechanisms to make it harder to subvert
> dynamic userspace code generation, but I think the code generation
> itself needs to live in userspace (e.g. for ABI reasons I previously
> mentioned).
> 
I completely agree that the kernel should not deal with the complexities
of code generation and ABI details. My version 1 did not have any code
generation. But since a performance issue was raised, I explored the idea
of kernel code generation. To be honest, I was not really that
comfortable with the idea.
That is why I have decided to implement the second piece I had in
my plan now. This piece does not have the code generation complexities
or ABI issues. This piece can be used to solve libffi, GCC, etc.
I will still write the code in such a way that I can use the first
approach in the future if I really need it. But it will not involve any
code generation from the kernel. It will only be used for cases that
don't mind the extra trip to the kernel.
Madhavan
^ permalink raw reply	[flat|nested] 64+ messages in thread
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-12 10:06           ` Mark Rutland
  2020-08-12 18:47             ` Madhavan T. Venkataraman
@ 2020-08-19 18:53             ` Mickaël Salaün
  2020-09-01 15:42               ` Mark Rutland
  1 sibling, 1 reply; 64+ messages in thread
From: Mickaël Salaün @ 2020-08-19 18:53 UTC (permalink / raw)
  To: Mark Rutland, Madhavan T. Venkataraman
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86
On 12/08/2020 12:06, Mark Rutland wrote:
> On Thu, Aug 06, 2020 at 12:26:02PM -0500, Madhavan T. Venkataraman wrote:
>> Thanks for the lively discussion. I have tried to answer some of the
>> comments below.
>>
>> On 8/4/20 9:30 AM, Mark Rutland wrote:
>>>
>>>> So, the context is - if security settings in a system disallow a page to have
>>>> both write and execute permissions, how do you allow the execution of
>>>> genuine trampolines that are runtime generated and placed in a data
>>>> page or a stack page?
>>> There are options today, e.g.
>>>
>>> a) If the restriction is only per-alias, you can have distinct aliases
>>>    where one is writable and another is executable, and you can make it
>>>    hard to find the relationship between the two.
>>>
>>> b) If the restriction is only temporal, you can write instructions into
>>>    an RW- buffer, transition the buffer to R--, verify the buffer
>>>    contents, then transition it to --X.
>>>
>>> c) You can have two processes A and B where A generates instrucitons into
>>>    a buffer that (only) B can execute (where B may be restricted from
>>>    making syscalls like write, mprotect, etc).
>>
>> The general principle of the mitigation is W^X. I would argue that
>> the above options are violations of the W^X principle. If they are
>> allowed today, they must be fixed. And they will be. So, we cannot
>> rely on them.
> 
> Hold on.
> 
> Contemporary W^X means that a given virtual alias cannot be writeable
> and executeable simultaneously, permitting (a) and (b). If you read the
> references on the Wikipedia page for W^X you'll see the OpenBSD 3.3
> release notes and related presentation make this clear, and further they
> expect (b) to occur with JITS flipping W/X with mprotect().
W^X (with "permanent" mprotect restrictions [1]) goes back to 2000 with
PaX [2] (which predates partial OpenBSD implementation from 2003).
[1] https://pax.grsecurity.net/docs/mprotect.txt
[2] https://undeadly.org/cgi?action=article;sid=20030417082752
> 
> Please don't conflate your assumed stronger semantics with the general
> principle. It not matching you expectations does not necessarily mean
> that it is wrong.
> 
> If you want a stronger W^X semantics, please refer to this specifically
> with a distinct name.
> 
>> a) This requires a remap operation. Two mappings point to the same
>>      physical page. One mapping has W and the other one has X. This
>>      is a violation of W^X.
>>
>> b) This is again a violation. The kernel should refuse to give execute
>>      permission to a page that was writeable in the past and refuse to
>>      give write permission to a page that was executable in the past.
>>
>> c) This is just a variation of (a).
> 
> As above, this is not true.
> 
> If you have a rationale for why this is desirable or necessary, please
> justify that before using this as justification for additional features.
> 
>> In general, the problem with user-level methods to map and execute
>> dynamic code is that the kernel cannot tell if a genuine application is
>> using them or an attacker is using them or piggy-backing on them.
> 
> Yes, and as I pointed out the same is true for trampfd unless you can
> somehow authenticate the calls are legitimate (in both callsite and the
> set of arguments), and I don't see any reasonable way of doing that.
> 
> If you relax your threat model to an attacker not being able to make
> arbitrary syscalls, then your suggestion that userspace can perorm
> chceks between syscalls may be sufficient, but as I pointed out that's
> equally true for a sealed memfd or similar.
> 
>> Off the top of my head, I have tried to identify some examples
>> where we can have more trust on dynamic code and have the kernel
>> permit its execution.
>>
>> 1. If the kernel can do the job, then that is one safe way. Here, the kernel
>>     is the code. There is no code generation involved. This is what I
>>     have presented in the patch series as the first cut.
> 
> This is sleight-of-hand; it doesn't matter where the logic is performed
> if the power is identical. Practically speaking this is equivalent to
> some dynamic code generation.
> 
> I think that it's misleading to say that because the kernel emulates
> something it is safe when the provenance of the syscall arguments cannot
> be verified.
> 
> [...]
> 
>> Anyway, these are just examples. The principle is - if we can identify
>> dynamic code that has a certain measure of trust, can the kernel
>> permit their execution?
> 
> My point generally is that the kernel cannot identify this, and if
> usrspace code is trusted to dynamically generate trampfd arguments it
> can equally be trusted to dyncamilly generate code.
> 
> [...]
> 
>> As I have mentioned above, I intend to have the kernel generate code
>> only if the code generation is simple enough. For more complicated cases,
>> I plan to use a user-level code generator that is for exclusive kernel use.
>> I have yet to work out the details on how this would work. Need time.
> 
> This reads to me like trampfd is only dealing with a few special cases
> and we know that we need a more general solution.
> 
> I hope I am mistaken, but I get the strong impression that you're trying
> to justify your existing solution rather than trying to understand the
> problem space.
> 
> To be clear, my strong opinion is that we should not be trying to do
> this sort of emulation or code generation within the kernel. I do think
> it's worthwhile to look at mechanisms to make it harder to subvert
> dynamic userspace code generation, but I think the code generation
> itself needs to live in userspace (e.g. for ABI reasons I previously
> mentioned).
> 
> Mark.
> 
^ permalink raw reply	[flat|nested] 64+ messages in thread 
- * Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-19 18:53             ` Mickaël Salaün
@ 2020-09-01 15:42               ` Mark Rutland
  0 siblings, 0 replies; 64+ messages in thread
From: Mark Rutland @ 2020-09-01 15:42 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Madhavan T. Venkataraman, kernel-hardening, linux-api,
	linux-arm-kernel, linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, oleg, x86
On Wed, Aug 19, 2020 at 08:53:42PM +0200, Mickaël Salaün wrote:
> On 12/08/2020 12:06, Mark Rutland wrote:
> > Contemporary W^X means that a given virtual alias cannot be writeable
> > and executeable simultaneously, permitting (a) and (b). If you read the
> > references on the Wikipedia page for W^X you'll see the OpenBSD 3.3
> > release notes and related presentation make this clear, and further they
> > expect (b) to occur with JITS flipping W/X with mprotect().
> 
> W^X (with "permanent" mprotect restrictions [1]) goes back to 2000 with
> PaX [2] (which predates partial OpenBSD implementation from 2003).
> 
> [1] https://pax.grsecurity.net/docs/mprotect.txt
> [2] https://undeadly.org/cgi?action=article;sid=20030417082752
Thanks for the pointers!
Mark.
^ permalink raw reply	[flat|nested] 64+ messages in thread