Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* [PATCH bpf-next v11 6/7] bpf,landlock: Add tests for the Landlock ptrace program type
From: Mickaël Salaün @ 2019-10-29 17:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Alexei Starovoitov, Andy Lutomirski,
	Casey Schaufler, Daniel Borkmann, David Drysdale, Florent Revest,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, KP Singh, Michael Kerrisk, Mickaël Salaün,
	Paul Moore, Sargun Dhillon, Serge E . Hallyn, Shuah Khan,
	Stephen Smalley
In-Reply-To: <20191029171505.6650-1-mic@digikod.net>

Test eBPF program context access and ptrace hooks semantic.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: James Morris <jmorris@namei.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Will Drewry <wad@chromium.org>
---

Changes since v10:
* rework tests with new Landlock ptrace programs which restrict ptrace
  thanks to the task_landlock_ptrace_ancestor() helper
* simplify ptrace tests (make expect_ptrace implicit)
* add tests:
  * check a child process tracing its parent
  * check Landlock domain without ptrace enforcement (e.g. useful for
    audit/signaling purpose)
  * check inherited-only domains
  * check task pointer arithmetic
* fix flaky test for multi-core
* increase log size
* cosmetic renames
* update and improve the Makefile

Changes since v9:
* replace subtype with expected_attach_type and expected_attach_triggers
* rename inode_map_lookup() into inode_map_lookup_elem()
* check for inode map entry without value (which is now possible thanks
  to the pointer null check)
* use read-only inode map for Landlock programs

Changes since v8:
* update eBPF include path for macros
* use TEST_GEN_PROGS and use the generic "clean" target
* add more verbose errors
* update the bpf/verifier files
* remove chain tests (from landlock and bpf/verifier)
* replace the whitelist tests with blacklist tests (because of stateless
  Landlock programs): remove "dotdot" tests and other depth tests
* sync the landlock Makefile with its bpf sibling directory and use
  bpf_load_program_xattr()

Changes since v7:
* update tests and add new ones for filesystem hierarchy and Landlock
  chains.

Changes since v6:
* use the new kselftest_harness.h
* use const variables
* replace ASSERT_STEP with ASSERT_*
* rename BPF_PROG_TYPE_LANDLOCK to BPF_PROG_TYPE_LANDLOCK_RULE
* force sample library rebuild
* fix install target

Changes since v5:
* add subtype test
* add ptrace tests
* split and rename files
* cleanup and rebase
---
 scripts/bpf_helpers_doc.py                    |   1 +
 tools/include/uapi/linux/bpf.h                |  23 +-
 tools/include/uapi/linux/landlock.h           |  22 ++
 tools/lib/bpf/libbpf_probes.c                 |   3 +
 tools/testing/selftests/bpf/config            |   3 +
 tools/testing/selftests/bpf/test_verifier.c   |   1 +
 .../testing/selftests/bpf/verifier/landlock.c |  56 +++++
 tools/testing/selftests/landlock/.gitignore   |   5 +
 tools/testing/selftests/landlock/Makefile     |  27 +++
 tools/testing/selftests/landlock/config       |   5 +
 tools/testing/selftests/landlock/test.h       |  48 ++++
 tools/testing/selftests/landlock/test_base.c  |  24 ++
 .../testing/selftests/landlock/test_ptrace.c  | 210 ++++++++++++++++++
 13 files changed, 427 insertions(+), 1 deletion(-)
 create mode 100644 tools/include/uapi/linux/landlock.h
 create mode 100644 tools/testing/selftests/bpf/verifier/landlock.c
 create mode 100644 tools/testing/selftests/landlock/.gitignore
 create mode 100644 tools/testing/selftests/landlock/Makefile
 create mode 100644 tools/testing/selftests/landlock/config
 create mode 100644 tools/testing/selftests/landlock/test.h
 create mode 100644 tools/testing/selftests/landlock/test_base.c
 create mode 100644 tools/testing/selftests/landlock/test_ptrace.c

diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
index 7548569e8076..8e4c0fe75663 100755
--- a/scripts/bpf_helpers_doc.py
+++ b/scripts/bpf_helpers_doc.py
@@ -466,6 +466,7 @@ class PrinterHelpers(Printer):
             'const struct sk_buff': 'const struct __sk_buff',
             'struct sk_msg_buff': 'struct sk_msg_md',
             'struct xdp_buff': 'struct xdp_md',
+            'struct task_struct': 'void',
     }
 
     def print_header(self):
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4af8b0819a32..c88436b97163 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -173,6 +173,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_CGROUP_SYSCTL,
 	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
 	BPF_PROG_TYPE_CGROUP_SOCKOPT,
+	BPF_PROG_TYPE_LANDLOCK_HOOK,
 };
 
 enum bpf_attach_type {
@@ -199,6 +200,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_UDP6_RECVMSG,
 	BPF_CGROUP_GETSOCKOPT,
 	BPF_CGROUP_SETSOCKOPT,
+	BPF_LANDLOCK_PTRACE,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -2775,6 +2777,24 @@ union bpf_attr {
  * 		restricted to raw_tracepoint bpf programs.
  * 	Return
  * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_task_landlock_ptrace_ancestor(struct task_struct *parent, struct task_struct *child)
+ *	Description
+ *		Check the relation of a potentially parent task with a child
+ *		one, according to their Landlock ptrace hook programs.
+ *	Return
+ *		**-EINVAL** if the child's ptrace programs are not comparable
+ *		to the parent ones, i.e. one of them is an empty set.
+ *
+ *		**-ENOENT** if the parent's ptrace programs are either in a
+ *		separate hierarchy of the child ones, or if the parent's ptrace
+ *		programs are a superset of the child ones.
+ *
+ *		0 if the parent's ptrace programs are the same as the child
+ *		ones.
+ *
+ *		1 if the parent's ptrace programs are indeed a subset of the
+ *		child ones.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2888,7 +2908,8 @@ union bpf_attr {
 	FN(sk_storage_delete),		\
 	FN(send_signal),		\
 	FN(tcp_gen_syncookie),		\
-	FN(skb_output),
+	FN(skb_output),			\
+	FN(task_landlock_ptrace_ancestor),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/include/uapi/linux/landlock.h b/tools/include/uapi/linux/landlock.h
new file mode 100644
index 000000000000..3db2d190c4e7
--- /dev/null
+++ b/tools/include/uapi/linux/landlock.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Landlock - UAPI headers
+ *
+ * Copyright © 2017-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#ifndef _UAPI__LINUX_LANDLOCK_H__
+#define _UAPI__LINUX_LANDLOCK_H__
+
+#include <linux/types.h>
+
+#define LANDLOCK_RET_ALLOW	0
+#define LANDLOCK_RET_DENY	1
+
+struct landlock_context_ptrace {
+	__u64 tracer;
+	__u64 tracee;
+};
+
+#endif /* _UAPI__LINUX_LANDLOCK_H__ */
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index 4b0b0364f5fc..1e0d6346a7c7 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -78,6 +78,9 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
 	case BPF_PROG_TYPE_KPROBE:
 		xattr.kern_version = get_kernel_version();
 		break;
+	case BPF_PROG_TYPE_LANDLOCK_HOOK:
+		xattr.expected_attach_type = BPF_LANDLOCK_PTRACE;
+		break;
 	case BPF_PROG_TYPE_UNSPEC:
 	case BPF_PROG_TYPE_SOCKET_FILTER:
 	case BPF_PROG_TYPE_SCHED_CLS:
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index 5dc109f4c097..3161a88a6059 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -35,3 +35,6 @@ CONFIG_MPLS_ROUTING=m
 CONFIG_MPLS_IPTUNNEL=m
 CONFIG_IPV6_SIT=m
 CONFIG_BPF_JIT=y
+CONFIG_SECCOMP_FILTER=y
+CONFIG_SECURITY=y
+CONFIG_SECURITY_LANDLOCK=y
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index d27fd929abb9..74f249dafc0b 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -30,6 +30,7 @@
 #include <linux/bpf.h>
 #include <linux/if_ether.h>
 #include <linux/btf.h>
+#include <linux/landlock.h>
 
 #include <bpf/bpf.h>
 #include <bpf/libbpf.h>
diff --git a/tools/testing/selftests/bpf/verifier/landlock.c b/tools/testing/selftests/bpf/verifier/landlock.c
new file mode 100644
index 000000000000..59cd333745dc
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/landlock.c
@@ -0,0 +1,56 @@
+{
+	"landlock/ptrace: always accept",
+	.prog_type = BPF_PROG_TYPE_LANDLOCK_HOOK,
+	.expected_attach_type = BPF_LANDLOCK_PTRACE,
+	.insns = {
+		BPF_MOV32_IMM(BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+},
+{
+	"landlock/ptrace: forbid arbitrary return value",
+	.prog_type = BPF_PROG_TYPE_LANDLOCK_HOOK,
+	.expected_attach_type = BPF_LANDLOCK_PTRACE,
+	.insns = {
+		BPF_MOV32_IMM(BPF_REG_0, 2),
+		BPF_EXIT_INSN(),
+	},
+	.result = REJECT,
+	.errstr = "At program exit the register R0 has value (0x2; 0x0) should have been in (0x0; 0x1)",
+},
+{
+	"landlock/ptrace: read context and call dedicated helper",
+	.prog_type = BPF_PROG_TYPE_LANDLOCK_HOOK,
+	.expected_attach_type = BPF_LANDLOCK_PTRACE,
+	.insns = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+		BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6,
+			offsetof(struct landlock_context_ptrace, tracer)),
+		BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_6,
+			offsetof(struct landlock_context_ptrace, tracer)),
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				BPF_FUNC_task_landlock_ptrace_ancestor),
+		BPF_MOV32_IMM(BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+},
+{
+	"landlock/ptrace: forbid pointer arithmetic",
+	.prog_type = BPF_PROG_TYPE_LANDLOCK_HOOK,
+	.expected_attach_type = BPF_LANDLOCK_PTRACE,
+	.insns = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+		BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6,
+			offsetof(struct landlock_context_ptrace, tracer)),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 1),
+		BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_6,
+			offsetof(struct landlock_context_ptrace, tracee)),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, 1),
+		BPF_MOV32_IMM(BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	},
+	.result = REJECT,
+	.errstr = "R1 pointer arithmetic on task prohibited",
+},
diff --git a/tools/testing/selftests/landlock/.gitignore b/tools/testing/selftests/landlock/.gitignore
new file mode 100644
index 000000000000..4c5c01d23fe0
--- /dev/null
+++ b/tools/testing/selftests/landlock/.gitignore
@@ -0,0 +1,5 @@
+/feature
+/fixdep
+/*libbpf*
+/test_base
+/test_ptrace
diff --git a/tools/testing/selftests/landlock/Makefile b/tools/testing/selftests/landlock/Makefile
new file mode 100644
index 000000000000..2da77c30e77f
--- /dev/null
+++ b/tools/testing/selftests/landlock/Makefile
@@ -0,0 +1,27 @@
+# SPDX-License-Identifier: GPL-2.0
+
+LIBDIR := $(abspath ../../../lib)
+BPFDIR := $(LIBDIR)/bpf
+TOOLSDIR := $(abspath ../../../include)
+APIDIR := $(TOOLSDIR)/uapi
+
+CFLAGS += -g -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(TOOLSDIR)
+LDLIBS += -lelf
+
+test_src = $(wildcard test_*.c)
+
+TEST_GEN_PROGS := $(test_src:.c=)
+
+include ../lib.mk
+
+BPFOBJ := $(OUTPUT)/libbpf.a
+
+$(TEST_GEN_PROGS): $(BPFOBJ) ../kselftest_harness.h
+
+.PHONY: force
+
+# force a rebuild of BPFOBJ when its dependencies are updated
+force:
+
+$(BPFOBJ): force
+	$(MAKE) -C $(BPFDIR) OUTPUT=$(OUTPUT)/
diff --git a/tools/testing/selftests/landlock/config b/tools/testing/selftests/landlock/config
new file mode 100644
index 000000000000..fa5081b840ad
--- /dev/null
+++ b/tools/testing/selftests/landlock/config
@@ -0,0 +1,5 @@
+CONFIG_BPF=y
+CONFIG_BPF_SYSCALL=y
+CONFIG_SECCOMP_FILTER=y
+CONFIG_SECURITY=y
+CONFIG_SECURITY_LANDLOCK=y
diff --git a/tools/testing/selftests/landlock/test.h b/tools/testing/selftests/landlock/test.h
new file mode 100644
index 000000000000..836df68b6bb8
--- /dev/null
+++ b/tools/testing/selftests/landlock/test.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Landlock helpers
+ *
+ * Copyright © 2017-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2019 ANSSI
+ */
+
+#include <bpf/bpf.h>
+#include <errno.h>
+#include <linux/filter.h>
+#include <linux/landlock.h>
+#include <linux/seccomp.h>
+#include <sys/prctl.h>
+#include <sys/syscall.h>
+
+#include "../kselftest_harness.h"
+#include "../../../../samples/bpf/bpf_load.h"
+
+#ifndef SECCOMP_PREPEND_LANDLOCK_PROG
+#define SECCOMP_PREPEND_LANDLOCK_PROG	4
+#endif
+
+#ifndef seccomp
+static int __attribute__((unused)) seccomp(unsigned int op, unsigned int flags,
+		void *args)
+{
+	errno = 0;
+	return syscall(__NR_seccomp, op, flags, args);
+}
+#endif
+
+static int __attribute__((unused)) ll_bpf_load_program(
+		const struct bpf_insn *bpf_insns, size_t insns_len,
+		char *log_buf, size_t log_buf_sz,
+		const enum bpf_attach_type attach_type)
+{
+	struct bpf_load_program_attr load_attr;
+
+	memset(&load_attr, 0, sizeof(struct bpf_load_program_attr));
+	load_attr.prog_type = BPF_PROG_TYPE_LANDLOCK_HOOK;
+	load_attr.expected_attach_type = attach_type;
+	load_attr.insns = bpf_insns;
+	load_attr.insns_cnt = insns_len / sizeof(struct bpf_insn);
+	load_attr.license = "GPL";
+
+	return bpf_load_program_xattr(&load_attr, log_buf, log_buf_sz);
+}
diff --git a/tools/testing/selftests/landlock/test_base.c b/tools/testing/selftests/landlock/test_base.c
new file mode 100644
index 000000000000..db46f39048cb
--- /dev/null
+++ b/tools/testing/selftests/landlock/test_base.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Landlock tests - base
+ *
+ * Copyright © 2017-2019 Mickaël Salaün <mic@digikod.net>
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+
+#include "test.h"
+
+TEST(seccomp_landlock)
+{
+	int ret;
+
+	ret = seccomp(SECCOMP_PREPEND_LANDLOCK_PROG, 0, NULL);
+	EXPECT_EQ(-1, ret);
+	EXPECT_EQ(EFAULT, errno) {
+		TH_LOG("Kernel does not support CONFIG_SECURITY_LANDLOCK");
+	}
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/landlock/test_ptrace.c b/tools/testing/selftests/landlock/test_ptrace.c
new file mode 100644
index 000000000000..f4ee67126394
--- /dev/null
+++ b/tools/testing/selftests/landlock/test_ptrace.c
@@ -0,0 +1,210 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Landlock tests - ptrace
+ *
+ * Copyright © 2017-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2019 ANSSI
+ */
+
+#define _GNU_SOURCE
+#include <signal.h> /* raise */
+#include <sys/ptrace.h>
+#include <sys/types.h> /* waitpid */
+#include <sys/wait.h> /* waitpid */
+#include <unistd.h> /* fork, pipe */
+
+#include "test.h"
+
+#define LOG_SIZE 512
+
+static void create_domain(struct __test_metadata *_metadata,
+		bool scoped_ptrace, bool inherited_only)
+{
+	const struct bpf_insn prog_void[] = {
+		BPF_MOV32_IMM(BPF_REG_0, LANDLOCK_RET_ALLOW),
+		BPF_EXIT_INSN(),
+	};
+	const struct bpf_insn prog_check[] = {
+		BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_1),
+		BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6,
+			offsetof(struct landlock_context_ptrace, tracer)),
+		BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_6,
+			offsetof(struct landlock_context_ptrace, tracee)),
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				BPF_FUNC_task_landlock_ptrace_ancestor),
+		/* if @tracee is an ancestor or at the same level of @tracer,
+		 * then allow ptrace (warning: do not use BPF_JGE 0) */
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, inherited_only ? 0 : 1, 2),
+		BPF_MOV32_IMM(BPF_REG_0, LANDLOCK_RET_DENY),
+		BPF_EXIT_INSN(),
+		BPF_MOV32_IMM(BPF_REG_0, LANDLOCK_RET_ALLOW),
+		BPF_EXIT_INSN(),
+	};
+	int prog;
+	char log[LOG_SIZE] = "";
+
+	if (scoped_ptrace)
+		prog = ll_bpf_load_program(prog_check, sizeof(prog_check),
+				log, sizeof(log), BPF_LANDLOCK_PTRACE);
+	else
+		prog = ll_bpf_load_program(prog_void, sizeof(prog_void),
+				log, sizeof(log), BPF_LANDLOCK_PTRACE);
+	ASSERT_NE(-1, prog) {
+		TH_LOG("Failed to load the %s program: %s\n%s",
+				scoped_ptrace ? "check" : "void",
+				strerror(errno), log);
+	}
+	ASSERT_EQ(0, seccomp(SECCOMP_PREPEND_LANDLOCK_PROG, 0, &prog)) {
+		TH_LOG("Failed to create a Landlock domain: %s", strerror(errno));
+	}
+	EXPECT_EQ(0, close(prog));
+}
+
+/* test PTRACE_TRACEME and PTRACE_ATTACH for parent and child */
+static void _check_ptrace(struct __test_metadata *_metadata,
+		bool scoped_ptrace, bool domain_both,
+		bool domain_parent, bool domain_child)
+{
+	pid_t child, parent;
+	int status;
+	int pipe_child[2], pipe_parent[2];
+	char buf_parent;
+	const bool inherited_only = domain_both && !domain_parent && !domain_child;
+
+	parent = getpid();
+
+	ASSERT_EQ(0, pipe(pipe_child));
+	ASSERT_EQ(0, pipe(pipe_parent));
+	if (domain_both)
+		create_domain(_metadata, scoped_ptrace, inherited_only);
+
+	child = fork();
+	ASSERT_LE(0, child);
+	if (child == 0) {
+		char buf_child;
+
+		EXPECT_EQ(0, close(pipe_parent[1]));
+		EXPECT_EQ(0, close(pipe_child[0]));
+		if (domain_child)
+			create_domain(_metadata, scoped_ptrace, inherited_only);
+
+		/* sync #1 */
+		ASSERT_EQ(1, read(pipe_parent[0], &buf_child, 1)) {
+			TH_LOG("Failed to read() sync #1 from parent");
+		}
+		ASSERT_EQ('.', buf_child);
+
+		/* test the parent protection */
+		ASSERT_EQ((domain_child && scoped_ptrace) ? -1 : 0,
+				ptrace(PTRACE_ATTACH, parent, NULL, 0));
+		if (domain_child && scoped_ptrace) {
+			ASSERT_EQ(EPERM, errno);
+		} else {
+			ASSERT_EQ(parent, waitpid(parent, &status, 0));
+			ASSERT_EQ(1, WIFSTOPPED(status));
+			ASSERT_EQ(0, ptrace(PTRACE_DETACH, parent, NULL, 0));
+		}
+
+		/* sync #2 */
+		ASSERT_EQ(1, write(pipe_child[1], ".", 1)) {
+			TH_LOG("Failed to write() sync #2 to parent");
+		}
+
+		/* test traceme */
+		ASSERT_EQ((domain_parent && scoped_ptrace) ? -1 : 0,
+				ptrace(PTRACE_TRACEME));
+		if (domain_parent && scoped_ptrace) {
+			ASSERT_EQ(EPERM, errno);
+		} else {
+			ASSERT_EQ(0, raise(SIGSTOP));
+		}
+
+		/* sync #3 */
+		ASSERT_EQ(1, read(pipe_parent[0], &buf_child, 1)) {
+			TH_LOG("Failed to read() sync #3 from parent");
+		}
+		ASSERT_EQ('.', buf_child);
+		_exit(_metadata->passed ? EXIT_SUCCESS : EXIT_FAILURE);
+	}
+
+	EXPECT_EQ(0, close(pipe_child[1]));
+	EXPECT_EQ(0, close(pipe_parent[0]));
+	if (domain_parent)
+		create_domain(_metadata, scoped_ptrace, inherited_only);
+
+	/* sync #1 */
+	ASSERT_EQ(1, write(pipe_parent[1], ".", 1)) {
+		TH_LOG("Failed to write() sync #1 to child");
+	}
+
+	/* test the parent protection */
+	/* sync #2 */
+	ASSERT_EQ(1, read(pipe_child[0], &buf_parent, 1)) {
+		TH_LOG("Failed to read() sync #2 from child");
+	}
+	ASSERT_EQ('.', buf_parent);
+
+	/* test traceme */
+	if (!(domain_parent && scoped_ptrace)) {
+		ASSERT_EQ(child, waitpid(child, &status, 0));
+		ASSERT_EQ(1, WIFSTOPPED(status));
+		ASSERT_EQ(0, ptrace(PTRACE_DETACH, child, NULL, 0));
+	}
+	/* test attach */
+	ASSERT_EQ((domain_parent && scoped_ptrace) ? -1 : 0,
+			ptrace(PTRACE_ATTACH, child, NULL, 0));
+	if (domain_parent && scoped_ptrace) {
+		ASSERT_EQ(EPERM, errno);
+	} else {
+		ASSERT_EQ(child, waitpid(child, &status, 0));
+		ASSERT_EQ(1, WIFSTOPPED(status));
+		ASSERT_EQ(0, ptrace(PTRACE_DETACH, child, NULL, 0));
+	}
+
+	/* sync #3 */
+	ASSERT_EQ(1, write(pipe_parent[1], ".", 1)) {
+		TH_LOG("Failed to write() sync #3 to child");
+	}
+	ASSERT_EQ(child, waitpid(child, &status, 0));
+	if (WIFSIGNALED(status) || WEXITSTATUS(status))
+		_metadata->passed = 0;
+}
+
+/* keep the *_scoped order to check program inheritance */
+#define CHECK_PTRACE(name, domain_both, domain_parent, domain_child) \
+	TEST(name ## _unscoped) { \
+		_check_ptrace(_metadata, false, domain_both, domain_parent, \
+				domain_child); \
+	} \
+	TEST(name ## _scoped) { \
+		_check_ptrace(_metadata, false, domain_both, domain_parent, \
+				domain_child); \
+		_check_ptrace(_metadata, true, domain_both, domain_parent, \
+				domain_child); \
+	}
+
+/* no domain */
+CHECK_PTRACE(allow_without_domain, false, false, false);
+
+/* child domain */
+CHECK_PTRACE(allow_with_one_domain, false, false, true);
+
+/* parent domain */
+CHECK_PTRACE(deny_with_parent_domain, false, true, false);
+
+/* parent and child domain */
+CHECK_PTRACE(deny_with_sibling_domain, false, true, true);
+
+/* inherited domain */
+CHECK_PTRACE(allow_sibling_domain, true, false, false);
+
+/* inherited and child domain */
+CHECK_PTRACE(allow_with_nested_domain, true, false, true);
+
+/* inherited and parent domain */
+CHECK_PTRACE(deny_with_nested_and_parent_domain, true, true, false);
+
+/* inherited, parent and child domain */
+CHECK_PTRACE(deny_with_forked_domain, true, true, true);
+
+TEST_HARNESS_MAIN
-- 
2.23.0

^ permalink raw reply related

* [PATCH bpf-next v11 5/7] bpf,landlock: Add task_landlock_ptrace_ancestor() helper
From: Mickaël Salaün @ 2019-10-29 17:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Alexei Starovoitov, Andy Lutomirski,
	Casey Schaufler, Daniel Borkmann, David Drysdale, Florent Revest,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, KP Singh, Michael Kerrisk, Mickaël Salaün,
	Paul Moore, Sargun Dhillon, Serge E . Hallyn, Shuah Khan,
	Stephen Smalley
In-Reply-To: <20191029171505.6650-1-mic@digikod.net>

This new task_landlock_ptrace_ancestor() helper can be used to identify
if the Landlock domain tied to the current tracer is in the same
hierarchy as the domain of tracee.

Indeed, ptrace(2) can be used to impersonate an unsandboxed process and
lead to a privilege escalation.  A common use-case when sandboxing a
process is then to forbid it to debug a less-privileged process.  A
sandbox process (tracer) should only be allowed to trace another process
(tracee) if the tracee has fewer privileges than the tracer.  This
policy can be implemented with this helper.

More complex helpers could be added in the future to enable other ways
to check the relation between the tracer and the tracee.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: James Morris <jmorris@namei.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Will Drewry <wad@chromium.org>
---

Changes since v10:
* new patch taking inspiration from the previous static ptrace policy
---
 include/linux/bpf.h            |  2 +
 include/uapi/linux/bpf.h       | 21 ++++++++++-
 kernel/bpf/verifier.c          |  4 ++
 security/landlock/bpf_ptrace.c | 68 ++++++++++++++++++++++++++++++++++
 security/landlock/bpf_verify.c |  4 ++
 5 files changed, 98 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 819a3e207438..67ec198a90cb 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -214,6 +214,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_LONG,	/* pointer to long */
 	ARG_PTR_TO_SOCKET,	/* pointer to bpf_sock (fullsock) */
 	ARG_PTR_TO_BTF_ID,	/* pointer to in-kernel struct */
+	ARG_PTR_TO_TASK,	/* pointer to task_struct */
 };
 
 /* type of values returned from helper functions */
@@ -1088,6 +1089,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
 extern const struct bpf_func_proto bpf_strtol_proto;
 extern const struct bpf_func_proto bpf_strtoul_proto;
 extern const struct bpf_func_proto bpf_tcp_sock_proto;
+extern const struct bpf_func_proto bpf_task_landlock_ptrace_ancestor_proto;
 
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6e4147790f96..c88436b97163 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2777,6 +2777,24 @@ union bpf_attr {
  * 		restricted to raw_tracepoint bpf programs.
  * 	Return
  * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_task_landlock_ptrace_ancestor(struct task_struct *parent, struct task_struct *child)
+ *	Description
+ *		Check the relation of a potentially parent task with a child
+ *		one, according to their Landlock ptrace hook programs.
+ *	Return
+ *		**-EINVAL** if the child's ptrace programs are not comparable
+ *		to the parent ones, i.e. one of them is an empty set.
+ *
+ *		**-ENOENT** if the parent's ptrace programs are either in a
+ *		separate hierarchy of the child ones, or if the parent's ptrace
+ *		programs are a superset of the child ones.
+ *
+ *		0 if the parent's ptrace programs are the same as the child
+ *		ones.
+ *
+ *		1 if the parent's ptrace programs are indeed a subset of the
+ *		child ones.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2890,7 +2908,8 @@ union bpf_attr {
 	FN(sk_storage_delete),		\
 	FN(send_signal),		\
 	FN(tcp_gen_syncookie),		\
-	FN(skb_output),
+	FN(skb_output),			\
+	FN(task_landlock_ptrace_ancestor),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ebf1991906b7..af8f1a777a2d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3492,6 +3492,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 		    type != PTR_TO_MAP_VALUE &&
 		    type != expected_type)
 			goto err_type;
+	} else if (arg_type == ARG_PTR_TO_TASK) {
+		expected_type = PTR_TO_TASK;
+		if (type != expected_type)
+			goto err_type;
 	} else {
 		verbose(env, "unsupported arg_type %d\n", arg_type);
 		return -EFAULT;
diff --git a/security/landlock/bpf_ptrace.c b/security/landlock/bpf_ptrace.c
index 2ec73078ad01..0e1362951463 100644
--- a/security/landlock/bpf_ptrace.c
+++ b/security/landlock/bpf_ptrace.c
@@ -7,9 +7,13 @@
  */
 
 #include <linux/bpf.h>
+#include <linux/cred.h>
+#include <linux/kernel.h>
+#include <linux/rcupdate.h>
 #include <uapi/linux/landlock.h>
 
 #include "bpf_ptrace.h"
+#include "common.h"
 
 bool landlock_is_valid_access_ptrace(int off, enum bpf_access_type type,
 		enum bpf_reg_type *reg_type, int *max_size)
@@ -28,3 +32,67 @@ bool landlock_is_valid_access_ptrace(int off, enum bpf_access_type type,
 		return false;
 	}
 }
+
+/**
+ * domain_ptrace_ancestor - check domain ordering according to ptrace
+ *
+ * @parent: a parent domain
+ * @child: a potential child of @parent
+ *
+ * Check if the @parent domain is less or equal to (i.e. a subset of) the
+ * @child domain.
+ */
+static int domain_ptrace_ancestor(const struct landlock_domain *parent,
+		const struct landlock_domain *child)
+{
+	const struct landlock_prog_list *child_progs, *parent_progs;
+	const size_t hook = get_hook_index(LANDLOCK_HOOK_PTRACE);
+
+	if (!parent || !child)
+		/* @parent or @child has no ptrace restriction */
+		return -EINVAL;
+	parent_progs = parent->programs[hook];
+	child_progs = child->programs[hook];
+	if (!parent_progs || !child_progs)
+		/* @parent or @child has no ptrace restriction */
+		return -EINVAL;
+	if (child_progs == parent_progs)
+		/* @parent is at the same level as @child */
+		return 0;
+	for (child_progs = child_progs->prev; child_progs;
+			child_progs = child_progs->prev) {
+		if (child_progs == parent_progs)
+			/* @parent is one of the ancestors of @child */
+			return 1;
+	}
+	/*
+	 * Either there is no relationship between @parent and @child, or
+	 * @child is one of the ancestors of @parent.
+	 */
+	return -ENOENT;
+}
+
+/*
+ * Cf. include/uapi/linux/bpf.h - bpf_task_landlock_ptrace_ancestor
+ */
+BPF_CALL_2(bpf_task_landlock_ptrace_ancestor, const struct task_struct *,
+		parent, const struct task_struct *, child)
+{
+	const struct landlock_domain *dom_parent, *dom_child;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	if (WARN_ON(!parent || !child))
+		return -EFAULT;
+	dom_parent = landlock_cred(__task_cred(parent))->domain;
+	dom_child = landlock_cred(__task_cred(child))->domain;
+	return domain_ptrace_ancestor(dom_parent, dom_child);
+}
+
+const struct bpf_func_proto bpf_task_landlock_ptrace_ancestor_proto = {
+	.func		= bpf_task_landlock_ptrace_ancestor,
+	.gpl_only	= false,
+	.pkt_access	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TASK,
+	.arg2_type	= ARG_PTR_TO_TASK,
+};
diff --git a/security/landlock/bpf_verify.c b/security/landlock/bpf_verify.c
index 6ed921588178..a1d2db75d51d 100644
--- a/security/landlock/bpf_verify.c
+++ b/security/landlock/bpf_verify.c
@@ -70,6 +70,10 @@ static const struct bpf_func_proto *bpf_landlock_func_proto(
 		return &bpf_map_update_elem_proto;
 	case BPF_FUNC_map_delete_elem:
 		return &bpf_map_delete_elem_proto;
+	case BPF_FUNC_task_landlock_ptrace_ancestor:
+		if (get_hook_type(prog) == LANDLOCK_HOOK_PTRACE)
+			return &bpf_task_landlock_ptrace_ancestor_proto;
+		return NULL;
 	default:
 		return NULL;
 	}
-- 
2.23.0

^ permalink raw reply related

* [PATCH bpf-next v11 4/7] landlock: Add ptrace LSM hooks
From: Mickaël Salaün @ 2019-10-29 17:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Alexei Starovoitov, Andy Lutomirski,
	Casey Schaufler, Daniel Borkmann, David Drysdale, Florent Revest,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, KP Singh, Michael Kerrisk, Mickaël Salaün,
	Paul Moore, Sargun Dhillon, Serge E . Hallyn, Shuah Khan,
	Stephen Smalley
In-Reply-To: <20191029171505.6650-1-mic@digikod.net>

Add a first Landlock hook that can be used to enforce a security policy
or to audit some process activities.  For a sandboxing use-case, it is
needed to inform the kernel if a task can legitimately debug another.
ptrace(2) can also be used by an attacker to impersonate another task
and remain undetected while performing malicious activities.

Using ptrace(2) and related features on a target process can lead to a
privilege escalation.  A sandboxed task must then be able to tell the
kernel if another task is more privileged, via ptrace_may_access().

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: James Morris <jmorris@namei.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Will Drewry <wad@chromium.org>
---

Changes since v10:
* revamp and replace the static policy with a Landlock hook which may be
  used by the corresponding BPF_LANDLOCK_PTRACE program (attach) type
  and a dedicated process_cmp_landlock_ptrace() BPF helper
* check prog return value against LANDLOCK_RET_DENY (ret is a bitmask)

Changes since v6:
* factor out ptrace check
* constify pointers
* cleanup headers
* use the new security_add_hooks()
---
 security/landlock/Makefile       |   4 +-
 security/landlock/bpf_run.c      |  62 +++++++++++++++++
 security/landlock/bpf_run.h      |  25 +++++++
 security/landlock/hooks_ptrace.c | 114 +++++++++++++++++++++++++++++++
 security/landlock/hooks_ptrace.h |  19 ++++++
 security/landlock/init.c         |   2 +
 6 files changed, 224 insertions(+), 2 deletions(-)
 create mode 100644 security/landlock/bpf_run.c
 create mode 100644 security/landlock/bpf_run.h
 create mode 100644 security/landlock/hooks_ptrace.c
 create mode 100644 security/landlock/hooks_ptrace.h

diff --git a/security/landlock/Makefile b/security/landlock/Makefile
index 0b291f2c027c..93e4c2f31c8a 100644
--- a/security/landlock/Makefile
+++ b/security/landlock/Makefile
@@ -1,6 +1,6 @@
 obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
 
 landlock-y := init.o \
-	bpf_verify.o bpf_ptrace.o \
+	bpf_verify.o bpf_run.o bpf_ptrace.o \
 	domain_manage.o domain_syscall.o \
-	hooks_cred.o
+	hooks_cred.o hooks_ptrace.o
diff --git a/security/landlock/bpf_run.c b/security/landlock/bpf_run.c
new file mode 100644
index 000000000000..8874958bdc30
--- /dev/null
+++ b/security/landlock/bpf_run.c
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Landlock LSM - eBPF program evaluation
+ *
+ * Copyright © 2016-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#include <asm/current.h>
+#include <linux/bpf.h>
+#include <linux/errno.h>
+#include <linux/filter.h>
+#include <linux/rculist.h>
+#include <uapi/linux/landlock.h>
+
+#include "bpf_run.h"
+#include "common.h"
+#include "hooks_ptrace.h"
+
+static const void *get_prog_ctx(struct landlock_hook_ctx *hook_ctx)
+{
+	switch (hook_ctx->type) {
+	case LANDLOCK_HOOK_PTRACE:
+		return landlock_get_ctx_ptrace(hook_ctx->ctx_ptrace);
+	}
+	WARN_ON(1);
+	return NULL;
+}
+
+/**
+ * landlock_access_denied - run Landlock programs tied to a hook
+ *
+ * @domain: Landlock domain pointer
+ * @hook_ctx: non-NULL valid eBPF context pointer
+ *
+ * Return true if at least one program return deny, false otherwise.
+ */
+bool landlock_access_denied(struct landlock_domain *domain,
+		struct landlock_hook_ctx *hook_ctx)
+{
+	struct landlock_prog_list *prog_list;
+	const size_t hook = get_hook_index(hook_ctx->type);
+
+	if (!domain)
+		return false;
+
+	for (prog_list = domain->programs[hook]; prog_list;
+			prog_list = prog_list->prev) {
+		u32 ret;
+		const void *prog_ctx;
+
+		prog_ctx = get_prog_ctx(hook_ctx);
+		if (!prog_ctx || WARN_ON(IS_ERR(prog_ctx)))
+			return true;
+		rcu_read_lock();
+		ret = BPF_PROG_RUN(prog_list->prog, prog_ctx);
+		rcu_read_unlock();
+		if (ret & LANDLOCK_RET_DENY)
+			return true;
+	}
+	return false;
+}
diff --git a/security/landlock/bpf_run.h b/security/landlock/bpf_run.h
new file mode 100644
index 000000000000..3461cbb8ec12
--- /dev/null
+++ b/security/landlock/bpf_run.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Landlock LSM - eBPF program evaluation headers
+ *
+ * Copyright © 2016-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#ifndef _SECURITY_LANDLOCK_BPF_RUN_H
+#define _SECURITY_LANDLOCK_BPF_RUN_H
+
+#include "common.h"
+#include "hooks_ptrace.h"
+
+struct landlock_hook_ctx {
+	enum landlock_hook_type type;
+	union {
+		struct landlock_hook_ctx_ptrace *ctx_ptrace;
+	};
+};
+
+bool landlock_access_denied(struct landlock_domain *domain,
+		struct landlock_hook_ctx *hook_ctx);
+
+#endif /* _SECURITY_LANDLOCK_BPF_RUN_H */
diff --git a/security/landlock/hooks_ptrace.c b/security/landlock/hooks_ptrace.c
new file mode 100644
index 000000000000..8e518a472d04
--- /dev/null
+++ b/security/landlock/hooks_ptrace.c
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Landlock LSM - ptrace hooks
+ *
+ * Copyright © 2017-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2019 ANSSI
+ */
+
+#include <asm/current.h>
+#include <linux/cred.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/lsm_hooks.h>
+#include <linux/sched.h>
+#include <uapi/linux/landlock.h>
+
+#include "bpf_run.h"
+#include "common.h"
+#include "hooks_ptrace.h"
+
+struct landlock_hook_ctx_ptrace {
+	struct landlock_context_ptrace prog_ctx;
+};
+
+const struct landlock_context_ptrace *landlock_get_ctx_ptrace(
+		const struct landlock_hook_ctx_ptrace *hook_ctx)
+{
+	if (WARN_ON(!hook_ctx))
+		return NULL;
+
+	return &hook_ctx->prog_ctx;
+}
+
+static int check_ptrace(struct landlock_domain *domain,
+		struct task_struct *tracer, struct task_struct *tracee)
+{
+	struct landlock_hook_ctx_ptrace ctx_ptrace = {
+		.prog_ctx = {
+			.tracer = (uintptr_t)tracer,
+			.tracee = (uintptr_t)tracee,
+		},
+	};
+	struct landlock_hook_ctx hook_ctx = {
+		.type = LANDLOCK_HOOK_PTRACE,
+		.ctx_ptrace = &ctx_ptrace,
+	};
+
+	return landlock_access_denied(domain, &hook_ctx) ? -EPERM : 0;
+}
+
+/**
+ * hook_ptrace_access_check - determine whether the current process may access
+ *			      another
+ *
+ * @child: the process to be accessed
+ * @mode: the mode of attachment
+ *
+ * If the current task (i.e. tracer) has one or multiple BPF_LANDLOCK_PTRACE
+ * programs, then run them with the `struct landlock_context_ptrace` context.
+ * If one of these programs return LANDLOCK_RET_DENY, then deny access with
+ * -EPERM, else allow it by returning 0.
+ */
+static int hook_ptrace_access_check(struct task_struct *child,
+		unsigned int mode)
+{
+	struct landlock_domain *dom_current;
+	const size_t hook = get_hook_index(LANDLOCK_HOOK_PTRACE);
+
+	dom_current = landlock_cred(current_cred())->domain;
+	if (!(dom_current && dom_current->programs[hook]))
+		return 0;
+	return check_ptrace(dom_current, current, child);
+}
+
+/**
+ * hook_ptrace_traceme - determine whether another process may trace the
+ *			 current one
+ *
+ * @parent: the task proposed to be the tracer
+ *
+ * If the parent task (i.e. tracer) has one or multiple BPF_LANDLOCK_PTRACE
+ * programs, then run them with the `struct landlock_context_ptrace` context.
+ * If one of these programs return LANDLOCK_RET_DENY, then deny access with
+ * -EPERM, else allow it by returning 0.
+ */
+static int hook_ptrace_traceme(struct task_struct *parent)
+{
+	struct landlock_domain *dom_parent;
+	const size_t hook = get_hook_index(LANDLOCK_HOOK_PTRACE);
+	int ret;
+
+	rcu_read_lock();
+	dom_parent = landlock_cred(__task_cred(parent))->domain;
+	if (!(dom_parent && dom_parent->programs[hook])) {
+		ret = 0;
+		goto put_rcu;
+	}
+	ret = check_ptrace(dom_parent, parent, current);
+
+put_rcu:
+	rcu_read_unlock();
+	return ret;
+}
+
+static struct security_hook_list landlock_hooks[] = {
+	LSM_HOOK_INIT(ptrace_access_check, hook_ptrace_access_check),
+	LSM_HOOK_INIT(ptrace_traceme, hook_ptrace_traceme),
+};
+
+__init void landlock_add_hooks_ptrace(void)
+{
+	security_add_hooks(landlock_hooks, ARRAY_SIZE(landlock_hooks),
+			LANDLOCK_NAME);
+}
diff --git a/security/landlock/hooks_ptrace.h b/security/landlock/hooks_ptrace.h
new file mode 100644
index 000000000000..53fe651bdb3e
--- /dev/null
+++ b/security/landlock/hooks_ptrace.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Landlock LSM - ptrace hooks headers
+ *
+ * Copyright © 2017-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2019 ANSSI
+ */
+
+#ifndef _SECURITY_LANDLOCK_HOOKS_PTRACE_H
+#define _SECURITY_LANDLOCK_HOOKS_PTRACE_H
+
+struct landlock_hook_ctx_ptrace;
+
+const struct landlock_context_ptrace *landlock_get_ctx_ptrace(
+		const struct landlock_hook_ctx_ptrace *hook_ctx);
+
+__init void landlock_add_hooks_ptrace(void);
+
+#endif /* _SECURITY_LANDLOCK_HOOKS_PTRACE_H */
diff --git a/security/landlock/init.c b/security/landlock/init.c
index 8836ec4defd3..541aad17418e 100644
--- a/security/landlock/init.c
+++ b/security/landlock/init.c
@@ -10,11 +10,13 @@
 
 #include "common.h"
 #include "hooks_cred.h"
+#include "hooks_ptrace.h"
 
 static int __init landlock_init(void)
 {
 	pr_info(LANDLOCK_NAME ": Registering hooks\n");
 	landlock_add_hooks_cred();
+	landlock_add_hooks_ptrace();
 	return 0;
 }
 
-- 
2.23.0

^ permalink raw reply related

* [PATCH bpf-next v11 3/7] landlock,seccomp: Load Landlock programs per process hierarchy
From: Mickaël Salaün @ 2019-10-29 17:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Alexei Starovoitov, Andy Lutomirski,
	Casey Schaufler, Daniel Borkmann, David Drysdale, Florent Revest,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, KP Singh, Michael Kerrisk, Mickaël Salaün,
	Paul Moore, Sargun Dhillon, Serge E . Hallyn, Shuah Khan,
	Stephen Smalley
In-Reply-To: <20191029171505.6650-1-mic@digikod.net>

The seccomp(2) syscall can be used by a task to apply a Landlock program
to itself. As a seccomp filter, a Landlock program is enforced for the
current task and all its future children. A program is immutable and a
task can only add new restricting programs to itself, forming a list of
programs.

A Landlock program is tied to a Landlock hook. If the action on a kernel
object is allowed by the other Linux security mechanisms (e.g. DAC,
capabilities, other LSM), then a Landlock hook related to this kind of
object is triggered. The list of programs for this hook is then
evaluated. Each program return a binary value which can deny the action
on a kernel object with a non-zero value. If every programs of the list
return zero, then the action on the object is allowed.

The next commit adds the LSM hooks to enforce the memory protection
programs on the appropriate process hierarchies.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: James Morris <jmorris@namei.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Will Drewry <wad@chromium.org>
Link: https://lore.kernel.org/lkml/c10a503d-5e35-7785-2f3d-25ed8dd63fab@digikod.net/
---

Changes since v10:
* rewrite the Landlock program attaching mechanisme to not rely on
  internal seccomp structures but only on the (LSM-stacked) task's
  credentials:
  * make the use of seccomp (and task's credentials) optional if not
    relying on its syscall, which may be useful for domains defined by
    other means (e.g. cgroups or system-wide thanks to a dedicated
    securityfs)

Changes since v9:
* replace subtype with expected_attach_type and expected_attach_triggers

Changes since v8:
* Remove the chaining concept from the eBPF program contexts (chain and
  cookie). We need to keep these subtypes this way to be able to make
  them evolve, though.

Changes since v7:
* handle and verify program chains
* split and rename providers.c to enforce.c and enforce_seccomp.c
* rename LANDLOCK_SUBTYPE_* to LANDLOCK_*

Changes since v6:
* rename some functions with more accurate names to reflect that an eBPF
  program for Landlock could be used for something else than a rule
* reword rule "appending" to "prepending" and explain it
* remove the superfluous no_new_privs check, only check global
  CAP_SYS_ADMIN when prepending a Landlock rule (needed for containers)
* create and use {get,put}_seccomp_landlock() (suggested by Kees Cook)
* replace ifdef with static inlined function (suggested by Kees Cook)
* use get_user() (suggested by Kees Cook)
* replace atomic_t with refcount_t (requested by Kees Cook)
* move struct landlock_{rule,events} from landlock.h to common.h
* cleanup headers

Changes since v5:
* remove struct landlock_node and use a similar inheritance mechanisme
  as seccomp-bpf (requested by Andy Lutomirski)
* rename SECCOMP_ADD_LANDLOCK_RULE to SECCOMP_APPEND_LANDLOCK_RULE
* rename file manager.c to providers.c
* add comments
* typo and cosmetic fixes

Changes since v4:
* merge manager and seccomp patches
* return -EFAULT in seccomp(2) when user_bpf_fd is null to easely check
  if Landlock is supported
* only allow a process with the global CAP_SYS_ADMIN to use Landlock
  (will be lifted in the future)
* add an early check to exit as soon as possible if the current process
  does not have Landlock rules

Changes since v3:
* remove the hard link with seccomp (suggested by Andy Lutomirski and
  Kees Cook):
  * remove the cookie which could imply multiple evaluation of Landlock
    rules
  * remove the origin field in struct landlock_data
* remove documentation fix (merged upstream)
* rename the new seccomp command to SECCOMP_ADD_LANDLOCK_RULE
* internal renaming
* split commit
* new design to be able to inherit on the fly the parent rules

Changes since v2:
* Landlock programs can now be run without seccomp filter but for any
  syscall (from the process) or interruption
* move Landlock related functions and structs into security/landlock/*
  (to manage cgroups as well)
* fix seccomp filter handling: run Landlock programs for each of their
  legitimate seccomp filter
* properly clean up all seccomp results
* cosmetic changes to ease the understanding
* fix some ifdef
---
 MAINTAINERS                        |  1 +
 include/linux/landlock.h           | 25 +++++++++
 include/linux/lsm_hooks.h          |  1 +
 include/uapi/linux/seccomp.h       |  1 +
 kernel/seccomp.c                   |  4 ++
 security/landlock/Makefile         |  5 +-
 security/landlock/common.h         | 16 ++++++
 security/landlock/domain_syscall.c | 87 ++++++++++++++++++++++++++++++
 security/landlock/hooks_cred.c     | 47 ++++++++++++++++
 security/landlock/hooks_cred.h     | 14 +++++
 security/landlock/init.c           | 30 +++++++++++
 security/security.c                | 15 ++++++
 12 files changed, 244 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/landlock.h
 create mode 100644 security/landlock/domain_syscall.c
 create mode 100644 security/landlock/hooks_cred.c
 create mode 100644 security/landlock/hooks_cred.h
 create mode 100644 security/landlock/init.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 4cabb85ea52d..32bfd88159b0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9149,6 +9149,7 @@ F:	net/ipv4/tcp_bpf.c
 LANDLOCK SECURITY MODULE
 M:	Mickaël Salaün <mic@digikod.net>
 S:	Supported
+F:	include/linux/landlock.h
 F:	include/uapi/linux/landlock.h
 F:	security/landlock/
 K:	landlock
diff --git a/include/linux/landlock.h b/include/linux/landlock.h
new file mode 100644
index 000000000000..ffbf2397c459
--- /dev/null
+++ b/include/linux/landlock.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Landlock LSM - public kernel headers
+ *
+ * Copyright © 2016-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#ifndef _LINUX_LANDLOCK_H
+#define _LINUX_LANDLOCK_H
+
+#include <linux/errno.h>
+
+#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
+extern int landlock_seccomp_prepend_prog(unsigned int flags,
+		const int __user *user_bpf_fd);
+#else /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
+static inline int landlock_seccomp_prepend_prog(unsigned int flags,
+		const int __user *user_bpf_fd)
+{
+		return -EINVAL;
+}
+#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
+
+#endif /* _LINUX_LANDLOCK_H */
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index a3763247547c..a8ba679b388a 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -2106,6 +2106,7 @@ extern void security_add_hooks(struct security_hook_list *hooks, int count,
 enum lsm_order {
 	LSM_ORDER_FIRST = -1,	/* This is only for capabilities. */
 	LSM_ORDER_MUTABLE = 0,
+	LSM_ORDER_LAST = 1,	/* potentially-unprivileged LSM */
 };
 
 struct lsm_info {
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 90734aa5aa36..bce6534e7feb 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -16,6 +16,7 @@
 #define SECCOMP_SET_MODE_FILTER		1
 #define SECCOMP_GET_ACTION_AVAIL	2
 #define SECCOMP_GET_NOTIF_SIZES		3
+#define SECCOMP_PREPEND_LANDLOCK_PROG	4
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
 #define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index dba52a7db5e8..af542a2d21e7 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -41,6 +41,7 @@
 #include <linux/tracehook.h>
 #include <linux/uaccess.h>
 #include <linux/anon_inodes.h>
+#include <linux/landlock.h>
 
 enum notify_state {
 	SECCOMP_NOTIFY_INIT,
@@ -1397,6 +1398,9 @@ static long do_seccomp(unsigned int op, unsigned int flags,
 			return -EINVAL;
 
 		return seccomp_get_notif_sizes(uargs);
+	case SECCOMP_PREPEND_LANDLOCK_PROG:
+		return landlock_seccomp_prepend_prog(flags,
+				(const int __user *)uargs);
 	default:
 		return -EINVAL;
 	}
diff --git a/security/landlock/Makefile b/security/landlock/Makefile
index dd5f70185778..0b291f2c027c 100644
--- a/security/landlock/Makefile
+++ b/security/landlock/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
 
-landlock-y := \
+landlock-y := init.o \
 	bpf_verify.o bpf_ptrace.o \
-	domain_manage.o
+	domain_manage.o domain_syscall.o \
+	hooks_cred.o
diff --git a/security/landlock/common.h b/security/landlock/common.h
index fb2990eb5fb4..3ae8340a5b3d 100644
--- a/security/landlock/common.h
+++ b/security/landlock/common.h
@@ -10,9 +10,15 @@
 #define _SECURITY_LANDLOCK_COMMON_H
 
 #include <linux/bpf.h>
+#include <linux/cred.h>
 #include <linux/filter.h>
+#include <linux/lsm_hooks.h>
 #include <linux/refcount.h>
 
+#define LANDLOCK_NAME "landlock"
+
+extern struct lsm_blob_sizes landlock_blob_sizes;
+
 enum landlock_hook_type {
 	LANDLOCK_HOOK_PTRACE = 1,
 };
@@ -43,6 +49,16 @@ struct landlock_domain {
 	refcount_t usage;
 };
 
+struct landlock_cred_security {
+	struct landlock_domain *domain;
+};
+
+static inline struct landlock_cred_security *landlock_cred(
+		const struct cred *cred)
+{
+	return cred->security + landlock_blob_sizes.lbs_cred;
+}
+
 /**
  * get_hook_index - get an index for the programs of struct landlock_prog_set
  *
diff --git a/security/landlock/domain_syscall.c b/security/landlock/domain_syscall.c
new file mode 100644
index 000000000000..62daa630353e
--- /dev/null
+++ b/security/landlock/domain_syscall.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Landlock LSM - seccomp syscall
+ *
+ * Copyright © 2016-2018 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#ifdef CONFIG_SECCOMP_FILTER
+
+#include <asm/current.h>
+#include <linux/bpf.h>
+#include <linux/capability.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/filter.h>
+#include <linux/landlock.h>
+#include <linux/refcount.h>
+#include <linux/sched.h>
+#include <linux/seccomp.h>
+#include <linux/uaccess.h>
+
+#include "common.h"
+#include "domain_manage.h"
+
+/**
+ * landlock_seccomp_prepend_prog - attach a Landlock program to the current
+ *                                 task
+ *
+ * current->cred->security[landlock]->domain is lazily allocated. When a new
+ * credential is created, only a pointer is copied.  When a new Landlock
+ * program is added by a task, if there is other references to this task's
+ * domain, then a new allocation is made to contain an array pointing to
+ * Landlock program lists.  This design enable low-performance impact and is
+ * memory efficient while keeping the property of prepend-only programs.
+ *
+ * For now, installing a Landlock program requires that the requesting task has
+ * the global CAP_SYS_ADMIN. We cannot force the use of no_new_privs to not
+ * exclude containers where a process may legitimately acquire more privileges
+ * thanks to an SUID binary.
+ *
+ * @flags: not used, must be 0
+ * @user_bpf_fd: file descriptor pointing to a loaded Landlock prog
+ */
+int landlock_seccomp_prepend_prog(unsigned int flags,
+		const int __user *user_bpf_fd)
+{
+	struct landlock_domain *new_domain;
+	struct cred *cred_new;
+	struct landlock_cred_security *llcred_new;
+	struct bpf_prog *prog;
+	int bpf_fd, err;
+
+	/* planned to be replaced with a no_new_privs check to allow
+	 * unprivileged tasks */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	/* enable to check if Landlock is supported with early EFAULT */
+	if (!user_bpf_fd)
+		return -EFAULT;
+	if (flags)
+		return -EINVAL;
+	err = get_user(bpf_fd, user_bpf_fd);
+	if (err)
+		return err;
+	prog = bpf_prog_get(bpf_fd);
+	if (IS_ERR(prog))
+		return PTR_ERR(prog);
+
+	cred_new = prepare_creds();
+	if (!cred_new) {
+		bpf_prog_put(prog);
+		return -ENOMEM;
+	}
+	llcred_new = landlock_cred(cred_new);
+	/* the new creds are an atomic copy of the current creds */
+	new_domain = landlock_prepend_prog(llcred_new->domain, prog);
+	bpf_prog_put(prog);
+	if (IS_ERR(new_domain)) {
+		abort_creds(cred_new);
+		return PTR_ERR(new_domain);
+	}
+	llcred_new->domain = new_domain;
+	return commit_creds(cred_new);
+}
+
+#endif /* CONFIG_SECCOMP_FILTER */
diff --git a/security/landlock/hooks_cred.c b/security/landlock/hooks_cred.c
new file mode 100644
index 000000000000..def8678019a0
--- /dev/null
+++ b/security/landlock/hooks_cred.c
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Landlock LSM - credential hooks
+ *
+ * Copyright © 2017-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#include <linux/cred.h>
+#include <linux/lsm_hooks.h>
+
+#include "common.h"
+#include "domain_manage.h"
+#include "hooks_cred.h"
+
+static int hook_cred_prepare(struct cred *new, const struct cred *old,
+		gfp_t gfp)
+{
+	const struct landlock_cred_security *cred_old = landlock_cred(old);
+	struct landlock_cred_security *cred_new = landlock_cred(new);
+	struct landlock_domain *dom_old;
+
+	dom_old = cred_old->domain;
+	if (dom_old) {
+		landlock_get_domain(dom_old);
+		cred_new->domain = dom_old;
+	} else {
+		cred_new->domain = NULL;
+	}
+	return 0;
+}
+
+static void hook_cred_free(struct cred *cred)
+{
+	landlock_put_domain(landlock_cred(cred)->domain);
+}
+
+static struct security_hook_list landlock_hooks[] = {
+	LSM_HOOK_INIT(cred_prepare, hook_cred_prepare),
+	LSM_HOOK_INIT(cred_free, hook_cred_free),
+};
+
+__init void landlock_add_hooks_cred(void)
+{
+	security_add_hooks(landlock_hooks, ARRAY_SIZE(landlock_hooks),
+			LANDLOCK_NAME);
+}
diff --git a/security/landlock/hooks_cred.h b/security/landlock/hooks_cred.h
new file mode 100644
index 000000000000..641d66f6bf9a
--- /dev/null
+++ b/security/landlock/hooks_cred.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Landlock LSM - credential hooks headers
+ *
+ * Copyright © 2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2019 ANSSI
+ */
+
+#ifndef _SECURITY_LANDLOCK_HOOKS_CRED_H
+#define _SECURITY_LANDLOCK_HOOKS_CRED_H
+
+__init void landlock_add_hooks_cred(void);
+
+#endif /* _SECURITY_LANDLOCK_HOOKS_CRED_H */
diff --git a/security/landlock/init.c b/security/landlock/init.c
new file mode 100644
index 000000000000..8836ec4defd3
--- /dev/null
+++ b/security/landlock/init.c
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Landlock LSM - initialization
+ *
+ * Copyright © 2016-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#include <linux/lsm_hooks.h>
+
+#include "common.h"
+#include "hooks_cred.h"
+
+static int __init landlock_init(void)
+{
+	pr_info(LANDLOCK_NAME ": Registering hooks\n");
+	landlock_add_hooks_cred();
+	return 0;
+}
+
+struct lsm_blob_sizes landlock_blob_sizes __lsm_ro_after_init = {
+	.lbs_cred = sizeof(struct landlock_cred_security),
+};
+
+DEFINE_LSM(LANDLOCK_NAME) = {
+	.name = LANDLOCK_NAME,
+	.order = LSM_ORDER_LAST,
+	.blobs = &landlock_blob_sizes,
+	.init = landlock_init,
+};
diff --git a/security/security.c b/security/security.c
index 1bc000f834e2..03c7dce9e014 100644
--- a/security/security.c
+++ b/security/security.c
@@ -264,6 +264,21 @@ static void __init ordered_lsm_parse(const char *order, const char *origin)
 		}
 	}
 
+	/*
+	 * In case of an unprivileged access-control, we don't want to give the
+	 * ability to any process to do some checks (e.g. through an eBPF
+	 * program) on kernel objects (e.g. files) if a privileged security
+	 * policy forbid their access.  We must then load
+	 * potentially-unprivileged security modules after all other LSMs.
+	 *
+	 * LSM_ORDER_LAST is always last and does not appear in the modifiable
+	 * ordered list of enabled LSMs.
+	 */
+	for (lsm = __start_lsm_info; lsm < __end_lsm_info; lsm++) {
+		if (lsm->order == LSM_ORDER_LAST)
+			append_ordered_lsm(lsm, "last");
+	}
+
 	/* Disable all LSMs not in the ordered list. */
 	for (lsm = __start_lsm_info; lsm < __end_lsm_info; lsm++) {
 		if (exists_ordered_lsm(lsm))
-- 
2.23.0

^ permalink raw reply related

* [PATCH bpf-next v11 2/7] landlock: Add the management of domains
From: Mickaël Salaün @ 2019-10-29 17:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Alexei Starovoitov, Andy Lutomirski,
	Casey Schaufler, Daniel Borkmann, David Drysdale, Florent Revest,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, KP Singh, Michael Kerrisk, Mickaël Salaün,
	Paul Moore, Sargun Dhillon, Serge E . Hallyn, Shuah Khan,
	Stephen Smalley
In-Reply-To: <20191029171505.6650-1-mic@digikod.net>

A Landlock domain is a set of eBPF programs.  There is a list for each
different program types that can be run on a specific Landlock hook
(e.g. ptrace).  A domain is tied to a set of subjects (i.e. tasks).  A
Landlock program should not try (nor be able) to infer which subject is
currently enforced, but to have a unique security policy for all
subjects tied to the same domain.  This make the reasoning much easier
and help avoid pitfalls.

The next commits tie a domain to a task's credentials thanks to
seccomp(2), but we could use cgroups or a security file-system to
enforce a sysadmin-defined policy .

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: James Morris <jmorris@namei.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Will Drewry <wad@chromium.org>
---

Changes since v10:
* rename files and names to clearly define a domain
* create a standalone patch to ease review
---
 security/landlock/Makefile        |   3 +-
 security/landlock/common.h        |  38 +++++
 security/landlock/domain_manage.c | 265 ++++++++++++++++++++++++++++++
 security/landlock/domain_manage.h |  23 +++
 4 files changed, 328 insertions(+), 1 deletion(-)
 create mode 100644 security/landlock/domain_manage.c
 create mode 100644 security/landlock/domain_manage.h

diff --git a/security/landlock/Makefile b/security/landlock/Makefile
index 682b798c6b76..dd5f70185778 100644
--- a/security/landlock/Makefile
+++ b/security/landlock/Makefile
@@ -1,4 +1,5 @@
 obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
 
 landlock-y := \
-	bpf_verify.o bpf_ptrace.o
+	bpf_verify.o bpf_ptrace.o \
+	domain_manage.o
diff --git a/security/landlock/common.h b/security/landlock/common.h
index 0234c4bc4acd..fb2990eb5fb4 100644
--- a/security/landlock/common.h
+++ b/security/landlock/common.h
@@ -11,11 +11,49 @@
 
 #include <linux/bpf.h>
 #include <linux/filter.h>
+#include <linux/refcount.h>
 
 enum landlock_hook_type {
 	LANDLOCK_HOOK_PTRACE = 1,
 };
 
+#define _LANDLOCK_HOOK_LAST LANDLOCK_HOOK_PTRACE
+
+struct landlock_prog_list {
+	struct landlock_prog_list *prev;
+	struct bpf_prog *prog;
+	refcount_t usage;
+};
+
+/**
+ * struct landlock_domain - Landlock programs enforced on a set of tasks
+ *
+ * When prepending a new program, if &struct landlock_domain is shared with
+ * other tasks, then duplicate it and prepend the program to this new &struct
+ * landlock_domain.
+ *
+ * @usage: reference count to manage the object lifetime. When a task needs to
+ *	   add Landlock programs and if @usage is greater than 1, then the
+ *	   task must duplicate &struct landlock_domain to not change the
+ *	   children's programs as well.
+ * @programs: array of non-NULL &struct landlock_prog_list pointers
+ */
+struct landlock_domain {
+	struct landlock_prog_list *programs[_LANDLOCK_HOOK_LAST];
+	refcount_t usage;
+};
+
+/**
+ * get_hook_index - get an index for the programs of struct landlock_prog_set
+ *
+ * @type: a Landlock hook type
+ */
+static inline size_t get_hook_index(enum landlock_hook_type type)
+{
+	/* type ID > 0 for loaded programs */
+	return type - 1;
+}
+
 static inline enum landlock_hook_type get_hook_type(const struct bpf_prog *prog)
 {
 	switch (prog->expected_attach_type) {
diff --git a/security/landlock/domain_manage.c b/security/landlock/domain_manage.c
new file mode 100644
index 000000000000..c955b9c95c84
--- /dev/null
+++ b/security/landlock/domain_manage.c
@@ -0,0 +1,265 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Landlock LSM - domain management
+ *
+ * Copyright © 2016-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/refcount.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+#include "common.h"
+#include "domain_manage.h"
+
+void landlock_get_domain(struct landlock_domain *dom)
+{
+	if (!dom)
+		return;
+	refcount_inc(&dom->usage);
+}
+
+static void put_landlock_prog_list(struct landlock_prog_list *prog_list)
+{
+	struct landlock_prog_list *orig = prog_list;
+
+	/* clean up single-reference branches iteratively */
+	while (orig && refcount_dec_and_test(&orig->usage)) {
+		struct landlock_prog_list *freeme = orig;
+
+		if (orig->prog)
+			bpf_prog_put(orig->prog);
+		orig = orig->prev;
+		kfree(freeme);
+	}
+}
+
+void landlock_put_domain(struct landlock_domain *domain)
+{
+	if (domain && refcount_dec_and_test(&domain->usage)) {
+		size_t i;
+
+		for (i = 0; i < ARRAY_SIZE(domain->programs); i++)
+			put_landlock_prog_list(domain->programs[i]);
+		kfree(domain);
+	}
+}
+
+static struct landlock_domain *new_landlock_domain(void)
+{
+	struct landlock_domain *domain;
+
+	/* array filled with NULL values */
+	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+	if (!domain)
+		return ERR_PTR(-ENOMEM);
+	refcount_set(&domain->usage, 1);
+	return domain;
+}
+
+/**
+ * store_landlock_prog - prepend and deduplicate a Landlock prog_list
+ *
+ * Prepend @prog to @init_domain while ignoring @prog if they are already in
+ * @ref_domain.  Whatever is the result of this function call, you can call
+ * bpf_prog_put(@prog) after.
+ *
+ * @init_domain: empty domain to prepend to
+ * @ref_domain: domain to check for duplicate programs
+ * @prog: program to prepend
+ *
+ * Return -errno on error or 0 if @prog was successfully stored.
+ */
+static int store_landlock_prog(struct landlock_domain *init_domain,
+		const struct landlock_domain *ref_domain,
+		struct bpf_prog *prog)
+{
+	struct landlock_prog_list *tmp_list = NULL;
+	int err;
+	size_t hook;
+	enum landlock_hook_type last_type;
+	struct bpf_prog *new = prog;
+
+	/* allocate all the memory we need */
+	struct landlock_prog_list *new_list;
+
+	last_type = get_hook_type(new);
+
+	/* ignore duplicate programs */
+	if (ref_domain) {
+		struct landlock_prog_list *ref;
+
+		hook = get_hook_index(get_hook_type(new));
+		for (ref = ref_domain->programs[hook]; ref;
+				ref = ref->prev) {
+			if (ref->prog == new)
+				return -EINVAL;
+		}
+	}
+
+	new = bpf_prog_inc(new);
+	if (IS_ERR(new)) {
+		err = PTR_ERR(new);
+		goto put_tmp_list;
+	}
+	new_list = kzalloc(sizeof(*new_list), GFP_KERNEL);
+	if (!new_list) {
+		bpf_prog_put(new);
+		err = -ENOMEM;
+		goto put_tmp_list;
+	}
+	/* ignore Landlock types in this tmp_list */
+	new_list->prog = new;
+	new_list->prev = tmp_list;
+	refcount_set(&new_list->usage, 1);
+	tmp_list = new_list;
+
+	if (!tmp_list)
+		/* inform user space that this program was already added */
+		return -EEXIST;
+
+	/* properly store the list (without error cases) */
+	while (tmp_list) {
+		struct landlock_prog_list *new_list;
+
+		new_list = tmp_list;
+		tmp_list = tmp_list->prev;
+		/* do not increment the previous prog list usage */
+		hook = get_hook_index(get_hook_type(new_list->prog));
+		new_list->prev = init_domain->programs[hook];
+		/* no need to add from the last program to the first because
+		 * each of them are a different Landlock type */
+		smp_store_release(&init_domain->programs[hook], new_list);
+	}
+	return 0;
+
+put_tmp_list:
+	put_landlock_prog_list(tmp_list);
+	return err;
+}
+
+/* limit Landlock programs set to 256KB */
+#define LANDLOCK_PROGRAMS_MAX_PAGES (1 << 6)
+
+/**
+ * landlock_prepend_prog - attach a Landlock prog_list to @current_domain
+ *
+ * Whatever is the result of this function call, you can call
+ * bpf_prog_put(@prog) after.
+ *
+ * @current_domain: landlock_domain pointer, must be (RCU-)locked (if needed)
+ *                  to prevent a concurrent put/free. This pointer must not be
+ *                  freed after the call.
+ * @prog: non-NULL Landlock prog_list to prepend to @current_domain. @prog will
+ *        be owned by landlock_prepend_prog() and freed if an error happened.
+ *
+ * Return @current_domain or a new pointer when OK. Return a pointer error
+ * otherwise.
+ */
+struct landlock_domain *landlock_prepend_prog(
+		struct landlock_domain *current_domain,
+		struct bpf_prog *prog)
+{
+	struct landlock_domain *new_domain = current_domain;
+	unsigned long pages;
+	int err;
+	size_t i;
+	struct landlock_domain tmp_domain = {};
+
+	if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
+		return ERR_PTR(-EINVAL);
+
+	/* validate memory size allocation */
+	pages = prog->pages;
+	if (current_domain) {
+		size_t i;
+
+		for (i = 0; i < ARRAY_SIZE(current_domain->programs); i++) {
+			struct landlock_prog_list *walker_p;
+
+			for (walker_p = current_domain->programs[i];
+					walker_p; walker_p = walker_p->prev)
+				pages += walker_p->prog->pages;
+		}
+		/* count a struct landlock_domain if we need to allocate one */
+		if (refcount_read(&current_domain->usage) != 1)
+			pages += round_up(sizeof(*current_domain), PAGE_SIZE)
+				/ PAGE_SIZE;
+	}
+	if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
+		return ERR_PTR(-E2BIG);
+
+	/* ensure early that we can allocate enough memory for the new
+	 * prog_lists */
+	err = store_landlock_prog(&tmp_domain, current_domain, prog);
+	if (err)
+		return ERR_PTR(err);
+
+	/*
+	 * Each task_struct points to an array of prog list pointers.  These
+	 * tables are duplicated when additions are made (which means each
+	 * table needs to be refcounted for the processes using it). When a new
+	 * table is created, all the refcounters on the prog_list are bumped
+	 * (to track each table that references the prog). When a new prog is
+	 * added, it's just prepended to the list for the new table to point
+	 * at.
+	 *
+	 * Manage all the possible errors before this step to not uselessly
+	 * duplicate current_domain and avoid a rollback.
+	 */
+	if (!new_domain) {
+		/*
+		 * If there is no Landlock domain used by the current task,
+		 * then create a new one.
+		 */
+		new_domain = new_landlock_domain();
+		if (IS_ERR(new_domain))
+			goto put_tmp_lists;
+	} else if (refcount_read(&current_domain->usage) > 1) {
+		/*
+		 * If the current task is not the sole user of its Landlock
+		 * domain, then duplicate it.
+		 */
+		new_domain = new_landlock_domain();
+		if (IS_ERR(new_domain))
+			goto put_tmp_lists;
+		for (i = 0; i < ARRAY_SIZE(new_domain->programs); i++) {
+			new_domain->programs[i] =
+				READ_ONCE(current_domain->programs[i]);
+			if (new_domain->programs[i])
+				refcount_inc(&new_domain->programs[i]->usage);
+		}
+
+		/*
+		 * Landlock domain from the current task will not be freed here
+		 * because the usage is strictly greater than 1. It is only
+		 * prevented to be freed by another task thanks to the caller
+		 * of landlock_prepend_prog() which should be locked if needed.
+		 */
+		landlock_put_domain(current_domain);
+	}
+
+	/* prepend tmp_domain to new_domain */
+	for (i = 0; i < ARRAY_SIZE(tmp_domain.programs); i++) {
+		/* get the last new list */
+		struct landlock_prog_list *last_list =
+			tmp_domain.programs[i];
+
+		if (last_list) {
+			while (last_list->prev)
+				last_list = last_list->prev;
+			/* no need to increment usage (pointer replacement) */
+			last_list->prev = new_domain->programs[i];
+			new_domain->programs[i] = tmp_domain.programs[i];
+		}
+	}
+	return new_domain;
+
+put_tmp_lists:
+	for (i = 0; i < ARRAY_SIZE(tmp_domain.programs); i++)
+		put_landlock_prog_list(tmp_domain.programs[i]);
+	return new_domain;
+}
diff --git a/security/landlock/domain_manage.h b/security/landlock/domain_manage.h
new file mode 100644
index 000000000000..5b5b49f6e3e8
--- /dev/null
+++ b/security/landlock/domain_manage.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Landlock LSM - domain management headers
+ *
+ * Copyright © 2016-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#ifndef _SECURITY_LANDLOCK_DOMAIN_MANAGE_H
+#define _SECURITY_LANDLOCK_DOMAIN_MANAGE_H
+
+#include <linux/filter.h>
+
+#include "common.h"
+
+void landlock_get_domain(struct landlock_domain *dom);
+void landlock_put_domain(struct landlock_domain *dom);
+
+struct landlock_domain *landlock_prepend_prog(
+		struct landlock_domain *current_domain,
+		struct bpf_prog *prog);
+
+#endif /* _SECURITY_LANDLOCK_DOMAIN_MANAGE_H */
-- 
2.23.0

^ permalink raw reply related

* [PATCH bpf-next v11 1/7] bpf,landlock: Define an eBPF program type for Landlock hooks
From: Mickaël Salaün @ 2019-10-29 17:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Alexei Starovoitov, Andy Lutomirski,
	Casey Schaufler, Daniel Borkmann, David Drysdale, Florent Revest,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, KP Singh, Michael Kerrisk, Mickaël Salaün,
	Paul Moore, Sargun Dhillon, Serge E . Hallyn, Shuah Khan,
	Stephen Smalley
In-Reply-To: <20191029171505.6650-1-mic@digikod.net>

Add a new type of eBPF program used by Landlock hooks.  The goal of this
type of program is to accept or deny a requested access from userspace
to a kernel object (e.g. process).  This will be more useful with the
next commit adding a new eBPF helper.

This new BPF program type will be registered with the Landlock LSM
initialization.

Add an initial Landlock Kconfig and update the MAINTAINERS file.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: James Morris <jmorris@namei.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Will Drewry <wad@chromium.org>
---

Changes since v10:
* replace file system program types with a (simpler) ptrace program type
* add an eBPF task pointer type
* split files

Changes since v9:
* handle inode put and map put, which fix unmount (reported by Al Viro)
* replace subtype with expected_attach_type and expected_attach_triggers
* check eBPF program return code

Changes since v8:
* Remove the chaining concept from the eBPF program contexts (chain and
  cookie). We need to keep these subtypes this way to be able to make
  them evolve, though.
* remove bpf_landlock_put_extra() because there is no more a "previous"
  field to free (for now)

Changes since v7:
* cosmetic fixes
* rename LANDLOCK_SUBTYPE_* to LANDLOCK_*
* cleanup UAPI definitions and move them from bpf.h to landlock.h
  (suggested by Alexei Starovoitov)
* disable Landlock by default (suggested by Alexei Starovoitov)
* rename BPF_PROG_TYPE_LANDLOCK_{RULE,HOOK}
* update the Kconfig
* update the MAINTAINERS file
* replace the IOCTL, LOCK and FCNTL events with FS_PICK, FS_WALK and
  FS_GET hook types
* add the ability to chain programs with an eBPF program file descriptor
  (i.e. the "previous" field in a Landlock subtype) and keep a state
  with a "cookie" value available from the context
* add a "triggers" subtype bitfield to match specific actions (e.g.
  append, chdir, read...)

Changes since v6:
* add 3 more sub-events: IOCTL, LOCK, FCNTL
  https://lkml.kernel.org/r/2fbc99a6-f190-f335-bd14-04bdeed35571@digikod.net
* rename LANDLOCK_VERSION to LANDLOCK_ABI to better reflect its purpose,
  and move it from landlock.h to common.h
* rename BPF_PROG_TYPE_LANDLOCK to BPF_PROG_TYPE_LANDLOCK_RULE: an eBPF
  program could be used for something else than a rule
* simplify struct landlock_context by removing the arch and syscall_nr fields
* remove all eBPF map functions call, remove ABILITY_WRITE
* refactor bpf_landlock_func_proto() (suggested by Kees Cook)
* constify pointers
* fix doc inclusion

Changes since v5:
* rename file hooks.c to init.c
* fix spelling

Changes since v4:
* merge a minimal (not enabled) LSM code and Kconfig in this commit

Changes since v3:
* split commit
* revamp the landlock_context:
  * add arch, syscall_nr and syscall_cmd (ioctl, fcntl…) to be able to
    cross-check action with the event type
  * replace args array with dedicated fields to ease the addition of new
    fields
---
 MAINTAINERS                    |  8 ++++
 include/linux/bpf.h            |  1 +
 include/linux/bpf_types.h      |  3 ++
 include/uapi/linux/bpf.h       |  2 +
 include/uapi/linux/landlock.h  | 39 ++++++++++++++++
 kernel/bpf/syscall.c           |  9 ++++
 kernel/bpf/verifier.c          |  7 +++
 security/Kconfig               |  1 +
 security/Makefile              |  2 +
 security/landlock/Kconfig      | 19 ++++++++
 security/landlock/Makefile     |  4 ++
 security/landlock/bpf_ptrace.c | 30 ++++++++++++
 security/landlock/bpf_ptrace.h | 17 +++++++
 security/landlock/bpf_verify.c | 83 ++++++++++++++++++++++++++++++++++
 security/landlock/common.h     | 30 ++++++++++++
 15 files changed, 255 insertions(+)
 create mode 100644 include/uapi/linux/landlock.h
 create mode 100644 security/landlock/Kconfig
 create mode 100644 security/landlock/Makefile
 create mode 100644 security/landlock/bpf_ptrace.c
 create mode 100644 security/landlock/bpf_ptrace.h
 create mode 100644 security/landlock/bpf_verify.c
 create mode 100644 security/landlock/common.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 7fc074632eac..4cabb85ea52d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9146,6 +9146,14 @@ F:	net/core/skmsg.c
 F:	net/core/sock_map.c
 F:	net/ipv4/tcp_bpf.c
 
+LANDLOCK SECURITY MODULE
+M:	Mickaël Salaün <mic@digikod.net>
+S:	Supported
+F:	include/uapi/linux/landlock.h
+F:	security/landlock/
+K:	landlock
+K:	LANDLOCK
+
 LANTIQ / INTEL Ethernet drivers
 M:	Hauke Mehrtens <hauke@hauke-m.de>
 L:	netdev@vger.kernel.org
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 171be30fe0ae..819a3e207438 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -291,6 +291,7 @@ enum bpf_reg_type {
 	PTR_TO_TP_BUFFER,	 /* reg points to a writable raw tp's buffer */
 	PTR_TO_XDP_SOCK,	 /* reg points to struct xdp_sock */
 	PTR_TO_BTF_ID,		 /* reg points to kernel struct */
+	PTR_TO_TASK,		 /* reg points to struct task_struct */
 };
 
 /* The information passed from prog-specific *_is_valid_access
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 36a9c2325176..bddabc961a3b 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -38,6 +38,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
 #ifdef CONFIG_INET
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
 #endif
+#ifdef CONFIG_SECURITY_LANDLOCK
+BPF_PROG_TYPE(BPF_PROG_TYPE_LANDLOCK_HOOK, landlock)
+#endif
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4af8b0819a32..6e4147790f96 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -173,6 +173,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_CGROUP_SYSCTL,
 	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
 	BPF_PROG_TYPE_CGROUP_SOCKOPT,
+	BPF_PROG_TYPE_LANDLOCK_HOOK,
 };
 
 enum bpf_attach_type {
@@ -199,6 +200,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_UDP6_RECVMSG,
 	BPF_CGROUP_GETSOCKOPT,
 	BPF_CGROUP_SETSOCKOPT,
+	BPF_LANDLOCK_PTRACE,
 	__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
new file mode 100644
index 000000000000..3ffe3cbdbad6
--- /dev/null
+++ b/include/uapi/linux/landlock.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Landlock - UAPI headers
+ *
+ * Copyright © 2017-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#ifndef _UAPI__LINUX_LANDLOCK_H__
+#define _UAPI__LINUX_LANDLOCK_H__
+
+#include <linux/types.h>
+
+/**
+ * DOC: landlock_ret
+ *
+ * The return value of a landlock program is a bitmask that can allow or deny
+ * the action for which the program is run.
+ *
+ * In the future, this could be used to trigger an audit event as well.
+ *
+ * - %LANDLOCK_RET_ALLOW
+ * - %LANDLOCK_RET_DENY
+ */
+#define LANDLOCK_RET_ALLOW	0
+#define LANDLOCK_RET_DENY	1
+
+/**
+ * struct landlock_context_ptrace - context accessible to BPF_LANDLOCK_PTRACE
+ *
+ * @tracer: pointer to the task requesting to debug @tracee
+ * @tracee: pointer to the task being debugged
+ */
+struct landlock_context_ptrace {
+	__u64 tracer;
+	__u64 tracee;
+};
+
+#endif /* _UAPI__LINUX_LANDLOCK_H__ */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ff5225759553..5159e582a0d8 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1621,6 +1621,15 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 		default:
 			return -EINVAL;
 		}
+#ifdef CONFIG_SECURITY_LANDLOCK
+	case BPF_PROG_TYPE_LANDLOCK_HOOK:
+		switch (expected_attach_type) {
+		case BPF_LANDLOCK_PTRACE:
+			return 0;
+		default:
+			return -EINVAL;
+		}
+#endif
 	default:
 		return 0;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c59778c0fc4d..ebf1991906b7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -421,6 +421,7 @@ static const char * const reg_type_str[] = {
 	[PTR_TO_TP_BUFFER]	= "tp_buffer",
 	[PTR_TO_XDP_SOCK]	= "xdp_sock",
 	[PTR_TO_BTF_ID]		= "ptr_",
+	[PTR_TO_TASK]		= "task",
 };
 
 static char slot_type_char[] = {
@@ -1878,6 +1879,7 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
 	case PTR_TO_TCP_SOCK:
 	case PTR_TO_TCP_SOCK_OR_NULL:
 	case PTR_TO_XDP_SOCK:
+	case PTR_TO_TASK:
 		return true;
 	default:
 		return false;
@@ -2600,6 +2602,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 	case PTR_TO_XDP_SOCK:
 		pointer_desc = "xdp_sock ";
 		break;
+	case PTR_TO_TASK:
+		pointer_desc = "task ";
+		break;
 	default:
 		break;
 	}
@@ -4527,6 +4532,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 	case PTR_TO_TCP_SOCK:
 	case PTR_TO_TCP_SOCK_OR_NULL:
 	case PTR_TO_XDP_SOCK:
+	case PTR_TO_TASK:
 		verbose(env, "R%d pointer arithmetic on %s prohibited\n",
 			dst, reg_type_str[ptr_reg->type]);
 		return -EACCES;
@@ -6278,6 +6284,7 @@ static int check_return_code(struct bpf_verifier_env *env)
 	case BPF_PROG_TYPE_CGROUP_DEVICE:
 	case BPF_PROG_TYPE_CGROUP_SYSCTL:
 	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
+	case BPF_PROG_TYPE_LANDLOCK_HOOK:
 		break;
 	default:
 		return 0;
diff --git a/security/Kconfig b/security/Kconfig
index 2a1a2d396228..9d9981394fb0 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -238,6 +238,7 @@ source "security/loadpin/Kconfig"
 source "security/yama/Kconfig"
 source "security/safesetid/Kconfig"
 source "security/lockdown/Kconfig"
+source "security/landlock/Kconfig"
 
 source "security/integrity/Kconfig"
 
diff --git a/security/Makefile b/security/Makefile
index be1dd9d2cb2f..60b7f6f2fd30 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -12,6 +12,7 @@ subdir-$(CONFIG_SECURITY_YAMA)		+= yama
 subdir-$(CONFIG_SECURITY_LOADPIN)	+= loadpin
 subdir-$(CONFIG_SECURITY_SAFESETID)    += safesetid
 subdir-$(CONFIG_SECURITY_LOCKDOWN_LSM)	+= lockdown
+subdir-$(CONFIG_SECURITY_LANDLOCK)		+= landlock
 
 # always enable default capabilities
 obj-y					+= commoncap.o
@@ -29,6 +30,7 @@ obj-$(CONFIG_SECURITY_YAMA)		+= yama/
 obj-$(CONFIG_SECURITY_LOADPIN)		+= loadpin/
 obj-$(CONFIG_SECURITY_SAFESETID)       += safesetid/
 obj-$(CONFIG_SECURITY_LOCKDOWN_LSM)	+= lockdown/
+obj-$(CONFIG_SECURITY_LANDLOCK)	+= landlock/
 obj-$(CONFIG_CGROUP_DEVICE)		+= device_cgroup.o
 
 # Object integrity file lists
diff --git a/security/landlock/Kconfig b/security/landlock/Kconfig
new file mode 100644
index 000000000000..44921bd72380
--- /dev/null
+++ b/security/landlock/Kconfig
@@ -0,0 +1,19 @@
+config SECURITY_LANDLOCK
+	bool "Landlock support"
+	depends on SECURITY
+	depends on BPF_SYSCALL
+	depends on SECCOMP_FILTER
+	default n
+	help
+	  This selects Landlock, a programmatic access control.  It enables to
+	  restrict processes on the fly (i.e. create a sandbox) or log some
+	  actions.  The security policy is a set of eBPF programs, dedicated to
+	  allow or deny a list of actions on specific kernel objects (e.g.
+	  process).
+
+	  You need to enable seccomp filter to apply a security policy to a
+	  process hierarchy (e.g. application with built-in sandboxing).
+
+	  See Documentation/security/landlock/ for further information.
+
+	  If you are unsure how to answer this question, answer N.
diff --git a/security/landlock/Makefile b/security/landlock/Makefile
new file mode 100644
index 000000000000..682b798c6b76
--- /dev/null
+++ b/security/landlock/Makefile
@@ -0,0 +1,4 @@
+obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
+
+landlock-y := \
+	bpf_verify.o bpf_ptrace.o
diff --git a/security/landlock/bpf_ptrace.c b/security/landlock/bpf_ptrace.c
new file mode 100644
index 000000000000..2ec73078ad01
--- /dev/null
+++ b/security/landlock/bpf_ptrace.c
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Landlock LSM - eBPF ptrace
+ *
+ * Copyright © 2017-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2019 ANSSI
+ */
+
+#include <linux/bpf.h>
+#include <uapi/linux/landlock.h>
+
+#include "bpf_ptrace.h"
+
+bool landlock_is_valid_access_ptrace(int off, enum bpf_access_type type,
+		enum bpf_reg_type *reg_type, int *max_size)
+{
+	if (type != BPF_READ)
+		return false;
+
+	switch (off) {
+	case offsetof(struct landlock_context_ptrace, tracer):
+		/* fall through */
+	case offsetof(struct landlock_context_ptrace, tracee):
+		*reg_type = PTR_TO_TASK;
+		*max_size = sizeof(u64);
+		return true;
+	default:
+		return false;
+	}
+}
diff --git a/security/landlock/bpf_ptrace.h b/security/landlock/bpf_ptrace.h
new file mode 100644
index 000000000000..85edce37b70a
--- /dev/null
+++ b/security/landlock/bpf_ptrace.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Landlock LSM - eBPF ptrace headers
+ *
+ * Copyright © 2017-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2019 ANSSI
+ */
+
+#ifndef _SECURITY_LANDLOCK_BPF_PTRACE_H
+#define _SECURITY_LANDLOCK_BPF_PTRACE_H
+
+#include <linux/bpf.h>
+
+bool landlock_is_valid_access_ptrace(int off, enum bpf_access_type type,
+		enum bpf_reg_type *reg_type, int *max_size);
+
+#endif /* _SECURITY_LANDLOCK_BPF_PTRACE_H */
diff --git a/security/landlock/bpf_verify.c b/security/landlock/bpf_verify.c
new file mode 100644
index 000000000000..6ed921588178
--- /dev/null
+++ b/security/landlock/bpf_verify.c
@@ -0,0 +1,83 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Landlock LSM - eBPF program verifications
+ *
+ * Copyright © 2016-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#include <linux/bpf.h>
+#include <linux/filter.h>
+
+#include "common.h"
+#include "bpf_ptrace.h"
+
+static bool bpf_landlock_is_valid_access(int off, int size,
+		enum bpf_access_type type, const struct bpf_prog *prog,
+		struct bpf_insn_access_aux *info)
+{
+	enum bpf_reg_type reg_type = NOT_INIT;
+	int max_size = 0;
+
+	if (WARN_ON(!prog->expected_attach_type))
+		return false;
+
+	if (off < 0)
+		return false;
+	if (size <= 0 || size > sizeof(__u64))
+		return false;
+
+	/* set register type and max size */
+	switch (get_hook_type(prog)) {
+	case LANDLOCK_HOOK_PTRACE:
+		if (!landlock_is_valid_access_ptrace(off, type, &reg_type,
+					&max_size))
+			return false;
+		break;
+	}
+
+	/* check memory range access */
+	switch (reg_type) {
+	case NOT_INIT:
+		return false;
+	case SCALAR_VALUE:
+		/* allow partial raw value */
+		if (size > max_size)
+			return false;
+		info->ctx_field_size = max_size;
+		break;
+	default:
+		/* deny partial pointer */
+		if (size != max_size)
+			return false;
+	}
+
+	info->reg_type = reg_type;
+	return true;
+}
+
+static const struct bpf_func_proto *bpf_landlock_func_proto(
+		enum bpf_func_id func_id,
+		const struct bpf_prog *prog)
+{
+	if (WARN_ON(!prog->expected_attach_type))
+		return NULL;
+
+	switch (func_id) {
+	case BPF_FUNC_map_lookup_elem:
+		return &bpf_map_lookup_elem_proto;
+	case BPF_FUNC_map_update_elem:
+		return &bpf_map_update_elem_proto;
+	case BPF_FUNC_map_delete_elem:
+		return &bpf_map_delete_elem_proto;
+	default:
+		return NULL;
+	}
+}
+
+const struct bpf_verifier_ops landlock_verifier_ops = {
+	.get_func_proto	= bpf_landlock_func_proto,
+	.is_valid_access = bpf_landlock_is_valid_access,
+};
+
+const struct bpf_prog_ops landlock_prog_ops = {};
diff --git a/security/landlock/common.h b/security/landlock/common.h
new file mode 100644
index 000000000000..0234c4bc4acd
--- /dev/null
+++ b/security/landlock/common.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Landlock LSM - private headers
+ *
+ * Copyright © 2016-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#ifndef _SECURITY_LANDLOCK_COMMON_H
+#define _SECURITY_LANDLOCK_COMMON_H
+
+#include <linux/bpf.h>
+#include <linux/filter.h>
+
+enum landlock_hook_type {
+	LANDLOCK_HOOK_PTRACE = 1,
+};
+
+static inline enum landlock_hook_type get_hook_type(const struct bpf_prog *prog)
+{
+	switch (prog->expected_attach_type) {
+	case BPF_LANDLOCK_PTRACE:
+		return LANDLOCK_HOOK_PTRACE;
+	default:
+		WARN_ON(1);
+		return BPF_LANDLOCK_PTRACE;
+	}
+}
+
+#endif /* _SECURITY_LANDLOCK_COMMON_H */
-- 
2.23.0

^ permalink raw reply related

* [PATCH bpf-next v11 0/7] Landlock LSM
From: Mickaël Salaün @ 2019-10-29 17:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Alexei Starovoitov, Andy Lutomirski,
	Casey Schaufler, Daniel Borkmann, David Drysdale, Florent Revest,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, KP Singh, Michael Kerrisk, Mickaël Salaün,
	Paul Moore, Sargun Dhillon, Serge E . Hallyn, Shuah Khan,
	Stephen Smalley

Hi,

This eleventh series is a major rework of the previous series [1] to
make the patches smaller while still putting in place the foundations of
Landlock and implementing a useful feature.  The whole file-system
support (i.e. inode map and program triggers) has been removed (but is
still planed in the future).  This series rewrite the previous static
ptrace restrictions with a programmatic one thanks to eBPF.  To be more
independent from seccomp, Landlock is now a full LSM using task's
credentials thanks to the LSM stacking infrastructure.  The only part of
seccomp still used is the syscall, which makes sense and which is a
quite simple interface.  The clear definition of Landlock domains (sets
of eBPF programs) can be used for other use-cases than sandboxing e.g.,
to implement a system-wide security signaling framework as described by
KRSI [2].  In addition to improvements and bug fixes, I split the
patches as much as possible to ease the review process.

As discussed at LSS NA [3] (with Kees Cook, James Morris, KP Singh and
Florent Revest) and at Kernel Recipes (with Alexei Starovoitov), I
planned to shrink the code of Landlock to a bare minimum to enable
incremental feature integration.  The idea was to create a "memory
protection" hook and the appropriate eBPF program type.  After some
experimentations, I concluded that it is not easy to implement a simple
interface to control actions such as mmap(2) or mprotect(2).  I then
focused on an old stable feature of Landlock: ptrace protection.  But
instead of keeping the static security policy, which make sense in a
sandboxing scenario, I extended Landlock with a ptrace hook and the
appropriate eBPF program type.  The ptrace enforcement is not mandatory
anymore but this new hook could be used (and extended with new helpers)
for a security signaling mechanism such as KRSI.  I hope the KRSI
developments could take advantage of this new Landlock version.

This is the first step of the roadmap discussed at LPC [4].  While the
intended final goal is to allow unprivileged users to use Landlock, this
series allows only a process with global CAP_SYS_ADMIN to load and
enforce a rule.  This may help to get feedback and avoid unexpected
behaviors.

This series can be applied on top of bpf-next, commit e93d99180abd
("selftests/bpf: Restore $(OUTPUT)/test_stub.o rule").  This can be
tested with CONFIG_BPF_SYSCALL, CONFIG_SECCOMP_FILTER and
CONFIG_SECURITY_LANDLOCK.  This patch series can be found in a Git
repository here:
https://github.com/landlock-lsm/linux/commits/landlock-v11
I would really appreciate constructive comments on the design and the
code.

# Landlock LSM

Landlock is a stackable LSM [5] intended to be used as a low-level
framework to build custom access-control/audit systems or safe endpoint
security agents.  There is currently one Landlock hook dedicated to
check ptrace(2).  This hook accepts a dedicated eBPF program, called a
Landlock program, which can currently compare its position in the
hierarchy of similar programs tied to other processes.  This enables to
enforce programmatic scoped ptrace restrictions.

The final goal of this new Linux Security Module (LSM) called Landlock
is to allow any process, including unprivileged ones, to create powerful
security sandboxes comparable to XNU Sandbox, FreeBSD Capsicum or
OpenBSD Pledge (which could be implemented with Landlock).  This kind of
sandbox is expected to help mitigate the security impact of bugs or
unexpected/malicious behaviors in user-space applications.

The use of seccomp and Landlock is more suitable with the help of a
user-space library (e.g.  libseccomp) that could help to specify a
high-level language to express a security policy instead of raw eBPF
programs.  Moreover, thanks to the LLVM front-end, it is quite easy to
write an eBPF program with a subset of the C language.

The documentation patch contains some kernel documentation, explanations
on how to use Landlock and a FAQ.  The compiled documentation and some
talks can be found here: https://landlock.io

# Frequently asked questions

## Why is seccomp-bpf not enough?

A seccomp filter can access only raw syscall arguments (i.e. the
register values) which means that it is not possible to filter according
to the value pointed to by an argument, such as a file pathname. As an
embryonic Landlock version demonstrated (i.e. seccomp-object), filtering
at the syscall level is complicated (e.g. need to take care of race
conditions). This is mainly because the access control checkpoints of
the kernel are not at this high-level but more underneath, at the
LSM-hook level. The LSM hooks are designed to handle this kind of
checks.  Landlock abstracts this approach to leverage the ability of
unprivileged users to limit themselves.

Cf. section "What it isn't?" in
Documentation/userspace-api/seccomp_filter.rst

## Why use the seccomp(2) syscall?

Landlock use the same semantic as seccomp to apply access rule
restrictions. It add a new layer of security for the current process
which is inherited by its children. It makes sense to use an unique
access-restricting syscall (that should be allowed by seccomp filters)
which can only drop privileges. Moreover, a Landlock rule could come
from outside a process (e.g.  passed through a UNIX socket). It is then
useful to differentiate the creation/load of Landlock eBPF programs via
bpf(2), from rule enforcement via seccomp(2).

## Why a new LSM? Are SELinux, AppArmor, Smack and Tomoyo not good
   enough?

The current access control LSMs are fine for their purpose which is to
give the *root* the ability to enforce a security policy for the
*system*. What is missing is a way to enforce a security policy for any
application by its developer and *unprivileged user* as seccomp can do
for raw syscall filtering.

Differences from other (access control) LSMs:
* not only dedicated to administrators (i.e. no_new_priv);
* limited kernel attack surface (e.g. policy parsing);
* constrained policy rules (no DoS: deterministic execution time);
* do not leak more information than the loader process can legitimately
  have access to (minimize metadata inference).

# Changes since v10

* remove all the file system related features: program types, inode
  map and expected_attach_triggers
* replace the static ptrace security policy with a new and simpler
  ptrace program (attached) type and a task_landlock_ptrace_ancestor()
  eBPF helper
* do not rely on seccomp internal structure but use stacked credentials
  insdead
* extend ptrace tests
* add more documentation
* split and rename files/patches
* miscellaneous fixes

Previous changes can be found in the previous cover-letter [1].

[1] https://lore.kernel.org/lkml/20190721213116.23476-1-mic@digikod.net/
[2] https://lore.kernel.org/lkml/20190910115527.5235-1-kpsingh@chromium.org/
[3] https://lwn.net/Articles/798157/
[4] https://lore.kernel.org/lkml/5828776A.1010104@digikod.net/
[5] https://lore.kernel.org/lkml/50db058a-7dde-441b-a7f9-f6837fe8b69f@schaufler-ca.com/

Regards,

Mickaël Salaün (7):
  bpf,landlock: Define an eBPF program type for Landlock hooks
  landlock: Add the management of domains
  landlock,seccomp: Load Landlock programs per process hierarchy
  landlock: Add ptrace LSM hooks
  bpf,landlock: Add task_landlock_ptrace_ancestor() helper
  bpf,landlock: Add tests for the Landlock ptrace program type
  landlock: Add user and kernel documentation for Landlock

 Documentation/security/index.rst              |   1 +
 Documentation/security/landlock/index.rst     |  22 ++
 Documentation/security/landlock/kernel.rst    | 139 +++++++++
 Documentation/security/landlock/user.rst      | 142 ++++++++++
 MAINTAINERS                                   |   9 +
 include/linux/bpf.h                           |   3 +
 include/linux/bpf_types.h                     |   3 +
 include/linux/landlock.h                      |  25 ++
 include/linux/lsm_hooks.h                     |   1 +
 include/uapi/linux/bpf.h                      |  23 +-
 include/uapi/linux/landlock.h                 |  39 +++
 include/uapi/linux/seccomp.h                  |   1 +
 kernel/bpf/syscall.c                          |   9 +
 kernel/bpf/verifier.c                         |  11 +
 kernel/seccomp.c                              |   4 +
 scripts/bpf_helpers_doc.py                    |   1 +
 security/Kconfig                              |   1 +
 security/Makefile                             |   2 +
 security/landlock/Kconfig                     |  19 ++
 security/landlock/Makefile                    |   6 +
 security/landlock/bpf_ptrace.c                |  98 +++++++
 security/landlock/bpf_ptrace.h                |  17 ++
 security/landlock/bpf_run.c                   |  62 ++++
 security/landlock/bpf_run.h                   |  25 ++
 security/landlock/bpf_verify.c                |  87 ++++++
 security/landlock/common.h                    |  84 ++++++
 security/landlock/domain_manage.c             | 265 ++++++++++++++++++
 security/landlock/domain_manage.h             |  23 ++
 security/landlock/domain_syscall.c            |  87 ++++++
 security/landlock/hooks_cred.c                |  47 ++++
 security/landlock/hooks_cred.h                |  14 +
 security/landlock/hooks_ptrace.c              | 114 ++++++++
 security/landlock/hooks_ptrace.h              |  19 ++
 security/landlock/init.c                      |  32 +++
 security/security.c                           |  15 +
 tools/include/uapi/linux/bpf.h                |  23 +-
 tools/include/uapi/linux/landlock.h           |  22 ++
 tools/lib/bpf/libbpf_probes.c                 |   3 +
 tools/testing/selftests/bpf/config            |   3 +
 tools/testing/selftests/bpf/test_verifier.c   |   1 +
 .../testing/selftests/bpf/verifier/landlock.c |  56 ++++
 tools/testing/selftests/landlock/.gitignore   |   5 +
 tools/testing/selftests/landlock/Makefile     |  27 ++
 tools/testing/selftests/landlock/config       |   5 +
 tools/testing/selftests/landlock/test.h       |  48 ++++
 tools/testing/selftests/landlock/test_base.c  |  24 ++
 .../testing/selftests/landlock/test_ptrace.c  | 210 ++++++++++++++
 47 files changed, 1875 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/security/landlock/index.rst
 create mode 100644 Documentation/security/landlock/kernel.rst
 create mode 100644 Documentation/security/landlock/user.rst
 create mode 100644 include/linux/landlock.h
 create mode 100644 include/uapi/linux/landlock.h
 create mode 100644 security/landlock/Kconfig
 create mode 100644 security/landlock/Makefile
 create mode 100644 security/landlock/bpf_ptrace.c
 create mode 100644 security/landlock/bpf_ptrace.h
 create mode 100644 security/landlock/bpf_run.c
 create mode 100644 security/landlock/bpf_run.h
 create mode 100644 security/landlock/bpf_verify.c
 create mode 100644 security/landlock/common.h
 create mode 100644 security/landlock/domain_manage.c
 create mode 100644 security/landlock/domain_manage.h
 create mode 100644 security/landlock/domain_syscall.c
 create mode 100644 security/landlock/hooks_cred.c
 create mode 100644 security/landlock/hooks_cred.h
 create mode 100644 security/landlock/hooks_ptrace.c
 create mode 100644 security/landlock/hooks_ptrace.h
 create mode 100644 security/landlock/init.c
 create mode 100644 tools/include/uapi/linux/landlock.h
 create mode 100644 tools/testing/selftests/bpf/verifier/landlock.c
 create mode 100644 tools/testing/selftests/landlock/.gitignore
 create mode 100644 tools/testing/selftests/landlock/Makefile
 create mode 100644 tools/testing/selftests/landlock/config
 create mode 100644 tools/testing/selftests/landlock/test.h
 create mode 100644 tools/testing/selftests/landlock/test_base.c
 create mode 100644 tools/testing/selftests/landlock/test_ptrace.c

-- 
2.23.0

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Andy Lutomirski @ 2019-10-29 17:03 UTC (permalink / raw)
  To: Reshetova, Elena
  Cc: Mike Rapoport, linux-kernel@vger.kernel.org, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Dave Hansen, James Bottomley, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	linux-api@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
	Mike Rapoport, Tycho Andersen
In-Reply-To: <2236FBA76BA1254E88B949DDB74E612BA4EEC0CE@IRSMSX102.ger.corp.intel.com>

On Tue, Oct 29, 2019 at 4:25 AM Reshetova, Elena
<elena.reshetova@intel.com> wrote:
>
> > The patch below aims to allow applications to create mappins that have
> > pages visible only to the owning process. Such mappings could be used to
> > store secrets so that these secrets are not visible neither to other
> > processes nor to the kernel.
>
> Hi Mike,
>
> I have actually been looking into the closely related problem for the past
> couple of weeks (on and off). What is common here is the need for userspace
> to indicate to kernel that some pages contain secrets. And then there are
> actually a number of things that kernel can do to try to protect these secrets
> better. Unmap from direct map is one of them. Another thing is to map such
> pages as non-cached, which can help us to prevent or considerably restrict
> speculation on such pages. The initial proof of concept for marking pages as
> "UNCACHED" that I got from Dave Hansen was actually based on mlock2()
> and a new flag for it for this purpose. Since then I have been thinking on what
> interface suits the use case better and actually selected going with new madvise()
> flag instead because of all possible implications for fragmentation and performance.

Doing all of this with MAP_SECRET seems bad to me.  If user code wants
UC memory, it should ask for UC memory -- having the kernel involved
in the decision to use UC memory is a bad idea, because the
performance impact of using UC memory where user code wasn't expecting
it wil be so bad that the system might as well not work at all.  (For
kicks, I once added a sysctl to turn off caching in CR0.  I enabled it
in gnome-shell.  The system slowed down to such an extent that I was
unable to enter the three or so keystrokes to turn it back off.)

EXCLUSIVE makes sense.  Saying "don't ptrace this" makes sense.  UC
makes sense.  But having one flag to rule them all does not make sense
to me.

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Andy Lutomirski @ 2019-10-29 17:00 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: LKML, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Linux API, Linux-MM, X86 ML, Mike Rapoport
In-Reply-To: <20191029093254.GE18773@rapoport-lnx>

On Tue, Oct 29, 2019 at 2:33 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Mon, Oct 28, 2019 at 02:44:23PM -0600, Andy Lutomirski wrote:
> >
> > > On Oct 27, 2019, at 4:17 AM, Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > From: Mike Rapoport <rppt@linux.ibm.com>
> > >
> > > Hi,
> > >
> > > The patch below aims to allow applications to create mappins that have
> > > pages visible only to the owning process. Such mappings could be used to
> > > store secrets so that these secrets are not visible neither to other
> > > processes nor to the kernel.
> > >
> > > I've only tested the basic functionality, the changes should be verified
> > > against THP/migration/compaction. Yet, I'd appreciate early feedback.
> >
> > I’ve contemplated the concept a fair amount, and I think you should
> > consider a change to the API. In particular, rather than having it be a
> > MAP_ flag, make it a chardev.  You can, at least at first, allow only
> > MAP_SHARED, and admins can decide who gets to use it.  It might also play
> > better with the VM overall, and you won’t need a VM_ flag for it — you
> > can just wire up .fault to do the right thing.
>
> I think mmap()/mprotect()/madvise() are the natural APIs for such
> interface.

Then you have a whole bunch of questions to answer.  For example:

What happens if you mprotect() or similar when the mapping is already
in use in a way that's incompatible with MAP_EXCLUSIVE?

Is it actually reasonable to malloc() some memory and then make it exclusive?

Are you permitted to map a file MAP_EXCLUSIVE?  What does it mean?

What does MAP_PRIVATE | MAP_EXCLUSIVE do?

How does one pass exclusive memory via SCM_RIGHTS?  (If it's a
memfd-like or chardev interface, it's trivial.  mmap(), not so much.)

And finally, there's my personal giant pet peeve: a major use of this
will be for virtualization.  I suspect that a lot of people would like
the majority of KVM guest memory to be unmapped from the host
pagetables.  But people might also like for guest memory to be
unmapped in *QEMU's* pagetables, and mmap() is a basically worthless
interface for this.  Getting fd-backed memory into a guest will take
some possibly major work in the kernel, but getting vma-backed memory
into a guest without mapping it in the host user address space seems
much, much worse.

> Switching to a chardev doesn't solve the major problem of direct
> map fragmentation and defeats the ability to use exclusive memory mappings
> with the existing allocators, while mprotect() and madvise() do not.
>

Will people really want to do malloc() and then remap it exclusive?
This sounds dubiously useful at best.

^ permalink raw reply

* Re: For review: documentation of clone3() system call
From: Christian Brauner @ 2019-10-29 16:05 UTC (permalink / raw)
  To: Jann Horn
  Cc: Florian Weimer, Michael Kerrisk-manpages, lkml, linux-man,
	Kees Cook, Oleg Nesterov, Arnd Bergmann, David Howells,
	Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	Linux API
In-Reply-To: <CAG48ez1pH9fGacQF6m7=R39bDMRqNR_ML7d2v-e=-kVLJhBuPA@mail.gmail.com>

On Tue, Oct 29, 2019 at 04:20:37PM +0100, Jann Horn wrote:
> On Tue, Oct 29, 2019 at 3:26 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> > On Tue, Oct 29, 2019 at 12:27:07PM +0100, Christian Brauner wrote:
> > > On Mon, Oct 28, 2019 at 08:09:13PM +0100, Jann Horn wrote:
> > > > On Mon, Oct 28, 2019 at 6:21 PM Christian Brauner
> > > > <christian.brauner@ubuntu.com> wrote:
> > > > > where stack + stack_size is addition on a void pointer which usually
> > > > > clang and gcc are not very happy about.
> > > > > I wanted to bring this up on the mailing list soon: If possible, I don't
> > > > > want userspace to need to know about stack direction and just have stack
> > > > > point to the beginning and then have the kernel do the + stack_size
> > > > > after the copy_clone_args_from_user() if the arch needs it. For example,
> > > > > by having a dumb helder similar to copy_thread_tls()/coyp_thread() that
> > > > > either does the + stack_size or not. Right now, clone3() is supported on
> > > > > parisc and afaict, the stack grows upwards for it. I'm not sure if there
> > > > > are obvious reasons why that won't work or it would be a bad idea...
> > > >
> > > > That would mean adding a new clone flag that redefines how those
> > > > parameters work and describing the current behavior in the manpage as
> > > > the behavior without the flag (which doesn't exist on 5.3), right?
> > >
> > > I would break API and if someone reports breakage we'll revert and go
> > > the more complicated route you outlined (see [1]).
> >
> > @Jann, I think the following patch might even be enough?...
> [...]
> > +static inline void clone3_prepare_stack(struct kernel_clone_args *kargs)
> > +{
> > +#if !defined(CONFIG_STACK_GROWSUP) && !defined(CONFIG_IA64)
> > +       kargs->stack += kargs->stack_size;
> > +#endif
> > +}
> 
> I guess it might work as long as nobody is actually using clone3 yet
> and you can get this patch into the 5.3 stable tree and any distro
> kernels on 5.3 before people do start using clone3?

Yes, that would be my preferred approach. As I said doing it this way is
pretty common. A recent example where we did this is the file_max
sysctl.

Christian

^ permalink raw reply

* Re: For review: documentation of clone3() system call
From: Christian Brauner @ 2019-10-29 16:04 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Jann Horn, Michael Kerrisk-manpages, lkml, linux-man, Kees Cook,
	Oleg Nesterov, Arnd Bergmann, David Howells, Pavel Emelyanov,
	Andrew Morton, Adrian Reber, Andrei Vagin, Linux API
In-Reply-To: <87wocn39fu.fsf@oldenburg2.str.redhat.com>

On Tue, Oct 29, 2019 at 03:36:37PM +0100, Florian Weimer wrote:
> * Christian Brauner:
> 
> > @Florian, do you have an opinion about always passing the stack from the
> > lowest address with clone3()?
> 
> Do you mean that the stack extends from stack to stack_size?  I guess

Specifically, that userspace doesn't need to know whether it needs to
pass stack or stack + stack_size. The kernel will just do the stack + stack_size
if the architecture has a downwards growing stack. So for _all_
architectures, ia64 or not, you'd always pass:

void *p[PAGE_SIZE];

struct clone_args args = {
	.stack = p,
	.stack_size = PAGE_SIZE,
};


> that makes sense.  What about architectures which need two stacks (I
> think ia64 is one)?

I don't think ia64 needs any special treament. ia64 requires you to pass
the lowest address of the stack and the kernel does the additon to reach
the top of the stack and the alignemnt too. So ia64 _in the kernel_
currently does:

arch/ia64/kernel/entry.S:sys_clone2()
- setup stack and stack size and call into do_fork()
  -> kernel/fork.c:do_fork()
     -> copy_thread_tls()
       -> arch/ia64/kernel/process.c:copy_thread():

	if (user_stack_base) {
		child_ptregs->r12 = user_stack_base + user_stack_size - 16;
		child_ptregs->ar_bspstore = user_stack_base;
		child_ptregs->ar_rnat = 0;
		child_ptregs->loadrs = 0;
	}

> There is also the matter whose responsibility is the alignment of the
> initial stack pointer.

Hm, probably also a detail that userspace shouldn't need to know
about?

Christian

^ permalink raw reply

* Re: For review: documentation of clone3() system call
From: Jann Horn @ 2019-10-29 15:20 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Florian Weimer, Michael Kerrisk-manpages, lkml, linux-man,
	Kees Cook, Oleg Nesterov, Arnd Bergmann, David Howells,
	Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	Linux API
In-Reply-To: <20191029142622.jxmssu4s4ndui7bw@wittgenstein>

On Tue, Oct 29, 2019 at 3:26 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> On Tue, Oct 29, 2019 at 12:27:07PM +0100, Christian Brauner wrote:
> > On Mon, Oct 28, 2019 at 08:09:13PM +0100, Jann Horn wrote:
> > > On Mon, Oct 28, 2019 at 6:21 PM Christian Brauner
> > > <christian.brauner@ubuntu.com> wrote:
> > > > where stack + stack_size is addition on a void pointer which usually
> > > > clang and gcc are not very happy about.
> > > > I wanted to bring this up on the mailing list soon: If possible, I don't
> > > > want userspace to need to know about stack direction and just have stack
> > > > point to the beginning and then have the kernel do the + stack_size
> > > > after the copy_clone_args_from_user() if the arch needs it. For example,
> > > > by having a dumb helder similar to copy_thread_tls()/coyp_thread() that
> > > > either does the + stack_size or not. Right now, clone3() is supported on
> > > > parisc and afaict, the stack grows upwards for it. I'm not sure if there
> > > > are obvious reasons why that won't work or it would be a bad idea...
> > >
> > > That would mean adding a new clone flag that redefines how those
> > > parameters work and describing the current behavior in the manpage as
> > > the behavior without the flag (which doesn't exist on 5.3), right?
> >
> > I would break API and if someone reports breakage we'll revert and go
> > the more complicated route you outlined (see [1]).
>
> @Jann, I think the following patch might even be enough?...
[...]
> +static inline void clone3_prepare_stack(struct kernel_clone_args *kargs)
> +{
> +#if !defined(CONFIG_STACK_GROWSUP) && !defined(CONFIG_IA64)
> +       kargs->stack += kargs->stack_size;
> +#endif
> +}

I guess it might work as long as nobody is actually using clone3 yet
and you can get this patch into the 5.3 stable tree and any distro
kernels on 5.3 before people do start using clone3?

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Tycho Andersen @ 2019-10-29 15:13 UTC (permalink / raw)
  To: Reshetova, Elena
  Cc: Mike Rapoport, linux-kernel@vger.kernel.org, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Dave Hansen, James Bottomley, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	linux-api@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
	Mike Rapoport, Alan Cox
In-Reply-To: <2236FBA76BA1254E88B949DDB74E612BA4EEC0CE@IRSMSX102.ger.corp.intel.com>

Hi Elena, Mike,

On Tue, Oct 29, 2019 at 11:25:12AM +0000, Reshetova, Elena wrote:
> > The patch below aims to allow applications to create mappins that have
> > pages visible only to the owning process. Such mappings could be used to
> > store secrets so that these secrets are not visible neither to other
> > processes nor to the kernel.
> 
> Hi Mike, 
> 
> I have actually been looking into the closely related problem for the past 
> couple of weeks (on and off). What is common here is the need for userspace
> to indicate to kernel that some pages contain secrets. And then there are
> actually a number of things that kernel can do to try to protect these secrets
> better. Unmap from direct map is one of them. Another thing is to map such
> pages as non-cached, which can help us to prevent or considerably restrict
> speculation on such pages. The initial proof of concept for marking pages as
> "UNCACHED" that I got from Dave Hansen was actually based on mlock2() 
> and a new flag for it for this purpose. Since then I have been thinking on what
> interface suits the use case better and actually selected going with new madvise() 
> flag instead because of all possible implications for fragmentation and performance. 
> My logic was that we better allocate the secret data explicitly (using mmap()) 
> to make sure that no other process data accidentally gets to suffer.
> Imagine I would allocate a buffer to hold a secret key, signal with mlock
>  to protect it and suddenly my other high throughput non-secret buffer 
> (which happened to live on the same page by chance) became very slow
>  and I don't even have an easy way (apart from mmap()ing it!) to guarantee
>  that it won't be affected.
> 
> So, I ended up towards smth like:
> 
>   secret_buffer =  mmap(NULL, PAGE_SIZE, ...)
>    madvise(secret_buffer, size, MADV_SECRET)
> 
> I have work in progress code here:
>  https://github.com/ereshetova/linux/commits/madvise
> 
> I haven't sent it for review, because it is not ready yet and I am now working
> on trying to add the page wiping functionality. Otherwise it would be useless
> to protect the page during the time it is used in userspace, but then allow it
> to get reused by a different process later after it has been released back and
> userspace was stupid enough not to wipe the contents (or was crashed on 
> purpose before it was able to wipe anything out). 

I was looking at this and thinking that wiping during do_exit() might
be a nice place, but I haven't tried anything yet.

> We have also had some discussions with Tycho that XPFO can be also
> applied selectively for such "SECRET" marked pages and I know that he has also
> did some initial prototyping on this, so I think it would be great to decide
> on userspace interface first and then see how we can assemble together all
> these features. 

Yep! Here's my tree with the direct un-mapping bits ported from XPFO:
https://github.com/tych0/linux/commits/madvise

As noted in one of the commit messages I think the bit math for page
prot flags needs a bit of work, but the test passes, so :)

In any case, I'll try to look at Mike's patches later today.

Cheers,

Tycho

^ permalink raw reply

* Re: For review: documentation of clone3() system call
From: Florian Weimer @ 2019-10-29 14:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Michael Kerrisk-manpages, lkml, linux-man, Kees Cook,
	Oleg Nesterov, Arnd Bergmann, David Howells, Pavel Emelyanov,
	Andrew Morton, Adrian Reber, Andrei Vagin, Linux API
In-Reply-To: <20191029142622.jxmssu4s4ndui7bw@wittgenstein>

* Christian Brauner:

> @Florian, do you have an opinion about always passing the stack from the
> lowest address with clone3()?

Do you mean that the stack extends from stack to stack_size?  I guess
that makes sense.  What about architectures which need two stacks (I
think ia64 is one)?

There is also the matter whose responsibility is the alignment of the
initial stack pointer.

Thanks,
Florian

^ permalink raw reply

* Re: For review: documentation of clone3() system call
From: Christian Brauner @ 2019-10-29 14:26 UTC (permalink / raw)
  To: Jann Horn, Florian Weimer
  Cc: Michael Kerrisk-manpages, lkml, linux-man, Kees Cook,
	Oleg Nesterov, Arnd Bergmann, David Howells, Pavel Emelyanov,
	Andrew Morton, Adrian Reber, Andrei Vagin, Linux API
In-Reply-To: <20191029112706.p5dd5yzpcgouo6n5@wittgenstein>

On Tue, Oct 29, 2019 at 12:27:07PM +0100, Christian Brauner wrote:
> On Mon, Oct 28, 2019 at 08:09:13PM +0100, Jann Horn wrote:
> > On Mon, Oct 28, 2019 at 6:21 PM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > > On Mon, Oct 28, 2019 at 04:12:09PM +0100, Jann Horn wrote:
> > > > On Fri, Oct 25, 2019 at 6:59 PM Michael Kerrisk (man-pages)
> > > > <mtk.manpages@gmail.com> wrote:
> > > > > I've made a first shot at adding documentation for clone3(). You can
> > > > > see the diff here:
> > > > > https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=faa0e55ae9e490d71c826546bbdef954a1800969
> > [...]
> > > > You might want to note somewhere that its flags can't be
> > > > seccomp-filtered because they're stored in memory, making it
> > > > inappropriate to use in heavily sandboxed processes.
> > >
> > > Hm, I don't think that belongs on the clone manpage. Granted that
> > > process creation is an important syscall but so are a bunch of others
> > > that aren't filterable because of pointer arguments.
> > > We can probably mention on the seccomp manpage that seccomp can't filter
> > > on pointer arguments and then provide a list of examples. If you setup a
> > > seccomp filter and don't know that you can't filter syscalls with
> > > pointer args that seems pretty bad to begin with.
> > 
> > Fair enough.
> > 
> > [...]
> > > One thing I never liked about clone() was that userspace had to know
> > > about stack direction. And there is a lot of ugly code in userspace that
> > > has nasty clone() wrappers like:
> > [...]
> > > where stack + stack_size is addition on a void pointer which usually
> > > clang and gcc are not very happy about.
> > > I wanted to bring this up on the mailing list soon: If possible, I don't
> > > want userspace to need to know about stack direction and just have stack
> > > point to the beginning and then have the kernel do the + stack_size
> > > after the copy_clone_args_from_user() if the arch needs it. For example,
> > > by having a dumb helder similar to copy_thread_tls()/coyp_thread() that
> > > either does the + stack_size or not. Right now, clone3() is supported on
> > > parisc and afaict, the stack grows upwards for it. I'm not sure if there
> > > are obvious reasons why that won't work or it would be a bad idea...
> > 
> > That would mean adding a new clone flag that redefines how those
> > parameters work and describing the current behavior in the manpage as
> > the behavior without the flag (which doesn't exist on 5.3), right?
> 
> I would break API and if someone reports breakage we'll revert and go
> the more complicated route you outlined (see [1]).

@Jann, I think the following patch might even be enough?...

@Florian, do you have an opinion about always passing the stack from the
lowest address with clone3()?

>From 72b2a5711fd37e34e87df1b29b2e1885bb28cf75 Mon Sep 17 00:00:00 2001
From: Christian Brauner <christian.brauner@ubuntu.com>
Date: Tue, 29 Oct 2019 13:55:39 +0100
Subject: [PATCH] fork: stack direction

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 kernel/fork.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index bcdf53125210..22dc72071a6d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2584,6 +2584,13 @@ static bool clone3_args_valid(const struct kernel_clone_args *kargs)
 	return true;
 }
 
+static inline void clone3_prepare_stack(struct kernel_clone_args *kargs)
+{
+#if !defined(CONFIG_STACK_GROWSUP) && !defined(CONFIG_IA64)
+	kargs->stack += kargs->stack_size;
+#endif
+}
+
 /**
  * clone3 - create a new process with specific properties
  * @uargs: argument structure
@@ -2605,6 +2612,8 @@ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
 	if (err)
 		return err;
 
+	clone3_prepare_stack(&kargs);
+
 	if (!clone3_args_valid(&kargs))
 		return -EINVAL;
 
-- 
2.23.0

^ permalink raw reply related

* AMD TLB errata, (Was: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings)
From: Peter Zijlstra @ 2019-10-29 12:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dan Williams, Mike Rapoport, Linux Kernel Mailing List,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Linux API, linux-mm,
	the arch/x86 maintainers, Mike Rapoport, thomas.lendacky
In-Reply-To: <20191029110024.yjytp22lhd2vekrv@box>

On Tue, Oct 29, 2019 at 02:00:24PM +0300, Kirill A. Shutemov wrote:
> On Tue, Oct 29, 2019 at 09:56:02AM +0100, Peter Zijlstra wrote:
> > On Tue, Oct 29, 2019 at 09:43:18AM +0300, Kirill A. Shutemov wrote:
> > > But some CPUs don't like to have two TLB entries for the same memory with
> > > different sizes at the same time. See for instance AMD erratum 383.
> > > 
> > > Getting it right would require making the range not present, flush TLB and
> > > only then install huge page. That's what we do for userspace.
> > > 
> > > It will not fly for the direct mapping. There is no reasonable way to
> > > exclude other CPU from accessing the range while it's not present (call
> > > stop_machine()? :P). Moreover, the range may contain the code that doing
> > > the collapse or data required for it...
> > > 
> > > BTW, looks like current __split_large_page() in pageattr.c is susceptible
> > > to the errata. Maybe we can get away with the easy way...
> > 
> > As you write above, there is just no way we can have a (temporary) hole
> > in the direct map.
> > 
> > We are careful about that other errata, and make sure both translations
> > are identical wrt everything else.
> 
> It's not clear if it is enough to avoid the issue. "under a highly specific
> and detailed set of conditions" is not very specific set of conditions :P

Yeah, I know ... :/ Tom is there any chance you could shed a little more
light on that errata?

^ permalink raw reply

* Re: For review: documentation of clone3() system call
From: Christian Brauner @ 2019-10-29 11:27 UTC (permalink / raw)
  To: Jann Horn
  Cc: Michael Kerrisk-manpages, lkml, linux-man, Kees Cook,
	Florian Weimer, Oleg Nesterov, Arnd Bergmann, David Howells,
	Pavel Emelyanov, Andrew Morton, Adrian Reber, Andrei Vagin,
	Linux API
In-Reply-To: <CAG48ez20hn8vToY+=C62nA-rbUfxh=JD6N-f7XVS3_GZOoPjxw@mail.gmail.com>

On Mon, Oct 28, 2019 at 08:09:13PM +0100, Jann Horn wrote:
> On Mon, Oct 28, 2019 at 6:21 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> > On Mon, Oct 28, 2019 at 04:12:09PM +0100, Jann Horn wrote:
> > > On Fri, Oct 25, 2019 at 6:59 PM Michael Kerrisk (man-pages)
> > > <mtk.manpages@gmail.com> wrote:
> > > > I've made a first shot at adding documentation for clone3(). You can
> > > > see the diff here:
> > > > https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=faa0e55ae9e490d71c826546bbdef954a1800969
> [...]
> > > You might want to note somewhere that its flags can't be
> > > seccomp-filtered because they're stored in memory, making it
> > > inappropriate to use in heavily sandboxed processes.
> >
> > Hm, I don't think that belongs on the clone manpage. Granted that
> > process creation is an important syscall but so are a bunch of others
> > that aren't filterable because of pointer arguments.
> > We can probably mention on the seccomp manpage that seccomp can't filter
> > on pointer arguments and then provide a list of examples. If you setup a
> > seccomp filter and don't know that you can't filter syscalls with
> > pointer args that seems pretty bad to begin with.
> 
> Fair enough.
> 
> [...]
> > One thing I never liked about clone() was that userspace had to know
> > about stack direction. And there is a lot of ugly code in userspace that
> > has nasty clone() wrappers like:
> [...]
> > where stack + stack_size is addition on a void pointer which usually
> > clang and gcc are not very happy about.
> > I wanted to bring this up on the mailing list soon: If possible, I don't
> > want userspace to need to know about stack direction and just have stack
> > point to the beginning and then have the kernel do the + stack_size
> > after the copy_clone_args_from_user() if the arch needs it. For example,
> > by having a dumb helder similar to copy_thread_tls()/coyp_thread() that
> > either does the + stack_size or not. Right now, clone3() is supported on
> > parisc and afaict, the stack grows upwards for it. I'm not sure if there
> > are obvious reasons why that won't work or it would be a bad idea...
> 
> That would mean adding a new clone flag that redefines how those
> parameters work and describing the current behavior in the manpage as
> the behavior without the flag (which doesn't exist on 5.3), right?

I would break API and if someone reports breakage we'll revert and go
the more complicated route you outlined (see [1]).
But I don't think it will a big deal. First, we haven't documented how
stack needs to be passed so who knows what people currently do. Second,
clone3() has not been out for a long time and currently does _not_
provide features that legacy clone() does not provide apart from a
cleaner interface. So userspace has no incentive to use clone3() over
clone() right now. That'll change latest with v5.5 where we have new
features on top of clone3() (CLONE_CLEAR_SIGHAND). So let's just try and
fix it.

[1]: This is basically what Linus has repeatedly said: it's not about
     never breaking api in principle but rather about whether this
     breaks someones usecase. And if it does break, we need to revert.

Christian

^ permalink raw reply

* RE: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Reshetova, Elena @ 2019-10-29 11:25 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel@vger.kernel.org
  Cc: Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	linux-api@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
	Mike Rapoport, Tycho Andersen, Alan Cox
In-Reply-To: <1572171452-7958-1-git-send-email-rppt@kernel.org>

> The patch below aims to allow applications to create mappins that have
> pages visible only to the owning process. Such mappings could be used to
> store secrets so that these secrets are not visible neither to other
> processes nor to the kernel.

Hi Mike, 

I have actually been looking into the closely related problem for the past 
couple of weeks (on and off). What is common here is the need for userspace
to indicate to kernel that some pages contain secrets. And then there are
actually a number of things that kernel can do to try to protect these secrets
better. Unmap from direct map is one of them. Another thing is to map such
pages as non-cached, which can help us to prevent or considerably restrict
speculation on such pages. The initial proof of concept for marking pages as
"UNCACHED" that I got from Dave Hansen was actually based on mlock2() 
and a new flag for it for this purpose. Since then I have been thinking on what
interface suits the use case better and actually selected going with new madvise() 
flag instead because of all possible implications for fragmentation and performance. 
My logic was that we better allocate the secret data explicitly (using mmap()) 
to make sure that no other process data accidentally gets to suffer.
Imagine I would allocate a buffer to hold a secret key, signal with mlock
 to protect it and suddenly my other high throughput non-secret buffer 
(which happened to live on the same page by chance) became very slow
 and I don't even have an easy way (apart from mmap()ing it!) to guarantee
 that it won't be affected.

So, I ended up towards smth like:

  secret_buffer =  mmap(NULL, PAGE_SIZE, ...)
   madvise(secret_buffer, size, MADV_SECRET)

I have work in progress code here:
 https://github.com/ereshetova/linux/commits/madvise

I haven't sent it for review, because it is not ready yet and I am now working
on trying to add the page wiping functionality. Otherwise it would be useless
to protect the page during the time it is used in userspace, but then allow it
to get reused by a different process later after it has been released back and
userspace was stupid enough not to wipe the contents (or was crashed on 
purpose before it was able to wipe anything out). 

We have also had some discussions with Tycho that XPFO can be also
applied selectively for such "SECRET" marked pages and I know that he has also
did some initial prototyping on this, so I think it would be great to decide
on userspace interface first and then see how we can assemble together all
these features. 

The *very* far fetching goal for all of this would be something that Alan Cox
suggested when I started looking into this - actually have a new libc function to 
allocate memory in a secure way, which can hide all the dancing with mmap()/madvise()
(or/and potentially interaction with a chardev that Andy was suggesting also) and
implement an efficient allocator for such secret pages. Openssl has its
own version of  "secure heap", which is essentially mmap area with additional 
MLOCK_ONFAULT and MADV_DONTDUMP flags for protection. Some other 
apps or libs must use smth similar if they want additional protection, which
makes them to reimplement the same concept again and again. Sadly or surprisingly 
other major libs like boringssl, mbedTLS or client like openssh do not user any mlock()/
madvise() flags for any additional protection of secrets that they hold in memory. 
Maybe if all of it would be behind a single secure API situation would start to 
change in userspace towards better. 

Best Regards,
Elena.

. 

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: David Hildenbrand @ 2019-10-29 11:02 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Peter Zijlstra,
	Steven Rostedt, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	linux-api, linux-mm, x86, Mike Rapoport
In-Reply-To: <1572171452-7958-2-git-send-email-rppt@kernel.org>

On 27.10.19 11:17, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> The mappings created with MAP_EXCLUSIVE are visible only in the context of
> the owning process and can be used by applications to store secret
> information that will not be visible not only to other processes but to the
> kernel as well.
> 
> The pages in these mappings are removed from the kernel direct map and
> marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> the pages are mapped back into the direct map.
> 

Just a thought, the kernel is still able to indirectly read the contents 
of these pages by doing a kdump from kexec environment, right?. Also, I 
wonder what would happen if you map such pages via /dev/mem into another 
user space application and e.g., use them along with kvm [1].

[1] https://lwn.net/Articles/778240/

-- 

Thanks,

David / dhildenb

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Kirill A. Shutemov @ 2019-10-29 11:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dan Williams, Mike Rapoport, Linux Kernel Mailing List,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, James Bottomley, Steven Rostedt,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Linux API, linux-mm,
	the arch/x86 maintainers, Mike Rapoport
In-Reply-To: <20191029085602.GI4114@hirez.programming.kicks-ass.net>

On Tue, Oct 29, 2019 at 09:56:02AM +0100, Peter Zijlstra wrote:
> On Tue, Oct 29, 2019 at 09:43:18AM +0300, Kirill A. Shutemov wrote:
> > But some CPUs don't like to have two TLB entries for the same memory with
> > different sizes at the same time. See for instance AMD erratum 383.
> > 
> > Getting it right would require making the range not present, flush TLB and
> > only then install huge page. That's what we do for userspace.
> > 
> > It will not fly for the direct mapping. There is no reasonable way to
> > exclude other CPU from accessing the range while it's not present (call
> > stop_machine()? :P). Moreover, the range may contain the code that doing
> > the collapse or data required for it...
> > 
> > BTW, looks like current __split_large_page() in pageattr.c is susceptible
> > to the errata. Maybe we can get away with the easy way...
> 
> As you write above, there is just no way we can have a (temporary) hole
> in the direct map.
> 
> We are careful about that other errata, and make sure both translations
> are identical wrt everything else.

It's not clear if it is enough to avoid the issue. "under a highly specific
and detailed set of conditions" is not very specific set of conditions :P

-- 
 Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Christopher Lameter @ 2019-10-29 10:12 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Kirill A. Shutemov, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-api, linux-mm, x86,
	Mike Rapoport
In-Reply-To: <20191029085551.GA18773@rapoport-lnx>



On Tue, 29 Oct 2019, Mike Rapoport wrote:

> I've talked with Thomas yesterday and he suggested something similar:
>
> When the MAP_EXCLUSIVE request comes for the first time, we allocate a huge
> page for it and then use this page as a pool of 4K pages for subsequent
> requests. Once this huge page is full we allocate a new one and append it
> to the pool. When all the 4K pages that comprise the huge page are freed
> the huge page is collapsed.

Or write a device driver that allows you to mmap a secure area and avoid
all core kernel modifications?

/dev/securemem or so?

It may exist already.

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Mike Rapoport @ 2019-10-29  9:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport
In-Reply-To: <CA5C22D9-BC3E-4B69-8DD9-4D3B75E40BD5@amacapital.net>

On Mon, Oct 28, 2019 at 02:44:23PM -0600, Andy Lutomirski wrote:
> 
> > On Oct 27, 2019, at 4:17 AM, Mike Rapoport <rppt@kernel.org> wrote:
> > 
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > Hi,
> > 
> > The patch below aims to allow applications to create mappins that have
> > pages visible only to the owning process. Such mappings could be used to
> > store secrets so that these secrets are not visible neither to other
> > processes nor to the kernel.
> > 
> > I've only tested the basic functionality, the changes should be verified
> > against THP/migration/compaction. Yet, I'd appreciate early feedback.
> 
> I’ve contemplated the concept a fair amount, and I think you should
> consider a change to the API. In particular, rather than having it be a
> MAP_ flag, make it a chardev.  You can, at least at first, allow only
> MAP_SHARED, and admins can decide who gets to use it.  It might also play
> better with the VM overall, and you won’t need a VM_ flag for it — you
> can just wire up .fault to do the right thing.

I think mmap()/mprotect()/madvise() are the natural APIs for such
interface. Switching to a chardev doesn't solve the major problem of direct
map fragmentation and defeats the ability to use exclusive memory mappings
with the existing allocators, while mprotect() and madvise() do not.

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Mike Rapoport @ 2019-10-29  9:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Hansen, linux-kernel, Alexey Dobriyan, Andrew Morton,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	James Bottomley, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, linux-api, linux-mm, x86,
	Mike Rapoport
In-Reply-To: <20191028180808.GA17669@bombadil.infradead.org>

On Mon, Oct 28, 2019 at 11:08:08AM -0700, Matthew Wilcox wrote:
> On Mon, Oct 28, 2019 at 10:12:44AM -0700, Dave Hansen wrote:
> > Some other random thoughts:
> > 
> >  * The page flag is probably not a good idea.  It would be probably
> >    better to set _PAGE_SPECIAL on the PTE and force get_user_pages()
> >    into the slow path.
> >  * This really stops being "normal" memory.  You can't do futexes on it,
> >    cant splice it.  Probably need a more fleshed-out list of
> >    incompatible features.
> >  * As Kirill noted, each 4k page ends up with a potential 1GB "blast
> >    radius" of demoted pages in the direct map.  Not cool.  This is
> >    probably a non-starter as it stands.
> >  * The global TLB flushes are going to eat you alive.  They probably
> >    border on a DoS on larger systems.
> >  * Do we really want this user interface to dictate the kernel
> >    implementation?  In other words, do we really want MAP_EXCLUSIVE,
> >    or do we want MAP_SECRET?  One tells the kernel what do *do*, the
> >    other tells the kernel what the memory *IS*.
> >  * There's a lot of other stuff going on in this area: XPFO, SEV, MKTME,
> >    Persistent Memory, where the kernel direct map is a liability in some
> >    way.  We probably need some kind of overall, architected solution
> >    rather than five or ten things all poking at the direct map.
> 
> Another random set of thoughts:
> 
>  - Should devices be permitted to DMA to/from MAP_SECRET pages?

I can't say I have a clear cut yes or no here. One possible use case for
such pages is to read a secrets from storage directly into them. On the
other side, DMA to/from a device can be used to exploit those secrets...

>  - How about GUP?

Do you mean GUP for "remote" memory? I'd say no.

>  - Can I ptrace my way into another process's secret pages?

No.

>  - What if I splice() the page into a pipe?

I think it should fail.

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Mike Rapoport @ 2019-10-29  9:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport
In-Reply-To: <d6ac08fe-23f3-c2d5-24c4-88e68f3fd4d0@intel.com>

On Mon, Oct 28, 2019 at 10:12:44AM -0700, Dave Hansen wrote:
> On 10/27/19 3:17 AM, Mike Rapoport wrote:
> > The pages in these mappings are removed from the kernel direct map and
> > marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> > the pages are mapped back into the direct map.
> 
> This looks fun.  It's certainly simple.
> 
> But, the description is not really calling out the pros and cons very
> well.  I'm also not sure that folks will use an interface like this that
> requires up-front, special code to do an allocation instead of something
> like madvise().  That's why protection keys ended up the way it did: if
> you do this as a mmap() replacement, you need to modify all *allocators*
> to be enabled for this.  If you do it with mprotect()-style, you can
> apply it to existing allocations.

Actually, I've started with mprotect() and then realized that mmap() would
be simpler, so I switched over to mmap().
 
> Some other random thoughts:
> 
>  * The page flag is probably not a good idea.  It would be probably
>    better to set _PAGE_SPECIAL on the PTE and force get_user_pages()
>    into the slow path.

The page flag won't work on 32-bit, indeed. But do we really need such
functionality on 32-bit?

>  * This really stops being "normal" memory.  You can't do futexes on it,
>    cant splice it.  Probably need a more fleshed-out list of
>    incompatible features.

True, my bad. I should have mentioned more than THP/compaction/migration.

>  * As Kirill noted, each 4k page ends up with a potential 1GB "blast
>    radius" of demoted pages in the direct map.  Not cool.  This is
>    probably a non-starter as it stands.
>  * The global TLB flushes are going to eat you alive.  They probably
>    border on a DoS on larger systems.

As I wrote in another email, we could use some kind of pooling to reduce
the "blast radius" and that will reduce that amount of TLB flushes as well.
The size of the MAP_EXCLUSIVE obeys the RLIMIT_MEMLOCK and we can add a
system-wide limit for size of such allocations.

>  * Do we really want this user interface to dictate the kernel
>    implementation?  In other words, do we really want MAP_EXCLUSIVE,
>    or do we want MAP_SECRET?  One tells the kernel what do *do*, the
>    other tells the kernel what the memory *IS*.

I hesitated quite some time between EXCLUSIVE and SECRET. I've settled down
on EXCLUSIVE because in my view that better describes the fact that the
region is only mapped in its owner address space. And as such it can be
used to store secrets, but it can be used for other purposes as well.

>  * There's a lot of other stuff going on in this area: XPFO, SEV, MKTME,
>    Persistent Memory, where the kernel direct map is a liability in some
>    way.  We probably need some kind of overall, architected solution
>    rather than five or ten things all poking at the direct map.

Agree.
 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings
From: Mike Rapoport @ 2019-10-29  9:01 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, James Bottomley,
	Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, linux-api, linux-mm, x86, Mike Rapoport
In-Reply-To: <87h83s62mi.fsf@mid.deneb.enyo.de>

On Mon, Oct 28, 2019 at 09:23:17PM +0100, Florian Weimer wrote:
> * Mike Rapoport:
> 
> > On October 27, 2019 12:30:21 PM GMT+02:00, Florian Weimer
> > <fw@deneb.enyo.de> wrote:
> >>* Mike Rapoport:
> >>
> >>> The patch below aims to allow applications to create mappins that
> >>have
> >>> pages visible only to the owning process. Such mappings could be used
> >>to
> >>> store secrets so that these secrets are not visible neither to other
> >>> processes nor to the kernel.
> >>
> >>How is this expected to interact with CRIU?
> >
> > CRIU dumps the memory contents using a parasite code from inside the
> > dumpee address space, so it would work the same way as for the other
> > mappings. Of course, at the restore time the exclusive mapping should
> > be recreated with the appropriate flags.
> 
> Hmm, so it would use a bounce buffer to perform the extraction?

At first I thought that CRIU would extract the memory contents from these
mappings just as it does now using vmsplice(). But it seems that such
mappings won't play well with pipes, so CRIU will need a bounce buffer
indeed.
 
> >>> I've only tested the basic functionality, the changes should be
> >>verified
> >>> against THP/migration/compaction. Yet, I'd appreciate early feedback.
> >>
> >>What are the expected semantics for VM migration?  Should it fail?
> >
> > I don't quite follow. If qemu would use such mappings it would be able
> > to transfer them during live migration.
> 
> I was wondering if the special state is supposed to bubble up to the
> host eventually.

Well, that was not intended.

-- 
Sincerely yours,
Mike.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox