Linux Security Modules development

Linux Security Modules development
 help / color / mirror / Atom feed

* Re: [PATCH v20 17/23] LSM: security_secid_to_secctx in netlink netfilter
From: Pablo Neira Ayuso @ 2020-09-08 10:46 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: casey.schaufler, jmorris, linux-security-module, selinux,
	linux-audit, keescook, john.johansen, penguin-kernel, paul, sds,
	netdev
In-Reply-To: <20200826145247.10029-18-casey@schaufler-ca.com>

On Wed, Aug 26, 2020 at 07:52:41AM -0700, Casey Schaufler wrote:
> Change netlink netfilter interfaces to use lsmcontext
> pointers, and remove scaffolding.
> 
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Reviewed-by: John Johansen <john.johansen@canonical.com>
> Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
> cc: netdev@vger.kernel.org
> ---
>  net/netfilter/nfnetlink_queue.c | 31 ++++++++++++-------------------
>  1 file changed, 12 insertions(+), 19 deletions(-)
> 
[...]
>  static u32 nfqnl_get_bridge_size(struct nf_queue_entry *entry)
> @@ -401,8 +399,7 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue,
>  	enum ip_conntrack_info ctinfo;
>  	struct nfnl_ct_hook *nfnl_ct;
>  	bool csum_verify;
> -	struct lsmcontext scaff; /* scaffolding */
> -	char *secdata = NULL;
> +	struct lsmcontext context = { };
>  	u32 seclen = 0;

While at it, please introduce reverse xmas tree in variable
definitions incrementally:

	struct lsmcontext context = { };
  	enum ip_conntrack_info ctinfo;
  	struct nfnl_ct_hook *nfnl_ct;
  	bool csum_verify;
 	u32 seclen = 0;

And Cc: netfilter-devel@vger.kernel.org for patches that update the
Netfilter codebase.

Thanks.

^ permalink raw reply

* Re: [RFC PATCH] sched: only issue an audit on privileged operation
From: peterz @ 2020-09-08 10:25 UTC (permalink / raw)
  To: Christian Göttsche
  Cc: linux-kernel, selinux, linux-security-module, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman
In-Reply-To: <20200904160031.6444-1-cgzones@googlemail.com>

On Fri, Sep 04, 2020 at 06:00:31PM +0200, Christian Göttsche wrote:
> sched_setattr(2) does via kernel/sched/core.c:__sched_setscheduler()
> issue a CAP_SYS_NICE audit event unconditionally, even when the requested
> operation does not require that capability / is un-privileged.
> 
> Perform privilged/unprivileged catigorization first and perform a
> capable test only if needed.
> 
> Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
> ---
>  kernel/sched/core.c | 65 ++++++++++++++++++++++++++++++++-------------
>  1 file changed, 47 insertions(+), 18 deletions(-)

So who sodding cares about audit, and why is that a reason to make a
trainwreck of code?

^ permalink raw reply

* [RFC PATCH v8 1/3] fs: Introduce AT_INTERPRETED flag for faccessat2(2)
From: Mickaël Salaün @ 2020-09-08  7:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
	Al Viro, Andrew Morton, Andy Lutomirski, Christian Brauner,
	Christian Heimes, Daniel Borkmann, Deven Bowers, Dmitry Vyukov,
	Eric Biggers, Eric Chiang, Florian Weimer, James Morris, Jan Kara,
	Jann Horn, Jonathan Corbet, Kees Cook, Lakshmi Ramasubramanian,
	Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Miklos Szeredi,
	Mimi Zohar, Philippe Trébuchet, Scott Shell,
	Sean Christopherson, Shuah Khan, Steve Dower, Steve Grubb,
	Tetsuo Handa, Thibaut Sautereau, Vincent Strubel,
	kernel-hardening, linux-api, linux-integrity,
	linux-security-module, linux-fsdevel, Thibaut Sautereau,
	Mickaël Salaün
In-Reply-To: <20200908075956.1069018-1-mic@digikod.net>

From: Mickaël Salaün <mic@linux.microsoft.com>

The AT_INTERPRETED flag combined with the X_OK mode enable trusted user
space tasks to check that files are allowed to be executed by user
space.  The security policy is consistently managed by the kernel
through a sysctl or implemented by an LSM thanks to the inode_permission
hook and a new kernel flag: MAY_INTERPRETED_EXEC.

The underlying idea is to be able to restrict scripts interpretation
according to a policy defined by the system administrator.  For this to
be possible, script interpreters must use faccessat2(2) with the
AT_INTERPRETED flag and the X_OK mode.  To be fully effective, these
interpreters also need to handle the other ways to execute code: command
line parameters (e.g., option -e for Perl), module loading (e.g., option
-m for Python), stdin, file sourcing, environment variables,
configuration files, etc.  According to the threat model, it may be
acceptable to allow some script interpreters (e.g. Bash) to interpret
commands from stdin, may it be a TTY or a pipe, because it may not be
enough to (directly) perform syscalls.  Further documentation can be
found in a following patch.

Even without enforced security policy, userland interpreters can set it
to enforce the system policy at their level, knowing that it will not
break anything on running systems which do not care about this feature.
However, on systems which want this feature enforced, there will be
knowledgeable people (i.e. sysadmins who enforced AT_INTERPRETED with
X_OK deliberately) to manage it.  A simple security policy
implementation, configured through a dedicated sysctl, is available in a
following patch.

AT_INTERPRETED with X_OK should not be confused with the O_EXEC flag
(for open) which is intended for execute-only, which obviously doesn't
work for scripts.  However, a similar behavior could be implemented in
userland with O_PATH:
https://lore.kernel.org/lkml/1e2f6913-42f2-3578-28ed-567f6a4bdda1@digikod.net/

This is a new implementation of a patch initially written by
Vincent Strubel for CLIP OS 4:
https://github.com/clipos-archive/src_platform_clip-patches/blob/f5cb330d6b684752e403b4e41b39f7004d88e561/1901_open_mayexec.patch
This patch has been used for more than 12 years with customized script
interpreters.  Some examples (with the original O_MAYEXEC) can be found
here:
https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC

Co-developed-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Signed-off-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
---

Changes since v7:
* Replaces openat2/O_MAYEXEC with faccessat2/X_OK/AT_INTERPRETED .
  Switching to an FD-based syscall was suggested by Al Viro and Jann
  Horn.

Changes since v6:
* Do not set __FMODE_EXEC for now because of inconsistent behavior:
  https://lore.kernel.org/lkml/202007160822.CCDB5478@keescook/
* Returns EISDIR when opening a directory with O_MAYEXEC.
* Removed Deven Bowers and Kees Cook Reviewed-by tags because of the
  current update.

Changes since v5:
* Update commit message.

Changes since v3:
* Switch back to O_MAYEXEC, but only handle it with openat2(2) which
  checks unknown flags (suggested by Aleksa Sarai). Cf.
  https://lore.kernel.org/lkml/20200430015429.wuob7m5ofdewubui@yavin.dot.cyphar.com/

Changes since v2:
* Replace O_MAYEXEC with RESOLVE_MAYEXEC from openat2(2).  This change
  enables to not break existing application using bogus O_* flags that
  may be ignored by current kernels by using a new dedicated flag, only
  usable through openat2(2) (suggested by Jeff Layton).  Using this flag
  will results in an error if the running kernel does not support it.
  User space needs to manage this case, as with other RESOLVE_* flags.
  The best effort approach to security (for most common distros) will
  simply consists of ignoring such an error and retry without
  RESOLVE_MAYEXEC.  However, a fully controlled system may which to
  error out if such an inconsistency is detected.

Changes since v1:
* Set __FMODE_EXEC when using O_MAYEXEC to make this information
  available through the new fanotify/FAN_OPEN_EXEC event (suggested by
  Jan Kara and Matthew Bobrowski):
  https://lore.kernel.org/lkml/20181213094658.GA996@lithium.mbobrowski.org/
---
 fs/open.c                  | 31 +++++++++++++++++++++++++++++--
 include/linux/fs.h         |  2 ++
 include/uapi/linux/fcntl.h | 12 +++++++++++-
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 9af548fb841b..879bdfbdc6fa 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -405,9 +405,13 @@ static long do_faccessat(int dfd, const char __user *filename, int mode, int fla
 	if (mode & ~S_IRWXO)	/* where's F_OK, X_OK, W_OK, R_OK? */
 		return -EINVAL;

-	if (flags & ~(AT_EACCESS | AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH))
+	if (flags & ~(AT_EACCESS | AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH |
+				AT_INTERPRETED))
 		return -EINVAL;

+	/* Only allows X_OK with AT_INTERPRETED for now. */
+	if ((flags & AT_INTERPRETED) && !(mode & S_IXOTH))
+		return -EINVAL;
 	if (flags & AT_SYMLINK_NOFOLLOW)
 		lookup_flags &= ~LOOKUP_FOLLOW;
 	if (flags & AT_EMPTY_PATH)
@@ -426,7 +430,30 @@ static long do_faccessat(int dfd, const char __user *filename, int mode, int fla

 	inode = d_backing_inode(path.dentry);

-	if ((mode & MAY_EXEC) && S_ISREG(inode->i_mode)) {
+	if ((flags & AT_INTERPRETED)) {
+		/*
+		 * For compatibility reasons, without a defined security policy
+		 * (via sysctl or LSM), using AT_INTERPRETED must map the
+		 * execute permission to the read permission.  Indeed, from
+		 * user space point of view, being able to execute data (e.g.
+		 * scripts) implies to be able to read this data.
+		 *
+		 * The MAY_INTERPRETED_EXEC bit is set to enable LSMs to add
+		 * custom checks, while being compatible with current policies.
+		 */
+		if ((mode & MAY_EXEC)) {
+			mode |= MAY_INTERPRETED_EXEC;
+			/*
+			 * For compatibility reasons, if the system-wide policy
+			 * doesn't enforce file permission checks, then
+			 * replaces the execute permission request with a read
+			 * permission request.
+			 */
+			mode &= ~MAY_EXEC;
+			/* To be executed *by* user space, files must be readable. */
+			mode |= MAY_READ;
+		}
+	} else if ((mode & MAY_EXEC) && S_ISREG(inode->i_mode)) {
 		/*
 		 * MAY_EXEC on regular files is denied if the fs is mounted
 		 * with the "noexec" flag.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7519ae003a08..03f1b2da6a87 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -101,6 +101,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 #define MAY_CHDIR		0x00000040
 /* called from RCU mode, don't block */
 #define MAY_NOT_BLOCK		0x00000080
+/* interpreted accesses checked with faccessat2 and AT_INTERPRETED */
+#define MAY_INTERPRETED_EXEC	0x00000100

 /*
  * flags in file.f_mode.  Note that FMODE_READ and FMODE_WRITE must correspond
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..dca082b02634 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -90,7 +90,8 @@
  * unlinkat.  The two functions do completely different things and therefore,
  * the flags can be allowed to overlap.  For example, passing AT_REMOVEDIR to
  * faccessat would be undefined behavior and thus treating it equivalent to
- * AT_EACCESS is valid undefined behavior.
+ * AT_EACCESS is valid undefined behavior.  The same goes for AT_SYMLINK_FOLLOW
+ * and AT_INTERPRETED.
  */
 #define AT_FDCWD		-100    /* Special value used to indicate
                                            openat should use the current
@@ -100,6 +101,15 @@
                                            effective IDs, not real IDs.  */
 #define AT_REMOVEDIR		0x200   /* Remove directory instead of
                                            unlinking file.  */
+#define AT_INTERPRETED		0x400	/* Check if the current process should
+					   grant access (e.g. execution) for a
+					   specific file, i.e. enables RWX to
+					   be enforced *by* user space.  The
+					   main usage is for script
+					   interpreters to enforce a policy
+					   consistent with the kernel's one
+					   (through sysctl configuration or LSM
+					   policy).  */
 #define AT_SYMLINK_FOLLOW	0x400   /* Follow symbolic links.  */
 #define AT_NO_AUTOMOUNT		0x800	/* Suppress terminal automount traversal */
 #define AT_EMPTY_PATH		0x1000	/* Allow empty relative pathname */
-- 
2.28.0

^ permalink raw reply related

* [RFC PATCH v8 3/3] selftest/interpreter: Add tests for AT_INTERPRETED enforcing
From: Mickaël Salaün @ 2020-09-08  7:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
	Al Viro, Andrew Morton, Andy Lutomirski, Christian Brauner,
	Christian Heimes, Daniel Borkmann, Deven Bowers, Dmitry Vyukov,
	Eric Biggers, Eric Chiang, Florian Weimer, James Morris, Jan Kara,
	Jann Horn, Jonathan Corbet, Kees Cook, Lakshmi Ramasubramanian,
	Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Miklos Szeredi,
	Mimi Zohar, Philippe Trébuchet, Scott Shell,
	Sean Christopherson, Shuah Khan, Steve Dower, Steve Grubb,
	Tetsuo Handa, Thibaut Sautereau, Vincent Strubel,
	kernel-hardening, linux-api, linux-integrity,
	linux-security-module, linux-fsdevel, Mickaël Salaün,
	Thibaut Sautereau
In-Reply-To: <20200908075956.1069018-1-mic@digikod.net>

From: Mickaël Salaün <mic@linux.microsoft.com>

Test that checks performed by faccessat2(2) with AT_INTERPRETED on file
path and file descriptors are consistent with noexec mount points and
file execute permissions, according to the policy configured with the
fs.interpreted_access sysctl.

Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com>
Reviewed-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
---

Changes since v7:
* Update tests with faccessat2/AT_INTERPRETED, including new ones to
  check that setting R_OK or W_OK returns EINVAL.
* Add tests for memfd, pipefs and nsfs.
* Rename and move back tests to a standalone directory.

Changes since v6:
* Add full combination tests for all file types, including block
  devices, character devices, fifos, sockets and symlinks.
* Properly save and restore initial sysctl value for all tests.

Changes since v5:
* Refactor with FIXTURE_VARIANT, which make the tests much more easy to
  read and maintain.
* Save and restore initial sysctl value (suggested by Kees Cook).
* Test with a sysctl value of 0.
* Check errno in sysctl_access_write test.
* Update tests for the CAP_SYS_ADMIN switch.
* Update tests to check -EISDIR (replacing -EACCES).
* Replace FIXTURE_DATA() with FIXTURE() (spotted by Kees Cook).
* Use global const strings.

Changes since v3:
* Replace RESOLVE_MAYEXEC with O_MAYEXEC.
* Add tests to check that O_MAYEXEC is ignored by open(2) and openat(2).

Changes since v2:
* Move tests from exec/ to openat2/ .
* Replace O_MAYEXEC with RESOLVE_MAYEXEC from openat2(2).
* Cleanup tests.

Changes since v1:
* Move tests from yama/ to exec/ .
* Fix _GNU_SOURCE in kselftest_harness.h .
* Add a new test sysctl_access_write to check if CAP_MAC_ADMIN is taken
  into account.
* Test directory execution which is always forbidden since commit
  73601ea5b7b1 ("fs/open.c: allow opening only regular files during
  execve()"), and also check that even the root user can not bypass file
  execution checks.
* Make sure delete_workspace() always as enough right to succeed.
* Cosmetic cleanup.
---
 .../testing/selftests/interpreter/.gitignore  |   2 +
 tools/testing/selftests/interpreter/Makefile  |  18 +
 tools/testing/selftests/interpreter/config    |   1 +
 .../interpreter/interpreted_access_test.c     | 384 ++++++++++++++++++
 4 files changed, 405 insertions(+)
 create mode 100644 tools/testing/selftests/interpreter/.gitignore
 create mode 100644 tools/testing/selftests/interpreter/Makefile
 create mode 100644 tools/testing/selftests/interpreter/config
 create mode 100644 tools/testing/selftests/interpreter/interpreted_access_test.c

diff --git a/tools/testing/selftests/interpreter/.gitignore b/tools/testing/selftests/interpreter/.gitignore
new file mode 100644
index 000000000000..82a4846cbc4b
--- /dev/null
+++ b/tools/testing/selftests/interpreter/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+/*_test
diff --git a/tools/testing/selftests/interpreter/Makefile b/tools/testing/selftests/interpreter/Makefile
new file mode 100644
index 000000000000..6b3e8c3e533b
--- /dev/null
+++ b/tools/testing/selftests/interpreter/Makefile
@@ -0,0 +1,18 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2
+LDLIBS += -lcap
+
+src_test := $(wildcard *_test.c)
+TEST_GEN_PROGS := $(src_test:.c=)
+
+KSFT_KHDR_INSTALL := 1
+include ../lib.mk
+
+khdr_dir = $(top_srcdir)/usr/include
+
+$(khdr_dir)/asm-generic/unistd.h: khdr
+	@:
+
+$(OUTPUT)/%_test: %_test.c $(khdr_dir)/asm-generic/unistd.h ../kselftest_harness.h
+	$(LINK.c) $< $(LDLIBS) -o $@ -I$(khdr_dir)
diff --git a/tools/testing/selftests/interpreter/config b/tools/testing/selftests/interpreter/config
new file mode 100644
index 000000000000..dd53c266bf52
--- /dev/null
+++ b/tools/testing/selftests/interpreter/config
@@ -0,0 +1 @@
+CONFIG_SYSCTL=y
diff --git a/tools/testing/selftests/interpreter/interpreted_access_test.c b/tools/testing/selftests/interpreter/interpreted_access_test.c
new file mode 100644
index 000000000000..6458dccabe51
--- /dev/null
+++ b/tools/testing/selftests/interpreter/interpreted_access_test.c
@@ -0,0 +1,384 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test AT_INTERPRETED
+ *
+ * Copyright © 2018-2020 ANSSI
+ *
+ * Author: Mickaël Salaün <mic@digikod.net>
+ */
+
+#define _GNU_SOURCE
+#include <asm-generic/unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/capability.h>
+#include <sys/mman.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/sysmacros.h>
+#include <unistd.h>
+
+#include "../kselftest_harness.h"
+
+#ifndef AT_INTERPRETED
+#define AT_INTERPRETED		0x400
+#endif
+
+#ifndef faccessat2
+static int faccessat2(int dirfd, const char *pathname, int mode, int flags)
+{
+	errno = 0;
+	return syscall(__NR_faccessat2, dirfd, pathname, mode, flags);
+}
+#endif
+
+static const char sysctl_path[] = "/proc/sys/fs/interpreted_access";
+
+static const char workdir_path[] = "./test-mount";
+static const char reg_file_path[] = "./test-mount/regular_file";
+static const char dir_path[] = "./test-mount/directory";
+static const char symlink_path[] = "./test-mount/symlink";
+static const char block_dev_path[] = "./test-mount/block_device";
+static const char char_dev_path[] = "./test-mount/character_device";
+static const char fifo_path[] = "./test-mount/fifo";
+static const char sock_path[] = "./test-mount/socket";
+
+static void ignore_dac(struct __test_metadata *_metadata, int override)
+{
+	cap_t caps;
+	const cap_value_t cap_val[2] = {
+		CAP_DAC_OVERRIDE,
+		CAP_DAC_READ_SEARCH,
+	};
+
+	caps = cap_get_proc();
+	ASSERT_NE(NULL, caps);
+	ASSERT_EQ(0, cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_val,
+				override ? CAP_SET : CAP_CLEAR));
+	ASSERT_EQ(0, cap_set_proc(caps));
+	EXPECT_EQ(0, cap_free(caps));
+}
+
+static void ignore_sys_admin(struct __test_metadata *_metadata, int override)
+{
+	cap_t caps;
+	const cap_value_t cap_val[1] = {
+		CAP_SYS_ADMIN,
+	};
+
+	caps = cap_get_proc();
+	ASSERT_NE(NULL, caps);
+	ASSERT_EQ(0, cap_set_flag(caps, CAP_EFFECTIVE, 1, cap_val,
+				override ? CAP_SET : CAP_CLEAR));
+	ASSERT_EQ(0, cap_set_proc(caps));
+	EXPECT_EQ(0, cap_free(caps));
+}
+
+static void test_omx(struct __test_metadata *_metadata,
+		const char *const path, const int err_open,
+		const int err_access)
+{
+	int flags = O_RDONLY | O_NOFOLLOW | O_CLOEXEC;
+	int fd, access_ret, access_errno;
+
+	/* Do not block on pipes. */
+	if (path == fifo_path)
+		flags |= O_NONBLOCK;
+
+	fd = open(path, flags);
+	if (err_open) {
+		ASSERT_EQ(err_open, errno) {
+			TH_LOG("Wrong error for open %s: %s", path, strerror(errno));
+		}
+		ASSERT_EQ(-1, fd);
+	} else {
+		ASSERT_LE(0, fd) {
+			TH_LOG("Failed to open %s: %s", path, strerror(errno));
+		}
+		access_ret = faccessat2(fd, "", X_OK, AT_EMPTY_PATH | AT_INTERPRETED);
+		access_errno = errno;
+		EXPECT_EQ(0, close(fd));
+		if (err_access) {
+			ASSERT_EQ(err_access, access_errno) {
+				TH_LOG("Wrong error for faccessat2 w/o path %s: %s",
+						path, strerror(access_errno));
+			}
+			ASSERT_EQ(-1, access_ret);
+		} else {
+			ASSERT_EQ(0, access_ret) {
+				TH_LOG("Access denied for %s: %s", path, strerror(access_errno));
+			}
+		}
+	}
+
+	access_ret = faccessat2(AT_FDCWD, path, X_OK, AT_SYMLINK_NOFOLLOW | AT_INTERPRETED);
+	if (err_access) {
+		ASSERT_EQ(err_access, errno) {
+			TH_LOG("Wrong error for faccessat2 w/ path %s: %s", path, strerror(errno));
+		}
+		ASSERT_EQ(-1, access_ret);
+	} else {
+		ASSERT_EQ(0, access_ret) {
+			TH_LOG("Access denied for %s: %s", path, strerror(errno));
+		}
+	}
+
+	/* Tests read access. */
+	access_ret = faccessat2(AT_FDCWD, path, R_OK, AT_SYMLINK_NOFOLLOW | AT_INTERPRETED);
+	ASSERT_EQ(-1, access_ret);
+	ASSERT_EQ(EINVAL, errno);
+
+	/* Tests write access. */
+	access_ret = faccessat2(AT_FDCWD, path, W_OK, AT_SYMLINK_NOFOLLOW | AT_INTERPRETED);
+	ASSERT_EQ(-1, access_ret);
+	ASSERT_EQ(EINVAL, errno);
+}
+
+static void test_policy_fd(struct __test_metadata *_metadata, const int fd,
+		const bool has_policy)
+{
+	const int ret = faccessat2(fd, "", X_OK, AT_EMPTY_PATH | AT_INTERPRETED);
+
+	if (has_policy) {
+		ASSERT_EQ(-1, ret);
+		ASSERT_EQ(EACCES, errno) {
+			TH_LOG("Wrong error for faccessat2 with an FD: %s", strerror(errno));
+		}
+	} else {
+		ASSERT_EQ(0, ret) {
+			TH_LOG("Access denied for an FD: %s", strerror(errno));
+		}
+	}
+}
+
+FIXTURE(access) {
+	char initial_sysctl_value;
+	int memfd, pipefd;
+	int pipe_fds[2];
+};
+
+static void test_file_types(struct __test_metadata *_metadata, FIXTURE_DATA(access) *self,
+		const int err_code, const bool has_policy)
+{
+	/* Tests are performed on a tmpfs mount point. */
+	test_omx(_metadata, reg_file_path, 0, err_code);
+	test_omx(_metadata, dir_path, 0, has_policy ? EACCES : 0);
+	test_omx(_metadata, symlink_path, ELOOP, has_policy ? EACCES : 0);
+	test_omx(_metadata, block_dev_path, 0, has_policy ? EACCES : 0);
+	test_omx(_metadata, char_dev_path, 0, has_policy ? EACCES : 0);
+	test_omx(_metadata, fifo_path, 0, has_policy ? EACCES : 0);
+	test_omx(_metadata, sock_path, ENXIO, has_policy ? EACCES : 0);
+
+	test_omx(_metadata, "/proc/self/ns/mnt", ELOOP, has_policy ? EACCES : 0);
+
+	/* Checks that exec is denied for any memfd. */
+	test_policy_fd(_metadata, self->memfd, has_policy);
+
+	/* Checks that exec is denied for any pipefs fd. */
+	test_policy_fd(_metadata, self->pipefd, has_policy);
+}
+
+static void test_files(struct __test_metadata *_metadata, FIXTURE_DATA(access) *self,
+		const int err_code, const bool has_policy)
+{
+	/* Tests as root. */
+	ignore_dac(_metadata, 1);
+	test_file_types(_metadata, self, err_code, has_policy);
+
+	/* Tests without bypass. */
+	ignore_dac(_metadata, 0);
+	test_file_types(_metadata, self, err_code, has_policy);
+}
+
+static void sysctl_write_char(struct __test_metadata *_metadata, const char value)
+{
+	int fd;
+
+	fd = open(sysctl_path, O_WRONLY | O_CLOEXEC);
+	ASSERT_LE(0, fd);
+	ASSERT_EQ(1, write(fd, &value, 1));
+	EXPECT_EQ(0, close(fd));
+}
+
+static char sysctl_read_char(struct __test_metadata *_metadata)
+{
+	int fd;
+	char sysctl_value;
+
+	fd = open(sysctl_path, O_RDONLY | O_CLOEXEC);
+	ASSERT_LE(0, fd);
+	ASSERT_EQ(1, read(fd, &sysctl_value, 1));
+	EXPECT_EQ(0, close(fd));
+	return sysctl_value;
+}
+
+FIXTURE_VARIANT(access) {
+	const bool mount_exec;
+	const bool file_exec;
+	const int sysctl_err_code[3];
+};
+
+FIXTURE_VARIANT_ADD(access, mount_exec_file_exec) {
+	.mount_exec = true,
+	.file_exec = true,
+	.sysctl_err_code = {0, 0, 0},
+};
+
+FIXTURE_VARIANT_ADD(access, mount_exec_file_noexec)
+{
+	.mount_exec = true,
+	.file_exec = false,
+	.sysctl_err_code = {0, EACCES, EACCES},
+};
+
+FIXTURE_VARIANT_ADD(access, mount_noexec_file_exec)
+{
+	.mount_exec = false,
+	.file_exec = true,
+	.sysctl_err_code = {EACCES, 0, EACCES},
+};
+
+FIXTURE_VARIANT_ADD(access, mount_noexec_file_noexec)
+{
+	.mount_exec = false,
+	.file_exec = false,
+	.sysctl_err_code = {EACCES, EACCES, EACCES},
+};
+
+FIXTURE_SETUP(access)
+{
+	int procfd_path_size;
+	static const char path_template[] = "/proc/self/fd/%d";
+	char procfd_path[sizeof(path_template) + 10];
+
+	/*
+	 * Cleans previous workspace if any error previously happened (don't
+	 * check errors).
+	 */
+	umount(workdir_path);
+	rmdir(workdir_path);
+
+	/* Creates a clean mount point. */
+	ASSERT_EQ(0, mkdir(workdir_path, 00700));
+	ASSERT_EQ(0, mount("test", workdir_path, "tmpfs", MS_MGC_VAL |
+				(variant->mount_exec ? 0 : MS_NOEXEC),
+				"mode=0700,size=4k"));
+
+	/* Creates a regular file. */
+	ASSERT_EQ(0, mknod(reg_file_path, S_IFREG | (variant->file_exec ? 0500 : 0400), 0));
+	/* Creates a directory. */
+	ASSERT_EQ(0, mkdir(dir_path, variant->file_exec ? 0500 : 0400));
+	/* Creates a symlink pointing to the regular file. */
+	ASSERT_EQ(0, symlink("regular_file", symlink_path));
+	/* Creates a character device: /dev/null. */
+	ASSERT_EQ(0, mknod(char_dev_path, S_IFCHR | 0400, makedev(1, 3)));
+	/* Creates a block device: /dev/loop0 */
+	ASSERT_EQ(0, mknod(block_dev_path, S_IFBLK | 0400, makedev(7, 0)));
+	/* Creates a fifo. */
+	ASSERT_EQ(0, mknod(fifo_path, S_IFIFO | 0400, 0));
+	/* Creates a socket. */
+	ASSERT_EQ(0, mknod(sock_path, S_IFSOCK | 0400, 0));
+
+	/* Creates a regular file without user mount point. */
+	self->memfd = memfd_create("test-interpreted", MFD_CLOEXEC);
+	ASSERT_LE(0, self->memfd);
+	/* Sets mode, which must be ignored by the exec check. */
+	ASSERT_EQ(0, fchmod(self->memfd, variant->file_exec ? 0500 : 0400));
+
+	/* Creates a pipefs file descriptor. */
+	ASSERT_EQ(0, pipe(self->pipe_fds));
+	procfd_path_size = snprintf(procfd_path, sizeof(procfd_path),
+			path_template, self->pipe_fds[0]);
+	ASSERT_LT(procfd_path_size, sizeof(procfd_path));
+	self->pipefd = open(procfd_path, O_RDONLY | O_CLOEXEC);
+	ASSERT_LE(0, self->pipefd);
+	ASSERT_EQ(0, fchmod(self->pipefd, variant->file_exec ? 0500 : 0400));
+
+	/* Saves initial sysctl value. */
+	self->initial_sysctl_value = sysctl_read_char(_metadata);
+
+	/* Prepares for sysctl writes. */
+	ignore_sys_admin(_metadata, 1);
+}
+
+FIXTURE_TEARDOWN(access)
+{
+	EXPECT_EQ(0, close(self->memfd));
+	EXPECT_EQ(0, close(self->pipefd));
+	EXPECT_EQ(0, close(self->pipe_fds[0]));
+	EXPECT_EQ(0, close(self->pipe_fds[1]));
+
+	/* Restores initial sysctl value. */
+	sysctl_write_char(_metadata, self->initial_sysctl_value);
+
+	/* There is no need to unlink the test files. */
+	ASSERT_EQ(0, umount(workdir_path));
+	ASSERT_EQ(0, rmdir(workdir_path));
+}
+
+TEST_F(access, sysctl_0)
+{
+	/* Do not enforce anything. */
+	sysctl_write_char(_metadata, '0');
+	test_files(_metadata, self, 0, false);
+}
+
+TEST_F(access, sysctl_1)
+{
+	/* Enforces mount exec check. */
+	sysctl_write_char(_metadata, '1');
+	test_files(_metadata, self, variant->sysctl_err_code[0], true);
+}
+
+TEST_F(access, sysctl_2)
+{
+	/* Enforces file exec check. */
+	sysctl_write_char(_metadata, '2');
+	test_files(_metadata, self, variant->sysctl_err_code[1], true);
+}
+
+TEST_F(access, sysctl_3)
+{
+	/* Enforces mount and file exec check. */
+	sysctl_write_char(_metadata, '3');
+	test_files(_metadata, self, variant->sysctl_err_code[2], true);
+}
+
+FIXTURE(cleanup) {
+	char initial_sysctl_value;
+};
+
+FIXTURE_SETUP(cleanup)
+{
+	/* Saves initial sysctl value. */
+	self->initial_sysctl_value = sysctl_read_char(_metadata);
+}
+
+FIXTURE_TEARDOWN(cleanup)
+{
+	/* Restores initial sysctl value. */
+	ignore_sys_admin(_metadata, 1);
+	sysctl_write_char(_metadata, self->initial_sysctl_value);
+}
+
+TEST_F(cleanup, sysctl_access_write)
+{
+	int fd;
+	ssize_t ret;
+
+	ignore_sys_admin(_metadata, 1);
+	sysctl_write_char(_metadata, '0');
+
+	ignore_sys_admin(_metadata, 0);
+	fd = open(sysctl_path, O_WRONLY | O_CLOEXEC);
+	ASSERT_LE(0, fd);
+	ret = write(fd, "0", 1);
+	ASSERT_EQ(-1, ret);
+	ASSERT_EQ(EPERM, errno);
+	EXPECT_EQ(0, close(fd));
+}
+
+TEST_HARNESS_MAIN
-- 
2.28.0


^ permalink raw reply related

* [RFC PATCH v8 0/3] Add support for AT_INTERPRETED (was O_MAYEXEC)
From: Mickaël Salaün @ 2020-09-08  7:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
	Al Viro, Andrew Morton, Andy Lutomirski, Christian Brauner,
	Christian Heimes, Daniel Borkmann, Deven Bowers, Dmitry Vyukov,
	Eric Biggers, Eric Chiang, Florian Weimer, James Morris, Jan Kara,
	Jann Horn, Jonathan Corbet, Kees Cook, Lakshmi Ramasubramanian,
	Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Miklos Szeredi,
	Mimi Zohar, Philippe Trébuchet, Scott Shell,
	Sean Christopherson, Shuah Khan, Steve Dower, Steve Grubb,
	Tetsuo Handa, Thibaut Sautereau, Vincent Strubel,
	kernel-hardening, linux-api, linux-integrity,
	linux-security-module, linux-fsdevel

Hi,

This height patch series rework the previous O_MAYEXEC series by not
adding a new flag to openat2(2) but to faccessat2(2) instead.  As
suggested, this enables to perform the access check on a file descriptor
instead of on a file path (while opening it).  This may require two
checks (one on open and then with faccessat2) but it is a more generic
approach [8].

The IMA patch is removed for now because the only LSM hook triggered by
faccessat2(2) is inode_permission() which takes a struct inode as
argument.  However, struct path and then struct file are still available
in this syscall, which enables to add a new hook to fit the needs of IMA
and other path-based LSMs.

We also removed the three patches from Kees Cook which are no longer
required for this new implementation.

Goal of AT_INTERPRETED
======================

The goal of this patch series is to enable to control script execution
with interpreters help.  A new AT_INTERPRETED flag, usable through
faccessat2(2), is added to enable userspace script interpreters to
delegate to the kernel (and thus the system security policy) the
permission to interpret/execute scripts or other files containing what
can be seen as commands.

A simple system-wide security policy can be enforced by the system
administrator through a sysctl configuration consistent with the mount
points or the file access rights.  The documentation patch explains the
prerequisites.

Furthermore, the security policy can also be delegated to an LSM, either
a MAC system or an integrity system.  For instance, the new kernel
MAY_INTERPRETED_EXEC flag is required to close a major IMA
measurement/appraisal interpreter integrity gap by bringing the ability
to check the use of scripts [1].  Other uses are expected, such as for
magic-links [2], SGX integration [3], bpffs [4] or IPE [5].

Possible extended usage
=======================

For now, only the X_OK mode is compatible with the AT_INTERPRETED flag.
This enables to restrict the addition of new control flows in a process.
Using R_OK or W_OK with AT_INTERPRETED returns -EINVAL.

Possible future use-cases for R_OK with AT_INTERPRETED may be to check
configuration files that may impact the behavior of applications (i.e.
influence critical part of the current control flow).  Those should then
be trusted as well.  The W_OK with AT_INTERPRETED could be used to check
that a file descriptor is allowed to receive sensitive data such as
debug logs.

Prerequisite of its use
=======================

Userspace needs to adapt to take advantage of this new feature.  For
example, the PEP 578 [6] (Runtime Audit Hooks) enables Python 3.8 to be
extended with policy enforcement points related to code interpretation,
which can be used to align with the PowerShell audit features.
Additional Python security improvements (e.g. a limited interpreter
without -c, stdin piping of code) are on their way [7].

Examples
========

The initial idea comes from CLIP OS 4 and the original implementation
has been used for more than 12 years:
https://github.com/clipos-archive/clipos4_doc
Chrome OS has a similar approach:
https://chromium.googlesource.com/chromiumos/docs/+/master/security/noexec_shell_scripts.md

Userland patches can be found here:
https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC
Actually, there is more than the O_MAYEXEC changes (which matches this search)
e.g., to prevent Python interactive execution. There are patches for
Bash, Wine, Java (Icedtea), Busybox's ash, Perl and Python. There are
also some related patches which do not directly rely on O_MAYEXEC but
which restrict the use of browser plugins and extensions, which may be
seen as scripts too:
https://github.com/clipos-archive/clipos4_portage-overlay/tree/master/www-client

An introduction to O_MAYEXEC was given at the Linux Security Summit
Europe 2018 - Linux Kernel Security Contributions by ANSSI:
https://www.youtube.com/watch?v=chNjCRtPKQY&t=17m15s
The "write xor execute" principle was explained at Kernel Recipes 2018 -
CLIP OS: a defense-in-depth OS:
https://www.youtube.com/watch?v=PjRE0uBtkHU&t=11m14s
See also an overview article: https://lwn.net/Articles/820000/

This patch series can be applied on top of v5.9-rc4 .  This can be tested
with CONFIG_SYSCTL.  I would really appreciate constructive comments on
this patch series.

Previous version:
https://lore.kernel.org/lkml/20200723171227.446711-1-mic@digikod.net/

[1] https://lore.kernel.org/lkml/1544647356.4028.105.camel@linux.ibm.com/
[2] https://lore.kernel.org/lkml/20190904201933.10736-6-cyphar@cyphar.com/
[3] https://lore.kernel.org/lkml/CALCETrVovr8XNZSroey7pHF46O=kj_c5D9K8h=z2T_cNrpvMig@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CALCETrVeZ0eufFXwfhtaG_j+AdvbzEWE0M3wjXMWVEO7pj+xkw@mail.gmail.com/
[5] https://lore.kernel.org/lkml/20200406221439.1469862-12-deven.desai@linux.microsoft.com/
[6] https://www.python.org/dev/peps/pep-0578/
[7] https://lore.kernel.org/lkml/0c70debd-e79e-d514-06c6-4cd1e021fa8b@python.org/
[8] https://lore.kernel.org/lkml/e7c1f99d7cdf706ca0867e5fb76ae4cb38bc83f5.camel@linux.ibm.com/

Regards,

Mickaël Salaün (3):
  fs: Introduce AT_INTERPRETED flag for faccessat2(2)
  fs,doc: Enable to configure exec checks for AT_INTERPRETED
  selftest/interpreter: Add tests for AT_INTERPRETED enforcing

 Documentation/admin-guide/sysctl/fs.rst       |  54 +++
 fs/open.c                                     |  67 ++-
 include/linux/fs.h                            |   3 +
 include/uapi/linux/fcntl.h                    |  12 +-
 kernel/sysctl.c                               |  12 +-
 .../testing/selftests/interpreter/.gitignore  |   2 +
 tools/testing/selftests/interpreter/Makefile  |  18 +
 tools/testing/selftests/interpreter/config    |   1 +
 .../interpreter/interpreted_access_test.c     | 384 ++++++++++++++++++
 9 files changed, 548 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/interpreter/.gitignore
 create mode 100644 tools/testing/selftests/interpreter/Makefile
 create mode 100644 tools/testing/selftests/interpreter/config
 create mode 100644 tools/testing/selftests/interpreter/interpreted_access_test.c

-- 
2.28.0

^ permalink raw reply

* [RFC PATCH v8 2/3] fs,doc: Enable to configure exec checks for AT_INTERPRETED
From: Mickaël Salaün @ 2020-09-08  7:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
	Al Viro, Andrew Morton, Andy Lutomirski, Christian Brauner,
	Christian Heimes, Daniel Borkmann, Deven Bowers, Dmitry Vyukov,
	Eric Biggers, Eric Chiang, Florian Weimer, James Morris, Jan Kara,
	Jann Horn, Jonathan Corbet, Kees Cook, Lakshmi Ramasubramanian,
	Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Miklos Szeredi,
	Mimi Zohar, Philippe Trébuchet, Scott Shell,
	Sean Christopherson, Shuah Khan, Steve Dower, Steve Grubb,
	Tetsuo Handa, Thibaut Sautereau, Vincent Strubel,
	kernel-hardening, linux-api, linux-integrity,
	linux-security-module, linux-fsdevel, Mickaël Salaün,
	Thibaut Sautereau
In-Reply-To: <20200908075956.1069018-1-mic@digikod.net>

From: Mickaël Salaün <mic@linux.microsoft.com>

This enables to configure a policy for executable scripts which can be
queried with faccessat2(2) and the AT_INTERPRETED flag.  This may allow
script interpreters to check execution permission before reading
commands from a file, or dynamic linkers to allow shared object loading.
This may be seen as a way for a trusted task (e.g. interpreter) to check
the trustworthiness of files (e.g. scripts) before extending its control
flow graph with new ones originating from these files.

Add a new sysctl fs.interpreted_access to enable system administrators
to enforce two complementary security policies according to the
installed system: enforce the noexec mount option, and enforce
executable file permission.  Indeed, because of compatibility with
installed systems, only system administrators are able to check that
this new enforcement is in line with the system mount points and file
permissions.

Being able to restrict execution also enables to protect the kernel by
restricting arbitrary syscalls that an attacker could perform with a
crafted binary or certain script languages.  It also improves multilevel
isolation by reducing the ability of an attacker to use side channels
with specific code.  These restrictions can natively be enforced for ELF
binaries (with the noexec mount option) but require this kernel
extension to properly handle scripts (e.g. Python, Perl).  To get a
consistent execution policy, additional memory restrictions should also
be enforced (e.g. thanks to SELinux).

Because the AT_INTERPRETED flag combined with X_OK mode is a mean to
enforce a system-wide security policy (but not application-centric
policies), it does not make sense for user space to check the sysctl
value.  Indeed, this new flag only enables to extend the system ability
to enforce a policy thanks to (some trusted) user space collaboration.
Moreover, additional security policies could be managed by LSMs.  This
is a best-effort approach from the application developer point of view:
https://lore.kernel.org/lkml/1477d3d7-4b36-afad-7077-a38f42322238@digikod.net/

Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com>
Reviewed-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
---

Changes since v7:
* Handle special file descriptors.
* Add a compatibility mode for execute/read check.
* Move the sysctl policy from fs/namei.c to fs/open.c for the new
  faccessat2/AT_INTERPRETED.
* Rename the sysctl from fs.open_mayexec_enforce to
  fs.interpreted_access .
* Update documentation accordingly.

Changes since v6:
* Allow opening pipes, block devices and character devices with
  O_MAYEXEC when there is no enforced policy, but forbid any non-regular
  file opened with O_MAYEXEC otherwise (i.e. for any enforced policy).
* Add a paragraph about the non-regular files policy.
* Move path_noexec() calls out of the fast-path (suggested by Kees
  Cook).

Changes since v5:
* Remove the static enforcement configuration through Kconfig because it
  makes the code more simple like this, and because the current sysctl
  configuration can only be set with CAP_SYS_ADMIN, the same way mount
  options (i.e. noexec) can be set.  If an harden distro wants to
  enforce a configuration, it should restrict capabilities or sysctl
  configuration.  Furthermore, an LSM can easily leverage O_MAYEXEC to
  fit its need.
* Move checks from inode_permission() to may_open() and make the error
  codes more consistent according to file types (in line with a previous
  commit): opening a directory with O_MAYEXEC returns EISDIR and other
  non-regular file types may return EACCES.
* In may_open(), when OMAYEXEC_ENFORCE_FILE is set, replace explicit
  call to generic_permission() with an artificial MAY_EXEC to avoid
  double calls.  This makes sense especially when an LSM policy forbids
  execution of a file.
* Replace the custom proc_omayexec() with
  proc_dointvec_minmax_sysadmin(), and then replace the CAP_MAC_ADMIN
  check with a CAP_SYS_ADMIN one (suggested by Kees Cook and Stephen
  Smalley).
* Use BIT() (suggested by Kees Cook).
* Rename variables (suggested by Kees Cook).
* Reword the kconfig help.
* Import the documentation patch (suggested by Kees Cook):
  https://lore.kernel.org/lkml/20200505153156.925111-6-mic@digikod.net/
* Update documentation and add LWN.net article.

Changes since v4:
* Add kernel configuration options to enforce O_MAYEXEC at build time,
  and disable the sysctl in such case (requested by James Morris).
* Reword commit message.

Changes since v3:
* Update comment with O_MAYEXEC.

Changes since v2:
* Cosmetic changes.

Changes since v1:
* Move code from Yama to the FS subsystem (suggested by Kees Cook).
* Make omayexec_inode_permission() static (suggested by Jann Horn).
* Use mode 0600 for the sysctl.
* Only match regular files (not directories nor other types), which
  follows the same semantic as commit 73601ea5b7b1 ("fs/open.c: allow
  opening only regular files during execve()").
---
 Documentation/admin-guide/sysctl/fs.rst | 54 +++++++++++++++++++++++++
 fs/open.c                               | 38 ++++++++++++++++-
 include/linux/fs.h                      |  1 +
 kernel/sysctl.c                         | 12 +++++-
 4 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
index f48277a0a850..66d1c1bd67a5 100644
--- a/Documentation/admin-guide/sysctl/fs.rst
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -36,6 +36,7 @@ Currently, these files are in /proc/sys/fs:
 - inode-max
 - inode-nr
 - inode-state
+- interpreted_access
 - nr_open
 - overflowuid
 - overflowgid
@@ -165,6 +166,59 @@ system needs to prune the inode list instead of allocating
 more.

+interpreted_access
+------------------
+
+The ``AT_INTERPRETED`` flag with an ``X_OK`` mode can be passed to
+:manpage:`faccessat2(2)` by an interpreter to check that regular files are
+expected to be executable.  If the file is not identified as executable, then
+the syscall returns -EACCES.  This may allow a script interpreter to check
+executable permission before reading commands from a file, or a dynamic linker
+to only load executable shared objects.  One interesting use case is to enforce
+a "write xor execute" policy through interpreters.
+
+To avoid race-conditions, it is highly recommended to first open the file and
+then do the check on the new file descriptor thanks to the ``AT_EMPTY_PATH``
+flag.
+
+The ability to restrict code execution must be thought as a system-wide policy,
+which first starts by restricting mount points with the ``noexec`` option.
+This option is also automatically applied to special filesystems such as /proc .
+This prevents files on such mount points to be directly executed by the kernel
+or mapped as executable memory (e.g. libraries).  With script interpreters
+using :manpage:`faccessat2(2)` and ``AT_INTERPRETED``, the executable
+permission can then be checked before reading commands from files.  This makes
+it possible to enforce the ``noexec`` at the interpreter level, and thus
+propagates this security policy to scripts.  To be fully effective, these
+interpreters also need to handle the other ways to execute code: command line
+parameters (e.g., option ``-e`` for Perl), module loading (e.g., option ``-m``
+for Python), stdin, file sourcing, environment variables, configuration files,
+etc.  According to the threat model, it may be acceptable to allow some script
+interpreters (e.g. Bash) to interpret commands from stdin, may it be a TTY or a
+pipe, because it may not be enough to (directly) perform syscalls.
+
+There are two complementary security policies: enforce the ``noexec`` mount
+option, and enforce executable file permission.  These policies are handled by
+the ``fs.interpreted_access`` sysctl (writable only with ``CAP_SYS_ADMIN``)
+as a bitmask:
+
+1 - Mount restriction: checks that the mount options for the underlying VFS
+    mount do not prevent execution.
+
+2 - File permission restriction: checks that the file is marked as
+    executable for the current process (e.g., POSIX permissions, ACLs).
+
+Note that as long as a policy is enforced, checking any non-regular file with
+``AT_INTERPRETED`` returns -EINVAL (e.g. TTYs, pipe), even when such a file is
+marked as executable or is on an executable mount point.
+
+Code samples can be found in
+tools/testing/selftests/interpreter/interpreted_access_test.c and interpreter
+patches (for the original O_MAYEXEC) are available at
+https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC .
+See also an overview article: https://lwn.net/Articles/820000/ .
+
+
 overflowgid & overflowuid
 -------------------------

diff --git a/fs/open.c b/fs/open.c
index 879bdfbdc6fa..ef01ab35449d 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -32,6 +32,7 @@
 #include <linux/ima.h>
 #include <linux/dnotify.h>
 #include <linux/compat.h>
+#include <linux/sysctl.h>

 #include "internal.h"

@@ -394,6 +395,11 @@ static const struct cred *access_override_creds(void)
 	return old_cred;
 }

+#define INTERPRETED_EXEC_MOUNT		BIT(0)
+#define INTERPRETED_EXEC_FILE		BIT(1)
+
+int sysctl_interpreted_access __read_mostly;
+
 static long do_faccessat(int dfd, const char __user *filename, int mode, int flags)
 {
 	struct path path;
@@ -443,13 +449,43 @@ static long do_faccessat(int dfd, const char __user *filename, int mode, int fla
 		 */
 		if ((mode & MAY_EXEC)) {
 			mode |= MAY_INTERPRETED_EXEC;
+			res = -EACCES;
+			/*
+			 * If there is a system-wide execute policy enforced,
+			 * then forbids access to non-regular files and special
+			 * superblocks.
+			 */
+			if ((sysctl_interpreted_access & (INTERPRETED_EXEC_MOUNT |
+							INTERPRETED_EXEC_FILE))) {
+				if (!S_ISREG(inode->i_mode))
+					goto out_path_release;
+				/*
+				 * Denies access to pseudo filesystems that
+				 * will never be mountable (e.g. sockfs,
+				 * pipefs) but can still be reachable through
+				 * /proc/self/fd, or memfd-like file
+				 * descriptors, or nsfs-like files.
+				 *
+				 * According to the tests, SB_NOEXEC seems to
+				 * be only used by proc and nsfs filesystems.
+				 * Is it correct?
+				 */
+				if ((path.dentry->d_sb->s_flags &
+							(SB_NOUSER | SB_KERNMOUNT | SB_NOEXEC)))
+					goto out_path_release;
+			}
+
+			if ((sysctl_interpreted_access & INTERPRETED_EXEC_MOUNT) &&
+					path_noexec(&path))
+				goto out_path_release;
 			/*
 			 * For compatibility reasons, if the system-wide policy
 			 * doesn't enforce file permission checks, then
 			 * replaces the execute permission request with a read
 			 * permission request.
 			 */
-			mode &= ~MAY_EXEC;
+			if (!(sysctl_interpreted_access & INTERPRETED_EXEC_FILE))
+				mode &= ~MAY_EXEC;
 			/* To be executed *by* user space, files must be readable. */
 			mode |= MAY_READ;
 		}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 03f1b2da6a87..ef39550f2464 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -83,6 +83,7 @@ extern int sysctl_protected_symlinks;
 extern int sysctl_protected_hardlinks;
 extern int sysctl_protected_fifos;
 extern int sysctl_protected_regular;
+extern int sysctl_interpreted_access;

 typedef __kernel_rwf_t rwf_t;

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 09e70ee2332e..899fa52b4ee8 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -113,6 +113,7 @@ static int sixty = 60;

 static int __maybe_unused neg_one = -1;
 static int __maybe_unused two = 2;
+static int __maybe_unused three = 3;
 static int __maybe_unused four = 4;
 static unsigned long zero_ul;
 static unsigned long one_ul = 1;
@@ -887,7 +888,6 @@ static int proc_taint(struct ctl_table *table, int write,
 	return err;
 }

-#ifdef CONFIG_PRINTK
 static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
 				void *buffer, size_t *lenp, loff_t *ppos)
 {
@@ -896,7 +896,6 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,

 	return proc_dointvec_minmax(table, write, buffer, lenp, ppos);
 }
-#endif

 /**
  * struct do_proc_dointvec_minmax_conv_param - proc_dointvec_minmax() range checking structure
@@ -3293,6 +3292,15 @@ static struct ctl_table fs_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= &two,
 	},
+	{
+		.procname       = "interpreted_access",
+		.data           = &sysctl_interpreted_access,
+		.maxlen         = sizeof(int),
+		.mode           = 0600,
+		.proc_handler	= proc_dointvec_minmax_sysadmin,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= &three,
+	},
 #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
 	{
 		.procname	= "binfmt_misc",
-- 
2.28.0

^ permalink raw reply related

* Re: [PATCH] SELinux: Measure state and hash of policy using IMA
From: Lakshmi Ramasubramanian @ 2020-09-08  4:44 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Mimi Zohar, Paul Moore, Ondrej Mosnacek, Casey Schaufler,
	Tyler Hicks, tusharsu, Sasha Levin, James Morris, linux-integrity,
	SElinux list, LSM List, linux-kernel
In-Reply-To: <CAEjxPJ5C64AmmVKuuPmtbfnY06w49ziryRAnARurWxpQumzfow@mail.gmail.com>

On 9/7/20 3:32 PM, Stephen Smalley wrote:

>> Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
>> Suggested-by: Stephen Smalley <stephen.smalley.work@gmail.com>
>> Reported-by: kernel test robot <lkp@intel.com> # error: implicit declaration of function 'vfree'
>> Reported-by: kernel test robot <lkp@intel.com> # error: implicit declaration of function 'crypto_alloc_shash'
>> Reported-by: kernel test robot <lkp@intel.com> # sparse: symbol 'security_read_selinux_policy' was not declared. Should it be static?
> 
> Not sure these Reported-by lines are useful since they were just on
> submitted versions of the patch not on an actual merged commit.

I'll remove them when I update the patch.

> 
>> diff --git a/security/selinux/measure.c b/security/selinux/measure.c
>> new file mode 100644
>> index 000000000000..caf9107937d9
>> --- /dev/null
>> +++ b/security/selinux/measure.c
> <snip>
>> +void selinux_measure_state(struct selinux_state *state, bool policy_mutex_held)
>> +{
> <snip>
>> +
>> +       if (!policy_mutex_held)
>> +               mutex_lock(&state->policy_mutex);
>> +
>> +       rc = security_read_policy_kernel(state, &policy, &policy_len);
>> +
>> +       if (!policy_mutex_held)
>> +               mutex_unlock(&state->policy_mutex);
> 
> This kind of conditional taking of a mutex is generally frowned upon
> in my experience.
> You should likely just always take the mutex in the callers of
> selinux_measure_state() instead.
> In some cases, it may be the caller of the caller.  Arguably selinuxfs
> could be taking it around all state modifying operations (e.g.
> enforce, checkreqprot) not just policy modifying ones although it
> isn't strictly for that purpose.

Since currently policy_mutex is not used to synchronize access to state 
variables (enforce, checkreqprot, etc.) I am wondering if 
selinux_measure_state() should measure only state if policy_mutex is not 
held by the caller - similar to how we skip measuring policy if 
initialization is not yet completed.

	/*
	 * Measure SELinux policy only after initialization is
          * completed.
	 */
	if (!initialized)
		goto out;

  -lakshmi



^ permalink raw reply

* Re: [PATCH v20 05/23] net: Prepare UDS for security module stacking
From: Stephen Smalley @ 2020-09-08  1:28 UTC (permalink / raw)
  To: John Johansen
  Cc: Casey Schaufler, Paul Moore, Casey Schaufler, James Morris,
	LSM List, SElinux list, linux-audit, Kees Cook, Tetsuo Handa,
	Stephen Smalley
In-Reply-To: <9a58d14c-eaff-3acf-4689-925cf08ba406@canonical.com>

On Sat, Sep 5, 2020 at 3:07 PM John Johansen
<john.johansen@canonical.com> wrote:
>
> On 9/5/20 11:13 AM, Casey Schaufler wrote:
> > On 9/5/2020 6:25 AM, Paul Moore wrote:
> >> On Fri, Sep 4, 2020 at 7:58 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >>> On 9/4/2020 2:53 PM, Paul Moore wrote:
> >>>> On Fri, Sep 4, 2020 at 5:35 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >>>>> On 9/4/2020 1:08 PM, Paul Moore wrote:
> >> ...
> >>
> >>>> I understand the concerns you mention, they are all valid as far as
> >>>> I'm concerned, but I think we are going to get burned by this code as
> >>>> it currently stands.
> >>> Yes, I can see that. We're getting burned by the non-extensibility
> >>> of secids. It will take someone smarter than me to figure out how to
> >>> fit N secids into 32bits without danger of either failure or memory
> >>> allocation.
> >> Sooo what are the next steps here?  It sounds like there is some
> >> agreement that the currently proposed unix_skb_params approach is a
> >> problem, but it also sounds like you just want to merge it anyway?
> >
> > There are real problems with all the approaches. This is by far the
> > least invasive of the lot. If this is acceptable for now I will commit
> > to including the dynamic allocation version in the full stacking
> > (e.g. Smack + SELinux) stage. If it isn't, well, this stage is going
> > to take even longer than it already has. Sigh.
> >
> >
> >> I was sorta hoping for something a bit better.
> >
> > I will be looking at alternatives. I am very much open to suggestions.
> > I'm not even 100% convinced that Stephen's objections to my separate
> > allocation strategy outweigh its advantages. If you have an opinion on
> > that, I'd love to hear it.
> >
>
> fwiw I prefer the separate allocation strategy, but as you have already
> said it trading off one set of problems for another. I would rather see
> this move forward and one set of trade offs isn't significantly worse
> than the other to me so, either wfm.

I remain unclear that AppArmor needs this patch at all even when
support for SO_PEERSEC lands.
Contrary to the patch description, it is about supporting SCM_SECURITY
for datagram not SO_PEERSEC.  And I don't know of any actual users of
SCM_SECURITY even for SELinux, just SO_PEERSEC.

^ permalink raw reply

* Re: [PATCH] SELinux: Measure state and hash of policy using IMA
From: Stephen Smalley @ 2020-09-07 22:32 UTC (permalink / raw)
  To: Lakshmi Ramasubramanian
  Cc: Mimi Zohar, Paul Moore, Ondrej Mosnacek, Casey Schaufler,
	Tyler Hicks, tusharsu, Sasha Levin, James Morris, linux-integrity,
	SElinux list, LSM List, linux-kernel
In-Reply-To: <20200907213855.3572-1-nramas@linux.microsoft.com>

On Mon, Sep 7, 2020 at 5:39 PM Lakshmi Ramasubramanian
<nramas@linux.microsoft.com> wrote:
>
> Critical data structures of security modules are currently not measured.
> Therefore an attestation service, for instance, would not be able to
> attest whether the security modules are always operating with the policies
> and configuration that the system administrator had setup. The policies
> and configuration for the security modules could be tampered with by
> rogue user mode agents or modified through some inadvertent actions on
> the system. Measuring such critical data would enable an attestation
> service to reliably assess the security configuration of the system.
>
> SELinux configuration and policy are some of the critical data for this
> security module that needs to be measured. This measurement can be used
> by an attestation service, for instance, to verify if the configuration
> and policies have been setup correctly and that they haven't been tampered
> with at runtime.
>
> Measure SELinux configuration, policy capabilities settings, and the hash
> of the loaded policy by calling the IMA hook ima_measure_critical_data().
> Since the size of the loaded policy can be quite large, hash of the policy
> is measured instead of the entire policy to avoid bloating the IMA log.
>
> Enable early boot measurement for SELinux in IMA since SELinux
> initializes its state and policy before custom IMA policy is loaded.
>
> Sample measurement of SELinux state and hash of the policy:
>
> 10 e32e...5ac3 ima-buf sha256:86e8...4594 selinux-state-1595389364:287899386 696e697469616c697a65643d313b656e61626c65643d313b656e666f7263696e673d303b636865636b72657170726f743d313b6e6574776f726b5f706565725f636f6e74726f6c733d313b6f70656e5f7065726d733d313b657874656e6465645f736f636b65745f636c6173733d313b616c776179735f636865636b5f6e6574776f726b3d303b6367726f75705f7365636c6162656c3d313b6e6e705f6e6f737569645f7472616e736974696f6e3d313b67656e66735f7365636c6162656c5f73796d6c696e6b733d303
> 10 9e81...0857 ima-buf sha256:4941...68fc selinux-policy-hash-1597335667:462051628 8d1d...1834
>
> To verify the measurement check the following:
>
> Execute the following command to extract the measured data
> from the IMA log for SELinux configuration (selinux-state).
>
>   grep -m 1 "selinux-state" /sys/kernel/security/integrity/ima/ascii_runtime_measurements | cut -d' ' -f 6 | xxd -r -p
>
> The output should be the list of key-value pairs. For example,
>  initialized=1;enabled=1;enforcing=0;checkreqprot=1;network_peer_controls=1;open_perms=1;extended_socket_class=1;always_check_network=0;cgroup_seclabel=1;nnp_nosuid_transition=1;genfs_seclabel_symlinks=0;
>
> To verify the measured data with the current SELinux state:
>
>  => enabled should be set to 1 if /sys/fs/selinux folder exists,
>     0 otherwise
>
> For other entries, compare the integer value in the files
>  => /sys/fs/selinux/enforce
>  => /sys/fs/selinux/checkreqprot
> And, each of the policy capabilities files under
>  => /sys/fs/selinux/policy_capabilities
>
> For selinux-policy-hash, the hash of SELinux policy is included
> in the IMA log entry.
>
> To verify the measured data with the current SELinux policy run
> the following commands and verify the output hash values match.
>
>   sha256sum /sys/fs/selinux/policy | cut -d' ' -f 1
>
>   grep -m 1 "selinux-policy-hash" /sys/kernel/security/integrity/ima/ascii_runtime_measurements | cut -d' ' -f 6
>
> This patch is based on commit 66ccd2560aff ("selinux: simplify away security_policydb_len()")
> in "next" branch in https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git
>
> This patch is dependent on the following patch series and must be
> applied in the given order:
>         https://patchwork.kernel.org/patch/11709527/
>         https://patchwork.kernel.org/patch/11730193/
>         https://patchwork.kernel.org/patch/11730757/
>
> Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
> Suggested-by: Stephen Smalley <stephen.smalley.work@gmail.com>
> Reported-by: kernel test robot <lkp@intel.com> # error: implicit declaration of function 'vfree'
> Reported-by: kernel test robot <lkp@intel.com> # error: implicit declaration of function 'crypto_alloc_shash'
> Reported-by: kernel test robot <lkp@intel.com> # sparse: symbol 'security_read_selinux_policy' was not declared. Should it be static?

Not sure these Reported-by lines are useful since they were just on
submitted versions of the patch not on an actual merged commit.

> diff --git a/security/selinux/measure.c b/security/selinux/measure.c
> new file mode 100644
> index 000000000000..caf9107937d9
> --- /dev/null
> +++ b/security/selinux/measure.c
<snip>
> +void selinux_measure_state(struct selinux_state *state, bool policy_mutex_held)
> +{
<snip>
> +
> +       if (!policy_mutex_held)
> +               mutex_lock(&state->policy_mutex);
> +
> +       rc = security_read_policy_kernel(state, &policy, &policy_len);
> +
> +       if (!policy_mutex_held)
> +               mutex_unlock(&state->policy_mutex);

This kind of conditional taking of a mutex is generally frowned upon
in my experience.
You should likely just always take the mutex in the callers of
selinux_measure_state() instead.
In some cases, it may be the caller of the caller.  Arguably selinuxfs
could be taking it around all state modifying operations (e.g.
enforce, checkreqprot) not just policy modifying ones although it
isn't strictly for that purpose.

^ permalink raw reply

* [PATCH] SELinux: Measure state and hash of policy using IMA
From: Lakshmi Ramasubramanian @ 2020-09-07 21:38 UTC (permalink / raw)
  To: zohar, stephen.smalley.work, paul, omosnace, casey
  Cc: tyhicks, tusharsu, sashal, jmorris, linux-integrity, selinux,
	linux-security-module, linux-kernel

Critical data structures of security modules are currently not measured.
Therefore an attestation service, for instance, would not be able to
attest whether the security modules are always operating with the policies
and configuration that the system administrator had setup. The policies
and configuration for the security modules could be tampered with by
rogue user mode agents or modified through some inadvertent actions on
the system. Measuring such critical data would enable an attestation
service to reliably assess the security configuration of the system.

SELinux configuration and policy are some of the critical data for this
security module that needs to be measured. This measurement can be used
by an attestation service, for instance, to verify if the configuration
and policies have been setup correctly and that they haven't been tampered
with at runtime.

Measure SELinux configuration, policy capabilities settings, and the hash
of the loaded policy by calling the IMA hook ima_measure_critical_data().
Since the size of the loaded policy can be quite large, hash of the policy
is measured instead of the entire policy to avoid bloating the IMA log.

Enable early boot measurement for SELinux in IMA since SELinux
initializes its state and policy before custom IMA policy is loaded.

Sample measurement of SELinux state and hash of the policy:

10 e32e...5ac3 ima-buf sha256:86e8...4594 selinux-state-1595389364:287899386 696e697469616c697a65643d313b656e61626c65643d313b656e666f7263696e673d303b636865636b72657170726f743d313b6e6574776f726b5f706565725f636f6e74726f6c733d313b6f70656e5f7065726d733d313b657874656e6465645f736f636b65745f636c6173733d313b616c776179735f636865636b5f6e6574776f726b3d303b6367726f75705f7365636c6162656c3d313b6e6e705f6e6f737569645f7472616e736974696f6e3d313b67656e66735f7365636c6162656c5f73796d6c696e6b733d303
10 9e81...0857 ima-buf sha256:4941...68fc selinux-policy-hash-1597335667:462051628 8d1d...1834

To verify the measurement check the following:

Execute the following command to extract the measured data
from the IMA log for SELinux configuration (selinux-state).

  grep -m 1 "selinux-state" /sys/kernel/security/integrity/ima/ascii_runtime_measurements | cut -d' ' -f 6 | xxd -r -p

The output should be the list of key-value pairs. For example,
 initialized=1;enabled=1;enforcing=0;checkreqprot=1;network_peer_controls=1;open_perms=1;extended_socket_class=1;always_check_network=0;cgroup_seclabel=1;nnp_nosuid_transition=1;genfs_seclabel_symlinks=0;

To verify the measured data with the current SELinux state:

 => enabled should be set to 1 if /sys/fs/selinux folder exists,
    0 otherwise

For other entries, compare the integer value in the files
 => /sys/fs/selinux/enforce
 => /sys/fs/selinux/checkreqprot
And, each of the policy capabilities files under
 => /sys/fs/selinux/policy_capabilities

For selinux-policy-hash, the hash of SELinux policy is included
in the IMA log entry.

To verify the measured data with the current SELinux policy run
the following commands and verify the output hash values match.

  sha256sum /sys/fs/selinux/policy | cut -d' ' -f 1

  grep -m 1 "selinux-policy-hash" /sys/kernel/security/integrity/ima/ascii_runtime_measurements | cut -d' ' -f 6

This patch is based on commit 66ccd2560aff ("selinux: simplify away security_policydb_len()")
in "next" branch in https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git

This patch is dependent on the following patch series and must be
applied in the given order:
	https://patchwork.kernel.org/patch/11709527/
	https://patchwork.kernel.org/patch/11730193/
	https://patchwork.kernel.org/patch/11730757/

Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
Suggested-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Reported-by: kernel test robot <lkp@intel.com> # error: implicit declaration of function 'vfree'
Reported-by: kernel test robot <lkp@intel.com> # error: implicit declaration of function 'crypto_alloc_shash'
Reported-by: kernel test robot <lkp@intel.com> # sparse: symbol 'security_read_selinux_policy' was not declared. Should it be static?
---
 security/integrity/ima/Kconfig      |   3 +-
 security/integrity/ima/ima.h        |   1 +
 security/selinux/Makefile           |   2 +
 security/selinux/hooks.c            |   1 +
 security/selinux/include/security.h |  18 +++-
 security/selinux/measure.c          | 156 ++++++++++++++++++++++++++++
 security/selinux/selinuxfs.c        |   3 +
 security/selinux/ss/services.c      |  70 +++++++++++--
 8 files changed, 242 insertions(+), 12 deletions(-)
 create mode 100644 security/selinux/measure.c

diff --git a/security/integrity/ima/Kconfig b/security/integrity/ima/Kconfig
index 953314d145bb..9bf0f65d720b 100644
--- a/security/integrity/ima/Kconfig
+++ b/security/integrity/ima/Kconfig
@@ -324,8 +324,7 @@ config IMA_MEASURE_ASYMMETRIC_KEYS
 
 config IMA_QUEUE_EARLY_BOOT_DATA
 	bool
-	depends on IMA_MEASURE_ASYMMETRIC_KEYS
-	depends on SYSTEM_TRUSTED_KEYRING
+	depends on (IMA_MEASURE_ASYMMETRIC_KEYS && SYSTEM_TRUSTED_KEYRING) || SECURITY_SELINUX
 	default y
 
 config IMA_SECURE_AND_OR_TRUSTED_BOOT
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 422fe833037d..710648eeb21b 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -230,6 +230,7 @@ struct modsig;
 
 #define __ima_supported_kernel_data_sources(source)	\
 	source(MIN_SOURCE, min_source)			\
+	source(SELINUX, selinux)			\
 	source(MAX_SOURCE, max_source)
 
 #define __ima_enum_stringify(ENUM, str) (#str),
diff --git a/security/selinux/Makefile b/security/selinux/Makefile
index 4d8e0e8adf0b..83d512116341 100644
--- a/security/selinux/Makefile
+++ b/security/selinux/Makefile
@@ -16,6 +16,8 @@ selinux-$(CONFIG_NETLABEL) += netlabel.o
 
 selinux-$(CONFIG_SECURITY_INFINIBAND) += ibpkey.o
 
+selinux-$(CONFIG_IMA) += measure.o
+
 ccflags-y := -I$(srctree)/security/selinux -I$(srctree)/security/selinux/include
 
 $(addprefix $(obj)/,$(selinux-y)): $(obj)/flask.h
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 6210e98219a5..222c8473e5b5 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -7402,6 +7402,7 @@ int selinux_disable(struct selinux_state *state)
 	}
 
 	selinux_mark_disabled(state);
+	selinux_measure_state(state, false);
 
 	pr_info("SELinux:  Disabled at runtime.\n");
 
diff --git a/security/selinux/include/security.h b/security/selinux/include/security.h
index cbdd3c7aff8b..c971ec09d855 100644
--- a/security/selinux/include/security.h
+++ b/security/selinux/include/security.h
@@ -209,6 +209,11 @@ static inline bool selinux_policycap_genfs_seclabel_symlinks(void)
 	return state->policycap[POLICYDB_CAPABILITY_GENFS_SECLABEL_SYMLINKS];
 }
 
+static inline bool selinux_checkreqprot(const struct selinux_state *state)
+{
+	return READ_ONCE(state->checkreqprot);
+}
+
 int security_mls_enabled(struct selinux_state *state);
 int security_load_policy(struct selinux_state *state,
 			void *data, size_t len,
@@ -219,7 +224,8 @@ void selinux_policy_cancel(struct selinux_state *state,
 			struct selinux_policy *policy);
 int security_read_policy(struct selinux_state *state,
 			 void **data, size_t *len);
-
+int security_read_policy_kernel(struct selinux_state *state,
+				void **data, size_t *len);
 int security_policycap_supported(struct selinux_state *state,
 				 unsigned int req_cap);
 
@@ -436,4 +442,14 @@ extern void ebitmap_cache_init(void);
 extern void hashtab_cache_init(void);
 extern int security_sidtab_hash_stats(struct selinux_state *state, char *page);
 
+#ifdef CONFIG_IMA
+extern void selinux_measure_state(struct selinux_state *selinux_state,
+				  bool policy_mutex_held);
+#else
+static inline void selinux_measure_state(struct selinux_state *selinux_state,
+					 bool policy_mutex_held)
+{
+}
+#endif
+
 #endif /* _SELINUX_SECURITY_H_ */
diff --git a/security/selinux/measure.c b/security/selinux/measure.c
new file mode 100644
index 000000000000..caf9107937d9
--- /dev/null
+++ b/security/selinux/measure.c
@@ -0,0 +1,156 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Measure SELinux state using IMA subsystem.
+ */
+#include <linux/vmalloc.h>
+#include <linux/ktime.h>
+#include <linux/ima.h>
+#include "security.h"
+
+/*
+ * This function creates an unique name by appending the timestamp to
+ * the given string. This string is passed as "event name" to the IMA
+ * hook to measure the given SELinux data.
+ *
+ * The data provided by SELinux to the IMA subsystem for measuring may have
+ * already been measured (for instance the same state existed earlier).
+ * But for SELinux the current data represents a state change and hence
+ * needs to be measured again. To enable this, pass an unique "Event Name"
+ * to the IMA hook so that IMA subsystem will always measure the given data.
+ *
+ * For example,
+ * At time T0 SELinux data to be measured is "foo". IMA measures it.
+ * At time T1 the data is changed to "bar". IMA measures it.
+ * At time T2 the data is changed to "foo" again. IMA will not measure it
+ * (since it was already measured) unless the event name, for instance,
+ * is different in this call.
+ */
+static char *selinux_event_name(const char *name_prefix)
+{
+	char *event_name = NULL;
+	struct timespec64 cur_time;
+
+	ktime_get_real_ts64(&cur_time);
+	event_name = kasprintf(GFP_KERNEL, "%s-%lld:%09ld", name_prefix,
+			       cur_time.tv_sec, cur_time.tv_nsec);
+	if (!event_name) {
+		pr_err("%s: event name not allocated.\n", __func__);
+		return NULL;
+	}
+
+	return event_name;
+}
+
+static int read_selinux_state(char **state_str, int *state_str_len,
+			      struct selinux_state *state)
+{
+	char *buf, *str_fmt = "%s=%d;";
+	int i, buf_len, curr;
+	bool initialized = selinux_initialized(state);
+	bool enabled = !selinux_disabled(state);
+	bool enforcing = enforcing_enabled(state);
+	bool checkreqprot = selinux_checkreqprot(state);
+
+	buf_len = snprintf(NULL, 0, str_fmt, "initialized", initialized);
+	buf_len += snprintf(NULL, 0, str_fmt, "enabled", enabled);
+	buf_len += snprintf(NULL, 0, str_fmt, "enforcing", enforcing);
+	buf_len += snprintf(NULL, 0, str_fmt, "checkreqprot", checkreqprot);
+
+	for (i = 0; i < __POLICYDB_CAPABILITY_MAX; i++) {
+		buf_len += snprintf(NULL, 0, str_fmt,
+				    selinux_policycap_names[i],
+				    state->policycap[i]);
+	}
+	++buf_len;
+
+	buf = kzalloc(buf_len, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	curr = snprintf(buf, buf_len, str_fmt,
+			"initialized", initialized);
+	curr += snprintf((buf + curr), (buf_len - curr), str_fmt,
+			 "enabled", enabled);
+	curr += snprintf((buf + curr), (buf_len - curr), str_fmt,
+			 "enforcing", enforcing);
+	curr += snprintf((buf + curr), (buf_len - curr), str_fmt,
+			 "checkreqprot", checkreqprot);
+
+	for (i = 0; i < __POLICYDB_CAPABILITY_MAX; i++) {
+		curr += snprintf((buf + curr), (buf_len - curr), str_fmt,
+				 selinux_policycap_names[i],
+				 state->policycap[i]);
+	}
+
+	*state_str = buf;
+	*state_str_len = curr;
+
+	return 0;
+}
+
+void selinux_measure_state(struct selinux_state *state, bool policy_mutex_held)
+{
+	void *policy = NULL;
+	char *state_event_name = NULL;
+	char *policy_event_name = NULL;
+	char *state_str = NULL;
+	size_t policy_len;
+	int state_str_len, rc = 0;
+	bool initialized = selinux_initialized(state);
+
+	rc = read_selinux_state(&state_str, &state_str_len, state);
+	if (rc) {
+		pr_err("%s: Failed to read selinux state.\n", __func__);
+		return;
+	}
+
+	/*
+	 * Get an unique string for measuring the current SELinux state.
+	 */
+	state_event_name = selinux_event_name("selinux-state");
+	if (!state_event_name) {
+		pr_err("%s: Event name for state not allocated.\n",
+		       __func__);
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	rc = ima_measure_critical_data(state_event_name, "selinux",
+				       state_str, state_str_len, false);
+	if (rc)
+		goto out;
+
+	/*
+	 * Measure SELinux policy only after initialization is completed.
+	 */
+	if (!initialized)
+		goto out;
+
+	if (!policy_mutex_held)
+		mutex_lock(&state->policy_mutex);
+
+	rc = security_read_policy_kernel(state, &policy, &policy_len);
+
+	if (!policy_mutex_held)
+		mutex_unlock(&state->policy_mutex);
+
+	if (rc)
+		goto out;
+
+	policy_event_name = selinux_event_name("selinux-policy-hash");
+	if (!policy_event_name) {
+		pr_err("%s: Event name for policy not allocated.\n",
+		       __func__);
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	rc = ima_measure_critical_data(policy_event_name, "selinux",
+				       policy, policy_len, true);
+
+out:
+	kfree(state_event_name);
+	kfree(policy_event_name);
+	kfree(state_str);
+	vfree(policy);
+}
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index 45e9efa9bf5b..bb460954de03 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -176,6 +176,7 @@ static ssize_t sel_write_enforce(struct file *file, const char __user *buf,
 			from_kuid(&init_user_ns, audit_get_loginuid(current)),
 			audit_get_sessionid(current));
 		enforcing_set(state, new_value);
+		selinux_measure_state(state, false);
 		if (new_value)
 			avc_ss_reset(state->avc, 0);
 		selnl_notify_setenforce(new_value);
@@ -761,6 +762,8 @@ static ssize_t sel_write_checkreqprot(struct file *file, const char __user *buf,
 
 	fsi->state->checkreqprot = new_value ? 1 : 0;
 	length = count;
+	selinux_measure_state(fsi->state, false);
+
 out:
 	kfree(page);
 	return length;
diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index 8dc111fbe23a..04a9c3d8c19b 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -2179,6 +2179,7 @@ static void selinux_notify_policy_change(struct selinux_state *state,
 	selinux_status_update_policyload(state, seqno);
 	selinux_netlbl_cache_invalidate();
 	selinux_xfrm_notify_policyload();
+	selinux_measure_state(state, true);
 }
 
 void selinux_policy_commit(struct selinux_state *state,
@@ -3874,6 +3875,30 @@ int security_netlbl_sid_to_secattr(struct selinux_state *state,
 }
 #endif /* CONFIG_NETLABEL */
 
+/**
+ * security_read_selinux_policy - read the policy.
+ * @policy: SELinux policy
+ * @data: binary policy data
+ * @len: length of data in bytes
+ *
+ */
+static int security_read_selinux_policy(struct selinux_policy *policy,
+					void **data, size_t *len)
+{
+	int rc;
+	struct policy_file fp;
+
+	fp.data = *data;
+	fp.len = *len;
+
+	rc = policydb_write(&policy->policydb, &fp);
+	if (rc)
+		return rc;
+
+	*len = (unsigned long)fp.data - (unsigned long)*data;
+	return 0;
+}
+
 /**
  * security_read_policy - read the policy.
  * @data: binary policy data
@@ -3884,8 +3909,6 @@ int security_read_policy(struct selinux_state *state,
 			 void **data, size_t *len)
 {
 	struct selinux_policy *policy;
-	int rc;
-	struct policy_file fp;
 
 	policy = rcu_dereference_protected(
 			state->policy, lockdep_is_held(&state->policy_mutex));
@@ -3897,14 +3920,43 @@ int security_read_policy(struct selinux_state *state,
 	if (!*data)
 		return -ENOMEM;
 
-	fp.data = *data;
-	fp.len = *len;
+	return security_read_selinux_policy(policy, data, len);
+}
 
-	rc = policydb_write(&policy->policydb, &fp);
-	if (rc)
-		return rc;
+/**
+ * security_read_policy_kernel - read the policy.
+ * @state: selinux_state
+ * @data: binary policy data
+ * @len: length of data in bytes
+ *
+ * Allocates kernel memory for reading SELinux policy.
+ * This function is for internal use only and should not
+ * be used for returning data to user space.
+ *
+ * This function must be called with policy_mutex held.
+ */
+int security_read_policy_kernel(struct selinux_state *state,
+				void **data, size_t *len)
+{
+	struct selinux_policy *policy;
+	int rc = 0;
 
-	*len = (unsigned long)fp.data - (unsigned long)*data;
-	return 0;
+	policy = rcu_dereference_protected(
+			state->policy, lockdep_is_held(&state->policy_mutex));
+	if (!policy) {
+		rc = -EINVAL;
+		goto out;
+	}
 
+	*len = policy->policydb.len;
+	*data = vmalloc(*len);
+	if (!*data) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	rc = security_read_selinux_policy(policy, data, len);
+
+out:
+	return rc;
 }
-- 
2.28.0


^ permalink raw reply related

* Re: [PATCH v8 1/3] Add a new LSM-supporting anonymous inode interface
From: Lokesh Gidra @ 2020-09-07  7:45 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, James Morris, Stephen Smalley, Casey Schaufler,
	Eric Biggers, Serge E. Hallyn, Paul Moore, Eric Paris,
	Daniel Colascione, Kees Cook, Eric W. Biederman, KP Singh,
	David Howells, Thomas Cedeno, Anders Roxell, Sami Tolvanen,
	Matthew Garrett, Aaron Goidel, Randy Dunlap,
	Joel Fernandes (Google), YueHaibing, Alexei Starovoitov,
	Alexey Budankov, Adrian Reber, Aleksa Sarai, Linux FS Devel,
	linux-kernel, LSM List, SElinux list, Kalesh Singh, Calin Juravle,
	Suren Baghdasaryan, Nick Kralevich, Jeffrey Vander Stoep,
	kernel-team, Jann Horn
In-Reply-To: <20200901124136.r3krb2p23343licq@wittgenstein>

On Tue, Sep 1, 2020 at 5:41 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Wed, Aug 26, 2020 at 11:35:20PM -0700, Lokesh Gidra wrote:
> > From: Daniel Colascione <dancol@google.com>
> >
> > This change adds a new function, anon_inode_getfd_secure, that creates
> > anonymous-node file with individual non-S_PRIVATE inode to which security
> > modules can apply policy. Existing callers continue using the original
> > singleton-inode kind of anonymous-inode file. We can transition anonymous
> > inode users to the new kind of anonymous inode in individual patches for
> > the sake of bisection and review.
> >
> > The new function accepts an optional context_inode parameter that
> > callers can use to provide additional contextual information to
> > security modules for granting/denying permission to create an anon inode
> > of the same type.
> >
> > For example, in case of userfaultfd, the created inode is a
> > 'logical child' of the context_inode (userfaultfd inode of the
> > parent process) in the sense that it provides the security context
> > required during creation of the child process' userfaultfd inode.
> >
> > Signed-off-by: Daniel Colascione <dancol@google.com>
> >
> > [Fix comment documenting return values of inode_init_security_anon()]
> > [Add context_inode description in comments to anon_inode_getfd_secure()]
> > [Remove definition of anon_inode_getfile_secure() as there are no callers]
> > [Make _anon_inode_getfile() static]
> > [Use correct error cast in _anon_inode_getfile()]
> > [Fix error handling in _anon_inode_getfile()]
> >
> > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> > ---
> >  fs/anon_inodes.c              | 147 +++++++++++++++++++++++++---------
> >  include/linux/anon_inodes.h   |   8 ++
> >  include/linux/lsm_hook_defs.h |   2 +
> >  include/linux/lsm_hooks.h     |   9 +++
> >  include/linux/security.h      |  10 +++
> >  security/security.c           |   8 ++
> >  6 files changed, 144 insertions(+), 40 deletions(-)
> >
> > diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
> > index 89714308c25b..c3f16deda211 100644
> > --- a/fs/anon_inodes.c
> > +++ b/fs/anon_inodes.c
> > @@ -55,61 +55,79 @@ static struct file_system_type anon_inode_fs_type = {
> >       .kill_sb        = kill_anon_super,
> >  };
> >
> > -/**
> > - * anon_inode_getfile - creates a new file instance by hooking it up to an
> > - *                      anonymous inode, and a dentry that describe the "class"
> > - *                      of the file
> > - *
> > - * @name:    [in]    name of the "class" of the new file
> > - * @fops:    [in]    file operations for the new file
> > - * @priv:    [in]    private data for the new file (will be file's private_data)
> > - * @flags:   [in]    flags
> > - *
> > - * Creates a new file by hooking it on a single inode. This is useful for files
> > - * that do not need to have a full-fledged inode in order to operate correctly.
> > - * All the files created with anon_inode_getfile() will share a single inode,
> > - * hence saving memory and avoiding code duplication for the file/inode/dentry
> > - * setup.  Returns the newly created file* or an error pointer.
> > - */
> > -struct file *anon_inode_getfile(const char *name,
> > -                             const struct file_operations *fops,
> > -                             void *priv, int flags)
> > +static struct inode *anon_inode_make_secure_inode(
> > +     const char *name,
> > +     const struct inode *context_inode)
> >  {
> > -     struct file *file;
> > +     struct inode *inode;
> > +     const struct qstr qname = QSTR_INIT(name, strlen(name));
> > +     int error;
> > +
> > +     inode = alloc_anon_inode(anon_inode_mnt->mnt_sb);
> > +     if (IS_ERR(inode))
> > +             return inode;
> > +     inode->i_flags &= ~S_PRIVATE;
> > +     error = security_inode_init_security_anon(inode, &qname, context_inode);
> > +     if (error) {
> > +             iput(inode);
> > +             return ERR_PTR(error);
> > +     }
> > +     return inode;
> > +}
>
> Hey,
>
> Iiuc, this makes each newly created anon inode fd correspond to a unique
> file and to a unique inode:
>
> fd1 -> file1 -> inode1
> fd2 -> file2 -> inode2
>
Not every anon inode. Just the ones created through
anon_inode_getfd_secure() API.

> Whereas before we had every anon inode fd correspond to a unique file
> but all files map to the _same_ inode:
>
> fd1 -> file1 -> inode
> fd2 -> file2 -> inode
>
Thils is still the case if anon_inode_getfile() and/or
anon_inode_getfd() APIs are used.

> The old behavior of hooking up a each anon inode fd to the same inode
> prevented having an evict method attached to the inode. Because it was
> shared that wasn't possible but also simply because that inode never got
> evicted anyway. That surely was intended but it's a bummer to some
> extent.
> With the new model you also can't have an evict method because now you
> have a separate inode for each file.
>
> I'm probably going to get killed for suggesting this but:
> If we're going to expand the anonymous inode infrastructure anyway is
> there a way we can make it so that we have a way to allocate a single
> inode for multiple anonymous inode fds and have callers opt-in to this
> behavior. We'd need a way to find this inode again, obviously.
>
> This would allow for some features on top of anonymous inode fds that
> can refer to the same object, i.e. anonymous inode fds that currently
> stash away the same object in f->private_data.
> In such a model we could allow such anonymous inode fds to stash away
> objects in inode->i_private instead of f->private_data and attach an
> evict method to it. This would e.g. allow a process to be killed when
> the last pidfd to it is closed or a seccomp notifier fd to notify when
> the filter is released without having to do separate reference counting.
>
I didn't fully understand the example you gave and the role that evict
method will play in it. Can you please elaborate a bit more.

But, I'd like to point you to a previous discussion between Daniel
Colascione (the original contributor of this patch series) and Stephan
Smalley on the topic of inodes
https://lore.kernel.org/lkml/CAKOZuesUVSYJ6EjHFL3QyiWKVmyhm1fLp5Bm_SHjB3_s1gn08A@mail.gmail.com/

I agree with Daniel (see his replies in the thread link above) that a
separate inode per anon inode fd keeps the design simple, particularly
from the security context perspective.

> This would need a way to lookup that inode by the object that is stashed
> away in it of course which could probably be done by an idr or an
> xarray or something cleverer. It would obviously only affect a subset of
> anonymous inode fds so any other anonymous inode fds wouldn't be
> impacted since they can still use the single-anon-inode interface.
>
> Christian

^ permalink raw reply

* Re: [RFC PATCH 00/30] ima: Introduce IMA namespace
From: Dr. Greg @ 2020-09-06 17:14 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: Christian Brauner, krzysztof.struczynski, linux-integrity,
	linux-kernel, containers, linux-security-module, stefanb,
	sunyuqiong1988, mkayaalp, dmitry.kasatkin, serge, jmorris,
	christian, silviu.vlasceanu, roberto.sassu, ebiederm, viro,
	torvalds, luto, jannh, nick.dusek
In-Reply-To: <d77a6cd783319702fddd06783cb84fdeb86210a6.camel@linux.ibm.com>

On Wed, Sep 02, 2020 at 03:54:58PM -0400, Mimi Zohar wrote:

Good morning, I hope the weekend is going well for everyone.

A follow on to my previous e-mail regarding what 'namespaced IMA'
should look like.

> On Tue, 2020-08-18 at 18:49 +0200, Christian Brauner wrote:
> > On Tue, Aug 18, 2020 at 05:20:07PM +0200, krzysztof.struczynski@huawei.com wrote:
> > > From: Krzysztof Struczynski <krzysztof.struczynski@huawei.com>
> > > IMA has not been designed to work with containers. It handles
> > > every process in the same way, and it cannot distinguish if a
> > > process belongs to a container or not.
> > >
> > > Containers use namespaces to make it appear to the processes in
> > > the containers that they have their own isolated instance of the
> > > global resource. For IMA as well, it is desirable to let
> > > processes in the

> > IMA is brought up on a regular basis with "we want to have this" for
> > years and then non-one seems to really care enough.

I don't think it is a matter of lack of interest or not caring.  The
challenge becomes whether or not a business case exists for expending
the resources and navigating the challenges needed to advance
infrastructure for inclusion in the kernel.

> There is a lot of interest in IMA namespacing, but the question
> always comes back to how to enable it.  Refer to
> https://kernsec.org/wiki/index.php/IMA_Namespacing_design_considerations
> for Stefan's analysis.

As I noted in my previous e-mail, I believe the path forward is not to
figure out how to address the rather complex and invasive issue of how
to namespace IMA, but instead, needs to be what a flexible
'next-generation' architecture for platform behavioral assessment
looks like.

In a larger context, I believe the future for security is going to
involve at a minimum the kernel, and more likely, something tightly
coupled to the kernel, making active decisions on whether or not a
behavior that the kernel is contemplating mediating is consistent with
platform or container security policy.

To frame this a bit, the well understood role of the kernel with
respect to security is to mediate 'Turing Events', or Actor/Subject
(A/S) interactions.  Both DAC and MAC are based on this concept with
the interface between an Actor and Subject being a security 'gate'.

Given this model, the most simplistic and direct path forward is to
provide a namespace capable method of exporting a description of the
identity parameters that characterize the entities involved in an A/S
interaction.  The kernel, or as I previously suggested as more likely
moving forward, a closely linked entity can then make a decision as to
whether or not the behavior should be allowed.

FWIW, as an example of the minimal impact of this method, here is the
diffstat of such an implementation:

---------------------------------------------------------------------------
 arch/x86/entry/syscalls/syscall_32.tbl    |    2 +
 arch/x86/entry/syscalls/syscall_64.tbl    |    2 +
 fs/proc/array.c                           |    7 +
 fs/proc/namespaces.c                      |    4 +
 include/linux/ima.h                       |   27 +
 include/linux/nsproxy.h                   |    2 +
 include/linux/proc_ns.h                   |    2 +
 include/linux/sched.h                     |    3 +
 include/linux/syscalls.h                  |    6 +
 include/uapi/asm-generic/unistd.h         |    7 +-
 include/uapi/linux/sched.h                |    1 +
 kernel/fork.c                             |    5 +-
 kernel/nsproxy.c                          |   18 +-
 kernel/sys_ni.c                           |    4 +
 security/Kconfig                          |    1 +
 security/Makefile                         |    2 +
 Security/ai/Kconfig                       |   12 +
 security/ai/Makefile                      |    3 +
 security/ai/ai.c                          |  137 ++
 security/integrity/iint.c                 |    5 +
 security/integrity/ima/Makefile           |    2 +-
 security/integrity/ima/ima.h              |   17 +
 security/integrity/ima/ima_api.c          |   37 +-
 security/integrity/ima/ima_fs.c           |   10 +
 security/integrity/ima/ima_identity.c     | 2204 +++++++++++++++++++++++++++++
 security/integrity/ima/ima_init.c         |    6 +-
 security/integrity/ima/ima_main.c         |    7 +
 security/integrity/ima/ima_policy.c       |   74 +-
 security/integrity/ima/ima_queue.c        |   10 +-
 security/integrity/ima/ima_template.c     |    6 +
 security/integrity/ima/ima_template_lib.c |   71 +
 security/integrity/ima/ima_template_lib.h |   10 +
 security/integrity/integrity.h            |    4 +-
 security/security.c                       |    1 +
 34 files changed, 2684 insertions(+), 25 deletions(-)
---------------------------------------------------------------------------

As can be seen, this was built out in the context of the IMA
sub-system with the majority of the changes being encapsulated in one
file.  This includes all of the infrastructure needed for a Trusted
Execution Environment (TEE) to enforce kernel security policy
decisions.

In addition to a container specific representation of the current
behavioral state, each namespace exports via a sysfs pseudo-file, the
following behavioral definition for each A/S interaction.

exchange pid{1} event{cboot:/home/greg/runc} actor{uid=0, euid=0, suid=0, gid=0, egid=0, sgid=0, fsuid=0, fsgid=0, cap=0x3fffffffff} subject{uid=50, gid=50, mode=0100755, name_length=15, name=f0da604ff3f0a3e16163bc9d2f99bb9bcd70397d211b746d0104299972cc5505, s_id=sda1, s_uuid=1bfef8aaa45f4bcaa846640ae4547ddc, digest=791a7cf8dec2afe302836b974b3c0f7b0a5983f76d857aa97658ce09d54f60f8}

Which provides the framework for implementing any number of policy
decisions, of which integrity is only one element.

We had initially used SGX to implement a TEE based enforcement engine,
but given the direction of hardware support, we have largely shelved
our SGX development efforts in favor of using a micro-controller based
approach.

Given what appears to be the direction for mobile devices, a
collection of specialized harware linked by an OS, the notion of a
separate entity making security policy decisions seems relevant.

> I understand "containers" is not a kernel construct, but from my very
> limited perspective, IMA namespacing only makes sense in the context of
> a "container".  The container owner may want to know which files have
> been accessed/executed (measurements, remote attestation) and/or
> constrain which files may be accessed/executed based on signatures
> (appraisal).

Trying to implement supportable and field maintainable 'trusted
computing' is a fools errand without the notion of containerization of
platform behavior.  This is true whether the target is the cloud or
endpoint/IOT class devices.

It seems well understood, that while containers are not a first class
kernel entity, the kernel takes responsibility for implementing
compartmentalization of resources.  It would seem that security event
characterizations are consistent with that model.

In addition, none of this works without developer support.  Framing
behavior assessment in the form of containers means that behavioral
trajectory definitions for the containers can be a byproduct of
standard DEVOP's pipelines.

> > I'm highly skeptical of the value of ~2500 lines of code even if
> > it includes a bunch of namespace boilerplate. It's yet another
> > namespace, and yet another security framework.  Why does IMA need
> > to be a separate namespace? Keyrings are tied to user namespaces
> > why can't IMA be?

> In the context of a container, the measurement list and IMA/EVM
> keyrings need to be setup before the first file is measured,
> signature verified, or file hash included in the audit log.

As I've noted previously, namespacing IMA is problematic, what is
needed is something far simpler and more flexible that provides a
framework for implementing policy outside of the kernel, or if in the
kernel, in a highly customizable fashion.

I think that it would be found that user namespaces bring too much
baggage to the table.  The most effective path forward in this venue
would seem to be to bring forward the most simplistic, flexiable and
uncomplicated mechanism possible.

In this model, classic IMA would serve as a trust root on whose
shoulders a security orchestration framework stands.

> > I believe Eric has even pointed that out before.
> >
> > Eric, thoughts?

> Any help with the above scenario would very be much appreciated.

Hopefully the conversation will benefit from actual field experience
with doing this sort of thing in a supportable fashion.

The concept of 'trusted computing' has been around since the days when
Dave Grawrock designed TXT, which is heavily linked to the heritage of
IMA.  The fact that effective solutions in widespread practice have
not emerged, in the face of demonstrated need, suggests the need to
develop new solution strategies.

Just to be clear, we are not campaigning or advocating what we have
done but are simply providing background for discussion.  We haven't
campaigned this approach given how complex the kernel development has
become, particurlarly with respect to security infrastructure.

Candidly, given the politics of security technology being viewed as
'constraining' user rights, I think that a lot of forthcoming security
technology may end up being out of tree moving forward.

> Mimi

Best wishes for a productive week to everyone.

Dr. Greg

As always,
Dr. Greg Wettstein, Ph.D, Worker      Autonomously self-defensive
Enjellic Systems Development, LLC     IOT platforms and edge devices.
4206 N. 19th Ave.
Fargo, ND  58102
PH: 701-281-1686                      EMAIL: dg@enjellic.com
------------------------------------------------------------------------------
"A large number of the world's technical challenges have been solved.
 The far greater challenge lies in conveying an understanding of this
 to the world."
                                -- Dr. Greg Wettstein
                                   Resurrection

^ permalink raw reply

* Re: [PATCH v20 20/23] Audit: Add new record for multiple process LSM attributes
From: Paul Moore @ 2020-09-06 16:32 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: casey.schaufler, James Morris, linux-security-module, selinux,
	linux-audit, keescook, john.johansen, penguin-kernel,
	Stephen Smalley
In-Reply-To: <20200826145247.10029-21-casey@schaufler-ca.com>

On Wed, Aug 26, 2020 at 11:23 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> Create a new audit record type to contain the subject information
> when there are multiple security modules that require such data.
> This record is linked with the same timestamp and serial number.
> The record is produced only in cases where there is more than one
> security module with a process "context".
>
> Before this change the only audit events that required multiple
> records were syscall events. Several non-syscall events include
> subject contexts, so the use of audit_context data has been expanded
> as necessary.
>
> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
> Cc: linux-audit@redhat.com
> ---
>  drivers/android/binder.c                |  2 +-
>  include/linux/audit.h                   | 13 +++-
>  include/linux/security.h                | 18 ++++-
>  include/net/netlabel.h                  |  2 +-
>  include/net/scm.h                       |  3 +-
>  include/net/xfrm.h                      |  4 +-
>  include/uapi/linux/audit.h              |  1 +
>  kernel/audit.c                          | 89 ++++++++++++++++++-------
>  kernel/auditfilter.c                    |  2 +-
>  kernel/auditsc.c                        | 87 ++++++++++++++++++++++--
>  net/ipv4/ip_sockglue.c                  |  2 +-
>  net/netfilter/nf_conntrack_netlink.c    |  4 +-
>  net/netfilter/nf_conntrack_standalone.c |  2 +-
>  net/netfilter/nfnetlink_queue.c         |  2 +-
>  net/netlabel/netlabel_unlabeled.c       | 16 ++---
>  net/netlabel/netlabel_user.c            | 12 ++--
>  net/netlabel/netlabel_user.h            |  6 +-
>  security/integrity/integrity_audit.c    |  2 +-
>  security/security.c                     | 73 +++++++++++++++-----
>  security/smack/smackfs.c                |  3 +-
>  20 files changed, 259 insertions(+), 84 deletions(-)

Regardless of the code, the audit folks generally ask for an example
of the new audit record in the commit description; it make it easier
for the people not familiar with the kernel to review the record
format/information.

I've also been requiring that a test is added to the audit test suite
for new features added to the kernel.  I recognize this is a big ask
for something like this, but it is something we have been doing for
several years now and I think the benefits far outweigh the costs.

* https://github.com/linux-audit/audit-testsuite

> diff --git a/include/linux/audit.h b/include/linux/audit.h
> index ba1cd38d601b..fe027df0d9a8 100644
> --- a/include/linux/audit.h
> +++ b/include/linux/audit.h
> @@ -186,7 +186,9 @@ extern void             audit_log_path_denied(int type,
>                                                   const char *operation);
>  extern void                audit_log_lost(const char *message);
>
> -extern int audit_log_task_context(struct audit_buffer *ab);
> +extern void audit_log_lsm(struct lsmblob *blob, bool exiting);
> +extern int audit_log_task_context(struct audit_buffer *ab,
> +                                 struct lsmblob *blob);

The audit functions tend to get abused, and this is mostly due to a
rather poor design, so I'd prefer to see audit_log_task_context()
remain with just a single argument: the audit buffer.  I believe this
covers the vast majority of the cases and should remain the preferred
option for callers.  This should help shrink the patch and focus it a
bit more.

For the handful of callers that do need to specify a separate task
context, you can create a __audit_log_task_context(ab, blob) that
allows the caller to specify the blob (and obviously you can write
audit_log_task_context() as a one line wrapper around this function).

> diff --git a/include/net/xfrm.h b/include/net/xfrm.h
> index 2737d24ec244..9e8cac6228b4 100644
> --- a/include/net/xfrm.h
> +++ b/include/net/xfrm.h
> @@ -675,11 +675,13 @@ static inline struct audit_buffer *xfrm_audit_start(const char *op)
>
>         if (audit_enabled == AUDIT_OFF)
>                 return NULL;
> +       audit_stamp_context(audit_context());
>         audit_buf = audit_log_start(audit_context(), GFP_ATOMIC,
>                                     AUDIT_MAC_IPSEC_EVENT);

Is audit_stamp_context() necessary here?  Generally if the first
argument to audit_log_start() is not NULL then you shouldn't need to
create a "local" audit context.

I wonder if you might be getting mixed up in testing by not having
audit properly enabled (see link below)?  Make sure you don't have any
audit disable rules loaded into the kernel, and you can even add
"audit=1" to the kernel command line.

https://github.com/linux-audit/audit-documentation/wiki/HOWTO-Fedora-Enable-Auditing

>         if (audit_buf == NULL)
>                 return NULL;
>         audit_log_format(audit_buf, "op=%s", op);
> +       audit_log_lsm(NULL, false);
>         return audit_buf;
>  }

...

> diff --git a/kernel/audit.c b/kernel/audit.c
> index 594b42fc88ff..0e7831c9f321 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -1070,13 +1071,31 @@ static void audit_log_common_recv_msg(struct audit_context *context,
>                 return;
>         audit_log_format(*ab, "pid=%d uid=%u ", pid, uid);
>         audit_log_session_info(*ab);
> -       audit_log_task_context(*ab);
> +       audit_log_task_context(*ab, NULL);
>  }
>
>  static inline void audit_log_user_recv_msg(struct audit_buffer **ab,
>                                            u16 msg_type)
>  {
> -       audit_log_common_recv_msg(NULL, ab, msg_type);
> +       struct audit_context *context;
> +
> +       if (!lsm_multiple_contexts()) {
> +               audit_log_common_recv_msg(NULL, ab, msg_type);
> +               return;
> +       }
> +
> +       context = audit_context();
> +       if (context) {
> +               if (!context->in_syscall)
> +                       audit_stamp_context(context);
> +               audit_log_common_recv_msg(context, ab, msg_type);
> +               return;
> +       }
> +
> +       audit_alloc(current);
> +       context = audit_context();
> +
> +       audit_log_common_recv_msg(context, ab, msg_type);
>  }

Hmm.  Take a look at Richard's patch for adding the audit container ID
record to audit user records, it should give you a better idea of how
to approach this.  The above changes in audit_log_user_recv_msg() are
not really what we want.

https://lore.kernel.org/linux-audit/4a5019ed3cfab416aeb6549b791ac6d8cc9fb8b7.1593198710.git.rgb@redhat.com

> @@ -1869,6 +1889,10 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask,
>         }
>
>         audit_get_stamp(ab->ctx, &t, &serial);
> +       if (type == AUDIT_MAC_TASK_CONTEXTS && ab->ctx->serial == 0) {
> +               audit_stamp_context(ab->ctx);
> +               audit_get_stamp(ab->ctx, &t, &serial);
> +       }
>         audit_log_format(ab, "audit(%llu.%03lu:%u): ",
>                          (unsigned long long)t.tv_sec, t.tv_nsec/1000000, serial);

Can you walk me through what you are trying to do here?  This doesn't
seem right to me, but I'm sure you put it here for a reason.

> @@ -2126,30 +2150,47 @@ void audit_log_key(struct audit_buffer *ab, char *key)
>                 audit_log_format(ab, "(null)");
>  }
>
> -int audit_log_task_context(struct audit_buffer *ab)
> +int audit_log_task_context(struct audit_buffer *ab, struct lsmblob *blob)
>  {
> +       int i;
>         int error;
> -       struct lsmblob blob;
> -       struct lsmcontext context;
> +       struct lsmblob localblob;
> +       struct lsmcontext lsmdata;
>
> -       security_task_getsecid(current, &blob);
> -       if (!lsmblob_is_set(&blob))
> +       /*
> +        * If there is more than one security module that has a
> +        * subject "context" it's necessary to put the subject data
> +        * into a separate record to maintain compatibility.
> +        */
> +       if (lsm_multiple_contexts()) {
> +               audit_log_format(ab, " subj=?");
>                 return 0;
> +       }
>
> -       error = security_secid_to_secctx(&blob, &context);
> -       if (error) {
> -               if (error != -EINVAL)
> -                       goto error_path;
> -               return 0;
> +       if (blob == NULL) {
> +               security_task_getsecid(current, &localblob);

Why is localblob necessary?  You know blob is NULL here, just use it
directly and skip the assignment later in this code block.

> +               if (!lsmblob_is_set(&localblob)) {
> +                       audit_log_format(ab, " subj=?");
> +                       return 0;
> +               }
> +               blob = &localblob;
>         }
>
> -       audit_log_format(ab, " subj=%s", context.context);
> -       security_release_secctx(&context);
> -       return 0;
> +       for (i = 0; i < LSMBLOB_ENTRIES; i++) {
> +               if (blob->secid[i] == 0)
> +                       continue;
> +               error = security_secid_to_secctx(blob, &lsmdata, i);
> +               if (error && error != -EINVAL) {
> +                       audit_panic("error in audit_log_task_context");
> +                       return error;
> +               }
>
> -error_path:
> -       audit_panic("error in audit_log_task_context");
> -       return error;
> +               audit_log_format(ab, " subj=%s", lsmdata.context);
> +               security_release_secctx(&lsmdata);
> +               break;
> +       }
> +
> +       return 0;
>  }
>  EXPORT_SYMBOL(audit_log_task_context);

...

> @@ -2279,6 +2320,7 @@ static void audit_log_set_loginuid(kuid_t koldloginuid, kuid_t kloginuid,
>         if (!audit_enabled)
>                 return;
>
> +       audit_stamp_context(audit_context());
>         ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_LOGIN);
>         if (!ab)
>                 return;

Similar to above, I'm not sure audit_stamp_context() is what we want here.

> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index 4af5861bcb9a..cf5dbd0e3a3d 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -962,10 +962,12 @@ int audit_alloc(struct task_struct *tsk)
>                 return 0; /* Return if not auditing. */
>
>         state = audit_filter_task(tsk, &key);
> -       if (state == AUDIT_DISABLED) {
> +       if (!lsm_multiple_contexts() && state == AUDIT_DISABLED) {
>                 clear_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT);
>                 return 0;
>         }
> +       if (state == AUDIT_DISABLED)
> +               clear_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT);

Hmmm.  I think we've hit critical mass on the amount of hacks in the
audit subsystem for this code ... Let's back up a bit and start with
some basic requirements (correct me if I'm wrong on any of these):

1. This patch only deals with LSM process/task/subject contexts.
2. Multiple LSMs may, or may not exist; audit logs and behavior should
remain unchanged in the single LSM case.
3. If multiple LSMs are present, the subject field in existing records
should take the form of "subj=?" and a new audit record is created
containing fields of the format "subj_<LSM>=<context>".

Assuming all of that is true, why not simply keep the basic logic in
audit_log_task_context(), but also stash a lsmblob struct in the
audit_context for later use by audit_log_exit()?  Stashing of the
lsmblob would really only be necessary to handle the few cases where
the current task is not the proper subject.

> @@ -1483,6 +1486,52 @@ static void audit_log_proctitle(void)
>         audit_log_end(ab);
>  }
>
> +void audit_log_lsm(struct lsmblob *blob, bool exiting)
> +{
> +       struct audit_context *context = audit_context();
> +       struct lsmcontext lsmdata;
> +       struct audit_buffer *ab;
> +       struct lsmblob localblob;
> +       bool sep = false;
> +       int error;
> +       int i;
> +
> +       if (!lsm_multiple_contexts())
> +               return;
> +
> +       if (context && context->in_syscall && !exiting)
> +               return;

I think the "!lsm_multiple_contexts()" and "!exiting" checks fit well
with my comments about generating the AUDIT_MAC_TASK_CONTEXT record in
audit_log_exit.  This code needs a rethink.

> +       ab = audit_log_start(context, GFP_ATOMIC, AUDIT_MAC_TASK_CONTEXTS);
> +       if (!ab)
> +               return; /* audit_panic or being filtered */
> +
> +       if (blob == NULL) {
> +               security_task_getsecid(current, &localblob);
> +               if (!lsmblob_is_set(&localblob))
> +                       return;
> +               blob = &localblob;
> +       }
> +
> +       for (i = 0; i < LSMBLOB_ENTRIES; i++) {
> +               if (blob->secid[i] == 0)
> +                       continue;
> +               error = security_secid_to_secctx(blob, &lsmdata, i);
> +               if (error && error != -EINVAL) {
> +                       audit_panic("error in audit_log_lsm");
> +                       return; fits well with a
> +               }
> +
> +               audit_log_format(ab, "%ssubj_%s=%s", sep ? " " : "",
> +                                security_lsm_slot_name(i), lsmdata.context);
> +               sep = true;
> +
> +               security_release_secctx(&lsmdata);
> +       }
> +
> +       audit_log_end(ab);
> +}

...

> @@ -2217,6 +2267,21 @@ void __audit_inode_child(struct inode *parent,
>  }
>  EXPORT_SYMBOL_GPL(__audit_inode_child);
>
> +/**
> + * audit_stamp_context - set the timestamp+serial in an audit context
> + * @ctx: audit_context to set
> + */
> +void audit_stamp_context(struct audit_context *ctx)
> +{
> +       /* ctx will be NULL unless lsm_multiple_contexts() is true */
> +       if (!ctx)
> +               return;
> +
> +       ktime_get_coarse_real_ts64(&ctx->ctime);
> +       ctx->serial = audit_serial();
> +       ctx->current_state = AUDIT_BUILD_CONTEXT;
> +}

Based on previous discussions and what I *think* you are trying to do
in this patchset, I believe Richard's audit_alloc_local()
implementation (link below) is a better and cleaner solution.  His
latest revisions needs some minor tweaks (see my feeback), but I think
you could probably work with him to pull that single patch into your
patchset.

To be clear, I'm talking about just that one patch; I'm a firm
believer that tying the LSM stacking and audit container ID patches
beyond this would be a disaster.  I can deal with any merge conflicts
that arise whenever the different patchsets land.

* Richard's audit_alloc_local() patch (07/13 of the latest audit
container ID patchset)
https://lore.kernel.org/linux-audit/21e6c4e1ac179c8dcf35803e603899ccfc69300a.1593198710.git.rgb@redhat.com

> diff --git a/net/netlabel/netlabel_user.c b/net/netlabel/netlabel_user.c
> index 951ba0639d20..4e9064754b5f 100644
> --- a/net/netlabel/netlabel_user.c
> +++ b/net/netlabel/netlabel_user.c
> @@ -84,12 +84,12 @@ struct audit_buffer *netlbl_audit_start_common(int type,
>                                                struct netlbl_audit *audit_info)
>  {
>         struct audit_buffer *audit_buf;
> -       struct lsmcontext context;
> -       struct lsmblob blob;
>
>         if (audit_enabled == AUDIT_OFF)
>                 return NULL;
>
> +       audit_stamp_context(audit_context());

Another one.

>         audit_buf = audit_log_start(audit_context(), GFP_ATOMIC, type);
>         if (audit_buf == NULL)
>                 return NULL;

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* [RFC PATCH 6/9] security/fbfam: Mitigate a fork brute force attack
From: John Wood @ 2020-09-06 15:24 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: John Wood

In order to mitigate a fork brute force attack it is necessary to kill
all the offending tasks. This tasks are all the ones that share the
statistical data with the current task (the task that has crashed).

Since the attack detection is done in the function fbfam_handle_attack()
that is called every time a core dump is triggered, only is needed to
kill the others tasks that share the same statistical data, not the
current one as this is in the path to be killed.

When the SIGKILL signal is sent to the offending tasks from the function
fbfam_kill_tasks(), this one will be called again during the core dump
due to the shared statistical data shows a quickly crashing rate. So, to
avoid kill again the same tasks due to a recursive call of this
function, it is necessary to disable the attack detection.

To disable this attack detection, add a condition in the function
fbfam_handle_attack() to not compute the crashing rate when the jiffies
stored in the statistical data are set to zero.

Signed-off-by: John Wood <john.wood@gmx.com>
---
 security/fbfam/fbfam.c | 76 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 71 insertions(+), 5 deletions(-)

diff --git a/security/fbfam/fbfam.c b/security/fbfam/fbfam.c
index 3aa669e4ea51..173a6122390f 100644
--- a/security/fbfam/fbfam.c
+++ b/security/fbfam/fbfam.c
@@ -4,8 +4,11 @@
 #include <linux/errno.h>
 #include <linux/gfp.h>
 #include <linux/jiffies.h>
+#include <linux/pid.h>
 #include <linux/printk.h>
+#include <linux/rcupdate.h>
 #include <linux/refcount.h>
+#include <linux/sched/signal.h>
 #include <linux/signal.h>
 #include <linux/slab.h>

@@ -24,7 +27,8 @@ unsigned long sysctl_crashing_rate_threshold = 30000;
  * struct fbfam_stats - Fork brute force attack mitigation statistics.
  * @refc: Reference counter.
  * @faults: Number of crashes since jiffies.
- * @jiffies: First fork or execve timestamp.
+ * @jiffies: First fork or execve timestamp. If zero, the attack detection is
+ *           disabled.
  *
  * The purpose of this structure is to manage all the necessary information to
  * compute the crashing rate of an application. So, it holds a first fork or
@@ -175,13 +179,69 @@ int fbfam_exit(void)
 }

 /**
- * fbfam_handle_attack() - Fork brute force attack detection.
+ * fbfam_kill_tasks() - Kill the offending tasks
+ *
+ * When a fork brute force attack is detected it is necessary to kill all the
+ * offending tasks. Since this function is called from fbfam_handle_attack(),
+ * and so, every time a core dump is triggered, only is needed to kill the
+ * others tasks that share the same statistical data, not the current one as
+ * this is in the path to be killed.
+ *
+ * When the SIGKILL signal is sent to the offending tasks, this function will be
+ * called again during the core dump due to the shared statistical data shows a
+ * quickly crashing rate. So, to avoid kill again the same tasks due to a
+ * recursive call of this function, it is necessary to disable the attack
+ * detection setting the jiffies to zero.
+ *
+ * To improve the for_each_process loop it is possible to end it when all the
+ * tasks that shared the same statistics are found.
+ *
+ * Return: -EFAULT if the current task doesn't have statistical data. Zero
+ *         otherwise.
+ */
+static int fbfam_kill_tasks(void)
+{
+	struct fbfam_stats *stats = current->fbfam_stats;
+	struct task_struct *p;
+	unsigned int to_kill, killed = 0;
+
+	if (!stats)
+		return -EFAULT;
+
+	to_kill = refcount_read(&stats->refc) - 1;
+	if (!to_kill)
+		return 0;
+
+	/* Disable the attack detection */
+	stats->jiffies = 0;
+	rcu_read_lock();
+
+	for_each_process(p) {
+		if (p == current || p->fbfam_stats != stats)
+			continue;
+
+		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_PID);
+		pr_warn("fbfam: Offending process with PID %d killed\n",
+			p->pid);
+
+		killed += 1;
+		if (killed >= to_kill)
+			break;
+	}
+
+	rcu_read_unlock();
+	return 0;
+}
+
+/**
+ * fbfam_handle_attack() - Fork brute force attack detection and mitigation.
  * @signal: Signal number that causes the core dump.
  *
  * The crashing rate of an application is computed in milliseconds per fault in
  * each crash. So, if this rate goes under a certain threshold there is a clear
  * signal that the application is crashing quickly. At this moment, a fork brute
- * force attack is happening.
+ * force attack is happening. Under this scenario it is necessary to kill all
+ * the offending tasks in order to mitigate the attack.
  *
  * Return: -EFAULT if the current task doesn't have statistical data. Zero
  *         otherwise.
@@ -195,6 +255,10 @@ int fbfam_handle_attack(int signal)
 	if (!stats)
 		return -EFAULT;

+	/* The attack detection is disabled */
+	if (!stats->jiffies)
+		return 0;
+
 	if (!(signal == SIGILL || signal == SIGBUS || signal == SIGKILL ||
 	      signal == SIGSEGV || signal == SIGSYS))
 		return 0;
@@ -205,9 +269,11 @@ int fbfam_handle_attack(int signal)
 	delta_time = jiffies64_to_msecs(delta_jiffies);
 	crashing_rate = delta_time / (u64)stats->faults;

-	if (crashing_rate < (u64)sysctl_crashing_rate_threshold)
-		pr_warn("fbfam: Fork brute force attack detected\n");
+	if (crashing_rate >= (u64)sysctl_crashing_rate_threshold)
+		return 0;

+	pr_warn("fbfam: Fork brute force attack detected\n");
+	fbfam_kill_tasks();
 	return 0;
 }

--
2.25.1


^ permalink raw reply related

* [RFC PATCH 5/9] security/fbfam: Detect a fork brute force attack
From: John Wood @ 2020-09-06 15:23 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: John Wood

To detect a fork brute force attack it is necessary to compute the
crashing rate of the application. This calculation is performed in each
fatal fail of a task, or in other words, when a core dump is triggered.
If this rate shows that the application is crashing quickly, there is a
clear signal that an attack is happening.

Since the crashing rate is computed in milliseconds per fault, if this
rate goes under a certain threshold a warning is triggered.

Signed-off-by: John Wood <john.wood@gmx.com>
---
 fs/coredump.c          |  2 ++
 include/fbfam/fbfam.h  |  2 ++
 security/fbfam/fbfam.c | 39 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 43 insertions(+)

diff --git a/fs/coredump.c b/fs/coredump.c
index 76e7c10edfc0..d4ba4e1828d5 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -51,6 +51,7 @@
 #include "internal.h"

 #include <trace/events/sched.h>
+#include <fbfam/fbfam.h>

 int core_uses_pid;
 unsigned int core_pipe_limit;
@@ -825,6 +826,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 fail_creds:
 	put_cred(cred);
 fail:
+	fbfam_handle_attack(siginfo->si_signo);
 	return;
 }

diff --git a/include/fbfam/fbfam.h b/include/fbfam/fbfam.h
index 2cfe51d2b0d5..9ac8e33d8291 100644
--- a/include/fbfam/fbfam.h
+++ b/include/fbfam/fbfam.h
@@ -12,10 +12,12 @@ extern struct ctl_table fbfam_sysctls[];
 int fbfam_fork(struct task_struct *child);
 int fbfam_execve(void);
 int fbfam_exit(void);
+int fbfam_handle_attack(int signal);
 #else
 static inline int fbfam_fork(struct task_struct *child) { return 0; }
 static inline int fbfam_execve(void) { return 0; }
 static inline int fbfam_exit(void) { return 0; }
+static inline int fbfam_handle_attack(int signal) { return 0; }
 #endif

 #endif /* _FBFAM_H_ */
diff --git a/security/fbfam/fbfam.c b/security/fbfam/fbfam.c
index 9be4639b72eb..3aa669e4ea51 100644
--- a/security/fbfam/fbfam.c
+++ b/security/fbfam/fbfam.c
@@ -4,7 +4,9 @@
 #include <linux/errno.h>
 #include <linux/gfp.h>
 #include <linux/jiffies.h>
+#include <linux/printk.h>
 #include <linux/refcount.h>
+#include <linux/signal.h>
 #include <linux/slab.h>

 /**
@@ -172,3 +174,40 @@ int fbfam_exit(void)
 	return 0;
 }

+/**
+ * fbfam_handle_attack() - Fork brute force attack detection.
+ * @signal: Signal number that causes the core dump.
+ *
+ * The crashing rate of an application is computed in milliseconds per fault in
+ * each crash. So, if this rate goes under a certain threshold there is a clear
+ * signal that the application is crashing quickly. At this moment, a fork brute
+ * force attack is happening.
+ *
+ * Return: -EFAULT if the current task doesn't have statistical data. Zero
+ *         otherwise.
+ */
+int fbfam_handle_attack(int signal)
+{
+	struct fbfam_stats *stats = current->fbfam_stats;
+	u64 delta_jiffies, delta_time;
+	u64 crashing_rate;
+
+	if (!stats)
+		return -EFAULT;
+
+	if (!(signal == SIGILL || signal == SIGBUS || signal == SIGKILL ||
+	      signal == SIGSEGV || signal == SIGSYS))
+		return 0;
+
+	stats->faults += 1;
+
+	delta_jiffies = get_jiffies_64() - stats->jiffies;
+	delta_time = jiffies64_to_msecs(delta_jiffies);
+	crashing_rate = delta_time / (u64)stats->faults;
+
+	if (crashing_rate < (u64)sysctl_crashing_rate_threshold)
+		pr_warn("fbfam: Fork brute force attack detected\n");
+
+	return 0;
+}
+
--
2.25.1


^ permalink raw reply related

* [RFC PATCH 4/9] security/fbfam: Add a new sysctl to control the crashing rate threshold
From: John Wood @ 2020-09-06 15:23 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: John Wood

This is a previous step to add the detection feature.

A fork brute force attack will be detected when an application crashes
quickly. Since, a rate can be defined as a time per fault, add a new
sysctl to control the crashing rate threshold.

This way, each system can tune the detection's sensibility adjusting the
milliseconds per fault. So, if the application's crashing rate falls
under this threshold an attack will be detected. So, the higher this
value, the faster an attack will be detected.

Signed-off-by: John Wood <john.wood@gmx.com>
---
 include/fbfam/fbfam.h   |  4 ++++
 kernel/sysctl.c         |  9 +++++++++
 security/fbfam/Makefile |  1 +
 security/fbfam/fbfam.c  | 11 +++++++++++
 security/fbfam/sysctl.c | 20 ++++++++++++++++++++
 5 files changed, 45 insertions(+)
 create mode 100644 security/fbfam/sysctl.c

diff --git a/include/fbfam/fbfam.h b/include/fbfam/fbfam.h
index b5b7d1127a52..2cfe51d2b0d5 100644
--- a/include/fbfam/fbfam.h
+++ b/include/fbfam/fbfam.h
@@ -3,8 +3,12 @@
 #define _FBFAM_H_

 #include <linux/sched.h>
+#include <linux/sysctl.h>

 #ifdef CONFIG_FBFAM
+#ifdef CONFIG_SYSCTL
+extern struct ctl_table fbfam_sysctls[];
+#endif
 int fbfam_fork(struct task_struct *child);
 int fbfam_execve(void);
 int fbfam_exit(void);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 287862f91717..104b70c98251 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -77,6 +77,8 @@
 #include <linux/uaccess.h>
 #include <asm/processor.h>

+#include <fbfam/fbfam.h>
+
 #ifdef CONFIG_X86
 #include <asm/nmi.h>
 #include <asm/stacktrace.h>
@@ -2661,6 +2663,13 @@ static struct ctl_table kern_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
 	},
+#endif
+#ifdef CONFIG_FBFAM
+	{
+		.procname	= "fbfam",
+		.mode		= 0555,
+		.child		= fbfam_sysctls,
+	},
 #endif
 	{ }
 };
diff --git a/security/fbfam/Makefile b/security/fbfam/Makefile
index f4b9f0b19c44..b8d5751ecea4 100644
--- a/security/fbfam/Makefile
+++ b/security/fbfam/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_FBFAM) += fbfam.o
+obj-$(CONFIG_SYSCTL) += sysctl.o
diff --git a/security/fbfam/fbfam.c b/security/fbfam/fbfam.c
index 0387f95f6408..9be4639b72eb 100644
--- a/security/fbfam/fbfam.c
+++ b/security/fbfam/fbfam.c
@@ -7,6 +7,17 @@
 #include <linux/refcount.h>
 #include <linux/slab.h>

+/**
+ * sysctl_crashing_rate_threshold - Crashing rate threshold.
+ *
+ * The rate's units are in milliseconds per fault.
+ *
+ * A fork brute force attack will be detected if the application's crashing rate
+ * falls under this threshold. So, the higher this value, the faster an attack
+ * will be detected.
+ */
+unsigned long sysctl_crashing_rate_threshold = 30000;
+
 /**
  * struct fbfam_stats - Fork brute force attack mitigation statistics.
  * @refc: Reference counter.
diff --git a/security/fbfam/sysctl.c b/security/fbfam/sysctl.c
new file mode 100644
index 000000000000..430323ad8e9f
--- /dev/null
+++ b/security/fbfam/sysctl.c
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/sysctl.h>
+
+extern unsigned long sysctl_crashing_rate_threshold;
+static unsigned long ulong_one = 1;
+static unsigned long ulong_max = ULONG_MAX;
+
+struct ctl_table fbfam_sysctls[] = {
+	{
+		.procname	= "crashing_rate_threshold",
+		.data		= &sysctl_crashing_rate_threshold,
+		.maxlen		= sizeof(sysctl_crashing_rate_threshold),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra1		= &ulong_one,
+		.extra2		= &ulong_max,
+	},
+	{ }
+};
+
--
2.25.1


^ permalink raw reply related

* [RFC PATCH 2/9] security/fbfam: Add the api to manage statistics
From: John Wood @ 2020-09-06 15:22 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: John Wood

Create a statistical data structure to hold all the necessary
information involve in a fork brute force attack. This info is a
timestamp for the first fork or execve and the number of crashes since
then. Moreover, due to this statitistical data will be shared between
different tasks, a reference counter it is necessary.

For a correct management of an attack it is also necessary that all the
tasks hold statistical data. The same statistical data needs to be
shared between all the tasks that hold the same memory contents or in
other words, between all the tasks that have been forked without any
execve call.

When a forked task calls the execve system call, the memory contents are
set with new values. So, in this scenario the parent's statistical data
no need to be share. Instead, a new statistical data structure must be
allocated to start a new cycle.

The statistical data that every task holds needs to be clear when a task
exits. Due to this data is shared across multiples tasks, the reference
counter is useful to free the previous allocated data only when there
are not other pointers to the same data. Or in other words, when the
reference counter reaches zero.

So, based in all the previous information add the api to manage all the
commented cases.

Also, add to the struct task_struct a new field to point to the
statitistical data related to an attack. This way, all the tasks will
have access to this information.

Signed-off-by: John Wood <john.wood@gmx.com>
---
 include/fbfam/fbfam.h   |  18 +++++
 include/linux/sched.h   |   4 +
 security/Makefile       |   4 +
 security/fbfam/Makefile |   2 +
 security/fbfam/fbfam.c  | 163 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 191 insertions(+)
 create mode 100644 include/fbfam/fbfam.h
 create mode 100644 security/fbfam/Makefile
 create mode 100644 security/fbfam/fbfam.c

diff --git a/include/fbfam/fbfam.h b/include/fbfam/fbfam.h
new file mode 100644
index 000000000000..b5b7d1127a52
--- /dev/null
+++ b/include/fbfam/fbfam.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _FBFAM_H_
+#define _FBFAM_H_
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_FBFAM
+int fbfam_fork(struct task_struct *child);
+int fbfam_execve(void);
+int fbfam_exit(void);
+#else
+static inline int fbfam_fork(struct task_struct *child) { return 0; }
+static inline int fbfam_execve(void) { return 0; }
+static inline int fbfam_exit(void) { return 0; }
+#endif
+
+#endif /* _FBFAM_H_ */
+
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 93ecd930efd3..f3f0cc169204 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1315,6 +1315,10 @@ struct task_struct {
 	struct callback_head		mce_kill_me;
 #endif

+#ifdef CONFIG_FBFAM
+	struct fbfam_stats		*fbfam_stats;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/security/Makefile b/security/Makefile
index 3baf435de541..36dc4b536349 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -36,3 +36,7 @@ obj-$(CONFIG_BPF_LSM)			+= bpf/
 # Object integrity file lists
 subdir-$(CONFIG_INTEGRITY)		+= integrity
 obj-$(CONFIG_INTEGRITY)			+= integrity/
+
+# Object fbfam file lists
+subdir-$(CONFIG_FBFAM)			+= fbfam
+obj-$(CONFIG_FBFAM)			+= fbfam/
diff --git a/security/fbfam/Makefile b/security/fbfam/Makefile
new file mode 100644
index 000000000000..f4b9f0b19c44
--- /dev/null
+++ b/security/fbfam/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_FBFAM) += fbfam.o
diff --git a/security/fbfam/fbfam.c b/security/fbfam/fbfam.c
new file mode 100644
index 000000000000..0387f95f6408
--- /dev/null
+++ b/security/fbfam/fbfam.c
@@ -0,0 +1,163 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <asm/current.h>
+#include <fbfam/fbfam.h>
+#include <linux/errno.h>
+#include <linux/gfp.h>
+#include <linux/jiffies.h>
+#include <linux/refcount.h>
+#include <linux/slab.h>
+
+/**
+ * struct fbfam_stats - Fork brute force attack mitigation statistics.
+ * @refc: Reference counter.
+ * @faults: Number of crashes since jiffies.
+ * @jiffies: First fork or execve timestamp.
+ *
+ * The purpose of this structure is to manage all the necessary information to
+ * compute the crashing rate of an application. So, it holds a first fork or
+ * execve timestamp and a number of crashes since then. This way the crashing
+ * rate in milliseconds per fault can be compute when necessary with the
+ * following formula:
+ *
+ * u64 delta_jiffies = get_jiffies_64() - fbfam_stats::jiffies;
+ * u64 delta_time = jiffies64_to_msecs(delta_jiffies);
+ * u64 crashing_rate = delta_time / (u64)fbfam_stats::faults;
+ *
+ * If the fbfam_stats::faults is zero, the above formula can't be used. In this
+ * case, the crashing rate is zero.
+ *
+ * Moreover, since the same allocated structure will be used in every fork
+ * since the first one or execve, it's also necessary a reference counter.
+ */
+struct fbfam_stats {
+	refcount_t refc;
+	unsigned int faults;
+	u64 jiffies;
+};
+
+/**
+ * fbfam_new_stats() - Allocation of new statistics structure.
+ *
+ * If the allocation is successful the reference counter is set to one to
+ * indicate that there will be one task that points to this structure. The
+ * faults field is initialize to zero and the timestamp for this moment is set.
+ *
+ * Return: NULL if the allocation fails. A pointer to the new allocated
+ *         statistics structure if it success.
+ */
+static struct fbfam_stats *fbfam_new_stats(void)
+{
+	struct fbfam_stats *stats = kmalloc(sizeof(struct fbfam_stats),
+					    GFP_KERNEL);
+
+	if (stats) {
+		refcount_set(&stats->refc, 1);
+		stats->faults = 0;
+		stats->jiffies = get_jiffies_64();
+	}
+
+	return stats;
+}
+
+/*
+ * fbfam_fork() - Fork management.
+ * @child: Pointer to the child task that will be created with the fork system
+ *         call.
+ *
+ * For a correct management of a fork brute force attack it is necessary that
+ * all the tasks hold statistical data. The same statistical data needs to be
+ * shared between all the tasks that hold the same memory contents or in other
+ * words, between all the tasks that have been forked without any execve call.
+ *
+ * To ensure this, if the current task doesn't have statistical data when forks
+ * (only possible in the first fork of the zero task), it is mandatory to
+ * allocate a new one. This way, the child task always will share the statistics
+ * with its parent.
+ *
+ * Return: -ENOMEN if the allocation of the new statistics structure fails.
+ *         Zero otherwise.
+ */
+int fbfam_fork(struct task_struct *child)
+{
+	struct fbfam_stats **stats = &current->fbfam_stats;
+
+	if (!*stats) {
+		*stats = fbfam_new_stats();
+		if (!*stats)
+			return -ENOMEM;
+	}
+
+	refcount_inc(&(*stats)->refc);
+	child->fbfam_stats = *stats;
+	return 0;
+}
+
+/**
+ * fbfam_execve() - Execve management.
+ *
+ * When a forked task calls the execve system call, the memory contents are set
+ * with new values. So, in this scenario the parent's statistical data no need
+ * to be share. Instead, a new statistical data structure must be allocated to
+ * start a new cycle. This condition is detected when the statistics reference
+ * counter holds a value greater than or equal to two (a fork always sets the
+ * statistics reference counter to two since the parent and the child task are
+ * sharing the same data).
+ *
+ * However, if the execve function is called immediately after another execve
+ * call, althought the memory contents are reset, there is no need to allocate
+ * a new statistical data structure. This is possible because at this moment
+ * only one task (the task that calls the execve function) points to the data.
+ * In this case, the previous allocation is used and only the faults and time
+ * fields are reset.
+ *
+ * Return: -ENOMEN if the allocation of the new statistics structure fails.
+ *         -EFAULT if the current task doesn't have statistical data. Zero
+ *         otherwise.
+ */
+int fbfam_execve(void)
+{
+	struct fbfam_stats **stats = &current->fbfam_stats;
+
+	if (!*stats)
+		return -EFAULT;
+
+	if (!refcount_dec_not_one(&(*stats)->refc)) {
+		/* execve call after an execve call */
+		(*stats)->faults = 0;
+		(*stats)->jiffies = get_jiffies_64();
+		return 0;
+	}
+
+	/* execve call after a fork call */
+	*stats = fbfam_new_stats();
+	if (!*stats)
+		return -ENOMEM;
+
+	return 0;
+}
+
+/**
+ * fbfam_exit() - Exit management.
+ *
+ * The statistical data that every task holds needs to be clear when a task
+ * exits. Due to this data is shared across multiples tasks, the reference
+ * counter is useful to free the previous allocated data only when there are
+ * not other pointers to the same data. Or in other words, when the reference
+ * counter reaches zero.
+ *
+ * Return: -EFAULT if the current task doesn't have statistical data. Zero
+ *         otherwise.
+ */
+int fbfam_exit(void)
+{
+	struct fbfam_stats *stats = current->fbfam_stats;
+
+	if (!stats)
+		return -EFAULT;
+
+	if (refcount_dec_and_test(&stats->refc))
+		kfree(stats);
+
+	return 0;
+}
+
--
2.25.1


^ permalink raw reply related

* [RFC PATCH 3/9] security/fbfam: Use the api to manage statistics
From: John Wood @ 2020-09-06 15:22 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: John Wood

Use the previous defined api to manage statistics calling it accordingly
when a task forks, calls execve or exits.

Signed-off-by: John Wood <john.wood@gmx.com>
---
 fs/exec.c     | 2 ++
 kernel/exit.c | 2 ++
 kernel/fork.c | 4 ++++
 3 files changed, 8 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index a91003e28eaa..b30118674d32 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -71,6 +71,7 @@
 #include "internal.h"

 #include <trace/events/sched.h>
+#include <fbfam/fbfam.h>

 static int bprm_creds_from_file(struct linux_binprm *bprm);

@@ -1940,6 +1941,7 @@ static int bprm_execve(struct linux_binprm *bprm,
 	task_numa_free(current, false);
 	if (displaced)
 		put_files_struct(displaced);
+	fbfam_execve();
 	return retval;

 out:
diff --git a/kernel/exit.c b/kernel/exit.c
index 733e80f334e7..39a6139dcf31 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -67,6 +67,7 @@
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
 #include <asm/mmu_context.h>
+#include <fbfam/fbfam.h>

 static void __unhash_process(struct task_struct *p, bool group_dead)
 {
@@ -852,6 +853,7 @@ void __noreturn do_exit(long code)
 		__this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
 	exit_rcu();
 	exit_tasks_rcu_finish();
+	fbfam_exit();

 	lockdep_free_task(tsk);
 	do_task_dead();
diff --git a/kernel/fork.c b/kernel/fork.c
index 4d32190861bd..088aa5b62634 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -107,6 +107,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/task.h>

+#include <fbfam/fbfam.h>
+
 /*
  * Minimum number of threads to boot the kernel
  */
@@ -941,6 +943,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 #ifdef CONFIG_MEMCG
 	tsk->active_memcg = NULL;
 #endif
+
+	fbfam_fork(tsk);
 	return tsk;

 free_stack:
--
2.25.1


^ permalink raw reply related

* [RFC PATCH 1/9] security/fbfam: Add a Kconfig to enable the fbfam feature
From: John Wood @ 2020-09-06 15:21 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: John Wood

Add a menu entry under "Security options" to enable the "Fork brute
force attack mitigation" feature.

Signed-off-by: John Wood <john.wood@gmx.com>
---
 security/Kconfig       |  1 +
 security/fbfam/Kconfig | 10 ++++++++++
 2 files changed, 11 insertions(+)
 create mode 100644 security/fbfam/Kconfig

diff --git a/security/Kconfig b/security/Kconfig
index 7561f6f99f1d..00a90e25b8d5 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -290,6 +290,7 @@ config LSM
 	  If unsure, leave this as the default.

 source "security/Kconfig.hardening"
+source "security/fbfam/Kconfig"

 endmenu

diff --git a/security/fbfam/Kconfig b/security/fbfam/Kconfig
new file mode 100644
index 000000000000..bbe7f6aad369
--- /dev/null
+++ b/security/fbfam/Kconfig
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0
+config FBFAM
+	bool "Fork brute force attack mitigation"
+	default n
+	help
+	  This is a user defense that detects any fork brute force attack
+	  based on the application's crashing rate. When this measure is
+	  triggered the fork system call is blocked.
+
+	  If you are unsure how to answer this question, answer N.
--
2.25.1


^ permalink raw reply related

* Re: [RFC PATCH 0/9] Fork brute force attack mitigation (fbfam)
From: John Wood @ 2020-09-06 14:13 UTC (permalink / raw)
  To: linux-security-module
In-Reply-To: <20200906121544.4204-1-john.wood@gmx.com>

Hi. I have problems with my email account to send the patch serie  [1] to
all the recipients. My account blocks if I send an email to a big amount of
recipients.

Please don't reply to this message.
Thanks

[1] https://lore.kernel.org/kernel-hardening/20200906121544.4204-1-john.wood@gmx.com/

^ permalink raw reply

* [RFC PATCH 0/9] Fork brute force attack mitigation (fbfam)
From: John Wood @ 2020-09-06 12:15 UTC (permalink / raw)
  To: Kees Cook
  Cc: John Wood, Jonathan Corbet, Alexander Viro, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Luis Chamberlain,
	Iurii Zaikin, James Morris, Serge E. Hallyn, linux-doc,
	linux-kernel, linux-fsdevel, linux-security-module,
	kernel-hardening

The goal of this patch serie is to detect and mitigate a fork brute force
attack.

Attacks with the purpose to break ASLR or bypass canaries traditionaly use
some level of brute force with the help of the fork system call. This is
possible since when creating a new process using fork its memory contents
are the same as those of the parent process (the process that called the
fork system call). So, the attacker can test the memory infinite times to
find the correct memory values or the correct memory addresses without
worrying about crashing the application.

Based on the above scenario it would be nice to have this detected and
mitigated, and this is the goal of this implementation.

Other implementations
---------------------

The public version of grsecurity, as a summary, is based on the idea of
delay the fork system call if a child died due to a fatal error. This has
some issues:

1.- Bad practices: Add delays to the kernel is, in general, a bad idea.

2.- Weak points: This protection can be bypassed using two different
    methods since it acts only when the fork is called after a child has
    crashed.

    2.1.- Bypass 1: So, it would still be possible for an attacker to fork
	  a big amount of children (in the order of thousands), then probe
	  all of them, and finally wait the protection time before repeat
	  the steps.

    2.2.- Bypass 2: This method is based on the idea that the protection
	  doesn't act if the parent crashes. So, it would still be possible
	  for an attacker to fork a process and probe itself. Then, fork
	  the child process and probe itself again. This way, these steps
	  can be repeated infinite times without any mitigation.

This implementation
-------------------

The main idea behind this implementation is to improve the existing ones
focusing on the weak points annotated before. So, the solution for the
first bypass method is to detect a fast crash rate instead of only one
simple crash. For the second bypass method the solution is to detect both
the crash of parent and child processes. Moreover, as a mitigation method
it is better to kill all the offending tasks involve in the attack instead
of use delays.

So, the solution to the two bypass methods previously commented is to use
some statistical data shared across all the processes that can have the
same memory contents. Or in other words, a statistical data shared between
all the processes that fork the task 0, and all the processes that fork
after an execve system call.

These statistics hold the timestamp for the first fork (case of a fork of
task zero) or the timestamp for the execve system call (the other case).
Also, hold the number of faults of all the tasks that share the same
statistical data since the commented timestamp.

With this information it is possible to detect a brute force attack when a
task die in a fatal way computing the crashing rate. This rate shows the
milliseconds per fault and when it goes under a certain threshold there is
a clear signal that something malicious is happening.

Once detected, the mitigation only kills the processes that share the same
statistical data and so, all the tasks that can have the same memory
contents. This way, an attack is rejected.

The fbfam feature can be enabled, disabled and tuned as follows:

1.- Per system enabling: This feature can be enabled in build time using
    the config application under:

    Security options  --->  Fork brute force attack mitigation

2.- Per process enabling/disabling: To allow that specific applications can
    turn off or turn on the detection and mitigation of a fork brute force
    attack when required, there are two new prctls.

    prctl(PR_FBFAM_ENABLE, 0, 0, 0, 0)  -> To enable the feature
    prctl(PR_FBFAM_DISABLE, 0, 0, 0, 0) -> To disable the feature

    Both functions return zero on success and -EFAULT if the current task
    doesn't have statistical data.

3.- Fine tuning: To customize the detection's sensibility there is a new
    sysctl that allows to set the crashing rate threshold. It is accessible
    through the file:

    /proc/sys/kernel/fbfam/crashing_rate_threshold

    The units are in milliseconds per fault and the attack's mitigation is
    triggered if the crashing rate of an application goes under this
    threshold. So, the higher this value, the faster an attack will be
    detected.

So, knowing all this information I will explain now the different patches:

The 1/9 patch adds a new config for the fbfam feature.

The 2/9 and 3/9 patches add and use the api to manage the statistical data
necessary to compute the crashing rate of an application.

The 4/9 patch adds a new sysctl to fine tuning the detection's sensibility.

The 5/9 patch detects a fork brute force attack calculating the crashing
rate.

The 6/9 patch mitigates the attack killing all the offending tasks.

The 7/9 patch adds two new prctls to allow per task enabling/disabling.

The 8/9 patch adds general documentation.

The 9/9 patch adds an entry to the maintainers list.

This patch series is a task of the KSPP [1] and it is worth to mention
that there is a previous attempt without any continuation [2].

[1] https://github.com/KSPP/linux/issues/39
[2] https://lore.kernel.org/linux-fsdevel/1419457167-15042-1-git-send-email-richard@nod.at/

Any constructive comments are welcome.

Note: During the compilation these warnings were shown:

kernel/exit.o: warning: objtool: __x64_sys_exit_group()+0x18: unreachable instruction
arch/x86/kernel/cpu/mce/core.o: warning: objtool: mce_panic()+0x123: unreachable instruction
arch/x86/kernel/smpboot.o: warning: objtool: native_play_dead()+0x122: unreachable instruction
net/core/skbuff.o: warning: objtool: skb_push.cold()+0x14: unreachable instruction

John Wood (9):
  security/fbfam: Add a Kconfig to enable the fbfam feature
  security/fbfam: Add the api to manage statistics
  security/fbfam: Use the api to manage statistics
  security/fbfam: Add a new sysctl to control the crashing rate
    threshold
  security/fbfam: Detect a fork brute force attack
  security/fbfam: Mitigate a fork brute force attack
  security/fbfam: Add two new prctls to enable and disable the fbfam
    feature
  Documentation/security: Add documentation for the fbfam feature
  MAINTAINERS: Add a new entry for the fbfam feature

 Documentation/security/fbfam.rst | 111 +++++++++++
 Documentation/security/index.rst |   1 +
 MAINTAINERS                      |   7 +
 fs/coredump.c                    |   2 +
 fs/exec.c                        |   2 +
 include/fbfam/fbfam.h            |  29 +++
 include/linux/sched.h            |   4 +
 include/uapi/linux/prctl.h       |   4 +
 kernel/exit.c                    |   2 +
 kernel/fork.c                    |   4 +
 kernel/sys.c                     |   8 +
 kernel/sysctl.c                  |   9 +
 security/Kconfig                 |   1 +
 security/Makefile                |   4 +
 security/fbfam/Kconfig           |  10 +
 security/fbfam/Makefile          |   3 +
 security/fbfam/fbfam.c           | 329 +++++++++++++++++++++++++++++++
 security/fbfam/sysctl.c          |  20 ++
 18 files changed, 550 insertions(+)
 create mode 100644 Documentation/security/fbfam.rst
 create mode 100644 include/fbfam/fbfam.h
 create mode 100644 security/fbfam/Kconfig
 create mode 100644 security/fbfam/Makefile
 create mode 100644 security/fbfam/fbfam.c
 create mode 100644 security/fbfam/sysctl.c

--
2.25.1

^ permalink raw reply

* Re: [PATCH v20 19/23] LSM: Verify LSM display sanity in binder
From: Paul Moore @ 2020-09-06  3:30 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: casey.schaufler, James Morris, linux-security-module, selinux,
	linux-audit, keescook, john.johansen, penguin-kernel,
	Stephen Smalley
In-Reply-To: <20200826145247.10029-20-casey@schaufler-ca.com>

On Wed, Aug 26, 2020 at 11:22 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> Verify that the tasks on the ends of a binder transaction
> use the same "display" security module. This prevents confusion
> of security "contexts".
>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
> ---
>  security/security.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)

Acked-by: Paul Moore <paul@paul-moore.com>

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH v20 18/23] NET: Store LSM netlabel data in a lsmblob
From: Paul Moore @ 2020-09-06  3:27 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: casey.schaufler, James Morris, linux-security-module, selinux,
	linux-audit, keescook, john.johansen, penguin-kernel,
	Stephen Smalley, netdev
In-Reply-To: <20200826145247.10029-19-casey@schaufler-ca.com>

On Wed, Aug 26, 2020 at 11:21 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> Netlabel uses LSM interfaces requiring an lsmblob and
> the internal storage is used to pass information between
> these interfaces, so change the internal data from a secid
> to a lsmblob. Update the netlabel interfaces and their
> callers to accommodate the change. This requires that the
> modules using netlabel use the lsm_id.slot to access the
> correct secid when using netlabel.
>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Reviewed-by: John Johansen <john.johansen@canonical.com>
> Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
> Cc: netdev@vger.kernel.org
> ---
>  include/net/netlabel.h              |  8 +--
>  net/ipv4/cipso_ipv4.c               | 27 ++++++----
>  net/netlabel/netlabel_kapi.c        |  6 +--
>  net/netlabel/netlabel_unlabeled.c   | 79 +++++++++--------------------
>  net/netlabel/netlabel_unlabeled.h   |  2 +-
>  security/selinux/hooks.c            |  2 +-
>  security/selinux/include/security.h |  1 +
>  security/selinux/netlabel.c         |  2 +-
>  security/selinux/ss/services.c      |  4 +-
>  security/smack/smack.h              |  1 +
>  security/smack/smack_lsm.c          |  5 +-
>  security/smack/smackfs.c            | 10 ++--
>  12 files changed, 65 insertions(+), 82 deletions(-)

Minor change suggested to a comment below, but looks good otherwise.

Acked-by: Paul Moore <paul@paul-moore.com>

> diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
> index 2eb71579f4d2..8182b923e802 100644
> --- a/net/ipv4/cipso_ipv4.c
> +++ b/net/ipv4/cipso_ipv4.c
> @@ -106,15 +106,17 @@ int cipso_v4_rbm_strictvalid = 1;
>  /* Base length of the local tag (non-standard tag).
>   *  Tag definition (may change between kernel versions)
>   *
> - * 0          8          16         24         32
> - * +----------+----------+----------+----------+
> - * | 10000000 | 00000110 | 32-bit secid value  |
> - * +----------+----------+----------+----------+
> - * | in (host byte order)|
> - * +----------+----------+
> - *
> + * 0          8          16                    16 + sizeof(struct lsmblob)
> + * +----------+----------+---------------------+
> + * | 10000000 | 00000110 | LSM blob data       |
> + * +----------+----------+---------------------+
> + *
> + * All secid and flag fields are in host byte order.
> + * The lsmblob structure size varies depending on which
> + * Linux security modules are built in the kernel.
> + * The data is opaque.
>   */
> -#define CIPSO_V4_TAG_LOC_BLEN         6
> +#define CIPSO_V4_TAG_LOC_BLEN         (2 + sizeof(struct lsmblob))
>
>  /*
>   * Helper Functions
> @@ -1469,7 +1471,12 @@ static int cipso_v4_gentag_loc(const struct cipso_v4_doi *doi_def,
>
>         buffer[0] = CIPSO_V4_TAG_LOCAL;
>         buffer[1] = CIPSO_V4_TAG_LOC_BLEN;
> -       *(u32 *)&buffer[2] = secattr->attr.secid;
> +       /* Ensure that there is sufficient space in the CIPSO header
> +        * for the LSM data. This should never become an issue.
> +        * The check is made from an abundance of caution. */
> +       BUILD_BUG_ON(CIPSO_V4_TAG_LOC_BLEN > CIPSO_V4_OPT_LEN_MAX);

I like the BUILD_BUG_ON() check, but for reasons very similar to the
unix_skb_params changes I don't really like the "should never become
an issue" commentary.  Granted, it is unlikely, but I don't think it
is wise to thumb our nose at fate.

> +       memcpy(&buffer[2], &secattr->attr.lsmblob,
> +              sizeof(secattr->attr.lsmblob));
>
>         return CIPSO_V4_TAG_LOC_BLEN;
>  }

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH v20 17/23] LSM: security_secid_to_secctx in netlink netfilter
From: Paul Moore @ 2020-09-06  3:11 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: casey.schaufler, James Morris, linux-security-module, selinux,
	linux-audit, keescook, john.johansen, penguin-kernel,
	Stephen Smalley, netdev
In-Reply-To: <20200826145247.10029-18-casey@schaufler-ca.com>

On Wed, Aug 26, 2020 at 11:20 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> Change netlink netfilter interfaces to use lsmcontext
> pointers, and remove scaffolding.
>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Reviewed-by: John Johansen <john.johansen@canonical.com>
> Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
> cc: netdev@vger.kernel.org
> ---
>  net/netfilter/nfnetlink_queue.c | 31 ++++++++++++-------------------
>  1 file changed, 12 insertions(+), 19 deletions(-)

...

> diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
> index d3f8e808c5d3..c830401f7792 100644
> --- a/net/netfilter/nfnetlink_queue.c
> +++ b/net/netfilter/nfnetlink_queue.c
> @@ -401,8 +399,7 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue,
>         enum ip_conntrack_info ctinfo;
>         struct nfnl_ct_hook *nfnl_ct;
>         bool csum_verify;
> -       struct lsmcontext scaff; /* scaffolding */
> -       char *secdata = NULL;
> +       struct lsmcontext context = { };
>         u32 seclen = 0;
>
>         size = nlmsg_total_size(sizeof(struct nfgenmsg))
> @@ -469,7 +466,7 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue,
>         }
>
>         if ((queue->flags & NFQA_CFG_F_SECCTX) && entskb->sk) {
> -               seclen = nfqnl_get_sk_secctx(entskb, &secdata);
> +               seclen = nfqnl_get_sk_secctx(entskb, &context);
>                 if (seclen)
>                         size += nla_total_size(seclen);
>         }

I think we can get rid of the local "seclen" variable, right?  We can
embed the nfqnl_get_sk_secctx() in the conditional and then simply
reference "context.len" everywhere else, yes?  For example:

  if (nfqnl_get_sk_secctx(..., &context))
    size += nla_total_size(context.len);

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH v20 16/23] LSM: Use lsmcontext in security_inode_getsecctx
From: Paul Moore @ 2020-09-06  2:55 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: casey.schaufler, James Morris, linux-security-module, selinux,
	linux-audit, keescook, john.johansen, penguin-kernel,
	Stephen Smalley
In-Reply-To: <20200826145247.10029-17-casey@schaufler-ca.com>

On Wed, Aug 26, 2020 at 11:19 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> Change the security_inode_getsecctx() interface to fill
> a lsmcontext structure instead of data and length pointers.
> This provides the information about which LSM created the
> context so that security_release_secctx() can use the
> correct hook.
>
> Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
> ---
>  fs/nfsd/nfs4xdr.c        | 23 +++++++++--------------
>  include/linux/security.h |  5 +++--
>  security/security.c      | 13 +++++++++++--
>  3 files changed, 23 insertions(+), 18 deletions(-)

Acked-by: Paul Moore <paul@paul-moore.com>

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox