Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Song Liu @ 2019-07-30  5:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Kees Cook, linux-security@vger.kernel.org,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API
In-Reply-To: <77354A95-4107-41A7-8936-D144F01C3CA4@fb.com>

Hi Andy, 

> On Jul 27, 2019, at 11:20 AM, Song Liu <songliubraving@fb.com> wrote:
> 
> Hi Andy, 
> 
>>>>> 
>>>> 
>>>> Well, yes. sys_bpf() is pretty powerful. 
>>>> 
>>>> The goal of /dev/bpf is to enable special users to call sys_bpf(). In 
>>>> the meanwhile, such users should not take down the whole system easily
>>>> by accident, e.g., with rm -rf /.
>>> 
>>> That’s easy, though — bpftool could learn to read /etc/bpfusers before allowing ruid != 0.
>> 
>> This is a great idea! fscaps + /etc/bpfusers should do the trick. 
> 
> After some discussions and more thinking on this, I have some concerns 
> with the user space only approach.  
> 
> IIUC, your proposal for user space only approach is like: 
> 
> 1. bpftool (and other tools) check /etc/bpfusers and only do 
>   setuid for allowed users:
> 
> 	int main()
> 	{
> 		if (/* uid in /etc/bpfusers */)
> 			setuid(0);
> 		sys_bpf(...);
> 	}
> 
> 2. bpftool (and other tools) is installed with CAP_SETUID:
> 
> 	setcap cap_setuid=e+p /bin/bpftool
> 
> 3. sys admin maintains proper /etc/bpfusers. 
> 
> This approach is not ideal, because we need to trust the tool to give 
> it CAP_SETUID. A hacked tool could easily bypass /etc/bpfusers check
> or use other root only sys calls after setuid(0). 
> 

I would like more comments on this. 

Currently, bpf permission is more or less "root or nothing", which we 
would like to change. 

The short term goal is to separate bpf from root, in other words, it is 
"all or nothing". Special user space utilities, such as systemd, would
benefit from this. Once this is implemented, systemd can call sys_bpf() 
when it is not running as root. 

In longer term, it may be useful to provide finer grain permission of 
sys_bpf(). For example, sys_bpf() should be aware of containers; and
user may only have access to certain bpf maps. Let's call this 
"fine grain" capability. 

Since we are seeing new use cases every year, we will need many 
iterations to implement the fine grain permission. I think we need an 
API that is flexible enough to cover different types of permission 
control. 

For example, bpf_with_cap() can be flexible:

	bpf_with_cap(cmd, attr, size, perm_fd);

We can get different types of permission via different combinations of 
arguments:

    A perm_fd to /dev/bpf gives access to all sys_bpf() commands, so 
    this is "all or nothing" permission. 

    A perm_fd to /sys/fs/cgroup/.../bpf.xxx would only allow some 
    commands to this specific cgroup. 

Alexei raised another idea in offline discussions: instead of adding
bpf_with_cap(), we add a command LOAD_PERM_FD, which enables special
permission for the _next_ sys_bpf() from current task:

    bpf(LOAD_PERM_FD, perm_fd);
    /* the next sys_bpf() uses permission from perm_fd */
    bpf(cmd, attr, size);

This is equivalent to bpf_with_cap(cmd, attr, size, perm_fd), but 
doesn't require the new sys call. 

For both these ideas, we will start with /dev/bpf. As we grow the
fine grain permission control, fewer users/processes will need access 
to /dev/bpf. 

Please let us know your thought on this. Would this make /dev/bpf 
more reasonable? :-)

A few notes for previous discussions:

1. User space only approach doesn't work, even for "all or nothing" 
   permission control. I expanded the discussion in the previous
   email. Please let me know if I missed anything there. 

2. Permission control only at BPF_PROG_ATTACH time is not sufficient.
   We need permission control during BPF_PROG_LOAD, e.g., is_priv in 
   the verifier. 

Thanks,
Song

^ permalink raw reply

* Re: [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing
From: Joel Fernandes @ 2019-07-30 13:06 UTC (permalink / raw)
  To: LKML
  Cc: Alexey Dobriyan, Andrew Morton, Brendan Gregg, Christian Hansen,
	Daniel Colascione, Florian Mayer, John Dias, Joel Fernandes,
	Jonathan Corbet, Kees Cook, kernel-team, Linux API,
	open list:DOCUMENTATION, Linux FS Devel, linux-mm, Michal Hocko,
	Mike Rapoport, Minchan Kim, Namhyung Kim, Roman Gushchin
In-Reply-To: <20190726152319.134152-1-joel@joelfernandes.org>

On Fri, Jul 26, 2019 at 11:23 AM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> The page_idle tracking feature currently requires looking up the pagemap
> for a process followed by interacting with /sys/kernel/mm/page_idle.
> Looking up PFN from pagemap in Android devices is not supported by
> unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
>
> This patch adds support to directly interact with page_idle tracking at
> the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> looking up PFN through pagemap is not needed since the interface uses
> virtual frame numbers, and at the same time also does not require
> SYS_ADMIN.
>
> In Android, we are using this for the heap profiler (heapprofd) which
> profiles and pin points code paths which allocates and leaves memory
> idle for long periods of time. This method solves the security issue
> with userspace learning the PFN, and while at it is also shown to yield
> better results than the pagemap lookup, the theory being that the window
> where the address space can change is reduced by eliminating the
> intermediate pagemap look up stage. In virtual address indexing, the
> process's mmap_sem is held for the duration of the access.
>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>
> ---
> v2->v3:
> Fixed a bug where I was doing a kfree that is not needed due to not
> needing to do GFP_ATOMIC allocations.
>
> v1->v2:
> Mark swap ptes as idle (Minchan)
> Avoid need for GFP_ATOMIC (Andrew)
> Get rid of idle_page_list lock by moving list to stack

I believe all suggestions have been addressed.  Do these look good now?

thanks,

 - Joel



> Internal review -> v1:
> Fixes from Suren.
> Corrections to change log, docs (Florian, Sandeep)
>
>  fs/proc/base.c            |   3 +
>  fs/proc/internal.h        |   1 +
>  fs/proc/task_mmu.c        |  57 +++++++
>  include/linux/page_idle.h |   4 +
>  mm/page_idle.c            | 340 +++++++++++++++++++++++++++++++++-----
>  5 files changed, 360 insertions(+), 45 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 77eb628ecc7f..a58dd74606e9 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
>         REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
>         REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),

^ permalink raw reply

* [PATCH] tracefs: Restrict tracefs when the kernel is locked down
From: Matthew Garrett @ 2019-07-30 18:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: jmorris, linux-security-module, linux-kernel, linux-api,
	Matthew Garrett, Matthew Garrett
In-Reply-To: <20190724222354.7cbd6c6e@oasis.local.home>

Tracefs may release more information about the kernel than desirable, so
restrict it when the kernel is locked down in confidentiality mode by
preventing open().

Signed-off-by: Matthew Garrett <mjg59@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---

Added the iput()

 fs/tracefs/inode.c           | 40 +++++++++++++++++++++++++++++++++++-
 include/linux/security.h     |  1 +
 security/lockdown/lockdown.c |  1 +
 3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index eeeae0475da9..fb4d1d89ce53 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -20,6 +20,7 @@
 #include <linux/parser.h>
 #include <linux/magic.h>
 #include <linux/slab.h>
+#include <linux/security.h>
 
 #define TRACEFS_DEFAULT_MODE	0700
 
@@ -27,6 +28,23 @@ static struct vfsmount *tracefs_mount;
 static int tracefs_mount_count;
 static bool tracefs_registered;
 
+static int default_open_file(struct inode *inode, struct file *filp)
+{
+	struct dentry *dentry = filp->f_path.dentry;
+	struct file_operations *real_fops;
+	int ret;
+
+	if (!dentry)
+		return -EINVAL;
+
+	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	if (ret)
+		return ret;
+
+	real_fops = dentry->d_fsdata;
+	return real_fops->open(inode, filp);
+}
+
 static ssize_t default_read_file(struct file *file, char __user *buf,
 				 size_t count, loff_t *ppos)
 {
@@ -221,6 +239,12 @@ static int tracefs_apply_options(struct super_block *sb)
 	return 0;
 }
 
+static void tracefs_destroy_inode(struct inode *inode)
+{
+	if (S_ISREG(inode->i_mode))
+		kfree(inode->i_fop);
+}
+
 static int tracefs_remount(struct super_block *sb, int *flags, char *data)
 {
 	int err;
@@ -256,6 +280,7 @@ static int tracefs_show_options(struct seq_file *m, struct dentry *root)
 
 static const struct super_operations tracefs_super_operations = {
 	.statfs		= simple_statfs,
+	.destroy_inode  = tracefs_destroy_inode,
 	.remount_fs	= tracefs_remount,
 	.show_options	= tracefs_show_options,
 };
@@ -387,6 +412,7 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
 				   struct dentry *parent, void *data,
 				   const struct file_operations *fops)
 {
+	struct file_operations *proxy_fops;
 	struct dentry *dentry;
 	struct inode *inode;
 
@@ -402,8 +428,20 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
 	if (unlikely(!inode))
 		return failed_creating(dentry);
 
+	proxy_fops = kzalloc(sizeof(struct file_operations), GFP_KERNEL);
+	if (unlikely(!proxy_fops)) {
+		iput(inode);
+		return failed_creating(dentry);
+	}
+
+	if (!fops)
+		fops = &tracefs_file_operations;
+
+	dentry->d_fsdata = (void *)fops;
+	memcpy(proxy_fops, fops, sizeof(*proxy_fops));
+	proxy_fops->open = default_open_file;
 	inode->i_mode = mode;
-	inode->i_fop = fops ? fops : &tracefs_file_operations;
+	inode->i_fop = proxy_fops;
 	inode->i_private = data;
 	d_instantiate(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
diff --git a/include/linux/security.h b/include/linux/security.h
index d92323b44a3f..807dc0d24982 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -121,6 +121,7 @@ enum lockdown_reason {
 	LOCKDOWN_KPROBES,
 	LOCKDOWN_BPF_READ,
 	LOCKDOWN_PERF,
+	LOCKDOWN_TRACEFS,
 	LOCKDOWN_CONFIDENTIALITY_MAX,
 };
 
diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
index 88064ce1c844..173191562047 100644
--- a/security/lockdown/lockdown.c
+++ b/security/lockdown/lockdown.c
@@ -36,6 +36,7 @@ static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
 	[LOCKDOWN_KPROBES] = "use of kprobes",
 	[LOCKDOWN_BPF_READ] = "use of bpf to read kernel RAM",
 	[LOCKDOWN_PERF] = "unsafe use of perf",
+	[LOCKDOWN_TRACEFS] = "use of tracefs",
 	[LOCKDOWN_CONFIDENTIALITY_MAX] = "confidentiality",
 };
 
-- 
2.22.0.770.g0f2c4a37fd-goog

^ permalink raw reply related

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-07-30 20:20 UTC (permalink / raw)
  To: Song Liu
  Cc: Andy Lutomirski, Kees Cook, linux-security@vger.kernel.org,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API
In-Reply-To: <77354A95-4107-41A7-8936-D144F01C3CA4@fb.com>

On Sat, Jul 27, 2019 at 11:20 AM Song Liu <songliubraving@fb.com> wrote:
>
> Hi Andy,
>
> >>>>
> >>>
> >>> Well, yes. sys_bpf() is pretty powerful.
> >>>
> >>> The goal of /dev/bpf is to enable special users to call sys_bpf(). In
> >>> the meanwhile, such users should not take down the whole system easily
> >>> by accident, e.g., with rm -rf /.
> >>
> >> That’s easy, though — bpftool could learn to read /etc/bpfusers before allowing ruid != 0.
> >
> > This is a great idea! fscaps + /etc/bpfusers should do the trick.
>
> After some discussions and more thinking on this, I have some concerns
> with the user space only approach.
>
> IIUC, your proposal for user space only approach is like:
>
> 1. bpftool (and other tools) check /etc/bpfusers and only do
>    setuid for allowed users:
>
>         int main()
>         {
>                 if (/* uid in /etc/bpfusers */)
>                         setuid(0);
>                 sys_bpf(...);
>         }
>
> 2. bpftool (and other tools) is installed with CAP_SETUID:
>
>         setcap cap_setuid=e+p /bin/bpftool
>

You have this a bit backwards.  You wouldn't use CAP_SETUID.  You
would use the setuid *mode* bit, i.e. chmod 4111 (or 4100 and use ACLs
to further lock it down).  Or you could use setcap cap_sys_admin=p,
although the details vary.  It woks a bit like this:

First, if you are running with elevated privilege due to SUID or
fscaps, the kernel and glibc offer you a degree of protection: you are
protected from ptrace(), LD_PRELOAD, etc.  You are *not* protected
from yourself.  For example, you may be running in a context in which
an attacker has malicious values in your environment variables, CWD,
etc.  Do you need to carefully decide whether you are willing to run
with elevated privilege on behalf of the user, which you learn like
this:

uid_t real_uid = getuid();

Your decision may may depend on command-line arguments as well (i.e.
you might want to allow tracing but not filtering, say).  Once you've
made this decision, the details vary:

For SUID, you either continue to run with euid == 0, or you drop
privilege using something like:

if (setresuid(real_uid, real_uid, real_uid) != 0) {
 /* optionally print an error to stderr */
 exit(1);
}

For fscaps, if you want to be privileged, you use something like
capng_update(); capng_apply() to set CAP_SYS_ADMIN to be effective
when you want privilege.  If you want to be unprivileged (because
bpfusers says so, for example), you could use capng_update() to drop
CAP_SYS_ADMIN entirely and see if the calls still work without
privilege.  But this is a little bit awkward, since you don't directly
know whether the user that invoked you in the first place had
CAP_SYS_ADMIN to begin with.

In general, SUID is a bit easier to work with.

> This approach is not ideal, because we need to trust the tool to give
> it CAP_SETUID. A hacked tool could easily bypass /etc/bpfusers check
> or use other root only sys calls after setuid(0).

How?  The whole SUID mechanism is designed fairly carefully to prevent
this.  /bin/sudo is likely to be SUID on your system, but you can't
just "hack" it to become root.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-07-30 20:24 UTC (permalink / raw)
  To: Song Liu
  Cc: Andy Lutomirski, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <369476A8-4CE1-43DA-9239-06437C0384C7@fb.com>

On Mon, Jul 29, 2019 at 10:07 PM Song Liu <songliubraving@fb.com> wrote:
>
> Hi Andy,
>
> > On Jul 27, 2019, at 11:20 AM, Song Liu <songliubraving@fb.com> wrote:
> >
> > Hi Andy,
> >
> >>>>>
> >>>>
> >>>> Well, yes. sys_bpf() is pretty powerful.
> >>>>
> >>>> The goal of /dev/bpf is to enable special users to call sys_bpf(). In
> >>>> the meanwhile, such users should not take down the whole system easily
> >>>> by accident, e.g., with rm -rf /.
> >>>
> >>> That’s easy, though — bpftool could learn to read /etc/bpfusers before allowing ruid != 0.
> >>
> >> This is a great idea! fscaps + /etc/bpfusers should do the trick.
> >
> > After some discussions and more thinking on this, I have some concerns
> > with the user space only approach.
> >
> > IIUC, your proposal for user space only approach is like:
> >
> > 1. bpftool (and other tools) check /etc/bpfusers and only do
> >   setuid for allowed users:
> >
> >       int main()
> >       {
> >               if (/* uid in /etc/bpfusers */)
> >                       setuid(0);
> >               sys_bpf(...);
> >       }
> >
> > 2. bpftool (and other tools) is installed with CAP_SETUID:
> >
> >       setcap cap_setuid=e+p /bin/bpftool
> >
> > 3. sys admin maintains proper /etc/bpfusers.
> >
> > This approach is not ideal, because we need to trust the tool to give
> > it CAP_SETUID. A hacked tool could easily bypass /etc/bpfusers check
> > or use other root only sys calls after setuid(0).
> >
>
> I would like more comments on this.
>
> Currently, bpf permission is more or less "root or nothing", which we
> would like to change.
>
> The short term goal is to separate bpf from root, in other words, it is
> "all or nothing". Special user space utilities, such as systemd, would
> benefit from this. Once this is implemented, systemd can call sys_bpf()
> when it is not running as root.

As generally nasty as Linux capabilities are, this sounds like a good
use for CAP_BPF_ADMIN.

But what do you have in mind?  Isn't non-root systemd mostly just the
user systemd session?  That should *not* have bpf() privileges until
bpf() is improved such that you can't use it to compromise the system.

>
> In longer term, it may be useful to provide finer grain permission of
> sys_bpf(). For example, sys_bpf() should be aware of containers; and
> user may only have access to certain bpf maps. Let's call this
> "fine grain" capability.
>
>
> Since we are seeing new use cases every year, we will need many
> iterations to implement the fine grain permission. I think we need an
> API that is flexible enough to cover different types of permission
> control.
>
> For example, bpf_with_cap() can be flexible:
>
>         bpf_with_cap(cmd, attr, size, perm_fd);
>
> We can get different types of permission via different combinations of
> arguments:
>
>     A perm_fd to /dev/bpf gives access to all sys_bpf() commands, so
>     this is "all or nothing" permission.
>
>     A perm_fd to /sys/fs/cgroup/.../bpf.xxx would only allow some
>     commands to this specific cgroup.
>

I don't see why you need to invent a whole new mechanism for this.
The entire cgroup ecosystem outside bpf() does just fine using the
write permission on files in cgroupfs to control access.  Why can't
bpf() do the same thing?

>
> Alexei raised another idea in offline discussions: instead of adding
> bpf_with_cap(), we add a command LOAD_PERM_FD, which enables special
> permission for the _next_ sys_bpf() from current task:
>
>     bpf(LOAD_PERM_FD, perm_fd);
>     /* the next sys_bpf() uses permission from perm_fd */
>     bpf(cmd, attr, size);
>
> This is equivalent to bpf_with_cap(cmd, attr, size, perm_fd), but
> doesn't require the new sys call.

That sounds almost every bit as problematic as the approach where you
ask for permission once and it sticks.

>
> 1. User space only approach doesn't work, even for "all or nothing"
>    permission control. I expanded the discussion in the previous
>    email. Please let me know if I missed anything there.

As in my previous email, I disagree.

^ permalink raw reply

* Re: [PATCH] tracefs: Restrict tracefs when the kernel is locked down
From: Steven Rostedt @ 2019-07-31  1:48 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: jmorris, linux-security-module, linux-kernel, linux-api,
	Matthew Garrett
In-Reply-To: <20190730184734.202386-1-matthewgarrett@google.com>

On Tue, 30 Jul 2019 11:47:34 -0700
Matthew Garrett <matthewgarrett@google.com> wrote:

> Tracefs may release more information about the kernel than desirable, so
> restrict it when the kernel is locked down in confidentiality mode by
> preventing open().
> 
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>

Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

-- Steve

^ permalink raw reply

* Re: [PATCH bpf-next v10 10/10] landlock: Add user and kernel documentation for Landlock
From: Randy Dunlap @ 2019-07-31  1:53 UTC (permalink / raw)
  To: Mickaël Salaün, linux-kernel
  Cc: Alexander Viro, Alexei Starovoitov, Andrew Morton,
	Andy Lutomirski, Arnaldo Carvalho de Melo, Casey Schaufler,
	Daniel Borkmann, David Drysdale, David S . Miller,
	Eric W . Biederman, James Morris, Jann Horn, John Johansen,
	Jonathan Corbet, Kees Cook, Michael Kerrisk,
	Mickaël Salaün, Paul Moore, Sargun Dhillon
In-Reply-To: <20190721213116.23476-11-mic@digikod.net>

On 7/21/19 2:31 PM, Mickaël Salaün wrote:
> This documentation can be built with the Sphinx framework.
> 
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: James Morris <jmorris@namei.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Serge E. Hallyn <serge@hallyn.com>
> ---
> 
> Changes since v9:
> * update with expected attach type and expected attach triggers
> 
> Changes since v8:
> * remove documentation related to chaining and tagging according to this
>   patch series
> 
> Changes since v7:
> * update documentation according to the Landlock revamp
> 
> Changes since v6:
> * add a check for ctx->event
> * rename BPF_PROG_TYPE_LANDLOCK to BPF_PROG_TYPE_LANDLOCK_RULE
> * rename Landlock version to ABI to better reflect its purpose and add a
>   dedicated changelog section
> * update tables
> * relax no_new_privs recommendations
> * remove ABILITY_WRITE related functions
> * reword rule "appending" to "prepending" and explain it
> * cosmetic fixes
> 
> Changes since v5:
> * update the rule hierarchy inheritance explanation
> * briefly explain ctx->arg2
> * add ptrace restrictions
> * explain EPERM
> * update example (subtype)
> * use ":manpage:"
> ---
>  Documentation/security/index.rst           |   1 +
>  Documentation/security/landlock/index.rst  |  20 +++
>  Documentation/security/landlock/kernel.rst |  99 ++++++++++++++
>  Documentation/security/landlock/user.rst   | 147 +++++++++++++++++++++
>  4 files changed, 267 insertions(+)
>  create mode 100644 Documentation/security/landlock/index.rst
>  create mode 100644 Documentation/security/landlock/kernel.rst
>  create mode 100644 Documentation/security/landlock/user.rst


> diff --git a/Documentation/security/landlock/kernel.rst b/Documentation/security/landlock/kernel.rst
> new file mode 100644
> index 000000000000..7d1e06d544bf
> --- /dev/null
> +++ b/Documentation/security/landlock/kernel.rst
> @@ -0,0 +1,99 @@
> +==============================
> +Landlock: kernel documentation
> +==============================
> +
> +eBPF properties
> +===============
> +
> +To get an expressive language while still being safe and small, Landlock is
> +based on eBPF. Landlock should be usable by untrusted processes and must
> +therefore expose a minimal attack surface. The eBPF bytecode is minimal,
> +powerful, widely used and designed to be used by untrusted applications. Thus,
> +reusing the eBPF support in the kernel enables a generic approach while
> +minimizing new code.
> +
> +An eBPF program has access to an eBPF context containing some fields used to
> +inspect the current object. These arguments can be used directly (e.g. cookie)
> +or passed to helper functions according to their types (e.g. inode pointer). It
> +is then possible to do complex access checks without race conditions or
> +inconsistent evaluation (i.e.  `incorrect mirroring of the OS code and state
> +<https://www.ndss-symposium.org/ndss2003/traps-and-pitfalls-practical-problems-system-call-interposition-based-security-tools/>`_).
> +
> +A Landlock hook describes a particular access type.  For now, there is two

                                                                 there are two

> +hooks dedicated to filesystem related operations: LANDLOCK_HOOK_FS_PICK and
> +LANDLOCK_HOOK_FS_WALK.  A Landlock program is tied to one hook.  This makes it
> +possible to statically check context accesses, potentially performed by such
> +program, and hence prevents kernel address leaks and ensure the right use of

                                                        ensures

> +hook arguments with eBPF functions.  Any user can add multiple Landlock
> +programs per Landlock hook.  They are stacked and evaluated one after the
> +other, starting from the most recent program, as seccomp-bpf does with its
> +filters.  Underneath, a hook is an abstraction over a set of LSM hooks.
> +
> +
> +Guiding principles
> +==================
> +
> +Unprivileged use
> +----------------
> +
> +* Landlock helpers and context should be usable by any unprivileged and
> +  untrusted program while following the system security policy enforced by
> +  other access control mechanisms (e.g. DAC, LSM).
> +
> +
> +Landlock hook and context
> +-------------------------
> +
> +* A Landlock hook shall be focused on access control on kernel objects instead
> +  of syscall filtering (i.e. syscall arguments), which is the purpose of
> +  seccomp-bpf.
> +* A Landlock context provided by a hook shall express the minimal and more
> +  generic interface to control an access for a kernel object.
> +* A hook shall guaranty that all the BPF function calls from a program are> +  safe.  Thus, the related Landlock context arguments shall always be of the
> +  same type for a particular hook.  For example, a network hook could share
> +  helpers with a file hook because of UNIX socket.  However, the same helpers
> +  may not be compatible for a file system handle and a net handle.
> +* Multiple hooks may use the same context interface.
> +
> +
> +Landlock helpers
> +----------------
> +
> +* Landlock helpers shall be as generic as possible while at the same time being
> +  as simple as possible and following the syscall creation principles (cf.
> +  *Documentation/adding-syscalls.txt*).
> +* The only behavior change allowed on a helper is to fix a (logical) bug to
> +  match the initial semantic.
> +* Helpers shall be reentrant, i.e. only take inputs from arguments (e.g. from
> +  the BPF context), to enable a hook to use a cache.  Future program options
> +  might change this cache behavior.
> +* It is quite easy to add new helpers to extend Landlock.  The main concern
> +  should be about the possibility to leak information from the kernel that may
> +  not be accessible otherwise (i.e. side-channel attack).
> +
> +
> +Questions and answers
> +=====================
> +
> +Why not create a custom hook for each kind of action?
> +-----------------------------------------------------
> +
> +Landlock programs can handle these checks.  Adding more exceptions to the
> +kernel code would lead to more code complexity.  A decision to ignore a kind of
> +action can and should be done at the beginning of a Landlock program.
> +
> +
> +Why a program does not return an errno or a kill code?
> +------------------------------------------------------
> +
> +seccomp filters can return multiple kind of code, including an errno value or a

                                       kinds

> +kill signal, which may be convenient for access control.  Those return codes
> +are hardwired in the userland ABI.  Instead, Landlock's approach is to return a
> +boolean to allow or deny an action, which is much simpler and more generic.
> +Moreover, we do not really have a choice because, unlike to seccomp, Landlock
> +programs are not enforced at the syscall entry point but may be executed at any
> +point in the kernel (through LSM hooks) where an errno return code may not make
> +sense.  However, with this simple ABI and with the ability to call helpers,
> +Landlock may gain features similar to seccomp-bpf in the future while being
> +compatible with previous programs.
> diff --git a/Documentation/security/landlock/user.rst b/Documentation/security/landlock/user.rst
> new file mode 100644
> index 000000000000..14c4f3b377bd
> --- /dev/null
> +++ b/Documentation/security/landlock/user.rst
> @@ -0,0 +1,147 @@
> +================================
> +Landlock: userland documentation
> +================================
> +
> +Landlock programs
> +=================
> +
> +eBPF programs are used to create security programs.  They are contained and can
> +call only a whitelist of dedicated functions. Moreover, they can only loop
> +under strict conditions, which protects from denial of service.  More
> +information on BPF can be found in *Documentation/networking/filter.txt*.
> +
> +
> +Writing a program
> +-----------------
> +
> +To enforce a security policy, a thread first needs to create a Landlock program.
> +The easiest way to write an eBPF program depicting a security program is to write
> +it in the C language.  As described in *samples/bpf/README.rst*, LLVM can
> +compile such programs.  Files *samples/bpf/landlock1_kern.c* and those in
> +*tools/testing/selftests/landlock/* can be used as examples.
> +
> +Once the eBPF program is created, the next step is to create the metadata
> +describing the Landlock program.  This metadata includes an expected attach type which
> +contains the hook type to which the program is tied, and expected attach
> +triggers which identify the actions for which the program should be run.
> +
> +A hook is a policy decision point which exposes the same context type for
> +each program evaluation.
> +
> +A Landlock hook describes the kind of kernel object for which a program will be
> +triggered to allow or deny an action.  For example, the hook
> +BPF_LANDLOCK_FS_PICK can be triggered every time a landlocked thread performs a
> +set of action related to the filesystem (e.g. open, read, write, mount...).

          actions

> +This actions are identified by the `triggers` bitfield.
> +
> +The next step is to fill a :c:type:`struct bpf_load_program_attr
> +<bpf_load_program_attr>` with BPF_PROG_TYPE_LANDLOCK_HOOK, the expected attach
> +type and other BPF program metadata.  This bpf_attr must then be passed to the
> +:manpage:`bpf(2)` syscall alongside the BPF_PROG_LOAD command.  If everything
> +is deemed correct by the kernel, the thread gets a file descriptor referring to
> +this program.
> +
> +In the following code, the *insn* variable is an array of BPF instructions
> +which can be extracted from an ELF file as is done in bpf_load_file() from
> +*samples/bpf/bpf_load.c*.

A little confusing.  Is there a mixup of <insn> and <insns>?

> +
> +.. code-block:: c
> +
> +    int prog_fd;
> +    struct bpf_load_program_attr load_attr;
> +
> +    memset(&load_attr, 0, sizeof(struct bpf_load_program_attr));
> +    load_attr.prog_type = BPF_PROG_TYPE_LANDLOCK_HOOK;
> +    load_attr.expected_attach_type = BPF_LANDLOCK_FS_PICK;
> +    load_attr.expected_attach_triggers = LANDLOCK_TRIGGER_FS_PICK_OPEN;
> +    load_attr.insns = insns;
> +    load_attr.insns_cnt = sizeof(insn) / sizeof(struct bpf_insn);
> +    load_attr.license = "GPL";
> +
> +    prog_fd = bpf_load_program_xattr(&load_attr, log_buf, log_buf_sz);
> +    if (prog_fd == -1)
> +        exit(1);
> +
> +
> +Enforcing a program
> +-------------------
> +
> +Once the Landlock program has been created or received (e.g. through a UNIX
> +socket), the thread willing to sandbox itself (and its future children) should
> +perform the following two steps.
> +
> +The thread should first request to never be allowed to get new privileges with a
> +call to :manpage:`prctl(2)` and the PR_SET_NO_NEW_PRIVS option.  More
> +information can be found in *Documentation/prctl/no_new_privs.txt*.
> +
> +.. code-block:: c
> +
> +    if (prctl(PR_SET_NO_NEW_PRIVS, 1, NULL, 0, 0))
> +        exit(1);
> +
> +A thread can apply a program to itself by using the :manpage:`seccomp(2)` syscall.
> +The operation is SECCOMP_PREPEND_LANDLOCK_PROG, the flags must be empty and the
> +*args* argument must point to a valid Landlock program file descriptor.
> +
> +.. code-block:: c
> +
> +    if (seccomp(SECCOMP_PREPEND_LANDLOCK_PROG, 0, &fd))
> +        exit(1);
> +
> +If the syscall succeeds, the program is now enforced on the calling thread and
> +will be enforced on all its subsequently created children of the thread as
> +well.  Once a thread is landlocked, there is no way to remove this security
> +policy, only stacking more restrictions is allowed.  The program evaluation is
> +performed from the newest to the oldest.
> +
> +When a syscall ask for an action on a kernel object, if this action is denied,

                  asks

> +then an EACCES errno code is returned through the syscall.
> +
> +
> +.. _inherited_programs:
> +
> +Inherited programs
> +------------------
> +
> +Every new thread resulting from a :manpage:`clone(2)` inherits Landlock program
> +restrictions from its parent.  This is similar to the seccomp inheritance as
> +described in *Documentation/prctl/seccomp_filter.txt*.
> +
> +
> +Ptrace restrictions
> +-------------------
> +
> +A landlocked process has less privileges than a non-landlocked process and must
> +then be subject to additional restrictions when manipulating another process.
> +To be allowed to use :manpage:`ptrace(2)` and related syscalls on a target
> +process, a landlocked process must have a subset of the target process programs.
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Maybe that last statement is correct, but it seems to me that it is missing something.

> +
> +
> +Landlock structures and constants
> +=================================
> +
> +Hook types
> +----------
> +
> +.. kernel-doc:: include/uapi/linux/landlock.h
> +    :functions: landlock_hook_type
> +
> +
> +Contexts
> +--------
> +
> +.. kernel-doc:: include/uapi/linux/landlock.h
> +    :functions: landlock_ctx_fs_pick landlock_ctx_fs_walk landlock_ctx_fs_get
> +
> +
> +Triggers for fs_pick
> +--------------------
> +
> +.. kernel-doc:: include/uapi/linux/landlock.h
> +    :functions: landlock_triggers
> +
> +
> +Additional documentation
> +========================
> +
> +See https://landlock.io
> 


-- 
~Randy

^ permalink raw reply

* Re: [PATCH v3 2/2] doc: Update documentation for page_idle virtual address indexing
From: Mike Rapoport @ 2019-07-31  6:44 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Brendan Gregg,
	Christian Hansen, dancol, fmayer, joaodias, joelaf,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Michal Hocko, minchan, namhyung,
	Roman Gushchin, Stephen Rothwell, surenb, tkjos, Vladimir Davydov,
	Vlastimil Babka, wvw
In-Reply-To: <20190726152319.134152-2-joel@joelfernandes.org>

On Fri, Jul 26, 2019 at 11:23:19AM -0400, Joel Fernandes (Google) wrote:
> This patch updates the documentation with the new page_idle tracking
> feature which uses virtual address indexing.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

One nit below, otherwise

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  .../admin-guide/mm/idle_page_tracking.rst     | 43 ++++++++++++++++---
>  1 file changed, 36 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst
> index df9394fb39c2..1eeac78c94a7 100644
> --- a/Documentation/admin-guide/mm/idle_page_tracking.rst
> +++ b/Documentation/admin-guide/mm/idle_page_tracking.rst
> @@ -19,10 +19,14 @@ It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
>  
>  User API
>  ========
> +There are 2 ways to access the idle page tracking API. One uses physical
> +address indexing, another uses a simpler virtual address indexing scheme.
>  
> -The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
> -Currently, it consists of the only read-write file,
> -``/sys/kernel/mm/page_idle/bitmap``.
> +Physical address indexing
> +-------------------------
> +The idle page tracking API for physical address indexing using page frame
> +numbers (PFN) is located at ``/sys/kernel/mm/page_idle``.  Currently, it
> +consists of the only read-write file, ``/sys/kernel/mm/page_idle/bitmap``.
>  
>  The file implements a bitmap where each bit corresponds to a memory page. The
>  bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
> @@ -74,6 +78,31 @@ See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
>  information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
>  ``/proc/kpagecgroup``.
>  
> +Virtual address indexing
> +------------------------
> +The idle page tracking API for virtual address indexing using virtual page
> +frame numbers (VFN) is located at ``/proc/<pid>/page_idle``. It is a bitmap
> +that follows the same semantics as ``/sys/kernel/mm/page_idle/bitmap``
> +except that it uses virtual instead of physical frame numbers.

Can you please make it more explicit that VFNs are in the <pid>'s address
space?

> +
> +This idle page tracking API does not need deal with PFN so it does not require
> +prior lookups of ``pagemap`` in order to find if page is idle or not. This is
> +an advantage on some systems where looking up PFN is considered a security
> +issue.  Also in some cases, this interface could be slightly more reliable to
> +use than physical address indexing, since in physical address indexing, address
> +space changes can occur between reading the ``pagemap`` and reading the
> +``bitmap``, while in virtual address indexing, the process's ``mmap_sem`` is
> +held for the duration of the access.
> +
> +To estimate the amount of pages that are not used by a workload one should:
> +
> + 1. Mark all the workload's pages as idle by setting corresponding bits in
> +    ``/proc/<pid>/page_idle``.
> +
> + 2. Wait until the workload accesses its working set.
> +
> + 3. Read ``/proc/<pid>/page_idle`` and count the number of bits set.
> +
>  .. _impl_details:
>  
>  Implementation Details
> @@ -99,10 +128,10 @@ When a dirty page is written to swap or disk as a result of memory reclaim or
>  exceeding the dirty memory limit, it is not marked referenced.
>  
>  The idle memory tracking feature adds a new page flag, the Idle flag. This flag
> -is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
> -:ref:`User API <user_api>`
> -section), and cleared automatically whenever a page is referenced as defined
> -above.
> +is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` for physical
> +addressing or by writing to ``/proc/<pid>/page_idle`` for virtual
> +addressing (see the :ref:`User API <user_api>` section), and cleared
> +automatically whenever a page is referenced as defined above.
>  
>  When a page is marked idle, the Accessed bit must be cleared in all PTEs it is
>  mapped to, otherwise we will not be able to detect accesses to the page coming
> -- 
> 2.22.0.709.g102302147b-goog
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Song Liu @ 2019-07-31  7:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, linux-security@vger.kernel.org, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API
In-Reply-To: <CALCETrVS5FwtmTyspyg-UNoZTHes9wUNbbsvNYwQwXUUfrtaiQ@mail.gmail.com>

Hi Andy,

> On Jul 30, 2019, at 1:20 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Sat, Jul 27, 2019 at 11:20 AM Song Liu <songliubraving@fb.com> wrote:
>> 
>> Hi Andy,
>> 
>>>>>> 
>>>>> 
>>>>> Well, yes. sys_bpf() is pretty powerful.
>>>>> 
>>>>> The goal of /dev/bpf is to enable special users to call sys_bpf(). In
>>>>> the meanwhile, such users should not take down the whole system easily
>>>>> by accident, e.g., with rm -rf /.
>>>> 
>>>> That’s easy, though — bpftool could learn to read /etc/bpfusers before allowing ruid != 0.
>>> 
>>> This is a great idea! fscaps + /etc/bpfusers should do the trick.
>> 
>> After some discussions and more thinking on this, I have some concerns
>> with the user space only approach.
>> 
>> IIUC, your proposal for user space only approach is like:
>> 
>> 1. bpftool (and other tools) check /etc/bpfusers and only do
>>   setuid for allowed users:
>> 
>>        int main()
>>        {
>>                if (/* uid in /etc/bpfusers */)
>>                        setuid(0);
>>                sys_bpf(...);
>>        }
>> 
>> 2. bpftool (and other tools) is installed with CAP_SETUID:
>> 
>>        setcap cap_setuid=e+p /bin/bpftool
>> 
> 
> You have this a bit backwards.  You wouldn't use CAP_SETUID.  You
> would use the setuid *mode* bit, i.e. chmod 4111 (or 4100 and use ACLs
> to further lock it down).  Or you could use setcap cap_sys_admin=p,
> although the details vary.  It woks a bit like this:
> 
> First, if you are running with elevated privilege due to SUID or
> fscaps, the kernel and glibc offer you a degree of protection: you are
> protected from ptrace(), LD_PRELOAD, etc.  You are *not* protected
> from yourself.  For example, you may be running in a context in which
> an attacker has malicious values in your environment variables, CWD,
> etc.  Do you need to carefully decide whether you are willing to run
> with elevated privilege on behalf of the user, which you learn like
> this:
> 
> uid_t real_uid = getuid();
> 
> Your decision may may depend on command-line arguments as well (i.e.
> you might want to allow tracing but not filtering, say).  Once you've
> made this decision, the details vary:
> 
> For SUID, you either continue to run with euid == 0, or you drop
> privilege using something like:
> 
> if (setresuid(real_uid, real_uid, real_uid) != 0) {
> /* optionally print an error to stderr */
> exit(1);
> }
> 
> For fscaps, if you want to be privileged, you use something like
> capng_update(); capng_apply() to set CAP_SYS_ADMIN to be effective
> when you want privilege.  If you want to be unprivileged (because
> bpfusers says so, for example), you could use capng_update() to drop
> CAP_SYS_ADMIN entirely and see if the calls still work without
> privilege.  But this is a little bit awkward, since you don't directly
> know whether the user that invoked you in the first place had
> CAP_SYS_ADMIN to begin with.
> 
> In general, SUID is a bit easier to work with.

Thanks a lot for these explanations. I learned a lot today via reading
this email and Googling. 

> 
>> This approach is not ideal, because we need to trust the tool to give
>> it CAP_SETUID. A hacked tool could easily bypass /etc/bpfusers check
>> or use other root only sys calls after setuid(0).
> 
> How?  The whole SUID mechanism is designed fairly carefully to prevent
> this.  /bin/sudo is likely to be SUID on your system, but you can't
> just "hack" it to become root.

I guess "hacked" was not the right description. I should have used 
"buggy" instead. 

SUID mechanism is great for small and solid tools, like sudo, passwd,
mount, etc. The user or sys admin could trust the tool to always do the 
right thing. On the other hand, I think SUID is not ideal for complex
tools that are under heavy development. As you mentioned, SUID doesn't 
protect against oneself. Therefore, it won't protect the system from 
buggy tool that uses SUID on something it should not. 

I think SUID is good for bpftool, because it is easy to use SUID 
correctly in bpftool. But, it is not easy to use SUID correctly in 
systemd. 

systemd and similar user space daemons use sys_bpf() to manage cgroup 
bpf programs and maps (BPF_MAP_TYPE_*CGROUP*, BPF_PROG_TYPE_CGROUP_*, 
BPF_CGROUP_*). The daemon runs in the background, and handles user 
requests to attach/detach BPF programs. If the daemon uses SUID or 
similar mechanism, it has to use euid == 0 for sys_bpf(). However, 
such daemons are usually pretty complicated. It is not easy to prove 
the daemon won't misuse euid == 0. 

Does this make sense so far? I will discuss more about cgroup in the 
next email. 

Thanks,
Song

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Song Liu @ 2019-07-31  8:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List
In-Reply-To: <CALCETrUpVMrk7aaf0trfg9AfZ4fy279uJgZH7V+gZzjFw=hUxA@mail.gmail.com>



> On Jul 30, 2019, at 1:24 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Mon, Jul 29, 2019 at 10:07 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> Hi Andy,
>> 
>>> On Jul 27, 2019, at 11:20 AM, Song Liu <songliubraving@fb.com> wrote:
>>> 
>>> Hi Andy,
>>> 
>>> 

[...]

>>> 
>> 
>> I would like more comments on this.
>> 
>> Currently, bpf permission is more or less "root or nothing", which we
>> would like to change.
>> 
>> The short term goal is to separate bpf from root, in other words, it is
>> "all or nothing". Special user space utilities, such as systemd, would
>> benefit from this. Once this is implemented, systemd can call sys_bpf()
>> when it is not running as root.
> 
> As generally nasty as Linux capabilities are, this sounds like a good
> use for CAP_BPF_ADMIN.

I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make 
existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.

> 
> But what do you have in mind?  Isn't non-root systemd mostly just the
> user systemd session?  That should *not* have bpf() privileges until
> bpf() is improved such that you can't use it to compromise the system.

cgroup bpf is the major use case here. A less important use case is to 
run bpf selftests without being root. 

> 
>> 
>> In longer term, it may be useful to provide finer grain permission of
>> sys_bpf(). For example, sys_bpf() should be aware of containers; and
>> user may only have access to certain bpf maps. Let's call this
>> "fine grain" capability.
>> 
>> 
>> Since we are seeing new use cases every year, we will need many
>> iterations to implement the fine grain permission. I think we need an
>> API that is flexible enough to cover different types of permission
>> control.
>> 
>> For example, bpf_with_cap() can be flexible:
>> 
>>        bpf_with_cap(cmd, attr, size, perm_fd);
>> 
>> We can get different types of permission via different combinations of
>> arguments:
>> 
>>    A perm_fd to /dev/bpf gives access to all sys_bpf() commands, so
>>    this is "all or nothing" permission.
>> 
>>    A perm_fd to /sys/fs/cgroup/.../bpf.xxx would only allow some
>>    commands to this specific cgroup.
>> 
> 
> I don't see why you need to invent a whole new mechanism for this.
> The entire cgroup ecosystem outside bpf() does just fine using the
> write permission on files in cgroupfs to control access.  Why can't
> bpf() do the same thing?

It is easier to use write permission for BPF_PROG_ATTACH. But it is 
not easy to do the same for other bpf commands: BPF_PROG_LOAD and 
BPF_MAP_*. A lot of these commands don't have target concept. Maybe 
we should have target concept for all these commands. But that is a 
much bigger project. OTOH, "all or nothing" model allows all these 
commands at once.

Well, that being said, I will look more into using write permission 
in cgroupfs. 

Thanks again for all these comments and suggestions. Please let us 
know your future thoughts and insights. 

Song

^ permalink raw reply

* Re: [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing
From: Minchan Kim @ 2019-07-31  8:53 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Brendan Gregg,
	Christian Hansen, dancol, fmayer, joaodias, joelaf,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport, namhyung,
	Roman Gushchin, Stephen Rothwell, surenb, tkjos, Vladimir Davydov,
	Vlastimil Babka, wvw
In-Reply-To: <20190726152319.134152-1-joel@joelfernandes.org>

Hi Joel,

On Fri, Jul 26, 2019 at 11:23:18AM -0400, Joel Fernandes (Google) wrote:
> The page_idle tracking feature currently requires looking up the pagemap
> for a process followed by interacting with /sys/kernel/mm/page_idle.
> Looking up PFN from pagemap in Android devices is not supported by
> unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
> 
> This patch adds support to directly interact with page_idle tracking at
> the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> looking up PFN through pagemap is not needed since the interface uses
> virtual frame numbers, and at the same time also does not require
> SYS_ADMIN.
> 
> In Android, we are using this for the heap profiler (heapprofd) which
> profiles and pin points code paths which allocates and leaves memory
> idle for long periods of time. This method solves the security issue
> with userspace learning the PFN, and while at it is also shown to yield
> better results than the pagemap lookup, the theory being that the window
> where the address space can change is reduced by eliminating the
> intermediate pagemap look up stage. In virtual address indexing, the
> process's mmap_sem is held for the duration of the access.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> ---
> v2->v3:
> Fixed a bug where I was doing a kfree that is not needed due to not
> needing to do GFP_ATOMIC allocations.
> 
> v1->v2:
> Mark swap ptes as idle (Minchan)
> Avoid need for GFP_ATOMIC (Andrew)
> Get rid of idle_page_list lock by moving list to stack
> 
> Internal review -> v1:
> Fixes from Suren.
> Corrections to change log, docs (Florian, Sandeep)
> 
>  fs/proc/base.c            |   3 +
>  fs/proc/internal.h        |   1 +
>  fs/proc/task_mmu.c        |  57 +++++++
>  include/linux/page_idle.h |   4 +
>  mm/page_idle.c            | 340 +++++++++++++++++++++++++++++++++-----
>  5 files changed, 360 insertions(+), 45 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 77eb628ecc7f..a58dd74606e9 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
>  	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
>  	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
>  	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +	REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
> +#endif
>  #endif
>  #ifdef CONFIG_SECURITY
>  	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index cd0c8d5ce9a1..bc9371880c63 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations;
>  extern const struct file_operations proc_pid_smaps_rollup_operations;
>  extern const struct file_operations proc_clear_refs_operations;
>  extern const struct file_operations proc_pagemap_operations;
> +extern const struct file_operations proc_page_idle_operations;
>  
>  extern unsigned long task_vsize(struct mm_struct *);
>  extern unsigned long task_statm(struct mm_struct *,
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 4d2b860dbc3f..11ccc53da38e 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1642,6 +1642,63 @@ const struct file_operations proc_pagemap_operations = {
>  	.open		= pagemap_open,
>  	.release	= pagemap_release,
>  };
> +
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
> +				   size_t count, loff_t *ppos)
> +{
> +	int ret;
> +	struct task_struct *tsk = get_proc_task(file_inode(file));
> +
> +	if (!tsk)
> +		return -EINVAL;
> +	ret = page_idle_proc_read(file, buf, count, ppos, tsk);
> +	put_task_struct(tsk);

Why do you need task_struct here? You already got the task in open
and got mm there so you could pass the MM here instead of task.

> +	return ret;
> +}
> +
> +static ssize_t proc_page_idle_write(struct file *file, const char __user *buf,
> +				 size_t count, loff_t *ppos)
> +{
> +	int ret;
> +	struct task_struct *tsk = get_proc_task(file_inode(file));
> +
> +	if (!tsk)
> +		return -EINVAL;
> +	ret = page_idle_proc_write(file, (char __user *)buf, count, ppos, tsk);
> +	put_task_struct(tsk);
> +	return ret;
> +}
> +
> +static int proc_page_idle_open(struct inode *inode, struct file *file)
> +{
> +	struct mm_struct *mm;
> +
> +	mm = proc_mem_open(inode, PTRACE_MODE_READ);
> +	if (IS_ERR(mm))
> +		return PTR_ERR(mm);
> +	file->private_data = mm;
> +	return 0;
> +}
> +
> +static int proc_page_idle_release(struct inode *inode, struct file *file)
> +{
> +	struct mm_struct *mm = file->private_data;
> +
> +	if (mm)
> +		mmdrop(mm);
> +	return 0;
> +}
> +
> +const struct file_operations proc_page_idle_operations = {
> +	.llseek		= mem_lseek, /* borrow this */
> +	.read		= proc_page_idle_read,
> +	.write		= proc_page_idle_write,
> +	.open		= proc_page_idle_open,
> +	.release	= proc_page_idle_release,
> +};
> +#endif /* CONFIG_IDLE_PAGE_TRACKING */
> +
>  #endif /* CONFIG_PROC_PAGE_MONITOR */
>  
>  #ifdef CONFIG_NUMA
> diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> index 1e894d34bdce..f1bc2640d85e 100644
> --- a/include/linux/page_idle.h
> +++ b/include/linux/page_idle.h
> @@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page)
>  }
>  #endif /* CONFIG_64BIT */
>  
> +ssize_t page_idle_proc_write(struct file *file,
> +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
> +ssize_t page_idle_proc_read(struct file *file,
> +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
>  #else /* !CONFIG_IDLE_PAGE_TRACKING */
>  
>  static inline bool page_is_young(struct page *page)
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index 295512465065..86244f7f1faa 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -5,12 +5,15 @@
>  #include <linux/sysfs.h>
>  #include <linux/kobject.h>
>  #include <linux/mm.h>
> -#include <linux/mmzone.h>
> -#include <linux/pagemap.h>
> -#include <linux/rmap.h>
>  #include <linux/mmu_notifier.h>
> +#include <linux/mmzone.h>
>  #include <linux/page_ext.h>
>  #include <linux/page_idle.h>
> +#include <linux/pagemap.h>
> +#include <linux/rmap.h>
> +#include <linux/sched/mm.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>  
>  #define BITMAP_CHUNK_SIZE	sizeof(u64)
>  #define BITMAP_CHUNK_BITS	(BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
> @@ -25,18 +28,13 @@
>   * page tracking. With such an indicator of user pages we can skip isolated
>   * pages, but since there are not usually many of them, it will hardly affect
>   * the overall result.
> - *
> - * This function tries to get a user memory page by pfn as described above.
>   */
> -static struct page *page_idle_get_page(unsigned long pfn)
> +static struct page *page_idle_get_page(struct page *page_in)

Looks weird function name after you changed the argument.
Maybe "bool check_valid_page(struct page *page)"?

>  {
>  	struct page *page;
>  	pg_data_t *pgdat;
>  
> -	if (!pfn_valid(pfn))
> -		return NULL;
> -
> -	page = pfn_to_page(pfn);
> +	page = page_in;
>  	if (!page || !PageLRU(page) ||
>  	    !get_page_unless_zero(page))
>  		return NULL;
> @@ -51,6 +49,18 @@ static struct page *page_idle_get_page(unsigned long pfn)
>  	return page;
>  }
>  
> +/*
> + * This function tries to get a user memory page by pfn as described above.
> + */
> +static struct page *page_idle_get_page_pfn(unsigned long pfn)

So we could use page_idle_get_page name here.

> +{
> +
> +	if (!pfn_valid(pfn))
> +		return NULL;


    page = pfn_to_page(pfn);

    return check_valid_page(page) ? page : NULL;
> +
> +	return page_idle_get_page(pfn_to_page(pfn));
> +}
> +
>  static bool page_idle_clear_pte_refs_one(struct page *page,
>  					struct vm_area_struct *vma,
>  					unsigned long addr, void *arg)
> @@ -118,6 +128,47 @@ static void page_idle_clear_pte_refs(struct page *page)
>  		unlock_page(page);
>  }
>  
> +/* Helper to get the start and end frame given a pos and count */
> +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> +				unsigned long *start, unsigned long *end)
> +{
> +	unsigned long max_frame;
> +
> +	/* If an mm is not given, assume we want physical frames */
> +	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> +
> +	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> +		return -EINVAL;
> +
> +	*start = pos * BITS_PER_BYTE;
> +	if (*start >= max_frame)
> +		return -ENXIO;
> +
> +	*end = *start + count * BITS_PER_BYTE;
> +	if (*end > max_frame)
> +		*end = max_frame;
> +	return 0;
> +}
> +
> +static bool page_really_idle(struct page *page)

Just minor:
Instead of creating new API, could we combine page_is_idle with
introducing furthere argument pte_check?

    bool page_is_idle(struct page *page, bool pte_check);

> +{
> +	if (!page)
> +		return false;
> +
> +	if (page_is_idle(page)) {
> +		/*
> +		 * The page might have been referenced via a
> +		 * pte, in which case it is not idle. Clear
> +		 * refs and recheck.
> +		 */
> +		page_idle_clear_pte_refs(page);
> +		if (page_is_idle(page))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
>  static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
>  				     struct bin_attribute *attr, char *buf,
>  				     loff_t pos, size_t count)
> @@ -125,35 +176,21 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
>  	u64 *out = (u64 *)buf;
>  	struct page *page;
>  	unsigned long pfn, end_pfn;
> -	int bit;
> +	int bit, ret;
>  
> -	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> -		return -EINVAL;
> -
> -	pfn = pos * BITS_PER_BYTE;
> -	if (pfn >= max_pfn)
> -		return 0;
> -
> -	end_pfn = pfn + count * BITS_PER_BYTE;
> -	if (end_pfn > max_pfn)
> -		end_pfn = max_pfn;
> +	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> +	if (ret == -ENXIO)
> +		return 0;  /* Reads beyond max_pfn do nothing */
> +	else if (ret)
> +		return ret;
>  
>  	for (; pfn < end_pfn; pfn++) {
>  		bit = pfn % BITMAP_CHUNK_BITS;
>  		if (!bit)
>  			*out = 0ULL;
> -		page = page_idle_get_page(pfn);
> -		if (page) {
> -			if (page_is_idle(page)) {
> -				/*
> -				 * The page might have been referenced via a
> -				 * pte, in which case it is not idle. Clear
> -				 * refs and recheck.
> -				 */
> -				page_idle_clear_pte_refs(page);
> -				if (page_is_idle(page))
> -					*out |= 1ULL << bit;
> -			}
> +		page = page_idle_get_page_pfn(pfn);
> +		if (page && page_really_idle(page)) {
> +			*out |= 1ULL << bit;
>  			put_page(page);
>  		}
>  		if (bit == BITMAP_CHUNK_BITS - 1)
> @@ -170,23 +207,16 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
>  	const u64 *in = (u64 *)buf;
>  	struct page *page;
>  	unsigned long pfn, end_pfn;
> -	int bit;
> +	int bit, ret;
>  
> -	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> -		return -EINVAL;
> -
> -	pfn = pos * BITS_PER_BYTE;
> -	if (pfn >= max_pfn)
> -		return -ENXIO;
> -
> -	end_pfn = pfn + count * BITS_PER_BYTE;
> -	if (end_pfn > max_pfn)
> -		end_pfn = max_pfn;
> +	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> +	if (ret)
> +		return ret;
>  
>  	for (; pfn < end_pfn; pfn++) {
>  		bit = pfn % BITMAP_CHUNK_BITS;
>  		if ((*in >> bit) & 1) {
> -			page = page_idle_get_page(pfn);
> +			page = page_idle_get_page_pfn(pfn);
>  			if (page) {
>  				page_idle_clear_pte_refs(page);
>  				set_page_idle(page);
> @@ -224,6 +254,226 @@ struct page_ext_operations page_idle_ops = {
>  };
>  #endif
>  
> +/*  page_idle tracking for /proc/<pid>/page_idle */
> +
> +struct page_node {
> +	struct page *page;
> +	unsigned long addr;
> +	struct list_head list;
> +};
> +
> +struct page_idle_proc_priv {
> +	unsigned long start_addr;
> +	char *buffer;
> +	int write;
> +
> +	/* Pre-allocate and provide nodes to add_page_idle_list() */
> +	struct page_node *page_nodes;
> +	int cur_page_node;
> +	struct list_head *idle_page_list;
> +};
> +
> +/*
> + * Add a page to the idle page list. page can be NULL if pte is
> + * from a swapped page.
> + */
> +static void add_page_idle_list(struct page *page,
> +			       unsigned long addr, struct mm_walk *walk)
> +{
> +	struct page *page_get = NULL;
> +	struct page_node *pn;
> +	int bit;
> +	unsigned long frames;
> +	struct page_idle_proc_priv *priv = walk->private;
> +	u64 *chunk = (u64 *)priv->buffer;
> +
> +	if (priv->write) {
> +		/* Find whether this page was asked to be marked */
> +		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> +		bit = frames % BITMAP_CHUNK_BITS;
> +		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> +		if (((*chunk >> bit) & 1) == 0)
> +			return;
> +	}
> +
> +	if (page) {
> +		page_get = page_idle_get_page(page);
> +		if (!page_get)
> +			return;
> +	}
> +
> +	pn = &(priv->page_nodes[priv->cur_page_node++]);
> +	pn->page = page_get;
> +	pn->addr = addr;
> +	list_add(&pn->list, priv->idle_page_list);
> +}
> +
> +static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
> +				    unsigned long end,
> +				    struct mm_walk *walk)
> +{
> +	struct vm_area_struct *vma = walk->vma;
> +	pte_t *pte;
> +	spinlock_t *ptl;
> +	struct page *page;
> +
> +	ptl = pmd_trans_huge_lock(pmd, vma);
> +	if (ptl) {
> +		if (pmd_present(*pmd)) {
> +			page = follow_trans_huge_pmd(vma, addr, pmd,
> +						     FOLL_DUMP|FOLL_WRITE);
> +			if (!IS_ERR_OR_NULL(page))
> +				add_page_idle_list(page, addr, walk);
> +		}
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		/*
> +		 * We add swapped pages to the idle_page_list so that we can
> +		 * reported to userspace that they are idle.
> +		 */
> +		if (is_swap_pte(*pte)) {

I suggested "let's consider every swapped out pages as IDLE" but
let's think about this case:

1. mark heap of the process as IDLE
2. process touch working set
3. process's heap pages are swap out by meory spike or madvise
4. heap profiler investigates the process's IDLE page and surprised all of
heap are idle.

It's the good scenario for other purpose because non-idle pages(IOW,
workingset) could be readahead when the app will restart.

Maybe, squeeze the idle bit in the swap pte to check it.

> +			add_page_idle_list(NULL, addr, walk);
> +			continue;
> +		}
> +
> +		if (!pte_present(*pte))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, *pte);
> +		if (page)
> +			add_page_idle_list(page, addr, walk);
> +	}
> +
> +	pte_unmap_unlock(pte - 1, ptl);
> +	return 0;
> +}
> +
> +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> +			       size_t count, loff_t *pos,
> +			       struct task_struct *tsk, int write)
> +{
> +	int ret;
> +	char *buffer;
> +	u64 *out;
> +	unsigned long start_addr, end_addr, start_frame, end_frame;
> +	struct mm_struct *mm = file->private_data;
> +	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
> +	struct page_node *cur;
> +	struct page_idle_proc_priv priv;
> +	bool walk_error = false;
> +	LIST_HEAD(idle_page_list);
> +
> +	if (!mm || !mmget_not_zero(mm))
> +		return -EINVAL;
> +
> +	if (count > PAGE_SIZE)
> +		count = PAGE_SIZE;
> +
> +	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +	if (!buffer) {
> +		ret = -ENOMEM;
> +		goto out_mmput;
> +	}
> +	out = (u64 *)buffer;
> +
> +	if (write && copy_from_user(buffer, ubuff, count)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
> +	if (ret)
> +		goto out;
> +
> +	start_addr = (start_frame << PAGE_SHIFT);
> +	end_addr = (end_frame << PAGE_SHIFT);
> +	priv.buffer = buffer;
> +	priv.start_addr = start_addr;
> +	priv.write = write;
> +
> +	priv.idle_page_list = &idle_page_list;
> +	priv.cur_page_node = 0;
> +	priv.page_nodes = kzalloc(sizeof(struct page_node) *
> +				  (end_frame - start_frame), GFP_KERNEL);
> +	if (!priv.page_nodes) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	walk.private = &priv;
> +	walk.mm = mm;
> +
> +	down_read(&mm->mmap_sem);
> +
> +	/*
> +	 * idle_page_list is needed because walk_page_vma() holds ptlock which
> +	 * deadlocks with page_idle_clear_pte_refs(). So we have to collect all
> +	 * pages first, and then call page_idle_clear_pte_refs().
> +	 */

Thanks for the comment, I was curious why you want to have
idle_page_list and the reason is here.

How about making this /proc/<pid>/page_idle per-process granuariy,
unlike system level /sys/xxx/page_idle? What I meant is not to check
rmap to see any reference from random process but just check only
access from the target process. It would be more proper as /proc/
<pid>/ interface and good for per-process tracking as well as
fast.

> +	ret = walk_page_range(start_addr, end_addr, &walk);
> +	if (ret)
> +		walk_error = true;
> +
> +	list_for_each_entry(cur, &idle_page_list, list) {
> +		int bit, index;
> +		unsigned long off;
> +		struct page *page = cur->page;
> +
> +		if (unlikely(walk_error))
> +			goto remove_page;
> +
> +		if (write) {
> +			if (page) {
> +				page_idle_clear_pte_refs(page);
> +				set_page_idle(page);
> +			}
> +		} else {
> +			if (!page || page_really_idle(page)) {
> +				off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
> +				bit = off % BITMAP_CHUNK_BITS;
> +				index = off / BITMAP_CHUNK_BITS;
> +				out[index] |= 1ULL << bit;
> +			}
> +		}
> +remove_page:
> +		if (page)
> +			put_page(page);
> +	}
> +
> +	if (!write && !walk_error)
> +		ret = copy_to_user(ubuff, buffer, count);
> +
> +	up_read(&mm->mmap_sem);
> +	kfree(priv.page_nodes);
> +out:
> +	kfree(buffer);
> +out_mmput:
> +	mmput(mm);
> +	if (!ret)
> +		ret = count;
> +	return ret;
> +
> +}
> +
> +ssize_t page_idle_proc_read(struct file *file, char __user *ubuff,
> +			    size_t count, loff_t *pos, struct task_struct *tsk)
> +{
> +	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 0);
> +}
> +
> +ssize_t page_idle_proc_write(struct file *file, char __user *ubuff,
> +			     size_t count, loff_t *pos, struct task_struct *tsk)
> +{
> +	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 1);
> +}
> +
>  static int __init page_idle_init(void)
>  {
>  	int err;
> -- 
> 2.22.0.709.g102302147b-goog

^ permalink raw reply

* Re: [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing
From: Joel Fernandes @ 2019-07-31 17:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Brendan Gregg,
	Christian Hansen, dancol, fmayer, joaodias, Jonathan Corbet,
	Kees Cook, kernel-team, linux-api, linux-doc, linux-fsdevel,
	linux-mm, Michal Hocko, Mike Rapoport, namhyung, Roman Gushchin,
	Stephen Rothwell, surenb, tkjos, Vladimir Davydov,
	Vlastimil Babka, wvw
In-Reply-To: <20190731085335.GD155569@google.com>

On Wed, Jul 31, 2019 at 05:53:35PM +0900, Minchan Kim wrote:
> Hi Joel,
> 
> On Fri, Jul 26, 2019 at 11:23:18AM -0400, Joel Fernandes (Google) wrote:
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > Looking up PFN from pagemap in Android devices is not supported by
> > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
> > 
[snip]
> > index 77eb628ecc7f..a58dd74606e9 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
> >  	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
> >  	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
> >  	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
> > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > +	REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
> > +#endif
> >  #endif
> >  #ifdef CONFIG_SECURITY
> >  	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
> > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > index cd0c8d5ce9a1..bc9371880c63 100644
> > --- a/fs/proc/internal.h
> > +++ b/fs/proc/internal.h
> > @@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations;
> >  extern const struct file_operations proc_pid_smaps_rollup_operations;
> >  extern const struct file_operations proc_clear_refs_operations;
> >  extern const struct file_operations proc_pagemap_operations;
> > +extern const struct file_operations proc_page_idle_operations;
> >  
> >  extern unsigned long task_vsize(struct mm_struct *);
> >  extern unsigned long task_statm(struct mm_struct *,
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 4d2b860dbc3f..11ccc53da38e 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1642,6 +1642,63 @@ const struct file_operations proc_pagemap_operations = {
> >  	.open		= pagemap_open,
> >  	.release	= pagemap_release,
> >  };
> > +
> > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > +static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
> > +				   size_t count, loff_t *ppos)
> > +{
> > +	int ret;
> > +	struct task_struct *tsk = get_proc_task(file_inode(file));
> > +
> > +	if (!tsk)
> > +		return -EINVAL;
> > +	ret = page_idle_proc_read(file, buf, count, ppos, tsk);
> > +	put_task_struct(tsk);
> 
> Why do you need task_struct here? You already got the task in open
> and got mm there so you could pass the MM here instead of task.

Good point, will just use mm.

[snip]
> > diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> > index 1e894d34bdce..f1bc2640d85e 100644
> > --- a/include/linux/page_idle.h
> > +++ b/include/linux/page_idle.h
> > @@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page)
> >  }
> >  #endif /* CONFIG_64BIT */
> >  
> > +ssize_t page_idle_proc_write(struct file *file,
> > +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
> > +ssize_t page_idle_proc_read(struct file *file,
> > +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
> >  #else /* !CONFIG_IDLE_PAGE_TRACKING */
> >  
> >  static inline bool page_is_young(struct page *page)
> > diff --git a/mm/page_idle.c b/mm/page_idle.c
> > index 295512465065..86244f7f1faa 100644
> > --- a/mm/page_idle.c
> > +++ b/mm/page_idle.c
> > @@ -5,12 +5,15 @@
> >  #include <linux/sysfs.h>
> >  #include <linux/kobject.h>
> >  #include <linux/mm.h>
> > -#include <linux/mmzone.h>
> > -#include <linux/pagemap.h>
> > -#include <linux/rmap.h>
> >  #include <linux/mmu_notifier.h>
> > +#include <linux/mmzone.h>
> >  #include <linux/page_ext.h>
> >  #include <linux/page_idle.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/rmap.h>
> > +#include <linux/sched/mm.h>
> > +#include <linux/swap.h>
> > +#include <linux/swapops.h>
> >  
> >  #define BITMAP_CHUNK_SIZE	sizeof(u64)
> >  #define BITMAP_CHUNK_BITS	(BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
> > @@ -25,18 +28,13 @@
> >   * page tracking. With such an indicator of user pages we can skip isolated
> >   * pages, but since there are not usually many of them, it will hardly affect
> >   * the overall result.
> > - *
> > - * This function tries to get a user memory page by pfn as described above.
> >   */
> > -static struct page *page_idle_get_page(unsigned long pfn)
> > +static struct page *page_idle_get_page(struct page *page_in)
> 
> Looks weird function name after you changed the argument.
> Maybe "bool check_valid_page(struct page *page)"?


I don't think so, this function does a get_page_unless_zero() on the page as well.

> >  {
> >  	struct page *page;
> >  	pg_data_t *pgdat;
> >  
> > -	if (!pfn_valid(pfn))
> > -		return NULL;
> > -
> > -	page = pfn_to_page(pfn);
> > +	page = page_in;
> >  	if (!page || !PageLRU(page) ||
> >  	    !get_page_unless_zero(page))
> >  		return NULL;
> > @@ -51,6 +49,18 @@ static struct page *page_idle_get_page(unsigned long pfn)
> >  	return page;
> >  }
> >  
> > +/*
> > + * This function tries to get a user memory page by pfn as described above.
> > + */
> > +static struct page *page_idle_get_page_pfn(unsigned long pfn)
> 
> So we could use page_idle_get_page name here.


Based on above comment, I prefer to keep same name. Do you agree?


> > +	return page_idle_get_page(pfn_to_page(pfn));
> > +}
> > +
> >  static bool page_idle_clear_pte_refs_one(struct page *page,
> >  					struct vm_area_struct *vma,
> >  					unsigned long addr, void *arg)
> > @@ -118,6 +128,47 @@ static void page_idle_clear_pte_refs(struct page *page)
> >  		unlock_page(page);
> >  }
> >  
> > +/* Helper to get the start and end frame given a pos and count */
> > +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> > +				unsigned long *start, unsigned long *end)
> > +{
> > +	unsigned long max_frame;
> > +
> > +	/* If an mm is not given, assume we want physical frames */
> > +	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> > +
> > +	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > +		return -EINVAL;
> > +
> > +	*start = pos * BITS_PER_BYTE;
> > +	if (*start >= max_frame)
> > +		return -ENXIO;
> > +
> > +	*end = *start + count * BITS_PER_BYTE;
> > +	if (*end > max_frame)
> > +		*end = max_frame;
> > +	return 0;
> > +}
> > +
> > +static bool page_really_idle(struct page *page)
> 
> Just minor:
> Instead of creating new API, could we combine page_is_idle with
> introducing furthere argument pte_check?


I cannot see in the code where pte_check will be false when this is called? I
could rename the function to page_idle_check_ptes() if that's Ok with you.

[snip]
> +
> > +static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
> > +				    unsigned long end,
> > +				    struct mm_walk *walk)
> > +{
> > +	struct vm_area_struct *vma = walk->vma;
> > +	pte_t *pte;
> > +	spinlock_t *ptl;
> > +	struct page *page;
> > +
> > +	ptl = pmd_trans_huge_lock(pmd, vma);
> > +	if (ptl) {
> > +		if (pmd_present(*pmd)) {
> > +			page = follow_trans_huge_pmd(vma, addr, pmd,
> > +						     FOLL_DUMP|FOLL_WRITE);
> > +			if (!IS_ERR_OR_NULL(page))
> > +				add_page_idle_list(page, addr, walk);
> > +		}
> > +		spin_unlock(ptl);
> > +		return 0;
> > +	}
> > +
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +
> > +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> > +		/*
> > +		 * We add swapped pages to the idle_page_list so that we can
> > +		 * reported to userspace that they are idle.
> > +		 */
> > +		if (is_swap_pte(*pte)) {
> 
> I suggested "let's consider every swapped out pages as IDLE" but
> let's think about this case:
> 
> 1. mark heap of the process as IDLE
> 2. process touch working set
> 3. process's heap pages are swap out by meory spike or madvise
> 4. heap profiler investigates the process's IDLE page and surprised all of
> heap are idle.
> 
> It's the good scenario for other purpose because non-idle pages(IOW,
> workingset) could be readahead when the app will restart.
> 
> Maybe, squeeze the idle bit in the swap pte to check it.


Ok, I will look more into this. Konstantin had similar ideas here too.


> > +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> > +			       size_t count, loff_t *pos,
> > +			       struct task_struct *tsk, int write)
> > +{
> > +	int ret;
> > +	char *buffer;
> > +	u64 *out;
> > +	unsigned long start_addr, end_addr, start_frame, end_frame;
> > +	struct mm_struct *mm = file->private_data;
> > +	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
> > +	struct page_node *cur;
> > +	struct page_idle_proc_priv priv;
> > +	bool walk_error = false;
> > +	LIST_HEAD(idle_page_list);
> > +
> > +	if (!mm || !mmget_not_zero(mm))
> > +		return -EINVAL;
> > +
> > +	if (count > PAGE_SIZE)
> > +		count = PAGE_SIZE;
> > +
> > +	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
> > +	if (!buffer) {
> > +		ret = -ENOMEM;
> > +		goto out_mmput;
> > +	}
> > +	out = (u64 *)buffer;
> > +
> > +	if (write && copy_from_user(buffer, ubuff, count)) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +
> > +	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
> > +	if (ret)
> > +		goto out;
> > +
> > +	start_addr = (start_frame << PAGE_SHIFT);
> > +	end_addr = (end_frame << PAGE_SHIFT);
> > +	priv.buffer = buffer;
> > +	priv.start_addr = start_addr;
> > +	priv.write = write;
> > +
> > +	priv.idle_page_list = &idle_page_list;
> > +	priv.cur_page_node = 0;
> > +	priv.page_nodes = kzalloc(sizeof(struct page_node) *
> > +				  (end_frame - start_frame), GFP_KERNEL);
> > +	if (!priv.page_nodes) {
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +
> > +	walk.private = &priv;
> > +	walk.mm = mm;
> > +
> > +	down_read(&mm->mmap_sem);
> > +
> > +	/*
> > +	 * idle_page_list is needed because walk_page_vma() holds ptlock which
> > +	 * deadlocks with page_idle_clear_pte_refs(). So we have to collect all
> > +	 * pages first, and then call page_idle_clear_pte_refs().
> > +	 */
> 
> Thanks for the comment, I was curious why you want to have
> idle_page_list and the reason is here.
> 
> How about making this /proc/<pid>/page_idle per-process granuariy,
> unlike system level /sys/xxx/page_idle? What I meant is not to check
> rmap to see any reference from random process but just check only
> access from the target process. It would be more proper as /proc/
> <pid>/ interface and good for per-process tracking as well as
> fast.


I prefer not to do this for the following reasons:
(1) It makes a feature lost, now accesses to shared pages will not be
accounted properly. 

(2) It makes it inconsistent with other idle page tracking mechanism. I
prefer if post per-process. At the heart of it, the tracking is always at the
physical page level -- I feel that is how it should be. Other drawback, is
also we have to document this subtlety.

Another reason is the performance is pretty good already with this mechanism
with rmap. I did idle tracking on 512MB range in about 15ms for read and 15ms
for write.

In the future if it is an issue, we can consider it.

thanks,

 - Joel

^ permalink raw reply

* Re: [PATCH v4 3/3] f2fs: Support case-insensitive file name lookups
From: Nathan Chancellor @ 2019-07-31 17:57 UTC (permalink / raw)
  To: Daniel Rosenberg
  Cc: Jaegeuk Kim, Chao Yu, Jonathan Corbet, linux-f2fs-devel,
	linux-kernel, linux-doc, linux-fsdevel, linux-api, kernel-team
In-Reply-To: <20190723230529.251659-4-drosen@google.com>

Hi all,

<snip>

> diff --git a/fs/f2fs/hash.c b/fs/f2fs/hash.c
> index cc82f142f811f..99e79934f5088 100644
> --- a/fs/f2fs/hash.c
> +++ b/fs/f2fs/hash.c
> @@ -14,6 +14,7 @@
>  #include <linux/f2fs_fs.h>
>  #include <linux/cryptohash.h>
>  #include <linux/pagemap.h>
> +#include <linux/unicode.h>
>  
>  #include "f2fs.h"
>  
> @@ -67,7 +68,7 @@ static void str2hashbuf(const unsigned char *msg, size_t len,
>  		*buf++ = pad;
>  }
>  
> -f2fs_hash_t f2fs_dentry_hash(const struct qstr *name_info,
> +static f2fs_hash_t __f2fs_dentry_hash(const struct qstr *name_info,
>  				struct fscrypt_name *fname)
>  {
>  	__u32 hash;
> @@ -103,3 +104,35 @@ f2fs_hash_t f2fs_dentry_hash(const struct qstr *name_info,
>  	f2fs_hash = cpu_to_le32(hash & ~F2FS_HASH_COL_BIT);
>  	return f2fs_hash;
>  }
> +
> +f2fs_hash_t f2fs_dentry_hash(const struct inode *dir,
> +		const struct qstr *name_info, struct fscrypt_name *fname)
> +{
> +#ifdef CONFIG_UNICODE
> +	struct f2fs_sb_info *sbi = F2FS_SB(dir->i_sb);
> +	const struct unicode_map *um = sbi->s_encoding;
> +	int r, dlen;
> +	unsigned char *buff;
> +	struct qstr *folded;
> +
> +	if (name_info->len && IS_CASEFOLDED(dir)) {
> +		buff = f2fs_kzalloc(sbi, sizeof(char) * PATH_MAX, GFP_KERNEL);
> +		if (!buff)
> +			return -ENOMEM;
> +
> +		dlen = utf8_casefold(um, name_info, buff, PATH_MAX);
> +		if (dlen < 0) {
> +			kvfree(buff);
> +			goto opaque_seq;
> +		}
> +		folded->name = buff;
> +		folded->len = dlen;
> +		r = __f2fs_dentry_hash(folded, fname);
> +
> +		kvfree(buff);
> +		return r;
> +	}
> +opaque_seq:
> +#endif
> +	return __f2fs_dentry_hash(name_info, fname);
> +}

Clang now warns:

fs/f2fs/hash.c:128:3: warning: variable 'folded' is uninitialized when used here [-Wuninitialized]
                folded->name = buff;
                ^~~~~~
fs/f2fs/hash.c:116:21: note: initialize the variable 'folded' to silence this warning
        struct qstr *folded;
                           ^
                            = NULL
1 warning generated.

I assume that it wants to be initialized with f2fs_kzalloc as well but
I am not familiar with this code and what it expects to do.

Please look into this when you get a chance!
Nathan

^ permalink raw reply

* Re: [PATCH v7 07/16] fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl
From: Eric Biggers @ 2019-07-31 18:38 UTC (permalink / raw)
  To: Theodore Y. Ts'o, linux-fscrypt, linux-fsdevel, linux-ext4,
	linux-f2fs-devel, linux-mtd, linux-api, linux-crypto, keyrings,
	Paul Crowley, Satya Tangirala
In-Reply-To: <20190729195827.GF169027@gmail.com>

On Mon, Jul 29, 2019 at 12:58:28PM -0700, Eric Biggers wrote:
> On Sun, Jul 28, 2019 at 03:24:17PM -0400, Theodore Y. Ts'o wrote:
> > > +
> > > +/*
> > > + * Try to remove an fscrypt master encryption key.  If other users have also
> > > + * added the key, we'll remove the current user's usage of the key, then return
> > > + * -EUSERS.  Otherwise we'll continue on and try to actually remove the key.
> > 
> > Nit: this should be moved to patch #11
> > 
> > Also, perror(EUSERS) will display "Too many users" which is going to
> > be confusing.  I understand why you chose this; we would like to
> > distinguish between there are still inodes using this key, and there
> > are other users using this key.
> > 
> > Do we really need to return EUSERS in this case?  It's actually not an
> > *error* that other users are using the key.  After all, the unlink(2)
> > system call doesn't return an advisory error when you delete a file
> > which has other hard links.  And an application which does care about
> > this detail can always call FS_IOC_ENCRYPTION_KEY_STATUS() and check
> > user_count.
> > 
> 
> Returning 0 when the key wasn't fully removed might also be confusing.  But I
> guess you're right that returning an error doesn't match how syscalls usually
> work.  It did remove the current user's usage of the key, after all, rather than
> completely fail.  And as you point out, if someone cares about other users
> having added the key, they can use FS_IOC_GET_ENCRYPTION_KEY_STATUS.
> 
> So I guess I'll change it to 0.
> 

So after making this change and thinking about it some more, I'm not sure it's
actually an improvement.

The normal use case for this ioctl is to "lock" some encrypted directory(s).  If
it returns 0 and doesn't lock the directory(s), that's unexpected.

This is perhaps different from what users expect from unlink().  It's well known
that unlink() just deletes the filename, not the file itself if it's still open
or has other links.  And unlink() by itself isn't meant for use cases where the
file absolutely must be securely erased.  But FS_IOC_REMOVE_ENCRYPTION_KEY
really is meant primarily for that sort of thing.

To give a concrete example: my patch for the userspace tool
https://github.com/google/fscrypt adds a command 'fscrypt lock' which locks an
encrypted directory.  If, say, someone runs 'fscrypt unlock' as uid 0 and then
'fscrypt lock' as uid 1000, then FS_IOC_REMOVE_ENCRYPTION_KEY can't actually
remove the key.  I need to make the tool show a proper error message in this
case.  To do so, it would help to get a unique error code (e.g. EUSERS) from
FS_IOC_REMOVE_ENCRYPTION_KEY, rather than get the ambiguous error code ENOKEY
and have to call FS_IOC_GET_ENCRYPTION_KEY_STATUS to get the real status.

Also, we already have the EBUSY case.  This means that the ioctl removed the
master key secret itself; however, some files were still in-use, so the key
remains in the "incompletely removed" state.  If we were actually going for
unlink() semantics, then for consistency this case really ought to return 0 and
unlink the key object, and people who care about in-use files would need to use
FS_IOC_GET_ENCRYPTION_KEY_STATUS.  But most people *will* care about this, and
may even want to retry the ioctl later, which isn't something you can do with
pure unlink() semantics.

So I'm leaning towards keeping the EUSERS and EBUSY errors.

- Eric

^ permalink raw reply

* Re: [PATCH bpf-next v10 06/10] bpf,landlock: Add a new map type: inode
From: Mickaël Salaün @ 2019-07-31 18:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Mickaël Salaün
  Cc: linux-kernel, Alexander Viro, Alexei Starovoitov, Andrew Morton,
	Andy Lutomirski, Arnaldo Carvalho de Melo, Casey Schaufler,
	Daniel Borkmann, David Drysdale, David S . Miller,
	Eric W . Biederman, James Morris, Jann Horn, John Johansen,
	Jonathan Corbet, Kees Cook, Michael Kerrisk, Paul Moore,
	Sargun Dhillon
In-Reply-To: <20190727014048.3czy3n2hi6hfdy3m@ast-mbp.dhcp.thefacebook.com>


On 27/07/2019 03:40, Alexei Starovoitov wrote:
> On Sun, Jul 21, 2019 at 11:31:12PM +0200, Mickaël Salaün wrote:
>> FIXME: 64-bits in the doc

FYI, this FIXME was fixed, just not removed from this message. :)

>>
>> This new map store arbitrary values referenced by inode keys.  The map
>> can be updated from user space with file descriptor pointing to inodes
>> tied to a file system.  From an eBPF (Landlock) program point of view,
>> such a map is read-only and can only be used to retrieved a value tied
>> to a given inode.  This is useful to recognize an inode tagged by user
>> space, without access right to this inode (i.e. no need to have a write
>> access to this inode).
>>
>> Add dedicated BPF functions to handle this type of map:
>> * bpf_inode_htab_map_update_elem()
>> * bpf_inode_htab_map_lookup_elem()
>> * bpf_inode_htab_map_delete_elem()
>>
>> This new map require a dedicated helper inode_map_lookup_elem() because
>> of the key which is a pointer to an opaque data (only provided by the
>> kernel).  This act like a (physical or cryptographic) key, which is why
>> it is also not allowed to get the next key.
>>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>
> there are too many things to comment on.
> Let's do this patch.
>
> imo inode_map concept is interesting, but see below...
>
>> +
>> +    /*
>> +     * Limit number of entries in an inode map to the maximum number of
>> +     * open files for the current process. The maximum number of file
>> +     * references (including all inode maps) for a process is then
>> +     * (RLIMIT_NOFILE - 1) * RLIMIT_NOFILE. If the process' RLIMIT_NOFILE
>> +     * is 0, then any entry update is forbidden.
>> +     *
>> +     * An eBPF program can inherit all the inode map FD. The worse case is
>> +     * to fill a bunch of arraymaps, create an eBPF program, close the
>> +     * inode map FDs, and start again. The maximum number of inode map
>> +     * entries can then be close to RLIMIT_NOFILE^3.
>> +     */
>> +    if (attr->max_entries > rlimit(RLIMIT_NOFILE))
>> +            return -EMFILE;
>
> rlimit is checked, but no fd are consumed.
> Once created such inode map_fd can be passed to a different process.
> map_fd can be pinned into bpffs.
> etc.
> what the value of the check?

I was looking for the most meaningful limit for a process, and rlimit is
the best I found. As the limit of open FD per processes, rlimit is not
perfect, but I think the semantic is close here (e.g. a process can also
pass FD through unix socket).

>
>> +
>> +    /* decorelate UAPI from kernel API */
>> +    attr->key_size = sizeof(struct inode *);
>> +
>> +    return htab_map_alloc_check(attr);
>> +}
>> +
>> +static void inode_htab_put_key(void *key)
>> +{
>> +    struct inode **inode = key;
>> +
>> +    if ((*inode)->i_state & I_FREEING)
>> +            return;
>
> checking the state without take a lock? isn't it racy?

This should only trigger when called from security_inode_free(). I'll
add a comment.

>
>> +    iput(*inode);
>> +}
>> +
>> +/* called from syscall or (never) from eBPF program */
>> +static int map_get_next_no_key(struct bpf_map *map, void *key, void *next_key)
>> +{
>> +    /* do not leak a file descriptor */
>
> what this comment suppose to mean?

Because a key is a reference to an inode, a possible return value for
this function could be a file descriptor pointing to this inode (the
same way a file descriptor is use to add an element). For now, I don't
want to implement a way for a process with such a map to extract such
inode, which I compare to a possible leak (of information, not kernel
memory nor object). This could be implemented in the future if there is
value in it (and probably some additional safeguards), though.

>
>> +    return -ENOTSUPP;
>> +}
>> +
>> +/* must call iput(inode) after this call */
>> +static struct inode *inode_from_fd(int ufd, bool check_access)
>> +{
>> +    struct inode *ret;
>> +    struct fd f;
>> +    int deny;
>> +
>> +    f = fdget(ufd);
>> +    if (unlikely(!f.file))
>> +            return ERR_PTR(-EBADF);
>> +    /* TODO?: add this check when called from an eBPF program too (already
>> +    * checked by the LSM parent hooks anyway) */
>> +    if (unlikely(IS_PRIVATE(file_inode(f.file)))) {
>> +            ret = ERR_PTR(-EINVAL);
>> +            goto put_fd;
>> +    }
>> +    /* check if the FD is tied to a mount point */
>> +    /* TODO?: add this check when called from an eBPF program too */
>> +    if (unlikely(f.file->f_path.mnt->mnt_flags & MNT_INTERNAL)) {
>> +            ret = ERR_PTR(-EINVAL);
>> +            goto put_fd;
>> +    }
>
> a bunch of TODOs do not inspire confidence.

I think the current implement is good, but these TODOs are here to draw
attention on particular points for which I would like external review
and opinion (hence the "?").

>
>> +    if (check_access) {
>> +            /*
>> +            * must be allowed to access attributes from this file to then
>> +            * be able to compare an inode to its map entry
>> +            */
>> +            deny = security_inode_getattr(&f.file->f_path);
>> +            if (deny) {
>> +                    ret = ERR_PTR(deny);
>> +                    goto put_fd;
>> +            }
>> +    }
>> +    ret = file_inode(f.file);
>> +    ihold(ret);
>> +
>> +put_fd:
>> +    fdput(f);
>> +    return ret;
>> +}
>> +
>> +/*
>> + * The key is a FD when called from a syscall, but an inode address when called
>> + * from an eBPF program.
>> + */
>> +
>> +/* called from syscall */
>> +int bpf_inode_fd_htab_map_lookup_elem(struct bpf_map *map, int *key, void *value)
>> +{
>> +    void *ptr;
>> +    struct inode *inode;
>> +    int ret;
>> +
>> +    /* check inode access */
>> +    inode = inode_from_fd(*key, true);
>> +    if (IS_ERR(inode))
>> +            return PTR_ERR(inode);
>> +
>> +    rcu_read_lock();
>> +    ptr = htab_map_lookup_elem(map, &inode);
>> +    iput(inode);
>> +    if (IS_ERR(ptr)) {
>> +            ret = PTR_ERR(ptr);
>> +    } else if (!ptr) {
>> +            ret = -ENOENT;
>> +    } else {
>> +            ret = 0;
>> +            copy_map_value(map, value, ptr);
>> +    }
>> +    rcu_read_unlock();
>> +    return ret;
>> +}
>> +
>> +/* called from kernel */
>
> wrong comment?
> kernel side cannot call it, right?

This is called from bpf_inode_fd_htab_map_delete_elem() (code just
beneath), and from
kernel/bpf/syscall.c:bpf_inode_ptr_unlocked_htab_map_delet_elem() which
can be called by security_inode_free() (hook_inode_free_security).

>
>> +int bpf_inode_ptr_locked_htab_map_delete_elem(struct bpf_map *map,
>> +            struct inode **key, bool remove_in_inode)
>> +{
>> +    if (remove_in_inode)
>> +            landlock_inode_remove_map(*key, map);
>> +    return htab_map_delete_elem(map, key);
>> +}
>> +
>> +/* called from syscall */
>> +int bpf_inode_fd_htab_map_delete_elem(struct bpf_map *map, int *key)
>> +{
>> +    struct inode *inode;
>> +    int ret;
>> +
>> +    /* do not check inode access (similar to directory check) */
>> +    inode = inode_from_fd(*key, false);
>> +    if (IS_ERR(inode))
>> +            return PTR_ERR(inode);
>> +    ret = bpf_inode_ptr_locked_htab_map_delete_elem(map, &inode, true);
>> +    iput(inode);
>> +    return ret;
>> +}
>> +
>> +/* called from syscall */
>> +int bpf_inode_fd_htab_map_update_elem(struct bpf_map *map, int *key, void *value,
>> +            u64 map_flags)
>> +{
>> +    struct inode *inode;
>> +    int ret;
>> +
>> +    WARN_ON_ONCE(!rcu_read_lock_held());
>> +
>> +    /* check inode access */
>> +    inode = inode_from_fd(*key, true);
>> +    if (IS_ERR(inode))
>> +            return PTR_ERR(inode);
>> +    ret = htab_map_update_elem(map, &inode, value, map_flags);
>> +    if (!ret)
>> +            ret = landlock_inode_add_map(inode, map);
>> +    iput(inode);
>> +    return ret;
>> +}
>> +
>> +static void inode_htab_map_free(struct bpf_map *map)
>> +{
>> +    struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
>> +    struct hlist_nulls_node *n;
>> +    struct hlist_nulls_head *head;
>> +    struct htab_elem *l;
>> +    int i;
>> +
>> +    for (i = 0; i < htab->n_buckets; i++) {
>> +            head = select_bucket(htab, i);
>> +            hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
>> +                    landlock_inode_remove_map(*((struct inode **)l->key), map);
>> +            }
>> +    }
>> +    htab_map_free(map);
>> +}
>
> user space can delete the map.
> that will trigger inode_htab_map_free() which will call
> landlock_inode_remove_map().
> which will simply itereate the list and delete from the list.

landlock_inode_remove_map() removes the reference to the map (being
freed) from the inode (with an RCU lock).

>
> While in parallel inode can be destoyed and hook_inode_free_security()
> will be called.
> I think nothing that protects from this race.

According to security_inode_free(), the inode is effectively freed after
the RCU grace period. However, I forgot to call bpf_map_inc() in
landlock_inode_add_map(), which would prevent the map to be freed
outside of the security_inode_free(). I'll fix that.

>
>> +
>> +/*
>> + * We need a dedicated helper to deal with inode maps because the key is a
>> + * pointer to an opaque data, only provided by the kernel.  This really act
>> + * like a (physical or cryptographic) key, which is why it is also not allowed
>> + * to get the next key with map_get_next_key().
>
> inode pointer is like cryptographic key? :)

I wanted to highlight the fact that, contrary to other map key types,
the value of this one should not be readable, only usable. A "secret
value" is more appropriate but still confusing. I'll rephrase that.

>
>> + */
>> +BPF_CALL_2(bpf_inode_map_lookup_elem, struct bpf_map *, map, void *, key)
>> +{
>> +    WARN_ON_ONCE(!rcu_read_lock_held());
>> +    return (unsigned long)htab_map_lookup_elem(map, &key);
>> +}
>> +
>> +const struct bpf_func_proto bpf_inode_map_lookup_elem_proto = {
>> +    .func           = bpf_inode_map_lookup_elem,
>> +    .gpl_only       = false,
>> +    .pkt_access     = true,
>
> pkt_access ? :)

This slipped in with this rebase, I'll remove it. :)

>
>> +    .ret_type       = RET_PTR_TO_MAP_VALUE_OR_NULL,
>> +    .arg1_type      = ARG_CONST_MAP_PTR,
>> +    .arg2_type      = ARG_PTR_TO_INODE,
>> +};
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index b2a8cb14f28e..e46441c42b68 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -801,6 +801,8 @@ static int map_lookup_elem(union bpf_attr *attr)
>>      } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
>>                 map->map_type == BPF_MAP_TYPE_STACK) {
>>              err = map->ops->map_peek_elem(map, value);
>> +    } else if (map->map_type == BPF_MAP_TYPE_INODE) {
>> +            err = bpf_inode_fd_htab_map_lookup_elem(map, key, value);
>>      } else {
>>              rcu_read_lock();
>>              if (map->ops->map_lookup_elem_sys_only)
>> @@ -951,6 +953,10 @@ static int map_update_elem(union bpf_attr *attr)
>>      } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
>>                 map->map_type == BPF_MAP_TYPE_STACK) {
>>              err = map->ops->map_push_elem(map, value, attr->flags);
>> +    } else if (map->map_type == BPF_MAP_TYPE_INODE) {
>> +            rcu_read_lock();
>> +            err = bpf_inode_fd_htab_map_update_elem(map, key, value, attr->flags);
>> +            rcu_read_unlock();
>>      } else {
>>              rcu_read_lock();
>>              err = map->ops->map_update_elem(map, key, value, attr->flags);
>> @@ -1006,7 +1012,10 @@ static int map_delete_elem(union bpf_attr *attr)
>>      preempt_disable();
>>      __this_cpu_inc(bpf_prog_active);
>>      rcu_read_lock();
>> -    err = map->ops->map_delete_elem(map, key);
>> +    if (map->map_type == BPF_MAP_TYPE_INODE)
>> +            err = bpf_inode_fd_htab_map_delete_elem(map, key);
>> +    else
>> +            err = map->ops->map_delete_elem(map, key);
>>      rcu_read_unlock();
>>      __this_cpu_dec(bpf_prog_active);
>>      preempt_enable();
>> @@ -1018,6 +1027,22 @@ static int map_delete_elem(union bpf_attr *attr)
>>      return err;
>>  }
>>
>> +int bpf_inode_ptr_unlocked_htab_map_delete_elem(struct bpf_map *map,
>> +                                            struct inode **key, bool remove_in_inode)
>> +{
>> +    int err;
>> +
>> +    preempt_disable();
>> +    __this_cpu_inc(bpf_prog_active);
>> +    rcu_read_lock();
>> +    err = bpf_inode_ptr_locked_htab_map_delete_elem(map, key, remove_in_inode);
>> +    rcu_read_unlock();
>> +    __this_cpu_dec(bpf_prog_active);
>> +    preempt_enable();
>> +    maybe_wait_bpf_programs(map);
>
> if that function was actually doing synchronize_rcu() the consequences
> would have been unpleasant. Fortunately it's a nop in this case.
> Please read the code carefully before copy-paste.
> Also what do you think the reason of bpf_prog_active above?
> What is the reason of rcu_read_lock above?

The RCU is used as for every map modifications (usually from userspace).
I wasn't sure about the other protections so I kept the same (generic)
checks as in map_delete_elem() (just above) because this function follow
the same semantic. What can I safely remove?

>
> I think the patch set needs to shrink at least in half to be reviewable.
> The way you tie seccomp and lsm is probably the biggest obstacle
> than any of the bugs above.
> Can you drop seccomp ? and do it as normal lsm ?

The seccomp/enforcement part is needed to have a minimum viable product,
i.e. a process able to sandbox itself. Are you suggesting to first merge
a version when it is only possible to create inode maps but not use them
in an useful way (i.e. for sandboxing)? I can do it if it's OK with you,
and I hope it will not be a problem for the security folks if it can
help to move forward.

--
Mickaël Salaün
ANSSI/SDE/ST/LAM

Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.

^ permalink raw reply

* Re: [PATCH bpf-next v10 06/10] bpf,landlock: Add a new map type: inode
From: Alexei Starovoitov @ 2019-07-31 18:58 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Mickaël Salaün, LKML, Alexander Viro,
	Alexei Starovoitov, Andrew Morton, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, Michael Kerrisk, Paul Moore
In-Reply-To: <a870c2c9-d2f7-e0fa-c8cc-35dbf8b5b87d@ssi.gouv.fr>

On Wed, Jul 31, 2019 at 11:46 AM Mickaël Salaün
<mickael.salaun@ssi.gouv.fr> wrote:
> >> +    for (i = 0; i < htab->n_buckets; i++) {
> >> +            head = select_bucket(htab, i);
> >> +            hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
> >> +                    landlock_inode_remove_map(*((struct inode **)l->key), map);
> >> +            }
> >> +    }
> >> +    htab_map_free(map);
> >> +}
> >
> > user space can delete the map.
> > that will trigger inode_htab_map_free() which will call
> > landlock_inode_remove_map().
> > which will simply itereate the list and delete from the list.
>
> landlock_inode_remove_map() removes the reference to the map (being
> freed) from the inode (with an RCU lock).

I'm going to ignore everything else for now and focus only on this bit,
since it's fundamental issue to address before this discussion can
go any further.
rcu_lock is not a spin_lock. I'm pretty sure you know this.
But you're arguing that it's somehow protecting from the race
I mentioned above?

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-07-31 19:09 UTC (permalink / raw)
  To: Song Liu
  Cc: Andy Lutomirski, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <D4040C0C-47D6-4852-933C-59EB53C05242@fb.com>

On Wed, Jul 31, 2019 at 1:10 AM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jul 30, 2019, at 1:24 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Mon, Jul 29, 2019 at 10:07 PM Song Liu <songliubraving@fb.com> wrote:
> >>
> >> Hi Andy,
> >>
> >>> On Jul 27, 2019, at 11:20 AM, Song Liu <songliubraving@fb.com> wrote:
> >>>
> >>> Hi Andy,
> >>>
> >>>
>
> [...]
>
> >>>
> >>
> >> I would like more comments on this.
> >>
> >> Currently, bpf permission is more or less "root or nothing", which we
> >> would like to change.
> >>
> >> The short term goal is to separate bpf from root, in other words, it is
> >> "all or nothing". Special user space utilities, such as systemd, would
> >> benefit from this. Once this is implemented, systemd can call sys_bpf()
> >> when it is not running as root.
> >
> > As generally nasty as Linux capabilities are, this sounds like a good
> > use for CAP_BPF_ADMIN.
>
> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.

It's been done before -- it's not that hard.  IMO the main tricky bit
would be try be somewhat careful about defining exactly what
CAP_BPF_ADMIN does.

> > I don't see why you need to invent a whole new mechanism for this.
> > The entire cgroup ecosystem outside bpf() does just fine using the
> > write permission on files in cgroupfs to control access.  Why can't
> > bpf() do the same thing?
>
> It is easier to use write permission for BPF_PROG_ATTACH. But it is
> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
> we should have target concept for all these commands. But that is a
> much bigger project. OTOH, "all or nothing" model allows all these
> commands at once.

For BPF_PROG_LOAD, I admit I've never understood why permission is
required at all.  I think that CAP_SYS_ADMIN or similar should be
needed to get is_priv in the verifier, but I think that should mainly
be useful for tracing, and that requires lots of privilege anyway.
BPF_MAP_* is probably the trickiest part.  One solution would be some
kind of bpffs, but I'm sure other solutions are possible.

^ permalink raw reply

* Re: [PATCH bpf-next v10 06/10] bpf,landlock: Add a new map type: inode
From: Mickaël Salaün @ 2019-07-31 19:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mickaël Salaün, LKML, Alexander Viro,
	Alexei Starovoitov, Andrew Morton, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, Michael Kerrisk, Paul Moore
In-Reply-To: <CAADnVQLqkfVijWoOM29PxCL_yK6K0fr8B89r4c5EKgddevJhGQ@mail.gmail.com>

On 31/07/2019 20:58, Alexei Starovoitov wrote:
> On Wed, Jul 31, 2019 at 11:46 AM Mickaël Salaün
> <mickael.salaun@ssi.gouv.fr> wrote:
>>>> +    for (i = 0; i < htab->n_buckets; i++) {
>>>> +            head = select_bucket(htab, i);
>>>> +            hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
>>>> +                    landlock_inode_remove_map(*((struct inode **)l->key), map);
>>>> +            }
>>>> +    }
>>>> +    htab_map_free(map);
>>>> +}
>>>
>>> user space can delete the map.
>>> that will trigger inode_htab_map_free() which will call
>>> landlock_inode_remove_map().
>>> which will simply itereate the list and delete from the list.
>>
>> landlock_inode_remove_map() removes the reference to the map (being
>> freed) from the inode (with an RCU lock).
>
> I'm going to ignore everything else for now and focus only on this bit,
> since it's fundamental issue to address before this discussion can
> go any further.
> rcu_lock is not a spin_lock. I'm pretty sure you know this.
> But you're arguing that it's somehow protecting from the race
> I mentioned above?
>

I was just clarifying your comment to avoid misunderstanding about what
is being removed.

As said in the full response, there is currently a race but, if I add a
bpf_map_inc() call when the map is referenced by inode->security, then I
don't see how a race could occur because such added map could only be
freed in a security_inode_free() (as long as it retains a reference to
this inode).

--
Mickaël Salaün
ANSSI/SDE/ST/LAM

Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.

^ permalink raw reply

* [PATCH V37 00/29] security: Add support for locking down the kernel
From: Matthew Garrett @ 2019-07-31 22:15 UTC (permalink / raw)
  To: jmorris; +Cc: linux-security-module, linux-kernel, linux-api

A minor fix to the tracefs patch, some Acks and reviews added to the SOB
chain, and rebased on next/master (there were a couple of minor fixes needed to
align that).

^ permalink raw reply

* [PATCH V37 01/29] security: Support early LSMs
From: Matthew Garrett @ 2019-07-31 22:15 UTC (permalink / raw)
  To: jmorris
  Cc: linux-security-module, linux-kernel, linux-api, Matthew Garrett,
	Matthew Garrett, Kees Cook, Casey Schaufler
In-Reply-To: <20190731221617.234725-1-matthewgarrett@google.com>

The lockdown module is intended to allow for kernels to be locked down
early in boot - sufficiently early that we don't have the ability to
kmalloc() yet. Add support for early initialisation of some LSMs, and
then add them to the list of names when we do full initialisation later.
Early LSMs are initialised in link order and cannot be overridden via
boot parameters, and cannot make use of kmalloc() (since the allocator
isn't initialised yet).

Signed-off-by: Matthew Garrett <mjg59@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
---
 include/asm-generic/vmlinux.lds.h |  8 ++++-
 include/linux/lsm_hooks.h         |  6 ++++
 include/linux/security.h          |  1 +
 init/main.c                       |  1 +
 security/security.c               | 50 ++++++++++++++++++++++++++-----
 5 files changed, 57 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index cd28f63bfbc7..dae64600ccbf 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -215,8 +215,13 @@
 			__start_lsm_info = .;				\
 			KEEP(*(.lsm_info.init))				\
 			__end_lsm_info = .;
+#define EARLY_LSM_TABLE()	. = ALIGN(8);				\
+			__start_early_lsm_info = .;			\
+			KEEP(*(.early_lsm_info.init))			\
+			__end_early_lsm_info = .;
 #else
 #define LSM_TABLE()
+#define EARLY_LSM_TABLE()
 #endif
 
 #define ___OF_TABLE(cfg, name)	_OF_TABLE_##cfg(name)
@@ -627,7 +632,8 @@
 	ACPI_PROBE_TABLE(timer)						\
 	THERMAL_TABLE(governor)						\
 	EARLYCON_TABLE()						\
-	LSM_TABLE()
+	LSM_TABLE()							\
+	EARLY_LSM_TABLE()
 
 #define INIT_TEXT							\
 	*(.init.text .init.text.*)					\
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index df1318d85f7d..aebb0e032072 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -2104,12 +2104,18 @@ struct lsm_info {
 };
 
 extern struct lsm_info __start_lsm_info[], __end_lsm_info[];
+extern struct lsm_info __start_early_lsm_info[], __end_early_lsm_info[];
 
 #define DEFINE_LSM(lsm)							\
 	static struct lsm_info __lsm_##lsm				\
 		__used __section(.lsm_info.init)			\
 		__aligned(sizeof(unsigned long))
 
+#define DEFINE_EARLY_LSM(lsm)						\
+	static struct lsm_info __early_lsm_##lsm			\
+		__used __section(.early_lsm_info.init)			\
+		__aligned(sizeof(unsigned long))
+
 #ifdef CONFIG_SECURITY_SELINUX_DISABLE
 /*
  * Assuring the safety of deleting a security module is up to
diff --git a/include/linux/security.h b/include/linux/security.h
index 5f7441abbf42..66a2fcbe6ab0 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -195,6 +195,7 @@ int unregister_blocking_lsm_notifier(struct notifier_block *nb);
 
 /* prototypes */
 extern int security_init(void);
+extern int early_security_init(void);
 
 /* Security operations */
 int security_binder_set_context_mgr(struct task_struct *mgr);
diff --git a/init/main.c b/init/main.c
index 96f8d5af52d6..565af7b963e1 100644
--- a/init/main.c
+++ b/init/main.c
@@ -593,6 +593,7 @@ asmlinkage __visible void __init start_kernel(void)
 	boot_cpu_init();
 	page_address_init();
 	pr_notice("%s", linux_banner);
+	early_security_init();
 	setup_arch(&command_line);
 	mm_init_cpumask(&init_mm);
 	setup_command_line(command_line);
diff --git a/security/security.c b/security/security.c
index 250ee2d76406..90f1e291c800 100644
--- a/security/security.c
+++ b/security/security.c
@@ -33,6 +33,7 @@
 
 /* How many LSMs were built into the kernel? */
 #define LSM_COUNT (__end_lsm_info - __start_lsm_info)
+#define EARLY_LSM_COUNT (__end_early_lsm_info - __start_early_lsm_info)
 
 struct security_hook_heads security_hook_heads __lsm_ro_after_init;
 static BLOCKING_NOTIFIER_HEAD(blocking_lsm_notifier_chain);
@@ -277,6 +278,8 @@ static void __init ordered_lsm_parse(const char *order, const char *origin)
 static void __init lsm_early_cred(struct cred *cred);
 static void __init lsm_early_task(struct task_struct *task);
 
+static int lsm_append(const char *new, char **result);
+
 static void __init ordered_lsm_init(void)
 {
 	struct lsm_info **lsm;
@@ -323,6 +326,26 @@ static void __init ordered_lsm_init(void)
 	kfree(ordered_lsms);
 }
 
+int __init early_security_init(void)
+{
+	int i;
+	struct hlist_head *list = (struct hlist_head *) &security_hook_heads;
+	struct lsm_info *lsm;
+
+	for (i = 0; i < sizeof(security_hook_heads) / sizeof(struct hlist_head);
+	     i++)
+		INIT_HLIST_HEAD(&list[i]);
+
+	for (lsm = __start_early_lsm_info; lsm < __end_early_lsm_info; lsm++) {
+		if (!lsm->enabled)
+			lsm->enabled = &lsm_enabled_true;
+		prepare_lsm(lsm);
+		initialize_lsm(lsm);
+	}
+
+	return 0;
+}
+
 /**
  * security_init - initializes the security framework
  *
@@ -330,14 +353,18 @@ static void __init ordered_lsm_init(void)
  */
 int __init security_init(void)
 {
-	int i;
-	struct hlist_head *list = (struct hlist_head *) &security_hook_heads;
+	struct lsm_info *lsm;
 
 	pr_info("Security Framework initializing\n");
 
-	for (i = 0; i < sizeof(security_hook_heads) / sizeof(struct hlist_head);
-	     i++)
-		INIT_HLIST_HEAD(&list[i]);
+	/*
+	 * Append the names of the early LSM modules now that kmalloc() is
+	 * available
+	 */
+	for (lsm = __start_early_lsm_info; lsm < __end_early_lsm_info; lsm++) {
+		if (lsm->enabled)
+			lsm_append(lsm->name, &lsm_names);
+	}
 
 	/* Load LSMs in specified order. */
 	ordered_lsm_init();
@@ -384,7 +411,7 @@ static bool match_last_lsm(const char *list, const char *lsm)
 	return !strcmp(last, lsm);
 }
 
-static int lsm_append(char *new, char **result)
+static int lsm_append(const char *new, char **result)
 {
 	char *cp;
 
@@ -422,8 +449,15 @@ void __init security_add_hooks(struct security_hook_list *hooks, int count,
 		hooks[i].lsm = lsm;
 		hlist_add_tail_rcu(&hooks[i].list, hooks[i].head);
 	}
-	if (lsm_append(lsm, &lsm_names) < 0)
-		panic("%s - Cannot get early memory.\n", __func__);
+
+	/*
+	 * Don't try to append during early_security_init(), we'll come back
+	 * and fix this up afterwards.
+	 */
+	if (slab_is_available()) {
+		if (lsm_append(lsm, &lsm_names) < 0)
+			panic("%s - Cannot get early memory.\n", __func__);
+	}
 }
 
 int call_blocking_lsm_notifier(enum lsm_event event, void *data)
-- 
2.22.0.770.g0f2c4a37fd-goog

^ permalink raw reply related

* [PATCH V37 02/29] security: Add a "locked down" LSM hook
From: Matthew Garrett @ 2019-07-31 22:15 UTC (permalink / raw)
  To: jmorris
  Cc: linux-security-module, linux-kernel, linux-api, Matthew Garrett,
	Matthew Garrett, Kees Cook, Casey Schaufler
In-Reply-To: <20190731221617.234725-1-matthewgarrett@google.com>

Add a mechanism to allow LSMs to make a policy decision around whether
kernel functionality that would allow tampering with or examining the
runtime state of the kernel should be permitted.

Signed-off-by: Matthew Garrett <mjg59@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
---
 include/linux/lsm_hooks.h |  2 ++
 include/linux/security.h  | 32 ++++++++++++++++++++++++++++++++
 security/security.c       |  6 ++++++
 3 files changed, 40 insertions(+)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index aebb0e032072..29c22cf40113 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1807,6 +1807,7 @@ union security_list_options {
 	int (*bpf_prog_alloc_security)(struct bpf_prog_aux *aux);
 	void (*bpf_prog_free_security)(struct bpf_prog_aux *aux);
 #endif /* CONFIG_BPF_SYSCALL */
+	int (*locked_down)(enum lockdown_reason what);
 };
 
 struct security_hook_heads {
@@ -2046,6 +2047,7 @@ struct security_hook_heads {
 	struct hlist_head bpf_prog_alloc_security;
 	struct hlist_head bpf_prog_free_security;
 #endif /* CONFIG_BPF_SYSCALL */
+	struct hlist_head locked_down;
 } __randomize_layout;
 
 /*
diff --git a/include/linux/security.h b/include/linux/security.h
index 66a2fcbe6ab0..c2b1204e8e26 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -77,6 +77,33 @@ enum lsm_event {
 	LSM_POLICY_CHANGE,
 };
 
+/*
+ * These are reasons that can be passed to the security_locked_down()
+ * LSM hook. Lockdown reasons that protect kernel integrity (ie, the
+ * ability for userland to modify kernel code) are placed before
+ * LOCKDOWN_INTEGRITY_MAX.  Lockdown reasons that protect kernel
+ * confidentiality (ie, the ability for userland to extract
+ * information from the running kernel that would otherwise be
+ * restricted) are placed before LOCKDOWN_CONFIDENTIALITY_MAX.
+ *
+ * LSM authors should note that the semantics of any given lockdown
+ * reason are not guaranteed to be stable - the same reason may block
+ * one set of features in one kernel release, and a slightly different
+ * set of features in a later kernel release. LSMs that seek to expose
+ * lockdown policy at any level of granularity other than "none",
+ * "integrity" or "confidentiality" are responsible for either
+ * ensuring that they expose a consistent level of functionality to
+ * userland, or ensuring that userland is aware that this is
+ * potentially a moving target. It is easy to misuse this information
+ * in a way that could break userspace. Please be careful not to do
+ * so.
+ */
+enum lockdown_reason {
+	LOCKDOWN_NONE,
+	LOCKDOWN_INTEGRITY_MAX,
+	LOCKDOWN_CONFIDENTIALITY_MAX,
+};
+
 /* These functions are in security/commoncap.c */
 extern int cap_capable(const struct cred *cred, struct user_namespace *ns,
 		       int cap, unsigned int opts);
@@ -393,6 +420,7 @@ void security_inode_invalidate_secctx(struct inode *inode);
 int security_inode_notifysecctx(struct inode *inode, void *ctx, u32 ctxlen);
 int security_inode_setsecctx(struct dentry *dentry, void *ctx, u32 ctxlen);
 int security_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen);
+int security_locked_down(enum lockdown_reason what);
 #else /* CONFIG_SECURITY */
 
 static inline int call_blocking_lsm_notifier(enum lsm_event event, void *data)
@@ -1205,6 +1233,10 @@ static inline int security_inode_getsecctx(struct inode *inode, void **ctx, u32
 {
 	return -EOPNOTSUPP;
 }
+static inline int security_locked_down(enum lockdown_reason what)
+{
+	return 0;
+}
 #endif	/* CONFIG_SECURITY */
 
 #ifdef CONFIG_SECURITY_NETWORK
diff --git a/security/security.c b/security/security.c
index 90f1e291c800..ce6c945bf347 100644
--- a/security/security.c
+++ b/security/security.c
@@ -2392,3 +2392,9 @@ void security_bpf_prog_free(struct bpf_prog_aux *aux)
 	call_void_hook(bpf_prog_free_security, aux);
 }
 #endif /* CONFIG_BPF_SYSCALL */
+
+int security_locked_down(enum lockdown_reason what)
+{
+	return call_int_hook(locked_down, 0, what);
+}
+EXPORT_SYMBOL(security_locked_down);
-- 
2.22.0.770.g0f2c4a37fd-goog

^ permalink raw reply related

* [PATCH V37 03/29] security: Add a static lockdown policy LSM
From: Matthew Garrett @ 2019-07-31 22:15 UTC (permalink / raw)
  To: jmorris
  Cc: linux-security-module, linux-kernel, linux-api, Matthew Garrett,
	Matthew Garrett, Kees Cook, David Howells
In-Reply-To: <20190731221617.234725-1-matthewgarrett@google.com>

While existing LSMs can be extended to handle lockdown policy,
distributions generally want to be able to apply a straightforward
static policy. This patch adds a simple LSM that can be configured to
reject either integrity or all lockdown queries, and can be configured
at runtime (through securityfs), boot time (via a kernel parameter) or
build time (via a kconfig option). Based on initial code by David
Howells.

Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
---
 .../admin-guide/kernel-parameters.txt         |   9 +
 include/linux/security.h                      |   3 +
 security/Kconfig                              |  11 +-
 security/Makefile                             |   2 +
 security/lockdown/Kconfig                     |  47 +++++
 security/lockdown/Makefile                    |   1 +
 security/lockdown/lockdown.c                  | 172 ++++++++++++++++++
 7 files changed, 240 insertions(+), 5 deletions(-)
 create mode 100644 security/lockdown/Kconfig
 create mode 100644 security/lockdown/Makefile
 create mode 100644 security/lockdown/lockdown.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ef52d01524af..0b962844ac2a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2267,6 +2267,15 @@
 	lockd.nlm_udpport=M	[NFS] Assign UDP port.
 			Format: <integer>
 
+	lockdown=	[SECURITY]
+			{ integrity | confidentiality }
+			Enable the kernel lockdown feature. If set to
+			integrity, kernel features that allow userland to
+			modify the running kernel are disabled. If set to
+			confidentiality, kernel features that allow userland
+			to extract confidential information from the kernel
+			are also disabled.
+
 	locktorture.nreaders_stress= [KNL]
 			Set the number of locking read-acquisition kthreads.
 			Defaults to being automatically set based on the
diff --git a/include/linux/security.h b/include/linux/security.h
index c2b1204e8e26..54a0532ec12f 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -97,6 +97,9 @@ enum lsm_event {
  * potentially a moving target. It is easy to misuse this information
  * in a way that could break userspace. Please be careful not to do
  * so.
+ *
+ * If you add to this, remember to extend lockdown_reasons in
+ * security/lockdown/lockdown.c.
  */
 enum lockdown_reason {
 	LOCKDOWN_NONE,
diff --git a/security/Kconfig b/security/Kconfig
index 0d65594b5196..2a1a2d396228 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -237,6 +237,7 @@ source "security/apparmor/Kconfig"
 source "security/loadpin/Kconfig"
 source "security/yama/Kconfig"
 source "security/safesetid/Kconfig"
+source "security/lockdown/Kconfig"
 
 source "security/integrity/Kconfig"
 
@@ -276,11 +277,11 @@ endchoice
 
 config LSM
 	string "Ordered list of enabled LSMs"
-	default "yama,loadpin,safesetid,integrity,smack,selinux,tomoyo,apparmor" if DEFAULT_SECURITY_SMACK
-	default "yama,loadpin,safesetid,integrity,apparmor,selinux,smack,tomoyo" if DEFAULT_SECURITY_APPARMOR
-	default "yama,loadpin,safesetid,integrity,tomoyo" if DEFAULT_SECURITY_TOMOYO
-	default "yama,loadpin,safesetid,integrity" if DEFAULT_SECURITY_DAC
-	default "yama,loadpin,safesetid,integrity,selinux,smack,tomoyo,apparmor"
+	default "lockdown,yama,loadpin,safesetid,integrity,smack,selinux,tomoyo,apparmor" if DEFAULT_SECURITY_SMACK
+	default "lockdown,yama,loadpin,safesetid,integrity,apparmor,selinux,smack,tomoyo" if DEFAULT_SECURITY_APPARMOR
+	default "lockdown,yama,loadpin,safesetid,integrity,tomoyo" if DEFAULT_SECURITY_TOMOYO
+	default "lockdown,yama,loadpin,safesetid,integrity" if DEFAULT_SECURITY_DAC
+	default "lockdown,yama,loadpin,safesetid,integrity,selinux,smack,tomoyo,apparmor"
 	help
 	  A comma-separated list of LSMs, in initialization order.
 	  Any LSMs left off this list will be ignored. This can be
diff --git a/security/Makefile b/security/Makefile
index c598b904938f..be1dd9d2cb2f 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -11,6 +11,7 @@ subdir-$(CONFIG_SECURITY_APPARMOR)	+= apparmor
 subdir-$(CONFIG_SECURITY_YAMA)		+= yama
 subdir-$(CONFIG_SECURITY_LOADPIN)	+= loadpin
 subdir-$(CONFIG_SECURITY_SAFESETID)    += safesetid
+subdir-$(CONFIG_SECURITY_LOCKDOWN_LSM)	+= lockdown
 
 # always enable default capabilities
 obj-y					+= commoncap.o
@@ -27,6 +28,7 @@ obj-$(CONFIG_SECURITY_APPARMOR)		+= apparmor/
 obj-$(CONFIG_SECURITY_YAMA)		+= yama/
 obj-$(CONFIG_SECURITY_LOADPIN)		+= loadpin/
 obj-$(CONFIG_SECURITY_SAFESETID)       += safesetid/
+obj-$(CONFIG_SECURITY_LOCKDOWN_LSM)	+= lockdown/
 obj-$(CONFIG_CGROUP_DEVICE)		+= device_cgroup.o
 
 # Object integrity file lists
diff --git a/security/lockdown/Kconfig b/security/lockdown/Kconfig
new file mode 100644
index 000000000000..7374ba76d8eb
--- /dev/null
+++ b/security/lockdown/Kconfig
@@ -0,0 +1,47 @@
+config SECURITY_LOCKDOWN_LSM
+	bool "Basic module for enforcing kernel lockdown"
+	depends on SECURITY
+	help
+	  Build support for an LSM that enforces a coarse kernel lockdown
+	  behaviour.
+
+config SECURITY_LOCKDOWN_LSM_EARLY
+	bool "Enable lockdown LSM early in init"
+	depends on SECURITY_LOCKDOWN_LSM
+	help
+	  Enable the lockdown LSM early in boot. This is necessary in order
+	  to ensure that lockdown enforcement can be carried out on kernel
+	  boot parameters that are otherwise parsed before the security
+	  subsystem is fully initialised. If enabled, lockdown will
+	  unconditionally be called before any other LSMs.
+
+choice
+	prompt "Kernel default lockdown mode"
+	default LOCK_DOWN_KERNEL_FORCE_NONE
+	depends on SECURITY_LOCKDOWN_LSM
+	help
+	  The kernel can be configured to default to differing levels of
+	  lockdown.
+
+config LOCK_DOWN_KERNEL_FORCE_NONE
+	bool "None"
+	help
+	  No lockdown functionality is enabled by default. Lockdown may be
+	  enabled via the kernel commandline or /sys/kernel/security/lockdown.
+
+config LOCK_DOWN_KERNEL_FORCE_INTEGRITY
+	bool "Integrity"
+	help
+	 The kernel runs in integrity mode by default. Features that allow
+	 the kernel to be modified at runtime are disabled.
+
+config LOCK_DOWN_KERNEL_FORCE_CONFIDENTIALITY
+	bool "Confidentiality"
+	help
+	 The kernel runs in confidentiality mode by default. Features that
+	 allow the kernel to be modified at runtime or that permit userland
+	 code to read confidential material held inside the kernel are
+	 disabled.
+
+endchoice
+
diff --git a/security/lockdown/Makefile b/security/lockdown/Makefile
new file mode 100644
index 000000000000..e3634b9017e7
--- /dev/null
+++ b/security/lockdown/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_SECURITY_LOCKDOWN_LSM) += lockdown.o
diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
new file mode 100644
index 000000000000..d30c4d254b5f
--- /dev/null
+++ b/security/lockdown/lockdown.c
@@ -0,0 +1,172 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Lock down the kernel
+ *
+ * Copyright (C) 2016 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/security.h>
+#include <linux/export.h>
+#include <linux/lsm_hooks.h>
+
+static enum lockdown_reason kernel_locked_down;
+
+static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
+	[LOCKDOWN_NONE] = "none",
+	[LOCKDOWN_INTEGRITY_MAX] = "integrity",
+	[LOCKDOWN_CONFIDENTIALITY_MAX] = "confidentiality",
+};
+
+static enum lockdown_reason lockdown_levels[] = {LOCKDOWN_NONE,
+						 LOCKDOWN_INTEGRITY_MAX,
+						 LOCKDOWN_CONFIDENTIALITY_MAX};
+
+/*
+ * Put the kernel into lock-down mode.
+ */
+static int lock_kernel_down(const char *where, enum lockdown_reason level)
+{
+	if (kernel_locked_down >= level)
+		return -EPERM;
+
+	kernel_locked_down = level;
+	pr_notice("Kernel is locked down from %s; see man kernel_lockdown.7\n",
+		  where);
+	return 0;
+}
+
+static int __init lockdown_param(char *level)
+{
+	if (!level)
+		return -EINVAL;
+
+	if (strcmp(level, "integrity") == 0)
+		lock_kernel_down("command line", LOCKDOWN_INTEGRITY_MAX);
+	else if (strcmp(level, "confidentiality") == 0)
+		lock_kernel_down("command line", LOCKDOWN_CONFIDENTIALITY_MAX);
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
+early_param("lockdown", lockdown_param);
+
+/**
+ * lockdown_is_locked_down - Find out if the kernel is locked down
+ * @what: Tag to use in notice generated if lockdown is in effect
+ */
+static int lockdown_is_locked_down(enum lockdown_reason what)
+{
+	if (kernel_locked_down >= what) {
+		if (lockdown_reasons[what])
+			pr_notice("Lockdown: %s is restricted; see man kernel_lockdown.7\n",
+				  lockdown_reasons[what]);
+		return -EPERM;
+	}
+
+	return 0;
+}
+
+static struct security_hook_list lockdown_hooks[] __lsm_ro_after_init = {
+	LSM_HOOK_INIT(locked_down, lockdown_is_locked_down),
+};
+
+static int __init lockdown_lsm_init(void)
+{
+#if defined(CONFIG_LOCK_DOWN_KERNEL_FORCE_INTEGRITY)
+	lock_kernel_down("Kernel configuration", LOCKDOWN_INTEGRITY_MAX);
+#elif defined(CONFIG_LOCK_DOWN_KERNEL_FORCE_CONFIDENTIALITY)
+	lock_kernel_down("Kernel configuration", LOCKDOWN_CONFIDENTIALITY_MAX);
+#endif
+	security_add_hooks(lockdown_hooks, ARRAY_SIZE(lockdown_hooks),
+			   "lockdown");
+	return 0;
+}
+
+static ssize_t lockdown_read(struct file *filp, char __user *buf, size_t count,
+			     loff_t *ppos)
+{
+	char temp[80];
+	int i, offset = 0;
+
+	for (i = 0; i < ARRAY_SIZE(lockdown_levels); i++) {
+		enum lockdown_reason level = lockdown_levels[i];
+
+		if (lockdown_reasons[level]) {
+			const char *label = lockdown_reasons[level];
+
+			if (kernel_locked_down == level)
+				offset += sprintf(temp+offset, "[%s] ", label);
+			else
+				offset += sprintf(temp+offset, "%s ", label);
+		}
+	}
+
+	/* Convert the last space to a newline if needed. */
+	if (offset > 0)
+		temp[offset-1] = '\n';
+
+	return simple_read_from_buffer(buf, count, ppos, temp, strlen(temp));
+}
+
+static ssize_t lockdown_write(struct file *file, const char __user *buf,
+			      size_t n, loff_t *ppos)
+{
+	char *state;
+	int i, len, err = -EINVAL;
+
+	state = memdup_user_nul(buf, n);
+	if (IS_ERR(state))
+		return PTR_ERR(state);
+
+	len = strlen(state);
+	if (len && state[len-1] == '\n') {
+		state[len-1] = '\0';
+		len--;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(lockdown_levels); i++) {
+		enum lockdown_reason level = lockdown_levels[i];
+		const char *label = lockdown_reasons[level];
+
+		if (label && !strcmp(state, label))
+			err = lock_kernel_down("securityfs", level);
+	}
+
+	kfree(state);
+	return err ? err : n;
+}
+
+static const struct file_operations lockdown_ops = {
+	.read  = lockdown_read,
+	.write = lockdown_write,
+};
+
+static int __init lockdown_secfs_init(void)
+{
+	struct dentry *dentry;
+
+	dentry = securityfs_create_file("lockdown", 0600, NULL, NULL,
+					&lockdown_ops);
+	if (IS_ERR(dentry))
+		return PTR_ERR(dentry);
+
+	return 0;
+}
+
+core_initcall(lockdown_secfs_init);
+
+#ifdef CONFIG_SECURITY_LOCKDOWN_LSM_EARLY
+DEFINE_EARLY_LSM(lockdown) = {
+#else
+DEFINE_LSM(lockdown) = {
+#endif
+	.name = "lockdown",
+	.init = lockdown_lsm_init,
+};
-- 
2.22.0.770.g0f2c4a37fd-goog

^ permalink raw reply related

* [PATCH V37 04/29] Enforce module signatures if the kernel is locked down
From: Matthew Garrett @ 2019-07-31 22:15 UTC (permalink / raw)
  To: jmorris
  Cc: linux-security-module, linux-kernel, linux-api, David Howells,
	Matthew Garrett, Kees Cook, Jessica Yu
In-Reply-To: <20190731221617.234725-1-matthewgarrett@google.com>

From: David Howells <dhowells@redhat.com>

If the kernel is locked down, require that all modules have valid
signatures that we can verify.

I have adjusted the errors generated:

 (1) If there's no signature (ENODATA) or we can't check it (ENOPKG,
     ENOKEY), then:

     (a) If signatures are enforced then EKEYREJECTED is returned.

     (b) If there's no signature or we can't check it, but the kernel is
	 locked down then EPERM is returned (this is then consistent with
	 other lockdown cases).

 (2) If the signature is unparseable (EBADMSG, EINVAL), the signature fails
     the check (EKEYREJECTED) or a system error occurs (eg. ENOMEM), we
     return the error we got.

Note that the X.509 code doesn't check for key expiry as the RTC might not
be valid or might not have been transferred to the kernel's clock yet.

 [Modified by Matthew Garrett to remove the IMA integration. This will
  be replaced with integration with the IMA architecture policy
  patchset.]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <matthewgarrett@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Jessica Yu <jeyu@kernel.org>
---
 include/linux/security.h     |  1 +
 kernel/module.c              | 37 +++++++++++++++++++++++++++++-------
 security/lockdown/lockdown.c |  1 +
 3 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/include/linux/security.h b/include/linux/security.h
index 54a0532ec12f..8e70063074a1 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -103,6 +103,7 @@ enum lsm_event {
  */
 enum lockdown_reason {
 	LOCKDOWN_NONE,
+	LOCKDOWN_MODULE_SIGNATURE,
 	LOCKDOWN_INTEGRITY_MAX,
 	LOCKDOWN_CONFIDENTIALITY_MAX,
 };
diff --git a/kernel/module.c b/kernel/module.c
index cd8df516666d..318209889e26 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2771,8 +2771,9 @@ static inline void kmemleak_load_module(const struct module *mod,
 #ifdef CONFIG_MODULE_SIG
 static int module_sig_check(struct load_info *info, int flags)
 {
-	int err = -ENOKEY;
+	int err = -ENODATA;
 	const unsigned long markerlen = sizeof(MODULE_SIG_STRING) - 1;
+	const char *reason;
 	const void *mod = info->hdr;
 
 	/*
@@ -2787,16 +2788,38 @@ static int module_sig_check(struct load_info *info, int flags)
 		err = mod_verify_sig(mod, info);
 	}
 
-	if (!err) {
+	switch (err) {
+	case 0:
 		info->sig_ok = true;
 		return 0;
-	}
 
-	/* Not having a signature is only an error if we're strict. */
-	if (err == -ENOKEY && !is_module_sig_enforced())
-		err = 0;
+		/* We don't permit modules to be loaded into trusted kernels
+		 * without a valid signature on them, but if we're not
+		 * enforcing, certain errors are non-fatal.
+		 */
+	case -ENODATA:
+		reason = "Loading of unsigned module";
+		goto decide;
+	case -ENOPKG:
+		reason = "Loading of module with unsupported crypto";
+		goto decide;
+	case -ENOKEY:
+		reason = "Loading of module with unavailable key";
+	decide:
+		if (is_module_sig_enforced()) {
+			pr_notice("%s is rejected\n", reason);
+			return -EKEYREJECTED;
+		}
 
-	return err;
+		return security_locked_down(LOCKDOWN_MODULE_SIGNATURE);
+
+		/* All other errors are fatal, including nomem, unparseable
+		 * signatures and signature check failures - even if signatures
+		 * aren't required.
+		 */
+	default:
+		return err;
+	}
 }
 #else /* !CONFIG_MODULE_SIG */
 static int module_sig_check(struct load_info *info, int flags)
diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
index d30c4d254b5f..2c53fd9f5c9b 100644
--- a/security/lockdown/lockdown.c
+++ b/security/lockdown/lockdown.c
@@ -18,6 +18,7 @@ static enum lockdown_reason kernel_locked_down;
 
 static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
 	[LOCKDOWN_NONE] = "none",
+	[LOCKDOWN_MODULE_SIGNATURE] = "unsigned module loading",
 	[LOCKDOWN_INTEGRITY_MAX] = "integrity",
 	[LOCKDOWN_CONFIDENTIALITY_MAX] = "confidentiality",
 };
-- 
2.22.0.770.g0f2c4a37fd-goog

^ permalink raw reply related

* [PATCH V37 05/29] Restrict /dev/{mem,kmem,port} when the kernel is locked down
From: Matthew Garrett @ 2019-07-31 22:15 UTC (permalink / raw)
  To: jmorris
  Cc: linux-security-module, linux-kernel, linux-api, Matthew Garrett,
	David Howells, Matthew Garrett, Kees Cook, x86
In-Reply-To: <20190731221617.234725-1-matthewgarrett@google.com>

From: Matthew Garrett <mjg59@srcf.ucam.org>

Allowing users to read and write to core kernel memory makes it possible
for the kernel to be subverted, avoiding module loading restrictions, and
also to steal cryptographic information.

Disallow /dev/mem and /dev/kmem from being opened this when the kernel has
been locked down to prevent this.

Also disallow /dev/port from being opened to prevent raw ioport access and
thus DMA from being used to accomplish the same thing.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: x86@kernel.org
---
 drivers/char/mem.c           | 7 +++++--
 include/linux/security.h     | 1 +
 security/lockdown/lockdown.c | 1 +
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index b08dc50f9f26..d0148aee1aab 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -29,8 +29,8 @@
 #include <linux/export.h>
 #include <linux/io.h>
 #include <linux/uio.h>
-
 #include <linux/uaccess.h>
+#include <linux/security.h>
 
 #ifdef CONFIG_IA64
 # include <linux/efi.h>
@@ -786,7 +786,10 @@ static loff_t memory_lseek(struct file *file, loff_t offset, int orig)
 
 static int open_port(struct inode *inode, struct file *filp)
 {
-	return capable(CAP_SYS_RAWIO) ? 0 : -EPERM;
+	if (!capable(CAP_SYS_RAWIO))
+		return -EPERM;
+
+	return security_locked_down(LOCKDOWN_DEV_MEM);
 }
 
 #define zero_lseek	null_lseek
diff --git a/include/linux/security.h b/include/linux/security.h
index 8e70063074a1..9458152601b5 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -104,6 +104,7 @@ enum lsm_event {
 enum lockdown_reason {
 	LOCKDOWN_NONE,
 	LOCKDOWN_MODULE_SIGNATURE,
+	LOCKDOWN_DEV_MEM,
 	LOCKDOWN_INTEGRITY_MAX,
 	LOCKDOWN_CONFIDENTIALITY_MAX,
 };
diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
index 2c53fd9f5c9b..d2ef29d9f0b2 100644
--- a/security/lockdown/lockdown.c
+++ b/security/lockdown/lockdown.c
@@ -19,6 +19,7 @@ static enum lockdown_reason kernel_locked_down;
 static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
 	[LOCKDOWN_NONE] = "none",
 	[LOCKDOWN_MODULE_SIGNATURE] = "unsigned module loading",
+	[LOCKDOWN_DEV_MEM] = "/dev/mem,kmem,port",
 	[LOCKDOWN_INTEGRITY_MAX] = "integrity",
 	[LOCKDOWN_CONFIDENTIALITY_MAX] = "confidentiality",
 };
-- 
2.22.0.770.g0f2c4a37fd-goog

^ permalink raw reply related

* [PATCH V37 06/29] kexec_load: Disable at runtime if the kernel is locked down
From: Matthew Garrett @ 2019-07-31 22:15 UTC (permalink / raw)
  To: jmorris-gx6/JNMH7DfYtjvyW6yDsg
  Cc: Matthew Garrett, Kees Cook, linux-api-u79uwXL29TY76Z2rM5mHXA,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Matthew Garrett,
	David Howells, linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	Dave Young
In-Reply-To: <20190731221617.234725-1-matthewgarrett-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

From: Matthew Garrett <mjg59-1xO5oi07KQx4cg9Nei1l7Q@public.gmane.org>

The kexec_load() syscall permits the loading and execution of arbitrary
code in ring 0, which is something that lock-down is meant to prevent. It
makes sense to disable kexec_load() in this situation.

This does not affect kexec_file_load() syscall which can check for a
signature on the image to be booted.

Signed-off-by: David Howells <dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Matthew Garrett <mjg59-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Acked-by: Dave Young <dyoung-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
cc: kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
---
 include/linux/security.h     | 1 +
 kernel/kexec.c               | 8 ++++++++
 security/lockdown/lockdown.c | 1 +
 3 files changed, 10 insertions(+)

diff --git a/include/linux/security.h b/include/linux/security.h
index 9458152601b5..69c5de539e9a 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -105,6 +105,7 @@ enum lockdown_reason {
 	LOCKDOWN_NONE,
 	LOCKDOWN_MODULE_SIGNATURE,
 	LOCKDOWN_DEV_MEM,
+	LOCKDOWN_KEXEC,
 	LOCKDOWN_INTEGRITY_MAX,
 	LOCKDOWN_CONFIDENTIALITY_MAX,
 };
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 1b018f1a6e0d..bc933c0db9bf 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -205,6 +205,14 @@ static inline int kexec_load_check(unsigned long nr_segments,
 	if (result < 0)
 		return result;
 
+	/*
+	 * kexec can be used to circumvent module loading restrictions, so
+	 * prevent loading in that case
+	 */
+	result = security_locked_down(LOCKDOWN_KEXEC);
+	if (result)
+		return result;
+
 	/*
 	 * Verify we have a legal set of flags
 	 * This leaves us room for future extensions.
diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
index d2ef29d9f0b2..6f302c156bc8 100644
--- a/security/lockdown/lockdown.c
+++ b/security/lockdown/lockdown.c
@@ -20,6 +20,7 @@ static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
 	[LOCKDOWN_NONE] = "none",
 	[LOCKDOWN_MODULE_SIGNATURE] = "unsigned module loading",
 	[LOCKDOWN_DEV_MEM] = "/dev/mem,kmem,port",
+	[LOCKDOWN_KEXEC] = "kexec of unsigned images",
 	[LOCKDOWN_INTEGRITY_MAX] = "integrity",
 	[LOCKDOWN_CONFIDENTIALITY_MAX] = "confidentiality",
 };
-- 
2.22.0.770.g0f2c4a37fd-goog

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox