* [PATCH v10 0/9] namei: openat2(2) path resolution restrictions
From: Aleksa Sarai @ 2019-07-19 16:42 UTC (permalink / raw)
To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan
Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Jann Horn,
Christian Brauner, David Drysdale, Tycho Andersen, Kees Cook,
Linus Torvalds, containers, linux-fsdevel, linux-api,
Andrew Morton, Alexei Starovoitov, Chanho Min, Oleg Nesterov,
Aleksa Sarai, linux-alpha, linux-arch, linux-arm-kernel,
linux-ia64, linux-kernel
This patch is being developed here (with snapshots of each series
version being stashed in separate branches with names of the form
"resolveat/vX-summary"):
<https://github.com/cyphar/linux/tree/resolveat/master>
Patch changelog:
v10:
* Ensure that unlazy_walk() will fail if we are in a scoped walk and
the caller has zeroed nd->root (this happens in a few places, I'm
not sure why because unlazy_walk() does legitimize_path()
already). In this case we need to go through path_init() again to
reset it (otherwise we will have a breakout because set_root()
will breakout).
* Also add a WARN_ON (and return -ENOTRECOVERABLE) if
LOOKUP_IN_ROOT is set and we are in set_root() -- which should
never happen and will cause a breakout.
* Make changes suggested by Al Viro:
* Remove nd->{opath_mask,acc_mode} by moving all of the magic-link
permission logic be done after trailing_symlink() (with
trailing_magiclink()) only within path_openat().
* Introduce LOOKUP_MAGICLINK_JUMPED to be able to detect
magic-link jumps done with nd_jump_link() (so we don't end up
blocking other LOOKUP_JUMPED cases).
* Simplify all of the path_init() changes to make the code far
less confusing. dirfd_path_init() turns out to be un-necessary.
* Make openat2(2) also -EINVAL on unknown how->flags.
[Dmitry V. Levin]
* Clean up bad definitions of O_EMPTYPATH on architectures where O_*
flags are subtly different to <asm-generic/fcntl.h>.
* Switch away from passing a struct to build_open_flags() and
instead just copy the one field we need to temporarily modify
(how->flags). Also fix a bug in OPENHOW_MODE. [Rasmus Villemoes]
* Fix syscall linkages and switch to 437. [Arnd Bergmann]
* Clean up text in commit messages and the cover-letter.
[Rolf Eike Beer]
* Fix openat2 selftest makefile. [Michael Ellerman]
The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
Capsicum project[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS,
the flag has been split up into separate flags. However, instead of
being an openat(2) flag it is provided through a new syscall openat2(2)
which provides an alternative way to get an O_PATH file descriptor (the
reasoning for doing this is included in the patch description). The
following new LOOKUP_* flags are added:
* LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do
not trigger this.
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[6] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
(such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
And further, several semantics of file descriptor "re-opening" are now
changed to prevent attacks like CVE-2019-5736 by restricting how
magic-links can be resolved (based on their mode). This required some
other changes to the semantics of the modes of O_PATH file descriptor's
associated /proc/self/fd magic-links. openat2(2) has the ability to
further restrict re-opening of its own O_PATH fds, so that users can
make even better use of this feature.
Finally, O_EMPTYPATH was added so that users can do /proc/self/fd-style
re-opening without depending on procfs. The new restricted semantics for
magic-links are applied here too.
In order to make all of the above more usable, I'm working on
libpathrs[7] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: David Drysdale <drysdale@google.com>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <containers@lists.linux-foundation.org>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: <linux-api@vger.kernel.org>
[1]: https://lwn.net/Articles/721443/
[2]: https://lore.kernel.org/patchwork/patch/784221/
[3]: https://lwn.net/Articles/619151/
[4]: https://lwn.net/Articles/603929/
[5]: https://lwn.net/Articles/723057/
[6]: https://github.com/cyphar/filepath-securejoin
[7]: https://github.com/openSUSE/libpathrs
Aleksa Sarai (9):
namei: obey trailing magic-link DAC permissions
procfs: switch magic-link modes to be more sane
open: O_EMPTYPATH: procfs-less file descriptor re-opening
namei: O_BENEATH-style path resolution flags
namei: LOOKUP_IN_ROOT: chroot-like path resolution
namei: aggressively check for nd->root escape on ".." resolution
open: openat2(2) syscall
kselftest: save-and-restore errno to allow for %m formatting
selftests: add openat2(2) selftests
Documentation/filesystems/path-lookup.rst | 12 +-
arch/alpha/include/uapi/asm/fcntl.h | 1 +
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/include/uapi/asm/fcntl.h | 39 +-
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/include/uapi/asm/fcntl.h | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/fcntl.c | 2 +-
fs/internal.h | 1 +
fs/namei.c | 270 ++++++++++--
fs/open.c | 112 ++++-
fs/proc/base.c | 20 +-
fs/proc/fd.c | 23 +-
fs/proc/namespaces.c | 2 +-
include/linux/fcntl.h | 17 +-
include/linux/fs.h | 8 +-
include/linux/namei.h | 9 +
include/linux/syscalls.h | 17 +-
include/uapi/asm-generic/fcntl.h | 4 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/fcntl.h | 42 ++
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/kselftest.h | 15 +
tools/testing/selftests/memfd/memfd_test.c | 7 +-
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 162 +++++++
tools/testing/selftests/openat2/helpers.h | 114 +++++
.../testing/selftests/openat2/linkmode_test.c | 326 ++++++++++++++
.../selftests/openat2/rename_attack_test.c | 124 ++++++
.../testing/selftests/openat2/resolve_test.c | 397 ++++++++++++++++++
46 files changed, 1652 insertions(+), 107 deletions(-)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/linkmode_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
--
2.22.0
^ permalink raw reply
* [PATCH v10 1/9] namei: obey trailing magic-link DAC permissions
From: Aleksa Sarai @ 2019-07-19 16:42 UTC (permalink / raw)
To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan
Cc: Aleksa Sarai, Andy Lutomirski, Christian Brauner, Eric Biederman,
Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
Aleksa Sarai, Linus Torvalds, containers, linux-alpha, linux-api,
linux-arch, linux-arm-kernel, linux-fsdevel, linux-ia64,
linux-kernel
In-Reply-To: <20190719164225.27083-1-cyphar@cyphar.com>
The ability for userspace to "re-open" file descriptors through
/proc/self/fd has been a very useful tool for all sorts of usecases
(container runtimes are one common example). However, the current
interface for doing this has resulted in some pretty subtle security
holes. Userspace can re-open a file descriptor with more permissions
than the original, which can result in cases such as /proc/$pid/exe
being re-opened O_RDWR at a later date even though (by definition)
/proc/$pid/exe cannot be opened for writing. When combined with O_PATH
the results can get even more confusing.
We cannot block this outright. Aside from userspace already depending on
it, it's a useful feature which can actually increase the security of
userspace. For instance, LXC keeps an O_PATH of the container's
/dev/pts/ptmx that gets re-opened to create new ptys and then uses
TIOCGPTPEER to get the slave end. This allows for pty allocation without
resolving paths inside an (untrusted) container's rootfs. There isn't a
trivial way of doing this that is as straight-forward and safe as O_PATH
re-opening.
Instead we have to restrict it in such a way that it doesn't break
(good) users but does block potential attackers. The solution applied in
this patch is to restrict *re-opening* (not resolution through)
magic-links by requiring that mode of the link be obeyed. Normal
symlinks have modes of a+rwx but magic-links have other modes. These
magic-link modes were historically ignored during path resolution, but
they've now been re-purposed for more useful ends.
It is also necessary to define semantics for the mode of an O_PATH
descriptor, since re-opening a magic-link through an O_PATH needs to be
just as restricted as the corresponding magic-link -- otherwise the
above protection can be bypassed. There are two distinct cases:
1. The target is a regular file (not a magic-link). Userspace depends
on being able to re-open the O_PATH of a regular file, so we must
define the mode to be a+rwx.
2. The target is a magic-link. In this case, we simply copy the mode of
the magic-link. This results in an O_PATH of a magic-link
effectively acting as a no-op in terms of how much re-opening
privileges a process has.
CAP_DAC_OVERRIDE can be used to override all of these restrictions, but
we only permit &init_userns's capabilities to affect these semantics.
The reason for this is that there isn't a clear way to track what
user_ns is the original owner of a given O_PATH chain -- thus an
unprivileged user could create a new userns and O_PATH the file
descriptor, owning it. All signs would indicate that the user really
does have CAP_DAC_OVERRIDE over the new descriptor and the protection
would be bypassed. We thus opt for the more conservative approach.
I have run this patch on several machines for several days. So far, the
only processes which have hit this case ("loadkeys" and "kbd_mode" from
the kbd package[1]) gracefully handle the permission error and do not
cause any user-visible problems. In order to give users a heads-up, a
warning is output to dmesg whenever may_open_magiclink() refuses access.
[1]: http://git.altlinux.org/people/legion/packages/kbd.git
Co-developed-by: Andy Lutomirski <luto@kernel.org>
Co-developed-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
Documentation/filesystems/path-lookup.rst | 12 +--
fs/internal.h | 1 +
fs/namei.c | 105 +++++++++++++++++++---
fs/open.c | 3 +-
fs/proc/fd.c | 23 ++++-
include/linux/fs.h | 4 +
include/linux/namei.h | 1 +
7 files changed, 130 insertions(+), 19 deletions(-)
diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst
index 434a07b0002b..a57d78ec8bee 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -1310,12 +1310,14 @@ longer needed.
``LOOKUP_JUMPED`` means that the current dentry was chosen not because
it had the right name but for some other reason. This happens when
following "``..``", following a symlink to ``/``, crossing a mount point
-or accessing a "``/proc/$PID/fd/$FD``" symlink. In this case the
-filesystem has not been asked to revalidate the name (with
-``d_revalidate()``). In such cases the inode may still need to be
-revalidated, so ``d_op->d_weak_revalidate()`` is called if
+or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
+link"). In this case the filesystem has not been asked to revalidate the
+name (with ``d_revalidate()``). In such cases the inode may still need
+to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
``LOOKUP_JUMPED`` is set when the look completes - which may be at the
-final component or, when creating, unlinking, or renaming, at the penultimate component.
+final component or, when creating, unlinking, or renaming, at the
+penultimate component. ``LOOKUP_MAGICLINK_JUMPED`` is set alongside
+``LOOKUP_JUMPED`` if a magic-link was traversed.
Final-component flags
~~~~~~~~~~~~~~~~~~~~~
diff --git a/fs/internal.h b/fs/internal.h
index a48ef81be37d..12847f502f49 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -119,6 +119,7 @@ struct open_flags {
int acc_mode;
int intent;
int lookup_flags;
+ fmode_t opath_mask;
};
extern struct file *do_filp_open(int dfd, struct filename *pathname,
const struct open_flags *op);
diff --git a/fs/namei.c b/fs/namei.c
index 20831c2fbb34..c6ba4ccafc51 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -872,7 +872,7 @@ void nd_jump_link(struct path *path)
nd->path = *path;
nd->inode = nd->path.dentry->d_inode;
- nd->flags |= LOOKUP_JUMPED;
+ nd->flags |= LOOKUP_JUMPED | LOOKUP_MAGICLINK_JUMPED;
}
static inline void put_link(struct nameidata *nd)
@@ -1066,6 +1066,7 @@ const char *get_link(struct nameidata *nd)
return ERR_PTR(error);
nd->last_type = LAST_BIND;
+ nd->flags &= ~LOOKUP_MAGICLINK_JUMPED;
res = READ_ONCE(inode->i_link);
if (!res) {
const char * (*get)(struct dentry *, struct inode *,
@@ -3501,16 +3502,73 @@ static int do_tmpfile(struct nameidata *nd, unsigned flags,
return error;
}
-static int do_o_path(struct nameidata *nd, unsigned flags, struct file *file)
+/**
+ * may_reopen_magiclink - Check permissions for opening a trailing magic-link
+ * @opath_mask: the O_PATH mask of the magic-link
+ * @acc_mode: ACC_MODE which the user is attempting
+ *
+ * We block magic-link re-opening if the @opath_mask is more strict than the
+ * @acc_mode being requested, unless the user is capable(CAP_DAC_OVERRIDE).
+ *
+ * Returns 0 if successful, -ve on error.
+ */
+static int may_open_magiclink(fmode_t opath_mask, int acc_mode)
{
- struct path path;
- int error = path_lookupat(nd, flags, &path);
- if (!error) {
- audit_inode(nd->name, path.dentry, 0);
- error = vfs_open(&path, file);
- path_put(&path);
- }
- return error;
+ /*
+ * We only allow for init_userns to be able to override magic-links.
+ * This is done to avoid cases where an unprivileged userns could take
+ * an O_PATH of the fd, resulting in it being very unclear whether
+ * CAP_DAC_OVERRIDE should work on the new O_PATH fd (given that it
+ * pipes through to the underlying file).
+ */
+ if (capable(CAP_DAC_OVERRIDE))
+ return 0;
+
+ if ((acc_mode & MAY_READ) &&
+ !(opath_mask & (FMODE_READ | FMODE_PATH_READ)))
+ goto err;
+ if ((acc_mode & MAY_WRITE) &&
+ !(opath_mask & (FMODE_WRITE | FMODE_PATH_WRITE)))
+ goto err;
+
+ return 0;
+
+err:
+ pr_warn_ratelimited("%s[%d]: magic-link re-open blocked (acc_mode=%s%s%s, opath_mask=%s%s%s%s)",
+ current->comm, task_pid_nr(current),
+ (acc_mode & MAY_READ) ? "r": "",
+ (acc_mode & MAY_WRITE) ? "w": "",
+ (acc_mode & MAY_EXEC) ? "x": "",
+ (opath_mask & FMODE_READ) ? "R" : "",
+ (opath_mask & FMODE_PATH_READ) ? "r" : "",
+ (opath_mask & FMODE_WRITE) ? "W" : "",
+ (opath_mask & FMODE_PATH_WRITE) ? "w" : "");
+ return -EACCES;
+}
+
+static int trailing_magiclink(struct nameidata *nd, int acc_mode,
+ fmode_t *opath_mask)
+{
+ struct inode *inode = nd->link_inode;
+ fmode_t new_mask = 0;
+
+ /* Trailing symlink was not a magic-link. */
+ if (!(nd->flags & LOOKUP_MAGICLINK_JUMPED))
+ return 0;
+
+ /*
+ * Figure out the O_PATH mask. Rather than using acl_permission_check,
+ * we check whether any of the rw bits are set in the mode.
+ */
+ if (inode->i_mode & S_IRUGO)
+ new_mask |= FMODE_PATH_READ;
+ if (inode->i_mode & S_IWUGO)
+ new_mask |= FMODE_PATH_WRITE;
+ if (opath_mask)
+ *opath_mask &= new_mask;
+
+ /* Is the new opath_mask more restrictive than acc_mode? */
+ return may_open_magiclink(new_mask, acc_mode);
}
static struct file *path_openat(struct nameidata *nd,
@@ -3526,13 +3584,38 @@ static struct file *path_openat(struct nameidata *nd,
if (unlikely(file->f_flags & __O_TMPFILE)) {
error = do_tmpfile(nd, flags, op, file);
} else if (unlikely(file->f_flags & O_PATH)) {
- error = do_o_path(nd, flags, file);
+ /* Inlined path_lookupat() with a trailing_magiclink() check. */
+ const char *s = path_init(nd, flags);
+ fmode_t opath_mask = op->opath_mask;
+
+ while (!(error = link_path_walk(s, nd))
+ && ((error = lookup_last(nd)) > 0)) {
+ s = trailing_symlink(nd);
+ error = trailing_magiclink(nd, op->acc_mode, &opath_mask);
+ if (error)
+ s = ERR_PTR(error);
+ }
+ if (!error)
+ error = complete_walk(nd);
+
+ if (!error && nd->flags & LOOKUP_DIRECTORY)
+ if (!d_can_lookup(nd->path.dentry))
+ error = -ENOTDIR;
+ if (!error) {
+ audit_inode(nd->name, nd->path.dentry, 0);
+ error = vfs_open(&nd->path, file);
+ file->f_mode |= opath_mask;
+ }
+ terminate_walk(nd);
} else {
const char *s = path_init(nd, flags);
while (!(error = link_path_walk(s, nd)) &&
(error = do_last(nd, file, op)) > 0) {
nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
s = trailing_symlink(nd);
+ error = trailing_magiclink(nd, op->acc_mode, NULL);
+ if (error)
+ s = ERR_PTR(error);
}
terminate_walk(nd);
}
diff --git a/fs/open.c b/fs/open.c
index b5b80469b93d..ab20eae39df7 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -982,8 +982,9 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
acc_mode |= MAY_APPEND;
op->acc_mode = acc_mode;
-
op->intent = flags & O_PATH ? 0 : LOOKUP_OPEN;
+ /* For O_PATH backwards-compatibility we default to an all-set mask. */
+ op->opath_mask = FMODE_PATH_READ | FMODE_PATH_WRITE;
if (flags & O_CREAT) {
op->intent |= LOOKUP_CREATE;
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 81882a13212d..9b7d8becb002 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -104,11 +104,30 @@ static void tid_fd_update_inode(struct task_struct *task, struct inode *inode,
task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid);
if (S_ISLNK(inode->i_mode)) {
+ /*
+ * Always set +x (depending on the fmode type), since there
+ * currently aren't FMODE_PATH_EXEC restrictions and there is
+ * no O_MAYEXEC yet. This might change in the future, in which
+ * case we will restrict +x.
+ */
unsigned i_mode = S_IFLNK;
+ if (f_mode & FMODE_PATH)
+ i_mode |= S_IXGRP;
+ else
+ i_mode |= S_IXUSR;
+ /*
+ * Construct the mode bits based on the open-mode. The u+rwx
+ * bits are for "ordinary" open modes while g+rwx are for
+ * O_PATH modes.
+ */
if (f_mode & FMODE_READ)
- i_mode |= S_IRUSR | S_IXUSR;
+ i_mode |= S_IRUSR;
if (f_mode & FMODE_WRITE)
- i_mode |= S_IWUSR | S_IXUSR;
+ i_mode |= S_IWUSR;
+ if (f_mode & FMODE_PATH_READ)
+ i_mode |= S_IRGRP;
+ if (f_mode & FMODE_PATH_WRITE)
+ i_mode |= S_IWGRP;
inode->i_mode = i_mode;
}
security_task_to_inode(task, inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f7fdfe93e25d..f7df213405ea 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -173,6 +173,10 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
/* File does not contribute to nr_files count */
#define FMODE_NOACCOUNT ((__force fmode_t)0x20000000)
+/* File is an O_PATH descriptor which can be upgraded to (read, write). */
+#define FMODE_PATH_READ ((__force fmode_t)0x40000000)
+#define FMODE_PATH_WRITE ((__force fmode_t)0x80000000)
+
/*
* Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
* that indicates that they should check the contents of the iovec are
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 9138b4471dbf..bd6d3eb7764d 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -49,6 +49,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_ROOT 0x2000
#define LOOKUP_EMPTY 0x4000
#define LOOKUP_DOWN 0x8000
+#define LOOKUP_MAGICLINK_JUMPED 0x10000
extern int path_pts(struct path *path);
--
2.22.0
^ permalink raw reply related
* [PATCH v10 2/9] procfs: switch magic-link modes to be more sane
From: Aleksa Sarai @ 2019-07-19 16:42 UTC (permalink / raw)
To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan
Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Andrew Morton,
Alexei Starovoitov, Kees Cook, Jann Horn, Christian Brauner,
Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
Aleksa Sarai, Linus Torvalds, containers, linux-alpha, linux-api,
linux-arch, linux-arm-kernel, linux-fsdevel, linux-ia64,
linux-kernel
In-Reply-To: <20190719164225.27083-1-cyphar@cyphar.com>
Now that magic-link modes are obeyed for file re-opening purposes, some
of the pre-existing magic-link modes need to be adjusted to be more
semantically correct.
The most blatant example of this is /proc/self/exe, which had a mode of
a+rwx even though tautologically the file could never be opened for
writing (because it is the current->mm of a live process).
With the new O_PATH restrictions, changing the default mode of these
magic-links allows us to avoid delayed-access attacks such as we saw in
CVE-2019-5736.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
fs/proc/base.c | 20 ++++++++++----------
fs/proc/namespaces.c | 2 +-
2 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 255f6754c70d..82c06c21e69d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -133,9 +133,9 @@ struct pid_entry {
#define DIR(NAME, MODE, iops, fops) \
NOD(NAME, (S_IFDIR|(MODE)), &iops, &fops, {} )
-#define LNK(NAME, get_link) \
- NOD(NAME, (S_IFLNK|S_IRWXUGO), \
- &proc_pid_link_inode_operations, NULL, \
+#define LNK(NAME, MODE, get_link) \
+ NOD(NAME, (S_IFLNK|(MODE)), \
+ &proc_pid_link_inode_operations, NULL, \
{ .proc_get_link = get_link } )
#define REG(NAME, MODE, fops) \
NOD(NAME, (S_IFREG|(MODE)), NULL, &fops, {})
@@ -2995,9 +2995,9 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
#endif
REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations),
- LNK("cwd", proc_cwd_link),
- LNK("root", proc_root_link),
- LNK("exe", proc_exe_link),
+ LNK("cwd", S_IRWXUGO, proc_cwd_link),
+ LNK("root", S_IRWXUGO, proc_root_link),
+ LNK("exe", S_IRUGO|S_IXUGO, proc_exe_link),
REG("mounts", S_IRUGO, proc_mounts_operations),
REG("mountinfo", S_IRUGO, proc_mountinfo_operations),
REG("mountstats", S_IRUSR, proc_mountstats_operations),
@@ -3393,11 +3393,11 @@ static const struct pid_entry tid_base_stuff[] = {
REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
#endif
REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations),
- LNK("cwd", proc_cwd_link),
- LNK("root", proc_root_link),
- LNK("exe", proc_exe_link),
+ LNK("cwd", S_IRWXUGO, proc_cwd_link),
+ LNK("root", S_IRWXUGO, proc_root_link),
+ LNK("exe", S_IRUGO|S_IXUGO, proc_exe_link),
REG("mounts", S_IRUGO, proc_mounts_operations),
- REG("mountinfo", S_IRUGO, proc_mountinfo_operations),
+ REG("mountinfo", S_IRUGO, proc_mountinfo_operations),
#ifdef CONFIG_PROC_PAGE_MONITOR
REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
REG("smaps", S_IRUGO, proc_pid_smaps_operations),
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index dd2b35f78b09..cd1e130913f7 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -94,7 +94,7 @@ static struct dentry *proc_ns_instantiate(struct dentry *dentry,
struct inode *inode;
struct proc_inode *ei;
- inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRWXUGO);
+ inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRUGO);
if (!inode)
return ERR_PTR(-ENOENT);
--
2.22.0
^ permalink raw reply related
* [PATCH v10 3/9] open: O_EMPTYPATH: procfs-less file descriptor re-opening
From: Aleksa Sarai @ 2019-07-19 16:42 UTC (permalink / raw)
To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan
Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Andrew Morton,
Alexei Starovoitov, Kees Cook, Jann Horn, Christian Brauner,
Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
Aleksa Sarai, Linus Torvalds, containers, linux-alpha, linux-api,
linux-arch, linux-arm-kernel, linux-fsdevel, linux-ia64,
linux-kernel
In-Reply-To: <20190719164225.27083-1-cyphar@cyphar.com>
Userspace has made use of /proc/self/fd very liberally to allow for
descriptors to be re-opened. There are a wide variety of uses for this
feature, but it has always required constructing a pathname and could
not be done without procfs mounted. The obvious solution for this is to
extend openat(2) to have an AT_EMPTY_PATH-equivalent -- O_EMPTYPATH.
Now that descriptor re-opening has been made safe through the new
magic-link resolution restrictions, we can replicate these restrictions
for O_EMPTYPATH. In particular, we only allow "upgrading" the file
descriptor if the corresponding FMODE_PATH_* bit is set (or the
FMODE_{READ,WRITE} cases for non-O_PATH file descriptors).
When doing openat(O_EMPTYPATH|O_PATH), O_PATH takes precedence and
O_EMPTYPATH is ignored. Very few users ever have a need to O_PATH
re-open an existing file descriptor, and so accommodating them at the
expense of further complicating O_PATH makes little sense. Ultimately,
if users ask for this we can always add RESOLVE_EMPTY_PATH to
resolveat(2) in the future.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
arch/alpha/include/uapi/asm/fcntl.h | 1 +
arch/parisc/include/uapi/asm/fcntl.h | 39 ++++++++++++++--------------
arch/sparc/include/uapi/asm/fcntl.h | 1 +
fs/fcntl.c | 2 +-
fs/namei.c | 20 ++++++++++++++
fs/open.c | 7 ++++-
include/linux/fcntl.h | 2 +-
include/uapi/asm-generic/fcntl.h | 4 +++
8 files changed, 54 insertions(+), 22 deletions(-)
diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
index 50bdc8e8a271..1f879bade68b 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -34,6 +34,7 @@
#define O_PATH 040000000
#define __O_TMPFILE 0100000000
+#define O_EMPTYPATH 0200000000
#define F_GETLK 7
#define F_SETLK 8
diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
index 03ce20e5ad7d..5d709058a76f 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -2,26 +2,27 @@
#ifndef _PARISC_FCNTL_H
#define _PARISC_FCNTL_H
-#define O_APPEND 000000010
-#define O_BLKSEEK 000000100 /* HPUX only */
-#define O_CREAT 000000400 /* not fcntl */
-#define O_EXCL 000002000 /* not fcntl */
-#define O_LARGEFILE 000004000
-#define __O_SYNC 000100000
+#define O_APPEND 0000000010
+#define O_BLKSEEK 0000000100 /* HPUX only */
+#define O_CREAT 0000000400 /* not fcntl */
+#define O_EXCL 0000002000 /* not fcntl */
+#define O_LARGEFILE 0000004000
+#define __O_SYNC 0000100000
#define O_SYNC (__O_SYNC|O_DSYNC)
-#define O_NONBLOCK 000200004 /* HPUX has separate NDELAY & NONBLOCK */
-#define O_NOCTTY 000400000 /* not fcntl */
-#define O_DSYNC 001000000 /* HPUX only */
-#define O_RSYNC 002000000 /* HPUX only */
-#define O_NOATIME 004000000
-#define O_CLOEXEC 010000000 /* set close_on_exec */
-
-#define O_DIRECTORY 000010000 /* must be a directory */
-#define O_NOFOLLOW 000000200 /* don't follow links */
-#define O_INVISIBLE 004000000 /* invisible I/O, for DMAPI/XDSM */
-
-#define O_PATH 020000000
-#define __O_TMPFILE 040000000
+#define O_NONBLOCK 0000200004 /* HPUX has separate NDELAY & NONBLOCK */
+#define O_NOCTTY 0000400000 /* not fcntl */
+#define O_DSYNC 0001000000 /* HPUX only */
+#define O_RSYNC 0002000000 /* HPUX only */
+#define O_NOATIME 0004000000
+#define O_CLOEXEC 0010000000 /* set close_on_exec */
+
+#define O_DIRECTORY 0000010000 /* must be a directory */
+#define O_NOFOLLOW 0000000200 /* don't follow links */
+#define O_INVISIBLE 0004000000 /* invisible I/O, for DMAPI/XDSM */
+
+#define O_PATH 0020000000
+#define __O_TMPFILE 0040000000
+#define O_EMPTYPATH 0100000000
#define F_GETLK64 8
#define F_SETLK64 9
diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index 67dae75e5274..dc86c9eaf950 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -37,6 +37,7 @@
#define O_PATH 0x1000000
#define __O_TMPFILE 0x2000000
+#define O_EMPTYPATH 0x4000000
#define F_GETOWN 5 /* for sockets. */
#define F_SETOWN 6 /* for sockets. */
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 3d40771e8e7c..4cf05a2fd162 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
* Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
* is defined as O_NONBLOCK on some platforms and not on others.
*/
- BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+ BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
HWEIGHT32(
(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index c6ba4ccafc51..165861d621c2 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3571,6 +3571,24 @@ static int trailing_magiclink(struct nameidata *nd, int acc_mode,
return may_open_magiclink(new_mask, acc_mode);
}
+static int do_emptypath(struct nameidata *nd, const struct open_flags *op,
+ struct file *file)
+{
+ int error;
+ /* We don't support AT_FDCWD (since O_PATH is disallowed here). */
+ struct fd f = fdget_raw(nd->dfd);
+
+ if (!f.file)
+ return -EBADF;
+
+ /* Apply trailing_magiclink()-like restrictions. */
+ error = may_open_magiclink(f.file->f_mode, op->acc_mode);
+ if (!error)
+ error = vfs_open(&f.file->f_path, file);
+ fdput(f);
+ return error;
+}
+
static struct file *path_openat(struct nameidata *nd,
const struct open_flags *op, unsigned flags)
{
@@ -3583,6 +3601,8 @@ static struct file *path_openat(struct nameidata *nd,
if (unlikely(file->f_flags & __O_TMPFILE)) {
error = do_tmpfile(nd, flags, op, file);
+ } else if (unlikely(file->f_flags & O_EMPTYPATH)) {
+ error = do_emptypath(nd, op, file);
} else if (unlikely(file->f_flags & O_PATH)) {
/* Inlined path_lookupat() with a trailing_magiclink() check. */
const char *s = path_init(nd, flags);
diff --git a/fs/open.c b/fs/open.c
index ab20eae39df7..bdca45528524 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -996,6 +996,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
lookup_flags |= LOOKUP_DIRECTORY;
if (!(flags & O_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
+ if (flags & O_EMPTYPATH)
+ lookup_flags |= LOOKUP_EMPTY;
op->lookup_flags = lookup_flags;
return 0;
}
@@ -1057,14 +1059,17 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
struct open_flags op;
int fd = build_open_flags(flags, mode, &op);
+ int empty = 0;
struct filename *tmp;
if (fd)
return fd;
- tmp = getname(filename);
+ tmp = getname_flags(filename, op.lookup_flags, &empty);
if (IS_ERR(tmp))
return PTR_ERR(tmp);
+ if (!empty)
+ op.open_flag &= ~O_EMPTYPATH;
fd = get_unused_fd_flags(flags);
if (fd >= 0) {
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index d019df946cb2..2868ae6c8fc1 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -9,7 +9,7 @@
(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
- O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_EMPTYPATH)
#ifndef force_o_largefile
#define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..ae6862f69cc2 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -89,6 +89,10 @@
#define __O_TMPFILE 020000000
#endif
+#ifndef O_EMPTYPATH
+#define O_EMPTYPATH 040000000
+#endif
+
/* a horrid kludge trying to make sure that this will fail on old kernels */
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
--
2.22.0
^ permalink raw reply related
* [PATCH v10 4/9] namei: O_BENEATH-style path resolution flags
From: Aleksa Sarai @ 2019-07-19 16:42 UTC (permalink / raw)
To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan
Cc: Aleksa Sarai, Christian Brauner, David Drysdale, Andy Lutomirski,
Linus Torvalds, Eric Biederman, Andrew Morton, Alexei Starovoitov,
Kees Cook, Jann Horn, Tycho Andersen, Chanho Min, Oleg Nesterov,
Aleksa Sarai, containers, linux-alpha, linux-api, linux-arch,
linux-arm-kernel, linux-fsdevel, linux-ia64, linux-kernel
In-Reply-To: <20190719164225.27083-1-cyphar@cyphar.com>
Add the following flags to allow various restrictions on path resolution
(these affect the *entire* resolution, rather than just the final path
component -- as is the case with LOOKUP_FOLLOW).
The primary justification for these flags is to allow for programs to be
far more strict about how they want path resolution to handle symlinks,
mountpoint crossings, and paths that escape the dirfd (through an
absolute path or ".." shenanigans).
This is of particular concern to container runtimes that want to be very
careful about malicious root filesystems that a container's init might
have screwed around with (and there is no real way to protect against
this in userspace if you consider potential races against a malicious
container's init). More classical applications (which have their own
potentially buggy userspace path sanitisation code) include web servers,
archive extraction tools, network file servers, and so on.
These flags are exposed to userspace through openat2(2) in a later
patchset.
* LOOKUP_NO_XDEV: Disallow mount-point crossing (both *down* into one,
or *up* from one). Both bind-mounts and cross-filesystem mounts are
blocked by this flag. The naming is based on "find -xdev" as well as
-EXDEV (though find(1) doesn't walk upwards, the semantics seem
obvious).
* LOOKUP_NO_MAGICLINKS: Disallows ->get_link "symlink" (or rather,
magic-link) jumping. This is a very specific restriction, and it
exists because /proc/$pid/fd/... "symlinks" allow for access outside
nd->root and pose risk to container runtimes that don't want to be
tricked into accessing a host path (but do want to allow
no-funny-business symlink resolution).
* LOOKUP_NO_SYMLINKS: Disallows resolution through symlinks of any kind
(including magic-links).
* LOOKUP_BENEATH: Disallow "escapes" from the starting point of the
filesystem tree during resolution (you must stay "beneath" the
starting point at all times). Currently this is done by disallowing
".." and absolute paths (either in the given path or found during
symlink resolution) entirely, as well as all magic-link jumping.
The wholesale banning of ".." is because it is currently not safe to
allow ".." resolution (races can cause the path to be moved outside of
the root -- this is conceptually similar to historical chroot(2)
escape attacks). Future patches in this series will address this, and
will re-enable ".." resolution once it is safe. With those patches,
".." resolution will only be allowed if it remains in the root
throughout resolution (such as "a/../b" not "a/../../outside/b").
The banning of magic-link jumping is done because it is not clear
whether semantically they should be allowed -- while some magic-links
are safe there are many that can cause escapes (and once a
resolution is outside of the root, O_BENEATH will no longer detect
it). Future patches may re-enable magic-link jumping when such jumps
would remain inside the root.
The LOOKUP_NO_*LINK flags return -ELOOP if path resolution would
violates their requirement, while the others all return -EXDEV.
This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation
on David Drysdale's O_BENEATH patchset[2], which in turn was based on
the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS
thread[4] determined most of the API changes made in this refresh.
[1]: https://lwn.net/Articles/721443/
[2]: https://lwn.net/Articles/619151/
[3]: https://lwn.net/Articles/603929/
[4]: https://lwn.net/Articles/723057/
Cc: Christian Brauner <christian@brauner.io>
Suggested-by: David Drysdale <drysdale@google.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
fs/namei.c | 85 ++++++++++++++++++++++++++++++++++++-------
include/linux/namei.h | 7 ++++
2 files changed, 78 insertions(+), 14 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 165861d621c2..9df4aa35aedc 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -674,7 +674,11 @@ static int unlazy_walk(struct nameidata *nd)
goto out2;
if (unlikely(!legitimize_path(nd, &nd->path, nd->seq)))
goto out1;
- if (nd->root.mnt && !(nd->flags & LOOKUP_ROOT)) {
+ if (!nd->root.mnt) {
+ /* Restart from path_init() if nd->root was cleared. */
+ if (nd->flags & LOOKUP_BENEATH)
+ goto out;
+ } else if (!(nd->flags & LOOKUP_ROOT)) {
if (unlikely(!legitimize_path(nd, &nd->root, nd->root_seq)))
goto out;
}
@@ -843,6 +847,13 @@ static inline void path_to_nameidata(const struct path *path,
static int nd_jump_root(struct nameidata *nd)
{
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
+ /* Absolute path arguments to path_init() are allowed. */
+ if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt)
+ return -EXDEV;
+ }
if (nd->flags & LOOKUP_RCU) {
struct dentry *d;
nd->path = nd->root;
@@ -1051,6 +1062,9 @@ const char *get_link(struct nameidata *nd)
int error;
const char *res;
+ if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+ return ERR_PTR(-ELOOP);
+
if (!(nd->flags & LOOKUP_RCU)) {
touch_atime(&last->link);
cond_resched();
@@ -1082,14 +1096,22 @@ const char *get_link(struct nameidata *nd)
} else {
res = get(dentry, inode, &last->done);
}
+ if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
+ if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
+ return ERR_PTR(-ELOOP);
+ /* Not currently safe. */
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return ERR_PTR(-EXDEV);
+ }
if (IS_ERR_OR_NULL(res))
return res;
}
if (*res == '/') {
if (!nd->root.mnt)
set_root(nd);
- if (unlikely(nd_jump_root(nd)))
- return ERR_PTR(-ECHILD);
+ error = nd_jump_root(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
while (unlikely(*++res == '/'))
;
}
@@ -1270,12 +1292,16 @@ static int follow_managed(struct path *path, struct nameidata *nd)
break;
}
- if (need_mntput && path->mnt == mnt)
- mntput(path->mnt);
+ if (need_mntput) {
+ if (path->mnt == mnt)
+ mntput(path->mnt);
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ ret = -EXDEV;
+ else
+ nd->flags |= LOOKUP_JUMPED;
+ }
if (ret == -EISDIR || !ret)
ret = 1;
- if (need_mntput)
- nd->flags |= LOOKUP_JUMPED;
if (unlikely(ret < 0))
path_put_conditional(path, nd);
return ret;
@@ -1332,6 +1358,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
mounted = __lookup_mnt(path->mnt, path->dentry);
if (!mounted)
break;
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ return false;
path->mnt = &mounted->mnt;
path->dentry = mounted->mnt.mnt_root;
nd->flags |= LOOKUP_JUMPED;
@@ -1352,8 +1380,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
struct inode *inode = nd->inode;
while (1) {
- if (path_equal(&nd->path, &nd->root))
+ if (path_equal(&nd->path, &nd->root)) {
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
break;
+ }
if (nd->path.dentry != nd->path.mnt->mnt_root) {
struct dentry *old = nd->path.dentry;
struct dentry *parent = old->d_parent;
@@ -1378,6 +1409,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
return -ECHILD;
if (&mparent->mnt == nd->path.mnt)
break;
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ return -EXDEV;
/* we know that mountpoint was pinned */
nd->path.dentry = mountpoint;
nd->path.mnt = &mparent->mnt;
@@ -1392,6 +1425,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
return -ECHILD;
if (!mounted)
break;
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ return -EXDEV;
nd->path.mnt = &mounted->mnt;
nd->path.dentry = mounted->mnt.mnt_root;
inode = nd->path.dentry->d_inode;
@@ -1480,8 +1515,11 @@ static int path_parent_directory(struct path *path)
static int follow_dotdot(struct nameidata *nd)
{
while(1) {
- if (path_equal(&nd->path, &nd->root))
+ if (path_equal(&nd->path, &nd->root)) {
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
break;
+ }
if (nd->path.dentry != nd->path.mnt->mnt_root) {
int ret = path_parent_directory(&nd->path);
if (ret)
@@ -1490,6 +1528,8 @@ static int follow_dotdot(struct nameidata *nd)
}
if (!follow_up(&nd->path))
break;
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ return -EXDEV;
}
follow_mount(&nd->path);
nd->inode = nd->path.dentry->d_inode;
@@ -1704,6 +1744,13 @@ static inline int may_lookup(struct nameidata *nd)
static inline int handle_dots(struct nameidata *nd, int type)
{
if (type == LAST_DOTDOT) {
+ /*
+ * LOOKUP_BENEATH resolving ".." is not currently safe -- races can
+ * cause our parent to have moved outside of the root and us to skip
+ * over it.
+ */
+ if (unlikely(nd->flags & LOOKUP_BENEATH))
+ return -EXDEV;
if (!nd->root.mnt)
set_root(nd);
if (nd->flags & LOOKUP_RCU) {
@@ -2170,6 +2217,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
/* must be paired with terminate_walk() */
static const char *path_init(struct nameidata *nd, unsigned flags)
{
+ int error;
const char *s = nd->name->name;
if (!*s)
@@ -2202,11 +2250,13 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->path.dentry = NULL;
nd->m_seq = read_seqbegin(&mount_lock);
+
+ /* Figure out the starting path and root (if needed). */
if (*s == '/') {
set_root(nd);
- if (likely(!nd_jump_root(nd)))
- return s;
- return ERR_PTR(-ECHILD);
+ error = nd_jump_root(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
} else if (nd->dfd == AT_FDCWD) {
if (flags & LOOKUP_RCU) {
struct fs_struct *fs = current->fs;
@@ -2222,7 +2272,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
get_fs_pwd(current->fs, &nd->path);
nd->inode = nd->path.dentry->d_inode;
}
- return s;
} else {
/* Caller must check execute permissions on the starting path component */
struct fd f = fdget_raw(nd->dfd);
@@ -2247,8 +2296,16 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->inode = nd->path.dentry->d_inode;
}
fdput(f);
- return s;
}
+ /* For scoped-lookups we need to set the root to the dirfd as well. */
+ if (flags & LOOKUP_BENEATH) {
+ nd->root = nd->path;
+ if (flags & LOOKUP_RCU)
+ nd->root_seq = nd->seq;
+ else
+ path_get(&nd->root);
+ }
+ return s;
}
static const char *trailing_symlink(struct nameidata *nd)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index bd6d3eb7764d..be407415c28a 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -51,6 +51,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_DOWN 0x8000
#define LOOKUP_MAGICLINK_JUMPED 0x10000
+/* Scoping flags for lookup. */
+#define LOOKUP_BENEATH 0x020000 /* No escaping from starting point. */
+#define LOOKUP_NO_XDEV 0x040000 /* No mountpoint crossing. */
+#define LOOKUP_NO_MAGICLINKS 0x080000 /* No /proc/$pid/fd/ "symlink" crossing. */
+#define LOOKUP_NO_SYMLINKS 0x100000 /* No symlink crossing *at all*.
+ Implies LOOKUP_NO_MAGICLINKS. */
+
extern int path_pts(struct path *path);
extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
--
2.22.0
^ permalink raw reply related
* [PATCH v10 5/9] namei: LOOKUP_IN_ROOT: chroot-like path resolution
From: Aleksa Sarai @ 2019-07-19 16:42 UTC (permalink / raw)
To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan
Cc: Aleksa Sarai, Christian Brauner, Eric Biederman, Andy Lutomirski,
Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
Aleksa Sarai, Linus Torvalds, containers, linux-alpha, linux-api,
linux-arch, linux-arm-kernel, linux-fsdevel, linux-ia64,
linux-kernel
In-Reply-To: <20190719164225.27083-1-cyphar@cyphar.com>
The primary motivation for the need for this flag is container runtimes
which have to interact with malicious root filesystems in the host
namespaces. One of the first requirements for a container runtime to be
secure against a malicious rootfs is that they correctly scope symlinks
(that is, they should be scoped as though they are chroot(2)ed into the
container's rootfs) and ".."-style paths[*]. The already-existing
LOOKUP_NO_XDEV and LOOKUP_NO_MAGICLINKS help defend against other
potential attacks in a malicious rootfs scenario.
Currently most container runtimes try to do this resolution in
userspace[1], causing many potential race conditions. In addition, the
"obvious" alternative (actually performing a {ch,pivot_}root(2))
requires a fork+exec (for some runtimes) which is *very* costly if
necessary for every filesystem operation involving a container.
[*] At the moment, ".." and magic-link jumping are disallowed for the
same reason it is disabled for LOOKUP_BENEATH -- currently it is not
safe to allow it. Future patches may enable it unconditionally once
we have resolved the possible races (for "..") and semantics (for
magic-link jumping).
The most significant *at(2) semantic change with LOOKUP_IN_ROOT is that
absolute pathnames no longer cause the dirfd to be ignored completely.
The rationale is that LOOKUP_IN_ROOT must necessarily chroot-scope
symlinks with absolute paths to dirfd, and so doing it for the base path
seems to be the most consistent behaviour (and also avoids foot-gunning
users who want to scope paths that are absolute).
[1]: https://github.com/cyphar/filepath-securejoin
Co-developed-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
fs/namei.c | 41 +++++++++++++++++++++++++++++++----------
include/linux/namei.h | 1 +
2 files changed, 32 insertions(+), 10 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 9df4aa35aedc..617fc7d55977 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -676,7 +676,7 @@ static int unlazy_walk(struct nameidata *nd)
goto out1;
if (!nd->root.mnt) {
/* Restart from path_init() if nd->root was cleared. */
- if (nd->flags & LOOKUP_BENEATH)
+ if (nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))
goto out;
} else if (!(nd->flags & LOOKUP_ROOT)) {
if (unlikely(!legitimize_path(nd, &nd->root, nd->root_seq)))
@@ -809,10 +809,18 @@ static int complete_walk(struct nameidata *nd)
return status;
}
-static void set_root(struct nameidata *nd)
+static int set_root(struct nameidata *nd)
{
struct fs_struct *fs = current->fs;
+ /*
+ * Jumping to the real root as part of LOOKUP_IN_ROOT is a BUG in namei,
+ * but we still have to ensure it doesn't happen because it will cause a
+ * breakout from the dirfd.
+ */
+ if (WARN_ON(nd->flags & LOOKUP_IN_ROOT))
+ return -ENOTRECOVERABLE;
+
if (nd->flags & LOOKUP_RCU) {
unsigned seq;
@@ -824,6 +832,7 @@ static void set_root(struct nameidata *nd)
} else {
get_fs_root(fs, &nd->root);
}
+ return 0;
}
static void path_put_conditional(struct path *path, struct nameidata *nd)
@@ -854,6 +863,11 @@ static int nd_jump_root(struct nameidata *nd)
if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt)
return -EXDEV;
}
+ if (!nd->root.mnt) {
+ int error = set_root(nd);
+ if (error)
+ return error;
+ }
if (nd->flags & LOOKUP_RCU) {
struct dentry *d;
nd->path = nd->root;
@@ -1100,15 +1114,13 @@ const char *get_link(struct nameidata *nd)
if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
return ERR_PTR(-ELOOP);
/* Not currently safe. */
- if (unlikely(nd->flags & LOOKUP_BENEATH))
+ if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
return ERR_PTR(-EXDEV);
}
if (IS_ERR_OR_NULL(res))
return res;
}
if (*res == '/') {
- if (!nd->root.mnt)
- set_root(nd);
error = nd_jump_root(nd);
if (unlikely(error))
return ERR_PTR(error);
@@ -1744,15 +1756,20 @@ static inline int may_lookup(struct nameidata *nd)
static inline int handle_dots(struct nameidata *nd, int type)
{
if (type == LAST_DOTDOT) {
+ int error = 0;
+
/*
* LOOKUP_BENEATH resolving ".." is not currently safe -- races can
* cause our parent to have moved outside of the root and us to skip
* over it.
*/
- if (unlikely(nd->flags & LOOKUP_BENEATH))
+ if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
return -EXDEV;
- if (!nd->root.mnt)
- set_root(nd);
+ if (!nd->root.mnt) {
+ error = set_root(nd);
+ if (error)
+ return error;
+ }
if (nd->flags & LOOKUP_RCU) {
return follow_dotdot_rcu(nd);
} else
@@ -2251,9 +2268,13 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->m_seq = read_seqbegin(&mount_lock);
+ /* LOOKUP_IN_ROOT treats absolute paths as being relative-to-dirfd. */
+ if (flags & LOOKUP_IN_ROOT)
+ while (*s == '/')
+ s++;
+
/* Figure out the starting path and root (if needed). */
if (*s == '/') {
- set_root(nd);
error = nd_jump_root(nd);
if (unlikely(error))
return ERR_PTR(error);
@@ -2298,7 +2319,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
fdput(f);
}
/* For scoped-lookups we need to set the root to the dirfd as well. */
- if (flags & LOOKUP_BENEATH) {
+ if (flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)) {
nd->root = nd->path;
if (flags & LOOKUP_RCU)
nd->root_seq = nd->seq;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index be407415c28a..ec2c6c588ea7 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -57,6 +57,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_NO_MAGICLINKS 0x080000 /* No /proc/$pid/fd/ "symlink" crossing. */
#define LOOKUP_NO_SYMLINKS 0x100000 /* No symlink crossing *at all*.
Implies LOOKUP_NO_MAGICLINKS. */
+#define LOOKUP_IN_ROOT 0x200000 /* Treat dirfd as %current->fs->root. */
extern int path_pts(struct path *path);
--
2.22.0
^ permalink raw reply related
* [PATCH v10 6/9] namei: aggressively check for nd->root escape on ".." resolution
From: Aleksa Sarai @ 2019-07-19 16:42 UTC (permalink / raw)
To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan
Cc: Aleksa Sarai, Jann Horn, Kees Cook, Eric Biederman,
Andy Lutomirski, Andrew Morton, Alexei Starovoitov,
Christian Brauner, Tycho Andersen, David Drysdale, Chanho Min,
Oleg Nesterov, Aleksa Sarai, Linus Torvalds, containers,
linux-alpha, linux-api, linux-arch, linux-arm-kernel,
linux-fsdevel, linux-ia64, linux-kernel
In-Reply-To: <20190719164225.27083-1-cyphar@cyphar.com>
This patch allows for LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit
".." resolution (in the case of LOOKUP_BENEATH the resolution will still
fail if ".." resolution would resolve a path outside of the root --
while LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps
are still disallowed entirely because now they could result in
inconsistent behaviour if resolution encounters a subsequent ".."[*].
The need for this patch is explained by observing there is a fairly
easy-to-exploit race condition with chroot(2) (and thus by extension
LOOKUP_IN_ROOT and LOOKUP_BENEATH if ".." is allowed) where a rename(2)
of a path can be used to "skip over" nd->root and thus escape to the
filesystem above nd->root.
thread1 [attacker]:
for (;;)
renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
thread2 [victim]:
for (;;)
resolveat(dirb, "b/c/../../etc/shadow", RESOLVE_IN_ROOT);
With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.
With this patch, such cases will be detected *during* ".." resolution
(which is the weak point of chroot(2) -- since walking *into* a
subdirectory tautologically cannot result in you walking *outside*
nd->root -- except through a bind-mount or magic-link). By detecting
this at ".." resolution (rather than checking only at the end of the
entire resolution) we can both correct escapes by jumping back to the
root (in the case of LOOKUP_IN_ROOT), as well as avoid revealing to
attackers the structure of the filesystem outside of the root (through
timing attacks for instance).
In order to avoid a quadratic lookup with each ".." entry, we only
activate the slow path if a write through &rename_lock or &mount_lock
has occurred during path resolution (&rename_lock and &mount_lock are
re-taken to further optimise the lookup). Since the primary attack being
protected against is MS_MOVE or rename(2), not doing additional checks
unless a mount or rename have occurred avoids making the common case
slow.
The use of path_is_under() here might seem suspect, but on further
inspection of the most important race (a path was *inside* the root but
is now *outside*), there appears to be no attack potential:
* If path_is_under() occurs before the rename, then the path will be
resolved -- however the path was originally inside the root and thus
there is no escape (and to userspace it'd look like the rename
occurred after the path was resolved). If path_is_under() occurs
afterwards, the resolution is blocked.
* Subsequent ".." jumps are guaranteed to check path_is_under() -- by
construction, &rename_lock or &mount_lock must have been taken by
the attacker after path_is_under() returned in the victim. Thus ".."
will not be able to escape from the previously-inside-root path.
* Walking down in the moved path is still safe since the entire
subtree was moved (either by rename(2) or MS_MOVE) and because (as
discussed above) walking down is safe.
A variant of the above attack is included in the selftests for
openat2(2) later in this patch series. I've run this test on several
machines for several days and no instances of a breakout were detected.
While this is not concrete proof that this is safe, when combined with
the above argument it should lend some trustworthiness to this
construction.
[*] It may be acceptable in the future to do a path_is_under() check
after resolving a magic-link and permit resolution if the
nd_jump_link() result is still within the dirfd. However this seems
unlikely to be a feature that people *really* need* -- it can be
added later if it turns out a lot of people want it.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
fs/namei.c | 45 +++++++++++++++++++++++++++++++--------------
1 file changed, 31 insertions(+), 14 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 617fc7d55977..b3abf5754ddd 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -491,7 +491,7 @@ struct nameidata {
struct path root;
struct inode *inode; /* path.dentry.d_inode */
unsigned int flags;
- unsigned seq, m_seq;
+ unsigned seq, m_seq, r_seq;
int last_type;
unsigned depth;
int total_link_count;
@@ -1758,22 +1758,36 @@ static inline int handle_dots(struct nameidata *nd, int type)
if (type == LAST_DOTDOT) {
int error = 0;
- /*
- * LOOKUP_BENEATH resolving ".." is not currently safe -- races can
- * cause our parent to have moved outside of the root and us to skip
- * over it.
- */
- if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
- return -EXDEV;
if (!nd->root.mnt) {
error = set_root(nd);
if (error)
return error;
}
- if (nd->flags & LOOKUP_RCU) {
- return follow_dotdot_rcu(nd);
- } else
- return follow_dotdot(nd);
+ if (nd->flags & LOOKUP_RCU)
+ error = follow_dotdot_rcu(nd);
+ else
+ error = follow_dotdot(nd);
+ if (error)
+ return error;
+
+ if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) {
+ bool m_retry = read_seqretry(&mount_lock, nd->m_seq);
+ bool r_retry = read_seqretry(&rename_lock, nd->r_seq);
+
+ /*
+ * Don't bother checking unless there's a racing
+ * rename(2) or MS_MOVE.
+ */
+ if (likely(!m_retry && !r_retry))
+ return 0;
+
+ if (m_retry && !(nd->flags & LOOKUP_RCU))
+ nd->m_seq = read_seqbegin(&mount_lock);
+ if (r_retry)
+ nd->r_seq = read_seqbegin(&rename_lock);
+ if (!path_is_under(&nd->path, &nd->root))
+ return -EXDEV;
+ }
}
return 0;
}
@@ -2245,6 +2259,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->last_type = LAST_ROOT; /* if there are only slashes... */
nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
nd->depth = 0;
+
+ nd->m_seq = read_seqbegin(&mount_lock);
+ if (flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))
+ nd->r_seq = read_seqbegin(&rename_lock);
+
if (flags & LOOKUP_ROOT) {
struct dentry *root = nd->root.dentry;
struct inode *inode = root->d_inode;
@@ -2266,8 +2285,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->path.mnt = NULL;
nd->path.dentry = NULL;
- nd->m_seq = read_seqbegin(&mount_lock);
-
/* LOOKUP_IN_ROOT treats absolute paths as being relative-to-dirfd. */
if (flags & LOOKUP_IN_ROOT)
while (*s == '/')
--
2.22.0
^ permalink raw reply related
* [PATCH v10 7/9] open: openat2(2) syscall
From: Aleksa Sarai @ 2019-07-19 16:42 UTC (permalink / raw)
To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan
Cc: Aleksa Sarai, Christian Brauner, Eric Biederman, Andy Lutomirski,
Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
Aleksa Sarai, Linus Torvalds, containers, linux-alpha, linux-api,
linux-arch, linux-arm-kernel, linux-fsdevel, linux-ia64,
linux-kernel
In-Reply-To: <20190719164225.27083-1-cyphar@cyphar.com>
The most obvious syscall to add support for the new LOOKUP_* scoping
flags would be openat(2). However, there are a few reasons why this is
not the best course of action:
* The new LOOKUP_* flags are intended to be security features, and
openat(2) will silently ignore all unknown flags. This means that
users would need to avoid foot-gunning themselves constantly when
using this interface if it were part of openat(2). This can be fixed
by having userspace libraries handle this for users[1], but should be
avoided if possible.
* Resolution scoping feels like a different operation to the existing
O_* flags. And since openat(2) has limited flag space, it seems to be
quite wasteful to clutter it with 5 flags that are all
resolution-related. Arguably O_NOFOLLOW is also a resolution flag but
its entire purpose is to error out if you encounter a trailing
symlink -- not to scope resolution.
* Other systems would be able to reimplement this syscall allowing for
cross-OS standardisation rather than being hidden amongst O_* flags
which may result in it not being used by all the parties that might
want to use it (file servers, web servers, container runtimes, etc).
* It gives us the opportunity to iterate on the O_PATH interface. In
particular, the new @how->upgrade_mask field for fd re-opening is
only possible because we have a clean slate without needing to re-use
the ACC_MODE flag design nor the existing openat(2) @mode semantics.
To this end, we introduce the openat2(2) syscall. It provides all of the
features of openat(2) through the @how->flags argument, but also
also provides a new @how->resolve argument which exposes RESOLVE_* flags
that map to our new LOOKUP_* flags. It also eliminates the long-standing
ugliness of variadic-open(2) by embedding it in a struct.
In order to allow for userspace to lock down their usage of file
descriptor re-opening, openat2(2) has the ability for users to disallow
certain re-opening modes through @how->upgrade_mask. At the moment,
there is no UPGRADE_NOEXEC. The open_how struct is padded to 64 bytes
for future extensions (all of the reserved bits must be zeroed).
[1]: https://github.com/openSUSE/libpathrs
Co-developed-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/open.c | 106 ++++++++++++++++----
include/linux/fcntl.h | 15 ++-
include/linux/fs.h | 4 +-
include/linux/syscalls.h | 17 +++-
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/fcntl.h | 42 ++++++++
24 files changed, 178 insertions(+), 30 deletions(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 9e7704e44f6d..cebe813a947f 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
541 common fsconfig sys_fsconfig
542 common fsmount sys_fsmount
543 common fspick sys_fspick
+547 common openat2 sys_openat2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index aaf479a9e92d..2a0b94e595e7 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -447,3 +447,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+437 common openat2 sys_openat2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index c9f8dd421c5f..40b8fec7ba55 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -33,7 +33,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
-#define __NR_compat_syscalls 434
+#define __NR_compat_syscalls 438
#endif
#define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index aa995920bd34..df60d417832e 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -875,6 +875,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
__SYSCALL(__NR_fsmount, sys_fsmount)
#define __NR_fspick 433
__SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_openat2 437
+__SYSCALL(__NR_openat2, sys_openat2)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index e01df3f2f80d..4da37f0f17fc 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -354,3 +354,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+437 common openat2 sys_openat2
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7e3d0734b2f3..323a3695c8a3 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+437 common openat2 sys_openat2
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 26339e417695..8a5b81663387 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+437 common openat2 sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 0e2dd68ade57..70dfc63d160d 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -372,3 +372,4 @@
431 n32 fsconfig sys_fsconfig
432 n32 fsmount sys_fsmount
433 n32 fspick sys_fspick
+437 n32 openat2 sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 5eebfa0d155c..1fcad88a326f 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -348,3 +348,4 @@
431 n64 fsconfig sys_fsconfig
432 n64 fsmount sys_fsmount
433 n64 fspick sys_fspick
+437 n64 openat2 sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 3cc1374e02d0..5ef8bdce49ca 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -421,3 +421,4 @@
431 o32 fsconfig sys_fsconfig
432 o32 fsmount sys_fsmount
433 o32 fspick sys_fspick
+437 o32 openat2 sys_openat2 sys_openat2
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c9e377d59232..176de3591738 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+437 common openat2 sys_openat2
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 103655d84b4b..59591311f8e2 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -515,3 +515,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+437 common openat2 sys_openat2
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index e822b2964a83..36ca509e26c2 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
431 common fsconfig sys_fsconfig sys_fsconfig
432 common fsmount sys_fsmount sys_fsmount
433 common fspick sys_fspick sys_fspick
+437 common openat2 sys_openat2 sys_openat2
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 016a727d4357..d5ba779f92c2 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+437 common openat2 sys_openat2
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index e047480b1605..d45bebdfdfae 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+437 common openat2 sys_openat2
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..88825c5e631f 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
431 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig
432 i386 fsmount sys_fsmount __ia32_sys_fsmount
433 i386 fspick sys_fspick __ia32_sys_fspick
+437 i386 openat2 sys_openat2 __ia32_sys_openat2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..ebfde1799001 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
431 common fsconfig __x64_sys_fsconfig
432 common fsmount __x64_sys_fsmount
433 common fspick __x64_sys_fspick
+437 common openat2 __x64_sys_openat2
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 5fa0ee1c8e00..927a642859a1 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -404,3 +404,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+437 common openat2 sys_openat2
diff --git a/fs/open.c b/fs/open.c
index bdca45528524..062761136f21 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -928,19 +928,29 @@ struct file *open_with_fake_path(const struct path *path, int flags,
}
EXPORT_SYMBOL(open_with_fake_path);
-static inline int build_open_flags(int flags, umode_t mode, struct open_flags *op)
+static inline int build_open_flags(const struct open_how *how,
+ struct open_flags *op)
{
+ int flags = how->flags;
int lookup_flags = 0;
+ int opath_mask = 0;
int acc_mode = ACC_MODE(flags);
/*
- * Clear out all open flags we don't know about so that we don't report
- * them in fcntl(F_GETFD) or similar interfaces.
+ * Older syscalls still clear these bits before calling
+ * build_open_flags(), but openat2(2) checks all its arguments.
*/
- flags &= VALID_OPEN_FLAGS;
+ if (flags & ~VALID_OPEN_FLAGS)
+ return -EINVAL;
+ if (how->resolve & ~VALID_RESOLVE_FLAGS)
+ return -EINVAL;
+ if (!(how->flags & (O_PATH | O_CREAT | __O_TMPFILE)) && how->mode != 0)
+ return -EINVAL;
+ if (memchr_inv(how->reserved, 0, sizeof(how->reserved)))
+ return -EINVAL;
if (flags & (O_CREAT | __O_TMPFILE))
- op->mode = (mode & S_IALLUGO) | S_IFREG;
+ op->mode = (how->mode & S_IALLUGO) | S_IFREG;
else
op->mode = 0;
@@ -968,6 +978,14 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
*/
flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH;
acc_mode = 0;
+
+ /* Allow userspace to restrict the re-opening of O_PATH fds. */
+ if (how->upgrade_mask & ~VALID_UPGRADE_FLAGS)
+ return -EINVAL;
+ if (!(how->upgrade_mask & UPGRADE_NOREAD))
+ opath_mask |= FMODE_PATH_READ;
+ if (!(how->upgrade_mask & UPGRADE_NOWRITE))
+ opath_mask |= FMODE_PATH_WRITE;
}
op->open_flag = flags;
@@ -983,8 +1001,7 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
op->acc_mode = acc_mode;
op->intent = flags & O_PATH ? 0 : LOOKUP_OPEN;
- /* For O_PATH backwards-compatibility we default to an all-set mask. */
- op->opath_mask = FMODE_PATH_READ | FMODE_PATH_WRITE;
+ op->opath_mask = opath_mask;
if (flags & O_CREAT) {
op->intent |= LOOKUP_CREATE;
@@ -998,6 +1015,18 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
lookup_flags |= LOOKUP_FOLLOW;
if (flags & O_EMPTYPATH)
lookup_flags |= LOOKUP_EMPTY;
+
+ if (how->resolve & RESOLVE_NO_XDEV)
+ lookup_flags |= LOOKUP_NO_XDEV;
+ if (how->resolve & RESOLVE_NO_MAGICLINKS)
+ lookup_flags |= LOOKUP_NO_MAGICLINKS;
+ if (how->resolve & RESOLVE_NO_SYMLINKS)
+ lookup_flags |= LOOKUP_NO_SYMLINKS;
+ if (how->resolve & RESOLVE_BENEATH)
+ lookup_flags |= LOOKUP_BENEATH;
+ if (how->resolve & RESOLVE_IN_ROOT)
+ lookup_flags |= LOOKUP_IN_ROOT;
+
op->lookup_flags = lookup_flags;
return 0;
}
@@ -1016,8 +1045,14 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
struct file *file_open_name(struct filename *name, int flags, umode_t mode)
{
struct open_flags op;
- int err = build_open_flags(flags, mode, &op);
- return err ? ERR_PTR(err) : do_filp_open(AT_FDCWD, name, &op);
+ struct open_how how = {
+ .flags = flags & VALID_OPEN_FLAGS,
+ .mode = OPENHOW_MODE(flags, mode),
+ };
+ int err = build_open_flags(&how, &op);
+ if (err)
+ return ERR_PTR(err);
+ return do_filp_open(AT_FDCWD, name, &op);
}
/**
@@ -1048,17 +1083,22 @@ struct file *file_open_root(struct dentry *dentry, struct vfsmount *mnt,
const char *filename, int flags, umode_t mode)
{
struct open_flags op;
- int err = build_open_flags(flags, mode, &op);
+ struct open_how how = {
+ .flags = flags & VALID_OPEN_FLAGS,
+ .mode = OPENHOW_MODE(flags, mode),
+ };
+ int err = build_open_flags(&how, &op);
if (err)
return ERR_PTR(err);
return do_file_open_root(dentry, mnt, filename, &op);
}
EXPORT_SYMBOL(file_open_root);
-long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
+long do_sys_open(int dfd, const char __user *filename,
+ struct open_how *how)
{
struct open_flags op;
- int fd = build_open_flags(flags, mode, &op);
+ int fd = build_open_flags(how, &op);
int empty = 0;
struct filename *tmp;
@@ -1071,7 +1111,7 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
if (!empty)
op.open_flag &= ~O_EMPTYPATH;
- fd = get_unused_fd_flags(flags);
+ fd = get_unused_fd_flags(how->flags);
if (fd >= 0) {
struct file *f = do_filp_open(dfd, tmp, &op);
if (IS_ERR(f)) {
@@ -1088,19 +1128,35 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
- if (force_o_largefile())
- flags |= O_LARGEFILE;
-
- return do_sys_open(AT_FDCWD, filename, flags, mode);
+ return ksys_open(filename, flags, mode);
}
SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
umode_t, mode)
{
+ struct open_how how = {
+ .flags = flags & VALID_OPEN_FLAGS,
+ .mode = OPENHOW_MODE(flags, mode),
+ };
+
+ if (force_o_largefile())
+ how.flags |= O_LARGEFILE;
+
+ return do_sys_open(dfd, filename, &how);
+}
+
+SYSCALL_DEFINE3(openat2, int, dfd, const char __user *, filename,
+ const struct open_how __user *, how)
+{
+ struct open_how tmp;
+
+ if (copy_from_user(&tmp, how, sizeof(tmp)))
+ return -EFAULT;
+
if (force_o_largefile())
- flags |= O_LARGEFILE;
+ tmp.flags |= O_LARGEFILE;
- return do_sys_open(dfd, filename, flags, mode);
+ return do_sys_open(dfd, filename, &tmp);
}
#ifdef CONFIG_COMPAT
@@ -1110,7 +1166,11 @@ SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
*/
COMPAT_SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
- return do_sys_open(AT_FDCWD, filename, flags, mode);
+ struct open_how how = {
+ .flags = flags & VALID_OPEN_FLAGS,
+ .mode = OPENHOW_MODE(flags, mode),
+ };
+ return do_sys_open(AT_FDCWD, filename, &how);
}
/*
@@ -1119,7 +1179,11 @@ COMPAT_SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t,
*/
COMPAT_SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags, umode_t, mode)
{
- return do_sys_open(dfd, filename, flags, mode);
+ struct open_how how = {
+ .flags = flags & VALID_OPEN_FLAGS,
+ .mode = OPENHOW_MODE(flags, mode),
+ };
+ return do_sys_open(dfd, filename, &how);
}
#endif
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 2868ae6c8fc1..f7f378e1f43c 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -4,13 +4,26 @@
#include <uapi/linux/fcntl.h>
-/* list of all valid flags for the open/openat flags argument: */
+/* Should open_how.mode be set for older syscalls wrappers? */
+#define OPENHOW_MODE(flags, mode) \
+ (((flags) & (O_CREAT | __O_TMPFILE)) ? (mode) : 0)
+
+/* List of all valid flags for the open/openat flags argument: */
#define VALID_OPEN_FLAGS \
(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_EMPTYPATH)
+/* List of all valid flags for the how->upgrade_mask argument: */
+#define VALID_UPGRADE_FLAGS \
+ (UPGRADE_NOWRITE | UPGRADE_NOREAD)
+
+/* List of all valid flags for the how->resolve argument: */
+#define VALID_RESOLVE_FLAGS \
+ (RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
+ RESOLVE_BENEATH | RESOLVE_IN_ROOT)
+
#ifndef force_o_largefile
#define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f7df213405ea..a3aede2b3a91 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2515,8 +2515,8 @@ extern int do_truncate(struct dentry *, loff_t start, unsigned int time_attrs,
struct file *filp);
extern int vfs_fallocate(struct file *file, int mode, loff_t offset,
loff_t len);
-extern long do_sys_open(int dfd, const char __user *filename, int flags,
- umode_t mode);
+extern long do_sys_open(int dfd, const char __user *filename,
+ struct open_how *how);
extern struct file *file_open_name(struct filename *, int, umode_t);
extern struct file *filp_open(const char *, int, umode_t);
extern struct file *file_open_root(struct dentry *, struct vfsmount *,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2bcef4c70183..db141c67c977 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -68,6 +68,7 @@ struct sigaltstack;
struct rseq;
union bpf_attr;
struct io_uring_params;
+struct open_how;
#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -438,6 +439,8 @@ asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user,
asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
umode_t mode);
+asmlinkage long sys_openat2(int dfd, const char __user *filename,
+ const struct open_how *how);
asmlinkage long sys_close(unsigned int fd);
asmlinkage long sys_vhangup(void);
@@ -1369,15 +1372,21 @@ static inline int ksys_close(unsigned int fd)
return __close_fd(current->files, fd);
}
-extern long do_sys_open(int dfd, const char __user *filename, int flags,
- umode_t mode);
+extern long do_sys_open(int dfd, const char __user *filename,
+ struct open_how *how);
static inline long ksys_open(const char __user *filename, int flags,
umode_t mode)
{
+ struct open_how how = {
+ .flags = flags & VALID_OPEN_FLAGS,
+ .mode = OPENHOW_MODE(flags, mode),
+ };
+
if (force_o_largefile())
- flags |= O_LARGEFILE;
- return do_sys_open(AT_FDCWD, filename, flags, mode);
+ how.flags |= O_LARGEFILE;
+
+ return do_sys_open(AT_FDCWD, filename, &how);
}
extern long do_sys_truncate(const char __user *pathname, loff_t length);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..e4e8eb7b20c1 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
__SYSCALL(__NR_fsmount, sys_fsmount)
#define __NR_fspick 433
__SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_openat2 437
+__SYSCALL(__NR_openat2, sys_openat2)
#undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 438
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 1d338357df8a..ebfc97b3d8aa 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -93,5 +93,47 @@
#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
+/**
+ * Arguments for how openat2(2) should open the target path. If @resolve is
+ * zero, then openat2(2) operates identically to openat(2).
+ *
+ * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
+ * than being silently ignored. In addition, @mode (or @upgrade_mask) must be
+ * zero unless one of {O_CREAT, O_TMPFILE, O_PATH} are set.
+ *
+ * @flags: O_* flags.
+ * @mode: O_CREAT/O_TMPFILE file mode.
+ * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).
+ * @resolve: RESOLVE_* flags.
+ * @reserved: reserved for future extensions, must be zeroed.
+ */
+struct open_how {
+ __u32 flags;
+ union {
+ __u16 mode;
+ __u16 upgrade_mask;
+ };
+ __u16 resolve;
+ __u64 reserved[7]; /* must be zeroed */
+};
+
+/* how->resolve flags for openat2(2). */
+#define RESOLVE_NO_XDEV 0x01 /* Block mount-point crossings
+ (includes bind-mounts). */
+#define RESOLVE_NO_MAGICLINKS 0x02 /* Block traversal through procfs-style
+ "magic-links". */
+#define RESOLVE_NO_SYMLINKS 0x04 /* Block traversal through all symlinks
+ (implies OEXT_NO_MAGICLINKS) */
+#define RESOLVE_BENEATH 0x08 /* Block "lexical" trickery like
+ "..", symlinks, and absolute
+ paths which escape the dirfd. */
+#define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".."
+ be scoped inside the dirfd
+ (similar to chroot(2)). */
+
+/* how->upgrade flags for openat2(2). */
+/* First bit is reserved for a future UPGRADE_NOEXEC flag. */
+#define UPGRADE_NOREAD 0x02 /* Block re-opening with MAY_READ. */
+#define UPGRADE_NOWRITE 0x04 /* Block re-opening with MAY_WRITE. */
#endif /* _UAPI_LINUX_FCNTL_H */
--
2.22.0
^ permalink raw reply related
* [PATCH v10 8/9] kselftest: save-and-restore errno to allow for %m formatting
From: Aleksa Sarai @ 2019-07-19 16:42 UTC (permalink / raw)
To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan
Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Andrew Morton,
Alexei Starovoitov, Kees Cook, Jann Horn, Christian Brauner,
Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
Aleksa Sarai, Linus Torvalds, containers, linux-alpha, linux-api,
linux-arch, linux-arm-kernel, linux-fsdevel, linux-ia64,
linux-kernel
In-Reply-To: <20190719164225.27083-1-cyphar@cyphar.com>
Previously, using "%m" in a ksft_* format string can result in strange
output because the errno value wasn't saved before calling other libc
functions. The solution is to simply save and restore the errno before
we format the user-supplied format string.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
tools/testing/selftests/kselftest.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/tools/testing/selftests/kselftest.h b/tools/testing/selftests/kselftest.h
index ec15c4f6af55..0ac49d91a260 100644
--- a/tools/testing/selftests/kselftest.h
+++ b/tools/testing/selftests/kselftest.h
@@ -10,6 +10,7 @@
#ifndef __KSELFTEST_H
#define __KSELFTEST_H
+#include <errno.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdarg.h>
@@ -81,58 +82,68 @@ static inline void ksft_print_cnts(void)
static inline void ksft_print_msg(const char *msg, ...)
{
+ int saved_errno = errno;
va_list args;
va_start(args, msg);
printf("# ");
+ errno = saved_errno;
vprintf(msg, args);
va_end(args);
}
static inline void ksft_test_result_pass(const char *msg, ...)
{
+ int saved_errno = errno;
va_list args;
ksft_cnt.ksft_pass++;
va_start(args, msg);
printf("ok %d ", ksft_test_num());
+ errno = saved_errno;
vprintf(msg, args);
va_end(args);
}
static inline void ksft_test_result_fail(const char *msg, ...)
{
+ int saved_errno = errno;
va_list args;
ksft_cnt.ksft_fail++;
va_start(args, msg);
printf("not ok %d ", ksft_test_num());
+ errno = saved_errno;
vprintf(msg, args);
va_end(args);
}
static inline void ksft_test_result_skip(const char *msg, ...)
{
+ int saved_errno = errno;
va_list args;
ksft_cnt.ksft_xskip++;
va_start(args, msg);
printf("not ok %d # SKIP ", ksft_test_num());
+ errno = saved_errno;
vprintf(msg, args);
va_end(args);
}
static inline void ksft_test_result_error(const char *msg, ...)
{
+ int saved_errno = errno;
va_list args;
ksft_cnt.ksft_error++;
va_start(args, msg);
printf("not ok %d # error ", ksft_test_num());
+ errno = saved_errno;
vprintf(msg, args);
va_end(args);
}
@@ -152,10 +163,12 @@ static inline int ksft_exit_fail(void)
static inline int ksft_exit_fail_msg(const char *msg, ...)
{
+ int saved_errno = errno;
va_list args;
va_start(args, msg);
printf("Bail out! ");
+ errno = saved_errno;
vprintf(msg, args);
va_end(args);
@@ -178,10 +191,12 @@ static inline int ksft_exit_xpass(void)
static inline int ksft_exit_skip(const char *msg, ...)
{
if (msg) {
+ int saved_errno = errno;
va_list args;
va_start(args, msg);
printf("not ok %d # SKIP ", 1 + ksft_test_num());
+ errno = saved_errno;
vprintf(msg, args);
va_end(args);
} else {
--
2.22.0
^ permalink raw reply related
* [PATCH v10 9/9] selftests: add openat2(2) selftests
From: Aleksa Sarai @ 2019-07-19 16:42 UTC (permalink / raw)
To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan
Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Andrew Morton,
Alexei Starovoitov, Kees Cook, Jann Horn, Christian Brauner,
Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
Aleksa Sarai, Linus Torvalds, containers, linux-alpha, linux-api,
linux-arch, linux-arm-kernel, linux-fsdevel, linux-ia64,
linux-kernel
In-Reply-To: <20190719164225.27083-1-cyphar@cyphar.com>
Test all of the various openat2(2) flags, as well as how file
descriptor re-opening works. A small stress-test of a symlink-rename
attack is included to show that the protections against ".."-based
attacks are sufficient.
In addition, the memfd selftest is fixed to no longer depend on the
now-disallowed functionality of upgrading an O_RDONLY descriptor to
O_RDWR.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/memfd/memfd_test.c | 7 +-
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 162 +++++++
tools/testing/selftests/openat2/helpers.h | 114 +++++
.../testing/selftests/openat2/linkmode_test.c | 326 ++++++++++++++
.../selftests/openat2/rename_attack_test.c | 124 ++++++
.../testing/selftests/openat2/resolve_test.c | 397 ++++++++++++++++++
9 files changed, 1138 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/linkmode_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 9781ca79794a..42a27d029c10 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -37,6 +37,7 @@ TARGETS += powerpc
TARGETS += proc
TARGETS += pstore
TARGETS += ptrace
+TARGETS += openat2
TARGETS += rseq
TARGETS += rtc
TARGETS += seccomp
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
index c67d32eeb668..e71df3d3e55d 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -925,7 +925,7 @@ static void test_share_mmap(char *banner, char *b_suffix)
*/
static void test_share_open(char *banner, char *b_suffix)
{
- int fd, fd2;
+ int procfd, fd, fd2;
printf("%s %s %s\n", memfd_str, banner, b_suffix);
@@ -950,13 +950,16 @@ static void test_share_open(char *banner, char *b_suffix)
mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
+ /* We cannot do a MAY_WRITE re-open of an O_RDONLY fd. */
+ procfd = mfd_assert_open(fd2, O_PATH, 0);
close(fd2);
- fd2 = mfd_assert_open(fd, O_RDWR, 0);
+ fd2 = mfd_assert_open(procfd, O_WRONLY, 0);
mfd_assert_add_seals(fd2, F_SEAL_SEAL);
mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
+ close(procfd);
close(fd2);
close(fd);
}
diff --git a/tools/testing/selftests/openat2/.gitignore b/tools/testing/selftests/openat2/.gitignore
new file mode 100644
index 000000000000..bd68f6c3fd07
--- /dev/null
+++ b/tools/testing/selftests/openat2/.gitignore
@@ -0,0 +1 @@
+/*_test
diff --git a/tools/testing/selftests/openat2/Makefile b/tools/testing/selftests/openat2/Makefile
new file mode 100644
index 000000000000..a0c1b53fd268
--- /dev/null
+++ b/tools/testing/selftests/openat2/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+
+CFLAGS += -Wall -O2 -g
+TEST_GEN_PROGS := linkmode_test resolve_test rename_attack_test
+
+include ../lib.mk
+
+$(TEST_GEN_PROGS): helpers.c
diff --git a/tools/testing/selftests/openat2/helpers.c b/tools/testing/selftests/openat2/helpers.c
new file mode 100644
index 000000000000..c16213ff1946
--- /dev/null
+++ b/tools/testing/selftests/openat2/helpers.c
@@ -0,0 +1,162 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <string.h>
+#include <syscall.h>
+#include <limits.h>
+
+#include "helpers.h"
+
+int sys_openat2(int dfd, const char *path, const struct open_how *how)
+{
+ int ret = syscall(__NR_openat2, dfd, path, how);
+ return ret >= 0 ? ret : -errno;
+}
+
+int sys_openat(int dfd, const char *path, const struct open_how *how)
+{
+ int ret = openat(dfd, path, how->flags, how->mode);
+ return ret >= 0 ? ret : -errno;
+}
+
+int sys_renameat2(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath, unsigned int flags)
+{
+ int ret = syscall(__NR_renameat2, olddirfd, oldpath,
+ newdirfd, newpath, flags);
+ return ret >= 0 ? ret : -errno;
+}
+
+char *openat_flags(unsigned int flags)
+{
+ char *flagset, *accmode = "(none)";
+
+ switch (flags & 0x03) {
+ case O_RDWR:
+ accmode = "O_RDWR";
+ break;
+ case O_RDONLY:
+ accmode = "O_RDONLY";
+ break;
+ case O_WRONLY:
+ accmode = "O_WRONLY";
+ break;
+ }
+
+ E_asprintf(&flagset, "%s%s%s",
+ (flags & O_PATH) ? "O_PATH|" : "",
+ (flags & O_CREAT) ? "O_CREAT|" : "",
+ accmode);
+
+ return flagset;
+}
+
+char *openat2_flags(const struct open_how *how)
+{
+ char *p;
+ char *flags_set, *resolve_set, *acc_set, *set;
+
+ flags_set = openat_flags(how->flags);
+
+ E_asprintf(&resolve_set, "%s%s%s%s%s0",
+ (how->resolve & RESOLVE_NO_XDEV) ? "RESOLVE_NO_XDEV|" : "",
+ (how->resolve & RESOLVE_NO_MAGICLINKS) ? "RESOLVE_NO_MAGICLINKS|" : "",
+ (how->resolve & RESOLVE_NO_SYMLINKS) ? "RESOLVE_NO_SYMLINKS|" : "",
+ (how->resolve & RESOLVE_BENEATH) ? "RESOLVE_BENEATH|" : "",
+ (how->resolve & RESOLVE_IN_ROOT) ? "RESOLVE_IN_ROOT|" : "");
+
+ /* Remove trailing "|0". */
+ p = strstr(resolve_set, "|0");
+ if (p)
+ *p = '\0';
+
+ if (how->flags & O_PATH)
+ E_asprintf(&acc_set, ", upgrade_mask=%s%s0",
+ (how->upgrade_mask & UPGRADE_NOREAD) ? "UPGRADE_NOREAD|" : "",
+ (how->upgrade_mask & UPGRADE_NOWRITE) ? "UPGRADE_NOWRITE|" : "");
+ else if (how->flags & O_CREAT)
+ E_asprintf(&acc_set, ", mode=0%o", how->mode);
+ else
+ acc_set = strdup("");
+
+ /* Remove trailing "|0". */
+ p = strstr(acc_set, "|0");
+ if (p)
+ *p = '\0';
+
+ /* And now generate our flagset. */
+ E_asprintf(&set, "[flags=%s, resolve=%s%s]",
+ flags_set, resolve_set, acc_set);
+
+ free(flags_set);
+ free(resolve_set);
+ free(acc_set);
+ return set;
+}
+
+int touchat(int dfd, const char *path)
+{
+ int fd = openat(dfd, path, O_CREAT);
+ if (fd >= 0)
+ close(fd);
+ return fd;
+}
+
+char *fdreadlink(int fd)
+{
+ char *target, *tmp;
+
+ E_asprintf(&tmp, "/proc/self/fd/%d", fd);
+
+ target = malloc(PATH_MAX);
+ if (!target)
+ ksft_exit_fail_msg("fdreadlink: malloc failed\n");
+ memset(target, 0, PATH_MAX);
+
+ E_readlink(tmp, target, PATH_MAX);
+ free(tmp);
+ return target;
+}
+
+bool fdequal(int fd, int dfd, const char *path)
+{
+ char *fdpath, *dfdpath, *other;
+ bool cmp;
+
+ fdpath = fdreadlink(fd);
+ dfdpath = fdreadlink(dfd);
+
+ if (!path)
+ E_asprintf(&other, "%s", dfdpath);
+ else if (*path == '/')
+ E_asprintf(&other, "%s", path);
+ else
+ E_asprintf(&other, "%s/%s", dfdpath, path);
+
+ cmp = !strcmp(fdpath, other);
+ if (!cmp)
+ ksft_print_msg("fdequal: expected '%s' but got '%s'\n", other, fdpath);
+
+ free(fdpath);
+ free(dfdpath);
+ free(other);
+ return cmp;
+}
+
+void test_openat2_supported(void)
+{
+ struct open_how how = {};
+ int fd = sys_openat2(AT_FDCWD, ".", &how);
+ if (fd == -ENOSYS)
+ ksft_exit_skip("openat2(2) unsupported on this kernel\n");
+ if (fd < 0)
+ ksft_exit_fail_msg("openat2(2) supported check failed: %s\n", strerror(-fd));
+ close(fd);
+}
diff --git a/tools/testing/selftests/openat2/helpers.h b/tools/testing/selftests/openat2/helpers.h
new file mode 100644
index 000000000000..93d1154c1aea
--- /dev/null
+++ b/tools/testing/selftests/openat2/helpers.h
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#ifndef __RESOLVEAT_H__
+#define __RESOLVEAT_H__
+
+#define _GNU_SOURCE
+#include <stdint.h>
+#include "../kselftest.h"
+
+#define ARRAY_LEN(X) (sizeof (X) / sizeof (*(X)))
+
+#ifndef SYS_openat2
+#ifndef __NR_openat2
+#define __NR_openat2 437
+#endif /* __NR_openat2 */
+#define SYS_openat2 __NR_openat2
+#endif /* SYS_openat2 */
+
+/**
+ * Arguments for how openat2(2) should open the target path. If @extra is zero,
+ * then openat2 is identical to openat(2). Only one of @mode or @upgrade_mask
+ * may be set at any given time.
+ *
+ * @flags: O_* flags (unknown flags ignored).
+ * @mode: O_CREAT file mode (ignored otherwise).
+ * @upgrade_mask: restrict how the O_PATH may be re-opened (ignored otherwise).
+ * @resolve: RESOLVE_* flags (-EINVAL on unknown flags).
+ * @reserved: reserved for future extensions, must be zeroed.
+ */
+struct open_how {
+ uint32_t flags;
+ union {
+ uint16_t mode;
+ uint16_t upgrade_mask;
+ };
+ uint16_t resolve;
+ uint64_t reserved[7]; /* must be zeroed */
+};
+
+#ifndef RESOLVE_INROOT
+/* how->resolve flags for openat2(2). */
+#define RESOLVE_NO_XDEV 0x01 /* Block mount-point crossings
+ (includes bind-mounts). */
+#define RESOLVE_NO_MAGICLINKS 0x02 /* Block traversal through procfs-style
+ "magic-links". */
+#define RESOLVE_NO_SYMLINKS 0x04 /* Block traversal through all symlinks
+ (implies OEXT_NO_MAGICLINKS) */
+#define RESOLVE_BENEATH 0x08 /* Block "lexical" trickery like
+ "..", symlinks, and absolute
+ paths which escape the dirfd. */
+#define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".."
+ be scoped inside the dirfd
+ (similar to chroot(2)). */
+#endif /* RESOLVE_IN_ROOT */
+
+#ifndef UPGRADE_NOREAD
+/* how->upgrade flags for openat2(2). */
+/* First bit is reserved for a future UPGRADE_NOEXEC flag. */
+#define UPGRADE_NOREAD 0x02 /* Block re-opening with MAY_READ. */
+#define UPGRADE_NOWRITE 0x04 /* Block re-opening with MAY_WRITE. */
+#endif /* UPGRADE_NOREAD */
+
+#ifndef O_EMPTYPATH
+#define O_EMPTYPATH 040000000
+#endif /* O_EMPTYPATH */
+
+#define E_func(func, ...) \
+ do { \
+ if (func(__VA_ARGS__) < 0) \
+ ksft_exit_fail_msg("%s:%d %s failed\n", \
+ __FILE__, __LINE__, #func);\
+ } while (0)
+
+#define E_mkdirat(...) E_func(mkdirat, __VA_ARGS__)
+#define E_symlinkat(...) E_func(symlinkat, __VA_ARGS__)
+#define E_touchat(...) E_func(touchat, __VA_ARGS__)
+#define E_readlink(...) E_func(readlink, __VA_ARGS__)
+#define E_fstatat(...) E_func(fstatat, __VA_ARGS__)
+#define E_asprintf(...) E_func(asprintf, __VA_ARGS__)
+#define E_fchdir(...) E_func(fchdir, __VA_ARGS__)
+#define E_mount(...) E_func(mount, __VA_ARGS__)
+#define E_unshare(...) E_func(unshare, __VA_ARGS__)
+#define E_setresuid(...) E_func(setresuid, __VA_ARGS__)
+#define E_chmod(...) E_func(chmod, __VA_ARGS__)
+
+#define E_assert(expr, msg, ...) \
+ do { \
+ if (!(expr)) \
+ ksft_exit_fail_msg("ASSERT(%s:%d) failed (%s): " msg "\n", \
+ __FILE__, __LINE__, #expr, ##__VA_ARGS__); \
+ } while (0)
+
+typedef int (*openfunc_t)(int dfd, const char *path, const struct open_how *how);
+
+int sys_openat2(int dfd, const char *path, const struct open_how *how);
+char *openat2_flags(const struct open_how *how);
+
+int sys_openat(int dfd, const char *path, const struct open_how *how);
+char *openat_flags(unsigned int flags);
+
+int sys_renameat2(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath, unsigned int flags);
+
+int touchat(int dfd, const char *path);
+char *fdreadlink(int fd);
+bool fdequal(int fd, int dfd, const char *path);
+
+void test_openat2_supported(void);
+
+#endif /* __RESOLVEAT_H__ */
diff --git a/tools/testing/selftests/openat2/linkmode_test.c b/tools/testing/selftests/openat2/linkmode_test.c
new file mode 100644
index 000000000000..87d94413659f
--- /dev/null
+++ b/tools/testing/selftests/openat2/linkmode_test.c
@@ -0,0 +1,326 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <stdbool.h>
+#include <string.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+static mode_t fdmode(int fd)
+{
+ char *fdpath;
+ struct stat statbuf;
+ mode_t mode;
+
+ E_asprintf(&fdpath, "/proc/self/fd/%d", fd);
+ E_fstatat(AT_FDCWD, fdpath, &statbuf, AT_SYMLINK_NOFOLLOW);
+ mode = (statbuf.st_mode & ~S_IFMT);
+ free(fdpath);
+
+ return mode;
+}
+
+static int reopen_proc(int fd, unsigned int flags)
+{
+ int ret, saved_errno;
+ char *fdpath;
+
+ E_asprintf(&fdpath, "/proc/self/fd/%d", fd);
+ ret = open(fdpath, flags);
+ saved_errno = errno;
+ free(fdpath);
+
+ return ret >= 0 ? ret : -saved_errno;
+}
+
+static int reopen_oemptypath(int fd, unsigned int flags)
+{
+ int ret = openat(fd, "", O_EMPTYPATH | flags);
+ return ret >= 0 ? ret : -errno;
+}
+
+struct reopen_test {
+ openfunc_t open;
+ mode_t chmod_mode;
+ struct {
+ struct open_how how;
+ mode_t mode;
+ int err;
+ } orig, new;
+};
+
+static bool reopen(int fd, struct reopen_test *test)
+{
+ int newfd;
+ mode_t proc_mode;
+ bool failed = false;
+
+ /* Check that the proc mode is correct. */
+ proc_mode = fdmode(fd);
+ if (proc_mode != test->orig.mode) {
+ ksft_print_msg("incorrect fdmode (got[%o] != want[%o])\n",
+ proc_mode, test->orig.mode);
+ failed = true;
+ }
+
+ /* Re-open through /proc. */
+ newfd = reopen_proc(fd, test->new.how.flags);
+ if (newfd != test->new.err && (newfd < 0 || test->new.err < 0)) {
+ ksft_print_msg("/proc failure (%d != %d [%s])\n",
+ newfd, test->new.err, strerror(-test->new.err));
+ failed = true;
+ }
+ if (newfd >= 0) {
+ proc_mode = fdmode(newfd);
+ if (proc_mode != test->new.mode) {
+ ksft_print_msg("/proc wrong fdmode (got[%o] != want[%o])\n",
+ proc_mode, test->new.mode);
+ failed = true;
+ }
+ close(newfd);
+ }
+
+ /* Re-open with O_EMPTYPATH. */
+ newfd = reopen_oemptypath(fd, test->new.how.flags);
+ if (newfd != test->new.err && (newfd < 0 || test->new.err < 0)) {
+ ksft_print_msg("O_EMPTYPATH failure (%d != %d [%s])\n",
+ newfd, test->new.err, strerror(-test->new.err));
+ failed = true;
+ }
+ if (newfd >= 0) {
+ proc_mode = fdmode(newfd);
+ if (proc_mode != test->new.mode) {
+ ksft_print_msg("O_EMPTYPATH wrong fdmode (got[%o] != want[%o])\n",
+ proc_mode, test->new.mode);
+ failed = true;
+ }
+ close(newfd);
+ }
+
+ return failed;
+}
+
+void test_reopen_ordinary(bool privileged)
+{
+ int fd;
+ int err_access = privileged ? 0 : -EACCES;
+ char tmpfile[] = "/tmp/ksft-openat2-reopen-testfile.XXXXXX";
+
+ fd = mkstemp(tmpfile);
+ E_assert(fd >= 0, "mkstemp failed: %m\n");
+ close(fd);
+
+ struct reopen_test tests[] = {
+ /* Re-opening with the same mode should succeed. */
+ { .open = sys_openat, .chmod_mode = 0400,
+ .orig.how.flags = O_RDONLY, .orig.mode = 0500,
+ .new.how.flags = O_RDONLY, .new.mode = 0500 },
+ { .open = sys_openat, .chmod_mode = 0200,
+ .orig.how.flags = O_WRONLY, .orig.mode = 0300,
+ .new.how.flags = O_WRONLY, .new.mode = 0300 },
+ { .open = sys_openat, .chmod_mode = 0600,
+ .orig.how.flags = O_RDWR, .orig.mode = 0700,
+ .new.how.flags = O_RDWR, .new.mode = 0700 },
+ { .open = sys_openat, .chmod_mode = 0600,
+ .orig.how.flags = O_RDWR, .orig.mode = 0700,
+ .new.how.flags = O_RDONLY, .new.mode = 0500 },
+ { .open = sys_openat, .chmod_mode = 0600,
+ .orig.how.flags = O_RDWR, .orig.mode = 0700,
+ .new.how.flags = O_WRONLY, .new.mode = 0300 },
+
+ /*
+ * Re-opening with a different mode will always fail (with an obvious
+ * carve-out for privileged users).
+ */
+ { .open = sys_openat, .chmod_mode = 0600,
+ .orig.how.flags = O_RDONLY, .orig.mode = 0500,
+ .new.how.flags = O_WRONLY, .new.mode = 0300, .new.err = err_access },
+ { .open = sys_openat, .chmod_mode = 0600,
+ .orig.how.flags = O_WRONLY, .orig.mode = 0300,
+ .new.how.flags = O_RDONLY, .new.mode = 0500, .new.err = err_access },
+ { .open = sys_openat, .chmod_mode = 0600,
+ .orig.how.flags = O_RDONLY, .orig.mode = 0500,
+ .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access },
+ { .open = sys_openat, .chmod_mode = 0600,
+ .orig.how.flags = O_WRONLY, .orig.mode = 0300,
+ .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access },
+
+ /* Doubly so if they didn't even have permissions at open-time. */
+ { .open = sys_openat, .chmod_mode = 0400,
+ .orig.how.flags = O_RDONLY, .orig.mode = 0500,
+ .new.how.flags = O_WRONLY, .new.mode = 0300, .new.err = err_access },
+ { .open = sys_openat, .chmod_mode = 0200,
+ .orig.how.flags = O_WRONLY, .orig.mode = 0300,
+ .new.how.flags = O_RDONLY, .new.mode = 0500, .new.err = err_access },
+ { .open = sys_openat, .chmod_mode = 0400,
+ .orig.how.flags = O_RDONLY, .orig.mode = 0500,
+ .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access },
+ { .open = sys_openat, .chmod_mode = 0200,
+ .orig.how.flags = O_WRONLY, .orig.mode = 0300,
+ .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access },
+
+ /* O_PATH re-opens (of ordinary files) will always work. */
+ { .open = sys_openat, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0070,
+ .new.how.flags = O_WRONLY, .new.mode = 0300 },
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0070,
+ .new.how.flags = O_WRONLY, .new.mode = 0300 },
+
+ { .open = sys_openat, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0070,
+ .new.how.flags = O_RDONLY, .new.mode = 0500 },
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0070,
+ .new.how.flags = O_RDONLY, .new.mode = 0500 },
+
+ { .open = sys_openat, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0070,
+ .new.how.flags = O_RDWR, .new.mode = 0700 },
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0070,
+ .new.how.flags = O_RDWR, .new.mode = 0700 },
+
+ /*
+ * openat2(2) UPGRADE_NO* flags. In the privileged case, the re-open
+ * will work but the mode will still be scoped to the mode (or'd with
+ * the open acc_mode).
+ */
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0010,
+ .orig.how.upgrade_mask = UPGRADE_NOREAD | UPGRADE_NOWRITE,
+ .new.how.flags = O_RDONLY, .new.mode = 0500, .new.err = err_access },
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0010,
+ .orig.how.upgrade_mask = UPGRADE_NOREAD | UPGRADE_NOWRITE,
+ .new.how.flags = O_WRONLY, .new.mode = 0300, .new.err = err_access },
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0010,
+ .orig.how.upgrade_mask = UPGRADE_NOREAD | UPGRADE_NOWRITE,
+ .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access },
+
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0050,
+ .orig.how.upgrade_mask = UPGRADE_NOWRITE,
+ .new.how.flags = O_RDONLY, .new.mode = 0500 },
+
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0030,
+ .orig.how.upgrade_mask = UPGRADE_NOREAD,
+ .new.how.flags = O_WRONLY, .new.mode = 0300 },
+
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0030,
+ .orig.how.upgrade_mask = UPGRADE_NOREAD,
+ .new.how.flags = O_RDONLY, .new.mode = 0500, .new.err = err_access },
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0050,
+ .orig.how.upgrade_mask = UPGRADE_NOWRITE,
+ .new.how.flags = O_WRONLY, .new.mode = 0300, .new.err = err_access },
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0030,
+ .orig.how.upgrade_mask = UPGRADE_NOREAD,
+ .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access },
+ { .open = sys_openat2, .chmod_mode = 0000,
+ .orig.how.flags = O_PATH, .orig.mode = 0050,
+ .orig.how.upgrade_mask = UPGRADE_NOWRITE,
+ .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access },
+ };
+
+ for (int i = 0; i < ARRAY_LEN(tests); i++) {
+ int fd;
+ char *orig_flagset, *new_flagset;
+ struct reopen_test *test = &tests[i];
+ void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+
+ E_chmod(tmpfile, test->chmod_mode);
+
+ fd = test->open(AT_FDCWD, tmpfile, &test->orig.how);
+ E_assert(fd >= 0, "open '%s' failed: %m\n", tmpfile);
+
+ /* Make sure that any EACCES we see is not from inode permissions. */
+ E_chmod(tmpfile, 0777);
+
+ if (reopen(fd, test))
+ resultfn = ksft_test_result_fail;
+
+ close(fd);
+
+ new_flagset = openat_flags(test->new.how.flags);
+ if (test->open == sys_openat)
+ orig_flagset = openat_flags(test->orig.how.flags);
+ else if (test->open == sys_openat2)
+ orig_flagset = openat2_flags(&test->orig.how);
+ else
+ ksft_exit_fail_msg("unknown test->open\n");
+
+ resultfn("%sordinary reopen of (orig[%s]=%s, new=%s) chmod=%.3o %s\n",
+ privileged ? "privileged " : "",
+ test->open == sys_openat ? "openat" : "openat2",
+ orig_flagset, new_flagset, test->chmod_mode,
+ test->new.err < 0 ? strerror(-test->new.err) : "works");
+ fflush(stdout);
+
+ free(new_flagset);
+ free(orig_flagset);
+ }
+
+ unlink(tmpfile);
+}
+
+void test_openat2_cloexec_test(void)
+{
+ void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+ struct open_how how = {
+ .flags = O_CLOEXEC | O_PATH | O_DIRECTORY,
+ };
+
+ int fd = sys_openat2(AT_FDCWD, ".", &how);
+ E_assert(fd >= 0, "open '.' failed: %m\n");
+
+ int flags = fcntl(fd, F_GETFD);
+ E_assert(flags >= 0, "F_GETFD failed: %m\n");
+
+ if (!(flags & FD_CLOEXEC))
+ resultfn = ksft_test_result_fail;
+
+ resultfn("openat2(O_CLOEXEC) works as expected\n");
+}
+
+int main(int argc, char **argv)
+{
+ bool privileged;
+
+ ksft_print_header();
+ test_openat2_supported();
+
+ /*
+ * Technically we should be checking CAP_DAC_OVERRIDE, but it's easier to
+ * just assume that euid=0 has the full capability set.
+ */
+ privileged = (geteuid() == 0);
+ if (!privileged)
+ ksft_test_result_skip("privileged tests require euid == 0\n");
+ else {
+ test_reopen_ordinary(privileged);
+
+ E_setresuid(65534, 65534, 65534);
+ privileged = (geteuid() == 0);
+ }
+
+ test_reopen_ordinary(privileged);
+ test_openat2_cloexec_test();
+
+ if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+ ksft_exit_fail();
+ else
+ ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/openat2/rename_attack_test.c b/tools/testing/selftests/openat2/rename_attack_test.c
new file mode 100644
index 000000000000..b5e2a68609f1
--- /dev/null
+++ b/tools/testing/selftests/openat2/rename_attack_test.c
@@ -0,0 +1,124 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <syscall.h>
+#include <limits.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+/* Construct a test directory with the following structure:
+ *
+ * root/
+ * |-- a/
+ * | `-- c/
+ * `-- b/
+ */
+int setup_testdir(void)
+{
+ int dfd;
+ char dirname[] = "/tmp/ksft-openat2-rename-attack.XXXXXX";
+
+ /* Make the top-level directory. */
+ if (!mkdtemp(dirname))
+ ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n");
+ dfd = open(dirname, O_PATH | O_DIRECTORY);
+ if (dfd < 0)
+ ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+
+ E_mkdirat(dfd, "a", 0755);
+ E_mkdirat(dfd, "b", 0755);
+ E_mkdirat(dfd, "a/c", 0755);
+
+ return dfd;
+}
+
+/* Swap @dirfd/@a and @dirfd/@b constantly. Parent must kill this process. */
+pid_t spawn_attack(int dirfd, char *a, char *b)
+{
+ pid_t child = fork();
+ if (child != 0)
+ return child;
+
+ /* If the parent (the test process) dies, kill ourselves too. */
+ prctl(PR_SET_PDEATHSIG, SIGKILL);
+
+ /* Swap @a and @b. */
+ for (;;)
+ renameat2(dirfd, a, dirfd, b, RENAME_EXCHANGE);
+ exit(1);
+}
+
+#define ROUNDS 400000
+void test_rename_attack(void)
+{
+ int dfd, afd, escaped_count = 0;
+ void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+ pid_t child;
+
+ dfd = setup_testdir();
+ afd = openat(dfd, "a", O_PATH);
+ if (afd < 0)
+ ksft_exit_fail_msg("test_rename_attack: failed to open 'a'\n");
+
+ child = spawn_attack(dfd, "a/c", "b");
+
+ for (int i = 0; i < ROUNDS; i++) {
+ int fd;
+ bool failed;
+ struct open_how how = {
+ .flags = O_PATH,
+ .resolve = RESOLVE_IN_ROOT,
+ };
+ char *victim_path = "c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../..";
+
+ fd = sys_openat2(afd, victim_path, &how);
+ if (fd < 0)
+ failed = (fd != -EXDEV);
+ else
+ failed = !fdequal(fd, afd, NULL);
+
+ escaped_count += failed;
+ close(fd);
+ }
+
+ if (escaped_count > 0)
+ resultfn = ksft_test_result_fail;
+
+ resultfn("rename attack fails (expected 0 breakouts in %d runs, got %d)\n",
+ ROUNDS, escaped_count);
+
+ /* Should be killed anyway, but might as well make sure. */
+ kill(child, SIGKILL);
+}
+
+int main(int argc, char **argv)
+{
+ ksft_print_header();
+ test_openat2_supported();
+
+ test_rename_attack();
+
+ if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+ ksft_exit_fail();
+ else
+ ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/openat2/resolve_test.c b/tools/testing/selftests/openat2/resolve_test.c
new file mode 100644
index 000000000000..5a9b478c9295
--- /dev/null
+++ b/tools/testing/selftests/openat2/resolve_test.c
@@ -0,0 +1,397 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+/*
+ * Construct a test directory with the following structure:
+ *
+ * root/
+ * |-- procexe -> /proc/self/exe
+ * |-- procroot -> /proc/self/root
+ * |-- root/
+ * |-- mnt/ [mountpoint]
+ * | |-- self -> ../mnt/
+ * | `-- absself -> /mnt/
+ * |-- etc/
+ * | `-- passwd
+ * |-- creatlink -> /newfile3
+ * |-- relsym -> etc/passwd
+ * |-- abssym -> /etc/passwd
+ * |-- abscheeky -> /cheeky
+ * |-- abscheeky -> /cheeky
+ * `-- cheeky/
+ * |-- absself -> /
+ * |-- self -> ../../root/
+ * |-- garbageself -> /../../root/
+ * |-- passwd -> ../cheeky/../cheeky/../etc/../etc/passwd
+ * |-- abspasswd -> /../cheeky/../cheeky/../etc/../etc/passwd
+ * |-- dotdotlink -> ../../../../../../../../../../../../../../etc/passwd
+ * `-- garbagelink -> /../../../../../../../../../../../../../../etc/passwd
+ */
+int setup_testdir(void)
+{
+ int dfd, tmpfd;
+ char dirname[] = "/tmp/ksft-openat2-testdir.XXXXXX";
+
+ /* Unshare and make /tmp a new directory. */
+ E_unshare(CLONE_NEWNS);
+ E_mount("", "/tmp", "", MS_PRIVATE, "");
+
+ /* Make the top-level directory. */
+ if (!mkdtemp(dirname))
+ ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n");
+ dfd = open(dirname, O_PATH | O_DIRECTORY);
+ if (dfd < 0)
+ ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+
+ /* A sub-directory which is actually used for tests. */
+ E_mkdirat(dfd, "root", 0755);
+ tmpfd = openat(dfd, "root", O_PATH | O_DIRECTORY);
+ if (tmpfd < 0)
+ ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+ close(dfd);
+ dfd = tmpfd;
+
+ E_symlinkat("/proc/self/exe", dfd, "procexe");
+ E_symlinkat("/proc/self/root", dfd, "procroot");
+ E_mkdirat(dfd, "root", 0755);
+
+ /* There is no mountat(2), so use chdir. */
+ E_mkdirat(dfd, "mnt", 0755);
+ E_fchdir(dfd);
+ E_mount("tmpfs", "./mnt", "tmpfs", MS_NOSUID | MS_NODEV, "");
+ E_symlinkat("../mnt/", dfd, "mnt/self");
+ E_symlinkat("/mnt/", dfd, "mnt/absself");
+
+ E_mkdirat(dfd, "etc", 0755);
+ E_touchat(dfd, "etc/passwd");
+
+ E_symlinkat("/newfile3", dfd, "creatlink");
+ E_symlinkat("etc/passwd", dfd, "relsym");
+ E_symlinkat("/etc/passwd", dfd, "abssym");
+ E_symlinkat("/cheeky", dfd, "abscheeky");
+
+ E_mkdirat(dfd, "cheeky", 0755);
+
+ E_symlinkat("/", dfd, "cheeky/absself");
+ E_symlinkat("../../root/", dfd, "cheeky/self");
+ E_symlinkat("/../../root/", dfd, "cheeky/garbageself");
+
+ E_symlinkat("../cheeky/../etc/../etc/passwd", dfd, "cheeky/passwd");
+ E_symlinkat("/../cheeky/../etc/../etc/passwd", dfd, "cheeky/abspasswd");
+
+ E_symlinkat("../../../../../../../../../../../../../../etc/passwd",
+ dfd, "cheeky/dotdotlink");
+ E_symlinkat("/../../../../../../../../../../../../../../etc/passwd",
+ dfd, "cheeky/garbagelink");
+
+ return dfd;
+}
+
+struct basic_test {
+ const char *dir;
+ const char *path;
+ struct open_how how;
+ bool pass;
+ union {
+ int err;
+ const char *path;
+ } out;
+};
+
+void test_openat2_opath_tests(void)
+{
+ int rootfd;
+ char *procselfexe;
+
+ E_asprintf(&procselfexe, "/proc/%d/exe", getpid());
+ rootfd = setup_testdir();
+
+ struct basic_test tests[] = {
+ /** RESOLVE_BENEATH **/
+ /* Attempts to cross dirfd should be blocked. */
+ { .path = "/", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "cheeky/absself", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "abscheeky/absself", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "..", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "../root/", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "cheeky/self", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "abscheeky/self", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "cheeky/garbageself", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "abscheeky/garbageself", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ /* Only relative paths that stay inside dirfd should work. */
+ { .path = "root", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "root", .pass = true },
+ { .path = "etc", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "etc", .pass = true },
+ { .path = "etc/passwd", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "relsym", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "cheeky/passwd", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "abscheeky/passwd", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "abssym", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "/etc/passwd", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "cheeky/abspasswd", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "abscheeky/abspasswd", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ /* Tricky paths should fail. */
+ { .path = "cheeky/dotdotlink", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "cheeky/garbagelink", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+
+ /** RESOLVE_IN_ROOT **/
+ /* All attempts to cross the dirfd will be scoped-to-root. */
+ { .path = "/", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = NULL, .pass = true },
+ { .path = "cheeky/absself", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = NULL, .pass = true },
+ { .path = "abscheeky/absself", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = NULL, .pass = true },
+ { .path = "..", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = NULL, .pass = true },
+ { .path = "../root/", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .path = "../root/", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .path = "cheeky/self", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .path = "cheeky/garbageself", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .path = "abscheeky/garbageself", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .path = "root", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .path = "etc", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc", .pass = true },
+ { .path = "etc/passwd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "relsym", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "cheeky/passwd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "abscheeky/passwd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "abssym", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "/etc/passwd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "cheeky/abspasswd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "abscheeky/abspasswd",.how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "cheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "/../../../../abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "cheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .path = "/../../../../abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ /* O_CREAT should handle trailing symlinks correctly. */
+ { .path = "newfile1", .how.flags = O_CREAT,
+ .how.mode = 0700,
+ .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "newfile1", .pass = true },
+ { .path = "/newfile2", .how.flags = O_CREAT,
+ .how.mode = 0700,
+ .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "newfile2", .pass = true },
+ { .path = "/creatlink", .how.flags = O_CREAT,
+ .how.mode = 0700,
+ .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "newfile3", .pass = true },
+
+ /** RESOLVE_NO_XDEV **/
+ /* Crossing *down* into a mountpoint is disallowed. */
+ { .path = "mnt", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "mnt/", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "mnt/.", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ /* Crossing *up* out of a mountpoint is disallowed. */
+ { .dir = "mnt", .path = ".", .how.resolve = RESOLVE_NO_XDEV,
+ .out.path = "mnt", .pass = true },
+ { .dir = "mnt", .path = "..", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .dir = "mnt", .path = "../mnt", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .dir = "mnt", .path = "self", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .dir = "mnt", .path = "absself", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ /* Jumping to "/" is ok, but later components cannot cross. */
+ { .dir = "mnt", .path = "/", .how.resolve = RESOLVE_NO_XDEV,
+ .out.path = "/", .pass = true },
+ { .dir = "/", .path = "/", .how.resolve = RESOLVE_NO_XDEV,
+ .out.path = "/", .pass = true },
+ { .path = "/proc/1", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .path = "/tmp", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+
+ /** RESOLVE_NO_MAGICLINKS **/
+ /* Regular symlinks should work. */
+ { .path = "relsym", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.path = "etc/passwd", .pass = true },
+ /* Magic-links should not work. */
+ { .path = "procexe", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "/proc/self/exe", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "procroot/etc", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "/proc/self/root/etc", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "/proc/self/root/etc", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "/proc/self/exe", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.path = procselfexe, .pass = true },
+
+ /** RESOLVE_NO_SYMLINKS **/
+ /* Normal paths should work. */
+ { .path = ".", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = NULL, .pass = true },
+ { .path = "root", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "root", .pass = true },
+ { .path = "etc", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "etc", .pass = true },
+ { .path = "etc/passwd", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "etc/passwd", .pass = true },
+ /* Regular symlinks are blocked. */
+ { .path = "relsym", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "abssym", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "cheeky/garbagelink", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "abscheeky/absself", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ /* Trailing symlinks with NO_FOLLOW. */
+ { .path = "relsym", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "relsym", .pass = true },
+ { .path = "abssym", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "abssym", .pass = true },
+ { .path = "cheeky/garbagelink", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "cheeky/garbagelink", .pass = true },
+ { .path = "abscheeky/garbagelink", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .path = "abscheeky/absself", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ };
+
+ for (int i = 0; i < ARRAY_LEN(tests); i++) {
+ int dfd, fd;
+ bool failed;
+ void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+ struct basic_test *test = &tests[i];
+ char *flagstr;
+
+ /* Auto-set O_PATH. */
+ if (!(test->how.flags & O_CREAT))
+ test->how.flags |= O_PATH;
+ flagstr = openat2_flags(&test->how);
+
+ if (test->dir)
+ dfd = openat(rootfd, test->dir, O_PATH | O_DIRECTORY);
+ else
+ dfd = dup(rootfd);
+ if (dfd < 0) {
+ resultfn = ksft_test_result_error;
+ goto next;
+ }
+
+ fd = sys_openat2(dfd, test->path, &test->how);
+ if (test->pass)
+ failed = (fd < 0 || !fdequal(fd, rootfd, test->out.path));
+ else
+ failed = (fd != test->out.err);
+ if (fd >= 0)
+ close(fd);
+ close(dfd);
+
+ if (failed)
+ resultfn = ksft_test_result_fail;
+
+next:
+ if (test->pass)
+ resultfn("openat2(root[%s], %s, %s) ==> %s\n",
+ test->dir ?: ".", test->path, flagstr,
+ test->out.path ?: ".");
+ else
+ resultfn("openat2(root[%s], %s, %s) ==> %d (%s)\n",
+ test->dir ?: ".", test->path, flagstr,
+ test->out.err, strerror(-test->out.err));
+ fflush(stdout);
+
+ free(flagstr);
+ }
+
+ free(procselfexe);
+ close(rootfd);
+}
+
+int main(int argc, char **argv)
+{
+ ksft_print_header();
+ test_openat2_supported();
+
+ /* NOTE: We should be checking for CAP_SYS_ADMIN here... */
+ if (geteuid() != 0)
+ ksft_exit_skip("openat2(2) tests require euid == 0\n");
+
+ test_openat2_opath_tests();
+
+ if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+ ksft_exit_fail();
+ else
+ ksft_exit_pass();
+}
--
2.22.0
^ permalink raw reply related
* Re: [PATCH ghak90 V6 03/10] audit: read container ID of a process
From: Richard Guy Briggs @ 2019-07-19 17:05 UTC (permalink / raw)
To: Eric W. Biederman
Cc: nhorman, linux-api, containers, LKML, dhowells,
Linux-Audit Mailing List, netfilter-devel, simo, netdev,
linux-fsdevel, eparis, serge
In-Reply-To: <87d0i6dnag.fsf@xmission.com>
On 2019-07-19 11:03, Eric W. Biederman wrote:
> Richard Guy Briggs <rgb@redhat.com> writes:
>
> > Add support for reading the audit container identifier from the proc
> > filesystem.
> >
> > This is a read from the proc entry of the form
> > /proc/PID/audit_containerid where PID is the process ID of the task
> > whose audit container identifier is sought.
> >
> > The read expects up to a u64 value (unset: 18446744073709551615).
> >
> > This read requires CAP_AUDIT_CONTROL.
>
> This scares me. As this seems to make it easy to reuse an audit
> containerid for non-audit purporses.
At this point, given that capable(CAP_AUDIT_CONTROL) is not available to
any userspaced container orchestrator/engine, it is moot anywaysand we
will need another method.
> I would think it would be safer and easier to poke audit and ask it to
> log a message with your audit container id.
For it to be useful to a container orchestrator/engine, I think that
would depend on whether we are setting the value, or it is being
assigned by the kernel. At this stage it is set by the orchestrator so
this could work.
> Eric
>
> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > Acked-by: Serge Hallyn <serge@hallyn.com>
> > Acked-by: Neil Horman <nhorman@tuxdriver.com>
> > Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> > ---
> > fs/proc/base.c | 25 ++++++++++++++++++++++---
> > 1 file changed, 22 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 43fd0c4b87de..acc70239d0cb 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -1211,7 +1211,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
> > };
> >
> > #ifdef CONFIG_AUDIT
> > -#define TMPBUFLEN 11
> > +#define TMPBUFLEN 21
> > static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
> > size_t count, loff_t *ppos)
> > {
> > @@ -1295,6 +1295,24 @@ static ssize_t proc_sessionid_read(struct file * file, char __user * buf,
> > .llseek = generic_file_llseek,
> > };
> >
> > +static ssize_t proc_contid_read(struct file *file, char __user *buf,
> > + size_t count, loff_t *ppos)
> > +{
> > + struct inode *inode = file_inode(file);
> > + struct task_struct *task = get_proc_task(inode);
> > + ssize_t length;
> > + char tmpbuf[TMPBUFLEN];
> > +
> > + if (!task)
> > + return -ESRCH;
> > + /* if we don't have caps, reject */
> > + if (!capable(CAP_AUDIT_CONTROL))
> > + return -EPERM;
> > + length = scnprintf(tmpbuf, TMPBUFLEN, "%llu", audit_get_contid(task));
> > + put_task_struct(task);
> > + return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
> > +}
> > +
> > static ssize_t proc_contid_write(struct file *file, const char __user *buf,
> > size_t count, loff_t *ppos)
> > {
> > @@ -1325,6 +1343,7 @@ static ssize_t proc_contid_write(struct file *file, const char __user *buf,
> > }
> >
> > static const struct file_operations proc_contid_operations = {
> > + .read = proc_contid_read,
> > .write = proc_contid_write,
> > .llseek = generic_file_llseek,
> > };
> > @@ -3067,7 +3086,7 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
> > #ifdef CONFIG_AUDIT
> > REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
> > REG("sessionid", S_IRUGO, proc_sessionid_operations),
> > - REG("audit_containerid", S_IWUSR, proc_contid_operations),
> > + REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
> > #endif
> > #ifdef CONFIG_FAULT_INJECTION
> > REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> > @@ -3466,7 +3485,7 @@ static int proc_tid_comm_permission(struct inode *inode, int mask)
> > #ifdef CONFIG_AUDIT
> > REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
> > REG("sessionid", S_IRUGO, proc_sessionid_operations),
> > - REG("audit_containerid", S_IWUSR, proc_contid_operations),
> > + REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
> > #endif
> > #ifdef CONFIG_FAULT_INJECTION
> > REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
* [v4 PATCH 0/2] mm: mempolicy: fix mbind()'s inconsistent behavior for unmovable pages
From: Yang Shi @ 2019-07-19 17:21 UTC (permalink / raw)
To: vbabka, mhocko, mgorman, akpm; +Cc: yang.shi, linux-mm, linux-kernel, linux-api
Changelog
v4: * Fixed the comments from Vlastimil.
* Collected Vlastimil's Reviewed-by.
v3: * Adopted the suggestions from Vlastimil. Saved another 20 lines.
Using flag in struct queue_pages looks not outperform renumbering retval
too much since we still have to return 1 to tell the caller there are
unmovable pages. So just renumber the retval.
* Manpage is not very clear about shared pages when MPOL_MF_MOVE is
specified, just leave it as it is for now till it gets clarified.
v2: * Fixed the inconsistent behavior by not aborting !vma_migratable()
immediately by a separate patch (patch 1/2), and this is also the
preparation for patch 2/2. For the details please see the commit
log. Per Vlastimil.
* Not abort immediately if unmovable page is met. This should handle
non-LRU movable pages and temporary off-LRU pages more friendly.
Per Vlastimil and Michal Hocko.
Yang Shi (2):
mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified
mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind
mm/mempolicy.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------
1 file changed, 73 insertions(+), 27 deletions(-)
^ permalink raw reply
* [v4 PATCH 1/2] mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified
From: Yang Shi @ 2019-07-19 17:21 UTC (permalink / raw)
To: vbabka, mhocko, mgorman, akpm; +Cc: yang.shi, linux-mm, linux-kernel, linux-api
In-Reply-To: <1563556862-54056-1-git-send-email-yang.shi@linux.alibaba.com>
When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should
try best to migrate misplaced pages, if some of the pages could not be
migrated, then return -EIO.
There are three different sub-cases:
1. vma is not migratable
2. vma is migratable, but there are unmovable pages
3. vma is migratable, pages are movable, but migrate_pages() fails
If #1 happens, kernel would just abort immediately, then return -EIO,
after the commit a7f40cfe3b7ada57af9b62fd28430eeb4a7cfcb7 ("mm:
mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified").
If #3 happens, kernel would set policy and migrate pages with best-effort,
but won't rollback the migrated pages and reset the policy back.
Before that commit, they behaves in the same way. It'd better to keep
their behavior consistent. But, rolling back the migrated pages and
resetting the policy back sounds not feasible, so just make #1 behave as
same as #3.
Userspace will know that not everything was successfully migrated (via
-EIO), and can take whatever steps it deems necessary - attempt rollback,
determine which exact page(s) are violating the policy, etc.
Make queue_pages_range() return 1 to indicate there are unmovable pages
or vma is not migratable.
The #2 is not handled correctly in the current kernel, the following
patch will fix it.
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
mm/mempolicy.c | 68 +++++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 48 insertions(+), 20 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f48693f..932c268 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -429,11 +429,14 @@ static inline bool queue_pages_required(struct page *page,
}
/*
- * queue_pages_pmd() has three possible return values:
- * 1 - pages are placed on the right node or queued successfully.
- * 0 - THP was split.
- * -EIO - is migration entry or MPOL_MF_STRICT was specified and an existing
- * page was already on a node that does not follow the policy.
+ * queue_pages_pmd() has four possible return values:
+ * 0 - pages are placed on the right node or queued successfully.
+ * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
+ * specified.
+ * 2 - THP was split.
+ * -EIO - is migration entry or only MPOL_MF_STRICT was specified and an
+ * existing page was already on a node that does not follow the
+ * policy.
*/
static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
unsigned long end, struct mm_walk *walk)
@@ -451,19 +454,17 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
if (is_huge_zero_page(page)) {
spin_unlock(ptl);
__split_huge_pmd(walk->vma, pmd, addr, false, NULL);
+ ret = 2;
goto out;
}
- if (!queue_pages_required(page, qp)) {
- ret = 1;
+ if (!queue_pages_required(page, qp))
goto unlock;
- }
- ret = 1;
flags = qp->flags;
/* go to thp migration */
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
if (!vma_migratable(walk->vma)) {
- ret = -EIO;
+ ret = 1;
goto unlock;
}
@@ -479,6 +480,13 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
/*
* Scan through pages checking if pages follow certain conditions,
* and move them to the pagelist if they do.
+ *
+ * queue_pages_pte_range() has three possible return values:
+ * 0 - pages are placed on the right node or queued successfully.
+ * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
+ * specified.
+ * -EIO - only MPOL_MF_STRICT was specified and an existing page was already
+ * on a node that does not follow the policy.
*/
static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
@@ -488,17 +496,17 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
int ret;
+ bool has_unmovable = false;
pte_t *pte;
spinlock_t *ptl;
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
- if (ret > 0)
- return 0;
- else if (ret < 0)
+ if (ret != 2)
return ret;
}
+ /* THP was split, fall through to pte walk */
if (pmd_trans_unstable(pmd))
return 0;
@@ -519,14 +527,21 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
if (!queue_pages_required(page, qp))
continue;
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
- if (!vma_migratable(vma))
+ /* MPOL_MF_STRICT must be specified if we get here */
+ if (!vma_migratable(vma)) {
+ has_unmovable = true;
break;
+ }
migrate_page_add(page, qp->pagelist, flags);
} else
break;
}
pte_unmap_unlock(pte - 1, ptl);
cond_resched();
+
+ if (has_unmovable)
+ return 1;
+
return addr != end ? -EIO : 0;
}
@@ -639,7 +654,13 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end,
*
* If pages found in a given range are on a set of nodes (determined by
* @nodes and @flags,) it's isolated and queued to the pagelist which is
- * passed via @private.)
+ * passed via @private.
+ *
+ * queue_pages_range() has three possible return values:
+ * 1 - there is unmovable page, but MPOL_MF_MOVE* & MPOL_MF_STRICT were
+ * specified.
+ * 0 - queue pages successfully or no misplaced page.
+ * -EIO - there is misplaced page and only MPOL_MF_STRICT was specified.
*/
static int
queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
@@ -1182,6 +1203,7 @@ static long do_mbind(unsigned long start, unsigned long len,
struct mempolicy *new;
unsigned long end;
int err;
+ int ret;
LIST_HEAD(pagelist);
if (flags & ~(unsigned long)MPOL_MF_VALID)
@@ -1243,10 +1265,15 @@ static long do_mbind(unsigned long start, unsigned long len,
if (err)
goto mpol_out;
- err = queue_pages_range(mm, start, end, nmask,
+ ret = queue_pages_range(mm, start, end, nmask,
flags | MPOL_MF_INVERT, &pagelist);
- if (!err)
- err = mbind_range(mm, start, end, new);
+
+ if (ret < 0) {
+ err = -EIO;
+ goto up_out;
+ }
+
+ err = mbind_range(mm, start, end, new);
if (!err) {
int nr_failed = 0;
@@ -1259,13 +1286,14 @@ static long do_mbind(unsigned long start, unsigned long len,
putback_movable_pages(&pagelist);
}
- if (nr_failed && (flags & MPOL_MF_STRICT))
+ if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
err = -EIO;
} else
putback_movable_pages(&pagelist);
+up_out:
up_write(&mm->mmap_sem);
- mpol_out:
+mpol_out:
mpol_put(new);
return err;
}
--
1.8.3.1
^ permalink raw reply related
* [v4 PATCH 2/2] mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind
From: Yang Shi @ 2019-07-19 17:21 UTC (permalink / raw)
To: vbabka, mhocko, mgorman, akpm; +Cc: yang.shi, linux-mm, linux-kernel, linux-api
In-Reply-To: <1563556862-54056-1-git-send-email-yang.shi@linux.alibaba.com>
When running syzkaller internally, we ran into the below bug on 4.9.x
kernel:
kernel BUG at mm/huge_memory.c:2124!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 1518 Comm: syz-executor107 Not tainted 4.9.168+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
task: ffff880067b34900 task.stack: ffff880068998000
RIP: 0010:[<ffffffff81895d6b>] [<ffffffff81895d6b>] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
RSP: 0018:ffff88006899f980 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffffea00018f1700 RCX: 0000000000000000
RDX: 1ffffd400031e2e7 RSI: 0000000000000001 RDI: ffffea00018f1738
RBP: ffff88006899f9e8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: fffffbfff0d8b13e R12: ffffea00018f1400
R13: ffffea00018f1400 R14: ffffea00018f1720 R15: ffffea00018f1401
FS: 00007fa333996740(0000) GS:ffff88006c600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000040 CR3: 0000000066b9c000 CR4: 00000000000606f0
Stack:
0000000000000246 ffff880067b34900 0000000000000000 ffff88007ffdc000
0000000000000000 ffff88006899f9e8 ffffffff812b4015 ffff880064c64e18
ffffea00018f1401 dffffc0000000000 ffffea00018f1700 0000000020ffd000
Call Trace:
[<ffffffff818490f1>] split_huge_page include/linux/huge_mm.h:100 [inline]
[<ffffffff818490f1>] queue_pages_pte_range+0x7e1/0x1480 mm/mempolicy.c:538
[<ffffffff817ed0da>] walk_pmd_range mm/pagewalk.c:50 [inline]
[<ffffffff817ed0da>] walk_pud_range mm/pagewalk.c:90 [inline]
[<ffffffff817ed0da>] walk_pgd_range mm/pagewalk.c:116 [inline]
[<ffffffff817ed0da>] __walk_page_range+0x44a/0xdb0 mm/pagewalk.c:208
[<ffffffff817edb94>] walk_page_range+0x154/0x370 mm/pagewalk.c:285
[<ffffffff81844515>] queue_pages_range+0x115/0x150 mm/mempolicy.c:694
[<ffffffff8184f493>] do_mbind mm/mempolicy.c:1241 [inline]
[<ffffffff8184f493>] SYSC_mbind+0x3c3/0x1030 mm/mempolicy.c:1370
[<ffffffff81850146>] SyS_mbind+0x46/0x60 mm/mempolicy.c:1352
[<ffffffff810097e2>] do_syscall_64+0x1d2/0x600 arch/x86/entry/common.c:282
[<ffffffff82ff6f93>] entry_SYSCALL_64_after_swapgs+0x5d/0xdb
Code: c7 80 1c 02 00 e8 26 0a 76 01 <0f> 0b 48 c7 c7 40 46 45 84 e8 4c
RIP [<ffffffff81895d6b>] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
RSP <ffff88006899f980>
with the below test:
---8<---
uint64_t r[1] = {0xffffffffffffffff};
int main(void)
{
syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
intptr_t res = 0;
res = syscall(__NR_socket, 0x11, 3, 0x300);
if (res != -1)
r[0] = res;
*(uint32_t*)0x20000040 = 0x10000;
*(uint32_t*)0x20000044 = 1;
*(uint32_t*)0x20000048 = 0xc520;
*(uint32_t*)0x2000004c = 1;
syscall(__NR_setsockopt, r[0], 0x107, 0xd, 0x20000040, 0x10);
syscall(__NR_mmap, 0x20fed000, 0x10000, 0, 0x8811, r[0], 0);
*(uint64_t*)0x20000340 = 2;
syscall(__NR_mbind, 0x20ff9000, 0x4000, 0x4002, 0x20000340,
0x45d4, 3);
return 0;
}
---8<---
Actually the test does:
mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
socket(AF_PACKET, SOCK_RAW, 768) = 3
setsockopt(3, SOL_PACKET, PACKET_TX_RING, {block_size=65536, block_nr=1, frame_size=50464, frame_nr=1}, 16) = 0
mmap(0x20fed000, 65536, PROT_NONE, MAP_SHARED|MAP_FIXED|MAP_POPULATE|MAP_DENYWRITE, 3, 0) = 0x20fed000
mbind(..., MPOL_MF_STRICT|MPOL_MF_MOVE) = 0
The setsockopt() would allocate compound pages (16 pages in this test)
for packet tx ring, then the mmap() would call packet_mmap() to map the
pages into the user address space specified by the mmap() call.
When calling mbind(), it would scan the vma to queue the pages for
migration to the new node. It would split any huge page since 4.9
doesn't support THP migration, however, the packet tx ring compound
pages are not THP and even not movable. So, the above bug is triggered.
However, the later kernel is not hit by this issue due to the
commit d44d363f65780f2ac2 ("mm: don't assume anonymous pages have
SwapBacked flag"), which just removes the PageSwapBacked check for a
different reason.
But, there is a deeper issue. According to the semantic of mbind(), it
should return -EIO if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified and
MPOL_MF_STRICT was also specified, but the kernel was unable to move
all existing pages in the range. The tx ring of the packet socket is
definitely not movable, however, mbind() returns success for this case.
Although the most socket file associates with non-movable pages, but XDP
may have movable pages from gup. So, it sounds not fine to just check
the underlying file type of vma in vma_migratable().
Change migrate_page_add() to check if the page is movable or not, if it
is unmovable, just return -EIO. But do not abort pte walk immediately,
since there may be pages off LRU temporarily. We should migrate other
pages if MPOL_MF_MOVE* is specified. Set has_unmovable flag if some
paged could not be not moved, then return -EIO for mbind() eventually.
With this change the above test would return -EIO as expected.
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
mm/mempolicy.c | 32 +++++++++++++++++++++++++-------
1 file changed, 25 insertions(+), 7 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 932c268..547cd40 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -403,7 +403,7 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
},
};
-static void migrate_page_add(struct page *page, struct list_head *pagelist,
+static int migrate_page_add(struct page *page, struct list_head *pagelist,
unsigned long flags);
struct queue_pages {
@@ -463,12 +463,11 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
flags = qp->flags;
/* go to thp migration */
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
- if (!vma_migratable(walk->vma)) {
+ if (!vma_migratable(walk->vma) ||
+ migrate_page_add(page, qp->pagelist, flags)) {
ret = 1;
goto unlock;
}
-
- migrate_page_add(page, qp->pagelist, flags);
} else
ret = -EIO;
unlock:
@@ -532,7 +531,14 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
has_unmovable = true;
break;
}
- migrate_page_add(page, qp->pagelist, flags);
+
+ /*
+ * Do not abort immediately since there may be
+ * temporary off LRU pages in the range. Still
+ * need migrate other LRU pages.
+ */
+ if (migrate_page_add(page, qp->pagelist, flags))
+ has_unmovable = true;
} else
break;
}
@@ -961,7 +967,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
/*
* page migration, thp tail pages can be passed.
*/
-static void migrate_page_add(struct page *page, struct list_head *pagelist,
+static int migrate_page_add(struct page *page, struct list_head *pagelist,
unsigned long flags)
{
struct page *head = compound_head(page);
@@ -974,8 +980,19 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
mod_node_page_state(page_pgdat(head),
NR_ISOLATED_ANON + page_is_file_cache(head),
hpage_nr_pages(head));
+ } else if (flags & MPOL_MF_STRICT) {
+ /*
+ * Non-movable page may reach here. And, there may be
+ * temporary off LRU pages or non-LRU movable pages.
+ * Treat them as unmovable pages since they can't be
+ * isolated, so they can't be moved at the moment. It
+ * should return -EIO for this case too.
+ */
+ return -EIO;
}
}
+
+ return 0;
}
/* page allocation callback for NUMA node migration */
@@ -1178,9 +1195,10 @@ static struct page *new_page(struct page *page, unsigned long start)
}
#else
-static void migrate_page_add(struct page *page, struct list_head *pagelist,
+static int migrate_page_add(struct page *page, struct list_head *pagelist,
unsigned long flags)
{
+ return -EIO;
}
int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH v10 8/9] kselftest: save-and-restore errno to allow for %m formatting
From: shuah @ 2019-07-19 22:25 UTC (permalink / raw)
To: Aleksa Sarai, Al Viro, Jeff Layton, J. Bruce Fields,
Arnd Bergmann, David Howells, Shuah Khan
Cc: Eric Biederman, Andy Lutomirski, Andrew Morton,
Alexei Starovoitov, Kees Cook, Jann Horn, Christian Brauner,
Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
Aleksa Sarai, Linus Torvalds, containers, linux-alpha, linux-api,
linux-arch, linux-arm-kernel, linux-fsdevel, linux-ia64,
linux-kernel, linux-kselftest
In-Reply-To: <20190719164225.27083-9-cyphar@cyphar.com>
On 7/19/19 10:42 AM, Aleksa Sarai wrote:
> Previously, using "%m" in a ksft_* format string can result in strange
> output because the errno value wasn't saved before calling other libc
> functions. The solution is to simply save and restore the errno before
> we format the user-supplied format string.
>
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
> tools/testing/selftests/kselftest.h | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/tools/testing/selftests/kselftest.h b/tools/testing/selftests/kselftest.h
> index ec15c4f6af55..0ac49d91a260 100644
> --- a/tools/testing/selftests/kselftest.h
> +++ b/tools/testing/selftests/kselftest.h
> @@ -10,6 +10,7 @@
> #ifndef __KSELFTEST_H
> #define __KSELFTEST_H
>
> +#include <errno.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <stdarg.h>
> @@ -81,58 +82,68 @@ static inline void ksft_print_cnts(void)
>
> static inline void ksft_print_msg(const char *msg, ...)
> {
> + int saved_errno = errno;
> va_list args;
>
> va_start(args, msg);
> printf("# ");
> + errno = saved_errno;
> vprintf(msg, args);
> va_end(args);
> }
>
> static inline void ksft_test_result_pass(const char *msg, ...)
> {
> + int saved_errno = errno;
> va_list args;
>
> ksft_cnt.ksft_pass++;
>
> va_start(args, msg);
> printf("ok %d ", ksft_test_num());
> + errno = saved_errno;
> vprintf(msg, args);
> va_end(args);
> }
>
> static inline void ksft_test_result_fail(const char *msg, ...)
> {
> + int saved_errno = errno;
> va_list args;
>
> ksft_cnt.ksft_fail++;
>
> va_start(args, msg);
> printf("not ok %d ", ksft_test_num());
> + errno = saved_errno;
> vprintf(msg, args);
> va_end(args);
> }
>
> static inline void ksft_test_result_skip(const char *msg, ...)
> {
> + int saved_errno = errno;
> va_list args;
>
> ksft_cnt.ksft_xskip++;
>
> va_start(args, msg);
> printf("not ok %d # SKIP ", ksft_test_num());
> + errno = saved_errno;
> vprintf(msg, args);
> va_end(args);
> }
>
> static inline void ksft_test_result_error(const char *msg, ...)
> {
> + int saved_errno = errno;
> va_list args;
>
> ksft_cnt.ksft_error++;
>
> va_start(args, msg);
> printf("not ok %d # error ", ksft_test_num());
> + errno = saved_errno;
> vprintf(msg, args);
> va_end(args);
> }
> @@ -152,10 +163,12 @@ static inline int ksft_exit_fail(void)
>
> static inline int ksft_exit_fail_msg(const char *msg, ...)
> {
> + int saved_errno = errno;
> va_list args;
>
> va_start(args, msg);
> printf("Bail out! ");
> + errno = saved_errno;
> vprintf(msg, args);
> va_end(args);
>
> @@ -178,10 +191,12 @@ static inline int ksft_exit_xpass(void)
> static inline int ksft_exit_skip(const char *msg, ...)
> {
> if (msg) {
> + int saved_errno = errno;
> va_list args;
>
> va_start(args, msg);
> printf("not ok %d # SKIP ", 1 + ksft_test_num());
> + errno = saved_errno;
> vprintf(msg, args);
> va_end(args);
> } else {
>
Hi Aleksa,
Can you send this patch separate from the patch series. I will apply
this as bug fix to 5.3-rc2 or rc3.
This isn't part of this series anyway and I would like to get this in
right away.
thanks,
-- Shuah
^ permalink raw reply
* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Burn Alting @ 2019-07-19 22:41 UTC (permalink / raw)
To: Eric W. Biederman, Paul Moore
Cc: Tycho Andersen, nhorman, Richard, Briggs, linux-api, containers,
LKML, dhowells, Linux-Audit Mailing List, netfilter-devel, simo,
netdev, linux-fsdevel, Eric Paris, Serge E. Hallyn
In-Reply-To: <87muhadnfr.fsf@xmission.com>
[-- Attachment #1.1: Type: text/plain, Size: 3342 bytes --]
On Fri, 2019-07-19 at 11:00 -0500, Eric W. Biederman wrote:
> Paul Moore <paul@paul-moore.com> writes:
>
>
> On Wed, Jul 17, 2019 at 8:52 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>
> On 2019-07-16 19:30, Paul Moore wrote:
>
> ...
>
>
>
> We can trust capable(CAP_AUDIT_CONTROL) for enforcing audit container
> ID policy, we can not trust ns_capable(CAP_AUDIT_CONTROL).
>
> Ok. So does a process in a non-init user namespace have two (or more)
> sets of capabilities stored in creds, one in the init_user_ns, and one
> in current_user_ns? Or does it get stripped of all its capabilities in
> init_user_ns once it has its own set in current_user_ns? If the former,
> then we can use capable(). If the latter, we need another mechanism, as
> you have suggested might be needed.
>
> Unfortunately I think the problem is that ultimately we need to allow
> any container orchestrator that has been given privileges to manage
> the audit container ID to also grant that privilege to any of the
> child process/containers it manages. I don't believe we can do that
> with capabilities based on the code I've looked at, and the
> discussions I've had, but if you find a way I would leave to hear it.
>
>
>
> If some random unprivileged user wants to fire up a container
> orchestrator/engine in his own user namespace, then audit needs to be
> namespaced. Can we safely discard this scenario for now?
>
> I think the only time we want to allow a container orchestrator to
> manage the audit container ID is if it has been granted that privilege
> by someone who has that privilege already. In the zero-container, or
> single-level of containers, case this is relatively easy, and we can
> accomplish it using CAP_AUDIT_CONTROL as the privilege. If we start
> nesting container orchestrators it becomes more complicated as we need
> to be able to support granting and inheriting this privilege in a
> manner; this is why I suggested a new mechanism *may* be necessary.
>
>
> Let me segway a bit and see if I can get this conversation out of the
> rut it seems to have drifted into.
>
> Unprivileged containers and nested containers exist today and are going
> to become increasingly common. Let that be a given.
>
> As I recall the interesting thing for audit to log is actions by
> privileged processes. Audit can log more but generally configuring
> logging by of the actions of unprivileged users is effectively a self
> DOS.
>
> So I think the initial implementation can safely ignore actions of
> nested containers and unprivileged containers because you don't care
> about their actions.
>
> If we start allow running audit in a container then we need to deal with
> all of the nesting issues but until then I don't think you folks care.
>
> Or am I wrong. Do the requirements for securely auditing things from
> the kernel care about the actions of unprivileged users?
Depending on the sensitivity of the information the host or system manages, yes the
actions of
unprivileged users is important to security auditing. Kernel auditing sometimes is
the only opportunity
an incident responder has to identify a user's (privileged or not) interaction with
the data the host manages.
>
> Eric
>
> --
> Linux-audit mailing list
> Linux-audit@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-audit
>
[-- Attachment #1.2: Type: text/html, Size: 4566 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply
* Re: [PATCH v10 8/9] kselftest: save-and-restore errno to allow for %m formatting
From: Aleksa Sarai @ 2019-07-20 0:09 UTC (permalink / raw)
To: shuah
Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Eric Biederman, Andy Lutomirski,
Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
Christian Brauner, Tycho Andersen, David Drysdale, Chanho Min,
Oleg Nesterov, Aleksa Sarai, Linus Torvalds, containers,
linux-alpha
In-Reply-To: <b32d95a1-8a49-65ef-4ddd-fe86a7ca01d5@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 822 bytes --]
On 2019-07-19, shuah <shuah@kernel.org> wrote:
> On 7/19/19 10:42 AM, Aleksa Sarai wrote:
> > Previously, using "%m" in a ksft_* format string can result in strange
> > output because the errno value wasn't saved before calling other libc
> > functions. The solution is to simply save and restore the errno before
> > we format the user-supplied format string.
> >
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> [...]
> Hi Aleksa,
>
> Can you send this patch separate from the patch series. I will apply
> this as bug fix to 5.3-rc2 or rc3.
>
> This isn't part of this series anyway and I would like to get this in
> right away.
Done, and I'll drop it in v11 after the rest gets reviewed.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: James Bottomley @ 2019-07-20 2:19 UTC (permalink / raw)
To: Eric W. Biederman, Paul Moore
Cc: nhorman, linux-api, containers, LKML, omosnace, dhowells,
Linux-Audit Mailing List, netfilter-devel, simo, netdev,
linux-fsdevel, Eric Paris, sgrubb
In-Reply-To: <87muhadnfr.fsf@xmission.com>
On Fri, 2019-07-19 at 11:00 -0500, Eric W. Biederman wrote:
> Paul Moore <paul@paul-moore.com> writes:
>
> > On Wed, Jul 17, 2019 at 8:52 PM Richard Guy Briggs <rgb@redhat.com>
> > wrote:
> > > On 2019-07-16 19:30, Paul Moore wrote:
> >
> > ...
> >
> > > > We can trust capable(CAP_AUDIT_CONTROL) for enforcing audit
> > > > container ID policy, we can not trust
> > > > ns_capable(CAP_AUDIT_CONTROL).
> > >
> > > Ok. So does a process in a non-init user namespace have two (or
> > > more) sets of capabilities stored in creds, one in the
> > > init_user_ns, and one in current_user_ns? Or does it get
> > > stripped of all its capabilities in init_user_ns once it has its
> > > own set in current_user_ns? If the former, then we can use
> > > capable(). If the latter, we need another mechanism, as
> > > you have suggested might be needed.
> >
> > Unfortunately I think the problem is that ultimately we need to
> > allow any container orchestrator that has been given privileges to
> > manage the audit container ID to also grant that privilege to any
> > of the child process/containers it manages. I don't believe we can
> > do that with capabilities based on the code I've looked at, and the
> > discussions I've had, but if you find a way I would leave to hear
> > it.
> > > If some random unprivileged user wants to fire up a container
> > > orchestrator/engine in his own user namespace, then audit needs
> > > to be namespaced. Can we safely discard this scenario for now?
> >
> > I think the only time we want to allow a container orchestrator to
> > manage the audit container ID is if it has been granted that
> > privilege by someone who has that privilege already. In the zero-
> > container, or single-level of containers, case this is relatively
> > easy, and we can accomplish it using CAP_AUDIT_CONTROL as the
> > privilege. If we start nesting container orchestrators it becomes
> > more complicated as we need to be able to support granting and
> > inheriting this privilege in a manner; this is why I suggested a
> > new mechanism *may* be necessary.
>
>
> Let me segway a bit and see if I can get this conversation out of the
> rut it seems to have drifted into.
>
> Unprivileged containers and nested containers exist today and are
> going to become increasingly common. Let that be a given.
Agree fully.
> As I recall the interesting thing for audit to log is actions by
> privileged processes. Audit can log more but generally configuring
> logging by of the actions of unprivileged users is effectively a self
> DOS.
>
> So I think the initial implementation can safely ignore actions of
> nested containers and unprivileged containers because you don't care
> about their actions.
I don't entirely agree here: remember there might be two consumers for
the audit data: the physical system owner (checking up on the tenants)
and the tenant themselves who might be watching either their sub
tenants or their users (and who, obviously, won't get the full audit
stream). In either case, the tenant may or may not be privileged, and
if they're privileged, it might be through the user_ns in which case
the physical system owner and the kernel would see them as "not
privileged". So I think we are ultimately going to need the ability to
audit unprivileged containers.
I also think audit has a role to play in intrusion detection and
forensic analysis for fully unprivileged containers running external
services, but I don't think we have to solve that case immediately.
> If we start allow running audit in a container then we need to deal
> with all of the nesting issues but until then I don't think you folks
> care.
>
> Or am I wrong. Do the requirements for securely auditing things from
> the kernel care about the actions of unprivileged users?
I think ultimately we have to care, but it could be three phases: first
would be genuinely privileged containers (i.e. with real root inside,
being our most dangerous problem) the second would be user_ns
privileged containers (i.e. with both user_ns and an interior root
mapping) and the third would be unprivileged containers (with or
without user_ns but no interior root).
James
^ permalink raw reply
* [PATCH bpf-next v10 00/10] Landlock LSM: Toward unprivileged sandboxing
From: Mickaël Salaün @ 2019-07-21 21:31 UTC (permalink / raw)
To: linux-kernel
Cc: Mickaël Salaün, Alexander Viro, Alexei Starovoitov,
Andrew Morton, Andy Lutomirski, Arnaldo Carvalho de Melo,
Casey Schaufler, Daniel Borkmann, David Drysdale,
David S . Miller, Eric W . Biederman, James Morris, Jann Horn,
John Johansen, Jonathan Corbet, Kees Cook, Michael Kerrisk,
Mickaël Salaün
Hi,
This tenth series mainly replace the previous [1] inode map
implementation with a hash map, which assure uniqueness of keys, improve
performance, and switch to arbitrary value size. The inode and map
lifetime are now handled by LSM hooks. The previous subtype is replaced
with the already existing expected attach type and a new expected attach
triggers field.
Landlock is a stackable LSM [4] intended to be used as a low-level
framework to build custom access-control systems or safe endpoint
security agents. There is two types of Landlock hooks: FS_WALK and
FS_PICK. Each of them accepts a dedicated eBPF program, called a
Landlock program. The set of actions on a file is well defined (e.g.
read, write, ioctl, append, lock, mount...) taking inspiration from the
major Linux LSMs and some other access-controls like Capsicum.
The example patch show how a file system access control can be built
based on a list of denied files and directories. From a security point
of view, it may be preferable to use a whitelist instead of a blacklist,
but this series only enable to match a specific list of files. Bringing
back a way to evaluate a path is planned for a future dedicated series,
once this base Landlock framework is merged. I may take inspiration
from the LOOKUP_BENEATH approach [5], but from an eBPF point of view.
The documentation patch contains some kernel documentation and
explanations on how to use Landlock. The compiled documentation and
some talks can be found here: https://landlock.io
This patch series can be found in a Git repository here:
https://github.com/landlock-lsm/linux/commits/landlock-v10
This is the first step of the roadmap discussed at LPC [2]. While the
intended final goal is to allow unprivileged users to use Landlock, this
series allows only a process with global CAP_SYS_ADMIN to load and
enforce a rule. This may help to get feedback and avoid unexpected
behaviors.
This series can be applied on top of bpf-next, commit 88091ff56b71
("selftests, bpf: Add test for veth native XDP"). This can be tested
with CONFIG_SECCOMP_FILTER and CONFIG_SECURITY_LANDLOCK. I would really
appreciate constructive comments on the design and the code.
# Landlock LSM
The goal of this new Linux Security Module (LSM) called Landlock is to
allow any process, including unprivileged ones, to create powerful
security sandboxes comparable to XNU Sandbox or OpenBSD Pledge (which
could be implemented with Landlock). This kind of sandbox is expected
to help mitigate the security impact of bugs or unexpected/malicious
behaviors in user-space applications.
The approach taken is to add the minimum amount of code while still
allowing the user-space application to create quite complex access
rules. A dedicated security policy language such as the one used by
SELinux, AppArmor and other major LSMs involves a lot of code and is
usually permitted to only a trusted user (i.e. root). On the contrary,
eBPF programs already exist and are designed to be safely loaded by
unprivileged user-space.
This design does not seem too intrusive but is flexible enough to allow
a powerful sandbox mechanism accessible by any process on Linux. The use
of seccomp and Landlock is more suitable with the help of a user-space
library (e.g. libseccomp) that could help to specify a high-level
language to express a security policy instead of raw eBPF programs.
Moreover, thanks to the LLVM front-end, it is quite easy to write an
eBPF program with a subset of the C language.
# Frequently asked questions
## Why is seccomp-bpf not enough?
A seccomp filter can access only raw syscall arguments (i.e. the
register values) which means that it is not possible to filter according
to the value pointed to by an argument, such as a file pathname. As an
embryonic Landlock version demonstrated, filtering at the syscall level
is complicated (e.g. need to take care of race conditions). This is
mainly because the access control checkpoints of the kernel are not at
this high-level but more underneath, at the LSM-hook level. The LSM
hooks are designed to handle this kind of checks. Landlock abstracts
this approach to leverage the ability of unprivileged users to limit
themselves.
Cf. section "What it isn't?" in Documentation/prctl/seccomp_filter.txt
## Why use the seccomp(2) syscall?
Landlock use the same semantic as seccomp to apply access rule
restrictions. It add a new layer of security for the current process
which is inherited by its children. It makes sense to use an unique
access-restricting syscall (that should be allowed by seccomp filters)
which can only drop privileges. Moreover, a Landlock rule could come
from outside a process (e.g. passed through a UNIX socket). It is then
useful to differentiate the creation/load of Landlock eBPF programs via
bpf(2), from rule enforcement via seccomp(2).
## Why a new LSM? Are SELinux, AppArmor, Smack and Tomoyo not good
enough?
The current access control LSMs are fine for their purpose which is to
give the *root* the ability to enforce a security policy for the
*system*. What is missing is a way to enforce a security policy for any
application by its developer and *unprivileged user* as seccomp can do
for raw syscall filtering.
Differences from other (access control) LSMs:
* not only dedicated to administrators (i.e. no_new_priv);
* limited kernel attack surface (e.g. policy parsing);
* constrained policy rules (no DoS: deterministic execution time);
* do not leak more information than the loader process can legitimately
have access to (minimize metadata inference).
# Changes since v9
* replace subtype with expected_attach_type and a new expected_attach_triggers
and update libbpf accordingly
* handle inode and map lifetime with LSM hooks
* use a hash map for the inode map: integrate inodemap.c into hashtab.c
* allow arbitrary value size instead of 64-bits
# Changes since v8
* fit with the new LSM stacking framework (security blobs were tested
but are not use in this series because of the code reduction)
* remove the Landlock program chaining and the file path evaluation
feature to get a minimum viable product and ease the review
* replace the example with a simple blacklist policy
* rebase on bpf-next
# Changes since v7
* major revamp of the file system enforcement:
* new eBPF map dedicated to tie an inode with an arbitrary 64-bits
value, which can be used to tag files
* three new Landlock hooks: FS_WALK, FS_PICK and FS_GET
* add the ability to chain Landlock programs
* add a new eBPF map type to compare inodes
* don't use macros anymore
* replace subtype fields:
* triggers: fine-grained bitfiel of actions on which a Landlock
program may be called (if it comes from a sandbox process)
* previous: a parent chained program
* upstreamed patches:
* commit 369130b63178 ("selftests: Enhance kselftest_harness.h to
print which assert failed")
# Changes since v6
* upstreamed patches:
* commit 752ba56fb130 ("bpf: Extend check_uarg_tail_zero() checks")
* commit 0b40808a1084 ("selftests: Make test_harness.h more generally
available") and related ones
* commit 3bb857e47e49 ("LSM: Enable multiple calls to
security_add_hooks() for the same LSM")
* simplify the landlock_context (remove syscall_* fields) and add three
FS sub-events: IOCTL, LOCK, FCNTL
* minimize the number of callable BPF functions from a Landlock rule
* do not split put_seccomp_filter() with put_seccomp()
* rename Landlock version to Landlock ABI
* miscellaneous fixes
* rebase on net-next
# Changes since v5
* eBPF program subtype:
* use a prog_subtype pointer instead of inlining it into bpf_attr
* enable a future-proof behavior (reject unhandled data/size)
* add tests
* use a simple rule hierarchy (similar to seccomp-bpf)
* add a ptrace scope protection
* add more tests
* add more documentation
* rename some files
* miscellaneous fixes
* rebase on net-next
# Changes since v4
* upstreamed patches:
* commit d498f8719a09 ("bpf: Rebuild bpf.o for any dependency update")
* commit a734fb5d6006 ("samples/bpf: Reset global variables") and
related ones
* commit f4874d01beba ("bpf: Use bpf_create_map() from the library")
and related ones
* commit d02d8986a768 ("bpf: Always test unprivileged programs")
* commit 640eb7e7b524 ("fs: Constify path_is_under()'s arguments")
* commit 535e7b4b5ef2 ("bpf: Use u64_to_user_ptr()")
* revamp Landlock to not expose an LSM hook interface but wrap and
abstract them with Landlock events (currently one for all filesystem
related operations: LANDLOCK_SUBTYPE_EVENT_FS)
* wrap all filesystem kernel objects through the same FS handle (struct
landlock_handle_fs): struct file, struct inode, struct path and struct
dentry
* a rule don't return an errno code but only a boolean to allow or deny
an access request
* handle all filesystem related LSM hooks
* add some tests and a sample:
* BPF context tests
* Landlock sandboxing tests and sample
* write Landlock rules in C and compile them with LLVM
* change field names of eBPF program subtype
* remove arraymap of handles for now (will be replaced with a revamped
map)
* remove cgroup handling for now
* add user and kernel documentation
* rebase on net-next
# Changes since v3
* upstreamed patch:
* commit 1955351da41c ("bpf: Set register type according to
is_valid_access()")
* use abstract LSM hook arguments with custom types (e.g.
*_LANDLOCK_ARG_FS for struct file, struct inode and struct path)
* add more LSM hooks to support full filesystem access control
* improve the sandbox example
* fix races and RCU issues:
* eBPF program execution and eBPF helpers
* revamp the arraymap of handles to cleanly deal with update/delete
* eBPF program subtype for Landlock:
* remove the "origin" field
* add an "option" field
* rebase onto Daniel Mack's patches v7 [3]
* remove merged commit 1955351da41c ("bpf: Set register type according
to is_valid_access()")
* fix spelling mistakes
* cleanup some type and variable names
* split patches
* for now, remove cgroup delegation handling for unprivileged user
* remove extra access check for cgroup_get_from_fd()
* remove unused example code dealing with skb
* remove seccomp-bpf link:
* no more seccomp cookie
* for now, it is no more possible to check the current syscall
properties
# Changes since v2
* revamp cgroup handling:
* use Daniel Mack's patches "Add eBPF hooks for cgroups" v5
* remove bpf_landlock_cmp_cgroup_beneath()
* make BPF_PROG_ATTACH usable with delegated cgroups
* add a new CGRP_NO_NEW_PRIVS flag for safe cgroups
* handle Landlock sandboxing for cgroups hierarchy
* allow unprivileged processes to attach Landlock eBPF program to
cgroups
* add subtype to eBPF programs:
* replace Landlock hook identification by custom eBPF program types
with a dedicated subtype field
* manage fine-grained privileged Landlock programs
* register Landlock programs for dedicated trigger origins (e.g.
syscall, return from seccomp filter and/or interruption)
* performance and memory optimizations: use an array to access Landlock
hooks directly but do not duplicated it for each thread
(seccomp-based)
* allow running Landlock programs without seccomp filter
* fix seccomp-related issues
* remove extra errno bounding check for Landlock programs
* add some examples for optional eBPF functions or context access
(network related) according to security checks to allow more features
for privileged programs (e.g. Checmate)
# Changes since v1
* focus on the LSM hooks, not the syscalls:
* much more simple implementation
* does not need audit cache tricks to avoid race conditions
* more simple to use and more generic because using the LSM hook
abstraction directly
* more efficient because only checking in LSM hooks
* architecture agnostic
* switch from cBPF to eBPF:
* new eBPF program types dedicated to Landlock
* custom functions used by the eBPF program
* gain some new features (e.g. 10 registers, can load values of
different size, LLVM translator) but only a few functions allowed
and a dedicated map type
* new context: LSM hook ID, cookie and LSM hook arguments
* need to set the sysctl kernel.unprivileged_bpf_disable to 0 (default
value) to be able to load hook filters as unprivileged users
* smaller and simpler:
* no more checker groups but dedicated arraymap of handles
* simpler userland structs thanks to eBPF functions
* distinctive name: Landlock
[1] https://lore.kernel.org/linux-security-module/20190625215239.11136-1-mic@digikod.net/
[2] https://lore.kernel.org/lkml/5828776A.1010104@digikod.net/
[3] https://lore.kernel.org/netdev/1477390454-12553-1-git-send-email-daniel@zonque.org/
[4] https://lore.kernel.org/lkml/50db058a-7dde-441b-a7f9-f6837fe8b69f@schaufler-ca.com/
[5] https://lore.kernel.org/lkml/20190520133305.11925-1-cyphar@cyphar.com/
Regards,
Mickaël Salaün (10):
fs,security: Add a new file access type: MAY_CHROOT
bpf: Add expected_attach_triggers and a is_valid_triggers() verifier
bpf,landlock: Define an eBPF program type for Landlock hooks
seccomp,landlock: Enforce Landlock programs per process hierarchy
landlock: Handle filesystem access control
bpf,landlock: Add a new map type: inode
landlock: Add ptrace restrictions
bpf: Add a Landlock sandbox example
bpf,landlock: Add tests for Landlock
landlock: Add user and kernel documentation for Landlock
Documentation/security/index.rst | 1 +
Documentation/security/landlock/index.rst | 20 +
Documentation/security/landlock/kernel.rst | 99 +++
Documentation/security/landlock/user.rst | 147 ++++
MAINTAINERS | 13 +
fs/open.c | 3 +-
include/linux/bpf.h | 18 +
include/linux/bpf_types.h | 6 +
include/linux/fs.h | 1 +
include/linux/landlock.h | 38 ++
include/linux/lsm_hooks.h | 1 +
include/linux/seccomp.h | 5 +
include/uapi/linux/bpf.h | 16 +-
include/uapi/linux/landlock.h | 94 +++
include/uapi/linux/seccomp.h | 1 +
kernel/bpf/core.c | 2 +
kernel/bpf/hashtab.c | 253 +++++++
kernel/bpf/syscall.c | 41 +-
kernel/bpf/verifier.c | 26 +
kernel/fork.c | 8 +-
kernel/seccomp.c | 4 +
samples/bpf/.gitignore | 1 +
samples/bpf/Makefile | 3 +
samples/bpf/landlock1.h | 8 +
samples/bpf/landlock1_kern.c | 55 ++
samples/bpf/landlock1_user.c | 250 +++++++
security/Kconfig | 1 +
security/Makefile | 2 +
security/landlock/Kconfig | 18 +
security/landlock/Makefile | 5 +
security/landlock/common.h | 105 +++
security/landlock/enforce.c | 272 ++++++++
security/landlock/enforce.h | 18 +
security/landlock/enforce_seccomp.c | 92 +++
security/landlock/hooks.c | 94 +++
security/landlock/hooks.h | 31 +
security/landlock/hooks_fs.c | 639 ++++++++++++++++++
security/landlock/hooks_fs.h | 31 +
security/landlock/hooks_ptrace.c | 121 ++++
security/landlock/hooks_ptrace.h | 8 +
security/landlock/init.c | 148 ++++
security/security.c | 15 +
tools/include/uapi/linux/bpf.h | 16 +-
tools/include/uapi/linux/landlock.h | 109 +++
tools/lib/bpf/bpf.h | 1 +
tools/lib/bpf/libbpf.c | 44 +-
tools/lib/bpf/libbpf.h | 7 +-
tools/lib/bpf/libbpf.map | 2 +
tools/lib/bpf/libbpf_probes.c | 2 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/bpf/bpf_helpers.h | 2 +
.../selftests/bpf/test_section_names.c | 2 +-
.../selftests/bpf/test_sockopt_multi.c | 4 +-
tools/testing/selftests/bpf/test_sockopt_sk.c | 2 +-
tools/testing/selftests/bpf/test_verifier.c | 1 +
.../testing/selftests/bpf/verifier/landlock.c | 24 +
tools/testing/selftests/landlock/.gitignore | 4 +
tools/testing/selftests/landlock/Makefile | 39 ++
tools/testing/selftests/landlock/test.h | 50 ++
tools/testing/selftests/landlock/test_base.c | 24 +
tools/testing/selftests/landlock/test_fs.c | 256 +++++++
.../testing/selftests/landlock/test_ptrace.c | 148 ++++
62 files changed, 3432 insertions(+), 20 deletions(-)
create mode 100644 Documentation/security/landlock/index.rst
create mode 100644 Documentation/security/landlock/kernel.rst
create mode 100644 Documentation/security/landlock/user.rst
create mode 100644 include/linux/landlock.h
create mode 100644 include/uapi/linux/landlock.h
create mode 100644 samples/bpf/landlock1.h
create mode 100644 samples/bpf/landlock1_kern.c
create mode 100644 samples/bpf/landlock1_user.c
create mode 100644 security/landlock/Kconfig
create mode 100644 security/landlock/Makefile
create mode 100644 security/landlock/common.h
create mode 100644 security/landlock/enforce.c
create mode 100644 security/landlock/enforce.h
create mode 100644 security/landlock/enforce_seccomp.c
create mode 100644 security/landlock/hooks.c
create mode 100644 security/landlock/hooks.h
create mode 100644 security/landlock/hooks_fs.c
create mode 100644 security/landlock/hooks_fs.h
create mode 100644 security/landlock/hooks_ptrace.c
create mode 100644 security/landlock/hooks_ptrace.h
create mode 100644 security/landlock/init.c
create mode 100644 tools/include/uapi/linux/landlock.h
create mode 100644 tools/testing/selftests/bpf/verifier/landlock.c
create mode 100644 tools/testing/selftests/landlock/.gitignore
create mode 100644 tools/testing/selftests/landlock/Makefile
create mode 100644 tools/testing/selftests/landlock/test.h
create mode 100644 tools/testing/selftests/landlock/test_base.c
create mode 100644 tools/testing/selftests/landlock/test_fs.c
create mode 100644 tools/testing/selftests/landlock/test_ptrace.c
--
2.22.0
^ permalink raw reply
* [PATCH bpf-next v10 01/10] fs,security: Add a new file access type: MAY_CHROOT
From: Mickaël Salaün @ 2019-07-21 21:31 UTC (permalink / raw)
To: linux-kernel
Cc: Mickaël Salaün, Alexander Viro, Alexei Starovoitov,
Andrew Morton, Andy Lutomirski, Arnaldo Carvalho de Melo,
Casey Schaufler, Daniel Borkmann, David Drysdale,
David S . Miller, Eric W . Biederman, James Morris, Jann Horn,
John Johansen, Jonathan Corbet, Kees Cook, Michael Kerrisk,
Mickaël Salaün
In-Reply-To: <20190721213116.23476-1-mic@digikod.net>
For compatibility reason, MAY_CHROOT is always set with MAY_CHDIR.
However, this new flag enable to differentiate a chdir form a chroot.
This is needed for the Landlock LSM to be able to evaluate a new root
directory.
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: James Morris <jmorris@namei.org>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: linux-fsdevel@vger.kernel.org
---
fs/open.c | 3 ++-
include/linux/fs.h | 1 +
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/open.c b/fs/open.c
index b5b80469b93d..e8767318fd03 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -494,7 +494,8 @@ int ksys_chroot(const char __user *filename)
if (error)
goto out;
- error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_CHDIR);
+ error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_CHDIR |
+ MAY_CHROOT);
if (error)
goto dput_and_out;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 75f2ed289a3f..7a0d92b1da85 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -99,6 +99,7 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
#define MAY_CHDIR 0x00000040
/* called from RCU mode, don't block */
#define MAY_NOT_BLOCK 0x00000080
+#define MAY_CHROOT 0x00000100
/*
* flags in file.f_mode. Note that FMODE_READ and FMODE_WRITE must correspond
--
2.22.0
^ permalink raw reply related
* [PATCH bpf-next v10 02/10] bpf: Add expected_attach_triggers and a is_valid_triggers() verifier
From: Mickaël Salaün @ 2019-07-21 21:31 UTC (permalink / raw)
To: linux-kernel
Cc: Mickaël Salaün, Alexander Viro, Alexei Starovoitov,
Andrew Morton, Andy Lutomirski, Arnaldo Carvalho de Melo,
Casey Schaufler, Daniel Borkmann, David Drysdale,
David S . Miller, Eric W . Biederman, James Morris, Jann Horn,
John Johansen, Jonathan Corbet, Kees Cook, Michael Kerrisk,
Mickaël Salaün
In-Reply-To: <20190721213116.23476-1-mic@digikod.net>
The goal of the program triggers is to be able to have static triggers
(bitflags) conditionning an eBPF program interpretation. This help to
avoid unnecessary runs.
The struct bpf_verifier_ops gets a new optional function:
is_valid_verifier(). This new verifier is called at the beginning of the
eBPF program verification to check if the (optional) program triggers
are valid.
For now, only Landlock eBPF programs are using program triggers (see
next commits) but this could be used by other program types in the
future.
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Link: https://lkml.kernel.org/r/20160827205559.GA43880@ast-mbp.thefacebook.com
---
Changes since v9:
* replace subtype with expected_attach_type (suggested by Alexei
Starovoitov) and a new expected_attach_triggers
* add new bpf_attach_type: BPF_LANDLOCK_FS_PICK and BPF_LANDLOCK_FS_WALK
* remove bpf_prog_extra from bpf_base_func_proto()
* update libbpf and test_verifier to handle triggers
Changes since v8:
* use bpf_load_program_xattr() instead of bpf_load_program() and add
bpf_verify_program_xattr() to deal with subtypes
* remove put_extra() since there is no more "previous" field (for now)
Changes since v7:
* rename LANDLOCK_SUBTYPE_* to LANDLOCK_*
* move subtype in bpf_prog_aux and use only one bit for has_subtype
(suggested by Alexei Starovoitov)
* wrap the prog_subtype with a prog_extra to be able to reference kernel
pointers:
* add an optional put_extra() function to struct bpf_prog_ops to be
able to free the pointed data
* replace all the prog_subtype with prog_extra in the struct
bpf_verifier_ops functions
* remove the ABI field (requested by Alexei Starovoitov)
* rename subtype fields
Changes since v6:
* rename Landlock version to ABI to better reflect its purpose
* fix unsigned integer checks
* fix pointer cast
* constify pointers
* rebase
Changes since v5:
* use a prog_subtype pointer and make it future-proof
* add subtype test
* constify bpf_load_program()'s subtype argument
* cleanup subtype initialization
* rebase
Changes since v4:
* replace the "status" field with "version" (more generic)
* replace the "access" field with "ability" (less confusing)
Changes since v3:
* remove the "origin" field
* add an "option" field
* cleanup comments
---
include/linux/bpf.h | 2 ++
include/linux/bpf_types.h | 3 +++
include/uapi/linux/bpf.h | 3 +++
kernel/bpf/syscall.c | 14 +++++++++++++-
kernel/bpf/verifier.c | 11 +++++++++++
tools/include/uapi/linux/bpf.h | 3 +++
tools/lib/bpf/bpf.h | 1 +
tools/lib/bpf/libbpf.map | 1 +
8 files changed, 37 insertions(+), 1 deletion(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 18f4cc2c6acd..6d9c7a08713e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -319,6 +319,7 @@ struct bpf_verifier_ops {
const struct bpf_insn *src,
struct bpf_insn *dst,
struct bpf_prog *prog, u32 *target_size);
+ bool (*is_valid_triggers)(const struct bpf_prog *prog);
};
struct bpf_prog_offload_ops {
@@ -418,6 +419,7 @@ struct bpf_prog_aux {
struct work_struct work;
struct rcu_head rcu;
};
+ u64 expected_attach_triggers;
};
struct bpf_array {
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index eec5aeeeaf92..2ab647323f3a 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -38,6 +38,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
#ifdef CONFIG_INET
BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
#endif
+#ifdef CONFIG_SECURITY_LANDLOCK
+BPF_PROG_TYPE(BPF_PROG_TYPE_LANDLOCK_HOOK, landlock)
+#endif
BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6f68438aa4ed..1664d260861b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -197,6 +197,8 @@ enum bpf_attach_type {
BPF_CGROUP_UDP6_RECVMSG,
BPF_CGROUP_GETSOCKOPT,
BPF_CGROUP_SETSOCKOPT,
+ BPF_LANDLOCK_FS_PICK,
+ BPF_LANDLOCK_FS_WALK,
__MAX_BPF_ATTACH_TYPE
};
@@ -412,6 +414,7 @@ union bpf_attr {
__u32 line_info_rec_size; /* userspace bpf_line_info size */
__aligned_u64 line_info; /* line info */
__u32 line_info_cnt; /* number of bpf_line_info records */
+ __aligned_u64 expected_attach_triggers; /* bitfield of triggers, e.g. LANDLOCK_TRIGGER_* */
};
struct { /* anonymous struct used by BPF_OBJ_* commands */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5d141f16f6fa..b2a8cb14f28e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1598,13 +1598,23 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
default:
return -EINVAL;
}
+#ifdef CONFIG_SECURITY_LANDLOCK
+ case BPF_PROG_TYPE_LANDLOCK_HOOK:
+ switch (expected_attach_type) {
+ case BPF_LANDLOCK_FS_PICK:
+ case BPF_LANDLOCK_FS_WALK:
+ return 0;
+ default:
+ return -EINVAL;
+ }
+#endif
default:
return 0;
}
}
/* last field in 'union bpf_attr' used by this command */
-#define BPF_PROG_LOAD_LAST_FIELD line_info_cnt
+#define BPF_PROG_LOAD_LAST_FIELD expected_attach_triggers
static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
{
@@ -1694,6 +1704,8 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
if (err)
goto free_prog;
+ prog->aux->expected_attach_triggers = attr->expected_attach_triggers;
+
/* run eBPF verifier */
err = bpf_check(&prog, attr, uattr);
if (err < 0)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a2e763703c30..94a43d7c8175 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -9265,6 +9265,17 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,
if (ret < 0)
goto skip_full_check;
+ if (env->ops->is_valid_triggers) {
+ if (!env->ops->is_valid_triggers(env->prog)) {
+ ret = -EINVAL;
+ goto err_unlock;
+ }
+ } else if (env->prog->aux->expected_attach_triggers) {
+ /* do not accept triggers if they are not handled */
+ ret = -EINVAL;
+ goto err_unlock;
+ }
+
if (bpf_prog_is_dev_bound(env->prog->aux)) {
ret = bpf_prog_offload_verifier_prep(env->prog);
if (ret)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f506c68b2612..232747393405 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -197,6 +197,8 @@ enum bpf_attach_type {
BPF_CGROUP_UDP6_RECVMSG,
BPF_CGROUP_GETSOCKOPT,
BPF_CGROUP_SETSOCKOPT,
+ BPF_LANDLOCK_FS_PICK,
+ BPF_LANDLOCK_FS_WALK,
__MAX_BPF_ATTACH_TYPE
};
@@ -412,6 +414,7 @@ union bpf_attr {
__u32 line_info_rec_size; /* userspace bpf_line_info size */
__aligned_u64 line_info; /* line info */
__u32 line_info_cnt; /* number of bpf_line_info records */
+ __aligned_u64 expected_attach_triggers; /* bitfield of triggers, e.g. LANDLOCK_TRIGGER_* */
};
struct { /* anonymous struct used by BPF_OBJ_* commands */
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index ff42ca043dc8..468bb3ac0be0 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -102,6 +102,7 @@ LIBBPF_API int bpf_load_program(enum bpf_prog_type type,
const struct bpf_insn *insns, size_t insns_cnt,
const char *license, __u32 kern_version,
char *log_buf, size_t log_buf_sz);
+LIBBPF_API int bpf_verify_program_xattr(union bpf_attr *attr, size_t attr_sz);
LIBBPF_API int bpf_verify_program(enum bpf_prog_type type,
const struct bpf_insn *insns,
size_t insns_cnt, __u32 prog_flags,
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index f9d316e873d8..36ac26bdfda0 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -107,6 +107,7 @@ LIBBPF_0.0.1 {
bpf_set_link_xdp_fd;
bpf_task_fd_query;
bpf_verify_program;
+ bpf_verify_program_xattr;
btf__fd;
btf__find_by_name;
btf__free;
--
2.22.0
^ permalink raw reply related
* [PATCH bpf-next v10 03/10] bpf,landlock: Define an eBPF program type for Landlock hooks
From: Mickaël Salaün @ 2019-07-21 21:31 UTC (permalink / raw)
To: linux-kernel
Cc: Mickaël Salaün, Alexander Viro, Alexei Starovoitov,
Andrew Morton, Andy Lutomirski, Arnaldo Carvalho de Melo,
Casey Schaufler, Daniel Borkmann, David Drysdale,
David S . Miller, Eric W . Biederman, James Morris, Jann Horn,
John Johansen, Jonathan Corbet, Kees Cook, Michael Kerrisk,
Mickaël Salaün
In-Reply-To: <20190721213116.23476-1-mic@digikod.net>
Add a new type of eBPF program used by Landlock hooks. The goal of this
type of program is to accept or deny a requested access from userspace
to a kernel object (e.g. a file).
This new BPF program type will be registered with the Landlock LSM
initialization.
Add an initial Landlock Kconfig and update the MAINTAINERS file.
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
---
Changes since v9:
* handle inode put and map put, which fix unmount (reported by Al Viro)
* replace subtype with expected_attach_type and expected_attach_triggers
* check eBPF program return code
Changes since v8:
* Remove the chaining concept from the eBPF program contexts (chain and
cookie). We need to keep these subtypes this way to be able to make
them evolve, though.
* remove bpf_landlock_put_extra() because there is no more a "previous"
field to free (for now)
Changes since v7:
* cosmetic fixes
* rename LANDLOCK_SUBTYPE_* to LANDLOCK_*
* cleanup UAPI definitions and move them from bpf.h to landlock.h
(suggested by Alexei Starovoitov)
* disable Landlock by default (suggested by Alexei Starovoitov)
* rename BPF_PROG_TYPE_LANDLOCK_{RULE,HOOK}
* update the Kconfig
* update the MAINTAINERS file
* replace the IOCTL, LOCK and FCNTL events with FS_PICK, FS_WALK and
FS_GET hook types
* add the ability to chain programs with an eBPF program file descriptor
(i.e. the "previous" field in a Landlock subtype) and keep a state
with a "cookie" value available from the context
* add a "triggers" subtype bitfield to match specific actions (e.g.
append, chdir, read...)
Changes since v6:
* add 3 more sub-events: IOCTL, LOCK, FCNTL
https://lkml.kernel.org/r/2fbc99a6-f190-f335-bd14-04bdeed35571@digikod.net
* rename LANDLOCK_VERSION to LANDLOCK_ABI to better reflect its purpose,
and move it from landlock.h to common.h
* rename BPF_PROG_TYPE_LANDLOCK to BPF_PROG_TYPE_LANDLOCK_RULE: an eBPF
program could be used for something else than a rule
* simplify struct landlock_context by removing the arch and syscall_nr fields
* remove all eBPF map functions call, remove ABILITY_WRITE
* refactor bpf_landlock_func_proto() (suggested by Kees Cook)
* constify pointers
* fix doc inclusion
Changes since v5:
* rename file hooks.c to init.c
* fix spelling
Changes since v4:
* merge a minimal (not enabled) LSM code and Kconfig in this commit
Changes since v3:
* split commit
* revamp the landlock_context:
* add arch, syscall_nr and syscall_cmd (ioctl, fcntl…) to be able to
cross-check action with the event type
* replace args array with dedicated fields to ease the addition of new
fields
---
MAINTAINERS | 13 ++++
include/uapi/linux/bpf.h | 1 +
include/uapi/linux/landlock.h | 94 ++++++++++++++++++++++++
kernel/bpf/verifier.c | 1 +
security/Kconfig | 1 +
security/Makefile | 2 +
security/landlock/Kconfig | 18 +++++
security/landlock/Makefile | 3 +
security/landlock/common.h | 44 +++++++++++
security/landlock/init.c | 100 +++++++++++++++++++++++++
tools/include/uapi/linux/bpf.h | 1 +
tools/include/uapi/linux/landlock.h | 109 ++++++++++++++++++++++++++++
tools/lib/bpf/libbpf.c | 1 +
tools/lib/bpf/libbpf_probes.c | 1 +
14 files changed, 389 insertions(+)
create mode 100644 include/uapi/linux/landlock.h
create mode 100644 security/landlock/Kconfig
create mode 100644 security/landlock/Makefile
create mode 100644 security/landlock/common.h
create mode 100644 security/landlock/init.c
create mode 100644 tools/include/uapi/linux/landlock.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 211ea3a199bd..d30b535a613a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8947,6 +8947,19 @@ F: net/core/skmsg.c
F: net/core/sock_map.c
F: net/ipv4/tcp_bpf.c
+LANDLOCK SECURITY MODULE
+M: Mickaël Salaün <mic@digikod.net>
+S: Supported
+F: Documentation/security/landlock/
+F: include/linux/landlock.h
+F: include/uapi/linux/landlock.h
+F: samples/bpf/landlock*
+F: security/landlock/
+F: tools/include/uapi/linux/landlock.h
+F: tools/testing/selftests/landlock/
+K: landlock
+K: LANDLOCK
+
LANTIQ / INTEL Ethernet drivers
M: Hauke Mehrtens <hauke@hauke-m.de>
L: netdev@vger.kernel.org
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1664d260861b..d68613f737f3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -171,6 +171,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_CGROUP_SYSCTL,
BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
BPF_PROG_TYPE_CGROUP_SOCKOPT,
+ BPF_PROG_TYPE_LANDLOCK_HOOK,
};
enum bpf_attach_type {
diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
new file mode 100644
index 000000000000..be3c4757a3ad
--- /dev/null
+++ b/include/uapi/linux/landlock.h
@@ -0,0 +1,94 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Landlock - UAPI headers
+ *
+ * Copyright © 2017-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#ifndef _UAPI__LINUX_LANDLOCK_H__
+#define _UAPI__LINUX_LANDLOCK_H__
+
+#include <linux/types.h>
+
+#define LANDLOCK_RET_ALLOW 0
+#define LANDLOCK_RET_DENY 1
+
+/**
+ * DOC: landlock_triggers
+ *
+ * A landlock trigger is used as a bitmask in subtype.landlock_hook.triggers
+ * for a fs_pick program. It defines a set of actions for which the program
+ * should verify an access request.
+ *
+ * - %LANDLOCK_TRIGGER_FS_PICK_APPEND
+ * - %LANDLOCK_TRIGGER_FS_PICK_CHDIR
+ * - %LANDLOCK_TRIGGER_FS_PICK_CHROOT
+ * - %LANDLOCK_TRIGGER_FS_PICK_CREATE
+ * - %LANDLOCK_TRIGGER_FS_PICK_EXECUTE
+ * - %LANDLOCK_TRIGGER_FS_PICK_FCNTL
+ * - %LANDLOCK_TRIGGER_FS_PICK_GETATTR
+ * - %LANDLOCK_TRIGGER_FS_PICK_IOCTL
+ * - %LANDLOCK_TRIGGER_FS_PICK_LINK
+ * - %LANDLOCK_TRIGGER_FS_PICK_LINKTO
+ * - %LANDLOCK_TRIGGER_FS_PICK_LOCK
+ * - %LANDLOCK_TRIGGER_FS_PICK_MAP
+ * - %LANDLOCK_TRIGGER_FS_PICK_MOUNTON
+ * - %LANDLOCK_TRIGGER_FS_PICK_OPEN
+ * - %LANDLOCK_TRIGGER_FS_PICK_READ
+ * - %LANDLOCK_TRIGGER_FS_PICK_READDIR
+ * - %LANDLOCK_TRIGGER_FS_PICK_RECEIVE
+ * - %LANDLOCK_TRIGGER_FS_PICK_RENAME
+ * - %LANDLOCK_TRIGGER_FS_PICK_RENAMETO
+ * - %LANDLOCK_TRIGGER_FS_PICK_RMDIR
+ * - %LANDLOCK_TRIGGER_FS_PICK_SETATTR
+ * - %LANDLOCK_TRIGGER_FS_PICK_TRANSFER
+ * - %LANDLOCK_TRIGGER_FS_PICK_UNLINK
+ * - %LANDLOCK_TRIGGER_FS_PICK_WRITE
+ */
+#define LANDLOCK_TRIGGER_FS_PICK_APPEND (1ULL << 0)
+#define LANDLOCK_TRIGGER_FS_PICK_CHDIR (1ULL << 1)
+#define LANDLOCK_TRIGGER_FS_PICK_CHROOT (1ULL << 2)
+#define LANDLOCK_TRIGGER_FS_PICK_CREATE (1ULL << 3)
+#define LANDLOCK_TRIGGER_FS_PICK_EXECUTE (1ULL << 4)
+#define LANDLOCK_TRIGGER_FS_PICK_FCNTL (1ULL << 5)
+#define LANDLOCK_TRIGGER_FS_PICK_GETATTR (1ULL << 6)
+#define LANDLOCK_TRIGGER_FS_PICK_IOCTL (1ULL << 7)
+#define LANDLOCK_TRIGGER_FS_PICK_LINK (1ULL << 8)
+#define LANDLOCK_TRIGGER_FS_PICK_LINKTO (1ULL << 9)
+#define LANDLOCK_TRIGGER_FS_PICK_LOCK (1ULL << 10)
+#define LANDLOCK_TRIGGER_FS_PICK_MAP (1ULL << 11)
+#define LANDLOCK_TRIGGER_FS_PICK_MOUNTON (1ULL << 12)
+#define LANDLOCK_TRIGGER_FS_PICK_OPEN (1ULL << 13)
+#define LANDLOCK_TRIGGER_FS_PICK_READ (1ULL << 14)
+#define LANDLOCK_TRIGGER_FS_PICK_READDIR (1ULL << 15)
+#define LANDLOCK_TRIGGER_FS_PICK_RECEIVE (1ULL << 16)
+#define LANDLOCK_TRIGGER_FS_PICK_RENAME (1ULL << 17)
+#define LANDLOCK_TRIGGER_FS_PICK_RENAMETO (1ULL << 18)
+#define LANDLOCK_TRIGGER_FS_PICK_RMDIR (1ULL << 19)
+#define LANDLOCK_TRIGGER_FS_PICK_SETATTR (1ULL << 20)
+#define LANDLOCK_TRIGGER_FS_PICK_TRANSFER (1ULL << 21)
+#define LANDLOCK_TRIGGER_FS_PICK_UNLINK (1ULL << 22)
+#define LANDLOCK_TRIGGER_FS_PICK_WRITE (1ULL << 23)
+
+/**
+ * struct landlock_ctx_fs_pick - context accessible to a fs_pick program
+ *
+ * @inode: pointer to the current kernel object that can be used to compare
+ * inodes from an inode map.
+ */
+struct landlock_ctx_fs_pick {
+ __u64 inode;
+};
+
+/**
+ * struct landlock_ctx_fs_walk - context accessible to a fs_walk program
+ *
+ * @inode: pointer to the current kernel object that can be used to compare
+ * inodes from an inode map.
+ */
+struct landlock_ctx_fs_walk {
+ __u64 inode;
+};
+
+#endif /* _UAPI__LINUX_LANDLOCK_H__ */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 94a43d7c8175..026c68cb9116 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6116,6 +6116,7 @@ static int check_return_code(struct bpf_verifier_env *env)
case BPF_PROG_TYPE_CGROUP_DEVICE:
case BPF_PROG_TYPE_CGROUP_SYSCTL:
case BPF_PROG_TYPE_CGROUP_SOCKOPT:
+ case BPF_PROG_TYPE_LANDLOCK_HOOK:
break;
default:
return 0;
diff --git a/security/Kconfig b/security/Kconfig
index 06a30851511a..549822ea60b7 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -237,6 +237,7 @@ source "security/apparmor/Kconfig"
source "security/loadpin/Kconfig"
source "security/yama/Kconfig"
source "security/safesetid/Kconfig"
+source "security/landlock/Kconfig"
source "security/integrity/Kconfig"
diff --git a/security/Makefile b/security/Makefile
index c598b904938f..396ff107f70d 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -11,6 +11,7 @@ subdir-$(CONFIG_SECURITY_APPARMOR) += apparmor
subdir-$(CONFIG_SECURITY_YAMA) += yama
subdir-$(CONFIG_SECURITY_LOADPIN) += loadpin
subdir-$(CONFIG_SECURITY_SAFESETID) += safesetid
+subdir-$(CONFIG_SECURITY_LANDLOCK) += landlock
# always enable default capabilities
obj-y += commoncap.o
@@ -27,6 +28,7 @@ obj-$(CONFIG_SECURITY_APPARMOR) += apparmor/
obj-$(CONFIG_SECURITY_YAMA) += yama/
obj-$(CONFIG_SECURITY_LOADPIN) += loadpin/
obj-$(CONFIG_SECURITY_SAFESETID) += safesetid/
+obj-$(CONFIG_SECURITY_LANDLOCK) += landlock/
obj-$(CONFIG_CGROUP_DEVICE) += device_cgroup.o
# Object integrity file lists
diff --git a/security/landlock/Kconfig b/security/landlock/Kconfig
new file mode 100644
index 000000000000..8bd103102008
--- /dev/null
+++ b/security/landlock/Kconfig
@@ -0,0 +1,18 @@
+config SECURITY_LANDLOCK
+ bool "Landlock support"
+ depends on SECURITY
+ depends on BPF_SYSCALL
+ depends on SECCOMP_FILTER
+ default n
+ help
+ This selects Landlock, a programmatic access control. It enables to
+ restrict processes on the fly (i.e. create a sandbox). The security
+ policy is a set of eBPF programs, dedicated to deny a list of actions
+ on specific kernel objects (e.g. file).
+
+ You need to enable seccomp filter to apply a security policy to a
+ process hierarchy (e.g. application with built-in sandboxing).
+
+ See Documentation/security/landlock/ for further information.
+
+ If you are unsure how to answer this question, answer N.
diff --git a/security/landlock/Makefile b/security/landlock/Makefile
new file mode 100644
index 000000000000..7205f9a7a2ee
--- /dev/null
+++ b/security/landlock/Makefile
@@ -0,0 +1,3 @@
+obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
+
+landlock-y := init.o
diff --git a/security/landlock/common.h b/security/landlock/common.h
new file mode 100644
index 000000000000..80dc36f4d0ac
--- /dev/null
+++ b/security/landlock/common.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Landlock LSM - private headers
+ *
+ * Copyright © 2016-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#ifndef _SECURITY_LANDLOCK_COMMON_H
+#define _SECURITY_LANDLOCK_COMMON_H
+
+#include <linux/bpf.h> /* enum bpf_attach_type */
+#include <linux/filter.h> /* bpf_prog */
+#include <linux/refcount.h> /* refcount_t */
+#include <uapi/linux/landlock.h> /* LANDLOCK_TRIGGER_* */
+
+#define LANDLOCK_NAME "landlock"
+
+/* UAPI bounds and bitmasks */
+
+#define _LANDLOCK_HOOK_LAST LANDLOCK_HOOK_FS_WALK
+
+#define _LANDLOCK_TRIGGER_FS_PICK_LAST LANDLOCK_TRIGGER_FS_PICK_WRITE
+#define _LANDLOCK_TRIGGER_FS_PICK_MASK ((_LANDLOCK_TRIGGER_FS_PICK_LAST << 1ULL) - 1)
+
+enum landlock_hook_type {
+ LANDLOCK_HOOK_FS_PICK = 1,
+ LANDLOCK_HOOK_FS_WALK,
+};
+
+static inline enum landlock_hook_type get_hook_type(const struct bpf_prog *prog)
+{
+ switch (prog->expected_attach_type) {
+ case BPF_LANDLOCK_FS_PICK:
+ return LANDLOCK_HOOK_FS_PICK;
+ case BPF_LANDLOCK_FS_WALK:
+ return LANDLOCK_HOOK_FS_WALK;
+ default:
+ WARN_ON(1);
+ return BPF_LANDLOCK_FS_PICK;
+ }
+}
+
+#endif /* _SECURITY_LANDLOCK_COMMON_H */
diff --git a/security/landlock/init.c b/security/landlock/init.c
new file mode 100644
index 000000000000..8dfd5fea3c1f
--- /dev/null
+++ b/security/landlock/init.c
@@ -0,0 +1,100 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Landlock LSM - init
+ *
+ * Copyright © 2016-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#include <linux/bpf.h> /* enum bpf_access_type */
+#include <linux/capability.h> /* capable */
+#include <linux/filter.h> /* struct bpf_prog */
+
+#include "common.h" /* LANDLOCK_* */
+
+static bool bpf_landlock_is_valid_access(int off, int size,
+ enum bpf_access_type type, const struct bpf_prog *prog,
+ struct bpf_insn_access_aux *info)
+{
+ enum bpf_reg_type reg_type = NOT_INIT;
+ int max_size = 0;
+
+ if (WARN_ON(!prog->expected_attach_type))
+ return false;
+
+ if (off < 0)
+ return false;
+ if (size <= 0 || size > sizeof(__u64))
+ return false;
+
+ /* check memory range access */
+ switch (reg_type) {
+ case NOT_INIT:
+ return false;
+ case SCALAR_VALUE:
+ /* allow partial raw value */
+ if (size > max_size)
+ return false;
+ info->ctx_field_size = max_size;
+ break;
+ default:
+ /* deny partial pointer */
+ if (size != max_size)
+ return false;
+ }
+
+ info->reg_type = reg_type;
+ return true;
+}
+
+static bool bpf_landlock_is_valid_triggers(const struct bpf_prog *prog)
+{
+ u64 triggers;
+
+ if (!prog)
+ return false;
+ triggers = prog->aux->expected_attach_triggers;
+
+ switch (get_hook_type(prog)) {
+ case LANDLOCK_HOOK_FS_PICK:
+ if (!triggers || triggers & ~_LANDLOCK_TRIGGER_FS_PICK_MASK)
+ return false;
+ break;
+ case LANDLOCK_HOOK_FS_WALK:
+ if (triggers)
+ return false;
+ break;
+ }
+ return true;
+}
+
+static const struct bpf_func_proto *bpf_landlock_func_proto(
+ enum bpf_func_id func_id,
+ const struct bpf_prog *prog)
+{
+ if (WARN_ON(!prog->expected_attach_type))
+ return NULL;
+
+ /* generic functions */
+ /* TODO: do we need/want update/delete functions for every LL prog?
+ * => impurity vs. audit */
+ switch (func_id) {
+ case BPF_FUNC_map_lookup_elem:
+ return &bpf_map_lookup_elem_proto;
+ case BPF_FUNC_map_update_elem:
+ return &bpf_map_update_elem_proto;
+ case BPF_FUNC_map_delete_elem:
+ return &bpf_map_delete_elem_proto;
+ default:
+ break;
+ }
+ return NULL;
+}
+
+const struct bpf_verifier_ops landlock_verifier_ops = {
+ .get_func_proto = bpf_landlock_func_proto,
+ .is_valid_access = bpf_landlock_is_valid_access,
+ .is_valid_triggers = bpf_landlock_is_valid_triggers,
+};
+
+const struct bpf_prog_ops landlock_prog_ops = {};
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 232747393405..7b7a4f6c3104 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -171,6 +171,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_CGROUP_SYSCTL,
BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
BPF_PROG_TYPE_CGROUP_SOCKOPT,
+ BPF_PROG_TYPE_LANDLOCK_HOOK,
};
enum bpf_attach_type {
diff --git a/tools/include/uapi/linux/landlock.h b/tools/include/uapi/linux/landlock.h
new file mode 100644
index 000000000000..9e6d8e10ec2c
--- /dev/null
+++ b/tools/include/uapi/linux/landlock.h
@@ -0,0 +1,109 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Landlock - UAPI headers
+ *
+ * Copyright © 2017-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#ifndef _UAPI__LINUX_LANDLOCK_H__
+#define _UAPI__LINUX_LANDLOCK_H__
+
+#include <linux/types.h>
+
+#define LANDLOCK_RET_ALLOW 0
+#define LANDLOCK_RET_DENY 1
+
+/**
+ * enum landlock_hook_type - hook type for which a Landlock program is called
+ *
+ * A hook is a policy decision point which exposes the same context type for
+ * each program evaluation.
+ *
+ * @LANDLOCK_HOOK_FS_PICK: called for the last element of a file path
+ * @LANDLOCK_HOOK_FS_WALK: called for each directory of a file path (excluding
+ * the directory passed to fs_pick, if any)
+ */
+enum landlock_hook_type {
+ LANDLOCK_HOOK_FS_PICK = 1,
+ LANDLOCK_HOOK_FS_WALK,
+};
+
+/**
+ * DOC: landlock_triggers
+ *
+ * A landlock trigger is used as a bitmask in subtype.landlock_hook.triggers
+ * for a fs_pick program. It defines a set of actions for which the program
+ * should verify an access request.
+ *
+ * - %LANDLOCK_TRIGGER_FS_PICK_APPEND
+ * - %LANDLOCK_TRIGGER_FS_PICK_CHDIR
+ * - %LANDLOCK_TRIGGER_FS_PICK_CHROOT
+ * - %LANDLOCK_TRIGGER_FS_PICK_CREATE
+ * - %LANDLOCK_TRIGGER_FS_PICK_EXECUTE
+ * - %LANDLOCK_TRIGGER_FS_PICK_FCNTL
+ * - %LANDLOCK_TRIGGER_FS_PICK_GETATTR
+ * - %LANDLOCK_TRIGGER_FS_PICK_IOCTL
+ * - %LANDLOCK_TRIGGER_FS_PICK_LINK
+ * - %LANDLOCK_TRIGGER_FS_PICK_LINKTO
+ * - %LANDLOCK_TRIGGER_FS_PICK_LOCK
+ * - %LANDLOCK_TRIGGER_FS_PICK_MAP
+ * - %LANDLOCK_TRIGGER_FS_PICK_MOUNTON
+ * - %LANDLOCK_TRIGGER_FS_PICK_OPEN
+ * - %LANDLOCK_TRIGGER_FS_PICK_READ
+ * - %LANDLOCK_TRIGGER_FS_PICK_READDIR
+ * - %LANDLOCK_TRIGGER_FS_PICK_RECEIVE
+ * - %LANDLOCK_TRIGGER_FS_PICK_RENAME
+ * - %LANDLOCK_TRIGGER_FS_PICK_RENAMETO
+ * - %LANDLOCK_TRIGGER_FS_PICK_RMDIR
+ * - %LANDLOCK_TRIGGER_FS_PICK_SETATTR
+ * - %LANDLOCK_TRIGGER_FS_PICK_TRANSFER
+ * - %LANDLOCK_TRIGGER_FS_PICK_UNLINK
+ * - %LANDLOCK_TRIGGER_FS_PICK_WRITE
+ */
+#define LANDLOCK_TRIGGER_FS_PICK_APPEND (1ULL << 0)
+#define LANDLOCK_TRIGGER_FS_PICK_CHDIR (1ULL << 1)
+#define LANDLOCK_TRIGGER_FS_PICK_CHROOT (1ULL << 2)
+#define LANDLOCK_TRIGGER_FS_PICK_CREATE (1ULL << 3)
+#define LANDLOCK_TRIGGER_FS_PICK_EXECUTE (1ULL << 4)
+#define LANDLOCK_TRIGGER_FS_PICK_FCNTL (1ULL << 5)
+#define LANDLOCK_TRIGGER_FS_PICK_GETATTR (1ULL << 6)
+#define LANDLOCK_TRIGGER_FS_PICK_IOCTL (1ULL << 7)
+#define LANDLOCK_TRIGGER_FS_PICK_LINK (1ULL << 8)
+#define LANDLOCK_TRIGGER_FS_PICK_LINKTO (1ULL << 9)
+#define LANDLOCK_TRIGGER_FS_PICK_LOCK (1ULL << 10)
+#define LANDLOCK_TRIGGER_FS_PICK_MAP (1ULL << 11)
+#define LANDLOCK_TRIGGER_FS_PICK_MOUNTON (1ULL << 12)
+#define LANDLOCK_TRIGGER_FS_PICK_OPEN (1ULL << 13)
+#define LANDLOCK_TRIGGER_FS_PICK_READ (1ULL << 14)
+#define LANDLOCK_TRIGGER_FS_PICK_READDIR (1ULL << 15)
+#define LANDLOCK_TRIGGER_FS_PICK_RECEIVE (1ULL << 16)
+#define LANDLOCK_TRIGGER_FS_PICK_RENAME (1ULL << 17)
+#define LANDLOCK_TRIGGER_FS_PICK_RENAMETO (1ULL << 18)
+#define LANDLOCK_TRIGGER_FS_PICK_RMDIR (1ULL << 19)
+#define LANDLOCK_TRIGGER_FS_PICK_SETATTR (1ULL << 20)
+#define LANDLOCK_TRIGGER_FS_PICK_TRANSFER (1ULL << 21)
+#define LANDLOCK_TRIGGER_FS_PICK_UNLINK (1ULL << 22)
+#define LANDLOCK_TRIGGER_FS_PICK_WRITE (1ULL << 23)
+
+/**
+ * struct landlock_ctx_fs_pick - context accessible to a fs_pick program
+ *
+ * @inode: pointer to the current kernel object that can be used to compare
+ * inodes from an inode map.
+ */
+struct landlock_ctx_fs_pick {
+ __u64 inode;
+};
+
+/**
+ * struct landlock_ctx_fs_walk - context accessible to a fs_walk program
+ *
+ * @inode: pointer to the current kernel object that can be used to compare
+ * inodes from an inode map.
+ */
+struct landlock_ctx_fs_walk {
+ __u64 inode;
+};
+
+#endif /* _UAPI__LINUX_LANDLOCK_H__ */
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index ed07789b3e62..ab3b8b510b8a 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2665,6 +2665,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
case BPF_PROG_TYPE_PERF_EVENT:
case BPF_PROG_TYPE_CGROUP_SYSCTL:
case BPF_PROG_TYPE_CGROUP_SOCKOPT:
+ case BPF_PROG_TYPE_LANDLOCK_HOOK:
return false;
case BPF_PROG_TYPE_KPROBE:
default:
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index ace1a0708d99..03c910d1f84c 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -102,6 +102,7 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
case BPF_PROG_TYPE_FLOW_DISSECTOR:
case BPF_PROG_TYPE_CGROUP_SYSCTL:
case BPF_PROG_TYPE_CGROUP_SOCKOPT:
+ case BPF_PROG_TYPE_LANDLOCK_HOOK:
default:
break;
}
--
2.22.0
^ permalink raw reply related
* [PATCH bpf-next v10 04/10] seccomp,landlock: Enforce Landlock programs per process hierarchy
From: Mickaël Salaün @ 2019-07-21 21:31 UTC (permalink / raw)
To: linux-kernel
Cc: Mickaël Salaün, Alexander Viro, Alexei Starovoitov,
Andrew Morton, Andy Lutomirski, Arnaldo Carvalho de Melo,
Casey Schaufler, Daniel Borkmann, David Drysdale,
David S . Miller, Eric W . Biederman, James Morris, Jann Horn,
John Johansen, Jonathan Corbet, Kees Cook, Michael Kerrisk,
Mickaël Salaün
In-Reply-To: <20190721213116.23476-1-mic@digikod.net>
The seccomp(2) syscall can be used by a task to apply a Landlock program
to itself. As a seccomp filter, a Landlock program is enforced for the
current task and all its future children. A program is immutable and a
task can only add new restricting programs to itself, forming a list of
programss.
A Landlock program is tied to a Landlock hook. If the action on a kernel
object is allowed by the other Linux security mechanisms (e.g. DAC,
capabilities, other LSM), then a Landlock hook related to this kind of
object is triggered. The list of programs for this hook is then
evaluated. Each program return a binary value which can deny the action
on a kernel object with a non-zero value. If every programs of the list
return zero, then the action on the object is allowed.
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: James Morris <jmorris@namei.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Will Drewry <wad@chromium.org>
Link: https://lkml.kernel.org/r/c10a503d-5e35-7785-2f3d-25ed8dd63fab@digikod.net
---
Changes since v9:
* replace subtype with expected_attach_type and expected_attach_triggers
Changes since v8:
* Remove the chaining concept from the eBPF program contexts (chain and
cookie). We need to keep these subtypes this way to be able to make
them evolve, though.
Changes since v7:
* handle and verify program chains
* split and rename providers.c to enforce.c and enforce_seccomp.c
* rename LANDLOCK_SUBTYPE_* to LANDLOCK_*
Changes since v6:
* rename some functions with more accurate names to reflect that an eBPF
program for Landlock could be used for something else than a rule
* reword rule "appending" to "prepending" and explain it
* remove the superfluous no_new_privs check, only check global
CAP_SYS_ADMIN when prepending a Landlock rule (needed for containers)
* create and use {get,put}_seccomp_landlock() (suggested by Kees Cook)
* replace ifdef with static inlined function (suggested by Kees Cook)
* use get_user() (suggested by Kees Cook)
* replace atomic_t with refcount_t (requested by Kees Cook)
* move struct landlock_{rule,events} from landlock.h to common.h
* cleanup headers
Changes since v5:
* remove struct landlock_node and use a similar inheritance mechanisme
as seccomp-bpf (requested by Andy Lutomirski)
* rename SECCOMP_ADD_LANDLOCK_RULE to SECCOMP_APPEND_LANDLOCK_RULE
* rename file manager.c to providers.c
* add comments
* typo and cosmetic fixes
Changes since v4:
* merge manager and seccomp patches
* return -EFAULT in seccomp(2) when user_bpf_fd is null to easely check
if Landlock is supported
* only allow a process with the global CAP_SYS_ADMIN to use Landlock
(will be lifted in the future)
* add an early check to exit as soon as possible if the current process
does not have Landlock rules
Changes since v3:
* remove the hard link with seccomp (suggested by Andy Lutomirski and
Kees Cook):
* remove the cookie which could imply multiple evaluation of Landlock
rules
* remove the origin field in struct landlock_data
* remove documentation fix (merged upstream)
* rename the new seccomp command to SECCOMP_ADD_LANDLOCK_RULE
* internal renaming
* split commit
* new design to be able to inherit on the fly the parent rules
Changes since v2:
* Landlock programs can now be run without seccomp filter but for any
syscall (from the process) or interruption
* move Landlock related functions and structs into security/landlock/*
(to manage cgroups as well)
* fix seccomp filter handling: run Landlock programs for each of their
legitimate seccomp filter
* properly clean up all seccomp results
* cosmetic changes to ease the understanding
* fix some ifdef
---
include/linux/landlock.h | 34 ++++
include/linux/seccomp.h | 5 +
include/uapi/linux/seccomp.h | 1 +
kernel/fork.c | 8 +-
kernel/seccomp.c | 4 +
security/landlock/Makefile | 3 +-
security/landlock/common.h | 38 ++++
security/landlock/enforce.c | 272 ++++++++++++++++++++++++++++
security/landlock/enforce.h | 18 ++
security/landlock/enforce_seccomp.c | 92 ++++++++++
10 files changed, 473 insertions(+), 2 deletions(-)
create mode 100644 include/linux/landlock.h
create mode 100644 security/landlock/enforce.c
create mode 100644 security/landlock/enforce.h
create mode 100644 security/landlock/enforce_seccomp.c
diff --git a/include/linux/landlock.h b/include/linux/landlock.h
new file mode 100644
index 000000000000..8ac7942f50fc
--- /dev/null
+++ b/include/linux/landlock.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Landlock LSM - public kernel headers
+ *
+ * Copyright © 2016-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#ifndef _LINUX_LANDLOCK_H
+#define _LINUX_LANDLOCK_H
+
+#include <linux/errno.h>
+#include <linux/sched.h> /* task_struct */
+
+#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
+extern int landlock_seccomp_prepend_prog(unsigned int flags,
+ const int __user *user_bpf_fd);
+extern void put_seccomp_landlock(struct task_struct *tsk);
+extern void get_seccomp_landlock(struct task_struct *tsk);
+#else /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
+static inline int landlock_seccomp_prepend_prog(unsigned int flags,
+ const int __user *user_bpf_fd)
+{
+ return -EINVAL;
+}
+static inline void put_seccomp_landlock(struct task_struct *tsk)
+{
+}
+static inline void get_seccomp_landlock(struct task_struct *tsk)
+{
+}
+#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
+
+#endif /* _LINUX_LANDLOCK_H */
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 84868d37b35d..106a0ceff3d7 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -11,6 +11,7 @@
#ifdef CONFIG_SECCOMP
+#include <linux/landlock.h>
#include <linux/thread_info.h>
#include <asm/seccomp.h>
@@ -22,6 +23,7 @@ struct seccomp_filter;
* system calls available to a process.
* @filter: must always point to a valid seccomp-filter or NULL as it is
* accessed without locking during system call entry.
+ * @landlock_prog_set: contains a set of Landlock programs.
*
* @filter must only be accessed from the context of current as there
* is no read locking.
@@ -29,6 +31,9 @@ struct seccomp_filter;
struct seccomp {
int mode;
struct seccomp_filter *filter;
+#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
+ struct landlock_prog_set *landlock_prog_set;
+#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
};
#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 90734aa5aa36..bce6534e7feb 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -16,6 +16,7 @@
#define SECCOMP_SET_MODE_FILTER 1
#define SECCOMP_GET_ACTION_AVAIL 2
#define SECCOMP_GET_NOTIF_SIZES 3
+#define SECCOMP_PREPEND_LANDLOCK_PROG 4
/* Valid flags for SECCOMP_SET_MODE_FILTER */
#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0)
diff --git a/kernel/fork.c b/kernel/fork.c
index 8f3e2d97d771..6c43517abdb9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -51,6 +51,7 @@
#include <linux/security.h>
#include <linux/hugetlb.h>
#include <linux/seccomp.h>
+#include <linux/landlock.h>
#include <linux/swap.h>
#include <linux/syscalls.h>
#include <linux/jiffies.h>
@@ -458,6 +459,7 @@ void free_task(struct task_struct *tsk)
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
put_seccomp_filter(tsk);
+ put_seccomp_landlock(tsk);
arch_release_task_struct(tsk);
if (tsk->flags & PF_KTHREAD)
free_kthread_struct(tsk);
@@ -888,7 +890,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
* the usage counts on the error path calling free_task.
*/
tsk->seccomp.filter = NULL;
-#endif
+#ifdef CONFIG_SECURITY_LANDLOCK
+ tsk->seccomp.landlock_prog_set = NULL;
+#endif /* CONFIG_SECURITY_LANDLOCK */
+#endif /* CONFIG_SECCOMP */
setup_thread_stack(tsk, orig);
clear_user_return_notifier(tsk);
@@ -1604,6 +1609,7 @@ static void copy_seccomp(struct task_struct *p)
/* Ref-count the new filter user, and assign it. */
get_seccomp_filter(current);
+ get_seccomp_landlock(current);
p->seccomp = current->seccomp;
/*
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index dba52a7db5e8..af542a2d21e7 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -41,6 +41,7 @@
#include <linux/tracehook.h>
#include <linux/uaccess.h>
#include <linux/anon_inodes.h>
+#include <linux/landlock.h>
enum notify_state {
SECCOMP_NOTIFY_INIT,
@@ -1397,6 +1398,9 @@ static long do_seccomp(unsigned int op, unsigned int flags,
return -EINVAL;
return seccomp_get_notif_sizes(uargs);
+ case SECCOMP_PREPEND_LANDLOCK_PROG:
+ return landlock_seccomp_prepend_prog(flags,
+ (const int __user *)uargs);
default:
return -EINVAL;
}
diff --git a/security/landlock/Makefile b/security/landlock/Makefile
index 7205f9a7a2ee..2a1a7082a365 100644
--- a/security/landlock/Makefile
+++ b/security/landlock/Makefile
@@ -1,3 +1,4 @@
obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
-landlock-y := init.o
+landlock-y := init.o \
+ enforce.o enforce_seccomp.o
diff --git a/security/landlock/common.h b/security/landlock/common.h
index 80dc36f4d0ac..2cf36dbf4560 100644
--- a/security/landlock/common.h
+++ b/security/landlock/common.h
@@ -28,6 +28,44 @@ enum landlock_hook_type {
LANDLOCK_HOOK_FS_WALK,
};
+struct landlock_prog_list {
+ struct landlock_prog_list *prev;
+ struct bpf_prog *prog;
+ refcount_t usage;
+};
+
+/**
+ * struct landlock_prog_set - Landlock programs enforced on a thread
+ *
+ * This is used for low performance impact when forking a process. Instead of
+ * copying the full array and incrementing the usage of each entries, only
+ * create a pointer to &struct landlock_prog_set and increments its usage. When
+ * prepending a new program, if &struct landlock_prog_set is shared with other
+ * tasks, then duplicate it and prepend the program to this new &struct
+ * landlock_prog_set.
+ *
+ * @usage: reference count to manage the object lifetime. When a thread need to
+ * add Landlock programs and if @usage is greater than 1, then the
+ * thread must duplicate &struct landlock_prog_set to not change the
+ * children's programs as well.
+ * @programs: array of non-NULL &struct landlock_prog_list pointers
+ */
+struct landlock_prog_set {
+ struct landlock_prog_list *programs[_LANDLOCK_HOOK_LAST];
+ refcount_t usage;
+};
+
+/**
+ * get_hook_index - get an index for the programs of struct landlock_prog_set
+ *
+ * @type: a Landlock hook type
+ */
+static inline int get_hook_index(enum landlock_hook_type type)
+{
+ /* type ID > 0 for loaded programs */
+ return type - 1;
+}
+
static inline enum landlock_hook_type get_hook_type(const struct bpf_prog *prog)
{
switch (prog->expected_attach_type) {
diff --git a/security/landlock/enforce.c b/security/landlock/enforce.c
new file mode 100644
index 000000000000..b6979de69d3e
--- /dev/null
+++ b/security/landlock/enforce.c
@@ -0,0 +1,272 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Landlock LSM - enforcing helpers
+ *
+ * Copyright © 2016-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#include <asm/barrier.h> /* smp_store_release() */
+#include <asm/page.h> /* PAGE_SIZE */
+#include <linux/bpf.h> /* bpf_prog_put() */
+#include <linux/compiler.h> /* READ_ONCE() */
+#include <linux/err.h> /* PTR_ERR() */
+#include <linux/errno.h>
+#include <linux/filter.h> /* struct bpf_prog */
+#include <linux/refcount.h>
+#include <linux/slab.h> /* alloc(), kfree() */
+
+#include "common.h" /* struct landlock_prog_list */
+
+/* TODO: use a dedicated kmem_cache_alloc() instead of k*alloc() */
+
+static void put_landlock_prog_list(struct landlock_prog_list *prog_list)
+{
+ struct landlock_prog_list *orig = prog_list;
+
+ /* clean up single-reference branches iteratively */
+ while (orig && refcount_dec_and_test(&orig->usage)) {
+ struct landlock_prog_list *freeme = orig;
+
+ if (orig->prog)
+ bpf_prog_put(orig->prog);
+ orig = orig->prev;
+ kfree(freeme);
+ }
+}
+
+void landlock_put_prog_set(struct landlock_prog_set *prog_set)
+{
+ if (prog_set && refcount_dec_and_test(&prog_set->usage)) {
+ size_t i;
+
+ for (i = 0; i < ARRAY_SIZE(prog_set->programs); i++)
+ put_landlock_prog_list(prog_set->programs[i]);
+ kfree(prog_set);
+ }
+}
+
+void landlock_get_prog_set(struct landlock_prog_set *prog_set)
+{
+ if (!prog_set)
+ return;
+ refcount_inc(&prog_set->usage);
+}
+
+static struct landlock_prog_set *new_landlock_prog_set(void)
+{
+ struct landlock_prog_set *ret;
+
+ /* array filled with NULL values */
+ ret = kzalloc(sizeof(*ret), GFP_KERNEL);
+ if (!ret)
+ return ERR_PTR(-ENOMEM);
+ refcount_set(&ret->usage, 1);
+ return ret;
+}
+
+/**
+ * store_landlock_prog - prepend and deduplicate a Landlock prog_list
+ *
+ * Prepend @prog to @init_prog_set while ignoring @prog
+ * if they are already in @ref_prog_set. Whatever is the result of this
+ * function call, you can call bpf_prog_put(@prog) after.
+ *
+ * @init_prog_set: empty prog_set to prepend to
+ * @ref_prog_set: prog_set to check for duplicate programs
+ * @prog: program to prepend
+ *
+ * Return -errno on error or 0 if @prog was successfully stored.
+ */
+static int store_landlock_prog(struct landlock_prog_set *init_prog_set,
+ const struct landlock_prog_set *ref_prog_set,
+ struct bpf_prog *prog)
+{
+ struct landlock_prog_list *tmp_list = NULL;
+ int err;
+ u32 hook_idx;
+ enum landlock_hook_type last_type;
+ struct bpf_prog *new = prog;
+
+ /* allocate all the memory we need */
+ struct landlock_prog_list *new_list;
+
+ last_type = get_hook_type(new);
+
+ /* ignore duplicate programs */
+ if (ref_prog_set) {
+ struct landlock_prog_list *ref;
+
+ hook_idx = get_hook_index(get_hook_type(new));
+ for (ref = ref_prog_set->programs[hook_idx];
+ ref; ref = ref->prev) {
+ if (ref->prog == new)
+ return -EINVAL;
+ }
+ }
+
+ new = bpf_prog_inc(new);
+ if (IS_ERR(new)) {
+ err = PTR_ERR(new);
+ goto put_tmp_list;
+ }
+ new_list = kzalloc(sizeof(*new_list), GFP_KERNEL);
+ if (!new_list) {
+ bpf_prog_put(new);
+ err = -ENOMEM;
+ goto put_tmp_list;
+ }
+ /* ignore Landlock types in this tmp_list */
+ new_list->prog = new;
+ new_list->prev = tmp_list;
+ refcount_set(&new_list->usage, 1);
+ tmp_list = new_list;
+
+ if (!tmp_list)
+ /* inform user space that this program was already added */
+ return -EEXIST;
+
+ /* properly store the list (without error cases) */
+ while (tmp_list) {
+ struct landlock_prog_list *new_list;
+
+ new_list = tmp_list;
+ tmp_list = tmp_list->prev;
+ /* do not increment the previous prog list usage */
+ hook_idx = get_hook_index(get_hook_type(new_list->prog));
+ new_list->prev = init_prog_set->programs[hook_idx];
+ /* no need to add from the last program to the first because
+ * each of them are a different Landlock type */
+ smp_store_release(&init_prog_set->programs[hook_idx], new_list);
+ }
+ return 0;
+
+put_tmp_list:
+ put_landlock_prog_list(tmp_list);
+ return err;
+}
+
+/* limit Landlock programs set to 256KB */
+#define LANDLOCK_PROGRAMS_MAX_PAGES (1 << 6)
+
+/**
+ * landlock_prepend_prog - attach a Landlock prog_list to @current_prog_set
+ *
+ * Whatever is the result of this function call, you can call
+ * bpf_prog_put(@prog) after.
+ *
+ * @current_prog_set: landlock_prog_set pointer, must be locked (if needed) to
+ * prevent a concurrent put/free. This pointer must not be
+ * freed after the call.
+ * @prog: non-NULL Landlock prog_list to prepend to @current_prog_set. @prog
+ * will be owned by landlock_prepend_prog() and freed if an error
+ * happened.
+ *
+ * Return @current_prog_set or a new pointer when OK. Return a pointer error
+ * otherwise.
+ */
+struct landlock_prog_set *landlock_prepend_prog(
+ struct landlock_prog_set *current_prog_set,
+ struct bpf_prog *prog)
+{
+ struct landlock_prog_set *new_prog_set = current_prog_set;
+ unsigned long pages;
+ int err;
+ size_t i;
+ struct landlock_prog_set tmp_prog_set = {};
+
+ if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
+ return ERR_PTR(-EINVAL);
+
+ /* validate memory size allocation */
+ pages = prog->pages;
+ if (current_prog_set) {
+ size_t i;
+
+ for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
+ struct landlock_prog_list *walker_p;
+
+ for (walker_p = current_prog_set->programs[i];
+ walker_p; walker_p = walker_p->prev)
+ pages += walker_p->prog->pages;
+ }
+ /* count a struct landlock_prog_set if we need to allocate one */
+ if (refcount_read(¤t_prog_set->usage) != 1)
+ pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
+ / PAGE_SIZE;
+ }
+ if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
+ return ERR_PTR(-E2BIG);
+
+ /* ensure early that we can allocate enough memory for the new
+ * prog_lists */
+ err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
+ if (err)
+ return ERR_PTR(err);
+
+ /*
+ * Each task_struct points to an array of prog list pointers. These
+ * tables are duplicated when additions are made (which means each
+ * table needs to be refcounted for the processes using it). When a new
+ * table is created, all the refcounters on the prog_list are bumped (to
+ * track each table that references the prog). When a new prog is
+ * added, it's just prepended to the list for the new table to point
+ * at.
+ *
+ * Manage all the possible errors before this step to not uselessly
+ * duplicate current_prog_set and avoid a rollback.
+ */
+ if (!new_prog_set) {
+ /*
+ * If there is no Landlock program set used by the current task,
+ * then create a new one.
+ */
+ new_prog_set = new_landlock_prog_set();
+ if (IS_ERR(new_prog_set))
+ goto put_tmp_lists;
+ } else if (refcount_read(¤t_prog_set->usage) > 1) {
+ /*
+ * If the current task is not the sole user of its Landlock
+ * program set, then duplicate them.
+ */
+ new_prog_set = new_landlock_prog_set();
+ if (IS_ERR(new_prog_set))
+ goto put_tmp_lists;
+ for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
+ new_prog_set->programs[i] =
+ READ_ONCE(current_prog_set->programs[i]);
+ if (new_prog_set->programs[i])
+ refcount_inc(&new_prog_set->programs[i]->usage);
+ }
+
+ /*
+ * Landlock program set from the current task will not be freed
+ * here because the usage is strictly greater than 1. It is
+ * only prevented to be freed by another task thanks to the
+ * caller of landlock_prepend_prog() which should be locked if
+ * needed.
+ */
+ landlock_put_prog_set(current_prog_set);
+ }
+
+ /* prepend tmp_prog_set to new_prog_set */
+ for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
+ /* get the last new list */
+ struct landlock_prog_list *last_list =
+ tmp_prog_set.programs[i];
+
+ if (last_list) {
+ while (last_list->prev)
+ last_list = last_list->prev;
+ /* no need to increment usage (pointer replacement) */
+ last_list->prev = new_prog_set->programs[i];
+ new_prog_set->programs[i] = tmp_prog_set.programs[i];
+ }
+ }
+ return new_prog_set;
+
+put_tmp_lists:
+ for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
+ put_landlock_prog_list(tmp_prog_set.programs[i]);
+ return new_prog_set;
+}
diff --git a/security/landlock/enforce.h b/security/landlock/enforce.h
new file mode 100644
index 000000000000..39b800d9999f
--- /dev/null
+++ b/security/landlock/enforce.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Landlock LSM - enforcing helpers headers
+ *
+ * Copyright © 2016-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#ifndef _SECURITY_LANDLOCK_ENFORCE_H
+#define _SECURITY_LANDLOCK_ENFORCE_H
+
+struct landlock_prog_set *landlock_prepend_prog(
+ struct landlock_prog_set *current_prog_set,
+ struct bpf_prog *prog);
+void landlock_put_prog_set(struct landlock_prog_set *prog_set);
+void landlock_get_prog_set(struct landlock_prog_set *prog_set);
+
+#endif /* _SECURITY_LANDLOCK_ENFORCE_H */
diff --git a/security/landlock/enforce_seccomp.c b/security/landlock/enforce_seccomp.c
new file mode 100644
index 000000000000..c38c81e6b01a
--- /dev/null
+++ b/security/landlock/enforce_seccomp.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Landlock LSM - enforcing with seccomp
+ *
+ * Copyright © 2016-2018 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#ifdef CONFIG_SECCOMP_FILTER
+
+#include <linux/bpf.h> /* bpf_prog_put() */
+#include <linux/capability.h>
+#include <linux/err.h> /* PTR_ERR() */
+#include <linux/errno.h>
+#include <linux/filter.h> /* struct bpf_prog */
+#include <linux/landlock.h>
+#include <linux/refcount.h>
+#include <linux/sched.h> /* current */
+#include <linux/uaccess.h> /* get_user() */
+
+#include "enforce.h"
+
+/* headers in include/linux/landlock.h */
+
+/**
+ * landlock_seccomp_prepend_prog - attach a Landlock program to the current
+ * process
+ *
+ * current->seccomp.landlock_state->prog_set is lazily allocated. When a
+ * process fork, only a pointer is copied. When a new program is added by a
+ * process, if there is other references to this process' prog_set, then a new
+ * allocation is made to contain an array pointing to Landlock program lists.
+ * This design enable low-performance impact and is memory efficient while
+ * keeping the property of prepend-only programs.
+ *
+ * For now, installing a Landlock prog requires that the requesting task has
+ * the global CAP_SYS_ADMIN. We cannot force the use of no_new_privs to not
+ * exclude containers where a process may legitimately acquire more privileges
+ * thanks to an SUID binary.
+ *
+ * @flags: not used for now, but could be used for TSYNC
+ * @user_bpf_fd: file descriptor pointing to a loaded Landlock prog
+ */
+int landlock_seccomp_prepend_prog(unsigned int flags,
+ const int __user *user_bpf_fd)
+{
+ struct landlock_prog_set *new_prog_set;
+ struct bpf_prog *prog;
+ int bpf_fd, err;
+
+ /* planned to be replaced with a no_new_privs check to allow
+ * unprivileged tasks */
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ /* enable to check if Landlock is supported with early EFAULT */
+ if (!user_bpf_fd)
+ return -EFAULT;
+ if (flags)
+ return -EINVAL;
+ err = get_user(bpf_fd, user_bpf_fd);
+ if (err)
+ return err;
+
+ prog = bpf_prog_get(bpf_fd);
+ if (IS_ERR(prog))
+ return PTR_ERR(prog);
+
+ /*
+ * We don't need to lock anything for the current process hierarchy,
+ * everything is guarded by the atomic counters.
+ */
+ new_prog_set = landlock_prepend_prog(
+ current->seccomp.landlock_prog_set, prog);
+ bpf_prog_put(prog);
+ /* @prog is managed/freed by landlock_prepend_prog() */
+ if (IS_ERR(new_prog_set))
+ return PTR_ERR(new_prog_set);
+ current->seccomp.landlock_prog_set = new_prog_set;
+ return 0;
+}
+
+void put_seccomp_landlock(struct task_struct *tsk)
+{
+ landlock_put_prog_set(tsk->seccomp.landlock_prog_set);
+}
+
+void get_seccomp_landlock(struct task_struct *tsk)
+{
+ landlock_get_prog_set(tsk->seccomp.landlock_prog_set);
+}
+
+#endif /* CONFIG_SECCOMP_FILTER */
--
2.22.0
^ permalink raw reply related
* [PATCH bpf-next v10 05/10] landlock: Handle filesystem access control
From: Mickaël Salaün @ 2019-07-21 21:31 UTC (permalink / raw)
To: linux-kernel
Cc: Mickaël Salaün, Alexander Viro, Alexei Starovoitov,
Andrew Morton, Andy Lutomirski, Arnaldo Carvalho de Melo,
Casey Schaufler, Daniel Borkmann, David Drysdale,
David S . Miller, Eric W . Biederman, James Morris, Jann Horn,
John Johansen, Jonathan Corbet, Kees Cook, Michael Kerrisk,
Mickaël Salaün
In-Reply-To: <20190721213116.23476-1-mic@digikod.net>
This add two Landlock hooks: FS_WALK and FS_PICK.
The FS_WALK hook is used to walk through a file path. A program tied to
this hook will be evaluated for each directory traversal except the last
one if it is the leaf of the path. It is important to differentiate
this hook from FS_PICK to enable more powerful path evaluation in the
future (cf. Landlock patch v8).
The FS_PICK hook is used to validate a set of actions requested on a
file. This actions are defined with triggers (e.g. read, write, open,
append...).
The Landlock LSM hook registration is done after other LSM to only run
actions from user-space, via eBPF programs, if the access was granted by
major (privileged) LSMs.
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
---
Changes since v9:
* replace subtype with expected_attach_type and expected_attach_triggers
Changes since v8:
* add a new LSM_ORDER_LAST, cf. commit e2bc445b66ca ("LSM: Introduce
enum lsm_order")
* add WARN_ON() for pointer dereferencement
* remove the FS_GET subtype which rely on program chaining
* remove the subtype option which was only used for chaining (with the
"previous" field)
* remove inode_lookup which depends on the (removed) nameidata security
blob
* remove eBPF helpers to get and set Landlock inode tags
* do not use task LSM credentials (for now)
Changes since v7:
* major rewrite with clean Landlock hooks able to deal with file paths
Changes since v6:
* add 3 more sub-events: IOCTL, LOCK, FCNTL
https://lkml.kernel.org/r/2fbc99a6-f190-f335-bd14-04bdeed35571@digikod.net
* use the new security_add_hooks()
* explain the -Werror=unused-function
* constify pointers
* cleanup headers
Changes since v5:
* split hooks.[ch] into hooks.[ch] and hooks_fs.[ch]
* add more documentation
* cosmetic fixes
* rebase (SCALAR_VALUE)
Changes since v4:
* add LSM hook abstraction called Landlock event
* use the compiler type checking to verify hooks use by an event
* handle all filesystem related LSM hooks (e.g. file_permission,
mmap_file, sb_mount...)
* register BPF programs for Landlock just after LSM hooks registration
* move hooks registration after other LSMs
* add failsafes to check if a hook is not used by the kernel
* allow partial raw value access form the context (needed for programs
generated by LLVM)
Changes since v3:
* split commit
* add hooks dealing with struct inode and struct path pointers:
inode_permission and inode_getattr
* add abstraction over eBPF helper arguments thanks to wrapping structs
---
include/linux/lsm_hooks.h | 1 +
security/landlock/Makefile | 3 +-
security/landlock/common.h | 9 +
security/landlock/hooks.c | 94 ++++++
security/landlock/hooks.h | 31 ++
security/landlock/hooks_fs.c | 554 +++++++++++++++++++++++++++++++++++
security/landlock/hooks_fs.h | 31 ++
security/landlock/init.c | 33 +++
security/security.c | 15 +
9 files changed, 770 insertions(+), 1 deletion(-)
create mode 100644 security/landlock/hooks.c
create mode 100644 security/landlock/hooks.h
create mode 100644 security/landlock/hooks_fs.c
create mode 100644 security/landlock/hooks_fs.h
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index df1318d85f7d..c06ad8a1d424 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -2092,6 +2092,7 @@ extern void security_add_hooks(struct security_hook_list *hooks, int count,
enum lsm_order {
LSM_ORDER_FIRST = -1, /* This is only for capabilities. */
LSM_ORDER_MUTABLE = 0,
+ LSM_ORDER_LAST = 1, /* potentially-unprivileged LSM */
};
struct lsm_info {
diff --git a/security/landlock/Makefile b/security/landlock/Makefile
index 2a1a7082a365..270ece5d93de 100644
--- a/security/landlock/Makefile
+++ b/security/landlock/Makefile
@@ -1,4 +1,5 @@
obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
landlock-y := init.o \
- enforce.o enforce_seccomp.o
+ enforce.o enforce_seccomp.o \
+ hooks.o hooks_fs.o
diff --git a/security/landlock/common.h b/security/landlock/common.h
index 2cf36dbf4560..b2ee018eb6fc 100644
--- a/security/landlock/common.h
+++ b/security/landlock/common.h
@@ -79,4 +79,13 @@ static inline enum landlock_hook_type get_hook_type(const struct bpf_prog *prog)
}
}
+__maybe_unused
+static bool current_has_prog_type(enum landlock_hook_type hook_type)
+{
+ struct landlock_prog_set *prog_set;
+
+ prog_set = current->seccomp.landlock_prog_set;
+ return (prog_set && prog_set->programs[get_hook_index(hook_type)]);
+}
+
#endif /* _SECURITY_LANDLOCK_COMMON_H */
diff --git a/security/landlock/hooks.c b/security/landlock/hooks.c
new file mode 100644
index 000000000000..97c54957f17b
--- /dev/null
+++ b/security/landlock/hooks.c
@@ -0,0 +1,94 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Landlock LSM - hook helpers
+ *
+ * Copyright © 2016-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#include <asm/current.h>
+#include <linux/bpf.h> /* enum bpf_prog_aux */
+#include <linux/errno.h>
+#include <linux/filter.h> /* BPF_PROG_RUN() */
+#include <linux/rculist.h> /* list_add_tail_rcu */
+#include <uapi/linux/landlock.h> /* struct landlock_context */
+
+#include "common.h" /* struct landlock_rule, get_hook_index() */
+#include "hooks.h" /* landlock_hook_ctx */
+
+#include "hooks_fs.h"
+
+/* return a Landlock program context (e.g. hook_ctx->fs_walk.prog_ctx) */
+static const void *get_ctx(enum landlock_hook_type hook_type,
+ struct landlock_hook_ctx *hook_ctx)
+{
+ switch (hook_type) {
+ case LANDLOCK_HOOK_FS_WALK:
+ return landlock_get_ctx_fs_walk(hook_ctx->fs_walk);
+ case LANDLOCK_HOOK_FS_PICK:
+ return landlock_get_ctx_fs_pick(hook_ctx->fs_pick);
+ }
+ WARN_ON(1);
+ return NULL;
+}
+
+/**
+ * landlock_access_deny - run Landlock programs tied to a hook
+ *
+ * @hook_idx: hook index in the programs array
+ * @ctx: non-NULL valid eBPF context
+ * @prog_set: Landlock program set pointer
+ * @triggers: a bitmask to check if a program should be run
+ *
+ * Return true if at least one program return deny.
+ */
+static bool landlock_access_deny(enum landlock_hook_type hook_type,
+ struct landlock_hook_ctx *hook_ctx,
+ struct landlock_prog_set *prog_set, u64 triggers)
+{
+ struct landlock_prog_list *prog_list, *prev_list = NULL;
+ u32 hook_idx = get_hook_index(hook_type);
+
+ if (!prog_set)
+ return false;
+
+ for (prog_list = prog_set->programs[hook_idx];
+ prog_list; prog_list = prog_list->prev) {
+ u32 ret;
+ const void *prog_ctx;
+
+ /* check if @prog expect at least one of this triggers */
+ if (triggers && !(triggers & prog_list->prog->aux->
+ expected_attach_triggers))
+ continue;
+ prog_ctx = get_ctx(hook_type, hook_ctx);
+ if (!prog_ctx || WARN_ON(IS_ERR(prog_ctx)))
+ return true;
+ rcu_read_lock();
+ ret = BPF_PROG_RUN(prog_list->prog, prog_ctx);
+ rcu_read_unlock();
+ /* deny access if a program returns a value different than 0 */
+ if (ret)
+ return true;
+ if (prev_list && prog_list->prev && prog_list->prev->prog->
+ expected_attach_type ==
+ prev_list->prog->expected_attach_type)
+ WARN_ON(prog_list->prev != prev_list);
+ prev_list = prog_list;
+ }
+ return false;
+}
+
+int landlock_decide(enum landlock_hook_type hook_type,
+ struct landlock_hook_ctx *hook_ctx, u64 triggers)
+{
+ bool deny = false;
+
+#ifdef CONFIG_SECCOMP_FILTER
+ deny = landlock_access_deny(hook_type, hook_ctx,
+ current->seccomp.landlock_prog_set, triggers);
+#endif /* CONFIG_SECCOMP_FILTER */
+
+ /* should we use -EPERM or -EACCES? */
+ return deny ? -EACCES : 0;
+}
diff --git a/security/landlock/hooks.h b/security/landlock/hooks.h
new file mode 100644
index 000000000000..31446e6629fb
--- /dev/null
+++ b/security/landlock/hooks.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Landlock LSM - hooks helpers
+ *
+ * Copyright © 2016-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#include <asm/current.h>
+#include <linux/sched.h> /* struct task_struct */
+#include <linux/seccomp.h>
+
+#include "hooks_fs.h"
+
+struct landlock_hook_ctx {
+ union {
+ struct landlock_hook_ctx_fs_walk *fs_walk;
+ struct landlock_hook_ctx_fs_pick *fs_pick;
+ };
+};
+
+static inline bool landlocked(const struct task_struct *task)
+{
+#ifdef CONFIG_SECCOMP_FILTER
+ return !!(task->seccomp.landlock_prog_set);
+#else
+ return false;
+#endif /* CONFIG_SECCOMP_FILTER */
+}
+
+int landlock_decide(enum landlock_hook_type, struct landlock_hook_ctx *, u64);
diff --git a/security/landlock/hooks_fs.c b/security/landlock/hooks_fs.c
new file mode 100644
index 000000000000..3f81b7fc2938
--- /dev/null
+++ b/security/landlock/hooks_fs.c
@@ -0,0 +1,554 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Landlock LSM - filesystem hooks
+ *
+ * Copyright © 2016-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#include <linux/bpf.h> /* enum bpf_access_type */
+#include <linux/kernel.h> /* ARRAY_SIZE */
+#include <linux/lsm_hooks.h>
+#include <linux/rcupdate.h> /* synchronize_rcu() */
+#include <linux/stat.h> /* S_ISDIR */
+#include <linux/stddef.h> /* offsetof */
+#include <linux/types.h> /* uintptr_t */
+#include <linux/workqueue.h> /* INIT_WORK() */
+
+/* permissions translation */
+#include <linux/fs.h> /* MAY_* */
+#include <linux/mman.h> /* PROT_* */
+#include <linux/namei.h>
+
+/* hook arguments */
+#include <linux/dcache.h> /* struct dentry */
+#include <linux/fs.h> /* struct inode, struct iattr */
+#include <linux/mm_types.h> /* struct vm_area_struct */
+#include <linux/mount.h> /* struct vfsmount */
+#include <linux/path.h> /* struct path */
+#include <linux/sched.h> /* struct task_struct */
+#include <linux/time.h> /* struct timespec */
+
+#include "common.h"
+#include "hooks_fs.h"
+#include "hooks.h"
+
+/* fs_pick */
+
+#include <asm/page.h> /* PAGE_SIZE */
+#include <asm/syscall.h>
+#include <linux/dcache.h> /* d_path, dentry_path_raw */
+#include <linux/err.h> /* *_ERR */
+#include <linux/gfp.h> /* __get_free_page, GFP_KERNEL */
+#include <linux/path.h> /* struct path */
+
+bool landlock_is_valid_access_fs_pick(int off, enum bpf_access_type type,
+ enum bpf_reg_type *reg_type, int *max_size)
+{
+ switch (off) {
+ default:
+ return false;
+ }
+}
+
+bool landlock_is_valid_access_fs_walk(int off, enum bpf_access_type type,
+ enum bpf_reg_type *reg_type, int *max_size)
+{
+ switch (off) {
+ default:
+ return false;
+ }
+}
+
+/* fs_walk */
+
+struct landlock_hook_ctx_fs_walk {
+ struct landlock_ctx_fs_walk prog_ctx;
+};
+
+const struct landlock_ctx_fs_walk *landlock_get_ctx_fs_walk(
+ const struct landlock_hook_ctx_fs_walk *hook_ctx)
+{
+ if (WARN_ON(!hook_ctx))
+ return NULL;
+
+ return &hook_ctx->prog_ctx;
+}
+
+static int decide_fs_walk(int may_mask, struct inode *inode)
+{
+ struct landlock_hook_ctx_fs_walk fs_walk = {};
+ struct landlock_hook_ctx hook_ctx = {
+ .fs_walk = &fs_walk,
+ };
+ const enum landlock_hook_type hook_type = LANDLOCK_HOOK_FS_WALK;
+
+ if (!current_has_prog_type(hook_type))
+ /* no fs_walk */
+ return 0;
+ if (WARN_ON(!inode))
+ return -EFAULT;
+
+ /* init common data: inode */
+ fs_walk.prog_ctx.inode = (uintptr_t)inode;
+ return landlock_decide(hook_type, &hook_ctx, 0);
+}
+
+/* fs_pick */
+
+struct landlock_hook_ctx_fs_pick {
+ __u64 triggers;
+ struct landlock_ctx_fs_pick prog_ctx;
+};
+
+const struct landlock_ctx_fs_pick *landlock_get_ctx_fs_pick(
+ const struct landlock_hook_ctx_fs_pick *hook_ctx)
+{
+ if (WARN_ON(!hook_ctx))
+ return NULL;
+
+ return &hook_ctx->prog_ctx;
+}
+
+static int decide_fs_pick(__u64 triggers, struct inode *inode)
+{
+ struct landlock_hook_ctx_fs_pick fs_pick = {};
+ struct landlock_hook_ctx hook_ctx = {
+ .fs_pick = &fs_pick,
+ };
+ const enum landlock_hook_type hook_type = LANDLOCK_HOOK_FS_PICK;
+
+ if (WARN_ON(!triggers))
+ return 0;
+ if (!current_has_prog_type(hook_type))
+ /* no fs_pick */
+ return 0;
+ if (WARN_ON(!inode))
+ return -EFAULT;
+
+ fs_pick.triggers = triggers,
+ /* init common data: inode */
+ fs_pick.prog_ctx.inode = (uintptr_t)inode;
+ return landlock_decide(hook_type, &hook_ctx, fs_pick.triggers);
+}
+
+/* helpers */
+
+static u64 fs_may_to_triggers(int may_mask, umode_t mode)
+{
+ u64 ret = 0;
+
+ if (may_mask & MAY_EXEC)
+ ret |= LANDLOCK_TRIGGER_FS_PICK_EXECUTE;
+ if (may_mask & MAY_READ) {
+ if (S_ISDIR(mode))
+ ret |= LANDLOCK_TRIGGER_FS_PICK_READDIR;
+ else
+ ret |= LANDLOCK_TRIGGER_FS_PICK_READ;
+ }
+ if (may_mask & MAY_WRITE)
+ ret |= LANDLOCK_TRIGGER_FS_PICK_WRITE;
+ if (may_mask & MAY_APPEND)
+ ret |= LANDLOCK_TRIGGER_FS_PICK_APPEND;
+ if (may_mask & MAY_OPEN)
+ ret |= LANDLOCK_TRIGGER_FS_PICK_OPEN;
+ if (may_mask & MAY_CHROOT)
+ ret |= LANDLOCK_TRIGGER_FS_PICK_CHROOT;
+ else if (may_mask & MAY_CHDIR)
+ ret |= LANDLOCK_TRIGGER_FS_PICK_CHDIR;
+ /* XXX: ignore MAY_ACCESS */
+ WARN_ON(!ret);
+ return ret;
+}
+
+static inline u64 mem_prot_to_triggers(unsigned long prot, bool private)
+{
+ u64 ret = LANDLOCK_TRIGGER_FS_PICK_MAP;
+
+ /* private mapping do not write to files */
+ if (!private && (prot & PROT_WRITE))
+ ret |= LANDLOCK_TRIGGER_FS_PICK_WRITE;
+ if (prot & PROT_READ)
+ ret |= LANDLOCK_TRIGGER_FS_PICK_READ;
+ if (prot & PROT_EXEC)
+ ret |= LANDLOCK_TRIGGER_FS_PICK_EXECUTE;
+ WARN_ON(!ret);
+ return ret;
+}
+
+/* binder hooks */
+
+static int hook_binder_transfer_file(struct task_struct *from,
+ struct task_struct *to, struct file *file)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!file))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_TRANSFER,
+ file_inode(file));
+}
+
+/* sb hooks */
+
+static int hook_sb_statfs(struct dentry *dentry)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!dentry))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_GETATTR,
+ dentry->d_inode);
+}
+
+/* TODO: handle mount source and remount */
+static int hook_sb_mount(const char *dev_name, const struct path *path,
+ const char *type, unsigned long flags, void *data)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!path))
+ return 0;
+ if (WARN_ON(!path->dentry))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_MOUNTON,
+ path->dentry->d_inode);
+}
+
+/*
+ * The @old_path is similar to a destination mount point.
+ */
+static int hook_sb_pivotroot(const struct path *old_path,
+ const struct path *new_path)
+{
+ int err;
+
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!old_path))
+ return 0;
+ if (WARN_ON(!old_path->dentry))
+ return 0;
+ err = decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_MOUNTON,
+ old_path->dentry->d_inode);
+ if (err)
+ return err;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_CHROOT,
+ new_path->dentry->d_inode);
+}
+
+/* inode hooks */
+
+/* a directory inode contains only one dentry */
+static int hook_inode_create(struct inode *dir, struct dentry *dentry,
+ umode_t mode)
+{
+ if (!landlocked(current))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_CREATE, dir);
+}
+
+static int hook_inode_link(struct dentry *old_dentry, struct inode *dir,
+ struct dentry *new_dentry)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!old_dentry)) {
+ int ret = decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_LINK,
+ old_dentry->d_inode);
+ if (ret)
+ return ret;
+ }
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_LINKTO, dir);
+}
+
+static int hook_inode_unlink(struct inode *dir, struct dentry *dentry)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!dentry))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_UNLINK,
+ dentry->d_inode);
+}
+
+static int hook_inode_symlink(struct inode *dir, struct dentry *dentry,
+ const char *old_name)
+{
+ if (!landlocked(current))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_CREATE, dir);
+}
+
+static int hook_inode_mkdir(struct inode *dir, struct dentry *dentry,
+ umode_t mode)
+{
+ if (!landlocked(current))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_CREATE, dir);
+}
+
+static int hook_inode_rmdir(struct inode *dir, struct dentry *dentry)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!dentry))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_RMDIR, dentry->d_inode);
+}
+
+static int hook_inode_mknod(struct inode *dir, struct dentry *dentry,
+ umode_t mode, dev_t dev)
+{
+ if (!landlocked(current))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_CREATE, dir);
+}
+
+static int hook_inode_rename(struct inode *old_dir, struct dentry *old_dentry,
+ struct inode *new_dir, struct dentry *new_dentry)
+{
+ if (!landlocked(current))
+ return 0;
+ if (!WARN_ON(!old_dentry)) {
+ int ret = decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_RENAME,
+ old_dentry->d_inode);
+ if (ret)
+ return ret;
+ }
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_RENAMETO, new_dir);
+}
+
+static int hook_inode_readlink(struct dentry *dentry)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!dentry))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_READ, dentry->d_inode);
+}
+
+/*
+ * ignore the inode_follow_link hook (could set is_symlink in the fs_walk
+ * context)
+ */
+
+static int hook_inode_permission(struct inode *inode, int mask)
+{
+ u64 triggers;
+
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!inode))
+ return 0;
+
+ triggers = fs_may_to_triggers(mask, inode->i_mode);
+ /*
+ * decide_fs_walk() is exclusive with decide_fs_pick(): in a path walk,
+ * ignore execute-only access on directory for any fs_pick program
+ */
+ if (triggers == LANDLOCK_TRIGGER_FS_PICK_EXECUTE &&
+ S_ISDIR(inode->i_mode))
+ return decide_fs_walk(mask, inode);
+
+ return decide_fs_pick(triggers, inode);
+}
+
+static int hook_inode_setattr(struct dentry *dentry, struct iattr *attr)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!dentry))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_SETATTR,
+ dentry->d_inode);
+}
+
+static int hook_inode_getattr(const struct path *path)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!path))
+ return 0;
+ if (WARN_ON(!path->dentry))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_GETATTR,
+ path->dentry->d_inode);
+}
+
+static int hook_inode_setxattr(struct dentry *dentry, const char *name,
+ const void *value, size_t size, int flags)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!dentry))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_SETATTR,
+ dentry->d_inode);
+}
+
+static int hook_inode_getxattr(struct dentry *dentry, const char *name)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!dentry))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_GETATTR,
+ dentry->d_inode);
+}
+
+static int hook_inode_listxattr(struct dentry *dentry)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!dentry))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_GETATTR,
+ dentry->d_inode);
+}
+
+static int hook_inode_removexattr(struct dentry *dentry, const char *name)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!dentry))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_SETATTR,
+ dentry->d_inode);
+}
+
+static int hook_inode_getsecurity(struct inode *inode, const char *name,
+ void **buffer, bool alloc)
+{
+ if (!landlocked(current))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_GETATTR, inode);
+}
+
+static int hook_inode_setsecurity(struct inode *inode, const char *name,
+ const void *value, size_t size, int flag)
+{
+ if (!landlocked(current))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_SETATTR, inode);
+}
+
+static int hook_inode_listsecurity(struct inode *inode, char *buffer,
+ size_t buffer_size)
+{
+ if (!landlocked(current))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_GETATTR, inode);
+}
+
+/* file hooks */
+
+static int hook_file_ioctl(struct file *file, unsigned int cmd,
+ unsigned long arg)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!file))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_IOCTL,
+ file_inode(file));
+}
+
+static int hook_file_lock(struct file *file, unsigned int cmd)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!file))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_LOCK, file_inode(file));
+}
+
+static int hook_file_fcntl(struct file *file, unsigned int cmd,
+ unsigned long arg)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!file))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_FCNTL,
+ file_inode(file));
+}
+
+static int hook_mmap_file(struct file *file, unsigned long reqprot,
+ unsigned long prot, unsigned long flags)
+{
+ if (!landlocked(current))
+ return 0;
+ /* file can be null for anonymous mmap */
+ if (!file)
+ return 0;
+ return decide_fs_pick(mem_prot_to_triggers(prot, flags & MAP_PRIVATE),
+ file_inode(file));
+}
+
+static int hook_file_mprotect(struct vm_area_struct *vma,
+ unsigned long reqprot, unsigned long prot)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!vma))
+ return 0;
+ if (!vma->vm_file)
+ return 0;
+ return decide_fs_pick(mem_prot_to_triggers(prot,
+ !(vma->vm_flags & VM_SHARED)),
+ file_inode(vma->vm_file));
+}
+
+static int hook_file_receive(struct file *file)
+{
+ if (!landlocked(current))
+ return 0;
+ if (WARN_ON(!file))
+ return 0;
+ return decide_fs_pick(LANDLOCK_TRIGGER_FS_PICK_RECEIVE,
+ file_inode(file));
+}
+
+static struct security_hook_list landlock_hooks[] = {
+ LSM_HOOK_INIT(binder_transfer_file, hook_binder_transfer_file),
+
+ LSM_HOOK_INIT(sb_statfs, hook_sb_statfs),
+ LSM_HOOK_INIT(sb_mount, hook_sb_mount),
+ LSM_HOOK_INIT(sb_pivotroot, hook_sb_pivotroot),
+
+ LSM_HOOK_INIT(inode_create, hook_inode_create),
+ LSM_HOOK_INIT(inode_link, hook_inode_link),
+ LSM_HOOK_INIT(inode_unlink, hook_inode_unlink),
+ LSM_HOOK_INIT(inode_symlink, hook_inode_symlink),
+ LSM_HOOK_INIT(inode_mkdir, hook_inode_mkdir),
+ LSM_HOOK_INIT(inode_rmdir, hook_inode_rmdir),
+ LSM_HOOK_INIT(inode_mknod, hook_inode_mknod),
+ LSM_HOOK_INIT(inode_rename, hook_inode_rename),
+ LSM_HOOK_INIT(inode_readlink, hook_inode_readlink),
+ LSM_HOOK_INIT(inode_permission, hook_inode_permission),
+ LSM_HOOK_INIT(inode_setattr, hook_inode_setattr),
+ LSM_HOOK_INIT(inode_getattr, hook_inode_getattr),
+ LSM_HOOK_INIT(inode_setxattr, hook_inode_setxattr),
+ LSM_HOOK_INIT(inode_getxattr, hook_inode_getxattr),
+ LSM_HOOK_INIT(inode_listxattr, hook_inode_listxattr),
+ LSM_HOOK_INIT(inode_removexattr, hook_inode_removexattr),
+ LSM_HOOK_INIT(inode_getsecurity, hook_inode_getsecurity),
+ LSM_HOOK_INIT(inode_setsecurity, hook_inode_setsecurity),
+ LSM_HOOK_INIT(inode_listsecurity, hook_inode_listsecurity),
+
+ /* do not handle file_permission for now */
+ LSM_HOOK_INIT(file_ioctl, hook_file_ioctl),
+ LSM_HOOK_INIT(file_lock, hook_file_lock),
+ LSM_HOOK_INIT(file_fcntl, hook_file_fcntl),
+ LSM_HOOK_INIT(mmap_file, hook_mmap_file),
+ LSM_HOOK_INIT(file_mprotect, hook_file_mprotect),
+ LSM_HOOK_INIT(file_receive, hook_file_receive),
+ /* file_open is not handled, use inode_permission instead */
+};
+
+__init void landlock_add_hooks_fs(void)
+{
+ security_add_hooks(landlock_hooks, ARRAY_SIZE(landlock_hooks),
+ LANDLOCK_NAME);
+}
diff --git a/security/landlock/hooks_fs.h b/security/landlock/hooks_fs.h
new file mode 100644
index 000000000000..eeae4dcd842f
--- /dev/null
+++ b/security/landlock/hooks_fs.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Landlock LSM - filesystem hooks
+ *
+ * Copyright © 2017-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018-2019 ANSSI
+ */
+
+#include <linux/bpf.h> /* enum bpf_access_type */
+
+__init void landlock_add_hooks_fs(void);
+
+/* fs_pick */
+
+struct landlock_hook_ctx_fs_pick;
+
+bool landlock_is_valid_access_fs_pick(int off, enum bpf_access_type type,
+ enum bpf_reg_type *reg_type, int *max_size);
+
+const struct landlock_ctx_fs_pick *landlock_get_ctx_fs_pick(
+ const struct landlock_hook_ctx_fs_pick *hook_ctx);
+
+/* fs_walk */
+
+struct landlock_hook_ctx_fs_walk;
+
+bool landlock_is_valid_access_fs_walk(int off, enum bpf_access_type type,
+ enum bpf_reg_type *reg_type, int *max_size);
+
+const struct landlock_ctx_fs_walk *landlock_get_ctx_fs_walk(
+ const struct landlock_hook_ctx_fs_walk *hook_ctx);
diff --git a/security/landlock/init.c b/security/landlock/init.c
index 8dfd5fea3c1f..391e88bd4d3a 100644
--- a/security/landlock/init.c
+++ b/security/landlock/init.c
@@ -9,8 +9,10 @@
#include <linux/bpf.h> /* enum bpf_access_type */
#include <linux/capability.h> /* capable */
#include <linux/filter.h> /* struct bpf_prog */
+#include <linux/lsm_hooks.h>
#include "common.h" /* LANDLOCK_* */
+#include "hooks_fs.h"
static bool bpf_landlock_is_valid_access(int off, int size,
enum bpf_access_type type, const struct bpf_prog *prog,
@@ -27,6 +29,20 @@ static bool bpf_landlock_is_valid_access(int off, int size,
if (size <= 0 || size > sizeof(__u64))
return false;
+ /* set register type and max size */
+ switch (get_hook_type(prog)) {
+ case LANDLOCK_HOOK_FS_PICK:
+ if (!landlock_is_valid_access_fs_pick(off, type, ®_type,
+ &max_size))
+ return false;
+ break;
+ case LANDLOCK_HOOK_FS_WALK:
+ if (!landlock_is_valid_access_fs_walk(off, type, ®_type,
+ &max_size))
+ return false;
+ break;
+ }
+
/* check memory range access */
switch (reg_type) {
case NOT_INIT:
@@ -98,3 +114,20 @@ const struct bpf_verifier_ops landlock_verifier_ops = {
};
const struct bpf_prog_ops landlock_prog_ops = {};
+
+static int __init landlock_init(void)
+{
+ pr_info(LANDLOCK_NAME ": Initializing (sandbox with seccomp)\n");
+ landlock_add_hooks_fs();
+ return 0;
+}
+
+struct lsm_blob_sizes landlock_blob_sizes __lsm_ro_after_init = {
+};
+
+DEFINE_LSM(LANDLOCK_NAME) = {
+ .name = LANDLOCK_NAME,
+ .order = LSM_ORDER_LAST,
+ .blobs = &landlock_blob_sizes,
+ .init = landlock_init,
+};
diff --git a/security/security.c b/security/security.c
index 250ee2d76406..e694e5fe7021 100644
--- a/security/security.c
+++ b/security/security.c
@@ -263,6 +263,21 @@ static void __init ordered_lsm_parse(const char *order, const char *origin)
}
}
+ /*
+ * In case of an unprivileged access-control, we don't want to give the
+ * ability to any process to do some checks (e.g. through an eBPF
+ * program) on kernel objects (e.g. files) if a privileged security
+ * policy forbid their access. We must then load
+ * potentially-unprivileged security modules after all other LSMs.
+ *
+ * LSM_ORDER_LAST is always last and does not appear in the modifiable
+ * ordered list of enabled LSMs.
+ */
+ for (lsm = __start_lsm_info; lsm < __end_lsm_info; lsm++) {
+ if (lsm->order == LSM_ORDER_LAST)
+ append_ordered_lsm(lsm, "last");
+ }
+
/* Disable all LSMs not in the ordered list. */
for (lsm = __start_lsm_info; lsm < __end_lsm_info; lsm++) {
if (exists_ordered_lsm(lsm))
--
2.22.0
^ permalink raw reply related
* [PATCH bpf-next v10 06/10] bpf,landlock: Add a new map type: inode
From: Mickaël Salaün @ 2019-07-21 21:31 UTC (permalink / raw)
To: linux-kernel
Cc: Mickaël Salaün, Alexander Viro, Alexei Starovoitov,
Andrew Morton, Andy Lutomirski, Arnaldo Carvalho de Melo,
Casey Schaufler, Daniel Borkmann, David Drysdale,
David S . Miller, Eric W . Biederman, James Morris, Jann Horn,
John Johansen, Jonathan Corbet, Kees Cook, Michael Kerrisk,
Mickaël Salaün
In-Reply-To: <20190721213116.23476-1-mic@digikod.net>
FIXME: 64-bits in the doc
This new map store arbitrary values referenced by inode keys. The map
can be updated from user space with file descriptor pointing to inodes
tied to a file system. From an eBPF (Landlock) program point of view,
such a map is read-only and can only be used to retrieved a value tied
to a given inode. This is useful to recognize an inode tagged by user
space, without access right to this inode (i.e. no need to have a write
access to this inode).
Add dedicated BPF functions to handle this type of map:
* bpf_inode_htab_map_update_elem()
* bpf_inode_htab_map_lookup_elem()
* bpf_inode_htab_map_delete_elem()
This new map require a dedicated helper inode_map_lookup_elem() because
of the key which is a pointer to an opaque data (only provided by the
kernel). This act like a (physical or cryptographic) key, which is why
it is also not allowed to get the next key.
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Jann Horn <jann@thejh.net>
---
Changes since v9:
* use a hash map for the inode map: integrate inodemap.c into hashtab.c
* add map_put_key() to struct bpf_map_ops to enable to put an inode
reference used as key
* allow arbitrary value size instead of 64-bits
* handle inode and map lifetime with LSM hooks
* check access for inode lookup via syscall: similar to adding xattr,
except it does not touch the file system (which is handy for read-only
ones)
* force read-only inode map for Landlock programs
* rename inode_map_lookup() into inode_map_lookup_elem()
* fix inode and mnt checks (suggested by Al Viro)
Changes since v8:
* remove prog chaining and object tagging to ease review
* use bpf_map_init_from_attr()
Changes since v7:
* new design with a dedicated map and a BPF function to tie a value to
an inode
* add the ability to set or get a tag on an inode from a Landlock
program
Changes since v6:
* remove WARN_ON() for missing dentry->d_inode
* refactor bpf_landlock_func_proto() (suggested by Kees Cook)
Changes since v5:
* cosmetic fixes and rebase
Changes since v4:
* use a file abstraction (handle) to wrap inode, dentry, path and file
structs
* remove bpf_landlock_cmp_fs_beneath()
* rename the BPF helper and move it to kernel/bpf/
* tighten helpers accessible by a Landlock rule
Changes since v3:
* remove bpf_landlock_cmp_fs_prop() (suggested by Alexei Starovoitov)
* add hooks dealing with struct inode and struct path pointers:
inode_permission and inode_getattr
* add abstraction over eBPF helper arguments thanks to wrapping structs
* add bpf_landlock_get_fs_mode() helper to check file type and mode
* merge WARN_ON() (suggested by Kees Cook)
* fix and update bpf_helpers.h
* use BPF_CALL_* for eBPF helpers (suggested by Alexei Starovoitov)
* make handle arraymap safe (RCU) and remove buggy synchronize_rcu()
* factor out the arraymay walk
* use size_t to index array (suggested by Jann Horn)
Changes since v2:
* add MNT_INTERNAL check to only add file handle from user-visible FS
(e.g. no anonymous inode)
* replace struct file* with struct path* in map_landlock_handle
* add BPF protos
* fix bpf_landlock_cmp_fs_prop_with_struct_file()
---
include/linux/bpf.h | 16 +++
include/linux/bpf_types.h | 3 +
include/linux/landlock.h | 4 +
include/uapi/linux/bpf.h | 12 +-
kernel/bpf/core.c | 2 +
kernel/bpf/hashtab.c | 253 +++++++++++++++++++++++++++++++++
kernel/bpf/syscall.c | 27 +++-
kernel/bpf/verifier.c | 14 ++
security/landlock/common.h | 14 ++
security/landlock/hooks_fs.c | 85 +++++++++++
security/landlock/init.c | 13 ++
tools/include/uapi/linux/bpf.h | 12 +-
tools/lib/bpf/libbpf_probes.c | 1 +
13 files changed, 453 insertions(+), 3 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6d9c7a08713e..c507438e56b5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -47,6 +47,7 @@ struct bpf_map_ops {
void *(*map_fd_get_ptr)(struct bpf_map *map, struct file *map_file,
int fd);
void (*map_fd_put_ptr)(void *ptr);
+ void (*map_put_key)(void *key);
u32 (*map_gen_lookup)(struct bpf_map *map, struct bpf_insn *insn_buf);
u32 (*map_fd_sys_lookup_elem)(void *ptr);
void (*map_seq_show_elem)(struct bpf_map *map, void *key,
@@ -208,6 +209,8 @@ enum bpf_arg_type {
ARG_PTR_TO_INT, /* pointer to int */
ARG_PTR_TO_LONG, /* pointer to long */
ARG_PTR_TO_SOCKET, /* pointer to bpf_sock (fullsock) */
+
+ ARG_PTR_TO_INODE, /* pointer to a struct inode */
};
/* type of values returned from helper functions */
@@ -278,6 +281,7 @@ enum bpf_reg_type {
PTR_TO_TCP_SOCK_OR_NULL, /* reg points to struct tcp_sock or NULL */
PTR_TO_TP_BUFFER, /* reg points to a writable raw tp's buffer */
PTR_TO_XDP_SOCK, /* reg points to struct xdp_sock */
+ PTR_TO_INODE, /* reg points to struct inode */
};
/* The information passed from prog-specific *_is_valid_access
@@ -479,6 +483,7 @@ struct bpf_event_entry {
struct rcu_head rcu;
};
+
bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp);
int bpf_prog_calc_tag(struct bpf_prog *fp);
@@ -684,6 +689,16 @@ int bpf_fd_array_map_lookup_elem(struct bpf_map *map, void *key, u32 *value);
int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
void *key, void *value, u64 map_flags);
int bpf_fd_htab_map_lookup_elem(struct bpf_map *map, void *key, u32 *value);
+int bpf_inode_fd_htab_map_lookup_elem(struct bpf_map *map, int *key, void *value);
+int bpf_inode_fd_htab_map_delete_elem(struct bpf_map *map, int *key);
+int bpf_inode_ptr_unlocked_htab_map_delete_elem(struct bpf_map *map,
+ struct inode **key,
+ bool remove_in_inode);
+int bpf_inode_ptr_locked_htab_map_delete_elem(struct bpf_map *map,
+ struct inode **key,
+ bool remove_in_inode);
+int bpf_inode_fd_htab_map_update_elem(struct bpf_map *map, int *key,
+ void *value, u64 map_flags);
int bpf_get_file_flag(int flags);
int bpf_check_uarg_tail_zero(void __user *uaddr, size_t expected_size,
@@ -1055,6 +1070,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
extern const struct bpf_func_proto bpf_strtol_proto;
extern const struct bpf_func_proto bpf_strtoul_proto;
extern const struct bpf_func_proto bpf_tcp_sock_proto;
+extern const struct bpf_func_proto bpf_inode_map_lookup_elem_proto;
/* Shared helpers among cBPF and eBPF. */
void bpf_user_rnd_init_once(void);
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2ab647323f3a..ea177818d67e 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -80,3 +80,6 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
#endif
BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
+#ifdef CONFIG_SECURITY_LANDLOCK
+BPF_MAP_TYPE(BPF_MAP_TYPE_INODE, htab_inode_ops)
+#endif
diff --git a/include/linux/landlock.h b/include/linux/landlock.h
index 8ac7942f50fc..731b89cdf977 100644
--- a/include/linux/landlock.h
+++ b/include/linux/landlock.h
@@ -9,6 +9,7 @@
#ifndef _LINUX_LANDLOCK_H
#define _LINUX_LANDLOCK_H
+#include <linux/bpf.h>
#include <linux/errno.h>
#include <linux/sched.h> /* task_struct */
@@ -31,4 +32,7 @@ static inline void get_seccomp_landlock(struct task_struct *tsk)
}
#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
+int landlock_inode_add_map(struct inode *inode, struct bpf_map *map);
+void landlock_inode_remove_map(struct inode *inode, const struct bpf_map *map);
+
#endif /* _LINUX_LANDLOCK_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d68613f737f3..2da054ca9c8b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -134,6 +134,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_QUEUE,
BPF_MAP_TYPE_STACK,
BPF_MAP_TYPE_SK_STORAGE,
+ BPF_MAP_TYPE_INODE,
};
/* Note that tracing related programs such as
@@ -2717,6 +2718,14 @@ union bpf_attr {
* **-EPERM** if no permission to send the *sig*.
*
* **-EAGAIN** if bpf program can try again.
+ *
+ * void *bpf_inode_map_lookup_elem(struct bpf_map *map, const void *key)
+ * Description
+ * Perform a lookup in *map* for an entry associated to an inode
+ * *key*.
+ * Return
+ * Map value associated to *key*, or **NULL** if no entry was
+ * found.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -2828,7 +2837,8 @@ union bpf_attr {
FN(strtoul), \
FN(sk_storage_get), \
FN(sk_storage_delete), \
- FN(send_signal),
+ FN(send_signal), \
+ FN(inode_map_lookup_elem),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 16079550db6d..4177c818e5cd 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2040,6 +2040,8 @@ const struct bpf_func_proto bpf_get_current_comm_proto __weak;
const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
const struct bpf_func_proto bpf_get_local_storage_proto __weak;
+const struct bpf_func_proto bpf_inode_map_update_proto __weak;
+
const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
{
return NULL;
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 22066a62c8c9..4fc7755042f0 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1,13 +1,21 @@
// SPDX-License-Identifier: GPL-2.0-only
/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
* Copyright (c) 2016 Facebook
+ * Copyright (c) 2017-2019 Mickaël Salaün <mic@digikod.net>
+ * Copyright (c) 2019 ANSSI
*/
+#include <asm/resource.h> /* RLIMIT_NOFILE */
#include <linux/bpf.h>
#include <linux/btf.h>
+#include <linux/err.h>
#include <linux/jhash.h>
+#include <linux/fs.h> /* iput() */
#include <linux/filter.h>
+#include <linux/landlock.h>
+#include <linux/mount.h> /* MNT_INTERNAL */
#include <linux/rculist_nulls.h>
#include <linux/random.h>
+#include <linux/sched/signal.h> /* rlimit() */
#include <uapi/linux/btf.h>
#include "percpu_freelist.h"
#include "bpf_lru_list.h"
@@ -684,6 +692,8 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
map->ops->map_fd_put_ptr(ptr);
}
+ if (map->ops->map_put_key)
+ map->ops->map_put_key(l->key);
if (htab_is_prealloc(htab)) {
__pcpu_freelist_push(&htab->freelist, &l->fnode);
@@ -1514,3 +1524,246 @@ const struct bpf_map_ops htab_of_maps_map_ops = {
.map_gen_lookup = htab_of_map_gen_lookup,
.map_check_btf = map_check_no_btf,
};
+
+/* inode_htab */
+
+static int inode_htab_map_alloc_check(union bpf_attr *attr)
+{
+ /* only allow root to create this type of map (for now), should be
+ * removed when Landlock will be usable by unprivileged users */
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* the key is a file descriptor */
+ if (attr->max_entries == 0 || attr->key_size != sizeof(int) ||
+ (attr->map_flags & ~(BPF_F_RDONLY | BPF_F_WRONLY |
+ BPF_F_RDONLY_PROG)) ||
+ /* for now, force read-only map for eBPF programs because only
+ * bpf_inode_map_lookup_elem() enable to access them */
+ !(attr->map_flags & BPF_F_RDONLY_PROG) ||
+ bpf_map_attr_numa_node(attr) != NUMA_NO_NODE)
+ return -EINVAL;
+
+ /*
+ * Limit number of entries in an inode map to the maximum number of
+ * open files for the current process. The maximum number of file
+ * references (including all inode maps) for a process is then
+ * (RLIMIT_NOFILE - 1) * RLIMIT_NOFILE. If the process' RLIMIT_NOFILE
+ * is 0, then any entry update is forbidden.
+ *
+ * An eBPF program can inherit all the inode map FD. The worse case is
+ * to fill a bunch of arraymaps, create an eBPF program, close the
+ * inode map FDs, and start again. The maximum number of inode map
+ * entries can then be close to RLIMIT_NOFILE^3.
+ */
+ if (attr->max_entries > rlimit(RLIMIT_NOFILE))
+ return -EMFILE;
+
+ /* decorelate UAPI from kernel API */
+ attr->key_size = sizeof(struct inode *);
+
+ return htab_map_alloc_check(attr);
+}
+
+static void inode_htab_put_key(void *key)
+{
+ struct inode **inode = key;
+
+ if ((*inode)->i_state & I_FREEING)
+ return;
+ iput(*inode);
+}
+
+/* called from syscall or (never) from eBPF program */
+static int map_get_next_no_key(struct bpf_map *map, void *key, void *next_key)
+{
+ /* do not leak a file descriptor */
+ return -ENOTSUPP;
+}
+
+/* must call iput(inode) after this call */
+static struct inode *inode_from_fd(int ufd, bool check_access)
+{
+ struct inode *ret;
+ struct fd f;
+ int deny;
+
+ f = fdget(ufd);
+ if (unlikely(!f.file))
+ return ERR_PTR(-EBADF);
+ /* TODO?: add this check when called from an eBPF program too (already
+ * checked by the LSM parent hooks anyway) */
+ if (unlikely(IS_PRIVATE(file_inode(f.file)))) {
+ ret = ERR_PTR(-EINVAL);
+ goto put_fd;
+ }
+ /* check if the FD is tied to a mount point */
+ /* TODO?: add this check when called from an eBPF program too */
+ if (unlikely(f.file->f_path.mnt->mnt_flags & MNT_INTERNAL)) {
+ ret = ERR_PTR(-EINVAL);
+ goto put_fd;
+ }
+ if (check_access) {
+ /*
+ * must be allowed to access attributes from this file to then
+ * be able to compare an inode to its map entry
+ */
+ deny = security_inode_getattr(&f.file->f_path);
+ if (deny) {
+ ret = ERR_PTR(deny);
+ goto put_fd;
+ }
+ }
+ ret = file_inode(f.file);
+ ihold(ret);
+
+put_fd:
+ fdput(f);
+ return ret;
+}
+
+/*
+ * The key is a FD when called from a syscall, but an inode address when called
+ * from an eBPF program.
+ */
+
+/* called from syscall */
+int bpf_inode_fd_htab_map_lookup_elem(struct bpf_map *map, int *key, void *value)
+{
+ void *ptr;
+ struct inode *inode;
+ int ret;
+
+ /* check inode access */
+ inode = inode_from_fd(*key, true);
+ if (IS_ERR(inode))
+ return PTR_ERR(inode);
+
+ rcu_read_lock();
+ ptr = htab_map_lookup_elem(map, &inode);
+ iput(inode);
+ if (IS_ERR(ptr)) {
+ ret = PTR_ERR(ptr);
+ } else if (!ptr) {
+ ret = -ENOENT;
+ } else {
+ ret = 0;
+ copy_map_value(map, value, ptr);
+ }
+ rcu_read_unlock();
+ return ret;
+}
+
+/* called from kernel */
+int bpf_inode_ptr_locked_htab_map_delete_elem(struct bpf_map *map,
+ struct inode **key, bool remove_in_inode)
+{
+ if (remove_in_inode)
+ landlock_inode_remove_map(*key, map);
+ return htab_map_delete_elem(map, key);
+}
+
+/* called from syscall */
+int bpf_inode_fd_htab_map_delete_elem(struct bpf_map *map, int *key)
+{
+ struct inode *inode;
+ int ret;
+
+ /* do not check inode access (similar to directory check) */
+ inode = inode_from_fd(*key, false);
+ if (IS_ERR(inode))
+ return PTR_ERR(inode);
+ ret = bpf_inode_ptr_locked_htab_map_delete_elem(map, &inode, true);
+ iput(inode);
+ return ret;
+}
+
+/* called from syscall */
+int bpf_inode_fd_htab_map_update_elem(struct bpf_map *map, int *key, void *value,
+ u64 map_flags)
+{
+ struct inode *inode;
+ int ret;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ /* check inode access */
+ inode = inode_from_fd(*key, true);
+ if (IS_ERR(inode))
+ return PTR_ERR(inode);
+ ret = htab_map_update_elem(map, &inode, value, map_flags);
+ if (!ret)
+ ret = landlock_inode_add_map(inode, map);
+ iput(inode);
+ return ret;
+}
+
+static void inode_htab_map_free(struct bpf_map *map)
+{
+ struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+ struct hlist_nulls_node *n;
+ struct hlist_nulls_head *head;
+ struct htab_elem *l;
+ int i;
+
+ for (i = 0; i < htab->n_buckets; i++) {
+ head = select_bucket(htab, i);
+ hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
+ landlock_inode_remove_map(*((struct inode **)l->key), map);
+ }
+ }
+ htab_map_free(map);
+}
+
+/* use the map_inode_lookup_elem() helper instead */
+static void *map_lookup_no_elem(struct bpf_map *map, void *key)
+{
+ WARN_ON_ONCE(1);
+ return NULL;
+}
+
+static int map_delete_no_elem(struct bpf_map *map, void *key)
+{
+ WARN_ON_ONCE(1);
+ return -ENOTSUPP;
+}
+
+static int map_update_no_elem(struct bpf_map *map, void *key, void *value,
+ u64 flags)
+{
+ WARN_ON_ONCE(1);
+ return -ENOTSUPP;
+}
+
+const struct bpf_map_ops htab_inode_ops = {
+ .map_alloc_check = inode_htab_map_alloc_check,
+ .map_alloc = htab_map_alloc,
+ .map_free = inode_htab_map_free,
+ .map_put_key = inode_htab_put_key,
+ .map_get_next_key = map_get_next_no_key,
+ .map_lookup_elem = map_lookup_no_elem,
+ .map_delete_elem = map_delete_no_elem,
+ .map_update_elem = map_update_no_elem,
+ .map_check_btf = map_check_no_btf,
+};
+
+/*
+ * We need a dedicated helper to deal with inode maps because the key is a
+ * pointer to an opaque data, only provided by the kernel. This really act
+ * like a (physical or cryptographic) key, which is why it is also not allowed
+ * to get the next key with map_get_next_key().
+ */
+BPF_CALL_2(bpf_inode_map_lookup_elem, struct bpf_map *, map, void *, key)
+{
+ WARN_ON_ONCE(!rcu_read_lock_held());
+ return (unsigned long)htab_map_lookup_elem(map, &key);
+}
+
+const struct bpf_func_proto bpf_inode_map_lookup_elem_proto = {
+ .func = bpf_inode_map_lookup_elem,
+ .gpl_only = false,
+ .pkt_access = true,
+ .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
+ .arg1_type = ARG_CONST_MAP_PTR,
+ .arg2_type = ARG_PTR_TO_INODE,
+};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b2a8cb14f28e..e46441c42b68 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -801,6 +801,8 @@ static int map_lookup_elem(union bpf_attr *attr)
} else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
map->map_type == BPF_MAP_TYPE_STACK) {
err = map->ops->map_peek_elem(map, value);
+ } else if (map->map_type == BPF_MAP_TYPE_INODE) {
+ err = bpf_inode_fd_htab_map_lookup_elem(map, key, value);
} else {
rcu_read_lock();
if (map->ops->map_lookup_elem_sys_only)
@@ -951,6 +953,10 @@ static int map_update_elem(union bpf_attr *attr)
} else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
map->map_type == BPF_MAP_TYPE_STACK) {
err = map->ops->map_push_elem(map, value, attr->flags);
+ } else if (map->map_type == BPF_MAP_TYPE_INODE) {
+ rcu_read_lock();
+ err = bpf_inode_fd_htab_map_update_elem(map, key, value, attr->flags);
+ rcu_read_unlock();
} else {
rcu_read_lock();
err = map->ops->map_update_elem(map, key, value, attr->flags);
@@ -1006,7 +1012,10 @@ static int map_delete_elem(union bpf_attr *attr)
preempt_disable();
__this_cpu_inc(bpf_prog_active);
rcu_read_lock();
- err = map->ops->map_delete_elem(map, key);
+ if (map->map_type == BPF_MAP_TYPE_INODE)
+ err = bpf_inode_fd_htab_map_delete_elem(map, key);
+ else
+ err = map->ops->map_delete_elem(map, key);
rcu_read_unlock();
__this_cpu_dec(bpf_prog_active);
preempt_enable();
@@ -1018,6 +1027,22 @@ static int map_delete_elem(union bpf_attr *attr)
return err;
}
+int bpf_inode_ptr_unlocked_htab_map_delete_elem(struct bpf_map *map,
+ struct inode **key, bool remove_in_inode)
+{
+ int err;
+
+ preempt_disable();
+ __this_cpu_inc(bpf_prog_active);
+ rcu_read_lock();
+ err = bpf_inode_ptr_locked_htab_map_delete_elem(map, key, remove_in_inode);
+ rcu_read_unlock();
+ __this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+ maybe_wait_bpf_programs(map);
+ return err;
+}
+
/* last field in 'union bpf_attr' used by this command */
#define BPF_MAP_GET_NEXT_KEY_LAST_FIELD next_key
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 026c68cb9116..3972b9f02dac 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -400,6 +400,7 @@ static const char * const reg_type_str[] = {
[PTR_TO_TCP_SOCK_OR_NULL] = "tcp_sock_or_null",
[PTR_TO_TP_BUFFER] = "tp_buffer",
[PTR_TO_XDP_SOCK] = "xdp_sock",
+ [PTR_TO_INODE] = "inode",
};
static char slot_type_char[] = {
@@ -1846,6 +1847,7 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
case PTR_TO_TCP_SOCK:
case PTR_TO_TCP_SOCK_OR_NULL:
case PTR_TO_XDP_SOCK:
+ case PTR_TO_INODE:
return true;
default:
return false;
@@ -3306,6 +3308,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
verbose(env, "verifier internal error\n");
return -EFAULT;
}
+ } else if (arg_type == ARG_PTR_TO_INODE) {
+ expected_type = PTR_TO_INODE;
+ if (type != expected_type)
+ goto err_type;
} else if (arg_type_is_mem_ptr(arg_type)) {
expected_type = PTR_TO_STACK;
/* One exception here. In case function allows for NULL to be
@@ -3511,6 +3517,10 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
func_id != BPF_FUNC_sk_storage_delete)
goto error;
break;
+ case BPF_MAP_TYPE_INODE:
+ if (func_id != BPF_FUNC_inode_map_lookup_elem)
+ goto error;
+ break;
default:
break;
}
@@ -3579,6 +3589,10 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
if (map->map_type != BPF_MAP_TYPE_SK_STORAGE)
goto error;
break;
+ case BPF_FUNC_inode_map_lookup_elem:
+ if (map->map_type != BPF_MAP_TYPE_INODE)
+ goto error;
+ break;
default:
break;
}
diff --git a/security/landlock/common.h b/security/landlock/common.h
index b2ee018eb6fc..b0ba3f31ac7d 100644
--- a/security/landlock/common.h
+++ b/security/landlock/common.h
@@ -11,6 +11,7 @@
#include <linux/bpf.h> /* enum bpf_attach_type */
#include <linux/filter.h> /* bpf_prog */
+#include <linux/lsm_hooks.h> /* lsm_blob_sizes */
#include <linux/refcount.h> /* refcount_t */
#include <uapi/linux/landlock.h> /* LANDLOCK_TRIGGER_* */
@@ -23,6 +24,8 @@
#define _LANDLOCK_TRIGGER_FS_PICK_LAST LANDLOCK_TRIGGER_FS_PICK_WRITE
#define _LANDLOCK_TRIGGER_FS_PICK_MASK ((_LANDLOCK_TRIGGER_FS_PICK_LAST << 1ULL) - 1)
+extern struct lsm_blob_sizes landlock_blob_sizes;
+
enum landlock_hook_type {
LANDLOCK_HOOK_FS_PICK = 1,
LANDLOCK_HOOK_FS_WALK,
@@ -55,6 +58,17 @@ struct landlock_prog_set {
refcount_t usage;
};
+struct landlock_inode_map {
+ struct list_head list;
+ struct rcu_head rcu_put;
+ struct bpf_map *map;
+ /*
+ * It would be nice to remove the inode field, but it is necessary for
+ * call_rcu() .
+ */
+ struct inode *inode;
+};
+
/**
* get_hook_index - get an index for the programs of struct landlock_prog_set
*
diff --git a/security/landlock/hooks_fs.c b/security/landlock/hooks_fs.c
index 3f81b7fc2938..8c9d6a333111 100644
--- a/security/landlock/hooks_fs.c
+++ b/security/landlock/hooks_fs.c
@@ -46,6 +46,12 @@ bool landlock_is_valid_access_fs_pick(int off, enum bpf_access_type type,
enum bpf_reg_type *reg_type, int *max_size)
{
switch (off) {
+ case offsetof(struct landlock_ctx_fs_pick, inode):
+ if (type != BPF_READ)
+ return false;
+ *reg_type = PTR_TO_INODE;
+ *max_size = sizeof(u64);
+ return true;
default:
return false;
}
@@ -55,6 +61,12 @@ bool landlock_is_valid_access_fs_walk(int off, enum bpf_access_type type,
enum bpf_reg_type *reg_type, int *max_size)
{
switch (off) {
+ case offsetof(struct landlock_ctx_fs_walk, inode):
+ if (type != BPF_READ)
+ return false;
+ *reg_type = PTR_TO_INODE;
+ *max_size = sizeof(u64);
+ return true;
default:
return false;
}
@@ -237,8 +249,79 @@ static int hook_sb_pivotroot(const struct path *old_path,
new_path->dentry->d_inode);
}
+/* inode helpers */
+
+static inline struct list_head *inode_landlock(const struct inode *inode)
+{
+ return inode->i_security + landlock_blob_sizes.lbs_inode;
+}
+
+int landlock_inode_add_map(struct inode *inode, struct bpf_map *map)
+{
+ struct landlock_inode_map *inode_map;
+
+ inode_map = kzalloc(sizeof(*inode_map), GFP_ATOMIC);
+ if (!inode_map)
+ return -ENOMEM;
+ INIT_LIST_HEAD(&inode_map->list);
+ inode_map->map = map;
+ inode_map->inode = inode;
+ list_add_tail(&inode_map->list, inode_landlock(inode));
+ return 0;
+}
+
+static void put_landlock_inode_map(struct rcu_head *head)
+{
+ struct landlock_inode_map *inode_map;
+ int err;
+
+ inode_map = container_of(head, struct landlock_inode_map, rcu_put);
+ err = bpf_inode_ptr_unlocked_htab_map_delete_elem(inode_map->map,
+ &inode_map->inode, false);
+ bpf_map_put(inode_map->map);
+ kfree(inode_map);
+}
+
+void landlock_inode_remove_map(struct inode *inode, const struct bpf_map *map)
+{
+ struct landlock_inode_map *inode_map;
+ bool found = false;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(inode_map, inode_landlock(inode), list) {
+ if (inode_map->map == map) {
+ found = true;
+ list_del_rcu(&inode_map->list);
+ kfree_rcu(inode_map, rcu_put);
+ break;
+ }
+ }
+ rcu_read_unlock();
+ WARN_ON(!found);
+}
+
/* inode hooks */
+static int hook_inode_alloc_security(struct inode *inode)
+{
+ struct list_head *ll_inode = inode_landlock(inode);
+
+ INIT_LIST_HEAD(ll_inode);
+ return 0;
+}
+
+static void hook_inode_free_security(struct inode *inode)
+{
+ struct landlock_inode_map *inode_map;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(inode_map, inode_landlock(inode), list) {
+ list_del_rcu(&inode_map->list);
+ call_rcu(&inode_map->rcu_put, put_landlock_inode_map);
+ }
+ rcu_read_unlock();
+}
+
/* a directory inode contains only one dentry */
static int hook_inode_create(struct inode *dir, struct dentry *dentry,
umode_t mode)
@@ -517,6 +600,8 @@ static struct security_hook_list landlock_hooks[] = {
LSM_HOOK_INIT(sb_mount, hook_sb_mount),
LSM_HOOK_INIT(sb_pivotroot, hook_sb_pivotroot),
+ LSM_HOOK_INIT(inode_alloc_security, hook_inode_alloc_security),
+ LSM_HOOK_INIT(inode_free_security, hook_inode_free_security),
LSM_HOOK_INIT(inode_create, hook_inode_create),
LSM_HOOK_INIT(inode_link, hook_inode_link),
LSM_HOOK_INIT(inode_unlink, hook_inode_unlink),
diff --git a/security/landlock/init.c b/security/landlock/init.c
index 391e88bd4d3a..eec4467cb5ee 100644
--- a/security/landlock/init.c
+++ b/security/landlock/init.c
@@ -104,6 +104,18 @@ static const struct bpf_func_proto *bpf_landlock_func_proto(
default:
break;
}
+
+ switch (get_hook_type(prog)) {
+ case LANDLOCK_HOOK_FS_WALK:
+ case LANDLOCK_HOOK_FS_PICK:
+ switch (func_id) {
+ case BPF_FUNC_inode_map_lookup_elem:
+ return &bpf_inode_map_lookup_elem_proto;
+ default:
+ break;
+ }
+ break;
+ }
return NULL;
}
@@ -123,6 +135,7 @@ static int __init landlock_init(void)
}
struct lsm_blob_sizes landlock_blob_sizes __lsm_ro_after_init = {
+ .lbs_inode = sizeof(struct list_head),
};
DEFINE_LSM(LANDLOCK_NAME) = {
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7b7a4f6c3104..7a55535f5dc1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -134,6 +134,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_QUEUE,
BPF_MAP_TYPE_STACK,
BPF_MAP_TYPE_SK_STORAGE,
+ BPF_MAP_TYPE_INODE,
};
/* Note that tracing related programs such as
@@ -2714,6 +2715,14 @@ union bpf_attr {
* **-EPERM** if no permission to send the *sig*.
*
* **-EAGAIN** if bpf program can try again.
+ *
+ * void *bpf_inode_map_lookup_elem(struct bpf_map *map, const void *key)
+ * Description
+ * Perform a lookup in *map* for an entry associated to an inode
+ * *key*.
+ * Return
+ * Map value associated to *key*, or **NULL** if no entry was
+ * found.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -2825,7 +2834,8 @@ union bpf_attr {
FN(strtoul), \
FN(sk_storage_get), \
FN(sk_storage_delete), \
- FN(send_signal),
+ FN(send_signal), \
+ FN(inode_map_lookup_elem),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index 03c910d1f84c..98875221310d 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -250,6 +250,7 @@ bool bpf_probe_map_type(enum bpf_map_type map_type, __u32 ifindex)
case BPF_MAP_TYPE_XSKMAP:
case BPF_MAP_TYPE_SOCKHASH:
case BPF_MAP_TYPE_REUSEPORT_SOCKARRAY:
+ case BPF_MAP_TYPE_INODE:
default:
break;
}
--
2.22.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox