* [PATCH v10 0/9] namei: openat2(2) path resolution restrictions
From: Aleksa Sarai @ 2019-07-19 16:42 UTC (permalink / raw)
To: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan
Cc: Aleksa Sarai, Eric Biederman, Andy Lutomirski, Jann Horn,
Christian Brauner, David Drysdale, Tycho Andersen, Kees Cook,
Linus Torvalds, containers, linux-fsdevel, linux-api,
Andrew Morton, Alexei Starovoitov, Chanho Min, Oleg Nesterov,
Aleksa Sarai, linux-alpha, linux-arch, linux-arm-kernel,
linux-ia64, linux-kernel
This patch is being developed here (with snapshots of each series
version being stashed in separate branches with names of the form
"resolveat/vX-summary"):
<https://github.com/cyphar/linux/tree/resolveat/master>
Patch changelog:
v10:
* Ensure that unlazy_walk() will fail if we are in a scoped walk and
the caller has zeroed nd->root (this happens in a few places, I'm
not sure why because unlazy_walk() does legitimize_path()
already). In this case we need to go through path_init() again to
reset it (otherwise we will have a breakout because set_root()
will breakout).
* Also add a WARN_ON (and return -ENOTRECOVERABLE) if
LOOKUP_IN_ROOT is set and we are in set_root() -- which should
never happen and will cause a breakout.
* Make changes suggested by Al Viro:
* Remove nd->{opath_mask,acc_mode} by moving all of the magic-link
permission logic be done after trailing_symlink() (with
trailing_magiclink()) only within path_openat().
* Introduce LOOKUP_MAGICLINK_JUMPED to be able to detect
magic-link jumps done with nd_jump_link() (so we don't end up
blocking other LOOKUP_JUMPED cases).
* Simplify all of the path_init() changes to make the code far
less confusing. dirfd_path_init() turns out to be un-necessary.
* Make openat2(2) also -EINVAL on unknown how->flags.
[Dmitry V. Levin]
* Clean up bad definitions of O_EMPTYPATH on architectures where O_*
flags are subtly different to <asm-generic/fcntl.h>.
* Switch away from passing a struct to build_open_flags() and
instead just copy the one field we need to temporarily modify
(how->flags). Also fix a bug in OPENHOW_MODE. [Rasmus Villemoes]
* Fix syscall linkages and switch to 437. [Arnd Bergmann]
* Clean up text in commit messages and the cover-letter.
[Rolf Eike Beer]
* Fix openat2 selftest makefile. [Michael Ellerman]
The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
Capsicum project[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS,
the flag has been split up into separate flags. However, instead of
being an openat(2) flag it is provided through a new syscall openat2(2)
which provides an alternative way to get an O_PATH file descriptor (the
reasoning for doing this is included in the patch description). The
following new LOOKUP_* flags are added:
* LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do
not trigger this.
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[6] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
(such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
And further, several semantics of file descriptor "re-opening" are now
changed to prevent attacks like CVE-2019-5736 by restricting how
magic-links can be resolved (based on their mode). This required some
other changes to the semantics of the modes of O_PATH file descriptor's
associated /proc/self/fd magic-links. openat2(2) has the ability to
further restrict re-opening of its own O_PATH fds, so that users can
make even better use of this feature.
Finally, O_EMPTYPATH was added so that users can do /proc/self/fd-style
re-opening without depending on procfs. The new restricted semantics for
magic-links are applied here too.
In order to make all of the above more usable, I'm working on
libpathrs[7] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: David Drysdale <drysdale@google.com>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <containers@lists.linux-foundation.org>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: <linux-api@vger.kernel.org>
[1]: https://lwn.net/Articles/721443/
[2]: https://lore.kernel.org/patchwork/patch/784221/
[3]: https://lwn.net/Articles/619151/
[4]: https://lwn.net/Articles/603929/
[5]: https://lwn.net/Articles/723057/
[6]: https://github.com/cyphar/filepath-securejoin
[7]: https://github.com/openSUSE/libpathrs
Aleksa Sarai (9):
namei: obey trailing magic-link DAC permissions
procfs: switch magic-link modes to be more sane
open: O_EMPTYPATH: procfs-less file descriptor re-opening
namei: O_BENEATH-style path resolution flags
namei: LOOKUP_IN_ROOT: chroot-like path resolution
namei: aggressively check for nd->root escape on ".." resolution
open: openat2(2) syscall
kselftest: save-and-restore errno to allow for %m formatting
selftests: add openat2(2) selftests
Documentation/filesystems/path-lookup.rst | 12 +-
arch/alpha/include/uapi/asm/fcntl.h | 1 +
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/include/uapi/asm/fcntl.h | 39 +-
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/include/uapi/asm/fcntl.h | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/fcntl.c | 2 +-
fs/internal.h | 1 +
fs/namei.c | 270 ++++++++++--
fs/open.c | 112 ++++-
fs/proc/base.c | 20 +-
fs/proc/fd.c | 23 +-
fs/proc/namespaces.c | 2 +-
include/linux/fcntl.h | 17 +-
include/linux/fs.h | 8 +-
include/linux/namei.h | 9 +
include/linux/syscalls.h | 17 +-
include/uapi/asm-generic/fcntl.h | 4 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/fcntl.h | 42 ++
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/kselftest.h | 15 +
tools/testing/selftests/memfd/memfd_test.c | 7 +-
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 162 +++++++
tools/testing/selftests/openat2/helpers.h | 114 +++++
.../testing/selftests/openat2/linkmode_test.c | 326 ++++++++++++++
.../selftests/openat2/rename_attack_test.c | 124 ++++++
.../testing/selftests/openat2/resolve_test.c | 397 ++++++++++++++++++
46 files changed, 1652 insertions(+), 107 deletions(-)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/linkmode_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
--
2.22.0
^ permalink raw reply
* Re: [v3 PATCH 1/2] mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified
From: Yang Shi @ 2019-07-19 16:18 UTC (permalink / raw)
To: Vlastimil Babka, mhocko, mgorman, akpm; +Cc: linux-mm, linux-kernel, linux-api
In-Reply-To: <c1e2b48a-972f-3944-bc17-598cb81a6658@suse.cz>
On 7/19/19 5:48 AM, Vlastimil Babka wrote:
> On 7/18/19 7:17 PM, Yang Shi wrote:
>> When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should
>> try best to migrate misplaced pages, if some of the pages could not be
>> migrated, then return -EIO.
>>
>> There are three different sub-cases:
>> 1. vma is not migratable
>> 2. vma is migratable, but there are unmovable pages
>> 3. vma is migratable, pages are movable, but migrate_pages() fails
>>
>> If #1 happens, kernel would just abort immediately, then return -EIO,
>> after the commit a7f40cfe3b7ada57af9b62fd28430eeb4a7cfcb7 ("mm:
>> mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified").
>>
>> If #3 happens, kernel would set policy and migrate pages with best-effort,
>> but won't rollback the migrated pages and reset the policy back.
>>
>> Before that commit, they behaves in the same way. It'd better to keep
>> their behavior consistent. But, rolling back the migrated pages and
>> resetting the policy back sounds not feasible, so just make #1 behave as
>> same as #3.
>>
>> Userspace will know that not everything was successfully migrated (via
>> -EIO), and can take whatever steps it deems necessary - attempt rollback,
>> determine which exact page(s) are violating the policy, etc.
>>
>> Make queue_pages_range() return 1 to indicate there are unmovable pages
>> or vma is not migratable.
>>
>> The #2 is not handled correctly in the current kernel, the following
>> patch will fix it.
>>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>
> Some nits below (I guess Andrew can incorporate them, no need to resend)
>
> ...
>
>> @@ -488,15 +496,15 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
>> struct queue_pages *qp = walk->private;
>> unsigned long flags = qp->flags;
>> int ret;
>> + bool has_unmovable = false;
>> pte_t *pte;
>> spinlock_t *ptl;
>>
>> ptl = pmd_trans_huge_lock(pmd, vma);
>> if (ptl) {
>> ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
>> - if (ret > 0)
>> - return 0;
>> - else if (ret < 0)
>> + /* THP was split, fall through to pte walk */
>> + if (ret != 2)
>> return ret;
> The comment should better go here after the if, as that's where fall through
> happens.
>
>> }
>>
>> @@ -519,14 +527,21 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
>> if (!queue_pages_required(page, qp))
>> continue;
>> if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
>> - if (!vma_migratable(vma))
>> + /* MPOL_MF_STRICT must be specified if we get here */
>> + if (!vma_migratable(vma)) {
>> + has_unmovable |= true;
> '|=' is weird, just use '='
>
>> break;
>> + }
>> migrate_page_add(page, qp->pagelist, flags);
>> } else
>> break;
>> }
>> pte_unmap_unlock(pte - 1, ptl);
>> cond_resched();
>> +
>> + if (has_unmovable)
>> + return 1;
>> +
>> return addr != end ? -EIO : 0;
>> }
>>
> ...
>> @@ -1259,11 +1286,12 @@ static long do_mbind(unsigned long start, unsigned long len,
>> putback_movable_pages(&pagelist);
>> }
>>
>> - if (nr_failed && (flags & MPOL_MF_STRICT))
>> + if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
>> err = -EIO;
>> } else
>> putback_movable_pages(&pagelist);
>>
>> +up_out:
>> up_write(&mm->mmap_sem);
>> mpol_out:
> The new label made the wrong identation of this one stand out, so I'd just fix
> it up while here.
Thanks, will fix all of these. I will resend this patch along with patch
2/2 which has to be resent anyway.
Yang
> Thanks!
>
>> mpol_put(new);
>>
^ permalink raw reply
* Re: [v3 PATCH 2/2] mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind
From: Yang Shi @ 2019-07-19 16:17 UTC (permalink / raw)
To: Vlastimil Babka, mhocko, mgorman, akpm; +Cc: linux-mm, linux-kernel, linux-api
In-Reply-To: <6ba72e56-9f62-36bf-ded7-f337522715d5@suse.cz>
On 7/19/19 6:01 AM, Vlastimil Babka wrote:
> On 7/18/19 7:17 PM, Yang Shi wrote:
>> When running syzkaller internally, we ran into the below bug on 4.9.x
>> kernel:
>>
>> kernel BUG at mm/huge_memory.c:2124!
>> invalid opcode: 0000 [#1] SMP KASAN
>> Dumping ftrace buffer:
>> (ftrace buffer empty)
>> Modules linked in:
>> CPU: 0 PID: 1518 Comm: syz-executor107 Not tainted 4.9.168+ #2
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
>> task: ffff880067b34900 task.stack: ffff880068998000
>> RIP: 0010:[<ffffffff81895d6b>] [<ffffffff81895d6b>] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
>> RSP: 0018:ffff88006899f980 EFLAGS: 00010286
>> RAX: 0000000000000000 RBX: ffffea00018f1700 RCX: 0000000000000000
>> RDX: 1ffffd400031e2e7 RSI: 0000000000000001 RDI: ffffea00018f1738
>> RBP: ffff88006899f9e8 R08: 0000000000000001 R09: 0000000000000000
>> R10: 0000000000000000 R11: fffffbfff0d8b13e R12: ffffea00018f1400
>> R13: ffffea00018f1400 R14: ffffea00018f1720 R15: ffffea00018f1401
>> FS: 00007fa333996740(0000) GS:ffff88006c600000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000000020000040 CR3: 0000000066b9c000 CR4: 00000000000606f0
>> Stack:
>> 0000000000000246 ffff880067b34900 0000000000000000 ffff88007ffdc000
>> 0000000000000000 ffff88006899f9e8 ffffffff812b4015 ffff880064c64e18
>> ffffea00018f1401 dffffc0000000000 ffffea00018f1700 0000000020ffd000
>> Call Trace:
>> [<ffffffff818490f1>] split_huge_page include/linux/huge_mm.h:100 [inline]
>> [<ffffffff818490f1>] queue_pages_pte_range+0x7e1/0x1480 mm/mempolicy.c:538
>> [<ffffffff817ed0da>] walk_pmd_range mm/pagewalk.c:50 [inline]
>> [<ffffffff817ed0da>] walk_pud_range mm/pagewalk.c:90 [inline]
>> [<ffffffff817ed0da>] walk_pgd_range mm/pagewalk.c:116 [inline]
>> [<ffffffff817ed0da>] __walk_page_range+0x44a/0xdb0 mm/pagewalk.c:208
>> [<ffffffff817edb94>] walk_page_range+0x154/0x370 mm/pagewalk.c:285
>> [<ffffffff81844515>] queue_pages_range+0x115/0x150 mm/mempolicy.c:694
>> [<ffffffff8184f493>] do_mbind mm/mempolicy.c:1241 [inline]
>> [<ffffffff8184f493>] SYSC_mbind+0x3c3/0x1030 mm/mempolicy.c:1370
>> [<ffffffff81850146>] SyS_mbind+0x46/0x60 mm/mempolicy.c:1352
>> [<ffffffff810097e2>] do_syscall_64+0x1d2/0x600 arch/x86/entry/common.c:282
>> [<ffffffff82ff6f93>] entry_SYSCALL_64_after_swapgs+0x5d/0xdb
>> Code: c7 80 1c 02 00 e8 26 0a 76 01 <0f> 0b 48 c7 c7 40 46 45 84 e8 4c
>> RIP [<ffffffff81895d6b>] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
>> RSP <ffff88006899f980>
> ...
>
>> @@ -532,7 +531,14 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
>> has_unmovable |= true;
>> break;
>> }
>> - migrate_page_add(page, qp->pagelist, flags);
>> +
>> + /*
>> + * Do not abort immediately since there may be
>> + * temporary off LRU pages in the range. Still
>> + * need migrate other LRU pages.
>> + */
>> + if (migrate_page_add(page, qp->pagelist, flags))
>> + has_unmovable |= true;
> Also = instead of |=
OK
>
>> } else
>> break;
>> }
>> @@ -961,10 +967,21 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>> /*
>> * page migration, thp tail pages can be passed.
>> */
>> -static void migrate_page_add(struct page *page, struct list_head *pagelist,
>> +static int migrate_page_add(struct page *page, struct list_head *pagelist,
>> unsigned long flags)
>> {
>> struct page *head = compound_head(page);
>> +
>> + /*
>> + * Non-movable page may reach here. And, there may be
>> + * temporary off LRU pages or non-LRU movable pages.
>> + * Treat them as unmovable pages since they can't be
>> + * isolated, so they can't be moved at the moment. It
>> + * should return -EIO for this case too.
>> + */
>> + if (!PageLRU(head) && (flags & MPOL_MF_STRICT))
>> + return -EIO;
> As this test is racy, why not just use the result of isolate_lru_page().
Sounds good to me. Will fix in v4.
Thanks,
Yang
>
>> +
>> /*
>> * Avoid migrating a page that is shared with others.
>> */
>> @@ -976,6 +993,8 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
>> hpage_nr_pages(head));
>> }
>> }
>> +
>> + return 0;
>> }
>>
>> /* page allocation callback for NUMA node migration */
>> @@ -1178,9 +1197,10 @@ static struct page *new_page(struct page *page, unsigned long start)
>> }
>> #else
>>
>> -static void migrate_page_add(struct page *page, struct list_head *pagelist,
>> +static int migrate_page_add(struct page *page, struct list_head *pagelist,
>> unsigned long flags)
>> {
>> + return -EIO;
>> }
>>
>> int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
>>
^ permalink raw reply
* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Eric W. Biederman @ 2019-07-19 16:07 UTC (permalink / raw)
To: Richard Guy Briggs
Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
LKML, netdev, netfilter-devel, Paul Moore, sgrubb, omosnace,
dhowells, simo, eparis, serge, nhorman
In-Reply-To: <9edad39c40671fb53f28d76862304cc2647029c6.1554732921.git.rgb@redhat.com>
Richard Guy Briggs <rgb@redhat.com> writes:
> Implement the proc fs write to set the audit container identifier of a
> process, emitting an AUDIT_CONTAINER_OP record to document the event.
>
> This is a write from the container orchestrator task to a proc entry of
> the form /proc/PID/audit_containerid where PID is the process ID of the
> newly created task that is to become the first task in a container, or
> an additional task added to a container.
>
> The write expects up to a u64 value (unset: 18446744073709551615).
>
> The writer must have capability CAP_AUDIT_CONTROL.
>
> This will produce a record such as this:
> type=CONTAINER_OP msg=audit(2018-06-06 12:39:29.636:26949) : op=set opid=2209 contid=123456 old-contid=18446744073709551615 pid=628 auid=root uid=root tty=ttyS0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 comm=bash exe=/usr/bin/bash res=yes
>
> The "op" field indicates an initial set. The "pid" to "ses" fields are
> the orchestrator while the "opid" field is the object's PID, the process
> being "contained". New and old audit container identifier values are
> given in the "contid" fields, while res indicates its success.
>
> It is not permitted to unset the audit container identifier.
> A child inherits its parent's audit container identifier.
Why get proc involved in this? I know it more or less fits as
this is about a process and it's descendants. But this seems to
encouarge being able to read this value, and being able to read
this value seems to encourage misuse.
So I am not of fan of using proc for this.
> Please see the github audit kernel issue for the main feature:
> https://github.com/linux-audit/audit-kernel/issues/90
> Please see the github audit userspace issue for supporting additions:
> https://github.com/linux-audit/audit-userspace/issues/51
> Please see the github audit testsuiite issue for the test case:
> https://github.com/linux-audit/audit-testsuite/issues/64
> Please see the github audit wiki for the feature overview:
> https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
>
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> Acked-by: Serge Hallyn <serge@hallyn.com>
> Acked-by: Steve Grubb <sgrubb@redhat.com>
> Acked-by: Neil Horman <nhorman@tuxdriver.com>
> Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> ---
> fs/proc/base.c | 36 ++++++++++++++++++++++++
> include/linux/audit.h | 25 +++++++++++++++++
> include/uapi/linux/audit.h | 2 ++
> kernel/audit.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++
> kernel/audit.h | 1 +
> kernel/auditsc.c | 4 +++
> 6 files changed, 137 insertions(+)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index ddef482f1334..43fd0c4b87de 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1294,6 +1294,40 @@ static ssize_t proc_sessionid_read(struct file * file, char __user * buf,
> .read = proc_sessionid_read,
> .llseek = generic_file_llseek,
> };
> +
> +static ssize_t proc_contid_write(struct file *file, const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct inode *inode = file_inode(file);
> + u64 contid;
> + int rv;
> + struct task_struct *task = get_proc_task(inode);
> +
> + if (!task)
> + return -ESRCH;
> + if (*ppos != 0) {
> + /* No partial writes. */
> + put_task_struct(task);
> + return -EINVAL;
> + }
> +
> + rv = kstrtou64_from_user(buf, count, 10, &contid);
> + if (rv < 0) {
> + put_task_struct(task);
> + return rv;
> + }
> +
> + rv = audit_set_contid(task, contid);
> + put_task_struct(task);
> + if (rv < 0)
> + return rv;
> + return count;
> +}
> +
> +static const struct file_operations proc_contid_operations = {
> + .write = proc_contid_write,
> + .llseek = generic_file_llseek,
> +};
> #endif
>
> #ifdef CONFIG_FAULT_INJECTION
> @@ -3033,6 +3067,7 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
> #ifdef CONFIG_AUDIT
> REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
> REG("sessionid", S_IRUGO, proc_sessionid_operations),
> + REG("audit_containerid", S_IWUSR, proc_contid_operations),
> #endif
> #ifdef CONFIG_FAULT_INJECTION
> REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> @@ -3431,6 +3466,7 @@ static int proc_tid_comm_permission(struct inode *inode, int mask)
> #ifdef CONFIG_AUDIT
> REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
> REG("sessionid", S_IRUGO, proc_sessionid_operations),
> + REG("audit_containerid", S_IWUSR, proc_contid_operations),
> #endif
> #ifdef CONFIG_FAULT_INJECTION
> REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> diff --git a/include/linux/audit.h b/include/linux/audit.h
> index bde346e73f0c..301337776193 100644
> --- a/include/linux/audit.h
> +++ b/include/linux/audit.h
> @@ -89,6 +89,7 @@ struct audit_field {
> struct audit_task_info {
> kuid_t loginuid;
> unsigned int sessionid;
> + u64 contid;
> #ifdef CONFIG_AUDITSYSCALL
> struct audit_context *ctx;
> #endif
> @@ -189,6 +190,15 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
> return tsk->audit->sessionid;
> }
>
> +extern int audit_set_contid(struct task_struct *tsk, u64 contid);
> +
> +static inline u64 audit_get_contid(struct task_struct *tsk)
> +{
> + if (!tsk->audit)
> + return AUDIT_CID_UNSET;
> + return tsk->audit->contid;
> +}
> +
> extern u32 audit_enabled;
> #else /* CONFIG_AUDIT */
> static inline int audit_alloc(struct task_struct *task)
> @@ -250,6 +260,11 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
> return AUDIT_SID_UNSET;
> }
>
> +static inline u64 audit_get_contid(struct task_struct *tsk)
> +{
> + return AUDIT_CID_UNSET;
> +}
> +
> #define audit_enabled AUDIT_OFF
> #endif /* CONFIG_AUDIT */
>
> @@ -606,6 +621,16 @@ static inline bool audit_loginuid_set(struct task_struct *tsk)
> return uid_valid(audit_get_loginuid(tsk));
> }
>
> +static inline bool audit_contid_valid(u64 contid)
> +{
> + return contid != AUDIT_CID_UNSET;
> +}
> +
> +static inline bool audit_contid_set(struct task_struct *tsk)
> +{
> + return audit_contid_valid(audit_get_contid(tsk));
> +}
> +
> static inline void audit_log_string(struct audit_buffer *ab, const char *buf)
> {
> audit_log_n_string(ab, buf, strlen(buf));
> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> index 3901c51c0b93..4a6a8bf1de32 100644
> --- a/include/uapi/linux/audit.h
> +++ b/include/uapi/linux/audit.h
> @@ -71,6 +71,7 @@
> #define AUDIT_TTY_SET 1017 /* Set TTY auditing status */
> #define AUDIT_SET_FEATURE 1018 /* Turn an audit feature on or off */
> #define AUDIT_GET_FEATURE 1019 /* Get which features are enabled */
> +#define AUDIT_CONTAINER_OP 1020 /* Define the container id and info */
>
> #define AUDIT_FIRST_USER_MSG 1100 /* Userspace messages mostly uninteresting to kernel */
> #define AUDIT_USER_AVC 1107 /* We filter this differently */
> @@ -485,6 +486,7 @@ struct audit_tty_status {
>
> #define AUDIT_UID_UNSET (unsigned int)-1
> #define AUDIT_SID_UNSET ((unsigned int)-1)
> +#define AUDIT_CID_UNSET ((u64)-1)
>
> /* audit_rule_data supports filter rules with both integer and string
> * fields. It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
> diff --git a/kernel/audit.c b/kernel/audit.c
> index 3fb09783cd4a..182b0f2c183d 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -244,6 +244,7 @@ int audit_alloc(struct task_struct *tsk)
> }
> info->loginuid = audit_get_loginuid(current);
> info->sessionid = audit_get_sessionid(current);
> + info->contid = audit_get_contid(current);
> tsk->audit = info;
>
> ret = audit_alloc_syscall(tsk);
> @@ -258,6 +259,7 @@ int audit_alloc(struct task_struct *tsk)
> struct audit_task_info init_struct_audit = {
> .loginuid = INVALID_UID,
> .sessionid = AUDIT_SID_UNSET,
> + .contid = AUDIT_CID_UNSET,
> #ifdef CONFIG_AUDITSYSCALL
> .ctx = NULL,
> #endif
> @@ -2341,6 +2343,73 @@ int audit_set_loginuid(kuid_t loginuid)
> }
>
> /**
> + * audit_set_contid - set current task's audit contid
> + * @contid: contid value
> + *
> + * Returns 0 on success, -EPERM on permission failure.
> + *
> + * Called (set) from fs/proc/base.c::proc_contid_write().
> + */
> +int audit_set_contid(struct task_struct *task, u64 contid)
> +{
> + u64 oldcontid;
> + int rc = 0;
> + struct audit_buffer *ab;
> + uid_t uid;
> + struct tty_struct *tty;
> + char comm[sizeof(current->comm)];
> +
> + task_lock(task);
> + /* Can't set if audit disabled */
> + if (!task->audit) {
> + task_unlock(task);
> + return -ENOPROTOOPT;
> + }
> + oldcontid = audit_get_contid(task);
> + read_lock(&tasklist_lock);
> + /* Don't allow the audit containerid to be unset */
> + if (!audit_contid_valid(contid))
> + rc = -EINVAL;
> + /* if we don't have caps, reject */
> + else if (!capable(CAP_AUDIT_CONTROL))
> + rc = -EPERM;
> + /* if task has children or is not single-threaded, deny */
> + else if (!list_empty(&task->children))
> + rc = -EBUSY;
> + else if (!(thread_group_leader(task) && thread_group_empty(task)))
> + rc = -EALREADY;
> + read_unlock(&tasklist_lock);
> + if (!rc)
> + task->audit->contid = contid;
> + task_unlock(task);
> +
> + if (!audit_enabled)
> + return rc;
> +
> + ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONTAINER_OP);
> + if (!ab)
> + return rc;
> +
> + uid = from_kuid(&init_user_ns, task_uid(current));
> + tty = audit_get_tty();
> + audit_log_format(ab,
> + "op=set opid=%d contid=%llu old-contid=%llu pid=%d uid=%u auid=%u tty=%s ses=%u",
> + task_tgid_nr(task), contid, oldcontid,
> + task_tgid_nr(current), uid,
> + from_kuid(&init_user_ns, audit_get_loginuid(current)),
> + tty ? tty_name(tty) : "(none)",
> + audit_get_sessionid(current));
> + audit_put_tty(tty);
> + audit_log_task_context(ab);
> + audit_log_format(ab, " comm=");
> + audit_log_untrustedstring(ab, get_task_comm(comm, current));
> + audit_log_d_path_exe(ab, current->mm);
> + audit_log_format(ab, " res=%d", !rc);
> + audit_log_end(ab);
> + return rc;
> +}
> +
> +/**
> * audit_log_end - end one audit record
> * @ab: the audit_buffer
> *
> diff --git a/kernel/audit.h b/kernel/audit.h
> index c00e2ee3c6b3..e2912924af0d 100644
> --- a/kernel/audit.h
> +++ b/kernel/audit.h
> @@ -148,6 +148,7 @@ struct audit_context {
> kuid_t target_uid;
> unsigned int target_sessionid;
> u32 target_sid;
> + u64 target_cid;
> char target_comm[TASK_COMM_LEN];
>
> struct audit_tree_refs *trees, *first_trees;
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index fd7ca983de4f..1f7edf035b16 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -113,6 +113,7 @@ struct audit_aux_data_pids {
> kuid_t target_uid[AUDIT_AUX_PIDS];
> unsigned int target_sessionid[AUDIT_AUX_PIDS];
> u32 target_sid[AUDIT_AUX_PIDS];
> + u64 target_cid[AUDIT_AUX_PIDS];
> char target_comm[AUDIT_AUX_PIDS][TASK_COMM_LEN];
> int pid_count;
> };
> @@ -2368,6 +2369,7 @@ void __audit_ptrace(struct task_struct *t)
> context->target_uid = task_uid(t);
> context->target_sessionid = audit_get_sessionid(t);
> security_task_getsecid(t, &context->target_sid);
> + context->target_cid = audit_get_contid(t);
> memcpy(context->target_comm, t->comm, TASK_COMM_LEN);
> }
>
> @@ -2408,6 +2410,7 @@ int audit_signal_info(int sig, struct task_struct *t)
> ctx->target_uid = t_uid;
> ctx->target_sessionid = audit_get_sessionid(t);
> security_task_getsecid(t, &ctx->target_sid);
> + ctx->target_cid = audit_get_contid(t);
> memcpy(ctx->target_comm, t->comm, TASK_COMM_LEN);
> return 0;
> }
> @@ -2429,6 +2432,7 @@ int audit_signal_info(int sig, struct task_struct *t)
> axp->target_uid[axp->pid_count] = t_uid;
> axp->target_sessionid[axp->pid_count] = audit_get_sessionid(t);
> security_task_getsecid(t, &axp->target_sid[axp->pid_count]);
> + axp->target_cid[axp->pid_count] = audit_get_contid(t);
> memcpy(axp->target_comm[axp->pid_count], t->comm, TASK_COMM_LEN);
> axp->pid_count++;
^ permalink raw reply
* Re: [PATCH ghak90 V6 03/10] audit: read container ID of a process
From: Eric W. Biederman @ 2019-07-19 16:03 UTC (permalink / raw)
To: Richard Guy Briggs
Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
LKML, netdev, netfilter-devel, Paul Moore, sgrubb, omosnace,
dhowells, simo, eparis, serge, nhorman
In-Reply-To: <846df5e5bf5a49094fede082a2ace135ab6f5772.1554732921.git.rgb@redhat.com>
Richard Guy Briggs <rgb@redhat.com> writes:
> Add support for reading the audit container identifier from the proc
> filesystem.
>
> This is a read from the proc entry of the form
> /proc/PID/audit_containerid where PID is the process ID of the task
> whose audit container identifier is sought.
>
> The read expects up to a u64 value (unset: 18446744073709551615).
>
> This read requires CAP_AUDIT_CONTROL.
This scares me. As this seems to make it easy to reuse an audit
containerid for non-audit purporses.
I would think it would be safer and easier to poke audit and ask it to
log a message with your audit container id.
Eric
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> Acked-by: Serge Hallyn <serge@hallyn.com>
> Acked-by: Neil Horman <nhorman@tuxdriver.com>
> Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> ---
> fs/proc/base.c | 25 ++++++++++++++++++++++---
> 1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 43fd0c4b87de..acc70239d0cb 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1211,7 +1211,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
> };
>
> #ifdef CONFIG_AUDIT
> -#define TMPBUFLEN 11
> +#define TMPBUFLEN 21
> static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
> size_t count, loff_t *ppos)
> {
> @@ -1295,6 +1295,24 @@ static ssize_t proc_sessionid_read(struct file * file, char __user * buf,
> .llseek = generic_file_llseek,
> };
>
> +static ssize_t proc_contid_read(struct file *file, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct inode *inode = file_inode(file);
> + struct task_struct *task = get_proc_task(inode);
> + ssize_t length;
> + char tmpbuf[TMPBUFLEN];
> +
> + if (!task)
> + return -ESRCH;
> + /* if we don't have caps, reject */
> + if (!capable(CAP_AUDIT_CONTROL))
> + return -EPERM;
> + length = scnprintf(tmpbuf, TMPBUFLEN, "%llu", audit_get_contid(task));
> + put_task_struct(task);
> + return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
> +}
> +
> static ssize_t proc_contid_write(struct file *file, const char __user *buf,
> size_t count, loff_t *ppos)
> {
> @@ -1325,6 +1343,7 @@ static ssize_t proc_contid_write(struct file *file, const char __user *buf,
> }
>
> static const struct file_operations proc_contid_operations = {
> + .read = proc_contid_read,
> .write = proc_contid_write,
> .llseek = generic_file_llseek,
> };
> @@ -3067,7 +3086,7 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
> #ifdef CONFIG_AUDIT
> REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
> REG("sessionid", S_IRUGO, proc_sessionid_operations),
> - REG("audit_containerid", S_IWUSR, proc_contid_operations),
> + REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
> #endif
> #ifdef CONFIG_FAULT_INJECTION
> REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> @@ -3466,7 +3485,7 @@ static int proc_tid_comm_permission(struct inode *inode, int mask)
> #ifdef CONFIG_AUDIT
> REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
> REG("sessionid", S_IRUGO, proc_sessionid_operations),
> - REG("audit_containerid", S_IWUSR, proc_contid_operations),
> + REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
> #endif
> #ifdef CONFIG_FAULT_INJECTION
> REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
^ permalink raw reply
* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Eric W. Biederman @ 2019-07-19 16:00 UTC (permalink / raw)
To: Paul Moore
Cc: Richard Guy Briggs, Serge E. Hallyn, Tycho Andersen, containers,
linux-api, Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
nhorman
In-Reply-To: <CAHC9VhTYV02ws3QcezER5cY+Xt+tExcJEO-dumTDx=FXGFh3nw@mail.gmail.com>
Paul Moore <paul@paul-moore.com> writes:
> On Wed, Jul 17, 2019 at 8:52 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>> On 2019-07-16 19:30, Paul Moore wrote:
>
> ...
>
>> > We can trust capable(CAP_AUDIT_CONTROL) for enforcing audit container
>> > ID policy, we can not trust ns_capable(CAP_AUDIT_CONTROL).
>>
>> Ok. So does a process in a non-init user namespace have two (or more)
>> sets of capabilities stored in creds, one in the init_user_ns, and one
>> in current_user_ns? Or does it get stripped of all its capabilities in
>> init_user_ns once it has its own set in current_user_ns? If the former,
>> then we can use capable(). If the latter, we need another mechanism, as
>> you have suggested might be needed.
>
> Unfortunately I think the problem is that ultimately we need to allow
> any container orchestrator that has been given privileges to manage
> the audit container ID to also grant that privilege to any of the
> child process/containers it manages. I don't believe we can do that
> with capabilities based on the code I've looked at, and the
> discussions I've had, but if you find a way I would leave to hear it.
>> If some random unprivileged user wants to fire up a container
>> orchestrator/engine in his own user namespace, then audit needs to be
>> namespaced. Can we safely discard this scenario for now?
>
> I think the only time we want to allow a container orchestrator to
> manage the audit container ID is if it has been granted that privilege
> by someone who has that privilege already. In the zero-container, or
> single-level of containers, case this is relatively easy, and we can
> accomplish it using CAP_AUDIT_CONTROL as the privilege. If we start
> nesting container orchestrators it becomes more complicated as we need
> to be able to support granting and inheriting this privilege in a
> manner; this is why I suggested a new mechanism *may* be necessary.
Let me segway a bit and see if I can get this conversation out of the
rut it seems to have drifted into.
Unprivileged containers and nested containers exist today and are going
to become increasingly common. Let that be a given.
As I recall the interesting thing for audit to log is actions by
privileged processes. Audit can log more but generally configuring
logging by of the actions of unprivileged users is effectively a self
DOS.
So I think the initial implementation can safely ignore actions of
nested containers and unprivileged containers because you don't care
about their actions.
If we start allow running audit in a container then we need to deal with
all of the nesting issues but until then I don't think you folks care.
Or am I wrong. Do the requirements for securely auditing things from
the kernel care about the actions of unprivileged users?
Eric
^ permalink raw reply
* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Eric W. Biederman @ 2019-07-19 15:32 UTC (permalink / raw)
To: Richard Guy Briggs
Cc: Paul Moore, Serge E. Hallyn, Tycho Andersen, containers,
linux-api, Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
nhorman
In-Reply-To: <20190718005145.eshekqfr3navqqiy@madcap2.tricolour.ca>
Richard Guy Briggs <rgb@redhat.com> writes:
> On 2019-07-16 19:30, Paul Moore wrote:
>> On Tue, Jul 16, 2019 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>> > On 2019-07-15 17:04, Paul Moore wrote:
>> > > On Mon, Jul 8, 2019 at 2:06 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>>
>> > > > At this point I would say we are at an impasse unless we trust
>> > > > ns_capable() or we implement audit namespaces.
>> > >
>> > > I'm not sure how we can trust ns_capable(), but if you can think of a
>> > > way I would love to hear it. I'm also not sure how namespacing audit
>> > > is helpful (see my above comments), but if you think it is please
>> > > explain.
>> >
>> > So if we are not namespacing, why do we not trust capabilities?
>>
>> We can trust capable(CAP_AUDIT_CONTROL) for enforcing audit container
>> ID policy, we can not trust ns_capable(CAP_AUDIT_CONTROL).
>
> Ok. So does a process in a non-init user namespace have two (or more)
> sets of capabilities stored in creds, one in the init_user_ns, and one
> in current_user_ns? Or does it get stripped of all its capabilities in
> init_user_ns once it has its own set in current_user_ns? If the former,
> then we can use capable(). If the latter, we need another mechanism, as
> you have suggested might be needed.
The latter. There is only one set of capabilities and it is in the
processes current user namespace.
Eric
^ permalink raw reply
* Re: [v3 PATCH 2/2] mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind
From: Vlastimil Babka @ 2019-07-19 13:01 UTC (permalink / raw)
To: Yang Shi, mhocko, mgorman, akpm; +Cc: linux-mm, linux-kernel, linux-api
In-Reply-To: <1563470274-52126-3-git-send-email-yang.shi@linux.alibaba.com>
On 7/18/19 7:17 PM, Yang Shi wrote:
> When running syzkaller internally, we ran into the below bug on 4.9.x
> kernel:
>
> kernel BUG at mm/huge_memory.c:2124!
> invalid opcode: 0000 [#1] SMP KASAN
> Dumping ftrace buffer:
> (ftrace buffer empty)
> Modules linked in:
> CPU: 0 PID: 1518 Comm: syz-executor107 Not tainted 4.9.168+ #2
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
> task: ffff880067b34900 task.stack: ffff880068998000
> RIP: 0010:[<ffffffff81895d6b>] [<ffffffff81895d6b>] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
> RSP: 0018:ffff88006899f980 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: ffffea00018f1700 RCX: 0000000000000000
> RDX: 1ffffd400031e2e7 RSI: 0000000000000001 RDI: ffffea00018f1738
> RBP: ffff88006899f9e8 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000000 R11: fffffbfff0d8b13e R12: ffffea00018f1400
> R13: ffffea00018f1400 R14: ffffea00018f1720 R15: ffffea00018f1401
> FS: 00007fa333996740(0000) GS:ffff88006c600000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020000040 CR3: 0000000066b9c000 CR4: 00000000000606f0
> Stack:
> 0000000000000246 ffff880067b34900 0000000000000000 ffff88007ffdc000
> 0000000000000000 ffff88006899f9e8 ffffffff812b4015 ffff880064c64e18
> ffffea00018f1401 dffffc0000000000 ffffea00018f1700 0000000020ffd000
> Call Trace:
> [<ffffffff818490f1>] split_huge_page include/linux/huge_mm.h:100 [inline]
> [<ffffffff818490f1>] queue_pages_pte_range+0x7e1/0x1480 mm/mempolicy.c:538
> [<ffffffff817ed0da>] walk_pmd_range mm/pagewalk.c:50 [inline]
> [<ffffffff817ed0da>] walk_pud_range mm/pagewalk.c:90 [inline]
> [<ffffffff817ed0da>] walk_pgd_range mm/pagewalk.c:116 [inline]
> [<ffffffff817ed0da>] __walk_page_range+0x44a/0xdb0 mm/pagewalk.c:208
> [<ffffffff817edb94>] walk_page_range+0x154/0x370 mm/pagewalk.c:285
> [<ffffffff81844515>] queue_pages_range+0x115/0x150 mm/mempolicy.c:694
> [<ffffffff8184f493>] do_mbind mm/mempolicy.c:1241 [inline]
> [<ffffffff8184f493>] SYSC_mbind+0x3c3/0x1030 mm/mempolicy.c:1370
> [<ffffffff81850146>] SyS_mbind+0x46/0x60 mm/mempolicy.c:1352
> [<ffffffff810097e2>] do_syscall_64+0x1d2/0x600 arch/x86/entry/common.c:282
> [<ffffffff82ff6f93>] entry_SYSCALL_64_after_swapgs+0x5d/0xdb
> Code: c7 80 1c 02 00 e8 26 0a 76 01 <0f> 0b 48 c7 c7 40 46 45 84 e8 4c
> RIP [<ffffffff81895d6b>] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
> RSP <ffff88006899f980>
...
> @@ -532,7 +531,14 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
> has_unmovable |= true;
> break;
> }
> - migrate_page_add(page, qp->pagelist, flags);
> +
> + /*
> + * Do not abort immediately since there may be
> + * temporary off LRU pages in the range. Still
> + * need migrate other LRU pages.
> + */
> + if (migrate_page_add(page, qp->pagelist, flags))
> + has_unmovable |= true;
Also = instead of |=
> } else
> break;
> }
> @@ -961,10 +967,21 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
> /*
> * page migration, thp tail pages can be passed.
> */
> -static void migrate_page_add(struct page *page, struct list_head *pagelist,
> +static int migrate_page_add(struct page *page, struct list_head *pagelist,
> unsigned long flags)
> {
> struct page *head = compound_head(page);
> +
> + /*
> + * Non-movable page may reach here. And, there may be
> + * temporary off LRU pages or non-LRU movable pages.
> + * Treat them as unmovable pages since they can't be
> + * isolated, so they can't be moved at the moment. It
> + * should return -EIO for this case too.
> + */
> + if (!PageLRU(head) && (flags & MPOL_MF_STRICT))
> + return -EIO;
As this test is racy, why not just use the result of isolate_lru_page().
> +
> /*
> * Avoid migrating a page that is shared with others.
> */
> @@ -976,6 +993,8 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
> hpage_nr_pages(head));
> }
> }
> +
> + return 0;
> }
>
> /* page allocation callback for NUMA node migration */
> @@ -1178,9 +1197,10 @@ static struct page *new_page(struct page *page, unsigned long start)
> }
> #else
>
> -static void migrate_page_add(struct page *page, struct list_head *pagelist,
> +static int migrate_page_add(struct page *page, struct list_head *pagelist,
> unsigned long flags)
> {
> + return -EIO;
> }
>
> int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
>
^ permalink raw reply
* Re: [v3 PATCH 1/2] mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified
From: Vlastimil Babka @ 2019-07-19 12:48 UTC (permalink / raw)
To: Yang Shi, mhocko, mgorman, akpm; +Cc: linux-mm, linux-kernel, linux-api
In-Reply-To: <1563470274-52126-2-git-send-email-yang.shi@linux.alibaba.com>
On 7/18/19 7:17 PM, Yang Shi wrote:
> When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should
> try best to migrate misplaced pages, if some of the pages could not be
> migrated, then return -EIO.
>
> There are three different sub-cases:
> 1. vma is not migratable
> 2. vma is migratable, but there are unmovable pages
> 3. vma is migratable, pages are movable, but migrate_pages() fails
>
> If #1 happens, kernel would just abort immediately, then return -EIO,
> after the commit a7f40cfe3b7ada57af9b62fd28430eeb4a7cfcb7 ("mm:
> mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified").
>
> If #3 happens, kernel would set policy and migrate pages with best-effort,
> but won't rollback the migrated pages and reset the policy back.
>
> Before that commit, they behaves in the same way. It'd better to keep
> their behavior consistent. But, rolling back the migrated pages and
> resetting the policy back sounds not feasible, so just make #1 behave as
> same as #3.
>
> Userspace will know that not everything was successfully migrated (via
> -EIO), and can take whatever steps it deems necessary - attempt rollback,
> determine which exact page(s) are violating the policy, etc.
>
> Make queue_pages_range() return 1 to indicate there are unmovable pages
> or vma is not migratable.
>
> The #2 is not handled correctly in the current kernel, the following
> patch will fix it.
>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Some nits below (I guess Andrew can incorporate them, no need to resend)
...
> @@ -488,15 +496,15 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
> struct queue_pages *qp = walk->private;
> unsigned long flags = qp->flags;
> int ret;
> + bool has_unmovable = false;
> pte_t *pte;
> spinlock_t *ptl;
>
> ptl = pmd_trans_huge_lock(pmd, vma);
> if (ptl) {
> ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
> - if (ret > 0)
> - return 0;
> - else if (ret < 0)
> + /* THP was split, fall through to pte walk */
> + if (ret != 2)
> return ret;
The comment should better go here after the if, as that's where fall through
happens.
> }
>
> @@ -519,14 +527,21 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
> if (!queue_pages_required(page, qp))
> continue;
> if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
> - if (!vma_migratable(vma))
> + /* MPOL_MF_STRICT must be specified if we get here */
> + if (!vma_migratable(vma)) {
> + has_unmovable |= true;
'|=' is weird, just use '='
> break;
> + }
> migrate_page_add(page, qp->pagelist, flags);
> } else
> break;
> }
> pte_unmap_unlock(pte - 1, ptl);
> cond_resched();
> +
> + if (has_unmovable)
> + return 1;
> +
> return addr != end ? -EIO : 0;
> }
>
...
> @@ -1259,11 +1286,12 @@ static long do_mbind(unsigned long start, unsigned long len,
> putback_movable_pages(&pagelist);
> }
>
> - if (nr_failed && (flags & MPOL_MF_STRICT))
> + if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
> err = -EIO;
> } else
> putback_movable_pages(&pagelist);
>
> +up_out:
> up_write(&mm->mmap_sem);
> mpol_out:
The new label made the wrong identation of this one stand out, so I'd just fix
it up while here.
Thanks!
> mpol_put(new);
>
^ permalink raw reply
* Re: [PATCH v9 08/10] open: openat2(2) syscall
From: Christian Brauner @ 2019-07-19 10:29 UTC (permalink / raw)
To: Dmitry V. Levin
Cc: Arnd Bergmann, Aleksa Sarai, Al Viro, Jeff Layton,
J. Bruce Fields, David Howells, Shuah Khan, Shuah Khan,
Eric Biederman, Andy Lutomirski, Andrew Morton,
Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
David Drysdale, Chanho Min, Oleg Nesterov, Aleksa Sarai,
Linus Torvalds, containers
In-Reply-To: <20190719021218.GB18022@altlinux.org>
On Fri, Jul 19, 2019 at 05:12:18AM +0300, Dmitry V. Levin wrote:
> On Thu, Jul 18, 2019 at 11:29:50PM +0200, Arnd Bergmann wrote:
> [...]
> > 5. you get the same problem with seccomp and strace that
> > clone3() has -- these and others only track the register
> > arguments by default.
>
> Just for the record, this is definitely not the case for strace:
> it decodes arrays, structures, netlink messages, and so on by default.
There sure is value in trying to design syscalls that can be handled
nicely by seccomp but that shouldn't become a burden on designing
extensible syscalls.
I suggested a session for Ksummit where we can discuss if and how we can
make seccomp more compatible with pointer-args in syscalls.
Christian
^ permalink raw reply
* [PATCH AUTOSEL 5.2 124/171] rseq/selftests: Fix Thumb mode build failure on arm32
From: Sasha Levin @ 2019-07-19 3:55 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Mathieu Desnoyers, Will Deacon, Peter Zijlstra, Thomas Gleixner,
Joel Fernandes, Catalin Marinas, Dave Watson, Shuah Khan,
Andi Kleen, linux-kselftest, H . Peter Anvin, Chris Lameter,
Russell King, Michael Kerrisk, Paul E . McKenney, Paul Turner,
Boqun Feng, Josh Triplett, Steven Rostedt, Ben Maurer, linux-api@
In-Reply-To: <20190719035643.14300-1-sashal@kernel.org>
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
[ Upstream commit ee8a84c60bcc1f1615bd9cb3edfe501e26cdc85b ]
Using ".arm .inst" for the arm signature introduces build issues for
programs compiled in Thumb mode because the assembler stays in the
arm mode for the rest of the inline assembly. Revert to using a ".word"
to express the signature as data instead.
The choice of signature is a valid trap instruction on arm32 little
endian, where both code and data are little endian.
ARMv6+ big endian (BE8) generates mixed endianness code vs data:
little-endian code and big-endian data. The data value of the signature
needs to have its byte order reversed to generate the trap instruction.
Prior to ARMv6, -mbig-endian generates big-endian code and data
(which match), so the endianness of the data representation of the
signature should not be reversed. However, the choice between BE32
and BE8 is done by the linker, so we cannot know whether code and
data endianness will be mixed before the linker is invoked. So rather
than try to play tricks with the linker, the rseq signature is simply
data (not a trap instruction) prior to ARMv6 on big endian. This is
why the signature is expressed as data (.word) rather than as
instruction (.inst) in assembler.
Because a ".word" is used to emit the signature, it will be interpreted
as a literal pool by a disassembler, not as an actual instruction.
Considering that the signature is not meant to be executed except in
scenarios where the program execution is completely bogus, this should
not be an issue.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Will Deacon <will.deacon@arm.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Joel Fernandes <joelaf@google.com>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Dave Watson <davejwatson@fb.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Shuah Khan <shuah@kernel.org>
CC: Andi Kleen <andi@firstfloor.org>
CC: linux-kselftest@vger.kernel.org
CC: "H . Peter Anvin" <hpa@zytor.com>
CC: Chris Lameter <cl@linux.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
CC: Paul Turner <pjt@google.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ben Maurer <bmaurer@fb.com>
CC: linux-api@vger.kernel.org
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Carlos O'Donell <carlos@redhat.com>
CC: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
tools/testing/selftests/rseq/rseq-arm.h | 61 +++++++++++++------------
1 file changed, 33 insertions(+), 28 deletions(-)
diff --git a/tools/testing/selftests/rseq/rseq-arm.h b/tools/testing/selftests/rseq/rseq-arm.h
index 84f28f147fb6..5943c816c07c 100644
--- a/tools/testing/selftests/rseq/rseq-arm.h
+++ b/tools/testing/selftests/rseq/rseq-arm.h
@@ -6,6 +6,8 @@
*/
/*
+ * - ARM little endian
+ *
* RSEQ_SIG uses the udf A32 instruction with an uncommon immediate operand
* value 0x5de3. This traps if user-space reaches this instruction by mistake,
* and the uncommon operand ensures the kernel does not move the instruction
@@ -22,36 +24,40 @@
* def3 udf #243 ; 0xf3
* e7f5 b.n <7f5>
*
- * pre-ARMv6 big endian code:
- * e7f5 b.n <7f5>
- * def3 udf #243 ; 0xf3
+ * - ARMv6+ big endian (BE8):
*
* ARMv6+ -mbig-endian generates mixed endianness code vs data: little-endian
- * code and big-endian data. Ensure the RSEQ_SIG data signature matches code
- * endianness. Prior to ARMv6, -mbig-endian generates big-endian code and data
- * (which match), so there is no need to reverse the endianness of the data
- * representation of the signature. However, the choice between BE32 and BE8
- * is done by the linker, so we cannot know whether code and data endianness
- * will be mixed before the linker is invoked.
+ * code and big-endian data. The data value of the signature needs to have its
+ * byte order reversed to generate the trap instruction:
+ *
+ * Data: 0xf3def5e7
+ *
+ * Translates to this A32 instruction pattern:
+ *
+ * e7f5def3 udf #24035 ; 0x5de3
+ *
+ * Translates to this T16 instruction pattern:
+ *
+ * def3 udf #243 ; 0xf3
+ * e7f5 b.n <7f5>
+ *
+ * - Prior to ARMv6 big endian (BE32):
+ *
+ * Prior to ARMv6, -mbig-endian generates big-endian code and data
+ * (which match), so the endianness of the data representation of the
+ * signature should not be reversed. However, the choice between BE32
+ * and BE8 is done by the linker, so we cannot know whether code and
+ * data endianness will be mixed before the linker is invoked. So rather
+ * than try to play tricks with the linker, the rseq signature is simply
+ * data (not a trap instruction) prior to ARMv6 on big endian. This is
+ * why the signature is expressed as data (.word) rather than as
+ * instruction (.inst) in assembler.
*/
-#define RSEQ_SIG_CODE 0xe7f5def3
-
-#ifndef __ASSEMBLER__
-
-#define RSEQ_SIG_DATA \
- ({ \
- int sig; \
- asm volatile ("b 2f\n\t" \
- "1: .inst " __rseq_str(RSEQ_SIG_CODE) "\n\t" \
- "2:\n\t" \
- "ldr %[sig], 1b\n\t" \
- : [sig] "=r" (sig)); \
- sig; \
- })
-
-#define RSEQ_SIG RSEQ_SIG_DATA
-
+#ifdef __ARMEB__
+#define RSEQ_SIG 0xf3def5e7 /* udf #24035 ; 0x5de3 (ARMv6+) */
+#else
+#define RSEQ_SIG 0xe7f5def3 /* udf #24035 ; 0x5de3 */
#endif
#define rseq_smp_mb() __asm__ __volatile__ ("dmb" ::: "memory", "cc")
@@ -125,8 +131,7 @@ do { \
__rseq_str(table_label) ":\n\t" \
".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
".word " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \
- ".arm\n\t" \
- ".inst " __rseq_str(RSEQ_SIG_CODE) "\n\t" \
+ ".word " __rseq_str(RSEQ_SIG) "\n\t" \
__rseq_str(label) ":\n\t" \
teardown \
"b %l[" __rseq_str(abort_label) "]\n\t"
--
2.20.1
^ permalink raw reply related
* Re: [PATCH v9 08/10] open: openat2(2) syscall
From: Aleksa Sarai @ 2019-07-19 2:19 UTC (permalink / raw)
To: Dmitry V. Levin
Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan, Christian Brauner,
Eric Biederman, Andy Lutomirski, Andrew Morton,
Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
David Drysdale, Chanho Min, Oleg Nesterov, Aleksa Sarai,
Linus Torvalds, containers
In-Reply-To: <20190719015933.GA18022@altlinux.org>
[-- Attachment #1: Type: text/plain, Size: 1088 bytes --]
On 2019-07-19, Dmitry V. Levin <ldv@altlinux.org> wrote:
> On Sun, Jul 07, 2019 at 12:57:35AM +1000, Aleksa Sarai wrote:
> [...]
> > +/**
> > + * Arguments for how openat2(2) should open the target path. If @extra is zero,
> > + * then openat2(2) is identical to openat(2).
> > + *
> > + * @flags: O_* flags (unknown flags ignored).
>
> What was the rationale for implementing this semantics?
> Ignoring unknown flags makes potential extension of this new interface
> problematic. This has bitten us many times already, so ...
I am mirroring the semantics of open(2) and openat(2).
To be clear, I am in favour of doing it -- and it would definitely be
possible to implement it with -EINVAL (you would just mask off
~VALID_OPEN_FLAGS for the older syscalls). But Linus' response to my
point about (the lack of) -EINVAL for unknown open(2) flags gave me the
impression he would be against this idea (though I might be
misunderstanding the point he was making).
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [PATCH v9 08/10] open: openat2(2) syscall
From: Dmitry V. Levin @ 2019-07-19 2:12 UTC (permalink / raw)
To: Arnd Bergmann
Cc: linux-ia64, Linux-sh list, Alexei Starovoitov,
Linux Kernel Mailing List, David Howells,
open list:KERNEL SELFTEST FRAMEWORK, sparclinux, Shuah Khan,
linux-arch, linux-s390, Tycho Andersen, Aleksa Sarai, Linux ARM,
linux-mips, linux-xtensa, Kees Cook, Jann Horn, linuxppc-dev,
Aleksa Sarai, Al Viro, Andy Lutomirski, Shuah Khan,
David Drysdale <drysdale@
In-Reply-To: <CAK8P3a3MiYK4bJiA3G_m5H-TpfN5__--b+=szsJBhG7_it+NQg@mail.gmail.com>
[-- Attachment #1.1: Type: text/plain, Size: 376 bytes --]
On Thu, Jul 18, 2019 at 11:29:50PM +0200, Arnd Bergmann wrote:
[...]
> 5. you get the same problem with seccomp and strace that
> clone3() has -- these and others only track the register
> arguments by default.
Just for the record, this is definitely not the case for strace:
it decodes arrays, structures, netlink messages, and so on by default.
--
ldv
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]
[-- Attachment #2: Type: text/plain, Size: 176 bytes --]
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply
* Re: [PATCH v9 08/10] open: openat2(2) syscall
From: Dmitry V. Levin @ 2019-07-19 1:59 UTC (permalink / raw)
To: Aleksa Sarai
Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
David Howells, Shuah Khan, Shuah Khan, Christian Brauner,
Eric Biederman, Andy Lutomirski, Andrew Morton,
Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
David Drysdale, Chanho Min, Oleg Nesterov, Aleksa Sarai,
Linus Torvalds, containers
In-Reply-To: <20190706145737.5299-9-cyphar@cyphar.com>
[-- Attachment #1: Type: text/plain, Size: 785 bytes --]
On Sun, Jul 07, 2019 at 12:57:35AM +1000, Aleksa Sarai wrote:
[...]
> +/**
> + * Arguments for how openat2(2) should open the target path. If @extra is zero,
> + * then openat2(2) is identical to openat(2).
> + *
> + * @flags: O_* flags (unknown flags ignored).
What was the rationale for implementing this semantics?
Ignoring unknown flags makes potential extension of this new interface
problematic. This has bitten us many times already, so ...
> + * @mode: O_CREAT file mode (ignored otherwise).
> + * @upgrade_mask: restrict how the O_PATH may be re-opened (ignored otherwise).
> + * @resolve: RESOLVE_* flags (-EINVAL on unknown flags).
... could you consider implementing this (-EINVAL on unknown flags) semantics
for @flags as well, please?
--
ldv
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]
^ permalink raw reply
* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-07-18 21:52 UTC (permalink / raw)
To: Richard Guy Briggs
Cc: Serge E. Hallyn, Tycho Andersen, containers, linux-api,
Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
ebiederm, nhorman
In-Reply-To: <20190718005145.eshekqfr3navqqiy@madcap2.tricolour.ca>
On Wed, Jul 17, 2019 at 8:52 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-07-16 19:30, Paul Moore wrote:
...
> > We can trust capable(CAP_AUDIT_CONTROL) for enforcing audit container
> > ID policy, we can not trust ns_capable(CAP_AUDIT_CONTROL).
>
> Ok. So does a process in a non-init user namespace have two (or more)
> sets of capabilities stored in creds, one in the init_user_ns, and one
> in current_user_ns? Or does it get stripped of all its capabilities in
> init_user_ns once it has its own set in current_user_ns? If the former,
> then we can use capable(). If the latter, we need another mechanism, as
> you have suggested might be needed.
Unfortunately I think the problem is that ultimately we need to allow
any container orchestrator that has been given privileges to manage
the audit container ID to also grant that privilege to any of the
child process/containers it manages. I don't believe we can do that
with capabilities based on the code I've looked at, and the
discussions I've had, but if you find a way I would leave to hear it.
> If some random unprivileged user wants to fire up a container
> orchestrator/engine in his own user namespace, then audit needs to be
> namespaced. Can we safely discard this scenario for now?
I think the only time we want to allow a container orchestrator to
manage the audit container ID is if it has been granted that privilege
by someone who has that privilege already. In the zero-container, or
single-level of containers, case this is relatively easy, and we can
accomplish it using CAP_AUDIT_CONTROL as the privilege. If we start
nesting container orchestrators it becomes more complicated as we need
to be able to support granting and inheriting this privilege in a
manner; this is why I suggested a new mechanism *may* be necessary.
--
paul moore
www.paul-moore.com
^ permalink raw reply
* Re: [PATCH v9 08/10] open: openat2(2) syscall
From: Arnd Bergmann @ 2019-07-18 21:29 UTC (permalink / raw)
To: Aleksa Sarai
Cc: Al Viro, Jeff Layton, J. Bruce Fields, David Howells, Shuah Khan,
Shuah Khan, Christian Brauner, Eric Biederman, Andy Lutomirski,
Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
Aleksa Sarai, Linus Torvalds, containers, alpha <linux-alp>
In-Reply-To: <20190718161231.xcno272nvqpln3wj@yavin>
On Thu, Jul 18, 2019 at 6:12 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2019-07-18, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Sat, Jul 6, 2019 at 5:00 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > In fact, that seems similar enough to the existing openat() that I think
> > you could also just add the fifth argument to the existing call when
> > a newly defined flag is set, similarly to how we only use the 'mode'
> > argument when O_CREAT or O_TMPFILE are set.
>
> I considered doing this (and even had a preliminary version of it), but
> I discovered that I was not in favour of this idea -- once I started to
> write tests using it -- for a few reasons:
>
> 1. It doesn't really allow for clean extension for a future 6th
> argument (because you are using up O_* flags to signify "use the
> next argument", and O_* flags don't give -EINVAL if they're
> unknown). Now, yes you can do the on-start runtime check that
> everyone does -- but I've never really liked having to do it.
>
> Having reserved padding for later extensions (that is actually
> checked and gives -EINVAL) matches more modern syscall designs.
>
> 2. I really was hoping that the variadic openat(2) could be done away
> using this union setup (Linus said he didn't like it, and suggested
> using something like 'struct stat' as an argument for openat(2) --
> though personally I am not sure I would personally like to use an
> interface like that).
>
> 3. In order to avoid wasting a syscall argument for mode/mask you need
> to either have something like your suggested mode_mask (which makes
> the syscall arguments less consistent) or have some sort of
> mode-like argument that is treated specially (which is really awful
> on multiple levels -- this one I also tried and even wrote my
> original tests using). And in both cases, the shims for
> open{,at}(2) are somewhat less clean.
These are all good reasons, thanks for providing the background.
> All of that being said, I'd be happy to switch to whatever you think
> makes the most sense. As long as it's possible to get an O_PATH with
> RESOLVE_IN_ROOT set, I'm happy.
I don't feel I should be in charge of making the decision. I'd still
prefer avoiding the indirect argument structure because
4. it's inconsistent with most other syscalls
5. you get the same problem with seccomp and strace that
clone3() has -- these and others only track the register
arguments by default.
6. copying the structure adds a small overhead compared to
passing registers
7. the calling conventions may be inconvenient for a user space
library, so you end up with different prototypes for the low-level
syscall and the libc abstraction.
I don't see any of the above seven points as a showstopper
either way, so I hope someone else has a strong opinion
and can make the decision easier for you.
In the meantime just keep what you have, so you don't have
to change it multiple times.
Arnd
^ permalink raw reply
* Re: [PATCH V36 23/29] bpf: Restrict bpf when kernel lockdown is in confidentiality mode
From: Kees Cook @ 2019-07-18 21:06 UTC (permalink / raw)
To: Matthew Garrett
Cc: jmorris, linux-security-module, linux-kernel, linux-api,
David Howells, Alexei Starovoitov, Matthew Garrett, netdev,
Chun-Yi Lee, Daniel Borkmann
In-Reply-To: <20190718194415.108476-24-matthewgarrett@google.com>
On Thu, Jul 18, 2019 at 12:44:09PM -0700, Matthew Garrett wrote:
> From: David Howells <dhowells@redhat.com>
>
> bpf_read() and bpf_read_str() could potentially be abused to (eg) allow
> private keys in kernel memory to be leaked. Disable them if the kernel
> has been locked down in confidentiality mode.
>
> Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
-Kees
> cc: netdev@vger.kernel.org
> cc: Chun-Yi Lee <jlee@suse.com>
> cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> ---
> include/linux/security.h | 1 +
> kernel/trace/bpf_trace.c | 10 ++++++++++
> security/lockdown/lockdown.c | 1 +
> 3 files changed, 12 insertions(+)
>
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 987d8427f091..8dd1741a52cd 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -118,6 +118,7 @@ enum lockdown_reason {
> LOCKDOWN_INTEGRITY_MAX,
> LOCKDOWN_KCORE,
> LOCKDOWN_KPROBES,
> + LOCKDOWN_BPF_READ,
> LOCKDOWN_CONFIDENTIALITY_MAX,
> };
>
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index ca1255d14576..492a8bfaae98 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -142,8 +142,13 @@ BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const void *, unsafe_ptr)
> {
> int ret;
>
> + ret = security_locked_down(LOCKDOWN_BPF_READ);
> + if (ret < 0)
> + goto out;
> +
> ret = probe_kernel_read(dst, unsafe_ptr, size);
> if (unlikely(ret < 0))
> +out:
> memset(dst, 0, size);
>
> return ret;
> @@ -569,6 +574,10 @@ BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size,
> {
> int ret;
>
> + ret = security_locked_down(LOCKDOWN_BPF_READ);
> + if (ret < 0)
> + goto out;
> +
> /*
> * The strncpy_from_unsafe() call will likely not fill the entire
> * buffer, but that's okay in this circumstance as we're probing
> @@ -580,6 +589,7 @@ BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size,
> */
> ret = strncpy_from_unsafe(dst, unsafe_ptr, size);
> if (unlikely(ret < 0))
> +out:
> memset(dst, 0, size);
>
> return ret;
> diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
> index 6b123cbf3748..1b89d3e8e54d 100644
> --- a/security/lockdown/lockdown.c
> +++ b/security/lockdown/lockdown.c
> @@ -33,6 +33,7 @@ static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
> [LOCKDOWN_INTEGRITY_MAX] = "integrity",
> [LOCKDOWN_KCORE] = "/proc/kcore access",
> [LOCKDOWN_KPROBES] = "use of kprobes",
> + [LOCKDOWN_BPF_READ] = "use of bpf to read kernel RAM",
> [LOCKDOWN_CONFIDENTIALITY_MAX] = "confidentiality",
> };
>
> --
> 2.22.0.510.g264f2c817a-goog
>
--
Kees Cook
^ permalink raw reply
* Re: [PATCH V36 20/29] x86/mmiotrace: Lock down the testmmiotrace module
From: Kees Cook @ 2019-07-18 21:06 UTC (permalink / raw)
To: Matthew Garrett
Cc: jmorris, linux-security-module, linux-kernel, linux-api,
David Howells, Thomas Gleixner, Matthew Garrett, Steven Rostedt,
Ingo Molnar, H. Peter Anvin, x86
In-Reply-To: <20190718194415.108476-21-matthewgarrett@google.com>
On Thu, Jul 18, 2019 at 12:44:06PM -0700, Matthew Garrett wrote:
> From: David Howells <dhowells@redhat.com>
>
> The testmmiotrace module shouldn't be permitted when the kernel is locked
> down as it can be used to arbitrarily read and write MMIO space. This is
> a runtime check rather than buildtime in order to allow configurations
> where the same kernel may be run in both locked down or permissive modes
> depending on local policy.
>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: David Howells <dhowells@redhat.com
Reviewed-by: Kees Cook <keescook@chromium.org>
-Kees
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
> cc: Thomas Gleixner <tglx@linutronix.de>
> cc: Steven Rostedt <rostedt@goodmis.org>
> cc: Ingo Molnar <mingo@kernel.org>
> cc: "H. Peter Anvin" <hpa@zytor.com>
> cc: x86@kernel.org
> ---
> arch/x86/mm/testmmiotrace.c | 5 +++++
> include/linux/security.h | 1 +
> security/lockdown/lockdown.c | 1 +
> 3 files changed, 7 insertions(+)
>
> diff --git a/arch/x86/mm/testmmiotrace.c b/arch/x86/mm/testmmiotrace.c
> index 0881e1ff1e58..a8bd952e136d 100644
> --- a/arch/x86/mm/testmmiotrace.c
> +++ b/arch/x86/mm/testmmiotrace.c
> @@ -8,6 +8,7 @@
> #include <linux/module.h>
> #include <linux/io.h>
> #include <linux/mmiotrace.h>
> +#include <linux/security.h>
>
> static unsigned long mmio_address;
> module_param_hw(mmio_address, ulong, iomem, 0);
> @@ -115,6 +116,10 @@ static void do_test_bulk_ioremapping(void)
> static int __init init(void)
> {
> unsigned long size = (read_far) ? (8 << 20) : (16 << 10);
> + int ret = security_locked_down(LOCKDOWN_MMIOTRACE);
> +
> + if (ret)
> + return ret;
>
> if (mmio_address == 0) {
> pr_err("you have to use the module argument mmio_address.\n");
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 43fa3486522b..3f7b6a4cd65a 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -114,6 +114,7 @@ enum lockdown_reason {
> LOCKDOWN_PCMCIA_CIS,
> LOCKDOWN_TIOCSSERIAL,
> LOCKDOWN_MODULE_PARAMETERS,
> + LOCKDOWN_MMIOTRACE,
> LOCKDOWN_INTEGRITY_MAX,
> LOCKDOWN_CONFIDENTIALITY_MAX,
> };
> diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
> index 5177938cfa0d..37b7d7e50474 100644
> --- a/security/lockdown/lockdown.c
> +++ b/security/lockdown/lockdown.c
> @@ -29,6 +29,7 @@ static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
> [LOCKDOWN_PCMCIA_CIS] = "direct PCMCIA CIS storage",
> [LOCKDOWN_TIOCSSERIAL] = "reconfiguration of serial port IO",
> [LOCKDOWN_MODULE_PARAMETERS] = "unsafe module parameters",
> + [LOCKDOWN_MMIOTRACE] = "unsafe mmio",
> [LOCKDOWN_INTEGRITY_MAX] = "integrity",
> [LOCKDOWN_CONFIDENTIALITY_MAX] = "confidentiality",
> };
> --
> 2.22.0.510.g264f2c817a-goog
>
--
Kees Cook
^ permalink raw reply
* Re: [PATCH V36 02/29] security: Add a "locked down" LSM hook
From: Casey Schaufler @ 2019-07-18 20:03 UTC (permalink / raw)
To: Matthew Garrett, jmorris
Cc: linux-security-module, linux-kernel, linux-api, Matthew Garrett,
Kees Cook
In-Reply-To: <20190718194415.108476-3-matthewgarrett@google.com>
On 7/18/2019 12:43 PM, Matthew Garrett wrote:
> Add a mechanism to allow LSMs to make a policy decision around whether
> kernel functionality that would allow tampering with or examining the
> runtime state of the kernel should be permitted.
>
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
> ---
> include/linux/lsm_hooks.h | 2 ++
> include/linux/security.h | 32 ++++++++++++++++++++++++++++++++
> security/security.c | 6 ++++++
> 3 files changed, 40 insertions(+)
>
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index aebb0e032072..29c22cf40113 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -1807,6 +1807,7 @@ union security_list_options {
> int (*bpf_prog_alloc_security)(struct bpf_prog_aux *aux);
> void (*bpf_prog_free_security)(struct bpf_prog_aux *aux);
> #endif /* CONFIG_BPF_SYSCALL */
> + int (*locked_down)(enum lockdown_reason what);
> };
>
> struct security_hook_heads {
> @@ -2046,6 +2047,7 @@ struct security_hook_heads {
> struct hlist_head bpf_prog_alloc_security;
> struct hlist_head bpf_prog_free_security;
> #endif /* CONFIG_BPF_SYSCALL */
> + struct hlist_head locked_down;
> } __randomize_layout;
>
> /*
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 66a2fcbe6ab0..c2b1204e8e26 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -77,6 +77,33 @@ enum lsm_event {
> LSM_POLICY_CHANGE,
> };
>
> +/*
> + * These are reasons that can be passed to the security_locked_down()
> + * LSM hook. Lockdown reasons that protect kernel integrity (ie, the
> + * ability for userland to modify kernel code) are placed before
> + * LOCKDOWN_INTEGRITY_MAX. Lockdown reasons that protect kernel
> + * confidentiality (ie, the ability for userland to extract
> + * information from the running kernel that would otherwise be
> + * restricted) are placed before LOCKDOWN_CONFIDENTIALITY_MAX.
> + *
> + * LSM authors should note that the semantics of any given lockdown
> + * reason are not guaranteed to be stable - the same reason may block
> + * one set of features in one kernel release, and a slightly different
> + * set of features in a later kernel release. LSMs that seek to expose
> + * lockdown policy at any level of granularity other than "none",
> + * "integrity" or "confidentiality" are responsible for either
> + * ensuring that they expose a consistent level of functionality to
> + * userland, or ensuring that userland is aware that this is
> + * potentially a moving target. It is easy to misuse this information
> + * in a way that could break userspace. Please be careful not to do
> + * so.
> + */
> +enum lockdown_reason {
> + LOCKDOWN_NONE,
> + LOCKDOWN_INTEGRITY_MAX,
> + LOCKDOWN_CONFIDENTIALITY_MAX,
> +};
> +
> /* These functions are in security/commoncap.c */
> extern int cap_capable(const struct cred *cred, struct user_namespace *ns,
> int cap, unsigned int opts);
> @@ -393,6 +420,7 @@ void security_inode_invalidate_secctx(struct inode *inode);
> int security_inode_notifysecctx(struct inode *inode, void *ctx, u32 ctxlen);
> int security_inode_setsecctx(struct dentry *dentry, void *ctx, u32 ctxlen);
> int security_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen);
> +int security_locked_down(enum lockdown_reason what);
> #else /* CONFIG_SECURITY */
>
> static inline int call_blocking_lsm_notifier(enum lsm_event event, void *data)
> @@ -1205,6 +1233,10 @@ static inline int security_inode_getsecctx(struct inode *inode, void **ctx, u32
> {
> return -EOPNOTSUPP;
> }
> +static inline int security_locked_down(enum lockdown_reason what)
> +{
> + return 0;
> +}
> #endif /* CONFIG_SECURITY */
>
> #ifdef CONFIG_SECURITY_NETWORK
> diff --git a/security/security.c b/security/security.c
> index 90f1e291c800..ce6c945bf347 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -2392,3 +2392,9 @@ void security_bpf_prog_free(struct bpf_prog_aux *aux)
> call_void_hook(bpf_prog_free_security, aux);
> }
> #endif /* CONFIG_BPF_SYSCALL */
> +
> +int security_locked_down(enum lockdown_reason what)
> +{
> + return call_int_hook(locked_down, 0, what);
> +}
> +EXPORT_SYMBOL(security_locked_down);
^ permalink raw reply
* Re: [PATCH V36 01/29] security: Support early LSMs
From: Casey Schaufler @ 2019-07-18 20:02 UTC (permalink / raw)
To: Matthew Garrett, jmorris
Cc: linux-security-module, linux-kernel, linux-api, Matthew Garrett,
Kees Cook
In-Reply-To: <20190718194415.108476-2-matthewgarrett@google.com>
On 7/18/2019 12:43 PM, Matthew Garrett wrote:
> The lockdown module is intended to allow for kernels to be locked down
> early in boot - sufficiently early that we don't have the ability to
> kmalloc() yet. Add support for early initialisation of some LSMs, and
> then add them to the list of names when we do full initialisation later.
> Early LSMs are initialised in link order and cannot be overridden via
> boot parameters, and cannot make use of kmalloc() (since the allocator
> isn't initialised yet).
>
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
> ---
> include/asm-generic/vmlinux.lds.h | 8 ++++-
> include/linux/lsm_hooks.h | 6 ++++
> include/linux/security.h | 1 +
> init/main.c | 1 +
> security/security.c | 50 ++++++++++++++++++++++++++-----
> 5 files changed, 57 insertions(+), 9 deletions(-)
>
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index ca42182992a5..6cc6174a2a4c 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -215,8 +215,13 @@
> __start_lsm_info = .; \
> KEEP(*(.lsm_info.init)) \
> __end_lsm_info = .;
> +#define EARLY_LSM_TABLE() . = ALIGN(8); \
> + __start_early_lsm_info = .; \
> + KEEP(*(.early_lsm_info.init)) \
> + __end_early_lsm_info = .;
> #else
> #define LSM_TABLE()
> +#define EARLY_LSM_TABLE()
> #endif
>
> #define ___OF_TABLE(cfg, name) _OF_TABLE_##cfg(name)
> @@ -616,7 +621,8 @@
> ACPI_PROBE_TABLE(irqchip) \
> ACPI_PROBE_TABLE(timer) \
> EARLYCON_TABLE() \
> - LSM_TABLE()
> + LSM_TABLE() \
> + EARLY_LSM_TABLE()
>
> #define INIT_TEXT \
> *(.init.text .init.text.*) \
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index df1318d85f7d..aebb0e032072 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -2104,12 +2104,18 @@ struct lsm_info {
> };
>
> extern struct lsm_info __start_lsm_info[], __end_lsm_info[];
> +extern struct lsm_info __start_early_lsm_info[], __end_early_lsm_info[];
>
> #define DEFINE_LSM(lsm) \
> static struct lsm_info __lsm_##lsm \
> __used __section(.lsm_info.init) \
> __aligned(sizeof(unsigned long))
>
> +#define DEFINE_EARLY_LSM(lsm) \
> + static struct lsm_info __early_lsm_##lsm \
> + __used __section(.early_lsm_info.init) \
> + __aligned(sizeof(unsigned long))
> +
> #ifdef CONFIG_SECURITY_SELINUX_DISABLE
> /*
> * Assuring the safety of deleting a security module is up to
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 5f7441abbf42..66a2fcbe6ab0 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -195,6 +195,7 @@ int unregister_blocking_lsm_notifier(struct notifier_block *nb);
>
> /* prototypes */
> extern int security_init(void);
> +extern int early_security_init(void);
>
> /* Security operations */
> int security_binder_set_context_mgr(struct task_struct *mgr);
> diff --git a/init/main.c b/init/main.c
> index ff5803b0841c..0fefca3fd43c 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -593,6 +593,7 @@ asmlinkage __visible void __init start_kernel(void)
> boot_cpu_init();
> page_address_init();
> pr_notice("%s", linux_banner);
> + early_security_init();
> setup_arch(&command_line);
> mm_init_cpumask(&init_mm);
> setup_command_line(command_line);
> diff --git a/security/security.c b/security/security.c
> index 250ee2d76406..90f1e291c800 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -33,6 +33,7 @@
>
> /* How many LSMs were built into the kernel? */
> #define LSM_COUNT (__end_lsm_info - __start_lsm_info)
> +#define EARLY_LSM_COUNT (__end_early_lsm_info - __start_early_lsm_info)
>
> struct security_hook_heads security_hook_heads __lsm_ro_after_init;
> static BLOCKING_NOTIFIER_HEAD(blocking_lsm_notifier_chain);
> @@ -277,6 +278,8 @@ static void __init ordered_lsm_parse(const char *order, const char *origin)
> static void __init lsm_early_cred(struct cred *cred);
> static void __init lsm_early_task(struct task_struct *task);
>
> +static int lsm_append(const char *new, char **result);
> +
> static void __init ordered_lsm_init(void)
> {
> struct lsm_info **lsm;
> @@ -323,6 +326,26 @@ static void __init ordered_lsm_init(void)
> kfree(ordered_lsms);
> }
>
> +int __init early_security_init(void)
> +{
> + int i;
> + struct hlist_head *list = (struct hlist_head *) &security_hook_heads;
> + struct lsm_info *lsm;
> +
> + for (i = 0; i < sizeof(security_hook_heads) / sizeof(struct hlist_head);
> + i++)
> + INIT_HLIST_HEAD(&list[i]);
> +
> + for (lsm = __start_early_lsm_info; lsm < __end_early_lsm_info; lsm++) {
> + if (!lsm->enabled)
> + lsm->enabled = &lsm_enabled_true;
> + prepare_lsm(lsm);
> + initialize_lsm(lsm);
> + }
> +
> + return 0;
> +}
> +
> /**
> * security_init - initializes the security framework
> *
> @@ -330,14 +353,18 @@ static void __init ordered_lsm_init(void)
> */
> int __init security_init(void)
> {
> - int i;
> - struct hlist_head *list = (struct hlist_head *) &security_hook_heads;
> + struct lsm_info *lsm;
>
> pr_info("Security Framework initializing\n");
>
> - for (i = 0; i < sizeof(security_hook_heads) / sizeof(struct hlist_head);
> - i++)
> - INIT_HLIST_HEAD(&list[i]);
> + /*
> + * Append the names of the early LSM modules now that kmalloc() is
> + * available
> + */
> + for (lsm = __start_early_lsm_info; lsm < __end_early_lsm_info; lsm++) {
> + if (lsm->enabled)
> + lsm_append(lsm->name, &lsm_names);
> + }
>
> /* Load LSMs in specified order. */
> ordered_lsm_init();
> @@ -384,7 +411,7 @@ static bool match_last_lsm(const char *list, const char *lsm)
> return !strcmp(last, lsm);
> }
>
> -static int lsm_append(char *new, char **result)
> +static int lsm_append(const char *new, char **result)
> {
> char *cp;
>
> @@ -422,8 +449,15 @@ void __init security_add_hooks(struct security_hook_list *hooks, int count,
> hooks[i].lsm = lsm;
> hlist_add_tail_rcu(&hooks[i].list, hooks[i].head);
> }
> - if (lsm_append(lsm, &lsm_names) < 0)
> - panic("%s - Cannot get early memory.\n", __func__);
> +
> + /*
> + * Don't try to append during early_security_init(), we'll come back
> + * and fix this up afterwards.
> + */
> + if (slab_is_available()) {
> + if (lsm_append(lsm, &lsm_names) < 0)
> + panic("%s - Cannot get early memory.\n", __func__);
> + }
> }
>
> int call_blocking_lsm_notifier(enum lsm_event event, void *data)
^ permalink raw reply
* [PATCH V36 29/29] lockdown: Print current->comm in restriction messages
From: Matthew Garrett @ 2019-07-18 19:44 UTC (permalink / raw)
To: jmorris
Cc: linux-security-module, linux-kernel, linux-api, Matthew Garrett,
David Howells, Matthew Garrett, Kees Cook
In-Reply-To: <20190718194415.108476-1-matthewgarrett@google.com>
Print the content of current->comm in messages generated by lockdown to
indicate a restriction that was hit. This makes it a bit easier to find
out what caused the message.
The message now patterned something like:
Lockdown: <comm>: <what> is restricted; see man kernel_lockdown.7
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
fs/proc/kcore.c | 5 +++--
security/lockdown/lockdown.c | 8 ++++++--
2 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index ee2c576cc94e..e2ed8e08cc7a 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -548,11 +548,12 @@ static int open_kcore(struct inode *inode, struct file *filp)
{
int ret = security_locked_down(LOCKDOWN_KCORE);
- if (ret)
- return ret;
if (!capable(CAP_SYS_RAWIO))
return -EPERM;
+ if (ret)
+ return ret;
+
filp->private_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
if (!filp->private_data)
return -ENOMEM;
diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
index 173191562047..f6c74cf6a798 100644
--- a/security/lockdown/lockdown.c
+++ b/security/lockdown/lockdown.c
@@ -81,10 +81,14 @@ early_param("lockdown", lockdown_param);
*/
static int lockdown_is_locked_down(enum lockdown_reason what)
{
+ if (WARN(what >= LOCKDOWN_CONFIDENTIALITY_MAX,
+ "Invalid lockdown reason"))
+ return -EPERM;
+
if (kernel_locked_down >= what) {
if (lockdown_reasons[what])
- pr_notice("Lockdown: %s is restricted; see man kernel_lockdown.7\n",
- lockdown_reasons[what]);
+ pr_notice("Lockdown: %s: %s is restricted; see man kernel_lockdown.7\n",
+ current->comm, lockdown_reasons[what]);
return -EPERM;
}
--
2.22.0.510.g264f2c817a-goog
^ permalink raw reply related
* [PATCH V36 28/29] efi: Restrict efivar_ssdt_load when the kernel is locked down
From: Matthew Garrett @ 2019-07-18 19:44 UTC (permalink / raw)
To: jmorris
Cc: linux-security-module, linux-kernel, linux-api, Matthew Garrett,
Matthew Garrett, Ard Biesheuvel, Kees Cook, linux-efi
In-Reply-To: <20190718194415.108476-1-matthewgarrett@google.com>
efivar_ssdt_load allows the kernel to import arbitrary ACPI code from an
EFI variable, which gives arbitrary code execution in ring 0. Prevent
that when the kernel is locked down.
Signed-off-by: Matthew Garrett <mjg59@google.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: linux-efi@vger.kernel.org
---
drivers/firmware/efi/efi.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index ad3b1f4866b3..776f479e5499 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -30,6 +30,7 @@
#include <linux/acpi.h>
#include <linux/ucs2_string.h>
#include <linux/memblock.h>
+#include <linux/security.h>
#include <asm/early_ioremap.h>
@@ -242,6 +243,11 @@ static void generic_ops_unregister(void)
static char efivar_ssdt[EFIVAR_SSDT_NAME_MAX] __initdata;
static int __init efivar_ssdt_setup(char *str)
{
+ int ret = security_locked_down(LOCKDOWN_ACPI_TABLES);
+
+ if (ret)
+ return ret;
+
if (strlen(str) < sizeof(efivar_ssdt))
memcpy(efivar_ssdt, str, strlen(str));
else
--
2.22.0.510.g264f2c817a-goog
^ permalink raw reply related
* [PATCH V36 27/29] tracefs: Restrict tracefs when the kernel is locked down
From: Matthew Garrett @ 2019-07-18 19:44 UTC (permalink / raw)
To: jmorris
Cc: linux-security-module, linux-kernel, linux-api, Matthew Garrett,
Matthew Garrett, Steven Rostedt
In-Reply-To: <20190718194415.108476-1-matthewgarrett@google.com>
Tracefs may release more information about the kernel than desirable, so
restrict it when the kernel is locked down in confidentiality mode by
preventing open().
Signed-off-by: Matthew Garrett <mjg59@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
fs/tracefs/inode.c | 38 +++++++++++++++++++++++++++++++++++-
include/linux/security.h | 1 +
security/lockdown/lockdown.c | 1 +
3 files changed, 39 insertions(+), 1 deletion(-)
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index eeeae0475da9..8a20137e1d8f 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -20,6 +20,7 @@
#include <linux/parser.h>
#include <linux/magic.h>
#include <linux/slab.h>
+#include <linux/security.h>
#define TRACEFS_DEFAULT_MODE 0700
@@ -27,6 +28,23 @@ static struct vfsmount *tracefs_mount;
static int tracefs_mount_count;
static bool tracefs_registered;
+static int default_open_file(struct inode *inode, struct file *filp)
+{
+ struct dentry *dentry = filp->f_path.dentry;
+ struct file_operations *real_fops;
+ int ret;
+
+ if (!dentry)
+ return -EINVAL;
+
+ ret = security_locked_down(LOCKDOWN_TRACEFS);
+ if (ret)
+ return ret;
+
+ real_fops = dentry->d_fsdata;
+ return real_fops->open(inode, filp);
+}
+
static ssize_t default_read_file(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
@@ -221,6 +239,12 @@ static int tracefs_apply_options(struct super_block *sb)
return 0;
}
+static void tracefs_destroy_inode(struct inode *inode)
+{
+ if (S_ISREG(inode->i_mode))
+ kfree(inode->i_fop);
+}
+
static int tracefs_remount(struct super_block *sb, int *flags, char *data)
{
int err;
@@ -256,6 +280,7 @@ static int tracefs_show_options(struct seq_file *m, struct dentry *root)
static const struct super_operations tracefs_super_operations = {
.statfs = simple_statfs,
+ .destroy_inode = tracefs_destroy_inode,
.remount_fs = tracefs_remount,
.show_options = tracefs_show_options,
};
@@ -387,6 +412,7 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
struct dentry *parent, void *data,
const struct file_operations *fops)
{
+ struct file_operations *proxy_fops;
struct dentry *dentry;
struct inode *inode;
@@ -402,8 +428,18 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
if (unlikely(!inode))
return failed_creating(dentry);
+ proxy_fops = kzalloc(sizeof(struct file_operations), GFP_KERNEL);
+ if (!proxy_fops)
+ return failed_creating(dentry);
+
+ if (!fops)
+ fops = &tracefs_file_operations;
+
+ dentry->d_fsdata = (void *)fops;
+ memcpy(proxy_fops, fops, sizeof(*proxy_fops));
+ proxy_fops->open = default_open_file;
inode->i_mode = mode;
- inode->i_fop = fops ? fops : &tracefs_file_operations;
+ inode->i_fop = proxy_fops;
inode->i_private = data;
d_instantiate(dentry, inode);
fsnotify_create(dentry->d_parent->d_inode, dentry);
diff --git a/include/linux/security.h b/include/linux/security.h
index d92323b44a3f..807dc0d24982 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -121,6 +121,7 @@ enum lockdown_reason {
LOCKDOWN_KPROBES,
LOCKDOWN_BPF_READ,
LOCKDOWN_PERF,
+ LOCKDOWN_TRACEFS,
LOCKDOWN_CONFIDENTIALITY_MAX,
};
diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
index 88064ce1c844..173191562047 100644
--- a/security/lockdown/lockdown.c
+++ b/security/lockdown/lockdown.c
@@ -36,6 +36,7 @@ static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
[LOCKDOWN_KPROBES] = "use of kprobes",
[LOCKDOWN_BPF_READ] = "use of bpf to read kernel RAM",
[LOCKDOWN_PERF] = "unsafe use of perf",
+ [LOCKDOWN_TRACEFS] = "use of tracefs",
[LOCKDOWN_CONFIDENTIALITY_MAX] = "confidentiality",
};
--
2.22.0.510.g264f2c817a-goog
^ permalink raw reply related
* [PATCH V36 26/29] debugfs: Restrict debugfs when the kernel is locked down
From: Matthew Garrett @ 2019-07-18 19:44 UTC (permalink / raw)
To: jmorris
Cc: linux-security-module, linux-kernel, linux-api, David Howells,
Andy Shevchenko, acpi4asus-user, platform-driver-x86,
Matthew Garrett, Thomas Gleixner, Greg KH, Rafael J . Wysocki,
Matthew Garrett
In-Reply-To: <20190718194415.108476-1-matthewgarrett@google.com>
From: David Howells <dhowells@redhat.com>
Disallow opening of debugfs files that might be used to muck around when
the kernel is locked down as various drivers give raw access to hardware
through debugfs. Given the effort of auditing all 2000 or so files and
manually fixing each one as necessary, I've chosen to apply a heuristic
instead. The following changes are made:
(1) chmod and chown are disallowed on debugfs objects (though the root dir
can be modified by mount and remount, but I'm not worried about that).
(2) When the kernel is locked down, only files with the following criteria
are permitted to be opened:
- The file must have mode 00444
- The file must not have ioctl methods
- The file must not have mmap
(3) When the kernel is locked down, files may only be opened for reading.
Normal device interaction should be done through configfs, sysfs or a
miscdev, not debugfs.
Note that this makes it unnecessary to specifically lock down show_dsts(),
show_devs() and show_call() in the asus-wmi driver.
I would actually prefer to lock down all files by default and have the
the files unlocked by the creator. This is tricky to manage correctly,
though, as there are 19 creation functions and ~1600 call sites (some of
them in loops scanning tables).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Andy Shevchenko <andy.shevchenko@gmail.com>
cc: acpi4asus-user@lists.sourceforge.net
cc: platform-driver-x86@vger.kernel.org
cc: Matthew Garrett <mjg59@srcf.ucam.org>
cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg KH <greg@kroah.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Matthew Garrett <matthewgarrett@google.com>
---
fs/debugfs/file.c | 30 ++++++++++++++++++++++++++++++
fs/debugfs/inode.c | 32 ++++++++++++++++++++++++++++++--
include/linux/security.h | 1 +
security/lockdown/lockdown.c | 1 +
4 files changed, 62 insertions(+), 2 deletions(-)
diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index 93e4ca6b2ad7..87846aad594b 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -19,6 +19,7 @@
#include <linux/atomic.h>
#include <linux/device.h>
#include <linux/poll.h>
+#include <linux/security.h>
#include "internal.h"
@@ -136,6 +137,25 @@ void debugfs_file_put(struct dentry *dentry)
}
EXPORT_SYMBOL_GPL(debugfs_file_put);
+/*
+ * Only permit access to world-readable files when the kernel is locked down.
+ * We also need to exclude any file that has ways to write or alter it as root
+ * can bypass the permissions check.
+ */
+static bool debugfs_is_locked_down(struct inode *inode,
+ struct file *filp,
+ const struct file_operations *real_fops)
+{
+ if ((inode->i_mode & 07777) == 0444 &&
+ !(filp->f_mode & FMODE_WRITE) &&
+ !real_fops->unlocked_ioctl &&
+ !real_fops->compat_ioctl &&
+ !real_fops->mmap)
+ return false;
+
+ return security_locked_down(LOCKDOWN_DEBUGFS);
+}
+
static int open_proxy_open(struct inode *inode, struct file *filp)
{
struct dentry *dentry = F_DENTRY(filp);
@@ -147,6 +167,11 @@ static int open_proxy_open(struct inode *inode, struct file *filp)
return r == -EIO ? -ENOENT : r;
real_fops = debugfs_real_fops(filp);
+
+ r = debugfs_is_locked_down(inode, filp, real_fops);
+ if (r)
+ goto out;
+
real_fops = fops_get(real_fops);
if (!real_fops) {
/* Huh? Module did not clean up after itself at exit? */
@@ -272,6 +297,11 @@ static int full_proxy_open(struct inode *inode, struct file *filp)
return r == -EIO ? -ENOENT : r;
real_fops = debugfs_real_fops(filp);
+
+ r = debugfs_is_locked_down(inode, filp, real_fops);
+ if (r)
+ goto out;
+
real_fops = fops_get(real_fops);
if (!real_fops) {
/* Huh? Module did not cleanup after itself at exit? */
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index 042b688ed124..7b975dbb2bb4 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -26,6 +26,7 @@
#include <linux/parser.h>
#include <linux/magic.h>
#include <linux/slab.h>
+#include <linux/security.h>
#include "internal.h"
@@ -35,6 +36,32 @@ static struct vfsmount *debugfs_mount;
static int debugfs_mount_count;
static bool debugfs_registered;
+/*
+ * Don't allow access attributes to be changed whilst the kernel is locked down
+ * so that we can use the file mode as part of a heuristic to determine whether
+ * to lock down individual files.
+ */
+static int debugfs_setattr(struct dentry *dentry, struct iattr *ia)
+{
+ int ret = security_locked_down(LOCKDOWN_DEBUGFS);
+
+ if (ret && (ia->ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID)))
+ return ret;
+ return simple_setattr(dentry, ia);
+}
+
+static const struct inode_operations debugfs_file_inode_operations = {
+ .setattr = debugfs_setattr,
+};
+static const struct inode_operations debugfs_dir_inode_operations = {
+ .lookup = simple_lookup,
+ .setattr = debugfs_setattr,
+};
+static const struct inode_operations debugfs_symlink_inode_operations = {
+ .get_link = simple_get_link,
+ .setattr = debugfs_setattr,
+};
+
static struct inode *debugfs_get_inode(struct super_block *sb)
{
struct inode *inode = new_inode(sb);
@@ -369,6 +396,7 @@ static struct dentry *__debugfs_create_file(const char *name, umode_t mode,
inode->i_mode = mode;
inode->i_private = data;
+ inode->i_op = &debugfs_file_inode_operations;
inode->i_fop = proxy_fops;
dentry->d_fsdata = (void *)((unsigned long)real_fops |
DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
@@ -532,7 +560,7 @@ struct dentry *debugfs_create_dir(const char *name, struct dentry *parent)
}
inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
- inode->i_op = &simple_dir_inode_operations;
+ inode->i_op = &debugfs_dir_inode_operations;
inode->i_fop = &simple_dir_operations;
/* directory inodes start off with i_nlink == 2 (for "." entry) */
@@ -632,7 +660,7 @@ struct dentry *debugfs_create_symlink(const char *name, struct dentry *parent,
return failed_creating(dentry);
}
inode->i_mode = S_IFLNK | S_IRWXUGO;
- inode->i_op = &simple_symlink_inode_operations;
+ inode->i_op = &debugfs_symlink_inode_operations;
inode->i_link = link;
d_instantiate(dentry, inode);
return end_creating(dentry);
diff --git a/include/linux/security.h b/include/linux/security.h
index 8ef366de70b0..d92323b44a3f 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -115,6 +115,7 @@ enum lockdown_reason {
LOCKDOWN_TIOCSSERIAL,
LOCKDOWN_MODULE_PARAMETERS,
LOCKDOWN_MMIOTRACE,
+ LOCKDOWN_DEBUGFS,
LOCKDOWN_INTEGRITY_MAX,
LOCKDOWN_KCORE,
LOCKDOWN_KPROBES,
diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
index fb437a7ef5f2..88064ce1c844 100644
--- a/security/lockdown/lockdown.c
+++ b/security/lockdown/lockdown.c
@@ -30,6 +30,7 @@ static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
[LOCKDOWN_TIOCSSERIAL] = "reconfiguration of serial port IO",
[LOCKDOWN_MODULE_PARAMETERS] = "unsafe module parameters",
[LOCKDOWN_MMIOTRACE] = "unsafe mmio",
+ [LOCKDOWN_DEBUGFS] = "debugfs access",
[LOCKDOWN_INTEGRITY_MAX] = "integrity",
[LOCKDOWN_KCORE] = "/proc/kcore access",
[LOCKDOWN_KPROBES] = "use of kprobes",
--
2.22.0.510.g264f2c817a-goog
^ permalink raw reply related
* [PATCH V36 25/29] kexec: Allow kexec_file() with appropriate IMA policy when locked down
From: Matthew Garrett @ 2019-07-18 19:44 UTC (permalink / raw)
To: jmorris
Cc: linux-security-module, linux-kernel, linux-api, Matthew Garrett,
Matthew Garrett, Mimi Zohar, Dmitry Kasatkin, linux-integrity
In-Reply-To: <20190718194415.108476-1-matthewgarrett@google.com>
Systems in lockdown mode should block the kexec of untrusted kernels.
For x86 and ARM we can ensure that a kernel is trustworthy by validating
a PE signature, but this isn't possible on other architectures. On those
platforms we can use IMA digital signatures instead. Add a function to
determine whether IMA has or will verify signatures for a given event type,
and if so permit kexec_file() even if the kernel is otherwise locked down.
This is restricted to cases where CONFIG_INTEGRITY_TRUSTED_KEYRING is set
in order to prevent an attacker from loading additional keys at runtime.
Signed-off-by: Matthew Garrett <mjg59@google.com>
Acked-by: Mimi Zohar <zohar@linux.ibm.com>
Cc: Dmitry Kasatkin <dmitry.kasatkin@gmail.com>
Cc: linux-integrity@vger.kernel.org
---
include/linux/ima.h | 9 ++++++
kernel/kexec_file.c | 12 +++++--
security/integrity/ima/ima.h | 2 ++
security/integrity/ima/ima_main.c | 2 +-
security/integrity/ima/ima_policy.c | 50 +++++++++++++++++++++++++++++
5 files changed, 72 insertions(+), 3 deletions(-)
diff --git a/include/linux/ima.h b/include/linux/ima.h
index a20ad398d260..1c37f17f7203 100644
--- a/include/linux/ima.h
+++ b/include/linux/ima.h
@@ -131,4 +131,13 @@ static inline int ima_inode_removexattr(struct dentry *dentry,
return 0;
}
#endif /* CONFIG_IMA_APPRAISE */
+
+#if defined(CONFIG_IMA_APPRAISE) && defined(CONFIG_INTEGRITY_TRUSTED_KEYRING)
+extern bool ima_appraise_signature(enum kernel_read_file_id func);
+#else
+static inline bool ima_appraise_signature(enum kernel_read_file_id func)
+{
+ return false;
+}
+#endif /* CONFIG_IMA_APPRAISE && CONFIG_INTEGRITY_TRUSTED_KEYRING */
#endif /* _LINUX_IMA_H */
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index dd06f1070d66..13c9960a5860 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -228,9 +228,17 @@ kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd,
goto out;
}
- ret = security_locked_down(LOCKDOWN_KEXEC);
- if (ret)
+ ret = 0;
+
+ /* If IMA is guaranteed to appraise a signature on the kexec
+ * image, permit it even if the kernel is otherwise locked
+ * down.
+ */
+ if (!ima_appraise_signature(READING_KEXEC_IMAGE) &&
+ security_locked_down(LOCKDOWN_KEXEC)) {
+ ret = -EPERM;
goto out;
+ }
break;
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 011b91c79351..64dcb11cf444 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -113,6 +113,8 @@ struct ima_kexec_hdr {
u64 count;
};
+extern const int read_idmap[];
+
#ifdef CONFIG_HAVE_IMA_KEXEC
void ima_load_kexec_buffer(void);
#else
diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
index 584019728660..b9f57503af2c 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -502,7 +502,7 @@ int ima_read_file(struct file *file, enum kernel_read_file_id read_id)
return 0;
}
-static const int read_idmap[READING_MAX_ID] = {
+const int read_idmap[READING_MAX_ID] = {
[READING_FIRMWARE] = FIRMWARE_CHECK,
[READING_FIRMWARE_PREALLOC_BUFFER] = FIRMWARE_CHECK,
[READING_MODULE] = MODULE_CHECK,
diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c
index 6df7f641ff66..827f1e33fe86 100644
--- a/security/integrity/ima/ima_policy.c
+++ b/security/integrity/ima/ima_policy.c
@@ -1456,3 +1456,53 @@ int ima_policy_show(struct seq_file *m, void *v)
return 0;
}
#endif /* CONFIG_IMA_READ_POLICY */
+
+#if defined(CONFIG_IMA_APPRAISE) && defined(CONFIG_INTEGRITY_TRUSTED_KEYRING)
+/*
+ * ima_appraise_signature: whether IMA will appraise a given function using
+ * an IMA digital signature. This is restricted to cases where the kernel
+ * has a set of built-in trusted keys in order to avoid an attacker simply
+ * loading additional keys.
+ */
+bool ima_appraise_signature(enum kernel_read_file_id id)
+{
+ struct ima_rule_entry *entry;
+ bool found = false;
+ enum ima_hooks func;
+
+ if (id >= READING_MAX_ID)
+ return false;
+
+ func = read_idmap[id] ?: FILE_CHECK;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(entry, ima_rules, list) {
+ if (entry->action != APPRAISE)
+ continue;
+
+ /*
+ * A generic entry will match, but otherwise require that it
+ * match the func we're looking for
+ */
+ if (entry->func && entry->func != func)
+ continue;
+
+ /*
+ * We require this to be a digital signature, not a raw IMA
+ * hash.
+ */
+ if (entry->flags & IMA_DIGSIG_REQUIRED)
+ found = true;
+
+ /*
+ * We've found a rule that matches, so break now even if it
+ * didn't require a digital signature - a later rule that does
+ * won't override it, so would be a false positive.
+ */
+ break;
+ }
+
+ rcu_read_unlock();
+ return found;
+}
+#endif /* CONFIG_IMA_APPRAISE && CONFIG_INTEGRITY_TRUSTED_KEYRING */
--
2.22.0.510.g264f2c817a-goog
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox