From: Kees Cook <keescook@chromium.org>
To: Jann Horn <jannh@google.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>,
	Tycho Andersen <tycho@tycho.ws>,
	Sargun Dhillon <sargun@sargun.me>,
	Matt Denton <mpdenton@google.com>,
	Chris Palmer <palmer@google.com>,
	Jeffrey Vander Stoep <jeffv@google.com>,
	Linux Containers <containers@lists.linux-foundation.org>,
	Linux API <linux-api@vger.kernel.org>,
	kernel list <linux-kernel@vger.kernel.org>
Subject: Re: seccomp feature development
Date: Wed, 3 Jun 2020 12:09:44 -0700	[thread overview]
Message-ID: <202005181630.60E58CA0C5@keescook> (raw)
In-Reply-To: <CAG48ez1LrQvR2RHD5-ZCEihL4YT1tVgoAJfGYo+M3QukumX=OQ@mail.gmail.com>
[trying to get back to this thread -- I've been distracted]
On Tue, May 19, 2020 at 12:39:39AM +0200, Jann Horn wrote:
> On Mon, May 18, 2020 at 11:05 PM Kees Cook <keescook@chromium.org> wrote:
> > ## deep argument inspection
> >
> > Background: seccomp users would like to write filters that traverse
> > the user pointers passed into many syscalls, but seccomp can't do this
> > dereference for a variety of reasons (mostly involving race conditions and
> > rearchitecting the entire kernel syscall and copy_from_user() code flows).
> 
> Also, other than for syscall entry, it might be worth thinking about
> whether we want to have a special hook into seccomp for io_uring.
> io_uring is growing support for more and more syscalls, including
> things like openat2, connect, sendmsg, splice and so on, and that list
> is probably just going to grow in the future. If people start wanting
> to use io_uring in software with seccomp filters, it might be
> necessary to come up with some mechanism to prevent io_uring from
> permitting access to almost everything else...
/me perks up. Oh my, I hadn't been paying attention -- I thought this
was strictly for I/O ... like it's named. I will go read up.
> [...]
> > The argument caching bit is, I think, rather mechanical in nature since
> > it's all "just" internal to the kernel: seccomp can likely adjust how it
> > allocates seccomp_data (maybe going so far as to have it split across two
> > pages with the syscall argument struct always starting on the 2nd page
> > boundary), and copying the EA struct into that page, which will be both
> > used by the filter and by the syscall.
> 
> We could also do the same kind of thing the eBPF verifier does in
> convert_ctx_accesses(), and rewrite the context accesses to actually
> go through two different pointers depending on the (constant) offset
> into seccomp_data.
Ah, as in "for seccomp_data accesses, add offset $foo and for EA struct
add offset $bar"? Yeah, though my preference is to avoid rewriting the
filters as much as possible. But yes, that's a good point about not
requiring them be strictly contiguous.
> > I imagine state tracking ("is
> > there a cached EA?", "what is the address of seccomp_data?", "what is
> > the address of the EA?") can be associated with the thread struct.
> 
> You probably mean the task struct?
Yup; think-o.
> > ## syscall bitmasks
> 
> YES PLEASE
I've got a working PoC for this now. It's sneaky.
> Other options:
>  - add a "load from read-only memory" opcode and permit specifying the
> data that should be in that memory when loading the filter
I think you've mentioned something like this before to me, but can you
remind me the details? If you mean RO userspace memory, don't we still
run the risk of racing mprotect, etc?
>  - make the seccomp API take an array of (syscall-number,
> instruction-offset) tuples, and start evaluation of the filter at an
> offset if it's one of those syscalls
To avoid making cBPF changes, yeah, perhaps have a way to add per-syscall
filters.
> One more thing that would be really nice: Is there a way we can have
> 64-bit registers in our seccomp filters? At the moment, every
> comparison has to turn into three ALU ops, which is pretty messy;
> libseccomp got that wrong (<https://crbug.com/project-zero/1769>), and
> it contributes to the horrific code Chrome's BPF generator creates.
> Here's some pseudocode from my hacky BPF disassembler, which shows
> pretty much just the filter code for filtering getpriority() and
> setpriority() in a Chrome renderer, with tons of useless dead code:
> 
> 0139         if args[0].high == 0x00000000: [true +3, false +0]
> 013e           if args[0].low != 0x00000000: [true +157, false +0] -> ret TRAP
> 0140           if args[1].high == 0x00000000: [true +3, false +0]
> 0145             if args[1].low == 0x00000000: [true +149, false +0]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 0147             if args[1].high == 0x00000000: [true +3, false +0]
> 014c               if args[1].low == 0x00000001: [true +142, false
> +141] -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da               ret ERRNO
> 0148             if args[1].high != 0xffffffff: [true +142, false +0]
> -> ret TRAP
> 014a             if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
> false +0] -> ret TRAP
> 014c             if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da             ret ERRNO
> 0141           if args[1].high != 0xffffffff: [true +149, false +0] -> ret TRAP
> 0143           if args[1].low NO-COMMON-BITS 0x80000000: [true +147,
> false +0] -> ret TRAP
> 0145           if args[1].low == 0x00000000: [true +149, false +0] ->
> ret ALLOW (syscalls: getpriority, setpriority)
> 0147           if args[1].high == 0x00000000: [true +3, false +0]
> 014c             if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da             ret ERRNO
> 0148           if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP
> 014a           if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
> false +0] -> ret TRAP
> 014c           if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da           ret ERRNO
> 013a         if args[0].high != 0xffffffff: [true +156, false +0] -> ret TRAP
> 013c         if args[0].low NO-COMMON-BITS 0x80000000: [true +154,
> false +0] -> ret TRAP
> 013e         if args[0].low != 0x00000000: [true +157, false +0] -> ret TRAP
> 0140         if args[1].high == 0x00000000: [true +3, false +0]
> 0145           if args[1].low == 0x00000000: [true +149, false +0] ->
> ret ALLOW (syscalls: getpriority, setpriority)
> 0147           if args[1].high == 0x00000000: [true +3, false +0]
> 014c             if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da             ret ERRNO
> 0148           if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP
> 014a           if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
> false +0] -> ret TRAP
> 014c           if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da           ret ERRNO
> 0141         if args[1].high != 0xffffffff: [true +149, false +0] -> ret TRAP
> 0143         if args[1].low NO-COMMON-BITS 0x80000000: [true +147,
> false +0] -> ret TRAP
> 0145         if args[1].low == 0x00000000: [true +149, false +0] ->
> ret ALLOW (syscalls: getpriority, setpriority)
> 0147         if args[1].high == 0x00000000: [true +3, false +0]
> 014c           if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da           ret ERRNO
> 0148         if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP
> 014a         if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
> false +0] -> ret TRAP
> 014c         if args[1].low == 0x00000001: [true +142, false +141] ->
> ret ALLOW (syscalls: getpriority, setpriority)
> 01da         ret ERRNO
> 
> which is generated by this little snippet of C++ code:
> 
> ResultExpr RestrictGetSetpriority(pid_t target_pid) {
>   const Arg<int> which(0);
>   const Arg<int> who(1);
>   return If(which == PRIO_PROCESS,
>             Switch(who).CASES((0, target_pid), Allow()).Default(Error(EPERM)))
>       .Else(CrashSIGSYS());
> }
What would this look like in eBPF?
> On anything other than 32-bit MIPS, 32-bit powerpc and 32-bit sparc,
> we're actually already using the eBPF backend infrastructure... so
> maybe it would be an option to keep enforcing basically the same rules
> that we currently have for cBPF, but use the eBPF instruction format?
Yeah, I think this might be a good idea just for the reduction in
complexity for these things. The "unpriv BPF" problem needs to be solved
for this still, though, yes?
-- 
Kees Cook
next prev parent reply	other threads:[~2020-06-03 19:09 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-18 21:04 seccomp feature development Kees Cook
2020-05-18 22:39 ` Jann Horn
2020-05-19  2:48   ` Aleksa Sarai
2020-05-19 10:35     ` Christian Brauner
2020-05-19 16:18     ` Alexei Starovoitov
2020-05-19 21:40       ` Kees Cook
2020-05-20  1:20       ` Aleksa Sarai
2020-05-20  5:17         ` Alexei Starovoitov
2020-05-20  6:18           ` Aleksa Sarai
2020-05-19  7:24   ` Sargun Dhillon
2020-05-19  8:30     ` Christian Brauner
2020-06-03 19:09   ` Kees Cook [this message]
2020-05-18 22:55 ` Andy Lutomirski
2020-05-19  7:09 ` Aleksa Sarai
2020-05-19 11:01   ` Christian Brauner
2020-05-19 13:42     ` Aleksa Sarai
2020-05-19 13:53       ` Aleksa Sarai
2020-05-19 10:26 ` Christian Brauner
2020-05-20  8:23   ` Sargun Dhillon
2020-05-19 14:07 ` Tycho Andersen
2020-05-20  9:05 ` Sargun Dhillon
2020-05-22 20:09 ` Sargun Dhillon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox
  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):
  git send-email \
    --in-reply-to=202005181630.60E58CA0C5@keescook \
    --to=keescook@chromium.org \
    --cc=christian.brauner@ubuntu.com \
    --cc=containers@lists.linux-foundation.org \
    --cc=jannh@google.com \
    --cc=jeffv@google.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mpdenton@google.com \
    --cc=palmer@google.com \
    --cc=sargun@sargun.me \
    --cc=tycho@tycho.ws \
    /path/to/YOUR_REPLY
  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
  Be sure your reply has a Subject: header at the top and a blank line
  before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).