Re: [PATCH v4 1/4] seccomp: add a return code to trap to userspace

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Tycho Andersen <tycho@tycho.ws>
To: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>,
	kernel list <linux-kernel@vger.kernel.org>,
	containers@lists.linux-foundation.org,
	Linux API <linux-api@vger.kernel.org>,
	Andy Lutomirski <luto@amacapital.net>,
	Oleg Nesterov <oleg@redhat.com>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	"Serge E. Hallyn" <serge@hallyn.com>,
	Christian Brauner <christian.brauner@ubuntu.com>,
	Tyler Hicks <tyhicks@canonical.com>,
	suda.akihiro@lab.ntt.co.jp, "Tobin C. Harding" <me@tobin.cc>
Subject: Re: [PATCH v4 1/4] seccomp: add a return code to trap to userspace
Date: Fri, 22 Jun 2018 09:15:14 -0600	[thread overview]
Message-ID: <20180622151514.GM3992@cisco> (raw)
In-Reply-To: <CAG48ez3Ek_KG54ejR=Q=XtW_HDs8hQ+cgFODzn4rQ0nVDVpODg@mail.gmail.com>

Hi Jann,

On Fri, Jun 22, 2018 at 04:40:20PM +0200, Jann Horn wrote:
> On Fri, Jun 22, 2018 at 12:05 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > This patch introduces a means for syscalls matched in seccomp to notify
> > some other task that a particular filter has been triggered.
> >
> > The motivation for this is primarily for use with containers. For example,
> > if a container does an init_module(), we obviously don't want to load this
> > untrusted code, which may be compiled for the wrong version of the kernel
> > anyway. Instead, we could parse the module image, figure out which module
> > the container is trying to load and load it on the host.
> >
> > As another example, containers cannot mknod(), since this checks
> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> > coding some whitelist in the kernel. Another example is mount(), which has
> > many security restrictions for good reason, but configuration or runtime
> > knowledge could potentially be used to relax these restrictions.
> >
> > This patch adds functionality that is already possible via at least two
> > other means that I know about, both of which involve ptrace(): first, one
> > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> > Unfortunately this is slow, so a faster version would be to install a
> > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> > Since ptrace allows only one tracer, if the container runtime is that
> > tracer, users inside the container (or outside) trying to debug it will not
> > be able to use ptrace, which is annoying. It also means that older
> > distributions based on Upstart cannot boot inside containers using ptrace,
> > since upstart itself uses ptrace to start services.
> >
> > The actual implementation of this is fairly small, although getting the
> > synchronization right was/is slightly complex.
> >
> > Finally, it's worth noting that the classic seccomp TOCTOU of reading
> > memory data from the task still applies here, but can be avoided with
> > careful design of the userspace handler: if the userspace handler reads all
> > of the task memory that is necessary before applying its security policy,
> > the tracee's subsequent memory edits will not be read by the tracer.
> 
> I've been thinking about how one would actually write userspace code
> that uses this API, and whether PID reuse is an issue here. As far as
> I can tell, the following situation can happen:
> 
>  - seccomped process tries to perform a syscall that gets trapped
>  - notification is sent to the supervisor
>  - supervisor reads the notification
>  - seccomped process gets SIGKILLed
>  - new process appears with the PID that the seccomped process had
>  - supervisor tries to access memory of the seccomped process via
> process_vm_{read,write}v or /proc/$pid/mem
>  - supervisor unintentionally accesses memory of the new process instead
> 
> This could have particularly nasty consequences if the supervisor has
> to write to memory of the seccomped process for some reason.
> It might make sense to explicitly document how the API has to be used
> to avoid such a scenario from occuring. AFAICS,
> process_vm_{read,write}v are fundamentally unsafe for this;
> /proc/$pid/mem might be safe if you do the following dance in the
> supervisor to validate that you have a reference to the right struct
> mm before starting to actually access memory:
> 
>  - supervisor reads a syscall notification for the seccomped process with PID $A
>  - supervisor opens /proc/$A/mem [taking a reference on the mm of the
> process that currently has PID $A]
>  - supervisor reads all pending events from the notification FD; if
> one of them says that PID $A was signalled, send back -ERESTARTSYS (or
> -ERESTARTNOINTR?) and bail out
>  - [at this point, the open FD to /proc/$A/mem is known to actually
> refer to the mm struct of the seccomped process]
>  - read and write on the open FD to /proc/$A/mem as necessary
>  - send back the syscall result

Yes, this is a nasty problem :(. We have the id in the
request/response structs to avoid this race, so perhaps we can re-use
that? So it would look like:

- supervisor gets syscall notification for $A
- supervisor opens /proc/$A/mem or /proc/$A/map_files/... or a dir fd
  to the container's root or whatever
- supervisor calls seccomp(SECCOMP_NOTIFICATION_IS_VALID, req->id, listener_fd)
- supervisor knows that the fds it has open are safe

That way it doesn't have to flush the whole queue? Of course this
makes things a lot slower, but it does enable safety for more than
just memory accesses, and also isn't required for things which
wouldn't read memory.

> It might be nice if the kernel was able to directly give the
> supervisor an FD to /proc/$A/mem that is guaranteed to point to the
> right struct mm, but trying to implement that would probably make this
> patch set significantly larger?

I'll take a look and see how big it is, it doesn't *seem* like it
should be that hard. Famous last words :)

Tycho

next prev parent reply	other threads:[~2018-06-22 15:15 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-21 22:04 [PATCH v4 0/4] seccomp trap to userspace Tycho Andersen
2018-06-21 22:04 ` [PATCH v4 1/4] seccomp: add a return code to " Tycho Andersen
2018-06-21 23:21   ` Jann Horn
2018-06-22  0:58     ` Tycho Andersen
2018-06-22  1:28       ` Jann Horn
2018-06-22  1:39         ` Tycho Andersen
2018-06-22 14:40   ` Jann Horn
2018-06-22 15:15     ` Tycho Andersen [this message]
2018-06-22 16:24       ` Jann Horn
2018-06-22 18:09       ` Andy Lutomirski
2018-06-22 21:51         ` Kees Cook
2018-06-22 22:27           ` Jann Horn
2018-06-26  1:32             ` Tycho Andersen
2018-06-26  2:00               ` Andy Lutomirski
2018-06-21 22:04 ` [PATCH v4 2/4] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
2018-06-21 22:04 ` [PATCH v4 3/4] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
2018-06-21 22:48   ` Jann Horn
2018-06-21 23:07     ` Tycho Andersen
2018-06-21 22:04 ` [PATCH v4 4/4] seccomp: add support for passing fds via USER_NOTIF Tycho Andersen
2018-06-21 23:34   ` Jann Horn
2018-06-22  0:51     ` Tycho Andersen
2018-06-22 16:23   ` Jann Horn
2018-06-22 18:21     ` Andy Lutomirski
2018-08-07  2:44 ` [PATCH v4 0/4] seccomp trap to userspace Tycho Andersen
2018-08-07  2:57   ` Andy Lutomirski
2018-08-07  3:30   ` Christian Brauner
2018-08-07  4:19     ` Andy Lutomirski
2018-08-07 12:23       ` Christian Brauner
2018-08-07 14:34   ` James Bottomley
2018-08-10  0:31   ` Dinesh Subhraveti
     [not found]   ` <CAP4sa4+rODVahad2hW-L3h7k6fkfGBsoCfDfBVuMwp3Aaie2KA@mail.gmail.com>
2018-08-11  2:32     ` Tycho Andersen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180622151514.GM3992@cisco \
    --to=tycho@tycho.ws \
    --cc=christian.brauner@ubuntu.com \
    --cc=containers@lists.linux-foundation.org \
    --cc=ebiederm@xmission.com \
    --cc=jannh@google.com \
    --cc=keescook@chromium.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=me@tobin.cc \
    --cc=oleg@redhat.com \
    --cc=serge@hallyn.com \
    --cc=suda.akihiro@lab.ntt.co.jp \
    --cc=tyhicks@canonical.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).