From mboxrd@z Thu Jan  1 00:00:00 1970
From: Will Drewry <wad@chromium.org>
Subject: Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
Date: Wed, 1 Feb 2012 01:02:11 -0800
Message-ID: <CABqD9ha4x0_YmWjv-h6GmimCk21UARO9u7AqLkXivffDv6vPVg@mail.gmail.com>
References: <1327706681-11959-1-git-send-email-wad@chromium.org>
	<1327706681-11959-2-git-send-email-wad@chromium.org>
	<c211732dcacbff72c70127ccab36122a.squirrel@webmail.greenhost.nl>
	<CABqD9hatfCSSqUt-FYUa6-LcCeDEht8gLMAY7cnWKe6gp-9upg@mail.gmail.com>
	<13b3f9dcf188908604a9529ef1934ecf.squirrel@webmail.greenhost.nl>
	<CABqD9hYXJapEiDcA_VxggkTkC6KXnJwmoE3zAoPzvYDCgkQXSA@mail.gmail.com>
	<8a37c5805a9941a8616f1c28245a0880.squirrel@webmail.greenhost.nl>
	<CABqD9hYpiDtCM5gPgjjHOJ6DcFs-HHUORAGi+AMFs1+9EAk1UA@mail.gmail.com>
	<bd6c07e54f2a86916550f69013e4dcc6.squirrel@webmail.greenhost.nl>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: linux-kernel@vger.kernel.org, keescook@chromium.org,
	john.johansen@canonical.com, serge.hallyn@canonical.com,
	coreyb@linux.vnet.ibm.com, pmoore@redhat.com, eparis@redhat.com,
	djm@mindrot.org, torvalds@linux-foundation.org,
	segoon@openwall.com, rostedt@goodmis.org, jmorris@namei.org,
	scarybeasts@gmail.com, avi@redhat.com, penberg@cs.helsinki.fi,
	viro@zeniv.linux.org.uk, luto@mit.edu, mingo@elte.hu,
	akpm@linux-foundation.org, khilman@ti.com, borislav.petkov@amd.com,
	amwang@redhat.com, oleg@redhat.com, ak@linux.intel.com,
	eric.dumazet@gmail.com, gregkh@suse.de, dhowells@redhat.com,
	daniel.lezcano@free.fr, linux-fsdevel@vger.kernel.org,
	linux-security-module@vger.kernel.org, olofj@chromium.org,
	mhalcrow@google.com, dlaor@redhat.com, corbet@lwn.net,
	alan@lxorguk.ukuu.org.uk, mcgrathr@chromium.org
To: Indan Zupancic <indan@nul.nu>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail-bk0-f46.google.com ([209.85.214.46]:63310 "EHLO
	mail-bk0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753051Ab2BAJCN convert rfc822-to-8bit (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Wed, 1 Feb 2012 04:02:13 -0500
In-Reply-To: <bd6c07e54f2a86916550f69013e4dcc6.squirrel@webmail.greenhost.nl>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Tue, Jan 31, 2012 at 5:36 PM, Indan Zupancic <indan@nul.nu> wrote:
> On Tue, January 31, 2012 12:04, Will Drewry wrote:
>> On Mon, Jan 30, 2012 at 7:42 PM, Indan Zupancic <indan@nul.nu> wrote=
:
>>>> I vote for:
>>>>
>>>> 3. Add tracehook support to all archs.
>>
>> I don't see these #3 as mutually exclusive :)
>
> They are if you really add tracehook support to all archs. ;-)
>
>> tracehook requires:
>> - task_pt_regs() =A0 =A0 =A0 =A0 =A0in asm/processor.h or asm/ptrace=
=2Eh
>> - arch_has_single_step() =A0if there is hardware single-step support
>> - arch_has_block_step() =A0 if there is hardware block-step support
>> - asm/syscall.h =A0 =A0 =A0 =A0 =A0 supplying asm-generic/syscall.h =
interface
>> - linux/regset.h =A0 =A0 =A0 =A0 =A0user_regset interfaces
>> - CORE_DUMP_USE_REGSET =A0 =A0#define'd in linux/elf.h
>> -TIF_SYSCALL_TRACE =A0 =A0 =A0 calls tracehook_report_syscall_{entry=
,exit}
>> - TIF_NOTIFY_RESUME =A0 =A0 =A0 calls tracehook_notify_resume()
>> - signal delivery =A0 =A0 =A0 =A0 calls tracehook_signal_handler()
>
> Okay, that's a bit fuzzier than I expected. I suppose the archs imple=
ment
> some of that in another way currently?
>
>>>> Maybe not all archs, but at least some more. That way, every time =
someone
>>>> adds something tracehook specific, more archs support it.
>>
>> Well the other arch I want this on specifically for my purposes is
>> arm, but someone recently posted a partial asm/syscall.h for arm, bu=
t
>> I didn't see that one go anywhere just yet. =A0(I know syscall_get_n=
r
>> can be tricky on arm because it doesn't (didn't) have a way of
>> tracking in-syscall state.)
>>
>> ref: https://lkml.org/lkml/2011/12/1/131
>
> That totally ignores OABI, it should depend on CONFIG_AEABI and
> on !CONFIG_OABI_COMPAT.
>
>>>> syscall.h has no TRACEHOOK defines or anything though.
>>
>> Nope - it is just part of what is expected.
>>
>>>> Only syscall_rollback() looks tricky. I have no clue what the diff=
erence
>>>> between syscall_get_error() and syscall_get_return_value() is. But=
 you
>>>> only need to add syscall_get_nr() and syscall_[gs]et_arguments(), =
which
>>>> should be possible for all archs.
>>
>> It seems even syscall_get_nr can have some wrinkles :)
>>
>> ref: http://lkml.indiana.edu/hypermail/linux/kernel/0906.3/00096.htm=
l
>
> That's 2009! I wonder why no progress happened since then.
>
> At least for SECCOMP you get the syscall nr directly as a parameter,
> so you at least don't actually need syscall_get_nr(). Same for ptrace=
=2E
> That doesn't help for /proc/$PID/syscall or coredumps though.
>
> One solution I can think of for the OABI compat problem is to let the
> OABI entry path store the syscall nr in thread_info->syscall and set
> an OABI task flag somewhere. Then syscall_get_nr() can check that to
> decide between r7 or the stored value. This would help /proc and does=
n't
> cause any overhead for non-compat syscalls.
>
> Another ugly way of doing it would be to store the trap instruction f=
or
> OABI calls, and when user space does a PTRACE_PEEK_TEXT on that it re=
turns
> the saved instruction instead of rereading the actual memory. This av=
oids
> races and lets current user space work without modifications. Problem=
 is
> that user space can't easily check if it's running on a fixed kernel.
>
> All in all I propose to only support this stuff for kernels not havin=
g OABI
> support, as that keeps things nice and simple, which was probably the=
 point
> of moving to a new ARM ABI anyway. Tough, but that's what you get whe=
n you
> change ABIs.
>
>>>> How many archs don't support tracehook?
>>
>> 14 out of 26. =A0However, 5 of those still have asm/syscall.h
>
> That's a lot of tricky work.
>
>>>> Well, the thing is, this recursion is controlled by user space dep=
ending
>>>> on how many filters they have installed. What is preventing them t=
o force
>>>> you out of stack?
>>
>> Hrm true. =A0The easiest option is to just convert it to iterative b=
y
>> not using kref_t, but I'll look more closely.
>
> Probably.
>
>>>> So perhaps add at least some arbitrary filter limit to avoid this?
>>
>> Definitely possible -- probably as a sysctl. =A0I'm not quite sure w=
hat
>> number makes sense yet, but I'll look at breaking the recursion firs=
t.
>> =A0Thanks!
>
> Please no sysctl. The point of arbitrary limits is that they are
> arbitrarily high, but low enough to avoid any problems. A limit
> of 1K filters should be plenty for user space, but still put a
> limit on the total (memory) overhead of filters.

I'd love to avoid a limit at all, but picking something in a sane
range makes perfect sense.

>>>>> I'll clarify a bit. =A0My original ptrace integration worked such=
 that a
>>>>> tracer may only intercept syscall failures if it attached prior t=
o the
>>>>> failing filter being installed. =A0I did it this way to avoid usi=
ng
>>>>> ptrace to escape system call filtering. =A0However, since I don't=
 have
>>>>> that as part of the patch series, it doesn't make sense to keep i=
t. (I
>>>>> tracked a tracer pid_struct in the filters.) =A0If it needs to co=
me back
>>>>> with later patchsets, then it can be tackled then!
>>>>
>>>> The problem of that is that filters can be shared between processe=
s with
>>>> different ptracers. And you have all the hassle of keeping it up t=
o date.
>>>>
>>>> I think seccomp should always come first and just trust ptrace. Th=
is
>>>> because it can deny all ptrace() calls for filtered tasks, so the =
only
>>>> untrusted tasks doing ptrace() are outside of seccomp's filtering =
control.
>>>> And those could do the same system calls themselves.
>>>>
>>>> The case where there is one task being filtered and allowed to do =
ptrace,
>>>> but not some other syscall, ptracing another filtered task which i=
sn't
>>>> allowed to do ptrace, but allowed to do that other syscall, is qui=
te far
>>>> fetched I think. If you really want to handle this, then you could=
 run
>>>> the ptracer's filters against the tracee's post-ptrace syscall sta=
te.
>>>> This is best done in the ptracer's context, just before continuing=
 the
>>>> system call. (You really want Oleg's SIKILL immediate patch then.)
>>>>
>>>> What about:
>>>>
>>>> 1) Seccomp filters can deny a syscall by killing the task.
>>>>
>>>> 2) Seccomp can deny a syscall and let it return an error value.
>>>>
>>>> =A0 I know you're not fond of this one. It's just a performance
>>>> =A0 optimisation as sometimes a lot of denied but harmless syscall=
s
>>>> =A0 are called by glibc all the time, like getxattr(). Hardcoding
>>>> =A0 the kill policy seems wrong when it can be avoided. If this is
>>>> =A0 too hard then skip it, but if it's easy to add then please do.
>>>> =A0 I'll take a look at this with your next patch version.
>>
>> It's easy on x86 harder on other arches. =A0I would suggest saving
>> changing the __secure_computing signature until after the core
>> functionality lands, but that's just me.
>
> The only problem of that is that you will have two versions of seccom=
p
> filtering in the wild: One that does support this, and one that doesn=
't.

Hrm.  It'd be the addition of a return value which is essentially an
expansion of the possible filters.  It wouldn't make the existing
filters any less valid -- it's just that they wouldn't return errors.

However, I realize there's a desire to have all the pieces in place
upfront. I'll see how much work this really pans out to being. On x86,
it's easy.  Some other arches, a little less so, but probably not too
bad.

> It seems cleaner to consolidate the seccomp entry path with the ptrac=
e path,
> and do the seccomp check first in there. This saves a few instruction=
s on
> some archs for the seccomp check too and would simplify the entry cod=
e of
> most archs.
> Ptrace is expected to change the syscall nr already, so that can be u=
sed
> to set it to -1 or something. A different return value than -ENOSYS c=
an
> be set in the syscall exit path, if required. All this can be shared =
with
> PTRACE_SYSEMU.

That's roughly how it is on x86 now except seccomp is after all the
slow-path copy stuff.  It'd be cool to bump it up in front of that
work then pass through its return value.  I'll poke around at this and
look at the use of __secure_computing on the other arches to make sure
I understand the impact.  Maybe it is easier than I first thought it'd
be.

>>>> 3) Seccomp can allow a syscall to proceed normally.
>>>>
>>>> 4) Seccomp can set a hint to skip ptrace syscall events for this s=
yscall.
>>>> =A0 A filter sets this by returning a specific value.
>>>>
>>>> 5) Ptrace always gets a syscall event when it asked for it.
>>>>
>>>> 6) Ptrace can set an option to honour seccomp's hint and to not ge=
t all
>>>> =A0 syscall events.
>>>>
>>>> This way all seccomp needs to do is to set some flags which ptrace=
 can check.
>>
>> I like the use of flags/options to trigger ptrace handling. =A0If I =
were
>> to stack rank these for pursuit after the core functionality lands,
>> it'd be to add #6 (and its deps) then #2. =A0With #6, #2 can be
>> simulated (by having a supervisor that changes the syscall number to
>> -1), but that is much less ideal than just returning SECCOMP_ERROR
>> instead of SECCOMP_ALLOW/DENY and letting an error code get bubbled
>> up.
>
> But it would require two filter versions and hence make it more diffi=
cult
> to generally support BPF syscall filtering.

True.  I'm not quite sure that it makes sense to have the BPF program
decide what the tracer sees or doesn't see.  I can see it as a nice
optimization for a sandbox implementation, but I could see something
similar being done purely by letting a tracer catch disallowed
syscalls (possibly via it registering an option indicating it wants
seccomp_events). It wouldn't be quite as flexible, but it would avoid
the simple filter becoming a more complex piece of logic and if a
returned error were allowed (that didn't trigger ptrace), you'd not
see a ridiculous amount of overhead from pointless syscalls.  I'm not
sure though.

I'll poke at the ptrace bits too and see if one approach seems to fit
better than another.

thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html