From: Will Drewry <wad@chromium.org>
To: Steven Rostedt <rostedt@goodmis.org>, Ingo Molnar <mingo@elte.hu>,
Peter Zijlstra <peterz@infradead.org>,
Frederic Weisbecker <fweisbec@gmail.com>
Cc: linux-mips@linux-mips.org, linux-sh@vger.kernel.org,
Heiko Carstens <heiko.carstens@de.ibm.com>,
Oleg Nesterov <oleg@redhat.com>,
David Howells <dhowells@redhat.com>,
Paul Mackerras <paulus@samba.org>,
Ralf Baechle <ralf@linux-mips.org>,
"H. Peter Anvin" <hpa@zytor.com>,
sparclinux@vger.kernel.org, Jiri Slaby <jslaby@suse.cz>,
linux-s390@vger.kernel.org, Russell King <linux@arm.linux.org.uk>,
x86@kernel.org, James Morris <jmorris@namei.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Ingo Molnar <mingo@redhat.com>,
kees.cook@canonical.com, "Serge E. Hallyn" <serge@hallyn.com>,
Tejun Heo <tj@kernel.org>, Thomas Gleixner <tglx@linutronix.de>,
linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
Michal Marek <mmarek@suse.cz>, Michal Simek <monstr@monstr.eu>,
linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
Eric Paris <eparis@redhat.com>, Paul Mundt <lethal@linux-sh.org>,
Martin Schwidefsky <schwidefsky@de.ibm.com>,
linux390@de.ibm.com, Andrew Morton <akpm@linux-foundation.org>,
agl@chromium.org, "David S. Miller" <davem@davemloft.net>
Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
Date: Tue, 24 May 2011 10:59:04 -0500 [thread overview]
Message-ID: <BANLkTim9UyYAGhg06vCFLxkYPX18cPymEQ@mail.gmail.com> (raw)
In-Reply-To: <BANLkTiki8aQJbFkKOFC+s6xAEiuVyMM5MQ@mail.gmail.com>
On Thu, May 19, 2011 at 4:05 PM, Will Drewry <wad@chromium.org> wrote:
> On Thu, May 19, 2011 at 7:22 AM, Steven Rostedt <rostedt@goodmis.org> wro=
te:
>> On Wed, 2011-05-18 at 21:07 -0700, Will Drewry wrote:
>>
>>> Do event_* that return non-void exist in the tree at all now? =A0I've
>>> looked at the various tracepoint macros as well as some of the other
>>> handlers (trace_function, perf_tp_event, etc) and I'm not seeing any
>>> places where a return value is honored nor could be. =A0At best, the
>>> perf_tp_event can be short-circuited it in the hlist_for_each, but
>>> it'd still need a way to bubble up a failure and result in not calling
>>> the trace/event that the hook precedes.
>>
>> No, none of the current trace hooks have return values. That was what I
>> was talking about how to implement in my previous emails.
>
> Led on by complete ignorance, I think I'm finally beginning to untwist
> the different pieces of the tracing infrastructure. =A0Unfortunately,
> that means I took a few wrong turns along the way...
>
> I think function tracing looks something like this:
>
> ftrace_call has been injected into at a specific callsite. =A0Upon hit:
> 1. ftrace_call triggers
> 2. does some checks then calls ftrace_trace_function (via mcount instrs)
> 3. ftrace_trace_function may be a single func or a list. For a list it
> will be: ftrace_list_func
> 4. ftrace_list_func calls each registered hook for that function in a
> while loop ignoring return values
> 5. registered hook funcs may then track the call, farm it out to
> specific sub-handlers, etc.
>
> This seems to be a red herring for my use case :/ though this helped
> me understand your back and forth (Ingo & Steve) regarding dynamic
> versus explicit events.
>
>
> System call tracing is done via kernel/tracepoint.c events fed in via
> arch/[arch]/kernel/ptrace.c where it calls trace_sys_enter. =A0This
> yields direct sys_enter and sys_exit event sources (and an event class
> to hook up per-system call events). =A0This means that
> ftrace_syscall_enter() does the event prep prior to doing a filter
> check comparing the ftrace_event object for the given syscall_nr to
> the event data. =A0perf_sysenter_enter() is similar but it pushes the
> info over to perf_tp_event to be matched not against the global
> syscall event entry, but against any sub-event in the linked list on
> that syscall's event.
>
> Is that roughly an accurate representation of the two? =A0I wish I
> hadn't gotten distracted along the function path, but at least I
> learned something (and it is relevant to the larger scope of this
> thread's discussion).
>
>
> After doing that digging, it looks like providing hook call
> pre-emption and return value propagation will be a unique and fun task
> for each tracer and event subsystem. =A0If I look solely at tracepoints,
> a generic change would be to make the trace_##subsys function return
> an int (which I think was the event_vfs_getname proposal?). =A0The other
> option is to change the trace_sys_enter proto to include a 'int
> *retcode'.
>
> That change would allow the propagation of some sort of policy
> information. =A0To put it to use, seccomp mode 1 could be implemented on
> top of trace_syscalls. =A0The following changes would need to happen:
> 1. dummy metadata should be inserted for all unhooked system calls
> 2. perf_trace_buf_submit would need to return an int or a new
> TRACE_REG_SECCOMP_REGISTER handler would need to be setup in
> syscall_enter_register.
> 3. If perf is abused, a "kill/fail_on_discard" bit would be added to
> event->attrs.
> 4. perf_trace_buf_submit/perf_tp_event will return 0 for no matches,
> 'n' for the number of event matches, and -EACCES/? if a
> 'fail_on_discard' event is seen.
> 5. perf_syscall_enter would set *retcode =3D perf_trace_buf_submit()'s re=
tcode
> 6. trace_sys_enter() would need to be moved to be the first entry
> arch/../kernel/ptrace.c for incoming syscalls
> 7. if trace_sys_enter() yields a negative return code, then
> do_exit(SIGKILL) the process and return.
>
> Entering into seccomp mode 1 would require adding a =A0"0" filter for
> every system call number (which is why we need a dummy event call for
> them since failing to check the bitmask can't be flagged
> fail_on_discard) with the fail_on_discard bit. =A0For the three calls
> that are allowed, a '1' filter would be set.
>
> That would roughly implement seccomp mode 1. =A0It's pretty ugly and the
> fact that every system call that's disallowed has to be blacklisted is
> not ideal. =A0An alternate model would be to just use the seccomp mode
> as we do today and let secure_computing() handle the return code of "#
> of matches". =A0If it the # of matches is 0, it terminates. A
> 'fail_on_discard' bit then would only be good to stop further
> tracepoint callback evaluation. =A0This approach would also *almost* nix
> the need to provide dummy syscall hooks. =A0(Since sigreturn isn't
> hooked on x86 because it uses ptregs fixup, a dummy would still be
> needed to apply a "1" filter to.)
>
> Even with that tweak to move to a whitelist model, the perf event
> evaluation and tracepoint callback ordering is still not guaranteed.
> Without changing tracepoint itself, all other TPs will still execute.
> And for perf events, it'll be first-come-first-serve until a
> fail_on_discard is hit.
>
> After using seccomp mode=3D1 as the sample case to reduce scope, it's
> possible to ignore it for now :) and look at the seccomp-filter/mode=3D2
> case. =A0The same mechanism could be used to inject more expressive
> filters. =A0This would be done over the sys_perf_event_open() interface
> assuming the new attr is added to stop perf event list enumeration.
> Assuming a whitelist model, a call to prctl(PR_SET_SECCOMP, 2) would
> be needed (maybe with the ON_EXEC flag option too to mirror the
> event->attr on-exec bit). That would yield the ability to register
> perf events for system calls then use ioctl() to SET_FILTER on them.
>
> Reducing the privileges of the filters after installation could be
> done with another attribute bit like 'set_filter_ands'. =A0If that is
> also set on the event, and a filter is installed to ensure that
> sys_perf_event_open is blocked, then it should be reasonably sane.
>
> I'd need to add a PERF_EVENT_IOC_GET_FILTER handler to allow
> extracting the settings.
>
> Clearly, I haven't written the code for that yet, though making the
> change for a single platform shouldn't be too much code.
>
> So that leaves me with some questions:
> - Is this the type of reuse that was being encouraged?
> - Does it really make sense to cram this through the perf interface
> and events? =A0While the changed attributes are innocuous and
> potentially reusable, it seems that a large amount of the perf
> facilities are exposed that could have weird side effects, but I'm
> sure I still don't fully grok the perf infrastructure.
> - Does extending one tracepoint to provide return values via a pointer
> make sense? I'd hesitate to make all tracepoint hooks return an int by
> default.
> - If all tracepoints returned an int, what would the standard value
> look like - or would it need to be per tracepoint impl?
> - How is ambiguity resolved if multiple perf_events are registered for
> a syscall with different filters? =A0Maybe a 'stop_on_match'? though
> ordering is still a problem then.
> - Is there a way to affect a task-wide change without a seccomp flag
> (or new task_struct entry) via the existing sys_perf_event_open
> syscall? =A0I considered suggesting a attr bit call 'secure_computing'
> that when an event with the bit is enabled, it automagically forces
> the task into seccomp enforcement mode, but that, again, seemed
> hackish.
>
> While I like the idea of sharing the tracepoints infrastructure and
> the trace_syscalls hooks as well as using a pre-existing interface
> with very minor changes, I'm worried that the complexity of the
> interface and of the infrastructure might undermine the ability to
> continue meeting the desired security goals. =A0I had originally stayed
> very close to the seccomp world because of how trivial it is to review
> the code and verify its accuracy/effectiveness. =A0This approach leaves
> a lot of gaps for kernel code to seep through and a fair amount of
> ambiguity in what locked down syscall filters might look like.
>
> To me, the best part of the above is that it shows that even if we go
> with a prctl SET_SECCOMP_FILTER-style interface, it is completely
> certain that if a perf/ftrace-based security infrastructure is on our
> future, it will be entirely compatible -- even if the prctl()
> interface is just the "simpler" interface at that point somewhere down
> the road.
>
>
> Regardless, I'll hack up a proof of concept based on the outline
> above. Perhaps something more elegant will emerge once I start
> tweaking the source, but I'm still seeing too many gaps to be
> comfortable so far.
>
>
> [[There is a whole other approach to this too. We could continue with
> the prctl() interface and mirror the trace_sys_enter model for
> secure_computing(). =A0 Instead of keeping a seccomp_t-based hlist of
> events, they could be stored in a new hlist for seccomp_events in
> struct ftrace_event_call. =A0The ftrace filters would be installed there
> and the seccomp_syscall_enter() function could do the checks and pass
> up some state data on the task's seccomp_t that indicates it needs to
> do_exit(). =A0That would likely reduce the amount of code in
> seccomp_filter.c pretty seriously (though not entirely
> beneficially).]]
>
>
> Thanks again for all the feedback and insights! I really hope we can
> come to an agreeable approach for implementing kernel attack surface
> reduction.
Just a quick follow up. I have a PoC of the perf-based system call
filtering code in place which uses seccomp.mode as a task_context
flag, adds require_secure and err_on_discard perf_event_attrs, and
enforces these new events terminate the process in perf_syscall_enter
via a return value on perf_buf_submit. This lets a call like:
LD_LIBRARY_PATH=3D/host/home/wad/kernel.uml/tests/
/host/home/wad/kernel.uml/tests/perf record \
-S \
-e 'syscalls:sys_enter_access' \
-e 'syscalls:sys_enter_brk' \
-e 'syscalls:sys_enter_close' \
-e 'syscalls:sys_enter_exit_group' \
-e 'syscalls:sys_enter_fcntl64' \
-e 'syscalls:sys_enter_fstat64' \
-e 'syscalls:sys_enter_getdents64' \
-e 'syscalls:sys_enter_getpid' \
-e 'syscalls:sys_enter_getuid' \
-e 'syscalls:sys_enter_ioctl' \
-e 'syscalls:sys_enter_lstat64' \
-e 'syscalls:sys_enter_mmap_pgoff' \
-e 'syscalls:sys_enter_mprotect' \
-e 'syscalls:sys_enter_munmap' \
-e 'syscalls:sys_enter_open' \
-e 'syscalls:sys_enter_read' \
-e 'syscalls:sys_enter_stat64' \
-e 'syscalls:sys_enter_time' \
-e 'syscalls:sys_enter_newuname' \
-e 'syscalls:sys_enter_write' --filter "fd =3D=3D 1" \
/bin/ls
run successfully while omitting a syscall or passing in --filter "fd
=3D=3D 0" properly terminates it. (ignoring the need for execve and
set_thread_area for PoC purposes).
The change avoids defining a new trace call type or a huge number of
internal changes and hides seccomp.mode=3D2 from ABI-exposure in prctl,
but the attack surface is non-trivial to verify, and I'm not sure if
this ABI change makes sense. It amounts to:
include/linux/ftrace_event.h | 4 +-
include/linux/perf_event.h | 10 +++++---
kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++=
---
kernel/seccomp.c | 8 ++++++
kernel/trace/trace_syscalls.c | 27 +++++++++++++++++-----
5 files changed, 82 insertions(+), 16 deletions(-)
And can be found here: http://static.dataspill.org/perf_secure/v1/
If there is any interest at all, I can post it properly to this giant
CC list. However, it's unclear to me where this thread stands. I
also see a completely independent path for implementing system call
filtering now, but I'll hold off posting that patch until I see where
this goes.
Any guidance will be appreciated - thanks again!
will
next prev parent reply other threads:[~2011-05-24 15:59 UTC|newest]
Thread overview: 77+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1304017638.18763.205.camel@gandalf.stny.rr.com>
2011-05-12 3:02 ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Will Drewry
2011-05-12 7:48 ` Ingo Molnar
2011-05-12 9:24 ` Kees Cook
2011-05-12 10:49 ` Ingo Molnar
2011-05-12 11:44 ` James Morris
2011-05-12 13:01 ` Ingo Molnar
2011-05-12 16:26 ` Will Drewry
2011-05-16 12:55 ` Ingo Molnar
2011-05-16 14:42 ` Will Drewry
2011-05-13 0:18 ` James Morris
2011-05-13 12:10 ` Ingo Molnar
2011-05-13 12:19 ` Peter Zijlstra
2011-05-13 12:26 ` Ingo Molnar
2011-05-13 12:39 ` Peter Zijlstra
2011-05-13 12:43 ` Peter Zijlstra
2011-05-13 12:54 ` Ingo Molnar
2011-05-13 13:08 ` Peter Zijlstra
2011-05-13 13:18 ` Ingo Molnar
2011-05-13 13:55 ` Peter Zijlstra
2011-05-13 14:57 ` Ingo Molnar
2011-05-13 15:27 ` Peter Zijlstra
2011-05-14 7:05 ` Ingo Molnar
2011-05-16 16:23 ` Steven Rostedt
2011-05-16 16:52 ` Ingo Molnar
2011-05-16 17:03 ` Steven Rostedt
2011-05-17 12:42 ` Ingo Molnar
2011-05-17 13:05 ` Steven Rostedt
2011-05-17 13:19 ` Ingo Molnar
2011-05-19 4:07 ` Will Drewry
2011-05-19 12:22 ` Steven Rostedt
2011-05-19 21:05 ` Will Drewry
2011-05-24 15:59 ` Will Drewry [this message]
2011-05-24 16:20 ` Peter Zijlstra
2011-05-24 16:25 ` Thomas Gleixner
2011-05-24 19:00 ` Will Drewry
2011-05-24 19:54 ` Ingo Molnar
2011-05-24 20:10 ` Ingo Molnar
2011-05-25 10:35 ` Thomas Gleixner
2011-05-25 15:01 ` Ingo Molnar
2011-05-25 17:43 ` Peter Zijlstra
2011-05-29 20:17 ` Ingo Molnar
2011-05-25 17:48 ` Thomas Gleixner
2011-05-26 8:43 ` Ingo Molnar
2011-05-26 9:15 ` Ingo Molnar
2011-05-24 20:08 ` Ingo Molnar
2011-05-24 20:14 ` Steven Rostedt
2011-05-13 15:17 ` Eric Paris
2011-05-13 15:29 ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system callfiltering David Laight
2011-05-16 12:03 ` Ingo Molnar
2011-05-13 12:49 ` [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering Ingo Molnar
2011-05-13 13:55 ` Peter Zijlstra
2011-05-13 15:02 ` Ingo Molnar
2011-05-13 15:10 ` Eric Paris
2011-05-13 15:23 ` Peter Zijlstra
2011-05-13 15:55 ` Eric Paris
2011-05-13 16:29 ` Will Drewry
2011-05-14 7:30 ` Ingo Molnar
2011-05-14 20:57 ` Will Drewry
2011-05-16 12:43 ` Ingo Molnar
2011-05-16 15:29 ` Will Drewry
2011-05-17 12:57 ` Ingo Molnar
2011-05-16 0:36 ` James Morris
2011-05-16 15:08 ` Ingo Molnar
2011-05-17 2:24 ` James Morris
2011-05-17 13:10 ` Ingo Molnar
2011-05-17 13:29 ` James Morris
2011-05-17 18:34 ` Ingo Molnar
2011-05-26 6:27 ` Pavel Machek
2011-05-26 8:35 ` Ingo Molnar
2011-05-12 12:15 ` Frederic Weisbecker
2011-05-12 11:33 ` James Morris
2011-05-13 19:35 ` Arnd Bergmann
2011-05-14 20:58 ` Will Drewry
2011-05-15 6:42 ` Arnd Bergmann
2011-05-16 12:00 ` Ingo Molnar
2011-05-16 15:26 ` Steven Rostedt
2011-05-16 15:28 ` Will Drewry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=BANLkTim9UyYAGhg06vCFLxkYPX18cPymEQ@mail.gmail.com \
--to=wad@chromium.org \
--cc=agl@chromium.org \
--cc=akpm@linux-foundation.org \
--cc=davem@davemloft.net \
--cc=dhowells@redhat.com \
--cc=eparis@redhat.com \
--cc=fweisbec@gmail.com \
--cc=heiko.carstens@de.ibm.com \
--cc=hpa@zytor.com \
--cc=jmorris@namei.org \
--cc=jslaby@suse.cz \
--cc=kees.cook@canonical.com \
--cc=lethal@linux-sh.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mips@linux-mips.org \
--cc=linux-s390@vger.kernel.org \
--cc=linux-sh@vger.kernel.org \
--cc=linux390@de.ibm.com \
--cc=linux@arm.linux.org.uk \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mingo@elte.hu \
--cc=mingo@redhat.com \
--cc=mmarek@suse.cz \
--cc=monstr@monstr.eu \
--cc=oleg@redhat.com \
--cc=paulus@samba.org \
--cc=peterz@infradead.org \
--cc=ralf@linux-mips.org \
--cc=rostedt@goodmis.org \
--cc=schwidefsky@de.ibm.com \
--cc=serge@hallyn.com \
--cc=sparclinux@vger.kernel.org \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).