From: Kees Cook <keescook@chromium.org>
To: Jann Horn <jannh@google.com>
Cc: YiFei Zhu <yifeifz2@illinois.edu>,
Christian Brauner <christian.brauner@ubuntu.com>,
Tycho Andersen <tycho@tycho.pizza>,
Andy Lutomirski <luto@amacapital.net>,
Will Drewry <wad@chromium.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Giuseppe Scrivano <gscrivan@redhat.com>,
Tobin Feldman-Fitzthum <tobin@ibm.com>,
Dimitrios Skarlatos <dskarlat@cs.cmu.edu>,
Valentin Rothberg <vrothber@redhat.com>,
Hubertus Franke <frankeh@us.ibm.com>,
Jack Chen <jianyan2@illinois.edu>,
Josep Torrellas <torrella@illinois.edu>,
Tianyin Xu <tyxu@illinois.edu>, bpf <bpf@vger.kernel.org>,
Linux Containers <containers@lists.linux-foundation.org>,
Linux API <linux-api@vger.kernel.org>,
kernel list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/6] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE
Date: Thu, 24 Sep 2020 00:11:01 -0700 [thread overview]
Message-ID: <202009232353.FD011DAA0@keescook> (raw)
In-Reply-To: <CAG48ez17XK2Co+1LbUWTc4x_W7nza=TObNh2Kpz6P+ba3OKPsw@mail.gmail.com>
On Thu, Sep 24, 2020 at 02:41:36AM +0200, Jann Horn wrote:
> On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > For systems that provide multiple syscall maps based on audit
> > architectures (e.g. AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 via
> > CONFIG_COMPAT) or via syscall masks (e.g. x86_x32), allow a fast way
> > to pin the process to a specific syscall table, instead of needing
> > to generate all filters with an architecture check as the first filter
> > action.
> >
> > This creates the internal representation that seccomp itself can use
> > (which is separate from the filters, which need to stay runtime
> > agnostic). Additionally paves the way for constant-action bitmaps.
>
> I don't really see the point in providing this UAPI - the syscall
> number checking will probably have much more performance cost than the
> architecture number check, and it's not like this lets us avoid the
> check, we're just moving it over into C code.
It's desirable for libseccomp and is a request from systemd (which is,
at this point, the largest seccomp user I know of), as they have no way
to force an arch without doing it in filters, which doesn't help much
with reducing filter runtime.
>
> > Signed-off-by: Kees Cook <keescook@chromium.org>
> > ---
> > include/linux/seccomp.h | 9 +++
> > include/uapi/linux/seccomp.h | 1 +
> > kernel/seccomp.c | 79 ++++++++++++++++++-
> > tools/testing/selftests/seccomp/seccomp_bpf.c | 33 ++++++++
> > 4 files changed, 120 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> > index 02aef2844c38..0be20bc81ea9 100644
> > --- a/include/linux/seccomp.h
> > +++ b/include/linux/seccomp.h
> > @@ -20,12 +20,18 @@
> > #include <linux/atomic.h>
> > #include <asm/seccomp.h>
> >
> > +#define SECCOMP_ARCH_IS_NATIVE 1
> > +#define SECCOMP_ARCH_IS_COMPAT 2
>
> FYI, mips has three different possible "arch" values (per kernel build
> config; the __AUDIT_ARCH_LE flag can also be set, but that's fixed
> based on the config):
>
> - AUDIT_ARCH_MIPS
> - AUDIT_ARCH_MIPS | __AUDIT_ARCH_64BIT
> - AUDIT_ARCH_MIPS | __AUDIT_ARCH_64BIT | __AUDIT_ARCH_CONVENTION_MIPS64_N32
>
> But I guess we can deal with that once someone wants to actually add
> support for this on mips.
Yup!
>
> > +#define SECCOMP_ARCH_IS_MULTIPLEX 3
>
> Why should X32 be handled specially? If the seccomp filter allows
Because it's a masked lookup into a separate table; the syscalls don't
map to x86_64's table; so for seccomp to correctly figure out which
bitmap to use, it has to do this decoding.
> specific syscalls (as it should), we don't have to care about X32.
> Only in weird cases where the seccomp filter wants to deny specific
> syscalls (a horrible idea), X32 is a concern, and in such cases, the
> userspace code can generate a single conditional jump to deal with it.
I feel like I must not understand what you mean. The x32-aware seccomp
filters are using syscall tests with 0x40000000 included in the values.
So seccomp's bitmap cannot handle this because it must know how many
syscalls to include in a linearly-allocated bitmap.
> And when seccomp is used properly to allow specific syscalls, the
> kernel will just waste time uselessly checking this X32 stuff.
It not measurable in my tests -- seccomp_data::nr is rather hot in the
cache. ;) That said, if it's unwanted, then CONFIG_X86_X32=n is the way
to go.
> [...]
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> [...]
> > +static long seccomp_pin_architecture(void)
> > +{
> > +#ifdef SECCOMP_ARCH
> > + struct task_struct *task = current;
> > +
> > + u8 arch = seccomp_get_arch(syscall_get_arch(task),
> > + syscall_get_nr(task, task_pt_regs(task)));
> > +
> > + /* How did you even get here? */
>
> Via a racing TSYNC, that's how.
Yes; thanks. This will need to take ¤t->sighand->siglock.
>
> > + if (task->seccomp.arch && task->seccomp.arch != arch)
> > + return -EBUSY;
> > +
> > + task->seccomp.arch = arch;
> > +#endif
> > + return 0;
> > +}
>
> Why does this return 0 if SECCOMP_ARCH is not defined? That suggests
> to userspace that we have successfully pinned the ABI, even though
> we're actually unable to do so.
Yup; thanks for the catch. This is a logical leftover from the RFC. This
should be, I think:
+ task->seccomp.arch = arch;
+ return 0;
+#else
+ return -EINVAL;
+#endif
--
Kees Cook
WARNING: multiple messages have this Message-ID (diff)
From: Kees Cook <keescook@chromium.org>
To: Jann Horn <jannh@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
Giuseppe Scrivano <gscrivan@redhat.com>,
Will Drewry <wad@chromium.org>, bpf <bpf@vger.kernel.org>,
YiFei Zhu <yifeifz2@illinois.edu>,
Linux API <linux-api@vger.kernel.org>,
Linux Containers <containers@lists.linux-foundation.org>,
Tobin Feldman-Fitzthum <tobin@ibm.com>,
Hubertus Franke <frankeh@us.ibm.com>,
Andy Lutomirski <luto@amacapital.net>,
Valentin Rothberg <vrothber@redhat.com>,
Dimitrios Skarlatos <dskarlat@cs.cmu.edu>,
Jack Chen <jianyan2@illinois.edu>,
Josep Torrellas <torrella@illinois.edu>,
Tianyin Xu <tyxu@illinois.edu>,
kernel list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/6] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE
Date: Thu, 24 Sep 2020 00:11:01 -0700 [thread overview]
Message-ID: <202009232353.FD011DAA0@keescook> (raw)
In-Reply-To: <CAG48ez17XK2Co+1LbUWTc4x_W7nza=TObNh2Kpz6P+ba3OKPsw@mail.gmail.com>
On Thu, Sep 24, 2020 at 02:41:36AM +0200, Jann Horn wrote:
> On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > For systems that provide multiple syscall maps based on audit
> > architectures (e.g. AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 via
> > CONFIG_COMPAT) or via syscall masks (e.g. x86_x32), allow a fast way
> > to pin the process to a specific syscall table, instead of needing
> > to generate all filters with an architecture check as the first filter
> > action.
> >
> > This creates the internal representation that seccomp itself can use
> > (which is separate from the filters, which need to stay runtime
> > agnostic). Additionally paves the way for constant-action bitmaps.
>
> I don't really see the point in providing this UAPI - the syscall
> number checking will probably have much more performance cost than the
> architecture number check, and it's not like this lets us avoid the
> check, we're just moving it over into C code.
It's desirable for libseccomp and is a request from systemd (which is,
at this point, the largest seccomp user I know of), as they have no way
to force an arch without doing it in filters, which doesn't help much
with reducing filter runtime.
>
> > Signed-off-by: Kees Cook <keescook@chromium.org>
> > ---
> > include/linux/seccomp.h | 9 +++
> > include/uapi/linux/seccomp.h | 1 +
> > kernel/seccomp.c | 79 ++++++++++++++++++-
> > tools/testing/selftests/seccomp/seccomp_bpf.c | 33 ++++++++
> > 4 files changed, 120 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> > index 02aef2844c38..0be20bc81ea9 100644
> > --- a/include/linux/seccomp.h
> > +++ b/include/linux/seccomp.h
> > @@ -20,12 +20,18 @@
> > #include <linux/atomic.h>
> > #include <asm/seccomp.h>
> >
> > +#define SECCOMP_ARCH_IS_NATIVE 1
> > +#define SECCOMP_ARCH_IS_COMPAT 2
>
> FYI, mips has three different possible "arch" values (per kernel build
> config; the __AUDIT_ARCH_LE flag can also be set, but that's fixed
> based on the config):
>
> - AUDIT_ARCH_MIPS
> - AUDIT_ARCH_MIPS | __AUDIT_ARCH_64BIT
> - AUDIT_ARCH_MIPS | __AUDIT_ARCH_64BIT | __AUDIT_ARCH_CONVENTION_MIPS64_N32
>
> But I guess we can deal with that once someone wants to actually add
> support for this on mips.
Yup!
>
> > +#define SECCOMP_ARCH_IS_MULTIPLEX 3
>
> Why should X32 be handled specially? If the seccomp filter allows
Because it's a masked lookup into a separate table; the syscalls don't
map to x86_64's table; so for seccomp to correctly figure out which
bitmap to use, it has to do this decoding.
> specific syscalls (as it should), we don't have to care about X32.
> Only in weird cases where the seccomp filter wants to deny specific
> syscalls (a horrible idea), X32 is a concern, and in such cases, the
> userspace code can generate a single conditional jump to deal with it.
I feel like I must not understand what you mean. The x32-aware seccomp
filters are using syscall tests with 0x40000000 included in the values.
So seccomp's bitmap cannot handle this because it must know how many
syscalls to include in a linearly-allocated bitmap.
> And when seccomp is used properly to allow specific syscalls, the
> kernel will just waste time uselessly checking this X32 stuff.
It not measurable in my tests -- seccomp_data::nr is rather hot in the
cache. ;) That said, if it's unwanted, then CONFIG_X86_X32=n is the way
to go.
> [...]
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> [...]
> > +static long seccomp_pin_architecture(void)
> > +{
> > +#ifdef SECCOMP_ARCH
> > + struct task_struct *task = current;
> > +
> > + u8 arch = seccomp_get_arch(syscall_get_arch(task),
> > + syscall_get_nr(task, task_pt_regs(task)));
> > +
> > + /* How did you even get here? */
>
> Via a racing TSYNC, that's how.
Yes; thanks. This will need to take ¤t->sighand->siglock.
>
> > + if (task->seccomp.arch && task->seccomp.arch != arch)
> > + return -EBUSY;
> > +
> > + task->seccomp.arch = arch;
> > +#endif
> > + return 0;
> > +}
>
> Why does this return 0 if SECCOMP_ARCH is not defined? That suggests
> to userspace that we have successfully pinned the ABI, even though
> we're actually unable to do so.
Yup; thanks for the catch. This is a logical leftover from the RFC. This
should be, I think:
+ task->seccomp.arch = arch;
+ return 0;
+#else
+ return -EINVAL;
+#endif
--
Kees Cook
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers
next prev parent reply other threads:[~2020-09-24 7:12 UTC|newest]
Thread overview: 81+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-23 23:29 [PATCH v1 0/6] seccomp: Implement constant action bitmaps Kees Cook
2020-09-23 23:29 ` Kees Cook
2020-09-23 23:29 ` [PATCH 1/6] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE Kees Cook
2020-09-23 23:29 ` Kees Cook
2020-09-24 0:41 ` Jann Horn
2020-09-24 0:41 ` Jann Horn via Containers
2020-09-24 7:11 ` Kees Cook [this message]
2020-09-24 7:11 ` Kees Cook
2020-09-23 23:29 ` [PATCH 2/6] x86: Enable seccomp architecture tracking Kees Cook
2020-09-23 23:29 ` Kees Cook
2020-09-24 0:45 ` Jann Horn
2020-09-24 0:45 ` Jann Horn via Containers
2020-09-24 7:12 ` Kees Cook
2020-09-24 7:12 ` Kees Cook
2020-09-23 23:29 ` [PATCH 3/6] seccomp: Implement constant action bitmaps Kees Cook
2020-09-23 23:29 ` Kees Cook
2020-09-24 0:25 ` Jann Horn
2020-09-24 0:25 ` Jann Horn via Containers
2020-09-24 7:36 ` Kees Cook
2020-09-24 7:36 ` Kees Cook
2020-09-24 8:07 ` YiFei Zhu
2020-09-24 8:07 ` YiFei Zhu
2020-09-24 8:15 ` Kees Cook
2020-09-24 8:15 ` Kees Cook
2020-09-24 8:22 ` YiFei Zhu
2020-09-24 8:22 ` YiFei Zhu
2020-09-24 12:28 ` Jann Horn
2020-09-24 12:28 ` Jann Horn via Containers
2020-09-24 12:37 ` David Laight
2020-09-24 12:37 ` David Laight
2020-09-24 12:56 ` Jann Horn
2020-09-24 12:56 ` Jann Horn via Containers
[not found] ` <DM6PR11MB271492D0565E91475D949F5DEF390@DM6PR11MB2714.namprd11.prod.outlook.com>
2020-09-24 0:36 ` YiFei Zhu
2020-09-24 0:36 ` YiFei Zhu
2020-09-24 7:38 ` Kees Cook
2020-09-24 7:38 ` Kees Cook
2020-09-24 7:51 ` YiFei Zhu
2020-09-24 7:51 ` YiFei Zhu
2020-09-23 23:29 ` [PATCH 4/6] seccomp: Emulate basic filters for constant action results Kees Cook
2020-09-23 23:29 ` Kees Cook
2020-09-23 23:47 ` Jann Horn
2020-09-23 23:47 ` Jann Horn via Containers
2020-09-24 7:46 ` Kees Cook
2020-09-24 7:46 ` Kees Cook
2020-09-24 15:28 ` Paul Moore
2020-09-24 15:28 ` Paul Moore
2020-09-24 19:52 ` Kees Cook
2020-09-24 19:52 ` Kees Cook
2020-09-24 20:46 ` Paul Moore
2020-09-24 20:46 ` Paul Moore
2020-09-24 21:35 ` Kees Cook
2020-09-24 21:35 ` Kees Cook
2020-09-24 13:16 ` kernel test robot
2020-09-23 23:29 ` [PATCH 5/6] selftests/seccomp: Compare bitmap vs filter overhead Kees Cook
2020-09-23 23:29 ` Kees Cook
2020-09-23 23:29 ` [PATCH 6/6] [DEBUG] seccomp: Report bitmap coverage ranges Kees Cook
2020-09-23 23:29 ` Kees Cook
2020-09-24 13:40 ` [PATCH v1 0/6] seccomp: Implement constant action bitmaps Rasmus Villemoes
2020-09-24 13:40 ` Rasmus Villemoes
2020-09-24 13:58 ` YiFei Zhu
2020-09-24 13:58 ` YiFei Zhu
2020-09-25 5:56 ` Rasmus Villemoes
2020-09-25 5:56 ` Rasmus Villemoes
2020-09-25 7:07 ` YiFei Zhu
2020-09-25 7:07 ` YiFei Zhu
2020-09-26 18:11 ` YiFei Zhu
2020-09-26 18:11 ` YiFei Zhu
2020-09-28 20:04 ` Kees Cook
2020-09-28 20:04 ` Kees Cook
2020-09-28 20:16 ` YiFei Zhu
2020-09-28 20:16 ` YiFei Zhu
2020-09-24 14:05 ` Jann Horn
2020-09-24 14:05 ` Jann Horn via Containers
2020-09-24 18:57 ` Andrea Arcangeli
2020-09-24 18:57 ` Andrea Arcangeli
2020-09-24 19:18 ` Jann Horn
2020-09-24 19:18 ` Jann Horn via Containers
[not found] ` <9dbe8e3bbdad43a1872202ff38c34ca2@DM5PR11MB1692.namprd11.prod.outlook.com>
2020-09-24 19:48 ` Tianyin Xu
2020-09-24 19:48 ` Tianyin Xu
2020-09-24 20:00 ` Kees Cook
2020-09-24 20:00 ` Kees Cook
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=202009232353.FD011DAA0@keescook \
--to=keescook@chromium.org \
--cc=aarcange@redhat.com \
--cc=bpf@vger.kernel.org \
--cc=christian.brauner@ubuntu.com \
--cc=containers@lists.linux-foundation.org \
--cc=dskarlat@cs.cmu.edu \
--cc=frankeh@us.ibm.com \
--cc=gscrivan@redhat.com \
--cc=jannh@google.com \
--cc=jianyan2@illinois.edu \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=tobin@ibm.com \
--cc=torrella@illinois.edu \
--cc=tycho@tycho.pizza \
--cc=tyxu@illinois.edu \
--cc=vrothber@redhat.com \
--cc=wad@chromium.org \
--cc=yifeifz2@illinois.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.