From: Jann Horn <jannh@google.com>
To: Eyal Birger <eyal.birger@gmail.com>
Cc: jolsa@kernel.org, kees@kernel.org, luto@amacapital.net,
wad@chromium.org, oleg@redhat.com, mhiramat@kernel.org,
andrii@kernel.org, alexei.starovoitov@gmail.com,
olsajiri@gmail.com, cyphar@cyphar.com, songliubraving@fb.com,
yhs@fb.com, john.fastabend@gmail.com, peterz@infradead.org,
tglx@linutronix.de, bp@alien8.de, daniel@iogearbox.net,
ast@kernel.org, andrii.nakryiko@gmail.com, rostedt@goodmis.org,
rafi@rbk.io, shmulik.ladkani@gmail.com, bpf@vger.kernel.org,
linux-api@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
x86@kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 0/2] seccomp: pass uretprobe system call through seccomp
Date: Fri, 7 Feb 2025 17:50:06 +0100 [thread overview]
Message-ID: <CAG48ez0c-n1K=Ui-Awp+pGq-k6QvaWarjqz0znyKi5HO5R5P7A@mail.gmail.com> (raw)
In-Reply-To: <CAHsH6GtiwCGJevfkE5=VzzuQcusKp-ugnRD+AD+5a+8kqOGyZA@mail.gmail.com>
On Fri, Feb 7, 2025 at 5:20 PM Eyal Birger <eyal.birger@gmail.com> wrote:
> On Fri, Feb 7, 2025 at 7:27 AM Jann Horn <jannh@google.com> wrote:
> >
> > On Sun, Feb 2, 2025 at 5:29 PM Eyal Birger <eyal.birger@gmail.com> wrote:
> > > uretprobe(2) is an performance enhancement system call added to improve
> > > uretprobes on x86_64.
> > >
> > > Confinement environments such as Docker are not aware of this new system
> > > call and kill confined processes when uretprobes are attached to them.
> >
> > FYI, you might have similar issues with Syscall User Dispatch
> > (https://docs.kernel.org/admin-guide/syscall-user-dispatch.html) and
> > potentially also with ptrace-based sandboxes, depending on what kinda
> > processes you inject uprobes into. For Syscall User Dispatch, there is
> > already precedent for a bypass based on instruction pointer (see
> > syscall_user_dispatch()).
>
> Thanks. This is interesting.
>
> Do you know of confinement environments using this?
Not for Syscall User Dispatch; I think that was mostly intended for
stuff like emulating Windows syscalls in WINE. I'm not sure who
actually uses it, I just know a bit about the kernel side of it.
From what I know, ptrace sandboxing is a technique used by some
configurations of gVisor
(https://gvisor.dev/docs/architecture_guide/platforms/#ptrace), though
now I see that that page says that this configuration is no longer
supported. I am also not sure whether you'd ever have uprobes
installed in files from which instructions are executed in this
environment.
> > > Since uretprobe is a "kernel implementation detail" system call which is
> > > not used by userspace application code directly, pass this system call
> > > through seccomp without forcing existing userspace confinement environments
> > > to be changed.
> >
> > This makes me feel kinda uncomfortable. The purpose of seccomp() is
> > that you can create a process that is as locked down as you want; you
> > can use it for some light limits on what a process can do (like in
> > Docker), or you can use it to make a process that has access to
> > essentially nothing except read(), write() and exit_group(). Even
> > stuff like restart_syscall() and rt_sigreturn() is not currently
> > excepted from that.
>
> Yes, this has been discussed at length in the threads mentioned
> in the "Link" tags.
>
> >
> > I guess your usecase is a little special in that you were already
> > calling from userspace into the kernel with SWBP before, which is also
> > not subject to seccomp; and the syscall is essentially an
> > arch-specific hack to make the SWBP a little faster.
>
> Indeed. The uretprobe mechanism wasn't enforced by seccomp before
> this syscall. This change preserves this.
>
> >
> > If we do this, we should at least ensure that there is absolutely no
> > way for anything to happen in sys_uretprobe when no uretprobes are
> > configured for the process - the first check in the syscall
> > implementation almost does that, but the implementation could be a bit
> > stricter. It checks for "regs->ip != trampoline_check_ip()", but if no
> > uprobe region exists for the process, trampoline_check_ip() returns
> > `-1 + (uretprobe_syscall_check - uretprobe_trampoline_entry)`. So
> > there is a userspace instruction pointer near the bottom of the
> > address space that is allowed to call into the syscall if uretprobes
> > are not set up. Though the mmap minimum address restrictions will
> > typically prevent creating mappings there, and
> > uprobe_handle_trampoline() will SIGILL us if we get that far without a
> > valid uretprobe.
>
> I'm not sure I understand your point. If creating mappings in that
> area is prevented, what is the issue?
It is usually prevented, not always - root can do it depending on
system configuration.
Also, in a syscall like this that will be reachable in every sandbox,
I think we should try to be more careful about edge cases and avoid
things like this offset calculation on address -1.
> also, this would be related to the
> uretprobe syscall implementation in general, no?
Yes. I just think it is relevant to the seccomp change because
excepting a syscall from seccomp makes it more important that that
syscall is robust and correct.
> To me this seems unrelated to the seccomp change.
> Jiri, do you have any input on this?
>
> Thanks!
> Eyal.
next prev parent reply other threads:[~2025-02-07 16:50 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-02 16:29 [PATCH v3 0/2] seccomp: pass uretprobe system call through seccomp Eyal Birger
2025-02-02 16:29 ` [PATCH v3 1/2] seccomp: passthrough uretprobe systemcall without filtering Eyal Birger
2025-02-06 21:20 ` Kees Cook
2025-02-02 16:29 ` [PATCH v3 2/2] selftests/seccomp: validate uretprobe syscall passes through seccomp Eyal Birger
2025-02-02 20:51 ` Jiri Olsa
2025-02-02 21:13 ` Eyal Birger
2025-02-06 21:18 ` Kees Cook
2025-02-06 21:21 ` [PATCH v3 0/2] seccomp: pass uretprobe system call " Kees Cook
2025-02-07 1:06 ` Eyal Birger
2025-02-07 13:24 ` Jiri Olsa
2025-02-07 15:27 ` Jann Horn
2025-02-07 16:20 ` Eyal Birger
2025-02-07 16:50 ` Jann Horn [this message]
2025-02-08 0:03 ` Jiri Olsa
2025-02-08 20:35 ` Kees Cook
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAG48ez0c-n1K=Ui-Awp+pGq-k6QvaWarjqz0znyKi5HO5R5P7A@mail.gmail.com' \
--to=jannh@google.com \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii.nakryiko@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bp@alien8.de \
--cc=bpf@vger.kernel.org \
--cc=cyphar@cyphar.com \
--cc=daniel@iogearbox.net \
--cc=eyal.birger@gmail.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kees@kernel.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=mhiramat@kernel.org \
--cc=oleg@redhat.com \
--cc=olsajiri@gmail.com \
--cc=peterz@infradead.org \
--cc=rafi@rbk.io \
--cc=rostedt@goodmis.org \
--cc=shmulik.ladkani@gmail.com \
--cc=songliubraving@fb.com \
--cc=tglx@linutronix.de \
--cc=wad@chromium.org \
--cc=x86@kernel.org \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).