public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Gabriel Krisman Bertazi <krisman@collabora.com>
To: Rich Felker <dalias@libc.org>
Cc: libc-alpha@sourceware.org, Florian Weimer <fw@deneb.enyo.de>,
	linux-kernel@vger.kernel.org
Subject: Re: Kernel prctl feature for syscall interception and emulation
Date: Thu, 19 Nov 2020 11:15:46 -0500	[thread overview]
Message-ID: <87h7pltj9p.fsf@collabora.com> (raw)
In-Reply-To: <20201119151317.GF534@brightrain.aerifal.cx> (Rich Felker's message of "Thu, 19 Nov 2020 10:13:18 -0500")

Rich Felker <dalias@libc.org> writes:

> On Wed, Nov 18, 2020 at 01:57:26PM -0500, Gabriel Krisman Bertazi via Libc-alpha wrote:

[...]

>
> SIGSYS (or signal handling in general) is not the right way to do
> this. It has all the same problems that came up in seccomp filtering
> with SIGSYS, and which were solved by user_notif mode (running the
> interception in a separate thread rather than an async context
> interrupting the syscall. In fact I wouldn't be surprised if what you
> want can already be done with reasonable efficiency using seccomp
> user_notif.

Hi Rich,

User_notif was raised in the kernel discussion and we had experimented
with it, but the latency of user_notif is even worse than what we can do
right now with other seccomp actions.

Regarding SIGSYS, the x86 maintainer suggested redirecting the syscall
return to a userspace thunk, but the understanding among Wine developers
is that SIGSYS is enough for their emulation needs.

> The default-intercept and excepting libc code segment is also bogus,
> and will break stuff, including vdso syscall mechanism on i386 and any
> code outside libc that makes its own syscalls from asm. If you need to
> tag regions to control interception, it should be tagging the emulated
> Windows guest code, which is bounded and you have full control over,
> rather than the host code, which is unbounded and includes any
> libraries that get linked indirectly by Wine.

The vdso trampoline, for the architectures that have it, is solved by
the kernel implementation, who makes sure that region is allowed.

The Linux code is not bounded, but the dispatcher region main goal is to
support trampolines outside of the vdso case. The correct userspace
implementation requires flipping the selector on any Windows/Linux code
boundary cross, exactly because other libraries can issue syscalls
directly.  The fact that libc is not the only one issuing syscalls is
the exact reason we need something more complex than a few seccomp
filters.

Flipping the selector on every boundary crosses is fine for performance,
since we don't go into the kernel.  But if we can avoid checking it from
kernelspace, that's an optimization, which is what I meant by the
dispatcher region allowing the more parts of the glibc code.  That's
just an optimization, but not strictly necessary for correctness.

I still don't think anything is broken here.

> But I'm skeptical that doing any new kernel-side logic for tagging is
> needed. Seccomp already lets you filter on instruction pointer so you
> can install filters that will trigger user_notif just for guest code,
> then let you execute the emulation in the watcher thread and skip the
> actual syscall in the watched thread.

As I mentioned, we can check IP in seccomp and write filters.  But this
has two problems:

1) Performance.  seccomp filters use cBPF which means 32bit comparisons,
no maps and a very limited instruction set.  We need to generate
boundary checks for each memory segment.  The filter becomes very large
very quickly and becomes a observable bottleneck.

2) Seccomp filters cannot be removed.  And we'd need to update them
frequently.

-- 
Gabriel Krisman Bertazi

  reply	other threads:[~2020-11-19 16:16 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-18 18:57 Kernel prctl feature for syscall interception and emulation Gabriel Krisman Bertazi
2020-11-19 15:13 ` Rich Felker
2020-11-19 16:15   ` Gabriel Krisman Bertazi [this message]
2020-11-19 16:28     ` Rich Felker
2020-11-19 17:32       ` Gabriel Krisman Bertazi
2020-11-19 17:39         ` Rich Felker
2020-11-19 17:57           ` David Laight
2020-11-19 20:54             ` Paul Gofman
2020-11-19 21:19               ` Paul Gofman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87h7pltj9p.fsf@collabora.com \
    --to=krisman@collabora.com \
    --cc=dalias@libc.org \
    --cc=fw@deneb.enyo.de \
    --cc=libc-alpha@sourceware.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox