Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fw@deneb.enyo.de>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	carlos <carlos@redhat.com>, Thomas Gleixner <tglx@linutronix.de>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	paulmck <paulmck@linux.ibm.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	"H. Peter Anvin" <hpa@zytor.com>, Paul Turner <pjt@google.com>,
	linux-api <linux-api@vger.kernel.org>,
	Dmitry Vyukov <dvyukov@google.com>,
	Neel Natu <neelnatu@google.com>
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID
Date: Mon, 13 Jul 2020 14:40:20 -0400 (EDT)	[thread overview]
Message-ID: <1305865358.10354.1594665620975.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <20200711155459.4mswacrvosw377tk@wittgenstein>

----- On Jul 11, 2020, at 11:54 AM, Christian Brauner christian.brauner@ubuntu.com wrote:

> 
> The registration is a thread-group property I'll assume, right, i.e. all
> threads will have rseq TLS or no thread will have it?

No, rseq registration is a per-thread property, but it would arguably be
a bit weird for a thread-group to observe different registration states
for different threads.

> Some things I seem to be able to assume (correct me if I'm wrong):
> - rseq registration is expected to be done at thread creation time

True.

> - rseq registration _should_ only be done once, i.e. if a caller detects
>  that rseq is already registered for a thread, then they could
>  technically re-register rseq - risking a feature mismatch if doing so
>  - but they shouldn't re-register but simply assume that someone else
>  is in control of rseq. If they violate that assumption than you're
>  hosed anyway.

Right.

> So it seems as long as callers leave rseq registration alone whenever
> they detect that it is already registered then one can assume that rseq
> is under consistent control of a single library/program. If that's the
> case it should safe to assume that the library will use the same rseq
> registration for all threads bounded by the available kernel features or
> by the set of features it is aware of.

The rseq registration is per-thread. But yes, as soon as one user registers
rseq, other users for that thread are expected to piggy-back on that
registration.

> I proposed that specific scheme because I was under the impression that
> you are in need of a mechanism for a caller (at thread creation I
> assume) to check what feature set is supported by rseq _without_
> issung a system call. If you were to record the requested flags in
> struct rseq or in some other way, then another library which tries to
> register rseq for a thread but detects it has already been registered
> can look at e.g. whether the reliable cpu feature is around and then
> adjust it's behavior accordingly.
> Another reason why this seems worthwhile is because of how rseq works in
> general. Since it registers a piece of userspace memory which userspace
> can trivially access enforcing that userspace issue a syscall to get at
> the feature list seems odd when you can just record it in the struct.
> But that's a matter of style, I guess.

Good points.

> 
> Also, I'm thinking about the case of adding one or two new features that
> introduce mutually exclusive behavior for rseq, i.e. if you register
> rseq with FEAT1 and someone else registers it with FEAT2 and FEAT1 and
> FEAT2 would lead to incompatible behavior for an aspect or all of rseq.
> Even if you had a way to query the kernel for FEAT1 and FEAT2 in the
> rseq syscall it still be a problem since a caller wouldn't know at rseq
> registration time whether the library registered rseq with FEAT1 or
> FEAT2. If you record the behavior somewhere - kernel_flags or whatever -
> then the caller could check at registration time "ah, rseq is registered
> with this behavior" I need to adjust my behavior.

I think what we want here is to be able to extend the feature set, but not
"pick and choose" different incompatible features.

[...]
>> 
>> One additional thing to keep in mind: the application can itself choose
>> to define the __rseq_abi TLS, which AFAIU (please let me know if I am
>> wrong) would take precedence over glibc's copy. So extending the
> 
> You mean either that an application could simply choose to ignore that e.g.
> glibc has already registered rseq and e.g. unregister it and register
> it's own or that it registers it's own rseq before glibc comes into the
> play?

No quite.

I mean that an application binary or a preloaded library is allowed to
interpose with glibc and expose a __rseq_abi symbol with a size smaller
than glibc's __rseq_abi. The issue is that glibc (or other library
responsible for rseq registration) is unaware of that symbol's size.

This means that extending __rseq_abi cannot be done by means of additional
flags or parameters passed directly from the registration owner to the
rseq system call.

> I mean, if I interpreted what you're trying to say correctly, I think
> one needs to work under the assumption that if someone else has already
> registered rseq than it becomes the single source of truth. I think that
> basically needs to be the contract. Same way you expect a user of
> pthreads to not suddenly go and call clone3() with CLONE_THREAD |
> CLONE_VM | [...] and assume that this won't mess with glibc's internal
> state.

Right. The issue is not about which library owns the registration, but
rather making sure everyone agree on the size of __rseq_abi, including
interposition scenarios.

[...]
>> 
>> For both approaches, we could either pass them as parameters with rseq
>> registration, and make rseq registration success conditional on the
>> kernel implementing those feature/fix-version, or validate the flag/version
>> separately from registration.
>> 
>> If this is done on registration, it means glibc will eventually have to
>> handle this. This prevents user libraries with specific needs to query
>> whether their features are available. Doing the feature/version validation
>> separately from registration allows each user library to make its own
>> queries and take advantage of new kernel features before glibc is
>> upgraded to be made aware of them.
> 
> Why isn't there a "dual scheme"? I.e. you record the features somewhere
> in struct rseq or some other place so userspace can query an already
> registered thread for the features it was registered with and have a way
> to query the kernel for the supported features through the system call
> (See what I suggested above with the feature checking flags.).

This discussion got my head into gears over the weekend, and I think
I came up with something that could elegantly solve all the "rseq extensibility"
problem. More below.

[...]

> I really think this is not an rseq specific problem. This seems to be a
> generic problem any interface has when it e.g. makes use of an extended
> struct and e.g. decides to make its own syscalls without going through
> the glibc wrappers (If there are any...). That's the reality right now
> and will likely always be that way short of us blocking non-libc
> syscalls similar to some bsds at which point we need a 1:1 kernel + libc
> development. :) That's not going to happen. The problem ranges from
> struct statx to struct clone_args and struct open_how and so on.

AFAIU the only "special" thing about rseq is that its __rseq_abi is
a TLS symbol shared between application/libraries, and interposition is
allowed.

> 
> But one question. Musn't the assumption be that all threads in a
> thread-group if they have registered rseq then the same
> application/library has done that registration?

No, __rseq_abi is a TLS, and the registration is per-thread.

> And if that's the case
> then the application/library doing the registration is what defines the
> layout for that thread-group and becomes the one source of truth.
> Basically, if an application uses it's own rseq then glibc must be out
> of the picture. If that's not part of the contract then it feels to me
> that rseq cannot be extended (for now).

Indeed, the new scheme I have in mind for rseq extensibility would allow
new features to be used between new application/library and kernel even
with an older glibc which would contain the rseq registration code, but
be unaware of those new features.

[...]
> 
> Wouldn't the non-updated application just access fields and methods of
> the smaller struct? The way struct extensions work is that we only
> extend them after the last member and always correctly 64 bit aligned.
> And as long as you only extend the struct at the end wouldn't that be
> ok? An application with a non-updated/smaller struct rseq would just
> access fields that the larger struct supports, no?

The issue is symbol interposition, as discussed above.

So here is the idea which emerged through the weekend as I was thinking
about your email:

* Current technical constraints
  - struct rseq __rseq_abi can be interposed by preloaded libraries and
    application, without knowledge from the registration "owner" (typically
    glibc).

* Objectives:
  - Allow extending the size of struct rseq to add additional features,
    such as node_id field, signal-disabling fields, and so on.
  - Allow extending this size without requiring an upgrade of the library
    performing rseq registration. Simply upgrading the rseq "user" application
    or library and the kernel should suffice.

* Proposed ABI
  - Reserve a bit in the field (struct rseq *)->flags of the TLS __rseq_abi,
    named e.g.: RSEQ_CS_FLAG_SIZE = (1U << 3).
  - A definition wishing to extend struct rseq would be required to initialize
    __rseq_abi with this bit set in the flags field.
  - When RSEQ_CS_FLAG_SIZE is set, struct rseq is guaranteed to have two new
    fields after the flags field: a __u32 user_size and a __u32 kernel_size field.
  - The user_size field is meant to be initialized to sizeof(struct rseq) by the
    __rseq_abi definition. In an interposition scenario, a kernel supporting this
    size feature will know about the size of the symbol by checking both the
    RSEQ_CS_FLAG_SIZE flag and the user_size field.
  - On registration, if the __rseq_abi.flags RSEQ_CS_FLAG_SIZE flag is set (and this
    flag is supported by the kernel), the kernel updates the kernel_size field to
    min(sizeof(struct rseq), __rseq_abi.user_size), effectively the subset of size
    supported by both the user-space __rseq_abi definition and by the kernel.
  - Users wishing to use additional fields beyond __rseq_abi.flags would need to check
    that __rseq_abi->flags & RSEQ_CS_FLAG_SIZE is true, and that
    __rseq_abi.kernel_size >= offsetof(struct rseq, feature_field) + sizeof(__rseq_abi.feature_field)
    This would ensure the fields are only used if the symbol is large enough to
    hold them *and* the kernel supports them.

With this kind of scheme, we could then easily extend struct rseq to cover additional
use-cases such as:

- adding a new "node_id" field to speed up getcpu(3), user-space locking on NUMA,
  and possibly memory allocators,
- adding fields allowing to quickly disable/enable signals,
- adding a "__u64 features" field, which would contain for instance
  RSEQ_FEATURE_RELIABLE_CPU_ID.

I'm not sure why I did not think of this earlier, but it all seems to fit nicely.

Any comments ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

next prev parent reply	other threads:[~2020-07-13 18:40 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-06 20:49 [RFC PATCH for 5.8 0/4] rseq cpu_id ABI fix Mathieu Desnoyers
2020-07-06 20:49 ` [RFC PATCH for 5.8 1/4] sched: Fix unreliable rseq cpu_id for new tasks Mathieu Desnoyers
2020-07-07  7:30   ` Florian Weimer
2020-07-07 10:51     ` Mathieu Desnoyers
2020-07-06 20:49 ` [RFC PATCH for 5.8 2/4] rseq: Introduce RSEQ_FLAG_REGISTER Mathieu Desnoyers
2020-07-06 20:49 ` [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID Mathieu Desnoyers
2020-07-07  7:29   ` Florian Weimer
2020-07-07 10:48     ` Mathieu Desnoyers
2020-07-07 11:32       ` Florian Weimer
2020-07-07 12:06         ` Mathieu Desnoyers
2020-07-07 18:53           ` Carlos O'Donell
2020-07-07 18:59             ` Mathieu Desnoyers
2020-07-08  8:31               ` Florian Weimer
2020-07-07 19:55             ` Florian Weimer
2020-07-08 15:33               ` Mathieu Desnoyers
2020-07-08 16:22                 ` Christian Brauner
2020-07-08 16:36                   ` Florian Weimer
2020-07-08 17:34                   ` Mathieu Desnoyers
2020-07-09 12:49                     ` Christian Brauner
2020-07-09 15:15                       ` Mathieu Desnoyers
2020-07-11 15:54                         ` Christian Brauner
2020-07-13 18:40                           ` Mathieu Desnoyers [this message]
2020-07-06 20:49 ` [RFC PATCH for 5.8 4/4] rseq: selftests: Expect reliable cpu_id field Mathieu Desnoyers
2020-07-07  6:26 ` [RFC PATCH for 5.8 0/4] rseq cpu_id ABI fix Florian Weimer
2020-07-07 14:54 ` Florian Weimer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1305865358.10354.1594665620975.JavaMail.zimbra@efficios.com \
    --to=mathieu.desnoyers@efficios.com \
    --cc=boqun.feng@gmail.com \
    --cc=carlos@redhat.com \
    --cc=christian.brauner@ubuntu.com \
    --cc=dvyukov@google.com \
    --cc=fw@deneb.enyo.de \
    --cc=hpa@zytor.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neelnatu@google.com \
    --cc=paulmck@linux.ibm.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.