public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Thomas Gleixner <tglx@linutronix.de>, Andi Kleen <andi@firstfloor.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Dave Watson <davejwatson@fb.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	linux-api <linux-api@vger.kernel.org>,
	Paul Turner <pjt@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Russell King <linux@arm.linux.org.uk>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Andrew Hunter <ahh@google.com>, Chris Lameter <cl@linux.com>,
	Ben Maurer <bmaurer@fb.com>, rostedt <rostedt@goodmis.org>,
	Josh Triplett <josh@joshtriplett.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will.deacon@arm.com>,
	Michael Kerrisk <mtk.manpages@gmail.com>
Subject: Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
Date: Fri, 17 Nov 2017 17:14:32 +0000 (UTC)	[thread overview]
Message-ID: <1756446476.17265.1510938872121.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <alpine.DEB.2.20.1711171107450.1709@nanos>

----- On Nov 17, 2017, at 5:09 AM, Thomas Gleixner tglx@linutronix.de wrote:

> On Thu, 16 Nov 2017, Andi Kleen wrote:
>> My preference would be just to drop this new super ugly system call.
>> 
>> It's also not just the ugliness, but the very large attack surface
>> that worries me here.
>> 
>> As far as I know it is only needed to support single stepping, correct?
> 
> I can't figure that out because the changelog describes only WHAT the patch
> does and not WHY. Useful, isn't it?
> 
>> Then this whole mess would disappear.
> 
> Agreed. That would be much appreciated.

Let's have a look at why cpu_opv is needed. I'll make sure to enhance the
changelog and documentation to include that information.

1) Handling single-stepping from tools

Tools like debuggers, and simulators like record-replay ("rr") use
single-stepping to run through existing programs. If core libraries start
to use restartable sequences for e.g. memory allocation, this means
pre-existing programs cannot be single-stepped, simply because the
underlying glibc or jemalloc has changed.

The rseq user-space does expose a __rseq_table section for the sake of
debuggers, so they can skip over the rseq critical sections if they want.
However, this requires upgrading tools, and still breaks single-stepping
in case where glibc or jemalloc is updated, but not the tooling.

Having a performance-related library improvement break tooling is likely to
cause a big push-back against wide adoption of rseq. *I* would not even
be using rseq in liburcu and lttng-ust until gdb gets updated in every
distributions that my users depend on. This will likely happen... never.


2) Forward-progress guarantee

Having a piece of user-space code that stops progressing due to
external conditions is pretty bad. We are used to think of fast-path and
slow-path (e.g. for locking), where the contended vs uncontended cases
have different performance characteristics, but each need to provide some
level of progress guarantees.

I'm very concerned about proposing just "rseq" without the associated
slow-path (cpu_opv) that guarantees progress. It's just asking for trouble
when real-life will happen: page faults, uprobes, and other unforeseen
conditions that would seldom cause a rseq fast-path to never progress.


3) Handling page faults

If we get creative enough, it's pretty easy to come up with corner-case
scenarios where rseq does not progress without the help from cpu_opv. For
instance, a system with swap enabled which is under high memory pressure
could trigger page faults at pretty much every rseq attempt. I recognize
that this scenario is extremely unlikely, but I'm not comfortable making
rseq the weak link of the chain here.


4) Comparison with LL/SC

The layman versed in the load-link/store-conditional instructions in
RISC architectures will notice the similarity between rseq and LL/SC
critical sections. The comparison can even be pushed further: since
debuggers can handle those LL/SC critical sections, they should be
able to handle rseq c.s. in the same way.

First, the way gdb recognises LL/SC c.s. patterns is very fragile:
it's limited to specific common patterns, and will miss the pattern
in all other cases. But fear not, having the rseq c.s. expose a
__rseq_table to debuggers removes that guessing part.

The main difference between LL/SC and rseq is that debuggers had
to support single-stepping through LL/SC critical sections from the
get go in order to support a given architecture. For rseq, we're
adding critical sections into pre-existing applications/libraries,
so the user expectation is that tools don't break due to a library
optimization.


5) Perform maintenance operations on per-cpu data

rseq c.s. are quite limited feature-wise: they need to end with a
*single* commit instruction that updates a memory location. On the
other hand, the cpu_opv system call can combine a sequence of operations
that need to be executed with preemption disabled. While slower than
rseq, this allows for more complex maintenance operations to be
performed on per-cpu data concurrently with rseq fast-paths, in cases
where it's not possible to map those sequences of ops to a rseq.


6) Use cpu_opv as generic implementation for architectures not
   implementing rseq assembly code

rseq critical sections require architecture-specific user-space code
to be crafted in order to port an algorithm to a given architecture.
In addition, it requires that the kernel architecture implementation
adds hooks into signal delivery and resume to user-space.

In order to facilitate integration of rseq into user-space, cpu_opv
can provide a (relatively slower) architecture-agnostic implementation
of rseq. This means that user-space code can be ported to all
architectures through use of cpu_opv initially, and have the fast-path
use rseq whenever the asm code is implemented.


7) Allow libraries with multi-part algorithms to work on same per-cpu
   data without affecting the allowed cpu mask

I stumbled on an interesting use-case within the lttng-ust tracer
per-cpu buffers: the algorithm needs to update a "reserve" counter,
serialize data into the buffer, and then update a "commit" counter
_on the same per-cpu buffer_. My goal is to use rseq for both reserve
and commit.

Clearly, if rseq reserve fails, the algorithm can retry on a different
per-cpu buffer. However, it's not that easy for the commit. It needs to
be performed on the same per-cpu buffer as the reserve.

The cpu_opv system call solves that problem by receiving the cpu number
on which the operation needs to be performed as argument. It can push
the task to the right CPU if needed, and perform the operations there
with preemption disabled.

Changing the allowed cpu mask for the current thread is not an acceptable
alternative for a tracing library, because the application being traced
does not expect that mask to be changed by libraries.


So, TLDR: cpu_opv is needed for many use-cases other that single-stepping,
and facilitates adoption of rseq into pre-existing applications.


Thanks,

Mathieu

> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

  reply	other threads:[~2017-11-17 17:13 UTC|newest]

Thread overview: 80+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-14 20:03 [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call Mathieu Desnoyers
2017-11-14 20:39   ` Ben Maurer
2017-11-14 20:52     ` Mathieu Desnoyers
2017-11-14 21:48       ` Ben Maurer
     [not found]   ` <CY4PR15MB168884529B3C0F8E6CC06257CF280@CY4PR15MB1688.namprd15.prod.outlook.com>
2017-11-14 20:49     ` Ben Maurer
2017-11-14 21:03       ` Mathieu Desnoyers
2017-11-16 16:18   ` Peter Zijlstra
2017-11-16 16:27     ` Mathieu Desnoyers
2017-11-16 16:32       ` Peter Zijlstra
2017-11-16 17:09         ` Mathieu Desnoyers
2017-11-16 18:43   ` Peter Zijlstra
2017-11-16 18:49     ` Mathieu Desnoyers
2017-11-16 19:06       ` Thomas Gleixner
2017-11-16 20:06         ` Mathieu Desnoyers
2017-11-16 19:14   ` Peter Zijlstra
2017-11-16 20:37     ` Mathieu Desnoyers
2017-11-16 20:46       ` Peter Zijlstra
2017-11-16 21:08   ` Thomas Gleixner
2017-11-19 17:24     ` Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 02/24] Restartable sequences: ARM 32 architecture support Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 03/24] Restartable sequences: wire up ARM 32 system call Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 architecture support Mathieu Desnoyers
2017-11-16 21:14   ` Thomas Gleixner
2017-11-19 17:41     ` Mathieu Desnoyers
2017-11-20  8:38       ` Thomas Gleixner
2017-11-14 20:03 ` [RFC PATCH for 4.15 05/24] Restartable sequences: wire up x86 32/64 system call Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 06/24] Restartable sequences: powerpc architecture support Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 07/24] Restartable sequences: Wire up powerpc system call Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv " Mathieu Desnoyers
2017-11-15  1:34   ` Mathieu Desnoyers
2017-11-15  7:44   ` Michael Kerrisk (man-pages)
2017-11-15 14:30     ` Mathieu Desnoyers
2017-11-16 23:26   ` Thomas Gleixner
2017-11-17  0:14     ` Andi Kleen
2017-11-17 10:09       ` Thomas Gleixner
2017-11-17 17:14         ` Mathieu Desnoyers [this message]
2017-11-17 18:18           ` Andi Kleen
2017-11-17 18:59             ` Thomas Gleixner
2017-11-17 19:15               ` Andi Kleen
2017-11-17 20:07                 ` Thomas Gleixner
2017-11-18 21:09                   ` Andy Lutomirski
2017-11-17 20:22           ` Thomas Gleixner
2017-11-20 17:13             ` Mathieu Desnoyers
2017-11-20 16:13     ` Mathieu Desnoyers
2017-11-20 17:48       ` Thomas Gleixner
2017-11-20 18:03         ` Thomas Gleixner
2017-11-20 18:42           ` Mathieu Desnoyers
2017-11-20 18:39         ` Mathieu Desnoyers
2017-11-20 18:49           ` Andi Kleen
2017-11-20 22:46             ` Mathieu Desnoyers
2017-11-20 19:44           ` Thomas Gleixner
2017-11-21 11:25             ` Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 09/24] cpu_opv: Wire up x86 32/64 " Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 10/24] cpu_opv: Wire up powerpc " Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 11/24] cpu_opv: Wire up ARM32 " Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 12/24] cpu_opv: Implement selftests Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 13/24] Restartable sequences: Provide self-tests Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 14/24] Restartable sequences selftests: arm: workaround gcc asm size guess Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 15/24] membarrier: selftest: Test private expedited cmd Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v7 for 4.15 16/24] membarrier: powerpc: Skip memory barrier in switch_mm() Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v5 for 4.15 17/24] membarrier: Document scheduler barrier requirements Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 18/24] membarrier: provide SHARED_EXPEDITED command Mathieu Desnoyers
2017-11-15  1:36   ` Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 19/24] membarrier: selftest: Test shared expedited cmd Mathieu Desnoyers
2017-11-17 15:07   ` Shuah Khan
2017-11-14 20:04 ` [RFC PATCH for 4.15 20/24] membarrier: Provide core serializing command Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 21/24] x86: Introduce sync_core_before_usermode Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 22/24] membarrier: x86: Provide core serializing command Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 23/24] membarrier: selftest: Test private expedited sync core cmd Mathieu Desnoyers
2017-11-17 15:09   ` Shuah Khan
2017-11-17 16:17     ` Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 24/24] membarrier: arm64: Provide core serializing command Mathieu Desnoyers
2017-11-14 21:08 ` [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Linus Torvalds
2017-11-14 21:15   ` Andy Lutomirski
2017-11-14 21:32     ` Paul Turner
2018-03-27 18:15       ` Mathieu Desnoyers
2017-11-14 21:32     ` Mathieu Desnoyers
2017-11-15  4:12       ` Andy Lutomirski
2017-11-15  6:34         ` Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1756446476.17265.1510938872121.JavaMail.zimbra@efficios.com \
    --to=mathieu.desnoyers@efficios.com \
    --cc=ahh@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=bmaurer@fb.com \
    --cc=boqun.feng@gmail.com \
    --cc=catalin.marinas@arm.com \
    --cc=cl@linux.com \
    --cc=davejwatson@fb.com \
    --cc=hpa@zytor.com \
    --cc=josh@joshtriplett.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@arm.linux.org.uk \
    --cc=luto@amacapital.net \
    --cc=mingo@redhat.com \
    --cc=mtk.manpages@gmail.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox