Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Dave Watson <davejwatson@fb.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	linux-api <linux-api@vger.kernel.org>,
	Paul Turner <pjt@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Russell King <linux@arm.linux.org.uk>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Andrew Hunter <ahh@google.com>, Andi Kleen <andi@firstfloor.org>,
	Chris Lameter <cl@linux.com>, Ben Maurer <bmaurer@fb.com>,
	rostedt <rostedt@goodmis.org>,
	Josh Triplett <josh@joshtriplett.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will.deacon@arm.com>,
	Michael Kerrisk <mtk.manpages@gmail.com>
Subject: Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
Date: Tue, 21 Nov 2017 11:25:58 +0000 (UTC)	[thread overview]
Message-ID: <1830717892.19023.1511263558845.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <alpine.DEB.2.20.1711202037280.2348@nanos>

----- On Nov 20, 2017, at 2:44 PM, Thomas Gleixner tglx@linutronix.de wrote:

> On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
>> ----- On Nov 20, 2017, at 12:48 PM, Thomas Gleixner tglx@linutronix.de wrote:
>> The use-case for 4k memcpy operation is a per-cpu ring buffer where
>> the rseq fast-path does the following:
>> 
>> - ring buffer push: in the rseq asm instruction sequence, a memcpy of a
>>   given structure (limited to 4k in size) into a ring buffer,
>>   followed by the final commit instruction which increments the current
>>   position offset by the number of bytes pushed.
>> 
>> - ring buffer pop: in the rseq asm instruction sequence, a memcpy of
>>   a given structure (up to 4k) from the ring buffer, at "position" offset.
>>   The final commit instruction decrements the current position offset by
>>   the number of bytes pop'd.
>> 
>> Having cpu_opv do a 4k memcpy allow it to handle scenarios where
>> rseq fails to progress.
> 
> I'm still confused. Before you talked about the sequence:
> 
>    1) Reserve
> 
>    2) Commit
> 
> and both use rseq, but obviously you cannot do two "atomic" operation in
> one section.
> 
> So now you talk about push. Is that what you described earlier as commit?
> 
> Assumed that it is, I still have no idea why the memcpy needs to happen in
> that preempt disabled region.
> 
> If you have space reserved, then the copy does not have any dependencies
> neither on the cpu it runs on nor on anything else. So the only
> interresting operation is the final commit instruction which tells the
> consumer that its ready.
> 
> So what is the part I am missing here aside of having difficulties to map
> the constantly changing names of this stuff?

Let's clear up some confusion: those are two different use-cases. The
ring buffer with reserve+commit is a FIFO ring buffer, and the ring buffer
with memcpy+position update is a LIFO queue.

Let me explain the various use-cases here, so we know what we're talking
about.

rseq and cpu_opv use-cases

1) per-cpu spinlock

A per-cpu spinlock can be implemented as a rseq consisting of a
comparison operation (== 0) on a word, and a word store (1), followed
by an acquire barrier after control dependency. The unlock path can be
performed with a simple store-release of 0 to the word, which does
not require rseq.

The cpu_opv fallback requires a single-word comparison (== 0) and a
single-word store (1).

2) per-cpu statistics counters

A per-cpu statistics counters can be implemented as a rseq consisting
of a final "add" instruction on a word as commit.

The cpu_opv fallback can be implemented as a "ADD" operation.

Besides statistics tracking, these counters can be used to implement
user-space RCU per-cpu grace period tracking for both single and
multi-process user-space RCU.

3) per-cpu LIFO linked-list (unlimited size stack)

A per-cpu LIFO linked-list has a "push" and "pop" operation,
which respectively adds an item to the list, and removes an
item from the list.

The "push" operation can be implemented as a rseq consisting of
a word comparison instruction against head followed by a word store
(commit) to head. Its cpu_opv fallback can be implemented as a
word-compare followed by word-store as well.

The "pop" operation can be implemented as a rseq consisting of
loading head, comparing it against NULL, loading the next pointer
at the right offset within the head item, and the next pointer as
a new head, returning the old head on success.

The cpu_opv fallback for "pop" differs from its rseq algorithm:
considering that cpu_opv requires to know all pointers at system
call entry so it can pin all pages, so cpu_opv cannot simply load
head and then load the head->next address within the preempt-off
critical section. User-space needs to pass the head and head->next
addresses to the kernel, and the kernel needs to check that the
head address is unchanged since it has been loaded by user-space.
However, when accessing head->next in a ABA situation, it's
possible that head is unchanged, but loading head->next can
result in a page fault due to a concurrently freed head object.
This is why the "expect_fault" operation field is introduced: if a
fault is triggered by this access, "-EAGAIN" will be returned by
cpu_opv rather than -EFAULT, thus indicating the the operation
vector should be attempted again. The "pop" operation can thus be
implemented as a word comparison of head against the head loaded
by user-space, followed by a load of the head->next pointer (which
may fault), and a store of that pointer as a new head.

4) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack)

This structure is useful for passing around allocated objects
by passing pointers through per-cpu fixed-sized stack.

The "push" side can be implemented with a check of the current
offset against the maximum buffer length, followed by a rseq
consisting of a comparison of the previously loaded offset
against the current offset, a word "try store" operation into the
next ring buffer array index (it's OK to abort after a try-store,
since it's not the commit, and its side-effect can be overwritten),
then followed by a word-store to increment the current offset (commit).

The "push" cpu_opv fallback can be done with the comparison, and
two consecutive word stores, all within the preempt-off section.

The "pop" side can be implemented with a check that offset is not
0 (whether the buffer is empty), a load of the "head" pointer before the
offset array index, followed by a rseq consisting of a word
comparison checking that the offset is unchanged since previously
loaded, another check ensuring that the "head" pointer is unchanged,
followed by a store decrementing the current offset.

The cpu_opv "pop" can be implemented with the same algorithm
as the rseq fast-path (compare, compare, store).

5) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack)
   supporting "peek" from remote CPU

In order to implement work queues with work-stealing between CPUs, it is
useful to ensure the offset "commit" in scenario 4) "push" have a
store-release semantic, thus allowing remote CPU to load the offset
with acquire semantic, and load the top pointer, in order to check if
work-stealing should be performed. The task (work queue item) existence
should be protected by other means, e.g. RCU.

If the peek operation notices that work-stealing should indeed be
performed, a thread can use cpu_opv to move the task between per-cpu
workqueues, by first invoking cpu_opv passing the remote work queue
cpu number as argument to pop the task, and then again as "push" with
the target work queue CPU number.

6) per-cpu LIFO ring buffer with data copy (fixed-sized stack)
   (with and without acquire-release)

This structure is useful for passing around data without requiring
memory allocation by copying the data content into per-cpu fixed-sized
stack.

The "push" operation is performed with an offset comparison against
the buffer size (figuring out if the buffer is full), followed by
a rseq consisting of a comparison of the offset, a try-memcpy attempting
to copy the data content into the buffer (which can be aborted and
overwritten), and a final store incrementing the offset.

The cpu_opv fallback needs to same operations, except that the memcpy
is guaranteed to complete, given that it is performed with preemption
disabled. This requires a memcpy operation supporting length up to 4kB.

The "pop" operation is similar to the "push, except that the offset
is first compared to 0 to ensure the buffer is not empty. The
copy source is the ring buffer, and the destination is an output
buffer.

7) per-cpu FIFO ring buffer (fixed-sized queue)

This structure is useful wherever a FIFO behavior (queue) is needed.
One major use-case is tracer ring buffer.

An implementation of this ring buffer has a "reserve", followed by
serialization of multiple bytes into the buffer, ended by a "commit".
The "reserve" can be implemented as a rseq consisting of a word
comparison followed by a word store. The reserve operation moves the
producer "head". The multi-byte serialization can be performed
non-atomically. Finally, the "commit" update can be performed with
a rseq "add" commit instruction with store-release semantic. The
ring buffer consumer reads the commit value with load-acquire
semantic to know whenever it is safe to read from the ring buffer.

This use-case requires that both "reserve" and "commit" operations
be performed on the same per-cpu ring buffer, even if a migration
happens between those operations. In the typical case, both operations
will happens on the same CPU and use rseq. In the unlikely event of a
migration, the cpu_opv system call will ensure the commit can be
performed on the right CPU by migrating the task to that CPU.

On the consumer side, an alternative to using store-release and
load-acquire on the commit counter would be to use cpu_opv to
ensure the commit counter load is performed on the right CPU. This
effectively allows moving a consumer thread between CPUs to execute
close to the ring buffer cache lines it will read.

Thanks,

Mathieu

> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

next prev parent reply	other threads:[~2017-11-21 11:24 UTC|newest]

Thread overview: 80+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-14 20:03 [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call Mathieu Desnoyers
2017-11-14 20:39   ` Ben Maurer
2017-11-14 20:52     ` Mathieu Desnoyers
2017-11-14 21:48       ` Ben Maurer
     [not found]   ` <CY4PR15MB168884529B3C0F8E6CC06257CF280@CY4PR15MB1688.namprd15.prod.outlook.com>
2017-11-14 20:49     ` Ben Maurer
2017-11-14 21:03       ` Mathieu Desnoyers
2017-11-16 16:18   ` Peter Zijlstra
2017-11-16 16:27     ` Mathieu Desnoyers
2017-11-16 16:32       ` Peter Zijlstra
2017-11-16 17:09         ` Mathieu Desnoyers
2017-11-16 18:43   ` Peter Zijlstra
2017-11-16 18:49     ` Mathieu Desnoyers
2017-11-16 19:06       ` Thomas Gleixner
2017-11-16 20:06         ` Mathieu Desnoyers
2017-11-16 19:14   ` Peter Zijlstra
2017-11-16 20:37     ` Mathieu Desnoyers
2017-11-16 20:46       ` Peter Zijlstra
2017-11-16 21:08   ` Thomas Gleixner
2017-11-19 17:24     ` Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 02/24] Restartable sequences: ARM 32 architecture support Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 03/24] Restartable sequences: wire up ARM 32 system call Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 architecture support Mathieu Desnoyers
2017-11-16 21:14   ` Thomas Gleixner
2017-11-19 17:41     ` Mathieu Desnoyers
2017-11-20  8:38       ` Thomas Gleixner
2017-11-14 20:03 ` [RFC PATCH for 4.15 05/24] Restartable sequences: wire up x86 32/64 system call Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 06/24] Restartable sequences: powerpc architecture support Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 07/24] Restartable sequences: Wire up powerpc system call Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv " Mathieu Desnoyers
2017-11-15  1:34   ` Mathieu Desnoyers
2017-11-15  7:44   ` Michael Kerrisk (man-pages)
2017-11-15 14:30     ` Mathieu Desnoyers
2017-11-16 23:26   ` Thomas Gleixner
2017-11-17  0:14     ` Andi Kleen
2017-11-17 10:09       ` Thomas Gleixner
2017-11-17 17:14         ` Mathieu Desnoyers
2017-11-17 18:18           ` Andi Kleen
2017-11-17 18:59             ` Thomas Gleixner
2017-11-17 19:15               ` Andi Kleen
2017-11-17 20:07                 ` Thomas Gleixner
2017-11-18 21:09                   ` Andy Lutomirski
2017-11-17 20:22           ` Thomas Gleixner
2017-11-20 17:13             ` Mathieu Desnoyers
2017-11-20 16:13     ` Mathieu Desnoyers
2017-11-20 17:48       ` Thomas Gleixner
2017-11-20 18:03         ` Thomas Gleixner
2017-11-20 18:42           ` Mathieu Desnoyers
2017-11-20 18:39         ` Mathieu Desnoyers
2017-11-20 18:49           ` Andi Kleen
2017-11-20 22:46             ` Mathieu Desnoyers
2017-11-20 19:44           ` Thomas Gleixner
2017-11-21 11:25             ` Mathieu Desnoyers [this message]
2017-11-14 20:03 ` [RFC PATCH for 4.15 09/24] cpu_opv: Wire up x86 32/64 " Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 10/24] cpu_opv: Wire up powerpc " Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 11/24] cpu_opv: Wire up ARM32 " Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 12/24] cpu_opv: Implement selftests Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 13/24] Restartable sequences: Provide self-tests Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 14/24] Restartable sequences selftests: arm: workaround gcc asm size guess Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 15/24] membarrier: selftest: Test private expedited cmd Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v7 for 4.15 16/24] membarrier: powerpc: Skip memory barrier in switch_mm() Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v5 for 4.15 17/24] membarrier: Document scheduler barrier requirements Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 18/24] membarrier: provide SHARED_EXPEDITED command Mathieu Desnoyers
2017-11-15  1:36   ` Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 19/24] membarrier: selftest: Test shared expedited cmd Mathieu Desnoyers
2017-11-17 15:07   ` Shuah Khan
2017-11-14 20:04 ` [RFC PATCH for 4.15 20/24] membarrier: Provide core serializing command Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 21/24] x86: Introduce sync_core_before_usermode Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 22/24] membarrier: x86: Provide core serializing command Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 23/24] membarrier: selftest: Test private expedited sync core cmd Mathieu Desnoyers
2017-11-17 15:09   ` Shuah Khan
2017-11-17 16:17     ` Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 24/24] membarrier: arm64: Provide core serializing command Mathieu Desnoyers
2017-11-14 21:08 ` [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Linus Torvalds
2017-11-14 21:15   ` Andy Lutomirski
2017-11-14 21:32     ` Paul Turner
2018-03-27 18:15       ` Mathieu Desnoyers
2017-11-14 21:32     ` Mathieu Desnoyers
2017-11-15  4:12       ` Andy Lutomirski
2017-11-15  6:34         ` Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1830717892.19023.1511263558845.JavaMail.zimbra@efficios.com \
    --to=mathieu.desnoyers@efficios.com \
    --cc=ahh@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=bmaurer@fb.com \
    --cc=boqun.feng@gmail.com \
    --cc=catalin.marinas@arm.com \
    --cc=cl@linux.com \
    --cc=davejwatson@fb.com \
    --cc=hpa@zytor.com \
    --cc=josh@joshtriplett.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@arm.linux.org.uk \
    --cc=luto@amacapital.net \
    --cc=mingo@redhat.com \
    --cc=mtk.manpages@gmail.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox