From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
Boqun Feng <boqun.feng@gmail.com>,
Andy Lutomirski <luto@amacapital.net>,
Dave Watson <davejwatson@fb.com>,
linux-kernel <linux-kernel@vger.kernel.org>,
linux-api <linux-api@vger.kernel.org>,
Paul Turner <pjt@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Russell King <linux@arm.linux.org.uk>,
Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
Andrew Hunter <ahh@google.com>, Andi Kleen <andi@firstfloor.org>,
Chris Lameter <cl@linux.com>, Ben Maurer <bmaurer@fb.com>,
rostedt <rostedt@goodmis.org>,
Josh Triplett <josh@joshtriplett.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will.deacon@arm.com>,
Michael Kerrisk <mtk.manpages@gmail.com>
Subject: Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call
Date: Tue, 21 Nov 2017 11:25:58 +0000 (UTC) [thread overview]
Message-ID: <1830717892.19023.1511263558845.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <alpine.DEB.2.20.1711202037280.2348@nanos>
----- On Nov 20, 2017, at 2:44 PM, Thomas Gleixner tglx@linutronix.de wrote:
> On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
>> ----- On Nov 20, 2017, at 12:48 PM, Thomas Gleixner tglx@linutronix.de wrote:
>> The use-case for 4k memcpy operation is a per-cpu ring buffer where
>> the rseq fast-path does the following:
>>
>> - ring buffer push: in the rseq asm instruction sequence, a memcpy of a
>> given structure (limited to 4k in size) into a ring buffer,
>> followed by the final commit instruction which increments the current
>> position offset by the number of bytes pushed.
>>
>> - ring buffer pop: in the rseq asm instruction sequence, a memcpy of
>> a given structure (up to 4k) from the ring buffer, at "position" offset.
>> The final commit instruction decrements the current position offset by
>> the number of bytes pop'd.
>>
>> Having cpu_opv do a 4k memcpy allow it to handle scenarios where
>> rseq fails to progress.
>
> I'm still confused. Before you talked about the sequence:
>
> 1) Reserve
>
> 2) Commit
>
> and both use rseq, but obviously you cannot do two "atomic" operation in
> one section.
>
> So now you talk about push. Is that what you described earlier as commit?
>
> Assumed that it is, I still have no idea why the memcpy needs to happen in
> that preempt disabled region.
>
> If you have space reserved, then the copy does not have any dependencies
> neither on the cpu it runs on nor on anything else. So the only
> interresting operation is the final commit instruction which tells the
> consumer that its ready.
>
> So what is the part I am missing here aside of having difficulties to map
> the constantly changing names of this stuff?
Let's clear up some confusion: those are two different use-cases. The
ring buffer with reserve+commit is a FIFO ring buffer, and the ring buffer
with memcpy+position update is a LIFO queue.
Let me explain the various use-cases here, so we know what we're talking
about.
rseq and cpu_opv use-cases
1) per-cpu spinlock
A per-cpu spinlock can be implemented as a rseq consisting of a
comparison operation (== 0) on a word, and a word store (1), followed
by an acquire barrier after control dependency. The unlock path can be
performed with a simple store-release of 0 to the word, which does
not require rseq.
The cpu_opv fallback requires a single-word comparison (== 0) and a
single-word store (1).
2) per-cpu statistics counters
A per-cpu statistics counters can be implemented as a rseq consisting
of a final "add" instruction on a word as commit.
The cpu_opv fallback can be implemented as a "ADD" operation.
Besides statistics tracking, these counters can be used to implement
user-space RCU per-cpu grace period tracking for both single and
multi-process user-space RCU.
3) per-cpu LIFO linked-list (unlimited size stack)
A per-cpu LIFO linked-list has a "push" and "pop" operation,
which respectively adds an item to the list, and removes an
item from the list.
The "push" operation can be implemented as a rseq consisting of
a word comparison instruction against head followed by a word store
(commit) to head. Its cpu_opv fallback can be implemented as a
word-compare followed by word-store as well.
The "pop" operation can be implemented as a rseq consisting of
loading head, comparing it against NULL, loading the next pointer
at the right offset within the head item, and the next pointer as
a new head, returning the old head on success.
The cpu_opv fallback for "pop" differs from its rseq algorithm:
considering that cpu_opv requires to know all pointers at system
call entry so it can pin all pages, so cpu_opv cannot simply load
head and then load the head->next address within the preempt-off
critical section. User-space needs to pass the head and head->next
addresses to the kernel, and the kernel needs to check that the
head address is unchanged since it has been loaded by user-space.
However, when accessing head->next in a ABA situation, it's
possible that head is unchanged, but loading head->next can
result in a page fault due to a concurrently freed head object.
This is why the "expect_fault" operation field is introduced: if a
fault is triggered by this access, "-EAGAIN" will be returned by
cpu_opv rather than -EFAULT, thus indicating the the operation
vector should be attempted again. The "pop" operation can thus be
implemented as a word comparison of head against the head loaded
by user-space, followed by a load of the head->next pointer (which
may fault), and a store of that pointer as a new head.
4) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack)
This structure is useful for passing around allocated objects
by passing pointers through per-cpu fixed-sized stack.
The "push" side can be implemented with a check of the current
offset against the maximum buffer length, followed by a rseq
consisting of a comparison of the previously loaded offset
against the current offset, a word "try store" operation into the
next ring buffer array index (it's OK to abort after a try-store,
since it's not the commit, and its side-effect can be overwritten),
then followed by a word-store to increment the current offset (commit).
The "push" cpu_opv fallback can be done with the comparison, and
two consecutive word stores, all within the preempt-off section.
The "pop" side can be implemented with a check that offset is not
0 (whether the buffer is empty), a load of the "head" pointer before the
offset array index, followed by a rseq consisting of a word
comparison checking that the offset is unchanged since previously
loaded, another check ensuring that the "head" pointer is unchanged,
followed by a store decrementing the current offset.
The cpu_opv "pop" can be implemented with the same algorithm
as the rseq fast-path (compare, compare, store).
5) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack)
supporting "peek" from remote CPU
In order to implement work queues with work-stealing between CPUs, it is
useful to ensure the offset "commit" in scenario 4) "push" have a
store-release semantic, thus allowing remote CPU to load the offset
with acquire semantic, and load the top pointer, in order to check if
work-stealing should be performed. The task (work queue item) existence
should be protected by other means, e.g. RCU.
If the peek operation notices that work-stealing should indeed be
performed, a thread can use cpu_opv to move the task between per-cpu
workqueues, by first invoking cpu_opv passing the remote work queue
cpu number as argument to pop the task, and then again as "push" with
the target work queue CPU number.
6) per-cpu LIFO ring buffer with data copy (fixed-sized stack)
(with and without acquire-release)
This structure is useful for passing around data without requiring
memory allocation by copying the data content into per-cpu fixed-sized
stack.
The "push" operation is performed with an offset comparison against
the buffer size (figuring out if the buffer is full), followed by
a rseq consisting of a comparison of the offset, a try-memcpy attempting
to copy the data content into the buffer (which can be aborted and
overwritten), and a final store incrementing the offset.
The cpu_opv fallback needs to same operations, except that the memcpy
is guaranteed to complete, given that it is performed with preemption
disabled. This requires a memcpy operation supporting length up to 4kB.
The "pop" operation is similar to the "push, except that the offset
is first compared to 0 to ensure the buffer is not empty. The
copy source is the ring buffer, and the destination is an output
buffer.
7) per-cpu FIFO ring buffer (fixed-sized queue)
This structure is useful wherever a FIFO behavior (queue) is needed.
One major use-case is tracer ring buffer.
An implementation of this ring buffer has a "reserve", followed by
serialization of multiple bytes into the buffer, ended by a "commit".
The "reserve" can be implemented as a rseq consisting of a word
comparison followed by a word store. The reserve operation moves the
producer "head". The multi-byte serialization can be performed
non-atomically. Finally, the "commit" update can be performed with
a rseq "add" commit instruction with store-release semantic. The
ring buffer consumer reads the commit value with load-acquire
semantic to know whenever it is safe to read from the ring buffer.
This use-case requires that both "reserve" and "commit" operations
be performed on the same per-cpu ring buffer, even if a migration
happens between those operations. In the typical case, both operations
will happens on the same CPU and use rseq. In the unlikely event of a
migration, the cpu_opv system call will ensure the commit can be
performed on the right CPU by migrating the task to that CPU.
On the consumer side, an alternative to using store-release and
load-acquire on the commit counter would be to use cpu_opv to
ensure the commit counter load is performed on the right CPU. This
effectively allows moving a consumer thread between CPUs to execute
close to the ring buffer cache lines it will read.
Thanks,
Mathieu
>
> Thanks,
>
> tglx
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
next prev parent reply other threads:[~2017-11-21 11:24 UTC|newest]
Thread overview: 80+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-11-14 20:03 [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call Mathieu Desnoyers
2017-11-14 20:39 ` Ben Maurer
2017-11-14 20:52 ` Mathieu Desnoyers
2017-11-14 21:48 ` Ben Maurer
[not found] ` <CY4PR15MB168884529B3C0F8E6CC06257CF280@CY4PR15MB1688.namprd15.prod.outlook.com>
2017-11-14 20:49 ` Ben Maurer
2017-11-14 21:03 ` Mathieu Desnoyers
2017-11-16 16:18 ` Peter Zijlstra
2017-11-16 16:27 ` Mathieu Desnoyers
2017-11-16 16:32 ` Peter Zijlstra
2017-11-16 17:09 ` Mathieu Desnoyers
2017-11-16 18:43 ` Peter Zijlstra
2017-11-16 18:49 ` Mathieu Desnoyers
2017-11-16 19:06 ` Thomas Gleixner
2017-11-16 20:06 ` Mathieu Desnoyers
2017-11-16 19:14 ` Peter Zijlstra
2017-11-16 20:37 ` Mathieu Desnoyers
2017-11-16 20:46 ` Peter Zijlstra
2017-11-16 21:08 ` Thomas Gleixner
2017-11-19 17:24 ` Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 02/24] Restartable sequences: ARM 32 architecture support Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 03/24] Restartable sequences: wire up ARM 32 system call Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 architecture support Mathieu Desnoyers
2017-11-16 21:14 ` Thomas Gleixner
2017-11-19 17:41 ` Mathieu Desnoyers
2017-11-20 8:38 ` Thomas Gleixner
2017-11-14 20:03 ` [RFC PATCH for 4.15 05/24] Restartable sequences: wire up x86 32/64 system call Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 06/24] Restartable sequences: powerpc architecture support Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH for 4.15 07/24] Restartable sequences: Wire up powerpc system call Mathieu Desnoyers
2017-11-14 20:03 ` [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv " Mathieu Desnoyers
2017-11-15 1:34 ` Mathieu Desnoyers
2017-11-15 7:44 ` Michael Kerrisk (man-pages)
2017-11-15 14:30 ` Mathieu Desnoyers
2017-11-16 23:26 ` Thomas Gleixner
2017-11-17 0:14 ` Andi Kleen
2017-11-17 10:09 ` Thomas Gleixner
2017-11-17 17:14 ` Mathieu Desnoyers
2017-11-17 18:18 ` Andi Kleen
2017-11-17 18:59 ` Thomas Gleixner
2017-11-17 19:15 ` Andi Kleen
2017-11-17 20:07 ` Thomas Gleixner
2017-11-18 21:09 ` Andy Lutomirski
2017-11-17 20:22 ` Thomas Gleixner
2017-11-20 17:13 ` Mathieu Desnoyers
2017-11-20 16:13 ` Mathieu Desnoyers
2017-11-20 17:48 ` Thomas Gleixner
2017-11-20 18:03 ` Thomas Gleixner
2017-11-20 18:42 ` Mathieu Desnoyers
2017-11-20 18:39 ` Mathieu Desnoyers
2017-11-20 18:49 ` Andi Kleen
2017-11-20 22:46 ` Mathieu Desnoyers
2017-11-20 19:44 ` Thomas Gleixner
2017-11-21 11:25 ` Mathieu Desnoyers [this message]
2017-11-14 20:03 ` [RFC PATCH for 4.15 09/24] cpu_opv: Wire up x86 32/64 " Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 10/24] cpu_opv: Wire up powerpc " Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 11/24] cpu_opv: Wire up ARM32 " Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 12/24] cpu_opv: Implement selftests Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 13/24] Restartable sequences: Provide self-tests Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 14/24] Restartable sequences selftests: arm: workaround gcc asm size guess Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 15/24] membarrier: selftest: Test private expedited cmd Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v7 for 4.15 16/24] membarrier: powerpc: Skip memory barrier in switch_mm() Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v5 for 4.15 17/24] membarrier: Document scheduler barrier requirements Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 18/24] membarrier: provide SHARED_EXPEDITED command Mathieu Desnoyers
2017-11-15 1:36 ` Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 19/24] membarrier: selftest: Test shared expedited cmd Mathieu Desnoyers
2017-11-17 15:07 ` Shuah Khan
2017-11-14 20:04 ` [RFC PATCH for 4.15 20/24] membarrier: Provide core serializing command Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 21/24] x86: Introduce sync_core_before_usermode Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH v2 for 4.15 22/24] membarrier: x86: Provide core serializing command Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 23/24] membarrier: selftest: Test private expedited sync core cmd Mathieu Desnoyers
2017-11-17 15:09 ` Shuah Khan
2017-11-17 16:17 ` Mathieu Desnoyers
2017-11-14 20:04 ` [RFC PATCH for 4.15 24/24] membarrier: arm64: Provide core serializing command Mathieu Desnoyers
2017-11-14 21:08 ` [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 Linus Torvalds
2017-11-14 21:15 ` Andy Lutomirski
2017-11-14 21:32 ` Paul Turner
2018-03-27 18:15 ` Mathieu Desnoyers
2017-11-14 21:32 ` Mathieu Desnoyers
2017-11-15 4:12 ` Andy Lutomirski
2017-11-15 6:34 ` Mathieu Desnoyers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1830717892.19023.1511263558845.JavaMail.zimbra@efficios.com \
--to=mathieu.desnoyers@efficios.com \
--cc=ahh@google.com \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=bmaurer@fb.com \
--cc=boqun.feng@gmail.com \
--cc=catalin.marinas@arm.com \
--cc=cl@linux.com \
--cc=davejwatson@fb.com \
--cc=hpa@zytor.com \
--cc=josh@joshtriplett.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@arm.linux.org.uk \
--cc=luto@amacapital.net \
--cc=mingo@redhat.com \
--cc=mtk.manpages@gmail.com \
--cc=paulmck@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=pjt@google.com \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=will.deacon@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox