[RFC PATCH v9 for 4.15 01/14] Restartable sequences system call

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found] <20171012230326.19984-1-mathieu.desnoyers@efficios.com>
@ 2017-10-12 23:03 ` Mathieu Desnoyers
  2017-10-13  0:36   ` Linus Torvalds
                     ` (2 more replies)
  2017-10-12 23:03 ` [RFC PATCH for 4.15 02/14] tracing: instrument restartable sequences Mathieu Desnoyers
                   ` (9 subsequent siblings)
  10 siblings, 3 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-12 23:03 UTC (permalink / raw)
  To: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon
  Cc: linux-kernel, Mathieu Desnoyers, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Ben Maurer,
	Steven Rostedt, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Michael Kerrisk, Alexander Viro, linux-api

Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.

* Restartable sequences (per-cpu atomics)

Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.

The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path. A locking-based
fall-back, purely implemented in user-space, is proposed here to deal
with debugger single-stepping. This fallback interacts with rseq_start()
and rseq_finish(), which force retries in response to concurrent
lock-based activity.

Here are benchmarks of counter increment in various scenarios compared
to restartable sequences. Those benchmarks were taken on v8 of the
patchset.

ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck

                      Counter increment speed (ns/increment)
                             1 thread    2 threads
global increment (baseline)      6           N/A
percpu rseq increment           50            52
percpu rseq spinlock            94            94
global atomic increment         48            74 (__sync_add_and_fetch_4)
global atomic CAS               50           172 (__sync_val_compare_and_swap_4)
global pthread mutex           148           862

ARMv7 Processor rev 10 (v7l)
Machine model: Wandboard

                      Counter increment speed (ns/increment)
                             1 thread    4 threads
global increment (baseline)      7           N/A
percpu rseq increment           50            50
percpu rseq spinlock            82            84
global atomic increment         44           262 (__sync_add_and_fetch_4)
global atomic CAS               46           316 (__sync_val_compare_and_swap_4)
global pthread mutex           146          1400

x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:

                      Counter increment speed (ns/increment)
                              1 thread           8 threads
global increment (baseline)      3.0                N/A
percpu rseq increment            3.6                3.8
percpu rseq spinlock             5.6                6.2
global LOCK; inc                 8.0              166.4
global LOCK; cmpxchg            13.4              435.2
global pthread mutex            25.2             1363.6

* Reading the current CPU number

Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.

Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:

- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
  executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
  assembly, which makes it a useful building block for restartable
  sequences.
- The approach of reading the cpu id through memory mapping shared
  between kernel and user-space is portable (e.g. ARM), which is not the
  case for the lsl-based x86 vdso.

On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.

Benchmarking various approaches for reading the current CPU number:

ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop):                                    8.4 ns
- Read CPU from rseq cpu_id:                               16.7 ns
- Read CPU from rseq cpu_id (lazy register):               19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
- getcpu system call:                                     234.9 ns

x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop):                                    0.8 ns
- Read CPU from rseq cpu_id:                                0.8 ns
- Read CPU from rseq cpu_id (lazy register):                0.8 ns
- Read using gs segment selector:                           0.8 ns
- "lsl" inline assembly:                                   13.0 ns
- glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
- getcpu system call:                                      53.9 ns

- Speed

Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:

Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.

* CONFIG_RSEQ=n

avg.:      41.37 s
std.dev.:   0.36 s

* CONFIG_RSEQ=y

avg.:      40.46 s
std.dev.:   0.33 s

- Size

On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
2855 bytes, and the data size increase of vmlinux is 1024 bytes.

* CONFIG_RSEQ=n

   text	   data	    bss	    dec	    hex	filename
9964559	4256280	 962560	15183399	 e7ae27	vmlinux.norseq

* CONFIG_RSEQ=y

   text	   data	    bss	    dec	    hex	filename
9967414	4257304	 962560	15187278	 e7bd4e	vmlinux.rseq

[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: linux-api@vger.kernel.org
---

Changes since v1:
- Return -1, errno=EINVAL if cpu_cache pointer is not aligned on
  sizeof(int32_t).
- Update man page to describe the pointer alignement requirements and
  update atomicity guarantees.
- Add MAINTAINERS file GETCPU_CACHE entry.
- Remove dynamic memory allocation: go back to having a single
  getcpu_cache entry per thread. Update documentation accordingly.
- Rebased on Linux 4.4.

Changes since v2:
- Introduce a "cmd" argument, along with an enum with GETCPU_CACHE_GET
  and GETCPU_CACHE_SET. Introduce a uapi header linux/getcpu_cache.h
  defining this enumeration.
- Split resume notifier architecture implementation from the system call
  wire up in the following arch-specific patches.
- Man pages updates.
- Handle 32-bit compat pointers.
- Simplify handling of getcpu_cache GETCPU_CACHE_SET compiler barrier:
  set the current cpu cache pointer before doing the cache update, and
  set it back to NULL if the update fails. Setting it back to NULL on
  error ensures that no resume notifier will trigger a SIGSEGV if a
  migration happened concurrently.

Changes since v3:
- Fix __user annotations in compat code,
- Update memory ordering comments.
- Rebased on kernel v4.5-rc5.

Changes since v4:
- Inline getcpu_cache_fork, getcpu_cache_execve, and getcpu_cache_exit.
- Add new line between if() and switch() to improve readability.
- Added sched switch benchmarks (hackbench) and size overhead comparison
  to change log.

Changes since v5:
- Rename "getcpu_cache" to "thread_local_abi", allowing to extend
  this system call to cover future features such as restartable critical
  sections. Generalizing this system call ensures that we can add
  features similar to the cpu_id field within the same cache-line
  without having to track one pointer per feature within the task
  struct.
- Add a tlabi_nr parameter to the system call, thus allowing to extend
  the ABI beyond the initial 64-byte structure by registering structures
  with tlabi_nr greater than 0. The initial ABI structure is associated
  with tlabi_nr 0.
- Rebased on kernel v4.5.

Changes since v6:
- Integrate "restartable sequences" v2 patchset from Paul Turner.
- Add handling of single-stepping purely in user-space, with a
  fallback to locking after 2 rseq failures to ensure progress, and
  by exposing a __rseq_table section to debuggers so they know where
  to put breakpoints when dealing with rseq assembly blocks which
  can be aborted at any point.
- make the code and ABI generic: porting the kernel implementation
  simply requires to wire up the signal handler and return to user-space
  hooks, and allocate the syscall number.
- extend testing with a fully configurable test program. See
  param_spinlock_test -h for details.
- handling of rseq ENOSYS in user-space, also with a fallback
  to locking.
- modify Paul Turner's rseq ABI to only require a single TLS store on
  the user-space fast-path, removing the need to populate two additional
  registers. This is made possible by introducing struct rseq_cs into
  the ABI to describe a critical section start_ip, post_commit_ip, and
  abort_ip.
- Rebased on kernel v4.7-rc7.

Changes since v7:
- Documentation updates.
- Integrated powerpc architecture support.
- Compare rseq critical section start_ip, allows shriking the user-space
  fast-path code size.
- Added Peter Zijlstra, Paul E. McKenney and Boqun Feng as
  co-maintainers.
- Added do_rseq2 and do_rseq_memcpy to test program helper library.
- Code cleanup based on review from Peter Zijlstra, Andy Lutomirski and
  Boqun Feng.
- Rebase on kernel v4.8-rc2.

Changes since v8:
- clear rseq_cs even if non-nested. Speeds up user-space fast path by
  removing the final "rseq_cs=NULL" assignment.
- add enum rseq_flags: critical sections and threads can set migration,
  preemption and signal "disable" flags to inhibit rseq behavior.
- rseq_event_counter needs to be updated with a pre-increment: Otherwise
  misses an increment after exec (when TLS and in-kernel states are
  initially 0).

Man page associated:

RSEQ(2)                 Linux Programmer's Manual                 RSEQ(2)

NAME
       rseq - Restartable sequences and cpu number cache

SYNOPSIS
       #include <linux/rseq.h>

       int rseq(struct rseq * rseq, int flags);

DESCRIPTION
       The  rseq()  ABI accelerates user-space operations on per-cpu data
       by defining a shared data structure ABI  between  each  user-space
       thread and the kernel.

       It  allows user-space to perform update operations on per-cpu data
       without requiring heavy-weight atomic operations.

       Restartable sequences are atomic with respect to preemption  (mak‐
       ing  it  atomic  with respect to other threads running on the same
       CPU), as well as signal delivery  (user-space  execution  contexts
       nested over the same thread).

       It is suited for update operations on per-cpu data.

       It  can be used on data structures shared between threads within a
       process, and on data structures shared between threads across dif‐
       ferent processes.

       Some examples of operations that can be accelerated by this ABI:

       · Querying the current CPU number,

       · Incrementing per-CPU counters,

       · Modifying data protected by per-CPU spinlocks,

       · Inserting/removing elements in per-CPU linked-lists,

       · Writing/reading per-CPU ring buffers content.

       The  rseq argument is a pointer to the thread-local rseq structure
       to be shared between kernel and  user-space.  A  NULL  rseq  value
       unregisters the current thread rseq structure.

       The layout of struct rseq is as follows:

       Structure alignment
              This structure is aligned on multiples of 128 bytes.

       Structure size
              This structure has a fixed size of 128 bytes.

       Fields

           cpu_id
              Cache of the CPU number on which the current thread is run‐
              ning.

           event_counter
              Counter guaranteed  to  be  incremented  when  the  current
              thread  is  preempted  or when a signal is delivered to the
              current thread.

           rseq_cs
              The rseq_cs field is a pointer to a struct rseq_cs.  Is  is
              NULL when no rseq assembly block critical section is active
              for the current thread.  Setting it to point to a  critical
              section  descriptor (struct rseq_cs) marks the beginning of
              the critical section. It is cleared after the  end  of  the
              critical section.

       The layout of struct rseq_cs is as follows:

       Structure alignment
              This structure is aligned on multiples of 256 bytes.

       Structure size
              This structure has a fixed size of 256 bytes.

       Fields

           start_ip
              Instruction pointer address of the first instruction of the
              sequence of consecutive assembly instructions.

           post_commit_ip
              Instruction pointer address after the last  instruction  of
              the sequence of consecutive assembly instructions.

           abort_ip
              Instruction  pointer  address  where  to move the execution
              flow in case of abort of the sequence of consecutive assem‐
              bly instructions.

       Upon registration, the flags argument is currently unused and must
       be specified as 0. Upon unregistration, the flags argument can  be
       either  specified  as  0,  or as RSEQ_FORCE_UNREGISTER, which will
       force unregistration of  the  current  rseq  address  rather  than
       requiring each registration to be matched by an unregistration.

       Libraries  and  applications  should  keep the rseq structure in a
       thread-local storage variable.  Since only one rseq address can be
       registered  per  thread,  applications and libraries should define
       their struct rseq as a volatile thread-local storage variable with
       the weak symbol __rseq_abi.  This allows using rseq from an appli‐
       cation executable and from multiple shared libraries linked to the
       same executable. The cpu_id field should be initialized to -1.

       Each  thread  is responsible for registering and unregistering its
       rseq structure. No more than one rseq  structure  address  can  be
       registered  per  thread  at  a given time. The same address can be
       registered more than once for  a  thread,  and  each  registration
       needs  to  have  a  matching  unregistration before the address is
       effectively unregistered. After the rseq  address  is  effectively
       unregistered for a thread, a new address can be registered. Unreg‐
       istration of associated rseq  structure  is  implicitly  performed
       when a thread or process exits.

       In  a  typical  usage  scenario,  the  thread registering the rseq
       structure will be performing loads and stores from/to that  struc‐
       ture. It is however also allowed to read that structure from other
       threads.  The rseq field updates performed by the  kernel  provide
       relaxed  atomicity  semantics,  which guarantee that other threads
       performing relaxed atomic reads  of  the  cpu  number  cache  will
       always observe a consistent value.

RETURN VALUE
       A  return  value of 0 indicates success. On error, -1 is returned,
       and errno is set appropriately.

ERRORS
       EINVAL Either flags contains an invalid value, or rseq contains an
              address which is not appropriately aligned.

       ENOSYS The rseq() system call is not implemented by this kernel.

       EFAULT rseq is an invalid address.

       EBUSY  The rseq argument contains a non-NULL address which differs
              from  the  memory  location  already  registered  for  this
              thread.

       EOVERFLOW
              Registering  the  rseq  address  is  not allowed because it
              would cause a reference counter overflow.

       ENOENT The rseq argument is NULL, but no memory location  is  cur‐
              rently registered for this thread.

VERSIONS
       The rseq() system call was added in Linux 4.X (TODO).

CONFORMING TO
       rseq() is Linux-specific.

ALGORITHM
       The restartable sequences mechanism is the overlap of two distinct
       restart mechanisms: a sequence  counter  tracking  preemption  and
       signal  delivery for high-level code, and an ip-fixup-based mecha‐
       nism for the final assembly instruction sequence.

       A high-level summary of the algorithm to use rseq from  user-space
       is as follows:

       The  high-level  code between rseq_start() and rseq_finish() loads
       the current value of the sequence  counter  in  rseq_start(),  and
       then  it  gets  compared  with  the  new  current value within the
       rseq_finish()   restartable    instruction    sequence.    Between
       rseq_start()  and  rseq_finish(),  the high-level code can perform
       operations that do not have side-effects, such as getting the cur‐
       rent CPU number, and loading from variables.

       Stores  are  performed at the very end of the restartable sequence
       assembly block. Each  assembly  block  defines  a  struct  rseq_cs
       structure   which   describes   the  start_ip  and  post_commit_ip
       addresses, as well as the abort_ip address where the kernel should
       move  the  thread  instruction  pointer if a rseq critical section
       assembly block is preempted or if a signal is delivered on top  of
       a rseq critical section assembly block.

       Detailed algorithm of rseq use:

       rseq_start()

           0. Userspace  loads  the  current event counter value from the
              event_counter field of the registered struct rseq TLS area,

       rseq_finish()

              Steps [1]-[3] (inclusive) need to be a sequence of instruc‐
              tions  in  userspace  that  can  handle  being moved to the
              abort_ip between any of those instructions.

              The abort_ip address needs to be  less  than  start_ip,  or
              greater-or-equal  the  post_commit_ip.   Step  [4]  and the
              failure code step [F1] need to be at addresses lesser  than
              start_ip, or greater-or-equal the post_commit_ip.

           [ start_ip ]

           1. Userspace stores the address of the struct rseq_cs assembly
              block descriptor into the rseq_cs field of  the  registered
              struct rseq TLS area.

           2. Userspace  tests  to  see whether the current event_counter
              value match the value loaded at [0].  Manually  jumping  to
              [F1] in case of a mismatch.

              Note  that  if  we are preempted or interrupted by a signal
              after [1] and before post_commit_ip, then the  kernel  also
              performs the comparison performed in [2], and conditionally
              clears the rseq_cs field of struct rseq, then jumps  us  to
              abort_ip.

           3. Userspace   critical   section   final  instruction  before
              post_commit_ip is the commit. The critical section is self-
              terminating.

           [ post_commit_ip ]

           4. Userspace  clears  the rseq_cs field of the struct rseq TLS
              area.

           5. Return true.

           On failure at [2]:

           F1.
              Userspace clears the rseq_cs field of the struct  rseq  TLS
              area. Followed by step [F2].

           [ abort_ip ]

           F2.
              Return false.

EXAMPLE
       The following code uses the rseq() system call to keep a thread-local
       storage variable up to date with the current CPU number, with a fall‐
       back on sched_getcpu(3) if the cache is not  available.  For  example
       simplicity,  it  is  done in main(), but multithreaded programs would
       need to invoke rseq() from each program thread.

           #define _GNU_SOURCE
           #include <stdlib.h>
           #include <stdio.h>
           #include <unistd.h>
           #include <stdint.h>
           #include <sched.h>
           #include <stddef.h>
           #include <errno.h>
           #include <string.h>
           #include <stdbool.h>
           #include <sys/syscall.h>
           #include <linux/rseq.h>

           __attribute__((weak)) __thread volatile struct rseq __rseq_abi = {
               .u.e.cpu_id = -1,
           };

           static int
           sys_rseq(volatile struct rseq *rseq_abi, int flags)
           {
               return syscall(__NR_rseq, rseq_abi, flags);
           }

           static int32_t
           rseq_current_cpu_raw(void)
           {
               return __rseq_abi.u.e.cpu_id;
           }

           static int32_t
           rseq_current_cpu(void)
           {
               int32_t cpu;

               cpu = rseq_current_cpu_raw();
               if (cpu < 0)
                   cpu = sched_getcpu();
               return cpu;
           }

           static int
           rseq_register_current_thread(void)
           {
               int rc;

               rc = sys_rseq(&__rseq_abi, 0);
               if (rc) {
                   fprintf(stderr,
                       "Error: sys_rseq(...) register failed(%d): %s\n",
                       errno, strerror(errno));
                   return -1;
               }
               return 0;
           }

           static int
           rseq_unregister_current_thread(void)
           {
               int rc;

               rc = sys_rseq(NULL, 0);
               if (rc) {
                   fprintf(stderr,
                       "Error: sys_rseq(...) unregister failed(%d): %s\n",
                       errno, strerror(errno));
                   return -1;
               }
               return 0;
           }

           int
           main(int argc, char **argv)
           {
               bool rseq_registered = false;

               if (!rseq_register_current_thread()) {
                   rseq_registered = true;
               } else {
                   fprintf(stderr,
                       "Unable to register restartable sequences.\n");
                   fprintf(stderr, "Using sched_getcpu() as fallback.\n");
               }

               printf("Current CPU number: %d\n", rseq_current_cpu());

               if (rseq_registered && rseq_unregister_current_thread()) {
                   exit(EXIT_FAILURE);
               }
               exit(EXIT_SUCCESS);
           }

SEE ALSO
       sched_getcpu(3)

Linux                           2016-08-19                        RSEQ(2)
---
 MAINTAINERS               |  10 ++
 arch/Kconfig              |   7 +
 fs/exec.c                 |   1 +
 include/linux/sched.h     |  89 ++++++++++++
 include/uapi/linux/rseq.h | 131 +++++++++++++++++
 init/Kconfig              |  13 ++
 kernel/Makefile           |   1 +
 kernel/fork.c             |   2 +
 kernel/rseq.c             | 347 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c       |   4 +
 kernel/sys_ni.c           |   3 +
 11 files changed, 608 insertions(+)
 create mode 100644 include/uapi/linux/rseq.h
 create mode 100644 kernel/rseq.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1c3feffb1c1c..f05c526fe1e8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11224,6 +11224,16 @@ F:	include/dt-bindings/reset/
 F:	include/linux/reset.h
 F:	include/linux/reset-controller.h
 
+RESTARTABLE SEQUENCES SUPPORT
+M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+M:	Peter Zijlstra <peterz@infradead.org>
+M:	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
+M:	Boqun Feng <boqun.feng@gmail.com>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	kernel/rseq.c
+F:	include/uapi/linux/rseq.h
+
 RFKILL
 M:	Johannes Berg <johannes@sipsolutions.net>
 L:	linux-wireless@vger.kernel.org
diff --git a/arch/Kconfig b/arch/Kconfig
index 21d0089117fe..6f1203612403 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -257,6 +257,13 @@ config HAVE_REGS_AND_STACK_ACCESS_API
 	  declared in asm/ptrace.h
 	  For example the kprobes-based event tracer needs this API.
 
+config HAVE_RSEQ
+	bool
+	depends on HAVE_REGS_AND_STACK_ACCESS_API
+	help
+	  This symbol should be selected by an architecture if it
+	  supports an implementation of restartable sequences.
+
 config HAVE_CLK
 	bool
 	help
diff --git a/fs/exec.c b/fs/exec.c
index 62175cbcc801..75fcbaeb0206 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1794,6 +1794,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	rseq_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current);
 	free_bprm(bprm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c05ac5f5aa03..203abf387a14 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -26,6 +26,7 @@
 #include <linux/signal_types.h>
 #include <linux/mm_types_task.h>
 #include <linux/task_io_accounting.h>
+#include <linux/rseq.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -966,6 +967,13 @@ struct task_struct {
 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_RSEQ
+	struct rseq __user *rseq;
+	u32 rseq_event_counter;
+	unsigned int rseq_refcount;
+	bool rseq_preempt, rseq_signal, rseq_migrate;
+#endif
+
 	struct tlbflush_unmap_batch	tlb_ubc;
 
 	struct rcu_head			rcu;
@@ -1626,4 +1634,85 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 #define TASK_SIZE_OF(tsk)	TASK_SIZE
 #endif
 
+#ifdef CONFIG_RSEQ
+static inline void rseq_set_notify_resume(struct task_struct *t)
+{
+	if (t->rseq)
+		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+}
+void __rseq_handle_notify_resume(struct pt_regs *regs);
+static inline void rseq_handle_notify_resume(struct pt_regs *regs)
+{
+	if (current->rseq)
+		__rseq_handle_notify_resume(regs);
+}
+/*
+ * If parent process has a registered restartable sequences area, the
+ * child inherits. Only applies when forking a process, not a thread. In
+ * case a parent fork() in the middle of a restartable sequence, set the
+ * resume notifier to force the child to retry.
+ */
+static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
+{
+	if (clone_flags & CLONE_THREAD) {
+		t->rseq = NULL;
+		t->rseq_event_counter = 0;
+		t->rseq_refcount = 0;
+	} else {
+		t->rseq = current->rseq;
+		t->rseq_event_counter = current->rseq_event_counter;
+		t->rseq_refcount = current->rseq_refcount;
+		rseq_set_notify_resume(t);
+	}
+}
+static inline void rseq_execve(struct task_struct *t)
+{
+	t->rseq = NULL;
+	t->rseq_event_counter = 0;
+	t->rseq_refcount = 0;
+}
+static inline void rseq_sched_out(struct task_struct *t)
+{
+	rseq_set_notify_resume(t);
+}
+static inline void rseq_signal_deliver(struct pt_regs *regs)
+{
+	current->rseq_signal = true;
+	rseq_handle_notify_resume(regs);
+}
+static inline void rseq_preempt(struct task_struct *t)
+{
+	t->rseq_preempt = true;
+}
+static inline void rseq_migrate(struct task_struct *t)
+{
+	t->rseq_migrate = true;
+}
+#else
+static inline void rseq_set_notify_resume(struct task_struct *t)
+{
+}
+static inline void rseq_handle_notify_resume(struct pt_regs *regs)
+{
+}
+static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
+{
+}
+static inline void rseq_execve(struct task_struct *t)
+{
+}
+static inline void rseq_sched_out(struct task_struct *t)
+{
+}
+static inline void rseq_signal_deliver(struct pt_regs *regs)
+{
+}
+static inline void rseq_preempt(struct task_struct *t)
+{
+}
+static inline void rseq_migrate(struct task_struct *t)
+{
+}
+#endif
+
 #endif
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
new file mode 100644
index 000000000000..8abd8b638ce0
--- /dev/null
+++ b/include/uapi/linux/rseq.h
@@ -0,0 +1,131 @@
+#ifndef _UAPI_LINUX_RSEQ_H
+#define _UAPI_LINUX_RSEQ_H
+
+/*
+ * linux/rseq.h
+ *
+ * Restartable sequences system call API
+ *
+ * Copyright (c) 2015-2016 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifdef __KERNEL__
+# include <linux/types.h>
+#else	/* #ifdef __KERNEL__ */
+# include <stdint.h>
+#endif	/* #else #ifdef __KERNEL__ */
+
+#include <asm/byteorder.h>
+
+#ifdef __LP64__
+# define RSEQ_FIELD_u32_u64(field)	uint64_t field
+#elif defined(__BYTE_ORDER) ? \
+	__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
+# define RSEQ_FIELD_u32_u64(field)	uint32_t _padding ## field, field
+#else
+# define RSEQ_FIELD_u32_u64(field)	uint32_t field, _padding ## field
+#endif
+
+enum rseq_flags {
+	RSEQ_FORCE_UNREGISTER = (1 << 0),
+};
+
+enum rseq_cs_flags {
+	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT	= (1U << 0),
+	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL	= (1U << 1),
+	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE	= (1U << 2),
+};
+
+/*
+ * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
+ * contained within a single cache-line. It is usually declared as
+ * link-time constant data.
+ */
+struct rseq_cs {
+	RSEQ_FIELD_u32_u64(start_ip);
+	RSEQ_FIELD_u32_u64(post_commit_ip);
+	RSEQ_FIELD_u32_u64(abort_ip);
+	uint32_t flags;
+} __attribute__((aligned(4 * sizeof(uint64_t))));
+
+union rseq_cpu_event {
+	struct {
+		/*
+		 * Restartable sequences cpu_id field.
+		 * Updated by the kernel, and read by user-space with
+		 * single-copy atomicity semantics. Aligned on 32-bit.
+		 * Negative values are reserved for user-space.
+		 */
+		int32_t cpu_id;
+		/*
+		 * Restartable sequences event_counter field.
+		 * Updated by the kernel, and read by user-space with
+		 * single-copy atomicity semantics. Aligned on 32-bit.
+		 */
+		uint32_t event_counter;
+	} e;
+	/*
+	 * On architectures with 64-bit aligned reads, both cpu_id and
+	 * event_counter can be read with single-copy atomicity
+	 * semantics.
+	 */
+	uint64_t v;
+};
+
+/*
+ * struct rseq is aligned on 4 * 8 bytes to ensure it is always
+ * contained within a single cache-line.
+ */
+struct rseq {
+	union rseq_cpu_event u;
+	/*
+	 * Restartable sequences rseq_cs field.
+	 * Contains NULL when no critical section is active for the
+	 * current thread, or holds a pointer to the currently active
+	 * struct rseq_cs.
+	 * Updated by user-space at the beginning and end of assembly
+	 * instruction sequence block, and by the kernel when it
+	 * restarts an assembly instruction sequence block. Read by the
+	 * kernel with single-copy atomicity semantics. Aligned on
+	 * 64-bit.
+	 */
+	RSEQ_FIELD_u32_u64(rseq_cs);
+	/*
+	 * - RSEQ_DISABLE flag:
+	 *
+	 * Fallback fast-track flag for single-stepping.
+	 * Set by user-space if lack of progress is detected.
+	 * Cleared by user-space after rseq finish.
+	 * Read by the kernel.
+	 * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+	 *     Inhibit instruction sequence block restart and event
+	 *     counter increment on preemption for this thread.
+	 * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+	 *     Inhibit instruction sequence block restart and event
+	 *     counter increment on signal delivery for this thread.
+	 * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
+	 *     Inhibit instruction sequence block restart and event
+	 *     counter increment on migration for this thread.
+	 */
+	uint32_t flags;
+} __attribute__((aligned(4 * sizeof(uint64_t))));
+
+#endif /* _UAPI_LINUX_RSEQ_H */
diff --git a/init/Kconfig b/init/Kconfig
index 8514b25db21c..b8aa41bd4f4f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1395,6 +1395,19 @@ config MEMBARRIER
 
 	  If unsure, say Y.
 
+config RSEQ
+	bool "Enable rseq() system call" if EXPERT
+	default y
+	depends on HAVE_RSEQ
+	help
+	  Enable the restartable sequences system call. It provides a
+	  user-space cache for the current CPU number value, which
+	  speeds up getting the current CPU number from user-space,
+	  as well as an ABI to speed up user-space operations on
+	  per-CPU data.
+
+	  If unsure, say Y.
+
 config EMBEDDED
 	bool "Embedded system"
 	option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 4cb8e8b23c6e..5c09592b3b9f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -111,6 +111,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_RSEQ) += rseq.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/fork.c b/kernel/fork.c
index b7e9e57b71ea..f311a99fb1d1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1849,6 +1849,8 @@ static __latent_entropy struct task_struct *copy_process(
 	 */
 	copy_seccomp(p);
 
+	rseq_fork(p, clone_flags);
+
 	/*
 	 * Process group and session signals need to be delivered to just the
 	 * parent before the fork or both the parent and the child after the
diff --git a/kernel/rseq.c b/kernel/rseq.c
new file mode 100644
index 000000000000..706a83bd885c
--- /dev/null
+++ b/kernel/rseq.c
@@ -0,0 +1,347 @@
+/*
+ * Restartable sequences system call
+ *
+ * Restartable sequences are a lightweight interface that allows
+ * user-level code to be executed atomically relative to scheduler
+ * preemption and signal delivery. Typically used for implementing
+ * per-cpu operations.
+ *
+ * It allows user-space to perform update operations on per-cpu data
+ * without requiring heavy-weight atomic operations.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Copyright (C) 2015, Google, Inc.,
+ * Paul Turner <pjt@google.com> and Andrew Hunter <ahh@google.com>
+ * Copyright (C) 2015-2016, EfficiOS Inc.,
+ * Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/rseq.h>
+#include <linux/types.h>
+#include <asm/ptrace.h>
+
+/*
+ * The restartable sequences mechanism is the overlap of two distinct
+ * restart mechanisms: a sequence counter tracking preemption and signal
+ * delivery for high-level code, and an ip-fixup-based mechanism for the
+ * final assembly instruction sequence.
+ *
+ * A high-level summary of the algorithm to use rseq from user-space is
+ * as follows:
+ *
+ * The high-level code between rseq_start() and rseq_finish() loads the
+ * current value of the sequence counter in rseq_start(), and then it
+ * gets compared with the new current value within the rseq_finish()
+ * restartable instruction sequence. Between rseq_start() and
+ * rseq_finish(), the high-level code can perform operations that do not
+ * have side-effects, such as getting the current CPU number, and
+ * loading from variables.
+ *
+ * Stores are performed at the very end of the restartable sequence
+ * assembly block. Each assembly block within rseq_finish() defines a
+ * "struct rseq_cs" structure which describes the start_ip and
+ * post_commit_ip addresses, as well as the abort_ip address where the
+ * kernel should move the thread instruction pointer if a rseq critical
+ * section assembly block is preempted or if a signal is delivered on
+ * top of a rseq critical section assembly block.
+ *
+ * Detailed algorithm of rseq use:
+ *
+ * rseq_start()
+ *
+ *   0. Userspace loads the current event counter value from the
+ *      event_counter field of the registered struct rseq TLS area,
+ *
+ * rseq_finish()
+ *
+ *   Steps [1]-[3] (inclusive) need to be a sequence of instructions in
+ *   userspace that can handle being moved to the abort_ip between any
+ *   of those instructions.
+ *
+ *   The abort_ip address needs to be less than start_ip, or
+ *   greater-or-equal the post_commit_ip. Step [4] and the failure
+ *   code step [F1] need to be at addresses lesser than start_ip, or
+ *   greater-or-equal the post_commit_ip.
+ *
+ *       [start_ip]
+ *   1.  Userspace stores the address of the struct rseq_cs assembly
+ *       block descriptor into the rseq_cs field of the registered
+ *       struct rseq TLS area. This update is performed through a single
+ *       store, followed by a compiler barrier which prevents the
+ *       compiler from moving following loads or stores before this
+ *       store.
+ *
+ *   2.  Userspace tests to see whether the current event counter value
+ *       match the value loaded at [0]. Manually jumping to [F1] in case
+ *       of a mismatch.
+ *
+ *       Note that if we are preempted or interrupted by a signal
+ *       after [1] and before post_commit_ip, then the kernel also
+ *       performs the comparison performed in [2], and conditionally
+ *       clears the rseq_cs field of struct rseq, then jumps us to
+ *       abort_ip.
+ *
+ *   3.  Userspace critical section final instruction before
+ *       post_commit_ip is the commit. The critical section is
+ *       self-terminating.
+ *       [post_commit_ip]
+ *
+ *   4.  Userspace clears the rseq_cs field of the struct rseq
+ *       TLS area.
+ *
+ *   5.  Return true.
+ *
+ *   On failure at [2]:
+ *
+ *   F1. Userspace clears the rseq_cs field of the struct rseq
+ *       TLS area. Followed by step [F2].
+ *
+ *       [abort_ip]
+ *   F2. Return false.
+ */
+
+/*
+ * The rseq_event_counter allow user-space to detect preemption and
+ * signal delivery. It increments at least once before returning to
+ * user-space if a thread is preempted or has a signal delivered. It is
+ * not meant to be an exact counter of such events.
+ *
+ * Overflow of the event counter is not a problem in practice. It
+ * increments at most once between each user-space thread instruction
+ * executed, so we would need a thread to execute 2^32 instructions or
+ * more between rseq_start() and rseq_finish(), while single-stepping,
+ * for this to be an issue.
+ *
+ * On 64-bit architectures, both cpu_id and event_counter can be updated
+ * with a single 64-bit store. On 32-bit architectures, __put_user() is
+ * expected to perform two 32-bit single-copy stores to guarantee
+ * single-copy atomicity semantics for other threads.
+ */
+static bool rseq_update_cpu_id_event_counter(struct task_struct *t,
+		bool inc_event_counter)
+{
+	union rseq_cpu_event u;
+
+	u.e.cpu_id = raw_smp_processor_id();
+	u.e.event_counter = inc_event_counter ? ++t->rseq_event_counter :
+			t->rseq_event_counter;
+	if (__put_user(u.v, &t->rseq->u.v))
+		return false;
+	return true;
+}
+
+static bool rseq_get_rseq_cs(struct task_struct *t,
+		void __user **start_ip,
+		void __user **post_commit_ip,
+		void __user **abort_ip,
+		uint32_t *cs_flags)
+{
+	unsigned long ptr;
+	struct rseq_cs __user *urseq_cs;
+	struct rseq_cs rseq_cs;
+
+	if (__get_user(ptr, &t->rseq->rseq_cs))
+		return false;
+	if (!ptr)
+		return true;
+	urseq_cs = (struct rseq_cs __user *)ptr;
+	if (copy_from_user(&rseq_cs, urseq_cs, sizeof(rseq_cs)))
+		return false;
+	/*
+	 * We need to clear rseq_cs upon entry into a signal handler
+	 * nested on top of a rseq assembly block, so the signal handler
+	 * will not be fixed up if itself interrupted by a nested signal
+	 * handler or preempted.  We also need to clear rseq_cs if we
+	 * preempt or deliver a signal on top of code outside of the
+	 * rseq assembly block, to ensure that a following preemption or
+	 * signal delivery will not try to perform a fixup needlessly.
+	 */
+	if (clear_user(&t->rseq->rseq_cs, sizeof(t->rseq->rseq_cs)))
+		return false;
+	*start_ip = (void __user *)rseq_cs.start_ip;
+	*post_commit_ip = (void __user *)rseq_cs.post_commit_ip;
+	*abort_ip = (void __user *)rseq_cs.abort_ip;
+	*cs_flags = rseq_cs.flags;
+	return true;
+}
+
+static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags)
+{
+	bool need_restart = false;
+	uint32_t flags;
+
+	/* Get thread flags. */
+	if (__get_user(flags, &t->rseq->flags))
+		return -EFAULT;
+
+	/* Take into account critical section flags. */
+	flags |= cs_flags;
+
+	/*
+	 * Restart on signal can only be inhibited when restart on
+	 * preempt and restart on migrate are inhibited too. Otherwise,
+	 * a preempted signal handler could fail to restart the prior
+	 * execution context on sigreturn.
+	 */
+	if (flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) {
+		if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
+			return -EINVAL;
+		if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
+			return -EINVAL;
+	}
+	if (t->rseq_migrate
+			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
+		need_restart = true;
+	else if (t->rseq_preempt
+			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
+		need_restart = true;
+	else if (t->rseq_signal
+			&& !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL))
+		need_restart = true;
+
+	t->rseq_preempt = false;
+	t->rseq_signal = false;
+	t->rseq_migrate = false;
+	if (need_restart)
+		return 1;
+	return 0;
+}
+
+static int rseq_ip_fixup(struct pt_regs *regs)
+{
+	struct task_struct *t = current;
+	void __user *start_ip = NULL;
+	void __user *post_commit_ip = NULL;
+	void __user *abort_ip = NULL;
+	uint32_t cs_flags = 0;
+	int ret;
+
+	ret = rseq_get_rseq_cs(t, &start_ip, &post_commit_ip, &abort_ip,
+			&cs_flags);
+	if (!ret)
+		return -EFAULT;
+
+	ret = rseq_need_restart(t, cs_flags);
+	if (ret < 0)
+		return -EFAULT;
+	if (!ret)
+		return 0;
+
+	/* Handle potentially not being within a critical section. */
+	if ((void __user *)instruction_pointer(regs) >= post_commit_ip ||
+			(void __user *)instruction_pointer(regs) < start_ip)
+		return 1;
+
+	/*
+	 * We set this after potentially failing in
+	 * clear_user so that the signal arrives at the
+	 * faulting rip.
+	 */
+	instruction_pointer_set(regs, (unsigned long)abort_ip);
+	return 1;
+}
+
+/*
+ * This resume handler should always be executed between any of:
+ * - preemption,
+ * - signal delivery,
+ * and return to user-space.
+ *
+ * This is how we can ensure that the entire rseq critical section,
+ * consisting of both the C part and the assembly instruction sequence,
+ * will issue the commit instruction only if executed atomically with
+ * respect to other threads scheduled on the same CPU, and with respect
+ * to signal handlers.
+ */
+void __rseq_handle_notify_resume(struct pt_regs *regs)
+{
+	struct task_struct *t = current;
+	int ret;
+
+	if (unlikely(t->flags & PF_EXITING))
+		return;
+	if (unlikely(!access_ok(VERIFY_WRITE, t->rseq, sizeof(*t->rseq))))
+		goto error;
+	ret = rseq_ip_fixup(regs);
+	if (unlikely(ret < 0))
+		goto error;
+	if (unlikely(!rseq_update_cpu_id_event_counter(t, ret)))
+		goto error;
+	return;
+
+error:
+	force_sig(SIGSEGV, t);
+}
+
+/*
+ * sys_rseq - setup restartable sequences for caller thread.
+ */
+SYSCALL_DEFINE2(rseq, struct rseq __user *, rseq, int, flags)
+{
+	if (!rseq) {
+		/* Unregister rseq for current thread. */
+		if (unlikely(flags & ~RSEQ_FORCE_UNREGISTER))
+			return -EINVAL;
+		if (flags & RSEQ_FORCE_UNREGISTER) {
+			current->rseq = NULL;
+			current->rseq_refcount = 0;
+			return 0;
+		}
+		if (!current->rseq_refcount)
+			return -ENOENT;
+		if (!--current->rseq_refcount)
+			current->rseq = NULL;
+		return 0;
+	}
+
+	if (unlikely(flags))
+		return -EINVAL;
+
+	if (current->rseq) {
+		/*
+		 * If rseq is already registered, check whether
+		 * the provided address differs from the prior
+		 * one.
+		 */
+		BUG_ON(!current->rseq_refcount);
+		if (current->rseq != rseq)
+			return -EBUSY;
+		if (current->rseq_refcount == UINT_MAX)
+			return -EOVERFLOW;
+		current->rseq_refcount++;
+	} else {
+		/*
+		 * If there was no rseq previously registered,
+		 * we need to ensure the provided rseq is
+		 * properly aligned and valid.
+		 */
+		BUG_ON(current->rseq_refcount);
+		if (!IS_ALIGNED((unsigned long)rseq, __alignof__(*rseq)))
+			return -EINVAL;
+		if (!access_ok(VERIFY_WRITE, rseq, sizeof(*rseq)))
+			return -EFAULT;
+		current->rseq = rseq;
+		current->rseq_refcount = 1;
+		/*
+		 * If rseq was previously inactive, and has just
+		 * been registered, ensure the cpu_id and
+		 * event_counter fields are updated before
+		 * returning to user-space.
+		 */
+		rseq_set_notify_resume(current);
+	}
+
+	return 0;
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0869b20fba81..12da0f771d73 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1170,6 +1170,8 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 #endif
 #endif
 
+	rseq_migrate(p);
+
 	trace_sched_migrate_task(p, new_cpu);
 
 	if (task_cpu(p) != new_cpu) {
@@ -2572,6 +2574,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 {
 	sched_info_switch(rq, prev, next);
 	perf_event_task_sched_out(prev, next);
+	rseq_sched_out(prev);
 	fire_sched_out_preempt_notifiers(prev, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
@@ -3322,6 +3325,7 @@ static void __sched notrace __schedule(bool preempt)
 	clear_preempt_need_resched();
 
 	if (likely(prev != next)) {
+		rseq_preempt(prev);
 		rq->nr_switches++;
 		rq->curr = next;
 		++*switch_count;
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 8acef8576ce9..c7b366ccf39c 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -258,3 +258,6 @@ cond_syscall(sys_membarrier);
 cond_syscall(sys_pkey_mprotect);
 cond_syscall(sys_pkey_alloc);
 cond_syscall(sys_pkey_free);
+
+/* restartable sequence */
+cond_syscall(sys_rseq);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
  2017-10-12 23:03 ` [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call Mathieu Desnoyers
@ 2017-10-13  0:36   ` Linus Torvalds
  2017-10-13  9:35     ` Ben Maurer
  2017-10-13 12:50   ` Florian Weimer
  2017-10-18 16:41   ` Ben Maurer
  2 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2017-10-13  0:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, Linux Kernel Mailing List, Thomas Gleixner,
	Andi Kleen, Chris Lameter, Ingo Molnar, H. Peter Anvin,
	Ben Maurer, Steven Rostedt, Andrew Morton, Russell King,
	Catalin Marinas, Michael

I do not hate this series, and I'd be happy to apply it, but I will
repeat what I've asked for EVERY SINGLE TIME this series has come up:

On Thu, Oct 12, 2017 at 4:03 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> Here are benchmarks of counter increment in various scenarios compared
> to restartable sequences. Those benchmarks were taken on v8 of the
> patchset.

I want to see real numbers from real issues.

A "increment percpu value" simply isn't relevant.

When I asked last time, people pointed me to potential uses, including
malloc libraries that could get per-thread performance with just
per-cpu (not per thread) data structure overhead. I see that you once
more point to the slides from 2013 that again talks about it.

But people didn't post code, people didn't post numbers, and people
didn't point to actual real uses, just "this could happen".

I really really want more than hand-waving. I want more than pointless
"this is how quickly you can increment a per-thread counter". I want
to hear about _real_ uses, and real numbers.

This has been going on for long enough, that if there *still* are no
actual real users, then I'm *still* not interested in having this
merged.

Because without real-world uses, it's not obvious that there won't be
somebody who goes "oh, this isn't quite enough for us, the semantics
are subtly incompatible with our real-world use case".

So I want real numbers from a real implementation of malloc/free. And
if it's not malloc/free, then what is it? I want something *real*, not
some micro-benchmark that benchmarks a totally pointless load.

Because until there are that kind of "yes, this is more than theory",
I'm not really willing to have this merged.

                   Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
  2017-10-13  0:36   ` Linus Torvalds
@ 2017-10-13  9:35     ` Ben Maurer
       [not found]       ` <DM5PR15MB1690DA99E4AA74FBE54CF7F9CF480-kTBAvIqET4EjX1lkf7hTyId3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 61+ messages in thread
From: Ben Maurer @ 2017-10-13  9:35 UTC (permalink / raw)
  To: Linus Torvalds, Mathieu Desnoyers, David Goldblatt, Qi Wang
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, Linux Kernel Mailing List, Thomas Gleixner,
	Andi Kleen, Chris Lameter, Ingo Molnar, H. Peter Anvin,
	Steven Rostedt, Andrew Morton, Russell King, Catalin Marinas,
	Michael Kerrisk

Hey,

I'm really excited to hear that you're open to this patch set and totally understand the desire for some more numbers. I have a few thoughts and questions -- hopefully ones that could help better understand where you'd like to see more data (potentially from myself and other Facebook folks)

> A "increment percpu value" simply isn't relevant.

While I understand it seems trivial, my experience has been that this type of operation can actually be important in many server workloads. In applications with 1000s of threads, keeping a counter like this can pose a challenge. One can use a per-thread variable, but the size overhead here can be very large (8 bytes per counter per thread adds up very quickly). And atomic instructions can cause contention quickly. Server programs tend to have many statistical counters, being able to implement them efficiently without size bloat is a real world win.

This type of per-cpu counter can also very quickly be used to implement other abstractions in common use -- eg an asymmetric reader-writer lock or a reference counted object that is changed infrequently. While thread local storage can also be used in these cases this can be a substantial size overhead and can also require cooperation between the application and the library to manage thread lifecycle.

At least from what I've seen of our usage of these types of abstractions within Facebook, if rseq met these use cases and did absolutely nothing else it would still be a feature that our applications would benefit from. Hopefully we can find evidence that it can do even more than this, but I think that this "trivial" use case is actually addressing a real world problem.

> When I asked last time, people pointed me to potential uses, including
> malloc libraries that could get per-thread performance with just
> per-cpu (not per thread) data structure overhead. I see that you once
> more point to the slides from 2013 that again talks about it.
>
> But people didn't post code, people didn't post numbers, and people
> didn't point to actual real uses, just "this could happen".

At Facebook we did some work to experiment with rseq and jemalloc Qi and David (cc'd) may be able to provide more context on the current state.

> Because without real-world uses, it's not obvious that there won't be
> somebody who goes "oh, this isn't quite enough for us, the semantics
> are subtly incompatible with our real-world use case".

Is your concern mainly this question (is this patchset a good way to bring per-cpu algorithms to userspace)? I'm hoping that given the variety of ways that per-cpu data structures are used in the kernel the concerns around this patch set are mainly around what approach we should take rather than if per-cpu algorithms are a good idea at all. If this is your main concern perhaps our focus should be around demonstrating that a number of useful per-cpu algorithms can be implemented using restartable sequences.

Ultimately I'm worried there's a chicken and egg problem here. It's hard to get people to commit to investing in rseq without a clear path to the patchset seeing the light of day. It's also easy to understand why you'd be reluctant to merge such a substantial and unfamiliar API without extensive data. If we're still not able to get compelling data, I'm wondering if there are other approaches that could get us unstuck, eg

(1) Could we merge enough of this patchset (eg things like the faster getcpu() operation, which seems like a good improvement over the current approach). If we make the remaining patches small enough it may be easier for sophisticated users to apply the remaining patches, maintain them, and provide real-world operational experience with this abstraction.

(2) Could we implement restartable sequences in the kernel but only allow the vdso to make use of them? We could have the vdso export a number of high-level operations (like the ones suggested in Paul Turner's original presentation -- per-cpu spin lock, per-cpu atomic increment/decrement, per-cpu list push/pop). This would allow us to get real-world data about how these primitives are used without committing to a complex ABI -- only committing to support the specific operations. If the whole idea flops we could eventually create a slow/naive implementation of the vdso functions and kill restartable sequences entirely.

-b

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <DM5PR15MB1690DA99E4AA74FBE54CF7F9CF480-kTBAvIqET4EjX1lkf7hTyId3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]       ` <DM5PR15MB1690DA99E4AA74FBE54CF7F9CF480-kTBAvIqET4EjX1lkf7hTyId3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-13 18:30         ` Linus Torvalds
       [not found]           ` <CA+55aFzPBES0JOYuZhuNM7NKN+G9ytZQT2daueFPw0j9HGpdGQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-10-14  3:01           ` Andi Kleen
  0 siblings, 2 replies; 61+ messages in thread
From: Linus Torvalds @ 2017-10-13 18:30 UTC (permalink / raw)
  To: Ben Maurer
  Cc: Mathieu Desnoyers, David Goldblatt, Qi Wang, Paul E. McKenney,
	Boqun Feng, Peter Zijlstra, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Dave Watson, Josh Triplett, Will Deacon,
	Linux Kernel Mailing List, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Steven Rostedt,
	Andrew Morton

On Fri, Oct 13, 2017 at 2:35 AM, Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org> wrote:
>
> I'm really excited to hear that you're open to this patch set and totally understand the desire for some more numbers.

So the patch-set actually looks very reasonable today. I looked
through it (ok, I wasn't cc'd on the ppc-only patches so I didn't look
at those, but I don't think they are likely objectionable either), and
everything looked fine from a patch standpoint.

But it's not _just_ numbers for real loads I'm looking for, it's
actually an _existence proof_ for a real load too. I'd like to know
that the suggested interface _really_ works in practice too for all
the expected users.

In particular, it's easy to make test-cases to show basic
functionality, but that does not necessarily show that the interface
then works in "real life".

For example, if this is supposed to work for a malloc library, it's
important that people show that yes, this can really work in a
*LIBRARY*.

That sounds so obvious and stupid that you might go "What do you
mean?", but for things to work for libraries, they have to work
together with *other* users, and with *independent* users.

For example, say that you're some runtime that wants to use the percpu
thing for percpu counters - because you want to avoid cache ping-pong,
and you want to avoid per-thread allocation overhead (or per-thread
scaling for just summing up the counters) when you have potentially
tens of thousands of threads.

Now, how does this runtime work *together* with

 - CPU hotplug adding new cpu's while you are running (and after you
allocated your percpu areas)

 - libraries and system admins that limit - or extend - you to a
certain set of CPUs

 - another library (like the malloc library) that wants to use the
same interface for its percpu allocation queues.

maybe all of this "just works", but I really want to see an existence
proof.  Not just a "dedicated use of the interface for one benchmark".

So yes, I want to see numbers, but I really want to see something much
more fundamental. I want to feel like there is a good reason to
believe that the interface really is sufficient and that it really
does work, even when a single thread may have multiple *different*
uses for this. Statistics, memory allocation queues, RCU, per-cpu
locking, yadda yadda. All these things may want to use this, but they
want to use it *together*, and without you having to write special
code where every user needs to know about every other user statically.

Can you load two different *dynamic* libraries that each independently
uses this thing for their own use, without having to be built together
for each other?

>> A "increment percpu value" simply isn't relevant.
>
> While I understand it seems trivial, my experience has been that this type of operation can actually be important in many server workloads.

Oh, I'm not saying that it's not relevant to have high-performance
statistics gathering using percpu data structures. Of _course_ that is
important, we do that very much in the kernel itself.

But a benchmark that does nothing else really isn't relevant.  If the
*only* thing somebody uses this for is statistics, it's simply not
good enough.

>> Because without real-world uses, it's not obvious that there won't be
>> somebody who goes "oh, this isn't quite enough for us, the semantics
>> are subtly incompatible with our real-world use case".
>
> Is your concern mainly this question (is this patchset a good way to
> bring per-cpu algorithms to userspace)? I'm hoping that given the
> variety of ways that per-cpu data structures are used in the kernel
> the concerns around this patch set are mainly around what approach we
> should take rather than if per-cpu algorithms are a good idea at all.
> If this is your main concern perhaps our focus should be around
> demonstrating that a number of useful per-cpu algorithms can be
> implemented using restartable sequences.

The important thing for me is that it should demonstrate that you can
have users co-exists, and that the interface is sufficient for that.

So I do want to see "just numbers" in the sense that I would want to
see that people have actually written code that takes advantage of the
percpu nature to do real things (like an allocator). But more than
that, I want to see *use*.

> Ultimately I'm worried there's a chicken and egg problem here.

This patch-set has been around for *years* in some form. It's improved
over the years, but the basic approaches are not new.

Honestly, if people still don't have any actual user-level code that
really _uses_ this, I'm not interested in merging it.

There's no chicken-and-egg here. Anybody who wants to push this
patch-set needs to write the user level code to validate that the
patch-set makes sense. That's not chicken-and-egg, that's just
"without the user-space code, the kernel code has never been tested,
validated or used".

And if nobody can be bothered to write the user-level code and test
this patch-series, then obviously it's not important enough for the
kernel to merge it.

                      Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <CA+55aFzPBES0JOYuZhuNM7NKN+G9ytZQT2daueFPw0j9HGpdGQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]           ` <CA+55aFzPBES0JOYuZhuNM7NKN+G9ytZQT2daueFPw0j9HGpdGQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-10-13 20:54             ` Paul E. McKenney
       [not found]               ` <20171013205418.GM3521-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 61+ messages in thread
From: Paul E. McKenney @ 2017-10-13 20:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben Maurer, Mathieu Desnoyers, David Goldblatt, Qi Wang,
	Boqun Feng, Peter Zijlstra, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Dave Watson, Josh Triplett, Will Deacon,
	Linux Kernel Mailing List, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Steven Rostedt,
	Andrew Morton, Russell King

On Fri, Oct 13, 2017 at 11:30:29AM -0700, Linus Torvalds wrote:
> On Fri, Oct 13, 2017 at 2:35 AM, Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org> wrote:
> >
> > I'm really excited to hear that you're open to this patch set and totally understand the desire for some more numbers.
> 
> So the patch-set actually looks very reasonable today. I looked
> through it (ok, I wasn't cc'd on the ppc-only patches so I didn't look
> at those, but I don't think they are likely objectionable either), and
> everything looked fine from a patch standpoint.
> 
> But it's not _just_ numbers for real loads I'm looking for, it's
> actually an _existence proof_ for a real load too. I'd like to know
> that the suggested interface _really_ works in practice too for all
> the expected users.
> 
> In particular, it's easy to make test-cases to show basic
> functionality, but that does not necessarily show that the interface
> then works in "real life".
> 
> For example, if this is supposed to work for a malloc library, it's
> important that people show that yes, this can really work in a
> *LIBRARY*.
> 
> That sounds so obvious and stupid that you might go "What do you
> mean?", but for things to work for libraries, they have to work
> together with *other* users, and with *independent* users.
> 
> For example, say that you're some runtime that wants to use the percpu
> thing for percpu counters - because you want to avoid cache ping-pong,
> and you want to avoid per-thread allocation overhead (or per-thread
> scaling for just summing up the counters) when you have potentially
> tens of thousands of threads.
> 
> Now, how does this runtime work *together* with
> 
>  - CPU hotplug adding new cpu's while you are running (and after you
> allocated your percpu areas)
> 
>  - libraries and system admins that limit - or extend - you to a
> certain set of CPUs
> 
>  - another library (like the malloc library) that wants to use the
> same interface for its percpu allocation queues.
> 
> maybe all of this "just works", but I really want to see an existence
> proof.  Not just a "dedicated use of the interface for one benchmark".
> 
> So yes, I want to see numbers, but I really want to see something much
> more fundamental. I want to feel like there is a good reason to
> believe that the interface really is sufficient and that it really
> does work, even when a single thread may have multiple *different*
> uses for this. Statistics, memory allocation queues, RCU, per-cpu
> locking, yadda yadda. All these things may want to use this, but they
> want to use it *together*, and without you having to write special
> code where every user needs to know about every other user statically.
> 
> Can you load two different *dynamic* libraries that each independently
> uses this thing for their own use, without having to be built together
> for each other?
> 
> >> A "increment percpu value" simply isn't relevant.
> >
> > While I understand it seems trivial, my experience has been that this type of operation can actually be important in many server workloads.
> 
> Oh, I'm not saying that it's not relevant to have high-performance
> statistics gathering using percpu data structures. Of _course_ that is
> important, we do that very much in the kernel itself.
> 
> But a benchmark that does nothing else really isn't relevant.  If the
> *only* thing somebody uses this for is statistics, it's simply not
> good enough.
> 
> 
> >> Because without real-world uses, it's not obvious that there won't be
> >> somebody who goes "oh, this isn't quite enough for us, the semantics
> >> are subtly incompatible with our real-world use case".
> >
> > Is your concern mainly this question (is this patchset a good way to
> > bring per-cpu algorithms to userspace)? I'm hoping that given the
> > variety of ways that per-cpu data structures are used in the kernel
> > the concerns around this patch set are mainly around what approach we
> > should take rather than if per-cpu algorithms are a good idea at all.
> > If this is your main concern perhaps our focus should be around
> > demonstrating that a number of useful per-cpu algorithms can be
> > implemented using restartable sequences.
> 
> The important thing for me is that it should demonstrate that you can
> have users co-exists, and that the interface is sufficient for that.
> 
> So I do want to see "just numbers" in the sense that I would want to
> see that people have actually written code that takes advantage of the
> percpu nature to do real things (like an allocator). But more than
> that, I want to see *use*.
> 
> > Ultimately I'm worried there's a chicken and egg problem here.
> 
> This patch-set has been around for *years* in some form. It's improved
> over the years, but the basic approaches are not new.
> 
> Honestly, if people still don't have any actual user-level code that
> really _uses_ this, I'm not interested in merging it.
> 
> There's no chicken-and-egg here. Anybody who wants to push this
> patch-set needs to write the user level code to validate that the
> patch-set makes sense. That's not chicken-and-egg, that's just
> "without the user-space code, the kernel code has never been tested,
> validated or used".
> 
> And if nobody can be bothered to write the user-level code and test
> this patch-series, then obviously it's not important enough for the
> kernel to merge it.

My guess is that it will take some time, probably measured in months,
to carry out this level of integration and testing to.  But agreed, it
is needed -- as you know, I recently removed some code from RCU that
was requested but then never used.  Not fun.  Even worse would be a
case where the requested code was half-used in a inefficient way, but
precluding improvements.  And we actually had that problem some years
back with the userspace-accessible ring-buffer code.

So if it would help the people doing the testing, I would be happy to
maintain a out-of-tree repository for this series.  That way, if the
testing showed that kernel-code changes were required, these changes
could be easily made without worrying about backwards compatibility
(you don't get the backwards-compatibility guarantee until the code
hits mainline).  This repository would be similar in some ways to the
-rt tree, however, given the small size of this patchset, I cannot
justify a separate git tree.  My thought is to provide (for example)
v4.14-rc4-rseq tags within my -rcu tree.

If there are problems with this approach, or if someone has a better idea,
please let me know.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <20171013205418.GM3521-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]               ` <20171013205418.GM3521-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2017-10-13 21:05                 ` Linus Torvalds
       [not found]                   ` <CA+55aFwvNS95ByZJTh1yG25QfaD0K0ZByK3iXeeRU8LafFiGFQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2017-10-13 21:05 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Ben Maurer, Mathieu Desnoyers, David Goldblatt, Qi Wang,
	Boqun Feng, Peter Zijlstra, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Dave Watson, Josh Triplett, Will Deacon,
	Linux Kernel Mailing List, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Steven Rostedt,
	Andrew Morton, Russell King

On Fri, Oct 13, 2017 at 1:54 PM, Paul E. McKenney
<paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
>>
>> And if nobody can be bothered to write the user-level code and test
>> this patch-series, then obviously it's not important enough for the
>> kernel to merge it.
>
> My guess is that it will take some time, probably measured in months,
> to carry out this level of integration and testing to.

That would be an argument if this was a new patch series. "Wait a few months".

But that just isn't the case here.

The fact is, these patches have been floating around in one form or
another not for a couple of months, but for years. There's a LWN
article about it from 2015, and it wasn't new back then either (slides
from 2013).

I wouldn't be surprised if there had been academic _papers_ written
about the notion.

So if there  *still* is no actual real code around this, then that
just strengthens my point - no way should we merge something where
people haven't actually bothered to write the user-mode component for
years and years.

It really boils down to: "if nobody can be bothered to write the user
mode parts after several years, why should it be merged into the
kernel"?

I don't think that's too much to ask for.

                Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <CA+55aFwvNS95ByZJTh1yG25QfaD0K0ZByK3iXeeRU8LafFiGFQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]                   ` <CA+55aFwvNS95ByZJTh1yG25QfaD0K0ZByK3iXeeRU8LafFiGFQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-10-13 21:21                     ` Paul E. McKenney
  2017-10-13 21:36                     ` Mathieu Desnoyers
  1 sibling, 0 replies; 61+ messages in thread
From: Paul E. McKenney @ 2017-10-13 21:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben Maurer, Mathieu Desnoyers, David Goldblatt, Qi Wang,
	Boqun Feng, Peter Zijlstra, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Dave Watson, Josh Triplett, Will Deacon,
	Linux Kernel Mailing List, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Steven Rostedt,
	Andrew Morton, Russell King

On Fri, Oct 13, 2017 at 02:05:50PM -0700, Linus Torvalds wrote:
> On Fri, Oct 13, 2017 at 1:54 PM, Paul E. McKenney
> <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> >>
> >> And if nobody can be bothered to write the user-level code and test
> >> this patch-series, then obviously it's not important enough for the
> >> kernel to merge it.
> >
> > My guess is that it will take some time, probably measured in months,
> > to carry out this level of integration and testing to.
> 
> That would be an argument if this was a new patch series. "Wait a few months".
> 
> But that just isn't the case here.
> 
> The fact is, these patches have been floating around in one form or
> another not for a couple of months, but for years. There's a LWN
> article about it from 2015, and it wasn't new back then either (slides
> from 2013).
> 
> I wouldn't be surprised if there had been academic _papers_ written
> about the notion.
> 
> So if there  *still* is no actual real code around this, then that
> just strengthens my point - no way should we merge something where
> people haven't actually bothered to write the user-mode component for
> years and years.
> 
> It really boils down to: "if nobody can be bothered to write the user
> mode parts after several years, why should it be merged into the
> kernel"?
> 
> I don't think that's too much to ask for.

Completely and totally agreed.  This doesn't go into mainline without
being integrated into the various relevant usermode parts, and even then
not without some significant advantages having been clearly demonstrated.

The same as has been the case for the -rt patches.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]                   ` <CA+55aFwvNS95ByZJTh1yG25QfaD0K0ZByK3iXeeRU8LafFiGFQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-10-13 21:21                     ` Paul E. McKenney
@ 2017-10-13 21:36                     ` Mathieu Desnoyers
  2017-10-16 16:04                       ` Carlos O'Donell
  1 sibling, 1 reply; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-13 21:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul E. McKenney, Ben Maurer, David Goldblatt, Qi Wang,
	Boqun Feng, Peter Zijlstra, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Dave Watson, Josh Triplett, Will Deacon,
	linux-kernel, Thomas Gleixner, Andi Kleen, Chris Lameter,
	Ingo Molnar, H. Peter Anvin, rostedt, Andrew Morton, Russell King,
	Catalin

----- On Oct 13, 2017, at 5:05 PM, Linus Torvalds torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org wrote:

> On Fri, Oct 13, 2017 at 1:54 PM, Paul E. McKenney
> <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
>>>
>>> And if nobody can be bothered to write the user-level code and test
>>> this patch-series, then obviously it's not important enough for the
>>> kernel to merge it.
>>
>> My guess is that it will take some time, probably measured in months,
>> to carry out this level of integration and testing to.
> 
> That would be an argument if this was a new patch series. "Wait a few months".
> 
> But that just isn't the case here.
> 
> The fact is, these patches have been floating around in one form or
> another not for a couple of months, but for years. There's a LWN
> article about it from 2015, and it wasn't new back then either (slides
> from 2013).
> 
> I wouldn't be surprised if there had been academic _papers_ written
> about the notion.
> 
> So if there  *still* is no actual real code around this, then that
> just strengthens my point - no way should we merge something where
> people haven't actually bothered to write the user-mode component for
> years and years.
> 
> It really boils down to: "if nobody can be bothered to write the user
> mode parts after several years, why should it be merged into the
> kernel"?
> 
> I don't think that's too much to ask for.

I remember hearing that Google have been running their own version of
this patch on their servers. My understanding is that they did not
really care about things like single-stepping on server workloads,
because they never single-step. One issue there is that getting
rseq to work for specifically tuned systems (e.g. no cpu hotplug,
no single-stepping, and so on) is fairly straightforward. The tricky
part is to make it work in all situations, and I don't think Google
had incentive to complete those tricky bits, because they don't need
them.

Facebook seem to try to follow upstream more closely. My understanding
is that they can do specific prototypes to prove the value of the
approach, as they did with their jemalloc integration from last year.

I have myself integrated the LTTng-UST tracer to rseq as a prototype
branch, and created a Userspace RCU prototype branch that works
similarly to SRCU in the kernel, using per-cpu counters. Those are
staying prototypes because I won't release an open source tool
or library based on non-mainline system call numbers, this just cannot
end well. Once/if rseq gets in, my next step will be to implement a
multi-process userspace RCU flavor based on per-cpu counters (with
rseq). Doing this with atomic operations is not worth it, because it
just leads to really poor performance for read-side.

I also spoke to Carlos O'Donell from glibc about it, and he was very
excited about the possible use of rseq for malloc speedup/memory usage
improvement. But again, I don't see a project like glibc starting to
use a system call for which the number will have to be bumped every
now and then.

I would *not* want this merged before we gather significant user feedback.
The question is: how can we best gather that feedback ?

Perhaps one approach could be to reserve system call numbers for
sys_rseq and sys_cpu_opv, but leave them unimplemented for now
(ENOSYS). This would lessen the amount of pain user-space would have
to go through to adapt to system call number changes, and we could
provide the implementation of those system calls in a -rseq tree, which
I'd be happy to maintain in order to gather feedback. If it ends up that
it's not the right approach after all, all we would have lost is two
unwired system call numbers per architecture.

Thoughts ?

Thanks,

Mathieu

> 
>                 Linus

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
  2017-10-13 21:36                     ` Mathieu Desnoyers
@ 2017-10-16 16:04                       ` Carlos O'Donell
  2017-10-16 16:46                         ` Andi Kleen
  0 siblings, 1 reply; 61+ messages in thread
From: Carlos O'Donell @ 2017-10-16 16:04 UTC (permalink / raw)
  To: Mathieu Desnoyers, Linus Torvalds
  Cc: Paul E. McKenney, Ben Maurer, David Goldblatt, Qi Wang,
	Boqun Feng, Peter Zijlstra, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Dave Watson, Josh Triplett, Will Deacon,
	linux-kernel, Thomas Gleixner, Andi Kleen, Chris Lameter,
	Ingo Molnar, H. Peter Anvin, rostedt, Andrew Morton, Russell King,
	Catalin

On 10/13/2017 02:36 PM, Mathieu Desnoyers wrote:
> I also spoke to Carlos O'Donell from glibc about it, and he was very
> excited about the possible use of rseq for malloc speedup/memory usage
> improvement. But again, I don't see a project like glibc starting to
> use a system call for which the number will have to be bumped every
> now and then.
> 
> I would *not* want this merged before we gather significant user feedback.
> The question is: how can we best gather that feedback ?
> 
> Perhaps one approach could be to reserve system call numbers for
> sys_rseq and sys_cpu_opv, but leave them unimplemented for now
> (ENOSYS). This would lessen the amount of pain user-space would have
> to go through to adapt to system call number changes, and we could
> provide the implementation of those system calls in a -rseq tree, which
> I'd be happy to maintain in order to gather feedback. If it ends up that
> it's not the right approach after all, all we would have lost is two
> unwired system call numbers per architecture.
> 
> Thoughts ?

We have similar problems in glibc with API/ABI issues, and there 
isn't really any way around this except to present a reviewer with
an overwhelming amount of evidence that use cases exist and work.

How you collect, summarize, and analyze that overwhelming evidence
is up to you, specific to each change, and difficult to do accurately
and with any large measure of statistical confidence. The reviewer
has to basically trust you to some degree :-)

We should probably be working together to present the case to Linus
that glibc is immediately ready to use restartable sequences and
provide the use cases we have in mind with a public branch showing
the work and the results. This would at least convince people that
if we turned this on, every application would get benefit from a
GNU system running glibc (which is less than the number of people
running Linux on phones these days so YMMV).

As always, glibc can use any new kernel features immediately,
and only needs to detect presence at startup.

My only concrete suggestion would be to add a level of indirection,
some way to fetch the new syscalls dynamically at program startup,
then I could construct a way to call them, mark it RO, and use that
e.g. a userspace syscall table populated dynamically for experimental
syscalls (semantic changes would require changes in the name used for
lookup). It's just an expansion of the number of bits used to identify
the syscall. Obviously such a patch is only for downstream testing
in order to gather consensus for upstream patches.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
  2017-10-16 16:04                       ` Carlos O'Donell
@ 2017-10-16 16:46                         ` Andi Kleen
  2017-10-16 22:17                           ` Mathieu Desnoyers
  0 siblings, 1 reply; 61+ messages in thread
From: Andi Kleen @ 2017-10-16 16:46 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Mathieu Desnoyers, Linus Torvalds, Paul E. McKenney, Ben Maurer,
	David Goldblatt, Qi Wang, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, rostedt

> How you collect, summarize, and analyze that overwhelming evidence
> is up to you, specific to each change, and difficult to do accurately
> and with any large measure of statistical confidence. The reviewer
> has to basically trust you to some degree :-)

I think Linus' just asked for some working "real world, not micro" code that
demonstrates use.

A prototype type implementation of the glibc malloc cache using this may
be good enough.

Even if the API still changes slightly later in review I would assume
the basic concepts will stay the same, so it would be likely not
too difficult to convert that prototype to the later final API.

-Andi

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
  2017-10-16 16:46                         ` Andi Kleen
@ 2017-10-16 22:17                           ` Mathieu Desnoyers
       [not found]                             ` <21865534.42661.1508192263844.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-16 22:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: carlos, Linus Torvalds, Paul E. McKenney, Ben Maurer,
	David Goldblatt, Qi Wang, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Chris Lameter,
	Ingo Molnar, H. Peter Anvin, rostedt, Andrew Morton, Russ

----- On Oct 16, 2017, at 12:46 PM, Andi Kleen andi@firstfloor.org wrote:

>> How you collect, summarize, and analyze that overwhelming evidence
>> is up to you, specific to each change, and difficult to do accurately
>> and with any large measure of statistical confidence. The reviewer
>> has to basically trust you to some degree :-)
> 
> I think Linus' just asked for some working "real world, not micro" code that
> demonstrates use.
> 
> A prototype type implementation of the glibc malloc cache using this may
> be good enough.
> 
> Even if the API still changes slightly later in review I would assume
> the basic concepts will stay the same, so it would be likely not
> too difficult to convert that prototype to the later final API.

In that respect, I have working prototypes of two non-trivial library
projects using rseq within the same process.

Those can be considered as being "early adopters" of rseq, before it
becomes available in glibc.

- liburcu per-cpu flavor prototype [1]
  Interesting bits at
  https://github.com/compudj/userspace-rcu-dev/blob/urcu-percpu/include/urcu/static/urcu-percpu.h
  https://github.com/compudj/userspace-rcu-dev/blob/urcu-percpu/src/urcu-percpu.c
  (it also has its own copy of rseq and cpu-opv helper libraries)

- lttng-ust tracer rseq prototype [2, 3]
  Interesting bits at
  https://github.com/compudj/lttng-ust-dev/blob/rseq-integration-oct-2017/libringbuffer/getcpu.h#L85
  https://github.com/compudj/lttng-ust-dev/blob/rseq-integration-oct-2017/libringbuffer/vatomic.h#L60
  (it also has its own copy of rseq and cpu-opv helper libraries)

They use a slightly updated version of the rseq patchset, which I
plan to push into a new "rseq" tree on kernel.org soon. It takes care
of the comments I received in the past few days.

They end up sharing the "__rseq_abi" TLS weak symbol (initial state of
cpu_id = -1). They lazy-detect whether rseq needs to be registered for
the current thread by checking if the cpu_id read from the rseq TLS
is < 0. If rseq registration fails, they set its value to -2 and won't
try to register again (will use their fallback). When they successfully
register, they setup a pthread_key so rseq is unregistered when the
thread exits.

So far the restrictions I see for libraries using this symbol are:
- They should never be unloaded,
- They should never be loaded with dlopen RTLD_LOCAL flag.

If those are considered acceptable limitations, then we can stick to
the "single rseq TLS per thread" rule, and we don't have to implement
a linked-list of rseq TLS per thread.

When glibc eventually adds support for rseq, I expect it to deal with
rseq TLS registration and unregistration at thread creation/exit.
Therefore, the checks for negative cpu_id performed by lttng-ust and
liburcu will figure out that rseq is already registered, and skip
registration altogether when it's already performed by glibc.

Thoughts ?

Thanks,

Mathieu

[1] https://github.com/compudj/userspace-rcu-dev/tree/urcu-percpu
[2] https://github.com/compudj/lttng-ust-dev/tree/rseq-integration-oct-2017
[3] https://github.com/compudj/lttng-tools-dev/tree/urcu-percpu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <21865534.42661.1508192263844.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>]

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]                             ` <21865534.42661.1508192263844.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-10-17 16:19                               ` Ben Maurer
       [not found]                                 ` <CY4PR15MB168879D6220D976B04FE482CCF4C0-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 61+ messages in thread
From: Ben Maurer @ 2017-10-17 16:19 UTC (permalink / raw)
  To: Mathieu Desnoyers, Andi Kleen
  Cc: carlos, Linus Torvalds, Paul E. McKenney, David Goldblatt,
	Qi Wang, Boqun Feng, Peter Zijlstra, Paul Turner, Andrew Hunter,
	Andy Lutomirski, Dave Watson, Josh Triplett, Will Deacon,
	linux-kernel, Thomas Gleixner, Chris Lameter, Ingo Molnar,
	H. Peter Anvin, rostedt, Andrew Morton, Russell King

Hey,

> So far the restrictions I see for libraries using this symbol are:
> - They should never be unloaded,
> - They should never be loaded with dlopen RTLD_LOCAL flag.

We talked a bit about this off-list but I wanted to state publicly that I think this model works well for our use case. Specifically,

(1) It reduces complexity by focusing on the common case -- long term we expect glibc to manage the process of using this feature and registering/deregistering threads for rseq. Unloading isn't a challenge in these situations, so why add the complexity for it?

(2) This still allows for early adopters to use rseq before there is glibc support. I believe the vast majority of real world applications meet these two criteria you've listed. If not, they can create a thin shared library that has the sole purpose of providing the weak symbol and that never gets unloaded

(3) This allows for applications to provide the __rseq_abi so that they can ensure it uses the initial_exec tls model and optimize in-application assembly code for it. This is a good optimization for server applications that tend to statically link.

If others agree with this, would it make sense to remove the concept of reference counting in the system call that defines and redefines the per-thread area? Seems like it would remove complexity.

-b

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <CY4PR15MB168879D6220D976B04FE482CCF4C0-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]                                 ` <CY4PR15MB168879D6220D976B04FE482CCF4C0-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-17 16:33                                   ` Mathieu Desnoyers
       [not found]                                     ` <1292309161.43101.1508258000235.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  2017-10-18  6:22                                   ` Greg KH
  1 sibling, 1 reply; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-17 16:33 UTC (permalink / raw)
  To: Ben Maurer
  Cc: Andi Kleen, carlos, Linus Torvalds, Paul E. McKenney,
	David Goldblatt, Qi Wang, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Chris Lameter,
	Ingo Molnar, H. Peter Anvin, rostedt, Andrew Morton, Russell

----- On Oct 17, 2017, at 12:19 PM, Ben Maurer bmaurer-b10kYP2dOMg@public.gmane.org wrote:

> Hey,
> 
>> So far the restrictions I see for libraries using this symbol are:
>> - They should never be unloaded,
>> - They should never be loaded with dlopen RTLD_LOCAL flag.
> 
> We talked a bit about this off-list but I wanted to state publicly that I think
> this model works well for our use case. Specifically,
> 
> (1) It reduces complexity by focusing on the common case -- long term we expect
> glibc to manage the process of using this feature and registering/deregistering
> threads for rseq. Unloading isn't a challenge in these situations, so why add
> the complexity for it?
> 
> (2) This still allows for early adopters to use rseq before there is glibc
> support. I believe the vast majority of real world applications meet these two
> criteria you've listed. If not, they can create a thin shared library that has
> the sole purpose of providing the weak symbol and that never gets unloaded
> 
> (3) This allows for applications to provide the __rseq_abi so that they can
> ensure it uses the initial_exec tls model and optimize in-application assembly
> code for it. This is a good optimization for server applications that tend to
> statically link.

Agreed with all the above,

> 
> If others agree with this, would it make sense to remove the concept of
> reference counting in the system call that defines and redefines the per-thread
> area? Seems like it would remove complexity.

I have a use-case for keeping the reference counting in place though. It's
use of rseq in signal handlers.

If we have two early-adopter libraries trying to lazy-register rseq, and
one of those libraries can be called within a signal handler (e.g. lttng-ust),
we run into a situation where signal handler could nest on top of the
first library lazy-register (test, branch, register), and race against it.
So having reference counting in place allows the kernel to deal with
those multi-lib use-cases atomically wrt signal handlers from a thread
perspective.

And I don't want to require every early-adopter library to disable signals
just in case some _other_ library would be invoked in a signal handler.

Thoughts ?

Thanks,

Mathieu

> 
> -b

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <1292309161.43101.1508258000235.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>]

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]                                     ` <1292309161.43101.1508258000235.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-10-17 16:41                                       ` Ben Maurer
       [not found]                                         ` <CY4PR15MB16886FD43FB48592F3F5892FCF4C0-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 61+ messages in thread
From: Ben Maurer @ 2017-10-17 16:41 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andi Kleen, carlos, Linus Torvalds, Paul E. McKenney,
	David Goldblatt, Qi Wang, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Chris Lameter,
	Ingo Molnar, H. Peter Anvin, rostedt, Andrew Morton

> I have a use-case for keeping the reference counting in place though. It's
> use of rseq in signal handlers.

Would this be solved by saying that the rseq api will return an error if you register and there's already a block registered. In this case the signal handler would register the rseq abi area just as the non-signal code is trying to do the same. The non-signal code would see this error code and realize that its job had been done for it and then go on it's way.

It would be unsafe for signal handler code to *unregister* the area, but I don't think that's necessary.

Basically we'd support a refcount of either 0 or 1, but nothing else.

If a signal handler registers the ABI area, how will it ensure the ABI is cleaned up at thread exit? I can't imagine pthread_key is signal safe.

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <CY4PR15MB16886FD43FB48592F3F5892FCF4C0-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]                                         ` <CY4PR15MB16886FD43FB48592F3F5892FCF4C0-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-17 17:48                                           ` Mathieu Desnoyers
  0 siblings, 0 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-17 17:48 UTC (permalink / raw)
  To: Ben Maurer
  Cc: Andi Kleen, carlos, Linus Torvalds, Paul E. McKenney,
	David Goldblatt, Qi Wang, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Chris Lameter,
	Ingo Molnar, H. Peter Anvin, rostedt, Andrew Morton, Russell

----- On Oct 17, 2017, at 12:41 PM, Ben Maurer bmaurer-b10kYP2dOMg@public.gmane.org wrote:

>> I have a use-case for keeping the reference counting in place though. It's
>> use of rseq in signal handlers.
> 
> Would this be solved by saying that the rseq api will return an error if you
> register and there's already a block registered. In this case the signal
> handler would register the rseq abi area just as the non-signal code is trying
> to do the same. The non-signal code would see this error code and realize that
> its job had been done for it and then go on it's way.

Yes, that should work, as long as we return a specific error code, e.g. -EBUSY,
to tell the caller that rseq has actually been registered.

> 
> It would be unsafe for signal handler code to *unregister* the area, but I don't
> think that's necessary.

Right.

> 
> Basically we'd support a refcount of either 0 or 1, but nothing else.

Yep, I'll try this out.

> 
> If a signal handler registers the ABI area, how will it ensure the ABI is
> cleaned up at thread exit? I can't imagine pthread_key is signal safe.

You have a very good point there. This highlights a signal-safety issue
I have in liburcu-bp when used by lttng-ust. pthread_setspecific is
indeed not listed as being signal-safe: it can perform on-demand memory
allocation when a second level array is needed.

I'll have to scratch my head a bit to fix this one.

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]                                 ` <CY4PR15MB168879D6220D976B04FE482CCF4C0-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2017-10-17 16:33                                   ` Mathieu Desnoyers
@ 2017-10-18  6:22                                   ` Greg KH
       [not found]                                     ` <20171018062226.GB18857-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 61+ messages in thread
From: Greg KH @ 2017-10-18  6:22 UTC (permalink / raw)
  To: Ben Maurer
  Cc: Mathieu Desnoyers, Andi Kleen, carlos, Linus Torvalds,
	Paul E. McKenney, David Goldblatt, Qi Wang, Boqun Feng,
	Peter Zijlstra, Paul Turner, Andrew Hunter, Andy Lutomirski,
	Dave Watson, Josh Triplett, Will Deacon, linux-kernel,
	Thomas Gleixner, Chris Lameter, Ingo Molnar, H. Peter Anvin,
	rostedt

On Tue, Oct 17, 2017 at 04:19:41PM +0000, Ben Maurer wrote:
> Hey,
> 
> > So far the restrictions I see for libraries using this symbol are:
> > - They should never be unloaded,
> > - They should never be loaded with dlopen RTLD_LOCAL flag.
> 
> We talked a bit about this off-list but I wanted to state publicly
> that I think this model works well for our use case. Specifically,
> 
> (1) It reduces complexity by focusing on the common case -- long term
> we expect glibc to manage the process of using this feature and
> registering/deregistering threads for rseq. Unloading isn't a
> challenge in these situations, so why add the complexity for it?

You never install a new version of glibc on a running system, and expect
everything to keep running successfully?  Breaking that would not be
good...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <20171018062226.GB18857-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>]

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]                                     ` <20171018062226.GB18857-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
@ 2017-10-18 16:28                                       ` Mathieu Desnoyers
  0 siblings, 0 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-18 16:28 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Ben Maurer, Andi Kleen, Carlos O'Donell, Linus Torvalds,
	Paul E. McKenney, David Goldblatt, Qi Wang, Boqun Feng,
	Peter Zijlstra, Paul Turner, Andrew Hunter, Andy Lutomirski,
	Dave Watson, Josh Triplett, Will Deacon, linux-kernel,
	Thomas Gleixner, Chris Lameter, Ingo Molnar, H. Peter Anvin,
	rostedt

----- On Oct 18, 2017, at 2:22 AM, Greg Kroah-Hartman gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org wrote:

> On Tue, Oct 17, 2017 at 04:19:41PM +0000, Ben Maurer wrote:
>> Hey,
>> 
>> > So far the restrictions I see for libraries using this symbol are:
>> > - They should never be unloaded,
>> > - They should never be loaded with dlopen RTLD_LOCAL flag.
>> 
>> We talked a bit about this off-list but I wanted to state publicly
>> that I think this model works well for our use case. Specifically,
>> 
>> (1) It reduces complexity by focusing on the common case -- long term
>> we expect glibc to manage the process of using this feature and
>> registering/deregistering threads for rseq. Unloading isn't a
>> challenge in these situations, so why add the complexity for it?
> 
> You never install a new version of glibc on a running system, and expect
> everything to keep running successfully?  Breaking that would not be
> good...

If we share the __rseq_abi TLS weak symbol between glibc, applications,
and early-adopter libraries, we just need those early adopters to
check whether the TLS is already registered (cpu_id field >= 0), and
don't bother doing their lazy registration if it's already been done
for them (either by glibc or by the application).

If either the application or a lazy-registering library gets a EBUSY
errno from rseq registration, it can consider that another library
already performed the registration for them (probably glibc).

As long as early adopter libraries don't expect to be sole users
of __rseq_abi, upgrading to newer glibc should work fine.

But this means we need to get all early adopters to get it right from
the get-go, or things could break when you compose them.

Thoughts ?

Thanks,

Mathieu

> 
> thanks,
> 
> greg k-h

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
  2017-10-13 18:30         ` Linus Torvalds
       [not found]           ` <CA+55aFzPBES0JOYuZhuNM7NKN+G9ytZQT2daueFPw0j9HGpdGQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-10-14  3:01           ` Andi Kleen
  2017-10-14  4:05             ` Linus Torvalds
  1 sibling, 1 reply; 61+ messages in thread
From: Andi Kleen @ 2017-10-14  3:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben Maurer, Mathieu Desnoyers, David Goldblatt, Qi Wang,
	Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, Linux Kernel Mailing List, Thomas Gleixner,
	Andi Kleen, Chris Lameter, Ingo Molnar, H. Peter Anvin,
	Steven Rostedt

> That sounds so obvious and stupid that you might go "What do you
> mean?", but for things to work for libraries, they have to work
> together with *other* users, and with *independent* users.

As far as I can see the current model fundamentally only works for
one user per process (because there is only a single range and abort IP) 

So once malloc started using it noone else could.

Since malloc is the primary user it would be pointless to ever
try it on something else.

It seems fixing that would complicate everything quite a bit --
all ranges would need to be (unlimited?) arrays.

-Andi

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
  2017-10-14  3:01           ` Andi Kleen
@ 2017-10-14  4:05             ` Linus Torvalds
  2017-10-14 11:37               ` Mathieu Desnoyers
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2017-10-14  4:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ben Maurer, Mathieu Desnoyers, David Goldblatt, Qi Wang,
	Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, Linux Kernel Mailing List, Thomas Gleixner,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Steven Rostedt,
	Andrew Morton

On Fri, Oct 13, 2017 at 8:01 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> As far as I can see the current model fundamentally only works for
> one user per process (because there is only a single range and abort IP)

No, it should work for libraries, you just need to always initialize
the proper start/commit/abort IP's for every transaction. Then
everybody should be fine.

So I _think_ it's all good. But I really would want to see that
actually being the case.

                Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
  2017-10-14  4:05             ` Linus Torvalds
@ 2017-10-14 11:37               ` Mathieu Desnoyers
  0 siblings, 0 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-14 11:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Ben Maurer, David Goldblatt, Qi Wang,
	Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Chris Lameter,
	Ingo Molnar, H. Peter Anvin, rostedt, Andrew Morton, Russell King,
	Catalin

----- On Oct 14, 2017, at 12:05 AM, Linus Torvalds torvalds@linux-foundation.org wrote:

> On Fri, Oct 13, 2017 at 8:01 PM, Andi Kleen <andi@firstfloor.org> wrote:
>>
>> As far as I can see the current model fundamentally only works for
>> one user per process (because there is only a single range and abort IP)
> 
> No, it should work for libraries, you just need to always initialize
> the proper start/commit/abort IP's for every transaction. Then
> everybody should be fine.

Yes, it does work for libraries. I have used it in my lttng-ust and
liburcu prototypes, which are libraries. LTTng-UST requires at least
two distinct critical sections (reserve and commit). For use in
both executable and multiple libraries, we need each to declare the
struct rseq TLS as a weak symbol, so only one gets picked throughout the
process.

One clarification about your statement above: the user-space fast-path
does not need to initialize much at runtime: one "rseq_cs descriptor"
is created by each rseq_finish assembly section. Each of those is
initialized by the dynamic loader with the proper addresses.

All the user-space fast-path really needs to do is to store the address
to that descriptor into the TLS "rseq_cs" field. It does not even have to
clear it after the critical section: the kernel can do it lazily.

> 
> So I _think_ it's all good. But I really would want to see that
> actually being the case.

There is one other use-case I've been made aware of in the past months:
Will Deacon want to use rseq on aarch64 to read PMU counters on
big.LITTLE to prevent migration and use of an unsupported PMC on a
LITTLE core, which could trigger a fault.

You had a really good point about cpu hotplug by the way. I recently
realize that algorithms that have multiple non-atomic steps may
_require_ to execute a series of steps on the same CPU.
One example is lttng-ust ring buffer: it works on per-cpu buffers,
and does a series of operations: reserve, [write to buffer], commit.
Both reserve and commit can benefit from rseq, but we really need
the commit to happen on the right CPU. Currently, in order to handle
CPU hotplug, lttng-ust allocates CPU buffers for all possible cpus.
If a CPU is hotunplugged between the reserve and commit though, we
would run into a scenario where the "commit" could never be completed
on the right CPU. I've actually prepared a follow-up patch [1]
yesterday that fixes this in the cpu_opv() system call: it detects
situations where the target CPU is possible but not online, prevents
cpu hotplug, grabs a mutex, and performs the requested operation
from whichever CPU it happens to run on.

Those are the kind of use-cases I want to gather more feedback on
before we integrate those system calls for good.

Thanks,

Mathieu

[1] https://github.com/compudj/linux-percpu-dev/commit/b602821e446f7bd8a0a2de44c598f257cf4120f5

> 
>                 Linus

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
  2017-10-12 23:03 ` [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call Mathieu Desnoyers
  2017-10-13  0:36   ` Linus Torvalds
@ 2017-10-13 12:50   ` Florian Weimer
  2017-10-13 13:40     ` Mathieu Desnoyers
  2017-10-18 16:41   ` Ben Maurer
  2 siblings, 1 reply; 61+ messages in thread
From: Florian Weimer @ 2017-10-13 12:50 UTC (permalink / raw)
  To: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Peter Zijlstra,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Dave Watson,
	Josh Triplett, Will Deacon
  Cc: linux-kernel, Thomas Gleixner, Andi Kleen, Chris Lameter,
	Ingo Molnar, H. Peter Anvin, Ben Maurer, Steven Rostedt,
	Linus Torvalds, Andrew Morton, Russell King, Catalin Marinas,
	Michael Kerrisk, Alexander Viro, linux-api

On 10/13/2017 01:03 AM, Mathieu Desnoyers wrote:
> Expose a new system call allowing each thread to register one userspace
> memory area to be used as an ABI between kernel and user-space for two
> purposes: user-space restartable sequences and quick access to read the
> current CPU number value from user-space.
> 
> * Restartable sequences (per-cpu atomics)
> 
> Restartables sequences allow user-space to perform update operations on
> per-cpu data without requiring heavy-weight atomic operations.
> 
> The restartable critical sections (percpu atomics) work has been started
> by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
> critical sections. [1] [2] The re-implementation proposed here brings a
> few simplifications to the ABI which facilitates porting to other
> architectures and speeds up the user-space fast path. A locking-based
> fall-back, purely implemented in user-space, is proposed here to deal
> with debugger single-stepping. This fallback interacts with rseq_start()
> and rseq_finish(), which force retries in response to concurrent
> lock-based activity.

This functionality essentially relies on writable function pointers (or 
pointers to data containing function pointers), right?  Is there a way 
to make this a less attractive target for exploit writers?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
  2017-10-13 12:50   ` Florian Weimer
@ 2017-10-13 13:40     ` Mathieu Desnoyers
  2017-10-13 13:56       ` Florian Weimer
  0 siblings, 1 reply; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-13 13:40 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Ben Maurer, rostedt,
	Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas <catalin.

----- On Oct 13, 2017, at 8:50 AM, Florian Weimer fweimer@redhat.com wrote:

> On 10/13/2017 01:03 AM, Mathieu Desnoyers wrote:
>> Expose a new system call allowing each thread to register one userspace
>> memory area to be used as an ABI between kernel and user-space for two
>> purposes: user-space restartable sequences and quick access to read the
>> current CPU number value from user-space.
>> 
>> * Restartable sequences (per-cpu atomics)
>> 
>> Restartables sequences allow user-space to perform update operations on
>> per-cpu data without requiring heavy-weight atomic operations.
>> 
>> The restartable critical sections (percpu atomics) work has been started
>> by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
>> critical sections. [1] [2] The re-implementation proposed here brings a
>> few simplifications to the ABI which facilitates porting to other
>> architectures and speeds up the user-space fast path.

This part:

>> A locking-based
>> fall-back, purely implemented in user-space, is proposed here to deal
>> with debugger single-stepping. This fallback interacts with rseq_start()
>> and rseq_finish(), which force retries in response to concurrent
>> lock-based activity.

should have been updated in this series to:

A second system call, cpu_opv(), is proposed as fallback to deal with debugger
single-stepping. cpu_opv() executes a sequence of operations on behalf of
user-space with preemption disabled.

> This functionality essentially relies on writable function pointers (or
> pointers to data containing function pointers), right?  Is there a way
> to make this a less attractive target for exploit writers?

The proposed ABI does not require to store any function pointer. For a given
rseq_finish() critical section, pointers to specific instructions (within a
function) are emitted at link-time into a struct rseq_cs:

struct rseq_cs {
        RSEQ_FIELD_u32_u64(start_ip);
        RSEQ_FIELD_u32_u64(post_commit_ip);
        RSEQ_FIELD_u32_u64(abort_ip);
        uint32_t flags;
} __attribute__((aligned(4 * sizeof(uint64_t))));

Then, at runtime, the fast-path stores the address of that struct rseq_cs
into the TLS struct rseq "rseq_cs" field.

So all we store at runtime is a pointer to data, not a pointer to functions.

But you seem to hint that having a pointer to data containing pointers to code
may still be making it easier for exploit writers. Can you elaborate on the
scenario ?

Thanks,

Mathieu


> 
> Thanks,
> Florian

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
  2017-10-13 13:40     ` Mathieu Desnoyers
@ 2017-10-13 13:56       ` Florian Weimer
  2017-10-13 14:27         ` Mathieu Desnoyers
  0 siblings, 1 reply; 61+ messages in thread
From: Florian Weimer @ 2017-10-13 13:56 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Ben Maurer, rostedt,
	Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas <catalin.

On 10/13/2017 03:40 PM, Mathieu Desnoyers wrote:
> The proposed ABI does not require to store any function pointer. For a given
> rseq_finish() critical section, pointers to specific instructions (within a
> function) are emitted at link-time into a struct rseq_cs:
> 
> struct rseq_cs {
>          RSEQ_FIELD_u32_u64(start_ip);
>          RSEQ_FIELD_u32_u64(post_commit_ip);
>          RSEQ_FIELD_u32_u64(abort_ip);
>          uint32_t flags;
> } __attribute__((aligned(4 * sizeof(uint64_t))));
> 
> Then, at runtime, the fast-path stores the address of that struct rseq_cs
> into the TLS struct rseq "rseq_cs" field.
> 
> So all we store at runtime is a pointer to data, not a pointer to functions.
> 
> But you seem to hint that having a pointer to data containing pointers to code
> may still be making it easier for exploit writers. Can you elaborate on the
> scenario ?

I'm concerned that the exploit writer writes a totally made up struct 
rseq_cs object into writable memory, along with function pointers, and 
puts the address of that in to the rseq_cs field.

This would be comparable to how C++ vtable pointers are targeted 
(including those in the glibc libio implementation of stdio streams).

Does this answer your questions?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
  2017-10-13 13:56       ` Florian Weimer
@ 2017-10-13 14:27         ` Mathieu Desnoyers
  2017-10-13 17:24           ` Andy Lutomirski
  0 siblings, 1 reply; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-13 14:27 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Ben Maurer, rostedt,
	Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas <catalin.

----- On Oct 13, 2017, at 9:56 AM, Florian Weimer fweimer@redhat.com wrote:

> On 10/13/2017 03:40 PM, Mathieu Desnoyers wrote:
>> The proposed ABI does not require to store any function pointer. For a given
>> rseq_finish() critical section, pointers to specific instructions (within a
>> function) are emitted at link-time into a struct rseq_cs:
>> 
>> struct rseq_cs {
>>          RSEQ_FIELD_u32_u64(start_ip);
>>          RSEQ_FIELD_u32_u64(post_commit_ip);
>>          RSEQ_FIELD_u32_u64(abort_ip);
>>          uint32_t flags;
>> } __attribute__((aligned(4 * sizeof(uint64_t))));
>> 
>> Then, at runtime, the fast-path stores the address of that struct rseq_cs
>> into the TLS struct rseq "rseq_cs" field.
>> 
>> So all we store at runtime is a pointer to data, not a pointer to functions.
>> 
>> But you seem to hint that having a pointer to data containing pointers to code
>> may still be making it easier for exploit writers. Can you elaborate on the
>> scenario ?
> 
> I'm concerned that the exploit writer writes a totally made up struct
> rseq_cs object into writable memory, along with function pointers, and
> puts the address of that in to the rseq_cs field.
> 
> This would be comparable to how C++ vtable pointers are targeted
> (including those in the glibc libio implementation of stdio streams).
> 
> Does this answer your questions?

Yes, it does. How about we add a "canary" field to the TLS struct rseq, e.g.:

struct rseq {
        union rseq_cpu_event u;
        RSEQ_FIELD_u32_u64(rseq_cs);  -> pointer to struct rseq_cs
        uint32_t flags;
        uint32_t canary;   -> 32 low bits of rseq_cs ^ canary_mask
};

We could then add a "uint32_t canary_mask" argument to sys_rseq, e.g.:

SYSCALL_DEFINE3(rseq, struct rseq __user *, rseq, uint32_t, canary_mask, int, flags);

So a thread which does not care about hardening would simply register its
struct rseq TLS with a canary mask of "0". Nothing changes on the fast-path.

A thread belonging to a process that cares about hardening could use a random
value as canary, and pass it as canary_mask argument to the syscall. The
fast-path could then set the struct rseq "canary" value to
(32-low-bits of rseq_cs) ^ canary_mask just surrounding the critical section,
and set it back to 0 afterward.

In the kernel, whenever the rseq_cs pointer would be loaded, its 32 low bits
would be checked to match (canary ^ canary_mask). If it differs, then the
kernel kills the process with SIGSEGV.

Would that take care of your concern ?

Thanks,

Mathieu

> 
> Thanks,
> Florian

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
  2017-10-13 14:27         ` Mathieu Desnoyers
@ 2017-10-13 17:24           ` Andy Lutomirski
       [not found]             ` <CALCETrXccCp8apoyUJV8kWLOavnFnenZoU-fbb6cOVZvWp-fnA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2017-10-13 17:24 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Florian Weimer, Paul E. McKenney, Boqun Feng, Peter Zijlstra,
	Paul Turner, Andrew Hunter, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Ben Maurer, rostedt,
	Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas <catalin.ma>

On Fri, Oct 13, 2017 at 7:27 AM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> ----- On Oct 13, 2017, at 9:56 AM, Florian Weimer fweimer@redhat.com wrote:
>
>> On 10/13/2017 03:40 PM, Mathieu Desnoyers wrote:
>>> The proposed ABI does not require to store any function pointer. For a given
>>> rseq_finish() critical section, pointers to specific instructions (within a
>>> function) are emitted at link-time into a struct rseq_cs:
>>>
>>> struct rseq_cs {
>>>          RSEQ_FIELD_u32_u64(start_ip);
>>>          RSEQ_FIELD_u32_u64(post_commit_ip);
>>>          RSEQ_FIELD_u32_u64(abort_ip);
>>>          uint32_t flags;
>>> } __attribute__((aligned(4 * sizeof(uint64_t))));
>>>
>>> Then, at runtime, the fast-path stores the address of that struct rseq_cs
>>> into the TLS struct rseq "rseq_cs" field.
>>>
>>> So all we store at runtime is a pointer to data, not a pointer to functions.
>>>
>>> But you seem to hint that having a pointer to data containing pointers to code
>>> may still be making it easier for exploit writers. Can you elaborate on the
>>> scenario ?
>>
>> I'm concerned that the exploit writer writes a totally made up struct
>> rseq_cs object into writable memory, along with function pointers, and
>> puts the address of that in to the rseq_cs field.
>>
>> This would be comparable to how C++ vtable pointers are targeted
>> (including those in the glibc libio implementation of stdio streams).
>>
>> Does this answer your questions?
>
> Yes, it does. How about we add a "canary" field to the TLS struct rseq, e.g.:
>
> struct rseq {
>         union rseq_cpu_event u;
>         RSEQ_FIELD_u32_u64(rseq_cs);  -> pointer to struct rseq_cs
>         uint32_t flags;
>         uint32_t canary;   -> 32 low bits of rseq_cs ^ canary_mask
> };
>
> We could then add a "uint32_t canary_mask" argument to sys_rseq, e.g.:
>
> SYSCALL_DEFINE3(rseq, struct rseq __user *, rseq, uint32_t, canary_mask, int, flags);
>
> So a thread which does not care about hardening would simply register its
> struct rseq TLS with a canary mask of "0". Nothing changes on the fast-path.
>
> A thread belonging to a process that cares about hardening could use a random
> value as canary, and pass it as canary_mask argument to the syscall. The
> fast-path could then set the struct rseq "canary" value to
> (32-low-bits of rseq_cs) ^ canary_mask just surrounding the critical section,
> and set it back to 0 afterward.
>
> In the kernel, whenever the rseq_cs pointer would be loaded, its 32 low bits
> would be checked to match (canary ^ canary_mask). If it differs, then the
> kernel kills the process with SIGSEGV.
>
> Would that take care of your concern ?
>

I would propose a slightly different solution: have the kernel verify
that it jumps to a code sequence that occurs just after some
highly-unlikely magic bytes in the text *and* that those bytes have
some signature that matches a signature in the struct rseq that's
passed in.

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <CALCETrXccCp8apoyUJV8kWLOavnFnenZoU-fbb6cOVZvWp-fnA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]             ` <CALCETrXccCp8apoyUJV8kWLOavnFnenZoU-fbb6cOVZvWp-fnA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-10-13 17:53               ` Florian Weimer
       [not found]                 ` <3358e696-43e9-15d3-9634-68e9da79e121-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 61+ messages in thread
From: Florian Weimer @ 2017-10-13 17:53 UTC (permalink / raw)
  To: Andy Lutomirski, Mathieu Desnoyers
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Dave Watson, Josh Triplett, Will Deacon,
	linux-kernel, Thomas Gleixner, Andi Kleen, Chris Lameter,
	Ingo Molnar, H. Peter Anvin, Ben Maurer, rostedt, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas,
	Michael Kerrisk <mtk.

On 10/13/2017 07:24 PM, Andy Lutomirski wrote:
> On Fri, Oct 13, 2017 at 7:27 AM, Mathieu Desnoyers
> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>> ----- On Oct 13, 2017, at 9:56 AM, Florian Weimer fweimer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org wrote:
>>
>>> On 10/13/2017 03:40 PM, Mathieu Desnoyers wrote:
>>>> The proposed ABI does not require to store any function pointer. For a given
>>>> rseq_finish() critical section, pointers to specific instructions (within a
>>>> function) are emitted at link-time into a struct rseq_cs:
>>>>
>>>> struct rseq_cs {
>>>>           RSEQ_FIELD_u32_u64(start_ip);
>>>>           RSEQ_FIELD_u32_u64(post_commit_ip);
>>>>           RSEQ_FIELD_u32_u64(abort_ip);
>>>>           uint32_t flags;
>>>> } __attribute__((aligned(4 * sizeof(uint64_t))));
>>>>
>>>> Then, at runtime, the fast-path stores the address of that struct rseq_cs
>>>> into the TLS struct rseq "rseq_cs" field.
>>>>
>>>> So all we store at runtime is a pointer to data, not a pointer to functions.
>>>>
>>>> But you seem to hint that having a pointer to data containing pointers to code
>>>> may still be making it easier for exploit writers. Can you elaborate on the
>>>> scenario ?
>>>
>>> I'm concerned that the exploit writer writes a totally made up struct
>>> rseq_cs object into writable memory, along with function pointers, and
>>> puts the address of that in to the rseq_cs field.
>>>
>>> This would be comparable to how C++ vtable pointers are targeted
>>> (including those in the glibc libio implementation of stdio streams).
>>>
>>> Does this answer your questions?
>>
>> Yes, it does. How about we add a "canary" field to the TLS struct rseq, e.g.:
>>
>> struct rseq {
>>          union rseq_cpu_event u;
>>          RSEQ_FIELD_u32_u64(rseq_cs);  -> pointer to struct rseq_cs
>>          uint32_t flags;
>>          uint32_t canary;   -> 32 low bits of rseq_cs ^ canary_mask
>> };
>>
>> We could then add a "uint32_t canary_mask" argument to sys_rseq, e.g.:
>>
>> SYSCALL_DEFINE3(rseq, struct rseq __user *, rseq, uint32_t, canary_mask, int, flags);
>>
>> So a thread which does not care about hardening would simply register its
>> struct rseq TLS with a canary mask of "0". Nothing changes on the fast-path.
>>
>> A thread belonging to a process that cares about hardening could use a random
>> value as canary, and pass it as canary_mask argument to the syscall. The
>> fast-path could then set the struct rseq "canary" value to
>> (32-low-bits of rseq_cs) ^ canary_mask just surrounding the critical section,
>> and set it back to 0 afterward.
>>
>> In the kernel, whenever the rseq_cs pointer would be loaded, its 32 low bits
>> would be checked to match (canary ^ canary_mask). If it differs, then the
>> kernel kills the process with SIGSEGV.
>>
>> Would that take care of your concern ?
>>
> 
> I would propose a slightly different solution: have the kernel verify
> that it jumps to a code sequence that occurs just after some
> highly-unlikely magic bytes in the text *and* that those bytes have
> some signature that matches a signature in the struct rseq that's
> passed in.

And the signature is fixed at the time of the rseq syscall?

Yes, that would be far more reliable.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <3358e696-43e9-15d3-9634-68e9da79e121-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]                 ` <3358e696-43e9-15d3-9634-68e9da79e121-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-10-13 18:17                   ` Andy Lutomirski
       [not found]                     ` <CALCETrVWZxC=mT9p7HTrAwcAdMzaxwa=A-O0uQt79qy1Cpky_g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2017-10-13 18:17 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Peter Zijlstra,
	Paul Turner, Andrew Hunter, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Ben Maurer, rostedt,
	Linus Torvalds, Andrew Morton, Russell King, Catalin

On Fri, Oct 13, 2017 at 10:53 AM, Florian Weimer <fweimer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 10/13/2017 07:24 PM, Andy Lutomirski wrote:
>>
>> On Fri, Oct 13, 2017 at 7:27 AM, Mathieu Desnoyers
>> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>>>
>>> ----- On Oct 13, 2017, at 9:56 AM, Florian Weimer fweimer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
>>> wrote:
>>>
>>>> On 10/13/2017 03:40 PM, Mathieu Desnoyers wrote:
>>>>>
>>>>> The proposed ABI does not require to store any function pointer. For a
>>>>> given
>>>>> rseq_finish() critical section, pointers to specific instructions
>>>>> (within a
>>>>> function) are emitted at link-time into a struct rseq_cs:
>>>>>
>>>>> struct rseq_cs {
>>>>>           RSEQ_FIELD_u32_u64(start_ip);
>>>>>           RSEQ_FIELD_u32_u64(post_commit_ip);
>>>>>           RSEQ_FIELD_u32_u64(abort_ip);
>>>>>           uint32_t flags;
>>>>> } __attribute__((aligned(4 * sizeof(uint64_t))));
>>>>>
>>>>> Then, at runtime, the fast-path stores the address of that struct
>>>>> rseq_cs
>>>>> into the TLS struct rseq "rseq_cs" field.
>>>>>
>>>>> So all we store at runtime is a pointer to data, not a pointer to
>>>>> functions.
>>>>>
>>>>> But you seem to hint that having a pointer to data containing pointers
>>>>> to code
>>>>> may still be making it easier for exploit writers. Can you elaborate on
>>>>> the
>>>>> scenario ?
>>>>
>>>>
>>>> I'm concerned that the exploit writer writes a totally made up struct
>>>> rseq_cs object into writable memory, along with function pointers, and
>>>> puts the address of that in to the rseq_cs field.
>>>>
>>>> This would be comparable to how C++ vtable pointers are targeted
>>>> (including those in the glibc libio implementation of stdio streams).
>>>>
>>>> Does this answer your questions?
>>>
>>>
>>> Yes, it does. How about we add a "canary" field to the TLS struct rseq,
>>> e.g.:
>>>
>>> struct rseq {
>>>          union rseq_cpu_event u;
>>>          RSEQ_FIELD_u32_u64(rseq_cs);  -> pointer to struct rseq_cs
>>>          uint32_t flags;
>>>          uint32_t canary;   -> 32 low bits of rseq_cs ^ canary_mask
>>> };
>>>
>>> We could then add a "uint32_t canary_mask" argument to sys_rseq, e.g.:
>>>
>>> SYSCALL_DEFINE3(rseq, struct rseq __user *, rseq, uint32_t, canary_mask,
>>> int, flags);
>>>
>>> So a thread which does not care about hardening would simply register its
>>> struct rseq TLS with a canary mask of "0". Nothing changes on the
>>> fast-path.
>>>
>>> A thread belonging to a process that cares about hardening could use a
>>> random
>>> value as canary, and pass it as canary_mask argument to the syscall. The
>>> fast-path could then set the struct rseq "canary" value to
>>> (32-low-bits of rseq_cs) ^ canary_mask just surrounding the critical
>>> section,
>>> and set it back to 0 afterward.
>>>
>>> In the kernel, whenever the rseq_cs pointer would be loaded, its 32 low
>>> bits
>>> would be checked to match (canary ^ canary_mask). If it differs, then the
>>> kernel kills the process with SIGSEGV.
>>>
>>> Would that take care of your concern ?
>>>
>>
>> I would propose a slightly different solution: have the kernel verify
>> that it jumps to a code sequence that occurs just after some
>> highly-unlikely magic bytes in the text *and* that those bytes have
>> some signature that matches a signature in the struct rseq that's
>> passed in.
>
>
> And the signature is fixed at the time of the rseq syscall?

The point of the signature is to prevent an rseq landing pad from
being used out of context.  Actually getting the details right might
be tricky.

>
> Yes, that would be far more reliable.
>
> Thanks,
> Florian



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <CALCETrVWZxC=mT9p7HTrAwcAdMzaxwa=A-O0uQt79qy1Cpky_g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]                     ` <CALCETrVWZxC=mT9p7HTrAwcAdMzaxwa=A-O0uQt79qy1Cpky_g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-10-14 11:53                       ` Mathieu Desnoyers
  0 siblings, 0 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-14 11:53 UTC (permalink / raw)
  To: Andy Lutomirski, Florian Weimer
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Dave Watson, Josh Triplett, Will Deacon,
	linux-kernel, Thomas Gleixner, Andi Kleen, Chris Lameter,
	Ingo Molnar, H. Peter Anvin, Ben Maurer, rostedt, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas,
	Michael Kerrisk <mtk.

----- On Oct 13, 2017, at 2:17 PM, Andy Lutomirski luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org wrote:

> On Fri, Oct 13, 2017 at 10:53 AM, Florian Weimer <fweimer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On 10/13/2017 07:24 PM, Andy Lutomirski wrote:
>>>
>>> On Fri, Oct 13, 2017 at 7:27 AM, Mathieu Desnoyers
>>> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>>>>
>>>> ----- On Oct 13, 2017, at 9:56 AM, Florian Weimer fweimer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
>>>> wrote:
>>>>
>>>>> On 10/13/2017 03:40 PM, Mathieu Desnoyers wrote:
>>>>>>
>>>>>> The proposed ABI does not require to store any function pointer. For a
>>>>>> given
>>>>>> rseq_finish() critical section, pointers to specific instructions
>>>>>> (within a
>>>>>> function) are emitted at link-time into a struct rseq_cs:
>>>>>>
>>>>>> struct rseq_cs {
>>>>>>           RSEQ_FIELD_u32_u64(start_ip);
>>>>>>           RSEQ_FIELD_u32_u64(post_commit_ip);
>>>>>>           RSEQ_FIELD_u32_u64(abort_ip);
>>>>>>           uint32_t flags;
>>>>>> } __attribute__((aligned(4 * sizeof(uint64_t))));
>>>>>>
>>>>>> Then, at runtime, the fast-path stores the address of that struct
>>>>>> rseq_cs
>>>>>> into the TLS struct rseq "rseq_cs" field.
>>>>>>
>>>>>> So all we store at runtime is a pointer to data, not a pointer to
>>>>>> functions.
>>>>>>
>>>>>> But you seem to hint that having a pointer to data containing pointers
>>>>>> to code
>>>>>> may still be making it easier for exploit writers. Can you elaborate on
>>>>>> the
>>>>>> scenario ?
>>>>>
>>>>>
>>>>> I'm concerned that the exploit writer writes a totally made up struct
>>>>> rseq_cs object into writable memory, along with function pointers, and
>>>>> puts the address of that in to the rseq_cs field.
>>>>>
>>>>> This would be comparable to how C++ vtable pointers are targeted
>>>>> (including those in the glibc libio implementation of stdio streams).
>>>>>
>>>>> Does this answer your questions?
>>>>
>>>>
>>>> Yes, it does. How about we add a "canary" field to the TLS struct rseq,
>>>> e.g.:
>>>>
>>>> struct rseq {
>>>>          union rseq_cpu_event u;
>>>>          RSEQ_FIELD_u32_u64(rseq_cs);  -> pointer to struct rseq_cs
>>>>          uint32_t flags;
>>>>          uint32_t canary;   -> 32 low bits of rseq_cs ^ canary_mask
>>>> };
>>>>
>>>> We could then add a "uint32_t canary_mask" argument to sys_rseq, e.g.:
>>>>
>>>> SYSCALL_DEFINE3(rseq, struct rseq __user *, rseq, uint32_t, canary_mask,
>>>> int, flags);
>>>>
>>>> So a thread which does not care about hardening would simply register its
>>>> struct rseq TLS with a canary mask of "0". Nothing changes on the
>>>> fast-path.
>>>>
>>>> A thread belonging to a process that cares about hardening could use a
>>>> random
>>>> value as canary, and pass it as canary_mask argument to the syscall. The
>>>> fast-path could then set the struct rseq "canary" value to
>>>> (32-low-bits of rseq_cs) ^ canary_mask just surrounding the critical
>>>> section,
>>>> and set it back to 0 afterward.
>>>>
>>>> In the kernel, whenever the rseq_cs pointer would be loaded, its 32 low
>>>> bits
>>>> would be checked to match (canary ^ canary_mask). If it differs, then the
>>>> kernel kills the process with SIGSEGV.
>>>>
>>>> Would that take care of your concern ?
>>>>
>>>
>>> I would propose a slightly different solution: have the kernel verify
>>> that it jumps to a code sequence that occurs just after some
>>> highly-unlikely magic bytes in the text *and* that those bytes have
>>> some signature that matches a signature in the struct rseq that's
>>> passed in.
>>
>>
>> And the signature is fixed at the time of the rseq syscall?
> 
> The point of the signature is to prevent an rseq landing pad from
> being used out of context.  Actually getting the details right might
> be tricky.

So my understanding is that we want to prevent an attacker that
controls the stack to easily use rseq to trick the kernel into
branching into an arbitrary pre-existing executable address in
the process.

I like the idea of putting a signature just before the abort_ip
landing address and having it checked by the kernel. We could start
by using a fixed hardcoded signature for now, and pass the
signature value to the kernel when registering rseq. This would
eventually allow a process to use a randomized signature if we
figure out it's needed in the future.

I don't see how placing this signature in struct rseq TLS area
is a good idea: an attacker could then just overwrite that value
so it matches whatever code is before the branch target it wishes
to branch to.

I also don't get how having the signature in the struct rseq_cs
(restartable sequence descriptor) alongside with start/end/abort
ip can be useful. Typically, an attacker would put its fake structure
either on the stack, in data, or in rw memory, and make sure it
uses the right signature in there. In the end, we don't really care
whether the user ends up controlling the content of a struct rseq_cs,
what we really care about is that it does not make the kernel branch
to a pre-existing executable code address of its choosing.

So having the kernel validate a signature placed just before the
abort_ip should be enough for hardening purposes.

Thoughts ?

Thanks,

Mathieu


> 
>>
>> Yes, that would be far more reliable.
>>
>> Thanks,
>> Florian
> 
> 
> 
> --
> Andy Lutomirski
> AMA Capital Management, LLC

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
  2017-10-12 23:03 ` [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call Mathieu Desnoyers
  2017-10-13  0:36   ` Linus Torvalds
  2017-10-13 12:50   ` Florian Weimer
@ 2017-10-18 16:41   ` Ben Maurer
       [not found]     ` <CY4PR15MB1688286D6B1283A1C234BAE6CF4E0-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2 siblings, 1 reply; 61+ messages in thread
From: Ben Maurer @ 2017-10-18 16:41 UTC (permalink / raw)
  To: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Peter Zijlstra,
	Paul Turner, Andrew Hunter, Andy Lutomirski, Dave Watson,
	Josh Triplett, Will Deacon
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Steven Rostedt,
	Linus Torvalds, Andrew Morton, Russell King, Catalin Marinas,
	Michael Kerrisk, Alexander Viro, linux-api@vger.kernel.org

> The layout of struct rseq_cs is as follows:

> start_ip
> Instruction pointer address of the first instruction of the
> sequence of consecutive assembly instructions.

> post_commit_ip
> Instruction pointer address after the last  instruction  of
>  the sequence of consecutive assembly instructions.

>  abort_ip
> Instruction  pointer  address  where  to move the execution
>  flow in case of abort of the sequence of consecutive assem‐
>  bly instructions.

Really minor performance performance thought here.

1) In the kernel at context switch time you'd need code like:

if (ip >= start_ip && ip <= post_commit_ip)

This branch would be hard to predict because most instruction pointers would be either before or after. If post_commit_ip were relative to start_ip you could do this:

if (ip - start_ip <= post_commit_offset)

which is a single branch that would be more predictable.

2) In a shared library a rseq_cs structure would have to be relocated at runtime because at compilation time the final address of the library wouldn't be known. I'm not sure if this is important enough to address, but it could be solved by making the pointers relative to the address of rseq_cs. But this would make for an uglier API.

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <CY4PR15MB1688286D6B1283A1C234BAE6CF4E0-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]     ` <CY4PR15MB1688286D6B1283A1C234BAE6CF4E0-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-18 18:11       ` Mathieu Desnoyers
       [not found]         ` <515879378.43966.1508350299712.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-18 18:11 UTC (permalink / raw)
  To: Ben Maurer
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, rostedt,
	Linus Torvalds, Andrew Morton, Russell King, Catalin Marinas,
	Michael

----- On Oct 18, 2017, at 12:41 PM, Ben Maurer bmaurer-b10kYP2dOMg@public.gmane.org wrote:

>> The layout of struct rseq_cs is as follows:
> 
>> start_ip
>> Instruction pointer address of the first instruction of the
>> sequence of consecutive assembly instructions.
> 
>> post_commit_ip
>> Instruction pointer address after the last  instruction  of
>>  the sequence of consecutive assembly instructions.
> 
>>  abort_ip
>> Instruction  pointer  address  where  to move the execution
>>  flow in case of abort of the sequence of consecutive assem‐
>>  bly instructions.
> 
> Really minor performance performance thought here.
> 
> 1) In the kernel at context switch time you'd need code like:
> 
> if (ip >= start_ip && ip <= post_commit_ip)
> 
> This branch would be hard to predict because most instruction pointers would be
> either before or after. If post_commit_ip were relative to start_ip you could
> do this:
> 
> if (ip - start_ip <= post_commit_offset)
> 
> which is a single branch that would be more predictable.

The actual context switch code only sets the "t->rseq_preempt"
flags and TIF_NOTIFY_RESUME. The user-accesses happen when
returning to user-space with TIF_NOTIFY_RESUME set.

Indeed, we can expect most context switch out of a registered
rseq thread to trigger one __rseq_handle_notify_resume on
return to user-space for that thread.

As you point out, the "common" case is *not* nested over a
critical section. This means t->rseq->rseq_cs is NULL.
This effectively means post_commit_ip and start_ip are NULL in
rseq_ip_fixup when compared to the current ip.

The check is currently implemented like this:

        /* Handle potentially not being within a critical section. */
        if ((void __user *)instruction_pointer(regs) >= post_commit_ip ||
                        (void __user *)instruction_pointer(regs) < start_ip)
                return 1;

So if non-nested over c.s., the first branch is ip >= NULL, which turns
out to be true, and we therefore return 1 from rseq_ip_fixup.

I suspect that we'd need to cast those pointers to (unsigned long) to
be strictly C standard compliant.

If we instead use "post_commit_offset" relative to start_ip, a non-nested
common case would have start_ip = NULL, post_commit_offset = 0. The check
you propose for not being nested over a c.s. would look like:

if (!((long)ip - (long)start_ip <= (long)post_commit_offset))
   return 1;

This introduces an issue here: if "ip" is lower than "start_ip", we
can incorrectly think we are in a critical section, when we are in
fact not.

With the previous approach proposed by Paul Turner, this was not an
issue, because he was setting the equivalent of the rseq_cs pointer
back to NULL at the end of the assembly fast-path. However, I have
a fast-path optimization that only sets the rseq_cs pointer at the
beginning of the fast-path, without clearing it afterward. It's up
to the following critical section to overwrite the rseq_cs pointer,
or to the kernel to set it back to NULL if it finds out that it is
preempting/delivering a signal over an instruction pointer outside
of the current rseq_cs start_ip/post_commit_ip range (lazy clear).

Moreover, this modification would add a subtraction on the common case
(ip - start_ip), and makes the ABI slightly uglier.

> 
> 2) In a shared library a rseq_cs structure would have to be relocated at runtime
> because at compilation time the final address of the library wouldn't be known.
> I'm not sure if this is important enough to address, but it could be solved by
> making the pointers relative to the address of rseq_cs. But this would make for
> an uglier API.

If I understand well, you are proposing to speed up .so load time by
means of removing relocations of pointers within rseq_cs, done by
making those relative to the rseq_cs address.

So the downside here is extra arithmetic operations on resume to
userspace (__rseq_handle_notify_resume): the kernel would have
to calculate the offset of start_ip and post_commit_ip from the
address of rseq_cs. Sure, we're only talking about two additions
there, but I don't think marginally speeding up library load
justifies extra work in that kernel path, nor the uglier ABI.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <515879378.43966.1508350299712.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>]

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]         ` <515879378.43966.1508350299712.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-10-19 11:35           ` Mathieu Desnoyers
  2017-10-19 17:01             ` Florian Weimer
  2017-10-23 17:30           ` Ben Maurer
  1 sibling, 1 reply; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-19 11:35 UTC (permalink / raw)
  To: Ben Maurer, Carlos O'Donell, Florian Weimer
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, rostedt,
	Linus Torvalds, Andrew Morton, Russell King, Catalin Marinas,
	Michael

Speaking of optimization, I think the rseq.c helper library
(and eventually glibc) should define the __rseq_abi TLS
variable with __attribute__((tls_model("initial-exec"))).
It provides faster, and signal-safe, accesses to the TLS
variable from libraries.

The idea you were suggesting where the application could
override the glibc symbol with its own just ends up enforcing
an initial-exec model, but it looks like we can do this
directly from the library.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
  2017-10-19 11:35           ` Mathieu Desnoyers
@ 2017-10-19 17:01             ` Florian Weimer
  0 siblings, 0 replies; 61+ messages in thread
From: Florian Weimer @ 2017-10-19 17:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ben Maurer, Carlos O'Donell, Paul E. McKenney, Boqun Feng,
	Peter Zijlstra, Paul Turner, Andrew Hunter, Andy Lutomirski,
	Dave Watson, Josh Triplett, Will Deacon, linux-kernel,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	H. Peter Anvin, rostedt, Linus Torvalds, Andrew Morton,
	Russell King

* Mathieu Desnoyers:

> Speaking of optimization, I think the rseq.c helper library
> (and eventually glibc) should define the __rseq_abi TLS
> variable with __attribute__((tls_model("initial-exec"))).
> It provides faster, and signal-safe, accesses to the TLS
> variable from libraries.
>
> The idea you were suggesting where the application could
> override the glibc symbol with its own just ends up enforcing
> an initial-exec model, but it looks like we can do this
> directly from the library.

This really depends on how the programming model turns out once
multiple libraries are involved.  I think it's premature to discuss
such details.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]         ` <515879378.43966.1508350299712.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  2017-10-19 11:35           ` Mathieu Desnoyers
@ 2017-10-23 17:30           ` Ben Maurer
       [not found]             ` <CY4PR15MB16888F91F41A4A1D322C102CCF460-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  1 sibling, 1 reply; 61+ messages in thread
From: Ben Maurer @ 2017-10-23 17:30 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, rostedt,
	Linus Torvalds, Andrew Morton, Russell King, Catalin Marinas,
	Michael

> if (!((long)ip - (long)start_ip <= (long)post_commit_offset))
>   return 1;

> This introduces an issue here: if "ip" is lower than "start_ip", we
> can incorrectly think we are in a critical section, when we are in
> fact not.

This shouldn't be an issue if we used unsigned numbers. Eg if start_ip is X and post_commit_offset is L, then (ip - X <= L) means that if ip is less than X ip - X will be signed, which will become a large unsigned value. 

> or to the kernel to set it back to NULL if it finds out that it is
> preempting/delivering a signal over an instruction pointer outside
> of the current rseq_cs start_ip/post_commit_ip range (lazy clear).

I see, lazy clear makes sense. Still, if during most execution periods the user code enters some rseq section (likely if rseq is used for something like malloc) on every context switch this code will have to be run.

> Moreover, this modification would add a subtraction on the common case
> (ip - start_ip), and makes the ABI slightly uglier.

We could benchmark it but the subtraction should be similar in cost to the extra comparison but reducing the number of branches seems like it will help as well. FWIW GCC attempts to translate this kind of sequence to a subtract and compare: https://godbolt.org/g/5DGLvo.

I agree the ABI is uglier, but since we're mucking with every context switch I thought I'd point it out.

> If I understand well, you are proposing to speed up .so load time by
> means of removing relocations of pointers within rseq_cs, done by
> making those relative to the rseq_cs address.

Yeah, I think this may be overkill as optimization.

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <CY4PR15MB16888F91F41A4A1D322C102CCF460-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]

* Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
       [not found]             ` <CY4PR15MB16888F91F41A4A1D322C102CCF460-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-23 20:44               ` Mathieu Desnoyers
  0 siblings, 0 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-23 20:44 UTC (permalink / raw)
  To: Ben Maurer
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, rostedt,
	Linus Torvalds, Andrew Morton, Russell King, Catalin Marinas,
	Michael

----- On Oct 23, 2017, at 7:30 PM, Ben Maurer bmaurer-b10kYP2dOMg@public.gmane.org wrote:

>> if (!((long)ip - (long)start_ip <= (long)post_commit_offset))
>>   return 1;
> 
>> This introduces an issue here: if "ip" is lower than "start_ip", we
>> can incorrectly think we are in a critical section, when we are in
>> fact not.
> 
> This shouldn't be an issue if we used unsigned numbers. Eg if start_ip is X and
> post_commit_offset is L, then (ip - X <= L) means that if ip is less than X ip
> - X will be signed, which will become a large unsigned value.
> 
>> or to the kernel to set it back to NULL if it finds out that it is
>> preempting/delivering a signal over an instruction pointer outside
>> of the current rseq_cs start_ip/post_commit_ip range (lazy clear).
> 
> I see, lazy clear makes sense. Still, if during most execution periods the user
> code enters some rseq section (likely if rseq is used for something like
> malloc) on every context switch this code will have to be run.
> 
>> Moreover, this modification would add a subtraction on the common case
>> (ip - start_ip), and makes the ABI slightly uglier.
> 
> We could benchmark it but the subtraction should be similar in cost to the extra
> comparison but reducing the number of branches seems like it will help as well.
> FWIW GCC attempts to translate this kind of sequence to a subtract and compare:
> https://godbolt.org/g/5DGLvo.
> 
> I agree the ABI is uglier, but since we're mucking with every context switch I
> thought I'd point it out.

Thanks for following up on this. I did not initially realize the importance
of doing the unsigned comparison. I've pushed a commit in my private dev branch
implementing your suggestion.

https://github.com/compudj/linux-percpu-dev/commit/4cf8e9104636b51741c0118f2c88519e3acab7aa

Thanks!

Mathieu

> 
>> If I understand well, you are proposing to speed up .so load time by
>> means of removing relocations of pointers within rseq_cs, done by
>> making those relative to the rseq_cs address.
> 
> Yeah, I think this may be overkill as optimization.

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH for 4.15 02/14] tracing: instrument restartable sequences
       [not found] <20171012230326.19984-1-mathieu.desnoyers@efficios.com>
  2017-10-12 23:03 ` [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call Mathieu Desnoyers
@ 2017-10-12 23:03 ` Mathieu Desnoyers
  2017-10-12 23:03 ` [RFC PATCH for 4.15 03/14] Restartable sequences: ARM 32 architecture support Mathieu Desnoyers
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-12 23:03 UTC (permalink / raw)
  To: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon
  Cc: linux-kernel, Mathieu Desnoyers, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Ben Maurer,
	Steven Rostedt, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Michael Kerrisk, linux-api

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 MAINTAINERS                 |  1 +
 include/trace/events/rseq.h | 64 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/rseq.c               |  7 +++++
 3 files changed, 72 insertions(+)
 create mode 100644 include/trace/events/rseq.h

diff --git a/MAINTAINERS b/MAINTAINERS
index f05c526fe1e8..9d6a830a8c32 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11233,6 +11233,7 @@ L:	linux-kernel@vger.kernel.org
 S:	Supported
 F:	kernel/rseq.c
 F:	include/uapi/linux/rseq.h
+F:	include/trace/events/rseq.h
 
 RFKILL
 M:	Johannes Berg <johannes@sipsolutions.net>
diff --git a/include/trace/events/rseq.h b/include/trace/events/rseq.h
new file mode 100644
index 000000000000..63a8eb7d553d
--- /dev/null
+++ b/include/trace/events/rseq.h
@@ -0,0 +1,64 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM rseq
+
+#if !defined(_TRACE_RSEQ_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_RSEQ_H
+
+#include <linux/tracepoint.h>
+#include <linux/types.h>
+
+TRACE_EVENT(rseq_update,
+
+	TP_PROTO(struct task_struct *t),
+
+	TP_ARGS(t),
+
+	TP_STRUCT__entry(
+		__field(s32, cpu_id)
+		__field(u32, event_counter)
+	),
+
+	TP_fast_assign(
+		__entry->cpu_id = raw_smp_processor_id();
+		__entry->event_counter = t->rseq_event_counter;
+	),
+
+	TP_printk("cpu_id=%d event_counter=%u",
+		__entry->cpu_id, __entry->event_counter)
+);
+
+TRACE_EVENT(rseq_ip_fixup,
+
+	TP_PROTO(void __user *regs_ip, void __user *start_ip,
+		void __user *post_commit_ip, void __user *abort_ip,
+		u32 kevcount, int ret),
+
+	TP_ARGS(regs_ip, start_ip, post_commit_ip, abort_ip, kevcount, ret),
+
+	TP_STRUCT__entry(
+		__field(void __user *, regs_ip)
+		__field(void __user *, start_ip)
+		__field(void __user *, post_commit_ip)
+		__field(void __user *, abort_ip)
+		__field(u32, kevcount)
+		__field(int, ret)
+	),
+
+	TP_fast_assign(
+		__entry->regs_ip = regs_ip;
+		__entry->start_ip = start_ip;
+		__entry->post_commit_ip = post_commit_ip;
+		__entry->abort_ip = abort_ip;
+		__entry->kevcount = kevcount;
+		__entry->ret = ret;
+	),
+
+	TP_printk("regs_ip=%p start_ip=%p post_commit_ip=%p abort_ip=%p kevcount=%u ret=%d",
+		__entry->regs_ip, __entry->start_ip, __entry->post_commit_ip,
+		__entry->abort_ip, __entry->kevcount, __entry->ret)
+);
+
+#endif /* _TRACE_SOCK_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 706a83bd885c..31582e5ff3be 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -32,6 +32,9 @@
 #include <linux/types.h>
 #include <asm/ptrace.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/rseq.h>
+
 /*
  * The restartable sequences mechanism is the overlap of two distinct
  * restart mechanisms: a sequence counter tracking preemption and signal
@@ -139,6 +142,7 @@ static bool rseq_update_cpu_id_event_counter(struct task_struct *t,
 			t->rseq_event_counter;
 	if (__put_user(u.v, &t->rseq->u.v))
 		return false;
+	trace_rseq_update(t);
 	return true;
 }
 
@@ -230,6 +234,9 @@ static int rseq_ip_fixup(struct pt_regs *regs)
 
 	ret = rseq_get_rseq_cs(t, &start_ip, &post_commit_ip, &abort_ip,
 			&cs_flags);
+	trace_rseq_ip_fixup((void __user *)instruction_pointer(regs),
+		start_ip, post_commit_ip, abort_ip, t->rseq_event_counter,
+		ret);
 	if (!ret)
 		return -EFAULT;
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH for 4.15 03/14] Restartable sequences: ARM 32 architecture support
       [not found] <20171012230326.19984-1-mathieu.desnoyers@efficios.com>
  2017-10-12 23:03 ` [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call Mathieu Desnoyers
  2017-10-12 23:03 ` [RFC PATCH for 4.15 02/14] tracing: instrument restartable sequences Mathieu Desnoyers
@ 2017-10-12 23:03 ` Mathieu Desnoyers
  2017-10-12 23:03 ` [RFC PATCH for 4.15 04/14] Restartable sequences: wire up ARM 32 system call Mathieu Desnoyers
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-12 23:03 UTC (permalink / raw)
  To: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon
  Cc: linux-kernel, Mathieu Desnoyers, Russell King, Catalin Marinas,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	Ben Maurer, Steven Rostedt, Linus Torvalds, Andrew Morton,
	linux-api

Call the rseq_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.

Increment the event counter and perform fixup on the pre-signal frame
when a signal is delivered on top of a restartable sequence critical
section.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 arch/arm/Kconfig         | 1 +
 arch/arm/kernel/signal.c | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 61a0cb15067e..85bc5d8de3eb 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -86,6 +86,7 @@ config ARM
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_RCU_TABLE_FREE if (SMP && ARM_LPAE)
 	select HAVE_REGS_AND_STACK_ACCESS_API
+	select HAVE_RSEQ
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UID16
 	select HAVE_VIRT_CPU_ACCOUNTING_GEN
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index 5814298ef0b7..7de5df4ba6ec 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -517,6 +517,12 @@ static void handle_signal(struct ksignal *ksig, struct pt_regs *regs)
 	int ret;
 
 	/*
+	 * Increment event counter and perform fixup for the pre-signal
+	 * frame.
+	 */
+	rseq_signal_deliver(regs);
+
+	/*
 	 * Set up the stack frame
 	 */
 	if (ksig->ka.sa.sa_flags & SA_SIGINFO)
@@ -636,6 +642,7 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 			} else {
 				clear_thread_flag(TIF_NOTIFY_RESUME);
 				tracehook_notify_resume(regs);
+				rseq_handle_notify_resume(regs);
 			}
 		}
 		local_irq_disable();
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH for 4.15 04/14] Restartable sequences: wire up ARM 32 system call
       [not found] <20171012230326.19984-1-mathieu.desnoyers@efficios.com>
                   ` (2 preceding siblings ...)
  2017-10-12 23:03 ` [RFC PATCH for 4.15 03/14] Restartable sequences: ARM 32 architecture support Mathieu Desnoyers
@ 2017-10-12 23:03 ` Mathieu Desnoyers
  2017-10-12 23:03 ` [RFC PATCH for 4.15 05/14] Restartable sequences: x86 32/64 architecture support Mathieu Desnoyers
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-12 23:03 UTC (permalink / raw)
  To: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon
  Cc: linux-kernel, Mathieu Desnoyers, Russell King, Catalin Marinas,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	Ben Maurer, Steven Rostedt, Linus Torvalds, Andrew Morton,
	linux-api

Wire up the rseq system call on 32-bit ARM.

This provides an ABI improving the speed of a user-space getcpu
operation on ARM by skipping the getcpu system call on the fast path, as
well as improving the speed of user-space operations on per-cpu data
compared to using load-linked/store-conditional.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 arch/arm/tools/syscall.tbl | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 0bb0e9c6376c..fbc74b5fa3ed 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -412,3 +412,4 @@
 395	common	pkey_alloc		sys_pkey_alloc
 396	common	pkey_free		sys_pkey_free
 397	common	statx			sys_statx
+398	common	rseq			sys_rseq
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH for 4.15 05/14] Restartable sequences: x86 32/64 architecture support
       [not found] <20171012230326.19984-1-mathieu.desnoyers@efficios.com>
                   ` (3 preceding siblings ...)
  2017-10-12 23:03 ` [RFC PATCH for 4.15 04/14] Restartable sequences: wire up ARM 32 system call Mathieu Desnoyers
@ 2017-10-12 23:03 ` Mathieu Desnoyers
  2017-10-12 23:03 ` [RFC PATCH for 4.15 06/14] Restartable sequences: wire up x86 32/64 system call Mathieu Desnoyers
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-12 23:03 UTC (permalink / raw)
  To: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon
  Cc: linux-kernel, Mathieu Desnoyers, Russell King, Catalin Marinas,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	H. Peter Anvin, Ben Maurer, Steven Rostedt, Linus Torvalds,
	Andrew Morton, linux-api

Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.

Increment the event counter and perform fixup on the pre-signal frame
when a signal is delivered on top of a restartable sequence critical
section.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 arch/x86/Kconfig         | 1 +
 arch/x86/entry/common.c  | 1 +
 arch/x86/kernel/signal.c | 6 ++++++
 3 files changed, 8 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 323cb065be5e..b2ce9970bc9c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -170,6 +170,7 @@ config X86
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_RELIABLE_STACKTRACE		if X86_64 && FRAME_POINTER && STACK_VALIDATION
 	select HAVE_STACK_VALIDATION		if X86_64
+	select HAVE_RSEQ
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UNSTABLE_SCHED_CLOCK
 	select HAVE_USER_RETURN_NOTIFIER
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index cdefcfdd9e63..2077086ecb94 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -159,6 +159,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_NOTIFY_RESUME) {
 			clear_thread_flag(TIF_NOTIFY_RESUME);
 			tracehook_notify_resume(regs);
+			rseq_handle_notify_resume(regs);
 		}
 
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index cc30a74e4adb..61adc59cde3f 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -686,6 +686,12 @@ setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
 	sigset_t *set = sigmask_to_save();
 	compat_sigset_t *cset = (compat_sigset_t *) set;
 
+	/*
+	 * Increment event counter and perform fixup for the pre-signal
+	 * frame.
+	 */
+	rseq_signal_deliver(regs);
+
 	/* Set up the stack frame */
 	if (is_ia32_frame(ksig)) {
 		if (ksig->ka.sa.sa_flags & SA_SIGINFO)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH for 4.15 06/14] Restartable sequences: wire up x86 32/64 system call
       [not found] <20171012230326.19984-1-mathieu.desnoyers@efficios.com>
                   ` (4 preceding siblings ...)
  2017-10-12 23:03 ` [RFC PATCH for 4.15 05/14] Restartable sequences: x86 32/64 architecture support Mathieu Desnoyers
@ 2017-10-12 23:03 ` Mathieu Desnoyers
  2017-10-12 23:03 ` [RFC PATCH for 4.15 09/14] Provide cpu_opv " Mathieu Desnoyers
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-12 23:03 UTC (permalink / raw)
  To: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon
  Cc: linux-kernel, Mathieu Desnoyers, Russell King, Catalin Marinas,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	H. Peter Anvin, Ben Maurer, Steven Rostedt, Linus Torvalds,
	Andrew Morton, linux-api

Wire up the rseq system call on x86 32/64.

This provides an ABI improving the speed of a user-space getcpu
operation on x86 by removing the need to perform a function call, "lsl"
instruction, or system call on the fast path, as well as improving the
speed of user-space operations on per-cpu data.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 448ac2161112..ba43ee75e425 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
 382	i386	pkey_free		sys_pkey_free
 383	i386	statx			sys_statx
 384	i386	arch_prctl		sys_arch_prctl			compat_sys_arch_prctl
+385	i386	rseq			sys_rseq
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..3ad03495bbb9 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
 330	common	pkey_alloc		sys_pkey_alloc
 331	common	pkey_free		sys_pkey_free
 332	common	statx			sys_statx
+333	common	rseq			sys_rseq
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH for 4.15 09/14] Provide cpu_opv system call
       [not found] <20171012230326.19984-1-mathieu.desnoyers@efficios.com>
                   ` (5 preceding siblings ...)
  2017-10-12 23:03 ` [RFC PATCH for 4.15 06/14] Restartable sequences: wire up x86 32/64 system call Mathieu Desnoyers
@ 2017-10-12 23:03 ` Mathieu Desnoyers
       [not found]   ` <20171012230326.19984-10-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  2017-10-12 23:03 ` [RFC PATCH for 4.15 10/14] cpu_opv: Wire up x86 32/64 " Mathieu Desnoyers
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-12 23:03 UTC (permalink / raw)
  To: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon
  Cc: linux-kernel, Mathieu Desnoyers, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Ben Maurer,
	Steven Rostedt, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Michael Kerrisk, linux-api

This new cpu_opv system call executes a vector of operations on behalf
of user-space on a specific CPU with preemption disabled. It is inspired
from readv() and writev() system calls which take a "struct iovec" array
as argument.

The operations available are: comparison, memcpy, add, or, and, xor,
left shift, and right shift. The system call receives a CPU number from
user-space as argument, which is the CPU on which those operations need
to be performed. All preparation steps such as loading pointers, and
applying offsets to arrays, need to be performed by user-space before
invoking the system call. The "comparison" operation can be used to
check that the data used in the preparation step did not change between
preparation of system call inputs and operation execution within the
preempt-off critical section.

The reason why we require all pointer offsets to be calculated by
user-space beforehand is because we need to use get_user_pages_fast() to
first pin all pages touched by each operation. This takes care of
faulting-in the pages. Then, preemption is disabled, and the operations
are performed atomically with respect to other thread execution on that
CPU, without generating any page fault.

A maximum limit of 16 operations per cpu_opv syscall invocation is
enforced, so user-space cannot generate a too long preempt-off critical
section. Each operation is also limited a length of PAGE_SIZE bytes,
meaning that an operation can touch a maximum of 4 pages (memcpy: 2
pages for source, 2 pages for destination if addresses are not aligned
on page boundaries).

If the thread is not running on the requested CPU, a new
push_task_to_cpu() is invoked to migrate the task to the requested CPU.
If the requested CPU is not part of the cpus allowed mask of the thread,
the system call fails with EINVAL. After the migration has been
performed, preemption is disabled, and the current CPU number is checked
again and compared to the requested CPU number. If it still differs, it
means the scheduler migrated us away from that CPU. Return EAGAIN to
user-space in that case, and let user-space retry (either requesting the
same CPU number, or a different one, depending on the user-space
algorithm constraints).

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul Turner <pjt@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Andrew Hunter <ahh@google.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 MAINTAINERS                  |    7 +
 include/uapi/linux/cpu_opv.h |   93 ++++
 init/Kconfig                 |   14 +
 kernel/Makefile              |    1 +
 kernel/cpu_opv.c             | 1000 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c          |   37 ++
 kernel/sched/sched.h         |    2 +
 kernel/sys_ni.c              |    1 +
 8 files changed, 1155 insertions(+)
 create mode 100644 include/uapi/linux/cpu_opv.h
 create mode 100644 kernel/cpu_opv.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 9d6a830a8c32..6a5f3afb2ea4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3611,6 +3611,13 @@ B:	https://bugzilla.kernel.org
 F:	drivers/cpuidle/*
 F:	include/linux/cpuidle.h
 
+CPU NON-PREEMPTIBLE OPERATION VECTOR SUPPORT
+M:	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	kernel/cpu_opv.c
+F:	include/uapi/linux/cpu_opv.h
+
 CRAMFS FILESYSTEM
 W:	http://sourceforge.net/projects/cramfs/
 S:	Orphan / Obsolete
diff --git a/include/uapi/linux/cpu_opv.h b/include/uapi/linux/cpu_opv.h
new file mode 100644
index 000000000000..a3fcdebd063b
--- /dev/null
+++ b/include/uapi/linux/cpu_opv.h
@@ -0,0 +1,93 @@
+#ifndef _UAPI_LINUX_CPU_OPV_H
+#define _UAPI_LINUX_CPU_OPV_H
+
+/*
+ * linux/cpu_opv.h
+ *
+ * CPU preempt-off operation vector system call API
+ *
+ * Copyright (c) 2017 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifdef __KERNEL__
+# include <linux/types.h>
+#else	/* #ifdef __KERNEL__ */
+# include <stdint.h>
+#endif	/* #else #ifdef __KERNEL__ */
+
+#include <asm/byteorder.h>
+
+#ifdef __LP64__
+# define CPU_OP_FIELD_u32_u64(field)	uint64_t field
+#elif defined(__BYTE_ORDER) ? \
+	__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
+# define CPU_OP_FIELD_u32_u64(field)	uint32_t _padding ## field, field
+#else
+# define CPU_OP_FIELD_u32_u64(field)	uint32_t field, _padding ## field
+#endif
+
+#define CPU_OP_VEC_LEN_MAX		16
+#define CPU_OP_ARG_LEN_MAX		24
+#define CPU_OP_DATA_LEN_MAX		PAGE_SIZE
+#define CPU_OP_MAX_PAGES		4	/* Max. pages per op. */
+
+enum cpu_op_type {
+	CPU_COMPARE_EQ_OP,	/* compare */
+	CPU_COMPARE_NE_OP,	/* compare */
+	CPU_MEMCPY_OP,		/* memcpy */
+	CPU_ADD_OP,		/* arithmetic */
+	CPU_OR_OP,		/* bitwise */
+	CPU_AND_OP,		/* bitwise */
+	CPU_XOR_OP,		/* bitwise */
+	CPU_LSHIFT_OP,		/* shift */
+	CPU_RSHIFT_OP,		/* shift */
+};
+
+/* Vector of operations to perform. Limited to 16. */
+struct cpu_op {
+	int32_t op;	/* enum cpu_op_type. */
+	uint32_t len;	/* data length, in bytes. */
+	union {
+		struct {
+			CPU_OP_FIELD_u32_u64(a);
+			CPU_OP_FIELD_u32_u64(b);
+		} compare_op;
+		struct {
+			CPU_OP_FIELD_u32_u64(dst);
+			CPU_OP_FIELD_u32_u64(src);
+		} memcpy_op;
+		struct {
+			CPU_OP_FIELD_u32_u64(p);
+			int64_t count;
+		} arithmetic_op;
+		struct {
+			CPU_OP_FIELD_u32_u64(p);
+			uint64_t mask;
+		} bitwise_op;
+		struct {
+			CPU_OP_FIELD_u32_u64(p);
+			uint32_t bits;
+		} shift_op;
+		char __padding[CPU_OP_ARG_LEN_MAX];
+	} u;
+};
+
+#endif /* _UAPI_LINUX_CPU_OPV_H */
diff --git a/init/Kconfig b/init/Kconfig
index b8aa41bd4f4f..98b79eb9020e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1399,6 +1399,7 @@ config RSEQ
 	bool "Enable rseq() system call" if EXPERT
 	default y
 	depends on HAVE_RSEQ
+	select CPU_OPV
 	help
 	  Enable the restartable sequences system call. It provides a
 	  user-space cache for the current CPU number value, which
@@ -1408,6 +1409,19 @@ config RSEQ
 
 	  If unsure, say Y.
 
+config CPU_OPV
+	bool "Enable cpu_opv() system call" if EXPERT
+	default y
+	help
+	  Enable the CPU preempt-off operation vector system call.
+	  It allows user-space to perform a sequence of operations on
+	  per-cpu data with preemption disabled. Useful as
+	  single-stepping fall-back for restartable sequences, and for
+	  performing more complex operations on per-cpu data that would
+	  not be otherwise possible to do with restartable sequences.
+
+	  If unsure, say Y.
+
 config EMBEDDED
 	bool "Embedded system"
 	option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 5c09592b3b9f..8301e454c2a8 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -112,6 +112,7 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
 obj-$(CONFIG_RSEQ) += rseq.o
+obj-$(CONFIG_CPU_OPV) += cpu_opv.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/cpu_opv.c b/kernel/cpu_opv.c
new file mode 100644
index 000000000000..2e615612acb1
--- /dev/null
+++ b/kernel/cpu_opv.c
@@ -0,0 +1,1000 @@
+/*
+ * CPU preempt-off operation vector system call
+ *
+ * It allows user-space to perform a sequence of operations on per-cpu
+ * data with preemption disabled. Useful as single-stepping fall-back
+ * for restartable sequences, and for performing more complex operations
+ * on per-cpu data that would not be otherwise possible to do with
+ * restartable sequences.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Copyright (C) 2017, EfficiOS Inc.,
+ * Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/cpu_opv.h>
+#include <linux/types.h>
+#include <asm/ptrace.h>
+#include <asm/byteorder.h>
+
+#include "sched/sched.h"
+
+#define TMP_BUFLEN			64
+#define NR_PINNED_PAGES_ON_STACK	8
+
+/*
+ * The cpu_opv system call executes a vector of operations on behalf of
+ * user-space on a specific CPU with preemption disabled. It is inspired
+ * from readv() and writev() system calls which take a "struct iovec"
+ * array as argument.
+ * 
+ * The operations available are: comparison, memcpy, add, or, and, xor,
+ * left shift, and right shift. The system call receives a CPU number
+ * from user-space as argument, which is the CPU on which those
+ * operations need to be performed. All preparation steps such as
+ * loading pointers, and applying offsets to arrays, need to be
+ * performed by user-space before invoking the system call. The
+ * "comparison" operation can be used to check that the data used in the
+ * preparation step did not change between preparation of system call
+ * inputs and operation execution within the preempt-off critical
+ * section.
+ * 
+ * The reason why we require all pointer offsets to be calculated by
+ * user-space beforehand is because we need to use get_user_pages_fast()
+ * to first pin all pages touched by each operation. This takes care of
+ * faulting-in the pages. Then, preemption is disabled, and the
+ * operations are performed atomically with respect to other thread
+ * execution on that CPU, without generating any page fault.
+ * 
+ * A maximum limit of 16 operations per cpu_opv syscall invocation is
+ * enforced, so user-space cannot generate a too long preempt-off
+ * critical section. Each operation is also limited a length of
+ * PAGE_SIZE bytes, meaning that an operation can touch a maximum of 4
+ * pages (memcpy: 2 pages for source, 2 pages for destination if
+ * addresses are not aligned on page boundaries).
+ * 
+ * If the thread is not running on the requested CPU, a new
+ * push_task_to_cpu() is invoked to migrate the task to the requested
+ * CPU.  If the requested CPU is not part of the cpus allowed mask of
+ * the thread, the system call fails with EINVAL. After the migration
+ * has been performed, preemption is disabled, and the current CPU
+ * number is checked again and compared to the requested CPU number. If
+ * it still differs, it means the scheduler migrated us away from that
+ * CPU. Return EAGAIN to user-space in that case, and let user-space
+ * retry (either requesting the same CPU number, or a different one,
+ * depending on the user-space algorithm constraints).
+ */
+
+/*
+ * Check operation types and length parameters.
+ */
+static int cpu_opv_check(struct cpu_op *cpuop, int cpuopcnt)
+{
+	int i;
+
+	for (i = 0; i < cpuopcnt; i++) {
+		struct cpu_op *op = &cpuop[i];
+
+		switch (op->op) {
+		case CPU_COMPARE_EQ_OP:
+		case CPU_COMPARE_NE_OP:
+		case CPU_MEMCPY_OP:
+			if (op->len > CPU_OP_DATA_LEN_MAX)
+				return -EINVAL;
+			break;
+		case CPU_ADD_OP:
+		case CPU_OR_OP:
+		case CPU_AND_OP:
+		case CPU_XOR_OP:
+			switch (op->len) {
+			case 1:
+			case 2:
+			case 4:
+			case 8:
+				break;
+			default:
+				return -EINVAL;
+			}
+			break;
+		case CPU_LSHIFT_OP:
+		case CPU_RSHIFT_OP:
+			switch (op->len) {
+			case 1:
+				if (op->u.shift_op.bits > 7)
+					return -EINVAL;
+				break;
+			case 2:
+				if (op->u.shift_op.bits > 15)
+					return -EINVAL;
+				break;
+			case 4:
+				if (op->u.shift_op.bits > 31)
+					return -EINVAL;
+				break;
+			case 8:
+				if (op->u.shift_op.bits > 63)
+					return -EINVAL;
+				break;
+			default:
+				return -EINVAL;
+			}
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static unsigned long cpu_op_range_nr_pages(unsigned long addr,
+		unsigned long len)
+{
+	return ((addr + len - 1) >> PAGE_SHIFT) - (addr >> PAGE_SHIFT) + 1;
+}
+
+static int cpu_op_pin_pages(unsigned long addr, unsigned long len,
+		struct page ***pinned_pages_ptr, size_t *nr_pinned)
+{
+	unsigned long nr_pages;
+	struct page *pages[2];
+	int ret;
+
+	if (!len)
+		return 0;
+	nr_pages = cpu_op_range_nr_pages(addr, len);
+	BUG_ON(nr_pages > 2);
+	if (*nr_pinned + nr_pages > NR_PINNED_PAGES_ON_STACK) {
+		struct page **pinned_pages =
+			kzalloc(CPU_OP_VEC_LEN_MAX * CPU_OP_MAX_PAGES
+				* sizeof(struct page *), GFP_KERNEL);
+		if (!pinned_pages)
+			return -ENOMEM;
+		memcpy(pinned_pages, *pinned_pages_ptr,
+			*nr_pinned * sizeof(struct page *));
+		*pinned_pages_ptr = pinned_pages;
+	}
+	ret = get_user_pages_fast(addr, nr_pages, 0, pages);
+	if (ret < nr_pages) {
+		if (ret > 0)
+			put_page(pages[0]);
+		return -EFAULT;
+	}
+	(*pinned_pages_ptr)[(*nr_pinned)++] = pages[0];
+	if (nr_pages > 1)
+		(*pinned_pages_ptr)[(*nr_pinned)++] = pages[1];
+	return 0;
+}
+
+static int cpu_opv_pin_pages(struct cpu_op *cpuop, int cpuopcnt,
+		struct page ***pinned_pages_ptr, size_t *nr_pinned)
+{
+	int ret, i;
+
+	/* Check access, pin pages. */
+	for (i = 0; i < cpuopcnt; i++) {
+		struct cpu_op *op = &cpuop[i];
+
+		switch (op->op) {
+		case CPU_COMPARE_EQ_OP:
+		case CPU_COMPARE_NE_OP:
+			if (!access_ok(VERIFY_READ, op->u.compare_op.a,
+					op->len))
+				goto error;
+			ret = cpu_op_pin_pages(
+					(unsigned long)op->u.compare_op.a,
+					op->len, pinned_pages_ptr, nr_pinned);
+			if (ret)
+				goto error;
+			if (!access_ok(VERIFY_READ, op->u.compare_op.b,
+					op->len))
+				goto error;
+			ret = cpu_op_pin_pages(
+					(unsigned long)op->u.compare_op.b,
+					op->len, pinned_pages_ptr, nr_pinned);
+			if (ret)
+				goto error;
+			break;
+		case CPU_MEMCPY_OP:
+			if (!access_ok(VERIFY_WRITE, op->u.memcpy_op.dst,
+					op->len))
+				goto error;
+			ret = cpu_op_pin_pages(
+					(unsigned long)op->u.memcpy_op.dst,
+					op->len, pinned_pages_ptr, nr_pinned);
+			if (ret)
+				goto error;
+			if (!access_ok(VERIFY_READ, op->u.memcpy_op.src,
+					op->len))
+				goto error;
+			ret = cpu_op_pin_pages(
+					(unsigned long)op->u.memcpy_op.src,
+					op->len, pinned_pages_ptr, nr_pinned);
+			if (ret)
+				goto error;
+			break;
+		case CPU_ADD_OP:
+			if (!access_ok(VERIFY_WRITE, op->u.arithmetic_op.p,
+					op->len))
+				goto error;
+			ret = cpu_op_pin_pages(
+					(unsigned long)op->u.arithmetic_op.p,
+					op->len, pinned_pages_ptr, nr_pinned);
+			if (ret)
+				goto error;
+			break;
+		case CPU_OR_OP:
+		case CPU_AND_OP:
+		case CPU_XOR_OP:
+			if (!access_ok(VERIFY_WRITE, op->u.bitwise_op.p,
+					op->len))
+				goto error;
+			ret = cpu_op_pin_pages(
+					(unsigned long)op->u.bitwise_op.p,
+					op->len, pinned_pages_ptr, nr_pinned);
+			if (ret)
+				goto error;
+			break;
+		case CPU_LSHIFT_OP:
+		case CPU_RSHIFT_OP:
+			if (!access_ok(VERIFY_WRITE, op->u.shift_op.p,
+					op->len))
+				goto error;
+			ret = cpu_op_pin_pages(
+					(unsigned long)op->u.shift_op.p,
+					op->len, pinned_pages_ptr, nr_pinned);
+			if (ret)
+				goto error;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+	return 0;
+
+error:
+	for (i = 0; i < *nr_pinned; i++)
+		put_page((*pinned_pages_ptr)[i]);
+	*nr_pinned = 0;
+	return ret;
+}
+
+/* Return 0 if same, > 0 if different, < 0 on error. */
+static int do_cpu_op_compare_iter(void __user *a, void __user *b, uint32_t len)
+{
+	char bufa[TMP_BUFLEN], bufb[TMP_BUFLEN];
+	uint32_t compared = 0;
+
+	while (compared != len) {
+		unsigned long to_compare;
+
+		to_compare = min_t(uint32_t, TMP_BUFLEN, len - compared);
+		if (__copy_from_user_inatomic(bufa, a + compared, to_compare))
+			return -EFAULT;
+		if (__copy_from_user_inatomic(bufb, b + compared, to_compare))
+			return -EFAULT;
+		if (memcmp(bufa, bufb, to_compare))
+			return 1;	/* different */
+		compared += to_compare;
+	}
+	return 0;	/* same */
+}
+
+/* Return 0 if same, > 0 if different, < 0 on error. */
+static int do_cpu_op_compare(void __user *a, void __user *b, uint32_t len)
+{
+	int ret = -EFAULT;
+	union {
+		uint8_t _u8;
+		uint16_t _u16;
+		uint32_t _u32;
+		uint64_t _u64;
+#if (BITS_PER_LONG < 64)
+		uint32_t _u64_split[2];
+#endif
+	} tmp[2];
+
+	pagefault_disable();
+	switch (len) {
+	case 1:
+		if (__get_user(tmp[0]._u8, (uint8_t __user *)a))
+			goto end;
+		if (__get_user(tmp[1]._u8, (uint8_t __user *)b))
+			goto end;
+		ret = !!(tmp[0]._u8 != tmp[1]._u8);
+		break;
+	case 2:
+		if (__get_user(tmp[0]._u16, (uint16_t __user *)a))
+			goto end;
+		if (__get_user(tmp[1]._u16, (uint16_t __user *)b))
+			goto end;
+		ret = !!(tmp[0]._u16 != tmp[1]._u16);
+		break;
+	case 4:
+		if (__get_user(tmp[0]._u32, (uint32_t __user *)a))
+			goto end;
+		if (__get_user(tmp[1]._u32, (uint32_t __user *)b))
+			goto end;
+		ret = !!(tmp[0]._u32 != tmp[1]._u32);
+		break;
+	case 8:
+#if (BITS_PER_LONG >= 64)
+		if (__get_user(tmp[0]._u64, (uint64_t __user *)a))
+			goto end;
+		if (__get_user(tmp[1]._u64, (uint64_t __user *)b))
+			goto end;
+#else
+		if (__get_user(tmp[0]._u64_split[0], (uint32_t __user *)a))
+			goto end;
+		if (__get_user(tmp[0]._u64_split[1], (uint32_t __user *)a + 1))
+			goto end;
+		if (__get_user(tmp[1]._u64_split[0], (uint32_t __user *)b))
+			goto end;
+		if (__get_user(tmp[1]._u64_split[1], (uint32_t __user *)b + 1))
+			goto end;
+#endif
+		ret = !!(tmp[0]._u64 != tmp[1]._u64);
+		break;
+	default:
+		pagefault_enable();
+		return do_cpu_op_compare_iter(a, b, len);
+	}
+end:
+	pagefault_enable();
+	return ret;
+}
+
+/* Return 0 on success, < 0 on error. */
+static int do_cpu_op_memcpy_iter(void __user *dst, void __user *src,
+		uint32_t len)
+{
+	char buf[TMP_BUFLEN];
+	uint32_t copied = 0;
+
+	while (copied != len) {
+		unsigned long to_copy;
+
+		to_copy = min_t(uint32_t, TMP_BUFLEN, len - copied);
+		if (__copy_from_user_inatomic(buf, src + copied, to_copy))
+			return -EFAULT;
+		if (__copy_to_user_inatomic(dst + copied, buf, to_copy))
+			return -EFAULT;
+		copied += to_copy;
+	}
+	return 0;
+}
+
+/* Return 0 on success, < 0 on error. */
+static int do_cpu_op_memcpy(void __user *dst, void __user *src, uint32_t len)
+{
+	int ret = -EFAULT;
+	union {
+		uint8_t _u8;
+		uint16_t _u16;
+		uint32_t _u32;
+		uint64_t _u64;
+#if (BITS_PER_LONG < 64)
+		uint32_t _u64_split[2];
+#endif
+	} tmp;
+
+	pagefault_disable();
+	switch (len) {
+	case 1:
+		if (__get_user(tmp._u8, (uint8_t __user *)src))
+			goto end;
+		if (__put_user(tmp._u8, (uint8_t __user *)dst))
+			goto end;
+		break;
+	case 2:
+		if (__get_user(tmp._u16, (uint16_t __user *)src))
+			goto end;
+		if (__put_user(tmp._u16, (uint16_t __user *)dst))
+			goto end;
+		break;
+	case 4:
+		if (__get_user(tmp._u32, (uint32_t __user *)src))
+			goto end;
+		if (__put_user(tmp._u32, (uint32_t __user *)dst))
+			goto end;
+		break;
+	case 8:
+#if (BITS_PER_LONG >= 64)
+		if (__get_user(tmp._u64, (uint64_t __user *)src))
+			goto end;
+		if (__put_user(tmp._u64, (uint64_t __user *)dst))
+			goto end;
+#else
+		if (__get_user(tmp._u64_split[0], (uint32_t __user *)src))
+			goto end;
+		if (__get_user(tmp._u64_split[1], (uint32_t __user *)src + 1))
+			goto end;
+		if (__put_user(tmp._u64_split[0], (uint32_t __user *)dst))
+			goto end;
+		if (__put_user(tmp._u64_split[1], (uint32_t __user *)dst + 1))
+			goto end;
+#endif
+		break;
+	default:
+		pagefault_enable();
+		return do_cpu_op_memcpy_iter(dst, src, len);
+	}
+	ret = 0;
+end:
+	pagefault_enable();
+	return ret;
+}
+
+/* Return 0 on success, < 0 on error. */
+static int do_cpu_op_add(void __user *p, int64_t count, uint32_t len)
+{
+	int ret = -EFAULT;
+	union {
+		uint8_t _u8;
+		uint16_t _u16;
+		uint32_t _u32;
+		uint64_t _u64;
+#if (BITS_PER_LONG < 64)
+		uint32_t _u64_split[2];
+#endif
+	} tmp;
+
+	pagefault_disable();
+	switch (len) {
+	case 1:
+		if (__get_user(tmp._u8, (uint8_t __user *)p))
+			goto end;
+		tmp._u8 += (uint8_t)count;
+		if (__put_user(tmp._u8, (uint8_t __user *)p))
+			goto end;
+		break;
+	case 2:
+		if (__get_user(tmp._u16, (uint16_t __user *)p))
+			goto end;
+		tmp._u16 += (uint16_t)count;
+		if (__put_user(tmp._u16, (uint16_t __user *)p))
+			goto end;
+		break;
+	case 4:
+		if (__get_user(tmp._u32, (uint32_t __user *)p))
+			goto end;
+		tmp._u32 += (uint32_t)count;
+		if (__put_user(tmp._u32, (uint32_t __user *)p))
+			goto end;
+		break;
+	case 8:
+#if (BITS_PER_LONG >= 64)
+		if (__get_user(tmp._u64, (uint64_t __user *)p))
+			goto end;
+#else
+		if (__get_user(tmp._u64_split[0], (uint32_t __user *)p))
+			goto end;
+		if (__get_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
+			goto end;
+#endif
+		tmp._u64 += (uint64_t)count;
+#if (BITS_PER_LONG >= 64)
+		if (__put_user(tmp._u64, (uint64_t __user *)p))
+			goto end;
+#else
+		if (__put_user(tmp._u64_split[0], (uint32_t __user *)p))
+			goto end;
+		if (__put_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
+			goto end;
+#endif
+		break;
+	default:
+		ret = -EINVAL;
+		goto end;
+	}
+	ret = 0;
+end:
+	pagefault_enable();
+	return ret;
+}
+
+/* Return 0 on success, < 0 on error. */
+static int do_cpu_op_or(void __user *p, uint64_t mask, uint32_t len)
+{
+	int ret = -EFAULT;
+	union {
+		uint8_t _u8;
+		uint16_t _u16;
+		uint32_t _u32;
+		uint64_t _u64;
+#if (BITS_PER_LONG < 64)
+		uint32_t _u64_split[2];
+#endif
+	} tmp;
+
+	pagefault_disable();
+	switch (len) {
+	case 1:
+		if (__get_user(tmp._u8, (uint8_t __user *)p))
+			goto end;
+		tmp._u8 |= (uint8_t)mask;
+		if (__put_user(tmp._u8, (uint8_t __user *)p))
+			goto end;
+		break;
+	case 2:
+		if (__get_user(tmp._u16, (uint16_t __user *)p))
+			goto end;
+		tmp._u16 |= (uint16_t)mask;
+		if (__put_user(tmp._u16, (uint16_t __user *)p))
+			goto end;
+		break;
+	case 4:
+		if (__get_user(tmp._u32, (uint32_t __user *)p))
+			goto end;
+		tmp._u32 |= (uint32_t)mask;
+		if (__put_user(tmp._u32, (uint32_t __user *)p))
+			goto end;
+		break;
+	case 8:
+#if (BITS_PER_LONG >= 64)
+		if (__get_user(tmp._u64, (uint64_t __user *)p))
+			goto end;
+#else
+		if (__get_user(tmp._u64_split[0], (uint32_t __user *)p))
+			goto end;
+		if (__get_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
+			goto end;
+#endif
+		tmp._u64 |= (uint64_t)mask;
+#if (BITS_PER_LONG >= 64)
+		if (__put_user(tmp._u64, (uint64_t __user *)p))
+			goto end;
+#else
+		if (__put_user(tmp._u64_split[0], (uint32_t __user *)p))
+			goto end;
+		if (__put_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
+			goto end;
+#endif
+		break;
+	default:
+		ret = -EINVAL;
+		goto end;
+	}
+	ret = 0;
+end:
+	pagefault_enable();
+	return ret;
+}
+
+/* Return 0 on success, < 0 on error. */
+static int do_cpu_op_and(void __user *p, uint64_t mask, uint32_t len)
+{
+	int ret = -EFAULT;
+	union {
+		uint8_t _u8;
+		uint16_t _u16;
+		uint32_t _u32;
+		uint64_t _u64;
+#if (BITS_PER_LONG < 64)
+		uint32_t _u64_split[2];
+#endif
+	} tmp;
+
+	pagefault_disable();
+	switch (len) {
+	case 1:
+		if (__get_user(tmp._u8, (uint8_t __user *)p))
+			goto end;
+		tmp._u8 &= (uint8_t)mask;
+		if (__put_user(tmp._u8, (uint8_t __user *)p))
+			goto end;
+		break;
+	case 2:
+		if (__get_user(tmp._u16, (uint16_t __user *)p))
+			goto end;
+		tmp._u16 &= (uint16_t)mask;
+		if (__put_user(tmp._u16, (uint16_t __user *)p))
+			goto end;
+		break;
+	case 4:
+		if (__get_user(tmp._u32, (uint32_t __user *)p))
+			goto end;
+		tmp._u32 &= (uint32_t)mask;
+		if (__put_user(tmp._u32, (uint32_t __user *)p))
+			goto end;
+		break;
+	case 8:
+#if (BITS_PER_LONG >= 64)
+		if (__get_user(tmp._u64, (uint64_t __user *)p))
+			goto end;
+#else
+		if (__get_user(tmp._u64_split[0], (uint32_t __user *)p))
+			goto end;
+		if (__get_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
+			goto end;
+#endif
+		tmp._u64 &= (uint64_t)mask;
+#if (BITS_PER_LONG >= 64)
+		if (__put_user(tmp._u64, (uint64_t __user *)p))
+			goto end;
+#else
+		if (__put_user(tmp._u64_split[0], (uint32_t __user *)p))
+			goto end;
+		if (__put_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
+			goto end;
+#endif
+		break;
+	default:
+		ret = -EINVAL;
+		goto end;
+	}
+	ret = 0;
+end:
+	pagefault_enable();
+	return ret;
+}
+
+/* Return 0 on success, < 0 on error. */
+static int do_cpu_op_xor(void __user *p, uint64_t mask, uint32_t len)
+{
+	int ret = -EFAULT;
+	union {
+		uint8_t _u8;
+		uint16_t _u16;
+		uint32_t _u32;
+		uint64_t _u64;
+#if (BITS_PER_LONG < 64)
+		uint32_t _u64_split[2];
+#endif
+	} tmp;
+
+	pagefault_disable();
+	switch (len) {
+	case 1:
+		if (__get_user(tmp._u8, (uint8_t __user *)p))
+			goto end;
+		tmp._u8 ^= (uint8_t)mask;
+		if (__put_user(tmp._u8, (uint8_t __user *)p))
+			goto end;
+		break;
+	case 2:
+		if (__get_user(tmp._u16, (uint16_t __user *)p))
+			goto end;
+		tmp._u16 ^= (uint16_t)mask;
+		if (__put_user(tmp._u16, (uint16_t __user *)p))
+			goto end;
+		break;
+	case 4:
+		if (__get_user(tmp._u32, (uint32_t __user *)p))
+			goto end;
+		tmp._u32 ^= (uint32_t)mask;
+		if (__put_user(tmp._u32, (uint32_t __user *)p))
+			goto end;
+		break;
+	case 8:
+#if (BITS_PER_LONG >= 64)
+		if (__get_user(tmp._u64, (uint64_t __user *)p))
+			goto end;
+#else
+		if (__get_user(tmp._u64_split[0], (uint32_t __user *)p))
+			goto end;
+		if (__get_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
+			goto end;
+#endif
+		tmp._u64 ^= (uint64_t)mask;
+#if (BITS_PER_LONG >= 64)
+		if (__put_user(tmp._u64, (uint64_t __user *)p))
+			goto end;
+#else
+		if (__put_user(tmp._u64_split[0], (uint32_t __user *)p))
+			goto end;
+		if (__put_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
+			goto end;
+#endif
+		break;
+	default:
+		ret = -EINVAL;
+		goto end;
+	}
+	ret = 0;
+end:
+	pagefault_enable();
+	return ret;
+}
+
+/* Return 0 on success, < 0 on error. */
+static int do_cpu_op_lshift(void __user *p, uint32_t bits, uint32_t len)
+{
+	int ret = -EFAULT;
+	union {
+		uint8_t _u8;
+		uint16_t _u16;
+		uint32_t _u32;
+		uint64_t _u64;
+#if (BITS_PER_LONG < 64)
+		uint32_t _u64_split[2];
+#endif
+	} tmp;
+
+	pagefault_disable();
+	switch (len) {
+	case 1:
+		if (__get_user(tmp._u8, (uint8_t __user *)p))
+			goto end;
+		tmp._u8 <<= bits;
+		if (__put_user(tmp._u8, (uint8_t __user *)p))
+			goto end;
+		break;
+	case 2:
+		if (__get_user(tmp._u16, (uint16_t __user *)p))
+			goto end;
+		tmp._u16 <<= bits;
+		if (__put_user(tmp._u16, (uint16_t __user *)p))
+			goto end;
+		break;
+	case 4:
+		if (__get_user(tmp._u32, (uint32_t __user *)p))
+			goto end;
+		tmp._u32 <<= bits;
+		if (__put_user(tmp._u32, (uint32_t __user *)p))
+			goto end;
+		break;
+	case 8:
+#if (BITS_PER_LONG >= 64)
+		if (__get_user(tmp._u64, (uint64_t __user *)p))
+			goto end;
+#else
+		if (__get_user(tmp._u64_split[0], (uint32_t __user *)p))
+			goto end;
+		if (__get_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
+			goto end;
+#endif
+		tmp._u64 <<= bits;
+#if (BITS_PER_LONG >= 64)
+		if (__put_user(tmp._u64, (uint64_t __user *)p))
+			goto end;
+#else
+		if (__put_user(tmp._u64_split[0], (uint32_t __user *)p))
+			goto end;
+		if (__put_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
+			goto end;
+#endif
+		break;
+	default:
+		ret = -EINVAL;
+		goto end;
+	}
+	ret = 0;
+end:
+	pagefault_enable();
+	return ret;
+}
+
+/* Return 0 on success, < 0 on error. */
+static int do_cpu_op_rshift(void __user *p, uint32_t bits, uint32_t len)
+{
+	int ret = -EFAULT;
+	union {
+		uint8_t _u8;
+		uint16_t _u16;
+		uint32_t _u32;
+		uint64_t _u64;
+#if (BITS_PER_LONG < 64)
+		uint32_t _u64_split[2];
+#endif
+	} tmp;
+
+	pagefault_disable();
+	switch (len) {
+	case 1:
+		if (__get_user(tmp._u8, (uint8_t __user *)p))
+			goto end;
+		tmp._u8 >>= bits;
+		if (__put_user(tmp._u8, (uint8_t __user *)p))
+			goto end;
+		break;
+	case 2:
+		if (__get_user(tmp._u16, (uint16_t __user *)p))
+			goto end;
+		tmp._u16 >>= bits;
+		if (__put_user(tmp._u16, (uint16_t __user *)p))
+			goto end;
+		break;
+	case 4:
+		if (__get_user(tmp._u32, (uint32_t __user *)p))
+			goto end;
+		tmp._u32 >>= bits;
+		if (__put_user(tmp._u32, (uint32_t __user *)p))
+			goto end;
+		break;
+	case 8:
+#if (BITS_PER_LONG >= 64)
+		if (__get_user(tmp._u64, (uint64_t __user *)p))
+			goto end;
+#else
+		if (__get_user(tmp._u64_split[0], (uint32_t __user *)p))
+			goto end;
+		if (__get_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
+			goto end;
+#endif
+		tmp._u64 >>= bits;
+#if (BITS_PER_LONG >= 64)
+		if (__put_user(tmp._u64, (uint64_t __user *)p))
+			goto end;
+#else
+		if (__put_user(tmp._u64_split[0], (uint32_t __user *)p))
+			goto end;
+		if (__put_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
+			goto end;
+#endif
+		break;
+	default:
+		ret = -EINVAL;
+		goto end;
+	}
+	ret = 0;
+end:
+	pagefault_enable();
+	return ret;
+}
+
+static int __do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt)
+{
+	int i, ret;
+
+	for (i = 0; i < cpuopcnt; i++) {
+		struct cpu_op *op = &cpuop[i];
+
+		switch (op->op) {
+		case CPU_COMPARE_EQ_OP:
+			ret = do_cpu_op_compare(
+					(void __user *)op->u.compare_op.a,
+					(void __user *)op->u.compare_op.b,
+					op->len);
+			/* Stop execution on error. */
+			if (ret < 0)
+				return ret;
+			/*
+			 * Stop execution, return op index + 1 if comparison
+			 * differs.
+			 */
+			if (ret > 0)
+				return i + 1;
+			break;
+		case CPU_COMPARE_NE_OP:
+			ret = do_cpu_op_compare(
+					(void __user *)op->u.compare_op.a,
+					(void __user *)op->u.compare_op.b,
+					op->len);
+			/* Stop execution on error. */
+			if (ret < 0)
+				return ret;
+			/*
+			 * Stop execution, return op index + 1 if comparison
+			 * is identical.
+			 */
+			if (ret == 0)
+				return i + 1;
+			break;
+		case CPU_MEMCPY_OP:
+			ret = do_cpu_op_memcpy(
+					(void __user *)op->u.memcpy_op.dst,
+					(void __user *)op->u.memcpy_op.src,
+					op->len);
+			/* Stop execution on error. */
+			if (ret)
+				return ret;
+			break;
+		case CPU_ADD_OP:
+			ret = do_cpu_op_add((void __user *)op->u.arithmetic_op.p,
+					op->u.arithmetic_op.count, op->len);
+			/* Stop execution on error. */
+			if (ret)
+				return ret;
+			break;
+		case CPU_OR_OP:
+			ret = do_cpu_op_or((void __user *)op->u.bitwise_op.p,
+					op->u.bitwise_op.mask, op->len);
+			/* Stop execution on error. */
+			if (ret)
+				return ret;
+			break;
+		case CPU_AND_OP:
+			ret = do_cpu_op_and((void __user *)op->u.bitwise_op.p,
+					op->u.bitwise_op.mask, op->len);
+			/* Stop execution on error. */
+			if (ret)
+				return ret;
+			break;
+		case CPU_XOR_OP:
+			ret = do_cpu_op_xor((void __user *)op->u.bitwise_op.p,
+					op->u.bitwise_op.mask, op->len);
+			/* Stop execution on error. */
+			if (ret)
+				return ret;
+			break;
+		case CPU_LSHIFT_OP:
+			ret = do_cpu_op_lshift((void __user *)op->u.shift_op.p,
+					op->u.shift_op.bits, op->len);
+			/* Stop execution on error. */
+			if (ret)
+				return ret;
+			break;
+		case CPU_RSHIFT_OP:
+			ret = do_cpu_op_rshift((void __user *)op->u.shift_op.p,
+					op->u.shift_op.bits, op->len);
+			/* Stop execution on error. */
+			if (ret)
+				return ret;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt, int cpu)
+{
+	int ret;
+
+	if (cpu != raw_smp_processor_id()) {
+		ret = push_task_to_cpu(current, cpu);
+		if (ret)
+			return ret;
+	}
+	preempt_disable();
+	if (cpu != smp_processor_id()) {
+		ret = -EAGAIN;
+		goto end;
+	}
+	ret = __do_cpu_opv(cpuop, cpuopcnt);
+end:
+	preempt_enable();
+	return ret;
+}
+
+/*
+ * cpu_opv - execute operation vector on a given CPU with preempt off.
+ *
+ * Userspace should pass current CPU number as parameter. May fail with
+ * -EAGAIN if currently executing on the wrong CPU.
+ */
+SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt,
+		int, cpu, int, flags)
+{
+	struct cpu_op cpuopv[CPU_OP_VEC_LEN_MAX];
+	struct page *pinned_pages_on_stack[NR_PINNED_PAGES_ON_STACK];
+	struct page **pinned_pages = pinned_pages_on_stack;
+	int ret, i;
+	size_t nr_pinned = 0;
+
+	if (unlikely(flags))
+		return -EINVAL;
+	if (unlikely(cpu < 0))
+		return -EINVAL;
+	if (cpuopcnt < 0 || cpuopcnt > CPU_OP_VEC_LEN_MAX)
+		return -EINVAL;
+	if (copy_from_user(cpuopv, ucpuopv, cpuopcnt * sizeof(struct cpu_op)))
+		return -EFAULT;
+	ret = cpu_opv_check(cpuopv, cpuopcnt);
+	if (ret)
+		return ret;
+	ret = cpu_opv_pin_pages(cpuopv, cpuopcnt,
+				&pinned_pages, &nr_pinned);
+	if (ret)
+		goto end;
+	ret = do_cpu_opv(cpuopv, cpuopcnt, cpu);
+	for (i = 0; i < nr_pinned; i++)
+		put_page(pinned_pages[i]);
+end:
+	if (pinned_pages != pinned_pages_on_stack)
+		kfree(pinned_pages);
+	return ret;
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 12da0f771d73..db50984f7535 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1047,6 +1047,43 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 		set_curr_task(rq, p);
 }
 
+int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+	int ret = 0;
+
+	rq = task_rq_lock(p, &rf);
+	update_rq_clock(rq);
+
+	if (!cpumask_test_cpu(dest_cpu, &p->cpus_allowed)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (task_cpu(p) == dest_cpu)
+		goto out;
+
+	if (task_running(rq, p) || p->state == TASK_WAKING) {
+		struct migration_arg arg = { p, dest_cpu };
+		/* Need help from migration thread: drop lock and wait. */
+		task_rq_unlock(rq, p, &rf);
+		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
+		tlb_migrate_finish(p->mm);
+		return 0;
+	} else if (task_on_rq_queued(p)) {
+		/*
+		 * OK, since we're going to drop the lock immediately
+		 * afterwards anyway.
+		 */
+		rq = move_queued_task(rq, &rf, p, dest_cpu);
+	}
+out:
+	task_rq_unlock(rq, p, &rf);
+
+	return ret;
+}
+
 /*
  * Change a given task's CPU affinity. Migrate the thread to a
  * proper CPU and schedule it away if the CPU it's executing on
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eeef1a3086d1..a1c0e60006f8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1207,6 +1207,8 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 #endif
 }
 
+int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu);
+
 /*
  * Tunables that become constants when CONFIG_SCHED_DEBUG is off:
  */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index c7b366ccf39c..044808ac8197 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -261,3 +261,4 @@ cond_syscall(sys_pkey_free);
 
 /* restartable sequence */
 cond_syscall(sys_rseq);
+cond_syscall(sys_cpu_opv);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 61+ messages in thread

[parent not found: <20171012230326.19984-10-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>]

* Re: [RFC PATCH for 4.15 09/14] Provide cpu_opv system call
       [not found]   ` <20171012230326.19984-10-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-10-13 13:57     ` Alan Cox
  2017-10-13 14:50       ` Mathieu Desnoyers
  2017-10-13 17:20     ` Andy Lutomirski
  2017-10-14  2:50     ` Andi Kleen
  2 siblings, 1 reply; 61+ messages in thread
From: Alan Cox @ 2017-10-13 13:57 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Andi Kleen, Chris Lameter, Ingo Molnar, H. Peter Anvin,
	Ben Maurer, Steven Rostedt, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas

> A maximum limit of 16 operations per cpu_opv syscall invocation is
> enforced, so user-space cannot generate a too long preempt-off critical
> section. 

Except that all the operations could be going to mmapped I/O space and if
I pick the right targets could take quite a long time to complete. It's
still only 16 operations - But 160ms is a lot worse than 10ms. In fact
with compare_iter I could make it much much worse still as I get 2 x
TMP_BUFLEN x 16 x worst case latency in my attack. That's enough to screw
up plenty of things.

So it seems to me at minimum it needs to be restricted to genuine RAM user
pages, and in fact would be far far simpler code as well if it were
limited to a single page for a given invocation or if like futexes you
had to specifically create a per_cpu_opv mapping.

Alan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH for 4.15 09/14] Provide cpu_opv system call
  2017-10-13 13:57     ` Alan Cox
@ 2017-10-13 14:50       ` Mathieu Desnoyers
       [not found]         ` <854849583.40647.1507906233368.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-13 14:50 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Ben Maurer, rostedt,
	Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas <catalin.

----- On Oct 13, 2017, at 9:57 AM, One Thousand Gnomes gnomes@lxorguk.ukuu.org.uk wrote:

>> A maximum limit of 16 operations per cpu_opv syscall invocation is
>> enforced, so user-space cannot generate a too long preempt-off critical
>> section.
> 
> Except that all the operations could be going to mmapped I/O space and if
> I pick the right targets could take quite a long time to complete.

We could check whether a struct page belongs to mmapped I/O space, and return
EINVAL in that case.

> It's
> still only 16 operations - But 160ms is a lot worse than 10ms. In fact
> with compare_iter I could make it much much worse still as I get 2 x
> TMP_BUFLEN x 16 x worst case latency in my attack. That's enough to screw
> up plenty of things.

Would a check that ensures the page is not mmapped I/O space be sufficient
to take care of this ? If happen to know which API I need to look for, it
would be welcome.

> 
> So it seems to me at minimum it needs to be restricted to genuine RAM user
> pages, and in fact would be far far simpler code as well if it were
> limited to a single page for a given invocation or if like futexes you
> had to specifically create a per_cpu_opv mapping.

I've had requests to implement per-cpu ring buffers with memcpy + offset
pointer update restartable sequences. Having a memcpy operation which does not
require page-alignment allows cpu_opv() to be used as a single-stepping
fallback for those use-cases.

I'm open to consider simplifying the other operations such as compare, add,
bitwise ops, and shift ops by requiring that they target aligned content,
which would therefore fit within a single page. However, given that we already
want to support the unaligned memcpy operation, it does not add much extra
complexity to support unaligned accesses for the other cases. We could also
limit the "compare" operation to 1, 2, 4, 8 aligned bytes rather than being an
up-to-PAGE_SIZE compare, but it would limit its usefulness in case of structure
content comparison.

Thanks,

Mathieu

> 
> Alan

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <854849583.40647.1507906233368.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>]

* Re: [RFC PATCH for 4.15 09/14] Provide cpu_opv system call
       [not found]         ` <854849583.40647.1507906233368.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-10-14 14:22           ` Mathieu Desnoyers
  0 siblings, 0 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-14 14:22 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Ben Maurer, rostedt,
	Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas <catalin.

----- On Oct 13, 2017, at 10:50 AM, Mathieu Desnoyers mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org wrote:

> ----- On Oct 13, 2017, at 9:57 AM, One Thousand Gnomes
> gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org wrote:
> 
>>> A maximum limit of 16 operations per cpu_opv syscall invocation is
>>> enforced, so user-space cannot generate a too long preempt-off critical
>>> section.
>> 
>> Except that all the operations could be going to mmapped I/O space and if
>> I pick the right targets could take quite a long time to complete.
> 
> We could check whether a struct page belongs to mmapped I/O space, and return
> EINVAL in that case.
> 
>> It's
>> still only 16 operations - But 160ms is a lot worse than 10ms. In fact
>> with compare_iter I could make it much much worse still as I get 2 x
>> TMP_BUFLEN x 16 x worst case latency in my attack. That's enough to screw
>> up plenty of things.
> 
> Would a check that ensures the page is not mmapped I/O space be sufficient
> to take care of this ? If happen to know which API I need to look for, it
> would be welcome.

I think is_zone_device_page() is what I was looking for.

Let me know if I missed something,

Thanks,

Mathieu

> Thanks,
> 
> Mathieu
> 
> 
>> 
>> Alan
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH for 4.15 09/14] Provide cpu_opv system call
       [not found]   ` <20171012230326.19984-10-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  2017-10-13 13:57     ` Alan Cox
@ 2017-10-13 17:20     ` Andy Lutomirski
  2017-10-14  2:50     ` Andi Kleen
  2 siblings, 0 replies; 61+ messages in thread
From: Andy Lutomirski @ 2017-10-13 17:20 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Dave Watson, Josh Triplett, Will Deacon,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	H. Peter Anvin, Ben Maurer, Steven Rostedt, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas

On Thu, Oct 12, 2017 at 4:03 PM, Mathieu Desnoyers
<mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> This new cpu_opv system call executes a vector of operations on behalf
> of user-space on a specific CPU with preemption disabled. It is inspired
> from readv() and writev() system calls which take a "struct iovec" array
> as argument.
>
> The operations available are: comparison, memcpy, add, or, and, xor,
> left shift, and right shift. The system call receives a CPU number from
> user-space as argument, which is the CPU on which those operations need
> to be performed. All preparation steps such as loading pointers, and
> applying offsets to arrays, need to be performed by user-space before
> invoking the system call. The "comparison" operation can be used to
> check that the data used in the preparation step did not change between
> preparation of system call inputs and operation execution within the
> preempt-off critical section.
>
> The reason why we require all pointer offsets to be calculated by
> user-space beforehand is because we need to use get_user_pages_fast() to
> first pin all pages touched by each operation. This takes care of
> faulting-in the pages. Then, preemption is disabled, and the operations
> are performed atomically with respect to other thread execution on that
> CPU, without generating any page fault.
>
> A maximum limit of 16 operations per cpu_opv syscall invocation is
> enforced, so user-space cannot generate a too long preempt-off critical
> section. Each operation is also limited a length of PAGE_SIZE bytes,
> meaning that an operation can touch a maximum of 4 pages (memcpy: 2
> pages for source, 2 pages for destination if addresses are not aligned
> on page boundaries).
>
> If the thread is not running on the requested CPU, a new
> push_task_to_cpu() is invoked to migrate the task to the requested CPU.
> If the requested CPU is not part of the cpus allowed mask of the thread,
> the system call fails with EINVAL. After the migration has been
> performed, preemption is disabled, and the current CPU number is checked
> again and compared to the requested CPU number. If it still differs, it
> means the scheduler migrated us away from that CPU. Return EAGAIN to
> user-space in that case, and let user-space retry (either requesting the
> same CPU number, or a different one, depending on the user-space
> algorithm constraints).

This series seems to get more complicated every time, and it's been so
long that I've mostly forgetten all the details.  I would have sworn
we had a solution that got single-stepping right without any
complicated work like this in the kernel and had at most a minor
performance hit relative to the absolutely fastest solution.  I'll try
to dig it up.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH for 4.15 09/14] Provide cpu_opv system call
       [not found]   ` <20171012230326.19984-10-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  2017-10-13 13:57     ` Alan Cox
  2017-10-13 17:20     ` Andy Lutomirski
@ 2017-10-14  2:50     ` Andi Kleen
       [not found]       ` <20171014025029.GL2482-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
  2 siblings, 1 reply; 61+ messages in thread
From: Andi Kleen @ 2017-10-14  2:50 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Andi Kleen, Chris Lameter, Ingo Molnar, H. Peter Anvin,
	Ben Maurer, Steven Rostedt, Linus Torvalds, Andrew Morton,
	Russell King, Catalin Marinas

> +	pagefault_disable();
> +	switch (len) {
> +	case 1:
> +		if (__get_user(tmp._u8, (uint8_t __user *)p))
> +			goto end;
> +		tmp._u8 += (uint8_t)count;
> +		if (__put_user(tmp._u8, (uint8_t __user *)p))
> +			goto end;
> +		break;

It seems the code size could be dramatically shrunk by using a function
pointer callback for the actual operation, and then avoiding all the
copies. Since this is only a fallback this shouldn't be a problem
(and modern branch predictors are fairly good in such situations anyways)

If you really want to keep the code it would be better if you used
some macros -- i bet there's a typo in here somewhere in this
forest.

-Andi

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <20171014025029.GL2482-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>]

* Re: [RFC PATCH for 4.15 09/14] Provide cpu_opv system call
       [not found]       ` <20171014025029.GL2482-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
@ 2017-10-14 13:35         ` Mathieu Desnoyers
  0 siblings, 0 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-14 13:35 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Thomas Gleixner, Chris Lameter,
	Ingo Molnar, H. Peter Anvin, Ben Maurer, rostedt, Linus Torvalds,
	Andrew Morton, Russell King, Catalin Marinas, Michael

----- On Oct 13, 2017, at 10:50 PM, Andi Kleen andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org wrote:

>> +	pagefault_disable();
>> +	switch (len) {
>> +	case 1:
>> +		if (__get_user(tmp._u8, (uint8_t __user *)p))
>> +			goto end;
>> +		tmp._u8 += (uint8_t)count;
>> +		if (__put_user(tmp._u8, (uint8_t __user *)p))
>> +			goto end;
>> +		break;
> 
> It seems the code size could be dramatically shrunk by using a function
> pointer callback for the actual operation, and then avoiding all the
> copies. Since this is only a fallback this shouldn't be a problem
> (and modern branch predictors are fairly good in such situations anyways)
> 
> If you really want to keep the code it would be better if you used
> some macros -- i bet there's a typo in here somewhere in this
> forest.

Good point!

I'll add this in the next round. Meanwhile, here is the resulting
commit: https://github.com/compudj/linux-percpu-dev/commit/864b444f64f4c227ddc587d12631ff0d3440796c

Thanks,

Mathieu


> 
> -Andi

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH for 4.15 10/14] cpu_opv: Wire up x86 32/64 system call
       [not found] <20171012230326.19984-1-mathieu.desnoyers@efficios.com>
                   ` (6 preceding siblings ...)
  2017-10-12 23:03 ` [RFC PATCH for 4.15 09/14] Provide cpu_opv " Mathieu Desnoyers
@ 2017-10-12 23:03 ` Mathieu Desnoyers
  2017-10-12 23:03 ` [RFC PATCH for 4.15 12/14] cpu_opv: Wire up ARM32 " Mathieu Desnoyers
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-12 23:03 UTC (permalink / raw)
  To: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon
  Cc: linux-kernel, Mathieu Desnoyers, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Ben Maurer,
	Steven Rostedt, Linus Torvalds, Andrew Morton, Russell King,
	Catalin Marinas, Michael Kerrisk, linux-api

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul Turner <pjt@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Andrew Hunter <ahh@google.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ba43ee75e425..afc6988fb2c8 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -392,3 +392,4 @@
 383	i386	statx			sys_statx
 384	i386	arch_prctl		sys_arch_prctl			compat_sys_arch_prctl
 385	i386	rseq			sys_rseq
+386	i386	cpu_opv			sys_cpu_opv
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 3ad03495bbb9..ab5d1f9f9396 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -340,6 +340,7 @@
 331	common	pkey_free		sys_pkey_free
 332	common	statx			sys_statx
 333	common	rseq			sys_rseq
+334	common	cpu_opv			sys_cpu_opv
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH for 4.15 12/14] cpu_opv: Wire up ARM32 system call
       [not found] <20171012230326.19984-1-mathieu.desnoyers@efficios.com>
                   ` (7 preceding siblings ...)
  2017-10-12 23:03 ` [RFC PATCH for 4.15 10/14] cpu_opv: Wire up x86 32/64 " Mathieu Desnoyers
@ 2017-10-12 23:03 ` Mathieu Desnoyers
  2017-10-12 23:03 ` [RFC PATCH for 4.15 13/14] cpu_opv: Implement selftests Mathieu Desnoyers
  2017-10-12 23:03 ` [RFC PATCH for 4.15 14/14] Restartable sequences: Provide self-tests Mathieu Desnoyers
  10 siblings, 0 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-12 23:03 UTC (permalink / raw)
  To: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon
  Cc: linux-kernel, Mathieu Desnoyers, Russell King, Catalin Marinas,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	Ben Maurer, Steven Rostedt, Linus Torvalds, Andrew Morton,
	linux-api

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 arch/arm/tools/syscall.tbl | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index fbc74b5fa3ed..213ccfc2c437 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -413,3 +413,4 @@
 396	common	pkey_free		sys_pkey_free
 397	common	statx			sys_statx
 398	common	rseq			sys_rseq
+399	common	cpu_opv			sys_cpu_opv
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH for 4.15 13/14] cpu_opv: Implement selftests
       [not found] <20171012230326.19984-1-mathieu.desnoyers@efficios.com>
                   ` (8 preceding siblings ...)
  2017-10-12 23:03 ` [RFC PATCH for 4.15 12/14] cpu_opv: Wire up ARM32 " Mathieu Desnoyers
@ 2017-10-12 23:03 ` Mathieu Desnoyers
  2017-10-12 23:03 ` [RFC PATCH for 4.15 14/14] Restartable sequences: Provide self-tests Mathieu Desnoyers
  10 siblings, 0 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-12 23:03 UTC (permalink / raw)
  To: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon
  Cc: linux-kernel, Mathieu Desnoyers, Russell King, Catalin Marinas,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	H. Peter Anvin, Ben Maurer, Steven Rostedt, Linus Torvalds,
	Andrew Morton, Shuah Khan, linux-kselftest, linux-api

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Shuah Khan <shuah@kernel.org>
CC: linux-kselftest@vger.kernel.org
CC: linux-api@vger.kernel.org
---
 MAINTAINERS                                        |   1 +
 tools/testing/selftests/cpu-opv/.gitignore         |   1 +
 tools/testing/selftests/cpu-opv/Makefile           |  13 +
 .../testing/selftests/cpu-opv/basic_cpu_opv_test.c | 828 +++++++++++++++++++++
 tools/testing/selftests/cpu-opv/cpu-op.c           | 189 +++++
 tools/testing/selftests/cpu-opv/cpu-op.h           |  53 ++
 6 files changed, 1085 insertions(+)
 create mode 100644 tools/testing/selftests/cpu-opv/.gitignore
 create mode 100644 tools/testing/selftests/cpu-opv/Makefile
 create mode 100644 tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c
 create mode 100644 tools/testing/selftests/cpu-opv/cpu-op.c
 create mode 100644 tools/testing/selftests/cpu-opv/cpu-op.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 6a5f3afb2ea4..9134a3234737 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3617,6 +3617,7 @@ L:	linux-kernel@vger.kernel.org
 S:	Supported
 F:	kernel/cpu_opv.c
 F:	include/uapi/linux/cpu_opv.h
+F:	tools/testing/selftests/cpu-opv/
 
 CRAMFS FILESYSTEM
 W:	http://sourceforge.net/projects/cramfs/
diff --git a/tools/testing/selftests/cpu-opv/.gitignore b/tools/testing/selftests/cpu-opv/.gitignore
new file mode 100644
index 000000000000..c7186eb95cf5
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/.gitignore
@@ -0,0 +1 @@
+basic_cpu_opv_test
diff --git a/tools/testing/selftests/cpu-opv/Makefile b/tools/testing/selftests/cpu-opv/Makefile
new file mode 100644
index 000000000000..81d0596824ee
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/Makefile
@@ -0,0 +1,13 @@
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/
+LDFLAGS += -lpthread
+
+TESTS = basic_cpu_opv_test
+
+all: $(TESTS)
+%: %.c cpu-op.c cpu-op.h
+	$(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
+
+include ../lib.mk
+
+clean:
+	$(RM) $(TESTS)
diff --git a/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c b/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c
new file mode 100644
index 000000000000..e2ad818cca1c
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c
@@ -0,0 +1,828 @@
+/*
+ * Basic test coverage for cpu_opv system call.
+ */
+
+#define _GNU_SOURCE
+#include <assert.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/time.h>
+#include <errno.h>
+#include <stdlib.h>
+
+#include "cpu-op.h"
+
+#define ARRAY_SIZE(arr)	(sizeof(arr) / sizeof((arr)[0]))
+
+#define TESTBUFLEN	4096
+
+static int test_compare_eq_op(char *a, char *b, size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)a,
+			.u.compare_op.b = (unsigned long)b,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_compare_eq_same(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	const char *test_name = "test_compare_eq same";
+
+	printf("Testing %s\n", test_name);
+
+	/* Test compare_eq */
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf2[i] = (char)i;
+	ret = test_compare_eq_op(buf2, buf1, TESTBUFLEN);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret > 0) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 0);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_compare_eq_diff(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	const char *test_name = "test_compare_eq different";
+
+	printf("Testing %s\n", test_name);
+
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	memset(buf2, 0, TESTBUFLEN);
+	ret = test_compare_eq_op(buf2, buf1, TESTBUFLEN);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret == 0) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 1);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_compare_ne_op(char *a, char *b, size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_NE_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)a,
+			.u.compare_op.b = (unsigned long)b,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_compare_ne_same(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	const char *test_name = "test_compare_ne same";
+
+	printf("Testing %s\n", test_name);
+
+	/* Test compare_ne */
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf2[i] = (char)i;
+	ret = test_compare_ne_op(buf2, buf1, TESTBUFLEN);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret == 0) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 1);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_compare_ne_diff(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	const char *test_name = "test_compare_ne different";
+
+	printf("Testing %s\n", test_name);
+
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	memset(buf2, 0, TESTBUFLEN);
+	ret = test_compare_ne_op(buf2, buf1, TESTBUFLEN);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret != 0) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 0);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_2compare_eq_op(char *a, char *b, char *c, char *d,
+		size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)a,
+			.u.compare_op.b = (unsigned long)b,
+		},
+		[1] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)c,
+			.u.compare_op.b = (unsigned long)d,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_2compare_eq_index(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	char buf3[TESTBUFLEN];
+	char buf4[TESTBUFLEN];
+	const char *test_name = "test_2compare_eq index";
+
+	printf("Testing %s\n", test_name);
+
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	memset(buf2, 0, TESTBUFLEN);
+	memset(buf3, 0, TESTBUFLEN);
+	memset(buf4, 0, TESTBUFLEN);
+
+	/* First compare failure is op[0], expect 1. */
+	ret = test_2compare_eq_op(buf2, buf1, buf4, buf3, TESTBUFLEN);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret != 1) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 1);
+		return -1;
+	}
+
+	/* All compares succeed. */
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf2[i] = (char)i;
+	ret = test_2compare_eq_op(buf2, buf1, buf4, buf3, TESTBUFLEN);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret != 0) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 0);
+		return -1;
+	}
+
+	/* First compare failure is op[1], expect 2. */
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf3[i] = (char)i;
+	ret = test_2compare_eq_op(buf2, buf1, buf4, buf3, TESTBUFLEN);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret != 2) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 2);
+		return -1;
+	}
+
+	return 0;
+}
+
+static int test_2compare_ne_op(char *a, char *b, char *c, char *d,
+		size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_NE_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)a,
+			.u.compare_op.b = (unsigned long)b,
+		},
+		[1] = {
+			.op = CPU_COMPARE_NE_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)c,
+			.u.compare_op.b = (unsigned long)d,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_2compare_ne_index(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	char buf3[TESTBUFLEN];
+	char buf4[TESTBUFLEN];
+	const char *test_name = "test_2compare_ne index";
+
+	printf("Testing %s\n", test_name);
+
+	memset(buf1, 0, TESTBUFLEN);
+	memset(buf2, 0, TESTBUFLEN);
+	memset(buf3, 0, TESTBUFLEN);
+	memset(buf4, 0, TESTBUFLEN);
+
+	/* First compare ne failure is op[0], expect 1. */
+	ret = test_2compare_ne_op(buf2, buf1, buf4, buf3, TESTBUFLEN);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret != 1) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 1);
+		return -1;
+	}
+
+	/* All compare ne succeed. */
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf3[i] = (char)i;
+	ret = test_2compare_ne_op(buf2, buf1, buf4, buf3, TESTBUFLEN);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret != 0) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 0);
+		return -1;
+	}
+
+	/* First compare failure is op[1], expect 2. */
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf4[i] = (char)i;
+	ret = test_2compare_ne_op(buf2, buf1, buf4, buf3, TESTBUFLEN);
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret != 2) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 2);
+		return -1;
+	}
+
+	return 0;
+}
+
+
+static int test_memcpy_op(void *dst, void *src, size_t len)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)dst,
+			.u.memcpy_op.src = (unsigned long)src,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_memcpy(void)
+{
+	int i, ret;
+	char buf1[TESTBUFLEN];
+	char buf2[TESTBUFLEN];
+	const char *test_name = "test_memcpy";
+
+	printf("Testing %s\n", test_name);
+
+	/* Test memcpy */
+	for (i = 0; i < TESTBUFLEN; i++)
+		buf1[i] = (char)i;
+	memset(buf2, 0, TESTBUFLEN);
+	ret = test_memcpy_op(buf2, buf1, TESTBUFLEN);
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	for (i = 0; i < TESTBUFLEN; i++) {
+		if (buf2[i] != (char)i) {
+			printf("%s failed. Expecting '%d', found '%d' at offset %d\n",
+				test_name, (char)i, buf2[i], i);
+			return -1;
+		}
+	}
+	return 0;
+}
+
+static int test_memcpy_u32(void)
+{
+	int ret;
+	uint32_t v1, v2;
+	const char *test_name = "test_memcpy_u32";
+
+	printf("Testing %s\n", test_name);
+
+	/* Test memcpy_u32 */
+	v1 = 42;
+	v2 = 0;
+	ret = test_memcpy_op(&v2, &v1, sizeof(v1));
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (v1 != v2) {
+		printf("%s failed. Expecting '%d', found '%d'\n",
+			test_name, v1, v2);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_add_op(int *v, int64_t increment)
+{
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_op_add(v, increment, sizeof(*v), cpu);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_add(void)
+{
+	int orig_v = 42, v, ret;
+	int increment = 1;
+	const char *test_name = "test_add";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_add_op(&v, increment);
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != orig_v + increment) {
+		printf("%s unexpected value: %d. Should be %d.\n",
+			test_name, v, orig_v);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_two_add_op(int *v, int64_t *increments)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_ADD_OP,
+			.len = sizeof(*v),
+			.u.arithmetic_op.p = (unsigned long)v,
+			.u.arithmetic_op.count = increments[0],
+		},
+		[1] = {
+			.op = CPU_ADD_OP,
+			.len = sizeof(*v),
+			.u.arithmetic_op.p = (unsigned long)v,
+			.u.arithmetic_op.count = increments[1],
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_two_add(void)
+{
+	int orig_v = 42, v, ret;
+	int64_t increments[2] = { 99, 123 };
+	const char *test_name = "test_two_add";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_two_add_op(&v, increments);
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != orig_v + increments[0] + increments[1]) {
+		printf("%s unexpected value: %d. Should be %d.\n",
+			test_name, v, orig_v);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_or_op(int *v, uint64_t mask)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_OR_OP,
+			.len = sizeof(*v),
+			.u.bitwise_op.p = (unsigned long)v,
+			.u.bitwise_op.mask = mask,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_or(void)
+{
+	int orig_v = 0xFF00000, v, ret;
+	uint32_t mask = 0xFFF;
+	const char *test_name = "test_or";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_or_op(&v, mask);
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != (orig_v | mask)) {
+		printf("%s unexpected value: %d. Should be %d.\n",
+			test_name, v, orig_v | mask);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_and_op(int *v, uint64_t mask)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_AND_OP,
+			.len = sizeof(*v),
+			.u.bitwise_op.p = (unsigned long)v,
+			.u.bitwise_op.mask = mask,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_and(void)
+{
+	int orig_v = 0xF00, v, ret;
+	uint32_t mask = 0xFFF;
+	const char *test_name = "test_and";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_and_op(&v, mask);
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != (orig_v & mask)) {
+		printf("%s unexpected value: %d. Should be %d.\n",
+			test_name, v, orig_v & mask);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_xor_op(int *v, uint64_t mask)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_XOR_OP,
+			.len = sizeof(*v),
+			.u.bitwise_op.p = (unsigned long)v,
+			.u.bitwise_op.mask = mask,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_xor(void)
+{
+	int orig_v = 0xF00, v, ret;
+	uint32_t mask = 0xFFF;
+	const char *test_name = "test_xor";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_xor_op(&v, mask);
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != (orig_v ^ mask)) {
+		printf("%s unexpected value: %d. Should be %d.\n",
+			test_name, v, orig_v ^ mask);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_lshift_op(int *v, uint32_t bits)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_LSHIFT_OP,
+			.len = sizeof(*v),
+			.u.shift_op.p = (unsigned long)v,
+			.u.shift_op.bits = bits,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_lshift(void)
+{
+	int orig_v = 0xF00, v, ret;
+	uint32_t bits = 5;
+	const char *test_name = "test_lshift";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_lshift_op(&v, bits);
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != (orig_v << bits)) {
+		printf("%s unexpected value: %d. Should be %d.\n",
+			test_name, v, orig_v << bits);
+		return -1;
+	}
+	return 0;
+}
+
+
+static int test_rshift_op(int *v, uint32_t bits)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_RSHIFT_OP,
+			.len = sizeof(*v),
+			.u.shift_op.p = (unsigned long)v,
+			.u.shift_op.bits = bits,
+		},
+	};
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+static int test_rshift(void)
+{
+	int orig_v = 0xF00, v, ret;
+	uint32_t bits = 5;
+	const char *test_name = "test_rshift";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_rshift_op(&v, bits);
+	if (ret) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		return -1;
+	}
+	if (v != (orig_v >> bits)) {
+		printf("%s unexpected value: %d. Should be %d.\n",
+			test_name, v, orig_v >> bits);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_cmpxchg_op(void *v, void *expect, void *old, void *n,
+		size_t len)
+{
+	int ret, cpu;
+
+	do {
+		cpu = cpu_op_get_current_cpu();
+		ret = cpu_op_cmpxchg(v, expect, old, n, len, cpu);
+	} while (ret == -1 && errno == EAGAIN);
+
+	return ret;
+}
+
+
+static int test_cmpxchg_success(void)
+{
+	int ret;
+	uint64_t orig_v = 1, v, expect = 1, old = 0, n = 3;
+	const char *test_name = "test_cmpxchg success";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_cmpxchg_op(&v, &expect, &old, &n, sizeof(uint64_t));
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 0);
+		return -1;
+	}
+	if (v != n) {
+		printf("%s v is %lld, expecting %lld\n",
+			test_name, (long long)v, (long long)n);
+		return -1;
+	}
+	if (old != orig_v) {
+		printf("%s old is %lld, expecting %lld\n",
+			test_name, (long long)old, (long long)orig_v);
+		return -1;
+	}
+	return 0;
+}
+
+static int test_cmpxchg_fail(void)
+{
+	int ret;
+	uint64_t orig_v = 1, v, expect = 123, old = 0, n = 3;
+	const char *test_name = "test_cmpxchg fail";
+
+	printf("Testing %s\n", test_name);
+
+	v = orig_v;
+	ret = test_cmpxchg_op(&v, &expect, &old, &n, sizeof(uint64_t));
+	if (ret < 0) {
+		printf("%s returned with %d, errno: %s\n",
+			test_name, ret, strerror(errno));
+		exit(-1);
+	}
+	if (ret == 0) {
+		printf("%s returned %d, expecting %d\n",
+			test_name, ret, 1);
+		return -1;
+	}
+	if (v == n) {
+		printf("%s v is %lld, expecting %lld\n",
+			test_name, (long long)v, (long long)orig_v);
+		return -1;
+	}
+	if (old != orig_v) {
+		printf("%s old is %lld, expecting %lld\n",
+			test_name, (long long)old, (long long)orig_v);
+		return -1;
+	}
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	int ret = 0;
+
+	ret |= test_compare_eq_same();
+	ret |= test_compare_eq_diff();
+	ret |= test_compare_ne_same();
+	ret |= test_compare_ne_diff();
+	ret |= test_2compare_eq_index();
+	ret |= test_2compare_ne_index();
+	ret |= test_memcpy();
+	ret |= test_memcpy_u32();
+	ret |= test_add();
+	ret |= test_two_add();
+	ret |= test_or();
+	ret |= test_and();
+	ret |= test_xor();
+	ret |= test_lshift();
+	ret |= test_rshift();
+	ret |= test_cmpxchg_success();
+	ret |= test_cmpxchg_fail();
+
+	return ret;
+}
diff --git a/tools/testing/selftests/cpu-opv/cpu-op.c b/tools/testing/selftests/cpu-opv/cpu-op.c
new file mode 100644
index 000000000000..d25420c74a71
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/cpu-op.c
@@ -0,0 +1,189 @@
+/*
+ * cpu-op.c
+ *
+ * Copyright (C) 2017 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; only
+ * version 2.1 of the License.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <syscall.h>
+#include <assert.h>
+#include <signal.h>
+
+#include "cpu-op.h"
+
+#define ARRAY_SIZE(arr)	(sizeof(arr) / sizeof((arr)[0]))
+
+int cpu_opv(struct cpu_op *cpu_opv, int cpuopcnt, int cpu, int flags)
+{
+	return syscall(__NR_cpu_opv, cpu_opv, cpuopcnt, cpu, flags);
+}
+
+int cpu_op_get_current_cpu(void)
+{
+	int cpu;
+
+	cpu = sched_getcpu();
+	if (cpu < 0) {
+		perror("sched_getcpu()");
+		abort();
+	}
+	return cpu;
+}
+
+int cpu_op_cmpstore(void *v, void *expect, void *n, size_t len, int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)expect,
+		},
+		[1] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)n,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_2cmp1store(void *v, void *expect, void *n, void *check2,
+		void *expect2, size_t len, int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)expect,
+		},
+		[1] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)check2,
+			.u.compare_op.b = (unsigned long)expect2,
+		},
+		[2] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)n,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_1cmp2store(void *v, void *expect, void *_new,
+		void *v2, void *_new2, size_t len, int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)expect,
+		},
+		[1] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)_new,
+		},
+		[2] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)v2,
+			.u.memcpy_op.src = (unsigned long)_new2,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpxchg(void *v, void *expect, void *old, void *n,
+		size_t len, int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)old,
+			.u.memcpy_op.src = (unsigned long)v,
+		},
+		[1] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)expect,
+		},
+		[2] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)n,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_add(void *v, int64_t count, size_t len, int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_ADD_OP,
+			.len = len,
+			.u.arithmetic_op.p = (unsigned long)v,
+			.u.arithmetic_op.count = count,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpstorememcpy(void *v, void *expect, void *_new, size_t len,
+		void *dst, void *src, size_t copylen, int cpu)
+{
+	struct cpu_op opvec[] = {
+		[0] = {
+			.op = CPU_COMPARE_EQ_OP,
+			.len = len,
+			.u.compare_op.a = (unsigned long)v,
+			.u.compare_op.b = (unsigned long)expect,
+		},
+		[1] = {
+			.op = CPU_MEMCPY_OP,
+			.len = len,
+			.u.memcpy_op.dst = (unsigned long)v,
+			.u.memcpy_op.src = (unsigned long)_new,
+		},
+		[2] = {
+			.op = CPU_MEMCPY_OP,
+			.len = copylen,
+			.u.memcpy_op.dst = (unsigned long)dst,
+			.u.memcpy_op.src = (unsigned long)src,
+		},
+	};
+
+	return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
diff --git a/tools/testing/selftests/cpu-opv/cpu-op.h b/tools/testing/selftests/cpu-opv/cpu-op.h
new file mode 100644
index 000000000000..3f5679e643bb
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/cpu-op.h
@@ -0,0 +1,53 @@
+/*
+ * cpu-op.h
+ *
+ * (C) Copyright 2017 - Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef CPU_OPV_H
+#define CPU_OPV_H
+
+#include <stdlib.h>
+#include <linux/cpu_opv.h>
+
+#define likely(x)		__builtin_expect(!!(x), 1)
+#define unlikely(x)		__builtin_expect(!!(x), 0)
+#define barrier()		__asm__ __volatile__("" : : : "memory")
+
+#define ACCESS_ONCE(x)		(*(__volatile__  __typeof__(x) *)&(x))
+#define WRITE_ONCE(x, v)	__extension__ ({ ACCESS_ONCE(x) = (v); })
+#define READ_ONCE(x)		ACCESS_ONCE(x)
+
+int cpu_opv(struct cpu_op *cpuopv, int cpuopcnt, int cpu, int flags);
+int cpu_op_get_current_cpu(void);
+
+int cpu_op_cmpstore(void *v, void *expect, void *_new, size_t len, int cpu);
+int cpu_op_2cmp1store(void *v, void *expect, void *_new, void *check2,
+		void *expect2, size_t len, int cpu);
+int cpu_op_1cmp2store(void *v, void *expect, void *_new,
+		void *v2, void *_new2, size_t len, int cpu);
+int cpu_op_cmpxchg(void *v, void *expect, void *old, void *_new,
+		size_t len, int cpu);
+int cpu_op_add(void *v, int64_t count, size_t len, int cpu);
+int cpu_op_cmpstorememcpy(void *v, void *expect, void *_new, size_t len,
+		void *dst, void *src, size_t copylen, int cpu);
+
+#endif  /* CPU_OPV_H_ */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC PATCH for 4.15 14/14] Restartable sequences: Provide self-tests
       [not found] <20171012230326.19984-1-mathieu.desnoyers@efficios.com>
                   ` (9 preceding siblings ...)
  2017-10-12 23:03 ` [RFC PATCH for 4.15 13/14] cpu_opv: Implement selftests Mathieu Desnoyers
@ 2017-10-12 23:03 ` Mathieu Desnoyers
  2017-10-16  2:51   ` Michael Ellerman
       [not found]   ` <20171012230326.19984-15-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  10 siblings, 2 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-12 23:03 UTC (permalink / raw)
  To: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon
  Cc: linux-kernel, Mathieu Desnoyers, Russell King, Catalin Marinas,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	H. Peter Anvin, Ben Maurer, Steven Rostedt, Linus Torvalds,
	Andrew Morton, Shuah Khan, linux-kselftest, linux-api

Implements two basic tests of RSEQ functionality, and one more
exhaustive parameterizable test.

The first, "basic_test" only asserts that RSEQ works moderately
correctly.
E.g. that:
- The CPUID pointer works
- Code infinitely looping within a critical section will eventually be
  interrupted.
- Critical sections are interrupted by signals.

"basic_percpu_ops_test" is a slightly more "realistic" variant,
implementing a few simple per-cpu operations and testing their
correctness.

"param_test" is a parametrizable restartable sequences test. See
the "--help" output for usage.

As part of those tests, a helper library "rseq" implements a user-space
API around restartable sequences. It uses the cpu_opv system call as
fallback when single-stepped by a debugger. It exposes the instruction
pointer addresses where the rseq assembly blocks begin and end, as well
as the associated abort instruction pointer, in the __rseq_table
section. This section allows debuggers may know where to place
breakpoints when single-stepping through assembly blocks which may be
aborted at any point by the kernel.

The following rseq APIs are implemented in this helper library:
- rseq_register_current_thread()/rseq_unregister_current_thread():
    register/unregister current thread's use of rseq,
- rseq_current_cpu_raw():
    current CPU number,
- rseq_start():
    beginning of a restartable sequence,
- rseq_cpu_at_start():
    CPU number at start of restartable sequence,
- rseq_finish():
    End of restartable sequence made of zero or more loads, completed by
    a word-sized store,
- rseq_finish2():
    End of restartable sequence made of zero or more loads, one
    speculative word-sized store, completed by a word-sized store,
- rseq_finish2_release():
    End of restartable sequence made of zero or more loads, one
    speculative word-sized store, completed by a word-sized store with
    release semantic,
- rseq_finish_memcpy():
    End of restartable sequence made of zero or more loads, a
    speculative copy of a variable length memory region, completed by a
    word-sized store.
- rseq_finish_memcpy_release():
    End of restartable sequence made of zero or more loads, a
    speculative copy of a variable length memory region, completed by a
    word-sized store with release semantic.

PowerPC tests have been implemented by Boqun Feng.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul Turner <pjt@google.com>
CC: Andrew Hunter <ahh@google.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Shuah Khan <shuah@kernel.org>
CC: linux-kselftest@vger.kernel.org
CC: linux-api@vger.kernel.org
---
 MAINTAINERS                                        |    1 +
 tools/testing/selftests/rseq/.gitignore            |    4 +
 tools/testing/selftests/rseq/Makefile              |   13 +
 .../testing/selftests/rseq/basic_percpu_ops_test.c |  319 +++++
 tools/testing/selftests/rseq/basic_test.c          |   97 ++
 tools/testing/selftests/rseq/param_test.c          | 1246 ++++++++++++++++++++
 tools/testing/selftests/rseq/rseq-arm.h            |  159 +++
 tools/testing/selftests/rseq/rseq-ppc.h            |  266 +++++
 tools/testing/selftests/rseq/rseq-x86.h            |  304 +++++
 tools/testing/selftests/rseq/rseq.c                |   78 ++
 tools/testing/selftests/rseq/rseq.h                |  298 +++++
 11 files changed, 2785 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/.gitignore
 create mode 100644 tools/testing/selftests/rseq/Makefile
 create mode 100644 tools/testing/selftests/rseq/basic_percpu_ops_test.c
 create mode 100644 tools/testing/selftests/rseq/basic_test.c
 create mode 100644 tools/testing/selftests/rseq/param_test.c
 create mode 100644 tools/testing/selftests/rseq/rseq-arm.h
 create mode 100644 tools/testing/selftests/rseq/rseq-ppc.h
 create mode 100644 tools/testing/selftests/rseq/rseq-x86.h
 create mode 100644 tools/testing/selftests/rseq/rseq.c
 create mode 100644 tools/testing/selftests/rseq/rseq.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 9134a3234737..a79b0b473e7f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11242,6 +11242,7 @@ S:	Supported
 F:	kernel/rseq.c
 F:	include/uapi/linux/rseq.h
 F:	include/trace/events/rseq.h
+F:	tools/testing/selftests/rseq/
 
 RFKILL
 M:	Johannes Berg <johannes@sipsolutions.net>
diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore
new file mode 100644
index 000000000000..9409c3db99b2
--- /dev/null
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -0,0 +1,4 @@
+basic_percpu_ops_test
+basic_test
+basic_rseq_op_test
+param_test
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
new file mode 100644
index 000000000000..7f0153556b80
--- /dev/null
+++ b/tools/testing/selftests/rseq/Makefile
@@ -0,0 +1,13 @@
+CFLAGS += -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/
+LDFLAGS += -lpthread
+
+TESTS = basic_test basic_percpu_ops_test param_test
+
+all: $(TESTS)
+%: %.c rseq.h rseq-*.h rseq.c ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h
+	$(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
+
+include ../lib.mk
+
+clean:
+	$(RM) $(TESTS)
diff --git a/tools/testing/selftests/rseq/basic_percpu_ops_test.c b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
new file mode 100644
index 000000000000..5771470862bf
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
@@ -0,0 +1,319 @@
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "rseq.h"
+#include "cpu-op.h"
+
+#define ARRAY_SIZE(arr)	(sizeof(arr) / sizeof((arr)[0]))
+
+struct percpu_lock_entry {
+	intptr_t v;
+} __attribute__((aligned(128)));
+
+struct percpu_lock {
+	struct percpu_lock_entry c[CPU_SETSIZE];
+};
+
+struct test_data_entry {
+	intptr_t count;
+} __attribute__((aligned(128)));
+
+struct spinlock_test_data {
+	struct percpu_lock lock;
+	struct test_data_entry c[CPU_SETSIZE];
+	int reps;
+};
+
+struct percpu_list_node {
+	intptr_t data;
+	struct percpu_list_node *next;
+};
+
+struct percpu_list_entry {
+	struct percpu_list_node *head;
+} __attribute__((aligned(128)));
+
+struct percpu_list {
+	struct percpu_list_entry c[CPU_SETSIZE];
+};
+
+/* A simple percpu spinlock.  Returns the cpu lock was acquired on. */
+int rseq_percpu_lock(struct percpu_lock *lock)
+{
+	int cpu;
+
+	for (;;) {
+		struct rseq_state rseq_state;
+		intptr_t expect = 0, n = 1;
+		int ret;
+
+		/* Try fast path. */
+		rseq_state = rseq_start();
+		cpu = rseq_cpu_at_start(rseq_state);
+		if (unlikely(lock->c[cpu].v != 0))
+			continue;	/* Retry.*/
+		if (likely(rseq_finish(&lock->c[cpu].v, 1, rseq_state)))
+			break;
+		/* Fallback on cpu_opv system call. */
+		cpu = rseq_current_cpu_raw();
+		ret = cpu_op_cmpstore(&lock->c[cpu].v, &expect, &n,
+			sizeof(intptr_t), cpu);
+		if (likely(!ret))
+			break;
+		assert(ret >= 0 || errno == EAGAIN);
+	}
+	/*
+	 * Acquire semantic when taking lock after control dependency.
+	 * Matches smp_store_release().
+	 */
+	smp_acquire__after_ctrl_dep();
+	return cpu;
+}
+
+void rseq_percpu_unlock(struct percpu_lock *lock, int cpu)
+{
+	assert(lock->c[cpu].v == 1);
+	/*
+	 * Release lock, with release semantic. Matches
+	 * smp_acquire__after_ctrl_dep().
+	 */
+	smp_store_release(&lock->c[cpu].v, 0);
+}
+
+void *test_percpu_spinlock_thread(void *arg)
+{
+	struct spinlock_test_data *data = arg;
+	int i, cpu;
+
+	if (rseq_register_current_thread())
+		abort();
+	for (i = 0; i < data->reps; i++) {
+		cpu = rseq_percpu_lock(&data->lock);
+		data->c[cpu].count++;
+		rseq_percpu_unlock(&data->lock, cpu);
+	}
+	if (rseq_unregister_current_thread())
+		abort();
+
+	return NULL;
+}
+
+/*
+ * A simple test which implements a sharded counter using a per-cpu
+ * lock.  Obviously real applications might prefer to simply use a
+ * per-cpu increment; however, this is reasonable for a test and the
+ * lock can be extended to synchronize more complicated operations.
+ */
+void test_percpu_spinlock(void)
+{
+	const int num_threads = 200;
+	int i;
+	uint64_t sum;
+	pthread_t test_threads[num_threads];
+	struct spinlock_test_data data;
+
+	memset(&data, 0, sizeof(data));
+	data.reps = 5000;
+
+	for (i = 0; i < num_threads; i++)
+		pthread_create(&test_threads[i], NULL,
+			test_percpu_spinlock_thread, &data);
+
+	for (i = 0; i < num_threads; i++)
+		pthread_join(test_threads[i], NULL);
+
+	sum = 0;
+	for (i = 0; i < CPU_SETSIZE; i++)
+		sum += data.c[i].count;
+
+	assert(sum == (uint64_t)data.reps * num_threads);
+}
+
+int percpu_list_push(struct percpu_list *list, struct percpu_list_node *node)
+{
+	struct rseq_state rseq_state;
+	intptr_t *targetptr, newval, expect;
+	int cpu;
+
+	/* Try fast path. */
+	rseq_state = rseq_start();
+	cpu = rseq_cpu_at_start(rseq_state);
+	newval = (intptr_t)node;
+	targetptr = (intptr_t *)&list->c[cpu].head;
+	node->next = list->c[cpu].head;
+	if (unlikely(!rseq_finish(targetptr, newval, rseq_state))) {
+		/* Fallback on cpu_opv system call. */
+		for (;;) {
+			int ret;
+
+			cpu = rseq_current_cpu_raw();
+			/* Load list->c[cpu].head with single-copy atomicity. */
+			expect = (intptr_t)READ_ONCE(list->c[cpu].head);
+			newval = (intptr_t)node;
+			targetptr = (intptr_t *)&list->c[cpu].head;
+			node->next = (struct percpu_list_node *)expect;
+			ret = cpu_op_cmpstore(targetptr, &expect, &newval,
+				sizeof(intptr_t), cpu);
+			if (likely(!ret))
+				break;
+			assert(ret >= 0 || errno == EAGAIN);
+		}
+	}
+	return cpu;
+}
+
+/*
+ * Unlike a traditional lock-less linked list; the availability of a
+ * rseq primitive allows us to implement pop without concerns over
+ * ABA-type races.
+ */
+struct percpu_list_node *percpu_list_pop(struct percpu_list *list)
+{
+	struct percpu_list_node *head, *next;
+	struct rseq_state rseq_state;
+	intptr_t *targetptr, newval, expect;
+	int cpu;
+
+	/* Try fast path. */
+	rseq_state = rseq_start();
+	cpu = rseq_cpu_at_start(rseq_state);
+	/* Load head with single-copy atomicity. */
+	head = READ_ONCE(list->c[cpu].head);
+	if (!head)
+		return NULL;
+	/* Load head->next with single-copy atomicity. */
+	next = READ_ONCE(head->next);
+	newval = (intptr_t)next;
+	targetptr = (intptr_t *)&list->c[cpu].head;
+	if (unlikely(!rseq_finish(targetptr, newval, rseq_state))) {
+		/* Fallback on cpu_opv system call. */
+		for (;;) {
+			int ret;
+
+			cpu = rseq_current_cpu_raw();
+			/* Load head with single-copy atomicity. */
+			head = READ_ONCE(list->c[cpu].head);
+			if (!head)
+				return NULL;
+			expect = (intptr_t)head;
+			/* Load head->next with single-copy atomicity. */
+			next = READ_ONCE(head->next);
+			newval = (intptr_t)next;
+			targetptr = (intptr_t *)&list->c[cpu].head;
+			ret = cpu_op_2cmp1store(targetptr, &expect, &newval,
+				&head->next, &next,
+				sizeof(intptr_t), cpu);
+			if (likely(!ret))
+				break;
+			assert(ret >= 0 || errno == EAGAIN);
+		}
+	}
+
+	return head;
+}
+
+void *test_percpu_list_thread(void *arg)
+{
+	int i;
+	struct percpu_list *list = (struct percpu_list *)arg;
+
+	if (rseq_register_current_thread())
+		abort();
+
+	for (i = 0; i < 100000; i++) {
+		struct percpu_list_node *node = percpu_list_pop(list);
+
+		sched_yield();  /* encourage shuffling */
+		if (node)
+			percpu_list_push(list, node);
+	}
+
+	if (rseq_unregister_current_thread())
+		abort();
+
+	return NULL;
+}
+
+/* Simultaneous modification to a per-cpu linked list from many threads.  */
+void test_percpu_list(void)
+{
+	int i, j;
+	uint64_t sum = 0, expected_sum = 0;
+	struct percpu_list list;
+	pthread_t test_threads[200];
+	cpu_set_t allowed_cpus;
+
+	memset(&list, 0, sizeof(list));
+
+	/* Generate list entries for every usable cpu. */
+	sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+		for (j = 1; j <= 100; j++) {
+			struct percpu_list_node *node;
+
+			expected_sum += j;
+
+			node = malloc(sizeof(*node));
+			assert(node);
+			node->data = j;
+			node->next = list.c[i].head;
+			list.c[i].head = node;
+		}
+	}
+
+	for (i = 0; i < 200; i++)
+		assert(pthread_create(&test_threads[i], NULL,
+			test_percpu_list_thread, &list) == 0);
+
+	for (i = 0; i < 200; i++)
+		pthread_join(test_threads[i], NULL);
+
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		cpu_set_t pin_mask;
+		struct percpu_list_node *node;
+
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+
+		CPU_ZERO(&pin_mask);
+		CPU_SET(i, &pin_mask);
+		sched_setaffinity(0, sizeof(pin_mask), &pin_mask);
+
+		while ((node = percpu_list_pop(&list))) {
+			sum += node->data;
+			free(node);
+		}
+	}
+
+	/*
+	 * All entries should now be accounted for (unless some external
+	 * actor is interfering with our allowed affinity while this
+	 * test is running).
+	 */
+	assert(sum == expected_sum);
+}
+
+int main(int argc, char **argv)
+{
+	if (rseq_register_current_thread())
+		goto error;
+	printf("spinlock\n");
+	test_percpu_spinlock();
+	printf("percpu_list\n");
+	test_percpu_list();
+	if (rseq_unregister_current_thread())
+		goto error;
+	return 0;
+
+error:
+	return -1;
+}
+
diff --git a/tools/testing/selftests/rseq/basic_test.c b/tools/testing/selftests/rseq/basic_test.c
new file mode 100644
index 000000000000..236bbe2610af
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_test.c
@@ -0,0 +1,97 @@
+/*
+ * Basic test coverage for critical regions and rseq_current_cpu().
+ */
+
+#define _GNU_SOURCE
+#include <assert.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/time.h>
+
+#include "rseq.h"
+
+volatile int signals_delivered;
+volatile __thread struct rseq_state sigtest_start;
+
+void test_cpu_pointer(void)
+{
+	cpu_set_t affinity, test_affinity;
+	int i;
+
+	sched_getaffinity(0, sizeof(affinity), &affinity);
+	CPU_ZERO(&test_affinity);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (CPU_ISSET(i, &affinity)) {
+			CPU_SET(i, &test_affinity);
+			sched_setaffinity(0, sizeof(test_affinity),
+					&test_affinity);
+			assert(rseq_current_cpu() == sched_getcpu());
+			assert(rseq_current_cpu() == i);
+			CPU_CLR(i, &test_affinity);
+		}
+	}
+	sched_setaffinity(0, sizeof(affinity), &affinity);
+}
+
+/*
+ * This depends solely on some environmental event triggering a counter
+ * increase.
+ */
+void test_critical_section(void)
+{
+	struct rseq_state start;
+	uint32_t event_counter;
+
+	start = rseq_start();
+	event_counter = start.event_counter;
+	do {
+		start = rseq_start();
+	} while (start.event_counter == event_counter);
+}
+
+void test_signal_interrupt_handler(int signo)
+{
+	struct rseq_state current;
+
+	current = rseq_start();
+	/*
+	 * The potential critical section bordered by 'start' must be
+	 * invalid.
+	 */
+	assert(current.event_counter != sigtest_start.event_counter);
+	signals_delivered++;
+}
+
+void test_signal_interrupts(void)
+{
+	struct itimerval it = { { 0, 1 }, { 0, 1 } };
+	struct itimerval stop_it = { { 0, 0 }, { 0, 0 } };
+
+	setitimer(ITIMER_PROF, &it, NULL);
+	signal(SIGPROF, test_signal_interrupt_handler);
+
+	do {
+		sigtest_start = rseq_start();
+	} while (signals_delivered < 10);
+	setitimer(ITIMER_PROF, &stop_it, NULL);
+}
+
+int main(int argc, char **argv)
+{
+	if (rseq_register_current_thread())
+		goto init_thread_error;
+	printf("testing current cpu\n");
+	test_cpu_pointer();
+	printf("testing critical section\n");
+	test_critical_section();
+	printf("testing critical section is interrupted by signal\n");
+	test_signal_interrupts();
+	if (rseq_unregister_current_thread())
+		goto init_thread_error;
+	return 0;
+
+init_thread_error:
+	return -1;
+}
diff --git a/tools/testing/selftests/rseq/param_test.c b/tools/testing/selftests/rseq/param_test.c
new file mode 100644
index 000000000000..a68fa0886d50
--- /dev/null
+++ b/tools/testing/selftests/rseq/param_test.c
@@ -0,0 +1,1246 @@
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+#include <poll.h>
+#include <sys/types.h>
+#include <signal.h>
+#include <errno.h>
+
+#include "cpu-op.h"
+
+static inline pid_t gettid(void)
+{
+	return syscall(__NR_gettid);
+}
+
+#define NR_INJECT	9
+static int loop_cnt[NR_INJECT + 1];
+
+static int opt_modulo;
+
+static int opt_yield, opt_signal, opt_sleep,
+		opt_disable_rseq, opt_threads = 200,
+		opt_reps = 5000, opt_disable_mod = 0, opt_test = 's';
+
+static __thread unsigned int signals_delivered;
+
+#ifndef BENCHMARK
+
+static __thread unsigned int yield_mod_cnt, nr_retry;
+
+#define printf_nobench(fmt, ...)	printf(fmt, ## __VA_ARGS__)
+
+#define RSEQ_INJECT_INPUT \
+	, [loop_cnt_1]"m"(loop_cnt[1]) \
+	, [loop_cnt_2]"m"(loop_cnt[2]) \
+	, [loop_cnt_3]"m"(loop_cnt[3]) \
+	, [loop_cnt_4]"m"(loop_cnt[4]) \
+	, [loop_cnt_5]"m"(loop_cnt[5])
+
+#if defined(__x86_64__) || defined(__i386__)
+
+#define INJECT_ASM_REG	"eax"
+
+#define RSEQ_INJECT_CLOBBER \
+	, INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+	"mov %[loop_cnt_" #n "], %%" INJECT_ASM_REG "\n\t" \
+	"test %%" INJECT_ASM_REG ",%%" INJECT_ASM_REG "\n\t" \
+	"jz 333f\n\t" \
+	"222:\n\t" \
+	"dec %%" INJECT_ASM_REG "\n\t" \
+	"jnz 222b\n\t" \
+	"333:\n\t"
+
+#elif defined(__ARMEL__)
+
+#define INJECT_ASM_REG	"r4"
+
+#define RSEQ_INJECT_CLOBBER \
+	, INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+	"ldr " INJECT_ASM_REG ", %[loop_cnt_" #n "]\n\t" \
+	"cmp " INJECT_ASM_REG ", #0\n\t" \
+	"beq 333f\n\t" \
+	"222:\n\t" \
+	"subs " INJECT_ASM_REG ", #1\n\t" \
+	"bne 222b\n\t" \
+	"333:\n\t"
+
+#elif __PPC__
+#define INJECT_ASM_REG	"r18"
+
+#define RSEQ_INJECT_CLOBBER \
+	, INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+	"lwz %%" INJECT_ASM_REG ", %[loop_cnt_" #n "]\n\t" \
+	"cmpwi %%" INJECT_ASM_REG ", 0\n\t" \
+	"beq 333f\n\t" \
+	"222:\n\t" \
+	"subic. %%" INJECT_ASM_REG ", %%" INJECT_ASM_REG ", 1\n\t" \
+	"bne 222b\n\t" \
+	"333:\n\t"
+#else
+#error unsupported target
+#endif
+
+#define RSEQ_INJECT_FAILED \
+	nr_retry++;
+
+#define RSEQ_INJECT_C(n) \
+{ \
+	int loc_i, loc_nr_loops = loop_cnt[n]; \
+	\
+	for (loc_i = 0; loc_i < loc_nr_loops; loc_i++) { \
+		barrier(); \
+	} \
+	if (loc_nr_loops == -1 && opt_modulo) { \
+		if (yield_mod_cnt == opt_modulo - 1) { \
+			if (opt_sleep > 0) \
+				poll(NULL, 0, opt_sleep); \
+			if (opt_yield) \
+				sched_yield(); \
+			if (opt_signal) \
+				raise(SIGUSR1); \
+			yield_mod_cnt = 0; \
+		} else { \
+			yield_mod_cnt++; \
+		} \
+	} \
+}
+
+#else
+
+#define printf_nobench(fmt, ...)
+
+#endif /* BENCHMARK */
+
+#include "rseq.h"
+
+struct percpu_lock_entry {
+	intptr_t v;
+} __attribute__((aligned(128)));
+
+struct percpu_lock {
+	struct percpu_lock_entry c[CPU_SETSIZE];
+};
+
+struct test_data_entry {
+	intptr_t count;
+} __attribute__((aligned(128)));
+
+struct spinlock_test_data {
+	struct percpu_lock lock;
+	struct test_data_entry c[CPU_SETSIZE];
+};
+
+struct spinlock_thread_test_data {
+	struct spinlock_test_data *data;
+	int reps;
+	int reg;
+};
+
+struct inc_test_data {
+	struct test_data_entry c[CPU_SETSIZE];
+};
+
+struct inc_thread_test_data {
+	struct inc_test_data *data;
+	int reps;
+	int reg;
+};
+
+struct percpu_list_node {
+	intptr_t data;
+	struct percpu_list_node *next;
+};
+
+struct percpu_list_entry {
+	struct percpu_list_node *head;
+} __attribute__((aligned(128)));
+
+struct percpu_list {
+	struct percpu_list_entry c[CPU_SETSIZE];
+};
+
+#define BUFFER_ITEM_PER_CPU	100
+
+struct percpu_buffer_node {
+	intptr_t data;
+};
+
+struct percpu_buffer_entry {
+	intptr_t offset;
+	intptr_t buflen;
+	struct percpu_buffer_node **array;
+} __attribute__((aligned(128)));
+
+struct percpu_buffer {
+	struct percpu_buffer_entry c[CPU_SETSIZE];
+};
+
+#define MEMCPY_BUFFER_ITEM_PER_CPU	100
+
+struct percpu_memcpy_buffer_node {
+	intptr_t data1;
+	uint64_t data2;
+};
+
+struct percpu_memcpy_buffer_entry {
+	intptr_t offset;
+	intptr_t buflen;
+	struct percpu_memcpy_buffer_node *array;
+} __attribute__((aligned(128)));
+
+struct percpu_memcpy_buffer {
+	struct percpu_memcpy_buffer_entry c[CPU_SETSIZE];
+};
+
+/* A simple percpu spinlock.  Returns the cpu lock was acquired on. */
+static int rseq_percpu_lock(struct percpu_lock *lock)
+{
+	int cpu;
+
+	for (;;) {
+#ifndef SKIP_FASTPATH
+		struct rseq_state rseq_state;
+
+		/* Try fast path. */
+		rseq_state = rseq_start();
+		cpu = rseq_cpu_at_start(rseq_state);
+		if (unlikely(lock->c[cpu].v != 0))
+			continue;	/* Retry.*/
+		if (likely(rseq_finish(&lock->c[cpu].v, 1, rseq_state)))
+			break;
+		else
+#endif
+		{
+			/* Fallback on cpu_opv system call. */
+			intptr_t expect = 0, n = 1;
+			int ret;
+
+			cpu = rseq_current_cpu_raw();
+			ret = cpu_op_cmpstore(&lock->c[cpu].v, &expect, &n,
+				sizeof(intptr_t), cpu);
+			if (likely(!ret))
+				break;
+			assert(ret >= 0 || errno == EAGAIN);
+		}
+	}
+	/*
+	 * Acquire semantic when taking lock after control dependency.
+	 * Matches smp_store_release().
+	 */
+	smp_acquire__after_ctrl_dep();
+	return cpu;
+}
+
+static void rseq_percpu_unlock(struct percpu_lock *lock, int cpu)
+{
+	assert(lock->c[cpu].v == 1);
+	/*
+	 * Release lock, with release semantic. Matches
+	 * smp_acquire__after_ctrl_dep().
+	 */
+	smp_store_release(&lock->c[cpu].v, 0);
+}
+
+void *test_percpu_spinlock_thread(void *arg)
+{
+	struct spinlock_thread_test_data *thread_data = arg;
+	struct spinlock_test_data *data = thread_data->data;
+	int i, cpu;
+
+	if (!opt_disable_rseq && thread_data->reg
+			&& rseq_register_current_thread())
+		abort();
+	for (i = 0; i < thread_data->reps; i++) {
+		cpu = rseq_percpu_lock(&data->lock);
+		data->c[cpu].count++;
+		rseq_percpu_unlock(&data->lock, cpu);
+#ifndef BENCHMARK
+		if (i != 0 && !(i % (thread_data->reps / 10)))
+			printf("tid %d: count %d\n", (int) gettid(), i);
+#endif
+	}
+	printf_nobench("tid %d: number of retry: %d, signals delivered: %u\n",
+		(int) gettid(), nr_retry, signals_delivered);
+	if (rseq_unregister_current_thread())
+		abort();
+	return NULL;
+}
+
+/*
+ * A simple test which implements a sharded counter using a per-cpu
+ * lock.  Obviously real applications might prefer to simply use a
+ * per-cpu increment; however, this is reasonable for a test and the
+ * lock can be extended to synchronize more complicated operations.
+ */
+void test_percpu_spinlock(void)
+{
+	const int num_threads = opt_threads;
+	int i, ret;
+	uint64_t sum;
+	pthread_t test_threads[num_threads];
+	struct spinlock_test_data data;
+	struct spinlock_thread_test_data thread_data[num_threads];
+
+	memset(&data, 0, sizeof(data));
+	for (i = 0; i < num_threads; i++) {
+		thread_data[i].reps = opt_reps;
+		if (opt_disable_mod <= 0 || (i % opt_disable_mod))
+			thread_data[i].reg = 1;
+		else
+			thread_data[i].reg = 0;
+		thread_data[i].data = &data;
+		ret = pthread_create(&test_threads[i], NULL,
+			test_percpu_spinlock_thread, &thread_data[i]);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	sum = 0;
+	for (i = 0; i < CPU_SETSIZE; i++)
+		sum += data.c[i].count;
+
+	assert(sum == (uint64_t)opt_reps * num_threads);
+}
+
+void *test_percpu_inc_thread(void *arg)
+{
+	struct inc_thread_test_data *thread_data = arg;
+	struct inc_test_data *data = thread_data->data;
+	int i;
+
+	if (!opt_disable_rseq && thread_data->reg
+			&& rseq_register_current_thread())
+		abort();
+	for (i = 0; i < thread_data->reps; i++) {
+		int cpu;
+
+#ifndef SKIP_FASTPATH
+		struct rseq_state rseq_state;
+		intptr_t *targetptr, newval;
+
+		/* Try fast path. */
+		rseq_state = rseq_start();
+		cpu = rseq_cpu_at_start(rseq_state);
+		newval = (intptr_t)data->c[cpu].count + 1;
+		targetptr = (intptr_t *)&data->c[cpu].count;
+		if (unlikely(!rseq_finish(targetptr, newval, rseq_state)))
+#endif
+		{
+			for (;;) {
+				/* Fallback on cpu_opv system call. */
+				int ret;
+
+				cpu = rseq_current_cpu_raw();
+				ret = cpu_op_add(&data->c[cpu].count, 1,
+					sizeof(intptr_t), cpu);
+				if (likely(!ret))
+					break;
+				assert(ret >= 0 || errno == EAGAIN);
+			}
+		}
+
+#ifndef BENCHMARK
+		if (i != 0 && !(i % (thread_data->reps / 10)))
+			printf("tid %d: count %d\n", (int) gettid(), i);
+#endif
+	}
+	printf_nobench("tid %d: number of retry: %d, signals delivered: %u\n",
+		(int) gettid(), nr_retry, signals_delivered);
+	if (rseq_unregister_current_thread())
+		abort();
+	return NULL;
+}
+
+void test_percpu_inc(void)
+{
+	const int num_threads = opt_threads;
+	int i, ret;
+	uint64_t sum;
+	pthread_t test_threads[num_threads];
+	struct inc_test_data data;
+	struct inc_thread_test_data thread_data[num_threads];
+
+	memset(&data, 0, sizeof(data));
+	for (i = 0; i < num_threads; i++) {
+		thread_data[i].reps = opt_reps;
+		if (opt_disable_mod <= 0 || (i % opt_disable_mod))
+			thread_data[i].reg = 1;
+		else
+			thread_data[i].reg = 0;
+		thread_data[i].data = &data;
+		ret = pthread_create(&test_threads[i], NULL,
+			test_percpu_inc_thread, &thread_data[i]);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	sum = 0;
+	for (i = 0; i < CPU_SETSIZE; i++)
+		sum += data.c[i].count;
+
+	assert(sum == (uint64_t)opt_reps * num_threads);
+}
+
+int percpu_list_push(struct percpu_list *list, struct percpu_list_node *node)
+{
+	intptr_t *targetptr, newval, expect;
+	int cpu;
+#ifndef SKIP_FASTPATH
+	struct rseq_state rseq_state;
+
+	/* Try fast path. */
+	rseq_state = rseq_start();
+	cpu = rseq_cpu_at_start(rseq_state);
+	newval = (intptr_t)node;
+	targetptr = (intptr_t *)&list->c[cpu].head;
+	node->next = list->c[cpu].head;
+	if (unlikely(!rseq_finish(targetptr, newval, rseq_state)))
+#endif
+	{
+		/* Fallback on cpu_opv system call. */
+		for (;;) {
+			int ret;
+
+			cpu = rseq_current_cpu_raw();
+			/* Load list->c[cpu].head with single-copy atomicity. */
+			expect = (intptr_t)READ_ONCE(list->c[cpu].head);
+			newval = (intptr_t)node;
+			targetptr = (intptr_t *)&list->c[cpu].head;
+			node->next = (struct percpu_list_node *)expect;
+			ret = cpu_op_cmpstore(targetptr, &expect, &newval,
+				sizeof(intptr_t), cpu);
+			if (likely(!ret))
+				break;
+			assert(ret >= 0 || errno == EAGAIN);
+		}
+	}
+	return cpu;
+}
+
+/*
+ * Unlike a traditional lock-less linked list; the availability of a
+ * rseq primitive allows us to implement pop without concerns over
+ * ABA-type races.
+ */
+struct percpu_list_node *percpu_list_pop(struct percpu_list *list)
+{
+	struct percpu_list_node *head, *next;
+	intptr_t *targetptr, newval, expect;
+	int cpu;
+#ifndef SKIP_FASTPATH
+	struct rseq_state rseq_state;
+
+	/* Try fast path. */
+	rseq_state = rseq_start();
+	cpu = rseq_cpu_at_start(rseq_state);
+	/* Load list->c[cpu].head with single-copy atomicity. */
+	head = READ_ONCE(list->c[cpu].head);
+	if (!head)
+		return NULL;
+	/* Load head->next with single-copy atomicity. */
+	next = READ_ONCE(head->next);
+	newval = (intptr_t)next;
+	targetptr = (intptr_t *)&list->c[cpu].head;
+	if (unlikely(!rseq_finish(targetptr, newval, rseq_state)))
+#endif
+	{
+		/* Fallback on cpu_opv system call. */
+		for (;;) {
+			int ret;
+
+			cpu = rseq_current_cpu_raw();
+			/* Load list->c[cpu].head with single-copy atomicity. */
+			head = READ_ONCE(list->c[cpu].head);
+			if (!head)
+				return NULL;
+			expect = (intptr_t)head;
+			/* Load head->next with single-copy atomicity. */
+			next = READ_ONCE(head->next);
+			newval = (intptr_t)next;
+			targetptr = (intptr_t *)&list->c[cpu].head;
+			ret = cpu_op_2cmp1store(targetptr, &expect, &newval,
+				&head->next, &next,
+				sizeof(intptr_t), cpu);
+			if (likely(!ret))
+				break;
+			assert(ret >= 0 || errno == EAGAIN);
+		}
+	}
+
+	return head;
+}
+
+void *test_percpu_list_thread(void *arg)
+{
+	int i;
+	struct percpu_list *list = (struct percpu_list *)arg;
+
+	if (rseq_register_current_thread())
+		abort();
+
+	for (i = 0; i < opt_reps; i++) {
+		struct percpu_list_node *node = percpu_list_pop(list);
+
+		if (opt_yield)
+			sched_yield();  /* encourage shuffling */
+		if (node)
+			percpu_list_push(list, node);
+	}
+
+	if (rseq_unregister_current_thread())
+		abort();
+
+	return NULL;
+}
+
+/* Simultaneous modification to a per-cpu linked list from many threads.  */
+void test_percpu_list(void)
+{
+	const int num_threads = opt_threads;
+	int i, j, ret;
+	uint64_t sum = 0, expected_sum = 0;
+	struct percpu_list list;
+	pthread_t test_threads[num_threads];
+	cpu_set_t allowed_cpus;
+
+	memset(&list, 0, sizeof(list));
+
+	/* Generate list entries for every usable cpu. */
+	sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+		for (j = 1; j <= 100; j++) {
+			struct percpu_list_node *node;
+
+			expected_sum += j;
+
+			node = malloc(sizeof(*node));
+			assert(node);
+			node->data = j;
+			node->next = list.c[i].head;
+			list.c[i].head = node;
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		ret = pthread_create(&test_threads[i], NULL,
+			test_percpu_list_thread, &list);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		cpu_set_t pin_mask;
+		struct percpu_list_node *node;
+
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+
+		CPU_ZERO(&pin_mask);
+		CPU_SET(i, &pin_mask);
+		sched_setaffinity(0, sizeof(pin_mask), &pin_mask);
+
+		while ((node = percpu_list_pop(&list))) {
+			sum += node->data;
+			free(node);
+		}
+	}
+
+	/*
+	 * All entries should now be accounted for (unless some external
+	 * actor is interfering with our allowed affinity while this
+	 * test is running).
+	 */
+	assert(sum == expected_sum);
+}
+
+bool percpu_buffer_push(struct percpu_buffer *buffer,
+		struct percpu_buffer_node *node)
+{
+	intptr_t *targetptr_spec, newval_spec;
+	intptr_t *targetptr_final, newval_final;
+	int cpu;
+	intptr_t offset;
+#ifndef SKIP_FASTPATH
+	struct rseq_state rseq_state;
+
+	/* Try fast path. */
+	rseq_state = rseq_start();
+	cpu = rseq_cpu_at_start(rseq_state);
+	/* Load offset with single-copy atomicity. */
+	offset = READ_ONCE(buffer->c[cpu].offset);
+	if (offset == buffer->c[cpu].buflen)
+		return false;
+	newval_spec = (intptr_t)node;
+	targetptr_spec = (intptr_t *)&buffer->c[cpu].array[offset];
+	newval_final = offset + 1;
+	targetptr_final = &buffer->c[cpu].offset;
+	if (unlikely(!rseq_finish2(targetptr_spec, newval_spec,
+			targetptr_final, newval_final, rseq_state)))
+#endif
+	{
+		/* Fallback on cpu_opv system call. */
+		for (;;) {
+			int ret;
+
+			cpu = rseq_current_cpu_raw();
+			/* Load offset with single-copy atomicity. */
+			offset = READ_ONCE(buffer->c[cpu].offset);
+			if (offset == buffer->c[cpu].buflen)
+				return false;
+			newval_spec = (intptr_t)node;
+			targetptr_spec = (intptr_t *)&buffer->c[cpu].array[offset];
+			newval_final = offset + 1;
+			targetptr_final = &buffer->c[cpu].offset;
+			ret = cpu_op_1cmp2store(targetptr_final, &offset, &newval_final,
+				targetptr_spec, &newval_spec,
+				sizeof(intptr_t), cpu);
+			if (likely(!ret))
+				break;
+			assert(ret >= 0 || errno == EAGAIN);
+		}
+	}
+	return true;
+}
+
+struct percpu_buffer_node *percpu_buffer_pop(struct percpu_buffer *buffer)
+{
+	struct percpu_buffer_node *head;
+	intptr_t *targetptr, newval;
+	int cpu;
+	intptr_t offset;
+#ifndef SKIP_FASTPATH
+	struct rseq_state rseq_state;
+
+	/* Try fast path. */
+	rseq_state = rseq_start();
+	cpu = rseq_cpu_at_start(rseq_state);
+	/* Load offset with single-copy atomicity. */
+	offset = READ_ONCE(buffer->c[cpu].offset);
+	if (offset == 0)
+		return NULL;
+	head = buffer->c[cpu].array[offset - 1];
+	newval = offset - 1;
+	targetptr = (intptr_t *)&buffer->c[cpu].offset;
+	if (unlikely(!rseq_finish(targetptr, newval, rseq_state)))
+#endif
+	{
+		/* Fallback on cpu_opv system call. */
+		for (;;) {
+			int ret;
+
+			cpu = rseq_current_cpu_raw();
+			/* Load offset with single-copy atomicity. */
+			offset = READ_ONCE(buffer->c[cpu].offset);
+			if (offset == 0)
+				return NULL;
+			head = buffer->c[cpu].array[offset - 1];
+			newval = offset - 1;
+			targetptr = (intptr_t *)&buffer->c[cpu].offset;
+			ret = cpu_op_2cmp1store(targetptr, &offset, &newval,
+				&buffer->c[cpu].array[offset - 1], &head,
+				sizeof(intptr_t), cpu);
+			if (likely(!ret))
+				break;
+			assert(ret >= 0 || errno == EAGAIN);
+		}
+	}
+	return head;
+}
+
+void *test_percpu_buffer_thread(void *arg)
+{
+	int i;
+	struct percpu_buffer *buffer = (struct percpu_buffer *)arg;
+
+	if (rseq_register_current_thread())
+		abort();
+
+	for (i = 0; i < opt_reps; i++) {
+		struct percpu_buffer_node *node = percpu_buffer_pop(buffer);
+
+		if (opt_yield)
+			sched_yield();  /* encourage shuffling */
+		if (node) {
+			if (!percpu_buffer_push(buffer, node)) {
+				/* Should increase buffer size. */
+				abort();
+			}
+		}
+	}
+
+	if (rseq_unregister_current_thread())
+		abort();
+
+	return NULL;
+}
+
+/* Simultaneous modification to a per-cpu buffer from many threads.  */
+void test_percpu_buffer(void)
+{
+	const int num_threads = opt_threads;
+	int i, j, ret;
+	uint64_t sum = 0, expected_sum = 0;
+	struct percpu_buffer buffer;
+	pthread_t test_threads[num_threads];
+	cpu_set_t allowed_cpus;
+
+	memset(&buffer, 0, sizeof(buffer));
+
+	/* Generate list entries for every usable cpu. */
+	sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+		/* Worse-case is every item in same CPU. */
+		buffer.c[i].array =
+			malloc(sizeof(*buffer.c[i].array) * CPU_SETSIZE
+				* BUFFER_ITEM_PER_CPU);
+		assert(buffer.c[i].array);
+		buffer.c[i].buflen = CPU_SETSIZE * BUFFER_ITEM_PER_CPU;
+		for (j = 1; j <= BUFFER_ITEM_PER_CPU; j++) {
+			struct percpu_buffer_node *node;
+
+			expected_sum += j;
+
+			/*
+			 * We could theoretically put the word-sized
+			 * "data" directly in the buffer. However, we
+			 * want to model objects that would not fit
+			 * within a single word, so allocate an object
+			 * for each node.
+			 */
+			node = malloc(sizeof(*node));
+			assert(node);
+			node->data = j;
+			buffer.c[i].array[j - 1] = node;
+			buffer.c[i].offset++;
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		ret = pthread_create(&test_threads[i], NULL,
+			test_percpu_buffer_thread, &buffer);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		cpu_set_t pin_mask;
+		struct percpu_buffer_node *node;
+
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+
+		CPU_ZERO(&pin_mask);
+		CPU_SET(i, &pin_mask);
+		sched_setaffinity(0, sizeof(pin_mask), &pin_mask);
+
+		while ((node = percpu_buffer_pop(&buffer))) {
+			sum += node->data;
+			free(node);
+		}
+		free(buffer.c[i].array);
+	}
+
+	/*
+	 * All entries should now be accounted for (unless some external
+	 * actor is interfering with our allowed affinity while this
+	 * test is running).
+	 */
+	assert(sum == expected_sum);
+}
+
+bool percpu_memcpy_buffer_push(struct percpu_memcpy_buffer *buffer,
+		struct percpu_memcpy_buffer_node item)
+{
+	char *destptr, *srcptr;
+	size_t copylen;
+	intptr_t *targetptr_final, newval_final;
+	int cpu;
+	intptr_t offset;
+#ifndef SKIP_FASTPATH
+	struct rseq_state rseq_state;
+
+	/* Try fast path. */
+	rseq_state = rseq_start();
+	cpu = rseq_cpu_at_start(rseq_state);
+	/* Load offset with single-copy atomicity. */
+	offset = READ_ONCE(buffer->c[cpu].offset);
+	if (offset == buffer->c[cpu].buflen)
+		return false;
+	destptr = (char *)&buffer->c[cpu].array[offset];
+	srcptr = (char *)&item;
+	copylen = sizeof(item);
+	newval_final = offset + 1;
+	targetptr_final = &buffer->c[cpu].offset;
+	if (unlikely(!rseq_finish_memcpy(destptr, srcptr, copylen,
+			targetptr_final, newval_final, rseq_state)))
+#endif
+	{
+		/* Fallback on cpu_opv system call. */
+		for (;;) {
+			int ret;
+
+			cpu = rseq_current_cpu_raw();
+			/* Load offset with single-copy atomicity. */
+			offset = READ_ONCE(buffer->c[cpu].offset);
+			if (offset == buffer->c[cpu].buflen)
+				return false;
+			destptr = (char *)&buffer->c[cpu].array[offset];
+			srcptr = (char *)&item;
+			copylen = sizeof(item);
+			newval_final = offset + 1;
+			targetptr_final = &buffer->c[cpu].offset;
+			/* copylen must be <= PAGE_SIZE. */
+			ret = cpu_op_cmpstorememcpy(targetptr_final, &offset, &newval_final,
+				sizeof(intptr_t), destptr, srcptr, copylen, cpu);
+			if (likely(!ret))
+				break;
+			assert(ret >= 0 || errno == EAGAIN);
+		}
+	}
+	return true;
+}
+
+bool percpu_memcpy_buffer_pop(struct percpu_memcpy_buffer *buffer,
+		struct percpu_memcpy_buffer_node *item)
+{
+	char *destptr, *srcptr;
+	size_t copylen;
+	intptr_t *targetptr_final, newval_final;
+	int cpu;
+	intptr_t offset;
+#ifndef SKIP_FASTPATH
+	struct rseq_state rseq_state;
+
+	/* Try fast path. */
+	rseq_state = rseq_start();
+	cpu = rseq_cpu_at_start(rseq_state);
+	/* Load offset with single-copy atomicity. */
+	offset = READ_ONCE(buffer->c[cpu].offset);
+	if (offset == 0)
+		return false;
+	destptr = (char *)item;
+	srcptr = (char *)&buffer->c[cpu].array[offset - 1];
+	copylen = sizeof(*item);
+	newval_final = offset - 1;
+	targetptr_final = &buffer->c[cpu].offset;
+	if (unlikely(!rseq_finish_memcpy(destptr, srcptr, copylen,
+			targetptr_final, newval_final, rseq_state)))
+#endif
+	{
+		/* Fallback on cpu_opv system call. */
+		for (;;) {
+			int ret;
+
+			cpu = rseq_current_cpu_raw();
+			/* Load offset with single-copy atomicity. */
+			offset = READ_ONCE(buffer->c[cpu].offset);
+			if (offset == 0)
+				return false;
+			destptr = (char *)item;
+			srcptr = (char *)&buffer->c[cpu].array[offset - 1];
+			copylen = sizeof(*item);
+			newval_final = offset - 1;
+			targetptr_final = &buffer->c[cpu].offset;
+			/* copylen must be <= PAGE_SIZE. */
+			ret = cpu_op_cmpstorememcpy(targetptr_final, &offset, &newval_final,
+				sizeof(intptr_t), destptr, srcptr, copylen, cpu);
+			if (likely(!ret))
+				break;
+			assert(ret >= 0 || errno == EAGAIN);
+		}
+	}
+	return true;
+}
+
+void *test_percpu_memcpy_buffer_thread(void *arg)
+{
+	int i;
+	struct percpu_memcpy_buffer *buffer = (struct percpu_memcpy_buffer *)arg;
+
+	if (rseq_register_current_thread())
+		abort();
+
+	for (i = 0; i < opt_reps; i++) {
+		struct percpu_memcpy_buffer_node item;
+		bool result;
+
+		result = percpu_memcpy_buffer_pop(buffer, &item);
+		if (opt_yield)
+			sched_yield();  /* encourage shuffling */
+		if (result) {
+			if (!percpu_memcpy_buffer_push(buffer, item)) {
+				/* Should increase buffer size. */
+				abort();
+			}
+		}
+	}
+
+	if (rseq_unregister_current_thread())
+		abort();
+
+	return NULL;
+}
+
+/* Simultaneous modification to a per-cpu buffer from many threads.  */
+void test_percpu_memcpy_buffer(void)
+{
+	const int num_threads = opt_threads;
+	int i, j, ret;
+	uint64_t sum = 0, expected_sum = 0;
+	struct percpu_memcpy_buffer buffer;
+	pthread_t test_threads[num_threads];
+	cpu_set_t allowed_cpus;
+
+	memset(&buffer, 0, sizeof(buffer));
+
+	/* Generate list entries for every usable cpu. */
+	sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+		/* Worse-case is every item in same CPU. */
+		buffer.c[i].array =
+			malloc(sizeof(*buffer.c[i].array) * CPU_SETSIZE
+				* MEMCPY_BUFFER_ITEM_PER_CPU);
+		assert(buffer.c[i].array);
+		buffer.c[i].buflen = CPU_SETSIZE * MEMCPY_BUFFER_ITEM_PER_CPU;
+		for (j = 1; j <= MEMCPY_BUFFER_ITEM_PER_CPU; j++) {
+			expected_sum += 2 * j + 1;
+
+			/*
+			 * We could theoretically put the word-sized
+			 * "data" directly in the buffer. However, we
+			 * want to model objects that would not fit
+			 * within a single word, so allocate an object
+			 * for each node.
+			 */
+			buffer.c[i].array[j - 1].data1 = j;
+			buffer.c[i].array[j - 1].data2 = j + 1;
+			buffer.c[i].offset++;
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		ret = pthread_create(&test_threads[i], NULL,
+			test_percpu_memcpy_buffer_thread, &buffer);
+		if (ret) {
+			errno = ret;
+			perror("pthread_create");
+			abort();
+		}
+	}
+
+	for (i = 0; i < num_threads; i++) {
+		pthread_join(test_threads[i], NULL);
+		if (ret) {
+			errno = ret;
+			perror("pthread_join");
+			abort();
+		}
+	}
+
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		cpu_set_t pin_mask;
+		struct percpu_memcpy_buffer_node item;
+
+		if (!CPU_ISSET(i, &allowed_cpus))
+			continue;
+
+		CPU_ZERO(&pin_mask);
+		CPU_SET(i, &pin_mask);
+		sched_setaffinity(0, sizeof(pin_mask), &pin_mask);
+
+		while (percpu_memcpy_buffer_pop(&buffer, &item)) {
+			sum += item.data1;
+			sum += item.data2;
+		}
+		free(buffer.c[i].array);
+	}
+
+	/*
+	 * All entries should now be accounted for (unless some external
+	 * actor is interfering with our allowed affinity while this
+	 * test is running).
+	 */
+	assert(sum == expected_sum);
+}
+
+static void test_signal_interrupt_handler(int signo)
+{
+	signals_delivered++;
+}
+
+static int set_signal_handler(void)
+{
+	int ret = 0;
+	struct sigaction sa;
+	sigset_t sigset;
+
+	ret = sigemptyset(&sigset);
+	if (ret < 0) {
+		perror("sigemptyset");
+		return ret;
+	}
+
+	sa.sa_handler = test_signal_interrupt_handler;
+	sa.sa_mask = sigset;
+	sa.sa_flags = 0;
+	ret = sigaction(SIGUSR1, &sa, NULL);
+	if (ret < 0) {
+		perror("sigaction");
+		return ret;
+	}
+
+	printf_nobench("Signal handler set for SIGUSR1\n");
+
+	return ret;
+}
+
+static void show_usage(int argc, char **argv)
+{
+	printf("Usage : %s <OPTIONS>\n",
+		argv[0]);
+	printf("OPTIONS:\n");
+	printf("	[-1 loops] Number of loops for delay injection 1\n");
+	printf("	[-2 loops] Number of loops for delay injection 2\n");
+	printf("	[-3 loops] Number of loops for delay injection 3\n");
+	printf("	[-4 loops] Number of loops for delay injection 4\n");
+	printf("	[-5 loops] Number of loops for delay injection 5\n");
+	printf("	[-6 loops] Number of loops for delay injection 6 (-1 to enable -m)\n");
+	printf("	[-7 loops] Number of loops for delay injection 7 (-1 to enable -m)\n");
+	printf("	[-8 loops] Number of loops for delay injection 8 (-1 to enable -m)\n");
+	printf("	[-9 loops] Number of loops for delay injection 9 (-1 to enable -m)\n");
+	printf("	[-m N] Yield/sleep/kill every modulo N (default 0: disabled) (>= 0)\n");
+	printf("	[-y] Yield\n");
+	printf("	[-k] Kill thread with signal\n");
+	printf("	[-s S] S: =0: disabled (default), >0: sleep time (ms)\n");
+	printf("	[-t N] Number of threads (default 200)\n");
+	printf("	[-r N] Number of repetitions per thread (default 5000)\n");
+	printf("	[-d] Disable rseq system call (no initialization)\n");
+	printf("	[-D M] Disable rseq for each M threads\n");
+	printf("	[-T test] Choose test: (s)pinlock, (l)ist, (b)uffer, (m)emcpy, (i)ncrement\n");
+	printf("	[-h] Show this help.\n");
+	printf("\n");
+}
+
+int main(int argc, char **argv)
+{
+	int i;
+
+	if (set_signal_handler())
+		goto error;
+	for (i = 1; i < argc; i++) {
+		if (argv[i][0] != '-')
+			continue;
+		switch (argv[i][1]) {
+		case '1':
+		case '2':
+		case '3':
+		case '4':
+		case '5':
+		case '6':
+		case '7':
+		case '8':
+		case '9':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			loop_cnt[argv[i][1] - '0'] = atol(argv[i + 1]);
+			i++;
+			break;
+		case 'm':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_modulo = atol(argv[i + 1]);
+			if (opt_modulo < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 's':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_sleep = atol(argv[i + 1]);
+			if (opt_sleep < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 'y':
+			opt_yield = 1;
+			break;
+		case 'k':
+			opt_signal = 1;
+			break;
+		case 'd':
+			opt_disable_rseq = 1;
+			break;
+		case 'D':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_disable_mod = atol(argv[i + 1]);
+			if (opt_disable_mod < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 't':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_threads = atol(argv[i + 1]);
+			if (opt_threads < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 'r':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_reps = atol(argv[i + 1]);
+			if (opt_reps < 0) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		case 'h':
+			show_usage(argc, argv);
+			goto end;
+		case 'T':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				goto error;
+			}
+			opt_test = *argv[i + 1];
+			switch (opt_test) {
+			case 's':
+			case 'l':
+			case 'i':
+			case 'b':
+			case 'm':
+				break;
+			default:
+				show_usage(argc, argv);
+				goto error;
+			}
+			i++;
+			break;
+		default:
+			show_usage(argc, argv);
+			goto error;
+		}
+	}
+
+	if (!opt_disable_rseq && rseq_register_current_thread())
+		goto error;
+	switch (opt_test) {
+	case 's':
+		printf_nobench("spinlock\n");
+		test_percpu_spinlock();
+		break;
+	case 'l':
+		printf_nobench("linked list\n");
+		test_percpu_list();
+		break;
+	case 'b':
+		printf_nobench("buffer\n");
+		test_percpu_buffer();
+		break;
+	case 'm':
+		printf_nobench("memcpy buffer\n");
+		test_percpu_memcpy_buffer();
+		break;
+	case 'i':
+		printf_nobench("counter increment\n");
+		test_percpu_inc();
+		break;
+	}
+	if (rseq_unregister_current_thread())
+		abort();
+end:
+	return 0;
+
+error:
+	return -1;
+}
diff --git a/tools/testing/selftests/rseq/rseq-arm.h b/tools/testing/selftests/rseq/rseq-arm.h
new file mode 100644
index 000000000000..b5f57d250071
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq-arm.h
@@ -0,0 +1,159 @@
+/*
+ * rseq-arm.h
+ *
+ * (C) Copyright 2016 - Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#define smp_mb()	__asm__ __volatile__ ("dmb" : : : "memory")
+#define smp_rmb()	__asm__ __volatile__ ("dmb" : : : "memory")
+#define smp_wmb()	__asm__ __volatile__ ("dmb" : : : "memory")
+
+#define smp_load_acquire(p)						\
+__extension__ ({							\
+	__typeof(*p) ____p1 = READ_ONCE(*p);				\
+	smp_mb();							\
+	____p1;								\
+})
+
+#define smp_acquire__after_ctrl_dep()	smp_rmb()
+
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	WRITE_ONCE(*p, v);						\
+} while (0)
+
+#define has_fast_acquire_release()	0
+#define has_single_copy_load_64()	1
+
+/*
+ * The __rseq_table section can be used by debuggers to better handle
+ * single-stepping through the restartable critical sections.
+ */
+
+#define RSEQ_FINISH_ASM(_target_final, _to_write_final, _start_value, \
+		_failure, _spec_store, _spec_input, \
+		_final_store, _final_input, _extra_clobber, \
+		_setup, _teardown, _scratch) \
+do { \
+	_scratch \
+	__asm__ __volatile__ goto ( \
+		".pushsection __rseq_table, \"aw\"\n\t" \
+		".balign 32\n\t" \
+		".word 1f, 0x0, 2f, 0x0, 5f, 0x0, 0x0, 0x0\n\t" \
+		".popsection\n\t" \
+		"1:\n\t" \
+		_setup \
+		RSEQ_INJECT_ASM(1) \
+		"adr r0, 3f\n\t" \
+		"str r0, [%[rseq_cs]]\n\t" \
+		RSEQ_INJECT_ASM(2) \
+		"ldr r0, %[current_event_counter]\n\t" \
+		"cmp %[start_event_counter], r0\n\t" \
+		"bne 5f\n\t" \
+		RSEQ_INJECT_ASM(3) \
+		_spec_store \
+		_final_store \
+		"2:\n\t" \
+		RSEQ_INJECT_ASM(5) \
+		_teardown \
+		"b 4f\n\t" \
+		".balign 32\n\t" \
+		"3:\n\t" \
+		".word 1b, 0x0, 2b, 0x0, 5f, 0x0, 0x0, 0x0\n\t" \
+		"5:\n\t" \
+		_teardown \
+		"b %l[failure]\n\t" \
+		"4:\n\t" \
+		: /* gcc asm goto does not allow outputs */ \
+		: [start_event_counter]"r"((_start_value).event_counter), \
+		  [current_event_counter]"m"((_start_value).rseqp->u.e.event_counter), \
+		  [rseq_cs]"r"(&(_start_value).rseqp->rseq_cs) \
+		  _spec_input \
+		  _final_input \
+		  RSEQ_INJECT_INPUT \
+		: "r0", "memory", "cc" \
+		  _extra_clobber \
+		  RSEQ_INJECT_CLOBBER \
+		: _failure \
+	); \
+} while (0)
+
+#define RSEQ_FINISH_FINAL_STORE_ASM() \
+		"str %[to_write_final], [%[target_final]]\n\t"
+
+#define RSEQ_FINISH_FINAL_STORE_RELEASE_ASM() \
+		"dmb\n\t" \
+		RSEQ_FINISH_FINAL_STORE_ASM()
+
+#define RSEQ_FINISH_FINAL_STORE_INPUT(_target_final, _to_write_final) \
+		, [to_write_final]"r"(_to_write_final), \
+		[target_final]"r"(_target_final)
+
+#define RSEQ_FINISH_SPECULATIVE_STORE_ASM() \
+		"str %[to_write_spec], [%[target_spec]]\n\t" \
+		RSEQ_INJECT_ASM(4)
+
+#define RSEQ_FINISH_SPECULATIVE_STORE_INPUT(_target_spec, _to_write_spec) \
+		, [to_write_spec]"r"(_to_write_spec), \
+		[target_spec]"r"(_target_spec)
+
+/* TODO: implement a faster memcpy. */
+#define RSEQ_FINISH_MEMCPY_STORE_ASM() \
+		"cmp %[len_memcpy], #0\n\t" \
+		"beq 333f\n\t" \
+		"222:\n\t" \
+		"ldrb %%r0, [%[to_write_memcpy]]\n\t" \
+		"strb %%r0, [%[target_memcpy]]\n\t" \
+		"adds %[to_write_memcpy], #1\n\t" \
+		"adds %[target_memcpy], #1\n\t" \
+		"subs %[len_memcpy], #1\n\t" \
+		"bne 222b\n\t" \
+		"333:\n\t" \
+		RSEQ_INJECT_ASM(4)
+
+#define RSEQ_FINISH_MEMCPY_STORE_INPUT(_target_memcpy, _to_write_memcpy, _len_memcpy) \
+		, [to_write_memcpy]"r"(_to_write_memcpy), \
+		[target_memcpy]"r"(_target_memcpy), \
+		[len_memcpy]"r"(_len_memcpy), \
+		[rseq_scratch0]"m"(rseq_scratch[0]), \
+		[rseq_scratch1]"m"(rseq_scratch[1]), \
+		[rseq_scratch2]"m"(rseq_scratch[2])
+
+/* We can use r0. */
+#define RSEQ_FINISH_MEMCPY_CLOBBER()
+
+#define RSEQ_FINISH_MEMCPY_SCRATCH() \
+		uint32_t rseq_scratch[3];
+
+/*
+ * We need to save and restore those input registers so they can be
+ * modified within the assembly.
+ */
+#define RSEQ_FINISH_MEMCPY_SETUP() \
+		"str %[to_write_memcpy], %[rseq_scratch0]\n\t" \
+		"str %[target_memcpy], %[rseq_scratch1]\n\t" \
+		"str %[len_memcpy], %[rseq_scratch2]\n\t"
+
+#define RSEQ_FINISH_MEMCPY_TEARDOWN() \
+		"ldr %[len_memcpy], %[rseq_scratch2]\n\t" \
+		"ldr %[target_memcpy], %[rseq_scratch1]\n\t" \
+		"ldr %[to_write_memcpy], %[rseq_scratch0]\n\t"
diff --git a/tools/testing/selftests/rseq/rseq-ppc.h b/tools/testing/selftests/rseq/rseq-ppc.h
new file mode 100644
index 000000000000..94c8ba0b4311
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq-ppc.h
@@ -0,0 +1,266 @@
+/*
+ * rseq-ppc.h
+ *
+ * (C) Copyright 2016 - Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ * (C) Copyright 2016 - Boqun Feng <boqun.feng@gmail.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#define smp_mb()	__asm__ __volatile__ ("sync" : : : "memory")
+#define smp_lwsync()	__asm__ __volatile__ ("lwsync" : : : "memory")
+#define smp_rmb()	smp_lwsync()
+#define smp_wmb()	smp_lwsync()
+
+#define smp_load_acquire(p)						\
+__extension__ ({							\
+	__typeof(*p) ____p1 = READ_ONCE(*p);				\
+	smp_lwsync();							\
+	____p1;								\
+})
+
+#define smp_acquire__after_ctrl_dep()	smp_lwsync()
+
+#define smp_store_release(p, v)						\
+do {									\
+	smp_lwsync();							\
+	WRITE_ONCE(*p, v);						\
+} while (0)
+
+#define has_fast_acquire_release()	0
+
+#ifdef __PPC64__
+#define has_single_copy_load_64()	1
+#else
+#define has_single_copy_load_64()	0
+#endif
+
+/*
+ * The __rseq_table section can be used by debuggers to better handle
+ * single-stepping through the restartable critical sections.
+ */
+
+#ifdef __PPC64__
+
+#define RSEQ_FINISH_ASM(_target_final, _to_write_final, _start_value, \
+		_failure, _spec_store, _spec_input, \
+		_final_store, _final_input, _extra_clobber, \
+		_setup, _teardown, _scratch) \
+	__asm__ __volatile__ goto ( \
+		".pushsection __rseq_table, \"aw\"\n\t" \
+		".balign 32\n\t" \
+		"3:\n\t" \
+		".quad 1f, 2f, 4f\n\t" \
+		".long 0x0, 0x0\n\t" \
+		".popsection\n\t" \
+		"1:\n\t" \
+		_setup \
+		RSEQ_INJECT_ASM(1) \
+		"lis %%r17, (3b)@highest\n\t" \
+		"ori %%r17, %%r17, (3b)@higher\n\t" \
+		"rldicr %%r17, %%r17, 32, 31\n\t" \
+		"oris %%r17, %%r17, (3b)@h\n\t" \
+		"ori %%r17, %%r17, (3b)@l\n\t" \
+		"std %%r17, 0(%[rseq_cs])\n\t" \
+		RSEQ_INJECT_ASM(2) \
+		"lwz %%r17, %[current_event_counter]\n\t" \
+		"cmpw cr7, %[start_event_counter], %%r17\n\t" \
+		"bne- cr7, 4f\n\t" \
+		RSEQ_INJECT_ASM(3) \
+		_spec_store \
+		_final_store \
+		"2:\n\t" \
+		RSEQ_INJECT_ASM(5) \
+		_teardown \
+		"b 5f\n\t" \
+		"4:\n\t" \
+		_teardown \
+		"b %l[failure]\n\t" \
+		"5:\n\t" \
+		: /* gcc asm goto does not allow outputs */ \
+		: [start_event_counter]"r"((_start_value).event_counter), \
+		  [current_event_counter]"m"((_start_value).rseqp->u.e.event_counter), \
+		  [rseq_cs]"b"(&(_start_value).rseqp->rseq_cs) \
+		  _spec_input \
+		  _final_input \
+		  RSEQ_INJECT_INPUT \
+		: "r17", "memory", "cc" \
+		  _extra_clobber \
+		  RSEQ_INJECT_CLOBBER \
+		: _failure \
+	)
+
+#define RSEQ_FINISH_FINAL_STORE_ASM() \
+		"std %[to_write_final], 0(%[target_final])\n\t"
+
+#define RSEQ_FINISH_FINAL_STORE_RELEASE_ASM() \
+		"lwsync\n\t" \
+		RSEQ_FINISH_FINAL_STORE_ASM()
+
+#define RSEQ_FINISH_FINAL_STORE_INPUT(_target_final, _to_write_final) \
+		, [to_write_final]"r"(_to_write_final), \
+		[target_final]"b"(_target_final)
+
+#define RSEQ_FINISH_SPECULATIVE_STORE_ASM() \
+		"std %[to_write_spec], 0(%[target_spec])\n\t" \
+		RSEQ_INJECT_ASM(4)
+
+#define RSEQ_FINISH_SPECULATIVE_STORE_INPUT(_target_spec, _to_write_spec) \
+		, [to_write_spec]"r"(_to_write_spec), \
+		[target_spec]"b"(_target_spec)
+
+/* TODO: implement a faster memcpy. */
+#define RSEQ_FINISH_MEMCPY_STORE_ASM() \
+		"cmpdi %%r19, 0\n\t" \
+		"beq 333f\n\t" \
+		"addi %%r20, %%r20, -1\n\t" \
+		"addi %%r21, %%r21, -1\n\t" \
+		"222:\n\t" \
+		"lbzu %%r18, 1(%%r20)\n\t" \
+		"stbu %%r18, 1(%%r21)\n\t" \
+		"addi %%r19, %%r19, -1\n\t" \
+		"cmpdi %%r19, 0\n\t" \
+		"bne 222b\n\t" \
+		"333:\n\t" \
+		RSEQ_INJECT_ASM(4)
+
+#define RSEQ_FINISH_MEMCPY_STORE_INPUT(_target_memcpy, _to_write_memcpy, _len_memcpy) \
+		, [to_write_memcpy]"r"(_to_write_memcpy), \
+		[target_memcpy]"r"(_target_memcpy), \
+		[len_memcpy]"r"(_len_memcpy)
+
+#define RSEQ_FINISH_MEMCPY_CLOBBER() \
+		, "r18", "r19", "r20", "r21"
+
+#define RSEQ_FINISH_MEMCPY_SCRATCH()
+
+/*
+ * We use extra registers to hold the input registers, and we don't need to
+ * save and restore the input registers.
+ */
+#define RSEQ_FINISH_MEMCPY_SETUP() \
+		"mr %%r19, %[len_memcpy]\n\t" \
+		"mr %%r20, %[to_write_memcpy]\n\t" \
+		"mr %%r21, %[target_memcpy]\n\t" \
+
+#define RSEQ_FINISH_MEMCPY_TEARDOWN()
+
+#else	/* #ifdef __PPC64__ */
+
+#define RSEQ_FINISH_ASM(_target_final, _to_write_final, _start_value, \
+		_failure, _spec_store, _spec_input, \
+		_final_store, _final_input, _extra_clobber, \
+		_setup, _teardown, _scratch) \
+	__asm__ __volatile__ goto ( \
+		".pushsection __rseq_table, \"aw\"\n\t" \
+		".balign 32\n\t" \
+		"3:\n\t" \
+		/* 32-bit only supported on BE */ \
+		".long 0x0, 1f, 0x0, 2f, 0x0, 4f, 0x0, 0x0\n\t" \
+		".popsection\n\t" \
+		"1:\n\t" \
+		_setup \
+		RSEQ_INJECT_ASM(1) \
+		"lis %%r17, (3b)@ha\n\t" \
+		"addi %%r17, %%r17, (3b)@l\n\t" \
+		"stw %%r17, 0(%[rseq_cs])\n\t" \
+		RSEQ_INJECT_ASM(2) \
+		"lwz %%r17, %[current_event_counter]\n\t" \
+		"cmpw cr7, %[start_event_counter], %%r17\n\t" \
+		"bne- cr7, 4f\n\t" \
+		RSEQ_INJECT_ASM(3) \
+		_spec_store \
+		_final_store \
+		"2:\n\t" \
+		RSEQ_INJECT_ASM(5) \
+		_teardown \
+		"b 5f\n\t" \
+		"4:\n\t" \
+		_teardown \
+		"b %l[failure]\n\t" \
+		"5:\n\t" \
+		: /* gcc asm goto does not allow outputs */ \
+		: [start_event_counter]"r"((_start_value).event_counter), \
+		  [current_event_counter]"m"((_start_value).rseqp->u.e.event_counter), \
+		  [rseq_cs]"b"(&(_start_value).rseqp->rseq_cs) \
+		  _spec_input \
+		  _final_input \
+		  RSEQ_INJECT_INPUT \
+		: "r17", "memory", "cc" \
+		  _extra_clobber \
+		  RSEQ_INJECT_CLOBBER \
+		: _failure \
+	)
+
+#define RSEQ_FINISH_FINAL_STORE_ASM() \
+		"stw %[to_write_final], 0(%[target_final])\n\t"
+
+#define RSEQ_FINISH_FINAL_STORE_RELEASE_ASM() \
+		"lwsync\n\t" \
+		RSEQ_FINISH_FINAL_STORE_ASM()
+
+#define RSEQ_FINISH_FINAL_STORE_INPUT(_target_final, _to_write_final) \
+		, [to_write_final]"r"(_to_write_final), \
+		[target_final]"b"(_target_final)
+
+#define RSEQ_FINISH_SPECULATIVE_STORE_ASM() \
+		"stw %[to_write_spec], 0(%[target_spec])\n\t" \
+		RSEQ_INJECT_ASM(4)
+
+#define RSEQ_FINISH_SPECULATIVE_STORE_INPUT(_target_spec, _to_write_spec) \
+		, [to_write_spec]"r"(_to_write_spec), \
+		[target_spec]"b"(_target_spec)
+
+/* TODO: implement a faster memcpy. */
+#define RSEQ_FINISH_MEMCPY_STORE_ASM() \
+		"cmpwi %%r19, 0\n\t" \
+		"beq 333f\n\t" \
+		"addi %%r20, %%r20, -1\n\t" \
+		"addi %%r21, %%r21, -1\n\t" \
+		"222:\n\t" \
+		"lbzu %%r18, 1(%%r20)\n\t" \
+		"stbu %%r18, 1(%%r21)\n\t" \
+		"addi %%r19, %%r19, -1\n\t" \
+		"cmpwi %%r19, 0\n\t" \
+		"bne 222b\n\t" \
+		"333:\n\t" \
+		RSEQ_INJECT_ASM(4)
+
+#define RSEQ_FINISH_MEMCPY_STORE_INPUT(_target_memcpy, _to_write_memcpy, _len_memcpy) \
+		, [to_write_memcpy]"r"(_to_write_memcpy), \
+		[target_memcpy]"r"(_target_memcpy), \
+		[len_memcpy]"r"(_len_memcpy)
+
+#define RSEQ_FINISH_MEMCPY_CLOBBER() \
+		, "r18", "r19", "r20", "r21"
+
+#define RSEQ_FINISH_MEMCPY_SCRATCH()
+
+/*
+ * We use extra registers to hold the input registers, and we don't need to
+ * save and restore the input registers.
+ */
+#define RSEQ_FINISH_MEMCPY_SETUP() \
+		"mr %%r19, %[len_memcpy]\n\t" \
+		"mr %%r20, %[to_write_memcpy]\n\t" \
+		"mr %%r21, %[target_memcpy]\n\t" \
+
+#define RSEQ_FINISH_MEMCPY_TEARDOWN()
+
+#endif	/* #else #ifdef __PPC64__ */
diff --git a/tools/testing/selftests/rseq/rseq-x86.h b/tools/testing/selftests/rseq/rseq-x86.h
new file mode 100644
index 000000000000..2896186eef9b
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq-x86.h
@@ -0,0 +1,304 @@
+/*
+ * rseq-x86.h
+ *
+ * (C) Copyright 2016 - Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifdef __x86_64__
+
+#define smp_mb()	__asm__ __volatile__ ("mfence" : : : "memory")
+#define smp_rmb()	barrier()
+#define smp_wmb()	barrier()
+
+#define smp_load_acquire(p)						\
+__extension__ ({							\
+	__typeof(*p) ____p1 = READ_ONCE(*p);				\
+	barrier();							\
+	____p1;								\
+})
+
+#define smp_acquire__after_ctrl_dep()	smp_rmb()
+
+#define smp_store_release(p, v)						\
+do {									\
+	barrier();							\
+	WRITE_ONCE(*p, v);						\
+} while (0)
+
+#define has_fast_acquire_release()	1
+#define has_single_copy_load_64()	1
+
+/*
+ * The __rseq_table section can be used by debuggers to better handle
+ * single-stepping through the restartable critical sections.
+ */
+#define RSEQ_FINISH_ASM(_target_final, _to_write_final, _start_value, \
+		_failure, _spec_store, _spec_input, \
+		_final_store, _final_input, _extra_clobber, \
+		_setup, _teardown, _scratch) \
+do { \
+	_scratch \
+	__asm__ __volatile__ goto ( \
+		".pushsection __rseq_table, \"aw\"\n\t" \
+		".balign 32\n\t" \
+		"3:\n\t" \
+		".quad 1f, 2f, 4f\n\t" \
+		".long 0x0, 0x0\n\t" \
+		".popsection\n\t" \
+		"1:\n\t" \
+		_setup \
+		RSEQ_INJECT_ASM(1) \
+		"leaq 3b(%%rip), %%rax\n\t" \
+		"movq %%rax, %[rseq_cs]\n\t" \
+		RSEQ_INJECT_ASM(2) \
+		"cmpl %[start_event_counter], %[current_event_counter]\n\t" \
+		"jnz 4f\n\t" \
+		RSEQ_INJECT_ASM(3) \
+		_spec_store \
+		_final_store \
+		"2:\n\t" \
+		RSEQ_INJECT_ASM(5) \
+		_teardown \
+		".pushsection __rseq_failure, \"a\"\n\t" \
+		"4:\n\t" \
+		_teardown \
+		"jmp %l[failure]\n\t" \
+		".popsection\n\t" \
+		: /* gcc asm goto does not allow outputs */ \
+		: [start_event_counter]"r"((_start_value).event_counter), \
+		  [current_event_counter]"m"((_start_value).rseqp->u.e.event_counter), \
+		  [rseq_cs]"m"((_start_value).rseqp->rseq_cs) \
+		  _spec_input \
+		  _final_input \
+		  RSEQ_INJECT_INPUT \
+		: "memory", "cc", "rax" \
+		  _extra_clobber \
+		  RSEQ_INJECT_CLOBBER \
+		: _failure \
+	); \
+} while (0)
+
+#define RSEQ_FINISH_FINAL_STORE_ASM() \
+		"movq %[to_write_final], %[target_final]\n\t"
+
+/* x86-64 is TSO */
+#define RSEQ_FINISH_FINAL_STORE_RELEASE_ASM() \
+		RSEQ_FINISH_FINAL_STORE_ASM()
+
+#define RSEQ_FINISH_FINAL_STORE_INPUT(_target_final, _to_write_final) \
+		, [to_write_final]"r"(_to_write_final), \
+		[target_final]"m"(*(_target_final))
+
+#define RSEQ_FINISH_SPECULATIVE_STORE_ASM() \
+		"movq %[to_write_spec], %[target_spec]\n\t" \
+		RSEQ_INJECT_ASM(4)
+
+#define RSEQ_FINISH_SPECULATIVE_STORE_INPUT(_target_spec, _to_write_spec) \
+		, [to_write_spec]"r"(_to_write_spec), \
+		[target_spec]"m"(*(_target_spec))
+
+/* TODO: implement a faster memcpy. */
+#define RSEQ_FINISH_MEMCPY_STORE_ASM() \
+		"test %[len_memcpy], %[len_memcpy]\n\t" \
+		"jz 333f\n\t" \
+		"222:\n\t" \
+		"movb (%[to_write_memcpy]), %%al\n\t" \
+		"movb %%al, (%[target_memcpy])\n\t" \
+		"inc %[to_write_memcpy]\n\t" \
+		"inc %[target_memcpy]\n\t" \
+		"dec %[len_memcpy]\n\t" \
+		"jnz 222b\n\t" \
+		"333:\n\t" \
+		RSEQ_INJECT_ASM(4)
+
+#define RSEQ_FINISH_MEMCPY_STORE_INPUT(_target_memcpy, _to_write_memcpy, _len_memcpy) \
+		, [to_write_memcpy]"r"(_to_write_memcpy), \
+		[target_memcpy]"r"(_target_memcpy), \
+		[len_memcpy]"r"(_len_memcpy), \
+		[rseq_scratch0]"m"(rseq_scratch[0]), \
+		[rseq_scratch1]"m"(rseq_scratch[1]), \
+		[rseq_scratch2]"m"(rseq_scratch[2])
+
+#define RSEQ_FINISH_MEMCPY_CLOBBER()	\
+		, "rax"
+
+#define RSEQ_FINISH_MEMCPY_SCRATCH() \
+		uint64_t rseq_scratch[3];
+
+/*
+ * We need to save and restore those input registers so they can be
+ * modified within the assembly.
+ */
+#define RSEQ_FINISH_MEMCPY_SETUP() \
+		"movq %[to_write_memcpy], %[rseq_scratch0]\n\t" \
+		"movq %[target_memcpy], %[rseq_scratch1]\n\t" \
+		"movq %[len_memcpy], %[rseq_scratch2]\n\t"
+
+#define RSEQ_FINISH_MEMCPY_TEARDOWN() \
+		"movq %[rseq_scratch2], %[len_memcpy]\n\t" \
+		"movq %[rseq_scratch1], %[target_memcpy]\n\t" \
+		"movq %[rseq_scratch0], %[to_write_memcpy]\n\t"
+
+#elif __i386__
+
+/*
+ * Support older 32-bit architectures that do not implement fence
+ * instructions.
+ */
+#define smp_mb()	\
+	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
+#define smp_rmb()	\
+	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
+#define smp_wmb()	\
+	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
+
+#define smp_load_acquire(p)						\
+__extension__ ({							\
+	__typeof(*p) ____p1 = READ_ONCE(*p);				\
+	smp_mb();							\
+	____p1;								\
+})
+
+#define smp_acquire__after_ctrl_dep()	smp_rmb()
+
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	WRITE_ONCE(*p, v);						\
+} while (0)
+
+#define has_fast_acquire_release()	0
+#define has_single_copy_load_64()	0
+
+/*
+ * Use eax as scratch register and take memory operands as input to
+ * lessen register pressure. Especially needed when compiling
+ * do_rseq_memcpy() in O0.
+ */
+#define RSEQ_FINISH_ASM(_target_final, _to_write_final, _start_value, \
+		_failure, _spec_store, _spec_input, \
+		_final_store, _final_input, _extra_clobber, \
+		_setup, _teardown, _scratch) \
+do { \
+	_scratch \
+	__asm__ __volatile__ goto ( \
+		".pushsection __rseq_table, \"aw\"\n\t" \
+		".balign 32\n\t" \
+		"3:\n\t" \
+		".long 1f, 0x0, 2f, 0x0, 4f, 0x0, 0x0, 0x0\n\t" \
+		".popsection\n\t" \
+		"1:\n\t" \
+		_setup \
+		RSEQ_INJECT_ASM(1) \
+		"movl $3b, %[rseq_cs]\n\t" \
+		RSEQ_INJECT_ASM(2) \
+		"movl %[start_event_counter], %%eax\n\t" \
+		"cmpl %%eax, %[current_event_counter]\n\t" \
+		"jnz 4f\n\t" \
+		RSEQ_INJECT_ASM(3) \
+		_spec_store \
+		_final_store \
+		"2:\n\t" \
+		RSEQ_INJECT_ASM(5) \
+		_teardown \
+		".pushsection __rseq_failure, \"a\"\n\t" \
+		"4:\n\t" \
+		_teardown \
+		"jmp %l[failure]\n\t" \
+		".popsection\n\t" \
+		: /* gcc asm goto does not allow outputs */ \
+		: [start_event_counter]"m"((_start_value).event_counter), \
+		  [current_event_counter]"m"((_start_value).rseqp->u.e.event_counter), \
+		  [rseq_cs]"m"((_start_value).rseqp->rseq_cs) \
+		  _spec_input \
+		  _final_input \
+		  RSEQ_INJECT_INPUT \
+		: "memory", "cc", "eax" \
+		  _extra_clobber \
+		  RSEQ_INJECT_CLOBBER \
+		: _failure \
+	); \
+} while (0)
+
+#define RSEQ_FINISH_FINAL_STORE_ASM() \
+		"movl %[to_write_final], %%eax\n\t" \
+		"movl %%eax, %[target_final]\n\t"
+
+#define RSEQ_FINISH_FINAL_STORE_RELEASE_ASM() \
+		"lock; addl $0,0(%%esp)\n\t" \
+		RSEQ_FINISH_FINAL_STORE_ASM()
+
+#define RSEQ_FINISH_FINAL_STORE_INPUT(_target_final, _to_write_final) \
+		, [to_write_final]"m"(_to_write_final), \
+		[target_final]"m"(*(_target_final))
+
+#define RSEQ_FINISH_SPECULATIVE_STORE_ASM() \
+		"movl %[to_write_spec], %%eax\n\t" \
+		"movl %%eax, %[target_spec]\n\t" \
+		RSEQ_INJECT_ASM(4)
+
+#define RSEQ_FINISH_SPECULATIVE_STORE_INPUT(_target_spec, _to_write_spec) \
+		, [to_write_spec]"m"(_to_write_spec), \
+		[target_spec]"m"(*(_target_spec))
+
+/* TODO: implement a faster memcpy. */
+#define RSEQ_FINISH_MEMCPY_STORE_ASM() \
+		"movl %[len_memcpy], %%eax\n\t" \
+		"test %%eax, %%eax\n\t" \
+		"jz 333f\n\t" \
+		"222:\n\t" \
+		"movb (%[to_write_memcpy]), %%al\n\t" \
+		"movb %%al, (%[target_memcpy])\n\t" \
+		"inc %[to_write_memcpy]\n\t" \
+		"inc %[target_memcpy]\n\t" \
+		"decl %[rseq_scratch2]\n\t" \
+		"jnz 222b\n\t" \
+		"333:\n\t" \
+		RSEQ_INJECT_ASM(4)
+
+#define RSEQ_FINISH_MEMCPY_STORE_INPUT(_target_memcpy, _to_write_memcpy, _len_memcpy) \
+		, [to_write_memcpy]"r"(_to_write_memcpy), \
+		[target_memcpy]"r"(_target_memcpy), \
+		[len_memcpy]"m"(_len_memcpy), \
+		[rseq_scratch0]"m"(rseq_scratch[0]), \
+		[rseq_scratch1]"m"(rseq_scratch[1]), \
+		[rseq_scratch2]"m"(rseq_scratch[2])
+
+#define RSEQ_FINISH_MEMCPY_CLOBBER()
+
+#define RSEQ_FINISH_MEMCPY_SCRATCH() \
+		uint32_t rseq_scratch[3];
+
+/*
+ * We need to save and restore those input registers so they can be
+ * modified within the assembly.
+ */
+#define RSEQ_FINISH_MEMCPY_SETUP() \
+		"movl %[to_write_memcpy], %[rseq_scratch0]\n\t" \
+		"movl %[target_memcpy], %[rseq_scratch1]\n\t" \
+		"movl %[len_memcpy], %%eax\n\t" \
+		"movl %%eax, %[rseq_scratch2]\n\t"
+
+#define RSEQ_FINISH_MEMCPY_TEARDOWN() \
+		"movl %[rseq_scratch1], %[target_memcpy]\n\t" \
+		"movl %[rseq_scratch0], %[to_write_memcpy]\n\t"
+
+#endif
diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
new file mode 100644
index 000000000000..79eba7f20064
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -0,0 +1,78 @@
+/*
+ * rseq.c
+ *
+ * Copyright (C) 2016 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; only
+ * version 2.1 of the License.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <syscall.h>
+#include <assert.h>
+#include <signal.h>
+
+#include "rseq.h"
+
+#define ARRAY_SIZE(arr)	(sizeof(arr) / sizeof((arr)[0]))
+
+__attribute__((weak)) __thread volatile struct rseq __rseq_abi = {
+	.u.e.cpu_id = -1,
+};
+
+static int sys_rseq(volatile struct rseq *rseq_abi, int flags)
+{
+	return syscall(__NR_rseq, rseq_abi, flags);
+}
+
+int rseq_register_current_thread(void)
+{
+	int rc;
+
+	rc = sys_rseq(&__rseq_abi, 0);
+	if (rc) {
+		fprintf(stderr, "Error: sys_rseq(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		return -1;
+	}
+	assert(rseq_current_cpu() >= 0);
+	return 0;
+}
+
+int rseq_unregister_current_thread(void)
+{
+	int rc;
+
+	rc = sys_rseq(NULL, 0);
+	if (rc) {
+		fprintf(stderr, "Error: sys_rseq(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		return -1;
+	}
+	return 0;
+}
+
+int rseq_fallback_current_cpu(void)
+{
+	int cpu;
+
+	cpu = sched_getcpu();
+	if (cpu < 0) {
+		perror("sched_getcpu()");
+		abort();
+	}
+	return cpu;
+}
diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/rseq/rseq.h
new file mode 100644
index 000000000000..b0015f255ffc
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq.h
@@ -0,0 +1,298 @@
+/*
+ * rseq.h
+ *
+ * (C) Copyright 2016 - Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef RSEQ_H
+#define RSEQ_H
+
+#include <stdint.h>
+#include <stdbool.h>
+#include <pthread.h>
+#include <signal.h>
+#include <sched.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sched.h>
+#include <linux/rseq.h>
+
+/*
+ * Empty code injection macros, override when testing.
+ * It is important to consider that the ASM injection macros need to be
+ * fully reentrant (e.g. do not modify the stack).
+ */
+#ifndef RSEQ_INJECT_ASM
+#define RSEQ_INJECT_ASM(n)
+#endif
+
+#ifndef RSEQ_INJECT_C
+#define RSEQ_INJECT_C(n)
+#endif
+
+#ifndef RSEQ_INJECT_INPUT
+#define RSEQ_INJECT_INPUT
+#endif
+
+#ifndef RSEQ_INJECT_CLOBBER
+#define RSEQ_INJECT_CLOBBER
+#endif
+
+#ifndef RSEQ_INJECT_FAILED
+#define RSEQ_INJECT_FAILED
+#endif
+
+extern __thread volatile struct rseq __rseq_abi;
+
+#define likely(x)		__builtin_expect(!!(x), 1)
+#define unlikely(x)		__builtin_expect(!!(x), 0)
+#define barrier()		__asm__ __volatile__("" : : : "memory")
+
+#define ACCESS_ONCE(x)		(*(__volatile__  __typeof__(x) *)&(x))
+#define WRITE_ONCE(x, v)	__extension__ ({ ACCESS_ONCE(x) = (v); })
+#define READ_ONCE(x)		ACCESS_ONCE(x)
+
+#if defined(__x86_64__) || defined(__i386__)
+#include <rseq-x86.h>
+#elif defined(__ARMEL__)
+#include <rseq-arm.h>
+#elif defined(__PPC__)
+#include <rseq-ppc.h>
+#else
+#error unsupported target
+#endif
+
+/* State returned by rseq_start, passed as argument to rseq_finish. */
+struct rseq_state {
+	volatile struct rseq *rseqp;
+	int32_t cpu_id;		/* cpu_id at start. */
+	uint32_t event_counter;	/* event_counter at start. */
+};
+
+/*
+ * Register rseq for the current thread. This needs to be called once
+ * by any thread which uses restartable sequences, before they start
+ * using restartable sequences. If initialization is not invoked, or if
+ * it fails, the restartable critical sections will fall-back on locking
+ * (rseq_lock).
+ */
+int rseq_register_current_thread(void);
+
+/*
+ * Unregister rseq for current thread.
+ */
+int rseq_unregister_current_thread(void);
+
+/*
+ * Restartable sequence fallback for reading the current CPU number.
+ */
+int rseq_fallback_current_cpu(void);
+
+static inline int32_t rseq_cpu_at_start(struct rseq_state start_value)
+{
+	return start_value.cpu_id;
+}
+
+static inline int32_t rseq_current_cpu_raw(void)
+{
+	return ACCESS_ONCE(__rseq_abi.u.e.cpu_id);
+}
+
+static inline int32_t rseq_current_cpu(void)
+{
+	int32_t cpu;
+
+	cpu = rseq_current_cpu_raw();
+	if (unlikely(cpu < 0))
+		cpu = rseq_fallback_current_cpu();
+	return cpu;
+}
+
+static inline __attribute__((always_inline))
+struct rseq_state rseq_start(void)
+{
+	struct rseq_state result;
+
+	result.rseqp = &__rseq_abi;
+	if (has_single_copy_load_64()) {
+		union rseq_cpu_event u;
+
+		u.v = ACCESS_ONCE(result.rseqp->u.v);
+		result.event_counter = u.e.event_counter;
+		result.cpu_id = u.e.cpu_id;
+	} else {
+		result.event_counter =
+			ACCESS_ONCE(result.rseqp->u.e.event_counter);
+		/* load event_counter before cpu_id. */
+		RSEQ_INJECT_C(6)
+		result.cpu_id = ACCESS_ONCE(result.rseqp->u.e.cpu_id);
+	}
+	RSEQ_INJECT_C(7)
+	/*
+	 * Ensure the compiler does not re-order loads of protected
+	 * values before we load the event counter.
+	 */
+	barrier();
+	return result;
+}
+
+enum rseq_finish_type {
+	RSEQ_FINISH_SINGLE,
+	RSEQ_FINISH_TWO,
+	RSEQ_FINISH_MEMCPY,
+};
+
+/*
+ * p_spec and to_write_spec are used for a speculative write attempted
+ * near the end of the restartable sequence. A rseq_finish2 may fail
+ * even after this write takes place.
+ *
+ * p_final and to_write_final are used for the final write. If this
+ * write takes place, the rseq_finish2 is guaranteed to succeed.
+ */
+static inline __attribute__((always_inline))
+bool __rseq_finish(intptr_t *p_spec, intptr_t to_write_spec,
+		void *p_memcpy, void *to_write_memcpy, size_t len_memcpy,
+		intptr_t *p_final, intptr_t to_write_final,
+		struct rseq_state start_value,
+		enum rseq_finish_type type, bool release)
+{
+	RSEQ_INJECT_C(9)
+
+	switch (type) {
+	case RSEQ_FINISH_SINGLE:
+		RSEQ_FINISH_ASM(p_final, to_write_final, start_value, failure,
+			/* no speculative write */, /* no speculative write */,
+			RSEQ_FINISH_FINAL_STORE_ASM(),
+			RSEQ_FINISH_FINAL_STORE_INPUT(p_final, to_write_final),
+			/* no extra clobber */, /* no arg */, /* no arg */,
+			/* no arg */
+		);
+		break;
+	case RSEQ_FINISH_TWO:
+		if (release) {
+			RSEQ_FINISH_ASM(p_final, to_write_final, start_value, failure,
+				RSEQ_FINISH_SPECULATIVE_STORE_ASM(),
+				RSEQ_FINISH_SPECULATIVE_STORE_INPUT(p_spec, to_write_spec),
+				RSEQ_FINISH_FINAL_STORE_RELEASE_ASM(),
+				RSEQ_FINISH_FINAL_STORE_INPUT(p_final, to_write_final),
+				/* no extra clobber */, /* no arg */, /* no arg */,
+				/* no arg */
+			);
+		} else {
+			RSEQ_FINISH_ASM(p_final, to_write_final, start_value, failure,
+				RSEQ_FINISH_SPECULATIVE_STORE_ASM(),
+				RSEQ_FINISH_SPECULATIVE_STORE_INPUT(p_spec, to_write_spec),
+				RSEQ_FINISH_FINAL_STORE_ASM(),
+				RSEQ_FINISH_FINAL_STORE_INPUT(p_final, to_write_final),
+				/* no extra clobber */, /* no arg */, /* no arg */,
+				/* no arg */
+			);
+		}
+		break;
+	case RSEQ_FINISH_MEMCPY:
+		if (release) {
+			RSEQ_FINISH_ASM(p_final, to_write_final, start_value, failure,
+				RSEQ_FINISH_MEMCPY_STORE_ASM(),
+				RSEQ_FINISH_MEMCPY_STORE_INPUT(p_memcpy, to_write_memcpy, len_memcpy),
+				RSEQ_FINISH_FINAL_STORE_RELEASE_ASM(),
+				RSEQ_FINISH_FINAL_STORE_INPUT(p_final, to_write_final),
+				RSEQ_FINISH_MEMCPY_CLOBBER(),
+				RSEQ_FINISH_MEMCPY_SETUP(),
+				RSEQ_FINISH_MEMCPY_TEARDOWN(),
+				RSEQ_FINISH_MEMCPY_SCRATCH()
+			);
+		} else {
+			RSEQ_FINISH_ASM(p_final, to_write_final, start_value, failure,
+				RSEQ_FINISH_MEMCPY_STORE_ASM(),
+				RSEQ_FINISH_MEMCPY_STORE_INPUT(p_memcpy, to_write_memcpy, len_memcpy),
+				RSEQ_FINISH_FINAL_STORE_ASM(),
+				RSEQ_FINISH_FINAL_STORE_INPUT(p_final, to_write_final),
+				RSEQ_FINISH_MEMCPY_CLOBBER(),
+				RSEQ_FINISH_MEMCPY_SETUP(),
+				RSEQ_FINISH_MEMCPY_TEARDOWN(),
+				RSEQ_FINISH_MEMCPY_SCRATCH()
+			);
+		}
+		break;
+	}
+	return true;
+failure:
+	RSEQ_INJECT_FAILED
+	return false;
+}
+
+static inline __attribute__((always_inline))
+bool rseq_finish(intptr_t *p, intptr_t to_write,
+		struct rseq_state start_value)
+{
+	return __rseq_finish(NULL, 0,
+			NULL, NULL, 0,
+			p, to_write, start_value,
+			RSEQ_FINISH_SINGLE, false);
+}
+
+static inline __attribute__((always_inline))
+bool rseq_finish2(intptr_t *p_spec, intptr_t to_write_spec,
+		intptr_t *p_final, intptr_t to_write_final,
+		struct rseq_state start_value)
+{
+	return __rseq_finish(p_spec, to_write_spec,
+			NULL, NULL, 0,
+			p_final, to_write_final, start_value,
+			RSEQ_FINISH_TWO, false);
+}
+
+static inline __attribute__((always_inline))
+bool rseq_finish2_release(intptr_t *p_spec, intptr_t to_write_spec,
+		intptr_t *p_final, intptr_t to_write_final,
+		struct rseq_state start_value)
+{
+	return __rseq_finish(p_spec, to_write_spec,
+			NULL, NULL, 0,
+			p_final, to_write_final, start_value,
+			RSEQ_FINISH_TWO, true);
+}
+
+static inline __attribute__((always_inline))
+bool rseq_finish_memcpy(void *p_memcpy, void *to_write_memcpy,
+		size_t len_memcpy, intptr_t *p_final, intptr_t to_write_final,
+		struct rseq_state start_value)
+{
+	return __rseq_finish(NULL, 0,
+			p_memcpy, to_write_memcpy, len_memcpy,
+			p_final, to_write_final, start_value,
+			RSEQ_FINISH_MEMCPY, false);
+}
+
+static inline __attribute__((always_inline))
+bool rseq_finish_memcpy_release(void *p_memcpy, void *to_write_memcpy,
+		size_t len_memcpy, intptr_t *p_final, intptr_t to_write_final,
+		struct rseq_state start_value)
+{
+	return __rseq_finish(NULL, 0,
+			p_memcpy, to_write_memcpy, len_memcpy,
+			p_final, to_write_final, start_value,
+			RSEQ_FINISH_MEMCPY, true);
+}
+
+#endif  /* RSEQ_H_ */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH for 4.15 14/14] Restartable sequences: Provide self-tests
  2017-10-12 23:03 ` [RFC PATCH for 4.15 14/14] Restartable sequences: Provide self-tests Mathieu Desnoyers
@ 2017-10-16  2:51   ` Michael Ellerman
  2017-10-16 14:23     ` Mathieu Desnoyers
       [not found]     ` <871sm3n6sy.fsf-W0DJWXSxmBNbyGPkN3NxC2scP1bn1w/D@public.gmane.org>
       [not found]   ` <20171012230326.19984-15-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  1 sibling, 2 replies; 61+ messages in thread
From: Michael Ellerman @ 2017-10-16  2:51 UTC (permalink / raw)
  To: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon
  Cc: linux-kernel, Mathieu Desnoyers, Russell King, Catalin Marinas,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	H. Peter Anvin, Ben Maurer, Steven Rostedt, Linus Torvalds,
	Andrew Morton, Shuah Khan, linux-kselftest, linux-api

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:

> Implements two basic tests of RSEQ functionality, and one more
> exhaustive parameterizable test.
>
> The first, "basic_test" only asserts that RSEQ works moderately
> correctly.
> E.g. that:
> - The CPUID pointer works
> - Code infinitely looping within a critical section will eventually be
>   interrupted.
> - Critical sections are interrupted by signals.
>
> "basic_percpu_ops_test" is a slightly more "realistic" variant,
> implementing a few simple per-cpu operations and testing their
> correctness.
>
> "param_test" is a parametrizable restartable sequences test. See
> the "--help" output for usage.

Thanks for providing selftests :)

The Makefiles could use a little clean up:
  - cpu-opv doesn't need libpthread
  - you don't need to define your own rule just for building
  - use TEST_GEN_PROGS to hook into the right parts of lib.mk
  - .. which means you can use the clean rule in lib.mk


I notice you didn't add rseq or cpu-opv to the list of TARGETS in
tools/testing/selftests/Makefile, was that deliberate?

Feel free to squash this patch in if you're happy to.

This still works with:
  $ make -C tools/testing/selftests TARGETS=rseq

and:
  $ cd tools/testing/selftests/rseq; make

cheers

diff --git a/tools/testing/selftests/cpu-opv/Makefile b/tools/testing/selftests/cpu-opv/Makefile
index 81d0596824ee..d41670ad5c43 100644
--- a/tools/testing/selftests/cpu-opv/Makefile
+++ b/tools/testing/selftests/cpu-opv/Makefile
@@ -1,13 +1,9 @@
 CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/
-LDFLAGS += -lpthread
 
-TESTS = basic_cpu_opv_test
+TEST_GEN_PROGS = basic_cpu_opv_test
 
-all: $(TESTS)
-%: %.c cpu-op.c cpu-op.h
-	$(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
+all: $(TEST_GEN_PROGS)
 
-include ../lib.mk
+$(TEST_GEN_PROGS): cpu-op.c cpu-op.h
 
-clean:
-	$(RM) $(TESTS)
+include ../lib.mk
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
index 7f0153556b80..9f8257b4ce14 100644
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -1,13 +1,10 @@
 CFLAGS += -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/
-LDFLAGS += -lpthread
+LDLIBS += -lpthread
 
-TESTS = basic_test basic_percpu_ops_test param_test
+TEST_GEN_PROGS = basic_test basic_percpu_ops_test param_test
 
-all: $(TESTS)
-%: %.c rseq.h rseq-*.h rseq.c ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h
-	$(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
+all: $(TEST_GEN_PROGS)
 
-include ../lib.mk
+$(TEST_GEN_PROGS): rseq.h rseq-*.h rseq.c ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h
 
-clean:
-	$(RM) $(TESTS)
+include ../lib.mk

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH for 4.15 14/14] Restartable sequences: Provide self-tests
  2017-10-16  2:51   ` Michael Ellerman
@ 2017-10-16 14:23     ` Mathieu Desnoyers
       [not found]       ` <399058130.42156.1508163782335.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
       [not found]     ` <871sm3n6sy.fsf-W0DJWXSxmBNbyGPkN3NxC2scP1bn1w/D@public.gmane.org>
  1 sibling, 1 reply; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-16 14:23 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Russell King, Catalin Marinas,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	H. Peter Anvin, Ben Maurer, rostedt, Linus Torvalds,
	Andrew Morton

----- On Oct 15, 2017, at 10:51 PM, Michael Ellerman mpe@ellerman.id.au wrote:

> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> 
>> Implements two basic tests of RSEQ functionality, and one more
>> exhaustive parameterizable test.
>>
>> The first, "basic_test" only asserts that RSEQ works moderately
>> correctly.
>> E.g. that:
>> - The CPUID pointer works
>> - Code infinitely looping within a critical section will eventually be
>>   interrupted.
>> - Critical sections are interrupted by signals.
>>
>> "basic_percpu_ops_test" is a slightly more "realistic" variant,
>> implementing a few simple per-cpu operations and testing their
>> correctness.
>>
>> "param_test" is a parametrizable restartable sequences test. See
>> the "--help" output for usage.
> 
> Thanks for providing selftests :)
> 
> The Makefiles could use a little clean up:
>  - cpu-opv doesn't need libpthread
>  - you don't need to define your own rule just for building
>  - use TEST_GEN_PROGS to hook into the right parts of lib.mk
>  - .. which means you can use the clean rule in lib.mk
> 
> 
> I notice you didn't add rseq or cpu-opv to the list of TARGETS in
> tools/testing/selftests/Makefile, was that deliberate?

No, I think I just copied some other selftest which perhaps did
not have those rules back then.

> 
> Feel free to squash this patch in if you're happy to.
> 
> This still works with:
>  $ make -C tools/testing/selftests TARGETS=rseq
> 
> and:
>  $ cd tools/testing/selftests/rseq; make

Great ! I'll fold this patch in.

Thanks!

Mathieu

> 
> cheers
> 
> diff --git a/tools/testing/selftests/cpu-opv/Makefile
> b/tools/testing/selftests/cpu-opv/Makefile
> index 81d0596824ee..d41670ad5c43 100644
> --- a/tools/testing/selftests/cpu-opv/Makefile
> +++ b/tools/testing/selftests/cpu-opv/Makefile
> @@ -1,13 +1,9 @@
> CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/
> -LDFLAGS += -lpthread
> 
> -TESTS = basic_cpu_opv_test
> +TEST_GEN_PROGS = basic_cpu_opv_test
> 
> -all: $(TESTS)
> -%: %.c cpu-op.c cpu-op.h
> -	$(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
> +all: $(TEST_GEN_PROGS)
> 
> -include ../lib.mk
> +$(TEST_GEN_PROGS): cpu-op.c cpu-op.h
> 
> -clean:
> -	$(RM) $(TESTS)
> +include ../lib.mk
> diff --git a/tools/testing/selftests/rseq/Makefile
> b/tools/testing/selftests/rseq/Makefile
> index 7f0153556b80..9f8257b4ce14 100644
> --- a/tools/testing/selftests/rseq/Makefile
> +++ b/tools/testing/selftests/rseq/Makefile
> @@ -1,13 +1,10 @@
> CFLAGS += -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/
> -LDFLAGS += -lpthread
> +LDLIBS += -lpthread
> 
> -TESTS = basic_test basic_percpu_ops_test param_test
> +TEST_GEN_PROGS = basic_test basic_percpu_ops_test param_test
> 
> -all: $(TESTS)
> -%: %.c rseq.h rseq-*.h rseq.c ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h
> -	$(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
> +all: $(TEST_GEN_PROGS)
> 
> -include ../lib.mk
> +$(TEST_GEN_PROGS): rseq.h rseq-*.h rseq.c ../cpu-opv/cpu-op.c
> ../cpu-opv/cpu-op.h
> 
> -clean:
> -	$(RM) $(TESTS)
> +include ../lib.mk

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <399058130.42156.1508163782335.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>]

* Re: [RFC PATCH for 4.15 14/14] Restartable sequences: Provide self-tests
       [not found]       ` <399058130.42156.1508163782335.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-10-17 10:38         ` Michael Ellerman
  2017-10-17 13:50           ` Mathieu Desnoyers
  0 siblings, 1 reply; 61+ messages in thread
From: Michael Ellerman @ 2017-10-17 10:38 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Russell King, Catalin Marinas,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	H. Peter Anvin, Ben Maurer, rostedt, Linus Torvalds,
	Andrew Morton

Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> writes:

> ----- On Oct 15, 2017, at 10:51 PM, Michael Ellerman mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org wrote:
>
>> Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> writes:
>> 
>>> Implements two basic tests of RSEQ functionality, and one more
>>> exhaustive parameterizable test.
>>>
>>> The first, "basic_test" only asserts that RSEQ works moderately
>>> correctly.
>>> E.g. that:
>>> - The CPUID pointer works
>>> - Code infinitely looping within a critical section will eventually be
>>>   interrupted.
>>> - Critical sections are interrupted by signals.
>>>
>>> "basic_percpu_ops_test" is a slightly more "realistic" variant,
>>> implementing a few simple per-cpu operations and testing their
>>> correctness.
>>>
>>> "param_test" is a parametrizable restartable sequences test. See
>>> the "--help" output for usage.
>> 
>> Thanks for providing selftests :)
>> 
>> The Makefiles could use a little clean up:
>>  - cpu-opv doesn't need libpthread
>>  - you don't need to define your own rule just for building
>>  - use TEST_GEN_PROGS to hook into the right parts of lib.mk
>>  - .. which means you can use the clean rule in lib.mk
>> 
>> 
>> I notice you didn't add rseq or cpu-opv to the list of TARGETS in
>> tools/testing/selftests/Makefile, was that deliberate?
>
> No, I think I just copied some other selftest which perhaps did
> not have those rules back then.

OK. I don't see why these tests shouldn't be part of the default test
run, so here's a patch to add them.

cheers


diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index ff805643b5f7..b79ef5915c74 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -3,6 +3,7 @@ TARGETS += breakpoints
 TARGETS += capabilities
 TARGETS += cpufreq
 TARGETS += cpu-hotplug
+TARGETS += cpu-opv
 TARGETS += efivarfs
 TARGETS += exec
 TARGETS += firmware
@@ -23,6 +24,7 @@ TARGETS += nsfs
 TARGETS += powerpc
 TARGETS += pstore
 TARGETS += ptrace
+TARGETS += rseq
 TARGETS += seccomp
 TARGETS += sigaltstack
 TARGETS += size

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH for 4.15 14/14] Restartable sequences: Provide self-tests
  2017-10-17 10:38         ` Michael Ellerman
@ 2017-10-17 13:50           ` Mathieu Desnoyers
  0 siblings, 0 replies; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-17 13:50 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Russell King, Catalin Marinas,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	H. Peter Anvin, Ben Maurer, rostedt, Linus Torvalds,
	Andrew Morton

----- On Oct 17, 2017, at 6:38 AM, Michael Ellerman mpe@ellerman.id.au wrote:

> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> 
>> ----- On Oct 15, 2017, at 10:51 PM, Michael Ellerman mpe@ellerman.id.au wrote:
>>
>>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
>>> 
>>>> Implements two basic tests of RSEQ functionality, and one more
>>>> exhaustive parameterizable test.
>>>>
>>>> The first, "basic_test" only asserts that RSEQ works moderately
>>>> correctly.
>>>> E.g. that:
>>>> - The CPUID pointer works
>>>> - Code infinitely looping within a critical section will eventually be
>>>>   interrupted.
>>>> - Critical sections are interrupted by signals.
>>>>
>>>> "basic_percpu_ops_test" is a slightly more "realistic" variant,
>>>> implementing a few simple per-cpu operations and testing their
>>>> correctness.
>>>>
>>>> "param_test" is a parametrizable restartable sequences test. See
>>>> the "--help" output for usage.
>>> 
>>> Thanks for providing selftests :)
>>> 
>>> The Makefiles could use a little clean up:
>>>  - cpu-opv doesn't need libpthread
>>>  - you don't need to define your own rule just for building
>>>  - use TEST_GEN_PROGS to hook into the right parts of lib.mk
>>>  - .. which means you can use the clean rule in lib.mk
>>> 
>>> 
>>> I notice you didn't add rseq or cpu-opv to the list of TARGETS in
>>> tools/testing/selftests/Makefile, was that deliberate?
>>
>> No, I think I just copied some other selftest which perhaps did
>> not have those rules back then.
> 
> OK. I don't see why these tests shouldn't be part of the default test
> run, so here's a patch to add them.

Perfect, folded that change into my series as well.

Thanks!

Mathieu

> 
> cheers
> 
> 
> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> index ff805643b5f7..b79ef5915c74 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -3,6 +3,7 @@ TARGETS += breakpoints
> TARGETS += capabilities
> TARGETS += cpufreq
> TARGETS += cpu-hotplug
> +TARGETS += cpu-opv
> TARGETS += efivarfs
> TARGETS += exec
> TARGETS += firmware
> @@ -23,6 +24,7 @@ TARGETS += nsfs
> TARGETS += powerpc
> TARGETS += pstore
> TARGETS += ptrace
> +TARGETS += rseq
> TARGETS += seccomp
> TARGETS += sigaltstack
>  TARGETS += size

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <871sm3n6sy.fsf-W0DJWXSxmBNbyGPkN3NxC2scP1bn1w/D@public.gmane.org>]

* Re: [RFC PATCH for 4.15 14/14] Restartable sequences: Provide self-tests
       [not found]     ` <871sm3n6sy.fsf-W0DJWXSxmBNbyGPkN3NxC2scP1bn1w/D@public.gmane.org>
@ 2017-10-16 18:50       ` Mathieu Desnoyers
       [not found]         ` <1998166049.42520.1508179805908.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-16 18:50 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Russell King, Catalin Marinas,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	H. Peter Anvin, Ben Maurer, rostedt, Linus Torvalds,
	Andrew Morton

Hi Michael,

With your changes integrated, both rseq and cpu-opv selftests fail to
build if I pass e.g. -j32 to make.

cd tools/testing/selftests/cpu-opv

efficios@compudjdev:~/git/linux-percpu-dev/tools/testing/selftests/cpu-opv$ make clean; make
rm -f -r /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test   
gcc -O2 -Wall -g -I./ -I../../../../usr/include/    basic_cpu_opv_test.c cpu-op.c cpu-op.h   -o basic_cpu_opv_test

efficios@compudjdev:~/git/linux-percpu-dev/tools/testing/selftests/cpu-opv$ make clean; make -j32
rm -f -r /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test   
gcc -O2 -Wall -g -I./ -I../../../../usr/include/    basic_cpu_opv_test.c cpu-op.c cpu-op.h   -o basic_cpu_opv_test
gcc -O2 -Wall -g -I./ -I../../../../usr/include/    basic_cpu_opv_test.c  -o /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test
/tmp/ccDthnqM.o: In function `test_memcpy_op':
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:364: undefined reference to `cpu_op_get_current_cpu'
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:365: undefined reference to `cpu_opv'
/tmp/ccDthnqM.o: In function `test_2compare_ne_op':
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:279: undefined reference to `cpu_op_get_current_cpu'
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:280: undefined reference to `cpu_opv'
/tmp/ccDthnqM.o: In function `test_2compare_eq_op':
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:189: undefined reference to `cpu_op_get_current_cpu'
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:190: undefined reference to `cpu_opv'
/tmp/ccDthnqM.o: In function `test_compare_ne_op':
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:108: undefined reference to `cpu_op_get_current_cpu'
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:109: undefined reference to `cpu_opv'
/tmp/ccDthnqM.o: In function `test_compare_eq_op':
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:34: undefined reference to `cpu_op_get_current_cpu'
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:35: undefined reference to `cpu_opv'
/tmp/ccDthnqM.o: In function `test_add_op':
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:430: undefined reference to `cpu_op_get_current_cpu'
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:431: undefined reference to `cpu_op_add'
/tmp/ccDthnqM.o: In function `test_two_add_op':
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:479: undefined reference to `cpu_op_get_current_cpu'
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:480: undefined reference to `cpu_opv'
/tmp/ccDthnqM.o: In function `test_or_op':
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:522: undefined reference to `cpu_op_get_current_cpu'
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:523: undefined reference to `cpu_opv'
/tmp/ccDthnqM.o: In function `test_and_op':
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:565: undefined reference to `cpu_op_get_current_cpu'
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:566: undefined reference to `cpu_opv'
/tmp/ccDthnqM.o: In function `test_xor_op':
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:608: undefined reference to `cpu_op_get_current_cpu'
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:609: undefined reference to `cpu_opv'
/tmp/ccDthnqM.o: In function `test_lshift_op':
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:651: undefined reference to `cpu_op_get_current_cpu'
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:652: undefined reference to `cpu_opv'
/tmp/ccDthnqM.o: In function `test_rshift_op':
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:695: undefined reference to `cpu_op_get_current_cpu'
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:696: undefined reference to `cpu_opv'
/tmp/ccDthnqM.o: In function `test_cmpxchg_op':
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:731: undefined reference to `cpu_op_get_current_cpu'
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:732: undefined reference to `cpu_op_cmpxchg'
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:731: undefined reference to `cpu_op_get_current_cpu'
/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:732: undefined reference to `cpu_op_cmpxchg'
collect2: error: ld returned 1 exit status
make: *** [/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test] Error 1
make: *** Waiting for unfinished jobs....

Any idea what is going on here ?

Thanks,

Mathieu

----- On Oct 15, 2017, at 10:51 PM, Michael Ellerman mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org wrote:

> Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> writes:
> 
>> Implements two basic tests of RSEQ functionality, and one more
>> exhaustive parameterizable test.
>>
>> The first, "basic_test" only asserts that RSEQ works moderately
>> correctly.
>> E.g. that:
>> - The CPUID pointer works
>> - Code infinitely looping within a critical section will eventually be
>>   interrupted.
>> - Critical sections are interrupted by signals.
>>
>> "basic_percpu_ops_test" is a slightly more "realistic" variant,
>> implementing a few simple per-cpu operations and testing their
>> correctness.
>>
>> "param_test" is a parametrizable restartable sequences test. See
>> the "--help" output for usage.
> 
> Thanks for providing selftests :)
> 
> The Makefiles could use a little clean up:
>  - cpu-opv doesn't need libpthread
>  - you don't need to define your own rule just for building
>  - use TEST_GEN_PROGS to hook into the right parts of lib.mk
>  - .. which means you can use the clean rule in lib.mk
> 
> 
> I notice you didn't add rseq or cpu-opv to the list of TARGETS in
> tools/testing/selftests/Makefile, was that deliberate?
> 
> Feel free to squash this patch in if you're happy to.
> 
> This still works with:
>  $ make -C tools/testing/selftests TARGETS=rseq
> 
> and:
>  $ cd tools/testing/selftests/rseq; make
> 
> cheers
> 
> diff --git a/tools/testing/selftests/cpu-opv/Makefile
> b/tools/testing/selftests/cpu-opv/Makefile
> index 81d0596824ee..d41670ad5c43 100644
> --- a/tools/testing/selftests/cpu-opv/Makefile
> +++ b/tools/testing/selftests/cpu-opv/Makefile
> @@ -1,13 +1,9 @@
> CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/
> -LDFLAGS += -lpthread
> 
> -TESTS = basic_cpu_opv_test
> +TEST_GEN_PROGS = basic_cpu_opv_test
> 
> -all: $(TESTS)
> -%: %.c cpu-op.c cpu-op.h
> -	$(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
> +all: $(TEST_GEN_PROGS)
> 
> -include ../lib.mk
> +$(TEST_GEN_PROGS): cpu-op.c cpu-op.h
> 
> -clean:
> -	$(RM) $(TESTS)
> +include ../lib.mk
> diff --git a/tools/testing/selftests/rseq/Makefile
> b/tools/testing/selftests/rseq/Makefile
> index 7f0153556b80..9f8257b4ce14 100644
> --- a/tools/testing/selftests/rseq/Makefile
> +++ b/tools/testing/selftests/rseq/Makefile
> @@ -1,13 +1,10 @@
> CFLAGS += -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/
> -LDFLAGS += -lpthread
> +LDLIBS += -lpthread
> 
> -TESTS = basic_test basic_percpu_ops_test param_test
> +TEST_GEN_PROGS = basic_test basic_percpu_ops_test param_test
> 
> -all: $(TESTS)
> -%: %.c rseq.h rseq-*.h rseq.c ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h
> -	$(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
> +all: $(TEST_GEN_PROGS)
> 
> -include ../lib.mk
> +$(TEST_GEN_PROGS): rseq.h rseq-*.h rseq.c ../cpu-opv/cpu-op.c
> ../cpu-opv/cpu-op.h
> 
> -clean:
> -	$(RM) $(TESTS)
> +include ../lib.mk

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <1998166049.42520.1508179805908.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>]

* Re: [RFC PATCH for 4.15 14/14] Restartable sequences: Provide self-tests
       [not found]         ` <1998166049.42520.1508179805908.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-10-17 10:36           ` Michael Ellerman
       [not found]             ` <87d15mjc1g.fsf-W0DJWXSxmBNbyGPkN3NxC2scP1bn1w/D@public.gmane.org>
  0 siblings, 1 reply; 61+ messages in thread
From: Michael Ellerman @ 2017-10-17 10:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Russell King, Catalin Marinas,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	H. Peter Anvin, Ben Maurer, rostedt, Linus Torvalds,
	Andrew Morton

Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> writes:

> Hi Michael,
>
> With your changes integrated, both rseq and cpu-opv selftests fail to
> build if I pass e.g. -j32 to make.
>
> cd tools/testing/selftests/cpu-opv
>
> efficios@compudjdev:~/git/linux-percpu-dev/tools/testing/selftests/cpu-opv$ make clean; make
> rm -f -r /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test   
> gcc -O2 -Wall -g -I./ -I../../../../usr/include/    basic_cpu_opv_test.c cpu-op.c cpu-op.h   -o basic_cpu_opv_test
>
> efficios@compudjdev:~/git/linux-percpu-dev/tools/testing/selftests/cpu-opv$ make clean; make -j32
> rm -f -r /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test   
> gcc -O2 -Wall -g -I./ -I../../../../usr/include/    basic_cpu_opv_test.c cpu-op.c cpu-op.h   -o basic_cpu_opv_test
> gcc -O2 -Wall -g -I./ -I../../../../usr/include/    basic_cpu_opv_test.c  -o /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test
> /tmp/ccDthnqM.o: In function `test_memcpy_op':
> /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:364: undefined reference to `cpu_op_get_current_cpu'
> /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:365: undefined reference to `cpu_opv'
...
> make: *** [/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test] Error 1
> make: *** Waiting for unfinished jobs....
>
> Any idea what is going on here ?

Ugh sorry, yes.

New patch below should fix it. Tested with -j:

  ~/linux/tools/testing/selftests$ make TARGETS="rseq cpu-opv" -j
  make[1]: Entering directory '/home/michael/linux/tools/testing/selftests/rseq'
  gcc -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/    basic_test.c rseq.h rseq-ppc.h rseq-x86.h rseq-arm.h rseq.c ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h -lpthread -o /home/michael/linux/tools/testing/selftests/rseq/basic_test
  gcc -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/    basic_percpu_ops_test.c rseq.h rseq-ppc.h rseq-x86.h rseq-arm.h rseq.c ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h -lpthread -o /home/michael/linux/tools/testing/selftests/rseq/basic_percpu_ops_test
  gcc -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/    param_test.c rseq.h rseq-ppc.h rseq-x86.h rseq-arm.h rseq.c ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h -lpthread -o /home/michael/linux/tools/testing/selftests/rseq/param_test
  make[1]: Leaving directory '/home/michael/linux/tools/testing/selftests/rseq'
  make[1]: Entering directory '/home/michael/linux/tools/testing/selftests/cpu-opv'
  gcc -O2 -Wall -g -I./ -I../../../../usr/include/    basic_cpu_opv_test.c cpu-op.c cpu-op.h  -o /home/michael/linux/tools/testing/selftests/cpu-opv/basic_cpu_opv_test
  make[1]: Leaving directory '/home/michael/linux/tools/testing/selftests/cpu-opv'
  ~/linux/tools/testing/selftests$ make TARGETS="rseq cpu-opv" clean
  make[1]: Entering directory '/home/michael/linux/tools/testing/selftests/rseq'
  rm -f -r /home/michael/linux/tools/testing/selftests/rseq/basic_test /home/michael/linux/tools/testing/selftests/rseq/basic_percpu_ops_test /home/michael/linux/tools/testing/selftests/rseq/param_test   
  make[1]: Leaving directory '/home/michael/linux/tools/testing/selftests/rseq'
  make[1]: Entering directory '/home/michael/linux/tools/testing/selftests/cpu-opv'
  rm -f -r /home/michael/linux/tools/testing/selftests/cpu-opv/basic_cpu_opv_test   
  make[1]: Leaving directory '/home/michael/linux/tools/testing/selftests/cpu-opv'
  ~/linux/tools/testing/selftests$ cd cpu-opv/
  ~/linux/tools/testing/selftests/cpu-opv$ make -j
  gcc -O2 -Wall -g -I./ -I../../../../usr/include/    basic_cpu_opv_test.c cpu-op.c cpu-op.h  -o /home/michael/linux/tools/testing/selftests/cpu-opv/basic_cpu_opv_test
  ~/linux/tools/testing/selftests/cpu-opv$ make clean
  rm -f -r /home/michael/linux/tools/testing/selftests/cpu-opv/basic_cpu_opv_test   
  ~/linux/tools/testing/selftests/cpu-opv$ cd ../rseq/
  ~/linux/tools/testing/selftests/rseq$ make -j
  gcc -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/    basic_test.c rseq.h rseq-ppc.h rseq-x86.h rseq-arm.h rseq.c ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h -lpthread -o /home/michael/linux/tools/testing/selftests/rseq/basic_test
  gcc -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/    basic_percpu_ops_test.c rseq.h rseq-ppc.h rseq-x86.h rseq-arm.h rseq.c ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h -lpthread -o /home/michael/linux/tools/testing/selftests/rseq/basic_percpu_ops_test
  gcc -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/    param_test.c rseq.h rseq-ppc.h rseq-x86.h rseq-arm.h rseq.c ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h -lpthread -o /home/michael/linux/tools/testing/selftests/rseq/param_test
  ~/linux/tools/testing/selftests/rseq$ make clean
  rm -f -r /home/michael/linux/tools/testing/selftests/rseq/basic_test /home/michael/linux/tools/testing/selftests/rseq/basic_percpu_ops_test /home/michael/linux/tools/testing/selftests/rseq/param_test   



cheers

diff --git a/tools/testing/selftests/cpu-opv/Makefile b/tools/testing/selftests/cpu-opv/Makefile
index 81d0596824ee..d27bd0f74deb 100644
--- a/tools/testing/selftests/cpu-opv/Makefile
+++ b/tools/testing/selftests/cpu-opv/Makefile
@@ -1,13 +1,7 @@
 CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/
-LDFLAGS += -lpthread
 
-TESTS = basic_cpu_opv_test
-
-all: $(TESTS)
-%: %.c cpu-op.c cpu-op.h
-       $(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
+TEST_GEN_PROGS = basic_cpu_opv_test
 
 include ../lib.mk
 
-clean:
-       $(RM) $(TESTS)
+$(TEST_GEN_PROGS): cpu-op.c cpu-op.h
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
index 7f0153556b80..7f625147b7fe 100644
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -1,13 +1,8 @@
 CFLAGS += -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/
-LDFLAGS += -lpthread
+LDLIBS += -lpthread
 
-TESTS = basic_test basic_percpu_ops_test param_test
-
-all: $(TESTS)
-%: %.c rseq.h rseq-*.h rseq.c ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h
-       $(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
+TEST_GEN_PROGS = basic_test basic_percpu_ops_test param_test
 
 include ../lib.mk
 
-clean:
-       $(RM) $(TESTS)
+$(TEST_GEN_PROGS): rseq.h rseq-*.h rseq.c ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h

^ permalink raw reply related	[flat|nested] 61+ messages in thread

[parent not found: <87d15mjc1g.fsf-W0DJWXSxmBNbyGPkN3NxC2scP1bn1w/D@public.gmane.org>]

* Re: [RFC PATCH for 4.15 14/14] Restartable sequences: Provide self-tests
       [not found]             ` <87d15mjc1g.fsf-W0DJWXSxmBNbyGPkN3NxC2scP1bn1w/D@public.gmane.org>
@ 2017-10-17 13:50               ` Mathieu Desnoyers
       [not found]                 ` <1618170495.42951.1508248216596.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 61+ messages in thread
From: Mathieu Desnoyers @ 2017-10-17 13:50 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Russell King, Catalin Marinas,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	H. Peter Anvin, Ben Maurer, rostedt, Linus Torvalds,
	Andrew Morton

----- On Oct 17, 2017, at 6:36 AM, Michael Ellerman mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org wrote:

> Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> writes:
> 
>> Hi Michael,
>>
>> With your changes integrated, both rseq and cpu-opv selftests fail to
>> build if I pass e.g. -j32 to make.
>>
>> cd tools/testing/selftests/cpu-opv
>>
>> efficios@compudjdev:~/git/linux-percpu-dev/tools/testing/selftests/cpu-opv$ make
>> clean; make
>> rm -f -r
>> /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test
>> gcc -O2 -Wall -g -I./ -I../../../../usr/include/    basic_cpu_opv_test.c
>> cpu-op.c cpu-op.h   -o basic_cpu_opv_test
>>
>> efficios@compudjdev:~/git/linux-percpu-dev/tools/testing/selftests/cpu-opv$ make
>> clean; make -j32
>> rm -f -r
>> /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test
>> gcc -O2 -Wall -g -I./ -I../../../../usr/include/    basic_cpu_opv_test.c
>> cpu-op.c cpu-op.h   -o basic_cpu_opv_test
>> gcc -O2 -Wall -g -I./ -I../../../../usr/include/    basic_cpu_opv_test.c  -o
>> /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test
>> /tmp/ccDthnqM.o: In function `test_memcpy_op':
>> /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:364:
>> undefined reference to `cpu_op_get_current_cpu'
>> /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:365:
>> undefined reference to `cpu_opv'
> ...
>> make: ***
>> [/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test]
>> Error 1
>> make: *** Waiting for unfinished jobs....
>>
>> Any idea what is going on here ?
> 
> Ugh sorry, yes.
> 
> New patch below should fix it. Tested with -j:

Perfect, folding it into my series.

I see that the "all" target was redundant here.

Thanks,

Mathieu

> 
>  ~/linux/tools/testing/selftests$ make TARGETS="rseq cpu-opv" -j
>  make[1]: Entering directory '/home/michael/linux/tools/testing/selftests/rseq'
>  gcc -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/    basic_test.c
>  rseq.h rseq-ppc.h rseq-x86.h rseq-arm.h rseq.c ../cpu-opv/cpu-op.c
>  ../cpu-opv/cpu-op.h -lpthread -o
>  /home/michael/linux/tools/testing/selftests/rseq/basic_test
>  gcc -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/
>  basic_percpu_ops_test.c rseq.h rseq-ppc.h rseq-x86.h rseq-arm.h rseq.c
>  ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h -lpthread -o
>  /home/michael/linux/tools/testing/selftests/rseq/basic_percpu_ops_test
>  gcc -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/    param_test.c
>  rseq.h rseq-ppc.h rseq-x86.h rseq-arm.h rseq.c ../cpu-opv/cpu-op.c
>  ../cpu-opv/cpu-op.h -lpthread -o
>  /home/michael/linux/tools/testing/selftests/rseq/param_test
>  make[1]: Leaving directory '/home/michael/linux/tools/testing/selftests/rseq'
>  make[1]: Entering directory
>  '/home/michael/linux/tools/testing/selftests/cpu-opv'
>  gcc -O2 -Wall -g -I./ -I../../../../usr/include/    basic_cpu_opv_test.c
>  cpu-op.c cpu-op.h  -o
>  /home/michael/linux/tools/testing/selftests/cpu-opv/basic_cpu_opv_test
>  make[1]: Leaving directory '/home/michael/linux/tools/testing/selftests/cpu-opv'
>  ~/linux/tools/testing/selftests$ make TARGETS="rseq cpu-opv" clean
>  make[1]: Entering directory '/home/michael/linux/tools/testing/selftests/rseq'
>  rm -f -r /home/michael/linux/tools/testing/selftests/rseq/basic_test
>  /home/michael/linux/tools/testing/selftests/rseq/basic_percpu_ops_test
>  /home/michael/linux/tools/testing/selftests/rseq/param_test
>  make[1]: Leaving directory '/home/michael/linux/tools/testing/selftests/rseq'
>  make[1]: Entering directory
>  '/home/michael/linux/tools/testing/selftests/cpu-opv'
>  rm -f -r /home/michael/linux/tools/testing/selftests/cpu-opv/basic_cpu_opv_test
>  make[1]: Leaving directory '/home/michael/linux/tools/testing/selftests/cpu-opv'
>  ~/linux/tools/testing/selftests$ cd cpu-opv/
>  ~/linux/tools/testing/selftests/cpu-opv$ make -j
>  gcc -O2 -Wall -g -I./ -I../../../../usr/include/    basic_cpu_opv_test.c
>  cpu-op.c cpu-op.h  -o
>  /home/michael/linux/tools/testing/selftests/cpu-opv/basic_cpu_opv_test
>  ~/linux/tools/testing/selftests/cpu-opv$ make clean
>  rm -f -r /home/michael/linux/tools/testing/selftests/cpu-opv/basic_cpu_opv_test
>  ~/linux/tools/testing/selftests/cpu-opv$ cd ../rseq/
>  ~/linux/tools/testing/selftests/rseq$ make -j
>  gcc -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/    basic_test.c
>  rseq.h rseq-ppc.h rseq-x86.h rseq-arm.h rseq.c ../cpu-opv/cpu-op.c
>  ../cpu-opv/cpu-op.h -lpthread -o
>  /home/michael/linux/tools/testing/selftests/rseq/basic_test
>  gcc -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/
>  basic_percpu_ops_test.c rseq.h rseq-ppc.h rseq-x86.h rseq-arm.h rseq.c
>  ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h -lpthread -o
>  /home/michael/linux/tools/testing/selftests/rseq/basic_percpu_ops_test
>  gcc -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/    param_test.c
>  rseq.h rseq-ppc.h rseq-x86.h rseq-arm.h rseq.c ../cpu-opv/cpu-op.c
>  ../cpu-opv/cpu-op.h -lpthread -o
>  /home/michael/linux/tools/testing/selftests/rseq/param_test
>  ~/linux/tools/testing/selftests/rseq$ make clean
>  rm -f -r /home/michael/linux/tools/testing/selftests/rseq/basic_test
>  /home/michael/linux/tools/testing/selftests/rseq/basic_percpu_ops_test
>  /home/michael/linux/tools/testing/selftests/rseq/param_test
> 
> 
> 
> cheers
> 
> diff --git a/tools/testing/selftests/cpu-opv/Makefile
> b/tools/testing/selftests/cpu-opv/Makefile
> index 81d0596824ee..d27bd0f74deb 100644
> --- a/tools/testing/selftests/cpu-opv/Makefile
> +++ b/tools/testing/selftests/cpu-opv/Makefile
> @@ -1,13 +1,7 @@
> CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/
> -LDFLAGS += -lpthread
> 
> -TESTS = basic_cpu_opv_test
> -
> -all: $(TESTS)
> -%: %.c cpu-op.c cpu-op.h
> -       $(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
> +TEST_GEN_PROGS = basic_cpu_opv_test
> 
> include ../lib.mk
> 
> -clean:
> -       $(RM) $(TESTS)
> +$(TEST_GEN_PROGS): cpu-op.c cpu-op.h
> diff --git a/tools/testing/selftests/rseq/Makefile
> b/tools/testing/selftests/rseq/Makefile
> index 7f0153556b80..7f625147b7fe 100644
> --- a/tools/testing/selftests/rseq/Makefile
> +++ b/tools/testing/selftests/rseq/Makefile
> @@ -1,13 +1,8 @@
> CFLAGS += -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/
> -LDFLAGS += -lpthread
> +LDLIBS += -lpthread
> 
> -TESTS = basic_test basic_percpu_ops_test param_test
> -
> -all: $(TESTS)
> -%: %.c rseq.h rseq-*.h rseq.c ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h
> -       $(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
> +TEST_GEN_PROGS = basic_test basic_percpu_ops_test param_test
> 
> include ../lib.mk
> 
> -clean:
> -       $(RM) $(TESTS)
> +$(TEST_GEN_PROGS): rseq.h rseq-*.h rseq.c ../cpu-opv/cpu-op.c
> ../cpu-opv/cpu-op.h

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <1618170495.42951.1508248216596.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>]

* Re: [RFC PATCH for 4.15 14/14] Restartable sequences: Provide self-tests
       [not found]                 ` <1618170495.42951.1508248216596.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-10-18  5:45                   ` Michael Ellerman
  0 siblings, 0 replies; 61+ messages in thread
From: Michael Ellerman @ 2017-10-18  5:45 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Russell King, Catalin Marinas,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	H. Peter Anvin, Ben Maurer, rostedt, Linus Torvalds,
	Andrew Morton

Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> writes:

> ----- On Oct 17, 2017, at 6:36 AM, Michael Ellerman mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org wrote:
>
>> Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> writes:
>> 
>>> Hi Michael,
>>>
>>> With your changes integrated, both rseq and cpu-opv selftests fail to
>>> build if I pass e.g. -j32 to make.
>>>
>>> cd tools/testing/selftests/cpu-opv
>>>
>>> efficios@compudjdev:~/git/linux-percpu-dev/tools/testing/selftests/cpu-opv$ make
>>> clean; make
>>> rm -f -r
>>> /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test
>>> gcc -O2 -Wall -g -I./ -I../../../../usr/include/    basic_cpu_opv_test.c
>>> cpu-op.c cpu-op.h   -o basic_cpu_opv_test
>>>
>>> efficios@compudjdev:~/git/linux-percpu-dev/tools/testing/selftests/cpu-opv$ make
>>> clean; make -j32
>>> rm -f -r
>>> /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test
>>> gcc -O2 -Wall -g -I./ -I../../../../usr/include/    basic_cpu_opv_test.c
>>> cpu-op.c cpu-op.h   -o basic_cpu_opv_test
>>> gcc -O2 -Wall -g -I./ -I../../../../usr/include/    basic_cpu_opv_test.c  -o
>>> /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test
>>> /tmp/ccDthnqM.o: In function `test_memcpy_op':
>>> /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:364:
>>> undefined reference to `cpu_op_get_current_cpu'
>>> /home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c:365:
>>> undefined reference to `cpu_opv'
>> ...
>>> make: ***
>>> [/home/efficios/git/linux-percpu-dev/tools/testing/selftests/cpu-opv/basic_cpu_opv_test]
>>> Error 1
>>> make: *** Waiting for unfinished jobs....
>>>
>>> Any idea what is going on here ?
>> 
>> Ugh sorry, yes.
>> 
>> New patch below should fix it. Tested with -j:
>
> Perfect, folding it into my series.

Thanks.

> I see that the "all" target was redundant here.

Yeah.

Unfortunately lib.mk has grown a bit more complicated than I would like,
including defining the all target.

It does make for nice short Makefiles for individual test directories,
but it's not all that easy to use if you're not familiar with it [or
even if you are, as I demostrated here :)]

cheers

^ permalink raw reply	[flat|nested] 61+ messages in thread

[parent not found: <20171012230326.19984-15-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>]

* Re: [RFC PATCH for 4.15 14/14] Restartable sequences: Provide self-tests
       [not found]   ` <20171012230326.19984-15-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2017-10-16  3:00     ` Michael Ellerman
  2017-10-16  3:48       ` Boqun Feng
  0 siblings, 1 reply; 61+ messages in thread
From: Michael Ellerman @ 2017-10-16  3:00 UTC (permalink / raw)
  To: Paul E. McKenney, Boqun Feng, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mathieu Desnoyers,
	Russell King, Catalin Marinas, Thomas Gleixner, Andi Kleen,
	Chris Lameter, Ingo Molnar, H. Peter Anvin, Ben Maurer,
	Steven Rostedt, Linus Torvalds, Andrew Morton, Shuah Khan,
	linux-kselftest-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> writes:

> Implements two basic tests of RSEQ functionality, and one more
> exhaustive parameterizable test.
>
> The first, "basic_test" only asserts that RSEQ works moderately
> correctly.
> E.g. that:
> - The CPUID pointer works
> - Code infinitely looping within a critical section will eventually be
>   interrupted.
> - Critical sections are interrupted by signals.
>
> "basic_percpu_ops_test" is a slightly more "realistic" variant,
> implementing a few simple per-cpu operations and testing their
> correctness.
>
> "param_test" is a parametrizable restartable sequences test. See
> the "--help" output for usage.
>
> As part of those tests, a helper library "rseq" implements a user-space
> API around restartable sequences. It uses the cpu_opv system call as
> fallback when single-stepped by a debugger. It exposes the instruction
> pointer addresses where the rseq assembly blocks begin and end, as well
> as the associated abort instruction pointer, in the __rseq_table
> section. This section allows debuggers may know where to place
> breakpoints when single-stepping through assembly blocks which may be
> aborted at any point by the kernel.
>
> The following rseq APIs are implemented in this helper library:
> - rseq_register_current_thread()/rseq_unregister_current_thread():
>     register/unregister current thread's use of rseq,
> - rseq_current_cpu_raw():
>     current CPU number,
> - rseq_start():
>     beginning of a restartable sequence,
> - rseq_cpu_at_start():
>     CPU number at start of restartable sequence,
> - rseq_finish():
>     End of restartable sequence made of zero or more loads, completed by
>     a word-sized store,
> - rseq_finish2():
>     End of restartable sequence made of zero or more loads, one
>     speculative word-sized store, completed by a word-sized store,
> - rseq_finish2_release():
>     End of restartable sequence made of zero or more loads, one
>     speculative word-sized store, completed by a word-sized store with
>     release semantic,
> - rseq_finish_memcpy():
>     End of restartable sequence made of zero or more loads, a
>     speculative copy of a variable length memory region, completed by a
>     word-sized store.
> - rseq_finish_memcpy_release():
>     End of restartable sequence made of zero or more loads, a
>     speculative copy of a variable length memory region, completed by a
>     word-sized store with release semantic.
>
> PowerPC tests have been implemented by Boqun Feng.

Hi Boqun,

I'm having trouble testing these, I get:

  ~/linus/tools/testing/selftests/cpu-opv$ ./basic_cpu_opv_test
  Testing test_compare_eq same
  Testing test_compare_eq different
  Testing test_compare_ne same
  Testing test_compare_ne different
  Testing test_2compare_eq index
  Testing test_2compare_ne index
  Testing test_memcpy
  Testing test_memcpy_u32
  Testing test_add
  Testing test_two_add
  Testing test_or
  Testing test_and
  Testing test_xor
  Testing test_lshift
  Testing test_rshift
  Testing test_cmpxchg success
  Testing test_cmpxchg fail
  
  ~/linus/tools/testing/selftests/rseq$ ./basic_test
  testing current cpu
  testing critical section
  testing critical section is interrupted by signal

  ~/linus/tools/testing/selftests/rseq$ ./basic_percpu_ops_test
  ./basic_percpu_ops_test: error while loading shared libraries: R_PPC64_ADDR16_HI re10d8f10a0 for symbol `' out of range
  ~/linus/tools/testing/selftests/rseq$ ./param_test
  ./param_test: error while loading shared libraries: R_PPC64_ADDR16_HI re136251b48 for symbol `' out of range


Any idea what's going on with the last two? I assume you don't see that
in your test setup :)

cheers

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH for 4.15 14/14] Restartable sequences: Provide self-tests
  2017-10-16  3:00     ` Michael Ellerman
@ 2017-10-16  3:48       ` Boqun Feng
  2017-10-16 11:48         ` Michael Ellerman
  0 siblings, 1 reply; 61+ messages in thread
From: Boqun Feng @ 2017-10-16  3:48 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Mathieu Desnoyers, Paul E. McKenney, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Russell King, Catalin Marinas,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	H. Peter Anvin, Ben Maurer, Steven Rostedt, Linus Torvalds

On Mon, Oct 16, 2017 at 02:00:33PM +1100, Michael Ellerman wrote:
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> 
> > Implements two basic tests of RSEQ functionality, and one more
> > exhaustive parameterizable test.
> >
> > The first, "basic_test" only asserts that RSEQ works moderately
> > correctly.
> > E.g. that:
> > - The CPUID pointer works
> > - Code infinitely looping within a critical section will eventually be
> >   interrupted.
> > - Critical sections are interrupted by signals.
> >
> > "basic_percpu_ops_test" is a slightly more "realistic" variant,
> > implementing a few simple per-cpu operations and testing their
> > correctness.
> >
> > "param_test" is a parametrizable restartable sequences test. See
> > the "--help" output for usage.
> >
> > As part of those tests, a helper library "rseq" implements a user-space
> > API around restartable sequences. It uses the cpu_opv system call as
> > fallback when single-stepped by a debugger. It exposes the instruction
> > pointer addresses where the rseq assembly blocks begin and end, as well
> > as the associated abort instruction pointer, in the __rseq_table
> > section. This section allows debuggers may know where to place
> > breakpoints when single-stepping through assembly blocks which may be
> > aborted at any point by the kernel.
> >
> > The following rseq APIs are implemented in this helper library:
> > - rseq_register_current_thread()/rseq_unregister_current_thread():
> >     register/unregister current thread's use of rseq,
> > - rseq_current_cpu_raw():
> >     current CPU number,
> > - rseq_start():
> >     beginning of a restartable sequence,
> > - rseq_cpu_at_start():
> >     CPU number at start of restartable sequence,
> > - rseq_finish():
> >     End of restartable sequence made of zero or more loads, completed by
> >     a word-sized store,
> > - rseq_finish2():
> >     End of restartable sequence made of zero or more loads, one
> >     speculative word-sized store, completed by a word-sized store,
> > - rseq_finish2_release():
> >     End of restartable sequence made of zero or more loads, one
> >     speculative word-sized store, completed by a word-sized store with
> >     release semantic,
> > - rseq_finish_memcpy():
> >     End of restartable sequence made of zero or more loads, a
> >     speculative copy of a variable length memory region, completed by a
> >     word-sized store.
> > - rseq_finish_memcpy_release():
> >     End of restartable sequence made of zero or more loads, a
> >     speculative copy of a variable length memory region, completed by a
> >     word-sized store with release semantic.
> >
> > PowerPC tests have been implemented by Boqun Feng.
> 
> Hi Boqun,
> 

Hello Michael,

> I'm having trouble testing these, I get:
> 
>   ~/linus/tools/testing/selftests/cpu-opv$ ./basic_cpu_opv_test
>   Testing test_compare_eq same
>   Testing test_compare_eq different
>   Testing test_compare_ne same
>   Testing test_compare_ne different
>   Testing test_2compare_eq index
>   Testing test_2compare_ne index
>   Testing test_memcpy
>   Testing test_memcpy_u32
>   Testing test_add
>   Testing test_two_add
>   Testing test_or
>   Testing test_and
>   Testing test_xor
>   Testing test_lshift
>   Testing test_rshift
>   Testing test_cmpxchg success
>   Testing test_cmpxchg fail
>   
>   ~/linus/tools/testing/selftests/rseq$ ./basic_test
>   testing current cpu
>   testing critical section
>   testing critical section is interrupted by signal
> 
>   ~/linus/tools/testing/selftests/rseq$ ./basic_percpu_ops_test
>   ./basic_percpu_ops_test: error while loading shared libraries: R_PPC64_ADDR16_HI re10d8f10a0 for symbol `' out of range
>   ~/linus/tools/testing/selftests/rseq$ ./param_test
>   ./param_test: error while loading shared libraries: R_PPC64_ADDR16_HI re136251b48 for symbol `' out of range
> 

I think this one is due to the same reason as:

	7998eb3dc700 ("powerpc: Fix 64 bit builds with binutils 2.24")

I have made the fix before, but seems forgot to send it to Mathieu...

so would this help?
	
diff --git a/tools/testing/selftests/rseq/rseq-ppc.h b/tools/testing/selftests/rseq/rseq-ppc.h
index bc78b4fd72b1..39cbabe89b0e 100644
--- a/tools/testing/selftests/rseq/rseq-ppc.h
+++ b/tools/testing/selftests/rseq/rseq-ppc.h
@@ -74,7 +74,7 @@ do {									\
 		"lis %%r17, (3b)@highest\n\t" \
 		"ori %%r17, %%r17, (3b)@higher\n\t" \
 		"rldicr %%r17, %%r17, 32, 31\n\t" \
-		"oris %%r17, %%r17, (3b)@h\n\t" \
+		"oris %%r17, %%r17, (3b)@high\n\t" \
 		"ori %%r17, %%r17, (3b)@l\n\t" \
 		"std %%r17, 0(%[rseq_cs])\n\t" \
 		RSEQ_INJECT_ASM(2) \

Regards,
Boqun

> 
> Any idea what's going on with the last two? I assume you don't see that
> in your test setup :)
> 
> cheers

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH for 4.15 14/14] Restartable sequences: Provide self-tests
  2017-10-16  3:48       ` Boqun Feng
@ 2017-10-16 11:48         ` Michael Ellerman
  0 siblings, 0 replies; 61+ messages in thread
From: Michael Ellerman @ 2017-10-16 11:48 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Mathieu Desnoyers, Paul E. McKenney, Peter Zijlstra, Paul Turner,
	Andrew Hunter, Andy Lutomirski, Dave Watson, Josh Triplett,
	Will Deacon, linux-kernel, Russell King, Catalin Marinas,
	Thomas Gleixner, Andi Kleen, Chris Lameter, Ingo Molnar,
	H. Peter Anvin, Ben Maurer, Steven Rostedt, Linus Torvalds

Boqun Feng <boqun.feng@gmail.com> writes:
> On Mon, Oct 16, 2017 at 02:00:33PM +1100, Michael Ellerman wrote:
>
>> I'm having trouble testing these, I get:
...
>> 
>>   ~/linus/tools/testing/selftests/rseq$ ./basic_percpu_ops_test
>>   ./basic_percpu_ops_test: error while loading shared libraries: R_PPC64_ADDR16_HI re10d8f10a0 for symbol `' out of range
>>   ~/linus/tools/testing/selftests/rseq$ ./param_test
>>   ./param_test: error while loading shared libraries: R_PPC64_ADDR16_HI re136251b48 for symbol `' out of range
>> 
>
> I think this one is due to the same reason as:
>
> 	7998eb3dc700 ("powerpc: Fix 64 bit builds with binutils 2.24")
>
> I have made the fix before, but seems forgot to send it to Mathieu...
>
> so would this help?
> 	
> diff --git a/tools/testing/selftests/rseq/rseq-ppc.h b/tools/testing/selftests/rseq/rseq-ppc.h
> index bc78b4fd72b1..39cbabe89b0e 100644
> --- a/tools/testing/selftests/rseq/rseq-ppc.h
> +++ b/tools/testing/selftests/rseq/rseq-ppc.h
> @@ -74,7 +74,7 @@ do {									\
>  		"lis %%r17, (3b)@highest\n\t" \
>  		"ori %%r17, %%r17, (3b)@higher\n\t" \
>  		"rldicr %%r17, %%r17, 32, 31\n\t" \
> -		"oris %%r17, %%r17, (3b)@h\n\t" \
> +		"oris %%r17, %%r17, (3b)@high\n\t" \
>  		"ori %%r17, %%r17, (3b)@l\n\t" \
>  		"std %%r17, 0(%[rseq_cs])\n\t" \
>  		RSEQ_INJECT_ASM(2) \

Yes, that fixes it thanks!

cheers

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2017-10-23 20:44 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20171012230326.19984-1-mathieu.desnoyers@efficios.com>
2017-10-12 23:03 ` [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call Mathieu Desnoyers
2017-10-13  0:36   ` Linus Torvalds
2017-10-13  9:35     ` Ben Maurer
     [not found]       ` <DM5PR15MB1690DA99E4AA74FBE54CF7F9CF480-kTBAvIqET4EjX1lkf7hTyId3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-13 18:30         ` Linus Torvalds
     [not found]           ` <CA+55aFzPBES0JOYuZhuNM7NKN+G9ytZQT2daueFPw0j9HGpdGQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-10-13 20:54             ` Paul E. McKenney
     [not found]               ` <20171013205418.GM3521-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2017-10-13 21:05                 ` Linus Torvalds
     [not found]                   ` <CA+55aFwvNS95ByZJTh1yG25QfaD0K0ZByK3iXeeRU8LafFiGFQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-10-13 21:21                     ` Paul E. McKenney
2017-10-13 21:36                     ` Mathieu Desnoyers
2017-10-16 16:04                       ` Carlos O'Donell
2017-10-16 16:46                         ` Andi Kleen
2017-10-16 22:17                           ` Mathieu Desnoyers
     [not found]                             ` <21865534.42661.1508192263844.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-10-17 16:19                               ` Ben Maurer
     [not found]                                 ` <CY4PR15MB168879D6220D976B04FE482CCF4C0-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-17 16:33                                   ` Mathieu Desnoyers
     [not found]                                     ` <1292309161.43101.1508258000235.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-10-17 16:41                                       ` Ben Maurer
     [not found]                                         ` <CY4PR15MB16886FD43FB48592F3F5892FCF4C0-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-17 17:48                                           ` Mathieu Desnoyers
2017-10-18  6:22                                   ` Greg KH
     [not found]                                     ` <20171018062226.GB18857-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
2017-10-18 16:28                                       ` Mathieu Desnoyers
2017-10-14  3:01           ` Andi Kleen
2017-10-14  4:05             ` Linus Torvalds
2017-10-14 11:37               ` Mathieu Desnoyers
2017-10-13 12:50   ` Florian Weimer
2017-10-13 13:40     ` Mathieu Desnoyers
2017-10-13 13:56       ` Florian Weimer
2017-10-13 14:27         ` Mathieu Desnoyers
2017-10-13 17:24           ` Andy Lutomirski
     [not found]             ` <CALCETrXccCp8apoyUJV8kWLOavnFnenZoU-fbb6cOVZvWp-fnA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-10-13 17:53               ` Florian Weimer
     [not found]                 ` <3358e696-43e9-15d3-9634-68e9da79e121-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-10-13 18:17                   ` Andy Lutomirski
     [not found]                     ` <CALCETrVWZxC=mT9p7HTrAwcAdMzaxwa=A-O0uQt79qy1Cpky_g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-10-14 11:53                       ` Mathieu Desnoyers
2017-10-18 16:41   ` Ben Maurer
     [not found]     ` <CY4PR15MB1688286D6B1283A1C234BAE6CF4E0-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-18 18:11       ` Mathieu Desnoyers
     [not found]         ` <515879378.43966.1508350299712.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-10-19 11:35           ` Mathieu Desnoyers
2017-10-19 17:01             ` Florian Weimer
2017-10-23 17:30           ` Ben Maurer
     [not found]             ` <CY4PR15MB16888F91F41A4A1D322C102CCF460-ZVJ2su15u+xeX4ZvlgGe+Yd3EbNNOtPMvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-23 20:44               ` Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 02/14] tracing: instrument restartable sequences Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 03/14] Restartable sequences: ARM 32 architecture support Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 04/14] Restartable sequences: wire up ARM 32 system call Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 05/14] Restartable sequences: x86 32/64 architecture support Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 06/14] Restartable sequences: wire up x86 32/64 system call Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 09/14] Provide cpu_opv " Mathieu Desnoyers
     [not found]   ` <20171012230326.19984-10-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-10-13 13:57     ` Alan Cox
2017-10-13 14:50       ` Mathieu Desnoyers
     [not found]         ` <854849583.40647.1507906233368.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-10-14 14:22           ` Mathieu Desnoyers
2017-10-13 17:20     ` Andy Lutomirski
2017-10-14  2:50     ` Andi Kleen
     [not found]       ` <20171014025029.GL2482-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
2017-10-14 13:35         ` Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 10/14] cpu_opv: Wire up x86 32/64 " Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 12/14] cpu_opv: Wire up ARM32 " Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 13/14] cpu_opv: Implement selftests Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 14/14] Restartable sequences: Provide self-tests Mathieu Desnoyers
2017-10-16  2:51   ` Michael Ellerman
2017-10-16 14:23     ` Mathieu Desnoyers
     [not found]       ` <399058130.42156.1508163782335.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-10-17 10:38         ` Michael Ellerman
2017-10-17 13:50           ` Mathieu Desnoyers
     [not found]     ` <871sm3n6sy.fsf-W0DJWXSxmBNbyGPkN3NxC2scP1bn1w/D@public.gmane.org>
2017-10-16 18:50       ` Mathieu Desnoyers
     [not found]         ` <1998166049.42520.1508179805908.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-10-17 10:36           ` Michael Ellerman
     [not found]             ` <87d15mjc1g.fsf-W0DJWXSxmBNbyGPkN3NxC2scP1bn1w/D@public.gmane.org>
2017-10-17 13:50               ` Mathieu Desnoyers
     [not found]                 ` <1618170495.42951.1508248216596.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-10-18  5:45                   ` Michael Ellerman
     [not found]   ` <20171012230326.19984-15-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2017-10-16  3:00     ` Michael Ellerman
2017-10-16  3:48       ` Boqun Feng
2017-10-16 11:48         ` Michael Ellerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).