public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections
@ 2015-06-24 22:26 Paul Turner
  2015-06-24 22:26 ` [RFC PATCH 3/3] restartable sequences: basic user-space self-tests Paul Turner
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Paul Turner @ 2015-06-24 22:26 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney, Mathieu Desnoyers
  Cc: Andrew Hunter, Andi Kleen, Lai Jiangshan, linux-api, linux-kernel,
	Steven Rostedt, Josh Triplett, Ingo Molnar, Andrew Morton,
	Andy Lutomirski, Linus Torvalds, Chris Lameter

This is a fairly small series demonstrating a feature we've found to be quite
powerful in practice, "restartable sequences".

Most simply: these sequences comprise small snippets of user-code that are
guaranteed to be (effectively) executed serially, with support for restart (or
other handling) in the event of preemption or interruption.

This (when combined with an in-memory cpu-id; potentially optional on some
architectures) can be used to implement extremely fast per-cpu operations that
do not rely on the use of actual processor atomics.  We've used this to back
performance critical code such as our malloc implementation with good results.

We previously discussed this feature at LPC:
  http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Mathieu Desnoyers posted an alternate implementation based on the ideas above
at:
  https://lkml.org/lkml/2015/5/21/472

This RFC is based on the approach we currently use internally.  However, I'll
likely posit that neither this approach, nor the one posted above, is what we
should ultimately adopt (provided sufficient interest exists).

The implementation in this series can be summarized as follows:
- We allow a single (per-address) space critical section (and associated
  handler) to be defined.
- When a thread with RSEQ configured (via new syscall) is interrupted within
  a critical section, we modify its return address.  Either within signal
  delivery, or the notify-resume path.  The scheduler touchpoint is only a
  preempt_notifier which (potentially, dependent on config) examines the
  kernel copy of pt_regs.

There are a few core requirements which motivate the approach taken here:
1) We must not interfere with regular scheduling.  Unlike preemption protection
   (or similar), there are no special considerations for code running within a
   critical region beyond that we must move to the restart handler if
   interrupted.
2) The code executing in scheduler context must be tightly constrained, both in
   terms of overhead and that it must not require access to user memory.
3) The fast-path must be fast.  That is, user entry/execution/exit of a
   non-interrupted critical section is the most important case.  The restart
   handler is a 'slow' path that should only happen comparatively infrequently.
   We're strongly motivated here by high-performance, low-level primitives:
   think malloc, or rcu_read_lock.

While the contained implementation performs well under these constraints, it
has some notable limitations which we should consider for more general usage:

1) The definition of a single region works well for statically linked binaries
   but can be challenging when shared-libraries want to use this feature.  This
   is partially mitigated in our experience that a library of primitives is
   generally more useful than a myriad of custom sequences, but it's still a
   major concern.
2) Due to the nature of restart and explicit location requirements it's only
   really reasonable to write this critical section in assembly; which makes
   porting and maintenance less convenient.  (One of the included tests shows a
   reasonable example of what this looks like.)
3) TLS spreads the entrails of its implementation all over compile _and_ link.
   This makes properly handling it within the critical section cumbersome in
   the shared binary case.

We've been thinking about how to address these issues and are considering some
alternate ABIs that still satisfy (1)-(3), but have a more convenient
programming model.  I'll send this as a follow-up, but wanted to first share
the approach we've used to date.

Thanks,

- Paul




^ permalink raw reply	[flat|nested] 15+ messages in thread
* [RFC PATCH 0/3] restartable sequences benchmarks
@ 2015-10-22 18:06 Dave Watson
  2015-10-22 18:06 ` [RFC PATCH 2/3] restartable sequences: x86 ABI Dave Watson
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Watson @ 2015-10-22 18:06 UTC (permalink / raw)
  To: davejwatson, kernel-team, linux-kernel, linux-api, pjt,
	mathieu.desnoyers

We've been testing out restartable sequences + malloc changes for use
at Facebook.  Below are some test results, as well as some possible
changes based on Paul Turner's original patches

https://lkml.org/lkml/2015/6/24/665

I ran one service with several permutations of various mallocs.  The
service is CPU-bound, and hits the allocator quite hard.  Requests/s
are held constant at the source, so we use cpu idle time and latency
as an indicator of service quality. These are average numbers over
several hours.  Machines were dual E5-2660, total 16 cores +
hyperthreading.  This service has ~400 total threads, 70-90 of which
are doing work at any particular time.

                                   RSS CPUIDLE LATENCYMS
jemalloc 4.0.0                     31G   33%     390
jemalloc + this patch              25G   33%     390
jemalloc + this patch using lsl    25G   30%     420
jemalloc + PT's rseq patch         25G   32%     405
glibc malloc 2.20                  27G   30%     420
tcmalloc gperftools trunk (2.2)    21G   30%     480

jemalloc rseq patch used for testing:
https://github.com/djwatson/jemalloc

lsl test - using lsl segment limit to get cpu (i.e. inlined vdso
getcpu on x86) instead of using the thread caching as in this patch.
There has been some suggestions to add the thread-cached getcpu()
feature separately.  It does seem to move the needle in a real service
by about ~3% to have a thread-cached getcpu vs. not.  I don't think we
can use restartable sequences in production without a faster getcpu.

GS-segment / migration only tests

There's been some interest in seeing if we can do this with only gs
segment, here's some numbers for those.  This doesn't have to be gs,
it could just be a migration signal sent to userspace as well, the
same approaches would apply.

GS patch: https://lkml.org/lkml/2014/9/13/59

                                   RSS CPUIDLE LATENCYMS
jemalloc 4.0.0                     31G   33%     390
jemalloc + percpu locking          25G   25%     420
jemalloc + preempt lock / signal   25G   32%     415

* Percpu locking - just lock everything percpu all the time.  If
  scheduled off during the critical section, other threads have to
  wait.

* 'Preempt lock' idea is that we grab a lock, but if we miss the lock,
  send a signal to the offending thread (tid is stored in the lock
  variable) to restart its critical section.  Libunwind was used to
  fixup ips in the signal handler, walking all the frames.  This is
  slower than the kernel preempt check, but happens less often - only
  if there was a preempt during the critical section.  Critical
  sections were inlined using the same scheme as in this patch.  There
  is more overhead than restartable sequences in the hot path (an
  extra unlocked cmpxchg, some accounting). Microbenchmarks showed it
  was 2x slower than rseq, but still faster than atomics.

  Roughly like this: https://gist.github.com/djwatson/9c268681a0dfa797990c

* I also tried a percpu version of stm (software transactional
  memory), but could never write anything better than ~3x slower than
  atomics in a microbenchmark.  I didn't test this in a real service.

Attached are two changes to the original patch:

1) Support more than one critical memory range in the kernel using
   binary search.  This has several advantages:

  * We don't need an extra register ABI to support multiplexing them
    in userspace.  This also avoids some complexity knowing which
    registers/flags might be smashed by a restart.

  * There are no collisions between shared libraries

  * They can be inlined with gcc inline asm.  With optimization on,
    gcc correctly inlines and registers many more regions.  In a real
    service this does seem to improve latency a hair.  A
    microbenchmark shows ~20% faster.

Downsides:  Less control over how we search/jump to the regions, but I
didn't notice any difference in testing a reasonable number of regions
(less than 100).  We could set a max limit?

2) Additional checks in ptrace to single step over critical sections.
   We also prevent setting breakpoints, as these also seem to confuse
   gdb sometimes.

Dave Watson (3):
  restartable sequences: user-space per-cpu critical sections
  restartable sequences: x86 ABI
  restartable sequences: basic user-space self-tests

 arch/Kconfig                                       |   7 +
 arch/x86/Kconfig                                   |   1 +
 arch/x86/entry/common.c                            |   3 +
 arch/x86/entry/syscalls/syscall_64.tbl             |   1 +
 arch/x86/include/asm/restartable_sequences.h       |  44 +++
 arch/x86/kernel/Makefile                           |   2 +
 arch/x86/kernel/ptrace.c                           |   6 +-
 arch/x86/kernel/restartable_sequences.c            |  47 +++
 arch/x86/kernel/signal.c                           |  12 +-
 fs/exec.c                                          |   3 +-
 include/linux/sched.h                              |  39 +++
 include/uapi/asm-generic/unistd.h                  |   4 +-
 init/Kconfig                                       |   9 +
 kernel/Makefile                                    |   2 +-
 kernel/fork.c                                      |   1 +
 kernel/ptrace.c                                    |  15 +-
 kernel/restartable_sequences.c                     | 255 ++++++++++++++++
 kernel/sched/core.c                                |   5 +
 kernel/sched/sched.h                               |   3 +
 kernel/sys_ni.c                                    |   3 +
 tools/testing/selftests/rseq/Makefile              |  14 +
 .../testing/selftests/rseq/basic_percpu_ops_test.c | 331 +++++++++++++++++++++
 tools/testing/selftests/rseq/rseq.c                |  48 +++
 tools/testing/selftests/rseq/rseq.h                |  17 ++
 24 files changed, 862 insertions(+), 10 deletions(-)
 create mode 100644 arch/x86/include/asm/restartable_sequences.h
 create mode 100644 arch/x86/kernel/restartable_sequences.c
 create mode 100644 kernel/restartable_sequences.c
 create mode 100644 tools/testing/selftests/rseq/Makefile
 create mode 100644 tools/testing/selftests/rseq/basic_percpu_ops_test.c
 create mode 100644 tools/testing/selftests/rseq/rseq.c
 create mode 100644 tools/testing/selftests/rseq/rseq.h

-- 
2.4.6


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-10-22 18:07 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-24 22:26 [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections Paul Turner
2015-06-24 22:26 ` [RFC PATCH 3/3] restartable sequences: basic user-space self-tests Paul Turner
2015-06-24 22:26 ` [RFC PATCH 1/3] restartable sequences: user-space per-cpu critical sections Paul Turner
2015-06-24 22:26 ` [RFC PATCH 2/3] restartable sequences: x86 ABI Paul Turner
2015-06-26 18:09   ` Mathieu Desnoyers
2015-06-26 19:04     ` Mathieu Desnoyers
2015-06-26 19:31     ` Andy Lutomirski
2015-06-27  1:33       ` Paul Turner
2015-06-25  0:07 ` [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections Andy Lutomirski
2015-06-25  2:54   ` Paul Turner
2015-06-26  1:15     ` Mathieu Desnoyers
2015-06-26  2:05       ` Paul Turner
2015-06-27 16:25         ` Andy Lutomirski
2015-06-28 16:11           ` Mathieu Desnoyers
  -- strict thread matches above, loose matches on Subject: below --
2015-10-22 18:06 [RFC PATCH 0/3] restartable sequences benchmarks Dave Watson
2015-10-22 18:06 ` [RFC PATCH 2/3] restartable sequences: x86 ABI Dave Watson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox