Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
       [not found] ` <1432219487-13364-1-git-send-email-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2015-05-22 20:26   ` Michael Kerrisk
       [not found]     ` <CAHO5Pa0Kok4_QN0v3JNWyzGT=GbZNZcRyLhu02R2npV9hSdt7g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Kerrisk @ 2015-05-22 20:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul Turner, Andrew Hunter, Ben Maurer, Linux Kernel,
	Peter Zijlstra, Ingo Molnar, Steven Rostedt, Paul E. McKenney,
	Josh Triplett, Lai Jiangshan, Linus Torvalds, Andrew Morton,
	Linux API

[CC += linux-api@]

On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
<mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> Expose a new system call allowing userspace threads to register
> a TLS area used as an ABI between the kernel and userspace to
> share information required to create efficient per-cpu critical
> sections in user-space.
>
> This ABI consists of a thread-local structure containing:
>
> - a nesting count surrounding the critical section,
> - a signal number to be sent to the thread when preempting a thread
>   with non-zero nesting count,
> - a flag indicating whether the signal has been sent within the
>   critical section,
> - an integer where to store the current CPU number, updated whenever
>   the thread is preempted. This CPU number cache is not strictly
>   needed, but performs better than getcpu vdso.
>
> This approach is inspired by Paul Turner and Andrew Hunter's work
> on percpu atomics, which lets the kernel handle restart of critical
> sections, ref. http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
>
> What is done differently here compared to percpu atomics: we track
> a single nesting counter per thread rather than many ranges of
> instruction pointer values. We deliver a signal to user-space and
> let the logic of restart be handled in user-space, thus moving
> the complexity out of the kernel. The nesting counter approach
> allows us to skip the complexity of interacting with signals that
> would be otherwise needed with the percpu atomics approach, which
> needs to know which instruction pointers are preempted, including
> when preemption occurs on a signal handler nested over an instruction
> pointer of interest.
>
> Advantages of this approach over percpu atomics:
> - kernel code is relatively simple: complexity of restart sections
>   is in user-space,
> - easy to port to other architectures: just need to reserve a new
>   system call,
> - for threads which have registered a TLS structure, the fast-path
>   at preemption is only a nesting counter check, along with the
>   optional store of the current CPU number, rather than comparing
>   instruction pointer with possibly many registered ranges,
>
> Caveats of this approach compared to the percpu atomics:
> - We need a signal number for this, so it cannot be done without
>   designing the application accordingly,
> - Handling restart in user-space is currently performed with page
>   protection, for which we install a SIGSEGV signal handler. Again,
>   this requires designing the application accordingly, especially
>   if the application installs its own segmentation fault handler,
> - It cannot be used for tracing of processes by injection of code
>   into their address space, due to interactions with application
>   signal handlers.
>
> The user-space proof of concept code implementing the restart section
> can be found here: https://github.com/compudj/percpu-dev
>
> Benchmarking sched_getcpu() vs tls cache approach. Getting the
> current CPU number:
>
> - With Linux vdso:            12.7 ns
> - With TLS-cached cpu number:  0.3 ns
>
> We will use the TLS-cached cpu number for the following
> benchmarks.
>
> On an Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, comparison
> with a baseline running very few load/stores (no locking,
> no getcpu, assuming one thread per CPU with affinity),
> against locking scheme based on "lock; cmpxchg", "cmpxchg"
> (using restart signal), load-store (using restart signal).
> This is performed with 32 threads on a 16-core, hyperthread
> system:
>
>                  ns/loop      overhead (ns)
> Baseline:          3.7           0.0
> lock; cmpxchg:    22.0          18.3
> cmpxchg:          11.1           7.4
> load-store:        9.4           5.7
>
> Therefore, the load-store scheme has a speedup of 3.2x over the
> "lock; cmpxchg" scheme if both are using the tls-cache for the
> CPU number. If we use Linux sched_getcpu() for "lock; cmpxchg"
> we reach of speedup of 5.4x for load-store+tls-cache vs
> "lock; cmpxchg"+vdso-getcpu.
>
> I'm sending this out to trigger discussion, and hopefully to see
> Paul and Andrew's patches being posted publicly at some point, so
> we can compare our approaches.
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
> CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> CC: Andrew Hunter <ahh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> CC: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> CC: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> CC: Ben Maurer <bmaurer-b10kYP2dOMg@public.gmane.org>
> CC: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
> CC: "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> CC: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
> CC: Lai Jiangshan <laijs-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> CC: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> ---
>  arch/x86/syscalls/syscall_64.tbl  |   1 +
>  fs/exec.c                         |   1 +
>  include/linux/sched.h             |  18 ++++++
>  include/uapi/asm-generic/unistd.h |   4 +-
>  init/Kconfig                      |  10 +++
>  kernel/Makefile                   |   1 +
>  kernel/fork.c                     |   2 +
>  kernel/percpu-user.c              | 126 ++++++++++++++++++++++++++++++++++++++
>  kernel/sys_ni.c                   |   3 +
>  9 files changed, 165 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/percpu-user.c
>
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index 8d656fb..0499703 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -329,6 +329,7 @@
>  320    common  kexec_file_load         sys_kexec_file_load
>  321    common  bpf                     sys_bpf
>  322    64      execveat                stub_execveat
> +323    common  percpu                  sys_percpu
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/exec.c b/fs/exec.c
> index c7f9b73..0a2f0b2 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1555,6 +1555,7 @@ static int do_execveat_common(int fd, struct filename *filename,
>         /* execve succeeded */
>         current->fs->in_exec = 0;
>         current->in_execve = 0;
> +       percpu_user_execve(current);
>         acct_update_integrals(current);
>         task_numa_free(current);
>         free_bprm(bprm);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a419b65..9c88bff 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1275,6 +1275,8 @@ enum perf_event_task_context {
>         perf_nr_task_contexts,
>  };
>
> +struct thread_percpu_user;
> +
>  struct task_struct {
>         volatile long state;    /* -1 unrunnable, 0 runnable, >0 stopped */
>         void *stack;
> @@ -1710,6 +1712,10 @@ struct task_struct {
>  #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
>         unsigned long   task_state_change;
>  #endif
> +#ifdef CONFIG_PERCPU_USER
> +       struct preempt_notifier percpu_user_notifier;
> +       struct thread_percpu_user __user *percpu_user;
> +#endif
>  };
>
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> @@ -3090,4 +3096,16 @@ static inline unsigned long rlimit_max(unsigned int limit)
>         return task_rlimit_max(current, limit);
>  }
>
> +#ifdef CONFIG_PERCPU_USER
> +void percpu_user_fork(struct task_struct *t);
> +void percpu_user_execve(struct task_struct *t);
> +#else
> +static inline void percpu_user_fork(struct task_struct *t)
> +{
> +}
> +static inline void percpu_user_execve(struct task_struct *t)
> +{
> +}
> +#endif
> +
>  #endif
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index e016bd9..f4350d9 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
>  __SYSCALL(__NR_bpf, sys_bpf)
>  #define __NR_execveat 281
>  __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
> +#define __NR_percpu 282
> +__SYSCALL(__NR_percpu, sys_percpu)
>
>  #undef __NR_syscalls
> -#define __NR_syscalls 282
> +#define __NR_syscalls 283
>
>  /*
>   * All syscalls below here should go away really,
> diff --git a/init/Kconfig b/init/Kconfig
> index f5dbc6d..73c4070 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1559,6 +1559,16 @@ config PCI_QUIRKS
>           bugs/quirks. Disable this only if your target machine is
>           unaffected by PCI quirks.
>
> +config PERCPU_USER
> +       bool "Enable percpu() system call" if EXPERT
> +       default y
> +       select PREEMPT_NOTIFIERS
> +       help
> +         Enable the percpu() system call which provides a building block
> +         for fast per-cpu critical sections in user-space.
> +
> +         If unsure, say Y.
> +
>  config EMBEDDED
>         bool "Embedded system"
>         option allnoconfig_y
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 1408b33..76919a6 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -96,6 +96,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
>  obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>  obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
>  obj-$(CONFIG_TORTURE_TEST) += torture.o
> +obj-$(CONFIG_PERCPU_USER) += percpu-user.o
>
>  $(obj)/configs.o: $(obj)/config_data.h
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index cf65139..63aaf5a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1549,6 +1549,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>         cgroup_post_fork(p);
>         if (clone_flags & CLONE_THREAD)
>                 threadgroup_change_end(current);
> +       if (!(clone_flags & CLONE_THREAD))
> +               percpu_user_fork(p);
>         perf_event_fork(p);
>
>         trace_task_newtask(p, clone_flags);
> diff --git a/kernel/percpu-user.c b/kernel/percpu-user.c
> new file mode 100644
> index 0000000..be3d439
> --- /dev/null
> +++ b/kernel/percpu-user.c
> @@ -0,0 +1,126 @@
> +/*
> + * Copyright (C) 2015 Mathieu Desnoyers <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
> + *
> + * percpu system call
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/preempt.h>
> +#include <linux/init.h>
> +#include <linux/sched.h>
> +#include <linux/uaccess.h>
> +#include <linux/syscalls.h>
> +
> +struct thread_percpu_user {
> +       int32_t nesting;
> +       int32_t signal_sent;
> +       int32_t signo;
> +       int32_t current_cpu;
> +};
> +
> +static void percpu_user_sched_in(struct preempt_notifier *notifier, int cpu)
> +{
> +       struct thread_percpu_user __user *tpu_user;
> +       struct thread_percpu_user tpu;
> +       struct task_struct *t = current;
> +
> +       tpu_user = t->percpu_user;
> +       if (tpu_user == NULL)
> +               return;
> +       if (unlikely(t->flags & PF_EXITING))
> +               return;
> +       /*
> +        * access_ok() of tpu_user has already been checked by sys_percpu().
> +        */
> +       if (__put_user(smp_processor_id(), &tpu_user->current_cpu)) {
> +               WARN_ON_ONCE(1);
> +               return;
> +       }
> +       if (__copy_from_user(&tpu, tpu_user, sizeof(tpu))) {
> +               WARN_ON_ONCE(1);
> +               return;
> +       }
> +       if (!tpu.nesting || tpu.signal_sent)
> +               return;
> +       if (do_send_sig_info(tpu.signo, SEND_SIG_PRIV, t, 0)) {
> +               WARN_ON_ONCE(1);
> +               return;
> +       }
> +       tpu.signal_sent = 1;
> +       if (__copy_to_user(tpu_user, &tpu, sizeof(tpu))) {
> +               WARN_ON_ONCE(1);
> +               return;
> +       }
> +}
> +
> +static void percpu_user_sched_out(struct preempt_notifier *notifier,
> +               struct task_struct *next)
> +{
> +}
> +
> +static struct preempt_ops percpu_user_ops = {
> +       .sched_in = percpu_user_sched_in,
> +       .sched_out = percpu_user_sched_out,
> +};
> +
> +/*
> + * If parent had a percpu-user preempt notifier, we need to setup our own.
> + */
> +void percpu_user_fork(struct task_struct *t)
> +{
> +       struct task_struct *parent = current;
> +
> +       if (!parent->percpu_user)
> +               return;
> +       preempt_notifier_init(&t->percpu_user_notifier, &percpu_user_ops);
> +       preempt_notifier_register(&t->percpu_user_notifier);
> +       t->percpu_user = parent->percpu_user;
> +}
> +
> +void percpu_user_execve(struct task_struct *t)
> +{
> +       if (!t->percpu_user)
> +               return;
> +       preempt_notifier_unregister(&t->percpu_user_notifier);
> +       t->percpu_user = NULL;
> +}
> +
> +/*
> + * sys_percpu - setup user-space per-cpu critical section for caller thread
> + */
> +SYSCALL_DEFINE1(percpu, struct thread_percpu_user __user *, tpu)
> +{
> +       struct task_struct *t = current;
> +
> +       if (tpu == NULL) {
> +               if (t->percpu_user)
> +                       preempt_notifier_unregister(&t->percpu_user_notifier);
> +               goto set_tpu;
> +       }
> +       if (!access_ok(VERIFY_WRITE, tpu, sizeof(struct thread_percpu_user)))
> +               return -EFAULT;
> +       preempt_disable();
> +       if (__put_user(smp_processor_id(), &tpu->current_cpu)) {
> +               WARN_ON_ONCE(1);
> +               preempt_enable();
> +               return -EFAULT;
> +       }
> +       preempt_enable();
> +       if (!current->percpu_user) {
> +               preempt_notifier_init(&t->percpu_user_notifier,
> +                               &percpu_user_ops);
> +               preempt_notifier_register(&t->percpu_user_notifier);
> +       }
> +set_tpu:
> +       current->percpu_user = tpu;
> +       return 0;
> +}
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 5adcb0a..16e2bc8 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -229,3 +229,6 @@ cond_syscall(sys_bpf);
>
>  /* execveat */
>  cond_syscall(sys_execveat);
> +
> +/* percpu userspace critical sections */
> +cond_syscall(sys_percpu);
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <CAHO5Pa0Kok4_QN0v3JNWyzGT=GbZNZcRyLhu02R2npV9hSdt7g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
       [not found]     ` <CAHO5Pa0Kok4_QN0v3JNWyzGT=GbZNZcRyLhu02R2npV9hSdt7g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-22 20:53       ` Andy Lutomirski
  2015-05-22 21:34         ` Mathieu Desnoyers
       [not found]         ` <CALCETrUSBqHG3tbOq1yFz33v1_ckEgLNorgAxwLFi7MkjNcwLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 2 replies; 17+ messages in thread
From: Andy Lutomirski @ 2015-05-22 20:53 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Mathieu Desnoyers, Paul Turner, Andrew Hunter, Ben Maurer,
	Linux Kernel, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Paul E. McKenney, Josh Triplett, Lai Jiangshan, Linus Torvalds,
	Andrew Morton, Linux API

On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> [CC += linux-api@]
>
> On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>> Expose a new system call allowing userspace threads to register
>> a TLS area used as an ABI between the kernel and userspace to
>> share information required to create efficient per-cpu critical
>> sections in user-space.
>>
>> This ABI consists of a thread-local structure containing:
>>
>> - a nesting count surrounding the critical section,
>> - a signal number to be sent to the thread when preempting a thread
>>   with non-zero nesting count,
>> - a flag indicating whether the signal has been sent within the
>>   critical section,
>> - an integer where to store the current CPU number, updated whenever
>>   the thread is preempted. This CPU number cache is not strictly
>>   needed, but performs better than getcpu vdso.
>>
>> This approach is inspired by Paul Turner and Andrew Hunter's work
>> on percpu atomics, which lets the kernel handle restart of critical
>> sections, ref. http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
>>
>> What is done differently here compared to percpu atomics: we track
>> a single nesting counter per thread rather than many ranges of
>> instruction pointer values. We deliver a signal to user-space and
>> let the logic of restart be handled in user-space, thus moving
>> the complexity out of the kernel. The nesting counter approach
>> allows us to skip the complexity of interacting with signals that
>> would be otherwise needed with the percpu atomics approach, which
>> needs to know which instruction pointers are preempted, including
>> when preemption occurs on a signal handler nested over an instruction
>> pointer of interest.
>>

I talked about this kind of thing with PeterZ at LSF/MM, and I was
unable to convince myself that the kernel needs to help at all.  To do
this without kernel help, I want to relax the requirements slightly.
With true per-cpu atomic sections, you have a guarantee that you are
either really running on the same CPU for the entire duration of the
atomic section or you abort.  I propose a weaker primitive: you
acquire one of an array of locks (probably one per cpu), and you are
guaranteed that, if you don't abort, no one else acquires the same
lock while you hold it.  Here's how:

Create an array of user-managed locks, one per cpu.  Call them lock[i]
for 0 <= i < ncpus.

To acquire, look up your CPU number.  Then, atomically, check that
lock[cpu] isn't held and, if so, mark it held and record both your tid
and your lock acquisition count.  If you learn that the lock *was*
held after all, signal the holder (with kill or your favorite other
mechanism), telling it which lock acquisition count is being aborted.
Then atomically steal the lock, but only if the lock acquisition count
hasn't changed.

This has a few benefits over the in-kernel approach:

1. No kernel patch.

2. No unnecessary abort if you are preempted in favor of a thread that
doesn't content for your lock.

3. Greatly improved debuggability.

4. With long critical sections and heavy load, you can improve
performance by having several locks per cpu and choosing one at
random.

Is there a reason that a scheme like this doesn't work?

>> Benchmarking sched_getcpu() vs tls cache approach. Getting the
>> current CPU number:
>>
>> - With Linux vdso:            12.7 ns
>> - With TLS-cached cpu number:  0.3 ns

Slightly off-topic: try this again on a newer kernel.  The vdso should
have gotten a bit faster in 3.19 or 4.0 IIRC.

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
  2015-05-22 20:53       ` Andy Lutomirski
@ 2015-05-22 21:34         ` Mathieu Desnoyers
  2015-05-22 22:24           ` Andy Lutomirski
       [not found]         ` <CALCETrUSBqHG3tbOq1yFz33v1_ckEgLNorgAxwLFi7MkjNcwLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 17+ messages in thread
From: Mathieu Desnoyers @ 2015-05-22 21:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Michael Kerrisk, Paul Turner, Andrew Hunter, Ben Maurer,
	Linux Kernel, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Paul E. McKenney, Josh Triplett, Lai Jiangshan, Linus Torvalds,
	Andrew Morton, Linux API

----- Original Message -----
> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk <mtk.manpages@gmail.com>
> wrote:
> > [CC += linux-api@]
> >
> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> >> Expose a new system call allowing userspace threads to register
> >> a TLS area used as an ABI between the kernel and userspace to
> >> share information required to create efficient per-cpu critical
> >> sections in user-space.
> >>
> >> This ABI consists of a thread-local structure containing:
> >>
> >> - a nesting count surrounding the critical section,
> >> - a signal number to be sent to the thread when preempting a thread
> >>   with non-zero nesting count,
> >> - a flag indicating whether the signal has been sent within the
> >>   critical section,
> >> - an integer where to store the current CPU number, updated whenever
> >>   the thread is preempted. This CPU number cache is not strictly
> >>   needed, but performs better than getcpu vdso.
> >>
> >> This approach is inspired by Paul Turner and Andrew Hunter's work
> >> on percpu atomics, which lets the kernel handle restart of critical
> >> sections, ref.
> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> >>
> >> What is done differently here compared to percpu atomics: we track
> >> a single nesting counter per thread rather than many ranges of
> >> instruction pointer values. We deliver a signal to user-space and
> >> let the logic of restart be handled in user-space, thus moving
> >> the complexity out of the kernel. The nesting counter approach
> >> allows us to skip the complexity of interacting with signals that
> >> would be otherwise needed with the percpu atomics approach, which
> >> needs to know which instruction pointers are preempted, including
> >> when preemption occurs on a signal handler nested over an instruction
> >> pointer of interest.
> >>
> 
> I talked about this kind of thing with PeterZ at LSF/MM, and I was
> unable to convince myself that the kernel needs to help at all.  To do
> this without kernel help, I want to relax the requirements slightly.
> With true per-cpu atomic sections, you have a guarantee that you are
> either really running on the same CPU for the entire duration of the
> atomic section or you abort.  I propose a weaker primitive: you
> acquire one of an array of locks (probably one per cpu), and you are
> guaranteed that, if you don't abort, no one else acquires the same
> lock while you hold it.

In my proof of concept (https://github.com/compudj/percpu-dev) I
actually implement an array of per-cpu lock. The issue here boils
down to grabbing this per-cpu lock efficiently. Once the lock is taken,
the thread has exclusive access to that per-cpu lock, even if it
migrates.

>  Here's how:
> 
> Create an array of user-managed locks, one per cpu.  Call them lock[i]
> for 0 <= i < ncpus.
> 
> To acquire, look up your CPU number.  Then, atomically, check that
> lock[cpu] isn't held and, if so, mark it held and record both your tid
> and your lock acquisition count.  If you learn that the lock *was*
> held after all, signal the holder (with kill or your favorite other
> mechanism), telling it which lock acquisition count is being aborted.
> Then atomically steal the lock, but only if the lock acquisition count
> hasn't changed.
> 
> This has a few benefits over the in-kernel approach:
> 
> 1. No kernel patch.
> 
> 2. No unnecessary abort if you are preempted in favor of a thread that
> doesn't content for your lock.
> 
> 3. Greatly improved debuggability.
> 
> 4. With long critical sections and heavy load, you can improve
> performance by having several locks per cpu and choosing one at
> random.
> 
> Is there a reason that a scheme like this doesn't work?

What do you mean exactly by "atomically check that lock is not
held and, if so, mark it held" ? Do you imply using a lock-prefixed
atomic operation ?

The goal of this whole restart section approach is to allow grabbing
a lock (or doing other sequences of operations ending with a single
store) on per-cpu data without having to use slow lock-prefixed
atomic operations.

> 
> >> Benchmarking sched_getcpu() vs tls cache approach. Getting the
> >> current CPU number:
> >>
> >> - With Linux vdso:            12.7 ns
> >> - With TLS-cached cpu number:  0.3 ns
> 
> Slightly off-topic: try this again on a newer kernel.  The vdso should
> have gotten a bit faster in 3.19 or 4.0 IIRC.

Those benchmarks were done on Linux 4.0.

Thanks!

Mathieu

> 
> --Andy
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
  2015-05-22 21:34         ` Mathieu Desnoyers
@ 2015-05-22 22:24           ` Andy Lutomirski
       [not found]             ` <CALCETrUxp-dP-kaTy4prEdciM-=sTXjpqnMbkvk38g5BTEvX0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2015-05-22 22:24 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Michael Kerrisk, Paul Turner, Andrew Hunter, Ben Maurer,
	Linux Kernel, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Paul E. McKenney, Josh Triplett, Lai Jiangshan, Linus Torvalds,
	Andrew Morton, Linux API

On Fri, May 22, 2015 at 2:34 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> ----- Original Message -----
>> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk <mtk.manpages@gmail.com>
>> wrote:
>> > [CC += linux-api@]
>> >
>> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
>> > <mathieu.desnoyers@efficios.com> wrote:
>> >> Expose a new system call allowing userspace threads to register
>> >> a TLS area used as an ABI between the kernel and userspace to
>> >> share information required to create efficient per-cpu critical
>> >> sections in user-space.
>> >>
>> >> This ABI consists of a thread-local structure containing:
>> >>
>> >> - a nesting count surrounding the critical section,
>> >> - a signal number to be sent to the thread when preempting a thread
>> >>   with non-zero nesting count,
>> >> - a flag indicating whether the signal has been sent within the
>> >>   critical section,
>> >> - an integer where to store the current CPU number, updated whenever
>> >>   the thread is preempted. This CPU number cache is not strictly
>> >>   needed, but performs better than getcpu vdso.
>> >>
>> >> This approach is inspired by Paul Turner and Andrew Hunter's work
>> >> on percpu atomics, which lets the kernel handle restart of critical
>> >> sections, ref.
>> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
>> >>
>> >> What is done differently here compared to percpu atomics: we track
>> >> a single nesting counter per thread rather than many ranges of
>> >> instruction pointer values. We deliver a signal to user-space and
>> >> let the logic of restart be handled in user-space, thus moving
>> >> the complexity out of the kernel. The nesting counter approach
>> >> allows us to skip the complexity of interacting with signals that
>> >> would be otherwise needed with the percpu atomics approach, which
>> >> needs to know which instruction pointers are preempted, including
>> >> when preemption occurs on a signal handler nested over an instruction
>> >> pointer of interest.
>> >>
>>
>> I talked about this kind of thing with PeterZ at LSF/MM, and I was
>> unable to convince myself that the kernel needs to help at all.  To do
>> this without kernel help, I want to relax the requirements slightly.
>> With true per-cpu atomic sections, you have a guarantee that you are
>> either really running on the same CPU for the entire duration of the
>> atomic section or you abort.  I propose a weaker primitive: you
>> acquire one of an array of locks (probably one per cpu), and you are
>> guaranteed that, if you don't abort, no one else acquires the same
>> lock while you hold it.
>
> In my proof of concept (https://github.com/compudj/percpu-dev) I
> actually implement an array of per-cpu lock. The issue here boils
> down to grabbing this per-cpu lock efficiently. Once the lock is taken,
> the thread has exclusive access to that per-cpu lock, even if it
> migrates.
>
>>  Here's how:
>>
>> Create an array of user-managed locks, one per cpu.  Call them lock[i]
>> for 0 <= i < ncpus.
>>
>> To acquire, look up your CPU number.  Then, atomically, check that
>> lock[cpu] isn't held and, if so, mark it held and record both your tid
>> and your lock acquisition count.  If you learn that the lock *was*
>> held after all, signal the holder (with kill or your favorite other
>> mechanism), telling it which lock acquisition count is being aborted.
>> Then atomically steal the lock, but only if the lock acquisition count
>> hasn't changed.
>>
>> This has a few benefits over the in-kernel approach:
>>
>> 1. No kernel patch.
>>
>> 2. No unnecessary abort if you are preempted in favor of a thread that
>> doesn't content for your lock.
>>
>> 3. Greatly improved debuggability.
>>
>> 4. With long critical sections and heavy load, you can improve
>> performance by having several locks per cpu and choosing one at
>> random.
>>
>> Is there a reason that a scheme like this doesn't work?
>
> What do you mean exactly by "atomically check that lock is not
> held and, if so, mark it held" ? Do you imply using a lock-prefixed
> atomic operation ?

Yes.

>
> The goal of this whole restart section approach is to allow grabbing
> a lock (or doing other sequences of operations ending with a single
> store) on per-cpu data without having to use slow lock-prefixed
> atomic operations.

Ah, ok, I assumed it was to allow multiple threads to work in parallel.

How arch-specific are you willing to be?  On x86, it might be possible
to play some GDT games so that an unlocked xchg relative to gs would
work if you arranged for gs to refer to the GDT.  On 32-bit userspace,
you can do this with set_thread_area and on 64-bit userspace you can
do it with arch_prctl.  You'd be relying on nothing else in the
process using gs, but your proposal already relies on nothing else in
the process using your signal number.

I still dislike anything that tells programs when they're preempted,
since it prevents any kind of single-stepping.  It would be nicer if
you could somehow only get notified that another thread in your
process took your place on your cpu.

>
>>
>> >> Benchmarking sched_getcpu() vs tls cache approach. Getting the
>> >> current CPU number:
>> >>
>> >> - With Linux vdso:            12.7 ns
>> >> - With TLS-cached cpu number:  0.3 ns
>>
>> Slightly off-topic: try this again on a newer kernel.  The vdso should
>> have gotten a bit faster in 3.19 or 4.0 IIRC.
>
> Those benchmarks were done on Linux 4.0.

What cpu?  I'm surprised it's that bad.  (TLS-cached will always be
better, but still.  Although I'm curious how you're making the
TLS-cached value work reliably enough.)

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <CALCETrUxp-dP-kaTy4prEdciM-=sTXjpqnMbkvk38g5BTEvX0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
       [not found]             ` <CALCETrUxp-dP-kaTy4prEdciM-=sTXjpqnMbkvk38g5BTEvX0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-23 17:09               ` Mathieu Desnoyers
  2015-05-23 19:15                 ` Andy Lutomirski
  0 siblings, 1 reply; 17+ messages in thread
From: Mathieu Desnoyers @ 2015-05-23 17:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Michael Kerrisk, Paul Turner, Andrew Hunter, Ben Maurer,
	Linux Kernel, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Paul E. McKenney, Josh Triplett, Lai Jiangshan, Linus Torvalds,
	Andrew Morton, Linux API

----- Original Message -----
> On Fri, May 22, 2015 at 2:34 PM, Mathieu Desnoyers
> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> > ----- Original Message -----
> >> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> >> wrote:
> >> > [CC += linux-api@]
> >> >
> >> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
> >> > <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> >> >> Expose a new system call allowing userspace threads to register
> >> >> a TLS area used as an ABI between the kernel and userspace to
> >> >> share information required to create efficient per-cpu critical
> >> >> sections in user-space.
> >> >>
> >> >> This ABI consists of a thread-local structure containing:
> >> >>
> >> >> - a nesting count surrounding the critical section,
> >> >> - a signal number to be sent to the thread when preempting a thread
> >> >>   with non-zero nesting count,
> >> >> - a flag indicating whether the signal has been sent within the
> >> >>   critical section,
> >> >> - an integer where to store the current CPU number, updated whenever
> >> >>   the thread is preempted. This CPU number cache is not strictly
> >> >>   needed, but performs better than getcpu vdso.
> >> >>
> >> >> This approach is inspired by Paul Turner and Andrew Hunter's work
> >> >> on percpu atomics, which lets the kernel handle restart of critical
> >> >> sections, ref.
> >> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> >> >>
> >> >> What is done differently here compared to percpu atomics: we track
> >> >> a single nesting counter per thread rather than many ranges of
> >> >> instruction pointer values. We deliver a signal to user-space and
> >> >> let the logic of restart be handled in user-space, thus moving
> >> >> the complexity out of the kernel. The nesting counter approach
> >> >> allows us to skip the complexity of interacting with signals that
> >> >> would be otherwise needed with the percpu atomics approach, which
> >> >> needs to know which instruction pointers are preempted, including
> >> >> when preemption occurs on a signal handler nested over an instruction
> >> >> pointer of interest.
> >> >>
> >>
> >> I talked about this kind of thing with PeterZ at LSF/MM, and I was
> >> unable to convince myself that the kernel needs to help at all.  To do
> >> this without kernel help, I want to relax the requirements slightly.
> >> With true per-cpu atomic sections, you have a guarantee that you are
> >> either really running on the same CPU for the entire duration of the
> >> atomic section or you abort.  I propose a weaker primitive: you
> >> acquire one of an array of locks (probably one per cpu), and you are
> >> guaranteed that, if you don't abort, no one else acquires the same
> >> lock while you hold it.
> >
> > In my proof of concept (https://github.com/compudj/percpu-dev) I
> > actually implement an array of per-cpu lock. The issue here boils
> > down to grabbing this per-cpu lock efficiently. Once the lock is taken,
> > the thread has exclusive access to that per-cpu lock, even if it
> > migrates.
> >
> >>  Here's how:
> >>
> >> Create an array of user-managed locks, one per cpu.  Call them lock[i]
> >> for 0 <= i < ncpus.
> >>
> >> To acquire, look up your CPU number.  Then, atomically, check that
> >> lock[cpu] isn't held and, if so, mark it held and record both your tid
> >> and your lock acquisition count.  If you learn that the lock *was*
> >> held after all, signal the holder (with kill or your favorite other
> >> mechanism), telling it which lock acquisition count is being aborted.
> >> Then atomically steal the lock, but only if the lock acquisition count
> >> hasn't changed.
> >>
> >> This has a few benefits over the in-kernel approach:
> >>
> >> 1. No kernel patch.
> >>
> >> 2. No unnecessary abort if you are preempted in favor of a thread that
> >> doesn't content for your lock.
> >>
> >> 3. Greatly improved debuggability.
> >>
> >> 4. With long critical sections and heavy load, you can improve
> >> performance by having several locks per cpu and choosing one at
> >> random.
> >>
> >> Is there a reason that a scheme like this doesn't work?
> >
> > What do you mean exactly by "atomically check that lock is not
> > held and, if so, mark it held" ? Do you imply using a lock-prefixed
> > atomic operation ?
> 
> Yes.
> 
> >
> > The goal of this whole restart section approach is to allow grabbing
> > a lock (or doing other sequences of operations ending with a single
> > store) on per-cpu data without having to use slow lock-prefixed
> > atomic operations.
> 
> Ah, ok, I assumed it was to allow multiple threads to work in parallel.
> 
> How arch-specific are you willing to be?

I'd want this to be usable on every major architectures.

> On x86, it might be possible
> to play some GDT games so that an unlocked xchg relative

AFAIK, there is no such thing as an unlocked xchg. xchg always
imply the lock prefix on x86. I guess you mean cmpxchg here.

> to gs would
> work if you arranged for gs to refer to the GDT.  On 32-bit userspace,
> you can do this with set_thread_area and on 64-bit userspace you can
> do it with arch_prctl.  You'd be relying on nothing else in the
> process using gs, but your proposal already relies on nothing else in
> the process using your signal number.

Ideally, and this is indeed a limitation of my own approach, I would
like to be able to use this scheme from a library injected into
processes for tracing purposes, which means that I would be tempted
to stay away from solutions that affect the application too much.
This includes sending a signal unfortunately.

In addition, the gs approach you propose here would work as long
as we use non-lock-prefixed atomic operations (e.g. cmpxchg), but
it would not work for sequences of instructions that need to be
performed over the same data (e.g. load, test, conditional branch,
store), which performs slightly faster than non-lock-prefixed atomic
ops.

> 
> I still dislike anything that tells programs when they're preempted,
> since it prevents any kind of single-stepping.  It would be nicer if
> you could somehow only get notified that another thread in your
> process took your place on your cpu.

Good point about single-stepping. It would indeed not play nicely with
the restartable critical section. Single-stepping a restartable c.s.
that points its restart block to the beginning of the restartable
c.s. would prevent progress, and make the c.s. restart forever. One
way to work around this would be to point the restart block to a
a slow path, which is not restartable, and single-stepping friendly.
We'd have to consider the interaction of the fast and slow paths very
carefully though.

> 
> >
> >>
> >> >> Benchmarking sched_getcpu() vs tls cache approach. Getting the
> >> >> current CPU number:
> >> >>
> >> >> - With Linux vdso:            12.7 ns
> >> >> - With TLS-cached cpu number:  0.3 ns
> >>
> >> Slightly off-topic: try this again on a newer kernel.  The vdso should
> >> have gotten a bit faster in 3.19 or 4.0 IIRC.
> >
> > Those benchmarks were done on Linux 4.0.
> 
> What cpu?  I'm surprised it's that bad.

Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
(Haswell)
Please note that I run this benchmark under a kvm guest VM, but
I doubt it would impact these numbers.

> (TLS-cached will always be
> better, but still.  Although I'm curious how you're making the
> TLS-cached value work reliably enough.)

What do you have in mind as possibility of having an unreliable TLS-cached
value ? In my approach, the tls-cached value is updated in the preempt in
notifier, whenever the CPU number has changed compared to the value in
user-space. Accessing the TLS value from user-space is performed with the
help of the compiler, including all the complexity involved in using a
TLS from a library if need be (lazy allocation, internal glibc mutex, etc).
However, since I control very precisely where in my critical section the
execution flow can be restarted (it's only on the store operation that
touch the write-protected memory range), there is no need to worry about
restarting in the middle of a held lock. It also works if a system call
is issued within the critical section (e.g. sys_futex due to lock), and
works with function calls.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
  2015-05-23 17:09               ` Mathieu Desnoyers
@ 2015-05-23 19:15                 ` Andy Lutomirski
       [not found]                   ` <CALCETrWzoFX7hXqvQqDEq=r=7PNaGKVjZeHEBWxPvC28Zi1AKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2015-05-23 19:15 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Lai Jiangshan, Paul E. McKenney, Ben Maurer, Josh Triplett,
	Ingo Molnar, Andrew Morton, Linux API, Michael Kerrisk,
	Linux Kernel, Linus Torvalds, Peter Zijlstra, Paul Turner,
	Steven Rostedt, Andrew Hunter

On May 23, 2015 10:09 AM, "Mathieu Desnoyers"
<mathieu.desnoyers@efficios.com> wrote:
>
> ----- Original Message -----
> > On Fri, May 22, 2015 at 2:34 PM, Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> > > ----- Original Message -----
> > >> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk <mtk.manpages@gmail.com>
> > >> wrote:
> > >> > [CC += linux-api@]
> > >> >
> > >> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
> > >> > <mathieu.desnoyers@efficios.com> wrote:
> > >> >> Expose a new system call allowing userspace threads to register
> > >> >> a TLS area used as an ABI between the kernel and userspace to
> > >> >> share information required to create efficient per-cpu critical
> > >> >> sections in user-space.
> > >> >>
> > >> >> This ABI consists of a thread-local structure containing:
> > >> >>
> > >> >> - a nesting count surrounding the critical section,
> > >> >> - a signal number to be sent to the thread when preempting a thread
> > >> >>   with non-zero nesting count,
> > >> >> - a flag indicating whether the signal has been sent within the
> > >> >>   critical section,
> > >> >> - an integer where to store the current CPU number, updated whenever
> > >> >>   the thread is preempted. This CPU number cache is not strictly
> > >> >>   needed, but performs better than getcpu vdso.
> > >> >>
> > >> >> This approach is inspired by Paul Turner and Andrew Hunter's work
> > >> >> on percpu atomics, which lets the kernel handle restart of critical
> > >> >> sections, ref.
> > >> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> > >> >>
> > >> >> What is done differently here compared to percpu atomics: we track
> > >> >> a single nesting counter per thread rather than many ranges of
> > >> >> instruction pointer values. We deliver a signal to user-space and
> > >> >> let the logic of restart be handled in user-space, thus moving
> > >> >> the complexity out of the kernel. The nesting counter approach
> > >> >> allows us to skip the complexity of interacting with signals that
> > >> >> would be otherwise needed with the percpu atomics approach, which
> > >> >> needs to know which instruction pointers are preempted, including
> > >> >> when preemption occurs on a signal handler nested over an instruction
> > >> >> pointer of interest.
> > >> >>
> > >>
> > >> I talked about this kind of thing with PeterZ at LSF/MM, and I was
> > >> unable to convince myself that the kernel needs to help at all.  To do
> > >> this without kernel help, I want to relax the requirements slightly.
> > >> With true per-cpu atomic sections, you have a guarantee that you are
> > >> either really running on the same CPU for the entire duration of the
> > >> atomic section or you abort.  I propose a weaker primitive: you
> > >> acquire one of an array of locks (probably one per cpu), and you are
> > >> guaranteed that, if you don't abort, no one else acquires the same
> > >> lock while you hold it.
> > >
> > > In my proof of concept (https://github.com/compudj/percpu-dev) I
> > > actually implement an array of per-cpu lock. The issue here boils
> > > down to grabbing this per-cpu lock efficiently. Once the lock is taken,
> > > the thread has exclusive access to that per-cpu lock, even if it
> > > migrates.
> > >
> > >>  Here's how:
> > >>
> > >> Create an array of user-managed locks, one per cpu.  Call them lock[i]
> > >> for 0 <= i < ncpus.
> > >>
> > >> To acquire, look up your CPU number.  Then, atomically, check that
> > >> lock[cpu] isn't held and, if so, mark it held and record both your tid
> > >> and your lock acquisition count.  If you learn that the lock *was*
> > >> held after all, signal the holder (with kill or your favorite other
> > >> mechanism), telling it which lock acquisition count is being aborted.
> > >> Then atomically steal the lock, but only if the lock acquisition count
> > >> hasn't changed.
> > >>
> > >> This has a few benefits over the in-kernel approach:
> > >>
> > >> 1. No kernel patch.
> > >>
> > >> 2. No unnecessary abort if you are preempted in favor of a thread that
> > >> doesn't content for your lock.
> > >>
> > >> 3. Greatly improved debuggability.
> > >>
> > >> 4. With long critical sections and heavy load, you can improve
> > >> performance by having several locks per cpu and choosing one at
> > >> random.
> > >>
> > >> Is there a reason that a scheme like this doesn't work?
> > >
> > > What do you mean exactly by "atomically check that lock is not
> > > held and, if so, mark it held" ? Do you imply using a lock-prefixed
> > > atomic operation ?
> >
> > Yes.
> >
> > >
> > > The goal of this whole restart section approach is to allow grabbing
> > > a lock (or doing other sequences of operations ending with a single
> > > store) on per-cpu data without having to use slow lock-prefixed
> > > atomic operations.
> >
> > Ah, ok, I assumed it was to allow multiple threads to work in parallel.
> >
> > How arch-specific are you willing to be?
>
> I'd want this to be usable on every major architectures.
>
> > On x86, it might be possible
> > to play some GDT games so that an unlocked xchg relative
>
> AFAIK, there is no such thing as an unlocked xchg. xchg always
> imply the lock prefix on x86. I guess you mean cmpxchg here.
>

Right, got my special cases mixed up.

I wonder if we could instead have a vdso function that did something like:

unsigned long __vdso_cpu_local_exchange(unsigned long *base, int
shift, unsigned long newval)
{
  unsigned long *ptr = base + (cpu << shift);
  unsigned long old = *ptr;
  *ptr = new;
  return *ptr;
}

I think this primitive would be sufficient to let user code do the
rest.  There might be other more simple primitives that would work.
It could be implemented by fiddling with IP ranges, but we could
change the implementation later without breaking anything.  The only
really hard part would be efficiently figuring out what CPU we're on.

FWIW, 'xchg' to cache-hot memory is only about 20 cycles on my laptop,
and cmpxchg seems to be about 6 cycles.  Both are faster than getcpu.
How much are we really saving with any of this over the pure userspace
approach?  I think that the most that the kernel can possibly help us
is to give us a faster getcpu and to help us deal with being migrated
to a different cpu if we want to avoid expensive operations and don't
want to play unlocked cmpxchg tricks.

I was wrong about set_thread_area giving us efficient per-cpu data,
BTW, although it would be easy to add a similar feature that would
give us a per-cpu segment base on x86.

> > to gs would
> > work if you arranged for gs to refer to the GDT.  On 32-bit userspace,
> > you can do this with set_thread_area and on 64-bit userspace you can
> > do it with arch_prctl.  You'd be relying on nothing else in the
> > process using gs, but your proposal already relies on nothing else in
> > the process using your signal number.
>
> Ideally, and this is indeed a limitation of my own approach, I would
> like to be able to use this scheme from a library injected into
> processes for tracing purposes, which means that I would be tempted
> to stay away from solutions that affect the application too much.
> This includes sending a signal unfortunately.
>
> In addition, the gs approach you propose here would work as long
> as we use non-lock-prefixed atomic operations (e.g. cmpxchg), but
> it would not work for sequences of instructions that need to be
> performed over the same data (e.g. load, test, conditional branch,
> store), which performs slightly faster than non-lock-prefixed atomic
> ops.
>

> > >> >> - With Linux vdso:            12.7 ns
> > >> >> - With TLS-cached cpu number:  0.3 ns
> > >>
> > >> Slightly off-topic: try this again on a newer kernel.  The vdso should
> > >> have gotten a bit faster in 3.19 or 4.0 IIRC.
> > >
> > > Those benchmarks were done on Linux 4.0.
> >
> > What cpu?  I'm surprised it's that bad.
>
> Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
> (Haswell)
> Please note that I run this benchmark under a kvm guest VM, but
> I doubt it would impact these numbers.

Go figure.  My laptop (SNB, 2.7GHz) can do it in 8ns or so.

>
> > (TLS-cached will always be
> > better, but still.  Although I'm curious how you're making the
> > TLS-cached value work reliably enough.)
>
> What do you have in mind as possibility of having an unreliable TLS-cached
> value ?

Something like the old vdso getcpu cache -- re-read the cpu count if
it has expired since last time.

> In my approach, the tls-cached value is updated in the preempt in
> notifier, whenever the CPU number has changed compared to the value in
> user-space.

Ah, I missed that part.

This ABI is kind of unfortunate in that it can't be shared between
multiple libraries.

Accessing the TLS value from user-space is performed with the
> help of the compiler, including all the complexity involved in using a
> TLS from a library if need be (lazy allocation, internal glibc mutex, etc).
> However, since I control very precisely where in my critical section the
> execution flow can be restarted (it's only on the store operation that
> touch the write-protected memory range), there is no need to worry about
> restarting in the middle of a held lock. It also works if a system call
> is issued within the critical section (e.g. sys_futex due to lock), and
> works with function calls.

Oh, you're delivering a signal every time you're preempted, even
outside the critical section?  That sounds expensive.

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <CALCETrWzoFX7hXqvQqDEq=r=7PNaGKVjZeHEBWxPvC28Zi1AKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
       [not found]                   ` <CALCETrWzoFX7hXqvQqDEq=r=7PNaGKVjZeHEBWxPvC28Zi1AKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-25 18:30                     ` Mathieu Desnoyers
       [not found]                       ` <1184354091.7499.1432578613872.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Mathieu Desnoyers @ 2015-05-25 18:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Lai Jiangshan, Paul E. McKenney, Ben Maurer, Josh Triplett,
	Ingo Molnar, Andrew Morton, Linux API, Michael Kerrisk,
	Linux Kernel, Linus Torvalds, Peter Zijlstra, Paul Turner,
	Steven Rostedt, Andrew Hunter

----- Original Message -----
> On May 23, 2015 10:09 AM, "Mathieu Desnoyers"
> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> >
> > ----- Original Message -----
> > > On Fri, May 22, 2015 at 2:34 PM, Mathieu Desnoyers
> > > <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> > > > ----- Original Message -----
> > > >> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk
> > > >> <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > > >> wrote:
> > > >> > [CC += linux-api@]
> > > >> >
> > > >> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
> > > >> > <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> > > >> >> Expose a new system call allowing userspace threads to register
> > > >> >> a TLS area used as an ABI between the kernel and userspace to
> > > >> >> share information required to create efficient per-cpu critical
> > > >> >> sections in user-space.
> > > >> >>
> > > >> >> This ABI consists of a thread-local structure containing:
> > > >> >>
> > > >> >> - a nesting count surrounding the critical section,
> > > >> >> - a signal number to be sent to the thread when preempting a thread
> > > >> >>   with non-zero nesting count,
> > > >> >> - a flag indicating whether the signal has been sent within the
> > > >> >>   critical section,
> > > >> >> - an integer where to store the current CPU number, updated
> > > >> >> whenever
> > > >> >>   the thread is preempted. This CPU number cache is not strictly
> > > >> >>   needed, but performs better than getcpu vdso.
> > > >> >>
> > > >> >> This approach is inspired by Paul Turner and Andrew Hunter's work
> > > >> >> on percpu atomics, which lets the kernel handle restart of critical
> > > >> >> sections, ref.
> > > >> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> > > >> >>
> > > >> >> What is done differently here compared to percpu atomics: we track
> > > >> >> a single nesting counter per thread rather than many ranges of
> > > >> >> instruction pointer values. We deliver a signal to user-space and
> > > >> >> let the logic of restart be handled in user-space, thus moving
> > > >> >> the complexity out of the kernel. The nesting counter approach
> > > >> >> allows us to skip the complexity of interacting with signals that
> > > >> >> would be otherwise needed with the percpu atomics approach, which
> > > >> >> needs to know which instruction pointers are preempted, including
> > > >> >> when preemption occurs on a signal handler nested over an
> > > >> >> instruction
> > > >> >> pointer of interest.
> > > >> >>
> > > >>
> > > >> I talked about this kind of thing with PeterZ at LSF/MM, and I was
> > > >> unable to convince myself that the kernel needs to help at all.  To do
> > > >> this without kernel help, I want to relax the requirements slightly.
> > > >> With true per-cpu atomic sections, you have a guarantee that you are
> > > >> either really running on the same CPU for the entire duration of the
> > > >> atomic section or you abort.  I propose a weaker primitive: you
> > > >> acquire one of an array of locks (probably one per cpu), and you are
> > > >> guaranteed that, if you don't abort, no one else acquires the same
> > > >> lock while you hold it.
> > > >
> > > > In my proof of concept (https://github.com/compudj/percpu-dev) I
> > > > actually implement an array of per-cpu lock. The issue here boils
> > > > down to grabbing this per-cpu lock efficiently. Once the lock is taken,
> > > > the thread has exclusive access to that per-cpu lock, even if it
> > > > migrates.
> > > >
> > > >>  Here's how:
> > > >>
> > > >> Create an array of user-managed locks, one per cpu.  Call them lock[i]
> > > >> for 0 <= i < ncpus.
> > > >>
> > > >> To acquire, look up your CPU number.  Then, atomically, check that
> > > >> lock[cpu] isn't held and, if so, mark it held and record both your tid
> > > >> and your lock acquisition count.  If you learn that the lock *was*
> > > >> held after all, signal the holder (with kill or your favorite other
> > > >> mechanism), telling it which lock acquisition count is being aborted.
> > > >> Then atomically steal the lock, but only if the lock acquisition count
> > > >> hasn't changed.
> > > >>
> > > >> This has a few benefits over the in-kernel approach:
> > > >>
> > > >> 1. No kernel patch.
> > > >>
> > > >> 2. No unnecessary abort if you are preempted in favor of a thread that
> > > >> doesn't content for your lock.
> > > >>
> > > >> 3. Greatly improved debuggability.
> > > >>
> > > >> 4. With long critical sections and heavy load, you can improve
> > > >> performance by having several locks per cpu and choosing one at
> > > >> random.
> > > >>
> > > >> Is there a reason that a scheme like this doesn't work?
> > > >
> > > > What do you mean exactly by "atomically check that lock is not
> > > > held and, if so, mark it held" ? Do you imply using a lock-prefixed
> > > > atomic operation ?
> > >
> > > Yes.
> > >
> > > >
> > > > The goal of this whole restart section approach is to allow grabbing
> > > > a lock (or doing other sequences of operations ending with a single
> > > > store) on per-cpu data without having to use slow lock-prefixed
> > > > atomic operations.
> > >
> > > Ah, ok, I assumed it was to allow multiple threads to work in parallel.
> > >
> > > How arch-specific are you willing to be?
> >
> > I'd want this to be usable on every major architectures.
> >
> > > On x86, it might be possible
> > > to play some GDT games so that an unlocked xchg relative
> >
> > AFAIK, there is no such thing as an unlocked xchg. xchg always
> > imply the lock prefix on x86. I guess you mean cmpxchg here.
> >
> 
> Right, got my special cases mixed up.
> 
> I wonder if we could instead have a vdso function that did something like:
> 
> unsigned long __vdso_cpu_local_exchange(unsigned long *base, int
> shift, unsigned long newval)
> {
>   unsigned long *ptr = base + (cpu << shift);
>   unsigned long old = *ptr;
>   *ptr = new;
>   return *ptr;
> }
> 
> I think this primitive would be sufficient to let user code do the
> rest.  There might be other more simple primitives that would work.
> It could be implemented by fiddling with IP ranges, but we could
> change the implementation later without breaking anything.  The only
> really hard part would be efficiently figuring out what CPU we're on.

The "fiddling with IP ranges" is where the restart sections come into
play. Paul Turner's approach indeed knows about IP ranges, and performs
the restart from the kernel. My alternative approach uses a signal and
page protection in user-space to reach the same result. It appears that
CONFIG_PREEMPT kernels are difficult to handle with Paul's approach, so
perhaps we could combine our approaches to get the best of both.

> 
> FWIW, 'xchg' to cache-hot memory is only about 20 cycles on my laptop,
> and cmpxchg seems to be about 6 cycles.  Both are faster than getcpu.
> How much are we really saving with any of this over the pure userspace
> approach?  I think that the most that the kernel can possibly help us
> is to give us a faster getcpu and to help us deal with being migrated
> to a different cpu if we want to avoid expensive operations and don't
> want to play unlocked cmpxchg tricks.

You appear to summarize the two things that are added to the kernel with
this RFC patch:
- faster getcpu(): 12.7 ns -> 0.3 ns
- allow using less expensive operations to perform per-cpu-atomic ops:
  (on my system):
  - lock; cmpxchg (18 ns) -> load-test-branch-store (5.4 ns)

> 
> I was wrong about set_thread_area giving us efficient per-cpu data,
> BTW, although it would be easy to add a similar feature that would
> give us a per-cpu segment base on x86.

ok

> 
> > > to gs would
> > > work if you arranged for gs to refer to the GDT.  On 32-bit userspace,
> > > you can do this with set_thread_area and on 64-bit userspace you can
> > > do it with arch_prctl.  You'd be relying on nothing else in the
> > > process using gs, but your proposal already relies on nothing else in
> > > the process using your signal number.
> >
> > Ideally, and this is indeed a limitation of my own approach, I would
> > like to be able to use this scheme from a library injected into
> > processes for tracing purposes, which means that I would be tempted
> > to stay away from solutions that affect the application too much.
> > This includes sending a signal unfortunately.
> >
> > In addition, the gs approach you propose here would work as long
> > as we use non-lock-prefixed atomic operations (e.g. cmpxchg), but
> > it would not work for sequences of instructions that need to be
> > performed over the same data (e.g. load, test, conditional branch,
> > store), which performs slightly faster than non-lock-prefixed atomic
> > ops.
> >
> 
> > > >> >> - With Linux vdso:            12.7 ns
> > > >> >> - With TLS-cached cpu number:  0.3 ns
> > > >>
> > > >> Slightly off-topic: try this again on a newer kernel.  The vdso should
> > > >> have gotten a bit faster in 3.19 or 4.0 IIRC.
> > > >
> > > > Those benchmarks were done on Linux 4.0.
> > >
> > > What cpu?  I'm surprised it's that bad.
> >
> > Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
> > (Haswell)
> > Please note that I run this benchmark under a kvm guest VM, but
> > I doubt it would impact these numbers.
> 
> Go figure.  My laptop (SNB, 2.7GHz) can do it in 8ns or so.
> 
> >
> > > (TLS-cached will always be
> > > better, but still.  Although I'm curious how you're making the
> > > TLS-cached value work reliably enough.)
> >
> > What do you have in mind as possibility of having an unreliable TLS-cached
> > value ?
> 
> Something like the old vdso getcpu cache -- re-read the cpu count if
> it has expired since last time.
> 
> > In my approach, the tls-cached value is updated in the preempt in
> > notifier, whenever the CPU number has changed compared to the value in
> > user-space.
> 
> Ah, I missed that part.
> 
> This ABI is kind of unfortunate in that it can't be shared between
> multiple libraries.

Well ideally you'd want one library to implement this (e.g. glibc), and
other libraries to use it. I'm not sure doing a linked list of TLS entries
per thread would really be useful here.

> 
> Accessing the TLS value from user-space is performed with the
> > help of the compiler, including all the complexity involved in using a
> > TLS from a library if need be (lazy allocation, internal glibc mutex, etc).
> > However, since I control very precisely where in my critical section the
> > execution flow can be restarted (it's only on the store operation that
> > touch the write-protected memory range), there is no need to worry about
> > restarting in the middle of a held lock. It also works if a system call
> > is issued within the critical section (e.g. sys_futex due to lock), and
> > works with function calls.
> 
> Oh, you're delivering a signal every time you're preempted, even
> outside the critical section?  That sounds expensive.

No, the signal is only delivered if the kernel preempts the critical
section.

Thanks for the feedback!

Mathieu

> 
> --Andy
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <1184354091.7499.1432578613872.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>]

* Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
       [not found]                       ` <1184354091.7499.1432578613872.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2015-05-25 18:54                         ` Andy Lutomirski
       [not found]                           ` <CALCETrW3_Hv0jc3cpiwsHTinBqJzvab_EiPS8BVJhX-xe5D8qw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-05-26 21:20                           ` Andi Kleen
  0 siblings, 2 replies; 17+ messages in thread
From: Andy Lutomirski @ 2015-05-25 18:54 UTC (permalink / raw)
  To: Mathieu Desnoyers, H. Peter Anvin, Andi Kleen, Borislav Petkov
  Cc: Lai Jiangshan, Paul E. McKenney, Ben Maurer, Josh Triplett,
	Ingo Molnar, Andrew Morton, Linux API, Michael Kerrisk,
	Linux Kernel, Linus Torvalds, Peter Zijlstra, Paul Turner,
	Steven Rostedt, Andrew Hunter

[cc: hpa, Borislav and Andi]

On Mon, May 25, 2015 at 11:30 AM, Mathieu Desnoyers
<mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> ----- Original Message -----
>> On May 23, 2015 10:09 AM, "Mathieu Desnoyers"
>> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>> >
>> > ----- Original Message -----
>> > > On Fri, May 22, 2015 at 2:34 PM, Mathieu Desnoyers
>> > > <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>> > > > ----- Original Message -----
>> > > >> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk
>> > > >> <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> > > >> wrote:
>> > > >> > [CC += linux-api@]
>> > > >> >
>> > > >> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
>> > > >> > <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>> > > >> >> Expose a new system call allowing userspace threads to register
>> > > >> >> a TLS area used as an ABI between the kernel and userspace to
>> > > >> >> share information required to create efficient per-cpu critical
>> > > >> >> sections in user-space.
>> > > >> >>
>> > > >> >> This ABI consists of a thread-local structure containing:
>> > > >> >>
>> > > >> >> - a nesting count surrounding the critical section,
>> > > >> >> - a signal number to be sent to the thread when preempting a thread
>> > > >> >>   with non-zero nesting count,
>> > > >> >> - a flag indicating whether the signal has been sent within the
>> > > >> >>   critical section,
>> > > >> >> - an integer where to store the current CPU number, updated
>> > > >> >> whenever
>> > > >> >>   the thread is preempted. This CPU number cache is not strictly
>> > > >> >>   needed, but performs better than getcpu vdso.
>> > > >> >>
>> > > >> >> This approach is inspired by Paul Turner and Andrew Hunter's work
>> > > >> >> on percpu atomics, which lets the kernel handle restart of critical
>> > > >> >> sections, ref.
>> > > >> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
>> > > >> >>
>> > > >> >> What is done differently here compared to percpu atomics: we track
>> > > >> >> a single nesting counter per thread rather than many ranges of
>> > > >> >> instruction pointer values. We deliver a signal to user-space and
>> > > >> >> let the logic of restart be handled in user-space, thus moving
>> > > >> >> the complexity out of the kernel. The nesting counter approach
>> > > >> >> allows us to skip the complexity of interacting with signals that
>> > > >> >> would be otherwise needed with the percpu atomics approach, which
>> > > >> >> needs to know which instruction pointers are preempted, including
>> > > >> >> when preemption occurs on a signal handler nested over an
>> > > >> >> instruction
>> > > >> >> pointer of interest.
>> > > >> >>
>> > > >>
>> > > >> I talked about this kind of thing with PeterZ at LSF/MM, and I was
>> > > >> unable to convince myself that the kernel needs to help at all.  To do
>> > > >> this without kernel help, I want to relax the requirements slightly.
>> > > >> With true per-cpu atomic sections, you have a guarantee that you are
>> > > >> either really running on the same CPU for the entire duration of the
>> > > >> atomic section or you abort.  I propose a weaker primitive: you
>> > > >> acquire one of an array of locks (probably one per cpu), and you are
>> > > >> guaranteed that, if you don't abort, no one else acquires the same
>> > > >> lock while you hold it.
>> > > >
>> > > > In my proof of concept (https://github.com/compudj/percpu-dev) I
>> > > > actually implement an array of per-cpu lock. The issue here boils
>> > > > down to grabbing this per-cpu lock efficiently. Once the lock is taken,
>> > > > the thread has exclusive access to that per-cpu lock, even if it
>> > > > migrates.
>> > > >
>> > > >>  Here's how:
>> > > >>
>> > > >> Create an array of user-managed locks, one per cpu.  Call them lock[i]
>> > > >> for 0 <= i < ncpus.
>> > > >>
>> > > >> To acquire, look up your CPU number.  Then, atomically, check that
>> > > >> lock[cpu] isn't held and, if so, mark it held and record both your tid
>> > > >> and your lock acquisition count.  If you learn that the lock *was*
>> > > >> held after all, signal the holder (with kill or your favorite other
>> > > >> mechanism), telling it which lock acquisition count is being aborted.
>> > > >> Then atomically steal the lock, but only if the lock acquisition count
>> > > >> hasn't changed.
>> > > >>
>> > > >> This has a few benefits over the in-kernel approach:
>> > > >>
>> > > >> 1. No kernel patch.
>> > > >>
>> > > >> 2. No unnecessary abort if you are preempted in favor of a thread that
>> > > >> doesn't content for your lock.
>> > > >>
>> > > >> 3. Greatly improved debuggability.
>> > > >>
>> > > >> 4. With long critical sections and heavy load, you can improve
>> > > >> performance by having several locks per cpu and choosing one at
>> > > >> random.
>> > > >>
>> > > >> Is there a reason that a scheme like this doesn't work?
>> > > >
>> > > > What do you mean exactly by "atomically check that lock is not
>> > > > held and, if so, mark it held" ? Do you imply using a lock-prefixed
>> > > > atomic operation ?
>> > >
>> > > Yes.
>> > >
>> > > >
>> > > > The goal of this whole restart section approach is to allow grabbing
>> > > > a lock (or doing other sequences of operations ending with a single
>> > > > store) on per-cpu data without having to use slow lock-prefixed
>> > > > atomic operations.
>> > >
>> > > Ah, ok, I assumed it was to allow multiple threads to work in parallel.
>> > >
>> > > How arch-specific are you willing to be?
>> >
>> > I'd want this to be usable on every major architectures.
>> >
>> > > On x86, it might be possible
>> > > to play some GDT games so that an unlocked xchg relative
>> >
>> > AFAIK, there is no such thing as an unlocked xchg. xchg always
>> > imply the lock prefix on x86. I guess you mean cmpxchg here.
>> >
>>
>> Right, got my special cases mixed up.
>>
>> I wonder if we could instead have a vdso function that did something like:
>>
>> unsigned long __vdso_cpu_local_exchange(unsigned long *base, int
>> shift, unsigned long newval)
>> {
>>   unsigned long *ptr = base + (cpu << shift);
>>   unsigned long old = *ptr;
>>   *ptr = new;
>>   return *ptr;
>> }
>>
>> I think this primitive would be sufficient to let user code do the
>> rest.  There might be other more simple primitives that would work.
>> It could be implemented by fiddling with IP ranges, but we could
>> change the implementation later without breaking anything.  The only
>> really hard part would be efficiently figuring out what CPU we're on.
>
> The "fiddling with IP ranges" is where the restart sections come into
> play. Paul Turner's approach indeed knows about IP ranges, and performs
> the restart from the kernel. My alternative approach uses a signal and
> page protection in user-space to reach the same result. It appears that
> CONFIG_PREEMPT kernels are difficult to handle with Paul's approach, so
> perhaps we could combine our approaches to get the best of both.

I'm not sure why CONFIG_PREEMPT would matter.  What am I missing?

Doing this in the vdso has some sneaky benefits: rather than aborting
a very short vdso-based primitive on context switch, we could just fix
it up in the kernel and skip ahead to the end.

>
>>
>> FWIW, 'xchg' to cache-hot memory is only about 20 cycles on my laptop,
>> and cmpxchg seems to be about 6 cycles.  Both are faster than getcpu.
>> How much are we really saving with any of this over the pure userspace
>> approach?  I think that the most that the kernel can possibly help us
>> is to give us a faster getcpu and to help us deal with being migrated
>> to a different cpu if we want to avoid expensive operations and don't
>> want to play unlocked cmpxchg tricks.
>
> You appear to summarize the two things that are added to the kernel with
> this RFC patch:
> - faster getcpu(): 12.7 ns -> 0.3 ns

Yeah, I figured that out the second time I read your email and re-read
the original message, but I apparently forgot to fix up the email I
was typing.

> - allow using less expensive operations to perform per-cpu-atomic ops:
>   (on my system):
>   - lock; cmpxchg (18 ns) -> load-test-branch-store (5.4 ns)
>

When you say "load-test-branch-store", do you mean that sequence of
normal code or a literal cmpxchg?  On x86, we really can pull off the
sneaky trick of gs-relative cmpxchg in a single instruction, which
gives us per-cpu atomic locking without any special kernel fixups,
although that locks down gs and isn't currently supported.

IIRC some other OS's use gs as a userspace per-cpu pointer instead of
a per-thread pointer.  Would we want to consider enabling that?
This may interact oddly with the WIP wrgsbase stuff.  (Grr, Intel, why
couldn't you have made wrgsbase reset the gs selector to zero?  That
would have been so much nicer.)

>>
>> I was wrong about set_thread_area giving us efficient per-cpu data,
>> BTW, although it would be easy to add a similar feature that would
>> give us a per-cpu segment base on x86.
>
> ok
>

>>
>> This ABI is kind of unfortunate in that it can't be shared between
>> multiple libraries.
>
> Well ideally you'd want one library to implement this (e.g. glibc), and
> other libraries to use it. I'm not sure doing a linked list of TLS entries
> per thread would really be useful here.

True, but that may not play so well with tracing injection.

At least a pure vdso approach is guaranteed not to have this problem.


>> Oh, you're delivering a signal every time you're preempted, even
>> outside the critical section?  That sounds expensive.
>
> No, the signal is only delivered if the kernel preempts the critical
> section.

As noted above, I meant to delete this before sending :)

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <CALCETrW3_Hv0jc3cpiwsHTinBqJzvab_EiPS8BVJhX-xe5D8qw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
       [not found]                           ` <CALCETrW3_Hv0jc3cpiwsHTinBqJzvab_EiPS8BVJhX-xe5D8qw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-26 19:57                             ` Andy Lutomirski
       [not found]                               ` <CALCETrXzmO=fQC=UdCh5b0zWiGWAJScEtdT4QDJkoqLgtgEVig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-05-26 20:38                             ` Mathieu Desnoyers
  1 sibling, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2015-05-26 19:57 UTC (permalink / raw)
  To: Andi Kleen, Mathieu Desnoyers, Borislav Petkov, H. Peter Anvin
  Cc: Lai Jiangshan, Ben Maurer, Paul E. McKenney, Ingo Molnar,
	Josh Triplett, Andrew Morton, Michael Kerrisk, Linux API,
	Linux Kernel, Paul Turner, Peter Zijlstra, Linus Torvalds,
	Steven Rostedt, Andrew Hunter

On May 25, 2015 11:54 AM, "Andy Lutomirski" <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>
> [cc: hpa, Borislav and Andi]
>
> On Mon, May 25, 2015 at 11:30 AM, Mathieu Desnoyers
> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> > ----- Original Message -----
> >> On May 23, 2015 10:09 AM, "Mathieu Desnoyers"
> >> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> >> >
> >> > ----- Original Message -----
> >> > > On Fri, May 22, 2015 at 2:34 PM, Mathieu Desnoyers
> >> > > <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> >> > > > ----- Original Message -----
> >> > > >> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk
> >> > > >> <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> >> > > >> wrote:
> >> > > >> > [CC += linux-api@]
> >> > > >> >
> >> > > >> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
> >> > > >> > <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> >> > > >> >> Expose a new system call allowing userspace threads to register
> >> > > >> >> a TLS area used as an ABI between the kernel and userspace to
> >> > > >> >> share information required to create efficient per-cpu critical
> >> > > >> >> sections in user-space.
> >> > > >> >>
> >> > > >> >> This ABI consists of a thread-local structure containing:
> >> > > >> >>
> >> > > >> >> - a nesting count surrounding the critical section,
> >> > > >> >> - a signal number to be sent to the thread when preempting a thread
> >> > > >> >>   with non-zero nesting count,
> >> > > >> >> - a flag indicating whether the signal has been sent within the
> >> > > >> >>   critical section,
> >> > > >> >> - an integer where to store the current CPU number, updated
> >> > > >> >> whenever
> >> > > >> >>   the thread is preempted. This CPU number cache is not strictly
> >> > > >> >>   needed, but performs better than getcpu vdso.
> >> > > >> >>
> >> > > >> >> This approach is inspired by Paul Turner and Andrew Hunter's work
> >> > > >> >> on percpu atomics, which lets the kernel handle restart of critical
> >> > > >> >> sections, ref.
> >> > > >> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> >> > > >> >>
> >> > > >> >> What is done differently here compared to percpu atomics: we track
> >> > > >> >> a single nesting counter per thread rather than many ranges of
> >> > > >> >> instruction pointer values. We deliver a signal to user-space and
> >> > > >> >> let the logic of restart be handled in user-space, thus moving
> >> > > >> >> the complexity out of the kernel. The nesting counter approach
> >> > > >> >> allows us to skip the complexity of interacting with signals that
> >> > > >> >> would be otherwise needed with the percpu atomics approach, which
> >> > > >> >> needs to know which instruction pointers are preempted, including
> >> > > >> >> when preemption occurs on a signal handler nested over an
> >> > > >> >> instruction
> >> > > >> >> pointer of interest.
> >> > > >> >>
> >> > > >>
> >> > > >> I talked about this kind of thing with PeterZ at LSF/MM, and I was
> >> > > >> unable to convince myself that the kernel needs to help at all.  To do
> >> > > >> this without kernel help, I want to relax the requirements slightly.
> >> > > >> With true per-cpu atomic sections, you have a guarantee that you are
> >> > > >> either really running on the same CPU for the entire duration of the
> >> > > >> atomic section or you abort.  I propose a weaker primitive: you
> >> > > >> acquire one of an array of locks (probably one per cpu), and you are
> >> > > >> guaranteed that, if you don't abort, no one else acquires the same
> >> > > >> lock while you hold it.
> >> > > >
> >> > > > In my proof of concept (https://github.com/compudj/percpu-dev) I
> >> > > > actually implement an array of per-cpu lock. The issue here boils
> >> > > > down to grabbing this per-cpu lock efficiently. Once the lock is taken,
> >> > > > the thread has exclusive access to that per-cpu lock, even if it
> >> > > > migrates.
> >> > > >
> >> > > >>  Here's how:
> >> > > >>
> >> > > >> Create an array of user-managed locks, one per cpu.  Call them lock[i]
> >> > > >> for 0 <= i < ncpus.
> >> > > >>
> >> > > >> To acquire, look up your CPU number.  Then, atomically, check that
> >> > > >> lock[cpu] isn't held and, if so, mark it held and record both your tid
> >> > > >> and your lock acquisition count.  If you learn that the lock *was*
> >> > > >> held after all, signal the holder (with kill or your favorite other
> >> > > >> mechanism), telling it which lock acquisition count is being aborted.
> >> > > >> Then atomically steal the lock, but only if the lock acquisition count
> >> > > >> hasn't changed.
> >> > > >>
> >> > > >> This has a few benefits over the in-kernel approach:
> >> > > >>
> >> > > >> 1. No kernel patch.
> >> > > >>
> >> > > >> 2. No unnecessary abort if you are preempted in favor of a thread that
> >> > > >> doesn't content for your lock.
> >> > > >>
> >> > > >> 3. Greatly improved debuggability.
> >> > > >>
> >> > > >> 4. With long critical sections and heavy load, you can improve
> >> > > >> performance by having several locks per cpu and choosing one at
> >> > > >> random.
> >> > > >>
> >> > > >> Is there a reason that a scheme like this doesn't work?
> >> > > >
> >> > > > What do you mean exactly by "atomically check that lock is not
> >> > > > held and, if so, mark it held" ? Do you imply using a lock-prefixed
> >> > > > atomic operation ?
> >> > >
> >> > > Yes.
> >> > >
> >> > > >
> >> > > > The goal of this whole restart section approach is to allow grabbing
> >> > > > a lock (or doing other sequences of operations ending with a single
> >> > > > store) on per-cpu data without having to use slow lock-prefixed
> >> > > > atomic operations.
> >> > >
> >> > > Ah, ok, I assumed it was to allow multiple threads to work in parallel.
> >> > >
> >> > > How arch-specific are you willing to be?
> >> >
> >> > I'd want this to be usable on every major architectures.
> >> >
> >> > > On x86, it might be possible
> >> > > to play some GDT games so that an unlocked xchg relative
> >> >
> >> > AFAIK, there is no such thing as an unlocked xchg. xchg always
> >> > imply the lock prefix on x86. I guess you mean cmpxchg here.
> >> >
> >>
> >> Right, got my special cases mixed up.
> >>
> >> I wonder if we could instead have a vdso function that did something like:
> >>
> >> unsigned long __vdso_cpu_local_exchange(unsigned long *base, int
> >> shift, unsigned long newval)
> >> {
> >>   unsigned long *ptr = base + (cpu << shift);
> >>   unsigned long old = *ptr;
> >>   *ptr = new;
> >>   return *ptr;
> >> }
> >>
> >> I think this primitive would be sufficient to let user code do the
> >> rest.  There might be other more simple primitives that would work.
> >> It could be implemented by fiddling with IP ranges, but we could
> >> change the implementation later without breaking anything.  The only
> >> really hard part would be efficiently figuring out what CPU we're on.
> >
> > The "fiddling with IP ranges" is where the restart sections come into
> > play. Paul Turner's approach indeed knows about IP ranges, and performs
> > the restart from the kernel. My alternative approach uses a signal and
> > page protection in user-space to reach the same result. It appears that
> > CONFIG_PREEMPT kernels are difficult to handle with Paul's approach, so
> > perhaps we could combine our approaches to get the best of both.
>
> I'm not sure why CONFIG_PREEMPT would matter.  What am I missing?
>
> Doing this in the vdso has some sneaky benefits: rather than aborting
> a very short vdso-based primitive on context switch, we could just fix
> it up in the kernel and skip ahead to the end.

I might be guilty of being too x86-centric here.  On x86, as long as
the lock and unlock primitives are sufficiently atomic, everything
should be okay.  On other architectures, though, a primitive that
gives lock, unlock, and abort of a per-cpu lock without checking that
you're still on the right cpu at unlock time may not be sufficient.
If the primitive is implemented purely with loads and stores, then
even if you take the lock, migrate, finish your work, and unlock
without anyone else contending from the lock (and hence don't abort),
the next thread to take the same lock will end up unsynchronized
unless there's appropriate memory ordering.  For example, if taking
the lock were an acquire and unlocking were a release, we'd be fine.

Your RFC design certainly works (in principle -- I haven't looked at
the code in detail), but I can't shake the feeling that it's overkill
and that it could be improved to avoid unnecessary aborts every time
the lock holder is scheduled out.

This isn't a problem in your RFC design, but if we wanted to come up
with tighter primitives, we'd have to be quite careful to document
exactly what memory ordering guarantees they come with.

It may be that all architectures for which you care about the
performance boost already have efficient acquire and release
operations.  Certainly x86 does, and I don't know how fast the new ARM
instructions are, but I imagine they're pretty good.

It's too bad that not all architectures have a single-instruction
unlocked compare-and-exchange.

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <CALCETrXzmO=fQC=UdCh5b0zWiGWAJScEtdT4QDJkoqLgtgEVig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
       [not found]                               ` <CALCETrXzmO=fQC=UdCh5b0zWiGWAJScEtdT4QDJkoqLgtgEVig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-26 21:04                                 ` Mathieu Desnoyers
       [not found]                                   ` <821493560.8531.1432674243321.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Mathieu Desnoyers @ 2015-05-26 21:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, Borislav Petkov, H. Peter Anvin, Lai Jiangshan,
	Ben Maurer, Paul E. McKenney, Ingo Molnar, Josh Triplett,
	Andrew Morton, Michael Kerrisk, Linux API, Linux Kernel,
	Paul Turner, Peter Zijlstra, Linus Torvalds, Steven Rostedt,
	Andrew Hunter

----- Original Message -----
> On May 25, 2015 11:54 AM, "Andy Lutomirski" <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
[...]
> I might be guilty of being too x86-centric here.  On x86, as long as
> the lock and unlock primitives are sufficiently atomic, everything
> should be okay.  On other architectures, though, a primitive that
> gives lock, unlock, and abort of a per-cpu lock without checking that
> you're still on the right cpu at unlock time may not be sufficient.
> If the primitive is implemented purely with loads and stores, then
> even if you take the lock, migrate, finish your work, and unlock
> without anyone else contending from the lock (and hence don't abort),
> the next thread to take the same lock will end up unsynchronized
> unless there's appropriate memory ordering.  For example, if taking
> the lock were an acquire and unlocking were a release, we'd be fine.

If we look at x86-TSO, where stores reordered after loads is pretty
much the only thing we need to care about, we have:

  CPU 0                               CPU 1

                                      load, test lock[1]
                                      store lock[1]
                                      loads/stores to data[1]
                                      [preempted, migrate to cpu 0]
                                      [run other thread]
                                      load, test lock[1] (loop)
  loads/stores to data[1] (A)
  store 0 to lock[1]      (B)
                                      load, test lock[1]      (C)
                                      store lock[1]
                                      loads/stores to data[1] (D)

What we want to ensure is that load/stores to data[1] from CPU 0 and
1 don't get interleaved.

On CPU 0 doing unlock, loads and stores to data[1] (A) cannot be reordered
against the store to lock[1] (B) due to TSO (all we care about is stores
reordered after loads).

On CPU 1 grabbing the 2nd lock, we care about loads/store from/to
data[1] (C) being reordered before load, test of lock[1] (C). Again,
thanks to TSO, loads/stores from/to data[1] (D) cannot be reordered
wrt (C).

So on x86-TSO we should be OK, but as you say, we need to be careful
to have the right barriers in place on other architectures.

> 
> Your RFC design certainly works (in principle -- I haven't looked at
> the code in detail), but I can't shake the feeling that it's overkill

If we can find a simpler way to achieve the same result, I'm all ears :)

> and that it could be improved to avoid unnecessary aborts every time
> the lock holder is scheduled out.

The restart is only performed if preemption occurs when the thread is
trying to grab the lock. Once the thread has the lock (it's now the
lock holder), it will not be aborted even if it is migrated.

> 
> This isn't a problem in your RFC design, but if we wanted to come up
> with tighter primitives, we'd have to be quite careful to document
> exactly what memory ordering guarantees they come with.

Indeed.

> 
> It may be that all architectures for which you care about the
> performance boost already have efficient acquire and release
> operations.  Certainly x86 does, and I don't know how fast the new ARM
> instructions are, but I imagine they're pretty good.

We may even need memory barriers around lock and unlock on ARM. :/ ARM
smp_store_release() and smp_load_acquire() currently implements those with
smp_mb().

> 
> It's too bad that not all architectures have a single-instruction
> unlocked compare-and-exchange.

Based on my benchmarks, it's not clear that single-instruction
unlocked CAS is actually faster than doing the same with many
instructions.

Thanks,

Mathieu

> 
> --Andy
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <821493560.8531.1432674243321.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>]

* Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
       [not found]                                   ` <821493560.8531.1432674243321.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2015-05-26 21:18                                     ` Andy Lutomirski
  2015-05-26 21:44                                       ` Andy Lutomirski
  0 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2015-05-26 21:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andi Kleen, Borislav Petkov, H. Peter Anvin, Lai Jiangshan,
	Ben Maurer, Paul E. McKenney, Ingo Molnar, Josh Triplett,
	Andrew Morton, Michael Kerrisk, Linux API, Linux Kernel,
	Paul Turner, Peter Zijlstra, Linus Torvalds, Steven Rostedt,
	Andrew Hunter

On Tue, May 26, 2015 at 2:04 PM, Mathieu Desnoyers
<mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> ----- Original Message -----
>> On May 25, 2015 11:54 AM, "Andy Lutomirski" <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> [...]
>> I might be guilty of being too x86-centric here.  On x86, as long as
>> the lock and unlock primitives are sufficiently atomic, everything
>> should be okay.  On other architectures, though, a primitive that
>> gives lock, unlock, and abort of a per-cpu lock without checking that
>> you're still on the right cpu at unlock time may not be sufficient.
>> If the primitive is implemented purely with loads and stores, then
>> even if you take the lock, migrate, finish your work, and unlock
>> without anyone else contending from the lock (and hence don't abort),
>> the next thread to take the same lock will end up unsynchronized
>> unless there's appropriate memory ordering.  For example, if taking
>> the lock were an acquire and unlocking were a release, we'd be fine.
>
> If we look at x86-TSO, where stores reordered after loads is pretty
> much the only thing we need to care about, we have:
>
>   CPU 0                               CPU 1
>
>                                       load, test lock[1]
>                                       store lock[1]
>                                       loads/stores to data[1]
>                                       [preempted, migrate to cpu 0]
>                                       [run other thread]
>                                       load, test lock[1] (loop)
>   loads/stores to data[1] (A)
>   store 0 to lock[1]      (B)
>                                       load, test lock[1]      (C)
>                                       store lock[1]
>                                       loads/stores to data[1] (D)
>
> What we want to ensure is that load/stores to data[1] from CPU 0 and
> 1 don't get interleaved.
>
> On CPU 0 doing unlock, loads and stores to data[1] (A) cannot be reordered
> against the store to lock[1] (B) due to TSO (all we care about is stores
> reordered after loads).
>
> On CPU 1 grabbing the 2nd lock, we care about loads/store from/to
> data[1] (C) being reordered before load, test of lock[1] (C). Again,
> thanks to TSO, loads/stores from/to data[1] (D) cannot be reordered
> wrt (C).
>
> So on x86-TSO we should be OK, but as you say, we need to be careful
> to have the right barriers in place on other architectures.
>
>>
>> Your RFC design certainly works (in principle -- I haven't looked at
>> the code in detail), but I can't shake the feeling that it's overkill
>
> If we can find a simpler way to achieve the same result, I'm all ears :)
>
>> and that it could be improved to avoid unnecessary aborts every time
>> the lock holder is scheduled out.
>
> The restart is only performed if preemption occurs when the thread is
> trying to grab the lock. Once the thread has the lock (it's now the
> lock holder), it will not be aborted even if it is migrated.

Then maybe this does have the same acquire/release issue after all.

>
>>
>> This isn't a problem in your RFC design, but if we wanted to come up
>> with tighter primitives, we'd have to be quite careful to document
>> exactly what memory ordering guarantees they come with.
>
> Indeed.
>
>>
>> It may be that all architectures for which you care about the
>> performance boost already have efficient acquire and release
>> operations.  Certainly x86 does, and I don't know how fast the new ARM
>> instructions are, but I imagine they're pretty good.
>
> We may even need memory barriers around lock and unlock on ARM. :/ ARM
> smp_store_release() and smp_load_acquire() currently implements those with
> smp_mb().

I would have expected it to use LDA/STL on new enough cpus:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/CIHDEACA.html


>
>>
>> It's too bad that not all architectures have a single-instruction
>> unlocked compare-and-exchange.
>
> Based on my benchmarks, it's not clear that single-instruction
> unlocked CAS is actually faster than doing the same with many
> instructions.

True, but with a single instruction the user can't get preempted in the middle.

Looking at your code, it looks like percpu_user_sched_in has some
potentially nasty issues with page faults.  Avoiding touching user
memory from the scheduler would be quite nice from an implementation
POV, and the x86-specific gs hack wins in that regard.

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
  2015-05-26 21:18                                     ` Andy Lutomirski
@ 2015-05-26 21:44                                       ` Andy Lutomirski
  0 siblings, 0 replies; 17+ messages in thread
From: Andy Lutomirski @ 2015-05-26 21:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andi Kleen, Borislav Petkov, H. Peter Anvin, Lai Jiangshan,
	Ben Maurer, Paul E. McKenney, Ingo Molnar, Josh Triplett,
	Andrew Morton, Michael Kerrisk, Linux API, Linux Kernel,
	Paul Turner, Peter Zijlstra, Linus Torvalds, Steven Rostedt,
	Andrew Hunter

On Tue, May 26, 2015 at 2:18 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, May 26, 2015 at 2:04 PM, Mathieu Desnoyers
>>
>>>
>>> It's too bad that not all architectures have a single-instruction
>>> unlocked compare-and-exchange.
>>
>> Based on my benchmarks, it's not clear that single-instruction
>> unlocked CAS is actually faster than doing the same with many
>> instructions.
>
> True, but with a single instruction the user can't get preempted in the middle.
>
> Looking at your code, it looks like percpu_user_sched_in has some
> potentially nasty issues with page faults.  Avoiding touching user
> memory from the scheduler would be quite nice from an implementation
> POV, and the x86-specific gs hack wins in that regard.

ARM has "TLB lockdown entries" which could, I think, be used to
implement per-cpu or per-thread mappings.  I'm actually rather
surprised that Linux doesn't already use a TLB lockdown entry for TLS.
(Hmm.  Maybe it's because the interface to write the entries requires
actually touching the page.  Maybe not -- the ARM docs, in general,
seem to be much less clear than the Intel and AMD docs.)

ARM doesn't seem to have any single-instruction compare-exchange or
similar instruction, though, so this might be all that useful.  On the
other hand, ARM can probably do reasonably efficient per-cpu memory
allocation and such with a single ldrex/strex pair.

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
       [not found]                           ` <CALCETrW3_Hv0jc3cpiwsHTinBqJzvab_EiPS8BVJhX-xe5D8qw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-05-26 19:57                             ` Andy Lutomirski
@ 2015-05-26 20:38                             ` Mathieu Desnoyers
       [not found]                               ` <933886515.8478.1432672739485.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 17+ messages in thread
From: Mathieu Desnoyers @ 2015-05-26 20:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, Andi Kleen, Borislav Petkov, Lai Jiangshan,
	Paul E. McKenney, Ben Maurer, Josh Triplett, Ingo Molnar,
	Andrew Morton, Linux API, Michael Kerrisk, Linux Kernel,
	Linus Torvalds, Peter Zijlstra, Paul Turner, Steven Rostedt,
	Andrew Hunter

----- Original Message -----
> [cc: hpa, Borislav and Andi]
> 
> On Mon, May 25, 2015 at 11:30 AM, Mathieu Desnoyers
> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> > ----- Original Message -----
> >> On May 23, 2015 10:09 AM, "Mathieu Desnoyers"
> >> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> >> >
> >> > ----- Original Message -----
> >> > > On Fri, May 22, 2015 at 2:34 PM, Mathieu Desnoyers
> >> > > <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> >> > > > ----- Original Message -----
> >> > > >> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk
> >> > > >> <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> >> > > >> wrote:
> >> > > >> > [CC += linux-api@]
> >> > > >> >
> >> > > >> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
> >> > > >> > <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> >> > > >> >> Expose a new system call allowing userspace threads to register
> >> > > >> >> a TLS area used as an ABI between the kernel and userspace to
> >> > > >> >> share information required to create efficient per-cpu critical
> >> > > >> >> sections in user-space.
> >> > > >> >>
> >> > > >> >> This ABI consists of a thread-local structure containing:
> >> > > >> >>
> >> > > >> >> - a nesting count surrounding the critical section,
> >> > > >> >> - a signal number to be sent to the thread when preempting a
> >> > > >> >> thread
> >> > > >> >>   with non-zero nesting count,
> >> > > >> >> - a flag indicating whether the signal has been sent within the
> >> > > >> >>   critical section,
> >> > > >> >> - an integer where to store the current CPU number, updated
> >> > > >> >> whenever
> >> > > >> >>   the thread is preempted. This CPU number cache is not strictly
> >> > > >> >>   needed, but performs better than getcpu vdso.
> >> > > >> >>
> >> > > >> >> This approach is inspired by Paul Turner and Andrew Hunter's
> >> > > >> >> work
> >> > > >> >> on percpu atomics, which lets the kernel handle restart of
> >> > > >> >> critical
> >> > > >> >> sections, ref.
> >> > > >> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> >> > > >> >>
> >> > > >> >> What is done differently here compared to percpu atomics: we
> >> > > >> >> track
> >> > > >> >> a single nesting counter per thread rather than many ranges of
> >> > > >> >> instruction pointer values. We deliver a signal to user-space
> >> > > >> >> and
> >> > > >> >> let the logic of restart be handled in user-space, thus moving
> >> > > >> >> the complexity out of the kernel. The nesting counter approach
> >> > > >> >> allows us to skip the complexity of interacting with signals
> >> > > >> >> that
> >> > > >> >> would be otherwise needed with the percpu atomics approach,
> >> > > >> >> which
> >> > > >> >> needs to know which instruction pointers are preempted,
> >> > > >> >> including
> >> > > >> >> when preemption occurs on a signal handler nested over an
> >> > > >> >> instruction
> >> > > >> >> pointer of interest.
> >> > > >> >>
> >> > > >>
> >> > > >> I talked about this kind of thing with PeterZ at LSF/MM, and I was
> >> > > >> unable to convince myself that the kernel needs to help at all.  To
> >> > > >> do
> >> > > >> this without kernel help, I want to relax the requirements
> >> > > >> slightly.
> >> > > >> With true per-cpu atomic sections, you have a guarantee that you
> >> > > >> are
> >> > > >> either really running on the same CPU for the entire duration of
> >> > > >> the
> >> > > >> atomic section or you abort.  I propose a weaker primitive: you
> >> > > >> acquire one of an array of locks (probably one per cpu), and you
> >> > > >> are
> >> > > >> guaranteed that, if you don't abort, no one else acquires the same
> >> > > >> lock while you hold it.
> >> > > >
> >> > > > In my proof of concept (https://github.com/compudj/percpu-dev) I
> >> > > > actually implement an array of per-cpu lock. The issue here boils
> >> > > > down to grabbing this per-cpu lock efficiently. Once the lock is
> >> > > > taken,
> >> > > > the thread has exclusive access to that per-cpu lock, even if it
> >> > > > migrates.
> >> > > >
> >> > > >>  Here's how:
> >> > > >>
> >> > > >> Create an array of user-managed locks, one per cpu.  Call them
> >> > > >> lock[i]
> >> > > >> for 0 <= i < ncpus.
> >> > > >>
> >> > > >> To acquire, look up your CPU number.  Then, atomically, check that
> >> > > >> lock[cpu] isn't held and, if so, mark it held and record both your
> >> > > >> tid
> >> > > >> and your lock acquisition count.  If you learn that the lock *was*
> >> > > >> held after all, signal the holder (with kill or your favorite other
> >> > > >> mechanism), telling it which lock acquisition count is being
> >> > > >> aborted.
> >> > > >> Then atomically steal the lock, but only if the lock acquisition
> >> > > >> count
> >> > > >> hasn't changed.
> >> > > >>
> >> > > >> This has a few benefits over the in-kernel approach:
> >> > > >>
> >> > > >> 1. No kernel patch.
> >> > > >>
> >> > > >> 2. No unnecessary abort if you are preempted in favor of a thread
> >> > > >> that
> >> > > >> doesn't content for your lock.
> >> > > >>
> >> > > >> 3. Greatly improved debuggability.
> >> > > >>
> >> > > >> 4. With long critical sections and heavy load, you can improve
> >> > > >> performance by having several locks per cpu and choosing one at
> >> > > >> random.
> >> > > >>
> >> > > >> Is there a reason that a scheme like this doesn't work?
> >> > > >
> >> > > > What do you mean exactly by "atomically check that lock is not
> >> > > > held and, if so, mark it held" ? Do you imply using a lock-prefixed
> >> > > > atomic operation ?
> >> > >
> >> > > Yes.
> >> > >
> >> > > >
> >> > > > The goal of this whole restart section approach is to allow grabbing
> >> > > > a lock (or doing other sequences of operations ending with a single
> >> > > > store) on per-cpu data without having to use slow lock-prefixed
> >> > > > atomic operations.
> >> > >
> >> > > Ah, ok, I assumed it was to allow multiple threads to work in
> >> > > parallel.
> >> > >
> >> > > How arch-specific are you willing to be?
> >> >
> >> > I'd want this to be usable on every major architectures.
> >> >
> >> > > On x86, it might be possible
> >> > > to play some GDT games so that an unlocked xchg relative
> >> >
> >> > AFAIK, there is no such thing as an unlocked xchg. xchg always
> >> > imply the lock prefix on x86. I guess you mean cmpxchg here.
> >> >
> >>
> >> Right, got my special cases mixed up.
> >>
> >> I wonder if we could instead have a vdso function that did something like:
> >>
> >> unsigned long __vdso_cpu_local_exchange(unsigned long *base, int
> >> shift, unsigned long newval)
> >> {
> >>   unsigned long *ptr = base + (cpu << shift);
> >>   unsigned long old = *ptr;
> >>   *ptr = new;
> >>   return *ptr;
> >> }
> >>
> >> I think this primitive would be sufficient to let user code do the
> >> rest.  There might be other more simple primitives that would work.
> >> It could be implemented by fiddling with IP ranges, but we could
> >> change the implementation later without breaking anything.  The only
> >> really hard part would be efficiently figuring out what CPU we're on.
> >
> > The "fiddling with IP ranges" is where the restart sections come into
> > play. Paul Turner's approach indeed knows about IP ranges, and performs
> > the restart from the kernel. My alternative approach uses a signal and
> > page protection in user-space to reach the same result. It appears that
> > CONFIG_PREEMPT kernels are difficult to handle with Paul's approach, so
> > perhaps we could combine our approaches to get the best of both.
> 
> I'm not sure why CONFIG_PREEMPT would matter.  What am I missing?

With CONFIG_PREEMPT, the scheduler can preempt the kernel while
it's running a page fault, or a system call, on behalf of a thread.

So in addition of having to check which instruction pointer is preempted,
and which IP is having a signal delivered on, we may need to add a check
when entering into the kernel on pretty much every path, including
page faults and system calls. I presume the interrupt handler is OK,
provided that interrupt handlers cannot be preempted, and that the explicit
preempt check is done before the interrupt handler returns to user-space,
AFAIU passing the user-space pt_regs, which should be OK already.

I may very have missed other subtleties of issues related to pjt's approach
with CONFIG_PREEMPT. Hopefully he can enlighten us with the missing parts. 

> 
> Doing this in the vdso has some sneaky benefits: rather than aborting
> a very short vdso-based primitive on context switch, we could just fix
> it up in the kernel and skip ahead to the end.

The fixup rather than restart idea is quite interesting. However, it may
require to do the fixup in the kernel before being migrated, and I'm not
sure it's that easy to find a preemptable spot where we can do the user
accesses needed for the fixup before migration. One advantage of the
"restart" approach is that we can perform the user-accesses after the
migration, which allow us to call that code right before return to
userspace. Unfortunately, doing the fixup from the kernel after the
migration might be tricky, even if we use a lock-atomic op in the fixup,
because we might be doing a store concurrently with a non-locked atomic
op running in userspace on the original CPU.

Here is an idea I'm not totally proud of, but might work if we really
want to do this without restarting userspace: we could expose a vdso
that both performs an atomic operation on an array of pointers to
per-cpu values, and gets the cpu number:

void __vdso_cmpxchg_getcpu(unsigned long **p, unsigned long _old,
    unsigned long _new, int *cpu);

If we preempt this vdso, the in-kernel fixup simply grabs the current CPU
number and updates the right array entry, all this locally. The advantage
here is that we can perform this after migration. It comes at the expense
of an indirection through the pointer array, which I rather dislike. If
we can rather find a way to perform the userspace fixup before migration,
it would be much better.

> 
> >
> >>
> >> FWIW, 'xchg' to cache-hot memory is only about 20 cycles on my laptop,
> >> and cmpxchg seems to be about 6 cycles.  Both are faster than getcpu.
> >> How much are we really saving with any of this over the pure userspace
> >> approach?  I think that the most that the kernel can possibly help us
> >> is to give us a faster getcpu and to help us deal with being migrated
> >> to a different cpu if we want to avoid expensive operations and don't
> >> want to play unlocked cmpxchg tricks.
> >
> > You appear to summarize the two things that are added to the kernel with
> > this RFC patch:
> > - faster getcpu(): 12.7 ns -> 0.3 ns
> 
> Yeah, I figured that out the second time I read your email and re-read
> the original message, but I apparently forgot to fix up the email I
> was typing.
> 
> > - allow using less expensive operations to perform per-cpu-atomic ops:
> >   (on my system):
> >   - lock; cmpxchg (18 ns) -> load-test-branch-store (5.4 ns)
> >
> 
> When you say "load-test-branch-store", do you mean that sequence of
> normal code or a literal cmpxchg?  On x86, we really can pull off the
> sneaky trick of gs-relative cmpxchg in a single instruction, which
> gives us per-cpu atomic locking without any special kernel fixups,
> although that locks down gs and isn't currently supported.

It's a sequence of normal code. Ideally I'd want this to be doable
on other architectures too (powerpc and ARM at least).

> 
> IIRC some other OS's use gs as a userspace per-cpu pointer instead of
> a per-thread pointer.  Would we want to consider enabling that?
> This may interact oddly with the WIP wrgsbase stuff.  (Grr, Intel, why
> couldn't you have made wrgsbase reset the gs selector to zero?  That
> would have been so much nicer.)
> 
> >>
> >> I was wrong about set_thread_area giving us efficient per-cpu data,
> >> BTW, although it would be easy to add a similar feature that would
> >> give us a per-cpu segment base on x86.
> >
> > ok
> >
> 
> >>
> >> This ABI is kind of unfortunate in that it can't be shared between
> >> multiple libraries.
> >
> > Well ideally you'd want one library to implement this (e.g. glibc), and
> > other libraries to use it. I'm not sure doing a linked list of TLS entries
> > per thread would really be useful here.
> 
> True, but that may not play so well with tracing injection.
> 
> At least a pure vdso approach is guaranteed not to have this problem.

Yep.

Thanks!

Mathieu

> 
> 
> >> Oh, you're delivering a signal every time you're preempted, even
> >> outside the critical section?  That sounds expensive.
> >
> > No, the signal is only delivered if the kernel preempts the critical
> > section.
> 
> As noted above, I meant to delete this before sending :)
> 
> --Andy
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <933886515.8478.1432672739485.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>]

* Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
       [not found]                               ` <933886515.8478.1432672739485.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
@ 2015-05-26 20:58                                 ` Andy Lutomirski
  0 siblings, 0 replies; 17+ messages in thread
From: Andy Lutomirski @ 2015-05-26 20:58 UTC (permalink / raw)
  To: Mathieu Desnoyers, Konstantin Khlebnikov
  Cc: H. Peter Anvin, Andi Kleen, Borislav Petkov, Lai Jiangshan,
	Paul E. McKenney, Ben Maurer, Josh Triplett, Ingo Molnar,
	Andrew Morton, Linux API, Michael Kerrisk, Linux Kernel,
	Linus Torvalds, Peter Zijlstra, Paul Turner, Steven Rostedt,
	Andrew Hunter

On Tue, May 26, 2015 at 1:38 PM, Mathieu Desnoyers
<mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
> ----- Original Message -----
>> [cc: hpa, Borislav and Andi]
>>
>> On Mon, May 25, 2015 at 11:30 AM, Mathieu Desnoyers
>> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>> > ----- Original Message -----
>> >> On May 23, 2015 10:09 AM, "Mathieu Desnoyers"
>> >> <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>> >> >
>> >> > ----- Original Message -----
>> >> > > On Fri, May 22, 2015 at 2:34 PM, Mathieu Desnoyers
>> >> > > <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>> >> > > > ----- Original Message -----
>> >> > > >> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk
>> >> > > >> <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> >> > > >> wrote:
>> >> > > >> > [CC += linux-api@]
>> >> > > >> >
>> >> > > >> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
>> >> > > >> > <mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> wrote:
>> >> > > >> >> Expose a new system call allowing userspace threads to register
>> >> > > >> >> a TLS area used as an ABI between the kernel and userspace to
>> >> > > >> >> share information required to create efficient per-cpu critical
>> >> > > >> >> sections in user-space.
>> >> > > >> >>
>> >> > > >> >> This ABI consists of a thread-local structure containing:
>> >> > > >> >>
>> >> > > >> >> - a nesting count surrounding the critical section,
>> >> > > >> >> - a signal number to be sent to the thread when preempting a
>> >> > > >> >> thread
>> >> > > >> >>   with non-zero nesting count,
>> >> > > >> >> - a flag indicating whether the signal has been sent within the
>> >> > > >> >>   critical section,
>> >> > > >> >> - an integer where to store the current CPU number, updated
>> >> > > >> >> whenever
>> >> > > >> >>   the thread is preempted. This CPU number cache is not strictly
>> >> > > >> >>   needed, but performs better than getcpu vdso.
>> >> > > >> >>
>> >> > > >> >> This approach is inspired by Paul Turner and Andrew Hunter's
>> >> > > >> >> work
>> >> > > >> >> on percpu atomics, which lets the kernel handle restart of
>> >> > > >> >> critical
>> >> > > >> >> sections, ref.
>> >> > > >> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
>> >> > > >> >>
>> >> > > >> >> What is done differently here compared to percpu atomics: we
>> >> > > >> >> track
>> >> > > >> >> a single nesting counter per thread rather than many ranges of
>> >> > > >> >> instruction pointer values. We deliver a signal to user-space
>> >> > > >> >> and
>> >> > > >> >> let the logic of restart be handled in user-space, thus moving
>> >> > > >> >> the complexity out of the kernel. The nesting counter approach
>> >> > > >> >> allows us to skip the complexity of interacting with signals
>> >> > > >> >> that
>> >> > > >> >> would be otherwise needed with the percpu atomics approach,
>> >> > > >> >> which
>> >> > > >> >> needs to know which instruction pointers are preempted,
>> >> > > >> >> including
>> >> > > >> >> when preemption occurs on a signal handler nested over an
>> >> > > >> >> instruction
>> >> > > >> >> pointer of interest.
>> >> > > >> >>
>> >> > > >>
>> >> > > >> I talked about this kind of thing with PeterZ at LSF/MM, and I was
>> >> > > >> unable to convince myself that the kernel needs to help at all.  To
>> >> > > >> do
>> >> > > >> this without kernel help, I want to relax the requirements
>> >> > > >> slightly.
>> >> > > >> With true per-cpu atomic sections, you have a guarantee that you
>> >> > > >> are
>> >> > > >> either really running on the same CPU for the entire duration of
>> >> > > >> the
>> >> > > >> atomic section or you abort.  I propose a weaker primitive: you
>> >> > > >> acquire one of an array of locks (probably one per cpu), and you
>> >> > > >> are
>> >> > > >> guaranteed that, if you don't abort, no one else acquires the same
>> >> > > >> lock while you hold it.
>> >> > > >
>> >> > > > In my proof of concept (https://github.com/compudj/percpu-dev) I
>> >> > > > actually implement an array of per-cpu lock. The issue here boils
>> >> > > > down to grabbing this per-cpu lock efficiently. Once the lock is
>> >> > > > taken,
>> >> > > > the thread has exclusive access to that per-cpu lock, even if it
>> >> > > > migrates.
>> >> > > >
>> >> > > >>  Here's how:
>> >> > > >>
>> >> > > >> Create an array of user-managed locks, one per cpu.  Call them
>> >> > > >> lock[i]
>> >> > > >> for 0 <= i < ncpus.
>> >> > > >>
>> >> > > >> To acquire, look up your CPU number.  Then, atomically, check that
>> >> > > >> lock[cpu] isn't held and, if so, mark it held and record both your
>> >> > > >> tid
>> >> > > >> and your lock acquisition count.  If you learn that the lock *was*
>> >> > > >> held after all, signal the holder (with kill or your favorite other
>> >> > > >> mechanism), telling it which lock acquisition count is being
>> >> > > >> aborted.
>> >> > > >> Then atomically steal the lock, but only if the lock acquisition
>> >> > > >> count
>> >> > > >> hasn't changed.
>> >> > > >>
>> >> > > >> This has a few benefits over the in-kernel approach:
>> >> > > >>
>> >> > > >> 1. No kernel patch.
>> >> > > >>
>> >> > > >> 2. No unnecessary abort if you are preempted in favor of a thread
>> >> > > >> that
>> >> > > >> doesn't content for your lock.
>> >> > > >>
>> >> > > >> 3. Greatly improved debuggability.
>> >> > > >>
>> >> > > >> 4. With long critical sections and heavy load, you can improve
>> >> > > >> performance by having several locks per cpu and choosing one at
>> >> > > >> random.
>> >> > > >>
>> >> > > >> Is there a reason that a scheme like this doesn't work?
>> >> > > >
>> >> > > > What do you mean exactly by "atomically check that lock is not
>> >> > > > held and, if so, mark it held" ? Do you imply using a lock-prefixed
>> >> > > > atomic operation ?
>> >> > >
>> >> > > Yes.
>> >> > >
>> >> > > >
>> >> > > > The goal of this whole restart section approach is to allow grabbing
>> >> > > > a lock (or doing other sequences of operations ending with a single
>> >> > > > store) on per-cpu data without having to use slow lock-prefixed
>> >> > > > atomic operations.
>> >> > >
>> >> > > Ah, ok, I assumed it was to allow multiple threads to work in
>> >> > > parallel.
>> >> > >
>> >> > > How arch-specific are you willing to be?
>> >> >
>> >> > I'd want this to be usable on every major architectures.
>> >> >
>> >> > > On x86, it might be possible
>> >> > > to play some GDT games so that an unlocked xchg relative
>> >> >
>> >> > AFAIK, there is no such thing as an unlocked xchg. xchg always
>> >> > imply the lock prefix on x86. I guess you mean cmpxchg here.
>> >> >
>> >>
>> >> Right, got my special cases mixed up.
>> >>
>> >> I wonder if we could instead have a vdso function that did something like:
>> >>
>> >> unsigned long __vdso_cpu_local_exchange(unsigned long *base, int
>> >> shift, unsigned long newval)
>> >> {
>> >>   unsigned long *ptr = base + (cpu << shift);
>> >>   unsigned long old = *ptr;
>> >>   *ptr = new;
>> >>   return *ptr;
>> >> }
>> >>
>> >> I think this primitive would be sufficient to let user code do the
>> >> rest.  There might be other more simple primitives that would work.
>> >> It could be implemented by fiddling with IP ranges, but we could
>> >> change the implementation later without breaking anything.  The only
>> >> really hard part would be efficiently figuring out what CPU we're on.
>> >
>> > The "fiddling with IP ranges" is where the restart sections come into
>> > play. Paul Turner's approach indeed knows about IP ranges, and performs
>> > the restart from the kernel. My alternative approach uses a signal and
>> > page protection in user-space to reach the same result. It appears that
>> > CONFIG_PREEMPT kernels are difficult to handle with Paul's approach, so
>> > perhaps we could combine our approaches to get the best of both.
>>
>> I'm not sure why CONFIG_PREEMPT would matter.  What am I missing?
>
> With CONFIG_PREEMPT, the scheduler can preempt the kernel while
> it's running a page fault, or a system call, on behalf of a thread.
>
> So in addition of having to check which instruction pointer is preempted,
> and which IP is having a signal delivered on, we may need to add a check
> when entering into the kernel on pretty much every path, including
> page faults and system calls. I presume the interrupt handler is OK,
> provided that interrupt handlers cannot be preempted, and that the explicit
> preempt check is done before the interrupt handler returns to user-space,
> AFAIU passing the user-space pt_regs, which should be OK already.
>
> I may very have missed other subtleties of issues related to pjt's approach
> with CONFIG_PREEMPT. Hopefully he can enlighten us with the missing parts.

Oh, right.

FWIW, we don't really have accurate tracking of the original user
state before the outermost nested entry.  There's perf_get_regs_user,
but that's a hack.  On the other hand, we only care about potentially
preemptable entries, so no NMI, but even machine checks can sort of
sleep these days (even on non-CONFIG_PREEMPT).

Yuck.

>
>>
>> Doing this in the vdso has some sneaky benefits: rather than aborting
>> a very short vdso-based primitive on context switch, we could just fix
>> it up in the kernel and skip ahead to the end.
>
> The fixup rather than restart idea is quite interesting. However, it may
> require to do the fixup in the kernel before being migrated, and I'm not
> sure it's that easy to find a preemptable spot where we can do the user
> accesses needed for the fixup before migration.

Isn't schedule() itself a reasonable spot?  We could wrap it in
pagefault_disable() and presumably find some reasonable way to retry
if the fault fails.  We also might not need user vm access at all --
pt_regs could be sufficient.  But the same CONFIG_PREEMPT caveat
applies, I think.

> One advantage of the
> "restart" approach is that we can perform the user-accesses after the
> migration, which allow us to call that code right before return to
> userspace. Unfortunately, doing the fixup from the kernel after the
> migration might be tricky, even if we use a lock-atomic op in the fixup,
> because we might be doing a store concurrently with a non-locked atomic
> op running in userspace on the original CPU.

But we can check what cpu we're on if we're in the kernel.  In fact,
once the x86 exit-to-userspace code gets rewritten (working on that,
but maybe not ready for 4.2), this would actually be straightforward,
modulo forcing a slower exit path on migration if we don't have a good
way to exclude migrations that don't matter.

>
> Here is an idea I'm not totally proud of, but might work if we really
> want to do this without restarting userspace: we could expose a vdso
> that both performs an atomic operation on an array of pointers to
> per-cpu values, and gets the cpu number:
>
> void __vdso_cmpxchg_getcpu(unsigned long **p, unsigned long _old,
>     unsigned long _new, int *cpu);
>
> If we preempt this vdso, the in-kernel fixup simply grabs the current CPU
> number and updates the right array entry, all this locally. The advantage
> here is that we can perform this after migration. It comes at the expense
> of an indirection through the pointer array, which I rather dislike. If
> we can rather find a way to perform the userspace fixup before migration,
> it would be much better.

Sneaky.  I think that's actually kind of neat, except that it's
incompatible with the x86 gs-based cmpxchg trick.

FWIW, I should probably stop complaining about having some per-process
non-library-compatible state.  We already have that with
set_robust_list.  If we want to have userspace register a per-thread
virtual address at which the thread's cpu number lives, so be it.  How
do we handle a page fault on migration, though?

Grumble.  I really wish that we had per-cpu virtual memory mappings.
Oh, well.  Also, I think PeterZ's opinion on anything that mucks with
the scheduler is critical here.

>
>>
>> >
>> >>
>> >> FWIW, 'xchg' to cache-hot memory is only about 20 cycles on my laptop,
>> >> and cmpxchg seems to be about 6 cycles.  Both are faster than getcpu.
>> >> How much are we really saving with any of this over the pure userspace
>> >> approach?  I think that the most that the kernel can possibly help us
>> >> is to give us a faster getcpu and to help us deal with being migrated
>> >> to a different cpu if we want to avoid expensive operations and don't
>> >> want to play unlocked cmpxchg tricks.
>> >
>> > You appear to summarize the two things that are added to the kernel with
>> > this RFC patch:
>> > - faster getcpu(): 12.7 ns -> 0.3 ns
>>
>> Yeah, I figured that out the second time I read your email and re-read
>> the original message, but I apparently forgot to fix up the email I
>> was typing.
>>
>> > - allow using less expensive operations to perform per-cpu-atomic ops:
>> >   (on my system):
>> >   - lock; cmpxchg (18 ns) -> load-test-branch-store (5.4 ns)
>> >
>>
>> When you say "load-test-branch-store", do you mean that sequence of
>> normal code or a literal cmpxchg?  On x86, we really can pull off the
>> sneaky trick of gs-relative cmpxchg in a single instruction, which
>> gives us per-cpu atomic locking without any special kernel fixups,
>> although that locks down gs and isn't currently supported.
>
> It's a sequence of normal code. Ideally I'd want this to be doable
> on other architectures too (powerpc and ARM at least).

Agreed.

BTW, I just found this thread:

http://comments.gmane.org/gmane.linux.kernel/1786674

this is kind of like my sort-of-proposal for using gs.

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
  2015-05-25 18:54                         ` Andy Lutomirski
       [not found]                           ` <CALCETrW3_Hv0jc3cpiwsHTinBqJzvab_EiPS8BVJhX-xe5D8qw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-26 21:20                           ` Andi Kleen
       [not found]                             ` <20150526212041.GQ19417-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
  1 sibling, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2015-05-26 21:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mathieu Desnoyers, H. Peter Anvin, Andi Kleen, Borislav Petkov,
	Lai Jiangshan, Paul E. McKenney, Ben Maurer, Josh Triplett,
	Ingo Molnar, Andrew Morton, Linux API, Michael Kerrisk,
	Linux Kernel, Linus Torvalds, Peter Zijlstra, Paul Turner,
	Steven Rostedt, Andrew Hunter


I haven't thought too much about it, but doing it in a vdso seems
reasonable. It would mean that the scheduler hot path would
need to know about its address though.

> > You appear to summarize the two things that are added to the kernel with
> > this RFC patch:
> > - faster getcpu(): 12.7 ns -> 0.3 ns
> 
> Yeah, I figured that out the second time I read your email and re-read
> the original message, but I apparently forgot to fix up the email I
> was typing.

The high speed users seem to have all switched to using LSL directly,
which is even faster. Since there are already users it cannot be changed
anymore. Should probably just document this as an official ABI,
and provide some standard include file with inline code.

> IIRC some other OS's use gs as a userspace per-cpu pointer instead of
> a per-thread pointer.  Would we want to consider enabling that?
> This may interact oddly with the WIP wrgsbase stuff.  (Grr, Intel, why
> couldn't you have made wrgsbase reset the gs selector to zero?  That
> would have been so much nicer.)

Applications really want the extra address registers. For example
OpenJDK could free a GPR. And it's needed for the TLS of user space
threads, which are becoming more and more popular. So there are
already more important users queued.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <20150526212041.GQ19417-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>]

* Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
       [not found]                             ` <20150526212041.GQ19417-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
@ 2015-05-26 21:26                               ` Andy Lutomirski
  0 siblings, 0 replies; 17+ messages in thread
From: Andy Lutomirski @ 2015-05-26 21:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mathieu Desnoyers, H. Peter Anvin, Borislav Petkov, Lai Jiangshan,
	Paul E. McKenney, Ben Maurer, Josh Triplett, Ingo Molnar,
	Andrew Morton, Linux API, Michael Kerrisk, Linux Kernel,
	Linus Torvalds, Peter Zijlstra, Paul Turner, Steven Rostedt,
	Andrew Hunter

On Tue, May 26, 2015 at 2:20 PM, Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org> wrote:
>
> I haven't thought too much about it, but doing it in a vdso seems
> reasonable. It would mean that the scheduler hot path would
> need to know about its address though.
>
>> > You appear to summarize the two things that are added to the kernel with
>> > this RFC patch:
>> > - faster getcpu(): 12.7 ns -> 0.3 ns
>>
>> Yeah, I figured that out the second time I read your email and re-read
>> the original message, but I apparently forgot to fix up the email I
>> was typing.
>
> The high speed users seem to have all switched to using LSL directly,
> which is even faster. Since there are already users it cannot be changed
> anymore. Should probably just document this as an official ABI,
> and provide some standard include file with inline code.
>
>> IIRC some other OS's use gs as a userspace per-cpu pointer instead of
>> a per-thread pointer.  Would we want to consider enabling that?
>> This may interact oddly with the WIP wrgsbase stuff.  (Grr, Intel, why
>> couldn't you have made wrgsbase reset the gs selector to zero?  That
>> would have been so much nicer.)
>
> Applications really want the extra address registers. For example
> OpenJDK could free a GPR. And it's needed for the TLS of user space
> threads, which are becoming more and more popular. So there are
> already more important users queued.

This isn't necessarily the end of the world.  They could turn it off
or hopefully repurpose or extend their FS usage instead of requiring
GS.

IOW, I think that ARCH_SET_GS should continue working, as should
wrgsbase.  We could allow a program, strictly as an alternative, to do
ARCH_SET_GS_PERCPU or something.  (If we do that, we should wire up
arch_prctl for x86_32 as well.  We could further consider restricting
the new per-cpu mode to require a specific *nonzero* value for the
selector.  Also, it's really annoying that we'll be forced to assign
some meaning to a non-zero selector with unexpected gsbase during
context switch.)

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <CALCETrUSBqHG3tbOq1yFz33v1_ckEgLNorgAxwLFi7MkjNcwLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
       [not found]         ` <CALCETrUSBqHG3tbOq1yFz33v1_ckEgLNorgAxwLFi7MkjNcwLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-22 22:06           ` Andrew Hunter
  0 siblings, 0 replies; 17+ messages in thread
From: Andrew Hunter @ 2015-05-22 22:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Michael Kerrisk, Mathieu Desnoyers, Paul Turner, Ben Maurer,
	Linux Kernel, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Paul E. McKenney, Josh Triplett, Lai Jiangshan, Linus Torvalds,
	Andrew Morton, Linux API

On Fri, May 22, 2015 at 1:53 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> Create an array of user-managed locks, one per cpu.  Call them lock[i]
> for 0 <= i < ncpus.
>
> To acquire, look up your CPU number.  Then, atomically, check that
> lock[cpu] isn't held and, if so, mark it held and record both your tid
> and your lock acquisition count.  If you learn that the lock *was*
> held after all, signal the holder (with kill or your favorite other
> mechanism), telling it which lock acquisition count is being aborted.
> Then atomically steal the lock, but only if the lock acquisition count
> hasn't changed.
>

We had to deploy the userspace percpu API (percpu sharded locks,
{double,}compare-and-swap, atomic-increment, etc) universally to the
fleet without waiting for 100% kernel penetration, not to mention
wanting to disable the kernel acceleration in case of kernel bugs.
(Since this is mostly used in core infrastructure--malloc, various
statistics platforms, etc--in userspace, checking for availability
isn't feasible. The primitives have to work 100% of the time or it
would be too complex for our developers to bother using them.)

So we did basically this (without the lock stealing...): we have a
single per-cpu spin lock manipulated with atomics, which we take very
briefly to implement (e.g.) compare-and-swap.  The performance is
hugely worse; typical overheads are in the 10x range _without_ any
on-cpu contention. Uncontended atomics are much cheaper than they were
on pre-Nehalem chips, but they still can't hold a candle to
unsynchronized instructions.

As a fallback path for userspace, this is fine--if 5% of binaries on
busted kernels aren't quite as fast, we can work with that in exchange
for being able to write a percpu op without worrying about what to do
on -ENOSYS.  But it's just not fast enough to compete as the intended
way to do things.

AHH

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-05-26 21:44 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1432219487-13364-1-git-send-email-mathieu.desnoyers@efficios.com>
     [not found] ` <1432219487-13364-1-git-send-email-mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2015-05-22 20:26   ` [RFC PATCH] percpu system call: fast userspace percpu critical sections Michael Kerrisk
     [not found]     ` <CAHO5Pa0Kok4_QN0v3JNWyzGT=GbZNZcRyLhu02R2npV9hSdt7g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-22 20:53       ` Andy Lutomirski
2015-05-22 21:34         ` Mathieu Desnoyers
2015-05-22 22:24           ` Andy Lutomirski
     [not found]             ` <CALCETrUxp-dP-kaTy4prEdciM-=sTXjpqnMbkvk38g5BTEvX0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-23 17:09               ` Mathieu Desnoyers
2015-05-23 19:15                 ` Andy Lutomirski
     [not found]                   ` <CALCETrWzoFX7hXqvQqDEq=r=7PNaGKVjZeHEBWxPvC28Zi1AKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-25 18:30                     ` Mathieu Desnoyers
     [not found]                       ` <1184354091.7499.1432578613872.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2015-05-25 18:54                         ` Andy Lutomirski
     [not found]                           ` <CALCETrW3_Hv0jc3cpiwsHTinBqJzvab_EiPS8BVJhX-xe5D8qw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-26 19:57                             ` Andy Lutomirski
     [not found]                               ` <CALCETrXzmO=fQC=UdCh5b0zWiGWAJScEtdT4QDJkoqLgtgEVig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-26 21:04                                 ` Mathieu Desnoyers
     [not found]                                   ` <821493560.8531.1432674243321.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2015-05-26 21:18                                     ` Andy Lutomirski
2015-05-26 21:44                                       ` Andy Lutomirski
2015-05-26 20:38                             ` Mathieu Desnoyers
     [not found]                               ` <933886515.8478.1432672739485.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org>
2015-05-26 20:58                                 ` Andy Lutomirski
2015-05-26 21:20                           ` Andi Kleen
     [not found]                             ` <20150526212041.GQ19417-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>
2015-05-26 21:26                               ` Andy Lutomirski
     [not found]         ` <CALCETrUSBqHG3tbOq1yFz33v1_ckEgLNorgAxwLFi7MkjNcwLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-22 22:06           ` Andrew Hunter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).