* [PATCH RFC] x86_64: per-cpu memory for user-space
@ 2014-09-13 14:35 Konstantin Khlebnikov
2014-09-13 18:10 ` Dmitry Vyukov
2014-09-14 14:06 ` Andi Kleen
0 siblings, 2 replies; 4+ messages in thread
From: Konstantin Khlebnikov @ 2014-09-13 14:35 UTC (permalink / raw)
To: x86, linux-kernel
Cc: Thomas Gleixner, Andi Kleen, Ingo Molnar, Dmitry Vyukov,
H. Peter Anvin
This patch implements user-space per-cpu memory in the same manner as in
kernel-space: each cpu has its own %gs base address. On x86_64 %fs is used
for thread local storage, %gs usually is free.
User-space application cannot prevent preemption but x86 read-modify-write
operations are atomic against interrupts and context switches. Thus percpu
counters, ring-buffer cursors, per-cpu locks and other cool things might
be implemented in a very efficient way.
After this patch kernel recalculates %gs at each context switch.
This's implemented only via MSR_KERNEL_GS_BASE. Loading base via gdt
selector might be faster but it's much more complicated.
By the way, newer Intel cpus have even faster instructions for
changing %fs/%gs, but they are still not supported by the kernel.
Additional overhead is near to zero: this patch adds one extra multiplication
into __switch_to (only if gs is set by user-space and its base is above 4Gb):
if (next->gs)
- wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
+ wrmsrl(MSR_KERNEL_GS_BASE, next->gs +
+ cpu * next->gs_cpu_stride);
Child inherits setup from parent at clone because it gets a copy of task_struct.
Changing %gs via any other interface (selector, ARCH_SET_GS) disables striping.
Interface:
int arch_prctl(ARCH_GET_GS_PERCPU, unsigned long arg[2]);
int arch_prctl(ARCH_SET_GS_PERCPU, unsigned long arg[2]);
arg[0] - base address for cpu0
arg[1] - stride to each next cpu
Error codes:
-EINVAL - not implemented (or ia32 compat)
-ENOENT - not configured (only for get)
-EFAULT - arg isn't addressable
-EPERM - base above addressable space (only for set)
-EOVERFLOW - stride too big for this base and count nr_cpus (only for set)
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
---
arch/x86/include/asm/processor.h | 1 +
arch/x86/include/uapi/asm/prctl.h | 2 ++
arch/x86/kernel/process_64.c | 39 ++++++++++++++++++++++++++++++++++++-
3 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index eb71ec7..102c1f9 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -484,6 +484,7 @@ struct thread_struct {
#endif
#ifdef CONFIG_X86_64
unsigned long fs;
+ unsigned long gs_cpu_stride;
#endif
unsigned long gs;
/* Save middle states of ptrace breakpoints */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 3ac5032..026bd39 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -5,5 +5,7 @@
#define ARCH_SET_FS 0x1002
#define ARCH_GET_FS 0x1003
#define ARCH_GET_GS 0x1004
+#define ARCH_SET_GS_PERCPU 0x1005
+#define ARCH_GET_GS_PERCPU 0x1006
#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ca5b02d..5e7af75 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -351,7 +351,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
prev->gs = 0;
}
if (next->gs)
- wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
+ wrmsrl(MSR_KERNEL_GS_BASE, next->gs +
+ cpu * next->gs_cpu_stride);
prev->gsindex = gsindex;
switch_fpu_finish(next_p, fpu);
@@ -469,6 +470,7 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
if (addr >= TASK_SIZE_OF(task))
return -EPERM;
cpu = get_cpu();
+ task->thread.gs_cpu_stride = 0;
/* handle small bases via the GDT because that's faster to
switch. */
if (addr <= 0xffffffff) {
@@ -544,6 +546,41 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
ret = put_user(base, (unsigned long __user *)addr);
break;
}
+ case ARCH_GET_GS_PERCPU:
+ if (test_tsk_thread_flag(task, TIF_ADDR32))
+ return -EINVAL;
+ if (!task->thread.gs || !task->thread.gs_cpu_stride)
+ return -ENOENT;
+ ret = put_user(task->thread.gs,
+ (unsigned long __user *)addr);
+ if (!ret)
+ ret = put_user(task->thread.gs_cpu_stride,
+ ((unsigned long __user *)addr) + 1);
+ break;
+ case ARCH_SET_GS_PERCPU: {
+ unsigned long arg[2];
+
+ if (test_tsk_thread_flag(task, TIF_ADDR32))
+ return -EINVAL;
+ if (copy_from_user(arg, (void __user *)addr, sizeof(arg)))
+ return -EFAULT;
+ if (arg[0] >= TASK_SIZE_MAX)
+ return -EPERM;
+ if (arg[1] > (TASK_SIZE_MAX - arg[0]) / num_possible_cpus())
+ return -EOVERFLOW;
+
+ task->thread.gsindex = 0;
+ task->thread.gs = arg[0];
+ task->thread.gs_cpu_stride = arg[1];
+ if (doit) {
+ cpu = get_cpu();
+ load_gs_index(0);
+ ret = wrmsrl_safe(MSR_KERNEL_GS_BASE,
+ arg[0] + cpu * arg[1]);
+ put_cpu();
+ }
+ break;
+ }
default:
ret = -EINVAL;
^ permalink raw reply related [flat|nested] 4+ messages in thread* Re: [PATCH RFC] x86_64: per-cpu memory for user-space 2014-09-13 14:35 [PATCH RFC] x86_64: per-cpu memory for user-space Konstantin Khlebnikov @ 2014-09-13 18:10 ` Dmitry Vyukov 2014-09-14 14:06 ` Andi Kleen 1 sibling, 0 replies; 4+ messages in thread From: Dmitry Vyukov @ 2014-09-13 18:10 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: x86, LKML, Thomas Gleixner, Andi Kleen, Ingo Molnar, H. Peter Anvin On Sat, Sep 13, 2014 at 7:35 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote: > This patch implements user-space per-cpu memory in the same manner as in > kernel-space: each cpu has its own %gs base address. On x86_64 %fs is used > for thread local storage, %gs usually is free. > > User-space application cannot prevent preemption but x86 read-modify-write > operations are atomic against interrupts and context switches. Thus percpu > counters, ring-buffer cursors, per-cpu locks and other cool things might > be implemented in a very efficient way. > > After this patch kernel recalculates %gs at each context switch. > This's implemented only via MSR_KERNEL_GS_BASE. Loading base via gdt > selector might be faster but it's much more complicated. > > By the way, newer Intel cpus have even faster instructions for > changing %fs/%gs, but they are still not supported by the kernel. > > Additional overhead is near to zero: this patch adds one extra multiplication > into __switch_to (only if gs is set by user-space and its base is above 4Gb): > > if (next->gs) > - wrmsrl(MSR_KERNEL_GS_BASE, next->gs); > + wrmsrl(MSR_KERNEL_GS_BASE, next->gs + > + cpu * next->gs_cpu_stride); > > Child inherits setup from parent at clone because it gets a copy of task_struct. > Changing %gs via any other interface (selector, ARCH_SET_GS) disables striping. > > Interface: > > int arch_prctl(ARCH_GET_GS_PERCPU, unsigned long arg[2]); > int arch_prctl(ARCH_SET_GS_PERCPU, unsigned long arg[2]); > > arg[0] - base address for cpu0 > arg[1] - stride to each next cpu > > Error codes: > -EINVAL - not implemented (or ia32 compat) > -ENOENT - not configured (only for get) > -EFAULT - arg isn't addressable > -EPERM - base above addressable space (only for set) > -EOVERFLOW - stride too big for this base and count nr_cpus (only for set) > > Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> > --- > arch/x86/include/asm/processor.h | 1 + > arch/x86/include/uapi/asm/prctl.h | 2 ++ > arch/x86/kernel/process_64.c | 39 ++++++++++++++++++++++++++++++++++++- > 3 files changed, 41 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h > index eb71ec7..102c1f9 100644 > --- a/arch/x86/include/asm/processor.h > +++ b/arch/x86/include/asm/processor.h > @@ -484,6 +484,7 @@ struct thread_struct { > #endif > #ifdef CONFIG_X86_64 > unsigned long fs; > + unsigned long gs_cpu_stride; > #endif > unsigned long gs; > /* Save middle states of ptrace breakpoints */ > diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h > index 3ac5032..026bd39 100644 > --- a/arch/x86/include/uapi/asm/prctl.h > +++ b/arch/x86/include/uapi/asm/prctl.h > @@ -5,5 +5,7 @@ > #define ARCH_SET_FS 0x1002 > #define ARCH_GET_FS 0x1003 > #define ARCH_GET_GS 0x1004 > +#define ARCH_SET_GS_PERCPU 0x1005 > +#define ARCH_GET_GS_PERCPU 0x1006 > > #endif /* _ASM_X86_PRCTL_H */ > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c > index ca5b02d..5e7af75 100644 > --- a/arch/x86/kernel/process_64.c > +++ b/arch/x86/kernel/process_64.c > @@ -351,7 +351,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) > prev->gs = 0; > } > if (next->gs) > - wrmsrl(MSR_KERNEL_GS_BASE, next->gs); > + wrmsrl(MSR_KERNEL_GS_BASE, next->gs + > + cpu * next->gs_cpu_stride); > prev->gsindex = gsindex; > > switch_fpu_finish(next_p, fpu); > @@ -469,6 +470,7 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr) > if (addr >= TASK_SIZE_OF(task)) > return -EPERM; > cpu = get_cpu(); > + task->thread.gs_cpu_stride = 0; > /* handle small bases via the GDT because that's faster to > switch. */ > if (addr <= 0xffffffff) { > @@ -544,6 +546,41 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr) > ret = put_user(base, (unsigned long __user *)addr); > break; > } > + case ARCH_GET_GS_PERCPU: > + if (test_tsk_thread_flag(task, TIF_ADDR32)) > + return -EINVAL; > + if (!task->thread.gs || !task->thread.gs_cpu_stride) > + return -ENOENT; > + ret = put_user(task->thread.gs, > + (unsigned long __user *)addr); > + if (!ret) > + ret = put_user(task->thread.gs_cpu_stride, > + ((unsigned long __user *)addr) + 1); > + break; > + case ARCH_SET_GS_PERCPU: { > + unsigned long arg[2]; > + > + if (test_tsk_thread_flag(task, TIF_ADDR32)) > + return -EINVAL; > + if (copy_from_user(arg, (void __user *)addr, sizeof(arg))) > + return -EFAULT; > + if (arg[0] >= TASK_SIZE_MAX) > + return -EPERM; > + if (arg[1] > (TASK_SIZE_MAX - arg[0]) / num_possible_cpus()) > + return -EOVERFLOW; > + > + task->thread.gsindex = 0; > + task->thread.gs = arg[0]; > + task->thread.gs_cpu_stride = arg[1]; > + if (doit) { > + cpu = get_cpu(); > + load_gs_index(0); > + ret = wrmsrl_safe(MSR_KERNEL_GS_BASE, > + arg[0] + cpu * arg[1]); > + put_cpu(); > + } > + break; > + } > > default: > ret = -EINVAL; > Nice! Per-cpu non-lossy stats counters are trivial with this support. For more complex data structures the trick is to put cpu number in the per-cpu region of memory, and use CAS-loop to modify the data (but the CAS does not need LOCK prefix in this case). For example, here is how a lock-free per-cpu freelist can be implemented: struct freelist_t { void* head; uint16 cpu; uint16 len; uint32 aba; }; bool freelist_push(void *p) { freelist_t *fl, old, new; for (;;) { fl = (freelist_t*)&GS[OFF]; old = atomic_load(fl); if (old.len == UINT16_MAX) return false; *(void**)p = old.head; new = old; new.aba++; new.len++; new.head = p; if (CMPXCHG16B(fl, old, new)) return true; } } void *freelist_pop(void) { freelist_t *fl, old, new; void *p; for (;;) { fl = (freelist_t*)&GS[OFF]; old = atomic_load(fl); if (old.len == 0) return NULL; p = old.head; new = old; new.len--; new.head = *(void**)p; if (CMPXCHG16B(fl, old, new)) return p; } } ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH RFC] x86_64: per-cpu memory for user-space 2014-09-13 14:35 [PATCH RFC] x86_64: per-cpu memory for user-space Konstantin Khlebnikov 2014-09-13 18:10 ` Dmitry Vyukov @ 2014-09-14 14:06 ` Andi Kleen 2014-09-14 18:35 ` Dmitry Vyukov 1 sibling, 1 reply; 4+ messages in thread From: Andi Kleen @ 2014-09-14 14:06 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Dmitry Vyukov, H. Peter Anvin On Sat, Sep 13, 2014 at 06:35:34PM +0400, Konstantin Khlebnikov wrote: > This patch implements user-space per-cpu memory in the same manner as in > kernel-space: each cpu has its own %gs base address. On x86_64 %fs is used > for thread local storage, %gs usually is free. > > User-space application cannot prevent preemption but x86 read-modify-write > operations are atomic against interrupts and context switches. Thus percpu > counters, ring-buffer cursors, per-cpu locks and other cool things might > be implemented in a very efficient way. Do you have some concrete examples for the more complex operations? It seems to me the limitation to a simple instruction will be very limiting for anything more complicated than a counter. Also it's not even clear how someone would implement retry (short of something like kuchannel) Of course it wouldn't be a problem with TSX transactions, but it's not clear they need it. The other problem with the approach is, how would cpu hotplug be handled? > By the way, newer Intel cpus have even faster instructions for > changing %fs/%gs, but they are still not supported by the kernel. Patch kits are pending. -Andi ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH RFC] x86_64: per-cpu memory for user-space 2014-09-14 14:06 ` Andi Kleen @ 2014-09-14 18:35 ` Dmitry Vyukov 0 siblings, 0 replies; 4+ messages in thread From: Dmitry Vyukov @ 2014-09-14 18:35 UTC (permalink / raw) To: Andi Kleen Cc: Konstantin Khlebnikov, x86, LKML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paul Turner, Andrew Hunter On Sun, Sep 14, 2014 at 7:06 AM, Andi Kleen <ak@linux.intel.com> wrote: > On Sat, Sep 13, 2014 at 06:35:34PM +0400, Konstantin Khlebnikov wrote: >> This patch implements user-space per-cpu memory in the same manner as in >> kernel-space: each cpu has its own %gs base address. On x86_64 %fs is used >> for thread local storage, %gs usually is free. >> >> User-space application cannot prevent preemption but x86 read-modify-write >> operations are atomic against interrupts and context switches. Thus percpu >> counters, ring-buffer cursors, per-cpu locks and other cool things might >> be implemented in a very efficient way. > > Do you have some concrete examples for the more complex operations? Hi Andy, I've showed how to implement a per-cpu freelist in a previous email. Here is how you can do a per-cpu logging system, multiple threads enqueue messages, a single thread polls all per-cpu queues (not tested as well): struct elem_t { uint32 lap; T data; }; struct percpu_queue_t { uint64 enqueue_pos_cpu; uint32 dequeue_pos; elem_t buf[N]; }; bool percpu_queue_enqueue(T *data) { percpu_queue_t *q; uint32 enqueue_pos; elem_t *e; uint64 old; for (;;) { q = (percpu_queue_t*)&GS[OFF]; old = atomic_load(&q->enqueue_pos_cpu, memory_order_relaxed); enqueue_pos = (uint32)(old >> 32); if (enqueue_pos - atomic32_load(&q->dequeue_pos, memory_order_acquire) >= N) return false; // full if (CMPXCHG16B(&GS[OFF], old, old+1)) // w/o LOCK prefix return p; e = &q->buf[enqueue_pos%N]; e->data = *data; atomic_store(&e->lap, enqueue_pos/N + 1, memory_order_release); return true; } } bool percpu_queue_dequeue(percpu_queue_t *q, T *data) { elem_t *e; uint32 dequeue_pos; dequeue_pos = q->dequeue_pos; e = q->buf[dequeue_pos%N]; if (atomic_load(&e->lap, memory_order_acquire) != dequeue_pos/N+1) return false; // empty *data = e->data; atomic_store(&q->dequeue_pos, dequeue_pos+1, memory_order_release); return true; } This is a simple version. It can be modified to do variable-size messages, batch updates of dequeue_pos to reduce write sharing, or split both enqueue and dequeue in prepare and commit phases to do zero-copy communication. So, generally, if you can reduce per-cpu state modification to a single CAS, then you pair the data with cpu number and include it into CAS. This allows you do CAS loop on per-cpu state w/o LOCK prefix. And in the freelist code, the line: if (CMPXCHG16B(fl, old, new)) needs to be replaced with (we want to ensure that we are still on the same cpu): if (CMPXCHG16B(&GS[OFF], old, new)) For more complex data structures (e.g. doubly-linked list), I *suspect* that it's possible to do something like transaction journalling approach. Namely, there is a single per-cpu slot for current transaction. If it's empty, then a thread installs own transaction descriptor with a CAS w/o lock prefix; executes it and uninstalls the trx descriptor. If it's not empty, then a thread first helps to execute the pending transaction and uninstalls the trx descriptor. All mutations in transactions are done with CAS w/o lock prefix (both when executed by the issuing thread and during helping). Care must be taken to ensure that threads do not operate on stale data and on wrong cpu; probably all data slots must be accompanied by aba counter and/or cpu number. I've done this in the context of robust shared memory data structures. I have not worked out all the details for per-cpu data structures, but I think that it can work. > It seems to me the limitation to a simple instruction will be very limiting > for anything more complicated than a counter. Also it's not even > clear how someone would implement retry (short of something like kuchannel) What is retry/kuchannel? > Of course it wouldn't be a problem with TSX transactions, but it's not > clear they need it. As far as I know (had not have a change to try), commit of a TSX transaction includes a trailing store-load style fence. And the whole point of this change is to eliminate the cost of the store-load fence. Otherwise, you just query any approximation of the current cpu and do either TSX or LOCK-prefixed RWM on the current cpu state. > The other problem with the approach is, how would cpu hotplug > be handled? > >> By the way, newer Intel cpus have even faster instructions for >> changing %fs/%gs, but they are still not supported by the kernel. > > Patch kits are pending. > > -Andi ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-09-14 18:35 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-09-13 14:35 [PATCH RFC] x86_64: per-cpu memory for user-space Konstantin Khlebnikov 2014-09-13 18:10 ` Dmitry Vyukov 2014-09-14 14:06 ` Andi Kleen 2014-09-14 18:35 ` Dmitry Vyukov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox