* [PATCH] ARM: pcpu: ensure __my_cpu_offset cannot be re-ordered across barrier()
@ 2013-06-03 16:53 Will Deacon
2013-06-03 19:33 ` Nicolas Pitre
0 siblings, 1 reply; 6+ messages in thread
From: Will Deacon @ 2013-06-03 16:53 UTC (permalink / raw)
To: linux-arm-kernel
__my_cpu_offset is non-volatile, since we want its value to be cached
when we access several per-cpu variables in a row with preemption
disabled. This means that we rely on preempt_{en,dis}able to hazard
with the operation via the barrier() macro, so that we can't end up
migrating CPUs without reloading the per-cpu offset.
Unfortunately, GCC doesn't treat a "memory" clobber on a non-volatile
asm block as a side-effect, and will happily re-order it before other
memory clobbers (including those in prempt_disable()) and cache the
value. This has been observed to break the cmpxchg logic in the slub
allocator, leading to livelock in kmem_cache_alloc in mainline kernels.
This patch adds a dummy memory output operand to __my_cpu_offset,
forcing it to be ordered with respect to the barrier() macro.
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
arch/arm/include/asm/percpu.h | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h
index 968c0a1..93970eb 100644
--- a/arch/arm/include/asm/percpu.h
+++ b/arch/arm/include/asm/percpu.h
@@ -30,8 +30,12 @@ static inline void set_my_cpu_offset(unsigned long off)
static inline unsigned long __my_cpu_offset(void)
{
unsigned long off;
- /* Read TPIDRPRW */
- asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off) : : "memory");
+ /*
+ * Read TPIDRPRW.
+ * We want to allow caching the value, so avoid using volatile and
+ * instead use a fake memory access to hazard against barrier().
+ */
+ asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off), "=Qo" (off));
return off;
}
#define __my_cpu_offset __my_cpu_offset()
--
1.8.2.2
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH] ARM: pcpu: ensure __my_cpu_offset cannot be re-ordered across barrier() 2013-06-03 16:53 [PATCH] ARM: pcpu: ensure __my_cpu_offset cannot be re-ordered across barrier() Will Deacon @ 2013-06-03 19:33 ` Nicolas Pitre 2013-06-04 10:39 ` Will Deacon 0 siblings, 1 reply; 6+ messages in thread From: Nicolas Pitre @ 2013-06-03 19:33 UTC (permalink / raw) To: linux-arm-kernel On Mon, 3 Jun 2013, Will Deacon wrote: > __my_cpu_offset is non-volatile, since we want its value to be cached > when we access several per-cpu variables in a row with preemption > disabled. This means that we rely on preempt_{en,dis}able to hazard > with the operation via the barrier() macro, so that we can't end up > migrating CPUs without reloading the per-cpu offset. > > Unfortunately, GCC doesn't treat a "memory" clobber on a non-volatile > asm block as a side-effect, and will happily re-order it before other > memory clobbers (including those in prempt_disable()) and cache the > value. This has been observed to break the cmpxchg logic in the slub > allocator, leading to livelock in kmem_cache_alloc in mainline kernels. > > This patch adds a dummy memory output operand to __my_cpu_offset, > forcing it to be ordered with respect to the barrier() macro. > > Cc: Rob Herring <rob.herring@calxeda.com> > Cc: Nicolas Pitre <nico@fluxnic.net> > Signed-off-by: Will Deacon <will.deacon@arm.com> > --- > arch/arm/include/asm/percpu.h | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h > index 968c0a1..93970eb 100644 > --- a/arch/arm/include/asm/percpu.h > +++ b/arch/arm/include/asm/percpu.h > @@ -30,8 +30,12 @@ static inline void set_my_cpu_offset(unsigned long off) > static inline unsigned long __my_cpu_offset(void) > { > unsigned long off; > - /* Read TPIDRPRW */ > - asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off) : : "memory"); > + /* > + * Read TPIDRPRW. > + * We want to allow caching the value, so avoid using volatile and > + * instead use a fake memory access to hazard against barrier(). > + */ > + asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off), "=Qo" (off)); > return off; > } This doesn't work with the little test I wrote to see the compiler behavior change. Here's my test in case it is flawed: static inline unsigned long __my_cpu_offset(void) { unsigned long off; //asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off) : : "memory"); asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off), "=Qo" (off)); return off; } #define barrier() asm volatile ("" : : : "memory") int foo(int *a, int *b, char *c) { int x, y, z; x = a[__my_cpu_offset()]; barrier(); y = b[__my_cpu_offset()]; z = c[__my_cpu_offset()]; barrier(); return x + y + z + a[__my_cpu_offset()]; } With the original memory clobber (commented out above) the asm output is: foo: mrc p15, 0, r3, c13, c0, 4 ldr ip, [r0, r3, asl #2] ldr r1, [r1, r3, asl #2] ldrb r2, [r2, r3] @ zero_extendqisi2 ldr r3, [r0, r3, asl #2] add r0, ip, r1 add r0, r0, r2 add r0, r0, r3 bx lr So to confirm that no memory barrier is respected in any way. With your change we get this instead: foo: sub sp, sp, #8 mrc p15, 0, r3, c13, c0, 4 str r3, [sp, #4] ldr ip, [r0, r3, asl #2] ldr r1, [r1, r3, asl #2] str r3, [sp, #4] ldrb r2, [r2, r3] @ zero_extendqisi2 ldr r3, [r0, r3, asl #2] add r0, ip, r1 add r0, r0, r2 add r0, r0, r3 add sp, sp, #8 bx lr So not only the barrier is completely ineffective at reloading the CPU offset, but it introduces a useless store onto the stack. Same behavior with gcc versions 4.7.3, 4.6.2 and 4.5.1 (yes, that's really what I have on my system -- waiting for 4.8.4 now). So if the preemption count is really what this should be protected against, we should simply be adding that as an input dependency to the inline asm constraint as follows: diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h index 968c0a14e0..52f9ca33e7 100644 --- a/arch/arm/include/asm/percpu.h +++ b/arch/arm/include/asm/percpu.h @@ -31,7 +31,9 @@ static inline unsigned long __my_cpu_offset(void) { unsigned long off; /* Read TPIDRPRW */ - asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off) : : "memory"); + asm("mrc p15, 0, %0, c13, c0, 4" + : "=r" (off) + : "m" (current_thread_info()->preempt_count)); return off; } #define __my_cpu_offset __my_cpu_offset() The problem with this approach is that current_thread_info() is not a static location either as it is always derrived from the stack pointer, so it needlessly computes it even when nothing else uses it in the same function. Nicolas ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH] ARM: pcpu: ensure __my_cpu_offset cannot be re-ordered across barrier() 2013-06-03 19:33 ` Nicolas Pitre @ 2013-06-04 10:39 ` Will Deacon 2013-06-04 17:15 ` Will Deacon 0 siblings, 1 reply; 6+ messages in thread From: Will Deacon @ 2013-06-04 10:39 UTC (permalink / raw) To: linux-arm-kernel Hi Nico, Thanks for helping out with this. On Mon, Jun 03, 2013 at 08:33:24PM +0100, Nicolas Pitre wrote: > On Mon, 3 Jun 2013, Will Deacon wrote: > > static inline unsigned long __my_cpu_offset(void) > > { > > unsigned long off; > > - /* Read TPIDRPRW */ > > - asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off) : : "memory"); > > + /* > > + * Read TPIDRPRW. > > + * We want to allow caching the value, so avoid using volatile and > > + * instead use a fake memory access to hazard against barrier(). > > + */ > > + asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off), "=Qo" (off)); > > return off; > > } > > This doesn't work with the little test I wrote to see the compiler > behavior change. Here's my test in case it is flawed: [...] > int foo(int *a, int *b, char *c) > { > int x, y, z; > > x = a[__my_cpu_offset()]; > barrier(); > y = b[__my_cpu_offset()]; > z = c[__my_cpu_offset()]; > barrier(); > return x + y + z + a[__my_cpu_offset()]; > } > > With the original memory clobber (commented out above) the asm output > is: > > foo: > mrc p15, 0, r3, c13, c0, 4 > ldr ip, [r0, r3, asl #2] > ldr r1, [r1, r3, asl #2] > ldrb r2, [r2, r3] @ zero_extendqisi2 > ldr r3, [r0, r3, asl #2] > add r0, ip, r1 > add r0, r0, r2 > add r0, r0, r3 > bx lr > > So to confirm that no memory barrier is respected in any way. > > With your change we get this instead: > > foo: > sub sp, sp, #8 > mrc p15, 0, r3, c13, c0, 4 > str r3, [sp, #4] > ldr ip, [r0, r3, asl #2] > ldr r1, [r1, r3, asl #2] > str r3, [sp, #4] > ldrb r2, [r2, r3] @ zero_extendqisi2 > ldr r3, [r0, r3, asl #2] > add r0, ip, r1 > add r0, r0, r2 > add r0, r0, r3 > add sp, sp, #8 > bx lr Ah crap, that sucks. Thanks for testing it -- I've just been looking at slub with my toolchain, but your test case also falls to bits here. > So not only the barrier is completely ineffective at reloading the CPU > offset, but it introduces a useless store onto the stack. Same behavior > with gcc versions 4.7.3, 4.6.2 and 4.5.1 (yes, that's really what I have > on my system -- waiting for 4.8.4 now). Just a heads up: I've had nothing but problems with 4.8-based toolchains (including those provided by Linaro). Seriously, I had the thing ICE on 5 of the first 7 files building the kernel! > So if the preemption count is really what this should be protected > against, we should simply be adding that as an input dependency to the > inline asm constraint as follows: > > diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h > index 968c0a14e0..52f9ca33e7 100644 > --- a/arch/arm/include/asm/percpu.h > +++ b/arch/arm/include/asm/percpu.h > @@ -31,7 +31,9 @@ static inline unsigned long __my_cpu_offset(void) > { > unsigned long off; > /* Read TPIDRPRW */ > - asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off) : : "memory"); > + asm("mrc p15, 0, %0, c13, c0, 4" > + : "=r" (off) > + : "m" (current_thread_info()->preempt_count)); > return off; > } > #define __my_cpu_offset __my_cpu_offset() > > The problem with this approach is that current_thread_info() is not a > static location either as it is always derrived from the stack pointer, > so it needlessly computes it even when nothing else uses it in the same > function. Yes, that's going to generate fairly horrible code :( Actually, I just had a play around and managed to come up with something a bit better. It seems as though putting the memory constraint as an *input* clobber is what GCC really wants, but this sucks because you end up generating prologue code to set things up for the addressing mode which you never use. For example, you can pretend you have a dependency on *(unsigned long *)0 and your test case becomes: c001cd48: e52d4004 push {r4} ; (str r4, [sp, #-4]!) c001cd4c: e3a03000 mov r3, #0 c001cd50: ee1dcf90 mrc 15, 0, ip, cr13, cr0, {4} c001cd54: e790410c ldr r4, [r0, ip, lsl #2] c001cd58: ee1dcf90 mrc 15, 0, ip, cr13, cr0, {4} c001cd5c: e791110c ldr r1, [r1, ip, lsl #2] c001cd60: e792210c ldr r2, [r2, ip, lsl #2] c001cd64: ee1d3f90 mrc 15, 0, r3, cr13, cr0, {4} c001cd68: e7903103 ldr r3, [r0, r3, lsl #2] c001cd6c: e0840001 add r0, r4, r1 c001cd70: e0800002 add r0, r0, r2 c001cd74: e0800003 add r0, r0, r3 c001cd78: e8bd0010 pop {r4} c001cd7c: e12fff1e bx lr However, if you instead add a dependency on something that already lives in a register, like __builtin_return_address, you get: c001cd44: e92d4010 push {r4, lr} c001cd48: ee1dcf90 mrc 15, 0, ip, cr13, cr0, {4} c001cd4c: e790410c ldr r4, [r0, ip, lsl #2] c001cd50: ee1dcf90 mrc 15, 0, ip, cr13, cr0, {4} c001cd54: e791110c ldr r1, [r1, ip, lsl #2] c001cd58: e792210c ldr r2, [r2, ip, lsl #2] c001cd5c: ee1d3f90 mrc 15, 0, r3, cr13, cr0, {4} c001cd60: e7903103 ldr r3, [r0, r3, lsl #2] c001cd64: e0840001 add r0, r4, r1 c001cd68: e0800002 add r0, r0, r2 c001cd6c: e0800003 add r0, r0, r3 c001cd70: e8bd8010 pop {r4, pc} Unfortunately, that builtin has clobber semantics on lr (no idea why) so we push an extra register. I'm going to have a chat with our tools guys this afternoon to see if there's anything better we can do, but for the time being my diff is below. Cheers, Will --->8 diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h index 968c0a1..6780990 100644 --- a/arch/arm/include/asm/percpu.h +++ b/arch/arm/include/asm/percpu.h @@ -30,8 +30,16 @@ static inline void set_my_cpu_offset(unsigned long off) static inline unsigned long __my_cpu_offset(void) { unsigned long off; - /* Read TPIDRPRW */ - asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off) : : "memory"); + + /* + * Read TPIDRPRW. + * We want to allow caching the value, so avoid using volatile and + * instead use a fake memory access to hazard against barrier(). + */ + asm("mrc p15, 0, %0, c13, c0, 4" + : "=r" (off) + : "Q" (*(unsigned long *)__builtin_return_address(0))); + return off; } #define __my_cpu_offset __my_cpu_offset() ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH] ARM: pcpu: ensure __my_cpu_offset cannot be re-ordered across barrier() 2013-06-04 10:39 ` Will Deacon @ 2013-06-04 17:15 ` Will Deacon 2013-06-04 17:39 ` Nicolas Pitre 0 siblings, 1 reply; 6+ messages in thread From: Will Deacon @ 2013-06-04 17:15 UTC (permalink / raw) To: linux-arm-kernel On Tue, Jun 04, 2013 at 11:39:53AM +0100, Will Deacon wrote: > However, if you instead add a dependency on something that already lives in a > register, like __builtin_return_address, you get: > > c001cd44: e92d4010 push {r4, lr} > c001cd48: ee1dcf90 mrc 15, 0, ip, cr13, cr0, {4} > c001cd4c: e790410c ldr r4, [r0, ip, lsl #2] > c001cd50: ee1dcf90 mrc 15, 0, ip, cr13, cr0, {4} > c001cd54: e791110c ldr r1, [r1, ip, lsl #2] > c001cd58: e792210c ldr r2, [r2, ip, lsl #2] > c001cd5c: ee1d3f90 mrc 15, 0, r3, cr13, cr0, {4} > c001cd60: e7903103 ldr r3, [r0, r3, lsl #2] > c001cd64: e0840001 add r0, r4, r1 > c001cd68: e0800002 add r0, r0, r2 > c001cd6c: e0800003 add r0, r0, r3 > c001cd70: e8bd8010 pop {r4, pc} Actually, I've managed to improve this further using sp instead (see diff below), which also prevents GCC from moving the lr into other registers for no reason. The code generated for your test is: c001cda4: ee1d3f90 mrc 15, 0, r3, cr13, cr0, {4} c001cda8: e790c103 ldr ip, [r0, r3, lsl #2] c001cdac: ee1d3f90 mrc 15, 0, r3, cr13, cr0, {4} c001cdb0: e7911103 ldr r1, [r1, r3, lsl #2] c001cdb4: e7922103 ldr r2, [r2, r3, lsl #2] c001cdb8: ee1d3f90 mrc 15, 0, r3, cr13, cr0, {4} c001cdbc: e7903103 ldr r3, [r0, r3, lsl #2] c001cdc0: e08c0001 add r0, ip, r1 c001cdc4: e0800002 add r0, r0, r2 c001cdc8: e0800003 add r0, r0, r3 c001cdcc: e12fff1e bx lr I spoke to the tools guys and they'll look into why the original patch didn't work, since that could have wider implications on our use of barrier(). Will --->8 diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h index 968c0a1..209e650 100644 --- a/arch/arm/include/asm/percpu.h +++ b/arch/arm/include/asm/percpu.h @@ -30,8 +30,15 @@ static inline void set_my_cpu_offset(unsigned long off) static inline unsigned long __my_cpu_offset(void) { unsigned long off; - /* Read TPIDRPRW */ - asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off) : : "memory"); + register unsigned long *sp asm ("sp"); + + /* + * Read TPIDRPRW. + * We want to allow caching the value, so avoid using volatile and + * instead use a fake stack read to hazard against barrier(). + */ + asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off) : "Q" (*sp)); + return off; } #define __my_cpu_offset __my_cpu_offset() ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH] ARM: pcpu: ensure __my_cpu_offset cannot be re-ordered across barrier() 2013-06-04 17:15 ` Will Deacon @ 2013-06-04 17:39 ` Nicolas Pitre 2013-06-05 9:59 ` Will Deacon 0 siblings, 1 reply; 6+ messages in thread From: Nicolas Pitre @ 2013-06-04 17:39 UTC (permalink / raw) To: linux-arm-kernel On Tue, 4 Jun 2013, Will Deacon wrote: > On Tue, Jun 04, 2013 at 11:39:53AM +0100, Will Deacon wrote: > > However, if you instead add a dependency on something that already lives in a > > register, like __builtin_return_address, you get: > > > > c001cd44: e92d4010 push {r4, lr} > > c001cd48: ee1dcf90 mrc 15, 0, ip, cr13, cr0, {4} > > c001cd4c: e790410c ldr r4, [r0, ip, lsl #2] > > c001cd50: ee1dcf90 mrc 15, 0, ip, cr13, cr0, {4} > > c001cd54: e791110c ldr r1, [r1, ip, lsl #2] > > c001cd58: e792210c ldr r2, [r2, ip, lsl #2] > > c001cd5c: ee1d3f90 mrc 15, 0, r3, cr13, cr0, {4} > > c001cd60: e7903103 ldr r3, [r0, r3, lsl #2] > > c001cd64: e0840001 add r0, r4, r1 > > c001cd68: e0800002 add r0, r0, r2 > > c001cd6c: e0800003 add r0, r0, r3 > > c001cd70: e8bd8010 pop {r4, pc} > > Actually, I've managed to improve this further using sp instead (see diff > below), which also prevents GCC from moving the lr into other registers for > no reason. The code generated for your test is: > > c001cda4: ee1d3f90 mrc 15, 0, r3, cr13, cr0, {4} > c001cda8: e790c103 ldr ip, [r0, r3, lsl #2] > c001cdac: ee1d3f90 mrc 15, 0, r3, cr13, cr0, {4} > c001cdb0: e7911103 ldr r1, [r1, r3, lsl #2] > c001cdb4: e7922103 ldr r2, [r2, r3, lsl #2] > c001cdb8: ee1d3f90 mrc 15, 0, r3, cr13, cr0, {4} > c001cdbc: e7903103 ldr r3, [r0, r3, lsl #2] > c001cdc0: e08c0001 add r0, ip, r1 > c001cdc4: e0800002 add r0, r0, r2 > c001cdc8: e0800003 add r0, r0, r3 > c001cdcc: e12fff1e bx lr Excellent. > I spoke to the tools guys and they'll look into why the original patch > didn't work, since that could have wider implications on our use of > barrier(). Good. However, given the wide range of gcc versions exhibiting the same behavior, I think your patch is the best way forward even if gcc were to be broken and eventually fixed. Reviewed-by: Nicolas Pitre <nico@linaro.org> I'd suggest queuing it for stable as well. > --->8 > > diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h > index 968c0a1..209e650 100644 > --- a/arch/arm/include/asm/percpu.h > +++ b/arch/arm/include/asm/percpu.h > @@ -30,8 +30,15 @@ static inline void set_my_cpu_offset(unsigned long off) > static inline unsigned long __my_cpu_offset(void) > { > unsigned long off; > - /* Read TPIDRPRW */ > - asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off) : : "memory"); > + register unsigned long *sp asm ("sp"); > + > + /* > + * Read TPIDRPRW. > + * We want to allow caching the value, so avoid using volatile and > + * instead use a fake stack read to hazard against barrier(). > + */ > + asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off) : "Q" (*sp)); > + > return off; > } > #define __my_cpu_offset __my_cpu_offset() > ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH] ARM: pcpu: ensure __my_cpu_offset cannot be re-ordered across barrier() 2013-06-04 17:39 ` Nicolas Pitre @ 2013-06-05 9:59 ` Will Deacon 0 siblings, 0 replies; 6+ messages in thread From: Will Deacon @ 2013-06-05 9:59 UTC (permalink / raw) To: linux-arm-kernel On Tue, Jun 04, 2013 at 06:39:42PM +0100, Nicolas Pitre wrote: > On Tue, 4 Jun 2013, Will Deacon wrote: > > I spoke to the tools guys and they'll look into why the original patch > > didn't work, since that could have wider implications on our use of > > barrier(). > > Good. However, given the wide range of gcc versions exhibiting the same > behavior, I think your patch is the best way forward even if gcc were to > be broken and eventually fixed. > > Reviewed-by: Nicolas Pitre <nico@linaro.org> Thanks Nicolas. > I'd suggest queuing it for stable as well. Will do. Will ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2013-06-05 9:59 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-06-03 16:53 [PATCH] ARM: pcpu: ensure __my_cpu_offset cannot be re-ordered across barrier() Will Deacon 2013-06-03 19:33 ` Nicolas Pitre 2013-06-04 10:39 ` Will Deacon 2013-06-04 17:15 ` Will Deacon 2013-06-04 17:39 ` Nicolas Pitre 2013-06-05 9:59 ` Will Deacon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox