* test_and_set_bit implementation
@ 2006-12-12 15:01 Zoltan Menyhart
2006-12-12 15:47 ` Matthew Wilcox
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: Zoltan Menyhart @ 2006-12-12 15:01 UTC (permalink / raw)
To: linux-ia64
We have got a test_and_set_bit implementation as follows:
static __inline__ int
test_and_set_bit (int nr, volatile void *addr)
{
__u32 bit, old, new;
volatile __u32 *m;
m = (volatile __u32 *) addr + (nr >> 5);
bit = 1 << (nr & 31);
do {
old = *m;
new = old | bit;
} while (cmpxchg_acq(m, old, new) != old);
return (old & bit) != 0;
}
Let's assume the bit test & set is already set, why is then the
cmpxchg_acq() executed? Cannot we just return, e.g. like this?
do {
old = *m;
if (old & bit)
return 1;
new = old | bit;
} while (cmpxchg_acq(m, old, new) != old);
Thanks,
Zoltán Menyhárt
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: test_and_set_bit implementation
2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
@ 2006-12-12 15:47 ` Matthew Wilcox
2006-12-12 17:20 ` Christoph Lameter
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2006-12-12 15:47 UTC (permalink / raw)
To: linux-ia64
On Tue, Dec 12, 2006 at 04:01:00PM +0100, Zoltan Menyhart wrote:
> Let's assume the bit test & set is already set, why is then the
> cmpxchg_acq() executed? Cannot we just return, e.g. like this?
>
> do {
> old = *m;
> if (old & bit)
> return 1;
> new = old | bit;
> } while (cmpxchg_acq(m, old, new) != old);
The original code and your rewrite both access memory twice in the loop.
Why don't we do it with one memory reference per loop instead?
{
CMPXCHG_BUGCHECK_DECL
u32 *m = (u32 *)addr + (nr >> 5);
u32 bit = 1 << (nr & 31);
u32 old = *m;
while (!(old & bit)) {
u32 new = old | bit;
u32 prev = cmpxchg_acq(m, old, new);
CMPXCHG_BUGCHECK(m);
if (prev = old)
return 1;
old = prev;
}
return 0;
}
Looking at the disassembly of grab_block() in fs/ext2/balloc.c, I don't
see much difference. The ld4.acq turns into a regular ld4 (because
'm' is no longer tagged as volatile), and is hoisted out of the loop.
Interestingly, gcc chooses to reorder the tests, and make the loop four
bundles long instead of three, but will 'goto repeat' in two bundles
instead of four. Using likely()/unlikely() doesn't persuade gcc to
change the order of the two branches, so I assume it actually is better
to do it this way.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: test_and_set_bit implementation
2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
2006-12-12 15:47 ` Matthew Wilcox
@ 2006-12-12 17:20 ` Christoph Lameter
2006-12-12 17:22 ` Christoph Lameter
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2006-12-12 17:20 UTC (permalink / raw)
To: linux-ia64
On Tue, 12 Dec 2006, Zoltan Menyhart wrote:
> Let's assume the bit test & set is already set, why is then the
> cmpxchg_acq() executed? Cannot we just return, e.g. like this?
>
> do {
> old = *m;
> if (old & bit)
> return 1;
> new = old | bit;
> } while (cmpxchg_acq(m, old, new) != old);
I think this will work. However, usually you execute bit test and test
because the bit is not set and you want to set it. This change would
optimize a rarely taken path. Probably not worth it.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: test_and_set_bit implementation
2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
2006-12-12 15:47 ` Matthew Wilcox
2006-12-12 17:20 ` Christoph Lameter
@ 2006-12-12 17:22 ` Christoph Lameter
2006-12-12 18:07 ` Matthew Wilcox
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2006-12-12 17:22 UTC (permalink / raw)
To: linux-ia64
On Tue, 12 Dec 2006, Matthew Wilcox wrote:
> The original code and your rewrite both access memory twice in the loop.
> Why don't we do it with one memory reference per loop instead?
Because the the cmpxchg usually succeeds. It is rare that we have to redo
the loop.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: test_and_set_bit implementation
2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
` (2 preceding siblings ...)
2006-12-12 17:22 ` Christoph Lameter
@ 2006-12-12 18:07 ` Matthew Wilcox
2006-12-13 10:02 ` Zoltan Menyhart
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2006-12-12 18:07 UTC (permalink / raw)
To: linux-ia64
On Tue, Dec 12, 2006 at 09:22:44AM -0800, Christoph Lameter wrote:
> On Tue, 12 Dec 2006, Matthew Wilcox wrote:
>
> > The original code and your rewrite both access memory twice in the loop.
> > Why don't we do it with one memory reference per loop instead?
>
> Because the the cmpxchg usually succeeds. It is rare that we have to redo
> the loop.
Even if the loop executes exactly once, we still win by an ld4 vs an
ld4.acq. It can't be less expensive to do an ld4.acq than it is to do
an ld4.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: test_and_set_bit implementation
2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
` (3 preceding siblings ...)
2006-12-12 18:07 ` Matthew Wilcox
@ 2006-12-13 10:02 ` Zoltan Menyhart
2006-12-13 10:20 ` Zoltan Menyhart
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Zoltan Menyhart @ 2006-12-13 10:02 UTC (permalink / raw)
To: linux-ia64
Matthew Wilcox wrote:
> The original code and your rewrite both access memory twice in the loop.
> Why don't we do it with one memory reference per loop instead?
>
> {
> CMPXCHG_BUGCHECK_DECL
>
> u32 *m = (u32 *)addr + (nr >> 5);
> u32 bit = 1 << (nr & 31);
>
> u32 old = *m;
> while (!(old & bit)) {
> u32 new = old | bit;
> u32 prev = cmpxchg_acq(m, old, new);
> CMPXCHG_BUGCHECK(m);
> if (prev = old)
> return 1;
> old = prev;
> }
>
> return 0;
> }
>
> Looking at the disassembly of grab_block() in fs/ext2/balloc.c, I don't
> see much difference. The ld4.acq turns into a regular ld4 (because
> 'm' is no longer tagged as volatile), and is hoisted out of the loop.
> Interestingly, gcc chooses to reorder the tests, and make the loop four
> bundles long instead of three, but will 'goto repeat' in two bundles
> instead of four. Using likely()/unlikely() doesn't persuade gcc to
> change the order of the two branches, so I assume it actually is better
> to do it this way.
I like this code with the following slight modifications:
- let's keep "m" as pointer to volatile
- let's keep on using "__u32" types
- not sure we need "new"
- return the old bit
volatile __u32 *m = (volatile __u32 *)addr + (nr >> 5);
__u32 bit = 1 << (nr & 31);
__u32 old = *m;
while (!(old & bit)) {
__u32 prev = cmpxchg_acq(m, old, old | bit);
CMPXCHG_BUGCHECK(m);
if (prev = old)
return 0;
old = prev;
}
return 1;
It compiles to:
0: and r151,r32
6: mov r14=1
c: extr r32=r32,5,27;;
10: shladd r18=r32,2,r33;;
16: ld4.acq r16=[r18]
1c: shl r17=r14,r15;;
20: and r14=r17,r16;;
26: nop.m 0x0
2c: cmp4.eq p7,p6=0,r14
30: nop.b 0x0
36: nop.b 0x0
3c: (p06) br.cond.dpnt.few a0 <+0xa0>
40: nop.m 0x0
46: zxt4 r14=r16
4c: nop.b 0x0;;
50: mov.m ar.ccv=r14;;
56: nop.m 0x0
5c: or r14=r17,r16
60: nop.m 0x0;;
66: cmpxchg4.acq r14=[r18],r14,ar.ccv
6c: nop.i 0x0;;
70: cmp4.eq p7,p6=r14,r16
76: nop.f 0x0
7c: and r15=r17,r14
80: mov r16=r14
86: nop.f 0x0
8c: (p07) br.cond.dpnt.few b0 <+0xb0>;;
90: nop.m 0x0
96: cmp4.eq p7,p6=0,r15
9c: (p07) br.cond.dptk.few 40 <+0x40>
a0: nop.m 0x0
a6: mov r8=1
ac: br.ret.sptk.many b0;;
b0: nop.m 0x0
b6: mov r8=r0
bc: br.ret.sptk.many b0;;
It seems to be o.k., thanks.
Zoltán Menyhárt
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: test_and_set_bit implementation
2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
` (4 preceding siblings ...)
2006-12-13 10:02 ` Zoltan Menyhart
@ 2006-12-13 10:20 ` Zoltan Menyhart
2006-12-13 12:29 ` Matthew Wilcox
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Zoltan Menyhart @ 2006-12-13 10:20 UTC (permalink / raw)
To: linux-ia64
Christoph Lameter wrote:
> I think this will work. However, usually you execute bit test and test
> because the bit is not set and you want to set it. This change would
> optimize a rarely taken path. Probably not worth it.
How much is the probability that the bit is not set?
Adding a test can cost only a few cycles, say max 4.
For an atomic operation, you need to go out and snoop.
Let's have a look at e.g. the "bit_spin_lock()":
static inline void bit_spin_lock(int bitnum, unsigned long *addr)
{
preempt_disable();
while (test_and_set_bit(bitnum, addr)) {
while (test_bit(bitnum, addr)) {
preempt_enable();
cpu_relax();
preempt_disable();
}
}
__acquire(bitlock);
}
By executing the atomic operation unconditionally, you kill
the cache line all the other waiting processors looping at.
Thanks,
Zoltán Menyhárt
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: test_and_set_bit implementation
2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
` (5 preceding siblings ...)
2006-12-13 10:20 ` Zoltan Menyhart
@ 2006-12-13 12:29 ` Matthew Wilcox
2006-12-13 18:28 ` Christoph Lameter
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2006-12-13 12:29 UTC (permalink / raw)
To: linux-ia64
On Wed, Dec 13, 2006 at 11:02:48AM +0100, Zoltan Menyhart wrote:
> I like this code with the following slight modifications:
> - let's keep "m" as pointer to volatile
Why? What benefit is there to doing an ld4.acq instead of a plain ld4?
> - let's keep on using "__u32" types
Why be so ugly?
> - not sure we need "new"
Good idea.
> - return the old bit
I don't understand what you mean here.
> volatile __u32 *m = (volatile __u32 *)addr + (nr >> 5);
> __u32 bit = 1 << (nr & 31);
>
> __u32 old = *m;
> while (!(old & bit)) {
> __u32 prev = cmpxchg_acq(m, old, old | bit);
> CMPXCHG_BUGCHECK(m);
> if (prev = old)
> return 0;
> old = prev;
> }
> return 1;
>
> It compiles to:
>
> 0: and r151,r32
> 6: mov r14=1
> c: extr r32=r32,5,27;;
> 10: shladd r18=r32,2,r33;;
> 16: ld4.acq r16=[r18]
> 1c: shl r17=r14,r15;;
> 20: and r14=r17,r16;;
> 26: nop.m 0x0
> 2c: cmp4.eq p7,p6=0,r14
> 30: nop.b 0x0
> 36: nop.b 0x0
> 3c: (p06) br.cond.dpnt.few a0 <+0xa0>
> 40: nop.m 0x0
> 46: zxt4 r14=r16
> 4c: nop.b 0x0;;
> 50: mov.m ar.ccv=r14;;
> 56: nop.m 0x0
> 5c: or r14=r17,r16
> 60: nop.m 0x0;;
> 66: cmpxchg4.acq r14=[r18],r14,ar.ccv
> 6c: nop.i 0x0;;
> 70: cmp4.eq p7,p6=r14,r16
> 76: nop.f 0x0
> 7c: and r15=r17,r14
> 80: mov r16=r14
> 86: nop.f 0x0
> 8c: (p07) br.cond.dpnt.few b0 <+0xb0>;;
> 90: nop.m 0x0
> 96: cmp4.eq p7,p6=0,r15
> 9c: (p07) br.cond.dptk.few 40 <+0x40>
> a0: nop.m 0x0
> a6: mov r8=1
> ac: br.ret.sptk.many b0;;
> b0: nop.m 0x0
> b6: mov r8=r0
> bc: br.ret.sptk.many b0;;
>
> It seems to be o.k., thanks.
>
> Zolt?n Menyh?rt
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: test_and_set_bit implementation
2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
` (6 preceding siblings ...)
2006-12-13 12:29 ` Matthew Wilcox
@ 2006-12-13 18:28 ` Christoph Lameter
2006-12-14 9:24 ` Zoltan Menyhart
2006-12-14 9:37 ` Zoltan Menyhart
9 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2006-12-13 18:28 UTC (permalink / raw)
To: linux-ia64
On Wed, 13 Dec 2006, Zoltan Menyhart wrote:
> How much is the probability that the bit is not set?
Depends on the load on the system. Typically very much near 100%
> Adding a test can cost only a few cycles, say max 4.
But it reduces the performance of the commonly taken code path.
> For an atomic operation, you need to go out and snoop.
>
> Let's have a look at e.g. the "bit_spin_lock()":
>
> static inline void bit_spin_lock(int bitnum, unsigned long *addr)
> {
> preempt_disable();
> while (test_and_set_bit(bitnum, addr)) {
> while (test_bit(bitnum, addr)) {
> preempt_enable();
> cpu_relax();
> preempt_disable();
> }
> }
> __acquire(bitlock);
> }
>
> By executing the atomic operation unconditionally, you kill
> the cache line all the other waiting processors looping at.
But note also that we optimize the common case, the case that the test and
set bit are successful. Only if it was not successful will we do non
atomic loads. This is done to avoid cachelines bouncing while the lock is
contended.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: test_and_set_bit implementation
2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
` (7 preceding siblings ...)
2006-12-13 18:28 ` Christoph Lameter
@ 2006-12-14 9:24 ` Zoltan Menyhart
2006-12-14 9:37 ` Zoltan Menyhart
9 siblings, 0 replies; 11+ messages in thread
From: Zoltan Menyhart @ 2006-12-14 9:24 UTC (permalink / raw)
To: linux-ia64
Matthew Wilcox wrote:
> On Wed, Dec 13, 2006 at 11:02:48AM +0100, Zoltan Menyhart wrote:
>
>>I like this code with the following slight modifications:
>>- let's keep "m" as pointer to volatile
>
> Why? What benefit is there to doing an ld4.acq instead of a plain ld4?
It is not about the ld4.acq, but the volatile storage class (= don't
assume it wont be changed by someone else, do reload it each time).
Not using the volatile key word allows the compiler to optimize
out the actual load in loops like:
while (test_and_set_bit(bitnum, addr))
;
>>- let's keep on using "__u32" types
>
> Why be so ugly?
- It has been like that for a while, I got used to it :-)
- grep-ing for "__u32" gives less false positives that grep-ing for "u32"
>>- return the old bit
>
> I don't understand what you mean here.
In your code:
...
while (!(old & bit)) {
...
return 1;
...
}
return 0;
You return the inverse of the old bit, while the orig. code says:
" * test_and_set_bit - Set a bit and return its old value"
Thanks.
Zoltán Menyhárt
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: test_and_set_bit implementation
2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
` (8 preceding siblings ...)
2006-12-14 9:24 ` Zoltan Menyhart
@ 2006-12-14 9:37 ` Zoltan Menyhart
9 siblings, 0 replies; 11+ messages in thread
From: Zoltan Menyhart @ 2006-12-14 9:37 UTC (permalink / raw)
To: linux-ia64
Christoph Lameter wrote:
>>How much is the probability that the bit is not set?
> Depends on the load on the system. Typically very much near 100%
If you are right, why do we need "bit_spin_lock()" that much complicated?
while (test_and_set_bit(bitnum, addr)) {
while (test_bit(bitnum, addr)) {
preempt_enable();
cpu_relax();
preempt_disable();
}
}
If "test_and_set_bit()" did not modify the lock while it's busy, a
simpler solution would do:
while (test_and_set_bit(bitnum, addr)) {
preempt_enable();
cpu_relax();
preempt_disable();
}
>>Adding a test can cost only a few cycles, say max 4.
>
> But it reduces the performance of the commonly taken code path.
...
>>
>>By executing the atomic operation unconditionally, you kill
>>the cache line all the other waiting processors looping at.
>
> But note also that we optimize the common case, the case that the test and
> set bit are successful. Only if it was not successful will we do non
> atomic loads. This is done to avoid cachelines bouncing while the lock is
> contended.
I can accept these arguments for small, not too much busy systems.
As the snooping and the cache line bouncing costs increases quadratically
with the system size...
Have you got some lock benchmarks to measure it?
Thanks,
Zoltán Menyhárt
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2006-12-14 9:37 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
2006-12-12 15:47 ` Matthew Wilcox
2006-12-12 17:20 ` Christoph Lameter
2006-12-12 17:22 ` Christoph Lameter
2006-12-12 18:07 ` Matthew Wilcox
2006-12-13 10:02 ` Zoltan Menyhart
2006-12-13 10:20 ` Zoltan Menyhart
2006-12-13 12:29 ` Matthew Wilcox
2006-12-13 18:28 ` Christoph Lameter
2006-12-14 9:24 ` Zoltan Menyhart
2006-12-14 9:37 ` Zoltan Menyhart
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox