public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* test_and_set_bit implementation
@ 2006-12-12 15:01 Zoltan Menyhart
  2006-12-12 15:47 ` Matthew Wilcox
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: Zoltan Menyhart @ 2006-12-12 15:01 UTC (permalink / raw)
  To: linux-ia64

We have got a test_and_set_bit implementation as follows:

static __inline__ int
test_and_set_bit (int nr, volatile void *addr)
{
        __u32 bit, old, new;
        volatile __u32 *m;

        m = (volatile __u32 *) addr + (nr >> 5);
        bit = 1 << (nr & 31);
        do {
                old = *m;
                new = old | bit;
        } while (cmpxchg_acq(m, old, new) != old);
        return (old & bit) != 0;
}

Let's assume the bit test & set is already set, why is then the
cmpxchg_acq() executed? Cannot we just return, e.g. like this?

        do {
                old = *m;
		if (old & bit)
			return 1;
                new = old | bit;
        } while (cmpxchg_acq(m, old, new) != old);

Thanks,

Zoltán Menyhárt

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: test_and_set_bit implementation
  2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
@ 2006-12-12 15:47 ` Matthew Wilcox
  2006-12-12 17:20 ` Christoph Lameter
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2006-12-12 15:47 UTC (permalink / raw)
  To: linux-ia64

On Tue, Dec 12, 2006 at 04:01:00PM +0100, Zoltan Menyhart wrote:
> Let's assume the bit test & set is already set, why is then the
> cmpxchg_acq() executed? Cannot we just return, e.g. like this?
> 
>        do {
>                old = *m;
> 		if (old & bit)
> 			return 1;
>                new = old | bit;
>        } while (cmpxchg_acq(m, old, new) != old);

The original code and your rewrite both access memory twice in the loop.
Why don't we do it with one memory reference per loop instead?

{
	CMPXCHG_BUGCHECK_DECL

	u32 *m = (u32 *)addr + (nr >> 5);
	u32 bit = 1 << (nr & 31);

	u32 old = *m;
	while (!(old & bit)) {
		u32 new = old | bit;
		u32 prev = cmpxchg_acq(m, old, new);
		CMPXCHG_BUGCHECK(m);
		if (prev = old)
			return 1;
		old = prev;
	}

	return 0;
}

Looking at the disassembly of grab_block() in fs/ext2/balloc.c, I don't
see much difference.  The ld4.acq turns into a regular ld4 (because
'm' is no longer tagged as volatile), and is hoisted out of the loop.
Interestingly, gcc chooses to reorder the tests, and make the loop four
bundles long instead of three, but will 'goto repeat' in two bundles
instead of four.  Using likely()/unlikely() doesn't persuade gcc to
change the order of the two branches, so I assume it actually is better
to do it this way.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: test_and_set_bit implementation
  2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
  2006-12-12 15:47 ` Matthew Wilcox
@ 2006-12-12 17:20 ` Christoph Lameter
  2006-12-12 17:22 ` Christoph Lameter
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2006-12-12 17:20 UTC (permalink / raw)
  To: linux-ia64

On Tue, 12 Dec 2006, Zoltan Menyhart wrote:

> Let's assume the bit test & set is already set, why is then the
> cmpxchg_acq() executed? Cannot we just return, e.g. like this?
> 
>        do {
>                old = *m;
> 		if (old & bit)
> 			return 1;
>                new = old | bit;
>        } while (cmpxchg_acq(m, old, new) != old);


I think this will work. However, usually you execute bit test and test 
because the bit is not set and you want to set it. This change would 
optimize a rarely taken path. Probably not worth it.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: test_and_set_bit implementation
  2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
  2006-12-12 15:47 ` Matthew Wilcox
  2006-12-12 17:20 ` Christoph Lameter
@ 2006-12-12 17:22 ` Christoph Lameter
  2006-12-12 18:07 ` Matthew Wilcox
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2006-12-12 17:22 UTC (permalink / raw)
  To: linux-ia64

On Tue, 12 Dec 2006, Matthew Wilcox wrote:

> The original code and your rewrite both access memory twice in the loop.
> Why don't we do it with one memory reference per loop instead?

Because the the cmpxchg usually succeeds. It is rare that we have to redo 
the loop.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: test_and_set_bit implementation
  2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
                   ` (2 preceding siblings ...)
  2006-12-12 17:22 ` Christoph Lameter
@ 2006-12-12 18:07 ` Matthew Wilcox
  2006-12-13 10:02 ` Zoltan Menyhart
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2006-12-12 18:07 UTC (permalink / raw)
  To: linux-ia64

On Tue, Dec 12, 2006 at 09:22:44AM -0800, Christoph Lameter wrote:
> On Tue, 12 Dec 2006, Matthew Wilcox wrote:
> 
> > The original code and your rewrite both access memory twice in the loop.
> > Why don't we do it with one memory reference per loop instead?
> 
> Because the the cmpxchg usually succeeds. It is rare that we have to redo 
> the loop.

Even if the loop executes exactly once, we still win by an ld4 vs an
ld4.acq.  It can't be less expensive to do an ld4.acq than it is to do
an ld4.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: test_and_set_bit implementation
  2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
                   ` (3 preceding siblings ...)
  2006-12-12 18:07 ` Matthew Wilcox
@ 2006-12-13 10:02 ` Zoltan Menyhart
  2006-12-13 10:20 ` Zoltan Menyhart
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Zoltan Menyhart @ 2006-12-13 10:02 UTC (permalink / raw)
  To: linux-ia64

Matthew Wilcox wrote:

> The original code and your rewrite both access memory twice in the loop.
> Why don't we do it with one memory reference per loop instead?
> 
> {
> 	CMPXCHG_BUGCHECK_DECL
> 
> 	u32 *m = (u32 *)addr + (nr >> 5);
> 	u32 bit = 1 << (nr & 31);
> 
> 	u32 old = *m;
> 	while (!(old & bit)) {
> 		u32 new = old | bit;
> 		u32 prev = cmpxchg_acq(m, old, new);
> 		CMPXCHG_BUGCHECK(m);
> 		if (prev = old)
> 			return 1;
> 		old = prev;
> 	}
> 
> 	return 0;
> }
> 
> Looking at the disassembly of grab_block() in fs/ext2/balloc.c, I don't
> see much difference.  The ld4.acq turns into a regular ld4 (because
> 'm' is no longer tagged as volatile), and is hoisted out of the loop.
> Interestingly, gcc chooses to reorder the tests, and make the loop four
> bundles long instead of three, but will 'goto repeat' in two bundles
> instead of four.  Using likely()/unlikely() doesn't persuade gcc to
> change the order of the two branches, so I assume it actually is better
> to do it this way.

I like this code with the following slight modifications:
- let's keep "m" as pointer to volatile
- let's keep on using "__u32" types
- not sure we need "new"
- return the old bit

	volatile __u32 *m = (volatile __u32 *)addr + (nr >> 5);
	__u32 bit = 1 << (nr & 31);

	__u32 old = *m;
	while (!(old & bit)) {
		__u32 prev = cmpxchg_acq(m, old, old | bit);
		CMPXCHG_BUGCHECK(m);
		if (prev = old)
			return 0;
		old = prev;
	}
	return 1;

It compiles to:

   0:         and r151,r32
   6:         mov r14=1
   c:         extr r32=r32,5,27;;
  10:         shladd r18=r32,2,r33;;
  16:         ld4.acq r16=[r18]
  1c:         shl r17=r14,r15;;
  20:         and r14=r17,r16;;
  26:         nop.m 0x0
  2c:         cmp4.eq p7,p6=0,r14
  30:         nop.b 0x0
  36:         nop.b 0x0
  3c:   (p06) br.cond.dpnt.few a0 <+0xa0>
  40:         nop.m 0x0
  46:         zxt4 r14=r16
  4c:         nop.b 0x0;;
  50:         mov.m ar.ccv=r14;;
  56:         nop.m 0x0
  5c:         or r14=r17,r16
  60:         nop.m 0x0;;
  66:         cmpxchg4.acq r14=[r18],r14,ar.ccv
  6c:         nop.i 0x0;;
  70:         cmp4.eq p7,p6=r14,r16
  76:         nop.f 0x0
  7c:         and r15=r17,r14
  80:         mov r16=r14
  86:         nop.f 0x0
  8c:   (p07) br.cond.dpnt.few b0 <+0xb0>;;
  90:         nop.m 0x0
  96:         cmp4.eq p7,p6=0,r15
  9c:   (p07) br.cond.dptk.few 40 <+0x40>
  a0:         nop.m 0x0
  a6:         mov r8=1
  ac:         br.ret.sptk.many b0;;
  b0:         nop.m 0x0
  b6:         mov r8=r0
  bc:         br.ret.sptk.many b0;;

It seems to be o.k., thanks.

Zoltán Menyhárt



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: test_and_set_bit implementation
  2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
                   ` (4 preceding siblings ...)
  2006-12-13 10:02 ` Zoltan Menyhart
@ 2006-12-13 10:20 ` Zoltan Menyhart
  2006-12-13 12:29 ` Matthew Wilcox
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Zoltan Menyhart @ 2006-12-13 10:20 UTC (permalink / raw)
  To: linux-ia64

Christoph Lameter wrote:

> I think this will work. However, usually you execute bit test and test 
> because the bit is not set and you want to set it. This change would 
> optimize a rarely taken path. Probably not worth it.

How much is the probability that the bit is not set?

Adding a test can cost only a few cycles, say max 4.

For an atomic operation, you need to go out and snoop.

Let's have a look at e.g. the "bit_spin_lock()":

static inline void bit_spin_lock(int bitnum, unsigned long *addr)
{
        preempt_disable();
        while (test_and_set_bit(bitnum, addr)) {
                while (test_bit(bitnum, addr)) {
                        preempt_enable();
                        cpu_relax();
                        preempt_disable();
                }
        }
        __acquire(bitlock);
}

By executing the atomic operation unconditionally, you kill
the cache line all the other waiting processors looping at.

Thanks,

Zoltán Menyhárt

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: test_and_set_bit implementation
  2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
                   ` (5 preceding siblings ...)
  2006-12-13 10:20 ` Zoltan Menyhart
@ 2006-12-13 12:29 ` Matthew Wilcox
  2006-12-13 18:28 ` Christoph Lameter
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2006-12-13 12:29 UTC (permalink / raw)
  To: linux-ia64

On Wed, Dec 13, 2006 at 11:02:48AM +0100, Zoltan Menyhart wrote:
> I like this code with the following slight modifications:
> - let's keep "m" as pointer to volatile

Why?  What benefit is there to doing an ld4.acq instead of a plain ld4?

> - let's keep on using "__u32" types

Why be so ugly?

> - not sure we need "new"

Good idea.

> - return the old bit

I don't understand what you mean here.

> 	volatile __u32 *m = (volatile __u32 *)addr + (nr >> 5);
> 	__u32 bit = 1 << (nr & 31);
> 
> 	__u32 old = *m;
> 	while (!(old & bit)) {
> 		__u32 prev = cmpxchg_acq(m, old, old | bit);
> 		CMPXCHG_BUGCHECK(m);
> 		if (prev = old)
> 			return 0;
> 		old = prev;
> 	}
> 	return 1;
> 
> It compiles to:
> 
>   0:         and r151,r32
>   6:         mov r14=1
>   c:         extr r32=r32,5,27;;
>  10:         shladd r18=r32,2,r33;;
>  16:         ld4.acq r16=[r18]
>  1c:         shl r17=r14,r15;;
>  20:         and r14=r17,r16;;
>  26:         nop.m 0x0
>  2c:         cmp4.eq p7,p6=0,r14
>  30:         nop.b 0x0
>  36:         nop.b 0x0
>  3c:   (p06) br.cond.dpnt.few a0 <+0xa0>
>  40:         nop.m 0x0
>  46:         zxt4 r14=r16
>  4c:         nop.b 0x0;;
>  50:         mov.m ar.ccv=r14;;
>  56:         nop.m 0x0
>  5c:         or r14=r17,r16
>  60:         nop.m 0x0;;
>  66:         cmpxchg4.acq r14=[r18],r14,ar.ccv
>  6c:         nop.i 0x0;;
>  70:         cmp4.eq p7,p6=r14,r16
>  76:         nop.f 0x0
>  7c:         and r15=r17,r14
>  80:         mov r16=r14
>  86:         nop.f 0x0
>  8c:   (p07) br.cond.dpnt.few b0 <+0xb0>;;
>  90:         nop.m 0x0
>  96:         cmp4.eq p7,p6=0,r15
>  9c:   (p07) br.cond.dptk.few 40 <+0x40>
>  a0:         nop.m 0x0
>  a6:         mov r8=1
>  ac:         br.ret.sptk.many b0;;
>  b0:         nop.m 0x0
>  b6:         mov r8=r0
>  bc:         br.ret.sptk.many b0;;
> 
> It seems to be o.k., thanks.
> 
> Zolt?n Menyh?rt
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: test_and_set_bit implementation
  2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
                   ` (6 preceding siblings ...)
  2006-12-13 12:29 ` Matthew Wilcox
@ 2006-12-13 18:28 ` Christoph Lameter
  2006-12-14  9:24 ` Zoltan Menyhart
  2006-12-14  9:37 ` Zoltan Menyhart
  9 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2006-12-13 18:28 UTC (permalink / raw)
  To: linux-ia64

On Wed, 13 Dec 2006, Zoltan Menyhart wrote:

> How much is the probability that the bit is not set?

Depends on the load on the system. Typically very much near 100%

> Adding a test can cost only a few cycles, say max 4.

But it reduces the performance of the commonly taken code path.

> For an atomic operation, you need to go out and snoop.
> 
> Let's have a look at e.g. the "bit_spin_lock()":
> 
> static inline void bit_spin_lock(int bitnum, unsigned long *addr)
> {
>        preempt_disable();
>        while (test_and_set_bit(bitnum, addr)) {
>                while (test_bit(bitnum, addr)) {
>                        preempt_enable();
>                        cpu_relax();
>                        preempt_disable();
>                }
>        }
>        __acquire(bitlock);
> }
> 
> By executing the atomic operation unconditionally, you kill
> the cache line all the other waiting processors looping at.

But note also that we optimize the common case, the case that the test and 
set bit are successful. Only if it was not successful will we do non 
atomic loads. This is done to avoid cachelines bouncing while the lock is 
contended.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: test_and_set_bit implementation
  2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
                   ` (7 preceding siblings ...)
  2006-12-13 18:28 ` Christoph Lameter
@ 2006-12-14  9:24 ` Zoltan Menyhart
  2006-12-14  9:37 ` Zoltan Menyhart
  9 siblings, 0 replies; 11+ messages in thread
From: Zoltan Menyhart @ 2006-12-14  9:24 UTC (permalink / raw)
  To: linux-ia64

Matthew Wilcox wrote:
> On Wed, Dec 13, 2006 at 11:02:48AM +0100, Zoltan Menyhart wrote:
> 
>>I like this code with the following slight modifications:
>>- let's keep "m" as pointer to volatile
> 
> Why?  What benefit is there to doing an ld4.acq instead of a plain ld4?

It is not about the ld4.acq, but the volatile storage class (= don't
assume it wont be changed by someone else, do reload it each time).

Not using the volatile key word allows the compiler to optimize
out the actual load in loops like:

	while (test_and_set_bit(bitnum, addr))
		;


>>- let's keep on using "__u32" types
> 
> Why be so ugly?

- It has been like that for a while, I got used to it :-)
- grep-ing for "__u32" gives less false positives that grep-ing for "u32"
 
>>- return the old bit
> 
> I don't understand what you mean here.

In your code:
...
    while (!(old & bit)) {
...
            return 1;
...
    }
    return 0;

You return the inverse of the old bit, while the orig. code says:

" * test_and_set_bit - Set a bit and return its old value"

Thanks.

Zoltán Menyhárt


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: test_and_set_bit implementation
  2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
                   ` (8 preceding siblings ...)
  2006-12-14  9:24 ` Zoltan Menyhart
@ 2006-12-14  9:37 ` Zoltan Menyhart
  9 siblings, 0 replies; 11+ messages in thread
From: Zoltan Menyhart @ 2006-12-14  9:37 UTC (permalink / raw)
  To: linux-ia64

Christoph Lameter wrote:

>>How much is the probability that the bit is not set?

> Depends on the load on the system. Typically very much near 100%

If you are right, why do we need "bit_spin_lock()" that much complicated?

       while (test_and_set_bit(bitnum, addr)) {
               while (test_bit(bitnum, addr)) {
                       preempt_enable();
                       cpu_relax();
                       preempt_disable();
               }
       } 

If "test_and_set_bit()" did not modify the lock while it's busy, a
simpler solution would do:

       while (test_and_set_bit(bitnum, addr)) {
               preempt_enable();
               cpu_relax();
               preempt_disable();
       } 

>>Adding a test can cost only a few cycles, say max 4.
> 
> But it reduces the performance of the commonly taken code path.
...
>>
>>By executing the atomic operation unconditionally, you kill
>>the cache line all the other waiting processors looping at.
> 
> But note also that we optimize the common case, the case that the test and 
> set bit are successful. Only if it was not successful will we do non 
> atomic loads. This is done to avoid cachelines bouncing while the lock is 
> contended.

I can accept these arguments for small, not too much busy systems.

As the snooping and the cache line bouncing costs increases quadratically
with the system size...

Have you got some lock benchmarks to measure it?

Thanks,

Zoltán Menyhárt

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2006-12-14  9:37 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-12 15:01 test_and_set_bit implementation Zoltan Menyhart
2006-12-12 15:47 ` Matthew Wilcox
2006-12-12 17:20 ` Christoph Lameter
2006-12-12 17:22 ` Christoph Lameter
2006-12-12 18:07 ` Matthew Wilcox
2006-12-13 10:02 ` Zoltan Menyhart
2006-12-13 10:20 ` Zoltan Menyhart
2006-12-13 12:29 ` Matthew Wilcox
2006-12-13 18:28 ` Christoph Lameter
2006-12-14  9:24 ` Zoltan Menyhart
2006-12-14  9:37 ` Zoltan Menyhart

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox