[IA64] Reduce __clear_bit

All of lore.kernel.org
 help / color / mirror / Atom feed

* [IA64] Reduce __clear_bit_unlock overhead
@ 2007-10-19  3:38 Christoph Lameter
  2007-10-19  4:34 ` Nick Piggin
                   ` (7 more replies)
  0 siblings, 8 replies; 11+ messages in thread
From: Christoph Lameter @ 2007-10-19  3:38 UTC (permalink / raw)
  To: linux-ia64

On Fri, 19 Oct 2007, Nick Piggin wrote:

> I'm not sure, I had an idea it was relatively expensive on ia64,
> but I didn't really test with a good workload (a microbenchmark
> probably isn't that good because it won't generate too much out
> of order memory traffic that needs to be fenced).

Its expensive on IA64. Is it any less expensive on x86?

> > Where can I find your patchset? I looked through lkml but did not see it.
> 
> Infrastructure in -mm, starting at bitops-introduce-lock-ops.patch.
> bit_spin_lock-use-lock-bitops.patch and ia64-lock-bitops.patch are
> ones to look at.

ia64-lock-bitops.patch defines:

static __inline__ void
clear_bit_unlock (int nr, volatile void *addr)
{
       __u32 mask, old, new;
       volatile __u32 *m;
       CMPXCHG_BUGCHECK_DECL

       m = (volatile __u32 *) addr + (nr >> 5);
       mask = ~(1 << (nr & 31));
       do {
               CMPXCHG_BUGCHECK(m);
               old = *m;
               new = old & mask;
       } while (cmpxchg_rel(m, old, new) != old);
}

/**
 * __clear_bit_unlock - Non-atomically clear a bit with release
 *
 * This is like clear_bit_unlock, but the implementation may use a non-atomic
 * store (this one uses an atomic, however).
 */
#define __clear_bit_unlock clear_bit_unlock


A non atomic store is a misaligned store on IA64. That is not 
relevant here. The data is properly aligned. I guess it was intended to
refer to the cmpxchg.

How about this patch? [Works fine on IA64 simulator...]




IA64: Slim down __clear_bit_unlock

__clear_bit_unlock does not need to perform atomic operations on the variable.
Avoid a cmpxchg and simply do a store with release semantics. Add a barrier to
be safe that the compiler does not do funky things.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/asm-ia64/bitops.h |   12 ++++++++++++
 1 file changed, 12 insertions(+)

Index: linux-2.6.23-mm1/include/asm-ia64/bitops.h
=================================--- linux-2.6.23-mm1.orig/include/asm-ia64/bitops.h	2007-10-18 19:37:22.000000000 -0700
+++ linux-2.6.23-mm1/include/asm-ia64/bitops.h	2007-10-18 19:50:22.000000000 -0700
@@ -124,10 +124,21 @@ clear_bit_unlock (int nr, volatile void 
 /**
  * __clear_bit_unlock - Non-atomically clear a bit with release
  *
- * This is like clear_bit_unlock, but the implementation may use a non-atomic
- * store (this one uses an atomic, however).
+ * This is like clear_bit_unlock, but the implementation uses a store
+ * with release semantics. See also __raw_spin_unlock().
  */
-#define __clear_bit_unlock clear_bit_unlock
+static __inline__ void
+__clear_bit_unlock (int nr, volatile void *addr)
+{
+	__u32 mask, new;
+	volatile __u32 *m;
+
+	m = (volatile __u32 *) addr + (nr >> 5);
+	mask = ~(1 << (nr & 31));
+	new = *m & mask;
+	barrier();
+	asm volatile ("st4.rel.nta [%0] = %1\n\t" :: "r"(m), "r"(new));
+}
 
 /**
  * __clear_bit - Clears a bit in memory (non-atomic version)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [IA64] Reduce __clear_bit_unlock overhead
  2007-10-19  3:38 [IA64] Reduce __clear_bit_unlock overhead Christoph Lameter
@ 2007-10-19  4:34 ` Nick Piggin
  2007-10-19  9:14 ` Zoltan Menyhart
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Nick Piggin @ 2007-10-19  4:34 UTC (permalink / raw)
  To: linux-ia64

On Friday 19 October 2007 13:38, Christoph Lameter wrote:
> On Fri, 19 Oct 2007, Nick Piggin wrote:
> > I'm not sure, I had an idea it was relatively expensive on ia64,
> > but I didn't really test with a good workload (a microbenchmark
> > probably isn't that good because it won't generate too much out
> > of order memory traffic that needs to be fenced).
>
> Its expensive on IA64. Is it any less expensive on x86?

Probably relatively less because x86 has only a very small amount of
possible reordering, perhaps. (it can only reorder loads past
earlier stores).

Of course, we can avoid the fence altogether on x86 as well, because
it simply isn't needed for release semantics.


> > > Where can I find your patchset? I looked through lkml but did not see
> > > it.
> >
> > Infrastructure in -mm, starting at bitops-introduce-lock-ops.patch.
> > bit_spin_lock-use-lock-bitops.patch and ia64-lock-bitops.patch are
> > ones to look at.
>
> ia64-lock-bitops.patch defines:
>
> static __inline__ void
> clear_bit_unlock (int nr, volatile void *addr)
> {
>        __u32 mask, old, new;
>        volatile __u32 *m;
>        CMPXCHG_BUGCHECK_DECL
>
>        m = (volatile __u32 *) addr + (nr >> 5);
>        mask = ~(1 << (nr & 31));
>        do {
>                CMPXCHG_BUGCHECK(m);
>                old = *m;
>                new = old & mask;
>        } while (cmpxchg_rel(m, old, new) != old);
> }
>
> /**
>  * __clear_bit_unlock - Non-atomically clear a bit with release
>  *
>  * This is like clear_bit_unlock, but the implementation may use a
> non-atomic * store (this one uses an atomic, however).
>  */
> #define __clear_bit_unlock clear_bit_unlock
>
>
> A non atomic store is a misaligned store on IA64. That is not
> relevant here. The data is properly aligned. I guess it was intended to
> refer to the cmpxchg.

Well it's just saying that it _can_ be non-atomic, but the implementation
here is actually atomic, because....


> How about this patch? [Works fine on IA64 simulator...]
>
>
>
>
> IA64: Slim down __clear_bit_unlock
>
> __clear_bit_unlock does not need to perform atomic operations on the
> variable. Avoid a cmpxchg and simply do a store with release semantics. Add
> a barrier to be safe that the compiler does not do funky things.
>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
>
> ---
>  include/asm-ia64/bitops.h |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> Index: linux-2.6.23-mm1/include/asm-ia64/bitops.h
> =================================> --- linux-2.6.23-mm1.orig/include/asm-ia64/bitops.h	2007-10-18
> 19:37:22.000000000 -0700 +++
> linux-2.6.23-mm1/include/asm-ia64/bitops.h	2007-10-18 19:50:22.000000000
> -0700 @@ -124,10 +124,21 @@ clear_bit_unlock (int nr, volatile void
>  /**
>   * __clear_bit_unlock - Non-atomically clear a bit with release
>   *
> - * This is like clear_bit_unlock, but the implementation may use a
> non-atomic - * store (this one uses an atomic, however).
> + * This is like clear_bit_unlock, but the implementation uses a store
> + * with release semantics. See also __raw_spin_unlock().
>   */
> -#define __clear_bit_unlock clear_bit_unlock
> +static __inline__ void
> +__clear_bit_unlock (int nr, volatile void *addr)
> +{
> +	__u32 mask, new;
> +	volatile __u32 *m;
> +
> +	m = (volatile __u32 *) addr + (nr >> 5);
> +	mask = ~(1 << (nr & 31));
> +	new = *m & mask;
> +	barrier();
> +	asm volatile ("st4.rel.nta [%0] = %1\n\t" :: "r"(m), "r"(new));
> +}

.... I didn't realise ia64 could do plain stores with release
semantics -- I should have looked closer at spinlock.h. Yes,
this is exactly what we want here. This patch will now apply
upstream, so if you want to send it to Linus, that would be
nice. Thanks!

Acked-by: Nick Piggin <npiggin@suse.de>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [IA64] Reduce __clear_bit_unlock overhead
  2007-10-19  3:38 [IA64] Reduce __clear_bit_unlock overhead Christoph Lameter
  2007-10-19  4:34 ` Nick Piggin
@ 2007-10-19  9:14 ` Zoltan Menyhart
  2007-10-19  9:28 ` Nick Piggin
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Zoltan Menyhart @ 2007-10-19  9:14 UTC (permalink / raw)
  To: linux-ia64

You may want to avoid assembly magics:

static __inline__ void
__clear_bit_unlock(int const nr, volatile void * const addr)
{
        volatile __u32 * const m = (volatile __u32 *) addr + (nr >> 5);

        *m &= ~(1 << (nr & 0x1f));
}

GCC compiles volatile loads with ".acq" and stores with ".rel".
E.g. the following program:

int lo = 3;

main()
{
        __clear_bit_unlock(1, &lo);
}

compiles into (NOP-s removed):

4000000000000680 <main>:0b 70 e0 03 00 24       [MMI]       addl r14\x120,r1;;
4000000000000686:       f0 00 38 60 21 00                   ld4.acq r15=[r14]
4000000000000690:       0a 78 f4 1f 2c 22       [MMI]       and r15=-3,r15;;
4000000000000696:       00 78 38 60 23 00                   st4.rel [r14]=r15
40000000000006ac:       08 00 84 00                         br.ret.sptk.many b0;;

Actually, we don't need a load with ".acq". A somewhat less readable code:

static __inline__ void
__clear_bit_unlock(int const nr, void * const addr)
{
        __u32 * const p = (__u32 *) addr + (nr >> 5);

        * (volatile __u32 *) p = *p & ~(1 << (nr & 0x1f));
}

gives you:

4000000000000680 <main>:0b 70 e0 03 00 24       [MMI]       addl r14\x120,r1;;
4000000000000686:       f0 00 38 20 20 00                   ld4 r15=[r14]
4000000000000690:       0a 78 f4 1f 2c 22       [MMI]       and r15=-3,r15;;
4000000000000696:       00 78 38 60 23 00                   st4.rel [r14]=r15
40000000000006ac:       08 00 84 00                         br.ret.sptk.many b0;;

that can be slightly more efficient.

Another remark:
We are adding more variants of existing funtions, e.g.:

clear_bit()
__clear_bit()

I've got problems with hidden semantics.
Just reading the source (where they are used), I simply cannot guess
if a primitive is atomic or not, if it is with some fencing or w/o.

Cannot we have some "speaking names"? E.g.: bit_unlock_Natomic_rel()

Zoltan Menyhart

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [IA64] Reduce __clear_bit_unlock overhead
  2007-10-19  3:38 [IA64] Reduce __clear_bit_unlock overhead Christoph Lameter
  2007-10-19  4:34 ` Nick Piggin
  2007-10-19  9:14 ` Zoltan Menyhart
@ 2007-10-19  9:28 ` Nick Piggin
  2007-10-19 10:58 ` Christoph Lameter
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Nick Piggin @ 2007-10-19  9:28 UTC (permalink / raw)
  To: linux-ia64

On Friday 19 October 2007 19:14, Zoltan Menyhart wrote:
> You may want to avoid assembly magics:
>
> static __inline__ void
> __clear_bit_unlock(int const nr, volatile void * const addr)
> {
>         volatile __u32 * const m = (volatile __u32 *) addr + (nr >> 5);
>
>         *m &= ~(1 << (nr & 0x1f));
> }
>
> GCC compiles volatile loads with ".acq" and stores with ".rel".

Hmm, more arbitrary volatile semantics :(

Actually I personally would prefer to use a non-volatile pointer,
and do the assembly explicitly. However, that's not for me to
decide. Importantly, the load with acquire is not required and I
agree it should go. Thanks for noticing that.

> Another remark:
> We are adding more variants of existing funtions, e.g.:
>
> clear_bit()
> __clear_bit()
>
> I've got problems with hidden semantics.

Double underscore prepended versions of the bitops are always
non-atomic. This is nothing new.

Functions with "lock" or "unlock" in their names are relatively
clear to be acquire and release. I preferred to stay with the
lock / unlock theme rather than introduce the new acquire and
release barriers into the kernel. Of course they are the same
thing really, but lock/unlock is always associated with locking
and unlocking something.

I don't want people to use the acquire/release semantics of these
things for anything except taking and releasing a lock (in which
case you really don't have to worry much about memory ordering
issues).

> Just reading the source (where they are used), I simply cannot guess
> if a primitive is atomic or not, if it is with some fencing or w/o.
>
> Cannot we have some "speaking names"? E.g.: bit_unlock_Natomic_rel()

I used the double underscore prefix for bit_unlock simply because
it is pretty well associated with the bitops code. I personally do
not like _Natomic postfix though. Also _rel definitely is redundant
because an unlock operation is not exactly an unlock operation if
it doesn't provide the right memory ordering.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [IA64] Reduce __clear_bit_unlock overhead
  2007-10-19  3:38 [IA64] Reduce __clear_bit_unlock overhead Christoph Lameter
                   ` (2 preceding siblings ...)
  2007-10-19  9:28 ` Nick Piggin
@ 2007-10-19 10:58 ` Christoph Lameter
  2007-10-19 11:12 ` Christoph Lameter
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2007-10-19 10:58 UTC (permalink / raw)
  To: linux-ia64

On Fri, 19 Oct 2007, Zoltan Menyhart wrote:

> You may want to avoid assembly magics:
> 
> static __inline__ void
> __clear_bit_unlock(int const nr, volatile void * const addr)
> {
>        volatile __u32 * const m = (volatile __u32 *) addr + (nr >> 5);
> 
>        *m &= ~(1 << (nr & 0x1f));
> }
> 
> GCC compiles volatile loads with ".acq" and stores with ".rel".

But gcc does not generate the .nta type of store.

> Cannot we have some "speaking names"? E.g.: bit_unlock_Natomic_rel()

The convention is that __xxx is the non atomic variant.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [IA64] Reduce __clear_bit_unlock overhead
  2007-10-19  3:38 [IA64] Reduce __clear_bit_unlock overhead Christoph Lameter
                   ` (3 preceding siblings ...)
  2007-10-19 10:58 ` Christoph Lameter
@ 2007-10-19 11:12 ` Christoph Lameter
  2007-10-19 14:15 ` Zoltan Menyhart
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2007-10-19 11:12 UTC (permalink / raw)
  To: linux-ia64

On Fri, 19 Oct 2007, Nick Piggin wrote:

> Of course, we can avoid the fence altogether on x86 as well, because
> it simply isn't needed for release semantics.

So we can generally switch to using wmb there?



__clear_bit_unlock: A write barrier is sufficient for release semantics

Switch the smb_mb that causes a fence instruction to be issued on x86
to a write barrier which issues no instruction on x86.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6/include/asm-generic/bitops/lock.h
=================================--- linux-2.6.orig/include/asm-generic/bitops/lock.h	2007-10-19 04:08:58.000000000 -0700
+++ linux-2.6/include/asm-generic/bitops/lock.h	2007-10-19 04:09:17.000000000 -0700
@@ -37,7 +37,7 @@ do {					\
  */
 #define __clear_bit_unlock(nr, addr)	\
 do {					\
-	smp_mb();			\
+	smp_wmb();			\
 	__clear_bit(nr, addr);		\
 } while (0)
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [IA64] Reduce __clear_bit_unlock overhead
  2007-10-19  3:38 [IA64] Reduce __clear_bit_unlock overhead Christoph Lameter
                   ` (4 preceding siblings ...)
  2007-10-19 11:12 ` Christoph Lameter
@ 2007-10-19 14:15 ` Zoltan Menyhart
  2007-10-19 17:44 ` Christoph Lameter
  2007-10-21  4:43 ` Nick Piggin
  7 siblings, 0 replies; 11+ messages in thread
From: Zoltan Menyhart @ 2007-10-19 14:15 UTC (permalink / raw)
  To: linux-ia64

Christoph Lameter wrote:
> On Fri, 19 Oct 2007, Zoltan Menyhart wrote:
> 
> 
>>You may want to avoid assembly magics:
>>
>>static __inline__ void
>>__clear_bit_unlock(int const nr, volatile void * const addr)
>>{
>>       volatile __u32 * const m = (volatile __u32 *) addr + (nr >> 5);
>>
>>       *m &= ~(1 << (nr & 0x1f));
>>}
>>
>>GCC compiles volatile loads with ".acq" and stores with ".rel".
> 
> 
> But gcc does not generate the .nta type of store.

Can you please tell me what is the advantage of ".nta " on the store?
I far as I can see in the I2 Microarch. Guide, Table 3-2 Processor
Cache Hints, ".nta " on stores means:
- L2 NRU bit is not updated
- No slot is allocated in L3

In order to be able to take advantage of an "st.nta", you have to
use "ld.nta" in __clear_bit_unlock(), and at the bit lock
acquisition, too.

I assume the critical region data protected by the lock is in the
same cache line as the bit lock itself, therefore all loads /
stores have to use ".nta " - that the GCC wont generate.
Should you do it by hand, you would not use the cache as it is
assumed to be used, therefore the cache itself becomes less
efficient.

Nick Piggin wrote:
> Actually I personally would prefer to use a non-volatile pointer,
> and do the assembly explicitly. However, that's not for me to
> decide. Importantly, the load with acquire is not required and I
> agree it should go. Thanks for noticing that.

Well, one of the primary requirements is to avoid people
misunderstanding the code. I can accept that using explicit,
special assembly instructions can help people to understand
the code.

Zoltan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [IA64] Reduce __clear_bit_unlock overhead
  2007-10-19  3:38 [IA64] Reduce __clear_bit_unlock overhead Christoph Lameter
                   ` (5 preceding siblings ...)
  2007-10-19 14:15 ` Zoltan Menyhart
@ 2007-10-19 17:44 ` Christoph Lameter
  2007-10-21  4:43 ` Nick Piggin
  7 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2007-10-19 17:44 UTC (permalink / raw)
  To: linux-ia64

On Fri, 19 Oct 2007, Zoltan Menyhart wrote:

> Can you please tell me what is the advantage of ".nta " on the store?
> I far as I can see in the I2 Microarch. Guide, Table 3-2 Processor
> Cache Hints, ".nta " on stores means:
> - L2 NRU bit is not updated
> - No slot is allocated in L3

The advantage is that the cacheline is pushed out faster to real memory 
which may save cacheline eviction if the lock has to be acquired by 
another processor.

> In order to be able to take advantage of an "st.nta", you have to
> use "ld.nta" in __clear_bit_unlock(), and at the bit lock
> acquisition, too.

Hmmm.. Yes this would make the function more complex to read. Maybe its 
not worth having the nta there.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [IA64] Reduce __clear_bit_unlock overhead
  2007-10-19  3:38 [IA64] Reduce __clear_bit_unlock overhead Christoph Lameter
                   ` (6 preceding siblings ...)
  2007-10-19 17:44 ` Christoph Lameter
@ 2007-10-21  4:43 ` Nick Piggin
  7 siblings, 0 replies; 11+ messages in thread
From: Nick Piggin @ 2007-10-21  4:43 UTC (permalink / raw)
  To: linux-ia64

On Friday 19 October 2007 21:12, Christoph Lameter wrote:
> On Fri, 19 Oct 2007, Nick Piggin wrote:
> > Of course, we can avoid the fence altogether on x86 as well, because
> > it simply isn't needed for release semantics.
>
> So we can generally switch to using wmb there?

Not for asm-generic/. But for x86, we don't anything actually
(because writes are ordered). I sent a patch to Linus for this
earlier -- he hasn't taken it yet, but I will ensure it gets
merged before 2.6.24.

Also I didn't get a chance to send out the other patches before,
because of issues with my net connection. Will do so now.

>
>
>
> __clear_bit_unlock: A write barrier is sufficient for release semantics
>
> Switch the smb_mb that causes a fence instruction to be issued on x86
> to a write barrier which issues no instruction on x86.
>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
>
> Index: linux-2.6/include/asm-generic/bitops/lock.h
> =================================> --- linux-2.6.orig/include/asm-generic/bitops/lock.h	2007-10-19
> 04:08:58.000000000 -0700 +++
> linux-2.6/include/asm-generic/bitops/lock.h	2007-10-19 04:09:17.000000000
> -0700 @@ -37,7 +37,7 @@ do {					\
>   */
>  #define __clear_bit_unlock(nr, addr)	\
>  do {					\
> -	smp_mb();			\
> +	smp_wmb();			\
>  	__clear_bit(nr, addr);		\
>  } while (0)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* SLUB: Avoid atomic operation for slab_unlock
@ 2007-10-18 22:15 Christoph Lameter
  2007-10-19  1:56 ` Nick Piggin
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2007-10-18 22:15 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm

Currently page flags are only modified in SLUB under page lock. This means
that we do not need an atomic operation to release the lock since there
is nothing we can race against that is modifying page flags. We can simply
clear the bit without the use of an atomic operation and make sure that this
change becomes visible after the other changes to slab metadata through
a memory barrier.

The performance of slab_free() increases 10-15% (SMP configuration doing
a long series of remote frees).

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-10-18 14:12:59.000000000 -0700
+++ linux-2.6/mm/slub.c	2007-10-18 14:24:43.000000000 -0700
@@ -1180,9 +1180,22 @@ static __always_inline void slab_lock(st
 	bit_spin_lock(PG_locked, &page->flags);
 }
 
+/*
+ * Slab unlock version that avoids having to use atomic operations
+ * (echos some of the code of bit_spin_unlock!)
+ */
 static __always_inline void slab_unlock(struct page *page)
 {
-	bit_spin_unlock(PG_locked, &page->flags);
+#ifdef CONFIG_SMP
+	unsigned long flags;
+
+	flags = page->flags & ~(1 << PG_locked);
+
+	smp_wmb();
+	page->flags = flags;
+#endif
+	preempt_enable();
+	__release(bitlock);
 }
 
 static __always_inline int slab_trylock(struct page *page)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SLUB: Avoid atomic operation for slab_unlock
@ 2007-10-19  1:56 ` Nick Piggin
  2007-10-19  2:01   ` Christoph Lameter
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2007-10-19  1:56 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Linux Kernel Mailing List, akpm, linux-mm

On Friday 19 October 2007 11:21, Christoph Lameter wrote:
> On Fri, 19 Oct 2007, Nick Piggin wrote:
> > Ah, thanks, but can we just use my earlier patch that does the
> > proper __bit_spin_unlock which is provided by
> > bit_spin_lock-use-lock-bitops.patch
>
> Ok.
>
> > This primitive should have a better chance at being correct, and
> > also potentially be more optimised for each architecture (it
> > only has to provide release consistency).
>
> Yes that is what I attempted to do with the write barrier. To my knowledge
> there are no reads that could bleed out and I wanted to avoid a full fence
> instruction there.

Oh, OK. Bit risky ;) You might be right, but anyway I think it
should be just as fast with the optimised bit_unlock on most
architectures.

Which reminds me, it would be interesting to test the ia64
implementation I did. For the non-atomic unlock, I'm actually
doing an atomic operation there so that it can use the release
barrier rather than the mf. Maybe it's faster the other way around
though? Will be useful to test with something that isn't a trivial
loop, so the slub case would be a good benchmark.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SLUB: Avoid atomic operation for slab_unlock
  2007-10-19  1:56 ` Nick Piggin
@ 2007-10-19  2:01   ` Christoph Lameter
  2007-10-19  2:12     ` Nick Piggin
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2007-10-19  2:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel Mailing List, akpm, linux-mm

On Fri, 19 Oct 2007, Nick Piggin wrote:

> > Yes that is what I attempted to do with the write barrier. To my knowledge
> > there are no reads that could bleed out and I wanted to avoid a full fence
> > instruction there.
> 
> Oh, OK. Bit risky ;) You might be right, but anyway I think it
> should be just as fast with the optimised bit_unlock on most
> architectures.

How expensive is the fence? An store with release semantics would be safer 
and okay for IA64.
 
> Which reminds me, it would be interesting to test the ia64
> implementation I did. For the non-atomic unlock, I'm actually
> doing an atomic operation there so that it can use the release
> barrier rather than the mf. Maybe it's faster the other way around
> though? Will be useful to test with something that isn't a trivial
> loop, so the slub case would be a good benchmark.

Lets avoid mf (too expensive) and just use a store with release semantics.

Where can I find your patchset? I looked through lkml but did not see it.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SLUB: Avoid atomic operation for slab_unlock
  2007-10-19  2:01   ` Christoph Lameter
@ 2007-10-19  2:12     ` Nick Piggin
  2007-10-19  3:26         ` Christoph Lameter
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2007-10-19  2:12 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Linux Kernel Mailing List, akpm, linux-mm

On Friday 19 October 2007 12:01, Christoph Lameter wrote:
> On Fri, 19 Oct 2007, Nick Piggin wrote:
> > > Yes that is what I attempted to do with the write barrier. To my
> > > knowledge there are no reads that could bleed out and I wanted to avoid
> > > a full fence instruction there.
> >
> > Oh, OK. Bit risky ;) You might be right, but anyway I think it
> > should be just as fast with the optimised bit_unlock on most
> > architectures.
>
> How expensive is the fence? An store with release semantics would be safer
> and okay for IA64.

I'm not sure, I had an idea it was relatively expensive on ia64,
but I didn't really test with a good workload (a microbenchmark
probably isn't that good because it won't generate too much out
of order memory traffic that needs to be fenced).

> > Which reminds me, it would be interesting to test the ia64
> > implementation I did. For the non-atomic unlock, I'm actually
> > doing an atomic operation there so that it can use the release
> > barrier rather than the mf. Maybe it's faster the other way around
> > though? Will be useful to test with something that isn't a trivial
> > loop, so the slub case would be a good benchmark.
>
> Lets avoid mf (too expensive) and just use a store with release semantics.

OK, that's what I've done at the moment.

> Where can I find your patchset? I looked through lkml but did not see it.

Infrastructure in -mm, starting at bitops-introduce-lock-ops.patch.
bit_spin_lock-use-lock-bitops.patch and ia64-lock-bitops.patch are
ones to look at.

The rest of the patches I have queued here, apart from the SLUB patch,
I guess aren't so interesting to you (they don't do anything fancy
like convert to non-atomic unlocks, just switch things like page and
buffer locks to use new bitops).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [IA64] Reduce __clear_bit_unlock overhead
  2007-10-19  2:12     ` Nick Piggin
@ 2007-10-19  3:26         ` Christoph Lameter
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2007-10-19  3:26 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel Mailing List, akpm, linux-mm, linux-ia64

On Fri, 19 Oct 2007, Nick Piggin wrote:

> I'm not sure, I had an idea it was relatively expensive on ia64,
> but I didn't really test with a good workload (a microbenchmark
> probably isn't that good because it won't generate too much out
> of order memory traffic that needs to be fenced).

Its expensive on IA64. Is it any less expensive on x86?

> > Where can I find your patchset? I looked through lkml but did not see it.
> 
> Infrastructure in -mm, starting at bitops-introduce-lock-ops.patch.
> bit_spin_lock-use-lock-bitops.patch and ia64-lock-bitops.patch are
> ones to look at.

ia64-lock-bitops.patch defines:

static __inline__ void
clear_bit_unlock (int nr, volatile void *addr)
{
       __u32 mask, old, new;
       volatile __u32 *m;
       CMPXCHG_BUGCHECK_DECL

       m = (volatile __u32 *) addr + (nr >> 5);
       mask = ~(1 << (nr & 31));
       do {
               CMPXCHG_BUGCHECK(m);
               old = *m;
               new = old & mask;
       } while (cmpxchg_rel(m, old, new) != old);
}

/**
 * __clear_bit_unlock - Non-atomically clear a bit with release
 *
 * This is like clear_bit_unlock, but the implementation may use a non-atomic
 * store (this one uses an atomic, however).
 */
#define __clear_bit_unlock clear_bit_unlock


A non atomic store is a misaligned store on IA64. That is not 
relevant here. The data is properly aligned. I guess it was intended to
refer to the cmpxchg.

How about this patch? [Works fine on IA64 simulator...]




IA64: Slim down __clear_bit_unlock

__clear_bit_unlock does not need to perform atomic operations on the variable.
Avoid a cmpxchg and simply do a store with release semantics. Add a barrier to
be safe that the compiler does not do funky things.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/asm-ia64/bitops.h |   12 ++++++++++++
 1 file changed, 12 insertions(+)

Index: linux-2.6.23-mm1/include/asm-ia64/bitops.h
===================================================================
--- linux-2.6.23-mm1.orig/include/asm-ia64/bitops.h	2007-10-18 19:37:22.000000000 -0700
+++ linux-2.6.23-mm1/include/asm-ia64/bitops.h	2007-10-18 19:50:22.000000000 -0700
@@ -124,10 +124,21 @@ clear_bit_unlock (int nr, volatile void 
 /**
  * __clear_bit_unlock - Non-atomically clear a bit with release
  *
- * This is like clear_bit_unlock, but the implementation may use a non-atomic
- * store (this one uses an atomic, however).
+ * This is like clear_bit_unlock, but the implementation uses a store
+ * with release semantics. See also __raw_spin_unlock().
  */
-#define __clear_bit_unlock clear_bit_unlock
+static __inline__ void
+__clear_bit_unlock (int nr, volatile void *addr)
+{
+	__u32 mask, new;
+	volatile __u32 *m;
+
+	m = (volatile __u32 *) addr + (nr >> 5);
+	mask = ~(1 << (nr & 31));
+	new = *m & mask;
+	barrier();
+	asm volatile ("st4.rel.nta [%0] = %1\n\t" :: "r"(m), "r"(new));
+}
 
 /**
  * __clear_bit - Clears a bit in memory (non-atomic version)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [IA64] Reduce __clear_bit_unlock overhead
@ 2007-10-19  3:26         ` Christoph Lameter
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2007-10-19  3:26 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel Mailing List, akpm, linux-mm, linux-ia64

On Fri, 19 Oct 2007, Nick Piggin wrote:

> I'm not sure, I had an idea it was relatively expensive on ia64,
> but I didn't really test with a good workload (a microbenchmark
> probably isn't that good because it won't generate too much out
> of order memory traffic that needs to be fenced).

Its expensive on IA64. Is it any less expensive on x86?

> > Where can I find your patchset? I looked through lkml but did not see it.
> 
> Infrastructure in -mm, starting at bitops-introduce-lock-ops.patch.
> bit_spin_lock-use-lock-bitops.patch and ia64-lock-bitops.patch are
> ones to look at.

ia64-lock-bitops.patch defines:

static __inline__ void
clear_bit_unlock (int nr, volatile void *addr)
{
       __u32 mask, old, new;
       volatile __u32 *m;
       CMPXCHG_BUGCHECK_DECL

       m = (volatile __u32 *) addr + (nr >> 5);
       mask = ~(1 << (nr & 31));
       do {
               CMPXCHG_BUGCHECK(m);
               old = *m;
               new = old & mask;
       } while (cmpxchg_rel(m, old, new) != old);
}

/**
 * __clear_bit_unlock - Non-atomically clear a bit with release
 *
 * This is like clear_bit_unlock, but the implementation may use a non-atomic
 * store (this one uses an atomic, however).
 */
#define __clear_bit_unlock clear_bit_unlock


A non atomic store is a misaligned store on IA64. That is not 
relevant here. The data is properly aligned. I guess it was intended to
refer to the cmpxchg.

How about this patch? [Works fine on IA64 simulator...]




IA64: Slim down __clear_bit_unlock

__clear_bit_unlock does not need to perform atomic operations on the variable.
Avoid a cmpxchg and simply do a store with release semantics. Add a barrier to
be safe that the compiler does not do funky things.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/asm-ia64/bitops.h |   12 ++++++++++++
 1 file changed, 12 insertions(+)

Index: linux-2.6.23-mm1/include/asm-ia64/bitops.h
===================================================================
--- linux-2.6.23-mm1.orig/include/asm-ia64/bitops.h	2007-10-18 19:37:22.000000000 -0700
+++ linux-2.6.23-mm1/include/asm-ia64/bitops.h	2007-10-18 19:50:22.000000000 -0700
@@ -124,10 +124,21 @@ clear_bit_unlock (int nr, volatile void 
 /**
  * __clear_bit_unlock - Non-atomically clear a bit with release
  *
- * This is like clear_bit_unlock, but the implementation may use a non-atomic
- * store (this one uses an atomic, however).
+ * This is like clear_bit_unlock, but the implementation uses a store
+ * with release semantics. See also __raw_spin_unlock().
  */
-#define __clear_bit_unlock clear_bit_unlock
+static __inline__ void
+__clear_bit_unlock (int nr, volatile void *addr)
+{
+	__u32 mask, new;
+	volatile __u32 *m;
+
+	m = (volatile __u32 *) addr + (nr >> 5);
+	mask = ~(1 << (nr & 31));
+	new = *m & mask;
+	barrier();
+	asm volatile ("st4.rel.nta [%0] = %1\n\t" :: "r"(m), "r"(new));
+}
 
 /**
  * __clear_bit - Clears a bit in memory (non-atomic version)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-10-21  4:43 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-19  3:38 [IA64] Reduce __clear_bit_unlock overhead Christoph Lameter
2007-10-19  4:34 ` Nick Piggin
2007-10-19  9:14 ` Zoltan Menyhart
2007-10-19  9:28 ` Nick Piggin
2007-10-19 10:58 ` Christoph Lameter
2007-10-19 11:12 ` Christoph Lameter
2007-10-19 14:15 ` Zoltan Menyhart
2007-10-19 17:44 ` Christoph Lameter
2007-10-21  4:43 ` Nick Piggin
  -- strict thread matches above, loose matches on Subject: below --
2007-10-18 22:15 SLUB: Avoid atomic operation for slab_unlock Christoph Lameter
2007-10-19  1:56 ` Nick Piggin
2007-10-19  2:01   ` Christoph Lameter
2007-10-19  2:12     ` Nick Piggin
2007-10-19  3:26       ` [IA64] Reduce __clear_bit_unlock overhead Christoph Lameter
2007-10-19  3:26         ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.