i386 spinlock fairness: bizarre test results

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* i386 spinlock fairness: bizarre test results
@ 2005-10-11  4:04 Chuck Ebbert
  2005-10-11  9:42 ` Eric Dumazet
  2005-10-11 13:00 ` Alan Cox
  0 siblings, 2 replies; 7+ messages in thread
From: Chuck Ebbert @ 2005-10-11  4:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux, Linus Torvalds, Kirill Korotaev

  After seeing Kirill's message about spinlocks I decided to do my own
testing with the userspace program below; the results were very strange.

  When using the 'mov' instruction to do the unlock I was able to reproduce
hogging of the spinlock by a single CPU even on Pentium II under some
conditions, while using 'xchg' always allowed the other CPU to get the
lock:


[me@d2 t]$ gcc -O2 -o spintest.o -Wall -DUSE_MOV=1 spintest.c ; ./spintest.o
parent CPU 1, child CPU 0, using mov instruction for unlock
parent did: 34534 of 10000001 iterations
CPU clocks:     2063591864

[me@d2 t]$ gcc -O2 -o spintest.o -Wall -DUSE_MOV=0 spintest.c ; ./spintest.o
parent CPU 1, child CPU 0, using xchg instruction for unlock
parent did: 5169760 of 10000001 iterations
CPU clocks:     2164689784


  The results were dependent on the alignment of the "lock ; decl" at the start
of the spinlock code.  If that 7-byte instruction spanned two cachelines then
the CPU with that code could somehow hog the spinlock.  Optimizing for
Pentium II forced 16-byte alignment and made the spinlock fairer, but still
somewhat biased when using 'mov':


[me@d2 t]$ gcc -O2 -o spintest.o -Wall -DUSE_MOV=1 -mcpu=pentium2 spintest.c ; ./spintest.o
parent CPU 1, child CPU 0, using mov instruction for unlock
parent did: 4181147 of 10000001 iterations
CPU clocks:     2057436825


  That test machine was a dual 350MHz Pentium II Xeon; on a dual 333MHz Pentium II
Overdrive (with very slow Socket 8 bus) I could not reproduce those results.
However, on that machine the 'xchg' instruction made the test run almost 20%
_faster_ than using 'mov'.

  So I think the i386 spinlock code should be changed to always use 'xchg' to do
spin_unlock.

================================================================================
/* spinlock test
 *
 * Tests spinlock fairness on SMP i386 machine.
 */

/* number of tests  */
#ifndef ITERS
#define ITERS 10000000
#endif

/* use "mov" instruction for spin_unlock instead of "xchg"  */
#ifndef USE_MOV
#define USE_MOV 1
#endif

/* cpu to run parent process; child will use !PARENT_CPU  */
#ifndef PARENT_CPU
#define PARENT_CPU 1
#endif

/* change this to match your version of glibc -- "2" means use 2 args */
#ifndef SETAFFINITY
#define SETAFFINITY 2
#endif

#if SETAFFINITY == 2
#define setaffinity(pid, mask) sched_setaffinity((pid), &(mask))
#else /* 3 args  */
#define setaffinity(pid, mask) sched_setaffinity((pid), sizeof(mask), &(mask))
#endif

#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/mman.h>
#include <asm/user.h>

#define likely(x)	__builtin_expect(!!(x), 1)
#define unlikely(x)	__builtin_expect(!!(x), 0)

#define RDTSCLL(r) asm volatile("rdtsc" : "=A" (r))

unsigned long long tsc1, tsc2;
cpu_set_t cpuset0, cpuset1;
int test = 1; /* for testing -- starts unlocked  */
int strt = 0; /* sync startup  -- starts locked  */
int stk[1024];
int iters, p_iters, c_iters;

static inline void __raw_spin_lock(int *l)
{
	asm volatile(
		"\n1:\t"
		"lock ; decl %0\n\t"
		"jns 3f\n"
		"2:\t"
		"rep;nop\n\t"
		"cmpl $0,%0\n\t"
		"jle 2b\n\t"
		"jmp 1b\n"
		"3:\n\t"
		:"=m" (*l) : : "memory");
}

#if USE_MOV

static inline void __raw_spin_unlock(int *l)
{
	asm volatile("movl $1,%0"
		     : "=m" (*l) : : "memory");
}

#else

static inline void __raw_spin_unlock(int *l)
{
	int oldval = 1;

	asm volatile("xchgl %0,%1"
		     : "=q" (oldval), "=m" (*l)
		     : "0" (oldval) : "memory");
}

#endif /* USE_MOV  */

int do_child(void *vp)
{
	CPU_SET(!PARENT_CPU, &cpuset1);
	if (unlikely(setaffinity(0, cpuset1)))
		perror("child setaffinity");

	/* Add "nop" insns as necessary to make 1st
	 * insn of __raw_spin_lock span cachelines.
	 */
	asm("nop ; nop ; nop");

	__raw_spin_unlock(&strt); /* release parent  */
again:
	__raw_spin_lock(&test);
	c_iters++;
	if (likely(++iters < ITERS)) {
		__raw_spin_unlock(&test);
		goto again;
	}
	__raw_spin_unlock(&test);

	return 0;
}

int main(int argc, char * const argv[])
{
	CPU_SET(PARENT_CPU, &cpuset0);
	if (unlikely(setaffinity(0, cpuset0)))
		perror("parent setaffinity");

	clone(do_child, &stk[1023], CLONE_VM, 0);

	__raw_spin_lock(&strt); /* wait for child to init  */
	RDTSCLL(tsc1);
again:
	__raw_spin_lock(&test);
	p_iters++;
	if (likely(++iters < ITERS)) {
		__raw_spin_unlock(&test);
		goto again;
	}
	__raw_spin_unlock(&test);
	RDTSCLL(tsc2);

	printf("parent CPU %d, child CPU %d, ", PARENT_CPU, !PARENT_CPU);
	printf("using %s instruction for unlock\n", USE_MOV ? "mov" : "xchg");
	printf("parent did: %d of %d iterations\n", p_iters, iters);
	printf("CPU clocks: %14llu\n", tsc2 - tsc1);

	return 0;
}
__
Chuck

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: i386 spinlock fairness: bizarre test results
       [not found] <4WjCM-7Aq-7@gated-at.bofh.it>
@ 2005-10-11  4:32 ` Robert Hancock
  2005-10-11 12:31   ` Joe Seigh
  0 siblings, 1 reply; 7+ messages in thread
From: Robert Hancock @ 2005-10-11  4:32 UTC (permalink / raw)
  To: linux-kernel

Chuck Ebbert wrote:
>   After seeing Kirill's message about spinlocks I decided to do my own
> testing with the userspace program below; the results were very strange.
> 
>   When using the 'mov' instruction to do the unlock I was able to reproduce
> hogging of the spinlock by a single CPU even on Pentium II under some
> conditions, while using 'xchg' always allowed the other CPU to get the
> lock:

This might not necessarily be a win in all situations. If two CPUs A and 
  B are trying to get into a spinlock-protected critical section to do 5 
operations, it may well be more efficient for them to do AAAAABBBBB as 
opposed to ABABABABAB, as the second situation may result in cache lines 
bouncing between the two CPUs each time, etc.

I don't know that making spinlocks "fairer" is really very worthwhile. 
If some spinlocks are so heavily contented that fairness becomes an 
issue, it would be better to find a way to reduce that contention.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: i386 spinlock fairness: bizarre test results
  2005-10-11  4:04 Chuck Ebbert
@ 2005-10-11  9:42 ` Eric Dumazet
  2005-10-11 13:00 ` Alan Cox
  1 sibling, 0 replies; 7+ messages in thread
From: Eric Dumazet @ 2005-10-11  9:42 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-kernel, linux, Linus Torvalds, Kirill Korotaev

Chuck Ebbert a écrit :
>   After seeing Kirill's message about spinlocks I decided to do my own
> testing with the userspace program below; the results were very strange.
> 
>   When using the 'mov' instruction to do the unlock I was able to reproduce
> hogging of the spinlock by a single CPU even on Pentium II under some
> conditions, while using 'xchg' always allowed the other CPU to get the
> lock:
> 
> 
> parent CPU 1, child CPU 0, using mov instruction for unlock
> parent did: 34534 of 10000001 iterations
> CPU clocks:     2063591864
> 
> parent CPU 1, child CPU 0, using xchg instruction for unlock
> parent did: 5169760 of 10000001 iterations
> CPU clocks:     2164689784
> 
Hi Chuck

Thats interesting, because on my machines I get opposite results :

On a Xeon 2.0 GHz, Hyper Threading, I get better results (in the sense of 
fairness) with the MOV version

    parent CPU 1, child CPU 0, using mov instruction for unlock
    parent did: 5291346 of 10000001 iterations
    CPU clocks:     3.437.053.244

    parent CPU 1, child CPU 0, using xchg instruction for unlock
    parent did: 7732349 of 10000001 iterations
    CPU clocks:     3.408.285.308


On a dual Opteron 248 (2.2GHz) machine, I get bad fairness results regardless 
of MOV/XCHG (but less cycles per iteration). A lot of variations in the 
results (maybe because of NUMA effects ?), but xchg gives slightly better 
fairness.

    parent CPU 1, child CPU 0, using mov instruction for unlock
    parent did: 256810 of 10000001 iterations
    CPU clocks:      772.838.640

    parent CPU 1, child CPU 0, using xchg instruction for unlock
    parent did: 438280 of 10000001 iterations
    CPU clocks:     1.115.653.346

    parent CPU 1, child CPU 0, using xchg instruction for unlock
    parent did: 574501 of 10000001 iterations
    CPU clocks:     1.200.129.428


On a dual core Opteron 275 (2.2GHz), xchg is faster but unfair.

(threads running on the same physical CPU)
    parent CPU 1, child CPU 0, using mov instruction for unlock
    parent did: 4822270 of 10000001 iterations
    CPU clocks:      738.438.383

    parent CPU 1, child CPU 0, using xchg instruction for unlock
    parent did: 9702075 of 10000001 iterations
    CPU clocks:      561.457.724

Totally different results if affinity changed so that threads run on different 
physical cpus  : (XCHG is slower)

    parent CPU 0, child CPU 2, using mov instruction for unlock
    parent did: 1081522 of 10000001 iterations
    CPU clocks:      508.611.273

    parent CPU 0, child CPU 2, using xchg instruction for unlock
    parent did: 4310427 of 10000000 iterations
    CPU clocks:     1.074.170.246



For reference, if only one thread is running, the MOV version is also faster 
on both platforms :

[Xeon 2GHz]
    one thread, using mov instruction for unlock
    10000000 iterations
    CPU clocks:     1.278.879.528

[Xeon 2GHz]
    one thread, using xchg instruction for unlock
    10000000 iterations
    CPU clocks:     2.486.912.752

Of course Opterons are faster :) (less cycles per iteration)

[Opteron 248]
    one thread, using mov instruction for unlock
    10000000 iterations
    CPU clocks:      212.514.637

[Opteron 248]
    one thread, using xchg instruction for unlock
    10000000 iterations
    CPU clocks:      383.306.420

[Opteron 275]
    one thread, using mov instruction for unlock
    10000000 iterations
    CPU clocks:      208.472.009

[Opteron 275]
    one thread, using xchg instruction for unlock
    10000000 iterations
    CPU clocks:      417.502.675

In conclusion, I would say that for uncontended locks, MOV is faster. For 
contended locks, XCHG might be faster on some platforms (dual core Opterons, 
only if on the same physical CPU)

Eric


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: i386 spinlock fairness: bizarre test results
  2005-10-11  4:32 ` i386 spinlock fairness: bizarre test results Robert Hancock
@ 2005-10-11 12:31   ` Joe Seigh
  2005-10-11 21:01     ` Bill Davidsen
  0 siblings, 1 reply; 7+ messages in thread
From: Joe Seigh @ 2005-10-11 12:31 UTC (permalink / raw)
  To: linux-kernel

Robert Hancock wrote:
> Chuck Ebbert wrote:
> 
>>   After seeing Kirill's message about spinlocks I decided to do my own
>> testing with the userspace program below; the results were very strange.
>>
>>   When using the 'mov' instruction to do the unlock I was able to 
>> reproduce
>> hogging of the spinlock by a single CPU even on Pentium II under some
>> conditions, while using 'xchg' always allowed the other CPU to get the
>> lock:
> 
> 
> This might not necessarily be a win in all situations. If two CPUs A and 
>  B are trying to get into a spinlock-protected critical section to do 5 
> operations, it may well be more efficient for them to do AAAAABBBBB as 
> opposed to ABABABABAB, as the second situation may result in cache lines 
> bouncing between the two CPUs each time, etc.
> 
> I don't know that making spinlocks "fairer" is really very worthwhile. 
> If some spinlocks are so heavily contented that fairness becomes an 
> issue, it would be better to find a way to reduce that contention.
> 

You're right that it wouldn't be an issue on a system with relatively few
cpu's since that amount of contention would cripple the system.  Though
with 100's of cpu's you could get contention hotspots with some spin locks
being concurrently accessed by some subset of the cpu's for periods of time.

The real issue is scalability or how gracefully does a system degrade
when it starts to hit its contention limits.  It's not a good thing when
a system appears to run fine and then catastrophically hangs when it
bumps across its critical limit.  It's better when a system exhibit's
some sort of linear degradation.  The former exhibits bistable behavior
which requires a drastic, probably impossible, reduction in work load
to regain normal performance.  Reboots are the normal course of correction.
The linearly degrading systems just require moderation of the workload
to move back into acceptable performance.

Anyway, if you want to build a scalable system, it makes sense to build it
out of scalable components.  Right?

Joe Seigh

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: i386 spinlock fairness: bizarre test results
  2005-10-11  4:04 Chuck Ebbert
  2005-10-11  9:42 ` Eric Dumazet
@ 2005-10-11 13:00 ` Alan Cox
  2005-10-11 14:44   ` Linus Torvalds
  1 sibling, 1 reply; 7+ messages in thread
From: Alan Cox @ 2005-10-11 13:00 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-kernel, linux, Linus Torvalds, Kirill Korotaev

On Maw, 2005-10-11 at 00:04 -0400, Chuck Ebbert wrote:
>   That test machine was a dual 350MHz Pentium II Xeon; on a dual 333MHz Pentium II
> Overdrive (with very slow Socket 8 bus) I could not reproduce those results.
> However, on that machine the 'xchg' instruction made the test run almost 20%
> _faster_ than using 'mov'.
> 
>   So I think the i386 spinlock code should be changed to always use 'xchg' to do
> spin_unlock.

Using xchg on the spin unlock path is expensive. Really expensive on P4
compared to movb. It also doesn't guarantee anything either way around
especially as you go to four cores or change CPU (or in some cases quite
likely even chipset).

Spin lock paths should not be heavily contested. If they are then fix
the underlying problem with finer locking, or if you can't do that then
perhaps by serializing it with a queue.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: i386 spinlock fairness: bizarre test results
  2005-10-11 13:00 ` Alan Cox
@ 2005-10-11 14:44   ` Linus Torvalds
  0 siblings, 0 replies; 7+ messages in thread
From: Linus Torvalds @ 2005-10-11 14:44 UTC (permalink / raw)
  To: Alan Cox; +Cc: Chuck Ebbert, linux-kernel, linux, Kirill Korotaev

On Tue, 11 Oct 2005, Alan Cox wrote:

> On Maw, 2005-10-11 at 00:04 -0400, Chuck Ebbert wrote:
> >   That test machine was a dual 350MHz Pentium II Xeon; on a dual 333MHz Pentium II
> > Overdrive (with very slow Socket 8 bus) I could not reproduce those results.
> > However, on that machine the 'xchg' instruction made the test run almost 20%
> > _faster_ than using 'mov'.
> > 
> >   So I think the i386 spinlock code should be changed to always use 'xchg' to do
> > spin_unlock.
> 
> 
> Using xchg on the spin unlock path is expensive. Really expensive on P4
> compared to movb. It also doesn't guarantee anything either way around
> especially as you go to four cores or change CPU (or in some cases quite
> likely even chipset).

Indeed.

I suspect that the behaviour Chuck saw is (a) only present under 
contention and (b) very much dependent on other timing issues.

(a) is the wrong thing to optimize for, and (b) means that Chuck's numbers 
aren't reliable anyway (as shown by the fact that things like instruction 
alignment matters, and by Eric's numbers on other machines).

We want the spinlocks to behave well when they are _not_ under heavy 
contention. If a spinlock gets so much contention that it starts having 
these kinds of issues, then there's something wrong at higher levels, and 
the fix is to use a different algorithm, or use a different kind of lock.

Spinlocks by definition are the _simplest_ locks there are. Not the 
smartest or most fair. Trying to make them anything else is kind of 
missing the whole point of them.

			Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: i386 spinlock fairness: bizarre test results
  2005-10-11 12:31   ` Joe Seigh
@ 2005-10-11 21:01     ` Bill Davidsen
  0 siblings, 0 replies; 7+ messages in thread
From: Bill Davidsen @ 2005-10-11 21:01 UTC (permalink / raw)
  To: Joe Seigh, Linux Kernel Mailing List

Joe Seigh wrote:
> Robert Hancock wrote:
> 
>> Chuck Ebbert wrote:
>>
>>>   After seeing Kirill's message about spinlocks I decided to do my own
>>> testing with the userspace program below; the results were very strange.
>>>
>>>   When using the 'mov' instruction to do the unlock I was able to 
>>> reproduce
>>> hogging of the spinlock by a single CPU even on Pentium II under some
>>> conditions, while using 'xchg' always allowed the other CPU to get the
>>> lock:
>>
>>
>>
>> This might not necessarily be a win in all situations. If two CPUs A 
>> and  B are trying to get into a spinlock-protected critical section to 
>> do 5 operations, it may well be more efficient for them to do 
>> AAAAABBBBB as opposed to ABABABABAB, as the second situation may 
>> result in cache lines bouncing between the two CPUs each time, etc.
>>
>> I don't know that making spinlocks "fairer" is really very worthwhile. 
>> If some spinlocks are so heavily contented that fairness becomes an 
>> issue, it would be better to find a way to reduce that contention.
>>
> 
> You're right that it wouldn't be an issue on a system with relatively few
> cpu's since that amount of contention would cripple the system.  Though
> with 100's of cpu's you could get contention hotspots with some spin locks
> being concurrently accessed by some subset of the cpu's for periods of 
> time.
> 
> The real issue is scalability or how gracefully does a system degrade
> when it starts to hit its contention limits.  It's not a good thing when
> a system appears to run fine and then catastrophically hangs when it
> bumps across its critical limit.  It's better when a system exhibit's
> some sort of linear degradation.  The former exhibits bistable behavior
> which requires a drastic, probably impossible, reduction in work load
> to regain normal performance.  Reboots are the normal course of correction.
> The linearly degrading systems just require moderation of the workload
> to move back into acceptable performance.
> 
> Anyway, if you want to build a scalable system, it makes sense to build it
> out of scalable components.  Right?
> 
Joe, I totally agree. That type of non-linear performance is sometimes 
called a "jackpot condition," and you see it in various algorithms. The 
comment that depending on fairness is poor design is correct, but 
because loads may be vastly higher than the original program design, 
it's sometimes not obvious that there could be high contention.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-10-11 21:00 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <4WjCM-7Aq-7@gated-at.bofh.it>
2005-10-11  4:32 ` i386 spinlock fairness: bizarre test results Robert Hancock
2005-10-11 12:31   ` Joe Seigh
2005-10-11 21:01     ` Bill Davidsen
2005-10-11  4:04 Chuck Ebbert
2005-10-11  9:42 ` Eric Dumazet
2005-10-11 13:00 ` Alan Cox
2005-10-11 14:44   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox