* Re: i386 spinlock fairness: bizarre test results
[not found] <4WjCM-7Aq-7@gated-at.bofh.it>
@ 2005-10-11 4:32 ` Robert Hancock
2005-10-11 12:31 ` Joe Seigh
0 siblings, 1 reply; 7+ messages in thread
From: Robert Hancock @ 2005-10-11 4:32 UTC (permalink / raw)
To: linux-kernel
Chuck Ebbert wrote:
> After seeing Kirill's message about spinlocks I decided to do my own
> testing with the userspace program below; the results were very strange.
>
> When using the 'mov' instruction to do the unlock I was able to reproduce
> hogging of the spinlock by a single CPU even on Pentium II under some
> conditions, while using 'xchg' always allowed the other CPU to get the
> lock:
This might not necessarily be a win in all situations. If two CPUs A and
B are trying to get into a spinlock-protected critical section to do 5
operations, it may well be more efficient for them to do AAAAABBBBB as
opposed to ABABABABAB, as the second situation may result in cache lines
bouncing between the two CPUs each time, etc.
I don't know that making spinlocks "fairer" is really very worthwhile.
If some spinlocks are so heavily contented that fairness becomes an
issue, it would be better to find a way to reduce that contention.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: i386 spinlock fairness: bizarre test results
2005-10-11 4:32 ` i386 spinlock fairness: bizarre test results Robert Hancock
@ 2005-10-11 12:31 ` Joe Seigh
2005-10-11 21:01 ` Bill Davidsen
0 siblings, 1 reply; 7+ messages in thread
From: Joe Seigh @ 2005-10-11 12:31 UTC (permalink / raw)
To: linux-kernel
Robert Hancock wrote:
> Chuck Ebbert wrote:
>
>> After seeing Kirill's message about spinlocks I decided to do my own
>> testing with the userspace program below; the results were very strange.
>>
>> When using the 'mov' instruction to do the unlock I was able to
>> reproduce
>> hogging of the spinlock by a single CPU even on Pentium II under some
>> conditions, while using 'xchg' always allowed the other CPU to get the
>> lock:
>
>
> This might not necessarily be a win in all situations. If two CPUs A and
> B are trying to get into a spinlock-protected critical section to do 5
> operations, it may well be more efficient for them to do AAAAABBBBB as
> opposed to ABABABABAB, as the second situation may result in cache lines
> bouncing between the two CPUs each time, etc.
>
> I don't know that making spinlocks "fairer" is really very worthwhile.
> If some spinlocks are so heavily contented that fairness becomes an
> issue, it would be better to find a way to reduce that contention.
>
You're right that it wouldn't be an issue on a system with relatively few
cpu's since that amount of contention would cripple the system. Though
with 100's of cpu's you could get contention hotspots with some spin locks
being concurrently accessed by some subset of the cpu's for periods of time.
The real issue is scalability or how gracefully does a system degrade
when it starts to hit its contention limits. It's not a good thing when
a system appears to run fine and then catastrophically hangs when it
bumps across its critical limit. It's better when a system exhibit's
some sort of linear degradation. The former exhibits bistable behavior
which requires a drastic, probably impossible, reduction in work load
to regain normal performance. Reboots are the normal course of correction.
The linearly degrading systems just require moderation of the workload
to move back into acceptable performance.
Anyway, if you want to build a scalable system, it makes sense to build it
out of scalable components. Right?
Joe Seigh
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: i386 spinlock fairness: bizarre test results
2005-10-11 12:31 ` Joe Seigh
@ 2005-10-11 21:01 ` Bill Davidsen
0 siblings, 0 replies; 7+ messages in thread
From: Bill Davidsen @ 2005-10-11 21:01 UTC (permalink / raw)
To: Joe Seigh, Linux Kernel Mailing List
Joe Seigh wrote:
> Robert Hancock wrote:
>
>> Chuck Ebbert wrote:
>>
>>> After seeing Kirill's message about spinlocks I decided to do my own
>>> testing with the userspace program below; the results were very strange.
>>>
>>> When using the 'mov' instruction to do the unlock I was able to
>>> reproduce
>>> hogging of the spinlock by a single CPU even on Pentium II under some
>>> conditions, while using 'xchg' always allowed the other CPU to get the
>>> lock:
>>
>>
>>
>> This might not necessarily be a win in all situations. If two CPUs A
>> and B are trying to get into a spinlock-protected critical section to
>> do 5 operations, it may well be more efficient for them to do
>> AAAAABBBBB as opposed to ABABABABAB, as the second situation may
>> result in cache lines bouncing between the two CPUs each time, etc.
>>
>> I don't know that making spinlocks "fairer" is really very worthwhile.
>> If some spinlocks are so heavily contented that fairness becomes an
>> issue, it would be better to find a way to reduce that contention.
>>
>
> You're right that it wouldn't be an issue on a system with relatively few
> cpu's since that amount of contention would cripple the system. Though
> with 100's of cpu's you could get contention hotspots with some spin locks
> being concurrently accessed by some subset of the cpu's for periods of
> time.
>
> The real issue is scalability or how gracefully does a system degrade
> when it starts to hit its contention limits. It's not a good thing when
> a system appears to run fine and then catastrophically hangs when it
> bumps across its critical limit. It's better when a system exhibit's
> some sort of linear degradation. The former exhibits bistable behavior
> which requires a drastic, probably impossible, reduction in work load
> to regain normal performance. Reboots are the normal course of correction.
> The linearly degrading systems just require moderation of the workload
> to move back into acceptable performance.
>
> Anyway, if you want to build a scalable system, it makes sense to build it
> out of scalable components. Right?
>
Joe, I totally agree. That type of non-linear performance is sometimes
called a "jackpot condition," and you see it in various algorithms. The
comment that depending on fairness is poor design is correct, but
because loads may be vastly higher than the original program design,
it's sometimes not obvious that there could be high contention.
--
-bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
^ permalink raw reply [flat|nested] 7+ messages in thread
* i386 spinlock fairness: bizarre test results
@ 2005-10-11 4:04 Chuck Ebbert
2005-10-11 9:42 ` Eric Dumazet
2005-10-11 13:00 ` Alan Cox
0 siblings, 2 replies; 7+ messages in thread
From: Chuck Ebbert @ 2005-10-11 4:04 UTC (permalink / raw)
To: linux-kernel; +Cc: linux, Linus Torvalds, Kirill Korotaev
After seeing Kirill's message about spinlocks I decided to do my own
testing with the userspace program below; the results were very strange.
When using the 'mov' instruction to do the unlock I was able to reproduce
hogging of the spinlock by a single CPU even on Pentium II under some
conditions, while using 'xchg' always allowed the other CPU to get the
lock:
[me@d2 t]$ gcc -O2 -o spintest.o -Wall -DUSE_MOV=1 spintest.c ; ./spintest.o
parent CPU 1, child CPU 0, using mov instruction for unlock
parent did: 34534 of 10000001 iterations
CPU clocks: 2063591864
[me@d2 t]$ gcc -O2 -o spintest.o -Wall -DUSE_MOV=0 spintest.c ; ./spintest.o
parent CPU 1, child CPU 0, using xchg instruction for unlock
parent did: 5169760 of 10000001 iterations
CPU clocks: 2164689784
The results were dependent on the alignment of the "lock ; decl" at the start
of the spinlock code. If that 7-byte instruction spanned two cachelines then
the CPU with that code could somehow hog the spinlock. Optimizing for
Pentium II forced 16-byte alignment and made the spinlock fairer, but still
somewhat biased when using 'mov':
[me@d2 t]$ gcc -O2 -o spintest.o -Wall -DUSE_MOV=1 -mcpu=pentium2 spintest.c ; ./spintest.o
parent CPU 1, child CPU 0, using mov instruction for unlock
parent did: 4181147 of 10000001 iterations
CPU clocks: 2057436825
That test machine was a dual 350MHz Pentium II Xeon; on a dual 333MHz Pentium II
Overdrive (with very slow Socket 8 bus) I could not reproduce those results.
However, on that machine the 'xchg' instruction made the test run almost 20%
_faster_ than using 'mov'.
So I think the i386 spinlock code should be changed to always use 'xchg' to do
spin_unlock.
================================================================================
/* spinlock test
*
* Tests spinlock fairness on SMP i386 machine.
*/
/* number of tests */
#ifndef ITERS
#define ITERS 10000000
#endif
/* use "mov" instruction for spin_unlock instead of "xchg" */
#ifndef USE_MOV
#define USE_MOV 1
#endif
/* cpu to run parent process; child will use !PARENT_CPU */
#ifndef PARENT_CPU
#define PARENT_CPU 1
#endif
/* change this to match your version of glibc -- "2" means use 2 args */
#ifndef SETAFFINITY
#define SETAFFINITY 2
#endif
#if SETAFFINITY == 2
#define setaffinity(pid, mask) sched_setaffinity((pid), &(mask))
#else /* 3 args */
#define setaffinity(pid, mask) sched_setaffinity((pid), sizeof(mask), &(mask))
#endif
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/mman.h>
#include <asm/user.h>
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
#define RDTSCLL(r) asm volatile("rdtsc" : "=A" (r))
unsigned long long tsc1, tsc2;
cpu_set_t cpuset0, cpuset1;
int test = 1; /* for testing -- starts unlocked */
int strt = 0; /* sync startup -- starts locked */
int stk[1024];
int iters, p_iters, c_iters;
static inline void __raw_spin_lock(int *l)
{
asm volatile(
"\n1:\t"
"lock ; decl %0\n\t"
"jns 3f\n"
"2:\t"
"rep;nop\n\t"
"cmpl $0,%0\n\t"
"jle 2b\n\t"
"jmp 1b\n"
"3:\n\t"
:"=m" (*l) : : "memory");
}
#if USE_MOV
static inline void __raw_spin_unlock(int *l)
{
asm volatile("movl $1,%0"
: "=m" (*l) : : "memory");
}
#else
static inline void __raw_spin_unlock(int *l)
{
int oldval = 1;
asm volatile("xchgl %0,%1"
: "=q" (oldval), "=m" (*l)
: "0" (oldval) : "memory");
}
#endif /* USE_MOV */
int do_child(void *vp)
{
CPU_SET(!PARENT_CPU, &cpuset1);
if (unlikely(setaffinity(0, cpuset1)))
perror("child setaffinity");
/* Add "nop" insns as necessary to make 1st
* insn of __raw_spin_lock span cachelines.
*/
asm("nop ; nop ; nop");
__raw_spin_unlock(&strt); /* release parent */
again:
__raw_spin_lock(&test);
c_iters++;
if (likely(++iters < ITERS)) {
__raw_spin_unlock(&test);
goto again;
}
__raw_spin_unlock(&test);
return 0;
}
int main(int argc, char * const argv[])
{
CPU_SET(PARENT_CPU, &cpuset0);
if (unlikely(setaffinity(0, cpuset0)))
perror("parent setaffinity");
clone(do_child, &stk[1023], CLONE_VM, 0);
__raw_spin_lock(&strt); /* wait for child to init */
RDTSCLL(tsc1);
again:
__raw_spin_lock(&test);
p_iters++;
if (likely(++iters < ITERS)) {
__raw_spin_unlock(&test);
goto again;
}
__raw_spin_unlock(&test);
RDTSCLL(tsc2);
printf("parent CPU %d, child CPU %d, ", PARENT_CPU, !PARENT_CPU);
printf("using %s instruction for unlock\n", USE_MOV ? "mov" : "xchg");
printf("parent did: %d of %d iterations\n", p_iters, iters);
printf("CPU clocks: %14llu\n", tsc2 - tsc1);
return 0;
}
__
Chuck
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: i386 spinlock fairness: bizarre test results
2005-10-11 4:04 Chuck Ebbert
@ 2005-10-11 9:42 ` Eric Dumazet
2005-10-11 13:00 ` Alan Cox
1 sibling, 0 replies; 7+ messages in thread
From: Eric Dumazet @ 2005-10-11 9:42 UTC (permalink / raw)
To: Chuck Ebbert; +Cc: linux-kernel, linux, Linus Torvalds, Kirill Korotaev
Chuck Ebbert a écrit :
> After seeing Kirill's message about spinlocks I decided to do my own
> testing with the userspace program below; the results were very strange.
>
> When using the 'mov' instruction to do the unlock I was able to reproduce
> hogging of the spinlock by a single CPU even on Pentium II under some
> conditions, while using 'xchg' always allowed the other CPU to get the
> lock:
>
>
> parent CPU 1, child CPU 0, using mov instruction for unlock
> parent did: 34534 of 10000001 iterations
> CPU clocks: 2063591864
>
> parent CPU 1, child CPU 0, using xchg instruction for unlock
> parent did: 5169760 of 10000001 iterations
> CPU clocks: 2164689784
>
Hi Chuck
Thats interesting, because on my machines I get opposite results :
On a Xeon 2.0 GHz, Hyper Threading, I get better results (in the sense of
fairness) with the MOV version
parent CPU 1, child CPU 0, using mov instruction for unlock
parent did: 5291346 of 10000001 iterations
CPU clocks: 3.437.053.244
parent CPU 1, child CPU 0, using xchg instruction for unlock
parent did: 7732349 of 10000001 iterations
CPU clocks: 3.408.285.308
On a dual Opteron 248 (2.2GHz) machine, I get bad fairness results regardless
of MOV/XCHG (but less cycles per iteration). A lot of variations in the
results (maybe because of NUMA effects ?), but xchg gives slightly better
fairness.
parent CPU 1, child CPU 0, using mov instruction for unlock
parent did: 256810 of 10000001 iterations
CPU clocks: 772.838.640
parent CPU 1, child CPU 0, using xchg instruction for unlock
parent did: 438280 of 10000001 iterations
CPU clocks: 1.115.653.346
parent CPU 1, child CPU 0, using xchg instruction for unlock
parent did: 574501 of 10000001 iterations
CPU clocks: 1.200.129.428
On a dual core Opteron 275 (2.2GHz), xchg is faster but unfair.
(threads running on the same physical CPU)
parent CPU 1, child CPU 0, using mov instruction for unlock
parent did: 4822270 of 10000001 iterations
CPU clocks: 738.438.383
parent CPU 1, child CPU 0, using xchg instruction for unlock
parent did: 9702075 of 10000001 iterations
CPU clocks: 561.457.724
Totally different results if affinity changed so that threads run on different
physical cpus : (XCHG is slower)
parent CPU 0, child CPU 2, using mov instruction for unlock
parent did: 1081522 of 10000001 iterations
CPU clocks: 508.611.273
parent CPU 0, child CPU 2, using xchg instruction for unlock
parent did: 4310427 of 10000000 iterations
CPU clocks: 1.074.170.246
For reference, if only one thread is running, the MOV version is also faster
on both platforms :
[Xeon 2GHz]
one thread, using mov instruction for unlock
10000000 iterations
CPU clocks: 1.278.879.528
[Xeon 2GHz]
one thread, using xchg instruction for unlock
10000000 iterations
CPU clocks: 2.486.912.752
Of course Opterons are faster :) (less cycles per iteration)
[Opteron 248]
one thread, using mov instruction for unlock
10000000 iterations
CPU clocks: 212.514.637
[Opteron 248]
one thread, using xchg instruction for unlock
10000000 iterations
CPU clocks: 383.306.420
[Opteron 275]
one thread, using mov instruction for unlock
10000000 iterations
CPU clocks: 208.472.009
[Opteron 275]
one thread, using xchg instruction for unlock
10000000 iterations
CPU clocks: 417.502.675
In conclusion, I would say that for uncontended locks, MOV is faster. For
contended locks, XCHG might be faster on some platforms (dual core Opterons,
only if on the same physical CPU)
Eric
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: i386 spinlock fairness: bizarre test results
2005-10-11 4:04 Chuck Ebbert
2005-10-11 9:42 ` Eric Dumazet
@ 2005-10-11 13:00 ` Alan Cox
2005-10-11 14:44 ` Linus Torvalds
1 sibling, 1 reply; 7+ messages in thread
From: Alan Cox @ 2005-10-11 13:00 UTC (permalink / raw)
To: Chuck Ebbert; +Cc: linux-kernel, linux, Linus Torvalds, Kirill Korotaev
On Maw, 2005-10-11 at 00:04 -0400, Chuck Ebbert wrote:
> That test machine was a dual 350MHz Pentium II Xeon; on a dual 333MHz Pentium II
> Overdrive (with very slow Socket 8 bus) I could not reproduce those results.
> However, on that machine the 'xchg' instruction made the test run almost 20%
> _faster_ than using 'mov'.
>
> So I think the i386 spinlock code should be changed to always use 'xchg' to do
> spin_unlock.
Using xchg on the spin unlock path is expensive. Really expensive on P4
compared to movb. It also doesn't guarantee anything either way around
especially as you go to four cores or change CPU (or in some cases quite
likely even chipset).
Spin lock paths should not be heavily contested. If they are then fix
the underlying problem with finer locking, or if you can't do that then
perhaps by serializing it with a queue.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: i386 spinlock fairness: bizarre test results
2005-10-11 13:00 ` Alan Cox
@ 2005-10-11 14:44 ` Linus Torvalds
0 siblings, 0 replies; 7+ messages in thread
From: Linus Torvalds @ 2005-10-11 14:44 UTC (permalink / raw)
To: Alan Cox; +Cc: Chuck Ebbert, linux-kernel, linux, Kirill Korotaev
On Tue, 11 Oct 2005, Alan Cox wrote:
> On Maw, 2005-10-11 at 00:04 -0400, Chuck Ebbert wrote:
> > That test machine was a dual 350MHz Pentium II Xeon; on a dual 333MHz Pentium II
> > Overdrive (with very slow Socket 8 bus) I could not reproduce those results.
> > However, on that machine the 'xchg' instruction made the test run almost 20%
> > _faster_ than using 'mov'.
> >
> > So I think the i386 spinlock code should be changed to always use 'xchg' to do
> > spin_unlock.
>
>
> Using xchg on the spin unlock path is expensive. Really expensive on P4
> compared to movb. It also doesn't guarantee anything either way around
> especially as you go to four cores or change CPU (or in some cases quite
> likely even chipset).
Indeed.
I suspect that the behaviour Chuck saw is (a) only present under
contention and (b) very much dependent on other timing issues.
(a) is the wrong thing to optimize for, and (b) means that Chuck's numbers
aren't reliable anyway (as shown by the fact that things like instruction
alignment matters, and by Eric's numbers on other machines).
We want the spinlocks to behave well when they are _not_ under heavy
contention. If a spinlock gets so much contention that it starts having
these kinds of issues, then there's something wrong at higher levels, and
the fix is to use a different algorithm, or use a different kind of lock.
Spinlocks by definition are the _simplest_ locks there are. Not the
smartest or most fair. Trying to make them anything else is kind of
missing the whole point of them.
Linus
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2005-10-11 21:00 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <4WjCM-7Aq-7@gated-at.bofh.it>
2005-10-11 4:32 ` i386 spinlock fairness: bizarre test results Robert Hancock
2005-10-11 12:31 ` Joe Seigh
2005-10-11 21:01 ` Bill Davidsen
2005-10-11 4:04 Chuck Ebbert
2005-10-11 9:42 ` Eric Dumazet
2005-10-11 13:00 ` Alan Cox
2005-10-11 14:44 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox