* About a change to the implementation of spin lock in 2.6.12 kernel.
@ 2005-07-14 2:20 multisyncfe991
2005-07-14 5:16 ` Willy Tarreau
0 siblings, 1 reply; 9+ messages in thread
From: multisyncfe991 @ 2005-07-14 2:20 UTC (permalink / raw)
To: linux-kernel
Hi,
I found _spin_lock used a LOCK instruction to make the following operation
"decb %0" atomic. As you know, LOCK instruction alone takes almost 70 clock
cycles to finish and this add lots of cost to the _spin_lock. However
_spin_unlock does not use this LOCK instruction and it uses "movb $1,%0"
instead since 4-byte writes on 4-byte aligned addresses are atomic.
So I want rewrite the _spin_lock defined spinlock.h
(/linux/include/asm-i386) as follows to reduce the overhead of _spin_lock
and make it more efficient.
#define spin_lock_string \
"\n1:\t" \
"cmpb $0,%0\n\t" \
"jle 2f\n\t" \
"movb $0, %0\n\t" \
"jmp 3f\n" \
"2:\t" \
"rep;nop\n\t" \
"cmpb $0, %0\n\t" \
"jle 2b\n\t" \
"jmp 1b\n" \
"3:\n\t"
Compared with the original version as follows, LOCK instruction is removed.
I rebuilt the Intel e1000 Gigabit driver with this _spin_lock. There is
about 2% throughput improvement.
#define spin_lock_string \
"\n1:\t" \
"lock ; decb %0\n\t" \
"jns 3f\n" \
"2:\t" \
"rep;nop\n\t" \
"cmpb $0,%0\n\t" \
"jle 2b\n\t" \
"jmp 1b\n" \
"3:\n\t"
Do you think I can get a better performance if I dig further?
Any ideas will be greatly appreciated,
L.Y.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: About a change to the implementation of spin lock in 2.6.12 kernel.
2005-07-14 2:20 About a change to the implementation of spin lock in 2.6.12 kernel multisyncfe991
@ 2005-07-14 5:16 ` Willy Tarreau
2005-07-14 16:21 ` multisyncfe991
0 siblings, 1 reply; 9+ messages in thread
From: Willy Tarreau @ 2005-07-14 5:16 UTC (permalink / raw)
To: multisyncfe991; +Cc: linux-kernel
Hi,
On Wed, Jul 13, 2005 at 07:20:06PM -0700, multisyncfe991@hotmail.com wrote:
> Hi,
>
> I found _spin_lock used a LOCK instruction to make the following
> operation "decb %0" atomic. As you know, LOCK instruction alone takes
> almost 70 clock cycles to finish and this add lots of cost to the
> _spin_lock. However _spin_unlock does not use this LOCK instruction and
> it uses "movb $1,%0" instead since 4-byte writes on 4-byte aligned
> addresses are atomic.
_spin_unlock does not need locked operations because when it is run, the
code is already known to be the only one to hold the lock, so it can
release it without checking what others do.
> So I want rewrite the _spin_lock defined spinlock.h
> (/linux/include/asm-i386) as follows to reduce the overhead of _spin_lock
> and make it more efficient.
It does not work. You cannot write an inter-cpu atomic test-and-set with
several unlocked instructions.
> #define spin_lock_string \
> "\n1:\t" \
> "cmpb $0,%0\n\t" \
> "jle 2f\n\t" \
==> here, another thread or CPU can get the lock simultaneously.
> "movb $0, %0\n\t" \
> "jmp 3f\n" \
> "2:\t" \
> "rep;nop\n\t" \
> "cmpb $0, %0\n\t" \
> "jle 2b\n\t" \
> "jmp 1b\n" \
> "3:\n\t"
>
> Compared with the original version as follows, LOCK instruction is
> removed. I rebuilt the Intel e1000 Gigabit driver with this _spin_lock.
> There is about 2% throughput improvement.
> #define spin_lock_string \
> "\n1:\t" \
> "lock ; decb %0\n\t" \
> "jns 3f\n" \
> "2:\t" \
> "rep;nop\n\t" \
> "cmpb $0,%0\n\t" \
> "jle 2b\n\t" \
> "jmp 1b\n" \
> "3:\n\t"
>
> Do you think I can get a better performance if I dig further?
>
> Any ideas will be greatly appreciated,
well, of course with those methods you can improve performance, but you
lose the warranty that you're alone to get a lock, and that's bad.
another similar method to get a lock in some very controlled environment
is as follows :
1: cmp $0, %0
jne 1b
mov $CPUID, %0
membar
cmp $CPUID, %0
jne 1b
This only works with same speed CPUs and interrupts disabled. But in todays
environments, this is very risky (hyperthreaded CPUs, etc...). However, this
is often OK for more deterministic CPUs such as microcontrollers.
Regards,
Willy
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: About a change to the implementation of spin lock in 2.6.12 kernel.
2005-07-14 5:16 ` Willy Tarreau
@ 2005-07-14 16:21 ` multisyncfe991
2005-07-14 16:26 ` Brandon Niemczyk
2005-07-15 19:22 ` Volatile vs Non-Volatile Spin Locks on SMP multisyncfe991
0 siblings, 2 replies; 9+ messages in thread
From: multisyncfe991 @ 2005-07-14 16:21 UTC (permalink / raw)
To: linux-kernel
Hi Willy,
I think at least I can remove the LOCK instruction when the lock is already
held by someone else and enter the spinning wait directly, right?
0: cmpb $0, slp
jle 2f # lock is not available, then spinning
directly without locking the bus
1: lock; decb slp # lock the bus and atomically decrement
jns 3f # if clear sign bit jump forward to 3
2: pause # spin - wait
cmpb $0,slp # spin - compare to 0
jle 2b # spin - go back to 2 if <= 0 (locked)
jmp 1b # unlocked; go back to 1 to try to lock again
3: # we have acquired the lock .
But based on the Lockmeter report, the lock success is dominant 99.8%, so
maybe this will not make much change.
Thanks,
Liang
----- Original Message -----
From: "Willy Tarreau" <willy@w.ods.org>
To: <multisyncfe991@hotmail.com>
Cc: <linux-kernel@vger.kernel.org>
Sent: Wednesday, July 13, 2005 10:16 PM
Subject: Re: About a change to the implementation of spin lock in 2.6.12
kernel.
> Hi,
>
> On Wed, Jul 13, 2005 at 07:20:06PM -0700, multisyncfe991@hotmail.com
> wrote:
>> Hi,
>>
>> I found _spin_lock used a LOCK instruction to make the following
>> operation "decb %0" atomic. As you know, LOCK instruction alone takes
>> almost 70 clock cycles to finish and this add lots of cost to the
>> _spin_lock. However _spin_unlock does not use this LOCK instruction and
>> it uses "movb $1,%0" instead since 4-byte writes on 4-byte aligned
>> addresses are atomic.
>
> _spin_unlock does not need locked operations because when it is run, the
> code is already known to be the only one to hold the lock, so it can
> release it without checking what others do.
>
>> So I want rewrite the _spin_lock defined spinlock.h
>> (/linux/include/asm-i386) as follows to reduce the overhead of _spin_lock
>> and make it more efficient.
>
> It does not work. You cannot write an inter-cpu atomic test-and-set with
> several unlocked instructions.
>
>> #define spin_lock_string \
>> "\n1:\t" \
>> "cmpb $0,%0\n\t" \
>> "jle 2f\n\t" \
>
> ==> here, another thread or CPU can get the lock simultaneously.
>
>> "movb $0, %0\n\t" \
>> "jmp 3f\n" \
>> "2:\t" \
>> "rep;nop\n\t" \
>> "cmpb $0, %0\n\t" \
>> "jle 2b\n\t" \
>> "jmp 1b\n" \
>> "3:\n\t"
>>
>> Compared with the original version as follows, LOCK instruction is
>> removed. I rebuilt the Intel e1000 Gigabit driver with this _spin_lock.
>> There is about 2% throughput improvement.
>> #define spin_lock_string \
>> "\n1:\t" \
>> "lock ; decb %0\n\t" \
>> "jns 3f\n" \
>> "2:\t" \
>> "rep;nop\n\t" \
>> "cmpb $0,%0\n\t" \
>> "jle 2b\n\t" \
>> "jmp 1b\n" \
>> "3:\n\t"
>>
>> Do you think I can get a better performance if I dig further?
>>
>> Any ideas will be greatly appreciated,
>
> well, of course with those methods you can improve performance, but you
> lose the warranty that you're alone to get a lock, and that's bad.
>
> another similar method to get a lock in some very controlled environment
> is as follows :
>
> 1: cmp $0, %0
> jne 1b
> mov $CPUID, %0
> membar
> cmp $CPUID, %0
> jne 1b
>
> This only works with same speed CPUs and interrupts disabled. But in
> todays
> environments, this is very risky (hyperthreaded CPUs, etc...). However,
> this
> is often OK for more deterministic CPUs such as microcontrollers.
>
> Regards,
> Willy
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: About a change to the implementation of spin lock in 2.6.12 kernel.
2005-07-14 16:21 ` multisyncfe991
@ 2005-07-14 16:26 ` Brandon Niemczyk
2005-11-09 17:57 ` Does Printk() block another CPU in dual cpu platforms? John Smith
2005-07-15 19:22 ` Volatile vs Non-Volatile Spin Locks on SMP multisyncfe991
1 sibling, 1 reply; 9+ messages in thread
From: Brandon Niemczyk @ 2005-07-14 16:26 UTC (permalink / raw)
To: multisyncfe991; +Cc: linux-kernel
On Thu, 2005-07-14 at 09:21 -0700, multisyncfe991@hotmail.com wrote:
> Hi Willy,
>
> I think at least I can remove the LOCK instruction when the lock is already
> held by someone else and enter the spinning wait directly, right?
If the lock is already held by someone else, the cpu is just going to
burn cycles until it's not. So why do you care?
--
Brandon Niemczyk
^ permalink raw reply [flat|nested] 9+ messages in thread
* Volatile vs Non-Volatile Spin Locks on SMP.
2005-07-14 16:21 ` multisyncfe991
2005-07-14 16:26 ` Brandon Niemczyk
@ 2005-07-15 19:22 ` multisyncfe991
2005-07-17 12:51 ` Joe Seigh
1 sibling, 1 reply; 9+ messages in thread
From: multisyncfe991 @ 2005-07-15 19:22 UTC (permalink / raw)
To: linux-kernel
Hello,
By using volatile keyword for spin lock defined by in spinlock_t, it seems
Linux choose to always
reload the value of spin locks from cache instead of using the content from
registers. This may be
helpful for synchronization between multithreads in a single CPU.
I use two Xeon cpus with HyperThreading being disabled on both cpus. I think
the MESI
protocol will enforce the cache coherency and update the spin lock value
automatically between
these two cpus. So maybe we don't need to use the volatile any more, right?
Based on this, I rebuilt the Intel e1000 Gigabit network card driver with
volatile being removed,
but I didn't notice any performance improvement.
Any idea about this,
Thanks a lot,
Liang
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Volatile vs Non-Volatile Spin Locks on SMP.
2005-07-15 19:22 ` Volatile vs Non-Volatile Spin Locks on SMP multisyncfe991
@ 2005-07-17 12:51 ` Joe Seigh
2005-07-18 13:40 ` Joe Seigh
0 siblings, 1 reply; 9+ messages in thread
From: Joe Seigh @ 2005-07-17 12:51 UTC (permalink / raw)
To: linux-kernel
multisyncfe991@hotmail.com wrote:
> Hello,
>
> By using volatile keyword for spin lock defined by in spinlock_t, it
> seems Linux choose to always
> reload the value of spin locks from cache instead of using the content
> from registers. This may be
> helpful for synchronization between multithreads in a single CPU.
>
> I use two Xeon cpus with HyperThreading being disabled on both cpus. I
> think the MESI
> protocol will enforce the cache coherency and update the spin lock value
> automatically between
> these two cpus. So maybe we don't need to use the volatile any more, right?
>
> Based on this, I rebuilt the Intel e1000 Gigabit network card driver
> with volatile being removed,
> but I didn't notice any performance improvement.
>
> Any idea about this,
>
Volatile is meaningless as far as threading is concerned. Technically, its
meaning is implementation defined and since for Linux we're talking about
gcc, I suppose someone could claim it has some meaning although most of us
will have no way of verifying those claims. You might see some usage
of volatile in the Linux kernel which makes it appear as though it
has some meaning but you might want to be careful in depending on that
since there's no way of knowing if your interpretation of the meaning
is the same as what the authors of that code have in mind.
For synchronization you need memory barriers in most cases and the only
way to get these is using assembler since there are no C or gcc intrinsics
for these yet. For inline assembler, the convention seems to be to use
the volatile attribute, which I take as meaning no code movement across
the inline assembler code. It if doesn't mean that then a lot of stuff
is broken AFAICT.
Assuming you're doing this in assembler, using volatile on the C declaration
will have no effect on performance in this case. You're seeing the most
"recent" value due to the cache implementation.
--
Joe Seigh
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Volatile vs Non-Volatile Spin Locks on SMP.
2005-07-17 12:51 ` Joe Seigh
@ 2005-07-18 13:40 ` Joe Seigh
0 siblings, 0 replies; 9+ messages in thread
From: Joe Seigh @ 2005-07-18 13:40 UTC (permalink / raw)
To: linux-kernel
Joe Seigh wrote:
> For synchronization you need memory barriers in most cases and the only
> way to get these is using assembler since there are no C or gcc intrinsics
> for these yet. For inline assembler, the convention seems to be to use
> the volatile attribute, which I take as meaning no code movement across
> the inline assembler code. It if doesn't mean that then a lot of stuff
> is broken AFAICT.
>
Usenet rule #1. If you don't find something in the documentation, you
will find it after you post about it. Volatile does seem to be documented
somewhat in the gcc docs
http://gcc.gnu.org/onlinedocs/gcc-4.0.1/gcc/Extended-Asm.html#Extended-Asm
I was using "memory" in the clobber list as the main thing to keep optimization from
occurring across inline asm. This seems to say that you also need to say "volatile" to
tell the compiler that you really mean it.
"If your assembler instructions access memory in an unpredictable fashion, add `memory' to the list of clobbered registers. This will cause GCC to not keep memory values cached in registers across the assembler instruction and not optimize stores or loads to that memory. You will also want to add the volatile keyword if the memory affected is not listed in the inputs or outputs of the asm, as the `memory' clobber does not count as a side-effect of the asm. If you know how large the accessed memory is, you can add it as input or output but if this is not known, you should add `memory'."
Also this needs to be looked at, i.e. does "sequence" mean in program order or with no interleaved
C statements.
"Similarly, you can't expect a sequence of volatile asm instructions to remain perfectly consecutive. If you want consecutive output, use a single asm. Also, GCC will perform some optimizations across a volatile asm instruction; GCC does not “forget everything” when it encounters a volatile asm instruction the way some other compilers do."
One of the problems with volatile in C was that the compiler could move code around the volatile
accesses and even accesses to other volatile variables. This was a problem that Java had and
which they fixed with JSR-133 so you could actually do useful things with volatile in Java. It's
just worse in C since C has nowhere as useful or clear definitions to work with. The only
reason you can get away with something like
do {
while (lock != 0);
} while (!testandset(&lock)); // interlocked test and set
is the correctness of the code isn't affected by how the compiler treats
the test for lock != 0 as long as it terminates in a finite amount of time. Or by the fact
that's not the best way to do a spin wait on hyperthreaded Intel processors. Intel
recommends you use a PAUSE intstruction in the wait loop.
Anyway it looks like I'll have to do a little more augury on the gcc docs. :)
--
Joe Seigh
^ permalink raw reply [flat|nested] 9+ messages in thread
* Does Printk() block another CPU in dual cpu platforms?
2005-07-14 16:26 ` Brandon Niemczyk
@ 2005-11-09 17:57 ` John Smith
2005-11-10 3:31 ` Fawad Lateef
0 siblings, 1 reply; 9+ messages in thread
From: John Smith @ 2005-11-09 17:57 UTC (permalink / raw)
To: linux-kernel
Hello,
I just have a question about the usage of printk in multi-processor
platforms. If the program on two CPUs both try to call printk to output
something, will the program running on one CPUs get blocked (or just
spinning there) till the other is done with printk()?
Thanks,
Liang
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Does Printk() block another CPU in dual cpu platforms?
2005-11-09 17:57 ` Does Printk() block another CPU in dual cpu platforms? John Smith
@ 2005-11-10 3:31 ` Fawad Lateef
0 siblings, 0 replies; 9+ messages in thread
From: Fawad Lateef @ 2005-11-10 3:31 UTC (permalink / raw)
To: John Smith; +Cc: linux-kernel
On 11/9/05, John Smith <multisyncfe991@hotmail.com> wrote:
>
> I just have a question about the usage of printk in multi-processor
> platforms. If the program on two CPUs both try to call printk to output
> something, will the program running on one CPUs get blocked (or just
> spinning there) till the other is done with printk()?
>
I think yes, but for a very less time as printk holds the spin_lock to
logbuf_lock which will make to wait the printk on other CPU, and then
printk just copies the content to log_buffer and then call
release_console_sem which actually send the data to console later !
--
Fawad Lateef
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2005-11-10 3:31 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-14 2:20 About a change to the implementation of spin lock in 2.6.12 kernel multisyncfe991
2005-07-14 5:16 ` Willy Tarreau
2005-07-14 16:21 ` multisyncfe991
2005-07-14 16:26 ` Brandon Niemczyk
2005-11-09 17:57 ` Does Printk() block another CPU in dual cpu platforms? John Smith
2005-11-10 3:31 ` Fawad Lateef
2005-07-15 19:22 ` Volatile vs Non-Volatile Spin Locks on SMP multisyncfe991
2005-07-17 12:51 ` Joe Seigh
2005-07-18 13:40 ` Joe Seigh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox