* [PATCH 0/3] 64-bit futexes: Intro
@ 2008-05-31 1:27 Ulrich Drepper
2008-05-31 2:13 ` Linus Torvalds
0 siblings, 1 reply; 26+ messages in thread
From: Ulrich Drepper @ 2008-05-31 1:27 UTC (permalink / raw)
To: linux-kernel; +Cc: akpm, mtk.manpages, torvalds
This patch series adds support for 64-bit futexes. The current futexes
only protect 32-bits. I don't know why, ask the original authors. It is
unnecessarily limiting, though, especially for 64-bit machines.
To understand the problem let me just say that futexes work by storing
a program-defined state in a variable. Threads then can wait for the
state of the variable to change. It all works dandy if the protocol for
using futexes is followed correctly; see
http://people.redhat.com/drepper/futexes.pdf
For mutexes, semaphores etc this is quite easy. For mutexes, for instance,
we have as state to protect a flag whether the mutex is locked and whether
there is any thread waiting for the release. This can be done easily with
the available 32 bits in a futex.
The situation is different for more complex synchronization objects. The
best example are reader/writer locks. Here the state consists at least
of
- number of readers which locked the rwlock
- flag whether rwlock is locked for reader, writing, or not at all
- number of readers waiting
- number of writers waiting
I.e., rwlocks are significantly more complicated. Much of the complication
stems from the requirement to have to types: those which prefer readers and
those which prefer writers.
Without imposing unreasonable limitations on the number of concurrent users
a rwlock can have the state cannot be represented in a single 32-bit variable.
This is why the current rwlock implementation uses an internal lock and
spreads the state out in several variables.
This works, but there is a significant performance penalty to be paid:
- at least two atomic operations to lock and unlock the internal lock
- multiple readers are not really able to run concurrently since they
might block on the internal lock
- following from the last point, the behaviour of the code is less
predictable due to scheduling artifacts
The logical solution is to extend the size of the state which can be
protected by allowing 64-bit futexes on architectures which can support
them. This is not new. The last time it was brought in
http://lkml.org/lkml/2007/3/21/65
but nothing happened. I must admit that I didn't follow thorugh myself.
Anyway, I finally bit the bullet and wrote a patch myself. It differs
from the old patch:
- FUTEX_64_FLAG is a flag to the existing syscall, not a new syscall.
Given the changes which are needed the latter would have been overkill
IMO
- no 64-bit support for the PI variants. It simply makes no sense. The
value of the variables protected by the PI variants is under control of
the kernel and it's always a TID. Those are 32-bit values.
As you can see and the first patch the required changes really aren't that
many. A large part of the patch deals with changing the interface of the
futex_atomic_op_inuser function. The additional parameter is simply
ignored if 64-bit futexes aren't supported.
The patch also goes to great length to avoid negative impact for
archictectures which do not have 64-bit futexes. You can see several
if-expressions which are decided at compile-time, optimizing out all the
additional code.
Finally, enabling 64-bit futexes for 64-bit architectures is quite simple.
The second patch enables them for x86-64. Most of the new lines are simply
copies of the inline asms used for the 32-bit futexes, adapted for 64-bit
operations.
As for 32-bit architectures, they are mostly out of luck. That's OK, all
the code still works as before, it's just slower. The exception is perhaps
x86. x86 has the cmpxchg8b instruction for many years now. This is why
I added enablement for 64-bit futexes for x86 as well. It's a bit more
complicated and it requires a new system call. The reason is obvious: the
existing system call takes 32-bit values for the value parameters. On
64-bit machines these are in any case 64-bit values, so passing 64-bit
values only requires a change to the function parameter type. For 32-bit
machines it is not that easy, hence the new system call. To make things
even more interesting, the required additional parameter pushes us beyond
the 6 parameter limit and the parameters have to be passed down in memory.
Nothing new.
Anyway, is it all worth it? Well, here's a measurement of a workload
similar for what customers of our are using. These are raw numbers,
for a reason:
# Old New Gain
threads Rwlocks Rwlocks
1 4,991,633,560 3,986,898,840 20.13%
2 14,187,564,600 5,080,538,220 64.19%
3 20,727,416,260 5,291,971,820 74.47%
4 23,079,608,650 6,652,036,830 71.18%
5 23,1766,32,860 6,570,373,500 71.65%
6 21,913,357,010 6,591,716,100 69.92%
7 22,975,750,700 6,597,761,790 71.28%
8 22,349,919,860 6,632,005,730 70.33%
9 24,784,438,890 6,599,062,590 73.37%
10 24,899,243,380 6,493,066,340 73.92%
11 25,358,788,130 6,735,662,240 73.44%
12 19,955,548,890 6,591,059,500 66.97%
13 24,058,349,440 6,709,288,100 72.11%
14 26,002,901,340 6,741,505,130 74.07%
15 19,790,724,570 6,720,973,690 66.04%
16 26,639,730,750 6,558,662,430 75.38%
The test is run on a uni-processor quad core machine. The task is run with
varying number of threads. In this example the actual work performed is
trivial. In other words, the overhead for the locking is quite large. I
have put the data in a graph:
http://people.redhat.com/drepper/newrwlock.png
What is obvious is that with the new implementation to uncontended case
is 20% faster. That alone should be a good reason. But as can be seen
it get even better with a large number of threads. It peaks when four
threads are used (== number of cores) but I still show the results for
more threads to point out the better predictability. These are the
minimum values from ten runs each, no averaging. The old code, due to
the use of the internal lock, causes a lotter of jitters in the times,
reducing predictability. In term of performance, the new code is about
four times faster on this workload. It's certainly not characteristic
but I haven't seen a workload where the new code is slower.
I also haven't seen any case where the additional overhead in the futex
system call implementation is noticeable.
I tested the code extensively on a number of x86-64 machines, from small
dual cores machines to quad socket/quad cores machines. I'm even using
it on my main workstation right now.
The three patches can be applied individually and everything should
continue to work fine. The patches apply to Linus' current tree.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-05-31 1:27 [PATCH 0/3] 64-bit futexes: Intro Ulrich Drepper
@ 2008-05-31 2:13 ` Linus Torvalds
2008-05-31 3:14 ` Ulrich Drepper
0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2008-05-31 2:13 UTC (permalink / raw)
To: Ulrich Drepper; +Cc: linux-kernel, akpm, mtk.manpages
On Fri, 30 May 2008, Ulrich Drepper wrote:
>
> Without imposing unreasonable limitations on the number of concurrent users
> a rwlock can have the state cannot be represented in a single 32-bit variable.
I call bull on that.
We've had a 32-bit rwsemaphore in the kernel for a *long* time.
I think it's totally stupid for glibc to even *try* to do something like
this, since almost nobody will have a kernel with 64-bit futex support for
a long time in the wild anyway. So you need to support a 32-bit semaphore
in practice, and it's been done before.
The x86 kernel rwsem implementation may limit things to 64k readers (I'm
not even sure that's true, I'm not going to look at the code again), but
if I recall correctly that's just because we wanted to use a single "xadd"
in the hotpath, instead of doign a load a cmpxchg.
I really object to adding another 32/64-bit difference just for something
like an rwsemaphore. It needs a whole lot of stronger reasons than that.
I suspect the old rwlocks are simply just stupid. You can do those things
with a single 32-bit locked op. Have you tried
- 29 bits of reader counts
- 1 bit of uncontended writer
- 1 bit of "contention" (ie a mark for requiring fairness)
- 1 bit for a spinlock so that you can do all the fairness without doing
any extra locked ops
or similar?
Linus
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-05-31 2:13 ` Linus Torvalds
@ 2008-05-31 3:14 ` Ulrich Drepper
2008-05-31 3:44 ` Linus Torvalds
0 siblings, 1 reply; 26+ messages in thread
From: Ulrich Drepper @ 2008-05-31 3:14 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel, akpm, mtk.manpages
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Linus Torvalds wrote:
> We've had a 32-bit rwsemaphore in the kernel for a *long* time.
You misread what I wrote.
Semaphores are fine. The problem are reader/writer locks where we need
three counters and some flags. It is absolutely not acceptable to limit
the number of threads to 512 to make this fit.
> I think it's totally stupid for glibc to even *try* to do something like
> this, since almost nobody will have a kernel with 64-bit futex support for
> a long time in the wild anyway. So you need to support a 32-bit semaphore
> in practice, and it's been done before.
Again, misread. This is not functionality which is not available.
Semaphores are *of course* fine with 32 bits. And there is a
reader/writer lock implementation. It is just very slow compared to
what is possible. The transition between systems miss the 64-bit
support and those which have it is completely transparent. The same
glibc binary will work for both.
- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iEYEARECAAYFAkhAwpsACgkQ2ijCOnn/RHS7PACdGvdgI9pv+IcsSPkXPVwFJ5ZF
YNQAn01fm1wEc4EI/fwOVuM57XtCy9Dq
=LcpJ
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-05-31 3:14 ` Ulrich Drepper
@ 2008-05-31 3:44 ` Linus Torvalds
2008-05-31 4:04 ` Ulrich Drepper
0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2008-05-31 3:44 UTC (permalink / raw)
To: Ulrich Drepper; +Cc: linux-kernel, akpm, mtk.manpages
On Fri, 30 May 2008, Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Linus Torvalds wrote:
> > We've had a 32-bit rwsemaphore in the kernel for a *long* time.
>
> You misread what I wrote.
No, you do.
The kernel calls reader-writer locks rwsemaphores.
Linus
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-05-31 3:44 ` Linus Torvalds
@ 2008-05-31 4:04 ` Ulrich Drepper
2008-05-31 4:16 ` Linus Torvalds
0 siblings, 1 reply; 26+ messages in thread
From: Ulrich Drepper @ 2008-05-31 4:04 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel, akpm, mtk.manpages
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Linus Torvalds wrote:
> The kernel calls reader-writer locks rwsemaphores.
Well, then you don't understand the complexity of the required interface
together with the performance implications. Take your proposed allocation:
- 29 bits of reader counts
- 1 bit of uncontended writer
- 1 bit of "contention" (ie a mark for requiring fairness)
- 1 bit for a spinlock so that you can do all the fairness without
doing any extra locked ops
Ask yourself this:
- - How do you know when there is no more writer waiting? You cannot
reset a "writer waiting" bit after you wake up one writer without
waking every single thread waiting for the rwlock and letting them
fight for it
- - How do you handle the difference between reader-preferred rwlocks
and writer-preferred rwlocks? In the latter, if a rwlock is locked
for reading, future readers must be delayed until all writers are
gone
- - How do you do the accounting for the *timedlock variants? In the
case of those a functions, if the threads wake due to a timeout,
you have the decrement the waiter count. But you have only one bit...
I don't say you cannot implement rwlocks this way. Sure, it definitely
is possible. But this implementation would in the contended case like
10x as slow as the current code because you constantly have to wake up
every single thread.
If you want, I'll give you a libpthread.so with the new code. Then you
can test your code. I bet you whatever you want that you cannot achieve
the performance with your puny 32-bit futex without limiting the number
of threads to a ridiculously low number. The implementation I have
allows for 2^23 (possibly recursively locked) readers, 2^18 readers or
writers waiting.
When I was writing "you need more than 32 bits" I didn't even imagine
that somebody would suggest using such a primitive scheme which cannot
possibly work efficiently in a userlevel implementation.
- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iEYEARECAAYFAkhAzlEACgkQ2ijCOnn/RHQ+bgCcDDmhSvbKdboyqa21smzlSt75
zHUAn2rr3NKns4Kb78Woas3NP5hbwkOU
=p6z6
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-05-31 4:04 ` Ulrich Drepper
@ 2008-05-31 4:16 ` Linus Torvalds
2008-05-31 4:23 ` Linus Torvalds
0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2008-05-31 4:16 UTC (permalink / raw)
To: Ulrich Drepper; +Cc: linux-kernel, akpm, mtk.manpages
On Fri, 30 May 2008, Ulrich Drepper wrote:
>
> Ask yourself this:
How about you post your code.
Because your questions seem to be total and utter crap.
> - How do you know when there is no more writer waiting? You cannot
> reset a "writer waiting" bit after you wake up one writer without
> waking every single thread waiting for the rwlock and letting them
> fight for it
Sure you can. By looking at the *other*data* that isn't atomic.
So when doing a "write_unlock()" (or whatever you call it - for the kernel
calls it "up_write()") you can look at the non-atomic write counts to
decide whether there are others.
Also note how you can use 64-bit atomic ops to do that all in user space,
without actually requiring 64-bit futex support - as long as the bits that
matter for waking (like "was there more than one pending writer") fit in
the one 32-bit word.
> - - How do you handle the difference between reader-preferred rwlocks
> and writer-preferred rwlocks? In the latter, if a rwlock is locked
> for reading, future readers must be delayed until all writers are
> gone
Again, that's not something that needs to be in the *atomic* part.
It's unquestionable that a rwlock is more than 32 bits, but what I
question is whether you need all those extra bits to be under futex
> - How do you do the accounting for the *timedlock variants? In the
> case of those a functions, if the threads wake due to a timeout,
> you have the decrement the waiter count. But you have only one bit...
Uli, you're not even trying any more.
NO, you don't have just one bit. You have as many bits as you want. It's
just that only 32 of the bits will be relevant for FUTEX_WAIT_OP etc.
Linus
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-05-31 4:16 ` Linus Torvalds
@ 2008-05-31 4:23 ` Linus Torvalds
2008-05-31 4:38 ` Ulrich Drepper
0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2008-05-31 4:23 UTC (permalink / raw)
To: Ulrich Drepper; +Cc: linux-kernel, akpm, mtk.manpages
On Fri, 30 May 2008, Linus Torvalds wrote:
>
> Also note how you can use 64-bit atomic ops to do that all in user space,
> without actually requiring 64-bit futex support - as long as the bits that
> matter for waking (like "was there more than one pending writer") fit in
> the one 32-bit word.
So here's an example of this:
- make all the readers/writers actually update a 64-bit word (by using
cmpxchg8b in user space to actually get the locks)
- but organize things so that a reader only needs to look at the high 32
bits to actually make it's wakeup-decision, and a sleeping writer only
needs to look at the low 32 bits. How? Make the low bits of the words
contain the "contention status".
Is it possible? I dunno. I personally suspect it is. I also suspect you
didn't even try.
Linus
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-05-31 4:23 ` Linus Torvalds
@ 2008-05-31 4:38 ` Ulrich Drepper
2008-05-31 4:58 ` Linus Torvalds
2008-05-31 22:25 ` Linus Torvalds
0 siblings, 2 replies; 26+ messages in thread
From: Ulrich Drepper @ 2008-05-31 4:38 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel, akpm, mtk.manpages
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Linus Torvalds wrote:
> Is it possible? I dunno.
Once again not without limitations which are too low. The main problem
is that there is far more information needed on the reader side. There
are two counters and a number of bits for state information. Which
means the counters can be at most 14 bits in size.
A counter this low also means I cannot unconditionally increment them
(since there can be more than 2^14 threads sharing one page). This
means you get in the cache line race I described in the last mail. You
first have to load a word and then perform a the write operation while
at the same time other threads competing with you for the same cache line.
> I personally suspect it is. I also suspect you
> didn't even try.
I have to disappoint you. I thought of that and had to reject it.
- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iEYEARECAAYFAkhA1j8ACgkQ2ijCOnn/RHTUXQCgnkUy6PNSuUI4O6BD4abxqqKr
PzMAn2qcvrXGM4qrhOytk1Yie0BsMeVa
=zvxZ
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-05-31 4:38 ` Ulrich Drepper
@ 2008-05-31 4:58 ` Linus Torvalds
2008-05-31 22:25 ` Linus Torvalds
1 sibling, 0 replies; 26+ messages in thread
From: Linus Torvalds @ 2008-05-31 4:58 UTC (permalink / raw)
To: Ulrich Drepper; +Cc: linux-kernel, akpm, mtk.manpages
On Fri, 30 May 2008, Ulrich Drepper wrote:
>
> Linus Torvalds wrote:
> > Is it possible? I dunno.
>
> Once again not without limitations which are too low. The main problem
> is that there is far more information needed on the reader side. There
> are two counters and a number of bits for state information. Which
> means the counters can be at most 14 bits in size.
Uli, you're NOT EVEN READING!
Go back and read it, dammit! How many times have I said that you can even
have more than 32 bits of data?
There's nothing wrong with using 64-bit atomics to update the words. All
I've *ever* questioned is whether you need more than 32 bits for the
FUTEX part.
For example, let's say that you need 4 bits of "state" information (ie
"next woken-up one is a reader"). And yes, you need counts for readers and
writers. Maybe we can make those counts be 30 bits each. 64 bits total
space. But does the FUTEX_OP things actually need to access all 64 bits?
So imagine that you bave a 64-bit atomic op, and bits 0-1 are the reader
state bits, with bits 2-31 being the reader count. Bits 32-33 are the
writer state bits, and bits 34-63 are the writer count.
Now you can use 64-bit atomic ops to update them, but perhaps just use a
32-bit FUTEX op to actually wait as a reader on the low 32 bits that are
the reader state bits. See?
THIS is what I've been trying to tell you. It's quite possible that the
algorithm you already have could work with just moving the bits around so
that each FUTEX op can be done on just one of the words.
And notice my use of the word "possible".
The kernel uses a 32-bit count word, but has various other state too that
it maintains under a separate locking setup. Same basic idea: the 32-bit
word is "special" (because it's the one that has to have the right format
for the 'xadd' op that we use for the fast path), but it's not the _only_
data in the thing. The other data is also accessed, it's just accessed
according to different principles.
Linus
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-05-31 4:38 ` Ulrich Drepper
2008-05-31 4:58 ` Linus Torvalds
@ 2008-05-31 22:25 ` Linus Torvalds
2008-05-31 22:32 ` Linus Torvalds
2008-06-02 18:54 ` Ingo Molnar
1 sibling, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2008-05-31 22:25 UTC (permalink / raw)
To: Ulrich Drepper; +Cc: linux-kernel, akpm, mtk.manpages
On Fri, 30 May 2008, Ulrich Drepper wrote:
>
> Once again not without limitations which are too low. The main problem
> is that there is far more information needed on the reader side. There
> are two counters and a number of bits for state information. Which
> means the counters can be at most 14 bits in size.
Ok, so I obviously think you're simply wrong, and I've told you so both in
private after seeing the 64-bit code you propose (and which you want to
patent), and publicly.
So let's have code talk. I'm going to give real code for the only case
that really matters, namely the uncontended case.
Do you agree that once you get contention, and you actually have to do a
FUTEX system call to sort it out, the number of atomic instructions do not
really matter? Because the system call is going to dominate. Agreed?
So I'm not going to write out the slow contention case, because you have
already agreed that you can do a 32-bit rwlock _slowly_ with the existing
32-bit FUTEX support.
IOW, I'm faking it, but I'm making a point. Namely that you can
efficiently do read-write lock using *only* 32-bit ops, and without any
real kind of limitations on the number of readers and writers.
So here goes the explanation and the pseudo-code.
- have two levels of locking: the contended case, and the uncontended case
- these are two *separate* locks. I'm going to describe the only case
that mattes for performance, namely the non-contention one. We just
make sure that the non-contention lock "gates" things to the slow case.
- in other words, here's a pretty optimal fast-case that we write in
assembly language, and all it does is to handle the non-contention
case, with any waiting happening on *another* lock entirely.
- on that first-level lock, the setup is as follows:
- the low bit (bit#0) is "writer holds"
- the next bit (bit#1) is "contended", and all it does is that it
guarantees that when set, everybody will go to the slow-path.
- bits 2-31 are "reader count"
- the second-level lock is another 32-bit lock that is *adjacent* in
memory, and they are 64-bit aligned, so that you can do an atomic
initialization and more importantly a 64-bit "it is now no longer
contended" assignment in one atomic op. That's going to be relevant
only for the slow-path, and all it means is that we can basically do
an atomic "sync" of the two locks even on 32-bit x86 with a single
"cmpxchg8b" (no other "full" 64-bit operations necessary).
And before I actually give the fast-path code, let me describe the
requirements for the slow-path code:
- it needs to work with 32-bit futexes. This code obviously does exist,
because you already have something like that in glibc. Performance is
not important, the only thing that matters is really "queueing
behaviour", ie this code needs to wake things up in the right order.
- it needs to be able to *notice* when things are no longer contended.
IOW, it doesn't need to have a fast-path case per se, but the unlock
sequences do need to notice when they are the last unlocker, and when
that happens, they need to be able to do that magic "lock cmpxchg8b"
that tests-and-clears the contention bit in the primary lock at the
same time as it does the secondary lock.
That second constraint is actually fairly obvious when you think about it:
it's effectively the thing that makes us re-enter the non-contention case
after the contention has been handled.
Ok, that's the setup. Now, let's look at the fast-path cases.
First, write_lock():
/*
* write_lock() can only succeed as a fast-path if the reader
* count is zero and there are no pending or holding writers.
* Ergo, the low 32 bits have to be 0, and should be 1 (for
* "writer exists, no other pending writer" if it succeeds.
*
* Note that we don't need to update the high bits if we
* succeed (because the writer count is the _pending_ writer
* count, not the holding one)
*/
eax = 0
edx = 1
lock cmpxchgl %edx,(rwlock)
testl %eax,%eax
jne slowpath
ret
slowpath:
lock orl $2,(rwlock)
call slow_write_lock
ret
And here's read_lock:
/*
* read_lock() can only succeed if we increment the reader
* count _and_ there were no pending or holding writers.
* So we do a 32-bit xaddl with 4 (incrementing the readers
* by one), and were successful if the old value didn't
* have any writer bits set
*/
eax = 4 /* Low two bits are writers - don't touch */
xaddl %eax,(rwlock)
testl $3,%eax /* Did we have any holding or pending writers? */
jne slowpath
ret
slowpath:
lock orl $2,(rwlock)
lock subl $4,(rwlock)
call slow_read_lock
ret
See? Fastpaths are fast - a single atomic op and a conditional.
Now, the unlock paths - they also need to be fast for the non-contention case:
The write_unlock is just
/*
* If there was no contention, then the low word must
* still have the value "1" - this one writer, and
* no other pending writers or any readers that tried
* to get the lock.
*/
eax = 1
edx = 0
cmpxchgl %edx,(rwlock)
cmpl $1,%eax
jne slowpath /* contention - need to wake up a reader or a writer */
ret
slowpath:
lock orl $2,(rwlock)
lock andl $~1,(rwlock)
call slow_write_unlock
ret
and the read_unlock is
eax = -4
xaddl %eax,(rwlock)
testl $3,%eax /* Do we have contention? */
jne slowpath
ret
slowpath:
lock orl $2,(rwlock)
call slow_read_unlock
ret
Ok, so notice a couple of things here. The above is not only designed to
handle the uncontended case with just a single locked op on the lock, but
it also makes sure that each path basically enters the slow-path with all
of its fast-path actions *done*, and with any unlocks done. So by the time
the slow-path is entered, we know that
(a) bit#1 ("contention") is guaranteed to be set, and it will be _our_
job to clear it once everybody has unlocked
(b) nobody else will ever succeed on the fast-path and they'll all come
to us (including any future unlockers that may need to wake things
up)
(c) anybody who got a fastpath lock will do their unlock action before
they then get trapped into the slow path.
Now, (c) above, in particular, means that the slow path should start out
making sure that all fast-path locks have been released. That's easy
enough to do: the first thing each slow-path member needs to do is to wait
until the 32-bit fastpath variable gets the value "2". It *will* happen
eventually, and once it does, we are guaranteed that no fast-path code
will ever get the lock again (even if the fast-path read_lock() may
increment the higher bits and then decrement them again when it notices
the contention - so it can still change from "2", but nobody will actually
*acquire* the lock (and obviously, if it has hit "2", that also means
that all _previous_ lockers will have released their locks, whether they
were read- or write-lockers).
So each slow-path thing needs to start out that way, but once it has done
that, it basically knows that the fast-path code is immaterial from a lock
acquire standpoint.
So now you do your existing slow-path code on the second lock. And at each
unlock (or failed trylock, for that matter), remember to see if you can do
an atomic "release both fast-path _and_ slow-path locks" with cmpxchg8b
(with the fastpath lock word being a 2->0 transition - what the slowpath
rule for "all done" is will obviously depend on the algorithm in
question).
See? I'm not guaranteeing that the above is "optimal" in the sense that I
suspect that there are probably clever ways to make the slow path faster
(including, I suspect, even for all 32-bit code), but I *am* claiming that
this is _obviously_ optimal for the non-contention case, and that the
actual contention case is almost certainly not going to care about a
couple of extra atomic ops.
Agreed? The above isn't actually all that complicated. And it would mean
that the exact same algorithm works on both 32-bit an 64-bit CPU's, and on
both old and new kernels (because you only really need 32-bit futexes).
Linus
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-05-31 22:25 ` Linus Torvalds
@ 2008-05-31 22:32 ` Linus Torvalds
2008-06-02 18:54 ` Ingo Molnar
1 sibling, 0 replies; 26+ messages in thread
From: Linus Torvalds @ 2008-05-31 22:32 UTC (permalink / raw)
To: Ulrich Drepper; +Cc: linux-kernel, akpm, mtk.manpages
On Sat, 31 May 2008, Linus Torvalds wrote:
>
> * Note that we don't need to update the high bits if we
> * succeed (because the writer count is the _pending_ writer
> * count, not the holding one)
Btw, that comment is bogus, ignore it. It comes from an earlier algorithm
I tried that also explained the contention case, but got unnecessarily
complex. Separating out the queuing behaviour from the actual
non-contended behaviour made things much simpler to explain, so I stopped
even bothering with a "both" case.
Too much cut-and-paste, IOW.
Linus
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-05-31 22:25 ` Linus Torvalds
2008-05-31 22:32 ` Linus Torvalds
@ 2008-06-02 18:54 ` Ingo Molnar
2008-06-02 20:22 ` Linus Torvalds
1 sibling, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2008-06-02 18:54 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Ulrich Drepper, linux-kernel, akpm, mtk.manpages
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> IOW, I'm faking it, but I'm making a point. Namely that you can
> efficiently do read-write lock using *only* 32-bit ops, and without
> any real kind of limitations on the number of readers and writers.
>
> So here goes the explanation and the pseudo-code.
>
> - have two levels of locking: the contended case, and the uncontended
> case
i suspect _any_ abstract locking functionality around a data structure
can be implemented via atomic control over just a single user-space bit.
That bit can be used as a lock and if all access to the state of that
atomic variable uses it, arbitrary higher-order atomic state transitions
can be derived from it. The cost would be a bit more instructions in the
fastpath, but there would still only be a single atomic op (the acquire
op), as the unlock would be a natural barrier (on x86 at least).
Concurrency (and scheduling) of that lock would still be exactly the
same as with genuine 64-bit (or even larger) atomic ops, and the
fastpath would be very close as well.
Ingo
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-06-02 18:54 ` Ingo Molnar
@ 2008-06-02 20:22 ` Linus Torvalds
2008-06-02 23:03 ` Ingo Molnar
2008-06-06 1:27 ` Nick Piggin
0 siblings, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2008-06-02 20:22 UTC (permalink / raw)
To: Ingo Molnar, Nick Piggin, David Howells
Cc: Ulrich Drepper, Linux Kernel Mailing List, Andrew Morton
[ I added Nick and DavidH to the Cc, since they have at least historically
shown interest in locking algorithms ]
On Mon, 2 Jun 2008, Ingo Molnar wrote:
>
> i suspect _any_ abstract locking functionality around a data structure
> can be implemented via atomic control over just a single user-space bit.
Well, theory is theory, and practice is different.
That's *especially* true when it comes to locking, which is just so
tightly coupled to the exact details of which atomics are fast on a
particular architecture.
Also, different people will want to see different performance profiles.
For example, it makes a *huge* difference whether you are strictly fair or
not. I can almost guarantee that a 100% fair implementation may be really
"nice" from a theoretical standpoint, but suck really badly in practice,
because if you want best performance, then you want the lock to have a
very strong CPU affinity - and the faster you can do your lock ops, the
more of a unfairness and CPU affinity they get!
And rwlocks in particular are actually much more interesting in this
respect, because they not only have that CPU affinity fairness, but also
the reader-vs-writer fairness. You optimize for one load, and it may give
you almost perfect performance, but at the cost of sucking at another
load.
For example, some loads are almost entirely read-read locks, with only
very occasional write locks for some uncommon config change thing. Do you
want to optimize for that? Maybe. And yes, you can make those uncontended
read-read locks go really quickly, but then (especially if you continue to
let reads through even when writers want to contend), that can slow down
writers a *lot*, to the point of starvation.
Different CPU's will also show different patterns.
Anyway, I was busy most of the weekend, but I've now coded up a partial
actual example implementation. Its' probably buggy. Uli is very correct in
saying that it's easy to screw up futex'es, but even in the absense of
futexes, it's just easy to screw up any threaded logic.
But if anybody wants to look at a rough draft that at least limps along
_partially_, there's
http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git;a=summary
for you.
And I'll freely admit that
(a) the above thing is pretty hacked up. No guarantees as to correctness
or anything else.
(b) I looked _mainly_ at the "all readers" case, and considered that the
primary goal. Which explains why it does really well on that
particular load.
(c) However, I refuse to starve writers. In fact, my thing will always
prefer a writer to any new readers, on the assumption that the sanest
rwlock situation is the "tons of readers, occasional writer".
(d) it does ok for me on the write-write contention case too, but the
random mixed "all threads do 5% writelocks" test-case seems to suck.
As an exmaple: the current glibc code I have seems to be uniformly and
boringly middle-of-the-road. It really doesn't seem to be horrible at all.
That said, the above thing _is_ 2-3x faster for me on that read-read case
(from single thread up to 20 threads - but just tested on one particular
machine), so if that is what you want to aim for, it's certainly easy to
beat.
But for the write case, while I can easily beat it for a low-thread count
with little contention, my example thing above benchmarks about equal for
bigger thread counts (caveat: again - I've only tested on one machine,
which is a single-socket dual-core thing. Caveat emptor).
And the pthreads implementation actually beats my hacked-up one at least
for the 5% random writelocks case. It's not beating it by a huge amount,
but it's not in the noise level either (maybe 15%). And I bet the pthreads
implementation is a hell of a lot better tested ;)
> That bit can be used as a lock and if all access to the state of that
> atomic variable uses it, arbitrary higher-order atomic state transitions
> can be derived from it. The cost would be a bit more instructions in the
> fastpath, but there would still only be a single atomic op (the acquire
> op), as the unlock would be a natural barrier (on x86 at least).
No, "unlocks as a natural barrier" only works for exclusive kernel locks
(spin_unlock and write_unlock). There we can just do a write to unlock.
But for anything that wants to handle contention differently than just
spinning, the unlock path needs to be able to do an atomic "unlock and
test if I need to do something else", because it may need to wake things
up.
Feel free to try out my code, and laugh at it. Getting locking right is
definitely not simple, and I would be really surprised if I got everything
right. It's meant as a "hey, here's real code people can at least play
with" for anybody who is interested in this kind of thing.
Linus
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-06-02 20:22 ` Linus Torvalds
@ 2008-06-02 23:03 ` Ingo Molnar
2008-06-03 3:24 ` Nick Piggin
2008-06-06 1:27 ` Nick Piggin
1 sibling, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2008-06-02 23:03 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nick Piggin, David Howells, Ulrich Drepper,
Linux Kernel Mailing List, Andrew Morton
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > That bit can be used as a lock and if all access to the state of
> > that atomic variable uses it, arbitrary higher-order atomic state
> > transitions can be derived from it. The cost would be a bit more
> > instructions in the fastpath, but there would still only be a single
> > atomic op (the acquire op), as the unlock would be a natural barrier
> > (on x86 at least).
>
> No, "unlocks as a natural barrier" only works for exclusive kernel
> locks (spin_unlock and write_unlock). There we can just do a write to
> unlock. But for anything that wants to handle contention differently
> than just spinning, the unlock path needs to be able to do an atomic
> "unlock and test if I need to do something else", because it may need
> to wake things up.
yeah, indeed. Compared to all the other costs that have to be dealt with
here, having a second atomic op isnt all that much of an issue either,
especially on latest hw. An atomic op will probably never be as cheap as
a non-atomic op, but ~20 cycles is still plenty fast for most practical
purposes.
Ingo
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-06-02 23:03 ` Ingo Molnar
@ 2008-06-03 3:24 ` Nick Piggin
2008-06-04 19:57 ` Linus Torvalds
0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2008-06-03 3:24 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, David Howells, Ulrich Drepper,
Linux Kernel Mailing List, Andrew Morton
On Tue, Jun 03, 2008 at 01:03:17AM +0200, Ingo Molnar wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > > That bit can be used as a lock and if all access to the state of
> > > that atomic variable uses it, arbitrary higher-order atomic state
> > > transitions can be derived from it. The cost would be a bit more
> > > instructions in the fastpath, but there would still only be a single
> > > atomic op (the acquire op), as the unlock would be a natural barrier
> > > (on x86 at least).
> >
> > No, "unlocks as a natural barrier" only works for exclusive kernel
> > locks (spin_unlock and write_unlock). There we can just do a write to
> > unlock. But for anything that wants to handle contention differently
> > than just spinning, the unlock path needs to be able to do an atomic
> > "unlock and test if I need to do something else", because it may need
> > to wake things up.
>
> yeah, indeed. Compared to all the other costs that have to be dealt with
> here, having a second atomic op isnt all that much of an issue either,
> especially on latest hw. An atomic op will probably never be as cheap as
> a non-atomic op, but ~20 cycles is still plenty fast for most practical
> purposes.
I think optimised our unlock_page in a way that it can do a
non-atomic unlock in the fastpath (no waiters) using 2 bits. In
practice it was still atomic but only because other page flags
operations could operate on ->flags at the same time.
I still have to get around to submitting the damn thing upstream,
but maybe if I bring it up here, I get the idea more reviewed :)
Anyway, the algorithm goes like this:
lock_page()
{
if (test_and_set_bit_lock(PG_locked))
lock_page_slow();
}
lock_page_slow()
{
/* slowpath */
again:
prepare_to_wait();
set_bit(PG_waiters);
if (test_and_set_bit_lock(PG_locked)) {
schedule();
goto again;
}
}
wake_page_waiters()
{
/* slowpath */
clear_bit(PG_waiters);
smp_mb__after_clear_bit(); /* order clear_bit store with wake_up_page load */
wake_up_page(PG_locked);
}
unlock_page()
{
clear_bit_unlock(PG_locked);
if (unlikely(test_bit(PG_waiters)))
wake_page_waiters();
}
We don't require any load/store barrier in the unlock_page fastpath
because the bits are in the same word, so cache coherency gives us a
sequential ordering anyway.
Now you notice the lock page slowpath can set bits without holding the
lock, at first glance you'd think the clear_bit to unlock has to be
atomic too then. But actually if we're careful, we can put them in seperate
parts of the word and use the sub-word operations on x86 to avoid the atomic
requirement. I'm not aware of any architecture in which operations to the
same word could be out of order.
Not only does this avoid all barriers (except acquire/release) in the
fastpaths, but it also avoids the unconditional load of the random
hash page waitqueue cacheline on unlock.
Anything applicable to your problem at hand?
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-06-03 3:24 ` Nick Piggin
@ 2008-06-04 19:57 ` Linus Torvalds
2008-06-04 20:38 ` Linus Torvalds
2008-06-05 1:45 ` Nick Piggin
0 siblings, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2008-06-04 19:57 UTC (permalink / raw)
To: Nick Piggin
Cc: Ingo Molnar, David Howells, Ulrich Drepper,
Linux Kernel Mailing List, Andrew Morton
On Tue, 3 Jun 2008, Nick Piggin wrote:
>
> I think optimised our unlock_page in a way that it can do a
> non-atomic unlock in the fastpath (no waiters) using 2 bits. In
> practice it was still atomic but only because other page flags
> operations could operate on ->flags at the same time.
I'd be *very* nervous about this.
> We don't require any load/store barrier in the unlock_page fastpath
> because the bits are in the same word, so cache coherency gives us a
> sequential ordering anyway.
Yes and no.
Yes, the bits are int he same word, so cache coherency guarantees a lot.
HOWEVER. If you do the sub-word write using a regular store, you are now
invoking the _one_ non-coherent part of the x86 memory pipeline: the store
buffer. Normal stores can (and will) be forwarded to subsequent loads from
the store buffer, and they are not strongly ordered wrt cache coherency
while they are buffered.
IOW, on x86, loads are ordered wrt loads, and stores are ordered wrt other
stores, but loads are *not* ordered wrt other stores in the absense of a
serializing instruction, and it's exactly because of the write buffer.
So:
> But actually if we're careful, we can put them in seperate parts of the
> word and use the sub-word operations on x86 to avoid the atomic
> requirement. I'm not aware of any architecture in which operations to
> the same word could be out of order.
See above. The above is unsafe, because if you do a regular store to a
partial word, with no serializing instructions between that and a
subsequent load of the whole word, the value of the store can be bypassed
from the store buffer, and the load from the other part of the word can be
carried out _before_ the store has actually gotten that cacheline
exclusively!
So when you do
movb reg,(byteptr)
movl (byteptr),reg
you may actually get old data in the upper 24 bits, along with new data in
the lower 8.
I think.
Anyway, be careful. The cacheline itself will always be coherent, but the
store buffer is not going to be part of the coherency rules, and without
serialization (or locked ops), you _are_ going to invoke the store buffer!
Linus
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-06-04 19:57 ` Linus Torvalds
@ 2008-06-04 20:38 ` Linus Torvalds
2008-06-05 1:56 ` Nick Piggin
2008-06-05 1:45 ` Nick Piggin
1 sibling, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2008-06-04 20:38 UTC (permalink / raw)
To: Nick Piggin
Cc: Ingo Molnar, David Howells, Ulrich Drepper,
Linux Kernel Mailing List, Andrew Morton
On Wed, 4 Jun 2008, Linus Torvalds wrote:
>
> So when you do
>
> movb reg,(byteptr)
> movl (byteptr),reg
>
> you may actually get old data in the upper 24 bits, along with new data in
> the lower 8.
Put another way: the CPU may internally effectively rewrite the two
instructions as
movb reg,tmpreg (tmp = writebuffer)
movl (byteptr),reg (do the 32-bit read of the old cached contents)
movb tmpreg,reg (writebuffer snoop by reads)
movb tmpreg,(byteptr) (writebuffer actually drains into cacheline)
and *if* your algorithm is robust wrt these kinds of rewrites, you're ok.
But notice how there are two accesses to the cacheline, and how they are
actually in the "wrong" order, and how the cacheline could have been
updated by another CPU in between.
Does this actually happen? Yeah, I do believe it does. Is it a deathknell
for anybody trying to be clever with overlapping reads/writes? No, you can
still have things like causality rules that guarantee that your algorithm
is perfectly stable in the face of these kinds of reordering. But it *is*
one of the few re-orderings that I think is theoretically archtiecturally
visible.
For example, let's start out with a 32-bit word that contains zero, and
three CPU's. One CPU does
cmpxchgl 0->0x01010101,mem
another one does
cmpxchlg 0x01010101->0x02020202,mem
and the third one does that
movb $0x03,mem
movl mem,reg
and after it all completed, you may have 0x02020203 in memory, but "reg"
on the third CPU contains 0x01010103.
Note how NO OTHER CPU could _possibly_ have seen that value! That value
never ever existed in any caches. If the final value was 0x02020203, then
both the cmpxchgl's must have worked, so the cache coherency contents
*must* have gone from 0 -> 0x01010101 -> 0x02020202 -> 0x02020203 (with
the movb actually getting the exclusive cache access last).
So the third CPU saw a value for that load that actually *never* existed
in any cache-line: 0x01010103. Exactly because the x86 memory ordering
allows the store buffer contents to be forwarded within a CPU core.
And this is why atomic locked instructions are special. They bypass the
store buffer (or at least they _act_ as if they do - they likely actually
use the store buffer, but they flush it and the instruction pipeline
before the load and wait for it to drain after, and have a lock on the
cacheline that they take as part o the load, and release as part of the
store - all to make sure that the cacheline doesn't go away in between and
that nobody else can see the store buffer contents while this is going
on).
This is also why there is so much room for improvement in locked
instruction performance - you don't _have_ to flush things if you just are
very careful about tracking how and when you use which elements in the
store buffer, and track the ordering of cache accesses by all of this.
Linus
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-06-04 20:38 ` Linus Torvalds
@ 2008-06-05 1:56 ` Nick Piggin
2008-06-05 1:58 ` Nick Piggin
2008-06-05 3:08 ` Linus Torvalds
0 siblings, 2 replies; 26+ messages in thread
From: Nick Piggin @ 2008-06-05 1:56 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, David Howells, Ulrich Drepper,
Linux Kernel Mailing List, Andrew Morton
On Wed, Jun 04, 2008 at 01:38:50PM -0700, Linus Torvalds wrote:
>
>
> On Wed, 4 Jun 2008, Linus Torvalds wrote:
> >
> > So when you do
> >
> > movb reg,(byteptr)
> > movl (byteptr),reg
> >
> > you may actually get old data in the upper 24 bits, along with new data in
> > the lower 8.
>
> Put another way: the CPU may internally effectively rewrite the two
> instructions as
>
> movb reg,tmpreg (tmp = writebuffer)
> movl (byteptr),reg (do the 32-bit read of the old cached contents)
> movb tmpreg,reg (writebuffer snoop by reads)
> movb tmpreg,(byteptr) (writebuffer actually drains into cacheline)
>
> and *if* your algorithm is robust wrt these kinds of rewrites, you're ok.
> But notice how there are two accesses to the cacheline, and how they are
> actually in the "wrong" order, and how the cacheline could have been
> updated by another CPU in between.
>
> Does this actually happen? Yeah, I do believe it does. Is it a deathknell
> for anybody trying to be clever with overlapping reads/writes? No, you can
> still have things like causality rules that guarantee that your algorithm
> is perfectly stable in the face of these kinds of reordering. But it *is*
> one of the few re-orderings that I think is theoretically archtiecturally
> visible.
>
> For example, let's start out with a 32-bit word that contains zero, and
> three CPU's. One CPU does
>
> cmpxchgl 0->0x01010101,mem
>
> another one does
>
> cmpxchlg 0x01010101->0x02020202,mem
>
> and the third one does that
>
> movb $0x03,mem
> movl mem,reg
>
> and after it all completed, you may have 0x02020203 in memory, but "reg"
> on the third CPU contains 0x01010103.
>
> Note how NO OTHER CPU could _possibly_ have seen that value! That value
> never ever existed in any caches. If the final value was 0x02020203, then
> both the cmpxchgl's must have worked, so the cache coherency contents
> *must* have gone from 0 -> 0x01010101 -> 0x02020202 -> 0x02020203 (with
> the movb actually getting the exclusive cache access last).
>
> So the third CPU saw a value for that load that actually *never* existed
> in any cache-line: 0x01010103. Exactly because the x86 memory ordering
> allows the store buffer contents to be forwarded within a CPU core.
>
> And this is why atomic locked instructions are special. They bypass the
> store buffer (or at least they _act_ as if they do - they likely actually
> use the store buffer, but they flush it and the instruction pipeline
> before the load and wait for it to drain after, and have a lock on the
> cacheline that they take as part o the load, and release as part of the
> store - all to make sure that the cacheline doesn't go away in between and
> that nobody else can see the store buffer contents while this is going
> on).
>
> This is also why there is so much room for improvement in locked
> instruction performance - you don't _have_ to flush things if you just are
> very careful about tracking how and when you use which elements in the
> store buffer, and track the ordering of cache accesses by all of this.
I see what you're saying, but I hadn't really considered it before.
Not that I've attempted to do much of these sub-word operations
without consulting you first (which brings us here) ;).
I'd have thought that for a case like this, you'd simply hit the store
alias logic and store forwarding would stall because it doesn't have
the full data.
I'd like to know for sure.
The other thing that could be possible, and I'd imagine maybe more likely
to be implemented in a real CPU because it should give more imrpovement
(and which does break my algorithm) is just that the load to the cacheline
may get to execute first, then if the cacheline gets evicted and
modified by another CPU before our store completes, we effectively see
store/load reordering again.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-06-05 1:56 ` Nick Piggin
@ 2008-06-05 1:58 ` Nick Piggin
2008-06-05 3:08 ` Linus Torvalds
1 sibling, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2008-06-05 1:58 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, David Howells, Ulrich Drepper,
Linux Kernel Mailing List, Andrew Morton
On Thu, Jun 05, 2008 at 03:56:34AM +0200, Nick Piggin wrote:
>
> The other thing that could be possible, and I'd imagine maybe more likely
> to be implemented in a real CPU because it should give more imrpovement
> (and which does break my algorithm) is just that the load to the cacheline
> may get to execute first, then if the cacheline gets evicted and
> modified by another CPU before our store completes, we effectively see
> store/load reordering again.
Well, I should qualify that: it doesn't actually break my algorithm
because you can still implement the unlock without atomic RMWs.
This may require you to have an mfence there, but we can still get
away without atomics (whether that's much cheaper or not, I haven't
measured recently!)
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-06-05 1:56 ` Nick Piggin
2008-06-05 1:58 ` Nick Piggin
@ 2008-06-05 3:08 ` Linus Torvalds
2008-06-05 4:29 ` Nick Piggin
1 sibling, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2008-06-05 3:08 UTC (permalink / raw)
To: Nick Piggin
Cc: Ingo Molnar, David Howells, Ulrich Drepper,
Linux Kernel Mailing List, Andrew Morton
On Thu, 5 Jun 2008, Nick Piggin wrote:
>
> I'd have thought that for a case like this, you'd simply hit the store
> alias logic and store forwarding would stall because it doesn't have
> the full data.
That's _one_ possible implementation.
Quite frankly, I think it's the less likely one. It's much more likely
that the cache read access and the store buffer probe happen in parallel
(this is a really important hotpath for any CPU, but even more so x86
where there are more of loads and stores that are spills). And then the
store buffer logic would return the data and a bytemask mask (where the
mask would be all zeroes for a miss), and the returned value is just the
appropriate mix of the two.
> I'd like to know for sure.
You'd have to ask somebody very knowledgeable inside Intel and AMD, and it
is quite likely that different microarchitectures have different
approaches...
> The other thing that could be possible, and I'd imagine maybe more likely
> to be implemented in a real CPU because it should give more imrpovement
> (and which does break my algorithm) is just that the load to the cacheline
> may get to execute first, then if the cacheline gets evicted and
> modified by another CPU before our store completes, we effectively see
> store/load reordering again.
Oh, absolutely, the perfect algorithm would actually get the right answer
and notice that the cacheline got evicted, and retried the whole sequence
such that it is coherent.
But we do know that Intel expressly documents loads and stores to pass
each other and documents the fact that the store buffer is there. So I bet
that this is visible in some micro-architecture, even if it's not
necessarily visible in _all_ of them.
The recent Intel memory ordering whitepaper makes it very clear that loads
can pass earlier stores and in particular that the store buffer allows
intra-processor forwarding to subsequent loads (2.4 in their whitepaper).
It _could_ be just a "for future CPU's", but quite frankly, I'm 100% sure
it isn't. The store->load forwarding is such a critical performance issue
that I can pretty much guarantee that it doesn't always hit the cacheline.
Of course, the partial store forwarding case is not nearly as important,
and stalling is quite a reasonable implementation approach. I just
personally suspect that doing the unconditional byte-masking is actually
_simpler_ to implement than the stall, so..
Linus
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-06-05 3:08 ` Linus Torvalds
@ 2008-06-05 4:29 ` Nick Piggin
0 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2008-06-05 4:29 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, David Howells, Ulrich Drepper,
Linux Kernel Mailing List, Andrew Morton
On Wed, Jun 04, 2008 at 08:08:37PM -0700, Linus Torvalds wrote:
>
>
> On Thu, 5 Jun 2008, Nick Piggin wrote:
> >
> > I'd have thought that for a case like this, you'd simply hit the store
> > alias logic and store forwarding would stall because it doesn't have
> > the full data.
>
> That's _one_ possible implementation.
>
> Quite frankly, I think it's the less likely one. It's much more likely
> that the cache read access and the store buffer probe happen in parallel
> (this is a really important hotpath for any CPU, but even more so x86
> where there are more of loads and stores that are spills). And then the
> store buffer logic would return the data and a bytemask mask (where the
> mask would be all zeroes for a miss), and the returned value is just the
> appropriate mix of the two.
>
> > I'd like to know for sure.
>
> You'd have to ask somebody very knowledgeable inside Intel and AMD, and it
> is quite likely that different microarchitectures have different
> approaches...
Well, it would just be nice to hear a "no we'll never do that", "we
already do", or "you can't rely on it" ;)
> > The other thing that could be possible, and I'd imagine maybe more likely
> > to be implemented in a real CPU because it should give more imrpovement
> > (and which does break my algorithm) is just that the load to the cacheline
> > may get to execute first, then if the cacheline gets evicted and
> > modified by another CPU before our store completes, we effectively see
> > store/load reordering again.
>
> Oh, absolutely, the perfect algorithm would actually get the right answer
> and notice that the cacheline got evicted, and retried the whole sequence
> such that it is coherent.
>
> But we do know that Intel expressly documents loads and stores to pass
> each other and documents the fact that the store buffer is there. So I bet
> that this is visible in some micro-architecture, even if it's not
> necessarily visible in _all_ of them.
>
> The recent Intel memory ordering whitepaper makes it very clear that loads
> can pass earlier stores and in particular that the store buffer allows
> intra-processor forwarding to subsequent loads (2.4 in their whitepaper).
> It _could_ be just a "for future CPU's", but quite frankly, I'm 100% sure
> it isn't. The store->load forwarding is such a critical performance issue
> that I can pretty much guarantee that it doesn't always hit the cacheline.
Well I have a simple test case to show loads pass earlier non conflicting
stores in the case that loads do not come from the store buffer (ie.
*inter* processor forwarding).
And store forwarding, by definition means that the load can complete before
the store can possibly be visible to another CPU I'd say. So yes, I'm
sure this does happen too.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-06-04 19:57 ` Linus Torvalds
2008-06-04 20:38 ` Linus Torvalds
@ 2008-06-05 1:45 ` Nick Piggin
1 sibling, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2008-06-05 1:45 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, David Howells, Ulrich Drepper,
Linux Kernel Mailing List, Andrew Morton
On Wed, Jun 04, 2008 at 12:57:13PM -0700, Linus Torvalds wrote:
>
>
> On Tue, 3 Jun 2008, Nick Piggin wrote:
> >
> > I think optimised our unlock_page in a way that it can do a
> > non-atomic unlock in the fastpath (no waiters) using 2 bits. In
> > practice it was still atomic but only because other page flags
> > operations could operate on ->flags at the same time.
>
> I'd be *very* nervous about this.
Heh ;) Well I'm not actually trying to do it in Linux (yet).
> > We don't require any load/store barrier in the unlock_page fastpath
> > because the bits are in the same word, so cache coherency gives us a
> > sequential ordering anyway.
>
> Yes and no.
>
> Yes, the bits are int he same word, so cache coherency guarantees a lot.
>
> HOWEVER. If you do the sub-word write using a regular store, you are now
> invoking the _one_ non-coherent part of the x86 memory pipeline: the store
> buffer. Normal stores can (and will) be forwarded to subsequent loads from
> the store buffer, and they are not strongly ordered wrt cache coherency
> while they are buffered.
>
> IOW, on x86, loads are ordered wrt loads, and stores are ordered wrt other
> stores, but loads are *not* ordered wrt other stores in the absense of a
> serializing instruction, and it's exactly because of the write buffer.
>
> So:
>
> > But actually if we're careful, we can put them in seperate parts of the
> > word and use the sub-word operations on x86 to avoid the atomic
> > requirement. I'm not aware of any architecture in which operations to
> > the same word could be out of order.
>
> See above. The above is unsafe, because if you do a regular store to a
> partial word, with no serializing instructions between that and a
> subsequent load of the whole word, the value of the store can be bypassed
> from the store buffer, and the load from the other part of the word can be
> carried out _before_ the store has actually gotten that cacheline
> exclusively!
>
> So when you do
>
> movb reg,(byteptr)
> movl (byteptr),reg
>
> you may actually get old data in the upper 24 bits, along with new data in
> the lower 8.
>
> I think.
I'd be very surprised if that was the case. But the unlock code needn't
do that anyway. It could do
movb reg,(byteptr) # clear PG_locked
movb (byteptr+1),reg # load PG_waiters
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-06-02 20:22 ` Linus Torvalds
2008-06-02 23:03 ` Ingo Molnar
@ 2008-06-06 1:27 ` Nick Piggin
2008-06-06 3:37 ` Linus Torvalds
1 sibling, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2008-06-06 1:27 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, David Howells, Ulrich Drepper,
Linux Kernel Mailing List, Andrew Morton
On Mon, Jun 02, 2008 at 01:22:57PM -0700, Linus Torvalds wrote:
>
> [ I added Nick and DavidH to the Cc, since they have at least historically
> shown interest in locking algorithms ]
>
> On Mon, 2 Jun 2008, Ingo Molnar wrote:
> >
> > i suspect _any_ abstract locking functionality around a data structure
> > can be implemented via atomic control over just a single user-space bit.
>
> Well, theory is theory, and practice is different.
>
> That's *especially* true when it comes to locking, which is just so
> tightly coupled to the exact details of which atomics are fast on a
> particular architecture.
>
> Also, different people will want to see different performance profiles.
>
> For example, it makes a *huge* difference whether you are strictly fair or
> not. I can almost guarantee that a 100% fair implementation may be really
> "nice" from a theoretical standpoint, but suck really badly in practice,
> because if you want best performance, then you want the lock to have a
> very strong CPU affinity - and the faster you can do your lock ops, the
> more of a unfairness and CPU affinity they get!
>
> And rwlocks in particular are actually much more interesting in this
> respect, because they not only have that CPU affinity fairness, but also
> the reader-vs-writer fairness. You optimize for one load, and it may give
> you almost perfect performance, but at the cost of sucking at another
> load.
>
> For example, some loads are almost entirely read-read locks, with only
> very occasional write locks for some uncommon config change thing. Do you
> want to optimize for that? Maybe. And yes, you can make those uncontended
> read-read locks go really quickly, but then (especially if you continue to
> let reads through even when writers want to contend), that can slow down
> writers a *lot*, to the point of starvation.
>
> Different CPU's will also show different patterns.
>
> Anyway, I was busy most of the weekend, but I've now coded up a partial
> actual example implementation. Its' probably buggy. Uli is very correct in
> saying that it's easy to screw up futex'es, but even in the absense of
> futexes, it's just easy to screw up any threaded logic.
>
> But if anybody wants to look at a rough draft that at least limps along
> _partially_, there's
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git;a=summary
>
> for you.
>
> And I'll freely admit that
> (a) the above thing is pretty hacked up. No guarantees as to correctness
> or anything else.
> (b) I looked _mainly_ at the "all readers" case, and considered that the
> primary goal. Which explains why it does really well on that
> particular load.
> (c) However, I refuse to starve writers. In fact, my thing will always
> prefer a writer to any new readers, on the assumption that the sanest
> rwlock situation is the "tons of readers, occasional writer".
> (d) it does ok for me on the write-write contention case too, but the
> random mixed "all threads do 5% writelocks" test-case seems to suck.
>
> As an exmaple: the current glibc code I have seems to be uniformly and
> boringly middle-of-the-road. It really doesn't seem to be horrible at all.
> That said, the above thing _is_ 2-3x faster for me on that read-read case
> (from single thread up to 20 threads - but just tested on one particular
> machine), so if that is what you want to aim for, it's certainly easy to
> beat.
>
> But for the write case, while I can easily beat it for a low-thread count
> with little contention, my example thing above benchmarks about equal for
> bigger thread counts (caveat: again - I've only tested on one machine,
> which is a single-socket dual-core thing. Caveat emptor).
>
> And the pthreads implementation actually beats my hacked-up one at least
> for the 5% random writelocks case. It's not beating it by a huge amount,
> but it's not in the noise level either (maybe 15%). And I bet the pthreads
> implementation is a hell of a lot better tested ;)
Had a bit of a look though this, seems pretty OK to me. Obviously with
rwlocks it *is* impossible to do non-atomic unlocks, so I can't see
a way to significantly improve your code there.
What you *could* maybe do, to slightly speed up the reader fastpath, at
the expense of the writer fastpath, is to also have the active writer add
4 to the count too, so your unlock can start with a lock xadd -4, count
in order to get the write-intent on the cacheline straight up.
Though that's a pretty tiny "optmisation", and not going to be an amazing
win, even when it does save a state transition on the cacheline...
I'd be more interested to know why this code can't be evolved into a full
rwlock implementation? This is a rather standard (though neat) looking rwlock
-- so my question is what can the patented 64-bit futex locks do that this
can't, or what can they do faster?
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-06-06 1:27 ` Nick Piggin
@ 2008-06-06 3:37 ` Linus Torvalds
2008-06-06 11:53 ` Nick Piggin
0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2008-06-06 3:37 UTC (permalink / raw)
To: Nick Piggin
Cc: Ingo Molnar, David Howells, Ulrich Drepper,
Linux Kernel Mailing List, Andrew Morton
On Fri, 6 Jun 2008, Nick Piggin wrote:
>
> What you *could* maybe do, to slightly speed up the reader fastpath, at
> the expense of the writer fastpath, is to also have the active writer add
> 4 to the count too, so your unlock can start with a lock xadd -4, count
> in order to get the write-intent on the cacheline straight up.
Yes, nice idea. It avoids the possible unnecessary S->M transition, but
the downside is that it effectively slows down the write unlock by making
it do two atomic ops even for the fastpath. So if I were to _only_ care
about the reader path, I think it would be a great idea, but as it is, the
current non-contended write case is actually pretty close to optimal, and
doing the unconditional xaddl on the unlock path would slow that one down.
> I'd be more interested to know why this code can't be evolved into a full
> rwlock implementation? This is a rather standard (though neat) looking rwlock
> -- so my question is what can the patented 64-bit futex locks do that this
> can't, or what can they do faster?
Quite frankly - and this was my argument the whole time - I do not believe
that a "full" 64-bit implementation can do _anything_ more than this
32-bit one can do. That's the reason I wrote the code. I'm pretty sure
that you can do perfectly well with just 32 bit atomic futex ops (my
rwlocks obviously do use 64-bit cmpxchg's in user space, but not in the
fast-path, and it should work fine on x86-32 too using cmpxchg8b).
In fact, in the second version of my locks, I didn't do futex operations
on the actual lock itself at *all*, and just do futex ops on separate
"event" fields. That actually should have avoided a bug I think I had (but
couldn't really trigger in practice) in the first version, and made
everything look pretty straightforward.
I haven't really worked on them since I write the thing, but I did
consider things like timeouts etc. Timeouts are "hard" to handle because
they mean that you cannot use any kind of trivially incrementing "ticket
locks" with sequence numbers (because we may have to just avoid a sequence
if it times out), so the sequence number approach that we now use for
kernel spinlocks was not an option. I didn't actually *write* the timeout
versions, of course, but given the structure of the locks they really
should be very straightforward.
[ Half-way subtle thing: a writer that times out needs to be very careful
that it doesn't lose a wakeup event, but futexes actually make that part
pretty easy - since FUTEX_WAIT returns whether you got woken up or not,
you can just decide to wake up the next write-waiter if you cannot get
the lock immediately and have to exit due to a timeout. ]
But I really haven't tested my rwlocks very exhaustively, and I did not
verify that they actualyl scale with lots of CPU's, for example. I
literally only have dual-core CPU's in use at home, right now, nothing
fancier. Somebody with dual-socket quads would be a lot better off, and
the more the merrier, of course.
But correctness is even more important, and that can at least be somewhat
"thought" about.
Linus
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-06-06 3:37 ` Linus Torvalds
@ 2008-06-06 11:53 ` Nick Piggin
2008-06-06 15:01 ` Linus Torvalds
0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2008-06-06 11:53 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, David Howells, Ulrich Drepper,
Linux Kernel Mailing List, Andrew Morton
On Thu, Jun 05, 2008 at 08:37:19PM -0700, Linus Torvalds wrote:
>
>
> On Fri, 6 Jun 2008, Nick Piggin wrote:
> >
> > What you *could* maybe do, to slightly speed up the reader fastpath, at
> > the expense of the writer fastpath, is to also have the active writer add
> > 4 to the count too, so your unlock can start with a lock xadd -4, count
> > in order to get the write-intent on the cacheline straight up.
>
> Yes, nice idea. It avoids the possible unnecessary S->M transition, but
> the downside is that it effectively slows down the write unlock by making
> it do two atomic ops even for the fastpath. So if I were to _only_ care
> about the reader path, I think it would be a great idea, but as it is, the
> current non-contended write case is actually pretty close to optimal, and
> doing the unconditional xaddl on the unlock path would slow that one down.
Yeah, it is a case of a large slowdown for write for a small speedup
for read (pity the API doesn't have explicit read and write unlocks
-- were they too lazy to type the last bit, or did they expect people
to lose track of whether they had a read or write lock? :P)
Anyway, it's obviously a tradeoff you'd just have to carefully
benchmark in real situations.
> > I'd be more interested to know why this code can't be evolved into a full
> > rwlock implementation? This is a rather standard (though neat) looking rwlock
> > -- so my question is what can the patented 64-bit futex locks do that this
> > can't, or what can they do faster?
>
> Quite frankly - and this was my argument the whole time - I do not believe
> consider things like timeouts etc. Timeouts are "hard" to handle because
> they mean that you cannot use any kind of trivially incrementing "ticket
> locks" with sequence numbers (because we may have to just avoid a sequence
> if it times out), so the sequence number approach that we now use for
> kernel spinlocks was not an option. I didn't actually *write* the timeout
> versions, of course, but given the structure of the locks they really
> should be very straightforward.
>
> [ Half-way subtle thing: a writer that times out needs to be very careful
> that it doesn't lose a wakeup event, but futexes actually make that part
> pretty easy - since FUTEX_WAIT returns whether you got woken up or not,
> you can just decide to wake up the next write-waiter if you cannot get
> the lock immediately and have to exit due to a timeout. ]
>
> But I really haven't tested my rwlocks very exhaustively, and I did not
> verify that they actualyl scale with lots of CPU's, for example. I
> literally only have dual-core CPU's in use at home, right now, nothing
> fancier. Somebody with dual-socket quads would be a lot better off, and
> the more the merrier, of course.
Well... a single lock is only going to be so scalable. I don't see how
it could be done really significantly better? Maybe a small factor of
improvement if you were to concentrate on the contended case (but you
wouldn't want to do that anyway)
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] 64-bit futexes: Intro
2008-06-06 11:53 ` Nick Piggin
@ 2008-06-06 15:01 ` Linus Torvalds
0 siblings, 0 replies; 26+ messages in thread
From: Linus Torvalds @ 2008-06-06 15:01 UTC (permalink / raw)
To: Nick Piggin
Cc: Ingo Molnar, David Howells, Ulrich Drepper,
Linux Kernel Mailing List, Andrew Morton
On Fri, 6 Jun 2008, Nick Piggin wrote:
>
> Well... a single lock is only going to be so scalable. I don't see how
> it could be done really significantly better?
My worry is that I did something wrong in the slowpath, and that there is
some thundering-herd-wakeup kind of thing that makes that much slower than
it should be.
The slow path doesn't much matter from the angle of counting individual
cycles, but it still matters very much from a bigger picture. Does it have
bad behavior where we wake up a thousand readers, but then a writer gets
to come in first and all the readers have to go to sleep again?
That's one thing I tried to avoid in the second version (the "Another
approch" commit) where a read-wakeup actually moves the readers from the
pending count to the final count - both to get more fairness (which can be
_bad_ for performance), but also because I think it can avoid pathological
cases (reader starvation and unnecessary futex wakeup).
Linus
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2008-06-06 15:02 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-31 1:27 [PATCH 0/3] 64-bit futexes: Intro Ulrich Drepper
2008-05-31 2:13 ` Linus Torvalds
2008-05-31 3:14 ` Ulrich Drepper
2008-05-31 3:44 ` Linus Torvalds
2008-05-31 4:04 ` Ulrich Drepper
2008-05-31 4:16 ` Linus Torvalds
2008-05-31 4:23 ` Linus Torvalds
2008-05-31 4:38 ` Ulrich Drepper
2008-05-31 4:58 ` Linus Torvalds
2008-05-31 22:25 ` Linus Torvalds
2008-05-31 22:32 ` Linus Torvalds
2008-06-02 18:54 ` Ingo Molnar
2008-06-02 20:22 ` Linus Torvalds
2008-06-02 23:03 ` Ingo Molnar
2008-06-03 3:24 ` Nick Piggin
2008-06-04 19:57 ` Linus Torvalds
2008-06-04 20:38 ` Linus Torvalds
2008-06-05 1:56 ` Nick Piggin
2008-06-05 1:58 ` Nick Piggin
2008-06-05 3:08 ` Linus Torvalds
2008-06-05 4:29 ` Nick Piggin
2008-06-05 1:45 ` Nick Piggin
2008-06-06 1:27 ` Nick Piggin
2008-06-06 3:37 ` Linus Torvalds
2008-06-06 11:53 ` Nick Piggin
2008-06-06 15:01 ` Linus Torvalds
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.