[PATCH 0/3] 64-bit futexes: Intro

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] 64-bit futexes: Intro
@ 2008-05-31  1:27 Ulrich Drepper
  2008-05-31  2:13 ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Ulrich Drepper @ 2008-05-31  1:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, mtk.manpages, torvalds

This patch series adds support for 64-bit futexes.  The current futexes
only protect 32-bits.  I don't know why, ask the original authors.  It is
unnecessarily limiting, though, especially for 64-bit machines.

To understand the problem let me just say that futexes work by storing
a program-defined state in a variable.  Threads then can wait for the
state of the variable to change.  It all works dandy if the protocol for
using futexes is followed correctly; see

	http://people.redhat.com/drepper/futexes.pdf

For mutexes, semaphores etc this is quite easy.  For mutexes, for instance,
we have as state to protect a flag whether the mutex is locked and whether
there is any thread waiting for the release.  This can be done easily with
the available 32 bits in a futex.

The situation is different for more complex synchronization objects.  The
best example are reader/writer locks.  Here the state consists at least
of

- number of readers which locked the rwlock
- flag whether rwlock is locked for reader, writing, or not at all
- number of readers waiting
- number of writers waiting

I.e., rwlocks are significantly more complicated.  Much of the complication
stems from the requirement to have to types: those which prefer readers and
those which prefer writers.

Without imposing unreasonable limitations on the number of concurrent users
a rwlock can have the state cannot be represented in a single 32-bit variable.
This is why the current rwlock implementation uses an internal lock and
spreads the state out in several variables.

This works, but there is a significant performance penalty to be paid:

- at least two atomic operations to lock and unlock the internal lock
- multiple readers are not really able to run concurrently since they
  might block on the internal lock
- following from the last point, the behaviour of the code is less
  predictable due to scheduling artifacts

The logical solution is to extend the size of the state which can be
protected by allowing 64-bit futexes on architectures which can support
them.  This is not new.  The last time it was brought in

  http://lkml.org/lkml/2007/3/21/65

but nothing happened.  I must admit that I didn't follow thorugh myself.

Anyway, I finally bit the bullet and wrote a patch myself.  It differs
from the old patch:

- FUTEX_64_FLAG is a flag to the existing syscall, not a new syscall.
  Given the changes which are needed the latter would have been overkill
  IMO

- no 64-bit support for the PI variants.  It simply makes no sense.  The
  value of the variables protected by the PI variants is under control of
  the kernel and it's always a TID.  Those are 32-bit values.

As you can see and the first patch the required changes really aren't that
many.  A large part of the patch deals with changing the interface of the
futex_atomic_op_inuser function.  The additional parameter is simply
ignored if 64-bit futexes aren't supported.

The patch also goes to great length to avoid negative impact for
archictectures which do not have 64-bit futexes.  You can see several
if-expressions which are decided at compile-time, optimizing out all the
additional code.

Finally, enabling 64-bit futexes for 64-bit architectures is quite simple.
The second patch enables them for x86-64.  Most of the new lines are simply
copies of the inline asms used for the 32-bit futexes, adapted for 64-bit
operations.

As for 32-bit architectures, they are mostly out of luck.  That's OK, all
the code still works as before, it's just slower.  The exception is perhaps
x86.  x86 has the cmpxchg8b instruction for many years now.  This is why
I added enablement for 64-bit futexes for x86 as well.  It's a bit more
complicated and it requires a new system call.  The reason is obvious: the
existing system call takes 32-bit values for the value parameters.  On
64-bit machines these are in any case 64-bit values, so passing 64-bit
values only requires a change to the function parameter type.  For 32-bit
machines it is not that easy, hence the new system call.  To make things
even more interesting, the required additional parameter pushes us beyond
the 6 parameter limit and the parameters have to be passed down in memory.
Nothing new.

Anyway, is it all worth it?  Well, here's a measurement of a workload
similar for what customers of our are using.  These are raw numbers,
for a reason:

#            Old              New       Gain
threads      Rwlocks          Rwlocks

1      4,991,633,560    3,986,898,840   20.13%
2     14,187,564,600    5,080,538,220   64.19%
3     20,727,416,260    5,291,971,820   74.47%
4     23,079,608,650    6,652,036,830   71.18%
5     23,1766,32,860    6,570,373,500   71.65%
6     21,913,357,010    6,591,716,100   69.92%
7     22,975,750,700    6,597,761,790   71.28%
8     22,349,919,860    6,632,005,730   70.33%
9     24,784,438,890    6,599,062,590   73.37%
10    24,899,243,380    6,493,066,340   73.92%
11    25,358,788,130    6,735,662,240   73.44%
12    19,955,548,890    6,591,059,500   66.97%
13    24,058,349,440    6,709,288,100   72.11%
14    26,002,901,340    6,741,505,130   74.07%
15    19,790,724,570    6,720,973,690   66.04%
16    26,639,730,750    6,558,662,430   75.38%

The test is run on a uni-processor quad core machine.  The task is run with
varying number of threads.  In this example the actual work performed is
trivial.  In other words, the overhead for the locking is quite large.  I
have put the data in a graph:

  http://people.redhat.com/drepper/newrwlock.png

What is obvious is that with the new implementation to uncontended case
is 20% faster.  That alone should be a good reason.  But as can be seen
it get even better with a large number of threads.  It peaks when four
threads are used (== number of cores) but I still show the results for
more threads to point out the better predictability.  These are the
minimum values from ten runs each, no averaging.  The old code, due to
the use of the internal lock, causes a lotter of jitters in the times,
reducing predictability.  In term of performance, the new code is about
four times faster on this workload.  It's certainly not characteristic
but I haven't seen a workload where the new code is slower.

I also haven't seen any case where the additional overhead in the futex
system call implementation is noticeable.

I tested the code extensively on a number of x86-64 machines, from small
dual cores machines to quad socket/quad cores machines.  I'm even using
it on my main workstation right now.

The three patches can be applied individually and everything should
continue to work fine.  The patches apply to Linus' current tree.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-05-31  1:27 [PATCH 0/3] 64-bit futexes: Intro Ulrich Drepper
@ 2008-05-31  2:13 ` Linus Torvalds
  2008-05-31  3:14   ` Ulrich Drepper
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2008-05-31  2:13 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: linux-kernel, akpm, mtk.manpages

On Fri, 30 May 2008, Ulrich Drepper wrote:
> 
> Without imposing unreasonable limitations on the number of concurrent users
> a rwlock can have the state cannot be represented in a single 32-bit variable.

I call bull on that.

We've had a 32-bit rwsemaphore in the kernel for a *long* time.

I think it's totally stupid for glibc to even *try* to do something like 
this, since almost nobody will have a kernel with 64-bit futex support for 
a long time in the wild anyway. So you need to support a 32-bit semaphore 
in practice, and it's been done before.

The x86 kernel rwsem implementation may limit things to 64k readers (I'm 
not even sure that's true, I'm not going to look at the code again), but 
if I recall correctly that's just because we wanted to use a single "xadd" 
in the hotpath, instead of doign a load a cmpxchg.

I really object to adding another 32/64-bit difference just for something 
like an rwsemaphore. It needs a whole lot of stronger reasons than that.

I suspect the old rwlocks are simply just stupid. You can do those things 
with a single 32-bit locked op. Have you tried
 - 29 bits of reader counts
 - 1 bit of uncontended writer
 - 1 bit of "contention" (ie a mark for requiring fairness)
 - 1 bit for a spinlock so that you can do all the fairness without doing 
   any extra locked ops
or similar?

			Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-05-31  2:13 ` Linus Torvalds
@ 2008-05-31  3:14   ` Ulrich Drepper
  2008-05-31  3:44     ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Ulrich Drepper @ 2008-05-31  3:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, akpm, mtk.manpages

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:
> We've had a 32-bit rwsemaphore in the kernel for a *long* time.

You misread what I wrote.

Semaphores are fine.  The problem are reader/writer locks where we need
three counters and some flags.  It is absolutely not acceptable to limit
the number of threads to 512 to make this fit.

> I think it's totally stupid for glibc to even *try* to do something like 
> this, since almost nobody will have a kernel with 64-bit futex support for 
> a long time in the wild anyway. So you need to support a 32-bit semaphore 
> in practice, and it's been done before.

Again, misread.  This is not functionality which is not available.
Semaphores are *of course* fine with 32 bits.  And there is a
reader/writer lock implementation.  It is just very slow compared to
what is possible.  The transition between systems miss the 64-bit
support and those which have it is completely transparent.  The same
glibc binary will work for both.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkhAwpsACgkQ2ijCOnn/RHS7PACdGvdgI9pv+IcsSPkXPVwFJ5ZF
YNQAn01fm1wEc4EI/fwOVuM57XtCy9Dq
=LcpJ
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-05-31  3:14   ` Ulrich Drepper
@ 2008-05-31  3:44     ` Linus Torvalds
  2008-05-31  4:04       ` Ulrich Drepper
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2008-05-31  3:44 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: linux-kernel, akpm, mtk.manpages



On Fri, 30 May 2008, Ulrich Drepper wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Linus Torvalds wrote:
> > We've had a 32-bit rwsemaphore in the kernel for a *long* time.
> 
> You misread what I wrote.

No, you do.

The kernel calls reader-writer locks rwsemaphores.

		Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-05-31  3:44     ` Linus Torvalds
@ 2008-05-31  4:04       ` Ulrich Drepper
  2008-05-31  4:16         ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Ulrich Drepper @ 2008-05-31  4:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, akpm, mtk.manpages

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:
> The kernel calls reader-writer locks rwsemaphores.

Well, then you don't understand the complexity of the required interface
together with the performance implications.  Take your proposed allocation:

 - 29 bits of reader counts
 - 1 bit of uncontended writer
 - 1 bit of "contention" (ie a mark for requiring fairness)
 - 1 bit for a spinlock so that you can do all the fairness without
   doing any extra locked ops

Ask yourself this:

- - How do you know when there is no more writer waiting?  You cannot
  reset a "writer waiting" bit after you wake up one writer without
  waking every single thread waiting for the rwlock and letting them
  fight for it

- - How do you handle the difference between reader-preferred rwlocks
  and writer-preferred rwlocks?  In the latter, if a rwlock is locked
  for reading, future readers must be delayed until all writers are
  gone

- - How do you do the accounting for the *timedlock variants?  In the
  case of those a functions, if the threads wake due to a timeout,
  you have the decrement the waiter count.  But you have only one bit...

I don't say you cannot implement rwlocks this way.  Sure, it definitely
is possible.  But this implementation would in the contended case like
10x as slow as the current code because you constantly have to wake up
every single thread.

If you want, I'll give you a libpthread.so with the new code.  Then you
can test your code.  I bet you whatever you want that you cannot achieve
the performance with your puny 32-bit futex without limiting the number
of threads to a ridiculously low number.  The implementation I have
allows for 2^23 (possibly recursively locked) readers, 2^18 readers or
writers waiting.

When I was writing "you need more than 32 bits" I didn't even imagine
that somebody would suggest using such a primitive scheme which cannot
possibly work efficiently in a userlevel implementation.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkhAzlEACgkQ2ijCOnn/RHQ+bgCcDDmhSvbKdboyqa21smzlSt75
zHUAn2rr3NKns4Kb78Woas3NP5hbwkOU
=p6z6
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-05-31  4:04       ` Ulrich Drepper
@ 2008-05-31  4:16         ` Linus Torvalds
  2008-05-31  4:23           ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2008-05-31  4:16 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: linux-kernel, akpm, mtk.manpages

On Fri, 30 May 2008, Ulrich Drepper wrote:
>
> Ask yourself this:

How about you post your code. 

Because your questions seem to be total and utter crap.

> - How do you know when there is no more writer waiting?  You cannot
>   reset a "writer waiting" bit after you wake up one writer without
>   waking every single thread waiting for the rwlock and letting them
>   fight for it

Sure you can. By looking at the *other*data* that isn't atomic.

So when doing a "write_unlock()" (or whatever you call it - for the kernel 
calls it "up_write()") you can look at the non-atomic write counts to 
decide whether there are others.

Also note how you can use 64-bit atomic ops to do that all in user space, 
without actually requiring 64-bit futex support - as long as the bits that 
matter for waking (like "was there more than one pending writer") fit in 
the one 32-bit word.

> - - How do you handle the difference between reader-preferred rwlocks
>   and writer-preferred rwlocks?  In the latter, if a rwlock is locked
>   for reading, future readers must be delayed until all writers are
>   gone

Again, that's not something that needs to be in the *atomic* part.

It's unquestionable that a rwlock is more than 32 bits, but what I 
question is whether you need all those extra bits to be under futex

> - How do you do the accounting for the *timedlock variants?  In the
>   case of those a functions, if the threads wake due to a timeout,
>   you have the decrement the waiter count.  But you have only one bit...

Uli, you're not even trying any more.

NO, you don't have just one bit. You have as many bits as you want. It's 
just that only 32 of the bits will be relevant for FUTEX_WAIT_OP etc.

			Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-05-31  4:16         ` Linus Torvalds
@ 2008-05-31  4:23           ` Linus Torvalds
  2008-05-31  4:38             ` Ulrich Drepper
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2008-05-31  4:23 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: linux-kernel, akpm, mtk.manpages



On Fri, 30 May 2008, Linus Torvalds wrote:
> 
> Also note how you can use 64-bit atomic ops to do that all in user space, 
> without actually requiring 64-bit futex support - as long as the bits that 
> matter for waking (like "was there more than one pending writer") fit in 
> the one 32-bit word.

So here's an example of this:

 - make all the readers/writers actually update a 64-bit word (by using 
   cmpxchg8b in user space to actually get the locks)

 - but organize things so that a reader only needs to look at the high 32 
   bits to actually make it's wakeup-decision, and a sleeping writer only 
   needs to look at the low 32 bits. How? Make the low bits of the words 
   contain the "contention status".

Is it possible? I dunno. I personally suspect it is. I also suspect you 
didn't even try.

			Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-05-31  4:23           ` Linus Torvalds
@ 2008-05-31  4:38             ` Ulrich Drepper
  2008-05-31  4:58               ` Linus Torvalds
  2008-05-31 22:25               ` Linus Torvalds
  0 siblings, 2 replies; 26+ messages in thread
From: Ulrich Drepper @ 2008-05-31  4:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, akpm, mtk.manpages

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:
> Is it possible? I dunno.

Once again not without limitations which are too low.  The main problem
is that there is far more information needed on the reader side.  There
are two counters and a number of bits for state information.  Which
means the counters can be at most 14 bits in size.

A counter this low also means I cannot unconditionally increment them
(since there can be more than 2^14 threads sharing one page).  This
means you get in the cache line race I described in the last mail.  You
first have to load a word and then perform a the write operation while
at the same time other threads competing with you for the same cache line.

> I personally suspect it is. I also suspect you 
> didn't even try.

I have to disappoint you.  I thought of that and had to reject it.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkhA1j8ACgkQ2ijCOnn/RHTUXQCgnkUy6PNSuUI4O6BD4abxqqKr
PzMAn2qcvrXGM4qrhOytk1Yie0BsMeVa
=zvxZ
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-05-31  4:38             ` Ulrich Drepper
@ 2008-05-31  4:58               ` Linus Torvalds
  2008-05-31 22:25               ` Linus Torvalds
  1 sibling, 0 replies; 26+ messages in thread
From: Linus Torvalds @ 2008-05-31  4:58 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: linux-kernel, akpm, mtk.manpages

On Fri, 30 May 2008, Ulrich Drepper wrote:
> 
> Linus Torvalds wrote:
> > Is it possible? I dunno.
> 
> Once again not without limitations which are too low.  The main problem
> is that there is far more information needed on the reader side.  There
> are two counters and a number of bits for state information.  Which
> means the counters can be at most 14 bits in size.

Uli, you're NOT EVEN READING!

Go back and read it, dammit! How many times have I said that you can even 
have more than 32 bits of data? 

There's nothing wrong with using 64-bit atomics to update the words. All 
I've *ever* questioned is whether you need more than 32 bits for the 
FUTEX part.

For example, let's say that you need 4 bits of "state" information (ie 
"next woken-up one is a reader"). And yes, you need counts for readers and 
writers. Maybe we can make those counts be 30 bits each. 64 bits total 
space. But does the FUTEX_OP things actually need to access all 64 bits?

So imagine that you bave a 64-bit atomic op, and bits 0-1 are the reader 
state bits, with bits 2-31 being the reader count. Bits 32-33 are the 
writer state bits, and bits 34-63 are the writer count.

Now you can use 64-bit atomic ops to update them, but perhaps just use a 
32-bit FUTEX op to actually wait as a reader on the low 32 bits that are 
the reader state bits. See?

THIS is what I've been trying to tell you. It's quite possible that the 
algorithm you already have could work with just moving the bits around so 
that each FUTEX op can be done on just one of the words.

And notice my use of the word "possible". 

The kernel uses a 32-bit count word, but has various other state too that 
it maintains under a separate locking setup. Same basic idea: the 32-bit 
word is "special" (because it's the one that has to have the right format 
for the 'xadd' op that we use for the fast path), but it's not the _only_ 
data in the thing. The other data is also accessed, it's just accessed 
according to different principles.

			Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-05-31  4:38             ` Ulrich Drepper
  2008-05-31  4:58               ` Linus Torvalds
@ 2008-05-31 22:25               ` Linus Torvalds
  2008-05-31 22:32                 ` Linus Torvalds
  2008-06-02 18:54                 ` Ingo Molnar
  1 sibling, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2008-05-31 22:25 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: linux-kernel, akpm, mtk.manpages

On Fri, 30 May 2008, Ulrich Drepper wrote:
> 
> Once again not without limitations which are too low.  The main problem
> is that there is far more information needed on the reader side.  There
> are two counters and a number of bits for state information.  Which
> means the counters can be at most 14 bits in size.

Ok, so I obviously think you're simply wrong, and I've told you so both in 
private after seeing the 64-bit code you propose (and which you want to 
patent), and publicly.

So let's have code talk. I'm going to give real code for the only case 
that really matters, namely the uncontended case.

Do you agree that once you get contention, and you actually have to do a 
FUTEX system call to sort it out, the number of atomic instructions do not 
really matter? Because the system call is going to dominate. Agreed?

So I'm not going to write out the slow contention case, because you have 
already agreed that you can do a 32-bit rwlock _slowly_ with the existing 
32-bit FUTEX support.

IOW, I'm faking it, but I'm making a point. Namely that you can 
efficiently do read-write lock using *only* 32-bit ops, and without any 
real kind of limitations on the number of readers and writers.

So here goes the explanation and the pseudo-code.

 - have two levels of locking: the contended case, and the uncontended case

 - these are two *separate* locks.  I'm going to describe the only case
   that mattes for performance, namely the non-contention one.  We just
   make sure that the non-contention lock "gates" things to the slow case.

 - in other words, here's a pretty optimal fast-case that we write in
   assembly language, and all it does is to handle the non-contention 
   case, with any waiting happening on *another* lock entirely.

 - on that first-level lock, the setup is as follows:
	- the low bit (bit#0) is "writer holds"
	- the next bit (bit#1) is "contended", and all it does is that it 
	  guarantees that when set, everybody will go to the slow-path.
	- bits 2-31 are "reader count"

 - the second-level lock is another 32-bit lock that is *adjacent* in
   memory, and they are 64-bit aligned, so that you can do an atomic 
   initialization and more importantly a 64-bit "it is now no longer
   contended" assignment in one atomic op.  That's going to be relevant
   only for the slow-path, and all it means is that we can basically do
   an atomic "sync" of the two locks even on 32-bit x86 with a single
   "cmpxchg8b" (no other "full" 64-bit operations necessary).

And before I actually give the fast-path code, let me describe the 
requirements for the slow-path code:

 - it needs to work with 32-bit futexes. This code obviously does exist, 
   because you already have something like that in glibc. Performance is 
   not important, the only thing that matters is really "queueing 
   behaviour", ie this code needs to wake things up in the right order.

 - it needs to be able to *notice* when things are no longer contended. 
   IOW, it doesn't need to have a fast-path case per se, but the unlock 
   sequences do need to notice when they are the last unlocker, and when 
   that happens, they need to be able to do that magic "lock cmpxchg8b" 
   that tests-and-clears the contention bit in the primary lock at the 
   same time as it does the secondary lock.

That second constraint is actually fairly obvious when you think about it: 
it's effectively the thing that makes us re-enter the non-contention case 
after the contention has been handled.

Ok, that's the setup. Now, let's look at the fast-path cases.

First, write_lock():

	/*
	 * write_lock() can only succeed as a fast-path if the reader
	 * count is zero and there are no pending or holding writers.
	 * Ergo, the low 32 bits have to be 0, and should be 1 (for
	 * "writer exists, no other pending writer" if it succeeds.
	 *
	 * Note that we don't need to update the high bits if we
	 * succeed (because the writer count is the _pending_ writer
	 * count, not the holding one)
	 */
	eax = 0
	edx = 1
	lock cmpxchgl %edx,(rwlock)
	testl %eax,%eax
	jne slowpath
	ret
  slowpath:
	lock orl $2,(rwlock)
	call slow_write_lock
	ret

And here's read_lock:

	/*
	 * read_lock() can only succeed if we increment the reader
	 * count _and_ there were no pending or holding writers.
	 * So we do a 32-bit xaddl with 4 (incrementing the readers  
	 * by one), and were successful if the old value didn't
	 * have any writer bits set
	 */
	eax = 4         /* Low two bits are writers - don't touch */
	xaddl %eax,(rwlock)
	testl $3,%eax   /* Did we have any holding or pending writers? */
	jne slowpath
	ret
  slowpath:
	lock orl $2,(rwlock)
	lock subl $4,(rwlock)
	call slow_read_lock
	ret

See? Fastpaths are fast - a single atomic op and a conditional. 

Now, the unlock paths - they also need to be fast for the non-contention case:

The write_unlock is just

	/*
	 * If there was no contention, then the low word must
	 * still have the value "1" - this one writer, and
	 * no other pending writers or any readers that tried   
	 * to get the lock.
	 */
	eax = 1
	edx = 0
	cmpxchgl %edx,(rwlock)
	cmpl $1,%eax
	jne slowpath    /* contention - need to wake up a reader or a writer */
	ret
  slowpath:
	lock orl $2,(rwlock)
	lock andl $~1,(rwlock)
	call slow_write_unlock
	ret

and the read_unlock is

	eax = -4
	xaddl %eax,(rwlock)
	testl $3,%eax   	/* Do we have contention? */
	jne slowpath
	ret
  slowpath:
	lock orl $2,(rwlock)
	call slow_read_unlock
	ret

Ok, so notice a couple of things here. The above is not only designed to 
handle the uncontended case with just a single locked op on the lock, but 
it also makes sure that each path basically enters the slow-path with all 
of its fast-path actions *done*, and with any unlocks done. So by the time 
the slow-path is entered, we know that

 (a) bit#1 ("contention") is guaranteed to be set, and it will be _our_ 
     job to clear it once everybody has unlocked
 (b) nobody else will ever succeed on the fast-path and they'll all come 
     to us (including any future unlockers that may need to wake things 
     up)
 (c) anybody who got a fastpath lock will do their unlock action before 
     they then get trapped into the slow path.

Now, (c) above, in particular, means that the slow path should start out 
making sure that all fast-path locks have been released. That's easy 
enough to do: the first thing each slow-path member needs to do is to wait 
until the 32-bit fastpath variable gets the value "2". It *will* happen 
eventually, and once it does, we are guaranteed that no fast-path code 
will ever get the lock again (even if the fast-path read_lock() may 
increment the higher bits and then decrement them again when it notices 
the contention - so it can still change from "2", but nobody will actually 
*acquire* the lock (and obviously, if it has hit "2", that also means 
that all _previous_ lockers will have released their locks, whether they 
were read- or write-lockers).

So each slow-path thing needs to start out that way, but once it has done 
that, it basically knows that the fast-path code is immaterial from a lock 
acquire standpoint.

So now you do your existing slow-path code on the second lock. And at each 
unlock (or failed trylock, for that matter), remember to see if you can do 
an atomic "release both fast-path _and_ slow-path locks" with cmpxchg8b 
(with the fastpath lock word being a 2->0 transition - what the slowpath 
rule for "all done" is will obviously depend on the algorithm in 
question).

See? I'm not guaranteeing that the above is "optimal" in the sense that I 
suspect that there are probably clever ways to make the slow path faster 
(including, I suspect, even for all 32-bit code), but I *am* claiming that 
this is _obviously_ optimal for the non-contention case, and that the 
actual contention case is almost certainly not going to care about a 
couple of extra atomic ops.

Agreed? The above isn't actually all that complicated. And it would mean 
that the exact same algorithm works on both 32-bit an 64-bit CPU's, and on 
both old and new kernels (because you only really need 32-bit futexes).

			Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-05-31 22:25               ` Linus Torvalds
@ 2008-05-31 22:32                 ` Linus Torvalds
  2008-06-02 18:54                 ` Ingo Molnar
  1 sibling, 0 replies; 26+ messages in thread
From: Linus Torvalds @ 2008-05-31 22:32 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: linux-kernel, akpm, mtk.manpages

On Sat, 31 May 2008, Linus Torvalds wrote:
>
> 	 * Note that we don't need to update the high bits if we
> 	 * succeed (because the writer count is the _pending_ writer
> 	 * count, not the holding one)

Btw, that comment is bogus, ignore it. It comes from an earlier algorithm 
I tried that also explained the contention case, but got unnecessarily 
complex. Separating out the queuing behaviour from the actual 
non-contended behaviour made things much simpler to explain, so I stopped 
even bothering with a "both" case.

Too much cut-and-paste, IOW.

		Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-05-31 22:25               ` Linus Torvalds
  2008-05-31 22:32                 ` Linus Torvalds
@ 2008-06-02 18:54                 ` Ingo Molnar
  2008-06-02 20:22                   ` Linus Torvalds
  1 sibling, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2008-06-02 18:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ulrich Drepper, linux-kernel, akpm, mtk.manpages

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> IOW, I'm faking it, but I'm making a point. Namely that you can 
> efficiently do read-write lock using *only* 32-bit ops, and without 
> any real kind of limitations on the number of readers and writers.
> 
> So here goes the explanation and the pseudo-code.
> 
>  - have two levels of locking: the contended case, and the uncontended
>    case

i suspect _any_ abstract locking functionality around a data structure 
can be implemented via atomic control over just a single user-space bit.

That bit can be used as a lock and if all access to the state of that 
atomic variable uses it, arbitrary higher-order atomic state transitions 
can be derived from it. The cost would be a bit more instructions in the 
fastpath, but there would still only be a single atomic op (the acquire 
op), as the unlock would be a natural barrier (on x86 at least).

Concurrency (and scheduling) of that lock would still be exactly the 
same as with genuine 64-bit (or even larger) atomic ops, and the 
fastpath would be very close as well.

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-06-02 18:54                 ` Ingo Molnar
@ 2008-06-02 20:22                   ` Linus Torvalds
  2008-06-02 23:03                     ` Ingo Molnar
  2008-06-06  1:27                     ` Nick Piggin
  0 siblings, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2008-06-02 20:22 UTC (permalink / raw)
  To: Ingo Molnar, Nick Piggin, David Howells
  Cc: Ulrich Drepper, Linux Kernel Mailing List, Andrew Morton

[ I added Nick and DavidH to the Cc, since they have at least historically 
  shown interest in locking algorithms ]

On Mon, 2 Jun 2008, Ingo Molnar wrote:
> 
> i suspect _any_ abstract locking functionality around a data structure 
> can be implemented via atomic control over just a single user-space bit.

Well, theory is theory, and practice is different.

That's *especially* true when it comes to locking, which is just so 
tightly coupled to the exact details of which atomics are fast on a 
particular architecture.

Also, different people will want to see different performance profiles. 

For example, it makes a *huge* difference whether you are strictly fair or 
not. I can almost guarantee that a 100% fair implementation may be really 
"nice" from a theoretical standpoint, but suck really badly in practice, 
because if you want best performance, then you want the lock to have a 
very strong CPU affinity - and the faster you can do your lock ops, the 
more of a unfairness and CPU affinity they get!

And rwlocks in particular are actually much more interesting in this 
respect, because they not only have that CPU affinity fairness, but also 
the reader-vs-writer fairness. You optimize for one load, and it may give 
you almost perfect performance, but at the cost of sucking at another 
load.

For example, some loads are almost entirely read-read locks, with only 
very occasional write locks for some uncommon config change thing. Do you 
want to optimize for that? Maybe. And yes, you can make those uncontended 
read-read locks go really quickly, but then (especially if you continue to 
let reads through even when writers want to contend), that can slow down 
writers a *lot*, to the point of starvation.

Different CPU's will also show different patterns. 

Anyway, I was busy most of the weekend, but I've now coded up a partial 
actual example implementation. Its' probably buggy. Uli is very correct in 
saying that it's easy to screw up futex'es, but even in the absense of 
futexes, it's just easy to screw up any threaded logic.

But if anybody wants to look at a rough draft that at least limps along 
_partially_, there's

	http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git;a=summary

for you.

And I'll freely admit that 
 (a) the above thing is pretty hacked up. No guarantees as to correctness 
     or anything else.
 (b) I looked _mainly_ at the "all readers" case, and considered that the 
     primary goal. Which explains why it does really well on that 
     particular load.
 (c) However, I refuse to starve writers. In fact, my thing will always 
     prefer a writer to any new readers, on the assumption that the sanest 
     rwlock situation is the "tons of readers, occasional writer".
 (d) it does ok for me on the write-write contention case too, but the 
     random mixed "all threads do 5% writelocks" test-case seems to suck.

As an exmaple: the current glibc code I have seems to be uniformly and 
boringly middle-of-the-road. It really doesn't seem to be horrible at all. 
That said, the above thing _is_ 2-3x faster for me on that read-read case 
(from single thread up to 20 threads - but just tested on one particular 
machine), so if that is what you want to aim for, it's certainly easy to 
beat.

But for the write case, while I can easily beat it for a low-thread count 
with little contention, my example thing above benchmarks about equal for 
bigger thread counts (caveat: again - I've only tested on one machine, 
which is a single-socket dual-core thing. Caveat emptor).

And the pthreads implementation actually beats my hacked-up one at least 
for the 5% random writelocks case. It's not beating it by a huge amount, 
but it's not in the noise level either (maybe 15%). And I bet the pthreads 
implementation is a hell of a lot better tested ;)

> That bit can be used as a lock and if all access to the state of that 
> atomic variable uses it, arbitrary higher-order atomic state transitions 
> can be derived from it. The cost would be a bit more instructions in the 
> fastpath, but there would still only be a single atomic op (the acquire 
> op), as the unlock would be a natural barrier (on x86 at least).

No, "unlocks as a natural barrier" only works for exclusive kernel locks 
(spin_unlock and write_unlock). There we can just do a write to unlock. 
But for anything that wants to handle contention differently than just 
spinning, the unlock path needs to be able to do an atomic "unlock and 
test if I need to do something else", because it may need to wake things 
up.

Feel free to try out my code, and laugh at it. Getting locking right is 
definitely not simple, and I would be really surprised if I got everything 
right. It's meant as a "hey, here's real code people can at least play 
with" for anybody who is interested in this kind of thing. 

		Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-06-02 20:22                   ` Linus Torvalds
@ 2008-06-02 23:03                     ` Ingo Molnar
  2008-06-03  3:24                       ` Nick Piggin
  2008-06-06  1:27                     ` Nick Piggin
  1 sibling, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2008-06-02 23:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, David Howells, Ulrich Drepper,
	Linux Kernel Mailing List, Andrew Morton


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > That bit can be used as a lock and if all access to the state of 
> > that atomic variable uses it, arbitrary higher-order atomic state 
> > transitions can be derived from it. The cost would be a bit more 
> > instructions in the fastpath, but there would still only be a single 
> > atomic op (the acquire op), as the unlock would be a natural barrier 
> > (on x86 at least).
> 
> No, "unlocks as a natural barrier" only works for exclusive kernel 
> locks (spin_unlock and write_unlock). There we can just do a write to 
> unlock. But for anything that wants to handle contention differently 
> than just spinning, the unlock path needs to be able to do an atomic 
> "unlock and test if I need to do something else", because it may need 
> to wake things up.

yeah, indeed. Compared to all the other costs that have to be dealt with 
here, having a second atomic op isnt all that much of an issue either, 
especially on latest hw. An atomic op will probably never be as cheap as 
a non-atomic op, but ~20 cycles is still plenty fast for most practical 
purposes.

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-06-02 23:03                     ` Ingo Molnar
@ 2008-06-03  3:24                       ` Nick Piggin
  2008-06-04 19:57                         ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2008-06-03  3:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Howells, Ulrich Drepper,
	Linux Kernel Mailing List, Andrew Morton

On Tue, Jun 03, 2008 at 01:03:17AM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > > That bit can be used as a lock and if all access to the state of 
> > > that atomic variable uses it, arbitrary higher-order atomic state 
> > > transitions can be derived from it. The cost would be a bit more 
> > > instructions in the fastpath, but there would still only be a single 
> > > atomic op (the acquire op), as the unlock would be a natural barrier 
> > > (on x86 at least).
> > 
> > No, "unlocks as a natural barrier" only works for exclusive kernel 
> > locks (spin_unlock and write_unlock). There we can just do a write to 
> > unlock. But for anything that wants to handle contention differently 
> > than just spinning, the unlock path needs to be able to do an atomic 
> > "unlock and test if I need to do something else", because it may need 
> > to wake things up.
> 
> yeah, indeed. Compared to all the other costs that have to be dealt with 
> here, having a second atomic op isnt all that much of an issue either, 
> especially on latest hw. An atomic op will probably never be as cheap as 
> a non-atomic op, but ~20 cycles is still plenty fast for most practical 
> purposes.

I think optimised our unlock_page in a way that it can do a
non-atomic unlock in the fastpath (no waiters) using 2 bits. In
practice it was still atomic but only because other page flags
operations could operate on ->flags at the same time.

I still have to get around to submitting the damn thing upstream,
but maybe if I bring it up here, I get the idea more reviewed :)

Anyway, the algorithm goes like this:

lock_page()
{
  if (test_and_set_bit_lock(PG_locked))
    lock_page_slow();
}

lock_page_slow()
{
  /* slowpath */
again:
  prepare_to_wait();
  set_bit(PG_waiters);
  if (test_and_set_bit_lock(PG_locked)) {
    schedule();
    goto again;
  }
}

wake_page_waiters()
{
  /* slowpath */
  clear_bit(PG_waiters);
  smp_mb__after_clear_bit(); /* order clear_bit store with wake_up_page load */
  wake_up_page(PG_locked);
}

unlock_page()
{
  clear_bit_unlock(PG_locked);
  if (unlikely(test_bit(PG_waiters)))
    wake_page_waiters();
}

We don't require any load/store barrier in the unlock_page fastpath
because the bits are in the same word, so cache coherency gives us a
sequential ordering anyway.

Now you notice the lock page slowpath can set bits without holding the
lock, at first glance you'd think the clear_bit to unlock has to be
atomic too then. But actually if we're careful, we can put them in seperate
parts of the word and use the sub-word operations on x86 to avoid the atomic
requirement. I'm not aware of any architecture in which operations to the
same word could be out of order.

Not only does this avoid all barriers (except acquire/release) in the
fastpaths, but it also avoids the unconditional load of the random
hash page waitqueue cacheline on unlock.

Anything applicable to your problem at hand?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-06-03  3:24                       ` Nick Piggin
@ 2008-06-04 19:57                         ` Linus Torvalds
  2008-06-04 20:38                           ` Linus Torvalds
  2008-06-05  1:45                           ` Nick Piggin
  0 siblings, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2008-06-04 19:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, David Howells, Ulrich Drepper,
	Linux Kernel Mailing List, Andrew Morton

On Tue, 3 Jun 2008, Nick Piggin wrote:
> 
> I think optimised our unlock_page in a way that it can do a
> non-atomic unlock in the fastpath (no waiters) using 2 bits. In
> practice it was still atomic but only because other page flags
> operations could operate on ->flags at the same time.

I'd be *very* nervous about this.

> We don't require any load/store barrier in the unlock_page fastpath
> because the bits are in the same word, so cache coherency gives us a
> sequential ordering anyway.

Yes and no.

Yes, the bits are int he same word, so cache coherency guarantees a lot.

HOWEVER. If you do the sub-word write using a regular store, you are now 
invoking the _one_ non-coherent part of the x86 memory pipeline: the store 
buffer. Normal stores can (and will) be forwarded to subsequent loads from 
the store buffer, and they are not strongly ordered wrt cache coherency 
while they are buffered.

IOW, on x86, loads are ordered wrt loads, and stores are ordered wrt other 
stores, but loads are *not* ordered wrt other stores in the absense of a 
serializing instruction, and it's exactly because of the write buffer.

So:

> But actually if we're careful, we can put them in seperate parts of the 
> word and use the sub-word operations on x86 to avoid the atomic 
> requirement. I'm not aware of any architecture in which operations to 
> the same word could be out of order.

See above. The above is unsafe, because if you do a regular store to a 
partial word, with no serializing instructions between that and a 
subsequent load of the whole word, the value of the store can be bypassed 
from the store buffer, and the load from the other part of the word can be 
carried out _before_ the store has actually gotten that cacheline 
exclusively!

So when you do

	movb reg,(byteptr)
	movl (byteptr),reg

you may actually get old data in the upper 24 bits, along with new data in 
the lower 8.

I think.

Anyway, be careful. The cacheline itself will always be coherent, but the 
store buffer is not going to be part of the coherency rules, and without 
serialization (or locked ops), you _are_ going to invoke the store buffer!

		Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-06-04 19:57                         ` Linus Torvalds
@ 2008-06-04 20:38                           ` Linus Torvalds
  2008-06-05  1:56                             ` Nick Piggin
  2008-06-05  1:45                           ` Nick Piggin
  1 sibling, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2008-06-04 20:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, David Howells, Ulrich Drepper,
	Linux Kernel Mailing List, Andrew Morton

On Wed, 4 Jun 2008, Linus Torvalds wrote:
> 
> So when you do
> 
> 	movb reg,(byteptr)
> 	movl (byteptr),reg
> 
> you may actually get old data in the upper 24 bits, along with new data in 
> the lower 8.

Put another way: the CPU may internally effectively rewrite the two 
instructions as

	movb reg,tmpreg		(tmp = writebuffer)
	movl (byteptr),reg	(do the 32-bit read of the old cached contents)
	movb tmpreg,reg		(writebuffer snoop by reads)
	movb tmpreg,(byteptr)	(writebuffer actually drains into cacheline)

and *if* your algorithm is robust wrt these kinds of rewrites, you're ok. 
But notice how there are two accesses to the cacheline, and how they are 
actually in the "wrong" order, and how the cacheline could have been 
updated by another CPU in between.

Does this actually happen? Yeah, I do believe it does. Is it a deathknell 
for anybody trying to be clever with overlapping reads/writes? No, you can 
still have things like causality rules that guarantee that your algorithm 
is perfectly stable in the face of these kinds of reordering. But it *is* 
one of the few re-orderings that I think is theoretically archtiecturally 
visible.

For example, let's start out with a 32-bit word that contains zero, and 
three CPU's. One CPU does

	cmpxchgl 0->0x01010101,mem

another one does

	cmpxchlg 0x01010101->0x02020202,mem

and the third one does that

	movb $0x03,mem
	movl mem,reg

and after it all completed, you may have 0x02020203 in memory, but "reg" 
on the third CPU contains 0x01010103.

Note how NO OTHER CPU could _possibly_ have seen that value! That value 
never ever existed in any caches. If the final value was 0x02020203, then 
both the cmpxchgl's must have worked, so the cache coherency contents 
*must* have gone from 0 -> 0x01010101 -> 0x02020202 -> 0x02020203 (with 
the movb actually getting the exclusive cache access last).

So the third CPU saw a value for that load that actually *never* existed 
in any cache-line: 0x01010103.  Exactly because the x86 memory ordering 
allows the store buffer contents to be forwarded within a CPU core.

And this is why atomic locked instructions are special. They bypass the 
store buffer (or at least they _act_ as if they do - they likely actually 
use the store buffer, but they flush it and the instruction pipeline 
before the load and wait for it to drain after, and have a lock on the 
cacheline that they take as part o the load, and release as part of the 
store - all to make sure that the cacheline doesn't go away in between and 
that nobody else can see the store buffer contents while this is going 
on).

This is also why there is so much room for improvement in locked 
instruction performance - you don't _have_ to flush things if you just are 
very careful about tracking how and when you use which elements in the 
store buffer, and track the ordering of cache accesses by all of this.

			Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-06-04 20:38                           ` Linus Torvalds
@ 2008-06-05  1:56                             ` Nick Piggin
  2008-06-05  1:58                               ` Nick Piggin
  2008-06-05  3:08                               ` Linus Torvalds
  0 siblings, 2 replies; 26+ messages in thread
From: Nick Piggin @ 2008-06-05  1:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, David Howells, Ulrich Drepper,
	Linux Kernel Mailing List, Andrew Morton

On Wed, Jun 04, 2008 at 01:38:50PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 4 Jun 2008, Linus Torvalds wrote:
> > 
> > So when you do
> > 
> > 	movb reg,(byteptr)
> > 	movl (byteptr),reg
> > 
> > you may actually get old data in the upper 24 bits, along with new data in 
> > the lower 8.
> 
> Put another way: the CPU may internally effectively rewrite the two 
> instructions as
> 
> 	movb reg,tmpreg		(tmp = writebuffer)
> 	movl (byteptr),reg	(do the 32-bit read of the old cached contents)
> 	movb tmpreg,reg		(writebuffer snoop by reads)
> 	movb tmpreg,(byteptr)	(writebuffer actually drains into cacheline)
> 
> and *if* your algorithm is robust wrt these kinds of rewrites, you're ok. 
> But notice how there are two accesses to the cacheline, and how they are 
> actually in the "wrong" order, and how the cacheline could have been 
> updated by another CPU in between.
> 
> Does this actually happen? Yeah, I do believe it does. Is it a deathknell 
> for anybody trying to be clever with overlapping reads/writes? No, you can 
> still have things like causality rules that guarantee that your algorithm 
> is perfectly stable in the face of these kinds of reordering. But it *is* 
> one of the few re-orderings that I think is theoretically archtiecturally 
> visible.
> 
> For example, let's start out with a 32-bit word that contains zero, and 
> three CPU's. One CPU does
> 
> 	cmpxchgl 0->0x01010101,mem
> 
> another one does
> 
> 	cmpxchlg 0x01010101->0x02020202,mem
> 
> and the third one does that
> 
> 	movb $0x03,mem
> 	movl mem,reg
> 
> and after it all completed, you may have 0x02020203 in memory, but "reg" 
> on the third CPU contains 0x01010103.
> 
> Note how NO OTHER CPU could _possibly_ have seen that value! That value 
> never ever existed in any caches. If the final value was 0x02020203, then 
> both the cmpxchgl's must have worked, so the cache coherency contents 
> *must* have gone from 0 -> 0x01010101 -> 0x02020202 -> 0x02020203 (with 
> the movb actually getting the exclusive cache access last).
> 
> So the third CPU saw a value for that load that actually *never* existed 
> in any cache-line: 0x01010103.  Exactly because the x86 memory ordering 
> allows the store buffer contents to be forwarded within a CPU core.
> 
> And this is why atomic locked instructions are special. They bypass the 
> store buffer (or at least they _act_ as if they do - they likely actually 
> use the store buffer, but they flush it and the instruction pipeline 
> before the load and wait for it to drain after, and have a lock on the 
> cacheline that they take as part o the load, and release as part of the 
> store - all to make sure that the cacheline doesn't go away in between and 
> that nobody else can see the store buffer contents while this is going 
> on).
> 
> This is also why there is so much room for improvement in locked 
> instruction performance - you don't _have_ to flush things if you just are 
> very careful about tracking how and when you use which elements in the 
> store buffer, and track the ordering of cache accesses by all of this.

I see what you're saying, but I hadn't really considered it before.
Not that I've attempted to do much of these sub-word operations
without consulting you first (which brings us here) ;).

I'd have thought that for a case like this, you'd simply hit the store
alias logic and store forwarding would stall because it doesn't have
the full data.

I'd like to know for sure.

The other thing that could be possible, and I'd imagine maybe more likely
to be implemented in a real CPU because it should give more imrpovement
(and which does break my algorithm) is just that the load to the cacheline
may get to execute first, then if the cacheline gets evicted and
modified by another CPU before our store completes, we effectively see
store/load reordering again.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-06-05  1:56                             ` Nick Piggin
@ 2008-06-05  1:58                               ` Nick Piggin
  2008-06-05  3:08                               ` Linus Torvalds
  1 sibling, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2008-06-05  1:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, David Howells, Ulrich Drepper,
	Linux Kernel Mailing List, Andrew Morton

On Thu, Jun 05, 2008 at 03:56:34AM +0200, Nick Piggin wrote:
> 
> The other thing that could be possible, and I'd imagine maybe more likely
> to be implemented in a real CPU because it should give more imrpovement
> (and which does break my algorithm) is just that the load to the cacheline
> may get to execute first, then if the cacheline gets evicted and
> modified by another CPU before our store completes, we effectively see
> store/load reordering again.

Well, I should qualify that: it doesn't actually break my algorithm
because you can still implement the unlock without atomic RMWs.
This may require you to have an mfence there, but we can still get
away without atomics (whether that's much cheaper or not, I haven't
measured recently!)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-06-05  1:56                             ` Nick Piggin
  2008-06-05  1:58                               ` Nick Piggin
@ 2008-06-05  3:08                               ` Linus Torvalds
  2008-06-05  4:29                                 ` Nick Piggin
  1 sibling, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2008-06-05  3:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, David Howells, Ulrich Drepper,
	Linux Kernel Mailing List, Andrew Morton

On Thu, 5 Jun 2008, Nick Piggin wrote:
> 
> I'd have thought that for a case like this, you'd simply hit the store
> alias logic and store forwarding would stall because it doesn't have
> the full data.

That's _one_ possible implementation. 

Quite frankly, I think it's the less likely one. It's much more likely 
that the cache read access and the store buffer probe happen in parallel 
(this is a really important hotpath for any CPU, but even more so x86 
where there are more of loads and stores that are spills). And then the 
store buffer logic would return the data and a bytemask mask (where the 
mask would be all zeroes for a miss), and the returned value is just the 
appropriate mix of the two.

> I'd like to know for sure.

You'd have to ask somebody very knowledgeable inside Intel and AMD, and it 
is quite likely that different microarchitectures have different 
approaches...

> The other thing that could be possible, and I'd imagine maybe more likely
> to be implemented in a real CPU because it should give more imrpovement
> (and which does break my algorithm) is just that the load to the cacheline
> may get to execute first, then if the cacheline gets evicted and
> modified by another CPU before our store completes, we effectively see
> store/load reordering again.

Oh, absolutely, the perfect algorithm would actually get the right answer 
and notice that the cacheline got evicted, and retried the whole sequence 
such that it is coherent. 

But we do know that Intel expressly documents loads and stores to pass 
each other and documents the fact that the store buffer is there. So I bet 
that this is visible in some micro-architecture, even if it's not 
necessarily visible in _all_ of them.

The recent Intel memory ordering whitepaper makes it very clear that loads 
can pass earlier stores and in particular that the store buffer allows 
intra-processor forwarding to subsequent loads (2.4 in their whitepaper). 
It _could_ be just a "for future CPU's", but quite frankly, I'm 100% sure 
it isn't. The store->load forwarding is such a critical performance issue 
that I can pretty much guarantee that it doesn't always hit the cacheline.

Of course, the partial store forwarding case is not nearly as important, 
and stalling is quite a reasonable implementation approach. I just 
personally suspect that doing the unconditional byte-masking is actually 
_simpler_ to implement than the stall, so..

			Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-06-05  3:08                               ` Linus Torvalds
@ 2008-06-05  4:29                                 ` Nick Piggin
  0 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2008-06-05  4:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, David Howells, Ulrich Drepper,
	Linux Kernel Mailing List, Andrew Morton

On Wed, Jun 04, 2008 at 08:08:37PM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 5 Jun 2008, Nick Piggin wrote:
> > 
> > I'd have thought that for a case like this, you'd simply hit the store
> > alias logic and store forwarding would stall because it doesn't have
> > the full data.
> 
> That's _one_ possible implementation. 
> 
> Quite frankly, I think it's the less likely one. It's much more likely 
> that the cache read access and the store buffer probe happen in parallel 
> (this is a really important hotpath for any CPU, but even more so x86 
> where there are more of loads and stores that are spills). And then the 
> store buffer logic would return the data and a bytemask mask (where the 
> mask would be all zeroes for a miss), and the returned value is just the 
> appropriate mix of the two.
> 
> > I'd like to know for sure.
> 
> You'd have to ask somebody very knowledgeable inside Intel and AMD, and it 
> is quite likely that different microarchitectures have different 
> approaches...

Well, it would just be nice to hear a "no we'll never do that", "we
already do", or "you can't rely on it" ;)

 
> > The other thing that could be possible, and I'd imagine maybe more likely
> > to be implemented in a real CPU because it should give more imrpovement
> > (and which does break my algorithm) is just that the load to the cacheline
> > may get to execute first, then if the cacheline gets evicted and
> > modified by another CPU before our store completes, we effectively see
> > store/load reordering again.
> 
> Oh, absolutely, the perfect algorithm would actually get the right answer 
> and notice that the cacheline got evicted, and retried the whole sequence 
> such that it is coherent. 
> 
> But we do know that Intel expressly documents loads and stores to pass 
> each other and documents the fact that the store buffer is there. So I bet 
> that this is visible in some micro-architecture, even if it's not 
> necessarily visible in _all_ of them.
> 
> The recent Intel memory ordering whitepaper makes it very clear that loads 
> can pass earlier stores and in particular that the store buffer allows 
> intra-processor forwarding to subsequent loads (2.4 in their whitepaper). 
> It _could_ be just a "for future CPU's", but quite frankly, I'm 100% sure 
> it isn't. The store->load forwarding is such a critical performance issue 
> that I can pretty much guarantee that it doesn't always hit the cacheline.

Well I have a simple test case to show loads pass earlier non conflicting
stores in the case that loads do not come from the store buffer (ie.
*inter* processor forwarding).

And store forwarding, by definition means that the load can complete before
the store can possibly be visible to another CPU I'd say. So yes, I'm
sure this does happen too.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-06-04 19:57                         ` Linus Torvalds
  2008-06-04 20:38                           ` Linus Torvalds
@ 2008-06-05  1:45                           ` Nick Piggin
  1 sibling, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2008-06-05  1:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, David Howells, Ulrich Drepper,
	Linux Kernel Mailing List, Andrew Morton

On Wed, Jun 04, 2008 at 12:57:13PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 3 Jun 2008, Nick Piggin wrote:
> > 
> > I think optimised our unlock_page in a way that it can do a
> > non-atomic unlock in the fastpath (no waiters) using 2 bits. In
> > practice it was still atomic but only because other page flags
> > operations could operate on ->flags at the same time.
> 
> I'd be *very* nervous about this.

Heh ;) Well I'm not actually trying to do it in Linux (yet).

 
> > We don't require any load/store barrier in the unlock_page fastpath
> > because the bits are in the same word, so cache coherency gives us a
> > sequential ordering anyway.
> 
> Yes and no.
> 
> Yes, the bits are int he same word, so cache coherency guarantees a lot.
> 
> HOWEVER. If you do the sub-word write using a regular store, you are now 
> invoking the _one_ non-coherent part of the x86 memory pipeline: the store 
> buffer. Normal stores can (and will) be forwarded to subsequent loads from 
> the store buffer, and they are not strongly ordered wrt cache coherency 
> while they are buffered.
> 
> IOW, on x86, loads are ordered wrt loads, and stores are ordered wrt other 
> stores, but loads are *not* ordered wrt other stores in the absense of a 
> serializing instruction, and it's exactly because of the write buffer.
> 
> So:
> 
> > But actually if we're careful, we can put them in seperate parts of the 
> > word and use the sub-word operations on x86 to avoid the atomic 
> > requirement. I'm not aware of any architecture in which operations to 
> > the same word could be out of order.
> 
> See above. The above is unsafe, because if you do a regular store to a 
> partial word, with no serializing instructions between that and a 
> subsequent load of the whole word, the value of the store can be bypassed 
> from the store buffer, and the load from the other part of the word can be 
> carried out _before_ the store has actually gotten that cacheline 
> exclusively!
> 
> So when you do
> 
> 	movb reg,(byteptr)
> 	movl (byteptr),reg
> 
> you may actually get old data in the upper 24 bits, along with new data in 
> the lower 8.
> 
> I think.

I'd be very surprised if that was the case. But the unlock code needn't
do that anyway. It could do

movb reg,(byteptr)   # clear PG_locked
movb (byteptr+1),reg # load PG_waiters


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-06-02 20:22                   ` Linus Torvalds
  2008-06-02 23:03                     ` Ingo Molnar
@ 2008-06-06  1:27                     ` Nick Piggin
  2008-06-06  3:37                       ` Linus Torvalds
  1 sibling, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2008-06-06  1:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, David Howells, Ulrich Drepper,
	Linux Kernel Mailing List, Andrew Morton

On Mon, Jun 02, 2008 at 01:22:57PM -0700, Linus Torvalds wrote:
> 
> [ I added Nick and DavidH to the Cc, since they have at least historically 
>   shown interest in locking algorithms ]
> 
> On Mon, 2 Jun 2008, Ingo Molnar wrote:
> > 
> > i suspect _any_ abstract locking functionality around a data structure 
> > can be implemented via atomic control over just a single user-space bit.
> 
> Well, theory is theory, and practice is different.
> 
> That's *especially* true when it comes to locking, which is just so 
> tightly coupled to the exact details of which atomics are fast on a 
> particular architecture.
> 
> Also, different people will want to see different performance profiles. 
> 
> For example, it makes a *huge* difference whether you are strictly fair or 
> not. I can almost guarantee that a 100% fair implementation may be really 
> "nice" from a theoretical standpoint, but suck really badly in practice, 
> because if you want best performance, then you want the lock to have a 
> very strong CPU affinity - and the faster you can do your lock ops, the 
> more of a unfairness and CPU affinity they get!
> 
> And rwlocks in particular are actually much more interesting in this 
> respect, because they not only have that CPU affinity fairness, but also 
> the reader-vs-writer fairness. You optimize for one load, and it may give 
> you almost perfect performance, but at the cost of sucking at another 
> load.
> 
> For example, some loads are almost entirely read-read locks, with only 
> very occasional write locks for some uncommon config change thing. Do you 
> want to optimize for that? Maybe. And yes, you can make those uncontended 
> read-read locks go really quickly, but then (especially if you continue to 
> let reads through even when writers want to contend), that can slow down 
> writers a *lot*, to the point of starvation.
> 
> Different CPU's will also show different patterns. 
> 
> Anyway, I was busy most of the weekend, but I've now coded up a partial 
> actual example implementation. Its' probably buggy. Uli is very correct in 
> saying that it's easy to screw up futex'es, but even in the absense of 
> futexes, it's just easy to screw up any threaded logic.
> 
> But if anybody wants to look at a rough draft that at least limps along 
> _partially_, there's
> 
> 	http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git;a=summary
> 
> for you.
> 
> And I'll freely admit that 
>  (a) the above thing is pretty hacked up. No guarantees as to correctness 
>      or anything else.
>  (b) I looked _mainly_ at the "all readers" case, and considered that the 
>      primary goal. Which explains why it does really well on that 
>      particular load.
>  (c) However, I refuse to starve writers. In fact, my thing will always 
>      prefer a writer to any new readers, on the assumption that the sanest 
>      rwlock situation is the "tons of readers, occasional writer".
>  (d) it does ok for me on the write-write contention case too, but the 
>      random mixed "all threads do 5% writelocks" test-case seems to suck.
> 
> As an exmaple: the current glibc code I have seems to be uniformly and 
> boringly middle-of-the-road. It really doesn't seem to be horrible at all. 
> That said, the above thing _is_ 2-3x faster for me on that read-read case 
> (from single thread up to 20 threads - but just tested on one particular 
> machine), so if that is what you want to aim for, it's certainly easy to 
> beat.
> 
> But for the write case, while I can easily beat it for a low-thread count 
> with little contention, my example thing above benchmarks about equal for 
> bigger thread counts (caveat: again - I've only tested on one machine, 
> which is a single-socket dual-core thing. Caveat emptor).
> 
> And the pthreads implementation actually beats my hacked-up one at least 
> for the 5% random writelocks case. It's not beating it by a huge amount, 
> but it's not in the noise level either (maybe 15%). And I bet the pthreads 
> implementation is a hell of a lot better tested ;)

Had a bit of a look though this, seems pretty OK to me. Obviously with
rwlocks it *is*  impossible to do non-atomic unlocks, so I can't see
a way to significantly improve your code there.

What you *could* maybe do, to slightly speed up the reader fastpath, at
the expense of the writer fastpath, is to also have the active writer add
4 to the count too, so your unlock can start with a lock xadd -4, count
in order to get the write-intent on the cacheline straight up.

Though that's a pretty tiny "optmisation", and not going to be an amazing
win, even when it does save a state transition on the cacheline...

I'd be more interested to know why this code can't be evolved into a full
rwlock implementation? This is a rather standard (though neat) looking rwlock
-- so my question is what can the patented 64-bit futex locks do that this
can't, or what can they do faster?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-06-06  1:27                     ` Nick Piggin
@ 2008-06-06  3:37                       ` Linus Torvalds
  2008-06-06 11:53                         ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2008-06-06  3:37 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, David Howells, Ulrich Drepper,
	Linux Kernel Mailing List, Andrew Morton

On Fri, 6 Jun 2008, Nick Piggin wrote:
> 
> What you *could* maybe do, to slightly speed up the reader fastpath, at
> the expense of the writer fastpath, is to also have the active writer add
> 4 to the count too, so your unlock can start with a lock xadd -4, count
> in order to get the write-intent on the cacheline straight up.

Yes, nice idea. It avoids the possible unnecessary S->M transition, but 
the downside is that it effectively slows down the write unlock by making 
it do two atomic ops even for the fastpath. So if I were to _only_ care 
about the reader path, I think it would be a great idea, but as it is, the 
current non-contended write case is actually pretty close to optimal, and 
doing the unconditional xaddl on the unlock path would slow that one down.

> I'd be more interested to know why this code can't be evolved into a full
> rwlock implementation? This is a rather standard (though neat) looking rwlock
> -- so my question is what can the patented 64-bit futex locks do that this
> can't, or what can they do faster?

Quite frankly - and this was my argument the whole time - I do not believe 
that a "full" 64-bit implementation can do _anything_ more than this 
32-bit one can do.  That's the reason I wrote the code. I'm pretty sure 
that you can do perfectly well with just 32 bit atomic futex ops (my 
rwlocks obviously do use 64-bit cmpxchg's in user space, but not in the 
fast-path, and it should work fine on x86-32 too using cmpxchg8b).

In fact, in the second version of my locks, I didn't do futex operations 
on the actual lock itself at *all*, and just do futex ops on separate 
"event" fields. That actually should have avoided a bug I think I had (but 
couldn't really trigger in practice) in the first version, and made 
everything look pretty straightforward.

I haven't really worked on them since I write the thing, but I did 
consider things like timeouts etc. Timeouts are "hard" to handle because 
they mean that you cannot use any kind of trivially incrementing "ticket 
locks" with sequence numbers (because we may have to just avoid a sequence 
if it times out), so the sequence number approach that we now use for 
kernel spinlocks was not an option. I didn't actually *write* the timeout 
versions, of course, but given the structure of the locks they really 
should be very straightforward.

[ Half-way subtle thing: a writer that times out needs to be very careful 
  that it doesn't lose a wakeup event, but futexes actually make that part 
  pretty easy - since FUTEX_WAIT returns whether you got woken up or not, 
  you can just decide to wake up the next write-waiter if you cannot get 
  the lock immediately and have to exit due to a timeout. ]

But I really haven't tested my rwlocks very exhaustively, and I did not 
verify that they actualyl scale with lots of CPU's, for example.  I 
literally only have dual-core CPU's in use at home, right now, nothing 
fancier. Somebody with dual-socket quads would be a lot better off, and 
the more the merrier, of course.

But correctness is even more important, and that can at least be somewhat 
"thought" about.

		Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-06-06  3:37                       ` Linus Torvalds
@ 2008-06-06 11:53                         ` Nick Piggin
  2008-06-06 15:01                           ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2008-06-06 11:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, David Howells, Ulrich Drepper,
	Linux Kernel Mailing List, Andrew Morton

On Thu, Jun 05, 2008 at 08:37:19PM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 6 Jun 2008, Nick Piggin wrote:
> > 
> > What you *could* maybe do, to slightly speed up the reader fastpath, at
> > the expense of the writer fastpath, is to also have the active writer add
> > 4 to the count too, so your unlock can start with a lock xadd -4, count
> > in order to get the write-intent on the cacheline straight up.
> 
> Yes, nice idea. It avoids the possible unnecessary S->M transition, but 
> the downside is that it effectively slows down the write unlock by making 
> it do two atomic ops even for the fastpath. So if I were to _only_ care 
> about the reader path, I think it would be a great idea, but as it is, the 
> current non-contended write case is actually pretty close to optimal, and 
> doing the unconditional xaddl on the unlock path would slow that one down.

Yeah, it is a case of a large slowdown for write for a small speedup
for read (pity the API doesn't have explicit read and write unlocks
-- were they too lazy to type the last bit, or did they expect people
to lose track of whether they had a read or write lock? :P)

Anyway, it's obviously a tradeoff you'd just have to carefully
benchmark in real situations.

 
> > I'd be more interested to know why this code can't be evolved into a full
> > rwlock implementation? This is a rather standard (though neat) looking rwlock
> > -- so my question is what can the patented 64-bit futex locks do that this
> > can't, or what can they do faster?
> 
> Quite frankly - and this was my argument the whole time - I do not believe 

> consider things like timeouts etc. Timeouts are "hard" to handle because 
> they mean that you cannot use any kind of trivially incrementing "ticket 
> locks" with sequence numbers (because we may have to just avoid a sequence 
> if it times out), so the sequence number approach that we now use for 
> kernel spinlocks was not an option. I didn't actually *write* the timeout 
> versions, of course, but given the structure of the locks they really 
> should be very straightforward.
> 
> [ Half-way subtle thing: a writer that times out needs to be very careful 
>   that it doesn't lose a wakeup event, but futexes actually make that part 
>   pretty easy - since FUTEX_WAIT returns whether you got woken up or not, 
>   you can just decide to wake up the next write-waiter if you cannot get 
>   the lock immediately and have to exit due to a timeout. ]
> 
> But I really haven't tested my rwlocks very exhaustively, and I did not 
> verify that they actualyl scale with lots of CPU's, for example.  I 
> literally only have dual-core CPU's in use at home, right now, nothing 
> fancier. Somebody with dual-socket quads would be a lot better off, and 
> the more the merrier, of course.

Well... a single lock is only going to be so scalable. I don't see how
it could be done really significantly better? Maybe a small factor of
improvement if you were to concentrate on the contended case (but you
wouldn't want to do that anyway)




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] 64-bit futexes: Intro
  2008-06-06 11:53                         ` Nick Piggin
@ 2008-06-06 15:01                           ` Linus Torvalds
  0 siblings, 0 replies; 26+ messages in thread
From: Linus Torvalds @ 2008-06-06 15:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, David Howells, Ulrich Drepper,
	Linux Kernel Mailing List, Andrew Morton

On Fri, 6 Jun 2008, Nick Piggin wrote:
> 
> Well... a single lock is only going to be so scalable. I don't see how
> it could be done really significantly better?

My worry is that I did something wrong in the slowpath, and that there is 
some thundering-herd-wakeup kind of thing that makes that much slower than 
it should be.

The slow path doesn't much matter from the angle of counting individual 
cycles, but it still matters very much from a bigger picture. Does it have 
bad behavior where we wake up a thousand readers, but then a writer gets 
to come in first and all the readers have to go to sleep again?

That's one thing I tried to avoid in the second version (the "Another 
approch" commit) where a read-wakeup actually moves the readers from the 
pending count to the final count - both to get more fairness (which can be 
_bad_ for performance), but also because I think it can avoid pathological 
cases (reader starvation and unnecessary futex wakeup).

		Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2008-06-06 15:02 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-31  1:27 [PATCH 0/3] 64-bit futexes: Intro Ulrich Drepper
2008-05-31  2:13 ` Linus Torvalds
2008-05-31  3:14   ` Ulrich Drepper
2008-05-31  3:44     ` Linus Torvalds
2008-05-31  4:04       ` Ulrich Drepper
2008-05-31  4:16         ` Linus Torvalds
2008-05-31  4:23           ` Linus Torvalds
2008-05-31  4:38             ` Ulrich Drepper
2008-05-31  4:58               ` Linus Torvalds
2008-05-31 22:25               ` Linus Torvalds
2008-05-31 22:32                 ` Linus Torvalds
2008-06-02 18:54                 ` Ingo Molnar
2008-06-02 20:22                   ` Linus Torvalds
2008-06-02 23:03                     ` Ingo Molnar
2008-06-03  3:24                       ` Nick Piggin
2008-06-04 19:57                         ` Linus Torvalds
2008-06-04 20:38                           ` Linus Torvalds
2008-06-05  1:56                             ` Nick Piggin
2008-06-05  1:58                               ` Nick Piggin
2008-06-05  3:08                               ` Linus Torvalds
2008-06-05  4:29                                 ` Nick Piggin
2008-06-05  1:45                           ` Nick Piggin
2008-06-06  1:27                     ` Nick Piggin
2008-06-06  3:37                       ` Linus Torvalds
2008-06-06 11:53                         ` Nick Piggin
2008-06-06 15:01                           ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.