[patch 0/6] lightweight robust futexes: -V3

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch 0/6] lightweight robust futexes: -V3
@ 2006-02-16  9:41 Ingo Molnar
  2006-02-16 16:33 ` Daniel Walker
  2006-02-16 21:23 ` Paul Jackson
  0 siblings, 2 replies; 30+ messages in thread
From: Ingo Molnar @ 2006-02-16  9:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ulrich Drepper, Thomas Gleixner, Arjan van de Ven, Andrew Morton

This is release -V3 of the "lightweight robust futexes" patchset. The 
patchset can also be downloaded from:

  http://redhat.com/~mingo/lightweight-robust-futexes/

Changes since -V2:

Ulrich Drepper ran the code through more glibc testcases, which 
unearthed a couple of bugs:

 - fixed bug in the i386 and x86_64 assembly code (Ulrich Drepper)

 - fixed bug in the list walking futex-wakeups (found by Ulrich Drepper)

 - race fix: do not bail out in the list walk when the list_op_pending 
   pointer cannot be followed by the kernel - another userspace thread 
   may have already destroyed the mutex (and unmapped it), before this 
   thread had a chance to clear the field.

 - cleanup: renamed list_add_pending to list_op_pending. (the field is 
   used for list removals too)

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3
  2006-02-16  9:41 [patch 0/6] lightweight robust futexes: -V3 Ingo Molnar
@ 2006-02-16 16:33 ` Daniel Walker
  2006-02-16 17:24   ` Ingo Molnar
  2006-02-16 21:23 ` Paul Jackson
  1 sibling, 1 reply; 30+ messages in thread
From: Daniel Walker @ 2006-02-16 16:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Ulrich Drepper, Thomas Gleixner, Arjan van de Ven,
	Andrew Morton


Another thing I noticed was that futex_offset on the surface looks like
a malicious users dream variable .. I didn't notice security addressed
at all in your initial write up . I was told it was a big topic at last
years OLS ..  In your write up you did say you corrupted the
robust_list , but did you corrupt the offset? Is this even a concern?

Daniel


On Thu, 2006-02-16 at 10:41 +0100, Ingo Molnar wrote:
> This is release -V3 of the "lightweight robust futexes" patchset. The 
> patchset can also be downloaded from:
> 
>   http://redhat.com/~mingo/lightweight-robust-futexes/
> 
> Changes since -V2:
> 
> Ulrich Drepper ran the code through more glibc testcases, which 
> unearthed a couple of bugs:
> 
>  - fixed bug in the i386 and x86_64 assembly code (Ulrich Drepper)
> 
>  - fixed bug in the list walking futex-wakeups (found by Ulrich Drepper)
> 
>  - race fix: do not bail out in the list walk when the list_op_pending 
>    pointer cannot be followed by the kernel - another userspace thread 
>    may have already destroyed the mutex (and unmapped it), before this 
>    thread had a chance to clear the field.
> 
>  - cleanup: renamed list_add_pending to list_op_pending. (the field is 
>    used for list removals too)
> 
> 	Ingo
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3
  2006-02-16 16:33 ` Daniel Walker
@ 2006-02-16 17:24   ` Ingo Molnar
  2006-02-16 17:34     ` Daniel Walker
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2006-02-16 17:24 UTC (permalink / raw)
  To: Daniel Walker
  Cc: linux-kernel, Ulrich Drepper, Thomas Gleixner, Arjan van de Ven,
	Andrew Morton


* Daniel Walker <dwalker@mvista.com> wrote:

> Another thing I noticed was that futex_offset on the surface looks 
> like a malicious users dream variable .. [...]

i have no idea what you mean by that - could you explain whatever threat 
you have in mind, in more detail?

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3
  2006-02-16 17:24   ` Ingo Molnar
@ 2006-02-16 17:34     ` Daniel Walker
  2006-02-16 19:06       ` [patch 0/6] lightweight robust futexes: -V3 - Why in userspace? Esben Nielsen
                         ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Daniel Walker @ 2006-02-16 17:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Ulrich Drepper, Thomas Gleixner, Arjan van de Ven,
	Andrew Morton

On Thu, 2006-02-16 at 18:24 +0100, Ingo Molnar wrote:
> * Daniel Walker <dwalker@mvista.com> wrote:
> 
> > Another thing I noticed was that futex_offset on the surface looks 
> > like a malicious users dream variable .. [...]
> 
> i have no idea what you mean by that - could you explain whatever threat 
> you have in mind, in more detail?

	As I said, "on the surface" you could manipulate the futex_offset to
access memory unrelated to the futex structure . That's all I'm
referring too .. 

Daniel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3 - Why in userspace?
  2006-02-16 17:34     ` Daniel Walker
@ 2006-02-16 19:06       ` Esben Nielsen
  2006-02-16 19:34         ` Arjan van de Ven
  2006-02-16 20:23       ` [patch 0/6] lightweight robust futexes: -V3 Ingo Molnar
  2006-02-16 20:47       ` Paul Jackson
  2 siblings, 1 reply; 30+ messages in thread
From: Esben Nielsen @ 2006-02-16 19:06 UTC (permalink / raw)
  To: Daniel Walker
  Cc: Ingo Molnar, linux-kernel, Ulrich Drepper, Thomas Gleixner,
	Arjan van de Ven, Andrew Morton

I just jump into a thread somewhere to ask my question :-)

Why does the list have to be in userspace?

As I see it there can only be a problem when some thread has done
FUTEX_WAIT and is blocked. If no task is blocked (or on it's way to
being blocked) there is no problem. 

The solution I could imagine was the FUTEX_WAIT operation adds the
waiting task to a list of waiters attached to the mutex owner's task_t
(which is known by it's pid in the userspace flag) just before calling
schedule(). This list needs to be protected by a spinlock, ofcourse.
When a task dies it can wake up the waiters on it's list without relying
on the userspace.

What race conditions have I missed?

Esben

On Thu, 16 Feb 2006, Daniel Walker wrote:

> On Thu, 2006-02-16 at 18:24 +0100, Ingo Molnar wrote:
> > * Daniel Walker <dwalker@mvista.com> wrote:
> > 
> > > Another thing I noticed was that futex_offset on the surface looks 
> > > like a malicious users dream variable .. [...]
> > 
> > i have no idea what you mean by that - could you explain whatever threat 
> > you have in mind, in more detail?
> 
> 	As I said, "on the surface" you could manipulate the futex_offset to
> access memory unrelated to the futex structure . That's all I'm
> referring too .. 
> 
> Daniel
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3 - Why in userspace?
  2006-02-16 19:06       ` [patch 0/6] lightweight robust futexes: -V3 - Why in userspace? Esben Nielsen
@ 2006-02-16 19:34         ` Arjan van de Ven
  2006-02-16 20:04           ` Esben Nielsen
  0 siblings, 1 reply; 30+ messages in thread
From: Arjan van de Ven @ 2006-02-16 19:34 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Daniel Walker, Ingo Molnar, linux-kernel, Ulrich Drepper,
	Thomas Gleixner, Andrew Morton

On Thu, 2006-02-16 at 20:06 +0100, Esben Nielsen wrote:
> I just jump into a thread somewhere to ask my question :-)
> 
> Why does the list have to be in userspace?

because it's faster ;)




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3 - Why in userspace?
  2006-02-16 19:34         ` Arjan van de Ven
@ 2006-02-16 20:04           ` Esben Nielsen
  2006-02-16 20:17             ` Esben Nielsen
                               ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Esben Nielsen @ 2006-02-16 20:04 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Daniel Walker, Ingo Molnar, linux-kernel, Ulrich Drepper,
	Thomas Gleixner, Andrew Morton

On Thu, 16 Feb 2006, Arjan van de Ven wrote:

> On Thu, 2006-02-16 at 20:06 +0100, Esben Nielsen wrote:
> > I just jump into a thread somewhere to ask my question :-)
> > 
> > Why does the list have to be in userspace?
> 
> because it's faster ;)
> 
>
Faster??? 
As I see it, extra manipulations have to be done even in the non-congested
case: Every time the lock is taken the locking thread has to add the lock
to a the list, and reversely remove the lock from the list. I.e.
instructions are _added_ to the fast path where you stay purely in
userspace.

I am ofcourse comparing to a solution where you do a syscall on everytime
you do a lock. What I am asking about is whethere it wouldn't be
enough to maintain the list at the FUTEX_WAIT/FUTEX_WAKE operation - i.e.
the slow path where you have to go into the kernel.

Esben



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3 - Why in userspace?
  2006-02-16 20:04           ` Esben Nielsen
@ 2006-02-16 20:17             ` Esben Nielsen
  2006-02-16 20:23             ` Christopher Friesen
  2006-02-16 20:36             ` Ingo Molnar
  2 siblings, 0 replies; 30+ messages in thread
From: Esben Nielsen @ 2006-02-16 20:17 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Daniel Walker, Ingo Molnar, linux-kernel, Ulrich Drepper,
	Thomas Gleixner, Andrew Morton



On Thu, 16 Feb 2006, Esben Nielsen wrote:

> On Thu, 16 Feb 2006, Arjan van de Ven wrote:
> 
> > On Thu, 2006-02-16 at 20:06 +0100, Esben Nielsen wrote:
> > > I just jump into a thread somewhere to ask my question :-)
> > > 
> > > Why does the list have to be in userspace?
> > 
> > because it's faster ;)
> > 
> >
> Faster??? 
> As I see it, extra manipulations have to be done even in the non-congested
> case: Every time the lock is taken the locking thread has to add the lock
> to a the list, and reversely remove the lock from the list. I.e.
> instructions are _added_ to the fast path where you stay purely in
> userspace.
> 
> I am ofcourse comparing to a solution where you do a syscall on everytime
Correction:
"I am ofcourse NOT comparing"
...
> you do a lock. What I am asking about is whethere it wouldn't be
> enough to maintain the list at the FUTEX_WAIT/FUTEX_WAKE operation - i.e.
> the slow path where you have to go into the kernel.
> 
> Esben
> 
Esben


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3 - Why in userspace?
  2006-02-16 20:04           ` Esben Nielsen
  2006-02-16 20:17             ` Esben Nielsen
@ 2006-02-16 20:23             ` Christopher Friesen
  2006-02-16 20:36             ` Ingo Molnar
  2 siblings, 0 replies; 30+ messages in thread
From: Christopher Friesen @ 2006-02-16 20:23 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Arjan van de Ven, Daniel Walker, Ingo Molnar, linux-kernel,
	Ulrich Drepper, Thomas Gleixner, Andrew Morton

Esben Nielsen wrote:
> On Thu, 16 Feb 2006, Arjan van de Ven wrote:
> 
> 
>>On Thu, 2006-02-16 at 20:06 +0100, Esben Nielsen wrote:

>>>Why does the list have to be in userspace?
>>
>>because it's faster ;)

> Faster??? 
> As I see it, extra manipulations have to be done even in the non-congested
> case: Every time the lock is taken the locking thread has to add the lock
> to a the list, and reversely remove the lock from the list. I.e.
> instructions are _added_ to the fast path where you stay purely in
> userspace.
> 
> I am ofcourse comparing to a solution where you do a syscall on everytime
> you do a lock.


The whole *point* of futexes is that on uncontested operations you don't 
have to do a syscall.  Thus, if you can avoid taking a syscall while 
still getting reliability, you'll be faster.

Dropping to kernelspace isn't free.

Chris


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3
  2006-02-16 17:34     ` Daniel Walker
  2006-02-16 19:06       ` [patch 0/6] lightweight robust futexes: -V3 - Why in userspace? Esben Nielsen
@ 2006-02-16 20:23       ` Ingo Molnar
  2006-02-16 20:54         ` Daniel Walker
  2006-02-16 20:47       ` Paul Jackson
  2 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2006-02-16 20:23 UTC (permalink / raw)
  To: Daniel Walker
  Cc: linux-kernel, Ulrich Drepper, Thomas Gleixner, Arjan van de Ven,
	Andrew Morton


* Daniel Walker <dwalker@mvista.com> wrote:

> On Thu, 2006-02-16 at 18:24 +0100, Ingo Molnar wrote:
> > * Daniel Walker <dwalker@mvista.com> wrote:
> > 
> > > Another thing I noticed was that futex_offset on the surface looks 
> > > like a malicious users dream variable .. [...]
> > 
> > i have no idea what you mean by that - could you explain whatever threat 
> > you have in mind, in more detail?
> 
> 	As I said, "on the surface" you could manipulate the 
> futex_offset to access memory unrelated to the futex structure . 
> That's all I'm referring too ..

and? You can 'manipulate' arbitrary userspace memory, be that used by 
the kernel or not, and you can do a sys_futex(FUTEX_WAKE) on any 
arbitrary userspace memory address too (this is a core property of 
futexes). You must have meant something specific when you said "on the 
surface looks like a malicious users dream variable". In other words: 
please move your statement out of innuendo by backing it up with 
specifics (or by retracting it) - right now it's hanging in the air :)

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3 - Why in userspace?
  2006-02-16 20:04           ` Esben Nielsen
  2006-02-16 20:17             ` Esben Nielsen
  2006-02-16 20:23             ` Christopher Friesen
@ 2006-02-16 20:36             ` Ingo Molnar
  2006-02-16 22:32               ` Esben Nielsen
  2 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2006-02-16 20:36 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Arjan van de Ven, Daniel Walker, linux-kernel, Ulrich Drepper,
	Thomas Gleixner, Andrew Morton

* Esben Nielsen <simlo@phys.au.dk> wrote:

> > On Thu, 2006-02-16 at 20:06 +0100, Esben Nielsen wrote:
> > > I just jump into a thread somewhere to ask my question :-)
> > > 
> > > Why does the list have to be in userspace?
> > 
> > because it's faster ;)
> > 
> >
> Faster??? 

yes, it's faster.

> As I see it, extra manipulations have to be done even in the non-congested
> case: Every time the lock is taken the locking thread has to add the lock
> to a the list, and reversely remove the lock from the list. I.e.
> instructions are _added_ to the fast path where you stay purely in
> userspace.

Note that glibc was already doing these list ops for the current 
(limited, userspace-only) implementation of robust mutexes, so moving 
the cleanup to kernel-space has only a small effect on the userspace 
fastpath.

even considering the list ops, they are 2-3 (non-LOCK-ed) instructions 
of memory values that are already cached => it's almost for free. Ulrich 
(who is a kind of person who writes glibc code in assembly whenever he 
can excuse it with a performance argument) would be pretty upset if it 
wasnt cheap :-)

> I am ofcourse comparing to a solution where you do a syscall on 
> everytime you do a lock. What I am asking about is whethere it 
> wouldn't be enough to maintain the list at the FUTEX_WAIT/FUTEX_WAKE 
> operation - i.e. the slow path where you have to go into the kernel.

no, that's not enough at all: we need to be able to clean up after 
futexes even if the kernel was _never involved_ with them. The pure 
userspace futex fastpath still can keep a lock stuck! In fact that is 
the common-case.

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3
  2006-02-16 17:34     ` Daniel Walker
  2006-02-16 19:06       ` [patch 0/6] lightweight robust futexes: -V3 - Why in userspace? Esben Nielsen
  2006-02-16 20:23       ` [patch 0/6] lightweight robust futexes: -V3 Ingo Molnar
@ 2006-02-16 20:47       ` Paul Jackson
  2006-02-16 21:35         ` Ingo Molnar
  2 siblings, 1 reply; 30+ messages in thread
From: Paul Jackson @ 2006-02-16 20:47 UTC (permalink / raw)
  To: Daniel Walker; +Cc: mingo, linux-kernel, drepper, tglx, arjan, akpm

Daniel wrote:
> "on the surface" you could manipulate the futex_offset to
> access memory unrelated to the futex structure .

If a piece of malicious code has wormed its way far enough into my
application to be manipulating this list, then I don't think that code
will gain any further advantage by manpulating this list.  I think my
application is already powned.

That malicious code would have no need to have the kernel futext handling
code do its dirty work indirectly via manipulations of this list.  It can
just do the dirty work directly.

All Ingo needs to insure is that the kernel will assume no more
priviledge when reading/writing this list than the current task had,
from user space, reading/writing this list.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3
  2006-02-16 20:23       ` [patch 0/6] lightweight robust futexes: -V3 Ingo Molnar
@ 2006-02-16 20:54         ` Daniel Walker
  2006-02-16 21:26           ` Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Walker @ 2006-02-16 20:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Ulrich Drepper, Thomas Gleixner, Arjan van de Ven,
	Andrew Morton


On Thu, 16 Feb 2006, Ingo Molnar wrote:
> and? You can 'manipulate' arbitrary userspace memory, be that used by
> the kernel or not, and you can do a sys_futex(FUTEX_WAKE) on any
> arbitrary userspace memory address too (this is a core property of
> futexes). You must have meant something specific when you said "on the
> surface looks like a malicious users dream variable". In other words:
> please move your statement out of innuendo by backing it up with
> specifics (or by retracting it) - right now it's hanging in the air :)


 	Sorry I didn't mean to leave something hanging out there. I was 
just making an observation. The 'dream variable' comment was maybe a little 
to much and I'll gladly retract that .. I'll replace it with , I think the 
code needs more review in that area ..

Daniel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3
  2006-02-16  9:41 [patch 0/6] lightweight robust futexes: -V3 Ingo Molnar
  2006-02-16 16:33 ` Daniel Walker
@ 2006-02-16 21:23 ` Paul Jackson
  2006-02-16 21:50   ` Ingo Molnar
  1 sibling, 1 reply; 30+ messages in thread
From: Paul Jackson @ 2006-02-16 21:23 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, drepper, tglx, arjan, akpm

Nice stuff ...

I wonder if some of the initial questions about whether gcc would be
forcing something on the kernel, and whether it was unsafe for the
kernel to be walking a user list, are distracting from a more
interesting (in my view) question.

One can view this as just another sort of "interesting" system call,
where user code puts some data in various register and memory
locations, and then ends up by some predictable path in kernel code
which is acting on the request encoded in that data.

As always with system calls:
 1) the kernel can't trust the user data any further than the user
    could have thrown it, and
 2) the interface needs a robust ABI and one or more language API's,
    which will stand the test of time, over various architectures
    and 32-64 emulations.

>From what I could see glancing at the code and comments, Ingo has (1)
covered easily enough.

Would it make sense to have a language independent specification of
this interface, providing a detailed ABI, suitably generalized to cover
the various big endian, little endian, 32 and 64 and cross environments
that Linux normally supports?

I have in mind something that a competent assembly language coder could
write to, directly, when coding user access to this facility?  Or some
other language or library implementor, besides C and glibc, could
develop to?

The biggest problem that I find in new and interesting ways for the
kernel to interact with user space is not thinking carefully through
and documenting in obscene detail the exact interface (this byte here
means this, that little endian quad there means thus, ...) for all
archs and emulations of interest.  This tends to result in some corner
cases that have warts which can never be fixed, in order to maintain
compatibility.

This is sort of like specifying the over the wire protocols the
internet, where each byte is spelled out, avoiding any assumption
of what sort of computing device is on the other end.  Well, not
quite that bad.  I guess we can assume the user code is running
on the same arch as the kernel, give or take possible word size
and endian emulations ... though if performance of this even from
within machine architecture emulators was a priority, even that
assumption is perhaps not desirable.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3
  2006-02-16 20:54         ` Daniel Walker
@ 2006-02-16 21:26           ` Ingo Molnar
  2006-02-16 21:50             ` Christopher Friesen
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2006-02-16 21:26 UTC (permalink / raw)
  To: Daniel Walker
  Cc: linux-kernel, Ulrich Drepper, Thomas Gleixner, Arjan van de Ven,
	Andrew Morton

* Daniel Walker <dwalker@mvista.com> wrote:

> 
> On Thu, 16 Feb 2006, Ingo Molnar wrote:
> >and? You can 'manipulate' arbitrary userspace memory, be that used by
> >the kernel or not, and you can do a sys_futex(FUTEX_WAKE) on any
> >arbitrary userspace memory address too (this is a core property of
> >futexes). You must have meant something specific when you said "on the
> >surface looks like a malicious users dream variable". In other words:
> >please move your statement out of innuendo by backing it up with
> >specifics (or by retracting it) - right now it's hanging in the air :)
> 
> 
> 	Sorry I didn't mean to leave something hanging out there. I was 
> just making an observation. The 'dream variable' comment was maybe a 
> little to much and I'll gladly retract that .. I'll replace it with , 
> I think the code needs more review in that area ..

basically, ->futex_offset is not blindly trusted by the kernel either: 
it's simply used to calculate a "userspace pointer" value, which it then 
uses in a (secure) get_user() access, to do a FUTEX_WAKEUP. [Note that 
FUTEX_WAKEUP is already done at do_exit() time via the ->clear_child_tid 
userspace pointer.] All in one: this is totally safe.

The purpose of ->futex_offset is to not hardcode glibc's data structure 
layout into the kernel. Since 'clean up after locks' is a relatively 
rare operation, it was the prudent thing to do.

We could also have registered the futex_offset in the kernel itself, but 
I didnt do it that way because that would add another word to 
task_struct (for the sake of an operation that is rare), and it would 
also make the sys_set_robust_list() operation a bit more expensive. So I 
minimized the API to only take a single userspace pointer, which pointer 
points to the robust_list_head structure which contains all data to 
continue. That data is never trusted and is handled very carefully.

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3
  2006-02-16 20:47       ` Paul Jackson
@ 2006-02-16 21:35         ` Ingo Molnar
  0 siblings, 0 replies; 30+ messages in thread
From: Ingo Molnar @ 2006-02-16 21:35 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Daniel Walker, linux-kernel, drepper, tglx, arjan, akpm

* Paul Jackson <pj@sgi.com> wrote:

> That malicious code would have no need to have the kernel futext 
> handling code do its dirty work indirectly via manipulations of this 
> list.  It can just do the dirty work directly.
> 
> All Ingo needs to insure is that the kernel will assume no more 
> priviledge when reading/writing this list than the current task had, 
> from user space, reading/writing this list.

Correct, this is precisely what happens.

Furthermore, the new exit-time futex code within the kernel will do only 
one, very limited thing with userspace memory: it will atomically set 
bit 30 of a word at a userspace address (if the word is accessible to 
and writable by userspace), if and only if that word is equal to 
current->pid. This is really not the sort of memory writing capability 
attackers are looking for :-)

Btw., we already have a similar mechanism in the kernel (and had for 
years): the current->clear_child_tid pointer will be overwritten with 0 
by the kernel at do_exit() time, and causes a futex wakeup. See 
kernel/fork.c:mm_release():

        if (tsk->clear_child_tid && atomic_read(&mm->mm_users) > 1) {
                u32 __user * tidptr = tsk->clear_child_tid;
                tsk->clear_child_tid = NULL;

                /*
                 * We don't check the error code - if userspace has
                 * not set up a proper pointer then tough luck.
                 */
                put_user(0, tidptr);
                sys_futex(tidptr, FUTEX_WAKE, 1, NULL, NULL, 0);

So the concept is not unprecedented at all, nor did it ever cause any 
security problems [and i think i'd know - i wrote the above code too].  
And 'write 0' is slightly more interesting to attackers than 'set bit 30 
if word equals to TID'.

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3
  2006-02-16 21:26           ` Ingo Molnar
@ 2006-02-16 21:50             ` Christopher Friesen
  2006-02-16 21:55               ` Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: Christopher Friesen @ 2006-02-16 21:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Daniel Walker, linux-kernel, Ulrich Drepper, Thomas Gleixner,
	Arjan van de Ven, Andrew Morton

Ingo Molnar wrote:

> basically, ->futex_offset is not blindly trusted by the kernel either: 
> it's simply used to calculate a "userspace pointer" value, which it then 
> uses in a (secure) get_user() access, to do a FUTEX_WAKEUP. [Note that 
> FUTEX_WAKEUP is already done at do_exit() time via the ->clear_child_tid 
> userspace pointer.] All in one: this is totally safe.

As mentioned by Paul...how do you deal with 32/64 compatibility where 
your pointers are different sizes?

Chris

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3
  2006-02-16 21:23 ` Paul Jackson
@ 2006-02-16 21:50   ` Ingo Molnar
  2006-02-17  4:56     ` Paul Jackson
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2006-02-16 21:50 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-kernel, drepper, tglx, arjan, akpm

* Paul Jackson <pj@sgi.com> wrote:

> Nice stuff ...
> 
> I wonder if some of the initial questions about whether gcc would be
> forcing something on the kernel, and whether it was unsafe for the
> kernel to be walking a user list, are distracting from a more
> interesting (in my view) question.
>
> One can view this as just another sort of "interesting" system call, 
> where user code puts some data in various register and memory 
> locations, and then ends up by some predictable path in kernel code 
> which is acting on the request encoded in that data.

correct.

> As always with system calls:
>  1) the kernel can't trust the user data any further than the user
>     could have thrown it, and
>  2) the interface needs a robust ABI and one or more language API's,
>     which will stand the test of time, over various architectures
>     and 32-64 emulations.
> 
> >From what I could see glancing at the code and comments, Ingo has (1)
> covered easily enough.
> 
> Would it make sense to have a language independent specification of 
> this interface, providing a detailed ABI, suitably generalized to 
> cover the various big endian, little endian, 32 and 64 and cross 
> environments that Linux normally supports?

little/big endian shouldnt be a problem i think, as this is a 
nonpersistent object. (futexes do not survive reboot)

The 32-bit-on-64-bit support code was indeed interesting, but it's also 
pretty straightforward. See kernel/futex_compat.c where the 64-bit 
kernel walks a 32-bit userspace. The method i took was to have _two_ 
lists:

        struct robust_list_head __user *robust_list;
#ifdef CONFIG_COMPAT
        struct compat_robust_list_head __user *compat_robust_list;
#endif

and at do_exit() time we process _both_ lists, first the 64-bit one, 
then the 32-bit one. This handles execution environments that have both 
32-bit and 64-bit state - they could crash in e.g. 32-bit mode holding 
robust futexes, while holding 64-bit robust futexes too. This method 
correctly handles e.g. x86 binaries on x86_64 [i checked that], and 
native binaries too.

> I have in mind something that a competent assembly language coder 
> could write to, directly, when coding user access to this facility?  
> Or some other language or library implementor, besides C and glibc, 
> could develop to?

in this particular case i dont think it could be described in a more 
generic way. I'm not against your idea per se - but someone would have 
to code it up ;) Nor do i think that in this particular case we'd need 
more flexibility than the patch offers: only a minimal amount of things 
are 'hardcoded' in the robust-list approach, and even those are either 
known futex properties, or are 'obvious' approaches like the fact that 
it's represented as a linked list. (which is what glibc uses anyway) But 
e.g. we dont force the single linked list: userspace can use a 
double-linked list too - the kernel will simply walk the single-linked 
component of that list in a forwards way.

> This is sort of like specifying the over the wire protocols the 
> internet, where each byte is spelled out, avoiding any assumption of 
> what sort of computing device is on the other end.  Well, not quite 
> that bad.  I guess we can assume the user code is running on the same 
> arch as the kernel, give or take possible word size and endian 
> emulations ... though if performance of this even from within machine 
> architecture emulators was a priority, even that assumption is perhaps 
> not desirable.

i think my patch is a good example of how to do it with our existing 
tools: i separated the list walking into a separate function 
(exit_robust_list() and compat_exit_robust_list()), which purely handles 
the data structure details.

In theory you are right, these two functions do essentially the same 
thing, and we could have automatically 'converted' 
compat_exit_robust_list() from the native exit_robust_list() function - 
but in practice it was a pretty straightforward process anyway for these 
~50-line functions. I think it would need a more complex example than 
this to justify some sort of new language.

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3
  2006-02-16 21:50             ` Christopher Friesen
@ 2006-02-16 21:55               ` Ingo Molnar
  0 siblings, 0 replies; 30+ messages in thread
From: Ingo Molnar @ 2006-02-16 21:55 UTC (permalink / raw)
  To: Christopher Friesen
  Cc: Daniel Walker, linux-kernel, Ulrich Drepper, Thomas Gleixner,
	Arjan van de Ven, Andrew Morton


* Christopher Friesen <cfriesen@nortel.com> wrote:

> Ingo Molnar wrote:
> 
> >basically, ->futex_offset is not blindly trusted by the kernel either: 
> >it's simply used to calculate a "userspace pointer" value, which it then 
> >uses in a (secure) get_user() access, to do a FUTEX_WAKEUP. [Note that 
> >FUTEX_WAKEUP is already done at do_exit() time via the ->clear_child_tid 
> >userspace pointer.] All in one: this is totally safe.
> 
> As mentioned by Paul...how do you deal with 32/64 compatibility where 
> your pointers are different sizes?

i just replied to Paul's mail with details about this. (Please reply to 
that mail if there are any open questions.)

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3 - Why in userspace?
  2006-02-16 20:36             ` Ingo Molnar
@ 2006-02-16 22:32               ` Esben Nielsen
  2006-02-16 22:36                 ` Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: Esben Nielsen @ 2006-02-16 22:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, Daniel Walker, linux-kernel, Ulrich Drepper,
	Thomas Gleixner, Andrew Morton


On Thu, 16 Feb 2006, Ingo Molnar wrote:

> [...]
> 
> > I am ofcourse comparing to a solution where you do a syscall on 
> > everytime you do a lock. What I am asking about is whethere it 
> > wouldn't be enough to maintain the list at the FUTEX_WAIT/FUTEX_WAKE 
> > operation - i.e. the slow path where you have to go into the kernel.
> 
> no, that's not enough at all: we need to be able to clean up after 
> futexes even if the kernel was _never involved_ with them. The pure 
> userspace futex fastpath still can keep a lock stuck! In fact that is 
> the common-case.
> 

As I understand the protocol the userspace task writes it's pid into the
lock atomically when locking it and erases it atomically when it leaves
the lock. If it is killed inbetween the pid is still there.
Now if another task comes along it reads the pid, sets the wait flag and goes 
into the kernel. The kernel will now be able to see that the pid is no
longer valid and therefore the owner must be dead. 

Esben

> 	Ingo
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3 - Why in userspace?
  2006-02-16 22:32               ` Esben Nielsen
@ 2006-02-16 22:36                 ` Ingo Molnar
  2006-02-16 23:20                   ` Esben Nielsen
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2006-02-16 22:36 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Arjan van de Ven, Daniel Walker, linux-kernel, Ulrich Drepper,
	Thomas Gleixner, Andrew Morton


* Esben Nielsen <simlo@phys.au.dk> wrote:

> As I understand the protocol the userspace task writes it's pid into 
> the lock atomically when locking it and erases it atomically when it 
> leaves the lock. If it is killed inbetween the pid is still there. Now 
> if another task comes along it reads the pid, sets the wait flag and 
> goes into the kernel. The kernel will now be able to see that the pid 
> is no longer valid and therefore the owner must be dead.

this is racy - we cannot know whether the PID wrapped around.

nor does this method offer any solution for the case where there are 
already waiters pending: they might be hung forever. With our solution 
one of those waiters gets woken up and notice that the lock is dead. 
(and in the unlikely even of that thread dying too while trying to 
recover the data, the kernel will do yet another wakeup, of the next 
waiter.)

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3 - Why in userspace?
  2006-02-16 22:36                 ` Ingo Molnar
@ 2006-02-16 23:20                   ` Esben Nielsen
  2006-02-16 23:39                     ` Ingo Molnar
  2006-02-17 23:47                     ` Andrew James Wade
  0 siblings, 2 replies; 30+ messages in thread
From: Esben Nielsen @ 2006-02-16 23:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, Daniel Walker, linux-kernel, Ulrich Drepper,
	Thomas Gleixner, Andrew Morton

On Thu, 16 Feb 2006, Ingo Molnar wrote:

> 
> * Esben Nielsen <simlo@phys.au.dk> wrote:
> 
> > As I understand the protocol the userspace task writes it's pid into 
> > the lock atomically when locking it and erases it atomically when it 
> > leaves the lock. If it is killed inbetween the pid is still there. Now 
> > if another task comes along it reads the pid, sets the wait flag and 
> > goes into the kernel. The kernel will now be able to see that the pid 
> > is no longer valid and therefore the owner must be dead.
> 
> this is racy - we cannot know whether the PID wrapped around.
>
What about adding more bits to check on? The PID to lookup the task_t and
then some extra bits to uniquely identify the actual task.
 
> nor does this method offer any solution for the case where there are 
> already waiters pending: they might be hung forever. 
It was for this case I suggested maintaining a list of waiters within the
kernel on each task_t. The adding has to be done FUTEX_WAIT so the adding
operation needs to be protected.

> With our solution 
> one of those waiters gets woken up and notice that the lock is dead. 
> (and in the unlikely even of that thread dying too while trying to 
> recover the data, the kernel will do yet another wakeup, of the next 
> waiter.)
> 
I admit your solution is a good one. The only drawback - besides being
untraditional - is that memory corruption can leave futexes locked at
exit.

Esben

> 	Ingo
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3 - Why in userspace?
  2006-02-16 23:20                   ` Esben Nielsen
@ 2006-02-16 23:39                     ` Ingo Molnar
  2006-02-17  0:20                       ` Esben Nielsen
  2006-02-17 23:47                     ` Andrew James Wade
  1 sibling, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2006-02-16 23:39 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Arjan van de Ven, Daniel Walker, linux-kernel, Ulrich Drepper,
	Thomas Gleixner, Andrew Morton


* Esben Nielsen <simlo@phys.au.dk> wrote:

> > this is racy - we cannot know whether the PID wrapped around.
> >
> What about adding more bits to check on? The PID to lookup the task_t 
> and then some extra bits to uniquely identify the actual task.

which would just be a fancy name for a wider PID space, and would thus 
still not protect against PID reuse :-)

> > nor does this method offer any solution for the case where there are 
> > already waiters pending: they might be hung forever. 
>
> It was for this case I suggested maintaining a list of waiters within 
> the kernel on each task_t. The adding has to be done FUTEX_WAIT so the 
> adding operation needs to be protected.

i'm not sure i follow - what list is this and how would it be 
maintained?

> > With our solution 
> > one of those waiters gets woken up and notice that the lock is dead. 
> > (and in the unlikely even of that thread dying too while trying to 
> > recover the data, the kernel will do yet another wakeup, of the next 
> > waiter.)
> > 
> I admit your solution is a good one. The only drawback - besides being 
> untraditional - is that memory corruption can leave futexes locked at 
> exit.

so? Memory corruption can overwrite the futex value anyway, and can thus 
cause the wrong owner to be identified - causing a locked futex. This 
patch does not protect against bad effects of memory corruption - 
there's really no way to keep userspace from breaking itself.

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3 - Why in userspace?
  2006-02-16 23:39                     ` Ingo Molnar
@ 2006-02-17  0:20                       ` Esben Nielsen
  2006-02-17  0:42                         ` Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: Esben Nielsen @ 2006-02-17  0:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, Daniel Walker, linux-kernel, Ulrich Drepper,
	Thomas Gleixner, Andrew Morton

On Fri, 17 Feb 2006, Ingo Molnar wrote:

> 
> * Esben Nielsen <simlo@phys.au.dk> wrote:
> 
> > > this is racy - we cannot know whether the PID wrapped around.
> > >
> > What about adding more bits to check on? The PID to lookup the task_t 
> > and then some extra bits to uniquely identify the actual task.
> 
> which would just be a fancy name for a wider PID space, and would thus 
> still not protect against PID reuse :-)
> 
Can it really be correct there is no way to uniquely identify a thread in
the uptime of the system? It could be done with BigIntegers :-)


> > > nor does this method offer any solution for the case where there are 
> > > already waiters pending: they might be hung forever. 
> >
> > It was for this case I suggested maintaining a list of waiters within 
> > the kernel on each task_t. The adding has to be done FUTEX_WAIT so the 
> > adding operation needs to be protected.
> 
> i'm not sure i follow - what list is this and how would it be 
> maintained?
>

At the FUTEX_WAIT operation add the waiter to a list of waiters on the
owner's task_t. At FUTEX_WAKE remove the waiter. At task exit wake up the
waiters.
 
> > > With our solution 
> > > one of those waiters gets woken up and notice that the lock is dead. 
> > > (and in the unlikely even of that thread dying too while trying to 
> > > recover the data, the kernel will do yet another wakeup, of the next 
> > > waiter.)
> > > 
> > I admit your solution is a good one. The only drawback - besides being 
> > untraditional - is that memory corruption can leave futexes locked at 
> > exit.
> 
> so? Memory corruption can overwrite the futex value anyway, and can thus 
> cause the wrong owner to be identified - causing a locked futex. This 
> patch does not protect against bad effects of memory corruption - 
> there's really no way to keep userspace from breaking itself.
> 

At least you could wake up those who are already blocked in the kernel...

Esben

> 	Ingo
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3 - Why in userspace?
  2006-02-17  0:20                       ` Esben Nielsen
@ 2006-02-17  0:42                         ` Ingo Molnar
  0 siblings, 0 replies; 30+ messages in thread
From: Ingo Molnar @ 2006-02-17  0:42 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Arjan van de Ven, Daniel Walker, linux-kernel, Ulrich Drepper,
	Thomas Gleixner, Andrew Morton

* Esben Nielsen <simlo@phys.au.dk> wrote:

> > > > this is racy - we cannot know whether the PID wrapped around.
> > > >
> > > What about adding more bits to check on? The PID to lookup the task_t 
> > > and then some extra bits to uniquely identify the actual task.
> > 
> > which would just be a fancy name for a wider PID space, and would thus 
> > still not protect against PID reuse :-)
> > 
>
> Can it really be correct there is no way to uniquely identify a thread 
> in the uptime of the system? It could be done with BigIntegers :-)

well, that's how PIDs work. I'd have no problem with 64-bit PIDs/TIDs in 
theory, but besides a _massive_ ABI break (which makes the whole thing a 
non-starter), users would probably resist 'ps' output like:

 917239487098712349  tty8     Ss+    0:00 -bash
 1844674407370955161 tty9     Ss+    0:00 -bash
 1356415698798712343 tty10    Ss+    0:00 -bash

:-| Another problem is that futexes are fundamentally 32-bit, so there's 
no space in them for 64-bit TIDs ...

> > i'm not sure i follow - what list is this and how would it be 
> > maintained?
> 
> At the FUTEX_WAIT operation add the waiter to a list of waiters on the 
> owner's task_t. At FUTEX_WAKE remove the waiter. At task exit wake up 
> the waiters.

well, but even in this case the kernel has no idea (currently) about the 
owner - it is the waiters that have kernel state in this case. So extra 
code would have to be added to look up the TID, make sure the lookup is 
secure: check that the task looked up is really mapping that vma, and to 
queue waiters to the owner. Quite clumsy ... and this is still only the 
easy case, when the kernel is involved in the futex use.

> > > I admit your solution is a good one. The only drawback - besides being 
> > > untraditional - is that memory corruption can leave futexes locked at 
> > > exit.
> > 
> > so? Memory corruption can overwrite the futex value anyway, and can thus 
> > cause the wrong owner to be identified - causing a locked futex. This 
> > patch does not protect against bad effects of memory corruption - 
> > there's really no way to keep userspace from breaking itself.
> > 
> 
> At least you could wake up those who are already blocked in the 
> kernel...

which would be of little use if the wrong TID is in the futex. There's 
really no protection against userspace breaking itself.

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3
  2006-02-16 21:50   ` Ingo Molnar
@ 2006-02-17  4:56     ` Paul Jackson
  2006-02-17  9:41       ` Ingo Molnar
  2006-02-17 11:59       ` Ingo Molnar
  0 siblings, 2 replies; 30+ messages in thread
From: Paul Jackson @ 2006-02-17  4:56 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, drepper, tglx, arjan, akpm

> in this particular case i dont think it could be described in a more 
> generic way. I'm not against your idea per se - but someone would have 
> to code it up ;)

I wasn't talking about 'automation' code, nor about 'more flexible
or generic' code, nor any other changes or additions to your code,
but rather about documentation that spelled out the ABI explicitly,
independent of kernel or glibc code.

Apparently my question was confusing, as you seem to have answered
a different question than I thought I asked.  Good answers, to some
other question.

Let me try again, from the beginning.

First, let me point out the tight coupling of this patch set, at least
as currently presented, with glibc.  Notice for example the following
comment from your patch:

+ * NOTE: this structure is part of the syscall ABI, and must only be
+ * changed if the change is first communicated with the glibc folks.

Perhaps I am being an old fogey, reflecting times and systems long
past their prime, but I'd have thought that it would be better if
this interface was not so much a contract written in C code between
the kernel and glibc (which is how the above comment sounds) but
rather a contract between the kernel code and a neutral document,
to which any user side implementation, such as glibc, could be written.

The comments and documents for robust_futexes seem to be written
as if Linux had some special arrangement with glibc to be the sole
purveyor of the userland interface.  Is that the case?

And half of this contract, the glibc code, isn't even explicitly
presented on this lkml thread.

Second, let me try again to explain what sort of more language neutral
Documentation I was hoping for, this time by example.

As it stands now, I have to read the kernel half of the code to figure
this out (yeah, I know, that complaint won't garner much sympathy on
this list ...)

Let me quit complaining and try to offer up something useful.

How about adding this to Documentation/robust-futexes.txt.

Be warned that the following may be seriously confused.

+++++++++++++++++++++++++++ Begin +++++++++++++++++++++++++++

The robust futex ABI
--------------------

Robust_futexes provide a mechanism that is used in addition to normal
futexes, for kernel assist of cleanup of held locks on task exit.

The interesting data as to what futexes a thread is holding is kept on
a linked list in user space, where it can be updated efficiently as
locks are taken and dropped, without kernel intervention.  The only
additional kernel intervention required for robust_futexes above and
beyond what is required for futexes is:
 1) a one time call, per thread, to tell the kernel where its list of
    held robust_futexes begins, and
 2) internal kernel code at exit, to handle any listed locks held
    by the exiting thread.

The existing normal futexes already provide a "Fast Userspace Locking"
mechanism, which handles uncontested locking without needing a
system call, and handles contested locking by maintaining a list of
waiting threads in the kernel.  Options on the sys_futex(2) system
call support waiting on a particular futex, and waking up the next
waiter on a particular futex.

For robust_futexes to work, the user code (typically in a library such
as glibc linked with the application) has to manage and place the
necessary list elements exactly as the kernel expects them.  If it
fails to do so, then improperly listed locks will not be cleaned up
on exit, probably causing deadlock or other such failure of the other
threads waiting on the same locks.

A thread that anticipates possibly using robust_futexes should first
issue the system call:
    asmlinkage long
    sys_set_robust_list(struct robust_list_head __user *head, size_t len);

The pointer 'head' points to a structure in the threads address space
consisting of three words.  Each word is 32 bits on 32 bit arch's,
or 64 bits on 64 bit arch's, and local byte order.  Each thread should
have its own thread private 'head'.

If a thread is running in 32 bit compatibility mode on a 64 native
arch kernel, then it can actually have two such structures - one
using 32 bit words for 32 bit compatibility mode, and one using 64 bit
words for 64 bit native mode.  The kernel, if it is a 64 bit kernel
supporting 32 bit compatibility mode, will attempt to process both
lists on each task exit, if the corresponding sys_set_robust_list()
call has been made to setup that list.

  The first word in the memory structure at 'head' contains a
  pointer to a single linked list of 'lock entries', one per lock,
  as described below.  If the list is empty, the pointer will point
  to itself, 'head'.  The last 'lock entry' points back to the 'head'.

  The second word, called 'offset', specifies the offset from the
  address of the associated 'lock entry', plus or minus, of what will
  be called the 'lock word', from that 'lock entry'.  The 'lock word'
  is always a 32 bit word, unlike the other words above.  The 'lock
  word' holds 3 flag bits in the upper 3 bits, and the thread id (TID)
  of the thread holding the lock in the bottom 29 bits.  See further
  below for a description of the flag bits.

  The third word, called 'list_op_pending', contains transient copy of
  the address of the 'lock entry', during list insertion and removal,
  and is needed to correctly resolve races should a thread exit while
  in the middle of a locking or unlocking operation.

Each 'lock entry' on the single linked list starting at 'head' consists
of just a single word, pointing to the next 'lock entry', or back to
'head' if there are no more entries.  In addition, nearby to each
'lock entry', at an offset from the 'lock entry' specified by the
'offset' word, is one 'lock word'.

The 'lock word' is always 32 bits, and is intended to be the same 32
bit lock variable used by the futex mechanism, in conjunction with
robust_futexes.  The kernel will only be able to wakeup the next thread
waiting for a lock on a threads exit if that next thread used the futex
mechanism to register the address of that 'lock word' with the kernel.

For each futex lock currently held by a thread, if it wants this
robust_futex support for exit cleanup of that lock, it should have
one 'lock entry' on this list, with its associated 'lock word' at
the specified 'offset'.  Should a thread die while holding any such
locks, the kernel will walk this list, mark any such locks with a bit
indicating their holder died, and wakeup the next thread waiting for
that lock using the futex mechanism.

When a thread has invoked the above system call to indicate it
anticipates using robust_futexes, the kernel stores the passed in
'head' pointer for that task.  The task may retrieve that value later
on by using the system call:
    asmlinkage long
    sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
                        size_t __user *len_ptr);

It is anticipated that threads will use robust_futexes embedded in
larger, user level locking structures, one per lock.  The kernel
robust_futex mechanism doesn't care what else is in that structure,
so long as the 'offset' to the 'lock word' is the same for all
robust_futexes used by that thread.  The thread should link those
locks it currently holds using the 'lock entry' pointers.  It may
also have other links between the locks, such as the reverse side of
a double linked list, but that doesn't matter to the kernel.

By keeping its locks linked this way, on a list starting with a 'head'
pointer known to the kernel, the kernel can provide to a thread the
essential service available for robust_futexes, which is to help
clean up locks held at the time of (a perhaps unexpectedly) exit.

Actual locking and unlocking, during normal operations, is handled
entirely by user level code in the contending threads, and by the
existing futex mechanism to wait for, and wakeup, locks.  The kernels
only essential involvement in robust_futexes is to remember where the
list 'head' is, and to walk the list on thread exit, handling locks
still held by the departing thread, as described below.

There may exist thousands of futex lock structures in a threads
shared memory, on various data structures, at a given point in time.
Only those lock structures for locks currently held by that thread
should be on that thread's robust_futex linked lock list a given time.

A given futex lock structure in a user shared memory region may be held
at different times by any of the threads with access to that region.
The thread currently holding such a lock, if any, is marked with the
threads TID in the lower 29 bits of the 'lock word'.

When adding or removing a lock from its list of held locks, in order
for the kernel to correctly handle lock cleanup regardless of when
the task exits (perhaps it gets an unexpected signal 9 in the middle
of manipulating this list), the user code must observe the following
protocol on 'lock entry' insertion and removal:

On insertion:
 1) set the 'list_op_pending' word to the address of the 'lock word'
    to be inserted,
 2) acquire the futex lock,
 3) add the lock entry, with its thread id (TID) in the bottom 29 bits
    of the 'lock word', to the linked list starting at 'head', and
 4) clear the 'list_op_pending' word.

	XXX I am particularly unsure of the following -pj XXX

On removal:
 1) set the 'list_op_pending' word to the address of the 'lock word'
    to be removed,
 2) remove the lock entry for this lock from the 'head' list,
 2) release the futex lock, and
 2) clear the 'lock_op_pending' word.

On exit, the kernel will consider the address stored in
'list_op_pending' and the address of each 'lock word' found by walking
the list starting at 'head'.  For each such address, if the bottom
29 bits of the 'lock word' at offset 'offset' from that address equals
the exiting threads TID, then the kernel will do two things:
 1) if bit 31 (0x80000000) is set in that word, then attempt a futex
    wakeup on that address, which will waken the next thread that has
    used to the futex mechanism to wait on that address, and
 2) atomically set  bit 30 (0x40000000) in the 'lock word'.

In the above, bit 31 was set by futex waiters on that lock to indicate
they were waiting, and bit 30 is set by the kernel to indicate that
the lock owner died holding the lock.

The kernel exit code will silently stop scanning the list further
if at any point:
 1) the 'head' pointer or an subsequent linked list pointer
    is not a valid address of a user space word
 2) the calculated location of the 'lock word' (address plus
    'offset') is not the valud address of a 32 bit user space
    word
 3) if the list contains more than 1 million (subject to
    future kernel configuration changes) elements.

When the kernel sees a list entry whose 'lock word' doesn't have the
current threads TID in the lower 29 bits, it does nothing with that
entry, and goes on to the next entry.

Bit 29 (0x20000000) of the 'lock word' is reserved for future use.

++++++++++++++++++++++++++++ End ++++++++++++++++++++++++++++

Other details ...

Nit ...

    +If a futex is found to be held at exit time, the kernel sets the highest
    +bit of the futex word:
    +
    +	#define FUTEX_OWNER_DIED        0x40000000

    Contrary to the comment, that doesn't look like the "highest bit."

Confusion ...

    +The list is guaranteed to be private and per-thread, so it's lockless.

    This statement seems like it is stretching the truth a bit.
    As best as I can tell, the 'head' is private per-thread, but the
    elements on the list are shared by all contending threads, and so
    adding and removing these elements from a given threads list requires
    some sort of contention handling mechanism, which the code provides.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3
  2006-02-17  4:56     ` Paul Jackson
@ 2006-02-17  9:41       ` Ingo Molnar
  2006-02-17 11:59       ` Ingo Molnar
  1 sibling, 0 replies; 30+ messages in thread
From: Ingo Molnar @ 2006-02-17  9:41 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-kernel, drepper, tglx, arjan, akpm

* Paul Jackson <pj@sgi.com> wrote:

> First, let me point out the tight coupling of this patch set, at least 
> as currently presented, with glibc.  Notice for example the following 
> comment from your patch:
> 
> + * NOTE: this structure is part of the syscall ABI, and must only be
> + * changed if the change is first communicated with the glibc folks.

Note that this is really business as usual: we already have dozens of 
different 'struct' parameters to hundreds of syscalls, to all of which 
exactly these restrictions apply: they must never be changed.

Furthermore there are a good deal of other implicit and explicit data 
structure assumptions that all form the ABI - and which the kernel must 
not break.

The only unusual thing i guess is that i documented it for this new bit 
of functionality ;-)

[ In fact, the robust_list syscalls are shaped so that the structures 
_can_ be changed if done with care (due to the length parameter). The 
overwhelming majority of our other ABI assumptions are hardcoded and are 
only changeable by writing totally new syscalls and phasing out the old 
ones. ]

I agree with your suggestion of better documenting the 
kernel<->userspace ABI, but this should be done independently of robust 
futexes.

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3
  2006-02-17  4:56     ` Paul Jackson
  2006-02-17  9:41       ` Ingo Molnar
@ 2006-02-17 11:59       ` Ingo Molnar
  2006-02-17 20:50         ` Paul Jackson
  1 sibling, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2006-02-17 11:59 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-kernel, drepper, tglx, arjan, akpm


* Paul Jackson <pj@sgi.com> wrote:

> [ ... nice writeup of the robust-futex ABI ... ]

can i put this into Documentation/robust-futex-ABI.txt?

> Other details ...
> 
> Nit ...
> 
>     +If a futex is found to be held at exit time, the kernel sets the highest
>     +bit of the futex word:
>     +
>     +	#define FUTEX_OWNER_DIED        0x40000000
> 
>     Contrary to the comment, that doesn't look like the "highest bit."

ok, i fixed this in the text.

> Confusion ...
> 
>     +The list is guaranteed to be private and per-thread, so it's lockless.
> 
>     This statement seems like it is stretching the truth a bit.
>     As best as I can tell, the 'head' is private per-thread, but the
>     elements on the list are shared by all contending threads, and so
>     adding and removing these elements from a given threads list requires
>     some sort of contention handling mechanism, which the code provides.

well, from the kernel's perspective, the list _as it exists_ is private 
and per-thread, so it can be accessed in a lockless way.

from the userspace perspective you are right, it's only private if the 
list entry is manipulated after acquiring the lock.

i fixed up the text to reflect this.

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3
  2006-02-17 11:59       ` Ingo Molnar
@ 2006-02-17 20:50         ` Paul Jackson
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Jackson @ 2006-02-17 20:50 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, drepper, tglx, arjan, akpm

> > [ ... nice writeup of the robust-futex ABI ... ]
> 
> can i put this into Documentation/robust-futex-ABI.txt?

Good idea - so be it.

Could you review it for accuracy -- I'm sure I screwed
it up in some details, large or small.

Ulrich -- if you're reading this -- your review comments
would be most welcome as well.

In particular:
 1) See the description of the removal protocol, below
    the XXX comment.  I was really guessing there.
 2) Could you add a statement on how current code should
    handle the FUTEX_OWNER_PENDING bit (when to set it,
    when to clear it, when to preserve it) so that current
    code won't be incompatible with likely future uses of
    this big?
 3) You have implicit ABI versioning in the size of the
    head struct.  Could you add words describing that?

Thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/6] lightweight robust futexes: -V3 - Why in userspace?
  2006-02-16 23:20                   ` Esben Nielsen
  2006-02-16 23:39                     ` Ingo Molnar
@ 2006-02-17 23:47                     ` Andrew James Wade
  1 sibling, 0 replies; 30+ messages in thread
From: Andrew James Wade @ 2006-02-17 23:47 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Ingo Molnar, Arjan van de Ven, Daniel Walker, linux-kernel,
	Ulrich Drepper, Thomas Gleixner, Andrew Morton

[Resending: I accidentally did reply one.]

On Thursday 16 February 2006 18:20, Esben Nielsen wrote:
> On Thu, 16 Feb 2006, Ingo Molnar wrote:
>
> >
> > * Esben Nielsen <simlo@phys.au.dk> wrote:
> >
> > > As I understand the protocol the userspace task writes it's pid into
> > > the lock atomically when locking it and erases it atomically when it
> > > leaves the lock. If it is killed inbetween the pid is still there. Now
> > > if another task comes along it reads the pid, sets the wait flag and
> > > goes into the kernel. The kernel will now be able to see that the pid
> > > is no longer valid and therefore the owner must be dead.
> >
> > this is racy - we cannot know whether the PID wrapped around.
> >
> What about adding more bits to check on? The PID to lookup the task_t and
> then some extra bits to uniquely identify the actual task.

The extra identifying bits don't even have to be written at the same
time/place as the PID. They can be written after the futex is aquired,
and cleared before the futex is released. A mechanism similar to
list_op_pending can be used to fill the races: If the extra-id field is
clear, the kernel checks all the list_op_pendings registered for that PID
to see if any of the threads is in the process of aquiring/releasing that
futex. If not, FUTEX_OWNER_DIED. This last process is liable to be quite
nasty and heavy-weight, but it should also be rare if the races are small.

I believe all the races in the last process can be closed if we can count
on FUTEX_WAITERS informing us (the testing process) if the futex is
released. For each thread that might be holding the futex, if
list_op_pending doesn't equal the futex, then that thread can't be aquiring
the futex. If the extra_id is still clear, then that thread can't be holding
the futex. If list_op_pending still doesn't equal the futex, then that
thread can't be freeing the futex. Therefore that thread doesn't have the
futex. So if no threads are holding the futex, and the futex hasn't been
released during this process, then the owner must be dead.

[This assumes that the freeing thread is the same as the one that aquired
the mutex. This assumption can be relaxed to the freeing thread
being merely in the same process, but beyond that a syscall would be
needed on the freeing side to avoid races.]

I hope this is clear, I'd need to get up to speed on kernel hacking before
I could turn this into code.

Andrew Wade

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2006-02-17 23:47 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-16  9:41 [patch 0/6] lightweight robust futexes: -V3 Ingo Molnar
2006-02-16 16:33 ` Daniel Walker
2006-02-16 17:24   ` Ingo Molnar
2006-02-16 17:34     ` Daniel Walker
2006-02-16 19:06       ` [patch 0/6] lightweight robust futexes: -V3 - Why in userspace? Esben Nielsen
2006-02-16 19:34         ` Arjan van de Ven
2006-02-16 20:04           ` Esben Nielsen
2006-02-16 20:17             ` Esben Nielsen
2006-02-16 20:23             ` Christopher Friesen
2006-02-16 20:36             ` Ingo Molnar
2006-02-16 22:32               ` Esben Nielsen
2006-02-16 22:36                 ` Ingo Molnar
2006-02-16 23:20                   ` Esben Nielsen
2006-02-16 23:39                     ` Ingo Molnar
2006-02-17  0:20                       ` Esben Nielsen
2006-02-17  0:42                         ` Ingo Molnar
2006-02-17 23:47                     ` Andrew James Wade
2006-02-16 20:23       ` [patch 0/6] lightweight robust futexes: -V3 Ingo Molnar
2006-02-16 20:54         ` Daniel Walker
2006-02-16 21:26           ` Ingo Molnar
2006-02-16 21:50             ` Christopher Friesen
2006-02-16 21:55               ` Ingo Molnar
2006-02-16 20:47       ` Paul Jackson
2006-02-16 21:35         ` Ingo Molnar
2006-02-16 21:23 ` Paul Jackson
2006-02-16 21:50   ` Ingo Molnar
2006-02-17  4:56     ` Paul Jackson
2006-02-17  9:41       ` Ingo Molnar
2006-02-17 11:59       ` Ingo Molnar
2006-02-17 20:50         ` Paul Jackson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox