* [PATCH] Make futex waiters take an mm or inode reference
@ 2003-09-08 18:20 Jamie Lokier
2003-09-08 18:34 ` Jamie Lokier
0 siblings, 1 reply; 5+ messages in thread
From: Jamie Lokier @ 2003-09-08 18:20 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rusty Russell, Hugh Dickins, Ulrich Drepper, Andrew Morton,
Stephen Hemminger, Linux Kernel
Rusty Russell wrote:
> But why not solve the problem by just holding an mm reference, too?
Rusty also wrote:
> Why not make the code a *whole* lot more readable (and only marginally
> slower, if at all) by doing it in two passes: pull them off onto a
> (on-stack) list in one pass, then requeue them all in another.
This patch makes each futex waiter hold a reference to the mm or inode
that a futex is keyed on.
This is very important, because otherwise a malicious or erroneous
program can use FUTEX_FD to create futexes on mms or inodes which are
recycled, and steal wakeups from other, unrelated programs.
It isn't entirely trivial, because we can't call mmdrop() or iput()
while holding the spinlock, I think. (Does someone know to the
contrary?) Rusty, you will be glad to see that I have reimplemented
futex_requeue() exactly as you suggest: in two passes.
Ulrich will be glad to hear tst-cond2 runs just fine :)
Linus, please apply unless there are objections.
Thanks,
-- Jamie
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Make futex waiters take an mm or inode reference
2003-09-08 18:20 [PATCH] Make futex waiters take an mm or inode reference Jamie Lokier
@ 2003-09-08 18:34 ` Jamie Lokier
2003-09-08 18:52 ` Linus Torvalds
0 siblings, 1 reply; 5+ messages in thread
From: Jamie Lokier @ 2003-09-08 18:34 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rusty Russell, Hugh Dickins, Ulrich Drepper, Andrew Morton,
Stephen Hemminger, Linux Kernel
> seems you forgot the patch...
Ahem!
-- Jamie
Subject: [PATCH] Make futex waiters take an mm or inode reference
Patch: futex-refs-2.6.0-test4-bk10-40jl
Rusty Russell wrote:
> But why not solve the problem by just holding an mm reference, too?
Rusty also wrote:
> Why not make the code a *whole* lot more readable (and only marginally
> slower, if at all) by doing it in two passes: pull them off onto a
> (on-stack) list in one pass, then requeue them all in another.
This patch makes each futex waiter hold a reference to the mm or inode
that a futex is keyed on.
This is very important, because otherwise a malicious or erroneous
program can use FUTEX_FD to create futexes on mms or inodes which are
recycled, and steal wakeups from other, unrelated programs.
It isn't entirely trivial, because we can't call mmdrop() or iput()
while holding the spinlock, I think. (Does someone know to the
contrary?) Rusty, you will be glad to see that I have reimplemented
futex_requeue() exactly as you suggest: in two passes.
Ulrich will be glad to hear tst-cond2 runs just fine :)
Linus, please apply unless there are objections.
Thanks,
-- Jamie
diff -urN --exclude-from=dontdiff newfut-2.6.0-test4/kernel/futex.c laptop-2.6.0-test4/kernel/futex.c
--- newfut-2.6.0-test4/kernel/futex.c 2003-09-06 18:51:55.000000000 +0100
+++ laptop-2.6.0-test4/kernel/futex.c 2003-09-08 16:58:49.757693547 +0100
@@ -43,6 +43,9 @@
/*
* Futexes are matched on equal values of this key.
* The key type depends on whether it's a shared or private mapping.
+ *
+ * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
+ * We set bit 0 to indicate if it's an inode-based key.
*/
union futex_key {
struct {
@@ -164,10 +167,12 @@
return 0;
}
+ key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+ key->shared.inode = vma->vm_file->f_dentry->d_inode;
+
/*
* Linear mappings are also simple.
*/
- key->shared.inode = vma->vm_file->f_dentry->d_inode;
if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
key->shared.pgoff = (((uaddr - vma->vm_start) >> PAGE_SHIFT)
+ vma->vm_pgoff);
@@ -204,18 +209,58 @@
put_page(page);
return 0;
}
+
return err;
}
+/*
+ * Take a reference to the resource addressed by a key.
+ *
+ * NOTE: mmap_sem MUST be held between get_futex_key() and calling this
+ * function, if it is called at all. mmap_sem keeps key->shared.inode valid.
+ */
+static inline void get_key_refs(union futex_key *key)
+{
+ if (key->both.ptr != 0) {
+ if (key->both.offset & 1)
+ atomic_inc(&key->shared.inode->i_count);
+ else
+ atomic_inc(&key->private.mm->mm_count);
+ }
+}
+
+/*
+ * Drop a reference to the resource addressed by a key.
+ * futex_lock should not be held.
+ */
+static inline void drop_key_refs(union futex_key *key)
+{
+ if (key->both.ptr != 0) {
+ if (key->both.offset & 1)
+ iput(key->shared.inode);
+ else
+ mmdrop(key->private.mm);
+ }
+}
+
+/* futex_lock must be held when this is called. */
+static inline void wake_futex(struct futex_q *q)
+{
+ list_del_init(&q->list);
+ wake_up_all(&q->waiters);
+ if (q->filp)
+ send_sigio(&q->filp->f_owner, q->fd, POLL_IN);
+}
/*
* Wake up all waiters hashed on the physical page that is mapped
* to this virtual address:
*/
-static int futex_wake(unsigned long uaddr, int num)
+static int futex_wake(unsigned long uaddr, int nr_wake)
{
- struct list_head *i, *next, *head;
union futex_key key;
+ struct futex_q *this, *next;
+ struct list_head *head;
int ret;
down_read(¤t->mm->mmap_sem);
@@ -227,16 +272,10 @@
head = hash_futex(&key);
spin_lock(&futex_lock);
- list_for_each_safe(i, next, head) {
- struct futex_q *this = list_entry(i, struct futex_q, list);
-
+ list_for_each_entry_safe(this, next, head, list) {
if (match_futex (&this->key, &key)) {
- list_del_init(i);
- wake_up_all(&this->waiters);
- if (this->filp)
- send_sigio(&this->filp->f_owner, this->fd, POLL_IN);
- ret++;
- if (ret >= num)
+ wake_futex(this);
+ if (++ret >= nr_wake)
break;
}
}
@@ -254,8 +293,9 @@
static int futex_requeue(unsigned long uaddr1, unsigned long uaddr2,
int nr_wake, int nr_requeue)
{
- struct list_head *i, *next, *head1, *head2;
union futex_key key1, key2;
+ struct list_head *head, moved;
+ struct futex_q *this, *next;
int ret;
down_read(¤t->mm->mmap_sem);
@@ -267,47 +307,59 @@
if (unlikely(ret != 0))
goto out;
- head1 = hash_futex(&key1);
- head2 = hash_futex(&key2);
+ head = hash_futex(&key1);
+ INIT_LIST_HEAD(&moved);
spin_lock(&futex_lock);
- list_for_each_safe(i, next, head1) {
- struct futex_q *this = list_entry(i, struct futex_q, list);
-
+ list_for_each_entry_safe(this, next, head, list) {
if (match_futex (&this->key, &key1)) {
- list_del_init(i);
if (++ret <= nr_wake) {
- wake_up_all(&this->waiters);
- if (this->filp)
- send_sigio(&this->filp->f_owner,
- this->fd, POLL_IN);
+ wake_futex(this);
} else {
- list_add_tail(i, head2);
- this->key = key2;
+ list_move_tail(&this->list, &moved);
if (ret - nr_wake >= nr_requeue)
break;
- /* Make sure to stop if key1 == key2 */
- if (head1 == head2 && head1 != next)
- head1 = i;
}
}
}
spin_unlock(&futex_lock);
+ if (!list_empty(&moved)) {
+ head = hash_futex(&key2);
+ list_for_each_entry_safe(this, next, &moved, list) {
+ /* Don't hold futex_lock during drop_key_refs(). */
+ drop_key_refs(&this->key);
+ this->key = key2;
+ get_key_refs(&this->key);
+
+ spin_lock(&futex_lock);
+ list_add_tail (&this->list, head);
+ spin_unlock(&futex_lock);
+ }
+ }
+
out:
up_read(¤t->mm->mmap_sem);
return ret;
}
-static inline void queue_me(struct futex_q *q, union futex_key *key,
- int fd, struct file *filp)
+/*
+ * queue_me and unqueue_me must be called as a pair, each
+ * exactly once. They are called with futex_lock held.
+ */
+
+/* The key must be already stored in q->key. */
+static inline void queue_me(struct futex_q *q, int fd, struct file *filp)
{
- struct list_head *head = hash_futex(key);
+ struct list_head *head;
- q->key = *key;
q->fd = fd;
q->filp = filp;
+ init_waitqueue_head(&q->waiters);
+ get_key_refs(&q->key);
+ head = hash_futex(&q->key);
+
spin_lock(&futex_lock);
list_add_tail(&q->list, head);
spin_unlock(&futex_lock);
@@ -324,6 +376,8 @@
ret = 1;
}
spin_unlock(&futex_lock);
+
+ drop_key_refs(&q->key);
return ret;
}
@@ -331,19 +385,16 @@
{
DECLARE_WAITQUEUE(wait, current);
int ret, curval;
- union futex_key key;
struct futex_q q;
try_again:
- init_waitqueue_head(&q.waiters);
-
down_read(¤t->mm->mmap_sem);
- ret = get_futex_key(uaddr, &key);
+ ret = get_futex_key(uaddr, &q.key);
if (unlikely(ret != 0))
goto out_release_sem;
- queue_me(&q, &key, -1, NULL);
+ queue_me(&q, -1, NULL);
/*
* Access the page after the futex is queued.
@@ -427,7 +478,7 @@
struct futex_q *q = filp->private_data;
unqueue_me(q);
- kfree(filp->private_data);
+ kfree(q);
return 0;
}
@@ -457,7 +508,6 @@
static int futex_fd(unsigned long uaddr, int signal)
{
struct futex_q *q;
- union futex_key key;
struct file *filp;
int ret, err;
@@ -499,20 +549,21 @@
}
down_read(¤t->mm->mmap_sem);
- err = get_futex_key(uaddr, &key);
- up_read(¤t->mm->mmap_sem);
+ err = get_futex_key(uaddr, &q->key);
if (unlikely(err != 0)) {
+ up_read(¤t->mm->mmap_sem);
put_unused_fd(ret);
put_filp(filp);
kfree(q);
return err;
}
- init_waitqueue_head(&q->waiters);
+ /* queue_me() must be called before releasing mmap_sem, because
+ key->shared.inode is valid without a reference until then. */
filp->private_data = q;
-
- queue_me(q, &key, ret, filp);
+ queue_me(q, ret, filp);
+ up_read(¤t->mm->mmap_sem);
/* Now we map fd to filp, so userspace can access it */
fd_install(ret, filp);
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Make futex waiters take an mm or inode reference
2003-09-08 18:34 ` Jamie Lokier
@ 2003-09-08 18:52 ` Linus Torvalds
2003-09-08 20:00 ` Jamie Lokier
2003-09-09 4:02 ` Rusty Russell
0 siblings, 2 replies; 5+ messages in thread
From: Linus Torvalds @ 2003-09-08 18:52 UTC (permalink / raw)
To: Jamie Lokier
Cc: Linus Torvalds, Rusty Russell, Hugh Dickins, Ulrich Drepper,
Andrew Morton, Stephen Hemminger, Linux Kernel
On Mon, 8 Sep 2003, Jamie Lokier wrote:
>
> This patch makes each futex waiter hold a reference to the mm or inode
> that a futex is keyed on.
I get the inode part, but let's think about the mm part a bit more.
In particular, passing off a futex that points to private memory to
somebody else just _doesn't_work_. It's insane. So I'd suggest saying that
an anonymous futex is only an <address,offset> pair, and drop the "mm"
entirely. Let's make an anonymous futex _really_ anonymous, and document
that it's only an "address" - passing it off via UNIX domain sockets is
fine, it just doesn't do anything useful.
So is there any reason to really having "private.mm" AT ALL? From what I
can tell, it is not actually ever used (all "mm" users are "current->mm"),
so I don't see the point of incrementing a count for it either.
Or did I miss something?
Linus
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Make futex waiters take an mm or inode reference
2003-09-08 18:52 ` Linus Torvalds
@ 2003-09-08 20:00 ` Jamie Lokier
2003-09-09 4:02 ` Rusty Russell
1 sibling, 0 replies; 5+ messages in thread
From: Jamie Lokier @ 2003-09-08 20:00 UTC (permalink / raw)
To: Linus Torvalds
Cc: Linus Torvalds, Rusty Russell, Hugh Dickins, Ulrich Drepper,
Andrew Morton, Stephen Hemminger, Linux Kernel
Linus Torvalds wrote:
> So is there any reason to really having "private.mm" AT ALL? From what I
> can tell, it is not actually ever used (all "mm" users are "current->mm"),
> so I don't see the point of incrementing a count for it either.
>
> Or did I miss something?
Yes. The hash table is global to all processes, so "mm" is needed as
a hash key whether it is user-visible or not.
A process can do FUTEX_FD and then pass that fd to another mm, in
numerous ways (fork, exec, socket). Although that does have a
well-defined behaviour at present, I agree it's absolutely fine to
declare that "programmer error" and say it doesn't do anything useful.
But the implemenation is a security problem: a broken program will
cause _other_ unrelated programs to fail, by stealing their wakeups.
That is very bad. A userspace error should never cause random
unrelated programs to fail.
Possible fixes include:
- destroying futexes of an mm when the mm is destroyed
- marking the fds in a special way to prevent them being passed on
- taking an mm reference
Taking an mm reference is the simplest.
-- Jamie
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Make futex waiters take an mm or inode reference
2003-09-08 18:52 ` Linus Torvalds
2003-09-08 20:00 ` Jamie Lokier
@ 2003-09-09 4:02 ` Rusty Russell
1 sibling, 0 replies; 5+ messages in thread
From: Rusty Russell @ 2003-09-09 4:02 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jamie Lokier, Rusty Russell, Hugh Dickins, Ulrich Drepper,
Andrew Morton, Stephen Hemminger, Linux Kernel
In message <Pine.LNX.4.44.0309081144390.3202-100000@home.osdl.org> you write:
> So is there any reason to really having "private.mm" AT ALL? From what I
> can tell, it is not actually ever used (all "mm" users are "current->mm"),
> so I don't see the point of incrementing a count for it either.
>
> Or did I miss something?
Yes. Firstly, you can't do "wake one" if the one you wake might be
some completely unrelated process which happens to use the same
address. Secondly, I implemented fair futexes by relying on the
return value of FUTEX_WAKE to indicate how many people were woken: you
set the futex to a magic value, call FUTEX_WAKE(1), and if it returns
1, you've "passed" the futex directly, otherwise you unlock the futex
like normal. This is surprisingly useful for implementing
"drop_futex_if_someone_is_waiting()" in cleanup threads etc.
Sorry,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2003-09-09 4:04 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-09-08 18:20 [PATCH] Make futex waiters take an mm or inode reference Jamie Lokier
2003-09-08 18:34 ` Jamie Lokier
2003-09-08 18:52 ` Linus Torvalds
2003-09-08 20:00 ` Jamie Lokier
2003-09-09 4:02 ` Rusty Russell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox