public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch] futex requeueing feature, futex-requeue-2.5.69-D3
@ 2003-05-19  9:31 Ingo Molnar
  2003-05-19 10:10 ` Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Ingo Molnar @ 2003-05-19  9:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Rusty Russell, Ulrich Drepper


the attached patch addresses a futex related SMP scalability problem of
glibc. A number of regressions have been reported to the NTPL mailing list
when going to many CPUs, for applications that use condition variables and
the pthread_cond_broadcast() API call. Using this functionality, testcode
shows a slowdown from 0.12 seconds runtime to over 237 seconds (!)  
runtime, on 4-CPU systems.

pthread condition variables use two futex-backed mutex-alike locks: an
internal one for the glibc CV state itself, and a user-supplied mutex
which the API guarantees to take in certain codepaths. (Unfortunately the
user-supplied mutex cannot be used to protect the CV state, so we've got
to deal with two locks.)

The cause of the slowdown is a 'swarm effect': if lots of threads are
blocked on a condition variable, and pthread_cond_broadcast() is done,
then glibc first does a FUTEX_WAKE on the cv-internal mutex, then down a
mutex_down() on the user-supplied mutex. Ie. a swarm of threads is created
which all race to serialize on the user-supplied mutex. The more threads
are used, the more likely it becomes that the scheduler will balance them
over to other CPUs - where they just schedule, try to lock the mutex, and
go to sleep. This 'swarm effect' is purely technical, a side-effect of
glibc's use of futexes, and the imperfect coupling of the two locks.

the solution to this problem is to not wake up the swarm of threads, but
'requeue' them from the CV-internal mutex to the user-supplied mutex. The
attached patch adds the FUTEX_REQUEUE feature FUTEX_REQUEUE requeues N
threads from futex address A to futex address B:

	sys_futex(uaddr, FUTEX_REQUEUE, nr_wake, NULL, uaddr2);

the 'val' parameter to sys_futex (nr_wake) is the # of woken up threads.  
This way glibc can wake up a single thread (which will take the
user-mutex), and can requeue the rest, with a single system-call.

Ulrich Drepper has implemented FUTEX_REQUEUE support in glibc, and a
number of people have tested it over the past couple of weeks. Here are
the measurements done by Saurabh Desai:

System: 4xPIII 700MHz

 ./cond-perf -r 100 -n 200:        1p       2p         4p
 Default NPTL:                 0.120s   0.211s   237.407s
 requeue NPTL:                 0.124s   0.156s     0.040s

 ./cond-perf -r 1000 -n 100:
 Default NPTL:                 0.276s   0.412s     0.530s
 requeue NPTL:                 0.349s   0.503s     0.550s

 ./pp -v -n 128 -i 1000 -S 32768:
 Default NPTL: 128 games in    1.111s   1.270s    16.894s
 requeue NPTL: 128 games in    1.111s   1.959s     2.426s

 ./pp -v -n 1024 -i 10 -S 32768:
 Default NPTL: 1024 games in   0.181s   0.394s     incompleted 2m+
 requeue NPTL: 1024 games in   0.166s   0.254s     0.341s

the speedup with increasing number of threads is quite significant, in the
128 threads, case it's more than 8 times. In the cond-perf test, on 4 CPUs
it's almost infinitely faster than the 'swarm of threads' catastrophy
triggered by the old code.

there's a slowdown on UP, which is expected: on UP the O(1) scheduler
implicitly serializes all active threads on the runqueue, and doesnt
degrade under lots of threads. On SMP the 'point of breakdown' depends on
the precise amount of time needed for the threads to become rated as
'cache-cold' by the load-balancer.

(the patch adds a new futex syscall parameter (uaddr2), which is a
compatible extension of sys_futex. Old NPTL applications will continue to
work without any impact, only the FUTEX_REQUEUE codepath uses the new
parameter.)

	Ingo

--- linux/include/linux/futex.h.orig	
+++ linux/include/linux/futex.h	
@@ -5,7 +5,8 @@
 #define FUTEX_WAIT (0)
 #define FUTEX_WAKE (1)
 #define FUTEX_FD (2)
+#define FUTEX_REQUEUE (3)
 
-extern asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime);
+extern asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime, u32 __user *uaddr2);
 
 #endif
--- linux/kernel/fork.c.orig	
+++ linux/kernel/fork.c	
@@ -457,7 +457,7 @@ void mm_release(struct task_struct *tsk,
 		 * not set up a proper pointer then tough luck.
 		 */
 		put_user(0, tidptr);
-		sys_futex(tidptr, FUTEX_WAKE, 1, NULL);
+		sys_futex(tidptr, FUTEX_WAKE, 1, NULL, NULL);
 	}
 }
 
--- linux/kernel/compat.c.orig	
+++ linux/kernel/compat.c	
@@ -214,7 +214,7 @@ asmlinkage long compat_sys_sigprocmask(i
 extern long do_futex(unsigned long, int, int, unsigned long);
 
 asmlinkage long compat_sys_futex(u32 *uaddr, int op, int val,
-		struct compat_timespec *utime)
+		struct compat_timespec *utime, u32 *uaddr2)
 {
 	struct timespec t;
 	unsigned long timeout = MAX_SCHEDULE_TIMEOUT;
@@ -224,7 +224,7 @@ asmlinkage long compat_sys_futex(u32 *ua
 			return -EFAULT;
 		timeout = timespec_to_jiffies(&t) + 1;
 	}
-	return do_futex((unsigned long)uaddr, op, val, timeout);
+	return do_futex((unsigned long)uaddr, op, val, timeout, (unsigned long)uaddr2);
 }
 
 asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit *rlim);
--- linux/kernel/futex.c.orig	
+++ linux/kernel/futex.c	
@@ -2,6 +2,9 @@
  *  Fast Userspace Mutexes (which I call "Futexes!").
  *  (C) Rusty Russell, IBM 2002
  *
+ *  Generalized futexes, futex requeueing, misc fixes by Ingo Molnar
+ *  (C) Copyright 2003 Red Hat Inc, All Rights Reserved
+ *
  *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
  *  enough at me, Linus for the original (flawed) idea, Matthew
  *  Kirkwood for proof-of-concept implementation.
@@ -9,9 +12,6 @@
  *  "The futexes are also cursed."
  *  "But they come in a choice of three flavours!"
  *
- *  Generalized futexes for every mapping type, Ingo Molnar, 2002
- *
- *
  *  This program is free software; you can redistribute it and/or modify
  *  it under the terms of the GNU General Public License as published by
  *  the Free Software Foundation; either version 2 of the License, or
@@ -216,6 +216,61 @@ static void futex_vcache_callback(vcache
 	spin_unlock(&futex_lock);
 }
 
+/*
+ * Requeue all waiters hashed on one physical page to another
+ * physical page.
+ */
+static int futex_requeue(unsigned long uaddr1, int offset1, unsigned long uaddr2, int offset2, int num)
+{
+	struct list_head *i, *next, *head1, *head2;
+	struct page *page1, *page2;
+	int ret = 0;
+
+	lock_futex_mm();
+
+	page1 = __pin_page(uaddr1 - offset1);
+	if (!page1) {
+		unlock_futex_mm();
+		return -EFAULT;
+	}
+	page2 = __pin_page(uaddr2 - offset2);
+	if (!page2) {
+		unlock_futex_mm();
+		return -EFAULT;
+	}
+
+	head1 = hash_futex(page1, offset1);
+	head2 = hash_futex(page2, offset2);
+
+	list_for_each_safe(i, next, head1) {
+		struct futex_q *this = list_entry(i, struct futex_q, list);
+
+		if (this->page == page1 && this->offset == offset1) {
+			list_del_init(i);
+			__detach_vcache(&this->vcache);
+			if (++ret <= num) {
+				wake_up_all(&this->waiters);
+				if (this->filp)
+					send_sigio(&this->filp->f_owner, this->fd, POLL_IN);
+			} else {
+				unpin_page(this->page);
+				__pin_page_atomic (page2);
+				list_add_tail(i, head2);
+				__attach_vcache(&this->vcache, uaddr2, current->mm, futex_vcache_callback);
+				this->offset = offset2;
+				this->page = page2;
+			}
+		}
+	}
+
+	unlock_futex_mm();
+
+	unpin_page(page1);
+	unpin_page(page2);
+
+	return ret;
+}
+
 static inline void __queue_me(struct futex_q *q, struct page *page,
 				unsigned long uaddr, int offset,
 				int fd, struct file *filp)
@@ -425,9 +480,9 @@ out:
 	return ret;
 }
 
-long do_futex(unsigned long uaddr, int op, int val, unsigned long timeout)
+long do_futex(unsigned long uaddr, int op, int val, unsigned long timeout, unsigned long uaddr2)
 {
-	unsigned long pos_in_page;
+	unsigned long pos_in_page, pos_in_page2;
 	int ret;
 
 	pos_in_page = uaddr % PAGE_SIZE;
@@ -443,6 +498,14 @@ long do_futex(unsigned long uaddr, int o
 	case FUTEX_WAKE:
 		ret = futex_wake(uaddr, pos_in_page, val);
 		break;
+	case FUTEX_REQUEUE:
+		pos_in_page2 = uaddr2 % PAGE_SIZE;
+
+		/* Must be "naturally" aligned */
+		if (pos_in_page2 % sizeof(u32))
+			return -EINVAL;
+		ret = futex_requeue(uaddr, pos_in_page, uaddr2, pos_in_page2, val);
+		break;
 	case FUTEX_FD:
 		/* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
 		ret = futex_fd(uaddr, pos_in_page, val);
@@ -453,7 +516,7 @@ long do_futex(unsigned long uaddr, int o
 	return ret;
 }
 
-asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime)
+asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime, u32 __user *uaddr2)
 {
 	struct timespec t;
 	unsigned long timeout = MAX_SCHEDULE_TIMEOUT;
@@ -463,7 +526,7 @@ asmlinkage long sys_futex(u32 __user *ua
 			return -EFAULT;
 		timeout = timespec_to_jiffies(&t) + 1;
 	}
-	return do_futex((unsigned long)uaddr, op, val, timeout);
+	return do_futex((unsigned long)uaddr, op, val, timeout, (unsigned long)uaddr2);
 }
 
 static struct super_block *



^ permalink raw reply	[flat|nested] 53+ messages in thread
[parent not found: <3ECAC2AE.8090401@redhat.com>]
* Re: [patch] futex requeueing feature, futex-requeue-2.5.69-D3
@ 2003-05-22 11:23 Martin Wirth
  2003-05-22 11:34 ` Ingo Molnar
  0 siblings, 1 reply; 53+ messages in thread
From: Martin Wirth @ 2003-05-22 11:23 UTC (permalink / raw)
  To: mingo, linux-kernel, rusty

Ingo Molnar wrote:

 >all that is needed now is some actual review of the new APIs from the
 >conceptual angle (i've done that and i think they are okay, but more 
 >eyes see more), so that we make sure these are good and we wont need to
 >discard any aspect of them anytime soon.

What about adding an u32 flags field to each one of the new futex 
sys_calls. This gives you more freedom for future extensions without
changing the API again. Possible uses may be:

- Specify the futexes to be mm-local: By using the pair mm* and vaddr as
   key it is possible to have process local futexes living on the same
   hash with the following advantages:
   1. no page_table lock contention (I implemented mm-local futexes
      for my application after I noticed long latencies on SMP where a
      high prio tasks spun in futex_wake while another task was doing 		
      mmap/munmap on a second processor).
   2. no vcache pollution (I guess 99% of all futexes will not be in 		 
     shared memory)
   3. Slightly faster, since no page pinning is needed

- Specify queueing or unqueueing in priority order


Martin


P.S.

By the way, your latest futex patch still contains the bogus
if (!list_empty(&q.list)) conditional, that's always true since
you hold the locks at this point and no one can have removed us
from the list:

 > 	add_wait_queue(&q.waiters, &wait);
 > 	set_current_state(TASK_INTERRUPTIBLE);
 >-	if (!list_empty(&q.list))
 >+	if (!list_empty(&q.list)) {
 >+		unlock_futex_mm();
 > 		time = schedule_timeout(time);
 >+	}

Of course the test would be (and was) pretty necessary if you drop the
locks before the get_user(...) call. And I must admit that I still
can't see why you need to hold the locks across get_user. Even if it's
save to do so at least every automatic code checker will bark at this point.



	


^ permalink raw reply	[flat|nested] 53+ messages in thread
[parent not found: <3ECCB319.4060706@dlr.de.suse.lists.linux.kernel>]

end of thread, other threads:[~2003-05-22 11:51 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-05-19  9:31 [patch] futex requeueing feature, futex-requeue-2.5.69-D3 Ingo Molnar
2003-05-19 10:10 ` Christoph Hellwig
2003-05-19 10:16   ` Ingo Molnar
2003-05-19 11:49     ` Christoph Hellwig
2003-05-19 12:33       ` [patch] futex API cleanups, futex-api-cleanup-2.5.69-A2 Ingo Molnar
2003-05-19 12:51         ` Ingo Molnar
2003-05-19 14:47         ` bert hubert
2003-05-19 16:02           ` Ingo Molnar
2003-05-19 15:18         ` Linus Torvalds
2003-05-19 16:06           ` Ingo Molnar
2003-05-19 23:33             ` Jamie Lokier
2003-05-20  0:39               ` Rusty Russell
2003-05-20  1:14                 ` Davide Libenzi
2003-05-20  1:44                   ` Jamie Lokier
2003-05-20 16:54                     ` Davide Libenzi
2003-05-20 17:59                       ` Jamie Lokier
2003-05-21 23:56                         ` Davide Libenzi
2003-05-20  1:50                 ` Jamie Lokier
2003-05-20  0:15             ` Rusty Russell
2003-05-20  0:08   ` [patch] futex requeueing feature, futex-requeue-2.5.69-D3 Rusty Russell
2003-05-20  0:31     ` Valdis.Kletnieks
2003-05-19 10:23 ` Andrew Morton
2003-05-19 10:30   ` [patch] futex requeueing feature, futex-requeue-2.5.69-D4 Ingo Molnar
2003-05-20  0:04 ` [patch] futex requeueing feature, futex-requeue-2.5.69-D3 Rusty Russell
2003-05-20  0:40   ` Ulrich Drepper
2003-05-20  1:46     ` Rusty Russell
2003-05-20  2:11       ` Ulrich Drepper
2003-05-20  6:27       ` Ingo Molnar
2003-05-20  6:56         ` Christoph Hellwig
2003-05-20  8:57           ` Rusty Russell
2003-05-20  9:03             ` Ingo Molnar
2003-05-20  9:51               ` Christoph Hellwig
2003-05-20 12:56                 ` [patch] futex patches, futex-2.5.69-A2 Ingo Molnar
2003-05-20 14:08                   ` Christoph Hellwig
2003-05-20 16:02                     ` Ingo Molnar
2003-05-20 19:55                       ` Christoph Hellwig
2003-05-21  5:06                         ` Martin Schlemmer
2003-05-21  6:31                           ` Christoph Hellwig
2003-05-20 15:46                 ` [patch] futex requeueing feature, futex-requeue-2.5.69-D3 Linus Torvalds
2003-05-20  9:12           ` Ingo Oeser
2003-05-20 15:41           ` Linus Torvalds
2003-05-20  8:55         ` Rusty Russell
2003-05-20  6:19   ` Ingo Molnar
     [not found] <3ECAC2AE.8090401@redhat.com>
2003-05-21  2:34 ` Rusty Russell
2003-05-21  9:48   ` Ingo Molnar
2003-05-21 10:24     ` Christoph Hellwig
2003-05-22  0:30     ` Rusty Russell
2003-05-22  9:15       ` Ingo Molnar
2003-05-22 10:35         ` Rusty Russell
  -- strict thread matches above, loose matches on Subject: below --
2003-05-22 11:23 Martin Wirth
2003-05-22 11:34 ` Ingo Molnar
2003-05-22 12:04   ` William Lee Irwin III
     [not found] <3ECCB319.4060706@dlr.de.suse.lists.linux.kernel>
2003-05-22 11:38 ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox