public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [BENCHMARKS] 2.6.4 vs 2.6.4-mm1
@ 2004-03-13  0:55 Nick Piggin
  2004-03-19  9:50 ` Ingo Molnar
  0 siblings, 1 reply; 6+ messages in thread
From: Nick Piggin @ 2004-03-13  0:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton

These are some benchmarks on a 16-way (4x4) NUMAQ. Basically
measures the scheduler patches with a couple of meaningless
but very scheduler intensive benchmarks.

hackbench:
The number in () is a projection for the time 1000 would take,
assuming a linear scaling. It is probably better shown on a
graph, but you can see a non linear element in 2.6.4 that is
basically absent in 2.6.4-mm1.

              2.6.4    2.6.4-mm1
 50      19.4 (388)   15.5 (310)
100      39.0 (390)   34.5 (345)
150      59.0 (393)   48.3 (322)
200      82.9 (414)   68.9 (344)
250     114.8 (459)   90.2 (360)
300     145.4 (484)  106.3 (354)
350     178.1 (508)  122.1 (348)
400     218.8 (547)  135.0 (337)
450     237.8 (528)  163.9 (364)
500     262.0 (524)  181.7 (363)

volanomark (MPS):
This one starts getting huge mmap_sem contention at 150+ coming
from futexes. Don't know what is taking the mmap_sem for writing.
Maybe just brk or mmap.

        2.6.4   2.6.4-mm1
 15      5850        6221
 30      5682        5852
 45      4736        5700
 60      2857        5622
 75      1024        4840
 90      1832        5191
105       491        5036
120      1591        4228
135       393        4986
150      1056        1586


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BENCHMARKS] 2.6.4 vs 2.6.4-mm1
  2004-03-13  0:55 [BENCHMARKS] 2.6.4 vs 2.6.4-mm1 Nick Piggin
@ 2004-03-19  9:50 ` Ingo Molnar
  2004-03-19  9:58   ` Nick Piggin
  0 siblings, 1 reply; 6+ messages in thread
From: Ingo Molnar @ 2004-03-19  9:50 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, Andrew Morton


* Nick Piggin <piggin@cyberone.com.au> wrote:

> volanomark (MPS):
> This one starts getting huge mmap_sem contention at 150+ coming
> from futexes. Don't know what is taking the mmap_sem for writing.
> Maybe just brk or mmap.

are you sure it's down_write() contention? down_read() can create
contention just as much, simply due to the fact that hundreds of threads
and a dozen CPUs are pounding in on the same poor lock.

i do think there should be a rw-semaphore variant that is per-cpu for
the read path. (This would also fix the 4:4 threading overhead.)

	Ingo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BENCHMARKS] 2.6.4 vs 2.6.4-mm1
  2004-03-19  9:50 ` Ingo Molnar
@ 2004-03-19  9:58   ` Nick Piggin
  2004-03-21  4:04     ` Nick Piggin
  0 siblings, 1 reply; 6+ messages in thread
From: Nick Piggin @ 2004-03-19  9:58 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Andrew Morton



Ingo Molnar wrote:

>* Nick Piggin <piggin@cyberone.com.au> wrote:
>
>
>>volanomark (MPS):
>>This one starts getting huge mmap_sem contention at 150+ coming
>>from futexes. Don't know what is taking the mmap_sem for writing.
>>Maybe just brk or mmap.
>>
>
>are you sure it's down_write() contention? down_read() can create
>contention just as much, simply due to the fact that hundreds of threads
>and a dozen CPUs are pounding in on the same poor lock.
>
>

No I'm not sure actually, it could be just read lock
contention. IIRC it was all coming from the semaphore's
spinlock, in up_read...

>i do think there should be a rw-semaphore variant that is per-cpu for
>the read path. (This would also fix the 4:4 threading overhead.)
>
>

That would be interesting, yes. I have (somewhere) a patch
that wakes up the semaphore's waiters outside its spinlock.
I think that only gave about 5% or so improvement though.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BENCHMARKS] 2.6.4 vs 2.6.4-mm1
  2004-03-19  9:58   ` Nick Piggin
@ 2004-03-21  4:04     ` Nick Piggin
  2004-03-21  7:31       ` Ingo Molnar
  0 siblings, 1 reply; 6+ messages in thread
From: Nick Piggin @ 2004-03-21  4:04 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 581 bytes --]



Nick Piggin wrote:

>
> That would be interesting, yes. I have (somewhere) a patch
> that wakes up the semaphore's waiters outside its spinlock.
> I think that only gave about 5% or so improvement though.
>
>

Here is a cleaned up patch for comments. It is untested at the
moment because I don't have access to the 16-way NUMAQ now. It
moves waking of the waiters outside the spinlock.

I think it gave about 5-10% improvement when the rwsem gets
really contended. Not as much as I had hoped, but every bit
helps.

The rwsem-spinlock.c code could use the same optimisation too.


[-- Attachment #2: rwsem-scale.patch --]
[-- Type: text/x-patch, Size: 4063 bytes --]


Move rwsem's up_read wakeups out of the semaphore's wait_lock


 linux-2.6-npiggin/lib/rwsem.c |   49 +++++++++++++++++++-----------------------
 1 files changed, 23 insertions(+), 26 deletions(-)

diff -puN lib/rwsem.c~rwsem-scale lib/rwsem.c
--- linux-2.6/lib/rwsem.c~rwsem-scale	2004-03-21 14:01:12.000000000 +1100
+++ linux-2.6-npiggin/lib/rwsem.c	2004-03-21 14:30:19.000000000 +1100
@@ -35,13 +35,15 @@ void rwsemtrace(struct rw_semaphore *sem
  * - the spinlock must be held by the caller
  * - woken process blocks are discarded from the list after having flags zeroised
  * - writers are only woken if wakewrite is non-zero
+ *
+ * The spinlock will be dropped by this function
  */
 static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
 {
+	LIST_HEAD(wake_list);
 	struct rwsem_waiter *waiter;
-	struct list_head *next;
 	signed long oldcount;
-	int woken, loop;
+	int woken;
 
 	rwsemtrace(sem,"Entering __rwsem_do_wake");
 
@@ -63,9 +65,8 @@ static inline struct rw_semaphore *__rws
 	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
 		goto readers_only;
 
-	list_del(&waiter->list);
+	list_move_tail(&waiter->list, &wake_list);
 	waiter->flags = 0;
-	wake_up_process(waiter->task);
 	goto out;
 
 	/* don't want to wake any writers */
@@ -74,13 +75,16 @@ static inline struct rw_semaphore *__rws
 	if (waiter->flags & RWSEM_WAITING_FOR_WRITE)
 		goto out;
 
-	/* grant an infinite number of read locks to the readers at the front of the queue
-	 * - note we increment the 'active part' of the count by the number of readers (less one
-	 *   for the activity decrement we've already done) before waking any processes up
+	/* grant an infinite number of read locks to the readers at the front
+	 * of the queue - note we increment the 'active part' of the count by
+	 * the number of readers (less one for the activity decrement we've
+	 * already done) before waking any processes up
 	 */
  readers_only:
 	woken = 0;
 	do {
+		list_move_tail(&waiter->list, &wake_list);
+		waiter->flags = 0;
 		woken++;
 
 		if (waiter->list.next==&sem->wait_list)
@@ -90,23 +94,17 @@ static inline struct rw_semaphore *__rws
 
 	} while (waiter->flags & RWSEM_WAITING_FOR_READ);
 
-	loop = woken;
 	woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS;
 	woken -= RWSEM_ACTIVE_BIAS;
 	rwsem_atomic_add(woken,sem);
 
-	next = sem->wait_list.next;
-	for (; loop>0; loop--) {
-		waiter = list_entry(next,struct rwsem_waiter,list);
-		next = waiter->list.next;
-		waiter->flags = 0;
+ out:
+	spin_unlock(&sem->wait_lock);
+	while (!list_empty(&wake_list)) {
+		waiter = list_entry(wake_list.next,struct rwsem_waiter,list);
+		list_del(&waiter->list);
 		wake_up_process(waiter->task);
 	}
-
-	sem->wait_list.next = next;
-	next->prev = &sem->wait_list;
-
- out:
 	rwsemtrace(sem,"Leaving __rwsem_do_wake");
 	return sem;
 
@@ -130,9 +128,8 @@ static inline struct rw_semaphore *rwsem
 	set_task_state(tsk,TASK_UNINTERRUPTIBLE);
 
 	/* set up my own style of waitqueue */
-	spin_lock(&sem->wait_lock);
 	waiter->task = tsk;
-
+	spin_lock(&sem->wait_lock);
 	list_add_tail(&waiter->list,&sem->wait_list);
 
 	/* note that we're now waiting on the lock, but no longer actively read-locking */
@@ -143,8 +140,8 @@ static inline struct rw_semaphore *rwsem
 	 */
 	if (!(count & RWSEM_ACTIVE_MASK))
 		sem = __rwsem_do_wake(sem,1);
-
-	spin_unlock(&sem->wait_lock);
+	else
+		spin_unlock(&sem->wait_lock);
 
 	/* wait to be given the lock */
 	for (;;) {
@@ -204,8 +201,8 @@ struct rw_semaphore fastcall *rwsem_wake
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
 		sem = __rwsem_do_wake(sem,1);
-
-	spin_unlock(&sem->wait_lock);
+	else
+		spin_unlock(&sem->wait_lock);
 
 	rwsemtrace(sem,"Leaving rwsem_wake");
 
@@ -226,8 +223,8 @@ struct rw_semaphore fastcall *rwsem_down
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
 		sem = __rwsem_do_wake(sem,0);
-
-	spin_unlock(&sem->wait_lock);
+	else
+		spin_unlock(&sem->wait_lock);
 
 	rwsemtrace(sem,"Leaving rwsem_downgrade_wake");
 	return sem;

_

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BENCHMARKS] 2.6.4 vs 2.6.4-mm1
  2004-03-21  4:04     ` Nick Piggin
@ 2004-03-21  7:31       ` Ingo Molnar
  2004-03-21  8:08         ` Nick Piggin
  0 siblings, 1 reply; 6+ messages in thread
From: Ingo Molnar @ 2004-03-21  7:31 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, Andrew Morton


your patch looks interesting. 

wrt. making a fully scalable MM read side:

perphaps RCU could be used to make lookup access to the vma tree and
lookup of the pagetables lockless. This would make futexes (and
pagefaults) fundamentally scalable.

another option would be to introduce a rwsem which is read-scalable, but
this would pessimise writes quite as bad as brlocks did. I'm not sure
how acceptable that is.

	Ingo

* Nick Piggin <piggin@cyberone.com.au> wrote:

> 
> 
> Nick Piggin wrote:
> 
> >
> >That would be interesting, yes. I have (somewhere) a patch
> >that wakes up the semaphore's waiters outside its spinlock.
> >I think that only gave about 5% or so improvement though.
> >
> >
> 
> Here is a cleaned up patch for comments. It is untested at the
> moment because I don't have access to the 16-way NUMAQ now. It
> moves waking of the waiters outside the spinlock.
> 
> I think it gave about 5-10% improvement when the rwsem gets
> really contended. Not as much as I had hoped, but every bit
> helps.
> 
> The rwsem-spinlock.c code could use the same optimisation too.
> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BENCHMARKS] 2.6.4 vs 2.6.4-mm1
  2004-03-21  7:31       ` Ingo Molnar
@ 2004-03-21  8:08         ` Nick Piggin
  0 siblings, 0 replies; 6+ messages in thread
From: Nick Piggin @ 2004-03-21  8:08 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Andrew Morton



Ingo Molnar wrote:

>your patch looks interesting. 
>
>

I'll see if I can get some numbers for it soon.

>wrt. making a fully scalable MM read side:
>
>perphaps RCU could be used to make lookup access to the vma tree and
>lookup of the pagetables lockless. This would make futexes (and
>pagefaults) fundamentally scalable.
>
>another option would be to introduce a rwsem which is read-scalable, but
>this would pessimise writes quite as bad as brlocks did. I'm not sure
>how acceptable that is.
>
>

It is a pretty silly benchmark. But I guess one day someone
is going to complain about mm scalability.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-03-21  8:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-13  0:55 [BENCHMARKS] 2.6.4 vs 2.6.4-mm1 Nick Piggin
2004-03-19  9:50 ` Ingo Molnar
2004-03-19  9:58   ` Nick Piggin
2004-03-21  4:04     ` Nick Piggin
2004-03-21  7:31       ` Ingo Molnar
2004-03-21  8:08         ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox