* [BENCHMARKS] 2.6.4 vs 2.6.4-mm1
@ 2004-03-13 0:55 Nick Piggin
2004-03-19 9:50 ` Ingo Molnar
0 siblings, 1 reply; 6+ messages in thread
From: Nick Piggin @ 2004-03-13 0:55 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton
These are some benchmarks on a 16-way (4x4) NUMAQ. Basically
measures the scheduler patches with a couple of meaningless
but very scheduler intensive benchmarks.
hackbench:
The number in () is a projection for the time 1000 would take,
assuming a linear scaling. It is probably better shown on a
graph, but you can see a non linear element in 2.6.4 that is
basically absent in 2.6.4-mm1.
2.6.4 2.6.4-mm1
50 19.4 (388) 15.5 (310)
100 39.0 (390) 34.5 (345)
150 59.0 (393) 48.3 (322)
200 82.9 (414) 68.9 (344)
250 114.8 (459) 90.2 (360)
300 145.4 (484) 106.3 (354)
350 178.1 (508) 122.1 (348)
400 218.8 (547) 135.0 (337)
450 237.8 (528) 163.9 (364)
500 262.0 (524) 181.7 (363)
volanomark (MPS):
This one starts getting huge mmap_sem contention at 150+ coming
from futexes. Don't know what is taking the mmap_sem for writing.
Maybe just brk or mmap.
2.6.4 2.6.4-mm1
15 5850 6221
30 5682 5852
45 4736 5700
60 2857 5622
75 1024 4840
90 1832 5191
105 491 5036
120 1591 4228
135 393 4986
150 1056 1586
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [BENCHMARKS] 2.6.4 vs 2.6.4-mm1 2004-03-13 0:55 [BENCHMARKS] 2.6.4 vs 2.6.4-mm1 Nick Piggin @ 2004-03-19 9:50 ` Ingo Molnar 2004-03-19 9:58 ` Nick Piggin 0 siblings, 1 reply; 6+ messages in thread From: Ingo Molnar @ 2004-03-19 9:50 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel, Andrew Morton * Nick Piggin <piggin@cyberone.com.au> wrote: > volanomark (MPS): > This one starts getting huge mmap_sem contention at 150+ coming > from futexes. Don't know what is taking the mmap_sem for writing. > Maybe just brk or mmap. are you sure it's down_write() contention? down_read() can create contention just as much, simply due to the fact that hundreds of threads and a dozen CPUs are pounding in on the same poor lock. i do think there should be a rw-semaphore variant that is per-cpu for the read path. (This would also fix the 4:4 threading overhead.) Ingo ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [BENCHMARKS] 2.6.4 vs 2.6.4-mm1 2004-03-19 9:50 ` Ingo Molnar @ 2004-03-19 9:58 ` Nick Piggin 2004-03-21 4:04 ` Nick Piggin 0 siblings, 1 reply; 6+ messages in thread From: Nick Piggin @ 2004-03-19 9:58 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel, Andrew Morton Ingo Molnar wrote: >* Nick Piggin <piggin@cyberone.com.au> wrote: > > >>volanomark (MPS): >>This one starts getting huge mmap_sem contention at 150+ coming >>from futexes. Don't know what is taking the mmap_sem for writing. >>Maybe just brk or mmap. >> > >are you sure it's down_write() contention? down_read() can create >contention just as much, simply due to the fact that hundreds of threads >and a dozen CPUs are pounding in on the same poor lock. > > No I'm not sure actually, it could be just read lock contention. IIRC it was all coming from the semaphore's spinlock, in up_read... >i do think there should be a rw-semaphore variant that is per-cpu for >the read path. (This would also fix the 4:4 threading overhead.) > > That would be interesting, yes. I have (somewhere) a patch that wakes up the semaphore's waiters outside its spinlock. I think that only gave about 5% or so improvement though. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [BENCHMARKS] 2.6.4 vs 2.6.4-mm1 2004-03-19 9:58 ` Nick Piggin @ 2004-03-21 4:04 ` Nick Piggin 2004-03-21 7:31 ` Ingo Molnar 0 siblings, 1 reply; 6+ messages in thread From: Nick Piggin @ 2004-03-21 4:04 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 581 bytes --] Nick Piggin wrote: > > That would be interesting, yes. I have (somewhere) a patch > that wakes up the semaphore's waiters outside its spinlock. > I think that only gave about 5% or so improvement though. > > Here is a cleaned up patch for comments. It is untested at the moment because I don't have access to the 16-way NUMAQ now. It moves waking of the waiters outside the spinlock. I think it gave about 5-10% improvement when the rwsem gets really contended. Not as much as I had hoped, but every bit helps. The rwsem-spinlock.c code could use the same optimisation too. [-- Attachment #2: rwsem-scale.patch --] [-- Type: text/x-patch, Size: 4063 bytes --] Move rwsem's up_read wakeups out of the semaphore's wait_lock linux-2.6-npiggin/lib/rwsem.c | 49 +++++++++++++++++++----------------------- 1 files changed, 23 insertions(+), 26 deletions(-) diff -puN lib/rwsem.c~rwsem-scale lib/rwsem.c --- linux-2.6/lib/rwsem.c~rwsem-scale 2004-03-21 14:01:12.000000000 +1100 +++ linux-2.6-npiggin/lib/rwsem.c 2004-03-21 14:30:19.000000000 +1100 @@ -35,13 +35,15 @@ void rwsemtrace(struct rw_semaphore *sem * - the spinlock must be held by the caller * - woken process blocks are discarded from the list after having flags zeroised * - writers are only woken if wakewrite is non-zero + * + * The spinlock will be dropped by this function */ static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite) { + LIST_HEAD(wake_list); struct rwsem_waiter *waiter; - struct list_head *next; signed long oldcount; - int woken, loop; + int woken; rwsemtrace(sem,"Entering __rwsem_do_wake"); @@ -63,9 +65,8 @@ static inline struct rw_semaphore *__rws if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE)) goto readers_only; - list_del(&waiter->list); + list_move_tail(&waiter->list, &wake_list); waiter->flags = 0; - wake_up_process(waiter->task); goto out; /* don't want to wake any writers */ @@ -74,13 +75,16 @@ static inline struct rw_semaphore *__rws if (waiter->flags & RWSEM_WAITING_FOR_WRITE) goto out; - /* grant an infinite number of read locks to the readers at the front of the queue - * - note we increment the 'active part' of the count by the number of readers (less one - * for the activity decrement we've already done) before waking any processes up + /* grant an infinite number of read locks to the readers at the front + * of the queue - note we increment the 'active part' of the count by + * the number of readers (less one for the activity decrement we've + * already done) before waking any processes up */ readers_only: woken = 0; do { + list_move_tail(&waiter->list, &wake_list); + waiter->flags = 0; woken++; if (waiter->list.next==&sem->wait_list) @@ -90,23 +94,17 @@ static inline struct rw_semaphore *__rws } while (waiter->flags & RWSEM_WAITING_FOR_READ); - loop = woken; woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS; woken -= RWSEM_ACTIVE_BIAS; rwsem_atomic_add(woken,sem); - next = sem->wait_list.next; - for (; loop>0; loop--) { - waiter = list_entry(next,struct rwsem_waiter,list); - next = waiter->list.next; - waiter->flags = 0; + out: + spin_unlock(&sem->wait_lock); + while (!list_empty(&wake_list)) { + waiter = list_entry(wake_list.next,struct rwsem_waiter,list); + list_del(&waiter->list); wake_up_process(waiter->task); } - - sem->wait_list.next = next; - next->prev = &sem->wait_list; - - out: rwsemtrace(sem,"Leaving __rwsem_do_wake"); return sem; @@ -130,9 +128,8 @@ static inline struct rw_semaphore *rwsem set_task_state(tsk,TASK_UNINTERRUPTIBLE); /* set up my own style of waitqueue */ - spin_lock(&sem->wait_lock); waiter->task = tsk; - + spin_lock(&sem->wait_lock); list_add_tail(&waiter->list,&sem->wait_list); /* note that we're now waiting on the lock, but no longer actively read-locking */ @@ -143,8 +140,8 @@ static inline struct rw_semaphore *rwsem */ if (!(count & RWSEM_ACTIVE_MASK)) sem = __rwsem_do_wake(sem,1); - - spin_unlock(&sem->wait_lock); + else + spin_unlock(&sem->wait_lock); /* wait to be given the lock */ for (;;) { @@ -204,8 +201,8 @@ struct rw_semaphore fastcall *rwsem_wake /* do nothing if list empty */ if (!list_empty(&sem->wait_list)) sem = __rwsem_do_wake(sem,1); - - spin_unlock(&sem->wait_lock); + else + spin_unlock(&sem->wait_lock); rwsemtrace(sem,"Leaving rwsem_wake"); @@ -226,8 +223,8 @@ struct rw_semaphore fastcall *rwsem_down /* do nothing if list empty */ if (!list_empty(&sem->wait_list)) sem = __rwsem_do_wake(sem,0); - - spin_unlock(&sem->wait_lock); + else + spin_unlock(&sem->wait_lock); rwsemtrace(sem,"Leaving rwsem_downgrade_wake"); return sem; _ ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [BENCHMARKS] 2.6.4 vs 2.6.4-mm1 2004-03-21 4:04 ` Nick Piggin @ 2004-03-21 7:31 ` Ingo Molnar 2004-03-21 8:08 ` Nick Piggin 0 siblings, 1 reply; 6+ messages in thread From: Ingo Molnar @ 2004-03-21 7:31 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel, Andrew Morton your patch looks interesting. wrt. making a fully scalable MM read side: perphaps RCU could be used to make lookup access to the vma tree and lookup of the pagetables lockless. This would make futexes (and pagefaults) fundamentally scalable. another option would be to introduce a rwsem which is read-scalable, but this would pessimise writes quite as bad as brlocks did. I'm not sure how acceptable that is. Ingo * Nick Piggin <piggin@cyberone.com.au> wrote: > > > Nick Piggin wrote: > > > > >That would be interesting, yes. I have (somewhere) a patch > >that wakes up the semaphore's waiters outside its spinlock. > >I think that only gave about 5% or so improvement though. > > > > > > Here is a cleaned up patch for comments. It is untested at the > moment because I don't have access to the 16-way NUMAQ now. It > moves waking of the waiters outside the spinlock. > > I think it gave about 5-10% improvement when the rwsem gets > really contended. Not as much as I had hoped, but every bit > helps. > > The rwsem-spinlock.c code could use the same optimisation too. > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [BENCHMARKS] 2.6.4 vs 2.6.4-mm1 2004-03-21 7:31 ` Ingo Molnar @ 2004-03-21 8:08 ` Nick Piggin 0 siblings, 0 replies; 6+ messages in thread From: Nick Piggin @ 2004-03-21 8:08 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel, Andrew Morton Ingo Molnar wrote: >your patch looks interesting. > > I'll see if I can get some numbers for it soon. >wrt. making a fully scalable MM read side: > >perphaps RCU could be used to make lookup access to the vma tree and >lookup of the pagetables lockless. This would make futexes (and >pagefaults) fundamentally scalable. > >another option would be to introduce a rwsem which is read-scalable, but >this would pessimise writes quite as bad as brlocks did. I'm not sure >how acceptable that is. > > It is a pretty silly benchmark. But I guess one day someone is going to complain about mm scalability. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2004-03-21 8:09 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-03-13 0:55 [BENCHMARKS] 2.6.4 vs 2.6.4-mm1 Nick Piggin 2004-03-19 9:50 ` Ingo Molnar 2004-03-19 9:58 ` Nick Piggin 2004-03-21 4:04 ` Nick Piggin 2004-03-21 7:31 ` Ingo Molnar 2004-03-21 8:08 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox