2.5.50-BK + 24 CPUs

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.5.50-BK + 24 CPUs
@ 2002-12-08 13:09 Anton Blanchard
  2002-12-08 14:49 ` Rik van Riel
  0 siblings, 1 reply; 11+ messages in thread
From: Anton Blanchard @ 2002-12-08 13:09 UTC (permalink / raw)
  To: linux-kernel

Hi,

I found time to run a few benchmarks over a largish machine (24 way
ppc64) running 2.5.50-BK from a few days ago.

1. kernel compile benchmark (ie build an x86 2.4.18 kernel)

I hijacked /proc/profile to log functions where we call schedule from.
It shows:

schedules:
 56283 total
 41984 pipe_wait
  9746 do_work
  1949 do_exit
  1834 sys_wait4

ie during the compile we scheduled 56283 times, and 41984 of them were
caused by pipes. Simple fix, remove -pipe from the Makefile of the
kernel I was building:

schedules:
  8497 total
  3665 do_work
  1878 do_exit
  1824 sys_wait4
   306 cpu_idle
   260 open_namei
   256 pipe_wait

Much nicer. Does it make sense to use -pipe in our kernel Makefile these
days? Note "do_work" is a ppc64 assembly function which checks
need_resched and calls schedule if the timeslice has been exceeded. So
its nice to see almost all of the schedules are due to timeslice
expiration, processes exiting or processes doing a wait().

Now we can look at the profile:

profile:
 66260 total
 54227 cpu_idle
  1000 page_remove_rmap
   909 __get_page_state
   830 page_add_rmap
   753 save_remaining_regs
   646 do_anonymous_page
   529 do_page_fault
   475 release_pages
   468 pSeries_flush_hash_range
   462 pSeries_hpte_insert
   266 __copy_tofrom_user
   215 zap_pte_range
   214 sys_brk
   210 __pagevec_lru_add_active
   209 buffered_rmqueue
   201 find_get_page
   185 vm_enough_memory
   183 nr_free_pages

Mostly idle time, theres a limit to how much we can parallelize here.
Note: save_remaining_regs is the syscall/interrupt entry path for ppc64.

2. dbench 24

Lets not pay too much attention here but there are a few things to keep
in mind:

schedules:
1635314 total
753694 cpu_idle
357910 ext2_new_block
289189 ext2_free_blocks
123788 ext2_new_inode
 95025 ext2_free_inode

Whee, look at all the schedules we took inside the ext2 code. Of course
its due to the superblock lock semaphore.

profile:
370142 total
302615 cpu_idle
  8600 __copy_tofrom_user
  3119 schedule
  2760 current_kernel_time

Lots of idle time in part due to the superblock lock (oh yeah and my
slow to react finger stopping profiling after the benchmark finished).
current_kernel_time makes a recent appearance in the profile, we are
working on a number of things to address this.

3. "University workload"

A benchmark that does lots of shell scripts, cc, troff, etc. 

schedules:
470212 total
126262 do_work
 86986 ext2_free_blocks
 58039 ext2_new_block
 53627 cpu_idle
 43140 ext2_new_inode
 30934 ext2_free_inode
 19849 do_exit
 18526 sys_wait4

The superblock lock semaphore makes an appearance in the schedule
summary again (ext2_*). Now for the profile:

profile:
136296 total
 41592 cpu_idle
 16319 page_remove_rmap
  7338 page_add_rmap
  3583 save_remaining_regs
  3072 pSeries_flush_hash_range
  2832 release_pages
  2584 do_page_fault
  2281 find_get_page
  2238 pSeries_hpte_insert
  2117 copy_page_range
  2085 current_kernel_time
  2028 zap_pte_range
  1886 __get_page_state
  1689 atomic_dec_and_lock

No big suprises in the profile. This benchmark tends to be a worst case
scenario for rmap, think of 100s of shells all mapping the same text
pages.

Anton

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.5.50-BK + 24 CPUs
  2002-12-08 13:09 Anton Blanchard
@ 2002-12-08 14:49 ` Rik van Riel
  2002-12-08 16:45   ` Rik van Riel
  0 siblings, 1 reply; 11+ messages in thread
From: Rik van Riel @ 2002-12-08 14:49 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linux-kernel

On Mon, 9 Dec 2002, Anton Blanchard wrote:

> profile:
>  66260 total
>  54227 cpu_idle
>   1000 page_remove_rmap
>    909 __get_page_state
>    830 page_add_rmap

Looks like the bitflag locking in rmap is hurting you.
How does it work with a real spinlock in the struct page
instead of using a bit in page->flags ?

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/		http://guru.conectiva.com/
Current spamtrap:  <a href=mailto:"october@surriel.com">october@surriel.com</a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.5.50-BK + 24 CPUs
  2002-12-08 14:49 ` Rik van Riel
@ 2002-12-08 16:45   ` Rik van Riel
  2002-12-09 14:08     ` Anton Blanchard
  0 siblings, 1 reply; 11+ messages in thread
From: Rik van Riel @ 2002-12-08 16:45 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linux-kernel

On Sun, 8 Dec 2002, Rik van Riel wrote:
> On Mon, 9 Dec 2002, Anton Blanchard wrote:
>
> > profile:
> >  66260 total
> >  54227 cpu_idle
> >   1000 page_remove_rmap
> >    909 __get_page_state
> >    830 page_add_rmap
>
> Looks like the bitflag locking in rmap is hurting you.
> How does it work with a real spinlock in the struct page
> instead of using a bit in page->flags ?

In particular, something like the (completely untested) patch
below.  Yes, this patch is on the wrong side of the space/time
tradeoff for machines with highmem, but it might be worth it
for 64 bit machines, especially those with slow bitops.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/		http://guru.conectiva.com/
Current spamtrap:  <a href=mailto:"october@surriel.com">october@surriel.com</a>


===== include/linux/mm.h 1.97 vs edited =====
--- 1.97/include/linux/mm.h	Thu Nov  7 08:48:53 2002
+++ edited/include/linux/mm.h	Sun Dec  8 14:36:44 2002
@@ -169,6 +169,7 @@
 					 * protected by PG_chainlock */
 		pte_addr_t direct;
 	} pte;
+	spinlock_t ptechain_lock;	/* Lock for pte.chain and pte.direct */
 	unsigned long private;		/* mapping-private opaque data */

 	/*
===== include/linux/rmap-locking.h 1.1 vs edited =====
--- 1.1/include/linux/rmap-locking.h	Sun Sep  1 17:56:32 2002
+++ edited/include/linux/rmap-locking.h	Sun Dec  8 14:37:49 2002
@@ -14,20 +14,10 @@
 	 * busywait with less bus contention for a good time to
 	 * attempt to acquire the lock bit.
 	 */
-	preempt_disable();
-#ifdef CONFIG_SMP
-	while (test_and_set_bit(PG_chainlock, &page->flags)) {
-		while (test_bit(PG_chainlock, &page->flags))
-			cpu_relax();
-	}
-#endif
+	spin_lock(&page->ptechain_lock);
 }

 static inline void pte_chain_unlock(struct page *page)
 {
-#ifdef CONFIG_SMP
-	smp_mb__before_clear_bit();
-	clear_bit(PG_chainlock, &page->flags);
-#endif
-	preempt_enable();
+	spin_unlock(&page->ptechain_lock);
 }
===== mm/page_alloc.c 1.135 vs edited =====
--- 1.135/mm/page_alloc.c	Mon Dec  2 18:31:01 2002
+++ edited/mm/page_alloc.c	Sun Dec  8 14:39:06 2002
@@ -1129,6 +1129,7 @@
 			struct page *page = lmem_map + local_offset + i;
 			set_page_zone(page, nid * MAX_NR_ZONES + j);
 			set_page_count(page, 0);
+			page->ptechain_lock = SPIN_LOCK_UNLOCKED;
 			SetPageReserved(page);
 			INIT_LIST_HEAD(&page->list);
 #ifdef WANT_PAGE_VIRTUAL

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.5.50-BK + 24 CPUs
@ 2002-12-08 21:22 Manfred Spraul
  2002-12-08 21:28 ` William Lee Irwin III
  0 siblings, 1 reply; 11+ messages in thread
From: Manfred Spraul @ 2002-12-08 21:22 UTC (permalink / raw)
  To: anton; +Cc: linux-kernel

Anton wrote:

>schedules:
> 56283 total
> 41984 pipe_wait
>  9746 do_work
>  1949 do_exit
>  1834 sys_wait4
>
>ie during the compile we scheduled 56283 times, and 41984 of them were
>caused by pipes.
>
The linux pipe implementation has only a page sized buffer - with 4 kB 
pages, transfering 1 MB through a pipe means at 512 context switches.

--
    Manfred


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.5.50-BK + 24 CPUs
  2002-12-08 21:22 2.5.50-BK + 24 CPUs Manfred Spraul
@ 2002-12-08 21:28 ` William Lee Irwin III
  2002-12-08 23:22   ` David S. Miller
  0 siblings, 1 reply; 11+ messages in thread
From: William Lee Irwin III @ 2002-12-08 21:28 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: anton, linux-kernel

Anton wrote:
>> ie during the compile we scheduled 56283 times, and 41984 of them were
>> caused by pipes.

On Sun, Dec 08, 2002 at 10:22:03PM +0100, Manfred Spraul wrote:
> The linux pipe implementation has only a page sized buffer - with 4 kB 
> pages, transfering 1 MB through a pipe means at 512 context switches.

Hmm. What happened to that pipe buffer size increase patch? That sounds
like it might help here, but only if those things are trying to shove
more than 4KB through the pipe at a time.


Bill

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.5.50-BK + 24 CPUs
  2002-12-08 23:22   ` David S. Miller
@ 2002-12-08 23:01     ` William Lee Irwin III
  2002-12-09 17:03     ` Manfred Spraul
  1 sibling, 0 replies; 11+ messages in thread
From: William Lee Irwin III @ 2002-12-08 23:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: Manfred Spraul, anton, linux-kernel

On Sun, 2002-12-08 at 13:28, William Lee Irwin III wrote:
>> Hmm. What happened to that pipe buffer size increase patch? That sounds
>> like it might help here, but only if those things are trying to shove
>> more than 4KB through the pipe at a time.

On Sun, Dec 08, 2002 at 03:22:58PM -0800, David S. Miller wrote:
> You probably mean the zero-copy pipe patches, which I think really
> should go in.  The most recent version of the diffs I saw didn't
> use the zero copy bits unless the trasnfers were quite large so it
> should be ok and not pessimize small transfers.
> That patch has been gathering cobwebs for more than a year now when I
> first did it, let's push this in already :-)

I was actually referring to one that explicitly used larger pipe
buffers, but this sounds useful too.


Bill

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.5.50-BK + 24 CPUs
  2002-12-08 21:28 ` William Lee Irwin III
@ 2002-12-08 23:22   ` David S. Miller
  2002-12-08 23:01     ` William Lee Irwin III
  2002-12-09 17:03     ` Manfred Spraul
  0 siblings, 2 replies; 11+ messages in thread
From: David S. Miller @ 2002-12-08 23:22 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Manfred Spraul, anton, linux-kernel

On Sun, 2002-12-08 at 13:28, William Lee Irwin III wrote:
> Hmm. What happened to that pipe buffer size increase patch? That sounds
> like it might help here, but only if those things are trying to shove
> more than 4KB through the pipe at a time.

You probably mean the zero-copy pipe patches, which I think really
should go in.  The most recent version of the diffs I saw didn't
use the zero copy bits unless the trasnfers were quite large so it
should be ok and not pessimize small transfers.

That patch has been gathering cobwebs for more than a year now when I
first did it, let's push this in already :-)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.5.50-BK + 24 CPUs
  2002-12-08 16:45   ` Rik van Riel
@ 2002-12-09 14:08     ` Anton Blanchard
  0 siblings, 0 replies; 11+ messages in thread
From: Anton Blanchard @ 2002-12-09 14:08 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel


Thanks Rik,

> In particular, something like the (completely untested) patch
> below.  Yes, this patch is on the wrong side of the space/time
> tradeoff for machines with highmem, but it might be worth it
> for 64 bit machines, especially those with slow bitops.

I'll give it a spin when I next get some time on the machine.

Anton

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.5.50-BK + 24 CPUs
  2002-12-08 23:22   ` David S. Miller
  2002-12-08 23:01     ` William Lee Irwin III
@ 2002-12-09 17:03     ` Manfred Spraul
  2002-12-09 20:15       ` David S. Miller
  1 sibling, 1 reply; 11+ messages in thread
From: Manfred Spraul @ 2002-12-09 17:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: William Lee Irwin III, anton, linux-kernel

David S. Miller wrote:

>On Sun, 2002-12-08 at 13:28, William Lee Irwin III wrote:
>  
>
>>Hmm. What happened to that pipe buffer size increase patch? That sounds
>>like it might help here, but only if those things are trying to shove
>>more than 4KB through the pipe at a time.
>>    
>>
>
>You probably mean the zero-copy pipe patches, which I think really
>should go in.  The most recent version of the diffs I saw didn't
>use the zero copy bits unless the trasnfers were quite large so it
>should be ok and not pessimize small transfers.
>
>That patch has been gathering cobwebs for more than a year now when I
>first did it, let's push this in already :-)
>  
>
Unfortunately zero-copy doesn't help to avoid the schedules:
Zero copy just avoid the copy to kernel - you still need one schedule 
for each page to be transfered.

writer calls
    for(;;){
        prepare_data(buf);
        write(fd,buf,PAGE_SIZE);
    }
reader calls
    for(;;) {
        read(fd,buf,PAGE_SIZE);
        use_data(buf);
    }

What's needed is a large kernel buffer - I've seen buffers between 64 
and 256 kB in other unices.
zero copy only helps lmbench and other apps where the whole working set 
fits into the cpu cache.

The difference between
    main-mem->cache;cache->main_mem [non-zerocopy]
and
    main-mem->main-mem [zerocopy, the copy to kernel is skipped]
is small.

--
    Manfred


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.5.50-BK + 24 CPUs
  2002-12-09 17:03     ` Manfred Spraul
@ 2002-12-09 20:15       ` David S. Miller
  2002-12-09 21:12         ` Manfred Spraul
  0 siblings, 1 reply; 11+ messages in thread
From: David S. Miller @ 2002-12-09 20:15 UTC (permalink / raw)
  To: manfred; +Cc: wli, anton, linux-kernel

   From: Manfred Spraul <manfred@colorfullife.com>
   Date: Mon, 09 Dec 2002 18:03:10 +0100

   Unfortunately zero-copy doesn't help to avoid the schedules:
   Zero copy just avoid the copy to kernel - you still need one schedule 
   for each page to be transfered.

The zerocopy patches copied up to 64k (or rather, 16 pages, something
like that) at once, that's going to lead to 16 times less schedules.

The 64k number was decided arbitrarily (it's what freebsd's pipe code
uses) and it can be experimented with.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.5.50-BK + 24 CPUs
  2002-12-09 20:15       ` David S. Miller
@ 2002-12-09 21:12         ` Manfred Spraul
  0 siblings, 0 replies; 11+ messages in thread
From: Manfred Spraul @ 2002-12-09 21:12 UTC (permalink / raw)
  To: David S. Miller; +Cc: wli, anton, linux-kernel

David S. Miller wrote:

>   From: Manfred Spraul <manfred@colorfullife.com>
>   Date: Mon, 09 Dec 2002 18:03:10 +0100
>
>   Unfortunately zero-copy doesn't help to avoid the schedules:
>   Zero copy just avoid the copy to kernel - you still need one schedule 
>   for each page to be transfered.
>
>The zerocopy patches copied up to 64k (or rather, 16 pages, something
>like that) at once, that's going to lead to 16 times less schedules.
>
>The 64k number was decided arbitrarily (it's what freebsd's pipe code
>uses) and it can be experimented with.
>  
>
Only if user space writes in 64 kB chunks - if user space writes 4 kB 
chunks, then zerocopy doesn't help much against schedule [depending on 
the implementation, it halves the number of schedules].
And page table tricks (COW) tricks are not acceptable.

--
    Manfred


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2002-12-09 21:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-12-08 21:22 2.5.50-BK + 24 CPUs Manfred Spraul
2002-12-08 21:28 ` William Lee Irwin III
2002-12-08 23:22   ` David S. Miller
2002-12-08 23:01     ` William Lee Irwin III
2002-12-09 17:03     ` Manfred Spraul
2002-12-09 20:15       ` David S. Miller
2002-12-09 21:12         ` Manfred Spraul
  -- strict thread matches above, loose matches on Subject: below --
2002-12-08 13:09 Anton Blanchard
2002-12-08 14:49 ` Rik van Riel
2002-12-08 16:45   ` Rik van Riel
2002-12-09 14:08     ` Anton Blanchard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox