RE: ext3 performance bottleneck as the number of spindles gets la rge

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* RE: ext3 performance bottleneck as the number of spindles gets la rge
@ 2002-06-20 21:50 Griffiths, Richard A
  2002-06-21  7:58 ` ext3 performance bottleneck as the number of spindles gets large Andrew Morton
  2002-06-23  4:02 ` Christopher E. Brown
  0 siblings, 2 replies; 25+ messages in thread
From: Griffiths, Richard A @ 2002-06-20 21:50 UTC (permalink / raw)
  To: 'Andrew Morton', mgross
  Cc: Griffiths, Richard A, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

I should have mentioned the throughput we saw on 4 adapters 6 drives was
126KB/s.  The max theoretical bus bandwith is 640MB/s.

-----Original Message-----
From: Andrew Morton [mailto:akpm@zip.com.au]
Sent: Thursday, June 20, 2002 2:26 PM
To: mgross@unix-os.sc.intel.com
Cc: Griffiths, Richard A; 'Jens Axboe'; Linux Kernel Mailing List;
lse-tech@lists.sourceforge.net
Subject: Re: ext3 performance bottleneck as the number of spindles gets
large


mgross wrote:
> 
> On Thursday 20 June 2002 04:18 pm, Andrew Morton wrote:
> > Yup.  I take it back - high ext3 lock contention happens on 2.5
> > as well, which has block-highmem.  With heavy write traffic onto
> > six disks, two controllers, six filesystems, four CPUs the machine
> > spends about 40% of the time spinning on locks in fs/ext3/inode.c
> > You're un dual CPU, so the contention is less.
> >
> > Not very nice.  But given that the longest spin time was some
> > tens of milliseconds, with the average much lower, it shouldn't
> > affect overall I/O throughput.
> 
> How could losing 40% of your CPU's to spin locks NOT spank your
throughtput?

The limiting factor is usually disk bandwidth, seek latency, rotational
latency.  That's why I want to know your bandwidth.

> Can you copy your lockmeter data from its kernel_flag section?  Id like to
> see it.

I don't find lockmeter very useful.  Here's oprofile output for 2.5.23:

c013ec08 873      1.07487     rmqueue                 
c018a8e4 950      1.16968     do_get_write_access     
c013b00c 969      1.19307     kmem_cache_alloc_batch  
c018165c 1120     1.37899     ext3_writepage          
c0193120 1457     1.79392     journal_add_journal_head 
c0180e30 1458     1.79515     ext3_prepare_write      
c0136948 6546     8.05969     generic_file_write      
c01838ac 42608    52.4606     .text.lock.inode      

So I lost two CPUs on the BKL in fs/ext3/inode.c.  The remaining
two should be enough to saturate all but the most heroic disk
subsystems.

A couple of possibilities come to mind:

1: Processes which should be submitting I/O against disk "A" are
   instead spending tons of time asleep in the page allocator waiting
   for I/O to complete against disk "B".

2: ext3 is just too slow for the rate of data which you're trying to
   push at it.   This exhibits as lock contention, but the root cause
   is the cost of things like ext3_mark_inode_dirty().  And *that*
   is something we can fix - can shave 75% off the cost of that.

Need more data...


> >
> > Possibly something else is happening.  Have you tested ext2?
> 
> No.  We're attempting to see if we can scale to large numbers of spindles
> with EXT3 at the moment.  Perhaps we can effect positive changes to ext3
> before giving up on it and moving to another Journaled FS.

Have you tried *any* other fs?

-

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-20 21:50 ext3 performance bottleneck as the number of spindles gets la rge Griffiths, Richard A
@ 2002-06-21  7:58 ` Andrew Morton
  2002-06-21 18:46   ` mgross
  2002-06-23  4:02 ` Christopher E. Brown
  1 sibling, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2002-06-21  7:58 UTC (permalink / raw)
  To: Griffiths, Richard A
  Cc: mgross, 'Jens Axboe', Linux Kernel Mailing List, lse-tech

"Griffiths, Richard A" wrote:
> 
> I should have mentioned the throughput we saw on 4 adapters 6 drives was
> 126KB/s.  The max theoretical bus bandwith is 640MB/s.

I hope that was 128MB/s?

Please try the below patch (againt 2.4.19-pre10).  It halves the lock
contention, and it does that by making the fs twice as efficient, so
that's a bonus.

I wouldn't be surprised if it made no difference.  I'm not seeing
much difference between ext2 and ext3 here.

If you have time, please test ext2 and/or reiserfs and/or ext3
in writeback mode.

And please tell us some more details regarding the performance bottleneck.
I assume that you mean that the IO rate per disk slows as more
disks are added to an adapter?  Or does the total throughput through
the adapter fall as more disks are added?

Thanks.




--- 2.4.19-pre10/fs/ext3/inode.c~ext3-speedup-1	Fri Jun 21 00:28:59 2002
+++ 2.4.19-pre10-akpm/fs/ext3/inode.c	Fri Jun 21 00:28:59 2002
@@ -1016,21 +1016,20 @@ static int ext3_prepare_write(struct fil
 	int ret, needed_blocks = ext3_writepage_trans_blocks(inode);
 	handle_t *handle;
 
-	lock_kernel();
 	handle = ext3_journal_start(inode, needed_blocks);
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);
 		goto out;
 	}
-	unlock_kernel();
 	ret = block_prepare_write(page, from, to, ext3_get_block);
-	lock_kernel();
 	if (ret != 0)
 		goto prepare_write_failed;
 
 	if (ext3_should_journal_data(inode)) {
+		lock_kernel();
 		ret = walk_page_buffers(handle, page->buffers,
 				from, to, NULL, do_journal_get_write_access);
+		unlock_kernel();
 		if (ret) {
 			/*
 			 * We're going to fail this prepare_write(),
@@ -1043,10 +1042,12 @@ static int ext3_prepare_write(struct fil
 		}
 	}
 prepare_write_failed:
-	if (ret)
+	if (ret) {
+		lock_kernel();
 		ext3_journal_stop(handle, inode);
+		unlock_kernel();
+	}
 out:
-	unlock_kernel();
 	return ret;
 }
 
@@ -1094,7 +1095,6 @@ static int ext3_commit_write(struct file
 	struct inode *inode = page->mapping->host;
 	int ret = 0, ret2;
 
-	lock_kernel();
 	if (ext3_should_journal_data(inode)) {
 		/*
 		 * Here we duplicate the generic_commit_write() functionality
@@ -1102,22 +1102,43 @@ static int ext3_commit_write(struct file
 		int partial = 0;
 		loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
 
+		lock_kernel();
 		ret = walk_page_buffers(handle, page->buffers,
 			from, to, &partial, commit_write_fn);
+		unlock_kernel();
 		if (!partial)
 			SetPageUptodate(page);
 		kunmap(page);
 		if (pos > inode->i_size)
 			inode->i_size = pos;
 		EXT3_I(inode)->i_state |= EXT3_STATE_JDATA;
+		if (inode->i_size > inode->u.ext3_i.i_disksize) {
+			inode->u.ext3_i.i_disksize = inode->i_size;
+			lock_kernel();
+			ret2 = ext3_mark_inode_dirty(handle, inode);
+			unlock_kernel();
+			if (!ret) 
+				ret = ret2;
+		}
 	} else {
 		if (ext3_should_order_data(inode)) {
+			lock_kernel();
 			ret = walk_page_buffers(handle, page->buffers,
 				from, to, NULL, journal_dirty_sync_data);
+			unlock_kernel();
 		}
 		/* Be careful here if generic_commit_write becomes a
 		 * required invocation after block_prepare_write. */
 		if (ret == 0) {
+			/*
+			 * generic_commit_write() will run mark_inode_dirty()
+			 * if i_size changes.  So let's piggyback the
+			 * i_disksize mark_inode_dirty into that.
+			 */
+			loff_t new_i_size =
+				((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+			if (new_i_size > EXT3_I(inode)->i_disksize)
+				EXT3_I(inode)->i_disksize = new_i_size;
 			ret = generic_commit_write(file, page, from, to);
 		} else {
 			/*
@@ -1129,12 +1150,7 @@ static int ext3_commit_write(struct file
 			kunmap(page);
 		}
 	}
-	if (inode->i_size > inode->u.ext3_i.i_disksize) {
-		inode->u.ext3_i.i_disksize = inode->i_size;
-		ret2 = ext3_mark_inode_dirty(handle, inode);
-		if (!ret) 
-			ret = ret2;
-	}
+	lock_kernel();
 	ret2 = ext3_journal_stop(handle, inode);
 	unlock_kernel();
 	if (!ret)
@@ -2165,9 +2181,11 @@ bad_inode:
 /*
  * Post the struct inode info into an on-disk inode location in the
  * buffer-cache.  This gobbles the caller's reference to the
- * buffer_head in the inode location struct.  
+ * buffer_head in the inode location struct.
+ *
+ * On entry, the caller *must* have journal write access to the inode's
+ * backing block, at iloc->bh.
  */
-
 static int ext3_do_update_inode(handle_t *handle, 
 				struct inode *inode, 
 				struct ext3_iloc *iloc)
@@ -2176,12 +2194,6 @@ static int ext3_do_update_inode(handle_t
 	struct buffer_head *bh = iloc->bh;
 	int err = 0, rc, block;
 
-	if (handle) {
-		BUFFER_TRACE(bh, "get_write_access");
-		err = ext3_journal_get_write_access(handle, bh);
-		if (err)
-			goto out_brelse;
-	}
 	raw_inode->i_mode = cpu_to_le16(inode->i_mode);
 	if(!(test_opt(inode->i_sb, NO_UID32))) {
 		raw_inode->i_uid_low = cpu_to_le16(low_16_bits(inode->i_uid));
--- 2.4.19-pre10/mm/filemap.c~ext3-speedup-1	Fri Jun 21 00:28:59 2002
+++ 2.4.19-pre10-akpm/mm/filemap.c	Fri Jun 21 00:28:59 2002
@@ -2924,6 +2924,7 @@ generic_file_write(struct file *file,con
 	long		status = 0;
 	int		err;
 	unsigned	bytes;
+	time_t		time_now;
 
 	if ((ssize_t) count < 0)
 		return -EINVAL;
@@ -3026,8 +3027,12 @@ generic_file_write(struct file *file,con
 		goto out;
 
 	remove_suid(inode);
-	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
-	mark_inode_dirty_sync(inode);
+	time_now = CURRENT_TIME;
+	if (inode->i_ctime != time_now || inode->i_mtime != time_now) {
+		inode->i_ctime = time_now;
+		inode->i_mtime = time_now;
+		mark_inode_dirty_sync(inode);
+	}
 
 	if (file->f_flags & O_DIRECT)
 		goto o_direct;
--- 2.4.19-pre10/fs/jbd/transaction.c~ext3-speedup-1	Fri Jun 21 00:28:59 2002
+++ 2.4.19-pre10-akpm/fs/jbd/transaction.c	Fri Jun 21 00:28:59 2002
@@ -237,7 +237,9 @@ handle_t *journal_start(journal_t *journ
 	handle->h_ref = 1;
 	current->journal_info = handle;
 
+	lock_kernel();
 	err = start_this_handle(journal, handle);
+	unlock_kernel();
 	if (err < 0) {
 		kfree(handle);
 		current->journal_info = NULL;
@@ -1388,8 +1390,10 @@ int journal_stop(handle_t *handle)
 	transaction->t_outstanding_credits -= handle->h_buffer_credits;
 	transaction->t_updates--;
 	if (!transaction->t_updates) {
-		wake_up(&journal->j_wait_updates);
-		if (journal->j_barrier_count)
+		if (waitqueue_active(&journal->j_wait_updates))
+			wake_up(&journal->j_wait_updates);
+		if (journal->j_barrier_count &&
+			waitqueue_active(&journal->j_wait_transaction_locked))
 			wake_up(&journal->j_wait_transaction_locked);
 	}
 

-

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-21  7:58 ` ext3 performance bottleneck as the number of spindles gets large Andrew Morton
@ 2002-06-21 18:46   ` mgross
  2002-06-21 19:26     ` Chris Mason
  2002-06-21 19:56     ` Andrew Morton
  0 siblings, 2 replies; 25+ messages in thread
From: mgross @ 2002-06-21 18:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Griffiths, Richard A, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

Andrew Morton wrote:

>"Griffiths, Richard A" wrote:
>
>>I should have mentioned the throughput we saw on 4 adapters 6 drives was
>>126KB/s.  The max theoretical bus bandwith is 640MB/s.
>>
>
>I hope that was 128MB/s?
>
Yes that was MB/s, the data was taken in KB a set of 3 zeros where missing.

>
>
>Please try the below patch (againt 2.4.19-pre10).  It halves the lock
>contention, and it does that by making the fs twice as efficient, so
>that's a bonus.
>
We'll give it a try.  I'm on travel right now so it may be a few days if 
Richard doesn't get to before I get back.

>
>
>I wouldn't be surprised if it made no difference.  I'm not seeing
>much difference between ext2 and ext3 here.
>
>If you have time, please test ext2 and/or reiserfs and/or ext3
>in writeback mode.
>
Soon after we finish beating the ext3 file system up I'll take a swing 
at some other file systems.

>
>And please tell us some more details regarding the performance bottleneck.
>I assume that you mean that the IO rate per disk slows as more
>disks are added to an adapter?  Or does the total throughput through
>the adapter fall as more disks are added?
>
No, the IO block write throughput for the system goes down as drives are 
added under this work load.  We measure the system throughput not the 
per drive throughput, but one could infer the per drive throughput by 
dividing.

Running bonnie++ on with 300MB files doing 8Kb sequential writes we get 
the following system wide throughput as a function of the number of 
drives attached and by number of addapters.  

One addapter                           
1 drive per addapter    127,702KB/Sec
2 drives per addapter  93,283 KB/Sec
6 drives per addapter   85,626 KB/Sec

2 addapters
1 drive per addapter    92,095 KB/Sec
2 drives per addapter  110,956 KB/Sec
6 drives per addapter   106,883 KB/Sec

4 addapters
1 drive per addapter    121,125 KB/Sec
2 drives per addapter   117,575 KB/Sec
6 drives per addapter   116,570 KB/Sec

Not too pritty.

--mgross


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-21 18:46   ` mgross
@ 2002-06-21 19:26     ` Chris Mason
  2002-06-21 19:56     ` Andrew Morton
  1 sibling, 0 replies; 25+ messages in thread
From: Chris Mason @ 2002-06-21 19:26 UTC (permalink / raw)
  To: mgross
  Cc: Andrew Morton, Griffiths, Richard A, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

On Fri, 2002-06-21 at 14:46, mgross wrote:
> Andrew Morton wrote:

> >
> >Please try the below patch (againt 2.4.19-pre10).  It halves the lock
> >contention, and it does that by making the fs twice as efficient, so
> >that's a bonus.
> >
> We'll give it a try.  I'm on travel right now so it may be a few days if 
> Richard doesn't get to before I get back.

You might want to try this too, Andrew fixed UPDATE_ATIME() to only call
the dirty_inode method once per second, but generic_file_write should do
the same.  It reduces BKL contention by reducing calls to ext3 and
reiserfs dirty_inode calls, which are much more expensive than simply
marking the inode dirty.

-chris

--- linux/mm/filemap.c Mon, 28 Jan 2002 09:51:50 -0500 
+++ linux/mm/filemap.c Sun, 12 May 2002 16:16:59 -0400 
@@ -2826,6 +2826,14 @@
 	}
 }
 
+static void update_inode_times(struct inode *inode) 
+{
+	time_t now = CURRENT_TIME;
+	if (inode->i_ctime != now || inode->i_mtime != now) {
+	    inode->i_ctime = inode->i_mtime = now;
+	    mark_inode_dirty_sync(inode);
+	} 
+}
 /*
  * Write to a file through the page cache. 
  *
@@ -2955,8 +2963,7 @@
 		goto out;
 
 	remove_suid(inode);
-	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
-	mark_inode_dirty_sync(inode);
+	update_inode_times(inode);
 
 	if (file->f_flags & O_DIRECT)
 		goto o_direct;


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-21 18:46   ` mgross
  2002-06-21 19:26     ` Chris Mason
@ 2002-06-21 19:56     ` Andrew Morton
  1 sibling, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2002-06-21 19:56 UTC (permalink / raw)
  To: mgross
  Cc: Griffiths, Richard A, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

mgross wrote:
> 
> ...
> >And please tell us some more details regarding the performance bottleneck.
> >I assume that you mean that the IO rate per disk slows as more
> >disks are added to an adapter?  Or does the total throughput through
> >the adapter fall as more disks are added?
> >
> No, the IO block write throughput for the system goes down as drives are
> added under this work load.  We measure the system throughput not the
> per drive throughput, but one could infer the per drive throughput by
> dividing.
> 
> Running bonnie++ on with 300MB files doing 8Kb sequential writes we get
> the following system wide throughput as a function of the number of
> drives attached and by number of addapters.
> 
> One addapter
> 1 drive per addapter    127,702KB/Sec
> 2 drives per addapter  93,283 KB/Sec
> 6 drives per addapter   85,626 KB/Sec

127 megabytes/sec to a single disk?  Either that's a very
fast disk, or you're using very small bytes :)

> 2 addapters
> 1 drive per addapter    92,095 KB/Sec
> 2 drives per addapter  110,956 KB/Sec
> 6 drives per addapter   106,883 KB/Sec
> 
> 4 addapters
> 1 drive per addapter    121,125 KB/Sec
> 2 drives per addapter   117,575 KB/Sec
> 6 drives per addapter   116,570 KB/Sec
> 

Possibly what is happening here is that a significant amount
of dirty data is being left in memory and is escaping the
measurement period.   When you run the test against more disks,
the *total* amount of dirty memory is increased, so the kernel
is forced to perform more writeback within the measurement period.

So with two filesystems, you're actually performing more I/O.

You need to either ensure that all I/O is occurring *within the
measurement interval*, or make the test write so much data (wrt
main memory size) that any leftover unwritten stuff is insignificant.

bonnie++ is too complex for this work.  Suggest you use
http://www.zip.com.au/~akpm/linux/write-and-fsync.c
which will just write and fsync a file.  Time how long that
takes.  Or you could experiment with bonnie++'s fsync option.

My suggestion is to work with this workload:

for i in /mnt/1 /mnt/2 /mnt/3 /mnt/4 ...
do
	write-and-fsync $i/foo 4000 &
done

which will write a 4 gig file to each disk.  This will defeat
any caching effects and is just a way simpler workload, which
will allow you to test one thing in isolation.

So anyway.  All this possibly explains the "negative scalability"
in the single-adapter case.  For four adapters with one disk on
each, 120 megs/sec seems reasonable, assuming the sustained
write bandwidth of a single disk is 30 megs/sec.

For four adapters, six disks on each you should be doing better.
Something does appear to be wrong there.

-

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: ext3 performance bottleneck as the number of spindles gets large
  2002-06-20 21:50 ext3 performance bottleneck as the number of spindles gets la rge Griffiths, Richard A
  2002-06-21  7:58 ` ext3 performance bottleneck as the number of spindles gets large Andrew Morton
@ 2002-06-23  4:02 ` Christopher E. Brown
  2002-06-23  4:33   ` Andreas Dilger
  1 sibling, 1 reply; 25+ messages in thread
From: Christopher E. Brown @ 2002-06-23  4:02 UTC (permalink / raw)
  To: Griffiths, Richard A
  Cc: 'Andrew Morton', mgross, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

On Thu, 20 Jun 2002, Griffiths, Richard A wrote:

> I should have mentioned the throughput we saw on 4 adapters 6 drives was
> 126KB/s.  The max theoretical bus bandwith is 640MB/s.

This is *NOT* correct.  Assuming a 64bit 66Mhz PCI bus your MAX is
503MB/sec minus PCI overhead...

This of course assumes nothing else is using the PCI bus.

120 something MB/sec sounds a hell of a lot like topping out a 32bit
33Mhz PCI bus, but IIRC the earlier posting listed 39160 cards, PCI
64bit w/ backward compat to 32bit.

You do have *ALL* of these cards plugged into a full PCI 64bit/66Mhz
slot right?  Not plugging them into a 32bit/33Mhz slot?

32bit/33Mhz	(32 * 33,000,000) / (1024 * 1024 * 8) = 125.89 MByte/sec
64bit/33Mhz	(64 * 33,000,000) / (1024 * 1024 * 8) = 251.77 MByte/sec
64bit/66Mhz	(64 * 66,000,000) / (1024 * 1024 * 8) = 503.54 MByte/sec

NOTE: PCI transfer rates are often listed as

32bit/33Mhz, 132 MByte/sec
64bit/33Mhz, 264 MByte/sec
64bit/66Mhz, 528 MByte/sec

This is somewhat true, but only if we start with Mbit rates as used in
transmission rates (1,000,000 bits/sec) and work from there, instead
of 2^20 (1,048,576).  I will not argue about PCI 32bit/33Mhz being
1056Mbit, if talking about line rate, but when we are talking about
storage media and transfers to/from as measured by files remember to
convert.

-- 
I route, therefore you are.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  4:02 ` Christopher E. Brown
@ 2002-06-23  4:33   ` Andreas Dilger
  2002-06-23  6:00     ` Christopher E. Brown
  0 siblings, 1 reply; 25+ messages in thread
From: Andreas Dilger @ 2002-06-23  4:33 UTC (permalink / raw)
  To: Christopher E. Brown
  Cc: Griffiths, Richard A, 'Andrew Morton', mgross,
	'Jens Axboe', Linux Kernel Mailing List, lse-tech

On Jun 22, 2002  22:02 -0600, Christopher E. Brown wrote:
> On Thu, 20 Jun 2002, Griffiths, Richard A wrote:
> 
> > I should have mentioned the throughput we saw on 4 adapters 6 drives was
> > 126KB/s.  The max theoretical bus bandwith is 640MB/s.
> 
> This is *NOT* correct.  Assuming a 64bit 66Mhz PCI bus your MAX is
> 503MB/sec minus PCI overhead...

Assuming you only have a single PCI bus...

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  4:33   ` Andreas Dilger
@ 2002-06-23  6:00     ` Christopher E. Brown
  2002-06-23  6:35       ` [Lse-tech] " William Lee Irwin III
  0 siblings, 1 reply; 25+ messages in thread
From: Christopher E. Brown @ 2002-06-23  6:00 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Griffiths, Richard A, 'Andrew Morton', mgross,
	'Jens Axboe', Linux Kernel Mailing List, lse-tech

On Sat, 22 Jun 2002, Andreas Dilger wrote:

> On Jun 22, 2002  22:02 -0600, Christopher E. Brown wrote:
> > On Thu, 20 Jun 2002, Griffiths, Richard A wrote:
> >
> > > I should have mentioned the throughput we saw on 4 adapters 6 drives was
> > > 126KB/s.  The max theoretical bus bandwith is 640MB/s.
> >
> > This is *NOT* correct.  Assuming a 64bit 66Mhz PCI bus your MAX is
> > 503MB/sec minus PCI overhead...
>
> Assuming you only have a single PCI bus...


Yes, we could (for example) assume a DP264 board, it features 2/4/8
way memory interleave, dual 21264 CPUs, and 2 separate PCI 64bit 66Mhz
buses.

However, multiple busses are *rare* on x86.  There are alot of chained
busses via PCI to PCI bridge, but few systems with 2 or more PCI
busses of any type with parallel access to the CPU.


 --
I route, therefore you are.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  6:00     ` Christopher E. Brown
@ 2002-06-23  6:35       ` William Lee Irwin III
  2002-06-23  7:29         ` Dave Hansen
  2002-06-23 17:06         ` Eric W. Biederman
  0 siblings, 2 replies; 25+ messages in thread
From: William Lee Irwin III @ 2002-06-23  6:35 UTC (permalink / raw)
  To: Christopher E. Brown
  Cc: Andreas Dilger, Griffiths, Richard A, 'Andrew Morton',
	mgross, 'Jens Axboe', Linux Kernel Mailing List, lse-tech

On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote:
> However, multiple busses are *rare* on x86.  There are alot of chained
> busses via PCI to PCI bridge, but few systems with 2 or more PCI
> busses of any type with parallel access to the CPU.

NUMA-Q has them.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  6:35       ` [Lse-tech] " William Lee Irwin III
@ 2002-06-23  7:29         ` Dave Hansen
  2002-06-23  7:36           ` William Lee Irwin III
  2002-06-23 17:06         ` Eric W. Biederman
  1 sibling, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2002-06-23  7:29 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Christopher E. Brown, Andreas Dilger, Griffiths, Richard A,
	'Andrew Morton', mgross, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

William Lee Irwin III wrote:
> On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote:
> 
>>However, multiple busses are *rare* on x86.  There are alot of chained
>>busses via PCI to PCI bridge, but few systems with 2 or more PCI
>>busses of any type with parallel access to the CPU.
> 
> NUMA-Q has them.
> 

Yep, 2 independent busses per quad.  That's a _lot_ of busses when you 
have an 8 or 16 quad system.  (I wonder who has one of those... ;)

Almost all of the server-type boxes that we play with have multiple 
PCI busses.  Even my old dual-PPro has 2.

-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  7:29         ` Dave Hansen
@ 2002-06-23  7:36           ` William Lee Irwin III
  2002-06-23  7:45             ` Dave Hansen
  0 siblings, 1 reply; 25+ messages in thread
From: William Lee Irwin III @ 2002-06-23  7:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Christopher E. Brown, Andreas Dilger, Griffiths, Richard A,
	'Andrew Morton', mgross, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

>> On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote:
>>> However, multiple busses are *rare* on x86.  There are alot of chained
>>> busses via PCI to PCI bridge, but few systems with 2 or more PCI
>>> busses of any type with parallel access to the CPU.

William Lee Irwin III wrote:
>> NUMA-Q has them.


On Sun, Jun 23, 2002 at 12:29:23AM -0700, Dave Hansen wrote:
> Yep, 2 independent busses per quad.  That's a _lot_ of busses when you 
> have an 8 or 16 quad system.  (I wonder who has one of those... ;)
> Almost all of the server-type boxes that we play with have multiple 
> PCI busses.  Even my old dual-PPro has 2.

I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the
"independent" bit coming into play.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  7:36           ` William Lee Irwin III
@ 2002-06-23  7:45             ` Dave Hansen
  2002-06-23  7:55               ` Christopher E. Brown
  2002-06-23 16:21               ` Martin J. Bligh
  0 siblings, 2 replies; 25+ messages in thread
From: Dave Hansen @ 2002-06-23  7:45 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Christopher E. Brown, Andreas Dilger, Griffiths, Richard A,
	'Andrew Morton', mgross, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

William Lee Irwin III wrote:
 > On Sun, Jun 23, 2002 at 12:29:23AM -0700, Dave Hansen wrote:
 >> Yep, 2 independent busses per quad.  That's a _lot_ of busses
 >> when you have an 8 or 16 quad system.  (I wonder who has one of
 >> those... ;) Almost all of the server-type boxes that we play with
 >>  have multiple PCI busses.  Even my old dual-PPro has 2.
 >
 > I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the
 > "independent" bit coming into play.
 >
Hmmmm.  Maybe there is another one for the onboard devices.  I thought
that there were 8 slots and 4 per bus.  I could
be wrong.  BTW, the ISA slot is EISA and as far as I can tell is only
used for the MDC.


-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  7:45             ` Dave Hansen
@ 2002-06-23  7:55               ` Christopher E. Brown
  2002-06-23  8:11                 ` David Lang
  2002-06-23  8:31                 ` Dave Hansen
  2002-06-23 16:21               ` Martin J. Bligh
  1 sibling, 2 replies; 25+ messages in thread
From: Christopher E. Brown @ 2002-06-23  7:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: William Lee Irwin III, Andreas Dilger, Griffiths, Richard A,
	'Andrew Morton', mgross, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

On Sun, 23 Jun 2002, Dave Hansen wrote:

> William Lee Irwin III wrote:
>  > On Sun, Jun 23, 2002 at 12:29:23AM -0700, Dave Hansen wrote:
>  >> Yep, 2 independent busses per quad.  That's a _lot_ of busses
>  >> when you have an 8 or 16 quad system.  (I wonder who has one of
>  >> those... ;) Almost all of the server-type boxes that we play with
>  >>  have multiple PCI busses.  Even my old dual-PPro has 2.
>  >
>  > I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the
>  > "independent" bit coming into play.
>  >
> Hmmmm.  Maybe there is another one for the onboard devices.  I thought
> that there were 8 slots and 4 per bus.  I could
> be wrong.  BTW, the ISA slot is EISA and as far as I can tell is only
> used for the MDC.


Do you mean independent in that there are 2 sets of 4 slots each
detected as a seperate PCI bus, or independent in that each set of 4
had *direct* access to the cpu side, and *does not* access via a
PCI:PCI bridge?



I have stacks of PPro/PII/Xeon boards around, but 9 out of 10 have
chianed buses.  Even the old PPro x 6 (Avion 6600/ALR 6x6/Unisys
HR/HS6000) had 2 PCI buses, however the second BUS hung off of a
PCI:PCI bridge.


-- 
I route, therefore you are.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  7:55               ` Christopher E. Brown
@ 2002-06-23  8:11                 ` David Lang
  2002-06-23  8:31                 ` Dave Hansen
  1 sibling, 0 replies; 25+ messages in thread
From: David Lang @ 2002-06-23  8:11 UTC (permalink / raw)
  To: Christopher E. Brown
  Cc: Dave Hansen, William Lee Irwin III, Andreas Dilger,
	Griffiths, Richard A, 'Andrew Morton', mgross,
	'Jens Axboe', Linux Kernel Mailing List, lse-tech

most chipsets only have one PCI bus on them so any others need to be
bridged to that one.

David Lang

On Sun, 23 Jun 2002, Christopher E. Brown wrote:

> Date: Sun, 23 Jun 2002 01:55:28 -0600 (MDT)
> From: Christopher E. Brown <cbrown@woods.net>
> To: Dave Hansen <haveblue@us.ibm.com>
> Cc: William Lee Irwin III <wli@holomorphy.com>,
>      Andreas Dilger <adilger@clusterfs.com>,
>      "Griffiths, Richard A" <richard.a.griffiths@intel.com>,
>      'Andrew Morton' <akpm@zip.com.au>, mgross@unix-os.sc.intel.com,
>      'Jens Axboe' <axboe@suse.de>,
>      Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
>      lse-tech@lists.sourceforge.net
> Subject: Re: [Lse-tech] Re: ext3 performance bottleneck as the number of
>     spindles gets large
>
> On Sun, 23 Jun 2002, Dave Hansen wrote:
>
> > William Lee Irwin III wrote:
> >  > On Sun, Jun 23, 2002 at 12:29:23AM -0700, Dave Hansen wrote:
> >  >> Yep, 2 independent busses per quad.  That's a _lot_ of busses
> >  >> when you have an 8 or 16 quad system.  (I wonder who has one of
> >  >> those... ;) Almost all of the server-type boxes that we play with
> >  >>  have multiple PCI busses.  Even my old dual-PPro has 2.
> >  >
> >  > I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the
> >  > "independent" bit coming into play.
> >  >
> > Hmmmm.  Maybe there is another one for the onboard devices.  I thought
> > that there were 8 slots and 4 per bus.  I could
> > be wrong.  BTW, the ISA slot is EISA and as far as I can tell is only
> > used for the MDC.
>
>
> Do you mean independent in that there are 2 sets of 4 slots each
> detected as a seperate PCI bus, or independent in that each set of 4
> had *direct* access to the cpu side, and *does not* access via a
> PCI:PCI bridge?
>
>
>
> I have stacks of PPro/PII/Xeon boards around, but 9 out of 10 have
> chianed buses.  Even the old PPro x 6 (Avion 6600/ALR 6x6/Unisys
> HR/HS6000) had 2 PCI buses, however the second BUS hung off of a
> PCI:PCI bridge.
>
>
> --
> I route, therefore you are.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  7:55               ` Christopher E. Brown
  2002-06-23  8:11                 ` David Lang
@ 2002-06-23  8:31                 ` Dave Hansen
  1 sibling, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2002-06-23  8:31 UTC (permalink / raw)
  To: Christopher E. Brown
  Cc: William Lee Irwin III, Andreas Dilger, Griffiths, Richard A,
	'Andrew Morton', mgross, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

Christopher E. Brown wrote:
> Do you mean independent in that there are 2 sets of 4 slots each
> detected as a seperate PCI bus, or independent in that each set of 4
> had *direct* access to the cpu side, and *does not* access via a
> PCI:PCI bridge?

No PCI:PCI bridges, at least for NUMA-Q.
http://telia.dl.sourceforge.net/sourceforge/lse/linux_on_numaq.pdf

-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  7:45             ` Dave Hansen
  2002-06-23  7:55               ` Christopher E. Brown
@ 2002-06-23 16:21               ` Martin J. Bligh
  1 sibling, 0 replies; 25+ messages in thread
From: Martin J. Bligh @ 2002-06-23 16:21 UTC (permalink / raw)
  To: Dave Hansen, William Lee Irwin III
  Cc: Christopher E. Brown, Andreas Dilger, Griffiths, Richard A,
	'Andrew Morton', mgross, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

>  >> Yep, 2 independent busses per quad.  That's a _lot_ of busses
>  >> when you have an 8 or 16 quad system.  (I wonder who has one of
>  >> those... ;) Almost all of the server-type boxes that we play with
>  >>  have multiple PCI busses.  Even my old dual-PPro has 2.
>  >
>  > I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the
>  > "independent" bit coming into play.
>  >
> Hmmmm.  Maybe there is another one for the onboard devices.  I thought
> that there were 8 slots and 4 per bus.  I could
> be wrong.  BTW, the ISA slot is EISA and as far as I can tell is only
> used for the MDC.

NUMA-Q has 2 PCI buses per quad, 3 slots in one, 4 in the other,
plus the EISA slots.

Multiple independant PCI buses are also available on other more
common architecutres, eg Netfinity 8500R, x360, x440, etc. 

Anything with the Intel Profusion chipset will have this feature,
the bottleneck becomes the "P6 system bus" backplane they're all
connected to, which has a theoretical limit of 800Mb/s IIRC, though
nobody's been able to get more than 420Mb/s out of it in practice,
as far as I know. 

The thing that makes the NUMA-Q a massive IO shovelling engine is
having one of these IO backplanes per quad too ... 16 x 800Mb/s
= 12.8Gb/s ;-)

M.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  6:35       ` [Lse-tech] " William Lee Irwin III
  2002-06-23  7:29         ` Dave Hansen
@ 2002-06-23 17:06         ` Eric W. Biederman
  1 sibling, 0 replies; 25+ messages in thread
From: Eric W. Biederman @ 2002-06-23 17:06 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Christopher E. Brown, Andreas Dilger, Griffiths, Richard A,
	'Andrew Morton', mgross, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

William Lee Irwin III <wli@holomorphy.com> writes:

> On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote:
> > However, multiple busses are *rare* on x86.  There are alot of chained
> > busses via PCI to PCI bridge, but few systems with 2 or more PCI
> > busses of any type with parallel access to the CPU.
> 
> NUMA-Q has them.

As do the latest round of dual P4 Xeon chipsets.  The Intel E7500 and
the Serverworks Grand Champion.  

So on new systems this is easy to get if you want it.

Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: ext3 performance bottleneck as the number of spindles gets la rge
@ 2002-06-20 15:26 Griffiths, Richard A
  2002-06-20 20:18 ` ext3 performance bottleneck as the number of spindles gets large Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Griffiths, Richard A @ 2002-06-20 15:26 UTC (permalink / raw)
  To: 'Jens Axboe', Andrew Morton
  Cc: mgross, Linux Kernel Mailing List, lse-tech, Griffiths, Richard A

We ran without highmem enabled so the Kernel only saw 1GB of memory.

Richard

-----Original Message-----
From: Jens Axboe [mailto:axboe@suse.de]
Sent: Wednesday, June 19, 2002 11:05 PM
To: Andrew Morton
Cc: mgross@unix-os.sc.intel.com; Linux Kernel Mailing List;
lse-tech@lists.sourceforge.net; richard.a.griffiths@intel.com
Subject: Re: ext3 performance bottleneck as the number of spindles gets
large


On Wed, Jun 19 2002, Andrew Morton wrote:
> mgross wrote:
> > 
> > We've been doing some throughput comparisons and benchmarks of block I/O
> > throughput for 8KB writes as the number of SCSI addapters and drives per
> > adapter is increased.
> > 
> > The Linux platform is a dual processor 1.2GHz PIII, 2Gig or RAM, 2U box.
> > Similar results have been seen with both 2.4.16 and 2.4.18 base kernel,
as
> > well as one of those patched up O(1) 2.4.18 kernels out there.
> 
> umm.  Are you not using block-highmem?  That is a must-have.
> 
>
http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre9
aa2/00_block-highmem-all-18b-12.gz

please use

http://www.kernel.org/pub/linux/kernel/people/axboe/patches/v2.4/2.4.19-pre1
0/block-highmem-all-19.bz2

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-20 15:26 ext3 performance bottleneck as the number of spindles gets la rge Griffiths, Richard A
@ 2002-06-20 20:18 ` Andrew Morton
  2002-06-20 18:08   ` mgross
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2002-06-20 20:18 UTC (permalink / raw)
  To: Griffiths, Richard A
  Cc: 'Jens Axboe', mgross, Linux Kernel Mailing List, lse-tech

"Griffiths, Richard A" wrote:
> 
> We ran without highmem enabled so the Kernel only saw 1GB of memory.
> 

Yup.  I take it back - high ext3 lock contention happens on 2.5
as well, which has block-highmem.  With heavy write traffic onto
six disks, two controllers, six filesystems, four CPUs the machine
spends about 40% of the time spinning on locks in fs/ext3/inode.c
You're un dual CPU, so the contention is less.

Not very nice.  But given that the longest spin time was some
tens of milliseconds, with the average much lower, it shouldn't
affect overall I/O throughput.

Possibly something else is happening.  Have you tested ext2?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-20 20:18 ` ext3 performance bottleneck as the number of spindles gets large Andrew Morton
@ 2002-06-20 18:08   ` mgross
  2002-06-20 21:25     ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: mgross @ 2002-06-20 18:08 UTC (permalink / raw)
  To: Andrew Morton, Griffiths, Richard A
  Cc: 'Jens Axboe', Linux Kernel Mailing List, lse-tech

On Thursday 20 June 2002 04:18 pm, Andrew Morton wrote:
> Yup.  I take it back - high ext3 lock contention happens on 2.5
> as well, which has block-highmem.  With heavy write traffic onto
> six disks, two controllers, six filesystems, four CPUs the machine
> spends about 40% of the time spinning on locks in fs/ext3/inode.c
> You're un dual CPU, so the contention is less.
>
> Not very nice.  But given that the longest spin time was some
> tens of milliseconds, with the average much lower, it shouldn't
> affect overall I/O throughput.

How could losing 40% of your CPU's to spin locks NOT spank your throughtput?  
Can you copy your lockmeter data from its kernel_flag section?  Id like to 
see it.

>
> Possibly something else is happening.  Have you tested ext2?

No.  We're attempting to see if we can scale to large numbers of spindles 
with EXT3 at the moment.  Perhaps we can effect positive changes to ext3 
before giving up on it and moving to another Journaled FS.


--mgross

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-20 18:08   ` mgross
@ 2002-06-20 21:25     ` Andrew Morton
  0 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2002-06-20 21:25 UTC (permalink / raw)
  To: mgross
  Cc: Griffiths, Richard A, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

mgross wrote:
> 
> On Thursday 20 June 2002 04:18 pm, Andrew Morton wrote:
> > Yup.  I take it back - high ext3 lock contention happens on 2.5
> > as well, which has block-highmem.  With heavy write traffic onto
> > six disks, two controllers, six filesystems, four CPUs the machine
> > spends about 40% of the time spinning on locks in fs/ext3/inode.c
> > You're un dual CPU, so the contention is less.
> >
> > Not very nice.  But given that the longest spin time was some
> > tens of milliseconds, with the average much lower, it shouldn't
> > affect overall I/O throughput.
> 
> How could losing 40% of your CPU's to spin locks NOT spank your throughtput?

The limiting factor is usually disk bandwidth, seek latency, rotational
latency.  That's why I want to know your bandwidth.

> Can you copy your lockmeter data from its kernel_flag section?  Id like to
> see it.

I don't find lockmeter very useful.  Here's oprofile output for 2.5.23:

c013ec08 873      1.07487     rmqueue                 
c018a8e4 950      1.16968     do_get_write_access     
c013b00c 969      1.19307     kmem_cache_alloc_batch  
c018165c 1120     1.37899     ext3_writepage          
c0193120 1457     1.79392     journal_add_journal_head 
c0180e30 1458     1.79515     ext3_prepare_write      
c0136948 6546     8.05969     generic_file_write      
c01838ac 42608    52.4606     .text.lock.inode      

So I lost two CPUs on the BKL in fs/ext3/inode.c.  The remaining
two should be enough to saturate all but the most heroic disk
subsystems.

A couple of possibilities come to mind:

1: Processes which should be submitting I/O against disk "A" are
   instead spending tons of time asleep in the page allocator waiting
   for I/O to complete against disk "B".

2: ext3 is just too slow for the rate of data which you're trying to
   push at it.   This exhibits as lock contention, but the root cause
   is the cost of things like ext3_mark_inode_dirty().  And *that*
   is something we can fix - can shave 75% off the cost of that.

Need more data...

> >
> > Possibly something else is happening.  Have you tested ext2?
> 
> No.  We're attempting to see if we can scale to large numbers of spindles
> with EXT3 at the moment.  Perhaps we can effect positive changes to ext3
> before giving up on it and moving to another Journaled FS.

Have you tried *any* other fs?

-

^ permalink raw reply	[flat|nested] 25+ messages in thread

* ext3 performance bottleneck as the number of spindles gets large
@ 2002-06-19 21:29 mgross
  2002-06-20  0:54 ` Andrew Morton
  2002-06-20  1:55 ` Andrew Morton
  0 siblings, 2 replies; 25+ messages in thread
From: mgross @ 2002-06-19 21:29 UTC (permalink / raw)
  To: Linux Kernel Mailing List, lse-tech; +Cc: richard.a.griffiths

[-- Attachment #1: Type: text/plain, Size: 2309 bytes --]

We've been doing some throughput comparisons and benchmarks of block I/O 
throughput for 8KB writes as the number of SCSI addapters and drives per 
adapter is increased.

The Linux platform is a dual processor 1.2GHz PIII, 2Gig or RAM, 2U box.
Similar results have been seen with both 2.4.16 and 2.4.18 base kernel, as 
well as one of those patched up O(1) 2.4.18 kernels out there.

The benchmark is Bonnie++.

What seems to be happening is the throughput for 8Kb sequential Write's with 
300MB files goes down with the number of spindles. We have negative scale WRT 
spindles per SCSI adapter, and very poor scaling per SCSI adapter.

(The other 2 processor + OS platform sees its throughput go up with adapters and 
spindles. )

Running this benchmark with lockmeter ends up pointing a big finger at BKL 
contention in: ext3_commit_write, ext3_dirty_inode, ext3_get_block_handle 
and, ext3_prepare_write (twice!).  Attached is the output from the worst 
case, 4 SCSI adapters with 6 drives per adapter.

Has anyone done any work looking into the I/O scaling of Linux / ext3 per 
spindle or per adapter?  We would like to compare notes.

I've only just started to look at the ext3 code but it seems to me that replacing the 
BKL with a per - ext3 file system lock could remove some of the contention thats 
getting measured.  What data are the BKL protecting in these ext3 functions?  Could a 
lock per FS approach work?

Thoughts? 
Comments?
Ideas?

--mgross



- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN RJECT  NAME

        3.7%  0.7us(  44ms)  7.8us(  44ms)(22.9%)  49644038 96.3%  3.7% 0.00%  *TOTAL*

 26.6% 71.2%   13us(  44ms)  8.0us(8076us)( 5.8%)    632107 28.8% 71.2%    0%    ext3_commit_write+0x38
  4.4% 30.3%  4.3us( 360us)   13us(7511us)( 2.1%)    316124 69.7% 30.3%    0%    ext3_dirty_inode+0x2c
 28.1%  7.9%   14us(1660us)  9.7us(6842us)(0.78%)    632239 92.1%  7.9%    0%    ext3_get_block_handle+0x8c
  1.2% 27.2%  0.6us( 240us)   11us(6604us)( 3.0%)    632107 72.8% 27.2%    0%    ext3_prepare_write+0x34
 0.26% 88.1%  0.1us(  74us)  9.6us(7026us)( 8.6%)    632107 11.9% 88.1%    0%    ext3_prepare_write+0xe0


[-- Attachment #2: lm_4x6_300MBw --]
[-- Type: text/plain, Size: 40755 bytes --]

Lockmeter statistics are now RESET
Lockmeter statistics are now ON
___________________________________________________________________________________________
System: Linux TSRLT2 2.4.18 #2 SMP Mon Jun 17 08:28:25 PDT 2002 i686
Total counts

All (2) CPUs

Start time: Mon Jun 17 11:12:36 2002
End   time: Mon Jun 17 11:13:07 2002
Delta Time: 30.99 sec.
Hash table slots in use:      314.
Global read lock slots in use: 884.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN RJECT  NAME

        3.7%  0.7us(  44ms)  7.8us(  44ms)(22.9%)  49644038 96.3%  3.7% 0.00%  *TOTAL*

 0.00%    0%  1.6us( 3.4us)    0us                        3  100%    0%    0%  [0xdff2bf90]
 0.00%    0%  3.4us( 3.4us)    0us                        1  100%    0%    0%    complete+0x1c
 0.00%    0%  1.3us( 1.3us)    0us                        1  100%    0%    0%    wait_for_completion+0x18
 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%    wait_for_completion+0x98

 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%  [0xdff56694]
 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%    exec_mmap+0x8c

 0.00%    0%  1.2us( 2.0us)    0us                        7  100%    0%    0%  [0xf410c22c]
 0.00%    0%  0.9us( 1.5us)    0us                        5  100%    0%    0%    unmap_fixup+0x8c
 0.00%    0%  1.9us( 2.0us)    0us                        2  100%    0%    0%    unmap_fixup+0x134

 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%  [0xf683bf0c]
 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%    neigh_destroy+0x108

 0.00%    0%  2.5us( 3.9us)    0us                       15  100%    0%    0%  [0xf6e170d0]
 0.00%    0%  2.5us( 3.9us)    0us                       15  100%    0%    0%    dev_watchdog+0x14

 0.00%    0%  0.4us( 0.7us)    0us                        2  100%    0%    0%  [0xf703f504]
 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%    skb_recv_datagram+0x90
 0.00%    0%  0.7us( 0.7us)    0us                        1  100%    0%    0%    unix_dgram_sendmsg+0x35c

 0.37% 0.31%  2.5us(  12us)  3.9us(  10us)(0.00%)     45356 99.7% 0.31%    0%  allocator_request_lock
 0.07% 0.49%  1.0us( 7.8us)  3.9us(  10us)(0.00%)     22678 99.5% 0.49%    0%    scsi_free+0x1c
 0.30% 0.14%  4.1us(  12us)  3.7us( 7.0us)(0.00%)     22678 99.9% 0.14%    0%    scsi_malloc+0x48

 0.00%    0%  0.1us( 0.6us)    0us                       36  100%    0%    0%  arbitration_lock
 0.00%    0%  0.1us( 0.6us)    0us                       30  100%    0%    0%    deny_write_access+0xc
 0.00%    0%  0.2us( 0.3us)    0us                        6  100%    0%    0%    get_write_access+0xc

 0.00%    0%  0.7us( 2.8us)    0us                     1011  100%    0%    0%  bdev_lock
 0.00%    0%  0.7us( 2.8us)    0us                     1011  100%    0%    0%    bdget+0x34

 0.00%    0%  4.1us( 5.8us)    0us                       18  100%    0%    0%  call_lock
 0.00%    0%  4.1us( 5.8us)    0us                       18  100%    0%    0%    smp_call_function+0x58

 0.00%    0%  0.2us( 0.9us)    0us                        6  100%    0%    0%  cdev_lock
 0.00%    0%  0.2us( 0.9us)    0us                        6  100%    0%    0%    cdput+0x28

  2.9% 0.59%  0.7us( 6.5us)  1.0us( 3.6us)(0.01%)   1256678 99.4% 0.59%    0%  contig_page_data+0xa8
 0.72%  1.1%  0.4us( 5.8us)  1.0us( 3.4us)(0.01%)    628341 98.9%  1.1%    0%    __free_pages_ok+0xc8
  2.2% 0.06%  1.1us( 6.5us)  0.9us( 3.6us)(0.00%)    628337  100% 0.06%    0%    rmqueue+0x28

 0.14% 0.02%  0.1us(  84us)  4.4us(  47us)(0.00%)    325106  100% 0.02%    0%  dcache_lock
 0.00%    0%  0.1us( 0.6us)    0us                       20  100%    0%    0%    d_alloc+0x128
 0.00%    0%  0.1us( 0.1us)    0us                        2  100%    0%    0%    d_delete+0x10
 0.00%    0%  0.1us( 0.3us)    0us                       23  100%    0%    0%    d_instantiate+0x1c
 0.01% 0.02%  0.4us(  24us)  0.9us( 0.9us)(0.00%)      4375  100% 0.02%    0%    d_lookup+0x5c
 0.00%    0%  0.2us( 1.2us)    0us                       20  100%    0%    0%    d_rehash+0x40
 0.00%    0%  1.2us( 1.2us)    0us                        1  100%    0%    0%    do_readv_writev+0x28c
 0.00% 0.09%  0.1us( 3.1us)  1.2us( 1.2us)(0.00%)      1069  100% 0.09%    0%    dput+0x30
 0.00%    0%  2.0us( 3.4us)    0us                       10  100%    0%    0%    link_path_walk+0x2a8
 0.00%    0%  0.2us( 0.7us)    0us                        4  100%    0%    0%    notify_change+0xec
 0.00%    0%  1.1us( 1.9us)    0us                        4  100%    0%    0%    prune_dcache+0x14
 0.00% 0.87%  0.5us(  59us)  0.9us( 1.5us)(0.00%)      3231 99.1% 0.87%    0%    prune_dcache+0x138
 0.00%    0%  2.3us( 2.3us)    0us                        1  100%    0%    0%    sys_getcwd+0xc8
 0.00% 0.30%  0.2us( 1.4us)  0.6us( 0.6us)(0.00%)       334 99.7% 0.30%    0%    sys_read+0xac
 0.13% 0.01%  0.1us(  84us)  8.0us(  47us)(0.00%)    316012  100% 0.01%    0%    sys_write+0xac

 0.20% 0.10%  1.3us(  41us)  1.7us( 6.1us)(0.00%)     47856 99.9% 0.10%    0%  device_request_lock
 0.01% 0.15%  0.1us( 1.6us)  2.0us( 6.1us)(0.00%)     23928 99.8% 0.15%    0%    __scsi_release_command+0x14
 0.19% 0.05%  2.4us(  41us)  1.0us( 1.7us)(0.00%)     23928  100% 0.05%    0%    scsi_allocate_device+0x30

 0.00%    0%  0.2us( 2.5us)    0us                      642  100%    0%    0%  files_lock
 0.00%    0%  0.1us( 1.1us)    0us                      200  100%    0%    0%    file_move+0x18
 0.00%    0%  0.1us( 0.8us)    0us                      201  100%    0%    0%    fput+0x80
 0.00%    0%  0.3us( 2.5us)    0us                      203  100%    0%    0%    get_empty_filp+0xc
 0.00%    0%  0.7us( 1.4us)    0us                       37  100%    0%    0%    get_empty_filp+0xdc
 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%    put_filp+0x18

  2.9%  8.9%   32us(1727us)    0us                    27945 91.1%    0%  8.9%  global_bh_lock
  2.9%  8.9%   32us(1727us)    0us                    27945 91.1%    0%  8.9%    bh_action+0x18

 0.05%    0%  5.2us( 7.8us)    0us                     3099  100%    0%    0%  i8253_lock
 0.05%    0%  5.2us( 7.8us)    0us                     3099  100%    0%    0%    timer_interrupt+0x2c

 0.01%    0%  1.2us( 2.5us)    0us                     3099  100%    0%    0%  i8259A_lock
 0.01%    0%  1.2us( 2.5us)    0us                     3099  100%    0%    0%    timer_interrupt+0x90

 0.00%    0%  0.4us( 0.4us)    0us                        1  100%    0%    0%  inet_peer_unused_lock
 0.00%    0%  0.4us( 0.4us)    0us                        1  100%    0%    0%    cleanup_once+0x24

 0.00%    0%  0.1us( 1.2us)    0us                       34  100%    0%    0%  init_mm+0x2c
 0.00%    0%  0.9us( 1.2us)    0us                        2  100%    0%    0%    __vmalloc+0x70
 0.00%    0%  0.1us( 0.3us)    0us                       32  100%    0%    0%    __vmalloc+0x120

 0.00%    0%  0.5us( 295us)    0us                     2768  100%    0%    0%  inode_lock
 0.00%    0%  0.4us( 1.9us)    0us                      309  100%    0%    0%    __mark_inode_dirty+0x48
 0.00%    0%  0.8us( 1.0us)    0us                        4  100%    0%    0%    get_empty_inode+0x24
 0.00%    0%  0.7us( 1.3us)    0us                       12  100%    0%    0%    get_new_inode+0x34
 0.00%    0%  1.3us( 2.0us)    0us                       12  100%    0%    0%    iget4+0x3c
 0.00%    0%  0.5us( 0.5us)    0us                        2  100%    0%    0%    insert_inode_hash+0x44
 0.00%    0%  0.1us( 2.1us)    0us                     2138  100%    0%    0%    iput+0x68
 0.00%    0%  182us( 295us)    0us                        4  100%    0%    0%    prune_icache+0x1c
 0.00%    0%  5.8us( 8.0us)    0us                        6  100%    0%    0%    sync_unlocked_inodes+0x10
 0.00%    0%  1.1us(  42us)    0us                      281  100%    0%    0%    sync_unlocked_inodes+0x10c

  2.0%  2.6%  0.8us( 103us)  2.6us(  29us)(0.08%)    788392 97.4%  2.6%    0%  io_request_lock
 0.00%  6.5%  0.6us( 2.0us)  4.6us(  13us)(0.00%)       262 93.5%  6.5%    0%    __get_request_wait+0x90
  1.1%  2.0%  0.5us(  46us)  3.0us(  29us)(0.06%)    666006 98.0%  2.0%    0%    __make_request+0xc0
 0.19%  4.4%  2.4us(  36us)  2.8us(  22us)(0.00%)     23953 95.6%  4.4%    0%    ahc_linux_isr+0x2ec
 0.03%  9.6%  4.6us(  46us)  1.6us(  15us)(0.00%)      2319 90.4%  9.6%    0%    generic_unplug_device+0x10
 0.31%  8.9%  4.0us( 103us)  1.5us(  27us)(0.01%)     24068 91.1%  8.9%    0%    scsi_dispatch_cmd+0x11c
 0.02%  3.9%  0.3us( 1.9us)  2.3us(  20us)(0.00%)     23928 96.1%  3.9%    0%    scsi_finish_command+0x18
 0.17%  3.0%  2.2us(  36us)  2.2us(  15us)(0.00%)     23928 97.0%  3.0%    0%    scsi_queue_next_request+0x18
 0.16%  7.0%  2.0us(  37us)  1.4us(  11us)(0.00%)     23928 93.0%  7.0%    0%    scsi_request_fn+0x31c

 0.51%    0%  0.1us( 147us)    0us                  1300436  100%    0%    0%  jh_splice_lock
 0.27%    0%  0.1us(  59us)    0us                   665500  100%    0%    0%    __journal_remove_journal_head+0xe8
 0.24%    0%  0.1us( 147us)    0us                   634936  100%    0%    0%    journal_add_journal_head+0xd0

 12.8% 0.49%  0.1us(  15ms)  1.3us(  15ms)(0.33%)  33219940 99.5% 0.49%    0%  journal_datalist_lock
 0.00%    0%  2.3us( 2.3us)    0us                        1  100%    0%    0%    dispose_buffer+0x18
 0.00%  2.7%  1.1us( 7.8us)  0.7us( 1.0us)(0.00%)       828 97.3%  2.7%    0%    do_get_write_access+0x9c
  1.6% 0.25%  0.1us( 163us)  1.0us(  84us)(0.03%)   8220760 99.8% 0.25%    0%    do_get_write_access+0x204
  2.0% 0.40%  0.1us( 208us)  0.9us( 182us)(0.05%)   8855491 99.6% 0.40%    0%    journal_add_journal_head+0x10
 0.62% 0.49%  0.3us( 158us)  0.8us( 6.0us)(0.00%)    634936 99.5% 0.49%    0%    journal_add_journal_head+0x88
 0.01%    0%   25us(1934us)    0us                      179  100%    0%    0%    journal_commit_transaction+0x1bc
  3.0% 0.53% 2460us(  15ms)  1.0us( 1.2us)(0.00%)       377 99.5% 0.53%    0%    journal_commit_transaction+0x258
 0.08%    0%   23us( 369us)    0us                     1111  100%    0%    0%    journal_commit_transaction+0x3c0
 0.01% 0.73%  1.4us(  50us)  1.0us( 1.4us)(0.00%)      1786 99.3% 0.73%    0%    journal_commit_transaction+0xd5c
 0.00%  1.1%  0.2us( 0.8us)  1.1us( 1.3us)(0.00%)       179 98.9%  1.1%    0%    journal_commit_transaction+0xee4
 0.45% 0.44%  0.2us( 111us)  1.0us( 2.8us)(0.00%)    632107 99.6% 0.44%    0%    journal_dirty_data+0x54
  2.7% 0.28%  0.2us( 578us)  0.9us(  50us)(0.02%)   5376063 99.7% 0.28%    0%    journal_dirty_metadata+0x54
 0.01% 0.70%  0.4us(  19us)  1.0us( 2.9us)(0.00%)      5537 99.3% 0.70%    0%    journal_file_buffer+0x18
 0.00%    0%  2.2us(  20us)    0us                      619  100%    0%    0%    journal_get_create_access+0x130
 0.60%  9.4%  0.3us( 106us)  1.8us(  15ms)(0.18%)    632510 90.6%  9.4%    0%    journal_try_to_free_buffers+0x4c
 0.00% 0.36%  0.2us( 2.4us)  0.8us( 1.1us)(0.00%)      1965 99.6% 0.36%    0%    journal_unfile_buffer+0xc
  1.7% 0.28%  0.1us( 320us)  1.0us( 263us)(0.04%)   8855491 99.7% 0.28%    0%    journal_unlock_journal_head+0xc

 0.01%    0%   21us(  41us)    0us                      176  100%    0%    0%  kbd_controller_lock
 0.01%    0%   21us(  41us)    0us                      176  100%    0%    0%    keyboard_interrupt+0x14

 64.2% 46.6%  7.0us(  44ms)   10us(  13ms)(21.4%)   2845886 53.4% 46.6%    0%  kernel_flag
 0.00%    0%  1.3us( 1.3us)    0us                        1  100%    0%    0%    chrdev_open+0x4c
 0.00% 57.1%  0.4us( 0.5us)  8.6us(  20us)(0.00%)         7 42.9% 57.1%    0%    de_put+0x28
 0.00% 20.0%   89us( 142us)  4.2us( 4.2us)(0.00%)         5 80.0% 20.0%    0%    do_exit+0xd8
 26.6% 71.2%   13us(  44ms)  8.0us(8076us)( 5.8%)    632107 28.8% 71.2%    0%    ext3_commit_write+0x38
 0.00%    0%   43us(  43us)    0us                        1  100%    0%    0%    ext3_delete_inode+0x48
  4.4% 30.3%  4.3us( 360us)   13us(7511us)( 2.1%)    316124 69.7% 30.3%    0%    ext3_dirty_inode+0x2c
 0.00%    0%  2.2us( 2.2us)    0us                        1  100%    0%    0%    ext3_force_commit+0x38
 28.1%  7.9%   14us(1660us)  9.7us(6842us)(0.78%)    632239 92.1%  7.9%    0%    ext3_get_block_handle+0x8c
  1.2% 27.2%  0.6us( 240us)   11us(6604us)( 3.0%)    632107 72.8% 27.2%    0%    ext3_prepare_write+0x34
 0.26% 88.1%  0.1us(  74us)  9.6us(7026us)( 8.6%)    632107 11.9% 88.1%    0%    ext3_prepare_write+0xe0
 0.00%    0%  5.1us( 5.1us)    0us                        1  100%    0%    0%    get_chrfops+0x88
 0.00%  100%  0.8us( 0.8us)   33us(  33us)(0.00%)         1    0%  100%    0%    locks_remove_posix+0x3c
 0.00%    0%  137us( 221us)    0us                        2  100%    0%    0%    lookup_hash+0x7c
 0.00% 25.0%   17us(  49us)   27us(  27us)(0.00%)         4 75.0% 25.0%    0%    notify_change+0x50
 0.00% 50.0%   40us( 121us)  5.6us( 9.4us)(0.00%)        16 50.0% 50.0%    0%    real_lookup+0x64
  3.7% 58.4% 1007us(  15ms) 1062us(  13ms)( 1.1%)      1136 41.6% 58.4%    0%    schedule+0x508
 0.00% 83.3%  181us( 284us)   17us(  57us)(0.00%)         6 16.7% 83.3%    0%    sync_old_buffers+0x1c
 0.00%    0%  3.2us( 4.2us)    0us                        3  100%    0%    0%    sys_ioctl+0x4c
 0.00% 50.0%  1.3us( 2.1us)   33us( 105us)(0.00%)         8 50.0% 50.0%    0%    sys_llseek+0x88
 0.00%    0%  0.8us( 0.8us)    0us                        1  100%    0%    0%    sys_lseek+0x70
 0.00% 50.0%  7.5us( 9.1us)   13us(  13us)(0.00%)         2 50.0% 50.0%    0%    sys_sysctl+0x70
 0.00%    0%  115us( 172us)    0us                        2  100%    0%    0%    vfs_create+0x84
 0.00%    0%   13us(  13us)    0us                        1  100%    0%    0%    vfs_link+0xa4
 0.00%    0%   12us(  24us)    0us                        2  100%    0%    0%    vfs_readdir+0x68
 0.00%    0%   15us(  16us)    0us                        2  100%    0%    0%    vfs_unlink+0x108

 0.00%    0%  0.6us( 0.9us)    0us                        5  100%    0%    0%  lastpid_lock
 0.00%    0%  0.6us( 0.9us)    0us                        5  100%    0%    0%    get_pid+0x20

 0.00%    0%  0.5us( 1.3us)    0us                      176  100%    0%    0%  logbuf_lock
 0.00%    0%  0.5us( 1.3us)    0us                      176  100%    0%    0%    release_console_sem+0x1c

  7.5% 0.74%  0.9us(  44ms)   14us(  44ms)(0.46%)   2683880 99.3% 0.74%    0%  lru_list_lock
  1.2%  1.5%  7.1us( 499us)  2.5us( 134us)(0.00%)     50765 98.5%  1.5%    0%    balance_dirty+0x18
 0.06% 16.8%   24us( 121us)   12us(  91us)(0.00%)       792 83.2% 16.8%    0%    bdflush+0x98
 0.54%    0%   28ms(  44ms)    0us                        6  100%    0%    0%    bdflush+0xb8
 0.14% 0.42%  0.1us(  85us)  2.0us(  19us)(0.01%)    632101 99.6% 0.42%    0%    buffer_insert_inode_data_queue+0x10
 0.00%  100%  0.4us( 0.4us)   11us(  11us)(0.00%)         1    0%  100%    0%    fsync_inode_buffers+0x28
 0.00%    0%  0.7us( 0.7us)    0us                        1  100%    0%    0%    fsync_inode_data_buffers+0x28
 0.00%    0%  0.6us( 0.6us)    0us                        1  100%    0%    0%    fsync_inode_data_buffers+0xb4
 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%    fsync_inode_data_buffers+0x128
 0.00%  3.8%  0.1us( 0.7us)  1.4us( 3.5us)(0.00%)      1760 96.2%  3.8%    0%    inode_has_buffers+0x10
 0.00%  1.1%  0.1us( 0.7us)  1.1us( 1.9us)(0.00%)       877 98.9%  1.1%    0%    invalidate_inode_buffers+0x10
 0.19%    0% 9733us(  17ms)    0us                        6  100%    0%    0%    kupdate+0x98
 0.00%    0%  1.0us( 1.0us)    0us                        1  100%    0%    0%    osync_inode_buffers+0x14
 0.00%    0%  0.8us( 0.8us)    0us                        1  100%    0%    0%    osync_inode_data_buffers+0x14
  1.2% 0.29%  0.3us( 192us)   45us(  44ms)(0.29%)   1363983 99.7% 0.29%    0%    refile_buffer+0xc
 0.00%    0%  0.5us( 0.8us)    0us                        6  100%    0%    0%    sync_old_buffers+0x64
  4.2%  1.9%  2.1us( 384us)  7.7us(  29ms)(0.15%)    633578 98.1%  1.9%    0%    try_to_free_buffers+0x1c

 0.00%    0%  0.5us( 1.9us)    0us                       55  100%    0%    0%  mmlist_lock
 0.00%    0%  0.1us( 0.1us)    0us                        4  100%    0%    0%    copy_mm+0x120
 0.00%    0%  0.5us( 0.5us)    0us                        1  100%    0%    0%    exec_mmap+0x50
 0.00%    0%  0.4us( 0.6us)    0us                        5  100%    0%    0%    mmput+0x28
 0.00%    0%  0.6us( 1.9us)    0us                       45  100%    0%    0%    swap_out+0x50

 0.00%    0%  0.3us( 1.7us)    0us                      223  100%    0%    0%  page_uptodate_lock.0
 0.00%    0%  0.3us( 1.7us)    0us                      223  100%    0%    0%    end_buffer_io_async+0x38

  5.7% 0.86%  0.9us( 278us)  1.6us( 240us)(0.04%)   1901672 99.1% 0.86%    0%  pagecache_lock
 0.00%  2.5%  0.5us( 2.4us)  1.2us( 2.4us)(0.00%)      1028 97.5%  2.5%    0%    __find_get_page+0x18
  2.6% 0.35%  1.3us( 278us)  4.7us( 240us)(0.02%)    632107 99.7% 0.35%    0%    __find_lock_page+0xc
  1.6%  1.1%  0.8us( 196us)  1.0us( 158us)(0.01%)    632348 98.9%  1.1%    0%    add_to_page_cache_unique+0x18
 0.00% 0.29%  0.6us( 3.5us)  1.4us( 1.4us)(0.00%)       349 99.7% 0.29%    0%    do_generic_file_read+0x1a4
 0.00%    0%  1.1us( 1.5us)    0us                        2  100%    0%    0%    do_generic_file_read+0x370
 0.00%    0%  0.1us( 0.9us)    0us                      282  100%    0%    0%    filemap_fdatasync+0x20
 0.00% 0.35%  0.1us( 1.1us)  2.0us( 2.0us)(0.00%)       282 99.6% 0.35%    0%    filemap_fdatawait+0x14
 0.00% 0.69%  0.8us(  54us)  1.0us( 2.0us)(0.00%)      1011 99.3% 0.69%    0%    find_or_create_page+0x38
 0.00% 0.49%  0.7us( 2.4us)  1.0us( 1.4us)(0.00%)      1011 99.5% 0.49%    0%    find_or_create_page+0x78
 0.00% 0.68%  0.1us( 1.2us)  0.8us( 0.8us)(0.00%)       146 99.3% 0.68%    0%    page_cache_read+0x48
 0.00%    0%  0.3us( 0.3us)    0us                        1  100%    0%    0%    remove_inode_page+0x18
 0.00%  4.5%  0.1us( 0.9us)  1.4us( 1.5us)(0.00%)       111 95.5%  4.5%    0%    set_page_dirty+0x24
  1.5%  1.1%  0.7us( 163us)  1.1us(  69us)(0.01%)    632992 98.9%  1.1%    0%    shrink_cache+0x2c0
 0.00%    0%  0.9us( 0.9us)    0us                        1  100%    0%    0%    truncate_inode_pages+0x38
 0.00%    0%  0.2us( 0.2us)    0us                        1  100%    0%    0%    truncate_list_pages+0x158

  6.3% 20.8%  1.5us( 956us)  1.2us( 953us)(0.54%)   1310161 79.2% 20.8%    0%  pagemap_lru_lock
 0.00%  2.3%  0.7us(  71us)  1.0us( 2.2us)(0.00%)      1688 97.7%  2.3%    0%    activate_page+0xc
 0.69%  2.6%  0.3us( 166us)  1.5us( 254us)(0.04%)    634142 97.4%  2.6%    0%    lru_cache_add+0x1c
 0.00% 0.70%  0.5us( 135us)  1.0us( 1.2us)(0.00%)       714 99.3% 0.70%    0%    lru_cache_del+0xc
 0.03% 12.7%  0.4us(  30us)  1.7us(  79us)(0.01%)     19785 87.3% 12.7%    0%    refill_inactive+0x10
 0.11% 11.6%  1.8us(  71us)  2.3us( 122us)(0.01%)     19785 88.4% 11.6%    0%    shrink_cache+0x50
 0.00%  1.6%  1.0us(  26us)  1.5us( 1.9us)(0.00%)       385 98.4%  1.6%    0%    shrink_cache+0x194
 0.00%    0%  1.2us(  13us)    0us                       85  100%    0%    0%    shrink_cache+0x21c
  5.5% 39.8%  2.7us( 956us)  1.2us( 953us)(0.48%)    632981 60.2% 39.8%    0%    shrink_cache+0x290
 0.00% 18.0%  1.0us(  14us)  1.1us( 4.0us)(0.00%)       596 82.0% 18.0%    0%    shrink_cache+0x2b0

 0.45%  1.8%  3.8us(  21us)  3.3us(  11us)(0.00%)     36592 98.2%  1.8%    0%  runqueue_lock
 0.09%  4.6%  3.2us(  11us)  3.5us( 8.7us)(0.00%)      9263 95.4%  4.6%    0%    __wake_up+0x5c
 0.00%    0%  0.7us( 0.7us)    0us                        1  100%    0%    0%    complete+0x6c
 0.00%    0%  0.3us( 0.7us)    0us                        3  100%    0%    0%    deliver_signal+0x48
 0.00%    0%  3.0us( 9.8us)    0us                       20  100%    0%    0%    process_timeout+0x14
 0.00%    0%  3.6us( 3.6us)    0us                        1  100%    0%    0%    schedule_tail+0x58
 0.31% 0.35%  5.4us(  21us)  4.7us(  11us)(0.00%)     18215 99.7% 0.35%    0%    schedule+0xa0
 0.00%    0%  1.6us( 4.6us)    0us                       43  100%    0%    0%    schedule+0x264
 0.04%  2.2%  1.5us( 8.3us)  2.1us( 7.1us)(0.00%)      8325 97.8%  2.2%    0%    schedule+0x4c8
 0.00%    0%  0.8us( 8.4us)    0us                      721  100%    0%    0%    wake_up_process+0x14

 0.00%    0%  0.4us( 5.9us)    0us                      496  100%    0%    0%  sb_lock
 0.00%    0%  0.1us( 0.1us)    0us                      168  100%    0%    0%    drop_super+0x24
 0.00%    0%  0.5us( 4.4us)    0us                      174  100%    0%    0%    sync_supers+0x6c
 0.00%    0%  4.0us( 5.9us)    0us                        6  100%    0%    0%    sync_unlocked_inodes+0x18
 0.00%    0%  0.4us( 1.3us)    0us                      148  100%    0%    0%    sync_unlocked_inodes+0x18c

 0.03% 0.06%  0.1us( 1.9us)  1.2us( 2.1us)(0.00%)     68015  100% 0.06%    0%  scsi_bhqueue_lock
 0.01% 0.07%  0.1us( 1.2us)  1.3us( 2.1us)(0.00%)     43947  100% 0.07%    0%    scsi_bottom_half_handler+0x1c
 0.02% 0.04%  0.2us( 1.9us)  0.9us( 1.5us)(0.00%)     24068  100% 0.04%    0%    scsi_done+0x3c

 0.00%    0%  0.4us( 2.1us)    0us                     2366  100%    0%    0%  semaphore_lock
 0.00%    0%  0.4us( 1.9us)    0us                     1483  100%    0%    0%    __down+0x44
 0.00%    0%  0.5us( 2.1us)    0us                      715  100%    0%    0%    __down+0x78
 0.00%    0%  0.3us( 0.3us)    0us                      168  100%    0%    0%    __down_trylock+0x10

 0.00%    0%  0.1us( 6.3us)    0us                      416  100%    0%    0%  swap_info+0x8
 0.00%    0%  0.2us( 6.3us)    0us                      111  100%    0%    0%    get_swap_page+0x74
 0.00%    0%  0.1us( 1.3us)    0us                      123  100%    0%    0%    swap_duplicate+0x54
 0.00%    0%  0.1us( 0.8us)    0us                      182  100%    0%    0%    swap_info_get+0xb4

 0.00%    0%  0.4us( 9.0us)    0us                      294  100%    0%    0%  swaplock
 0.00%    0%  0.4us( 9.0us)    0us                      111  100%    0%    0%    get_swap_page+0x20
 0.00%    0%  1.7us( 1.7us)    0us                        1  100%    0%    0%    si_swapinfo+0x18
 0.00%    0%  0.4us( 2.3us)    0us                      182  100%    0%    0%    swap_info_get+0x88

 0.04% 0.07%  0.2us(  15us)  1.5us( 7.8us)(0.00%)     52142  100% 0.07%    0%  timerlist_lock
 0.01% 0.07%  0.2us( 2.9us)  1.3us( 3.1us)(0.00%)     24279  100% 0.07%    0%    add_timer+0x10
 0.01% 0.07%  0.1us( 1.4us)  1.6us( 7.8us)(0.00%)     24423  100% 0.07%    0%    del_timer+0x14
 0.00%    0%  0.3us( 0.7us)    0us                       27  100%    0%    0%    del_timer_sync+0x1c
 0.00%    0%  0.8us( 1.7us)    0us                      197  100%    0%    0%    mod_timer+0x18
 0.02%    0%  1.5us(  15us)    0us                     3099  100%    0%    0%    timer_bh+0xcc
 0.00%    0%  0.2us( 0.9us)    0us                      117  100%    0%    0%    timer_bh+0x254

 0.00% 0.02%  0.1us( 0.9us)  1.4us( 1.4us)(0.00%)      5566  100% 0.02%    0%  tqueue_lock
 0.00%    0%  0.1us( 0.8us)    0us                     2259  100%    0%    0%    __run_task_queue+0x14
 0.00%    0%  0.1us( 0.8us)    0us                     1067  100%    0%    0%    batch_entropy_store+0x7c
 0.00% 0.05%  0.1us( 0.6us)  1.4us( 1.4us)(0.00%)      2064  100% 0.05%    0%    generic_plug_device+0x34
 0.00%    0%  0.2us( 0.9us)    0us                      176  100%    0%    0%    schedule_task+0x28

 0.00%    0%  0.9us( 1.6us)    0us                        3  100%    0%    0%  uidhash_lock
 0.00%    0%  1.6us( 1.6us)    0us                        1  100%    0%    0%    alloc_uid+0x10
 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%    alloc_uid+0x94
 0.00%    0%  1.0us( 1.0us)    0us                        1  100%    0%    0%    free_uid+0x28

  1.5% 0.12%  0.4us( 192us)  2.0us( 147us)(0.00%)   1269889 99.9% 0.12%    0%  unused_list_lock
 0.37% 0.19%  0.2us( 192us)  2.0us( 142us)(0.00%)    635121 99.8% 0.19%    0%    get_unused_buffer_head+0x8
 0.00% 0.22%  0.1us( 1.3us)  1.0us( 2.4us)(0.00%)      1786 99.8% 0.22%    0%    put_unused_buffer_head+0xc
  1.2% 0.05%  0.6us( 157us)  2.0us( 147us)(0.00%)    632982  100% 0.05%    0%    try_to_free_buffers+0x54

 0.00%    0%  0.3us( 1.1us)    0us                        8  100%    0%    0%  __kmem_cache_shrink+0x18
 0.00%    0%  0.1us( 0.3us)    0us                      142  100%    0%    0%  __kmem_cache_shrink+0x48
 0.71% 0.01%  0.1us(  34us)  1.8us( 8.9us)(0.00%)   2045139  100% 0.01%    0%  __wake_up+0x24
 0.00%    0%  0.1us( 0.6us)    0us                     8216  100%    0%    0%  add_wait_queue+0x10
 0.00%  6.8%  0.1us( 1.3us)  2.9us( 8.4us)(0.00%)      1490 93.2%  6.8%    0%  add_wait_queue_exclusive+0x10
 0.50% 0.08%  4.3us( 124us)  2.8us(  10us)(0.00%)     35856  100% 0.08%    0%  ahc_linux_isr+0x24
 0.28% 0.21%  3.6us(  34us)  6.4us(  93us)(0.00%)     24068 99.8% 0.21%    0%  ahc_linux_queue+0x34
 0.00%    0%  0.7us( 1.6us)    0us                       24  100%    0%    0%  change_protection+0x34
 0.00%    0%   11us(  15us)    0us                        9  100%    0%    0%  clear_page_tables+0x1c
 0.00%    0%  0.1us( 0.4us)    0us                       63  100%    0%    0%  copy_mm+0x1e8
 0.00%    0%  2.8us(  48us)    0us                       84  100%    0%    0%  copy_mm+0x230
 0.00%    0%  3.9us(  48us)    0us                       84  100%    0%    0%  copy_page_range+0x100
 0.02%    0%  0.3us( 2.2us)    0us                    27234  100%    0%    0%  do_IRQ+0x40
 0.02%    0%  0.3us( 2.4us)    0us                    27234  100%    0%    0%  do_IRQ+0xc0
 0.00%    0%  0.9us( 5.2us)    0us                      617  100%    0%    0%  do_anonymous_page+0x5c
 0.00%    0%  0.4us( 0.9us)    0us                        5  100%    0%    0%  do_brk+0x1d4
 0.00%    0%  0.1us( 0.1us)    0us                        5  100%    0%    0%  do_exit+0x124
 0.00%    0%  0.1us( 0.3us)    0us                        5  100%    0%    0%  do_exit+0x84
 0.00%    0%  0.1us( 0.1us)    0us                        5  100%    0%    0%  do_exit+0xf4
 0.00%    0%  0.6us( 2.7us)    0us                       86  100%    0%    0%  do_mmap_pgoff+0x40c
 0.00%    0%  0.5us( 3.8us)    0us                      168  100%    0%    0%  do_mmap_pgoff+0x418
 0.00%    0%  0.1us( 0.8us)    0us                       38  100%    0%    0%  do_munmap+0x1b8
 0.00%    0%  0.6us( 5.4us)    0us                      114  100%    0%    0%  do_munmap+0xe0
 0.00%    0%  0.1us( 1.3us)    0us                     1025  100%    0%    0%  do_no_page+0xdc
 0.00%    0%  0.3us( 1.0us)    0us                       14  100%    0%    0%  do_page_fault+0xe0
 0.00%    0%  0.4us( 4.9us)    0us                      114  100%    0%    0%  do_sigaction+0x58
 0.00%    0%  0.6us( 1.5us)    0us                        9  100%    0%    0%  do_sigaction+0xd8
 0.00%    0%  2.6us( 6.2us)    0us                        7  100%    0%    0%  do_signal+0x54
 0.00%    0%  1.1us( 4.2us)    0us                      205  100%    0%    0%  do_wp_page+0x118
 0.00%    0%  0.1us( 0.1us)    0us                        9  100%    0%    0%  exit_mmap+0x18
 0.00%    0%  0.1us( 0.7us)    0us                      143  100%    0%    0%  exit_mmap+0x88
 0.00%    0%  1.7us( 2.8us)    0us                        5  100%    0%    0%  exit_sighand+0x18
 0.13% 0.02%  7.9us(  30us)  1.2us( 1.2us)(0.00%)      5232  100% 0.02%    0%  free_block+0x1c
 0.00%    0%  0.3us(  17us)    0us                     2025  100%    0%    0%  handle_mm_fault+0x34
 0.00%    0%  0.4us( 0.7us)    0us                        5  100%    0%    0%  handle_signal+0xb0
 0.00%    0%  1.7us( 3.3us)    0us                        5  100%    0%    0%  insert_vm_struct+0x60
 0.00%    0%  0.1us( 0.3us)    0us                      182  100%    0%    0%  interruptible_sleep_on+0x28
 0.00%    0%  0.1us( 0.2us)    0us                      182  100%    0%    0%  interruptible_sleep_on+0x54
 0.19% 0.02%  2.9us(  51us)  2.1us( 2.4us)(0.00%)     19844  100% 0.02%    0%  kmem_cache_alloc_batch+0x18
 0.00%    0%  0.1us( 1.3us)    0us                     6765  100%    0%    0%  kmem_cache_grow+0x1d4
 0.01% 0.01%  0.3us( 1.5us)  1.5us( 1.5us)(0.00%)      6765  100% 0.01%    0%  kmem_cache_grow+0x80
 0.00%    0%  0.1us( 0.6us)    0us                      160  100%    0%    0%  kmem_cache_reap+0x25c
 0.00% 0.07%  0.1us( 0.6us)   10us(  15us)(0.00%)      7063  100% 0.07%    0%  kmem_cache_reap+0x2c4
 0.72% 0.00%  1.2us( 327us)  3.0us( 6.9us)(0.00%)    194356  100% 0.00%    0%  kmem_cache_reap+0xa4
 0.00%    0%  1.0us( 2.0us)    0us                       24  100%    0%    0%  mprotect_fixup+0x2a8
 0.00%    0%  0.7us( 1.5us)    0us                       24  100%    0%    0%  mprotect_fixup+0x2b4
 0.00%    0%  5.0us(  41us)    0us                       27  100%    0%    0%  pte_alloc+0x88
 0.00%    0%  0.3us( 0.8us)    0us                        5  100%    0%    0%  put_dirty_page+0x3c
 0.00%    0%  0.1us( 0.4us)    0us                        5  100%    0%    0%  release_task+0x3c
 0.00%  2.9%  0.1us( 1.2us)  3.5us( 9.2us)(0.00%)      9706 97.1%  2.9%    0%  remove_wait_queue+0x10
 0.07%    0%  1.1us( 263us)    0us                    18189  100%    0%    0%  schedule+0x478
 0.00%    0%  1.7us( 7.4us)    0us                        5  100%    0%    0%  schedule_tail+0x20
 0.00%    0%   13us(  31us)    0us                        6  100%    0%    0%  send_sig_info+0x4c
 0.00%    0%  0.1us( 0.4us)    0us                       40  100%    0%    0%  sleep_on+0x28
 0.00%    0%  0.1us( 0.8us)    0us                       40  100%    0%    0%  sleep_on+0x54
 0.01%    0%   57us( 546us)    0us                       45  100%    0%    0%  swap_out+0xc8
 0.00%    0%  0.1us( 0.1us)    0us                        9  100%    0%    0%  sys_rt_sigprocmask+0x18c
 0.00%    0%  0.3us( 1.8us)    0us                       69  100%    0%    0%  sys_rt_sigprocmask+0x98
 0.00%    0%  0.4us( 0.7us)    0us                        5  100%    0%    0%  sys_sigreturn+0x84
 0.00%    0%  1.1us( 1.7us)    0us                        8  100%    0%    0%  unmap_fixup+0xa8
 0.00%    0%  0.6us( 1.0us)    0us                        8  100%    0%    0%  unmap_fixup+0xb8
 0.00%    0%  0.1us( 4.1us)    0us                      249  100%    0%    0%  vma_merge+0x54
 0.01%    0%  5.9us( 398us)    0us                      300  100%    0%    0%  zap_page_range+0x48

- - - - - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
RWLOCK READS   HOLD    MAX  RDR BUSY PERIOD      WAIT
  UTIL  CON    MEAN   RDRS   MEAN(  MAX )   MEAN(  MAX )( %CPU)     TOTAL NOWAIT SPIN  NAME

       0.68%                               2.1us( 229us)(0.11%)   5063033 99.3% 0.68%  *TOTAL*

 0.00%    0%   3.9us     1  3.9us(  19us)    0us                        3  100%    0%  [0xe6f1b044]
          0%                                 0us                        3  100%    0%    copy_files+0x158

 0.00%    0%   0.2us     1  0.2us( 5.8us)    0us                        1  100%    0%  [0xf417a044]
          0%                                 0us                        1  100%    0%    do_fcntl+0x104

 0.00%    0%   1.2us     1  1.2us( 1.2us)    0us                        1  100%    0%  [0xf703f224]
          0%                                 0us                        1  100%    0%    unix_write_space+0x14

 0.00%    0%   1.3us     1  1.3us( 1.3us)    0us                        1  100%    0%  [0xf703f448]
          0%                                 0us                        1  100%    0%    unix_dgram_sendmsg+0x80

 0.00%    0%  10.1us     1   10us(  10us)    0us                        1  100%    0%  [0xf703f564]
          0%                                 0us                        1  100%    0%    sock_def_readable+0x14

 0.00%    0%   4.5us     1  4.5us( 4.5us)    0us                        1  100%    0%  [0xf703f788]
          0%                                 0us                        1  100%    0%    unix_dgram_sendmsg+0x21c

 0.00%    0%   0.2us     1  0.2us( 3.3us)    0us                        1  100%    0%  [0xf7658d24]
          0%                                 0us                        1  100%    0%    sys_getcwd+0x38

 0.00%    0%   0.7us     1  0.7us( 0.9us)    0us                        2  100%    0%  arp_tbl+0xc4
          0%                                 0us                        2  100%    0%    neigh_lookup+0x40

 0.00%    0%   0.9us     1  0.9us( 1.3us)    0us                        5  100%    0%  binfmt_lock
          0%                                 0us                        5  100%    0%    search_binary_handler+0x38

 0.00%    0%   1.3us     1  1.3us( 1.3us)    0us                        1  100%    0%  chrdevs_lock
          0%                                 0us                        1  100%    0%    get_chrfops+0x28

 0.00%    0%   3.4us     1  3.4us( 4.2us)    0us                        4  100%    0%  fib_hash_lock
          0%                                 0us                        4  100%    0%    fn_hash_lookup+0x10

  2.9% 0.72%   0.2us     2  0.2us( 288us)  2.1us( 229us)(0.11%)   4745142 99.3% 0.72%  hash_table_lock
       0.72%                               2.1us( 229us)(0.11%)   4745142 99.3% 0.72%    get_hash_table+0x60

 0.00%    0%   0.8us     1  0.8us( 1.4us)    0us                        4  100%    0%  inetdev_lock
          0%                                 0us                        2  100%    0%    arp_rcv+0x28
          0%                                 0us                        2  100%    0%    ip_route_input_slow+0x18

 0.01%    0%  28.0us     2   28us(  80us)    0us                       69  100%    0%  tasklist_lock
          0%                                 0us                        6  100%    0%    count_active_tasks+0xc
          0%                                 0us                        5  100%    0%    exit_notify+0x18
          0%                                 0us                       43  100%    0%    schedule+0x218
          0%                                 0us                        1  100%    0%    sys_setsid+0x10
          0%                                 0us                       14  100%    0%    sys_wait4+0x8c

 0.00%    0%   2.1us     1  2.1us( 3.2us)    0us                        3  100%    0%  udp_hash_lock
          0%                                 0us                        3  100%    0%    udp_v4_mcast_deliver+0x10

 0.00%    0%   0.6us     1  0.6us( 1.4us)    0us                       36  100%    0%  xtime_lock
          0%                                 0us                       36  100%    0%    do_gettimeofday+0x14

          0%                                 0us                        5  100%    0%  copy_files+0x100
          0%                                 0us                        5  100%    0%  do_fork+0x35c
          0%                                 0us                       14  100%    0%  do_select+0x24
          0%                                 0us                   316664  100%    0%  fget+0x1c
          0%                                 0us                        5  100%    0%  ip_route_input+0x88
          0%                                 0us                        5  100%    0%  net_rx_action+0x48
          0%                                 0us                        4  100%    0%  path_init+0x114
          0%                                 0us                     1056  100%    0%  path_init+0x30

- - - - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
RWLOCK WRITES     HOLD           WAIT (ALL)           WAIT (WW) 
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )( %CPU)   MEAN(  MAX )     TOTAL NOWAIT SPIN(  WW )  NAME

        4.9%  1.4us( 217us)  1.2us( 218us)(0.06%)  0.3us(  34us)    643483 95.1%  3.5%( 1.4%)  *TOTAL*

 0.00%    0%  0.6us( 1.3us)    0us                   0us                 6  100%    0%(   0%)  [0xe6f1b204]
 0.00%    0%  0.5us( 1.3us)    0us                   0us                 3  100%    0%(   0%)    copy_files+0x12c
 0.00%    0%  0.6us( 0.7us)    0us                   0us                 3  100%    0%(   0%)    expand_fd_array+0x88

 0.00%    0%  4.4us(  21us)    0us                   0us                 5  100%    0%(   0%)  [0xf417a044]
 0.00%    0%  0.1us( 0.1us)    0us                   0us                 2  100%    0%(   0%)    do_fcntl+0x140
 0.00%    0%  7.3us(  21us)    0us                   0us                 3  100%    0%(   0%)    sys_dup2+0x2c

 0.00%    0%  0.1us( 0.1us)    0us                   0us                 4  100%    0%(   0%)  [0xf50ce044]
 0.00%    0%  0.1us( 0.1us)    0us                   0us                 2  100%    0%(   0%)    do_pipe+0x174
 0.00%    0%  0.1us( 0.1us)    0us                   0us                 2  100%    0%(   0%)    do_pipe+0x1a4

 0.00%    0%  0.7us( 0.7us)    0us                   0us                 1  100%    0%(   0%)  [0xf7658d24]
 0.00%    0%  0.7us( 0.7us)    0us                   0us                 1  100%    0%(   0%)    sys_chdir+0x9c

 0.00%    0%  0.1us( 0.1us)    0us                   0us                 1  100%    0%(   0%)  [0xf7df24f4]
 0.00%    0%  0.1us( 0.1us)    0us                   0us                 1  100%    0%(   0%)    neigh_destroy+0x8c

 0.00%    0%  104us( 104us)    0us                   0us                 1  100%    0%(   0%)  arp_tbl+0xc4
 0.00%    0%  104us( 104us)    0us                   0us                 1  100%    0%(   0%)    neigh_periodic_timer__thr+0x20

 0.00%    0%  0.5us( 0.6us)    0us                   0us                 2  100%    0%(   0%)  dn_lock
 0.00%    0%  0.5us( 0.6us)    0us                   0us                 2  100%    0%(   0%)    fcntl_dirnotify+0x94

  2.9%  5.0%  1.4us( 217us)  1.2us( 218us)(0.06%)  0.3us(  34us)    634589 95.0%  3.6%( 1.4%)  hash_table_lock
 0.00% 0.59%  1.3us(  30us)   17us(  34us)(0.00%)  8.8us(  34us)      1011 99.4% 0.10%(0.49%)    hash_page_buffers+0x48
  2.9%  5.0%  1.4us( 217us)  1.2us( 218us)(0.06%)  0.2us( 4.9us)    633578 95.0%  3.6%( 1.4%)    try_to_free_buffers+0x28

 0.00%    0%  7.6us(  38us)    0us                   0us                15  100%    0%(   0%)  tasklist_lock
 0.00%    0%  2.2us( 2.7us)    0us                   0us                 5  100%    0%(   0%)    do_fork+0x530
 0.00%    0%   20us(  38us)    0us                   0us                 5  100%    0%(   0%)    exit_notify+0x1b0
 0.00%    0%  0.6us( 1.0us)    0us                   0us                 5  100%    0%(   0%)    release_task+0x7c

 0.00%    0%   20us(  40us)    0us                   0us                 4  100%    0%(   0%)  vmlist_lock
 0.00%    0%  4.4us( 5.2us)    0us                   0us                 2  100%    0%(   0%)    get_vm_area+0x3c
 0.00%    0%   35us(  40us)    0us                   0us                 2  100%    0%(   0%)    vfree+0x58

 0.12%    0%  5.9us(  19us)    0us                   0us              6198  100%    0%(   0%)  xtime_lock
 0.02%    0%  1.7us( 6.7us)    0us                   0us              3099  100%    0%(   0%)    timer_bh+0xc
 0.10%    0%   10us(  19us)    0us                   0us              3099  100%    0%(   0%)    timer_interrupt+0x10

 0.00%    0%  0.9us( 1.2us)    0us                   0us                 5  100%    0%(   0%)  flush_old_exec+0x22c
 0.00%    0%  0.2us( 3.3us)    0us                   0us               813  100%    0%(   0%)  get_unused_fd+0x24
 0.00%    0%  0.1us( 0.1us)    0us                   0us                 5  100%    0%(   0%)  load_elf_binary+0x190
 0.00%    0%  0.4us( 0.5us)    0us                   0us                 4  100%    0%(   0%)  neigh_periodic_timer__thr+0xa8
 0.00%    0%  0.1us( 4.1us)    0us                   0us               820  100%    0%(   0%)  rt_check_expire__thr+0x64
 0.00%    0%  0.2us( 2.6us)    0us                   0us               206  100%    0%(   0%)  sys_close+0x1c
 0.00%    0%  0.1us( 1.7us)    0us                   0us               188  100%    0%(   0%)  sys_open+0x60
 0.00%    0%  0.1us( 0.8us)    0us                   0us               616  100%    0%(   0%)  sys_open+0xa8
_________________________________________________________________________________________________________________________
Number of read locks found=16
Lockmeter statistics are now OFF

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-19 21:29 mgross
@ 2002-06-20  0:54 ` Andrew Morton
  2002-06-20  9:54   ` Stephen C. Tweedie
  2002-06-20  1:55 ` Andrew Morton
  1 sibling, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2002-06-20  0:54 UTC (permalink / raw)
  To: mgross; +Cc: Linux Kernel Mailing List, lse-tech, richard.a.griffiths

mgross wrote:
> 
> ...
> Has anyone done any work looking into the I/O scaling of Linux / ext3 per
> spindle or per adapter?  We would like to compare notes.

No.  ext3 scalability is very poor, I'm afraid.  The fs really wasn't
up and running until kernel 2.4.5 and we just didn't have time to
address that issue.

> I've only just started to look at the ext3 code but it seems to me that replacing the
> BKL with a per - ext3 file system lock could remove some of the contention thats
> getting measured.  What data are the BKL protecting in these ext3 functions?  Could a
> lock per FS approach work?

The vague plan there is to replace lock_kernel with lock_journal
where appropriate.  But ext3 scalability work of this nature
will be targetted at the 2.5 kernel, most probably.

I'll take a look, see if there's any low-hanging fruit in there,
but I doubt that the results will be fantastic.

-

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-20  0:54 ` Andrew Morton
@ 2002-06-20  9:54   ` Stephen C. Tweedie
  0 siblings, 0 replies; 25+ messages in thread
From: Stephen C. Tweedie @ 2002-06-20  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mgross, Linux Kernel Mailing List, lse-tech, richard.a.griffiths,
	ext2-devel

Hi,

On Wed, Jun 19, 2002 at 05:54:46PM -0700, Andrew Morton wrote:

> The vague plan there is to replace lock_kernel with lock_journal
> where appropriate.  But ext3 scalability work of this nature
> will be targetted at the 2.5 kernel, most probably.

I think we can do better than that, with care.  lock_journal could
easily become a read/write lock to protect the transaction state
machine, as there's really only one place --- the commit thread ---
where we end up changing the state of a transaction itself (eg. from
running to committing).  For short-lived buffer transformations, we
already have the datalist spinlock.

There are a few intermediate types of operation, such as the
do_get_write_access.  That's a buffer operation, but it relies on us
being able to allocate memory for the old version of the buffer if we
happen to be committing the bh to disk already.  All of those cases
are already prepared to accept BKL being dropped during the memory
allocation, so there's no problem with doing the same for a short-term
buffer spinlock; and if the journal_lock is only taken shared in such
places, then there's no urgent need to drop that over the malloc.

Even the commit thread can probably avoid taking the journal lock in
many cases --- it would need it exclusively while changing a
transaction's global state, but while it's just manipulating blocks on
the committing transaction it can probably get away with much less
locking.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-19 21:29 mgross
  2002-06-20  0:54 ` Andrew Morton
@ 2002-06-20  1:55 ` Andrew Morton
  2002-06-20  6:05   ` Jens Axboe
  1 sibling, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2002-06-20  1:55 UTC (permalink / raw)
  To: mgross; +Cc: Linux Kernel Mailing List, lse-tech, richard.a.griffiths

mgross wrote:
> 
> We've been doing some throughput comparisons and benchmarks of block I/O
> throughput for 8KB writes as the number of SCSI addapters and drives per
> adapter is increased.
> 
> The Linux platform is a dual processor 1.2GHz PIII, 2Gig or RAM, 2U box.
> Similar results have been seen with both 2.4.16 and 2.4.18 base kernel, as
> well as one of those patched up O(1) 2.4.18 kernels out there.

umm.  Are you not using block-highmem?  That is a must-have.

http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre9aa2/00_block-highmem-all-18b-12.gz

-

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-20  1:55 ` Andrew Morton
@ 2002-06-20  6:05   ` Jens Axboe
  0 siblings, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2002-06-20  6:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mgross, Linux Kernel Mailing List, lse-tech, richard.a.griffiths

On Wed, Jun 19 2002, Andrew Morton wrote:
> mgross wrote:
> > 
> > We've been doing some throughput comparisons and benchmarks of block I/O
> > throughput for 8KB writes as the number of SCSI addapters and drives per
> > adapter is increased.
> > 
> > The Linux platform is a dual processor 1.2GHz PIII, 2Gig or RAM, 2U box.
> > Similar results have been seen with both 2.4.16 and 2.4.18 base kernel, as
> > well as one of those patched up O(1) 2.4.18 kernels out there.
> 
> umm.  Are you not using block-highmem?  That is a must-have.
> 
> http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre9aa2/00_block-highmem-all-18b-12.gz

please use

http://www.kernel.org/pub/linux/kernel/people/axboe/patches/v2.4/2.4.19-pre10/block-highmem-all-19.bz2

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2002-06-23 17:17 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-06-20 21:50 ext3 performance bottleneck as the number of spindles gets la rge Griffiths, Richard A
2002-06-21  7:58 ` ext3 performance bottleneck as the number of spindles gets large Andrew Morton
2002-06-21 18:46   ` mgross
2002-06-21 19:26     ` Chris Mason
2002-06-21 19:56     ` Andrew Morton
2002-06-23  4:02 ` Christopher E. Brown
2002-06-23  4:33   ` Andreas Dilger
2002-06-23  6:00     ` Christopher E. Brown
2002-06-23  6:35       ` [Lse-tech] " William Lee Irwin III
2002-06-23  7:29         ` Dave Hansen
2002-06-23  7:36           ` William Lee Irwin III
2002-06-23  7:45             ` Dave Hansen
2002-06-23  7:55               ` Christopher E. Brown
2002-06-23  8:11                 ` David Lang
2002-06-23  8:31                 ` Dave Hansen
2002-06-23 16:21               ` Martin J. Bligh
2002-06-23 17:06         ` Eric W. Biederman
  -- strict thread matches above, loose matches on Subject: below --
2002-06-20 15:26 ext3 performance bottleneck as the number of spindles gets la rge Griffiths, Richard A
2002-06-20 20:18 ` ext3 performance bottleneck as the number of spindles gets large Andrew Morton
2002-06-20 18:08   ` mgross
2002-06-20 21:25     ` Andrew Morton
2002-06-19 21:29 mgross
2002-06-20  0:54 ` Andrew Morton
2002-06-20  9:54   ` Stephen C. Tweedie
2002-06-20  1:55 ` Andrew Morton
2002-06-20  6:05   ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox