ext3 writepages ?

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* ext3 writepages ?
@ 2005-02-02 15:32 Badari Pulavarty
  2005-02-02 20:19 ` Sonny Rao
  0 siblings, 1 reply; 21+ messages in thread
From: Badari Pulavarty @ 2005-02-02 15:32 UTC (permalink / raw)
  To: linux-fsdevel

Hi,

I forgot the reason why we don't have ext3_writepages() ?
I can dig through to find out, but it would be easy to ask
people.

Please let me know.

Thanks,
Badari

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-02 15:32 ext3 writepages ? Badari Pulavarty
@ 2005-02-02 20:19 ` Sonny Rao
  2005-02-03 15:51   ` Badari Pulavarty
  0 siblings, 1 reply; 21+ messages in thread
From: Sonny Rao @ 2005-02-02 20:19 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-fsdevel

On Wed, Feb 02, 2005 at 07:32:04AM -0800, Badari Pulavarty wrote:
> Hi,
> 
> I forgot the reason why we don't have ext3_writepages() ?
> I can dig through to find out, but it would be easy to ask
> people.
> 
> Please let me know.

Badari, I seem to have successfully hacked the writeback mode to use
writepages on a User-Mode Linux instance.  I'm going to try it on a
real box soon.  The only issue is that pdflush is passing the create
parameter as 1 to writepages, which doesn't exactly make sense.  I
suppose it might be needed for a filesystem like XFS which does
delayed block allocation ?  In ext3 however, the blocks should have
been allocated beforehand.

Sonny

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-02 20:19 ` Sonny Rao
@ 2005-02-03 15:51   ` Badari Pulavarty
  2005-02-03 17:00     ` Sonny Rao
  2005-02-03 20:50     ` Sonny Rao
  0 siblings, 2 replies; 21+ messages in thread
From: Badari Pulavarty @ 2005-02-03 15:51 UTC (permalink / raw)
  To: Sonny Rao; +Cc: linux-fsdevel

On Wed, 2005-02-02 at 12:19, Sonny Rao wrote:
> On Wed, Feb 02, 2005 at 07:32:04AM -0800, Badari Pulavarty wrote:
> > Hi,
> > 
> > I forgot the reason why we don't have ext3_writepages() ?
> > I can dig through to find out, but it would be easy to ask
> > people.
> > 
> > Please let me know.
> 
> Badari, I seem to have successfully hacked the writeback mode to use
> writepages on a User-Mode Linux instance.  I'm going to try it on a
> real box soon.  The only issue is that pdflush is passing the create
> parameter as 1 to writepages, which doesn't exactly make sense.  I
> suppose it might be needed for a filesystem like XFS which does
> delayed block allocation ?  In ext3 however, the blocks should have
> been allocated beforehand.

Funny, I am also hacking writepages for writeback mode. You are a step
ahead of me :) Please let me know, how it goes.

Thanks,
Badari


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-03 17:00     ` Sonny Rao
@ 2005-02-03 16:56       ` Badari Pulavarty
  2005-02-03 17:24         ` Sonny Rao
  0 siblings, 1 reply; 21+ messages in thread
From: Badari Pulavarty @ 2005-02-03 16:56 UTC (permalink / raw)
  To: Sonny Rao; +Cc: linux-fsdevel

On Thu, 2005-02-03 at 09:00, Sonny Rao wrote:

> 
> Well it seems to work, here's my (rather ugly) patch.
> I'm doing some performance comparisons now.
> 
> Sonny

Interesting.. Why did you create a nobh_prepare_write() ?
mpage_writepages() can handle  pages with buffer heads
attached.

And also, are you sure you don't need to journal start/stop
in writepages() ?

Thanks,
Badari

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-03 15:51   ` Badari Pulavarty
@ 2005-02-03 17:00     ` Sonny Rao
  2005-02-03 16:56       ` Badari Pulavarty
  2005-02-03 20:50     ` Sonny Rao
  1 sibling, 1 reply; 21+ messages in thread
From: Sonny Rao @ 2005-02-03 17:00 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1074 bytes --]

On Thu, Feb 03, 2005 at 07:51:37AM -0800, Badari Pulavarty wrote:
> On Wed, 2005-02-02 at 12:19, Sonny Rao wrote:
> > On Wed, Feb 02, 2005 at 07:32:04AM -0800, Badari Pulavarty wrote:
> > > Hi,
> > > 
> > > I forgot the reason why we don't have ext3_writepages() ?
> > > I can dig through to find out, but it would be easy to ask
> > > people.
> > > 
> > > Please let me know.
> > 
> > Badari, I seem to have successfully hacked the writeback mode to use
> > writepages on a User-Mode Linux instance.  I'm going to try it on a
> > real box soon.  The only issue is that pdflush is passing the create
> > parameter as 1 to writepages, which doesn't exactly make sense.  I
> > suppose it might be needed for a filesystem like XFS which does
> > delayed block allocation ?  In ext3 however, the blocks should have
> > been allocated beforehand.
> 
> Funny, I am also hacking writepages for writeback mode. You are a step
> ahead of me :) Please let me know, how it goes.

Well it seems to work, here's my (rather ugly) patch.
I'm doing some performance comparisons now.

Sonny

[-- Attachment #2: ext3-wb-wpages.patch --]
[-- Type: text/plain, Size: 2662 bytes --]

diff -Naurp linux-2.6.10-original/fs/ext3/inode.c linux-2.6.10-working/fs/ext3/inode.c
--- linux-2.6.10-original/fs/ext3/inode.c	2004-12-24 15:35:01.000000000 -0600
+++ linux-2.6.10-working/fs/ext3/inode.c	2005-01-29 10:45:09.599837136 -0600
@@ -810,6 +810,18 @@ static int ext3_get_block(struct inode *
 	return ret;
 }
 
+static int ext3_get_block_wpages(struct inode *inode, sector_t iblock,
+			struct buffer_head *bh_result, int create)
+{
+  	/* ugly hack, just pass 0 tfor create to get_block_handle */
+  	/* the blocks shouldd have already been allocated ifwe're in  */
+  	/* writepages writeback */
+        return ext3_get_block_handle(NULL, inode, iblock,
+				     bh_result, 0, 0);
+}
+
+
+
 #define DIO_CREDITS (EXT3_RESERVE_TRANS_BLOCKS + 32)
 
 static int
@@ -1025,6 +1037,32 @@ out:
 	return ret;
 }
 
+static int ext3_nobh_prepare_write(struct file *file, struct page *page,
+			      unsigned from, unsigned to)
+{
+	struct inode *inode = page->mapping->host;
+	int ret;
+	int needed_blocks = ext3_writepage_trans_blocks(inode);
+	handle_t *handle;
+	int retries = 0;
+
+retry:
+	handle = ext3_journal_start(inode, needed_blocks);
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		goto out;
+	}
+	ret = nobh_prepare_write(page, from, to, ext3_get_block);
+	if (ret)
+		ext3_journal_stop(handle);
+	if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
+		goto retry;
+out:
+
+	return ret;
+}
+
+
 static int
 ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
 {
@@ -1092,7 +1130,7 @@ static int ext3_writeback_commit_write(s
 	new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
 	if (new_i_size > EXT3_I(inode)->i_disksize)
 		EXT3_I(inode)->i_disksize = new_i_size;
-	ret = generic_commit_write(file, page, from, to);
+	ret = nobh_commit_write(file, page, from, to);
 	ret2 = ext3_journal_stop(handle);
 	if (!ret)
 		ret = ret2;
@@ -1321,6 +1359,14 @@ out_fail:
 	return ret;
 }
 
+static int
+ext3_writeback_writepages(struct address_space *mapping, 
+			  struct writeback_control *wbc)
+{
+	return mpage_writepages(mapping, wbc, ext3_get_block_wpages);
+}
+
+
 static int ext3_writeback_writepage(struct page *page,
 				struct writeback_control *wbc)
 {
@@ -1552,8 +1598,9 @@ static struct address_space_operations e
 	.readpage	= ext3_readpage,
 	.readpages	= ext3_readpages,
 	.writepage	= ext3_writeback_writepage,
+	.writepages     = ext3_writeback_writepages,
 	.sync_page	= block_sync_page,
-	.prepare_write	= ext3_prepare_write,
+	.prepare_write	= ext3_nobh_prepare_write,
 	.commit_write	= ext3_writeback_commit_write,
 	.bmap		= ext3_bmap,
 	.invalidatepage	= ext3_invalidatepage,

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-03 16:56       ` Badari Pulavarty
@ 2005-02-03 17:24         ` Sonny Rao
  0 siblings, 0 replies; 21+ messages in thread
From: Sonny Rao @ 2005-02-03 17:24 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-fsdevel

On Thu, Feb 03, 2005 at 08:56:50AM -0800, Badari Pulavarty wrote:
> On Thu, 2005-02-03 at 09:00, Sonny Rao wrote:
> 
> > 
> > Well it seems to work, here's my (rather ugly) patch.
> > I'm doing some performance comparisons now.
> > 
> > Sonny
> 
> Interesting.. Why did you create a nobh_prepare_write() ?
> mpage_writepages() can handle  pages with buffer heads
> attached.

IIRC, block_prepare_write will attach buffer_heads for you, which I'm
explicitly trying to avoid. 

> And also, are you sure you don't need to journal start/stop
> in writepages() ?

Heh, I'm not sure, I don't understand the semantics of those calls
well enough to say with certainty.  

My guess is no, because the blocks on disk were already allocated
beforehand.  Maybe it could be a problem if there could be a truncate
in progress  elsewhere, but I don't think so since the inode is
locked. 

Sonny

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-03 15:51   ` Badari Pulavarty
  2005-02-03 17:00     ` Sonny Rao
@ 2005-02-03 20:50     ` Sonny Rao
  2005-02-08  1:33       ` Andreas Dilger
  1 sibling, 1 reply; 21+ messages in thread
From: Sonny Rao @ 2005-02-03 20:50 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-fsdevel

On Thu, Feb 03, 2005 at 07:51:37AM -0800, Badari Pulavarty wrote:
> On Wed, 2005-02-02 at 12:19, Sonny Rao wrote:
> > On Wed, Feb 02, 2005 at 07:32:04AM -0800, Badari Pulavarty wrote:
> > > Hi,
> > > 
> > > I forgot the reason why we don't have ext3_writepages() ?
> > > I can dig through to find out, but it would be easy to ask
> > > people.
> > > 
> > > Please let me know.
> > 
> > Badari, I seem to have successfully hacked the writeback mode to use
> > writepages on a User-Mode Linux instance.  I'm going to try it on a
> > real box soon.  The only issue is that pdflush is passing the create
> > parameter as 1 to writepages, which doesn't exactly make sense.  I
> > suppose it might be needed for a filesystem like XFS which does
> > delayed block allocation ?  In ext3 however, the blocks should have
> > been allocated beforehand.
> 
> Funny, I am also hacking writepages for writeback mode. You are a step
> ahead of me :) Please let me know, how it goes.

Well, from what I can tell, my patch doesn't seem to make much of a
difference in write throughput other than allowing multi-page bios to
be sent down and cutting down on buffer_head usage.

If the only goal was to reduce buffer_head usage, then this works, but
using an mpage_writepage-like function should achieve the same result.

I did notice in my write throughput tests that ext2 still did
significantly better for some reason, even though no meta-data changes
were occurring.  I'm looking into that.

Sonny

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-03 20:50     ` Sonny Rao
@ 2005-02-08  1:33       ` Andreas Dilger
  2005-02-08  5:38         ` Sonny Rao
  0 siblings, 1 reply; 21+ messages in thread
From: Andreas Dilger @ 2005-02-08  1:33 UTC (permalink / raw)
  To: Sonny Rao; +Cc: Badari Pulavarty, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 512 bytes --]

On Feb 03, 2005  15:50 -0500, Sonny Rao wrote:
> Well, from what I can tell, my patch doesn't seem to make much of a
> difference in write throughput other than allowing multi-page bios to
> be sent down and cutting down on buffer_head usage.

Even if it doesn't make a difference in performance, it might reduce the
CPU usage.  Did you check that at all?

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/             http://members.shaw.ca/golinux/


[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-08  1:33       ` Andreas Dilger
@ 2005-02-08  5:38         ` Sonny Rao
  2005-02-09 21:11           ` Sonny Rao
  0 siblings, 1 reply; 21+ messages in thread
From: Sonny Rao @ 2005-02-08  5:38 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, Badari Pulavarty

On Mon, Feb 07, 2005 at 06:33:51PM -0700, Andreas Dilger wrote:
> On Feb 03, 2005  15:50 -0500, Sonny Rao wrote:
> > Well, from what I can tell, my patch doesn't seem to make much of a
> > difference in write throughput other than allowing multi-page bios to
> > be sent down and cutting down on buffer_head usage.
> 
> Even if it doesn't make a difference in performance, it might reduce the
> CPU usage.  Did you check that at all?

No I didn't, I'll check that out and post back.

Sonny



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-08  5:38         ` Sonny Rao
@ 2005-02-09 21:11           ` Sonny Rao
  2005-02-09 22:29             ` Badari Pulavarty
  0 siblings, 1 reply; 21+ messages in thread
From: Sonny Rao @ 2005-02-09 21:11 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, Badari Pulavarty

On Tue, Feb 08, 2005 at 12:38:08AM -0500, Sonny Rao wrote:
> On Mon, Feb 07, 2005 at 06:33:51PM -0700, Andreas Dilger wrote:
> > On Feb 03, 2005  15:50 -0500, Sonny Rao wrote:
> > > Well, from what I can tell, my patch doesn't seem to make much of a
> > > difference in write throughput other than allowing multi-page bios to
> > > be sent down and cutting down on buffer_head usage.
> > 
> > Even if it doesn't make a difference in performance, it might reduce the
> > CPU usage.  Did you check that at all?
> 
> No I didn't, I'll check that out and post back.
> 
> Sonny

Ok, I take it back, on a raid device I saw a significant increase in
throughput and approximately equal cpu utilization.   I was comparing
the wrong data points before.. oops.

Sequential overwrite went from 75.6 MB/sec to 87.7 MB/sec both with an
average CPU utilization of 73% for both.

So, I see a 16% improvement in throughput for this test case and a
corresponding increase in efficiency. 

Although, after reading what SCT wrote about writepage and writepages
needing to have a transaction handle, in some cases, that might  make
the proper writepages code significantly more complex than my two-bit
hack.  Still, I think it's worth it.

Sonny

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-09 21:11           ` Sonny Rao
@ 2005-02-09 22:29             ` Badari Pulavarty
  2005-02-10  2:05               ` Bryan Henderson
  0 siblings, 1 reply; 21+ messages in thread
From: Badari Pulavarty @ 2005-02-09 22:29 UTC (permalink / raw)
  To: Sonny Rao; +Cc: Andreas Dilger, linux-fsdevel

On Wed, 2005-02-09 at 13:11, Sonny Rao wrote:
> On Tue, Feb 08, 2005 at 12:38:08AM -0500, Sonny Rao wrote:
> > On Mon, Feb 07, 2005 at 06:33:51PM -0700, Andreas Dilger wrote:
> > > On Feb 03, 2005  15:50 -0500, Sonny Rao wrote:
> > > > Well, from what I can tell, my patch doesn't seem to make much of a
> > > > difference in write throughput other than allowing multi-page bios to
> > > > be sent down and cutting down on buffer_head usage.
> > > 
> > > Even if it doesn't make a difference in performance, it might reduce the
> > > CPU usage.  Did you check that at all?
> > 
> > No I didn't, I'll check that out and post back.
> > 
> > Sonny
> 
> Ok, I take it back, on a raid device I saw a significant increase in
> throughput and approximately equal cpu utilization.   I was comparing
> the wrong data points before.. oops.
> 
> Sequential overwrite went from 75.6 MB/sec to 87.7 MB/sec both with an
> average CPU utilization of 73% for both.
> 
> So, I see a 16% improvement in throughput for this test case and a
> corresponding increase in efficiency. 
> 
> Although, after reading what SCT wrote about writepage and writepages
> needing to have a transaction handle, in some cases, that might  make
> the proper writepages code significantly more complex than my two-bit
> hack.  Still, I think it's worth it.

Yep. I hacked ext3_write_pages() to use mpage_writepages() as you did
(without modifying bufferheads stuff). With the limited testing I did,
I see much larger IO chunks and better throughput. So, I guess its
worth doing it - i am little worried about error handling though..

Lets handle one issue at a time. 

First fix writepages() without bufferhead changes ? Then handle
bufferheads ? I still can't figure out a way to workaround the
bufferheads especially for ordered writes.

Thanks,
Badari


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-09 22:29             ` Badari Pulavarty
@ 2005-02-10  2:05               ` Bryan Henderson
  2005-02-10  2:45                 ` Sonny Rao
  2005-02-10 16:02                 ` Badari Pulavarty
  0 siblings, 2 replies; 21+ messages in thread
From: Bryan Henderson @ 2005-02-10  2:05 UTC (permalink / raw)
  To: pbadari; +Cc: Andreas Dilger, linux-fsdevel, Sonny Rao

>I see much larger IO chunks and better throughput. So, I guess its
>worth doing it

I hate to see something like this go ahead based on empirical results 
without theory.  It might make things worse somewhere else.

Do you have an explanation for why the IO chunks are larger?  Is the I/O 
scheduler not building large I/Os out of small requests?  Is the queue 
running dry while the device is actually busy?

--
Bryan Henderson                               San Jose California
IBM Almaden Research Center                   Filesystems

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-10  2:05               ` Bryan Henderson
@ 2005-02-10  2:45                 ` Sonny Rao
  2005-02-10 17:51                   ` Bryan Henderson
  2005-02-10 16:02                 ` Badari Pulavarty
  1 sibling, 1 reply; 21+ messages in thread
From: Sonny Rao @ 2005-02-10  2:45 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: pbadari, Andreas Dilger, linux-fsdevel

On Wed, Feb 09, 2005 at 09:05:21PM -0500, Bryan Henderson wrote:
> >I see much larger IO chunks and better throughput. So, I guess its
> >worth doing it
> 
> I hate to see something like this go ahead based on empirical results 
> without theory.  It might make things worse somewhere else.
> 
> Do you have an explanation for why the IO chunks are larger?  Is the I/O 
> scheduler not building large I/Os out of small requests?  Is the queue 
> running dry while the device is actually busy?

Yes, the queue is running dry, and there is much more evidence of that
besides just the throughput numbers.

I am inferring this using iostat which shows that average device
utilization fluctuates between 83 and 99 percent and the average
request size is around 650 sectors (going to the device) without
writepages.  

With writepages, device utilization never drops below 95 percent and
is usually about 98 percent utilized, and the average request size to
the device is around 1000 sectors.  Not to mention the io-scheduler
merge rate is reduced by a few orders of magnitude (16k vs ~30) .

I'm not sure what theory you are looking for here?  We do the work of
coalescing io requests up front, rather than relying on an io-scheduler
to save us.  What is the point of the 2.6 block-io subsystem (i.e. the
bio layer) if you don't use it to its fullest potential? 

I can give you pointers to the data if you're interested.

Sonny

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-10  2:05               ` Bryan Henderson
  2005-02-10  2:45                 ` Sonny Rao
@ 2005-02-10 16:02                 ` Badari Pulavarty
  2005-02-10 18:00                   ` Bryan Henderson
  1 sibling, 1 reply; 21+ messages in thread
From: Badari Pulavarty @ 2005-02-10 16:02 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: Andreas Dilger, linux-fsdevel, Sonny Rao

On Wed, 2005-02-09 at 18:05, Bryan Henderson wrote:
> >I see much larger IO chunks and better throughput. So, I guess its
> >worth doing it
> 
> I hate to see something like this go ahead based on empirical results 
> without theory.  It might make things worse somewhere else.
> 
> Do you have an explanation for why the IO chunks are larger?  Is the I/O 
> scheduler not building large I/Os out of small requests?  Is the queue 
> running dry while the device is actually busy?
> 

Bryan,

I would like to find out what theory you are looking for.

Don't you think, filesystems submitting biggest chunks of IO
possible is better than submitting 1k-4k chunks and hoping that
IO schedulers do the perfect job ? 

BTW, writepages() is being used for other filesystems like JFS.

We all learnt thro 2.4 RAW code about the overhead of doing 512bytes
IO and making the elevator merge all the peices together. Thats
one reason why 2.6 DIO/RAW code is completely written from
scratch to submit the biggest possible IO chunks.

Well, I agree that we should have theory behind the results.
We are just playing with prototypes for now.

Thanks,
Badari

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-10  2:45                 ` Sonny Rao
@ 2005-02-10 17:51                   ` Bryan Henderson
  2005-02-10 19:02                     ` Sonny Rao
  0 siblings, 1 reply; 21+ messages in thread
From: Bryan Henderson @ 2005-02-10 17:51 UTC (permalink / raw)
  To: Sonny Rao; +Cc: Andreas Dilger, linux-fsdevel, pbadari

>I am inferring this using iostat which shows that average device
>utilization fluctuates between 83 and 99 percent and the average
>request size is around 650 sectors (going to the device) without
>writepages. 
>
>With writepages, device utilization never drops below 95 percent and
>is usually about 98 percent utilized, and the average request size to
>the device is around 1000 sectors.

Well that blows away the only two ways I know that this effect can happen. 
 The first has to do with certain code being more efficient than other 
code at assembling I/Os, but the fact that the CPU utilization is the same 
in both cases pretty much eliminates that.  The other is where the 
interactivity of the I/O generator doesn't match the buffering in the 
device so that the device ends up 100% busy processing small I/Os that 
were sent to it because it said all the while that it needed more work. 
But in the small-I/O case, we don't see a 100% busy device.

So why would the device be up to 17% idle, since the writepages case makes 
it apparent that the I/O generator is capable of generating much more 
work?  Is there some queue plugging (I/O scheduler delays sending I/O to 
the device even though the device is idle) going on?

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-10 16:02                 ` Badari Pulavarty
@ 2005-02-10 18:00                   ` Bryan Henderson
  2005-02-10 18:32                     ` Badari Pulavarty
  0 siblings, 1 reply; 21+ messages in thread
From: Bryan Henderson @ 2005-02-10 18:00 UTC (permalink / raw)
  To: pbadari; +Cc: Andreas Dilger, linux-fsdevel, Sonny Rao

>Don't you think, filesystems submitting biggest chunks of IO
>possible is better than submitting 1k-4k chunks and hoping that
>IO schedulers do the perfect job ? 

No, I don't see why it would better.  In fact intuitively, I think the I/O 
scheduler, being closer to the device, should do a better job of deciding 
in what packages I/O should go to the device.  After all, there exist 
block devices that don't process big chunks faster than small ones.  But 

So this starts to look like something where you withhold data from the I/O 
scheduler in order to prevent it from scheduling the I/O wrongly because 
you (the pager/filesystem driver) know better.  That shouldn't be the 
architecture.

So I'd like still like to see a theory that explains why submitting the 
I/O a little at a time (i.e. including the bio_submit() in the loop that 
assembles the I/O) causes the device to be idle more.

>We all learnt thro 2.4 RAW code about the overhead of doing 512bytes
>IO and making the elevator merge all the peices together.

That was CPU time, right?  In the present case, the numbers say it takes 
the same amount of CPU time to assemble the I/O above the I/O scheduler as 
inside it.

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-10 18:00                   ` Bryan Henderson
@ 2005-02-10 18:32                     ` Badari Pulavarty
  2005-02-10 20:30                       ` Bryan Henderson
  0 siblings, 1 reply; 21+ messages in thread
From: Badari Pulavarty @ 2005-02-10 18:32 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: Andreas Dilger, linux-fsdevel, Sonny Rao

On Thu, 2005-02-10 at 10:00, Bryan Henderson wrote:
> >Don't you think, filesystems submitting biggest chunks of IO
> >possible is better than submitting 1k-4k chunks and hoping that
> >IO schedulers do the perfect job ? 
> 
> No, I don't see why it would better.  In fact intuitively, I think the I/O 
> scheduler, being closer to the device, should do a better job of deciding 
> in what packages I/O should go to the device.  After all, there exist 
> block devices that don't process big chunks faster than small ones.  But 
> 
> So this starts to look like something where you withhold data from the I/O 
> scheduler in order to prevent it from scheduling the I/O wrongly because 
> you (the pager/filesystem driver) know better.  That shouldn't be the 
> architecture.
> 
> So I'd like still like to see a theory that explains why submitting the 
> I/O a little at a time (i.e. including the bio_submit() in the loop that 
> assembles the I/O) causes the device to be idle more.
> 
> >We all learnt thro 2.4 RAW code about the overhead of doing 512bytes
> >IO and making the elevator merge all the peices together.
> 
> That was CPU time, right?  In the present case, the numbers say it takes 
> the same amount of CPU time to assemble the I/O above the I/O scheduler as 
> inside it.

One clear distinction between submitting smaller chunks vs larger
ones is - number of call backs we get and the processing we need to
do.

I don't think we have enough numbers here to get to bottom of this.
CPU utilization remains same in both cases, doesn't mean that - the
test took exactly same amount of time. I don't even think that we
are doing a fixed number of IOs. Its possible that by doing larger
IOs we save CPU and use that CPU to push more data ?



Thanks,
Badari


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-10 17:51                   ` Bryan Henderson
@ 2005-02-10 19:02                     ` Sonny Rao
  0 siblings, 0 replies; 21+ messages in thread
From: Sonny Rao @ 2005-02-10 19:02 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: Andreas Dilger, linux-fsdevel, pbadari

On Thu, Feb 10, 2005 at 09:51:42AM -0800, Bryan Henderson wrote:
> >I am inferring this using iostat which shows that average device
> >utilization fluctuates between 83 and 99 percent and the average
> >request size is around 650 sectors (going to the device) without
> >writepages. 
> >
> >With writepages, device utilization never drops below 95 percent and
> >is usually about 98 percent utilized, and the average request size to
> >the device is around 1000 sectors.
> 
> Well that blows away the only two ways I know that this effect can happen. 
>  The first has to do with certain code being more efficient than other 
> code at assembling I/Os, but the fact that the CPU utilization is the same 
> in both cases pretty much eliminates that.  

No, I don't think you can draw that conclusion based on total CPU
utilization, because in the writepages case we are spending more time
(as a percentage of total time) copying data from userspace, which
leads to an increase in CPU utilization.  So, I think this shows that the
writepages code path is in fact more efficient than the ioscheduler path.

Here's the oprofile output from the runs where you'll see
__copy_from_user_ll at the top of both profiles:

No writepages:

CPU: P4 / Xeon, speed 1997.8 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        image name               app name                 symbol name
2225649  38.7482  vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 __copy_from_user_ll
1471012  25.6101  vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 poll_idle
104736    1.8234  vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 __block_commit_write
92702     1.6139  vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 mark_offset_cyclone
90077     1.5682  vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 _spin_lock
83649     1.4563  vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 __block_write_full_page
81483     1.4186  vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 generic_file_buffered_write
69232     1.2053  vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 ext3_writeback_commit_write


With writepages:

CPU: P4 / Xeon, speed 1997.98 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        image name               app name                 symbol name
2487751  43.4411  vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 __copy_from_user_ll
1518775  26.5209  vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 poll_idle
124956    2.1820  vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 _spin_lock
93689     1.6360  vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 generic_file_buffered_write
93139     1.6264  vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 mark_offset_cyclone
89683     1.5660  vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 ext3_writeback_commit_write

So we see 38% vs 43% which I belive should be directly correlated with
throughput ( about 12% diff. here ). 


> The other is where the 
> interactivity of the I/O generator doesn't match the buffering in the 
> device so that the device ends up 100% busy processing small I/Os that 
> were sent to it because it said all the while that it needed more work. 
> But in the small-I/O case, we don't see a 100% busy device.

That might be possible, but I'm not sure how one could account for it?

The application, VM, and I/O systems are all so intertwined it would be
difficult to isolate the application if we are trying to measure
maximum throughput, no?

 
> So why would the device be up to 17% idle, since the writepages case makes 
> it apparent that the I/O generator is capable of generating much more 
> work?  Is there some queue plugging (I/O scheduler delays sending I/O to 
> the device even though the device is idle) going on?

Again, I think the amount of work being generated is directly related
to how quickly the dirty pages are being flushed out, so
inefficiencies in the io-system bubble up to the generator.

Sonny



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-10 20:30                       ` Bryan Henderson
@ 2005-02-10 20:25                         ` Sonny Rao
  2005-02-11  0:20                           ` Bryan Henderson
  0 siblings, 1 reply; 21+ messages in thread
From: Sonny Rao @ 2005-02-10 20:25 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: pbadari, Andreas Dilger, linux-fsdevel

On Thu, Feb 10, 2005 at 12:30:23PM -0800, Bryan Henderson wrote:
> >Its possible that by doing larger
> >IOs we save CPU and use that CPU to push more data ?
> 
> This is absolutely right; my mistake -- the relevant number is CPU seconds 
> per megabyte moved, not CPU seconds per elapsed second.
> But I don't think we're close enough to 100% CPU utilization that this 
> explains much.
> 
> In fact, the curious thing here is that neither the disk nor the CPU seems 
> to be a bottleneck in the slow case.  Maybe there's some serialization I'm 
> not seeing that makes less parallelism between I/O and execution.  Is this 
> a single thread doing writes and syncs to a single file?

>From what I've seen, without writepages, the application thread itself
tends to do the writing by falling into balance_dirty_pages() during
it's write call, while in the writepages case, a pdflush thread seems
to do more of the writeback.    This also depends somewhat on
processor speed (and number) and amount of RAM.  

To try and isolate this more, I've limited RAM (1GB) and number of
CPUs (1)  on my testing setup.

So yes, there could be better parallelism in the writepages case, but
again this behavior could be a symptom and not a cause, but I'm not
sure how to figure that out, any suggestions ?

Sonny

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-10 18:32                     ` Badari Pulavarty
@ 2005-02-10 20:30                       ` Bryan Henderson
  2005-02-10 20:25                         ` Sonny Rao
  0 siblings, 1 reply; 21+ messages in thread
From: Bryan Henderson @ 2005-02-10 20:30 UTC (permalink / raw)
  To: pbadari; +Cc: Andreas Dilger, linux-fsdevel, Sonny Rao

>Its possible that by doing larger
>IOs we save CPU and use that CPU to push more data ?

This is absolutely right; my mistake -- the relevant number is CPU seconds 
per megabyte moved, not CPU seconds per elapsed second.
But I don't think we're close enough to 100% CPU utilization that this 
explains much.

In fact, the curious thing here is that neither the disk nor the CPU seems 
to be a bottleneck in the slow case.  Maybe there's some serialization I'm 
not seeing that makes less parallelism between I/O and execution.  Is this 
a single thread doing writes and syncs to a single file?

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ext3 writepages ?
  2005-02-10 20:25                         ` Sonny Rao
@ 2005-02-11  0:20                           ` Bryan Henderson
  0 siblings, 0 replies; 21+ messages in thread
From: Bryan Henderson @ 2005-02-11  0:20 UTC (permalink / raw)
  To: Sonny Rao; +Cc: Andreas Dilger, linux-fsdevel, pbadari

I went back and looked more closely and see that you did more than add a 
->writepages method.  You replaced the ->prepare_write with one that 
doesn't involve the buffer cache, right?  And from your answer to Badari's 
question about that, I believe you said this is not an integral part of 
having ->writepages, but an additional enhancement.  Well, that could 
explain a lot.  First of all, there's a significant amount of CPU time 
involved in managing buffer heads.  In the profile you posted, it's one of 
the differences in CPU time between the writepages and non-writepages 
case.  But it also changes the whole way the file cache is managed, 
doesn't it?  That might account for the fact that in one case you see 
cache cleaning happening via balance_dirty_pages() (i.e. memory fills up), 
but in the other it happens via Pdflush.  I'm not really up on the buffer 
cache; I haven't used it in my own studies for years.

I also saw that while you originally said CPU utilization was 73% in both 
cases, in one of the profiles I add up at least 77% for the writepages 
case, so I'm not sure we're really comparing straight across.

To investigate these effects further, I think you should monitor 
/proc/meminfo.  And/or make more isolated changes to the code.

>So yes, there could be better parallelism in the writepages case, but
>again this behavior could be a symptom and not a cause,

I'm not really suggesting that there's better parallelism in the 
writepages case.  I'm suggesting that there's poor parallelism (compared 
to what I expect) in both cases, which means that adding CPU time directly 
affects throughput.  If the CPU time were in parallel with the I/O time, 
adding an extra 1.8ms per megabyte to the CPU time (which is what one of 
my calculation from your data gave) wouldn't affect throughput.

But I believe we've at least established doubt that submitting an entire 
file cache in one bio is faster than submitting a bio for each page and 
that smaller I/Os (to the device) cause lower throughput in the 
non-writepages case (it seems more likely that the lower throughput causes 
the smaller I/Os).

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2005-02-11  0:20 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-02 15:32 ext3 writepages ? Badari Pulavarty
2005-02-02 20:19 ` Sonny Rao
2005-02-03 15:51   ` Badari Pulavarty
2005-02-03 17:00     ` Sonny Rao
2005-02-03 16:56       ` Badari Pulavarty
2005-02-03 17:24         ` Sonny Rao
2005-02-03 20:50     ` Sonny Rao
2005-02-08  1:33       ` Andreas Dilger
2005-02-08  5:38         ` Sonny Rao
2005-02-09 21:11           ` Sonny Rao
2005-02-09 22:29             ` Badari Pulavarty
2005-02-10  2:05               ` Bryan Henderson
2005-02-10  2:45                 ` Sonny Rao
2005-02-10 17:51                   ` Bryan Henderson
2005-02-10 19:02                     ` Sonny Rao
2005-02-10 16:02                 ` Badari Pulavarty
2005-02-10 18:00                   ` Bryan Henderson
2005-02-10 18:32                     ` Badari Pulavarty
2005-02-10 20:30                       ` Bryan Henderson
2005-02-10 20:25                         ` Sonny Rao
2005-02-11  0:20                           ` Bryan Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).