* ext3 writepages ?
@ 2005-02-02 15:32 Badari Pulavarty
2005-02-02 20:19 ` Sonny Rao
0 siblings, 1 reply; 21+ messages in thread
From: Badari Pulavarty @ 2005-02-02 15:32 UTC (permalink / raw)
To: linux-fsdevel
Hi,
I forgot the reason why we don't have ext3_writepages() ?
I can dig through to find out, but it would be easy to ask
people.
Please let me know.
Thanks,
Badari
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-02 15:32 ext3 writepages ? Badari Pulavarty
@ 2005-02-02 20:19 ` Sonny Rao
2005-02-03 15:51 ` Badari Pulavarty
0 siblings, 1 reply; 21+ messages in thread
From: Sonny Rao @ 2005-02-02 20:19 UTC (permalink / raw)
To: Badari Pulavarty; +Cc: linux-fsdevel
On Wed, Feb 02, 2005 at 07:32:04AM -0800, Badari Pulavarty wrote:
> Hi,
>
> I forgot the reason why we don't have ext3_writepages() ?
> I can dig through to find out, but it would be easy to ask
> people.
>
> Please let me know.
Badari, I seem to have successfully hacked the writeback mode to use
writepages on a User-Mode Linux instance. I'm going to try it on a
real box soon. The only issue is that pdflush is passing the create
parameter as 1 to writepages, which doesn't exactly make sense. I
suppose it might be needed for a filesystem like XFS which does
delayed block allocation ? In ext3 however, the blocks should have
been allocated beforehand.
Sonny
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-02 20:19 ` Sonny Rao
@ 2005-02-03 15:51 ` Badari Pulavarty
2005-02-03 17:00 ` Sonny Rao
2005-02-03 20:50 ` Sonny Rao
0 siblings, 2 replies; 21+ messages in thread
From: Badari Pulavarty @ 2005-02-03 15:51 UTC (permalink / raw)
To: Sonny Rao; +Cc: linux-fsdevel
On Wed, 2005-02-02 at 12:19, Sonny Rao wrote:
> On Wed, Feb 02, 2005 at 07:32:04AM -0800, Badari Pulavarty wrote:
> > Hi,
> >
> > I forgot the reason why we don't have ext3_writepages() ?
> > I can dig through to find out, but it would be easy to ask
> > people.
> >
> > Please let me know.
>
> Badari, I seem to have successfully hacked the writeback mode to use
> writepages on a User-Mode Linux instance. I'm going to try it on a
> real box soon. The only issue is that pdflush is passing the create
> parameter as 1 to writepages, which doesn't exactly make sense. I
> suppose it might be needed for a filesystem like XFS which does
> delayed block allocation ? In ext3 however, the blocks should have
> been allocated beforehand.
Funny, I am also hacking writepages for writeback mode. You are a step
ahead of me :) Please let me know, how it goes.
Thanks,
Badari
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-03 17:00 ` Sonny Rao
@ 2005-02-03 16:56 ` Badari Pulavarty
2005-02-03 17:24 ` Sonny Rao
0 siblings, 1 reply; 21+ messages in thread
From: Badari Pulavarty @ 2005-02-03 16:56 UTC (permalink / raw)
To: Sonny Rao; +Cc: linux-fsdevel
On Thu, 2005-02-03 at 09:00, Sonny Rao wrote:
>
> Well it seems to work, here's my (rather ugly) patch.
> I'm doing some performance comparisons now.
>
> Sonny
Interesting.. Why did you create a nobh_prepare_write() ?
mpage_writepages() can handle pages with buffer heads
attached.
And also, are you sure you don't need to journal start/stop
in writepages() ?
Thanks,
Badari
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-03 15:51 ` Badari Pulavarty
@ 2005-02-03 17:00 ` Sonny Rao
2005-02-03 16:56 ` Badari Pulavarty
2005-02-03 20:50 ` Sonny Rao
1 sibling, 1 reply; 21+ messages in thread
From: Sonny Rao @ 2005-02-03 17:00 UTC (permalink / raw)
To: Badari Pulavarty; +Cc: linux-fsdevel
[-- Attachment #1: Type: text/plain, Size: 1074 bytes --]
On Thu, Feb 03, 2005 at 07:51:37AM -0800, Badari Pulavarty wrote:
> On Wed, 2005-02-02 at 12:19, Sonny Rao wrote:
> > On Wed, Feb 02, 2005 at 07:32:04AM -0800, Badari Pulavarty wrote:
> > > Hi,
> > >
> > > I forgot the reason why we don't have ext3_writepages() ?
> > > I can dig through to find out, but it would be easy to ask
> > > people.
> > >
> > > Please let me know.
> >
> > Badari, I seem to have successfully hacked the writeback mode to use
> > writepages on a User-Mode Linux instance. I'm going to try it on a
> > real box soon. The only issue is that pdflush is passing the create
> > parameter as 1 to writepages, which doesn't exactly make sense. I
> > suppose it might be needed for a filesystem like XFS which does
> > delayed block allocation ? In ext3 however, the blocks should have
> > been allocated beforehand.
>
> Funny, I am also hacking writepages for writeback mode. You are a step
> ahead of me :) Please let me know, how it goes.
Well it seems to work, here's my (rather ugly) patch.
I'm doing some performance comparisons now.
Sonny
[-- Attachment #2: ext3-wb-wpages.patch --]
[-- Type: text/plain, Size: 2662 bytes --]
diff -Naurp linux-2.6.10-original/fs/ext3/inode.c linux-2.6.10-working/fs/ext3/inode.c
--- linux-2.6.10-original/fs/ext3/inode.c 2004-12-24 15:35:01.000000000 -0600
+++ linux-2.6.10-working/fs/ext3/inode.c 2005-01-29 10:45:09.599837136 -0600
@@ -810,6 +810,18 @@ static int ext3_get_block(struct inode *
return ret;
}
+static int ext3_get_block_wpages(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create)
+{
+ /* ugly hack, just pass 0 tfor create to get_block_handle */
+ /* the blocks shouldd have already been allocated ifwe're in */
+ /* writepages writeback */
+ return ext3_get_block_handle(NULL, inode, iblock,
+ bh_result, 0, 0);
+}
+
+
+
#define DIO_CREDITS (EXT3_RESERVE_TRANS_BLOCKS + 32)
static int
@@ -1025,6 +1037,32 @@ out:
return ret;
}
+static int ext3_nobh_prepare_write(struct file *file, struct page *page,
+ unsigned from, unsigned to)
+{
+ struct inode *inode = page->mapping->host;
+ int ret;
+ int needed_blocks = ext3_writepage_trans_blocks(inode);
+ handle_t *handle;
+ int retries = 0;
+
+retry:
+ handle = ext3_journal_start(inode, needed_blocks);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+ ret = nobh_prepare_write(page, from, to, ext3_get_block);
+ if (ret)
+ ext3_journal_stop(handle);
+ if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
+ goto retry;
+out:
+
+ return ret;
+}
+
+
static int
ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
{
@@ -1092,7 +1130,7 @@ static int ext3_writeback_commit_write(s
new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
if (new_i_size > EXT3_I(inode)->i_disksize)
EXT3_I(inode)->i_disksize = new_i_size;
- ret = generic_commit_write(file, page, from, to);
+ ret = nobh_commit_write(file, page, from, to);
ret2 = ext3_journal_stop(handle);
if (!ret)
ret = ret2;
@@ -1321,6 +1359,14 @@ out_fail:
return ret;
}
+static int
+ext3_writeback_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ return mpage_writepages(mapping, wbc, ext3_get_block_wpages);
+}
+
+
static int ext3_writeback_writepage(struct page *page,
struct writeback_control *wbc)
{
@@ -1552,8 +1598,9 @@ static struct address_space_operations e
.readpage = ext3_readpage,
.readpages = ext3_readpages,
.writepage = ext3_writeback_writepage,
+ .writepages = ext3_writeback_writepages,
.sync_page = block_sync_page,
- .prepare_write = ext3_prepare_write,
+ .prepare_write = ext3_nobh_prepare_write,
.commit_write = ext3_writeback_commit_write,
.bmap = ext3_bmap,
.invalidatepage = ext3_invalidatepage,
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-03 16:56 ` Badari Pulavarty
@ 2005-02-03 17:24 ` Sonny Rao
0 siblings, 0 replies; 21+ messages in thread
From: Sonny Rao @ 2005-02-03 17:24 UTC (permalink / raw)
To: Badari Pulavarty; +Cc: linux-fsdevel
On Thu, Feb 03, 2005 at 08:56:50AM -0800, Badari Pulavarty wrote:
> On Thu, 2005-02-03 at 09:00, Sonny Rao wrote:
>
> >
> > Well it seems to work, here's my (rather ugly) patch.
> > I'm doing some performance comparisons now.
> >
> > Sonny
>
> Interesting.. Why did you create a nobh_prepare_write() ?
> mpage_writepages() can handle pages with buffer heads
> attached.
IIRC, block_prepare_write will attach buffer_heads for you, which I'm
explicitly trying to avoid.
> And also, are you sure you don't need to journal start/stop
> in writepages() ?
Heh, I'm not sure, I don't understand the semantics of those calls
well enough to say with certainty.
My guess is no, because the blocks on disk were already allocated
beforehand. Maybe it could be a problem if there could be a truncate
in progress elsewhere, but I don't think so since the inode is
locked.
Sonny
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-03 15:51 ` Badari Pulavarty
2005-02-03 17:00 ` Sonny Rao
@ 2005-02-03 20:50 ` Sonny Rao
2005-02-08 1:33 ` Andreas Dilger
1 sibling, 1 reply; 21+ messages in thread
From: Sonny Rao @ 2005-02-03 20:50 UTC (permalink / raw)
To: Badari Pulavarty; +Cc: linux-fsdevel
On Thu, Feb 03, 2005 at 07:51:37AM -0800, Badari Pulavarty wrote:
> On Wed, 2005-02-02 at 12:19, Sonny Rao wrote:
> > On Wed, Feb 02, 2005 at 07:32:04AM -0800, Badari Pulavarty wrote:
> > > Hi,
> > >
> > > I forgot the reason why we don't have ext3_writepages() ?
> > > I can dig through to find out, but it would be easy to ask
> > > people.
> > >
> > > Please let me know.
> >
> > Badari, I seem to have successfully hacked the writeback mode to use
> > writepages on a User-Mode Linux instance. I'm going to try it on a
> > real box soon. The only issue is that pdflush is passing the create
> > parameter as 1 to writepages, which doesn't exactly make sense. I
> > suppose it might be needed for a filesystem like XFS which does
> > delayed block allocation ? In ext3 however, the blocks should have
> > been allocated beforehand.
>
> Funny, I am also hacking writepages for writeback mode. You are a step
> ahead of me :) Please let me know, how it goes.
Well, from what I can tell, my patch doesn't seem to make much of a
difference in write throughput other than allowing multi-page bios to
be sent down and cutting down on buffer_head usage.
If the only goal was to reduce buffer_head usage, then this works, but
using an mpage_writepage-like function should achieve the same result.
I did notice in my write throughput tests that ext2 still did
significantly better for some reason, even though no meta-data changes
were occurring. I'm looking into that.
Sonny
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-03 20:50 ` Sonny Rao
@ 2005-02-08 1:33 ` Andreas Dilger
2005-02-08 5:38 ` Sonny Rao
0 siblings, 1 reply; 21+ messages in thread
From: Andreas Dilger @ 2005-02-08 1:33 UTC (permalink / raw)
To: Sonny Rao; +Cc: Badari Pulavarty, linux-fsdevel
[-- Attachment #1: Type: text/plain, Size: 512 bytes --]
On Feb 03, 2005 15:50 -0500, Sonny Rao wrote:
> Well, from what I can tell, my patch doesn't seem to make much of a
> difference in write throughput other than allowing multi-page bios to
> be sent down and cutting down on buffer_head usage.
Even if it doesn't make a difference in performance, it might reduce the
CPU usage. Did you check that at all?
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-08 1:33 ` Andreas Dilger
@ 2005-02-08 5:38 ` Sonny Rao
2005-02-09 21:11 ` Sonny Rao
0 siblings, 1 reply; 21+ messages in thread
From: Sonny Rao @ 2005-02-08 5:38 UTC (permalink / raw)
To: Andreas Dilger; +Cc: linux-fsdevel, Badari Pulavarty
On Mon, Feb 07, 2005 at 06:33:51PM -0700, Andreas Dilger wrote:
> On Feb 03, 2005 15:50 -0500, Sonny Rao wrote:
> > Well, from what I can tell, my patch doesn't seem to make much of a
> > difference in write throughput other than allowing multi-page bios to
> > be sent down and cutting down on buffer_head usage.
>
> Even if it doesn't make a difference in performance, it might reduce the
> CPU usage. Did you check that at all?
No I didn't, I'll check that out and post back.
Sonny
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-08 5:38 ` Sonny Rao
@ 2005-02-09 21:11 ` Sonny Rao
2005-02-09 22:29 ` Badari Pulavarty
0 siblings, 1 reply; 21+ messages in thread
From: Sonny Rao @ 2005-02-09 21:11 UTC (permalink / raw)
To: Andreas Dilger; +Cc: linux-fsdevel, Badari Pulavarty
On Tue, Feb 08, 2005 at 12:38:08AM -0500, Sonny Rao wrote:
> On Mon, Feb 07, 2005 at 06:33:51PM -0700, Andreas Dilger wrote:
> > On Feb 03, 2005 15:50 -0500, Sonny Rao wrote:
> > > Well, from what I can tell, my patch doesn't seem to make much of a
> > > difference in write throughput other than allowing multi-page bios to
> > > be sent down and cutting down on buffer_head usage.
> >
> > Even if it doesn't make a difference in performance, it might reduce the
> > CPU usage. Did you check that at all?
>
> No I didn't, I'll check that out and post back.
>
> Sonny
Ok, I take it back, on a raid device I saw a significant increase in
throughput and approximately equal cpu utilization. I was comparing
the wrong data points before.. oops.
Sequential overwrite went from 75.6 MB/sec to 87.7 MB/sec both with an
average CPU utilization of 73% for both.
So, I see a 16% improvement in throughput for this test case and a
corresponding increase in efficiency.
Although, after reading what SCT wrote about writepage and writepages
needing to have a transaction handle, in some cases, that might make
the proper writepages code significantly more complex than my two-bit
hack. Still, I think it's worth it.
Sonny
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-09 21:11 ` Sonny Rao
@ 2005-02-09 22:29 ` Badari Pulavarty
2005-02-10 2:05 ` Bryan Henderson
0 siblings, 1 reply; 21+ messages in thread
From: Badari Pulavarty @ 2005-02-09 22:29 UTC (permalink / raw)
To: Sonny Rao; +Cc: Andreas Dilger, linux-fsdevel
On Wed, 2005-02-09 at 13:11, Sonny Rao wrote:
> On Tue, Feb 08, 2005 at 12:38:08AM -0500, Sonny Rao wrote:
> > On Mon, Feb 07, 2005 at 06:33:51PM -0700, Andreas Dilger wrote:
> > > On Feb 03, 2005 15:50 -0500, Sonny Rao wrote:
> > > > Well, from what I can tell, my patch doesn't seem to make much of a
> > > > difference in write throughput other than allowing multi-page bios to
> > > > be sent down and cutting down on buffer_head usage.
> > >
> > > Even if it doesn't make a difference in performance, it might reduce the
> > > CPU usage. Did you check that at all?
> >
> > No I didn't, I'll check that out and post back.
> >
> > Sonny
>
> Ok, I take it back, on a raid device I saw a significant increase in
> throughput and approximately equal cpu utilization. I was comparing
> the wrong data points before.. oops.
>
> Sequential overwrite went from 75.6 MB/sec to 87.7 MB/sec both with an
> average CPU utilization of 73% for both.
>
> So, I see a 16% improvement in throughput for this test case and a
> corresponding increase in efficiency.
>
> Although, after reading what SCT wrote about writepage and writepages
> needing to have a transaction handle, in some cases, that might make
> the proper writepages code significantly more complex than my two-bit
> hack. Still, I think it's worth it.
Yep. I hacked ext3_write_pages() to use mpage_writepages() as you did
(without modifying bufferheads stuff). With the limited testing I did,
I see much larger IO chunks and better throughput. So, I guess its
worth doing it - i am little worried about error handling though..
Lets handle one issue at a time.
First fix writepages() without bufferhead changes ? Then handle
bufferheads ? I still can't figure out a way to workaround the
bufferheads especially for ordered writes.
Thanks,
Badari
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-09 22:29 ` Badari Pulavarty
@ 2005-02-10 2:05 ` Bryan Henderson
2005-02-10 2:45 ` Sonny Rao
2005-02-10 16:02 ` Badari Pulavarty
0 siblings, 2 replies; 21+ messages in thread
From: Bryan Henderson @ 2005-02-10 2:05 UTC (permalink / raw)
To: pbadari; +Cc: Andreas Dilger, linux-fsdevel, Sonny Rao
>I see much larger IO chunks and better throughput. So, I guess its
>worth doing it
I hate to see something like this go ahead based on empirical results
without theory. It might make things worse somewhere else.
Do you have an explanation for why the IO chunks are larger? Is the I/O
scheduler not building large I/Os out of small requests? Is the queue
running dry while the device is actually busy?
--
Bryan Henderson San Jose California
IBM Almaden Research Center Filesystems
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-10 2:05 ` Bryan Henderson
@ 2005-02-10 2:45 ` Sonny Rao
2005-02-10 17:51 ` Bryan Henderson
2005-02-10 16:02 ` Badari Pulavarty
1 sibling, 1 reply; 21+ messages in thread
From: Sonny Rao @ 2005-02-10 2:45 UTC (permalink / raw)
To: Bryan Henderson; +Cc: pbadari, Andreas Dilger, linux-fsdevel
On Wed, Feb 09, 2005 at 09:05:21PM -0500, Bryan Henderson wrote:
> >I see much larger IO chunks and better throughput. So, I guess its
> >worth doing it
>
> I hate to see something like this go ahead based on empirical results
> without theory. It might make things worse somewhere else.
>
> Do you have an explanation for why the IO chunks are larger? Is the I/O
> scheduler not building large I/Os out of small requests? Is the queue
> running dry while the device is actually busy?
Yes, the queue is running dry, and there is much more evidence of that
besides just the throughput numbers.
I am inferring this using iostat which shows that average device
utilization fluctuates between 83 and 99 percent and the average
request size is around 650 sectors (going to the device) without
writepages.
With writepages, device utilization never drops below 95 percent and
is usually about 98 percent utilized, and the average request size to
the device is around 1000 sectors. Not to mention the io-scheduler
merge rate is reduced by a few orders of magnitude (16k vs ~30) .
I'm not sure what theory you are looking for here? We do the work of
coalescing io requests up front, rather than relying on an io-scheduler
to save us. What is the point of the 2.6 block-io subsystem (i.e. the
bio layer) if you don't use it to its fullest potential?
I can give you pointers to the data if you're interested.
Sonny
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-10 2:05 ` Bryan Henderson
2005-02-10 2:45 ` Sonny Rao
@ 2005-02-10 16:02 ` Badari Pulavarty
2005-02-10 18:00 ` Bryan Henderson
1 sibling, 1 reply; 21+ messages in thread
From: Badari Pulavarty @ 2005-02-10 16:02 UTC (permalink / raw)
To: Bryan Henderson; +Cc: Andreas Dilger, linux-fsdevel, Sonny Rao
On Wed, 2005-02-09 at 18:05, Bryan Henderson wrote:
> >I see much larger IO chunks and better throughput. So, I guess its
> >worth doing it
>
> I hate to see something like this go ahead based on empirical results
> without theory. It might make things worse somewhere else.
>
> Do you have an explanation for why the IO chunks are larger? Is the I/O
> scheduler not building large I/Os out of small requests? Is the queue
> running dry while the device is actually busy?
>
Bryan,
I would like to find out what theory you are looking for.
Don't you think, filesystems submitting biggest chunks of IO
possible is better than submitting 1k-4k chunks and hoping that
IO schedulers do the perfect job ?
BTW, writepages() is being used for other filesystems like JFS.
We all learnt thro 2.4 RAW code about the overhead of doing 512bytes
IO and making the elevator merge all the peices together. Thats
one reason why 2.6 DIO/RAW code is completely written from
scratch to submit the biggest possible IO chunks.
Well, I agree that we should have theory behind the results.
We are just playing with prototypes for now.
Thanks,
Badari
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-10 2:45 ` Sonny Rao
@ 2005-02-10 17:51 ` Bryan Henderson
2005-02-10 19:02 ` Sonny Rao
0 siblings, 1 reply; 21+ messages in thread
From: Bryan Henderson @ 2005-02-10 17:51 UTC (permalink / raw)
To: Sonny Rao; +Cc: Andreas Dilger, linux-fsdevel, pbadari
>I am inferring this using iostat which shows that average device
>utilization fluctuates between 83 and 99 percent and the average
>request size is around 650 sectors (going to the device) without
>writepages.
>
>With writepages, device utilization never drops below 95 percent and
>is usually about 98 percent utilized, and the average request size to
>the device is around 1000 sectors.
Well that blows away the only two ways I know that this effect can happen.
The first has to do with certain code being more efficient than other
code at assembling I/Os, but the fact that the CPU utilization is the same
in both cases pretty much eliminates that. The other is where the
interactivity of the I/O generator doesn't match the buffering in the
device so that the device ends up 100% busy processing small I/Os that
were sent to it because it said all the while that it needed more work.
But in the small-I/O case, we don't see a 100% busy device.
So why would the device be up to 17% idle, since the writepages case makes
it apparent that the I/O generator is capable of generating much more
work? Is there some queue plugging (I/O scheduler delays sending I/O to
the device even though the device is idle) going on?
--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-10 16:02 ` Badari Pulavarty
@ 2005-02-10 18:00 ` Bryan Henderson
2005-02-10 18:32 ` Badari Pulavarty
0 siblings, 1 reply; 21+ messages in thread
From: Bryan Henderson @ 2005-02-10 18:00 UTC (permalink / raw)
To: pbadari; +Cc: Andreas Dilger, linux-fsdevel, Sonny Rao
>Don't you think, filesystems submitting biggest chunks of IO
>possible is better than submitting 1k-4k chunks and hoping that
>IO schedulers do the perfect job ?
No, I don't see why it would better. In fact intuitively, I think the I/O
scheduler, being closer to the device, should do a better job of deciding
in what packages I/O should go to the device. After all, there exist
block devices that don't process big chunks faster than small ones. But
So this starts to look like something where you withhold data from the I/O
scheduler in order to prevent it from scheduling the I/O wrongly because
you (the pager/filesystem driver) know better. That shouldn't be the
architecture.
So I'd like still like to see a theory that explains why submitting the
I/O a little at a time (i.e. including the bio_submit() in the loop that
assembles the I/O) causes the device to be idle more.
>We all learnt thro 2.4 RAW code about the overhead of doing 512bytes
>IO and making the elevator merge all the peices together.
That was CPU time, right? In the present case, the numbers say it takes
the same amount of CPU time to assemble the I/O above the I/O scheduler as
inside it.
--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-10 18:00 ` Bryan Henderson
@ 2005-02-10 18:32 ` Badari Pulavarty
2005-02-10 20:30 ` Bryan Henderson
0 siblings, 1 reply; 21+ messages in thread
From: Badari Pulavarty @ 2005-02-10 18:32 UTC (permalink / raw)
To: Bryan Henderson; +Cc: Andreas Dilger, linux-fsdevel, Sonny Rao
On Thu, 2005-02-10 at 10:00, Bryan Henderson wrote:
> >Don't you think, filesystems submitting biggest chunks of IO
> >possible is better than submitting 1k-4k chunks and hoping that
> >IO schedulers do the perfect job ?
>
> No, I don't see why it would better. In fact intuitively, I think the I/O
> scheduler, being closer to the device, should do a better job of deciding
> in what packages I/O should go to the device. After all, there exist
> block devices that don't process big chunks faster than small ones. But
>
> So this starts to look like something where you withhold data from the I/O
> scheduler in order to prevent it from scheduling the I/O wrongly because
> you (the pager/filesystem driver) know better. That shouldn't be the
> architecture.
>
> So I'd like still like to see a theory that explains why submitting the
> I/O a little at a time (i.e. including the bio_submit() in the loop that
> assembles the I/O) causes the device to be idle more.
>
> >We all learnt thro 2.4 RAW code about the overhead of doing 512bytes
> >IO and making the elevator merge all the peices together.
>
> That was CPU time, right? In the present case, the numbers say it takes
> the same amount of CPU time to assemble the I/O above the I/O scheduler as
> inside it.
One clear distinction between submitting smaller chunks vs larger
ones is - number of call backs we get and the processing we need to
do.
I don't think we have enough numbers here to get to bottom of this.
CPU utilization remains same in both cases, doesn't mean that - the
test took exactly same amount of time. I don't even think that we
are doing a fixed number of IOs. Its possible that by doing larger
IOs we save CPU and use that CPU to push more data ?
Thanks,
Badari
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-10 17:51 ` Bryan Henderson
@ 2005-02-10 19:02 ` Sonny Rao
0 siblings, 0 replies; 21+ messages in thread
From: Sonny Rao @ 2005-02-10 19:02 UTC (permalink / raw)
To: Bryan Henderson; +Cc: Andreas Dilger, linux-fsdevel, pbadari
On Thu, Feb 10, 2005 at 09:51:42AM -0800, Bryan Henderson wrote:
> >I am inferring this using iostat which shows that average device
> >utilization fluctuates between 83 and 99 percent and the average
> >request size is around 650 sectors (going to the device) without
> >writepages.
> >
> >With writepages, device utilization never drops below 95 percent and
> >is usually about 98 percent utilized, and the average request size to
> >the device is around 1000 sectors.
>
> Well that blows away the only two ways I know that this effect can happen.
> The first has to do with certain code being more efficient than other
> code at assembling I/Os, but the fact that the CPU utilization is the same
> in both cases pretty much eliminates that.
No, I don't think you can draw that conclusion based on total CPU
utilization, because in the writepages case we are spending more time
(as a percentage of total time) copying data from userspace, which
leads to an increase in CPU utilization. So, I think this shows that the
writepages code path is in fact more efficient than the ioscheduler path.
Here's the oprofile output from the runs where you'll see
__copy_from_user_ll at the top of both profiles:
No writepages:
CPU: P4 / Xeon, speed 1997.8 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
samples % image name app name symbol name
2225649 38.7482 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 __copy_from_user_ll
1471012 25.6101 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 poll_idle
104736 1.8234 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 __block_commit_write
92702 1.6139 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 mark_offset_cyclone
90077 1.5682 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 _spin_lock
83649 1.4563 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 __block_write_full_page
81483 1.4186 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 generic_file_buffered_write
69232 1.2053 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 ext3_writeback_commit_write
With writepages:
CPU: P4 / Xeon, speed 1997.98 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
samples % image name app name symbol name
2487751 43.4411 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 __copy_from_user_ll
1518775 26.5209 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 poll_idle
124956 2.1820 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 _spin_lock
93689 1.6360 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 generic_file_buffered_write
93139 1.6264 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 mark_offset_cyclone
89683 1.5660 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 ext3_writeback_commit_write
So we see 38% vs 43% which I belive should be directly correlated with
throughput ( about 12% diff. here ).
> The other is where the
> interactivity of the I/O generator doesn't match the buffering in the
> device so that the device ends up 100% busy processing small I/Os that
> were sent to it because it said all the while that it needed more work.
> But in the small-I/O case, we don't see a 100% busy device.
That might be possible, but I'm not sure how one could account for it?
The application, VM, and I/O systems are all so intertwined it would be
difficult to isolate the application if we are trying to measure
maximum throughput, no?
> So why would the device be up to 17% idle, since the writepages case makes
> it apparent that the I/O generator is capable of generating much more
> work? Is there some queue plugging (I/O scheduler delays sending I/O to
> the device even though the device is idle) going on?
Again, I think the amount of work being generated is directly related
to how quickly the dirty pages are being flushed out, so
inefficiencies in the io-system bubble up to the generator.
Sonny
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-10 20:30 ` Bryan Henderson
@ 2005-02-10 20:25 ` Sonny Rao
2005-02-11 0:20 ` Bryan Henderson
0 siblings, 1 reply; 21+ messages in thread
From: Sonny Rao @ 2005-02-10 20:25 UTC (permalink / raw)
To: Bryan Henderson; +Cc: pbadari, Andreas Dilger, linux-fsdevel
On Thu, Feb 10, 2005 at 12:30:23PM -0800, Bryan Henderson wrote:
> >Its possible that by doing larger
> >IOs we save CPU and use that CPU to push more data ?
>
> This is absolutely right; my mistake -- the relevant number is CPU seconds
> per megabyte moved, not CPU seconds per elapsed second.
> But I don't think we're close enough to 100% CPU utilization that this
> explains much.
>
> In fact, the curious thing here is that neither the disk nor the CPU seems
> to be a bottleneck in the slow case. Maybe there's some serialization I'm
> not seeing that makes less parallelism between I/O and execution. Is this
> a single thread doing writes and syncs to a single file?
>From what I've seen, without writepages, the application thread itself
tends to do the writing by falling into balance_dirty_pages() during
it's write call, while in the writepages case, a pdflush thread seems
to do more of the writeback. This also depends somewhat on
processor speed (and number) and amount of RAM.
To try and isolate this more, I've limited RAM (1GB) and number of
CPUs (1) on my testing setup.
So yes, there could be better parallelism in the writepages case, but
again this behavior could be a symptom and not a cause, but I'm not
sure how to figure that out, any suggestions ?
Sonny
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-10 18:32 ` Badari Pulavarty
@ 2005-02-10 20:30 ` Bryan Henderson
2005-02-10 20:25 ` Sonny Rao
0 siblings, 1 reply; 21+ messages in thread
From: Bryan Henderson @ 2005-02-10 20:30 UTC (permalink / raw)
To: pbadari; +Cc: Andreas Dilger, linux-fsdevel, Sonny Rao
>Its possible that by doing larger
>IOs we save CPU and use that CPU to push more data ?
This is absolutely right; my mistake -- the relevant number is CPU seconds
per megabyte moved, not CPU seconds per elapsed second.
But I don't think we're close enough to 100% CPU utilization that this
explains much.
In fact, the curious thing here is that neither the disk nor the CPU seems
to be a bottleneck in the slow case. Maybe there's some serialization I'm
not seeing that makes less parallelism between I/O and execution. Is this
a single thread doing writes and syncs to a single file?
--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ?
2005-02-10 20:25 ` Sonny Rao
@ 2005-02-11 0:20 ` Bryan Henderson
0 siblings, 0 replies; 21+ messages in thread
From: Bryan Henderson @ 2005-02-11 0:20 UTC (permalink / raw)
To: Sonny Rao; +Cc: Andreas Dilger, linux-fsdevel, pbadari
I went back and looked more closely and see that you did more than add a
->writepages method. You replaced the ->prepare_write with one that
doesn't involve the buffer cache, right? And from your answer to Badari's
question about that, I believe you said this is not an integral part of
having ->writepages, but an additional enhancement. Well, that could
explain a lot. First of all, there's a significant amount of CPU time
involved in managing buffer heads. In the profile you posted, it's one of
the differences in CPU time between the writepages and non-writepages
case. But it also changes the whole way the file cache is managed,
doesn't it? That might account for the fact that in one case you see
cache cleaning happening via balance_dirty_pages() (i.e. memory fills up),
but in the other it happens via Pdflush. I'm not really up on the buffer
cache; I haven't used it in my own studies for years.
I also saw that while you originally said CPU utilization was 73% in both
cases, in one of the profiles I add up at least 77% for the writepages
case, so I'm not sure we're really comparing straight across.
To investigate these effects further, I think you should monitor
/proc/meminfo. And/or make more isolated changes to the code.
>So yes, there could be better parallelism in the writepages case, but
>again this behavior could be a symptom and not a cause,
I'm not really suggesting that there's better parallelism in the
writepages case. I'm suggesting that there's poor parallelism (compared
to what I expect) in both cases, which means that adding CPU time directly
affects throughput. If the CPU time were in parallel with the I/O time,
adding an extra 1.8ms per megabyte to the CPU time (which is what one of
my calculation from your data gave) wouldn't affect throughput.
But I believe we've at least established doubt that submitting an entire
file cache in one bio is faster than submitting a bio for each page and
that smaller I/Os (to the device) cause lower throughput in the
non-writepages case (it seems more likely that the lower throughput causes
the smaller I/Os).
--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2005-02-11 0:20 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-02 15:32 ext3 writepages ? Badari Pulavarty
2005-02-02 20:19 ` Sonny Rao
2005-02-03 15:51 ` Badari Pulavarty
2005-02-03 17:00 ` Sonny Rao
2005-02-03 16:56 ` Badari Pulavarty
2005-02-03 17:24 ` Sonny Rao
2005-02-03 20:50 ` Sonny Rao
2005-02-08 1:33 ` Andreas Dilger
2005-02-08 5:38 ` Sonny Rao
2005-02-09 21:11 ` Sonny Rao
2005-02-09 22:29 ` Badari Pulavarty
2005-02-10 2:05 ` Bryan Henderson
2005-02-10 2:45 ` Sonny Rao
2005-02-10 17:51 ` Bryan Henderson
2005-02-10 19:02 ` Sonny Rao
2005-02-10 16:02 ` Badari Pulavarty
2005-02-10 18:00 ` Bryan Henderson
2005-02-10 18:32 ` Badari Pulavarty
2005-02-10 20:30 ` Bryan Henderson
2005-02-10 20:25 ` Sonny Rao
2005-02-11 0:20 ` Bryan Henderson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).