* ext3 writepages ? @ 2005-02-02 15:32 Badari Pulavarty 2005-02-02 20:19 ` Sonny Rao 0 siblings, 1 reply; 21+ messages in thread From: Badari Pulavarty @ 2005-02-02 15:32 UTC (permalink / raw) To: linux-fsdevel Hi, I forgot the reason why we don't have ext3_writepages() ? I can dig through to find out, but it would be easy to ask people. Please let me know. Thanks, Badari ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-02 15:32 ext3 writepages ? Badari Pulavarty @ 2005-02-02 20:19 ` Sonny Rao 2005-02-03 15:51 ` Badari Pulavarty 0 siblings, 1 reply; 21+ messages in thread From: Sonny Rao @ 2005-02-02 20:19 UTC (permalink / raw) To: Badari Pulavarty; +Cc: linux-fsdevel On Wed, Feb 02, 2005 at 07:32:04AM -0800, Badari Pulavarty wrote: > Hi, > > I forgot the reason why we don't have ext3_writepages() ? > I can dig through to find out, but it would be easy to ask > people. > > Please let me know. Badari, I seem to have successfully hacked the writeback mode to use writepages on a User-Mode Linux instance. I'm going to try it on a real box soon. The only issue is that pdflush is passing the create parameter as 1 to writepages, which doesn't exactly make sense. I suppose it might be needed for a filesystem like XFS which does delayed block allocation ? In ext3 however, the blocks should have been allocated beforehand. Sonny ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-02 20:19 ` Sonny Rao @ 2005-02-03 15:51 ` Badari Pulavarty 2005-02-03 17:00 ` Sonny Rao 2005-02-03 20:50 ` Sonny Rao 0 siblings, 2 replies; 21+ messages in thread From: Badari Pulavarty @ 2005-02-03 15:51 UTC (permalink / raw) To: Sonny Rao; +Cc: linux-fsdevel On Wed, 2005-02-02 at 12:19, Sonny Rao wrote: > On Wed, Feb 02, 2005 at 07:32:04AM -0800, Badari Pulavarty wrote: > > Hi, > > > > I forgot the reason why we don't have ext3_writepages() ? > > I can dig through to find out, but it would be easy to ask > > people. > > > > Please let me know. > > Badari, I seem to have successfully hacked the writeback mode to use > writepages on a User-Mode Linux instance. I'm going to try it on a > real box soon. The only issue is that pdflush is passing the create > parameter as 1 to writepages, which doesn't exactly make sense. I > suppose it might be needed for a filesystem like XFS which does > delayed block allocation ? In ext3 however, the blocks should have > been allocated beforehand. Funny, I am also hacking writepages for writeback mode. You are a step ahead of me :) Please let me know, how it goes. Thanks, Badari ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-03 15:51 ` Badari Pulavarty @ 2005-02-03 17:00 ` Sonny Rao 2005-02-03 16:56 ` Badari Pulavarty 2005-02-03 20:50 ` Sonny Rao 1 sibling, 1 reply; 21+ messages in thread From: Sonny Rao @ 2005-02-03 17:00 UTC (permalink / raw) To: Badari Pulavarty; +Cc: linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 1074 bytes --] On Thu, Feb 03, 2005 at 07:51:37AM -0800, Badari Pulavarty wrote: > On Wed, 2005-02-02 at 12:19, Sonny Rao wrote: > > On Wed, Feb 02, 2005 at 07:32:04AM -0800, Badari Pulavarty wrote: > > > Hi, > > > > > > I forgot the reason why we don't have ext3_writepages() ? > > > I can dig through to find out, but it would be easy to ask > > > people. > > > > > > Please let me know. > > > > Badari, I seem to have successfully hacked the writeback mode to use > > writepages on a User-Mode Linux instance. I'm going to try it on a > > real box soon. The only issue is that pdflush is passing the create > > parameter as 1 to writepages, which doesn't exactly make sense. I > > suppose it might be needed for a filesystem like XFS which does > > delayed block allocation ? In ext3 however, the blocks should have > > been allocated beforehand. > > Funny, I am also hacking writepages for writeback mode. You are a step > ahead of me :) Please let me know, how it goes. Well it seems to work, here's my (rather ugly) patch. I'm doing some performance comparisons now. Sonny [-- Attachment #2: ext3-wb-wpages.patch --] [-- Type: text/plain, Size: 2662 bytes --] diff -Naurp linux-2.6.10-original/fs/ext3/inode.c linux-2.6.10-working/fs/ext3/inode.c --- linux-2.6.10-original/fs/ext3/inode.c 2004-12-24 15:35:01.000000000 -0600 +++ linux-2.6.10-working/fs/ext3/inode.c 2005-01-29 10:45:09.599837136 -0600 @@ -810,6 +810,18 @@ static int ext3_get_block(struct inode * return ret; } +static int ext3_get_block_wpages(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create) +{ + /* ugly hack, just pass 0 tfor create to get_block_handle */ + /* the blocks shouldd have already been allocated ifwe're in */ + /* writepages writeback */ + return ext3_get_block_handle(NULL, inode, iblock, + bh_result, 0, 0); +} + + + #define DIO_CREDITS (EXT3_RESERVE_TRANS_BLOCKS + 32) static int @@ -1025,6 +1037,32 @@ out: return ret; } +static int ext3_nobh_prepare_write(struct file *file, struct page *page, + unsigned from, unsigned to) +{ + struct inode *inode = page->mapping->host; + int ret; + int needed_blocks = ext3_writepage_trans_blocks(inode); + handle_t *handle; + int retries = 0; + +retry: + handle = ext3_journal_start(inode, needed_blocks); + if (IS_ERR(handle)) { + ret = PTR_ERR(handle); + goto out; + } + ret = nobh_prepare_write(page, from, to, ext3_get_block); + if (ret) + ext3_journal_stop(handle); + if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries)) + goto retry; +out: + + return ret; +} + + static int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh) { @@ -1092,7 +1130,7 @@ static int ext3_writeback_commit_write(s new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; if (new_i_size > EXT3_I(inode)->i_disksize) EXT3_I(inode)->i_disksize = new_i_size; - ret = generic_commit_write(file, page, from, to); + ret = nobh_commit_write(file, page, from, to); ret2 = ext3_journal_stop(handle); if (!ret) ret = ret2; @@ -1321,6 +1359,14 @@ out_fail: return ret; } +static int +ext3_writeback_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + return mpage_writepages(mapping, wbc, ext3_get_block_wpages); +} + + static int ext3_writeback_writepage(struct page *page, struct writeback_control *wbc) { @@ -1552,8 +1598,9 @@ static struct address_space_operations e .readpage = ext3_readpage, .readpages = ext3_readpages, .writepage = ext3_writeback_writepage, + .writepages = ext3_writeback_writepages, .sync_page = block_sync_page, - .prepare_write = ext3_prepare_write, + .prepare_write = ext3_nobh_prepare_write, .commit_write = ext3_writeback_commit_write, .bmap = ext3_bmap, .invalidatepage = ext3_invalidatepage, ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-03 17:00 ` Sonny Rao @ 2005-02-03 16:56 ` Badari Pulavarty 2005-02-03 17:24 ` Sonny Rao 0 siblings, 1 reply; 21+ messages in thread From: Badari Pulavarty @ 2005-02-03 16:56 UTC (permalink / raw) To: Sonny Rao; +Cc: linux-fsdevel On Thu, 2005-02-03 at 09:00, Sonny Rao wrote: > > Well it seems to work, here's my (rather ugly) patch. > I'm doing some performance comparisons now. > > Sonny Interesting.. Why did you create a nobh_prepare_write() ? mpage_writepages() can handle pages with buffer heads attached. And also, are you sure you don't need to journal start/stop in writepages() ? Thanks, Badari ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-03 16:56 ` Badari Pulavarty @ 2005-02-03 17:24 ` Sonny Rao 0 siblings, 0 replies; 21+ messages in thread From: Sonny Rao @ 2005-02-03 17:24 UTC (permalink / raw) To: Badari Pulavarty; +Cc: linux-fsdevel On Thu, Feb 03, 2005 at 08:56:50AM -0800, Badari Pulavarty wrote: > On Thu, 2005-02-03 at 09:00, Sonny Rao wrote: > > > > > Well it seems to work, here's my (rather ugly) patch. > > I'm doing some performance comparisons now. > > > > Sonny > > Interesting.. Why did you create a nobh_prepare_write() ? > mpage_writepages() can handle pages with buffer heads > attached. IIRC, block_prepare_write will attach buffer_heads for you, which I'm explicitly trying to avoid. > And also, are you sure you don't need to journal start/stop > in writepages() ? Heh, I'm not sure, I don't understand the semantics of those calls well enough to say with certainty. My guess is no, because the blocks on disk were already allocated beforehand. Maybe it could be a problem if there could be a truncate in progress elsewhere, but I don't think so since the inode is locked. Sonny ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-03 15:51 ` Badari Pulavarty 2005-02-03 17:00 ` Sonny Rao @ 2005-02-03 20:50 ` Sonny Rao 2005-02-08 1:33 ` Andreas Dilger 1 sibling, 1 reply; 21+ messages in thread From: Sonny Rao @ 2005-02-03 20:50 UTC (permalink / raw) To: Badari Pulavarty; +Cc: linux-fsdevel On Thu, Feb 03, 2005 at 07:51:37AM -0800, Badari Pulavarty wrote: > On Wed, 2005-02-02 at 12:19, Sonny Rao wrote: > > On Wed, Feb 02, 2005 at 07:32:04AM -0800, Badari Pulavarty wrote: > > > Hi, > > > > > > I forgot the reason why we don't have ext3_writepages() ? > > > I can dig through to find out, but it would be easy to ask > > > people. > > > > > > Please let me know. > > > > Badari, I seem to have successfully hacked the writeback mode to use > > writepages on a User-Mode Linux instance. I'm going to try it on a > > real box soon. The only issue is that pdflush is passing the create > > parameter as 1 to writepages, which doesn't exactly make sense. I > > suppose it might be needed for a filesystem like XFS which does > > delayed block allocation ? In ext3 however, the blocks should have > > been allocated beforehand. > > Funny, I am also hacking writepages for writeback mode. You are a step > ahead of me :) Please let me know, how it goes. Well, from what I can tell, my patch doesn't seem to make much of a difference in write throughput other than allowing multi-page bios to be sent down and cutting down on buffer_head usage. If the only goal was to reduce buffer_head usage, then this works, but using an mpage_writepage-like function should achieve the same result. I did notice in my write throughput tests that ext2 still did significantly better for some reason, even though no meta-data changes were occurring. I'm looking into that. Sonny ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-03 20:50 ` Sonny Rao @ 2005-02-08 1:33 ` Andreas Dilger 2005-02-08 5:38 ` Sonny Rao 0 siblings, 1 reply; 21+ messages in thread From: Andreas Dilger @ 2005-02-08 1:33 UTC (permalink / raw) To: Sonny Rao; +Cc: Badari Pulavarty, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 512 bytes --] On Feb 03, 2005 15:50 -0500, Sonny Rao wrote: > Well, from what I can tell, my patch doesn't seem to make much of a > difference in write throughput other than allowing multi-page bios to > be sent down and cutting down on buffer_head usage. Even if it doesn't make a difference in performance, it might reduce the CPU usage. Did you check that at all? Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/ [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-08 1:33 ` Andreas Dilger @ 2005-02-08 5:38 ` Sonny Rao 2005-02-09 21:11 ` Sonny Rao 0 siblings, 1 reply; 21+ messages in thread From: Sonny Rao @ 2005-02-08 5:38 UTC (permalink / raw) To: Andreas Dilger; +Cc: linux-fsdevel, Badari Pulavarty On Mon, Feb 07, 2005 at 06:33:51PM -0700, Andreas Dilger wrote: > On Feb 03, 2005 15:50 -0500, Sonny Rao wrote: > > Well, from what I can tell, my patch doesn't seem to make much of a > > difference in write throughput other than allowing multi-page bios to > > be sent down and cutting down on buffer_head usage. > > Even if it doesn't make a difference in performance, it might reduce the > CPU usage. Did you check that at all? No I didn't, I'll check that out and post back. Sonny ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-08 5:38 ` Sonny Rao @ 2005-02-09 21:11 ` Sonny Rao 2005-02-09 22:29 ` Badari Pulavarty 0 siblings, 1 reply; 21+ messages in thread From: Sonny Rao @ 2005-02-09 21:11 UTC (permalink / raw) To: Andreas Dilger; +Cc: linux-fsdevel, Badari Pulavarty On Tue, Feb 08, 2005 at 12:38:08AM -0500, Sonny Rao wrote: > On Mon, Feb 07, 2005 at 06:33:51PM -0700, Andreas Dilger wrote: > > On Feb 03, 2005 15:50 -0500, Sonny Rao wrote: > > > Well, from what I can tell, my patch doesn't seem to make much of a > > > difference in write throughput other than allowing multi-page bios to > > > be sent down and cutting down on buffer_head usage. > > > > Even if it doesn't make a difference in performance, it might reduce the > > CPU usage. Did you check that at all? > > No I didn't, I'll check that out and post back. > > Sonny Ok, I take it back, on a raid device I saw a significant increase in throughput and approximately equal cpu utilization. I was comparing the wrong data points before.. oops. Sequential overwrite went from 75.6 MB/sec to 87.7 MB/sec both with an average CPU utilization of 73% for both. So, I see a 16% improvement in throughput for this test case and a corresponding increase in efficiency. Although, after reading what SCT wrote about writepage and writepages needing to have a transaction handle, in some cases, that might make the proper writepages code significantly more complex than my two-bit hack. Still, I think it's worth it. Sonny ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-09 21:11 ` Sonny Rao @ 2005-02-09 22:29 ` Badari Pulavarty 2005-02-10 2:05 ` Bryan Henderson 0 siblings, 1 reply; 21+ messages in thread From: Badari Pulavarty @ 2005-02-09 22:29 UTC (permalink / raw) To: Sonny Rao; +Cc: Andreas Dilger, linux-fsdevel On Wed, 2005-02-09 at 13:11, Sonny Rao wrote: > On Tue, Feb 08, 2005 at 12:38:08AM -0500, Sonny Rao wrote: > > On Mon, Feb 07, 2005 at 06:33:51PM -0700, Andreas Dilger wrote: > > > On Feb 03, 2005 15:50 -0500, Sonny Rao wrote: > > > > Well, from what I can tell, my patch doesn't seem to make much of a > > > > difference in write throughput other than allowing multi-page bios to > > > > be sent down and cutting down on buffer_head usage. > > > > > > Even if it doesn't make a difference in performance, it might reduce the > > > CPU usage. Did you check that at all? > > > > No I didn't, I'll check that out and post back. > > > > Sonny > > Ok, I take it back, on a raid device I saw a significant increase in > throughput and approximately equal cpu utilization. I was comparing > the wrong data points before.. oops. > > Sequential overwrite went from 75.6 MB/sec to 87.7 MB/sec both with an > average CPU utilization of 73% for both. > > So, I see a 16% improvement in throughput for this test case and a > corresponding increase in efficiency. > > Although, after reading what SCT wrote about writepage and writepages > needing to have a transaction handle, in some cases, that might make > the proper writepages code significantly more complex than my two-bit > hack. Still, I think it's worth it. Yep. I hacked ext3_write_pages() to use mpage_writepages() as you did (without modifying bufferheads stuff). With the limited testing I did, I see much larger IO chunks and better throughput. So, I guess its worth doing it - i am little worried about error handling though.. Lets handle one issue at a time. First fix writepages() without bufferhead changes ? Then handle bufferheads ? I still can't figure out a way to workaround the bufferheads especially for ordered writes. Thanks, Badari ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-09 22:29 ` Badari Pulavarty @ 2005-02-10 2:05 ` Bryan Henderson 2005-02-10 2:45 ` Sonny Rao 2005-02-10 16:02 ` Badari Pulavarty 0 siblings, 2 replies; 21+ messages in thread From: Bryan Henderson @ 2005-02-10 2:05 UTC (permalink / raw) To: pbadari; +Cc: Andreas Dilger, linux-fsdevel, Sonny Rao >I see much larger IO chunks and better throughput. So, I guess its >worth doing it I hate to see something like this go ahead based on empirical results without theory. It might make things worse somewhere else. Do you have an explanation for why the IO chunks are larger? Is the I/O scheduler not building large I/Os out of small requests? Is the queue running dry while the device is actually busy? -- Bryan Henderson San Jose California IBM Almaden Research Center Filesystems ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-10 2:05 ` Bryan Henderson @ 2005-02-10 2:45 ` Sonny Rao 2005-02-10 17:51 ` Bryan Henderson 2005-02-10 16:02 ` Badari Pulavarty 1 sibling, 1 reply; 21+ messages in thread From: Sonny Rao @ 2005-02-10 2:45 UTC (permalink / raw) To: Bryan Henderson; +Cc: pbadari, Andreas Dilger, linux-fsdevel On Wed, Feb 09, 2005 at 09:05:21PM -0500, Bryan Henderson wrote: > >I see much larger IO chunks and better throughput. So, I guess its > >worth doing it > > I hate to see something like this go ahead based on empirical results > without theory. It might make things worse somewhere else. > > Do you have an explanation for why the IO chunks are larger? Is the I/O > scheduler not building large I/Os out of small requests? Is the queue > running dry while the device is actually busy? Yes, the queue is running dry, and there is much more evidence of that besides just the throughput numbers. I am inferring this using iostat which shows that average device utilization fluctuates between 83 and 99 percent and the average request size is around 650 sectors (going to the device) without writepages. With writepages, device utilization never drops below 95 percent and is usually about 98 percent utilized, and the average request size to the device is around 1000 sectors. Not to mention the io-scheduler merge rate is reduced by a few orders of magnitude (16k vs ~30) . I'm not sure what theory you are looking for here? We do the work of coalescing io requests up front, rather than relying on an io-scheduler to save us. What is the point of the 2.6 block-io subsystem (i.e. the bio layer) if you don't use it to its fullest potential? I can give you pointers to the data if you're interested. Sonny ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-10 2:45 ` Sonny Rao @ 2005-02-10 17:51 ` Bryan Henderson 2005-02-10 19:02 ` Sonny Rao 0 siblings, 1 reply; 21+ messages in thread From: Bryan Henderson @ 2005-02-10 17:51 UTC (permalink / raw) To: Sonny Rao; +Cc: Andreas Dilger, linux-fsdevel, pbadari >I am inferring this using iostat which shows that average device >utilization fluctuates between 83 and 99 percent and the average >request size is around 650 sectors (going to the device) without >writepages. > >With writepages, device utilization never drops below 95 percent and >is usually about 98 percent utilized, and the average request size to >the device is around 1000 sectors. Well that blows away the only two ways I know that this effect can happen. The first has to do with certain code being more efficient than other code at assembling I/Os, but the fact that the CPU utilization is the same in both cases pretty much eliminates that. The other is where the interactivity of the I/O generator doesn't match the buffering in the device so that the device ends up 100% busy processing small I/Os that were sent to it because it said all the while that it needed more work. But in the small-I/O case, we don't see a 100% busy device. So why would the device be up to 17% idle, since the writepages case makes it apparent that the I/O generator is capable of generating much more work? Is there some queue plugging (I/O scheduler delays sending I/O to the device even though the device is idle) going on? -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-10 17:51 ` Bryan Henderson @ 2005-02-10 19:02 ` Sonny Rao 0 siblings, 0 replies; 21+ messages in thread From: Sonny Rao @ 2005-02-10 19:02 UTC (permalink / raw) To: Bryan Henderson; +Cc: Andreas Dilger, linux-fsdevel, pbadari On Thu, Feb 10, 2005 at 09:51:42AM -0800, Bryan Henderson wrote: > >I am inferring this using iostat which shows that average device > >utilization fluctuates between 83 and 99 percent and the average > >request size is around 650 sectors (going to the device) without > >writepages. > > > >With writepages, device utilization never drops below 95 percent and > >is usually about 98 percent utilized, and the average request size to > >the device is around 1000 sectors. > > Well that blows away the only two ways I know that this effect can happen. > The first has to do with certain code being more efficient than other > code at assembling I/Os, but the fact that the CPU utilization is the same > in both cases pretty much eliminates that. No, I don't think you can draw that conclusion based on total CPU utilization, because in the writepages case we are spending more time (as a percentage of total time) copying data from userspace, which leads to an increase in CPU utilization. So, I think this shows that the writepages code path is in fact more efficient than the ioscheduler path. Here's the oprofile output from the runs where you'll see __copy_from_user_ll at the top of both profiles: No writepages: CPU: P4 / Xeon, speed 1997.8 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000 samples % image name app name symbol name 2225649 38.7482 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 __copy_from_user_ll 1471012 25.6101 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 poll_idle 104736 1.8234 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 __block_commit_write 92702 1.6139 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 mark_offset_cyclone 90077 1.5682 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 _spin_lock 83649 1.4563 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 __block_write_full_page 81483 1.4186 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 generic_file_buffered_write 69232 1.2053 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 ext3_writeback_commit_write With writepages: CPU: P4 / Xeon, speed 1997.98 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000 samples % image name app name symbol name 2487751 43.4411 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 __copy_from_user_ll 1518775 26.5209 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 poll_idle 124956 2.1820 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 _spin_lock 93689 1.6360 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 generic_file_buffered_write 93139 1.6264 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 mark_offset_cyclone 89683 1.5660 vmlinux-autobench-2.6.10-autokern1 vmlinux-autobench-2.6.10-autokern1 ext3_writeback_commit_write So we see 38% vs 43% which I belive should be directly correlated with throughput ( about 12% diff. here ). > The other is where the > interactivity of the I/O generator doesn't match the buffering in the > device so that the device ends up 100% busy processing small I/Os that > were sent to it because it said all the while that it needed more work. > But in the small-I/O case, we don't see a 100% busy device. That might be possible, but I'm not sure how one could account for it? The application, VM, and I/O systems are all so intertwined it would be difficult to isolate the application if we are trying to measure maximum throughput, no? > So why would the device be up to 17% idle, since the writepages case makes > it apparent that the I/O generator is capable of generating much more > work? Is there some queue plugging (I/O scheduler delays sending I/O to > the device even though the device is idle) going on? Again, I think the amount of work being generated is directly related to how quickly the dirty pages are being flushed out, so inefficiencies in the io-system bubble up to the generator. Sonny ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-10 2:05 ` Bryan Henderson 2005-02-10 2:45 ` Sonny Rao @ 2005-02-10 16:02 ` Badari Pulavarty 2005-02-10 18:00 ` Bryan Henderson 1 sibling, 1 reply; 21+ messages in thread From: Badari Pulavarty @ 2005-02-10 16:02 UTC (permalink / raw) To: Bryan Henderson; +Cc: Andreas Dilger, linux-fsdevel, Sonny Rao On Wed, 2005-02-09 at 18:05, Bryan Henderson wrote: > >I see much larger IO chunks and better throughput. So, I guess its > >worth doing it > > I hate to see something like this go ahead based on empirical results > without theory. It might make things worse somewhere else. > > Do you have an explanation for why the IO chunks are larger? Is the I/O > scheduler not building large I/Os out of small requests? Is the queue > running dry while the device is actually busy? > Bryan, I would like to find out what theory you are looking for. Don't you think, filesystems submitting biggest chunks of IO possible is better than submitting 1k-4k chunks and hoping that IO schedulers do the perfect job ? BTW, writepages() is being used for other filesystems like JFS. We all learnt thro 2.4 RAW code about the overhead of doing 512bytes IO and making the elevator merge all the peices together. Thats one reason why 2.6 DIO/RAW code is completely written from scratch to submit the biggest possible IO chunks. Well, I agree that we should have theory behind the results. We are just playing with prototypes for now. Thanks, Badari ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-10 16:02 ` Badari Pulavarty @ 2005-02-10 18:00 ` Bryan Henderson 2005-02-10 18:32 ` Badari Pulavarty 0 siblings, 1 reply; 21+ messages in thread From: Bryan Henderson @ 2005-02-10 18:00 UTC (permalink / raw) To: pbadari; +Cc: Andreas Dilger, linux-fsdevel, Sonny Rao >Don't you think, filesystems submitting biggest chunks of IO >possible is better than submitting 1k-4k chunks and hoping that >IO schedulers do the perfect job ? No, I don't see why it would better. In fact intuitively, I think the I/O scheduler, being closer to the device, should do a better job of deciding in what packages I/O should go to the device. After all, there exist block devices that don't process big chunks faster than small ones. But So this starts to look like something where you withhold data from the I/O scheduler in order to prevent it from scheduling the I/O wrongly because you (the pager/filesystem driver) know better. That shouldn't be the architecture. So I'd like still like to see a theory that explains why submitting the I/O a little at a time (i.e. including the bio_submit() in the loop that assembles the I/O) causes the device to be idle more. >We all learnt thro 2.4 RAW code about the overhead of doing 512bytes >IO and making the elevator merge all the peices together. That was CPU time, right? In the present case, the numbers say it takes the same amount of CPU time to assemble the I/O above the I/O scheduler as inside it. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-10 18:00 ` Bryan Henderson @ 2005-02-10 18:32 ` Badari Pulavarty 2005-02-10 20:30 ` Bryan Henderson 0 siblings, 1 reply; 21+ messages in thread From: Badari Pulavarty @ 2005-02-10 18:32 UTC (permalink / raw) To: Bryan Henderson; +Cc: Andreas Dilger, linux-fsdevel, Sonny Rao On Thu, 2005-02-10 at 10:00, Bryan Henderson wrote: > >Don't you think, filesystems submitting biggest chunks of IO > >possible is better than submitting 1k-4k chunks and hoping that > >IO schedulers do the perfect job ? > > No, I don't see why it would better. In fact intuitively, I think the I/O > scheduler, being closer to the device, should do a better job of deciding > in what packages I/O should go to the device. After all, there exist > block devices that don't process big chunks faster than small ones. But > > So this starts to look like something where you withhold data from the I/O > scheduler in order to prevent it from scheduling the I/O wrongly because > you (the pager/filesystem driver) know better. That shouldn't be the > architecture. > > So I'd like still like to see a theory that explains why submitting the > I/O a little at a time (i.e. including the bio_submit() in the loop that > assembles the I/O) causes the device to be idle more. > > >We all learnt thro 2.4 RAW code about the overhead of doing 512bytes > >IO and making the elevator merge all the peices together. > > That was CPU time, right? In the present case, the numbers say it takes > the same amount of CPU time to assemble the I/O above the I/O scheduler as > inside it. One clear distinction between submitting smaller chunks vs larger ones is - number of call backs we get and the processing we need to do. I don't think we have enough numbers here to get to bottom of this. CPU utilization remains same in both cases, doesn't mean that - the test took exactly same amount of time. I don't even think that we are doing a fixed number of IOs. Its possible that by doing larger IOs we save CPU and use that CPU to push more data ? Thanks, Badari ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-10 18:32 ` Badari Pulavarty @ 2005-02-10 20:30 ` Bryan Henderson 2005-02-10 20:25 ` Sonny Rao 0 siblings, 1 reply; 21+ messages in thread From: Bryan Henderson @ 2005-02-10 20:30 UTC (permalink / raw) To: pbadari; +Cc: Andreas Dilger, linux-fsdevel, Sonny Rao >Its possible that by doing larger >IOs we save CPU and use that CPU to push more data ? This is absolutely right; my mistake -- the relevant number is CPU seconds per megabyte moved, not CPU seconds per elapsed second. But I don't think we're close enough to 100% CPU utilization that this explains much. In fact, the curious thing here is that neither the disk nor the CPU seems to be a bottleneck in the slow case. Maybe there's some serialization I'm not seeing that makes less parallelism between I/O and execution. Is this a single thread doing writes and syncs to a single file? -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-10 20:30 ` Bryan Henderson @ 2005-02-10 20:25 ` Sonny Rao 2005-02-11 0:20 ` Bryan Henderson 0 siblings, 1 reply; 21+ messages in thread From: Sonny Rao @ 2005-02-10 20:25 UTC (permalink / raw) To: Bryan Henderson; +Cc: pbadari, Andreas Dilger, linux-fsdevel On Thu, Feb 10, 2005 at 12:30:23PM -0800, Bryan Henderson wrote: > >Its possible that by doing larger > >IOs we save CPU and use that CPU to push more data ? > > This is absolutely right; my mistake -- the relevant number is CPU seconds > per megabyte moved, not CPU seconds per elapsed second. > But I don't think we're close enough to 100% CPU utilization that this > explains much. > > In fact, the curious thing here is that neither the disk nor the CPU seems > to be a bottleneck in the slow case. Maybe there's some serialization I'm > not seeing that makes less parallelism between I/O and execution. Is this > a single thread doing writes and syncs to a single file? >From what I've seen, without writepages, the application thread itself tends to do the writing by falling into balance_dirty_pages() during it's write call, while in the writepages case, a pdflush thread seems to do more of the writeback. This also depends somewhat on processor speed (and number) and amount of RAM. To try and isolate this more, I've limited RAM (1GB) and number of CPUs (1) on my testing setup. So yes, there could be better parallelism in the writepages case, but again this behavior could be a symptom and not a cause, but I'm not sure how to figure that out, any suggestions ? Sonny ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ext3 writepages ? 2005-02-10 20:25 ` Sonny Rao @ 2005-02-11 0:20 ` Bryan Henderson 0 siblings, 0 replies; 21+ messages in thread From: Bryan Henderson @ 2005-02-11 0:20 UTC (permalink / raw) To: Sonny Rao; +Cc: Andreas Dilger, linux-fsdevel, pbadari I went back and looked more closely and see that you did more than add a ->writepages method. You replaced the ->prepare_write with one that doesn't involve the buffer cache, right? And from your answer to Badari's question about that, I believe you said this is not an integral part of having ->writepages, but an additional enhancement. Well, that could explain a lot. First of all, there's a significant amount of CPU time involved in managing buffer heads. In the profile you posted, it's one of the differences in CPU time between the writepages and non-writepages case. But it also changes the whole way the file cache is managed, doesn't it? That might account for the fact that in one case you see cache cleaning happening via balance_dirty_pages() (i.e. memory fills up), but in the other it happens via Pdflush. I'm not really up on the buffer cache; I haven't used it in my own studies for years. I also saw that while you originally said CPU utilization was 73% in both cases, in one of the profiles I add up at least 77% for the writepages case, so I'm not sure we're really comparing straight across. To investigate these effects further, I think you should monitor /proc/meminfo. And/or make more isolated changes to the code. >So yes, there could be better parallelism in the writepages case, but >again this behavior could be a symptom and not a cause, I'm not really suggesting that there's better parallelism in the writepages case. I'm suggesting that there's poor parallelism (compared to what I expect) in both cases, which means that adding CPU time directly affects throughput. If the CPU time were in parallel with the I/O time, adding an extra 1.8ms per megabyte to the CPU time (which is what one of my calculation from your data gave) wouldn't affect throughput. But I believe we've at least established doubt that submitting an entire file cache in one bio is faster than submitting a bio for each page and that smaller I/Os (to the device) cause lower throughput in the non-writepages case (it seems more likely that the lower throughput causes the smaller I/Os). -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2005-02-11 0:20 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-02-02 15:32 ext3 writepages ? Badari Pulavarty 2005-02-02 20:19 ` Sonny Rao 2005-02-03 15:51 ` Badari Pulavarty 2005-02-03 17:00 ` Sonny Rao 2005-02-03 16:56 ` Badari Pulavarty 2005-02-03 17:24 ` Sonny Rao 2005-02-03 20:50 ` Sonny Rao 2005-02-08 1:33 ` Andreas Dilger 2005-02-08 5:38 ` Sonny Rao 2005-02-09 21:11 ` Sonny Rao 2005-02-09 22:29 ` Badari Pulavarty 2005-02-10 2:05 ` Bryan Henderson 2005-02-10 2:45 ` Sonny Rao 2005-02-10 17:51 ` Bryan Henderson 2005-02-10 19:02 ` Sonny Rao 2005-02-10 16:02 ` Badari Pulavarty 2005-02-10 18:00 ` Bryan Henderson 2005-02-10 18:32 ` Badari Pulavarty 2005-02-10 20:30 ` Bryan Henderson 2005-02-10 20:25 ` Sonny Rao 2005-02-11 0:20 ` Bryan Henderson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).