* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages [not found] <1251600858-21294-1-git-send-email-tytso@mit.edu> @ 2009-08-30 16:52 ` Christoph Hellwig 2009-08-30 18:17 ` Theodore Tso 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2009-08-30 16:52 UTC (permalink / raw) To: Theodore Ts'o Cc: linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason, jens.axboe On Sat, Aug 29, 2009 at 10:54:18PM -0400, Theodore Ts'o wrote: > MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a concern of not > holding I_SYNC for too long. But this shouldn't be a concern since > I_LOCK and I_SYNC have been separated. So make it be a tunable and > change the default to be 32768. > > This change is helpful for ext4 since it means we write out large file > in bigger chunks than just 4 megabytes at a time, so that when we have > multiple large files in the page cache waiting for writeback, the > files don't end up getting interleaved. There shouldn't be any downside. > > http://bugzilla.kernel.org/show_bug.cgi?id=13930 The current writeback sizes are defintively too small, we shoved in a hack into XFS to bump up nr_to_write to four times the value the VM sends us to be able to saturate medium sized RAID arrays in XFS. Turns out this was not enough and at least for Chris Masons array we only started seaturating at * 16. I suspect you patch will give a similar effect. > /* > + * The maximum number of pages to writeout in a single bdflush/kupdate > + * operation. We used to limit this to 1024 pages to avoid holding > + * I_SYNC against an inode for a long period of times, but since > + * I_SYNC has been separated out from I_LOCK, the only time a process > + * waits for I_SYNC is when it is calling fsync() or otherwise forcing > + * out the inode. > + */ > +unsigned int max_writeback_pages = 32768; Now while I'm sure this a a much much better default than the brain dead previous one I suspect we really need to scale it based on the amount of available or dirty RAM to keep larger systems busy. Also that I_SYNC comment doesn't make too much sense to me - if we do a sync/fsync we do need to write out all data anyway, so waiting for bdflush/kupdate is not a problem. The only thing where it could matter is for Jan's new O_SYNC implementation. And btw, I think referring to the historic code in the comment is not a good idea, it's just going to ocnfuse the heck out of everyone looking at it in the future. The information above makes sense for the commit message. And the other big question is how this interacts with Jens' new per-bdi flushing code that we still hope to merge in 2.6.32. Maybe we'll actually get some sane writeback code for the first time. > + > +/* > * Start background writeback (via pdflush) at this percentage > */ > int dirty_background_ratio = 10; > @@ -708,10 +709,10 @@ static void background_writeout(unsigned long _min_pages) > break; > wbc.more_io = 0; > wbc.encountered_congestion = 0; > - wbc.nr_to_write = MAX_WRITEBACK_PAGES; > + wbc.nr_to_write = max_writeback_pages; > wbc.pages_skipped = 0; > writeback_inodes(&wbc); > - min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; > + min_pages -= max_writeback_pages - wbc.nr_to_write; > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { > /* Wrote less than expected */ > if (wbc.encountered_congestion || wbc.more_io) > @@ -783,7 +784,7 @@ static void wb_kupdate(unsigned long arg) > while (nr_to_write > 0) { > wbc.more_io = 0; > wbc.encountered_congestion = 0; > - wbc.nr_to_write = MAX_WRITEBACK_PAGES; > + wbc.nr_to_write = max_writeback_pages; > writeback_inodes(&wbc); > if (wbc.nr_to_write > 0) { > if (wbc.encountered_congestion || wbc.more_io) > @@ -791,7 +792,7 @@ static void wb_kupdate(unsigned long arg) > else > break; /* All the old data is written */ > } > - nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; > + nr_to_write -= max_writeback_pages - wbc.nr_to_write; > } > if (time_before(next_jif, jiffies + HZ)) > next_jif = jiffies + HZ; > -- > 1.6.3.2.1.gb9f7d.dirty > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ---end quoted text--- ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages 2009-08-30 16:52 ` [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages Christoph Hellwig @ 2009-08-30 18:17 ` Theodore Tso 2009-08-30 22:27 ` Christoph Hellwig 2009-09-01 18:00 ` Chris Mason 0 siblings, 2 replies; 14+ messages in thread From: Theodore Tso @ 2009-08-30 18:17 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason, jens.axboe On Sun, Aug 30, 2009 at 12:52:29PM -0400, Christoph Hellwig wrote: > On Sat, Aug 29, 2009 at 10:54:18PM -0400, Theodore Ts'o wrote: > > MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a concern of not > > holding I_SYNC for too long. But this shouldn't be a concern since > > I_LOCK and I_SYNC have been separated. So make it be a tunable and > > change the default to be 32768. > > > > This change is helpful for ext4 since it means we write out large file > > in bigger chunks than just 4 megabytes at a time, so that when we have > > multiple large files in the page cache waiting for writeback, the > > files don't end up getting interleaved. There shouldn't be any downside. > > > > http://bugzilla.kernel.org/show_bug.cgi?id=13930 > > The current writeback sizes are defintively too small, we shoved in > a hack into XFS to bump up nr_to_write to four times the value the > VM sends us to be able to saturate medium sized RAID arrays in XFS. Hmm, should we make it be a per-superblock tunable so that it can either be tuned on a per-block device basis or the filesystem code can adjust it to their liking? I thought about it, but decided maybe it was better to keeping it simple. > Turns out this was not enough and at least for Chris Masons array > we only started seaturating at * 16. I suspect you patch will give > a similar effect. So you think 16384 would be a better default? The reason why I picked 32768 was because that was the size of the ext4 block group, but it was otherwise it was totally arbitrary. I haven't done any benchmarking yet, which is one of the reasons why I thought about making it a tunable. > And btw, I think referring to the historic code in the comment is not > a good idea, it's just going to ocnfuse the heck out of everyone looking > at it in the future. The information above makes sense for the commit > message. Yeah, good point. > And the other big question is how this interacts with Jens' new per-bdi > flushing code that we still hope to merge in 2.6.32. Jens? What do you think? Fixing MAX_WRITEBACK_PAGES was something I really wanted to merge in 2.6.32 since it makes a huge difference for the block allocation layout for a "rsync -avH /old-fs /new-fs" when we are copying bunch of large files (say, 800 meg iso images) and so the fact that the writeback routine is writing out 4 megs at a time, means that our files get horribly interleaved and thus get fragmented. I initially thought about adding some massive workarounds in the filesystem layer (which is I guess what XFS did), but I ultimately decided this was begging to be solved in the page writeback code, especially since it's *such* an easy fix. > Maybe we'll actually get some sane writeback code for the first time. To quote from "Fiddler on the Roof", from your lips to God's ears.... :-) - Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages 2009-08-30 18:17 ` Theodore Tso @ 2009-08-30 22:27 ` Christoph Hellwig 2009-08-31 3:08 ` Theodore Tso 2009-09-01 18:00 ` Chris Mason 1 sibling, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2009-08-30 22:27 UTC (permalink / raw) To: Theodore Tso Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason, jens.axboe On Sun, Aug 30, 2009 at 02:17:31PM -0400, Theodore Tso wrote: > > > > The current writeback sizes are defintively too small, we shoved in > > a hack into XFS to bump up nr_to_write to four times the value the > > VM sends us to be able to saturate medium sized RAID arrays in XFS. > > Hmm, should we make it be a per-superblock tunable so that it can > either be tuned on a per-block device basis or the filesystem code can > adjust it to their liking? I thought about it, but decided maybe it > was better to keeping it simple. I'm don't think tuning it on a per-filesystem basis is a good idea, we had to resort to this for 2.6.30 as a quick hack, and we will ged rid of it again in 2.6.31 one way or another. I personally think we should fight this cancer of per-filesystem hacks in the writeback code as much as we can. Right now people keep adding tuning hacks for specific workloads there, and at least all the modern filesystems (ext4, btrfs and XFS) have very similar requirements to the writeback code, that is give the filesystem as much as possible to write at the same time to do intelligent decisions based on that. The VM writeback code fails horribly at that right now. > > Turns out this was not enough and at least for Chris Masons array > > we only started seaturating at * 16. I suspect you patch will give > > a similar effect. > > So you think 16384 would be a better default? The reason why I picked > 32768 was because that was the size of the ext4 block group, but it > was otherwise it was totally arbitrary. I haven't done any > benchmarking yet, which is one of the reasons why I thought about > making it a tunable. It was just another arbitrary number. My suspicion is that the exact number does not matter, it just needs to be much much larger than the current one. And the latency concerns are also over-rated as the block layer will tell us that we are congested if we push too much into a queue. > > And the other big question is how this interacts with Jens' new per-bdi > > flushing code that we still hope to merge in 2.6.32. > > Jens? What do you think? Fixing MAX_WRITEBACK_PAGES was something I > really wanted to merge in 2.6.32 since it makes a huge difference for > the block allocation layout for a "rsync -avH /old-fs /new-fs" when we > are copying bunch of large files (say, 800 meg iso images) and so the > fact that the writeback routine is writing out 4 megs at a time, means > that our files get horribly interleaved and thus get fragmented. > > I initially thought about adding some massive workarounds in the > filesystem layer (which is I guess what XFS did), XFS is relatively good at doing the disk block layout even with smaller writeouts, so we do not have that fragmentation hack. And the for-2.6.30 big hack is one liner: --- a/fs/xfs/linux-2.6/xfs_aops.c +++ b/fs/xfs/linux-2.6/xfs_aops.c @@ -1268,6 +1268,14 @@ xfs_vm_writepage( if (!page_has_buffers(page)) create_empty_buffers(page, 1 << inode->i_blkbits, 0); + + /* + * VM calculation for nr_to_write seems off. Bump it way + * up, this gets simple streaming writes zippy again. + * To be reviewed again after Jens' writeback changes. + */ + wbc->nr_to_write *= 4; + /* * Convert delayed allocate, unwritten or unmapped space * to real space and flush out to disk. This also went past -fsdevel, but I didn' manage to get the flames for adding these hacks that I hoped to get ;-) My stance is to wait for this until about -rc2 at which points Jens' code is hopefully in and we can start doing all the fine-tuning, including lots of benchmarking. Btw, one thing I would really see from one of the big companies or the LF is doing benchmarks like yours above or just a simple one or two stream dd on some big machines weekly so we can immediately see regressions once someones starts to tweak the VM again. Preferably including seekwatcher data. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages 2009-08-30 22:27 ` Christoph Hellwig @ 2009-08-31 3:08 ` Theodore Tso 2009-08-31 10:29 ` Jens Axboe 0 siblings, 1 reply; 14+ messages in thread From: Theodore Tso @ 2009-08-31 3:08 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason, jens.axboe On Sun, Aug 30, 2009 at 06:27:10PM -0400, Christoph Hellwig wrote: > I'm don't think tuning it on a per-filesystem basis is a good idea, > we had to resort to this for 2.6.30 as a quick hack, and we will ged > rid of it again in 2.6.31 one way or another. I personally think we > should fight this cancer of per-filesystem hacks in the writeback code > as much as we can. Right now people keep adding tuning hacks for > specific workloads there, and at least all the modern filesystems (ext4, > btrfs and XFS) have very similar requirements to the writeback code, > that is give the filesystem as much as possible to write at the same > time to do intelligent decisions based on that. The VM writeback code > fails horribly at that right now. Yep; and Jens' patch doesn't change that. It is still sending writes out to the filesystem a piddling 1024 pages at a time. > My stance is to wait for this until about -rc2 at which points Jens' > code is hopefully in and we can start doing all the fine-tuning, > including lots of benchmarking. Well, I've ported my patch so it applies on top of Jens' per-bdi patch. It seems to be clearly needed; Jens, would you agree to add it to your per-bdi patch series? We can choose a different default if you like, but making MAX_WRITEBACK_PAGES tunable seems to be clearly necessary. By the way, while I was testing my patch on top of v13 of the per-bdi patches, I found something *very* curious. I did a test where ran the following commands on a freshly mkfs'ed ext4 filesystem: dd if=/dev/zero of=test1 bs=1024k count=128 dd if=/dev/zero of=test2 bs=1024k count=128 sync I traced the calls to ext4_da_writepages() using ftrace, and found this: flush-8:16-1829 [001] 23.416351: ext4_da_writepages: dev sdb ino 12 nr_t_write 32759 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 flush-8:16-1829 [000] 25.939354: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 flush-8:16-1829 [000] 25.939486: ext4_da_writepages: dev sdb ino 13 nr_t_write 32759 pages_skipped 0 range_start 134180864 range_end 9223372036854775807 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 flush-8:16-1829 [000] 27.055687: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 flush-8:16-1829 [000] 27.055691: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 flush-8:16-1829 [000] 27.878708: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 The *first* time the per-bdi code called writepages on the second file (test2, inode #13), range_start was 134180864 (which, curiously enough, is 4096*32759, which was the value of nr_to_write passed to ext4_da_writepages). Given that the inode only had 32768 pages, the fact that apparently *some* codepath called ext4_da_writepages starting at logical block 32759, with nr_to_write set to 32759, seems very curious indeed. That doesn't seem right at all. It's late, so I won't try to trace it down now; plus which it's your code so I figure you can probably figure it out faster.... - Ted >From 763fb19b3d14d73e71f5d88bd3b5f56869536ae5 Mon Sep 17 00:00:00 2001 From: Theodore Ts'o <tytso@mit.edu> Date: Sun, 30 Aug 2009 23:06:24 -0400 Subject: [PATCH] vm: Add an tuning knob for vm.max_writeback_pages Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a concern of not holding I_SYNC for too long. (At least, that was the comment previously.) This doesn't make sense now because the only time we wait for I_SYNC is if we are calling sync or fsync, and in that case we need to write out all of the data anyway. Previously there may have been other code paths that waited on I_SYNC, but not any more. According to Christoph, the current writeback size is way too small, and XFS had a hack that bumped out nr_to_write to four times the value sent by the VM to be able to saturate medium-sized RAID arrays. This value was also problematic for ext4 as well, as it caused large files to be come interleaved on disk by in 8 megabyte chunks (we bumped up the nr_to_write by a factor of two). So, in this patch, we make the MAX_WRITEBACK_PAGES a tunable, and change the default to be 32768 blocks. http://bugzilla.kernel.org/show_bug.cgi?id=13930 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> --- This patch is designed to be applied on top of the per-bdi v13 patchset fs/fs-writeback.c | 15 +++------------ include/linux/writeback.h | 1 + kernel/sysctl.c | 8 ++++++++ mm/page-writeback.c | 6 ++++++ 4 files changed, 18 insertions(+), 12 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index dfb4767..7e66de7 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -264,15 +264,6 @@ void bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb, } } -/* - * The maximum number of pages to writeout in a single bdi flush/kupdate - * operation. We do this so we don't hold I_SYNC against an inode for - * enormous amounts of time, which would block a userspace task which has - * been forced to throttle against that inode. Also, the code reevaluates - * the dirty each time it has written this many pages. - */ -#define MAX_WRITEBACK_PAGES 1024 - static inline bool over_bground_thresh(void) { unsigned long background_thresh, dirty_thresh; @@ -325,11 +316,11 @@ static long wb_writeback(struct bdi_writeback *wb, long nr_pages, wbc.more_io = 0; wbc.encountered_congestion = 0; - wbc.nr_to_write = MAX_WRITEBACK_PAGES; + wbc.nr_to_write = max_writeback_pages; wbc.pages_skipped = 0; generic_sync_wb_inodes(wb, sb, &wbc); - nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; - wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write; + nr_pages -= max_writeback_pages - wbc.nr_to_write; + wrote += max_writeback_pages - wbc.nr_to_write; /* * If we ran out of stuff to write, bail unless more_io got set */ diff --git a/include/linux/writeback.h b/include/linux/writeback.h index e070b91..a27fc00 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -100,6 +100,7 @@ extern int vm_dirty_ratio; extern unsigned long vm_dirty_bytes; extern unsigned int dirty_writeback_interval; extern unsigned int dirty_expire_interval; +extern unsigned int max_writeback_pages; extern int vm_highmem_is_dirtyable; extern int block_dump; extern int laptop_mode; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 58be760..06d1c4c 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1104,6 +1104,14 @@ static struct ctl_table vm_table[] = { .proc_handler = &proc_dointvec, }, { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_writeback_pages", + .data = &max_writeback_pages, + .maxlen = sizeof(max_writeback_pages), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { .ctl_name = VM_NR_PDFLUSH_THREADS, .procname = "nr_pdflush_threads", .data = &nr_pdflush_threads, diff --git a/mm/page-writeback.c b/mm/page-writeback.c index a727af9..46fe54f 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -55,6 +55,12 @@ static inline long sync_writeback_pages(void) /* The following parameters are exported via /proc/sys/vm */ /* + * The maximum number of pages to write out in a single bdflush/kupdate + * operation. + */ +unsigned int max_writeback_pages = 32768; + +/* * Start background writeback (via pdflush) at this percentage */ int dirty_background_ratio = 10; -- 1.6.3.2.1.gb9f7d.dirty ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages 2009-08-31 3:08 ` Theodore Tso @ 2009-08-31 10:29 ` Jens Axboe 2009-08-31 10:47 ` Jens Axboe 0 siblings, 1 reply; 14+ messages in thread From: Jens Axboe @ 2009-08-31 10:29 UTC (permalink / raw) To: Theodore Tso Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason On Sun, Aug 30 2009, Theodore Tso wrote: > On Sun, Aug 30, 2009 at 06:27:10PM -0400, Christoph Hellwig wrote: > > I'm don't think tuning it on a per-filesystem basis is a good idea, > > we had to resort to this for 2.6.30 as a quick hack, and we will ged > > rid of it again in 2.6.31 one way or another. I personally think we > > should fight this cancer of per-filesystem hacks in the writeback code > > as much as we can. Right now people keep adding tuning hacks for > > specific workloads there, and at least all the modern filesystems (ext4, > > btrfs and XFS) have very similar requirements to the writeback code, > > that is give the filesystem as much as possible to write at the same > > time to do intelligent decisions based on that. The VM writeback code > > fails horribly at that right now. > > Yep; and Jens' patch doesn't change that. It is still sending writes > out to the filesystem a piddling 1024 pages at a time. Right, I didn't want to change too much of the logic, tuning is better left as follow up patches. There is one change, though - where the current logic splits ->nr_to_write between devices, with the writeback patches each get the "full" MAX_WRITEBACK_PAGES. > > My stance is to wait for this until about -rc2 at which points Jens' > > code is hopefully in and we can start doing all the fine-tuning, > > including lots of benchmarking. > > Well, I've ported my patch so it applies on top of Jens' per-bdi > patch. It seems to be clearly needed; Jens, would you agree to add it > to your per-bdi patch series? We can choose a different default if > you like, but making MAX_WRITEBACK_PAGES tunable seems to be clearly > necessary. I don't mind adding it, but do we really want to export the value? If we plan on making this dynamically adaptable soon, then we'll be stuck with some proc file that doesn't really do much. I guess by then we can leave it as a 'maximum ever' type control, which at least would do something... > By the way, while I was testing my patch on top of v13 of the per-bdi > patches, I found something *very* curious. I did a test where ran the > following commands on a freshly mkfs'ed ext4 filesystem: > > dd if=/dev/zero of=test1 bs=1024k count=128 > dd if=/dev/zero of=test2 bs=1024k count=128 > sync > > I traced the calls to ext4_da_writepages() using ftrace, and found this: > > flush-8:16-1829 [001] 23.416351: ext4_da_writepages: dev sdb ino 12 nr_t_write 32759 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 > flush-8:16-1829 [000] 25.939354: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 > flush-8:16-1829 [000] 25.939486: ext4_da_writepages: dev sdb ino 13 nr_t_write 32759 pages_skipped 0 range_start 134180864 range_end 9223372036854775807 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 > flush-8:16-1829 [000] 27.055687: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 > flush-8:16-1829 [000] 27.055691: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 > flush-8:16-1829 [000] 27.878708: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 > > The *first* time the per-bdi code called writepages on the second file > (test2, inode #13), range_start was 134180864 (which, curiously > enough, is 4096*32759, which was the value of nr_to_write passed to > ext4_da_writepages). Given that the inode only had 32768 pages, the > fact that apparently *some* codepath called ext4_da_writepages > starting at logical block 32759, with nr_to_write set to 32759, seems > very curious indeed. That doesn't seem right at all. It's late, so I > won't try to trace it down now; plus which it's your code so I figure > you can probably figure it out faster.... Interesting, needs checking up on. I've prepared a v14 patchset today, perhaps (if you have time), you can see if it reproduces there? I'm running some performance tests today, but will make a note to look into this after that. -- Jens Axboe ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages 2009-08-31 10:29 ` Jens Axboe @ 2009-08-31 10:47 ` Jens Axboe 2009-08-31 12:37 ` Theodore Tso ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Jens Axboe @ 2009-08-31 10:47 UTC (permalink / raw) To: Theodore Tso Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason On Mon, Aug 31 2009, Jens Axboe wrote: > > I traced the calls to ext4_da_writepages() using ftrace, and found this: > > > > flush-8:16-1829 [001] 23.416351: ext4_da_writepages: dev sdb ino 12 nr_t_write 32759 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 > > flush-8:16-1829 [000] 25.939354: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 > > flush-8:16-1829 [000] 25.939486: ext4_da_writepages: dev sdb ino 13 nr_t_write 32759 pages_skipped 0 range_start 134180864 range_end 9223372036854775807 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 > > flush-8:16-1829 [000] 27.055687: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 > > flush-8:16-1829 [000] 27.055691: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 > > flush-8:16-1829 [000] 27.878708: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1 > > > > The *first* time the per-bdi code called writepages on the second file > > (test2, inode #13), range_start was 134180864 (which, curiously > > enough, is 4096*32759, which was the value of nr_to_write passed to > > ext4_da_writepages). Given that the inode only had 32768 pages, the > > fact that apparently *some* codepath called ext4_da_writepages > > starting at logical block 32759, with nr_to_write set to 32759, seems > > very curious indeed. That doesn't seem right at all. It's late, so I > > won't try to trace it down now; plus which it's your code so I figure > > you can probably figure it out faster.... > > Interesting, needs checking up on. I've prepared a v14 patchset today, > perhaps (if you have time), you can see if it reproduces there? I'm > running some performance tests today, but will make a note to look into > this after that. It's because ext4 writepages sets ->range_start and wb_writeback() is range cyclic, then the next iteration will have the previous end point as the starting point. Looks like we need to clear ->range_start in wb_writeback(), the better place is probably to do that in fs/fs-writeback.c:generic_sync_wb_inodes() right after the writeback_single_inode() call. This, btw, should be no different than the current code, weird/correct or not :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages 2009-08-31 10:47 ` Jens Axboe @ 2009-08-31 12:37 ` Theodore Tso 2009-08-31 15:54 ` Theodore Tso 2009-08-31 21:03 ` Theodore Tso 2 siblings, 0 replies; 14+ messages in thread From: Theodore Tso @ 2009-08-31 12:37 UTC (permalink / raw) To: Jens Axboe Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason On Mon, Aug 31, 2009 at 12:47:49PM +0200, Jens Axboe wrote: > It's because ext4 writepages sets ->range_start and wb_writeback() is > range cyclic, then the next iteration will have the previous end point > as the starting point. Looks like we need to clear ->range_start in > wb_writeback(), the better place is probably to do that in > fs/fs-writeback.c:generic_sync_wb_inodes() right after the > writeback_single_inode() call. This, btw, should be no different than > the current code, weird/correct or not :-) Hmm, or we could have ext4_da_writepages save and restore ->range_start. One of the things that's never been well documented is exactly what the semantics are of the various fields in the wbc struct, and who is allowed to modify which fields when. If you have some time, it would be great if you could document the rules filesystems should be following with respect to the wbc struct, and then we can audit each filesystem to make sure they follow those rules. One of the things which is a bit scary about how the many wbc flags work is that each time a filesystem wants some particular behavior, it seems like we need to dive into writeback code, and figure out some combination of flags/settings that make the page writeback code do what we wants, and sometimes it's not clear whether that was a designed-in semantic of the interface, or just something that happened to work given the current implementation. In any case, if one of the rules is that the filesystems' writepages command shouldn't be modifying range_start, we can fix this problem up by saving and restore range_start inside ext4_da_writepages(). - Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages 2009-08-31 10:47 ` Jens Axboe 2009-08-31 12:37 ` Theodore Tso @ 2009-08-31 15:54 ` Theodore Tso 2009-08-31 20:36 ` Jens Axboe 2009-08-31 21:03 ` Theodore Tso 2 siblings, 1 reply; 14+ messages in thread From: Theodore Tso @ 2009-08-31 15:54 UTC (permalink / raw) To: Jens Axboe Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason At the risk of asking a stupid question, what *is* range_cyclic and what is it trying to do? I've been looking at the code and am I'm getting myself very confused about what the code is trying to do and what was its original intent. - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages 2009-08-31 15:54 ` Theodore Tso @ 2009-08-31 20:36 ` Jens Axboe 0 siblings, 0 replies; 14+ messages in thread From: Jens Axboe @ 2009-08-31 20:36 UTC (permalink / raw) To: Theodore Tso Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason On Mon, Aug 31 2009, Theodore Tso wrote: > At the risk of asking a stupid question, what *is* range_cyclic and > what is it trying to do? I've been looking at the code and am I'm > getting myself very confused about what the code is trying to do and > what was its original intent. Range cyclic means that the current writeback in ->writepages() should start where it left off the last time, non-range cyclic starts off at ->range_start. ->writepages() can use mapping->writeback_index to store such information. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages 2009-08-31 10:47 ` Jens Axboe 2009-08-31 12:37 ` Theodore Tso 2009-08-31 15:54 ` Theodore Tso @ 2009-08-31 21:03 ` Theodore Tso 2009-09-01 7:57 ` Aneesh Kumar K.V 2009-09-01 9:17 ` Jens Axboe 2 siblings, 2 replies; 14+ messages in thread From: Theodore Tso @ 2009-08-31 21:03 UTC (permalink / raw) To: Jens Axboe Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason On Mon, Aug 31, 2009 at 12:47:49PM +0200, Jens Axboe wrote: > It's because ext4 writepages sets ->range_start and wb_writeback() is > range cyclic, then the next iteration will have the previous end point > as the starting point. Looks like we need to clear ->range_start in > wb_writeback(), the better place is probably to do that in > fs/fs-writeback.c:generic_sync_wb_inodes() right after the > writeback_single_inode() call. This, btw, should be no different than > the current code, weird/correct or not :-) Thanks for pointing it out. After staring at the code, I now believe this is the best fix for now. What do other folks think? - Ted commit 39cac8147479b48cd45b768d184aa6a80f23a2f7 Author: Theodore Ts'o <tytso@mit.edu> Date: Mon Aug 31 17:00:59 2009 -0400 ext4: Restore wbc->range_start in ext4_da_writepages() To solve a lock inversion problem, we implement part of the range_cyclic algorithm in ext4_da_writepages(). (See commit 2acf2c26 for more details.) As part of that change wbc->range_start was modified by ext4's writepages function, which causes its callers to get confused since they aren't expecting the filesystem to modify it. The simplest fix is to save and restore wbc->range_start in ext4_da_writepages. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index d61fb52..ff659e7 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2749,6 +2749,7 @@ static int ext4_da_writepages(struct address_space *mapping, long pages_skipped; int range_cyclic, cycled = 1, io_done = 0; int needed_blocks, ret = 0, nr_to_writebump = 0; + loff_t range_start = wbc->range_start; struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb); trace_ext4_da_writepages(inode, wbc); @@ -2917,6 +2918,7 @@ out_writepages: if (!no_nrwrite_index_update) wbc->no_nrwrite_index_update = 0; wbc->nr_to_write -= nr_to_writebump; + wbc->range_start = range_start; trace_ext4_da_writepages_result(inode, wbc, ret, pages_written); return ret; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages 2009-08-31 21:03 ` Theodore Tso @ 2009-09-01 7:57 ` Aneesh Kumar K.V 2009-09-01 9:17 ` Jens Axboe 1 sibling, 0 replies; 14+ messages in thread From: Aneesh Kumar K.V @ 2009-09-01 7:57 UTC (permalink / raw) To: Theodore Tso Cc: Jens Axboe, Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason On Mon, Aug 31, 2009 at 05:03:37PM -0400, Theodore Tso wrote: > On Mon, Aug 31, 2009 at 12:47:49PM +0200, Jens Axboe wrote: > > It's because ext4 writepages sets ->range_start and wb_writeback() is > > range cyclic, then the next iteration will have the previous end point > > as the starting point. Looks like we need to clear ->range_start in > > wb_writeback(), the better place is probably to do that in > > fs/fs-writeback.c:generic_sync_wb_inodes() right after the > > writeback_single_inode() call. This, btw, should be no different than > > the current code, weird/correct or not :-) > > Thanks for pointing it out. After staring at the code, I now believe > this is the best fix for now. What do other folks think? > > - Ted > > commit 39cac8147479b48cd45b768d184aa6a80f23a2f7 > Author: Theodore Ts'o <tytso@mit.edu> > Date: Mon Aug 31 17:00:59 2009 -0400 > > ext4: Restore wbc->range_start in ext4_da_writepages() > > To solve a lock inversion problem, we implement part of the > range_cyclic algorithm in ext4_da_writepages(). (See commit 2acf2c26 > for more details.) > > As part of that change wbc->range_start was modified by ext4's > writepages function, which causes its callers to get confused since > they aren't expecting the filesystem to modify it. The simplest fix > is to save and restore wbc->range_start in ext4_da_writepages. > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index d61fb52..ff659e7 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -2749,6 +2749,7 @@ static int ext4_da_writepages(struct address_space *mapping, > long pages_skipped; > int range_cyclic, cycled = 1, io_done = 0; > int needed_blocks, ret = 0, nr_to_writebump = 0; > + loff_t range_start = wbc->range_start; > struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb); > > trace_ext4_da_writepages(inode, wbc); > @@ -2917,6 +2918,7 @@ out_writepages: > if (!no_nrwrite_index_update) > wbc->no_nrwrite_index_update = 0; > wbc->nr_to_write -= nr_to_writebump; > + wbc->range_start = range_start; > trace_ext4_da_writepages_result(inode, wbc, ret, pages_written); > return ret; > } Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> We had range_start reset till af6f029d3836eb7264cd3fbb13a6baf0e5fdb5ea -aneesh ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages 2009-08-31 21:03 ` Theodore Tso 2009-09-01 7:57 ` Aneesh Kumar K.V @ 2009-09-01 9:17 ` Jens Axboe 1 sibling, 0 replies; 14+ messages in thread From: Jens Axboe @ 2009-09-01 9:17 UTC (permalink / raw) To: Theodore Tso Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason On Mon, Aug 31 2009, Theodore Tso wrote: > On Mon, Aug 31, 2009 at 12:47:49PM +0200, Jens Axboe wrote: > > It's because ext4 writepages sets ->range_start and wb_writeback() is > > range cyclic, then the next iteration will have the previous end point > > as the starting point. Looks like we need to clear ->range_start in > > wb_writeback(), the better place is probably to do that in > > fs/fs-writeback.c:generic_sync_wb_inodes() right after the > > writeback_single_inode() call. This, btw, should be no different than > > the current code, weird/correct or not :-) > > Thanks for pointing it out. After staring at the code, I now believe > this is the best fix for now. What do other folks think? > > - Ted > > commit 39cac8147479b48cd45b768d184aa6a80f23a2f7 > Author: Theodore Ts'o <tytso@mit.edu> > Date: Mon Aug 31 17:00:59 2009 -0400 > > ext4: Restore wbc->range_start in ext4_da_writepages() > > To solve a lock inversion problem, we implement part of the > range_cyclic algorithm in ext4_da_writepages(). (See commit 2acf2c26 > for more details.) > > As part of that change wbc->range_start was modified by ext4's > writepages function, which causes its callers to get confused since > they aren't expecting the filesystem to modify it. The simplest fix > is to save and restore wbc->range_start in ext4_da_writepages. > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index d61fb52..ff659e7 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -2749,6 +2749,7 @@ static int ext4_da_writepages(struct address_space *mapping, > long pages_skipped; > int range_cyclic, cycled = 1, io_done = 0; > int needed_blocks, ret = 0, nr_to_writebump = 0; > + loff_t range_start = wbc->range_start; > struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb); > > trace_ext4_da_writepages(inode, wbc); > @@ -2917,6 +2918,7 @@ out_writepages: > if (!no_nrwrite_index_update) > wbc->no_nrwrite_index_update = 0; > wbc->nr_to_write -= nr_to_writebump; > + wbc->range_start = range_start; > trace_ext4_da_writepages_result(inode, wbc, ret, pages_written); > return ret; > } I was going to suggest using range_start locally and not touching ->range_start, but I see you pass the wbc further down. So this looks fine to me. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages 2009-08-30 18:17 ` Theodore Tso 2009-08-30 22:27 ` Christoph Hellwig @ 2009-09-01 18:00 ` Chris Mason 2009-09-01 20:30 ` Theodore Tso 1 sibling, 1 reply; 14+ messages in thread From: Chris Mason @ 2009-09-01 18:00 UTC (permalink / raw) To: Theodore Tso Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel, jens.axboe On Sun, Aug 30, 2009 at 02:17:31PM -0400, Theodore Tso wrote: [ ... ] > Jens? What do you think? Fixing MAX_WRITEBACK_PAGES was something I > really wanted to merge in 2.6.32 since it makes a huge difference for > the block allocation layout for a "rsync -avH /old-fs /new-fs" when we > are copying bunch of large files (say, 800 meg iso images) and so the > fact that the writeback routine is writing out 4 megs at a time, means > that our files get horribly interleaved and thus get fragmented. I did some runs comparing mainline with Jens' current writeback queue. This is just btrfs, but its a good example of how things are improving. These graphs show us the 'compile' phase of compilebench, where it is writing all the .o files into the 30 kernel trees. Basically we have one thread, creating a bunch of files based on the sizes of all the .o files in a compiled kernel. They are created in random order, similar to the files produced from a make -j. I haven't yet tried this without the max_writeback_pages patch, but the graphs clearly show a speed improvement, and that the mainline code is smearing writes across the drive while Jens' work is writing sequentially. Jens' writeback branch: http://oss.oracle.com/~mason/seekwatcher/compilebench-writes-axboe.png Mainline http://oss.oracle.com/~mason/seekwatcher/compilebench-writes-pdflush.png -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages 2009-09-01 18:00 ` Chris Mason @ 2009-09-01 20:30 ` Theodore Tso 0 siblings, 0 replies; 14+ messages in thread From: Theodore Tso @ 2009-09-01 20:30 UTC (permalink / raw) To: Chris Mason, Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel, jens On Tue, Sep 01, 2009 at 02:00:52PM -0400, Chris Mason wrote: > > I haven't yet tried this without the max_writeback_pages patch, but the > graphs clearly show a speed improvement, and that the mainline code is > smearing writes across the drive while Jens' work is writing > sequentially. FYI, you don't need to revert the max_writebacks_pages patch; the whole point of making it a tunable was to make it easier to run benchmarks. If you want to get the effects of the original setting of MAX_WRITEBACK_PAGES before the patch, just run as root: sysctl vm.max_writeback_pages=1024 - Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2009-09-01 20:30 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <1251600858-21294-1-git-send-email-tytso@mit.edu> 2009-08-30 16:52 ` [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages Christoph Hellwig 2009-08-30 18:17 ` Theodore Tso 2009-08-30 22:27 ` Christoph Hellwig 2009-08-31 3:08 ` Theodore Tso 2009-08-31 10:29 ` Jens Axboe 2009-08-31 10:47 ` Jens Axboe 2009-08-31 12:37 ` Theodore Tso 2009-08-31 15:54 ` Theodore Tso 2009-08-31 20:36 ` Jens Axboe 2009-08-31 21:03 ` Theodore Tso 2009-09-01 7:57 ` Aneesh Kumar K.V 2009-09-01 9:17 ` Jens Axboe 2009-09-01 18:00 ` Chris Mason 2009-09-01 20:30 ` Theodore Tso
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).