* Bug in kernel 2.6.31, Slow wb_kupdate writeout
@ 2009-07-28 19:11 Chad Talbott
2009-07-28 21:49 ` Martin Bligh
2009-07-30 21:39 ` Jens Axboe
0 siblings, 2 replies; 32+ messages in thread
From: Chad Talbott @ 2009-07-28 19:11 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, wfg, Martin Bligh, Michael Rubin, Andrew Morton,
sandeen
I run a simple workload on a 4GB machine which dirties a few largish
inodes like so:
# seq 10 | xargs -P0 -n1 -i\{} dd if=/dev/zero of=/tmp/dump\{}
bs=1024k count=100
While the dds are running data is written out at disk speed. However,
once the dds have run to completion and exited there is ~500MB of
dirty memory left. Background writeout then takes about 3 more
minutes to clean memory at only ~3.3MB/s. When I explicitly sync, I
can see that the disk is capable of 40MB/s, which finishes off the
files in ~10s. [1]
An interesting recent-ish change is "writeback: speed up writeback of
big dirty files." When I revert the change to __sync_single_inode the
problem appears to go away and background writeout proceeds at disk
speed. Interestingly, that code is in the git commit [2], but not in
the post to LKML. [3] This is may not be the fix, but it makes this
test behave better.
Thanks,
Chad
[1] I've plotted the dirty memory from /proc/meminfo and disk write
speed from iostat at
http://sites.google.com/site/cwtlinux/2-6-31-writeback-bug
[2] git commit:
http://mirror.celinuxforum.org/gitstat/commit-detail.php?commit=8bc3be2751b4f74ab90a446da1912fd8204d53f7
[3] LKML post: http://marc.info/?l=linux-kernel&m=119131601130372&w=2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-28 19:11 Bug in kernel 2.6.31, Slow wb_kupdate writeout Chad Talbott @ 2009-07-28 21:49 ` Martin Bligh 2009-07-29 7:15 ` Martin Bligh 2009-07-30 21:39 ` Jens Axboe 1 sibling, 1 reply; 32+ messages in thread From: Martin Bligh @ 2009-07-28 21:49 UTC (permalink / raw) To: Chad Talbott Cc: linux-kernel, linux-mm, wfg, Michael Rubin, Andrew Morton, sandeen > An interesting recent-ish change is "writeback: speed up writeback of > big dirty files." When I revert the change to __sync_single_inode the > problem appears to go away and background writeout proceeds at disk > speed. Interestingly, that code is in the git commit [2], but not in > the post to LKML. [3] This is may not be the fix, but it makes this > test behave better. I'm fairly sure this is not fixing the root cause - but putting it at the head rather than the tail of the queue causes the error not to starve wb_kupdate for nearly so long - as long as we keep the queue full, the bug is hidden. M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-28 21:49 ` Martin Bligh @ 2009-07-29 7:15 ` Martin Bligh 2009-07-29 11:43 ` Wu Fengguang 0 siblings, 1 reply; 32+ messages in thread From: Martin Bligh @ 2009-07-29 7:15 UTC (permalink / raw) To: Chad Talbott Cc: linux-kernel, linux-mm, wfg, Michael Rubin, Andrew Morton, sandeen, Michael Davidson On Tue, Jul 28, 2009 at 2:49 PM, Martin Bligh<mbligh@google.com> wrote: >> An interesting recent-ish change is "writeback: speed up writeback of >> big dirty files." When I revert the change to __sync_single_inode the >> problem appears to go away and background writeout proceeds at disk >> speed. Interestingly, that code is in the git commit [2], but not in >> the post to LKML. [3] This is may not be the fix, but it makes this >> test behave better. > > I'm fairly sure this is not fixing the root cause - but putting it at the head > rather than the tail of the queue causes the error not to starve wb_kupdate > for nearly so long - as long as we keep the queue full, the bug is hidden. OK, it seems this is the root cause - I wasn't clear why all the pages weren't being written back, and thought there was another bug. What happens is we go into write_cache_pages, and stuff the disk queue with as much as we can put into it, and then inevitably hit the congestion limit. Then we back out to __sync_single_inode, who says "huh, you didn't manage to write your whole slice", and penalizes the poor blameless inode in question by putting it back into the penalty box for 30s. This results in very lumpy I/O writeback at 5s intervals, and very poor throughput. Patch below is inline and probably text munged, but is for RFC only. I'll test it more thoroughly tomorrow. As for the comment about starving other writes, I believe requeue_io moves it from s_io to s_more_io which should at least allow some progress of other files. --- linux-2.6.30/fs/fs-writeback.c.old 2009-07-29 00:08:29.000000000 -0700 +++ linux-2.6.30/fs/fs-writeback.c 2009-07-29 00:11:28.000000000 -0700 @@ -322,46 +322,11 @@ __sync_single_inode(struct inode *inode, /* * We didn't write back all the pages. nfs_writepages() * sometimes bales out without doing anything. Redirty - * the inode; Move it from s_io onto s_more_io/s_dirty. + * the inode; Move it from s_io onto s_more_io. It + * may well have just encountered congestion */ - /* - * akpm: if the caller was the kupdate function we put - * this inode at the head of s_dirty so it gets first - * consideration. Otherwise, move it to the tail, for - * the reasons described there. I'm not really sure - * how much sense this makes. Presumably I had a good - * reasons for doing it this way, and I'd rather not - * muck with it at present. - */ - if (wbc->for_kupdate) { - /* - * For the kupdate function we move the inode - * to s_more_io so it will get more writeout as - * soon as the queue becomes uncongested. - */ - inode->i_state |= I_DIRTY_PAGES; - if (wbc->nr_to_write <= 0) { - /* - * slice used up: queue for next turn - */ - requeue_io(inode); - } else { - /* - * somehow blocked: retry later - */ - redirty_tail(inode); - } - } else { - /* - * Otherwise fully redirty the inode so that - * other inodes on this superblock will get some - * writeout. Otherwise heavy writing to one - * file would indefinitely suspend writeout of - * all the other files. - */ - inode->i_state |= I_DIRTY_PAGES; - redirty_tail(inode); - } + inode->i_state |= I_DIRTY_PAGES; + requeue_io(inode); } else if (inode->i_state & I_DIRTY) { /* * Someone redirtied the inode while were writing back -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-29 7:15 ` Martin Bligh @ 2009-07-29 11:43 ` Wu Fengguang 2009-07-29 14:11 ` Martin Bligh 2009-07-30 0:19 ` Martin Bligh 0 siblings, 2 replies; 32+ messages in thread From: Wu Fengguang @ 2009-07-29 11:43 UTC (permalink / raw) To: Martin Bligh Cc: Chad Talbott, linux-kernel, linux-mm, Michael Rubin, Andrew Morton, sandeen, Michael Davidson On Wed, Jul 29, 2009 at 12:15:48AM -0700, Martin Bligh wrote: > On Tue, Jul 28, 2009 at 2:49 PM, Martin Bligh<mbligh@google.com> wrote: > >> An interesting recent-ish change is "writeback: speed up writeback of > >> big dirty files." A When I revert the change to __sync_single_inode the > >> problem appears to go away and background writeout proceeds at disk > >> speed. A Interestingly, that code is in the git commit [2], but not in > >> the post to LKML. [3] A This is may not be the fix, but it makes this > >> test behave better. > > > > I'm fairly sure this is not fixing the root cause - but putting it at the head > > rather than the tail of the queue causes the error not to starve wb_kupdate > > for nearly so long - as long as we keep the queue full, the bug is hidden. > > OK, it seems this is the root cause - I wasn't clear why all the pages weren't > being written back, and thought there was another bug. What happens is > we go into write_cache_pages, and stuff the disk queue with as much as > we can put into it, and then inevitably hit the congestion limit. > > Then we back out to __sync_single_inode, who says "huh, you didn't manage > to write your whole slice", and penalizes the poor blameless inode in question > by putting it back into the penalty box for 30s. > > This results in very lumpy I/O writeback at 5s intervals, and very > poor throughput. You are right, so let's fix the congestion case. Your analysis would be perfect changelog :) > Patch below is inline and probably text munged, but is for RFC only. > I'll test it > more thoroughly tomorrow. As for the comment about starving other writes, > I believe requeue_io moves it from s_io to s_more_io which should at least > allow some progress of other files. > > --- linux-2.6.30/fs/fs-writeback.c.old 2009-07-29 00:08:29.000000000 -0700 > +++ linux-2.6.30/fs/fs-writeback.c 2009-07-29 00:11:28.000000000 -0700 > @@ -322,46 +322,11 @@ __sync_single_inode(struct inode *inode, > /* > * We didn't write back all the pages. nfs_writepages() > * sometimes bales out without doing anything. Redirty [snip] > - if (wbc->nr_to_write <= 0) { > - /* > - * slice used up: queue for next turn > - */ > - requeue_io(inode); > - } else { > - /* > - * somehow blocked: retry later > - */ > - redirty_tail(inode); Removing this line can be dangerous - we'll probably go into buzy waiting (I have tried that long long ago). Chad, can you try this small patch? Thank you. Thanks, Fengguang --- fs/fs-writeback.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- mm.orig/fs/fs-writeback.c +++ mm/fs/fs-writeback.c @@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode, * soon as the queue becomes uncongested. */ inode->i_state |= I_DIRTY_PAGES; - if (wbc->nr_to_write <= 0) { + if (wbc->nr_to_write <= 0 || + wbc->encountered_congestion) { /* * slice used up: queue for next turn */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-29 11:43 ` Wu Fengguang @ 2009-07-29 14:11 ` Martin Bligh 2009-07-30 1:06 ` Wu Fengguang 2009-07-30 0:19 ` Martin Bligh 1 sibling, 1 reply; 32+ messages in thread From: Martin Bligh @ 2009-07-29 14:11 UTC (permalink / raw) To: Wu Fengguang Cc: Chad Talbott, linux-kernel, linux-mm, Michael Rubin, Andrew Morton, sandeen, Michael Davidson > --- mm.orig/fs/fs-writeback.c > +++ mm/fs/fs-writeback.c > @@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode, > * soon as the queue becomes uncongested. > */ > inode->i_state |= I_DIRTY_PAGES; > - if (wbc->nr_to_write <= 0) { > + if (wbc->nr_to_write <= 0 || > + wbc->encountered_congestion) { > /* > * slice used up: queue for next turn > */ > That's not sufficient - it only the problem in the wb_kupdate path. If you want to be more conservative, how about we do this? --- linux-2.6.30/fs/fs-writeback.c.old 2009-07-29 00:08:29.000000000 -0700 +++ linux-2.6.30/fs/fs-writeback.c 2009-07-29 07:08:48.000000000 -0700 @@ -323,43 +323,14 @@ __sync_single_inode(struct inode *inode, * We didn't write back all the pages. nfs_writepages( ) * sometimes bales out without doing anything. Redirty * the inode; Move it from s_io onto s_more_io/s_dirty. + * It may well have just encountered congestion */ - /* - * akpm: if the caller was the kupdate function we put - * this inode at the head of s_dirty so it gets first - * consideration. Otherwise, move it to the tail, for - * the reasons described there. I'm not really sure - * how much sense this makes. Presumably I had a good - * reasons for doing it this way, and I'd rather not - * muck with it at present. - */ - if (wbc->for_kupdate) { - /* - * For the kupdate function we move the inode - * to s_more_io so it will get more writeout as - * soon as the queue becomes uncongested. - */ - inode->i_state |= I_DIRTY_PAGES; - if (wbc->nr_to_write <= 0) { - /* - * slice used up: queue for next turn - */ - requeue_io(inode); - } else { - /* - * somehow blocked: retry later - */ - redirty_tail(inode); - } - } else { - /* - * Otherwise fully redirty the inode so that - * other inodes on this superblock will get som e - * writeout. Otherwise heavy writing to one - * file would indefinitely suspend writeout of - * all the other files. - */ - inode->i_state |= I_DIRTY_PAGES; + inode->i_state |= I_DIRTY_PAGES; + if (wbc->nr_to_write <= 0 || /* sliced used up */ + wbc->encountered_congestion) + requeue_io(inode); + else { + /* somehow blocked: retry later */ redirty_tail(inode); } } else if (inode->i_state & I_DIRTY) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-29 14:11 ` Martin Bligh @ 2009-07-30 1:06 ` Wu Fengguang 2009-07-30 1:12 ` Martin Bligh 0 siblings, 1 reply; 32+ messages in thread From: Wu Fengguang @ 2009-07-30 1:06 UTC (permalink / raw) To: Martin Bligh Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin, Andrew Morton, sandeen@redhat.com, Michael Davidson On Wed, Jul 29, 2009 at 10:11:10PM +0800, Martin Bligh wrote: > > --- mm.orig/fs/fs-writeback.c > > +++ mm/fs/fs-writeback.c > > @@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode, > > A A A A A A A A A A A A A A A A * soon as the queue becomes uncongested. > > A A A A A A A A A A A A A A A A */ > > A A A A A A A A A A A A A A A A inode->i_state |= I_DIRTY_PAGES; > > - A A A A A A A A A A A A A A A if (wbc->nr_to_write <= 0) { > > + A A A A A A A A A A A A A A A if (wbc->nr_to_write <= 0 || > > + A A A A A A A A A A A A A A A A A wbc->encountered_congestion) { > > A A A A A A A A A A A A A A A A A A A A /* > > A A A A A A A A A A A A A A A A A A A A * slice used up: queue for next turn > > A A A A A A A A A A A A A A A A A A A A */ > > > > That's not sufficient - it only the problem in the wb_kupdate path. If you want > to be more conservative, how about we do this? I agree on the unification of kupdate and sync paths. In fact I had a patch for doing this. And I'd recommend to do it in two patches: one to fix the congestion case, another to do the code unification. The sync path don't care whether requeue_io() or redirty_tail() is used, because they disregard the time stamps totally - only order of inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail(). Thanks, Fengguang > --- linux-2.6.30/fs/fs-writeback.c.old 2009-07-29 00:08:29.000000000 -0700 > +++ linux-2.6.30/fs/fs-writeback.c 2009-07-29 07:08:48.000000000 -0700 > @@ -323,43 +323,14 @@ __sync_single_inode(struct inode *inode, > * We didn't write back all the pages. nfs_writepages( > ) > * sometimes bales out without doing anything. Redirty > * the inode; Move it from s_io onto s_more_io/s_dirty. > + * It may well have just encountered congestion > */ > - /* > - * akpm: if the caller was the kupdate function we put > - * this inode at the head of s_dirty so it gets first > - * consideration. Otherwise, move it to the tail, for > - * the reasons described there. I'm not really sure > - * how much sense this makes. Presumably I had a good > - * reasons for doing it this way, and I'd rather not > - * muck with it at present. > - */ > - if (wbc->for_kupdate) { > - /* > - * For the kupdate function we move the inode > - * to s_more_io so it will get more writeout as > - * soon as the queue becomes uncongested. > - */ > - inode->i_state |= I_DIRTY_PAGES; > - if (wbc->nr_to_write <= 0) { > - /* > - * slice used up: queue for next turn > - */ > - requeue_io(inode); > - } else { > - /* > - * somehow blocked: retry later > - */ > - redirty_tail(inode); > - } > - } else { > - /* > - * Otherwise fully redirty the inode so that > - * other inodes on this superblock will get som > e > - * writeout. Otherwise heavy writing to one > - * file would indefinitely suspend writeout of > - * all the other files. > - */ > - inode->i_state |= I_DIRTY_PAGES; > + inode->i_state |= I_DIRTY_PAGES; > + if (wbc->nr_to_write <= 0 || /* sliced used up */ > + wbc->encountered_congestion) > + requeue_io(inode); > + else { > + /* somehow blocked: retry later */ > redirty_tail(inode); > } > } else if (inode->i_state & I_DIRTY) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 1:06 ` Wu Fengguang @ 2009-07-30 1:12 ` Martin Bligh 2009-07-30 1:57 ` Wu Fengguang 0 siblings, 1 reply; 32+ messages in thread From: Martin Bligh @ 2009-07-30 1:12 UTC (permalink / raw) To: Wu Fengguang Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin, Andrew Morton, sandeen@redhat.com, Michael Davidson > I agree on the unification of kupdate and sync paths. In fact I had a > patch for doing this. And I'd recommend to do it in two patches: > one to fix the congestion case, another to do the code unification. > > The sync path don't care whether requeue_io() or redirty_tail() is > used, because they disregard the time stamps totally - only order of > inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail(). But, as I understand it, both paths share the same lists, so we still have to be consistent? Also, you set flags like more_io higher up in sync_sb_inodes() based on whether there's anything in s_more_io queue, so it still seems to have some effect to me? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 1:12 ` Martin Bligh @ 2009-07-30 1:57 ` Wu Fengguang 2009-07-30 2:59 ` Martin Bligh 0 siblings, 1 reply; 32+ messages in thread From: Wu Fengguang @ 2009-07-30 1:57 UTC (permalink / raw) To: Martin Bligh Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin, Andrew Morton, sandeen@redhat.com, Michael Davidson On Thu, Jul 30, 2009 at 09:12:26AM +0800, Martin Bligh wrote: > > I agree on the unification of kupdate and sync paths. In fact I had a > > patch for doing this. And I'd recommend to do it in two patches: > > one to fix the congestion case, another to do the code unification. > > > > The sync path don't care whether requeue_io() or redirty_tail() is > > used, because they disregard the time stamps totally - only order of > > inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail(). > > But, as I understand it, both paths share the same lists, so we still have > to be consistent? Then let's first unify the code, then fix the congestion case? :) > Also, you set flags like more_io higher up in sync_sb_inodes() based on > whether there's anything in s_more_io queue, so it still seems to have > some effect to me? Yes, maybe in some rare cases. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 1:57 ` Wu Fengguang @ 2009-07-30 2:59 ` Martin Bligh 2009-07-30 4:08 ` Wu Fengguang 0 siblings, 1 reply; 32+ messages in thread From: Martin Bligh @ 2009-07-30 2:59 UTC (permalink / raw) To: Wu Fengguang Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin, Andrew Morton, sandeen@redhat.com, Michael Davidson On Wed, Jul 29, 2009 at 6:57 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > On Thu, Jul 30, 2009 at 09:12:26AM +0800, Martin Bligh wrote: >> > I agree on the unification of kupdate and sync paths. In fact I had a >> > patch for doing this. And I'd recommend to do it in two patches: >> > one to fix the congestion case, another to do the code unification. >> > >> > The sync path don't care whether requeue_io() or redirty_tail() is >> > used, because they disregard the time stamps totally - only order of >> > inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail(). >> >> But, as I understand it, both paths share the same lists, so we still have >> to be consistent? > > Then let's first unify the code, then fix the congestion case? :) OK, I will send it out as separate patches. I am just finishing up the testing first. M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 2:59 ` Martin Bligh @ 2009-07-30 4:08 ` Wu Fengguang 2009-07-30 19:55 ` Martin Bligh 0 siblings, 1 reply; 32+ messages in thread From: Wu Fengguang @ 2009-07-30 4:08 UTC (permalink / raw) To: Martin Bligh Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin, Andrew Morton, sandeen@redhat.com, Michael Davidson On Thu, Jul 30, 2009 at 10:59:09AM +0800, Martin Bligh wrote: > On Wed, Jul 29, 2009 at 6:57 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > > On Thu, Jul 30, 2009 at 09:12:26AM +0800, Martin Bligh wrote: > >> > I agree on the unification of kupdate and sync paths. In fact I had a > >> > patch for doing this. And I'd recommend to do it in two patches: > >> > one to fix the congestion case, another to do the code unification. > >> > > >> > The sync path don't care whether requeue_io() or redirty_tail() is > >> > used, because they disregard the time stamps totally - only order of > >> > inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail(). > >> > >> But, as I understand it, both paths share the same lists, so we still have > >> to be consistent? > > > > Then let's first unify the code, then fix the congestion case? :) > > OK, I will send it out as separate patches. I am just finishing up the testing > first. Note that this is a simple fix that may have suboptimal write performance. Here is an old reasoning: http://lkml.org/lkml/2009/3/28/235 Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 4:08 ` Wu Fengguang @ 2009-07-30 19:55 ` Martin Bligh 2009-08-01 2:02 ` Wu Fengguang 0 siblings, 1 reply; 32+ messages in thread From: Martin Bligh @ 2009-07-30 19:55 UTC (permalink / raw) To: Wu Fengguang Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin, sandeen@redhat.com, Michael Davidson, Andrew Morton, Peter Zijlstra > Note that this is a simple fix that may have suboptimal write performance. > Here is an old reasoning: > > http://lkml.org/lkml/2009/3/28/235 The other thing I've been experimenting with is to disable the per-page check in write_cache_pages, ie: if (wbc->nonblocking && bdi_write_congested(bdi)) { wb_stats_inc(WB_STATS_WCP_SECTION_CONG); wbc->encountered_congestion = 1; /* done = 1; */ This treats the congestion limits as soft, but encourages us to write back in larger, more efficient chunks. If that's not going to scare people unduly, I can submit that as well. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 19:55 ` Martin Bligh @ 2009-08-01 2:02 ` Wu Fengguang 0 siblings, 0 replies; 32+ messages in thread From: Wu Fengguang @ 2009-08-01 2:02 UTC (permalink / raw) To: Martin Bligh Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin, sandeen@redhat.com, Michael Davidson, Andrew Morton, Peter Zijlstra [-- Attachment #1: Type: text/plain, Size: 1560 bytes --] On Fri, Jul 31, 2009 at 03:55:44AM +0800, Martin Bligh wrote: > > Note that this is a simple fix that may have suboptimal write performance. > > Here is an old reasoning: > > > > A A A A http://lkml.org/lkml/2009/3/28/235 > > The other thing I've been experimenting with is to disable the per-page > check in write_cache_pages, ie: > > if (wbc->nonblocking && bdi_write_congested(bdi)) { > wb_stats_inc(WB_STATS_WCP_SECTION_CONG); > wbc->encountered_congestion = 1; > /* done = 1; */ > > This treats the congestion limits as soft, but encourages us to write > back in larger, more efficient chunks. If that's not going to scare > people unduly, I can submit that as well. This risks hitting the hard limit (nr_requests), and block everyone, including the ones with higher priority (ie. kswapd). On the other hand, the simple fix in previous mails won't necessarily act too sub-optimal. It's only a potential one. There is a window of (1/16)*(nr_requests)*request_size (= 128*256KB/16 = 4MB) between congestion-on and congestion-off states. So for the best we can inject a big 4MB chunk into the async write queue once it becomes uncongested. I have a writeback debug patch that can help find out how that works out in your real world workloads (by monitoring nr_to_write). You can also try doubling the ratio (1/16) in blk_queue_congestion_threshold(), to see how an increased congestion-on-off window may help. Thanks, Fengguang [-- Attachment #2: writeback-debug-2.6.31.patch --] [-- Type: text/x-diff, Size: 2713 bytes --] mm/page-writeback.c | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) --- sound-2.6.orig/mm/page-writeback.c +++ sound-2.6/mm/page-writeback.c @@ -116,6 +116,33 @@ EXPORT_SYMBOL(laptop_mode); /* End of sysctl-exported parameters */ +#define writeback_debug_report(n, wbc) do { \ + __writeback_debug_report(n, wbc, __FILE__, __LINE__, __FUNCTION__); \ +} while (0) + +void print_writeback_control(struct writeback_control *wbc) +{ + printk(KERN_DEBUG + "global dirty %lu writeback %lu nfs %lu " + "flags %c%c towrite %ld skipped %ld\n", + global_page_state(NR_FILE_DIRTY), + global_page_state(NR_WRITEBACK), + global_page_state(NR_UNSTABLE_NFS), + wbc->encountered_congestion ? 'C':'_', + wbc->more_io ? 'M':'_', + wbc->nr_to_write, + wbc->pages_skipped); +} + +void __writeback_debug_report(long n, struct writeback_control *wbc, + const char *file, int line, const char *func) +{ + printk(KERN_DEBUG "%s %d %s: %s(%d) %ld\n", + file, line, func, + current->comm, current->pid, + n); + print_writeback_control(wbc); +} static void background_writeout(unsigned long _min_pages); @@ -550,6 +577,7 @@ static void balance_dirty_pages(struct a pages_written += write_chunk - wbc.nr_to_write; get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi); + writeback_debug_report(pages_written, &wbc); } /* @@ -576,6 +604,7 @@ static void balance_dirty_pages(struct a break; /* We've done our duty */ congestion_wait(BLK_RW_ASYNC, HZ/10); + writeback_debug_report(-pages_written, &wbc); } if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh && @@ -670,6 +699,11 @@ void throttle_vm_writeout(gfp_t gfp_mask global_page_state(NR_WRITEBACK) <= dirty_thresh) break; congestion_wait(BLK_RW_ASYNC, HZ/10); + printk(KERN_DEBUG "throttle_vm_writeout: " + "congestion_wait on %lu+%lu > %lu\n", + global_page_state(NR_UNSTABLE_NFS), + global_page_state(NR_WRITEBACK), + dirty_thresh); /* * The caller might hold locks which can prevent IO completion @@ -719,7 +753,9 @@ static void background_writeout(unsigned else break; } + writeback_debug_report(min_pages, &wbc); } + writeback_debug_report(min_pages, &wbc); } /* @@ -792,7 +828,9 @@ static void wb_kupdate(unsigned long arg break; /* All the old data is written */ } nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; + writeback_debug_report(nr_to_write, &wbc); } + writeback_debug_report(nr_to_write, &wbc); if (time_before(next_jif, jiffies + HZ)) next_jif = jiffies + HZ; if (dirty_writeback_interval) ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-29 11:43 ` Wu Fengguang 2009-07-29 14:11 ` Martin Bligh @ 2009-07-30 0:19 ` Martin Bligh 2009-07-30 1:28 ` Martin Bligh 2009-07-30 1:49 ` Wu Fengguang 1 sibling, 2 replies; 32+ messages in thread From: Martin Bligh @ 2009-07-30 0:19 UTC (permalink / raw) To: Wu Fengguang Cc: Chad Talbott, linux-kernel, linux-mm, Michael Rubin, Andrew Morton, sandeen, Michael Davidson BTW, can you explain this code at the bottom of generic_sync_sb_inodes for me? if (wbc->nr_to_write <= 0) { wbc->more_io = 1; break; } I don't understand why we are setting more_io here? AFAICS, more_io means there's more stuff to write ... I would think we'd set this if nr_to_write was > 0 ? Or just have the section below brought up above this break check and do: if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io)) wbc->more_io = 1; Am I just misunderstanding the intent of more_io ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 0:19 ` Martin Bligh @ 2009-07-30 1:28 ` Martin Bligh 2009-07-30 2:09 ` Wu Fengguang 2009-07-30 1:49 ` Wu Fengguang 1 sibling, 1 reply; 32+ messages in thread From: Martin Bligh @ 2009-07-30 1:28 UTC (permalink / raw) To: Wu Fengguang Cc: Chad Talbott, linux-kernel, linux-mm, Michael Rubin, Andrew Morton, sandeen, Michael Davidson On Wed, Jul 29, 2009 at 5:19 PM, Martin Bligh<mbligh@google.com> wrote: > BTW, can you explain this code at the bottom of generic_sync_sb_inodes > for me? > > if (wbc->nr_to_write <= 0) { > wbc->more_io = 1; > break; > } > > I don't understand why we are setting more_io here? AFAICS, more_io > means there's more stuff to write ... I would think we'd set this if > nr_to_write was > 0 ? > > Or just have the section below brought up above this > break check and do: > > if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io)) > wbc->more_io = 1; > > Am I just misunderstanding the intent of more_io ? I am thinking along the lines of: @@ -638,13 +609,11 @@ sync_sb_inodes(struct super_block *sb, s iput(inode); cond_resched(); spin_lock(&inode_lock); - if (wbc->nr_to_write <= 0) { - wbc->more_io = 1; + if (wbc->nr_to_write <= 0) break; - } - if (!list_empty(&sb->s_more_io)) - wbc->more_io = 1; } + if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io) + wbc->more_io = 1; return; /* Leave any unwritten inodes on s_io */ } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 1:28 ` Martin Bligh @ 2009-07-30 2:09 ` Wu Fengguang 2009-07-30 2:57 ` Martin Bligh 0 siblings, 1 reply; 32+ messages in thread From: Wu Fengguang @ 2009-07-30 2:09 UTC (permalink / raw) To: Martin Bligh Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin, Andrew Morton, sandeen@redhat.com, Michael Davidson On Thu, Jul 30, 2009 at 09:28:07AM +0800, Martin Bligh wrote: > On Wed, Jul 29, 2009 at 5:19 PM, Martin Bligh<mbligh@google.com> wrote: > > BTW, can you explain this code at the bottom of generic_sync_sb_inodes > > for me? > > > > A A A A A A A A if (wbc->nr_to_write <= 0) { > > A A A A A A A A A A A A wbc->more_io = 1; > > A A A A A A A A A A A A break; > > A A A A A A A A } > > > > I don't understand why we are setting more_io here? AFAICS, more_io > > means there's more stuff to write ... I would think we'd set this if > > nr_to_write was > 0 ? > > > > Or just have the section below brought up above this > > break check and do: > > > > if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io)) > > A A A A wbc->more_io = 1; > > > > Am I just misunderstanding the intent of more_io ? > > I am thinking along the lines of: On closer looks I found this line: if (inode_dirtied_after(inode, start)) break; In this case "list_empty(&sb->s_io)" is not a good criteria: here we are breaking away for some other reasons, and shall not touch wbc.more_io. So let's stick with the current code? Thanks, Fengguang > @@ -638,13 +609,11 @@ sync_sb_inodes(struct super_block *sb, s > iput(inode); > cond_resched(); > spin_lock(&inode_lock); > - if (wbc->nr_to_write <= 0) { > - wbc->more_io = 1; > + if (wbc->nr_to_write <= 0) > break; > - } > - if (!list_empty(&sb->s_more_io)) > - wbc->more_io = 1; > } > + if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io) > + wbc->more_io = 1; > return; /* Leave any unwritten inodes on s_io */ > } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 2:09 ` Wu Fengguang @ 2009-07-30 2:57 ` Martin Bligh 2009-07-30 3:19 ` Wu Fengguang 0 siblings, 1 reply; 32+ messages in thread From: Martin Bligh @ 2009-07-30 2:57 UTC (permalink / raw) To: Wu Fengguang Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin, Andrew Morton, sandeen@redhat.com, Michael Davidson > On closer looks I found this line: > > if (inode_dirtied_after(inode, start)) > break; Ah, OK. > In this case "list_empty(&sb->s_io)" is not a good criteria: > here we are breaking away for some other reasons, and shall > not touch wbc.more_io. > > So let's stick with the current code? Well, I see two problems. One is that we set more_io based on whether s_more_io is empty or not before we finish the loop. I can't see how this can be correct, especially as there can be other concurrent writers. So somehow we need to check when we exit the loop, not during it. The other is that we're saying we are setting more_io when nr_to_write is <=0 ... but we only really check it when nr_to_write is > 0 ... I can't see how this can be useful? I'll admit there is one corner case when page_skipped it set from one of the branches, but I am really not sure what the intended logic is here, given the above? In the case where we hit the inode_dirtied_after break condition, is it bad to set more_io ? There is more to do on that inode after all. Is there a definition somewhere for exactly what the more_io flag means? M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 2:57 ` Martin Bligh @ 2009-07-30 3:19 ` Wu Fengguang 2009-07-30 20:33 ` Martin Bligh 0 siblings, 1 reply; 32+ messages in thread From: Wu Fengguang @ 2009-07-30 3:19 UTC (permalink / raw) To: Martin Bligh Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin, Andrew Morton, sandeen@redhat.com, Michael Davidson On Thu, Jul 30, 2009 at 10:57:35AM +0800, Martin Bligh wrote: > > On closer looks I found this line: > > > > A A A A A A A A if (inode_dirtied_after(inode, start)) > > A A A A A A A A A A A A break; > > Ah, OK. > > > In this case "list_empty(&sb->s_io)" is not a good criteria: > > here we are breaking away for some other reasons, and shall > > not touch wbc.more_io. > > > > So let's stick with the current code? > > Well, I see two problems. One is that we set more_io based on > whether s_more_io is empty or not before we finish the loop. > I can't see how this can be correct, especially as there can be > other concurrent writers. So somehow we need to check when > we exit the loop, not during it. It is correct inside the loop, however with some overheads. We put it inside the loop because sometimes the whole filesystem is skipped and we shall not set more_io on them whether or not s_more_io is empty. > The other is that we're saying we are setting more_io when > nr_to_write is <=0 ... but we only really check it when > nr_to_write is > 0 ... I can't see how this can be useful? That's the caller's fault - I guess the logic was changed a bit by Jens in linux-next. I noticed this just now. It shall be fixed. > I'll admit there is one corner case when page_skipped it set > from one of the branches, but I am really not sure what the > intended logic is here, given the above? > > In the case where we hit the inode_dirtied_after break > condition, is it bad to set more_io ? There is more to do > on that inode after all. Is there a definition somewhere for > exactly what the more_io flag means? "More dirty pages to be put to io"? The exact semantics of more_io is determined by the caller, which used to be (in 2.6.31): background_writeout(): if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { /* Wrote less than expected */ if (wbc.encountered_congestion || wbc.more_io) congestion_wait(BLK_RW_ASYNC, HZ/10); else break; } wb_kupdate() is same except that it does not check pages_skipped. Note that in 2.6.31, more_io is not used at all for sync(). Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 3:19 ` Wu Fengguang @ 2009-07-30 20:33 ` Martin Bligh 2009-08-01 2:58 ` Wu Fengguang 2009-08-01 4:10 ` Wu Fengguang 0 siblings, 2 replies; 32+ messages in thread From: Martin Bligh @ 2009-07-30 20:33 UTC (permalink / raw) To: Wu Fengguang Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin, Andrew Morton, sandeen@redhat.com, Michael Davidson (BTW: background ... I'm not picking through this code for fun, I'm trying to debug writeback problems introduced in our new kernel that are affecting Google production workloads ;-)) >> Well, I see two problems. One is that we set more_io based on >> whether s_more_io is empty or not before we finish the loop. >> I can't see how this can be correct, especially as there can be >> other concurrent writers. So somehow we need to check when >> we exit the loop, not during it. > > It is correct inside the loop, however with some overheads. > > We put it inside the loop because sometimes the whole filesystem is > skipped and we shall not set more_io on them whether or not s_more_io > is empty. My point was that you're setting more_io based on a condition at a point in time that isn't when you return to the caller. By the time you return to the caller (after several more loops iterations), that condition may no longer be true. One other way to address that would to be only to set if if we're about to fall off the end of the loop, ie change it to: if (!list_empty(&sb->s_more_io) && list_empty(&sb->s_io)) wbc->more_io = 1; >> The other is that we're saying we are setting more_io when >> nr_to_write is <=0 ... but we only really check it when >> nr_to_write is > 0 ... I can't see how this can be useful? > > That's the caller's fault - I guess the logic was changed a bit by > Jens in linux-next. I noticed this just now. It shall be fixed. I am guessing you're setting more_io here because we're stopping because our slice expired, presumably without us completing all the io there was to do? That doesn't seem entirely accurate, we could have finished all the pending IO (particularly given that we can go over nr_to_write somewhat and send it negative). Hence, I though that checking whether s_more_io and s_io were empty at the time of return might be a more accurate check, but on the other hand they are shared lists. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 20:33 ` Martin Bligh @ 2009-08-01 2:58 ` Wu Fengguang 2009-08-01 4:10 ` Wu Fengguang 1 sibling, 0 replies; 32+ messages in thread From: Wu Fengguang @ 2009-08-01 2:58 UTC (permalink / raw) To: Martin Bligh Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin, Andrew Morton, sandeen@redhat.com, Michael Davidson On Fri, Jul 31, 2009 at 04:33:09AM +0800, Martin Bligh wrote: > (BTW: background ... I'm not picking through this code for fun, I'm > trying to debug writeback problems introduced in our new kernel > that are affecting Google production workloads ;-)) > > >> Well, I see two problems. One is that we set more_io based on > >> whether s_more_io is empty or not before we finish the loop. > >> I can't see how this can be correct, especially as there can be > >> other concurrent writers. So somehow we need to check when > >> we exit the loop, not during it. > > > > It is correct inside the loop, however with some overheads. > > > > We put it inside the loop because sometimes the whole filesystem is > > skipped and we shall not set more_io on them whether or not s_more_io > > is empty. > > My point was that you're setting more_io based on a condition > at a point in time that isn't when you return to the caller. > > By the time you return to the caller (after several more loops > iterations), that condition may no longer be true. You are right in that sense. Sorry that my claim of correctness is somehow biased: we normally care much about early abortion, and don't mind one extra trip over the superblocks. And the extra trip should be rare enough. I'd be surprised if you observed much of them in real workloads. > One other way to address that would to be only to set if if we're > about to fall off the end of the loop, ie change it to: > > if (!list_empty(&sb->s_more_io) && list_empty(&sb->s_io)) > wbc->more_io = 1; Let more_io=0 when there are more inodes in s_io to be worked on? I cannot understand it, and suspect we are talking about imaginary problem on this point ;) > >> The other is that we're saying we are setting more_io when > >> nr_to_write is <=0 ... but we only really check it when > >> nr_to_write is > 0 ... I can't see how this can be useful? > > > > That's the caller's fault - I guess the logic was changed a bit by > > Jens in linux-next. I noticed this just now. It shall be fixed. > > I am guessing you're setting more_io here because we're stopping > because our slice expired, presumably without us completing > all the io there was to do? That doesn't seem entirely accurate, > we could have finished all the pending IO (particularly given that > we can go over nr_to_write somewhat and send it negative). > Hence, I though that checking whether s_more_io and s_io were > empty at the time of return might be a more accurate check, > but on the other hand they are shared lists. Yes the current more_io logic is not entirely accurate, but I doubt we can gain much and the improvement can be done trivially (not the line of code, but the analyzes and tests involved). Anyway if you would take the time to push forward a patch for reducing the overheads of possible extra trips, I'll take the time to review it ;) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 20:33 ` Martin Bligh 2009-08-01 2:58 ` Wu Fengguang @ 2009-08-01 4:10 ` Wu Fengguang 1 sibling, 0 replies; 32+ messages in thread From: Wu Fengguang @ 2009-08-01 4:10 UTC (permalink / raw) To: Martin Bligh Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin, Andrew Morton, sandeen@redhat.com, Michael Davidson On Fri, Jul 31, 2009 at 04:33:09AM +0800, Martin Bligh wrote: > (BTW: background ... I'm not picking through this code for fun, I'm > trying to debug writeback problems introduced in our new kernel > that are affecting Google production workloads ;-)) > > >> Well, I see two problems. One is that we set more_io based on > >> whether s_more_io is empty or not before we finish the loop. > >> I can't see how this can be correct, especially as there can be > >> other concurrent writers. So somehow we need to check when > >> we exit the loop, not during it. > > > > It is correct inside the loop, however with some overheads. > > > > We put it inside the loop because sometimes the whole filesystem is > > skipped and we shall not set more_io on them whether or not s_more_io > > is empty. > > My point was that you're setting more_io based on a condition > at a point in time that isn't when you return to the caller. > > By the time you return to the caller (after several more loops > iterations), that condition may no longer be true. > > One other way to address that would to be only to set if if we're > about to fall off the end of the loop, ie change it to: > > if (!list_empty(&sb->s_more_io) && list_empty(&sb->s_io)) > wbc->more_io = 1; Ah I see it (as the below patch), looks reasonable to me. Thanks, Fengguang --- fs/fs-writeback.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- sound-2.6.orig/fs/fs-writeback.c +++ sound-2.6/fs/fs-writeback.c @@ -544,9 +544,9 @@ void generic_sync_sb_inodes(struct super wbc->more_io = 1; break; } - if (!list_empty(&sb->s_more_io)) - wbc->more_io = 1; } + if (!list_empty(&sb->s_more_io) && list_empty(&sb->s_io)) + wbc->more_io = 1; if (sync) { struct inode *inode, *old_inode = NULL; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 0:19 ` Martin Bligh 2009-07-30 1:28 ` Martin Bligh @ 2009-07-30 1:49 ` Wu Fengguang 1 sibling, 0 replies; 32+ messages in thread From: Wu Fengguang @ 2009-07-30 1:49 UTC (permalink / raw) To: Martin Bligh Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin, Andrew Morton, sandeen@redhat.com, Michael Davidson On Thu, Jul 30, 2009 at 08:19:34AM +0800, Martin Bligh wrote: > BTW, can you explain this code at the bottom of generic_sync_sb_inodes > for me? > > if (wbc->nr_to_write <= 0) { > wbc->more_io = 1; > break; > } > > I don't understand why we are setting more_io here? AFAICS, more_io > means there's more stuff to write ... I would think we'd set this if > nr_to_write was > 0 ? That's true: wbc.nr_to_write will always be set to MAX_WRITEBACK_PAGES by wb_writeback() before entering generic_sync_sb_inodes(). So wbc.nr_to_write <=0 indicates we are interrupted by the quota and should revisit generic_sync_sb_inodes() to check for more io (which will _normally_ find more dirty pages to write). > Or just have the section below brought up above this > break check and do: > > if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io)) > wbc->more_io = 1; > > Am I just misunderstanding the intent of more_io ? It should be OK. I agree on the change if it makes the logic more straightforward. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-28 19:11 Bug in kernel 2.6.31, Slow wb_kupdate writeout Chad Talbott 2009-07-28 21:49 ` Martin Bligh @ 2009-07-30 21:39 ` Jens Axboe 2009-07-30 22:01 ` Martin Bligh 1 sibling, 1 reply; 32+ messages in thread From: Jens Axboe @ 2009-07-30 21:39 UTC (permalink / raw) To: Chad Talbott Cc: linux-kernel, linux-mm, wfg, Martin Bligh, Michael Rubin, Andrew Morton, sandeen On Tue, Jul 28 2009, Chad Talbott wrote: > I run a simple workload on a 4GB machine which dirties a few largish > inodes like so: > > # seq 10 | xargs -P0 -n1 -i\{} dd if=/dev/zero of=/tmp/dump\{} > bs=1024k count=100 > > While the dds are running data is written out at disk speed. However, > once the dds have run to completion and exited there is ~500MB of > dirty memory left. Background writeout then takes about 3 more > minutes to clean memory at only ~3.3MB/s. When I explicitly sync, I > can see that the disk is capable of 40MB/s, which finishes off the > files in ~10s. [1] > > An interesting recent-ish change is "writeback: speed up writeback of > big dirty files." When I revert the change to __sync_single_inode the > problem appears to go away and background writeout proceeds at disk > speed. Interestingly, that code is in the git commit [2], but not in > the post to LKML. [3] This is may not be the fix, but it makes this > test behave better. Can I talk you into trying the per-bdi writeback patchset? I just tried your test on a 16gb machine, and the dd's finish immediately since it wont trip the writeout at that percentage of dirty memory. The 1GB of dirty memory is flushed when it gets too old, 30 seconds later in two chunks of writeout running at disk speed. http://lkml.org/lkml/2009/7/30/302 You can either use the git branch, or just download http://kernel.dk/writeback-v13.patch and apply that to -git (or -rc4) directly. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 21:39 ` Jens Axboe @ 2009-07-30 22:01 ` Martin Bligh 2009-07-30 22:17 ` Jens Axboe 0 siblings, 1 reply; 32+ messages in thread From: Martin Bligh @ 2009-07-30 22:01 UTC (permalink / raw) To: Jens Axboe Cc: Chad Talbott, linux-kernel, linux-mm, wfg, Michael Rubin, Andrew Morton, sandeen On Thu, Jul 30, 2009 at 2:39 PM, Jens Axboe<jens.axboe@oracle.com> wrote: > On Tue, Jul 28 2009, Chad Talbott wrote: >> I run a simple workload on a 4GB machine which dirties a few largish >> inodes like so: >> >> # seq 10 | xargs -P0 -n1 -i\{} dd if=/dev/zero of=/tmp/dump\{} >> bs=1024k count=100 >> >> While the dds are running data is written out at disk speed. However, >> once the dds have run to completion and exited there is ~500MB of >> dirty memory left. Background writeout then takes about 3 more >> minutes to clean memory at only ~3.3MB/s. When I explicitly sync, I >> can see that the disk is capable of 40MB/s, which finishes off the >> files in ~10s. [1] >> >> An interesting recent-ish change is "writeback: speed up writeback of >> big dirty files." When I revert the change to __sync_single_inode the >> problem appears to go away and background writeout proceeds at disk >> speed. Interestingly, that code is in the git commit [2], but not in >> the post to LKML. [3] This is may not be the fix, but it makes this >> test behave better. > > Can I talk you into trying the per-bdi writeback patchset? I just tried > your test on a 16gb machine, and the dd's finish immediately since it > wont trip the writeout at that percentage of dirty memory. The 1GB of > dirty memory is flushed when it gets too old, 30 seconds later in two > chunks of writeout running at disk speed. How big did you make the dds? It has to be writing more data than you have RAM, or it's not going to do anything much interesting ;-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 22:01 ` Martin Bligh @ 2009-07-30 22:17 ` Jens Axboe 2009-07-30 22:34 ` Martin Bligh 0 siblings, 1 reply; 32+ messages in thread From: Jens Axboe @ 2009-07-30 22:17 UTC (permalink / raw) To: Martin Bligh Cc: Chad Talbott, linux-kernel, linux-mm, wfg, Michael Rubin, Andrew Morton, sandeen On Thu, Jul 30 2009, Martin Bligh wrote: > On Thu, Jul 30, 2009 at 2:39 PM, Jens Axboe<jens.axboe@oracle.com> wrote: > > On Tue, Jul 28 2009, Chad Talbott wrote: > >> I run a simple workload on a 4GB machine which dirties a few largish > >> inodes like so: > >> > >> # seq 10 | xargs -P0 -n1 -i\{} dd if=/dev/zero of=/tmp/dump\{} > >> bs=1024k count=100 > >> > >> While the dds are running data is written out at disk speed. However, > >> once the dds have run to completion and exited there is ~500MB of > >> dirty memory left. Background writeout then takes about 3 more > >> minutes to clean memory at only ~3.3MB/s. When I explicitly sync, I > >> can see that the disk is capable of 40MB/s, which finishes off the > >> files in ~10s. [1] > >> > >> An interesting recent-ish change is "writeback: speed up writeback of > >> big dirty files." When I revert the change to __sync_single_inode the > >> problem appears to go away and background writeout proceeds at disk > >> speed. Interestingly, that code is in the git commit [2], but not in > >> the post to LKML. [3] This is may not be the fix, but it makes this > >> test behave better. > > > > Can I talk you into trying the per-bdi writeback patchset? I just tried > > your test on a 16gb machine, and the dd's finish immediately since it > > wont trip the writeout at that percentage of dirty memory. The 1GB of > > dirty memory is flushed when it gets too old, 30 seconds later in two > > chunks of writeout running at disk speed. > > How big did you make the dds? It has to be writing more data than > you have RAM, or it's not going to do anything much interesting ;-) The test case above on a 4G machine is only generating 1G of dirty data. I ran the same test case on the 16G, resulting in only background writeout. The relevant bit here being that the background writeout finished quickly, writing at disk speed. I re-ran the same test, but using 300 100MB files instead. While the dd's are running, we are going at ~80MB/sec (this is disk speed, it's an x25-m). When the dd's are done, it continues doing 80MB/sec for 10 seconds or so. Then the remainder (about 2G) is written in bursts at disk speeds, but with some time in between. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 22:17 ` Jens Axboe @ 2009-07-30 22:34 ` Martin Bligh 2009-07-30 22:43 ` Jens Axboe 2009-08-01 4:02 ` Wu Fengguang 0 siblings, 2 replies; 32+ messages in thread From: Martin Bligh @ 2009-07-30 22:34 UTC (permalink / raw) To: Jens Axboe Cc: Chad Talbott, linux-kernel, linux-mm, wfg, Michael Rubin, Andrew Morton, sandeen > The test case above on a 4G machine is only generating 1G of dirty data. > I ran the same test case on the 16G, resulting in only background > writeout. The relevant bit here being that the background writeout > finished quickly, writing at disk speed. > > I re-ran the same test, but using 300 100MB files instead. While the > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an > x25-m). When the dd's are done, it continues doing 80MB/sec for 10 > seconds or so. Then the remainder (about 2G) is written in bursts at > disk speeds, but with some time in between. OK, I think the test case is sensitive to how many files you have - if we punt them to the back of the list, and yet we still have 299 other ones, it may well be able to keep the disk spinning despite the bug I outlined.Try using 30 1GB files? Though it doesn't seem to happen with just one dd streamer, and I don't see why the bug doesn't trigger in that case either. I believe the bugfix is correct independent of any bdi changes? M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 22:34 ` Martin Bligh @ 2009-07-30 22:43 ` Jens Axboe 2009-07-30 22:48 ` Martin Bligh 2009-08-01 4:02 ` Wu Fengguang 1 sibling, 1 reply; 32+ messages in thread From: Jens Axboe @ 2009-07-30 22:43 UTC (permalink / raw) To: Martin Bligh Cc: Chad Talbott, linux-kernel, linux-mm, wfg, Michael Rubin, Andrew Morton, sandeen On Thu, Jul 30 2009, Martin Bligh wrote: > > The test case above on a 4G machine is only generating 1G of dirty data. > > I ran the same test case on the 16G, resulting in only background > > writeout. The relevant bit here being that the background writeout > > finished quickly, writing at disk speed. > > > > I re-ran the same test, but using 300 100MB files instead. While the > > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an > > x25-m). When the dd's are done, it continues doing 80MB/sec for 10 > > seconds or so. Then the remainder (about 2G) is written in bursts at > > disk speeds, but with some time in between. > > OK, I think the test case is sensitive to how many files you have - if > we punt them to the back of the list, and yet we still have 299 other > ones, it may well be able to keep the disk spinning despite the bug > I outlined.Try using 30 1GB files? If this disk starts spinning, then we have bigger bugs :-) > > Though it doesn't seem to happen with just one dd streamer, and > I don't see why the bug doesn't trigger in that case either. > > I believe the bugfix is correct independent of any bdi changes? Yeah I think so too, I'll run some more tests on this tomorrow and verify it there as well. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 22:43 ` Jens Axboe @ 2009-07-30 22:48 ` Martin Bligh 2009-07-31 7:50 ` Peter Zijlstra 2009-08-01 4:03 ` Wu Fengguang 0 siblings, 2 replies; 32+ messages in thread From: Martin Bligh @ 2009-07-30 22:48 UTC (permalink / raw) To: Jens Axboe Cc: Chad Talbott, linux-kernel, linux-mm, wfg, Michael Rubin, sandeen, Andrew Morton, Peter Zijlstra On Thu, Jul 30, 2009 at 3:43 PM, Jens Axboe<jens.axboe@oracle.com> wrote: > On Thu, Jul 30 2009, Martin Bligh wrote: >> > The test case above on a 4G machine is only generating 1G of dirty data. >> > I ran the same test case on the 16G, resulting in only background >> > writeout. The relevant bit here being that the background writeout >> > finished quickly, writing at disk speed. >> > >> > I re-ran the same test, but using 300 100MB files instead. While the >> > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an >> > x25-m). When the dd's are done, it continues doing 80MB/sec for 10 >> > seconds or so. Then the remainder (about 2G) is written in bursts at >> > disk speeds, but with some time in between. >> >> OK, I think the test case is sensitive to how many files you have - if >> we punt them to the back of the list, and yet we still have 299 other >> ones, it may well be able to keep the disk spinning despite the bug >> I outlined.Try using 30 1GB files? > > If this disk starts spinning, then we have bigger bugs :-) >> >> Though it doesn't seem to happen with just one dd streamer, and >> I don't see why the bug doesn't trigger in that case either. >> >> I believe the bugfix is correct independent of any bdi changes? > > Yeah I think so too, I'll run some more tests on this tomorrow and > verify it there as well. There's another issue I was discussing with Peter Z. earlier that the bdi changes might help with - if you look at where the dirty pages get to, they are capped hard at the average of the dirty and background thresholds, meaning we can only dirty about half the pages we should be able to. That does very slowly go away when the bdi limit catches up, but it seems to start at 0, and it's progess seems glacially slow (at least if you're impatient ;-)) This seems to affect some of our workloads badly when they have a sharp spike in dirty data to one device, they get throttled heavily when they wouldn't have before the per-bdi dirty limits. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 22:48 ` Martin Bligh @ 2009-07-31 7:50 ` Peter Zijlstra 2009-08-01 4:03 ` Wu Fengguang 1 sibling, 0 replies; 32+ messages in thread From: Peter Zijlstra @ 2009-07-31 7:50 UTC (permalink / raw) To: Martin Bligh Cc: Jens Axboe, Chad Talbott, linux-kernel, linux-mm, wfg, Michael Rubin, sandeen, Andrew Morton On Thu, 2009-07-30 at 15:48 -0700, Martin Bligh wrote: > There's another issue I was discussing with Peter Z. earlier that the > bdi changes might help with - if you look at where the dirty pages > get to, they are capped hard at the average of the dirty and > background thresholds, meaning we can only dirty about half the > pages we should be able to. That does very slowly go away when > the bdi limit catches up, but it seems to start at 0, and it's progess > seems glacially slow (at least if you're impatient ;-)) > > This seems to affect some of our workloads badly when they have > a sharp spike in dirty data to one device, they get throttled heavily > when they wouldn't have before the per-bdi dirty limits. Right, currently that adjustment period is about the same order as writing out the full dirty page capacity. If your system has unbalanced memory vs io capacity this might indeed end up being glacial. I've been considering making it a sublinear function wrt the memory size, so that larger machines get less and therefore adjust faster. Something like the below perhaps -- the alternative is yet another sysctl :/ Not sure how the sqrt works out on a wide variety of machines though,.. we'll have to test and see. --- mm/page-writeback.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 81627eb..64aa140 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -152,7 +152,7 @@ static int calc_period_shift(void) else dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100; - return 2 + ilog2(dirty_total - 1); + return 2 + ilog2(int_sqrt(dirty_total) - 1); } /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 22:48 ` Martin Bligh 2009-07-31 7:50 ` Peter Zijlstra @ 2009-08-01 4:03 ` Wu Fengguang 2009-08-01 4:53 ` Wu Fengguang 1 sibling, 1 reply; 32+ messages in thread From: Wu Fengguang @ 2009-08-01 4:03 UTC (permalink / raw) To: Martin Bligh Cc: Jens Axboe, Chad Talbott, linux-kernel, linux-mm, Michael Rubin, sandeen, Andrew Morton, Peter Zijlstra On Thu, Jul 30, 2009 at 03:48:02PM -0700, Martin Bligh wrote: > On Thu, Jul 30, 2009 at 3:43 PM, Jens Axboe<jens.axboe@oracle.com> wrote: > > On Thu, Jul 30 2009, Martin Bligh wrote: > >> > The test case above on a 4G machine is only generating 1G of dirty data. > >> > I ran the same test case on the 16G, resulting in only background > >> > writeout. The relevant bit here being that the background writeout > >> > finished quickly, writing at disk speed. > >> > > >> > I re-ran the same test, but using 300 100MB files instead. While the > >> > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an > >> > x25-m). When the dd's are done, it continues doing 80MB/sec for 10 > >> > seconds or so. Then the remainder (about 2G) is written in bursts at > >> > disk speeds, but with some time in between. > >> > >> OK, I think the test case is sensitive to how many files you have - if > >> we punt them to the back of the list, and yet we still have 299 other > >> ones, it may well be able to keep the disk spinning despite the bug > >> I outlined.Try using 30 1GB files? > > > > If this disk starts spinning, then we have bigger bugs :-) > >> > >> Though it doesn't seem to happen with just one dd streamer, and > >> I don't see why the bug doesn't trigger in that case either. > >> > >> I believe the bugfix is correct independent of any bdi changes? > > > > Yeah I think so too, I'll run some more tests on this tomorrow and > > verify it there as well. > > There's another issue I was discussing with Peter Z. earlier that the > bdi changes might help with - if you look at where the dirty pages > get to, they are capped hard at the average of the dirty and > background thresholds, meaning we can only dirty about half the > pages we should be able to. That does very slowly go away when > the bdi limit catches up, but it seems to start at 0, and it's progess > seems glacially slow (at least if you're impatient ;-)) You mean the dirty limit will start from (dirty_ratio+background_ratio)/2 = 15% to (dirty_ratio) = 20%, and grow in a very slow pace? I did observed such curves long ago, but it does not always show up, as in the below mini bench. > This seems to affect some of our workloads badly when they have > a sharp spike in dirty data to one device, they get throttled heavily > when they wouldn't have before the per-bdi dirty limits. Here is a single dd on my laptop with 4G memory, kernel 2.6.30. root /home/wfg# echo 10 > /proc/sys/vm/dirty_ratio root /home/wfg# echo 20 > /proc/sys/vm/dirty_background_ratio wfg ~% dd if=/dev/zero of=/opt/vm/10G bs=1M count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 12.7143 s, 82.5 MB/s output of vmmon: nr_dirty nr_writeback 0 0 0 0 56795 0 51655 17020 52071 17511 51648 16898 51655 16485 52369 17425 51648 16930 51470 16809 52630 17267 51287 16634 51260 16641 51310 16903 51281 16379 46073 11169 46086 0 46089 0 3132 9657 21 17677 3 14107 14 2 0 0 0 0 In this case nr_dirty stays almost constant. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-08-01 4:03 ` Wu Fengguang @ 2009-08-01 4:53 ` Wu Fengguang 2009-08-01 5:03 ` Wu Fengguang 0 siblings, 1 reply; 32+ messages in thread From: Wu Fengguang @ 2009-08-01 4:53 UTC (permalink / raw) To: Martin Bligh Cc: Jens Axboe, Chad Talbott, linux-kernel, linux-mm, Michael Rubin, sandeen, Andrew Morton, Peter Zijlstra On Sat, Aug 01, 2009 at 12:03:13PM +0800, Wu Fengguang wrote: > On Thu, Jul 30, 2009 at 03:48:02PM -0700, Martin Bligh wrote: > > On Thu, Jul 30, 2009 at 3:43 PM, Jens Axboe<jens.axboe@oracle.com> wrote: > > > On Thu, Jul 30 2009, Martin Bligh wrote: > > >> > The test case above on a 4G machine is only generating 1G of dirty data. > > >> > I ran the same test case on the 16G, resulting in only background > > >> > writeout. The relevant bit here being that the background writeout > > >> > finished quickly, writing at disk speed. > > >> > > > >> > I re-ran the same test, but using 300 100MB files instead. While the > > >> > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an > > >> > x25-m). When the dd's are done, it continues doing 80MB/sec for 10 > > >> > seconds or so. Then the remainder (about 2G) is written in bursts at > > >> > disk speeds, but with some time in between. > > >> > > >> OK, I think the test case is sensitive to how many files you have - if > > >> we punt them to the back of the list, and yet we still have 299 other > > >> ones, it may well be able to keep the disk spinning despite the bug > > >> I outlined.Try using 30 1GB files? > > > > > > If this disk starts spinning, then we have bigger bugs :-) > > >> > > >> Though it doesn't seem to happen with just one dd streamer, and > > >> I don't see why the bug doesn't trigger in that case either. > > >> > > >> I believe the bugfix is correct independent of any bdi changes? > > > > > > Yeah I think so too, I'll run some more tests on this tomorrow and > > > verify it there as well. > > > > There's another issue I was discussing with Peter Z. earlier that the > > bdi changes might help with - if you look at where the dirty pages > > get to, they are capped hard at the average of the dirty and > > background thresholds, meaning we can only dirty about half the > > pages we should be able to. That does very slowly go away when > > the bdi limit catches up, but it seems to start at 0, and it's progess > > seems glacially slow (at least if you're impatient ;-)) > > You mean the dirty limit will start from > (dirty_ratio+background_ratio)/2 = 15% to (dirty_ratio) = 20%, > and grow in a very slow pace? I did observed such curves long ago, > but it does not always show up, as in the below mini bench. > > > This seems to affect some of our workloads badly when they have > > a sharp spike in dirty data to one device, they get throttled heavily > > when they wouldn't have before the per-bdi dirty limits. > > Here is a single dd on my laptop with 4G memory, kernel 2.6.30. > > root /home/wfg# echo 10 > /proc/sys/vm/dirty_ratio > root /home/wfg# echo 20 > /proc/sys/vm/dirty_background_ratio > > wfg ~% dd if=/dev/zero of=/opt/vm/10G bs=1M count=1000 > 1000+0 records in > 1000+0 records out > 1048576000 bytes (1.0 GB) copied, 12.7143 s, 82.5 MB/s > > output of vmmon: > > nr_dirty nr_writeback > 0 0 > 0 0 > 56795 0 > 51655 17020 > 52071 17511 > 51648 16898 > 51655 16485 > 52369 17425 > 51648 16930 > 51470 16809 > 52630 17267 > 51287 16634 > 51260 16641 > 51310 16903 > 51281 16379 > 46073 11169 > 46086 0 > 46089 0 > 3132 9657 > 21 17677 > 3 14107 > 14 2 > 0 0 > 0 0 > > In this case nr_dirty stays almost constant. I can see the growth when I increased the dd size to 2GB, and the dd throughput decreased from 82.5MB/s to 60.9MB/s. wfg ~% dd if=/dev/zero of=/opt/vm/10G bs=1M count=2000 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 34.4114 s, 60.9 MB/s nr_dirty nr_writeback 0 0 44980 0 49929 20353 49929 20353 49189 17822 54556 14852 49191 17717 52455 15501 49903 19330 50077 17293 50040 19111 52097 7040 52656 16797 53361 19455 53551 16999 57599 16396 55165 6801 57626 16534 56193 18795 57888 16655 57740 18818 65759 11304 60015 19842 61136 16618 62166 17429 62160 16782 62036 11907 59237 13715 61991 18561 66256 15111 60574 17551 17926 17930 17919 17057 17919 16379 11 13717 11470 4606 2 913 2 0 10 0 10 0 0 0 0 0 But when I redid the above test after dropping all the ~3GB caches, the dirty limit again seem to remain constant. # echo 1 > /proc/sys/vm/drop_caches wfg ~% dd if=/dev/zero of=/opt/vm/10G bs=1M count=2000 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 33.3299 s, 62.9 MB/s nr_dirty nr_writeback 0 0 76425 10825 66255 17302 69942 15865 65332 17305 71207 14605 69957 15380 65901 18960 66365 16233 66040 17041 66042 16378 66434 2169 67606 17143 68660 17195 67613 16514 67366 17415 65784 4620 69053 16831 66037 17033 64601 19936 64629 16922 70459 9227 66673 17789 65638 20102 65166 17662 66255 16286 69821 11264 82247 4113 64012 18060 29585 17920 5872 16653 5872 14197 25422 1913 5884 16658 0 12027 2 26 2 0 2 0 Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-08-01 4:53 ` Wu Fengguang @ 2009-08-01 5:03 ` Wu Fengguang 0 siblings, 0 replies; 32+ messages in thread From: Wu Fengguang @ 2009-08-01 5:03 UTC (permalink / raw) To: Martin Bligh Cc: Jens Axboe, Chad Talbott, linux-kernel, linux-mm, Michael Rubin, sandeen, Andrew Morton, Peter Zijlstra On Sat, Aug 01, 2009 at 12:53:46PM +0800, Wu Fengguang wrote: > On Sat, Aug 01, 2009 at 12:03:13PM +0800, Wu Fengguang wrote: > I can see the growth when I increased the dd size to 2GB, > and the dd throughput decreased from 82.5MB/s to 60.9MB/s. The raw disk write throughput seems to be 60MB/s: wfg ~% dd if=/dev/zero of=/opt/vm/200M bs=1M count=200 oflag=direct 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 3.48137 s, 60.2 MB/s read throughput is a bit better: wfg ~% dd of=/dev/null if=/opt/vm/200M bs=1M count=200 iflag=direct 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 2.66606 s, 78.7 MB/s # hdparm -tT /dev/hda /dev/hda: Timing cached reads: 10370 MB in 1.99 seconds = 5213.70 MB/sec Timing buffered disk reads: 216 MB in 3.03 seconds = 71.22 MB/sec And sync writes are pretty slow: wfg ~% dd if=/dev/zero of=/opt/vm/200M bs=1M count=200 oflag=sync 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 10.4741 s, 20.0 MB/s Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout 2009-07-30 22:34 ` Martin Bligh 2009-07-30 22:43 ` Jens Axboe @ 2009-08-01 4:02 ` Wu Fengguang 1 sibling, 0 replies; 32+ messages in thread From: Wu Fengguang @ 2009-08-01 4:02 UTC (permalink / raw) To: Martin Bligh Cc: Jens Axboe, Chad Talbott, linux-kernel, linux-mm, Michael Rubin, Andrew Morton, sandeen On Thu, Jul 30, 2009 at 03:34:12PM -0700, Martin Bligh wrote: > > The test case above on a 4G machine is only generating 1G of dirty data. > > I ran the same test case on the 16G, resulting in only background > > writeout. The relevant bit here being that the background writeout > > finished quickly, writing at disk speed. > > > > I re-ran the same test, but using 300 100MB files instead. While the > > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an > > x25-m). When the dd's are done, it continues doing 80MB/sec for 10 > > seconds or so. Then the remainder (about 2G) is written in bursts at > > disk speeds, but with some time in between. > > OK, I think the test case is sensitive to how many files you have - if > we punt them to the back of the list, and yet we still have 299 other > ones, it may well be able to keep the disk spinning despite the bug > I outlined.Try using 30 1GB files? > > Though it doesn't seem to happen with just one dd streamer, and > I don't see why the bug doesn't trigger in that case either. I guess the bug is not related to number dd streamers, but whether there is a stream of newly dirtied inodes (atime dirtiness would be enough). Because wb_kupdate() itself won't give up on congestion, but redirty_tail() would refresh the inode dirty time if there are newly dirtied inodes in front. And we cannot claim it to be a bug of the list based redirty_tail(), since we call it with the belief that the inode is somehow blocked. In this manner redirty_tail() can refresh the inode dirty time (and therefore delay its writeback for up to 30s) at will. > I believe the bugfix is correct independent of any bdi changes? Agreed. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2009-08-01 5:03 UTC | newest] Thread overview: 32+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-07-28 19:11 Bug in kernel 2.6.31, Slow wb_kupdate writeout Chad Talbott 2009-07-28 21:49 ` Martin Bligh 2009-07-29 7:15 ` Martin Bligh 2009-07-29 11:43 ` Wu Fengguang 2009-07-29 14:11 ` Martin Bligh 2009-07-30 1:06 ` Wu Fengguang 2009-07-30 1:12 ` Martin Bligh 2009-07-30 1:57 ` Wu Fengguang 2009-07-30 2:59 ` Martin Bligh 2009-07-30 4:08 ` Wu Fengguang 2009-07-30 19:55 ` Martin Bligh 2009-08-01 2:02 ` Wu Fengguang 2009-07-30 0:19 ` Martin Bligh 2009-07-30 1:28 ` Martin Bligh 2009-07-30 2:09 ` Wu Fengguang 2009-07-30 2:57 ` Martin Bligh 2009-07-30 3:19 ` Wu Fengguang 2009-07-30 20:33 ` Martin Bligh 2009-08-01 2:58 ` Wu Fengguang 2009-08-01 4:10 ` Wu Fengguang 2009-07-30 1:49 ` Wu Fengguang 2009-07-30 21:39 ` Jens Axboe 2009-07-30 22:01 ` Martin Bligh 2009-07-30 22:17 ` Jens Axboe 2009-07-30 22:34 ` Martin Bligh 2009-07-30 22:43 ` Jens Axboe 2009-07-30 22:48 ` Martin Bligh 2009-07-31 7:50 ` Peter Zijlstra 2009-08-01 4:03 ` Wu Fengguang 2009-08-01 4:53 ` Wu Fengguang 2009-08-01 5:03 ` Wu Fengguang 2009-08-01 4:02 ` Wu Fengguang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).